Causal network discovery by iterative conditioning: comparison of algorithms
aa r X i v : . [ phy s i c s . d a t a - a n ] J a n Causal network discovery by iterative conditioning: comparison of algorithms
Jaroslav Hlinka
1, 2 and Jakub Koˇrenek
1, 3 Institute of Computer Science, Czech Academy of Sciences, Pod vodarenskou vezi 271/2, 182 07, Prague,Czech Republic National Institute of Mental Health, Topolov´a 748, 250 67, Klecany, Czech Republic Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University, Bˇrehov´a 7, 115 19, Prague,Czech Republic (Dated: 20 January 2020)
Estimating causal interactions in complex dynamical systems is an important problem encountered in manyfields of current science. While a theoretical solution for detecting the causal interactions has been pre-viously formulated in the framework of prediction improvement, it generally requires the computation ofhigh-dimensional information functionals – a situation invoking the curse of dimensionality with increasingnetwork size. Recently, several methods have been proposed to alleviate this problem, based on iterativeprocedures for assessment of conditional (in)dependences. In the current work, we bring a comparison ofseveral such prominent approaches. This is done both by theoretical comparison of the algorithms using aformulation in a common framework, and by numerical simulations including realistic complex coupling pat-terns. The theoretical analysis highlights the key similarities and differences between the algorithms, hintingon their comparative strengths and weaknesses. The method assumptions and specific properties such as falsepositive control and order-dependence are discussed. Numerical simulations suggest that while the accuracyof most of the algorithms is almost indistinguishable, there are substantial differences in their computationaldemands, ranging theoretically from polynomial to exponential complexity, and leading to substantial differ-ences in computation time in realistic scenarios depending on the density and size of networks. Based onanalysis of the algorithms and numerical simulations, we propose a hybrid approach providing competitiveaccuracy with improved computational efficiency.Keywords: causal inference, complex networks, transfer entropy, partial correlation, Granger causality
Characterization of the structure of interactionsin large heterogeneous systems based on observa-tional data has become one of the dominant chal-lenges across scientific fields. In many cases, mea-surements of the dynamical behavior is available,allowing inference of causal interactions amongthe subsystems by exploiting the principle of tem-poral precedence of the cause before the effect.Half a century ago Sir Clive Granger proposed aformal treatment of the problem of detecting suchinteractions, based on statistical testing of theimprovement of the prediction of a target vari-able by a candidate source variable. This processhas been generalized to nonlinear processes usingthe framework of information theory. However,the practical applicability of this methodology hasbeen hindered by the need to properly accountfor all other potential intervening variables in thesystem, bringing in both computational and ac-curacy issues growing with network size. In thiswork we compare several prominent algorithmsproposed recently for estimating causal structurein large networks. We introduce the algorithmswithin a common framework highlighting the sim-ilarities and differences, and compare their accu-racy and computational demands on both simu-lated random networks and realistic examples de-rived from real-world brain and climate dynamicsdatasets. Finally, we suggest an algorithm withcompetitive accuracy and faster performance.
I. INTRODUCTION
The study of complex dynamical systems is a grow-ing area of research with applications in multiple fieldsranging from neuroscience through genetics, ecology, so-cial anthropology, informatics, economy and energeticsto climate research – see for an authoritative review in-cluding a range of application fields. This growth is fedby the increasing availability of large datasets with ob-servational data from multiple subsystems of the studiedsystems, as well as by the rapidly increasing computa-tional power of modern computers and progress in thealgorithms for complex data analysis. A key principle incomplex network research is viewing the system at handas a network of interacting subsystems, with one of thecentral questions being that of estimating the pattern ofmutual interactions of these. Notably, there is an on-going transition from the previously prevailing study ofpurely statistical dependences between the subsystems(commonly denoted by the term ’functional connectiv-ity’, borrowed from the neurosciences ) to the quest ofcharacterizing the pattern of the direct causal connec-tions between the subsystems (’effective connectivity’).Note that for many complex systems, our knowledgeof its structure and dynamics, although increasing at atremendous pace, is very far from perfect. Therefore, thestructure of interactions needs to be commonly estimateddirectly from observed time series. For instance in thecase of the human brain, resolving the pattern of anatom-ical connections (structural connectivity) is still posingserious challenges, in particular using non-invasive meth-ods . Moreover, depending on system parameters, thesame structural connectivity can give rise to vastly dif-ferent patterns of dynamical interactions . The patternof functional connectivity can even change dynamically,giving rise to a progression of brain states, the detectabil-ity of which poses further methodological challenges .In this context, the development and proper validationof methods for estimating the structure of causal inter-actions from observed time series is of key importance.For stochastic processes, the problem of causal inter-action discovery has been considered already by NorbertWiener and later formulated by Sir Clive Granger inhis famous concept of (Granger) causality. This gen-erally states that a variable is to be considered causalwith respect to some target variable, if its inclusion in amodel improves the prediction of the target . This op-erationalization has of course many practical and philo-sophical limitations, however has become commonly usedat least as an interim approach in situations where test-ing causality by e.g. direct experimental manipulationis not readily available. For a more general discussionof causality and its inference we refer the reader to .In principle, two key challenges appear in causal inter-action discovery: the first is, under what conditions canthe ’true’ causality be uncovered from observational timeseries (such as observing all intervening variables), whilethe second lies in finding efficient algorithms for inferenceof the network of causal relations from data (particularlyin the case of many dependent variables, short samplesand potentially nonlinear relations). Thus, the practi-cal application of Granger/Wiener conceptual solutionof the former requires an effective algorithm for estab-lishing causality from the observed time series of finitesize to solve the latter challenge. Building on the originalGranger’s approach based on linear vector autoregressiveprocesses, there has been a long line of attempts to widenthe applicability of the principle by devising algorithmsthat would perform well also for nonlinear processes aswell as in the situation of relatively large system size.The former problem of nonlinearity is commonly ad-dressed by utilizing entropy as a general measure of un-certainty, or equivalently using mutual information as ameasure of statistical dependence. This has motivatedthe definition of transfer entropy , a special case of con-ditional mutual information as a measure of causalityin nonlinear dynamical systems. Similarly as in Grangercausality, to avoid spurious inference due to indirect cau-sation, the method can be extended by taking into ac-count other potentially intervening variables by condi-tioning on other variables.However, the use of information-theoretical functionalsonly escalates the latter problem of dealing with high-dimensional data. Indeed, the complexity of standardbinning algorithms grows exponentially with the dimen- sion of the variables considered (as the number of multi-dimensional bins, at which the probability density is es-timated, scales exponentially with the dimension). Al-ternative algorithms to estimate conditional mutual in-formation based onnon-discretizing approaches such askernel methods or k-nearest-neighbor-based estimators also exist; however estimating entropy functionals is stilla difficult task. To remedy this problem, several re-searchers have recently turned to schemes that reducethe number of conditions considered by some principledvariable selection procedure. In particular, four such al-gorithms were formulated in a way either directlyinspired or at least resembling in some aspects the PC-algorithm , in particular the first phase of PC, the skele-ton discovery phase.These methods aim to construct a (directed) networkrepresentation of the systems causal structure; by evalu-ating the (conditional) mutual information from poten-tial source variables to each target variable. Importantly,this network representation provides only a simplifiedpicture of the full causal structure in a generic case, dueto existence of higher-order (sometimes called polyadic,in contrast to dyadic) dependences. Such potential ap-proximation has been recently criticized, in particularconcerning the problems of interpreting the reconstructednetworks as information flows .However, proper theoretical treatment of higher or-der dependences as well as methods for their quantifi-cation from finite size samples are a matter of ongoingresearch, see and references therein. While we be-lieve, that building such theoretical fundamentals is keyfor proper interpretation of complex causal structuresincluding higher-order dependences, in many practicalsituations such higher-order dependences may be neg-ligible, particularly given their problematic estimabilityfrom small data samples. Therefore we believe that algo-rithms for construction of directed network representa-tion of system causal structure will continue to be widelyapplied in practice, albeit this should be done with duecaution. While all the four above mentioned methodshave been reported as reasonably performing on bothsimulated examples and real-world data, to the best ofour knowledge, there has been no systematic theoreti-cal and numerical comparison available that would helpmaking informed choice concerning which of these algo-rithms to use in practical situations. However, we directthe attention of the readers at least to two recent workspublished during revisions of this manuscript, which con-tain in particular comparison of the naive fully multivari-ate TE with Sun’s optimal causation entropy and thePCMCI ? , discussing several advantages of the PCMCIalgorithm, and a recent review of causal inference withemphasis on applications in Earth sciences ? .In this paper, we therefore set out to fill the gap by pro-viding a comparison of structure, performance and com-putational demands of the reviewed algorithms, to helpmaking informed choice in practice as well as to assessspace and possible directions for improvement. Based onthis comparison, we also propose a hybrid method thatoutperforms other methods in some scenarios in terms ofcomputational demands under conserved accuracy. Thepaper is structured as follows: after this introductory sec-tion, in section II we describe the compared methods aswell as the selected procedure for statistical testing. Sec-tion III describes the data that we use for assessing theaccuracy and computational demands of the approaches.The results are shown in section IV, and the paper isfinalized by a detailed discussion in section V and finalconclusion (section VI). II. PROBLEM STATEMENT AND METHODS
As mentioned above, under some assumptions causal-ity may be defined in terms of reducing uncertainty of theprediction or in other words the conditional dependenceof the target and source variable. These notions are con-veniently formalized in terms of information-theoreticalfunctionals. For two discrete random variables
X, Y withsets of values Ξ and Υ, marginal probability distributionfunctions p ( x ) , p ( y ), and joint probability distributionfunction p ( x, y ), the Shannon entropy H ( X ) is definedas H ( X ) = − X x ∈ Ξ p ( x ) log p ( x ) , (1)and the joint entropy H ( X, Y ) of X and Y as H ( X, Y ) = − X x ∈ Ξ X y ∈ Υ p ( x, y ) log p ( x, y ) . (2)The conditional entropy H ( X | Y ) of X given Y is H ( X | Y ) = − X x ∈ Ξ X y ∈ Υ p ( x, y ) log p ( x | y ) . (3)The amount of common information contained in thevariables X and Y is quantified by the mutual informa-tion I ( X ; Y ) defined as I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X, Y ) . (4)The conditional mutual information (CMI) I ( X ; Y | Z ) ofthe variables X, Y given the variable Z is given as I ( X ; Y | Z ) = H ( X | Z ) + H ( Y | Z ) − H ( X, Y | Z ) . (5)Entropy and mutual information are measured in bitsif the base of the logarithms in their definitions is 2. Itis straightforward to extend these definitions to morevariables, and to continuous rather than discrete vari-ables. In practice, estimation of information-theoreticalfunctionals for continuous variables is often carried outthrough their discretization by binning procedures, oralternative non-discretizing approaches such as kernelmethods or k-nearest-neighbor-based estimators . Al-ternatively, when data are considered sufficiently close to Gaussianity, estimates of linear quantities can be used– in particular Pearson’s correlation coefficient in placeof mutual information and partial correlation in place ofCMI.Let us now consider a random process { X t | t ∈ Z } ,where X t is (for all t ∈ Z ) a multivariate random variable X t = (cid:0) X t , . . . , X nt (cid:1) ⊤ , with the random variable X it indi-cating the state of element i at the time t . Next we de-fine X − t = ( X t − , . . . , X t − τ max ), τ max ∈ [1 , + ∞ ) , whichexpresses the previous states of the system, similarly foreach element of the system X i − t = (cid:0) X it − , . . . , X it − τ max (cid:1) . A natural way to quantify the causal effect of thevariable X jt − τ on the variable X it conditioned on allother elements of the system X t is the calculation ofthe CMI I (cid:16) X it ; X jt − τ | X − t r X jt − τ (cid:17) ? . Indeed, follow-ing the Granger’s/Wiener’s idea, a variable X jt − τ is tobe considered causal with respect to the variable X it , if I (cid:16) X it ; X jt − τ | X − t r X jt − τ (cid:17) >
0. The reviewed causal dis-covery algorithms are thus trying to estimate the set ofsuch nodes for a given target node, called the causal par-ent set: N X it = { X jt − τ | I (cid:16) X it ; X jt − τ | X − t r X jt − τ (cid:17) > } . (6)However, the evaluation of this CMI may in practicebe unfeasible due to the problems with estimating high-dimensional information functionals, including compu-tational demands and common inaccuracy of estimatesfrom short time series samples . Therefore the be-low outlined algorithms for reduction of the dimension ofthe conditioning variable were proposed.While Kugiumtzis et al. gave their algorithm a name(PMIME - partial mutual information from mixed em-bedding), two other algorithms were not introduced withan explicit name – for simplicity and ease of orientation,we will refer to them as Runge’s and Sun’s algorithmthroughout the paper, although abbreviation based onnames of all original coauthors or procedural descriptionmight be advocated. The newer variant of algorithm pro-posed by Runge et al. is then denoted in line with theoriginal paper as PCMCI. A. Runge’s algorithm and PCMCI
The first of the studied algorithms is the algorithmintroduced by Runge et al. Throughout this article wedenote ˆ n = { , . . . , n } , similarly ˆ τ max = { , . . . , τ max } . The first step of the algorithm is to compute the mutualinformation I (cid:16) X it ; X jt − τ (cid:17) for all j ∈ ˆ n and τ ∈ ˆ τ max . Elements X jt − τ which share non-zero mutual informationwith X it form the set of potential causal parents of X it ,which we denote˜ N X it = { X jt − τ | I (cid:16) X it ; X jt − τ (cid:17) > } . (7)This set contains true causal parents, but also indirectlyassociated elements that have non-zero mutual informa-tion with the element X it for example because they areboth influenced by some other element. Algorithm 1
The first phase of Runge’s algorithm ˜ N X it ← ∅ for j ∈ ˆ n do for τ ∈ ˆ τ max do if I (cid:0) X it ; X jt − τ (cid:1) > then ˜ N X it ← ˜ N X it ∪ { X jt − τ } In the second (reduction) phase of the algorithm theseindirect links are therefore excluded from the set ˜ N X it .The natural way of this reduction is to determine foreach element X jt − τ of the set ˜ N X it the CMI I (cid:16) X it ; X jt − τ | ˜ N X it r { X jt − τ } (cid:17) (8)and in the case when this information is equal to zero, ex-clude X jt − τ from the set of potential causal parents ˜ N X it . However the size of the set of potential causal parentsmay be large; generally it may include up to n × τ max el-ements and the practical calculation of CMI may fail dueto availability of only a short sample of the time seriesor due to the computational demands. Therefore in thereduction phase of Runge’s algorithm, instead of com-puting the single conditional information (8), the mutualinformation I (cid:16) X it ; X jt − τ | ˜ N m,kX it (cid:17) is computed over subsets˜ N m,kX it of the original set ˜ N X it , where m is the size of thesubset and k is the index of the subset. Algorithm 2
The second phase of Runge’s algorithm m ← m while m < | ˜ N X it | do for k ∈ (cid:0) | ˜ N Xit | m (cid:1) do if ˜ N m,kX it ⊆ ˜ N X it then for X jt − τ ∈ ˜ N X it r ˜ N m,kX it do if I (cid:16) X it ; X jt − τ | ˜ N m,kX it (cid:17) = 0 then ˜ N X it ← ˜ N X it r { X jt − τ } m ← m + 1 N X it ← ˜ N X it In detail, the second (reduction) phase of Runge’s al-gorithm proceeds as follows. In the outer loop the pa-rameter m, which denotes the number of conditions inCMI, iterates upward from a predefined value m > k iterates through allthe different subsets of size m of ˜ N X it . If for any m andany k is the CMI I (cid:16) X it ; X jt − τ | ˜ N m,kX it (cid:17) equal to zero thanthe element X jt − τ is removed from the set ˜ N X it . If the size of the new set ˜ N X it is less or equal to m, algorithm termi-nates. Otherwise we increase m by one and the algorithmcontinues.The Runge’s algorithm was further developed to amore computationally efficient version of the algorithmintroduced under the name PCMCI , where PC standsfor the names of Peter Spirtes and Clark Glymour, theauthors of the PC algorithm , and MCI stands for Mo-mentary Conditional Independence. The initial phase ofthis algorithm is similar as the original algorithm (seealgorithm 3), however the number of considered subsets˜ N m,kX it is reduced. For a given X it and X jt − τ and for agiven cardinality m of subsets ˜ N m,kX it , only q max CMIs inthe form I (cid:16) X it ; X jt − τ | ˜ N m,kX it (cid:17) are assessed (instead of upto all possible (cid:0) nm (cid:1) combinations). Thanks to this re-duction, the complexity of this algorithm changes fromexponential to polynomial – similarly to Sun’s algorithm.The input parameter q max is selected by the user, for thecurrent paper we used the setting q max = 1 that was ap-plied for simulations in the original work . Note thatpreferable choice may depend on the relative weight ofrequired speed in the PC phase and the size of conditionset entering the MCI phase. A major other difference isorder-independence, which is achieved by not removingan independent parent immediately, but only after theloop (see line 16 in Algorithm 3). This leads in some situ-ations to different results than for other common heuristiused in the classical PC-algorithm, where the variables’conditioned out’ are removed straight away, and there-fore the result is not invariant with respect to the orderof testing. Such approach is e.g. used in the later intro-duced FACDA algorithm, making it from this perspectiveorder-dependent.The last step of the PCMCI algorithm is the MCIstep (see algorithm 4). In this step, all elements X jt − τ (including those which were excluded in the PC phase)are tested against the output set of candidate vari-ables from the PC phase. Moreover, the conditioningset in this phase does not contain only potential par-ents of the target X it , but also potential parents of thesource X jt − τ , although only the p X strongest parents ofthe source are included, to limit the size of the condi-tion. In this phase, for each element X jt − τ , the CMI I (cid:18) X it ; X jt − τ | ˜ N X it ∪ ˜ N p X X jt − τ r { X jt − τ } (cid:19) is thus assessed.A set of all X jt − τ for which is this CMI nonzero is de-clared as the set of causal parents of X it . The testing inthe MCI phase is aimed to control the false positives rateat a predefined level; due to the inclusiong of the parentsof the source and target, the tests should be valid evenfor highly autocorrelated variables, as effectively due tothe conditioning only the relation between the residualsstripped of the autocorrelation is tested .In the PC phase, the authors recommend setting a rel-atively high value of parameter α which denotes level ofstatistical significance for which the H hypothesis (CMIis equal to zero) is rejected; in particular numerical ex-amples show that α > . α lead-ing to too high false positive rate. In our simulations weuse the setting α = 0 . was recommended as a cor-rection for multiple testing comparison. For a prede-fined FDR level, this effectively corresponds to usinga corrected threshold that depends on the observed p-values across all the tests. To keep comparability withother methods, we use the range of the parameter θ ∈{ . , . . . , . } equally to other algorithms. Further,because of the potential problem with high dimension-ality, the authors recommend to restrict the number ofconditions ˜ N X jt − τ with a free parameter p X . We use thedefault setting p X = 1 recommended in similar simu-lations in the original study; i.e. we consider only oneelement of ˜ N X jt − τ , and for comparison a minimal choice p X = 0. In fact, choice of higher values had a detrimentaleffect on the accuracy of the algorithm in our simulations,see Figure 12, we believe this is a design-choice of theauthor of the algorithm for a particular reason, namelyachieving nominal FPR-control under autocorrelation. Algorithm 3
Algorithm PCMCI - PC phase ˜ N X it = { X jt − τ | j ∈ ˆ n ; τ ∈ ˆ τ max } I min ( X jt − τ ) = + ∞ ∀ X jt − τ ∈ ˜ N X it for m = 0 , . . . , m max do if | ˜ N X it | − < m then Break for-loop for X jt − τ ∈ ˜ N X it do q = − for all lexicographically chosen S ⊆ ˜ N X it r { X jt − τ } with | S | = m do q = q + 1 if q ≥ q max then Break from inner for-loop (cid:2) I jt − τ , p -value (cid:3) = I (cid:0) X it , X jt − τ | S (cid:1) if | I | < I min (cid:0) X jt − τ (cid:1) then I min (cid:0) X jt − τ (cid:1) = | I | if p -value > α then Mark X jt − τ for removal from ˜ N X it Break from inner for-loop
Remove non-significant parents from ˜ N X it Sort parents in ˜ N X it by I min (cid:0) X jt − τ (cid:1) from largestto smallest Algorithm 4
Algorithm PCMCI - MCI phase for X jt − τ ∈ { X jt − τ | j ∈ ˆ n ; τ ∈ ˆ τ max } do ˜ N p X X jt − τ ← first p X parents from ˜ N X jt shifted by τ if I (cid:18) X it ; X jt − τ | ˜ N X it ∪ ˜ N p X X jt − τ r { X jt − τ } (cid:19) = 0 then Mark X jt − τ for removal from ˜ N X it Remove non-significant parents from ˜ N X it B. PMIME & Sun’s algorithm
Two other studied algorithms are the algorithmPMIME and Sun’s algorithm . PMIME algorithm(partial mutual information from mixed embedding) wasoriginally formulated in a more general setting for mul-tiple time lags than the Sun’s algorithm. However, in abasic setting (that means maximum time lag equal to 1for every variable in the system) is this algorithm equiva-lent to the first phase of Sun’s algorithm, which has beenoriginally designed only for Markov processes of orderone. Algorithm 5
The first phase of Sun’s algorithm K ← { X jt − | j ∈ ˆ n } , ˜ N X it ← ∅ , I ← + ∞ , p ← ∅ while I > do ˜ N X it ← ˜ N X it ∪ p for X jt − ∈ (cid:16) K r ˜ N X it (cid:17) do I j ← I (cid:16) X it ; X jt − | ˜ N X it (cid:17) I ← max I j ˜ j ← argmax I j p ← { X ˜ jt − } This first phase of Sun’s algorithm proceeds as follows.The initial step is to estimate the mutual information I (cid:16) X it ; X jt − (cid:17) for each element j ∈ ˆ n. If this mutual in-formation is equal to zero for every j ∈ ˆ n, the algorithmterminates. Otherwise the element with maximal mutualinformation is added to the (initially empty) set of po-tential causal parents ˜ N X it . In the next steps, the CMI I (cid:16) X it ; X jt − | ˜ N X it (cid:17) is assessed for each j ∈ ˆ n for which X jt − / ∈ ˜ N X it . If this CMI is equal to zero for each j, the algorithm terminates. Otherwise the element withmaximal CMI is added to the set.However, Sun et al. suggested (on the contrary to theauthors of PMIME method) a necessity to include a sec-ond phase that would attempt to remove any spuriouslinks, i.e. indirect links due to common mediator or falselinks due to common driver, included during the firstphase. In the second (reduction) phase of Sun’s algo-rithm, at each step j the CMI I (cid:16) X it ; X jt − | ˜ N X it (cid:17) is as-sessed. If this CMI is equal to zero, the element X jt − isexcluded from the set of potential causal parents. Algorithm 6
The second phase of Sun’s algorithm for X jt − ∈ ˜ N X it do if I (cid:16) X it ; X jt − | ˜ N X it r { X jt − } (cid:17) = 0 then ˜ N X it = ˜ N X it r { X jt − } Similarly to Runge’s original but unlike in the PCMCIalgorithm, the order of testing of the elements from theset of potential causal parents ˜ N X it may also influence theoutcome of the Sun’s algorithm. In the original article this fact is not discussed. In our implementation we usetesting from the weakest element to the strongest. In thiscase we quantify the strength of the element X jt − by themutual information I (cid:16) X it ; X jt − (cid:17) . Note that the PCMCIis order-independent in that it avoids the need for orderchoice by only marking for removal instead of remov-ing the explained parents straight away. In principle thePCMCI removes thus a superset of variables compared toremoving directly during testing in any particular order;for more dicussion see ? . C. Relations between the algorithms
The description of the algorithms back to back alreadyhints on their similarities and differences. In the followingwe shall make this comparison even more explicit anddraw some suggestions and conclusions from this.A naive approach to detecting the parent set of a givennode would be to assess each potential parent node at atime by computing its information on the target nodeconditional on all other nodes. However, this would re-quire computation of information functionals of high di-mension, posing both computational and numerical prob-lems. The reviewed algorithms sidestep this problem bylimiting the candidate parent set in one way or another.In particular, for each target node, all reviewed algo-rithms include an initial phase that generates a set of itscandidate causal parents. This is done either at once byevaluating (unconditional) mutual information with thetarget (Runge’s algorithm and PCMCI), or iteratively byevaluating the mutual information conditional on the al-ready identified candidate parents (Sun’s algorithm andPMIME). Then, a second phase may follow: potentialcandidates are removed by iterative testing of their addedvalue (CMI) with respect to the rest of the candidate set(Sun) or with respect to its subsets of increasing size(Runge, PCMCI).The approach of Runge’s algorithm is to first obtaina superset of the true parents by assessing the mutualinformation of each node with the target, and in the sec-ond phase iteratively try to remove them by conditioningon increasing subsets of other strong candidates. On thecontrary to Runge’s algorithm, in the first phase of Sun’salgorithm the candidate parents are added one by one(i.e. evaluation of conditional mutual information (con-
Phase/algorithm PMIME Sun Runge PCMCI FACDAForward ∼ n ∼ n ∼ n ∼ n ∼ n Backward − ∼ n ∼ n ∼ n ∼ n Repair − − − ∼ n − Total ∼ n ∼ n ∼ n ∼ n ∼ n TABLE I: Asymptotic worst-case number of CMIevaluations for obtaining the parent set of one node.ditioned by elements of the current set of potential causalparents ˜ N X it ) is used), and therefore after the first phaseof Sun’s algorithm the set of potential causal parents ˜ N X it should contain fewer (if any) indirect connected elementsthan after the first simple phase of Runge’s algorithm,allowing to assess the fully conditioned mutual informa-tion. The two approaches thus principally differ in whichphase they treat iteratively - the forward inclusion phaseof the backward removal phase. While the number ofiterations is generally larger in Runge’s algorithm, thenumber of evaluated nodes in each step of the iteractionis larger in the (iterative) first phase of Sun’s algorithm;therefore it depends on the circumstances, which algo-rithm leads to less CMI evaluations in total.In general, we expect Sun’s algorithm to be more effec-tive than Runge’s for large dense networks due to its onlypolynomial complexity in network size. In particular, inthe case when the i -th element of the system is influ-enced by all other elements, at maximum n ( n − / ∼ n CMIs are evaluated in the first (more computational de-manding) phase of Sun’s algorithm. On the other side,in such extreme case, Runge’s algorithm would pass (inthe second phase) through all subsets of the (full) set ofpotential causal parents, in an attempt to ’condition out’the effect of a given candidate causal parent. In a systemof n elements this leads to assessing up to 2 n − subsets;leading to the worst case complexity exponential in n .Importantly, the PCMCI variant of Runge’s algorithmlargely remedies this weakness by limiting for each ten-tative parent the number of subsets of size m it is testedagainst from above by a constant q max , effectively pro-viding a polynomial (quadratic) computational complex-ity of ∼ n q max . The last phase added in the PCMCIalgorithm to provide control of false positives at a prede-fined rate does not substantially affect the computationtime. Note that (similarly as in the original Runge’s al-gorithm), the algorithm could be further speeded up bylimiting the maximum size of the condition by a constant m max , leading to further potential speedup in exchangefor higher false positive rate at the backward stage. D. FACDA
Based on the theoretical analysis above, we conjecturethat a key challenge for practically applicable algorithmsis being able to deal with large dense networks. Forthis purpose, limiting oneself in each step to testing us-ing only few strongest candidates instead of carrying outfull search through conditioning sets might be a suitableheuristic. We implement this idea in a hybrid algorithmbetween the Runge’s and PMIME algorithms, proposingthus a new Fast Approximate Causal Discovery Algo-rithm (FACDA), described in pseudo-code below.
Algorithm 7
The first phase of FACDA algorithm K ← { X jt − τ | j ∈ ˆ n ; 0 < τ ≤ τ max } , ˜ N X it ← ∅ , p ← ∅ while (cid:16) K r ˜ N X it (cid:17) = ∅ do ˜ N X it ← ˜ N X it ∪ p for X jt − τ ∈ (cid:16) K r ˜ N X it (cid:17) do I jt − τ ← I (cid:16) X it ; X jt − τ | ˜ N X it (cid:17) if I jt − τ = 0 then K ← K r { X jt − τ } [ j m τ m ] ← argmax I jt − τ p ← { X j m t − τ m } Algorithm 8
The second phase of FACDA algorithm for X jt − τ ∈ ˜ N X it do if I (cid:16) X it ; X jt − τ | ˜ N X it r { X jt − τ } (cid:17) = 0 then ˜ N X it = ˜ N X it r { X jt − τ } To understand the relation of FACDA to the algo-rithms presented earlier it is useful to introduce some con-cepts concerning feature selection procedures, in partic-ular the forward selection, backward selection and earlydropping. The former two denote commonly used heuris-tic algorithms, which are specific instances of stepwisemethods. In the basic forward selection algorithm, thepredictor/feature set is initiated as empty and in eachstep, the variable with maximal improvement in modelfit is added to the set. The usual stopping criterion islack of improvement in model fit by any of the remain-ing variables. Conversely, the backward selection algo-rithm initiates the feature set by the whole set of avail-able features/variables, and iteratively removes the leastrelevant one. Combination of these basic heuristic ap-proaches gives rise to a rich family of feature selectionmethods. For a simplified overview of the phases andcomputational complexities of the compared algorithmssee Table I.From this perspective, the first phase of the Sun’s al-gorithm is a forward selection, while the second phaseis a backward selection. Similarly, Runge’s original al-gorithm consists of initialization of the feature set by afiltering step, with subsequent variant of backward se-lection (using iteratively increasing subsets, allowing po-tentially avoiding getting stuck in a local minimum). Itis known that the forward selection may suffer from highcount of false positives and relatively high computational demands for large data . These problems can be al-leviated by narrowing down the search by filtering outvariables that are deemed conditionally independent ofthe target given the current set of selected variables – aheuristic recently introduced under the name Early Drop-ping .In this context, the first phase of FACDA algorithm en-tails a forward selection accelerated by applying the earlydropping heuristic followed by the backward selection inthe second phase of FACDA algorithm. For a more de-tailed review of iterative feature selection procedures in ageneral context not specific to causal network inference,we refer to the latter paper, that explicitly introducesand studies the properties of Forward-Backward selectionwith early dropping (FBED K ), an algorithm combining K + 1 runs of the forward selection with early droppingwith a final backward selection phase. Note that FBED can be proven to correctly identify the Markov blanket ofthe target variable under the faithfulness assumption andperfect statistical inference (for details and proof see ).In this general nomenclature, FACDA would correspondto FBED ∞ (or maybe FBED ntaumax , as only finite set ofpossible parents is considered).In the case of causal network inference from time series,we are interested in whether the algorithms correctly de-tects the causal parent set for each node. Below we showa sketch of a proof of such convergence for the FACDAalgorithm; note that similar arguments apply to otherpresented algorithms (apart from PMIME, which due tothe lack of the second phase should provide a superset of the causal parents under the below assumptions). Weassume the following conditions: causal sufficiency thatassumes that common causes of all variables are mea-sured, faithfulness, which ensures that the true parent X jt − τ ∈ N X it will not be eliminated by any set of otherelements S , i.e. I ( X it ; X jt − τ | S ) = 0 , causal Markov con-dition, which guarantees that all elements X jt − τ / ∈ N X it will be eliminated by set of all causal parents N X it , i.e. I ( X it ; X jt − τ | N X it r { X jt − τ } ) = 0 , and perfect statisticalinference. Of course, for finite size samples, statistical in-ference is imperfect, and therefore the prove below holdsonly asymptotically. For the exact definition of faithful-ness and causal Markov condition see .First, we show that the first phase of the FACDAalgorithm finds a superset of true causal parents i.e. N X it ⊆ ˜ N X it . Let us suppose that X jt − τ is the true parentof X it , i.e. X jt − τ ∈ N X it , but X jt − τ / ∈ ˜ N X it . Hence there isa set L ⊂ ˜ N X it such that I ( X it ; X jt − τ | L ) = 0 , which is incontradiction with faithfulness; hence X jt − τ ∈ ˜ N X it . In the second phase, faithfulness guarantees that notrue causal parent X jt − τ ∈ N X it will be excluded: letus assume that X jt − τ ∈ ( ˜ N X it r N X it ) = M ; we willprove that X jt − τ will be eliminated in the second phaseof FACDA algorithm. Based on step 2, element X jt − τ will be eliminated if I ( X it ; X jt − τ | ˜ N X it r { X jt − τ } ) = 0 . From the first part we know that N X it ⊆ ˜ N X it , hence˜ N X it = N X it ∪ M, now from the causal Markov con-dition ensues that I ( X it ; M | N X it ) = 0 , because all ele-ments in M are not causal parents of X it , hence fromthe weak union I ( X it ; X jt − τ | N X it ∪ M r { X jt − τ } ) = 0 thus I ( X it ; X jt − τ | N X it r { X jt − τ } ) = 0 and X jt − τ will be elimi-nated and then ˜ N X it = N X it . E. CMI estimation and statistical testing
In all presented algorithms it has to be repeatedly de-cided whether CMI is equal to zero or not. However suchestimate from finite sample is generally nonzero even forindependent variables, therefore a statistical test is re-quired of the null hypothesis H in the form H : I ( X ; Y | Z ) = 0 (9)at a predefined level of statistical significance θ. For speed and tractability reasons, in our numericalsimulations we use only linear Gaussian models of ran-dom processes. Thus we can efficiently utilize an estimateof the CMI based on partial correlation ρ ( X, Y | Z ): I ( X ; Y | Z ) = −
12 log (cid:16) − ρ ( X, Y | Z ) (cid:17) (10)and thus we evaluate partial correlation instead of CMI.Note that in practice, the choice of estimator has sub-stantial impact on computational complexity, see also theDiscussion section.To test if the CMI is zero, the authors recommend touse a permutation test, which does not assume normaldistribution and independence of samples. In our datasituation, due to the normality of the time series, wespeed up the simulations by using the (approximate) de-fault setting of the function partialcorr (MATLAB) inwhich the p-value is assessed by Student’s t-test. Notethat potential autocorrelation of the time series mightlead to increased false positive rate in the individual tests. III. DATA EXAMPLES
The numerical comparison of the above presented algo-rithms is demonstrated on examples of vector autoregres-sive processes of order 1 (VAR(1) process) in the form X t = A X t − + E t (11)where E t denotes a white noise vector E t = ( ε t , . . . , ε nt ) ⊤ with covariance matrix Cov( E t ) = I . The structural ma-trix A carries information about the causal relationships.If we express the i -th row of this vector equation as X it = a i, X t − + . . . + a i,n X nt − + ε it , (12) it is obvious that the i -th element of the system is affectedby all elements for which a i,j = 0 . In our numerical simulations we always work with aknown matrix A . From the expression (11) the time seriesof length T are generated. These data serve us as theinput for the studied algorithms whose output shouldideally be the original matrix A or more precisely thebinary structure of the matrix A . A. Randomly connected networks
We consider systems with random interaction struc-ture A which we model by Erd˝os-R´enyi model of randomgraph (matrix). In this model the probability of presenceof a direct link between each two elements is given by apredefined density value D ∈ [0 , . Practically we fix therequired density of the matrix A (percentage of the di-rect links) and assign a value of 1 to the correspondingnumber of randomly selected elements. This binary ma-trix is further normalized to ascertain stationarity of theprocess by multiplying it with a constant sλ max , where λ max is the largest eigenvalue (in absolute value) of thematrix A , and s ∈ (0 ,
1) is an optional parameter. Weset s = 0 . B. Realistically connected networks
As real complex systems have structure that is nei-ther random nor strictly regular, we further we use twodatasets to provide realistic scenarios - one from thefield of climatology and another from the field of neuro-science (described in detail bellow, more technical datadescription is available in a previous publication concern-ing small-world bias in correlation graphs of real-worldnetworks ). We make the approximation that thesedatasets correspond to realizations of a VAR(1) process(11), therefore the elements of matrix A are estimatedfrom the original data using linear regression. Then weretain in the matrix only a predefined percentage of thelargest elements in the matrix, the rest is set to zero.This matrix is subsequently normalized by the constant sλ max and this matrix A defines the VAR(1) process (11).Note that both these datasets demonstrate also a highlevel of autocorrelation of the time series, an importantproperty of real-world data that may affect the causalnetwork recovery – see Figures 2 and 4 in the Supple-mentary Material.First we consider a ’climate network’ constructed fromregional daily time series. The network has 42 nodes andwas obtained by thresholding the interaction matrix in adata-fitted VAR(1) model to 15 percent density. Detailsof the data origin and preprocessing are described in theSupplementary Material.The second real-world example is a ’brain network’.The network has 90 nodes and was obtained by threshold-ing the interaction matrix in a data-fitted VAR(1) modelto 5 percent density. We use data obtained as part of astudy on healthy subjects brain activity. The data de-scribing the activity in 90 brain regions of 84 subjectswere temporally concatenated in order to provide suffi-ciently long time series (20160 time points in total). De-tails of the data origin and preprocessing are describedin the Supplementary Material.Note that while in both the case of brain and climate,the approximation of the system by a linear vector au-toregressive model of order one is clearly a daunting sim-plification of the original system, it has actually been pre-viously shown to provide a surprisingly accurate repre-sentation of the observed system dynamics at commonlystudied spatiotemporal scales . IV. NUMERICAL RESULTS
In this section we compare the studied algorithms us-ing numerical simulations. In particular, we study theaccuracy and the computational demands of the algo-rithms. Concerning the accuracy, as the algorithms aremeant to estimate the binary structure of the matrix A , we compare the ground-truth matrix χ ( A ) defined as[ χ ( A )] i,j = ( a i,j = 00 a i,j = 0 (13)with the estimated matrix χ ( ˆ A ) . The accuracy of eachalgorithm is described by two error measures: the falsepositive ratio ε + and the false negative ratio ε − given by ε + = { ( i, j ) | χ ( A ) i,j = 0 ∧ χ ( ˆ A ) i,j = 1 } { ( i, j ) | χ ( A ) i,j = 0 } , (14) ε − = { ( i, j ) | χ ( A ) i,j = 1 ∧ χ ( ˆ A ) i,j = 0 } { ( i, j ) | χ ( A ) i,j = 1 } . (15)The computational demands are quantified by the to-tal time of calculation. The calculations were evaluatedfor a single core of the Intel(R) Xeon(R) CPU E5-2630v2 2.60GHz processor; of course, mainly a relative inter-pretation of the computation time is informative, as thespeed depends on many parameters of the hardware andimplementation and in practice parallelization is easilyavailable to speed up the computation. Therefore, wealso provide the number of evaluations of CMI.The numerical simulations proceeded as follows. Ac-cording to the expression (11), the time series of length T were generated. For robustness of evidence, 35 indepen-dent realizations of time series were generated for eachspecific parameter setting (35 random matrices were gen-erated for the ER model). From these data the matrix χ ( ˆ A ) was determined using each of the algorithms. Wepresent the meadian values of the false positive ratio,false negative ratio and computational demands. FIG. 1: Accuracy of the algorithms evaluated onsimulations of VAR(1) process with random structuralmatrix A of density D = 10% and size n = 42.Dependence of false positive ratio ε + on false negativeratio ε − for statistical test significance threshold θ ∈ { . , . . . , . } (denoted by increasing markersize) and time series length T ∈ { , , } . A. White Noise & Erd˝os-R´enyi model
The first studied model is a VAR(1) process witha random (Erd˝os-R´enyi model) structural matrix A = ER ( n, D ) . Realization of this model is a binary matrix A of dimension n × n with a density D of nonzero ele-ments. For each of the 35 simulations, an independentrealization of random matrix A was generated.For the randomly connected VAR process, we choosea network size corresponding to the above described cli-mate dataset ( n = 42), with density fixed to D = 10%.The corresponding Figure 1 shows the dependence offalse positive ratio on false negative ratio for all algo-rithms. The simulation was carried out for a range oftime series lengths: T ∈ { , , } and a range ofstatistical threshold choices: θ ∈ { . , . . . , . } . Letus note that the accuracy for the PMIME algorithm cor-responded almost perfectly to the results of Sun’s algo-rithm and is thus not plotted separately. We also put intocomparison only the last version of Runge’s algorithm -PCMCI, due to its polynomial computational demands.The PCMCI algorithm is studied in two parameters set-tings which showed different accuracy. However, param-eter setting in PCMCI does not have a significant effecton the computational demands, for this reason, we onlyinclude the results of numerical simulations of computa-tional demands of version PCMCI ( p x = 0).As can be seen from Figure 1, in line with reasonableexpectations, the overall error of algorithms decreaseswith the increasing length of time series T . More in-terestingly, the accuracy of the algorithms seems to becomparable, only the PCMCI algorithm (for both pa-rameter settings: p x = 0 and p x = 1) slightly differs fromthe others, this observation will be discussed in more de-0FIG. 2: Accuracy of the algorithms evaluated onsimulations of a VAR(2) process with random structuralmatrices of density D = 10% and size n = 20.Visualization and settings as in Figure 1.tail in subsection IV B . The hypothetical curves of ε + as function of ε − largely overlap. However, for a fixedvalue of θ , these algorithms are not comparable in theirerror rates – the algorithm of Sun/PMIME and PCMCIgive more false positives and less false negatives, i.e. aremore liberal. Conversely, our algorithm FACDA is moreconservative.Similar result concerning accuracy is reproduced alsofor denser networks – see Supplementary Material Fig-ure 5 for results obtained for a corresponding simulationusing a network density D = 15%. Indeed, here longertime series were needed to achieve comparable accuracy.Comparison of the algorithms was also carried out onan example of VAR(2) model (see Figure 2) with bothlag-1 and lag-2 matrices having density 10 percent andnetwork size n = 20 nodes. Stationarity of the corre-sponding VAR(2) process was again done by their scal-ing to assure that the leading eigenvalue of correspondingVAR(1) matrix is fixed to 0.8.Comparison of computational demands was carriedout for network sizes n ∈ { , , . . . , } and densities D ∈ { , . . . , } . For density D = 0 the VAR(1) processis equivalent to the vector form of a white noise process.As a baseline example, the total computation time forSun’s method is shown in Figure 3. In line with the the-oretical expectation, the computational demands growsubstantially with increasing network size and density.As documented in Figure 4, the new PCMCI algorithmby Runge et al. provides, particularly for the large densenetworks, a substantial speedup against not only the orig-inal Sun’s algorithm. Similar if not better performanceas the PCMCI is provided by our algorithm FACDA. De-tailed comparison with respect to the Sun’s algorithm isshown in Figure 5.As described in Subsection II B, in the current sim-plified setup, the PMIME algorithm is equivalent to thefirst phase of Sun’s algorithm. For this reason, PMIME FIG. 3: Computational demands of network estimationby Sun’s algorithm. Results for network sizes n ∈ { , , . . . , } and densities D ∈ { , . . . , } . Decadic logarithm of median computation time inseconds shown in grayscale. Time series of size T = 1024were generated from VAR(1) process with randomstructural matrix A , statistical threshold set to θ = 0 . B. Realistic datasets
Further simulations were carried out with a structuralmatrix derived from realistic datasets including a cli-matic dataset ( n = 42) with density of structural ma-trix D = 15% and a brain dataset ( n = 90) with densityof structural matrix D = 5% . The corresponding struc-1FIG. 5: Relative computational demands of networkestimation by FACDA and Sun’s algorithm.Visualization and settings as in Figure 3.FIG. 6: Relative computational demands of networkestimation by Sun’s with respect to PMIME algorithm.Median computation time of Sun’s algorithm divided bymedian computation time of PMIME is shown. Othervisualization and parameter settings as in Figure 3.tural matrices are shown in the Supplementary mate-rial Figures 1-4. Numerical assessment of accuracy ofthe algorithms was carried out for parameter settings T ∈ { , , , } and θ ∈ { . , . . . , . } . The simulation results are shown in Figure 7 and Fig-ure 8 respectively. Similarly to the Erd˝os-R´enyi model,the simulations also suggest that PCMCI ( p x = 0), Sun’sand FACDA algorithms are comparable in their accuracy.For both realistic datasets, the achieved accuracy waslower than for the randomly connected networks analyzedin the previous section (for a given time series length).This can be ascribed to the heterogeneous strength oflinks in realistic datasets, with a substantial proportionof relatively weak links, that are difficult to estimate cor- FIG. 7: Accuracy of the algorithms evaluated onVAR(1) process from climate data with structuralmatrix A of density D = 15% . Dependence of falsepositive ratio ε + on false negative ratio ε − forparameter θ ∈ { . , . . . , . } and time series oflength T ∈ { , , , } . rectly from short samples. As in the case of the Erd˝os-R´enyi model, we simulate also VAR(2) process modelingthe ’climate network’, further supporting the previousconclusions, see Supplementary Results Figure 7.However, the PCMCI ( p x = 1) differs from the otherthree in that while it achieves lower false positive ratio(which is fixed on the value of theta as can be seen infigure 11), this is more than outweighted by increases infalse negative ratio. While this effect is present also inthe simulated ER random networks in a weaker form, itis most clear for these inhomogeneous networks.The results of computational demands for the ’climatenetwork’ with 15 percent density are shown in Figure 9.While the worst case complexity is polynomial for both(PCMCI and Sun’s) algorithms, particularly for low val-ues of q max the PCMCI is faster. FACDA algorithm pro-vides similar if not better performance as the PCMCIalgorithm, that is substantial speedup particularly forlarge dense networks. Detailed comparison with respectto the Sun’s algorithm is shown in Figure 10. Qualita-tively equivalent results were obtained for other settings,see results for the ’brain network’ with density 5% in theSupplementary Materials Figure 6. V. DISCUSSION
The comparison of the algorithms (PCMCI , Sun’s and PMIME by Kugiumzis ) has shown, that in realis-tic settings, they do not substantially differ in accuracy(PCMCI ( p x = 0)), across a range of systems and pa-rameter settings.The newly introduced FACDA method appears to keepthe improved computational performance without the2FIG. 8: Accuracy of the algorithms on VAR(1) processfrom brain data with structural matrix A of density D = 5% . Visualization and settings as in Figure 7. θ (%) T i m e r a t i o T = 128 T = 512 T = 2048 T = 8192 FIG. 9: Ratio of the total time of network estimation ofPCMCI with respect to Sun’s algorithm as function ofstatistical threshold θ. Results for time series of length T ∈ { , , , } generated from VAR(1)process with structural matrix A (density D = 15%)from the climate dataset.detrimental effects on the accuracy, giving similar or bet-ter results than the original three methods.Notably, there are substantial differences in computa-tional demands among the methods. Only a subtle dif-ference is between the PMIME and Sun’s method, givenby PMIME missing a second phase – for larger networksthis difference appeared negligible.Runge’s original algorithm is remedied in its new vari-ant, PCMCI, that limits the number of tests in each cy-cle to q max , leading thus to maximally polynomial com-plexity. Similar or even stronger improvement is alsoachieved in the FACDA approach, that provided here upto an order of magnitude speedup over Sun’s algorithmin the case of the large dense networks. Notably, whilethe FACDA method can be considered as derived fromthe Sun’s approach, a theoretical comparison shows that θ (%) T i m e r a t i o T = 128 T = 512 T = 2048 T = 8192 FIG. 10: Ratio of the total time of network estimationof FACDA with respect to Sun’s algorithm as functionof statistical threshold θ. Visualization as in Figure 9.FIG. 11: Accuracy of algorithms inn simulations ofVAR(1) process from brain data with structural matrixof density D = 5% . Dependence of false positive ratio ε + on parameter θ for time series length T = 2048 . it is conceptually hybrid between this and PCMCI, be-ing equivalent to PCMCI with several alterations: fixing q max = 1, accepting the strongest candidate in each cy-cle without testing, defining the strength in each cycleby the current CMI instead of the lowest value achievedso far, marking for removal instead of removing straightaway (achieving thus order-independence, and thereforeirrelevance of order of testing candidates in line 6 of thePC phase of PCMCI; while FACDA chooses lexicographicorder and its change would generally alter the specific re-sults) and omitting the final (MCI) phase. The specificor combined effect of these variations is a topic for fur-ther study that may lead to potential improvement ofthe algorithms. As it may depend on system parame-ters, one of the possible avenues is to provide adaptivedata-informed algorithms.A somewhat open problem is the choice and overall sta-3 ε − ε + PX0PX1PX3PX ∞ FIG. 12: Accuracy of the PCMCI algorithm independence on parameter p X , evaluated on simulationsof VAR(1) process with random structural matrix ofdensity D = 10% , size n = 42, sample length T = 256 . tistical interpretation of the threshold parameter control-ling the leniency of the statistical test for the inclusion (orexclusion) of a candidate parent. Firstly, setting it to agiven value θ does not guarantee fixing the resulting falsepositive rate to such value (not even asymptotically), dueto the complex multiple testing procedure giving rise tothe resulting networks – unless a final ’repair’ phase isincluded, as in the PCMCI algorithm. Secondly, as thisbias differs between methods, setting the same θ leadsto different behaviour of the methods, as they work ata different point along their receiver operating curve. Inparticular, for a fixed θ , FACDA typically gave less falsepositives, but more false negatives, so the overall proce-dure can be considered as more conservative for a fixed θ . However, similar performance can be obtained fromSun’s method and PMIME by decreasing their θ param-eter.The MCI phase of PCMCI algorithm guarantees(asymptotically) the control of false positive rate at thepredefined level α . However, it is likely responsible forthe overall decreased performance (particularly becausetesting conditional independences is carried out with re-spect to parents of both source and target, thereforeworking with larger condition sets and smaller estimatedeffects). This is even stronger for high maximum includednumber of source parents p X , as is shown in Figure 12.On the other side, apart from estimating causalstrength stripped of the autocorrelation effects, the MCIphase of PCMCI has the advantage that the false positiverate is controlled asymptotically at the prescribed levelgiven by the statistical threshold in this phase, see Fig-ure 11. This was in the simulations approximately truealso for the Sun’s method, while FACDA has lower-than-prescribed false positive rates, which can be attributedto the early reduction of the candidate set. Results forshorter time series are shown in the Supplementary Ma-terial Figure 9; note that for small sample the parametric partial correlation test may be imprecise and the use ofsome permutation scheme may be more suitable for exactcontrol of false positive rate; for longer time series (Sup-plementary Material Figure 8) the Sun’s and PCMCImethods false positive rates converge to the prescribedvalue.Notably, the provided numerical comparisons were car-ried out using linear vector autoregressive processes.This is a standard type of stochastic system used inthe original papers introducing the methods, as it allowsmore extensive numerical comparisons due to the possi-bility of very efficient estimation of CMI even in high di-mensions through the use of partial correlation. Indeed,for Gaussian processes the transfer entropy is equivalentto Granger causality , which supports the use of linearmethods for data that are deemed reasonably close toGaussian; however even in the linear case, reduction ofnumber of conditions may be computationally beneficial.When the assumption of Gaussianity is not suitable,other estimators of the (conditional) mutual informationneed to be used, and this may further (detrimentally)affect both accuracy and computational demands of thealgorithms; in ways that would depend on the particu-lar estimator in use. In this sense, our results provideonly a rough guide, valid as long as this extra demandsare comparable across methods. In ? ? experiments withkNN estimators and also other versions find considerabletrade-offs in runtime, showing that sometimes it’s fasterto run a full-conditioning, sometimes not, offering inter-esting insights while providing space for development ofadaptive approaches.Apart from the general argument mentioned above,it is important to note that the linear (Gaussian) ap-proximation of the CMI by partial correlation is in-deed commonly used in causal network discovery prac-tice. This pragmatic choice is for many systems indeedsubstantiated by quantitative evidence concerning near-Gaussianity of the studied time series, e.g. for the brainactivity data measured by functional magnetic resonanceimaging or climate temperature time series .Notably, even in the linear setting, the compared algo-rithms become computationally intractable for networkslarger than about a hundred of nodes (particularly forlarger network densities). This constitutes a serious lim-itation. For such situations, some amendments to themethods or use or development of other algorithms wouldbe necessary. As a sidenote, a more detailed analysis sug-gests, that the limiting factor is not necessarily the overalllink density, but the maximum in-degree, i.e. the maxi-mal (candidate) parent set. Conversely, further speedupcan be of course achieved e.g. by limiting the maximumsize of the conditioning set or number of tested condi-tions, with the trade-off of larger false positive rate inthe forward phase. Such parametric variation of the al-gorithm (explicitly suggested e.g. in PCMCI) effectivelyrenders a family of methods, of which we tested only somerecommended default variants.Alternatively, the use of some weak heuristic assump-4tions may allow effective estimation of even much largercausal networks. An example is the successful estimationof the global climate network of causal interactions basedon temperature measurements in 2512 equidistantly dis-tributed nodes on the globe, based on estimation ofGranger causality and selection of the outgoing link withhighest Granger causality index for each node . Despitebeing apparently simplistic, this approach was well suitedto the data and allowed the discovery of a smooth causalflow in the global climate network, that until then wentunnoticed due to the inability of general-purpose causalnetwork analysis methods to deal with such a large net-work. A yet another alternative approach is to reduce thedimension before constructing the network by a suitableprocedure .Of course, while we have compared the most promi-nent algorithms introduced within the complex networkscommunity, there are other alternatives for causal net-work estimation, building on the concepts of nonlinearextensions of Granger causality analysis as well as ap-plying regularization procedures . From the breadthof reports concerning network reconstruction we directthe attention of the reader at least to a recent work in-cluding interesting combination of these ideas and alsoa useful comparison and overview of some of these al-ternative methods . A yet another family of meth-ods for detection of causal interactions has developed inthe area deterministic nonlinear dynamics, we refer thereader to comparative reviews for detailed discussionof a range of methods formulated for detecting causalityin the bivariate case. Generalization of many nonlinearmethods to fully multivariate setting is not readily avail-able and is a matter of further research, however, forsome indices it is already available. Apart from the useof CMI in multivariate setting, and already mentionednonlinear kernel Granger approaches, another recentlyproposed principled Granger causality generalization isthe definition of nonlinear Granger causality through lo-cal linearization . This approach provides a consistentand well defined generalization of linear Granger causal-ity and lends itself to straightforward generalization toconditional and multivariate setting.The research in causal network discovery is a verydynamic field that is being addressed by experts frommultiple fields, sometimes not necessarily aware of thedevelopments in other disciplines. We believe that fur-ther progress will be made by cross-fertilization betweenvarious approaches including the methods compared inthis paper (PC-algorithm variants or other iterative ap-proaches), regularization techniques and Bayesian infer-ence with context-informed priors.On the other hand, the suitability or at least properinterpretation of the characterization of the causal struc-ture by a (directed) network has been recently problema-tized, see e.g. Ref. . In the current paper, we haveon purpose used a system example that does not con-tain higher-order (polyadic) dependences; the true causalstructure is thus unambiguous and well represented by a directed graph. However to at least comment on the po-tentially difficult to interpret behavior of the consideredalgorithms when applied to processes with higher-orderdependences, we invite the reader to consider the case of aprocess given by Y t +1 = XOR ( X t , Z t ), with X t , Z t beingindependent boolean variables with p (0) = p (1) = 0 . I ( Y t +1 , X t ) and I ( Y t +1 , Z t ), bothof which are equal to zero (of course, in practice, ran-dom sampling would give rise to some ’false’ detections).While this example may seem singular or too artificial,less trivial and more realistic could be considered and weagree with that for general complex systems, particu-larly with substantial higher-order interactions, we mayneed more fundamental theoretical formalisms as well asalgorithms that would allow suitable representation ofthe causal structure going possibly beyond bivariate de-pendences – we refer the reader to recent works in thisarea and references therein for discussion of the pos-sible avenues. VI. CONCLUSIONS
We have carried out a comparison of several promi-nent algorithms for causal network reconstruction. Whilethey were originally introduced within slightly differentcontexts (such as explicit inclusion of arbitrary tempo-ral lag or multivariate target variables), these algorithmsshare common ground and are related to the general PC-algorithm. The main difference between the algorithms iswhether they use correlation or partial correlation for de-ciding on inclusion into the set of candidate parents in aninitial phase of the algorithm, and whether they include asecond phase for removal of indirect links from this set ofcandidate parents. By testing the algorithms using sim-ulations of Gaussian processes on randomly and realis-tically connected networks (motivated by neuroscientificand climate data), we have shown that in practical usagethese algorithms provide close to equivalent performance.However, the methods differ in their computational de-mands, most substantially for large networks: for sparsenetworks, selection of candidate parents by a single run ofmutual information can be more effective; for denser net-works, using CMI in the first phase provides substantialspeedup through decreasing the size of the candidate par-ents set. However, similar computational demands canbe achieved in a reduction phase by limiting the testingto a heuristically selected non-exhaustive sampling of thestrongest conditions. We also commented on the prob-lems of control of false positives and order-dependence,although for detailed discussion, we referred the readerto other works.Finally, we have proposed a new hybrid Fast Approx-imate Causal Discovery Algorithm (FACDA), designedfor improved performance while essentially conserving ac-curacy. Despite the current progress in algorithms, large5and dense networks represent a challenge for all presentedmethods, constituting a key open problem in causal net-work analysis.
VII. SUPPLEMENTARY MATERIAL
See Supplementary material for the results of furtheranalysis described in the Results section.
ACKNOWLEDGMENTS
This work was supported by the Czech Health Re-search Council Projects No. NV15-29835A, No. NV15-33250A, and No. NV17-28427A; and by project Nr.LO1611 with a financial support from the MEYS un-der the NPU I program. We thank Nikola Jajcay, DavidHartman and David Tomeˇcek for valuable help with datapreparation. S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D. U.Hwang, “Complex networks: Structure and dynamics,” PhysicsReports , 175–308 (2006). K. J. Friston, “Functional and effective connectivity in neu-roimaging: A synthesis.” Human Brain Mapping , 56–78 (1994). K. G. Schilling, A. Daducci, K. Maier-Hein, C. Poupon, J.-C. Houde, V. Nath, A. W. Anderson, B. A. Landman, andM. Descoteaux, “Challenges in diffusion mri tractography–lessons learned from international benchmark competitions,”Magnetic resonance imaging (2018). J. Hlinka and S. Coombes, “Using computational mod-els to relate structural and functional brain connectivity,”European Journal of Neuroscience , 2137–2145 (2012). J. Hlinka and M. Hadrava, “On the dan-ger of detecting network states in white noise,”Frontiers in Computational Neuroscience (2015), 10.3389/fncom.2015.00011. M. G. Preti, T. A. Bolton, and D. V. D. Ville, “The dy-namic functional connectome: State-of-the-art and perspec-tives,” NeuroImage , 41 – 54 (2017), functional Architectureof the Brain. N. Wiener, “Modern mathermatics for engineers,” (McGraw-Hill, New York, 1956) Chap. The theory of prediction, pp. 165 –190. C. W. Granger, “Investigating causal relations by econometricmodel and cross spectral methods,” Econometrica , 424–438(1969). P. Spirtes, C. Glymour, and R. Scheines,
Causation, Prediction,and Search , 2nd ed. (MIT press, 2000). T. Schreiber, “Measuring information transfer,” Physical ReviewLetters , 461–464 (2000). M. Palus, V. Komarek, T. Prochazka, Z. Hrncir, and K. Ster-bova, “Synchronization and information flow in eeg of epilepticpatients,” IEEE Engineering in Medicine and Biology Magazine , 65–71 (2001). A. Kraskov, H. Stogbauer, and P. Grassberger, “Estimating mu-tual information,” Physical Review E , 066138 (2004). J. Runge, J. Heitzig, V. Petoukhov, and J. Kurths, “Escapingthe curse of dimensionality in estimating multivariate transferentropy,” Physical Review Letters (2012). D. Kugiumtzis, “Direct-coupling information measure fromnonuniform embedding,” Phys. Rev. E , 062918 (2013). J. Sun, D. Taylor, and E. M. Bollt, “Causal network inferenceby optimal causation entropy,” SIAM Journal on Applied Dy-namical Systems , 73–106 (2015). J. Runge, D. Sejdinovic, and S. Flaxman, “Detecting causal as-sociations in large nonlinear time series datasets,” arXiv preprint1702.07007v1. R. G. James, N. Barnett, and J. P. Crutchfield, “In-formation flows? a critique of transfer entropies,”Phys. Rev. Lett. , 238701 (2016). R. G. James and J. P. Crutchfield, “Multivariate dependencebeyond shannon information,” Entropy , 531 (2017). E. A. Martin, J. Hlinka, A. Meinke, F. Dˇechtˇerenko, J. Tintˇera,I. Oliver, and J. Davidsen, “Network inference and maximumentropy estimation on information diagrams,” Scientific Reports , 7062 (2017). B. Allen, B. C. Stacey, and Y. Bar-Yam, “Multiscale informationtheory and the marginal utility of information,” Entropy , 273(2017). J. Hlinka, D. Hartman, M. Vejmelka, J. Runge, N. Marwan,J. Kurths, and M. Paluˇs, “Reliability of inference of di-rected climate networks using conditional mutual information,”Entropy , 2023–2045 (2013). Y. Benjamini and Y. Hochberg, “Controlling the false discoveryrate: a practical and powerful approach to multiple testing,”Journal of the royal statistical society. Series B (Methodological), 289–300 (1995). G. Borboudakis and I. Tsamardinos, “Forward-backward se-lection with early dropping,” arXiv preprint arXiv:1705.10770(2017). J. Hlinka, D. Hartman, N. Jajcay, D. Tomeˇcek,J. Tintˇera, and M. Paluˇs, “Small-world biasof correlation networks: From brain to climate,”Chaos: An Interdisciplinary Journal of Nonlinear Science , 035812 (2017). J. Hlinka, M. Palus, M. Vejmelka, D. Mantini, and M. Corbetta,“Functional connectivity in resting-state fmri: Is linear correla-tion sufficient?” NeuroImage , 2218–2225 (2011). D. Hartman, J. Hlinka, M. Paluˇs, D. Mantini, and M. Corbetta,“The role of nonlinearity in computing graph-theoretical proper-ties of resting-state functional magnetic resonance imaging brainnetworks,” Chaos (2011). J. Hlinka, D. Hartman, M. Vejmelka, D. Novotna, andM. Palus, “Non-linear dependence and teleconnectionsin climate data: sources, relevance, nonstationarity,”Climate Dynamics , 1873–1886 (2014). L. Barnett, A. B. Barrett, and A. K. Seth, “Granger causal-ity and transfer entropy are equivalent for gaussian variables,”Physical Review Letters (2009). J. Hlinka, N. Jajcay, D. Hartman, andM. Paluˇs, “Smooth information flow in temper-ature climate network reflects mass transport,”Chaos: An Interdisciplinary Journal of Nonlinear Science , 035811 (2017). M. Vejmelka, L. Pokorna, J. Hlinka, D. Hartman, N. Ja-jcay, and M. Palus, “Non-random correlation structuresand dimensionality reduction in multivariate climate data,”Climate Dynamics , 2663–2682 (2015). J. Runge, V. Petoukhov, J. F. Donges, J. Hlinka,N. Jajcay, M. Vejmelka, D. Hartman, N. Marwan,M. Palus, and J. Kurths, “Identifying causal gate-ways and mediators in complex spatio-temporal systems,”Nature communications , 8502–8502 (2015). D. Marinazzo, M. Pellicoro, and S. Stramaglia,“Kernel method for nonlinear granger causality,”Phys. Rev. Lett. , 144103 (2008). A. Arnold, Y. Liu, and N. Abe, “Temporalcausal modeling with graphical granger methods,” in
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,KDD ’07 (ACM, New York, NY, USA, 2007) pp. 66–75. G. Yang, L. Wang, and X. Wang, “Reconstruction of complex di-rectional networks with group lasso nonlinear conditional grangercausality,” Scientific Reports (2017). E. Pereda, R. Quiroga, and J. Bhattacharya, “Non-linear multivariate analysis of neurophysiological signals,”Progress in Neurobiology , 1–37 (2005). D. Chicharro and R. G. Andrzejak, “Reliable detec-tion of directional couplings using rank statistics,”Phys. Rev. E , 026217 (2009). B. Wahl, U. Feudel, J. Hlinka, M. W¨achter, J. Peinke, andJ. A. Freund, “Granger-causality maps of diffusion processes,”Physical Review E , 022213 (2016). B. Wahl, U. Feudel, J. Hlinka, M. W¨achter, J. Peinke, and J. A.Freund, “Conditional granger causality of diffusion processes,”The European Physical Journal B90