[PDF] Inferring network properties from time series using transfer entropy and mutual information: validation of multivariate versus bivariate approaches

Abstract

Functional and effective networks inferred from time series are at the core of network neuroscience. Interpreting their properties requires inferred network models to reflect key underlying structural features; however, even a few spurious links can distort network measures, challenging functional connectomes. We study the extent to which micro- and macroscopic properties of underlying networks can be inferred by algorithms based on mutual information and bivariate/multivariate transfer entropy. The validation is performed on two macaque connectomes and on synthetic networks with various topologies (regular lattice, small-world, random, scale-free, modular). Simulations are based on a neural mass model and on autoregressive dynamics (employing Gaussian estimators for direct comparison to functional connectivity and Granger causality). We find that multivariate transfer entropy captures key properties of all networks for longer time series. Bivariate methods can achieve higher recall (sensitivity) for shorter time series but are unable to control false positives (lower specificity) as available data increases. This leads to overestimated clustering, small-world, and rich-club coefficients, underestimated shortest path lengths and hub centrality, and fattened degree distribution tails. Caution should therefore be used when interpreting network properties of functional connectomes obtained via correlation or pairwise statistical dependence measures, rather than more holistic (yet data-hungry) multivariate models.

Full PDF

IInferring network properties from time series via transfer entropy andmutual information: validation of bivariate versus multivariate approaches

Leonardo Novelli ∗ and Joseph T. Lizier Centre for Complex Systems, Faculty of Engineering, The University of Sydney, Sydney, Australia (Dated: July 16, 2020)Functional and eﬀective networks inferred from time series are at the core of network neuroscience.Since it is common practice to compare network properties between patients and controls, it iscrucial for inferred network models to reﬂect key underlying structural properties. However, evena few spurious links severely distort the shortest-path length and derived measures including thesmall-world coeﬃcient, providing misleading insights into the macroscopic topology of the network.This poses a challenge for functional connectomes based on correlation, whose transitivity necessarilyinduce numerous false positives. We assess the extent to which false positives bias the propertiesof the networks inferred by bivariate and multivariate algorithms. Mutual information, bivariatetransfer entropy, and multivariate transfer entropy are respectively used for inferring functional,directed-functional, and eﬀective networks. The validation is performed on synthetic ground truthnetworks with 100–200 nodes and various topologies (regular lattice, small-world, random, scale-free,modular), as well as on a real macaque connectome. The time series of node activity are simulated viasimple autoregressive dynamics. We ﬁnd that multivariate transfer entropy is able to capture the keyproperties of all network structures for longer time series. Bivariate methods can achieve higher recall(sensitivity) for shorter time series under some circumstances; however, as available data increases,they are unable to control false positives (lower speciﬁcity). This leads to overestimated clustering,small-world, and rich-club coeﬃcients, underestimated shortest path lengths and hub centrality,and fattened degree distribution tails. Caution should therefore be used when interpreting networkproperties of functional connectomes obtained via correlation or pairwise statistical dependencemeasures, rather than more holistic (yet data-hungry) multivariate models.

I. INTRODUCTION

Functional and eﬀective network inference in neuro-science typically involves pre-processing the data, deﬁn-ing the parcellation, extracting the time series, inferringthe links in the model network, and measuring networkproperties, e.g. to compare patients and controls or to pre-dict phenotype [1, 2]. Each step in the pipeline requiresmaking modelling and analysis choices, whose inﬂuenceon the ﬁnal results is the subject of ongoing research [3–6]. As part of this eﬀort, we study how the choice ofdiﬀerent inference algorithms aﬀects the properties of theresulting network model in comparison to the underlyingstructural network, and whether the ability to accuratelyreﬂect these properties changes across diﬀerent underlyingstructural networks.The structure (or topology) of a network can be de-scribed at multiple scales [1]: from the microscopic (in-dividual links), to the mesoscopic (modules and motifs)and the macroscopic (summary statistics, such as averageshortest-path length and measures of small-worldness) [7].At each scale, the structure is associated with develop-ment, ageing, cognition, and neuropsychiatric diseases [8].Previous studies have assessed the performance of diﬀerentnetwork inference algorithms in identifying the structurallinks at the microscale [9–12]. The goal of this work isto extend the assessment to all scales and to a variety ∗ [email protected] of topologies, across a range of related network inferencealgorithms. We link the performance at the microscale tothe resulting network properties at the macroscale, anddescribe how this changes as a function of the overalltopology.We compare bivariate and multivariate approaches forinferring network models, employing statistical depen-dence measures based on information theory [13]. Theseapproaches include functional network inference , whichproduces models of networks of pairwise or bivariate statis-tical relationships between nodes, and can either quantifyundirected statistical dependence, in the case of mutualinformation (MI) [14], or directed dependence, in the caseof transfer entropy (TE) [15, 16]. These approaches alsoinclude eﬀective network inference , which is intended toproduce the simplest possible circuit models that explainthe observed responses [17]. In this class, we evaluate theuse of multivariate TE, which, in contrast to the bivariateapproaches, aims to minimise spurious links and inferminimal models of the parent sets for each target nodein the network.All of these inference techniques seek to infer a net-work model of the relationships between the nodes in asystem. Diﬀerent methods capture diﬀerent aspects ofthese relationships and don’t necessarily seek to repli-cate the underlying structural topology, nor do we expectthem to in general (particularly in neuroimaging experi-ments, where aspects of the structure may be expressedmore or less or not at all, depending on the cognitivetask). Why do we then seek to evaluate these methodsin inferring microscopic, mesoscopic and macroscopic fea- a r X i v : . [ q - b i o . N C ] J u l ture of the underlying network structure? Under certaincircumstances, we do expect eﬀective network inferencemethods to replicate underlying structure: in the absenceof hidden nodes (and other simplifying assumptions, in-cluding stationarity), networks inferred via multivariateTE are proven to converge to the underlying structurefor suﬃciently long time series [18, 19]. It is important tovalidate that these algorithms perform to that expectationwhere it is applicable; in doing so, we also address therecent call for more extensive model diversity in testingmultivariate algorithms: “ To avoid biased conclusions, alarge number of diﬀerent randomly selected connectivitystructures should be tested [including link density as wellas properties such as small-worldness] ” [19]. Outside ofthese circumstances though, it is at least desirable forfunctional and eﬀective networks to recognise importantfeatures in the underlying network structure: to trackoverall regime changes in that structure reliably, and toreﬂect the properties of distinctive nodes (or groups ofnodes) in the structure.This motivates our validation study, which is primarilybased on synthetic datasets involving ground truth net-works of 100–200 nodes with diﬀerent topologies, fromregular lattice to small-world and random (Section III),scale-free (Section IV), and modular (Section V). Manyof these structural properties are incorporated in themacaque connectome analysed in Section VI. At themacroscale, we measure several fundamental and widely-used properties, including shortest-path length, clusteringcoeﬃcient, small-world coeﬃcient, betweenness centrality,and features of the degree distributions [7]. These prop-erties of the inferred network models are compared tothose of the real underlying structural networks in orderto validate and benchmark diﬀerent inference algorithmsin terms of their ability to capture the key properties ofthe underlying topologies. At the microscale, the per-formance is assessed in terms of precision, recall, andspeciﬁcity of the inferred model in classifying the linksof the underlying structural network. As above, whilstwe do not expect all approaches to strictly capture themicroscale features, these results help to explain theirperformance at the macroscale.The time series of node activity on these networks aregenerated by vector autoregressive (VAR) dynamics, withlinearly coupled nodes and Gaussian noise. Both theVAR process and the inference algorithms are describedin detail in Section II, where we also discuss how MI andthe magnitude of Pearson correlation are equivalent forstationary VAR processes. This implies that the undi-rected networks obtained via the bivariate MI algorithmare equivalent to the widely employed undirected func-tional networks obtained via correlation, extending theimplications of our results beyond information-theoreticmethods. Further, our results based on TE extend toGranger causality, which is equivalent to TE for stationaryVAR processes [20]. Networks inferred using bivariate TEare typically referred to as directed functional networks,to emphasise the directed nature of their links. The exten- sion to multivariate TE for eﬀective network inference canalso be viewed as an extension to multivariate Grangercausality for the stationary VAR processes here.We ﬁnd that multivariate TE performs better on allnetwork topologies at all scales, for longer time series.Bivariate methods can achieve better recall with limitedamount of data (shorter time series) in some circum-stances, but the precision and the ability to control falsepositives are not consistent nor predictable a priori. Onthe other hand, thanks to recent statistical improvements,multivariate TE guarantees high speciﬁcity and precisionregardless of the amount of data available, and the re-call steadily increases with more data. We discuss howthe mesoscopic properties of the underlying structuralnetwork—particularly the network motifs—can inﬂuencethe precision and recall of the model at the microscale.In turn, we show how the performance at the microscaleaﬀects the inferred network properties at the macroscale.We observe that bivariate methods are often unable tocapture the most distinctive topological features of thenetworks under study (including path length, clustering,degree distribution, and modularity), largely due to theirinability to control false positives at the microscale.

II. METHODSA. Networks of linearly-coupled Gaussian variables

We simulate discrete-time, stationary, ﬁrst-order vectorautoregressive processes (VAR) on underlying structuralnetworks of N nodes. A VAR process is described by therecurrence relation Z ( t + 1) = Z ( t ) · C + ε ( t ) , (1)where Z ( t ) is a row vector and Z i ( t ) is the activity ofnode i at time t . The Gaussian noise ε ( t ) is spatially andserially uncorrelated, with standard deviation θ = 0 . N × N weighted adjacency matrix C = [ C ij ] describesthe network structure, where C ij is the weight of thedirected connection from node i to node j . The choiceof the weights C ij (detailed in the following sections)guarantees the stability of the system, which is a suﬃcientcondition for stationarity [21]. Since stationary VARprocesses have multivariate Gaussian distributions, theinformation-theoretic measures we use can be directlyrelated to Pearson correlation and Granger causality [22]. B. Network inference algorithms

Three algorithms are employed to infer network modelsfrom the synthetic time series generated by the VARprocesses, using the IDTxl Python package [23]:

1. Bivariate mutual information for functional connectivity

Mutual information (MI) is computed between all pairsof nodes independently, in a bivariate fashion, and onlythe measurements that pass a strict statistical signiﬁcancetest (described below) are interpreted as undirected links.MI is a measure of statistical dependence between ran-dom variables [14], introduced by Shannon in laying thefoundations of information theory [13]. Formally, the MIbetween two continuous random variables X and Y withjoint probability density function µ ( x, y ) and marginaldensities µ X ( x ) and µ Y ( y ) is deﬁned as I ( X ; Y ) := (cid:90) (cid:90) µ ( x, y ) log µ ( x, y ) µ X ( x ) µ Y ( y ) dxdy, (2)where the integral is taken over the set of pairs ( x, y ) suchthat µ ( x, y ) >

0. The strength of MI lies in its model-freenature, meaning that it doesn’t require any assumptionson the distribution or the variables (e.g. Gaussian). Beingable to capture nonlinear relationships, MI is typicallypresented as a generalised version of the Pearson cor-relation coeﬃcient. However, since the VAR processesconsidered here [Eq. (1)] have stationary multivariateGaussian distributions, the MI between two variables X and Y is completely determined by the magnitude of theirPearson correlation coeﬃcient ρ [14]: I ( X ; Y ) = − ln(1 − ρ ) . (3)Crucially, this one-to-one relationship between MI and theabsolute value of ρ implies that the networks inferred viathe bivariate MI algorithm are equivalent to the functionalnetworks obtained via cross-correlation—widely employedin neuroscience. Diﬀerences may lie in how the raw MIvalues are transformed into a network structure. Early ap-proaches often used a ﬁxed threshold aimed at obtaininga prescribed link density, while the bivariate MI algorithmused here adopts an adaptive threshold (diﬀerent for eachlink) to meet a desired statistical signiﬁcance level. Thestatistical signiﬁcance is computed via null hypothesistesting to reﬂect the probability of observing a larger MIfrom the same samples if their temporal relationship weredestroyed (the p -value is obtained from a chi-square test,as summarised in [24]). The critical level for statisticalsigniﬁcance is set to α = 0 . /N , where N is the networksize. This produces a Bonferroni correction for the infer-ence of parent nodes for each target (i.e., for each target,there is a 0 .

01 chance under the null hypothesis that atleast one spurious parent node is selected).

2. Bivariate transfer entropy for directed functionalconnectivity

Transfer entropy (TE) is computed between all pairs ofnodes independently, in a bivariate fashion, and only themeasurements that pass a strict statistical signiﬁcancetest (described below) are interpreted as links. TE is a model-free measure of statistical dependencebetween random variables [15]; however, diﬀerently fromMI and cross-correlation, it is a directed and not sym-metric measure (i.e., the TE from a source node X to atarget node Y is not necessarily the same as the TE from Y to X ), and speciﬁcally considers information about thedynamic state updates of the target Y . Thus, employingTE has the advantage of generating directed networksand providing a more detailed model of the dynamics ofthe system under investigation. Formally, the TE froma source stochastic process X to a target process Y isdeﬁned as [15] T X → Y ( t ) := I ( X t − ; Y t | Y

3. Multivariate transfer entropy for eﬀective networkinference

Diﬀerently from the bivariate approaches above, themultivariate TE approach does not consider pairs of nodesin isolation, but focusses on modelling dynamic updates ineach target process in the system by selecting a minimalset of sources that collectively contribute to the compu-tation of the target’s next state. More formally, for eachtarget Y , this method aims at identifying the minimalset of sources X

01 using the max statistic [9], meaning thatthere is a 0 .

01 chance that, under the null hypothesis,at least one spurious parent node is selected for eachtarget. The conditioning on previously selected sourcesserves to prevent the selection of spurious sources whichare correlated with true sources (referred to as holdingredundant information only [35]), and also enables captur-ing multivariate or synergistic eﬀects on the target thatcannot be detected by examining individual sources inisolation. Every node is studied as a target (in parallelon a computing cluster, using IDTxl [23]) and the resultsare then combined into a directed network describingthe information ﬂows in the system. Similarly to the bi-variate TE discussed above, a non-uniform embedding ofthe target’s past Y

At the microscale (individual links), the network in-ference performance is evaluated against the known un-derlying network structure as a binary classiﬁcation task,using standard statistics based on the number of truepositives (TP, i.e., correctly classiﬁed existing links), falsepositives (FP, i.e., absent links falsely classiﬁed as exist-ing), true negatives (TN, i.e., correctly classiﬁed absent links), and false negatives (FN, i.e., existing links falselyclassiﬁed as absent). The following standard statistics areemployed in the evaluation:

Precision: = T P/ ( T P + F P ) Recall (true positive rate): = T P/ ( T P + F N ) Speciﬁcity (true negative rate): = T N/ ( T N + F P )Intuitively, the precision measures how often an inferredlink is actually present in the underlying structure, therecall measures the proportion of true links that are de-tected, and the speciﬁcity measures the proportion ofabsent links that are correctly not inferred. For a prop-erly controlled family-wise error rate ( α ), the expectedspeciﬁcity is 1 − α .At the macroscale, the performance is evaluated interms of the accuracy in measuring network propertiesof interest on the inferred network, as compared to theirreal values when measured on the underlying structuralnetwork. These properties include: Characteristic path length:

The average shortest dis-tance between all pairs of nodes [7]. A shorteraverage value is typically interpreted as an indica-tion of the eﬃciency of the network in propagatinginformation. The characteristic path length is onlywell deﬁned for connected networks—a problemthat is avoided by construction in this study byonly generating connected networks. This limita-tion could be alternatively overcome by replacingthe characteristic path length with the analogousglobal eﬃciency [37], reported in Appendix A forcompleteness.

Clustering coeﬃcient:

The clustering coeﬃcient of anode is the fraction of triangles in its neighbourhood,i.e., the fraction of the node’s neighbours that arealso neighbours of each other [38]. The mean cluster-ing coeﬃcient hence reﬂects the average prevalenceof clustered connectivity around individual nodes.

Small-worldness coeﬃcient:

Small-world networksare formally deﬁned as networks that are signif-icantly more clustered than random networks,yet have approximately the same characteristicpath length as random networks [38]. Thesmall-worldness coeﬃcient was proposed to capturethis eﬀect in a single statistic [39], althoughimprovements on the original measure have beenrecently suggested [40, 41].

Degree distribution:

The probability distribution ofthe in- and out-degree, i.e., the number incomingand outgoing connections of each node.

Betweenness centrality:

The fraction of all shortestpaths in the network that pass through a givennode [7].

Modularity:

A measure of the separation of a networkinto speciﬁc groups (or modules), deﬁned as the frac-tion of the edges that fall within the given groupsminus the expected fraction if the edges were dis-tributed at random [42].

Rich-club coeﬃcient:

The rich-club coeﬃcient mea-sures the extent to which high-degree nodes connectto each other [43].

III. SMALL-WORLD NETWORKSA. Numerical simulations

The ﬁrst experiment is aimed at testing the robustnessof the three inference algorithms with respect to vastchanges in network structure. The Watts-Strogatz modelis used to generate a spectrum of topologies, ranging fromregular lattices to random Erd˝os-R´enyi networks througha small-world transition [38]. Each simulation starts witha directed ring network of 100 nodes with uniform linkweights C ij = C ii = 0 .

15 and ﬁxed in-degree d in = 4(i.e., each node is linked to two neighbours on each side,as well as to itself via a self-loop). The source of eachlink is then rewired with a given probability p ∈ [0 , p are repeated 10 timeson diﬀerent network realisations and with random initialconditions. B. Results

At the microscale, the performance is evaluated interms of precision, recall, and speciﬁcity in the classi-ﬁcation of the links (present or absent) in the inferrednetwork compared to the underlying structural network.In the case of bivariate MI, each undirected link in theinferred network is represented as two directed links in op-posite directions. For longer time series of 10 000 samples,multivariate TE is the most accurate method, achiev-ing optimal performance according to all metrics on allthe network topologies generated by the Watts-Strogatzrewiring model (Figure 1, right column). Bivariate TEalso achieves nearly optimal recall and high speciﬁcityon all topologies; however, despite the strict statisticalsigniﬁcance level, the precision is signiﬁcantly lower onlattice-like topologies (low rewiring probability) than onrandom ones (high rewiring probability). The oppositetrend is shown by the bivariate MI algorithm, whoseprecision and recall drastically decrease with increasingrewiring probability. As expected, the recall of all meth-ods decreases when shorter time series of 1000 samples areprovided (Figure 1, left column). However, the recall for . . . . . . P e r f o r m a n ce Bivariate MI (T=1000)

PrecisionRecallSpeciﬁcity

Bivariate MI (T=10000)

PrecisionRecallSpeciﬁcity . . . . . . P e r f o r m a n ce Bivariate TE (T=1000)

PrecisionRecallSpeciﬁcity

Bivariate TE (T=10000)

PrecisionRecallSpeciﬁcity − − Rewiring probability . . . . . . P e r f o r m a n ce Multivariate TE (T=1000)

PrecisionRecallSpeciﬁcity − − Rewiring probabilityMultivariate TE (T=10000)

PrecisionRecallSpeciﬁcity

FIG. 1. Performance as a function of the rewiring probabilityin Watts-Strogatz ring networks ( N =100 nodes). MultivariateTE is consistent across network structure and guarantees highprecision regardless of the amount of data available (thirdrow). Bivariate TE (second row) can have better recall thanmultivariate TE for shorter time series ( T = 1000, left column)but its precision is not consistent (it drops when T =10 000,right column) and the optimal time series length cannot bedetermined a priori. Bivariate MI (ﬁrst row) has lower pre-cision and recall than TE-based methods, both for T =1000and T =10 000. For each value of the rewiring probability, theresults for 10 simulations on diﬀerent networks are presented(low-opacity markers) in addition to the mean values (solidmarkers). multivariate TE is consistent across topologies, while itdecreases with higher rewiring probability when bivariatemethods are used. This results in the bivariate TE havinglarger recall for lattice-like topologies whilst multivariateTE has larger recall for more random topologies (i.e., for arewiring probability larger than p = 0 . − − Rewiring probability C h a r a c t e r i s t i c p a t h l e n g t h RealBivariate MIBivariate TEMultivariate TE

FIG. 2. Characteristic path length as a function of therewiring probability in Watts-Strogatz ring networks ( N =100nodes and T =10 000 time samples). Multivariate TE is ableto closely approximate the characteristic path length of thereal topologies (ground truth). On the other hand, bivariateMI and TE produce signiﬁcant underestimates due to spuriouslinks creating shortcuts across the network, particularly onlattice-like topologies (low rewiring probability). The resultsfor 10 simulations on diﬀerent network realisations are pre-sented (low-opacity markers) in addition to the mean values(solid markers). For real networks, only the mean values areplotted (cross markers). − − Rewiring probability . . . . C l u s t e r i n g c o e ﬃ c i e n t RealBivariate MIBivariate TEMultivariate TE

FIG. 3. Average clustering coeﬃcient as a function of therewiring probability in Watts-Strogatz ring networks ( N =100nodes and T =10 000 time samples). The multivariate TE algo-rithm closely matches the average clustering coeﬃcient of thereal networks (ground truth), which is instead overestimatedby bivariate MI and TE. The results for 10 simulations ondiﬀerent network realisations are presented (low-opacity mark-ers) in addition to the mean values (solid markers). For realnetworks, only the mean values are plotted (cross markers). real mean clustering on all the network topologies gen-erated by the Watts-Strogatz rewiring model, while thebivariate MI and TE algorithms consistently overestimateit (Figure 3). The related measure of local eﬃciency [37]is reported in Appendix A.Given the above results on the characteristic pathlength and the mean clustering coeﬃcient, it is not surpris-ing that the bivariate MI and TE algorithms signiﬁcantlyoverestimate the real small-worldness coeﬃcient, whilethe multivariate TE method produces accurate estimateson all the network topologies generated by the Watts-Strogatz rewiring model (Figure 4). Equivalent resultsare found if the alternative measures of “small-world in-dex” [40] or “double-graph normalized index” [44] are − − Rewiring probability S m a ll - w o r l d c o e ﬃ c i e n t RealBivariate MIBivariate TEMultivariate TE

FIG. 4. Small-world coeﬃcient as a function of the rewiringprobability in Watts-Strogatz ring networks ( N =100 nodesand T =10 000 time samples). The multivariate TE algorithmproduces accurate estimates of the small-world coeﬃcient ofthe real topologies (ground truth), which is instead stronglyoverestimated by bivariate MI and TE. The results for 10simulations on diﬀerent network realisations are presented(low-opacity markers) in addition to the mean values (solidmarkers). For real networks, only the mean values are plotted(cross markers). computed instead (not shown). C. Discussion

At the microscale, the results concerning the bivariateTE can be explained in the light of the recent theoreticalderivation of TE from network motifs [45]. For a ﬁxedin-degree, the TE decreases with the rewiring probability,making it harder for candidate links to pass the statisticalsigniﬁcance tests when only short time series are available.This explains why the recall for the bivariate TE slightlydrops with higher rewiring probability for T =1000 (Fig-ure 1). We can speculate that a similar mechanism couldbe responsible for the more drastic drop in the recall forthe bivariate MI (via evidence from derivations for covari-ances from network structure for similar processes [46]).The fact that bivariate TE is larger for regular latticestructures also explains why its recall is slightly higherthan for multivariate TE here: the redundancy betweenclose sources that elevates their bivariate TE is explicitlyconditioned out of the multivariate TE. On the other hand,as the rewiring increases, the higher recall for multivari-ate TE must be due to this method capturing synergisticeﬀects that (more disparate) multiple sources have on thetarget, which the bivariate method does not.Comparing the results between shorter and longer timeseries raises another question: why is the precision of thebivariate TE worse for longer time series than for shorterones, especially for lattice-like topologies? More com-plex motifs involving common parents and multiple walks,which are more prevalent in regular lattice topologies, canresult in nonzero TE on spurious links. These indirecteﬀects are typically weak; however, for long enough timeseries, the low TE values can be distinguished from noiseand thus pass the statistical signiﬁcance tests. The result-ing spurious links (false positives) decrease the precisionand the speciﬁcity as the time series length is increased,with the eﬀect being stronger in regular lattice topologies.In other words, the Bonferroni correction of the statisti-cal signiﬁcance level (i.e., dividing α by the network size N ) does not result in a well calibrated test for bivariateinference methods—the sources are correlated, and thetests on them are not independent. The diﬀerences in thespeciﬁcity on the plots are subtle because the networksare sparse; however, they manifest in large diﬀerences inthe precision. Crucially, this eﬀect is not seen for themultivariate TE, which maintains speciﬁcity consistentwith the requested α for all topologies and time serieslengths. Thus, lower recall achieved by multivariate TEon regular lattice networks for short time series (comparedto bivariate TE) can be viewed as a compromise to con-trol the speciﬁcity in a consistent fashion. A compellingargument in favour of controlling the speciﬁcity is pro-vided by Zalesky et al. [4], who conclude that “ speciﬁcityis at least twice as important as sensitivity [i.e., recall]when estimating key properties of brain networks, includ-ing topological measures of network clustering, networkeﬃciency and network modularity ”. Unfortunately, thereis currently no consistent a priori way (nor a reasonablecandidate) to determine the optimal time series lengthfor bivariate TE to attain high precision.Moving to the macroscale results, it is clear that theability to control the false positives while building connec-tomes is a crucial prerequisite for the application of com-plex network measures. Adding only a few spurious linksleads to signiﬁcant underestimate of the average shortest-path length—an eﬀect that has previously been reportedfor lattice-like networks using MI [47] and extended here toTE and across a range of topologies (Figure 2). Togetherwith the clustering coeﬃcient, the shortest-path lengthis a deﬁning feature of small-world networks. Althoughevidence of small-world properties of functional networksobtained from fMRI recordings have been provided inseveral studies (e.g. [48]), whether or not the brain is asmall-world network is still being debated [49, 50]. Fol-lowing Papo et al. [50], the question addressed here isof a pragmatic rather than an ontological nature: inde-pendently of whether the brain is a small-world networkor not, to what extent can neuroscientists using stan-dard system-level neuroimaging techniques interpret thesmall-world construct in the context of functional brainnetworks? An indication that the interpretation is prob-lematic was provided by Hlinka et al. [51], who showedthat functional connectivity matrices of randomly coupledautoregressive processes show small-world properties. Theeﬀect is due to intrinsic properties of correlation ratherthan just to the ﬁnite sample size problem or spatialoversampling. Speciﬁcally, correlation has a transitivityproperty: for any node X with neighbours Y and Z (andrespective correlations ρ XY and ρ XZ ), a lower bound canbe derived for the correlation between the neighbours [52]: ρ Y Z ≥ ρ XY ρ XZ − (cid:113) − ρ XY (cid:113) − ρ XZ . (6) In particular, a strong positive correlation between twopairs of them implies a positive correlation within thethird pair: ρ XY + ρ XZ > ρ Y Z > IV. SCALE-FREE NETWORKSA. Numerical simulations

The preferential attachment algorithm [54] is usedto generate undirected scale-free networks of 200 nodes.Starting with two connected nodes, a new node is addedat each iteration and linked bidirectionally to two ex-isting nodes, selected with probability proportional totheir current degree. This preferential mechanism makeshigh-degree nodes more likely to be selected and furtherincrease their degree—a positive feedback loop that gen-erates few highly-connected hubs and many low-degreenodes. A constant uniform link weight C XY = C XX = 0 . B. Results

At the microscale, the performance is evaluated in termsof precision and recall in the classiﬁcation of the links. Theoutcome is qualitatively similar to the small-world casepresented above: for longer time series (10 000 samples),multivariate TE is the most accurate method, achievingoptimal performance according to all metrics (Figure 5,right column). Bivariate TE also achieves optimal recall;however, despite the strict statistical signiﬁcance level,the precision is signiﬁcantly lower than multivariate TE.The bivariate MI algorithm scores comparatively verypoorly both in terms of precision and recall ( <

40% onaverage). As expected, the recall of all methods decreases . . . . . . P r ec i s i o n T = 1000 T = 10000 Bivariate MI Bivariate TE Multivariate TE . . . . . . R ec a ll T = 1000 Bivariate MI Bivariate TE Multivariate TE T = 10000 FIG. 5. Precision (top row) and recall (bottom row) in scale-free networks obtained via preferential attachment ( N =200nodes). Multivariate TE guarantees high precision regardlessof the amount of data available ( T = 1000 in the left columnand T =10 000 in the right column). Bivariate TE can achieveslightly better recall than multivariate TE for shorter timeseries (bottom left panel) but its precision drops substantiallyfor longer time series (top right panel) and the optimal timeseries length cannot be determined a priori. Bivariate MI haslower precision and recall than TE-based methods, both for T =1000 and T =10 000. The box-whiskers plots summarise theresults over 10 simulations on diﬀerent network realisations,with median values indicated in colour. when shorter time series of 1000 samples are provided(Figure 5, left column). Once more, bivariate TE attainsbetter precision on shorter time series than on longer ones,and for these networks attains slightly better recall thanmultivariate TE on the shorter time series.At the macroscale, the three algorithms are testedon their ability to accurately measure several relevantproperties of scale-free networks. It is well known thatthe degree distribution of networks generated via thepreferential attachment algorithm follows a power-law,with theoretical exponent β = 3 in the limit of largenetworks [54]. Fitting power-laws to empirical data re-quires some caution, e.g. adopting a logarithmic binningscheme [55], and the dedicated powerlaw Python packageis employed for this purpose [56]. For suﬃciently longtime series ( T = 10 000 in this study), multivariate TE isable to accurately recover the in-degrees of the nodes inour scale-free networks, while the bivariate MI and TEalgorithms produce signiﬁcant overestimates (Figure 6).As a consequence, the (absolute value of the) exponentof the ﬁtted power-law is underestimated by the lattermethods, as shown in Figure 7.Hubs are a key feature of scale-free networks and havehigh betweenness centrality, since most shortest pathspass through them. However, their centrality is highly un-derestimated by bivariate methods, often making it indis-tinguishable from the centrality of peripheral nodes (Fig- Real in-degree I n f e rr e d i n - d e g r ee Bivariate MIBivariate TEMultivariate TE

FIG. 6. Inferred vs. real in-degree in scale-free net-works obtained via preferential attachment ( N =200 nodesand T =10 000 time samples). Multivariate TE is the onlyalgorithm able to preserve the in-degrees of the nodes as com-pared to their value in the real networks (ground truth). Thedashed black line represents the identity between real andinferred values. Surprisingly, bivariate methods can inﬂate thein-degree of non-hubs by over one order of magnitude, makinghubs less distinguishable. The results are collected over 10simulations on diﬀerent network realisations. In-degree − − − F r e qu e n c y Bivariate MI ( β =2.1)Bivariate TE ( β =2.2)Multivariate TE ( β =3.1)Real ( β =3.1) FIG. 7. Log-log plot of the in-degree distribution in scale-freenetworks obtained via preferential attachment ( N =200 nodesand T =10 000 time samples). The best power-law distributionﬁt for the real networks (ground truth) and the inferred net-work models are plotted with dashed lines (decay exponentsreported in the legend). Despite the ﬁnite-size eﬀect due tothe small network size, multivariate TE is able to approximatethe theoretical power-law decay exponent β = 3 and to matchthe power-law ﬁt to the real in-degree distribution ( β = 3 . ure 8).As in the small-world case, multivariate TE is able tovery closely approximate the real mean clustering coeﬃ-cient, while the bivariate MI and TE algorithms consis-tently overestimate it (Figure 9). The related measure oflocal eﬃciency [37] is reported in Appendix A. A closerexamination of the clustering of individual nodes (insteadof the average) reveals that low clustering values areconsistently overestimated by bivariate methods, whilehigh clustering values are underestimated (Appendix B).Finally, bivariate methods overestimate the rich-club co-eﬃcient (Appendix C). . . . . . Real centrality . . . . . I n f e rr e d ce n t r a li t y Bivariate MIBivariate TEMultivariate TE

FIG. 8. Inferred vs. real betweenness centrality for nodesin scale-free networks obtained via preferential attachment( N =200 nodes and T =10 000 time samples). MultivariateTE is the only algorithm able to preserve the centrality ofthe nodes as compared to their value in the real networks(ground truth). The dashed black line represents the identitybetween real and inferred values. Bivariate MI underestimatesthe centrality of all nodes, while bivariate TE particularlyunderestimates the centrality of the most central nodes inthe ground-truth network. The results are collected over 10simulations on diﬀerent network realisations. Bivariate MI Bivariate TE Multivariate TE Real . . . . . . . C l u s t e r i n g c o e ﬃ c i e n t FIG. 9. Inferred vs. real average clustering coeﬃcient in scale-free networks obtained via preferential attachment ( N =200nodes and T =10 000 time samples). Multivariate TE is theonly algorithm able to preserve the average clustering of thereal networks (ground truth), while bivariate TE and MI con-sistently overestimate it. The box-whiskers plots summarisethe results over 10 simulations, with median values indicatedin colour. C. Discussion

Echoing the discussion of small-world networks in Sec-tion III, the ability to control the false positives whilebuilding connectomes—exhibited only by multivariateTE—is also crucial for correctly identifying fundamentalfeatures of scale-free networks, such as the power-law de-gree distribution and the presence of hub nodes. Hubsare characterised by high degree and betweenness central-ity. Unfortunately, the centrality of hubs is not robustwith respect to false positives: the addition of spuriouslinks cause strong underestimates of the betweennesscentrality of real hubs, since additional links provide al-ternative shortest paths. For bivariate TE, the eﬀect isso prominent that the inferred centrality of real hubscan be indistinguishable from the centrality of periph- eral nodes, as shown in Figure 8. The in-degree is inprinciple more robust with respect to false positives; how-ever, bivariate methods infer so many spurious incominglinks into non-hubs that they become as connected (ormore) than real hubs (Figure 6). Taken together, theseeﬀects on the in-degree and centrality greatly hinder theidentiﬁcation of real hubs when bivariate MI or TE areemployed. The inﬂation of the in-degree of peripheralnodes also fattens the tail of the in-degree distribution(Figure 7), resulting in an underestimate of the exponentof the ﬁtted power-law with respect to the theoreticalvalue β = 3 [54]. This has severe implications for thesynthetic networks used in this study, erroneously provid-ing evidence against the simple preferential attachmentalgorithm used to generate them. The third distinct char-acteristic of these networks is their low average clustering,which is also induced by the preferential attachment al-gorithm, whereby each new node is only connected totwo existing ones. However, bivariate methods fail tocapture this feature, producing a strong overestimate ofthe average clustering coeﬃcient (Figure 9). This can beattributed to the transitivity property of Pearson’s corre-lation, which produces overabundant triangular cliques infunctional networks (as already discussed in Section III).Given the signiﬁcant biases aﬀecting all the distinctiveproperties of scale-free networks—in addition to the small-world networks presented above—it is evident that greatcaution should be used when applying bivariate inferencemethods (cross-correlation, MI, TE) to draw conclusionsas to topological properties of real-world networks. Incontrast, again, the multivariate TE was demonstratedto produce network models with microscopic and macro-scopic topological properties consistent with those of theunderlying structural scale-free networks. V. MODULAR NETWORKSA. Numerical simulations

In order to study the performance of the three inferencealgorithms on modular topologies, networks of 100 nodesare generated and equally partitioned into ﬁve groupsof 20. Initially, each node is directly linked to 10 ran-dom targets within its own group, such that the ﬁvecommunities are completely disconnected. The initialdensity is thus 50% within each group and 10% overall.Link targets are then gradually rewired from within tobetween groups, weakening the modular structure butpreserving the overall density and keeping the out-degreesﬁxed. Eventually, the concepts of “within” and “between”groups are no longer meaningful—the links are equallydistributed and the topology resembles a random Erd˝os-R´enyi network of equal overall density. This happenswhen the rewiring is so prevalent that only 2 links are leftwithin the initial groups and 8 out of 10 links are formedbetween them (for each node). Going even further, whenall 10 links are formed between the initial groups and0none within, the network becomes multipartite, i.e., thenodes are partitioned into ﬁve independent sets havingno internal connections. A constant uniform link weight C XY = C XX = 0 .

08 is assigned to all the links, achiev-ing strong coupling but ensuring the stationarity of theVAR dynamics. Each simulation is repeated 10 timeson diﬀerent network realisations and with random initialconditions.

B. Results

At the microscale, we ﬁnd that bivariate MI and TEinfer more spurious links within the initial groups thanbetween them for smaller between-group densities (Fig-ure 10, left column). As the between-group density in-creases though, we ﬁnd more spurious links between theinitial groups than within them. The normalised false-positive rate is also signiﬁcantly higher within groups forsmaller between-group densities (right column), howeverthe normalisation sees the false-positive rate becomingcomparable between and within group as the between-group density increases. The number of false positivesproduced by multivariate TE is comparatively negligible.At the mesoscale, the modularity of the partition corre-sponding to the ﬁve disconnected communities is maximalin the absence of rewiring and decreases as more and morelinks are formed between groups rather than within them(Figure 11). Bivariate and multivariate TE produce accu-rate estimates of the real modularity, while bivariate MIoften underestimates it, particularly for shorter time se-ries ( T =1000) and intermediate between-group densities. C. Discussion

Our results on modular networks conﬁrm and extendprevious ﬁndings on correlation-based functional connec-tivity, stating that “ false positives occur more prevalentlybetween network modules than within them, and the spu-rious inter-modular connections have a dramatic impacton network topology ” [4]. Indeed, the left column of Fig-ure 10 shows that bivariate MI and TE infer a larger number of false positives between the initial groups thanbetween them, once we have a mid-range between-grouplink density in the underlying structure (which induces thetransitive relationships). However, the same does not ap-ply to the false positive rate (i.e., the normalised numberof false positives) shown in the right column of Figure 10:where an edge does not actually exist, it is more likely tobe inferred if it is within rather than across group (forup to mid-range between-group link densities). As such,the higher number of false positives between modules ismostly due to the larger number of potential spuriouslinks available between diﬀerent communities comparedto those within them. Nonetheless, the key message is . . . . . . F a l s e p o s i t i v e s ( p e r n o d e ) Bivariate MI

Within groupsBetween groups . . . . . . F a l s e p o s i t i v e r a t e Bivariate MI

Within groupsBetween groups . . . . . . F a l s e p o s i t i v e s ( p e r n o d e ) Bivariate TE

Within groupsBetween groups . . . . . . F a l s e p o s i t i v e r a t e Bivariate TE

Within groupsBetween groups

Links between groups (per node) . . . . . . F a l s e p o s i t i v e s ( p e r n o d e ) Multivariate TE

Within groupsBetween groups

Links between groups (per node) . . . . . . F a l s e p o s i t i v e r a t e Multivariate TE

Within groupsBetween groups

FIG. 10. False positives per node (left column) and falsepositive rate (right column) in modular networks of N =100nodes and T =10 000 time samples. Each node is connected to10 neighbours and the horizontal axis represents the numberof links between groups (in the two extreme cases all 10 linksare formed exclusively within each group or exclusively be-tween groups). Bivariate MI and TE infer more spurious links(left column) within the initial groups than between themfor smaller between-group densities; the comparison then re-verses for larger between-group densities. The false positiverate (i.e., the normalised number of false positives; right col-umn) is also higher within groups for smaller between-groupdensities, whilst the rate for between groups becomes compa-rable (instead of larger) as between-group density increases.The number of false positives produced by multivariate TEis comparatively negligible. The results for 10 simulationson diﬀerent networks are presented (low-opacity markers) inaddition to the mean values (solid markers). that the modular structure (at the mesoscale level) aﬀectsthe performance of bivariate algorithms in inferring singlelinks (at the microscale level). This provides further em-pirical evidence for the theoretical ﬁnding that bivariateTE—despite being a pairwise measure—does not dependsolely on the directed link weight between a single pairof nodes, but on the larger network structure they areembedded in, via the mesoscopic network motifs [45]. Inparticular, the abundance of speciﬁc “clustered motifs” inmodular structure increase the bivariate TE, making linkswithin each group easier to detect but also increasing thefalse positive rate within modules. Other studies haverelated also the correlation-based functional connectivityto speciﬁc structural features, such as search information,path transitivity [57], and topological similarity [58].1 Links between groups − . . . . . M o du l a r i t y T = 1000 RealBivariate MIBivariate TEMultivariate TE

Links between groups − . . . . . M o du l a r i t y T = 10000 RealBivariate MIBivariate TEMultivariate TE

FIG. 11. Modularity of the partition representing ﬁve groupsof 20 nodes ( T =10 000 time samples). Each node is initiallyconnected to 10 neighbours within the same group via directedlinks. As link targets are gradually rewired from within tobetween groups, the modularity of the initial partition linearlydecreases. The horizontal axis represents the number of linksbetween groups for each node (in the two extreme cases all 10links are formed exclusively within each group or exclusivelybetween groups). Bivariate and multivariate TE produceaccurate estimates of the real modularity, while bivariate MIoften underestimates it, both for shorter and longer timeseries ( T =1000 in the top panel and T =10 000 in the bottomone). The results for 10 simulations on diﬀerent networks arepresented (low-opacity markers) in addition to the mean values(solid markers). For real networks, only the mean values areplotted (cross markers). The underestimate of the modularity of the initial par-tition by bivariate MI (Figure 11, bottom panel) is adirect result of these higher numbers of spurious between-group links. This has important implications for theidentiﬁcation of the modules, since a lower score makesthis partition less likely to be deemed optimal by popu-lar greedy modularity maximisation algorithms [59]. Wespeculate that the spurious inter-modular links wouldalso hinder the identiﬁcation of the modules when alter-native approaches for community detection are employed(a thorough comparison of which is beyond the scope ofthis study).

VI. MACAQUE CONNECTOMEA. Numerical simulations

Finally, the three inference algorithms are comparedon a real macaque brain connectome obtained via tract- tracing [60]. This directed network consists of 71 nodesand 746 links and incorporates multiple properties inves-tigated above, including a small-world topology and thepresence of hubs and modules. With the structure ﬁxed,we investigate the scaling of the performance as a functionof the cross-coupling strength (i.e., the sum of incominglink weights into each node, denoted as C in and formallydeﬁned as C in = (cid:80) X C XY for each node Y ). We vary C in ∈ [0 . , . C XY constant for each parent X for a given Y to achieve this, and keep the self-linkweights constant at C XX = 0 . B. Results

At the microscale, the results summarise the mainﬁndings presented so far. There exists an optimal window—characterised by low cross-coupling strength and shorttime series—where bivariate TE attains similar or betterperformance compared to multivariate TE in terms ofrecall, speciﬁcity, and precision. For stronger couplingor longer time series, the recall of all methods increase,but the precision and speciﬁcity of the bivariate methodssubstantially drop whilst those of multivariate TE remainconsistently high (Figure 12). The macroscale resultsconcerning the local and global eﬃciency are reported inAppendix A.

C. Discussion

Interestingly, Figure 12 shows how similar outcomesare produced by either stronger coupling (link weights) orlonger time series. An explanation is readily available inthe simple case of VAR dynamics used in this study, sincethe bivariate TE on spurious links is typically lower thanthe TE on real links, and it increases with the couplingstrength [61]. Therefore, spurious links can only passstatistical signiﬁcance tests when suﬃciently long timeseries are available (in order for their weak TE valuesto be distinguished from noise); for the same reason, forshorter time series, spurious links can only be detectedin the presence of strong enough links. Unfortunately,for real datasets, there is no consistent way to determinethese optimal windows for bivariate TE a priori.

VII. CONCLUSION

In this paper, we have sought to evaluate how wellnetwork models produced via bivariate (directed andundirected) and multivariate network inference methodscapture features of underlying structural topologies. Weevaluated the performance of these methods at the micro-scopic and macroscopic scales of the network. As outlinedin the Introduction, these inference techniques seek to2 . . . . . . P e r f o r m a n ce Bivariate MI (T=1000)

PrecisionRecallSpeciﬁcity

Bivariate MI (T=10000)

PrecisionRecallSpeciﬁcity . . . . . . P e r f o r m a n ce Bivariate TE (T=1000)

PrecisionRecallSpeciﬁcity

Bivariate TE (T=10000)

PrecisionRecallSpeciﬁcity . . . Cross-coupling . . . . . . P e r f o r m a n ce Multivariate TE (T=1000)

PrecisionRecallSpeciﬁcity . . . Cross-couplingMultivariate TE (T=10000)

PrecisionRecallSpeciﬁcity

FIG. 12. Performance as a function of coupling weight ina real macacque connectome ( N =71 nodes). MultivariateTE (third row) guarantees high speciﬁcity (adequate controlof the false positive rate) regardless of the cross-couplingstrength and time series length ( T =1000 in the left columnand and T =10 000 in the right column). Bivariate TE (secondrow) attains similar performance to multivariate TE for lowcross-coupling strength and short time series, according to allmetrics. For stronger coupling or longer time series, the recallof all methods increase, but the precision and speciﬁcity ofthe bivariate methods substantially drop. For each value ofthe cross-coupling weights, the results for 10 simulations fromrandom initial conditions are presented (low-opacity markers)in addition to the mean values (solid markers). infer a network model of the relationships between thenodes in a system, and are not necessarily designed norexpected to replicate the underlying structural topology.Yet we should expect all of the methods to identify rele-vant features in the underlying network structure, bothin the network regime overall as well as in identifyingdistinctive nodes or groups of nodes in the structure.For longer time series, multivariate TE performs betteron all network topologies (lattice-like, small-world, scale-free, modular, and the real macaque connectome). Thisenhanced performance is very clear at the microscaleof single links, achieving high precision and recall, andconsequently at the macroscale of network properties,accurately reﬂecting the key summary statistics of theground truth networks used for validation.Bivariate methods can exhibit higher recall (sensitiv-ity) for shorter time series for certain underlying topolo-gies; however, as available data increases, they are un- able to control false positives (lower speciﬁcity). At themacroscale, this leads to overestimated clustering, small-world, and rich-club coeﬃcients, underestimated shortestpath lengths and hub centrality, and fattened degree dis-tribution tails. Caution should therefore be used wheninterpreting network properties of functional connectomesobtained via correlation or pairwise statistical dependencemeasures. Their use is only advisable when the limitedamount of data doesn’t allow the use of the more sophis-ticated but more accurate multivariate TE, which morefaithfully tracks trends in underlying structural topology.Further research is required to try to reliably identify apriori the situations where bivariate TE will exhibit higherprecision and recall (particularly in terms of time serieslength), as there is no clear candidate approach to do so atpresent. In the current status quo, the key strength of themultivariate approach lies in its ability to appropriatelycontrol the false positive rate to meet the requested values.Avenues for future research also include assessing the abil-ity of inference algorithms to capture key network featureswhen only a subset of nodes is observed. Finally, whilewe focus on functional brain networks, our conclusionsand methods also apply to anatomical brain networks inwhich connectivity is measured using correlation in corti-cal thickness or volume [62]. Beyond neuroscience, theyalso extend to metabolite, protein and gene correlationnetworks [63] (a similar validation study using syntheticnetworks was carried out in gene regulatory networksusing bivariate MI and TE [64]). SUPPORTING INFORMATION

The network inference algorithms described in this pa-per are implemented in the open-source Python softwarepackage IDTxl [23], which is freely available on GitHub( https://github.com/pwollstadt/IDTxl ). The codeused for the systematic exploration of network struc-tures and inference methods is also publicly available( https://github.com/LNov/infonet ). ACKNOWLEDGMENTS

JL was supported through the Australian ResearchCouncil DECRA Fellowship grant DE160100630 andthrough The University of Sydney Research Accelera-tor (SOAR) prize program. The authors acknowledge theSydney Informatics Hub and the University of Sydney’shigh-performance computing cluster Artemis for provid-ing the high-performance computing resources that havecontributed to the research results reported within thispaper.3

AUTHOR CONTRIBUTIONS

Leonardo Novelli: Conceptualization, Methodology,Software, Validation, Formal analysis, Visualization, Writ-ing - Original Draft. Joseph Lizier: Conceptualization,Methodology, Supervision, Funding acquisition, Writing -Review & Editing.

Appendix A: Local and global eﬃciency

The crucial limitation of shortest path length measures(and of the derived small-world coeﬃcient) is being onlydeﬁned for connected networks. Therefore, the analogousglobal eﬃciency measure [37] is often used to overcomethis shortcoming (also see [41] for an alternative measureof small-worldness based on global eﬃciency). The relatedlocal eﬃciency measure can instead be regarded as anal-ogous to the clustering coeﬃcient [37]. Complementingthe results in the main text, we report the global andlocal eﬃciency of small-world networks (Figure 13), scale-free networks (Figure 14), and the macaque connectome(Figure 15).

Appendix B: Clustering coeﬃcient of scale-freenetworks

Plotting the clustering coeﬃcient values of individualnodes instead of the average shows that the low clus-tering values are consistently overestimated by bivariatemethods (which is the most prominent eﬀect aﬀecting theaverage), while high clustering values are underestimated(Figure 16).

Appendix C: Rich-club coeﬃcient and assortativityof scale-free networks

The rich-club coeﬃcient measures the extent to whichhigh-degree nodes connect to each other [43]. Insteadof choosing a speciﬁc threshold to deﬁne high-degreenodes, the rich-club coeﬃcient is plotted in Figure 17 fora range of thresholds (non-normalised values). The rich-club coeﬃcient is overestimated by bivariate MI and TEacross all thresholds, although the eﬀect is less prominentthan on other network properties.Assortativity (or assortative mixing) is a preferencefor nodes to attach to others with similar degree. Thein-degree assortativity coeﬃcient is the Pearson correla-tion coeﬃcient of degree between pairs of linked nodes.Positive values indicate a correlation between nodes ofsimilar in-degree, while negative values indicate relation-ships between nodes of diﬀerent in-degree. As shown inFigure 18, the scale-free networks obtained via preferen-tial attachment are disassortative (i.e., they have negativeassortativity coeﬃcients). Bivariate and multivariate TE − − Rewiring probability . . . . . G l o b a l e ﬃ c i e n c y RealBivariate MIBivariate TEMultivariate TE − − Rewiring probability . . . . . L o c a l e ﬃ c i e n c y RealBivariate MIBivariate TEMultivariate TE

FIG. 13. Global and local eﬃciency as a function of therewiring probability in Watts-Strogatz ring networks ( N =100nodes and T =10 000 time samples). Multivariate TE recon-structs networks having the same eﬃciency as the real topolo-gies (ground truth). On the other hand, bivariate MI and TEproduce signiﬁcant overestimates due to spurious links. Thesecreate shortcuts across the network (inﬂating the global eﬃ-ciency in the top panel) and form spurious triangular cliques(inﬂating the local eﬃciency in the bottom panel), particularlyon lattice-like topologies (low rewiring probability). For eachvalue of the rewiring probability, the results for 10 simulationson diﬀerent network realisations are presented (low-opacitymarkers) in addition to the mean values (solid markers). Forreal networks, only the mean values are plotted (cross mark-ers). accurately reproduce the assortativity of the real net-works (ground truth), while bivariate MI consistentlyunderestimate it.4 Bivariate MI Bivariate TE Multivariate TE Real . . . . . G l o b a l e ﬃ c i e n c y Bivariate MI Bivariate TE Multivariate TE Real . . . . . . L o c a l e ﬃ c i e n c y FIG. 14. Global and local eﬃciency in scale-free net-works obtained via preferential attachment ( N =200 nodesand T =10 000 time samples). Multivariate TE is the onlyalgorithm able to preserve the eﬃciency of the real networks(ground truth), while bivariate TE and MI consistently over-estimate it. The box-whiskers plots summarise the resultsover 10 simulations, with median values indicated in colour. . . . . . . . Cross-coupling . . . . . G l o b a l e ﬃ c i e n c y RealBivariate MIBivariate TEMultivariate TE . . . . . . . Cross-coupling . . . . . L o c a l e ﬃ c i e n c y RealBivariate MIBivariate TEMultivariate TE

FIG. 15. Global and local eﬃciency as a function of cou-pling weight in a real macacque connectome ( N =71 nodes and T =10 000 time samples). All inference algorithms produceunderestimates for low coupling. For stronger coupling, mul-tivariate TE converges to the real global and local eﬃciencyof the underlying networks (ground truth), while bivariatemethods overestimate both measures. For each value of thecross-coupling weights, the results for 10 simulations fromrandom initial conditions are presented (low-opacity markers)in addition to the mean values (solid markers). . . . . . . Real clustering . . . . . . I n f e rr e d c l u s t e r i n g Bivariate MIBivariate TEMultivariate TE

FIG. 16. Inferred vs. real clustering coeﬃcient of indi-vidual nodes in scale-free networks obtained via preferentialattachment ( N =200 nodes and T =10 000 time samples). Mul-tivariate TE accurately reproduces the clustering coeﬃcientof the real networks (ground truth), while bivariate methodsoverestimate low clustering values and underestimate highclustering values. The results are collected over 10 simulationson diﬀerent network realisations. In-degree . . . . . . R i c h - c l ub c o e ﬃ c i e n t Bivariate MIBivariate TEMultivariate TEReal

FIG. 17. Rich-club coeﬃcient in scale-free networks obtainedvia preferential attachment ( N =200 nodes and T =10 000 timesamples). Multivariate TE accurately reproduces the rich-club coeﬃcient of the real networks (ground truth) for lowerin-degree thresholds, while bivariate TE and MI consistentlyoverestimate it. Mean values over 10 simulations. − . − . − . − . − . − . − . Real in-degree assortativity − . − . − . − . − . I n f e rr e d a ss o r t a t i v i t y Bivariate MIBivariate TEMultivariate TE

FIG. 18. In-degree assortativity coeﬃcient in scale-freenetworks obtained via preferential attachment ( N =200 nodesand T =10 000 time samples). Bivariate and multivariate TEaccurately reproduce the assortativity of the real networks(ground truth), while bivariate MI consistently underestimateit. The black dashed line represents the identity between realand inferred values. The results are shown for 10 diﬀerentnetwork realisations. [1] D. S. Bassett and O. Sporns, Nature Neuroscience ,353 (2017).[2] A. Fornito, A. Zalesky, and E. T. Bullmore, Fundamentalsof Brain Network Analysis , 1st ed. (Academic Press, SanDiego, 2016) p. 494.[3] A. Zalesky, A. Fornito, I. H. Harding, L. Cocchi, M. Y¨ucel,C. Pantelis, and E. T. Bullmore, NeuroImage , 970(2010).[4] A. Zalesky, A. Fornito, L. Cocchi, L. L. Gollo, M. P.van den Heuvel, and M. Breakspear, NeuroImage ,407 (2016).[5] K. M. Aquino, B. D. Fulcher, L. Parkes, K. Sabaroedin,and A. Fornito, NeuroImage , 116614 (2020).[6] O. M. Cliﬀ, L. Novelli, B. D. Fulcher, J. M. Shine, andJ. T. Lizier, “Exact Inference of Linear Dependence Be-tween Multiple Autocorrelated Time Series,” (2020),arXiv:2003.03887.[7] M. Rubinov and O. Sporns, NeuroImage , 1059 (2010).[8] C. H. Xia, Z. Ma, Z. Cui, D. Bzdok, B. Thirion, D. S.Bassett, T. D. Satterthwaite, R. T. Shinohara, and D. M.Witten, Human Brain Mapping , 1 (2020).[9] L. Novelli, P. Wollstadt, P. Mediano, M. Wibral, andJ. T. Lizier, Network Neuroscience , 827 (2019).[10] J. Runge, P. Nowack, M. Kretschmer, S. Flaxman, andD. Sejdinovic, “Detecting causal associations in large non-linear time series datasets,” (2018), arXiv:1702.07007v2.[11] P. Kim, J. Rogers, J. Sun, and E. M. Bollt, Journalof Computational and Nonlinear Dynamics , 011008(2016).[12] A. Razi, J. Kahan, G. Rees, and K. J. Friston, NeuroIm-age , 1 (2015).[13] C. E. Shannon, Bell System Technical Journal , 379(1948).[14] T. M. Cover and J. A. Thomas, Elements of InformationTheory , 2nd ed. (John Wiley & Sons, Inc., Hoboken, NJ,USA, 2005) p. 748.[15] T. Schreiber, Physical Review Letters , 461 (2000).[16] T. Bossomaier, L. Barnett, M. Harr´e, and J. T. Lizier, AnIntroduction to Transfer Entropy (Springer InternationalPublishing, Cham, 2016) p. 190.[17] A. M. Aertsen, G. L. Gerstein, M. K. Habib, and G. Palm,Journal of Neurophysiology , 900 (1989).[18] J. Sun, D. Taylor, and E. M. Bollt, SIAM Journal onApplied Dynamical Systems , 73 (2015).[19] J. Runge, Chaos , 075310 (2018).[20] L. Barnett, A. B. Barrett, and A. K. Seth, PhysicalReview Letters , 238701 (2009).[21] F. M. Atay and ¨O. Karabacak, SIAM Journal on AppliedDynamical Systems , 508 (2006).[22] C. W. J. Granger, Econometrica , 424 (1969).[23] P. Wollstadt, J. T. Lizier, R. Vicente, C. Finn,M. Mart´ınez-Zarzuela, P. Mediano, L. Novelli, andM. Wibral, Journal of Open Source Software , 1081(2019).[24] J. T. Lizier, Frontiers in Robotics and AI , 11 (2014).[25] F. Takens, in Dynamical Systems and Turbulence , editedby D. Rand and L. Young (Springer Berlin Heidelberg,1981) pp. 366–381.[26] I. Vlachos and D. Kugiumtzis, Physical Review E ,016207 (2010).[27] L. Faes, G. Nollo, and A. Porta, Physical Review E , 051112 (2011).[28] C. J. Honey, R. Kotter, M. Breakspear, and O. Sporns,Proceedings of the National Academy of Sciences ,10240 (2007).[29] O. Stetter, D. Battaglia, J. Soriano, and T. Geisel, PLoSComputational Biology , e1002653 (2012).[30] J. G. Orlandi, O. Stetter, J. Soriano, T. Geisel, andD. Battaglia, PLoS ONE , e98842 (2014).[31] D. Marinazzo, G. Wu, M. Pellicoro, L. Angelini, andS. Stramaglia, PLoS ONE , e45026 (2012).[32] J. T. Lizier, J. Heinzle, A. Horstmann, J.-D. Haynes, andM. Prokopenko, Journal of Computational Neuroscience , 85 (2011).[33] M. Wibral, B. Rahm, M. Rieder, M. Lindner, R. Vicente,and J. Kaiser, Progress in Biophysics and Molecular Biol-ogy , 80 (2011).[34] M. Wibral, R. Vicente, and J. T. Lizier, Directed Informa-tion Measures in Neuroscience , Understanding ComplexSystems (Springer Berlin Heidelberg, Berlin, Heidelberg,2014) p. 225.[35] J. T. Lizier and M. Rubinov, Max Planck Institute:Preprint (2012).[36] A. Montalto, L. Faes, and D. Marinazzo, PLoS ONE ,e109462 (2014).[37] V. Latora and M. Marchiori, Physical Review Letters ,198701 (2001).[38] D. J. Watts and S. H. Strogatz, Nature , 440 (1998).[39] M. D. Humphries and K. Gurney, PLoS ONE , e0002051(2008).[40] Z. P. Neal, Network Science , 30 (2017).[41] M. Zanin, (2015), arXiv:1505.03689.[42] M. E. J. Newman and M. Girvan, Physical Review E ,026113 (2004).[43] V. Colizza, A. Flammini, M. A. Serrano, and A. Vespig-nani, Nature Physics , 110 (2006).[44] Q. K. Telesford, K. E. Joyce, S. Hayasaka, J. H. Burdette,and P. J. Laurienti, Brain Connectivity , 367 (2011).[45] L. Novelli, F. M. Atay, J. Jost, and J. T. Lizier, Proceed-ings of the Royal Society A: Mathematical, Physical andEngineering Sciences , 20190779 (2020).[46] V. Pernice, B. Staude, S. Cardanobile, and S. Rotter,PLOS Computational Biology , e1002059 (2011).[47] S. Bialonski, M.-T. Horstmann, and K. Lehnertz, Chaos:An Interdisciplinary Journal of Nonlinear Science ,013134 (2010).[48] M. van den Heuvel, C. Stam, M. Boersma, and H. Hul-shoﬀ Pol, NeuroImage , 528 (2008).[49] C. C. Hilgetag and A. Goulas, Brain Structure and Func-tion , 2361 (2015).[50] D. Papo, M. Zanin, J. H. Mart´ınez, and J. M. Buld´u,Frontiers in Human Neuroscience , 1 (2016).[51] J. Hlinka, D. Hartman, and M. Paluˇs, Chaos: An In-terdisciplinary Journal of Nonlinear Science , 033107(2012).[52] E. Langford, N. Schwertman, and M. Owens, The Amer-ican Statistician , 322 (2001).[53] A. Zalesky, A. Fornito, and E. Bullmore, NeuroImage , 2096 (2012).[54] A.-L. Barab´asi and R. Albert, Science , 509 (1999).[55] Y. Virkar and A. Clauset, The Annals of Applied Statistics , 89 (2014). [56] J. Alstott, E. Bullmore, and D. Plenz, PLoS ONE ,e85777 (2014).[57] J. Goni, M. P. van den Heuvel, A. Avena-Koenigsberger,N. Velez de Mendizabal, R. F. Betzel, A. Griﬀa, P. Hag-mann, B. Corominas-Murtra, J.-P. Thiran, and O. Sporns,Proceedings of the National Academy of Sciences ,833 (2014).[58] R. G. Bettinardi, G. Deco, V. M. Karlaftis, T. J. VanHartevelt, H. M. Fernandes, Z. Kourtzi, M. L. Kringel-bach, and G. Zamora-L´opez, Chaos: An InterdisciplinaryJournal of Nonlinear Science , 047409 (2017).[59] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, andE. Lefebvre, Journal of Statistical Mechanics: Theory and Experiment , P10008 (2008).[60] M. P. Young, Proceedings of the Royal Society of London.Series B: Biological Sciences , 13 (1993).[61] Bivariate TE can be analytically derived from the networkstructure under the assumption of VAR dynamics, andspurious links are associated with larger motifs (e.g. longerchains), contributing to the TE at lower orders of magni-tude [45].[62] Y. He, Z. J. Chen, and A. C. Evans, Cerebral Cortex ,2407 (2007).[63] J. Gillis and P. Pavlidis, Bioinformatics , 1860 (2011).[64] D. M. Budden and E. J. Crampin, BMC Systems Biology10