[PDF] Building surrogate temporal network data from observed backbones

Abstract

In many data sets, crucial elements co-exist with non-essential ones and noise. For data represented as networks in particular, several methods have been proposed to extract a "network backbone", i.e., the set of most important links. However, the question of how the resulting compressed views of the data can effectively be used has not been tackled. Here we address this issue by putting forward and exploring several systematic procedures to build surrogate data from various kinds of temporal network backbones. In particular, we explore how much information about the original data need to be retained alongside the backbone so that the surrogate data can be used in data-driven numerical simulations of spreading processes. We illustrate our results using empirical temporal networks with a broad variety of structures and properties.

Full PDF

BBuilding surrogate temporal network data from observed backbones

Charley Presigny , Petter Holme , and Alain Barrat Aix Marseille Univ, Université de Toulon, CNRS, CPT, Turing Center for Living Systems, Marseille, France Tokyo Tech World Research Hub Initiative (WRHI), Tokyo Institute of Technology, Tokyo, Japan * Corresponding author: [email protected]

ABSTRACT

In many data sets, crucial elements co-exist with non-essential ones and noise. For data represented asnetworks in particular, several methods have been proposed to extract a ”network backbone”, i.e., the set of most importantlinks. However, the question of how the resulting compressed views of the data can eﬀectively be used has not been tackled.Here we address this issue by putting forward and exploring several systematic procedures to build surrogate data fromvarious kinds of temporal network backbones. In particular, we explore how much information about the original dataneed to be retained alongside the backbone so that the surrogate data can be used in data-driven numerical simulationsof spreading processes. We illustrate our results using empirical temporal networks with a broad variety of structures andproperties.

Keywords: temporal networks; surrogate data; processes on networks

Many data sets coming from the world around us—transportation systems, human proximity, interactions onsocial media, etc.—take the form of networks.

1, 2

One of network science’s main objectives is to simplify complex,large-scale data sets to highlight important structures—like a map simpliﬁes the geography of a country. Thereare several diﬀerent perspectives one can take on how to map out a network. Perhaps the most popular directionis to detect mesoscopic structures, such as community structure. This is somewhat analogous to charting thecities of a country. In this paper, however, we start from the orthogonal problem of identifying the country’shighways, i.e., the network backbone structure. Mapping network backbones is intimately connected to anotherone of network science’s aims—to explain and predict dynamic processes on the network. To understand theﬂow of information or disease on a network, knowing the highway structure is more helpful than knowing thecities.Like statistical models, backbone extraction gives a compressed picture of a network. It can tell us many thingsabout the original data, but it is not a model per se without an associated decompression algorithm. In thispaper, we develop and investigate such algorithms—to generate surrogate networks from a network backbone.

6, 9

These algorithms thus model the rest of the network, apart from the backbone (See Figure 1). In particular, weconsider surrogate network construction from temporal network backbones, and investigate how to recreate thesame behavior as the original temporal network with respect to the outcome of epidemic models unfolding on thenetwork (note that this is a distinct problem from the one of inferring what contact network is the most likely tocorrespond to an observed empirical spreading pattern ). Temporal networks encode not only which nodes areconnected but also when the interactions between them happen,

12, 14, 20 and several studies have pointed out thenecessity of including realistic topological and temporal structures to correctly describe spreading dynamics ontemporal networks, including epidemic, opinion or information spreading.

2, 12

The performance of surrogate datain correctly reproducing the outcome of spreading dynamics might thus be inﬂuenced by how such structures aretaken into account in the backbone and surrogate data. 1 a r X i v : . [ phy s i c s . s o c - ph ] D ec everal methods have been put forward to extract network backbones. For static weighted networks, the simplestway of ﬁltering edges is to remove all the edges with weight below a given threshold value. More principledprocedures use statistical tests based on null models to compare the weights of the edges with the ones that wouldbe generated at random by a certain null model.

4, 11, 25, 26, 30

One then ﬁxes a desired signiﬁcance level and selectsonly those edges whose weight cannot be explained by the null model at the chosen signiﬁcance level. Thesesigniﬁcant edges form the backbone of the network.In the case of temporal networks, a simple approach for the extraction of backbones is to aggregate the data into aweighted static network. The weight of a link in this network is the number, or total duration, of the interactionsbetween the involved nodes. To avoid neglecting potentially critical temporal features, it is however necessary todeﬁne an adequate temporal null model. Such a procedure makes it possible to extract a backbone of signiﬁcantties, i.e., of meaningful sequences of temporal contacts between nodes, possibly taking into account the temporalevolution of the nodes’ properties.

17, 21, 22

Typical backbone-method studies validate the procedures to extract backbones from static and temporal networkson synthetic benchmark tests and various empirical data sets. One explores the main properties of the resultingbackbones, and compares these to known properties to understand by which network features they are inﬂuenced.However, there is most often no explicit interpretation of the “importance” of the links (except for the simpleweight-thresholding procedure). Most importantly, it is unknown whether the information contained in theextracted backbone is enough to correctly summarize the original data and to be actionable, i.e., whether a userwith access to the backbone but not to the original data can use it in data-driven applications such as simulationsof dynamical processes.

Backboneconstruction SurrogategenerationBackbone links Auxiliary linksInformation

Figure 1:

Illustration of backbones and surrogates.

The backbone construction identiﬁes the most importantlinks, and thereby compresses the original data. The surrogate generation models adds auxiliary links, extractedat random using speciﬁc procedures, to create a network of the same size as the original. The more links(information) retained in the backbone construction, the more similar is the surrogate data to the original.In this paper, we explore this issue for backbone methods in temporal networks. Given a backbone representingonly a fraction of the original data, we put forward and explore several systematic procedures to reconstructsurrogate actionable data by adding auxiliary links to the backbone (see Figure 1).

6, 9

These auxiliary links areextracted at random with a procedure depending on how the backbone was created. We compare several suchprocedures applied to backbones obtained through a simple thresholding procedure (serving as baseline) and thesigniﬁcant tie (ST) ﬁlter for temporal networks. We also propose a new version of this ﬁlter that considers thedata’s potential group structure. In each case, we explore how much information about the original data needs tobe kept alongside the backbone (e.g., some statistical properties concerning the links that have been ﬁltered out).2o show our results’ generality, we study temporal-network data with a broad range of topological and temporalstructures. a high school (Thiers13), a primary school (LyonSchool) and a scientiﬁc conference (SFHH). These datadescribe close face-to-face proximity of individuals equipped with wearable sensors, with a temporal resolution of seconds. To limit the eﬀect of noise, the data are moreover often aggregated over a coarser resolution of ∆ minutes (e.g., in 17 backbones are considered for ∆ ranging from to minutes). Here we will use ∆ = 3 min-utes, but we have obtained similar results for other temporal resolutions. Such data are conveniently representedas temporal networks in which nodes represent individuals. These networks are in discrete time, i.e., composedby T successive snapshots at times t , t + ∆ , · · · , where t is the initial time of the data set. A temporal edgebetween two nodes i and j at time t = t + n ∗ ∆ represents the fact that the corresponding nodes have been incontact during the time interval [ t, t + ∆] . We also deﬁne a “contact” between i and j as an uninterrupted seriesof timestamps in which there is a temporal edge between them. The duration of the contact is the length of thisseries. In each case, we also deﬁne the aggregated network as the static weighted network in which a link betweentwo nodes denotes that these two nodes have been in contact at least once, and the weight of the link is given bythe number of temporal edges between these nodes. Table 1 gives the main characteristics of each data set.Data set Location Year N Duration N g E T E T Ref.InVS15 Oﬃce building 2015 217 2 weeks 12 4,274 2,307 28,950 8LyonSchool Primary school 2009 242 2 days 10 8,317 345 64,419 29SFHH Conference 2009 403 2 days None 9,565 421 73,620 8Thiers13 High school 2013 326 1 week Data sets considered. N is the number of participants, "Duration" the total duration of the datacollection, N g the number of groups in the population, E the number of ties (i.e., links in the aggregated network), T the number of time stamps (once nights and week-ends, with no activity, have been removed), E T the numberof temporal edges. Here the temporal resolution is ∆ = 3 min .These data sets were collected in very diﬀerent contexts, so that the resulting structural and temporal propertiesof the contact network diﬀer strongly. School and high school populations are divided into classes of similarsizes, with a strong community structure and interactions between classes only during the breaks (occurring withsimilar patterns in diﬀerent days).

19, 29

In the oﬃce building, individuals are divided into departments of unequalsizes, and interactions are not limited by strict schedules. In the conference, a homogeneous aggregated contactnetwork is observed. For each data set, we ﬁrst extract their backbones according either to a simple thresholding procedure or using thesigniﬁcant tie ﬁlter (see below and Methods). Each backbone contains only a tunable fraction f of the originalties (we will use f = 40% , and ). In addition to the list of backbone ties (and possibly the corresponding3ists of temporal edges), we assume that some additional statistics of the original data sets are conserved, suchas the total number of temporal edges, the distributions of contact and inter-contact durations (or simply theparameters of their ﬁt to simple functional forms such as power-laws

5, 9, 18 ). Whenever the data presents a groupstructure, the corresponding metadata can also be conserved alongside the backbones.We then consider several methods to reconstruct surrogate data from the backbones. Each method consists inadding temporal edges to the backbone in a way tailored to reproduce several statistical features of the originaldata (see below and Methods). For the resulting surrogate data, we investigate whether they are suitable to feednumerical simulations of dynamical processes, i.e., whether the outcome of dynamical processes simulated on topof the surrogate data is close to the one obtained when using the original data. Speciﬁcally, we focus on theparadigmatic susceptible-infectious-recovered (SIR) model of epidemic propagation. In this model, a susceptible(S) node becomes infectious (I) at rate β when in contact with an infectious node. Infectious nodes recoverspontaneously at rate ν and enter an immune recovered (R) state. We quantify the outcome of these processes,i.e., the epidemic risk, by two quantities: (i) the basic reproductive number R (the average number of secondaryinfections by the source) and (ii) the average ﬁnal size Ω of the spread, i.e., the fraction of nodes that have beenin the infectious state at any time, and we explore a wide range of parameter values (See Methods for details onnumerical simulations and measures.)In the following, we will show in the main text the results for the Thiers13 data set. As we indeed observe a robustphenomenology across data sets, the results for the other data sets are shown in the Supplementary Material. To extract a backbone of a given size from a temporal network data set, we consider the Signiﬁcant Ties (ST)ﬁlter. In this method, the actual number of temporal edges between two nodes is compared to the one of atemporal null model. The signiﬁcant ties at signiﬁcance level α are the ones such that their number of temporaledges cannot be explained by the null model at signiﬁcance level α . Speciﬁcally, the null model is deﬁned asfollows: an “activity level” a i is associated to each node i , and two nodes i and j have a temporal edge at eachtime with probability a i a j . The activity levels of the nodes are obtained from the data by maximum likelihoodestimation (see Methods and 17). Tuning α makes it possible to select backbones representing a speciﬁc fraction f of the ties of the original data.Moreover, we extend the ST ﬁlter to take into account the group structure of several data sets. The resultingGST ﬁlter is obtained by modifying the temporal null model as follows: the probability of a temporal edgebetween i and j is equal to a i a j if i and j belong to the same group, and to pa i a j if they belong to diﬀerentgroups. The node activities and parameter p are obtained by maximum likelihood estimation as for the ST ﬁlter(see Methods). Note that p < corresponds to cohesive group structures, while p > would be obtained fordisassortative structures. It would also be possible to use several values of p depending on the respective groupsof i and j , but we consider here for simplicity only one parameter.In addition, we consider as baseline the simplest method to extract ties that can be interpreted as the mostimportant in a network: we order the ties according to their weight in the aggregated network, as given bytheir number of temporal edges (in the context of contact networks, this corresponds to the total duration of thecontacts between the two nodes forming the tie). The “threshold” backbone (TB) of the original data is thengiven by the fraction f of ties with the largest weights. 4e report in Table 2, for backbones formed of a fraction f = 40% , and of the original network, thecorresponding number of temporal edges for each backbone extraction method. Moreover, Figure S1 in theSupplementary Material shows how some statistics of the backbones compare to the ones of the original data. Asalready discussed in 17, the ST backbone ties tend to have large weights, with distributions clearly shifted to largevalues with respect to the original data. However, while this happens by deﬁnition in the threshold backbone,the distribution of weights in the ST backbone is smooth and does not have a sharp cutoﬀ at a minimal value.Moreover, when the group structure is included (GST backbone), the distribution of weights becomes notablybroader. This is due to inter-group ties that tend to have lower weights: these ties appear as signiﬁcant onlywhen we take into account, through the adequate null model (i.e., through the use of the parameter p ), that pairsof individuals belonging to diﬀerent groups have an a priori tendency to form less temporal edges than individualsof the same group. In fact, the ST ﬁlter tends to ﬁlter out most ties joining nodes of diﬀerent groups; the GSTﬁlter instead keeps ties both within and between groups. We also note that both the clustering coeﬃcient andthe modularity of the partition in groups, when measured in the backbones, can strongly deviate from the valuesin the original data (see Tables S1 and S2 in the Supplementary Material). On the other hand, the distributionsof contact and inter-contact durations are close to the ones observed in the original data (see the SupplementaryMaterial). Backbones are by deﬁnition composed of a much smaller number of temporal edges and ties than the original data.As discussed above, their statistical properties are not identical to the ones of the data. It is therefore expectedthat numerical simulations of spreading processes on top of a backbone largely underestimate their outcome. Weillustrate this in Figure 2 and in the Supplementary Material. Note that the underestimation is not as strongas the one that would be obtained by a random sampling of the events, as the backbone ties tend to have largeweights.We therefore put forward several methods to construct surrogate data that are statistically more similar to theoriginal data and, most importantly, yield more accurate estimations of processes’ outcomes. Starting from abackbone composed of E b ties and E bT temporal edges, we want to recreate a temporal network with approximately E ties and E T temporal edges. To this aim, we need to use complementary information, in addition to the list oftemporal edges composing the backbone. For instance, it is quite clear that we cannot guess from the backboneitself the correct numbers of ties and temporal edges to be added. Thus, this additional information should bekept alongside the backbone to make it a usable summary of the data. Here we consider several procedures,highlighting in each case the necessary type and amount of information. Note that the resulting list of proceduresdoes not pretend to be exhaustive but addresses a wide range of possibilities in terms of available information.Each procedure can be separated into two steps: (i) choosing ties (not included in the backbone) that interact inthe surrogate data, and (ii) building timelines of interactions on the chosen ties. Procedure (ii) might also needto be performed on the backbone ties if the temporal information of the backbone ties is not available.For step (i), we consider three distinct methods for backbones extracted using the ST or GST method. (G)ST-OA, where “OA” stands for “original activities”. We assume that the parameters of the null model usedto extract the backbones are available, namely the original node activities { a i , i = 1 , · · · , N } (and theparameter p for the GST). Moreover we assume that E T is known and, for the GST, that the number of5 . . . . . . . . . . . . . R ou t b r ea k s i z e . . . . . . . transmission probability0.1110100 r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e A B CD E F

Figure 2:

Original vs. backbone. R (top row) and Ω (bottom row) values obtained from the simulations onthe original data and on the backbones, as a function of the SIR parameters, for the Thiers13 data set. Left:original data. Middle: GST backbone with f = 10% . Right: GST backbone with f = 5% .temporal edges between groups and within groups, E T, inter and E T, intra , are also known, as well as the groupto which each node belongs.In this procedure, for each pair of nodes ( i, j ) not in the backbone, we add a temporal edge between i and j at each timestamp with probability αa i a j , calibrating α so that the obtained total number of temporaledges is close to E T (see Methods).For the GST, we use at each time the probabilities α intra a i a j if i and j are in the same group and α inter pa i a j if they are not, calibrating α inter and α intra to get approximately the correct number of temporal edges bothat the inter-group and intra-group levels. (G)ST-RA, where “RA” stands for “ ‘recomputed activities”. If the parameters of the null model (i.e., theactivities { a i } of the nodes) are not known, we use the fact that applying the MLE equations to thebackbone itself yields activity parameters correlated to the original ones (see Table 3 in the SupplementaryMaterial). We thus compute the activity ˜ a i of each node i (and the parameter ˜ p if the group structure isknown) in the backbone; we then add at each time a temporal edge between i and j with probability α ˜ a i ˜ a j ,calibrating α to get approximately the correct number of temporal edges (we assume as previously that E T is known). For the GST case, we use probabilities α intra ˜ a i ˜ a j and α inter ˜ p ˜ a i ˜ a j and calibrate α inter / intra as inthe previous method, assuming E T, inter and E T, intra are known. (G)ST-RT, where “RT” stands for “random ties”. We moreover consider a baseline in which we add to thebackbone the correct number of ties at random (i.e., E − E b ), with weights drawn from the list of weightsof the non-backbone ties. Note that here we do not consider simply adding the correct number of temporaledges at random between nodes, because that would result in a very large number of ties with only one or6ew temporal edges, a structure very diﬀerent from the original data. We thus assume that the number ofties in the original data E is known ( E inter and E intra if there are groups), in addition to the original numberof temporal edges. Moreover, as the distribution of the backbone weights is very diﬀerent from the originaldata (see Figure S1 and 17), we do not have a simple functional form for the weights of the non-backboneties. We thus assume that the list of weights of the non-signiﬁcant ties has been kept.Finally, for the backbones consisting of the ties with the largest weights (TB), as there is no underlying nullmodel, we only consider the baseline reconstruction method which we denote by (G)TB-RT: we proceed hereexactly as for the (G)ST-RT procedure.Once the ties and the number of temporal edges on each tie have been chosen by one of these procedures, wecan create surrogate timelines (step (ii) of the procedure) in various ways. In each case, for each tie ( i, j ) withnumber of temporal edges n ij , the aim is to choose n ij timestamps out of the T possible ones. Poisson: if no temporal information on the original data is available, the simplest procedure consists in choosingtotally at random the timestamps of the temporal edges for each tie.

BTL-Poisson: if the actual timelines of the backbone ties are known, one can keep these actual timelines andchoose at random the timestamps of temporal edges only for the surrogate ties.

Stats: we can instead assume that some information on the statistics of contact and inter-contact durationsare known, as these properties have been shown to be extremely robust

5, 8 (see also Figure S1 in theSupplementary Material). They can moreover be approximately ﬁtted to (truncated) power-law forms,meaning that the whole list of values is not needed, but only the parameters of the ﬁt. We can then build atimeline of temporal edges for each tie using contact and inter-contact durations generated randomly fromthese ﬁtted distributions.

BTL-Stats: if the actual timelines of the backbone ties are known, we keep these actual timelines, and proceedas in the Stats case for the surrogate ties only.We note here that each step of the procedure is stochastic, with random choices of ties and temporal edges. Thus,repeating the same procedure multiple times yields an ensemble of surrogate temporal networks. In the Methodssection, we provide a summary table of these procedures and the corresponding data used.

Figure 3 shows distributions of degrees and weights for the aggregate networks resulting from surrogate datacreated by several methods for the Thiers13 data set and f = 10% . Similar ﬁgures are shown in the SupplementaryMaterial for the other data sets as well as a table with the relative values of the clustering coeﬃcient and of themodularity of the aggregated surrogate networks.The surrogates based on adding ties according to the ST null model tend to overestimate the node degrees, withthe whole distribution shifting to larger values than in the original data and becoming broader. This eﬀect is verystrong for the ST-OA and ST-RA, but taking into account groups (GST-OA and GST-RA) leads to much weakerdeviations from the data. Using group data also leads to distributions of weights close to the original ones. At thesame time, ST-OA and ST-RA have a substantial depletion of the distribution at intermediate weight values (the7

50 100 150 200 25000.020.040.06 P ( k ) k w P ( k ) P ( k ) P ( k ) P ( w ) –2 –3 –4 w P ( w ) –2 –3 –4 w P ( w ) –2 –3 –4 w P ( w ) –2 –3 –4 kkk ABCD EFGH

Surrogate, without groups Surrogate, with groupsOriginal

Figure 3:

Distributions of (aggregated) degrees (left columns) and weights (right column) in thesurrogate data obtained by various methods.

From top to bottom: (G)ST-OA, (G)ST-RA, (G)ST-RT,(G)TB-RT. In each case, the blue line shows the distribution for the original data, the red and green line for thesurrogate built respectively without and with group information. Using group information yields distributionscloser to the original ones. The surrogates were built from backbones with f = 10% of the ties of the originaldata.tail of the distribution being correctly represented as most ties with large weights belong to the ST backbone).Note that these distributions emerge from the surrogate’s construction, as the initial distribution is not assumedto be known here.For surrogates created using random ties, (G)ST-RT and (G)TB-RT, the average degree is well reproduced bydesign, as the information about the original data number of ties is assumed to be known. On the other hand,the distribution of degrees is much narrower than the original one. The distribution of weights is almost identicalto the original data since the list of the actual weights of the non-signiﬁcant ties is assumed to be known.In terms of clustering and modularity, the procedures in which group information is known and used all leadto values that are close to the original ones, while ignoring group information can yield large discrepancies (seeSupplementary Material).Finally, the distributions of contact and inter-contact durations depend only on the way in which the timelines ofties are built in the surrogate data: they are exponential for the Poisson procedure, and very close to the originaldata distributions for the Stats procedures (not shown). 8 .5 Outcome of SIR processes on surrogate data sets We present our main results in Figures 4 – 5: each panel of the ﬁgures shows, as a function of the parameters β and ν , a color plot of the relative diﬀerences in the outcomes of SIR processes simulated either on surrogatedata or on the original data. The outcome is measured here by the basic reproductive number R , and we showsimilar results for the epidemic size Ω in the Supplementary Material.Let us ﬁrst note a general pattern: R tends to be underestimated, when using surrogate data, at large β and small ν , i.e., at very large R and epidemic size. At large β and ν on the other hand, the tendency is to overestimate theepidemic outcome. Finally, smaller deviations with respect to the simulations on the original data are observedin parameter regions where R is close to , i.e., close to the epidemic threshold.Let us now consider in more details the results obtained with various types surrogate data and the eﬀect of thechoices made in the reconstruction procedure.Figures 4 and S6 highlight the impact of using information on the group structure of data. The surrogate datagets more ties between groups when group information is not taken into account (see also the degree distributionin Fig. 3), leading to larger values of R and Ω . As a result, the range of parameters in which R is underestimatedis slightly smaller. Still, on the other hand, both R and Ω can be strongly overestimated in some parameterregions and in particular close to the epidemic threshold. –0 . –0 . – 0 . . . –0 . . . . . . . –0 . . . . . . . –0 . –0 . –0 . –0 . – 0 . . . –0 . –0 . –0 . –0 . . . . –0 . –0 . –0 . –0 . . . . . −0.3−0.2−0.10.00.10.20.3 r e l a t i v e de v i a t i on r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e A B CD E F f = 0.4 f = 0.1 f = 0.05 Figure 4:

Eﬀect of taking group structure into account when building the surrogate data, for theThiers13 data set.

Each panel shows the relative diﬀerence in R obtained from the simulations on surrogatedata with respect to simulations on the original data. Top: ST-RA. Bottom: GST-RA. Left f = 40% , middle f = 10% ; right f = 5% . In each case the backbone timelines are kept, and timelines respecting the statistics ofcontact and inter-contact durations are built for the surrogate ties (BTL-Stats method). 9e furthermore examine—see the Supplementary Material (Figures S11 and S12)—the eﬀect of diﬀerent timelinesreconstruction methods, at a ﬁxed procedure for choosing the surrogate ties. As could be expected, better resultsare obtained when more statistical information about the actual data timelines is used. In particular, usingPoisson timelines leads to stronger overestimations. On the other hand, using timelines with random contact andinter-contact durations reﬂecting the original data statistics leads to smaller deviations, and using these statisticsto create surrogate timelines even on the backbone ties does not have a strong impact.We thus consider all the surrogate reconstruction methods that take into account the group structure of the dataand use the BTL-Stats method for the timelines. Figure 5 shows the results for R , while the results for Ω areshown in the Supplementary Material. The main result underlined by the panels is that all methods give rathergood results. The deviations with respect to the original data naturally tend to increase as f decreases, but wideranges of parameters with small variations are still observed even at f = 5% . We also see that recomputing theactivities leads to worse underestimations than if the original activities are known. Despite being based on thesimple procedure of adding random ties and not using data on the nodes’ activities, the RT methods produceresults of comparable quality. However, their costs in terms of conserved information is much higher (as we usethe list of weights of non-backbone links in the surrogate construction method). In this paper, we have considered how to bridge the gap between the backbone of a network and its actual use,particularly in data-driven numerical simulations of dynamic processes. In other words, how to turn networkbackbone extraction into the production of surrogate network data. Several backbone extraction procedures haveindeed been put forward in the literature to extract a network’s most important ties, which are supposed tosummarize the most important information in the network. The issue of whether this summary suﬃces for actualdata-driven uses has not been explored.Here, we have tackled this issue for several types of backbones of a temporal network, by proposing systematicways to construct surrogate data from the backbone information. We have then used these surrogate datain numerical simulations of epidemic processes and investigated how well the outcomes of simulations and themeasure of epidemic risk match the simulations on the original data.We have considered a wide variety of procedures, with diﬀerent amounts and types of information on the originaldata kept in the summary of the original data formed by the backbone and completed by additional statisticalinformation. The threshold backbone arbitrarily selects the links with the largest weights, while the signiﬁcantties ﬁlters are more principled and retain ties that cannot be explained by a null model. In all cases, the datasummaries need to be informed by the number of temporal edges in the original data set. Still, the amount of otheradditional data they contain can vary signiﬁcantly. In particular, these summaries might include information onthe network structure and retain, or not, the values of the node activities computed on the whole data set. Ifthese values are unknown, we have shown that it is possible to recompute approximate values by applying theST ﬁlter null model to the backbone itself. Alternatively, it is possible to add ties at random between nodes toreach the original number of ties contained in the data, at the cost of also keeping the list of link weights of thenon-backbone links. The same procedure can be used to build surrogate data from the threshold backbone.Most procedures yield surrogate data that allow us to obtain a reasonable approximation of the original outcomewhen used to simulate epidemic spreading processes. The quality of the approximation, however, depends on the10 . –0 . –0 . –0 . . . . –0 . –0 . –0 . –0 . . . . –0 . –0 . –0 . –0 . –0 . – 0 . . . –0 . –0 . –0 . –0 . . . . –0 . –0 . –0 . . . . –0 . –0 . –0 . . . . . . . . –0 . –0 . –0 . . . . . –0 . –0 . . . . . . –0 . –0 . –0 . . . . . . –0 . . . . . . –0 . –0 . . . . . . –0 . –0 . . . . . . −0.3−0.2−0.10.00.10.20.3 r e l a t i v e de v i a t i on r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e A B CD E F r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e r e c o v e r y r a t e G H IJ K L f = 0.4 f = 0.1 f = 0.05 Figure 5:

Outcome of SIR processes on surrogate data obtained by various reconstruction methods,for the Thiers13 data set.

Relative diﬀerence in the values of R measured in simulations on the surrogate andon the original data. First row: GST-RA. Second row: GST-OA. Third row: GST-RT. Fourth row: GTB-RT.Left column f = 40% , middle f = 10% ; right f = 5% . In each case the backbone timelines are kept, and timelinesrespecting the statistics of contact and inter-contact durations are built for the surrogate ties (BTL-Stats method).11urrogate’s method. In particular, the information on the data’s group structure turns out to play an importantrole, in line with other results showing its importance in diﬀusion processes.

9, 27

Using realistic activity timelinesof temporal edges also yields better results. Once group information and realistic timelines are included in theconstruction of surrogate data, all methods give good results. The largest discrepancies between the originaland surrogate data outcomes are obtained at large spreading and recovery parameters. This is not surprising asthese parameters correspond to fast processes. In this case, the outcome can depend on the data’s details andtemporal structures at short timescales that are not present in the surrogate data. For instance, in school data,temporal edges between classes occur in a synchronized way during the breaks, creating activity patterns thatwould need to be put by hand in the surrogate data and thus be contained in some way in the data summary.Our results give hints on how to summarize complex data sets best so that they remain actionable. Moreover, asthe construction of surrogate data is a stochastic process, each of the procedures discussed here yields an ensembleof surrogate data with similar statistical properties. This highlights an interesting potential application of ourresults. Indeed, collecting data sets is an expensive task, and several data properties depend on context, makingmodeling of realistic temporal networks a problematic task. Simultaneously, the availability of data with realproperties is crucial to inform data-driven models of diﬀusion processes such as epidemics of infectious diseases.Moreover, collected data typically have a limited duration, and merely repeating the data might create undesiredbiases. The various procedures we have described here make it possible to create synthetic surrogate data withproperties very similar to empirical data without modeling assumptions. By tuning the backbone size, and hencethe amount of surrogate data needed to be added to it, one can moreover tune the similarity between the originaldata and the surrogate replicas.Our work has some limitations that also indicate the way for future work. First, we have limited our study todata describing contact between individuals. However, these data cover a broad range of contexts, have widelydiﬀerent temporal properties, and are particularly relevant for simulations of epidemic spread. Second, we haveconsidered only a limited number of backbone and surrogate construction methods. We sought to keep the methodsparsimonious, so one could consider other backbone extraction methods, taking, e.g., temporal variations of theactivities into account. Finally, networks could support other types of processes, such as synchronization orcomplex contagion, which might also involve higher-order structures going beyond ties.

3, 15

Correctly reproducingthe outcome of such processes from a network summary might require the development of backbones of signiﬁcantstructures and corresponding new surrogate data construction methods.

Data and Methods

Data

We use state-of-the-art publicly available datasets describing contacts between individuals in diﬀerent settings,with high spatial and temporal resolution. All data were collected by the SocioPatterns collaboration, usingan infrastructure based on wearable sensors that exchange radio packets, detecting close proximity ( ≤ . m ) ofindividuals wearing the devices, with temporal resolution of • The LyonSchool data set contains the contact events between 242 individuals (232 children and 10 teachers)in a primary school in Lyon, France, during two days in October 2009. The children are divided intoten classes of similar sizes (two classes per grade) and follow strict schedules, with mixing between classeslimited to the breaks. • The Thiers13 data set gives the interactions between 327 students of nine classes of similar sizes within ahigh school in Marseille, during ﬁve days in December 2013. • The SFHH conference data set describes the face-to-face interactions of 405 participants to the 2009 SFHHconference in Nice, France (June 4–5, 2009).

8, 28

No metadata on the participants was collected and theresulting contact network does not show any group structure. Signiﬁcant ties backbones

For completeness, we recall here the procedure to extract the signiﬁcant ties at a given signiﬁcance level α froma temporal network. We ﬁrst deﬁne a temporal ﬁtness model in which each node i has an activity level a i , and the probability u thatnodes i and j interact during any given time interval is given by the product of their activity levels, u ( a i , a j ) = a i a j .Given a data set of N nodes and temporal length T , we estimate the node activity levels a ≡ ( a ∗ , . . . , a ∗ N ) withinthe temporal ﬁtness model from the N maximum likelihood equations (cid:88) j : j (cid:54) = i m o ij − T a ∗ i a ∗ j − a ∗ i a ∗ j = 0 , ∀ i = 1 , . . . , N, that can be solved by standard numerical algorithms. We then compute for each pair of nodes i and j theprobability distribution of their total number of interactions m ij in the null model, which is simply given by thebinomial distribution g ( m ij | a ∗ i , a ∗ j ) = (cid:32) Tm ij (cid:33) u ( a ∗ i , a ∗ j ) m ij (1 − u ( a ∗ i , a ∗ j )) T − m ij .Let m cij denote the c -th percentile (0 ≤ c ≤ of g ( m ij | a ∗ i , a ∗ j ) . If the actual empirical number of interactions m o ij between i and j is larger than m cij , it means that this empirical number cannot be explained by the nullmodel at signiﬁcance level α ≡ − c/ : in other words, i and j are connected by a signiﬁcant tie at signiﬁcantlevel α .For a given value of α , the set of signiﬁcant ties and the corresponding temporal edges form the ST backboneof the network. As α decreases, the number of signiﬁcant ties obviously decreases, and one can tune α in orderto obtain a backbone formed by a given fraction f of ties. Note that, as the signiﬁcant ties tend to have largenumber of interactions, the relative sizes of backbones in terms of number of temporal edges are higher than interms of number of ties (see Table 2).When the nodes are divided into groups, we moreover consider a modiﬁed null model in which the probability ofinteraction at each time between i and j is u p ( a i , a j ) ≡ a i a j ( δ g i ,g j + p (1 − δ g i ,g j )) where g i indicates the group ofnode i and δ is the Kronecker symbol.For a given data set, we can write the maximum likelihood equations (MLE) to estimate the node activity levels a ≡ ( a ∗ , . . . , a ∗ N ) and the parameter p ∗ , similarly to the procedure of 17: given the null model, the number of13 ata set Threshold ST GST f

40% 10% 5% 40% 10% 5% 40% 10% 5%InVS15 25,417 16,541 12,014 24,738 16,166 9,084 20,890 13,581 9,522LyonSchool 56,807 33,205 22,912 55,773 32,346 18,746 41,032 24,732 16,431SFHH 18,802 12,650 10,253 17,257 11,950 9,763 – – –Thiers13 53,992 39,171 30,239 52,975 30,981 12,678 43,834 34,613 22,572Table 2:

Number of temporal edges E bT of the various backbones , for various values of the fraction f ofties forming the backbone.times temporal edges are formed between nodes i and j over T time intervals is a random variable m ij that followsa binomial distribution with parameters T and u p ( a i , a j ) . Therefore, the joint probability function leads to p ( { m ij }| a , p ) = (cid:89) i,j : i (cid:54) = j (cid:32) Tm ij (cid:33) u p ( a i , a j ) m ij (1 − u p ( a i , a j )) T − m ij , (1)and the N + 1 MLE equations are (cid:88) j : j (cid:54) = i m o ij − T a ∗ i a ∗ j ( δ g i ,g j + p ∗ (1 − δ g i ,g j ))1 − ( δ g i ,g j + p ∗ (1 − δ g i ,g j )) a ∗ i a ∗ j = 0 , ∀ i = 1 , . . . , N, and (cid:88) i,j : g i (cid:54) = g j m o ij − T p ∗ a ∗ i a ∗ j − p ∗ a ∗ i a ∗ j = 0 . The (G)ST ﬁlter can be applied to the original data set but also to the backbone itself. In Table 3 we give thecorrelation coeﬃcients between the activities obtained by solving the MLE equations for a data set and for itsextracted backbone representing a fraction f of ties. Data set ST GST f

40% 10% 5% 40% 10% 5%InVS15 0.99 0.92 0.62 0.97 0.87 0.75LyonSchool 0.99 0.90 0.60 0.96 0.80 0.64SFHH 0.97 0.93 0.90 – – –Thiers13 0.99 0.73 0.25 0.97 0.92 0.68Table 3:

Correlation between original activities and activities recomputed using the backbone ties.

Surrogate data

As described in the main text, we have put forward several methods to build surrogate data starting from abackbone. These methods consist of two steps, ﬁrst choosing the surrogate ties and then creating timelines oftemporal edges on each tie.In Table 4, we summarize each method’s main points for each type of backbone, the data needed in addition tothe backbone information, and the size of these additional data. Note that the random links methods need severalinputs of the order of the number of ties in the original data, while the methods based on the null model instead14se an input scaling with the number of nodes. The method needing the least extra data is the one recomputingthe activities applying the ST ﬁlter methodology on the backbone data itself.In the methods based on the (G)ST null models, we need to calibrate the parameter α (or the two parameters α intra and α inter ). To this aim, we ﬁrst try at each timestamp to add a temporal edge with probability a i a j for each ( i, j ) not in the backbone. This creates a total number of temporal edges E (cid:48) T . The actual number ofsurrogate temporal edges needed is actually E T − E bT —i.e., the diﬀerence between the number of temporal edgesin the data and in the backbone. Therefore, we set α = ( E T − E bT ) /E (cid:48) T and we use as probabilities of creationof temporal edges αa i a j . When the data group structure is taken into account, the procedure is performedseparately for intra- and inter-group ties. We note that the ﬁnal number of temporal edges in the surrogate datais not strictly ﬁxed by this procedure but remains a stochastic outcome. The number of ties is not ﬁxed eitherbut is also an outcome of the procedure, contrary to the procedures based on adding random links.Finally, to construct surrogate timelines respecting the data statistics of event and interevent durations, weproceed as follows, for each tie ( i, j ) with number of temporal edges n ij :1. we extract a random initial time t in [0 , T ] ; all the times are then considered modulo T ; we set n = n ij ;2. we iterate the procedure(a) extract a random duration τ from the ﬁtted distribution of event durations(b) check that τ ≤ n , else replace τ by n (c) add τ temporal edges between i and j , namely on the interval [ t , t + τ − (d) extract a random interevent time ∆ t from the ﬁtted distribution of interevent durations(e) replace t by t + τ + ∆ t and n by n − τ until n = 0 , i.e., until n ij temporal edges have been created. Simulations of the epidemic spread

For the simulation of the SIR model on temporal networks we use the approach and code presented in Ref. 13.We start the simulation with all nodes susceptible and introduce the disease at a random node at a random time(uniformly chosen between the beginning and end of the temporal network). Then if there is an event between asusceptible and an infectious, a contagion occurs with a probability β . The infected person recovers with a rate ν ,i.e., the time to recovery is a random variable δ extracted from the distribution ν exp( − νδ ) . Finally, we assumethat an individual that gets infected at time t (cid:48) cannot infect anyone else until t > t (cid:48) . For every pair of parametervalues ( β, ν ) , we run this algorithm times for averages.We calculate the basic reproductive number R directly from the simulations as the average numbers of individualsinfected directly by the source. Calculating the average outbreak size Ω is a similarly straightforward averageover the number of nodes in the recovered state when the outbreak is extinct. If the outbreak is not extinct whenthe simulation reaches the end of the data set, the outbreak size is the number of nodes in either the infectiousof the recovered state at the last time stamp of the data. 15ackbonetype Surrogatetype Method summary Extra data needed Size ofextra dataneededST ST-OA For each ( i, j ) not in backbone,at each timestamp add a temporaledge with probability αa i a j , with α scaled to adjust the total number oftemporal edges List of original node activities; Num-ber of temporal edges N + 1 ST-RA Compute the activity ˜ a i of each nodewith the ST backbone method ap-plied on the backbone itself; addsurrogate ties as for the ST-OAmethod, using the recomputed ac-tivities Number of temporal edges ST-RT Add ties at random in order to reachthe number of ties of the originaldata, with weights extracted at ran-dom from the list of weights of thenon-backbone ties Number of ties; List of weights ofties not in backbone E − E b + 1 TB TB-RT Same as for ST-RL Number of ties; List of weights ofties not in backbone E − E b + 1 GST GST-OA For each ( i, j ) not in backbone, ateach timestamp add a temporal edgewith probability α intra a i a j if i and j are in the same group, α inter pa i a j else, with α intra / inter scaled to adjustthe total number of temporal edgeswithin and between groups List of original node activities andparameter p; Group membership;Number of temporal edges withingroups and between groups N + 3 GST-RA Compute the activity ˜ a i of each nodeand the parameter ˜ p with the GSTbackbone method applied on thebackbone itself; add surrogate ties asfor the GST-OA method, using therecomputed activities. Group membership; Number of tem-poral edges within groups and be-tween groups N+2GST-RT Add ties at random in order to reachthe same number of ties within andbetween groups as in the originaldata, with weights extracted at ran-dom from the list of weights of thenon-backbone ties Group membership; Number of tiesbetween and within groups; List ofweights of non-backbone ties, be-tween and within groups N + E − E b + 2 TB GTB-RT Same as for GST-RT Group membership; Number of tiesbetween and within groups; List ofweights of non-backbone ties, be-tween and within groups N + E − E b + 2 Table 4: Summary of the various methods to choose the surrogate ties. 16 eferences [1] R. Albert and A.-L. Barabási. Statistical mechanics of complex networks.

Rev. Mod. Phys. , 74:47 – 97, 2002.[2] A. Barrat, M. Barthélemy, and A. Vespignani.

Dynamical processes on complex networks . CambridgeUniversity Press, Cambridge, 2008.[3] F. Battiston, G. Cencetti, I. Iacopini, V. Latora, M. Lucas, A. Patania, J.-G. Young, and G. Petri. Networksbeyond pairwise interactions: Structure and dynamics.

Physics Reports , 874:1 – 92, 2020. Networks beyondpairwise interactions: Structure and dynamics.[4] G. Casiraghi, V. Nanumyan, I. Scholtes, and F. Schweitzer. From relational data to graphs: Inferring signif-icant links using generalized hypergeometric ensembles. In

International Conference on Social Informatics ,pages 111–120. Springer, 2017.[5] C. Cattuto, W. Van den Broeck, A. Barrat, V. Colizza, J.-F. Pinton, and A. Vespignani. Dynamics ofperson-to-person interactions from distributed RFID sensor networks.

PLoS ONE , 5(7):e11596, 07 2010.[6] J. Fournet and A. Barrat. Estimating the epidemic risk using non-uniformly sampled contact data.

ScientiﬁcReports , 7(1):9975, 2017.[7] L. Gauvin, A. Panisson, and C. Cattuto. Detecting the community structure and activity patterns of temporalnetworks: a non-negative tensor factorization approach.

PLOS ONE , 9(1):e86028, 2014.[8] M. Génois and A. Barrat. Can co-location be used as a proxy for face-to-face contacts?

EPJ Data Science ,7(1):11, 2018.[9] M. Génois, C. L. Vestergaard, C. Cattuto, and A. Barrat. Compensating for population sampling in simu-lations of epidemic spread on temporal contact networks.

Nature Communications , 6(1):8860, 2015.[10] M. Génois, C. L. Vestergaard, J. Fournet, A. Panisson, I. Bonmarin, and A. Barrat. Data on face-to-facecontacts in an oﬃce building suggest a low-cost vaccination strategy based on community linkers.

NetworkScience , 3(3):326–347, 2015.[11] V. Hatzopoulos, G. Iori, R. N. Mantegna, S. Miccichè, and M. Tumminello. Quantifying preferential tradingin the e-mid interbank market.

Quantitative Finance , 15(4):693–710, 2015.[12] P. Holme. Temporal network structures controlling disease spreading.

Phys. Rev. E , 94:022305, Aug 2016.[13] P. Holme. Fast and principled simulations of the sir model on temporal networks. e-print arXiv:2007.14386,2020.[14] P. Holme and J. Saramäki. Temporal networks.

Physics Reports , 519:97–125, 2012.[15] I. Iacopini, G. Petri, A. Barrat, and V. Latora. Simplicial models of social contagion.

Nature Communications ,10(1):2485, 2019.[16] L. Isella, J. Stehlé, A. Barrat, C. Cattuto, J.-F. Pinton, and W. Van den Broeck. What’s in a crowd? analysisof face-to-face behavioral networks.

J. Theor. Biol. , 271:166–180, 2011.[17] T. Kobayashi, T. Takaguchi, and A. Barrat. The structured backbone of temporal social ties.

NatureCommunications , 10(1):220, 2019. 1718] A. Machens, F. Gesualdo, C. Rizzo, A. E. Tozzi, A. Barrat, and C. Cattuto. An infectious disease model onempirical networks of human contact: bridging the gap between dynamic network data and contact matrices.

BMC Infectious Diseases , 13(1):1–15, 2013.[19] R. Mastrandrea, J. Fournet, and A. Barrat. Contact patterns in a high school: A comparison between datacollected using wearable sensors, contact diaries and friendship surveys.

PLoS ONE , 10(9):1–26, 09 2015.[20] N. Masuda and R. Lambiotte.

A Guide to Temporal Networks . World Scientiﬁc Publishing, 2016.[21] M. Nadini, C. Bongiorno, A. Rizzo, and M. Porﬁri. Detecting network backbones against time variations innode properties.

Nonlinear Dynamics , 99(1):855–878, 2020.[22] M. Nadini, A. Rizzo, and M. Porﬁri. Reconstructing irreducible links in temporal networks: Which tool tochoose depends on the network size.

Journal of Physics: Complexity , 1:015001, 2020.[23] P. Sah, M. Otterstatter, S. T. Leu, S. Leviyang, and S. Bansal. Revealing mechanisms of infectious diseasetransmission through empirical contact networks. bioRxiv , 2018.[24] M. T. Schaub, J.-C. Delvenne, M. Rosvall, and R. Lambiotte. The many facets of community detection incomplex networks.

Applied network science , 2(1):4, 2017.[25] M. Á. Serrano, M. Boguná, and A. Vespignani. Extracting the multiscale backbone of complex weightednetworks.

Proceedings of the National Academy of Sciences of the United States of America , 106(16):6483–6488, 2009.[26] L. M. Shekhtman, J. P. Bagrow, and D. Brockmann. Robustness of skeletons and salient features in networks.

Journal of Complex Networks , 2(2):110–120, 01 2014.[27] T. Smieszek and M. Salathé. A low-cost method to assess the epidemiological importance of individuals incontrolling infectious disease outbreaks.

BMC Medicine , 11(1):35, 2013.[28] J. Stehlé, N. Voirin, A. Barrat, C. Cattuto, V. Colizza, L. Isella, C. Régis, J.-F. Pinton, N. Khanafer,W. Van den Broeck, and P. Vanhems. Simulation of an seir infectious disease model on the dynamic contactnetwork of conference attendees.

BMC Medicine , 9(1):87, 2011.[29] J. Stehlé, N. Voirin, A. Barrat, C. Cattuto, L. Isella, J. Pinton, M. Quaggiotto, W. Van den Broeck, C. Régis,B. Lina, and P. Vanhems. High-resolution measurements of face-to-face contact patterns in a primary school.

PLOS ONE , 6(8):e23176, 08 2011.[30] M. Tumminello, S. Micciché, F. Lillo, J. Piilo, and R. N. Mantegna. Statistically validated networks inbipartite complex systems.

PLOS ONE , 6(3):1–11, 03 2011.

Acknowledgements

A.B. was supported by the ANR project DATAREDUX (ANR-19-CE46-0008) and JSPS KAKENHI Grant Num-ber JP 20H04288. P.H. was supported by JSPS KAKENHI Grant Number JP 18H01655. 18 uthor contributions statement

A.B. and P.H. conceived the study. C.P., P.H., A.B. designed and conducted the numerical experiments andanalysed the results. All authors reviewed the manuscript.

Competing interests

The authors declare no competing interests. 19

Supplementary MaterialSupplementary Note 1. Statistics -4 -3 -2 -1 P(w) -4 -3 -2 -1 P( t ) -4 -3 -2 -1 P( D t) -4 -3 -2 -1 -4 -3 -2 -1 -4 -3 -2 -1 -4 -3 -2 -1 -4 -3 -2 -1 -4 -3 -2 -1 w10 -4 -3 -2 -1 t -4 -3 -2 -1 D t10 -4 -3 -2 -1 Figure S1:

Backbone statistics for f = 10% . Left column: distributions of weights. Middle column: distribu-tion of contact durations. Right column: distribution of intercontact durations. The black circles (resp. lines forthe right column) correspond to the distributions for the original data sets. Red squares (resp. lines) are for thethreshold backbone, magenta crosses (resp. lines) for the ST ﬁlter and blue triangles (resp. lines) for the GSTﬁlter. From top to bottom: InVS15, LyonSchool, SFHH, Thiers13. 20ata set f ST GST TB ST-OA ST-RA ST-RT TB-RT GST-OA GST-RA GST-RT GTB-RTInVS15 0.4 0.84 0.48 0.98 0.82 0.82 0.55 0.58 0.99 0.99 0.66 0.650.1 0.72 0.31 0.64 1.37 1.47 0.48 0.60 1.15 1.22 0.48 0.600.05 0.36 0.16 0.39 1.64 1.83 0.48 0.49 1.18 1.40 0.60 0.59LyonSchool 0.4 1.23 0.50 1.22 0.82 0.82 0.62 0.62 0.77 0.77 0.69 0.650.1 0.71 0.54 0.79 1.32 1.34 0.55 0.54 0.87 0.92 0.63 0.620.05 0.37 0.35 0.41 1.48 1.51 0.54 0.55 0.96 1.05 0.62 0.62Thiers13 0.4 0.89 0.40 0.97 0.49 0.49 0.32 0.33 1.11 1.10 0.85 0.740.1 0.50 0.41 0.57 1.04 1.16 0.22 0.22 1.01 1.01 0.72 0.710.05 0.24 0.23 0.29 1.27 1.47 0.22 0.22 0.99 1.03 0.72 0.71SFHH 0.4 0.52 0.90 0.76 0.74 0.46 0.520.1 1.04 1.05 1.05 1.27 0.43 0.430.05 1.11 1.04 1.16 1.52 0.42 0.43Table S1:

Clustering coeﬃcients in backbones and surrogates, normalized by the value of theclustering coeﬃcient in the original data.

Data set f ST GST TB ST-OA ST-RA ST-RT TB-RT GST-OA GST-RA GST-RT GTB-RTInVS15 0.4 0.89 0.25 0.85 0.94 0.94 0.95 0.96 0.99 0.99 0.96 1.020.1 1.06 0.50 1.13 0.69 0.69 0.69 0.70 0.99 0.99 0.93 1.050.05 1.03 0.53 1.22 0.38 0.35 0.41 0.54 0.99 0.99 0.91 1.05LyonSchool 0.4 1.03 0.22 0.93 0.99 0.98 0.99 0.98 1.00 1.01 1.01 1.050.1 1.26 0.45 1.24 0.63 0.62 0.62 0.64 1.00 0.99 1.00 1.140.05 1.24 0.52 1.32 0.35 0.33 0.38 0.47 1.00 0.98 1.01 1.12Thiers13 0.4 0.98 0.32 0.97 0.92 0.92 0.92 0.94 1.00 1.00 1.00 1.020.1 1.05 0.77 1.06 0.55 0.54 0.57 0.69 1.00 0.99 1.01 1.030.05 1.03 0.53 1.22 0.38 0.35 0.41 0.54 0.99 0.99 0.91 1.05Table S2:

Modularity in backbones and surrogates, normalized by the value of the modularity inthe original data.

Here the modularity is computed imposing as partition the known group structure of thedata (as it represents the ground truth), rather than using community detection algorithm. 21 upplementary Note 2. Original vs. backbone

Figure S2:

Original vs. backbones. R values obtained from the simulations on the original data and on thebackbones, for the Thiers13 data set. Top: original data. Second row: ST backbone; third row: GST backbone;fourth row: TB. Left: f = 40% . Middle : f = 10% . right: f = 5% . 22igure S3: Original vs. backbone.

Relative diﬀerence in R values obtained from the simulations on theoriginal data and on the backbones, for the Thiers13 data set. First row: ST backbone; 2nd row: GST backbone;3rd row: TB. Left: f = 40% . Middle : f = 10% . right: f = 5% . 23igure S4: Original vs. backbone. Ω values obtained from the simulations on the original data and on thebackbones, for the Thiers13 data set. Top: original data. Second row: ST backbone; third row: GST backbone;fourth row: TB. Left: f = 40% . Middle : f = 10% . right: f = 5% . 24igure S5: Original vs. backbone.

Relative diﬀerence in Ω values obtained from the simulations on the originaldata and on the backbones, for the Thiers13 data set. First row: ST backbone; 2nd row: GST backbone; 3rdrow: TB. Left: f = 40% . Middle : f = 10% . right: f = 5% . 25 upplementary Note 3. Using vs. not using the group structure in thebackbones and surrogates Figure S6:

Surrogate, eﬀect of taking into account group structure for the Thiers13 data set.

Relativediﬀerence in Ω obtained from the simulations on surrogate data with respect to simulations on the original data.Top: ST-RA. Bottom: GST-RA. Left f = 40% , middle f = 10% ; right f = 5% . In each case the backbonetimelines are kept, and timelines respecting the statistics of contact and inter-contact durations are built for thesurrogate ties (BTL-Stats method). 26igure S7: Surrogate, eﬀect of taking into account group structure for the LyonSchool data set.

Relative diﬀerence in R obtained from the simulations on surrogate data with respect to simulations on theoriginal data. Top: ST-RA. Bottom: GST-RA. Left f = 40% , middle f = 10% ; right f = 5% .Figure S8: Surrogate, eﬀect of taking into account group structure for the LyonSchool data set.

Relative diﬀerence in Ω obtained from the simulations on surrogate data and on the original data. Top: ST-RA.Bottom: GST-RA. Left f = 40% , middle f = 10% ; right f = 5% . 27igure S9: Surrogate, eﬀect of taking into account group structure for the InVS15 data set . Relativediﬀerence in R obtained from the simulations on surrogate data and on the original data. Top: ST-RA. Bottom:GST-RA. Left f = 40% , middle f = 10% ; right f = 5% .Figure S10: Surrogate, eﬀect of taking into account group structure for the InVS15 data set.

Relativediﬀerence in Ω obtained from the simulations on surrogate data and on the original data. Top: ST-RA. Bottom:GST-RA. Left f = 40% , middle f = 10% ; right f = 5% . 28 upplementary Note 4. Eﬀect of the timeline reconstruction Figure S11:

Eﬀect of the surrogate timelines, for the Thiers13 data set.

Relative diﬀerence in R obtained from the simulations on surrogate data (GST-RA method) and on the original. First row: Poissontimelines. Second row: BTL-Poisson (backbone timelines kept, Poisson timelines for surrogate ties). Third row:Stats (synthetic timelines respecting the data’s statistics). Fourth row: BTL-Stats (same, with backbone timelineskept). Left column: f = 40% . Middle column: f = 10% . Right column: f = 5% . 29igure S12: Eﬀect of the surrogate timelines, for the Thiers13 data set.

Each panel shows the relativediﬀerence in Ω obtained from the simulations on surrogate data and on the original. Here we use the GST-RAmethod. First row: Poisson timelines for all links. Second row: BTL-Poisson method (backbone timelines kept,Poisson timelines for surrogate ties). Third row: Stats method (synthetic timelines respecting the data’s statisticsfor all ties). Fourth row: BTL-Stats method (backbone timelines kept, synthetic timelines respecting the data’sstatistics for surrogate ties). Left column: f = 40% . Middle column: f = 10% . Right column: f = 5% . 30 upplementary Note 5. Various surrogates, other data sets Figure S13:

Outcome of SIR processes on surrogate data obtained by various reconstruction meth-ods, for the Thiers13 data set.

Relative diﬀerence in Ω measured in simulations on the surrogate and originaldata. First row: GST-RA. Second row: GST-OA. Third row: GST-RT. Fourth row: GTB-RT. The backbonetimelines are kept, and timelines respecting the statistics of contact and inter-contact durations are built for thesurrogate ties (BTL-Stats method). Left column: f = 40% . Middle column: f = 10% . Right column: f = 5% .31igure S14: Outcome of SIR processes on surrogate data obtained by various reconstruction meth-ods, for the LyonSchool data set.

Relative diﬀerence in R measured in simulations on the surrogate andon the original data. First row: GST-RA. Second row: GST-OA. Third row: GST-RT. Fourth row: GTB-RT.In each case the backbone timelines are kept, and timelines respecting the statistics of contact and inter-contactdurations are built for the surrogate ties (BTL-Stats method). Left column: f = 40% . Middle column: f = 10% .Right column: f = 5%= 5%

Relative diﬀerence in the values of Ω measured in simulations on thesurrogate and on the original data. First row: GST-RA. Second row: GST-OA. Third row: GST-RT. Fourthrow: GTB-RT. In each case the backbone timelines are kept, and timelines respecting the statistics of contactand inter-contact durations are built for the surrogate ties (BTL-Stats method). Left column: f = 40% . Middlecolumn: f = 10% . Right column: f = 5% . 33igure S16: Outcome of SIR processes on surrogate data obtained by various reconstruction meth-ods, for the InVS15 data set.

Relative diﬀerence in the values of R measured in simulations on the surrogateand on the original data. First row: GST-RA. Second row: GST-OA. Third row: GST-RT. Fourth row: GTB-RT.In each case the backbone timelines are kept, and timelines respecting the statistics of contact and inter-contactdurations are built for the surrogate ties (BTL-Stats method). Left column: f = 40% . Middle column: f = 10% .Right column: f = 5%= 5%

Relative diﬀerence in the values of R measured in simulations on the surrogateand on the original data. First row: GST-RA. Second row: GST-OA. Third row: GST-RT. Fourth row: GTB-RT.In each case the backbone timelines are kept, and timelines respecting the statistics of contact and inter-contactdurations are built for the surrogate ties (BTL-Stats method). Left column: f = 40% . Middle column: f = 10% .Right column: f = 5%= 5% . 34igure S17: Outcome of SIR processes on surrogate data obtained by various reconstruction meth-ods, for the InVS15 data set.

Relative diﬀerence in the values of Ω measured in simulations on the surrogateand on the original data. First row: GST-RA. Second row: GST-OA. Third row: GST-RT. Fourth row: GTB-RT.In each case the backbone timelines are kept, and timelines respecting the statistics of contact and inter-contactdurations are built for the surrogate ties (BTL-Stats method). Left column: f = 40% . Middle column: f = 10% .Right column: f = 5%= 5%

Relative diﬀerence in the values of Ω measured in simulations on the surrogateand on the original data. First row: GST-RA. Second row: GST-OA. Third row: GST-RT. Fourth row: GTB-RT.In each case the backbone timelines are kept, and timelines respecting the statistics of contact and inter-contactdurations are built for the surrogate ties (BTL-Stats method). Left column: f = 40% . Middle column: f = 10% .Right column: f = 5%= 5% . 35igure S18: Outcome of SIR processes on surrogate data obtained by various reconstruction meth-ods, for the SFHH data set.

Relative diﬀerence in the values of R measured in simulations on the surrogateand on the original data. First row: ST-RA. Second row: ST-OA. Third row: ST-RT. Fourth row: TB-RT. Ineach case the backbone timelines are kept, and timelines respecting the statistics of contact and inter-contactdurations are built for the surrogate ties (BTL-Stats method). Left column: f = 40% . Middle column: f = 10% .Right column: f = 5%= 5%