[PDF] Deep learning of contagion dynamics on complex networks

Abstract

Forecasting the evolution of contagion dynamics is still an open problem to which mechanistic models only offer a partial answer. To remain mathematically or computationally tractable, these models must rely on simplifying assumptions, thereby limiting the quantitative accuracy of their predictions and the complexity of the dynamics they can model. Here, we propose a complementary approach based on deep learning where the effective local mechanisms governing a dynamic on a network are learned from time series data. Our graph neural network architecture makes very few assumptions about the dynamics, and we demonstrate its accuracy using different contagion dynamics of increasing complexity. By allowing simulations on arbitrary network structures, our approach makes it possible to explore the properties of the learned dynamics beyond the training data. Finally, we illustrate the applicability of our approach using real data of the COVID-19 outbreak in Spain. Our results demonstrate how deep learning offers a new and complementary perspective to build effective models of contagion dynamics on networks.

Full PDF

DDeep learning of stochastic contagion dynamics on complex networks

Charles Murphy, Edward Laurence, and Antoine Allard

D´epartement de Physique, de G´enie Physique, et d’Optique,Universit´e Laval, Qu´ebec (Qu´ebec), Canada G1V 0A6 andCentre interdisciplinaire en mod´elisation math´ematique,Universit´e Laval, Qu´ebec (Qu´ebec), Canada G1V 0A6 (Dated: June 16, 2020)Forecasting the evolution of contagion dynamics is still an open problem to which mechanistic models onlyoffer a partial answer. To remain mathematically and/or computationally tractable, these models must rely onsimplifying assumptions, thereby limiting the quantitative accuracy of their predictions and the complexity ofthe dynamics they can model. Here, we propose a complementary approach based on deep learning wherethe effective local mechanisms governing a dynamic are learned automatically from time series data. Ourgraph neural network architecture makes very few assumptions about the dynamics, and we demonstrate itsaccuracy using stochastic contagion dynamics of increasing complexity on static and temporal networks. Byallowing simulations on arbitrary network structures, our approach makes it possible to explore the propertiesof the learned dynamics beyond the training data. Our results demonstrate how deep learning offers a new andcomplementary perspective to build effective models of contagion dynamics on networks.

Our capacity to prevent or contain outbreaks of infectiousdiseases is directly linked to our ability to accurately modelcontagion dynamics. Since the seminal work of Kermack andMcKendrick almost a century ago [1], a variety of modelsincorporating ever more sophisticated contagion mechanismshave been proposed, studied and used [2–5]. These mecha-nistic models have provided invaluable insights about how in-fectious diseases spread, and have thereby contributed to thedesign of better public health policies. However, several chal-lenges remain unresolved, which call for contributions fromnew modeling approaches [6–8].For instance, many complex contagion processes involvethe nontrivial interaction of several pathogens [9–12], andsome social contagion phenomena, like the spread of mis-information, require to go beyond pairwise interactions be-tween individuals [13–15]. Also, while qualitatively informa-tive, the forecasts of most mechanistic models lack quantita-tive accuracy. Indeed, most models are constructed from ahandful of mechanisms which can hardly reproduce the intri-cacies of real complex contagion dynamics. One approach tothese challenges is to complexify the models by adding moredetailed and reﬁned mechanisms. However, mechanistic mod-els become rapidly intractable as new mechanisms are added.Moreover, more complex models require the speciﬁcation ofa larger number of parameters whose values can be difﬁcult toinfer from limited data.On another front, the range of applications of deep learninghas grown remarkably fast in recent years [16], from computervision [17] to natural language processing [18] and time se-ries forecasting [19]. More speciﬁcally, deep learning is nowbeing applied to problems related to Network Science usingtools like Graph Neural Networks (GNN) which have beendesigned to handle arbitrarily structured data [20–22]. Re-cent work showed great promise for applications in commu-nity detection and link prediction [23, 24], in the predictionof dynamical observables [25], in network reconstruction [26]and the detection of structural perturbations [27], as well as inthe context of discovering new materials and drugs [28, 29].In general, GNNs have been shown to adequately handle in- tricate tasks, making them prime candidates to tackle severalchallenges of contagion dynamics modeling.Here, we demonstrate how deep learning can be used tobuild effective models of stochastic contagion dynamics tak-ing place on complex networks. Instead of constructing amodel by specifying the mechanisms driving the dynamics,we consider learning these mechanisms directly from data.We start by posing the machine learning problem explicitlyand propose a deep GNN architecture with a reliable protocolto train it. We demonstrate the validity of our approach usingvarious dynamics on networks with increasing complexity onstatic and temporal networks. Finally, we show how our ap-proach can provide predictions for previously unseen networkstructures, therefore allowing the exploration of the propertiesof the learned dynamics beyond the training data.Our approach assumes that a hidden stochastic contagiondynamics takes place on a network G = ( V , E ) , where V is the set of nodes and E is the set of links. As this hiddendynamical process evolves over time, it generates a multi-variate time series X = (cid:0) X ( t ) : t ∈ [0 , T ] (cid:1) , where X ( t ) is the global state of the system at time t . This global state X ( t ) = (cid:0) x i ( t ) : i ∈ V (cid:1) consists of the state of every node i at this time, noted x i ( t ) . We also denote the ﬁnite set ofthe possible node states by Ω , such that x i ( t ) ∈ Ω . We as-sume that this dynamical process can be entirely deﬁned byits local transition probabilities (LTPs), denoted p µ → ν σ . Thesecorrespond to the probability that a node of degree k in state µ at time t transitions to state ν at time t + ∆ t given thestates of its ﬁrst neighbors σ , where µ, ν ∈ Ω and σ ∈ Ω k .By doing so, we assume that the hidden process is station-ary, Markovian and discrete both in time and in terms of itsavailable global states. Note that generalizing our approach tonon-Markovian and/or continuous-state dynamics is straight-forward; we choose to limit the presentation of our approachto these more restricting assumptions for the sake of clarityand conciseness.We consider a trainable GNN model composed of a set offree and tunable parameters Θ that takes as inputs the cur-rent global state of the system, X ( t ) , and the structure of the a r X i v : . [ phy s i c s . s o c - ph ] J un network at time t (e.g. its adjacency list), and outputs the pre-dicted LTPs, denoted ˆ p µ → ν σ ( Θ ) . Our objective is then to learna set of parameters Θ ∗ such that ˆ p µ → ν σ ( Θ ∗ ) ≈ p µ → ν σ , (1)for all transitions µ, ν ∈ Ω and all possible neighborhoods σ . Note that there exist other techniques that can also per-form this task. For instance, Bayesian approaches have beenused for the parameters inference of simple and complex con-tagion under the assumption of a predeﬁned model [30, 31]. Ingeneral, these approaches work well on synthetic data gener-ated by the predeﬁned model, in some cases even without theneed of likelihood computations [31]. However, these sim-ple models are only coarse representations of real systems. Amore general approach is then to infer the complete Markovchains directly [32], where each LTP is considered as a pa-rameter of the model. We adopt a similar strategy in that weuse GNNs to parametrize these Markov chains with an under-lying network structure. With their easily adaptable architec-ture, GNNs have the advantage of combining their parame-ters hierarchically to compute the LTPs as well as being uni-versal estimators [33, 34], meaning that they can in principleestimate any set of LTPs. Furthermore, their parameters es-timation is scalable and efﬁciently performed with stochasticgradient descent [35, 36].In practice, the objective described by Eq. (1) must be en-coded into a loss function, denoted L ( Θ ) . An appropriatechoice is L ( Θ ) = (cid:88) k ∈K (cid:88) µ ∈ Ω (cid:88) σ ∈ Ω k Z (cid:34) − (cid:88) ν ∈ Ω p µ → ν σ log ˆ p µ → ν σ (cid:35) (2)where K is the set of degree classes, Z is the total numberof possible input conﬁgurations ( µ, σ ) , and the term betweenbrackets corresponds to the cross entropy between the LTPsof the underlying process (the “ground truth”) and the LTPspredicted by the GNN, ˆ p µ → ν σ . Equation (2) quantiﬁes the in-formation difference [37] between the original LTPs and theones predicted by the GNN, averaged over a uniform distribu-tion of all possible inputs ( µ, σ ) .Two obstacles prevent the exact evaluation of Eq. (2) in re-alistic settings. First, the LTPs p µ → ν σ that we want the modelto predict —the “ground truth”— are a priori unknown. Sec-ond, ﬁnite datasets only explore a ﬁnite fraction of the conﬁg-uration space. We propose to resolve both of these issues byapproximating Eq. (2) from the dataset X directly. To do so,we consider the following approximate loss, L ( Θ ) (cid:39) (cid:88) t ∈T (cid:48) (cid:88) i ∈V (cid:48) ( t ) ω i ( t ) |T (cid:48) | |V (cid:48) ( t ) | (cid:104) − log ˆ p x i ( t ) → x i ( t +∆ t ) x N i ( t ) (cid:105) , (3)where ω i ( t ) is a weight assigned to node i at time t , x N i ( t ) isthe state of the neighbors of node i at time t , and where V (cid:48) ⊆V and T (cid:48) ⊆ [0 , T ] are subsets of the training dataset. Note thatthe complements of these subsets can also be used to validatethe model during training (see Supplementary Material).Two approximations are at play in Eq. (3). First, the ﬁrstthree sums of Eq. (2) are approximated using a sampling ‘ ]0 . . . T r a n s i t i o np r o b a b ili t y [ p µ → ν ‘ ] AA Simple

GTGNNMLE ‘ ]0 . . . . BB Complex

Infection ( S → I )Recovery ( I → S ) FIG. 1. (color online)

Validation of the predictions of GNNstrained on a Barab´asi-Albert random network (BA) [38] for the(A) simple and (B) complex contagion dynamics.

The solid linescorrespond to the LTPs of the dynamics used to generate the train-ing data (labeled GT for “ground truth”), and the dashed lines showthe predictions of the GNNs. Markers correspond to the maximumlikelihood estimation (MLE) of the LTPs computed from the train-ing dataset. LTPs corresponding to transmission events are shownin blue ( S → I ), and the ones related to recovery events are shownin red ( I → S ). Lines and markers were obtained by averagingthe LTPs over every σ corresponding to a same value of (cid:96) . Thestandard deviation around these average values is shown using a col-ored area around the lines (typically narrower than the width of thelines) and using vertical bars for the markers. The training datasetswere generated using a BA network composed of |V| = 1000 nodesand with average degree (cid:104) k (cid:105) = 4 , and we used the parameters ( τ, γ ) = (0 . , . and ( η, γ ) = (8 , . for the simple andcomplex contagion dynamics, respectively. All technical details re-lated to training are provided in Supplementary Material. scheme where the weights ω i ( t ) rebalance the importanceto “overrepresented” inputs, i.e. inputs ( µ, σ ) whose fre-quency of occurrence in X , noted ρ ( µ, σ ) , is high. Choosing ω i ( t ) = ρ ( x i ( t ) , x N i ( t ) ) − λ effectively evaluates Eq. (3) witha relaxed version of importance sampling [39], which reducesthe importance of frequent inputs and increases that of the rareones. This procedure is identical to standard importance sam-pling when λ = 1 . Second, the average over ν is replaced bya Monte Carlo approximation since the “ground truth” LTPs, p µ → ν σ , are a priori unknown. Note that, for any speciﬁc in-put ( µ, σ ) , this approximation converges to its expected valueonly if the corresponding number of transitions found in thesubsets of the training dataset is sufﬁciently large. This sec-ond approximation will therefore be necessarily poor for rareinputs, and we set λ < to limit their inﬂuence on the qualityof the training. To isolate the effect of this second approxima-tion, we consider a semi-exact training scheme where only theﬁrst three sums of Eq. (2) are approximated using the impor-tance sampling scheme, and the sum on ν is computed usingthe “ground truth” LTPs—identically to Eq. (2). The modelstrained under the semi-exact scheme, denoted GNN * , repre-sent a best case scenario of the models trained using Eq. (3),and will therefore allow us to assess the impact of the datasetquality on the accuracy of the predictions.We propose the GNN architecture detailed in Supplemen-tary Material. In a nutshell, the GNN is a nonlinear function f (cid:0) x i ( t ) , x N i ( t ) (cid:1) that receives the state of a node i and the − − P r e d i c t i o n E rr o r( J S D ) A Simple B Complex C Interacting

GNNGNN*UninformedMLE k ]10 − − P r e d i c t i o n E rr o r( J S D ) D Simple k ] E Complex k ] F Interacting

FIG. 2. (color online)

Evaluation of the extrapolation capabilityof GNNs trained on Erd˝os-R´enyi (ER, top row) and Barab´asi-Albert (BA, bottom row) random networks [38] for the (A, D)simple, (B, E) complex and (C, F) interacting contagion dynam-ics.

The dashed lines correspond to the GNN prediction error, i.e. theaverage Jensen-Shannon distance (JSD) between the LTPs of the dy-namics used to the generate the training datasets (labeled as GT for“ground truth”) and the predictions of the GNNs trained with thesedatasets. Similarly, the dotted lines correspond to the JSD error ofGNN trained on the same datasets but using a semi-exact trainingscheme (denoted by GNN * ). For comparison, the solid lines corre-spond to the JSD error of an uninformed baseline in which all localtransitions are equiprobable, i.e. p µ → ν σ = | Ω | − where | Ω | is thenumber of node states. The prediction error of the MLEs is shownwith markers to clearly illustrate where data is lacking in the trainingdatasets. Lines and markers were obtained by averaging the JSD er-rors over every σ corresponding to a same value of k . The standarddeviation around these average values is shown with a colored areaaround the lines and with vertical bars for the markers. The train-ing datasets are generated using a ER or BA network composed of |V| = 1000 nodes and with average degree (cid:104) k (cid:105) = 4 . We used thesame parameters as on Fig. 1 for the simple and complex contagiondynamics, and used ( τ , τ , γ , γ , ζ ) = (0 . , . , . , . , for the interacting contagion dynamics. All technical details relatedto training are provided in Supplementary Material. states of its neighbors N i , and then returns the LTPs provid-ing the probabilities for nodes i to be in each of the availablestates at t + ∆ t , f (cid:0) x i ( t ) , x N i ( t ) (cid:1) = (cid:16) ˆ p x i ( t ) → νx N i ( t ) (cid:17) ν ∈ Ω . (4)This function is identical for all nodes and, in practice, theGNN is applied in parallel to every node. To allow the GNNto be applicable to any neighborhood size, the states of theneighbors are aggregated with an attention mechanism in-spired by [40]. A notable advantage of this GNN architectureis its inductive nature : It learns how to combine the states ofthe neighbors locally, i.e. through the edges of the network.It is therefore independent from the global network structure,and can consequently be used on any network structure onceit has been trained.We illustrate our approach by applying it to three typesof stochastic contagion dynamics whose behaviors heavily depend on the structure of the underlying network [3, 12,41]. We ﬁrst consider a simple contagion dynamics: Thediscrete-time susceptible-infected-susceptible (SIS) dynamicsin which nodes are either susceptible ( S ) or infected ( I ). Ateach time step, infected nodes transmit the disease to eachof their susceptible neighbors with probability τ , and recoverfrom the disease with probability γ , thus becoming suscepti-ble again. The LTPs of this dynamics are p S → I(cid:96) = 1 − p S → S(cid:96) = 1 − (1 − τ ) (cid:96) (5a) p I → S(cid:96) = 1 − p I → I(cid:96) = γ . (5b)Note that the state of the neighbors of a node, σ , is fully spec-iﬁed by the number of infected neighbors, (cid:96) . We stress thatthe GNN is not a priori designed to compute the LTPs as afunction of (cid:96) , but rather computes them as a function of thecomplete state of the neighbors, σ . The GNN must there-fore learn that the number of infected neighbor is sufﬁcient tocompute the LTPs.Equation (5a) assumes that disease transmission events areindependent. In other words, the probability for a suscepti-ble node to be infected by any of its infectious neighbors doesnot depend on the state of its other neighbors. The second dy-namical process we consider lifts this assumption by replacingEq. (5a) with the “Planck-like” nonmonotonic infection prob-ability p S → I(cid:96) = 1 z ( η ) (cid:96) e (cid:96)/η − , (6)where z ( η ) is ﬁxed such that p S → I(cid:96) equals 1 at its maximum,the position of this maximum being controlled by the param-eter η > . We refer to this second dynamics as complex [43]since transmission is now a nonlinear process that depends onthe state of the other neighbors of a node. While unrealisticin the context of disease spreading, Eq. (6) has an interest-ing interpretation in the context of the propagation of a socialbehavior. The probability that an individual adopts a new be-havior increases monotonously with its number of friends thathave adopted the behavior if it appears new or scarce (low (cid:96) ).However, the same individual will become reluctant to adoptthis behavior if it has become mainstream and popular (high (cid:96) ), resulting in monotonously decreasing probability of adop-tion.The third contagion dynamics we consider is an extensionof the SIS dynamics to two interacting infectious diseases inwhich nodes can be in four different states: Susceptible toboth diseases ( S S ), infected by one disease and suscepti-ble to the other ( I S or S I ), or infected by both diseases( I I ). Like the SIS dynamics, nodes infected by disease g will transmit it to a susceptible neighbor with probability τ g and will recover from g with probability γ g . The interactionbetween the two diseases affects the probability of transmis-sion whenever either node, the infected or the susceptible one,is already infected by the other disease. A node infected bydisease g will then infect its neighbor susceptible to g withprobability ζτ g , where ζ controls the strength of the interac-tion between the two diseases. This interacting contagion dy-namics is encoded in 12 LTPs similar to Eqs. (5) whose ex-pressions are given in Supplementary Material. . . . . . P r e v a l e n ce AA Simple BB Complex . . CC Interacting . . DD Simple EE Complex . . FF Interacting

GTGNNGNN* h k i = 4 . . h k i ]0 . . . P r e v a l e n ce GG Simple h k i ] HH Complex . . h k i ] II Interacting . . h k i ] JJ Simple h k i ] KK Complex . . h k i ] LL Interacting

FIG. 3.

Bifurcation diagrams of the simple, complex and interacting contagion dynamics on Poisson networks [38] composed of |V| = 2000 nodes with different average degrees (cid:104) k (cid:105) . Prevalence is deﬁned as the fraction of nodes infected by at least one disease. Panels(A–C) and (G–I) show the steady-state prevalence obtained by solving a set of mean-ﬁeld equations adapted from Ref. [42] (see SupplementaryMaterial). The solid vertical lines indicate the critical points where a steady state changes its stability; a colored area is used to indicate thebistable regime that exists when the two steady states do not change their stability simultaneously. Panels (D–F) and (J–L) show the steady-state prevalence obtained by standard stochastic numerical simulations; each marker is averaged over 100 simulations and the vertical barsshow the standard deviation. Red lines, markers and areas are obtained by using the LTPs of the original dynamics (labeled as GT for “groundtruth”) in the mean-ﬁeld equations as well as in the numerical simulations. Blue lines, markers and areas (ﬁrst row) are obtained by using theLTPs inferred with the GNN trained on a dataset generated with a Barab´asi-Albert network composed of |V| = 1000 nodes and with averagedegree (cid:104) k (cid:105) = 4 (value indicated by a vertical dotted black line for reference). Green lines, markers and areas (second row) are obtained byusing the LTPs inferred with the GNN trained on the same dataset but using a semi-exact training scheme (denoted by GNN * ). The parametersof the contagion dynamics are the same as for Figs. 1 and 2. In Fig. 1, we show the LTPs predicted by the GNN andcompare them with Eqs. (5)–(6) as well as with the maximumlikelihood estimators (MLEs) of p µ → ν σ . The MLEs are com-puted from the time series and correspond to the frequency atwhich nodes in state µ with neighborhood σ transition to ν at the next time step. We ﬁnd that the GNN ﬁts remarkablywell the LTPs of the simple and complex contagion dynamics.In fact, we found the predictions of the GNN to be systemat-ically smoother than the ones provided by the MLEs. Thisis because the MLEs estimate the LTPs for each individualpair ( µ, σ ) from disjoint subsets of the training dataset. Thisimplies that a large number of samples of each pair ( µ, σ ) isneeded for the MLEs to provide accurate estimates of everyLTPs; a condition rarely met in realistic settings, especiallyfor high degree nodes. This also means that the MLE cannotinterpolate in between nor can it extrapolate beyond the pairs ( µ, σ ) present in the training dataset. In contrast, the GNNworks in a conceptually different manner: It is differentiableby construction and all parameters are hierarchically involvedin the estimation of the LTPs. Its predictions are thereforesmoother, more consistent and able to inter- or extrapolate be-yond the training dataset. It also means that the GNN beneﬁtsfrom any sample to improve its predictions for all LTPs.Figure 2 addresses the accuracy of the predictions of theGNN for unseen local network structures, i.e. predictions ofLTPs corresponding to pairs ( µ, σ ) that are not necessarilypresent from the training dataset. These predictions were ex-tracted from the GNN by applying it to a star graph of k + 1 nodes— k nodes of degree 1 connected to a central node of degree k . Because σ is typically multivariate, and thus can-not be summarized by a single scalar (cid:96) as for the simpleand complex contagion dynamics presented in Fig. 1—it isthree-dimensional for the interacting contagion dynamics—,we use the Jensen-Shannon distance (JSD) [37] to quantifythe similarity between the LTPs predicted by the GNN andthe “ground truth”. Figure 2 shows averages of the JSD, withstandard deviations, over all possible neighborhoods σ andstates µ for various values of k . Note that the JSD is simi-lar to the loss function of Eq. (2), but it has the notable ad-vantages of being symmetrical, of being more forgiving if aLTP is wrongly predicted to be equal to zero (something thathappens with MLEs), and of allowing meaningful compar-isons between more than two distributions at the same time(by virtue of being a metric).Figure 2 conﬁrms that the GNN provides more accuratepredictions than MLEs in general. This is especially truein the case of the interacting contagion on BA networks,where MLEs are only marginally different from the unin-formed baseline for high degree nodes. This is a consequenceof how scarce the inputs are for this dynamics compared toboth the simple and complex contagion dynamics for train-ing datasets of the same size, and of how fast the size of theset of possible inputs scales, thereby quickly rendering MLEscompletely ineffective for small training datasets. Figure 2(F)therefore provides a telling illustration of the challenge of in-ferring the parameters of slightly complex contagion modelsfrom limited data. The GNN, on the other hand, is less af-fected by the scarcity of the data, since any sample improves .

00 0 .

25 0 .

50 0 .

75 1 . A GTGNNSusceptible (S)Infected (I) . . . B Simple - InVS150 . . . F r a c t i o n o f n o d e s C Complex - InVS15

GTGNN S or S S I or I S S I I I . . . D Interacting - InVS15

FIG. 4. (color online)

Projection on real-world temporal networks of proximity contacts between individuals in conferences (Hypertext2009, SFHH), workplaces (InVS13, InVS15), schools (Thiers13, HS13) and hospitals (LH10) [44–46]. (A) Steady-state fraction ofsusceptible nodes (in blue) and of infected nodes (in red) for the simple contagion dynamics on the aggregated version of these temporalnetworks. Triangles correspond numerical simulations averaged over 100 realizations obtained using the same LTPs as in Fig. 3(A, D). Circlesshow the results of numerical simulations performed with the “ground truth” LTPs of the original dynamics. (B–D) Numerical simulations ofthe (B) simple, (C) complex and (D) interacting contagion dynamics over time on the workplace InVS15 temporal network, averaged over 100realizations. Solid lines used the “ground truth” LTPs, and dashed lines used the LTPs extracted from the GNN (the same LTPs as in Fig. 3).Colored areas around the lines indicate the standard deviation. Alternating white and blue zones indicates the time windows, each of whichconsisting of 10 time steps of the original dataset. Results for the other temporal networks are available in Supplementary Material. the predictions for all LTPs, as discussed above. In fact, theGNN accuracy is often close to the best case scenario, illus-trated by the GNN * , trained under similar conditions but witha semi-exact training scheme.Figure 2 also suggests that the underlying network structureof the training dataset (mainly its degree distribution) plays acrucial role in the accuracy of the GNN’s predictions for un-seen local structures. Indeed, our results show that the GNNprovides more accurate predictions when interpolating in be-tween observed values of k than when extrapolating for val-ues of k higher than those present in the training dataset. Inother words, it is easier for the GNN to interpolate in-between“landmarks” provided by the dataset than to extrapolate be-yond the training dataset without any bearings whatsoever.Heterogeneous networks, like Barab´asi-Albert networks, aretherefore expected to yield accurate predictions over a widerrange of local structures ( µ , σ ) than homogeneous networks,even if the training datasets are similar in size. This is alsosupported by the fact that the best case scenario in terms ofGNN accuracy (GNN * ) is signiﬁcantly poorer for homoge-neous networks [Figs. 2(A–C)] than for heterogeneous net-works [Figs. 2(D–F)]. The amount of valuable information contained in a given dataset could therefore be interpretedthrough the lens of the properties of the training dataset’s un-derlying network structure. Altogether, these ﬁndings suggesta wide applicability of our approach for real complex systems,whose underlying network structures recurrently exhibit a het-erogeneous degree distribution [47].To further illustrate the inductive nature of our GNN ar-chitecture, we use it to recover the bifurcation diagram ofthe three contagion dynamics and to project their behaviorson real-world networks. In the inﬁnite-size limit |V| → ∞ ,these dynamics have two possible steady states: the absorb-ing state where all nodes are susceptible, and the endemicstate in which a fraction of nodes remains infected over time [3, 12, 41]. These steady states exchange stability duringa phase transition which is continuous for the simple conta-gion and discontinuous for the complex and interacting conta-gion dynamics. The position of the phase transition dependsboth on the parameters of the dynamics and on the topologyof the network on which they take place. Note that for thecomplex and interacting contagion dynamics, the stability ofabsorbing and endemic states do not change at the same point,giving rise to a bistable regime where both states are stable.Figure 3 shows the bifurcation diagrams obtained by ana-lyzing a mean ﬁeld framework which uses the LTPs extractedfrom the GNN, as well as those obtained by performing nu-merical simulations with these same LTPs. In particular, themean ﬁeld perspective demonstrates the possibility to developan approximate representation of the GNN predictions that,although not as accurate as the predictions obtained by numer-ical simulations, is amenable to mathematical analysis (i.e.position of ﬁxed points, their stability and the critical thresh-olds of the phase transitions). The ﬁrst row of Fig. 3 showsthat the GNN correctly predicts whether the phase transitionis continuous or discontinuous, as well as the existence of abistable regime. Quantitatively, the predictions are also strik-ingly accurate—essentially perfectly accurate for the simplecontagion dynamics—which is remarkable given that the bi-furcation diagrams were obtained on networks the GNN hadnever seen before. The second row of Fig. 3 shows the samebifurcation diagrams but using GNN * models trained underthe semi-exact training scheme. The almost perfect overlapand thresholds of the curves indicates that there is virtuallyno intrinsic limit to the accuracy the GNN’s predictions, andhighlights once more the critical inﬂuence of the informationcontained in the training dataset. In other words, our GNN’sarchitecture is as accurate as its training data allows it to be.Finally, similar levels of accuracy are obtained when simu-lating the three contagion dynamics on static [Fig. 4(A)] andtemporal [Figs. 4(B–D)] real-world complex networks. Re-call that the GNN used to produce these results is the same asthe one used in Fig. 3, and was trained on a synthetic staticnetwork of a different size. The projections obtained from theGNN remain in the same vicinity as those generated by the“ground truth” LTPs throughout the timespan of the dataset,and show similar patterns of population infections/recoverieswhen the conﬁguration of the network changes.In summary, we introduced a data-driven approach thatlearns the effective mechanisms governing the propagation ofcontagion dynamics on complex networks. We proposed areliable training protocol, and we validated the projections ofour GNN architecture on simple, complex and interacting con-tagion dynamics using synthetic as well as real-world staticand temporal complex networks. Interestingly, we found thatour approach performs better when trained on data whoseunderlying network structure is heterogeneous, which couldprove useful in real-world applications of our method giventhe ubiquitousness of scale-free networks [47].By recovering the bifurcation diagram of various dynam-ics, we illustrated how our approach can leverage time seriesfrom an unknown dynamical process to gain insights about its properties—the existence of a phase transition and its or-der. Most importantly, we showed how to explicitly extract theLTPs inferred by the GNN architecture, which in turn can helpuncover the underlying mechanisms governing the dynamicsand help build effective mechanistic models. In a way, wesee this approach as the equivalent of a numerical Petri dish—offering a new way to experiment and gain insights about anunknown contagion dynamics—that is complementary to tra-ditional mechanistic modeling.Although we focused the presentation of our method oncontagion dynamics, its potential applicability reaches manyother realms of complex systems modeling where intricatemechanisms are at play. We believe this work establishes solidfoundations for the use of deep learning in the design of moreeffective realistic models of complex systems.We are grateful to Emily Cyr and Guillaume St-Ongefor helpful comments and to Franc¸ois Thibault, PatrickDesrosiers, Louis J. Dub´e, Simon Hardy and Laurent H´ebert-Dufresne for fruitful discussions. This work was supported bythe Sentinelle Nord initiative from the Canada First ResearchExcellence Fund (CM, EL, AA), the Conseil de recherches ensciences naturelles et en g´enie du Canada (CM, AA) and theFonds de recherche du Qu´ebec-Nature et technologie (EL). [1] W. O. Kermack and A. G. McKendrick, “A Contribution to theMathematical Theory of Epidemics,” Proc. R. Soc. A , 700–721 (1927).[2] H. W. Hethcote, “The Mathematics of Infectious Diseases,”SIAM Rev. , 599–653 (2000).[3] I. Z. Kiss, J. C. Miller, and P. L. Simon, Mathematics of Epi-demics on Networks (Springer, 2017) p. 598.[4] F. Brauer, C. Castillo-Chavez, and Z. Feng,

MathematicalModels in Epidemiology (Springer, 2019).[5] C. I. Siettos and L. Russo, “Mathematical modeling of infec-tious disease dynamics,” Virulence , 295–306 (2013).[6] N. C. Grassly and C. Fraser, “Mathematical models of infec-tious disease transmission,” Nat. Rev. Microbiol. , 477–487(2008).[7] A. Pastore y Piontti, N. Perra, L. Rossi, N. Samay, andA. Vespignani, Charting the Next Pandemic: Modeling Infec-tious Disease Spreading in the Data Science Age (Springer,2019).[8] C. Viboud and A. Vespignani, “The future of inﬂuenza fore-casts,” Proc. Natl. Acad. Sci. U.S.A. , 2802–2804 (2019).[9] L. H´ebert-Dufresne, S. V. Scarpino, and J.-G. Young, “Macro-scopic patterns of interacting contagions are indistinguishablefrom social reinforcement,” Nat. Phys. , 426–431 (2020).[10] S. Nickbakhsh, C. Mair, L. Matthews, R. Reeve, P. C. D. John-son, F. Thorburn, B. von Wissmann, A. Reynolds, J. McMe-namin, R. N. Gunson, and P. R. Murcia, “Virus–virus interac-tions impact the population dynamics of inﬂuenza and the com-mon cold,” Proc. Natl. Acad. Sci. U.S.A. , 27142–27150(2019).[11] D. M. Morens, J. K. Taubenberger, and A. S. Fauci, “Pre-dominant Role of Bacterial Pneumonia as a Cause of Death inPandemic Inﬂuenza: Implications for Pandemic Inﬂuenza Pre-paredness,” J. Infect. Dis. , 962–970 (2008).[12] J. Sanz, C.-Y Xia, S. Meloni, and Y. Moreno, “Dynamics of Interacting Diseases,” Phys. Rev. X , 041005 (2014).[13] D. Centola, “The Spread of Behavior in an Online Social Net-work Experiment,” Science , 1194–1197 (2010).[14] S. Lehmann and Y.-Y. Ahn, eds., Complex Spreading Phe-nomena in Social Systems , Computational Social Sciences(Springer, 2018).[15] I. Iacopini, G. Petri, A. Barrat, and V. Latora, “Simplicial mod-els of social contagion,” Nat. Commun. , 2485 (2019).[16] I. Goodfellow, Y. Bengio, and A. Courtville, Deep Learning (MIT Press, 2016).[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Clas-siﬁcation with Deep Convolutional Neural Networks,” in

NIPSProc. (2012) pp. 1097–1105.[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All YouNeed,” in

Adv. Neural Inf. Process. Syst. 30 (NIPS 2017) (2017)pp. 5998–6008.[19] M. L¨angkvist, L. Karlsson, and A. Loutﬁ, “A review of un-supervised feature learning and deep learning for time-seriesmodeling,” Pattern Recognit. Lett. , 11–24 (2014).[20] Z. Zhang, P. Cui, and W. Zhu, “Deep Learning on Graphs: ASurvey,” (2018), arXiv:1812.04202.[21] J. Zhou, G´e Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li,and M. Sun, “Graph Neural Networks: A Review of Methodsand Applications,” (2018), arXiv:1812.08434.[22] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How Powerful areGraph Neural Networks?” (2019), arXiv:1810.00826.[23] W. L. Hamilton, R. Ying, and J. Leskovec, “RepresentationLearning on Graphs: Methods and Applications,” (2017),arXiv:1709.05584.[24] M. Balakrishnan and G. T. V, “A neural network frameworkfor predicting dynamic variations in heterogeneous social net-works,” PLOS ONE , e0231842 (2020).[25] F. A. Rodrigues, T. Peron, C. Connaughton, J. Kurths, and Y. Moreno, “A machine learning approach to predict-ing dynamical observables from network structure,” (2019),arXiv:1910.00544.[26] Z. Zhang, Y. Zhao, J. Liu, S. Wang, R. Tao, R. Xin, andJ. Zhang, “A general deep learning framework for network re-construction and dynamics learning,” Appl. Netw. Sci. , 110(2019).[27] E. Laurence, C. Murphy, G. St-Onge, X. Roy-Pomerleau, andV. Thibeault, “Detecting structural perturbations from time se-ries with deep learning,” (2020), arXiv:2006.05232.[28] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur, “Protein Inter-face Prediction using Graph Convolutional Networks,” in Adv.Neural Inf. Process. Syst. 30 (2017) pp. 6530–6539.[29] M. Zitnik, M. Agrawal, and J. Leskovec, “Modeling polyphar-macy side effects with graph convolutional networks,” in

Bioin-formatics , Vol. 34 (2018) pp. i457–i466.[30] F. Altarelli, A. Braunstein, L. Dall’Asta, A. Lage-Castellanos,R. Zecchina, L. Dall’Asta, A. Lage-Castellanos, andR. Zecchina, “Bayesian Inference of Epidemics on Networksvia Belief Propagation,” Phys. Rev. Lett. , 118701 (2014).[31] R. Dutta, A. Mira, and J.-P. Onnela, “Bayesian inferenceof spreading processes on networks,” Proc. R. Soc. A ,20180129 (2018).[32] P. Eichelsbacher and A. Ganesh, “Bayesian Inference forMarkov Chains,” J. Appl. Probab. , 91–99 (2002).[33] G. Cybenko, “Approximation by superpositions of a sigmoidalfunction,” Math. Control. Signals, Syst. , 303–314 (1989).[34] K. Hornik, “Approximation capabilities of multilayer feedfor-ward networks,” Neural Networks , 251–257 (1991).[35] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learningrepresentations by back-propagating errors,” Nature , 533–536 (1986).[36] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Op-timization,” (2014), arXiv:1412.6980.[37] T. M. Cover and J. A. Thomas, Elements of Information Theory ,2nd ed. (Wiley-Interscience, 2005) p. 776.[38] A.-L. Barab´asi,

Network Science (Cambridge University Press,2016) p. 474.[39] R. Y. Rubinstein and D. P. Kroese,

Simul. Monte Carlo Method ,3rd ed. (Wiley, 2016) pp. 1–414.[40] P. Velikovi, G. Cucurull, A. Casanova, A. Romero, P. Li, andY. Bengio, “Graph attention networks,” in

International Con-ference on Learning Representations (2018).[41] R. Pastor-Satorras, C. Castellano, P. Van Mieghem, andA. Vespignani, “Epidemic processes in complex networks,”Rev. Mod. Phys. , 925–979 (2015).[42] P. G. Fennell and J. P. Gleeson, “Multistate Dynamical Pro-cesses on Networks: Analysis through Degree-Based Approxi-mation Frameworks,” SIAM Rev. , 92–118 (2019).[43] S. Lehmann and Y.-Y. Ahn, eds., Complex Spreading Phenom-ena in Social Systems (Springer, 2018) p. 368.[44] L. Isella, J. Stehl´e, A. Barrat, C. Cattuto, J.-F. Pinton, andW. Van den Broeck, “What’s in a crowd? Analysis of face-to-face behavioral networks,” J. Theor. Biol. , 166–180 (2011).[45] R. Mastrandrea, J. Fournet, and A. Barrat, “Contact Patterns ina High School: A Comparison between Data Collected UsingWearable Sensors, Contact Diaries and Friendship Surveys,”PLOS ONE , e0136497 (2015).[46] M. G´enois and A. Barrat, “Can co-location be used as a proxyfor face-to-face contacts?” EPJ Data Sci. , 11 (2018).[47] I. Voitalov, P. van der Hoorn, R. van der Hofstad, and D. Kri-oukov, “Scale-free networks well done,” Phys. Rev. Research ,033034 (2019). eep learning of stochastic contagion dynamics on complex networks— Supplementary Material — Charles Murphy, Edward Laurence, and Antoine Allard

CONTENTS

I. Model 2A. Details of the Architecture 2B. Attention Mechanism 2II. Projections on Real-World Temporal Networks 5III. Local Transition Probabilities of the Interacting Contagion Dynamics 6IV. Loss Optimization Patterns 8V. Dataset Generation 9A. Algorithm 9B. Impact of the Data Generation Hyperparameters 11VI. Training Settings 14A. Optimization 14B. Validation Dataset Selection 14C. Hyperparameters 15VII. Mean Field framework 15A. Approximate master equations 15B. Numerical Evaluation of the Thresholds 16References 17

I. MODELA. Details of the Architecture

We propose the graph neural network (GNN) architecture shown in Fig. 1. First, we apply a sequence ofinput layers to the state of each node, denoted x i ( t ) , individually to transform them into features. A singlelayer l with f l features is composed of a linear transformation and a rectiﬁed linear unit activation function(ReLU, expressed as ReLU( x ) = max { x, } ) and acts on a feature vector u l − i such that u li = ReLU (cid:16) W l u l − i + b l (cid:17) (1)where W l ∈ R f l × f l − and b l ∈ R f l are trainable parameters. After these ﬁrst layers, the resulting featurevectors are aggregated using a graph attention module inspired by Ref. [1] (Description in Sec. I B). Inshort, the graph attention module, illustrated in blue in Fig. 1, is a trainable nonlinear function, denoted A ,which combines the node feature vectors using the adjacency matrix A . This layer returns a set of newfeature vectors describing both the state of nodes and the states of their ﬁrst neighbors. The aggregatedfeatures are then processed into a set of ﬁnal feature vectors, namely v i ∈ R | Ω | , with another sequence ofoutput layers composed of linear transformations and a ReLU activation functions [similar to Eq. (1)]. Fromthe aggregated and transformed feature vectors v i , we ﬁnally compute a discrete probability distribution,using a Softmax function, corresponding to the local transition probabilities (LTPs) of the GNN, where eachprobability is expressed as following: ˆ p i ; ν = [Softmax( v i )] ν = exp([ v i ] ν ) P | Ω | ξ =1 exp([ v i ] ξ ) . (2)Recall that each entry ˆ p i ; ν of this probability distribution corresponds to the probability that node i transitionto state ν , which is conditioned on its state x i ( t ) and the state of its ﬁrst neighbors x N i ( t ) thanks to the GNNarchitecture. Using the notation that we used in the main paper, we get the following equivalence: ˆ p i ; ν = ˆ p i ; ν ( t ) ≡ ˆ p x i ( t ) → νx N i ( t ) . (3)We summarize the architecture of the GNN models for each contagion dynamics we used in the main paperin Tab. I. B. Attention Mechanism

Our attention mechanism is a variant of the recently introduced graph attention networks (GAT) [1] andcorresponds to a nonlinear trainable function A that combines the feature vectors of the nodes with respect ... ... FIG. 1. (color online)

Illustration of the GNN architecture . The blocks of different colors represent a GNN oper-ations. The red blocks correspond to trainable linear transformation parametrized by weights and biases (see text).The purple blocks represent non-trainable activation function between each layer. The core of the model is the atten-tion module [1], which is represented in blue. The orange block at the end is an exponential Softmax activation thattransforms the output into properly normalized transition probabilities.Dynamics Simple contagion Complex contagion Interacting contagionInput layers

Linear(1 , , , , , , , Number of attention layers 1 1 2Output layers

Linear(32 , , , , , , , Number of parameters 2307 2307 6630TABLE I.

Layer by layer description of the GNN models for each contagion dynamics.

For each sequence,the operations are applied from top to bottom. The operations represented by

Linear( m, n ) correspond to lineartransformations of the form f ( x ) = W x + b , where x ∈ R m is the input, W ∈ R n × m and b ∈ R n are trainableparameters. The operations ReLU and

Softmax are activation functions given by

ReLU( x ) = max { x, } and Softmax( x ) = exp( x ) P i exp( x i ) . to the adjacency matrix A . It works by assigning an attention coefﬁcient α ij that modulates the effectof the node j features over the node i resulting features. Considering the pair of connected nodes ( i, j ) whose input feature vectors are respectively u li and u lj , we compute the attention coefﬁcient α ij using thefollowing expression: α ij = σ h W A (cid:16) u li (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) u lj (cid:17) + b A i (4)where σ ( x ) = 11 + e − x (5)is the logistic function, W A ∈ R × f l and b A ∈ R are additional trainable parameters corresponding toa linear transformation and ·||· denotes a concatenation. The logistic function acts as an activation func-tion guaranteeing that the attention coefﬁcients range between 0 (no attention) and 1 (full attention). Theattention coefﬁcients are then used to compute the aggregated features v i as follows: v i = h A (cid:16) u , · · · , u N ; A (cid:17)i i = u i + X j ∈N i α ij u j . (6)Note that our attention mechanism is different from the original GAT from Ref. [1] in two ways. First,in the original paper, they used a Softmax function in Eq. (4) instead of a logistic function, which con-strains the attention coefﬁcients to be normalized, i.e. P j α ij = 1 . Under this constraint, Eq. (6) becomesan averaging operation rather than a general weighted sum. We argue that this normalization constraint isdetrimental in the case of learning contagion dynamics speciﬁcally, because it does not capture the individ-ual contribution of each neighbors in an absolute way. Rather, it captures the relative contributions of theneighbors with respect to each other—the features of an ”average” neighbor—which is more desirable innetwork embedding tasks [2]. This difference also allows us to enforce a complete self-attention ( α ii = 1 ),as one can deduce from Eq. (6). Second, contrary to the original GAT architecture, we do not apply anactivation function directly after the linear transformation in Eq. (4). The reason for this is that negative-valued features are, for us, of great importance, because they lead to small attention coefﬁcients. Applying anonlinear activation function, e.g. a ReLU function, inhibits the negative values and, consequently, preventsthe attention coefﬁcients to vanish when needed.In Ref. [1], they also showed how using multiple attention layers in parallel and in series can be beneﬁcialto network embedding tasks. We also consider using attention layers in parallel in our GNN architecture.To do so, we apply multiple attention layers, denoted A q with q = 1 ..Q , on each input feature vectors u i ,and concatenate the resulting outputs in a single feature vector v i as follows: v i = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Qq =1 h A q (cid:16) u , · · · , u N ; A (cid:17)i i (7) Name Number of nodes Number of edges Timespan [s] ReferenceHypertext 2009 113 2196 212341 [3]Thiers13 328 43496 380421 [4]SFHH 403 73557 106541 [4]LH10 72 1381 259181 [4]InVS13 95 3915 993561 [4]InVS15 219 16725 993561 [4]HS13 327 5818 363561 [5]TABLE II.

Properties of social contact networks.

We show the number of nodes, the total number of edges observedand the timespan (in seconds) of these time-varying networks. In Refs. [3–5], they collected these networks bymonitoring the proximity of pairs of individuals: If two individuals are in close proximity for at least 20 seconds, acontact (or an edge) is recorded at this time.

During experiments, we also considered applying attention layers in series, but it turns out that in our caseusing more than one attention layer leads to overﬁtting. Indeed, recall that the ﬁrst attention layer aggregatesthe node features with respect to their ﬁrst neighbors. Therefore, applying a second attention layer indirectlyaggregate the features of the second neighbors, which in turn assumes that their contribution matters in thecomputation of the LTPs. Because only the ﬁrst neighbors actually contribute to the true LTPs, it forcesthe GNN model to unwind this assumption during its training, which is actually difﬁcult to do and mightexplain the overﬁtting.

II. PROJECTIONS ON REAL-WORLD TEMPORAL NETWORKS

We investigate our GNN architecture capabilities when used on real-world time-varying networks, afterthey are trained on synthetic data. More precisely, we apply our GNN models on 7 different temporalnetwork datasets, namely networks of social contacts in conferences (labeled Hypertext 2009 and SFHH),workplaces (labeled InVS13 and InVS15), schools (labeled Thiers13 and HS13) and hospitals (labeledLH10) [3–5]: A more detailed description of these datasets is shown in Tab. II. These datasets containtimestamped edges in the form ( t ij , i , j ), where t ij corresponds to the time at which edge ( i , j ), representingan active contact between individuals i and j , was observed.In the main paper, we consider two coarse-grained representations of these temporal networks. Forthe ﬁrst one, we aggregate all edges at any given times into a single and static network. We use thismore convenient representation in Fig. 2 to compute the steady-state distribution of node states. For eachdynamics, we sample 100 points from the LTPs of the GNN models (same model as Fig. 3 of the main . . . . . . A Simple

GTGNNSusceptible (S)Infected (I) . . . . B Complex

GTGNNSusceptible (S)Infected (I) . . . . . C Interacting

GTGNN S S I S S I I I FIG. 2.

Steady-state distributions of (a) the simple, (b) the complex and (c) the interacting contagion dynamicson real-world proximity networks in conferences (labeled Hypertext 2009, SFHH), workplaces (labeled InVS13and InVS15), schools (labeled Thiers13 and HS13) and hospitals (labeled LH10) [3–5] . The steady-state distri-bution as generated by the LTPs of the GNN models are indicated by the triangles; those generated by the LTPs of theoriginal dynamics (labeled as GT for “ground truth”) are indicated by the circles. The colors of the symbols indicatethe state corresponding to the fraction of nodes. paper) and the LTPs of the original dynamics for comparison, and display the resulting distributions usingboxplots.For the second coarse-grained representation, we aggregate edges in time windows of 12 hours. Forinstance, a network for which edges are timestamped across 24 hours would result in two different networks—one for the ﬁrst 12 hours and a second for the remaining 12 hours. We then generate 100 trajectories inwhich we run the GNN and the ground truth models for 10 time steps per network, i.e. per 12 hours,consequently setting a difference between the network’s and the contagion dynamics’ time scales. It isimportant to note that we could have chosen arbitrary time scales and window sizes, as setting these timescales differently would change the evolution of the projections. We chose this setting to obtain cleartrajectories with potentially small and large variations.

III. LOCAL TRANSITION PROBABILITIES OF THE INTERACTING CONTAGION DYNAMICS

For the interacting contagion dynamics, the support of the state of nodes is

Ω = { S S , I S , S I , I I } .The LTPs therefore consist of 16 probabilities—one for each possible transition—, minus the redundant GTGNN S or S S I or I S S I I I . . . A Simple - HS130 . . . F r a c t i o n o f n o d e s B Complex - HS130 20 40 60Time step0 . . . C Interacting - HS13 D Simple - InVS13 E Complex - InVS130 50 100 150 200Time step F Interacting - InVS13 G Simple - LH10 H Complex - LH100 10 20 30 40Time step I Interacting - LH100 . . . J Simple - SFHH0 . . . F r a c t i o n o f n o d e s K Complex - SFHH0 5 10 15Time step0 . . . L Interacting - SFHH M Simple - Thiers13 N Complex - Thiers130 20 40 60Time step O Interacting - Thiers13 P Simple - Hypertext2009 Q Complex - Hypertext20090 10 20 30Time step R Interacting - Hypertext2009

FIG. 3.

Projection of (a) the simple, (b) the complex and (c) the interacting contagion dynamics on real-worldtemporal networks.

Panels are organized in triple —one panel per contagion dynamics— corresponding to the samedataset. For instance, panels (a–c) correspond to projections on the HS13 temporal network. Dashed lines correspondto projections when using the LTPs predicted by the GNN model (same as in Fig. 2), while the projections of solidlines used the LTPs of the original dynamics (labeled GT for “ground truth”). Similarly to Fig. 2, colors indicate thestate of the fraction of nodes. Changes in shaded area indicate where the network transitioned to its next conﬁguration. ones obtained from the normalization, X v ∈ Ω p µ → ν σ = 1 , ∀ µ ∈ Ω , ∀ σ ∈ Ω k , ∀ k ∈ Z + . (8)Thus there are 12 probabilities in total. Recall that the mechanisms leading to these probabilities are thefollowing: a node infected by disease g transmit the disease to its neighbors susceptible to g with probability τ g , and recovers from it with probability γ g , at each time step. If more than one disease is involved in thetransmission (from the receiver or the transmitter), the transmission probability is changed to ζτ g , where ζ encodes the coupling the two diseases. From these mechanisms, we obtain the complete distribution ofLTPs. For the transitions from the completely susceptible state, S S , the LTPs are p S S → I S ‘ = h − (1 − τ ) ‘ I S (1 − ζτ ) ‘ I I i (1 − τ ) ‘ S I (1 − ζτ ) ‘ I I , (9a) p S S → S I ‘ = (1 − τ ) ‘ I S (1 − ζτ ) ‘ I I h − (1 − τ ) ‘ S I (1 − ζτ ) ‘ I I i , (9b) p S S → I I ‘ = h − (1 − τ ) ‘ I S (1 − ζτ ) ‘ I I i h − (1 − τ ) ‘ S I (1 − ζτ ) ‘ I I i , (9c) p S S → S S ‘ = 1 − h p S S → I S ‘ + p S S → S I ‘ + p S S → I I ‘ i . (9d)For the transitions from the state I S , i.e. infected by the disease 1, we have p I S → S S ‘ = γ (1 − ζτ ) ‘ S I + ‘ I I , (10a) p I S → S I ‘ = γ h − (1 − ζτ ) ‘ S I + ‘ I I i , (10b) p I S → I I ‘ = (1 − γ ) h − (1 − ζτ ) ‘ S I + ‘ I I i , (10c) p I S → I S ‘ = 1 − h p I S → S S ‘ + p I S → S I ‘ + p I S → I I ‘ i , (10d)and similarly, for the transitions from the state S I , i.e. infected by the disease 2, p S I → S S ‘ = (1 − ζτ ) ‘ I S + ‘ I I γ , (11a) p S I → I S ‘ = h − (1 − ζτ ) ‘ I S + ‘ I I i γ , (11b) p S I → I I ‘ = h − (1 − ζτ ) ‘ I S + ‘ I I i (1 − γ ) , (11c) p S I → S I ‘ = 1 − h p S I → S S ‘ + p S I → I S ‘ + p S I → I I ‘ i . (11d)Finally, the LTPs corresponding to the transitions from the state I I , i.e. infected by both diseases are p I I → S S ‘ = γ γ , (12a) p I I → I S ‘ = (1 − γ ) γ , (12b) p I I → S I ‘ = γ (1 − γ ) , (12c) p I S → I SI ‘ = 1 − h p I I → S S ‘ + p I I → I S ‘ + p I I → S I ‘ i . (12d) IV. LOSS OPTIMIZATION PATTERNS

The machine learning problem we address is similar to a classiﬁcation problem: For a given input, themodel learns to assign it the correct label, i.e. the discrete state to which the node transition to. Contrary tostandard classiﬁcation problems, for a given input ( µ, σ ) , the label ν is not assigned deterministically, butstochastically with LTP p µ → ν σ , which from the classiﬁcation point of view can be seen as a label distribution.This difference changes how the loss decreases during the learning phase. For example, when we considera cross entropy objective function and deterministic labels, we expect the loss to descend gradually to zerobecause the entries of the label distributions are either zeros or ones, i.e. p µ → ν σ ∈ { , } . For stochasticlabels, the model achieves maximal accuracy when the cross entropy is equal to the entropy of the label dis-tribution, i.e. H (cid:0) p µ → ν ‘ (cid:1) . This is obtained by pointing out that minimizing the cross entropy with respect tothe parameters Θ is completely equivalently to minimizing the Kullback-Liebler (KL) divergence betweenthe ground truth p µ → ν σ and the model ˆ p µ → ν σ = ˆ p µ → ν σ ( Θ ) deﬁned as follows: D (cid:0) p µ → ν σ (cid:12)(cid:12)(cid:12)(cid:12) ˆ p µ → ν σ (cid:1) = − X ν ∈ Ω p µ → ν σ log ˆ p µ → ν σ − H ( p µ → ν σ ) . (13)Because the KL divergence is non-negative, it follows that its minimum value, corresponding to maximalaccuracy, is exactly zero, hence min Θ " − X ν ∈ Ω p µ → ν σ log ˆ p µ → ν σ = H ( p µ → ν σ ) . (14)This motivates the monitoring of the entropy of the GNN predicted LTPs alongside the loss function duringtraining. Indeed, because we expect ˆ p µ → ν σ ≈ p µ → ν σ after a while, the entropy of the GNN predicted LTPsshould be close to the true value of H ( p µ → ν σ ) , which in turn should remain of the same magnitude as theloss function if the training is going smoothly.In Fig. 4, we show the evolution of different metrics throughout the training, namely the loss function,the GNN prediction entropy and the Jensen-Shannon distance (JSD) between the GNN predicted LTPs andthe ones given by the maximum likelihood estimators (MLE). As expected, we see the loss decreasing as thetraining goes on, but does not descend to zero because of the stochasticity of the labels. In fact, it remainsof the same order of magnitude as the GNN prediction entropy, as expected. However, the JSD clearlyconﬁrms that the model gets closer to the MLE, regardless of the fact that the loss is far from zero. Notethat the JSD for the interacting contagion remains larger than the other contagion models because the MLEquality is quite poor in this case. V. DATASET GENERATIONA. Algorithm

We generate data from each dynamics using the following algorithm:1. Sample a graph G from a given generative model (e.g. the Erd˝os-R´enyi or the Barab´asi-Albertmodels).0 . . . . . L o ss [ L ] A Simple . . . . B Complex . . . . . . C Interacting . . . . . G NN m o d e l E n tr o p y [ H ( ˆ p µ → ν ‘ ) ] D Simple . . . . E Complex . . . . F Interacting

TrainValidationSelected model0 10 20 30Epoch0 . . . . . . J e n s e n - Sh a nn o nd i s t a n ce G Simple . . . . . H Complex . . . I Interacting

FIG. 4.

Loss optimization patterns during training. (a–c) Loss as expressed by Eq. (3) in the main text, (d–f) average entropy of the GNN model predictions, (g–i) average Jensen-Shannon distance (JSD) between the GNNpredicted LTPs and the ones given by the MLE. We show the results obtained when using Barab´asi-Albert networksto generate the data; similar conclusions are obtained when using data generated with Erd˝os-R´enyi networks. Allmeasures shown by these plots are approximated using the importance sampling strategies used to compute the loss.The vertical dotted lines show the minimum value of the validation loss, corresponding to the criterion for our modelselection.

2. Initialize the state of the system X (0) = (cid:0) x i (0) (cid:1) i =1 ..N . That is, for disease g , sample n g uniformlybetween 0 and N and select uniformly and without replacement n g nodes from the network. Then,assign a state to each node according to this selection: For instance, if node i is selected to havedisease 1 and disease 2, it is assigned the state I I .3. At time t , sample X ( t + ∆ t ) with the LTP conditioned on X ( t ) , where ∆ t = 1 without loss ofgenerality. We record the states X ( t ) and X ( t + ∆ t ) for training as inputs and targets, respectively.4. Repeat step 3 until ( t mod t s ) = 0 , where t s denotes a resampling time. At this moment, apply1 FIG. 5. (color online)

Impact of the dataset size on the prediction error. (a–c) Models trained on ER networks, (d–f) models trained on BA networks, (g) normalized effective sample size (ESS), (h) average logarithm of the JSD erroragainst the ESS. The speciﬁc dynamics are indicated on top of each column. We show the prediction error—calculatedfrom the Jensen Shannon distance (JSD)— for different dataset size T ∈ { , , , , , } . Solidlines correspond to the error of the GNN predictions, dotted lines denote the error of the uninformed baseline andsymbols, the error of the MLE computed from the training dataset. On Figs. (a–f), the shade of blue indicates thevalue of T : darker blue means larger T . On Fig. (g–h), the colors indicate the type of dynamics used for training,and the lines and symbols, the type of networks. All hyperparameters are listed in Sec. IV.C of the SupplementaryMaterial. step 2 to reinitialize the states X ( t ) and repeat step 3.5. Stop when t = T , where T is the targeted number of samples.The resampling step parametrized by t s indirectly controls the diversity of the training dataset. Weallow t s to be small to emphasize on the performance of the GNN rather than the quality of the trainingdataset, while acknowledging that large values of t s could lead to poor training. Because of the very natureof contagion dynamics, which contains absorbing and endemic steady states, it is expected that letting thedynamics evolve on its own could lead to datasets with high redundancy composed largely of degeneratestates. This artiﬁcial mechanism has the advantage of increasing the effective sample size of the dataset byreducing this naturally occurring redundancy. B. Impact of the Data Generation Hyperparameters

We investigate the impact of several hyperparameters involved in the data generation on the performanceof the GNN, namely the dataset size T , the resampling time t s and the number of nodes N . In general, we2 FIG. 6. (color online)

Impact of the resampling time on the prediction error. (a–c) Models trained onER networks, (d–f) models trained on BA networks, (g) normalized effective sample size (ESS), (h) averagelogarithm of the JSD error against the ESS. In these experiments, we ﬁxed the resampling times to t s ∈{ , , , , , , , , , } . For both the GNN and MLE errors, the shade of blue indicates thevalue of t s : darker blue means smaller t s . See the caption of Fig. 5 of the Supplementary Material for further details.FIG. 7. (color online) Impact of the number of nodes of training networks on the average error. (a–c) Mod-els trained on ER networks, (d–f) models trained on BA networks, (g) normalized effective sample size (ESS),(h) average logarithm of the JSD error against the ESS. In these experiments, we ﬁxed the network sizes to N ∈ { , , , , } . In order to better appreciate the impact of increasing N , we ﬁxed T = 10 /N tomaintain a comparable number of samples, and ﬁxed h k i = 4 in all experiments. For both the GNN and MLE errors,the shade of blue indicates the value of N : darker blue means larger N . See the caption of Fig. 5 of the SupplementaryMaterial for further details. n eﬀ = (cid:16)P µ, ‘ n µ ‘ (cid:17) P µ, ‘ (cid:0) n µ ‘ (cid:1) (15)where n µ ‘ = T X t =1 X i ∈V I [ x i ( t ) = µ ] I [ ‘ i ( t ) = ‘ ] , ‘ i ( t ) = ( ‘ µi ( t )) µ ∈ Ω =  X j ∈N i I [ x j ( t ) = µ ]  µ ∈ Ω . (16)where I [ c ] is an indicator function where I [ c ] = 1 if c is true, and I [ c ] = 0 otherwise. We expect thatlarger ESS will lead to a better approximated loss, in turn yielding better trained models. To compare allexperiments, we consider the normalized ESS, denoted by n eﬀ − E( n eﬀ ) p var( n eﬀ ) (17)where the expectation and the variance are taken over the values of hyperparmeters.On Figs. 5-6-7, we show the prediction error, measured by the JSD, on star graphs while changingthe different data generation hyperparmeters ( T , t s and N , respectively). As we can see from Figs. 5-6, increasing T and reducing t s tend to help the models achieve higher performance, with diminishedprediction errors. In most cases, this is well correlated with the ESS which tend to be large when the erroris small, while the effect is more subtle for the resampling time than for the dataset size. It is interesting tonote that small values of t s are not always ideal, as we observe increase in ESS for the interacting contagiondynamics when t s is larger.Surprisingly, increasing the network size N does not seem to increase the performance of our models.First, for ER networks, increasing N does not tend to increase the ESS. This is expected because themaximum degree only slighting increase when the number of nodes is increased, for ﬁxed the averagedegree h k i . Hence, we do not observe additional degree classes when N is marginally increased and thetraining dataset variety remains similar. For BA networks, we observe something different: While theincrease in N leads to higher ESS, there is still no substantial gain in performance. This can be explainedby looking at the degree distribution. As more nodes are added to the network, the degree classes get morepopulated, resulting in increased ESS. However, because the degree distribution is scale-free (with exponent − ), these are not populated evenly and more degree classes are created as N increases. This has the effectof raising the problem difﬁculty, because increasing the number of accessible degree classes also increasesthe number of cases—LTPs with speciﬁc inputs—the GNN model needs to ﬁt.4 VI. TRAINING SETTINGSA. Optimization

We use the Rectiﬁed Adam optimization algorithm presented in Ref. [6]. Similarly to Ref. [7], thisalgorithm minimizes an objective function by estimating the average and variance of its gradient frommoving averages. These moving averages are parametrized by b , b ∈ [0 , for the average and thevariance respectively, which we specify below. B. Validation Dataset Selection

Validating the model is a crucial step of the learning pipeline where the performance of the model isobjectively evaluated. In most cases, it is done by selecting a subset of the training dataset, called the vali-dation dataset, from which the model will not learn from. The performance is then evaluated by computingthe average loss with respect to this validation dataset, which allows us to monitor the learning process.The selection of the validation set is usually done by sampling randomly a small percentage, about 20%,of samples in the training dataset. In general, this is not an issue in high dimensional problems whereeach data point is virtually unique. However, in our case, the data points correspond to each individualinput ( µ, σ ) , which is likely to be repeated multiple times. Therefore, it is not recommended to proceed bysplitting the training dataset at random. In fact, this could jeopardize the whole validation procedure if thetraining and validation datasets distribution are too similar.Instead, we propose to sample the transitions by importance similarly to the importance sampling schemeapproximating the loss function presented in the main text. Let us consider W ( t ) ⊂ V , a subset of nodes attime t selected for validation. The probability that the transition of node i at time t belongs to the validationdataset is equal to Pr [ i ∈ W ( t )] = ρ (cid:0) x i ( t ) , ‘ i ( t ) (cid:1) − δ P i ∈V ρ (cid:0) x i ( t ) , ‘ i ( t ) (cid:1) − δ (18)where ρ ( µ, ‘ ) denotes the input distribution, and δ ≥ is a parameter controlling the bias towards rareinputs. During training, the loss function is then approximated by L ( Θ ) ’ X t ∈T X i ∈V ( t ) ω i ( t ) |T | |V ( t ) | h − log ˆ p x i ( t ) → x i ( t +∆ t ) x N i ( t ) i (19)where V ( t ) = V\W ( t ) . Sampling the validation dataset by importance allows us to validate the model onall available inputs with equal weights, i.e. when δ → . It also helps to minimize the similarity betweenthe training and the validation datasets, as the rarest inputs will be likely only be available in the latter.5 C. Hyperparameters

For all experiments, we ﬁx the optimizer parameters to b = 0 . and b = 0 . as was suggestedin Ref. [7] and we schedule the learning rate (cid:15) to reduce by a factor 2 every 10 epochs, that is (cid:15) i +1 =2 b i mod 1010 c (cid:15) i with initial value (cid:15) = 0 . . A weight decay of − is used as well to help regularizethe training. We set the number of epochs to 20 and choose the model with the lowest loss on validationdatasets as our best model. We ﬁx the importance sampling bias exponents for the training and the validationto λ = 0 . and δ = 0 . , respectively. For the data generation, we ﬁx the sequence size t s = 2 and we used T = 10000 samples for the experiments presented in the main paper. VII. MEAN FIELD FRAMEWORKA. Approximate master equations

The mean ﬁeld framework we use is inspired by [8] and provides an approximate solution to the station-ary distribution of contagion dynamics using a set of discrete-time approximate master equations (AME).To construct the AME, we consider a set of LTPs p µ → ν ‘ and a state matrix Q ( t ) , where the entry [ Q ( t )] µ,k corresponds to the probability that a node of degree k is in state µ at time t . Then, the AME that describesthe evolution of Q ( t ) is expressed as follows: [ Q ( t + ∆ t )] µ,k = X ν [ Q ( t )] ν,k X | ‘ | = k M k ( ‘ ; t ) p µ → ν ‘ (20)where M k ( ‘ ) is the probability that the neighborhood of a node of degree k has state ‘ at time t , recallingthat [ ‘ ] µ = ‘ µ corresponds to the number of neighbors in state µ . The probability M k ( ‘ ) is approximatedby a multinomial distribution, M k ( ‘ ; t ) = k ! Y ν φ ν ( t ) ‘ ν ‘ ν ! (21)where φ ν ( t ) is the probability that a node at the end of a randomly selected edge is in state ν at time t .The underlying assumption behind this parameterization of M k ( ‘ ) is that the neighbors of a given node areassumed dynamically and structurally uncorrelated. These assumptions lead us to the following closed formfor φ ν ( t ) : φ ν ( t ) = P k kρ k [ Q ( t )] ν,k h k i (22)where ρ k = Pr( K = k ) is the probability that a node has degree k and kρ k h k i is the probability to reach anode of degree k from a randomly selected edge.6We use this mean ﬁeld framework to compute the stationary distributions of the three contagion dynam-ics we investigated in the main text, as well as the stationary distributions predicted by the trained GNNmodels. To simplify our analysis, we consider that ρ k = e − κ κ k /k ! is a Poisson distribution with parameterand κ with a truncated support K = [ k min , k max ] ⊂ Z + . In our experiments, we consider that the averagedegree h k i is our control parameter, rather than κ . Therefore, to ensure that h k i remains ﬁxed, we ﬁx thesupport using k min = max n , bh k i c − o , k max = max n , dh k i e + 7 o (23)where the operations b·c and d·e are the ﬂoor and ceil functions, respectively. This guarantees that |K| = k max − k min + 1 = 15 . Then, we ﬁx κ by solving numerically the following equation: X k ∈K kρ k = X k ∈K k e − κ κ k k ! = h k i . (24)To apply the AME framework to a GNN model, we begin by extracting its LTPs beforehand using theprescription described in the main text involving the star graph of k + 1 nodes, whose central node cangenerate any desired input ( µ , ‘ ) given its degree k . Then, Eq. (20) is solved numerically using a relaxationmethod [9].It is possible to obtain a more reﬁned version of Eq. (20), as prescribed in Ref. [8]. However, evenin this simple case, iterating Eq. (20) becomes rapidly tedious as the number of available states increases.Speciﬁcally, to update Eq. (20), one needs to enumerate | Ω | (cid:18) k + | Ω | − | Ω | − (cid:19) (25)terms, which scales poorly with | Ω | and k . The scaling is even worst for more reﬁned frameworks. B. Numerical Evaluation of the Thresholds

Contagion dynamics are known to have phase transitions which delineate different dynamical behaviors.More speciﬁcally, beyond some threshold values of the dynamical and structural parameters, contagiondynamics shift abruptly from an absorbing phase, where all nodes are susceptible in the steady state—noted Q ∗ where [ Q ∗ ] µ,k = I [ µ = S ] —, to an endemic phase in which a nonzero fraction of nodes remainsinfected over time [10–12]. We note the endemic state Q † , where [ Q † ] µ,k > is obtained numericallyfrom Eq. (20) using a relaxation method with the initial condition [ Q (0)] µ,k = (1 − (cid:15) ) I [ µ = S ] + (cid:15)I [ µ = S ] assuming (cid:15) (cid:28) [13]. In the case of the susceptible-infected-susceptible dynamics (SIS), the phasetransition occurs at the point where the infection and recovery probabilities, τ and γ , are related to the ﬁrst7and second moments of the degree distribution as follows: τγ = h k ih k i . (26)One obtains this relation by computing the stability λ ( Q ∗ ) of the absorbing state Q ∗ , which correspondsto the largest eigenvalue of the Jacobian matrix J ( Q ∗ ) evaluated at that point. The entries of the Jacobianmatrix, indexed by the pairs ( µ, k ) and ( µ , k ) , evaluated at an arbitrary point Q are computed as follows: [ J ( Q )] ( µ,k ) , ( µ ,k ) = ∂ [ Q ( t + ∆ t )] µ,k ∂ [ Q ( t )] µ ,k (cid:12)(cid:12)(cid:12)(cid:12) Q ( t )= Q . (27)A ﬁxed point Q is stable when λ ( Q ) < , and unstable otherwise. As we consider that the parameters ofcontagion dynamics to remain unchanged, we compute the thresholds of the phase transitions with respect tothe average degree h k i . For simple contagion dynamics, the threshold is obtained numerically by solving forthe value of h k i for which the stability of the absorbing state yields λ ( Q ∗ ) = 1 . For complex and interactingdynamics, we must ﬁnd two thresholds that delineate a bistable regime where both the absorbing state Q ∗ and the endemic state Q † are stable. Therefore, we apply the same strategy and solve numerically for h k i the two equations λ ( Q ∗ ) = 1 and λ ( Q † ) = 1 . [1] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations (2018).[2] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation Learning on Graphs: Methods and Applications,”(2017), arXiv:1709.05584.[3] L. Isella, J. Stehl´e, A. Barrat, C. Cattuto, J.-F. Pinton, and W. Van den Broeck, “What’s in a crowd? Analysis offace-to-face behavioral networks,” J. Theor. Biol. , 166–180 (2011).[4] M. G´enois and A. Barrat, “Can co-location be used as a proxy for face-to-face contacts?” EPJ Data Sci. , 11(2018).[5] R. Mastrandrea, J. Fournet, and A. Barrat, “Contact Patterns in a High School: A Comparison between DataCollected Using Wearable Sensors, Contact Diaries and Friendship Surveys,” PLOS ONE , e0136497 (2015).[6] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the Variance of the Adaptive Learning Rateand Beyond,” arXiv , 1908.03265 (2020).[7] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” (2014), arXiv:1412.6980.[8] P. G. Fennell and J. P. Gleeson, “Multistate Dynamical Processes on Networks: Analysis through Degree-BasedApproximation Frameworks,” SIAM Rev. , 92–118 (2019).[9] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes 3rd Edition: The Art ofScientiﬁc Computing , 3rd ed. (Cambridge University Press, USA, 2007).[10] I. Z. Kiss, J. C. Miller, and P. L. Simon,

Mathematics of Epidemics on Networks (Springer, 2017) p. 598. [11] R. Pastor-Satorras, C. Castellano, P. Van Mieghem, and A. Vespignani, “Epidemic processes in complex net-works,” Rev. Mod. Phys. , 925–979 (2015).[12] J. Sanz, C.-Y Xia, S. Meloni, and Y. Moreno, “Dynamics of Interacting Diseases,” Phys. Rev. X , 041005(2014).[13] Equivalently, in the case of the interacting contagion dynamics, we use [ Q (0)] µ,k = (1 − (cid:15) ) I [ µ = S S ] + (cid:15)I [ µ = S S ]]