[PDF] Mutually exciting point process graphs for modelling dynamic networks

Abstract

A new class of models for dynamic networks is proposed, called mutually exciting point process graphs (MEG), motivated by a practical application in computer network security. MEG is a scalable network-wide statistical model for point processes with dyadic marks, which can be used for anomaly detection when assessing the significance of previously unobserved connections. The model combines mutually exciting point processes to estimate dependencies between events and latent space models to infer relationships between the nodes. The intensity functions for each network edge are parameterised exclusively by node-specific parameters, which allows information to be shared across the network. Fast inferential procedures using modern gradient ascent algorithms are exploited. The model is tested on simulated graphs and real world computer network datasets, demonstrating excellent performance.

Full PDF

MMutually exciting point process graphs formodelling dynamic networks

Francesco Sanna Passino and Nicholas A. HeardDepartment of Mathematics, Imperial College London180 Queen’s Gate, SW7 2AZ, London

Abstract

A new class of models for dynamic networks is proposed, called mutually exciting point processgraphs (MEG), motivated by a practical application in computer network security. MEG is a scalablenetwork-wide statistical model for point processes with dyadic marks, which can be used for anomalydetection when assessing the signiﬁcance of previously unobserved connections. The model combinesmutually exciting point processes to estimate dependencies between events and latent space models toinfer relationships between the nodes. The intensity functions for each network edge are parameterisedexclusively by node-speciﬁc parameters, which allows information to be shared across the network.Fast inferential procedures using modern gradient ascent algorithms are exploited. The model is testedon simulated graphs and real world computer network datasets, demonstrating excellent performance.

Keywords — dynamic network, Hawkes process, self-exciting process, statistical cyber-security.

Dynamic networks are encountered in many domains, for example representing interactions in socialnetworks, messaging applications, or computer networks. Event data from dynamic networks are observedas triplets ( t , x , y ) , . . . , ( t m , x m , y m ) , where ≤ t ≤ t ≤ . . . are event times and the dyadic marks ( x k , y k ) denote the source and destination nodes, each belonging to a set of nodes V = { , . . . , n } ofsize n . The sequence of graph edges ( x , y ) , . . . , ( x m , y m ) induces a directed network adjacency matrix A = { A ij } ∈ { , } n × n where A ij = 1 if node i connected to node j at least once during the entireobservation period, and A ij = 0 otherwise.This article presents a new model for the arrival of connection events between nodes in a network,motivated by applications in computer network security. Modelling arrival times in computer networksis complicated by several factors: events tend to appear in bursts, they might be recorded multiple times,and exhibit polling at regular intervals (Heard et al., 2014). Furthermore, in computer network security, itis particularly important to assess the signiﬁcance of observing new links , corresponding to connectionson previously unobserved edges (Metelli and Heard, 2019). New links might be indicative of lateralmovement, which is a common behaviour of network attackers (Neil et al., 2013). The proposed mutuallyexciting graphs (MEG) framework for modelling point processes on networks simultaneously addressestwo fundamental tasks in network security: monitoring the normality of observed connectivity trafﬁc,and anomaly detection for unusual new connections. Despite the focus of this article on applications1 a r X i v : . [ c s . S I] F e b utually exciting point process graphs for modelling dynamic networks to computer network security, the proposed methodology is general and can be applied to any dynamicnetwork expressed as a point process with dyadic marks.The MEG model builds upon mutually exciting point processes and latent space models. Mutuallyexciting point processes have been already successfully used for a variety of different applications: mod-elling of earthquakes (Ogata, 1988), ﬁnancial markets (Bowsher, 2007), criminal activities (Mohler et al.,2011; Stomakhin et al., 2011), and popularity of tweets (Zhao et al., 2015; Chen and Tan, 2018). Let t , t , . . . , t m denote an increasing sequence of observed event times, and N ( t ) = (cid:80) mk =1 [0 ,t ] ( t k ) thecorresponding counting process, representing the number of events observed up to time t . The conditionalintensity λ ( t ) is assumed to depend on the last r observed arrival times: λ ( t ) = λ + N ( t ) (cid:88) k>N ( t ) − r ω ( t − t k ) , (1.1)where λ ∈ R + is a baseline intensity level and ω ( · ) is a non-increasing and non-negative excitationfunction. For simplicity, ω ( · ) is usually chosen to be a scaled exponential function: ω ( t ) = β exp {− ( β + θ ) t } , where β ≥ and θ > . Alternative choices of ω ( · ) are nonparametric step functions (Price-Williams and Heard, 2020), or the power-law ω ( t ) = θ ( t + γ ) − − δ , where θ ≥ , β, δ > and θ < δβ δ (Ozaki, 1979). In the literature, two extreme cases for the intensity in (1.1) are usually considered: r = 1 ,corresponding to a ﬁrst order Markov-like structure, and r = ∞ , called a Hawkes process (Hawkes,1971). If r = 0 , the model reduces to a Poisson process. Price-Williams and Heard (2020) showedthat self-exciting point processes appear to be suitable for modelling arrival times on individual networkedges in a cyber-security application. In this article, an extension of this approach is proposed, aimed atmodelling network data, and, more generally, point processes with dyadic marks.In large graphs, simultaneously modelling all the edge processes using individual intensities of theform (1.1) is computationally challenging, and ignores possible correlations between different edges andnodes. Inference would require estimating O ( n ) parameters, or O{ nnz( A ) } parameters if the graphis sparse, which is not feasible in most real-world applications. Furthermore, this approach would notparameterise new edges appearing after the model training period. Hence, traditional statistical models fornetworks, for example the latent space models (Hoff et al., 2002), aim to reduce the representation of thenetwork to O ( n ) parameters. Inspired by the literature on latent space models, here a dynamic graph ismodelled through the edge-speciﬁc point processes with intensity functions parametrised by node-speciﬁclatent features. In standard latent space network models, the probability of a link between two nodes isexpressed as a function of node-speciﬁc latent vectors a i , b j ∈ R d , such that P ( A ij = 1) = f ( a i , b j ) , forsome kernel function f . Similarly, here it is assumed that the arrival times on each observed network edgecan be modelled using a mutually exciting point process depending on node-speciﬁc characteristics.The related literature on mutually exciting point processes is vast, although mostly focusing on uni-variate and multivariate point processes; limited attention is devoted to using such processes for modellinglarge dynamic graphs. Hawkes processes are traditionally used to estimate causal interactions within mul-tivariate processes (Linderman and Adams, 2014), because of their appealing theoretical properties interms of Granger causality and directed information (Etesami et al., 2016; Eichler et al., 2017). The ap-proach proposed in this paper is related to Perry and Wolfe (2013), who consider directed interactionwithin dynamic networks as a multivariate point process using a Cox multiplicative intensity model, withcovariates depending on the history of the process. The methodology proposed in this work uses mutuallyexciting processes parametrised by node-speciﬁc features. Hawkes processes have also been used in Foxet al. (2016) to analyse e-mail networks, primarily focusing on point processes on each node.Furthermore, dynamic models for network snapshots observed at discrete points in time have alsobeen proposed in the literature. In particular, the methodology proposed in this work could be related to Sanna Passino, F. and Heard N. A. 2utually exciting point process graphs for modelling dynamic networks dynamic latent space models , which are based on latent feature representations of each node, evolvingaccording to a temporal dynamics. Examples are Sarkar and Moore (2006); Krivitsky and Handcock(2014); Sewell and Chen (2015); Durante and Dunson (2016). The MEG model proposed in this articleextends the latent feature framework to a continuous time setting, using node-speciﬁc latent vectors toparametrise point processes on each edge.The remainder of this article is structured as follows: Section 2 introduces the MEG model and relatedinferential procedures. Section 3 discusses simulation from the model and the calculation of p -values foreach observed network event. Results on simulated and real-world computer networks are discussed inSection 4. The main contribution proposed in this article is a mutually exciting graph model (MEG) for dynamic net-work point processes, deﬁned by an n × n time-varying matrix of non-negative functions λ ( t ) = { λ ij ( t ) } .Each entry λ ij ( t ) represents the intensity of the counting process N ij ( t ) = (cid:80) mk =1 [0 ,t ] ×{ i }×{ j } ( t k , x k , y k ) of events occurring on the graph edge ( i, j ) . For generality, it is assumed that for each edge ( i, j ) thereexists a changepoint τ ij ≥ after which the edge becomes observable. In the simplest case, τ ij = 0 ∀ i, j .To parameterise λ ( t ) , each entry is represented as an additive model with three non-negative compo-nents. The ﬁrst, denoted α i ( t ) , characterises the process of arrival times involving i as source node; thesecond, β j ( t ) , corresponds to arrivals for which j is the destination node; the third, γ ij ( t ) , is an interactionterm which will also be parameterised by node-speciﬁc parameters, giving: λ ij ( t ) = α i ( t ) + β j ( t ) + γ ij ( t ) , t ≥ τ ij . (2.1)Note that the intensity function (2.1) resembles the link function used in additive and multiplicative effectnetwork (AMEN) models for network adjacency matrices, proposed in Hoff (2018).Deﬁne the source and destination counting processes as N i ( t ) = (cid:80) mk =1 [0 ,t ] ×{ i } ( t k , x k ) and N (cid:48) j ( t ) = (cid:80) mk =1 [0 ,t ] ×{ j } ( t k , y k ) . Furthermore, let (cid:96) i , (cid:96) i , . . . denote the indices { k : x k = i } of the arrival timessuch that i appears as source node, and (cid:96) (cid:48) j , (cid:96) (cid:48) j , . . . denote the event indices { k : y k = j } for which j isthe destination node. To allow self excitation of both source and destination nodes, the latent functions α i ( t ) and β j ( t ) are assigned a similar form to the conditional intensity (1.1): α i ( t ) = α i + N i ( t ) (cid:88) k>N i ( t ) − r ω i ( t − t (cid:96) ik ) , β j ( t ) = β j + N (cid:48) j ( t ) (cid:88) k>N (cid:48) j ( t ) − r ω (cid:48) j ( t − t (cid:96) (cid:48) jk ) , (2.2)where α = ( α , . . . , α n ) , β = ( β , . . . , β n ) ∈ R n + are node-speciﬁc baseline intensity levels, and ω i , ω (cid:48) i are node-speciﬁc, non-increasing excitation functions from R + to R + . For simplicity, the excitation func-tions assume the following scaled exponential form, for non-negative parameters µ i , µ (cid:48) j , φ i , φ (cid:48) j ∈ R n + : ω i ( t ) = µ i exp {− ( µ i + φ i ) t } , ω (cid:48) j ( t ) = µ (cid:48) j exp {− ( µ (cid:48) j + φ (cid:48) j ) t } . (2.3)Similarly, let (cid:96) ij , (cid:96) ij , . . . be the indices { k : x k = i, y k = j } of the events observed on the edge ( i, j ) . The interaction term γ ij ( t ) in (2.1) assumes a similar form to (2.2), but with a background rateobtained as the inner product between two node-speciﬁc d > -dimensional baseline parameter vectors γ i , γ (cid:48) j ∈ R d + : γ ij ( t ) = γ i · γ (cid:48) j + N ij ( t ) (cid:88) k>N ij ( t ) − r ω ij ( t − t (cid:96) ijk ) , (2.4) Sanna Passino, F. and Heard N. A. 3utually exciting point process graphs for modelling dynamic networks . t α i ( t ) r = ∞ r = 1 . . t β j ( t ) r = ∞ r = 10 2 4 6 8 1000 . . . . t γ i j ( t ) r = ∞ r = 1 0 2 4 6 8 1001234 t λ i j ( t ) r = ∞ r = 1 Figure 1:

Cartoon of a 1-dimensional MEG model (2.1) – (2.5) . Events with source node i and destination node j are denoted by triangles; other events with source node i are are denoted with circles, and other eventswith destination node j are denoted by squares. The inner product baseline is inspired by random dot product graph models (see, for example, Athreyaet al., 2018) for link probabilities. The excitation function ω ij ( t ) is also expressed as a sum of scaledexponential functions, parameterised by four node-speciﬁc, non-negative latent d -vectors ν , ν (cid:48) j , θ i , θ (cid:48) j ∈ R d + : ω ij ( t ) = d (cid:88) (cid:96) =1 ν i(cid:96) ν (cid:48) j(cid:96) exp {− ( θ i(cid:96) + ν i(cid:96) )( θ (cid:48) j(cid:96) + ν (cid:48) j(cid:96) ) t } . (2.5)A cartoon example of the intensity λ ij ( t ) for the d = 1 dimensional MEG model with scaled expo-nential functions is given in Figure 1, with α i = 0 . , µ i = 0 . , φ i = 0 . , β j = 0 . , µ (cid:48) j = 0 . , φ (cid:48) j =0 . , γ i = 0 . , ν i = 0 . , θ i = 1 . , γ (cid:48) j = 0 . , ν (cid:48) j = 0 . , θ (cid:48) j = 0 . .As remarked in Section 1, alternative decay functions are possible, like nonparametric excitation func-tions (Price-Williams and Heard, 2020) or power-laws (Ozaki, 1979). In addition to the self-excitingcomponent, non-stationary background rates are also considered in the literature, often with a seasonalcomponent (Fox et al., 2016; Price-Williams and Heard, 2020). In cyber-security applications, includingseasonal components when modelling arrival times was demonstrated to have limited effect on the predic-tive performance of the model (Price-Williams and Heard, 2020); therefore, the background rate here isassumed to be constant.A key feature of the model is the representation of the intensity (2.1) with only node-speciﬁc param-eters Ψ = ( α , µ , φ , β , µ (cid:48) , φ (cid:48) , γ , ν , θ , γ (cid:48) , ν (cid:48) , θ (cid:48) ) . This construction allows estimation of intensities evenfor unobserved edges. Therefore, in practical applications, when a new link is observed, it is possible to Sanna Passino, F. and Heard N. A. 4utually exciting point process graphs for modelling dynamic networks immediately provide an estimate of the intensity of the process on that edge. This is a substantial differ-ence with respect to models based on edge-speciﬁc parameters for the edge intensities. Such models donot perform well for scoring new links, since the score for a new observation could only be based on aprior guess of the intensity, whereas MEG “borrows strength” from events observed on similar nodes andedges in the graph, providing an informed estimate of the intensity function on the newly observed edge.For a sequence of observed events H T = { ( x , y , t ) , . . . , ( x m , y m , t m ) } , with event times in [0 , T ] ,the log-likelihood (Daley and Vere-Jones, 2002) of a generic MEG model is: log L ( H T | Ψ ) = n (cid:88) i =1 n (cid:88) j =1 (cid:40) n ij (cid:88) k =1 log λ ij ( t (cid:96) ijk ) − (cid:90) Tτ ij λ ij ( t )d t (cid:41) . (2.6)where n ij is the number of events observed on edge ( i, j ) . Explicit forms of the likelihood function (2.6)for the scaled exponential excitation function model, with d = 1 and r = 1 or r = ∞ , are discussed inAppendix A.1.For r = ∞ , the main computational burden associated with the calculation of the likelihood (2.6) isthe double summation over each ( i, j ) pair of the sum of intensities λ ij ( t (cid:96) ijk ) for the events t (cid:96) ijk , and thesummations required in (2.2) and (2.4) to evaluate that intensity for each event. Appendix A.1 discusses arecursive form of the likelihood (2.6) for the MEG model with r = ∞ , which can be evaluated in lineartime on each active edge, signiﬁcantly reducing the computational effort. Inference in Hawkes processes is usually carried out using maximum likelihood estimation, and for MEGmodels maximum likelihood estimation via gradient ascent can be deployed, since it is not possible tooptimise (2.6) analytically. If the start times τ ij in (2.6) are unknown, the maximum likelihood estimatesare simply ˆ τ ij = t (cid:96) ij if at least one event is observed on the edge, and ˆ τ ij = ∞ otherwise (see Ap-pendix A.1). For maximising (2.6) with respect to the remaining parameters Ψ , the adaptive momentestimation method ( Adam,

Kingma and Ba, 2015) is used with a minor adaptation of optimising the log-arithm of each parameter since are all constrained to be positive. The resulting optimisation procedureis detailed in Algorithm 1. Alternative gradient ascent techniques for optimisation are surveyed in Ruder(2016). The gradient of the likelihood (2.6) with respect to Ψ is explicitly derived in Appendix A.2 for d = 1 and r = ∞ . The proposed modelling framework has been presented for directed graphs, but could be easily extended toundirected and bipartite networks. For undirected graphs A = A (cid:124) , hence there is no distinction betweensource and destination nodes. Therefore, β j ( t ) in (2.1) could be simply be replaced by α j ( t ) , and γ ij ( t ) modiﬁed as follows: γ ij ( t ) = γ i · γ j + N ij ( t ) (cid:88) k>N ij ( t ) − r d (cid:88) (cid:96) =1 ν i(cid:96) ν j(cid:96) exp {− ( θ i(cid:96) + ν i(cid:96) )( θ j(cid:96) + ν j(cid:96) ) t } . Furthermore, bipartite graphs can be considered as special cases of directed graphs, where the node set V = V ∪ V is divided into two sets V and V of cardinality n and n , such that V ∩ V = ∅ , and allthe edges are of the form ( i, j ) with i ∈ V and j ∈ V . Therefore, (2.1) and the corresponding equationsof the intensity functions still hold, assuming that i ∈ V and j ∈ V . Sanna Passino, F. and Heard N. A. 5utually exciting point process graphs for modelling dynamic networksAlgorithm 1:

Adam gradient ascent algorithm for optimisation of the likelihood (2.6).

Input: stepsize η ∈ R + , decay rates ρ , ρ ∈ (0 , , smoothing term ε ∈ R + , initial values Ψ . Output: model parameters Ψ . Initialise the moment vectors m and v , and the number of iterations k : m ← , v ← , k ← , repeat k ← k + 1 , g k ← ∂∂ Ψ log L ( H T | Ψ ) (cid:12)(cid:12) Ψ = Ψ k − , m k ← ρ m k − + (1 − ρ )( g k (cid:12) Ψ k − ) , where (cid:12) is the Hadamard product, v k ← ρ v k − + (1 − ρ )[( g k (cid:12) Ψ k − ) (cid:12) ( g k (cid:12) Ψ k − )] , Ψ k ← Ψ k − (cid:12) exp (cid:26) η m t / (1 − ρ k ) (cid:11) (cid:18)(cid:113) v t / (1 − ρ k ) + ε (cid:19)(cid:27) , where (cid:11) is the Hadamarddivision, and the exponential and square root are applied element-wise, until convergence in log L ( H T | Ψ ) ; In order to validate the inferential procedure, it is necessary to simulate data from the MEG model (2.1),which can be interpreted as an extended multivariate Hawkes process where some of the parameters areshared across the individual processes. Therefore, simulation of MEG models is possible under the frame-work described in Ogata (1981), and follows the standard technique of simulating self-exciting processesvia thinning . The procedure is described in Algorithm 2.

Algorithm 2:

Simulation of a MEG in [0 , T ] . set t (cid:63) = 0 , repeat set λ (cid:63) = (cid:80) ni =1 (cid:80) nj =1 λ ij ( t (cid:63) + ) , where t (cid:63) + denotes the limit from the right, generate the inter-arrival time q = − log( u ) /λ (cid:63) , where u is a draw from Uniform[0 , , obtain the candidate arrival time t (cid:63) ← t (cid:63) + q , assign the arrival time t (cid:63) to the edge ( i, j ) with probability λ ij ( t (cid:63) − ) /λ (cid:63) , and do not assign toany edge with probability − (cid:80) ni =1 (cid:80) nj =1 λ ij ( t (cid:63) − ) /λ (cid:63) , where t (cid:63) − denotes the limit from theleft. until t (cid:63) > T ;Furthermore, it is possible to assess the performance of the inferential procedure by evaluating thegoodness-of-ﬁt from out-of-sample events. If the model parameters are estimated only from the eventtimes obtained in [0 , T (cid:63) ] , with T (cid:63) < T , using Algorithm 1, the goodness-of-ﬁt can then be evaluated fromthe event times in ( T (cid:63) , T ] . Goodness-of-ﬁt measures can be calculated from functions of the compensatorfunction for the model. Given the conditional intensity λ ij ( t ) , the compensator Λ ij ( t ) is: Λ ij ( t ) = (cid:90) tτ ij λ ij ( s )d s. Examples of compensator functions for some MEG models, for t = T , can be found in Appendix A.1.Given arrival times t , . . . , t n ij on the edge ( i, j ) , under the null hypothesis of correct speciﬁcationof the conditional intensity λ ij ( t ) , by time rescaling theorem (see, for example, Brown et al., 2002) Sanna Passino, F. and Heard N. A. 6utually exciting point process graphs for modelling dynamic networks Λ ij ( t ) , . . . , Λ ij ( t n ij ) are event times of a homogeneous Poisson process with unit rate. It follows that,for example, the upper tail p -values p k = exp {− Λ ij ( t k ) + Λ ij ( t k − ) } = exp (cid:40) − (cid:90) t k t k − λ ij ( s ) d s (cid:41) (3.1)follow a standard uniform distribution under the null hypothesis. Therefore, given the estimates of theconditional intensity functions obtained from the arrival times in [0 , T (cid:63) ] , approximately uniform p -valuesfor the test event times in ( T (cid:63) , T ] should be observed if the model is speciﬁed and estimated correctly. In this section, the MEG model is tested on simulated network data and on two real world computer net-work datasets: the Enron e-mail network, and a bipartite graph obtained from network ﬂow data collectedat Imperial College London. Across the experiments, the decay rates ( ρ , ρ ) in Algorithm 1 have beenset to (0 . , . , and ε = 10 − . In order to evaluate Algorithm 1 and its performance at estimating MEG models, simulated networkdata are initially used. First, an adjacency matrix is simulated from an Erd˝os-Rényi graph with n = 10 nodes, such that A ij ∼ Bernoulli( p ) , with p = 1 / . For the edges such that A ij = 1 , then τ ij = 0 ,otherwise if A ij = 0 , then τ ij = ∞ . Second, m = 5 , event times are generated from a MEGmodel with r = ∞ using Algorithm 2, with parameters in Ψ sampled at random from uniform distribu-tions, restricted to the following ranges: α i , β j ∈ (10 − , − ) , µ i , µ (cid:48) j , φ i , φ (cid:48) j ∈ (10 − , − ) , γ i(cid:96) , γ (cid:48) j(cid:96) ∈ (10 − , − ) , ν i(cid:96) , ν (cid:48) j(cid:96) , θ i(cid:96) , θ (cid:48) j(cid:96) ∈ (10 − , . In the simulation, the expected number of events per activeedge is mp/ [ n ( n − ≈ . Algorithm 1 is used to estimate n × parameters, with learningrate η = 0 . , after a random initialisation from the same uniform distributions used in the data generat-ing process. Using the estimated parameters, the p -values (3.1) are then calculated for all the simulatedevents. Finally, the Kolmogorov-Smirnov (KS) score against the uniform distribution is calculated onthose p -values. The entire procedure is then repeated times.A second simulation is conducted for a graph with n = 20 nodes, simulating m = 10 , eventsfrom a MEG model with interaction term only, corresponding to λ ij ( t ) = A ij γ ij ( t ) , with r = 1 and d = 5 . A minor modiﬁcation is made to the range of the uniform distributions for sampling some of theinteraction term parameters: γ i(cid:96) , γ (cid:48) j(cid:96) ∈ (10 − , − ) , ν i(cid:96) , ν (cid:48) j(cid:96) , θ i(cid:96) , θ (cid:48) j(cid:96) ∈ (10 − , . Despite the simplerform of the intensity functions λ ij ( t ) , the inferential task is more difﬁcult than the previous simulation:more parameters must be estimated ( n × d = 600 ), and the expected number of connections per edgeis only . The resulting boxplots of the KS test obtained for the two simulations are plotted in Figure 2.As might be expected, the results of the ﬁrst simulation, which corresponds to an easier estimationtask, are slightly better than the second simulation. Both boxplots demonstrate that the algorithm is ableto recover sensible estimates of the parameter values, resulting in small KS scores, corresponding to agood model ﬁt. The Enron e-mail network collection is a record of e-mails exchanged between the employees of EnronCorporation before its bankruptcy. These data have already been demonstrated to be well-modelled as

Sanna Passino, F. and Heard N. A. 7utually exciting point process graphs for modelling dynamic networks .

01 0 .

02 0 .

03 0 .

04 0 .

05 0 .

06 0 . Figure 2:

Boxplots of the Kolmogorov-Smirnov scores obtained for the two simulations described in Section 4.1. self-exciting point processes by Fox et al. (2016). In this article, the version of these data used in Priebeet al. (2005) is analysed, where e-mails recorded multiple times have been used only once, and e-mailswith incorrectly recorded sent times (coded in the data with 9pm, 31 December 1979) have been removed.After such pre-processing, the e-mail data consist of , distinct triplets ( x k , y k , t k ) , correspondingto messages exchanged between n = 184 employees between November 1998 and June 2002, forming atotal of , graph edges. Note that some of the emails are sent to multiple receivers, and only , unique event times are observed, implying that on average each e-mail is sent to approximately . nodes.Because an e-mail can have multiple recipients, and because the event times are recorded to the nearestsecond, the likelihood (2.6) must be adapted slightly to handle tied arrival times. An approach used byPrice-Williams and Heard (2020, Section 8.1) is followed, with the arrivals modelled by an analogousdiscrete time process: In particular, arrivals at time t are assumed to contribute to the the intensities λ ij ( · ) from time t + d t onwards, where d t is the sampling interval, equal to one second in this example. The p -values of the process are approximated using (3.1), following Fox et al. (2016).The model is trained on , e-mails sent before 1st December, 2001, and trained on the remaining , e-mails. In the training set, , edges are observed, and in the test set, of which are not observed in the training period. One of the advantages of the proposed methodology is the possibility toscore events for such new links.A range of MEG models are ﬁtted to the training data, using different combinations of r and d forcharacterising main effects and interactions. A good conﬁguration for the initial parameter values isobtained through utilising the quantities q i = N i ( T ) nT and q (cid:48) j = N (cid:48) j ( T ) nT , corresponding to the average rate ofincoming and outgoing connections observed for each node. In particular, good results and convergenceare obtained setting initial values α i = µ i = q i , φ i = 3 q i , β j = µ (cid:48) j = q (cid:48) j , and φ (cid:48) j = 3 q (cid:48) j . For theinteraction term, the initial values used to obtain the results are γ i(cid:96) = γ (cid:48) j(cid:96) = ν i(cid:96) = ν (cid:48) j(cid:96) = 10 − , and θ i(cid:96) = θ (cid:48) j(cid:96) = 5 · − . If d > , then Gaussian noise with standard deviation · − is added to theinteraction parameters. In general, the algorithm is fairly robust to different initialisations if the scale ofthe parameters is similar to the choices above. The learning rate η is set to . .Three strategies are used for estimation of τ ij : (i) Using the MLE ˆ τ ij = t (cid:96) ij ; (ii) Setting τ ij = 0 ;(iii) Setting τ ij = 0 if A ij = 1 , and τ ij = ∞ if A ij = 0 . The MLE approach (i) has a drawback: the p -values (3.1) for the ﬁrst observation on each edge are always

1. This implies that the KS scores are The data are freely available at . Sanna Passino, F. and Heard N. A. 8utually exciting point process graphs for modelling dynamic networks

Table 1:

Training and test Kolmogorov-Smirnov scores on the Enron e-mail network for different conﬁgurations ofthe MEG model.

KS scores (train & test)

Main effects α i ( · ) and β j ( · ) ↓ τ ij ↓ Interactions γ ij ( · ) ↓ Absent Poisson ( r = 0) Markov ( r = 1 ) Hawkes ( r = ∞ ) τ ij = t (cid:96) ij (MLE) Absent – – 0.4530 0.4133 0.3678 0.3484 0.4443 0.3586Poisson( r = 0 ) d = 1 d = 5 d = 10 r = 1 ) d = 1 d = 5 d = 10 r = ∞ ) d = 1 d = 5 d = 10 τ ij = 0 Absent – – 0.7678 0.7983 0.7456 0.7360 0.7058 0.6046Poisson( r = 0 ) d = 1 d = 5 d = 10 r = 1 ) d = 1 d = 5 d = 10 r = ∞ ) d = 1 d = 5 d = 10 τ ij = (cid:40) , A ij = 1 ∞ , A ij = 0 Absent – – 0.5590 0.5941 0.4112 0.3667 0.4593 0.2758Poisson( r = 0 ) d = 1 d = 5 d = 10 r = 1 ) d = 1 d = 5 d = 10 Hawkes( r = ∞ ) d = 1 d = 5 d = 10 bounded below by / ≈ . for the training set and / ≈ . for the test set.The KS scores obtained on the training and test sets after ﬁtting different MEG models are reportedin Table 1. The best performance (KS score . ) is achieved when a Markov process is used for theinteraction term, with d = 5 or d = 10 , combined with a Hawkes process for the main effects, setting τ ij using option (iii). The same model achieves the best performance when alternative strategies for estimationof τ ij are used. If τ ij is set to its MLE (i), then the lower bound for the KS score on the training set isattained. In general, setting τ ij using option (iii) seems to outperform competing strategies for estimationof τ ij in terms of KS scores. More importantly, overall the results demonstrate that the interaction termplays a key role in obtaining a good ﬁt on the observed event times.The results on the training set can also be compared to alternative node-based models from the liter-ature. For example, Fox et al. (2016) propose the following node-speciﬁc intensity function for sending Sanna Passino, F. and Heard N. A. 9utually exciting point process graphs for modelling dynamic networks e-mails: λ i ( t ) = α i + N (cid:48) i ( t ) (cid:88) k =1 µ i exp {− ( µ i + φ i )( t − t (cid:48) ik ) } , (4.1)where the intensity jumps according to the event times of the received e-mails ( cf. (2.2) and (2.3)). Despitethe present article using a slightly different number of e-mails, the Kolmogorov-Smirnov score obtainedon the training data using (4.1) is . , which corresponds almost exactly to the result in Fox et al.(2016), demonstrating that the MEG appears to have superior performance for the Enron network. Theparameters of (4.1) are estimated by direct optimisation using the Nelder-Mead method on the negativelog-likelihood function for each source node. Nearly identical results to Fox et al. (2016) are also obtainedfrom ﬁtting an independent Poisson processes λ i ( t ) = α i on each source node, with KS score . .Finally, independent Hawkes process models of the form (1.1) are also ﬁtted to each source node, obtaininga KS score of . which is signiﬁcantly outperformed by the best conﬁguration of the MEG model.Since the MEG model KS score outperforms the value obtained using (4.1), it could be inferred that userstend to respond to multiple e-mails in sessions, and not necessarily immediately after an individual e-mailis received. Many enterprises routinely collect network ﬂow (NetFlow) data, representing summaries of connectionsbetween internet protocol (IP) addresses (see, for example, Hofstede et al., 2014). A bipartite dynamicnetwork has been constructed from a subset of such data collected at Imperial College London (ICL).The network consists of , , arrival times recorded to the nearest millisecond, observed between20th January 2020, and 9th February 2020, recorded from n = 173 clients hosted within the Depart-ment of Mathematics at ICL, connecting to n = 6 , internet servers connecting on ports 80 and 443(corresponding to unencrypted and encrypted web trafﬁc), forming a total of , unique edges. Theperiodic and automated activity has been ﬁltered by considering only edges such that the percentage ofarrival times observed between 7am and 12am is larger than 99%, corresponding to the Department ofMathematics building opening hours. To learn connectivity patterns, the MEG model is trained on the ﬁrsttwo weeks of data, corresponding to , , events, and tested on , events observed in the ﬁnalweek. The number of unique edges observed in the training period is , , and , in the test set;only , edges are observed in both time windows, which implies that , new edges are observedin the test set.As discussed in Section 1, computer network data are observed in bursts and exhibit periodic be-haviour. Figure 3 gives an example of the connections from two of the clients to the ICL Virtual LearningEnvironment (VLE) server. Each session begins at an hour consistent with human behaviour, while thefrequency of subsequent connections within each session is likely to be due to automated activity and pagerefreshing.The models have been initialised using a similar initialisation scheme to Section 4.2, with learningrate η = 0 . . In particular, setting q i = N i ( T ) n T and q (cid:48) j = N (cid:48) j ( T ) n T , the chosen initial values are α i = µ i = q i , φ i = 3 q i , β j = µ (cid:48) j = q (cid:48) j , and φ (cid:48) j = 3 q (cid:48) j , γ i(cid:96) = ( q i ) / , γ (cid:48) j(cid:96) = ( q (cid:48) j ) / , ν i(cid:96) = ν (cid:48) j(cid:96) = 10 − , and θ i(cid:96) = θ (cid:48) j(cid:96) = 5 · − . As before, Gaussian noise is added to the interaction parameters if d > .The likelihood for the Hawkes process is highly multimodal, and more sensitive to the initial valuesof the parameters than the Markov process with r = 1 . Therefore, the parameters for the Hawkes processmodels are initialised with the optimal values obtained from the corresponding Markov process models,which seems to lead to fast convergence. The Kolmogorov-Smirnov scores calculated on the training and Sanna Passino, F. and Heard N. A. 10utually exciting point process graphs for modelling dynamic networks D a y D a y Figure 3:

Connections to the ICL Virtual Learning Environment from two clients.

Table 2:

Kolmogorov-Smirnov scores on the ICL NetFlow data for different conﬁgurations of the MEG model.

KS scores (train & test)

Main effects α i ( · ) and β j ( · ) ↓ Interactions γ ij ( · ) ↓ Absent Poisson ( r = 0) Markov ( r = 1 ) Hawkes ( r = ∞ )Absent – – 0.7351 0.7148 0.6678 0.6489 0.7312 0.6950Poisson( r = 0 ) d = 1 d = 5 d = 10 r = 1 ) d = 1 d = 5 d = 10 r = ∞ ) d = 1 d = 5 d = 10 τ ij is set accordingto option (iii) from Section 4.2, which was observed to have the best performance on the Enron data.The best performance (KS score . ) is achieved by a Markov process with r = 1 for both themain effects and interactions, and latent dimensionality d = 5 for the parameters of the interaction term.Corresponding Q-Q plots for some of the models are plotted in Figure 4. Overall, the table and plotsdemonstrate that correctly modelling the arrival times requires inclusion within the model of an interactionterm with a self-exciting component. Because of the extremely bursty behaviour of NetFlow arrival times,the Markov process model for main effects and interactions intuitively appears to be a suitable choice.Finally, for the best performing model the corresponding KS scores are calculated individually foreach edge, and plotted in Figure 5 as a function of the number of connections on the edge. Clearly, themodel has a better performance at scoring arrival times on more active edges. Sanna Passino, F. and Heard N. A. 11utually exciting point process graphs for modelling dynamic networks (a)

Training set . . . . . . . . . . . . O b s e r v e dqu a n til e s Uniform distribution r = , d = r = , d = r = , d = r = , d = r → ∞ , d = (b) Test set . . . . . . . . . . . . O b s e r v e dqu a n til e s Figure 4:

Q-Q plots for the training and test p -values obtained from different MEG models, with main effects α i ( t ) and β j ( t ) with r = 1 , and different parameters for the interaction term γ ij ( t ) , speciﬁed in the legend. (a) Training set (b)

Test set

Figure 5:

Scatterplot of the Kolmogorov-Smirnov scores, calculated for each edge, versus the logarithm of the totalnumber of connections on the edge, for the best performing model in Table 2.

The mutually-exciting graph (MEG), a novel network-wide model for point processes with dyadic markshas been proposed. MEG uses mutually exciting point processes to model intensity functions, and borrowsideas from latent space models to infer relationships between the nodes. Edge-speciﬁc intensities areobtained only via node-speciﬁc parameters, which is useful for large and sparse graphs. Importantly, theproposed model is able to predict events observed on new edges. Inference is performed via maximum

Sanna Passino, F. and Heard N. A. 12utually exciting point process graphs for modelling dynamic networks likelihood estimation, optimised numerically using modern gradient ascent algorithms. The model hasbeen tested on simulated data and on two data sources related to computer networking: ICL NetFlowdata and the Enron e-mail network. MEG appears to have excellent goodness-of-ﬁt on training and testingdata, resulting in low Kolmogorov-Smirnov scores even on very large and heterogeneous data like networkﬂows. Furthermore, for the Enron e-mail network, MEG greatly outperforms results previously obtained inthe literature on the same data. The model has been speciﬁcally motivated by cyber-security applications,where scoring observations on new links is particularly important for network security. Within this context,MEG might be used to complement existing techniques for modelling sequences of edges on dynamicnetworks (Sanna Passino and Heard, 2019), providing a network-wide method for scoring arrival times.

Code

The python code to reproduce the results in this paper, and a bash script to obtain the Enron e-mail networkdata, are available in the

GitHub repository fraspass/meg.

Acknowledgements

This work is funded by the Microsoft Security AI research grant “Understanding the enterprise: Host-based event prediction for automatic defence in cyber-security" . The authors thank Dr Melissa J. M.Turcotte for helpful discussions and comments.

References

Athreya, A., Fishkind, D. E., Tang, M., Priebe, C. E., Park, Y., Vogelstein, J. T., Levin, K., Lyzinski,V., Qin, Y. and Sussman, D. L. (2018) Statistical inference on random dot product graphs: a survey.

Journal of Machine Learning Research , , 1–92.Bowsher, C. G. (2007) Modelling security market events in continuous time: Intensity based, multivariatepoint process models. Journal of Econometrics , , 876 – 912.Brown, E., Barbieri, R., Ventura, V., Kass, R. and Frank, L. (2002) The time-rescaling theorem and itsapplication to neural spike train data analysis. Neural computation , , 325–346.Chen, F. and Tan, W. H. (2018) Marked self-exciting point process modelling of information diffusion onTwitter. Annals of Applied Statistics , , 2175–2196.Daley, D. and Vere-Jones, D. (2002) An Introduction to the Theory of Point Processes – Volume I: Ele-mentary Theory and Methods . Probability and Its Applications. Springer.Durante, D. and Dunson, D. B. (2016) Locally adaptive dynamic networks.

Annals of Applied Statistics , , 2203–2232.Eichler, M., Dahlhaus, R. and Dueck, J. (2017) Graphical modeling for multivariate hawkes processeswith nonparametric link functions. Journal of Time Series Analysis , , 225–242.Etesami, J., Kiyavash, N., Zhang, K. and Singhal, K. (2016) Learning network of multivariate Hawkesprocesses: A time series approach. In Proceedings of the Thirty-Second Conference on Uncertainty inArtiﬁcial Intelligence , UAI’16, 162–171. Arlington, Virginia, USA: AUAI Press.

Sanna Passino, F. and Heard N. A. 13utually exciting point process graphs for modelling dynamic networks

Fox, E., Short, M., Schoenberg, F., Coronges, K. and Bertozzi, A. (2016) Modeling e-mail networks andinferring leadership using self-exciting point processes.

Journal of the American Statistical Association , , 564–584.Hawkes, A. (1971) Spectra of some self-exciting and mutually exciting point processes. Biometrika , ,83–90.Heard, N. A., Rubin-Delanchy, P. T. G. and Lawson, D. J. (2014) Filtering automated polling trafﬁc incomputer network ﬂow data. Proceedings - 2014 IEEE Joint Intelligence and Security InformaticsConference, JISIC 2014 , 268–271.Hoff, P. D. (2018) Additive and multiplicative effects network models. arXiv e-prints .Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002) Latent space approaches to social network analysis.

Journal of the American Statistical Association , , 1090–1098.Hofstede, R., ˇCeleda, P., Trammell, B., Drago, I., Sadre, R., Sperotto, A. and Pras, A. (2014) Flow moni-toring explained: From packet capture to data analysis with NetFlow and IPFIX. IEEE CommunicationsSurveys Tutorials , , 2037–2064.Kingma, D. P. and Ba, J. (2015) Adam: a method for stochastic optimization. In (eds. Y. Bengio and Y. LeCun). San Diego, CA, USA.Krivitsky, P. N. and Handcock, M. S. (2014) A separable model for dynamic networks. Journal of theRoyal Statistical Society: Series B , , 29–46.Linderman, S. W. and Adams, R. P. (2014) Discovering latent network structure in point process data. In Proceedings of the 31st International Conference on International Conference on Machine Learning -Volume 32 , ICML’14, II–1413–II–1421.Metelli, S. and Heard, N. A. (2019) On Bayesian new edge prediction and anomaly detection in computernetworks.

Annals of Applied Statistics , , 2586–2610.Mohler, G. O., Short, M. B., Brantingham, P. J., Schoenberg, F. P. and Tita, G. E. (2011) Self-excitingpoint process modeling of crime. Journal of the American Statistical Association , , 100–108.Neil, J., Hash, C., Brugh, A., Fisk, M. and Storlie, C. B. (2013) Scan statistics for the online detection oflocally anomalous subgraphs. Technometrics , , 403–414.Ogata, Y. (1978) The asymptotic behaviour of maximum likelihood estimators for stationary point pro-cesses. Annals of the Institute of Statistical Mathematics , , 243–261.Ogata, Y. (1981) On Lewis’ simulation method for point processes. IEEE Transactions on InformationTheory , , 23–31.Ogata, Y. (1988) Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical Association , , 9–27.Ozaki, T. (1979) Maximum likelihood estimation of Hawkes’ self-exciting point processes. Annals of theInstitute of Statistical Mathematics , , 145–155. Sanna Passino, F. and Heard N. A. 14utually exciting point process graphs for modelling dynamic networks

Perry, P. and Wolfe, P. (2013) Point process modelling for directed interaction networks.

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , , 821–849.Price-Williams, M. and Heard, N. A. (2020) Nonparametric self-exciting models for computer networktrafﬁc. Statistics and Computing , , 209–220.Priebe, C. E., Conroy, J. M., Marchette, D. J. and Park, Y. (2005) Scan statistics on Enron graphs. Com-putational & Mathematical Organization Theory , , 229–247.Ruder, S. (2016) An overview of gradient descent optimization algorithms. arXiv e-prints .Sanna Passino, F. and Heard, N. A. (2019) Modelling dynamic network evolution as a Pitman-Yor process. Foundations of Data Science , , 293–306.Sarkar, P. and Moore, A. W. (2006) Dynamic social network analysis using latent space models. In Advances in Neural Information Processing Systems 18 , 1145–1152.Sewell, D. K. and Chen, Y. (2015) Latent space models for dynamic networks.

Journal of the AmericanStatistical Association , , 1646–1657.Stomakhin, A., Short, M. B. and Bertozzi, A. L. (2011) Reconstruction of missing data in social networksbased on temporal patterns of interactions. Inverse Problems , .Zhao, Q., Erdogdu, M. A., He, H. Y., Rajaraman, A. and Leskovec, J. (2015) Seismic: A self-exciting pointprocess model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , KDD ’15, 1513–1522.

A Appendix

A.1 Calculation of the log-likelihood in MEG models

According to the choice of the excitation function and the parameter r , the log-likelihood (2.6) takes dif-ferent forms. Here, examples are given for MEG models with d = 1 , r = 1 or r = ∞ , with scaledexponential excitation functions. For simplicity, since d = 1 , the second subscript (cid:96) is dropped fromthe triplets ( γ i(cid:96) , ν i(cid:96) , θ i(cid:96) ) and ( γ (cid:48) j(cid:96) , ν (cid:48) j(cid:96) , θ (cid:48) j(cid:96) ) . Assume two sequences of arrival times t < · · · < t n i in-volving i as source node, and t (cid:48) < · · · < t (cid:48) n (cid:48) j such that j is the destination of the connection. Withinthe two sequences of arrival times t , . . . , t n i and t (cid:48) , . . . , t (cid:48) n (cid:48) j , assume that a subset of n ij events, with n ij ≤ min { n i , n (cid:48) j } , is observed on the edge ( i, j ) . Denote the indices of such events as (cid:96) , . . . , (cid:96) n ij and (cid:96) (cid:48) , . . . , (cid:96) (cid:48) n ij , such that t (cid:96) k = t (cid:48) (cid:96) (cid:48) k . Deﬁne ¯ t k = max { t h : t h < t (cid:96) k } and ¯ t (cid:48) k = max { t (cid:48) h : t (cid:48) h < t (cid:48) (cid:96) (cid:48) k } . For theedge ( i, j ) , assuming r = 1 , the ﬁrst part of the log-likelihood (2.6) is: n ij (cid:88) k =1 log λ ij ( t (cid:96) k ) = n ij (cid:88) k =1 log (cid:110) α i + µ i e − ( µ i + φ i )( t (cid:96)k − ¯ t k ) + β j + µ (cid:48) j e − ( µ (cid:48) j + φ (cid:48) j )( t (cid:48) (cid:96) (cid:48) k − ¯ t (cid:48) k ) + γ i γ (cid:48) j + ν i ν (cid:48) j e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( t (cid:96)k − t (cid:96)k − ) (cid:111) . (A.1) Sanna Passino, F. and Heard N. A. 15utually exciting point process graphs for modelling dynamic networks

For Hawkes process models, the main computational burden associated with the calculation of the likeli-hood is the double summation arising in the ﬁrst term of (2.6) when r = ∞ . The term in the ﬁrst sum in(2.6) can be written as: n ij (cid:88) k =1 log λ ij ( t (cid:96) k ) = n ij (cid:88) k =1 log (cid:26) α i + µ i (cid:96) k − (cid:88) h =1 e − ( µ i + φ i )( t (cid:96)k − t h ) + β j + µ (cid:48) j (cid:96) (cid:48) k − (cid:88) h =1 e − ( µ (cid:48) j + φ (cid:48) j )( t (cid:48) (cid:96) (cid:48) k − t (cid:48) h ) + γ i γ (cid:48) j + ν i ν (cid:48) j k − (cid:88) h =1 e − ( ν i + θ i )( θ (cid:48) j + ν (cid:48) j )( t (cid:96)k − t (cid:96)h ) (cid:27) . (A.2)Using a technique similar to the method proposed in Ogata (1978), it is possible to calculate (A.2) in lineartime using a recursive formulation of the inner summations. For k ∈ { , . . . , n ij } , deﬁne ψ ij ( k ) , ψ (cid:48) ij ( k ) and ˜ ψ ij ( t ) as follows: ψ ij ( k ) = (cid:96) k − (cid:88) h =1 e − ( µ i + φ i )( t (cid:96)k − t h ) , ψ (cid:48) ij ( k ) = (cid:96) (cid:48) k − (cid:88) h =1 e − ( µ (cid:48) j + φ (cid:48) j )( t (cid:48) (cid:96) (cid:48) k − t (cid:48) h ) , ˜ ψ ij ( k ) = k − (cid:88) h =1 e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( t (cid:96)k − t (cid:96)h ) , (A.3)assuming the initial conditions ˜ ψ ij (1) = 0 and ψ ij (1) = (cid:96) − (cid:88) h =1 e − ( µ i + φ i )( t (cid:96) − t h ) , ψ (cid:48) ij (1) = (cid:96) (cid:48) − (cid:88) h =1 e − ( µ (cid:48) j + φ (cid:48) j )( t (cid:48) (cid:96) (cid:48) − t (cid:48) h ) . Note that the subscript ( i, j ) for ψ ij ( k ) and ψ (cid:48) ij ( k ) is required since (cid:96) k and (cid:96) (cid:48) k are edge-speciﬁc values,and represent a short hand notation for (cid:96) ijk and (cid:96) (cid:48) ijk . Using (A.3), the ﬁrst term (A.2) of the likelihoodbecomes: n ij (cid:88) k =1 log λ ij ( t (cid:96) k ) = n ij (cid:88) k =1 log (cid:110) α i + β j + γ i γ (cid:48) j + µ i ψ ij ( k ) + µ (cid:48) j ψ (cid:48) ij ( k ) + ν i ν (cid:48) j ˜ ψ ij ( k ) (cid:111) . Proposition 1.

The terms ψ ij ( k ) , ψ (cid:48) ij ( k ) and ˜ ψ ij ( k ) can be written recursively as follows: ψ ij ( k ) = e − ( µ i + φ i )( t (cid:96)k − t (cid:96)k − ) [1 + ψ ij ( k − (cid:96) k − (cid:88) h = (cid:96) k − +1 e − ( µ i + φ i )( t (cid:96)k − t h ) , (A.4) ψ (cid:48) ij ( k ) = e − ( µ (cid:48) j + φ (cid:48) j )( t (cid:48) (cid:96) (cid:48) k − t (cid:48) (cid:96) (cid:48) k − ) (cid:2) ψ (cid:48) ij ( k − (cid:3) + (cid:96) (cid:48) k − (cid:88) h = (cid:96) (cid:48) k − +1 e − ( µ (cid:48) j + φ (cid:48) j )( t (cid:48) (cid:96) (cid:48) k − t (cid:48) h ) , ˜ ψ ij ( k ) = e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( t (cid:96)k − t (cid:96)k − ) (cid:104) ψ ij ( k − (cid:105) . Sanna Passino, F. and Heard N. A. 16utually exciting point process graphs for modelling dynamic networks

Proof.

The result is proved here for ψ ij ( k ) . ψ ij ( k ) = (cid:96) k − (cid:88) h =1 e − ( µ i + φ i )( t (cid:96)k − t h ) = (cid:96) k − (cid:88) h =1 e − ( µ i + φ i )( t (cid:96)k − t h ) + (cid:96) k − (cid:88) h = (cid:96) k − +1 e − ( µ i + φ i )( t (cid:96)k − t h ) = e − ( µ i + φ i )( t (cid:96)k − t (cid:96)k − )+( µ i + φ i )( t (cid:96)k − t (cid:96)k − ) (cid:96) k − (cid:88) h =1 e − ( µ i + φ i )( t (cid:96)k − t h ) + (cid:96) k − (cid:88) h = (cid:96) k − +1 e − ( µ i + φ i )( t (cid:96)k − t h ) = e − ( µ i + φ i )( t (cid:96)k − t (cid:96)k − ) (cid:96) k − (cid:88) h =1 e − ( µ i + φ i )( t (cid:96)k − − t h ) + (cid:96) k − (cid:88) h = (cid:96) k − +1 e − ( µ i + φ i )( t (cid:96)k − t h ) = e − ( µ i + φ i )( t (cid:96)k − t (cid:96)k − )  (cid:96) k − − (cid:88) h =1 e − ( µ i + φ i )( t (cid:96)k − − t h )  + (cid:96) k − (cid:88) h = (cid:96) k − +1 e − ( µ i + φ i )( t (cid:96)k − t h ) , which, using the deﬁnition of ψ ij ( k − , gives the result (A.4). The proof for ψ (cid:48) ij ( k ) follows simi-lar steps, but the summation is splitted at (cid:96) (cid:48) k − , and the ﬁrst summation is multiplied and divided by e − ( µ (cid:48) j + φ (cid:48) j )( t (cid:48) (cid:96) (cid:48) k − t (cid:48) (cid:96) (cid:48) k − ) . For ˜ ψ ij ( k ) , the decomposition is analogous to standard Hawkes processes.The second part of the log-likelihood (2.6) is equivalent to Λ ij ( T ) and follows from integration of λ ij ( t ) over the observation period. For r = 1 : (cid:90) T λ ij ( t )d t = ( α i + β j + γ i γ (cid:48) j )( T − τ ij ) − ν i ν (cid:48) j ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j ) n ij (cid:88) k =1 (cid:104) e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( t (cid:96)k +1 − t (cid:96)k ) − (cid:105) − µ i µ i + φ i n i (cid:88) k =1 [ τ ij , ∞ ) ( t k ) (cid:104) e − ( µ i + φ i )( t k +1 − t k ) − (cid:105) − µ (cid:48) j µ (cid:48) j + φ (cid:48) j n (cid:48) j (cid:88) k =1 [ τ ij , ∞ ) ( t (cid:48) k ) (cid:104) e − ( µ (cid:48) j + φ (cid:48) j )( t (cid:48) k +1 − t (cid:48) k ) − (cid:105) − µ i µ i + φ i (cid:104) e − ( µ i + φ i )(min { t h : t h ≥ τ ij }− max { t h : t h ≤ τ ij } ) − e − ( µ i + φ i )( τ ij − max { t h : t h ≤ τ ij } ) (cid:105) − µ (cid:48) j µ (cid:48) j + φ (cid:48) j (cid:104) e − ( µ (cid:48) j + φ (cid:48) j )(min { t (cid:48) h : t (cid:48) h ≥ τ ij }− max { t (cid:48) h : t (cid:48) h ≤ τ ij } ) − e − ( µ (cid:48) j + φ (cid:48) j )( τ ij − max { t (cid:48) h : t (cid:48) h ≤ τ ij } ) (cid:105) , (A.5)where t n i +1 = t (cid:48) n (cid:48) j +1 = t (cid:96) nij +1 = T . Similarly, for r = ∞ and [ τ ij , ∞ ) ( T ) = 1 : (cid:90) T λ ij ( t )d t = ( α i + β j + γ i γ (cid:48) j )( T − τ ij ) (A.6) − µ i µ i + φ i n i (cid:88) k =1 (cid:104) e − ( µ i + φ i )( T − t k ) − e − ( µ i + φ i ) max { τ ij − t k , } (cid:105) − µ (cid:48) j µ (cid:48) j + φ (cid:48) j n (cid:48) j (cid:88) k =1 (cid:104) e − ( µ (cid:48) j + φ (cid:48) j )( T − t (cid:48) k ) − e − ( µ i + φ i ) max { τ ij − t (cid:48) k , } (cid:105) − ν i ν (cid:48) j ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j ) n ij (cid:88) k =1 (cid:104) e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( T − t (cid:96)k ) − (cid:105) . Sanna Passino, F. and Heard N. A. 17utually exciting point process graphs for modelling dynamic networks

Note that (A.5) and (A.6) are monotonically decreasing functions in τ ij for any choice of the remainingparameters, with constraint τ ij < t (cid:96) ij , where t (cid:96) ij is the ﬁrst arrival time on the edge ( i, j ) . Furthermore, τ ij does not explicitly appear in the ﬁrst part of the likelihood, cf. (A.1) and (A.2). Therefore, using (2.6),the maximum likelihood estimate for τ ij is simply ˆ τ ij = t (cid:96) ij if at least one event is observed on the edge,and ˆ τ ij = ∞ otherwise. If τ ij is set to its MLE, then the last two lines of (A.5) cancel out.For the p -values in (3.1), for r = ∞ , the difference between the compensators is calculated sequen-tially at the observed times t , . . . , t n ij using ψ ij ( k ) , ψ (cid:48) ij ( k ) and ˜ ψ ij ( k ) : Λ ij ( t k ) − Λ ij ( t k − ) = ( α i + β j + γ i γ (cid:48) j )( t k − t k − ) − µ i µ i + φ i [ ψ ij ( k ) − N i ( t k ) − ψ ij ( k −

1) + N i ( t k − )] − µ (cid:48) j µ (cid:48) j + φ (cid:48) j (cid:2) ψ (cid:48) ij ( k ) − N (cid:48) j ( t k ) − ψ (cid:48) ij ( k −

1) + N (cid:48) j ( t k − ) (cid:3) − ν i ν (cid:48) j ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j ) (cid:104) ˜ ψ ij ( k ) − N ij ( t k ) − ˜ ψ ij ( k −

1) + N ij ( t k − ) (cid:105) . An expression similar to (A.5) can be used for Λ ij ( t k ) − Λ ij ( t k − ) when r = 1 . A.2 Calculation of the gradient for r = ∞ In the derivations of the gradient, the following notation is used: ζ ij ( k ) = α i + β j + γ i γ (cid:48) j + µ i ψ ij ( k ) + µ (cid:48) j ψ (cid:48) ij ( k ) + ν i ν (cid:48) j ˜ ψ ij ( k ) . The partial derivative of log L ( H T | Ψ ) with respect to α i and γ i takes the following form: ∂∂α i log L ( H T | Ψ ) = n (cid:88) j =1 [ τ ij , ∞ ) ( T ) (cid:34) − ( T − τ ij ) + n ij (cid:88) k =1 ζ ij ( k ) − (cid:35) ,∂∂γ i log L ( H T | Ψ ) = n (cid:88) j =1 γ (cid:48) j [ τ ij , ∞ ) ( T ) (cid:34) − ( T − τ ij ) + n ij (cid:88) k =1 ζ ij ( k ) − (cid:35) . Similar equations can be derived for the partial derivatives with respect to β j and γ (cid:48) j .The calculations of the partial derivatives with respect to the parameters µ i , φ i , µ (cid:48) j and φ (cid:48) j use therecursive structure deﬁned in the previous section, since the expressions ψ ij ( k ) and ψ (cid:48) ij ( k ) are functionsof ( µ i , φ i ) and ( µ (cid:48) j , φ (cid:48) j ) respectively. For µ i and φ i : ∂∂µ i log L ( H T | Ψ ) = 1 µ i + φ i n (cid:88) j =1 [ τ ij , ∞ ) ( T ) n i (cid:88) k =1 (cid:26) φ i µ i + φ i (cid:104) e − ( µ i + φ i )( T − t k ) − e − ( µ i + φ i ) max { τ ij − t k , } (cid:105) − µ i (cid:104) ( T − t k ) e − ( µ i + φ i )( T − t k ) − max { τ ij − t k , } e − ( µ i + φ i ) max { τ ij − t k , } (cid:105) (cid:27) + n (cid:88) j =1 [ τ ij , ∞ ) ( T ) n ij (cid:88) k =1 ζ ij ( k ) (cid:20) ψ ij ( k ) + µ i ∂∂µ i ψ ij ( k ) (cid:21) , Sanna Passino, F. and Heard N. A. 18utually exciting point process graphs for modelling dynamic networks ∂∂φ i log L ( H T | Ψ ) = − µ i µ i + φ i n (cid:88) j =1 [ τ ij , ∞ ) ( T ) n i (cid:88) k =1 (cid:26) µ i + φ i (cid:104) e − ( µ i + φ i )( T − t k ) − e − ( µ i + φ i ) max { τ ij − t k , } (cid:105) + (cid:104) ( T − t k ) e − ( µ i + φ i )( T − t k ) − max { τ ij − t k , } e − ( µ i + φ i ) max { τ ij − t k , } (cid:105) (cid:27) + µ i n (cid:88) j =1 [ τ ij , ∞ ) ( T ) n ij (cid:88) k =1 ζ ij ( k ) ∂∂φ i ψ ij ( k ) . In the above expression, the partial derivative of ψ ij ( k ) with respect to µ i and φ i is computed recursivelyas follows: ∂∂µ i ψ ij ( k ) = ∂∂φ i ψ ij ( k ) = e − ( µ i + φ i )( t (cid:96)k − t (cid:96)k − ) (cid:26) ∂∂φ i ψ ij ( k − − ( t (cid:96) k − t (cid:96) k − ) [1 + ψ ij ( k − (cid:27) − (cid:96) k − (cid:88) h = (cid:96) k − +1 ( t (cid:96) k − t h ) e − ( µ i + φ i )( t (cid:96)k − t h ) , Similar considerations can be made for the partial derivatives with respect to ( ν i , θ i ) and ( ν (cid:48) j , θ (cid:48) j ) . In thiscase, the recursive form stems from ˜ ψ ij ( k ) , which is function of the two pairs of parameters. For ν i and θ i : ∂∂ν i log L ( H T | Ψ ) = n (cid:88) j =1 [ τ ij , ∞ ) ( T ) ν (cid:48) j ν i + θ i n ij (cid:88) k =1 (cid:40) θ i ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j ) (cid:104) e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( T − t (cid:96)k ) − (cid:105) − ν i ( T − t (cid:96) k ) e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( T − t (cid:96)k ) (cid:41) + n (cid:88) j =1 n ij (cid:88) k =1 [ τ ij , ∞ ) ( T ) ν (cid:48) j ζ ij ( k ) (cid:20) ˜ ψ ij ( k ) + ν i ∂∂ν i ˜ ψ ij ( k ) (cid:21) ,∂∂θ i log L ( H T | Ψ ) = − ν i ν i + θ i n (cid:88) j =1 [ τ ij , ∞ ) ( T ) ν (cid:48) j n ij (cid:88) k =1 (cid:40) ν i + θ i )( ν (cid:48) j + θ (cid:48) j ) (cid:104) e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( T − t (cid:96)k ) − (cid:105) + ( T − t (cid:96) k ) e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( T − t (cid:96)k ) (cid:41) + ν i n (cid:88) j =1 n ij (cid:88) k =1 ζ ij ( k ) ∂∂θ i ˜ ψ ij ( k ) . The recursive equations for the partial derivative of ˜ ψ ij ( k ) with respect to ν i and θ i are equivalent. For θ i : ∂∂θ i ˜ ψ ij ( k ) = e − ( ν i + θ i )( ν (cid:48) j + θ (cid:48) j )( t (cid:96)k − t (cid:96)k − ) (cid:26) ∂∂θ i ˜ ψ ij ( k − − ( ν (cid:48) j + θ (cid:48) j )( t (cid:96) k − t (cid:96) k − ) (cid:104) ψ ij ( k − (cid:105)(cid:27) , Similarly to the previous cases, the initial conditions is: ∂∂θ i ˜ ψ ij (1) = 0 ..