Scalable Inference for Nonparametric Hawkes Process Using Pólya-Gamma Augmentation
Feng Zhou, Zhidong Li, Xuhui Fan, Yang Wang, Arcot Sowmya, Fang Chen
SScalable Inference for Nonparametric Hawkes Process Using P ´olya-GammaAugmentation
Feng Zhou , Zhidong Li , Xuhui Fan , Yang Wang , Arcot Sowmya , Fang Chen Data61, CSIRO University of New South Wales University of Technology Sydney
Abstract
In this paper, we consider the sigmoid Gaussian Hawkesprocess model: the baseline intensity and triggering kernelof Hawkes process are both modeled as the sigmoid trans-formation of random trajectories drawn from Gaussian pro-cesses (GP). By introducing auxiliary latent random vari-ables (branching structure, P´olya-Gamma random variablesand latent marked Poisson processes), the likelihood is con-verted to two decoupled components with a Gaussian formwhich allows for an efficient conjugate analytical inference.Using the augmented likelihood, we derive an expectation-maximization (EM) algorithm to obtain the maximum a pos-teriori (MAP) estimate. Furthermore, we extend the EM al-gorithm to an efficient approximate Bayesian inference algo-rithm: mean-field variational inference. We demonstrate theperformance of two algorithms on simulated fictitious data.Experiments on real data show that our proposed inferencealgorithms can recover well the underlying prompting char-acteristics efficiently.
Background
The Hawkes process is one important classof point processes which can be utilized to model the self-exciting phenomenon in numerous application domains, e.g.crime (Liu, Yan, and Chen 2018), ecosystem (Gupta et al.2018), transportation (Du et al. 2016) and social network(Pinto, Chahed, and Altman 2015). An important statisticof point processes is the conditional intensity: the proba-bility of one event occurring in an infinitesimal time inter-val given history. Specifically, the conditional intensity ofHawkes process is λ ( t ) = µ ( t )+ (cid:90) t φ ( t − s ) d N ( s ) = µ ( t )+ (cid:88) t i
All GP modulated intensity models mentionedabove have the same issue: ) due to the existence of linkfunction, the likelihood of GP variables is non-conjugateto the prior resulting in a non-Gaussian posterior. The non-conjugacy leads to a complicated and time-consuming infer-ence procedure. ) Furthermore, in Hawkes process, the ex-ogenous component (baseline intensity) and the endogenouscomponent (triggering kernel) are coupled in the likelihood,which further hampers the tractability of inference. Targets
To circumvent these problems, we augment thelikelihood with auxiliary latent random variables: branch-ing structure , P´olya-Gamma random variables and la-tent marked Poisson processes . The branching structure ofHawkes process is introduced to decouple µ ( t ) and φ ( τ ) to two independent components in the likelihood; inspiredby Polson, Scott, and Windle (2013) and Donner and Op-per (2018), we use a sigmoid link function in the model andconvert the sigmoid to an infinite mixture of Gaussians in-volving P´olya-Gamma random variables; the latent markedPoisson processes are augmented to linearize the exponen-tial integral term in likelihood. By augmenting the likelihoodin such a way, the likelihood becomes conjugate to the GPprior. With these latent random variables, we use the aug-mented likelihood to construct two efficient analytical iter-ative algorithms. The first one is an EM algorithm to ob-tain the MAP estimate; furthermore, we extend the EM al-gorithm to a mean-field variational inference algorithm thatis slightly slower but can provide uncertainty with a distri-bution estimation rather than point estimation. It is worthnoting that the na¨ıve implementations of both algorithmsare time-consuming. To improve efficiency remarkably, thesparse GP approximation (Titsias 2009) is introduced. Contributions
Specifically, we make the following con-tributions: An EM algorithm and a mean-field varitional inferencealgorithm are derived for Bayesian nonparametric Hawkesprocess: the baseline intensity and triggering kernel are bothsigmoid-GP rates. The original Hawkes process likelihood is convertedto two decoupled factors which are conjugate to GP priorsby augmenting the branching structure, P´olya-Gamma ran-dom variables and latent marked Poisson processes. This re-sults in simple and efficient EM and varitional inference al-gorithms with explicit expressions. The complexity is further reduced to be almost lineardue to the utilization of sparse GP approximation.
A Hawkes process is a stochastic process whose realizationis a sequence of timestamps D = { t i } Ni =1 ∈ [0 , T ] . Here, t i stands for the occurrence time of i -th event with T being theobservation window. The conditional intensity of Hawkesprocess is already provided in Eq.(1). Given µ ( t ) and φ ( τ ) ,the Hawkes process likelihood (Daley and Vere-Jones 2003)is p ( D | µ ( t ) ,φ ( τ )) = N (cid:89) i =1 µ ( t i ) + (cid:88) t j The conditional density of Π µ , ω ii , { Π φ i } Ni =1 , ω ij , X given ( λ ∗ µ , f, λ ∗ φ , g ) old can be factorizedand obtained from Eq.(7) and (8). More specifically, we pro-vide details of these factors. The conditional distributions of P´olya-Gamma vari-ables ω ii and ω ij depend on the function values f old and g old at t i and τ ij p ( ω ii | f old ) = N (cid:89) i =1 p PG ( ω ii | , f old ( t i )) p ( ω ij | g old ) = N (cid:89) i =2 i − (cid:89) j =1 p PG ( ω ij | , g old ( τ ij )) , (11)where we marginalize out X and utilize the tilted P´olya-Gamma distribution p PG ( ω | b, c ) ∝ e − c ω/ p PG ( ω | b, withthe first order moment being E [ ω ] = b c tanh c (Polson,Scott, and Windle 2013). The conditional density of Π µ depends on f old and λ ∗ µ old p (Π µ | f old , λ ∗ µ old ) = p λ µ (Π µ | λ ∗ µ old ) (cid:81) ( ω µ ,t ) ∈ Π µ e h ( ω µ , − f old ( t )) (cid:82) p λ µ (Π µ | λ ∗ µ old ) (cid:81) ( ω µ ,t ) ∈ Π µ e h ( ω µ , − f old ( t )) d Π µ . (12)Using Eq.(5) and (6) to convert the denominator, Eq.(12)can be rewritten as p (Π µ | f old , λ ∗ µ old ) = p λ µ (Π µ | λ ∗ µ old ) (cid:81) ( ω µ ,t ) ∈ Π µ e h ( ω µ , − f old ( t )) exp ( − (cid:82)(cid:82) (1 − e h ( ω µ , − f old ( t )) ) λ ∗ µ old p PG ( ω µ | , dω µ dt ) = (cid:89) ( ω µ ,t ) ∈ Π µ (cid:16) e h ( ω µ , − f old ( t )) λ ∗ µ old p PG ( ω µ | , (cid:17) · exp (cid:18) − (cid:90) (cid:90) e h ( ω µ , − f old ( t )) λ ∗ µ old p PG ( ω µ | , dω µ dt (cid:19) It is straightforward to see the above conditional distribu-tion is in the likelihood form of a marked Poisson process with intensity function Λ µ ( t, ω µ ) = e h ( ω µ , − f old ( t )) λ ∗ µ old p PG ( ω µ | , λ ∗ µ old σ ( − f old ( t )) p PG ( ω µ | , f old ( t )) . (13)The derivation of conditional distribution of Π φ i is sameas Π µ with the corresponding subscripts being replaced. It isworth noting that there exists N independent marked Pois-son processes { Π φ i } Ni =1 with the same intensity function Λ φ ( τ, ω φ ) = λ ∗ φ old σ ( − g old ( τ )) p PG ( ω φ | , g old ( τ )) . (14) Combining Eq.(7) and (8) and marginalizing out ω ii and ω ij , we obtain the conditional distribution of X p ( X | ( λ ∗ µ , f, λ ∗ φ , g ) old ) ∝ N (cid:89) i =1 ( µ old ( t i )) x ii N (cid:89) i =2 i − (cid:89) j =1 ( φ old ( τ ij )) x ij , with µ old ( t i ) = λ ∗ µ old σ ( f old ( t i )) and φ old ( τ ij ) = λ ∗ φ old σ ( g old ( τ ij )) . This is a multinomial distribution with p ( x ii = 1) = µ old ( t i ) µ old ( t i ) + (cid:80) i − j =1 φ old ( τ ij ) p ( x ij = 1) = φ old ( τ ij ) µ old ( t i ) + (cid:80) i − j =1 φ old ( τ ij ) (15)which is a well-known result in Lewis and Mohler (2011)and Zhou, Zha, and Song (2013). Lower-bound of log-posterior Given those conditionaldensities above, we can compute the lower-bound Q . Theexpectation of log-likelihood (ELL) term in Eq.(10) can berewritten as the summation of baseline intensity part andtriggering kernel part. The ELL of baseline intensity part isELL µ ( λ ∗ µ , f )= E P (Π µ , ω ii , X ii | ( λ ∗ µ ,f,λ ∗ φ ,g ) old ) (cid:2) log p ( D, Π µ , ω ii , X ii | λ ∗ µ , f ) (cid:3) = − (cid:90) T A µ ( t ) f ( t ) dt + (cid:90) T B µ ( t ) f ( t ) dt − λ ∗ µ T (cid:32) N (cid:88) i =1 E ( x ii ) + (cid:90) (cid:90) Λ µ ( t, ω µ ) dω µ dt (cid:33) log λ ∗ µ (16)where A µ ( t ) = N (cid:88) i =1 E [ ω ii ] E [ x ii ] δ ( t − t i ) + (cid:90) ∞ ω µ Λ µ ( t, ω µ ) dω µ B µ ( t ) = 12 N (cid:88) i =1 E [ x ii ] δ ( t − t i ) − (cid:90) ∞ Λ µ ( t, ω µ ) dω µ , with E over P ( ω ii | f old ( t i )) or P ( x ii | ( λ ∗ µ , f, λ ∗ φ , g ) old ) .imilarly, the ELL of triggering kernel part is written asELL φ ( λ ∗ φ , g ) = E P ( { Π φi } , ω ij , X ij | ( λ ∗ µ ,f,λ ∗ φ ,g ) old ) (cid:2) log p ( D, { Π φ i } , ω ij , X ij | λ ∗ φ , g ) (cid:3) = − (cid:90) T φ A φ ( τ ) g ( τ ) dτ + (cid:90) T φ B φ ( τ ) g ( τ ) dτ − N λ ∗ φ T φ N (cid:88) i =2 i − (cid:88) j =1 E ( x ij ) + N (cid:90) (cid:90) Λ φ ( τ, ω φ ) dω φ dτ log λ ∗ φ (17)where A φ ( τ ) = (cid:88) i,j E [ ω ij ] E [ x ij ] δ ( τ − τ ij ) + N (cid:90) ∞ ω φ Λ φ ( τ, ω φ ) dω φ B φ ( τ ) = 12 (cid:88) i,j E [ x ij ] δ ( τ − τ ij ) − N (cid:90) ∞ Λ φ ( τ, ω φ ) dω φ , with E over P ( ω ij | g old ( τ ij )) or P ( x ij | ( λ ∗ µ , f, λ ∗ φ , g ) old ) .However, the ELL is intractable for general GP priors bythe fact that the ELL is a functional. To circumvent the prob-lem, we utilize the sparse GP approximation to introducesome inducing points. f and g are supposed to be depen-dent on their corresponding inducing points { t s } S µ s =1 and { τ s } S φ s =1 ; the function values of f and g at these inducingpoints are f t s and g τ s . Given a sample f t s and g τ s , f ( t ) and g ( τ ) in Eq.(16) and (17) are assumed to be the poste-rior mean functions f ( t ) = k Tt s t K − t s t s f t s , g ( τ ) = k Tτ s τ K − τ s τ s g τ s , (18)with k Ttt s and k Tττ s being the kernel vector w.r.t. the observa-tions and inducing points while K t s t s and K τ s τ s being w.r.t.inducing points only.Substituting Eq.(18) to Eq.(16) and (17), we can obtain Q (( λ ∗ µ , f t s , λ ∗ φ , g τ s ) | ( λ ∗ µ , f t s , λ ∗ φ , g τ s ) old ) = ELL µ ( λ ∗ µ , f t s )+ ELL φ ( λ ∗ φ , g τ s ) − f Tt s K − t s t s f t s − g Tτ s K − τ s τ s g τ s (19) In the M step, we maximize the lower-bound Q . The opti-mal parameters ˆ λ ∗ µ , ˆ f t s , ˆ λ ∗ φ , ˆ g τ s can be obtained by settingthe gradient of Eq.(19) to zero. Due to auxiliary variablesaugmentation, we have analytical solutions ˆ λ ∗ µ = (cid:32) N (cid:88) i =1 p ii + M µ (cid:33) /T ˆ λ ∗ φ = N (cid:88) i =2 i − (cid:88) j =1 p ij + N M φ / ( N T φ )ˆ f t s = Σ t s K − t s t s (cid:90) T B µ ( t ) k t s t dt ˆ g τ s = Σ τ s K − τ s τ s (cid:90) T φ B φ ( τ ) k τ s τ dτ (20) where Σ t s = (cid:2) K − t s t s (cid:82) A µ ( t ) k t s t k Tt s t dt K − t s t s + K − t s t s (cid:3) − , Σ τ s = (cid:2) K − τ s τ s (cid:82) A φ ( τ ) k τ s τ k Tτ s τ dτ K − τ s τ s + K − τ s τ s (cid:3) − , M µ = (cid:82)(cid:82) Λ µ ( t, ω µ ) dω µ dt , M φ = (cid:82)(cid:82) Λ φ ( τ, ω φ ) dω φ dτ . Allintractable integrals can be solved by Gaussian quadrature. Another advantage of sparse GP approximation is the com-plexity of matrix inversion is fixed to O ( S µ + S φ ) where S µ (or S φ ) (cid:28) N . This results in a complexity scal-ing almost linearly with data size: O ( N L ) where L = (cid:82) T φ µ ( t )1 − (cid:82) φ ( τ ) dτ dt (cid:28) N due to the sparsity of expectationof branching structure: previous points that are more than T φ far away from event i have no influence on event i ( E [ x ij ] = 0 ). Throughout this paper, the GP covariance ker-nel is the squared exponential kernel k ( x, x (cid:48) ) = θ exp (cid:0) − θ (cid:107) x − x (cid:48) (cid:107) (cid:1) . The hyperparameters θ and θ can be optimized by performing maximization of Q over θ = { θ , θ } using numerical packages. Normally, weupdate θ every 20 iterations.Additional hyperparameters are the number and locationof inducing points which affect the complexity and estima-tion quality of µ ( t ) and φ ( τ ) . A large number of inducingpoints will lead to high complexity while a small numbercannot capture the dynamics. For fast inference, the induc-ing points are uniformly located on the domain. Anotheradvantage of uniform location is that the kernel matrix hasToeplitz structure (Cunningham, Shenoy, and Sahani 2008)which means the matrix inversion can be implemented moreefficiently. The number of inducing points is gradually in-creased until no more significant improvement. The finalpseudo code is provided in Alg.1. Algorithm 1: EM algorithm for SGPHP Result: µ ( t ) = λ ∗ µ σ ( f ( t )) , φ ( τ ) = λ ∗ φ σ ( g ( τ )) Initialize hyperparameters and X , λ ∗ µ , λ ∗ φ , ω ii , ω ij , f t s , g τ s , Π µ , { Π φ i } Ni =1 ; for do Update the posterior of ω ii and ω ij by Eq.(11);Update intensities of Π µ and { Π φ } by Eq.(13), (14);Update the posterior of X by Eq.(15);Update λ ∗ µ , f t s , λ ∗ φ and g τ s by Eq.(20);Update hyperparameters. end In the following, we extend the EM algorithm to a mean-field variational inference (Bishop 2007) algorithm whichsolves the inference problem slightly slower than EM, butcan provide uncertainty with a distribution estimation ratherthan point estimation.n variational inference, the posterior distribution over la-tent variables is approximated by a variational distribution.The optimal variational distribution is chosen by minimisingthe Kullback-Leibler (KL) divergence or equivalently max-imizing the evidence lower bound (ELBO). A common ap-proach is the mean-field method where the variational distri-bution is assumed to factorize over some partition of latentvariables.For the problem at hand, incorporating priors of λ ∗ µ , f , λ ∗ φ and g into Eq. (7) and (8), we obtain the joint distributionover all variables. Without loss of generality, an improperprior p ( λ ∗· ) = 1 /λ ∗· and a symmetric GP prior GP ( ·| , K · ) (Bishop 2007) are utilized here. We assume the variationaldistribution q can factorize as q (Π µ , ω ii , { Π φ i } Ni =1 , ω ij , X , λ ∗ µ , f, λ ∗ φ , g ) = q (Π µ , ω ii , { Π φ i } Ni =1 , ω ij , X ) q ( λ ∗ µ , f, λ ∗ φ , g ) . The derivation of variational mean-field approach isshown in the Appendix IV. It is similar with that of EM al-gorithm. Here, we only show the final result. The optimaldistribution for each factor maximizing the ELBO is givenby Optimal Density of P´olya-Gamma Variables q ( ω ii ) = N (cid:89) i =1 p PG ( ω ii | , ˜ f ( t i )) q ( ω ij ) = N (cid:89) i =2 i − (cid:89) j =1 p PG ( ω ij | , ˜ g ( τ ij )) , (21)where ˜ f ( t i ) = (cid:112) E ( f ( t i )) and ˜ g ( τ ij ) = (cid:112) E ( g ( τ ij )) which can be computed utilizing E ( C ) = E ( C )+ Var ( C ) . Optimal Marked Poisson Processes Λ µ ( t, ω µ ) = ˜ λ ∗ µ σ ( − ˜ f ( t )) p PG ( ω µ | , ˜ f ( t )) e ( ˜ f ( t ) − ¯ f ( t )) / Λ φ ( τ, ω φ ) = ˜ λ ∗ φ σ ( − ˜ g ( τ )) p PG ( ω φ | , ˜ g ( τ )) e (˜ g ( τ ) − ¯ g ( τ )) / , (22)where ˜ λ ∗ µ = e E (log λ ∗ µ ) , ¯ f ( t ) = E ( f ( t )) , ˜ λ ∗ φ = e E (log λ ∗ φ ) and ¯ g ( τ ) = E ( g ( τ )) . Optimal Density of Intensity Upper-bounds q ( λ ∗ µ ) = Gamma ( λ ∗ µ | α µ , β µ ) q ( λ ∗ φ ) = Gamma ( λ ∗ φ | α φ , β φ ) , (23)where α µ = (cid:80) Ni =1 E ( x ii ) + (cid:82)(cid:82) Λ µ ( t, ω µ ) dtdω µ , β µ = T , α φ = (cid:80) Ni =1 (cid:80) i − j =1 E ( x ij ) + N (cid:82)(cid:82) Λ φ ( τ, ω φ ) dτ dω φ , β φ = N T φ and all intractable integrals can be solved by Gaus-sian quadrature. This provides the required expectation forEq.(22) by E ( λ ∗ ) = α/β and E (log λ ∗ ) = ψ ( α ) − log β where ψ ( · ) is the digamma function. Optimal Sparse Gaussian Process q ( f t s ) = N ( f t s | ˜ m t s , ˜ Σ t s ) q ( g τ s ) = N ( g τ s | ˜ m τ s , ˜ Σ τ s ) , (24) where ˜ Σ t s = (cid:104) K − t s t s (cid:82) ˜ A µ ( t ) k t s t k Tt s t dt K − t s t s + K − t s t s (cid:105) − , ˜ m t s = ˜ Σ t s K − t s t s (cid:82) ˜ B µ ( t ) k t s t dt with ˜ A µ ( t ) = (cid:80) Ni =1 E [ ω ii ] E [ x ii ] δ ( t − t i ) + (cid:82) ∞ ω µ Λ µ ( t, ω µ ) dω µ and ˜ B µ ( t ) = (cid:80) Ni =1 E [ x ii ] δ ( t − t i ) − (cid:82) ∞ Λ µ ( t, ω µ ) dω µ ; ˜ Σ τ s = (cid:104) K − τ s τ s (cid:82) ˜ A φ ( τ ) k τ s τ k Tτ s τ dτ K − τ s τ s + K − τ s τ s (cid:105) − , ˜ m τ s = ˜ Σ τ s K − τ s τ s (cid:82) ˜ B φ ( τ ) k τ s τ dτ with ˜ A φ ( τ ) = (cid:80) i,j E [ ω ij ] E [ x ij ] δ ( τ − τ ij ) + N (cid:82) ∞ ω φ Λ φ ( τ, ω φ ) dω φ and ˜ B φ ( τ ) = (cid:80) i,j E [ x ij ] δ ( τ − τ ij ) − N (cid:82) ∞ Λ φ ( τ, ω φ ) dω φ .All intractable integrals are solved by Gaussian quadrature.Note also the similarity to EM algorithm in Eq.(20). Optimal Density of Branching Structure q ( x ii = 1) = ˜ µ ( t i )˜ µ ( t i ) + (cid:80) i − j =1 ˜ φ ( τ ij ) q ( x ij = 1) = ˜ φ ( τ ij )˜ µ ( t i ) + (cid:80) i − j =1 ˜ φ ( τ ij ) , (25)with ˜ µ ( t i ) = ˜ λ ∗ µ e E (log σ ( f ( t i ))) , ˜ φ ( τ ij ) = ˜ λ ∗ φ e E (log σ ( g ( τ ij ))) .The E (log σ ( · )) term can be solved by Gaussian quadrature. Hyperparameters Similarly, the hyperparameters θ and θ can be optimized by performing maximization of ELBOover { θ , θ } using numerical packages. The optimizationof number and location of inducing points is same as EMalgorithm. The final pseudo code is provided in Alg.2. Algorithm 2: Mean-field algorithm for SGPHP Result: µ ( t ) = λ ∗ µ σ ( f ( t )) , φ ( τ ) = λ ∗ φ σ ( g ( τ )) Initialize hyperparameters and variational distributionsof X , λ ∗ µ , λ ∗ φ , ω ii , ω ij , f t s , g τ s , Π µ , { Π φ i } Ni =1 ; for do Update q of ω ii and ω ij by Eq.(21);Update Λ of Π µ and { Π φ } by Eq.(22);Update q of λ ∗ µ and λ ∗ φ by Eq.(23);Update q of f t s and g τ s by Eq.(24);Update q of X by Eq.(25);Update hyperparameters. end We evaluate the performance of our proposed EM and mean-field (MF) algorithms on both fictitious and real-world data.Specifically, we compare our proposed algorithms to the fol-lowing alternatives. 1. Maximum Likelihood Estimation(MLE) : the vanilla Hawkes process with constant µ and ex-ponential decay triggering kernel; 2. Wiener-Hopf (WH) :a nonparametric algorithm for Hawkes process where µ is constant and φ ( τ ) is a nonparametric function (Bacryand Muzy 2016); 3. Majorization Minimization Euler-Lagrange (MMEL) : another nonparametric algorithm forHawkes process with constant µ and smooth φ ( τ ) , which a) (b) (c) Figure 1: The fictitious data experimental results. (a): The estimated ˆ µ ( t ) and ˆ φ ( τ ) in case 1 (with shading being one standarddeviation for MF); (b): for case 2. We can see both EM and MF algorithms capture the structure of underlying rates better thanother alternatives. (c): The running time (seconds) of different iterative nonparametric algorithms on varying of observations.We can see EM and MF algorithms scale linearly with observation, which is more efficient than MMEL. (GT=Ground Truth)similarly utilizes the branching structure and estimates φ ( τ ) by an Euler-Lagrange equation (Zhou, Zha, and Song 2013).We also tried to compare to the long short-term mem-ory (LSTM) based neural Hawkes process (Mei and Eisner2017) but found it hard to converge at least on our data. Onthe contrary, our proposed algorithms are easier to convergedue to the fact that there are fewer parameters to tune, whichconstitutes another advantage.We use the following metrics to evaluate the performanceof various methods: TestLL : the log-likelihood of hold-outdata using the trained model. This is a metric describing themodel prediction ability. EstErr : the mean squared error be-tween the estimated ˆ µ ( t ) , ˆ φ ( τ ) and the ground truth. It isonly used for fictitious data . RunTime : the running time ofvarious methods w.r.t. the number of training data. Q-Q plot :the plot visualizes the goodness-of-fit for different modelsusing time rescaling theorem (Papangelou 1972).Table 1: EstErr and TestLL for fictitious and real datasets. MLE WH MMEL EM MFCase 1 EstErr (ˆ µ, µ ) EstErr (ˆ φ, φ ) TestLL Case 2 EstErr (ˆ µ, µ ) EstErr (ˆ φ, φ ) TestLL TestLL Crime TestLL In fictitious data experiments, we use the thinning algorithm(Ogata 1998) to generate 100 sets of training data and 10sets of hold-out data with T φ = 6 and T = 100 in 2 cases: µ ( t ) is a constant and µ ( t ) is changing over time. µ ( t ) = 1 and φ ( τ ) = (cid:26) . 33 sin τ (0 < τ ≤ π )0 ( π < τ < T φ ) ; µ ( t ) = sin (cid:0) πT · t (cid:1) + 1 (0 < t < T ) and φ ( τ ) =0 . (cid:0) sin( π · τ ) + 1 (cid:1) · exp( − . τ ) (0 < τ < T φ ) .The inducing points and hyperparameters are optimizedfor inference. The estimated ˆ µ ( t ) and ˆ φ ( τ ) are shown in Fig.1a and 1b. From accuracy perspective, the result inTab.1 confirms that EM and MF algorithms outperformother alternatives in all cases w.r.t. TestLL and EstErr (mean for MF). For MLE, the estimated results are far awayfrom ground truth due to parametric constraint. For WHand MMEL, the constant limitation on µ ( t ) still exists eventhough φ ( τ ) has been relieved. On the contrary, our EMand MF algorithms provide nonparametric ˆ µ ( t ) and ˆ φ ( τ ) concurrently. From efficiency perspective, EM and MF al-gorithms are compared to the other iterative nonparametricalgorithm MMEL with same iterations. We can see EM andMF’s RunTime scales linearly with observation in Fig.1c,which proves their superior efficiency. In the real data section, we apply our EM and MF algorithmsto two different datasets, Motor Vehicle Collisions in New York City : a vehiclecollision dataset from New York City Police Department; Crime in Vancouver : a dataset of crimes in Vancouverfrom the Vancouver Open Data Catalogue.More experimental details (e.g. dataset introduction, datapre-processing, hyperparameter selection) are provided inAppendix V. We compare the performance of our proposedalgorithms to other alternatives. For each inference algo-rithm, we evaluate its predictive performance using TestLL .The TestLL of EM, MF and other alternatives are shown inTab.1. We can observe our EM and MF algorithms’ consis-tent superiority over other alternatives whose baseline in-tensity or triggering kernel is too restricted to capture thedynamics. To further measure the performance, we gener-ate the Q-Q plot . We transform a sequence of timestampsin hold-out data by the fitted model to a set of indepen-dent uniform random variables on the interval (0 , . The re-sult is shown in Appendix V. All experimental results prove our proposed algorithms can not only describe µ ( t ) and φ ( τ ) in a completely flexible manner which leads to bet-ter goodness-of-fit but also with superior efficiency . Conclusions In this paper, the scalable EM and mean-filed variationalinference algorithms are proposed for sigmoid GaussianHawkes process. By augmenting the branching structure,P´olya-Gamma random variables and latent marked Poissonprocesses, the inference can be performed in a conjugateway efficiently. Furthermore, by introducing sparse GP ap-proximation, our proposed algorithms scale linearly with thenumber of observations. The fictitious and real data experi-mental results confirm that the performance of accuracy andefficiency of our proposed algorithms is superior to otherstate-of-the-art alternatives. Future work can be done on theextension to multivariate or multidimensional Hawkes pro-cesses. References [Adams, Murray, and MacKay 2009] Adams, R. P.; Murray,I.; and MacKay, D. J. 2009. Tractable nonparametricBayesian inference in Poisson processes with Gaussian pro-cess intensities. In Proceedings of the 26th Annual Interna-tional Conference on Machine Learning , 9–16. ACM.[Bacry and Muzy 2016] Bacry, E., and Muzy, J.-F. 2016.First-and second-order statistics characterization of Hawkesprocesses and non-parametric estimation. IEEE Transac-tions on Information Theory Springer, New York .[Cunningham, Shenoy, and Sahani 2008] Cunningham, J. P.;Shenoy, K. V.; and Sahani, M. 2008. Fast Gaussian processmethods for point process intensity estimation. In Proceed-ings of the 25th international conference on Machine learn-ing , 192–199. ACM.[Daley and Vere-Jones 2003] Daley, D. J., and Vere-Jones,D. 2003. An introduction to the theory of point processes.vol. i. probability and its applications.[Donner and Opper 2018] Donner, C., and Opper, M. 2018.Efficient Bayesian inference for a Gaussian process densitymodel. arXiv preprint arXiv:1805.11494 .[Donner 2019] Donner, C. 2019. Bayesian inference of in-homogeneous point process models.[Du et al. 2016] Du, N.; Dai, H.; Trivedi, R.; Upadhyay, U.;Gomez-Rodriguez, M.; and Song, L. 2016. Recurrentmarked temporal point processes: embedding event historyto vector. In Proceedings of the 22nd ACM SIGKDD In-ternational Conference on Knowledge Discovery and DataMining , 1555–1564. ACM.[Eichler, Dahlhaus, and Dueck 2017] Eichler, M.; Dahlhaus,R.; and Dueck, J. 2017. Graphical modeling for multivariateHawkes processes with nonparametric link functions. Jour-nal of Time Series Analysis IJCAI , 3385–3392. [Kingman 2005] Kingman, J. F. C. 2005. Poisson processes. Encyclopedia of biostatistics Journal of Nonparametric Statistics IJCAI , 2475–2482.[Lloyd et al. 2015] Lloyd, C.; Gunter, T.; Osborne, M.; andRoberts, S. 2015. Variational inference for Gaussian processmodulated Poisson processes. In International Conferenceon Machine Learning , 1814–1822.[Marsan and Lengline 2008] Marsan, D., and Lengline, O.2008. Extending earthquakes’ reach through cascading. Sci-ence Advances in Neural InformationProcessing Systems , 6754–6764.[Møller, Syversveen, and Waagepetersen 1998] Møller, J.;Syversveen, A. R.; and Waagepetersen, R. P. 1998. LogGaussian Cox processes. Scandinavian journal of statistics Annals of the Instituteof Statistical Mathematics Transactions of the American MathematicalSociety Proceedings of the 2015IEEE/ACM International Conference on Advances in SocialNetworks Analysis and Mining 2015 , 1441–1448. ACM.[Polson, Scott, and Windle 2013] Polson, N. G.; Scott, J. G.;and Windle, J. 2013. Bayesian inference for logistic modelsusing P´olya-Gamma latent variables. Journal of the Ameri-can statistical Association Summer School on MachineLearning , 63–71. Springer.[Reynaud-Bouret, Schbath, and others 2010] Reynaud-Bouret, P.; Schbath, S.; et al. 2010. Adaptive estimationfor Hawkes processes; application to genome analysis. TheAnnals of Statistics International Confer-ence on Machine Learning , 2227–2236.[Titsias 2009] Titsias, M. 2009. Variational learning of in-ducing variables in sparse Gaussian processes. In ArtificialIntelligence and Statistics , 567–574.[Zhou et al. 2018] Zhou, F.; Li, Z.; Fan, X.; Wang, Y.;Sowmya, A.; and Chen, F. 2018. A refined MISD algorithmased on Gaussian process regression. In Pacific-Asia Con-ference on Knowledge Discovery and Data Mining , 584–596. Springer.[Zhou, Zha, and Song 2013] Zhou, K.; Zha, H.; and Song,L. 2013. Learning triggering kernels for multi-dimensionalHawkes processes. In International Conference on MachineLearning , 1301–1309.[Zhou 2019] Zhou, F. 2019. Efficient EM-variational infer-ence for Hawkes process. arXiv preprint arXiv:1905.12251 . Appendices I Proof of Transformation of Sigmoid Function Polson, Scott, and Windle (2013) found that the inverse hy-perbolic cosine can be expressed as an infinite mixture ofGaussian densities cosh − b ( z/ 2) = (cid:90) ∞ e − z ω/ p PG ( ω | b, dω, (1)where p PG ( ω | b, is the P´olya-Gamma distribution with ω ∈ R + . As a result, the sigmoid function can be definedas a Gaussian representation σ ( z ) = e z/ z/ 2) = (cid:90) ∞ e h ( ω,z ) p PG ( ω | , dω, (2)where h ( ω, z ) = z/ − z ω/ − log 2 . This proves Eq.(4)in the main paper. II Campbell’s Theorem Let Π ˆ Z = { ( z n , ω n ) } Nn =1 be a marked Poisson process onthe product space ˆ Z = Z × Ω with intensity Λ( z , ω ) =Λ( z ) p ( ω | z ) . Λ( z ) is the intensity for the unmarked Pois-son process { z n } Nn =1 with ω n ∼ p ( ω n | z n ) being an inde-pendent mark drawn at each z n . Furthermore, we define afunction h ( z , ω ) : Z × Ω → R and the sum H (Π ˆ Z ) = (cid:80) ( z , ω ) ∈ Π ˆ Z h ( z , ω ) . If Λ( z , ω ) < ∞ , then E Π ˆ Z (cid:2) exp (cid:0) ξH (Π ˆ Z ) (cid:1)(cid:3) = exp (cid:20)(cid:90) ˆ Z (cid:16) e ξh ( z , ω ) − (cid:17) Λ( z , ω ) d ω d z (cid:21) , for any ξ ∈ C . The above equation defines the characteristicfunctional of a marked Poisson process. This proves Eq.(6)in the main paper. The mean and variance are E Π ˆ Z (cid:2) H (Π ˆ Z ) (cid:3) = (cid:90) ˆ Z h ( z , ω )Λ( z , ω ) d ω d z Var Π ˆ Z (cid:2) H (Π ˆ Z ) (cid:3) = (cid:90) ˆ Z [ h ( z , ω )] Λ( z , ω ) d ω d z III Proof of Augmented Likelihood Substituting Eq.(4) and (6) into Eq.(3) in the main paper, weobtain the augmented joint likelihood of baseline intensitypart p ( D, X ii | λ ∗ µ , f )= N (cid:89) i =1 (cid:0) λ ∗ µ σ ( f ( t i )) (cid:1) x ii exp (cid:18) − (cid:90) T λ ∗ µ σ ( f ( t )) dt (cid:19) = N (cid:89) i =1 (cid:18)(cid:90) ∞ λ ∗ µ e h ( ω ii ,f ( t i )) p PG ( ω ii | , dω ii (cid:19) x ii · E p λµ (cid:89) ( ω µ ,t ) ∈ Π µ e h ( ω µ , − f ( t )) = (cid:90) (cid:90) N (cid:89) i =1 (cid:16) λ µ ( t i , ω ii ) e h ( ω ii ,f ( t i )) (cid:17) x ii · p λ µ (Π µ | λ ∗ µ ) (cid:89) ( ω µ ,t ) ∈ Π µ e h ( ω µ , − f ( t )) d ω ii d Π µ . ith ω ii denoting a vector of ω ii and λ µ ( t i , ω ii ) = λ ∗ µ p PG ( ω ii | , . Therefore, the augmented joint likelihoodis p ( D, Π µ , ω ii , X ii | λ ∗ µ , f ) = N (cid:89) i =1 (cid:16) λ µ ( t i , ω ii ) e h ( ω ii ,f ( t i )) (cid:17) x ii · p λ µ (Π µ | λ ∗ µ ) (cid:89) ( ω µ ,t ) ∈ Π µ e h ( ω µ , − f ( t )) This proves Eq.(7) in the main paper. The proof of Eq.(8) inthe main paper is the same and omitted here. IV Derivation of Mean-field Approach A standard derivation in the variational meanfield approachshows that the optimal distribution for each factor maximiz-ing the ELBO is given by log q (Π µ , ω ii , { Π φ i } Ni =1 , ω ij , X ) = E q [log p (Π µ , ω ii , { Π φ i } Ni =1 , ω ij , X , λ ∗ µ , f, λ ∗ φ , g )] + C log q ( λ ∗ µ , f, λ ∗ φ , g ) = E q [log p (Π µ , ω ii , { Π φ i } Ni =1 , ω ij , X , λ ∗ µ , f, λ ∗ φ , g )] + C (3)For our problem, incorporating priors of λ ∗ µ , f , λ ∗ φ and g into Eq. (7) and (8) in the main paper, we obtain the jointdistribution over all variables. Without loss of generality, animproper prior p ( λ ∗· ) = 1 /λ ∗· and a symmetric GP prior GP ( ·| , K · ) (Bishop 2007) are utilized here. The final aug-mented joint distribution of baseline intensity part is p ( D, Π µ , ω ii , X ii , λ ∗ µ , f ) = N (cid:89) i =1 (cid:16) λ µ ( t i , ω ii ) e h ( ω ii ,f ( t i )) (cid:17) x ii · p λ µ (Π µ | λ ∗ µ ) (cid:89) ( ω µ ,t ) ∈ Π µ e h ( ω µ , − f ( t )) · λ ∗ µ − GP ( f ); (4)and that of triggering kernel part is p ( D, { Π φ i } Ni =1 , ω ij , X ij , λ ∗ φ , g ) = N (cid:89) i =2 i − (cid:89) j =1 (cid:16) λ φ ( τ ij , ω ij ) e h ( ω ij ,g ( τ ij )) (cid:17) x ij · N (cid:89) i =1 p λ φ (Π φ i | λ ∗ φ ) (cid:89) ( ω φ ,τ ) ∈ Π φi e h ( ω φ , − g ( τ )) · λ ∗ φ − GP ( g ) . (5)Substituting Eq.(4) and (5) into Eq.(3), we can obtain theoptimal distribution for each factor (Eq.(21) ∼ (25) in themain paper). What is worth noting is that ω ii and ω ij arecoupled with branching structure X . We marginalize out ω when solving q ( X ) and vice versa. V Experimental Details In this appendix, we elaborate on the details of data gen-eration, processing, hyperparameter setup and some experi-mental results. Fictitious Data Experimental Details The first case hasa constant µ ( t ) and half sinusoidal triggering kernel φ ( τ ) .The bandwidth of WH is set to 0.3 and there are 10 inducingpoints on both µ ( t ) and φ ( τ ) for EM and MF. The learnedresults in Fig.1 and Tab.1 prove EM and MF can learn thecorrect triggering kernel in non-monotonic decreasing cases.The second case is a general case with time-changing µ ( t ) and sinusoidal exponential decay triggering kernel. Thebandwidth of WH is 0.1 and there are 10 inducing points onboth µ ( t ) and φ ( τ ) for EM and MF. Learned results in Fig.1and Tab.1 show our proposed algorithms are still the cham-pion. Vehicle Collision Dataset In daily transportation, car col-lisions happening in the past will have a triggering influenceon the future because of the traffic congestion caused by theinitial accident, so there is a self-exciting phenomenon inthis kind of application. The vehicle collision dataset is fromNew York City Police Department. We filter out weekdayrecords in nearly one month (Sep.18th-Oct.13th 2017). Thenumber of collisions in each day is about 600. Records inSep.18th-Oct.6th are used as training data and Oct.9th-13thare held out as test data.We compare the performance of EM, MF and other alter-natives. The whole observation period T is set to 1440 min-utes (24 hours) and the support of triggering kernel T φ is setto 60 minutes. For hyperparameters, the bandwidth of WH isset to 0.3 and there are 10 inducing points on µ ( t ) and φ ( τ ) for EM and MF with hyperparameters { θ , θ } optimizedfor inference. The final result is the average of learned µ ( t ) or φ ( τ ) of each day. Crime in Vancouver In crime domain, the past crime willhave a triggering effect on future ones, which has been re-ported in a lot of literature. The data of crimes in Vancouvercomes from the Vancouver Open Data Catalogue. It was ex-tracted on 2017-07-18 and it includes all valid felony, mis-demeanour and violation crimes from 2003-01-01 to 2017-07-13. The columns are crime type, year, month, day, hour,minute, block, neighbourhood, latitude, longitude etc. Wefilter out the crime records from 2016-06-01 to 2016-08-31as training data and 2016-09-01 to 2016-11-30 as test data,add a small time interval to separate all the simultaneousrecords and delete some records with empty values.The whole observation period T is set to 92 days and thesupport of φ ( τ ) is set to 6 days. For hyperparameters, thebandwidth of WH is set to 0.5 and there are still 10 inducingpoints on µ ( t ) and φ ( τ ) for EM and MF with hyperparame-ters { θ , θ } optimized for inference. Q-Q plot After the baseline intensity and triggering kernelestimated from the training data series, we can compute therescaled timestamps of test data: τ i = Λ( t i ) − Λ( t i − ) , where Λ( t i ) = (cid:82) t i λ ( t |H t i ) dt , H t i is the history before t i .According to the time rescaling theorem, { τ i } are indepen-dent exponential random variables with mean if the modelis correct. With further transformation: z i = 1 − exp( − τ i ) , a) (b) Figure 2: Q-Q plot of EM, MF and other alternative infer-ence algorithms for the real data. { z i } are independent uniform random variables on the in-terval (0 , . Any statistical assessment that measures agree-ment between the transformed data and a uniform distribu-tion evaluates how well the fitted model agrees with the testdata. So we can draw the Q-Q plot of the transformed times-tamps with respect to the uniform distribution. The plots areshown in Fig. 2. The perfect model follows a straight line of y = xx