A Poisson process reparameterisation for Bayesian inference for extremes
aa r X i v : . [ s t a t . A P ] D ec Noname manuscript No. (will be inserted by the editor)
A Poisson process reparameterisation for Bayesian inferencefor extremes
Paul Sharkey · Jonathan A. Tawn
Received: date / Accepted: date
Abstract
A common approach to modelling extreme values is to consider the ex-cesses above a high threshold as realisations of a non-homogeneous Poisson process.While this method offers the advantage of modelling using threshold-invariant ex-treme value parameters, the dependence between these parameters makes estimationmore difficult. We present a novel approach for Bayesian estimation of the Poissonprocess model parameters by reparameterising in terms of a tuning parameter m . Thispaper presents a method for choosing the optimal value of m that near-orthogonalisesthe parameters, which is achieved by minimising the correlation between the asymp-totic posterior distribution of the parameters. This choice of m ensures more rapidconvergence and efficient sampling from the joint posterior distribution using MarkovChain Monte Carlo methods. Samples from the parameterisation of interest are thenobtained by a simple transform. Results are presented in the cases of identically andnon-identically distributed models for extreme rainfall in Cumbria, UK. Keywords
Poisson processes · extreme value theory · Bayesian inference · reparameterisation · covariate modelling The aim of extreme value analysis is to model rare occurrences of an observed processto extrapolate to give estimates of the probabilities of unobserved levels. In this way,one can make predictions of future extreme behaviour by estimating the behaviourof the process using an asymptotically justified limit model. Let X , X , . . . , X n be aseries of independent and identically distributed (iid) random variables with commondistribution function F . Defining M n = max { X , X , . . . , X n } , if there exists sequences P. Sharkey, J.A. TawnSTOR-i Centre for Doctoral Training, Department of Mathematics and Statistics, Lancaster University,Lancaster, LA1 4YF, United Kingdom.E-mail: [email protected] Paul Sharkey, Jonathan A. Tawn of normalising constants a n > b n such that:Pr (cid:26) M n − b n a n ≤ x (cid:27) → G ( x ) as n → ¥ , (1)where G is non-degenerate, then G follows a generalised extreme value (GEV) dis-tribution, with distribution function G ( x ) = exp ( − (cid:20) + x (cid:18) x − ms (cid:19)(cid:21) − / x + ) , (2)where x + = max ( x , ) , s > m , x ∈ R . Here, m , s and x are location, scale andshape parameters respectively.Using a series of block maxima from X , . . . , X n , typically with blocks correspond-ing to years, the standard inference approach to give estimates of ( m , s , x ) is themaximum likelihood technique, which requires numerical optimisation methods. Inthese problems, particularly when covariates are involved, such methods may con-verge to local optima, with the consequence that parameter estimates are largely in-fluenced by the choice of starting values. The standard asymptotic properties of themaximum likelihood estimators are subject to certain regularity conditions outlinedin Smith (1985), but can give a poor representation of true uncertainty. In addition,flat likelihood surfaces can cause identifiability issues (Smith, 1987a). For these rea-sons, we choose to work in a Bayesian setting. Bayesian approaches have been usedto make inferences about q = ( m , s , x ) using standard Markov Chain Monte Carlo(MCMC) techniques. They have the advantage of being able to incorporate priorinformation when little is known about the extremes of interest, while also better ac-counting for parameter uncertainty when estimating functions of q , such as returnlevels (Coles and Tawn, 1996). For a recent review, see Stephenson (2016).An approach to inference that is considered to be more efficient than using blockmaxima is to consider a model for threshold excesses, which is superior in the sensethat it reduces uncertainty due to utilising more extreme data (Smith, 1987b). Given ahigh threshold u , the conditional distribution of excesses above u can be approximatedby a generalised Pareto (GP) distribution (Pickands, 1975) such thatPr ( X − u > x | X > u ) = (cid:18) + x x y u (cid:19) − / x + , x > , where y u > x ∈ R denote the scale and shape parameters respectively, with y u dependent on the threshold u , while x is identical to the shape parameter of theGEV distribution. This model conditions on an exceedance, but a third parameter l u ,denoting the rate of exceedance of X above the threshold u , must also be estimated.Both of these extreme value approaches are special cases of a unifying limiting Pois-son process characterisation of extremes (Smith, 1989; Coles, 2001). Let P n be a Poisson process reparameterisation for Bayesian inference for extremes 3 sequence of point processes such that P n = (cid:26)(cid:18) in + , X i − b n a n (cid:19) : i = , . . . , n (cid:27) , where a n > b n are the normalising constants in limit (1). The limit process isnon-degenerate since the limit distribution of ( M n − b n ) / a n is non-degenerate. Smallpoints are normalised to the same value b L = lim n → ¥ ( x L − b n ) / a n , where x L is thelower endpoint of the distribution F . Large points are retained in the limit process. Itfollows that P n converges to a non-homogeneous Poisson process P on regions of theform A y = ( , ) × [ y , ¥ ) , for y > b L . The limit process P has an intensity measure on A y given by L ( A y ) = (cid:20) + x (cid:18) y − ms (cid:19)(cid:21) − / x + . (3)It is typical to assume that the limit process is a reasonable approximation to thebehaviour of P n , without normalisation of the { X i } , on A u = ( , ) × [ u , ¥ ) , where u is a sufficiently high threshold and a n , b n are absorbed into the location and scaleparameters of the intensity (3). It is often convenient to rescale the intensity by afactor m , where m > n observations consist of m blocks of size n / m with the maximum M m of each block following a GEV ( m m , s m , x ) distribution,with x invariant to the choice of m . The Poisson process likelihood can be expressedas L ( q m ) = exp ( − m (cid:20) + x (cid:18) u − m m s m (cid:19)(cid:21) − / x + ) r (cid:213) j = s m (cid:20) + x (cid:18) x j − m m s m (cid:19)(cid:21) − / x − + , (4)where q m = ( m m , s m , x ) denotes the rescaled parameters, r denotes the number ofexcesses above the threshold u and x j > u , j = , . . . , r , denote the exceedances. Itis possible to move between parameterisations associated with different numbers ofblocks. If for k blocks the block maximum is denoted by M k and follows a GEVdistribution with the parameters q k = ( m k , s k , x ) , then for all x Pr ( M k < x ) = Pr ( M m < x ) k / m . As M k is GEV ( m k , s k , x ) and M m is GEV ( m m , s m , x ) it follows that m k = m m − s m x − (cid:18) km (cid:19) − x ! s k = s m (cid:18) km (cid:19) − x . (5)In this paper, we present a method to improve inference for q k , the parameterisationof interest. For an ‘optimal’ choice of m we first undertake inference for q m beforetransforming our results to give inference for q k using the mapping in expression (5). Paul Sharkey, Jonathan A. Tawn
In many practical problems, k is taken to be n y , the number of years of observation, sothat the annual maximum has a GEV distribution with parameters q n y = ( m n y , s n y , x ) .Although inference is for the annual maximum distribution parameters q n y , the Pois-son process model makes use of all data that are extreme, so inferences are moreprecise than estimates based on a direct fit of the GEV distribution to the annualmaximum data as noted above.To help see how the choice of m affects inference, consider the case when m = r , thenumber of excesses above the threshold u . If a likelihood inference was being usedwith this choice of m , the maximum likelihood estimators ( ˆ m r , ˆ s r , ˆ x ) = ( u , ˆ y u , ˆ x ) , seeAppendix A for more details. Therefore, Bayesian inference for the parameterisationof the Poisson process model when m = r is equivalent to Bayesian inference for theGP model.Although inference for the Poisson process and GP models is essentially the sameapproach when m = r , they differ in parameterisation, and hence inference, when m = r . The GP model is advantageous in that l u is globally orthogonal to y u and x . Chavez-Demoulin and Davison (2005) achieved local orthogonalisation of the GPmodel at the maximum likelihood estimates by reparameterising the scale parameteras n u = y u ( + x ) . This ensures all the GP tail model parameters are orthogonal lo-cally at the likelihood mode. However, the scale parameter is still dependent on thechoice of threshold. Unlike the GP, the parameters of the Poisson process model areinvariant to choice of threshold, which makes it more suitable for covariate modellingand hence suggests that it may be the better parameterisation to use. In contrast, it hasbeen found that the parameters are highly dependent, making estimation more diffi-cult.As we are working in the Bayesian framework, strongly dependent parameters leadto poor mixing in our MCMC procedure (Hills and Smith, 1992). A common way ofovercoming this is to explore the parameter space using a dependent proposal ran-dom walk Metropolis-Hastings algorithm, though this requires a knowledge of theparameter dependence structure a priori . Even in this case, the dependence structurepotentially varies in different regions of the parameter space, which may require dif-ferent parameterisations of the proposal to be applied. The alternative approach is toconsider a reparameterisation to give orthogonal parameters. However, Cox and Reid(1987) show that global orthogonalisation cannot be achieved in general.This paper illustrates an approach to improving Bayesian inference and efficiency forthe Poisson process model. Our method exploits the scaling factor m as a means ofcreating a near-orthogonal representation of the parameter space. While it is not pos-sible in our case to find a value of m that diagonalises the Fisher information matrix,we focus on minimising the off-diagonal components of the covariance matrix. Wepresent a method for choosing the ‘best’ value of m such that near-orthogonality ofthe model parameters is achieved, and thus improves the convergence of MCMC andsampling from the joint posterior distribution. Our focus is on Bayesian inference butthe reparameterisations we find can be used to improve likelihood inference as well, Poisson process reparameterisation for Bayesian inference for extremes 5 simply by ignoring the prior term.The structure of the paper is as follows. Section 2 examines the idea of reparameter-ising in terms of the scaling factor m and how this can be implemented in a Bayesianframework. Section 3 discusses the choice of m to optimise the sampling from thejoint posterior distribution in the case where X , . . . , X n are iid. Section 4 exploresthis choice when allowing for non-identically distributed variables through covari-ates in the model parameters. Section 5 describes an application of our methodol-ogy to extreme rainfall in Cumbria, UK, which experienced major flooding events inNovember 2009 and December 2015. Bayesian estimation of the Poisson process model parameters involves the specifica-tion of a prior distribution p ( q m ) . Then using Bayes Theorem, the posterior distribu-tion of q m can be expressed as p ( q m | x ) (cid:181) p ( q m ) L ( q m ) , where L ( q m ) is the likelihood as defined in (4) and x denotes the excesses of thethreshold u . We sample from the posterior distribution using a random walk Metropolis-Hastings scheme. Proposal values of each parameter are drawn sequentially from aunivariate Normal distribution and accepted with a probability defined as the pos-terior ratio of the proposed state relative to the current state of the Markov chain.In all cases throughout the paper, each individual parameter chain is tuned to givethe acceptance rate in the range of 20% −
25% to satisfy the optimality criterion ofRoberts et al (2001). For illustration purposes, results in Sections 2 and 3 are from theanalysis of simulated iid data. A total of 300 exceedances above a threshold u = q = ( , , . ) . Figure 1 showsindividual parameter chains for q k from a random walk Metropolis scheme run for50 ,
000 iterations with a burn-in of 5 ,
000 removed, where k = m = q , indicating non-convergence and strong dependence in the posterior sampling.We explore how reparameterising the model in terms of m can improve samplingperformance. For a general prior on the parameterisation of interest q k , denoted by p ( q k ) , Appendix B derives that the prior on the transformed parameter space q m is p ( q m ) = (cid:16) mk (cid:17) − x p ( q k ) . (6)In this example, independent Uniform priors are placed on m , log s and x , whichgives p ( q ) (cid:181) s ; m ∈ R , s > , x ∈ R . (7)This choice of prior results in a proper posterior distribution, provided there are atleast 4 threshold excesses (Northrop and Attalides, 2016). By finding a value of m Paul Sharkey, Jonathan A. Tawn m s − . . . . . . x Fig. 1: Random-walk Metropolis chains run for each component of q .that near-orthogonalises the parameters of the posterior distribution p ( q m | x ) , we canrun an efficient MCMC scheme on q m before transforming the samples to q k . It isnoted in Wadsworth et al (2010) that setting m to be the number of exceedances abovethe threshold, i.e. m = r , improves the mixing properties of the chain, as is illustratedin Figure 2. This is approximately equivalent to inference using a GP model, as dis-cussed in Section 1. − − m s − . . . . . x Fig. 2: Random-walk Metropolis chains run for parameters q r , where r =
300 is thenumber of exceedances in the simulated data.
Poisson process reparameterisation for Bayesian inference for extremes 7
Given this choice of m , the MCMC scheme is run for q m before transforming to es-timate the posterior of q using the mapping in (5), where k = q based on 5,000 and50,000 run lengths, with burn-in periods of 1,000 and 5,000 respectively. It comparesthe samples from directly estimating the posterior of q with that from transform-ing from the MCMC samples of the posterior of q m to give a posterior sample for q . Figure 3 indicates that q are highly correlated, with the result that we onlysample from a small proportion of the parameter space when exploring using inde-pendent random walks for each parameter. This explains the poor mixing if we wereto run the MCMC without a transformation. In particular, very different estimatesof the joint posterior are achieved for the 5,000 and 50,000 run lengths. Even with50,000 iterations the estimated density contours are very rough, indicating consider-able Monte Carlo noise as a result of poor mixing. In contrast, it is clear that, afterback-transforming to q , the reparameterisation enables a more thorough explorationof the parameter space, with almost identical estimated joint density contours basedon both 5,000 and 50,000 iterations. This shows a very rapid mixing of the associ-ated MCMC. In fact, we found that the reparameterisation yielded smoother densitycontours for 5 ,
000 iterations than for 5 million iterations without the transformation.However, while this transformation is a useful tool in enabling an efficient Bayesianinference procedure, further investigation is necessary in the choice of m to achievenear-orthogonality of the parameter space and thus maximising the efficiency of theMCMC procedure. m optimally As illustrated in Section 2, the choice of m in the Poisson process likelihood can im-prove the performance of the MCMC required to estimate the posterior density ofmodel parameters q k . We desire a value of m such that near-orthogonality of q m isachieved, before using the expressions in (5) to transform to the parameterisation ofinterest, e.g. q or q n y . As a measure of dependence, we use the asymptotic expectedcorrelation matrix of the posterior distribution of q m | x . In particular, we explore howthe off-diagonal components of the matrix, that is, the correlation between param-eters, changes with m . The covariance matrix associated with q m | x can be derivedanalytically by inverting the Fisher information matrix of the Poisson process log-likelihood (see Appendix C). The correlation matrix is then obtained by normalisingso that the matrix has a unit diagonal.Other choices for the measure of the dependence of the posterior could have beenused, such as the inverse of the Hessian matrix (or the expected Hessian matrix) ofthe log-posterior, evaluated at the posterior mode. For inference problems with stronginformation from the data relative to the prior there will be limited differences in theapproach and similar values for the optimal m will be found. In contrast, if the prioris strongly informative and the number of threshold exceedances is small then thechoice of m from using our approach could be far from optimal. Also the use of theobserved, rather than expected, Hessian may better represent the actual posterior dis- Paul Sharkey, Jonathan A. Tawn m s
50 60 70 80 90 100 m x
50 60 70 80 90 100 − . . . . s x − . . . . m s
50 60 70 80 90 100 m x
50 60 70 80 90 100 − . . . . s x − . . . . Fig. 3: Contour plots of the estimated joint posterior of q for 4,000 iterations (top)and 45,000 iterations (bottom) created from the transformed samples drawn from theMCMC procedure for q m (in black) and samples of q drawn directly (in red).tribution of q m and deliver a choice of m that better achieves orthogonalisation, seeEfron and Hinkley (1978) and Tawn (1987) respectively.We prefer our choice of measure of dependence as for iid problems it gives closedform results for m which can be used without the computational work required forother approaches, and this gives valuable insight into the choice of m to guide futureimplementation without the need for detailed computation of an optimal m . Further-more, informative priors rarely arise in extreme value problems, and so informationin the data typically dominates information in the prior, particularly around the pos-terior mode. It should be pointed out however, that the prior is used in the MCMCso there is no loss of prior information in our approach. Also standard MCMC diag-nostics should be used even after the selection of an optimal m , so if the asymptoticposterior correlations differ much from the posterior correlations, making our choiceof m poor, this will be obvious and a more complete but computationally burdensomeanalysis can be conducted using the methods described above.In this section, we use the data introduced in Section 2. For all integers m ∈ [ , ] ,maximum posterior mode estimates ˆ q m are computed and pairwise asymptotic pos-terior correlations calculated by substituting ˆ q m into the expressions for the Fisherinformation matrix, in Appendix C, and taking the inverse. Figure 4 shows how pa-rameter correlations change with the choice of m , illustrating that the asymptotic Poisson process reparameterisation for Bayesian inference for extremes 9 posterior distributions of m m and x are orthogonal when m = r , the number of ex-cesses above a threshold, which explains the findings of Wadsworth et al (2010). − . − . . . . m C o rr e l a t i on
295 300 305 310 315 320 − . − . . . . m C o rr e l a t i on Fig. 4: Left: Estimated parameter correlations changing with m : r m m , s m (black), r m m , x (red), r s m , x (blue). Right: Expanded region of the graph showing r m m , x = m close to r where r =
300 is the number of excesses above the threshold, while r m m , s m = m ≈ q m . Therefore, we would liketo find the value of m such that r ( q m ) is minimised, where r ( q m ) is defined as r ( q m ) = | r m m , s m | + | r m m , x | + | r s m , x | , (8)where r m m , s m denotes the asymptotic posterior correlation between m m and s m for ex-ample. We also look at the sum of the asymptotic posterior correlation terms involv-ing each individual parameter estimate. For example, we define r m m , the asymptoticposterior correlation associated with the estimate of m m , to be: r m m = | r m m , s m | + | r m m , x | . (9)Figure 5 shows how the asymptotic posterior correlation associated with each param-eter varies with m . From Figure 5 we see that while r m m is minimised at the value of m for which r m m , s m = r s m and r x have minima at the value of m for which r s m , x =
0. We denote the latter minimum by m and the former by m . Interms of the covariance function, this can be written as:ACov ( s m , x | x ) = ACov ( m m , s m | x ) = , (10)where ACov denotes the asymptotic covariance. Figure 5 shows that m also min-imises the total asymptotic posterior correlation in the model. . . . . m rq m . . . . m m m rm m . . . s m m rs m . . . x m rx Fig. 5: How r ( q m ) changes with m (top left) and how correlations in each individualestimated parameter, as measured by r m m , r s m and r x , change with m .One would expect that the values of m for which r ( q m ) is minimised would cor-respond to the MCMC chain of q m with good mixing properties. We examine theeffective sample size (ESS) as a way of evaluating this objectively. ESS is a mea-sure of the equivalent number of independent iterations that the chain represents(Robert and Casella, 2009). MCMC samples are often positively autocorrelated, andthus are less precise in representing the posterior than if the chain was independent.The ESS of a parameter chain f is defined asESS f = n + (cid:229) ¥ i = n i , (11)where n is the length of the chain and n i denotes the autocorrelation in the sampledchain of f at lag i . In practice, the sum of the autocorrelations is truncated when n i drops beneath a certain level. Figure 6 shows how ESS varies with m for each pa-rameter in q m . For these data the ESS follow a pattern we found to typically occur.We see that ESS m m is maximised at m = m due to the near-orthogonality of m m with s m and x . We find that ESS s m is maximised for m < m < m , as s m remainssubstantially positively correlated with m m and s m is negatively correlated with x .Similarly, ESS x is maximised at a value of m close to m , but x is negatively corre-lated with m m , which explains the slight distortion. From these results, we postulatethat a selection of m in the interval ( m , m ) = ( , ) would ensure the mostrapid convergence of the MCMC chain of q m , thus enabling an effective samplingprocedure from the joint posterior. Figure 6 shows clearly the benefits of the pro-posed approach. For example, ESS m = m =
24, illustrating that the
Poisson process reparameterisation for Bayesian inference for extremes 11 m m m ESS s m m ESS x m ESS
Fig. 6: How ESS varies with m for each parameter in q m . The blue dashed lines rep-resent m = m (left) and m = m (right) in the simulated data example for 45,000iterations of the MCMC, where m and m are defined by property ( ) . In the calcu-lations, the sum of the autocorrelations were truncated when the autocorrelations inthe chain drop below 0 . ( m , m ) , this approach gives a degree of flexibility to thechoice of m and giving a balance of mixing quality across the model parameters.The quantities m and m can be found by numerical solution of the equationsACov ( s m , x | x ) = ( m m , s m | x ) = q m , which is given by the inverse of the Fisher in-formation (see Appendix C). Approximate analytical expressions for m and m canbe derived using Halley’s method for root-finding (Gander, 1985) applied to equa-tions (10). This method yields the following approximations of m and m :ˆ m = r ( x + ) (cid:16) + x + ( x + ) log h x + x + i(cid:17) ( x + ) (cid:16) + x − ( x + ) log h x + x + i(cid:17) (12)ˆ m = r x + x + x + x + . (13)In practice, the values of ˆ m and ˆ m are estimated by using an estimate of x , such asthe maximum likelihood or probability weighted moments estimates. Figure 7 showshow ˆ m and ˆ m change relative to r for a range of x . This illustrates that for nega-tive estimates of the shape parameter, r is not a suitable candidate to be the ‘optimal’value of m as it is not in the range ( m , m ) . In the simulated data used in this section,although a selection of m = r is reasonable, Figure 6 shows that this may not be wiseif one was primarily concerned about sampling well from x , for example. In this case,ˆ m is relatively close to r , but Figure 7 shows that this is not the case for models with −0.4 −0.2 0.0 0.2 0.4 . . . . . . . x m r Fig. 7: How ˆ m and ˆ m change as a multiple of r with respect to ˆ x : ˆ m / r (bottomcurve), ˆ m / r (top curve).a larger positive estimate of x .A simulation study was carried out to assess the suitability of expressions ˆ m and ˆ m as approximations to m and m respectively. A total of 1000 Poisson processes weresimulated with different values of q m . The approximations were calculated and com-pared with the true values of m and m , which were obtained exactly by numericalmethods. It was found that | ˆ m i − m i | < . i = , | ˆ m i − m i | < . .
2% of the time for i = , m i using Halley’s method ( .
2% and 5% smaller than Newton’s method for i = , Algorithm 1:
Sampling from the posterior distribution of the Poisson processmodel parameters q k = ( m k , s k , x ) or q k = ( m ( ) k , m ( ) k , s k , x ) after reparameter-ising Data : Threshold excesses x Result : Samples from the posterior distribution p ( q k | x ) Choose parameterisation of interest q k ; if q k = ( m k , s k , x ) then Obtain an estimate of shape parameter x using maximum likelihood, forexample; Compute ˆ m and ˆ m as defined in (12) and (13); Choose m in range ( ˆ m , ˆ m ) ; else Choose m to be the value of m that numerically solves r m ( ) m , s m = Obtain MCMC samples for posterior distribution p ( q m | x ) ; Transform to obtain samples from p ( q k | x ) using expression (5). Poisson process reparameterisation for Bayesian inference for extremes 13 m in the presence of non-stationarity In many practical applications, processes exhibit trends or seasonal effects causedby underlying mechanisms. The standard methods for modelling extremes of non-identically distributed random variables were introduced by Davison and Smith (1990)and Smith (1989), using a Poisson process and Generalised Pareto distribution re-spectively. Both approaches involve setting a constant threshold and modelling theparameters as functions of covariates. In this way, we model the non-stationaritythrough the conditional distribution of the process on the covariates. We follow thePoisson process model of Smith (1989) as the parameters are invariant to the choice ofthreshold if the model is appropriate. We define the covariate-dependent parameters q m ( z ) = ( m m ( z ) , s m ( z ) , x ( z )) , for covariates z . Often in practice, the shape parameter x is assumed to be constant. A log-link is typically used to ensure positivity of s m ( z ) .The process of choosing m is complicated when modelling in the presence of covari-ates. This is partially caused by a modification of the integrated intensity measure,which becomes L ( A ) = m Z z (cid:20) + x ( z ) (cid:18) u − m m ( z ) s m ( z ) (cid:19)(cid:21) − / x ( z ) g ( z ) d z , (14)where g denotes the probability density function of the covariates, which is unknownand with covariate space z . The density term g is required as the covariates associatedwith exceedances of the threshold u are random. In addition, the extra parametersintroduced by modelling covariates increases the overall correlation in the model pa-rameters.For simplicity, we restrict our attention to the case of modelling when the locationparameter is a linear function of a covariate, that is, m m ( z ) = m ( ) m + m ( ) m z , s m ( z ) = s m , x ( z ) = x , where we centre the covariate z , as this leads to parameters m ( ) m and m ( ) m being or-thogonal. Note that the regression parameter m ( ) m is invariant to the choice of m . Atotal of 233 excesses above a threshold of u =
15 are simulated from a Poisson pro-cess model with m ( ) = m ( ) = s = x = − .
05. We choose g to follow anExp ( ) distribution, noting that one could also choose g to be the density of a covari-ate that is used in practice. We impose an improper Uniform prior on the regressionparameter m ( ) and set up the MCMC scheme in the same manner as in Section 3.The objective remains to identify the value of m that achieves near-orthogonality ofthe parameters of the posterior distribution. Like before, we run an MCMC sampleron q m ( z ) and transform the samples back to the parameterisation of interest q k ( z ) , which can be obtained as in (5) using the relations m ( ) k = m ( ) m − s m x − (cid:18) km (cid:19) − x ! m ( ) k = m ( ) m (15) s k = s m (cid:18) km (cid:19) − x . m ( ) m ( )
50 60 70 80 90 100 m ( ) s
50 60 70 80 90 m ( ) x
50 60 70 80 90 − . − . . . m ( ) s
20 25 30 35 m ( ) x
20 25 30 35 − . − . . . s x
10 15 20 25 − . − . . . Fig. 8: Contour plots of estimated posterior densities of q ( z ) having sampled fromthe joint posterior directly (red) and having transformed using (15) after reparam-eterising from q ( z ) (black). Both contours are constructed from 50,000 MCMCiterations with a burn-in of 5,000.The complication of the integral term in the likelihood for non-identically distributedvariables means that it is no longer feasible to gain an analytical approximation for theoptimal value of m . A referee has suggested a possible route to obtaining such expres-sions for m in the non-stationary case, is by building on results in Attalides (2015)and using a non-constant threshold as in Northrop and Jonathan (2011), but as thismoves away from our constant threshold case we do not pursue this. We thereforechoose a value of m that minimises the asymptotic posterior correlation in the model.The asymptotic posterior correlation matrix is found by inversion of the Fisher infor-mation matrix of the log-likelihood with modified integrated intensity measure (14)and normalising so that the matrix has a unit diagonal. Because of the integral term Poisson process reparameterisation for Bayesian inference for extremes 15 (14) in the log-likelihood, the Fisher information contains various integrals that re-quire numerical evaluation. We compute these using adaptive quadrature methods.Empirical evidence suggests that the optimal m coincides with the value of m suchthat r m ( ) m , s m =
0, which is similar to how m is defined in Section 3. Using numericalmethods, we identify that this corresponds to a value of m =
85 for the simulateddata example. Figure 8 shows contour plots of estimated posterior densities of q ( z ) ,comparing the sampling from directly estimating the posterior q ( z ) with that fromtransforming the samples from the estimated posterior of q m ( z ) to give a sample fromthe posterior of q ( z ) . From this figure, we see that the reparameterisation improvesthe sampling from the posterior q ( z ) . m m ( ) m ESS m m ( ) m ESS s m m ESS x m ESS
Fig. 9: Effective sample size of each parameter chain of the MCMC procedure.We again inspect the effective sample size for each parameter as a way of comparingthe efficiency of the MCMC under different parameterisations. Figure 9 shows howthe effective sample size varies with m for each parameter. This figure shows how thequality of mixing is approximately maximised in m ( ) m for the value of m that min-imises the asymptotic posterior correlation. Mixing for m ( ) m is consistent across allvalues of m . Interestingly, mixing in x increases as the value of m increases. Withouta formal measure for the quality of mixing across the parameters, it is found that,when averaging the effective sample size over the number of parameters, the ESSis stable with respect to m in the interval spanning from the value of m such that r m ( ) m , s m = m such that r s m , x =
0, like in Section 3. For a summary of how the reparameterisation method can be used in the presence of non-stationarity,see Algorithm 1.
In this section, we present a study as an example of how this reparameterisationmethod can be used in practice. In particular, we analyse data taken from the MetOffice UKCP09 project, which contains daily baseline averages of surface rainfallobservations, measured in millimetres, in 25km × £ m . For the purposes of illustratingour method, we initially make the assumption that the rainfall observations are iidand proceed with the method outlined in Section 3. We wish to obtain informationabout the parameters corresponding to the distribution of annual maxima, i.e. q .Standard threshold diagnostics (Coles, 2001) indicate a threshold of u =
15 is ap-propriate, which corresponds to the 95 .
6% quantile of the data. There are r = u (see Figure 10). We obtain bounds m and m , then choose a valueof m , with m < m < m , that will achieve near-orthogonality of the Poisson processmodel parameters to improve MCMC sampling from the joint posterior distribution.We obtain ˆ x = .
087 using maximum likelihood when m = r , which we use to obtainapproximations for m and m as in (12) and (13). From this, we obtain ˆ m ≈ m ≈ m and ˆ m represent good approximations by solvingequations (10) to obtain m = .
82 and m = .
96. Since r =
880 is contained inthe interval ( m , m ) , we choose m = r . We run an MCMC chain for q for 50,000iterations, discarding the first 1,000 samples as burn-in. We transform the remainingsamples using the mapping in (5), where k =
55, to obtain samples from the joint
Poisson process reparameterisation for Bayesian inference for extremes 17
Day R a i n f a ll a cc u m u l a t i on s ( mm ) −3 −2 −1 0 1 2 3 NAO R a i n f a ll a cc u m u l a t i on s ( mm ) Fig. 10: (Left) Daily rainfall observations in the Cumbria grid cell in the period 1958-2012. The red line represents the extreme value threshold of u =
15. (Right) Boxplotsof rainfall above u against the corresponding monthly NAO index.posterior of q . The estimated posterior density for each parameter is shown in Fig-ure 11.To estimate probabilities of events beyond the range of the data, we can use the esti-mated parameters to estimate extreme quantiles of the annual maximum distribution.The quantity y N , satisfying: 1 / N = − G ( y N ) , (16)is termed the N -year return level, where G is defined as in expression (2). The level y N is expected to be exceeded on average once every N years. By inverting (16) weget: y N = ( m − s x [ − {− log ( − / N ) } − x ] for x = m − s log {− log ( − / N ) } for x = . (17)The posterior density of the 100-year return level in Figure 11 is estimated by in-putting the MCMC samples of the model parameters into expression (17).We use the same methodology to explore the effect of the monthly NAO index onthe probability of extreme rainfall levels in Cumbria. The NAO index describes thesurface sea-level pressure difference between the Azores High and the Icelandic Low.The low frequency variability of the monthly scale is chosen to represent the largescale atmospheric processes affecting the distribution of wind and rain. In the UK, apositive NAO index is associated with cool summers and wet winters, while a neg-ative NAO index typically corresponds to cold winters, pushing the North Atlanticstorm track further south to the Mediterranean region (Hurrell et al, 2003). In thisanalysis, we incorporated the effect of NAO by introducing it as a covariate in the
32 34 36 38 . . . . . . m m D en s i t y . . . . . . s s D en s i t y . . . . . . x x D en s i t y
60 70 80 90 100 110 120 130 . . . Return level(mm) D en s i t y Fig. 11: Estimated posterior densities of m , s , x and the 100-year return level.location parameter. The threshold of u =
15 was retained for this analysis.To obtain the value of m that minimises the overall correlation in the model, we solvenumerically the equation r m ( ) m , s m =
0, following the reasoning in Section 4. We ob-tain a kernel density estimate of the NAO covariate, which represents g as defined inexpression (14). We use this to obtain maximum posterior mode estimates ˆ q r . Thesequantities are substituted into the Fisher information matrix. The matrix is then in-verted numerically to estimate m = m estimated during the iid analysis. We would expect this as the covariate effect is small,as shown in Figure 12. This example illustrates the benefit of numerically solving for m when modelling non-stationarity, as the range ( m , m ) estimated analytically dur-ing the iid analysis no longer contain the optimal value of m .We run an MCMC chain for q for 50,000 iterations before discarding the first5,000 samples as burn-in. We transform the remaining MCMC samples to the annualmaximum scale using the mapping in (15) where k =
55. Figure 12 indicates thatNAO has a significantly positive effect on the location parameter, as almost all poste-rior mass is distributed with m ( ) > . Poisson process reparameterisation for Bayesian inference for extremes 19
31 32 33 34 35 36 37 38 . . . . . . m ( ) D en s i t y . . . . . . m ( ) D en s i t y . . . . . . s D en s i t y . . . . . x D en s i t y Fig. 12: Estimated posterior densities of m ( ) , m ( ) , s and x .domness in future observations (Coles and Tawn, 1996). On the basis of thresholdexcesses x = ( x , . . . , x n ) , the predictive distribution of a future November maximum M is: Pr { M ≤ y | x } = Z q Pr { M ≤ y | q } p ( q | x ) d q , (18)where Pr { M ≤ y | q } = exp ( − (cid:20) + x (cid:18) y − ( m ( ) + m ( ) z ) s (cid:19)(cid:21) − / x + ) where z is knownexp − Z z " + x y − ( m ( ) + m ( ) z ) s ! − / x + g N ( z ) d z where z is unknown , where g N is the density of NAO in November and the integral is evaluated numericallyusing adaptive quadrature methods. The integral in (18) can be approximated usinga Monte Carlo summation over the samples from the joint posterior of q . Fromthis, we estimate the predictive probability of an event exceeding 51.6 in a typicalNovember is 0 . ( . , . ) , which corre-sponds to an 89-year event, ( , ) . For November 2009, when an NAO index of − .
02 was measured, the probability of such an event was 0 . ( . , . ) ,corresponding to a 90-year event, ( , ) . For the maximum observed value ofNAO in November, with NAO = .
04, the predictive probability of such an event is . ( . , . ) , which corresponds to a 75-year flood event, ( , ) . Thisillustrates that the impact that different phases of climate variability can have on theprobabilities of extreme events is slight but potentially important. Years R e t u r n Le v e l ( mm ) Fig. 13: Return levels corresponding to November maxima. The full line representsthe posterior mean and the two dashed lines representing 95% credible intervals.
Poisson process reparameterisation for Bayesian inference for extremes 21
AppendixA Proof: ˆ m r = u when m = r We can write the full likelihood for parameters q r given a series of excesses { x i } above a threshold u as: L ( q r ) = L × L , where L is the Poisson probability of r exceedances of u and L is the joint density of these r exceedances,so that: L = r ! ( r (cid:20) + x (cid:18) u − m r s r (cid:19)(cid:21) − / x + ) r exp ( − r (cid:20) + x (cid:18) u − m r s r (cid:19)(cid:21) − / x + ) , L = r (cid:213) i − s r (cid:20) + x (cid:18) x i − m r s r (cid:19)(cid:21) − / x − + (cid:20) + x (cid:18) u − m r s r (cid:19)(cid:21) / x + . By defining L = h + x (cid:16) u − m r s r (cid:17)i − / x + and y u = s r + x ( u − m r ) we can reparameterise the likelihood interms of q ∗ = ( L , y u , x ) to give: L ( q ∗ ) (cid:181) L r exp {− r L } r (cid:213) i = y u − x ( u − m r ) (cid:20) y u + x ( x i − u ) y u − x ( u − m r ) (cid:21) − / x − + (cid:20) y u y u − x ( u − m r ) (cid:21) / x + = L r exp {− r L } r (cid:213) i = y u (cid:20) + x (cid:18) x i − u y u (cid:19)(cid:21) − / x − + . Taking the log-likelihood and maximising with respect to L , we get: l ( q ∗ ) : = log L ( q ∗ ) = r log ˆ L − r ˆ L − r log y u − (cid:18) x + (cid:19) r (cid:229) i = log (cid:20) + x (cid:18) x i − u y u (cid:19)(cid:21) + ¶ l ¶L = r ˆ L − r = , which gives ˆ L =
1. Then, by the invariance property of maximum likelihood estimators, ˆ m r = u , and usingthe identity for y u , we get ˆ s r = ˆ y u . Because the x -dependent term in the log-likelihood is identical to thatin a GP log-likelihood, the maximum likelihood estimators of the two models coincide.2 Paul Sharkey, Jonathan A. Tawn B Derivation of prior for inference on q m We define a joint prior on the parameterisation of interest q k . However, as we are making inference for the‘optimal’ parameterisation q m , we must derive the prior for q m . We can calculate the prior density of q m by using the density method for one-to-one bivariate transformations. Inverting (5) to get expressions for m m and s m , i.e. m m = m k − s k x (cid:18) − (cid:16) mk (cid:17) − x (cid:19) = g ( m k , s k ) s m = s k (cid:16) mk (cid:17) − x = g ( m k , s k ) , we can use this transformation to calculate the prior for q m . p ( q m ) = p ( m m , s m , x )= p ( m k , s k , x ) | det J | m k = g − ( m m , s m ) , s k = g − ( m m , s m ) , x = x , where det J = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¶m m ¶m k ¶m m ¶s k ¶m m ¶x¶s m ¶m k ¶s m ¶s k ¶s m ¶x¶x¶m k ¶x¶s k ¶x¶x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¶m m ¶m k ¶m m ¶s k ¶m m ¶x ¶s m ¶s k ¶s m ¶x ¶x¶x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = ¶s m ¶s k ¶x¶x = (cid:16) mk (cid:17) − x . Therefore, p ( q m ) = (cid:0) mk (cid:1) − x p ( q k ) . C Fisher information matrix calculations for iid random variables
The log-likelihood of the Poisson process model with parameterisation q m = ( m m , s m , x ) can be expressedas l ( q m ) = − m (cid:20) + x (cid:18) u − m m s m (cid:19)(cid:21) − / x + − r log s m − (cid:18) x + (cid:19) r (cid:229) j = log (cid:20) + x (cid:18) x j − m m s m (cid:19)(cid:21) + , where r is the number of exceedances of X above the threshold u . For simplicity, we drop the [ · ] + subscriptin subsequent calculations. In order to produce analytic expressions for the asymptotic covariance matrix,we must evaluate the observed information matrix ˆ I ( q m ) . For simplicity, we define v m = u − m m s m and z j , m = x j − m m s m . Poisson process reparameterisation for Bayesian inference for extremes 23 ¶ l ¶m m = − m ( x + ) s m [ + x v m ] − / x − + x ( x + ) s m r (cid:229) j = (cid:2) + x z j , m (cid:3) − , ¶ l ¶s m = m s m [ + x v m ] − / x − v m − m ( x + ) s m [ + x v m ] − / x − v m + r s m − ( x + ) s m r (cid:229) j = (cid:2) + x z j , m (cid:3) − z j , m + x ( x + ) s m r (cid:229) j = z j , m (cid:2) + x z j , m (cid:3) − , ¶ l ¶x = − m [ + x v m ] − / x (cid:20) x v m [ + x v m ] − − x log [ + x v m ]+ x [ + x v m ] − v m + (cid:18) x log [ + x v m ] − x [ + x v m ] − v m (cid:19) − x r (cid:229) j = log (cid:2) + x z j , m (cid:3) + x r (cid:229) j = (cid:2) + x z j , m (cid:3) − z j , m + x + x r (cid:229) j = (cid:2) + x z j , m (cid:3) − z j , m , ¶ l ¶m m ¶s m = m s m [ + x v m ] − / x − − m ( x + ) s m [ + x v m ] − / x − v m − x + s m r (cid:229) j = (cid:2) + x z j , m (cid:3) − + x ( x + ) s m r (cid:229) j = (cid:2) + x z j , m (cid:3) − z j , m , ¶ l ¶m m ¶x = − m s m (cid:20) x [ + x v m ] − / x − log [ + x v m ] − x + x [ + x v m ] − / x − v m (cid:21) + s m r (cid:229) j = (cid:2) + x z j , m (cid:3) − − x + s m r (cid:229) j = (cid:2) + x z j , m (cid:3) − z j , m , ¶ l ¶s m ¶x = − m s m v m (cid:20) x [ + x v m ] − / x − log [ + x v m ] − x + x [ + x v m ] − / x − v m (cid:21) + s m r (cid:229) j = (cid:2) + x z j , m (cid:3) − z j , m − x + s m r (cid:229) j = (cid:2) + x z j , m (cid:3) − z j , m To obtain the Fisher information matrix, we take the expected value of each term in the observed infor-mation with respect to the probability density of points of a Poisson process. Let Z = X − m m s m , and R be arandom variable denoting the number of excesses of X above u . The density of points in the set A u can dedefined by f ( x ) = l ( x ) L ( A u ) = [ + x z ] − / x − [ + x v m ] − / x , where l is a function denoting the rate of exceedance. Then, for example, E Z , R ( R (cid:229) j = (cid:2) + x z j , m (cid:3) − ) = E R E Z | R ( R (cid:229) j = (cid:2) + x z j , m (cid:3) − ) = E R n R E Z n [ + x Z ] − oo = E R (cid:26) R [ + x v m ] / x Z ¥ v m [ + x z ] − / x − d z (cid:27) = m x + [ + x v m ] − / x − I ( q m ) as: E (cid:26) − ¶ l ¶m m (cid:27) = m ( x + ) s m [ + x v m ] − / x − − m x ( x + )( x + ) s m [ + x v m ] − / x − , E (cid:26) − ¶ l ¶s m (cid:27) = − m s m [ + x v m ] − / x − v m + m ( x + ) s m [ + x v m ] − / x − v m − r s m + m s m [ + x v m ] − / x − [ + ( x + ) v m ] − m x ( x + ) s m [ + x v m ] − / x − (cid:2) ( x + x + ) v m + ( x + ) v m + (cid:3) , E (cid:26) − ¶ l ¶x (cid:27) = m [ + x v m ] − / x (cid:20) x v m [ + x v m ] − − x log [ + x v m ]+ x [ + x v m ] − v m + (cid:18) x log [ + x v m ] − x [ + x v m ] − (cid:19) + x [ + x v m ] − / x [ x + log [ + x v m ]] − m ( x + ) x [ + x v m ] − / x − [ + ( x + ) v m ] − m x ( x + ) [ + x v m ] − / x − (cid:2) ( x + x + ) v m + ( x + ) v m + (cid:3) , E (cid:26) − ¶ l ¶m m ¶s m (cid:27) = m ( x + ) s m [ + x v m ] − / x − v m − m x ( x + ) s m [ + x v m ] − / x − [ + ( x + ) v m ] , E (cid:26) − ¶ l ¶m m ¶x (cid:27) = m s m (cid:20) x [ + x v m ] − / x − log [ + x v m ] − x + x [ + x v m ] − / x − v m (cid:21) − m s m ( x + ) [ + x v m ] − / x − + m s m ( x + ) [ + x v m ] − / x − [ + ( x + ) v m ] , E (cid:26) − ¶ l ¶s m ¶x (cid:27) = m s m v m (cid:20) x [ + x v m ] − / x − log [ + x v m ] + x + x [ + x v m ] − / x − v m (cid:21) − m s m ( x + ) [ + x v m ] − / x − [ + ( x + ) v m ] + m s m ( x + ) [ + x v m ] − / x − (cid:2) ( x + x + ) v m + ( x + ) v m + (cid:3) . By inverting the Fisher information matrix using a technical computing tool like Wolfram Mathematica,making the substitution r = m [ + x v m ] − / x , the expected number of exceedances, and using the mappingin (5), we can get expressions for asymptotic posterior covariances.ACov ( m m , x ) = x r ( x + ) s m (cid:16) rm (cid:17) − x (cid:18) x ( x + ) (cid:16) rm (cid:17) x log (cid:16) rm (cid:17) − ( x + ) (cid:18)(cid:16) rm (cid:17) x − (cid:19)(cid:19) ACov ( m m , s m ) = x r s m (cid:16) rm (cid:17) − x (cid:18)(cid:16) rm (cid:17) x (cid:16) ( x + ) log (cid:16) rm (cid:17)(cid:16) ( x + ) x log (cid:16) rm (cid:17) − x − (cid:17) + x ( x ( x + ) + ) + (cid:17) + ( x + )( x + ) (cid:16) log (cid:16) rm (cid:17) − (cid:17)(cid:19) ACov ( s m , x ) = r ( x + ) s m (cid:16) ( x + ) log (cid:16) rm (cid:17) − (cid:17) When m = r , ACov ( m m , x ) =
0. In addition, the m for which ACov ( m m , s m ) = m that minimises r q m as defined in (8). This root can easily be found numerically, but an analyticalapproximation can be calculated using a one-step Halley’s method. By using m = r as the initial seed, andusing the formula: x n + = x n − f ( x n ) f ′ ( x n ) − f ( x n ) f ′′ ( x n ) f ′ ( x n ) we get the expression (13) for ˆ m after one step. The quantity for ˆ m , given by expression (12) requirestwo iterations of this method. Poisson process reparameterisation for Bayesian inference for extremes 25 Acknowledgements
We gratefully acknowledge the support of the EPSRC funded EP/H023151/1 STOR-i Centre for DoctoralTraining, the Met Office and EDF Energy. We extend our thanks to Jenny Wadsworth of Lancaster Univer-sity, Simon Brown of the Met Office and two referees for very helpful comments. We also thank the MetOffice for the rainfall data.