[PDF] Inference for extreme earthquake magnitudes accounting for a time-varying measurement process

Abstract

Investment in measuring a process more completely or accurately is only useful if these improvements can be utilised during modelling and inference. We consider how improvements to data quality over time can be incorporated when selecting a modelling threshold and in the subsequent inference of an extreme value analysis. Motivated by earthquake catalogues, we consider variable data quality in the form of rounded and incompletely observed data. We develop an approach to select a time-varying modelling threshold that makes best use of the available data, accounting for uncertainty in the magnitude model and for the rounding of observations. We show the benefits of the proposed approach on simulated data and apply the method to a catalogue of earthquakes induced by gas extraction in the Netherlands. This more than doubles the usable catalogue size and greatly increases the precision of high magnitude quantile estimates. This has important consequences for the design and cost of earthquake defences. For the first time, we find compelling data-driven evidence against the applicability of the Gutenberg-Richer law to these earthquakes. Furthermore, our approach to automated threshold selection appears to have much potential for generic applications of extreme value methods.

Full PDF

SSubmitted to the Annals of Applied Statistics

INFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES ACCOUNTINGFOR A TIME-VARYING MEASUREMENT PROCESS B Y Z AK V ARTY , J ONATHAN

A. T

AWN P ETER

M. A

TKINSON

AND S TIJN B IERMAN Lancaster University, [email protected]; * [email protected]; † [email protected] Shell Global Solutions Netherlands, ‡ [email protected] Investment in measuring a process more completely or accurately is onlyuseful if these improvements can be utilised during modelling and inference.We consider how improvements to data quality over time can be incorpo-rated when selecting a modelling threshold and in the subsequent inferenceof an extreme value analysis. Motivated by earthquake catalogues, we con-sider variable data quality in the form of rounded and incompletely observeddata. We develop an approach to select a time-varying modelling thresholdthat makes best use of the available data, accounting for uncertainty in themagnitude model and for the rounding of observations. We show the beneﬁtsof the proposed approach on simulated data and apply the method to a cata-logue of earthquakes induced by gas extraction in the Netherlands. This morethan doubles the usable catalogue size and greatly increases the precision ofhigh magnitude quantile estimates. This has important consequences for thedesign and cost of earthquake defences. For the ﬁrst time, we ﬁnd compellingdata-driven evidence against the applicability of the Gutenberg-Richer law tothese earthquakes. Furthermore, our approach to automated threshold selec-tion appears to have much potential for generic applications of extreme valuemethods.

1. Introduction.

Aims and motivation.

The observational nature of environmental data can lead tochallenges during statistical modelling and inference. In particular, improved measurement ofan environmental process within a dataset should be acknowledged as part of any inference.Failing to do so leads to biased inference, while including data based only on the initialquality of measurements is overly conservative, leads to inefﬁcient inference, and makesﬁnancial investment into the measurement process redundant. We consider how to includechanging data quality in an extreme value analysis where low data quality is present as thepartial censoring of rounded data. Here and throughout, censored data are values that aremissing-not-at-random (Little and Rubin, 2019). This paper is motivated by the modellingof earthquake catalogues, but results in a method that is applicable more widely where thesedata features are present. This new threshold selection method should also be of value inmore general extreme value analyses.1.2.

Earthquake data.

Earthquakes are recorded if their locations and magnitudes canbe inferred from ground vibrations at sensor locations; this requires an earthquake to be de-tected by multiple sensors. An earthquake is detected or missed depending on its magnitudeand location relative to the sensor network. A low sensitivity network of sensors thereforeleads to the partial or complete censoring of small magnitude seismic events. As the network

Keywords and phrases: earthquake, extreme value methods, magnitude of completion, generalised Pareto dis-tribution, threshold selection, partial censoring. a r X i v : . [ s t a t . M E ] F e b is extended or upgraded over time the censoring of small events is reduced. It is usual inearthquake catalogues for magnitudes to be reported to one decimal place; this data featureis often overlooked during statistical analyses (Marzocchi et al., 2019). Using these rounded,incomplete observations we seek to understand the tail behaviour of the magnitude distribu-tion.Since 1991 the Groningen region of the Netherlands has experienced induced earthquakes.These seismic events are caused by gas extraction and have relatively small magnitudes com-pared to tectonic events. However, they also occur at much shallower depths than their tec-tonic equivalents. This means that for equal magnitudes they pose a greater hazard than theirtectonic counterparts because their impact is spread over a smaller spatial extent. These smallearthquakes are therefore both hazardous and difﬁcult to detect. This has led to continued in-vestment in the geophone network around the Groningen gas ﬁeld to increase detection ofsmall earthquakes and to better understand earthquake activity in the region. Estimating highquantiles of the magnitude distribution, and quantifying their uncertainty, is instrumental toappropriate design and improvement of buildings to withstand these earthquakes.1.3. Magnitude of completion.

The magnitude of completion m c is the lowest magnitudeabove which all earthquakes are certain to be recorded in a given area and time interval.The magnitude of completion therefore depends on the density and sensitivity of the sensornetwork as well as the local geology. When a sensor network changes substantially over time t , the magnitude of completion in that region can be considered as a function of time, denoted m c ( t ) . The magnitude of completion is not a quantity that can be determined experimentally,it must be inferred from the set of observed event magnitudes.Existing methods for statistical estimation of a constant m c use parametric or non-parametric methods to detect deviations from the assumed monotonicity of the magnitudedistribution (Mignan and Woessner, 2012). Parametric methods typically assume an expo-nential magnitude distribution, based on the empirical magnitude-frequency relationship ofGutenberg and Richter (1956). Heuristic techniques are used to detect deviations from thismodel based on maximum curvature, goodness-of-ﬁt, or parameter stability.Several methods exist to estimate a spatially varying magnitude of completion (Wiemerand Wyss, 2000; Mignan et al., 2011). In contrast, little attention has been given to estimatinga changing magnitude of completion over time. Where it has been considered, focus hasbeen on temporary increases in m c ( t ) due to residual vibrations following large earthquakes(Woessner and Wiemer, 2005; Utsu, Ogata and Matsu’ura, 1995). Long-term changes in m c ( t ) have been considered by assuming a constant value within a pre-determined temporalpartitioning (Hutton, Woessner and Hauksson, 2010) or a locally constant value estimatedusing a rolling window (Mignan and Woessner, 2012).1.4. Extreme value methods.

To specify a model for earthquake magnitudes we adapt amodel from extreme value theory. An asymptotic argument justiﬁes the use of the generalisedPareto distribution (GPD) to model the excesses of a continuous random variable over a suit-ably chosen threshold, under weak assumptions on the distribution of that random variable(Pickands, 1975). The distribution function of a random variable Y that follows a GPD, giventhat it is above the threshold u , is(1) F ( y ; σ, ξ ) = (cid:26) − [1 + ξ ( y − u ) /σ ] − /ξ + for ξ = 0 , y ≥ u, − exp[ − ( y − u ) /σ ] for ξ = 0 , y ≥ u ; where the shape parameter ξ ∈ R , scale parameter σ > and y + = max(0 , y ) . The distri-bution is exponential when ξ = 0 , heavy-tailed when ξ > and decays to a ﬁnite upperend point y + = u − σ/ξ when ξ < (Davison and Smith, 1990). The GPD generalises the NFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES Gutenberg-Richter model, in which magnitudes are independent and identically distributed(i.i.d.) exponential random variables, by allowing greater ﬂexibility in the tail behaviour ofthe distribution.Standard extreme value modelling deals with i.i.d. data, observed at regular intervals with-out rounding or censoring. The standard approach is to select a constant threshold u that isa ﬁxed, high quantile of the empirical distribution. Heuristic methods are used to select anappropriate quantile, see Scarrott and MacDonald (2012) for a review. These methods can bebased on the stability of parameter estimates, goodness-of-ﬁt measures, or the mean thresh-old exceedance size (Coles, 2001). When interest lies in estimating a particular extreme valueproperty, such as the shape parameter, an alternative strategy is to select the threshold thatoptimises inference for that property (Danielsson et al., 2001).Using a constant threshold is inefﬁcient when the data distribution changes over time.This type of change is likely to alter the quantile value above which a GPD is appropriate andcause the GPD parameters to change over time. To avoid this issue, quantiles can be estimatedlocally as a function of time and a global decision can be made on which quantile to use as atime-varying threshold u ( t ) (Eastoe and Tawn, 2009; Northrop and Jonathan, 2011).1.5. Shortcomings of current methods.

Estimating the magnitude of completion and se-lecting an extreme value threshold are closely linked problems. Both aim to select a value(possibly time-varying) above which a probability model is appropriate. Standard methodsfrom either setting do not meet our modelling needs, for the reasons that follow.Methods assuming an exponential magnitude distribution are problematic for two reasons.Firstly, an exponential tail model can lead to bias and false conﬁdence in quantile estimates.Coles and Pericchi (2003) demonstrated in a hydrological context the beneﬁts of using theencompassing generalised Pareto model to properly represent uncertainty in the tail shape.Secondly, the exponential distribution does not account for rounding of the data, resultingin biased parameter estimates (Marzocchi et al., 2019; Rohrbeck et al., 2018). Failing toacknowledge this rounding can therefore also cause bias in threshold selection.Methods to select a static threshold are also unsuitable for our problem. To obtain preciseestimates of the GPD parameters and high quantiles, as much data as possible should be usedin the analysis. However, this must be balanced by the need to represent model uncertaintyand avoid bias from incorrectly including small magnitude events. This bias has two sources:using either data values for which the extreme value model does not apply or values that arebelow the magnitude of completion at the time of their occurrence. The optimal choice oftime-varying threshold is therefore v ( t ) = max { m c ( t ) , u ( t ) } . Methods for selecting a staticmodelling threshold v are inefﬁcient when the true threshold varies with time, since the staticthreshold must satisfy v ≥ max t v ( t ) and so excludes viable data from the analysis.Finally, current approaches to selecting or estimating time-varying thresholds are also un-suitable for our problem; methods for estimating m c ( t ) consider only a small portion of thedata at once, while the selection of u ( t ) by a local quantile approach is impeded by the tem-poral development of the censoring process.1.6. Contributions and outline.

In this paper we develop an automated method to selecta dynamic threshold for rounded GPD data. This is, to our knowledge, the ﬁrst time that datarounding has been considered during threshold selection. Our proposed threshold selectionmethod uses as much data as possible while guarding against the use of values where a tailmodel is not appropriate or observations are not complete. This threshold choice leads tomore precise estimation of high magnitude quantiles, properly represents their uncertainty,and can also suggest how the magnitude of completion changes over time. The selectionmethod is developed for earthquake data, but the core idea of the method can be applied to extreme value threshold selection more generally. We demonstrate, via simulation, thebeneﬁts of including additional, small magnitude events in an extreme value analysis to bothparameter recovery and return level estimation. We go on to select dynamic thresholds forpartially censored earthquake catalogues and investigate the impact of this threshold whenestimating high quantiles of the magnitude distribution.This paper is structured as follows. Section 2 describes the Groningen earthquake cata-logue that motivates the proposed methodology, the model for observed magnitudes, and thenovel inference for the underlying parameters. Section 3 demonstrates the beneﬁts of includ-ing small magnitude events into an extreme value analysis. Section 4 introduces our proposedmethod of threshold selection. The method is applied to simulated earthquake catalogues inSection 5 and to the Groningen catalogue in Section 6. Concluding remarks are given inSection 7.

2. Motivating data and model formulation.

Data description.

We study the induced earthquakes in the Groningen region ofthe Netherlands from January 1 st st L ) to one decimal place (KNMI, 2020).Figure 1 shows Groningen earthquake magnitudes against both occurrence time and earth-quake index, along with smoothed estimates of their mean using a generalised additive modelwith cubic-spline basis. Assuming that magnitudes are i.i.d. (which is supported by the ex-ploratory analysis in Section A of the Supplementary Materials (Varty et al., 2021)) and thatdepartures from this are due to the partial censoring of small magnitude events, the reductionin mean magnitude indicates that fewer small magnitude events were censored at later times.It is unclear whether this change in detection was sudden or gradual. The KNMI report that m c ( t ) ≤ . M L for the entire period (Dost et al., 2012). Paleja and Bierman (2016) and Dost,Ruigrok and Spetzler (2017) used a ﬁxed temporal partitioning and conclude that for the pe-riod 2014-09-24 to 2016-09-27 the magnitude of completion was likely to be below 1.0M L .Since sensors have not been removed from the network, this suggests that the magnitude ofcompletion should be less than or equal to this in the period following their analysis (i.e. to2020 in Figure 1). F IG . Full Groningen earthquake catalogue, with magnitudes reported in M L and smoothed mean estimate;shown using natural- [left] and index-times [right]. NFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES Data model and inference.

This section introduces our notation and data model forthreshold selection and inference on extreme earthquake magnitudes. We deﬁne an earth-quake catalogue to be the set of n recorded time-magnitude pairs { ( t i , x i ) : i = 1 , . . . , n } where the recorded magnitudes x = ( x , . . . , x n ) are given rounded to the nearest δ ( δ > )and the event times t = ( t , . . . , t n ) are each within the observation interval ( t min , t max ) .The unrounded magnitudes associated with each event are represented by the vector y =( y , . . . , y n ) . An event ( t i , x i ) therefore corresponds to an earthquake of magnitude y i ∈ ( x i − δ, x i + δ ] that occurred at time t i and that was not censored.Recall from Figure 1 that earthquake intensity is not constant over the observation period.To separate exposition of our threshold selection method from estimation of this temporally-varying earthquake rate, we map each event time to its corresponding index. This transformsevent times t from an irregular sequence on the natural timescale t to a regular sequence τ onthe index scale τ , where observed events occur at τ = 1 , . . . , n . A modelling threshold v ( τ ) is then speciﬁed for the transformed observation period τ ∈ (0 , τ max ) and the threshold valuesat each event time are given by the vector v = ( v (1) , . . . , v ( n )) = ( v , . . . , v n ) . The thresholdfunction v ( τ ) and threshold vector v will be treated as known until threshold selection isdiscussed in Section 4.The probability α ( τ, y ) that an event is detected by the sensor network and included inthe earthquake catalogue is an unknown function of its time and magnitude. It is expectedthat for the Groningen catalogue α ( τ, y ) is a non-decreasing function in each of τ and y ;larger or later earthquakes are more likely to be detected. We make two assumptions on α ( τ, y ) : ﬁrstly that observation is complete above the modelling threshold, so that α ( τ, y ) = 1 for y ≥ v ( τ ) ; secondly that censoring begins gradually so that for all τ , α ( τ, y ) ≈ when y ∈ [ v ( τ ) − δ, v ( τ )] . This allows rounded magnitudes within δ of the modelling threshold tobe included during inference without constructing a full model for the censoring process.In constructing our model for magnitudes exceeding v ( τ ) , we assume that the unroundedmagnitudes y may be modelled as i.i.d. GPD random variables ( Y , . . . , Y n ) with parameters θ = ( σ u , ξ ) when they exceed a constant, lower threshold of u < min τ v ( τ ) − δ . Formally,we assume Y i − u | Y i > u ∼ GPD ( σ u , ξ ) . Since events exceeding v ( τ ) are never censored,excess magnitudes of v ( τ ) may also be modelled using a GPD but with threshold dependentscale parameters σ v i = σ u + ξ ( v i − u ) , so that Y i − v i | Y i > v i ∼ GPD ( σ v i , ξ ) .When using this probability model to construct a likelihood function for the GPD param-eters, rounded magnitudes x i should contribute only if the latent value y i > v i . Events with x i > v i + δ , should certainly contribute to the likelihood function and events with x i < v i − δ certainly should not. When | x i − v i | < δ it is uncertain whether y i > v i and whether event i should contribute to the likelihood. Each event is therefore weighted in the log-likelihoodby w i = Pr( Y i > v i | x i , θ ) , the probability it truly exceeds v ( τ ) . This is equivalent to usingthe expected likelihood over all possible unrounded magnitude vectors. The resulting log-likelihood function for the the parameters θ = ( σ u , ξ ) of F , the GPD (1) is: ‘ ( θ | x , v ) = n X i =1 w i log Pr( X i = x i | Y i > v i , θ )= n X i =1 w i log Pr(max( v i , x i − δ ) < Y i < x i + δ | θ ) (2) = n X i =1 w i log [ F ( x i + δ − v i ; σ v i , ξ ) − F (max( v i , x i − δ ) − v i ; σ v i , ξ )] , where w i = Pr(max( v i , x i − δ ) < Y i < x i + δ | θ )Pr( x i − δ < Y i < x i + δ | θ )= F ( x i + δ − u ; σ u , ξ ) − F (max( v i , x i − δ ) − u ; σ u , ξ ) F ( x i + δ − u ; σ u , ξ ) − F ( x i − δ − u ; σ u , ξ ) . (3)The maximum likelihood estimate ˆ θ can be found using numerical optimisation of thisfunction. Conﬁdence intervals may be obtained based on asymptotic normality, but this ap-proximation can be poor for the estimated shape parameter ˆ ξ and quantile values. To avoidthis and to ensure that conﬁdence bounds on ˆ σ u are positive, we use a parametric bootstrapapproach to describe parameter uncertainty, as described in Section B of the SupplementaryMaterials (Varty et al., 2021).

3. Motivating the inclusion of small magnitudes.

Simulation study overview.

Here we show that using a non-constant threshold toinclude additional, small magnitude earthquakes in an extreme value analysis can be beneﬁ-cial to both parameter and quantile estimation. We compare three approaches to inference on1000 simulated earthquake catalogues that have a known, stepped threshold. Each catalogueis simulated by ﬁrst generating 1000 latent magnitudes as independent GPD exceedances of u = 1 . M L with parameters θ = ( σ u , ξ ) = (0 . , . . Each event i = 1 , . . . , is cen-sored if τ i ≤ and y i < . M L . The retained magnitudes are then rounded to the nearest δ = 0 . M L , resulting in a catalogue of the form shown in Figure 2 (left). The size of theretained catalogue depends on the simulated magnitudes, and so varies between catalogues.A GPD model is ﬁtted to each of the simulated catalogues by maximising the log-likelihood (2) under each of three approaches. The ﬁrst, conservative approach to inferenceuses only exceedances of the ﬂat modelling threshold v ( τ ) = 1 . M L for ≤ τ ≤ .The second approach uses exceedances of the stepped threshold where v ( τ ) = 1 . M L for ≤ τ ≤ and v ( τ ) = 1 . M L for < τ ≤ . The number of data points used by thestepped approach will be at least as large as by the conservative approach. A third approach,possible in simulation but not practice, is also considered. In this third approach, additionalearthquakes are simulated above the conservative level to extend the simulated catalogue un-til the number of exceedances of . M L matches the number of events used by the steppedapproach. A GPD model is then ﬁtted to the extended set of earthquakes that exceed . M L .We compare the three approaches to inference in terms of parameter and quantile estima-tion. The conclusion of each comparison can differ because of the non-linear relationshipbetween GPD parameters and quantiles, which are also sensitive to small changes in the es-timated shape parameter ξ . Parameter estimates are compared using their bias and varianceover the 1000 simulated catalogues. To be able to compare quantile estimates across mod-elling thresholds we consider the conditional quantiles above the conservative threshold level,using conditional return levels. The conditional p -quantile above some magnitude c > u isthe magnitude y p,c that satisﬁes Pr( Y ≤ y p,c | Y > c ) = p. Letting ζ c = Pr( Y > c | Y > u ) = 1 − F ( c ; θ ) , where F is the distribution function (1), y p,c can be expressed as a function of θ :(4) y p,c ( θ ) = (cid:26) u + σ u ξ (cid:0) ( ζ c p ) − ξ − (cid:1) for ξ = 0 ,u + log( ζ c p ) for ξ = 0 . NFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES An alternative representation of conditional quantiles, more in-keeping with the extremevalue approach, is the m -event conditional return level above c . This can be found by setting p = 1 − /m in equation (4) and interpreted as the magnitude exceeded (on average) by onein every m events that exceed c . We compare point estimates and conﬁdence intervals ofconditional return levels under the three approaches to inference.3.2. Simulation study results.

Figure E.1 of the Supplementary Materials (Varty et al.,2021) shows the sampling distribution of parameter estimates and an error decompositionunder each approach to inference. The stepped threshold is best for parameter estimation,with the smallest bias and variance of the three approaches. The mean squared error of thestepped estimator is . times smaller than that of the conservative estimator, mainly dueto its increased precision. For comparison, artiﬁcially extending the earthquake cataloguegives a reduction factor of only . . In this example, each small magnitude event addedby lowering the threshold is more than twice as valuable to parameter estimation than anadditional observation above the conservative level. llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . . event index m agn i t ude return period c ond i t i ona l r e t u r n l e v e l TrueConservativeExtendedStepped F IG . [Left] Simulated catalogue structure: events are censored (grey dots) if in the ﬁrst 500 and below 1.65M L .For this catalogue, the conservative threshold (dashed red line) includes 181 events, while the stepped threshold(solid black line) includes 582 events. [Right] Magnitude conditional return level estimates in M L against returnperiod in number of earthquakes exceeding 1.65M L . Point estimates and conﬁdence intervals are givenunder conservative, extended and stepped approaches to inference, along with the true values. Figure 2 (right) shows the conditional return levels for magnitudes above c = 1 . M L under each approach. Point estimates are qualitatively similar in each case, but conﬁdenceintervals are narrowed by using the stepped rather than constant threshold. Conﬁdence in-tervals are further narrowed by artiﬁcially extending the observation period. This is becauseof the additional large values in the extended data, which have a strong inﬂuence over theestimated return levels (Davison and Smith, 1990).These results show clearly the beneﬁts for parameter and quantile inference that can beachieved by using a dynamic modelling threshold to include additional small magnitudeevents in an extreme value analysis. Using a conservative constant threshold leads to wastefulinference and the squandering of these potential gains.

4. Threshold selection.

Overview.

In practice, the true modelling threshold v ( τ ) is always unknown. Tochoose between potential thresholds, we must deﬁne what it means for one threshold to bepreferred over another. A generalised likelihood ratio test is not appropriate for this compar-ison because it compares nested models on the same data, rather than comparing the samemodel on nested data (Wadsworth and Tawn, 2012; Wadsworth, 2016). To select a model that is robust to sampling variability, v ( τ ) should include as much dataas possible in the model and therefore be chosen to be as low as possible. However, selecting v ( τ ) < max( u ( τ ) , m c ( τ )) for any < τ < τ max will cause bias in the ﬁtted model, makingit incapable of obtaining an asymptotically consistent estimator of the true parameter values.The best choice of v ( τ ) is therefore the threshold that includes the most data while maintain-ing a good agreement between observed threshold exceedances and the ﬁtted GPD.For i.i.d. continuous valued data, the distributional agreement with a probability modelcan be assessed graphically by using a PP- or QQ-plot and adding tolerance intervals to showexpected behaviour under that model. Alternatively, the distributional ﬁt can be summarisedusing a metric, such as the Anderson-Darling or Cramer-von Mises distances (Laio, 2004).Both graphical- and metric-based approaches can be adapted for data y that are independentand continuous valued, but which do not have a shared distribution. This is achieved by usingthe probability integral transform and the ﬁtted distribution to transform the data to have ashared marginal distribution before using methods for i.i.d. data to produce plots or metricvalues (Heffernan and Tawn, 2001).We further adapt these methods to handle both rounded data and parameter uncertainty,before showing how they can be used to inform selection of a modelling threshold. In do-ing so, we transform to standard Exponential margins because this distribution is centralwithin the GPD family and follows the precedent of Heffernan and Tawn (2001). Alternativemarginal distributions could be used; we additionally consider PP-plots, which correspond tothe special case of uniform margins.4.2. Graphical assessment.

The observed magnitudes x that exceed v ( τ ) do not have ashared marginal distribution when v ( τ ) is non-constant and they are not continuous-valueddue to their rounding. This presents challenges when trying to create a PP- or QQ-plot forexceedances of the modelling threshold v ( τ ) . Firstly, constructing these plots using roundedvalues can lead to many probabilities or quantiles of equal value and the plots being difﬁcultto interpret. The second challenge relates to observed, rounded values close to the modellingthreshold, { x i : | x i − v i | < δ, i = 1 , . . . , n } ; it is not known which, or how many, of theseevents satisfy y i ≥ v i and so should be included when constructing the plot.To overcome these challenges we use simulation to construct Monte Carlo conﬁdence in-tervals for the sample quantiles (or probabilities) of the unrounded threshold exceedancestransformed onto shared exponential margins. The process is described in Section C of theSupplementary Materials (Varty et al., 2021) and leads to a modiﬁed plot with two sets ofintervals; tolerance intervals show the expected variability of sample quantiles (or probabil-ities) under the ﬁtted model while conﬁdence intervals show the uncertainty about the ob-served sample quantile values. Conﬁdence and tolerance intervals that do not overlap suggestthat the distribution of the rounded exceedances is not coherent with the ﬁtted GPD model.Examples of such PP- and QQ-plots are shown in Figure 3. These use the simulated cat-alogue shown in Figure 2 (left) and constant modelling thresholds of v ( τ ) = 1.85 M L and1.15 M L . For this catalogue, exceedances of a ﬂat threshold should be consistent with a GPDmodel only if that threshold is of . M L or greater. For the higher threshold v ( τ ) = 1.85 M L ,the conﬁdence intervals on sample probabilities and quantiles overlap with the tolerance in-tervals, indicating that exceedances of this threshold are consistent with the ﬁtted GPD model.For the lower threshold v ( τ ) = 1.15 M L this is not the case, with the large sample quantilesbigger than expected under the ﬁtted model. Notice the shape of the tolerance intervals inFigure 3; the largest deviations from the line y = x are expected at central probabilities in thePP-plots and at the largest quantiles of the QQ-plots. This is important in Section 4.3 wherewe propose metrics to summarise these plots. NFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES . . . . . . model probability s a m p l e p r obab ili t y −1 0 1 2 3 4 5 6 model quantiles s a m p l e quan t il e . . . . . . model probability s a m p l e p r obab ili t y model quantiles s a m p l e quan t il e F IG . PP-plots [left] and QQ-plots [right] for threshold exceedance sizes shown on Exp(1) margins for constantmodelling thresholds v ( τ ) = 1 . M L [top] and v ( τ ) = 1 . M L [bottom]. 95 % tolerance intervals are shown asgrey regions, while conﬁdence intervals on each probability or quantile are shown as vertical lines. Theseare coloured red (blue) where the conﬁdence interval is entirely above (below) the tolerance interval. Metric-based assessment.

Using a metric rather than a graphic to assess the distribu-tional coherence of modelled and observed threshold exceedances facilitates the comparisonof many thresholds. We therefore aim to summarise the PP- and QQ-plots using metrics thatreward accurate estimation of the magnitude distribution function. An unbiased estimate re-sults in a plot that covers the line y = x , while a precise estimate results in plots that arestable between sampled values for the mle ˆ θ and unrounded data y . Our approach to creatinga metric that summarises these plots is novel in its design, which rewards large sample sizesthrough their effect to increase the precision of the distribution estimate.We propose four metrics to summarise deviation from the line y = x in PP- and QQ-plotsusing the mean absolute distance and mean squared distance in what follows. The calculationof these metrics is described below for a single sampled vector of threshold exceedances onexponential margins ˜ z . Let d be the realised metric value for an arbitrary dataset usingone of the four methods, and d = E Y , ˆ θ | x , v ( d ) be the expected value of d over the jointdistribution of Y , ˆ θ | x , v , thus accounting for the rounding and parameter uncertainties thatare represented by the conﬁdence intervals of Figure 3. We select a modelling threshold byminimising d and investigate which choice of d is best. Here the expected values of themetrics are calculated by a Monte Carlo approximation.Smaller values of each metric are preferable, with large values caused by the quantiles ofthe ﬁtted model being either highly uncertain or incoherent with the observed data. Minimis-ing these metrics provides a new approach to threshold selection, which rewards thresholdsthat give low sampling variability and small bias in the resulting estimator. The remainderof this section covers the calculation of these metrics, while Section 5 explores their relativeperformance on simulated data.In the following, ˜ z ( i ) is the i th parametric-bootstrapped vector of threshold exceedancestransformed onto exponential margins for independent, replicated samples i = 1 , . . . , k .An algorithm to obtain these is given in Section C of the Supplementary Materials(Varty et al., 2021). Also let H ( i ) ( y ) : R + → [0 , and Q ( i ) ( p ) : [0 , → R + , respec-tively, be the empirical distribution function and the sample quantile function of ˜ z ( i ) for i = 1 , . . . , k . The sample quantile functions are deﬁned as linear interpolations of the points n(cid:16) j − n ( i ) − , ˜ z ( i )( j ) (cid:17) : j = 1 , . . . , ˜ n ( i ) o , where ˜ n ( i ) is the length of ˜ z ( i ) and ˜ z ( i )( j ) is the j th orderstatistic of ˜ z ( i ) .The quantile based distance metrics d ( i ) ( q, and d ( i ) ( q, summarise the expected devi-ation in the QQ-plot of ˜ z ( i ) from the line y = x at a set of m ∈ N + equally spaced evaluationprobabilities { p j = j/ ( m + 1) : j = 1 , . . . , m } . The two metrics respectively give the meanabsolute distance and mean squared distance between model and sample quantiles over theset of evaluation probabilities. They are given by d ( i ) ( q,

1) = 1 m m X j =1 | − log(1 − p j ) − Q ( i ) ( p j ) | and d ( i ) ( q,

2) = 1 m m X j =1 ( − log(1 − p j ) − Q ( i ) ( p j )) . In a PP-plot the variance of deviations from the line y = x is greatest when p j = 0 . andshrinks to 0 as p j approaches 0 or 1. In the PP-based metrics we therefore weight the sum ofthe deviations to account for large discrepancies being less surprising for central probabilities.The metrics d ( i ) ( p, and d ( i ) ( p, are therefore calculated using, respectively, the weighted-absolute and weighted-squared errors: d ( i ) ( p,

1) = 1 m m X j =1 "(cid:18) p j (1 − p j ) √ n ( i ) (cid:19) − / (cid:12)(cid:12) p j − H ( i ) ( − log(1 − p j )) (cid:12)(cid:12) and d ( i ) ( p,

2) = 1 m m X j =1 "(cid:18) p j (1 − p j ) √ n ( i ) (cid:19) − / (cid:0) p j − H ( i ) ( − log(1 − p j )) (cid:1) . These deviations are again measured at equally spaced evaluation probabilities p , . . . , p m .In the quantile-based metrics the weighting is handled implicitly by choosing equally spacedevaluation probabilities, which gives dense evaluation where discrepancies from y = x areexpected to be small and sparse evaluation where they are expected to be large. In this way,the weights reﬂect the width of the tolerance intervals in Figure 3.Uncertainties in the estimated GPD parameters, the size of the exceedance set and thevalues of the unrounded exceedances should all be accounted for when using a metric toselect a modelling threshold. This can be achieved by calculating the distance metrics for eachof k realisations of the vector ˜ z , where each uses one of k bootstrap parameter estimates of ˆ θ . The expected metric values over these realisations are denoted by d ( a, b ) , where a ∈ { p, q } and b ∈ { , } . The expected distance metric d ( q, is deﬁned as: d ( q,

1) = 1 k k X i =1 d ( i ) ( q, , with the other expected distance metrics deﬁned similarly.4.4. Minimisation procedure.

To select the most appropriate threshold the threshold pa-rameters which minimise the selected metric d must be found. Standard, gradient-based op-timisation procedures are not well suited to this task because the censoring mechanism can NFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES cause multiple local minima and the Monte Carlo evaluation leads to local roughness overparameter values. When using a simple parametric form for the threshold, such as a constantor stepped threshold (where the change location is known), a simple grid search can be usedto overcome these issues and ﬁnd the threshold parameters that minimise the metrics. Formore complex threshold forms, with a higher dimensional parameter space to optimise over,a grid search becomes prohibitively expensive.To ﬁnd the threshold parameter set for more complicated thresholds we explore thethreshold parameter space in a more principled manner. To do this we use Bayesianoptimisation (Snoek, Larochelle and Adams, 2012) as implemented in the R package ParBayesianOptimization (Wilson, 2020). The optimisation procedure begins byevaluating d at a small initial collection of randomly chosen parameter vectors within abounded search space. Based on the resulting metric values, future evaluation points areselected sequentially as the parameter vector with the greatest expected reduction in d ascompared to the current best value. This search method balances evaluations between partsof the parameter space where the metric is known to have low values and parts where it ismost uncertain.Bayesian optimisation is a heuristic search method but has been shown in other appli-cations to ﬁnd good parameter combinations using a relatively small number of functionevaluations (Shahriari et al., 2015). To establish its suitability in our setting we comparedBayesian optimisation to a grid search for two sub-problems; catalogues with a ﬂat thresholdand catalogues with a stepped threshold with known change location. In both cases Bayesianoptimisation performed favourably compared to grid search, selecting thresholds close tothe true value at a lower computational cost. We do not claim that Bayesian optimisation isthe best method for optimising the proposed metrics over threshold parameters, only that itappears to be an efﬁcient method of ﬁnding good thresholds.

5. Threshold selection on simulated catalogues.

Simulation study overview.

We consider the performance of the proposed thresholdselection metrics on a collection of simulated data sets with either constant or stepped thresh-old forms. This simulation study illustrates the effectiveness of our method and establisheswhich of the distance metrics proposed in Section 4 is best.We attempt to select the most appropriate threshold from a set of candidate thresholds.Two censoring types (hard and phased) are considered for magnitudes that are below themodelling threshold. For hard censoring, all simulated continuous magnitudes below themodelling threshold are undetected. In phased censoring the detection probability of eachevent, α ( y i , v i ) = exp( − λ [ v i − y i ] + ) , decreases as the simulated continuous magnitude fallsfurther below the threshold, as controlled by the parameter λ > . The particular choices ofexponential decay and the value of λ are arbitrary but were chosen to reﬂect, in a broad sense,the censoring observed in the Groningen earthquake catalogue. Note that either of these cen-soring types can result in some rounded magnitudes that are below the threshold even thoughtheir simulated continuous values are above the threshold.5.2. Constant threshold, hard censoring.

We ﬁrst use the four proposed metrics to se-lect a constant threshold for 1500 simulated i.i.d. GPD exceedances of the constant threshold v ( τ ) = 0 . M L , hard-censored below v ( τ ) and rounded to the nearest 0.1M L . We ﬁrst con-sider the metrics for a single dataset. Expected metric values are calculated at the 41 equallyspaced, constant candidate thresholds shown in Figure 4. The candidate threshold selectedby minimising d ( q, is the closest threshold on the grid to the true value. This metric alsoappears to provide the most clearly deﬁned minimum, indicating that it penalises both thresh-olds that are too low and too high. All four metrics show clear increases in metric value for lllllllllllllllllllllllllllllllllllllllll . . threshold d ( q , ) lllllllllllllllllllllllllllllllllllllllll . . . threshold d ( q , ) lllllllllllllllllllllllllllllllllllllllll threshold d ( p , ) lllllllllllllllllllllllllllllllllllllllll . . . threshold d ( p , ) F IG . Flat threshold selection on a simulated catalogue. Top row: expected mean absolute [left] and expectedmean squared [right] QQ-distances against threshold value. Bottom row: expected PP-distance metrics based onabsolute [left] and squared [right] errors against threshold. Selected and true thresholds are indicated by solidblack and dashed red lines. candidate thresholds that are too low, but not when the candidate threshold is too high. Theprobability-based metrics do not increase greatly when the candidate threshold is too high,and so fail to adequately reward the inclusion of valid events with smaller magnitudes. Thisis presumably because they do not sufﬁciently penalise the increased uncertainty in the esti-mated parameters when using a higher threshold.When selecting a constant threshold, the standard approach is to exploit the well-established property that the GPD shape parameter is invariant to threshold choice (Coles,2001). Point estimates and conﬁdence intervals for ξ were obtained using exceedancesof each candidate threshold, accounting for the rounding of observations. The conﬁdence in-tervals for ξ overlap for all candidate thresholds above 0.275M L , and so by the parameterstability method any greater threshold is also valid. The thresholds chosen by our proposedmethod are therefore consistent with the parameter stability method, but are preferable in thatthe selected thresholds are not below the true level. Our proposed selection method is alsomore general; it allows comparison of many non-constant thresholds without the need forsubjective and time-consuming interpretation of parameter stability plots.Figure 5 presents the sampling distribution and RMSE of the thresholds selected fromthe candidate set by each of the QQ-based metrics over 500 replicated datasets, simulated aspreviously described. The thresholds chosen by the PP-based metrics are shown in Figure E.2of the Supplementary Materials (Varty et al., 2021) and are frequently much higher than thetrue value, resulting in higher RMSE values of 0.34 for d ( p, and 0.12 for d ( p, . Themetric d ( q, has the lowest RMSE and so appears to be the best of the proposed metricsin this case. All metrics have a tendency to overestimate the threshold value; this is likely tobe attributable to the hard censoring process. We therefore also consider the performance ofeach metric using catalogues with phased censoring.5.3. Constant threshold, phased censoring.

To assess the performance of each metricon simulated catalogues with phased censoring, we consider the thresholds selected by eachmetric for each of 500 simulated catalogues. For each catalogue, 2400 i.i.d. GPD exceedancesof 0M L were simulated. Each exceedance was retained with probability α ( y i , v i ) , as deﬁned NFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES . . . d(q,1), RMSE = 0.04 threshold s e l e c t i on p r opo r t i on . . . d(q,2), RMSE = 0.07 threshold s e l e c t i on p r opo r t i on F IG . Sampling distribution of threshold selection methods for quantile-based metrics over 500 simulated cata-logues with constant threshold and hard censoring. The true threshold is shown by a dashed red line and the rootmean squared error (RMSE) for each method is given in plot titles. in Section 5.1 with v ( τ ) = 0 . M L and λ = 7 . This combination of simulated catalogue sizeand censoring parameter gave an average catalogue size of 1500 recorded values, similar tothose in Section 5.2.The resulting RMSEs in threshold selection over these 500 catalogues were: 0.06 for d ( q, , 0.08 for d ( q, , 0.35 for d ( p, , and 0.12 for d ( p, . For all metrics the RMSE isslightly increased compared to hard censoring, as threshold selection is made more difﬁcultby the retention of some events that are truly below the threshold. As with hard censoring,the metrics d ( p, and d ( p, were prone to selecting conservative threshold values and d ( q, resulted in the lowest RMSE. Unlike for hard censoring, the sampling distributionsof selected thresholds now cover the true threshold values, this is shown in Figure E.4 ofthe Supplementary Materials (Varty et al., 2021). Similar selection properties for each met-ric were seen when considering more complex threshold forms and so further exposition islimited to the metric d ( q, , and we subsequently refer to d = d ( q, .5.4. Non-constant threshold selection.

Here catalogues are simulated by generating4000 i.i.d GPD exceedances of M L and censoring (either hard or phased) based on a thresh-old with v ( τ ) = 0 . M L for < τ ≤ and v ( τ ) = 0 . M L for < τ ≤ , seeFigure 6 where λ = 7 . llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . event index m agn i t ude lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . event index m agn i t ude F IG . Example simulated catalogues with hard censoring [left] and phased censoring [right] for stepped thresh-olds of ( v (1) , v (2) ) = (0.83,0.42), shown as a red line, and phasing parameter λ = 7 . We considered threshold selection behaviour over 500 earthquake catalogues simulatedusing the above change-point threshold for each of hard and phased censoring. Note thatthe number of retained events and the threshold change location τ ∗ within these will vary between simulations because they both depend on the simulated event magnitudes and onhow many of these are retained. However, in each case the true value of τ ∗ is known.For each simulated catalogue we selected a threshold of the form v ( τ ) = v (1) for < τ ≤ τ ∗ and v ( τ ) = v (2) for τ ∗ < τ < τ max , where the threshold parameters ( v (1) , v (2) , τ ∗ ) areunknown. Threshold parameters were selected using the Bayesian optimisation method ofSection 4.4 to minimise the metric d . The sampling distribution of the errors in the selectedthreshold parameters are shown in Figure 7, where it can be seen that our threshold selectionmethod regularly recovers the non-constant modelling threshold to within δ/ of it true value. −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 v (1) error D en s i t y −0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 v (2) error D en s i t y −600 −200 0 200 400 600 . . . t * error D en s i t y −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 v (1) error D en s i t y −0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 v (2) error D en s i t y −600 −200 0 200 400 600 . . . t * error D en s i t y F IG . Marginal sampling distributions of errors in the selected values of v (1) (left), v (2) (center) and τ ∗ (right)for 500 simulated catalogues with change-point type thresholds and hard (top row) or phased (bottom row)censoring. Speciﬁc ﬁndings vary by censoring type. For hard censoring, as would be expected, thethreshold levels v (1) and v (2) are rarely selected to be below the true values. The error dis-tribution of τ ∗ has, in both cases, a mode close to 0 but with large variance. As expected,the sampling variability of the error in each parameter is larger for phased censoring thanfor hard censoring, though it is reassuring to see that the distributions of selected thresholdparameters are now centered on the true values. This demonstrates that the tendency to se-lect threshold values too high for catalogues with hard censoring is a consequence of thecensoring mechanism, not a bias in the selection method.

6. Application to Groningen earthquakes.

Validating data model for Groningen catalogue.

We compare GPD and exponentialmodels for Groningen earthquake magnitudes. Rohrbeck et al. (2018) and Marzocchi et al.(2019) demonstrated the importance of acknowledging rounding of observations, and so thisis accounted for within the inference for both models. We focus on earthquakes exceedingthe constant conservative threshold of 1.45M L , subsequently referred to as v C . This is themagnitude of completion stated by the KNMI (Dost et al., 2012), adjusted to account forrounding.Both the GPD and exponential models assume that magnitudes are i.i.d.; this is supportedby our exploratory analysis of the Groningen catalogue in Section A of the supplementarymaterials (Varty et al., 2021). The two models may be compared by considering the samplingdistribution of the estimated shape parameter under a GPD model, because the exponentialmodel is a special case of the GPD where ξ = 0 . Fitting a GPD to the 311 exceedances of NFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES . . . . . . model probability s a m p l e p r obab ili t y model quantiles s a m p l e quan t il e F IG . Modiﬁed PP (left) and QQ (right) plots for Groningen magnitudes exceeding 1.45M L under the GPDmodel. Grey regions show tolerance intervals while vertical lines show conﬁdence intervals on sampleprobabilities / quantiles. All conﬁdence intervals overlap with the associated tolerance intervals. v C leads to point estimates of (ˆ σ . , ˆ ξ ) = (0 . , − . with respective bootstrapconﬁdence intervals of (0 . , . and ( − . , . . Since the conﬁdence interval for ξ covers 0, the exponential model cannot be discounted at the signiﬁcance level usingonly exceedances of v C . A second method of comparison is to ﬁt both an exponential andGPD model to exceedances of v C and, appealing to the asymptotic distribution of the MLE,perform a likelihood ratio test. This produces a likelihood ratio of 1.04 and associated p -value of 0.214, leading us to draw the same conclusion in both comparisons: that there isinsufﬁcient evidence to conclude that the Groningen magnitudes deviate from the Gutenberg-Richter law when using only exceedances of v C .However, if an exponential magnitude model is assumed then the uncertainty about ξ isignored. This has the effect of dramatically, but artiﬁcially, narrowing the conﬁdence intervalson the estimated magnitude quantiles, as shown in Figure E.3 of the Supplementary Materials(Varty et al., 2021). The potential repercussions of ignoring this uncertainty are described indetail in Coles and Pericchi (2003). A GPD model should therefore be used for the underlyingmagnitudes, to properly represent this uncertainty when selecting a modelling threshold forthe Groningen gas ﬁeld.If the rounding of observations had been ignored in the ﬁtting of the GPD model, thepoint estimates of the GPD parameters would be (ˆ σ . , ˆ ξ ) = (0 . , − . with respectivestandard errors of (0 . , . . The parameter estimates are not signiﬁcantly different tothose using the correct likelihood because the small number of threshold exceedances meansthat parameter uncertainty obscures the bias induced by neglecting to account for rounding.Finally, in Figure 8 we check that the ﬁtted GPD model is consistent with the empiricaldistribution of exceedances of . M L through the use of the modiﬁed QQ and PP plots intro-duced in Section 4.2. Since the tolerance intervals and conﬁdence intervals overlap for boththe sample quantiles and sample probabilities, we conclude that a GPD model is appropriatefor Groningen earthquake rounded magnitudes exceeding . M L .6.2. Parametric threshold forms.

Now we select thresholds of two parametric forms forthe Groningen catalogue and explore the results of the subsequent inference. The ﬁrst is aconstant threshold, v ( τ ) = v , where the level v is to be chosen. This will allow us to assessthe level of conservatism in the conventional modelling threshold where v = 1 . M L . Thesecond form is a sigmoid-type threshold v ( τ ) = v R + ( v L − v R )Φ ([ µ − τ ] /ς ) , with param-eters ( v L , v R , µ, ς ) ∈ R × R + and where Φ is the standard Gaussian distribution function.This extends the idea of the change-point threshold to accommodate smooth change in thethreshold value centred on µ . The threshold parameters may be interpreted as follows. Theleft and right asymptotic levels of the threshold are given by v L and v R , µ is the index-time . . . threshold value, v d ( q , ) lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll return period r e t u r n l e v e l lllllllllllllllllllllllllllllllllllllllllllllllllll F IG . [Left] Grid search to minimise d ( q, over threshold values v . Metric values are shown on log-scaleand vertical lines mark the edges of magnitude rounding intervals. [Right] Point estimates (solid lines) and conﬁdence intervals (dashed lines) for the conditional return levels for exceedances of . M L , using theconservative (black) and selected thresholds (red). Sample conditional return levels are shown in blue. at which the threshold takes the value ( v L + v R ) / , and ς controls how rapidly the thresholdchanges about µ , with ς → corresponding to a step change. In the context of the Groningencatalogue we expect that v R < v L .6.3. Threshold selection.

Constant threshold.

A grid search was used to ﬁnd the ﬂat threshold that min-imises the metric d , as shown in Figure 9. There are two local minima at v = 0 . M L and v = 1 . M L , the latter being the global minimum. For thresholds greater than . M L , in-cluding the conservative threshold of . M L , the metric values are increasing as not all vi-able data are utilised. For thresholds less than . M L the metric also increases as the validityof the tail model breaks down. The small peak between these minima is likely attributable tothe reduction of the m c over time. In Figure 1 we saw that fewer small magnitude events arecensored at later times. The minimum at . M L uses less data to achieve good distributionalagreement for the entire period, while the minimum at . M L compromises on the distri-butional agreement at early times to retain a larger proportion of the data. As the thresholdis lowered between magnitudes . M L and . M L , enough additional data are added tomore than compensate for the reduced goodness-of-ﬁt in the early part of the observationperiod and so the metric value reduces. Since the global minimum corresponds to the moreconservative threshold, we select . M L as our constant modelling threshold.6.3.2. Sigmoid threshold.

Bayesian optimisation was used to ﬁnd the sigmoid thresh-old parameters ( v L , v R , µ, ς ) that minimise the metric d , where the search space was con-strained to the region [0 . , . × [200 , × [1 , . For an initial set of 20 randomlyselected threshold parameter combinations, d was evaluated. A further ﬁxed budget of 100metric evaluations was allocated and the thresholds with the smallest metric value retainedfor further inspection. To assess the sensitivity of the selected threshold to the set of initialevaluation points, this was repeated for ﬁve initial parameter combination sets.The thresholds with the lowest values of d based on each initialisation are shown in Fig-ure 10 (left). The selected threshold values at the ends of the observation interval appear to bestable across initialisation, but the transition between these levels is not. Further investigationsupports the stability of the end levels; the blue and turquoise thresholds have signiﬁcantlygreater metric values than the other thresholds, suggesting that these initialisations had toofew evaluations to explore beyond a local minimum. These conclusions are consistent withthe simulation study of Section 5.4, illustrating that threshold levels are more easily estimatedthan the change between those levels. NFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll index m agn i t ude lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll index m agn i t ude lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll date m agn i t ude A B C F IG . Selected sigmoid thresholds using Bayesian optimisation from 5 random initial parameter sets. [left] Op-timising over all thresholds parameters. [centre, right] Optimising over ( µ, ς ) and ﬁxing ( v L , v R ) = (1 . , . on index- (centre) and natural- (right) timescales. Colours are comparable only between centre and right plots.Dashed horizontal lines show the conservative threshold value. Important dates relating to the development ofthe Groningen seismic detection network are shown as vertical lines: (A) development begins, (B) ﬁrst additionalsensors activated, (C) upgrade complete. A second Bayesian optimisation was performed, ﬁxing the end levels of the sigmoidthreshold to the those shared by the best performing thresholds in the previous optimisa-tion, namely ( v L , v R ) = (1 . , . . This reduces the dimension of the parameter space andsimpliﬁes the optimisation task. Using the same procedure as for the unconstrained optimisa-tion, the resulting selected thresholds from each initialisation are shown in Figure 10 (centre).Upon repeated Monte Carlo evaluation of the metric value for each of these thresholds, thereis insufﬁcient evidence to select one over the others. When transformed onto the natural timescale, as shown in Figure 10 (right), the selected thresholds are all consistent with the knowndates at which sensor installation occurred. This shows that from the earthquake cataloguealone our method is able to detect the starting and ending threshold levels and the periodin which it changed. However, we cannot identify precisely the way in which the thresholdchanged during the installation period. This is not a major setback, since between the mostand least conservative of the chosen thresholds (turquoise and red in the centre and right pan-els of Figure 10) the expected number of observations above the threshold differs by only 50earthquakes. We ﬁtted the GPD model using each of these ﬁve threshold functions, reachingsimilar conclusions, and so present further results for only the turquoise threshold.6.3.3. Threshold comparison.

We compare the conservative, selected constant, and se-lected sigmoid thresholds, which are referred to as ˆ v C , ˆ v and ˆ v S respectively. Comparisonsare made on: the expected metric value, the number of events used to ﬁt the GPD model, theestimated GPD parameter values, and the estimated return levels.Metric evaluations are subject to Monte Carlo noise and so the metric value was evaluated100 times for each threshold. The mean metric value and Monte Carlo noise intervalswere calculated to be: 0.091 (0.088, 0.096) for ˆ v C , 0.054 (0.053, 0.055) for ˆ v , and 0.041(0.039, 0.043) for ˆ v S . This suggests that the model ﬁt using ˆ v S ﬁts the observed data best,with ˆ v being preferred over ˆ v C . These improvements in model ﬁt may be attributable to theincreased data usage of the selected thresholds. The threshold ˆ v C is at the edge of a roundinginterval and so utilises 311 threshold exceedances in the resulting model. For thresholds ˆ v and ˆ v S , the rounding of magnitudes means that the exact number of exceedances is unknown. Theexpected number of exceedances under the ﬁtted magnitude models are 629 and 702 for ˆ v and ˆ v S , respectively. By using either of the selected thresholds, we have more than doubledthe size of usable catalogue as compared to the conservative threshold.Figure 11 (left) shows the estimated parameter values under the ﬁtted GPD model usingeach threshold. The uncertainty in both parameters is reduced when using ˆ v rather than ˆ v C ,and further reduced when using ˆ v S . To give a sense of scale in this reduction, we can cal-culate the number of additional exceedances of ˆ v C to which they are equivalent, under the return period r e t u r n l e v e l F IG . Bootstrap GPD parameter estimates based on exceedances of the conservative (black), ﬂat (red) andsigmoid (blue) thresholds [left]. Estimated return levels in M L and conﬁdence intervals for magnitudesexceeding . M L [right]. assumption that the standard error of parameter estimates scales with exceedance count n as n − / . In doing this, the additional and small magnitude earthquakes included by,respectively, using ˆ v or ˆ v S are equivalent to or additional events above ˆ v C . There-fore, point-for-point, the small magnitude earthquakes are at least as valuable as additionaldata above v C for parameter estimation.When modelling exceedances of ˆ v or ˆ v S the respective point estimates and conﬁdenceintervals for the shape parameter are -0.084 (-0.168, -0.017) and -0.069 (-0.144, -0.008).Using exceedances of ˆ v or ˆ v S leads to only . or . of the sampling distribution for ˆ ξ being above 0. This provides empirical evidence that the Groningen magnitude distributionhas a ﬁnite upper endpoint, unlike the conventional Gutenberg Richter magnitude model. Thisdramatic conclusion could not be reached using the smaller dataset exceeding ˆ v C , where of the sampling distribution for ˆ ξ lay above 0.Similar conclusions can be reached by using likelihood ratio tests to compare Exponentialand GPD models for exceedances each of v C , ˆ v and ˆ v S ; the respective p -values are 0.78,0.046, and 0.064. By using more of the available data, we have increased our ability to discernbetween an exponential model and the observed magnitude distribution. The conclusionsthat can be drawn from this test are in agreement with, but are less strong than, those ofthe previous comparison: a Gutenberg Richter magnitude model is likely inferior to a GPD.The discrepancy in conclusion strength between the two comparisons is likely due to theasymptotic assumptions of the likelihood ratio test not being met by our ﬁnite sample size.The estimated conditional return levels above . M L are shown using each threshold inFigure 11 (right). The estimated return levels are similar when using ˆ v and ˆ v S , but conﬁdenceintervals for large return periods are narrower when using ˆ v S . In either case, the return levelshave both smaller point estimates and uncertainties by using our threshold selection methodthan when using the conservative threshold. This is an important ﬁnding when deciding whatmeasures to take when designing or retroﬁtting earthquake defences for buildings.

7. Discussion / Conclusion.

This paper introduced a principled method to select a time-varying modelling threshold for an extreme value analysis. The effectiveness and value ofusing this method to include additional, less extreme events in the analysis were demonstratedthrough simulation studies. Although the method was developed in the context of earthquakecatalogues, and to accommodate the additional challenges to inference that these pose, thecore of our method is applicable to extreme value threshold selection more generally and weanticipate it having a much broader impact.Using the new threshold selection method, we have been able to identify the period inwhich the Groningen sensor network was being improved by using the earthquake catalogue

NFERENCE FOR EXTREME EARTHQUAKE MAGNITUDES alone. Our threshold selection method more than doubled the usable size of the Gronin-gen earthquake catalogue compared to using the conservative threshold given by the KNMI,whilst also improving model ﬁt. This has several important implications beyond the directimprovement to statistical inference.The use of these additional small magnitude earthquakes leads to greater precision in theestimates of high magnitude quantiles, which is potentially a huge beneﬁt by reducing thecost of designing, constructing or retroﬁtting earthquake defences. Following threshold se-lection, a Bayesian modelling approach would allow quantile uncertainty to be included nat-urally when designing defences against natural hazards (Coles and Tawn, 1996; Fawcett andGreen, 2018; Jonathan et al., 2020) and estimates with greater precision can reduce the costrequired to provide protection with equivalent conﬁdence. The gain we have made in theefﬁciency of statistical inference can be translated to a tangible economic beneﬁt of usingthe additional data recorded by improving the censor network. The more efﬁcient use of theavailable data has allowed us to conclude, for the ﬁrst time based on empirical evidence alone,that Groningen earthquake magnitudes are likely to have a light-tailed distribution. Using theconservative threshold level this conclusion could only have been achieved by waiting manyyears to observe additional large magnitude earthquakes. Finally, using a less conservativemodelling threshold provides a return on the substantial investment into the earthquake de-tection network around the Groningen gas ﬁeld. When a non-constant threshold is selected,the added value of the network improvements is exploited fully and the subsequent modellingthreshold can also offer insights into the reduction of m c over time.A limitation of the work is that the computational effort required to select a modellingthreshold is relatively high. We do not view this as a large drawback since threshold selectionmust be performed only once through the modelling process. An area for further developmentwould be to investigate alternative, exact methods to optimise the expected selection metricover the threshold parameters. One possible extension to our approach would be to adapt thedata model to account for magnitude measurement error causing events to be recorded withinincorrect rounding intervals. Another, more ambitious, extension might consider a selectionof spatio-temporal threshold function to describe spatial variability as well as the temporalevolution of event detection. Finally, an extensive comparison of our proposed and standardextreme value threshold selection methods would be a valuable piece of further work, givenits critical importance in extreme value methods. Acknowledgements.

This paper is based on work completed while Zak Varty was partof the EPSRC funded STOR-i centre for doctoral training (EP/L015692/1), with part-fundingfrom Shell Research Ltd.

SUPPLEMENTARY MATERIAL

Supplement:

Further detail is given on the methods introduced in the main text. Additional plots are pro-vided in support of our simulation studies and the analysis of Groningen earthquakes.REFERENCES C OLES , S. G. (2001).

An Introduction to Statistical Modeling of Extreme Values . Springer, New York.C

OLES , S. G. and P

ERICCHI , L. (2003). Anticipating catastrophes through extreme value modelling.

Journal ofthe Royal Statistical Society: Series C (Applied Statistics) OLES , S. G. and T

AWN , J. A. (1996). A Bayesian analysis of extreme rainfall data.

Journal of the RoyalStatistical Society: Series C (Applied Statistics) ANIELSSON , J., DE H AAN , L., P

ENG , L. and DE V RIES , C. G. (2001). Using a bootstrap method to choosethe sample fraction in tail index estimation.

Journal of Multivariate Analysis D AVISON , A. C. and S

MITH , R. L. (1990). Models for exceedances over high thresholds.

Journal of the RoyalStatistical Society: Series B (Methodological) OST , B., R

UIGROK , E. and S

PETZLER , J. (2017). Development of seismicity and probabilistic hazard assess-ment for the Groningen gas ﬁeld.

Netherlands Journal of Geosciences OST , B., G

OUTBEEK , F.,

VAN E CK , T. and K RAAIJPOEL , D. (2012). Monitoring induced seismicity in theNorth of the Netherlands: Status Report 2010.

KNMI scientiﬁc report .E ASTOE , E. F. and T

AWN , J. A. (2009). Modelling non-stationary extremes with application to surface levelozone.

Journal of the Royal Statistical Society: Series C (Applied Statistics) AWCETT , L. and G

REEN , A. C. (2018). Bayesian posterior predictive return levels for environmental extremes.

Stochastic Environmental Research and Risk Assessment UTENBERG , B. and R

ICHTER , C. F. (1956). Earthquake magnitude, intensity, energy, and acceleration: (Secondpaper).

Bulletin of the Seismological Society of America EFFERNAN , J. E. and T

AWN , J. A. (2001). Extreme value analysis of a large designed experiment: a case studyin bulk carrier safety.

Extremes UTTON , K., W

OESSNER , J. and H

AUKSSON , E. (2010). Earthquake monitoring in southern California forseventy-seven years (1932–2008).

Bulletin of the Seismological Society of America

ONATHAN , P., R

ANDELL , D., W

ADSWORTH , J. and T

AWN , J. (2020). Uncertainties in return values fromextreme value analysis of peaks over threshold using the generalised Pareto distribution.

Ocean Engineering .KNMI (2020). Aardbevings catalogus. . Accessed: 2020-05-01.L

AIO , F. (2004). Cramer–von Mises and Anderson-Darling goodness of ﬁt tests for extreme value distributionswith unknown parameters.

Water Resources Research .L ITTLE , R. J. and R

UBIN , D. B. (2019).

Statistical Analysis with Missing Data . John Wiley & Sons.M

ARZOCCHI , W., S

PASSIANI , I., S

TALLONE , A. and T

ARONI , M. (2019). How to be fooled searching forsigniﬁcant variations of the b-value.

Geophysical Journal International

IGNAN , A. and W

OESSNER , J. (2012). Estimating the magnitude of completeness for earthquake catalogs.

Community Online Resource for Statistical Seismicity Analysis .M IGNAN , A., W

ERNER , M., W

IEMER , S., C

HEN , C.-C. and W U , Y.-M. (2011). Bayesian estimation of thespatially varying completeness magnitude of earthquake catalogs. Bulletin of the Seismological Society ofAmerica

ORTHROP , P. J. and J

ONATHAN , P. (2011). Threshold modelling of spatially dependent non-stationary extremeswith application to hurricane-induced wave heights.

Environmetrics ALEJA , R. and B

IERMAN , S. (2016). Measuring changes in earthquake occurrence rates in Groningen.

ShellGlobal Solutions International BV (Amsterdam) .P ICKANDS , J. (1975). Statistical inference using extreme order statistics.

Annals of Statistics OHRBECK , C., E

ASTOE , E. F., F

RIGESSI , A. and T

AWN , J. A. (2018). Extreme value modelling of water-related insurance claims.

Annals of Applied Statistics CARROTT , C. and M AC D ONALD , A. (2012). A review of extreme value threshold estimation and uncertaintyquantiﬁcation.

REVSTAT–Statistical Journal HAHRIARI , B., S

WERSKY , K., W

ANG , Z., A

DAMS , R. P. and D E F REITAS , N. (2015). Taking the human outof the loop: A review of Bayesian optimization.

Proceedings of the IEEE

NOEK , J., L

AROCHELLE , H. and A

DAMS , R. P. (2012). Practical Bayesian optimization of machine learningalgorithms. In

Advances in neural information processing systems

TSU , T., O

GATA , Y. and M

ATSU ’ URA , R. S. (1995). The centenary of the Omori formula for a decay law ofaftershock activity.

Journal of Physics of the Earth ARTY , Z., T

AWN , J. A., A

TKINSON , P. M. and B

IERMAN , S. (2021). Supplement to “Inference for extremeearthquake magnitudes accounting for a time-varying measurement process”.W

ADSWORTH , J. L. (2016). Exploiting structure of maximum likelihood estimators for extreme value thresholdselection.

Technometrics ADSWORTH , J. L. and T

AWN , J. A. (2012). Likelihood-based procedures for threshold diagnostics and uncer-tainty in extreme value modelling.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) IEMER , S. and W

YSS , M. (2000). Minimum magnitude of completeness in earthquake catalogs: Examplesfrom Alaska, the western United States, and Japan.

Bulletin of the Seismological Society of America ILSON , S. (2020). ParBayesianOptimization: Parallel Bayesian Optimization of Hyperparameters R packageversion 1.2.1.W

OESSNER , J. and W

IEMER , S. (2005). Assessing the quality of earthquake catalogues: Estimating the magni-tude of completeness and its uncertainty.

Bulletin of the Seismological Society of America ubmitted to the Annals of Applied Statistics SUPPLEMENTARY MATERIALS FOR "INFERENCE FOR EXTREMEEARTHQUAKE MAGNITUDES ACCOUNTING FOR A TIME-VARYINGMEASUREMENT PROCESS" B Y Z AK V ARTY , J ONATHAN

A. T

AWN P ETER

M. A

TKINSON

AND S TIJN B IERMAN Lancaster University, [email protected]; * [email protected]; † [email protected] Shell Global Solutions Netherlands, ‡ [email protected] APPENDIX A: ASSESSING I.I.D. ASSUMPTION FOR GRONINGEN EARTHQUAKES

A.1. Connection to main text.

This appendix supports the claim made in Sections 2and 6 of the main text that Groningen earthquakes exceeding . M L may be modelled asindependent and identically distributed. A.2. Exploratory analysis.

Here we examine the validity of the assumption, commonto both GPD and exponential models, that magnitudes are i.i.d. above 1.45M L . In the caseof continuous-valued data that are completely observed, this assumption implies that inter-arrival times of threshold exceedances should approximately follow an exponential distri-bution. Due to incomplete observation below . M L and the transformed time scale, weinstead consider the inter-arrival times of events exceeding magnitude c ≥ v C measured interms of the number of events with magnitudes between 1.45M L and c . If events are i.i.d. thenthese inter-arrival times are geometrically distributed. It is important to investigate a range ofvalues for c , since lower values lead to more (shorter) observed interval lengths but these aremore concentrated about 0 making assessment of the geometric distribution more difﬁcult.This trade-off can seen by considering the edge-cases: if c = 1 . M L then each interval is oflength 0, and if c is between the second and third largest observed magnitudes then there is asingle observed interval.The empirical distributions of interval lengths in the Groningen catalogue are shown inthe panels of Figure A.1 for exceedances of c = 1 . M L , . M L , and . M L . These areconsistent with the conﬁdence interval for the ﬁtted geometric distribution at each valueof c . The same conclusion was found for . M L < c < . M L . Events larger than this showmild evidence of clustering, but overall this suggests that it is reasonable to model magnitudesas i.i.d. above c = 1 . M L .APPENDIX B: BOOTSTRAP DATA-SETS AND PARAMETER ESTIMATES B.1. Relation to main text.

This section supports the material presented in Section 2.2of the main paper. To represent the sampling variability in the maximum likelihood estimate ˆ θ associated with the log-likelihood (2) of the main text we take a parametric bootstrappingapproach. This appendix describes the simulation of bootstrap catalogues and how they maybe used to obtain bootstrap estimates of ˆ θ . Bootstrap parameter estimates of this type areused throughout the main text. B.2. Generating bootstrapped data-sets. a r X i v : . [ s t a t . M E ] F e b inter−exceedance delay ( c oun t − − − − − − − − − −− − − − − − − − − −− − − − − − − − − − inter−exceedance delay ( c oun t − − − − − − − − − −− − − − − − − − − −− − − − − − − − − − inter−exceedance delay ( c oun t − − − − − − − − − −− − − − − − − − − −− − − − − − − − − − F IG A.1 . Frequency plots of the number of earthquakes exceeding 1.45M L that separate earthquakes exceed-ing exceeding 1.65M L (left), 1.75M L (centre), and 1.85M L (right) for the Groningnen earthquake catalogue.Observed frequencies (black lines) fall within the conﬁdence intervals under the ﬁtted models (grey lines). F IG B.1 . Threshold exceedances as point process

B.2.1.

Threshold exceedances and a point process.

In bootstrap realisations of the earth-quake catalogue, the number, timing and magnitudes of events exceeding v ( τ ) are all vari-able. In an alternate catalogue, events remain within the transformed observation interval (0 , τ max ) but no longer form a regularly spaced sequence. Rather, the events in the region A v = { ( τ, y ) : 0 ≤ τ ≤ τ max , y ≥ v ( τ ) } , as shown in Figure B.1, are approximated by a Pois-son process.The Poisson process intensity on A v is determined by the GPD parameters θ and λ u ,where λ u is the expected number of events exceeding u ≤ min <τ<τ max v ( τ ) per unit τ (Coles,2001). Since it is assumed that no censoring occurs on A v and that the underlying magnitudedistribution is identical over t , it follows that λ u is constant over τ because time has beentransformed so that earthquakes occur at a constant rate. The resulting intensity function on A v is:(1) λ ( τ, y ) = λ u σ u (cid:20) ξ y − uσ u (cid:21) /ξ − for ( τ, y ) ∈ A v . Bootstrap catalogues of earthquakes with magnitudes exceeding v ( τ ) can be obtained asrealisations from the point process with intensity function (1) using the estimated parameters ˆ θ . We ﬁrst describe how to generate such bootstrap catalogues and then how these can beused to obtain bootstrap estimates of ˆ θ . UPPLEMENTARY MATERIALS B.2.2.

Simulating the exceedance count.

The ﬁrst step in generating a bootstrap cata-logue is to sample the number of events that occur on A v . The number of events on A v isPoisson distributed with expectation(2) Λ( A v ) = Z A v λ ( τ, y ) d τ d y, which must be estimated from the observed catalogue. This is complicated by the roundingof observed magnitudes x to the nearest δ . Recall that for borderline events { x i ∈ x : | x i − v i | < δ } it is not known whether or not the corresponding unrounded magnitudes exceed v ( τ ) and place those events on A v . Therefore n v , the observed event count on A v , is unknown andmust itself be estimated.Given the rounded magnitudes x = ( x , . . . , x n ) and the estimated GPD parameters (ˆ σ u , ˆ ξ ) , events i = 1 , . . . , n each exceed their magnitude threshold and are on A v indepen-dently with probability w i as deﬁned in expression (3) of the main text. Uncertainty in theobserved event count due to magnitude rounding can be included in the generation of boot-strap sample sizes by simulating a value m v for n v , as the sum of n independent Bernoullirandom variables with expectations w to w n .This simulated value for the observed event count can be used as a point estimate ˆΛ( A v ) = m v for Λ( A v ) . Since ˆΛ( A v ) is an estimate of based on a single observed count,it is important to also include uncertainty in the inferred value of Λ( A v ) when generatingbootstrap catalogue sizes. This can be done using a bootstrapped value for the value of theestimator ˆΛ( A v ) . Since inference is based on a single Poisson count, the bootstrap estimator ˜Λ( A v ) is obtained simply as a sampled value from the Poisson( ˆΛ( A v ) ) distribution.Finally, the event count ˜ n to be used for the bootstrapped catalogue can be sampled froma Poisson( ˜Λ( A v ) ) distribution. Doing so properly represents rounding, estimation and sam-pling uncertainties in the bootstrapped event counts.B.2.3. Simulation of event times.

Having simulated the exceedance count ˜ n v , the times ˜ τ = ( τ , . . . , τ ˜ n v ) of the events on A v can sampled according to the marginal temporal inten-sity λ ( τ ) . The marginal temporal intensity is found by integrating the joint intensity (1) overmagnitudes. Noticing that λ ( τ, y ) is proportional to the GPD density, λ ( τ ) can be stated interms of the GPD survivor function ¯ F ( y ; σ, ξ ) = [1 + ξy/σ ] − /ξ + to give:(3) λ ( τ ) = Z ∞ v ( τ ) λ ( τ, y ) d y = λ u ¯ F ( v ( τ ) − u ; σ u , ξ ) for ≤ τ ≤ τ max . Sampling exceedance times from this intensity can be achieved through reverse applicationof the time-rescaling theorem (Brown et al., 2002). In general, this will require numericalintegration and can be computationally intensive. Depending on the form of v ( τ ) more efﬁ-cient methods may be available. When v ( τ ) is a step function the step-wise integral of thetemporal intensity has a simple form and event times can be simulated very efﬁciently; the ˜ n v events are allocated independently to steps with probability proportional to the temporal in-tensity integrated over each step. Events are then be located uniformly at random within theirallocated step. When a more complex form is used for v ( τ ) , alternative sampling approachesmay reduce the computational cost of sampling event times. For example, exact samples maybe obtained through rejection sampling and approximate samples by approximating v ( τ ) asa step function. B.2.4.

Conditional simulation of event magnitudes.

The magnitude of each event in thebootstrap catalogue may then be simulated conditional on its occurrence time. Given a sim-ulated occurrence time ˜ τ ∈ (0 , τ max ) , the conditional magnitude intensity for magnitudes ex-ceeding v (˜ τ ) is simply the GPD density function with shape parameter ξ and time-dependentscale parameter σ (˜ τ ) = σ u + ξ ( v (˜ τ ) − u ) :(4) λ ( y | ˜ τ ) = λ (˜ τ , y ) λ (˜ τ ) = 1 σ (˜ τ ) (cid:20) ξ y − v (˜ τ ) σ (˜ τ ) (cid:21) − /ξ − , for y ≥ v (˜ τ ) . This allows easy simulation of magnitudes ˜ y = (˜ y , . . . , ˜ y ˜ n v ) conditional on their occurrencetimes ˜ τ , by generating random variates from the appropriate GPD. The simulation of a boot-strap earthquake catalogue is completed by rounding these to the nearest multiple of δ ,to obtain ˜ x = (˜ x , . . . , ˜ x ˜ n v ) . Note that in a bootstrap catalogue ˜ y i > ˜ v i for i = 1 , . . . , ˜ n v ,but following rounding it is possible that some of the ˜ x i are below their associated thresh-old value. The simulation of bootstrapped datasets is summarised in Algorithm 1, where ˆ σ (˜ τ ) = ˆ σ u + ˆ ξ ( v (˜ τ ) − u ) . Algorithm 1:

Simulation of GPD data with variable threshold and rounding

Result:

A bootstrapped dataset of rounded GPD observations ˜ x , based on the threshold function v ( τ ) ,observations x and parameter estimates ˆ θ .Sample m v as the sum of independent Bernoulli( w i ) realisations where i = 1 , . . . , n ;Sample the estimate of Λ( A v ) , ˜Λ( A v ) from the Poisson( m v ) distribution;Sample the bootstrap number of exceedances ˜ n v from the Poisson( ˜Λ( A v ) ) distribution; for i = 1 to i = ˜ n v do sample u i from a Uniform(0,1) distribution;ﬁnd the bootstrap occurrence time ˜ τ i which satisﬁes Z ˜ τ i λ ( τ ; ˆ σ u , ˆ ξ, v ( τ )) d τ = u i ˆΛ( A v ); sample the bootstrap magnitude exceedance ˜ z i from the GPD (ˆ σ (˜ τ i ) , ˆ ξ ) distribution;calculate the bootstrap latent magnitude ˜ y i = v (˜ τ i ) + ˜ z i ;round ˜ y i to the nearest δ to get the rounded bootstrap magnitude ˜ x i (which may be less than v (˜ τ i ) ). end B.3. Generating bootstrapped maximum likelihood estimates.

In the bootstrappedearthquake catalogues, some magnitudes ˜ x may be less than their respective threshold values.However, unlike in the original catalogue, each of the corresponding unrounded magnitudes ˜ y exceeds the respective modelling threshold. Therefore, an unweighted log-likelihood shouldbe used when obtaining bootstrap maximum likelihood estimates. Letting ˜ v i = v (˜ τ i ) and σ ˜ v i = σ u + ξ (˜ v i − u ) , the unweighted log-likelihood function is ‘ ( θ | ˜ x , ˜ v ) = ˜ n v X i =1 log [ F (˜ x i + δ − ˜ v i ; σ ˜ v i , ξ ) − F (max(˜ v i , ˜ x i − δ ) − ˜ v i ; σ ˜ v i , ξ )] . The maximum likelihood estimates resulting from a collection of bootstrap catalogues can beused to represent the sampling uncertainty of the original maximum likelihood point estimate ˆ θ . This is done in the main text when calculating conﬁdence intervals on parameter values,conditional quantiles and return levels. The bootstrap parameter estimates are also used inthe construction of adapted PP and QQ plots and when evaluating metric values. UPPLEMENTARY MATERIALS APPENDIX C: SAMPLING STANDARDISED THRESHOLD EXCEEDANCES

C.1. Connection to main text.

This appendix describes how to sample a vector ˜ z ofunrounded threshold exceedances transformed to have a common Exp(1) marginal distribu-tion. This used in Sections 3-6 of the main text and requires: a single bootstrap estimate ˜ θ of the estimated GPD parameters ˆ θ (obtained as described in Appendix B), the n -vector ofrounded observations x and the corresponding threshold vector v . The process for samplinga single vector ˜ z is described and is formalised in Algorithm 2. C.2. Sampling unrounded threshold exceedances.

It is unknown which, if any, of theborderline values { x i ∈ x : | x i − v i | < δ } correspond to unrounded values y i ∈ y that exceedthe modelling threshold v ( τ ) . The ﬁrst step in sampling ˜ z is therefore to sample the set I that exceed the modelling threshold. This is done by simulating independent Bernoulli trialsfor each event i = 1 , . . . , n with success probabilities w i = Pr( Y i > v i | x i , ˆ θ ) as deﬁned inequation (3) of the main text. The vector ˜ z will therefore have a randomly sampled length ˜ m = | I | ≤ n , where the distribution of ˜ m depends on x , v and ˜ θ .The unrounded magnitude values for events in I are then simulated from their conditionaldistribution given: their rounded values, the estimated GPD parameters and that they arethreshold exceedances. Letting F be the GPD distribution function as in Equation (1) of themain text, the required conditional distribution function of Y i | x i , θ , Y i ≥ v i is given by:(5) G Y i | x i , θ ,Y i ≥ v i ( y ) =  for y < b i , F ( y − u ; θ ) − F ( b i − u ; θ ) F ( x i + δ − u ; θ ) − F ( b i − u ; θ ) for b i ≤ y ≤ x i + δ, for y > x i + δ, where i ∈ I and b i = max( x i − δ, v i ) is the smallest value above the modelling thresholdthat results in the rounded observation x i . The sampled values are combined to create ˜ y ,an ˜ m -vector of continuous-valued sampled threshold exceedances. Note that the length ofthis vector is a random variate when there are one or more borderline values. By repeatedsimulation using k bootstrap parameter estimates { ˜ θ (1) , . . . , ˜ θ ( k ) } , the resulting vectors ofunrounded exceedances { ˜ y (1) , . . . , ˜ y ( k ) } reﬂect uncertainty about both the GPD parametervalues and the number of threshold exceedances. C.3. transformation onto common margins.

When v ( τ ) is non-constant, the elementsof ˜ y are random variates from the GPD family, but they do not share a common set of param-eters. To resolve this issue, the probability integral transform can then be used to give eachelement of ˜ y an Exp(1) marginal distribution using its ﬁtted GPD parameters. This resultsin a vector ˜ z of standardised, sampled exceedances of the modelling threshold v ( τ ) . Thebootstrap simulation of one such vector is formalised in Algorithm 2.Note that the transformation onto exponential margins is dependent on the estimated GPDparameters. The effect of parameter uncertainty on this transformation is represented acrossbootstrap catalogues by transforming each collection of bootstrapped threshold exceedances { ˜ y (1) , . . . , ˜ y ( k ) } using their respective bootstrapped parameter estimates { ˜ θ (1) , . . . , ˜ θ ( k ) } .This yields a set of k sampled vectors of threshold exceedances on exponential margins, { ˜ z (1) , . . . , ˜ z ( k ) } . As with ˜ y , the length of ˜ z is a an ˜ m -vector and so is a random variate whenthere are one or more borderline events in x .These sampled vectors of standardised threshold excesses can be used to calculate ex-pected metric values or to construct modiﬁed PP- and QQ-plots, as in Figures (3) and (8)of the main text. In those plots, the variability between the empirical quantiles (or probabili-ties) of each { ˜ z (1) , . . . , ˜ z ( k ) } is shown by the conﬁdence intervals on sample quantile values(or probabilities). The expected range of values for each sample quantile (or probability) is shown by the tolerance intervals. The uncertainty in the number of threshold exceedances isalso incorporated when calculating the tolerance intervals; they are constructed using k setsof Exp(1) random variates of lengths dim ˜ z (1) , . . . , dim ˜ z ( k ) . Algorithm 2:

Simulation of standardised threshold exceedance sets input :

A bootstrap estimate ˜ θ = (˜ σ, ˜ ξ ) of the GPD parameters, the n -vector of rounded observed values x , and their corresponding thresholds v . output: A vector ˜ z of length ˜ m ≤ n of sampled unrounded values, transformed to have an Exp(1)distribution under the ﬁtted model. for i = 1 to n do calculate w i = Pr ( Y i > v i | x i , ˜ θ ) , the probability that each rounded observation corresponds to anunrounded value on A v , as in Equation (3) of the main text; end Generate n independent Uniform[0,1] random variates u , . . . , u n ;Sample the indexing set of events that are on A v , I = { i ∈ (1 , . . . , n ) : u i ≤ w i } and let ˜ m = | I | ;Store the elements of I in the vector β = ( β , . . . , β ˜ m ) so that ˜ y β i > v β i for i = 1 , . . . , ˜ m and initialise ˜ y and ˜ z as vectors of length ˜ m ; for j = 1 to ˜ m do Let a = β j ;Sample the j th unrounded exceedance ˜ y j from its conditional distribution G Y a | x = x a , θ = ˜ θ ,Y a ≥ v a , ( y ) as in equation (5) ;Let ˜ θ v a = (˜ σ − ˜ ξ ( v a − u ) , ˜ ξ ) be the bootstrapped GPD parameters for exceedances of v a ;Transform ˜ y j onto Exp(1) margins under this ﬁtted model by letting F be the GPD distributionfunction (1) in the main text and setting ˜ z j = − log h − F (cid:16) y j − v a ; ˜ θ v a (cid:17)i . end APPENDIX D: CODEAn R package is currently under development to implement the methods introduced in thispaper. Unpolished R code used for the analyses of the main text can be found at https://github.com/zakvarty/threshold_paper_code .APPENDIX E: SUPPLEMENTARY FIGURES

UPPLEMENTARY MATERIALS − . − . . . . . s ^ u x ^ llllll l ll ll lll ll l ll ll l llll l lll ll ll ll lll lllll l ll l ll l ll ll ll ll ll ll ll ll ll lll lll l lllll ll ll l ll ll ll ll l l ll lll lll lll lll ll l ll lll ll llll lll llll ll ll lll ll ll llllll ll l l lll l ll ll l lll ll lll l ll ll lllll ll ll ll ll lll ll ll lll lll ll ll ll lll ll ll l ll lll ll ll l ll ll lll llll lllll l ll lll l lll llll ll l ll l ll lll ll lll ll l ll ll ll ll ll ll ll ll ll ll ll ll llll l lll ll lll l ll ll l lll l ll ll l ll lll ll lll llll l lll l lll ll l l ll ll l llll lll ll ll l ll ll lll llll l l ll lll ll ll ll ll ll l ll lll l lll ll ll l ll llll ll l llll ll llll lll ll ll llll ll l llll lll l ll ll ll ll ll lll llll l lll ll ll ll llll l l ll ll l ll ll ll l ll ll ll lll l ll lll l ll lll l ll l ll l l ll ll ll l ll l ll ll l l l lll ll ll l lllll l ll ll ll ll l ll lll ll l lll l ll l lll llll l lll ll l ll llll ll l ll ll ll l lll ll l lll llll llll ll ll ll lll l l lll lll ll lllll l llll l ll l ll l lll ll llll llll l l lllll ll l ll l llll lll lll l llll ll ll llll ll ll ll ll l ll ll llll ll llll ll ll ll ll l l ll ll llll lll llll ll ll lll ll l lll ll ll lll llllll ll ll llllll l lll lll lll ll lll ll ll l l lll ll ll lll l ll llll llll l ll ll l lll llll l l llll l lll ll llll l ll ll ll lll ll l ll lll l ll l ll ll ll l ll lll l llll llll ll ll l llll l l llll lll ll l ll ll ll ll ll ll l ll lll l ll lll llll l l lllll l ll l ll lllll lll lll ll lllll l ll ll llll ll lllll l lll l ll lll lllll l llll lllll ll lll ll llllll ll l lll ll llll llll ll ll lllll ll llll l llll l lll ll l llllll l l ll lll lll llll ll lllll lll ll llll lll l llll l lll llll llll lll l llll llll ll ll l lll l l lll lll l lll lll llll ll ll lll l ll l ll lll l ll l ll lll ll ll lll l ll l lll ll lll ll l lll ll llllll llll ll l llllll lll ll ll l lll ll llllll ll llll l lll ll ll ll ll ll ll ll l ll ll lll llll lll llll llll ll lll ll lll ll l ll llll llll ll ll ll l llllll llll ll lllll lll ll lll ll ll l lll l lll l lll lll ll l ll llll lll l lll l lll l l lll ll l ll ll llll lll ll l lllll l ll lll l llll ll llll lll llll ll lll lll llll ll lllll lll ll l lll lll lll ll lllll l ll lllll ll ll ll ll ll l l lll lll ll l lllll l ll lll ll l ll lll ll ll l ll ll ll l ll lll lll l l ll lll l lll ll lllll ll l lll lll ll l ll ll l ll lllllll ll ll ll llll llll ll ll ll l llllll l l lll ll l llll l ll llll l llllll ll lll l l llllll lll l lllll llll lll ll l ll lll l ll ll ll ll ll lll llll llll llll ll ll l lll ll ll ll l lllll lll l lll ll lllll ll lll lll l l llll llll ll l ll l ll llll ll ll lll llll l ll l ll lll lllll ll ll llll ll lllll llll lll lll ll ll l ll lll l lll ll ll ll l ll ll ll lll ll ll l lll lll llll l lll ll llll ll l l llll l lll l ll llll lll l llllll ll ll lllll ll llll llll l ll ll ll l ll l lllll l l ll ll lll llll l l ll ll ll llllll l ll ll llll ll llllll llllll ll llll llll lll lllll l llll ll ll l llll llll ll llll ll ll ll llllllllll l lllll llllllll lll ll llllll llll lll ll ll llllll lll lll lll l llll llll l llll ll ll llllll ll lll lllll ll ll lllllll llll ll llll ll llllll lllll llllll llll llll ll lllllllllll lll ll lll ll llll ll llll ll llll lllllll lllll llll ll lllllllll lll llll lll lllll llll lll llll l l lllll lll l l ll ll lllll llllllll l lll lll lll ll llll ll ll lll l lllllllllll llll lllll llll lllll ll ll ll lll ll l lll ll lllll llllll lll llll ll llllll lllllllll ll lll l llll l lllllll llllll lll ll l llll lllll lll l ll lllllllllllll llll llll lll ll lllll llll lllll lll ll ll lll lll ll ll llll lll llllll ll l lll lll ll lllll lll lll lll llll lll ll lllllll llll lll lllll lll lllllll ll llllllll ll lll lll ll lll lll lllllll llllllll ll llll ll ll lllll lllll llllll lll ll lll l llll llll llll llll ll l ll ll lll l lllllll ll ll llll ll ll llll ll llll llllll llll lllllll ll lll ll llll ll ll lll lll l llll llllllll lllll ll ll lll ll lll lll llll llllllll l llll llll llll lllll lll llll l lllll lllll l lllll ll ll lllll lllll ll ll ll llllll ll lll ll ll l llll lllll llllllllll lll lllll ll ll llllllll ll llll llllllll lll lll ll l lll ll lllll llll ll lll ll lllll lll lll l + + ConservativeExtendedSteppedTrue bias ( s ^ u ) bias ( x ^ ) Var ( s ^ u ) Var ( x ^ ) ConservativeExtendedStepped F IG E.1 . [Left] Plot of maximum likelihood estimates of GPD scale and shape parameters for 1000 simulatedcatalogues obtained using a conservative, stepped and extended approach. [Right] Plot of mean squared errordecomposition for maximum likelihood estimates by each approach. The error is decomposed into squared biasand variance terms for each of the GPD parameters, with variance terms having larger contributions. . . . d(p,1), RMSE = 0.34 threshold s e l e c t i on p r opo r t i on . . . d(p,2), RMSE = 0.12 threshold s e l e c t i on p r opo r t i on F IG E.2 . Sampling distribution of threshold selection methods for PP-based metrics over 500 simulated cata-logues with constant threshold and hard censoring. The true threshold is shown by a dashed red line and the rootmean squared error (RMSE) for each method is given in plot titles. return period r e t u r n l e v e l F IG E.3 . [left] Bootstrap maximum likelihood estimates for GPD parameters for Groningen magnitudes above1.45ML assuming GPD (black) and exponential (red) models. [right] Resulting conditional return level plot withreturn period measured in number of events exceeding 1.45 ML. . . . d(q,1), RMSE = 0.06 threshold s e l e c t i on p r opo r t i on . . . d(q,2), RMSE = 0.08 threshold s e l e c t i on p r opo r t i on . . . d(p,1), RMSE = 0.35 threshold s e l e c t i on p r opo r t i on . . . d(p,2), RMSE = 0.12 threshold s e l e c t i on p r opo r t i on F IG E.4 . Sampling distribution of threshold selection methods for QQ-based (top row) and PP-based (bottomrow) metrics over 500 simulated catalogues with constant threshold and phased censoring. The true threshold isshown by a dashed red line and the root mean squared error (RMSE) for each method is given in plot titles.

REFERENCES B ROWN , E. N., B

ARBIERI , R., V

ENTURA , V., K

ASS , R. E. and F

RANK , L. M. (2002). The time-rescalingtheorem and its application to neural spike train data analysis.

Neural Computation OLES , S. G. (2001).