R J. | 2019

Fitting Tails by the Empirical Residual Coefficient of Variation: The ercv Package

 
 
 
 

Abstract


This article is a self-contained introduction to the R package ercv and to the methodology on which it is based through the analysis of nine examples. The methodology is simple and trustworthy for the analysis of extreme values and relates the two main existing methodologies. The package contains R functions for visualizing, fitting and validating the distribution of tails. It also provides multiple threshold tests for a generalized Pareto distribution, together with an automatic threshold selection algorithm. Introduction and overview Extreme value theory (EVT) is one of the most important statistical techniques for the applied sciences. A review of the available software on extreme value analysis appears in Gilleland et al. (2013). R software (R Core Team, 2017) contains some useful packages for dealing with EVT. The R package evir (Pfaff and McNeil, 2012) provides maximum likelihood estimation (MLE) at the same time for the block maxima and threshold model approaches. The R package ismev (Heffernan and Stephenson, 2018) allows fitting parameters of a generalized Pareto distribution depending on covariates and offers diagnostics such as qqplots and return level plots with confidence bands. The R package poweRlaw (Gillespie, 2015) enables power laws and other heavy tailed distributions to be fitted using the techniques proposed by Clauset et al. (2009).This approach had been used to describe sizes of cities and word frequency and is linked to the physics of phase transitions and to complex systems. This paper shows that the R package ercv (del Castillo et al., 2017a), based on the coefficient of variation (CV), is a complement, and often an alternative, to the available software on EVT. The mathematical background is shown in Section 2.2, including threshold models and the relationship between power law distribution and the generalized Pareto distributions (GPD), which is the relationship between the two different approaches followed by the aforementioned R packages evir, or ismev, and poweRlaw. Section 2.3 introduces the tools for the empirical residual coefficient of variation developed in the papers del Castillo et al. (2014), del Castillo and Serra (2015) and del Castillo and Padilla (2016). Section 2.3.2 also shows the exploratory data analysis of nine examples, some of them from the R packages evir and poweRlaw, with the cvplot function, see Figure 1. Section 2.4 explains the Tm function in the R package ercv that provides a multiple thresholds test that truly reduces the multiple testing problem in threshold selection and provides clearly defined p-values. The function includes an estimation method of the extreme value index. An automatic threshold selection algorithm provided by the thrselect function is explained in Section 2.5 to determine the point above which GPD can be assumed for the tail distribution. Section 2.6 shows how the methodology developed in the previous sections can be extended with the tdata function to all GPD distributions, even with no finite moments. This technique is applied to the MobyDick example and to the Danish fire insurance dataset, a highly heavy-tailed, infinitevariance model. Finally, Section 2.7 describes the functions of the R package ercv that allow estimation of the parameters (fitpot) and drawing of the adjustments (ccdfplot) for the peak-over-threshold method. Mathematical Background Extreme value theory is widely used to model exceedances in many disciplines, such as hydrology, insurance, finance, internet traffic data and environmental science. The underlying mathematical basis is now thoroughly established in Leadbetter et al. (1983), Embrechts et al. (1997), de Haan and Ferreira (2007), Novak (2012) and Resnick (2013). Statistical tools and methods for use with a single time series of data, or with a few series, are well developed in Coles (2001), Beirlant et al. (2006) and Markovich (2007). Threshold models The first fundamental theorem on EVT by Fisher and Tippett (1928) and Gnedenko (1943) characterizes the asymptotic distribution of the maximum in observed data. Classical analyses now use the The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 generalized extreme value family of distribution functions for fitting to block maximum data provided the number of blocks is sufficiently large. Another point of view emerged in the 1970’s with the fundamental theorem by Pickands (1975) and Balkema and de Haan (1974). The Pickands-BalkemaDeHaan (PBdH) theorem, see McNeil et al. (2005, chap 7), initiated a new way of studying extreme value theory via distributions above a threshold, which use more information than the maximum data grouped into blocks. Let X be a continuous non-negative r.v. with distribution function F(x). For any threshold, t > 0, the r.v. of the conditional distribution of threshold excesses X − t given X > t, denoted as Xt = {X − t | X > t}, is called the residual distribution of X over t. The cumulative distribution function of Xt, Ft(x), is given by 1− Ft(x) = (1− F(x + t))/(1− F(t)). (1) The quantity M(t) = E(Xt) is called the residual mean and V(t) = var(Xt) the residual variance. The plot of sample mean excesses over increasing thresholds is a commonly used diagnostic tool in risk analysis called ME-plot (meplot function in evir R package). The residual coefficient of variation is given by CV(t) ≡ CV(Xt) = √ V(t)/M(t), (2) like the usual CV, the function CV(t) is independent under change of scale. The PBdH theorem characterizes the asymptotic distributions of the residual distribution over a high threshold under widely applicable regularity conditions, see Coles (2001). The result essentially says that GPD is the canonical distribution for modelling excess over high thresholds. The probability density function for a GPD(ξ, ψ) is given by g(x; ξ, ψ) = { ψ−1(1 + ξx/ψ)−(1+ξ)/ξ , ξ 6= 0, ψ−1 exp(−x/ψ), ξ = 0, (3) where ξ ∈ R is called the extreme value index (evi) and ψ > 0 is a scale parameter, 0 ≤ x ≤ −ψ/ξ if ξ < 0, and x ≥ 0 if ξ ≥ 0. The value of ξ determines the tail type. If ξ < 0, we say that the distribution is light tailed, if ξ = 0 we say it is exponential tailed. If ξ > 0 a GPD has finite moments of order n if ξ < 1/n and it is called heavy tailed. The mean of a GPD is ψ/(1− ξ) and the variance is ψ2/[(1− ξ)2(1− 2ξ)] provided ξ < 1 and ξ < 1/2, respectively. Then, the coefficient of variation is cξ = √ 1/(1− 2ξ), (4) the cvevi and evicv functions of the R package ercv correspond to this function and its inverse. The residual distribution of a GPD is again GPD with the same extreme value index ξ, for any threshold t > 0, in fact GPDt(ξ, ψ) = GPD(ξ, ψ + ξt). (5) Therefore, the residual CV for GPD is independent of the threshold and the scale parameter and is given by equation (4). The probability density functions (3) are monotonous decreasing (L-shaped) for ξ > −1, covering practically all the applications. Therefore, we are mainly concerned with the subset of data that indicate this behaviour. For example, if the dataset is concentrated in the centre and decreases on either side (bell-shaped) we will study the upper and lower part (changed sign) of the distribution separately, taking the median or some other location statistic as the origin. The power law distribution and GPD The power law distribution is the model, introduced by Pareto, p (x; α, σ) = α σ (σ x )α+1 , x > σ (6) where α > 0 is the tail index and σ > 0 the minimum value parameter. The model corresponds to the distribution functions F with the linear relation log [1− F (x)] = −α log(x) + α log (σ) , (7) see also Gillespie (2015). Note that if X is a r.v. with probability density function p (x; α, σ), given by (6), Z = X − σ has The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 probability density function g(z; 1/α, σ/α) = α σ ( σ z + σ )α+1 , z > 0, (8) that is, there is a one to one correspondence between power law distributions and GPD distributions with heavy tails (ξ > 0), where ξ = 1/α and σ = ψ/ξ. However, the two statistical models (3) and (6), with ξ > 0, are really different, since there is no unique transformation for all functions of the model (the transformation Z = X− σ depends on the minimum value parameter σ of the same variable X). The MLE for model (6) leads to the Hill estimator and Hill-plot (hill function in evir R package). The support of the distributions in (6) depends on the minimum value parameter σ. Hence, the MLE has no standard regularity conditions and the minimum value parameter σ is estimated with alternative methods, see Clauset et al. (2009) and its implementation in the poweRlaw R package by Gillespie (2015). However, the support of the distributions in (3), with ξ > 0, does not depend on parameters and MLE existing for large samples provided ξ > −1 and is asymptotically efficient provided ξ > −0.5, see del Castillo and Serra (2015) and the references therein for details. The gdp function in the evir R package provides the MLE for (3). Note that model (3) includes all the limit distributions (heavy or not) of the residual distribution over a high threshold and comes from a mathematical result (the PBdH theorem) and often (6) comes from empirical evidence of the linear relationship (7) and comparison with other models. Moreover, the linear relationship (7) is also obtained from the relationship between the parameters (8), see the ccdfplot function in Section 2.7. The residual CV approach Gupta and Kirmani (2000) show that the residual CV characterizes the distribution in univariate and bivariate cases, provided threre is a finite second moment (ξ < 1/2). In the case of GPD, the residual CV is constant and is a one to one transformation of the extreme value index suggesting its use to estimate this index. The residual CV can also be expressed in terms of probabilities, rather than the threshold, through the inverse of the distribution function or the quantile function defined b

Volume 11
Pages 56
DOI 10.32614/rj-2019-044
Language English
Journal R J.

Full Text