EEstimating the Tail Index by using Model Averaging
J. Martin van Zyl
Department of Mathematical Statistics and Actuarial Science, Uni-versity of the Free State, Bloemfontein, South Africa
E-mail: [email protected]
Abstract: The ideas of model averaging are used to find weights in peak-over-threshold problems using a possible range of thresholds. A range of the largest observations are chosen and considered as possible thresholds, each time per-forming estimation. Weights based on an information crite-rion for each threshold are calculated. A weighted estimate of the threshold and shape parameter can be calculated. Keywords: tail index, model averaging, weights
1. Introduction
Often there are various models which all performs well and each model has some properties which may be better than others with re-spect to certain aspects of the problem. This problem occurs mostly in problems with one sample and various models with a different number of parameters. In such problems model averaging can be performed. An overview of model averaging is given by Moral-Benito (2013). There is a Bayesian approach called Bayesian model averaging (BMA) and the frequentist model averaging (FMA) ap-proach. In this work weights are proposed for different thresholds in peak over threshold (POT) problems as introduced by Pickands (1975). Thus rather than using a specific point, many points are taken into consideration when performing estimation. The weights are used to calculate a weighted average over a range of estimated indexes of a Pareto distribution. A range of the largest observations in a sample is considered as possible thresholds, when performing estimation. The weights must be comparable over different sample sizes and the average likelihood for a specific threshold is used to calculate the weight for that specific estimate. Of course model av-eraging can also be performed using various estimation methods and the same threshold, but that is not considered in this work.
The weight which will be used are of the form of those proposed by Buckland et al. (1997), that is the weight for model r, where there are q possible models, using an information criterion for say model r , r I , is: exp( / 2) / exp( / 2) qr r rr w I I = = - - ∑ , (1) 2 log( ) r r r I L j= - + . r L is the maximized likelihood for model r, r j is a penalty term, involving the number of parameters used in the model. It can be shown that the average log-likelihood which cancels the effect of sample size, is also valid. For a model with p parameters where a sample of size n is available, the two most well-known criteria are the Akaike Information Criterion (AIC) with 2 r p j = (Akaike, 1973). The Bayesian Information Criterion (BIC) has a penalty fac-tor log( ) r p n j = . BIC is also often called SBIC or Schwarz BIC named after the author who wrote a paper on estimating the dimen-sion of a model and introduced BIC (Schwarz, 1976). The model selection work was derived to be used on one set of data where there are possible models and to either select one model or use model averaging. The information criteria and also specifi-cally the likelihood is a function of the sample size. It can be shown that for example the log-likelihood of a Pareto distribution is a line-arly increasing function of the sample size, so this must be taken into account when comparing models with different sample sizes. Consider the original derivation of the AIC. Suppose that a pos-sible model assumes the density of the data is ( , ) p y q . A sample, ,..., n y y , with unknown density ( ; ) f y q is available. Let ˆ q denote the MLE of q assuming that the observations have density p . The distance or quality of ˆˆ ( ; ) p y q as an estimate of ( ; ) f y q can be esti-mated using the Kullbach-Leibner distance ˆˆ ˆ( , ) ( ; ) log( ( ; ) / ( ; )) K f p f y f y p y dy q q q= ∫ = ˆˆ( ; ) log( ( ; )) ( ; ) log( ( ; )) f y f y dy f y p y dy q q q q- ∫ ∫ . The first term is fixed, thus minimizing ˆ( , ) K f p is equivalent to maximizing the second term. An estimate of the second term is n j jj p y l nn q q = = ∑ , where ˆ ˆ( ) ( ,..., ; ) j n l l y y q q= denote the log-likelihood assuming density p . This is a biased estimate of the true integral, and it was shown that the bias is approximately the dimension d of q , divided by n . Using the approximation it follows that the information crite-rion is ˆˆ ( ) / / K l n d n q= - . This expression was later multiplied by the constant, n - , and this expression minimized. Let ˆ( ) / / , 1,..., r r r r I l n d n r q q= - = . (2)
The proposed weights when q thresholds are considered will then be of the form: exp( / 2) / exp( / 2) qj j rr w I I = = ∑ , where r n observations are used for the r-th possible threshold when estimating the tail index and the log-likelihood in the point ˆ ˆ MLE q q= for this threshold is ˆ( ) r l q .
2. Weights for two approaches to esti-mate the tail index
Two of the approaches used to estimate the tail index is by using the Pareto assumption and assuming observations above a threshold has a Pareto distribution or a generalized Pareto distribu-tion (GPD), and the other makes use of the power tail form of the survival function above the threshold and performing regression. If one rewrite a simple of estimators it can be seen as a type of weighted estimate with the largest order statistics having the most weight, since the largest observation will every time be included. In a sample of size n , if the largest k to k observations are consid-ered as possible thresholds, to calculate a weighted average, every order statistic as a possible threshold need not be included, and for example every tenth order statistic can also be used to calculate the average. In this section the likelihood assuming a Pareto distribution above a threshold will be considered. It was found that when using the GPD distribution of excesses above a threshold, when many points are considered there are almost always a few numerical con-vergence problems at certain sample sizes in POT problems. In the ideal circumstances when estimating the parameters of GPD distrib-uted data these convergence problems occur almost never, but in POT problems it is a factor. Theoretically in very large samples the GPD with should perform better and in such samples if the threshold is very large and there are many observations above the threshold few numerical problems would be encountered. In the usual practical problems sample sizes are often a few thousand and it was found that the Pareto assumption using the weighted model performs well and is from a numerical and practical viewpoint a better method to use. The pure Pareto assumption is easy to calculate, there are no numerical problems and performs good. So in the simulation study only the pure Pareto assumption was considered. The Hill estimator can also be derived by assuming that the largest observations above a threshold are Pareto distribu-tion. If a Pareto distribution is assumed above a threshold, similar type of weights can be derived. Suppose a sample of size , ,...., n n x x , is available. The corresponding order statistics are (1) (2) ( ) ... n x x x £ £ £ . Consider the Pareto distribution in general. Let ( 1) ( ; , ) , p x x x a a a b ab b - + = ‡ . (3) The log-likelihood can be written as ( ,..., ; , ) log( ) log( ) ( 1) log( / ) nn jj l x x n n x a b a b a b = = - - + ∑ . The ML estimates of the parameters are (1) 1 ˆ ˆˆ, / log( / ) n jj x n x b a b = = = ∑ , leading to: ˆˆ ˆ ˆ ˆ( ,..., ; , ) / log( ) log( ) ( 1) / n l x x n x a b a a a= - - + . (4) If the m largest observations above a threshold in a POT problem are considered, it follows that: ( ) ( )1 ˆ ˆ, / log( / ) mm n m m n m r mr x m x b a b - - += = = ∑ , (5) ( 1) ( ) ( ) ˆˆ ˆ ˆ ˆ( ,..., ; , ) / log( ) log( ) ( 1) / n m n m m m n m m m l x x m x a b a a a - + - = - - + , ( 1) ( ) ˆˆ( ,..., ; , ) / 2 / n m n m n m m I l x x m m a b - - + = - . (6) In a sample of size n , if the largest k to k observations are con-sidered as possible thresholds, the equations are used to calculate the weights for each ,..., m n k n k = - - . exp( / 2) / exp( / 2) n km m jj n k w I I -= - = ∑ , m w denote the weight if the largest m observations are used and the threshold is assumed to be ( ) n m x - . A more complex assumption would be to make use of the peak over threshold theorem where it was shown that above a certain threshold, say u, the excesses above the threshold are generalized Pareto distributed. Let Y X u = - , then it follows that: ( ) 1 (1 ( / ) ) u F y y x x s - = - + , where x denotes the shape parameter and the index is, a x= , s a scale parameter. The log-likelihood for a given threshold, ( ) n k u x - = , using as sample the k excesses above the threshold evaluated at the ML estimators is; ˆ ˆˆ ˆ( ,..., , ; , ) log( ) (1/ 1) log(1 ( / )( )). nn jj n k l x x u k x u s x s x x s = - + = - - + + - ∑ . The information criterion to calculate the weights is ˆˆ( ,..., , ; , ) / 2 / k n I l x x u k k s x= - . This assumption works excellent if the ideal assumptions are met, but in POT problems and smaller sample sizes and then fitting it to data above a threshold often lead to convergence problems and the Pareto assumption was found to be a better practical method, when using the weighted estimation.
If it is assumed that the largest observations from a distribu-tion P obeys the power law and that 1 ( ) P x cx a- - = or log(1 ( )) log( ) log( ) P x c x a- = - , (7) and the empirical distribution is used to estimate ( )
P x , linear re-gression can be performed to estimate the tail index. If normal error terms are assumed the log-likelihood is of the form, the m largest observations are used, ignoring constants, ˆ ˆ( ) / log( ) m m l m a s(cid:181) - , mm r m s e = = ∑ , (8) where ˆ ˆ,..., m e e , denote the residuals when using the largest m ob-servations to estimate a . Let ˆlog( ) m m I s= - -2/m, (9) then it follows that exp( / 2) / exp( / 2) n km m jj n k w I I -= - = ∑ , for possible thresholds ( ) ( ) ,..., n k n k x x - - .
2. Simulation Study and an Application
In this section POT estimation will be performed on data simulated from the stable distribution, a t-distribution and also a GPD and the largest observations in each sample used to estimate the index of the tails and also a weighted estimate of the threshold. For the stable and t-distribution data will be simulated from symmet- ric distributions and the estimation performed on the absolute val-ues. For all the distribution the shape parameter is denoted by x , with index a x= . The skewness, location parameter and scale pa-rameters are denoted by , , b ms for the stable distributions. For the t-distribution the degrees of freedom denoted by n with location and scale parameter m and s . The GPD has location parameter m , and scale parameter s . The mean square error (MSE) and bias are given, based on using the largest 50 to 500 possible thresholds, and m = samples generated of size n = each. Note that in all simulations the results are given as the bias and MSE with respect to the index a . Pareto likelihood Power-tail regression Estimated threshold MSE bias Estimated threshold MSE bias a = Stable distribu-tion a = a =
0, 0, 1 b m s= = = a = n = t-distribution
0, 1 m s= = n = x = x = x = x = GPD
1, 1 m s= = x =
Table 1. Estimation of the shape parameter using the largest observations in sample of size n=2500. MSE and bias given.
It can be seen that the success of the estimation is sensitive to how heavy-tailed the data is and good results found especially in very heavy-tailed data. A sample of size n = is relatively small for POT problems, but especially where 2 a < ( x > ), both methods give good estimates of the shape parameter. The estimated threshold is smaller than the one using regression. In table 2 the scale parameter s = was used, the mean square error (MSE) and bias are given, based on using the largest 50 to 500 pos-sible thresholds, and m = of size n = each. Pareto likelihood Power-tail regression Estimated threshold MSE bias Estimated threshold MSE bias a = a = a = Stable distribution
0, 0, 2 b m s= = = a = n = t-distribution
0, 2 m s= = n = x = x = x = GPD
1, 2 m s= = x = x = Table 2. Estimation of the shape parameter using the largest observations in sample of size n=5000. MSE and bias given.
It can be seen that the success of the estimation is again very sensi-tive to how heavy-tailed the data is and good results found especially in very heavy-tailed data. especially where a < ( x > ), both methods give good estimates of the shape parameter. The estimated thresholds are much larger than in the n = samples, which may be an indication that the distributional behaviour of the largest ob-servations is not yet pure Pareto in the smaller sample. The bias of the estimated parameters are better in the larger sample as can be expected. For example to estimate a of a Cauchy distribu-tion, that is 1, 0 a b= = for a stable distribution using the Pareto as-sumption results in a bias of 0.0048 in a sample of size 2500 n = and 0.0014 in a sample of size 5000 n = , with similar estimated MSE’s of 0.0026. In figure 1 a histogram is shown with 5000 estimated parameters us-ing the Pareto assumption and estimation is based on the largest or-der statistics as a possible threshold, from the 500 largest to the 50 largest. The samples of size n = are form a stable distribution with a s b= = = . The mean of the 5000 estimated indices is 1.1709, with MSE 0.0119. f r equen cy Figure 1. Histogram of 5000 estimated indices in POT problems in samples of size n=5000 each. Samples stable distributed with index a s b= = = .
3. Application to the Danish Fire Losses
This set of data was analysed by various researchers. Some of the references are McNeil (1997, 1999), Resnick (1997), Lee et al. (2012), Embrechts et al. (1997), Zivot and Wang (2003). There are 2156 losses larger than 1 million Danish kroner. The total sam-ple size is 2492 observations. When using different assumptions and models different esti-mates of the threshold were found. Zivot and Wang (2003) estimated the threshold as 5.28 (millions) using the Hill quantile estimator and the ML GPD estimator is 5.20. Stable estimates of the index using the Hill estimator and a threshold in the region of 10 was found to be approximately in the region of 2.01. The estimates using the weighted approach are as follows: • Using the weighted average assuming the Pareto model, a threshold is estimated as 4.7154 , which means the largest 276 observations are used. The in-dex is estimated as ˆˆ 1.4435 ( 0.6928) a x= = . • If the power tail regression approach is follows the threshold is estimated as 5.3061, which means the largest 234 observations are included. The index is estimated as ˆˆ 1.4521 ( 0.6887) a x= = . In figures 2 and 3 model using the regression approach is shown versus the actual values. It can be seen that the fit is good, especially concerning the largest observations in the sample. F ( x ) Figure 2. Fitted Pareto distribution to values over estimated threshold on a log scale. l og ( - F ( x )) data 1 linear Figure 3. Quabtile plot of estimated distribution on a log-scale
In the weighted method the largest values will be taken into account when calculating each weight, and the process is inherently so that the largest observations are considered more important. It can be seen in the figures that the estimated distribution fits especially the largest observations good.
4. Conclusions
In this work the very basic ideas of model averaging is ap-plied to estimate the parameters of heavy-tailed distributions, using in POT problems. It was found to perform good. One advantage of this procedure is, that a reasonable estimate of the threshold can be found without having to look at figures. This estimate is dependent on the quality of the data and will be good if the sample is such that the POT technique can be applied. When checking the performance of estimation techniques us-ing simulated samples in POT problems, one cannot study charts each time, and such a procedure might be of help in such simulation studies giving a reasonable estimate of the threshold. References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood princi-ple. Second International Symposium on Information Theory, B.N. Petrov and F. Csaki (eds), 267 – 281. Budapest: Akademiai Kiado. Buckland, S.T., Burnham, K.P, Augistin, N.H. (1997). Model Selection: An Integral Part of Inference. Biometrics, 53, 603 – 618. Lee, D., Li, W.K., Wong, T.S.T. (2012). Modeling insurance claims via a mixture expone tial model combined with peaks-over-threshold approach. Insurance: Mathematics and Economics 51, 538 – 550. McNeil, A. (1997). Estimating the Tails of Loss Severity Distributions using Extreme Value Theory. ASTIN Bulletin,27, 117 – 137. McNeil, J., Frey, R., Embrechts, P. (2005). Quantatitive Risk Management: Concepts, Techniques, and Tools.
Princeton University Press.
Pickhands, J. III (1975). Statistical inference using extreme order statistics. The Annals of Statistics, 3, 1163 -1174. Resnick, S.I., 1997. Discussion of the Danish data on large fire insurance losses, Astin Bull. (1) , pp. 139–151. Scwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics6, 461 – 464. Zivot, E. and Wang, J. ( ) Modeling Financial Time Series with S-Plus . New York: Springer.. New York: Springer.