[PDF] Dynamic Semiparametric Models for Expected Shortfall (and Value-at-Risk)

Abstract

Expected Shortfall (ES) is the average return on a risky asset conditional on the return being below some quantile of its distribution, namely its Value-at-Risk (VaR). The Basel III Accord, which will be implemented in the years leading up to 2019, places new attention on ES, but unlike VaR, there is little existing work on modeling ES. We use recent results from statistical decision theory to overcome the problem of "elicitability" for ES by jointly modelling ES and VaR, and propose new dynamic models for these risk measures. We provide estimation and inference methods for the proposed models, and confirm via simulation studies that the methods have good finite-sample properties. We apply these models to daily returns on four international equity indices, and find the proposed new ES-VaR models outperform forecasts based on GARCH or rolling window models.

Full PDF

aa r X i v : . [ q -f i n . E C ] J u l Dynamic Semiparametric Models forExpected Shortfall (and Value-at-Risk) ∗ Andrew J. PattonDuke University Johanna F. ZiegelUniversity of Bern Rui ChenDuke UniversityFirst version: 5 December 2015. This version: 11 July 2017.

Abstract

Expected Shortfall (ES) is the average return on a risky asset conditional on the return beingbelow some quantile of its distribution, namely its Value-at-Risk (VaR). The Basel III Accord, whichwill be implemented in the years leading up to 2019, places new attention on ES, but unlike VaR,there is little existing work on modeling ES. We use recent results from statistical decision theoryto overcome the problem of “elicitability” for ES by jointly modelling ES and VaR, and proposenew dynamic models for these risk measures. We provide estimation and inference methods forthe proposed models, and conﬁrm via simulation studies that the methods have good ﬁnite-sampleproperties. We apply these models to daily returns on four international equity indices, and ﬁnd theproposed new ES-VaR models outperform forecasts based on GARCH or rolling window models.

Keywords:

Risk management, tails, crashes, forecasting, generalized autoregressive score.

J.E.L. codes:

G17, C22, G32, C58. ∗ For helpful comments we thank Tim Bollerslev, Rob Engle, Jia Li, Nour Meddahi, and seminar participantsat the Bank of Japan, Duke University, EPFL, Federal Reserve Bank of New York, Hitotsubashi University, NewYork University, Toulouse School of Economics, the University of Southern California, and the 2015 OberwolfachWorkshop on Quantitative Risk Management where this project started. The ﬁrst author would particularly like tothank the ﬁnance department at NYU Stern, where much of his work on this paper was completed. Contact address:Andrew Patton, Department of Economics, Duke University, 213 Social Sciences Building, Box 90097, Durham NC27708-0097. Email: [email protected] . Introduction

The ﬁnancial crisis of 2007-08 and its aftermath led to numerous changes in ﬁnancial marketregulation and banking supervision. One important change appears in the Third Basel Accord(Basel Committee, 2010), where new emphasis is placed on “Expected Shortfall” (ES) as a measureof risk, complementing, and in parts substituting, the more-familiar Value-at-Risk (VaR) measure.Expected Shortfall is the expected return on an asset conditional on the return being below a givenquantile of its distribution, namely its VaR. That is, if Y t is the return on some asset over somehorizon (e.g., one day or one week) with conditional (on information set F t − ) distribution F t ,which we assume to be strictly increasing with ﬁnite mean, the α -level VaR and ES are:ES t = E [ Y t | Y t ≤ VaR t , F t − ] (1)where VaR t = F − t ( α ) , for α ∈ (0 ,

1) (2)and Y t |F t − ∼ F t (3)As Basel III is implemented worldwide (implementation is expected to occur in the periodleading up to January 1 st , 2019), ES will inevitably gain, and require, increasing attention fromrisk managers and banking supervisors and regulators. The new “market discipline” aspects ofBasel III mean that ES and VaR will be regularly disclosed by banks, and so a knowledge of thesemeasures will also likely be of interest to these banks’ investors and counter-parties.There is, however, a paucity of empirical models for expected shortfall. The large literature onvolatility models (see Andersen et al. (2006) for a review) and VaR models (see Komunjer (2013)and McNeil et al. (2015)), have provided many useful models for these measures of risk. However,while ES has long been known to be a “coherent” measure of risk (Artzner, et al. jointly elicitable withVaR, to build new dynamic models for ES and VaR.This paper makes three main contributions. Firstly, we present some novel dynamic modelsfor ES and VaR, drawing on the GAS framework of Creal, et al. (2013), as well as successfulmodels from the volatility literature, see Andersen et al. (2006). The models we propose aresemiparametric in that they impose parametric structures for the dynamics of ES and VaR, but arecompletely agnostic about the conditional distribution of returns (aside from regularity conditionsrequired for estimation and inference). The models proposed in this paper are related to the classof “CAViaR” models proposed by Engle and Manganelli (2004a), in that we directly parameterizethe measure(s) of risk that are of interest, and avoid the need to specify a conditional distributionfor returns. The models we consider make estimation and prediction fast and simple to implement.Our semiparametric approach eliminates the need to specify and estimate a conditional density,thereby removing the possibility that such a model is misspeciﬁed, though at a cost of a loss ofeﬃciency compared with a correctly speciﬁed density model.Our second contribution is asymptotic theory for a general class of dynamic semiparametricmodels for ES and VaR. This theory is an extension of results for VaR presented in Weiss (1991) andEngle and Manganelli (2004a), and draws on identiﬁcation results in Fissler and Ziegel (2016) andresults for M-estimators in Newey and McFadden (1994). We present conditions under which theestimated parameters of the VaR and ES models are consistent and asymptotically normal, and wepresent a consistent estimator of the asymptotic covariance matrix. We show via an extensive MonteCarlo study that the asymptotic results provide reasonable approximations in realistic simulationdesigns. In addition to being useful for the new models we propose, the asymptotic theory wepresent provides a general framework for other researchers to develop, estimate, and evaluate newmodels for VaR and ES.Our third contribution is an extensive application of our new models and estimation methodsin an out-of-sample analysis of forecasts of ES and VaR for four international equity indices overthe period January 1990 to December 2016. We compare these new models with existing methods3rom the literature across a range of tail probability values ( α ) used in risk management. We useDiebold and Mariano (1995) tests to identify the best-performing models for ES and VaR, and wepresent simple regression-based methods, related to those of Engle and Manganelli (2004a) andNolde and Ziegel (2017), to “backtest” the ES forecasts.Some work on expected shortfall estimation and prediction has appeared in the literature,overcoming the problem of elicitability in diﬀerent ways: Engle and Manganelli (2004b) discussusing extreme value theory, combined with GARCH or CAViaR dynamics, to obtain forecasts ofES. Cai and Wang (2008) propose estimating VaR and ES based on nonparametric conditionaldistributions, while Taylor (2008) and Gsch¨opf et al. (2015) estimate models for “expectiles”(Newey and Powell, 1987) and map these to ES. Zhu and Galbraith (2011) propose using ﬂexibleparametric distributions for the standardized residuals from models for the conditional mean andvariance. Drawing on Fissler and Ziegel (2016), we overcome the problem of elicitability moredirectly, and open up new directions for ES modeling and prediction.In recent independent work, Taylor (2017) proposes using the asymmetric Laplace distributionto jointly estimate dynamic models for VaR and ES. He shows the intriguing result that the negativelog-likelihood of this distribution corresponds to one of the loss functions presented in Fissler andZiegel (2016), and thus can be used to estimate and evaluate such models. Unlike our paper, Taylor(2017) provides no asymptotic theory for his proposed estimation method, nor any simulationstudies of its reliability. However, given the link he presents, the theoretical results we presentbelow can be used to justify ex post the methods of his paper.The remainder of the paper is structured as follows. In Section 2 we present new dynamicsemiparametric models for ES and VaR and compare them with the main existing models for ES andVaR. In Section 3 we present asymptotic distribution theory for a generic dynamic semiparametricmodel for ES and VaR, and in Section 4 we study the ﬁnite-sample properties of the asymptotictheory in some realistic Monte Carlo designs. Section 5 we apply the new models to daily dataon four international equity indices, and compare these models both in-sample and out-of-samplewith existing models. Section 6 concludes. Proofs and additional technical details are presented inthe appendix, and a supplemental web appendix contains detailed proofs and additional analyses.4 Dynamic models for ES and VaR

In this section we propose some new dynamic models for expected shortfall (ES) and Value-at-Risk(VaR). We do so by exploiting recent work in Fissler and Ziegel (2016) which shows that thesevariables are elicitable jointly , despite the fact that ES was known to be not elicitable separately,see Gneiting (2011a). The models we propose are based on the GAS framework of Creal, et al. (2013) and Harvey (2013), which we brieﬂy review in Section 2.2 below.

Fissler and Ziegel (2016) show that the following class of loss functions (or “scoring rules”), indexedby the functions G and G , is consistent for VaR and ES. That is, minimizing the expected lossusing any of these loss functions returns the true VaR and ES. In the functions below, we use thenotation v and e for VaR and ES. L F Z ( Y, v, e ; α, G , G ) = ( { Y ≤ v } − α ) (cid:18) G ( v ) − G ( Y ) + 1 α G ( e ) v (cid:19) (4) − G ( e ) (cid:18) α { Y ≤ v } Y − e (cid:19) − G ( e )where G is weakly increasing, G is strictly increasing and strictly positive, and G ′ = G . We willrefer to the above class as “FZ loss functions.” Minimizing any member of this class yields VaRand ES: (VaR t , ES t ) = arg min ( v,e ) E t − [ L F Z ( Y t , v, e ; α, G , G )] (5)Using the FZ loss function for estimation and forecast evaluation requires choosing G and G . We choose these so that the loss function generates loss diﬀerences (between competing forecasts)that are homogeneous of degree zero. This property has been shown in volatility forecasting appli-cations to lead to higher power in Diebold-Mariano (1995) tests in Patton and Sheppard (2009).Nolde and Ziegel (2017) show that there does not generally exist an FZ loss function that generatesloss diﬀerences that are homogeneous of degree zero. However, zero-degree homogeneity may be Consistency of the FZ loss function for VaR and ES also requires imposing that e ≤ v, which follows naturallyfrom the deﬁnitions of ES and VaR in equations (1) and (2). We discuss how we impose this restriction empiricallyin Sections 4 and 5 below. α that are of interest in risk managementapplications (namely, values ranging from around 0.01 to 0.10), we may assume that ES t < ∀ t. The following proposition shows that if we further impose that VaR t < ∀ t, then, up toirrelevant location and scale factors, there is only one FZ loss function that generates loss diﬀer-ences that are homogeneous of degree zero. The fact that the L F Z loss function deﬁned below isunique has the added beneﬁt that there are, of course, no remaining shape or tuning parametersto be speciﬁed. Proposition 1

Deﬁne the FZ loss diﬀerence for two forecasts ( v t , e t ) and ( v t , e t ) as L F Z ( Y t , v t , e t ; α, G , G ) − L F Z ( Y t , v t , e t ; α, G , G ) . Under the assumption that VaR and ESare both strictly negative, the loss diﬀerences generated by a FZ loss function are homogeneous ofdegree zero iﬀ G ( x ) = 0 and G ( x ) = 1 /x. The resulting “FZ0” loss function is: L F Z ( Y, v, e ; α ) = − αe { Y ≤ v } ( v − Y ) + ve + log ( − e ) − L F Z when Y = − . In the leftpanel we ﬁx e = − .

06 and vary v, and in the right panel we ﬁx v = − .

64 and vary e. (Thesevalues for ( v, e ) are the α = 0 .

05 VaR and ES from a standard Normal distribution.) As neither ofthese are the complete loss function, the minimum is not zero in either panel. The left panel showsthat the implied VaR loss function resembles the “tick” loss function from quantile estimation,see Komunjer (2005) for example. In the right panel we see that the implied ES loss functionresembles the “QLIKE” loss function from volatility forecasting, see Patton (2011) for example. Inboth panels, values of ( v, e ) where v < e are presented with a dashed line, as by deﬁnition ES t isbelow VaR t , and so such values that would never be considered in practice. In Figure 2 we plot thecontours of expected FZ0 loss for a standard Normal random variable. The minimum value, whichis attained when ( v, e ) = ( − . , − . If VaR can be positive, then there is one free shape parameter in the class of zero-homogeneous FZ loss functions( ϕ /ϕ , in the notation of the proof of Proposition 1). In that case, our use of the loss function in equation (6) can beinterpreted as setting that shape parameter to zero. This shape parameter does not aﬀect the consistency of the lossfunction, as it is a member of the FZ class, but it may aﬀect the ranking of misspeciﬁed models, see Patton (2016). α -quantiles, continuous densities, and negative ES.[ INSERT FIGURES 1 AND 2 ABOUT HERE ]With the FZ0 loss function in hand, it is then possible to consider semiparametric dynamicmodels for ES and VaR: (VaR t , ES t ) = ( v ( Z t − ; θ ) , e ( Z t − ; θ )) (7)that is, where the true VaR and ES are some speciﬁed parametric functions of elements of theinformation set, Z t − ∈ F t − . The parameters of this model are estimated via: ˆ θ T = arg min θ T X Tt =1 L F Z ( Y t , v ( Z t − ; θ ) , e ( Z t − ; θ ) ; α ) (8)Such models impose a parametric structure on the dynamics of VaR and ES, through their rela-tionship with lagged information, but require no assumptions, beyond regularity conditions, on theconditional distribution of returns. In this sense, these models are semiparametric. Using theoryfor M-estimators (see White (1994) and Newey and McFadden (1994) for example) we establish inSection 3 below the asymptotic properties of such estimators. Before doing so, we ﬁrst considersome new dynamic speciﬁcations for ES and VaR. One of the challenges in specifying a dynamic model for a risk measure, or any other quantityof interest, is the mapping from lagged information to the current value of the variable. Our ﬁrstproposed speciﬁcation for ES and VaR draws on the work of Creal, et al. (2013) and Harvey (2013),who proposed a general class of models called “generalized autoregressive score” (GAS) models bythe former authors, and “dynamic conditional score” models by the latter author. In both casesthe models start from an assumption that the target variable has some parametric conditionaldistribution, where the parameter (vector) of that distribution follows a GARCH-like equation.The forcing variable in the model is the lagged score of the log-likelihood, scaled by some positive7eﬁnite matrix, a common choice for which is the inverse Hessian. This speciﬁcation nests manywell known models, including ARMA, GARCH (Bollerslev, 1986) and ACD (Engle and Russell,1998) models. See Koopman et al. (2016) for an overview of GAS and related models.We adopt this modeling approach and apply it to our M-estimation problem. In this application,the forcing variable is a function of the derivative and Hessian of the L F Z loss function ratherthan a log-likelihood. We will consider the following GAS(1,1) model for ES and VaR:  v t +1 e t +1  = w + B  v t e t  + AH − t ∇ t (9)where w is a (2 ×

1) vector and B and A are (2 ×

2) matrices. The forcing variable in thisspeciﬁcation is comprised of two components, the ﬁrst is the score: ∇ t ≡  ∂L F Z ( Y t , v t , e t ; α ) /∂v t ∂L F Z ( Y t , v t , e t ; α ) /∂e t  =  αv t e t λ v,t − αe t ( λ v,t + αλ e,t )  (10)where λ v,t ≡ − v t ( { Y t ≤ v t } − α ) (11) λ e,t ≡ α { Y t ≤ v t } Y t − e t (12)The scaling matrix, H t , is related to the Hessian: I t ≡  ∂ E t − [ L F Z ( Y t ,v t ,e t )] ∂v t ∂ E t − [ L F Z ( Y t ,v t ,e t )] ∂v t ∂e t • ∂ E t − [ L F Z ( Y t ,v t ,e t )] ∂e t  =  − f t ( v t ) αe t e t  (13)The second equality above exploits the fact that ∂ E t − [ L F Z ( Y t , v t , e t ; α )] /∂v t ∂e t = 0 under theassumption that the dynamics for VaR and ES are correctly speciﬁed. The ﬁrst element of thematrix I t depends on the unknown conditional density of Y t . We would like to avoid estimatingthis density, and we approximate the term f t ( v t ) as being proportional to v − t . This approximationholds exactly if Y t is a zero-mean location-scale random variable, Y t = σ t η t , where η t ∼ iid F η (0 , , as in that case we have: f t ( v t ) = f t ( σ t v α ) = 1 σ t f η ( v α ) ≡ k α v t (14) Note that the expression given for ∂L F Z /∂v t only holds for Y t = v t . As we assume that Y t is continuouslydistributed, this holds with probability one. k α ≡ v α f η ( v α ) is a constant with the same sign as v t . We deﬁne H t to equal I t with theﬁrst element replaced using the approximation in the above equation. The forcing variable in ourGAS model for VaR and ES then becomes: H − t ∇ t =  − k α λ v,t − α ( λ v,t + αλ e,t )  (15)Notice that the second term in the model is a linear combination of the two elements of the forcingvariable, and since the forcing variable is premultiplied by a coeﬃcient matrix, say ˜A , we canequivalently use ˜AH − t ∇ t = A λ t (16)where λ t ≡ [ λ v,t , λ e,t ] ′ We choose to work with the A λ t parameterization, as the two elements of this forcing variable( λ v,t , λ e,t ) are not directly correlated, while the elements of H − t ∇ t are correlated due to the over-lapping term ( λ v,t ) appearing in both elements. This aids the interpretation of the results of themodel without changing its ﬁt.To gain some intuition for how past returns aﬀect current forecasts of ES and VaR in thismodel, consider the “news impact curve” of this model, which presents ( v t +1 , e t +1 ) as a functionof Y t through its impact on λ t ≡ [ λ v,t , λ e,t ] ′ , holding all other variables constant. Figure 3 showsthese two curves for α = 0 . , using the estimated parameters for this model when applied to dailyreturns on the S&P 500 index (details are presented in Section 5 below). We consider two valuesfor the “current” value of ( v, e ): 10% above and below the long-run average for these variables. Wesee that for values where Y t > v t , the news impact curves are ﬂat, reﬂecting the fact that on thosedays the value of the realized return does not enter the forcing variable. When Y t ≤ v t , we see thatES and VaR react linearly to Y and this reaction is through the λ e,t forcing variable; the reactionthrough the λ v,t forcing variable is a simple step (down) in both of these risk measures. Note that we do not use the fact that the scaling matrix is exactly the inverse Hessian (e.g., by invoking theinformation matrix equality) in our empirical application or our theoretical analysis. Also, note that if we considereda value of α for which v t = 0 , then v α = 0 and we cannot justify our approximation using this approach. However,we focus on cases where α ≪ / , and so we are comfortable assuming v t = 0 , making k α invertible. The speciﬁcation in Section 2.2 allows ES and VaR to evolve as two separate, correlated, processes.In many risk forecasting applications, a useful simpler model is one based on a structure with onlyone time-varying risk measure, e.g. volatility. We will consider a one-factor model in this section,and will name the model in Section 2.2 a “two-factor” GAS model.Consider the following one-factor GAS model for ES and VaR, where both risk measures aredriven by a single variable, κ t : v t = a exp { κ t } (17) e t = b exp { κ t } , where b < a < κ t = ω + βκ t − + γH − t − s t − The forcing variable, H − t − s t − , in the evolution equation for κ t is obtained from the FZ0 lossfunction, plugging in ( a exp { κ t } , b exp { κ t } ) for ( v t , e t ). Using details provided in Appendix B.2,we ﬁnd that the score and Hessian are: s t ≡ ∂L F Z ( Y t , a exp { κ t } , b exp { κ t } ; α ) ∂κ = − e t (cid:18) α { Y t ≤ v t } Y t − e t (cid:19) (18)and I t ≡ ∂ E t − [ L F Z ( Y t , a exp { κ t } , b exp { κ t } ; α )] ∂κ t = α − k α a α α (19)where k α is a negative constant and a α lies between zero and one. The Hessian, I t , turns out to bea constant in this case, and since we estimate a free coeﬃcient on our forcing variable, we simplyset H t to one. Note that the VaR score, λ v,t = ∂L/∂v , turns out to drop out from the forcingvariable. Thus the one-factor GAS model for ES and VaR becomes: κ t = ω + βκ t − + γ − b exp { κ t − } (cid:18) α { Y t − ≤ a exp { κ t − }} Y t − − b exp { κ t − } (cid:19) (20)Using the FZ loss function for estimation, we are unable to identify ω, as there exists (cid:16) ˜ ω, ˜ a, ˜ b (cid:17) =( ω, a, b ) such that both triplets yield identical sequences of ES and VaR estimates, and thus identicalvalues of the objective function. We ﬁx ω = 0 and forfeit identiﬁcation of the level of the series for10 t , though we of course retain the ability to model and forecast ES and VaR. Foreshadowing theempirical results in Section 5, we ﬁnd that this one-factor GAS model outperforms the two-factorGAS model in out-of-sample forecasts for most of the asset return series that we study.

As noted in the introduction, there is a relative paucity of dynamic models for ES and VaR, butthere is not a complete absence of such models. The simplest existing model is based on a simplerolling window estimate of these quantities: d VaR t = \ Quantile { Y s } t − s = t − m (21) c ES t = 1 αm t − X s = t − m Y s n Y s ≤ d VaR s o where \ Quantile { Y s } t − s = t − m denotes the sample quantile of Y s over the period s ∈ [ t − m, t − . Common choices for the window size, m, include 125, 250 and 500, corresponding to six months,one year and two years of daily return observations respectively.A more challenging competitor for the new ES and VaR models proposed in this paper are thosebased on ARMA-GARCH dynamics for the conditional mean and variance, accompanied by someassumption for the distribution of the standardized residuals. These models all take the form: Y t = µ t + σ t η t (22) η t ∼ iid F η (0 , µ t and σ t are speciﬁed to follow some ARMA and GARCH model, and F η (0 ,

1) is somearbitrary, strictly increasing, distribution with mean zero and variance one. What remains is tospecify a distribution for the standardized residual, η t . Given a choice for F η , VaR and ES forecasts This one-factor model for ES and VaR can also be obtained by considering a zero-mean volatility model for Y t ,with iid standardized residuals, say denoted η t . In this case, κ t is the log conditional standard deviation of Y t , and a = F − η ( α ) and b = E [ η | η ≤ a ] . (We exploit this interpretation when linking these models to GARCH models inSection 2.5.1 below.) The lack of identiﬁcation of ω means that we do not identify the level of log volatility. v t = µ t + aσ t , where a = F − η ( α ) (23) e t = µ t + bσ t , where b = E [ η t | η t ≤ a ]Two parametric choices for F η are common in the literature: η t ∼ iid N (0 ,

1) (24) η t ∼ iid Skew t (0 , , ν, λ )There are various skew t distributions used in the literature; in the empirical analysis below weuse that of Hansen (1994). A nonparametric alternative is to estimate the distribution of η t usingthe empirical distribution function (EDF), an approach that is also known as “ﬁltered historicalsimulation,” and one that is perhaps the best existing model for ES, see the survey by Engle andManganelli (2004b). We consider all of these models in our empirical analysis in Section 5.

In this section we consider two extensions of the models presented above, in an attempt to combinethe success and parsimony of GARCH models with this paper’s focus on ES and VaR forecasting.

If an ARMA-GARCH model, including the speciﬁcation for the distribution of standardized residu-als, is correctly speciﬁed for the conditional distribution of an asset return, then maximum likelihoodis the most eﬃcient estimation method, and should naturally be adopted. If, on the other hand, weconsider an ARMA-GARCH model only as a useful approximation to the true conditional distri-bution, then it is no longer clear that MLE is optimal. In particular, if the application of the modelis to ES and VaR forecasting, then we might be able to improve the ﬁtted ARMA-GARCH model Some authors have also considered modeling the tail of F η using extreme value theory, however for the relativelynon-extreme values of α we consider here, past work (e.g., Engle and Manganelli (2004b), Nolde and Ziegel (2016)and Taylor (2017)) has found EVT to perform no better than the EDF, and so we do not include it in our analysis.

12y estimating the parameters of that model via FZ loss minimization, as discussed in Section 2.1.This estimation method is related to one discussed in Remark 1 of Francq and Zako¨ıan (2015).Consider the following model for asset returns: Y t = κ t η t , η t ∼ iid F η (0 ,

1) (25) κ t = ω + βκ t − + γY t − The variable κ t is the conditional variance and is assumed to follow a GARCH(1,1) process. Thismodel implies a structure analogous to the one-factor GAS model presented in Section 2.3, as weﬁnd: v t = a · κ t , where a = F − η ( α ) (26) e t = b · κ t , where b = E [ η | η ≤ a ]Some further results on VaR and ES in dynamic location-scale models are presented in AppendixB.3. To apply this model to VaR and ES forecasting, we also have to estimate the VaR and ESof the standardized residual, denoted ( a, b ) . Rather than estimating the parameters of this modelusing (Q)MLE, we consider here estimating the via FZ loss minimization. As in the one-factorGAS model, ω is unidentiﬁed and we set it to one, so the parameter vector to be estimated is( β, γ, a, b ). This estimation approach leads to a ﬁtted GARCH model that is tailored to providethe best-ﬁtting ES and VaR forecasts, rather than the best-ﬁtting volatility forecasts. Finally, we consider a direct combination of the forcing variable suggested by a GAS structure fora one-factor model of returns, described in equation (20), with the successful GARCH model forvolatility. We specify: Y t = exp { κ t } η t , η t ∼ iid F η (0 ,

1) (27) κ t = ω + βκ t − + γ (cid:18) − e t − (cid:18) α { Y t − ≤ v t − } Y t − − e t − (cid:19)(cid:19) + δ log | Y t − | The variable κ t is the log-volatility, identiﬁed up to scale. As the latent variable in this model islog-volatility, we use the lagged log absolute return rather than the lagged squared return, so that13he units remain in line for the evolution equation for κ t . There are ﬁve parameters in this model( β, γ, δ, a, b ) , and we estimate them using FZ loss minimization. This section presents asymptotic theory for the estimation of dynamic ES and VaR models by min-imizing FZ loss. Given a sample of observations ( y , · · · , y T ) and a constant α ∈ (0 , . α quantile (VaR) and corresponding expectedshortfall of Y t . Suppose Y t is a real-valued random variable that has, conditional on information set F t − , distribution function F t ( ·|F t − ) and corresponding density function f t ( ·|F t − ). Let v ( θ )and e ( θ ) be some initial conditions for VaR and ES and let F t − = σ { Y t − , X t − , · · · , Y , X } , where X t is a vector of exogenous variables or predetermined variables, be the information setavailable for forecasting Y t . The vector of unknown parameters to be estimated is θ ∈ Θ ⊂ R p .The conditional VaR and ES of Y t at probability level α, that is VaR α ( Y t |F t − ) and ES α ( Y t |F t − ),are assumed to follow some dynamic model:  VaR α ( Y t |F t − )ES α ( Y t |F t − )  =  v ( Y t − , X t − , · · · , Y , X ; θ ) e ( Y t − , X t − , · · · , Y , X ; θ )  ≡  v t ( θ ) e t ( θ )  , t = 1 , · · · , T, (28)The unknown parameters are estimated as: ˆ θ T ≡ arg min θ ∈ Θ L T ( θ ) (29)where L T ( θ ) = 1 T T X t =1 L F Z ( Y t , v t ( θ ) , e t ( θ ) ; α )and the FZ loss function L F Z is deﬁned in equation (6). Below we provide conditions under whichestimation of these parameters via FZ loss minimization leads to a consistent and asymptoticallynormal estimator, with standard errors that can be consistently estimated. Assumption 1 (A) L ( Y t , v t ( θ ) , e t ( θ ) ; α ) obeys the uniform law of large numbers.(B)(i) Θ is a compact subset of R p for p < ∞ . (ii) { Y t } ∞ t =1 is a strictly stationary process. Condi-tional on all the past information F t − , the distribution of Y t is F t ( ·|F t − ) which, for all t, belongs to class of distribution functions on R with ﬁnite ﬁrst moments and unique α -quantiles. (iii) ∀ t , both v t ( θ ) and e t ( θ ) are F t − -measurable and continuous in θ . (iv) If Pr (cid:2) v t ( θ ) = v t ( θ ) ∩ e t ( θ ) = e t ( θ ) (cid:3) =1 ∀ t , then θ = θ . Theorem 1 (Consistency)

Under Assumption 1, ˆ θ T p → θ as T → ∞ . The proof of Theorem 1, provided in Appendix A, is straightforward given Theorem 2.1 ofNewey and McFadden (1994) and Corollary 5.5 of Fissler and Ziegel (2016). Note that a variety ofuniform laws of large numbers (our Assumption 1(A)) are available for the time series applicationswe consider here, see Andrews (1987) and P¨otscher and Prucha (1989) for example. Zwingmannand Holzmann (2016) show that if the α -quantile is not unique (violating our Assumption 1(B)(iii)),then the convergence rate and asymptotic distribution of (ˆ v T , ˆ e T ) are non-standard, even in a settingwith iid data. We do not consider such problematic cases here.We next turn to the asymptotic distribution of our parameter estimator. In the assumptionsbelow, K denotes a ﬁnite constant that can change from line to line, and we use k x k to denote theEuclidean norm of a vector x . Assumption 2 (A) For all t , we have (i) v t ( θ ) and e t ( θ ) are twice continuously diﬀerentiable in θ , (ii) v t ( θ ) ≤ .(B) For all t , we have (i) Conditional on all the past information F t − , Y t has a continuousdensity f t ( ·|F t − ) that satisﬁes f t ( y |F t − ) ≤ K < ∞ and | f t ( y |F t − ) − f t ( y |F t ) | ≤ K | y − y | ,(ii) E h | Y t | δ i ≤ K < ∞ , for some < δ < .(C) There exists a neighborhood of θ , N (cid:0) θ (cid:1) , such that for all t we have (i) | /e t ( θ ) | ≤ K < ∞ , ∀ θ ∈ N (cid:0) θ (cid:1) , (ii) there exist some (possibly stochastic) F t − -measurable functions V ( F t − ) , V ( F t − ) , H ( F t − ) , V ( F t − ) , H ( F t − ) which satisfy ∀ θ ∈ N ( θ ) : | v t ( θ ) | ≤ V ( F t − ) , k∇ v t ( θ ) k ≤ V ( F t − ) , k∇ e t ( θ ) k ≤ H ( F t − ) , (cid:13)(cid:13) ∇ v t ( θ ) (cid:13)(cid:13) ≤ V ( F t − ) , and (cid:13)(cid:13) ∇ e t ( θ ) (cid:13)(cid:13) ≤ H ( F t − ) .(D) For some < δ < and for all t we have (i) E (cid:2) V ( F t − ) δ (cid:3) , E (cid:2) H ( F t − ) δ (cid:3) , E h V ( F t − ) δ i , E h H ( F t − ) δ i ≤ K , (ii) E (cid:2) V ( F t − ) δ V ( F t − ) H ( F t − ) δ (cid:3) ≤ K ,(iii) E h H ( F t − ) δ H ( F t − ) | Y t | δ i , E h H ( F t − ) δ | Y t | δ i ≤ K. E) The matrix D T deﬁned in Theorem 2 has eigenvalues bounded below by a positive constantfor T suﬃciently large.(F) The sequence { T − / P Tt =1 g t ( θ ) } obeys the CLT, where g t ( θ ) = ∂L ( Y t , v t ( θ ) , e t ( θ ) ; α ) ∂θ (30)= ∇ v t ( θ ) ′ − e t ( θ ) (cid:18) α { Y t ≤ v t ( θ ) } − (cid:19) (31)+ ∇ e t ( θ ) ′ e t ( θ ) (cid:18) α { Y t ≤ v t ( θ ) } ( v t ( θ ) − Y t ) − v t ( θ ) + e t ( θ ) (cid:19) (G) { Y t } is α -mixing of size − q/ ( q − for some q > . Most of the above assumptions are standard. Assumption 2(A)(i) imposes that the VaR isnegative, but given our focus on the left-tail ( α ≤ .

5) of asset returns, this is not likely a bindingconstraint. Assumptions 2(B),(C) and (E) are similar to those in Engle and Manganelli (2004a).Assumption 2(B)(ii) requires at least 4 + δ moments of returns to exist, however 2(D) may actuallyincrease the number of required moments, depending on the VaR-ES model employed. For thefamiliar GARCH(1,1) process, used in our simulation study, it can be shown that we only need toassume that 4 + δ moments exist. Assumption 2(F) allows for some CLT for mixing data to beinvoked, and 2(G) is a standard assumption on the time series dependence of the data. Theorem 2 (Asymptotic Normality)

Under Assumptions 1 and 2, we have √ T A − / T D T ( ˆ θ T − θ ) d → N (0 , I ) as T → ∞ (32) where D T = E " T − T X t =1 f t (cid:0) v t ( θ ) |F t − (cid:1) − e t ( θ ) α ∇ v t ( θ ) ′ ∇ v t ( θ ) + 1 e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ) (33) A T = E " T − T X t =1 g t ( θ ) g t ( θ ) ′ (34) and g t is deﬁned in Assumption 2(F). An outline of the proof of this theorem is given in Appendix A, and the detailed lemmasunderlying it are provided in the supplemental appendix. The proof of Theorem 2 builds on Huber(1967), Weiss (1991) and Engle and Manganelli (2004a), who focused on the estimation of quantiles.16inally, we present a result for estimating the asymptotic covariance matrix of ˆ θ T , therebyenabling the reporting of standard errors and conﬁdence intervals. Assumption 3 (A) The deterministic positive sequence c T satisﬁes c T = o (1) and c − T = o ( T / ) .(B)(i) T − P Tt =1 g t ( θ ) g t ( θ ) ′ − A T p → , where A T is deﬁned in Theorem 2.(ii) T − P Tt =1 1 e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ) − E [ T − P Tt =1 1 e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ )] p → .(iii) T − P Tt =1 f t ( v t ( θ ) |F t − ) − e t ( θ ) α ∇ v t ( θ ) ′ ∇ v t ( θ ) − E [ T − P Tt =1 f t ( v t ( θ ) |F t − ) − e t ( θ ) α ∇ v t ( θ ) ′ ∇ v t ( θ )] p → . Theorem 3

Under Assumptions 1-3, ˆA T − A T p → and ˆD T − D T p → , where ˆA T = T − T X t =1 g t ( ˆ θ T ) g t ( ˆ θ T ) ′ ˆD T = T − T X t =1  c T n(cid:12)(cid:12)(cid:12) y t − v t (cid:16) ˆ θ T (cid:17)(cid:12)(cid:12)(cid:12) < c T o ∇ ′ v t (cid:16) ˆ θ T (cid:17) ∇ v t (cid:16) ˆ θ T (cid:17) − αe t (cid:16) ˆ θ T (cid:17) + ∇ ′ e t (cid:16) ˆ θ T (cid:17) ∇ e t (cid:16) ˆ θ T (cid:17) e t (cid:16) ˆ θ T (cid:17)  This result extends Theorem 3 in Engle and Manganelli (2004a) from dynamic VaR modelsto dynamic joint models for VaR and ES. The key choice in estimating the asymptotic covariancematrix is the bandwidth parameter in Assumption 3(A). In our simulation study below we set thisto T − / and we ﬁnd that this leads to satisfactory ﬁnite-sample properties.The results here extend some very recent work in the literature: Dimitriadis and Bayer (2017)consider VaR-ES regression, but focus on iid data and linear speciﬁcations. Barendse (2017)considers “interquantile expectation regression,” which nests VaR-ES regression as a special case.He allows for time series data, but imposes that the models are linear. Our framework allows fortime series data and nonlinear models.

In this section we investigate the ﬁnite-sample accuracy of the asymptotic theory for dynamic ESand VaR models presented in the previous section. For ease of comparison with existing studies of Dimitriadis and Bayer (2017) also consider a variety of FZ loss functions, in contrast with our focus on the FZ0loss function, and they consider both M and GMM (or Z , in their notation) estimation, while we focus only on M estimation. Y t = σ t η t (35) σ t = ω + βσ t − + γY t − η t ∼ iid F η (0 ,

1) (36)We set the parameters of this DGP to ( ω, β, γ ) = (0 . , . , . . We consider two choices for thedistribution of η t : a standard Normal, and the standardized skew t distribution of Hansen (1994),with degrees of freedom and skewness parameters in the latter set to (5 , − . . Under this DGP,the ES and VaR are proportional to σ t , with(VaR αt , ES αt ) = ( a α , b α ) σ t (37)We make the dependence of the coeﬃcients of proportionality ( a α , b α ) on α explicit here, as weconsider a variety of values of α in this simulation study: α ∈ { . , . , . , . , . } . Interestin VaR and ES from regulators focuses on the smaller of these values of α, but we also considerthe larger values to better understand the properties of the asymptotic approximations at variouspoints in the tail of the distribution.For a standard Normal distribution, with CDF and PDF denoted Φ and φ, we have: a α = Φ − ( α ) (38) b α = − φ (cid:0) Φ − ( α ) (cid:1) /α For Hansen’s skew t distribution we can obtain a α from the inverse CDF, but no closed-formexpression for b α is available; we instead use a simulation of 10 million iid draws to estimate it. Asnoted above, FZ loss minimization does not allow us to identify ω in the GARCH model, and in ourempirical work we set this parameter to 1. To facilitate comparisons of the accuracy of estimatesof ( a α , b α ) in our simulation study we instead set ω at its true value. This is done without loss ofgenerality and merely eases the presentation of the results. To match our empirical application, wereplace the parameter a α with c α = a α /b α , and so our parameter vector becomes [ β, γ, b α , c α ] .

18e consider two sample sizes, T ∈ { , } corresponding to 10 and 20 years of dailyreturns respectively. These large sample sizes enable us to consider estimating models for quantilesas low as 1%, which are often used in risk management. We repeat all simulations 1000 times.Table 1 presents results for the estimation of this model on standard Normal innovations, andTable 2 presents corresponding results for skew t innovations. The top row of each panel present thetrue parameter values, with the latter two parameters changing across α. The second row presentsthe median estimated parameter across simulations, and the third row presents the average bias inthe estimated parameter. Both of these measures indicate that the parameter estimates are nicelycentered on the true parameter values. The penultimate row presents the cross-simulation standarddeviations of the estimated parameters, and we observe that these decrease with the sample size andincrease as we move further into the tails (i.e., as α decreases), both as expected. Comparing thestandard deviations across Tables 1 and 2, we also note that they are higher for skew t innovationsthan Normal innovations, again as expected.The last row in each panel presents the coverage probabilities for 95% conﬁdence intervalsfor each parameter, constructed using the estimated standard errors, with bandwidth parameter c T = (cid:4) T − / (cid:5) . For α ≥ .

05 we see that the coverage is reasonable, ranging from around 0.88 to0.96. For α = 0 .

025 or α = 0 .

01 the coverage tends to be too low, particularly for the smallersample size. Thus some caution is required when interpreting the standard errors for the modelswith the smallest values of α. In Table S1 of the Supplemental Appendix we present results for(Q)MLE for the GARCH model corresponding to the results in Tables 1 and 2, using the theory ofBollerslev and Wooldridge (1992). In Tables S2 and S3 we present results for CAViaR estimation ofthis model, using the “tick” loss function and the theory of Engle and Manganelli (2004a). We ﬁndthat (Q)MLE has better ﬁnite sample properties than FZ minimization, but CAViaR estimationhas slightly worse properties than FZ minimization.[INSERT TABLES 1 AND 2 ABOUT HERE ] In (Q)MLE, the parameters to be estimated are [ ω, β, γ ] . In “CAViaR” estimation, which is done by minimizingthe “tick” loss function, the parameters to be estimated are [ β, γ, a α ] , since in this case the parameter ω is againunidentiﬁed. As for the study of FZ estimation, we set ω to its true value to facilitate interpretation of the results.

19n Table 3 we compare the eﬃciency of FZ estimation relative to (Q)MLE and to CAViaRestimation, for the parameters that all three estimation methods have in common, namely [ β, γ ] . As expected, when the innovations are standard Normal, FZ estimation is substantially less eﬃcientthan MLE, however when the innovations are skew t the loss in eﬃciency drops and for somevalues of α FZ estimation is actually more eﬃcient than QMLE. This switch in the ranking of thecompeting estimators is qualitatively in line with results in Francq and Zako¨ıan (2015). In PanelB of Table 3, we see that FZ estimation is generally, though not uniformly, more eﬃcient thanCAViaR estimation.In many applications, interest is more focused on the forecasted values of VaR and ES thanthe estimated parameters of the models. To study this, Table 4 presents results on the accuracyof the ﬁtted VaR and ES estimates for the three estimation methods: (Q)MLE, CAViaR and FZestimation. To obtain estimates of VaR and ES from the (Q)ML estimates, we follow commonempirical practice and compute the sample VaR and ES of the estimated standardized residuals.In the ﬁrst column of each panel we present the mean absolute error (MAE) from (Q)MLE, andin the next two columns we present the relative MAE of CAViaR and FZ to (Q)MLE. Table4 reveals that (Q)MLE is the most accurate estimation method. Averaging across values of α, CAViaR is about 40% worse for Normal innovations, and 24% worse for skew t innovations, whileFZ fares somewhat better, being about 30% worse for Normal innovations and 16% worse for skew t innovations. The superior performance of (Q)MLE is not surprising when the innovations areNormal, as that corresponds to (full) maximum likelihood, which has maximal eﬃciency. Weighingagainst the loss in FZ estimation eﬃciency is the robustness that FZ estimation oﬀers relative toQML. For applications even further from Normality, e.g. with time-varying skewness or kurtosis,the loss in eﬃciency of QML is likely even greater.[INSERT TABLES 3 AND 4 ABOUT HERE ]Overall, these simulation results show that the asymptotic results of the previous section providereasonable approximations in ﬁnite samples, with the approximations improving for larger samplesizes and less extreme values of α. Compared with MLE, estimation by FZ loss minimization is20enerally less accurate, while it is generally more accurate than estimation using the CAViaRapproach of Engle and Manganelli (2004a). The latter outperformance is likely attributable to thefact that FZ estimation draws on information from two tail measures, VaR and ES, while CAViaRwas designed to only model VaR.

We now apply the models discussed in Section 2 to the forecasting of ES and VaR for daily returnson four international equity indices. We consider the S&P 500 index, the Dow Jones IndustrialAverage, the NIKKEI 225 index of Japanese stocks, and the FTSE 100 index of UK stocks. Oursample period is 1 January 1990 to 31 December 2016, yielding between 6,630 and 6,805 observationsper series (the exact numbers vary due to diﬀerences in holidays and market closures). In our out-of-sample analysis, we use the ﬁrst ten years for estimation, and reserve the remaining 17 years forevaluation and model comparison.Table 5 presents full-sample summary statistics on these four return series. Average annualizedreturns range from -2.7% for the NIKKEI to 7.2% for the DJIA, and annualized standard deviationsrange from 17.0% to 24.7%. All return series exhibit mild negative skewness (around -0.15) andsubstantial kurtosis (around 10). The lower two panels of Table 5 present the sample VaR and ESfor four choices of α. Table 6 presents results from standard time series models estimated on these return series overthe in-sample period (Jan 1990 to Dec 1999). In the ﬁrst panel we present the estimated parametersof the optimal ARMA( p, q ) models, where the choice of ( p, q ) is made using the BIC. The R valuesfrom the optimal models never rises above 1%, consistent with the well-known lack of predictabilityof these series. The second panel presents the parameters of the GARCH(1,1) model for conditionalvariance, and the lower panel presents the estimated parameters the skew t distribution appliedto the standardized residuals. All of these parameters are broadly in line with values obtained byother authors for these or similar series.[ INSERT TABLES 5 AND 6 ABOUT HERE ]21 .1 In-sample estimation We now present estimates of the parameters of the models presented in Section 2, along withstandard errors computed using the theory from Section 3. In the interests of space, we onlyreport the parameter estimates for the S&P 500 index for α = 0 .

05. The two-factor GAS modelbased on the FZ0 loss function is presented in the left panel of Table 7. This model allows forseparate dynamics in VaR and ES, and we present the parameters for each of these risk measuresin separate columns. We observe that the persistence of these processes is high, with the estimated b parameters equal to 0.973 and 0.977, similar to the persistence found in GARCH models (e.g.,see Table 6). The model-implied average values of VaR and ES are -2.001 and -2.556, similarto the sample values of these measures reported in Table 5. We also observe that in neitherequation is the coeﬃcient on λ v statistically signiﬁcant: the t -statistics on a v are both well belowone. The coeﬃcients on λ e are both larger, and more signiﬁcant (the t -statistics are 1.58 and1.75), indicating that the forcing variable from the ES part of the FZ0 loss function is the moreinformative component. However, the overall imprecision of the four coeﬃcients on the forcingvariables is suggestive that this model is over-parameterized.The right panel of Table 7 shows three one-factor models for ES and VaR. The ﬁrst is theone-factor GAS model, which is nested in the two-factor model presented in the left panel. Wesee a slight loss in ﬁt (the average loss is slightly greater) but the parameters of this model areestimated with greater precision. The one-factor GAS model ﬁts slightly better than the GARCHmodel estimated via FZ loss minimization (reported in the penultimate column). The “hybrid”model, augmenting the one-factor GAS model with a GARCH-type forcing variable, ﬁts betterthan the other one-factor models, and also better than the larger two-factor GAS model, and weobserve that the coeﬃcient on the GARCH forcing variable ( δ ) is signiﬁcantly diﬀerent from zero(with a t -statistic of 2.07). Computational details on the estimation of these models are given in Appendix C. Recall that in all of the one-factor models, the intercept ( ω ) in the GAS equation is unidentiﬁed. We ﬁx it at zerofor the GAS-1F and Hybrid models, and at one for the GARCH-FZ model. This has no impact on the ﬁt of thesemodels for VaR and ES, but it means that we cannot interpret the estimated ( a, b ) parameters as the VaR and ES ofthe standardized residuals, and we no longer expect the estimated values to match the sample estimates in Table 5.

22 INSERT TABLE 7 ABOUT HERE ]

We now turn to the out-of-sample (OOS) forecast performance of the models discussed above, aswell as some competitor models from the existing literature. We will focus initially on the results for α = 0 . , given the focus on that percentile in the extant VaR literature. (Results for other valuesof α are considered later, with details provided in the supplemental appendix.) We will consider atotal of ten models for forecasting ES and VaR. Firstly, we consider three rolling window methods,using window lengths of 125, 250 and 500 days. We next consider ARMA-GARCH models, with theARMA model orders selected using the BIC, and assuming that the distribution of the innovationsis standard Normal or skew t, or estimating it nonparametrically using the sample ES and VaR ofthe estimated standardized residuals. Finally we consider four new semiparametric dynamic modelsfor ES and VaR: the two-factor GAS model presented in Section 2.2, the one-factor GAS modelpresented in Section 2.3, a GARCH model estimated using FZ loss minimization, and the “hybrid”GAS/GARCH model presented in Section 2.5. We estimate these models using the ﬁrst ten yearsas our in-sample period, and retain those parameter estimates throughout the OOS period.In Figure 4 below we plot the ﬁtted 5% ES and VaR for the S&P 500 return series, using threemodels: the rolling window model using a window of 125 days, the GARCH-EDF model, and theone-factor GAS model. This ﬁgure covers both the in-sample and out-of-sample periods. The ﬁgureshows that the average ES was estimated at around -2%, rising as high as around -1% in the mid90s and mid 00s, and falling to its most extreme values of around -10% during the ﬁnancial crisisin late 2008. Thus, like volatility, ES ﬂuctuates substantially over time.Figure 5 zooms in on the last two years of our sample period, to better reveal the diﬀerences inthe estimates from these models. We observe the usual step-like movements in the rolling windowestimate of VaR and ES, as the more extreme observations enter and leave the estimation window.Comparing the GARCH and GAS estimates, we see how they diﬀer in reacting to returns: theGARCH estimates are driven by lagged squared returns, and thus move stochastically each day.The GAS estimates, on the other hand, only use information from returns when the VaR is violated,23nd on other days the estimates revert deterministically to the long-run mean. This generates asmoother time series of VaR and ES estimates. We investigate below which of these estimatesprovides a better ﬁt to the data.[ INSERT FIGURES 4 AND 5 ABOUT HERE ]The left panel of Table 8 presents the average OOS losses, using the FZ0 loss function fromequation (6), for each of the ten models, for the four equity return series. The lowest values in eachcolumn are highlighted in bold, and the second-lowest are in italics. We observe that the one-factorGAS model, labelled FZ1F, is the preferred model for the two US equity indices, while the Hybridmodel is the preferred model for the NIKKEI and FTSE indices. The worst model is the rollingwindow with a window length of 500 days.While average losses are useful for an initial look at OOS forecast performance, they do notreveal whether the gains are statistically signiﬁcant. Table 9 presents Diebold-Mariano t-statisticson the loss diﬀerences, for the S&P 500 index. Corresponding tables for the other three equityreturn series are presented in Table S4 of the supplemental appendix. The tests are conductedas “row model minus column model” and so a positive number indicates that the column modeloutperforms the row model. The column “FZ1F” corresponding to the one-factor GAS modelcontains all positive entries, revealing that this model out-performed all competing models. Thisoutperformance is strongly signiﬁcant for the comparisons to the rolling window forecasts, as well asthe GARCH model with Normal innovations. The gains relative to the GARCH model with skew t or nonparametric innovations are not signiﬁcant, with DM t -statistics of 1.48 and 1.16 respectively.Similar results are found for the best models for each of the other three equity return series. Thusthe worst models are easily separated from the better models, but the best few models are generallynot signiﬁcantly diﬀerent. [ INSERT TABLES 8 AND 9 ABOUT HERE ] Table S5 in the supplemental appendix presents results analogous to Table 8, but with alpha=0.025, which is thevalue for ES that is the focus of the Basel III accord. The rankings and results are qualitatively similar to those foralpha=0.05 discussed here.

24o complement the study of the relative performance of these models for ES and VaR, we nowconsider goodness-of-ﬁt tests for the OOS forecasts of VaR and ES. Under correct speciﬁcation ofthe model for VaR and ES, we know that E t −  ∂L F Z ( Y t , v t , e t ; α ) /∂v t ∂L F Z ( Y t , v t , e t ; α ) /∂e t  = 0 (39)and we note that this implies that E t − [ λ v,t ] = E t − [ λ e,t ] = 0 , where ( λ v,t , λ e,t ) are deﬁned inequations (11)-(12). Thus the variables λ v,t and λ e,t can be considered as a form of “generalizedresidual” for this model. To mitigate the impact of serial correlation in these measures (whichcomes through the persistence of v t and e t ) we use standardized versions of these residuals: λ sv,t ≡ λ v,t v t = { Y t ≤ v t } − α (40) λ se,t ≡ λ e,t e t = 1 α { Y t ≤ v t } Y t e t − λ sv,t = a + a λ sv,t − + a v t + u v,t (41) λ se,t = b + b λ se,t − + b e t + u e,t We test forecast optimality by testing that all terms ( a = [ a , a , a ] ′ and b = [ b , b , b ] ′ ) in theseregressions are zero, against the usual two-sided alternative. Similar “conditional calibration” testsare presented in Nolde and Ziegel (2017). One could also consider a joint test of both of the abovenull hypotheses, however we will focus on these separately so that we can determine which variableis well/poorly speciﬁed. 25he right two panels of Table 8 present the p -values from the tests of the goodness-of-ﬁt of theVaR and ES forecasts. Entries greater than 0.10 (indicating no evidence against optimality at the0.10 level) are in bold, and entries between 0.05 and 0.10 are in italics. For the S&P 500 indexand the DJIA, we see that only one model passes both the VaR and ES tests: the one-factor GASmodel. For the NIKKEI we see that all of the dynamic models pass these two tests, while all threeof the rolling window models fail. For the FTSE index, on the other hand, we see that all tenmodels considered here fail both the goodness-of-ﬁt tests. The outcomes for the NIKKEI and theFTSE each, in diﬀerent ways, present good examples of the problem highlighted in Nolde and Ziegel(2017), that many diﬀerent models may pass a goodness-of-ﬁt test, or all models may fail, whichmakes discussing their relative performance diﬃcult. To do so, one can look at Diebold-Marianotests of diﬀerences in average loss, as we do in Table 9.Finally, in Table 10 we look at the performance of these models across four values of α, tosee whether the best-performing models change with how deep in the tails we are. We ﬁnd thatthis is indeed the case: for α = 0 . , the best-performing model across the four return seriesis the GARCH model estimated by FZ loss minimization, followed by the GARCH model withnonparametric residuals. These two models are also the (equal) best two models for α = 0 . α = 0 .

05 and α = 0 .

10 the two best models are the one-factor GAS model and the Hybridmodel. These rankings are perhaps related to the fact that the forcing variable in the GAS modeldepends on observing a violation of the VaR, and for very small values of α these violations occuronly infrequently. In contrast, the GARCH model uses the information from the squared residual,and so information from the data moves the risk measures whether a VaR violation was observedor not. When α is not so small, the forcing variable suggested by the GAS model applied to theFZ loss function starts to out-perform.[ INSERT TABLE 10 ABOUT HERE ] 26 Conclusion

With the implementation of the Third Basel Accord in the next few years, risk managers andregulators will place greater focus on expected shortfall (ES) as a measure of risk, complementingand partly substituting previous emphasis on Value-at-Risk (VaR). We draw on recent results fromstatistical decision theory (Fissler and Ziegel, 2016) to propose new dynamic models for ES andVaR. The models proposed are semiparametric, in that they impose parametric structures for thedynamics of ES and VaR, but are agnostic about the conditional distribution of returns. We alsopresent asymptotic distribution theory for the estimation of these models, and we verify that thetheory provides a good approximation in ﬁnite samples. We apply the new models and methodsto daily returns on four international equity indices, over the period 1990 to 2016, and ﬁnd theproposed new ES-VaR models outperform forecasts based on GARCH or rolling window models.The asymptotic theory presented in this paper facilitates considering a large number of exten-sions of the models presented here. Our models all focus on a single value for the tail probability( α ) , and extending these to consider multiple values simultaneously could prove fruitful. For ex-ample, one could consider the values 0.01, 0.025 and 0.05, to capture various points in the lefttail, or one could consider 0.05 and 0.95 to capture both the left and right tails simultaneously.Another natural extension is to make use of exogenous information in the model; the models pro-posed here are all univariate, and one might expect that information from options markets, highfrequency data, or news announcements to also help predict VaR and ES. We leave these interestingextensions to future research. 27 ppendix A: Proofs Proof of Proposition 1.

Theorem C.3 of Nolde and Ziegel (2017) shows that under theassumption that ES is strictly negative, the loss diﬀerences generated by a FZ loss function arehomogeneous of degree zero iﬀ G ( x ) = ϕ { x ≥ } and G ( x ) = − ϕ /x with ϕ ≥ ϕ > L ∗ F Z ( Y, v, e ; α, ϕ , ϕ ) , and notice that: L ∗ F Z ( Y, v, e ; α, ϕ , ϕ ) = ϕ ( { Y ≤ v } − α ) ( { v ≥ } − { Y ≥ } )+ ϕ (cid:26) − ( { Y ≤ v } − α ) 1 α ve + 1 e (cid:18) α { Y ≤ v } Y − e (cid:19) + log ( − e ) (cid:27) = ϕ ( { Y ≤ v } − α ) ( { v ≥ } − { Y ≥ } ) + ϕ L F Z ( y, v, e ; α )= ϕ L F Z ( Y, v, e ; α ) + ϕ α { Y ≥ } + ϕ (1 − α − { Y ≥ } ) { v ≥ } Under the assumption that v < , the third term vanishes. The second term is purely a functionof Y and so can be disregarded; we can set ϕ = 0 without loss of generality. The ﬁrst term isaﬀected by a scaling parameter ϕ > , and we can set ϕ = 1 without loss of generality. Thus weobtain the L F Z given in equation (6). If v can be positive, then setting ϕ = 0 is interpretable asﬁxing this shape parameter value at a particular value. Proof of Theorem 1.

The proof is based on Theorem 2.1 of Newey and McFadden (1994).We only need to show that E [ L T ( · )] is uniquely minimized at θ , because the other assump-tions of Newey and McFadden’s theorem are clearly satisﬁed. By Corollary (5.5) of Fissler andZiegel (2016), given Assumption 1(B)(iii) and the fact that our choice of the objective func-tion L F Z satisﬁes the condition as in Corollary (5.5) of Fissler and Ziegel (2016), we knowthat E [ L ( Y t , v t ( θ ) , e t ( θ ) ; α ) |F t − ] is uniquely minimized at (VaR α ( Y t |F t − ) , ES α ( Y t |F t − )) , whichequals (cid:0) v t ( θ ) , e t ( θ ) (cid:1) under correct speciﬁcation. Combining this assumption and Assumption1(B)(iv), we know that θ is a unique minimizer of E [ L T ( · )], completing the proof. Outline of proof of Theorem 2.

We consider the population objective function λ T ( θ ) = T − P Tt =1 E [ g t ( θ )] , and take a mean-value expansion of λ T ( ˆ θ ) around θ . We show in Lemma 128hat: √ T ( ˆ θ − θ ) = − Λ − T ( θ ) 1 √ T T X t =1 g t ( θ ) + o p (1)where Λ T ( θ ∗ ) = T − T X t =1 ∂ E [ g t ( θ )] ∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ In the supplemental appendix we prove Lemma 1 by building on and extending Weiss (1991), whoextends Huber (1967) to non- iid data. We draw on Weiss’ Lemma A.1, and we verify that all ﬁveassumptions (N1-N5 in his notation) for that lemma are satisﬁed: N1, N2 and N5 are obviouslysatisﬁed given our Assumptions 1-2, and we show in Lemmas 3 - 6 that assumptions N3 and N4are satistﬁed. Assumption 2(F) allows a CLT to be applied: the asymptotic covariance matrix is A T = E h T − P Tt =1 g t ( θ ) g t ( θ ) ′ i , and we denote Λ T ( θ ) as D T , leading to the stated result. Proof of Theorem 3.

Given Assumption 3B(i) and the result in Theorem 1, the proof that ˆA T − A T p → is standard and omitted. Next, deﬁne ˜D T = T − T X t =1 { (2 c T ) − {| y t − v t ( θ ) | < c T } − e t ( θ ) α ∇ v t ( θ ) ′ ∇ v t ( θ ) + 1 e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ) } To prove the result we will show that ˆD T − ˜D T = o p (1) and ˜D T − D T = o p (1). Firstly, consider k ˆD T − ˜D T k ≤ (cid:13)(cid:13) (2 T c T ) − × T X t =1 { ( {| y t − v t ( ˆ θ T ) | < c T } − {| y t − v t ( θ ) | < c T } ) 1 − e t ( ˆ θ T ) α ∇ v t ( ˆ θ T ) ′ ∇ v t ( ˆ θ T )+ (cid:8) | y t − v t ( θ ) | < c T (cid:9) − e t ( ˆ θ T ) α (cid:16) ∇ v t ( ˆ θ T ) − ∇ v t ( θ ) (cid:17) ′ ∇ v t ( ˆ θ T )+ {| y t − v t ( θ ) | < c T } − αe t ( ˆ θ T ) − − αe t ( θ ) ! ∇ v t ( θ ) ′ ∇ v t ( ˆ θ T )+ {| y t − v t ( θ ) | < c T } − αe t ( θ ) ∇ v t ( θ ) ′ ( ∇ v t ( ˆ θ T ) − ∇ v t ( θ ))+ c T − ˆ c T c T {| y t − v t ( θ ) | < c T } − e t ( θ ) α ∇ v t ( θ ) ′ ∇ v t ( θ ) } (cid:13)(cid:13)(cid:13)(cid:13) + T − T X t =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) e t ( ˆ θ T ) ∇ e t ( ˆ θ T ) ′ ∇ e t ( ˆ θ T ) − e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) The last line above was shown to be o p (1) in the proof of Theorem 2. The diﬃcult quantity in theﬁrst term (over the ﬁrst six lines above) is the indicator, and following the same steps as in Engle29nd Manganelli (2004a), that term is also o p (1) . Next, consider ˜D T − D T : ˜D T − D T = 12 T c

T T X t =1 (cid:0) (cid:8)(cid:12)(cid:12) Y t − v t ( θ ) (cid:12)(cid:12) < c T (cid:9) − E (cid:2) (cid:8)(cid:12)(cid:12) Y t − v t (cid:0) θ (cid:1)(cid:12)(cid:12) < c T (cid:9) |F t − (cid:3)(cid:1) × ∇ ′ v t ( θ ) ∇ v t ( θ ) − e t ( θ ) α + 1 T T X t =1 (cid:26) c T E [ {| Y t − v t ( θ ) | < c T }|F t − ] 1 − e t ( θ ) α ∇ ′ v t ( θ ) ∇ v t ( θ ) − E (cid:20) f t ( v t ( θ )) − e t ( θ ) α ∇ ′ v t ( θ ) ∇ v t ( θ ) (cid:21)(cid:27) Following Engle and Manganelli (2004a), assumptions 1-3 are suﬃcient to show ˜D T − D T = o p (1)and the result follows. Appendix B: Derivations

Appendix B.1: Generic calculations for the FZ0 loss function

The FZ0 loss function is: L F Z ( Y, v, e ; α ) = − αe { Y ≤ v } ( v − Y ) + ve + log ( − e ) − not homogeneous, as for any k > , L F Z ( kY, kv, ke ; α ) = L F Z ( Y, v, e ; α ) +log ( k ), but this loss function generates loss diﬀerences that are homogenous of degree zero, as theadditive additional term above drops out.We will frequently use the ﬁrst derivatives of this loss function, and the second derivatives ofthe expected loss for an absolutely continuous random variable with density f and CDF F . Theseare (for v = y ): ∇ v ≡ ∂L F Z ( Y, v, e ; α ) ∂v = − αe ( { Y ≤ v } − α ) ≡ αve λ v (43) ∇ e ≡ ∂L F Z ( Y, v, e ; α ) ∂e (44)= 1 αe { Y ≤ v } ( v − Y ) − ve + 1 e = vαe ( { Y ≤ v } − α ) − e (cid:18) α { Y ≤ v } Y − e (cid:19) ≡ − αe ( λ v + αλ e ) 30here λ v ≡ − v ( { Y ≤ v } − α ) (45) λ e ≡ α { Y ≤ v } Y − e (46)and ∂ E [ L F Z ( Y, v, e ; α )] ∂v = − αe f ( v ) (47) ∂ E [ L F Z ( Y, v, e ; α )] ∂v∂e = 1 αe ( F ( v ) − α ) (48)= 0, at the true value of ( v, e ) ∂ E [ L F Z ( Y, v, e ; α )] ∂e = 1 e − αe { ( F ( v ) − α ) v − ( E [ { Y ≤ v } Y ] − αe ) } (49)= 1 e , at the true value of ( v, e ) Appendix B.2: Derivations for the one-factor GAS model for ES and VaR

Here we present the calculations to compute s t and I t for this model. Below we use: ∂v∂κ = ∂ v∂κ = a exp { κ } = v (50) ∂e∂κ = ∂ e∂κ = b exp { κ } = e (51)And so we ﬁnd (for v t = Y t ) s t ≡ ∂L F Z ( Y t , v t , e t ; α ) ∂κ t (52)= ∂L F Z ( Y t , v t , e t ; α ) ∂v t ∂v t ∂κ t + ∂L F Z ( Y t , v t , e t ; α ) ∂e t ∂e t ∂κ t = (cid:26) − αe t ( { Y t ≤ v t } − α ) (cid:27) v t + (cid:26) − e t (cid:18) α { Y t ≤ v t } Y t − e t (cid:19) + v t e t α ( { Y t ≤ v t } − α ) (cid:27) e t = − e t (cid:18) α { Y t ≤ v t } Y t − e t (cid:19) (53) ≡ − λ et /e t (54)Thus, the λ vt term drops out of s t and we are left with − λ et /e t . I t : I t ≡ ∂ E t − [ L F Z ( Y t , v t , e t ; α )] ∂κ t (55)= ∂ E t − [ L F Z ( Y t , v t , e t ; α )] ∂v t (cid:18) ∂v t ∂κ t (cid:19) + ∂ E t − [ L F Z ( Y t , v t , e t ; α )] ∂v t ∂e t ∂v t ∂κ t + ∂ E t − [ L F Z ( Y t , v t , e t ; α )] ∂e t (cid:18) ∂e t ∂κ t (cid:19) + ∂ E t − [ L F Z ( Y t , v t , e t ; α )] ∂v t ∂e t ∂e t ∂κ t + ∂ E t − [ L F Z ( Y t , v t , e t ; α )] ∂v t ∂ v t ∂κ t + ∂ E t − [ L F Z ( Y t , v t , e t ; α )] ∂e t ∂ e t ∂κ t But note that under correct speciﬁcation, ∂ E t − [ L ( Y t , v t , e t ; α )] ∂v t ∂e t = ∂ E t − [ L ( Y t , v t , e t ; α )] ∂v t = ∂ E t − [ L ( Y t , v t , e t ; α )] ∂e t = 0 (56)and so the Hessian simpliﬁes to: I t = ∂ E t − [ L F Z ( Y t , v t , e t ; α )] ∂v t (cid:18) ∂v t ∂κ t (cid:19) + ∂ E t − [ L F Z ( Y t , v t , e t ; α )] ∂e t (cid:18) ∂e t ∂κ t (cid:19) (57)= − αe t f t ( v t ) v t + 1 (58)= α − k α a α α , since f t ( v t ) = k α v t and v t e t = a α , for this DGP. (59)Thus although the Hessian could vary with time, as it is a derivative of the conditional expectedloss, in this speciﬁcation it simpliﬁes to a constant. Appendix B.3: ES and VaR in location-scale models

Dynamic location-scale models are widely used for asset returns and in this section we considerwhat such a speciﬁcation implies for the dynamics of ES and VaR. Consider the following: Y t = µ t + σ t η t , η t ∼ iid F η (0 ,

1) (60)where, for example, µ t is some ARMA model and σ t is some GARCH model. For asset returnsthat follow equation (60) we have: v t = µ t + aσ t , where a = F − η ( α ) (61) e t = µ t + bσ t , where b = E [ η t | η t ≤ a ]32nd we we can recover ( µ t , σ t ) from ( v t , e t ):  µ t σ t  = 1 b − a  b − a −   v t e t  (62)Thus under the conditional location-scale assumption, we can back out the conditional mean andvariance from the VaR and ES. Next note that if µ t = 0 ∀ t, then v t = c · e t , where c = a/b ∈ (0 , σ t = ¯ σ ∀ t, and in thatcase we have the simpliﬁcation that v t = d + e t , where d = ( a − b ) ¯ σ > . Appendix C: Estimation using the FZ0 loss function

The FZ0 loss function, equation (6), involves the indicator function { Y t ≤ v t } and so necessi-tates the use of a numerical search algorithm that does not rely on diﬀerentiability of the objectivefunction; we use the function fminsearch in Matlab. However, in preliminary simulation analyseswe found that this algorithm was sensitive to the starting values used in the search. To overcomethis, we initially consider a “smoothed” version of the FZ0 loss function, where we replace theindicator variable with a Logistic function:˜ L F Z ( Y, v, e ; α, τ ) = − αe Γ ( Y t , v t ; τ ) ( v − Y ) + ve + log ( − e ) − Y t , v t ; τ ) ≡

11 + exp { τ ( Y t − v t ) } , for τ > τ is the smoothing parameter, and the smoothing function Γ converges to the indicatorfunction as τ → ∞ . In GAS models that involve an indicator function in the forcing variable, wealter the forcing variable in the same way, to ensure that the objective function as a function of θ is diﬀerentiable. In these cases the loss function and the model itself are slightly altered throughthis smoothing.In our empirical implementation, we obtain “smart” starting values by ﬁrst estimating themodel using the “smoothed FZ0” loss function with τ = 5 . This choice of τ gives some smoothingfor values of Y t that are roughly within ± v t . Call the resulting parameter estimate ˜ θ (5) T . Sincethis objective function is diﬀerentiable, we can use more familiar gradient-based numerical search33lgorithms, such as fminunc or fmincon in Matlab, which are often less sensitive to starting values.We then re-estimate the model, using ˜ θ (5) T as the starting value, setting τ = 20 and obtain ˜ θ (20) T . This value of τ smoothes values of Y t within roughly ± .

25 of v t , and so this objective function iscloser to the true objective function. Finally, we use ˜ θ (20) T as the starting value in the optimizationof the actual FZ0 objective function, with no artiﬁcial smoothing, using the function fminsearch ,and obtain ˆ θ T . We found that this approach largely eliminated the sensitivity to starting values. References [1] Andersen, T.G., Bollerslev, T., Christoﬀersen, P., Diebold, F.X., 2006. Volatility and Corre-lation Forecasting, in (ed.s) G. Elliott, C.W.J. Granger, and A. Timmermann,

Handbook ofEconomic Forecasting , Vol. 1. Elsevier, Oxford.[2] Andrews, D.W.K., 1987, Consistency in nonlinear econometric models: ageneric uniform lawof large numbers,

Econometrica , 55, 1465–1471.[3] Artzner, P., F. Delbaen, J.M. Eber and D. Heath, 1999, Coherent measures of risk,

Mathe-matical Finance , 9, 203-228.[4] Barendse, S., 2017, Interquantile Expectation Regression, Tinbergen Institute Discussion Pa-per, TI 2017-034/III.[5] Basel Committee on Banking Supervision, 2010, Basel III: A Global Regulatory Frame-work for More Resiliant Banks and Banking Systems, Bank for International Settlements. [6] Bollerslev, T., 1986, Generalized Autoregressive Conditional Heteroskedasticity,

Journal ofEconometrics , 31, 307-327.[7] Bollerslev, T. and J.M. Wooldridge, 1992, Quasi-Maximum Likelihood Estimation and Infer-ence in Dynamic Models with Time Varying Covariances,

Econometric Reviews , 11(2), 143-172.[8] Cai, Z. and X. Wang, 2008, Nonparametric estimation of conditional VaR and expected short-fall,

Journal of Econometrics , 147, 120-130.[9] Creal, D.D., S.J. Koopman, and A. Lucas, 2013, Generalized Autoregressive Score Modelswith Applications,

Journal of Applied Econometrics , 28(5), 777-795.[10] Creal, D.D., S.J.Koopman, A. Lucas, and M. Zamojski, 2015, Generalized AutoregressiveMethod of Moments, Tinbergen Institute Discussion Paper, TI 2015-138/III.[11] Diebold, F.X. and R.S. Mariano, 1995. Comparing predictive accuracy,

Journal of Business &Economic Statistics, arXiv:1704.02213v1 .[13] Du, Z. and J.C. Escanciano, 2017, Backtesting Expected Shortfall: Accounting for Tail Risk,

Management Science , 63(4), 940-958.[14] Engle, R.F. and S. Manganelli, 2004a, CAViaR: Conditional Autoregressive Value at Risk byRegression Quantiles,

Journal of Business & Economic Statistics , 22, 367-381.[15] Engle, R.F. and S. Manganelli, 2004b, A Comparison of Value-at-Risk Models in Finance, inGiorgio Szego (ed.)

Risk Measures for the 21st Century , Wiley.[16] Engle, R.F. and J.R. Russell, 1998, Autoregressive Conditional Duration: A New Model forIrregularly Spaced Transaction Data,

Econometrica , 66, 1127-1162.[17] Fissler, T., 2017,

On Higher Order Elicitability and Some Limit Theorems on the Poisson andWeiner Space , PhD thesis, University of Bern.[18] Fissler, T., and J. F. Ziegel, 2016, Higher order elicitability and Osband’s principle,

Annals ofStatistics , 44(4), 1680-1707.[19] Francq, C. and J.-M. Zako¨ıan, 2015, Risk-parameter estimation in volatility models,

Journalof Econometrics , 184, 158-173.[20] Gerlach, R. and C.W.S. Chen, 2015, Bayesian Expected Shortfall Forecasting Incorporatingthe Intraday Range,

Journal of Financial Econometrics , 14(1), 128-158.[21] Gneiting, T., 2011, Making and Evaluating Point Forecasts,

Journal of the American StatisticalAssociation , 106(494), 746-762.[22] Gsch¨opf, P., W.K. H¨ardle, and A. Mihoci, Tail Event Risk Expectile based Shortfall, SFB 649Discussion Paper 2015-047.[23] Hansen, B.E., 1994, Autoregressive Conditional Density Estimation,

International EconomicReview , 35(3), 705-730.[24] Harvey, A.C., 2013,

Dynamic Models for Volatility and Heavy Tails , Econometric SocietyMonograph 52, Cambridge University Press, Cambridge.[25] Huber, P.J., 1967, The behavior of maximum likelihood estimates under nonstandard condi-tions, in (ed.s) L.M. Le Cam and J. Neyman

Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability , Vol. 1, University of California Press, Berkeley.[26] Komunjer, I., 2005, Quasi-Maximum Likelihood Estimation for Conditional Quantiles,

Journalof Econometrics , 128(1), 137-164.[27] Komunjer, I., 2013, Quantile Prediction, in (ed.s) G. Elliott, and A. Timmermann,

Handbookof Economic Forecasting , Vol. 2. Elsevier, Oxford.[28] Koopman, S.J., A. Lucas and M. Scharth, Predicting Time-Varying Parameters with Parame-ter Driven and Observation-Driven Models,

Review of Economics and Statistics , 98(1), 97-110.3529] Newey, W.K. and D. McFadden, 1994, Large Sample Estimation and Hypothesis Testing, inR.F. Engle and D.L. McFadden (eds.)

Handbook of Econometrics , Vol. IV, Elsevier.[30] Newey, W.K. and J.L. Powell, 1987, Asymmetric least squares estimation and testing,

Econo-metrica , 55(4), 819-847.[31] Nolde, N. and J. F. Ziegel, 2017, Elicitability and backtesting: Perspectives for banking regu-lation,

Annals of Applied Statistics , forthcoming.[32] Patton, A.J., 2011, Volatility Forecast Comparison using Imperfect Volatility Proxies,

Journalof Econometrics , 160(1), 246-256.[33] Patton, A.J., 2016, Comparing Possibly Misspeciﬁed Forecasts, working paper, Duke Univer-sity.[34] Patton, A.J. and K. Sheppard, 2009, Evaluating Volatility and Correlation Forecasts, in T.G.Andersen, R.A. Davis, J.-P. Kreiss and T. Mikosch (eds.)

Handbook of Financial Time Series ,Springer Verlag.[35] P¨otscher, B.M. and I.R. Prucha, 1989, A uniform law of large numbers for dependent andheterogeneous data processes,

Econometrica , 57, 675–683.[36] Taylor, J.W., 2008, Estimating Value-at-Risk and Expected Shortfall using Expectiles,

Journalof Financial Econometrics , 231-252.[37] Taylor, J.W., 2017, Forecasting Value at Risk and Expected Shortfall using a SemiparametricApproach Based on the Asymmetric Laplace Distribution,

Journal of Business & EconomicStatistics , forthcoming.[38] Weiss, A.A., 1991, Estimating Nonlinear Dynamic Models Using Least Absolute Error Esti-mation,

Econometric Theory , 7(1), 46-68.[39] White, H. 1994,

Estimation, Inference and Speciﬁcation Analysis , Econometric Society Mono-graphs No. 22, Cambridge University Press.[40] Zhu, D. and J.W. Galbraith, 2011, Modeling and forecasting expected shortfall with the gen-eralized asymmetric Student- t and asymmetric exponential power distributions, Journal ofEmpirical Finance , 18, 765-778.[41] Zwingmann T. and H. Holzmann, 2016, Asymptotics for Expected Shortfall, working paper,available at arXiv:1611.07222 . 36 able 1: Simulation results for Normal innovations T = 2500 T = 5000 β γ b α c α β γ b α c α α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . Notes:

This table presents results from 1000 replications of the estimation of VaR and ES froma GARCH(1,1) DGP with standard Normal innovations. Details are described in Section 4. Thetop row of each panel presents the true values of the parameters. The second, third, and fourthrows present the median estimated parameters, the average bias, and the standard deviation (acrosssimulations) of the estimated parameters. The last row of each panel presents the coverage ratesfor 95% conﬁdence intervals constructed using estimated standard errors.37 able 2: Simulation results for skew t innovations T = 2500 T = 5000 β γ b α c α β γ b α c α α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . Notes:

This table presents results from 1000 replications of the estimation of VaR and ESfrom a GARCH(1,1) DGP with skew t innovations. Details are described in Section 4. The toprow of each panel presents the true values of the parameters. The second, third, and fourth rowspresent the median estimated parameters, the average bias, and the standard deviation (acrosssimulations) of the estimated parameters. The last row of each panel presents the coverage ratesfor 95% conﬁdence intervals constructed using estimated standard errors.38 able 3: Sampling variation of FZ estimationrelative to (Q)MLE and CAViaR Normal innovations Skew t innovations T = 2500 T = 5000 T = 2500 T = 5000 α β γ β γ β γ β γ Panel A: FZ/(Q)ML

Panel B: FZ/CAViaR

Notes:

This table presents the ratio of cross-simulation standard deviations of parameter es-timates obtained by FZ loss minimization and (Q)MLE (Panel A), and CAViaR (Panel B). Weconsider only the parameters that are common to these three estimation methods, namely theGARCH(1,1) parameters β and γ. Ratios greater than one indicate the FZ estimator is morevariable than the alternative estimation method; ratios less than one indicate the opposite.39 able 4: Mean absolute errors for VaR and ES estimates

Normal innovations Skew t innovations

VaR ES VaR ES

MAE MAE ratio MAE MAE ratio MAE MAE ratio MAE MAE ratio α MLE CAViaR FZ MLE CAViaR FZ QMLE CAViaR FZ QMLE CAViaR FZ

Panel A: T = 25000.01 0.069 1.368 1.369 0.084 1.487 1.345 0.196 1.327 1.381 0.342 1.249 1.2520.025 0.055 1.305 1.288 0.064 1.341 1.290 0.120 1.228 1.244 0.205 1.166 1.1660.05 0.043 1.302 1.271 0.051 1.332 1.289 0.084 1.193 1.166 0.141 1.154 1.1290.10 0.034 1.322 1.253 0.042 1.394 1.302 0.056 1.168 1.089 0.098 1.160 1.0830.20 0.026 1.443 1.257 0.033 1.652 1.377 0.034 1.301 1.087 0.066 1.404 1.121 Panel B: T = 50000.01 0.049 1.404 1.387 0.060 1.443 1.344 0.138 1.369 1.375 0.245 1.256 1.2480.025 0.038 1.306 1.291 0.044 1.348 1.313 0.087 1.245 1.234 0.145 1.197 1.1850.05 0.031 1.314 1.264 0.036 1.350 1.290 0.061 1.184 1.143 0.101 1.164 1.1190.10 0.024 1.365 1.265 0.029 1.449 1.320 0.041 1.155 1.067 0.071 1.158 1.0690.20 0.018 1.458 1.241 0.023 1.706 1.377 0.024 1.316 1.066 0.048 1.409 1.089 Notes:

This table presents results on the accuracy of the ﬁtted VaR and ES estimates for the three estimation methods: (Q)MLE,CAViaR and FZ estimation. In the ﬁrst column of each panel we present the mean absolute error (MAE) from (Q)MLE, computedacross all dates in a given sample and all 1000 simulation replications. The next two columns present the relative

MAE of CAViaRand FZ to (Q)MLE. Values greater than one indicate (Q)MLE is more accurate (has lower MAE); values less than one indicate theopposite. able 5: Summary statisticsS&P 500 DJIA NIKKEI FTSE Mean (Annualized) 6.776 7.238 -2.682 3.987Std dev (Annualized) 17.879 17.042 24.667 17.730Skewness -0.244 -0.163 -0.114 -0.126Kurtosis 11.673 11.116 8.580 8.912VaR-0.01 -3.128 -3.034 -4.110 -3.098VaR-0.025 -2.324 -2.188 -3.151 -2.346VaR-0.05 -1.731 -1.640 -2.451 -1.709VaR-0.10 -1.183 -1.126 -1.780 -1.193ES-0.01 -4.528 -4.280 -5.783 -4.230ES-0.025 -3.405 -3.215 -4.449 -3.295ES-0.05 -2.697 -2.553 -3.603 -2.643ES-0.10 -2.065 -1.955 -2.850 -2.031

Notes:

This table presents summary statistics on the four daily equity return series studied inSection 5, over the full sample period from January 1990 to December 2016. The ﬁrst two rowsreport the annualized mean and standard deviation of these returns in percent. The second panelpresents sample Value-at-Risk for four choices of α, and the third panel presents correspondingsample Expected Shortfall estimates. Table 6: ARMA, GARCH, and Skew t resultsSP500 DJIA NIKKEI FTSE φ φ φ – -0.0407 – -0.0438 φ – – – -0.0585 φ – – – 0.0375 φ – – – -0.0501 θ -0.7048 – – – R ω β α ν λ -0.1146 -0.0997 -0.0659 -0.1018 Notes:

This table presents parameter estimates for the four daily equity return series studied inSection 5, over the in-sample period from January 1990 to December 1999. The ﬁrst panel presentsthe optimal ARMA model according to the BIC, along with the R of that model. The secondpanel presents the estimated GARCH(1,1) parameters, and the third panel presents the estimatedparameters of the skewed t distribution applied to the estimated standardized residuals.41 able 7: Estimated paramters of GAS models for VaR and ES GAS-2F GAS-1F GARCH-FZ Hybrid

VaR ES w -0.046 -0.069 β (s.e.) (0.010) (0.019) (s.e.) (0.004) (0.072) (0.015) b γ -0.010 0.030 -0.011 (s.e.) (0.005) (0.007) (s.e.) (0.002) (0.010) (0.002) a v δ – – 0.018 (s.e.) (0.092) (0.164) (s.e.) (0.009) a e a -1.490 -2.659 -2.443 (s.e.) (0.004) (0.007) (s.e.) (0.346) (0.492) (0.473) b -2.089 -3.761 -3.389 (s.e.) (0.487) (0.747) (0.664) Avg loss 0.747 0.750 0.762 0.745

Notes:

This table presents parameter estimates and standard errors for four GAS models ofVaR and ES for the S&P 500 index over the in-sample period from January 1990 to December1999. The left panel presents the results for the two-factor GAS model in Section 2.2. The rightpanel presents the results for the three one-factor models: a one-factor GAS model (from Section2.3), and a GARCH model estimated by FZ loss minimization, and “hybrid” one-factor GAS modelthat includes a additional GARCH-type forcing variable (both from Section 2.5). The bottom rowof this table presents the average (in-sample) losses from each of these four models.42 able 8: Out-of-sample average losses and goodness-of-ﬁt tests (alpha=0.05)

Average loss GoF p-values: VaR GoF p-values: ES

S&P DJIA NIK FTSE S&P DJIA NIK FTSE S&P DJIA NIK FTSE

RW-125 0.914 0.864 1.290 0.959 0.021 0.013 0.000 0.000 0.029 0.018 0.006 0.000RW-250 0.959 0.909 1.294 1.002 0.001 0.001 0.007 0.000 0.043 0.014 0.018 0.002RW-500 1.023 0.976 1.318 1.056 0.001 0.001 0.000 0.000 0.012 0.011 0.001 0.000GCH-N 0.876 0.808 1.170 0.871 0.031

Notes:

The left panel of this table presents the average losses, using the FZ0 loss function, for four daily equity return series, overthe out-of-sample period from January 2000 to December 2016, for ten diﬀerent forecasting models. The lowest average loss in eachcolumn is highlighted in bold, the second-lowest is highlighted in italics. The ﬁrst three rows correspond to rolling window forecasts,the next three rows correspond to GARCH forecasts based on diﬀerent models for the standardized residuals, and the last four rowscorrespond to models introduced in Section 2. The middle and right panels of this table present p -values from goodness-of-ﬁt testsof the VaR and ES forecasts respectively. Values that are greater than 0.10 (indicating no evidence against optimality at the 0.10level) are in bold, and values between 0.05 and 0.10 are in italics. able 9: Diebold-Mariano t-statistics on average out-of-sample loss diﬀerencesalpha=0.05, S&P 500 returns RW125 RW250 RW500 G-N G-Skt G-EDF FZ-2F FZ-1F G-FZ HybridRW125 -2.580 -4.260 2.109 2.693 2.900 2.978 3.978 3.020 2.967RW250 2.580 -4.015 3.098 3.549 3.730 3.799 4.701 3.921 4.110RW500 4.260 4.015 4.401 4.783 4.937 5.168 5.893 5.125 5.450G-N -2.109 -3.098 -4.401 3.670 3.068 1.553 2.248 2.818 0.685G-Skt -2.693 -3.549 -4.783 -3.670 2.103 0.889 1.475 1.232 -0.403G-EDF -2.900 -3.730 -4.937 -3.068 -2.103 0.599 1.157 0.024 -0.769FZ-2F -2.978 -3.799 -5.168 -1.553 -0.889 -0.599 0.582 -0.555 -0.580FZ-1F -3.912 -4.423 -5.483 -1.986 -1.421 -1.198 -0.582 -1.266 -1.978G-FZ -3.020 -3.921 -5.125 -2.818 -1.324 -0.024 0.555 1.266 -0.914Hybrid -3.276 -4.137 -5.272 -1.492 -0.419 0.045 0.580 1.978 0.914

Notes:

This table presents t -statistics from Diebold-Mariano tests comparing the average losses,using the FZ0 loss function, over the out-of-sample period from January 2000 to December 2016,for ten diﬀerent forecasting models. A positive value indicates that the row model has higheraverage loss than the column model. Values greater than 1.96 in absolute value indicate that theaverage loss diﬀerence is signiﬁcantly diﬀerent from zero at the 95% conﬁdence level. Values alongthe main diagonal are all identically zero and are omitted for interpretability. The ﬁrst three rowscorrespond to rolling window forecasts, the next three rows correspond to GARCH forecasts basedon diﬀerent models for the standardized residuals, and the last four rows correspond to modelsintroduced in Section 2. 44 able 10: Out-of-sample performance rankings for various alpha α = 0 . α = 0 . S&P DJIA NIK FTSE Avg S&P DJIA NIK FTSE Avg

RW-125 7 8 10 7 8 8 8 8 7 7.75RW-250 8 9 8 8 8.25 9 9 7 8 8.25RW-500 10 10 9 9 9.5 10 10 9 9 9.5G-N 6 6 5 4 5.25 7 6 4 3 5G-Skt 5 3 2 2 3 5 3 1 1 2.5G-EDF 4 2 3 1 2.5 2 2 3 2 2.25FZ-2F 1 4 7 10 5.5 4 5 10 10 7.25FZ-1F 9 7 6 6 7 3 4 6 4 4.25G-FZ 3 1 1 3 2 1 1 2 5 2.25Hybrid 2 5 4 5 4 6 7 5 6 6 α = 0 . α = 0 . S&P DJIA NIK FTSE Avg S&P DJIA NIK FTSE Avg

RW-125 8 8 8 7 7.75 8 8 8 8 8RW-250 9 9 9 8 8.75 9 9 9 9 9RW-500 10 10 10 9 9.75 10 10 10 10 10G-N 7 7 5 6 6.25 3 2 5 5 3.75G-Skt 5 3 4 2 3.5 7 4 4 4 4.75G-EDF 4 2 2 5 3.25 4 3 3 3 3.25FZ-2F 2 6 7 10 6.25 2 6 7 7 5.5FZ-1F 1 1 6 4 3 1 7 2 2 3G-FZ 3 5 3 3 3.5 6 5 6 6 5.75Hybrid 6 4 1 1 3 5 1 1 1 2

Notes:

This table presents the rankings (with the best performing model ranked 1 and the worstranked 10) based on average losses using the FZ0 loss function, for four daily equity return series,over the out-of-sample period from January 2000 to December 2016, for ten diﬀerent forecastingmodels. The ﬁrst three rows in each panel correspond to rolling window forecasts, the next threerows correspond to GARCH forecasts based on diﬀerent models for the standardized residuals, andthe last four rows correspond to models introduced in Section 2. The last column in each panelrepresents the average rank across the four equity return series.45 aR forecast -4 -3 -2 -1 0 Lo ss FZ loss as a fn of VaR

ES forecast -4 -3 -2 -1 0 Lo ss FZ loss as a fn of ES

Figure 1:

This ﬁgure plots the FZ0 loss function when Y = − and α = 0 . . In the left panel weﬁx e = − . and vary v, in the right panel we ﬁx v = − . and vary e. Values where v < e areindicated with a dashed line. . . . . . . . . . . . . Expected FZ0 loss for a standard Normal variable ES -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 V a R -4-3.5-3-2.5-2-1.5-1-0.50 Figure 2:

Contours of expected FZ0 loss when the target variable is standard Normal. Only valueswhere ES < VaR < Return -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1-2-1.5-1-0.5

News impact curve for VaR and ES v(t+1) when v(t),e(t) lowv(t+1) when v(t),e(t) highe(t+1) when v(t),e(t) lowe(t+1) when v(t),e(t) high

Figure 3:

This ﬁgure shows the values of VaR and ES as a function of the lagged return, when thelagged values of VaR and ES are either low (10% below average) or high (10% above average). Thefunction is based on the estimated parameters for daily S&P 500 returns. an90 Jan93 Jan96 Jan99 Jan02 Jan05 Jan08 Jan11 Jan14 Dec16 V a R -8-6-4-20

5% VaR forecasts for S&P 500 daily returns

One-factor GASGARCH-EDFRW-125

Jan90 Jan93 Jan96 Jan99 Jan02 Jan05 Jan08 Jan11 Jan14 Dec16 ES -12-10-8-6-4-20

5% ES forecasts for S&P 500 daily returns

Figure 4:

This ﬁgure plots the estimated 5% Value-at-Risk (VaR) and Expected Shortfall (ES)for daily returns on the S&P 500 index, over the period January 1990 to December 2016. Theestimates are based on a one-factor GAS model, a GARCH model, and a rolling window using 125observations. an15 Apr15 Jul15 Oct15 Jan16 Apr16 Jul16 Oct16 Dec16 V a R -4-3-2-10

5% VaR forecasts for S&P 500 daily returns

One-factor GASGARCH-EDFRW-125

Jan15 Apr15 Jul15 Oct15 Jan16 Apr16 Jul16 Oct16 Dec16 ES -5-4-3-2-10

5% ES forecasts for S&P 500 daily returns

Figure 5:

This ﬁgure plots the estimated 5% Value-at-Risk (VaR) and Expected Shortfall (ES)for daily returns on the S&P 500 index, over the period January 2015 to December 2016. Theestimates are based on a one-factor GAS model, a GARCH model, and a rolling window using 125observations. upplemental Appendix to: Dynamic Semiparametric Models forExpected Shortfall (and Value-at-Risk)

Andrew J. Patton Johanna F. Ziegel Rui Chen

Duke University University of Bern Duke University11 July 2017This appendix contains lemmas that provide further details on the proof of Theorem 2 presentedin the main paper, as well as additional tables of analysis.

Appendix SA.1: Detailed proofs

Throughout this appendix, we suppress the subscript on ˆ θ T for simplicity of presentation, andwe denote the conditional distribution and density functions as F t and f t rather than F t ( ·|F t − )and f t ( ·|F t − ) . In Lemmas 1 and 3 below, we will refer to the expected score, deﬁned as: λ T ( θ ) = T − T X t =1 E [ g t ( θ )] (1)= T − T X t =1 E (cid:20) − e t ( θ ) (cid:18) F t ( v t ( θ )) α − (cid:19) ∇ v t ( θ ) ′ +1 e t ( θ ) (cid:18) F t ( v t ( θ )) α v t ( θ ) − α E t − [ Y t | { Y t ≤ v t ( θ ) } ] − v t ( θ ) + e t ( θ ) (cid:19) ∇ e t ( θ ) ′ (cid:21) Lemma 1

Let Λ( θ ∗ ) = T − T X t =1 ∂ E [ g t ( θ )] ∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ (2) Then under Assumptions 1-2, √ T ( ˆ θ − θ ) = (cid:0) Λ − ( θ ) + o p (1) (cid:1) − √ T T X t =1 g t ( θ ) + o p (1) ! (3)1 roof of Lemma 1. Consider a mean-value expansion of λ T ( ˆ θ ) around θ : λ T ( ˆ θ ) = λ T ( θ ) + T − T X t =1 ∂ E [ g t ( θ )] ∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ ( ˆ θ − θ ) (4)= Λ( θ ∗ )( ˆ θ − θ ) (5)where θ ∗ lies between ˆ θ and θ , and noting that λ T ( θ ) = 0 and the deﬁnition of Λ( θ ∗ ) given in thestatement of the lemma. Proving the claim involves two results: (I) Λ − ( θ ∗ ) = Λ − ( θ ) + o p (1) , and(II) √ T λ T ( ˆ θ ) = − √ T P Tt =1 g t ( θ ) + o p (1) . Part (I) is easy to verify: Since v t ( θ ) and e t ( θ ) are twicecontinuously diﬀerentiable, and e t ( θ ) < θ ) is continuous in θ and Λ( θ ) is non-singular in aneighborhood of θ . Then by the continuous mapping theorem, θ ∗ p → θ ⇒ Λ( θ ∗ ) − p → Λ − ( θ ).Establishing (II) builds on Theorem 3 of Huber (1967) and Lemma A.1 of Weiss (1991), whichextends Huber’s conclusion to the case of non- iid dependent random variables. We are going toverify the conditions of Weiss’s Lemma A.1. Since the other conditions are easily checked, we onlyneed to show that T − / P Tt =1 g t ( ˆ θ ) = o p (1), which we show in Lemma 2, and that his assumptionsN3 and N4 hold, which we show in Lemmas 3-6. Lemma 2

Under Assumptions 1-2, T − / P Tt =1 g t ( ˆ θ ) = o p (1) . Proof of Lemma 2.

Let { e j } pj =1 be the standard basis of R p and deﬁne L jT ( a ) = T − / T X t =1 L F Z (cid:16) Y t , v t ( ˆ θ + ae j ) , e t ( ˆ θ + ae j ); α (cid:17) (6)where a is a scalar. Following Ruppert and Carroll’s (1980) approach, let G jT ( a ) (a scalar) be theright derivative of L jT ( a ), that is G jT ( a ) = T − / T X t =1 ∇ j v t ( ˆ θ + ae j ) − e t ( ˆ θ + ae j ) ( 1 α n Y t ≤ v t ( ˆ θ + ae j ) o − ∇ j e t ( ˆ θ + ae j ) e t ( ˆ θ + ae j ) (cid:18) α n Y t ≤ v t ( ˆ θ + ae j ) o ( v t ( ˆ θ + ae j ) − Y t ) − v t ( ˆ θ + ae j ) + e t ( ˆ θ + ae j ) (cid:19)! G jT (0) = lim ξ → G jT ( ξ ) is the right partial derivative of L T ( θ ) at ˆ θ in the direction θ j , whilelim ξ → G jT ( − ξ ) is the left partial derivative of L T ( θ ) at ˆ θ in the direction θ j . Although L T ( θ )is not diﬀerentiable, due to the presence of the indicator function, its left and right derivatives do2xist, and because L T ( θ ) achieves its minimum at ˆ θ , its left derivative must be non-positive andits right derivative must be non-negative. Thus, | G jT (0) | ≤ lim ξ → G jT ( ξ ) − lim ξ → G jT ( − ξ )= T − / T X t =1 |∇ j v t ( ˆ θ ) |− e t ( ˆ θ ) 1 α n Y t = v t (ˆ θ ) o + |∇ j e t ( ˆ θ ) | e t ( ˆ θ ) α (cid:16) v t ( ˆ θ ) − Y t (cid:17) n Y t = v t ( ˆ θ ) o! (8)= T − / T X t =1 |∇ j v t ( ˆ θ ) |− e t (ˆ θ ) 1 α n Y t = v t ( ˆ θ ) o The second term in the penultimate line vanishes as { Y t = v t ( ˆ θ ) } ( v t ( ˆ θ ) − y t ) is always zero.By Assumption 2(C), for all t , |∇ j v t ( ˆ θ ) | ≤ k∇ v t ( ˆ θ ) k ≤ V ( F t − ), (cid:12)(cid:12)(cid:12) /e t ( ˆ θ ) (cid:12)(cid:12)(cid:12) ≤ H , thus: | G jT (0) | ≤ Hα (cid:20) T − / max ≤ t ≤ T V ( F t − ) (cid:21) " T X t =1 n Y t = v t ( ˆ θ ) o (9) H is ﬁnite by Assumption 2(C), and for all ǫ > , Pr (cid:20) T − / max ≤ t ≤ T V ( F t − ) > ǫ (cid:21) ≤ T X t =1 Pr h V ( F t − ) > ǫT / i ≤ T X t =1 E [ V ( F t − ) ] ǫ T / → E [ V ( F t − ) ] is ﬁnite by as-sumption 2(D), we then have that T − / max ≤ t ≤ T V ( F t − ) = o p (1) . By Assumption 2(B) we have P Tt =1 n y t = v t ( ˆ θ ) o = 0 a.s. We therefore have G jT (0) p →

0. Since this holds for every j , we have T − / P Tt =1 g t ( ˆ θ ) = o p (1).The following three lemmas show each of the three parts of Assumption N3 of Weiss (1991)holds. In the proofs below we make repeated use of mean-value expansions, and we use θ ∗ to denotea point on the line connecting ˆ θ and θ , and θ ∗∗ to denote a point on the line connecting θ ∗ and θ . The particular point on the line can vary from expansion to expansion.

Lemma 3

Under assumptions 1-2, Assumption N3(i) of Weiss (1991) holds: k λ T ( θ ) k ≥ a k θ − θ k , for k θ − θ k ≤ d .for T suﬃciently large, where a and d are strictly positive numbers. roof of Lemma 3. A mean-value expansion yields: λ T ( ˆ θ ) = λ T ( θ ) + Λ T ( θ ∗ )( ˆ θ − θ ) = Λ T ( θ ∗ )( ˆ θ − θ ) (11)since λ T ( θ ) = 0 , where Λ T ( θ ) = T − P Tt =1 ∂ E [ g t ( θ )] /∂θ. Using the fact that ∂ E [ Y t { Y t ≤ v t ( θ ) }|F t − ] ∂θ = ∂∂θ (Z v t ( θ ) −∞ yf t ( y ) dy ) = v t ( θ ) f t ( v t ( θ )) ∇ v t ( θ ) (12)we can write:Λ T ( θ ) = T − T X t =1 E (cid:20)(cid:18) ∇ v t ( θ ) − e t ( θ ) + ∇ v t ( θ ) ′ ∇ e t ( θ ) e t ( θ ) + ∇ e t ( θ ) ′ ∇ v t ( θ ) e t ( θ ) (cid:19) (cid:18) F t ( v t ( θ )) α − (cid:19) (13)+ (cid:18) ∇ e t ( θ ) 1 e t ( θ ) + − e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ) (cid:19) · (cid:18)(cid:18) F t ( v t ( θ )) α − (cid:19) v t ( θ ) − α E [ Y t { Y t ≤ v t ( θ ) }|F t − ] + e t ( θ ) (cid:19) + f t ( v t ( θ )) − αe t ( θ ) ∇ ′ v t ( θ ) ∇ v t ( θ )+ 1 e t ( θ ) ∇ ′ e t ( θ ) ∇ e t ( θ )] } (cid:12)(cid:12)(cid:12)(cid:12) F t − (cid:21) Evaluated at θ , the ﬁrst two terms of Λ T drop out because F t (cid:0) v t ( θ ) (cid:1) = α and α E [ Y t { Y t ≤ v t ( θ ) }|F t − ] = e t (cid:0) θ (cid:1) . Deﬁne D T as D T ≡ Λ T ( θ ) = T − T X t =1 E (cid:20) f t ( v t ( θ )) − αe t ( θ ) ∇ v t ( θ ) ′ ∇ v t ( θ ) + 1 e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ) (cid:21) (14)Below we show that Λ T ( θ ∗ ) = D T + O ( k ˆ θ − θ k ) by decomposing k Λ T ( θ ∗ ) − D T k into four termsand showing that each is bounded by a O ( k ˆ θ − θ k ) term. First term:

Using a mean-value expansion around θ and Assumptions 2(C)-(D) we obtain: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − T X t =1 E (cid:20)(cid:18) ∇ v t ( θ ∗ ) − e t ( θ ∗ ) + ∇ v t ( θ ∗ ) ′ ∇ e t ( θ ∗ ) e t ( θ ∗ ) + ∇ e t ( θ ∗ ) ′ ∇ v t ( θ ∗ ) e t ( θ ∗ ) (cid:19) (cid:18) F t ( v t ( θ ∗ )) α − (cid:19)(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ T − T X t =1 E (cid:20)(cid:13)(cid:13)(cid:13)(cid:13)(cid:0) HV ( F t − ) + 2 H V ( F t − ) H ( F t − ) (cid:1) (cid:18) f t ( v t ( θ ∗∗ )) α ∇ v t ( θ ∗∗ )( θ ∗ − θ ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:21) (15) ≤ T − T X t =1 Kα n H E [ V ( F t − ) ] / E [ V ( F t − ) / ] / + 2 H E [ V ( F t − ) ] / E [ H ( F t − ) ] / o k θ ∗ − θ k econd term: Again using a mean-value expansion around θ and Assumptions 2(C)-(D): (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − T X t =1 E (cid:20)(cid:18) e t ( θ ∗ ) ∇ e t ( θ ∗ ) − e t ( θ ∗ ) ∇ e t ( θ ∗ ) ′ ∇ e t ( θ ∗ ) (cid:19) · (cid:18)(cid:18) F t ( v t ( θ ∗ )) α − (cid:19) v t ( θ ∗ ) − α E [ Y t { Y t ≤ v t ( θ ∗ ) }|F t − ] + e t ( θ ∗ ) (cid:19)(cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ T − T X t =1 E [ k (cid:0) H ( F t − ) H + H ( F t − ) · H · H ( F t − ) (cid:1) (16) · (( F t ( v t ( θ ∗∗ )) /α − ∇ v t ( θ ∗∗ ) + ∇ e t ( θ ∗∗ )) ( θ ∗ − θ ) k ] ≤ T − T X t =1 { (1 /α + 1)( H E [ V ( F t − ) H ( F t − )] + 2 H E [ V ( F t − ) H ( F t − ) ])+ ( H · E [ H ( F t − ) H ( F t − )] + 2 H E [ H ( F t − ) ]) }k θ ∗ − θ k Third term: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − T X t =1 E (cid:20) f t ( v t ( θ ∗ )) − e t ( θ ∗ ) α ∇ v t ( θ ∗ ) ′ ∇ v t ( θ ∗ ) − f t ( v t ( θ )) − e t ( θ ) α ∇ v t ( θ ) ′ ∇ v t ( θ ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − T X t =1 E { f t ( v t ( θ ∗ )) − e t ( θ ∗ ) ∇ v t ( θ ∗ ) ′ ∇ v t ( θ ∗ ) − f t ( v t ( θ ∗ )) − e t ( θ ∗ ) ∇ v t ( θ ) ′ ∇ v t ( θ ∗ ) (17)+ f t ( v t ( θ ∗ )) − e t ( θ ∗ ) ∇ v t ( θ ) ′ ∇ v t ( θ ∗ ) − f t ( v t ( θ )) − e t ( θ ∗ ) ∇ v t ( θ ) ′ ∇ v t ( θ ∗ )+ f t ( v t ( θ )) − e t ( θ ∗ ) ∇ v t ( θ ) ′ ∇ v t ( θ ∗ ) − f t ( v t ( θ )) − e t ( θ ) ∇ v t ( θ ) ′ ∇ v t ( θ ∗ )+ f t ( v t ( θ )) − e t ( θ ) ∇ v t ( θ ) ′ ∇ v t ( θ ∗ ) − f t ( v t ( θ )) − e t ( θ ) ∇ v t ( θ ) ′ ∇ v t ( θ ) } (cid:13)(cid:13)(cid:13)(cid:13) = 1 α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − T X t =1 E { f t ( v t ( θ ∗ )) − e t ( θ ∗ ) [ ∇ v t ( θ ∗∗ )( θ ∗ − θ )] ∇ v t ( θ ∗ )+ f t ( v t ( θ ∗ )) − f t ( v t ( θ )) − e t ( θ ∗ ) ∇ v t ( θ ) ′ ∇ v t ( θ ∗ )+ f t ( v t ( θ )) e t ( θ ∗∗ ) ( θ ∗ − θ ) ∇ v t ( θ ) ′ ∇ v t ( θ ∗ )+ f t ( v t ( θ )) − e t ( θ ) ∇ v t ( θ ) ′ ( θ ∗ − θ ) v t ( θ ∗∗ ) } (cid:13)(cid:13)(cid:13)(cid:13) ≤ α T − T X t =1 E { V ( F t − ) ( KH · V ( F t − )) + KH · V ( F t − ) + KH H ( F t − ) V ( F t − ) + KHV ( F t − ) V ( F t − ) } · k θ ∗ − θ k ourth term: The bound on this term follows similar steps to that of the third term: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − T X t =1 E { e t ( θ ∗ ) ∇ e t ( θ ∗ ) ′ ∇ e t ( θ ∗ ) − e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ) } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − T X t =1 E { e t ( θ ∗ ) ∇ e t ( θ ∗ ) ′ ∇ e t ( θ ∗ ) − e t ( θ ∗ ) ∇ e t ( θ ) ′ ∇ e t ( θ ∗ ) (18)+ 1 e t ( θ ∗ ) ∇ e t ( θ ) ′ ∇ e t ( θ ∗ ) − e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ∗ )+ 1 e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ∗ ) − e t ( θ ) ∇ e t ( θ ) ′ ∇ e t ( θ ) } (cid:13)(cid:13)(cid:13)(cid:13) |≤ T − T X t =1 { H · E [ H ( F t − ) H ( F t − )] + 2 H E (cid:2) H ( F t − ) (cid:3) + H E [ H ( F t − ) H ( F t − )] }k θ ∗ − θ k Therefore, Λ T ( θ ∗ ) = D T + O ( k ˆ θ − θ k ) ⇒ k Λ T ( θ ∗ ) − D T k ≤ K k ˆ θ − θ k , where K is some constant < ∞ , for T suﬃciently large. By Assumption 2(E), D T has eigenvalues bounded below by apositive constant, denoted as a, for T suﬃciently large. Thus, k λ T ( ˆ θ ) k = k Λ T ( θ ∗ ) (cid:16) ˆ θ − θ (cid:17) k = k D T ( ˆ θ − θ ) − ( D T − Λ T ( θ ∗ ))( ˆ θ − θ ) k (19) ≥ k D T ( ˆ θ − θ ) k − k ( D T − Λ T ( θ ∗ ))( ˆ θ − θ ) k≥ ( a − K k ˆ θ − θ k ) · k ˆ θ − θ k The penultimate inequality holds by the triangle inequality, and the ﬁnal inequality follows from As-sumption 2(E) on the minimum eigenvalue of D T . Thus, for T suﬃciently large so that a − K k ˆ θ − θ k > , the result follows. Lemma 4

Deﬁne µ t ( θ, d ) = sup k τ − θ k≤ d k g t ( τ ) − g t ( θ ) k (20) Then under assumptions 1-2, Assumption N3(ii) of Weiss (1991) holds E [ µ t ( θ, d )] ≤ bd, for k θ − θ k + d ≤ d , d ≥ for T suﬃciently large, where b, d, and d are strictly positive numbers. Proof of Lemma 4.

In this proof, the strictly positive constant c and the mean-value expansionterm, τ ∗ , can change from line to line. Pick d such that for any θ that satisﬁes k θ − θ k ≤ d , all6he conditions in Assumption 2(C) and 2(D) hold as well as e t ( θ ) ≤ v t ( θ ) ≤

0. Let us expand g t ( θ )into six terms: g t ( θ ) = 1 α ∇ ′ v t ( θ ) − e t ( θ ) { Y t ≤ v t ( θ ) } − ∇ ′ v t ( θ ) − e t ( θ ) + 1 α v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) { Y t ≤ v t ( θ ) } (22) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) − α ∇ ′ e t ( θ ) e t ( θ ) { Y t ≤ v t ( θ ) } Y t + ∇ ′ e t ( θ ) e t ( θ )We will bound µ t ( θ, d ) by considering six terms, µ t ( θ, d ) ( i ) , i = 1 , , · · · ,

6, deﬁned below. Eachterm is shown to be bounded by a constant times d . First term: µ t ( θ, d ) (1) = 1 α sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) { Y t ≤ v t ( τ ) } − ∇ ′ v t ( θ ) − e t ( θ ) { Y t ≤ v t ( θ ) } (cid:13)(cid:13)(cid:13)(cid:13) (23)Set τ = arg min k τ − θ k≤ d v t ( τ ) and τ = arg max k τ − θ k≤ d v t ( τ ). Since v t ( θ ) and e t ( θ ) are assumedto be twice continously diﬀerentiable, τ and τ exist. We want to take the indicator function outfrom the ‘sup’ operator. To this end, let us discuss what α · µ t ( θ, d ) (1) equals in two cases.Case 1: Y t ≤ v t ( θ ). (a) If Y t > v t ( τ ), α · µ t ( θ, d ) (1) = (cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13) . (b) If Y t < v t ( τ ), α · µ t ( θ, d ) (1) =sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13) . (c) If v t ( τ ) ≤ Y t ≤ v t ( τ ), α · µ t ( θ, d ) (1) = max ( sup k τ − θ k≤ d,Y t ≤ v ( τ ) (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)) (24) ≤ sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) Case 2: Y t > v t ( θ ), α · µ t ( θ, d ) (1) = { Y t ≤ v ( τ ) } · sup k τ − θ k≤ d,Y t ≤ v ( τ ) (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) (cid:13)(cid:13)(cid:13)(cid:13) (25) ≤ { Y t ≤ v ( τ ) } · sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) (cid:13)(cid:13)(cid:13)(cid:13) k θ − θ k + d ≤ d implies that both θ and τ (which are in a d -neighborhood of θ ) are in a d -neighborhood of θ , and so (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) (26)7hus, α · µ t ( θ, d ) (1) ≤ ( { v t ( τ ) < Y t ≤ v t ( θ ) } + { v t ( τ ) ≤ Y t ≤ v t ( θ ) } + { v t ( θ ) < Y t ≤ v t ( τ ) } ) (27) · sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) + sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) , where E t − [ { v t ( τ ) < Y t ≤ v t ( θ ) } ] = Z v t ( θ ) v t ( τ ) f t ( y ) dy (28) ≤ K | v t ( τ ) − v t ( θ ) | ≤ KV ( F t − ) k τ − θ k ≤ KV ( F t − ) d and similarly, E [ { v t ( θ ) < Y t ≤ v t ( τ ) }|F t − ] ≤ KV ( F t − ) d (29)and E [ { v t ( τ ) < Y t ≤ v t ( θ ) }|F t − ] ≤ KV ( F t − ) d Further sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ HV ( F t − ) (30)and by the mean-value theorem, ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) = (cid:13)(cid:13)(cid:13)(cid:13) ∇ v t ( τ ∗ ) − e t ( τ ∗ ) + ∇ ′ v t ( τ ∗ ) ∇ e t ( τ ∗ ) e t ( τ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) · ( τ − θ ) (31) ⇒ sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:0) HV ( F t − ) + H V ( F t − ) H ( F t − ) (cid:1) · d. (32)By Assumption 2(D), E [ V ( F t − )] and E [ V ( F t − ) H ( F t − )] are ﬁnite, so E [ µ t ( θ, d ) (1) ] ≤ cd , wherec is a strictly positive constant. Second term: µ t ( θ, d ) (2) = sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13) . It was shown in the derivations for theﬁrst term that E [ µ t ( θ, d ) (2) ] ≤ cd , where c is a strictly positive constant. Third term: µ t ( θ, d ) (3) = 1 α sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) { Y t ≤ v t ( τ ) } − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) { Y t ≤ v t ( θ ) } (cid:13)(cid:13)(cid:13)(cid:13) (33)8imilar to the ﬁrst term, α · µ t ( θ, d ) (3) can be bounded by( { v t ( τ ) < Y t ≤ v t ( θ ) } + { v t ( τ ) ≤ Y t ≤ v t ( θ ) } + { v t ( θ ) < Y t ≤ v t ( τ ) } ) (34) · sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) + sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) where E [ { v t ( τ ) < Y t ≤ v t ( θ ) } + { v t ( τ ) ≤ Y t ≤ v t ( θ ) } + { v t ( θ ) < Y t ≤ v t ( τ ) }|F t − ] ≤ KV ( F t − ) d (35)and sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ H · H ( F t − ) (36)where e t ( θ ) ≤ v t ( θ ) ≤ v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (37)= (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ∗ ) ∇ v t ( τ ∗ ) e t ( τ ∗ ) − v t ( τ ∗ ) ∇ ′ e t ( τ ∗ ) ∇ e t ( τ ∗ ) e t ( τ ∗ ) + v t ( τ ∗ ) ∇ e t ( τ ∗ ) e t ( τ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) · ( τ − θ ) ⇒ sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) (38) ≤ (cid:0) H V ( F t − ) H ( F t − ) + 2 H H ( F t − ) + H · H ( F t − ) (cid:1) · d By Assumption 2(D), E [ V ( F t − ) H ( F t − )], E [ H ( F t − ) ], E [ H ( F t − )] < ∞ . Therefore, E [ µ t ( θ, d ) (3) ] ≤ cd , where c is a strictly positive constant. Fourth term: µ t ( θ, d ) (4) = sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13) . In the derivations for the thirdterm we showed that E [ µ t ( θ, d ) (4) ] ≤ cd , where c is a strictly positive constant. Fifth term: µ t ( θ, d ) (5) = 1 α sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) e t ( τ ) { Y t ≤ v t ( τ ) } Y t − ∇ ′ e t ( θ ) e t ( θ ) { Y t ≤ v t ( θ ) } Y t (cid:13)(cid:13)(cid:13)(cid:13) (39)Similar to the ﬁrst term, α · µ t ( θ, d ) (5) can be bounded by( { v t ( τ ) < Y t ≤ v t ( θ ) } + { v t ( τ ) ≤ Y t ≤ v t ( θ ) } + { v t ( θ ) < Y t ≤ v t ( τ ) } ) (40) · | Y t | sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) + | Y t | sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) e t ( τ ) − ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) E [ { v t ( τ ) < Y t ≤ v t ( θ ) }| Y t | |F t − ] = Z v t ( θ ) v t ( τ ) | y | f t ( y ) dy ≤ K | v t ( τ ) | · | v t ( τ ) − v t ( θ ) | (41) ≤ KV ( F t − ) V ( F t − ) k τ − θ k ≤ KV ( F t − ) V ( F t − ) d and similarly, E [ { v t ( τ ) < Y t ≤ v t ( θ ) }| Y t | |F t − ] ≤ KV ( F t − ) V ( F t − ) d (42)and E [ { v t ( θ ) < Y t ≤ v t ( τ ) }| Y t | |F t − ] ≤ KV ( F t − ) V ( F t − ) d Further sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ H H ( F t − ) (43)and by the mean-value theorem, ∇ ′ e t ( τ ) e t ( τ ) − ∇ ′ e t ( θ ) e t ( θ ) = (cid:13)(cid:13)(cid:13)(cid:13) − ∇ ′ e t ( τ ∗ ) ∇ e t ( τ ∗ ) e t ( τ ∗ ) + ∇ e t ( τ ∗ ) e t ( τ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) · ( τ − θ ) (44) ⇒ sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) e t ( τ ) − ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:0) H H ( F t − ) + H H ( F t − ) (cid:1) · d By Assumption 2(D), E [ V ( F t − ) V ( F t − ) H ( F t − )], E [ H ( F t − ) | Y t | ], E [ H ( F t − ) | Y t | ] < ∞ . There-fore, E [ µ t ( θ, d ) (5) ] ≤ cd , where c is a strictly positive constant. Sixth term: µ (6) t ( θ, d ) = sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) − e t ( τ ) − ∇ ′ e t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) (45)By the mean-value theorem, ∇ ′ e t ( τ ) − e t ( τ ) − ∇ ′ e t ( θ ) − e t ( θ ) = (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ∗ ) ∇ e t ( τ ∗ ) e t ( τ ∗ ) + ∇ e t ( τ ∗ ) − e t ( τ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) · ( τ − θ ) (46) ⇒ sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) − e t ( τ ) − ∇ ′ e t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:0) H H ( F t − ) + H · H ( F t − ) (cid:1) · d. (47)By Assumption 2(D), E [ H ( F t − ) ], E [ H ( F t − )] < ∞ . Therefore, E [ µ t ( θ, d ) (6) ] ≤ cd , where c is astrictly positive constant.Thus we have shown that µ t ( θ, d ) ≤ P i =1 µ t ( θ, d ) ( i ) with E [ µ t ( θ, d ) ( i ) ] ≤ cd , ∀ i = 1 , , · · · , emma 5 Under Assumptions 1-2, Assumption N3(iii) of Weiss (1991) holds: E [ µ t ( θ, d ) q ] ≤ cd, for k θ − θ k + d ≤ d , and some q > where c, d and d are strictly positive numbers. Proof of Lemma 5.

In this proof, the strictly positive constant c and the mean-value expansionterm, τ ∗ , can change from line to line. Pick d such that for any θ that satisﬁes k θ − θ k ≤ d ,all the conditions in Assumption 2(C) and 2(D) hold as well as e t ( θ ) ≤ v t ( θ ) ≤

0. Similar toLemma 4, we will decompose µ t ( θ, d ) into six terms, µ t ( θ, d ) ( i ) , for i = 1 , , ...,

6. By Jensen’sinequality, E [ µ t ( θ, d ) q ] ≤ q − P i =1 E [ (cid:0) µ t ( θ, d ) ( i ) (cid:1) q ] , q >

2. We will show that for some 0 < δ < E [ (cid:0) µ t ( θ, d ) ( i ) (cid:1) δ ] ≤ cd , ∀ i = 1 , , · · · ,

6, where c is a strictly positive constant.

First term: µ t ( θ, d ) (1) = 1 α sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) { Y t ≤ v t ( τ ) } − ∇ ′ v t ( θ ) − e t ( θ ) { Y t ≤ v t ( θ ) } (cid:13)(cid:13)(cid:13)(cid:13) (48)Set τ = arg min k τ − θ k≤ d v t ( τ ) and τ = arg max k τ − θ k≤ d v t ( τ ). Following the same argument as inthe proof of Lemma 4, we obtain[ α · µ t ( θ, d ) (1) ] δ ≤ c · ( { v t ( τ ) < Y t ≤ v t ( θ ) } + { v t ( τ ) ≤ Y t ≤ v t ( θ ) } + { v t ( θ ) < Y t ≤ v t ( τ ) } )(49) · sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ + sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ where E t − [ { v t ( τ ) < Y t ≤ v t ( θ ) } + { v t ( τ ) ≤ Y t ≤ v t ( θ ) } + { v t ( θ ) < Y t ≤ v t ( τ ) } ] ≤ KV ( F t − ) d (50)and sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ ≤ ( HV ( F t − )) δ (51)For sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)! δ , we need to combine the two following two results:sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:0) HV ( F t − ) + H V ( F t − ) H ( F t − ) (cid:1) d (52) sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ ≤ (2 HV ( F t − )) δ E [ (cid:0) µ t ( θ, d ) (1) (cid:1) δ ] ≤ cd , where c is a strictly pos-itive constant. Second term: µ t ( θ, d ) (2) = sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13) ∇ ′ v t ( τ ) − e t ( τ ) − ∇ ′ v t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13) . It was shown in the derivations for theﬁrst term that E [ (cid:0) µ t ( θ, d ) (2) (cid:1) δ ] ≤ cd , where c is a strictly positive constant. Third term: µ t ( θ, d ) (3) = 1 α sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) { Y t ≤ v t ( τ ) } − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) { Y t ≤ v t ( θ ) } (cid:13)(cid:13)(cid:13)(cid:13) (53)Similar to the ﬁrst term, (cid:0) α · µ t ( θ, d ) (3) (cid:1) δ can be bounded by c · ( { v t ( τ ) < Y t ≤ v t ( θ ) } + { v t ( τ ) ≤ Y t ≤ v t ( θ ) } + { v t ( θ ) < Y t ≤ v t ( τ ) } ) (54) · sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ + sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ where E t − ( { v t ( τ ) < Y t ≤ v t ( θ ) } + { v t ( τ ) ≤ Y t ≤ v t ( θ ) } + { v t ( θ ) < Y t ≤ v t ( τ ) } ) ≤ KV ( F t − ) d (55)and sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ ≤ ( H · H ( F t − )) δ (56)As for sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)! δ , we need to combine the following two results:sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:0) H V ( F t − ) H ( F t − ) + 2 H H ( F t − ) + H · H ( F t − ) (cid:1) d (57) sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ ≤ (2 H · H ( F t − )) δ (58)Combining with Assumption 2(D), we thus have E [ (cid:0) µ t ( θ, d ) (3) (cid:1) δ ] ≤ cd , where c is a strictly pos-itive constant. 12 ourth term: µ t ( θ, d ) (4) = sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13) v t ( τ ) ∇ ′ e t ( τ ) e t ( τ ) − v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13) . It was shown in the derivationsfor the third term that E [ (cid:0) µ t ( θ, d ) (4) (cid:1) δ ] ≤ cd , where c is a strictly positive constant. Fifth term: µ t ( θ, d ) (5) = 1 α sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) e t ( τ ) { Y t ≤ v t ( τ ) } Y t − ∇ ′ e t ( θ ) e t ( θ ) { Y t ≤ v t ( θ ) } Y t (cid:13)(cid:13)(cid:13)(cid:13) (59)Similar to the ﬁrst and third terms, (cid:0) α · µ t ( θ, d ) (5) (cid:1) δ can be bounded by c · ( { v t ( τ ) < Y t ≤ v t ( θ ) } + { v t ( τ ) ≤ Y t ≤ v t ( θ ) } + { v t ( θ ) < Y t ≤ v t ( τ ) } ) (60) · | Y t | δ sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ + | Y t | δ sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) e t ( τ ) − ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ where E t − [ { v t ( τ ) < Y t ≤ v t ( θ ) }| Y t | δ ] = Z v t ( θ ) v t ( τ ) | y | δ f t ( y ) dy ≤ K | v t ( τ ) | δ · | v t ( τ ) − v t ( θ ) | (61) ≤ KV ( F t − ) δ V ( F t − ) k τ − θ k ≤ KV ( F t − ) δ V ( F t − ) d and similarly, E h { v t ( τ ) < Y t ≤ v t ( θ ) }| Y t | δ |F t − i ≤ KV ( F t − ) δ V ( F t − ) d (62)and E h { v t ( θ ) < Y t ≤ v t ( τ ) }| Y t | δ |F t − i ≤ KV ( F t − ) δ V ( F t − ) d Further sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ H H ( F t − ) (63)As for sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) e t ( τ ) − ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)! δ , we also need to combine the following two results:sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) e t ( τ ) − ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:0) H H ( F t − ) + H H ( F t − ) (cid:1) d (64) sup k θ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ ≤ (cid:0) H H ( F t − ) (cid:1) δ Combining with Assumption 2(D), we thus have E [ (cid:0) µ t ( θd ) (5) (cid:1) δ ] ≤ cd , where c is a strictly posi-tive constant. 13 ixth term: µ (6) t ( θ, d ) = sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) − e t ( τ ) − ∇ ′ e t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) (65)We have sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) − e t ( τ ) − ∇ ′ e t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:0) H H ( F t − ) + HH ( F t − ) (cid:1) d (66) sup k τ − θ k≤ d (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( τ ) − e t ( τ ) − ∇ ′ e t ( θ ) − e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)! δ ≤ (2 HH ( F t − )) δ Combining with Assumption 2(D), we thus have E [ (cid:0) µ t ( θ, d ) (6) (cid:1) δ ] ≤ cd , where c is a strictlypositive constant. Thus E [ µ t ( θ, d ) ( i ) ] δ ≤ cd , ∀ i = 1 , , · · · , , proving the lemma. Lemma 6

Under Assumptions 1-2, Assumption N4 of Weiss (1991) holds: E k g t ( θ ) k ≤ M , forall t and some M > . Proof of Lemma 6. E k g t ( θ ) k ≤ ( E (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ v t ( θ ) − e t ( θ ) (cid:18) α (cid:8) Y t ≤ v t ( θ ) (cid:9) − (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (67)+ E (cid:13)(cid:13)(cid:13)(cid:13) v t ( θ ) ∇ ′ e t ( θ ) e t ( θ ) (cid:18) α (cid:8) Y t ≤ v t ( θ ) (cid:9) − (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( θ ) e t ( θ ) (cid:13)(cid:13)(cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13)(cid:13) ∇ ′ e t ( θ ) e t ( θ ) α (cid:8) Y t ≤ v t ( θ ) (cid:9) Y t (cid:13)(cid:13)(cid:13)(cid:13) ) ≤ ( E "(cid:18) α + 1 (cid:19) H V ( F t − ) + E "(cid:18) α + 1 (cid:19) H H ( F t − ) + 1 α H E [ H ( F t − ) Y t ] + E (cid:2) H H ( F t − ) (cid:3)(cid:27) ≤ M where M is some ﬁnite constant, and the second inequality follows using Assumptions 2(C) and2(D). 14 ppendix SA.2: Additional tables Table S1: Finite-sample performance of (Q)MLE T = 2500 T = 5000 ω β γ ω β γ Panel A: N(0,1) innovations

True 0.050 0.950 0.050 0.050 0.950 0.050Median 0.053 0.897 0.050 0.051 0.899 0.050Avg bias 0.011 (0.011) 0.000 0.005 (0.005) 0.000St dev 0.056 0.064 0.013 0.023 0.029 0.009Coverage 0.936 0.930 0.928 0.936 0.933 0.937

Panel B: Skew t (5,-0.5) innovations

True 0.050 0.950 0.050 0.050 0.950 0.050Median 0.052 0.895 0.049 0.052 0.897 0.050Avg bias 0.017 (0.023) 0.005 0.006 (0.008) 0.002St dev 0.077 0.095 0.028 0.026 0.037 0.017Coverage 0.899 0.907 0.897 0.913 0.907 0.903

Notes:

This table presents results from 1000 replications of the estimation of the parametersof a GARCH(1,1) model, using the Normal likelihood. In Panel A the innovations are standardNormal, and so estimation is then ML. In Panel B the innovations are standardized skew t, and soestimation is QML. Details are described in Section 4 of the main paper. The top row of each panelpresents the true values of the parameters. The second, third, and fourth rows present the medianestimated parameters, the average bias, and the standard deviation (across simulations) of theestimated parameters. The last row of each panel presents the coverage rates for 95% conﬁdenceintervals constructed using estimated standard errors.15 able S2: Simulation results for Normal innovations,estimation by CAViaR T = 2500 T = 5000 β γ a α β γ a α α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . Notes:

This table presents results from 1000 replications of the estimation of VaR from aGARCH(1,1) DGP with standard Normal innovations. Details are described in Section 4 of themain paper. The top row of each panel presents the true values of the parameters. The second,third, and fourth rows present the median estimated parameters, the average bias, and the standarddeviation (across simulations) of the estimated parameters. The last row of each panel presentsthe coverage rates for 95% conﬁdence intervals constructed using estimated standard errors.16 able S3: Simulation results for skew t innovations,estimation by CAViaR T = 2500 T = 5000 β γ a α β γ a α α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . Notes:

This table presents results from 1000 replications of the estimation of VaR from aGARCH(1,1) DGP with skew t innovations. Details are described in Section 4 of the main paper.The top row of each panel presents the true values of the parameters. The second, third, and fourthrows present the median estimated parameters, the average bias, and the standard deviation (acrosssimulations) of the estimated parameters. The last row of each panel presents the coverage ratesfor 95% conﬁdence intervals constructed using estimated standard errors.17 able S4: Diebold-Mariano t-statistics on average out-of-sample loss diﬀerencesfor the DJIA, NIKKEI and FTSE100 (alpha=0.05) RW125 RW250 RW500 G-N G-Skt G-EDF FZ-2F FZ-1F G-FZ Hybrid

Panel A: DJIA

RW125 -2.547 -4.234 3.189 3.812 3.793 3.305 4.368 3.475 3.853RW250 2.547 -4.145 4.028 4.579 4.595 4.601 5.358 4.529 4.598RW500 4.234 4.145 5.328 5.802 5.825 5.903 6.553 5.868 5.901G-N -3.189 -4.028 -5.328 3.312 2.773 0.818 2.171 1.811 1.769G-Skt -3.812 -4.579 -5.802 -3.312 0.391 -0.143 1.430 -0.160 -0.022G-EDF -3.793 -4.595 -5.825 -2.773 -0.391 -0.187 1.434 -0.367 -0.174FZ-2F -3.305 -4.601 -5.903 -0.818 0.143 0.187 0.142 0.028 1.179FZ-1F -4.022 -4.738 -5.750 -0.965 0.004 0.038 -0.142 -1.597 -1.402G-FZ -3.475 -4.529 -5.868 -1.811 0.275 0.367 -0.028 1.597 0.086Hybrid -3.826 -4.506 -5.710 -2.426 -1.425 -1.430 -1.179 1.402 -0.086

Panel B: NIKKEI

RW125 -0.245 -1.181 4.015 3.993 4.030 3.804 3.464 3.933 4.166RW250 0.245 -1.418 4.460 4.473 4.519 4.075 3.887 4.437 4.661RW500 1.181 1.418 4.412 4.433 4.476 4.348 3.965 4.431 4.582G-N -4.015 -4.460 -4.412 1.180 2.177 -1.877 -1.271 1.251 0.419G-Skt -3.993 -4.473 -4.433 -1.180 1.831 -1.931 -1.389 0.613 0.255G-EDF -4.030 -4.519 -4.476 -2.177 -1.831 -2.031 -1.520 -0.901 0.075FZ-2F -3.804 -4.075 -4.348 1.877 1.931 2.031 1.135 1.950 2.495FZ-1F -3.250 -3.629 -3.659 1.195 1.319 1.463 -1.135 1.426 2.741G-FZ -3.933 -4.437 -4.431 -1.251 -0.640 0.901 -1.950 -1.426 0.171Hybrid -3.998 -4.500 -4.364 -0.565 -0.410 -0.226 -2.495 -2.741 -0.171

Table continued on next page. able S4: Diebold-Mariano t-statistics on average out-of-sample loss diﬀerencesfor the DJIA, NIKKEI and FTSE100 (alpha=0.05) RW125 RW250 RW500 G-N G-Skt G-EDF FZ-2F FZ-1F G-FZ Hybrid

Panel C: FTSE

RW125 -2.707 -3.955 3.723 3.988 3.846 -3.329 3.623 3.651 3.398RW250 2.707 -3.245 4.784 5.036 4.898 -2.188 4.724 4.764 4.486RW500 3.955 3.245 5.470 5.685 5.570 -0.834 5.479 5.513 5.321G-N -3.723 -4.784 -5.470 4.494 3.434 -6.805 0.406 1.526 0.796G-Skt -3.988 -5.036 -5.685 -4.494 -4.167 -6.898 -0.347 -0.671 0.172G-EDF -3.846 -4.898 -5.570 -3.434 4.167 -6.847 0.065 0.569 0.519FZ-2F 3.329 2.188 0.834 6.805 6.898 6.847 6.187 6.920 7.263FZ-1F -3.831 -4.853 -5.382 -0.247 0.355 0.020 -6.187 0.125 0.760G-FZ -3.651 -4.764 -5.513 -1.526 0.710 -0.569 -6.920 -0.125 0.417Hybrid -3.208 -4.242 -5.027 -0.643 0.008 -0.355 -7.263 -0.760 -0.417

Notes:

This table presents t -statistics from Diebold-Mariano tests comparing the average losses,using the FZ0 loss function, over the out-of-sample period from January 2000 to December 2016,for ten diﬀerent forecasting models. A positive value indicates that the row model has higheraverage loss than the column model. Values greater than 1.96 in absolute value indicate that theaverage loss diﬀerence is signiﬁcantly diﬀerent from zero at the 95% conﬁdence level. Values alongthe main diagonal are all identically zero and are omitted for interpretability. The ﬁrst three rowscorrespond to rolling window forecasts, the next three rows correspond to GARCH forecasts basedon diﬀerent models for the standardized residuals, and the last four rows correspond to modelsintroduced in Section 2 of the main paper. 19 able S5: Out-of-sample average losses and goodness-of-ﬁt tests (alpha=0.025) Average loss GoF p-values: VaR GoF p-values: ES

S&P DJIA NIK FTSE S&P DJIA NIK FTSE S&P DJIA NIK FTSE

RW-125 1.119 1.088 1.525 1.166 0.022 0.003 0.000 0.000 0.009 0.004 0.001 0.001RW-250 1.164 1.117 1.525 1.209 0.005 0.007 0.002 0.000 0.023 0.039 0.010 0.005RW-500 1.245 1.187 1.561 1.294 0.001 0.000 0.004 0.000 0.019 0.011 0.007 0.000GCH-N 1.089 1.016 1.341 1.053 0.000 0.002

Notes:

Panel A: S&P 500

RW125 -2.035 -3.587 1.100 2.728 3.125 1.972 3.599 3.212 2.642RW250 2.035 -3.454 1.901 3.112 3.472 2.637 4.240 3.613 3.447RW500 3.587 3.454 3.283 4.388 4.731 3.966 5.605 4.879 4.968G-N -1.100 -1.901 -3.283 4.241 3.522 1.645 2.346 3.835 1.963G-Skt -2.728 -3.112 -4.388 -4.241 2.393 0.093 0.738 2.850 -0.447G-EDF -3.125 -3.472 -4.731 -3.522 -2.393 -0.595 -0.198 1.482 -1.500FZ-2F -1.972 -2.637 -3.966 -1.645 -0.093 0.595 0.348 1.111 0.368FZ-1F -3.599 -4.240 -5.605 -2.346 -0.738 0.198 -0.348 0.739 -1.406G-FZ -3.212 -3.613 -4.879 -3.835 -2.850 -1.482 -1.111 -0.739 -2.300Hybrid -2.642 -3.447 -4.968 -1.963 0.447 1.500 -0.368 1.406 2.300

Panel B: DJIA

RW125 -1.066 -2.722 2.676 3.902 3.879 3.194 3.906 3.637 1.945RW250 1.066 -3.065 2.754 3.852 3.900 4.102 4.343 3.744 2.249RW500 2.722 3.065 3.968 5.053 5.131 5.529 5.764 5.026 3.661G-N -2.676 -2.754 -3.968 3.430 3.009 0.703 1.313 2.775 -0.970G-Skt -3.902 -3.852 -5.053 -3.430 1.390 -1.211 -0.958 1.722 -3.640G-EDF -3.879 -3.900 -5.131 -3.009 -1.390 -1.553 -1.265 1.620 -3.563FZ-2F -3.194 -4.102 -5.529 -0.703 1.211 1.553 -0.310 1.962 -0.744FZ-1F -3.906 -4.343 -5.764 -1.313 0.958 1.265 0.310 1.736 -1.835G-FZ -3.637 -3.744 -5.026 -2.775 -1.722 -1.620 -1.962 -1.736 -3.364Hybrid -1.945 -2.249 -3.661 0.970 3.640 3.563 0.744 1.835 3.364

Table continued on next page. able S6: Diebold-Mariano t-statistics on average out-of-sample loss diﬀerencesfor the S&P 500, DJIA, NIKKEI and FTSE100 (alpha=0.025), continued RW125 RW250 RW500 G-N G-Skt G-EDF FZ-2F FZ-1F G-FZ Hybrid

Panel C: NIKKEI

RW125 0.011 -0.977 4.223 4.166 4.211 -16.674 2.677 4.148 4.052RW250 -0.011 -1.773 4.499 4.568 4.592 -16.612 2.767 4.542 4.466RW500 0.977 1.773 4.536 4.628 4.638 -17.116 3.019 4.602 4.620G-N -4.223 -4.499 -4.536 1.896 2.089 -16.040 -2.765 2.042 -0.126G-Skt -4.166 -4.568 -4.628 -1.896 -0.864 -15.803 -3.078 -0.283 -0.828G-EDF -4.211 -4.592 -4.638 -2.089 0.864 -15.847 -3.072 0.415 -0.764FZ-2F 16.674 16.612 17.116 16.040 15.803 15.847 15.323 15.834 15.784FZ-1F -2.677 -2.767 -3.019 2.765 3.078 3.072 -15.323 3.035 3.650G-FZ -4.148 -4.542 -4.602 -2.042 0.283 -0.415 -15.834 -3.035 -0.785Hybrid -4.052 -4.466 -4.620 0.126 0.828 0.764 -15.784 -3.650 0.785

Panel D: FTSE

RW125 -1.754 -3.623 3.329 3.989 3.639 -4.888 3.253 2.818 2.375RW250 1.754 -3.406 4.122 4.786 4.435 -4.800 4.139 3.716 3.257RW500 3.623 3.406 5.066 5.638 5.339 -4.613 5.355 4.809 4.533G-N -3.329 -4.122 -5.066 4.696 3.860 -5.167 -0.306 -0.827 -2.199G-Skt -3.989 -4.786 -5.638 -4.696 -4.658 -5.230 -2.170 -3.470 -3.828G-EDF -3.639 -4.435 -5.339 -3.860 4.658 -5.191 -1.163 -2.332 -3.130FZ-2F 4.888 4.800 4.613 5.167 5.230 5.191 5.173 5.154 5.110FZ-1F -3.253 -4.139 -5.355 0.306 2.170 1.163 -5.173 -0.147 -1.526G-FZ -2.818 -3.716 -4.809 0.827 3.470 2.332 -5.154 0.147 -2.015Hybrid -2.375 -3.257 -4.533 2.199 3.828 3.130 -5.110 1.526 2.015

Notes:

This table presents tt