[PDF] Encompassing Tests for Value at Risk and Expected Shortfall Multi-Step Forecasts based on Inference on the Boundary

Abstract

We propose forecast encompassing tests for the Expected Shortfall (ES) jointly with the Value at Risk (VaR) based on flexible link (or combination) functions. Our setup allows testing encompassing for convex forecast combinations and for link functions which preclude crossings of the combined VaR and ES forecasts. As the tests based on these link functions involve parameters which are on the boundary of the parameter space under the null hypothesis, we derive and base our tests on nonstandard asymptotic theory on the boundary. Our simulation study shows that the encompassing tests based on our new link functions outperform tests based on unrestricted linear link functions for one-step and multi-step forecasts. We further illustrate the potential of the proposed tests in a real data analysis for forecasting VaR and ES of the S&P 500 index.

Full PDF

EEncompassing Tests for Value at Risk and

Expected Shortfall Multi-Step Forecastsbased on Inference on the Boundary

Timo Dimitriadis ∗ Xiaochun Liu † Julie Schnaitmann ‡ September 17, 2020

Abstract

We propose forecast encompassing tests for the Expected Shortfall (ES) jointlywith the Value at Risk (VaR) based on ﬂexible link (or combination) functions.Our setup allows testing encompassing for convex forecast combinations and forlink functions which preclude crossings of the combined VaR and ES forecasts. Asthe tests based on these link functions involve parameters which are on the boundaryof the parameter space under the null hypothesis, we derive and base our tests onnonstandard asymptotic theory on the boundary. Our simulation study shows thatthe encompassing tests based on our new link functions outperform tests based onunrestricted linear link functions for one-step and multi-step forecasts. We furtherillustrate the potential of the proposed tests in a real data analysis for forecastingVaR and ES of the S&P 500 index.

Keywords : asymptotic theory on the boundary, joint elicitability, multi-step ahead andaggregate forecasts, forecast evaluation and combinations

JEL : C12, C52, C58 ∗ Heidelberg Institute for Theoretical Studies (HITS), Heidelberg, 69118 Heidelberg and University ofHohenheim, Institute of Economics, 70599 Stuttgart, Germany, e-mail: [email protected] † Department of Economics, Finance and Legal Studies, Culverhouse College of Business, Universityof Alabama, Tuscaloosa Alabama 35487 USA. e-mail: [email protected] ‡ University of Konstanz, Department of Economics, 78457 Konstanz, Germany, e-mail:[email protected] a r X i v : . [ ec on . E M ] S e p Introduction

For nearly two decades, ﬁnancial institutions and regulators have advocated Value at Risk(VaR) as the main tool for risk management and capital allocation. Owing to a numberof weaknesses, including the failure of capturing (extreme) tail risks and hence discourag-ing risk diversiﬁcation (Artzner et al., 1999; Acerbi and Tasche, 2002; Tasche, 2002), theBasel Committee on Banking Supervision (BCBS) has recently adopted Expected Short-fall (ES), complementing and in parts substituting VaR as the fundamental measure formarket risk (Basel Committee, 2013, 2016, 2017, 2019).The ES at level α ∈ (0 ,

1) is deﬁned as the expected return beyond the α -quantile andit is widely used as a coherent measure of tail risks (Artzner et al., 1999; Tasche, 2002).Nonetheless, its inherent deﬁciency is that the ES is not elicitable on its own, meaning thatthe ES cannot be obtained as the unique minimizer of the expectation of a loss (scoring)function, see e.g., Gneiting (2011). However, Fissler and Ziegel (2016) show that the VaRand ES are jointly elicitable (or 2-elicitable). This joint elicitability property directlyhints towards evaluating the ES jointly with the VaR in a uniﬁed framework (Fissleret al., 2016), as in the present study concerning forecast encompassing tests.Forecast encompassing of two competing forecasts tests whether one forecast aloneperforms not worse than any forecast combination, stemming from some parametric com-bination formula, also denoted by link functions in this article. If this holds, the rivalforecast contains no additionally useful information relative to the ﬁrst forecast (Hendryand Richard, 1982; Mizon and Richard, 1986). This makes forecast encompassing testsan attractive tool for the empirical comparison of competing forecasts, especially whenfocusing on eﬃciency gains stemming from forecast combinations. As meaningful mea-sures of forecast performance are based on strictly consistent loss functions (Gneiting,2011), this forcefully illustrates the importance of the existence of such loss functionsfor testing forecast encompassing. Hence, we build our encompassing tests on joint lossfunctions for the VaR and ES (Fissler and Ziegel, 2016), and on recently developed jointsemiparametric VaR and ES models (Patton et al., 2019; Dimitriadis and Bayer, 2019;Taylor, 2019; Barendse, 2020).As the main methodological contribution of this paper, we introduce encompassingtests for the ES jointly with the VaR based on ﬂexible link functions or combinationformulas , which allow for several important speciﬁcations that go beyond those of ex-isting encompassing tests of e.g. Giacomini and Komunjer (2005) and Dimitriadis andSchnaitmann (2020). While linear forecast combination methods with unrestricted pa- For recent empirical applications of forecast encompassing tests, see e.g., Taylor (2005); Busettiand Marcucci (2013); Fuertes and Olmo (2013); Costantini et al. (2017); Liu (2017); Zhao et al. (2017);Tsiotas (2018); Clements and Reade (2020); You and Liu (2020) among others. more ﬂexible approaches are especially important for joint tests of the VaR and ES: First,unrestricted linear link functions regularly result in VaR and ES crossings , i.e. days wherethe optimally combined ES forecast is larger than the VaR forecast, which immediatelycontradicts their deﬁnitions (Taylor, 2020). To this end, we propose the no-crossing linkfunctions which impede such crossings. Second, convex forecast combinations present anattractive alternative as their structure can stabilize the forecast performance and reducethe estimation noise (Timmermann, 2006; Hansen, 2008; Bayer, 2018), which is particu-larly important for the case of semiparametric models for the VaR and ES with extremeprobability levels (Dimitriadis and Bayer, 2019).The link function speciﬁcations considered in this article imply that certain modelparameters are on the boundary of the parameter space under the null hypothesis. Thisboundary issue is exempliﬁed by encompassing tests for convex forecast combinations,which entails testing whether the convex combination parameter is one (or zero). Underthe null hypothesis, this parameter lies on the boundary of the admissible parameterspace, i.e., the unit interval. Hence, we derive novel and nonstandard asymptotic theoryfor the model parameters and the resulting Wald test statistics for semiparametric modelsfor the VaR and ES which allows some (or all) of the true model parameters to be on theboundary of the parameter space. For this, we follow the approach of Andrews (1999)and Andrews (2001), where the proofs use empirical process methods of Andrews (1994)and Doukhan et al. (1995). To render our tests practically feasible, we draw criticalvalues from the resulting nonstandard asymptotic distributions of the Wald test statisticsobtained from simulations involving the solution of quadratic programming problems.The proposed encompassing tests allow for testing one-step ahead, multi-step aheadand multi-step aggregate forecasts, where the consideration of multi-step forecasts requiresthe application of a VaR and ES speciﬁc adaption of the HAC (Heteroskedasticity andAutocorrelation Consistent) estimator of Newey and West (1987) and Andrews (1991).The examination of multi-step (aggregate) forecasts is particularly relevant for the riskmeasures VaR and ES due to the explicit calls for 10-day aggregate VaR and ES forecastsof the Basel Committee (2016, 2017, 2019, 2020). Furthermore, this goes beyond manyrecent papers concerning forecast evaluation procedures for the VaR and ES, which mainlyfocus on one-step ahead forecasts. Our simulations show that the encompassing tests for the VaR and ES based on ournew link functions and on inference on the boundary exhibit accurate empirical sizes and see e.g., Hendry and Richard (1982); Mizon and Richard (1986); Diebold (1989); Giacomini andKomunjer (2005); Clements and Harvey (2009, 2010); Dimitriadis and Schnaitmann (2020). see e.g. Kratz et al. (2018); Costanzino and Curran (2018); Bayer and Dimitriadis (2020); Couperierand Leymarie (2019); Patton et al. (2019); Dimitriadis and Schnaitmann (2020). Theory

We follow the general setup of Giacomini and Komunjer (2005) and Dimitriadis andSchnaitmann (2020) while further allowing for multi-step forecasts. For this, we con-sider a stationary stochastic process Z = (cid:110) Z t : Ω → R ˜ l +1 , t = 1 , . . . , R, ˜ l ∈ N , R ∈ N (cid:111) ,which is deﬁned on some common and complete probability space (Ω , F , P ), where F = {F t , t = 1 , . . . , R } and F t = σ { Z s , s ≤ t } . We partition the stochastic process as Z t =( Y t , X t ), where Y t : Ω → R is an absolutely continuous random variable of interest and X t : Ω → R ˜ l is a vector of explanatory variables. For some ﬁxed forecast horizon h ∈ N ,we denote the conditional distribution of Y t + h given the information set F t by F t . Accord-ingly, E t , Var t and h t denote the expectation, variance and density corresponding to F t .The conditional VaR of Y t + h given F t at probability level α ∈ (0 ,

1) is formally deﬁned asVaR t,α ( Y t + h ) = F − t ( α ) = inf { z ∈ R : F t ( z ) ≥ α } , (2.1)and given that F t is continuous at its α -quantile, the conditional ES of Y t + h given F t atlevel α ∈ (0 ,

1) is deﬁned byES t,α ( Y t + h ) = E t (cid:2) Y t + h | Y t + h ≤ VaR t,α ( Y t + h ) (cid:3) . (2.2)In order to allow for forecast evaluation of multi-step (ahead and aggregate) forecastswith horizon h ∈ N in an out-of-sample fashion, we split R = S + T + h −

1, where S ∈ N denotes the length of the in-sample and T ∈ N of the out-of sample window. Indetail, for all t ∈ N , such that S ≤ t ≤ S + T −

1, we generate h -step ahead VaR andES forecasts for the random variables Y t + h (i.e. for the sequence ( Y S + h , . . . , Y S + T + h − ))based on the previous S data points. For convenience of the notation, we deﬁne the set T := { t ∈ N : S ≤ t ≤ S + T − } corresponding to the time points the forecasts areissued for the out-of-sample period.We further denote the competing, F t -measurable, h -step forecasts for the VaR and ESby ˆ q j,t and ˆ e j,t , for j = 1 ,

2. Following Giacomini and Komunjer (2005), we assume thatthese are generated through a function f (cid:0) γ t,S , Z t , Z t − , . . . (cid:1) , which is ﬁxed over time. Forthis, γ t,S denotes the (estimated or ﬁxed) model parameters at time t , or alternativelythe semi- or non-parametric estimator used in the construction of the forecasts, (possibly)estimated by data from the in-sample period of length S . This construction allows forforecasting schemes with ﬁxed (or no) parameters, forecasting schemes with model pa-rameters γ t,S that are estimated only once, and rolling window forecasting schemes wherethe parameters γ t,S are re-estimated in each step (Giacomini and Komunjer, 2005). In our5esting approach, we focus on evaluation of the entire forecasting method as e.g. in Giaco-mini and Komunjer (2005) and Giacomini and White (2006), instead of on a forecastingmodel , as e.g. in West (1996, 2001). The stacked forecasts are denoted by ˆ q t = (ˆ q ,t , ˆ q ,t )for the VaR, and by ˆ e t = (ˆ e ,t , ˆ e ,t ) for the ES. In our notation of the forecasts, we stressthe dependence on t , the time-point they are issued, while suppressing the dependence onthe forecast horizon h as it is treated as ﬁxed.Let r t denote ﬁnancial log-returns for day t . Then, our theoretical setup allows for thetreatment of classical multi-step ( h -step) ahead forecasts, but also for h -step aggregateforecasts in the sense of an aggregated return over h days, such as the 10-day aggregateVaR and ES forecasts explicitly stated in the regulatory framework of the Basel Committee(2019, 2020). For classical h -step ahead forecasts , we use Y t + h = r t + h , while for h -stepaggregate forecasts we choose Y t + h = (cid:80) hs =1 r t + s .In the following exposition, all vectors refer to column vectors. For splitting of sub-vectors, we often abuse notation and write θ = ( θ , θ ) instead of θ = ( θ (cid:62) , θ (cid:62) ) (cid:62) . Theoperator ∇ denotes the derivative with respect to θ . All limits below are taken “as T → ∞ ” unless stated otherwise and P −→ and d −→ denote convergence in probability anddistribution respectively. Let := denote an equality “by deﬁnition”. Furthermore, let R + and R − denote the non-negative and non-positive real half-lines respectively and we deﬁne R C = { z ∈ R : | z | ≤ C } to be a suﬃciently large compact subset of the real numbers (forsome C ∈ R + large enough). For the introduction of the joint encompassing tests for VaR and ES forecasts, we fol-low Dimitriadis and Schnaitmann (2020) and deﬁne the ﬂexible link (or combination)functions g q : Q × E × Θ → R , (ˆ q t , ˆ e t , θ ) (cid:55)→ g q (ˆ q t , ˆ e t , θ ) , (2.3) g e : Q × E × Θ → R , (ˆ q t , ˆ e t , θ ) (cid:55)→ g e (ˆ q t , ˆ e t , θ ) , (2.4)based on some (compact) parameter space Θ ⊂ R k , where Q and E denote the randomspaces of the VaR and ES forecasts. These link functions represent the parametric, func-tional forms of the forecast combinations we consider. E.g., in the classical case of testing A generalization of our framework to test encompassing for multiple competing forecasts ( K ≥ q t =(ˆ q ,t , . . . , ˆ q K,t ) and ˆ e t = (ˆ e ,t , . . . , ˆ e K,t ) and by further using suitable speciﬁcations for the link functionsand the null hypotheses in the subsequent derivations. The link functions can alternatively be interpreted as (semi-) parametric models for the conditionalquantile (VaR) and ES of F t as in Patton et al. (2019). g qt ( θ ) := g q (ˆ q t , ˆ e t , θ ) , and g et ( θ ) := g e (ˆ q t , ˆ e t , θ ) . (2.5)We further assume that there exists a unique test parameter value θ ∗ ∈ Θ such that g q (ˆ q t , ˆ e t , θ ∗ ) = ˆ q ,t and g e (ˆ q t , ˆ e t , θ ∗ ) = ˆ e ,t almost surely. This assumption ensures thatthe parametric link function allows for the trivial forecast combination of only choosingthe ﬁrst forecast. In the classical case of unrestricted linear link functions, θ ∗ oftencorresponds to (1 ,

0) or (0 , , ρ (cid:0) Y, q, e (cid:1) = (cid:0) { Y ≤ q } − α (cid:1) g ( q ) − { Y ≤ q } g ( Y )+ φ (cid:48) ( e ) (cid:18) e − q + ( q − Y ) { Y ≤ q } α (cid:19) − φ ( e ) + a ( Y ) , (2.6)where the function g is twice continuously diﬀerentiable and increasing, φ is three timescontinuously diﬀerentiable, strictly increasing and strictly convex, and a and g are Y t + h -integrable functions. The most prominent candidate of this class is the zero-homogeneousloss function (Nolde and Ziegel, 2017), sometimes called the FZ0 loss, ρ FZ0 (cid:0)

Y, q, e (cid:1) = − e (cid:18) e − q + ( q − Y ) { Y ≤ q } α (cid:19) + log( − e ) , (2.7)which is obtained by choosing g ( z ) = 0, a ( z ) = 0 and φ ( z ) = − log( − z ) in (2.6). Wehenceforth often use the short notations ρ t ( θ ) := ρ (cid:0) Y t + h , g qt ( θ ) , g et ( θ ) (cid:1) and ρ FZ0 t ( θ ) := ρ FZ0 (cid:0) Y t + h , g qt ( θ ) , g et ( θ ) (cid:1) .Using the general class of loss functions in (2.6), we deﬁne the true regression (or As the encompassing tests in this article are always formulated as forecast (pair) one encompassesforecast (pair) two, we only assume the existence of θ ∗ corresponding to the ﬁrst (pair) of forecasts.Testing the inverted encompassing hypothesis that the second pair of forecasts encompasses the ﬁrstforecast pair can be carried out by interchanging the forecast pairs. Alternatively, one could assume thata value ˜ θ ∗ exists such that g q (ˆ q t , ˆ e t , ˜ θ ∗ ) = ˆ q ,t and g e (ˆ q t , ˆ e t , ˜ θ ∗ ) = ˆ e ,t holds almost surely. θ ∈ Θ by θ := arg min θ ∈ Θ E (cid:2) ρ (cid:0) Y t + h , g qt ( θ ) , g et ( θ ) (cid:1)(cid:3) , (2.8)which is independent of t as we assume stationarity of the process Z . The strict consis-tency result of the loss function from Fissler and Ziegel (2016) together with further weakregularity conditions on the link functions implies that Q ( Y t + h | F t ) = g qt ( θ ) and ES( Y t + h | F t ) = g et ( θ ) (2.9)almost surely, which justiﬁes the notion of the true regression parameter .We now deﬁne joint forecast encompassing for the VaR and ES following Giacominiand Komunjer (2005) and Dimitriadis and Schnaitmann (2020). Deﬁnition 1 (Joint VaR and ES Forecast Encompassing).

We say that the pair (cid:0) ˆ q ,t , ˆ e ,t (cid:1) jointly encompasses (cid:0) ˆ q ,t , ˆ e ,t (cid:1) at time t with respect to the link functions g q and g e if and only if E (cid:2) ρ (cid:0) Y t + h , ˆ q ,t , ˆ e ,t (cid:1)(cid:3) = E (cid:2) ρ (cid:0) Y t +1 , g q (ˆ q t , ˆ e t , θ ) , g e (ˆ q t , ˆ e t , θ ) (cid:1)(cid:3) , (2.10)where the loss function ρ is given in (2.6).This holds if and only if θ = θ ∗ as we impose uniqueness of the parameter θ ∗ . Theintuition behind the speciﬁcation in (2.10) is that the forecasts (ˆ q ,t , ˆ e ,t ) generate thesame expected loss as an optimal forecast combination (cid:0) g q (ˆ q t , θ ) , g e (ˆ e t , θ ) (cid:1) based on the optimal combination parameter deﬁned in (2.8). Hence, using the ﬁrst pair of forecasts(ˆ q ,t , ˆ e ,t ) is the optimal, but trivial forecast combination. From a diﬀerent point of view,this implies that the second pair of forecasts (ˆ q ,t , ˆ e ,t ) does not add any useful informationwhich is not already contained in (ˆ q ,t , ˆ e ,t ).If the interest is mainly placed on the performance of the competing ES forecasts,one can consider the auxiliary ES encompassing test in the spirit of Dimitriadis andSchnaitmann (2020). Deﬁnition 2 (Auxiliary ES Forecast Encompassing).

We say that the forecast ˆ e ,t auxiliarily encompasses its rival ˆ e ,t at time t with respect to the link functions g q and g e See e.g. Patton et al. (2019), Dimitriadis and Bayer (2019), Bayer and Dimitriadis (2020), Dimitriadisand Schnaitmann (2020) and Barendse (2020) for details on joint (semi-) parametric models for the VaRand ES. Application of the strict encompassing test of Dimitriadis and Schnaitmann (2020) in the setting ofthe present article further requires combining the asymptotic theory under misspeciﬁcation of Dimitriadisand Schnaitmann (2020) with the theory of estimation and testing at the boundary of the present article.

8f and only if E (cid:2) ρ (cid:0) Y t + h , g q (ˆ q t , ˆ e t , θ ) , ˆ e ,t (cid:1)(cid:3) = E (cid:2) ρ (cid:0) Y t +1 , g q (ˆ q t , ˆ e t , θ ) , g e (ˆ q t , ˆ e t , θ ) (cid:1)(cid:3) , (2.11)where the loss function ρ is given in (2.6).Finding testable conditions for the auxiliary test, corresponding to the condition θ = θ ∗ for the joint test, has to be done on a case-by-case basis for the link functions underconsideration, see Section 2.3 for further details.Given a sample of competing forecasts and corresponding realizations, we can testwhether the sequence of joint VaR and ES forecasts (ˆ q ,t , ˆ e ,t ) encompasses the sequence(ˆ q ,t , ˆ e ,t ) for all t ∈ T (in the out-of-sample period) by estimating the parameters of thesemiparametric models Y t + h = g qt ( θ ) + u qt + h , and Y t + h = g et ( θ ) + u et + h , (2.12)where Q α ( u qt + h | F t ) = 0 and ES α ( u et + h | F t ) = 0 almost surely for all t ∈ T by using theM-estimator introduced in Patton et al. (2019) and Dimitriadis and Bayer (2019), and bytesting whether θ ∗ = θ using a Wald test.Diﬀerently from Dimitriadis and Schnaitmann (2020) and the remaining literatureon testing forecast encompassing, we allow the true, optimal parameter θ to be on the boundary of Θ under the null hypothesis. This facilitates the consideration of severalimportant link function speciﬁcations. E.g., this enables to test encompassing for linkspeciﬁcations which theoretically prevent crossings of the combined VaR and ES forecastsin the sense that g et ( θ ) ≤ g qt ( θ ) almost surely for all t ∈ T (Taylor, 2020). Furthermore, wecan test forecast encompassing based on convex forecast combinations, which stabilizesthe parameter estimation. While the subsequent section focuses mainly on these twoexamples, our approach is by no means limited to these link functions. In this section, we introduce three link function speciﬁcations which are of interest forthis article, where other link functions can be treated along the lines of this sectionby employing an equivalent split of the parameter vector and by formulating the nullhypotheses accordingly. The treatment of asymptotic theory on the boundary in thesense of Andrews (2001), detailed in Section 2.4 of the present article, requires splittingthe parameter vector θ into the following structurally diﬀerent subvectors, θ = (cid:0) β , β , δ, ψ ) , (2.13)9here β ∈ B ⊆ R p , β ∈ B ⊆ R p , δ ∈ ∆ ⊆ R q and ψ ∈ Ψ ⊆ R s , where p + p + q + s = k , p := p + p and Θ = B × B × ∆ × Ψ. The intuition behind this decomposition is thefollowing: (1) the null hypothesis we test for is based on β only, and β may or may notbe on the boundary of the parameter space; (2) β may or may not be on the boundary,but it is not tested for; (3) δ is not on the boundary, and it is not tested for; (4) ψ is nottested for, it may or may not be on the boundary, and the oﬀ-diagonal elements of thematrix T , deﬁned later in (2.22), corresponding to interactions of ψ and ( β , β , δ ) arezero.Most importantly, the null hypothesis is based on β only, while the remaining param-eters can be thought of as nuisance parameters, required for the estimation of the model.The distinction between ψ and the remaining parameter subvectors (in particular β ) isthat the imposed nullity of certain oﬀ-diagonal elements of T implies that the asymptoticdistribution of β is not aﬀected by whether ψ is on the boundary or not.Using the subvector decomposition in (2.13), we can formally introduce the link func-tions and the corresponding null hypotheses of interest for the joint and auxiliary encom-passing tests. The subsequent orderings of the parameters θ follows the ordering in thedecomposition in (2.13). All following encompassing null hypotheses are formulated forthe test that the forecast pair (ˆ q ,t , ˆ e ,t ) encompasses (ˆ q ,t , ˆ e ,t ), whereas the reverse testscan be deﬁned by simply interchanging the forecast pairs.(1) (Unrestricted) Linear: The unrestricted linear link functions are given by g qt ( θ ) = θ + θ ˆ q ,t + θ ˆ q ,t , and (2.14) g et ( θ ) = θ + θ ˆ e ,t + θ ˆ e ,t , (2.15)where the parameter space Θ := R C is essentially unrestricted, as the constant C can be chosen suﬃciently large. We henceforth denote these link functions as linear .We then test (a) H Joint0 : ( θ , θ , θ , θ ) = (1 , , , H Aux0 : ( θ , θ ) = (1 , This corresponds to the standard case of forecast encompassing tests (Fair andShiller, 1989; Clements and Harvey, 2009), which is already considered by Dimi-triadis and Schnaitmann (2020) for the case of the VaR and ES. As none of theparameters are on the boundary under the null, standard asymptotic theory is suf-ﬁcient here and we use this speciﬁcation as the benchmark in this paper. In terms of the subvectors decomposition in (2.13), we can assign β := ( θ , θ , θ , θ ) and δ := ( θ , θ )for the joint test and β := ( θ , θ ) and δ := ( θ , θ , θ , θ ) for the auxiliary test. As the parametersubvector β is in the interior of the parameter space under the null for both tests, classical asymptotictheory is suﬃcient for this unrestricted linear link function speciﬁcation. Convex Combinations:

We consider the link functions g qt ( θ ) = θ + θ ˆ q ,t + (1 − θ )ˆ q ,t , and (2.16) g et ( θ ) = θ + θ ˆ e ,t + (1 − θ )ˆ e ,t , (2.17)where Θ := [0 , × R C . We then test the following null hypothesis:(a) H Joint0 : ( θ , θ ) = (1 , β := ( θ , θ ) ∈ B := [0 , , and δ := ( θ , θ ) ∈ ∆ := R C .(b) H Aux0 : θ = 1, and we assign β := θ ∈ B := [0 , β := θ ∈ B := [0 ,

1] and δ := ( θ , θ ) ∈ ∆ := R C .In comparison with the linear link functions, the convex forecast combinations re-quire estimation of less parameters and therefore stabilizes the parameter estima-tion, especially for highly correlated forecasts. For both hypotheses formulatedabove, θ and θ are on the boundary under the null, while θ and θ are not. Thelatter parameters are assigned to δ instead of ψ as the matrix T , given in (2.22),does not have null entries at the respective points. As the tested parameters areon the boundary of the parameter space under the null hypotheses of both tests,their corresponding Wald test statistics are subject to a non-standard asymptoticdistribution (Andrews, 1999, 2001).(3) No VaR and ES Crossing:

We consider the link functions g qt ( θ ) = θ + θ ˆ e ,t + (1 − θ )ˆ e ,t + θ (cid:0) ˆ q ,t − ˆ e ,t (cid:1) + (1 − θ ) (cid:0) ˆ q ,t − ˆ e ,t (cid:1) , and(2.18) g et ( θ ) = θ + θ ˆ e ,t + (1 − θ )ˆ e ,t , (2.19)where Θ := [0 , × R C . These link functions imply that g qt ( θ ) ≥ g et ( θ ) holds almostsurely for all t ∈ T , which can be interpreted as a necessary condition for sensible(combinations of) VaR and ES forecasts, which is closely related to the issue of quantile crossings in quantile regression (Koenker, 2005). We then test(a) H Joint0 : ( θ , θ ) = (1 , β := ( θ , θ ) ∈ B := [0 , , and δ := θ ∈ ∆ := R C .(b) H Aux0 : θ = 1, and we assign β := θ ∈ B := [0 , β := θ ∈ B := [0 ,

1] and δ := θ ∈ ∆ := R C . Notice that for the estimation of joint VaR and ES models, especially for extreme probabilities suchas α = 2 .

11s in the convex setup, the tested parameters are on the boundary under the nulland non-standard asymptotic theory is required.While we focus on these examples of link functions in this article, the asymptotic theorypresented in the subsequent section is valid for a many other interesting link functions,such as link functions without intercepts, nonlinear functions, and further speciﬁcationswhich prevent a crossing of the VaR and ES forecasts.

In this section, we derive the asymptotic theory for the M-estimator ˆ θ T , given byˆ θ T = arg min θ ∈ Θ l T ( θ ) , where l T ( θ ) = (cid:88) t ∈ T ρ t (cid:0) Y t + h , g qt ( θ ) , g et ( θ ) (cid:1) . (2.20)Classical asymptotic theory for the M-estimator ˆ θ T , as given in Patton et al. (2019), statesthat given certain regularity conditions, √ T (cid:0) ˆ θ T − θ (cid:1) d −→ N (cid:0) , T − IT − (cid:1) , (2.21)where T = − E (cid:20) ∇ g qt ( θ ) ∇ g qt ( θ ) (cid:62) (cid:18) g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) α (cid:19) h t ( g qt ( θ ))+ ∇ g et ( θ ) ∇ g et ( θ ) (cid:62) φ (cid:48)(cid:48) ( g et ( θ )) (cid:3) , and (2.22) I = Var (cid:32) T − / (cid:88) t ∈ T ψ t ( θ ) (cid:33) , (2.23)with ψ t ( θ ) = ∇ g qt ( θ ) (cid:18) g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) α (cid:19) (cid:0) { Y t + h ≤ g qt ( θ ) } − α (cid:1) + ∇ g et ( θ ) φ (cid:48)(cid:48) ( g et ( θ )) (cid:18) g et ( θ ) − g qt ( θ ) + 1 α ( g qt ( θ ) − Y t + h ) { Y t + h ≤ g qt ( θ ) } (cid:19) . (2.24)The function ψ t ( θ ) corresponds to the gradient of the loss function ρ t ( θ ) almost surely,i.e. on the set { θ ∈ Θ : Y t + h (cid:54) = g qt ( θ ) } , which has probability one as the distribution F t isassumed to be absolutely continuous. In order to ensure global convergence of the M-estimator by avoiding local minima, we utilize theimplementation of the R package esreg (Bayer and Dimitriadis, 2019) based on the Iterated Local Search(ILS) meta-heuristic of Louren¸co et al. (2003). See Section 3 of Dimitriadis and Bayer (2019) for furtherdetails. θ is in the interior of the parameter space, int(Θ). This conditionis violated under the null hypothesis for many interesting speciﬁcations of the link func-tions for the considered encompassing tests, as further outlined in Section 2.3. Andrews(1999) derives the non-standard asymptotic distribution of the parameter estimates in ageneral setup, which allows for parameters to be on the boundary and Andrews (2001)extends this result to the asymptotic distribution of the resulting Wald test statistics.Intuitively, the condition θ ∈ int(Θ) implies that parameters to all sides (in a neigh-borhood) of θ are contained in Θ such that the estimator ˆ θ T is allowed to vary to all sidesof θ . The asymptotic normality result in (2.21) formalizes this intuition by quantifyingthis variation as a limiting normal distribution. In contrast, if θ is on the boundary ofΘ, the estimator ˆ θ T cannot attain values to all sides of θ , as values in some directions areexcluded through the boundary. Consequently, in these cases the asymptotic distributionis more complicated and non-standard, which we formalize through deriving asymptotictheory on the boundary in the following. For this, we make the following assumptions. Assumption 1. (A) The parameter space is given as the product space Θ = B × B × ∆ × Ψ, where eachof these four spaces is compact and restricted by individual inequality constraints: • B = (cid:8) β ∈ R p : Γ β β ≤ r β (cid:9) , where Γ β is a l β × p matrix and r β a l β -dimensional vector, • B = (cid:8) β ∈ R p : Γ β β ≤ r β (cid:9) , where Γ β is a l β × p matrix and r β a l β -dimensional vector, • ∆ = (cid:8) δ ∈ R q : Γ δ δ ≤ r δ (cid:9) , where Γ δ is a l δ × q matrix and r δ a l δ -dimensionalvector, • Ψ = (cid:8) ψ ∈ R s : Γ ψ ψ ≤ r ψ (cid:9) , where Γ ψ is a l ψ × s matrix and r ψ a l ψ -dimensionalvector.(B) The process Z t is stationary and β -mixing of size − r/ ( r −

1) for some r > E (cid:2) sup θ ∈ Θ | ρ t ( θ ) | r (cid:3) < ∞ and E [sup θ ∈ Θ || ψ t ( θ ) || r ] < ∞ for all θ ∈ Θand some δ >

0, where r > We state these conditions as high-level moment conditions depending on ρ t ( θ ) and ψ t ( θ ). Thederivations for primitive moment conditions for the semiparametric models for the VaR and ES forspeciﬁc choices of the functions g ( · ) and φ ( · ) are straight-forward, but the resulting conditions are oftenrather convoluted, see e.g. Appendix A of Dimitriadis and Bayer (2019) and Assumption 2 (C) and (D)of Patton et al. (2019). Y t + h given F t , denoted by F t , is absolutely continuous withcontinuous and strictly positive density h t , which is bounded from above almostsurely on the whole support of F t and Lipschitz continuous.(E) The link functions g qt ( θ ) and g et ( θ ) are F t -measurable, twice continuously diﬀeren-tiable in θ on int(Θ) almost surely, and directionally diﬀerentiable on the boundaryof Θ. Moreover, if for some θ , θ ∈ Θ, P (cid:0) g qt ( θ ) = g qt ( θ ) ∩ g et ( θ ) = g et ( θ ) (cid:1) = 1,then θ = θ .(F) The matrices I and T have full rank.(G) The matrix-elements of T governing the dependence of ( β , β , δ ) and of ψ are zero.Apart from the conditions (A) and (G), these assumptions are similar to the onesof Patton et al. (2019) and Dimitriadis and Schnaitmann (2020). However, as we baseour proofs on stochastic equicontinuity and empirical process theory (Andrews, 1994),instead of on the approach of Weiss (1991), some of the conditions diﬀer slightly. Onemain diﬀerence is that we assume the slightly stronger dependence condition of β -mixing(instead of α -mixing) in order to show stochastic equicontinuity of the empirical processbased on the theory of Doukhan et al. (1995). Notice that the parameter space in condition(A) can conveniently be expressed through l inequality constraints using an l × k matrixΓ θ and an l -dimensional vector r θ as Θ = (cid:8) θ ∈ R k : Γ θ θ ≤ r θ (cid:9) . (2.25)This general formulation allows for ﬂexible product spaces of closed real intervals. Theorem 1.

Suppose Assumption 1 holds. Then √ T (cid:0) ˆ θ T − θ (cid:1) d −→ ˆ λ, where ˆ λ = arg inf λ ∈ Λ ( λ − Z ) (cid:62) T ( λ − Z ) , (2.26)with Z = T − G , G ∼ N (0 , I ) and Λ = (cid:8) λ ∈ R k : Γ ( b ) θ λ ≤ (cid:9) , where Γ ( b ) θ denotes thesubmatrix of Γ θ from (2.25), which consists of the rows of Γ θ for which all inequalitiesΓ ( b ) θ θ ≤ r θ hold as an equality.The proof of Theorem 1 veriﬁes the necessary assumptions in Andrews (1999) andAndrews (2001). If θ ∈ int(Θ), none of the inequalities in (2.25) is binding and Λ = R k . In fact, r θ = ( r β , r β , r δ , r ψ ) and by expressing Γ θ as a 4 × β ,Γ β , Γ δ and Γ ψ appear on its diagonal with rectangular zero-blocks everywhere else. Notice that the notation in Andrews (2001) includes the nuisance parameter π ∈ Π which we do notrequire. Thus, following the comment on p.692 of Andrews (2001), we simply employ a parameter spaceΠ = { π } consisting of a single point π , e.g. π = 0, and suppress the dependency on π in the notation. λ = Z almost surely in (2.26), which results in the classical asymptoticnormality result given in (2.21). In contrast, if θ is on the boundary of Θ, the arg inf in(2.26) results in a non-standard asymptotic distribution of the stabilizing transformation √ T (cid:0) ˆ θ T − θ (cid:1) . Subvector Inference

In the notation of the subvector decomposition of θ in (2.13), we only test parametricrestrictions for the subvector β , which might be substantially smaller than θ . Thus, theformulation of the arg inf in (2.26) might be unnecessarily complex in these situations. Toaddress this issue, we derive inference for the subvector β = ( β , β ) of θ by following thegeneral approach of Andrews (1999, 2001). In some instances, this considerably simpliﬁesthe solution of the arg inf in (2.26).For this, we deﬁne the subvector γ := ( β, δ ) = ( β , β , δ ), which contains all parametersin θ but ψ , with the intuition that ψ does not have any inﬂuence on the asymptotic distri-bution of γ through the nullity restrictions on T imposed in condition (G) in Assumption1. We deﬁne the following quantities for the subvectors β and γ , Z γ := T γ − G γ , Z β := HZ γ , with H := [ I p , p × q ] , (2.27)where T γ denotes the upper-left ( p + q ) × ( p + q ) submatrix of T and G γ the upper ( p + q )-dimensional subvector of G . The following theorem states the asymptotic distribution ofthe subvector β . Theorem 2.

Given Assumption 1, it holds that √ T (cid:0) ˆ β T − β (cid:1) d −→ ˆ λ β , (2.28)where ˆ λ β = arg inf λ β ∈ Λ β ( λ β − Z β ) (cid:62) (cid:0) H T γ − H (cid:62) (cid:1) − ( λ β − Z β ) , (2.29)and Λ β = (cid:8) λ β ∈ R p : Γ ( b ) β λ β ≤ (cid:9) . The matrix Γ ( b ) β denotes the sub-matrix of Γ β , whichconsists of the rows of Γ β for which the inequality Γ β β ≤ r β holds as an equality.Theorem 2 shows that the asymptotic distribution of β is entirely unaﬀected by the pa-rameter ψ . In contrast, the subvector δ (which is contained in γ ) inﬂuences the asymptoticdistribution of β through the weighting matrix in the quadratic programming problem in(2.29), even though δ itself is not on the boundary of the parameter space.15hile closed-form representations for the distribution of ˆ λ β (and of ˆ λ ) are only avail-able in special cases (Andrews, 1999), we can conveniently simulate from its distributionin a straight-forward fashion by solving a quadratic programming problem. For this,notice that the minimization problem in (2.29) is equivalent to solvingmin λ β ∈ R p λ (cid:62) β (cid:0) H T γ − H (cid:62) (cid:1) − λ β − Z (cid:62) β (cid:0) H T γ − H (cid:62) (cid:1) − λ β subject to Γ ( b ) β λ β ≤ , (2.30)where Γ ( b ) β is given as in Theorem 2 and speciﬁes the binding inequality restrictions ofΛ β . Consequently, we can draw samples from the Gaussian random variable G γ , and foreach sampled value, we solve the quadratic programming problem given in (2.30). Therespective solutions then form a sample of the random variable ˆ λ β , whose distribution isasymptotically equivalent to the one of √ T (cid:0) ˆ β T − β (cid:1) . The Wald Test Statistic

We now consider a Wald test for the null hypothesis H : β = β ∗ for some β ∗ ∈ B ,which may or may not be on the boundary of B . We deﬁne the Wald test statistic forthe null hypothesis H : β = β ∗ as W T = T (cid:0) ˆ β − β ∗ (cid:1) (cid:62) ˆ V − T (cid:0) ˆ β − β ∗ (cid:1) , (2.31)with weighting matrix ˆ V − T , given byˆ V T := H ˆ T − T γ ˆ I T γ ˆ T − T γ H (cid:62) , (2.32)where H := [ I p , p × q ], and where ˆ T T γ and ˆ I T γ are the upper left ( p + q ) × ( p + q )submatrices of ˆ T T and ˆ I T , respectively, which are consistent estimators for the matrices T and I . For the matrix T , we use the estimatorˆ T T = − T (cid:88) t ∈ T (cid:32) ∇ g qt (ˆ θ T ) ∇ g qt (ˆ θ T ) (cid:62) (cid:32) g ( g qt (ˆ θ T )) + φ (cid:48) ( g et (ˆ θ T )) α (cid:33) c T {| Y t + h − g qt (ˆ θ T ) |≤ c T } + ∇ g et (ˆ θ T ) ∇ g et (ˆ θ T ) (cid:62) φ (cid:48)(cid:48) ( g et (ˆ θ T )) (cid:17) , (2.33)where the bandwidth c T satisﬁes c T = o (1) and c − T = o ( T / ). In the speciﬁcation ofˆ T T , the term {| Y t + h − g qt (ˆ θ T )) |≤ c T } / (2 c T ) is a nonparametric estimator of the conditionaldensity h t ( g qt ( θ )), which is also employed in Engle and Manganelli (2004) and Pattonet al. (2019). 16s we allow for multi-step ahead (aggregate) forecasts in this treatment, we employ aHAC estimator (Newey and West, 1987; Andrews, 1991) for the matrix I ,ˆ I T = (cid:98) Ω T, + m T (cid:88) j =1 z ( j, m T ) (cid:0)(cid:98) Ω T,j + (cid:98) Ω (cid:62) T,j (cid:1) , where (cid:98) Ω T,j = 1 T (cid:88) t ∈ T j ψ t (ˆ θ T ) ψ (cid:62) t − j (ˆ θ T ) , (2.34)based on some weight functions z ( j, m ) → m T = o ( T / ).Furthermore, ψ t ( θ ) is given in (2.24) and we deﬁne T j := { t ∈ N : S + j ≤ t ≤ S + T − } for all j ≥

0. As the functions ψ t ( θ ) are not continuous in θ , we generalize the consistencyproofs of the HAC estimator in Newey and West (1987) to nonsmooth objective functionsin Lemma 3 in the supplementary material. For the asymptotic distribution of the Waldtest statistic, we impose the following assumptions. Assumption 2. (H) m T → ∞ such that m T = o ( T / ) and z ( j, m ) → m → ∞ .(I) c T = o (1) and c − T = o ( T / ).(J) The functions g qt ( θ ) and g et ( θ ) are three times continuously diﬀerentiable (in θ ) andthe following moments are ﬁnite, E (cid:20) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ θ ˜ A t ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r (cid:21) , E (cid:20) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ θ ˜ B t ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r (cid:21) , E (cid:20) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ˜ A t (˜ θ ) (cid:12)(cid:12)(cid:12) r × sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ θ g qt (˜ θ ) h t ( g qt (˜ θ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r (cid:21) ,and E (cid:104) sup θ ∈ Θ || ψ t ( θ ) || r + δ ) (cid:105) , for some δ >

0, where ˜ A t ( θ ) and ˜ B t ( θ ) are given in(S.5.60) and (S.5.61) in the supplementary material.Conditions (H) and (I) are standard in the literature on HAC estimators and esti-mating the conditional density, see e.g., Newey and West (1987), Engle and Manganelli(2004) and Patton et al. (2019). The strengthened moment conditions (J) are required toestablish stochastic equicontinuity of the discontinuous function T (cid:80) t ∈ T j ψ t ( θ ) ψ (cid:62) t − j ( θ ) forconsistency of the HAC estimator. Theorem 3.

Suppose Assumption 1 and Assumption 2 hold. Then W T d −→ W := ˆ λ (cid:62) β V − ˆ λ β , (2.35)where V denotes the probability limit of ˆ V T and ˆ λ β is the upper p -dimensional subvectorof ˆ λ β , given in Theorem 2.Using the simulation procedure for the distribution of ˆ λ β described after Theorem 2,we can easily simulate draws from ˆ λ β and consequently from the distribution of W by17sing the formula in (2.35). Hence, we obtain simulated, asymptotic critical values forthe Wald test statistic.We further use a variant of the HAC estimator (Newey and West, 1987; Andrews,1991), which is speciﬁcally designed for the semiparametric VaR and ES models. For mostclassical HAC estimators, estimation of the contemporaneous variance E (cid:2) ψ t ( θ ) ψ (cid:62) t ( θ ) (cid:3) isstraight-forward by employing a sample counterpart. The major challenge in consistentlyestimating the matrix I in (2.23) is then the inclusion of the (sample) autocovariances E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3) such that the resulting estimator is positive deﬁnite.However, for the VaR and ES, and especially for extreme quantile levels, estimation ofthe contemporaneous variance E (cid:2) ψ t ( θ ) ψ (cid:62) t ( θ ) (cid:3) is cumbersome in itself as it depends onthe conditional truncated variance Var t ( Y t + h | Y t + h ≤ g qt ( θ )), see e.g. Dimitriadis and Bayer(2019). For this, we employ the scl-sp estimator of Dimitriadis and Bayer (2019), whichis based on the regularizing assumption that the quantile residuals u qt + h = Y t + h − g qt ( θ )follow a location-scale model, conditional on the employed covariates. Imposing a location-scale model might cause some misspeciﬁcation in the estimation, but it allows to use allobservations to estimate a conditional variance, and then obtain the conditional truncatedvariance through a transformation formula for location-scale models. We obtain thisestimator by replacing the outer product estimator of the contemporaneous variance bythe scl-sp estimator, ˜ I T = (cid:101) Ω T, + m T (cid:88) j =1 z ( j, m T ) (cid:0)(cid:98) Ω T,j + (cid:98) Ω (cid:62) T,j (cid:1) , (2.36)where (cid:101) Ω T, denotes the scl-sp estimator of Dimitriadis and Bayer (2019).Even though the parametric link functions in (2.5) depend explicitly on the forecastsˆ q t and ˆ e t , it is important to note that the asymptotic theory of this section also holdsfor general semiparametric models for the VaR and ES in the sense of Patton et al.(2019). Consequently, the asymptotic theory and the proposed Wald test can further beemployed for testing (the nullity) of coeﬃcients in the dynamic models of Taylor (2019)and Patton et al. (2019), which are on the boundary of the parameter space under the nullhypothesis. Furthermore, the strict ES encompassing test of Dimitriadis and Schnaitmann(2020) allows for testing encompassing of ES forecasts without their accompanying VaRforecasts, which potentially introduces model misspeciﬁcation in the parametric models.The asymptotic theory for the M-estimator presented here can easily be adapted to themisspeciﬁed case by replacing the matrices T and I with their misspeciﬁcation-robustcounterparts of Dimitriadis and Schnaitmann (2020), and by replacing the respectivesteps in the proof of Theorem 1. 18 Simulations

In this section, we evaluate the empirical properties of the encompassing tests based onthe three diﬀerent link functions speciﬁed in Section 2.3, and on the asymptotic theoryof Section 2.4. Section 3.1 numerically illustrates the eﬀect testing on the boundary hason the asymptotic distribution of the parameters. Subsequently, we analyze the size andpower properties of the encompassing tests in Section 3.2 for one-step ahead forecasts andin Section 3.3 for multi-step ahead and aggregate forecasts.

We illustrate how true parameters on the boundary of the parameter space aﬀect theasymptotic distribution of the M-estimator through simulations. For this, we simulatedata according to the standard GARCH model with Gaussian innovations described in(3.1) in Section 3.2 with an out-of-sample window length of T = 2500. We estimate theparameters of the three considered link functions for the joint encompassing test that testswhether forecasts stemming from the (true) GARCH model encompass forecasts from theGJR-GARCH model given in (3.2).Figure 1 illustrates the distribution of the parameter estimates by plotting histogramsover 10000 simulation replications for the intercept and slope parameters of the respectiveES link functions g et , whose true values equal zero and one respectively throughout alllink functions. For the (unrestricted) linear link function, all true parameters are in theinterior of the parameter space and we ﬁnd that the histograms for both parametersclosely approximate the asymptotic normal distribution, derived and employed by Pattonet al. (2019) and Dimitriadis and Schnaitmann (2020). In contrast, for the convex andno-crossing link functions, the slope parameter is bounded between zero and one, i.e.its true value of one is on the boundary of the parameter space. This results in thenon-standard distributions illustrated by the histograms for the slope parameters in thesecond and third plot in the lower row of Figure 1. The histograms approximate theasymptotic distribution consisting of a mixture of a point mass at one and a half-normaldistribution, which is considerably diﬀerent from asymptotic normality. This behaviordirectly carries over to the resulting asymptotic distributions of the Wald test statisticswhich substantiates the necessity of the non-standard asymptotic theory on the boundarypresented in Section 2.4.While this behavior is not unexpected for the parameters on the boundary, the asymp-totic distribution of the intercept parameters, which themselves are in the interior of theparameter space, is also aﬀected due to the joint estimation. For instance, we observe aslight skewness in the distribution of the intercept parameter of the convex link function19 inear Link & ES Slope Convex Link & ES Slope No−Crossing Link & ES SlopeLinear Link & ES Intercept Convex Link & ES Intercept No−Crossing Link & ES Intercept0.4 0.8 1.2 1.6 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0−1 0 1 2 −0.3 −0.2 −0.1 0.0 0.1 0.2 −0.3 −0.2 −0.1 0.0 0.1 0.2024605101520024601020300.000.250.500.750.00.51.01.52.02.5 Parameter Value D en s i t y Figure 1: Illustration of the (asymptotic) distributions of the parameter estimates of the ES-speciﬁc intercept and slope parameter corresponding to the ﬁrst ES forecast ˆ e ,t for the threeconsidered link functions. contrasting the Gaussian distribution of the linear intercept parameter. In this section, we investigate the empirical performance of our new encompassing testsfor one-step ahead forecasts. For this, we consider encompassing of VaR and ES forecastsstemming from a standard GARCH and a GJR-GARCH model (Bollerslev, 1986; Glostenet al., 1993), which are given by r j,t +1 = σ j,t +1 u t +1 , for j = 1 ,

2, where the two distinctvolatility speciﬁcations are given by σ ,t +1 = 0 .

04 + 0 . r ,t + 0 . σ ,t , and (3.1) σ ,t +1 = 0 .

04 + (cid:0) .

05 + 0 . · { r ,t ≤ } (cid:1) r ,t + 0 . σ ,t . (3.2)Furthermore, we employ two diﬀerent residual distributions,( a ) u t +1 iid ∼ N (0 ,

1) and ( b ) u t +1 iid ∼ t (0 . , , (3.3)where the latter denotes a skewed t -distribution, parameterized as in Fern´andez and Steel(1998) and Giot and Laurent (2003), with zero mean, unit variance, a skewness parameterof 0 . q j,t = z α σ j,t +1 and ˆ e j,t = ξ α σ j,t +1 for j = 1 ,

2, where z α and ξ α are the α -quantile and α -ES of the20tandard normal and the skewed t -distribution, respectively. For both distributions,we simulate Y t +1 = r t +1 = (cid:0) (1 − π ) σ ,t +1 + πσ ,t +1 (cid:1) u t +1 for 11 equally spaced values of π ∈ [0 , u t +1 is given as in ( a ) and ( b ) in (3.3).We consider encompassing tests comparing the respective GARCH and GJR-GARCHvolatility speciﬁcations, where we analyze the models based on Gaussian and t -distributed residuals in separate simulation setups. For each forecast pair, we test two null hypotheses:the ﬁrst tests whether the ﬁrst forecast encompasses the second, indicated by H (1)0 , whereasthe second tests the reverse, i.e. that forecast two encompasses forecast one, indicated by H (2)0 . These two null hypotheses correspond to the cases π = 0 and π = 1 in the simulationdesign above. For all intermediate values of π ∈ (0 , scl-sp covariance estimator of Dimitriadis and Bayer(2019) described in Section 2.4. All following results are based on 2000 Monte Carloreplications.Table 1 reports the empirical test sizes of the joint VaR and ES and the auxiliary ESencompassing tests based on the three link functions described in Section 2.3 for a nominalsize of 5%. For this, we consider the two GARCH speciﬁcations described in (3.1) and(3.2) for various out-of-sample sizes ranging from T = 250 to T = 5000. We ﬁnd that thetests based on the convex and no-crossing link functions outperform the ones build on thelinear link function, especially for smaller out-of-sample sizes: the tests based on the linearlink function are in some instances severely oversized, while the other two link functionsexhibit empirical sizes generally below 10%, even for the smallest of the considered samplesizes. Note for this that a sample size of T = 250 is considered to be very small for VaRand ES forecasts at a probability level of α = 2 . t innovations.As the joint test includes testing of the quantile parameters, the asymptotic covariancematrix additionally contains the density quantile function h t ( g qt ( θ )) in (2.22), which isparticularly challenging to estimate for small probability levels (see e.g. Koenker and Regarding the time index, notice that ˆ q j,t and ˆ e j,t represent F t -measurable forecasts for the return r j,t +1 , while σ j,t +1 is equivalently based on time t information and corresponds to the conditional volatilityof r j,t +1 . Section S.2 in the supplemental material shows that the results for employing a HAC estimator arequalitatively equivalent for one-step ahead forecasts. able 1: Empirical Test Sizes for One-Step Ahead Forecasts. H (1)0 H (2)0 H (1)0 H (2)0 VaR ES Aux ES VaR ES Aux ES VaR ES Aux ES VaR ES Aux ESLinear link function T Normal innovations Skewed-t innovations250 21.45 11.20 19.65 10.90 31.30 16.10 31.15 16.30500 16.60 8.60 15.30 9.10 25.40 10.60 24.25 10.451000 12.95 6.70 11.80 7.05 22.55 7.35 20.25 8.302500 11.35 6.15 9.70 5.05 16.80 5.45 15.65 4.905000 8.65 5.00 8.45 5.30 14.35 5.10 15.00 5.40Convex link function T Normal innovations Skewed-t innovations250 10.35 8.70 7.35 6.20 13.30 10.45 10.50 8.10500 8.10 7.50 5.35 5.35 11.16 8.91 7.80 6.901000 7.53 6.82 4.75 4.40 9.26 7.71 6.36 4.562500 5.66 5.66 4.10 3.90 7.14 5.78 5.21 3.765000 7.02 6.77 4.65 4.00 5.56 4.31 6.16 3.91No-crossing link function T Normal innovations Skewed-t innovations250 3.90 9.15 2.65 5.20 7.45 10.95 8.45 5.25500 2.75 8.90 4.95 4.75 7.95 10.80 10.51 4.501000 2.75 9.05 7.30 3.60 9.70 9.35 12.76 3.902500 4.55 6.60 8.55 3.85 9.76 7.56 9.80 3.055000 4.96 6.76 7.35 3.90 8.47 5.96 9.35 3.75

Notes:

This table reports the empirical sizes of the encompassing tests with a nominal size of5% for one-step ahead forecasts. For this, we consider the two DGPs based on diﬀerent GARCHspeciﬁcations, the three link functions, the joint VaR and ES (VaR ES) and auxiliary ES (AuxES) tests and both encompassing null hypotheses. The columns denoted by “Normal innovations”contain results for the GARCH(1,1) and GJR-GARCH(1,1) in (3.1) and (3.2) with normal inno-vations, whereas those labeled “Skewed-t innovations” report results for the skewed- t distributedinnovations. Bassett, 1978; Koenker, 2005; Dimitriadis and Bayer, 2019).Figure 2 shows size-adjusted power curves for the joint VaR and ES and the auxiliaryES tests based on the three link functions for a nominal signiﬁcance level of 5% andfor the various settings described above. For computing the size-adjusted power, wefollow the approach of Davidson and MacKinnon (1998). For an increasing degree ofmisspeciﬁcation through π , we ﬁnd increasing power throughout all considered tests andprocesses. Both, the convex and no-crossing link function speciﬁcations exhibit better(size-adjusted) power than the linear link function throughout all considered processes,sample sizes and values of π . While the convex link function exhibits a slightly superior Figure S.1 in the supplementary material shows the corresponding raw power of the tests. ormal innovationsJoint VaR and ES Test Normal innovationsAuxiliary ES Test Skewed−t innovationsJoint VaR and ES Test Skewed−t innovationsAuxiliary ES Test T = T = T = T = T = p r e j e c t i on r a t e Tested Hypothesis H H Link Function linear convex no crossing

Figure 2: This ﬁgure shows size-adjusted power curves for the joint VaR and ES and the auxiliaryES encompassing tests with a nominal size of 5%. The employed link functions are indicatedwith the line color and symbol shape while the line type refers to the tested null hypothesis. Theplot rows depict diﬀerent sample sizes while the plot columns show results for the two innovationdistributions described in (3.1) - (3.3) and for the joint and the auxiliary tests. An ideal testexhibits a rejection rate of 5% for π = 0 and for H (1)0 (and inversely for π = 1 and H (2)0 ) and assharply increasing rejection rates as possible for increasing (decreasing) values of π . In this section, we consider multi-step ahead and multi-step aggregate forecasts for theVaR and ES. For any h >

1, we set Y j,t + h = r j,t + h for multi-step ahead forecasts, and Y j,t + h = (cid:80) hs =1 r j,t + s for multi-step aggregate forecasts, where the returns r j,t + h are simu-lated from the respective GARCH speciﬁcations in (3.1) - (3.3) for j = 1 ,

2. In order tosimulate returns which follow a (probabilistic) convex combination of these two processes,we simulate Bernoulli draws π t + h ∼ Bern( π ) for 11 equally spaced values of π ∈ [0 , Y t + h = (1 − π t + h ) Y ,t + h + π t + h Y ,t + h .Wong and So (2003) and Lnnbark (2016) among others illustrate that even thoughthe conditional variance of multi-step ahead (aggregate) forecasts for (quadratic) GARCHmodels is easily tractable, the entire conditional distribution is not. This implies thatmulti-step ahead (aggregate) VaR and ES forecasts cannot be obtained equivalently toone-step ahead forecasts by simply multiplying their conditional multi-step ahead (aggre-gate) volatilities with the quantile or ES of the residual distribution. Consequently, weemploy a simulation method proposed by Wong and So (2003) which yields very accu-rate approximations of the true VaR and ES forecasts: for all out-of-sample time points t ∈ T , we simulate R = 10000 sample paths from the respective GARCH model for h days into the future and in order to obtain multi-period ahead (aggregate) VaR and ESforecasts, we (point-wisely) take the empirical quantile and ES over the R sample pathsof the simulated h -period ahead (aggregated) returns.Here, we restrict attention to the DGP based on Gaussian residuals, the convex linkfunction and on the joint VaR and ES encompassing test as the t -distributed residualsand the auxiliary tests perform comparably in the previous section. However, we consider h -step ahead and h -step aggregate VaR and ES forecasts with forecasting horizons of h = 1 , , h . We employ a HAC estimator with the embedded scl-sp estimator ofDimitriadis and Bayer (2019) for the contemporaneous variance as described in Section2.4, as in particular the multi-period aggregate forecasts exhibit a correlated behavior dueto their inherently overlapping nature. In Section S.2 in the supplementary material, wediscuss four diﬀerent covariance estimators and show that the HAC estimator augmentedwith the scl-sp estimator performs best. 24 able 2: Empirical Test Sizes for Multi-Step Ahead and Aggregate Forecasts. H (1)0 H (2)0 h T h -step ahead forecasts250 10.85 11.87 10.16 7.95 9.17 9.18 8.27 6.35500 8.41 8.91 13.67 11.43 7.01 6.12 7.59 8.431000 6.83 6.70 10.87 13.15 3.51 4.62 5.52 6.512500 4.80 4.80 6.80 11.45 4.21 4.52 5.92 6.835000 3.80 4.12 5.02 9.80 3.90 4.30 6.61 5.81

T h -step aggregate forecasts250 11.46 13.85 22.78 31.54 8.98 11.08 20.58 26.22500 8.41 13.02 20.50 30.74 7.01 10.10 18.07 23.011000 6.63 9.04 16.72 26.15 3.61 6.46 14.04 18.692500 4.80 6.73 10.43 19.17 4.11 4.52 9.28 12.245000 4.10 4.42 8.52 14.36 4.40 4.02 7.02 8.72

Notes:

This table shows test sizes for the joint VaR and ES forecast en-compassing test based on the convex link function with a nominal size of5%. We simulate data from the two GARCH speciﬁcations in (3.1) - (3.3)with normal innovations and consider h -step ahead and h -step aggregateforecasts for h = 1 , , , Table 2 reports the tests sizes and Figure 3 presents size-adjusted power plots ofthe joint VaR and ES encompassing test for multi-step ahead and multi-step aggregateforecasts for a nominal signiﬁcance level of 5%. The encompassing tests for h -step aheadforecasts are well-sized, especially for larger sample sizes and for small horizons h . Theempirical sizes deteriorate slightly with an increasing forecast horizon h . While the generalbehavior is similar for h -step aggregate forecasts, these tests suﬀer considerably morefrom an increase of the forecasting horizon h . The inferior performance of multi-periodaggregate forecasts is not surprising given that the moment conditions of the aggregateforecasts are heavily correlated due to the overlapping deﬁnition of the aggregate forecasts.Concerning the size-adjusted power, depicted in Figure 3, we observe similar patterns.For h = 1 ,

2, the size-adjusted power increases substantially for an increasing degree ofmisspeciﬁcation for all considered settings. For longer forecast horizons h = 5 ,

10, the testpower is generally lower for both forecast types. As before, the encompassing tests for h -step ahead forecasts exhibit better properties than for h -step aggregate forecasts. Thiscan again be explained by the inherent correlation in h -step aggregate forecasts which ne- The size-adjusted power plots for h = 10 and T ∈ { , , } in Figure 3 exhibit test sizes underthe null hypotheses slightly above 5%. These are an artifact stemming from the fact that slightly morethan 5% of the simulated p -values are exactly zero, rendering an exact size-adjustment in the sense ofDavidson and MacKinnon (1998) infeasible. Figure S.2 in the supplementary material shows the corresponding raw power. Table S.2, Figure S.5and Figure S.6 in the supplementary material show test results for the auxiliary ES encompassing test. = 1 h = 2 h = 5 h = 10 T = T = T = T = T = p r e j e c t i on r a t e Tested Hypothesis H H Forecast Type h−step ahead h−step aggregate

Figure 3: This ﬁgure shows size-adjusted power curves for the joint VaR and ES encompassingtest with a nominal size of 5%, for h -step ahead and aggregate forecasts indicated with diﬀerentcolors, and for the two tested null hypotheses indicated with diﬀerent line types. The plot rowsdepict diﬀerent sample sizes, while the plot columns refer to diﬀerent forecast horizons h . Anideal test exhibits a rejection frequency of 5% for π = 0 and for H (1)0 (and inversely for π = 1and H (2)0 ) and as sharply increasing rejection rates as possible for increasing (decreasing) valuesof π . Note that we use a Bernoulli draw based combination method in this section as opposedto the variance combination in Section 3.2 and hence, the results of the one-step ahead forecastsare not necessarily identical. h , a larger out-of-sampleperiod is required to obtain encompassing tests with reliable test decisions. Furthermore,small sample sizes paired with large forecast horizons (e.g., T = 250 and h = 10) yieldalmost ﬂat (size-adjusted) power curves which implies that the test becomes unreliableand practical applications should be interpreted very carefully in these scenarios. Thisnegative result is remarkable concerning the planned evaluation of 10-day ahead aggregateES forecasts (Basel Committee, 2019, p.89).

This section empirically illustrates the usefulness of the proposed encompassing tests bycomparing alternative VaR and ES forecasts for daily S&P 500 returns from August 4, 2000to June 19, 2020 including a total of 5000 daily observations. We conduct a rolling windowforecasting scheme with S = 2000 estimation observations, and T = 3000 evaluationpoints starting on July 22, 2008. We follow the Basel Accords (Basel Committee, 2017,2019) and employ α = 2 . t distributed innovations (GJR-ST), (iv) the GARCH and GJR-GARCH modelswith asymmetric Laplace innovations (GARCH-AL and GJR-AL) and the same modelswith a time varying shape parameter (GARCH-AL-TVP and GJR-AL-TVP) of Chen et al.(2012), (v) the symmetric absolute value (SAV-) and asymmetric slope (AS-) CAViaR-ES models of Taylor (2019), and (vi) the one factor GAS model (GAS-1F) of Pattonet al. (2019). Details for the risk models of Chen et al. (2012), Taylor (2019) and Pattonet al. (2019) are given in Section S.3 in the supplementary material and an additionalabsolute evaluation in the form of backtests for these models is given in Section S.4 in thesupplementary material. Along these lines, Harvey et al. (2017) notice similar small-sample issues for forecast encompassingtests and tests for equal predictive ability (Diebold and Mariano, 1995) for multi-step ahead forecasts. able 3: Empirical Encompassing Test Results for One-Step Ahead Forecasts Joint VaR and ES Test Auxiliary ES TestModels E’ing E’ed Comb Incon E’ing E’ed Comb Incon Avg. WeightsGJR-ST 10 0 0 0 9 0 0 1 (0.82, 0.98)GJR-AL-TVP 6 1 3 0 8 1 0 1 (0.86, 0.71)AS-CAViaR-ES 5 3 1 1 6 0 0 4 (0.56, 0.66)GARCH-AL 4 1 5 0 2 2 2 4 (0.62, 0.63)GARCH-AL-TVP 3 1 5 1 2 3 3 2 (0.47, 0.60)GJR-AL-CP 3 1 3 3 6 3 0 1 (0.47, 0.68)GARCH-N 2 6 1 1 2 6 0 2 (0.46, 0.35)SAV-CAViaR-ES 1 4 3 2 2 4 1 3 (0.45, 0.39)RiskMetrics 1 5 1 3 1 4 2 3 (0.39, 0.21)GAS-1F 1 7 1 1 1 8 0 1 (0.27, 0.20)Historical Sim 0 7 3 0 0 8 2 0 (0.09, 0.06)

Notes:

This table reports a summary of the test results of the joint VaR and ES and the auxiliary ESencompassing tests based on the convex link function for one-step ahead forecasts. Entries for “E’ing”represent the number of occurrences (out of 10) that a row-heading model encompasses a competingmodel. Similarly, “E’ed” represent the frequencies that the row-heading model is encompassed, “Comb”that neither model encompasses its competitor, and “Incon” that both models encompass each other.The column “Avg. Weights” shows the estimated convex combination weights ( θ , θ ), averaged overthe 10 estimates for each model. In this subsection, we analyze pairwise encompassing for one-step ahead VaR and ESforecasts using the encompassing tests based on the convex link functions. For each modelpair, we estimate the combination weights and test both null hypotheses, i.e. that modelone encompasses model two and vice versa. We obtain simulated critical values for thetest through Theorem 3 and by employing the scl-sp estimator of Dimitriadis and Bayer(2019). Due to the simulation results of Section S.2 in the supplementary material, we donot consider estimation of HAC-terms in the covariance for one-step ahead forecasts.We report the summarized results for the joint VaR and ES and the auxiliary ES en-compassing tests with a signiﬁcance level of 5% for all pairwise combinations of the elevenrisk models in Table 3, where the models (in the table rows) are sorted according to theirencompassing performance. Out of the ten model combinations each individual model issubject to, we report the instances how often both null hypotheses are rejected (denotedby ”Combination” or ”Comb”), not rejected (”Inconclusive” or ”Incon”), only the ﬁrstone is rejected (”Encompassed” or ”E’ed”), and only the second one is rejected (”Encom-28assing” or ”E’ing”). Notice that the ”Combination” column is based on rejecting bothnull hypotheses, which constitutes a multiple testing problem and the results have to beinterpreted at a Bonferroni corrected signiﬁcance level of 10%, while each individual testsare based on a nominal signiﬁcance level of 5%. We ﬁnd that the GJR-GARCH models with Skew-t and asymmetric Laplace innova-tions achieve the best forecasting performance among the competing models. Interestingly,the CAViaR-ES models of Taylor (2019) and the GAS-1F model of Patton et al. (2019),which are speciﬁcally developed for jointly forecasting VaR and ES, generally do not per-form as good as the GARCH speciﬁcations. As expected, the RiskMetrics and HistoricalSimulation models perform worst. Furthermore, we ﬁnd many instances of rejections ofboth encompassing hypotheses, implying that a forecast combination via the estimatedencompassing weights is superior to both individual models. This result justiﬁes the use-fulness of the proposed encompassing tests, and is in line with the arguments for forecastcombinations of Giacomini and Komunjer (2005), Timmermann (2006), Taylor (2020)and Dimitriadis and Schnaitmann (2020).

In this subsection, we apply the proposed encompassing tests to 10-day ahead and ag-gregate VaR and ES forecasts. Note that 10-day aggregate VaR and ES forecasts arerequired by the Basel Accords for minimal capital requirement and risk weighted assets(Basel Committee, 2019, 2020). As it is unclear how to obtain multi-step ahead forecastsfrom the CAViaR-ES models of Taylor (2019) and the GAS-1F model of Patton et al.(2019), we reduce the set of evaluation models to the seven members of the GARCHfamily. For these models, we obtain multi-step ahead and multi-step aggregate VaR andES forecasts through the simulation method of Wong and So (2003), further describedin Section 3.3. Such a simulation-based forecasting is necessary as the conditional distri-bution of multi-step returns generally diﬀers from the imposed innovation distribution ofthe model, and thus, VaR and ES forecasts cannot be obtained through classical location-scale formulas as for one-step ahead forecasts. Based on the results of Section S.2 in thesupplementary material, we use a HAC covariance estimator (Newey and West, 1987),augmented with the scl-sp estimator of Dimitriadis and Bayer (2019) for the contempo-raneous variance component to perform the encompassing tests for multi-step ahead andaggregate forecasts.Table 4 reports the summarized encompassing test results for 10-step ahead and ag- Table S.4 in the supplementary material reports the correlations of the VaR and ES forecasts andTable S.5 additionally reports the estimated (convex) combination weights together with the test decisionsfor the combinations of the six bestperforming models, chosen by the absolute evaluation in Table S.9. able 4: Encompassing Test Results for 10-Step Ahead and Aggregate Forecasts Notes:

This table reports a summary of the test results of the joint VaR and ES and the auxiliary ESencompassing tests based on the convex link function, for 10-step ahead forecasts in the upper paneland for 10-step aggregate forecasts in the lower panel. Entries for “E’ing” represent the number ofoccurrences (out of 6) that a row-heading model encompasses a competing model. Similarly, “E’ed”represent the frequencies that the row-heading model is encompassed, “Comb” that neither modelencompasses its competitor, and “Incon” that both models encompass each other. The column “Avg.Weights” shows the estimated convex combination weights ( θ , θ ), averaged over the 10 estimates foreach model. The test results show that for both, 10-step ahead and aggregateforecasts, the best performing model is the GJR-GARCH model with asymmetric Laplaceinnovations and a time-varying shape parameter. We ﬁnd almost no cases of double re-jections, i.e. forecast combinations are not (signiﬁcantly) preferred over the stand-alonemodels. This can be an artifact from the lower power for multi-step forecasts as illustratedin Section 3.3 or from the high(er) correlations of the forecasts, reported in Table S.6 inthe supplementary material.Overall, the empirical results show that a model speciﬁed with an asymmetric volatil-ity process and a skewed error distribution, such as the GJR-ST model, outperforms thecompeting models considered in this paper for one-step ahead VaR and ES forecasts.Moreover, models based on an asymmetric innovation distribution with time-varying pa-rameters, such as, GJR-AL-TVP and GARCH-AL-TVP models, perform better than theother competing models. Note that the time-varying scale parameter of the asymmetricLaplace distribution produces both time-varying skewness and kurtosis for the innova-tion distribution. We ﬁnd that specifying time-varying higher moments for a risk modelsubstantially improves the model forecasting performance in both multi-step ahead andaggregate risk forecasts, much more than in one-step ahead forecasts.

This article proposes joint encompassing tests which compare one-step and multi-step VaRand ES forecasts based on general semiparametric forecast combination methods (linkfunctions) for the VaR and ES. While unrestricted linear methods are often employed inencompassing tests for functionals like the mean and quantiles (the VaR) as e.g. in Hendryand Richard (1982); Giacomini and Komunjer (2005), diﬀerent combination methods areof particular interest for the ES. E.g., our no-crossing link speciﬁcation theoreticallycircumvents crossings of the predicted VaR and ES, which is conceptually desirable butnot straight-forward to achieve (Taylor, 2020).Our employed link functions imply that some of the tested parameters are on theboundary of the parameter space under the null hypothesis, which necessitates non-standard asymptotic theory. Based on the general framework of Andrews (1999, 2001),we provide such novel asymptotic theory for the proposed encompassing tests and for theaccompanying Wald test statistics, which allows for inference and testing on the bound-ary. Our simulations show that the proposed VaR and ES forecast encompassing testsbased on the convex and no-crossing link functions exhibit superior size and power prop- Table S.7 and Table S.8 in the supplementary material report the detailed test results. Table S.6additionally reports correlations for the 10-day ahead and aggregate VaR and ES forecasts.

Acknowledgments

Our work has been supported by the University of Hohenheim, the Klaus Tschira Foun-dation and the University of Konstanz. A previous version of this paper circulated withthe title ”A Regression-based Joint Encompassing Test for Value-at-Risk and ExpectedShortfall Forecasts”.

References

Acerbi, C. and Tasche, D. (2002). On the coherence of expected shortfall.

Journal ofBanking & Finance , 26(7):1487 – 1503.Andrews, D. W. K. (1991). Heteroskedasticity and Autocorrelation Consistent CovarianceMatrix Estimation.

Econometrica , 59(3):817–858.Andrews, D. W. K. (1992). Generic uniform convergence.

Econometric Theory , 8(2):241–257.Andrews, D. W. K. (1994). Empirical Process Methods in Econometrics. In Engle, R.32nd McFadden, D., editors,

Handbook of Econometrics , volume 4, chapter 37, pages2247–2294. Elsevier.Andrews, D. W. K. (1997). Estimation when a parameter is on a boundary: Theory andapplications. Cowles Foundation Discussion Paper No. 1153.Andrews, D. W. K. (1999). Estimation when a parameter is on a boundary.

Econometrica ,67(6):1341–1383.Andrews, D. W. K. (2001). Testing when a parameter is on the boundary of the maintainedhypothesis.

Econometrica , 69(3):683–734.Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. (1999). Coherent Measures of Risk.

Mathematical Finance , 9(3):203–228.Barendse, S. (2020). Eﬃciently Weighted Estimation of Tail and Interquartile Expecta-tions. Working Paper, available at https://drive.google.com/file/d/1nI0QAWbM_VchAZDVg79p2vJcKoCrQB8o/view .Barendse, S., Kole, E., and van Dijk, D. J. (2019). Backtesting value-at-risk and expectedshortfall in the presence of estimation error. Tinbergen Institute Discussion Paper 2019-058/III, available at SSRN: https://ssrn.com/abstract=3439309 .Basel Committee (2013). Fundamental review of the trading book: A revised marketrisk framework. Technical report, Bank for International Settlements. Available at .Basel Committee (2016). Minimum capital requirements for Market Risk. Technicalreport, Bank for International Settlements. Available at .Basel Committee (2017). Pillar 3 disclosure requirements – consolidated and enhancedframework. Technical report, Basel Committee on Banking Supervision. Available at .Basel Committee (2019). Minimum capital requirements for Market Risk. Technicalreport, Bank for International Settlements. Available at , revised in Februrary 2019.Basel Committee (2020). MAR - Calculation of RWA for Market Risk. Technical re-port, Bank for International Settlements. Available at .Bayer, S. (2018). Combining value-at-risk forecasts using penalized quantile regressions.

Econometrics and Statistics , 8:56 – 77.Bayer, S. and Dimitriadis, T. (2019). esreg: Joint Quantile and Expected Shortfall Regres- ion . R package version 0.5.0, available at https://CRAN.R-project.org/package=esreg .Bayer, S. and Dimitriadis, T. (2020). Regression based expected shortfall backtesting. Journal of Financial Econometrics (forthcoming) . available at arXiv:1801.04112 [q-ﬁn.RM].Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity.

Journalof Econometrics , 31(3):307–327.Bradley, R. C. (2005). Basic properties of strong mixing conditions. A survey and someopen questions.

Probability Surveys , 2(1):107–144.Busetti, F. and Marcucci, J. (2013). Comparing forecast accuracy: A monte carlo inves-tigation.

International Journal of Forecasting , 29(1):13 – 27.Chen, Q., Gerlach, R., and Lu, Z. (2012). Bayesian value-at-risk and expected shortfallforecasting via the asymmetric laplace distribution.

Computational Statistics and DataAnalysis , 56(11):3498–3516.Christoﬀersen, P. (1998). Evaluating Interval Forecasts.

International Economic Review ,39(4):841–862.Clements, M. and Harvey, D. (2009). Forecast combination and encompassing. In Mills,T. C. and Patterson, K., editors,

Palgrave Handbook of Econometrics: Volume 2: Ap-plied Econometrics , pages 169–198. Palgrave Macmillan UK, London.Clements, M. and Harvey, D. (2010). Forecast encompassing tests and probability fore-casts.

Journal of Applied Econometrics , 25(6):1028–1062.Clements, M. P. and Reade, J. J. (2020). Forecasting and forecast narratives: The bankof england inﬂation reports.

International Journal of Forecasting .Costantini, M., Gunter, U., and M. Kunst, R. (2017). Forecast combinations in a dsge-varlab.

Journal of Forecasting , 36(3):305–324.Costanzino, N. and Curran, M. (2018). A simple traﬃc light approach to backtestingexpected shortfall.

Risks , 6(1).Couperier, O. and Leymarie, J. (2019). Backtesting expected shortfall via multi-quantileregression. Working Paper, available at https://halshs.archives-ouvertes.fr/halshs-01909375v4 .Creal, D., Koopman, S. J., and Lucas, A. (2013). Generalized autoregressive score modelswith applications.

Journal of Applied Econometrics , 28(5):777–795.Davidson, J. (1994).

Stochastic Limit Theory: An Introduction for Econometricians .34dvanced Texts in Econometrics. Oxford University Press.Davidson, R. and MacKinnon, J. G. (1998). Graphical methods for investigating the sizeand power of hypothesis tests.

The Manchester School , 66(1):1–26.Diebold, F. and Mariano, R. (1995). Comparing Predictive Accuracy.

Journal of Business& Economic Statistics , 13(3):253–63.Diebold, F. X. (1989). Forecast combination and encompassing: Reconciling two divergentliteratures.

International Journal of Forecasting , 5(4):589 – 592.Dimitriadis, T. and Bayer, S. (2019). A joint quantile and expected shortfall regressionframework.

Electron. J. Statist. , 13(1):1823–1871.Dimitriadis, T. and Schnaitmann, J. (2020). Forecast Encompassing Tests for the Ex-pected Shortfall.

International Journal of Forecasting (forthcoming) . available atarXiv:1908.04569 [q-ﬁn.RM].Dobric, V. and Liebars, C. (1994). Stochastic Diﬀerentiability in Maximum LikelihoodTheory. In Hoﬀmann-Jrgensen, J., Kuelb, J., and Marcus, M. B., editors,

Probabilityin Banach Spaces, 9 , pages 373–384. Birkhuser Basel.Doukhan, P., Massart, P., and Rio, E. (1995). Invariance principles for absolutely regularempirical processes.

Annales de l’I.H.P. Probabilit´es et statistiques , 31(2):393–427.Du, Z. and Escanciano, J. C. (2017). Backtesting Expected Shortfall: Accounting for TailRisk.

Management Science , 63(4):940–958.Engle, R. and Manganelli, S. (2004). CAViaR: Conditional Autoregressive Value at Riskby Regression Quantiles.

Journal of Business & Economic Statistics , 22(4):367–381.Escanciano, J. C. and Olmo, J. (2010). Backtesting parametric value-at-risk with estima-tion risk.

Journal of Business & Economic Statistics , 28(1):36–51.Fair, R. C. and Shiller, R. J. (1989). The informational content of ex ante forecasts.

TheReview of Economics and Statistics , 71(2):325–331.Fern´andez, C. and Steel, M. F. (1998). On bayesian modeling of fat tails and skewness.

Journal of the American Statistical Association , 93(441):359–371.Fissler, T. and Ziegel, J. F. (2016). Higher order elicitability and Osband’s principle.

Annals of Statistics , 44(4):1680–1707.Fissler, T., Ziegel, J. F., and Gneiting, T. (2016). Expected Shortfall is jointly elicitablewith Value at Risk - Implications for backtesting.

Risk , January:58–61.Francq, C. and Zako¨ıan, J. M. (2009). Testing the nullity of GARCH coeﬃcients: Correc-tion of the standard tests and relative eﬃciency comparisons.

Journal of the American tatistical Association , 104(485):313–324.Fuertes, A.-M. and Olmo, J. (2013). Optimally harnessing inter-day and intra-day infor-mation for daily value-at-risk prediction. International Journal of Forecasting , 29(1):28– 42.Gaglianone, W. P., Lima, L. R., Linton, O., and Smith, D. R. (2011). Evaluating Value-at-Risk Models via Quantile Regression.

Journal of Business & Economic Statistics ,29(1):150–160.Gerlach, R. and Wang, C. (2020). Semi-parametric dynamic asymmetric laplace mod-els for tail risk forecasting, incorporating realized measures.

International Journal ofForecasting , 36(2):489–506.Giacomini, R. and Komunjer, I. (2005). Evaluation and combination of conditional quan-tile forecasts.

Journal of Business & Economic Statistics , 23:416–431.Giacomini, R. and White, H. (2006). Tests of conditional predictive ability.

Econometrica ,74(6):1545–1578.Giot, P. and Laurent, S. (2003). Value-at-risk for long and short trading positions.

Journalof Applied Econometrics , 18(6):641–663.Glosten, L. R., Jagannathan, R., and Runkle, D. E. (1993). On the Relation betweenthe Expected Value and the Volatility of the Nominal Excess Return on Stocks.

TheJournal of Finance , 48(5):1779–1801.Gneiting, T. (2011). Making and Evaluating Point Forecasts.

Journal of the AmericanStatistical Association , 106(494):746–762.Hansen, B. E. (2008). Least-squares forecast averaging.

Journal of Econometrics ,146(2):342 – 350.Harvey, D. and Newbold, P. (2000). Tests for multiple forecast encompassing.

Journal ofApplied Econometrics , 15(5):471–482.Harvey, D. I., Leybourne, S. J., and Whitehouse, E. J. (2017). Forecast evaluation testsand negative long-run variance estimates in small samples.

International Journal ofForecasting , 33(4):833 – 847.Hendry, D. and Richard, J.-F. (1982). On the formulation of empirical models in dynamiceconometrics.

Journal of Econometrics , 20(1):3–33.Koenker, R. (2005).

Quantile Regression . Econometric Society Monographs. CambridgeUniversity Press.Koenker, R. W. and Bassett, G. (1978). Regression quantiles.

Econometrica , 46(1):33–50.36ratz, M., Lok, Y. H., and McNeil, A. J. (2018). Multinomial VaR backtests: A simpleimplicit approach to backtesting expected shortfall.

Journal of Banking & Finance ,88(C):393–407.Kupiec, P. H. (1995). Techniques for Verifying the Accuracy of Risk Measurement Models.

The Journal of Derivatives , 3(2):73–84.Liu, X. (2017). An integrated macro-ﬁnancial risk-based approach to the stressed capitalrequirement.

Review of Financial Economics , 34(1):86–98.Louren¸co, H. R., Martin, O. C., and St¨utzle, T. (2003). Iterated Local Search. InGlover, F. and Kochenberger, G. A., editors,

Handbook of Metaheuristics , pages 320–353. Springer US, Boston, MA.Lnnbark, C. (2016). Approximation methods for multiple period Value at Risk and Ex-pected Shortfall prediction.

Quantitative Finance , 16(6):947–968.McNeil, A. J. and Frey, R. (2000). Estimation of tail-related risk measures for het-eroscedastic ﬁnancial time series: an extreme value approach.

Journal of EmpiricalFinance , 7(3–4):271–300.Mizon, G. and Richard, J.-F. (1986). The encompassing principle and its application totesting non-nested hypotheses.

Econometrica , 54(3):657–78.Newey, W. and McFadden, D. (1994). Large sample estimation and hypothesis testing. InEngle, R. and McFadden, D., editors,

Handbook of Econometrics , volume 4, chapter 36,pages 2111–2245. Elsevier.Newey, W. and West, K. (1987). A simple, positive semi-deﬁnite, heteroskedasticity andautocorrelation consistent covariance matrix.

Econometrica , 55(3):703–08.Nolde, N. and Ziegel, J. F. (2017). Elicitability and backtesting: Perspectives for bankingregulation.

Ann. Appl. Stat. , 11(4):1833–1874.Patton, A. J., Ziegel, J. F., and Chen, R. (2019). Dynamic semiparametric models forexpected shortfall (and value-at-risk).

Journal of Econometrics , 211(2):388–413.Tasche, D. (2002). Expected shortfall and beyond.

Journal of Banking & Finance ,26(7):1519–1533.Taylor, J. W. (2005). Generating volatility forecasts from value at risk estimates.

Man-agement Science , 51(5):712–725.Taylor, J. W. (2019). Forecasting value at risk and expected shortfall using a semipara-metric approach based on the asymmetric laplace distribution.

Journal of Business &Economic Statistics , 37(1):121–133. 37aylor, J. W. (2020). Forecast combinations for value at risk and expected shortfall.

International Journal of Forecasting , 36(2):428–441.Timmermann, A. (2006). Forecast combinations. In Elliott, G., Granger, C., and Tim-mermann, A., editors,

Handbook of Economic Forecasting , volume 1, chapter 04, pages135–196. Elsevier, 1 edition.Tsiotas, G. (2018). A bayesian encompassing test using combined value-at-risk estimates.

Quantitative Finance , 18(3):395–417.Weiss, A. A. (1991). Estimating nonlinear dynamic models using least absolute errorestimation.

Econometric Theory , 7(01):46–68.West, K. D. (1996). Asymptotic inference about predictive ability.

Econometrica ,64(5):1067–1084.West, K. D. (2001). Tests for forecast encompassing when forecasts depend on estimatedregression parameters.

Journal of Business & Economic Statistics , 19(1):29–33.White, H. (2001).

Asymptotic Theory for Econometricians . Academic Press, San Diego.Wong, C. M. and So, M. K. (2003). On conditional moments of garch models, withapplications to multiple period value at risk estimation.

Statistica Sinica , 13(4):1015–1044.You, Y. and Liu, X. (2020). Forecasting short-run exchange rate volatility with monetaryfundamentals: A garch-midas approach.

Journal of Banking & Finance , 116:105849.Zeileis, A. (2004). Econometric computing with HC and HAC covariance matrix estima-tors.

Journal of Statistical Software , 11(10):1–17.Zeileis, A. (2006). Object-oriented computation of sandwich estimators.

Journal of Sta-tistical Software , 16(9):1–16.Zhao, Y., Li, J., and Yu, L. (2017). A deep learning ensemble approach for crude oil priceforecasting.

Energy Economics , 66:9–16.

A Proofs

Proof of Theorem 1.

For this proof, we employ Theorem 3 of Andrews (1999) (or equiva-lently Theorem 1 of Andrews (2001)), for which we verify the necessary Assumptions 1-6of Andrews (1999) in the following.We start by showing Assumption 1, i.e. the consistency of ˆ θ T . For this, we employ38heorem 2.1 of Newey and McFadden (1994). Assumption ( i ), i.e. that l ( θ ) is uniquelyminimized by θ follows directly from the identiﬁcation condition (E) and from the strictconsistency result of the loss functions of Fissler and Ziegel (2016). Condition ( ii ) followsdirectly as we impose that Θ is compact. Condition ( iii ) holds as l ( θ ) is continuous for all θ ∈ Θ as the distribution F t is absolutely continuous and we use continuously diﬀerentiablefunctions g , φ , g q and g e . The uniform consistency of T − l T ( θ ) of condition ( iv ) is shownby employing Theorem 21.9 of Davidson (1994). For this, we need that a point-wise lawof large numbers holds for T − l T ( θ ) for all θ ∈ Θ, which can be veriﬁed e.g., by employingCorollary 3.48 of White (2001). This holds as ρ t ( θ ) is α -mixing of size − r/ ( r −

1) for r > β -mixing series are also α -mixing of same size by Bradley(2005) and E (cid:2) | ρ t ( θ ) | r (cid:3) < ∞ for all θ ∈ Θ by condition (C). Furthermore, the sequence l T ( θ ) is stochastically equicontinuous by Lemma 1 in the supplementary material. Thus,sup θ ∈ Θ | T − l T ( θ ) − l ( θ ) | P −→ θ T follows from Theorem 2.1 of Neweyand McFadden (1994).Assumption 2 ∗ of Andrews (1999) is shown through the suﬃcient condition Assump-tion 1 ∗ on page 53 in Andrews (1997). For this, condition ( a ) holds trivially, and condition( b ) follows directly from the uniform consistency result of T − l T ( θ ). For condition ( c ), weset Θ + = Θ. Given condition (A), locally to θ , Θ equals a union of (Cartesian) orthants.For condition ( d ), we notice that l ( θ ) is twice continuously diﬀerentiable on the interiorof Θ and has partial right/left derivatives on the boundary of Θ of order one and two. Itfurther holds that ∇ θ l ( θ ) = 0 as the function l ( θ ) is uniquely minimized by θ . Noticethat this also holds for the respective directional derivatives if θ lies on the boundary ofΘ. Eventually, for condition ( e ), Lemma 2 in the supplementary material shows thatthe functions ρ t ( θ ) given in (2.6) form a type IV class (see Andrews (1994), p.2278)with index p = 2 r such that by Theorem 6 in Andrews (1994), it satisﬁes Ossiander’s L r -entropy condition and consequently has an L r -envelope. Furthermore, the moments E (cid:20) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ρ t (˜ θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r (cid:21) / r < ∞ are bounded by assumption. Consequently, by Theorem39 and Application 1 in Doukhan et al. (1995), we obtain that the empirical process, givenby T − / (cid:80) t ∈ T (cid:0) ρ t ( θ ) − E [ ρ t ( θ )] (cid:1) , is stochastically equicontinuous (see the remark on p.410of Doukhan et al. (1995)). Hence, the process T − l T ( θ ) − l ( θ ) is stochastically diﬀerentiable(see e.g. Newey and McFadden (1994), p.2187 or the proof after Theorem 3.1 in Dobric andLiebars (1994), which does not rely on the imposed iid assumption of that paper). Thus,all conditions in Assumption 1 ∗ of Andrews (1997) are fulﬁlled and hence, Assumption2 ∗ of Andrews (1999) holds.In the following, we verify Assumption 3 ∗ (which implies Assumption 3) of Andrews(1999), i.e. that T − / (cid:80) t ∈ T ψ t ( θ ) d −→ G , where G ∼ N (0 , I ), and I = Var (cid:0) T − / (cid:80) t ∈ T ψ t ( θ ) (cid:1) ,with ψ t ( θ ) given in (2.24). By using the Cramer-Wold theorem, we instead show that T − / (cid:80) t ∈ T u (cid:62) ψ t ( θ ) d −→ u (cid:62) Gu for all u ∈ R k where || u || = 1. This holds as Z t is as-sumed to be β -mixing of size − r/ ( r −

1) for r > β -mixingimplies α -mixing of same size (Bradley, 2005). By Theorem 3.49 in White (2001), wethen get that u (cid:62) ψ t ( θ ) are also α -mixing of the same size. Furthermore, it holds that E (cid:104)(cid:12)(cid:12) u (cid:62) ψ t ( θ ) (cid:12)(cid:12) r (cid:105) < E [sup θ ∈ Θ || ψ t ( θ ) || r ] < ∞ condition (C) in Assumption 1. The ma-trix I = Var (cid:0) T − / (cid:80) t ∈ T ψ t ( θ ) (cid:1) does not depend on T as the process is assumed to bestationary. As I has full rank by condition (F), it holds that Var (cid:0) T − (cid:80) t ∈ T u (cid:62) ψ t ( θ ) (cid:1) ≥ λ min >

0, where λ min is the smallest Eigenvalue of I . Consequently, applying Theorem5.20 in White (2001) delivers the asymptotic normality result.Following condition (A), the parameter space is given as the product Θ = B ×B × ∆ × Ψ, and each of these four spaces is given by (linear) inequality constraints. Consequently,Θ can also be expressed through a system of inequalities of the form Γ θ θ ≤ r θ , for somematrix Γ θ and vector r θ of appropriate dimensions. Then, following equations (4.6) and(4.7) of Andrews (1999), the cone Λ is given byΛ = { λ ∈ R k : Γ ( b ) θ λ ≤ } , (A.1)where Γ ( b ) θ consists of the rows of Γ θ for which the inequality Γ θ θ ≤ r θ is binding (i.e. it40olds as an equality). As this speciﬁcation of Λ is a convex cone, this shows Assumption5 and 6 of Andrews (1999), i.e. that Θ − θ locally equals a convex cone Λ ⊂ R k .Consequently, we can apply Theorem 3 of Andrews (1999) (or equivalently Theorem1 of Andrews (2001)), which completes the proof of this theorem. Proof of Theorem 2.

The result follows directly from Theorem 2 of Andrews (2001). Be-sides the assumptions of Theorem 1 (of the present article), we further need to verifyAssumptions 7 and 8 of Andrews (2001). Assumption 7( a ) is fulﬁlled by the imposedcondition (G) in Assumption 2. Furthermore, Assumption 7( b ) follows directly from con-dition (A) and from the speciﬁcation given in (2.25). Assumption 8 also holds trivially as δ is assumed to be in the interior of Θ, which concludes this proof. Proof of Theorem 3.

In order to employ Theorem 6 of Andrews (2001), we verify thenecessary Assumptions 9 and 12 of Andrews (2001). For the veriﬁcation of Assumptions1-8, see the proof of Theorem 1 and Theorem 2. For Assumption 9, notice that testing β = β ∗ corresponds to the null hypothesis that θ ∈ Θ = (cid:8) θ = ( β , β , δ, ψ ) ∈ Θ : β = β ∗ (cid:9) . Consequently, Assumption 9( a ) is satisﬁed. Assumption 9( b ) holds as throughout thepaper B T = √ T I k and Assumption 9( c ) follows directly from condition (A). Eventually,Assumption 9( d ) follows as B and B in condition (A) are given by separate inequalityconstraints.Assumption 12 ∗ ( a ) corresponds to Assumption 11( a ), which requires that the randomvariable G ∼ N (0 , I ) (simpliﬁed for the case that the space Π is single-valued). Thisfollows directly from the proof of Theorem 1 (where Assumption 3 ∗ of Andrews (1999) isveriﬁed). Assumption 12 ∗ ( b ) follows directly from (2.32) and the conditions Assumption12 ∗ ( c ).Consistency of the ”bread” matrix, ˆ T T P −→ T follows directly from Theorem 3 of Pat-ton et al. (2019) and consistency of the HAC estimator ˆ I T P −→ I is shown in Lemma3 in the supplementary material. Notice for this that joint convergence in probability (cid:0) ˆ T T , ˆ I T (cid:1) P −→ (cid:0) T , I (cid:1) follows directly from both variables converging in probability sepa-41ately. Eventually, Assumption 12 ∗ ( e ) follows as the matrix I has full rank by assumption.Consequently, the conditions of Theorem 6 of Andrews (2001) are satisﬁed and part( d ) yields that W T d −→ ˆ λ (cid:62) β V − ˆ λ β , where ˆ λ = (cid:0) ˆ λ β , ˆ λ β , ˆ λ δ , ˆ λ ψ (cid:1) is given in Theorem 2.42UPPLEMENTARY MATERIAL FOR Encompassing Tests for Value at Risk and ExpectedShortfall Multi-Step Forecasts based on Inference onthe Boundary

Timo Dimitriadis Xiaochun Liu Julie SchnaitmannSeptember 17, 2020All references to equations, sections, tables and ﬁgures starting with S. refer to thissupplement while the remaining references refer to the main document of the article.

S.1 Additional DGPs for the Simulation Study

Following Dimitriadis and Schnaitmann (2020), this section provides simulation resultsfor two additional data generating processes (DGPs) outside the class of location-scalemodels as a robustness check for the proposed encompassing tests.For the ﬁrst additional simulation design, we introduce two speciﬁcations of generalizedautoregressive score (GAS) models proposed by Creal et al. (2013). We generate r ,t +1 , ˆ q ,t and ˆ e ,t from a GAS model with Gaussian innovations, which corresponds to the standardGARCH(1,1) speciﬁcation given in (3.1). We obtain the second sequence of forecasts froma GAS model with Student- t residuals with time-varying variance and degrees of freedom,given by (ˆ µ , ˆ σ ,t , ˆ ν ,t ) (cid:62) = κ + B · (ˆ µ , ˆ σ ,t − , ˆ ν ,t − ) (cid:62) + AH t ∇ t , (S.1.1)where H t ∇ t is the forcing variable of the model, the scaling matrix H t is the Hessianand ∇ t the derivative of the log-likelihood function. We calibrate both models to dailyS&P 500 returns resulting in the parameter values κ = (0 . , . , − . A =diag(0 , . , . B = diag(0 , . , . r ,t +1 ∼ t ˆ ν ,t (cid:0) ˆ µ , ˆ σ ,t (cid:1) and we obtain one-step ahead VaR and ES forecasts from this t -distribution.In the second additional simulation setup, we implement the one-factor (1F) and two-factor (2F) GAS models for the VaR and ES of Patton et al. (2019). The 1F-GAS modelS.1volves asˆ q ,t = − .

164 exp(ˆ κ t ) and ˆ e ,t = − .

757 exp(ˆ κ t ) , whereˆ κ t = 0 . κ t − + 0 . e ,t − (cid:16) r ,t α { r ,t ≤ ˆ q ,t − } − ˆ e ,t − (cid:17) . (S.1.2)The 2F-GAS model follows the speciﬁcation (cid:32) ˆ q ,t ˆ e ,t (cid:33) = (cid:32) − . − . (cid:33) + (cid:32) .

993 00 0 . (cid:33) (cid:32) ˆ q ,t − ˆ e ,t − (cid:33) + (cid:32) − . − . − . − . (cid:33) λ t , (S.1.3)where the forcing variable is given by λ t = (cid:0) ˆ q ,t − ( α − { r ,t ≤ ˆ q ,t − } ) , { r ,t ≤ ˆ q ,t − } r ,t /α − ˆ e ,t − (cid:1) (cid:62) . For both models, j = 1 ,

2, we simulate r j,t +1 ∼ N (cid:0) ˆ µ j,t , ˆ σ j,t (cid:1) , where the condi-tional mean and standard deviations are given by ˆ µ j,t = ˆ q j,t − z α ˆ e j,t − ˆ q j,t ξ α − z α and ˆ σ j,t = ˆ e j,t − ˆ q j,t ξ α − z α ,such that Q α ( r j,t +1 |F t ) = ˆ q j,t and ES α ( r j,t +1 |F t ) = ˆ e j,t almost surely. The parametervalues for this model are obtained from Table 8 of Patton et al. (2019) and correspond tocalibrated parameters to daily S&P 500 returns.In order to simulate returns which follow a convex combination of these two conditionaldistributions (for both DGPs), we simulate Bernoulli draws π t +1 ∼ Bern( π ) for 11 equallyspaced values of π ∈ [0 , Y t +1 = r t +1 = (1 − π t +1 ) r ,t +1 + π t +1 r ,t +1 . Thus, for π = 0, Y t +1 follows the ﬁrst model, for π = 1, Y t +1 follows the second model, and for π ∈ (0 , Y t +1 follows some convex combination of the two models. Table S.1 shows the empirical sizes for the joint VaR and ES and the auxiliary ESencompassing tests for both null hypotheses, both DGPs, the three diﬀerent link func-tions and various out-of-sample sizes T and Figure S.4 presents the empirical rejectionfrequencies. As for the two GARCH-based DGPs, the tests based on the convex and no-crossing link functions outperform the tests based on the linear link function. Moreover,the auxiliary ES test performs slightly better than the joint VaR and ES encompassingtest, especially in terms of its size properties. This qualitatively conﬁrms the result forthe GARCH DGPs of Section 3. While generating returns stemming from convex combinations of GARCH-type volatility models isstraight-forward by using convex combinations of the conditional volatilities, this is not as simple for themore general GAS models considered in this section. Consequently, we use this more involved approachbased on Bernoulli draws in order to generate these convex model combinations. This is comparable toour combination approach of multi-step forecasts.

S.2 .2 Covariances Estimation

In this section, we compare the performance of the joint VaR and ES encompassing test forfour diﬀerent covariance estimators, where we consider multi-step ahead and multi-stepaggregate forecasts at diﬀerent forecast horizons. These estimators diﬀer with respect tothe estimation of the ”meat” matrix I , given in (2.23), where we consider an estimatorbased on HAC terms (Newey and West, 1987) and one without, paired with either theouter product of the gradient of ψ t ( θ ), or the scl-sp estimator of Dimitriadis and Bayer(2019).More precisely, the ﬁrst estimator is given by ˆ I (1) T = (cid:98) Ω T, , where the contemporaneouscovariance matrix is estimated by the outer product of the gradient of ψ t ( θ ) as given in(2.34). The second estimator is speciﬁed as ˆ I (2) T = (cid:101) Ω T, where (cid:101) Ω T, denotes the scl-sp estimator of Dimitriadis and Bayer (2019). The third speciﬁcation employs a standardHAC estimator (Newey and West, 1987; Andrews, 1991) as in (2.34), based on an auto-matic lag selection implemented in the R package sandwich (Zeileis, 2004, 2006). The lastspeciﬁcation combines the HAC estimator with the scl-sp estimator of Dimitriadis andBayer (2019) by replacing the outer product estimator of the contemporaneous varianceby the scl-sp estimator as in (2.36).Figure S.7 shows the empirical rejection frequencies for the joint VaR and ES encom-passing test based on the four diﬀerent covariance estimators for h -step ahead forecastswith forecast horizons h = 1 , , ,

10. Figure S.8 presents equivalent results for h -stepaggregate forecasts. For one-step ahead forecasts, the respective lines for the HAC andnon-HAC estimators coincide, which stems from the fact that the automatic lag selectionalmost exclusively chooses no additional lag terms beyond the contemporaneous vari-ance term and hence, for one-step ahead forecasts, our encompassing tests do not requireHAC-corrected covariance estimators. The diﬀerent performance stems from the estima-tion of the contemporaneous variance, where the closed-form solution based on the scl-sp estimator performs clearly superior to the outer product based version.For h -step ahead forecasts for larger forecast horizons, the covariance estimator com-bining the scl-sp estimator with additional HAC terms performs (only) slightly superiorto the raw scl-sp estimator. However, for inherently correlated multi-step ahead aggre-gate forecasts, presented in Figure S.8, this deviance becomes more obvious, especiallyfor increasing forecast horizons. Consequently, for any h > h -step ahead andaggregate forecasts, we use the scl-sp estimator augmented with additional HAC-terms.S.3 .3 Risk Models for the Empirical Application In this section, we describe the (non-standard) risk models of Chen et al. (2012), Taylor(2019) and Patton et al. (2019) for forecasting VaR and ES in the empirical applicationin Section 4.Chen et al. (2012) proposes to use GARCH models with innovations which follow anasymmetric Laplace distribution in order to capture potential (dynamic) skewness andheavy tails. In particular, r t = σ t ( ε t − µ ε ) , ε t iid ∼ AL (0 , , p ) (S.3.1)where σ t follows either a GARCH(1,1) or a GJR-GARCH(1,1) speciﬁcation, and where AL (0 , , p ) represents the asymmetric Laplace distribution with zero mode, unit variance,and shape parameter p , which is deﬁned such that p = P ( ε t < AL (0 , , p )probability density function has the following form f ( ε ; p ) = b p exp (cid:20) − b p | ε | (cid:18) p { ε< } + 11 − p { ε> } (cid:19)(cid:21) , (S.3.2)where b p = (cid:113) p + (1 − p ) , Var[ ε t ] = 1 and E [ ε t ] = µ ε = (1 − p ) /b p . Thus, u t = ε t − µ ε has an asymmetric Laplace distribution with zero mean, unit variance, and the shapeparameter p . Note that p = 0 . p < .

5, the density is skewed to the right, while the opposite applies for p > .

5. TheVaR and ES (in the relevant area α ∈ (0 , p )) can then be obtained analytically asˆ q t = σ t pb p log (cid:18) αp (cid:19) − µ ε σ t , and ˆ e t = ˆ q t − ˆ q t log (cid:16) αp (cid:17) . The GARCH and GJR-GARCH models with a constant shape parameter p are denotedby GARCH-AL and GJR-AL, respectively.Chen et al. (2012) further propose to augment these models with a time-varying shapeparameter, which allows for dynamic higher moments for r t , and whose dynamics are givenby p t = 11 + (cid:113) ξ t ζ t where ξ t = (1 − λ ) | u t − | { u t − ≥ } + λξ t − , and ζ t = (1 − λ ) | u t − | { u t − < } + λζ t − , forsome smoothing parameter 0 ≤ λ ≤

1. The models with a time-varying shape parameter In practice p is usually close to 0.5 and α << . S.4 t are denoted as GARCH-AL-TVP and GJR-AL-TVP.Taylor (2019) employs semiparametric models to forecast VaR and ES by augmentingthe CAViaR models of Engle and Manganelli (2004) with an additional component for theES. In particular, the author assumes that the conditional quantile ˆ q t at level α followseither the symmetric absolute value (SAV) or the asymmetric slope (AS) CAViaR models,SAV : ˆ q t = β + β ˆ q t − + β | r t − | , and (S.3.3)AS : ˆ q t = β + β ˆ q t − + β | r t − | { r t − ≥ } + β | r t − | { r t − < } . (S.3.4)Since the dynamics of the VaR may not be the same as the dynamics of the ES, Taylor(2019) equips these CAViaR models with the following ES speciﬁcationˆ e t = ˆ q t − x t (S.3.5) x t =  κ + κ (ˆ q t − − r t − ) + κ x t − if r t − ≤ ˆ q t − x t − otherwise , where κ > κ , κ ≥ e t < ˆ q t for ˆ q t <

0. The model speciﬁcation givenby (S.3.3) an (S.3.5) is denoted as the SAV-CAViaR-ES model, and the model speciﬁed by(S.3.4) and (S.3.5) as the AS-CAViaR-ES model. These models are estimated by quasi-maximum likelihood based on the asymmetric Laplace distribution, which corresponds toa special case of the M-estimator considered by Patton et al. (2019), and given in (2.6)and (2.20) of this article. In particular, Taylor (2019) shows that under the assumptionof a zero (conditional) mean, the (negative) of the asymmetric Laplace log-likelihoodcorresponds (up to constants) to the loss function in (2.6) with g ( z ) = 0 and φ ( z ) = − log( − z ).Finally, we consider the one factor GAS model of Patton et al. (2019) (also denotedby GAS-1F) which directly incorporates forcing variables into the dynamic process of theconditional variance in the sense of GAS models of Creal et al. (2013). In particular,ˆ q t = a exp(ˆ κ t ) and ˆ e t = b exp(ˆ κ t ) , whereˆ κ t = β + β ˆ κ t − + β ˆ e t − (cid:16) r t α { r t ≤ ˆ q t − } − ˆ e t − (cid:17) . and ˆ q t = σ t κ and ˆ e t = σ t δ where the restrictions δ < κ < e t < ˆ q t . The model is estimated by M-estimator given in (2.6)and (2.20).Table S.3 in Appendix S.6 reports parameter estimates of the risk models for the fullsample. S.5 .4 Absolute Forecast Evaluation Table S.9 shows absolute forecast evaluation criteria, including several backtests, forone-step ahead VaR and ES forecasts. For this, the VaR Violation Ratio is given byˆ α/α , where ˆ α = T − (cid:80) t ∈ T { Y t +1 < ˆ q t } and the empirical ES ratio is computed as ESR = (cid:80) t ∈ T [ Y t +1 { Y t +1 < ˆ q t } (cid:3) / (cid:80) t ∈ T (cid:2) ˆ e t { Y t +1 < ˆ q t } (cid:3) . Furthermore, we report p -values of the un-conditional coverage (UC) test of Kupiec (1995), the conditional coverage (CC) test ofChristoﬀersen (1998), the dynamic quantile (DQ) test of Engle and Manganelli (2004),the VQR test of Gaglianone et al. (2011), the ES backtest of McNeil and Frey (2000)(MF), the regression-based ES backtest of Bayer and Dimitriadis (2020) (BD), and forthe calibration test of Nolde and Ziegel (2017) (NZ). Table S.9 shows that six out of theeleven models pass all (are not rejected by any of the) seven backtests at a 5% signiﬁcancelevel, where the p -values in bold indicate that the null hypotheses of these tests are notrejected for any of the tests. S.5 Technical Details of the Proofs

Lemma 1 (Stochastic Equicontinuity of the Loss Function).

Given Assumption1, the function T − l T ( θ ) is stochastically equicontinuous, i.e. for all ε >

0, there exists a δ >

0, such thatlim sup T →∞ P (cid:34) sup { θ, ˜ θ ∈ Θ: || ˜ θ − θ || <δ } || T − l T ( θ ) − T − l T (˜ θ ) || > ε (cid:35) < ε. (S.5.1) Proof.

In the following, we show that for all θ, ˜ θ ∈ Θ and for all T ∈ N , it holds that | T − l T ( θ ) − T − l T (˜ θ ) | ≤ K T || θ − ˜ θ || , (S.5.2)where K T = O P (1), which implies stochastic equicontinuity by Theorem 21.10 of Davidson(1994).For this, we split the loss function ρ t ( θ ) = (cid:0) { Y t + h ≤ g qt ( θ ) } − α (cid:1) g ( g qt ( θ )) − { Y t + h ≤ g qt ( θ ) } g ( Y t + h )+ φ (cid:48) ( g et ( θ )) (cid:18) g et ( θ ) − g qt ( θ ) + ( g qt ( θ ) − Y t + h ) { Y t + h ≤ g qt ( θ ) } α (cid:19) − φ ( g et ( θ )) + a ( Y t + h )=: A t ( θ ) t + h ( θ ) + B t ( θ ) t + h ( θ ) + C t ( θ ) + a ( Y t + h ) , (S.5.3)S.6here we use the short notation t + h ( θ ) := { Y t + h ≤ g qt ( θ ) } and A t ( θ ) := g ( g qt ( θ )) − g ( Y t + h ) , (S.5.4) B t ( θ ) := φ (cid:48) ( g et ( θ )) /α (cid:0) g qt ( θ ) − Y t + h (cid:1) , and (S.5.5) C t ( θ ) := φ (cid:48) ( g et ( θ )) (cid:0) g et ( θ ) − g qt ( θ ) (cid:1) − φ ( g et ( θ )) − α g ( g qt ( θ )) . (S.5.6)It holds that | l T ( θ ) − l T (˜ θ ) | ≤ (cid:12)(cid:12) A t ( θ ) t + h ( θ ) − A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12) + (cid:12)(cid:12) B t ( θ ) t + h ( θ ) − B t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12) + (cid:12)(cid:12) C t ( θ ) − C t (˜ θ ) (cid:12)(cid:12) . (S.5.7)As C t is continuously diﬀerentiable, for the third term in (S.5.7) we get that (cid:12)(cid:12) C t ( θ ) − C t (˜ θ ) (cid:12)(cid:12) ≤ (cid:18) sup θ ∈ Θ ||∇ θ C t ( θ ) || (cid:19) · || θ − ˜ θ || (S.5.8)where sup θ ∈ Θ ||∇ θ C t ( θ ) || = O P (1) as E [sup θ ∈ Θ || ψ t ( θ ) || r ] < ∞ by condition (C). For theﬁrst term in (S.5.7), ﬁrst notice that (cid:0) g ( g qt ( θ )) − g ( Y t + h ) (cid:1) { Y t + h ≤ g qt ( θ ) } = 12 (cid:0) g ( g qt ( θ )) − g ( Y t + h ) + (cid:12)(cid:12) g ( g qt ( θ )) − g ( Y t + h ) (cid:12)(cid:12)(cid:1) . (S.5.9)Thus, it holds that (cid:12)(cid:12) A t ( θ ) t + h ( θ ) − A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12) (S.5.10)= 12 (cid:12)(cid:12)(cid:12)(cid:0) g ( g qt ( θ )) − g ( Y t + h ) + (cid:12)(cid:12) g ( g qt ( θ )) − g ( Y t + h ) (cid:12)(cid:12)(cid:1) − (cid:16) g ( g qt (˜ θ )) − g ( Y t + h ) + (cid:12)(cid:12) g ( g qt (˜ θ )) − g ( Y t + h ) (cid:12)(cid:12)(cid:17)(cid:12)(cid:12)(cid:12) (S.5.11) ≤ (cid:12)(cid:12)(cid:12) ( g ( g qt ( θ )) − g ( Y t + h )) − (cid:16) g ( g qt (˜ θ )) − g ( Y t + h ) (cid:17)(cid:12)(cid:12)(cid:12) + 12 (cid:12)(cid:12)(cid:12) | g ( g qt ( θ )) − g ( Y t + h ) | − (cid:12)(cid:12)(cid:12) g ( g qt (˜ θ )) − g ( Y t + h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.12) ≤ (cid:12)(cid:12)(cid:12) g ( g qt ( θ )) − g ( g qt (˜ θ )) (cid:12)(cid:12)(cid:12) (S.5.13) ≤ (cid:18) sup θ ∈ Θ ||∇ θ g ( g qt ( θ )) || (cid:19) · || θ − ˜ θ || , (S.5.14)where sup θ ∈ Θ ||∇ θ g ( g qt ( θ )) || = O P (1) as E [sup θ ∈ Θ || ψ t ( θ ) || r ] < ∞ . Equivalently, for theS.7econd term in (S.5.7), it holds that φ (cid:48) ( g et ( θ )) α (cid:0) g qt ( θ ) − Y t + h (cid:1) { Y t + h ≤ g qt ( θ ) } = φ (cid:48) ( g et ( θ ))2 α (cid:0)(cid:0) g qt ( θ ) − Y t + h (cid:1) − (cid:12)(cid:12) g qt ( θ ) − Y t + h (cid:12)(cid:12)(cid:1) . (S.5.15)Consequently, (cid:12)(cid:12) B t ( θ ) t + h ( θ ) − B t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12) (S.5.16)= (cid:12)(cid:12)(cid:12)(cid:12) φ (cid:48) ( g et ( θ ))2 α (cid:0)(cid:0) g qt ( θ ) − Y t + h (cid:1) − (cid:12)(cid:12) g qt ( θ ) − Y t + h (cid:12)(cid:12)(cid:1) (S.5.17) − φ (cid:48) ( g et (˜ θ ))2 α (cid:16)(cid:0) g qt (˜ θ ) − Y t + h (cid:1) − (cid:12)(cid:12) g qt (˜ θ ) − Y t + h (cid:12)(cid:12)(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.18) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ (cid:48) ( g et ( θ ))2 α (cid:0) g qt ( θ ) − Y t + h (cid:1) − φ (cid:48) ( g et (˜ θ ))2 α (cid:0) g qt (˜ θ ) − Y t + h (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.19)+ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ (cid:48) ( g et ( θ ))2 α (cid:12)(cid:12) g qt ( θ ) − Y t + h (cid:12)(cid:12) − φ (cid:48) ( g et (˜ θ ))2 α (cid:12)(cid:12) g qt (˜ θ ) − Y t + h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.20) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ (cid:48) ( g et ( θ )) α (cid:0) g qt ( θ ) − Y t + h (cid:1) − φ (cid:48) ( g et (˜ θ )) α (cid:0) g qt (˜ θ ) − Y t + h (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.21) ≤ (cid:18) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ θ (cid:18) φ (cid:48) ( g et ( θ )) α g qt ( θ ) (cid:19) + ∇ θ ( φ (cid:48) ( g et ( θ ))) α Y t + h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:19) · || θ − ˜ θ || . (S.5.22)and sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ θ (cid:16) φ (cid:48) ( g et ( θ )) α g qt ( θ ) (cid:17) + ∇ θ ( φ (cid:48) ( g et ( θ ))) α Y t + h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P (1) as E [sup θ ∈ Θ || ψ t ( θ ) || r ] < ∞ .Eventually, as T − l T ( θ ) = T − (cid:80) t ∈ T ρ t ( θ ), the Lipschitz condition in (S.5.2) holds with K T = O P (1), which concludes this proof. Lemma 2. (Type IV Class for Stochastic Equicontinuity of the Empirical Process) Theclass of functions given by ρ t ( θ ) := ρ (cid:0) Y t + h , g qt ( θ ) , g et ( θ ) (cid:1) in (2.6) is a type IV class (seeAndrews (1994), p. 2278) with index p = 2 r (in the notation of Andrews (1994) andwhere r > ≤ t ≤ T, T ≥ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ρ t ( θ ) − ρ t (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r ≤ Cδ, (S.5.23)for all θ ∈ Θ, for all δ > C .S.8 roof. For this proof, we split the loss function l t ( θ ) = (cid:0) { Y t + h ≤ g qt ( θ ) } − α (cid:1) g ( g qt ( θ )) − { Y t + h ≤ g qt ( θ ) } g ( Y t + h )+ φ (cid:48) ( g et ( θ )) (cid:18) g et ( θ ) − g qt ( θ ) + ( g qt ( θ ) − Y t + h ) { Y t + h ≤ g qt ( θ ) } α (cid:19) − φ ( g et ( θ )) + a ( Y t + h )=: A t ( θ ) t + h ( θ ) + B t ( θ ) t + h ( θ ) Y t + h + D t ( θ ) − t + h ( θ ) g ( Y t + h ) + a ( Y t + h ) , (S.5.24)where t + h ( θ ) := { Y t + h ≤ g qt ( θ ) } and A t ( θ ) := g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) g qt ( θ ) /α, (S.5.25) B t ( θ ) := − φ (cid:48) ( g et ( θ )) /α, and (S.5.26) C t ( θ ) := − α g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) (cid:0) g et ( θ ) − g qt ( θ ) (cid:1) − φ ( g et ( θ )) . (S.5.27)Thus, for all θ ∈ Θ, it holds that E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) l t ( θ ) − l t (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r ≤ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) A t ( θ ) t + h ( θ ) − A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r + E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) B t ( θ ) t + h ( θ ) − B t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r Y rt + h (cid:35) / r + E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) C t ( θ ) − C t (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r + E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) t + h ( θ ) − t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r | g ( Y t + h ) | r (cid:35) / r , (S.5.28)by Minkowski’s inequality (and as the sup-operator follows the triangle inequality). Westart by considering the ﬁrst term in (S.5.28) E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) A t ( θ ) t + h ( θ ) − A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r (S.5.29) ≤ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) A t ( θ ) t + h ( θ ) − A t (˜ θ ) t + h ( θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r + E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) A t (˜ θ ) t + h ( θ ) − A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r , (S.5.30)where the ﬁrst term is bounded from above by E (cid:34) sup ˜ θ ∈ U ( θ,δ ) ||∇ θ A t ( θ ) || r (cid:35) / r δ . For theS.9econd term, we get that E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) A t (˜ θ ) t + h ( θ ) − A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r (S.5.31) ≤ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) A t (˜ θ ) (cid:12)(cid:12)(cid:12) r E t (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) t + h ( θ ) − t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35)(cid:35) / r (S.5.32) ≤ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) A t (˜ θ ) (cid:12)(cid:12)(cid:12) r E t (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ θ g qt (˜ θ ) h t ( g qt (˜ θ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r (cid:35)(cid:35) / r δ. (S.5.33)by arguments as in the proof of Lemma B.1 of Dimitriadis and Bayer (2019). Similarreasons apply to the second term in in (S.5.28), where by argument similar to equation(58) of Dimitriadis and Bayer (2019), E t (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) t + h ( θ ) Y t + h − t + h (˜ θ ) Y t + h (cid:12)(cid:12)(cid:12) r (cid:35) ≤ sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ∇ θ g qt (˜ θ ) (cid:0) g qt (˜ θ ) (cid:1) r h t ( g qt (˜ θ )) (cid:12)(cid:12)(cid:12) δ. (S.5.34)Consequently, the second term is bounded by E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) B t ( θ ) t + h ( θ ) Y t + h − B t (˜ θ ) t + h (˜ θ ) Y t + h (cid:12)(cid:12)(cid:12) r (cid:35) / r (S.5.35) ≤ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ∇ θ B t (˜ θ ) Y t + h (cid:12)(cid:12)(cid:12) r + sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) B t (˜ θ ) ∇ θ g qt (˜ θ ) g qt (˜ θ ) r h t ( g qt (˜ θ )) (cid:12)(cid:12)(cid:12) r (cid:35) / r d. (S.5.36)Equivalent argument apply to the fourth term in (S.5.28), which is bounded by E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) B t (˜ θ ) ∇ θ g qt (˜ θ ) g ( g qt (˜ θ )) h t ( g qt (˜ θ )) (cid:12)(cid:12)(cid:12) r (cid:35) / r d. (S.5.37)Eventually, for the third term in (S.5.28) is bounded from above by E (cid:34) sup ˜ θ ∈ U ( θ,δ ) ||∇ θ C t ( θ ) || r (cid:35) / r δ. (S.5.38)As the respective moments are ﬁnite by condition (C) for all 1 ≤ t ≤ T and all T ≥

1, itfollows that sup ≤ t ≤ T, T ≥ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) l t ( θ ) − l t (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r ≤ Cδ, (S.5.39)S.10hich concludes this proof.

Lemma 3 (Consistency of the HAC Estimator).

Given Assumption 1 and Assump-tion 2, it holds that ˆ I T P −→ I . Proof.

For this proof, we adapt the proof of Newey and West (1987) such that it allows forthe discontinuity in ψ t ( θ ). For this, we use a slightly diﬀerent expansion than in equation(9) of Newey and West (1987) and we have to rely on a uniform law of large numbers inorder to establish the desired convergence.We start by showing the following uniform convergence for all j ≤ T ,sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t ( θ ) ψ (cid:62) t − j ( θ ) − E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P −→ . (S.5.40)For this, a pointwise law of large numbers (e.g., Corollary 3.48 of White (2001)) holds as E (cid:2) || ψ t ( θ ) || r + δ ) (cid:3) < ∞ for some δ > T (cid:80) t ∈ T j ψ t ( θ ) ψ (cid:62) t − j ( θ ) isstochastically equicontinuous. Consequently, a uniform law of large numbers holds, seee.g. Andrews (1992) for details.Consequently, by deﬁning Ψ j ( θ ) := E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3) , we get that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t (ˆ θ T ) ψ (cid:62) t − j (ˆ θ T ) − E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.41) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t (ˆ θ T ) ψ (cid:62) t − j (ˆ θ T ) − Ψ j (ˆ θ T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j Ψ j (ˆ θ T ) − Ψ j ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.42) ≤ sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t ( θ ) ψ (cid:62) t − j ( θ ) − Ψ j ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j Ψ j (ˆ θ T ) − Ψ j ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (S.5.43)The ﬁrst term converges to zero by (S.5.40) and as the function Ψ j is continuous in θ , the second term converges to zero by the continuous mapping theorem and as ˆ θ T isconsistent. This also implies that for T suﬃciently large enough, it holds (with probabilityapproaching one) that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t (ˆ θ T ) ψ (cid:62) t − j (ˆ θ T ) − E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t ( θ ) ψ (cid:62) t − j ( θ ) − Ψ j ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (S.5.44)Furthermore, as E (cid:104) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ψ t ( θ ) ψ (cid:62) t − j ( θ ) − Ψ j ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) r + δ ) (cid:105) < ∞ by assumption and as inS.11quation (10) in the proof of Newey and West (1987), we get that for all j ≥ E (cid:88) t ∈ T j sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ψ t ( θ ) ψ (cid:62) t − j ( θ ) − Ψ j ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)  ≤ T ( j + 1) D ∗ , (S.5.45)for some ﬁnite constant D ∗ . Consequently, for all j ≥ P  m T (cid:88) j =1 z ( j, m T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t (ˆ θ T ) ψ (cid:62) t − j (ˆ θ T ) − E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ε  ≤ m T (cid:88) j =1 P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t (ˆ θ T ) ψ (cid:62) t − j (ˆ θ T ) − E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > εCm T  ≤ m T (cid:88) j =1 P (cid:88) t ∈ T j sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ψ t ( θ ) ψ (cid:62) t − j ( θ ) − E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) > εT Cm T  ≤ m T (cid:88) j =1 E (cid:88) t ∈ T j sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ψ t ( θ ) ψ (cid:62) t − j ( θ ) − E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)  C m T T ε , ≤ m T (cid:88) j =1 T ( j + 1) D ∗ C m T ε = 4 D ∗ C ε m T ( m T + 3) T , (S.5.46)where we employ (S.5.44) in the second inequality, Markov’s inequality in the penultimateline and (S.5.45) in the last line. The term in (S.5.46) converges to zero as m T ( m T +3) /T → m T = o ( T / ). Now, similar to Newey and West (1987), we split (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ I T ( θ ) − I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t (ˆ θ T ) ψ (cid:62) t (ˆ θ T ) − E (cid:2) ψ t ( θ ) ψ (cid:62) t ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.47)+ 2 m T (cid:88) j =1 z ( j, m T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t ∈ T j ψ t (ˆ θ T ) ψ (cid:62) t − j (ˆ θ T ) − E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.48)+ 2 m T (cid:88) j =1 | z ( j, m T ) − | (cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) (S.5.49)+ 2 T (cid:88) j = m t +1 (cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ψ t ( θ ) ψ (cid:62) t − j ( θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) . (S.5.50)The terms in the ﬁrst two lines converge to zero in probability by (S.5.44) and (S.5.46).The proofs for the terms in the last two lines equal the approach in the proof of Theorem2 in Newey and West (1987). This concludes this proof.S.12 emma 4 (Stochastic Equicontinuity for the HAC Estimator). Given Assump-tion 1 and Assumption 2, the function T (cid:80) t ∈ T j ψ t ( θ ) ψ (cid:62) t − j ( θ ) is stochastically equicontin-uous, where ψ t ( θ ) = ∇ g qt ( θ ) (cid:18) g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) α (cid:19) (cid:0) { Y t + h ≤ g qt ( θ ) } − α (cid:1) (S.5.51)+ ∇ g et ( θ ) φ (cid:48)(cid:48) ( g et ( θ )) (cid:18) g et ( θ ) − g qt ( θ ) + 1 α ( g qt ( θ ) − Y t + h ) { Y t + h ≤ g qt ( θ ) } (cid:19) . (S.5.52) Proof.

We start by showing that the class of functions given by T (cid:80) t ∈ T j ψ t ( θ ) ψ (cid:62) t − j ( θ ) isa type IV class (see Andrews (1994), p. 2278) with index p = 2 r (in the notation ofAndrews (1994) and where ˜ r > t,T E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ψ t ( θ ) ψ (cid:62) t − j ( θ ) − ψ t (˜ θ ) ψ (cid:62) t − j (˜ θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r (cid:35) / r ≤ Cδ, (S.5.53)for all θ ∈ Θ, for all δ > C .First notice that (for j = 0), ψ t ( θ ) ψ (cid:62) t ( θ ) (S.5.54)= (cid:0) ∇ g qt ( θ ) ∇ (cid:62) g qt ( θ ) (cid:1) (cid:18) g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) α (cid:19) (cid:0) { Y t + h ≤ g qt ( θ ) } (1 − α ) + α (cid:1) (S.5.55)+ (cid:0) ∇ g et ( θ ) ∇ (cid:62) g et ( θ ) (cid:1) φ (cid:48)(cid:48) ( g et ( θ )) (cid:18) g et ( θ ) − g qt ( θ ) + 1 α ( g qt ( θ ) − Y t + h ) { Y t + h ≤ g qt ( θ ) } (cid:19) (S.5.56)+ 2 (cid:0) ∇ g qt ( θ ) ∇ (cid:62) g et ( θ ) (cid:1) (cid:18) g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) α (cid:19) (cid:0) { Y t + h ≤ g qt ( θ ) } − α (cid:1) × (S.5.57) φ (cid:48)(cid:48) ( g et ( θ )) (cid:18) g et ( θ ) − g qt ( θ ) + 1 α ( g qt ( θ ) − Y t + h ) { Y t + h ≤ g qt ( θ ) } (cid:19) (S.5.58)=: ˜ A t ( θ ) { Y t + h ≤ g qt ( θ ) } + ˜ B t ( θ ) , (S.5.59)where˜ B t ( θ ) := (cid:0) ∇ g qt ( θ ) ∇ (cid:62) g qt ( θ ) (cid:1) α (cid:18) g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) α (cid:19) + (cid:0) ∇ g et ( θ ) ∇ (cid:62) g et ( θ ) (cid:1) φ (cid:48)(cid:48) ( g et ( θ )) (cid:0) g et ( θ ) − g qt ( θ ) (cid:1) + 2 (cid:0) ∇ g qt ( θ ) ∇ (cid:62) g et ( θ ) (cid:1) φ (cid:48)(cid:48) ( g et ( θ )) (cid:18) g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) α (cid:19) α (cid:0) g qt ( θ ) − g et ( θ ) (cid:1) , (S.5.60)S.13nd˜ A t ( θ ) := (cid:0) ∇ g qt ( θ ) ∇ (cid:62) g qt ( θ ) (cid:1) (cid:18) g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) α (cid:19) (1 − α )+ (cid:0) ∇ g et ( θ ) ∇ (cid:62) g et ( θ ) (cid:1) φ (cid:48)(cid:48) ( g et ( θ )) (cid:20) α ( g qt ( θ ) − Y t + h ) + 2 α (cid:0) g et ( θ ) − g qt ( θ ) (cid:1) ( g qt ( θ ) − Y t + h ) (cid:21) + 2 (cid:0) ∇ g qt ( θ ) ∇ (cid:62) g et ( θ ) (cid:1) φ (cid:48)(cid:48) ( g et ( θ )) (cid:18) g ( g qt ( θ )) + φ (cid:48) ( g et ( θ )) α (cid:19) (cid:20)(cid:0) g et ( θ ) − g qt ( θ ) (cid:1) + 1 − αα ( g qt ( θ ) − Y t + h ) (cid:21) . (S.5.61)Further notice that both, ˜ A t ( θ ) and ˜ B t ( θ ) are continuously diﬀerentiable. In the following,we use the short notation t + h ( θ ) = { Y t + h ≤ g qt ( θ ) } . For all θ ∈ Θ, it holds that E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ψ t ( θ ) ψ (cid:62) t ( θ ) − ψ t (˜ θ ) ψ (cid:62) t (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r ≤ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ˜ A t ( θ ) t + h ( θ ) − ˜ A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r + E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ˜ B t ( θ ) − ˜ B t (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r . (S.5.62)We start by considering the ﬁrst term in (S.5.62), E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ˜ A t ( θ ) t + h ( θ ) − ˜ A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r (S.5.63) ≤ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ˜ A t ( θ ) t + h ( θ ) − ˜ A t (˜ θ ) t + h ( θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r + E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ˜ A t (˜ θ ) t + h ( θ ) − ˜ A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r , (S.5.64)where the ﬁrst term is bounded from above by E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ θ ˜ A t ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r (cid:35) / r δ . For thesecond term, we get that E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ˜ A t (˜ θ ) t + h ( θ ) − A t (˜ θ ) t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35) / r (S.5.65) ≤ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ˜ A t (˜ θ ) (cid:12)(cid:12)(cid:12) r E t (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) t + h ( θ ) − t + h (˜ θ ) (cid:12)(cid:12)(cid:12) r (cid:35)(cid:35) / r (S.5.66) ≤ E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12) ˜ A t (˜ θ ) (cid:12)(cid:12)(cid:12) r E t (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ θ g qt (˜ θ ) h t ( g qt (˜ θ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r (cid:35)(cid:35) / r δ. (S.5.67)S.14y arguments as in the proof of Lemma B.1 of Dimitriadis and Bayer (2019). Eventually,for the second term in (S.5.62) is bounded from above by E (cid:34) sup ˜ θ ∈ U ( θ,δ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ θ ˜ B t ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r (cid:35) / r δ. (S.5.68)The proofs for j ≥ T (cid:80) t ∈ T j ψ t ( θ ) ψ (cid:62) t − j ( θ ) are a type IV class of Andrews (1994) with index p = 2 r >

2. Consequently, by Theorem 5 of Andrews (1994), it satisﬁes ”Ossiander’s L r -entropy” condition and thus, it has a ” L r -envelope” given by their supremum. Con-sequently, we can apply Theorem 1 (and Application 1) of Doukhan et al. (1995) andobtain that T (cid:80) t ∈ T j ψ t ( θ ) ψ (cid:62) t − j ( θ ) is stochastically equicontinuous (see the Remark onp.410 of Doukhan et al. (1995)). S.15 .6 Additional Tables and Figures Table S.1: Empirical Sizes for the GAS processes H (1)0 H (2)0 H (1)0 H (2)0 VaR ES Aux ES VaR ES Aux ES VaR ES Aux ES VaR ES Aux ESLinear link function T GAS-t VaR/ES GAS250 31.30 20.95 23.30 15.00 30.75 21.80 27.00 17.40500 23.45 15.25 15.75 12.25 24.30 19.35 19.55 12.351000 14.80 9.45 12.10 9.25 18.70 13.05 15.90 10.602500 12.70 7.85 8.80 5.85 11.75 9.90 12.30 7.255000 9.60 5.35 8.60 5.75 8.60 7.85 9.30 5.85Convex link function T GAS-t VaR/ES GAS250 20.62 17.07 12.86 9.80 8.30 7.99 10.26 8.05500 17.87 16.57 9.70 8.55 15.19 17.41 10.31 7.851000 11.93 10.88 6.83 6.38 15.96 16.79 8.00 6.202500 10.71 10.61 5.90 5.60 10.67 12.02 7.30 4.655000 8.69 7.94 5.47 5.82 10.86 11.37 4.85 3.15No-crossing link function T GAS-t VaR/ES GAS250 13.13 11.77 9.95 7.25 6.87 5.87 8.00 7.55500 11.44 10.34 10.55 6.95 10.34 11.05 9.55 8.851000 7.87 7.72 9.90 6.25 10.87 10.82 10.25 9.052500 7.18 6.38 10.10 6.35 8.15 8.05 11.95 8.505000 7.47 6.11 8.55 6.10 8.36 8.31 9.00 6.55

Notes:

This table shows the empirical sizes for the encompassing tests for one-step aheadforecasts stemming from the two additional DGPs described in Section S.1, the three linkfunctions, the joint VaR and ES (VaR ES) and auxiliary ES (Aux ES) test and both nullhypotheses with a nominal size of 5%. The columns denoted by “GAS-t” contain results forthe GARCH(1,1) model with normal innovations and a GAS- t model, whereas those labeled“VaR/ES GAS” report results for the one and two factor GAS models introduced by Pattonet al. (2019). S.16 able S.2: Empirical Sizes for Multi-Step Forecasts H (1)0 H (2)0 h T h -step ahead forecasts250 8.34 10.26 9.26 5.53 6.96 6.96 7.26 4.64500 8.31 8.71 12.56 10.63 4.70 5.42 6.07 6.431000 6.83 7.00 10.16 13.25 3.21 4.22 4.81 6.712500 4.30 4.50 6.60 10.94 3.91 3.92 4.91 7.035000 3.60 4.83 5.82 9.50 3.50 3.30 5.11 6.31

T h -step aggregate forecasts250 8.44 13.65 21.30 29.91 6.96 10.14 19.80 25.76500 8.81 12.51 20.08 30.74 4.90 8.27 16.37 22.251000 6.93 8.63 16.82 25.63 3.31 5.45 13.33 18.072500 4.20 7.04 10.53 20.08 3.81 4.02 7.37 10.635000 3.50 4.92 9.12 15.36 3.30 2.91 5.22 7.92

Notes:

This table shows the empirical sizes for the auxiliary ES encom-passing test for the h -step ahead and the h -step aggregate forecasts andboth null hypotheses with a nominal size of 5%. It shows the results for theGARCH speciﬁcation with normal innovations and the convex link function. Table S.3: Parameter Estimates of the Risk Models for the Empirical Application

Volatility Models β β β β v λ p a b GARCH-N 0.023 0.859 0.125GJR-ST 0.018 0.879 0.001 0.218 7.364 0.869GARCH-AL 0.020 0.871 0.129 0.545GJR-AL 0.022 0.887 -0.021 0.267 0.560GARCH-AL-TVP 0.020 0.870 0.130 0.980GJR-AL-TVP 0.022 0.889 -0.020 0.262 0.979GAS-1F 0.930 -0.003 0.034 -1.449 -1.848CAViaR-ES Models β β β β κ κ κ SAV -0.099 0.841 -0.337 -1.233AS -0.072 0.889 -0.004 -0.436 0.006 0.890 0.113

Notes:

The entries in this table show parameter estimates from the risk models described inSection 3.2 and Appendix S.3 for the full sample.

S.17 a b l e S . : C o rr e l a t i o n s o f V a R a nd E S O n e - S t e p A h e a d F o r ec a s t s C o rr e l a t i o n s o f V a R F o r ec a s t s H i s t R i s k G A R C HN G J R G A R C H G J R G A R C H G J R S AVA

S S i m M e t r i c s - N - S T - A L - A L - A L - T V P - A L - T V P G A S - F C AV i a R - E S C AV i a R - E S H i s t S i m . . . . . . . . . . . R i s k M e t r i c s . . . . . . . . . . G A R C H - N . . . . . . . . . G J R - S T . . . . . . . . G A R C H - A L . . . . . . . G J R - A L . . . . . . G A R C H - A L - T V P . . . . . G J R - A L - T V P . . . . G A S - F . . . S AV - C AV i a R - E S . . A S - C AV i a R - E S . N o t e s : T h i s t a b l ec o n t a i n s t h e p a i r w i s ec o rr e l a t i o n s f o r o n e - s t e p a h e a d V a R a nd E S f o r ec a s t ss t e mm i n g f r o m t h ee l e v e n c o n s i d e r e d r i s k m o d e l s . S.18 a b l e S . : D e t a il e d E n c o m p a ss i n g T e s t R e s u l t s f o r O n e - S t e p A h e a d F o r ec a s t s J o i n t V a R a nd E S T e s t G J R - S T G A R C H - A L G A R C H - A L - T V P G A S - F S AV - C AV i a R - E S A S - C AV i a R - E S G J R - S T ( . , . )( . , . )( . , . )( . , . )( . , . ) G A R C H - A L ( . , . ) * ( . , . )( . , . )( . , . )( . , . ) * G A R C H - A L - T V P ( . , . ) * ( . , . ) * ( . , . )( . , . ) * ( . , . ) * G A S - F ( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * S AV - C AV i a R - E S ( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * A S - C AV i a R - E S ( . , . ) * ( . , . ) * ( . , . )( . , . )( . , . ) A u x ili a r y E S T e s t G J R - S T G A R C H - A L G A R C H - A L - T V P G A S - F S AV - C AV i a R - E S A S - C AV i a R - E S G J R - S T . . . . . G A R C H - A L . . . . . G A R C H - A L - T V P . . . . . G A S - F . . . . . S AV - C AV i a R - E S . . . . . A S - C AV i a R - E S . . . . . N o t e s : T h i s t a b l e r e p o r t s t h ee s t i m a t e s o f t h ec o n v e x c o m b i n a t i o np a r a m e t e r s ( θ , θ ) f r o m t h ec o n v e x li n k f un c t i o n w i t h i n t e r ce p t s f o r e a c hp a i r o f m o d e l s . T h e s y m b o l ∗ i nd i c a t e s t h a tt h e nu ll h y p o t h e s i s t h a tt h e V a R a nd E S f o r ec a s t s o f a r o w - h e a d i n g m o d e l j o i n t l y e n c o m p a ss e s t h o s e o f a c o l u m n - h e a d i n g m o d e li s r e j ec t e d a tt h e % s i g n i ﬁ c a n ce l e v e l. S.19 able S.6: Correlations of VaR and ES Multi-Step Forecasts

Risk GARCH GJR GARCH GJR GARCH GJRMetrics -N -ST -AL -AL -AL-TVP -AL-TVPCorrelations of 10-step Ahead VaR forecastsRiskMetrics 1.000 0.943 0.907 0.953 0.897 0.989 0.868GARCH-N 1.000 0.973 0.994 0.965 0.960 0.933GJR-ST 1.000 0.972 0.991 0.940 0.983GARCH-AL 1.000 0.964 0.969 0.937GJR-AL 1.000 0.928 0.977GARCH-AL-TVP 1.000 0.912GJR-AL-TVP 1.000Correlations of 10-step Ahead ES forecastsRiskMetrics 1.000 0.943 0.898 0.952 0.897 0.989 0.869GARCH-N 1.000 0.964 0.993 0.964 0.960 0.934GJR-ST 1.000 0.966 0.987 0.934 0.987GARCH-AL 1.000 0.965 0.968 0.939GJR-AL 1.000 0.928 0.977GARCH-AL-TVP 1.000 0.912GJR-AL-TVP 1.000Correlations of 10-step Aggregate VaR forecastsRiskMetrics 1.000 0.951 0.910 0.954 0.902 0.982 0.863GARCH-N 1.000 0.976 0.994 0.968 0.965 0.938GJR-ST 1.000 0.967 0.987 0.945 0.982GARCH-AL 1.000 0.969 0.962 0.924GJR-AL 1.000 0.934 0.966GARCH-AL-TVP 1.000 0.921GJR-AL-TVP 1.000Correlations of 10-step Aggregate ES forecastsRiskMetrics 1.000 0.951 0.908 0.955 0.903 0.984 0.866GARCH-N 1.000 0.975 0.995 0.968 0.966 0.941GJR-ST 1.000 0.967 0.987 0.943 0.985GARCH-AL 1.000 0.968 0.965 0.929GJR-AL 1.000 0.935 0.971GARCH-AL-TVP 1.000 0.920GJR-AL-TVP 1.000

Notes:

This table reports the pairwise correlations of the seven GARCH-type risk modelsfor the VaR and ES 10-step ahead forecasts in the upper two panels and for the VaR andES 10-step aggregate forecasts in the lower two panels.

S.20 a b l e S . : D e t a il e d E n c o m p a ss i n g T e s t R e s u l t s f o r - D a y A h e a d F o r ec a s t s J o i n t V a R a nd E S T e s t R i s k M e t r i c s G A R C H - N G J R - S T G A R C H - A L G J R - A L G A R C H - A L - T V P G J R - A L - T V P R i s k M e t r i c s ( . , . )( . , . )( . , . )( . , . )( . , . ) * ( . , . ) * G A R C H - N ( . , . )( . , . )( . , . )( . , . )( . , . ) * ( . , . ) * G J R - S T ( . , . ) * ( . , . )( . , . )( . , . )( . , . ) * ( . , . ) * G A R C H - A L ( . , . )( . , . )( . , . )( . , . )( . , . ) * ( . , . ) * G J R - A L ( . , . )( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * G A R C H - A L - T V P ( . , . )( . , . )( . , . )( . , . )( . , . )( . , . ) * G J R - A L - T V P ( . , . )( . , . )( . , . )( . , . )( . , . )( . , . ) A u x ili a r y E S T e s t R i s k M e t r i c s G A R C H - N G J R - S T G A R C H - A L G J R - A L G A R C H - A L - T V P G J R - A L - T V P R i s k M e t r i c s . . . . . . G A R C H - N . . . . . . G J R - S T . . . . . . G A R C H - A L . . . . . . G J R - A L . . . . . . G A R C H - A L - T V P . . . . . . G J R - A L - T V P . . . . . . N o t e s : T h i s t a b l e r e p o r t s t h ee s t i m a t e s o f t h ec o n v e x c o m b i n a t i o np a r a m e t e r s ( θ , θ ) f r o m t h ec o n v e x li n k f un c t i o n w i t h i n t e r ce p t s f o r e a c hp a i r o f m o d e l s f o r - d a y a h e a d f o r ec a s t s . T h e s y m b o l ∗ i nd i c a t e s t h a tt h e nu ll h y p o t h e s i s t h a tt h e V a R a nd E S f o r ec a s t s o f a r o w - h e a d i n g m o d e l j o i n t l y e n c o m p a ss e s t h o s e o f a c o l u m n - h e a d i n g m o d e li s r e j ec t e d a tt h e % s i g n i ﬁ c a n ce l e v e l. S.21 a b l e S . : D e t a il e d E n c o m p a ss i n g T e s t R e s u l t s f o r - D a y A gg r e ga t e F o r ec a s t s J o i n t V a R a nd E S T e s t R i s k M e t r i c s G A R C H - N G J R - S T G A R C H - A L G J R - A L G A R C H - A L - T V P G J R - A L - T V P R i s k M e t r i c s ( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * G A R C H - N ( . , . )( . , . ) * ( . , . )( . , . )( . , . ) * ( . , . ) * G J R - S T ( . , . )( . , . )( . , . )( . , . )( . , . )( . , . ) * G A R C H - A L ( . , . )( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * ( . , . ) * G J R - A L ( . , . )( . , . )( . , . )( . , . ) * ( . , . ) * ( . , . ) * G A R C H - A L - T V P ( . , . )( . , . )( . , . )( . , . )( . , . )( . , . ) * G J R - A L - T V P ( . , . )( . , . )( . , . )( . , . )( . , . )( . , . ) A u x ili a r y E S T e s t R i s k M e t r i c s G A R C H - N G J R - S T G A R C H - A L G J R - A L G A R C H - A L - T V P G J R - A L - T V P R i s k M e t r i c s . . . . . . G A R C H - N . . . . . . G J R - S T . . . . . . G A R C H - A L . . . . . . G J R - A L . . . . . . G A R C H - A L - T V P . . . . . . G J R - A L - T V P . . . . . . N o t e s : T h i s t a b l e r e p o r t s t h ee s t i m a t e s o f t h ec o n v e x c o m b i n a t i o np a r a m e t e r s ( θ , θ ) f r o m t h ec o n v e x li n k f un c t i o n w i t h i n t e r ce p t s f o r e a c hp a i r o f m o d e l s f o r - d a y agg r e ga t e f o r ec a s t s . T h e s y m b o l ∗ i nd i c a t e s t h a tt h e nu ll h y p o t h e s i s t h a tt h e V a R a nd E S f o r ec a s t s o f a r o w - h e a d i n g m o d e l j o i n t l y e n c o m p a ss e s t h o s e o f a c o l u m n - h e a d i n g m o d e li s r e j ec t e d a tt h e % s i g n i ﬁ c a n ce l e v e l. S.22 able S.9: Backtesting Results for One-Step Ahead Forecasts

Models Violation ESR UC CC DQ VQR MF BD NZRatioHistorical Sim 1.47 1.08 < < < < < < < < < < < < < < < < < < GARCH-AL 0.84 0.99

GJR-AL 0.69 0.98 < < < GJR-AL-TVP 1.01 0.97 0.91 0.40 0.96 0.04 0.43 0.06 0.02GAS-1F 1.20 1.03

SAV-CAViaR-ES 1.11 1.02

AS-CAViaR-ES 1.15 1.02

Notes:

The Violation Ratio is given by ˆ α/α , where ˆ α = T − (cid:80) t ∈ T { Y t +1 < ˆ q t } and the empirical ES ratio iscomputed as ESR = (cid:80) t ∈ T [ Y t +1 { Y t +1 < ˆ q t } (cid:3) / (cid:80) t ∈ T (cid:2) ˆ e t { Y t +1 < ˆ q t } (cid:3) . Both ratios are expected to equal onefor correctly speciﬁed VaR and ES forecasts. The remaining columns report backtesting p -values for theunconditional coverage (UC) test of Kupiec (1995), the conditional coverage (CC) test of Christoﬀersen(1998), the dynamic quantile (DQ) test of Engle and Manganelli (2004), the VQR test of Gaglianoneet al. (2011), the ES backtest of McNeil and Frey (2000) (MF), the regression-based ES backtest of Bayerand Dimitriadis (2020) (BD), and for the calibration test of Nolde and Ziegel (2017) (NZ). Rows withp-values in bold indicate that for a respective model, the null hypotheses of all seven backtests cannotbe rejected at the 5% signiﬁcance level. S.23 ormal innovationsJoint VaR and ES Test Normal innovationsAuxiliary ES Test Skewed−t innovationsJoint VaR and ES Test Skewed−t innovationsAuxiliary ES Test T = T = T = T = T = p r e j e c t i on r a t e Tested Hypothesis H H Link Function linear convex no crossing

Figure S.1: This ﬁgure shows raw power curves (empirical rejection frequencies) for the jointVaR and ES and the auxiliary ES encompassing tests with a nominal size of 5%. The employedlink functions are indicated with the line color and symbol shape while the line type refers tothe tested null hypothesis. The plot rows depict diﬀerent sample sizes while the plot columnsshow results for the two innovation distributions described in (3.1) - (3.3) and for the joint andthe auxiliary tests. An ideal test exhibits a rejection rate of 5% for π = 0 and for H (1)0 (andinversely for π = 1 and H (2)0 ) and as sharply increasing rejection rates as possible for increasing(decreasing) values of π . S.24 = 1 h = 2 h = 5 h = 10 T = T = T = T = T = p r e j e c t i on r a t e Tested Hypothesis H H Forecast Type h−step ahead h−step aggregate

Figure S.2: This ﬁgure shows raw power curves (empirical rejection frequencies) for the jointVaR and ES encompassing test with a nominal size of 5%, for h -step ahead and h -step aggregatedforecasts indicated with diﬀerent colors, and for the two tested null hypotheses indicated withdiﬀerent line types. The plot rows depict diﬀerent sample sizes, while the plot columns refer todiﬀerent forecast horizons h . An ideal test exhibits a rejection frequency of 5% for π = 0 andfor H (1)0 (and inversely for π = 1 and H (2)0 ) and as sharply increasing rejection rates as possiblefor increasing (decreasing) values of π . Note that we use a Bernoulli draw based combinationmethod in this section as opposed to the variance combination in Section 3.2 and hence, theresults of the one-step ahead forecasts are not necessarily identical. S.25 aR−ES−GAS Joint VaR and ES Test VaR−ES−GAS Auxiliary ES Test GAS−tJoint VaR and ES Test GAS−tAuxiliary ES Test T = T = T = T = T = p r e j e c t i on r a t e Tested Hypothesis H H Link Function linear convex no crossing

Figure S.3: This ﬁgure shows size-adjusted power curves for the joint VaR and ES encompassingtest and the auxiliary ES test with a nominal size of 5% and for one-step ahead forecasts ofthe two GAS-based DGPs described in Section S.1. The plot rows depict diﬀerent sample sizes,while the colors indicate the three diﬀerent link functions and the line types refer to the twotested null hypotheses. The plot columns show results for the models described in (S.1.2), (S.1.3)and (S.1.1) and for the joint and auxiliary tests. An ideal test exhibits a rejection frequency of5% for π = 0 and for H (1)0 (and inversely for π = 1 and H (2)0 ) and as sharply increasing rejectionrates as possible for increasing (decreasing) values of π . S.26 aR−ES−GAS Joint VaR and ES Test VaR−ES−GAS Auxiliary ES Test GAS−tJoint VaR and ES Test GAS−tAuxiliary ES Test T = T = T = T = T = p r e j e c t i on r a t e Tested Hypothesis H H Link Function linear convex no crossing

Figure S.4: This ﬁgure shows raw power curves (empirical rejection frequencies) for the jointVaR and ES encompassing test and the auxiliary ES test with a nominal size of 5% and forone-step ahead forecasts of the two GAS-based DGPs described in Section S.1. The plot rowsdepict diﬀerent sample sizes, while the colors indicate the three diﬀerent link functions and theline types refer to the two tested null hypotheses. The plot columns show results for the modelsdescribed in (S.1.2), (S.1.3) and (S.1.1) and for the joint and auxiliary tests. An ideal testexhibits a rejection frequency of 5% for π = 0 and for H (1)0 (and inversely for π = 1 and H (2)0 )and as sharply increasing rejection rates as possible for increasing (decreasing) values of π . S.27 = 1 h = 2 h = 5 h = 10 T = T = T = T = T = p r e j e c t i on r a t e Tested Hypothesis H H Forecast Type h−step ahead h−step aggregate

Figure S.5: This ﬁgure shows size-adjusted power curves for the auxiliary ES encompassing testwith a nominal size of 5% for h -step ahead and h -step aggregate forecasts stemming from theGARCH process speciﬁcations in (3.1) - (3.3). The h -step ahead and aggregate forecasts areindicated by diﬀerent colors and the two tested null hypotheses are indicated with diﬀerent linetypes. The plot rows depict diﬀerent sample sizes, while the plot columns refer to diﬀerentforecast horizons h = 1 , , ,

10. An ideal test exhibits a rejection frequency of 5% for π = 0 andfor H (1)0 (and inversely for π = 1 and H (2)0 ) and as sharply increasing rejection rates as possiblefor increasing (decreasing) values of π . S.28 = 1 h = 2 h = 5 h = 10 T = T = T = T = T = p r e j e c t i on r a t e Tested Hypothesis H H Forecast Type h−step ahead h−step aggregate

Figure S.6: This ﬁgure shows raw power curves (empirical rejection frequencies) for the auxiliaryES encompassing test with a nominal size of 5% for h -step ahead and h -step aggregate forecastsstemming from the GARCH process speciﬁcations in (3.1) - (3.3). The h -step ahead and aggre-gate forecasts are indicated by diﬀerent colors and the two tested null hypotheses are indicatedwith diﬀerent line types. The plot rows depict diﬀerent sample sizes, while the plot columnsrefer to diﬀerent forecast horizons h = 1 , , ,

10. An ideal test exhibits a rejection frequency of5% for π = 0 and for H (1)0 (and inversely for π = 1 and H (2)0 ) and as sharply increasing rejectionrates as possible for increasing (decreasing) values of π . S.29 = 1 h = 2 h = 5 h = 10 T = T = T = T = T = p r e j e c t i on r a t e Tested Hypothesis H H Covariance scl−sp OGP OGP & HAC scl−sp & HAC

Figure S.7: This ﬁgure shows raw power curves (empirical rejection frequencies) for the jointVaR and ES encompassing test with a nominal size of 5% for h -step ahead forecasts stemmingfrom the GARCH process speciﬁcations in (3.1) - (3.3). The plot rows depict diﬀerent samplesizes, the plot columns show the diﬀerent forecast horizons h , the colors indicate the diﬀerentcovariance estimators, and the line types refer to the two tested null hypotheses. An ideal testexhibits a rejection frequency of 5% for π = 0 and for H (1)0 (and inversely for π = 1 and H (2)0 )and as sharply increasing rejection rates as possible for increasing (decreasing) values of π ..

Figure S.8: This ﬁgure shows raw power curves (empirical rejection frequencies) for the joint VaRand ES encompassing test with a nominal size of 5% for h -step aggregate forecasts stemmingfrom the GARCH process speciﬁcations in (3.1) - (3.3). The plot rows depict diﬀerent samplesizes, the plot columns show the diﬀerent forecast horizons h , the colors indicate the diﬀerentcovariance estimators, and the line types refer to the two tested null hypotheses. An ideal testexhibits a rejection frequency of 5% for π = 0 and for H (1)0 (and inversely for π = 1 and H (2)0 )and as sharply increasing rejection rates as possible for increasing (decreasing) values of π ..