Computation of Expected Shortfall by fast detection of worst scenarios
CComputation of Expected Shortfall by fast detection of worstscenarios
Bruno Bouchard ∗ , Adil Reghai † , Benjamin Virrion ‡ , § May 27, 2020
Abstract
We consider a multi-step algorithm for the computation of the historical expected shortfall such asdefined by the Basel
Minimum Capital Requirements for Market Risk . At each step of the algorithm,we use Monte Carlo simulations to reduce the number of historical scenarios that potentially belong tothe set of worst scenarios. The number of simulations increases as the number of candidate scenariosis reduced and the distance between them diminishes. For the most naive scheme, we show thatthe L p -error of the estimator of the Expected Shortfall is bounded by a linear combination of theprobabilities of inversion of favorable and unfavorable scenarios at each step, and of the last stepMonte Carlo error associated to each scenario. By using concentration inequalities, we then showthat, for sub-gamma pricing errors, the probabilities of inversion converge at an exponential ratein the number of simulated paths. We then propose an adaptative version in which the algorithmimproves step by step its knowledge on the unknown parameters of interest: mean and varianceof the Monte Carlo estimators of the different scenarios. Both schemes can be optimized by usingdynamic programming algorithms that can be solved off-line. To our knowledge, these are the firstnon-asymptotic bounds for such estimators. Our hypotheses are weak enough to allow for the useof estimators for the different scenarios and steps based on the same random variables, which, inpractice, reduces considerably the computational effort. First numerical tests are performed. Keywords:
Expected Shortfall, ranking and selection, sequential design, Bayesian filter.
The Basel
Minimum Capital Requirements for Market Risk [4] has brought two main changes in the waythat investment banks need to compute their capital requirements. Expected Shortfall (ES) replaces Valueat Risk (VaR) as the main risk indicator for the computation of capital requirements. The advantages ofES over VaR have been brought forward in Artzner et al. [2], and Expected Shortfall is now consideredby most researchers and practitioners as superior to VaR as a risk measure, because it respects the sub-additivity axiom, see [1, 2, 23]. The second main change is that the number of required daily computationsof ES has been multiplied by a factor of up to 90. Where banks used to need to compute one VaR perday, they now need to compute up to three ES per liquidity horizon and risk class, as well as three ESper liquidity horizon for all risk classes combined. The above has triggered several works on the fastcomputation of ES.Mathematically, if V is a random variable modeling the level of loss of a portfolio that will be known ata future time, and 0 < α <
1, the expected shortfall of level α ∈ (0 ,
1) is defined byES α := 1 α (cid:90) α VaR γ ( V ) dγ, (1) ∗ CEREMADE, CNRS, Universit´e Paris Dauphine, PSL University. † Natixis. ‡ Natixis and CEREMADE, CNRS, Universit´e Paris Dauphine, PSL University. § The authors would like to thank Nicolas Baradel for helping with the code, Rida Mahi and Mathieu Bernardo from theNatixis Quantitative Research Teams for providing the first results and ideas on the Fast Detection Algorithm, and finallyWilliam Leduc for providing all the necessary data to obtain the different book parameters. All over this paper, we measure the performances in terms of losses. A positive number is a loss, a negative number isa gain. a r X i v : . [ q -f i n . C P ] M a y here VaR γ is the Value at Risk at level γ , i.e.VaR γ ( V ) := max { x ∈ R : P [ V ≥ x ] > γ } . (2)Nearly all of the literature concentrates on studying the ES by using parametric, non-parametric or semi-parametric approaches to approximate the distribution of V based on historical data. See in particular[9, 11, 12, 15, 16, 19, 20, 21, 24, 25, 26, 27]. Another approach consists in using the fact that V is the riskneutral value of a book, and therefore of the form E [ P | S ] in which S is a random variable associated tomarket parameters and P represents the future (discounted) payoffs of the book. This suggests using anested Monte Carlo approach : simulate a set of values in the distribution of S (outer scenarios), and, foreach simulation of S , compute a Monte Carlo estimator of E [ P | S ] by using simulations in the conditionaldistribution (inner scenarios). This is for instance the approach of [8, 13].But, as defined in the regulatory document of Basel [4], the expected shortfall is based on n s = 253scenarios of market parameters s = (s i ) i ≤ n s that are generated in a deterministic way. Therefore, S isjust uniformly distributed in the sequence s and there is no need for simulating outer scenarios. Since V is defined by a pricing formula E [ P | S ] that is fully described by the value of S , there is also no room forapproximating the law of V based on historical data, if we are only interested by the requirements of [4].The only issue is to compute in an efficient way the loss impacts ( µ i ) i ≤ n s of the book, µ i := (cid:0) E [ P | S = s i ] − E [ P | S = s ] (cid:1) , i = 1 . . . , n s , in which s is the current value of the market parameters, and then compute the average over the n w = 6worst impacts, say ES = 1 n w n w (cid:88) i =1 µ i , (3)if, for ease of notations, we assume that µ ≥ µ ≥ · · · ≥ µ n s − ≥ µ n s . (4)Methods that are in line with the above have also been studied, in particular in [17, 22] in which theauthors define a distance on the space of scenarios induced by the distance between their risk fac-tors. Starting with the original outer-level scenarios (called “prediction points”), they determine “designpoints” that are included in their convex hull. Inner-level paths are simulated in order to evaluate theportfolio value at the design points. These values are then used to establish a metamodel of the portfolioprice with respect to the risk factors, and this metamodel is then used to select among the predictionpoints those that are most likely to be part of the worst scenarios set. They are then added to the designpoints, and evaluated by using inner-level simulations, after which the metamodel is updated.These methods are very smart but neglect one important point for practitioners: the cost of launchinga pricer is high, as it typically entails instanciating thousands of objects at initialization, as well asvolatility surface calibrations and sometimes even graphical interfaces. Furthermore, these pricers usuallydo not have the flexibility to add dynamically, at each inner-level pricing, new paths to a given scenario.Therefore, we do not allow ourselves to adapt our strategies at such a level of granularity.Instead, we will consider strategies that only entail L -levels of sets of simulations, where L is typicallyquite low, so as not to pay too many times the overhead of launching the pricer and/or calibrating therequired volatility surfaces. We also do not use any concept of distance between scenarios induced bytheir risk factors. Although this enables [17] and [22] to obtain better empirical convergence rates, wesee at least one problem with this approach: at the scale of a bank, the space of risk factors is both of avery high dimension (a few thousands) and with a very complex geometry (the payoffs of the portfolio’sderivative products are usually non-convex, and path-dependent), so that it is very difficult to establisha model describing the proximity of scenarios in a robust way.We thus study a relatively simple procedure that also has the advantage of allowing us to establishnon-asymptotic bounds on the L p -error of our estimator, in the spirit of the simplest ranking by meanprocedures, see e.g. [3, 5, 6, 14]. It consists in using a first set of simulated paths to provide a crudeestimation of the impact factors µ i . These first estimators are ranked to select the q < n s outer-levelscenarios with the highest estimated impact values. Then, only the impact values of these q pre-selectedscenarios are estimated again by using the previous estimators together with a new set of simulated2aths. Among these new estimators we select the scenarios with the q < q highest estimated impactfactors. And so on. After L ≥ L being small in practice, we just keep the mean of the six highestestimated impacts.The rationale behind this is that a first crude estimation should be sufficient to rule out a large part ofthe scenarios from the candidates of being in the 6 worst ones, because the corresponding values shouldbe far enough. While the number of candidates reduces, one can expect that the differences between thecorresponding impacts diminish as well and that more Monte Carlo simulations are needed to differentiatethem. Under an indifference zone hypothesis, similar to the one used in the above mentioned paper, and asub-gamma distribution assumption, the convergence is exponential in the number of simulations used atthe different steps and of order 1 / In this section, we describe the simplest version of the algorithm. It uses a pre-defined deterministicnumber of simulations. We establish a strong L p -error bound and discuss several ways of choosing theoptimal strategy for minimizing this error. From now on, we assume that E [ P | S = s ] is known perfectly and set it to 0 for ease of notations. Asexplained above, the algorithm relies on the idea of selecting progressively the scenarios that will beused for the computation of the Expected Shortfall. Namely, let P | s := ( P | s i ) i ≤ n s be a n s -dimensionalrandom variable such that each P | s i has the law of P given S = s i . We first simulate independent copies( P j , . . . , P n s j ) j ≥ of P | s and compute the Monte Carlo estimators of E [ P | S = s i ], i ≤ n s :ˆ µ i := 1 N N (cid:88) j =1 P ij for i ≤ n s , for some N ≥
1. Among these random variables, we then select the ones that are the most likely tocoincide with the worst scenarios s , . . . , s n w , for some 1 ≤ n w < n s . To do this, one considers the(random) permutation m on [[1 , n s ]] such that the components of (cid:16) ˆ µ m ( i )1 (cid:17) i ≤ n s are in decreasing order: ˆ µ m (1)1 ≥ ˆ µ m (2)1 ≥ . . . ≥ ˆ µ m ( n s )1 , m ( i ) < m ( i (cid:48) ) if ˆ µ m ( i )1 = ˆ µ m ( i (cid:48) )1 for 1 ≤ i < i (cid:48) ≤ n s , and only keep the indexes ( m ( (cid:96) )) (cid:96) ≤ q of the corresponding q ≥ n w highest values, i.e. the indexesbelonging to I := I ∩ m ([[1 , q ]]) in which I := [[1 , n s ]] .
3e then iterate the above procedure on the scenarios in I and so on. Namely, we fix L ≥ q (cid:96) ) (cid:96) =0 ,...,L − such that n w =: q L − ≤ · · · ≤ q := n s . (5)Assuming that I (cid:96) − is given, for some 1 ≤ (cid:96) − ≤ L −
1, we compute the estimators ˆ µ i(cid:96) := 1 N (cid:96) N (cid:96) (cid:88) j =1 P ij for i ≤ n s , (6)for some N (cid:96) ≥ N (cid:96) − . If (cid:96) ≤ L −
1, we consider the (random) permutation m (cid:96) : [[1 , q (cid:96) − ]] (cid:55)→ I (cid:96) − such thatthe components of (cid:0) ˆ µ i(cid:96) (cid:1) i ∈ I (cid:96) − are in decreasing order ˆ µ m (cid:96) (1) (cid:96) ≥ ˆ µ m (cid:96) (2) (cid:96) ≥ . . . ≥ ˆ µ m (cid:96) ( q (cid:96) − ) (cid:96) , m (cid:96) ( i ) < m (cid:96) ( i (cid:48) ) if ˆ µ m (cid:96) ( i ) (cid:96) = ˆ µ m (cid:96) ( i (cid:48) ) (cid:96) for 1 ≤ i < i (cid:48) ≤ n s , (7)and only keep the elements in I (cid:96) := I (cid:96) − ∩ m (cid:96) ([[1 , q (cid:96) ]])for the next step. If (cid:96) = L , we just compute the final estimator of the ES given by (cid:99) ES := 1 n w n w (cid:88) i =1 ˆ µ m L − ( i ) L = 1 n w (cid:88) i ∈ I L − ˆ µ iL . Note that only the L − L is a pure MonteCarlo step. Again, the general idea is to reduce little by little the number of candidate scenarios to bepart of the worst ones. As the number of candidates diminishes, one increases the number of simulatedpaths so as to reduce the variance of our Monte Carlo estimators and be able to differentiate betweenpotentially closer true values of the associated conditional expectations. Remark 2.1.
Note that, given j , we do not assume that the P ij , i ≤ n s , are independent. The simulationsassociated to different scenarios are in general not independent. Moreover, the ˆ µ i(cid:96) , (cid:96) ≤ L , use the samesimulated paths, only the number of used simulations changes. Both permit to reduce the computationalcost, by allowing the use of the same simulations of the underlying processes across scenarios and steps. L p error In this section, we first provide a general L p estimate of the error. A more tractable formulation will beprovided in Corollary 2.3 under an additional sub-gamma distribution assumption.From now on, we assume that P | s ∈ L p for all p ≥
1, and we use the notations q := ( q , q , . . . , q L ) , N = ( N , N , . . . , N L ) δq (cid:96) := q (cid:96) − − q (cid:96) and δN (cid:96) := N (cid:96) − N (cid:96) − , for 1 ≤ (cid:96) ≤ L, (8) δ ˆ µ i(cid:96) := (cid:80) N (cid:96) j = N (cid:96) − +1 P ij δN (cid:96) = N (cid:96) ˆ µ i(cid:96) − N (cid:96) − ˆ µ i(cid:96) − δN (cid:96) , for 1 ≤ i ≤ n s , (9)with the convention 0 / Note from the considerations below that only the elements (ˆ µ i(cid:96) ) i ∈ I (cid:96) − are needed in practice, the others are onlydefined here because they will be used in our proofs. The element q L and N are defined for notational convenience, they never appear in our algorithm. To fix ideas, theycan be set to q L = n w and N = 0 all over this paper. roposition 2.2. For all p ≥ , E (cid:104)(cid:12)(cid:12)(cid:12) ES − (cid:99) ES (cid:12)(cid:12)(cid:12) p (cid:105) p ≤ L − (cid:88) (cid:96) =1 ( δq (cid:96) ) p max ( i,k ) ∈ [[1 ,n w ]] × [[ q (cid:96) +1 ,n s ]] ( µ i − µ k ) P [ˆ µ k(cid:96) > ˆ µ i(cid:96) ] p + 1 n w δN L N L max ≤ i < ··· ˆ µ i(cid:96) ] p with (cid:96) = 1 , . . . , L −
1. Each term corresponds to the situation in which an element i ∈ [[1 , n w ]] getsout the set of selected indexes I (cid:96) exactly at the (cid:96) -th step. In the worst situation, it is replaced by anelement of index k larger than q (cid:96) and this can happen only if ˆ µ k(cid:96) > ˆ µ i(cid:96) . The probability of this eventis controlled by the number of Monte Carlo simulations N (cid:96) used at the step (cid:96) but also by the distance between the two scenarios. More specifically, for (cid:96) small, one expects that P [ˆ µ k(cid:96) > ˆ µ i(cid:96) ] is small becausethe law of P | s k is concentrated far away from where the law of P | s i is. This quantity potentially increaseswith (cid:96) , as we reduce the number of selected indexes. This should be compensated by an increase in thenumber of used Monte Carlo simulations. Otherwise stated, we expect to balance the various terms of(10) by considering a suitable increasing sequence ( N (cid:96) ) (cid:96) ≤ L .Obviously, (10) implies that the algorithm converges as N (cid:96) → ∞ for all (cid:96) ≤ L , see Proposition 3.3 belowfor a proof in a more general framework. Proof of Proposition 2.2.
We split the error into a permutation and a Monte Carlo error: E (cid:104)(cid:12)(cid:12)(cid:12) ES − (cid:99) ES (cid:12)(cid:12)(cid:12) p (cid:105) p ≤ E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ≤ n w µ i − µ m L − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p + E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ≤ n w ˆ µ m L − ( i ) L − µ m L − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p . (11)Let us first look at the second term which corresponds to a Monte Carlo error. We have E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ≤ n w ˆ µ m L − ( i ) L − µ m L − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p ≤ N L − N L n w E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ≤ n w ˆ µ m L − ( i ) L − − µ m L − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p + N L − N L − N L n w E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ≤ n w (cid:80) N L j = N L − +1 ˆ P m L − ( i ) j N L − N L − − µ m L − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p in which E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ≤ n w ˆ µ m L − ( i ) L − − µ m L − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p ≤ (cid:88) i ≤ n s E (cid:104)(cid:12)(cid:12) ˆ µ iL − − µ i (cid:12)(cid:12) p (cid:105) p , E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ≤ n w (cid:80) N L j = N L − +1 ˆ P m L − ( i ) j N L − N L − − µ m L − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p = E E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ≤ n w δ ˆ µ m L − ( i ) L − µ m L − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m L − p ≤ max ≤ i <... ˆ µ i(cid:96) } and |R (cid:96) | ≤ q (cid:96) − − q (cid:96) , (12)since R (cid:96) ⊂ I (cid:96) − \ S q (cid:96) [ I (cid:96) − ] and | I (cid:96) − | = q (cid:96) − . Let A q,q (cid:48) denote the collection of subsets A of [[ q + 1 , n s ]]such that | A | = q (cid:48) . Then, it follows from (4), H¨older’s inequality and (12) that E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ≤ n w µ i − µ m L − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p ≤ n w (cid:88) i ≤ n w L − (cid:88) (cid:96) =1 E (cid:104)(cid:12)(cid:12)(cid:12) ( µ i − µ k (cid:96) ( i ) ) { i ∈ I (cid:96) − \ I (cid:96) } (cid:12)(cid:12)(cid:12) p (cid:105) p ≤ max i ≤ n w L − (cid:88) (cid:96) =1 E (cid:104)(cid:12)(cid:12)(cid:12) ( µ i − µ k (cid:96) ( i ) ) { i ∈ I (cid:96) − \ I (cid:96) } (cid:12)(cid:12)(cid:12) p (cid:105) p ≤ max i ≤ n w L − (cid:88) (cid:96) =1 (cid:32) max A ⊂ A q(cid:96),δq(cid:96) (cid:88) k ∈ A E (cid:104)(cid:12)(cid:12) ( µ i − µ k ) (cid:12)(cid:12) p { i ∈ I (cid:96) − \ I (cid:96) , k (cid:96) ( i )= k } (cid:105)(cid:33) p ≤ L − (cid:88) (cid:96) =1 ( δq (cid:96) ) p max ( i,k ) ∈ [[1 ,n w ]] × [[ q (cid:96) +1 ,n s ]] ( µ i − µ k ) P [ˆ µ k(cid:96) > ˆ µ i(cid:96) ] p . To illustrate how the general error bound of Proposition 2.2 can be used in practice to decide of thesequence ( q (cid:96) , N (cid:96) ) (cid:96) , we now consider the case where the components of P | s have sub-gamma distributions,and apply Bernstein’s inequality in (10), see e.g. [7, Chapter 2]. This requires the following assumption. Assumption 1.
There exists c ∈ R + such that the random variables Z [ i, k ] := ( P | s i − µ i ) − ( P | s k − µ k ) , i, k ≤ n s , satisfy Bernstein’s condition : E [ | Z [ i, k ] | p ] ≤ p ! c p − E (cid:2) Z [ i, k ] (cid:3) , i, k ≤ n s , for all p ≥ . From now on, we shall assume that the constant c is known. It can usually be estimated in practice. Corollary 2.3.
Assume that Assumption 1 holds. Then, for all p ≥ , E (cid:104)(cid:12)(cid:12)(cid:12) ES − (cid:99) ES (cid:12)(cid:12)(cid:12) p (cid:105) p ≤ F p ( q, N ) (13)6 n which F p ( q, N ) := L − (cid:88) (cid:96) =1 ( δq (cid:96) ) p max ( i,k ) ∈ [[1 ,n w ]] × [[ q (cid:96) +1 ,n s ]] ( µ i − µ k ) e − N(cid:96) ( µi − µk )22 p ( σ ik + c ( µi − µk )) + 1 n w δN L N L max ≤ i <... . The upper-bound of Corollary 2.3 has two advantages on Proposition 2.2. First, the dependence on( q (cid:96) , N (cid:96) ) (cid:96) ≥ is more explicit. It depends on unknown quantities, but we can estimate (at least rough)confidence intervals for them, see e.g. Section 2.4 below. Second, as we will see in the next section, itallows one to define a tractable deterministic optimal control problem satisfying a dynamic programmingprinciple, or even simple heuristics (see Section 2.5), to select an appropriate sequence ( q (cid:96) , N (cid:96) ) (cid:96) ≥ . Proof of Corollary 2.3.
The first term in (14) is an upper-bound for the first term in the right-hand sideof (10), see [7, Theorem 2.1]. As for the two other terms in (10), we use the usual argument, for i ≤ n s , E (cid:104)(cid:12)(cid:12) δ ˆ µ iL − µ i (cid:12)(cid:12) p (cid:105) = (cid:90) ∞ px p − P [ | δ ˆ µ iL − µ i | ≥ x ] dx and E (cid:104)(cid:12)(cid:12) ˆ µ iL − − µ i (cid:12)(cid:12) p (cid:105) = (cid:90) ∞ px p − P [ | ˆ µ iL − − µ i | ≥ x ] dx, and then appeal to [7, Theorem 2.1] again to deduce that E (cid:104)(cid:12)(cid:12) δ ˆ µ iL − µ i (cid:12)(cid:12) p (cid:105) ≤ (cid:90) ∞ px p − e − δNLx σ i + cx ) dx ≤ (cid:90) ∞ px p − e − δNLx σ i { x ≤ σ ic } dx + (cid:90) ∞ px p − e − δNLx c { x> σ ic } dx ≤ p ( σ i ) p ( δN L ) p (cid:90) ∞ y p − e − y dy + pc p ( δN L ) p (cid:90) ∞ y p − e − y dy, ≤ pσ pi ( δN L ) p p − Γ (cid:16) p (cid:17) + pc p ( δN L ) p p Γ( p ) , and E (cid:104)(cid:12)(cid:12) ˆ µ iL − − µ i (cid:12)(cid:12) p (cid:105) ≤ (cid:90) ∞ px p − e − NL − x σ i + cx ) dx ≤ (cid:90) ∞ px p − e − NL − x σ i { x ≤ σ ic } dx + (cid:90) ∞ px p − e − NL − x c { x> σ ic } dx ≤ p ( σ i ) p ( N L − ) p (cid:90) ∞ y p − e − y dy + pc p ( N L − ) p (cid:90) ∞ y p − e − y dy, ≤ pσ pi ( N L − ) p p − Γ (cid:16) p (cid:17) + pc p ( N L − ) p p Γ ( p ) . emark 2.4. If the (ˆ µ i(cid:96) ) i ≤ n s and ( δ ˆ µ i(cid:96) ) i ≤ n s are Gaussian, which is the case asymptotically, then thebound of Corollary 2.3 remains valid with c = 0 . This fact will be used later on for simplifying ournumerical algorithms. Given N := ( N (cid:96) ) ≤ (cid:96) ≤ L and q = ( q (cid:96) ) ≤ (cid:96) ≤ L − , the total computation cost isC( q, N ) := L − (cid:88) (cid:96) =0 q (cid:96) ( N (cid:96) +1 − N (cid:96) )with the convention N := 0. Let N denote the collection of non-decreasing sequences N := ( N (cid:96) ) ≤ (cid:96) ≤ L with values in N such that N = 0, and let Q denote the collections of non-increasing sequences q =( q (cid:96) ) ≤ (cid:96) ≤ L with values in [[ n w , n s ]] satisfying (5). In this section, we fix a total effort K > p ( q, N ), as defined in (14), can be minimized over the collection A of sequences ( N, q ) ∈ N × Q satisfying C( N, q ) ≤ K by using a standard dynamic programming approach.Given (¯ q, ¯ N ) ∈ Q × N and 0 ≤ (cid:96) ≤ L −
1, we writeF p ( (cid:96), ¯ q, ¯ N ) := 1 n w δ ¯ N L ¯ N L max ≤ i <...
N , ¯ q ) to L − µ i , σ i ) i ≤ n s and ( σ ik ) i,k ≤ n s are not known. However, one can considerrobust versions of the above. For instance, if we know that there exists some ( δ q , δ q ) q ≤ n s and σ suchthat ≤ δ q ≤ µ i − µ k ≤ δ q , ( i, k ) ∈ [[1 , n w ]] × [[ q + 1 , n s ]] σ i ∨ σ k ∨ σ ik ≤ σ , ( i, k ) ∈ [[1 , n w ]] × [[ n w + 1 , n s ]] , (17) We write ( q (cid:96) ) ≤ (cid:96) ≤ L for convenience also q L will never play any role. In the following, we only write A (0) for (cid:96) = 0 as it does not depend on (¯ q, ¯ N ). p defined as δ ¯ N L ¯ N L (cid:18) C p,σ pσ p ( δ ¯ N L ) p + C p,c pc p ( δ ¯ N L ) p (cid:19) + n s n w ¯ N L − ¯ N L (cid:18) C p,σ pσ p ( ¯ N L − ) p + C p,c pc p ( ¯ N L − ) p (cid:19) + { (cid:96)
20. Another choice in practice could beto take the ratio ( µ n w − µ ) / (100 − n w ) which amounts to considering only the first part of the curve,and neglecting points that are anyway far from the worst scenarios.Figure 1: k (cid:55)→ | µ n w − µ k | for Book k (cid:55)→ | µ n w − µ k | for Book k (cid:55)→ | µ n w − µ k | for Book k (cid:55)→ | µ n w − µ k | for Book N rather small) and one final “full pricing” (meaning N large). In theory, thiscorresponds to L = 3 with q = n w , δN = 0 and δN → ∞ . As δN → ∞ , the second and third termsin (14) vanish, as well as the component of the first term corresponding to (cid:96) = 2. We therefore neglectthem. In practice, we only take N large enough (and given) from the point of view of the bank, andminimize over ( q , N ) the remaining term in (14): F ∞ ( q ) := ( n s − q ) p max ( i,k ) ∈ [[1 ,n w ]] × [[ q +1 ,n s ]] ( µ i − µ k ) e − N µi − µk )22 p ( σ c ( µi − µk )) , in which ¯ σ is estimated to be as in (17), under the computation cost constraintC ( N , q ) = q ( N − N ) + n s N ≤ K for some given maximal cost K ∈ N ∗ .For N (or K ) large enough, the condition (18) leads to minimizing over q ∈ [[ n w , n s ]] ∩ [1 , K/N ] theupper-bound h p ( q ) := ( n s − q ) p × ( q + 1 − n w ) δ exp (cid:32) − ( K − q N ) ( q + 1 − n w ) δ p ( n s − q ) (cid:0) σ + c ( q + 1 − n w ) δ (cid:1) (cid:33) . (19)The optimal q ∗ can then be found easily by performing a one-dimensional numerical minimization. Uponreplacing n s − q by n s in the denominator of the exponential term, which provides a further upper-bound,the optimum can even be computed explicitly, see Appendix A. This provides a very easy to use algorithm.Considering the case p = 1, let us now perform first numerical tests to see if the proxy based on h is far from F ∞ . We use the parameters of Tables 1 and 2 below and µ i = − iδ , i ≤ n s , where δ :=( µ n w − µ ) / (100 − n w ) for the µ i s of Figure 6. In particular, we take c = 0, see Remark 2.4.In Figure 5, the two increasing curves show the optimum q ∗ (right axis) as found when applying thedeterministic dynamic programming algorithm (dashed line) of Section 2.4 associated to the real samplebook curve of Figure 6, and the heuristic (solid line) based on (19). The two decreasing curves show thecorresponding F ∞ ( q ∗ ) (left axis) found when applying the deterministic dynamic programming algorithm(dotted line) and the heuristic (dashdot line). We see that the heuristic and the real minimizer areextremely close. The noise in the lines associated to the dynamic programming algorithm are due to grideffects. 10igure 5: q ∗ vs K for the distribution of Figure 6 δ c σ (cid:112) − ρ ) × ρ . n s n w K N Table 2: Computing PowerFigure 6: Sample book distribution : i (cid:55)→ µ i for i ≤ n s Adaptative algorithm
Although the true value θ ◦ = ( µ ◦ , Σ ◦ ) of the vector of means and of the covariance matrix of P | s areunknown, we can set on it a prior distribution, e.g. based on previous Monte Carlo experiments, ratherthan just working on robust bounds as in the end of Section 2.4. Since the estimation of ES uses MonteCarlo simulations of P | s , the knowledge of these quantities can be improved along the different steps (cid:96) ofour estimation procedure. This suggests an adaptative algorithm for the optimization of the numericaleffort allocation, in which we learn progressively the true value of these parameters, or part of them.From now on, we therefore view the true value of the parameters as a random variable ˜ θ := (˜ µ, ˜Σ) onwhich a prior law ν is set. At each step (cid:96) , new Monte Carlo simulations will allow us to update thisprior, and our strategy for the next steps accordingly. Let us first adapt the proof of Proposition 2.2 and Corollary 2.3 to the case where controls are notdeterministic but stochastic processes. Given a stochastic process α with values in Q × N , we set( q α , N α ) := α where q α and N α are respectively Q and N -valued. We then define ˆ µ α = (ˆ µ α(cid:96) ) (cid:96) ≤ L ,( I α(cid:96) , m α(cid:96) ) (cid:96) ≤ L as in Section 2.1 except that we see ˆ µ α(cid:96) as a q α(cid:96) -dimensional random variables with entriesgiven by (ˆ µ α,i(cid:96) ) i ∈ I α(cid:96) . We use the same convention for δ ˆ µ α(cid:96) , recall (9). We say that α is admissible if it ispredictable with respect to ( F α(cid:96) ) (cid:96) ≤ L in which F α is trivial and F α(cid:96) = F α(cid:96) − ∨ σ ( P ij , ( i, j ) ∈ I α(cid:96) × [[1 , N α(cid:96) ]]).We call A ad the collection of such processes. Then, one defines (cid:99) ES α := 1 n w n w (cid:88) i =1 ˆ µ α, m αL − ( i ) L , α ∈ A ad . The true value of the expected shortfall is now also written as a random variable (cid:102)
ES := 1 n w n w (cid:88) i =1 ˜ µ ˜ m ( i ) , in which ˜ m is the random permutation such that ˜ µ ˜ m (1) ≥ ˜ µ ˜ m (2) ≥ . . . ≥ ˜ µ ˜ m ( n s ) , ˜ m ( i ) < ˜ m ( i (cid:48) ) if ˜ µ ˜ m ( i ) = ˜ µ ˜ m ( i (cid:48) ) for 1 ≤ i < i (cid:48) ≤ n s . We let M be a collection of laws on R n s × S n s , where S n s denotes the collection of covariance matrices ofsize n s . Given ν ∈ M , we denote by E ν the expectation operator given that ˜ θ admits the law ν . When ν is a Dirac mass, we retrieve the situation of Section 2 (up to re-ordering in a deterministic way thecomponents of µ ).We first provide a natural extension of Proposition 2.2. Proposition 3.1.
For all p ≥ , ν ∈ M , and α ∈ A ad , E ν (cid:104)(cid:12)(cid:12)(cid:12) (cid:102) ES − (cid:99) ES α (cid:12)(cid:12)(cid:12) p (cid:105) p ≤ n w E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ I αL − ˆ µ α,iL − ˜ µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p (20)+ L − (cid:88) (cid:96) =1 E ν (cid:34) δq α(cid:96) max ( i,k ) ∈ ˜ m α(cid:96) − ([[1 ,n w ]] × [[ q α(cid:96) +1 ,q α(cid:96) − ]]) (˜ µ i − ˜ µ k ) p P ν [ˆ µ α,k(cid:96) > ˆ µ α,i(cid:96) |F α(cid:96) − ∨ σ (˜ θ )] (cid:35) p , with the convention max ∅ = 0 and in which ˜ m α(cid:96) − is defined as ˜ m but on the subset I α(cid:96) − instead of I α = [[1 , n s ]] .Proof. We proceed as in the proof of Proposition 2.2 to obtain that E ν (cid:104)(cid:12)(cid:12)(cid:12) (cid:102) ES − (cid:99) ES α (cid:12)(cid:12)(cid:12) p (cid:105) p ≤ E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ≤ n w ˆ µ α, m αL − ( i ) L − ˜ µ m αL − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p + E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ≤ n w ˜ µ ˜ m ( i ) − ˜ µ m αL − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p , E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ≤ n w ˆ µ α, m αL − ( i ) L − ˜ µ m αL − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p = 1 n w E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ I αL − ˆ µ α,iL − ˜ µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p . We define k α(cid:96) as k (cid:96) in the proof of Proposition 2.2 for the strategy α , with R (cid:96) replaced by R α(cid:96) := I α(cid:96) \ ˜ m (S q α(cid:96) [ ˜ m − ( I (cid:96) − )]). Then, E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ≤ n w ˜ µ ˜ m ( i ) − ˜ µ m αL − ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p ≤ E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ≤ n w L − (cid:88) (cid:96) =1 (˜ µ ˜ m ( i ) − ˜ µ k α(cid:96) ( ˜ m ( i )) ) { ˜ m ( i ) ∈ I α(cid:96) − \ I α(cid:96) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p ≤ n w L − (cid:88) (cid:96) =1 (cid:88) i ≤ n w E ν (cid:104)(cid:12)(cid:12)(cid:12) (˜ µ ˜ m ( i ) − ˜ µ k α(cid:96) ( ˜ m ( i )) ) (cid:12)(cid:12)(cid:12) p { ˜ m ( i ) ∈ I α(cid:96) − \ I α(cid:96) } (cid:105) p ≤ n w L − (cid:88) (cid:96) =1 (cid:88) i ≤ n w E ν (cid:88) k ∈ ˜ m α(cid:96) − ([[ q α(cid:96) +1 ,q α(cid:96) − ]]) E ν (cid:104)(cid:12)(cid:12)(cid:12) (˜ µ ˜ m ( i ) − ˜ µ k ) (cid:12)(cid:12)(cid:12) p { ˜ m ( i ) ∈ I α(cid:96) − \ I α(cid:96) , k α(cid:96) ( ˜ m ( i ))= k } |F α(cid:96) − ∨ σ (˜ θ ) (cid:105) p ≤ L − (cid:88) (cid:96) =1 E ν (cid:34) δq α(cid:96) max ( i,k ) ∈ ˜ m α(cid:96) − ([[1 ,n w ]] × [[ q α(cid:96) +1 ,q α(cid:96) − ]]) (˜ µ i − ˜ µ k ) p P ν [ˆ µ α,k(cid:96) > ˆ µ α,i(cid:96) |F α(cid:96) − ∨ σ (˜ θ )] (cid:35) p . Remark 3.2.
Note that, when α is deterministic and ν is concentrated on a Dirac, the right-hand sideof (20) is bounded from above by n w δN αL N αL max ≤ i i <... ˆ µ α,i(cid:96) |F α(cid:96) − ∨ σ (˜ θ )] (cid:21) p , which coincides with the bound of Proposition 2.2 The above guarantees the convergence of the algorithm.
Proposition 3.3.
Let ( K n ) n ≥ ⊂ N ∗ be a sequence converging to infinity and let ( α n ) n ≥ be a sequencein A ad such that C( q α n , N α n ) ≤ K n for each n ≥ . Assume further that min ≤ (cid:96) ≤ L N α n (cid:96) → ∞ a.s. Let ν be concentrated on the Dirac mass on θ ◦ . Then, E ν (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) (cid:102) ES − (cid:99) ES α n (cid:12)(cid:12)(cid:12)(cid:12) p (cid:21) → as n → ∞ .Proof. It suffices to use the fact that, for some C p > E ν (cid:104)(cid:12)(cid:12)(cid:12) ˆ µ α n ,i(cid:96) − ˜ µ i (cid:12)(cid:12)(cid:12) p (cid:105) ≤ C p E ν (cid:20) δN α n (cid:96) N α n (cid:96) E ν (cid:104)(cid:12)(cid:12)(cid:12) δ ˆ µ α n ,i(cid:96) − µ i ◦ (cid:12)(cid:12)(cid:12) p |F α n (cid:96) − (cid:105)(cid:21) + C p E ν (cid:34) N α n (cid:96) − N α n (cid:96) (cid:12)(cid:12)(cid:12) ˆ µ α n ,i(cid:96) − − µ i ◦ (cid:12)(cid:12)(cid:12) p (cid:35) , in which δN α n (cid:96) N α n (cid:96) E ν (cid:104)(cid:12)(cid:12)(cid:12) δ ˆ µ α n ,i(cid:96) − µ i ◦ (cid:12)(cid:12)(cid:12) p |F α n (cid:96) − (cid:105) → , ν ◦ − a . s ., (cid:96) > i ≤ n s . By induction, this implies that E ν (cid:104)(cid:12)(cid:12)(cid:12) ˆ µ α n ,i(cid:96) − ˜ µ i (cid:12)(cid:12)(cid:12) p (cid:105) = E ν (cid:104)(cid:12)(cid:12)(cid:12) ˆ µ α n ,i(cid:96) − µ i ◦ (cid:12)(cid:12)(cid:12) p (cid:105) → (cid:96) ≤ L and i ≤ n s . Moreover, for some C > E ν (cid:104) (˜ µ i − ˜ µ k ) p P ν [ˆ µ α n ,k(cid:96) > ˆ µ α n ,i(cid:96) |F α(cid:96) − ∨ σ (˜ θ )] (cid:105) ≤ C { µ i ◦ − µ k ◦ > } E ν [ | ˆ µ α n ,i(cid:96) − ˆ µ α n ,k(cid:96) − ( µ i ◦ − µ k ◦ ) | ] µ i ◦ − µ k ◦ → i < k and (cid:96) ≤ L − α ∈ A ad is predictable, one can then proceed as in the proof of Corollary2.3 to derive a more tractable upper-bound. It appeals to the following version of Assumption 1. Assumption 2.
There exists c > such that, for all ν ∈ M , E ν (cid:104) | Z [ i, k ] | p | σ (˜ θ ) (cid:105) ≤ p ! c p − E ν (cid:104) Z [ i, k ] | σ (˜ θ ) (cid:105) ν − a . s ., for all i, k ≤ n s , p ≥ . Corollary 3.4.
Let Assumption 2 holds. Then, for all p ≥ , α ∈ A ad and ν ∈ M , E ν (cid:104)(cid:12)(cid:12)(cid:12) (cid:102) ES − (cid:99) ES α (cid:12)(cid:12)(cid:12) p (cid:105) p ≤ F ad p ( α, ν ) in which F ad p ( α, ν ) := 1 n w E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ I αL − ˆ µ α,iL − ˜ µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p p + L − (cid:88) (cid:96) =1 E ν (cid:104) f ad p ( (cid:96), α, ˜ θ ) (cid:105) p where f ad p ( (cid:96), α, ˜ θ ) := δq α(cid:96) max ( i,k ) ∈ ˜ m α(cid:96) − ([[1 ,n w ]] × [[ q α(cid:96) +1 ,q α(cid:96) − ]]) (˜ µ i − ˜ µ k ) p (cid:32) e − δNα(cid:96) ( ρα(cid:96) [ i,k ])22(˜ σ ik + cρα(cid:96) [ i,k ]) { ρ α(cid:96) [ i,k ] ≥ } + { ρ α(cid:96) [ i,k ] < } (cid:33) (21) with ρ α(cid:96) [ i, k ] := ˜ µ i − ˜ µ k + N α(cid:96) − δN α(cid:96) (ˆ µ α,i(cid:96) − − ˆ µ α,k(cid:96) − ) for (cid:96) ≥ and i, k ≤ n s .Proof. We use Bernstein’s inequality, see [7, Theorem 2.1], conditionally to F α(cid:96) − ∨ σ (˜ θ ), to deduce that P ν [ˆ µ α,k(cid:96) > ˆ µ α,i(cid:96) |F α(cid:96) − ∨ σ (˜ θ )]= P ν [ δ ˆ µ α,k(cid:96) − ˜ µ k − ( δ ˆ µ α,i(cid:96) − ˜ µ i ) > N α(cid:96) − δN α(cid:96) (ˆ µ α,i(cid:96) − − ˆ µ α,k(cid:96) − ) − (˜ µ k − ˜ µ i ) |F α(cid:96) − ∨ σ (˜ θ )] ≤ e − δNα(cid:96) ( ρα(cid:96) [ i,k ])22(˜ σ ik + cρα(cid:96) [ i,k ]) { ρ α(cid:96) [ i,k ] ≥ } + { ρ α(cid:96) [ i,k ] < } . Let us now describe how the result of Corollary 3.4 can be turned into a (stochastic) dynamic programmingalgorithm, in the spirit of Section 2.4, that can be implemented in practice.By Jensen’s inequality, the upper-bound of Corollary 3.4 can be rewritten as E ν (cid:104)(cid:12)(cid:12)(cid:12) (cid:102) ES − (cid:99) ES α (cid:12)(cid:12)(cid:12) p (cid:105) ≤ F ad p (0 , α, ν ) p (22)14here F ad p (0 , α, ν ) := E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ∈ I αL − ˆ µ α,iL − ˜ µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p + L − (cid:88) (cid:96) =1 f ad p ( (cid:96), α, ˜ θ ) , to which we can associate the optimal control problem ˆF ad p ( (cid:96), α, ν ) = ess inf α (cid:48) ∈A ad ( (cid:96),α ) F ad p ( (cid:96), α (cid:48) , ν ) for 0 ≤ (cid:96) ≤ L − ν ∈ M and α ∈ A ad ,where A ad ( (cid:96), α ) := { α (cid:48) = ( q (cid:48) , N (cid:48) ) ∈ A ad : ( α (cid:48) l ) ≤ l ≤ (cid:96) = ( α l ) ≤ l ≤ (cid:96) and C( q (cid:48) , N (cid:48) ) ≤ K } and F ad p ( (cid:96), α (cid:48) , ν ) := E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ∈ I α (cid:48) L − ˆ µ α (cid:48) ,iL − ˜ µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p + { (cid:96) Given ν ∈ M , there exists a measure m , such that, for all α ∈ A ad and ≤ (cid:96) ≤ L , thelaw of I α(cid:96) := ( P ij , ( i, j ) ∈ I α(cid:96) − × [[ N α(cid:96) − +1 , N α(cid:96) ]]) given F α(cid:96) − ∨ σ (˜ θ ) admits ν -a.s. the density g α(cid:96) ( · , ˜ θ α(cid:96) − ) := g ( · , I α(cid:96) − , N α(cid:96) − , N α(cid:96) , ˜ θ α(cid:96) − ) with respect to m , in which g is a bounded measurable map , that is continuousin its first argument, uniformly in the other ones. Moreover, for all α ∈ A ad and (cid:96) ≤ L , the law of ˜ θ given F α(cid:96) belongs to M ν -a.s. Under this assumption, we can compute the law ν α,(cid:96) − (cid:96) of ˜ θ α(cid:96) − = T α(cid:96) − (˜ θ ) given F α(cid:96) in terms of itscounterpart ν α(cid:96) − given F α(cid:96) − , in which T α(cid:96) − := T I α(cid:96) − I α(cid:96) − ◦ · · · ◦ T I α I α . It is given by ν α,(cid:96) − (cid:96) = U ( (cid:96), α, ν α(cid:96) − )with ν α = ν and U ( (cid:96), α, ν α(cid:96) − )( A ) := (cid:82) D α(cid:96) − g α(cid:96) ( I α(cid:96) , θ ) { θ ∈ A } ν α(cid:96) − ( dθ ) (cid:82) D α(cid:96) − g α(cid:96) ( I α(cid:96) , θ ) ν α(cid:96) − ( dθ ) (23) Only the conditional law given F α(cid:96) of the components of ˜ θ corresponding to indexes in I α(cid:96) play a role in the definitionof ˆF ad p ( (cid:96), α, ν ) and F ad p ( (cid:96), α, ν ) below. To avoid introducing new complex notations, we shall indifferently take ν or only theconditional law of the corresponding components as an argument, depending on the context. As for measurability, we identify I α(cid:96) − to the element of R n s with i -th entry given by { i ∈ I α(cid:96) − } . A of D α(cid:96) − := T α(cid:96) − ( R n s × S n s ) . From this, one can deduce the law ν α(cid:96) of ˜ θ α(cid:96) = T α(cid:96) (˜ θ ) given F α(cid:96) , in the form ν α(cid:96) = U ( (cid:96), α, ν α(cid:96) − ) , by simply integrating on the components corresponding to indexes that are not in I α(cid:96) (meaning that U is explicit in terms of U ).We are now in position to state our dynamic programming principle, see e.g. [10]. Again, note that thelaw of f ad p ( (cid:96) +1 , α (cid:48) , ˜ θ ) given F α (cid:48) (cid:96) depends on ˜ θ only through ˜ θ α (cid:48) (cid:96) . For ease of notations, we identify allmeasures to an element of M (even if it supported by a space smaller than R n s × S n s ). Proposition 3.5. Let Assumption 3 hold. Then, for all α ∈ A ad , ≤ (cid:96) ≤ L − and ν ∈ M , ˆF ad p ( (cid:96), α, ν ) = ess inf α (cid:48) ∈A ad ( (cid:96),α ) E ν [ˆF ad p ( (cid:96) + 1 , α (cid:48) , U ( (cid:96) + 1 , α (cid:48) , ν )) + f ad p ( (cid:96) + 1 , α (cid:48) , ˜ θ ) |F α(cid:96) ] . In principle, this dynamic programming algorithm allows one to estimate numerically the optimal policy α (cid:63) in a feed-back form, off-line. Importantly, solving this problem given an initial prior ν is very differentfrom first estimating the parameter ˜ θ and then solving the control problem as if ˜ θ was given. In the firstcase, we take into account the risk due to the uncertainly on the true value of ˜ θ , not in the second one. Remark 3.6. In practice, the algorithm requires estimating and manipulating the law of a high-dimensionalparameter, at least at the first steps. But the above can be modified by changing the filtration ( F α(cid:96) ) (cid:96) ≤ L in ( ¯ F α(cid:96) ) (cid:96) ≤ L with ¯ F α(cid:96) = σ ( (cid:96) ≥ τ α P ij , ( i, j ) ∈ I α(cid:96) × [[1 , N α(cid:96) ]]) with τ α := inf { l ≤ L : q αl ≤ ρ } for some ρ > . Inthis case, no additional information is considered up to step τ α , the update of the prior only takes placefrom step τ α on and it only concerns ˜ θ ατ α whose dimension is controlled by ρ . As for the first steps of thealgorithm, namely before τ ρ , one can replace f ad p by a robust version in the spirit of Section 2.4. Remark 3.7. The algorithm also requires knowing the conditional density g α(cid:96) . Although, P | s can besimulated, its conditional density is not known in general. However, one can use a proxy and/or againmodify the flow of information to reduce to a more explicit situation. Let us consider the situationin which ( F α(cid:96) ) (cid:96) ≤ L is replaced by ( ¯ F α(cid:96) ) (cid:96) ≤ L with ¯ F α(cid:96) = ¯ F α(cid:96) − ∨ σ ( δ ˆ µ α,i(cid:96) , i ∈ I α(cid:96) ) and ¯ F α is trivial. Then,conditionally to ¯ F α(cid:96) − ∨ σ (˜ θ α(cid:96) − ) , (cid:112) δN α(cid:96) ( ˜Σ α(cid:96) ) − ( δ ˆ µ α(cid:96) − ˜ µ α(cid:96) ) is asymptotically Gaussian as δN α(cid:96) increases toinfinity. In practice, we can do as if (cid:112) δN α(cid:96) ( ˜Σ α(cid:96) ) − ( δ ˆ µ α(cid:96) − ˜ µ α(cid:96) ) was actually following a standard Gaussiandistribution, conditionally to ˜ θ α(cid:96) and F α(cid:96) − , which provides an explicit formula for the conditional density ¯ g α(cid:96) of δ ˆ µ α(cid:96) given ˜ θ α(cid:96) − and F α(cid:96) − , to be plugged into (23) . Namely, the updating procedure takes the form ν α(cid:96) = ˇ U ( (cid:96), α, ν α(cid:96) − ) where ˇ U is explicit.Then, if the initial prior ν is such that (˜ µ, ˜Σ) is a Normal-inverse-Wishart distribution, all the posteriordistribution ν α(cid:96) , (cid:96) ≤ L , are such that (˜ µ, ˜Σ) remains in the class of Normal-inverse-Wishart distributionswith parameters that can computed explicitly from our simulations. Namely, if, given ¯ F α(cid:96) , ˜Σ has the dis-tribution W − i α(cid:96) (Σ α(cid:96) ) and ˜ µ has the distribution N ( m α(cid:96) , ˜Σ / k α(cid:96) ) given ˜Σ , then the coefficients correspondingto the law given ¯ F α(cid:96) +1 are i α(cid:96) +1 = i α(cid:96) + δN α(cid:96) +1 , k α(cid:96) +1 = k α(cid:96) + δN α(cid:96) +1 , m α(cid:96) +1 = κ α(cid:96) + δN α(cid:96) +1 (cid:104) κ α(cid:96) T I α(cid:96) +1 I α(cid:96) ( m α(cid:96) ) + δN α(cid:96) +1 δ ˆ µ α(cid:96) +1 (cid:105) Σ α(cid:96) +1 = T I α(cid:96) +1 I α(cid:96) (Σ α(cid:96) ) + (cid:80) N α(cid:96) +1 j = N α(cid:96) +1 ( T α(cid:96) +1 ( P j ) − δ ˆ µ α(cid:96) +1 )( T α(cid:96) +1 ( P j ) − δ ˆ µ α(cid:96) +1 ) (cid:62) + κ α(cid:96) δN α(cid:96) +1 κ α(cid:96) + δN α(cid:96) +1 ( T I α(cid:96) +1 I α(cid:96) ( m α(cid:96) ) − δ ˆ µ α(cid:96) +1 )( T I α(cid:96) +1 I α(cid:96) ( m α(cid:96) ) − δ ˆ µ α(cid:96) +1 ) (cid:62) , (24) see e.g. [18, Section 9]. Later on, we shall write the corresponding law as N W − ( p α(cid:96) +1 ) with p α(cid:96) +1 := ( m α(cid:96) +1 , k α(cid:96) +1 , i α(cid:96) +1 , Σ α(cid:96) +1 ) . Hereafter W − i (Σ) stands for the Inverse-Wishart distribution with degree of freedom i and scale matrix Σ, while N ( m , Σ) is the Gaussian distribution with mean m and covariance matrix Σ. .3 Example of numerical implementation using neural networks In this section, we aim at solving the version of the dynamic programming equation of Proposition3.5, using an initial Normal-inverse-Wishart prior and the approximate updating procedure suggested inRemark 3.7:ˇF ad p ( (cid:96), α, ν ) = ess inf α (cid:48) ∈A ad ( (cid:96),α ) E ν [ˇF ad p ( (cid:96) + 1 , α (cid:48) , ˇ U ( (cid:96) + 1 , α (cid:48) , ν )) + f ad p ( (cid:96) + 1 , α (cid:48) , ˜ θ ) | ¯ F α(cid:96) ] , with ˇ U as in Remark 3.7 andˇF ad p ( L − , α, ν ) := E ν (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ∈ I α (cid:48) L − ˆ µ α (cid:48) ,iL − ˜ µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ F αL − . It would be tempting to use a standard grid-based approximation. However, to turn this problem in aMarkovian one, one needs to let the value function at step (cid:96) depend on q α(cid:96) , N α(cid:96) , C α(cid:96) , ˆ µ α(cid:96) and p α(cid:96) , where C α(cid:96) is the running cost of strategy α up to level (cid:96) , defined for (cid:96) (cid:54) = 0 by C α(cid:96) = (cid:80) (cid:96) − l =0 q αl δN αl +1 and C α = 0.The dimension is then 1 + 1 + 1 + q α(cid:96) + (1 + q α(cid:96) + 1 + ( q α(cid:96) ) ). Even for q α(cid:96) = 20, the corresponding space isalready much too big to construct a reasonable grid on it. We therefore suggest using a neural networkapproximation. Let us consider a family of bounded continuous functions { φ x , x ∈ X } , X being a compactsubset of R d X for some d X ≥ 1, such that, for all q, δq ≤ n s and N, δN ≥ φ · ( δq, δN, q, N, C, · ) : (x , µ, p ) ∈ X × R q × R q + q (cid:55)→ φ x ( δq, δN, q, N, C, µ, p ) ∈ R is continuous.We then fix a family { α k } k ≤ ¯ k of deterministic paths of A (0) and simulate independent copies { ˜ θ j } j ≤ ¯ j of ˜ θ according to ν , a Normal-inverse-Wishart distribution N W − ( p ). For each j , we consider ani.i.d. sequence ( P j, j (cid:48) , . . . , P j,n s j (cid:48) ) j (cid:48) ≥ in the law N (˜ µ j , ˜Σ j ) with ˜ θ j =: (˜ µ j , ˜Σ j ). We take these sequencesindependent and independent of ˜ θ . For each k and j , we denote by (ˆ µ k,j(cid:96) ) (cid:96) ≤ L , (˜ p k,j(cid:96) ) (cid:96) ≤ L and ( I k,j(cid:96) ) (cid:96) ≤ L the paths (ˆ µ α k (cid:96) ) (cid:96) ≤ L , ( p α k (cid:96) ) (cid:96) ≤ L and ( I α k (cid:96) ) (cid:96) ≤ L associated to the j -th sequence ( P j, j (cid:48) , . . . , P j,n s j (cid:48) ) j (cid:48) ≥ and thecontrol α k . Similarly, we write f ad ,k,jp ( (cid:96), · ) to denote the function f ad p ( (cid:96), · ) defined as in (21) but in termsof I k,j(cid:96) − in place of I α(cid:96) − . Given an integer r ≥ 1, we first compute ˇx L − as the argmin over x ∈ X of ¯ k (cid:88) k =1 ¯ j (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E ν k,jL − L − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ∈ I k,jL − (ˆ µ k,jL ) i − ˜ µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p − φ x (0 , δN α k L , q α k L − , N α k L − , C α k L − , ˆ µ k,jL − , p k,jL − ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r in which E ν k,jL − L − means that the expectation is taken only over ˜ µ according to the law ν k,jL − , i.e. N W − ( p k,jL − ),and ( · ) i means that we take the i -th component of the vector in the brackets. Then, for any α ∈ A ad , weset ˇ φ L − ( q αL − , N αL − , C αL − , · ) := min (0 ,δN ) ∈ A( L − ,α ) φ ˇx L − (0 , δN, q αL − , N αL − , C αL − , · ) , where A( L − , α ) := { ( δq, δN ) ∈ { } × N : C αL − + n w δN ≤ K } . Given ˇ φ (cid:96) +1 for some (cid:96) ≤ L − 2, we then compute a minimizer ˇx (cid:96) ∈ X of ¯ k (cid:88) k =1 ¯ j (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12) E ν k,j(cid:96) (cid:96) (cid:104) ˇ φ (cid:96) +1 ( q α k (cid:96) +1 , N α k (cid:96) +1 , C α k (cid:96) +1 , ˆ µ k(cid:96) +1 , p k(cid:96) +1 ) + f ad ,k,jp ( (cid:96) + 1 , α k , ˜ θ ) (cid:105) − φ x ( δq α k (cid:96) +1 , δN α k (cid:96) +1 , q α k (cid:96) , N α k (cid:96) , C α k (cid:96) , ˆ µ k,j(cid:96) , p k,j(cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12) r , where E ν k,j(cid:96) (cid:96) means that the expectation is computed over (ˆ µ α k (cid:96) +1 , p α k (cid:96) +1 , ˜ θ, ˜ m α k (cid:96) ) given (ˆ µ k(cid:96) , p k(cid:96) ) = (ˆ µ k,j(cid:96) , p k,j(cid:96) )and using the prior ν k,j(cid:96) on ˜ θ associated to p k,j(cid:96) . Then, we setˇ φ (cid:96) ( q α(cid:96) , N α(cid:96) , C α(cid:96) , · ) := min ( δq,δN ) ∈ A( (cid:96),α ) φ ˇx (cid:96) ( δq, δN, q α(cid:96) , N α(cid:96) , C α(cid:96) , · ) , (cid:96), α ) := { ( δq, δN ) ∈ [[0 , q α(cid:96) − n w ]] × N : C α(cid:96) + ( q α(cid:96) − δq ) δN ≤ K } , (cid:96) < L − , A( L − , α ) := { ( δq, δN ) ∈ { q α(cid:96) − n w } × N : C αL − + ( q αL − − δq ) δN ≤ K } , and so on until obtaining φ ( n s , , , , p ). By continuity of φ · ( · ) and compactness of X and A( (cid:96), α ) for α given, the minimum is achieved in the above, possibly not unique, and one can choose a measurablemap a (cid:63)(cid:96) such that a (cid:63)(cid:96) ( q α(cid:96) , N α(cid:96) , C α(cid:96) , · ) ∈ arg min ( δq,δN ) ∈ A( (cid:96),α ) φ ˇ x N(cid:96) ( δq, δN, q α(cid:96) , N α(cid:96) , C α(cid:96) , · )for all α ∈ A ad . Then, given the parameter p of our initial prior ν , our estimator of the optimal policyis given by α (cid:63) = ( q (cid:63) , N (cid:63) ) defined by induction by( δq (cid:63) , δN (cid:63) ) = a (cid:63) ( n s , , , , p ) and ( δq (cid:63)(cid:96) +1 , δN (cid:63)(cid:96) +1 ) = a (cid:63)(cid:96) ( q (cid:63)(cid:96) , N (cid:63)(cid:96) , C (cid:63)(cid:96) , ˆ µ α (cid:63) (cid:96) , p α (cid:63) (cid:96) ) for 0 < (cid:96) < L. Note that the above algorithm for the estimation of the optimal control only requires off-line simulationsaccording to the initial prior ν . It is certainly costly but does not require to evaluate the real financialbook, it can be trained on a proxy, and can be done off-line. It can be combined with the approach ofRemark 3.6 to reduce the computation time. In order to prepare for the use of a different initial prior,one can also slightly adapt the above algorithm by considering different initial values of p (e.g. drawnfrom another distribution around p ), so as to estimate ˇ φ not only at the point p . When applied tothe real book, the update of the prior according to (24) leads to an additional cost that is negligiblewith respect to the simulation of the book. It leads to the computation of new priors associated to thefinancial book at hand, that can be used for a new estimation of the optimal policy or simply as a newinitial prior for the next computation of the ES.An example of a simple practical implementation is detailed in Appendix B, while numerical tests areperformed in Section 4. This section is dedicated to first numerical tests of the different algorithms presented in the previoussections. The settings of the experiments are as follows. We first choose a Normal-inverse-Wishart priordistribution ν with parameters p := ( m , k , i , Σ ). The vector m is represented on Figure 6 with m i = µ i , i ≤ n s , and Σ = ( i − n s − (cid:40) Σ ii = 4 . × if i = j Σ ij = ρ × . × if i (cid:54) = j, (25)with ρ = 0 . ρ = 0 depending on the experiments below. As for k and i , they are chosen equal to300, meaning that we have a low confidence in our prior. The computing power is K = 10 .We apply the four different algorithms on 5 000 runs (i.e. 5 000 independent implementations of eachalgorithm). For each run, we • first simulate a value for the real scenarios and covariance matrices (cid:16) ˜ µ, ˜Σ (cid:17) ∼ N W − ( p ), • apply each of the four algorithms, with simulated prices following P | s ∼ N (cid:16) ˜ µ, ˜Σ (cid:17) , • for each algorithm, we measure the relative error (cid:100) ES − (cid:102) ES (cid:102) ES and the error (cid:99) ES − (cid:102) ES, where (cid:102) ES = n w (cid:80) n w i =1 ˜ µ ˜ m ( i ) .The four algorithms that we compare are: • A Uniform Pricing Algorithm: All the scenarios are priced with K/n s Monte Carlo simulations, andthe estimator (cid:99) ES is the average of the n w = 6 worst scenarios. This is the most naive method, withonly one step and where all scenarios are priced with an equal number of Monte Carlo simulations.It serves as a benchmark. 18 The Heuristic Algorithm: We use the 2-levels strategy described in Section 2.5 with the booksample parameters of Table 1 and the computation parameters of Table 2. We do not evaluate theconstant c of Assumption 1 but simply set it to 0, see Remark 2.4. The optimal strategy is givenby ( q , q , N , N ) = (253 , , 17 297 , 100 000). • The Deterministic Algorithm: We run the deterministic algorithm of Section 2.4 optimized with µ = m as the values of the scenarios, Σ with ρ = 0 . L = 4. Note thatusing the real mean parameter as an entry for optimization is quite favorable for this algorithm,although the “true” parameter of each run will actually deviate from this mean value. This givesus the strategy ( q , q , q , q , N , N , N , N , N ) = (253 , , , , , , 44 000 , 44 000 , • The Adaptative Algorithm: We do the training part of the adaptative algorithm using our prior p := ( m , k , i , Σ ), with ρ = 0 . 6, as parameters and L = 4. We use a very simple one hidden-layerneural network. It could certainly be improved by using a more sophisticated multi-layers neuralnetwork, but this version will be enough for our discussion. Details on the implementation are givenin the Appendix B. Once this is done, we apply the optimal adaptative strategy on each run. ρ = 0 . In this first experiment, the simulated runs use the values ρ = 0 . i = k = 300.To get an idea of how much noise is added to the average scenario values in our simulations, we plot inFigure 7 the prior value m i for each scenario of index i ≤ n s (this is the line) and the first 20 ˜ µ ij out ofthe 5 000 runs for each scenario (these are the points).Figure 7: True value of µ ◦ and simulations of ˜ µ For the adaptative algorithm, the three mostly used strategies are: • ( q , q , q , q , N , N , N , N ) = (253 , , , , , 97 995 , 172 504 , 577 252) • ( q , q , q , q , N , N , N , N ) = (253 , , , , , 99 733 , 148 560 , 608 040) • ( q , q , q , q , N , N , N , N ) = (253 , , , , , 75 033 , 123 860 , 748 007)Compared to the deterministic algorithm, we see that the adaptative one uses much less Monte Carlosimulations at the final steps and focuses more on the intermediate steps to select the worst scenarios. Thedeterministic algorithm is also more aggressive in the choice of q and q . This can be easily explained bythe fact that the latter believes that the real distribution is not far from the solid curve on Figure 7 (up tostandard deviation) while the adaptative one only knows a much more diffuse distribution corresponding19o the cloud of points of Figure 7 since his level of uncertainty is quite high for our choice i = k = 300.On Figures 8-11, we plot the histograms of the relative errors. We see that the distribution is tightest forthe deterministic algorithm, followed quite closely by the adaptative algorithm. Both of them performvery well. As expected, the uniform algorithm is very poor. Note that the heuristic one already verysignificantly improves the uniform algorithm, although it does not reach the precision of the two mostsophisticated algorithms (without surprise). Because of the huge uncertainty mentioned above, the adap-tative algorithm is rather conservative while the deterministic algorithm makes full profit of essentiallyknowing the correct distribution, and performs better. We will see in our second experiment that thingswill change when we will deviate from the parameters used for optimizing the deterministic algorithm(by simply passing from ρ = 0 . ρ = 0 in the simulated runs).Figure 8: Relative Error for Adapta-tive Algorithm Figure 9: Relative Error for Deter-minist AlgorithmFigure 10: Relative Error for Heuris-tic Algorithm Figure 11: Relative Error for Uni-form AlgorithmIn Table 3, we provide the L and relative errors (with standard deviations), the L error and the numberof correct selections, that is the number of runs for which a given algorithm has chosen the correct worst 6scenarios. In terms of L or L error, the relative performance of the algorithms is as above. However, ifwe look at the number of correct selections, we see that the adaptive algorithm performs better than theother 3 algorithms. Again, by comparing the strategies of the deterministic and the adaptive algorithms,we see that those of the adaptative algorithm are more conservative on the ranking and filtering partversus the final pricing as it puts relatively more Monte Carlo simulations to detect the correct scenariosand relatively less for their estimation. 20lgorithm L Err. L Err. Std Rel. Err. (%) Rel. Err. Std (%) L Err. Correct SelectionsAd. Alg. 1 891 20.4 0.623 0.00886 2377 4247Det. Alg. 1 411 16.1 0.465 0.00693 1813 3499Heur. Alg. 4 562 50.2 1.49 0.0234 5779 4054Unif. Alg. 7 269 81.6 2.38 0.0348 9279 3500Table 3: Errors for ρ = 0 . x (cid:55)→ P [ X > − x ] where X is the absolute error of the algorithmon a run.Figure 12: Tail Distribution of the errors. First top lines: Uniform and Heuristic algorithms, respectively.Solid line: Adaptative algorithm. Dotted line: Deterministic algorithm.In Figure 13, we provide, for the first 4 runs, the values and real ranks of the 6 worst scenarios selected byeach algorithm. The numbers displayed are the true ranks of the selected scenarios given by ˜ µ and theiry-coordinate is the value obtained when running the algorithm. “Real” is the real values as sampled.Figure 13: Worst Scenarios Ranks and Values ρ = 0 We now do the numerical test with ρ = 0 as the true correlation. The deterministic and adaptativealgorithm are still trained with ρ = 0 . 6, but P | s is simulated using ρ = 0.On Figures 14-17, we show the histograms of the relative errors. We see that the distribution of therelative errors is now tightest for the adaptative method, followed by the deterministic method, then bythe heuristic and the uniform methods. Furthermore, we see that the distribution corresponding to thedeterministic method is significantly biased to the left. This is actually true for all algorithms, but at aless significant level. This suggests that we now have a large part of the error that does not come fromthe final pricing error, but from errors in the selection of scenarios.21igure 14: Relative Error for Adap-tative Algorithm Figure 15: Relative Error for Deter-minist AlgorithmFigure 16: Relative Error for Heuris-tic Algorithm Figure 17: Relative Error for Uni-form AlgorithmIn Table 4, we provide the L and relative errors (with standard deviations), the L error and the numberof correct selections for the 4 algorithms. For all algorithms, compared to the case ρ = 0 . 6, we see thatwe have simultaneously a lower number of correct selections of scenarios (which we could expect toincrease the errors) and a lower L error. This surprising result is explained by the fact that lowering thecorrelation has two effects. The filtering and ranking part of the algorithm becomes harder, as can beseen from Corollary 2.3. This explains why the number of correct selections becomes lower. However, wecompute at the end an average over the n w worst scenarios and the error on this average is lower whenthe pricings are uncorrelated compared to the case where they exhibit a positive correlation.The adaptative algorithm has now simultaneously the lowest L and L errors, as well as the highestnumber of correct selections. We see that it is especially good in L error, so we expect it to present avery low number of large errors. As, by construction, it has been trained to detect misspecifications ofthe parameters, it now has a clear advantage on the deterministic algorithm which does not see it. Thisresults in an improvement of almost 20% of the L error.Following the above reasoning, we understand that, compared to the previous experiment, the final pricingerror now plays a smaller role and the ranking and selection error a bigger role, which explains why thehistogram of the errors for the determinist algorithm is strongly biased to the left, as it now incorrectlyselects scenarios more often. 22lgorithm L Err. L Err. Std Rel. Err. (%) Rel. Err. Std (%) L Err. Correct SelectionsAd. Alg. 1 083 11.8 0.27 0.00294 1 366 3 930Det. Alg. 1 175 17.5 0.293 0.00448 1 705 3 202Heur. Alg. 2 547 28.33 0.628 0.00700 3 240 3 753Unif. Alg. 4 062 44.7 1.00 0.0111 5 147 3 102Table 4: Errors for ρ = 0In Figures 18, we plot the function x (cid:55)→ P [ X > − x ] where X is the absolute error of the algorithmon a run. As was suggested by the L errors of Table 4, we see that the tail distribution of errors is lowestfor the adaptative algorithm, followed by the deterministic algorithm (for big errors), and then by theheuristic and uniform algorithms.Figure 18: Tail Distribution of the errors. First top lines: Uniform and Heuristic algorithms, respectively.Solid line: Adaptative algorithm. Dotted line: Determinist algorithm We propose in this paper different algorithms for the computation of the expected shortfall based ongiven historical scenarios. All are multi-steps algorithms that use Monte Carlo simulations to reducethe number of historical scenarios that potentially belong to the set of worst scenarios. We provideexplicit error bounds and we test them on simulated data deviating from the true values of the historicalimpacts used for computing the associated optimal strategies. The first algorithm is a very easy toimplement 2-steps algorithm that already provides relatively small errors on our numerical tests. A fourstep deterministic dynamic programming algorithm performs very well when real datas are not far fromthe parameters used in the optimization procedure. It seems even to be quite robust, as shown by ournumerical test in the case where the true correlation parameter is not the one used for computing theoptimal policy. Finally, we propose an adaptative algorithm that aims at learning the true value of theparameters at the different steps of the algorithm. Our first numerical tests suggest that it is moreconservative than the deterministic one, but probably more robust to parameters misspecifications, asexpected. The version we use is built on a very simple one hidden layer neural network and can certainlybe considerably improved for industrial purposes. References [1] Carlo Acerbi and Dirk Tasche. On the coherence of expected shortfall. Journal of Banking &Finance , 26(7):1487–1503, 2002.[2] Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical finance , 9(3):203–228, 1999.[3] Raghu Raj Bahadur and Herbert Robbins. The problem of the greater mean. The Annals ofMathematical Statistics , pages 469–487, 1950. 234] Basel Committee on Banking Supervision. Minimum capital requirements for market risk . 2016.[5] Robert E Bechhofer. A single-sample multiple decision procedure for ranking means of normalpopulations with known variances. The Annals of Mathematical Statistics , pages 16–39, 1954.[6] Robert E Bechhofer, Charles W Dunnett, and Milton Sobel. A tow-sample multiple decision pro-cedure for ranking means of normal populations with a common unknown variance. Biometrika ,41(1-2):170–176, 1954.[7] Bernard Bercu, Bernard Delyon, and Emmanuel Rio. Concentration inequalities for sums and mar-tingales . Springer, 2015.[8] Mark Broadie, Yiping Du, and Ciamac C Moallemi. Efficient risk estimation via nested sequentialsimulation. Management Science , 57(6):1172–1194, 2011.[9] Simon A Broda, Jochen Krause, and Marc S Paolella. Approximating expected shortfall for heavy-tailed distributions. Econometrics and statistics , 8:184–203, 2018.[10] David Easley and Nicholas M Kiefer. Controlling a stochastic process with unknown parameters. Econometrica: Journal of the Econometric Society , pages 1045–1064, 1988.[11] Robert J Elliott and Hong Miao. Var and expected shortfall: a non-normal regime switching frame-work. Quantitative Finance , 9(6):747–755, 2009.[12] Christian Francq and Jean-Michel Zako¨ıan. Multi-level conditional var estimation in dynamic models.In Modeling Dependence in Econometrics , pages 3–19. Springer, 2014.[13] Michael B Gordy and Sandeep Juneja. Nested simulation in portfolio risk measurement. ManagementScience , 56(10):1833–1848, 2010.[14] Shanti S Gupta and S Panchapakesan. Sequential ranking and selection procedures. Handbook ofsequential analysis , pages 363–380, 1991.[15] Lennart Hoogerheide and Herman K van Dijk. Bayesian forecasting of value at risk and expectedshortfall using adaptive importance sampling. International Journal of Forecasting , 26(2):231–247,2010.[16] Jochen Krause and Marc S Paolella. A fast, accurate method for value-at-risk and expected shortfall. Econometrics , 2(2):98–122, 2014.[17] Ming Liu and Jeremy Staum. Stochastic kriging for efficient nested simulation of expected shortfall. Journal of Risk , 12(3):3, 2010.[18] Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. cs.ubc.ca/ ∼ murphyk/Papers/bayesGauss.pdf .[19] Saralees Nadarajah, Bo Zhang, and Stephen Chan. Estimation methods for expected shortfall. Quantitative Finance , 14(2):271–291, 2014.[20] Luis Ortiz-Gracia and Cornelis W Oosterlee. Efficient var and expected shortfall computationsfor nonlinear portfolios within the delta-gamma approach. Applied Mathematics and Computation ,244:16–31, 2014.[21] Franco Peracchi and Andrei V Tanase. On estimating the conditional expected shortfall. AppliedStochastic Models in Business and Industry , 24(5):471–493, 2008.[22] Jimmy Risk and Michael Ludkovski. Sequential design and spatial modeling for portfolio tail riskmeasurement. SIAM Journal on Financial Mathematics , 9(4):1137–1174, 2018.[23] R Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions. Journal of banking & finance , 26(7):1443–1471, 2002.2424] Jules Sadefo Kamdem. Value-at-risk and expected shortfall for linear portfolios with ellipticallydistributed risk factors. International Journal of Theoretical and Applied Finance , 8(05):537–551,2005.[25] Jean-Guy Simonato. The performance of johnson distributions for computingvalue at risk andexpected shortfall. The Journal of Derivatives , 19(1):7–24, 2011.[26] Keming Yu, A Allay, Shanchao Yang, and D Hand. Kernel quantile-based estimation of expectedshortfall. 2010.[27] Meng-Lan Yueh and Mark CW Wong. Analytical var and expected shortfall for quadratic portfolios. The Journal of Derivatives , 17(3):33–44, 2010.25 Proxy of the optimal strategy for the heuristic (19) In the case p = 1, (19) can even be further simplified by using the upper-bound˜ h ( q ) ≤ max { ˜ h ( q ); ˜ h ( q ) } (26)where ˜ h ( q ) := n s ( q + 1 − n w ) δ exp (cid:18) − ( K − q N ) ( q + 1 − n w ) δ n s c (cid:19) ˜ h ( q ) := n s ( n s − n w ) δ exp (cid:18) − ( K − q N ) (( q + 1 − n w ) δ ) n s σ (cid:19) . The right-hand side of (26) is now tractable for minimization. Given, ∆ := ( K − ( n w − N ) − n s N cδ B := ¯ σ cδ + n w − q , ∗ := max (cid:16) n w − + K N , n w (cid:17) q , , ∗ := max (cid:16) n w − + K −√ ∆4 N , n w (cid:17) q , , ∗ := max (cid:16) n w − + K + √ ∆4 N , n w (cid:17) (27)the optimal policy q h is defined by the following table : Cond. on B Cond. ∆ Cond. q , ∗ Cond. q , , ∗ Cond. q , , ∗ Choice of q h ≥ n s q h := q , ∗ ≤ n w > q h := argmin q ∈{ n w ,q , , ∗ } h ( q ) ≤ n w ≤ q h := n w n w < · < n s > ≤ B ≤ B ≤ B q h := argmin q ∈{ q , ∗ ,B } h ( q ) n w < · < n s > ≤ B ≤ B ≥ B q h := argmin q ∈{ q , ∗ ,q , , ∗ } h ( q ) n w < · < n s > ≤ B ≥ B ≥ B q h := argmin q ∈{ q , ∗ ,B,q , , ∗ } h ( q ) n w < · < n s > ≥ B ≤ B ≤ B q h := Bn w < · < n s > ≥ B ≥ B q h := argmin q ∈{ B,q , , ∗ } h ( q ) n w < · < n s ≤ ≤ B q h := argmin q ∈{ q , ∗ ,B } h ( q ) n w < · < n s ≤ ≥ B q h := B Table 5: Optimal q h for ˜ h .For simplicity, let us consider the case c = 0, see Remark 2.4.On Figure 19, the square is q , , ∗ = 52 . 41, the circle is q , ∗ = 68 . 33 and the cross is the real optimum q ∗ = 71 of h , for the parameters of Tables 1 and 2. We see that we actually almost reach the correctminimum. It corresponds (up to rounding) to N , ∗ = 23723, N , ∗ = 17148, N ∗ = 15934. We optimize here over real positive numbers. h ( q , , ∗ ). Circle: h ( q , ∗ ). Cross: h ( q ∗ ).Using the same set of parameters, we plot on Figure 20 the two functions h and ˜ h . Although these twofunctions have values of different orders of magnitude, their shapes are quite close, which explains whywe manage to obtain a relatively good approximation for the minimizer.Figure 20: Solid line : h . Dashed line : ˜ h . B Precise implementation of the neural network algorithm In this Appendix, we describe in more details how the neural network approximation of the optimalpolicy of the adaptative algorithm is constructed. All the parameters values are given in Tables 6, 7 and8 below. B.1 Initialization • In practice, the neural network input’s size depends on the window size q . Therefore, we need totrain different neural networks for each window size. In order to get enough points to train each of27hese neural networks, we have chosen the grid q g = [6 , , , , , , , , , , , , , , , , , q . • We simulate independent copies { ˜ θ j } j ≤ j bar = { (˜ µ j , ˜Σ j ) } j ≤ j bar of ˜ θ , where j bar is given in Table8. For each 1 ≤ j ≤ j bar, ˜Σ j is an inverse-Wishart of parameters i , Σ , and ˜ µ j is a Gaussianrandom vector of mean m and covariance matrix ˜Σ j / k . The parameters i , k and Σ are definedin Table 8 and (25), while m i = µ i , i ≤ n s , with the µ i ’s of Figure 6. B.2 Strategy Generation To generate the deterministic strategies ( α k ) k ≤ k bar , where k bar is given in Table 8, we proceed asfollows. • For each 1 ≤ k ≤ k bar, we simulate L + 1 uniform random variables ( U n ) Ln =0 between 0 and 1.We sort them in increasing order (cid:0) U s ( n ) (cid:1) Ln =0 and define a cost K (cid:96) := K ( U s ( (cid:96) ) − U s ( (cid:96) − ) when1 ≤ (cid:96) ≤ L − 1, and K L = K ( U s (0) + 1 − U s ( L − ). The idea is that we select L + 1 points randomlyon a cercle of total length K : we choose one of these points, and starting from it, the computationalpower that we will use at each level 1 ≤ (cid:96) ≤ L − K times the length between the points L − • Once we have the computational cost for each step, we can choose the q (cid:96) for each strategy, so thatwe can deduce δN (cid:96) +1 := K (cid:96) /q (cid:96) . For (cid:96) = 0, we choose q index = 18, where 18 is the number ofterms in the grid q g , which therefore gives q = q g [q index ] = n s . For (cid:96) = L − 1, we chooseq index L − = 0, that is, q L − = n w . For 1 ≤ (cid:96) ≤ L − 2, we choose q index (cid:96) as a random integerbetween [ L − (cid:96), q index (cid:96) − − q (cid:96) is then q (cid:96) = q g [q index (cid:96) ]. We check that thesequence ( N (cid:96) ) ≤ (cid:96) ≤ L is non-decreasing. If this is the case, we keep it, if not, we reject it and doanother run. B.3 Forward Pass The next step is to generate all prices and execute for each k and each j the strategy k . • For 1 ≤ j ≤ j bar, 1 ≤ k ≤ k bar and 1 ≤ (cid:96) ≤ L , we simulate δN k(cid:96) Gaussian variables (cid:16) P j, j (cid:48) , . . . , P j,n s j (cid:48) (cid:17) N k(cid:96) j (cid:48) = N k(cid:96) − of mean ˜ µ j and covariance matrix ˜Σ j (independently across j and k ). • We then update ˆ µ k,j(cid:96) , m k,j(cid:96) , i k,j(cid:96) , k k,j(cid:96) , Σ k,j(cid:96) accordingly, recall (24). • Updating Σ k,j(cid:96) from level (cid:96) − (cid:96) can use a lot of memory. Indeed, (cid:80) N α(cid:96) +1 j = N α(cid:96) +1 ( T α(cid:96) +1 ( P j ) − δ ˆ µ α(cid:96) +1 )( T α(cid:96) +1 ( P j ) − δ ˆ µ α(cid:96) +1 ) (cid:62) consists in δN α(cid:96) +1 × | q (cid:96) +1 | terms, which can quickly exceed memorylimits. Therefore, we do the sum with only N memory new pricings opt terms at a time, see Table8 below. B.4 Computation of f precompute, running costs and admissible sets • In order to speed up the computation time, we now precompute several values that will be used manytimes afterwards. First, we compute f precompute( (cid:96), k, j ) as f ad1 ( (cid:96), · ) at the point correspondingto ( k, j ) except that, in the definition of f ad1 ( (cid:96), · ), we replace the random permutation ˜ m by itsestimation from the previous step, ˜ µ by its average under the posterior distribution at (cid:96) + 1, and ˜ σ by its estimation at step (cid:96) + 1. • We compute running cost( (cid:96), k ) := C α k (cid:96) of each k at step (cid:96) .28 We restrict the set of possible actions at step (cid:96) , given that we have followed the strategy k so far,to admissible sets( (cid:96), k ) defined as the collection of { ( δq k (cid:48) (cid:96) +1 , δN k (cid:48) (cid:96) +1 ), k (cid:48) ≤ k bar } , such that q k(cid:96) + δq k (cid:48) (cid:96) +1 ∈ q g , N k (cid:48) (cid:96) +1 > N k(cid:96) , running cost( (cid:96), k ) + q k(cid:96) δN k (cid:48) (cid:96) +1 ≤ max ≤ k (cid:48)(cid:48) ≤ ¯ k running cost( (cid:96) + 1 , k (cid:48)(cid:48) ) . The last condition avoids inducing a strategy with a running cost that is not present in our dataset, when doing the one step optimization. B.5 Computation of the final expectations We first pre-compute the quantities E ν k,jL L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n w (cid:88) i ∈ I k,jL − (ˆ µ k,jL ) i − ˜ µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) by Monte Carlo using N e simulations. As the simulation of an inverse-Wishart random variable issignificantly slower than the simulation of a Gaussian random variable, we only simulate 1 inverse-Wishartfor N p Gaussians. The values of N e and N p are given by N mu tildes simulated and N wishart proportionof Table 8. The estimation is called expectation k,jL . B.6 Training of the neural network at level L • We use a neural network with one inner layer with 256 neurons and 1 output layer with 1 neuronto fit (expectation k,jL ) j ≤ j bar ,k ≤ k bar . The neurons of the inner layer consist of the composition ofthe softplus function with an affine transformation of the inputs. • We initialize the neural network parameters using a Xavier initialization. We then train the neuralnetwork by selecting a random new batch every N batch change proportion. This random newbatch is composed of the samples indexed by 1 ≤ m a ( j ) ≤ j batch and strategies indexed by1 ≤ m b ( k ) ≤ k batch, where m a and m b are uniform random permutations of [[1 , j bar]] and [[1 , k bar]].For each batch, the algorithm used for the training is the standard gradient descent of Tensorflow.We do N Iter training steps in total. The learning rate used is given in Table 7. In order to bringthe input values of the parameters close to 0 and 1, we renormalize them according to the valuesin Table 6. B.7 Computation of the expectations at level L − We now estimate E ν k,jL − L − (cid:104) ˇ φ L ( q α k L , N α k L , C α k L , ˆ µ kL , p kL ) (cid:105) where ˇ φ L is the fit of (expectation k,jL ) j ≤ j bar ,k ≤ k bar from the previous step. The most cpu demandingpart is no more the simulation of the inverse-Wisharts, but the updates of the parameters of the inverse-Wishart. Therefore, we simulate as many Gaussian random variables as inverse-Wishart random variables,with N e given by N mu tildes simulated non final level of Table 8.For our computations, we need to update Σ k,jL − to the corresponding posterior parameter according to(24). This can however lead to an enormous amount of multiplications and additions. Therefore, insteadof updating the whole matrix, we only update the diagonal terms according to (24) and estimate nondiagonal terms by keeping the correlation terms equal to the ones of Σ k,jL − . This enables us to approxi-mately gain a factor of q kL in speed in this critical step. B.8 Training of the neural network at level L − • To fit the expectation of the previous step, we use a neural network with the same structure as inlevel L , with the same cost function. 29 The initialization, choice of batches, and training of the neural network are the same as for the level L . The number of iteration, learning rate, and renormalization constants are given in Tables 6, 8and 7. • We take j batch = min (j batch size , j bar) and k batch = min (k batch size , k bar), where j batch sizeand k batch size are defined in Table 8. B.9 Computation of the expectations at levels ≤ (cid:96) ≤ L − • The expectations at step (cid:96) are computed by Monte Carlo after replacing the value function at step (cid:96) + 1 by its neural network approximation, and f ad1 ( (cid:96), · ) by f precompute( (cid:96), · ). • We simulate as many Gaussian random variables as inverse-Wishart random variables, with N e given by N mu tildes simulated non final level of Table 8. • We not not fully update Σ k,j(cid:96) to the corresponding posterior parameter but proceed as in level L − B.10 Training of neural networks at levels ≤ (cid:96) ≤ L − • We now have to optimize over q k(cid:96) ∈ q g . Therefore, we must now train up to | q g | different neuralnetworks (with different inputs’ sizes). In practice, we only train neural networks indexed by q ∈ ( q k(cid:96) ) ≤ k ≤ k bar ⊂ q g , that is, for all the choices of q that are obtained by at least one strategy atlevel (cid:96) . • We must also choose a δN that should be added as an entry of the neural network before optimizing.Furthermore, to help the neural networks converge, we decided to add f precompute( (cid:96), j, k ) as aninput. • The loss function and the structure of the neural network is as above, and we still use Xavierinitialization, and bring the inputs of the neural networks to reasonable values close to 0 and 1 byrenormalizing them using the constants of Table 6. • Compared to levels L and L − 1, the choice of batches is slightly different. Indeed, to train aneural network associated to q ∈ q g , we only use strategies such that q k(cid:96) = q . To do so, we firstdefine S q = { k ∈ [[1 , k bar]] : q k(cid:96) = q } . We then define k batch = min (k batch size , | S q | ) andj batch = min (j batch size , j bar). We then proceed nearly identically as for levels L and L − ≤ m a ( j ) ≤ j batch,1 ≤ m b ( k ) ≤ k batch, where m a and m b are uniform random permutations of [[1 , j bar]] and S q . Foreach batch, the algorithm used for the training is again the standard gradient descent of Tensorflow. • Compared to levels L and L − 1, we found that making the neural networks converge was muchharder. In particular, the learning rate had to be really fine tuned. In order to automatize theprocess, for each q , we proceed as follows. We do not intanciate one, butnumber of neural networks for learning rate test neural networks. For each of these neural net-works, we do N Iter learning rate test training steps, but use different learning rates for each. Forthe first neural network, we use base learning rate as the learning rate, for the second,base learning rate/10, and for the k -th, base learning rate / k − . For each of these neural net-works, we store at each iteration step the log error. Once the N Iter learning rate test trainingsteps have been done for each of these neural networks, we keep the neural network instance thathas the lowest average log error. If it is the k -th neural network, we then train it again for N Itertraining steps, using as learning rate base learning rate / k . B.11 Parallelization In practice, we parallelize the forward pass according to the strategy indices k . We run thread batch sizeprocesses in parallel, where thread batch size is defined in Table 8.At a given level (cid:96) , the computation of expectation k,j(cid:96) can be parallelized according to the sample indices j . In practice, we run number of threads for level expectations number of processes in parallel, wherenumber of threads for level expectations is defined in Table 8.30or a given level, the training of each neural network corresponding to a given q ∈ q g can be doneindependently. Therefore, at a given level, we multiprocessed our code in order to train all the neuralnetworks in parallel. B.12 Normalization constants, implementation parameters, and learning rates Level q m Σ N running cost f precompute1 6 10 Table 6: Inputs’ renormalization constants by LevelLevel q base learning rate Level q base learning rate1 6 10 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − Table 7: Neural network base learning rates31arameter Valuej batch size 4k batch size 4N batch change proportion 1 000N iter show proportion 100smaller learning rate proportion 10N Iter smaller learning rate 10 000L 4n s 253n w 6k bar 200j bar 40i 0 300k 0 300Σ (300 − −−