[PDF] Tile test for back-testing risk evaluation

Abstract

A new test for measuring the accuracy of financial market risk estimations is introduced. It is based on the probability integral transform (PIT) of the ex post realized returns using the ex ante probability distributions underlying the risk estimation. If the forecast is correct, the result of the PIT, that we called probtile, should be an iid random variable with a uniform distribution. The new test measures the variance of the number of probtiles in a tiling over the whole sample. Using different tilings allow to check the dynamic and the distributional aspect of risk methodologies. The new test is very powerful, and new benchmarks need to be introduced to take into account subtle mean reversion effects induced by some risk estimations. The test is applied on 2 data sets for risk horizons of 1 and 10 days. The results show unambiguously the importance of capturing correctly the dynamic of the financial market, and exclude some broadly used risk methodologies.

Full PDF

TTile test for back-testing risk evaluation

Gilles Zumbach

EdgelabAvenue de la Rasude 51006 [email protected] 27, 2020

Abstract

A new test for measuring the accuracy of ﬁnancial market risk estimations is introduced. It isbased on the probability integral transform (PIT) of the ex post realized returns using the ex ante probability distributions underlying the risk estimation. If the forecast is correct, the result of thePIT, that we called probtile, should be an iid random variable with a uniform distribution. Thenew test measures the variance of the number of probtiles in a tiling over the whole sample. Usingdiﬀerent tilings allow to check the dynamic and the distributional aspect of risk methodologies.The new test is very powerful, and new benchmarks need to be introduced to take into accountsubtle mean reversion eﬀects induced by some risk estimations. The test is applied on 2 data setsfor risk horizons of 1 and 10 days. The results show unambiguously the importance of capturingcorrectly the dynamic of the ﬁnancial market, and exclude some broadly used risk methodologies.

Keywords: Value-at-Risk, distribution forecast, risk evaluation, back test, probability integraltransform, PIT, tile test.JEL codes: C12, C22, C53 1 a r X i v : . [ q -f i n . R M ] J u l Introduction

The evaluation of ﬁnancial risk is an important part of any investment activity, and the regulatorsare imposing ever stricter rules over all participants. The overall ﬁnancial risk can be classiﬁed alongseveral categories, and this paper concerns market risk (together with credit risk). It is a majorcomponent of the ﬁnancial risk, and it can be quantiﬁed eﬃciently. In this context, a diﬃcult issue isthe validation of the quantitative ﬁgures provided by a risk evaluation.A validation of a risk computation is done using back-testing. The core idea of a back-test is fairlysimple: using a long sample of historical data, evaluate the risk at diﬀerent dates, and check over thefollowing days that the risk has been correctly computed. In details, the scheme is more complex. Theprimary outcome of a risk evaluation is a probability distribution for the returns over a selected riskhorizon ∆ T . From this distribution, the usual risk measures can be computed like standard deviation,value-at-risk (VaR) or expected shortfall (ES, also called CVaR). With the new information obtainedeach day, this distribution is changing in time. As attested by the large body of literature on thevolatility dynamics and the heteroskedasticity, the dynamics is quantitatively large and the volatilitycan change easily by a factor 5 over diﬀerent periods (quiet, volatile or crisis). Hence, we are dealingwith a clearly non stationary system, with widely changing distributions.The core of the back-test algorithm is better explained at a one-day risk horizon. Each day, a forecastof the probability distribution of the returns is made, an ex ante evaluation. The next day, the actualreturn realized by the market becomes available, an ex post value. This single return should be drawnfrom the distribution computed on the previous day. The goal of the back-test is to assess if thesequence of ex post draws is coming from the sequence of ex ante probability distributions. This isclearly not an easy problem, quite diﬀerent from the usual test with repeated draws from a ﬁxeddistribution.A simple possible test is provided by VaR: since it is deﬁned as the α quantile of the loss distribution,the fraction of exceedances should be 1 − α . This reduces the back-test to a simple counting exercise.The weaknesses of this simple test are to check only one level for α , and to ignore the dynamicalaspect of the risk evaluation. Better tests using the whole distribution of exceedances and/or theindependence of the exceedances where proposed and improved in many contributions [Kupiec, 1995],[Crnkovic and Drachman, 1996], [Diebold et al., 1998], [Barbachan et al., 2006], [Christoﬀersen, 1998],[Christoﬀersen and Pelletier, 2004], see also the reviews by [Campbell, 2006], [Christoﬀersen, 2010]and the master thesis of [Roccioletti, 2016].Better tests can be built using a probability integral transform (PIT): given a probability distributionand a return, compute the cumulative probability associated to the return. We named “probtile” thisrealized p -value, since it corresponds to the cumulative probability of a quantile. If the forecast for thepdf is correct, the probtiles should be iid with a uniform distribution. After the PIT transformation,the backtest problem becomes a standard statistical exercise for which numerous tests can be used,say for example a Kolmogorov-Smirnof test.This testing strategy is known for a long time by statisticians, going back at least to [Rosenblatt,1952], and has been rediscovered several times in ﬁnance [Christoﬀersen, 1998; Zumbach, 2007]. Itsmain advantage is to test the whole distribution, and tests can be constructed to emphasize the tailswhich are the focus point for risk evaluation. The dynamical aspect of the risk evaluation can also betested, for example using lagged correlations of the probtiles.An interesting recent development concerns scoring functions, elicitability and the direct comparisonof risk methodologies. The set-up is to have a ﬁxed distribution for a random variable, playing the roleof the returns. The idea is to have some functions that attribute a score to the set (conﬁdence level α ,2entative risk measure(s) , realized return). For some selected scoring functions, the minimum of theexpectation of the scoring function is given by the actual risk measure, say VaR α . A risk measure forwhich such a scoring function exists is called elicitable. An interesting property of the scoring functionis to allow for a direct comparison of risk methodologies, essentially using the expected diﬀerence ofthe scores for the respective methodologies (see e.g. [Nolde and Ziegel, 2017] and references therein).With this approach, various risk methodologies can be compared pairwise in order to decide whichone is better. Yet, the application of this approach to empirical data requires the distribution to bestationary, a hypothesis which is invalid in ﬁnance. Given this limitation, we have to use an absoluteapproach, where each risk methodology is assessed independently for its qualities.So far, most of the literature about ﬁnancial risk is to check the distributional aspect of the forecast,and the dynamical aspect of risk evaluation has been much less discussed. We ﬁnd this bias quiteodd, since having the dynamic right is quantitatively more important for actual decision-making thanhaving the asymptotic distribution right. We will ﬁrst present a simple illustrative example for onestock, showing visually the key role of the dynamic. Then, we introduce the tile test that gives a jointtest for the dynamics and the distribution. p r i ce stock price (right scale)5 and 95% VaR, returns based5 and 95% VaR, innovations basedOne day (relative) return R e t u r n , V a R [ a nnu a li ze d , % ] −200−10001002003004005002002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Figure 1: The price for the UBS stock (green line, right axis), the daily returns (black square) andthe 5 and 95% VaR, computed with the historical return methodology (blue line) and the historicalinnovation methodology (red line).The example uses the UBS stock price (a major Swiss bank), with a quite eventful history over the last20 years related to the major crises and to events speciﬁc to UBS. This choice provides for spectacularﬁgures, but all stocks, indexes or FX show similar properties. The ﬁgure 1 shows the UBS prices, thedaily returns, and the 5 and 95% VaR computed with two methodologies. The daily returns are markedwith black squares, and clearly the width of the return distribution changes with time. Clear periodscan be seen where the distribution is narrow or wide, a property called heteroskedasticity, namely thevariance (skedasticity) of the return time series is not constant in time (hetero). Accordingly, the riskis changing, and this should be reﬂected on the risk measures.3n Fig. 1, the blue line corresponds to the widely used historical return methodology, namely thedistribution for the next return is spanned by the daily returns in a moving 2 years window. Thered line is based on a long-memory ARCH process, with a historical distribution for the innovations(details about the risk methodologies are given below in Sec. 5). The dynamics is quite diﬀerentbetween both models, with a slow (fast) adaptation to the market conditions for the blue (red) line.If the risk computation is correct at the 5 and 95% level, 5% of the daily returns should be belowand above the VaR lines. This property should be veriﬁed not only asymptotically on a long sample,but at all times. With naked eyes, we see that the red line is about right, but the blue line showsclear periods with too many or too little exceedances. The diﬀerence can be large, see for examplethe covid crisis on the right of the graph. Let us emphasize that the blue line can be wrong duringextended periods, yet be correct on a long enough sample (i.e. asymptotically). This ﬁgure gives thegist of this paper, namely that the dynamics of the exceedances, or of the probtiles, should be testedthoroughly. P r ob til e s Figure 2: The probtiles for the UBS stock, computed using the historical return methodology. The nonuniformities during some periods are clearly visible, corresponding to a clear over- or under-estimationof the risk.The ﬁgure 1 is important since giving the key message in bare form, without any statistics. Thesetime series are obviously not stationary, and therefore not amenable easily to statistics. As explainedpreviously, the key idea to obtain a stationary problem is to use a PIT to transform the realised returnsinto probtiles, using the forecasted distributions. The ﬁgures 2 and 3 show the realized probtiles usingrespectively the blue and red risk methodologies. If a risk methodology is correct, the probtiles shouldbe iid with a uniform distribution. Visually, the points should be uniformly spread in the ﬁgure.This is obviously not the case with the blue methodology, based on historical returns, where clustersare observed. Even though, the asymptotic distribution, measured on the whole sample, could beuniform. Hence, testing only the marginal distributions is a fairly weak test of the adequacy of a riskmethodology. Let us emphasize that the blue risk methodology can be oﬀ for periods up to years,during which the risk is systematically under- or over-valued.4 r ob til e s Figure 3: The probtiles for the UBS stock, computed using the historical innovation methodology.The core idea for the tile test is to check the number of points on sub-tiles of the whole sample. Thewhole sample is divided regularly in T t intervals along the t -axis, and T z intervals along the probtile z -axis, creating T = T t T z tiles with an equal area. In each tiles, N/T points are expected, with N thetotal number of points in the sample, and T the number of tiles. The actual number of points on agiven tile is n i , and dn i = n i − N/T should be a random variable with zero mean. If the risk model isperfect, the variance of dn i should be in line with the theoretical model of random points. If not, therisk model can be rejected. In the literature, similar tests have been used to test random generators,to check that, in a sequence of draw x i , the vectors ( x i , x i +1 , · · · , x i + n ) are uniformly spread in aunit cube [0 , n [Knuth, 1998]. This test is of interest in ﬁnance since it is sensitive to both theuniformity in z and to the time dynamics. When applied on Fig. 2, the ﬂuctuations between tiles arelarger than with Fig. 3, and the hypothesis of uniformity should be rejected for the ﬁrst methodology.A slight complication is that the points are deterministic in time with one point per business day,and random in the z direction. This is important to derive an analytical benchmark, but for severalreasons detailed in this paper, we will only use numerical benchmarks.This test is used with empirical time series in order to check some major methodologies used tocompute ﬁnancial risk. A main outcome is that the dynamical part is crucial in delivering a goodrisk evaluation, then the distributional model is important. This conclusion is in-line with [Zumbach,2007], where the lagged correlations of the probtiles were tested, yet the tile-test gives a clearly morepowerful assessment of the dynamics. Interestingly, the bulk of the literature on back-testing focusseson the marginal distributions computed with the full sample, a quantity that is indirectly sensitive tothe dynamics through the choice of the sample time boundaries. This also contradicts the belief ofmany practitioners who are using long samples of historical returns, inducing a very slow dynamicsfor the volatility. The resulting lack of reactivity cost dearly to some banks during the sub-primecrisis in Fall 2008, whereas the volatility started to raise already at the end of 2007, following a longperiod of decreasing volatility. Not able to evaluate properly the risks with its dynamics, they kepttheir positions until September 2008, then made the headlines of newspapers.5he plan of this paper is as follows. The notation and deﬁnition of the tile test is presented in thenext section. Section 3 deﬁnes the benchmark following the idea that the probtiles z should be iid in U (0 , The notation is as follows. The risk horizon is ∆ T . The total number of points in the sample is N .The number of tiles in the probtile z and time t directions are T z and T t respectively, with the totalnumber of tiles T = T z T t . The number of points in one given tile i is n i .The time axis is divided regularly from the start to the end of the sample. The length of the tiles in the t -direction is denoted with ∆ T tile = ∆ T sample /T t , with ∆ T sample the sample length in year. Considernow a column of tiles, spanning some time interval. The probability for one point to fall in one giventile in the z direction at a given time t is p = 1 /T z . The statistical tests involve the total number ofpoints in the column of tiles. Due to holidays, the number is slightly changing, and should be countedfrom the number of valid data points in each column. The total number of points in a column of tiles,in the time span for these tiles, is N t with N t (cid:39) N/T t .The general idea for the tile test is that the points should be uniformly spread in the tiles, and wewant to measure the deviation from uniformity. On a given tile indexed by i , the number of points is n i , with the expected mean µ t = N t /T z . The random variable δn i = n i − µ t measures the diﬀerence between the actual number of points in the tile i and the mean. Its samplestandard deviation σ δn is given by σ δn = σ δn ( T, ∆ T ) = (cid:115) T (cid:88) i δn i and measures the ﬂuctuations of the point’s count between tiles. The point’s count standard deviation σ δn is our base random variable, depending on the number of tile T , on the time series, on the riskhorizon ∆ T , and on the risk methodology used to compute the ex ante probability distribution. Ourgoal is to construct a statistical test based on σ δn . Our null hypothesis is that the probtiles are iidwith a uniform distribution, and this induces the null distribution for σ δn . Following the standardstatistical testing procedure, a methodology can be rejected if the ﬂuctuations are too large comparedto the theoretical distribution obtained with the null hypothesis.The usual approach is to demonstrate (or assume) an asymptotic distribution with the null hypothesis,then to compute the mean and standard deviation of σ δn . Yet, this usual standardization strategyhas the following limits. • It is more natural to take σ δn as the base random variable to construct a test, but analyticalcomputations can be done only on σ δn . 6 For the risk horizon ∆ T larger than 1 day, the one-day sampling of the probtiles is correlated.Unfortunately, the theoretical distribution for σ δn or σ δn is then unknown. A possible remedyis to sub-sample the data every ∆ T -points, but this reduces the sample size by the same factor.As the back-test gets increasingly diﬃcult for growing ∆ T , we would like to keep as many datapoints as possible. • Stocks with low liquidity have many days with zero return, essentially due to no trading. Manystocks have also quite low prices, making the minimal price increment apparent for small pricechanges. Both eﬀects are often compounded. For such case, the return distributions becomesingular, with peaks at zero and at the price increments, and the subsequent probtile distributionsare not uniform in the neighbourhood of z = 1 /

2. Unfortunately, many stocks are in this case,and it is important to validate risk evaluations also in such cases, and not only for indexes ormajor stocks. • Depending on the algorithm used to compute the risk forecast, a subtle small negative auto-correlation makes the results better than the benchmark. This point is important to interpretthe statistical test results, and is discussed extensively in Sec. 7For these reasons, we decided to use exclusively a Monte Carlo approach, simulating σ δn as computedfrom random walks with constant volatility and with normal iid returns, in order to have a refer-ence distribution for the null hypothesis. Then, the value for σ δn is computed for actual data andrisk methodology, and the rejection probability is computed numerically with a Probability IntegralTransform (PIT) using the null distribution for σ δn . For stocks, in order to deal with the slow tradingand price granularity, we censor the tiles around z = 1 / The rejection test using a base Monte Carlo benchmark is build as follows, described for ∆ T = 1. Asample of random probtiles is drawn from a uniform distribution, one per day along the time axis,and with the same sample length as in the empirical data size N . The tile test statistics σ δn, MC iscomputed on this sample, possibly including a censorship around z = 1 / T that we want to use. The procedureis repeated n MC = 500 times in order to obtain a numerical estimate of the cumulative distributioncdf MC ( σ δn ) for σ δn, MC . For an empirical value σ δn, emp , the tile test statistics is given as p = 1 − cdf MC ( σ δn, emp ) . (1)This quantity measures the probability that a value as large as σ δn, emp or larger is observed according tothe null hypothesis that the probtiles are independent with a uniform distribution. If p is smaller thana threshold, say 5%, then the null hypothesis can be rejected. Essentially, p measures the probabilitythat a risk methodology is correct, and a small p indicates that the risk is incorrectly estimated (orequivalently that σ δn, emp is too large compared to a uniform sampling).This computation is done for a given time series with index α , and the corresponding probability p α is obtained. In order to have an overall measure for a risk methodology, the values p α are computedfor a set of series, and the simple mean gives the probability that the methodology evaluates correctlythe risk for this sample p methodology = 1 n (cid:88) α p α . (2)7hen sampling daily a risk computation at a risk horizon ∆ T larger than one, the points separatedby less than ∆ T are correlated due to the overlap. Thus, the Monte Carlo simulation needs to beadapted to include the correlations induced by the overlap. Yet, another subtle problem occurs withthis test, which need an extension to be based on normal random walks. This is discussed in Sec. 7,and this extension provides naturally for a benchmark at ∆ T larger than 1 day, while being equivalentto this one for ∆ T = 1. The choice for the tiles are free, but the overall idea is to get a sequence of coarse to ﬁne tiles, inorder to test risk methodologies at various scales. We have tried a few tilings, in particular increasingsimultaneously the division in the z and t directions.Yet, the most interesting test is in t , since depending on the dynamics of the market and of the riskmethodologies. After several exploratory studies, we singled out the following tiling: a ﬁxed divisionin the z direction with T z = 8 tiles, and an increasing number of tiles in the t direction. The numberof tiles in t follows a geometric progression with a factor √

2. The number of tiles goes from T t = 1 (i.e.no division, or the full sample in t ) to a maximal number of tiles corresponding to have no less than 2points per tiles (so roughly 1 month). For the ﬁgures, the horizontal axis is the time length of the tiles,expressed in year. This tiling and representation allow to have a diagnostic of a risk methodology asfunction of the characteristic time length for the test.In the ﬁgures below for the tile test, the left side of the graph, corresponding to short tiles, is mainlysensitive to the dynamic, while the right side for the graph, corresponding to long tiles, is mainlysensitive to the asymptotic distribution of the probtiles. Many risk evaluations used in the followingempirical study use a 2 year trailing window, which is apparent in the graphs at the centre. The users typically focus on VaR or ES, but the core object produced by a risk methodology is the(forecast for the) probability distribution for the losses (or for the returns, up to a trivial sign). Arisk methodology is a recipe to construct this forecast, in practice in a multivariate setting, albeit wewill investigate only the univariate case. At the core level, the risk methodologies can be based on thereturns or on the innovations. Since the ones based on the returns are simpler, let us present themﬁrst.The methodologies based on the returns typically “just” resample past returns, using a sample oflength N hist . Over the last N hist days, the daily returns are computed, and they are spanning theforecasted distribution. For a risk horizon ∆ T , the daily returns are scaled by √ ∆ T . A variation isto compute the returns over ∆ T (using a sample of N hist + ∆ T past prices), and no scaling. We haveinvestigated these 2 methodologies. • Historical returns @ 1d (with a √ ∆ T scaling) • Historical returns @ ∆ T (without scaling)For the empirical investigation, we have used N hist = 500 days, corresponding roughly to 2 years.The algorithm simplicity is very appealing, and is the reason for its wide usage. Since it is so simple ( “itjust resamples past returns” ), it seems void of hypothesis. This argument is often used, but is wrong.8he key hypothesis is that the returns have a stationary distribution. But this is incorrect becauseof the heteroskedasticity. The ﬁgures given in the introduction display the important dynamics ofthe volatility and of the return distribution. To be sure: in ﬁnance, the return distributions are timedependent for all time series, hence not stationary.The related volatility forecast is given by the standard deviation of the return sample. Accordingly,the variance is an equal weighted sum of squared returns over the last N hist days. A large returnentering the sample has a weight 1 /N hist , and the weight is constant over the N hist subsequent days.Then, this information is abruptly forgotten when the point leaves the trailing sample, possibly leadingto an “echo” behaviour with a step down in the volatility.The methodologies based on the innovations use a process structure to model the volatility dynamics[Zumbach, 2006]. At the time t , the base equation is r ( t + δt ) = (cid:101) µ ( t ) + (cid:101) σ ( t ) (cid:15) ( t + δt ) with (cid:15) ∼ p ( (cid:15) ) (3)where (cid:101) µ ( t ) and (cid:101) σ ( t ) are the forecast for the mean and standard deviation, computed using the infor-mation up to t . This equation can also be viewed as a location/size/shape decomposition of the returndistribution, with the location and shape considered as predictable, while the shape is stationary. Ina process, the innovations (cid:15) ( t + δt ) are assumed to have a ﬁxed distribution p ( (cid:15) ), typically normalor Student (with zero mean and unit variance). In this setting, the predictable and time dependentparts are in (cid:101) µ ( t ) and (cid:101) σ ( t ), while the (cid:15) ’s are random and have a time independent distribution p ( (cid:15) ).Since the expected returns are diﬃcult to predict and small, the default (cid:101) µ ( t ) = 0 is often used. Usinghistorical data and given the returns and the forecasts for the mean and volatility, this equation canbe expressed for the innovations (cid:15) ( t + δt ) = r ( t + δt ) − (cid:101) µ ( t ) (cid:101) σ ( t ) . (4)With the innovations computed from historical data, the stationary assumption for the distributionand its shape can be studied.Depending on the form for (cid:101) µ , (cid:101) σ and p ( (cid:15) ), many methodologies can be constructed. For all method-ologies used in this empirical study, a null mean return is assumed (cid:101) µ = 0. The remaining componentsare the volatility forecast, characterized by the shape of the memory kernel, and the probabilitydistribution for the innovations. • Risk Metrics ’94 : This is the original proposition based an exponential moving average forthe volatility and a normal distribution for the innovations[Mina and Xiao, 2001]. • LM-ARCH + student : This is the methodology proposed by RiskMetrics [Zumbach, 2006],using a long memory ARCH (LM-ARCH) process for the volatility forecast and a Studentdistribution for the innovations. In this model, the volatility forecast is inferred from a dailyprocess using conditional expectations for the squared returns [Zumbach, 2004]. Since derivedfrom a process, the forecasts are consistent for increasing ∆ T , have a non-trivial term structure,and do not involve new parameters or adjustments. The forecast is a weighted sum of pastsquared returns, with weights that decay smoothly with increasing lags. This property capturesthe progressive decay of the information as it recedes into the past. The original distribution isa Student with 5 degree of freedoms (dof), regardless of ∆ T . We report in this study the samemodel but for dof = 6, which provides slightly better results. • LM-ARCH + normal : The same volatility model, with a normal distribution.9

LM-ARCH + historical innovations : The volatility forecast is based on the same LM-ARCHprocess, and the distribution for the innovations corresponds to the empirical distribution of thehistorical innovations using a sample of length N hist . For the empirical investigation, we haveused N hist = 500 days, as for the historical return methodologies. This methodology has a non-trivial mean in the returns distribution, corresponding to the mean of the historical innovationsmultiplied by the volatility forecast. As for the historical return methodologies, two variants canbe studied, ﬁrst with innovations at 1 day, second with innovations at the risk horizon ∆ T .More methodologies can be crafted along these lines, but the main issue is to validate, or to invali-date, the various ingredients entering the computations. Beyond a sound mathematical structure andextensive statistical studies of the stylized facts, back-tests become crucial at this point. Two data sets are used for the empirical investigations below. • Stock indexes

A set of major stock indexes. Number of time series 10, from 1.1.1993 to29.4.2020, data length 6614 days. • FX Major foreign exchange rates, all against USD. Number of time series 6, from 1.1.1990 to4.5.2020, data length 7654 points.The results have also been veriﬁed on a set of commodities indexes and on a random selection of largeand small stocks from the Swiss market. All the results and conclusions are consistent between thevarious sets. tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T Figure 4: Tile test for the ’indexes’ sample at ∆ T = 1 day, using a uniform benchmark. The left(right) ﬁgure is for the historical return (innovation) methodology.For the sample of indexes, the ﬁgure 4 shows the results of a tile test using the iid uniform benchmark.The horizontal axis gives the time length of the tile in the t direction, the vertical axis is the probabilityto observe a value as large or larger than the empirical ones compared to an iid uniform benchmark.Using the historical return methodology leads to the left ﬁgure. The values close to zero for the testshow that this algorithm can be rejected at almost all time scales. It is only for very long samples,10ay more than 5 years that it is not rejected for some indexes. This shows that asymptotically thismethodology could be correct, but a very long sample is required. Notice that actual risk managementoperates in the days to months regimes, on the left of the ﬁgure, where this methodology is rejected.Using the LM-ARCH + historical innovation methodology leads to the right ﬁgure. At ﬁrst sight, itshows that this algorithm is excellent. On second sight, the probabilities close to one for long timeintervals show that it is always better than a uniform distribution. To say it otherwise, the probtilesare distributed systematically more uniformly compared to random uniform points! Something iswrong, since a perfect model should have values randomly distributed over the [0 ,

1] interval, and witha mean over many realizations around 1/2. Here, all empirical time series would reject the theoreticalmodel (using other data sets lead to the same conclusion). A plot of the probtiles shows nothingspecial, just random uniform dots. We checked very carefully our software, to no avail. But it isclearly not obvious how an actual risk computation can induce probtiles that are more uniform thana theoretical uniform distribution.Such a subtle eﬀect is indeed introduced by the trailing sample of historical innovations (and similarlyfor the historical return algorithm). In order to understand intuitively the cause, let us assume thatone or a few large innovations do occur out of sample. They are associated with z values close toone. These innovations get incorporated into the trailing sample, for the next 500 days in our case.Subsequently, small and normal innovation values will have z values slightly smaller compared to atheoretical distribution, as long as the large innovations are in the trailing sample. The same argumentapplies for large negative innovations, with slightly larger values for the subsequent probtiles. An excessof realized returns around zero produces an excess of probtiles around z = 1 / z (closer to z = 0 or z = 1),pushing the distribution toward uniformity. This is a type of “mean reverting” eﬀect, that produces aslightly diﬀerent dynamics for the probtiles, while still with a uniform distribution. The subtle pointis that the asymptotic distribution is still a uniform one, but a sequence of such random variableshas less variability than independent draws. Hence, the values returned by the tile test are smallercompared to the ones obtained by iid uniform draws.The benchmark needs to be adapted in order to include the possible negative lagged correlationsinduced by an algorithm based on a “trailing” sample. This is done by Monte Carlo simulations usinga normal random walk with constant volatility, followed by a part that compute the probtiles using asimple algorithm that reproduces the key parts of the actual risk algorithm. In this form, it is verysimple to modify the benchmark to support risk horizons longer than 1 day, while still using the fullsample (i.e. not decimating by taking one point every ∆ T days). The random sample of returnsis computed as follows, for a ﬁnal sample length of N . A random sample of normal innovations isdrawn. A moving sum of length ∆ T is performed, with the sum normalized by 1 / √ ∆ T . The resultingrandom variables correspond to the (scaled) return at ∆ T , have a normal distribution, a unit variance,a correlations when separated by less than ∆ T , and a sample length reduced by ∆ T −

1. On thesepaths, three benchmarks are computed that reproduce the key parts of the main risk evaluationalgorithms. • benchmark 1 : A mapping of the returns with a normal cdf leads to iid uniform randomvariables with the desired correlation. For this benchmark, the raw Monte Carlo sample lengthis N mod = N + ∆ T −

1. This benchmark is appropriate when the distribution for the returns orinnovations is ﬁxed, for example using a normal distribution. • benchmark 2 : A trailing sample of 500 daily returns is used to obtain an empirical realisationof the cdf. This cdf is used to map the next out-of-sample return (computed at scale ∆ T andscaled by 1 / √ ∆ T ) to the corresponding probtile z . Then, the oldest point in the trailing sampleis dropped, the next 1 day return is added in the trailing sample, and the date is moved by 111ay. For this benchmark, the Monte Carlo sample length is N mod = N + N trailing + ∆ T − N trailing = 500). This benchmark is appropriate when the distribution for the returns orinnovations is based on a trailing sample computed at 1 day. • benchmark 3 : This benchmark is identical to benchmark 2, but returns at the scale ∆ T areused in the trailing sample. For this benchmark, the Monte Carlo sample length is N mod = N + 500 + 2(∆ T − T days.Finally, this 1 path algorithm is repeated n MC = 500 times in order to obtain the desired statistics for σ δn . For the present tile test, the tile statistics for the desired tilings are computed on each randompath, and the n MC values are used to build cdf MC ( σ δn ) and to compute the corresponding mean µ MC and standard deviation σ MC . These statistics are obtained for each tilling. These Monte Carlocumulative distributions are the ﬁnal benchmarks for the actual algorithms used with empirical data,and the benchmark 1, 2 or 3 should be chosen according to the algorithm. We denote by ’adaptedbenchmark’ one of those benchmarks appropriate for a risk methodology.A few points should be emphasized. First, at a risk horizon of 1 day, the benchmark 1 is drawing iidnormal variables followed by a mapping with the normal cdf, hence uniform iid variables are obtained.Therefore, the benchmark 1 at 1 day corresponds to the usual uniform benchmark, namely to iiduniform probtiles. For longer risk horizon, it is similar but with correlations for values separated byless than ∆ T . For a 1 day risk horizon, the benchmark 2 and 3 are identical,but they are diﬀerent for∆ T >

Using the base set-up, the probability distribution for σ δn is investigated. The sample length is N = 5052 business days, corresponding approximately to 20 years of data. The number of tiles in the z direction is constant T z = 8, in the t direction is ranging from 1 to 256 (at which there is in average2.5 points per tiles). The number of tiles increases with a geometric progression in the t directionwith reason √

2. The results are better presented using a folded-cdf, namely the cdf for cdf < . > .

5. This representation gives a truthful representation for an empirical or MonteCarlo sampling, without using a smoothing kernel. Furthermore, a logarithmic vertical scale can beused, focusing on both tails of the cdf.The ﬁgure 5 gives the folded cdf for benchmark 1 and 2 fot ∆ T = 1, with a low number of divisions inthe t direction. Both distributions have an almost disjoint supports, with the benchmark 2 displaying12 o l d e d c d f Figure 5: The folded cdf( σ δn ) for the tile test σ δn for benchmarks 1 (black curve) and 2 (blue curve)for ∆ T = 1, with 8 divisions in the z direction and 4 in the t direction (corresponding to a tile lengthof 5 years).markedly lower values for σ δn , namely fewer ﬂuctuations than the usual iid uniform benchmark corre-sponding to benchmark 1. Depending on the tilling, the displacement of the distributions shows thevery strong eﬀect induced by the trailing sample for the empirical return distribution. This diﬀerenceis the cause for the too good empirical results reported in Fig. 4.The core of the distributions is plotted in Fig. 6 as function of the tile length in the t direction. Forincreasing tile lengths, the diﬀerences between both benchmarks increase, showing the importance ofselecting an appropriate benchmark. The diﬀerence becomes large when the tile length is of the orderof the trailing sample, in our case 500 business days equivalent to 2 years. This diﬀerence explains theﬁgure 4 where the historical innovation algorithm is better than the benchmark for tile length longerthan 2 years.Figure 7 displays the scaled mean of the tile statistics σ δn for increasing risk horizons, for the bench-marks 1 and 3. This graph shows clearly the diﬀerent behaviours when the tile length is shorter orlonger than the memory length of 500 days (cid:39) T . The reduction of the mean for large ∆ T tile originates in the meanreversion induced by the trailing windows, and it occurs regardless of the risk horizon.The curves are moving upward with increasing ∆ T . This is due to the reduction of the eﬀective samplelength induced by the correlations of the overlapping returns. As a ﬁrst estimate, the eﬀective samplelength goes as 1 / √ ∆ T , hence the mean of the ﬂuctuations increases as √ ∆ T . The actual increaseis smaller, of the order of ∆ T . to ∆ T . . This smaller increase is the gain due to the overlappingsample, namely to sample every day even when the risk horizon is longer.The standard deviation shows a very similar behaviour when scaled as σ MC / ∆ T tile , both in term of thereduction of the variance for the benchmark 3 with increasing tile length, and for the slower increasein ∆ T than a square root. These features, for the mean and the standard deviation, lead to a morepowerful test, including for risk horizon larger than 1 day. Indeed, most applications of risk evaluationare for medium to long risk horizons, say 5 days to a few months. This domain is increasingly diﬃcultto test, and most of the numerical tests are inconclusive. The above gains are therefore instrumentalat extending risk validation above 1 day. 13 enchmark 1Benchmark 2 T il e s t a ti s ti c s x

110 Time length of in the t direction [year]0.1 1 10

Figure 6: The “one σ range” for benchmark 1 and 2 at ∆ T = 1, as function of the tilelength in the t direction. The coloured areas correspond to a range of ± σ around the mean. Δ T = 1Δ T = 2Δ T = 5 Δ T = 10Δ T = 30 S ca l e d m ea n o f t h e til e s t a ti s ti c s x Figure 7: The scaled mean of the tile statistics σ δn obtained from MonteCarlo simulations, for bench-mark 1 and 3, as function of the tile length in the t direction, for increasing risk horizon ∆ T . Thescaled mean is µ MC / √ ∆ T tile , with µ MC the mean of the tile statistics in the Monte Carlo simulationsand ∆ T tile the tile length in year. The risk horizons are 1 (black), 2, 5, 10, 30 (blue), the benchmark1 are drawn with full lines, the benchmark 3 with dashed lines.14 tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T Figure 8: Tile test for the ’indexes’ sample at ∆ T = 1 day, using an adapted benchmark. The left(right) ﬁgure is for the historical return (innovation) methodology.This section analyses the benchmarks for the tile test statistics, and its dependency on the riskalgorithm. But a similar problem occurs for all statistics applied on a back-test: depending on therisk methodology, the simple benchmark can be (strongly) biased toward accepting a methodology.Hence, for all statistics, a similar approach should be used, namely to compute a benchmark from thestatistics evaluated on random paths with constant volatility and normal returns. ∆ T = 1 day The Fig. 8 is identical to Fig. 4, but with the benchmark 3, namely taking into account that bothalgorithms use a trailing sample of returns or innovations. Against this benchmark, the historicalreturn methodology is rejected strongly, at all time lengths, and for all series in this sample. Therejection of this model demonstrates the importance of the dynamics. Even with a tile length usingthe full sample (more than 25 years), this algorithm is strongly rejected. This means that 2 decades isnot long enough to reach an asymptotic regime where the return distributions converge to their longterm limits. By contrast, the historical innovations cannot be rejected for most time series. It shouldbe emphasized that the test is comparing a normal random walk with constant volatility againstreal data with heteroskedasticity and fat-tails. The brutal diﬀerence between both methodologiesshows that it is important to understand the stylized facts of the ﬁnancial markets, to have goodmathematical models for the time evolution, and only then sound risk evaluation algorithms can bedesigned.Now that we have a good understanding of the interactions between the risk methodologies and thebenchmarks, a performance analysis of the main methodologies can be made using the tile test. Themain ingredients we want to explore are returns or innovations based methodologies, the model forthe volatility forecast, and the probability distribution for the returns or innovations. On this basis, afew risk methodologies have been chosen, see the Sec. 5 for more details. The ﬁgures 9 and 10 presentthe tile test p -value as function of the tile length in the t direction for both data sets. The salientresults for both data sets are the following. First, the historical return methodology is unable to getthe short term dynamic, and is rejected for tile lengths up to a few years, or is rejected for all tilelengths. This is a particularly strong result, and a clear warning for its supporters. The core reasoncan be anticipated from our analysis of the UBS stock in the introduction, the tile test places cleanstatistics on the deﬁcient model for the market dynamics.Second, the original RiskMetrics methodology and the LM-ARCH + Student innovations have com-15enchmark 1 Adapted benchmarkHistoricalreturn tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T RiskMetrics tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +Student 6 tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +emp.cdf tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T Figure 9: Tile test for several risk methodologies, for the ’indexes’ sample, at ∆ T = 1 day, usingthe benchmark 1 (left column), and the adapted benchmark (right column). The left column allowscomparing directly the methodologies using the same benchmark, while the right column allows tocompare a methodology against its adapted random walk benchmark.16enchmark 1 Adapted benchmarkHistoricalreturn tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T RiskMetrics tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +Student 6 tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +emp.cdf tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T Figure 10: Tile test for several risk methodologies, for the ’FX’ sample, at ∆ T = 1 day.17arable performances, which are not very good. Other statistics are needed to understand the corereason, but essentially, a ﬁxed distribution is not able to get the peculiarities of each instrument. Forexample, the indexes have positive mean values for the innovations, related to their long term upwardtrends. Consequently, the innovations should have a positive mean value, whereas the normal andstudent distributions used in the present computation are centred.Third, a methodology using the empirical innovations on a trailing window is very good. Using thebenchmark 1 (left column) shows the clear superiority of this model, while using an adapted benchmarkshows that there is still space for improvements against a perfect model. The volatility forecast seemsless important when using an empirical distribution. The reason lies likely in the structure of therisk forecast based on the historical innovations (cid:15) = r/σ (more precisely on (cid:15) ( t (cid:48) ) = r ( t (cid:48) ) / ˜ σ ( t (cid:48) − ∆ T ))and the risk scenarios for the returns r = ˜ σ(cid:15) (more precisely ˜ r ( t ) ( t (cid:48) ) = ˜ σ ( t ) (cid:15) ( t (cid:48) ) with t (cid:48) indexing thescenarios). Since the algorithm contains both a multiplication and a division by the volatility (atdiﬀerent times), the defects of a volatility model become less important.

10 Empirical results for ∆ T = 10 business day In the back-test literature, the results beyond one day are very scarce. The reason is the shrinkingsamples, with the eﬀective sizes going down as 1 / ∆ T , and an eﬀective test size growing as √ ∆ T .Notice that in our implementation, the computations are always done with the full sample, and theMonte Carlo benchmark have an increasing size, albeit growing as a smaller pace than a square root(see Sec. 8). Interestingly, the original BIS requirements is to back-test risk evaluations at the 95%level, for a 1 and 10 days risk horizon.In order to tackle this challenge, both very long samples and powerful statistical tests are needed. Inthis section, we analyse the same methodologies, but for a risk horizon ∆ T = 10 business days. Suchvalidations are very important for practical applications with medium to low turnover, and with aninvestment horizon ranging from a few weeks to several months.Two methodologies are added to the panel. When computing a historical sample of returns, the timeinterval to compute the returns must be chosen. Two choices are natural, namely to compute returnsat 1 day or at the risk horizon ∆ T . Both choices have advantages, namely returns at 1 day leadto the largest sample of independent values, while returns at the risk horizon incorporate possibleeﬀects beyond a simple random walk. At priori, it is not clear which argument is better between thestatistical sample and the ﬁnancial time series model. This argument can be used for the returns andfor the innovations, hence the 2 methodologies added to the panel.The results are given in ﬁgures 11 and 12 for the same empirical samples. Overall, the results areconsistent with the analysis done for a 1 day risk horizon, up to an overall loss of statistical power dueto the shrinking samples. In particular, the historical return methodologies cannot get correctly theshort term dynamic, the methodologies with a ﬁxed (centred) distribution have diﬃculties capturingthe peculiarities of time series, and the empirical innovations show the best performances. On thequestion raised in the previous paragraph, the analysis shows unambiguously that using innovations(and returns) at the risk horizon is better (in particular for indexes). This clear-cut answer pointsto possible speciﬁcities of the time series model beyond the heteroskedasticity using an EMA or aLM-ARCH volatility model.A similar analysis has been done on a set of commodity indexes (agricultural, corn, soybeans, energy,non-precious metal, gold...) and on a panel of Swiss equity (including large, medium and small caps).All the results are consistent with the above graphs and analysis, showing the robustness of ourconclusions. 18enchmark 1 Adapted benchmarkHistoricalreturn @1d tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T Historicalreturn @10d tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T RiskMetrics tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +Student 6 tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +emp.cdf @1d tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +emp.cdf @10d tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T Figure 11: Tile test for several risk methodologies, for the ’indexes’ sample, at ∆ T = 10 days.19enchmark 1 Adapted benchmarkHistoricalreturn @1d tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T Historicalreturn @10d tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T RiskMetrics tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +Student 6 tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +emp.cdf @1d tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T LM-ARCH +emp.cdf @10d tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T tile length in time [year]0.00.20.40.60.81.0 t il e t e s t - P I T Figure 12: Tile test for several risk methodologies, for the ’FX’ sample, at ∆ T = 10 days.20 We propose a novel statistical test for the back-test of risk methodologies. It is based on the funda-mental object that a risk evaluation produces, namely a forecast for the probability distribution forthe expected returns (or losses), and the properties that a correct forecast should have, namely theprobability integral transform of the realized returns should be iid with a uniform distribution. Sincethese properties should be valid for any sub-sample, a test of uniformity on a tiling allows to probe arisk methodology at various scales. Because the dynamics of a risk methodology is a crucial featureto capture risk correctly, we choose a tiling with an increasingly ﬁne divisions in the t direction, anda ﬁxed number of tiles in the probtile z direction. This tiling allows to probe a risk computation fromthe short term dynamic (short tiles, on the left of the graphs) to the asymptotic long term behaviour(long tiles, on the right of the graphs).A ﬁrst application of this test using a uniform benchmark leads to a paradox, that is a risk methodologyapplied on real ﬁnancial data shows better uniformity that a uniform distribution. This paradox isresolved by realizing that a risk computation based on historical data induces negative correlations,creating a “return to uniformity” for the probtiles. Consequently, many risk methodologies based ontrailing distributions have too good statistical results when gauged with the natural benchmark.In order to take this eﬀect into account, a benchmark based on random walks with constant volatilityand normal returns is used. On the random paths, the algorithms used for the risk forecast is applied,possibly including a trailing sample of returns. This strategy leads to three benchmarks, whether thereturn distribution is ﬁxed (benchmark 1), is based on a trailing sample of daily returns (benchmark2), or is using a trailing sample of returns at the risk horizon (benchmark 3). The quantitative surpriseis that the “return to uniformity” eﬀect can be strong, leading potentially to distribution with disjointsupports for the benchmarks. As a side beneﬁt, the benchmarks can be used for any risk horizon andwith the full sample, without the need to decimate the sample by a factor ∆ T . A censoring can also beused for stocks, in order to eliminate spurious eﬀect due to low liquidity. The benchmark evaluationsare purely based on Monte Carlo simulations, and unfortunately an analytical approach seems quitediﬃcult.Equipped with a powerful test and with a good understanding of the benchmarks, a panel of typicalrisk methodologies can be investigated. The key results are that methodologies based on historicalreturns do not behave correctly at scales up to a few years, since they do not capture correctly themulti-scales dynamic of the ﬁnancial markets. Methodologies based on innovations perform better,with a clear advantage to the methodologies using an empirical distribution for the innovations at therisk horizon. The ﬁgures presented in the paper are based on two data set and two risk horizons, butthe empirical results are consistent with other data sets and other risk horizons (1, 2, 5, 10 and 30days).So far, most statistical tests applied on risk evaluation are based on the asymptotic distribution ofthe returns, the simplest one being counting the number of exceedances at a given probability α . Thetest proposed in this paper is a signiﬁcant improvement over the existing tests, in particular since itis a joint test of the dynamics and the distribution. On the down-side, the test does not provide for adiagnostic of what is right or wrong in a methodology, and other diagnostics should be used in parallel.In particular, the asymptotic distributions of the returns, probtiles and innovations, as well as laggedcorrelations for these quantities, give good complementary diagnostics to the tile test. Finally, the tiletest is only based on a forecast in the form of a probability distribution. Hence, it can be applied toother ﬁelds, say for example weather forecast. Acknowledgement:

The author thanks Samuel Quinodoz and Gwenol Grandperrin for their helpwith the software development part of this project.21 eferences

Jos´e Barbachan, Aquiles Farias, and Jos´e Ornelas. Goodness-of-ﬁt Tests focuses on Var at RiskEstimation.

Brazilian Review of economics , 2006.Sean D. Campbell. A Review of Backtesting and Backtesting Procedures.

Journal of Risk , 9(2):1–17,2006.Peter Christoﬀersen. Evaluating Interval Forecasts.

International Economic Review , 39:841–862, 1998.Peter Christoﬀersen.

Backtesting . John Wiley & Sons, Ltd, 2010. doi: 10.1002/9780470061602.eqf15018. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470061602.eqf15018 .Peter Christoﬀersen and Denis Pelletier. Backtesting value-at-risk: A duration-based approach.

Journal of Financial Econometrics , 2:84–108, 2004. URL

AvailableatSSRN:https://ssrn.com/abstract=821715 .C. Crnkovic and J. Drachman. Quality Control.

Risk , 9(9):138–143, 1996.F.X. Diebold, T. Gunther, and A. Tay. Evaluating Density Forecasts with Applications to FinancialRisk Management.

International Economic Review , 39:863–883, 1998.D. E. Knuth.

The Art of Computer Programming, Volume 2: Seminumerical Algorithms . Addison-Wesley, third edition edition, 1998.P. Kupiec. Techniques for verifying the accuracy of risk management models.

Risk measurement andsystemic risk: Proceedings of a joint central bank research conference, Washington D. C, November16-17, 1995 , 1 1995.J. Mina and J. Xiao. Return to RiskMetrics: The Evolution of a Standard. Technical report, RiskMetrics Group, 2001.Natalia Nolde and Johanna F. Ziegel. Elicitability and backtesting: perspective for banking regulation.

The annals of Applid Statistics , 11(4):1833–1874, 2017.Simona Roccioletti.

Backtesting Value at Risk and Expected Shortfall . Springer Gabler, 2016. ISBN978-3-658-11907-2. Master Thesis, University of Applied Science Vienna, Austria.M. Rosenblatt. Remarks on multivariate transformation.

Ann. Math. Statist. , 23:1052–1057, 1952.Gilles Zumbach. Volatility processes and volatility forecast with long memory.

Quantitative Finance