[PDF] Proving prediction prudence

Abstract

We study how to perform tests on samples of pairs of observations and predictions in order to assess whether or not the predictions are prudent. Prudence requires that that the mean of the difference of the observation-prediction pairs can be shown to be significantly negative. For safe conclusions, we suggest testing both unweighted (or equally weighted) and weighted means and explicitly taking into account the randomness of individual pairs. The test methods presented are mainly specified as bootstrap and normal approximation algorithms. The tests are general but can be applied in particular in the area of credit risk, both for regulatory and accounting purposes.

Full PDF

aa r X i v : . [ q -f i n . R M ] J un Proving prediction prudence

Dirk Tasche ∗ We study how to perform tests on samples of pairs of observations and predictions in orderto assess whether or not the predictions are prudent. Prudence requires that that the meanof the diﬀerence of the observation-prediction pairs can be shown to be signiﬁcantly negative.For safe conclusions, we suggest testing both unweighted (or equally weighted) and weightedmeans and explicitly taking into account the randomness of individual pairs. The test methodspresented are mainly speciﬁed as bootstrap and normal approximation algorithms. The testsare general but can be applied in particular in the area of credit risk, both for regulatory andaccounting purposes.

Keywords:

Paired diﬀerence test, weighted mean, credit risk, PD, LGD, EAD, CCF.

1. Introduction

Testing if the means of two samples signiﬁcantly diﬀer or the mean of one sample signiﬁcantly exceedsthe mean of the other sample is a problem that is widely covered in the statistical literature (see forinstance Casella and Berger, 2002; Davison and Hinkley, 1997; Venables and Ripley, 2002). At the latestby the validation requirements for credit risk parameter estimates in the regulatory Basel II framework(BCBS, 2006, paragraph 501), such tests also became an important issue in the banking industry : • “Banks must regularly compare realised default rates with estimated PDs for each grade and beable to demonstrate that the realised default rates are within the expected range for that grade”,and • “banks using the advanced IRB approach must complete such analysis for their estimates of LGDsand EADs”.More recently, as a consequence of the introduction of new rules for loss provisioning in ﬁnancial reportingstandards, the validation of risk parameter estimates also attracted interest in the accounting community(see, e.g., Bellini, 2019). Over the course of the past ﬁfteen years or so, a variety of statistical tests for thecomparison of realised and predicted values have been proposed for use in the banks’ validation exercises.For overviews on estimation and validation as well as references see Bl¨umke (2019, PD), Loterman et al.(2014, LGD), and G¨urtler et al. (2018, EAD). Scandizzo (2016) presents validation methods for all thesekinds of parameters in the general context of model risk management.In order to make validation results by diﬀerent banks to some extent comparable, in February 2019,the European Central Bank (ECB, 2019) asked the banks it supervises under the Single SupervisoryMechanism (SSM) to deliver standardised annual reports on their internal model validation exercises. ∗ Independent researcher, e-mail: [email protected] PD means ‘probability of default’, IRB means ‘internal ratings based’, LGD means ‘loss given default’ and EAD is‘exposure at default’.

1n particular, the requested reports are assumed to include data and tests regarding the “predictiveability (or calibration)” of PD, LGD and CCF (credit conversion factor) parameters in the most recentobservation period. Predictive ability for LGD estimation is explained through the statement “the analysisof predictive ability (or calibration) is aimed at ensuring that the LGD parameter adequately predictsthe loss rate in the event of a default i.e. that LGD estimates constitute reliable forecasts of realised lossrates” (ECB, 2019, Section 2.6.2). The meanings of predictive ability for PD and EAD / CCF respectivelyare illustrated in similar ways.ECB (2019) proposed “one-sample t-test[s] for paired observations” to test the “null hypothesis thatestimated LGD [or CCF or EAD] is greater than true LGD” (or CCF or EAD). ECB (2019) also suggesteda Jeﬀreys binomial test for the “null hypothesis that the PD applied in the portfolio/rating grade at thebeginning of the relevant observation period is greater than the true one (one sided hypothesis test)”. Inthis paper, • we make a case for also testing the null hypothesis that the estimated parameter is less than or equalto the true parameter in order to be able to ‘prove’ that the estimate is prudent (or conservative), • we suggest additionally using exposure- (or limit-)weighted sample averages in order to betterinform assessments of estimation (or prediction) prudence, and • we propose more elaborate statements of the hypotheses for the tests (by including ‘variance ex-pansion’) in order to account for portfolio inhomogeneity in terms of composition (exposures sizes)and riskiness.The proposal to look for a ‘proof’ of prediction prudence is inspired by the regulatory requirement (BCBS,2006, paragraph 451): “In order to avoid over-optimism, a bank must add to its estimates a margin ofconservatism that is related to the likely range of errors”.As a matter of fact, the statistical tests discussed in this paper can be deployed both for proving prudenceand for proving aggressiveness of estimates. However, an unsymmetric approach is recommended formaking use of the evidence from the tests: • For proving prudence, request that both the equal-weights test and the exposure-weighted testreject the null hypothesis of the parameter being aggressive. • For an alert of potential aggressiveness, request only that the equal-weights test or the exposure-weighted test reject the null hypothesis of the parameter being prudent.The paper is organised as follows: • In Section 2, we introduce a general non-parametric paired diﬀerence test approach to testing for thesign of a weighted mean value (Section 2.1). We compare this approach to the t-test for LGD, CCFand EAD proposed in ECB (2019) and note possible improvements of both approaches (Section 2.2).We then present in Section 2.3 a test approach to put into practice these improvements in the caseof variables with values in the unit interval like LGD and CCF. Appendices A and B supplementSection 2.3 with regard to weight-adjustments as an alternative to sampling with inhomogeneousweights and to testing non-negative but not necessarily bounded variables like EAD. • In Section 3, we discuss paired diﬀerence tests in the special case of diﬀerences between observedevent indicators and the predicted probabilities of the events. We start in Section 3.1 with thepresentation of a test approach that takes account of potential weighting of the observation pairsand variance expansion to deal with the individual randomness of the observations. In Section 3.2, we EAD and CCF of a credit facility are linked by the relation EAD = DA + CCF*(limit-DA) where DA is the alreadydrawn amount. ECB (2019) presumably only looks at “number-weighted” (i.e. equally weighted) averages because the Basel framework(BCBS, 2006) requires such averages for the risk parameter estimates. In banking practice, however, also exposure-weighted averages are considered (see, e.g., Li et al., 2009). • In Section 4, the test methods presented in the preceding sections are illustrated with two examplesof test results. • Section 5 concludes the paper with summarising remarks.

2. Paired diﬀerence tests

The statistical tests considered in this paper are ‘paired diﬀerence tests’. This test design accountsfor the strong dependence that is to be expected between the observation and the prediction in thematched observation-prediction pairs which the analysed samples consist of. See Mendenhall et al. (2008,Chapter 10) for a discussion of the advantages of such test designs.

Starting point. • One sample of real-valued observations ∆ , . . . , ∆ n . • Weights 0 < w i < i = 1 , . . . , n , with P ni =1 w i = 1. • Deﬁne the weighted-average observation ∆ w as∆ w = n X i =1 w i ∆ i . (2.1) Interpretation in the context of credit risk back-testing. • ∆ , . . . , ∆ n may be a sample of diﬀerences (residuals) between observed and predicted LGD (orCCF or EAD) for defaulted credit facilities (matched pairs of observations and predictions). • The weight w i reﬂects the relative importance of observation i . For instance, in the case of CCF orEAD estimates of credit facilities, one might choose w i = limit i P nj =1 limit j , (2.2a)where limit j is the limit of credit facility j at the time when the estimates were made. • In case of LGD estimates, the weights w i could be chosen as (Li et al., 2009, Section 5) w i = EAD i P nj =1 EAD j , (2.2b)where EAD j is the exposure at default estimate for credit facility j at the time when the estimateswere made. Goal.

We consider ∆ w as deﬁned by (2.1) the realisation of a test statistic to be deﬁned below and wantto answer the following two questions: • If ∆ w <

0, how safe is the conclusion that the observed (realised) values are on weighted averageless than the predictions, i.e. the predictions are prudent / conservative? • If ∆ w >

0, how safe is the conclusion that the observed (realised) values are on weighted averagegreater than the predictions, i.e. the predictions are aggressive?3he safety of conclusions is measured by p-values which provide error probabilities for the conclusions tobe wrong. The lower the p-value, the more likely the conclusion is right.In order to be able to examine the properties of the sample and ∆ w with statistical methods, we have tomake the assumption that the sample was generated with some random mechanism. This mechanism isdescribed in the following assumption. Assumption 2.1

The sample ∆ , . . . , ∆ n consists of independent realisations of a random variable X ϑ with distribution given by P (cid:2) X ϑ = ∆ i − ϑ (cid:3) = w i , i = 1 , . . . , n, (2.3) where the value of the parameter ϑ ∈ R is unknown. Note that (2.3) includes the case of equally weighted observations , by choosing w i = 1 /n for all i . Proposition 2.2

For X ϑ as described in Assumption 2.1, the expected value and the variance are givenby E[ X ϑ ] = ∆ w − ϑ, and (2.4a)var[ X ϑ ] = n X i =1 w i ∆ i − ∆ w . (2.4b) Proof.

Obvious. ✷ By Assumption 2.1 and Proposition 2.2, the questions on the safety of conclusions from the sign of ∆ w can be translated into hypotheses on the value of the parameter ϑ : • If ∆ w <

0, can we conclude that H : ϑ ≤ ∆ w is false and H : ϑ > ∆ w ⇔ E[ X ϑ ] < • If ∆ w >

0, can we conclude that H ∗ : ϑ ≥ ∆ w is false and H ∗ : ϑ < ∆ w ⇔ E[ X ϑ ] > , . . . , ∆ n was generated by independent realisations of X ϑ then thedistribution of the sample mean is diﬀerent from the distribution of X ϑ , as shown in the followingcorollary to Proposition 2.2. Corollary 2.3

Let X ,ϑ , . . . , X n,ϑ be independent and identically distributed copies of X ϑ as in Assump-tion 2.1 and deﬁne ¯ X ϑ = n P ni =1 X i,ϑ . Then for the mean and the variance of ¯ X ϑ , it holds that E[ ¯ X ϑ ] = ∆ w − ϑ, (2.5a)var[ ¯ X ϑ ] = 1 n (cid:16) n X i =1 w i ∆ i − ∆ w (cid:17) . (2.5b)In the following, we use ¯ X ϑ as the test statistic and interpret ∆ w as its observed value . Bootstrap test.

Generate a Monte Carlo sample ¯ x , . . . , ¯ x R from ∆ , . . . , ∆ n as follows: • For j = 1 , . . . , R : ¯ x j is the equally weighted mean of n independent draws from the distribution of X b ϑ as given by (2.3), with b ϑ = 0. Equivalently, ¯ x j is the mean of n draws with replacement fromthe sample ∆ , . . . , ∆ n , where ∆ i is drawn with probability w i . See Appendix A for a more detailed discussion of special cases with equal weights. For arithmetic reasons, actually most of the time ∆ w cannot be a realisation of ¯ X ϑ . As long as the sample size n is nottoo small, however, by (2.5a) and the law of large numbers considering ∆ w as realisation of ¯ X ϑ is not unreasonable. According to Davison and Hinkley (1997, Section 5.2.3), sample size R = 999 should suﬃce for the purposes of this paper. ¯ x , . . . , ¯ x R are realisations of independent, identically distributed random variables.Then a bootstrap p-value for the test of H : ϑ ≤ ∆ w against H : ϑ > ∆ w can be calculated as p-value = 1 + (cid:8) i : i ∈ { , . . . , n } , ¯ x i ≤ w (cid:9) R + 1 . (2.6a)A bootstrap p-value for the test of H ∗ : ϑ ≥ ∆ w against H ∗ : ϑ < ∆ w is given byp-value ∗ = 1 + (cid:8) i : i ∈ { , . . . , n } , ¯ x i ≥ w (cid:9) R + 1 . (2.6b) Rationale.

By (2.3), for each ϑ the distributions of X − ϑ and X ϑ are identical. As a consequence, ifunder H the true parameter is ϑ ≤ ∆ w and ( −∞ , x ] is the critical (rejection) range for the test of H against H based on the test statistic ¯ X ϑ , then it holds thatP (cid:2) ¯ X ϑ ∈ ( −∞ , x ] (cid:3) = P[ ¯ X ≤ x + ϑ ] ≤ P[ ¯ X ≤ x + ∆ w ] . (2.7)Hence, by Theorem 8.3.27 of Casella and Berger (2002), in order to obtain a p-value for H : ϑ ≤ ∆ w against H : ϑ > ∆ w , according to (2.7) it suﬃces to specify: • The upper limit x of the critical range for rejection of H : ϑ ≤ ∆ w as ‘observed’ value ∆ w of ¯ X ϑ ,and • an approximation of the distribution of ¯ X , as is done by generating the bootstrap sample ¯ x , . . . , ¯ x R .This implies Equation (2.6a) for the bootstrap p-value of the test of H against H . The rationale for(2.6b) is analogous. Normal approximate test.

By Corollary 2.3 for ϑ = ∆ w , we ﬁnd that the distribution of ¯ X ∆ w can beapproximated by a normal distribution with mean 0 and variance as shown on the right-hand side of(2.5b). With x = ∆ w , therefore, we obtain the following expression for the normal approximate p-valueof H : ϑ ≤ ∆ w against H : ϑ > ∆ w :p-value = P[ ¯ X ∆ w ≤ x ] ≈ Φ √ n ∆ w pP ni =1 w i ∆ i − ∆ w ! . (2.8a)The same reasoning gives for the normal approximate p-value of H ∗ : ϑ ≥ ∆ w against H ∗ : ϑ < ∆ w :p-value ∗ ≈ − Φ √ n ∆ w pP ni =1 w i ∆ i − ∆ w ! . (2.8b) In Sections 2.6.2 (for LGD back-testing), 2.9.3.1 (for CCF back-testing) and 2.9.3.2 (for EAD back-testing) of ECB (2019), the ECB proposes a t-test for – in the terms of Section 2.1 – H ∗ : ϑ ≥ ∆ w against H ∗ : ϑ < ∆ w . Transcribed into the notation of Section 2.1, the test can be described as follows: • n is the number of matched pairs of observations and predictions in the sample. • ∆ i is the diﬀerence of S denotes the number of elements of the set S . We adopt here the deﬁnition provided by Davison and Hinkley (1997, Eq. (4.11)). the realised LGD for facility i and the estimated LGD for facility i in Section 2.6.2, – the realised CCF for facility i and the estimated CCF for facility i in Section 2.9.3.1, and – the drawings (balance sheet exposure) at the time of default of facility i and the estimatedEAD of facility i in Section 2.9.3.2. • All w i equal 1 /n . • The right-hand side of (2.5b) is replaced by the sample variance s n = 1 n − n n X i =1 ∆ i − ∆ /n ! . • The p-value is computed as p-value ∗ = 1 − Ψ n − (cid:18) ∆ /n s n (cid:19) , (2.9)where Ψ n − denotes the distribution function of Student’s t-distribution with n − n and equal weights w i = 1 /n for all i = 1 , . . . , n . For smaller n , the valueof (2.9) would be exact if the variables X i,ϑ in Corollary 2.3 were normally distributed. Criticisms of the basic approach.

The basic approach as described in Sections 2.1 and 2.2 fails to takeaccount of the following issues: • The random mechanism reﬂected by (2.3) can be interpreted as an expression of uncertainty aboutthe cohort / portfolio composition. The randomness of the loss rate / exposure of the individualfacilities – the degree of which potentially can diﬀer between facilities – is not captured by (2.3). • The parametrisation of the distribution by a location parameter in (2.3) could result in distributionswith features that are not realistic, for instance negative exposures or loss rates greater than one.In the following section and in Appendix B, we are going to modify the basic approach for LGD / CCFon the one hand and EAD on the other hand in such a way as to take into account these two issues.

By deﬁnition, both LGD and CCF take values only in the unit interval [0 , Starting point. • A sample of paired observations ( λ , ℓ ) , . . . , ( λ n , ℓ n ), with predicted LGDs 0 < λ i < ≤ ℓ i ≤ • Weights 0 < w i < i = 1 , . . . , n , with P ni =1 w i = 1, • Weighted average loss rate ℓ w = P ni =1 w i ℓ i and weighted average loss prediction λ w = P ni =1 w i λ i . Interpretation in the context of LGD back-testing. • A sample of n defaulted credit facilities / loans is analysed.6 The LGD λ i is an estimate of loan i ’s loss rate as a consequence of the default, measured aspercentage of the exposure at the time of default (EAD). • The realized loss rate ℓ i shows the percentage of loan i ’s exposure at the time of default that cannotbe recovered. • The weight w i reﬂects the relative importance of observation i . In the case of LGD predictions, onemight choose (2.2b) for the deﬁnition of the weights, for CCF one might choose (2.2a) instead. • Deﬁne ∆ i = ℓ i − λ i , i = 1 , . . . , n . If | ∆ i | ≈ λ i is a good LGD prediction. If | ∆ i | ≈ λ i is a poor LGD prediction. Goal.

We want to use the observed weighted average diﬀerence / residual ∆ w = P ni =1 w i ∆ i = ℓ w − λ w to assess the quality of the calibration of the model / approach for the λ i to predict the realised loss rates ℓ i . Again we want to answer the following two questions: • If ∆ w <

0, how safe is the conclusion that the observed (realised) values are on weighted averageless than the predictions, i.e. the predictions are prudent / conservative? • If ∆ w >

0, how safe is the conclusion that the observed (realised) values are on weighted averagegreater than the predictions, i.e. the predictions are aggressive?The safety of such conclusions is measured by p-values which provide error probabilities for the conclusionsto be wrong. The lower the p-value, the more likely the conclusion is right.In order to be able to examine the speciﬁc properties of the sample and ∆ w with statistical methods,we have to make the assumption that the sample was generated with some random mechanism. Thismechanism is described in the following modiﬁcation of Assumption 2.1. Assumption 2.4

The sample ∆ , . . . , ∆ n consists of independent realisations of a random variable X ϑ with distribution given by X ϑ = ℓ I − Y ϑ , (2.10a) where I is a random variable with values in { , . . . , n } and P[ I = i ] = w i , i = 1 , . . . , n . Y ϑ is a beta ( α i , β i ) -distributed random variable conditional on I = i for i = 1 , . . . , n . The parameters α i and β i of thebeta-distribution depend on the unknown parameter < ϑ < by α i = ϑ i − vv , and β i = (1 − ϑ i ) 1 − vv . (2.10b) In (2.10b) , the constant < v < is the same for all i . The ϑ i are determined by ϑ i = ( λ i ) h ( ϑ ) , (2.10c) where < h ( ϑ ) < ∞ is the unique solution h of the equation ϑ = n X i =1 w i ( λ i ) h . (2.10d)Assumption 2.4 introduces randomness of the diﬀerence between loss rate and LGD prediction for indi-vidual facilities. Comparison between (2.13b) below and (2.4b) shows that this entails variance expansionof the sample ∆ , . . . , ∆ n .Note that Assumption 2.4 also describes a method for recalibration of the LGD estimates λ , . . . , λ n tomatch targets ϑ with the weighted average of the ϑ i . In contrast to (2.3), the transformation (2.10c) See Casella and Berger (2002, Section 3.3) for a deﬁnition of the beta-distribution. Y ϑ , it holds that E[ Y ϑ | I = i ] = ϑ i .The constant v speciﬁes the variance of Y ϑ conditional on I = i as percentage of the supremum ϑ i (1 − ϑ i )of its possible conditional variance, i.e. it holds thatvar[ Y ϑ | I = i ] = v ϑ i (1 − ϑ i ) , i = 1 , . . . , n. (2.11)The constant v must be pre-deﬁned or separately estimated. We suggest estimating it from the sample ℓ , . . . , ℓ n as ˆ v = P ni =1 w i ℓ i − ℓ w ℓ w (1 − ℓ w ) . (2.12)This approach yields 0 ≤ ˆ v ≤ ≤ ℓ i ≤ i = 1 , . . . , n , implies n X i =1 w i ℓ i − ℓ w ≤ ℓ w (1 − ℓ w ) . A simpler alternative to the deﬁnition (2.10c) of ϑ i would be linear scaling: ϑ i = λ i ϑλ w . However, with thisdeﬁnition ϑ i > Y ϑ | I = i would be ill-deﬁned. Proposition 2.5

For X ϑ as described in Assumption 2.4, the expected value and the variance are givenby E[ X ϑ ] = ℓ w − ϑ, and (2.13a)var[ X ϑ ] = n X i =1 w i ( ℓ i − ϑ i ) − ( ℓ w − ϑ ) + v n X i =1 w i ϑ i (1 − ϑ i ) . (2.13b) Proof.

For deriving the formula for var[ X ϑ ], make use of the well-known variance decompositionvar[ X ϑ ] = E (cid:2) var[ X ϑ | I ] (cid:3) + var (cid:2) E[ X ϑ | I ] (cid:3) . ✷ In contrast to (2.4b), the variance of X ϑ as shown in (2.13b) depends on the parameter ϑ and has anadditional component v P ni =1 w i ϑ i (1 − ϑ i ) which reﬂects the potentially diﬀerent variances of the lossrates in an inhomogeneous portfolio.By Assumption 2.4 and Proposition 2.5, the questions on the safety of conclusions from the sign of∆ w = ℓ w − λ w again can be translated into hypotheses on the value of the parameter ϑ : • If ∆ w <

0, can we conclude that H : ϑ ≤ ℓ w is false and H : ϑ > ℓ w ⇔ E[ X ϑ ] < • If ∆ w >

0, can we conclude that H ∗ : ϑ ≥ ℓ w is false and H ∗ : ϑ < ℓ w ⇔ E[ X ϑ ] > , . . . , ∆ n was generated by independent realisations of X ϑ then thedistribution of the sample mean is diﬀerent from the distribution of X ϑ , as shown in the followingcorollary to Proposition 2.5. Corollary 2.6

Let X ,ϑ , . . . , X n,ϑ be independent and identically distributed copies of X ϑ as in Assump-tion 2.4 and deﬁne ¯ X ϑ = n P ni =1 X i,ϑ . Then for the mean and variance of ¯ X ϑ , it holds that E[ ¯ X ϑ ] = ℓ w − ϑ. (2.14a)var[ ¯ X ϑ ] = 1 n n X i =1 w i ( ℓ i − ϑ i ) − ( ℓ w − ϑ ) + v n X i =1 w i ϑ i (1 − ϑ i ) ! . (2.14b)In the following, we use ¯ X ϑ as the test statistic and interpret ∆ w = ℓ w − λ w as its observed value.8 roposition 2.7 In the setting of Assumption 2.4 and Corollary 2.6, ϑ ≤ b ϑ implies that P[ ¯ X ϑ ≤ x ] ≤ P[ ¯ X b ϑ ≤ x ] , for all x ∈ R . Proof.

Observe that ϑ ≤ b ϑ implies ϑ i ≤ b ϑ i for all i = 1 , . . . , n . For ﬁxed i , the family of beta( α i , β i )-distributions, parametrised by ϑ ∈ (0 , monotone likelihood ratio in the sense of Deﬁni-tion 8.3.16 of Casella and Berger (2002). This implies that for ϑ ≤ b ϑ , conditional on I = i , the distributionof Y b ϑ is stochastically not less than the distribution of Y ϑ , i.e. it holds thatP[ Y ϑ ≤ x | I = i ] ≥ P[ Y b ϑ ≤ x | I = i ] , for all x ∈ R . From this, it follows that for all i = 1 , . . . , n P[ X ϑ ≤ x | I = i ] ≤ P[ X b ϑ ≤ x | I = i ] , for all x ∈ R . But this inequality implies for all x ∈ R thatP[ X ϑ ≤ x ] = n X i =1 w i P[ X ϑ ≤ x | I = i ] ≤ P[ X b ϑ ≤ x ] . (2.15)Property (2.15) is passed on to convolutions of independent copies of X ϑ and X b ϑ . This proves theassertion. ✷ Bootstrap test.

Generate a Monte Carlo sample ¯ x , . . . , ¯ x R from X ϑ with ϑ = ℓ w as follows: • For j = 1 , . . . , R : ¯ x j is the equally weighted mean of n independent draws from the distribution of X ϑ as given by Assumption 2.4, with ϑ = ℓ w . • ¯ x , . . . , ¯ x R are realisations of independent, identically distributed random variables,Then a bootstrap p-value for the test of H : ϑ ≤ ℓ w against H : ϑ > ℓ w can be calculated asp-value = 1 + (cid:8) i : i ∈ { , . . . , n } , ¯ x i ≤ ℓ w − λ w (cid:9) R + 1 . (2.16a)A bootstrap p-value for the test of H ∗ : ϑ ≥ ℓ w against H ∗ : ϑ < ℓ w is given byp-value ∗ = 1 + (cid:8) i : i ∈ { , . . . , n } , ¯ x i ≥ ℓ w − λ w (cid:9) R + 1 . (2.16b) Rationale.

By Proposition 2.7, if under H the true parameter is ϑ ≤ ℓ w and ( −∞ , x ] is the critical(rejection) range for the test of H : ϑ ≤ ℓ w against H : ϑ > ℓ w based on the test statistic ¯ X ϑ , then itholds that P (cid:2) ¯ X ϑ ∈ ( −∞ , x ] (cid:3) ≤ P[ ¯ X ℓ w ≤ x ] . (2.17)Hence, by Theorem 8.3.27 of Casella and Berger (2002), in order to obtain a p-value for H : ϑ ≤ ℓ w against H : ϑ > ℓ w , according to (2.17) it suﬃces to specify: • The upper limit x of the critical range for rejection of H : ϑ ≤ ℓ w as our realisation ∆ w = ℓ w − λ w of ¯ X ϑ , and • an approximation of the distribution of ¯ X ℓ w , as has been done by generating the bootstrap sample¯ x , . . . , ¯ x R .This implies Equation (2.16a) for the bootstrap p-value. The rationale for (2.16b) is analogous.9 ormal approximate test. By Corollary 2.6, we ﬁnd that the distribution of ¯ X ℓ w can be approximatedby a normal distribution with mean 0 and variance as shown on the right-hand side of (2.13b) with ϑ = ℓ w . With x = ℓ w − λ w , one obtains for the approximate p-value of H : ϑ ≤ ℓ w against H : ϑ > ℓ w :p-value = P[ ¯ X ℓ w ≤ x ] ≈ Φ  √ n ( ℓ w − λ w ) qP ni =1 w i ( ℓ i − b ϑ i ) + v P ni =1 w i b ϑ i (1 − b ϑ i )  , (2.18a)with b ϑ i = ( λ i ) h ( ℓ w ) as in Assumption 2.4. The same reasoning gives for the normal approximate p-valueof H ∗ : ϑ ≥ ℓ w against H ∗ : ϑ < ℓ w :p-value ∗ ≈ − Φ  √ n ( ℓ w − λ w ) qP ni =1 w i ( ℓ i − b ϑ i ) + v P ni =1 w i b ϑ i (1 − b ϑ i )  . (2.18b)

3. Tests of probabilities

Starting point. • A sample of paired observations ( p , b ) , . . . , ( p n , b n ), with probabilities 0 < p i < b i ∈ { , } (1 for defaulted, 0 for performing). • Weights 0 < w i < i = 1 , . . . , n , with P ni =1 w i = 1, • Weighted default rate b w = P ni =1 w i b i and weighted average PD p w = P ni =1 w i p i . Interpretation in the context of PD back-testing. • A sample of n borrowers is observed for a certain period of time, most commonly one year. • The PD p i is an estimate of borrower i ’s probability to default during the observation period,estimated before the beginning of the period. • The status indicator b i shows borrower i ’s performance status at the end of the observation period. b i = 1 means “borrower has defaulted”, b i = 0 means “borrower is performing”. • w i could be the relative importance of observation i . In the case of default predictions, one mightchoose weights as in (2.2b). • Deﬁne ∆ i = b i − p i , i = 1 , . . . , n . If | ∆ i | ≈ p i is a good default prediction. If | ∆ i | ≈ p i is a poor default prediction. Goal.

We want to use the observed weighted average diﬀerence / residual ∆ w = P ni =1 w i ∆ i = b w − p w to assess the quality of the calibration of the model / approach for the p i to predict the realised statusindicators b i . Again we want to answer the following two questions: • If ∆ w <

0, how safe is the conclusion that the observed (realised) values are on weighted averageless than the predictions, i.e. the predictions are prudent / conservative? • If ∆ w >

0, how safe is the conclusion that the observed (realised) values are on weighted averagegreater than the predictions, i.e. the predictions are aggressive?The safety of such conclusions is measured by p-values which provide error probabilities for the conclusionsto be wrong. The lower the p-value, the more likely the conclusion is right. In determining the p-values,we take into account the criticisms of the basic approach as mentioned at the end of Section 2.2.10 .1. Testing probabilities on inhomogeneous samples

In order to be able to examine the PD-speciﬁc properties of the sample and ∆ w = b w − p w with statisticalmethods, we have to make a assumption that the sample was generated with some random mechanism.This mechanism is described in the following modiﬁcation of Assumptions 2.1 and 2.4. Assumption 3.1

The sample ∆ , . . . , ∆ n consists of independent realisations of a random variable X ϑ with distribution given by X ϑ = b I − Y ϑ , (3.1a) where I is a random variable with values in { , . . . , n } and P[ I = i ] = w i , i = 1 , . . . , n . Y ϑ is a Bernoullivariable with P[ Y ϑ = 1 | I = i ] = ϑ i , i = 1 , . . . , n. (3.1b) Deﬁne ̺ i = − p i p i p w − p w . Then the ϑ i depend on the unknown parameter < ϑ < by ϑ i = ϑϑ + (1 − ϑ ) ̺ i h ( ϑ ) , (3.1c) where < h ( ϑ ) < ∞ is the unique solution of the equation n X i =1 w i ϑ + (1 − ϑ ) ̺ i h , (3.1d) when solved for h . Assumption 3.1 introduces randomness of the diﬀerence between status indicator and PD predictionfor individual facilities. Comparison between (3.2b) below and (2.4b) shows that this entails varianceexpansion of the sample ∆ , . . . , ∆ n .Note that Assumption 3.1 also describes a method for recalibration of the PD estimates p , . . . , p n tomatch targets ϑ with the weighted average of the ϑ i . In contrast to (2.3), the transformation (3.1c)makes it sure that the transformed PD parameters still are values in the unit interval. In principle,instead of (3.1c) also the transformation (2.10c) could have been used. (3.1c) was preferred because ithas a probabilistic foundation through Bayes’ theorem. By deﬁnition of Y ϑ , it holds that E[ Y ϑ | I = i ] = ϑ i .Another simple alternative to the deﬁnition (3.1c) of ϑ i would be linear scaling: ϑ i = p i ϑp w . However,with this deﬁnition ϑ i > Y ϑ | I = i would be ill-deﬁned. Proposition 3.2

For X ϑ as described in Assumption 3.1, the expected value and the variance are givenby E[ X ϑ ] = b w − ϑ, and (3.2a)var[ X ϑ ] = n X i =1 w i ( b i − ϑ i ) − ( b w − ϑ ) + n X i =1 w i ϑ i (1 − ϑ i ) . (3.2b) Proof.

Similar to the proof of Proposition 2.5. ✷ Note that P ni =1 w i ( b i − ϑ i ) is a weighted version of the Brier Score (see, e.g., Hand, 1997) for theobservation-prediction sample ( b , ϑ i ) , . . . , ( b n , ϑ n ). This observation suggests that the power of the cali-bration tests considered in this section will be the greater, the better the discriminatory power of the PDpredictions is (reﬂected by lower Brier scores).By Assumption 3.1 and Proposition 3.2, the questions on the safety of conclusions from the sign of∆ w = b w − p w again can be translated into hypotheses on the value of the parameter ϑ : See Tasche (2013a, Section 4.2.4). If ∆ w <

0, can we conclude that H : ϑ ≤ b w is false and H : ϑ > b w ⇔ E[ X ϑ ] < • If ∆ w >

0, can we conclude that H ∗ : ϑ ≥ b w is false and H ∗ : ϑ < b w ⇔ E[ X ϑ ] > , . . . , ∆ n was generated by independent realisationsof X ϑ then the distribution of the sample mean is diﬀerent from the distribution of X ϑ , as shown in thefollowing corollary to Proposition 3.2. Corollary 3.3

Let X ,ϑ , . . . , X n,ϑ be independent and identically distributed copies of X ϑ as in Assump-tion 3.1 and deﬁne ¯ X ϑ = n P ni =1 X i,ϑ . Then for the mean and variance of ¯ X ϑ , it holds that E[ ¯ X ϑ ] = b w − ϑ. (3.3a)var[ ¯ X ϑ ] = 1 n n X i =1 w i ( b i − ϑ i ) − ( b w − ϑ ) + n X i =1 w i ϑ i (1 − ϑ i ) ! . (3.3b)In the following, we use ¯ X ϑ as the test statistic and interpret ∆ w = b w − p w as its observed value. Lemma 3.4

In the setting of Assumption 3.1, ϑ < b ϑ implies that ϑ i < b ϑ i for all i = 1 , . . . , n . Proof.

Assume ϑ < b ϑ and let h = h ( ϑ ) and b h = h (cid:0)b ϑ (cid:1) . Along the same lines of algebra as in Section 3 ofTasche (2013b), it can be shown that (with w i and ̺ i as in Assumption 3.1) for 0 < t < η > n X i =1 w i t + (1 − t ) ̺ i η ⇐⇒ n X i =1 w i (1 − ̺ i η ) t + (1 − t ) ̺ i η . (3.4)Deﬁne f ( t, η ) = P ni =1 w i (1 − ̺ i η ) t +(1 − t ) ̺ i η . Then we obtain ∂f∂t ( t, η ) = − n X i =1 w i (1 − ̺ i η ) ( t + (1 − t ) ̺ i η ) < , (3.5a) ∂f∂η ( t, η ) = − n X i =1 w i ̺ i ( t + (1 − t ) ̺ i η ) < . (3.5b)By deﬁnition, (3.1d) holds for ϑ and h . From (3.4) and (3.5a) then it follows that0 > n X i =1 w i (1 − ̺ i h ) b ϑ + (1 − b ϑ ) ̺ i h . However, by (3.4) we also have 0 = n X i =1 w i (1 − ̺ i b h ) b ϑ + (1 − b ϑ ) ̺ i b h . By (3.5b), this only is possible if it holds that h > b h . Hence it follows that(1 − ϑ ) hϑ > (1 − b ϑ ) b h b ϑ . By (3.1c) (i.e. the deﬁnition of ϑ i and b ϑ i ), this inequality implies ϑ i < b ϑ i . ✷ heorem 3.5 In the setting of Assumption 3.1 and Corollary 3.3, ϑ ≤ b ϑ implies that P[ ¯ X ϑ ≤ x ] ≤ P[ ¯ X b ϑ ≤ x ] , for all x ∈ R . Proof.

By Lemma 3.4, ϑ ≤ b ϑ implies for all i = 1 , . . . , n that ϑ i ≤ b ϑ i and therefore alsoP[ Y ϑ ≤ x | I = i ] ≥ P[ Y b ϑ ≤ x | I = i ] , for all x ∈ R . The remainder of the proof is identical to the last part of the proof of Proposition 2.7. ✷ Exact p-values.

Since by deﬁnition up to the constant 1 /n the test statistic ¯ X ϑ as deﬁned in Assump-tion 3.1 and Corollary 3.3 takes only integer values in the range {− n, . . . , − , , , . . . , n } , its distributioncan readily be exactly determined by means of an inverse Fourier transform (Rolski et al., 1999, Sec-tion 4.7). By Theorem 3.5 and Theorem 8.3.27 of Casella and Berger (2002), then a p-value for the testof H : ϑ ≤ b w against H : ϑ > b w can exactly be computed asp-value = P[ ¯ X b w ≤ b w − p w ] . (3.6a)A p-value for the test of H ∗ : ϑ ≥ b w against H ∗ : ϑ < b w is given byp-value ∗ = P[ ¯ X b w ≥ b w − p w ] . (3.6b) Normal approximate test.

By Corollary 3.3, we ﬁnd that the distribution of ¯ X b w can be approximatedby a normal distribution with mean 0 and variance as shown on the right-hand side of (3.3b). With x = b w − p w , one obtains for the approximate p-value of H : ϑ ≤ b w against H : ϑ > b w :p-value = P[ ¯ X b w ≤ x ] ≈ Φ  √ n ( b w − p w ) qP ni =1 w i ( b i − b ϑ i ) + P ni =1 w i b ϑ i (1 − b ϑ i )  , (3.7a)with b ϑ i = b w b w +(1 − b w ) ̺ i h ( b w ) as in Assumption 3.1. The same reasoning gives for the normal approximatep-value of H ∗ : ϑ ≥ ℓ w against H ∗ : ϑ < ℓ w :p-value ∗ ≈ − Φ  √ n ( b w − p w ) qP ni =1 w i ( b i − b ϑ i ) + P ni =1 w i b ϑ i (1 − b ϑ i )  . (3.7b) In Section 2.5.3.1 of ECB (2019), the ECB proposes “PD back testing using a Jeﬀreys test”. Transcribedinto the notation of Section 3.1, the starting point for the test can be described as follows: • n = N , where “N is the number of customers in the portfolio/rating grade”. • P ni =1 b i = D , where “D is the number of those customers that have defaulted within that observationperiod”. • n P ni =1 p i = P D , where PD means the “PD [probability of default] of the portfolio/rating grade”. • All w i equal 1 /n . 13 he Jeﬀreys test for the success parameter of a binomial distribution. • In a Bayesian setting, an “objective Bayesian” prior distribution beta(1 / , /

2) for the PD is chosensuch that – assuming a binomial distribution for the number of defaults – the posterior distribution(i.e. conditional on the observed number of defaults) of the PD is beta( D + 1 / , N − D + 1 / D +1 / N +1 . • Null hypothesis is “the PD applied in the portfolio/rating grade . . . is greater than the true one(one sided hypothesis test)”, i.e. H : θ ≤ b θ with b θ = “applied PD” and θ = “true PD”. In thenotation of Section 3.1, this can be phrased as testing H ∗ : ϑ ≥ b /n against H ∗ : ϑ < b /n . • ECB (2019): “The test statistic is the PD of the portfolio/rating grade.” The construction principlefor the Jeﬀreys test is to determine a credibility interval for the PD and then to check if the appliedPD is inside or outside of the interval. • The p-value for this kind of Jeﬀreys test isp-value

Jeﬀreys = F D +1 / , N − D +1 / ( P D ) , (3.8)where F α, β denotes the distribution function of the beta( α, β )-distribution. Comments. • The standard (frequentist) one-sided binomial test would be: ‘Reject H if D ≥ c ’ where c is a‘critical’ value such the probability under H to observe c or more defaults is small. For this test,the p-value is p-value freq = N X i = D (cid:0) Ni (cid:1) P D i (1 − P D ) N − i = F D, N − D +1 ( P D ) . (3.9)Hence, unless the observed number of default D is very small or even zero, from (3.8) it follows thatin practice most of the time the Jeﬀreys test and the standard binomial test give similar results. • For a ‘fair’ comparison of the Jeﬀreys test and the test proposed in Section 3.1, we have to modifyAssumption 3.1 such that there is no variance expansion and all weights are equal, i.e. the randomvariable X ϑ is simply deﬁned byP[ X ϑ = b i − ϑ i ] = 1 n , i = 1 , . . . , n, (3.10)where the ϑ i depend on the unknown parameter 0 < ϑ < H against H is then (using the ECB notation)p-value ≈ − Φ √ N ( D/N − P D ) p D/N (1 − D/N ) ! . (3.11) • The normal approximation of the frequentist (and by (3.8) and (3.9) also Jeﬀreys) binomial testp-value is p-value freq ≈ − Φ √ N ( D/N − P D ) p P D (1 − P D ) ! . (3.12) • The test for H as required by the ECB would typically be performed when D/N > P D , i.e. whenthere are doubts with regard to the conservatism of the PD estimate. Rejection of H would thenbe regarded as ‘proof’ of the estimate being aggressive while non-rejection would entail ‘acquittal’for lack of evidence. In case of 1 / ≥ D/N > P D , it holds that

P D (1 − P D ) < D/N (1 − D/N )14uch that the p-value according to the ECB test is lower than the p-value according to (3.10) and(3.11), i.e. the ECB test would reject H earlier than the simpliﬁed version of the test according toSection 3.1.

4. Numerical examples

The test methods of Section 2 and the appendices are illustrated in Section 4.1 below with numericalresults from tests on a data set from Fischer and Pfeuﬀer (2014, Table 1). The test methods of Section 3are illustrated in Section 4.2 below with numerical results from tests on a data set consisting of simulateddata. However, the exposures in the data set are again from Fischer and Pfeuﬀer (2014, Table 1). Azip-archive with the R-scripts and csv-ﬁles that were used for computing the results can be downloadedfrom . [1] "2020-05-05 20:48:57 CEST"R Script: PairedDifferences.RInput data: LGD.csvSummary of sample distribution:Sample size: 100Sample means:EqWeighted Weighted0.02110 -0.09814Sample standard deviations:EqWeighted Weighted W.adjusted0.3011 0.3186 0.6419Three largest weights: 0.07998 0.07994 0.04192Sample quantiles:10% 25% 50% 75% 90%-0.3770 -0.2075 0.0250 0.2700 0.4580Weight-adjusted sample quantiles:10% 25% 50% 75% 90%-0.4474643 -0.0921226 0.0009514 0.0617173 0.3173528Random seed: 23Bootstrap iterations: 999p-values for H0: mean(obs-pred)>=0 vs. H1: mean(obs-pred)<0Eq-weighted Weighted W-adjustedt-test 0.7564 0.001403 0.06569Basic 0.7240 0.001000 0.07000Basic normal 0.7583 0.001034 0.06314Expanded variance 0.6670 0.016000 0.11400Exp var normal 0.6846 0.015140 0.13474p-values for H0: mean(obs-pred)<=0 vs. H1: mean(obs-pred)>0Eq-weighted Weighted W-adjusted -test 0.2436 0.9986 0.9343Basic 0.2770 1.0000 0.9310Basic normal 0.2417 0.9990 0.9369Expanded variance 0.3340 0.9850 0.8870Exp var normal 0.3154 0.9849 0.8653 Explanations. • Sample means: According to (2.1). Weights according to (2.2b) with EAD from the column ‘raw.w’of the data set, and w i = 1 /

100 in the equally weighted case. • Sample standard deviations: First two values according to the square root of the right-hand side of(2.4b). Third value also according to (2.4b), but with e ∆ i from (A.3a) and equal weights. • Weights according to (2.2b) with EAD from the column ‘raw.w’ of the data set. • Sample quantiles: Based on sample ∆ , . . . , ∆ computed as diﬀerence of columns ‘obs’ and ‘pred’of the data set. • Weight-adjusted sample quantiles: Based on sample e ∆ , . . . , e ∆ according to (A.3a). • t-test results: ‘Eq-weighted’ according to (2.9) and 1 − p-value ∗ for the ﬁrst row of the t-test results.‘Weighted’ analogously adapted for the weighted case (but without strong theoretical foundation).‘W-adjusted’ like ‘Eq-weighted’ but for the sample e ∆ , . . . , e ∆ . • ‘Basic’ results: Bootstrapped according to (2.6a) and (2.6b) respectively, with weights and sampleslike for the t-test rows. • ‘Basic normal’ results: Normal approximations according to (2.8a) and (2.8b) respectively, withweights and samples like for the t-test rows. • ‘Expanded variance’ results: With weights and samples like for the t-test rows, bootstrapped ac-cording to (2.16a) and (2.16b) respectively for the ﬁrst two values, and according to (B.6a) and(B.6b) respectively for the third value. • ‘Exp var normal’ results: With weights and samples like for the t-test rows, normal approximationsaccording to (2.18a) and (2.18b) respectively for the ﬁrst two values, and according to (B.7a) and(B.7b) respectively for the third value.This example demonstrates that • test results based on equally weighted means and means with inhomogeneous weights can lead tocontradictory conclusions, • variance expansion to capture the individual randomness of single observation-prediction pairs canhave some impact on the degree of certainty of the test results, by entailing greater p-values, and • the two diﬀerent approaches to account for the weights of the observation-prediction pairs discussedin this paper can deliver similar but still clearly diﬀerent results. [1] "2020-05-05 20:50:56 CEST"R Script: Probabilities.RInput data: PD.csvSummary of sample distribution:Sample size: 100 ample means:EqWeighted Weighted0.01913 0.06584Sample standard deviations:EqWeighted Weighted0.3023 0.3367Three largest weights: 0.07998 0.07994 0.04192Sample quantiles:10% 25% 50% 75% 90%-0.1803235 -0.0359915 -0.0043570 -0.0005316 0.1322876Random seed: 23Bootstrap iterations: 999p-values for H0: mean(obs-pred)>=0 vs. H1: mean(obs-pred)<0Eq-weighted WeightedJeffreys 0.7668 NABasic 0.7390 0.9770Basic normal 0.7365 0.9747Expanded variance 0.6525 0.9295Exp var normal 0.6902 0.9315p-values for H0: mean(obs-pred)<=0 vs. H1: mean(obs-pred)>0Eq-weighted WeightedJeffreys 0.2332 NABasic 0.2620 0.02400Basic normal 0.2635 0.02526Expanded variance 0.3475 0.07048Exp var normal 0.3098 0.06850 Explanations. • See Section 4.1 for an explanation of the summary of the sample distribution. • ‘Jeﬀreys’ results: The ‘Eq-weighted’ value for ‘H0: mean(obs-pred) ≤ >

0’ iscomputed according to (3.8). The ‘Eq-weighted’ value for ‘H0: mean(obs-pred) ≥ <

0’ is 1 − p-value Jeﬀreys . No ‘Weighted’ results are computed because there is no obvious‘weighted mean’-version of the binomial Jeﬀreys test. • ‘Basic’ results: Bootstrapped according to (2.6a) and (2.6b) respectively. • ‘Basic normal’ results: Normal approximations according to (2.8a) and (2.8b) respectively. • ‘Expanded variance’ results: Exact p-values by inverse Fourier transform according to (3.6a) and(3.6b) respectively. • ‘Exp var normal’ results: Normal approximations according to (3.7a) and (3.7b) respectively.This example demonstrates that • as mentioned in Section 3.2, the Jeﬀreys test has a tendency to earlier reject ‘H0: mean(obs-pred) ≤ • test results based on equally weighted means and means with inhomogeneous weights can lead todiﬀerent outcomes (no conclusion vs. rejection of the null hypothesis), and17 variance expansion to capture the individual randomness of single observation-prediction pairs canhave some impact on the degree of certainty of the test results, by entailing greater p-values.

5. Conclusions

In this paper, we have made suggestions of how improve on the t-test and the Jeﬀreys test presented inECB (2019) for assessing the ‘preditive ability (or calibration)’ of credit risk parameters. The improve-ments refer to • also testing the null hypothesis that the estimated parameter is less than or equal to the trueparameter in order to be able to ‘prove’ that the estimate is prudent (or conservative), • additionally using exposure- or limit-weighted sample averages in order to better inform assessmentsof estimation (or prediction) prudence, and • ‘variance expansion’ in order to account for sample inhomogeneity in terms of composition (expo-sures sizes) and riskiness.The suggested test methods have been illustrated with exemplary test results. R-scripts with code forthe tests are available. References

BCBS.

International Convergence of Capital Measurement and Capital Standards. A Revised Framework,Comprehensive Version . Basel Committee on Banking Supervision, June 2006.T. Bellini.

IFRS 9 and CECL Credit Risk Modelling and Validation: A Practical Guide with ExamplesWorked in R and SAS . Academic Press, 2019.O. Bl¨umke. Out-of-Time Validation of Default Probabilities within the Basel Accord: A comparativestudy.

Available at SSRN 2945931 , 2019.G. Casella and R.L. Berger.

Statistical Inference . Duxbury Press, second edition, 2002.A.C. Davison and D.V. Hinkley.

Bootstrap Methods and their Application . Cambridge University Press,1997.ECB.

Instructions for reporting the validation results of internal models: IRB Pillar I models for creditrisk . European Central Bank – Banking Supervision, February 2019.M. Fischer and M. Pfeuﬀer. A statistical repertoire for quantitative loss given default validation: overview,illustration, pitfalls and extensions.

The Journal of Risk Model Validation , 8(1):3–29, 2014.M. G¨urtler, M.T. Hibbeln, and P. Usselmann. Exposure at default modeling – A theoretical and empiricalassessment of estimation approaches and parameter choice.

Journal of Banking & Finance , 91:176–188,2018.D.J. Hand.

Construction and Assessment of Classiﬁcation Rules . John Wiley & Sons, Chichester, 1997.H. Kazianka. Objective Bayesian estimation of the probability of default.

Journal of the RoyalStatistical Society: Series C (Applied Statistics) , 65(1):1–27, 2016. doi: 10.1111/rssc.12107. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssc.12107 .D. Li, R. Bhariok, S. Keenan, and S. Santilli. Validation techniques and performance metrics for lossgiven default models.

The Journal of Risk Model Validation , 3(3):3–26, 2009.18. Loterman, M. Debruyne, K. Vanden Branden, T. Van Gestel, and C. Mues. A proposed frameworkfor backtesting loss given default models.

Journal of Risk Model Validation , 8(1):69–90, 2014.W. Mendenhall, R.J. Beaver, and B.M. Beaver.

Introduction to probability and statistics . CengageLearning, 13th edition, 2008.T. Rolski, H. Schmidli, V. Schmidt, and J. Teugels.

Stochastic Processes for Insurance and Finance .Wiley Series in Probability and Statistics. John Wiley & Sons, 1999.S. Scandizzo.

The validation of risk models: A handbook for practitioners . Springer, 2016.D. Tasche. The art of probability-of-default curve calibration.

Journal of Credit Risk , 9(4):63–103, 2013a.D. Tasche. The law of total odds. arXiv preprint arXiv:1312.0365, 2013b.W.N. Venables and B.D. Ripley.

Modern Applied Statistics with S . Springer, fourth edition, 2002.

A. Appendix: Special cases of the weighted paired diﬀerenceapproach

Equal weights in the basic approach.

In this case, the variable of interest is the ordinary average ofthe sample ∆ , . . . , ∆ n , as reﬂected by the fact that then instead of (2.4a), it holds thatE[ X ϑ ] = 1 n n X i =1 ∆ i − ϑ. (A.1)In the same vein, the algorithms and formulae of Section 2.1 can be adapted to the equal weights caseby replacing all weights w i and w j with 1 /n . Weight-adjusted sample.

In this case, the weights w i are accounted for by replacing the sample∆ , . . . , ∆ n with the sample ∆ ∗ , . . . , ∆ ∗ n where ∆ ∗ i is deﬁned by∆ ∗ i = w i ∆ i . The adjusted sample ∆ ∗ , . . . , ∆ ∗ n in turn is treated as in the equal weights case. Then, in particular, (2.3)for the distribution of X ϑ reads P[ X ϑ = ∆ ∗ i − ϑ ] = 1 n , i = 1 , . . . , n. If P ni =1 w i ∆ i = 0, it follows thatE[ X ϑ ] = 1 n n X i =1 ∆ ∗ i − ϑ = n X i =1 w i ∆ i − ϑ. (A.2)As a consequence of (A.2), the adaptation of the algorithms and formulae from Section 2.1 for the weight-adjusted sample case would appear somewhat misleading if comparability in magnitude of the values ofthe test statistic ¯ X ϑ to its values in the unequal weights case as discussed in Section 2.1 were intended.A workaround for this problem is to adjust the sample not only for the weights but also for the samplesize, i.e. to deﬁne the adjusted sample e ∆ , . . . , e ∆ n by e ∆ i = n w i ∆ i . (A.3a)19ssuming equal weights now means P[ X ϑ = e ∆ i − ϑ ] = 1 /n which impliesE[ X ϑ ] = 1 n n X i =1 e ∆ i − ϑ = n X i =1 w i ∆ i − ϑ, (A.3b)var[ X ϑ ] = 1 n n X i =1 e ∆ i − n n X i =1 e ∆ ! = n n X i =1 w i ∆ i − n X i =1 w i ∆ i ! . (A.3c)Comparison with (2.4b) shows that the variances of X ϑ according to the weighting scheme (A.3a) andthe weighting scheme deployed in Section 2.1 diﬀer by n X i =1 ( n w i − w i ∆ i , which can be positive or negative. The algorithms and formulae from Section 2.1 can be applied to theweight-adjusted sample case as speciﬁed by (A.3a) and P[ X ϑ = e ∆ i − ϑ ] = 1 /n if the following twomodiﬁcations are taken into account in the given order: • Replace the value of ∆ i by the value of e ∆ i = n w i ∆ i for i = 1 , . . . , n . • Replace all remaining appearances of the weights w i by 1 /n .Note that the weight-adjustment (A.3a) can also be deployed for samples with more special structurelike the ones considered in Section 2.3 and Appendix B below. There is no guarantee, however, thatadjustment (A.3a) would preserve the ‘values in the unit interval’ constraint of Section 2.3. There is nosuch preservation issue with regard to Appendix B. B. Appendix: Tests for non-negative variables

In contrast to LGD and CCF which by deﬁnition are variables with values in the unit interval, EADin principle may take any non-negative value. This requires some modiﬁcations in order to adapt theapproach from Section 2.3 to the assessment of EAD estimates.

Starting point. • A sample of paired observations ( h , η ) , . . . , ( h n , η n ), with predicted EADs 0 < η i < ∞ and realisedexposures 0 ≤ h i < ∞ . • Weights 0 < w i < i = 1 , . . . , n , with P ni =1 w i = 1, • Weighted average observed EAD h w = P ni =1 w i h i and weighted average EAD prediction η w = P ni =1 w i η i . Interpretation in the context of EAD back-testing. • A sample of n defaulted credit facilities / loans is analysed. • The EAD η i is an estimate of loan i ’s exposure at the moment of the default, measured in currencyunits. 20 The realized exposure h i shows the loan i ’s exposure at the time of default. • The weight w i reﬂects the relative importance of observation i . In the case of direct EAD predictions,one might choose w i according to (2.2a). • Deﬁne ∆ i = h i − η i , i = 1 , . . . , n . If | ∆ i | ≈ η i is a good EAD prediction. If | ∆ i | is large then η i is a poor EAD prediction. Goal.

We want to use the observed weighted average diﬀerence / residual ∆ w = P ni =1 w i ∆ i = h w − η w to assess the quality of the calibration of the model / approach for the η i to predict the realised exposures h i . Again we want to answer the following two questions: • If ∆ w <

0, how safe is the conclusion that the observed (realised) values are on weighted averageless than the predictions, i.e. the predictions are prudent / conservative? • If ∆ w >

The sample ∆ , . . . , ∆ n consists of independent realisations of a random variable X ϑ with distribution given by X ϑ = h I − Y ϑ , (B.1a) where I is a random variable with values in { , . . . , n } and P[ I = i ] = w i , i = 1 , . . . , n . Y ϑ is agamma ( α i , β i ) -distributed random variable conditional on I = i for i = 1 , . . . , n . The parameters α i and β i of the gamma-distribution depend on the unknown parameter < ϑ < ∞ by α i = ϑ i v , and β i = v. (B.1b) In (B.1b) , the constant < v < ∞ is the same for all i . The ϑ i are determined by ϑ i = η i ϑη w . (B.1c)Note that Assumption B.1 describes a method for recalibration of the EAD estimates η , . . . , η n to matchtargets ϑ with the weighted average of the ϑ i . By deﬁnition of Y ϑ , it holds that E[ Y ϑ | I = i ] = ϑ i .The constant v speciﬁes the variance of Y ϑ conditional on I = i as multiple of its expected value ϑ i , i.e.it holds that var[ Y ϑ | I = i ] = v ϑ i , i = 1 , . . . , n. (B.2)The constant v must be pre-deﬁned or separately estimated. We suggest estimating it from the sample h , . . . , h n as ˆ v = P ni =1 w i h i − h w h w . (B.3) See Casella and Berger (2002, Section 3.3) for a deﬁnition of the gamma-distribution. roposition B.2 For X ϑ as described in Assumption B.1, the expected value and the variance are givenby E[ X ϑ ] = h w − ϑ, and (B.4a)var[ X ϑ ] = n X i =1 w i ( h i − ϑ i ) − ( h w − ϑ ) + v ϑ. (B.4b) Proof.

For deriving the formula for var[ X ϑ ], make use of the well-known variance decompositionvar[ X ϑ ] = E (cid:2) var[ X ϑ | I ] (cid:3) + var (cid:2) E[ X ϑ | I ] (cid:3) . ✷ Like in (2.13b), the variance of X ϑ as shown in (B.4b) depends on the parameter ϑ and has an addi-tional component v ϑ which reﬂects the potentially diﬀerent variances of the exposures at default in aninhomogeneous portfolio.By Assumption B.1 and Proposition B.2, the questions on the safety of conclusions from the sign of ∆ w again can be translated into hypotheses on the value of the parameter ϑ : • If ∆ w <

0, can we conclude that H : ϑ ≤ h w is false and H : ϑ > h w ⇔ E[ X ϑ ] < • If ∆ w >

0, can we conclude that H ∗ : ϑ ≥ h w is false and H ∗ : ϑ < h w ⇔ E[ X ϑ ] > , . . . , ∆ n was generated by independent realisations of X ϑ then thedistribution of the sample mean is diﬀerent from the distribution of X ϑ , as shown in the followingcorollary to Proposition B.2. Corollary B.3

Let X ,ϑ , . . . , X n,ϑ be independent and identically distributed copies of X ϑ as in Assump-tion B.1 and deﬁne ¯ X ϑ = n P ni =1 X i,ϑ . Then for the mean and variance of ¯ X ϑ , it holds that E[ ¯ X ϑ ] = h w − ϑ. (B.5a)var[ ¯ X ϑ ] = 1 n n X i =1 w i ( h i − ϑ i ) − ( h w − ϑ ) + v ϑ ! . (B.5b)In the following, we use ¯ X ϑ as the test statistic and interpret ∆ w = h w − η w as its observed value. Proposition B.4

In the setting of Assumption B.1 and Corollary B.3, ϑ ≤ b ϑ implies that P[ ¯ X ϑ ≤ x ] ≤ P[ ¯ X b ϑ ≤ x ] , for all x ∈ R . Proof.

Same as the proof of Proposition 2.7. ✷ Bootstrap test.

Generate a Monte Carlo sample ¯ x , . . . , ¯ x R from X ϑ with ϑ = h w as follows: • For j = 1 , . . . , R : ¯ x j is the equally weighted mean of n independent draws from the distribution of X ϑ as given by Assumption B.1, with ϑ = h w . • ¯ x , . . . , ¯ x R are realisations of independent, identically distributed random variables,Then a bootstrap p-value for the test of H : ϑ ≤ h w against H : ϑ > h w can be calculated asp-value = 1 + (cid:8) i : i ∈ { , . . . , n } , ¯ x i ≤ h w − η w (cid:9) R + 1 . (B.6a)A bootstrap p-value for the test of H ∗ : ϑ ≥ h w against H ∗ : ϑ < h w is given byp-value ∗ = 1 + (cid:8) i : i ∈ { , . . . , n } , ¯ x i ≥ h w − η w (cid:9) R + 1 . (B.6b)22 ationale. Same as the rationale for (2.16a) and (2.16b).

Normal approximate test.

By Corollary B.3, we ﬁnd that the distribution of ¯ X h w can be approximatedby a normal distribution with mean 0 and variance as shown on the right-hand side of (B.5b) with ϑ = h w .With x = h w − η w , one obtains for the approximate p-value of H : ϑ ≤ h w against H : ϑ > h w :p-value = P[ ¯ X h w ≤ x ] ≈ Φ  √ n ( h w − η w ) qP ni =1 w i ( h i − b ϑ i ) + v h w  , (B.7a)with b ϑ i = η i h w η w as in Assumption B.1. The same reasoning gives for the normal approximate p-value of H ∗ : ϑ ≥ h w against H ∗ : ϑ < h w :p-value ∗ ≈ − Φ  √ n ( h w − η w ) qP ni =1 w i ( h i − b ϑ i ) + v h w  ..