Machine Learning approach for Credit Scoring
A. R. Provenzano, D. Trifirò, A. Datteo, L. Giada, N. Jean, A. Riciputi, G. Le Pera, M. Spadaccino, L. Massaron, C. Nordio
MMachine Learning approach forCredit Scoring
A. R. Provenzano ∗ , D. Trifir`o, A. Datteo, L. Giada, N. Jean, A. Riciputi,G. Le Pera, M. Spadaccino, L. Massaron and C. Nordio August 5, 2020
Working paper Abstract
In this work we build a stack of machine learning models aimed at composing a state-of-the-artcredit rating and default prediction system, obtaining excellent out-of-sample performances.Our approach is an excursion through the most recent ML / AI concepts, starting fromnatural language processes (NLP) applied to economic sectors’ (textual) descriptions usingembedding and autoencoders (AE), going through the classification of defaultable firms onthe base of a wide range of economic features using gradient boosting machines (GBM) andcalibrating their probabilities paying due attention to the treatment of unbalanced samples.Finally we assign credit ratings through genetic algorithms (differential evolution, DE). Modelinterpretability is achieved by implementing recent techniques such as SHAP and LIME,which explain predictions locally in features’ space.
JEL
Classification codes: C45, C55, G24, G32, G33
AMS
Classification codes: 62M45,68T01, 68T50, 91G40
Keywords:
Artificial Intelligence, Machine Learning, Explainable AI, Autoencoders, Em-bedding, LightGBM, Differential Evolution, SHAP, LIME, Credit Risk, Rating Model, Default,Probability of Default, Classification
Introduction
In the aftermath of the economic crisis, the probability of default (PD) has become a topicaltheme in the field of financial research. Indeed, given its usage in the risk management, in thevaluation of the credit derivatives, in the estimation of the creditworthiness of a borrower andin the calculation of economic or regulatory capital for banking institution (under Basel II),incorrect PD prediction can lead to false valuation of risk, unreasonable rating and incorrectpricing of financial instruments. In the last decades, a growing number of approaches has been ∗ corresponding author: [email protected] This paper reflects the authors’ opinions and not necessarily those of their employers. a r X i v : . [ q -f i n . S T ] J u l eveloped to model the credit quality of a company, by exploring statistical techniques. Severalworks have employed probit models[1] or linear and logistic regression to estimate company rat-ings using the main financial indicators as model input. However, these models suffer from theirclear inability to capture non-linear dynamics, which are prevalent in financial ratio data[2].New statistical techniques, especially from the field of machine learning, have gained a world-wide reputation thanks to their ability to efficiently capture information from big dataset byrecognizing non-linear patterns and temporal dependencies among data. Zhao et al. (2015) [3]employed feed forward neural networks in credit corporate rating determination. Petropopuloset al.[4] explore two state of the art techniques namely Extreme Gradient Boosting (XGBoost)and deep learning neural networks in order to estimate loan PD and calibrate an internal ratingsystem, useful both for internal usage and regulatory scope. Addo et al. (2018)[5] built binaryclassifiers based on machine and deep learning models on real data to predict loan probability ofdefault. They observed that the tree-based models are more stable than ones based on multilayerartificial neural networks.Starting from these studies, we propose a sophisticated framework of machine learning modelswhich, on the basis of company annual (end-of-year) financial statements coupled with relevantmacroeconomic indicators, attempts to classify the status of a company (performing - “in-bonis”- or defaulted) and to build a robust rating system in which each rating class will be matched toan internally calibrated default probability. In this regard, here the target variable is differentfrom a previous work by some of the authors [6], where the goal was to predict the credit ratingthat Moody’s would assign, according to an approach commonly called “shadow rating”. Thenovelty of our approach lies in the combination of data preprocessing algorithms, responsible forfeature engineering and feature selection, and a core model architecture made of a concatenationof a Boosted Tree default classifier, a probability calibrator and a rating attribution systembased on genetic algorithm. Great attention is then given to model interpretability, as wepropose two intuitive approaches to interpret the model output by exploring the property oflocal explainability. In details, the article is composed of the following sections: Section 1 isdevoted to describe the input dataset and the preprocessing phase; Section 2 in which the coremodel architecture is explained; Section 3 which collects results from the core model structure(i.e. default classifier, PD calibrator and rating clustering); finally Section 4 is left to modelexplainability. Data used for model training have been collected from the Credit Research Database (CRD)provided by Moody’s, and consist of 919 ,
636 annual (end of year) financial statements of 157 , target of the proposed default prediction model, is a binary indicator with the value of1 flagging a default event (i.e. a bankruptcy occurrence over a one-year horizon), 0 otherwise.In accordance to the above-defined target variable, input variables of our model have been se-lected to be consistent with factors that can affect the companies capacity to service externaldebt (a full explanation of the input model’s features is reported in Appendix A). In partic-ular they consist of balance-sheet indexes and ratios, and Key Performance Indicators (KPI)calculated from CRDs financial reports[7]. The latter include indicators for efficiency (i.e. mea-sures of operating performance), liquidity (i.e. ratios used to determine how quickly a companycan turn its assets into cash if it is experiencing financial distress or impending bankruptcy),2olvency (i.e. ratios that depict how much a company relies upon its debt to fund operations)and profitability (i.e. measures that demonstrate how profitable a company is). Since businesscycles can have great impact on a firm profitability and influence its risk profile, we joined orig-inal information with more general macro variables (2 years lagged historical data) addressingthe surrounding climate in which companies operate. Among the wide range of macroeconomicindicators provided by Oxford Economics [8], a subset of the most influential ones has beenselected as explanatory variables . Some of them are country-specific, others are common to thewhole Eurozone . The combined dataset of balance-sheet indexes, financial ratios and macrovariables along with data transformations and feature selection (better described hereafter inSection 1.1), led to a set of 179 features and covers the period 2011 − − − A preliminary step for building a machine learning model consists in generating a set of featuressuitable for model training. This task involves data manipulation processes like transformationof categorical features, missing values treatment, infinite values handling, outliers detection,data leakage avoidance. In particular, categorical, non-ordinal variables are one of the mainissue that must be tackled in order to feed any machine learning model[9].Different encoding technique can be used to make categorical data legible for a machine learningalgorithm. The most common way to deal with categories is to simply map each category with The list of selected indicators are reported in Section A.3 Regional aggregate
Eurozone includes the following countries: Austria, Belgium, Cyprus, Estonia, Finland,France, Germany, Greece, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Portugal, Slovenia,Slovakia and Spain
Label Encoding , a model would treatcategories as ordered integers, which would imply non-existent ordinal relationships betweendata, that could be misleading for model training. Another simple way to handle categoricaldata is the
One-Hot Encoding technique, which consists in transforming each categorical featureinto fixed-size sparse vector of all zeros but 1 in the cell used uniquely to identify specific real-ization of that variable. The main drawback of this technique relies on the fact that categorieswith an high number of possible realizations would generate large dimension datasets, whichmakes it a memory-inefficient encoder. Moreover this sparse representation does not preservesimilarity between feature values. An alternative approach to overcome these issues is repre-sented by
Categorical Embedding , which consists in mapping via Deep Neural Network (DNN)each possible discrete value of a given categorical variable in a low-dimensional, learned, contin-uous vector representation. This method allows to place each categorical feature in a Euclideanspace, keeping coherent relationship with other realization of the same variable. The extensionof categorical embedding approach to words and document representation is known as
WordEmbedding [10]. In particular,
Sentence Embedding is an application of word embedding aimingat representing a full sentence into a vector space. In this study, we applied sentence embed-ding to represent industry sectors descriptions associated to each ”NACE code” . In order toguarantee the ” semantics ” of the original data and work in a low dimensional space, we pro-pose a framework of embedding with autoencoder regularization, in which the original data areembedded into low dimension vectors. The obtained embeddings maintain local similarity andcan be easily reverted to their original forms. The encoding of the NACE is a novel way toovercome the NACE1-NACE2 mapping conundrum. In our dataset both NACE are used and asalready stated in many papers the two encoding systems are not fully compatible[11]. Moreoverthe NACE encoding allows for proper industry segment description of multi sector firms thatcannot be easily described by a single NACE code, further extending the predictive power ofthe economic sector category.A different encoding method is the Target encoding , in which categorical features are replacedwith the mean target value for samples having that category. This allows to encode an arbi-trary number of features without increasing data dimensionality. However, as a drawback, anaive application of this type of encoding can allow data leakage, leading to model overfittingand poor predictive performance. A target encoding algorithm developed for preventing dataleakage is know as
James-Stein estimator , and is the one used in our model. In more details,it transforms each categorical feature with a weighted average of the mean target value for theobserved feature value and the mean target value computed regardless of the feature realization.As described above, some feature transformations could result in a general increase of inputdata dimensionality, which makes it urgent to implement a robust and independent featureselection framework. In fact, training a machine learning model to a huge number of independentvariables is doomed to suffer from the so-called curse of dimensionality [12], i.e. the problemof exponential increase in volume associated with adding extra dimensions to a vector space.We employed a voting ensemble of models to independently assign importance to the availablefeatures and efficiently select those features which will contribute most to model prediction.Hereafter in this section we will look into the implementation of satellite models aimingat: performing sentence embedding of the industry sectors description; reducing embeddingsdimensionality via stacked autoencoder; selecting relevant features via voting approach. The ”
Statistical Classification of Economic Activities in the European Community ”, commonly referred to asNACE, is the industry standard classification system used in the European Union entence Embedding of sector descriptions A common practice in Natural LanguageProcessing (NLP) is the use of pre-trained embeddings to represent words or sentence in a doc-ument. Following this common practice, we use the pre-trained models built in SpaCy NLPlibrary for embedding the sequence of NACE sector textual description. In particular, we per-formed sentence embedding i.e. we transform each description into a 300-dimensional real-valuevector. Each sentence embedding is automatically constructed by SpaCy averaging the 300-dimensional real-value pre-trained vectors which map each word in that sentence.Here’s a glimpse at how spaCy processes textual data. It first segments text into words, punc-tuations, symbols and others by applying specific rules to each language (i.e. it tokenizes thetext). Then it performs Part-of-speech (POS) tagging to understand the grammatical propertiesof each word by means of built in statistical model. A model consists of binary data trained on adataset large enough to allow the system to make predictions that generalize across the language.A key assumption to the word embedding approach is the idea of using for each word a densedistributed representation learned based on the usage of words [13]. This allows words thatare used in similar ways to have similar representations, naturally capturing their meaning[14].Given the high importance industry sector has in financial literature as default prediction driver,embedding NACE industry descriptions improves the overall model performance in applicationby helping the model to generalize better and to smoothly handle unseen elements.
Dimensionality reduction via stacked autoencoder
The aforementioned world embed-ding models are powerful way to represent categorical variables which preserve relationshipbetween data, but at the cost of increase in dimensionality. In order to reduce the number ofdimensions of the output embeddings from 300 to 5, a stacked autoencoder (SAE) of 6 layers has been developed via tensorflow[15].In details, autoencoders (AE) are a family of neural networks in which input and output coincide.They work by compressing the input into a latent-space representation and then reconstruct theoutput by means of this representation. They consist of two principal component: the encoder which takes the input and compresses it into a representation with less dimensions, and the decoder which tries to reconstruct the input. Among AEs, stacked autoencoders are deep neuralnetworks in which the output of each hidden layer is connected to the input of the successivehidden layer. All hidden layers are trained by an unsupervised algorithm and then fine-tunedby a supervised method aimed at minimize the cost function. Since they can learn even non-linear transformations, unlike PCA, by using a non-linear activation function and a multiplelayer structure, autoencoders are efficient tools for dimensionality reduction. Moreover, in ourapplication, SAE exhibited a low reconstruction loss (around 6% of MSE), contrary to a lowfraction of variance explained by the PCA. Voting approach for feature selection
Feature selection is a key component when buildingmachine learning models. We can either demand this task to the main model or use a set oflighter models in a preparatory task so that the required effort for further feature selections willbe reduced when training the main model. This is particularly useful for multi parameters modelslike Light-GBM where the training phase involves also the calibration of a set of hyperparamtersusually spanning very wide ranges. Neglecting the expert based component, algorithmic featureselection methods are usually divided into three classes: filter methods, wrapper methods and a 3-layer encoder and 3-layer decoder The reconstruction loss is the loss function (usually either the mean-squared error or cross-entropy betweenthe reconstructed output and the input) which penalizes the network for creating outputs different from theoriginal input mbedded methods. Filter-based methods apply a statistical measure to assign a score to eachfeature; variables of the starting dataset are then ranked according to their scores and eitherselected to be kept or removed. Wrapper-based methods consider the selection of a set of featuresas a ”search problem”, where different combinations are prepared, evaluated and compared toother combinations. In details, a predictive model is used to evaluate a combination of featuresand assign a score based on model accuracy. Embedded-based methods learn which featuresbest contribute to the accuracy of the model while the model is being created.We combined a set of 6 different models for feature selection, stacking each algorithm into ahard voting framework where features which receive the highest number of votes among all themodels have been selected. In particular, after having transformed categorical features via targetencoding (by means of James-Stein encoder), each feature in the dataset has been ranked onthe basis of the following models: • Pearson criterion . It is a filter-based method which consists in checking the absolutevalue of the Pearson correlation between the target and features in the input dataset andkeeping the top n features based on this score. • Chi-squared criterion . It is another filter-based method in which we calculate the chi-squared metric between each feature and the target and select the desired number offeatures which exhibit the best chi-squared scores. The underlying intuition is that if afeature is independent to the target it is uninformative for classifying information. • Recursive Feature Elimination ( RFE ). This is a wrapper-based method whose goalis to select features by recursively considering smaller and smaller sets of features. First,the estimator is trained on the initial set of features and the importance of each feature iscomputed. In our specific case the estimator used is a Logistic Regression. Then, the leastimportant features are pruned from current set of features. That procedure is recursivelyrepeated on the pruned set until the desired number of features to select is eventuallyreached. • Random Forest Classifier ( RF ). This is a wrapper-based method that uses a built-inalgorithm for feature selection. In particular, variables are selected accordingly to featureimportance, obtained by averaging of all decision tree feature importance. • Logistic Lasso Regression
It is an embedded-based method which uses the built-infeature selection algorithm embedded into the Logistic Regression with L1 regularizationmodel. • Light-GBM [16] (
LGBM ). It is a wrapper-based method analogous to the above-mentionedRF classifier. Moving beyond the satellite models described in Section 1.1, and used in the preprocessor phase,in this section we will present the core model architecture. It consists of a concatenation of three RF is an ensemble of Decision Trees generally trained via bagging method : this approach consists in usingthe same training algorithm for every predictor, but training them on different random subsets of the Train set.Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating thepredictions of all predictors LGBM is a fast, high-performance gradient boosting framework based on decision tree algorithm
In order to leverage the availability of a large scale dataset, enriched with a high number offeatures, we developed a robust machine learning approach based on
Gradient Boosting decisiontrees known as
Light-GBM .Gradient Boosting trees model[17] is one method of combining a group of ”weak learners” (specif-ically decision trees) in order to form a ”strong predictor” model, by reducing both varianceand bias. Differently from other tree methods like Random Forest, Boosted tress work by se-quentially adding predictors to an ensemble, each one corrects its predecessor by trying to fitthe new predictor to the residuals of the previous one. These residuals are the gradient of theloss functional being minimized, with respect to the model values at each training data pointevaluated at the current step. Specifically, at each iteration a sub-sample of the training data isdrawn at random (without replacement) from the full training data-set. This randomly selectedsub-sample is then used in place of the full sample to fit the ”weak learner” and compute themodel update for the current iteration.In particular
Light GBM (LGBM) is a fast, high-performance gradient boosting frameworkbased on decision tree algorithm, which has proved to be highly effective in classification andregression models when applied on tabular, structured data, such as the ones we are dealingwith. The model hyper-parameters have been tuned via out-of-time cross-validation procedurebased on a custom extension of the F β -measure where the balance of specificity (also called”the true negative rate”) and recall (also called ”the true positive rate”) in the calculation ofthe harmonic mean is controlled by a coefficient called β as follows: F β = (1 + β ) specificity · recall β · specificity + recall (1)In this procedure, each test sets consist of a single year of future observation, while the cor-responding training sets are made up of the observations that occurred prior to the observationthat forms the test set. In this way, the model is optimized in predicting what will happen inthe future using only information available up to the the present day. The objective functionused for the classification problem was the log-loss , which measures the distance between eachpredicted probability and the actual class output value by means of a logarithmic penalty. Due The specificity is defined as the number of true negatives over the number of true negatives plus the numberof false negatives The recall is defined as the number of true positives over the number of true positives plus the number offalse negatives
7o the high unbalance between 0 and 1 target flags, we have used the modified unbalanced log-loss, by setting the scale pos weight parameter of the Light-GBM equal to the ratio betweenthe number of 0s and the number of 1s. Other objective functions we have used like FocalLoss[18] and custom weighted log losses have not given any specific advantage compared to theunbalanced log loss.
A natural extension of the issue of classification of corporate default forecast consists in predict-ing the probability of default. Complex non-linear machine learning algorithms can provide poorestimates of the class probabilities, especially in case the target variable is highly unbalanced, sothat the distribution and behaviour of the probabilities may not be reflective of the true under-lying probability of the sample. The unbalanced log loss objective chosen for the classificationtask creates a custom metric in the default probability space that is reflected in distorted classprobabilities. The perfect classifier would have only 0 and 1 probabilities, but these would notbe able to match historical default rates. They simply represent the probability of belonging toa class with the switch threshold at 0.5, they are not predicted default rates. Fortunately, it ispossible to adjust the probability distribution in order to better match the actual distributionobserved in the data, without losing predictive power. This adjustment is referred to as calibra-tion . In particular, calibrating a classifier consists in fitting a regressor (known as calibrator )which maps the output of the classifier f i to a calibrated probability p ( y i = 1 | f i ) in [0 , c parameter, has been optimized onthe out-of-time sample of the training-set) on its one-hot encoded leaf assignments.We first fitted the LGBM on the stratified test-set we left aside for the classification task, asthe sample used to train the calibrator should not be used to train the target classifier. Wetreated the output of each individual tree of the LGBM classifier as a categorical feature thattakes as value the index of the leaf an instance ends up falling in. We applied one-hot encodingto have dummies indicating leaf assignments on which fitting the LR model. Finally we testedthe calibrator on the train-set previously used for LGBM training phase. This methodology ofusing an intermediate result and change the output from classification to regression is analogousto what is currently known as Transfer Learning in the Deep Neural Network world, where thefinal neural net layer is removed and substituted with a novel output. The main advantage ofthis method is in preserving all the inner complex feature engineering that the system learnedin the original training task and transferring it to a different problem, in our specific case theprediction of actual default rates. A robust default classification system, able to meet both supervisory requirements and internalbanking usage, provides a way to represent the internally calibrated probability of default to arating system, in which each PD bucket is matched to a rating grade. In order to calibrate ourown rating system, the refitted default probability has been split into 9 groups (correspondingto 9 different rating classes) by means of a genetic algorithm known
Differential Evolution [20].The algorithmic task of calibrating a rating system can be stated as an optimization problem, asit tries to minimize the Brier Score, maximise the similarities among elements of the same groups(the so called cohesion i.e. the items in a cluster should be as similar as possible), minimise the8issimilarities between different groups (the so called separation i.e. any two clusters should beas distinct as possible in terms of similarity of items), ensure both PD monotonicity (i.e. lowerdefault rates have to correspond to low rating grades and vice versa) and have an acceptablecluster size (i.e. each cluster has to include a fraction of the total population that has to beroughly homogeneous among them). Among the partitioning clustering algorithms, GeneticAlgorithms (GA) are stochastic search heuristic inspired by the concepts of Darwinian evolutionand genetics. They are based on the idea of creating a population of candidate solutions toan optimization problem, which is iteratively refined by alteration (mutation) and selection ofgood solutions for the next iteration. Candidate solutions are selected according to a so-calledfitness function, which evaluates their quality in respect to the optimization problem. In thecase of Differential Evolution (DE) algorithms the candidate solutions are linear combinations ofexisting solutions. In the end, the best individual of the population is returned. This individualrepresents the best solution discovered by the algorithm.
The metric used to evaluate the model performance is the AUROC or AUC-ROC score (
AreaUnder the Receiver Operating Characteristics ). In particular, ROC is a probability curve andAUC represents the degree of separability. This measure tells how much a model is capable ofdistinguishing between classes: for an excellent model it is near to 1, for a poor model it nearto 0. The ROC curve is constructed by evaluating the fraction of ”true positives”( tpr or TruePositive Rate ) and ”false positives” ( fpr or False Positive Rate ) for different threshold values.In details, tpr , also known as
Recall or Sensitivity , is defined in Equation (2) as the number ofitems correctly identified as positive out of total true positives: tpr = TPTP + FN (2)where TP is the number of true positives and FN is the number of false negatives. The fpr ,also known as
Type I Error , is defined in Equation (3) as the number of items wrongly identifiedas positive out of total true negatives: f pr = FPFP + TN (3)where FP is the number of false positive and TN is the number of true negatives.Prediction results are then summarized into a confusion matrix which counts the numberof correct and incorrect predictions made by the classifier. A threshold is applied to the cut-offpoint in probability between the positive and negative classes, which for the default classifierhas been set at 0 .
5. However, a trade-off exists between tpr and fpr , such that changing thethreshold of classification will change the balance of predictions towards improving the
TruePositive Rate at the expense of
False Positive Rate , or vice versa.The metric used to evaluate the performance of internally calibrated PD prediction is the
Brier score (BS), i.e. a way to verify the accuracy of a probability forecast in term of distancebetween the actual results. The most common formulation of the
Brier score is mean squarederror : BS = 1N N (cid:88) t =1 ( f t − o t ) (4)9n which f t is the forecast probability, o t the actual outcome of the event at instance t andN is the number of forecasting instances. The best possible Brier score is 0, for total accuracy,the lowest possible score is 1, which mean the forecast was wholly inaccurate.Note that, all the metrics described so far have been calculated on the Test-set obtained bysplitting the dataset along the financial statement year . We obtained an high performance corresponding to an AUROC = 95 .
0% (see Figure 2) andsummarized in the normalized confusion matrix of Figure 3. In highly unbalanced datasets,usually the confusion matrix is skewed towards predicting well only the majority class, producingunsatisfactory performances in the minority class, even if the event of misclassification for theminority class is the event business usually try to minimize the most. The distorsion in thedefault probability space and an accurate choice of feature selection and hyperparameters createda system able to effectively discriminate events in the minority class, reducing the occurrence offalse negatives. Figure 2: ROC curve for Light-GBM classifier Train-set spans from 2011 to 2016, Test-set covers 2017
Default probabilities forecast before and after refitting procedure are summarized in the cali-bration plots (also called reliability curves ) of Figure 4, which allow checking if the predictedprobabilities produced by the model are well calibrated. Specifically, a calibration plot con-sists of a line plot of the relative observed frequency (y-axis) versus the predicted probabilities(x-axis) . A perfect classifier would produce only 0 and 1 predictions but would not be ableto forecast actual default rates. A perfect actual default rate model would produce reliabilitydiagrams as close as possible to the main diagonal from the bottom left to the top right of theplot. The refitting procedure maps the perfect classifier to a reliable default rate predictor.The refitting procedure left AUROC score of the model unchanged (AUROC = 95 . BS = 1 . ); classification resultsare summarized in the normalized confusion matrix reported in Figure 6, where the threshold tothe cut-off point between the positive and negative classes has been optimized on the Figure 5. In details, the predicted probabilities are divided up into a fixed number of buckets along the x-axis. Thenumber of target events (i.e. the occurrence of 1-year default) are then counted for each bin (i.e. the relativeobserved frequency). Finally, the counts are normalized and the results are plotted as a line plot The closer the Brier score is to zero the better is the forecast of default probabilities a)(b) Figure 4: Calibration plots and log-scaled histograms of forecast probability before (Figure 8a)and after (Figure 8b) refitting. Accuracy of predicted probabilities is expressed in term of logloss measure 12igure 5: ROC curve for the calibrated classifierFigure 6: Normalized confusion matrix with optimized threshold for the calibrated classifier13 .3 PD clustering
Among the several common statistical tests that can be performed to validate the assignment ofa probability of default to a certain rating grade, two approaches have been used: the one-sidedBinomial test and the
Extended Traffic-Light Approach .The
Binomial Test is one of the most popular single-grade single-period test performedfor rating system validation. For a certain rating grade k ∈ { , . . . , K } , being K the number ofrating classes , we made the assumptions that default events are independent within the grade k and could be modelled as a binomially distributed random variable X with size parameter N k and ” success ” probability P D k . Thus, we can assess the correctness of the PD forecast bytesting the null hypothesis H , where: • H : the actual default rate is less than or equal to the forecast default rate given by thePD;The null hypothesis H is rejected at a confidence level α in case the number of observeddefaults d per rating grade is greater than or equal to the critical value, as reported in Equation 5: d α = min d : N k (cid:88) j = d (cid:18) N k j (cid:19) P D jk (1 − P D jk ) N k − j ≤ − α (5)The Extended Traffic Light Approach is a novel technique for default probability validation,first adopted by Tasche (2003)[21]. The implementation used in this section refers to a heuristicapproach proposed by Blochwitz et al. (2005) [22] which is based on the estimation of a relativedistance between observed default rates and forecast probabilities of default, under the keyassumption of binomially distributed default events. Four coloured zones,
Green , Yellow , Orange , Red , are established to analyse the deviation of forecasts and actual realizations. In details: ifthe result of the validation assessment lies in the
Green zone there is no obvious contradictionbetween forecast and realized default rate; the
Yellow and
Orange lights indicate that therealized default rate is not compatible with the PD forecast, however, the difference of realizedrate and forecast is still in the range of usual statistical fluctuations; and last red traffic lightindicates a wrong forecast of default probability. The boundaries between the afore mentionedlight-zones are summarized in Equation 6:
Green p k < P D k Yellow
P D k (cid:54) p k < P D k + K y σ ( P D k , N k )Orange P D k + K y σ ( P D k , N k ) (cid:54) p k < P D k + K σ ( P D k , N k )Red P D k + K σ ( P D k , N k ) (cid:54) p k (6)where σ ( P D k , N k ) = (cid:112) P D k (1 − P D k ) /N k . The parameters K y and K play a major rolein the validation assessment, so have to be tuned carefully. A proper choice based on practicalconsiderations is setting K y = 0 .
84 and K = 1 .
44, which corresponds to a probability ofobserving green of 0 .
5, observing yellow with 0 .
3, orange with 0 .
15 and red with 0 . usually one year atingclass PD Bins(%) Rating ClassPD (%) Out-of-sampleDefault Rate(%) One-sidedBinomial Test ExtendedTraffic LightApproach AAA [0.00, 0.005) 0.03 0.00 Passed GreenAA [0.05, 0.42) 0.24 0.03 Passed GreenA [0.42, 0.55) 0.48 0.08 Passed GreenBBB [0.55, 0.74) 0.64 0.21 Passed GreenBB [0.74, 1.00) 0.87 0.40 Passed GreenB [1.00, 1.42) 1.21 0.83 Passed GreenCCC [1.42, 2.12) 1.77 1.29 Passed GreenCC [2.12, 9.03) 5.57 5.06 Passed GreenC [9.03,100) 54.52 33.77 Passed GreenTable 1: Internally calibrated PD clustering into 9 rating classes. Despite being borrowed fromS&P rating scales, the labels are assigned to a PD calibrated on an internal dataset (the oneused during the training-phase) and does not correspond to any rating agencies PD
Machine learning models which operate in higher dimensions than cannot be directly visualizedby human mind are often referred as ” black boxes ”, in the sense that high model performanceis often achieved on the detriment of output explainability, leading users not to understand thelogic behind model predictions. Even greater attention to model interpretability has led to thedevelopment of several methods to provide an explanation to machine learning outputs, both interm of global and local interpretability. In the first case, the goal is being able to explain andunderstand model decisions based on conditional interactions between the dependent variable(i.e. target) and the independent features on the entire dataset. In the latter case, the aim is tounderstand model output for a single prediction by looking at a local subregion in the featurespace around that instance.Two popular approaches described hereafter in this section are SHAP and LIME, which exploreand leverage the property of local explainability to build surrogate models which are able tointerpret the output of any machine learning models. The technique upon which these algorithmsare based is slightly tweaking the input and modelling the changes in prediction by means ofsurrogate agnostic models. In particular, SHAP measures how much each feature in our modelcontributes, either positively or negatively, to each prediction, in term of difference betweenthe actual prediction and its expected value. LIME builds sparse linear models around eachprediction to explain how the black box model works in that local vicinity.
SHAP, which stands for ( SH apley A dditive ex P lanation) [23], is a novel approach for modelexplainability which exploits the idea of Shapley regression value to model feature influencescoring. SHAP values quantify the magnitude and direction (positive or negative) of a feature’seffect on a prediction via an additive feature attribution method. In simple words, SHAP buildsmodel explanations by asking, for each prediction i and feature j , how i changes when j isremoved from the model. Since SHAP considers all possible predictions for an instance using all The technical definition of Shapley value is the average marginal contribution of a feature value over allpossible coalitions coalitional game theory . Thefeature values of a data instance act as players in a coalition: Shapley values suggest how tofairly distribute the payout (i.e. the prediction) among the features. SHAP summary plot
As reported in Figure 10, it combines feature importance with featureeffects to measure the global impact of features on the model. For each feature shown in they-axis, and ordered according to their importance, each point on the plot represents the Shapleyvalue (reported along the x-axis) for a given prediction. The color of each point represents theimpact of the feature on model output from low (i.e. blue) to high (i.e. red). Overlapping pointsare littered in y-axis direction, so we get a sense of the distribution of the Shapley values perfeature.Figure 7: SHAP summary plot for Light-GBM classifier. The details of model’s feature descrip-tion are reported in Appendix A note that a player can be an individual feature value or a group of feature values HAP dependence plot
It is a scatter plot that shows the effect a single feature has on themodel predictions. In particular, each dot represents a single prediction where the feature valueis on the x-axis and its SHAP value, representing how much knowing that feature’s value changesthe output of the model for that sample’s prediction, on the y-axis. The color corresponds toa second feature that may have an interaction effect with the plotted feature. If an interactioneffect is present it will show up as a distinct vertical pattern of colouring.
SHAP waterfall plot
The waterfall plot reported in Figure 9 is designed to display howthe SHAP values of each feature move the model output from our prior expectation underthe background data distribution ( E [( f ( X )]), to the final model prediction ( f ( X )) given theevidence of all the features. Features are sorted by the magnitude of their SHAP values withthe smallest magnitude features grouped together at the bottom of the plot. The color of eachrow represents the impact of the feature on model output from low (i.e. blue) to high (i.e. red).17 a) (b)(c) (d)(e) (f)(g) Figure 8: SHAP dependency plots for ACTIVITY in Figure 8a, cashAndMarketableSecuritiesin Figure 8b, DEBT EQUITY in Figure 8c, EBITDA RATIO in Figure 8d, netIncome in Fig-ure 8e, ROI in Figure 8f, totalInterestExpense in Figure 8g. The details of model’s featuredescription are reported in Appendix A 18igure 9: SHAP waterfall plot The details of model’s feature description are reported in Ap-pendix A 19 .2 Lime
LIME, L ocal I nterpretable M odel-agnostic E xplanations, is a novel technique that explains thepredictions of any classifier in an interpretable and faithful manner, by learning an interpretablemodel locally around the prediction[24]. Behind the workings of LIME lies the assumption thatevery complex model is linear on a local scale, so it is possible to fit a simple model around a singleobservation that will mimic how the global model behaves at that locality. The output of LIMEis a list of explanations, reflecting the contribution of each feature to the prediction of a datasample, allowing to determine which feature changes will have most impact on the prediction.Note that, LIME has the desirable property of additivity, i.e. the sum of the individual impactis equal to the total impact. Results for a prediction are summarized in Figure 10.Figure 10: Lime local explanation for a prediction from Light-GBM classifier. The details ofmodel’s feature description are reported in Appendix A Starting from Moody’s dataset of historical balancesheets, bankruptcy statuses and macroeco-nomic variables we have built three models: a classifier, a default probability model and a ratingsystem. By leveraging on modern techniques in both data processing and parameter calibrationwe have reached state of the art results. The three models show stunning out of sample per-formances allowing for an intensive usage in risk averse businesses where the occurrence of falsenegatives can dramatically harm the firm itself. The explainability layers via Shap and Limegive a set of extra tools to increase the confidence in the model and help in understanding themain features determining a specific result. These information can be leveraged by the analystto understand how to reduce the bankruptcy probability of a specific firm or to get insight inwhich balance sheet fields need to be improved to increase the rating, therefore providing abusiness instrument to actively manage clients and structured finance deals.20 cknowledgements
We are grateful to Corrado Passera for encouraging our research.
References [1] P. Mizen and S. Tsoukas, “Forecasting us bond default ratings allowing for previous andinitial state dependence in an ordered probit model,”
International Journal of Forecasting ,vol. 28, no. 1, pp. 273–287, 2012.[2] P. Gurn`y and M. Gurn`y, “Comparison of credit scoring models on probability of defaultestimation for us banks,” 2013.[3] Z. Zhao, S. Xu, B. H. Kang, M. M. J. Kabir, Y. Liu, and R. Wasinger, “Investigation andimprovement of multi-layer perceptron neural networks for credit scoring,”
Expert Systemswith Applications , vol. 42, no. 7, pp. 3508–3516, 2015.[4] A. Petropoulos, V. Siakoulis, E. Stavroulakis, A. Klamargias, et al. , “A robust machinelearning approach for credit risk analysis of large loan level datasets using deep learningand extreme gradient boosting,”
Are Post-crisis Statistical Initiatives Completed , vol. 49,pp. 49–49, 2019.[5] P. M. Addo, D. Guegan, and B. Hassani, “Credit risk analysis using machine and deeplearning models,”
Risks , vol. 6, no. 2, p. 38, 2018.[6] A. R. Provenzano, D. Trifir, N. Jean, G. L. Pera, M. Spadaccino, L. Massaron, and C. Nor-dio, “An artificial intelligence approach to shadow rating,” 2019.[7] Moody’s
Analytics , “Credit research database.” .[8] Oxford Economics, “Global economic databank.” .[9] A. Zheng and A. Casari,
Feature engineering for machine learning: principles and tech-niques for data scientists . ” O’Reilly Media, Inc.”, 2018.[10] C. Guo and F. Berkhahn, “Entity embeddings of categorical variables,” arXiv preprintarXiv:1604.06737 , 2016.[11] G. Perani, V. Cirillo, et al. , “Matching industry classifications. a method for convertingnace rev. 2 to nace rev. 1,” tech. rep., 2015.[12] R. Bellman, “Dynamic programming: Princeton univ. press,” NJ , vol. 95, 1957.[13] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic languagemodel,” Journal of machine learning research , vol. 3, no. Feb, pp. 1137–1155, 2003.[14] Y. Goldberg, “Neural network methods for natural language processing,”
Synthesis Lectureson Human Language Technologies , vol. 10, no. 1, pp. 1–309, 2017.[15] A. G´eron,
Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools,and techniques to build intelligent systems . ” O’Reilly Media, Inc.”, 2017.2116] https://github.com/microsoft/LightGBM .[17] J. H. Friedman, “Stochastic gradient boosting,”
Computational statistics & data analysis ,vol. 38, no. 4, pp. 367–378, 2002.[18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detec-tion,” in
Proceedings of the IEEE international conference on computer vision , pp. 2980–2988, 2017.[19] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. , “Practical lessons from predicting clicks on ads at facebook,” in
Proceedings of theEighth International Workshop on Data Mining for Online Advertising , pp. 1–9, 2014.[20] R. Storn and K. Price, “Differential evolution - a simple and efficient heuristic for globaloptimization over continuous spaces,”
Journal of Global Optimization , vol. 11, pp. 341–359,01 1997.[21] D. Tasche, “A traffic lights approach to pd validation,” arXiv preprint cond-mat/0305038 ,2003.[22] S. Blochwitz, S. Hohl, and C. Wehn, “Reconsidering ratings,”
Wilmott Magazine , vol. 5,pp. 60–69, 2005.[23] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in
Advances in neural information processing systems , pp. 4765–4774, 2017.[24] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predic-tions of any classifier,” in
Proceedings of the 22nd ACM SIGKDD international conferenceon knowledge discovery and data mining , pp. 1135–1144, 2016.22 ppendices
A Model’s Features descriptions
In this section the details of selected features, upon which the model have been trained, arereported.
A.1 Balancesheet index descriptions
Code Definition cashAndMarketableSe-curities Cash and marketable securities.depreciationExpense The depreciation expense for current period.ebitda Earnings before interest, taxes, depreciation and amortizationbefore extraordinary items.entityConsolidation-Type For companies with subsidiaries, iincorporationRegion The entity’s incorporation region.incorporationState The entity’s incorporation province or administrative divisionwhere the entity has a legal representation.longTermDebtCurrent-Maturities The current maturities of long-term debt, principal paymentsdue within 12 months.netIncome Net income is the total period-end earnings.netWorth Net worth is the sum of all equity items, including retainedearnings and other equity.payableToTrade Accounts payable to regular trade accounts.receivableFromTrade Accounts receivable from trade.retainedEarnings Retained earnings.tangibleNetWorth Defined as the difference between netWorth andtotalIntangibleAssets.totalAccountsPayable The sum of accounts payable.totalAccountsReceiv-able The total accounts receivable, net of any provision from loss.totalAmortizationAnd-Depreciaton The sum of amortization and depreciation expense for currentperiod.totalAssets The total assets of the borrower which is the sum of the currentassets and non-current assets.totalCapital Total subscribed and share capital.totalCapi-tal and totalLiabilities Defined as the sum of totalLiabilities and totalCapitaltotalCurrentAssets The sum of all current assets.totalCurrentLiabilities The sum of all current liabilities.totalFixedAssets Total fixed assets are the Gross Fixed Assets less AccumulatedDepreciation.totalIntangibleAssets Total intangible assets.23 ode Definition totalInterestExpense The total interest expense is any gross interest expense generatedfrom short-term, long-term, subordinated or related debt.totalInventory The sum of all the inventories.totalLiabilities The sum of Total Current Liabilities and Total non-currentliabilities.totalLongTermDebt The amount due to financial and other institutions after 12months.totalOperatingExpense The sum of all operating expenses.totalOperatingProfit The Gross Profit less Total Operating ExpensetotalProvisions Total provisions for pensions, taxes, etc.totalSales Total sales.totalWageExpense The total wage expense.workingCapital Defined as the sum of receivableFromTrade,totalAccountsReceivable and totalInventory minuspayableToTrade and totalAccountsPayable.
A.2 KPIs descriptions
ACID = cashAndMarketableSecurities + totalAccountsReceivabletotalCurrentLiabilities (7)ACTIVITY = totalCurrentLiabilitiestotalSales (8)AGE = financialStatementDate − incorporationDate365 . − cashAndMarketableSecurities(23)ROA = netIncometotalAssets (24)ROE = netIncomenetWorth (25)ROI = totalOperatingProfittotalAssets (26)SHORT-TERM-DEBT EQUITY = totalCurrentLiabilitiesnetWorth (27) A.3 Macro-economic factors descriptions
Country specific indicatorsCode Name Definition
C Consumption, private,real The volume of goods and servicesconsumed by households and non-profitinstitutions serving households.CD Durable goods The volume of real personal consumptionexpenditures.CREDR Credit rating, average The sovereign risk rating, based on theaverage of the sovereign ratings providedby Moodys, S&P and Fitch.CU Capacity utilisation A measure of the extent to which theproductive capacity of a business is beingused.25 ountry specific indicatorsCode Name Definition
DOMD Domestic demand, real The volume of consumption, investment,stockbuilding and governmentconsumption expressed in local currencyand at prices of the country’s base year.EE Employees inemployment Employees in employment.ET Employment, total Employment, total.GC Consumption,government, real The volume of government spending ongoods and services.GDP GDP, real The volume of all final goods andservices produced within a country in agiven period of time.GDPHEAD GDP per capita, real,US$, constant prices GDP per capita, real, US$, constantprices.IF Investment, total fixedinvestment, real Investment, total fixed investment, real.IP Industrial productionindex The volume of investment in tangibleand intangible capital goods, includingmachinery and equipment, software, andconstruction.IPNR Investment, privatesector business, real The volume of investment in privatesector business.IPRD Investment, privatedwellings, real The volume of investment in privatedwellings.IS Stockbuilding, real The volume of stocks of outputs that arestill held by the units that producedthem and stocks of products acquiredfrom other units that are intended to beused for intermediate consumption or forresale.M Imports, goods &services, real The volume of goods and servicesimports.MG Imports, goods, real The volume of goods imports.MS Imports, services, real The volume of services imports.PEWFP GDP, compensation ofemployees, total,nominal The values of wages and salaries ofemployees as a component of GDP.PH House price index Index of house prices.POIL$ Oil price US$ per toe Oil price US$ per toe.RCB Interest rate, centralbank policy The rate that is used by central bank toimplement or signal its monetary policystance (expressed as an average).26 ountry specific indicatorsCode Name Definition
RCORP SPREADEOP Credit spreads, end ofperiod The difference in yield between twobonds of similar maturity but differentcredit quality, expressed as end of periodvalue.RLG Interest rate, long-termgovernment bond yields Interest rate, long-term government bondyields.RS Retail Sales volumeindex, excludingautomotive Volume index for retail sales excludingautomotive.RSH Interest rate,short-term The 3-month interbank rate.RSHEOP Interest rate,short-term, end ofperiod The 3-month interbank rate for the endof period.SMEPS Stockmarket earningsper share Stockmarket earnings per sharecalculated as Stockmarket earnings, LCU* 1000 / Stockmarket shares outstanding.SMP TR Share price total returnindex Share price total return index.TFE Total final expenditure,real The sum of volumes of consumption,investment, stockbuilding, governmentconsumption and exports.U Unemployment The total number of people without ajob, but actively searching for one.UP Unemployment rate The percentage of the labour force that isunemployed at a given date.X Exports, goods &services, real The volume of goods and servicesexports expressed in local currency andat the country’s base year.XG Exports, goods, real The volume of goods exports expressedin local currency.XS Exports, services, real The volume of services exports expressedin local currency.
Eurozone indicatorsCode Name Definition
PH House price index Index of house prices.RCB Interest rate, centralbank policy The rate that is used by central bank toimplement or signal its monetary policystance (expressed as an average).27 urozone indicatorsCode Name Definitionurozone indicatorsCode Name Definition