[PDF] Machine Learning approach for Credit Scoring

Abstract

In this work we build a stack of machine learning models aimed at composing a state-of-the-art credit rating and default prediction system, obtaining excellent out-of-sample performances. Our approach is an excursion through the most recent ML / AI concepts, starting from natural language processes (NLP) applied to economic sectors' (textual) descriptions using embedding and autoencoders (AE), going through the classification of defaultable firms on the base of a wide range of economic features using gradient boosting machines (GBM) and calibrating their probabilities paying due attention to the treatment of unbalanced samples. Finally we assign credit ratings through genetic algorithms (differential evolution, DE). Model interpretability is achieved by implementing recent techniques such as SHAP and LIME, which explain predictions locally in features' space.

Full PDF

MMachine Learning approach forCredit Scoring

A. R. Provenzano ∗ , D. Triﬁr`o, A. Datteo, L. Giada, N. Jean, A. Riciputi,G. Le Pera, M. Spadaccino, L. Massaron and C. Nordio August 5, 2020

Working paper Abstract

In this work we build a stack of machine learning models aimed at composing a state-of-the-artcredit rating and default prediction system, obtaining excellent out-of-sample performances.Our approach is an excursion through the most recent ML / AI concepts, starting fromnatural language processes (NLP) applied to economic sectors’ (textual) descriptions usingembedding and autoencoders (AE), going through the classiﬁcation of defaultable ﬁrms onthe base of a wide range of economic features using gradient boosting machines (GBM) andcalibrating their probabilities paying due attention to the treatment of unbalanced samples.Finally we assign credit ratings through genetic algorithms (diﬀerential evolution, DE). Modelinterpretability is achieved by implementing recent techniques such as SHAP and LIME,which explain predictions locally in features’ space.

JEL

Classiﬁcation codes: C45, C55, G24, G32, G33

AMS

Classiﬁcation codes: 62M45,68T01, 68T50, 91G40

Keywords:

Artiﬁcial Intelligence, Machine Learning, Explainable AI, Autoencoders, Em-bedding, LightGBM, Diﬀerential Evolution, SHAP, LIME, Credit Risk, Rating Model, Default,Probability of Default, Classiﬁcation

Introduction

In the aftermath of the economic crisis, the probability of default (PD) has become a topicaltheme in the ﬁeld of ﬁnancial research. Indeed, given its usage in the risk management, in thevaluation of the credit derivatives, in the estimation of the creditworthiness of a borrower andin the calculation of economic or regulatory capital for banking institution (under Basel II),incorrect PD prediction can lead to false valuation of risk, unreasonable rating and incorrectpricing of ﬁnancial instruments. In the last decades, a growing number of approaches has been ∗ corresponding author: [email protected] This paper reﬂects the authors’ opinions and not necessarily those of their employers. a r X i v : . [ q -f i n . S T ] J u l eveloped to model the credit quality of a company, by exploring statistical techniques. Severalworks have employed probit models[1] or linear and logistic regression to estimate company rat-ings using the main ﬁnancial indicators as model input. However, these models suﬀer from theirclear inability to capture non-linear dynamics, which are prevalent in ﬁnancial ratio data[2].New statistical techniques, especially from the ﬁeld of machine learning, have gained a world-wide reputation thanks to their ability to eﬃciently capture information from big dataset byrecognizing non-linear patterns and temporal dependencies among data. Zhao et al. (2015) [3]employed feed forward neural networks in credit corporate rating determination. Petropopuloset al.[4] explore two state of the art techniques namely Extreme Gradient Boosting (XGBoost)and deep learning neural networks in order to estimate loan PD and calibrate an internal ratingsystem, useful both for internal usage and regulatory scope. Addo et al. (2018)[5] built binaryclassiﬁers based on machine and deep learning models on real data to predict loan probability ofdefault. They observed that the tree-based models are more stable than ones based on multilayerartiﬁcial neural networks.Starting from these studies, we propose a sophisticated framework of machine learning modelswhich, on the basis of company annual (end-of-year) ﬁnancial statements coupled with relevantmacroeconomic indicators, attempts to classify the status of a company (performing - “in-bonis”- or defaulted) and to build a robust rating system in which each rating class will be matched toan internally calibrated default probability. In this regard, here the target variable is diﬀerentfrom a previous work by some of the authors [6], where the goal was to predict the credit ratingthat Moody’s would assign, according to an approach commonly called “shadow rating”. Thenovelty of our approach lies in the combination of data preprocessing algorithms, responsible forfeature engineering and feature selection, and a core model architecture made of a concatenationof a Boosted Tree default classiﬁer, a probability calibrator and a rating attribution systembased on genetic algorithm. Great attention is then given to model interpretability, as wepropose two intuitive approaches to interpret the model output by exploring the property oflocal explainability. In details, the article is composed of the following sections: Section 1 isdevoted to describe the input dataset and the preprocessing phase; Section 2 in which the coremodel architecture is explained; Section 3 which collects results from the core model structure(i.e. default classiﬁer, PD calibrator and rating clustering); ﬁnally Section 4 is left to modelexplainability. Data used for model training have been collected from the Credit Research Database (CRD)provided by Moody’s, and consist of 919 ,

636 annual (end of year) ﬁnancial statements of 157 , target of the proposed default prediction model, is a binary indicator with the value of1 ﬂagging a default event (i.e. a bankruptcy occurrence over a one-year horizon), 0 otherwise.In accordance to the above-deﬁned target variable, input variables of our model have been se-lected to be consistent with factors that can aﬀect the companies capacity to service externaldebt (a full explanation of the input model’s features is reported in Appendix A). In partic-ular they consist of balance-sheet indexes and ratios, and Key Performance Indicators (KPI)calculated from CRDs ﬁnancial reports[7]. The latter include indicators for eﬃciency (i.e. mea-sures of operating performance), liquidity (i.e. ratios used to determine how quickly a companycan turn its assets into cash if it is experiencing ﬁnancial distress or impending bankruptcy),2olvency (i.e. ratios that depict how much a company relies upon its debt to fund operations)and proﬁtability (i.e. measures that demonstrate how proﬁtable a company is). Since businesscycles can have great impact on a ﬁrm proﬁtability and inﬂuence its risk proﬁle, we joined orig-inal information with more general macro variables (2 years lagged historical data) addressingthe surrounding climate in which companies operate. Among the wide range of macroeconomicindicators provided by Oxford Economics [8], a subset of the most inﬂuential ones has beenselected as explanatory variables . Some of them are country-speciﬁc, others are common to thewhole Eurozone . The combined dataset of balance-sheet indexes, ﬁnancial ratios and macrovariables along with data transformations and feature selection (better described hereafter inSection 1.1), led to a set of 179 features and covers the period 2011 − − − A preliminary step for building a machine learning model consists in generating a set of featuressuitable for model training. This task involves data manipulation processes like transformationof categorical features, missing values treatment, inﬁnite values handling, outliers detection,data leakage avoidance. In particular, categorical, non-ordinal variables are one of the mainissue that must be tackled in order to feed any machine learning model[9].Diﬀerent encoding technique can be used to make categorical data legible for a machine learningalgorithm. The most common way to deal with categories is to simply map each category with The list of selected indicators are reported in Section A.3 Regional aggregate

Eurozone includes the following countries: Austria, Belgium, Cyprus, Estonia, Finland,France, Germany, Greece, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Portugal, Slovenia,Slovakia and Spain

Label Encoding , a model would treatcategories as ordered integers, which would imply non-existent ordinal relationships betweendata, that could be misleading for model training. Another simple way to handle categoricaldata is the

One-Hot Encoding technique, which consists in transforming each categorical featureinto ﬁxed-size sparse vector of all zeros but 1 in the cell used uniquely to identify speciﬁc real-ization of that variable. The main drawback of this technique relies on the fact that categorieswith an high number of possible realizations would generate large dimension datasets, whichmakes it a memory-ineﬃcient encoder. Moreover this sparse representation does not preservesimilarity between feature values. An alternative approach to overcome these issues is repre-sented by

Categorical Embedding , which consists in mapping via Deep Neural Network (DNN)each possible discrete value of a given categorical variable in a low-dimensional, learned, contin-uous vector representation. This method allows to place each categorical feature in a Euclideanspace, keeping coherent relationship with other realization of the same variable. The extensionof categorical embedding approach to words and document representation is known as

WordEmbedding [10]. In particular,

Sentence Embedding is an application of word embedding aimingat representing a full sentence into a vector space. In this study, we applied sentence embed-ding to represent industry sectors descriptions associated to each ”NACE code” . In order toguarantee the ” semantics ” of the original data and work in a low dimensional space, we pro-pose a framework of embedding with autoencoder regularization, in which the original data areembedded into low dimension vectors. The obtained embeddings maintain local similarity andcan be easily reverted to their original forms. The encoding of the NACE is a novel way toovercome the NACE1-NACE2 mapping conundrum. In our dataset both NACE are used and asalready stated in many papers the two encoding systems are not fully compatible[11]. Moreoverthe NACE encoding allows for proper industry segment description of multi sector ﬁrms thatcannot be easily described by a single NACE code, further extending the predictive power ofthe economic sector category.A diﬀerent encoding method is the Target encoding , in which categorical features are replacedwith the mean target value for samples having that category. This allows to encode an arbi-trary number of features without increasing data dimensionality. However, as a drawback, anaive application of this type of encoding can allow data leakage, leading to model overﬁttingand poor predictive performance. A target encoding algorithm developed for preventing dataleakage is know as

James-Stein estimator , and is the one used in our model. In more details,it transforms each categorical feature with a weighted average of the mean target value for theobserved feature value and the mean target value computed regardless of the feature realization.As described above, some feature transformations could result in a general increase of inputdata dimensionality, which makes it urgent to implement a robust and independent featureselection framework. In fact, training a machine learning model to a huge number of independentvariables is doomed to suﬀer from the so-called curse of dimensionality [12], i.e. the problemof exponential increase in volume associated with adding extra dimensions to a vector space.We employed a voting ensemble of models to independently assign importance to the availablefeatures and eﬃciently select those features which will contribute most to model prediction.Hereafter in this section we will look into the implementation of satellite models aimingat: performing sentence embedding of the industry sectors description; reducing embeddingsdimensionality via stacked autoencoder; selecting relevant features via voting approach. The ”

Statistical Classiﬁcation of Economic Activities in the European Community ”, commonly referred to asNACE, is the industry standard classiﬁcation system used in the European Union entence Embedding of sector descriptions A common practice in Natural LanguageProcessing (NLP) is the use of pre-trained embeddings to represent words or sentence in a doc-ument. Following this common practice, we use the pre-trained models built in SpaCy NLPlibrary for embedding the sequence of NACE sector textual description. In particular, we per-formed sentence embedding i.e. we transform each description into a 300-dimensional real-valuevector. Each sentence embedding is automatically constructed by SpaCy averaging the 300-dimensional real-value pre-trained vectors which map each word in that sentence.Here’s a glimpse at how spaCy processes textual data. It ﬁrst segments text into words, punc-tuations, symbols and others by applying speciﬁc rules to each language (i.e. it tokenizes thetext). Then it performs Part-of-speech (POS) tagging to understand the grammatical propertiesof each word by means of built in statistical model. A model consists of binary data trained on adataset large enough to allow the system to make predictions that generalize across the language.A key assumption to the word embedding approach is the idea of using for each word a densedistributed representation learned based on the usage of words [13]. This allows words thatare used in similar ways to have similar representations, naturally capturing their meaning[14].Given the high importance industry sector has in ﬁnancial literature as default prediction driver,embedding NACE industry descriptions improves the overall model performance in applicationby helping the model to generalize better and to smoothly handle unseen elements.

Dimensionality reduction via stacked autoencoder

The aforementioned world embed-ding models are powerful way to represent categorical variables which preserve relationshipbetween data, but at the cost of increase in dimensionality. In order to reduce the number ofdimensions of the output embeddings from 300 to 5, a stacked autoencoder (SAE) of 6 layers has been developed via tensorﬂow[15].In details, autoencoders (AE) are a family of neural networks in which input and output coincide.They work by compressing the input into a latent-space representation and then reconstruct theoutput by means of this representation. They consist of two principal component: the encoder which takes the input and compresses it into a representation with less dimensions, and the decoder which tries to reconstruct the input. Among AEs, stacked autoencoders are deep neuralnetworks in which the output of each hidden layer is connected to the input of the successivehidden layer. All hidden layers are trained by an unsupervised algorithm and then ﬁne-tunedby a supervised method aimed at minimize the cost function. Since they can learn even non-linear transformations, unlike PCA, by using a non-linear activation function and a multiplelayer structure, autoencoders are eﬃcient tools for dimensionality reduction. Moreover, in ourapplication, SAE exhibited a low reconstruction loss (around 6% of MSE), contrary to a lowfraction of variance explained by the PCA. Voting approach for feature selection

Feature selection is a key component when buildingmachine learning models. We can either demand this task to the main model or use a set oflighter models in a preparatory task so that the required eﬀort for further feature selections willbe reduced when training the main model. This is particularly useful for multi parameters modelslike Light-GBM where the training phase involves also the calibration of a set of hyperparamtersusually spanning very wide ranges. Neglecting the expert based component, algorithmic featureselection methods are usually divided into three classes: ﬁlter methods, wrapper methods and a 3-layer encoder and 3-layer decoder The reconstruction loss is the loss function (usually either the mean-squared error or cross-entropy betweenthe reconstructed output and the input) which penalizes the network for creating outputs diﬀerent from theoriginal input mbedded methods. Filter-based methods apply a statistical measure to assign a score to eachfeature; variables of the starting dataset are then ranked according to their scores and eitherselected to be kept or removed. Wrapper-based methods consider the selection of a set of featuresas a ”search problem”, where diﬀerent combinations are prepared, evaluated and compared toother combinations. In details, a predictive model is used to evaluate a combination of featuresand assign a score based on model accuracy. Embedded-based methods learn which featuresbest contribute to the accuracy of the model while the model is being created.We combined a set of 6 diﬀerent models for feature selection, stacking each algorithm into ahard voting framework where features which receive the highest number of votes among all themodels have been selected. In particular, after having transformed categorical features via targetencoding (by means of James-Stein encoder), each feature in the dataset has been ranked onthe basis of the following models: • Pearson criterion . It is a ﬁlter-based method which consists in checking the absolutevalue of the Pearson correlation between the target and features in the input dataset andkeeping the top n features based on this score. • Chi-squared criterion . It is another ﬁlter-based method in which we calculate the chi-squared metric between each feature and the target and select the desired number offeatures which exhibit the best chi-squared scores. The underlying intuition is that if afeature is independent to the target it is uninformative for classifying information. • Recursive Feature Elimination ( RFE ). This is a wrapper-based method whose goalis to select features by recursively considering smaller and smaller sets of features. First,the estimator is trained on the initial set of features and the importance of each feature iscomputed. In our speciﬁc case the estimator used is a Logistic Regression. Then, the leastimportant features are pruned from current set of features. That procedure is recursivelyrepeated on the pruned set until the desired number of features to select is eventuallyreached. • Random Forest Classiﬁer ( RF ). This is a wrapper-based method that uses a built-inalgorithm for feature selection. In particular, variables are selected accordingly to featureimportance, obtained by averaging of all decision tree feature importance. • Logistic Lasso Regression

It is an embedded-based method which uses the built-infeature selection algorithm embedded into the Logistic Regression with L1 regularizationmodel. • Light-GBM [16] (

LGBM ). It is a wrapper-based method analogous to the above-mentionedRF classiﬁer. Moving beyond the satellite models described in Section 1.1, and used in the preprocessor phase,in this section we will present the core model architecture. It consists of a concatenation of three RF is an ensemble of Decision Trees generally trained via bagging method : this approach consists in usingthe same training algorithm for every predictor, but training them on diﬀerent random subsets of the Train set.Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating thepredictions of all predictors LGBM is a fast, high-performance gradient boosting framework based on decision tree algorithm

In order to leverage the availability of a large scale dataset, enriched with a high number offeatures, we developed a robust machine learning approach based on

Gradient Boosting decisiontrees known as

Light-GBM .Gradient Boosting trees model[17] is one method of combining a group of ”weak learners” (specif-ically decision trees) in order to form a ”strong predictor” model, by reducing both varianceand bias. Diﬀerently from other tree methods like Random Forest, Boosted tress work by se-quentially adding predictors to an ensemble, each one corrects its predecessor by trying to ﬁtthe new predictor to the residuals of the previous one. These residuals are the gradient of theloss functional being minimized, with respect to the model values at each training data pointevaluated at the current step. Speciﬁcally, at each iteration a sub-sample of the training data isdrawn at random (without replacement) from the full training data-set. This randomly selectedsub-sample is then used in place of the full sample to ﬁt the ”weak learner” and compute themodel update for the current iteration.In particular

Light GBM (LGBM) is a fast, high-performance gradient boosting frameworkbased on decision tree algorithm, which has proved to be highly eﬀective in classiﬁcation andregression models when applied on tabular, structured data, such as the ones we are dealingwith. The model hyper-parameters have been tuned via out-of-time cross-validation procedurebased on a custom extension of the F β -measure where the balance of speciﬁcity (also called”the true negative rate”) and recall (also called ”the true positive rate”) in the calculation ofthe harmonic mean is controlled by a coeﬃcient called β as follows: F β = (1 + β ) speciﬁcity · recall β · speciﬁcity + recall (1)In this procedure, each test sets consist of a single year of future observation, while the cor-responding training sets are made up of the observations that occurred prior to the observationthat forms the test set. In this way, the model is optimized in predicting what will happen inthe future using only information available up to the the present day. The objective functionused for the classiﬁcation problem was the log-loss , which measures the distance between eachpredicted probability and the actual class output value by means of a logarithmic penalty. Due The speciﬁcity is deﬁned as the number of true negatives over the number of true negatives plus the numberof false negatives The recall is deﬁned as the number of true positives over the number of true positives plus the number offalse negatives

7o the high unbalance between 0 and 1 target ﬂags, we have used the modiﬁed unbalanced log-loss, by setting the scale pos weight parameter of the Light-GBM equal to the ratio betweenthe number of 0s and the number of 1s. Other objective functions we have used like FocalLoss[18] and custom weighted log losses have not given any speciﬁc advantage compared to theunbalanced log loss.

A natural extension of the issue of classiﬁcation of corporate default forecast consists in predict-ing the probability of default. Complex non-linear machine learning algorithms can provide poorestimates of the class probabilities, especially in case the target variable is highly unbalanced, sothat the distribution and behaviour of the probabilities may not be reﬂective of the true under-lying probability of the sample. The unbalanced log loss objective chosen for the classiﬁcationtask creates a custom metric in the default probability space that is reﬂected in distorted classprobabilities. The perfect classiﬁer would have only 0 and 1 probabilities, but these would notbe able to match historical default rates. They simply represent the probability of belonging toa class with the switch threshold at 0.5, they are not predicted default rates. Fortunately, it ispossible to adjust the probability distribution in order to better match the actual distributionobserved in the data, without losing predictive power. This adjustment is referred to as calibra-tion . In particular, calibrating a classiﬁer consists in ﬁtting a regressor (known as calibrator )which maps the output of the classiﬁer f i to a calibrated probability p ( y i = 1 | f i ) in [0 , c parameter, has been optimized onthe out-of-time sample of the training-set) on its one-hot encoded leaf assignments.We ﬁrst ﬁtted the LGBM on the stratiﬁed test-set we left aside for the classiﬁcation task, asthe sample used to train the calibrator should not be used to train the target classiﬁer. Wetreated the output of each individual tree of the LGBM classiﬁer as a categorical feature thattakes as value the index of the leaf an instance ends up falling in. We applied one-hot encodingto have dummies indicating leaf assignments on which ﬁtting the LR model. Finally we testedthe calibrator on the train-set previously used for LGBM training phase. This methodology ofusing an intermediate result and change the output from classiﬁcation to regression is analogousto what is currently known as Transfer Learning in the Deep Neural Network world, where theﬁnal neural net layer is removed and substituted with a novel output. The main advantage ofthis method is in preserving all the inner complex feature engineering that the system learnedin the original training task and transferring it to a diﬀerent problem, in our speciﬁc case theprediction of actual default rates. A robust default classiﬁcation system, able to meet both supervisory requirements and internalbanking usage, provides a way to represent the internally calibrated probability of default to arating system, in which each PD bucket is matched to a rating grade. In order to calibrate ourown rating system, the reﬁtted default probability has been split into 9 groups (correspondingto 9 diﬀerent rating classes) by means of a genetic algorithm known

Diﬀerential Evolution [20].The algorithmic task of calibrating a rating system can be stated as an optimization problem, asit tries to minimize the Brier Score, maximise the similarities among elements of the same groups(the so called cohesion i.e. the items in a cluster should be as similar as possible), minimise the8issimilarities between diﬀerent groups (the so called separation i.e. any two clusters should beas distinct as possible in terms of similarity of items), ensure both PD monotonicity (i.e. lowerdefault rates have to correspond to low rating grades and vice versa) and have an acceptablecluster size (i.e. each cluster has to include a fraction of the total population that has to beroughly homogeneous among them). Among the partitioning clustering algorithms, GeneticAlgorithms (GA) are stochastic search heuristic inspired by the concepts of Darwinian evolutionand genetics. They are based on the idea of creating a population of candidate solutions toan optimization problem, which is iteratively reﬁned by alteration (mutation) and selection ofgood solutions for the next iteration. Candidate solutions are selected according to a so-calledﬁtness function, which evaluates their quality in respect to the optimization problem. In thecase of Diﬀerential Evolution (DE) algorithms the candidate solutions are linear combinations ofexisting solutions. In the end, the best individual of the population is returned. This individualrepresents the best solution discovered by the algorithm.

The metric used to evaluate the model performance is the AUROC or AUC-ROC score (

AreaUnder the Receiver Operating Characteristics ). In particular, ROC is a probability curve andAUC represents the degree of separability. This measure tells how much a model is capable ofdistinguishing between classes: for an excellent model it is near to 1, for a poor model it nearto 0. The ROC curve is constructed by evaluating the fraction of ”true positives”( tpr or TruePositive Rate ) and ”false positives” ( fpr or False Positive Rate ) for diﬀerent threshold values.In details, tpr , also known as

Recall or Sensitivity , is deﬁned in Equation (2) as the number ofitems correctly identiﬁed as positive out of total true positives: tpr = TPTP + FN (2)where TP is the number of true positives and FN is the number of false negatives. The fpr ,also known as

Type I Error , is deﬁned in Equation (3) as the number of items wrongly identiﬁedas positive out of total true negatives: f pr = FPFP + TN (3)where FP is the number of false positive and TN is the number of true negatives.Prediction results are then summarized into a confusion matrix which counts the numberof correct and incorrect predictions made by the classiﬁer. A threshold is applied to the cut-oﬀpoint in probability between the positive and negative classes, which for the default classiﬁerhas been set at 0 .

5. However, a trade-oﬀ exists between tpr and fpr , such that changing thethreshold of classiﬁcation will change the balance of predictions towards improving the

TruePositive Rate at the expense of

False Positive Rate , or vice versa.The metric used to evaluate the performance of internally calibrated PD prediction is the

Brier score (BS), i.e. a way to verify the accuracy of a probability forecast in term of distancebetween the actual results. The most common formulation of the

Brier score is mean squarederror : BS = 1N N (cid:88) t =1 ( f t − o t ) (4)9n which f t is the forecast probability, o t the actual outcome of the event at instance t andN is the number of forecasting instances. The best possible Brier score is 0, for total accuracy,the lowest possible score is 1, which mean the forecast was wholly inaccurate.Note that, all the metrics described so far have been calculated on the Test-set obtained bysplitting the dataset along the ﬁnancial statement year . We obtained an high performance corresponding to an AUROC = 95 .

0% (see Figure 2) andsummarized in the normalized confusion matrix of Figure 3. In highly unbalanced datasets,usually the confusion matrix is skewed towards predicting well only the majority class, producingunsatisfactory performances in the minority class, even if the event of misclassiﬁcation for theminority class is the event business usually try to minimize the most. The distorsion in thedefault probability space and an accurate choice of feature selection and hyperparameters createda system able to eﬀectively discriminate events in the minority class, reducing the occurrence offalse negatives. Figure 2: ROC curve for Light-GBM classiﬁer Train-set spans from 2011 to 2016, Test-set covers 2017

Default probabilities forecast before and after reﬁtting procedure are summarized in the cali-bration plots (also called reliability curves ) of Figure 4, which allow checking if the predictedprobabilities produced by the model are well calibrated. Speciﬁcally, a calibration plot con-sists of a line plot of the relative observed frequency (y-axis) versus the predicted probabilities(x-axis) . A perfect classiﬁer would produce only 0 and 1 predictions but would not be ableto forecast actual default rates. A perfect actual default rate model would produce reliabilitydiagrams as close as possible to the main diagonal from the bottom left to the top right of theplot. The reﬁtting procedure maps the perfect classiﬁer to a reliable default rate predictor.The reﬁtting procedure left AUROC score of the model unchanged (AUROC = 95 . BS = 1 . ); classiﬁcation resultsare summarized in the normalized confusion matrix reported in Figure 6, where the threshold tothe cut-oﬀ point between the positive and negative classes has been optimized on the Figure 5. In details, the predicted probabilities are divided up into a ﬁxed number of buckets along the x-axis. Thenumber of target events (i.e. the occurrence of 1-year default) are then counted for each bin (i.e. the relativeobserved frequency). Finally, the counts are normalized and the results are plotted as a line plot The closer the Brier score is to zero the better is the forecast of default probabilities a)(b) Figure 4: Calibration plots and log-scaled histograms of forecast probability before (Figure 8a)and after (Figure 8b) reﬁtting. Accuracy of predicted probabilities is expressed in term of logloss measure 12igure 5: ROC curve for the calibrated classiﬁerFigure 6: Normalized confusion matrix with optimized threshold for the calibrated classiﬁer13 .3 PD clustering

Among the several common statistical tests that can be performed to validate the assignment ofa probability of default to a certain rating grade, two approaches have been used: the one-sidedBinomial test and the

Extended Traﬃc-Light Approach .The

Binomial Test is one of the most popular single-grade single-period test performedfor rating system validation. For a certain rating grade k ∈ { , . . . , K } , being K the number ofrating classes , we made the assumptions that default events are independent within the grade k and could be modelled as a binomially distributed random variable X with size parameter N k and ” success ” probability P D k . Thus, we can assess the correctness of the PD forecast bytesting the null hypothesis H , where: • H : the actual default rate is less than or equal to the forecast default rate given by thePD;The null hypothesis H is rejected at a conﬁdence level α in case the number of observeddefaults d per rating grade is greater than or equal to the critical value, as reported in Equation 5: d α = min  d : N k (cid:88) j = d (cid:18) N k j (cid:19) P D jk (1 − P D jk ) N k − j ≤ − α  (5)The Extended Traﬃc Light Approach is a novel technique for default probability validation,ﬁrst adopted by Tasche (2003)[21]. The implementation used in this section refers to a heuristicapproach proposed by Blochwitz et al. (2005) [22] which is based on the estimation of a relativedistance between observed default rates and forecast probabilities of default, under the keyassumption of binomially distributed default events. Four coloured zones,

Green , Yellow , Orange , Red , are established to analyse the deviation of forecasts and actual realizations. In details: ifthe result of the validation assessment lies in the

Green zone there is no obvious contradictionbetween forecast and realized default rate; the

Yellow and

Orange lights indicate that therealized default rate is not compatible with the PD forecast, however, the diﬀerence of realizedrate and forecast is still in the range of usual statistical ﬂuctuations; and last red traﬃc lightindicates a wrong forecast of default probability. The boundaries between the afore mentionedlight-zones are summarized in Equation 6: 

Green p k < P D k Yellow

P D k (cid:54) p k < P D k + K y σ ( P D k , N k )Orange P D k + K y σ ( P D k , N k ) (cid:54) p k < P D k + K σ ( P D k , N k )Red P D k + K σ ( P D k , N k ) (cid:54) p k (6)where σ ( P D k , N k ) = (cid:112) P D k (1 − P D k ) /N k . The parameters K y and K play a major rolein the validation assessment, so have to be tuned carefully. A proper choice based on practicalconsiderations is setting K y = 0 .

84 and K = 1 .

44, which corresponds to a probability ofobserving green of 0 .

5, observing yellow with 0 .

3, orange with 0 .

15 and red with 0 . usually one year atingclass PD Bins(%) Rating ClassPD (%) Out-of-sampleDefault Rate(%) One-sidedBinomial Test ExtendedTraﬃc LightApproach AAA [0.00, 0.005) 0.03 0.00 Passed GreenAA [0.05, 0.42) 0.24 0.03 Passed GreenA [0.42, 0.55) 0.48 0.08 Passed GreenBBB [0.55, 0.74) 0.64 0.21 Passed GreenBB [0.74, 1.00) 0.87 0.40 Passed GreenB [1.00, 1.42) 1.21 0.83 Passed GreenCCC [1.42, 2.12) 1.77 1.29 Passed GreenCC [2.12, 9.03) 5.57 5.06 Passed GreenC [9.03,100) 54.52 33.77 Passed GreenTable 1: Internally calibrated PD clustering into 9 rating classes. Despite being borrowed fromS&P rating scales, the labels are assigned to a PD calibrated on an internal dataset (the oneused during the training-phase) and does not correspond to any rating agencies PD

Machine learning models which operate in higher dimensions than cannot be directly visualizedby human mind are often referred as ” black boxes ”, in the sense that high model performanceis often achieved on the detriment of output explainability, leading users not to understand thelogic behind model predictions. Even greater attention to model interpretability has led to thedevelopment of several methods to provide an explanation to machine learning outputs, both interm of global and local interpretability. In the ﬁrst case, the goal is being able to explain andunderstand model decisions based on conditional interactions between the dependent variable(i.e. target) and the independent features on the entire dataset. In the latter case, the aim is tounderstand model output for a single prediction by looking at a local subregion in the featurespace around that instance.Two popular approaches described hereafter in this section are SHAP and LIME, which exploreand leverage the property of local explainability to build surrogate models which are able tointerpret the output of any machine learning models. The technique upon which these algorithmsare based is slightly tweaking the input and modelling the changes in prediction by means ofsurrogate agnostic models. In particular, SHAP measures how much each feature in our modelcontributes, either positively or negatively, to each prediction, in term of diﬀerence betweenthe actual prediction and its expected value. LIME builds sparse linear models around eachprediction to explain how the black box model works in that local vicinity.

SHAP, which stands for ( SH apley A dditive ex P lanation) [23], is a novel approach for modelexplainability which exploits the idea of Shapley regression value to model feature inﬂuencescoring. SHAP values quantify the magnitude and direction (positive or negative) of a feature’seﬀect on a prediction via an additive feature attribution method. In simple words, SHAP buildsmodel explanations by asking, for each prediction i and feature j , how i changes when j isremoved from the model. Since SHAP considers all possible predictions for an instance using all The technical deﬁnition of Shapley value is the average marginal contribution of a feature value over allpossible coalitions coalitional game theory . Thefeature values of a data instance act as players in a coalition: Shapley values suggest how tofairly distribute the payout (i.e. the prediction) among the features. SHAP summary plot

As reported in Figure 10, it combines feature importance with featureeﬀects to measure the global impact of features on the model. For each feature shown in they-axis, and ordered according to their importance, each point on the plot represents the Shapleyvalue (reported along the x-axis) for a given prediction. The color of each point represents theimpact of the feature on model output from low (i.e. blue) to high (i.e. red). Overlapping pointsare littered in y-axis direction, so we get a sense of the distribution of the Shapley values perfeature.Figure 7: SHAP summary plot for Light-GBM classiﬁer. The details of model’s feature descrip-tion are reported in Appendix A note that a player can be an individual feature value or a group of feature values HAP dependence plot

It is a scatter plot that shows the eﬀect a single feature has on themodel predictions. In particular, each dot represents a single prediction where the feature valueis on the x-axis and its SHAP value, representing how much knowing that feature’s value changesthe output of the model for that sample’s prediction, on the y-axis. The color corresponds toa second feature that may have an interaction eﬀect with the plotted feature. If an interactioneﬀect is present it will show up as a distinct vertical pattern of colouring.

SHAP waterfall plot

The waterfall plot reported in Figure 9 is designed to display howthe SHAP values of each feature move the model output from our prior expectation underthe background data distribution ( E [( f ( X )]), to the ﬁnal model prediction ( f ( X )) given theevidence of all the features. Features are sorted by the magnitude of their SHAP values withthe smallest magnitude features grouped together at the bottom of the plot. The color of eachrow represents the impact of the feature on model output from low (i.e. blue) to high (i.e. red).17 a) (b)(c) (d)(e) (f)(g) Figure 8: SHAP dependency plots for ACTIVITY in Figure 8a, cashAndMarketableSecuritiesin Figure 8b, DEBT EQUITY in Figure 8c, EBITDA RATIO in Figure 8d, netIncome in Fig-ure 8e, ROI in Figure 8f, totalInterestExpense in Figure 8g. The details of model’s featuredescription are reported in Appendix A 18igure 9: SHAP waterfall plot The details of model’s feature description are reported in Ap-pendix A 19 .2 Lime

LIME, L ocal I nterpretable M odel-agnostic E xplanations, is a novel technique that explains thepredictions of any classiﬁer in an interpretable and faithful manner, by learning an interpretablemodel locally around the prediction[24]. Behind the workings of LIME lies the assumption thatevery complex model is linear on a local scale, so it is possible to ﬁt a simple model around a singleobservation that will mimic how the global model behaves at that locality. The output of LIMEis a list of explanations, reﬂecting the contribution of each feature to the prediction of a datasample, allowing to determine which feature changes will have most impact on the prediction.Note that, LIME has the desirable property of additivity, i.e. the sum of the individual impactis equal to the total impact. Results for a prediction are summarized in Figure 10.Figure 10: Lime local explanation for a prediction from Light-GBM classiﬁer. The details ofmodel’s feature description are reported in Appendix A Starting from Moody’s dataset of historical balancesheets, bankruptcy statuses and macroeco-nomic variables we have built three models: a classiﬁer, a default probability model and a ratingsystem. By leveraging on modern techniques in both data processing and parameter calibrationwe have reached state of the art results. The three models show stunning out of sample per-formances allowing for an intensive usage in risk averse businesses where the occurrence of falsenegatives can dramatically harm the ﬁrm itself. The explainability layers via Shap and Limegive a set of extra tools to increase the conﬁdence in the model and help in understanding themain features determining a speciﬁc result. These information can be leveraged by the analystto understand how to reduce the bankruptcy probability of a speciﬁc ﬁrm or to get insight inwhich balance sheet ﬁelds need to be improved to increase the rating, therefore providing abusiness instrument to actively manage clients and structured ﬁnance deals.20 cknowledgements

We are grateful to Corrado Passera for encouraging our research.

References [1] P. Mizen and S. Tsoukas, “Forecasting us bond default ratings allowing for previous andinitial state dependence in an ordered probit model,”

International Journal of Forecasting ,vol. 28, no. 1, pp. 273–287, 2012.[2] P. Gurn`y and M. Gurn`y, “Comparison of credit scoring models on probability of defaultestimation for us banks,” 2013.[3] Z. Zhao, S. Xu, B. H. Kang, M. M. J. Kabir, Y. Liu, and R. Wasinger, “Investigation andimprovement of multi-layer perceptron neural networks for credit scoring,”

Expert Systemswith Applications , vol. 42, no. 7, pp. 3508–3516, 2015.[4] A. Petropoulos, V. Siakoulis, E. Stavroulakis, A. Klamargias, et al. , “A robust machinelearning approach for credit risk analysis of large loan level datasets using deep learningand extreme gradient boosting,”

Are Post-crisis Statistical Initiatives Completed , vol. 49,pp. 49–49, 2019.[5] P. M. Addo, D. Guegan, and B. Hassani, “Credit risk analysis using machine and deeplearning models,”

Risks , vol. 6, no. 2, p. 38, 2018.[6] A. R. Provenzano, D. Triﬁr, N. Jean, G. L. Pera, M. Spadaccino, L. Massaron, and C. Nor-dio, “An artiﬁcial intelligence approach to shadow rating,” 2019.[7] Moody’s

Analytics , “Credit research database.” .[8] Oxford Economics, “Global economic databank.” .[9] A. Zheng and A. Casari,

Feature engineering for machine learning: principles and tech-niques for data scientists . ” O’Reilly Media, Inc.”, 2018.[10] C. Guo and F. Berkhahn, “Entity embeddings of categorical variables,” arXiv preprintarXiv:1604.06737 , 2016.[11] G. Perani, V. Cirillo, et al. , “Matching industry classiﬁcations. a method for convertingnace rev. 2 to nace rev. 1,” tech. rep., 2015.[12] R. Bellman, “Dynamic programming: Princeton univ. press,” NJ , vol. 95, 1957.[13] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic languagemodel,” Journal of machine learning research , vol. 3, no. Feb, pp. 1137–1155, 2003.[14] Y. Goldberg, “Neural network methods for natural language processing,”

Synthesis Lectureson Human Language Technologies , vol. 10, no. 1, pp. 1–309, 2017.[15] A. G´eron,

Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools,and techniques to build intelligent systems . ” O’Reilly Media, Inc.”, 2017.2116] https://github.com/microsoft/LightGBM .[17] J. H. Friedman, “Stochastic gradient boosting,”

Computational statistics & data analysis ,vol. 38, no. 4, pp. 367–378, 2002.[18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detec-tion,” in

Proceedings of the IEEE international conference on computer vision , pp. 2980–2988, 2017.[19] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. , “Practical lessons from predicting clicks on ads at facebook,” in

Proceedings of theEighth International Workshop on Data Mining for Online Advertising , pp. 1–9, 2014.[20] R. Storn and K. Price, “Diﬀerential evolution - a simple and eﬃcient heuristic for globaloptimization over continuous spaces,”

Journal of Global Optimization , vol. 11, pp. 341–359,01 1997.[21] D. Tasche, “A traﬃc lights approach to pd validation,” arXiv preprint cond-mat/0305038 ,2003.[22] S. Blochwitz, S. Hohl, and C. Wehn, “Reconsidering ratings,”

Wilmott Magazine , vol. 5,pp. 60–69, 2005.[23] S. M. Lundberg and S.-I. Lee, “A uniﬁed approach to interpreting model predictions,” in

Advances in neural information processing systems , pp. 4765–4774, 2017.[24] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predic-tions of any classiﬁer,” in

Proceedings of the 22nd ACM SIGKDD international conferenceon knowledge discovery and data mining , pp. 1135–1144, 2016.22 ppendices

A Model’s Features descriptions

In this section the details of selected features, upon which the model have been trained, arereported.

A.1 Balancesheet index descriptions

Code Deﬁnition cashAndMarketableSe-curities Cash and marketable securities.depreciationExpense The depreciation expense for current period.ebitda Earnings before interest, taxes, depreciation and amortizationbefore extraordinary items.entityConsolidation-Type For companies with subsidiaries, iincorporationRegion The entity’s incorporation region.incorporationState The entity’s incorporation province or administrative divisionwhere the entity has a legal representation.longTermDebtCurrent-Maturities The current maturities of long-term debt, principal paymentsdue within 12 months.netIncome Net income is the total period-end earnings.netWorth Net worth is the sum of all equity items, including retainedearnings and other equity.payableToTrade Accounts payable to regular trade accounts.receivableFromTrade Accounts receivable from trade.retainedEarnings Retained earnings.tangibleNetWorth Deﬁned as the diﬀerence between netWorth andtotalIntangibleAssets.totalAccountsPayable The sum of accounts payable.totalAccountsReceiv-able The total accounts receivable, net of any provision from loss.totalAmortizationAnd-Depreciaton The sum of amortization and depreciation expense for currentperiod.totalAssets The total assets of the borrower which is the sum of the currentassets and non-current assets.totalCapital Total subscribed and share capital.totalCapi-tal and totalLiabilities Deﬁned as the sum of totalLiabilities and totalCapitaltotalCurrentAssets The sum of all current assets.totalCurrentLiabilities The sum of all current liabilities.totalFixedAssets Total ﬁxed assets are the Gross Fixed Assets less AccumulatedDepreciation.totalIntangibleAssets Total intangible assets.23 ode Deﬁnition totalInterestExpense The total interest expense is any gross interest expense generatedfrom short-term, long-term, subordinated or related debt.totalInventory The sum of all the inventories.totalLiabilities The sum of Total Current Liabilities and Total non-currentliabilities.totalLongTermDebt The amount due to ﬁnancial and other institutions after 12months.totalOperatingExpense The sum of all operating expenses.totalOperatingProﬁt The Gross Proﬁt less Total Operating ExpensetotalProvisions Total provisions for pensions, taxes, etc.totalSales Total sales.totalWageExpense The total wage expense.workingCapital Deﬁned as the sum of receivableFromTrade,totalAccountsReceivable and totalInventory minuspayableToTrade and totalAccountsPayable.

A.2 KPIs descriptions

ACID = cashAndMarketableSecurities + totalAccountsReceivabletotalCurrentLiabilities (7)ACTIVITY = totalCurrentLiabilitiestotalSales (8)AGE = ﬁnancialStatementDate − incorporationDate365 . − cashAndMarketableSecurities(23)ROA = netIncometotalAssets (24)ROE = netIncomenetWorth (25)ROI = totalOperatingProﬁttotalAssets (26)SHORT-TERM-DEBT EQUITY = totalCurrentLiabilitiesnetWorth (27) A.3 Macro-economic factors descriptions

Country speciﬁc indicatorsCode Name Deﬁnition

C Consumption, private,real The volume of goods and servicesconsumed by households and non-proﬁtinstitutions serving households.CD Durable goods The volume of real personal consumptionexpenditures.CREDR Credit rating, average The sovereign risk rating, based on theaverage of the sovereign ratings providedby Moodys, S&P and Fitch.CU Capacity utilisation A measure of the extent to which theproductive capacity of a business is beingused.25 ountry speciﬁc indicatorsCode Name Deﬁnition

DOMD Domestic demand, real The volume of consumption, investment,stockbuilding and governmentconsumption expressed in local currencyand at prices of the country’s base year.EE Employees inemployment Employees in employment.ET Employment, total Employment, total.GC Consumption,government, real The volume of government spending ongoods and services.GDP GDP, real The volume of all ﬁnal goods andservices produced within a country in agiven period of time.GDPHEAD GDP per capita, real,US$, constant prices GDP per capita, real, US$, constantprices.IF Investment, total ﬁxedinvestment, real Investment, total ﬁxed investment, real.IP Industrial productionindex The volume of investment in tangibleand intangible capital goods, includingmachinery and equipment, software, andconstruction.IPNR Investment, privatesector business, real The volume of investment in privatesector business.IPRD Investment, privatedwellings, real The volume of investment in privatedwellings.IS Stockbuilding, real The volume of stocks of outputs that arestill held by the units that producedthem and stocks of products acquiredfrom other units that are intended to beused for intermediate consumption or forresale.M Imports, goods &services, real The volume of goods and servicesimports.MG Imports, goods, real The volume of goods imports.MS Imports, services, real The volume of services imports.PEWFP GDP, compensation ofemployees, total,nominal The values of wages and salaries ofemployees as a component of GDP.PH House price index Index of house prices.POIL$ Oil price US$ per toe Oil price US$ per toe.RCB Interest rate, centralbank policy The rate that is used by central bank toimplement or signal its monetary policystance (expressed as an average).26 ountry speciﬁc indicatorsCode Name Deﬁnition

RCORP SPREADEOP Credit spreads, end ofperiod The diﬀerence in yield between twobonds of similar maturity but diﬀerentcredit quality, expressed as end of periodvalue.RLG Interest rate, long-termgovernment bond yields Interest rate, long-term government bondyields.RS Retail Sales volumeindex, excludingautomotive Volume index for retail sales excludingautomotive.RSH Interest rate,short-term The 3-month interbank rate.RSHEOP Interest rate,short-term, end ofperiod The 3-month interbank rate for the endof period.SMEPS Stockmarket earningsper share Stockmarket earnings per sharecalculated as Stockmarket earnings, LCU* 1000 / Stockmarket shares outstanding.SMP TR Share price total returnindex Share price total return index.TFE Total ﬁnal expenditure,real The sum of volumes of consumption,investment, stockbuilding, governmentconsumption and exports.U Unemployment The total number of people without ajob, but actively searching for one.UP Unemployment rate The percentage of the labour force that isunemployed at a given date.X Exports, goods &services, real The volume of goods and servicesexports expressed in local currency andat the country’s base year.XG Exports, goods, real The volume of goods exports expressedin local currency.XS Exports, services, real The volume of services exports expressedin local currency.

Eurozone indicatorsCode Name Deﬁnition

PH House price index Index of house prices.RCB Interest rate, centralbank policy The rate that is used by central bank toimplement or signal its monetary policystance (expressed as an average).27 urozone indicatorsCode Name Deﬁnitionurozone indicatorsCode Name Deﬁnition