[PDF] Modelling Sovereign Credit Ratings: Evaluating the Accuracy and Driving Factors using Machine Learning Techniques

Abstract

Sovereign credit ratings summarize the creditworthiness of countries. These ratings have a large influence on the economy and the yields at which governments can issue new debt. This paper investigates the use of a Multilayer Perceptron (MLP), Classification and Regression Trees (CART), Support Vector Machines (SVM), Na\"ive Bayes (NB), and an Ordered Logit (OL) model for the prediction of sovereign credit ratings. We show that MLP is best suited for predicting sovereign credit ratings, with a random cross-validated accuracy of 68%, followed by CART (59%), SVM (41%), NB (38%), and OL (33%). Investigation of the determining factors shows that there is some heterogeneity in the important variables across the models. However, the two models with the highest out-of-sample predictive accuracy, MLP and CART, show a lot of similarities in the influential variables, with regulatory quality, and GDP per capita as common important variables. Consistent with economic theory, a higher regulatory quality and/or GDP per capita are associated with a higher credit rating.

Full PDF

MModelling Sovereign Credit Ratings: Evaluating the Accuracy andDriving Factors using Machine Learning Techniques ∗ Bart H.L. Overes † Erasmus University RotterdamMichel van der WelErasmus University Rotterdam, ERIM, Tinbergen Institute, and CREATESFebruary 1, 2021

Abstract

Sovereign credit ratings summarize the creditworthiness of countries. These rat-ings have a large inﬂuence on the economy and the yields at which governments canissue new debt. This paper investigates the use of a Multilayer Perceptron (MLP),Classiﬁcation and Regression Trees (CART), and an Ordered Logit (OL) model for theprediction of sovereign credit ratings. We show that MLP is best suited for predictingsovereign credit ratings, with an accuracy of 68%, followed by CART (59%) and OL(33%). Investigation of the determining factors shows that roughly the same explana-tory variables are important in all models, with regulatory quality, GDP per capita andunemployment rate as common important variables. Consistent with economic theory,a higher regulatory quality and/or GDP per capita are associated with a higher creditrating, while a higher unemployment rate is associated with a lower credit rating.

Keywords:

Sovereign Credit Ratings; Machine Learning; Determining factors; OrderedLogit

JEL classiﬁcation:

G12, C32. ∗ The authors would like to thank Dick van Dijk, H¨useyin ¨Ozturk and Corne Vriends for their commentsand suggestions. † Corresponding author. Mail:

[email protected] . a r X i v : . [ q -f i n . S T ] J a n . Introduction A sovereign credit rating is an evaluation of the credit risk of a country and gives anindication of the likelihood that the country will be able to make promised payments. Theseratings have a large inﬂuence on the interest rate at which governments are able to issue newdebt and thereby a big eﬀect on government spending and the government deﬁcit. Sovereigncredit ratings are usually given by one of three Credit Rating Agencies (CRAs): Moody’s,S&P, and Fitch. These agencies use a combination of objective and subjective factors todetermine the rating, however, unfortunately, the exact rating methodology and the de-termining factors remain unknown. This lack of transparency has resulted in widespreadcriticism of the CRAs. They have, among other things, been accused of giving biased rat-ings (Luitel et al., 2016), reacting slowly to changing circumstances (Elkhoury, 2009), andbehaving procyclically (Ferri et al., 1999).Getting an understanding of the rating methodology and the determining factors wouldbe very helpful for governments, investors, and ﬁnancial institutions. Governments wouldbe able to anticipate possible rating changes, while investors and ﬁnancial institutions couldcheck if ratings deviate from what the fundamentals of a country imply. In order to getan understanding of the credit rating process, a model is needed that can predict the rat-ings, ideally with high accuracy. Research has, up until now, mostly focussed on modellingsovereign credit ratings using various forms of the Ordered Probit/Logit (OP/OL) model,which assumes a particular functional form for the relation between a linear combinationof the input variables and the continuous output variable, or other related models, see, forexample, Cantor and Packer (1996); Dimitrakopoulos and Kolossiatis (2016); Reusens andCroux (2017). These models allow for easy interpretation of the determining factors andprove to be fairly accurate, but come at the cost that the linear relation they assume mightnot always hold. A recent branch of research has therefore focussed on using Machine Learn-ing (ML) techniques to model sovereign credit ratings (Bennell et al., 2006; Ozturk et al.,2015, 2016). Ozturk et al. (2015, 2016) show that ML models outperform linear modelson predictive accuracy, sometimes by a large margin. Especially the Multilayer Perceptron(MLP) and Classiﬁcation and Regression Trees (CART) prove to be well suited for modellingsovereign credit ratings. However, getting an insight into the inner workings of the modelsand their determining factors is diﬃcult.This paper focusses on obtaining the determining factors of two ML models used forsovereign credit ratings; MLP and CART, which has, up until now, not been done for ML For ease of reference, we will refer to the probit/logit variants simply as linear forms because of thelinear relation among variables. . Methodology

In this section, we discuss the methods used in this study, starting oﬀ with the modellingtechniques. Thereafter, we discuss the so called SHAP values, which allow us to isolate theinﬂuence of individual variables in complex models. Lastly, the methods used to evaluateand compare the accuracy of the diﬀerent models are discussed.

This section gives an overview of the diﬀerent models. These models are used to predictthe sovereign credit rating denoted by y i for observation i , which represents a rating class,with m being the total number of classes. In this research we use Moody’s credit ratings,Moody’s gives categorical credit ratings ranging from Aaa (highest) to C (lowest), with 19categories in between. As algorithms in general cannot handle categorical ratings, they aretransformed to numeric ratings from 17 (Aaa) to 1 (Caa1 and lower), where all the ratingsof Caa1 and lower have been grouped in C combined because of their infrequent occurrence.The structure of y i therefore is as follows: y i =  Aaa (17) Aa C combined (1) (1)with the numerical value corresponding to a rating given in brackets. Thus, a high numericalvalue corresponds to a high credit rating. The explanatory variables are contained in X i ,and n is the total number of observations. Consistent with the main modelling approachesin the literature that we follow (see, e.g, Dimitrakopoulos and Kolossiatis (2016); Ozturket al. (2015, 2016)), the panel structure of the credit rating data is not taken into account.The main reason for this is that the ML models used in this study do not support paneldata, although there are developments in this area. Multilayer Perceptron

The Multilayer Perceptron (MLP) is a form of an Artiﬁcial Neural Network (ANN) whichmimics the way that the human brain processes information. MLPs, or similar NeuralNetwork type of algorithms, are often found to perform very well in classiﬁcation problemsinvolving corporate and sovereign credit rating, see, for example, Baesens et al. (2003);Lessmann et al. (2015) and Ozturk et al. (2015, 2016). A MLP is able to model non-3inearities in the data, and can therefore handle very complex classiﬁcation problems withheterogeneous groups. However, interpretation of the MLP is extremely diﬃcult and, up toa certain degree, it will always remain a “black box”.The MLP consists of an input layer, an output layer and a certain number of hiddenlayers in between. The input layer contains a number of neurons equal to the number ofexplanatory variables, here the set of explanatory variables ( X i ) are fed into the model. Thislayer is followed by a certain number of hidden layers, which contain neurons that get aninput signal from all the neurons in the previous layer and process that information in orderto generate an output that is passed on to every neuron in the next layer. The output layeris the ﬁnal layer in the MLP structure and has a number of neurons equal to the numberof desired outputs, in this case the probability of belonging to each of the credit ratingcategories in y i .The output for each neuron j in hidden layer i is given by h ( i ) j = σ ( i ) (cid:16) z ( i ) j (cid:17) = σ ( i )  b ( i ) j + n ( i − (cid:88) k =1 W ( i − jk · h ( i − k  , (2)where b ( i ) j presents the bias term, W ( i − jk gives the weight connecting neuron k from layer i − j in layer i and n ( i − is the total number of neurons in layer i −

1. The activationfunction σ ( i ) ( z ) enables the algorithm to model non-linearities that might be present in thedata, and can be varied for each layer (Baesens et al., 2003). We use a Rectiﬁed Linear Unit(ReLU) function for the hidden layers, given by σ ( z ) = max (0 , z ) , (3)since it is often found to perform best (Ramachandran et al., 2017). For the output layer,we use the Softmax function, given by σ ( z j ) = e z j (cid:80) mk =1 e z k for j = 1 , ..., m and z = ( z , ..., z m ) , (4)where m is the number of desired output categories. This activation function gives us, forevery country, the probability of belonging to each rating class and is therefore well suitedfor multiclass classiﬁcation problems. In the end, the estimate for every country ˆ y i is set tothe numerical class for which it has the highest probability.4he MLP is optimized by minimizing the categorical cross-entropy function, given by C ( y i,j , ˆ y i,j ) = − m (cid:88) j =1 n (cid:88) i =1 (cid:18) y i,j · ln (ˆ p i,j ) (cid:19) (5)where the number of categories in y i is given by m and the total amount of observations by n .The true value of observation i for class j is given by y i,j , which is 1 if observation i belongsto class j and 0 otherwise. The predicted probability that observation i belongs to class j is given by ˆ p i,j . The algorithm is trained by backward propagation of error informationthrough the network. That is, the partial derivative of the cost function with respect to allweights and biases is determined. Thereafter, the weights connecting all the nodes, and thebiases are adjusted in such a way that the cost is minimized.The MLP architecture, that is, the number of hidden layers and neurons per hiddenlayer is optimized through a grid search. In this grid search, we also determine the optimaldropout rate, which is the fraction of neurons dropped at random to prevent overﬁtting tothe training data. The optimal performance-complexity trade-oﬀ for this data set is givenby the MLP with 1 hidden layer, 256 neurons and a dropout rate of 0.1. Estimation of thisMLP is done using a batch size of 8 and 400 epochs. Details of the grid searches can befound in Appendix A.3. The MLP is implemented in Python’s Keras package (Chollet et al.,2015). Classiﬁcation and Regression Trees

The idea behind the Classiﬁcation and Regression Tree (CART) is quite simple, the algorithmﬁnds the optimal splits based on the values of the explanatory variables in order to classifythe observations. CARTs have shown to be well suited for credit rating, see, for exampleMoor et al. (2018) and Ozturk et al. (2015, 2016). A few of the advantages of CARTs are thatthey can handle outliers, do automatic feature selection and allow for easy interpretation ofthe model. However, CARTs can be very prone to overﬁtting.A CART consists of a root, one or more nodes and several leaves. The ﬁrst split ofthe data, based on one of the explanatory variables in X i , is made at the root, that splitleads either to a node, where the remaining data is split further, again based on one of theexplanatory variables in X i , or a leaf, meaning a decisions is made for these observations.Every observation moves through the tree until it ends up at a leaf, which in our caserepresents one of the diﬀerent rating categories in y i .In this research, we use an algorithm that splits the data in two at every node. Thesequential data splits are determined using the Gini method. That is, for each variable thealgorithm calculates the weighted average Gini impurity, e.g. how eﬀective the diﬀerent5ategories can be separated based on that variable, using the following formula Gini = (cid:88) j =1 (cid:32) n j n n m (cid:88) i =1 p ( i ) ∗ (1 − p ( i )) (cid:33) , (6)where m is the number of diﬀerent categories in y i and p ( i ) is the probability of picking adata point of class i within that branch of the split. Furthermore, n j is the number of datapoints assigned to branch j and n n gives the total number of data points entering that node.The split that leads to the largest decrease in Gini Impurity is used at that node. Thismeans that the CART is greedy, i.e. it does not care about future splits and does not takethem into account.CARTs are notorious for overﬁtting, and therefore sometimes need to be restricted. Thereare two ways of doing this: restricted growth and pruning. In the case of restricted growth,constraints that limit the growth of the tree in certain ways are implemented, which prevent itfrom overﬁtting. Whereas with pruning, the tree is left to grow unrestricted and is decreasedin size afterwards. Both methods show no improvement on the cross-validated out-of-sampleaccuracy, and thus an unrestricted CART is used in this study. The details of the CARToptimization can be found in Appendix A.4. The CART is implemented in Python’s scikit-learn package (Pedregosa et al., 2011). Ordered Logit

The Ordered Logit (OL) model is, together with the Ordered Probit model, the most fre-quently used model in literature (Dimitrakopoulos and Kolossiatis, 2016; Afonso et al., 2011;Reusens and Croux, 2017). It therefore provides a good benchmark for the Machine Learningmodels, because these more complex models are only useful when they are able to outperformthe OL model. As opposed to OLS, the OL model can deal with unequal distances betweenrating classes and the presence of a top and bottom category. Furthermore, the OL modelallows for interpretation and signiﬁcance testing of the explanatory variables’ coeﬃcients,which makes it easy to obtain the determining factors.A pooled OL model is implemented. Here, the latent continuous variable y ∗ i has thefollowing speciﬁcation y ∗ i = α + X (cid:48) i β + (cid:15) i , (7)where the intercept is given by α , X i contains the explanatory variables for data point i , β is a vector containing the coeﬃcients and the idiosyncratic errors are given by (cid:15) i , which hasa standard logistic distribution.However, rating categories are not continuous and our continuous variable therefore needs6o be transformed into a categorical rating using y i =  Aaa (17) if y ∗ i ≥ τ Aa τ > y ∗ i ≥ τ ... C combined (1) if τ > y ∗ i (8)where the boundaries between the diﬀerent classes are given by τ j . The OL model is imple-mented in Python’s Mord package (Pedregosa-Izquierdo, 2015). Getting insight into the inner workings of complex models is diﬃcult. Therefore, Lundbergand Lee (2017) came up with a method to approximate the eﬀects that the individualexplanatory variables have on the model outcome, called SHAP. This method, based onShapley values (Shapley, 1953), evaluates how model outcomes diﬀer from the baseline bytuning all the explanatory variables individually, or in combination with a selection of otherexplanatory variables, while keeping the others constant.The basic framework for explaining a model f ( x ) using SHAP values is the explanationmodel g ( x ) = φ + n input (cid:88) i =1 φ i x i , (9)where x is a vector containing all the explanatory variables, φ is the baseline prediction, φ i is the weight of the i th explanatory variable in the ﬁnal prediction, and n input is the totalnumber of explanatory variables. The explanation model g ( x ) gives an approximation ofthe output of the real model f ( x ) by using a linear combination of the input variables anda baseline prediction. Calculating the contribution of each variable x i to the explanationmodel g ( x ) is done using φ i ( f ( x ) , x ) = (cid:88) v ⊆ x | v | !( n input − | v | − n input ! (cid:124) (cid:123)(cid:122) (cid:125) no. of permutations ( f x ( v ) − f x ( v \ i )) (cid:124) (cid:123)(cid:122) (cid:125) contribution of i , (10)where, v ⊆ x represents all the possible v vectors where the non-zero elements are a combi-nation of the non-zero elements in x , f x ( v \ i ) is the model output of the original model withthe i th element of v set to zero, and | v | gives the total number of non-zero elements in v (Lundberg and Lee, 2017). The SHAP values are now given by the solution to Equation 107hat satisﬁes f x ( v ) = E [ f ( v ) | v S ] , (11)where S represent the set of non-zero indices in v . This constraint ensures that the SHAPvalues do not violate the consistency and/or the local accuracy properties, for more informa-tion see Lundberg and Lee (2017). Thus, in the end, we get a speciﬁc contribution of eachexplanatory variable to the prediction for every individual observation considered. As theOL model allows for easy interpretation through its coeﬃcients, this method is only used forthe MLP and CART. For the SHAP values Python’s SHAP package is used (Lundberg andLee, 2017). Following common practice in literature (Ozturk et al., 2015; Reusens and Croux, 2017),for each model, we determine what percentage of the predictions was exactly right, 1 or 2notch(es) too high and 1 or 2 notch(es) too low. Where a credit rating prediction is saidto be u notch(es) too low (high) if the predicted class is the u class(es) below (above) theactual rating class.Predictions are made using random split 10-fold cross-validation. That is, the data issplit into 10 approximately equal subsets of which 9 are used to train the model, and thesubset that was left out is used for evaluation of the out-of-sample predictive accuracy. Byrotating the 10-folds, we obtain the out-of-sample accuracy of the model on the entire dataset. We use the averages of 100 replications of this procedure, each time using diﬀerent10-fold data splits, thus making sure that results are not dependent on one speciﬁc randomsplit.

3. Data

We use Moody’s’ sovereign credit ratings for a variety of 62 developed and developing coun-tries, among which Brazil, Canada, Morocco and Thailand, from 2001 to 2019. A histogramof the ratings with their alphabetical and numerical rating is shown in Figure 1. In this ﬁg-ure, we see that the data set contains a good mixture of the diﬀerent categories, however,class 17 (Aaa) is, with 286 observations, signiﬁcantly overrepresented. This is due to thefact that most of the Aaa countries stayed in this category throughout the entire period.There is therefore a trade-oﬀ between having enough diﬀerent countries with an Aaa ratingto train on and making sure the share of Aaa ratings does no become too large. The share Obtained from countryeconomy.com. (measured in %, with expected sign –), government balance (%of GDP, +), current account balance (% of GDP, +), inﬂation as measured by CPI (%,–), GDP per capita ( $ , +), government debt (% of GDP, –), GDP growth (annual %,+), regulatory quality index (+), and political stability and absence of violence/terrorismindex (+). Where the variables up to and including GDP growth are economic & ﬁscalindicators, while the latter two are measures of governance. The regulatory quality indexmeasures perceptions of the government’s ability to formulate and implement policies andregulations that permit and promote private sector development. While the political stabil-ity and absence of violence/terrorism index captures the perceptions of likelihood of politicalinstability and/or politically-motivated violence. Both these variables have values rangingfrom approximately -2.5 to 2.5, where a higher score indicates better regulatory quality orhigher political stability. The values of all the explanatory variables for year t are used tomodel Moody’s sovereign credit ratings as of January 1 of year t + 1.This set of explanatory variables represents three main factors in the credit rating pro-cess: the strength of the economy, the level of debt, and the willingness to repay. A strong Obtained from the International Monetary Fund. Obtained from the World Bank. $ ) 13.5 22.1 22.0 0.9 101.8economy is expected to be better capable of repaying its debt and preventing the debt bur-den to get out of control. Typical for strong economies are: a low unemployment rate, low(though not negative) and stable inﬂation, a high GDP per capita, and a high GDP growth.The debt position of a country is given by the government debt, and the government balanceshows if the total debt sum (in $ ) is increasing or decreasing. Finally, regulatory quality andpolitical stability can give an indication of the willingness of a country to repay their debt,but also of the economic climate for the private sector in a country.The descriptive statistics for the explanatory variables are shown in Table 1. This tablereports the median, mean, standard deviation, and 1% & 99% percentiles of all variables.We immediately observe that the average country experienced an economic expansion duringthis period, as can be seen from the positive average GDP growth of 3.2%. Furthermore, weobserve that the average unemployment rate for the period is 7.9%, but very low unemploy-ment rates of below 2.0% are also observed, for example for Thailand and Singapore. Theaverage government debt is 55.0% for the period, with a small number of countries, such asJapan and Venezuela, having a debt of over 180%, which can be considered extremely large.Lastly, the gap between the very rich and very poor countries is large, with some of the richcountries (Luxembourg, Norway) having a GDP per capita that is up to 80 times higherthan some of the poor countries (Honduras, Pakistan).

4. Results

In this section, we present the results obtained in this study. First, we discuss the accura-cies of the diﬀerent models when evaluated using cross-validation. Second, the determining11actors for each model are analysed individually, and compared to those of the other models.

The accuracies of the MLP, CART, and OL, determined using 100 replications of 10-foldrandom cross-validation, are shown in Table 2. In this table, for every model, we present thepercentage of predictions exactly right, 1 or 2 notch(es) too high or too low, the number ofpredictions correct within 1 and 2 notch(es), and the Mean Absolute Error (MAE).The MLP performs best with an accuracy of 68.3%, and 85.7% of predictions correctwithin 1 notch. MLP outperforms CART and OL, with respective accuracies of 58.6% and33.1%, signiﬁcantly on a 99% signiﬁcance level. CART outperforms the OL model signiﬁ-cantly and is, based on performance, much closer to the MLP than to the OL model. Theseresults conﬁrm earlier ﬁndings that MLP and CART outperform linear models based onaccuracy, see, for example, Bennell et al. (2006); Moor et al. (2018); Ozturk et al. (2015,2016).A nice symmetry in over- and underrating is observed for all models. This shows thatnone of the models has a tendency to consistently rate higher or lower than Moody’s. Addi-tional related results, available upon request, show that no country is persistently under- oroverrated by MLP and CART. OL, on the other hand, has that tendency and, for example,continuously underrates France and Belgium, and overrates Bulgaria and Cyprus comparedto Moody’s.There are multiple possible causes for the relatively large diﬀerence in accuracy betweenthe ML techniques and the OL model. First, ML techniques are able to pick up on non-linearrelations, where the OL model with its assumption of linear relations cannot. Research hasshown that there are non-linear eﬀects in the sovereign credit rating process, so assuminglinear relations is likely to harm performance (Reusens and Croux, 2016). Second, the MLtechniques have more modelling freedom to pick up on subjective factors of the CRAs, whichMoor et al. (2018) show to be especially large for low-rated countries.

In order to get an insight into the sovereign credit ratings, we analyse the determiningfactors for every model. We obtain the determining factors of the MLP and CART by usingSHAP values, as discussed in Section 2.2, and those of the OL model by looking at eachvariables’ coeﬃcients and their signiﬁcance. 12able 2: Averages of 100 replications of 10-fold cross-validated predictions for MLP, CART,and OL. All numbers, except for MAE, given in %.Correct prediction percentage2 below 1 below Exact 1 above 2 above Within 1 Within 2 MAEMLP 3.9 8.4 68.3 9.0 3.6 85.7 93.2 0.64CART 5.6 8.7 58.6 9.1 5.2 76.4 87.2 1.00OL 9.8 10.3 33.1 13.2 10.3 56.6 76.7 1.60

Multilayer Perceptron

SHAP values are calculated for every variable used in the MLP to isolate their eﬀects, andare shown in Figure 2. We immediately observe clear patterns for the regulatory quality andGDP per capita, the most important and second most important variable respectively. Ahigher value for either variable is associated with an increase in the credit rating, which isin line with economic theory. The importance of regulatory quality is perhaps surprising,since one would expect ﬁnancial indicators to be most important in an assessment of creditrisk. However, regulatory quality might be the best indicator of the economic climate forthe private sector in a country, which in turn might be the most relevant factor in separatingcreditworthy from non-creditworthy countries. Furthermore, regulatory quality also givesan indication of the willingness to pay. That GDP per capita turns out to be an importantfactor in the credit rating process is not unexpected, since it is a good measure of the relativesize of the economy and wealth of a country, and has proven to be important in previousstudies (Bissoondoyal-Bheenick, 2005; Gaillard, 2009; Afonso et al., 2011).The next variable, current account balance, shows a positive inﬂuence on the credit rat-ing when the value is either relatively low or relatively high, and a negative inﬂuence onthe credit rating for an average value. This non-linear relation is also visible in the data,as stronger economies are more towards the extremes. The Netherlands and Germany forexample have a very high current account balance, while that of the United Kingdom andAustralia is very low. Current account balance is directly followed by government debt,where a higher debt is associated with a lower rating, which is in line with economic intu-ition. Political stability and unemployment rate, ranking 5 th and 6 th , also show a patternalthough less pronounced than the previously discussed variables. Here, a higher politicalstability and/or a lower unemployment rate are associated with an increase in the creditrating, and vice versa.The three least important variables, being government balance, GDP growth, and in-ﬂation, show no clear eﬀect. Inﬂation even seems to have no inﬂuence at all. The relativeunimportance of these three factors is quite sensible. A negative government balance is13ig. 2. SHAP values plot for the MLP, explanatory variables are ranked from highest meanabsolute SHAP value (regulatory quality) to lowest (inﬂation). Individual dots represent datapoints that have been evaluated, where the color indicates whether the value is relativelyhigh (red) or low (blue) for that explanatory variable. The x-axis shows the impact of theparticular feature on the prediction, i.e. the number of notches the prediction deviates fromthe baseline prediction when that feature is included.generally a bad sign, because it increases the government debt. However, as previously dis-cussed, a higher rating leads to a lower interest rate and therefore less inclination to keepthe debt low. Government balance is thus not such a helpful factor in the credit ratingprocess. GDP growth and inﬂation do not lend themselves very well for distinction betweencreditworthy and non-creditworthy countries. In the case of GDP growth, we observe thatlower rated countries have on average a higher GDP growth, but a lower cumulative GDPgrowth over long periods, which in the end determines the long-term growth of the economy.While hyperinﬂation is obviously a bad sign, and should lead to a low rating, inﬂation oﬀersno clear guidance for the other values. Classiﬁcation and Regression Trees

The same procedure as for the MLP is repeated to isolate the inﬂuence of variables in theCART, the plot containing SHAP values for the CART is shown in Figure 3. Furthermore,14ig. 3. SHAP values plot for the CART, explanatory variables are ranked from highestmean absolute SHAP value (regulatory quality) to lowest (GDP growth). Individual dotsrepresent data points that have been evaluated, where the color indicates whether the valueis relatively high (red) or low (blue) for that explanatory variable. The x-axis shows theimpact of the particular feature on the prediction, i.e. the number of notches the predictiondeviates from the baseline prediction when that feature is included.15able 3: Ranking of the variables based on inﬂuence in the predictions of MLP, CART, andOL. The higher the rank, the more important the variable, with 1 being the most inﬂuentialvariable. As gov. debt is excluded from the OL model, no rank can be assigned to thevariable, however, this could be interpreted as ranking last.MLP CART OLRegulatory quality 1 1 1GDP per capita 2 2 5Current acc. 3 4 7Gov. debt 4 5 -Political stability 5 6 6Unemployment rate 6 3 2Gov. balance 7 8 8GDP growth 8 9 3Inﬂation 9 7 4to facilitate comparison with the MLP, Table 3 shows the explanatory variables ranked onimportance for every model. In the CART, similar to the MLP, regulatory quality and GDPper capita are the most important and second most important variables respectively. Asexpected, a higher regulatory quality and a higher GDP per capita are both associated witha higher credit rating.In the CART, just as in the MLP, unemployment rate, current account balance, gov-ernment debt, and political stability rank 3 rd to 6 th , although the exact order diﬀers. Theunemployment rate shows a clear relation to the credit rating, where a higher unemploy-ment rate is associated with a lower credit rating. The 4 th most important variable, currentaccount balance, shows the same non-linear behaviour as the MLP, with average values re-sulting in a lower credit rating, and the extremes in a higher one. However, in this case, theeﬀect is much less pronounced than in the MLP. Government debt proves to have a negativeeﬀect on the credit rating, which is in line with economic theory. In contrast to the MLP,political stability shows no clear relation to the credit rating in the CART.The three least important variables in the CART match those of the MLP. Inﬂation,government balance, and GDP growth show no distinct relation to the credit rating. Ordered Logit

Extracting the determining factor of the OL model is relatively simple, as the model onlyuses linear relations. The signiﬁcance of the coeﬃcients of diﬀerent variables, combined withthe sign, tells us how important a variable is and if the relation to the credit rating is posi-tive or negative. The estimated coeﬃcients, together with their standard error and p -value,are shown in Table 4. No coeﬃcient for government debt is estimated because excluding16overnment debt from the model results in a higher cross-validated accuracy. The rankingof importance for all the variables in the OL model is also shown in Table 3, together withthose of the other models.Again, regulatory quality proves to be the most important variable, where the positivesign of the coeﬃcient shows that the relation is positive, as was the case for the MLP andCART. That regulatory quality is most important in all models is a strong indication that itis a very important factor in the credit rating process. The second most important variablein the OL model is unemployment rate, where a higher unemployment rate is associatedwith a lower rating. The unemployment rate is followed by GDP growth and inﬂation,which ranked very low in the other two models, and in the case of GDP growth, the signis counter to expectation. However, as previously discussed, lower rated countries have ahigher GDP growth on average, just not a higher cumulative growth. The OL model, withits linear relations, therefore ﬁnds a negative relation between GDP growth and the creditrating. GDP per capita ranks 5 th in the OL model, where it ranked 2 nd in the other twomodels. The sign does match expectations with a higher GDP per capita associated with ahigher credit rating. The next variable is political stability, here the three models more orless agree on the importance of the factor. However, the sign of the coeﬃcient in the OLmodel is counter to economic theory, which could be an eﬀect of the inclusion of a lot ofvariables in a linear setting. The last two variables, current account balance and governmentbalance have a positive inﬂuence on the credit rating, which is in line with economic theory.That current account balance is relatively unimportant in the OL models compared to theother two makes sense, as the OL model cannot pick up on the non-linear relation that theother models ﬁnd.The rank of government balance in the OL model is similar to that of the other twomodels. While the coeﬃcients of current account balance and government balance are in-signiﬁcant, they do contribute to the cross-validated accuracy and are therefore included inthe model. The only factor that does not contribute to a higher cross-validated accuracy isgovernment debt. This is strange, since government debt is commonly assumed to be a veryimportant factor in the creditworthiness of a country. Countries that already have a lot ofdebt might be less able to repay new debt. It is nonetheless not a clear distinguishing factoron its own. There are also Aaa rated countries that have a lot of debt, since they have lessinclination to keep the debt low due to the low interest rates that they pay. Governmentdebt therefore seems to only be inﬂuential when taking into account other factors at thesame time, and is thus not useful for the OL model. These results conﬁrm earlier ﬁndingsthat government debt is a useful variable to split data on, but not necessarily useful in aregression model, see, for example, Bozic and Magazzino (2013); Reusens and Croux (2016).17able 4: Coeﬃcients, standard errors and p-values for the Ordered Logit model.Coeﬃcients S.E. p-valueGDP growth (%) -0.0012 0.0000 0.0000Inﬂation (%) -0.0467 0.0118 0.0001Unemployment rate (%) -0.1082 0.0014 0.0000Current acc. (% of GDP) 0.0124 0.0182 0.4951Gov. balance (% of GDP) 0.0409 0.1200 0.7331Political stability -0.2623 0.1567 0.0941Regulatory quality 3.6001 0.0045 0.0000GDP per capita (1000 $ ) 0.0337 0.0182 0.0642The large similarities in determining factors, especially between CART and MLP, aresurprising, since the modelling techniques are quite diﬀerent. This makes it more likely thatsome of variables found to be important in this study, such as regulatory quality, have alarge inﬂuence on the credit rating.

5. Conclusion

This paper investigates the use of two Machine Learning techniques, Multilayer Perceptron(MLP) and Classiﬁcation and Regression Trees (CART), and an Ordered Logit (OL) model,for prediction of sovereign credit ratings. MLP proves to be most suited for predictingMoody’s ratings based on macroeconomic variables. Using random 10-fold cross-validationit reaches an accuracy of 68%, and predicts 86% of ratings correct within 1 notch. Thereby,it signiﬁcantly outperforms CART and OL with their respective accuracies of 59% and 33%.Investigation of the determining factors, which has so far not been done for MachineLearning models in the sovereign credit rating setting, shows that there are common inﬂuen-tial factors across the models. Regulatory quality and GDP per capita are respectively themost important and second most important factor in the MLP and CART, with, as expected,a positive relation between both variables and the predicted credit rating. This behaviouris also reﬂected by the signs of the respective coeﬃcient in the OL model. Other, slightlyless, inﬂuential variables are: current account balance, government debt, political stabilityand unemployment rate. The behaviour of MLP and CART with respect to most of thesevariables is similar. A higher government debt and unemployment rate are associated with alower credit rating, and for both models an average current account balance value leads to alower rating while a relatively low or high value leads to a higher credit rating. The modelsdiﬀer on the interpretation of political stability. In the MLP, a higher value for politicalstability leads to a higher credit rating, but there is no clear relation in the CART. Most of18he previously mentioned eﬀects are also observed in the signs of OL coeﬃcients. However,the signs of GDP growth and political stability are in contrast to economic theory, wherea higher GDP growth and/or political stability are associated with a lower credit rating,possibly due to inclusion of all variables jointly in the restrictive linear setting.In short, we advice governments wanting to check their rating or investors deliberating aninvestment to use a MLP model, as this model proves to be most accurate. Sovereign creditratings are heavily inﬂuenced by the regulatory quality and GDP per capita of a country.Expected changes in either of these factors could thus result in a credit rating change. Antic-ipating this possible change can be very valuable, as the credit rating has a major inﬂuenceon the interest rate at which governments can issue new debt, and thus on the governmentbudgets.We end this paper with a few recommendations for future research. First, the determin-ing factors of other Machine Learning techniques, such as Support Vector Machines (SVM),Naive Bayes (NB), and Bayes Net (BN), in the sovereign credit rating setting could be inves-tigated. Second, the inclusion of more explanatory variables might increase accuracy of somemethods (most likely CART) and might lead to more insights into the relevant variables.

References

Afonso, A., Gomes, P., and Rother, P. (2011). Short- and long-run determinants of sovereigndebt credit ratings.

International Journal of Finance & Economics , 16(1):1–15.Baesens, B., Gestel, T. V., Viaene, S., Stepanova, M., Suykens, J., and Vanthienen, J. (2003).Benchmarking state-of-the-art classiﬁcation algorithms for credit scoring.

Journal of theOperational Research Society , 54(6):627–635.Bennell, J. A., Crabbe, D., Thomas, S., and ap Gwilym, O. (2006). Modelling sovereigncredit ratings: Neural networks versus ordered probit.

Expert Systems with Applications ,30(3):415 – 425. Intelligent Information Systems for Financial Engineering.Bissoondoyal-Bheenick, E. (2005). An analysis of the determinants of sovereign ratings.

Global Finance Journal , 15(3):251 – 280. Special Issue.Bozic, V. and Magazzino, C. (2013). Credit rating agencies: The importance of fundamentalsin the assessment of sovereign ratings.

Economic Analysis and Policy , 43(2):157 – 176.Cantor, R. and Packer, F. (1996). Determinants and impact of sovereign credit ratings.

Economic Policy Review , 2(2). 19hollet, F. et al. (2015). Keras. https://keras.io.Dimitrakopoulos, S. and Kolossiatis, M. (2016). State dependence and stickiness of sovereigncredit ratings: Evidence from a panel of countries.

Journal of Applied Econometrics ,31(6):1065–1082.Elkhoury, M. (2009). Credit rating agencies and their potential impact on developing coun-tries.

UNCTD Compendium on Debt Sustainability , pages 165–180.Ferri, G., Liu, L.-G., and Stiglitz, J. E. (1999). The procyclical role of rating agencies:Evidence from the east asian crisis.

Economic Notes , 28(3):335–355.Gaillard, N. (2009). The determinants of moody’s sub-sovereign ratings.

InternationalResearch Journal of Finance and Economics , 31(1):194–209.Hastie, T., Tibshirani, R., and Friedman, J. (2009).

The Elements of Statistical Learning:Data Mining, Inference, and Prediction . Springer series in statistics. Springer.Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). Onlarge-batch training for deep learning: Generalization gap and sharp minima.

CoRR ,abs/1609.04836.Lessmann, S., Baesens, B., Seow, H.-V., and Thomas, L. C. (2015). Benchmarking state-of-the-art classiﬁcation algorithms for credit scoring: An update of research.

EuropeanJournal of Operational Research , 247(1):124 – 136.Luitel, P., Vanp´ee, R., and Moor, L. D. (2016). Pernicious eﬀects: How the credit ratingagencies disadvantage emerging markets.

Research in International Business and Finance ,38:286 – 298.Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach to interpreting model predictions.In

Advances in Neural Information Processing Systems 30 , pages 4765–4774. Curran As-sociates, Inc.Moor, L. D., Luitel, P., Sercu, P., and Vanp´ee, R. (2018). Subjectivity in sovereign creditratings.

Journal of Banking & Finance , 88:366 – 392.Ozturk, H. (2014). The origin of bias in sovereign credit ratings: Reconciling agency viewswith institutional quality.

The Journal of Developing Areas , 48(4):161–188.Ozturk, H., Namli, E., and Erdal, H. I. (2015). Reducing overreliance on sovereign creditratings: Which model serves better?

Computational Economics , 48:59 – 81.20zturk, H., Namli, E., and Erdal, H. I. (2016). Modelling sovereign credit ratings: Theaccuracy of models in a heterogeneous sample.

Economic Modelling , 54:469 – 478.Panchal, G., Ganatra, A., Kosta, Y., and Panchal, D. (2011). Behaviour analysis of multi-layer perceptrons with multiple hidden neurons and hidden layers.

International Journalof Computer Theory and Engineering , 3(2):332–337.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning inpython.

Journal of Machine Learning Research , 12:2825–2830.Pedregosa-Izquierdo, F. (2015).

Feature extraction and supervised learning on fMRI : frompractice to theory . Theses, Universit´e Pierre et Marie Curie - Paris VI.Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation functions.

CoRR , abs/1710.05941.Reusens, P. and Croux, C. (2016). Sovereign credit rating determinants: the impact of theeuropean debt crisis.

Available at SSRN 2777491 .Reusens, P. and Croux, C. (2017). Sovereign credit rating determinants: A comparisonbefore and after the european debt crisis.

Journal of Banking & Finance , 77:108 – 121.Shapley, L. S. (1953). A value for n-person games.

Contributions to the Theory of Games ,2(28):307–317.

Appendix A. Appendix

A.1. List of countries

Countries included in the data set: Argentina, Australia, Austria, Belgium, Brazil, Bulgaria,Canada, China, Colombia, Costa Rica, Cyprus, Czech Republic, Denmark, Dominican Re-public, El Salvador, Fiji Islands, Finland, France, Germany, Greece, Honduras, Hungary,Iceland, Indonesia, Ireland, Israel, Italy, Japan, Jordan, Korea, Latvia, Lithuania, Luxem-bourg, Malaysia, Malta, Mauritius, Mexico, Moldova, Morocco, Netherlands, New Zealand,Norway, Pakistan, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Romania, Rus-sia, Saudi Arabia, Singapore, Slovenia, South Africa, Spain, Sweden, Switzerland, Thailand,Tunisia, United Kingdom, Venezuela. 21able 5: Conversion of Moody’s ratings into numeric ratings.Moody’s rating Numeric ratingAaa 17Aa1 16Aa2 15Aa3 14A1 13A2 12A3 11Baa1 10Baa2 9Baa3 8Ba1 7Ba2 6Ba3 5B1 4B2 3B3 2Caa1 1Caa2 1Caa3 1Ca 1C 1

A.2. Rating transformations

Table 5 shows the transformation of all Moody’s ratings into numerical ratings.

A.3. MLP optimization

There are basically two ways of optimizing model hyperparameters: grid search andBayesian model-based optimization. While the Bayesian methods are more likely to give youthe optimal setting, they give no insight into the diﬀerent performance-complexity trade-oﬀs.As that trade-oﬀ is important in this study, since interpretation suﬀers for more complexmodels, a grid search is used to optimize the MLP.In this grid search, ﬁve hyperparameters are optimized: number of hidden layers, numberof neurons per hidden layer, dropout rate, number of epochs and batch size. The number ofhidden layers and number of neurons per hidden layer, as previously explained, determinethe structure of the MLP. The dropout rate gives the fraction of neurons that is droppedfrom the model at random. Randomly dropping neurons from the model prevents overﬁtting,22s an overﬁtted model would perform very poorly when neurons are left out. The number ofepochs and batch size determine how the internal model parameters are estimated. When aMLP is trained, it updates the parameters after working through a number of data points.That is, the internal parameters are not updated after evaluating every individual data point,but after evaluating a certain number of data points, a batch. A larger batch size thus meansthat the algorithm evaluates more data points before updating the parameters and vice versafor a small batch size. The number of epochs determines how many times the algorithm goesthrough the entire data set, especially for smaller data sets this number can be very large,often hundreds or thousands.Setting up a full grid, where all the diﬀerent combinations are tested, is computationallyextremely expensive, since the number of possible combinations becomes very large. Wehave therefore opted for two separate grid searches. First, one where the optimal structureis investigated: hidden layers, neurons and dropout rate. Thereafter, a second grid searchin which the estimation of the optimal structure found in the ﬁrst grid search is analysed:epochs and batch size.In the ﬁrst grid search the following parameters are considered: hidden layers [1, 2, 3],neurons [8, 16, 32, 64, 128, 256, 512] and dropout rate [0, 0.1, 0.2] using a batch size of 8 and400 epochs. In general, one hidden layer suﬃces, only in cases where there are discontinuitiesin the data is more than one hidden layer required (Panchal et al., 2011). Therefore, the gridsearch is limited to three hidden layers, to make sure that additional hidden layers do notimprove performance. There are rules of thumb for selecting the number of neurons, suchas that it should be between the size of the input and the size of the output layer. However,deviating from these rules often results in drastically improved performance. Even though512 neurons seems excessively large, and is unlikely to improve performance compared tolower numbers, it is still evaluated to make sure no performance increase is obtained. Thedropout parameters are set in such a way that we can see if dropout is needed, or if droppinga signiﬁcant, but not too large, fraction of the neurons improves performance.Thereafter, for the optimal structure, we investigate the estimation parameters using:batch size [4, 8, 16, 32] and epochs [100, 200, 400, 800]. Keskar et al. (2016) show that thebatch size should be much smaller than the total number of data points in the set, and thatusing a large batch size decreases the ability of the model to generalize. For these reasons,we have decided to set an upper bound of 32 on the batch size. There are no clear guidelinesfor the optimal number of epochs, as this is highly dependent on the data set. The numberof epochs is thus increased until performance of the MLP stops improving. If optimal per-formance in this grid is found at 800 epochs, the use of an even higher number of epochs isinvestigated. 23he results of the two grid searches are shown in Tables 6 and 7. The optimal performance-complexity trade-oﬀ is in our view given by the MLP with 1 hidden layer, 256 neurons, and adropout rate of 0.1. Even though the MLP with 2 hidden layers, 256 neurons, and a dropoutrate of 0.2 has a slightly higher accuracy, we deem the increase in accuracy too small tojustify addition of a hidden layer. The results of the estimation grid search show that per-formance increases with more epochs, but levels oﬀ at about 200 epochs. Since we ratherbe on the safe side, we opted for 400 epochs. There is very little variation in performancebetween the diﬀerent batch sizes, although combinations of a low number of epochs withlarge batches perform poorly. We have therefore decided to use a batch size of 8, for whichthe MLP was structure was optimized.

A.4. CART Optimization

There are two ways in which a CART can be restricted: restricted growth and pruning.When restricting the growth of the CART, we limit the growth of the tree a priori, whilewith pruning we allow the tree to grow unobstructed but cut oﬀ branches afterwards.Limiting the growth of the CART can be done in multiple ways. In this study, we op-timize the following settings: maximum depth, minimum samples for a split and minimumimpurity decrease. Maximum depth limits the number of splits the tree is allowed to makeby stopping after a certain depth is reached. That is, it limits the number of sequential splitsthe algorithm is allowed to make, counting from the root node. The minimum samples fora split restrict splitting, if the minimum number for a split is not reached, the algorithmis forced to make a leaf there. Lastly, a restriction can be set on the minimum impuritydecrease, which means that the algorithm is only allowed to make a further split if thatleads to a certain decrease in the Gini impurity (Equation 6).For the CART, just as for the MLP, we use a grid search instead of Bayesian hyperparam-eter optimization techniques to get insight into the performance of the CART. Selecting thehyperparameter values to be included in the grid search requires some preliminary investi-gation, since the restrictions have to be adjusted to the size the tree would grow to when leftunrestricted. Being too restrictive compared to unrestricted growth will signiﬁcantly harmperformance, whereas restrictions that do not restrict maximum growth have no eﬀect at all.Initial investigation shows that a tree grown unrestrictedly on the full data set ends up with320 leaves and a maximum depth of 20. Therefore, we have decided to use the following gridparameters: maximum depth [10-20], minimum samples for a split [2, 3, 4, 5] and minimumimpurity decrease [0-0.0002] in steps of 0.00001.Instead of limiting the growth of the tree, we can also prune it after letting it grow unre-24able 6: MLP model structure optimization with hidden layers [1, 2, 3], neurons [8, 16, 32,64, 128, 256, 512] and dropout [0, 0.1]. All models are estimated using batch size 8 and 400epochs. Optimal structure, in terms of accuracy, consists of 2 hidden layer with 256 neuronswith a dropout rate of 0.2. The best performance-complexity trade-oﬀ, underlined in thetable, is obtained by the MLP with 1 hidden layer, 256 neurons and a dropout rate of 0.1.All numbers are given in %.Batch size 8 No dropout Dropout 0.1 Dropout 0.2Epochs 400Number of Neurons per Accuracy Accuracy Accuracyhidden layer(s) hidden layer Mean Std Mean Std Mean Std1 8 43.9 5.8 41.6 5.1 38.5 5.016 50.9 4.6 49.4 3.5 46.1 4.932 59.2 4.5 56.0 4.2 53.7 6.464 63.7 3.7 64.1 3.3 63.3 6.9128 66.2 3.4 68.5 3.7 68.2 5.2256 67.1 4.2 69.7 3.7 69.4 4.5512 67.1 3.8 68.3 2.2 68.2 4.52 8 40.8 5.3 38.8 4.7 37.0 5.016 48.7 4.1 43.7 4.3 41.3 4.732 55.6 4.6 56.6 4.4 51.9 5.364 62.1 4.0 64.3 4.2 65.7 6.0128 65.5 4.5 68.9 4.4 68.3 4.3256 67.7 2.8 69.5 3.1 70.0 4.2512 68.8 2.2 68.1 1.9 69.6 2.93 8 39.5 8.3 38.5 5.9 34.0 4.816 45.6 5.4 45.0 5.8 38.9 4.832 52.8 3.0 54.7 5.2 46.3 5.864 62.8 5.1 64.8 3.8 63.7 4.6128 65.5 4.5 67.9 4.0 69.6 4.8256 66.4 2.5 68.1 3.9 68.4 4.9512 68.1 1.8 68.3 4.1 66.9 4.925able 7: MLP model estimation optimization with epochs [20, 50, 100, 200, 400, 800] andbatch size [4, 8, 16, 32] on the MLP with 1 hidden layer, 256 neurons and dropout rate 0.1.Optimal estimation is achieved using 200 epochs and a batch size of 8. All numbers aregiven in %.

Epochs Batch size Mean acc. Std20 4 56.9 4.88 54.3 6.216 52.5 4.832 50.3 4.350 4 65.4 3.78 65.0 3.516 60.9 4.432 56.4 4.4100 4 67.7 4.18 67.5 5.916 65.4 4.132 63.2 4.7200 4 67.5 3.88 68.9 5.216 67.1 3.732 67.6 3.3400 4 68.0 4.88 68.7 5.116 68.0 4.132 67.2 5.6800 4 68.3 4.88 68.2 5.116 68.2 5.432 68.2 4.1 C α ( L ) = | L | (cid:88) l =1 (cid:32) (cid:88) u i ⊂ R l ( y i − ˆ y l ) (cid:33) + α | L | , (12)where each l represents a leaf and | L | is the total number of leaves. The set R l containsall the data points in leaf l , ˆ y l is the prediction for leaf l and α is the factor punishingfor complexity (Hastie et al., 2009). In words, the cost-complexity criterion is the sum ofsquared errors with an additional factor that punishes for tree complexity in the form of thenumber of leaves. Thus, optimizing tree complexity is done by optimizing the factor αα