Checking account activity and credit default risk of enterprises: An application of statistical learning methods
CChecking account activity and credit default risk ofenterprises: An application of statistical learning methods
Jinglun YAO ∗ , Maxime LEVY-CHAPIRA † , Mamikon MARGARYAN ‡ July 5, 2017
Abstract
The existence of asymmetric information has always been a major concern for financialinstitutions. Financial intermediaries such as commercial banks need to study the quality ofpotential borrowers in order to make their decision on corporate loans. Classical methodsmodel the default probability by financial ratios using the logistic regression. As one of themajor commercial banks in France, we have access to the the account activities of corporateclients. We show that this transactional data outperforms classical financial ratios in predictingthe default event. As the new data reflects the real time status of cash flow, this result confirmsour intuition that liquidity plays an important role in the phenomenon of default. Moreover,the two data sets are supplementary to each other to a certain extent: the merged data has abetter prediction power than each individual data. We have adopted some advanced machinelearning methods and analyzed their characteristics. The correct use of these methods helps usto acquire a deeper understanding of the role of central factors in the phenomenon of default,such as credit line violations and cash inflows.
Résumé
L’existence de l’asymétrie de l’information est une problématique majeure pour les ins-titutions financières. Les intermédiaires financiers, telles que banques commerciales, doiventétudier la qualité des emprunteurs potentiels afin de prendre leurs décisions sur les prêts com-merciaux. Les méthodes classiques modélisent la probabilité de défaut par les ratios financiersen utilisant la régression logistique. Au sein d’une principale banque commerciale en France,nous avons accès aux informations sur les activités du compte des clients commerciaux. Nousmontrons que les données transactionnelles surperforment les ratios financiers sur la prédic-tion du défaut. Comme ces nouvelles données reflètent le flux de trésorerie en temps réel, cerésultat confirme notre intuition que la liquidité joue un rôle essentiel dans les phénomènesde défault. En outre, les deux bases de données sont complémentaires l’une à l’autre d’unecertaine mesure : la base fusionnée a une meilleure performance de prédiction que chaque baseindividuelle. Nous avons adopté plusieurs méthodes avancées de l’apprentissage statistique etanalysé leurs caractéristiques. L’utilisation appropriée de ces méthodes nous aide à acquérirune compréhension profonde du rôle des facteurs centraux dans la prédiction du défaut, telque la violation de l’autorisation du découvert et les flux de trésorerie. ∗ Student at Ecole Polytechnique † Quantitative Risk Project Manager at Société Générale ‡ Head of Credit Risk Modeling at Société Générale a r X i v : . [ q -f i n . S T ] J u l Introduction
As Mishkin and Eakins (2006) point out, asymmetric information is one of the core issues in theexistence of financial institutions. Financial intermediaries, such as commercial banks, play animportant role in the financial system because they reduce transaction costs, share risk, and solveproblems raised by asymmetric information. One of the most important channels of achieving thisrole is the effective analysis of the quality of potential corporate borrowers. Banks need to distin-guish reliable borrowers from unreliable ones in order to make their decisions on corporate loans.From the banks’ point of view, this reduces the losses associated with corporate defaults, while itis also beneficial for the whole economy because resources are efficiently attributed to prominentprojects.Altman (1968), Beaver (1966) and Ohlson (1980) are pioneers of using statistical models in the pre-diction of default. They have used financial ratios which are calculated from the balance sheet andthe income statement. Their inspiring work has been widely recognized, which is proved by the factthat the method has become the standard of credit risk modeling for many financial institutions.One might doubt, however, if the phenomenon of default can be “explained” by the financial ratios.Intuitively, default takes place when the cash flows of a firm are no longer sustainable. The financialstructure of a firm might well be the result of an upcoming default instead of being the cause of itbecause the firm might be obliged to sell some of its assets when it is short of cash flows. Leland(2006) distinguishes two kinds of credit risk models: structural models and statistical models (orreduced form models). According to him, the statistical model above is not directly based on firm’scash flows or values, but empirically estimates a “jump rate” to default. What’s more, reduced formmodels do not allow an integrated analysis of a firm’s decision to default or its optimal financialstructure decisions. On the other hand, structural models, such as those proposed by Black andScholes (1973), Merton (1974) and Longstaff et al. (2005), associate default with the values of cor-porate securities, as the valuation of corporate securities depends on their future cash flows, whichin turn are contingent upon the firm’s operational cash flows. The diffusion models of market val-ues of securities allow us to investigate the evolution of cash flows, and thus the default probabilities.This suggestion is insightful, but does not provide a practical approach for commercial banks vis-a-vis their corporate clients. Most small and medium-sized enterprises do not sell marketed securities.For these firms, using structural models based on corporate securities is simply impossible. Fortu-nately, however, commercial banks possess the information on cash flows in another way. Corporateclients not only borrow from banks but also open checking accounts in these banks. Norden andWeber (2010) demonstrate that credit line usage, limit violations, and cash inflows exhibit abnormalpatterns approximately 12 months before default events. Measures of account activity substantiallyimprove default predictions and are especially helpful for monitoring small businesses and individu-als. This is another good example of economies of scale in which a bank shares information withinitself to achieve better global performance.Instead of using a structural model, we choose to use some statistical learning methods which im-prove considerably the prediction performance compared with classical logistic regression. Thischoice is due to the fact that it is difficult to construct a structural model at the first stage whichgives a general image and a good prediction at the same time. There is limited literature whichexplains the default by using checking account information. By using statistical learning methods,2e can empirically tell which variables are the most important in default prediction. This can helpus in the next stage construct a structural model. On the other hand, if we are only interested inprediction, a reduced form model is sufficient for our concern.However, We should underline the fact that application of machine learning methods does not elim-inate the necessity of economic understanding. As we will show, the construction of meaningfuleconomic variables is an essential preliminary step for machine learning. What’s more, the “impor-tant variables” given by machine learning should be taken with a grain of salt. Strobl et al. (1993)resume that variable selection in CART (classification and regression trees) is affected by charac-teristics other than information content, e.g. variables with more categories are preferred. To solvethe problem, Strobl et al. (2007) propose an unbiased split selection based on exact distributionhypothesis. As with all exact procedures, this method is computationally too intensive. Hothornet al. (2006) propose a more parsimonious algorithm, conditional classification tree (ctree), which isbased on the framework of permutation test developed by Strasser and Weber (1999). What’s more,unbiased random forest (conditional random forest, or cforest) is constructed based on ctree. Butcforest is still too heavy to be executed for our data. Besides, it is not clear whether the unbiased-ness in the sense of random forest is still valuable for other machine learning methods. That is tosay, it is disputable to find an universally valuable subset of variables which contain the same levelof information in any statistical method. Instead of using these computationally expensive meth-ods, we will compare the variables selected by boosting, stepwise selection and lasso. An thoroughunderstanding of these machine learning methods is efficient to shed light on the interpretation ofmodel selections.We begin by introducing basic random forest and boosting, as well as some important modificationsto accommodate characteristics in our data. Section 3 compares three approaches of treating check-ing account data, illustrates the importance of economically meaningful variables and shows someparticularities of machine learning methods. Section 4 compares the performance of financial ratiosand questionnaires with that of account data, combines the two data to achieve better predictionperformance. Section 5 does three model selections, respectively based on AIC, lasso and boosting.We use the logistic regression to interpret the marginal effect of these most important variables.Section 6 concludes the article.
For random forest and boosting, the most commonly used basic classifier is the classification tree.Suppose we want to classify a binary variable Y by using two explicative variables X and X .An example of the classification tree is given in Figure 1 . The two graphical representations areequivalent. And the tree can be represented by the formˆ f ( X ) = X m =1 c m I { ( X , X ) ∈ R m } where c m ∈ { , } , I is the indicator function. (1) Extracted from James et al. (2013) igure 1 – A simple example of classification tree To grow a tree, the central idea is to choose a loss function and to minimize the loss function withrespect to the tree. Friedman et al. (2001) and James et al. (2013) give a full introduction to themost important loss criteria in the context of classification trees. We use the Gini index as the lossfunction in our research. It should be underlined, however, that it is computationally too expensiveto find a global optimal solution. Instead, in practice one uses the “greedy algorithm” which admitsthe part already constructed and searches the optimal solution based on this part. A tree grownin this way is called a CART (classification and regression tree), which was proposed by Breimanet al. (1984) and has become the most popular tree algorithm in machine learning.The advantage of tree is obvious: it is intuitive and easy to be interpreted. Nonetheless, it generallyhas poor predictive power on training set and test set if the model is mildly fitted. Conversely, anoverfitting with training set (or overly reduced bias) is generally not expected in machine learning.Ensemble methods, such as Random Forest and Boosting, are conceived to solve this dilemma.
Random Forest aims at reducing model variance and thus increasing prediction power on test set.Instead of growing one single tree, we plant a forest. A general description of the algorithm is givenin Figure 2 . In practice, the optimal value of m is around √ p for classification problem, where p isthe total number of variables. We can of course, use cross-validation to optimise the value of thisparameter. This small value of m looks strange at first sight, but it is in fact the key of randomforest. In fact, for B identically distributed random variables, each with variance σ and positivepairwise correlation ρ , the variance of their average is ρσ + 1 − ρB σ (2)Even with large B (the number of trees in the case of random forest), we still need to decrease ρ to reduce the variance of average. The role of a small m is to reduce the correlation ρ across trees, Extracted from Friedman et al. (2001) igure 2 – Algorithm of random forest thus decrease the model variance.However, the basic random forest works poorly for our data because it is imbalanced (fewer than 6%observations defaulted). Several remedies exist for this characteristic, including weights adjustment(Ting (2002)) and stratified sampling (Chen et al. (2004)). We have adopted the stratificationmethod which is easy to be implemented and yields satisfying results. Instead of sampling uniformlydefault and non-default observations for each tree in step 1.(a) (eg. sampling 2/3 observationsuniformly), we take 2/3 default observations and an equal number of non-default observations.This apparently small modification leads to tremendous amelioration in confusion matrices. For agiven checking account data with 30 variables, the comparison is shown in Table 2. The test AUCsare respectively 78 .
72% and 79 . The most commonly used version of boosting is AdaBoost (Freund et al. (1996)). Contrary torandom forest which plants decision trees in parallel, AdaBoost cultivates a series of trees. If anobservation is wrongly classified in previous trees, its weight will be accentuated in latter treesuntil it is correctly classified. The central idea is intuitive, yet it had been purely an algorithmicnotion until Friedman et al. (2000), who pointed out the inherent relationship between AdaBoostand additive logistic regression model:
Theorem 1
The real AdaBoost algorithm fits an additive logistic regression model by stagewise andapproximate optimization of J ( F ) = E [ e − yF ( x ) ] . where additive logistic regression model is defined as having the following form for a two-classproblem: 5 able 1 – Add caption Training set Test setError rate Error rateImbalanced Balanced Imbalanced BalancedTrue value 0 0.057% 25.829% True value 0 0.055% 25.588%1 98.861% 28.599% 1 99.108% 27.340%Global 3.830% 25.930% Global 3.940% 25.660%
Table 2 – Error rates of imbalanced and balanced random forest. False negative rates are extremelyhigh for both the training set and the test set using imbalanced random forest.In contrast, the errorsrates using balanced random forest are much more reasonable. log P ( y = 1 | x ) P ( y = 0 | x ) = M X m =1 f m ( x ) (3)In the case of boosting trees, f m are individual trees adjusted by weights. According to Result 1,boosting is by its nature an optimisation process. This insight paves the way for xgboost (ExtremeGradient Boosting by Friedman (2001)), which searches the gradient of objective function andimplements efficiently the basic idea of boosting. Moreover, the intimate relationship betweenboosting and logistic regression leads to some interesting results on which we will discuss later on. A model is overfitted if it suits well the training set but poorly the test set. In our research,the model performance criterion is AUC (Area Under the ROC Curve), which measures the dis-crimination power of a given model. It should be noticed that AUC is immune to imbalance in data.Some methods, like the random forest, aim at reducing the model variance, i.e., by decorrelatingthe training data and the model, we obtain a model which is less sensitive to data change. Forexample, using 30 checking account variables to explain default, we get
AU C = 79 .
45% for trainingset and
AU C = 79 .
85% for test set in balanced random forest. Boosting had also been consideredto work in this way. But Friedman et al. (2000) point out that boosting seems mainly a bias reduc-ing procedure. This conclusion is coherent with our experiment. Using the same variables, we get
AU C = 87 .
45% for training set and
AU C = 79 .
8% for test set. Boosting has necessarily overfittedthe model, but this feature does not undermine its ability of predicting the test set.Additional remarks should be made on parameters in machine learning methods. While it is not themajor concern of this article, it is nonetheless crucial to let the machine run correctly. One importantparameter is related to the complexity of model, for example, the number of candidate variables foreach node splitting in random forest, the number of learning steps in boosting. Cross-validation isadopted to ensure the appropriate level of complexity and to avoid over-fitting. Appendix C givesan exhaustive explanation on the most important parameters in our models.6
Organising checking account data: Three approaches
In current literature, treating checking account data does not have mature approaches as we canfind for financial structure data. In the latter case, corporate finance suggests some particularlyuseful ratios such as working capital/total assets, retained earnings/total assets, market capital-ization/total debt etc (Ross et al. (2008)). Defining new features based on checking account databecomes a central issue in our study. We have tried three approaches detailed below. They will becombined with three different statistical methods (logistic regression, random forest and boosting).
This definition is inspired by Norden and Weber (2010). At the end of each year, which we notetime t, we define the explained variable, default, as the binary variable of going bankrupt in thenext year. The explicative variables are created based on monthly account variables in the lasttwo years. These 30 variables are listed in Appendix A and can be classed mathematically intofour categories: the difference of a characteristic (eg. balance, monthly cumulative credits) betweenthe begin and the end of a period (one or two years); the value of this characteristic at time t; thestandard deviation of this characteristic during a certain period; attributes of the firm (annual sales,sector). The basic idea is to use stock and flow variables for a complete but also concise descriptionof a certain characteristic. Moreover, the standard deviation of, e.g. monthly cumulative credits,allows us to quantify the risk associated with unstable income.The size of firms may influence considerably the model in an undesirable way. A firm might havea higher balance than another one only because it is larger: this larger balance does not “reflect” asmaller probability of default. Norden and Weber (2010) have used the line of credit as the normal-isation variable for the corporate clients of a German universal bank. However, this variable is notavailable in our research. We thus need to figure out another appropriate normalisation variable.One suggestion is to use information on the balance sheet or the income statement, such as totalsales. But larger firm may open accounts in several different commercial banks, reflecting only afragment of cash flow information in each account. There exists thus a discrepancy between thesize of the account and the size of the firm. In order to capture the account size, we need a variablewithin the account itself which reflects the account’s normal level of vitality. The average monthlycumulative credits in the last two years, responds to the defined criteria and is used to normalizethe variables proportional to account size. Intuitively, monthly cumulative credits is the equivalentof total sales in the context of checking account in the sense of total resources.
As well as in Definition 1, we still use account information in the past year to predict the default inthe coming year. But the explanatory variables used in statistical methods are built in a much more“computer science” way. Instead of using economic intuitions above to organise raw information,we rely on automatic methods to build the model inputs. 50 variables are firstly resumed from rawmonthly information, and then interact with each other using the four basic arithmetic operations.Together with some raw variables, the data set contains around 5000 variables in total. It should be7oticed that these combinations are usually not intuitively interpretable. While it might be possibleto give some far-fetched explanation for “average monthly balance/cumulative number of intendedviolations”, it is far more difficult to interpret other variables.One might argue that the simple arithmetic interactions are not capable of exhausting possiblemeaningful combinations of raw information, making this approach not representative. However, itshould first of all be noticed that boosting with 5000 variables is already computationally expensivefor an ordinary computer. In practice, we launch the boosting for each kind of arithmetic interactionand select the most important ones according to their contributions to the Gini index. Thesevariables are then used to run a final and lighter boosting with around 200 variables. Secondly, it issimply computationally impossible to exhaust most meaningful combinations. Suppose we want tocreate automatically the 30 variables in definition 1. These variables are based on more than 10 basicmonthly variables (e.g. TCREDIT, monthly number of violations), i.e. more than 120 variables ifwe take the month into consideration. Var16 is the difference of TCREDIT between time t and t-12(substraction of 2 variables), while var9 is the sum of monthly number of violations during one year(sum of 12 variables). This simple example shows that for a new variable, there is no limit a priorion the number of participating raw monthly variables. That is to say, any variable among the 120variables might be included in or excluded from the combination. The number of possible forms ofcombination is astronomical: 2 , even if we allow only one arithmetic operation, for example theaddition. Let alone other forms of operations. Thirdly, there is no reason to delimitate a priori aset of reasonable operations. The use of standard deviation for TCREDIT (var24), for example, isbased on the intuition of the stability of revenue. It is not reasonable, however, to include a priorithis operation, which is more complicated than sinus, cosinus or other simple functions, in the setof reasonable operations, if we investigate the question in a purely mathematical way.
Similar to Definition 1, this definition is also economically interpretable. In contrast, we create 5variables which are highly discretized. Four of them are binary, the fifth has three categories. Thesevariables are listed in Appendix B.
The performance, measured by test AUC, is given in Table 3. We have selected the 20 best variablesin Group 1 and Group 3 respectively by AIC and by variable importance in boosting. The 5 bestvariables in Group 2 are chosen according to variable importance in boosting. Despite the differencein variable selection methods, all the variables in Group 2 are included in Group 1. Among the 20variables in Group 3, three variables are not available for most of the observations ( > Table 3 – Test AUCs of four groups of account data (3 definitions) in logit, random forest andboosting. The 20 variables in Group 1 and Group 3 are selected respectively by AIC and by variableimportance in boosting. The 5 best variables in Group 2 are chosen according to variable importancein boosting. All the variables in Group 2 are included in Group 1. Among the 20 variables in Group3, three variables are not available for most of the observations ( > ) and are eliminated forrandom forest and for logistic regression. random forest. It seems to us that discretization is the reason for this. While it is a common ap-proach to discretize continuous variables for logistic regression because this can create a certain kindof non-linearity of a given explanatory variable within the linear framework, this will nonethelessreduce the information contained in it. The discretization is especially detrimental for imbalancedrandom forest. AU C = 46 .
81% suggests a worse performance than randomly distributed classesand should be considered as a pathology. Even the balanced random forest performs worse thanlogistic regression. In fact, the individual trees grown in a random forest are usually very deep( depth > depth = 5 in our setting). As Friedmanet al. (2001) suggest, experiences so far indicate that 4 < = depth < = 8 works well in the contextof boosting, with results being fairly insensitive to particular choices in this range. In any case,it is unlikely that depth >
10 will be required. This probably suggests that boosting relies muchless heavily on the variables’ ability of offering potential splits, making it less sensitive to discretevariables.In fact, using stumps ( depth = 2) is sufficiently efficient for yielding good prediction. Using allthe 30 variables in Definition 1, the AUCs are respectively 79 .
47% for depth = 2 and 79 .
82% for depth = 5. (It should be remarked, however, that the optimal number of rounds validated bycross-validation is higher in the case depth = 2. They are 2811 and 997 respectively for depth = 2and depth = 5 with other parameters fixed according to Appendix C.) In the case of M stumps,9he additive logistic regression model becomes: log P ( y = 1 | x ) P ( y = 0 | x ) = M X m =1 α m x m
2, the approximation is extended to multivariate functions. So theadvantage of boosting over logistic regression seems to be the capacity of the former to take non-linearity into consideration. This clearly explains why boosting is mainly a bias-reducing method,as mentioned by Friedman et al. (2001).Does the out-performance of boosting and balanced random forest also imply their superiority ofidentifying rich data set? Comparing Group 1 and Group 3, we can remark that logit AUC ishigher in Group 3, while boosting AUC is lower. If we trust in logistic regression in the case ofprediction, we should conclude that Group 3 contains more information than Group 1 and thatmachine learning methods such as boosting are not reliable for distinguishing a rich data set from apoorer one. However, looking at Group 2, we can easily reverse this conclusion. The logit AUC inGroup 2 is nearly the same as that in Group 3, while Group 2 contains apparently less informationthan Group 1 because all the variables in Group 2 are included in Group 1. Instead, a plausibleexplanation for the low logit AUC in Group 1 should be the multicollinearity between explanatoryvariables (James et al. (2013)). Boosting and random forest, in contrast, split each node by indi-vidual variable and should not be impacted by the haunting multicollinearity. With less variables(Group 2), the prediction accuracy is higher in logit. This phenomenon probably suggests that logitcan not well “digest” rich information because of its restrictive linear form. It is thus more reli-able to use AUCs of machine learning methods as a measure of information contained in the data set.The close relationship between boosting and logistic regression explains some results which mayseem strange at first sight. The higher logit AUC in Group 3 compared with Group 1 should be in-terpreted by the model selection method: “good variables” in the sense of boosting should generallybe “good” in the sense of logit. It is thus not surprising to find that 20 variables selected from about5000 variables works better in logit than 20 variables selected from 30. On the other hand, the samevariables have lower AUC in balanced random forest than in boosting (76 .
67% vs 78 . Traditional reduced form methods for default prediction mainly focused on financial structure ofenterprises(Altman (1968), Beaver (1966) and Ohlson (1980)), as financial structure does reflect toa large extent the solvability of enterprises, and is relatively more available than real time accountinformation. What’s more, in a reduced form method, as we merely try to match a pattern to thedata (Fayyad et al. (1996)) without worrying much about causality, the problem of endogeneity isnot a primary concern. But once we want to get some causal interpretation, financial structuredata may suffer from endogeneity and should be carefully interpreted as a “cause” of credit default.On the other hand, we should remark the difference between book value and market value, and theaccounting principle associated with this difference (Ross et al. (2008)). For small and medium-sized enterprises, their market values are simply not available because they usually don’t sell anymarketed securities, while their book values are historic and subjected to accounting manipulations.Commercial banks have both a necessity and an advantage in the analysis of credit default. The pos-session of corporate account information helps them to acquire a more direct and “frank” image ofthe firms’ account. Not only the information may be more reliable, but also more real-time. Balancesheet and income statement are resumed once a year by firms, while checking account informationcan be theoretically daily. In practice, we use monthly variables as raw variables for the purpose ofsimplification. This allows commercial banks to supervise the solvability of corporate borrowers ona more frequent basis. Given the advantage of checking account information, we should expect abetter performance of prediction based on checking account data. This is represented in the first twocolumns of Table 4. The AUCs based on account data in balanced random forest and boosting aresignificantly larger than those based on financial and management data. (One might argue that thissuperiority is simply due to more explicative variables. In fact, with the same number of variables(11), the boosting AUC of account data is 79 . . . . . Table 4 – The AUCs of checking account data, financial and managerial data, merged data in logit,random forest and boosting. Group 6 comes from the fusion of Group 1 and Group 5. This mergeddata has the best performance in balanced random forest and boosting. the same as that of Group 1 and significantly less than that of Group 6.) We can thus concludethat the three sources of information are complementary, which corresponds to our intuition on thereal functioning of enterprises. First, The checking account information is a reflection of a firm’scash flow, which is most directly related to a firm’s solvability. Second, financial ratios illustratethe firm’s financial structure and its ability to earn profits. We should remark that the financialratios we used are primarily concerned with the firm’s profitability and expenses (Interest expenses,earnings before interest and taxes etc.) and are more tightly related to cash flow, which is also thecase for Atiya (2001). Third, other non-financial reasons should be taken into consideration, forexample, the managerial expertise of cadres.Of course, these is not a complete list of all the factors which are related to credit default. Somemacroeconomic factors, for example, can be additionally taken into account. We have observed adecreasing quarterly default rate during 2013-2014, which might be explained by decreasing interestrate in Europe during the same period. If we use data from 2009 to 2012 as training set, and thatfrom 2013 to 2014 as test set, the statistical pattern works less well for defaults at the end of 2013and at the beginning of 2014.
Because of the multicollinearity problem between 30 variables in Definition 1, a variable selectionprocess is needed in order to obtain and interpret the marginal effect of each prominent variableby logistic regression. The list of important variables in Definition 1 is shown in Figure 3. We cansee that according to boosting, the most important variables are especially related to number ofviolations (var9, var11, var13) and current status (var27, var32, var33, var34). Intuitive as it be,this variable importance in the sense of boosting should be taken with a grain of salt. For example,does it mean that var10 (number of intended violations during the period [ t − , t − t − , t ], ranked 7 in the importance list)? In fact, if we draw two conditionaldistributions (conditioned on default) of each variable and calculate their individual AUCs whichreflects their individual discriminality, the AUC of var10 (68 . . t − , t ]) and naturally has a betterdiscriminatory power than var10 ( AU C = 72 .
68% vs
AU C = 68 . Figure 3 – Variable importance of 30 variables in Definition 1 according to boosting
In order to be more rigorous on variable selection, we have tried out two other different methodswhich are based on logistic regression: stepwise selection and lasso. For stepwise selection, AIC wasused as the criterion. Forward and backward selection have generated the same 8 variables marked13n Table 5. In order to compare between different model selection methods, we have adjusted λ inlasso so as to yield exactly 8 non-zero coefficients. These 8 variables are also resumed in Table 5.Remark that there are 7 among 8 variables which are identical to those selected by AIC. Thus forour data, there is no apparent difference between stepwise and lasso in model selection. In contrast,4 among 8 variables selected by boosting differ from that of lasso and stepwise selection! Thesevariables all belong to the variables of current status.It is difficult to confirm by experiment the reason for this difference. Our intuition centers on themulticollinearity between the 4 favored variables in boosting. Table 6 shows the Spearman cor-relation between these variables . It seems to us that because of its restrictive linear form, logitis incapable of disentangling the interweaving information contained in these variables. On thecontrary, boosting seems to be able to digest this intricate information. This hypothesis is looselyconfirmed by the regression results in Table 7 and 8. All the coefficients of the variables selectedby AIC are significantly different from zero at the 0 .
1% level. The 4 variables favored by boosting(var27, var32, var33, var34), on the other hand, are less significant: var27 and var33 are significantat the 5% level, while var32 and var34 are not significant. We should remark, however, that allthe signs of these 4 variables correspond to our intuition and that var32 and var34 are not farfrom being significant (P values=20.89% and 11.23% respectively). This is a common syndrome ofmulticollinearity because it increases the variances of related estimated coefficients and renders thecoefficients insignificantly different from zero.Returning on the regression in Table 7, several insights can be gained from the marginal effectof these variables. First, var9, var11 and var13 are always the best variables in any method thatwe have used (also valid for balanced random forest, of which we haven’t presented the variableselection). These variables concern intended or rejected violations of credit line. The negative signfor var11 (number of rejected violations) should not be regarded as counter-intuitive, because ofthe presence of var9 (number of intended violations) and its positive coefficient which is larger thanthat of var11 in absolute value. This suggests that larger number of violations, whether rejectedor not, indicates a higher probability of default. We have used the amount of violations insteadof the number of violations to construct var13, in order to capture more precisely the confidenceon each client given by the bank advisor. This variable seems to work particularly well, in thesense that the same variable on the previous year, var14, is also included by stepwise selectionand by lasso. This suggests that front line staff have acquired some important experiences andintuitions in distinguishing solvable clients from insolvable ones. These experiences may be hard tobe formally formulated, but are truly valuable and should be paid attention to. Second, the riskof default is intimately related to the risk of income. As var24 (standard deviation of cumulativemonthly credits) shows, the more the income is unstable, the more the firm is likely to default.Var31 (cumulative monthly credits at month t) is also related to credit and decreases the defaultprobability by having more income. Credits, rather than debits, may be considered more seriously asthe source of default. Norden and Weber (2010) point out that there exists a very strong correlationbetween debits and credits and that the latter should be considered as the constraint of the former. It should be more appropriate to calculate the Pearson correlation because we are interested in linear correlationin the case of logistic regression. However, this correlation is not stable with respect to manipulations such aselimination of missing values or extreme values. Spearman correlation, on the other hand, seems to be quite stablewith data manipulations, which shows the advantage of tree-based methods. Tree-based methods depend on ordinalproperties of variables instead of cardinal ones.
Table 5 – Variable selection among 30 variables of Definition 1 according to boosting, stepwiseselection and lasso. Stepwise selection with AIC criterion (both forward and backward) has yielded8 variables. For the purpose of comparison, we have chosen the 8 best variables in boosting. Forlasso, we have adjusted the parameter λ so as to yield exactly 8 non-zero coefficients. Table 6 – Spearman correlation between var27, var32, var33, var34
Coefficient Standard deviation P value Significance(Intercept) -6.670e-01 1.294e-01 2.55e-07 ***var9 2.060e-03 1.469e-04 < < < < Table 7 – Logistic regression using variables selected by stepwise selection
Coefficient Standard deviation P value Significance(Intercept) -1.128e+00 7.204e-02 <2e-16 ***var9 1.433e-03 1.146e-04 < < < Table 8 – Logistic regression using variables selected by boosting
Table 9 – Transformation of sector variable to numeric variable according to average default
Increase in expenses might be direct reason for default, but income decrease or instability may bemore fundamental. Third, different economic sectors clearly have different default rate. We haveconstructed var29 (sector) by using a theorem in Shih (2001). See Appendix D for the detailsof the theorem.This theorem allows us to transform a categorical variable into a discrete numericvariable for classification trees. The corresponding numeric values of sectors are shown in Table 9.Higher values are associated with higher average default rate. This is also validated by the logisticregression in Table 7. Fourth, larger firms are less likely to default. They are more mature thanstartups. Commercial banks have reason to be unwilling to lend money to startups, who in somecases might need to search investment from venture capitals or angel investors.
We have investigated the relationship between corporate checking account and credit default andshown that account information outperforms traditionally used financial ratios in predicting thedefault for our data sets. This result aligns with our understanding of default as a phenomenon ofliquidity. Checking account information reflects a more direct and real-time status of the firm’s cashflow and is a privilege of commercial banks when the firm’s market value is not available. Banks canexploit economies of scale and use information on the firms’ checking account to make reasonable de-cisions on corporate loans. Despite the importance of this subject, there is currently little literatureexcept Norden and Weber (2010), Mester et al. (2007) and Jiménez et al. (2009). Inspired by theirwork, we have investigated a broader range of explicative variables and systematically comparedthe performance of different data sets by statistical learning methods. We have shown that thesemethods, together with the AUC criterion, are more accurate and reliable approaches to measurethe information contained in data sets than logistic regression. While the latter often suffers frommulticollinearity, machine learning methods such as random forest and boosting separately makeuse of these variables and are capable of disentangling intricate information. By using randomforest and boosting, we have significantly increased the prediction accuracy. Tree-based methodshave other advantages such as being immune to extreme values.We should remark particularly, however, that successful statistical learning process is achieved withhuman expertise. Meaningful economic variables must be first of all created based on raw checkingaccount information, just as pioneers on corporate finance have created financial ratios based onbalance sheet and income statement. We also need to normalise these variable so as to eliminatethe effect of account size. As we have shown, it is technically not possible (and epistemologicallyunacceptable for some) to create explicative variables which contain the same level of concise infor-mation simply by automated program. The 30 variables created by Definition 1 need to be perfectedby eliminating about one half less useful variables and adding other potential important indicators.But even at this early stage, the importance of human expertise in financial study is illustrated.Financial ratios and managerial questionnaire are nonetheless still important in predicting credit17efault. By combining them with checking account data, the model has the best prediction perfor-mance and outperforms any other model with only one single data. This suggests a certain kind oforthogonality between the information of different data sets: the financial structure, profitability,and managerial experience should be considered in parallel with checking account information in areduced form model.By careful approaches of model selection, we have shown some particularities of boosting in select-ing important variables. We have used the 8 most important variables given by stepwise selectionto gain intuitions on the mechanism of default. Violations of credit line, whether rejected or not,are particularly good indicators of upcoming default. Moreover, front line advisors seem to havenotable experience in distinguishing acceptable violations, which is reflected in the percentage ofpermitted amount of violations. While the default is at first sight due to excessive expenses, Nordenand Weber (2010) and us have focused on the importance of credits. Low level of income, as wellas instability of income, increases significantly the default rate.Our research have adopted rigorous statistical methods to obtain a well-performed prediction modelbased on checking account and to identify key indicators in this data by an inductive methodol-ogy. We had a thorough discussion on the mechanisms of these methods which have significantimplications on the results. This has enriched the scarce literature on this topic and can providesuggestions to banks on their decision of corporate loans. Further research may try to identify otherkey factors in checking account information or construct a structural model for credit default ofsmall and medium sized enterprises.
References
Edward I Altman. Financial ratios, discriminant analysis and the prediction of corporatebankruptcy.
The journal of finance , 23(4):589–609, 1968.Amir F Atiya. Bankruptcy prediction for credit risk using neural networks: A survey and newresults.
IEEE Transactions on neural networks , 12(4):929–935, 2001.William H Beaver. Financial ratios as predictors of failure.
Journal of accounting research , pages71–111, 1966.Fischer Black and Myron Scholes. The pricing of options and corporate liabilities.
The journal ofpolitical economy , pages 637–654, 1973.Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen.
Classification and regres-sion trees . CRC press, 1984.Chao Chen, Andy Liaw, and Leo Breiman. Using random forest to learn imbalanced data.
Universityof California, Berkeley , pages 1–12, 2004.Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledgediscovery in databases.
AI magazine , 17(3):37, 1996.Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In
Icml ,volume 96, pages 148–156, 1996. 18erome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statisticalview of boosting (with discussion and a rejoinder by the authors).
The annals of statistics , 28(2):337–407, 2000.Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
The elements of statistical learning ,volume 1. Springer series in statistics Springer, Berlin, 2001.Jerome H Friedman. Greedy function approximation: a gradient boosting machine.
Annals ofstatistics , pages 1189–1232, 2001.Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A conditionalinference framework.
Journal of Computational and Graphical statistics , 15(3):651–674, 2006.Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
An introduction to statisticallearning , volume 6. Springer, 2013.Gabriel Jiménez, Jose A Lopez, and Jesús Saurina. Empirical analysis of corporate credit lines.
Review of Financial Studies , 22(12):5069–5098, 2009.Hayne E Leland. Structural models in corporate finance.
Princeton University Bendheim LectureSeries in Finance , 2006.Francis A Longstaff, Sanjay Mithal, and Eric Neis. Corporate yield spreads: Default risk or liquid-ity? new evidence from the credit default swap market.
The Journal of Finance , 60(5):2213–2253,2005.Robert C Merton. On the pricing of corporate debt: The risk structure of interest rates.
TheJournal of finance , 29(2):449–470, 1974.Loretta J Mester, Leonard I Nakamura, and Micheline Renault. Transactions accounts and loanmonitoring.
Review of Financial Studies , 20(3):529–556, 2007.Frederic S Mishkin and Stanley G Eakins.
Financial markets and institutions . Pearson EducationIndia, 2006.Lars Norden and Martin Weber. Credit line usage, checking account activity, and default risk ofbank borrowers.
Review of Financial Studies , 23(10):3665–3699, 2010.James A Ohlson. Financial ratios and the probabilistic prediction of bankruptcy.
Journal ofaccounting research , pages 109–131, 1980.Stephen A Ross, Randolph Westerfield, and Bradford D Jordan.
Fundamentals of corporate finance .Tata McGraw-Hill Education, 2008.Yu-Shan Shih. Selecting the best categorical split for classification trees.
Statistics and ProbabilityLetters , 54:341–345, 2001.Helmut Strasser and Christian Weber. On the asymptotic theory of permutation statistics. 1999.Carolin Strobl, Achim Zeileis, Anne-Laure Boulesteix, and Torsten Hothorn. Variable selection biasin classification trees and ensemble methods. In
Book of abstracts , page 159. Citeseer, 1993.19arolin Strobl, Anne-Laure Boulesteix, and Thomas Augustin. Unbiased split selection for classi-fication trees based on the gini index.
Computational Statistics & Data Analysis , 52(1):483–501,2007.Kai Ming Ting. An instance-weighting method to induce cost-sensitive trees.
IEEE Transactionson Knowledge and Data Engineering , 14(3):659–665, 2002.Jon Williamson. The philosophy of science and its relation to machine learning. In
Scientific DataMining and Knowledge Discovery , pages 77–89. Springer, 2009.20
Variable Definition 1
For simplicity, variables are abbreviated according to Table 10.Abbreviation ExplanationMIN_BAL monthly min account balanceMAX_BAL monthly max account balanceMEAN_BAL monthly average account balanceMEAN_CRBAL monthly average credit balanceMEAN_DBBAL monthly average debit balanceTCREDIT monthly total creditsTDEBIT monthly total debitsINT_CNVIOL cumulative number of intended violations from the beginning of the yearREJ_CNVIOL cumulative number of rejected violations from the beginning of the yearINT_CAVIOL cumulative amount of intended violations from the beginning of the yearREJ_CAVIOL cumulative amount of rejected violations from the beginning of the year
M EAN _ T CREDIT t mean TCREDIT during the period [t-23, t], used for nomalisation Table 10 – Variable Abbreviations
The 30 variables defined in Definition 1 are built by applying the operations in Table 11. Theirdefinition formulas are shown in Table 12. Operation Meaning X t Value of X at month t∆ X t X t − X t − ∆∆ X t X t − X t − mean t ( X ) Mean of X during the period [t-11, t] sd t ( X ) Standard Deviation of X during the period [t-11, t] Table 11 – Operations for creating variables One might wonder why the variables are not nominated from 1 to 30. This is purely a historical problem: wehave done a first version of 30 variables before modifying them to get the second version that we see right now. a r i a b l ec a t e go r y N a m e N u m e r a t o r D e n o m i n a t o r U s e n o r m a li z a t i o n E v o l u t i o n s B a l a n ce v a r ∆∆ M I N _ B A L t M E A N _ T C R E D I T t Y E S v a r ∆∆ M E A N _ B A L t M E A N _ T C R E D I T t Y E S v a r ∆∆ M E A N _ C R B A L t M E A N _ T C R E D I T t Y E S v a r ∆∆ M E A N _ D BB A L t M E A N _ T C R E D I T t Y E S V i o l a t i o n s v a r ∆ I N T _ C N V I O L t N O v a r ∆ I N T _ C N V I O L t − N O v a r ∆ R E J _ C N V I O L t N O v a r ∆ R E J _ C N V I O L t − N O v a r ∆ I N T _ C A V I O L t − ∆ R E J _ C A V I O L t ∆ I N T _ C A V I O L t N O v a r ∆ I N T _ C A V I O L t − − ∆ R E J _ C A V I O L t − ∆ I N T _ C A V I O L t − N O B a l a n ce V i t a li t yv a r ∆ ( M A X _ B A L − M I N _ B A L ) t M E A N _ T C R E D I T t Y E S C r e d i t s & D e b i t s v a r ∆ T C R E D I T t M E A N _ T C R E D I T t Y E S v a r ∆ T C R E D I T t − M E A N _ T C R E D I T t Y E S v a r ∆ T D E B I T t M E A N _ T C R E D I T t Y E S v a r ∆ C D E C I T t − M E A N _ T C R E D I T t Y E S v a r ∆ ( T C R E D I T / T D E B I T ) t N O v a r ∆ ( T C R E D I T / T D E B I T ) t − N O R i s k B a l a n ce s t a b ili t yv a r s d t ( M E A N _ B A L ) m e a n t ( M E A N _ B A L ) N O v a r s d t − ( M E A N _ B A L ) m e a n t − ( M E A N _ B A L ) N O C r e d i t ss t a b ili t yv a r s d t ( T C R E D I T ) m e a n t ( T C R E D I T ) N O v a r s d t − ( T C R E D I T ) m e a n t − ( T C R E D I T ) N O A c t u a l v a r M E A N _ B A L t M E A N _ T C R E D I T t Y E S v a r M E A N _ C R B A L t M E A N _ T C R E D I T t Y E S v a r M E A N _ D BB A L t M E A N _ T C R E D I T t Y E S v a r T C R E D I T t M E A N _ T C R E D I T t Y E S v a r T D E B I T t M E A N _ T C R E D I T t Y E S v a r M I N _ B A L t M E A N _ T C R E D I T t Y E S v a r M A X _ B A L t M E A N _ T C R E D I T t Y E S A ttr i bu t e s v a r s ec t o r N O v a r t o t a l s a l e s N O T a b l e v a r i a b l e s d e fin e d i n D e fin i t i o n . “ U s e n o r m a l i s a t i o n ” r e f e r s t o t h e n o r m a l i s a t i o n b y M E A N _ T C R E D I T t . Variable Definition 3
The 5 discrete variables are defined in Table 13.Variable before discretization Discrete classessum of MEAN_CRBAL during [t-2, t] < = a € > a €sum of monthly intended number of violations S1,and of monthly rejected number of violations S2,during [t-2, t] S S > S S > S > M EAN _ CRBAL t /M EAN _ CRBAL t − < b> = b history of relationshipwith the bank (years) < c years > = c years Table 13 – 5 variables defined in Definition 3. The exact values of a , b and c are not presentedbecause of confidential agreement. C Parameters in Random Forest and Boosting
In our research, the parameters of random forest and boosting are set in the following way:• Random Forest (R package RandomForest) – The number of candidate variables for node splitting (mtry): For a classification problem, √ p is the “standard” choice. We can also use cross-validation for determining the value. – Balanced random forest: use the parameter “sampsize” for stratified sampling. If theforest is well balanced, the minimal number of observations in each node (“nodesize”)should not greatly influence the prediction power.• Boosting (R package xgboost) – Choose the number of rounds (“nrounds”) by the cross-validation. – Choose a sufficiently small number for the shrinkage parameter (“eta”). For our data,the performance is stable when eta < . eta = 0 .
01 is used in our program. – Maximum depth of each tree (“max_depth”): between 4 and 8. We have used max d epth =5. – The proportion of observations used for each tree (“subsample”) does not influencegreatly the prediction performance. We have used subsample = 0 . Theorem for transforming sector variable to discrete nu-meric variable
Theorem 2
Suppose there are two classes, class 1 and class 2. Let X be a categorical variabletaking values on { , , ..., L } where the categories are in increasing p (1 | X = i ) values. If φ is aconcave function, then one of the L − splits, X ∈ { , , ..., l } where < = l < L , minimizes p Left φ ( p Left ) + p Right φ ( p Right ) ..