[PDF] Checking account activity and credit default risk of enterprises: An application of statistical learning methods

Abstract

The existence of asymmetric information has always been a major concern for financial institutions. Financial intermediaries such as commercial banks need to study the quality of potential borrowers in order to make their decision on corporate loans. Classical methods model the default probability by financial ratios using the logistic regression. As one of the major commercial banks in France, we have access to the the account activities of corporate clients. We show that this transactional data outperforms classical financial ratios in predicting the default event. As the new data reflects the real time status of cash flow, this result confirms our intuition that liquidity plays an important role in the phenomenon of default. Moreover, the two data sets are supplementary to each other to a certain extent: the merged data has a better prediction power than each individual data. We have adopted some advanced machine learning methods and analyzed their characteristics. The correct use of these methods helps us to acquire a deeper understanding of the role of central factors in the phenomenon of default, such as credit line violations and cash inflows.

Full PDF

CChecking account activity and credit default risk ofenterprises: An application of statistical learning methods

Jinglun YAO ∗ , Maxime LEVY-CHAPIRA † , Mamikon MARGARYAN ‡ July 5, 2017

Abstract

The existence of asymmetric information has always been a major concern for ﬁnancialinstitutions. Financial intermediaries such as commercial banks need to study the quality ofpotential borrowers in order to make their decision on corporate loans. Classical methodsmodel the default probability by ﬁnancial ratios using the logistic regression. As one of themajor commercial banks in France, we have access to the the account activities of corporateclients. We show that this transactional data outperforms classical ﬁnancial ratios in predictingthe default event. As the new data reﬂects the real time status of cash ﬂow, this result conﬁrmsour intuition that liquidity plays an important role in the phenomenon of default. Moreover,the two data sets are supplementary to each other to a certain extent: the merged data has abetter prediction power than each individual data. We have adopted some advanced machinelearning methods and analyzed their characteristics. The correct use of these methods helps usto acquire a deeper understanding of the role of central factors in the phenomenon of default,such as credit line violations and cash inﬂows.

Résumé

L’existence de l’asymétrie de l’information est une problématique majeure pour les ins-titutions ﬁnancières. Les intermédiaires ﬁnanciers, telles que banques commerciales, doiventétudier la qualité des emprunteurs potentiels aﬁn de prendre leurs décisions sur les prêts com-merciaux. Les méthodes classiques modélisent la probabilité de défaut par les ratios ﬁnanciersen utilisant la régression logistique. Au sein d’une principale banque commerciale en France,nous avons accès aux informations sur les activités du compte des clients commerciaux. Nousmontrons que les données transactionnelles surperforment les ratios ﬁnanciers sur la prédic-tion du défaut. Comme ces nouvelles données reﬂètent le ﬂux de trésorerie en temps réel, cerésultat conﬁrme notre intuition que la liquidité joue un rôle essentiel dans les phénomènesde défault. En outre, les deux bases de données sont complémentaires l’une à l’autre d’unecertaine mesure : la base fusionnée a une meilleure performance de prédiction que chaque baseindividuelle. Nous avons adopté plusieurs méthodes avancées de l’apprentissage statistique etanalysé leurs caractéristiques. L’utilisation appropriée de ces méthodes nous aide à acquérirune compréhension profonde du rôle des facteurs centraux dans la prédiction du défaut, telque la violation de l’autorisation du découvert et les ﬂux de trésorerie. ∗ Student at Ecole Polytechnique † Quantitative Risk Project Manager at Société Générale ‡ Head of Credit Risk Modeling at Société Générale a r X i v : . [ q -f i n . S T ] J u l Introduction

As Mishkin and Eakins (2006) point out, asymmetric information is one of the core issues in theexistence of ﬁnancial institutions. Financial intermediaries, such as commercial banks, play animportant role in the ﬁnancial system because they reduce transaction costs, share risk, and solveproblems raised by asymmetric information. One of the most important channels of achieving thisrole is the eﬀective analysis of the quality of potential corporate borrowers. Banks need to distin-guish reliable borrowers from unreliable ones in order to make their decisions on corporate loans.From the banks’ point of view, this reduces the losses associated with corporate defaults, while itis also beneﬁcial for the whole economy because resources are eﬃciently attributed to prominentprojects.Altman (1968), Beaver (1966) and Ohlson (1980) are pioneers of using statistical models in the pre-diction of default. They have used ﬁnancial ratios which are calculated from the balance sheet andthe income statement. Their inspiring work has been widely recognized, which is proved by the factthat the method has become the standard of credit risk modeling for many ﬁnancial institutions.One might doubt, however, if the phenomenon of default can be “explained” by the ﬁnancial ratios.Intuitively, default takes place when the cash ﬂows of a ﬁrm are no longer sustainable. The ﬁnancialstructure of a ﬁrm might well be the result of an upcoming default instead of being the cause of itbecause the ﬁrm might be obliged to sell some of its assets when it is short of cash ﬂows. Leland(2006) distinguishes two kinds of credit risk models: structural models and statistical models (orreduced form models). According to him, the statistical model above is not directly based on ﬁrm’scash ﬂows or values, but empirically estimates a “jump rate” to default. What’s more, reduced formmodels do not allow an integrated analysis of a ﬁrm’s decision to default or its optimal ﬁnancialstructure decisions. On the other hand, structural models, such as those proposed by Black andScholes (1973), Merton (1974) and Longstaﬀ et al. (2005), associate default with the values of cor-porate securities, as the valuation of corporate securities depends on their future cash ﬂows, whichin turn are contingent upon the ﬁrm’s operational cash ﬂows. The diﬀusion models of market val-ues of securities allow us to investigate the evolution of cash ﬂows, and thus the default probabilities.This suggestion is insightful, but does not provide a practical approach for commercial banks vis-a-vis their corporate clients. Most small and medium-sized enterprises do not sell marketed securities.For these ﬁrms, using structural models based on corporate securities is simply impossible. Fortu-nately, however, commercial banks possess the information on cash ﬂows in another way. Corporateclients not only borrow from banks but also open checking accounts in these banks. Norden andWeber (2010) demonstrate that credit line usage, limit violations, and cash inﬂows exhibit abnormalpatterns approximately 12 months before default events. Measures of account activity substantiallyimprove default predictions and are especially helpful for monitoring small businesses and individu-als. This is another good example of economies of scale in which a bank shares information withinitself to achieve better global performance.Instead of using a structural model, we choose to use some statistical learning methods which im-prove considerably the prediction performance compared with classical logistic regression. Thischoice is due to the fact that it is diﬃcult to construct a structural model at the ﬁrst stage whichgives a general image and a good prediction at the same time. There is limited literature whichexplains the default by using checking account information. By using statistical learning methods,2e can empirically tell which variables are the most important in default prediction. This can helpus in the next stage construct a structural model. On the other hand, if we are only interested inprediction, a reduced form model is suﬃcient for our concern.However, We should underline the fact that application of machine learning methods does not elim-inate the necessity of economic understanding. As we will show, the construction of meaningfuleconomic variables is an essential preliminary step for machine learning. What’s more, the “impor-tant variables” given by machine learning should be taken with a grain of salt. Strobl et al. (1993)resume that variable selection in CART (classiﬁcation and regression trees) is aﬀected by charac-teristics other than information content, e.g. variables with more categories are preferred. To solvethe problem, Strobl et al. (2007) propose an unbiased split selection based on exact distributionhypothesis. As with all exact procedures, this method is computationally too intensive. Hothornet al. (2006) propose a more parsimonious algorithm, conditional classiﬁcation tree (ctree), which isbased on the framework of permutation test developed by Strasser and Weber (1999). What’s more,unbiased random forest (conditional random forest, or cforest) is constructed based on ctree. Butcforest is still too heavy to be executed for our data. Besides, it is not clear whether the unbiased-ness in the sense of random forest is still valuable for other machine learning methods. That is tosay, it is disputable to ﬁnd an universally valuable subset of variables which contain the same levelof information in any statistical method. Instead of using these computationally expensive meth-ods, we will compare the variables selected by boosting, stepwise selection and lasso. An thoroughunderstanding of these machine learning methods is eﬃcient to shed light on the interpretation ofmodel selections.We begin by introducing basic random forest and boosting, as well as some important modiﬁcationsto accommodate characteristics in our data. Section 3 compares three approaches of treating check-ing account data, illustrates the importance of economically meaningful variables and shows someparticularities of machine learning methods. Section 4 compares the performance of ﬁnancial ratiosand questionnaires with that of account data, combines the two data to achieve better predictionperformance. Section 5 does three model selections, respectively based on AIC, lasso and boosting.We use the logistic regression to interpret the marginal eﬀect of these most important variables.Section 6 concludes the article.

For random forest and boosting, the most commonly used basic classiﬁer is the classiﬁcation tree.Suppose we want to classify a binary variable Y by using two explicative variables X and X .An example of the classiﬁcation tree is given in Figure 1 . The two graphical representations areequivalent. And the tree can be represented by the formˆ f ( X ) = X m =1 c m I { ( X , X ) ∈ R m } where c m ∈ { , } , I is the indicator function. (1) Extracted from James et al. (2013) igure 1 – A simple example of classiﬁcation tree To grow a tree, the central idea is to choose a loss function and to minimize the loss function withrespect to the tree. Friedman et al. (2001) and James et al. (2013) give a full introduction to themost important loss criteria in the context of classiﬁcation trees. We use the Gini index as the lossfunction in our research. It should be underlined, however, that it is computationally too expensiveto ﬁnd a global optimal solution. Instead, in practice one uses the “greedy algorithm” which admitsthe part already constructed and searches the optimal solution based on this part. A tree grownin this way is called a CART (classiﬁcation and regression tree), which was proposed by Breimanet al. (1984) and has become the most popular tree algorithm in machine learning.The advantage of tree is obvious: it is intuitive and easy to be interpreted. Nonetheless, it generallyhas poor predictive power on training set and test set if the model is mildly ﬁtted. Conversely, anoverﬁtting with training set (or overly reduced bias) is generally not expected in machine learning.Ensemble methods, such as Random Forest and Boosting, are conceived to solve this dilemma.

Random Forest aims at reducing model variance and thus increasing prediction power on test set.Instead of growing one single tree, we plant a forest. A general description of the algorithm is givenin Figure 2 . In practice, the optimal value of m is around √ p for classiﬁcation problem, where p isthe total number of variables. We can of course, use cross-validation to optimise the value of thisparameter. This small value of m looks strange at ﬁrst sight, but it is in fact the key of randomforest. In fact, for B identically distributed random variables, each with variance σ and positivepairwise correlation ρ , the variance of their average is ρσ + 1 − ρB σ (2)Even with large B (the number of trees in the case of random forest), we still need to decrease ρ to reduce the variance of average. The role of a small m is to reduce the correlation ρ across trees, Extracted from Friedman et al. (2001) igure 2 – Algorithm of random forest thus decrease the model variance.However, the basic random forest works poorly for our data because it is imbalanced (fewer than 6%observations defaulted). Several remedies exist for this characteristic, including weights adjustment(Ting (2002)) and stratiﬁed sampling (Chen et al. (2004)). We have adopted the stratiﬁcationmethod which is easy to be implemented and yields satisfying results. Instead of sampling uniformlydefault and non-default observations for each tree in step 1.(a) (eg. sampling 2/3 observationsuniformly), we take 2/3 default observations and an equal number of non-default observations.This apparently small modiﬁcation leads to tremendous amelioration in confusion matrices. For agiven checking account data with 30 variables, the comparison is shown in Table 2. The test AUCsare respectively 78 .

72% and 79 . The most commonly used version of boosting is AdaBoost (Freund et al. (1996)). Contrary torandom forest which plants decision trees in parallel, AdaBoost cultivates a series of trees. If anobservation is wrongly classiﬁed in previous trees, its weight will be accentuated in latter treesuntil it is correctly classiﬁed. The central idea is intuitive, yet it had been purely an algorithmicnotion until Friedman et al. (2000), who pointed out the inherent relationship between AdaBoostand additive logistic regression model:

Theorem 1

The real AdaBoost algorithm ﬁts an additive logistic regression model by stagewise andapproximate optimization of J ( F ) = E [ e − yF ( x ) ] . where additive logistic regression model is deﬁned as having the following form for a two-classproblem: 5 able 1 – Add caption Training set Test setError rate Error rateImbalanced Balanced Imbalanced BalancedTrue value 0 0.057% 25.829% True value 0 0.055% 25.588%1 98.861% 28.599% 1 99.108% 27.340%Global 3.830% 25.930% Global 3.940% 25.660%

Table 2 – Error rates of imbalanced and balanced random forest. False negative rates are extremelyhigh for both the training set and the test set using imbalanced random forest.In contrast, the errorsrates using balanced random forest are much more reasonable. log P ( y = 1 | x ) P ( y = 0 | x ) = M X m =1 f m ( x ) (3)In the case of boosting trees, f m are individual trees adjusted by weights. According to Result 1,boosting is by its nature an optimisation process. This insight paves the way for xgboost (ExtremeGradient Boosting by Friedman (2001)), which searches the gradient of objective function andimplements eﬃciently the basic idea of boosting. Moreover, the intimate relationship betweenboosting and logistic regression leads to some interesting results on which we will discuss later on. A model is overﬁtted if it suits well the training set but poorly the test set. In our research,the model performance criterion is AUC (Area Under the ROC Curve), which measures the dis-crimination power of a given model. It should be noticed that AUC is immune to imbalance in data.Some methods, like the random forest, aim at reducing the model variance, i.e., by decorrelatingthe training data and the model, we obtain a model which is less sensitive to data change. Forexample, using 30 checking account variables to explain default, we get

AU C = 79 .

45% for trainingset and

AU C = 79 .

85% for test set in balanced random forest. Boosting had also been consideredto work in this way. But Friedman et al. (2000) point out that boosting seems mainly a bias reduc-ing procedure. This conclusion is coherent with our experiment. Using the same variables, we get

AU C = 87 .

45% for training set and

AU C = 79 .

8% for test set. Boosting has necessarily overﬁttedthe model, but this feature does not undermine its ability of predicting the test set.Additional remarks should be made on parameters in machine learning methods. While it is not themajor concern of this article, it is nonetheless crucial to let the machine run correctly. One importantparameter is related to the complexity of model, for example, the number of candidate variables foreach node splitting in random forest, the number of learning steps in boosting. Cross-validation isadopted to ensure the appropriate level of complexity and to avoid over-ﬁtting. Appendix C givesan exhaustive explanation on the most important parameters in our models.6

Organising checking account data: Three approaches

In current literature, treating checking account data does not have mature approaches as we canﬁnd for ﬁnancial structure data. In the latter case, corporate ﬁnance suggests some particularlyuseful ratios such as working capital/total assets, retained earnings/total assets, market capital-ization/total debt etc (Ross et al. (2008)). Deﬁning new features based on checking account databecomes a central issue in our study. We have tried three approaches detailed below. They will becombined with three diﬀerent statistical methods (logistic regression, random forest and boosting).

This deﬁnition is inspired by Norden and Weber (2010). At the end of each year, which we notetime t, we deﬁne the explained variable, default, as the binary variable of going bankrupt in thenext year. The explicative variables are created based on monthly account variables in the lasttwo years. These 30 variables are listed in Appendix A and can be classed mathematically intofour categories: the diﬀerence of a characteristic (eg. balance, monthly cumulative credits) betweenthe begin and the end of a period (one or two years); the value of this characteristic at time t; thestandard deviation of this characteristic during a certain period; attributes of the ﬁrm (annual sales,sector). The basic idea is to use stock and ﬂow variables for a complete but also concise descriptionof a certain characteristic. Moreover, the standard deviation of, e.g. monthly cumulative credits,allows us to quantify the risk associated with unstable income.The size of ﬁrms may inﬂuence considerably the model in an undesirable way. A ﬁrm might havea higher balance than another one only because it is larger: this larger balance does not “reﬂect” asmaller probability of default. Norden and Weber (2010) have used the line of credit as the normal-isation variable for the corporate clients of a German universal bank. However, this variable is notavailable in our research. We thus need to ﬁgure out another appropriate normalisation variable.One suggestion is to use information on the balance sheet or the income statement, such as totalsales. But larger ﬁrm may open accounts in several diﬀerent commercial banks, reﬂecting only afragment of cash ﬂow information in each account. There exists thus a discrepancy between thesize of the account and the size of the ﬁrm. In order to capture the account size, we need a variablewithin the account itself which reﬂects the account’s normal level of vitality. The average monthlycumulative credits in the last two years, responds to the deﬁned criteria and is used to normalizethe variables proportional to account size. Intuitively, monthly cumulative credits is the equivalentof total sales in the context of checking account in the sense of total resources.

As well as in Deﬁnition 1, we still use account information in the past year to predict the default inthe coming year. But the explanatory variables used in statistical methods are built in a much more“computer science” way. Instead of using economic intuitions above to organise raw information,we rely on automatic methods to build the model inputs. 50 variables are ﬁrstly resumed from rawmonthly information, and then interact with each other using the four basic arithmetic operations.Together with some raw variables, the data set contains around 5000 variables in total. It should be7oticed that these combinations are usually not intuitively interpretable. While it might be possibleto give some far-fetched explanation for “average monthly balance/cumulative number of intendedviolations”, it is far more diﬃcult to interpret other variables.One might argue that the simple arithmetic interactions are not capable of exhausting possiblemeaningful combinations of raw information, making this approach not representative. However, itshould ﬁrst of all be noticed that boosting with 5000 variables is already computationally expensivefor an ordinary computer. In practice, we launch the boosting for each kind of arithmetic interactionand select the most important ones according to their contributions to the Gini index. Thesevariables are then used to run a ﬁnal and lighter boosting with around 200 variables. Secondly, it issimply computationally impossible to exhaust most meaningful combinations. Suppose we want tocreate automatically the 30 variables in deﬁnition 1. These variables are based on more than 10 basicmonthly variables (e.g. TCREDIT, monthly number of violations), i.e. more than 120 variables ifwe take the month into consideration. Var16 is the diﬀerence of TCREDIT between time t and t-12(substraction of 2 variables), while var9 is the sum of monthly number of violations during one year(sum of 12 variables). This simple example shows that for a new variable, there is no limit a priorion the number of participating raw monthly variables. That is to say, any variable among the 120variables might be included in or excluded from the combination. The number of possible forms ofcombination is astronomical: 2 , even if we allow only one arithmetic operation, for example theaddition. Let alone other forms of operations. Thirdly, there is no reason to delimitate a priori aset of reasonable operations. The use of standard deviation for TCREDIT (var24), for example, isbased on the intuition of the stability of revenue. It is not reasonable, however, to include a priorithis operation, which is more complicated than sinus, cosinus or other simple functions, in the setof reasonable operations, if we investigate the question in a purely mathematical way.

Similar to Deﬁnition 1, this deﬁnition is also economically interpretable. In contrast, we create 5variables which are highly discretized. Four of them are binary, the ﬁfth has three categories. Thesevariables are listed in Appendix B.

The performance, measured by test AUC, is given in Table 3. We have selected the 20 best variablesin Group 1 and Group 3 respectively by AIC and by variable importance in boosting. The 5 bestvariables in Group 2 are chosen according to variable importance in boosting. Despite the diﬀerencein variable selection methods, all the variables in Group 2 are included in Group 1. Among the 20variables in Group 3, three variables are not available for most of the observations ( > Table 3 – Test AUCs of four groups of account data (3 deﬁnitions) in logit, random forest andboosting. The 20 variables in Group 1 and Group 3 are selected respectively by AIC and by variableimportance in boosting. The 5 best variables in Group 2 are chosen according to variable importancein boosting. All the variables in Group 2 are included in Group 1. Among the 20 variables in Group3, three variables are not available for most of the observations ( > ) and are eliminated forrandom forest and for logistic regression. random forest. It seems to us that discretization is the reason for this. While it is a common ap-proach to discretize continuous variables for logistic regression because this can create a certain kindof non-linearity of a given explanatory variable within the linear framework, this will nonethelessreduce the information contained in it. The discretization is especially detrimental for imbalancedrandom forest. AU C = 46 .

81% suggests a worse performance than randomly distributed classesand should be considered as a pathology. Even the balanced random forest performs worse thanlogistic regression. In fact, the individual trees grown in a random forest are usually very deep( depth > depth = 5 in our setting). As Friedmanet al. (2001) suggest, experiences so far indicate that 4 < = depth < = 8 works well in the contextof boosting, with results being fairly insensitive to particular choices in this range. In any case,it is unlikely that depth >

10 will be required. This probably suggests that boosting relies muchless heavily on the variables’ ability of oﬀering potential splits, making it less sensitive to discretevariables.In fact, using stumps ( depth = 2) is suﬃciently eﬃcient for yielding good prediction. Using allthe 30 variables in Deﬁnition 1, the AUCs are respectively 79 .

47% for depth = 2 and 79 .

82% for depth = 5. (It should be remarked, however, that the optimal number of rounds validated bycross-validation is higher in the case depth = 2. They are 2811 and 997 respectively for depth = 2and depth = 5 with other parameters ﬁxed according to Appendix C.) In the case of M stumps,9he additive logistic regression model becomes: log P ( y = 1 | x ) P ( y = 0 | x ) = M X m =1 α m x m

2, the approximation is extended to multivariate functions. So theadvantage of boosting over logistic regression seems to be the capacity of the former to take non-linearity into consideration. This clearly explains why boosting is mainly a bias-reducing method,as mentioned by Friedman et al. (2001).Does the out-performance of boosting and balanced random forest also imply their superiority ofidentifying rich data set? Comparing Group 1 and Group 3, we can remark that logit AUC ishigher in Group 3, while boosting AUC is lower. If we trust in logistic regression in the case ofprediction, we should conclude that Group 3 contains more information than Group 1 and thatmachine learning methods such as boosting are not reliable for distinguishing a rich data set from apoorer one. However, looking at Group 2, we can easily reverse this conclusion. The logit AUC inGroup 2 is nearly the same as that in Group 3, while Group 2 contains apparently less informationthan Group 1 because all the variables in Group 2 are included in Group 1. Instead, a plausibleexplanation for the low logit AUC in Group 1 should be the multicollinearity between explanatoryvariables (James et al. (2013)). Boosting and random forest, in contrast, split each node by indi-vidual variable and should not be impacted by the haunting multicollinearity. With less variables(Group 2), the prediction accuracy is higher in logit. This phenomenon probably suggests that logitcan not well “digest” rich information because of its restrictive linear form. It is thus more reli-able to use AUCs of machine learning methods as a measure of information contained in the data set.The close relationship between boosting and logistic regression explains some results which mayseem strange at ﬁrst sight. The higher logit AUC in Group 3 compared with Group 1 should be in-terpreted by the model selection method: “good variables” in the sense of boosting should generallybe “good” in the sense of logit. It is thus not surprising to ﬁnd that 20 variables selected from about5000 variables works better in logit than 20 variables selected from 30. On the other hand, the samevariables have lower AUC in balanced random forest than in boosting (76 .

67% vs 78 . Traditional reduced form methods for default prediction mainly focused on ﬁnancial structure ofenterprises(Altman (1968), Beaver (1966) and Ohlson (1980)), as ﬁnancial structure does reﬂect toa large extent the solvability of enterprises, and is relatively more available than real time accountinformation. What’s more, in a reduced form method, as we merely try to match a pattern to thedata (Fayyad et al. (1996)) without worrying much about causality, the problem of endogeneity isnot a primary concern. But once we want to get some causal interpretation, ﬁnancial structuredata may suﬀer from endogeneity and should be carefully interpreted as a “cause” of credit default.On the other hand, we should remark the diﬀerence between book value and market value, and theaccounting principle associated with this diﬀerence (Ross et al. (2008)). For small and medium-sized enterprises, their market values are simply not available because they usually don’t sell anymarketed securities, while their book values are historic and subjected to accounting manipulations.Commercial banks have both a necessity and an advantage in the analysis of credit default. The pos-session of corporate account information helps them to acquire a more direct and “frank” image ofthe ﬁrms’ account. Not only the information may be more reliable, but also more real-time. Balancesheet and income statement are resumed once a year by ﬁrms, while checking account informationcan be theoretically daily. In practice, we use monthly variables as raw variables for the purpose ofsimpliﬁcation. This allows commercial banks to supervise the solvability of corporate borrowers ona more frequent basis. Given the advantage of checking account information, we should expect abetter performance of prediction based on checking account data. This is represented in the ﬁrst twocolumns of Table 4. The AUCs based on account data in balanced random forest and boosting aresigniﬁcantly larger than those based on ﬁnancial and management data. (One might argue that thissuperiority is simply due to more explicative variables. In fact, with the same number of variables(11), the boosting AUC of account data is 79 . . . . . Table 4 – The AUCs of checking account data, ﬁnancial and managerial data, merged data in logit,random forest and boosting. Group 6 comes from the fusion of Group 1 and Group 5. This mergeddata has the best performance in balanced random forest and boosting. the same as that of Group 1 and signiﬁcantly less than that of Group 6.) We can thus concludethat the three sources of information are complementary, which corresponds to our intuition on thereal functioning of enterprises. First, The checking account information is a reﬂection of a ﬁrm’scash ﬂow, which is most directly related to a ﬁrm’s solvability. Second, ﬁnancial ratios illustratethe ﬁrm’s ﬁnancial structure and its ability to earn proﬁts. We should remark that the ﬁnancialratios we used are primarily concerned with the ﬁrm’s proﬁtability and expenses (Interest expenses,earnings before interest and taxes etc.) and are more tightly related to cash ﬂow, which is also thecase for Atiya (2001). Third, other non-ﬁnancial reasons should be taken into consideration, forexample, the managerial expertise of cadres.Of course, these is not a complete list of all the factors which are related to credit default. Somemacroeconomic factors, for example, can be additionally taken into account. We have observed adecreasing quarterly default rate during 2013-2014, which might be explained by decreasing interestrate in Europe during the same period. If we use data from 2009 to 2012 as training set, and thatfrom 2013 to 2014 as test set, the statistical pattern works less well for defaults at the end of 2013and at the beginning of 2014.

Because of the multicollinearity problem between 30 variables in Deﬁnition 1, a variable selectionprocess is needed in order to obtain and interpret the marginal eﬀect of each prominent variableby logistic regression. The list of important variables in Deﬁnition 1 is shown in Figure 3. We cansee that according to boosting, the most important variables are especially related to number ofviolations (var9, var11, var13) and current status (var27, var32, var33, var34). Intuitive as it be,this variable importance in the sense of boosting should be taken with a grain of salt. For example,does it mean that var10 (number of intended violations during the period [ t − , t − t − , t ], ranked 7 in the importance list)? In fact, if we draw two conditionaldistributions (conditioned on default) of each variable and calculate their individual AUCs whichreﬂects their individual discriminality, the AUC of var10 (68 . . t − , t ]) and naturally has a betterdiscriminatory power than var10 ( AU C = 72 .

68% vs

AU C = 68 . Figure 3 – Variable importance of 30 variables in Deﬁnition 1 according to boosting

In order to be more rigorous on variable selection, we have tried out two other diﬀerent methodswhich are based on logistic regression: stepwise selection and lasso. For stepwise selection, AIC wasused as the criterion. Forward and backward selection have generated the same 8 variables marked13n Table 5. In order to compare between diﬀerent model selection methods, we have adjusted λ inlasso so as to yield exactly 8 non-zero coeﬃcients. These 8 variables are also resumed in Table 5.Remark that there are 7 among 8 variables which are identical to those selected by AIC. Thus forour data, there is no apparent diﬀerence between stepwise and lasso in model selection. In contrast,4 among 8 variables selected by boosting diﬀer from that of lasso and stepwise selection! Thesevariables all belong to the variables of current status.It is diﬃcult to conﬁrm by experiment the reason for this diﬀerence. Our intuition centers on themulticollinearity between the 4 favored variables in boosting. Table 6 shows the Spearman cor-relation between these variables . It seems to us that because of its restrictive linear form, logitis incapable of disentangling the interweaving information contained in these variables. On thecontrary, boosting seems to be able to digest this intricate information. This hypothesis is looselyconﬁrmed by the regression results in Table 7 and 8. All the coeﬃcients of the variables selectedby AIC are signiﬁcantly diﬀerent from zero at the 0 .

1% level. The 4 variables favored by boosting(var27, var32, var33, var34), on the other hand, are less signiﬁcant: var27 and var33 are signiﬁcantat the 5% level, while var32 and var34 are not signiﬁcant. We should remark, however, that allthe signs of these 4 variables correspond to our intuition and that var32 and var34 are not farfrom being signiﬁcant (P values=20.89% and 11.23% respectively). This is a common syndrome ofmulticollinearity because it increases the variances of related estimated coeﬃcients and renders thecoeﬃcients insigniﬁcantly diﬀerent from zero.Returning on the regression in Table 7, several insights can be gained from the marginal eﬀectof these variables. First, var9, var11 and var13 are always the best variables in any method thatwe have used (also valid for balanced random forest, of which we haven’t presented the variableselection). These variables concern intended or rejected violations of credit line. The negative signfor var11 (number of rejected violations) should not be regarded as counter-intuitive, because ofthe presence of var9 (number of intended violations) and its positive coeﬃcient which is larger thanthat of var11 in absolute value. This suggests that larger number of violations, whether rejectedor not, indicates a higher probability of default. We have used the amount of violations insteadof the number of violations to construct var13, in order to capture more precisely the conﬁdenceon each client given by the bank advisor. This variable seems to work particularly well, in thesense that the same variable on the previous year, var14, is also included by stepwise selectionand by lasso. This suggests that front line staﬀ have acquired some important experiences andintuitions in distinguishing solvable clients from insolvable ones. These experiences may be hard tobe formally formulated, but are truly valuable and should be paid attention to. Second, the riskof default is intimately related to the risk of income. As var24 (standard deviation of cumulativemonthly credits) shows, the more the income is unstable, the more the ﬁrm is likely to default.Var31 (cumulative monthly credits at month t) is also related to credit and decreases the defaultprobability by having more income. Credits, rather than debits, may be considered more seriously asthe source of default. Norden and Weber (2010) point out that there exists a very strong correlationbetween debits and credits and that the latter should be considered as the constraint of the former. It should be more appropriate to calculate the Pearson correlation because we are interested in linear correlationin the case of logistic regression. However, this correlation is not stable with respect to manipulations such aselimination of missing values or extreme values. Spearman correlation, on the other hand, seems to be quite stablewith data manipulations, which shows the advantage of tree-based methods. Tree-based methods depend on ordinalproperties of variables instead of cardinal ones.

Table 5 – Variable selection among 30 variables of Deﬁnition 1 according to boosting, stepwiseselection and lasso. Stepwise selection with AIC criterion (both forward and backward) has yielded8 variables. For the purpose of comparison, we have chosen the 8 best variables in boosting. Forlasso, we have adjusted the parameter λ so as to yield exactly 8 non-zero coeﬃcients. Table 6 – Spearman correlation between var27, var32, var33, var34

Coeﬃcient Standard deviation P value Signiﬁcance(Intercept) -6.670e-01 1.294e-01 2.55e-07 ***var9 2.060e-03 1.469e-04 < < < < Table 7 – Logistic regression using variables selected by stepwise selection

Coeﬃcient Standard deviation P value Signiﬁcance(Intercept) -1.128e+00 7.204e-02 <2e-16 ***var9 1.433e-03 1.146e-04 < < < Table 8 – Logistic regression using variables selected by boosting

Table 9 – Transformation of sector variable to numeric variable according to average default

Increase in expenses might be direct reason for default, but income decrease or instability may bemore fundamental. Third, diﬀerent economic sectors clearly have diﬀerent default rate. We haveconstructed var29 (sector) by using a theorem in Shih (2001). See Appendix D for the detailsof the theorem.This theorem allows us to transform a categorical variable into a discrete numericvariable for classiﬁcation trees. The corresponding numeric values of sectors are shown in Table 9.Higher values are associated with higher average default rate. This is also validated by the logisticregression in Table 7. Fourth, larger ﬁrms are less likely to default. They are more mature thanstartups. Commercial banks have reason to be unwilling to lend money to startups, who in somecases might need to search investment from venture capitals or angel investors.

We have investigated the relationship between corporate checking account and credit default andshown that account information outperforms traditionally used ﬁnancial ratios in predicting thedefault for our data sets. This result aligns with our understanding of default as a phenomenon ofliquidity. Checking account information reﬂects a more direct and real-time status of the ﬁrm’s cashﬂow and is a privilege of commercial banks when the ﬁrm’s market value is not available. Banks canexploit economies of scale and use information on the ﬁrms’ checking account to make reasonable de-cisions on corporate loans. Despite the importance of this subject, there is currently little literatureexcept Norden and Weber (2010), Mester et al. (2007) and Jiménez et al. (2009). Inspired by theirwork, we have investigated a broader range of explicative variables and systematically comparedthe performance of diﬀerent data sets by statistical learning methods. We have shown that thesemethods, together with the AUC criterion, are more accurate and reliable approaches to measurethe information contained in data sets than logistic regression. While the latter often suﬀers frommulticollinearity, machine learning methods such as random forest and boosting separately makeuse of these variables and are capable of disentangling intricate information. By using randomforest and boosting, we have signiﬁcantly increased the prediction accuracy. Tree-based methodshave other advantages such as being immune to extreme values.We should remark particularly, however, that successful statistical learning process is achieved withhuman expertise. Meaningful economic variables must be ﬁrst of all created based on raw checkingaccount information, just as pioneers on corporate ﬁnance have created ﬁnancial ratios based onbalance sheet and income statement. We also need to normalise these variable so as to eliminatethe eﬀect of account size. As we have shown, it is technically not possible (and epistemologicallyunacceptable for some) to create explicative variables which contain the same level of concise infor-mation simply by automated program. The 30 variables created by Deﬁnition 1 need to be perfectedby eliminating about one half less useful variables and adding other potential important indicators.But even at this early stage, the importance of human expertise in ﬁnancial study is illustrated.Financial ratios and managerial questionnaire are nonetheless still important in predicting credit17efault. By combining them with checking account data, the model has the best prediction perfor-mance and outperforms any other model with only one single data. This suggests a certain kind oforthogonality between the information of diﬀerent data sets: the ﬁnancial structure, proﬁtability,and managerial experience should be considered in parallel with checking account information in areduced form model.By careful approaches of model selection, we have shown some particularities of boosting in select-ing important variables. We have used the 8 most important variables given by stepwise selectionto gain intuitions on the mechanism of default. Violations of credit line, whether rejected or not,are particularly good indicators of upcoming default. Moreover, front line advisors seem to havenotable experience in distinguishing acceptable violations, which is reﬂected in the percentage ofpermitted amount of violations. While the default is at ﬁrst sight due to excessive expenses, Nordenand Weber (2010) and us have focused on the importance of credits. Low level of income, as wellas instability of income, increases signiﬁcantly the default rate.Our research have adopted rigorous statistical methods to obtain a well-performed prediction modelbased on checking account and to identify key indicators in this data by an inductive methodol-ogy. We had a thorough discussion on the mechanisms of these methods which have signiﬁcantimplications on the results. This has enriched the scarce literature on this topic and can providesuggestions to banks on their decision of corporate loans. Further research may try to identify otherkey factors in checking account information or construct a structural model for credit default ofsmall and medium sized enterprises.

References

Edward I Altman. Financial ratios, discriminant analysis and the prediction of corporatebankruptcy.

The journal of ﬁnance , 23(4):589–609, 1968.Amir F Atiya. Bankruptcy prediction for credit risk using neural networks: A survey and newresults.

IEEE Transactions on neural networks , 12(4):929–935, 2001.William H Beaver. Financial ratios as predictors of failure.

Journal of accounting research , pages71–111, 1966.Fischer Black and Myron Scholes. The pricing of options and corporate liabilities.

The journal ofpolitical economy , pages 637–654, 1973.Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen.

Classiﬁcation and regres-sion trees . CRC press, 1984.Chao Chen, Andy Liaw, and Leo Breiman. Using random forest to learn imbalanced data.

Universityof California, Berkeley , pages 1–12, 2004.Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledgediscovery in databases.

AI magazine , 17(3):37, 1996.Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In

Icml ,volume 96, pages 148–156, 1996. 18erome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statisticalview of boosting (with discussion and a rejoinder by the authors).

The annals of statistics , 28(2):337–407, 2000.Jerome Friedman, Trevor Hastie, and Robert Tibshirani.

The elements of statistical learning ,volume 1. Springer series in statistics Springer, Berlin, 2001.Jerome H Friedman. Greedy function approximation: a gradient boosting machine.

Annals ofstatistics , pages 1189–1232, 2001.Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A conditionalinference framework.

Journal of Computational and Graphical statistics , 15(3):651–674, 2006.Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.

An introduction to statisticallearning , volume 6. Springer, 2013.Gabriel Jiménez, Jose A Lopez, and Jesús Saurina. Empirical analysis of corporate credit lines.

Review of Financial Studies , 22(12):5069–5098, 2009.Hayne E Leland. Structural models in corporate ﬁnance.

Princeton University Bendheim LectureSeries in Finance , 2006.Francis A Longstaﬀ, Sanjay Mithal, and Eric Neis. Corporate yield spreads: Default risk or liquid-ity? new evidence from the credit default swap market.

The Journal of Finance , 60(5):2213–2253,2005.Robert C Merton. On the pricing of corporate debt: The risk structure of interest rates.

TheJournal of ﬁnance , 29(2):449–470, 1974.Loretta J Mester, Leonard I Nakamura, and Micheline Renault. Transactions accounts and loanmonitoring.

Review of Financial Studies , 20(3):529–556, 2007.Frederic S Mishkin and Stanley G Eakins.

Financial markets and institutions . Pearson EducationIndia, 2006.Lars Norden and Martin Weber. Credit line usage, checking account activity, and default risk ofbank borrowers.

Review of Financial Studies , 23(10):3665–3699, 2010.James A Ohlson. Financial ratios and the probabilistic prediction of bankruptcy.

Journal ofaccounting research , pages 109–131, 1980.Stephen A Ross, Randolph Westerﬁeld, and Bradford D Jordan.

Fundamentals of corporate ﬁnance .Tata McGraw-Hill Education, 2008.Yu-Shan Shih. Selecting the best categorical split for classiﬁcation trees.

Statistics and ProbabilityLetters , 54:341–345, 2001.Helmut Strasser and Christian Weber. On the asymptotic theory of permutation statistics. 1999.Carolin Strobl, Achim Zeileis, Anne-Laure Boulesteix, and Torsten Hothorn. Variable selection biasin classiﬁcation trees and ensemble methods. In

Book of abstracts , page 159. Citeseer, 1993.19arolin Strobl, Anne-Laure Boulesteix, and Thomas Augustin. Unbiased split selection for classi-ﬁcation trees based on the gini index.

Computational Statistics & Data Analysis , 52(1):483–501,2007.Kai Ming Ting. An instance-weighting method to induce cost-sensitive trees.

IEEE Transactionson Knowledge and Data Engineering , 14(3):659–665, 2002.Jon Williamson. The philosophy of science and its relation to machine learning. In

Scientiﬁc DataMining and Knowledge Discovery , pages 77–89. Springer, 2009.20

Variable Deﬁnition 1

For simplicity, variables are abbreviated according to Table 10.Abbreviation ExplanationMIN_BAL monthly min account balanceMAX_BAL monthly max account balanceMEAN_BAL monthly average account balanceMEAN_CRBAL monthly average credit balanceMEAN_DBBAL monthly average debit balanceTCREDIT monthly total creditsTDEBIT monthly total debitsINT_CNVIOL cumulative number of intended violations from the beginning of the yearREJ_CNVIOL cumulative number of rejected violations from the beginning of the yearINT_CAVIOL cumulative amount of intended violations from the beginning of the yearREJ_CAVIOL cumulative amount of rejected violations from the beginning of the year

M EAN _ T CREDIT t mean TCREDIT during the period [t-23, t], used for nomalisation Table 10 – Variable Abbreviations

The 30 variables deﬁned in Deﬁnition 1 are built by applying the operations in Table 11. Theirdeﬁnition formulas are shown in Table 12. Operation Meaning X t Value of X at month t∆ X t X t − X t − ∆∆ X t X t − X t − mean t ( X ) Mean of X during the period [t-11, t] sd t ( X ) Standard Deviation of X during the period [t-11, t] Table 11 – Operations for creating variables One might wonder why the variables are not nominated from 1 to 30. This is purely a historical problem: wehave done a ﬁrst version of 30 variables before modifying them to get the second version that we see right now. a r i a b l ec a t e go r y N a m e N u m e r a t o r D e n o m i n a t o r U s e n o r m a li z a t i o n E v o l u t i o n s B a l a n ce v a r ∆∆ M I N _ B A L t M E A N _ T C R E D I T t Y E S v a r ∆∆ M E A N _ B A L t M E A N _ T C R E D I T t Y E S v a r ∆∆ M E A N _ C R B A L t M E A N _ T C R E D I T t Y E S v a r ∆∆ M E A N _ D BB A L t M E A N _ T C R E D I T t Y E S V i o l a t i o n s v a r ∆ I N T _ C N V I O L t N O v a r ∆ I N T _ C N V I O L t − N O v a r ∆ R E J _ C N V I O L t N O v a r ∆ R E J _ C N V I O L t − N O v a r ∆ I N T _ C A V I O L t − ∆ R E J _ C A V I O L t ∆ I N T _ C A V I O L t N O v a r ∆ I N T _ C A V I O L t − − ∆ R E J _ C A V I O L t − ∆ I N T _ C A V I O L t − N O B a l a n ce V i t a li t yv a r ∆ ( M A X _ B A L − M I N _ B A L ) t M E A N _ T C R E D I T t Y E S C r e d i t s & D e b i t s v a r ∆ T C R E D I T t M E A N _ T C R E D I T t Y E S v a r ∆ T C R E D I T t − M E A N _ T C R E D I T t Y E S v a r ∆ T D E B I T t M E A N _ T C R E D I T t Y E S v a r ∆ C D E C I T t − M E A N _ T C R E D I T t Y E S v a r ∆ ( T C R E D I T / T D E B I T ) t N O v a r ∆ ( T C R E D I T / T D E B I T ) t − N O R i s k B a l a n ce s t a b ili t yv a r s d t ( M E A N _ B A L ) m e a n t ( M E A N _ B A L ) N O v a r s d t − ( M E A N _ B A L ) m e a n t − ( M E A N _ B A L ) N O C r e d i t ss t a b ili t yv a r s d t ( T C R E D I T ) m e a n t ( T C R E D I T ) N O v a r s d t − ( T C R E D I T ) m e a n t − ( T C R E D I T ) N O A c t u a l v a r M E A N _ B A L t M E A N _ T C R E D I T t Y E S v a r M E A N _ C R B A L t M E A N _ T C R E D I T t Y E S v a r M E A N _ D BB A L t M E A N _ T C R E D I T t Y E S v a r T C R E D I T t M E A N _ T C R E D I T t Y E S v a r T D E B I T t M E A N _ T C R E D I T t Y E S v a r M I N _ B A L t M E A N _ T C R E D I T t Y E S v a r M A X _ B A L t M E A N _ T C R E D I T t Y E S A ttr i bu t e s v a r s ec t o r N O v a r t o t a l s a l e s N O T a b l e v a r i a b l e s d e ﬁn e d i n D e ﬁn i t i o n . “ U s e n o r m a l i s a t i o n ” r e f e r s t o t h e n o r m a l i s a t i o n b y M E A N _ T C R E D I T t . Variable Deﬁnition 3

The 5 discrete variables are deﬁned in Table 13.Variable before discretization Discrete classessum of MEAN_CRBAL during [t-2, t] < = a € > a €sum of monthly intended number of violations S1,and of monthly rejected number of violations S2,during [t-2, t] S S > S S > S > M EAN _ CRBAL t /M EAN _ CRBAL t − < b> = b history of relationshipwith the bank (years) < c years > = c years Table 13 – 5 variables deﬁned in Deﬁnition 3. The exact values of a , b and c are not presentedbecause of conﬁdential agreement. C Parameters in Random Forest and Boosting

In our research, the parameters of random forest and boosting are set in the following way:• Random Forest (R package RandomForest) – The number of candidate variables for node splitting (mtry): For a classiﬁcation problem, √ p is the “standard” choice. We can also use cross-validation for determining the value. – Balanced random forest: use the parameter “sampsize” for stratiﬁed sampling. If theforest is well balanced, the minimal number of observations in each node (“nodesize”)should not greatly inﬂuence the prediction power.• Boosting (R package xgboost) – Choose the number of rounds (“nrounds”) by the cross-validation. – Choose a suﬃciently small number for the shrinkage parameter (“eta”). For our data,the performance is stable when eta < . eta = 0 .

01 is used in our program. – Maximum depth of each tree (“max_depth”): between 4 and 8. We have used max d epth =5. – The proportion of observations used for each tree (“subsample”) does not inﬂuencegreatly the prediction performance. We have used subsample = 0 . Theorem for transforming sector variable to discrete nu-meric variable

Theorem 2

Suppose there are two classes, class 1 and class 2. Let X be a categorical variabletaking values on { , , ..., L } where the categories are in increasing p (1 | X = i ) values. If φ is aconcave function, then one of the L − splits, X ∈ { , , ..., l } where < = l < L , minimizes p Left φ ( p Left ) + p Right φ ( p Right ) ..

Related Researches

Combination of window-sliding and prediction range method based on LSTM model for predicting cryptocurrency

by Yifan Yao

Asymmetric Tsallis distributions for modelling financial market dynamics

by Sandhya Devi

Exploring asymmetric multifractal cross-correlations of price-volatility and asymmetric volatility dynamics in cryptocurrency markets

by Shinji Kakinaka

Modeling Price Clustering in High-Frequency Prices

by Vladimír Holý

Overnight GARCH-Itô Volatility Models

by Donggyu Kim

REST: Relational Event-driven Stock Trend Forecasting

by Wentao Xu

Time-varying properties of asymmetric volatility and multifractality in Bitcoin

by Tetsuya Takaishi

On Technical Trading and Social Media Indicators in Cryptocurrencies' Price Classification Through Deep Learning

by Marco Ortu

Power-Law Return-Volatility Cross Correlations of Bitcoin

by T. Takaishi

Forecasting Commodity Prices Using Long Short-Term Memory Neural Networks

by Racine Ly

A Reinforcement Learning Based Encoder-Decoder Framework for Learning Stock Trading Rules

by Mehran Taghian

Visualizing the Financial Impact of Presidential Tweets on Stock Markets

by Ujwal Kandi

Predicting CEO Compensation in Non-Controlled Public Corporations with the Canonical Regression Quantile Method

by Joseph Haimberg

Theory and Applications of Financial Chaos Index

by Masoud Ataei

Nonstationary Portfolios: Diversification in the Spectral Domain

by Bruno Scalzo

Dynamics, behaviours, and anomaly persistence in cryptocurrencies and equities surrounding COVID-19

by Nick James

Modelling Sovereign Credit Ratings: Evaluating the Accuracy and Driving Factors using Machine Learning Techniques

by Bart H.L. Overes

Wavelet Denoised-ResNet CNN and LightGBM Method to Predict Forex Rate of Change

by Yiqi Zhao

Network-centric indicators for fragility in global financial indices

by Areejit Samal

Event-Driven LSTM For Forex Price Prediction

by Ling Qi

Unraveling S&P500 stock volatility and networks -- An encoding-and-decoding approach

by Xiaodong Wang

COVID19-HPSMP: COVID-19 Adopted Hybrid and Parallel Deep Information Fusion Framework for Stock Price Movement Prediction

by Farnoush Ronaghi

Evidence and Behaviour of Support and Resistance Levels in Financial Time Series

by Ken Chung

Absolute Value Constraint: The Reason for Invalid Performance Evaluation Results of Neural Network Models for Stock Price Prediction

by Yi Wei

Forecasting the Leading Indicator of a Recession: The 10-Year minus 3-Month Treasury Yield Spread

by Sudiksha Joshi

«

1

2

3

4

»

Submitted on 3 Jul 2017 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar