An Automatic Interaction Detection Hybrid Model for Bankcard Response Classification
AAn Automatic Interaction Detection Hybrid Modelfor Bankcard Response Classification
Yan Wang
Graduate CollegeKennesaw State UniversityKennesaw, Georgia [email protected]
Xuelei Sherry Ni
Department of Statistics and Analytical SciencesKennesaw State UniversityKennesaw, Georgia [email protected]
Brian Stone
Atlanticus Services CorporationAtlanta, Georgia [email protected]
Abstract —Data mining techniques have numerous applicationsin bankcard response modeling. Logistic regression has beenused as the standard modeling tool in the financial industrybecause of its almost always desirable performance and itsinterpretability. In this paper, we propose a hybrid bankcardresponse model, which integrates decision tree based chi-squareautomatic interaction detection (CHAID) into logistic regression.In the first stage of the hybrid model, CHAID analysis isused to detect the possibly potential variable interactions.Then in the second stage, these potential interactions areserved as the additional input variables in logistic regression.The motivation of the proposed hybrid model is that addingvariable interactions may improve the performance of logisticregression. Theoretically, all possible interactions could beadded in logistic regression and significant interactions couldbe identified by feature selection procedures. However, even thestepwise selection is very time-consuming when the numberof independent variables is large and tends to cause the p >> n problem. On the other hand, using CHAID analysisfor the detection of variable interactions has the potential toovercome the above-mentioned drawbacks. To demonstratethe effectiveness of the proposed hybrid model, it is evaluatedon a real credit customer response data set. As the resultsreveal, by identifying potential interactions among independentvariables, the proposed hybrid approach outperforms thelogistic regression without searching for interactions in termsof classification accuracy, the area under the receiver operatingcharacteristic curve (
ROC ), and Kolmogorov-Smirnov ( KS )statistics. Furthermore, CHAID analysis for interaction detectionis much more computationally efficient than the stepwise searchmentioned above and some identified interactions are shownto have statistically significant predictive power on the targetvariable. Last but not least, the customer profile created basedon the CHAID tree provides a reasonable interpretation of theinteractions, which is the required by regulations of the creditindustry. Hence, this study provides an alternative for handlingbankcard classification tasks. Keywords-decision tree; CHAID; hybrid; logistic regression;bankcard response modeling; credit risk modeling
I. I
NTRODUCTION
Recently, financial institutions and banks have beenexperiencing serious competitions. They have extensivelystarted to consider the credit risk and bankcard responseof their customers since inappropriate credit decisions may - Support by Atlanticus Services Corporation, Atlanta, GA, USA result in huge amount of losses. When considering the casesregarding credit card applications or bankcard marketingcampaigns, financial institutions usually adopt models toevaluate the applicants or to search strategies to target theconsumers. Hence, many statistical methods and machinelearning tools, including Bayesian probability models [1],support vector machine and neural networks [2], classificationand regression trees [3], have been provided to provision thecredit or bankcard scoring developments.After careful review of the bankcard response modelingand credit risk scoring literatures [4] [5], it can be concludedthat discriminant analyses and logistic regression are the twowidely used techniques in building bankcard response modelsand credit risk models. Compared with discriminant analysis,logistic regression has the advantage that it can performvariable regression even if the variable has an abnormaldistribution [6]. Therefore, logistic regression has been actedas a good alternative to discriminant analyses in handlingbankcard response problems and credit scoring modeling.Chi-square automatic interaction detection (CHAID)analysis is an algorithm created by Gordon V. Kass in1980 and it discovers relationships between independentvariables and the categorical outcomes [7]. Chi-square testsare applied at each of the stages in building the CHAID treeand Bonferroni corrections are usually used to account forthe multiple testing that takes place [8]. In general, CHAIDanalysis can be used for prediction and classification purposesas well as for detection of interactions between variables,such as diseases classification [9], financial distress prediction[10], and risk assessment [4] [11].Several studies have deployed the feature selectionapproaches to produce higher model performances. Forinstance, a combined strategy of feature selection approaches,including linear discriminant analysis, rough set theory,decision tree, and support vector machine classification modelwas proposed in credit scoring [12]. An evolutionary basedfeature selection approach was applied in a case study ofcredit approval data [13]. It is worth to mention that, whenimplementing feature selection approaches, many research a r X i v : . [ s t a t . M L ] J a n nly focuses on the features provided by the original data sets.There is limited research that aiming at incorporating variableinteractions in the feature selection approach. Focusing onidentifying variable interactions that could potentially improvethe model performance, this study aims at firstly using anefficient method for the detection of variable interactions,then incorporating these variable interactions for featureselections in the following modeling stage.Considering the above-mentioned research, the currentstudy proposed a hybrid model, which integrates decision treebased CHAID analysis into the logistic regression. The firststage of the hybrid model is the CHAID analysis, aiming attaking advantage of CHAID tree for the detection of potentialvariable interactions. The second stage of the hybrid modelis to incorporate these interactions as additional featuresin logistic regression and the most significant featureswill be selected by stepwise feature selection approach.This proposed hybrid model is supposed to have higherperformance compared with the pure logistic regressionmodel, which does not contain any variable interactions. Theeffectiveness and feasibility of the proposed hybrid model isevaluated by using the credit customer response data throughcross-validation. The performance of the hybrid model iscompared with that from the pure logistic model in terms ofclassification accuracy, area under the curve ( AU C ), and theKolmogorov-Smirnov ( KS ) test statistics.This paper has been structured as follows. Since thebankcard response model is used to demonstrate theeffectiveness of the proposed hybrid model, we will firstlyreview its related work in Section II. In Section III, theexperimental materials and methods are presented, includingthe data description, data pre-processing, CHAID analysis forthe detection of interactions, modeling and evaluation. Theexperimental results and discussions are elaborated in SectionIV. Finally, Section V is devoted to the conclusions.II. R ELATED WORK
We will review the literature of commonly used techniquesin bankcard response modeling and credit scoring modelingin this section.
A. Logistic Regression
Logistic regression is a widely used statistical modelingtechnique which relies on measuring the results with dichoto-mous outcomes. As a multivariate method, logistic regres-sion has an automatic regression capacity to analyze manyindependent variables that have potential relationships withthe dependent variable. In credit union environment, logisticregression is as efficient and accurate as other techniques suchas discriminant analysis and neural networks. Furthermore,logistic regression models can determine the conditional prob-ability of a specific observation belonging to a class, giventhe information of this observation. Hence, logistic regression provides a better understanding of the distribution of thefinancial risk than discriminant analysis [14]. As a result,logistic regression has been explored widely in building creditscoring models and bankcard response models.
B. Chi-square Automatic Interaction Detection
CHAID analysis is one of the main decision tree techniquesand it shapes the result as a tree structure. The constructionof the tree stops whenever it does not find any significantchi-square value between the dependent variable and thefactors. Thus, the higher chi-square value nodes come first thetree, whereas, the terminal nodes carry the lowest chi-squarevalue [15]. CHAID analysis has been popularly used in manyclassification and regression studies such as hazard analysis[16], medical research [17], and market segmentation [18]. Ithas also been explored in building credit risk models and hasobtained promising results in terms of predictive accuracyand type II errors [19].Since the resultant CHAID tree framework is based onlogical relationship between independent variables, CHAIDanalysis has also be used for interaction detection in someresearch [20]. However, most of the recent research aboutCHAID analysis focuses on using it as a modeling tool forregression or classification problems rather than a techniquefor the detection of variable interactions.
C. Ensemble and Hybrid Models
Recently, ensemble models, where several learningalgorithms being employed to solve one problem, have beenapplied to improve the model performance [21]. Motivated bythe idea of ensemble learning algorithms, many researchershave employed hybrid multistage models or integratedmultiple classifiers into an aggregated model to obtain betterclassification results in credit scoring modeling. For instance,in [21], a six-stage neural network hybrid learning approachwas proposed and its effectiveness was confirmed using twopublicly available credit datasets. A two-stage hybrid modelusing artificial neural networks and multivariate adaptiveregression splines was shown to outperform the traditionallyutilized discriminant analysis and logistic regression in creditscoring modeling in [22]. In another study, a hybrid approachof the integration of integrate genetic algorithm and dualscoring model was shown to enhance the performance ofcredit scoring model [23].In our proposed hybrid approach, CHAID analysis isapplied for the detection of variable interactions instead ofthe modeling tool in the first stage. In the second stage,the identified variable interactions are served as additionalindependent variables into the logistic regression model.This hybrid approach, which integrates CHAID analysis intologistic regression, is motived by while different from theabove-mentioned approaches.II. M
ATERIALS AND METHODS
In this study, the credit customer response dataset providedby Atlanticus Services Corporation located at Atlanta, GA,USA is used to evaluate the reliability and efficiency ofthe proposed hybrid decision tree based CHAID and logisticregression model. In general, the steps of the study containdata pre-processing, decision tree based CHAID analysis,logistic regression modeling, and model evaluation. All theanalysis was implemented in SAS 9.4. An overview of thestudy procedure is presented in Fig. 1 and details of the studyare described as follows.
A. Data Description
Briefly, the credit customer response dataset was collectedfrom , customers representing mainly the customers’credit behaviors. behavior variables were recorded andexample factors consist of: customer’s number of bankcardaccounts, age of the newest account, total balance closedaccounts within last three months, total past due amount,and worst status rating reported within one month. The targetvariable RESP F LAG has a binary value, where denotesthat customers have the response (have opened the bankcardaccount) after having the credit card offer while denotes theopposite. Of these , customers, around have theresponses after having the credit card offer while around show no responses. B. Data Pre-processing
Customer records with missing values in the targetvariable
RESP F LAG were removed in the very first datapre-processing stage to avoid biased results. Then, ofthe resulting data was set aside for comparison (used asvalidation data set) and the model was built on the remaining data (training data set). During the splitting procedure,stratified sampling was implemented to preserve the originalratio of the outcome in both training and validation datasets. As illustrated in the data pre-processing step in Fig. 1,the procedure of missing value imputation was implementedseparately for training and validation sets. For the trainingset, variables with more than missing percentage wereremoved due to the limited information provided. Otherwise,median value imputation was implemented to fill the missingvalues. In the meantime, these median values from trainingset were recorded and were used to impute the missing valueson the validation set.To reduce the data dimensionality and decrease the oc-currence of multicollinearity problems, hierarchical variableclustering was then applied on the training set and variablewith the lowest − R ratio, as defined in (1), in eachcluster was selected. As a result, variables were kept aftervariable clustering, accounting for about variability fromthe original dataset. These variables would be used as the inputfeatures in the future modeling stage. − R ratio = 1 − R own cluster − R next closest cluster (1) C. Complete Stepwise Search for the Detection of Interactions
Intuitively, the most straightforward way to look for poten-tial variable interactions is the complete stepwise search duringthe modeling procedure. That is, consider all the possible pairsof variables as interaction terms, feed them into the model,and use criteria such as p -value to filter out the significantinteraction terms through stepwise, backward, or forward fea-ture selection approaches [24]. Considering that variableswere kept after data pre-processing, there would be C combinations of variables. As a result, by using (2), , interaction terms would be served as additional independentvariables and would go through the feature selection procedurewhen constructing the models. C kn = n !( n − k )! k ! (2)There are many problems by using the above-mentionedcomplete search for variable interactions. After addingall the possible , interaction terms into thedata set, the resulting number of independent variables( ,
110 + 180 = 16 , ) would be larger than the numberof observations (i.e., , in this study). This will cause“large p small n problem” ( p >> n problem) and there wouldbe insufficient degrees of freedom to estimate the modelcoefficients [25]. Even though no “large p small n problem”occurs, the entire processing time for the complete stepwisesearch would be very long when the data is large. In ourexperiment, we tried to randomly select , interactionterms and served them as additional input variables in logisticregression by using SAS . on the computer with 3.3 GHzIntel Core I7 processor to estimate the time consuming. Asa result, it took more than ten hours when either of thestepwise, backward, or forward selection method was appliedas the feature selection tool. Therefore, it is confident toconclude that for the data used in this study, even no “large psmall n problem” problem is caused, the complete stepwisesearch method is not efficient in terms of time consuming.Furthermore, this complete stepwise search methodologytends to cause multicollinearity problem and thus couldoffset the advantages brought from the hierarchical variableclustering step illustrated in Fig. 1. Therefore, a more efficientway for the detection of the possible variable interactionsis needed and this motivates the occurrence of the proposedhybrid model. D. CHAID Analysis for the Detection of Interactions
In the proposed hybrid decision tree based CHAIDand logistic regression model, CHAID analysis was firstly
Data pre-processing
Missing ImputationVariable Clustering40%validationset
Data pre-processing
Missing Imputation (using median values from the training set)
Decision tree-based CHAID analysis
Numbered Nth pair of variablesDecision tree constructionSuccessful ?No YesN = N + 1N = 0
Evaluate logistic regressionCreateinteraction term using the Nth pair of variables Build logistic regression
Modeling
Fig. 1. An overview of the study procedure. It mainly contains three steps: Data pre-processing, decision tree based CHAID analysis, and modeling andevaluation. implemented to identify the pairs of variables with possiblypotential interactions. In particular, we propose to use theCHAID decision tree idea to identify potential significantinteractions. In a CHAID tree, a predictor and its bestsegmentation is identified to split the node (or grow thetree) based on an adjusted significance test according to thechi-square statistic. The tree will keep growing if a significantchi-square statistic can be found. When applying this ideain looking for potential interactions, we only feed a pairof variables in to the CHAID decision tree. If the CHAIDtree was successfully built (Fig. 2), we consider that thereexist potential interactions in the current pair of variables.That is, the effect of one variable on the target variable
RESP F LAG depends on its partner. Therefore, theinteraction term will be created and it will enter the logisticregression modeling by using stepwise method for featureselection. Otherwise, no interactions exist for the current pairand the iteration will continue using the next pair of variables.Please note that it is possible the CHAID tree is successfullybuilt using only one variable from the pair. In this case, theinteraction term would still be created to avoid any missingof the potential interactions. If this newly created interactionterm does not provide useful information in the prediction oftarget variables, it would be removed in the stepwise featureselection procedure in the following logistic regressionmodeling stage. Furthermore, during the modeling stage,variance inflation factor (
V IF ) values for the variables would be checked to avoid the entering of redundant interactions inthe final model.Pre-setting of the criteria of CHAID analysis is essential,because that will positively or negatively affect the tree size,and more importantly, the processing time [26]. To findthe optimal values of these criteria settings, we did variousexperiments that include a series values for the criteria setting.The experiments were established on the computer with 3.3GHz Intel Core I7 processor. With the purpose of avoidingtoo long processing time (we limit the time for the CHAIDanalysis being within five hours in this study), avoiding toocomplex decision tree structures, including more potentiallysignificant interactions, and excluding possibly uselessinteractions, we finally set the criteria of CHAID analysis asfollows: p value for the chi-square test, which was used tocontrol merging or creating a new branch, was set to . . Inorder to obtain more nodes in the tree, we set the minimumnode size for split to 18 and the minimum leaf size to 10. Foreach node, the maximum number of branches was set to 3 toavoid too large tree structures. The maximum depth of the treewas set to 15 through experiments. This value could avoidtoo long processing time as well as too complex tree structure. E. Properties of CHAID Analysis Used
It is worth mentioning that different from many studies thatuse decision tree based CHAID for classification, in this study,HAID is used to identify potential variable interactions.Comparing with other decision tree methods such as Clas-sification and Regression Tree (CRT) and Quick, Unbiased,Efficient Statistic Tree (QUEST), CHAID method has themajor properties as follows, which is the main reason why thismethod was used for the interaction detection in this study: • CRT and QUEST are binary trees and are not able toproduce multi-branches based tree. In contrast, CHAIDbuilds non-binary tree containing two or more branchesgrowing from a single node [27]. This is helpful toidentify the complex interactions among variables. • Tree pruning tasks are usually used in CRT methods,whereas in the case of CHAID method they arenot required [28]. This could largely decrease thecomputational time, especially when the data set isrelatively large. • CHAID can model both categorical or ordinal data. Thecontinuous data is automatically converted to ordinalduring the analysis [29]. Since most of the variablesused in this study are continuous, CHAID method is thefirst choice. • Data summarizing performance in CHAID analysisis equivalent to stepwise regression models such aslogistic regression [15]. However, the customer profilebased on CHAID tree could be created naturally anda better interpretation about variable interactions couldbe provided (see Table IV). This makes the CHIADmethod being preferable than other interaction detectionmethodologies in regulated industries.
F. Modeling – the Hybrid Model and the Pure Logistic Model
The logistic regression aims at measuring the results withdichotomous variables such as and . It builds a statisticalmodel to predict the logit transformation of the occurrenceprobability of the target variable RESP F LAG in thisstudy. The format of logistic regression can be representedin (3), where, p denotes the probability of the occurrence of RESP F LAG , n is the number of independent variables,and β i are the coefficients of the independent variables x i . p = 11 + e − ( β + β ∗ x + ... + β n ∗ x n ) (3)In this study, to show the effectiveness of the interactionterms identified by CHAID analysis, two models were builtand compared during the modeling procedure: • The proposed hybrid decision tree based CHAIDanalysis and logistic regression model, denoted asthe hybrid model in this paper, was built followingthe steps illustrated in Fig. 1. That is, after CHAIDanalysis, newly created interaction terms were used as additional independent variables and a stepwisefeature selection approach was used to select the mostimportant contributed predictors for the target variable
RESP F LAG in the logistic regression. The significantlevels of the entering and leaving the model for thevariables were set to 0.15. • The logistic regression model without CHAID analysis,denoted as the pure logistic model in this paper, wasbuilt following the steps except the decision tree basedCHAID analysis stage illustrated in Fig. 1. That is, nointeraction terms were created while the same stepwisefeature selection approach was used in logistic regression.By comparing the performances of the two models, thepurpose is to demonstrate the effectiveness of the newlycreated interactions through CHAID analysis as well as toshow the superiority of the proposed hybrid model.
G. Model Evaluation
The models were firstly evaluated using receiver operatingcharacteristic (ROC) method. Since
AU C has been themost common measure of discrimination for predictionmodels with binary outcome, we take advantage of
AU C for its popularity in this study [30]. Both training andvalidation sets were used to measure
AU C for comparingthe performances of the proposed hybrid model and thepure logistic regression model when different numberof variables are kept. When keeping the same number ofvariables, the desirable model should have higher
AU C value.The second evaluation measure used in this study is theclassification accuracy. After the observations are classifiedas a binary response into (1,0) categories (using cutoff valued0.5, i.e., if observations have predicted value larger than 0.5 inlogistic regression, they are classified to category 1 while thosewith predicted value no larger than 0.5 belong to category0), four possible consequences named true positive (TP), truenegative (TN), false positive (FP), and false negative (FN) areproduced. The classification accuracy can be obtained using(4) and higher accuracy is expected to be from the bettermodel. accuracy = T P + T NT P + T N + F P + F N (4)The last evaluation measure applied is the KS statistic,which quantifies a distance between the empirical distributionfunction of two samples. The KS statistic D n is defined in(3), where F n ( x ) and F p ( x ) denotes the cumulative densityfunction of the classifier scores for negatives and positives,respectively [31]. In general, larger KS statistic value denotesthe better goodness of fit of the model. n = max x | F n ( x ) − F p ( x ) | (5)IV. R ESULTS AND DISCUSSIONS
A. Result of CHAID Analysis for Interaction Detections
As stated in Section III-D, during each iteration in CHIADanalysis, one of the , pairs of variables was used tobuild the CHAID tree. To illustrate the result of the analysis,take a certain pair of variables ( V ar , V ar ) as an example.Fig. 2 shows the subtree structure starting from node 4 afterCHAID analysis on V ar and V ar . Since the decision treeis successfully built, it is reasonable to assume that the effectof V ar on the target variable RESP F LAG dependson
V ar . Therefore, the interaction term V ar ∗ V ar iscreated and would be served as an additional input into thefollowing logistic regression modeling stage. As the finalresult of CHIAD analysis, , pairs of variables are shownto be successful in the tree construction. Therefore, , interaction terms are finally created and their predictive poweron the target variable RESP F LAG would be determinedusing the stepwise feature selection procedure in logisticregression. It is worth to mention that, the entire CHAIDanalysis took only about two hours, which is much shorterthan the ten hours needed for the interaction search by usingeither of the stepwise, backward, or forward selection methodin logistic regression mentioned in III-D. Therefore, CHAIDanalysis could save at least 80% of the processing time andis much more computationally efficient based on the datasetused in this paper.
B. Result of the Hybrid Model and the Pure Logistic Model
Tables I and II show the classification accuracy and
AU C by using different number of variables on both trainingand validation sets produced by the proposed hybrid modeland the pure logistic model, respectively. When decreasingthe number of variables, the Wald chi-square values inlogistic regression were referred to and variables with lowestWald chi-square value were removed first. Furthermore, amulticollinearity test was performed to calculate the
V IF values of the variables for each model. Test shows that allthe variables selected by the models in Tables I and II arenot interdependent (
V IF <
AU C are very similar on the training and valida-tion sets, there is no evidence for the occurrence of over-fittingproblem. As expected, when the number of selected variablesdecreases, the classification accuracy and
AU C generallydemonstrate a decreasing trend in both training and validationsets for both proposed model and the pure logistic model.Most importantly, it is observed that when the same numberof selected variables was used, the hybrid model generally
Fig. 2. An illustrative example of the CHAID subtree. The entire structureof the CHAID tree is represented on the top left corner and the structure ofthe subtree starts from node 4 for illustrative purpose.TABLE IR
ESULT OF THE HYBRID MODEL
Number ofselectedvariables Accuracyontrain(%) Accuracyonvalidation(%)
AUC ontrain(%)
AUC onvalidation(%)30 83.86 82.40 83.45 79.4627 83.81 82.26 83.17 79.2124 83.48 81.86 82.56 78.8521 83.14 81.64 82.22 78.4918 82.97 81.70 81.58 77.9015 82.86 81.30 80.59 77.0812 82.45 80.90 79.87 76.32TABLE IIR
ESULT OF THE PURE LOGISTIC MODEL
Number ofselectedvariables Accuracyontrain(%) Accuracyonvalidation(%)
AUC ontrain(%)
AUC onvalidation(%)30 83.73 81.56 82.48 77.5927 83.56 81.78 82.18 77.2024 82.87 80.92 80.80 76.1721 82.58 80.82 80.34 76.1618 82.36 80.80 79.69 75.7015 81.90 80.86 78.99 75.7612 81.73 80.70 78.53 75.63 outperforms the pure logistic model in terms of classificationaccuracy and
AU C on both training and validation sets. .00000.10000.20000.30000.40000.50000.6000 30 27 24 21 18 15 12
KS statistics across different number of selected variables on training and validation sets
Hybrid model on training data Hybrid model on validation dataLogistic model on training data Logistic model on validation data
Fig. 3. Results of KS statistics of hybrid model and the pure logistic model. C. Model Comparison
Fig. 3, where y -axis represents the KS statistic and x -axisrepresents the number of selected variables, shows the KS statistics in the hybrid model and pure logistic regression. Ascan be noticed, the KS statistics does not change too muchon validation set with the changing the number of selectedvariables on both models. However, the hybrid model alwaysproduce a higher KS statistics than the pure logistic modelin both training and validation sets when the number ofselected variables is fixed. Therefore, by considering all theresults described earlier, it is confirmed that the proposedhybrid model outperforms the results using logistic regressionwithout interactions.Another important issue to consider is that, it is time andeconomy consuming in collecting customers’ information inthe credit research domain, hence a practical while reliablebankcard response model should not contain too many inde-pendent variables. Using the criterion that at least KS statisticsvalued . should be reached on the validation set, the hybridmodel with selected variables is considered to be the bestmodel in this study. To demonstrate the effectiveness of theidentified interaction terms in the best model, descriptions ofthe selected variables are summarized in Table III. It is worthto mention that, all the variables listed in Table III have p values less than . from the Wald chi-square test and V IF values less than . Furthermore, according to Table III, thereare 4 interaction terms entering the best model, indicating thenecessity of interaction detections. Therefore, these statisti-cally significant interaction terms in the hybrid model furtherconfirmed that the proposed hybrid approach, outperformsthe pure logistic regression model and hence provides analternative in handling bankcard response classifications. TABLE IIIL
IST OF SELECTED VARIABLES IN THE BEST MODEL
Variable Description x Number of inquiries within 1 month x Percent balance to high credit open department storeaccounts with update within 3 months x Age of the newest bankcard account x Age of newest judgment public record item x Dismissed bankruptcy public record within 24 months x Total loan amount open mortgage accounts with updatewithin 3 months x Total balance open student loan accounts with updatewithin 3 months x Age of newest data last activity installment accounts paidas agreed x Total balance closed bankcard accounts with updatewithin 3 months x Total past due amount installment accounts x * x Number open bankcard accounts with update within 3months with balance ≥
75% loan amount * Total balanceopen retail accounts with update within 3 months x * x Number department store accounts worst rating 120 to180 or more days past due within 6 months or majorderogatory event within 24 months * Number installmentaccounts opened within 6 months x Age of newest mortgage account x * x Number open bankcard accounts with update within 3months with balance ≥
75% loan amount * Accountsworst rating ever 90 days past due x * x Number open bankcard accounts with update within 3months with balance ≥
75% loan amount * Percent utilityinquiries within 3 months to inquiries within 24 months
D. Customer Profile
One thing the financial industries concerned when buildingthe bankcard response model or the credit risk model isthe customer profile. After detecting the existence of theinteractions of the variables, one more important issue is tounderstand how these variables interact. Take the interaction x * x term shown in Table III as an illustrative example.Since this interaction term is statistically significant in theproposed hybrid model, we are more concerned about thecustomer response rate (i.e., percentage of RESP F LAG =1 ) profile when considering these two variables alone. Bytaking advantage of the CHAID tree produced by the CHAIDanalysis in the hybrid model, the customer response rateprofile could be created in Table IV and a better interpretationon the interactions between x * x could be provided. AsTable IV indicates, customers who have at least two bankcardaccounts with update within 3 months with balance ≥ loan amount (i.e., x ≥ ) present a lower response ratethan those customers who have only one or even zero suchbankcard accounts (i.e., x < ). Furthermore, for customerswith x valued less than 2, the response rate does notdepend on another variable x . However, on the other hand,for customers with x valued at least 2, the response ratedoes depend on another variable x . As the result shownin Table IV, increasing x values indicates lower responseate. Therefore, the created customer profile could providethe financial institutions a thorough understanding about thebehaviors of their customers. TABLE IVC
USTOMER R ESPONSE R ATE P ROFILE BY x AND x x x Response Rate (%) < ≥ < ≥ ≥ < ≥ ≥ V. C
ONCLUSION
The bankcard response models play an important rolein helping financial companies in their decision making.Logistic regression is one of the popularly utilized techniquesin the credit research domain. This technique focus onexploring linear relationships among variables, especiallyamong independent and dependent variables. In credit cardresearch area, there may exist complex interactions amongindependent variables, that is, the relationship between onepredictor and the target variable, depends on the value ofanother independent variable. Therefore, adding interactionterms during the modeling procedure has the potential toproduce better model performances. However, the possiblenumber of interactions increases dramatically as the numberof independent variables increases, which could largelyincrease the computational time. In this situation, an efficientway for interaction detection is needed.The main objective of this research is to propose ahybrid data mining approach, which integrates decision treebased CHAID analysis into logistic regression model toimprove the performance for bankcard response classification.The rationale underlying the analyses is firstly using thedecision tree based CHAID method, a novel multivariatetool, to identify the potential interactions among independentvariables. Then these newly created interactions are served asadditional independent variables in the logistic regression forfurther feature selection through stepwise procedure.The effectiveness of the proposed hybrid model isdemonstrated by using the credit customer response datasetprovided by Atlanticus Services Corporation located atAtlanta, GA, USA. The proposed hybrid model and thepure logistic regression model (without CHAID analysis)were evaluated and compared when implementing creditresponse classification tasks. It is shown that by identifyingvariable interactions using CHAID method, the hybridmodel outperforms the pure logistic regression model interms of classification accuracy,
AU C , and KS statistics.By selecting the model with 15 variables, it is foundthat 4 of these 15 variables are the interaction termsidentified by CHAID analysis and they are all statisticallysignificant. This could further confirm the necessity of the detection of variable interactions for predicting the outcomes.Furthermore, CHAID method is more computationallyefficient in identifying potential interactions when comparedwith adding all possible interactions in either of the stepwise,backward, or forward feature selection procedures in logisticregression. Also, the customer profile created based on theCHAID tree could provide a better understanding about thevariable interactions.As a general conclusion, the advantages of the proposedhybrid decision tree based CHAID analysis and logistic re-gression model in the current research are: • Most of the recent research uses CHAID analysis asa prediction or classification technique. Different fromthis, powerful decision tree based CHAID analysis isapplied to identify potential variable interactions, whichhas not been used widely in bankcard response modeling. • More importantly, CHAID analysis for interactiondetections outperforms the complete stepwise searchingfor significant interactions by either of the stepwise,backward, or forward feature selection methods inlogistic regression in terms of its processing time.Based on the dataset used in this study, CHAIDanalysis for interaction detections could save at least80% of the running time compared with the completestepwise searching for significant interactions in logisticregression. • Furthermore, some of the interaction terms identified byCHAID analysis are shown to be statistically significantin the proposed hybrid model and hence enhancethe model performance compared with pure logisticregression without interaction terms. • Finally, by taking advantage of the CHAID treeproduced by CHAID analysis, the customer profilescould be created and a better interpretation about thevariable interactions could be provided.Therefore, the proposed hybrid model in this study offersa valuable aid for financial industries in handling bankcardresponse classifications or credit scoring tasks.A
CKNOWLEDGMENT
The authors would like to thank Atlanticus ServicesCorporation (located at Atlanta, GA, USA) for providing thecustomer credit response data set.R
EFERENCES[1] A. J. McNeil and J. P. Wendin, “Bayesian inference for generalized linearmixed models of portfolio credit risk,”
Journal of Empirical Finance ,vol. 14, no. 2, pp. 131–149, 2007.2] Z. Huang, H. Chen, C.-J. Hsu, W.-H. Chen, and S. Wu, “Credit ratinganalysis with support vector machines and neural networks: a marketcomparative study,”
Decision support systems , vol. 37, no. 4, pp. 543–558, 2004.[3] T.-S. Lee, C.-C. Chiu, Y.-C. Chou, and C.-J. Lu, “Mining the customercredit using classification and regression tree and multivariate adaptiveregression splines,”
Computational Statistics & Data Analysis , vol. 50,no. 4, pp. 1113–1130, 2006.[4] N. Chen, B. Ribeiro, and A. Chen, “Financial credit risk assessment: arecent review,”
Artificial Intelligence Review , vol. 45, no. 1, pp. 1–23,2016.[5]
Credit Risk Scorecards: Developing and Implementing Intelligent CreditScoring . John Wiley & Sons, Inc., 2015.[6] S. Menard, “Coefficients of determination for multiple logistic regressionanalysis,”
The American Statistician , vol. 54, no. 1, pp. 17–24, 2000.[7] M. Firat, “Understanding turkish students preferences for distanceeducation depending on financial circumstances: A large-scale chaidanalysis,”
International Review of Education , vol. 63, no. 2, pp. 197–212, 2017.[8] M. Akin, E. Eyduran, and B. M. Reed, “Use of rsm and chaid datamining algorithm for predicting mineral nutrition of hazelnut,”
PlantCell, Tissue and Organ Culture (PCTOC) , vol. 128, no. 2, pp. 303–316,2017.[9] R. A. Armstrong, “The quantitative analysis of neurodegenerative dis-ease: classification, noda, constellations, and multivariate geometry,”
Folia Neuropathologica , vol. 56, no. 1, pp. 1–13, 2018.[10] R. Geng, I. Bose, and X. Chen, “Prediction of financial distress:An empirical study of listed chinese companies using data mining,”
European Journal of Operational Research , vol. 241, no. 1, pp. 236–247, 2015.[11] Y. Zhou, M. Han, L. Liu, J. S. He, and Y. Wang, “Deep learningapproach for cyberattack detection,” in
IEEE INFOCOM 2018-IEEEConference on Computer Communications Workshops (INFOCOM WK-SHPS) . IEEE, 2018, pp. 262–267.[12] F. N. Koutanaei, H. Sajedi, and M. Khanbabaei, “A hybrid data miningmodel of feature selection algorithms and ensemble learning classifiersfor credit scoring,”
Journal of Retailing and Consumer Services , vol. 27,pp. 11–23, 2015.[13] C.-M. Wang and Y.-F. Huang, “Evolutionary-based feature selectionapproaches with new criteria for data mining: A case study of creditapproval data,”
Expert Systems with Applications , vol. 36, no. 3, pp.5900–5908, 2009.[14] V. S. Desai, J. N. Crook, and G. A. Overstreet Jr, “A comparisonof neural networks and linear scoring models in the credit unionenvironment,”
European Journal of Operational Research , vol. 95, no. 1,pp. 24–37, 1996.[15] O. F. Althuwaynee, B. Pradhan, H.-J. Park, and J. H. Lee, “A novelensemble decision tree-based chi-squared automatic interaction detection(chaid) and multivariate logistic regression models in landslide suscep-tibility mapping,”
Landslides , vol. 11, no. 6, pp. 1063–1078, 2014.[16] B. Pradhan, “A comparative study on the predictive ability of the deci-sion tree, support vector machine and neuro-fuzzy models in landslidesusceptibility mapping using gis,”
Computers & Geosciences , vol. 51,pp. 350–365, 2013.[17] P. Herschbach, M. Keller, L. Knight, T. Brandl, B. Huber, G. Henrich,and B. Marten-Mittag, “Psychological problems of cancer patients: acancer distress screening with a cancer-specific questionnaire,”
Britishjournal of cancer , vol. 91, no. 3, p. 504, 2004.[18] C. H. Hsu and S. K. Kang, “Chaid-based segmentation: International vis-itors’ trip characteristics and perceptions,”
Journal of Travel Research ,vol. 46, no. 2, pp. 207–216, 2007.[19] H. Ince and B. Aktan, “A comparison of data mining techniquesfor credit scoring in banking: A managerial perspective,”
Journal ofBusiness Economics and Management , vol. 10, no. 3, pp. 233–240, 2009.[20] D. A. Hill, L. M. Delaney, and S. Roncal, “A chi-square automaticinteraction detection (chaid) analysis of factors determining traumaoutcomes,”
Journal of Trauma and Acute Care Surgery , vol. 42, no. 1,pp. 62–66, 1997.[21] L. Yu, S. Wang, and K. K. Lai, “Credit risk assessment with amultistage neural network ensemble learning approach,”
Expert systemswith applications , vol. 34, no. 2, pp. 1434–1444, 2008.[22] T.-S. Lee and I.-F. Chen, “A two-stage hybrid credit scoring model usingartificial neural networks and multivariate adaptive regression splines,”
Expert Systems with Applications , vol. 28, no. 4, pp. 743–752, 2005. [23] B.-W. Chi and C.-C. Hsu, “A hybrid approach to integrate geneticalgorithm into dual scoring model in enhancing the performance ofcredit scoring model,”
Expert Systems with Applications , vol. 39, no. 3,pp. 2650–2661, 2012.[24] F. Bagherzadeh-Khiabani, A. Ramezankhani, F. Azizi, F. Hadaegh, E. W.Steyerberg, and D. Khalili, “A tutorial on variable selection for clinicalprediction models: feature selection methods in data mining couldimprove the results,”
Journal of clinical epidemiology , vol. 71, pp. 76–85, 2016.[25] X. Yin and H. Hilafu, “Sequential sufficient dimension reduction forlarge p, small n problems,”
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , vol. 77, no. 4, pp. 879–892, 2015.[26] V. Ivanˇcevi´c, N. Igi´c, B. Terzi´c, M. Kneˇzevi´c, and I. Lukovi´c, “Decisiontrees as readable models for early childhood caries,” in
IntelligentDecision Technologies 2016 . Springer, 2016, pp. 441–451.[27] L. Wilkinson, “Tree structured data analysis: Aid, chaid and cart,”
Retrieved February , vol. 1, p. 2008, 1992.[28] W.-Y. Loh, “Classification and regression trees,”
Wiley InterdisciplinaryReviews: Data Mining and Knowledge Discovery , vol. 1, no. 1, pp. 14–23, 2011.[29] F. M. D´ıaz-P´erez and M. Bethencourt-Cejas, “Chaid algorithm as anappropriate analytical method for tourism market segmentation,”
Journalof Destination Marketing & Management , vol. 5, no. 3, pp. 275–282,2016.[30] J. A. Hanley and B. J. McNeil, “The meaning and use of the area under areceiver operating characteristic (roc) curve.”
Radiology , vol. 143, no. 1,pp. 29–36, 1982.[31] A. Justel, D. Pe˜na, and R. Zamar, “A multivariate kolmogorov-smirnovtest of goodness of fit,”