Augmented Neural Networks for Modelling Consumer Indebtness
Alexandros Ladas, Jonathan M. Garibaldi, Rodrigo Scarpel, Uwe Aickelin
AAugmented Neural Networks for Modelling Consumer Indebtness
Alexandros Ladas, Jon Garibaldi, Rodrigo Scarpel and Uwe Aickelin
Abstract — Consumer Debt has risen to be an importantproblem of modern societies, generating a lot of research inorder to understand the nature of consumer indebtness, whichso far its modelling has been carried out by statistical models.In this work we show that Computational Intelligence can offera more holistic approach that is more suitable for the complexrelationships an indebtness dataset has and Linear Regressioncannot uncover. In particular, as our results show, NeuralNetworks achieve the best performance in modelling consumerindebtness, especially when they manage to incorporate thesignificant and experimentally verified results of the DataMining process in the model, exploiting the flexibility NeuralNetworks offer in designing their topology. This novel methodforms an elaborate framework to model Consumer indebtnessthat can be extended to any other real world application.
Index Terms — Knowledge Discovery, Neural Networks, Re-gression, Consumer Debt Analysis
I. I
NTRODUCTION C ONSUMER Debt Analysis has received recently a lotof attention from the research community in an effortto explain the “nature” of consumer indebtness that hasemerged recently in the developed countries. Among thethree fundamental research questions posed in the analysis ofthis social problem [17] lies the identification of factors thataffect the level of consumer debt. Answering the latter, on-going research revealed a series of diverse factors, economic,demographic and psychological, that are related to how deepa consumers goes in debt [3], [5], [18], [2] providing a deepinsight in the “nature” of this problem.The discovery of these factors was mainly carried out bytraditional statistical models like linear regression which hasthe ability to reveal linear associations between variables.However, as common as the utilisation of these modelsin the field of Economics might be, so is their limitedability to deal with characteristics that data from real worldapplications possess. Their difficulty to handle non-linearityin the data makes them unable to solve non-linear classi-fication problems [19], while the colinearity between theindependent variables can lead to incorrect identificationsof most predictors [22]. These limitations make them inap-propriate to model successfully consumer indebtness sincesocio-economic datasets exhibit strong non-linearity amongseveral other inconsistencies. It also raises questions re-garding the validity of the relationships uncovered by thesemodels as their small predictive accuracy cannot guaranteethe identification of the correct predictors. In addition tothis, most of the research has been conducted on a limited
Alexandros Ladas, Jon Garibaldi, Rodrigo Scarpel and Uwe Aickelin arewith the School of Computer Science, University of Nottingham, JubileeCampus, Wollaton Road, Nottingham, NG8 1BB, UK (email: { psxal2,uwe.aickelin } @nottingham.ac.uk,[email protected], [email protected]) number of observations making hard to consider the findingsas representative.As the need to develop fairly accurate quantitative predic-tion models becomes apparent [1], we argue that the fieldof Economics can benefit from the variety of techniquesand models Computational Intelligence has to offer. Sucha computational model is the Neural Networks, a systemof interconnected “neurons”, inspired by the functioning ofthe central nervous system. Neural networks are capableof machine learning and not only they manage to achieveremarkable prediction accuracy by successfully handlingnon-linearity in the data but their flexibility in the designof their topology also offers a way to incorporate importantsteps of the Data Mining process into a regression model. Thepotential of Data Mining is evident in the numerous waysto pre-process the data in order to tackle any inconsistenciesthey may contain and to explore the relationships in the data,that be can combined in an elaborate process for KnowledgeDiscovery in any difficult real world problem like consumerindebtness.Therefore in order to evaluate the impact Neural Networkscan make on modelling the Consumer Debt in a large socio-economic dataset in this work, we compare their performanceagainst Random Forests and linear regression. In the sameexperimental setup we also evaluate the contribution on theperformance of these models of a series of Data Miningtechniques like the transformations performed on the datain order to deal with the inconsistencies they contain, suchnoise, high dimensionality and the presence of outliers andthe a classification of debtors identified by clustering. Finallywe take advantage of the ability to design the topology ofNeural Networks and we introduce a novel way to incorpo-rate into the topology meaningful information that derivesfrom explanatory techniques applied on data, like Clusteringand Factor Analysis, and we assess its performance.Our results show that the transformations on the dataimprove in a great extend the accuracy of all three regressionmodels and that Neural Networks achieve the best perfor-mance. The contribution of the classifications provided byclustering remains argumentative when it is used as an extravariable but proves to be very useful when it is incorporatedin an appropriate way in the topology of the Neural Networkswhich leads to a further improvement in the performance ofthe model. Therefore, we believe that this work not onlyserves as a comparison between Neural Networks and otherregression models but it also verifies the great of potentialof Neural Networks that can be strong predictors and takeadvantage of significant results from Data Mining methodsat the same time, sketching a complete framework for theConsumer Debt Analysis including necessary transforma- a r X i v : . [ c s . C E ] S e p ions of data, exploratory models and reliable regressionmodel that it may extend to any real world applicationproblem that contains a dataset with similar inconsistenciesand characteristics as this one.The rest of the paper is organised as following. In the2nd section we discuss the related work on the level ofdebt predictions and on the models we use for our purposes.In the 3rd section we introduce briefly the CCCS datasettogether with transformations performed on its attributes andthe clustering approach that identified classes of debtors.We then present the models in the 4th section whereas inthe 5th we proceed with the details of the experimental setup. Finally in the 6th section we analyse the results of ourexperiments and we conclude our work in the 7th section.II. R ELATED W ORK
Statistical models and linear regression are primarily usedfor the level of debt prediction in the literature. A significantamount of the work is summarised in [5] where they alsoprovide a model for separating debtors from non-debtors.However, their suggested logit model suffers from a low R (33%). In a similar way, in [10], [20] the proposed modelsthat take into account psychological factors as predictors,exhibit even lower R in their probit models (around 10%).Surprisingly enough the linear regression model presentedin [17] achieves a remarkable 66% R but as it is explainedin [5], this big proportion of variance explained, is due tothe small number of respondents. A linear regression modelbuilt for estimating the outstanding credit card balance in[15] exhibits 30% R . Based on these results and the factthat the models are built on a limited number of observations,we are unsure whether to regard these findings as reliablesince the suggested models fail to explain the variance thatexists in the data and the small number of instances cannotbe considered representative enough. This is further enhancedby the criticism statistical techniques receive in [19], whereit is argued that they have reached their limitations inapplications with datasets that contain non-linearity in thedata, like an indebtness dataset.On the other hand, Random Forests, a popular machinelearning algorithm for Data Mining, has been shown to beable to handle non-linearities in the data [12]. They havereceived a lot of attention in biostatistics and other fields[12] due to their ability to handle a large number of variableswith a relatively small number of observations and becausethey provide a way to identify variable importance [12], [21].They manage to demonstrate exceptional performance withonly one parameter and their regression has been proven notto overfit the data [21]. An interesting application of RandomForests is in [11] where a model measuring the impact of thereviews of products in sales and perceived usefulness wasconstructed.Similarly, Neural Networks exhibit better generalisationthan linear regression models [19], [22], allow for extrapola-tion [22] and can handle non-linearity [19] posing as strongpredictors. Their huge learning capacity has led many ofresearchers to believe that they are able to approximate any function that is encountered in applications [14], [7]. Theyhave been shown to outperform Linear Regression models[19], [22] and in Economics they have been successfullyused for stock performance modelling [19] and for creditrisk assessment [1]. A very interesting ability they possess isthe ability to fully parametrise the topology of the networkintroducing a concept of logical structure among the neuronsthat consist the network. This has been exploited in [7] whereFactor Analysis is utilised in order to define the topologyof the network and although their result has shown not toactually improve the precision of the existing neural network,it manages to speed up the convergence of the algorithm. Thesame idea has been adopted by us in this work for furtherexperimentation in our dataset and has been extended in orderto include further information that derives from clusteringthe data. As Neural Networks have not been used so far forthe purposes of Consumer Debt Analysis, in this work weexploit the many advantages they offer in order to achievea better modelling of consumer indebtness than the existingones, supporting their utilisation in the field of Economics, inapplications of which they already have replaced traditionaleconometric models.. III. CCCS D ATASET
A. Description
The CCCS dataset, introduced in [8], is a socioeconomiccrossectional dataset based on the data provided by the Con-sumer Credit Counseling Service. Its 58 attributes contain in-formation about approximately 70000 clients who contactedthe service between the years 2004 and 2008 in order torequire advice about how they can overcome their debts.The information was gathered through interviews when eachclient first contacted the service and it varies from standarddemographics to financial details, aggregated spending incategories and debt details. The attributes of interest forthe purpose of Consumer Debt Analysis are limited toDemographics, Expenditure and Financial attributes as theycan be seen in Table I together with their description.
B. Transformations
Like other real world dataset, CCCS contains noise andoutliers, while at the same time it suffers from high dimen-sionality. In order to tackle the aforementioned difficulties aseries of transformations steps were performed in an earlierwork [16] that proved to be beneficial for the unsupervisedapproach of this dataset. More precisely, Homogeneity anal-ysis (Homals) [6] was utilised in order to map the categor-ical demographic data, significant attributes concerning theConsumer Debt Analysis, into two-dimensional coordinatestogether with a Factor analysis on the financial attributes anda clustering on the correlation of the spending items. Thesetransformations reduced the dimensionality to more compactattributes, removed noise and outliers, provided a sense ofinterpretability and improved the quality of the clustering. Asummary of the transformations can be seen in Fig.1 whereas
ABLE ID
ESCRIPTION OF
CCCS
ATTRIBUTES
Attribute Description pid individual identifier
Demographics age age of personmstat marital statusempstat employment statusmale sex of personhstatus housing statusndep number of dependants in householdnadults number of adults in household
Financial Attributes udebt total value of unsecured debtmortdebt total value of mortgage debthvalue total value all housing ownedfinasset total value of financial assetscarvalue resale value of carincome total monthly income
Expenditure clothing total monthly spending on clothingtravel total monthly spending on travelfood total monthly spending on foodservices total monthly spending on utilitieshousing total monthly spending on housingmotoring total monthly spending on motoringleisure total monthly spending on leisurepriority total monthly spending on priority debtsundries total monthly spending on sundriessempspend total monthly self-employed spendingother total other spending
Debt Details ndebtitems number of debt itemsFig. 1T
RANSFORMATIONS OF
CCCS
ATTRIBUTES the new nine transformed attributes include two spatial co-ordinates that discriminate the Demographic variables, threeFinancial Factors that summarise all the informations thatlies in Financial Attributes and four Behavioural SpendingClusters that characterise spending in Necessity, Household,Excessive and Leisure.
C. Classification of Debtors
Finally, in [16] these transformations were proved to beuseful for the clustering of a random sample of 10000debtors from the CCCS dataset that managed to classify8370 debtors in seven classes with distinct characteristics.
TABLE IID
ESCRIPTION OF CLASSES OF D EBTORS
Class Size Characterisation
The characteristics of these classes can be seen in the TableII, which also includes the 1630 debtors that remainedunclassified. Further information regarding the dataset itself,the suggested transformations and the clustering results canbe found in [16] as it is not the subject of this work. Ourobjective is to use the information that derives from theexploratory research that was conducted in [16], meaning thetransformed attributes and classifications, in order to evaluatetheir contribution in the level of debt prediction.IV. M
ODELS
A. Linear Regression
Linear Regression is the simplest of the statistical modelsand it tries to model the relationship between a dependantvariable and one or more explanatory variables. As someonecan refer from the name, Linear Regression assumes alinear relationship between the dependant variable and theexplanatory variables and tries to fit a straight line in thedata. More formally Linear Regression is defined as: Y = β + X β + .... + X p β p + (cid:15), (1)where β , β , ...., β p are the coefficients and X j ,j=1,....pdenote p regressor variables. Finally (cid:15) denotes the error termwhich is assumed to be uncorrelated to the regressors andhave mean and variance equal to 0. The model takes as inputthe observations and tries to fit the straight line by estimatingthe parameters (coefficients and error term). A widely usedlgorithm for estimating the parameters is the Ordinary LeastSquares(OLS) which tries to minimise the sum of squaredresiduals. B. Random Forest Regression
Random Forest is an example of ensemble learning thatgenerates many classifiers and aggregate the results [4]. TheRandom Forest method creates large number of DecisionTrees for the case of classification or Regression Trees for thecase of regression from different random samples of the data.The samples are being drawn based on bootstrap techniquesthat allow resampling of instances. The appropriate tree isbeing constructed based on each sample and its accuracy isevaluated on the rest of the samples. The difference from thecommon Decision Tree is that when a split on a node is tobe decided, a specific number of the attributes can participateas candidates and not all of them. When the random forestis built the prediction is made by aggregating the votes of allthe trees for the case of classification and by averaging theresults of all the trees for the case of regression. It needsthe specification of only two parameters, the size of theforest and the number of predictors that can be candidatesfor each node split and its success is based on its simplicity.The notion of randomness it adopts in its process allows themodel to be robust against data overfitting.
C. Neural Networks
A Neural Network is a directed graph consisting of nodesand edges that are organised in layers. As it models arelationship between the predictors and the response vari-ables, the input layer is consisted of nodes that representthe predictors and the output layer of nodes that representthe response variables if there are more than one. One ormore hidden layers of an arbitrary number of nodes connectthese two layers. Each layer is fully connected with the nextlayer and each edge assigns a weight to the value it takes asinput and passes it on the next node. Thus in each node theweighted sum of all the nodes that belong to the previouslayer is calculated adding the intercept and the result is beingfed into an activation function and passed to the next layer.The activation function is usually a non-linear activationfunction like the sigmoid function or the hyperbolic tangent.The simplest Neural Network (Perceptron) has n inputs andone output and it is identical to the logistic regression asit is a non-linear function of the linear aggregation of theinput. With this in mind we can easily conclude that a NeuralNetwork with more than one node in the hidden layer is anextension of the Generalised Linear Models.A Neural Network takes as parameters the starting weightsof the edges that are usually initialised randomly and thenetwork topology meaning the organisation of the nodes inthe hidden layers. Then the model tries to find the optimalweights of the edges by using a learning algorithm like Back-propagation on the data. Backpropagation tries to minimisethe difference between the predicted value calculated by themodel and the actual value. It does that by calculating thisdifference and then following the chain rule it moves from the output to the input adapting all the appropriate weightsaccording to a specific learning rate. Resilient Backpropa-gation which is argued to be more suitable for regressionpurposes [13] is similar to Backpropagation but instead ofsubtracting a ratio of the gradient of the error function likeBackpropagation does, it increases the weight if the gradientis negative and reduces it if its positive. It updates the weightsby using only the sign of the gradient and some predefinedvalues. The value of the update is bigger if the gradientchanges sign from the previous update and smaller if it keepsthe same sign. This way it ensures that a local minimumwon’t be missed.The Neural Networks tend to overfit the data, a factthat raises a concern of how they can be properly used. Acommon technique for avoiding data overfitting is to train themodel on a subset of the data and validate it on the rest of thedata. A very popular technigue in Supervised Learning forthis, is the 10-fold cross validation where the data is dividedin ten folds and then a model is trained for each fold andgets validated on the rest of the folds. This is the way toevaluate the accuracy of the model and thus to choose theappropriate number of hidden layers and hidden nodes sincethis is not known beforehand. Usually different topologies arebeing tested and the one that minimises the error between thepredicted and the actual values on the test set is selected.
D. Topology Defined Neural Network
The flexibility that Neural Networks provide in designingthe topology can be exploited to incorporate knowledgeextracted by unsupervised learning performed on the data.Thus, in this work we tried to organise the neurons in thehidden layers based on the knowledge extracted by FactorAnalysis and Clustering. The idea behind this was based onthe striking resemblance Neural Networks have with LatentFactor Models, like Factor Analysis, and on the assumptionthat the classes of debtors identified by clustering definedifferent relationships between the response variable and thepredictors.Factor Analysis is a common Latent Factor Model thatorganises the variables of a dataset into a smaller num-ber of hidden factors that would still contain most of theinformation from the initial variables. This way neuronsin the first hidden layer can be depicted as latent factorsthat summarise the input. The only difference with FactorAnalysis, a widely used Latent Factor Model, is that therelationship between the input variables and the factors isnon-linear. This non-linear relationship would also be ableto model the linear relationships between the input variablesand the neurons identified by Factor Analysis. This idea hasbeen incorporated with the algorithm proposed in [7].Clustering on the other hand divides the debtors intoclasses with distinct characteristics. As these classes maymodel different relationships between the response variablesand the explanatory variables this could be introduced inthe neural network as an extra hidden layer with as manyneurons as the classes. This would create different functionsfor each class that will be combined in a more complexelationship in order to produce the final modelling. Theintuition is something similar to Clusterwise Regression butthe combination of different functions for each class is morefuzzy since they are included in a neural network and nothard.These two ideas form this novel method to use NeuralNetworks that we named Topology Defined Neural Network(TopDNN). Our aim is to test TopDNN in the socio-economiccontext but its disciplines can be extended in creating NeuralNetworks models for any real world application.V. E
XPERIMENTAL S ETUP
The aim of this work is to evaluate the performance ofNeural Networks as a regression model that can predict theamount of unsecured debts ( udebt ) a debtor in the CCCS hasby using the rest of the variables as predictors. For this reasonwe compare its performance against different regressionmodels with different characteristics, like Linear Regression,Random Forest Regression. Furthermore we check whethera series of transformations we performed in [16] and theclassification of debtors we provided in the same work canimprove the performance of the regression so that they beincorporated in the final Neural Network we aim to develop.Since these models try to optimise different criteria andthey are internally validated on different measures whenthey are fitted into data, we needed to test all these modelsunder a common framework. So we use the 10-fold crossvalidation as the method to compare the different models andwe selected RMSE and R as the evaluation criteria. 10-foldcross validation is a standard method for evaluating modelsin Unsupervised Learning and it also allows Neural Networksto avoid data overfitting providing more representative resultsfor their case. R measures the percentage of variance that is explainedby the model and it a standardised measure taking valuesfrom 0 to 1 with 1 being a perfect fit. The Root Mean SquareError (RMSE) measures the difference between the predictedvalues from the model and the actual values. It is defined as: RM SE = (cid:114) (cid:80) ni =1 ( y obs,i − y model , i ) n (2)where n is the number of observations, y obs,i is the observedvalue of the observation i and y model,i is the calculatedvalue of the observation i. The best model will minimisethe RMSE.For model training we use a random sample of 10000debtors from the CCCS dataset, a subset of dataset thatcontains no missing values and we already had performedthe transformations on and divided in classes [16]. All themodels are built in R using the caret package and forLinear Regression we calculate the weights using the OLSalgorithm, for Random Forests we create 500 trees andinitialise the number of potential candidates for a node splitas m/3 where m equals the number of predictors. For NeuralNetworks the initial weights are randomly assigned and ahidden layer is chosen. In order to choose the optimal number TABLE IIID
ESCRIPTION OF D ATASETS
Dataset Attributes
A Original CCCS variablesB Transformed VariablesC Original CCCS variables and clustering classificationD Transformed variables and clustering classification of hidden nodes, we produce ten neural networks for eachcase with the number of neurons varying from 1 to 10. 10-fold cross validation is used to evaluate all of them and theone with that minimised RMSE is selected as the best model.We also use both Backpropagation and Resilient Backprop-agation for making the appropriate comparisons. All modelsare built using both the actual data and the transformed andthe classification is introduced as an additional categoricalvariable. For all of the above we had to create four differentdatasets that all the regression models will be build upon.These necessary datasets in order test the contribution of thetransformation and the classification provided by clusteringtogether with the performance of the regression models aresummarised in Table III.Finally we construct a Neural Network based on our intu-ition to utilise clustering classifications and Factor Analysisfor designing the topology and we checked its performancein the same dataset. VI. R
ESULTS
A. Comparison of Models
From a quick look in Table IV, which presents the per-formance of the models build on the four datasets with thebrackets indicating the optimal number of neurons in thehidden layer found, we can see that Neural Networks andRandom Forests clearly outperform Linear Regression onalmost datasets with the only exception being the NeuralNetwork model build on the C dataset and was trained withbackpropagation. In all the rest of the cases Neural Networksand Random Forests produce smaller RMSE and bigger R .In addition to this we can identify the beneficial natureof the transformations performed on CCCS attributes sinceall four different regression models seem to improve theirperformance when they are built on the transformed data.More specifically, the models built on datasets containingthe transformed attributes (B and D) reduce the RMSE andincrease R when compared with models built on datasetsA and C respectively. Especially in the cases of NeuralNetworks trained with Resilient Backpropagation and theRandom Forests regression the improvement in the perfor-mance is significantly big reducing the RMSE to around0.06 for the case of Random Forests and to around 0.05for the case of Neural Networks trained with ResilientBackpropagation. Similarly R was raised to around 0.5 forRandom Forests and around 0.6 for the Neural Networkstrained with Backpropagation. For the cases of Linear Re-gression and Neural Networks trained with Backpropagationhe improvement was significant but much smaller.On the other hand the contribution of the classificationsprovided from clustering remains less clear. It manages toprovide a rather small improvement in the Linear Regressionand the Backpropagation Neural Networks but it decreasesthe performance of Random Forests while in Resilient Back-propagation Neural Networks it is beneficial only when itis combined with the transformed data. This can be seenwhen you compare models built on datasets C and D thatcontain the additional categorical variable of classificationwith the models build on datasets A and B respectively.Interestingly enough the Random Forest regression modelbuild on C has an increased RMSE and a bigger proportionof variance explained at the same time.Looking at the performance of the models, the best per-formance was achieved by the Resilient Backpropagationmodels followed closely by the Random Forests Regres-sion whereas the performance of Backpropagation NeuralNetworks and Linear Regression remained comparable withthe first one being better though. The model that exhibitsthe minimum RMSE and the bigger R is the ResilientBackpropagation Neural Network built on the transformedvariables together with the classification of debtors. Thisverified the argument of [13] that Resilient Backpropagationis more suitable for regression purposes. It also strengthensthe argument regarding the potential of using Neural Net-works in applications of Economics, traditionally dominatedby statistical models. Data Mining and Computational Intel-ligence in a broader sense introduce a holistic approach inorder to extract knowledge from that data as it offers a largenumber of tools to preprocess the data, techniques to explorethe relationships with unsupervised learning algorithms likeclustering and accurate models to be used for prediction, thatwhen combined in a sophisticated framework, they can buildmodels which achieve impressive results. In our case thiswas verified not only by the better performance of NeuralNetworks and Random Forests but also from the beneficialnature of the transformations performed on the data as part ofpreprocessing the data that improved all the models. Despitethe fact that the contribution of the classification of debtorsreturned from clustering was not beneficial for all the casestested, it managed to provide a small improvement in mostof the cases and especially when it was combined with thetransformations in the Neural Networks.Proceeding with the examination the R achieved by themodels, we notice that the best model has the ability toexplain approximately two times the proportion of varianceexplained by the best Linear Regression model. When thesemodels are compared to the ones found in literature, LinearRegression performance seems to be comparable to the onepresented in [15] but better than the rest of the modelswhereas the performance of the Neural Networks trainedby Resilient Backpropagation is significantly higher and canonly be compared with the Linear Regression model in [17]but this was considered not representative enough due to thelimited number of observations the model was build upon. TABLE IVR
ESULTS OF R EGRESSION M ODELS
Dataset RMSE Rsquared
Linear Regression
A 0,078 0,235B 0,0731 0,328C 0,0769 0,257D 0,0727 0,336
Random Forests
A 0,0727 0,293B 0,0592 0,572C 0,0741 0,311D 0,0626 0,5
Neural Networks Backpropagation
A (4 Neurons) 0,0779 0,241B (2 Neurons) 0,0672 0,445C (4 Neurons) 0,0778 0,239D (2 Neurons) 0,0671 0,445
Neural Networks Resilient Backpropagation
A (3 Neurons) 0,0759 0,314B (3 Neurons) 0,0552 0,619C (2 Neurons) 0,0764 0,26D (3 Neurons) 0,0538 0,632
In fact, a more realistic number of R for this model givenin [5] was arround 30% meaning that the performance of thebest model found here is still significantly higher than theones found in literature. B. Analysing Linear Regression
The low performance of Linear Regression comparing tothe Data Mining methods can be explained easily if we takea careful look at the diagnostics plots of the best linearmodel in Fig. 2 The plot of the residuals against the fittedvalues indicates that the error terms are not independentand that their variance is not constant as they are notrandomly scattered throughout the 0. Besides this, the normalprobability reveals that the error terms are not normallydistributed as there is a strong deviation from the linewith two big curves in the beginning and the end of theplot. Furthermore in Fig 3., where the partial residuals plotfor
Housing Factor is depicted, we can identify the non-linear relationship it has with the response variable. Partialresiduals are utilised instead of normal residuals because ina multiple regression they account for the effect the rest ofthe independent variables have on this relationship. Theseobservations come in contrast with almost all the assumptionsof linear regression, degrading the quality of the linear model.A series of transformations on the response variable or theexplanatory variables, following established techniques likepower and log transformations were not able to improvethe quality of the model as the R remained low and theassumptions were still violated. C. TopDNN
Since the benefical nature of the transformed variables isexperimentally verified in all cases we are encouraged totest our novel approach on the dataset B using ResilientBackpropagation. Therefore we begin with performing aFactor Analysis on the attributes of the dataset B. Three ig. 2D
IAGNOSTIC PLOTS OF L INEAR R EGRESSION MODEL BUILT ON DD ATASET
Fig. 3P
ARTIAL R ESIDUALS PLOT OF
Housing Factor was the number of factors that is found to be optimal forsummarising the nine attributes of the dataset after examiningthe scree plot of the eigenvalues and performing a parallelanalysis. In the scree plot the eigenvalues of the correlationmatrix are plotted in order of descending values. The lastsubstantial drop in the graph indicates the number of factors.In parallel analysis the same eigenvalues are compared toeigenvalues derived from random data. The number of casesthey are bigger suggests the number of factors in the model.These methods for determining the number of factors are twoof the most popular and effective and they are preferred fromothers as dictated in [9]. Interestingly enough three is alsothe number of neurons that was found to be optimal for thecase of building Neural Networks on C using Resilient Back-
TABLE VF
ACTOR A NALYSIS ON TRANSFORMED VARIABLES
Factor1 Factor2 Factor3 x 0.298 0.487yhousingfactor 0.385 -0.477financialfactor1 0.280 0.574 0.766financialfactor2 0.118 0.792Necessity.Spending 0.983 0.167Household.Spending 0.728 0.286 0.232Excessive.Spending 0.217 0.128Leisure.Spending TABLE VIR
ESULTS OF T OP DNN
RMSE Rsquared
TopDNN
NN with factor analysis 0,055 0,616NN with factor analysis and clustering 0,0528 0,633 propagation indicating the agreement between two differenttechniques in designing the network topology of a neuralnetwork. The three factors and their loadings can be seen inTable V.Then we train two Neural Networks, one with one hiddenlayer of three neurons and one with an additional hiddenlayer of eight neurons representing the classes of debtors,in order to test in a stepwise fashion the two main ideas ofour approach. Again we utilise the 10-fold cross validationand RMSE and R as evaluation criteria in order to getcomparable results with the rest of the experiments. Theresults can be seen in Table VI. We can see that designingthe network topology according to the knowledge extractedby Factor Analysis and Clustering is beneficial for the perfor-mance of the model. We see that the inclusion of the hiddenlayer of three nodes as dictated by Factor Analysis improvesthe performance of the model when compared with theNeural Network build on B but has worse performance fromthe best model of the previous experiments. The additionallayer of eight neurons on the other hand achieves the bestperformance from all the models build here in the workraising the R to 0.633 and reducing the RMSE to 0.0528.This verified our intuition that the flexibility Neural Networksoffer in designing their topology can be exploited properly inorder to include knowledge that stems from the unsupervisedlearning approaches performed on the data. Thus our modelmanages to achieve the best performance of all the modelsindicating the ability of Neural Networks to incorporatein their modelling results from previous steps of the DataMining process.The plot of the Neural Network build with the TopDNNapproach can be seen in Fig. 4. The weights of the edgeshave been omitted for classification reasons but the lineshave modified accordingly to depict the magnitude of theweights with thinner line representing small or negativeweights and thicker lines large weights. We can notice ig. 4T OP DNN
WITH TWO HIDDEN LAYERS . T
HE FIRST ONE REPRESENTS THENUMBER OF FACTORS AND THE SECOND ONE THE NUMBER OF CLASSESOF DEBTORS that the interpretation of Neural Network is not a trivialtask, especially when the network is complicated. That istheir main drawback comparing to Linear Regression andRandom Forest which have mechanism to assess the variableimportance of their models. However tracing the very thickblack lines of the plot we can immediately detect the stronginfluence
FinancialFactor1 has on the final outcome as itinfluences heavily the first neuron of the first hidden whichinfluenced strongly the sixth neuron of the 2nd hidden layerwhich belongs to the four neurons of the 2nd layer that affectmoderately the final outcome. This relationship between the
FinancialFactor1 and udebt cannot be quantified or definedbut it can be signified. There are techniques to assess variableimportance in Neural Networks, like Sensitivity Analysis thatcan provide the desired interpretabily that is valuable for theanalysis of real world applications but we leave this for thefuture part of our research.VII. C
ONCLUSIONS
In this work we tried to construct an accurate regressionmodel for the level of debt prediction, a significant task forConsumer Debt Analysis utilising a widely used computa-tional model, Neural Networks. For this reason we comparedtheir performance against Linear Regression and RandomForests. Our results show that Neural Networks clearlyoutperform Linear Regression. Random Forests achieve com-parable performance but their only one parameter does notallow for more improvements. They also proved that all theregression models can benefit from the necessary data trans-formations and from the Unsupervised Learning approacheson the data, if these are incorporated properly in the Data.Trying the latter we devised a novel method for designing thetopology of the Neural Networks utilising information thatstems from the Factor Analysis and Clustering performedon the data. TopDNN as our method was named, improvedthe performance of the models even more and signified theability Neural Networks offer in adopting in their designresults from previous steps of explanatory research conductedon the dataset. Our work forms a complete Computational Intelligence framework with the pre-processing of data, clus-tering to uncover important relationships and the regressionmodel that is suitable for the purposes of Consumer DataAnalysis. This framework exhibits much better performancethan the existing statistical methods that dominate the fieldof Economics and it highlights a more sophisticated way tomodel consumer indebtness that it can extend to any realworld application.
A. Acknowledgements
We would like to thank John Gathergood, lecturer inschool of Economics of University of Nottingham for pro-viding us the CCCS dataset.R
EFERENCES[1] Atiya, Amir F. “Bankruptcy prediction for credit risk using neuralnetworks: A survey and new results.” Neural Networks, IEEE Trans-actions on 12.4 (2001): 929-935.[2] Bernadette Kamleitner and Erich Kirchler. Consumer credit use: Aprocess model and literature review.
Revue Europeenne de PsychologieAppliquee/European Re- view of Applied Psychology , 57(4):267-283,2007.[3] Bernadette Kamleitner, Erik Hoelzl, and Erich Kirchler. Credit use:Psychological perspectives on a multifaceted phenomenon.
Interna-tional Journal of Psychology , 47(1):1-27, 2012.[4] Breiman, Leo. “Random forests”
Machine learning
Journal ofEconomic Psychology , 27 (4):543-556, 2006.[6] De Leeuw, Jan, and Patrick Mair. “Gifi methods for optimal scaling inR: The package homals.”
Journal of Statistical Software , forthcoming(2009): 1-30.[7] Ding, Shifei and Jia, Weikuan and Xu, Xinzheng and Zhu, Hong.“Neural Networks Algorithm Based on Factor Analysis”
Advances inNeural Networks (2010): 319-324[8] Disney R., and Gathergood J.,“Understanding consumer over-indebtedness using counselling sector data: Scoping Study.”,
Report tothe Department for Business, Innovation and Skills (BIS) , Universityof Nottingham, 2009.[9] Fabrigar, Leandre R., et al. ”Evaluating the use of exploratory factoranalysis in psychological research.” Psychological methods 4.3 (1999):272.[10] Gathergood, John. “Self-control, financial literacy and consumer over-indebtedness.”
Journal of Economic Psychology
Knowledge and Data Engineering , IEEETransactions on 23.10 (2011): 1498-1512.[12] Grmping, Ulrike. “Variable importance assessment in regression: linearregression versus random forest.”
The American Statistician
The R Journal ,vol 2/1, (2010):30-37[14] Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. “Multilayerfeedforward networks are universal approximators.”
Neural networks
Financial Counselingand Planning 12.1 (2001): 67-77.[16] Ladas A., Aickelin U., Garibaldi J., Scarpel R., and Ferguson E. “TheImpact of Preprocessing on Clustering socio-economic Data: A Steptowards Consumer Debt Analysis”, under Review.[17] Livingstone, Sonia M., and Peter K. Lunt. “Predicting personal debtand debt repayment: Psychological, social and economic determi-nants.”
Journal of Economic Psychology
Journal of economic psychology , 32(1):179-193,2011.19] Nicholas Refenes, Apostolos, Achileas Zapranis, and Gavin Francis.“Stock performance modeling using neural networks: a comparativestudy with regression models.”
Neural Networks
Journal of economic psychol-ogy 32.5 (2011): 754-761.[21] Segal, Mark R. “Machine learning benchmarks and random forestregression.” (2004).[22] Sousa, S. I. V., et al. “Multiple linear regression and artificial neuralnetworks based on principal components to predict ozone concentra-tions.”