Detection of fraudulent users in P2P financial market
DDetection of fraudulent users in P2P financialmarket
Hao
Wang * HC Research, HC Financial Service Group, China
Abstract.
Financial fraud detection is one of the core technological assetsof Fintech companies. It saves tens of millions of money from ChineseFintech companies since the bad loan rate is more than 10%. HC FinancialService Group is the 3 rd largest company in the Chinese P2P financialmarket. In this paper we illustrate how we tackle the fraud detectionproblem at HC Financial. We utilize two powerful workhorses in themachine learning field - random forest and gradient boosting decision treeto detect fraudulent users. We demonstrate that by carefully select featuresand tune model parameters, we could effectively filter out fraudulent usersin the P2P market. Fintech is one of the most thriving industry in many countries over the world. People havebeen relying on Fintech to lend and borrow money , detect fraudulent users , match loansbetween money lenders and borrowers. P2P is a business model where money lenders candistribute his investment over multiple borrowers and each borrower could gather moneyfrom a host of money lenders. P2P has been extremely popular in China with an annualinterest return rate of 8% - 10% satisfying to most of P2P sites’ users.Major P2P companies in China have been using a technology called knowledge graph tofacilitate their financial processes. Knowledge graph is a graph constructed from datacollected from the internet with user’s authorization, including the user’s phone log , IDcard information, bank account transactions, home addresses, etc. Each node in knowledgegraph represents entity such as person, id card number, address etc. , while each edgerepresents relationship such as colleagues, family membership, etc. Knowledge graph isone of the core assets of P2P companies because it is highly useful in crucial businessprocesses within the company. For example, knowledge graph could be used to do anti-fraud , which saves tens of millions of money given the bad loan rate is close to 50% onmost of the P2P platforms.HC Financial Service Group is one of the top P2P companies in China. We have a fullybuilt team consisting of nearly 100 staff working on credit risk modeling problems takingadvantage of more than 400 million users’ information. Credit risk modeling problems ,when solved in machine learning contexts, are conventionally modeled as class imbalanceproblems. Most of the time , feature engineering and graph pattern analysis play a key role * Corresponding author: [email protected] © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
MATEC Web of Conferences , 06004 (2018) https://doi.org/10.1051/matecconf/201818906004
MEAMT 2018 n the problem solving process. Case-by-case and manual inspection of individual node andits neighbors in extremely large knowledge graph is daily routine of credit risk modelingteam’s work.One advantage of the problem setting in P2P financial market is that the fraud rate isvery high - as high as more than 10% , some times a lot higher. The high fraud rate causesbig headache for company runners but saves the day for algorithm engineers since classimbalance problem is a lot less severe. In this paper, we demonstrate that using randomforest and gradient boosting decision tree, we could obtain evaluation metrics comparableto non-class imbalance problems.
Fraudulent behavior exists since the advent of internet. Special groups and teams of internetcompanies and educational institutes are formed to tackle various fraudulent behavior.Facebook designed an algorithm called CopyCatch [1] to detect the Lockstep behavior.Later they came up with a new invention called SynchroTrap [2] to detect synchronizedattack. CMU researchers invented fBox [3] and SpokeEigen [4] algorithms to detectcommunity fraud.Fraud detection is one of the most extensively researched field in Fintech industry.Common fraud detection methods include rule-based approaches, Bayesian networks andmachine learning techniques. Fraud detection, when formulated as a classification problem,is inherently a class imbalance problem very difficult to solve. Therefore in practice expertdomain knowledge has become one of the most important ingredients of the Fintech frauddetection system.Vlasselaer et al. [5] provided a machine learning framework for graph-based financialfraud detection in general. They pointed out that graph pattern mining alone is seldom usedas a standalone model for financial fraud detection. Graph patterns usually serve as featuresto classification models such as random Logistic regression.
Feature engineering is a crucial step in credit risk modeling . Selection and processing ofthe appropriate features is a delicate art. At HC Financial Group, we have both online andoffline systems where we could store detailed information about money borrowers. Forexample, at our outlet in Beijing, our sales staff help customers input their information intoour systems. Once the information is entered, our online system will ask the user toauthorize us to acquire other information about the user such as the credit report from thePeople’s Bank of China. We have hundreds of data items we could use for each user in oursystems.However, not every data item store in our system is necessary in our credit riskmodeling processes. In our problem context, we select the following features as input to ourmodels:1. Financial information : Features in this group include user’s personal income , carrent , house rent , etc. MATEC Web of Conferences , 06004 (2018) https://doi.org/10.1051/matecconf/201818906004, 06004 (2018) https://doi.org/10.1051/matecconf/201818906004
Feature engineering is a crucial step in credit risk modeling . Selection and processing ofthe appropriate features is a delicate art. At HC Financial Group, we have both online andoffline systems where we could store detailed information about money borrowers. Forexample, at our outlet in Beijing, our sales staff help customers input their information intoour systems. Once the information is entered, our online system will ask the user toauthorize us to acquire other information about the user such as the credit report from thePeople’s Bank of China. We have hundreds of data items we could use for each user in oursystems.However, not every data item store in our system is necessary in our credit riskmodeling processes. In our problem context, we select the following features as input to ourmodels:1. Financial information : Features in this group include user’s personal income , carrent , house rent , etc. MATEC Web of Conferences , 06004 (2018) https://doi.org/10.1051/matecconf/201818906004, 06004 (2018) https://doi.org/10.1051/matecconf/201818906004
MEAMT 2018 n the problem solving process. Case-by-case and manual inspection of individual node andits neighbors in extremely large knowledge graph is daily routine of credit risk modelingteam’s work.One advantage of the problem setting in P2P financial market is that the fraud rate isvery high - as high as more than 10% , some times a lot higher. The high fraud rate causesbig headache for company runners but saves the day for algorithm engineers since classimbalance problem is a lot less severe. In this paper, we demonstrate that using randomforest and gradient boosting decision tree, we could obtain evaluation metrics comparableto non-class imbalance problems.
Fraudulent behavior exists since the advent of internet. Special groups and teams of internetcompanies and educational institutes are formed to tackle various fraudulent behavior.Facebook designed an algorithm called CopyCatch [1] to detect the Lockstep behavior.Later they came up with a new invention called SynchroTrap [2] to detect synchronizedattack. CMU researchers invented fBox [3] and SpokeEigen [4] algorithms to detectcommunity fraud.Fraud detection is one of the most extensively researched field in Fintech industry.Common fraud detection methods include rule-based approaches, Bayesian networks andmachine learning techniques. Fraud detection, when formulated as a classification problem,is inherently a class imbalance problem very difficult to solve. Therefore in practice expertdomain knowledge has become one of the most important ingredients of the Fintech frauddetection system.Vlasselaer et al. [5] provided a machine learning framework for graph-based financialfraud detection in general. They pointed out that graph pattern mining alone is seldom usedas a standalone model for financial fraud detection. Graph patterns usually serve as featuresto classification models such as random Logistic regression.
Feature engineering is a crucial step in credit risk modeling . Selection and processing ofthe appropriate features is a delicate art. At HC Financial Group, we have both online andoffline systems where we could store detailed information about money borrowers. Forexample, at our outlet in Beijing, our sales staff help customers input their information intoour systems. Once the information is entered, our online system will ask the user toauthorize us to acquire other information about the user such as the credit report from thePeople’s Bank of China. We have hundreds of data items we could use for each user in oursystems.However, not every data item store in our system is necessary in our credit riskmodeling processes. In our problem context, we select the following features as input to ourmodels:1. Financial information : Features in this group include user’s personal income , carrent , house rent , etc. 2. Work information : Features in this group include user’s company’s income , howlong the company has been founded, etc.3. Transaction information : Features in this group include the amount of money userborrows in the transaction , whether the user has submitted applications before , etc.4. Demographic information: Features in this group include the number of familymembers of the user, etc.We chose these 4 groups of features because they are both heuristically and empiricallyconsistent with our mission. For categorical data in the features, we utilize one-hotencodings to expand the term into several terms consisting of either 1 or 0. There are 97selected features in total with 33 being categorical features.To compare different feature engineering results, we use original data , data with PCAdimensionality reduction , data with tanh conversion , among many other featureengineering schemas. We would like to try PCA because after one-hot encoding, the featurespace is expanded into hundreds of dimensions. We explore tanh because of the sharpcontrast in the magnitudes of different features.We use two of the most powerful workhorses in the machine learning field to detect thefraud users - random forest and gradient boosting decision tree. We choose these twomodels because they provide easy-to-tune parameters and robust results.Random forest is an ensemble learning algorithm that aggregates the results from agroup of regression or classification trees. The parameters that needs to be tuned include themaximum depth of each tree and the total number of trees in the forest.Gradient boosting decision tree (GBDT) is one of the most successful gradient boostingalgorithmic paradigms. The general work flow of a gradient boosting algorithm [6] is listedin Figure 1. GBDT utilizes the general gradient boosting schema with trees as itselementary components. The parameters that needs to be tuned include the maximum depthof each boosting tree, learning rate of the algorithm , number of trees in the model, etc.To evaluate our algorithms, we use the AUC metric. We choose AUC because it isinsensitive to class balance ratio. AUC measures the area of the ROC curve. A randomclassification result is close to 0.5. A perfect classification result is close to 1.0. The higherthe value of the AUC, the better the classification result is.The major toolkit we use to develop and test our algorithms is scikit-learn and xgboost. MATEC Web of Conferences , 06004 (2018) https://doi.org/10.1051/matecconf/201818906004
MEAMT 2018 e select two datasets to test our algorithms : One consists of 30K normal users and 30Kusers with overdue payments (dataset A) , the other consists of 50K users consisting of 25Kfraud users and 25K normal users (dataset B) . We split our dataset into training set, test setand validation set with the ratio of 4:1:1.In our random forest algorithm, we enumerate the values of the maximum depth of eachtree and the number of trees in the model. Figure 1 illustrates the scatter plot visualizing theresults (AUC) of random forest + PCA dataset A. Maximum depths of trees are enumeratedbetween 2 and 5 while the number of trees in the forest is enumerated between 5 and 120 .As shown in the figure, AUC increases with max depths until it reaches the value of 4. It iscomparatively insensitive to the number of trees in the forest.Figure 2 visualizes the change of parameters with function tanh applied to features sothey become normalized. Tanh generates better results than PCA only or PCA and tanhcombined , with the average AUC being 0.780 on test set and 0.797 on validation set.
Fig. 1.
Scatter plot of random forest + PCA on dataset A.
Fig. 2.
Scatter plot of random forest + tanh on dataset A.
We also test random forest on dataset B. We enumerate the maximum depths andnumber of trees as the parameters to the random forest algorithm. Figure 3 shows the resultwith tanh function applied to the features . Similar to our result on dataset A, tanh alonegenerates better results than PCA or PCA and tanh combined.The average AUC metrics onboth test and validation sets are 0.83.Figure 4 shows the results of tanh applied to GBDT input feautres on dataset B. Afterhaving eliminated a few AUC outliers, the average AUC on both test and validation setsyields values close to 0.88.In both data sets, PCA generates more consistent AUC metric per each parametercompared to the same algorithm on plain data set. However , tanh operation on features MATEC Web of Conferences , 06004 (2018) https://doi.org/10.1051/matecconf/201818906004, 06004 (2018) https://doi.org/10.1051/matecconf/201818906004
We also test random forest on dataset B. We enumerate the maximum depths andnumber of trees as the parameters to the random forest algorithm. Figure 3 shows the resultwith tanh function applied to the features . Similar to our result on dataset A, tanh alonegenerates better results than PCA or PCA and tanh combined.The average AUC metrics onboth test and validation sets are 0.83.Figure 4 shows the results of tanh applied to GBDT input feautres on dataset B. Afterhaving eliminated a few AUC outliers, the average AUC on both test and validation setsyields values close to 0.88.In both data sets, PCA generates more consistent AUC metric per each parametercompared to the same algorithm on plain data set. However , tanh operation on features MATEC Web of Conferences , 06004 (2018) https://doi.org/10.1051/matecconf/201818906004, 06004 (2018) https://doi.org/10.1051/matecconf/201818906004
MEAMT 2018 e select two datasets to test our algorithms : One consists of 30K normal users and 30Kusers with overdue payments (dataset A) , the other consists of 50K users consisting of 25Kfraud users and 25K normal users (dataset B) . We split our dataset into training set, test setand validation set with the ratio of 4:1:1.In our random forest algorithm, we enumerate the values of the maximum depth of eachtree and the number of trees in the model. Figure 1 illustrates the scatter plot visualizing theresults (AUC) of random forest + PCA dataset A. Maximum depths of trees are enumeratedbetween 2 and 5 while the number of trees in the forest is enumerated between 5 and 120 .As shown in the figure, AUC increases with max depths until it reaches the value of 4. It iscomparatively insensitive to the number of trees in the forest.Figure 2 visualizes the change of parameters with function tanh applied to features sothey become normalized. Tanh generates better results than PCA only or PCA and tanhcombined , with the average AUC being 0.780 on test set and 0.797 on validation set.
Fig. 1.
Scatter plot of random forest + PCA on dataset A.
Fig. 2.
Scatter plot of random forest + tanh on dataset A.
We also test random forest on dataset B. We enumerate the maximum depths andnumber of trees as the parameters to the random forest algorithm. Figure 3 shows the resultwith tanh function applied to the features . Similar to our result on dataset A, tanh alonegenerates better results than PCA or PCA and tanh combined.The average AUC metrics onboth test and validation sets are 0.83.Figure 4 shows the results of tanh applied to GBDT input feautres on dataset B. Afterhaving eliminated a few AUC outliers, the average AUC on both test and validation setsyields values close to 0.88.In both data sets, PCA generates more consistent AUC metric per each parametercompared to the same algorithm on plain data set. However , tanh operation on features yield better scores than PCA only or PCA / tanh combined on both data sets. We see fromour experimental results, partially due to the nice data structure, simple tweaks after carefulfeature selection could lead to results usable for our online systems.
Fig. 3.
Scatter plot of random forest + tanh on dataset B.
Fig. 4.
Scatter plot of GBDT + tanh on dataset B.
Fraud and overdue payment users detection is a crucial step in Fintech company’s businessprocesses. Effective filter-out of such users could save tens of millions of US dollars for thecompany. In this paper we propose using feature engineering and random forest / GBDTalgorithms to detect fraudulent users in P2P financial market. Due to the high fraud rate inthe market (more than 10%), it is comparatively easy to generate satisfying resultscompared to other markets like conventional banking systems where the fraud rate could beas low as 2%. We tested our algorithms on two real world data sets and visualized the AUCmetric together with the selection of model parameters.In future work, we would like to explore the possibility of taking advantage of deeplearning techniques to detect fraud and overdue payment users in our systems. Deeplearning has been effectively and ubiquitously used in a whole spectrum of machinelearning tasks. The expectation of better evaluation metric results could never beunderestimated. MATEC Web of Conferences , 06004 (2018) https://doi.org/10.1051/matecconf/201818906004
MEAMT 2018 eferences
1. A. Beutel, W. Xu, V. Guruswami, C. Palow and C. Faloutsos, “CopyCatch: StoppingGroup Attacks by Spotting Lockstep Behavior in Social Networks” WWW’132. Q. Cao, X. Yang, J. Yu and C. Palow, “Uncovering Large Groups of Active MaliciousAccounts in Online Social Networks” CCS’143. N. Shah, A. Beutel, B. Gallagher, C. Faloutsos, “Spotting Suspicious Link Behaviorwith fBox : An Adversarial Perspective” ICDM’144. B. Prakash, A. Sridharan, M. Seshadri, S.Machiraju, C. Faloutsos, “EigenSpokes:Surprising Patterns and Scalable Community Chipping in Large Graphs” PAKDD’105. V. Vlasselaer, T. Eliassi-Rad, L. Akoglu, M. Snoeck, B. Baesens, “GOTCHA!Network-based Fraud Detection for Social Security Fraud” Management Science’146. https://en.wikipedia.org/wiki/Gradient_boosting MATEC Web of Conferences , 06004 (2018) https://doi.org/10.1051/matecconf/201818906004, 06004 (2018) https://doi.org/10.1051/matecconf/201818906004