Information collection for fraud detection in P2P financial market
IInformation collection for fraud detection in P2Pfinancial market
Hao
Wang , Zonghu
Wang , Bin
Zhang , and Jun
Zhou HC Research, HC Financial Service Group, China HC DevOps, HC Financial Service Group, China HC Big Data, HC Financial Service Group, China
Abstract.
Fintech companies have been facing challenges from fraudulentbehavior for a long time. Fraud rate in Chinese P2P financial market couldgo as high as 10%. It is crucial to collect sufficient information of the useras input to the anti-fraud process. Data collection framework for Fintechcompanies are different from conventional internet firms. Withindividual-based crawling request , we need to deal with new challengesnegligible elsewhere. In this paper , we give an outline of how we collectdata from the web to facilitate our anti-fraud process. We also overviewthe challenges and solutions to our problems. Our team at HC FinancialService Group is one of the few companies that are capable of developingfull-fledged crawlers on our own.
Fraud rate in Chinese P2P market is estimated over 10%. The high fraud rate poses a severethreat to P2P companies compared with conventional banking system where the fraud rateis as low as 2%. To counter the fraudulent behavior , most P2P companies create acredit-checking mechanism to block out malicious users. The credit-checking measurecrawls user’s information from different websites upon his authorization. Such informationis crucial for the following step of anti-fraud detection which takes the collected userinformation as input.Major P2P companies in China have their own information collection teams. However ,many small-to-medium enterprises do not have the capacity to crawl data on their own.They rely on third-party companies to provide such services. Famous companies thatprovide commercial data collecting services include Tong Dun , Mo Jie , and Xin De etc.As the 3 rd largest P2P company in China, HC Financial Service Group has built afull-fledged technical team specialized on information collection. The team collects all thenecessary information of users from the internet and stores the information in the datastorage platform.Information collection for P2P products faces several challenges: 1. Websites frequentlyget updated, the information collection team needs to constantly monitor websites’ onlinestatus ; 2. Anti-crawling mechanism including security ActiveX control sometimes makes itvery difficult for crawlers to collect data ; 3. Information collection is very sensitive to * Corresponding author: [email protected] © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
MATEC Web of Conferences , 06006 (2018) https://doi.org/10.1051/matecconf/201818906006, 06006 (2018) https://doi.org/10.1051/matecconf/201818906006
MATEC Web of Conferences , 06006 (2018) https://doi.org/10.1051/matecconf/201818906006, 06006 (2018) https://doi.org/10.1051/matecconf/201818906006
MEAMT 2018 ailures. A failed attempt to collect user information blocks out the loan application processand makes the company loses a potential customer.In this paper, we provide a description on how information collection works at HCFinancial Service Group. We show the work flow of our information collectionfunctionality and provides a brief overview of how we tackle the problems informationcollection team faces in general in the P2P market. For commercial privacy reasons, wecould not give a thorough description of our technologies , we hope the overview we areabout to outline in the following sections helps researchers and industrial workers betterunderstand the information collection and anti-fraud processes in the industrial world.
Fraudulent behavior has a long and dark history since the advent of the internet. Tons ofresearch has been invested to fight the malicious users targeting valuable online assets.Famous anti-fraud algorithms include Facebook-invented algorithm called CopyCatch [1]to detect the Lockstep behavior. Later Facebook team came up with a new invention calledSynchroTrap [2] to detect synchronized attack. CMU researchers invented fBox [3] andSpokeEigen [4] algorithms to detect community fraud.Fraud detection is one of the key business processes of Fintech product cycles.Common anti-fraud techniques include Bayesian Networks , Logistic Regression and othermachine learning-based prediction methods rely heavily on feature engineering and domainknowledge. Vlasselaer et al. [5] provided a machine learning framework for graph-basedfinancial fraud detection in general.Crawlers have been widely adopted in internet companies. Google , Bing and Baidu alldeveloped their own crawlers to collect data from the world wide web. Some crawlingresearch has been focused on traversal strategy [6], others focus on adaptibility of crawlers[7] .
Anti-fraud is crucial for P2P companies and it is one of the most valuable technical assets.Anti-fraud algorithms usually take information related to users and apply feature engineeringto the information before the fraud detection mechanism. The user information gathered foranti-fraud processes include the following categories.1. Financial information : Features in this group include user’s personal income , carrent , house rent , etc.2. Work information : Features in this group include user’s company’s income , howlong the company has been founded, etc.3. Transaction information : Features in this group include the amount of money userborrows in the transaction , whether the user has submitted applications before , etc.4. Demographic information: Features in this group include the number of familymembers of the user, etc.There are two different sources of the information: user input and data crawling. HCFinancial Service Group have both online systems and offline outlets serving customerswho want to loan money from borrowers. In our online system, users open up the app andinput required information in the online forms and authorize the system to crawl data fromdifferent websites. At our offline outlets , company sales staff help the customers to finishthe process.We collect user information from websites such as People ’ s Bank of China , ChinaMobile etc. For commercial privacy reasons , we could not disclose the full list of the MATEC Web of Conferences , 06006 (2018) https://doi.org/10.1051/matecconf/201818906006
MEAMT 2018 ailures. A failed attempt to collect user information blocks out the loan application processand makes the company loses a potential customer.In this paper, we provide a description on how information collection works at HCFinancial Service Group. We show the work flow of our information collectionfunctionality and provides a brief overview of how we tackle the problems informationcollection team faces in general in the P2P market. For commercial privacy reasons, wecould not give a thorough description of our technologies , we hope the overview we areabout to outline in the following sections helps researchers and industrial workers betterunderstand the information collection and anti-fraud processes in the industrial world.
Fraudulent behavior has a long and dark history since the advent of the internet. Tons ofresearch has been invested to fight the malicious users targeting valuable online assets.Famous anti-fraud algorithms include Facebook-invented algorithm called CopyCatch [1]to detect the Lockstep behavior. Later Facebook team came up with a new invention calledSynchroTrap [2] to detect synchronized attack. CMU researchers invented fBox [3] andSpokeEigen [4] algorithms to detect community fraud.Fraud detection is one of the key business processes of Fintech product cycles.Common anti-fraud techniques include Bayesian Networks , Logistic Regression and othermachine learning-based prediction methods rely heavily on feature engineering and domainknowledge. Vlasselaer et al. [5] provided a machine learning framework for graph-basedfinancial fraud detection in general.Crawlers have been widely adopted in internet companies. Google , Bing and Baidu alldeveloped their own crawlers to collect data from the world wide web. Some crawlingresearch has been focused on traversal strategy [6], others focus on adaptibility of crawlers[7] .
Anti-fraud is crucial for P2P companies and it is one of the most valuable technical assets.Anti-fraud algorithms usually take information related to users and apply feature engineeringto the information before the fraud detection mechanism. The user information gathered foranti-fraud processes include the following categories.1. Financial information : Features in this group include user’s personal income , carrent , house rent , etc.2. Work information : Features in this group include user’s company’s income , howlong the company has been founded, etc.3. Transaction information : Features in this group include the amount of money userborrows in the transaction , whether the user has submitted applications before , etc.4. Demographic information: Features in this group include the number of familymembers of the user, etc.There are two different sources of the information: user input and data crawling. HCFinancial Service Group have both online systems and offline outlets serving customerswho want to loan money from borrowers. In our online system, users open up the app andinput required information in the online forms and authorize the system to crawl data fromdifferent websites. At our offline outlets , company sales staff help the customers to finishthe process.We collect user information from websites such as People ’ s Bank of China , ChinaMobile etc. For commercial privacy reasons , we could not disclose the full list of the websites where we collect information from the user. Instead , we give the following list ofcategories of websites where we collect the user information :1. Communication Overview : We collect user’s communication information necessaryfor our identity authorization and credit analysis process from websites such as ChinaMobile.2. Credit Report : We collect user’s credit report from government agencies such asPeople’s Bank of China.3. Bank Account Information : We also collect user’s bank account information for usto better understand the financial status of the user.The above information is well needed in further processing of the data before theanti-fraud algorithms get involved. In the following section, we provide the workflowarchitecture of the information collection system.
Fig. 1.
Data collection workflow at HC financial service group.
Fig 1. illustrates the data collection workflow at HC Financial Service. The user’s datacrawling request is first fed into the producer. Data then flows into consumer via brokers.Then the data crawling interface chooses to use Java-written or Python-written programs tocollect data from the web and return the results to the user interface.Our crawling procedures are all individual-based, meaning that we do not do massivedata collecting similar to what search engines do. Instead, we only invoke the crawlingprocess for each user when requested. We use the most commonly used crawling tookitssuch as Scrapy and PhantomJS for our purposes. Although authorized crawling doesn’tsound technologically sophisticated , it poses many technical challenges to our team. Wegive an outline of the problems we encounter in the next section.
Since crawling data is an inalienable part of the entire P2P product cycle, it is crucial tomaintain its robustness and efficiency. Although at HC Financial Service Group, we have ateam of more than 20 people working on data collection. We are still facing severechallenges that we need to deal almost routinely on a daily basis to guarantee the properfunctioning of our products. To be more specific, we encounter the following problemswhen crawling data from the web : MATEC Web of Conferences , 06006 (2018) https://doi.org/10.1051/matecconf/201818906006, 06006 (2018) https://doi.org/10.1051/matecconf/201818906006
Since crawling data is an inalienable part of the entire P2P product cycle, it is crucial tomaintain its robustness and efficiency. Although at HC Financial Service Group, we have ateam of more than 20 people working on data collection. We are still facing severechallenges that we need to deal almost routinely on a daily basis to guarantee the properfunctioning of our products. To be more specific, we encounter the following problemswhen crawling data from the web : MATEC Web of Conferences , 06006 (2018) https://doi.org/10.1051/matecconf/201818906006, 06006 (2018) https://doi.org/10.1051/matecconf/201818906006
MEAMT 2018 . Major Chinese banks installed ActiveX controller on the website as security guard.Commonly used crawling toolkits such as Scrapy and PhantomJS could not pass through it.2. Complex Javascript-written encryption scripts that prevents simple crawlingtechniques from gathering data.3. Crawling process intermittently gets blocked by the website.4. Frequent rewriting and updating of the website.5. There are too many websites from which data needs to be gathered. Each websiteneeds code inspection and parsing manually, which consumes lots of people and time.6. Some of our crawling toolkits have technical restrictions. For instance, PhantomJSconsumption of memory is high, and it grows fast as the number of threads increase.
ActiveX controller has been one of the biggest challenges we are facing in the datacollecting team . American bank customers are not familiar with ActiveX controller for it isnearly never used in banks’ websites. On the contrary, most of major banks in Chinautilizes ActiveX controller (Fig 2) as a necessary safeguard measure against hackers andmalicious users.
Fig. 2.
Notification from China Merchant’s Bank that ActiveX controller needs to be installed toenable login.
In our crawling system, we use success rate to measure the robustness of the system. A90% robustness metric means among 100 requests to crawl the data, 10% of the requestsfailed. Websites with ActiveX controller enabled lowers down the success rate of crawlingsystem by half compared with data collection results on other websites.To solve this problem , we create a Windows ActiveX controller service that emulatesreal user’s online behavior to pass through the ActiveX controller security guard. By doingso , we are able to achieve a success rate that beats most of the third-party crawling solutionproviders.It is a tedious job to monitor and maintain the status of different websites. Weestablished a monitoring system where each crawling failure would send a message MATEC Web of Conferences , 06006 (2018) https://doi.org/10.1051/matecconf/201818906006
MEAMT 2018 . Major Chinese banks installed ActiveX controller on the website as security guard.Commonly used crawling toolkits such as Scrapy and PhantomJS could not pass through it.2. Complex Javascript-written encryption scripts that prevents simple crawlingtechniques from gathering data.3. Crawling process intermittently gets blocked by the website.4. Frequent rewriting and updating of the website.5. There are too many websites from which data needs to be gathered. Each websiteneeds code inspection and parsing manually, which consumes lots of people and time.6. Some of our crawling toolkits have technical restrictions. For instance, PhantomJSconsumption of memory is high, and it grows fast as the number of threads increase.
ActiveX controller has been one of the biggest challenges we are facing in the datacollecting team . American bank customers are not familiar with ActiveX controller for it isnearly never used in banks’ websites. On the contrary, most of major banks in Chinautilizes ActiveX controller (Fig 2) as a necessary safeguard measure against hackers andmalicious users.
Fig. 2.
Notification from China Merchant’s Bank that ActiveX controller needs to be installed toenable login.
In our crawling system, we use success rate to measure the robustness of the system. A90% robustness metric means among 100 requests to crawl the data, 10% of the requestsfailed. Websites with ActiveX controller enabled lowers down the success rate of crawlingsystem by half compared with data collection results on other websites.To solve this problem , we create a Windows ActiveX controller service that emulatesreal user’s online behavior to pass through the ActiveX controller security guard. By doingso , we are able to achieve a success rate that beats most of the third-party crawling solutionproviders.It is a tedious job to monitor and maintain the status of different websites. Weestablished a monitoring system where each crawling failure would send a message notifying the developer. Whenever a website gets updated or monitored , it will trigger afailure for the crawlers and the developers will be notified instantly.
Fraud detection relies heavily on what users input and what we could know about the user.The more we learn about the user, the more capable we will become to predict user’scredibility.Data collection is an inalienable part of the entire fraud detection process. Therobustness and efficiency of crawlers is critical for the entire P2P product cycle. In thispaper, we give an overview of the data collection system at HC Financial Service Group.We also outlined challenges we are facing when collecting data and how we managed tosolve these issues. Although data crawling is unavoidable for P2P companies, manysmall-to-medium sized firms are not capable of developing data collecting systems on theirown. We hope this paper could help them as well as other teams in bigger corporations todevelop better systems.Data collection might seem trivial to many experienced industry workers. This ispartially due to the fact that the most mentioned crawlers are those used in search enginecompanies. A technical failure of the system does not lead to direct loss of money. Manycrucial technical issues are hidden behind the simple and trivial appearance of thetechnology. However, for a financial company like HC Financial Service Group, robustnessand efficiency become topmost important tasks for the crawling team. Trivial problemselsewhere pose great threats in the financial world.In future work, we would like to keep improving our system . In particular, we wouldlike to build more robust and efficient systems by solving the challenges mentioned in thispaper. We wish our crawling system could serve as a model example of how Fintechcompanies should build their data collecting systems.
References
1. A. Beutel, W. Xu, V. Guruswami, C. Palow and C. Faloutsos, “CopyCatch: StoppingGroup Attacks by Spotting Lockstep Behavior in Social Networks” WWW’132. Q. Cao, X. Yang, J. Yu and C. Palow, “Uncovering Large Groups of Active MaliciousAccounts in Online Social Networks” CCS’143. N. Shah, A. Beutel, B. Gallagher, C. Faloutsos, “Spotting Suspicious Link Behaviorwith fBox : An Adversarial Perspective” ICDM’144. B. Prakash, A. Sridharan, M. Seshadri, S.Machiraju, C. Faloutsos, “EigenSpokes:Surprising Patterns and Scalable Community Chipping in Large Graphs” PAKDD’105. V. Vlasselaer, T. Eliassi-Rad, L. Akoglu, M. Snoeck, B. Baesens, “GOTCHA!Network-based Fraud Detection for Social Security Fraud” Management Science’146. Y. Wang, J.M.Yang, W. Lai, R.Cui, L.Zhang, W.Y.Ma, “Exploring Traversal Strategyfor Web Forum Crawling” SIGIR’087. D. Ahlers, S.Boll, “Adaptive Geospatially Focused Crawling” CIKM’09 MATEC Web of Conferences , 06006 (2018) https://doi.org/10.1051/matecconf/201818906006, 06006 (2018) https://doi.org/10.1051/matecconf/201818906006