[PDF] Improving Investment Suggestions for Peer-to-Peer (P2P) Lending via Integrating Credit Scoring into Profit Scoring

Abstract

In the peer-to-peer (P2P) lending market, lenders lend the money to the borrowers through a virtual platform and earn the possible profit generated by the interest rate. From the perspective of lenders, they want to maximize the profit while minimizing the risk. Therefore, many studies have used machine learning algorithms to help the lenders identify the "best" loans for making investments. The studies have mainly focused on two categories to guide the lenders' investments: one aims at minimizing the risk of investment (i.e., the credit scoring perspective) while the other aims at maximizing the profit (i.e., the profit scoring perspective). However, they have all focused on one category only and there is seldom research trying to integrate the two categories together. Motivated by this, we propose a two-stage framework that incorporates the credit information into a profit scoring modeling. We conducted the empirical experiment on a real-world P2P lending data from the US P2P market and used the Light Gradient Boosting Machine (lightGBM) algorithm in the two-stage framework. Results show that the proposed two-stage method could identify more profitable loans and thereby provide better investment guidance to the investors compared to the existing one-stage profit scoring alone approach. Therefore, the proposed framework serves as an innovative perspective for making investment decisions in P2P lending.

Full PDF

IImproving Investment Suggestions for Peer-to-Peer (P2P)Lending via Integrating Credit Scoring into Profit Scoring

Yan Wang

Kennesaw State UniversityKennesaw, GA, [email protected]

Xuelei Sherry Ni

Kennesaw State UniversityKennesaw, GA, [email protected]

ABSTRACT

In the peer-to-peer (P2P) lending market, lenders lend the moneyto the borrowers through a virtual platform and earn the possibleprofit generated by the interest rate. From the perspective of lenders,they want to maximize the profit while minimizing the risk. There-fore, many studies have used machine learning algorithms to helpthe lenders identify the “best" loans for making investments. Thestudies have mainly focused on two categories to guide the lenders’investments: one aims at minimizing the risk of investment (i.e., thecredit scoring perspective) while the other aims at maximizing theprofit (i.e., the profit scoring perspective). However, they have allfocused on one category only and there is seldom research trying tointegrate the two categories together. Motivated by this, we proposea two-stage framework that incorporates the credit informationinto a profit scoring modeling. We conducted the empirical exper-iment on a real-world P2P lending data from the US P2P marketand used the Light Gradient Boosting Machine (lightGBM) algo-rithm in the two-stage framework. Results show that the proposedtwo-stage method could identify more profitable loans and therebyprovide better investment guidance to the investors compared tothe existing one-stage profit scoring alone approach. Therefore,the proposed framework serves as an innovative perspective formaking investment decisions in P2P lending.

CCS CONCEPTS • Computer systems organization → Machine learning; Mod-eling . KEYWORDS

P2P Lending; Credit Scoring, Profit Scoring; LightGBM

ACM Reference Format:

Yan Wang and Xuelei Sherry Ni. 2020. Improving Investment Suggestionsfor Peer-to-Peer (P2P) Lending via Integrating Credit Scoring into ProfitScoring. In

ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3374135.3385272

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Peer-to-peer (P2P) lending consists of the practice of matchinganonymous lenders with borrowers through an electronic platformso lenders could directly invest on (lend to) certain borrowers [1].In general, lenders could earn higher returns relative to savingsand other investment products offered by banking when borrowerspay back their loans as scheduled. However, the loans on the P2Pmarket are unsecured and investors need to tolerate the risk oflosing part or even all of their principal if borrowers default theloans. To help investors find out the safer loans with the relativelylower risk, it is beneficial to evaluate each loan from the perspec-tive of “the risk level", which is typically done by estimating theprobability of default (PD). Loans with lower PDs are consideredsafer than those with higher PDs and vice versa. The PD for eachloan can be predicted by considering its characteristics, such asthe loan amount, the loan purpose, the assets of the borrowers,etc. The above-mentioned approach is known as the credit scoringapproach, which poses a classification problem that classifies theloans into either (1) the default case if the predicted PD exceeds acertain predefined threshold, or (2) the non-default case otherwise.Subsequently, the credit scoring approach recommends lenders toinvest in non-default loans or the loans with lower predicted PDsbecause of the potentially lower risk.In the P2P market, minimizing the risk is one but not the onlyobjective for investors. The profit gain of the loan even matters morefor lenders, making it crucial to evaluate each loan at “the profitlevel", which is known as the profit scoring approach. In [16], profitscoring was first proposed as an alternative to credit scoring in P2Plending and the internal rate of return (IRR) was used as the measureof the profit. IRR is a well-known financial formula [2]. For example,suppose there are two borrowers obtaining a $100 loan each, andsuppose the first borrower pays back $150 and the second one paysback $95. Then the IRRs for the first and the second borrowersare $150 − $100$100 =

50% and $95 − $100$100 = − a r X i v : . [ q -f i n . R M ] S e p oth of them generate the IRR valued 50%. However, the ARRsfor the first and the second borrowers are ( $150$100 ) ( / ) = . ( $150$100 ) ( / ) = .

2, respectively. The investment gains the profit in ashorter period of time is valued higher by ARR. Considering thatin P2P lending, the duration of the repayments varies for differentborrowers, we will use ARR as the profit measurement in our study.

ARR = ( PaPr ) / Y (1)Both of the credit scoring approach and the profit scoring ap-proach can be used to evaluate loans and make recommendationsto investors. However, they work from different perspectives. Aspointed out in [16], the factors determining the profit differ fromthose determining the PD, although overlapping factors exist. Thecredit scoring approach helps lenders minimize the potential defaultrisk. It identifies the loans with lower PDs and these “safe" loans areconsidered as the “good" loans. From the credit scoring perspective,the “safe” loans may lead to a good profit since they have a higherprobability of being fully repaid. On the other hand, the profit scor-ing method identifies the loans with higher predicted profits basedon the condition that borrowers fully pay off their loans (e.g., theyare non-default loans) and these “more profitable" loans are consid-ered as “good" loans from the profit scoring perspective. Althoughworking from different perspectives, the final objective of bothcredit scoring and profit scoring is to help investors get more profitfrom the investment.Considering that credit scoring only focuses on PDs and totallyignores the profit while profit scoring only targets on the profitand totally ignores the default risk, none of the two approachescould evaluate the loans comprehensively. It is intuitive that thehigher PD the loan has, the higher interest rate it associates with,thus the higher profit it may lead to. Therefore, the credit scoringinformation may provide some additional power to the prediction ofthe profit and integrating the two scoring approaches may providea better investment suggestion. Motivated by the aforementionedconjecture, we design a two-stage framework that could integratethe credit scoring information into the profit scoring method inthe evaluation of loans. To be specific, in stage 1, each loan’s PDis estimated by a classifier. The predicted PD then serves as anadditional predictor in stage 2, where a regressor is used to get thepredicted profit of each loan. Subsequently, the lenders might beable to select the loans with a higher predicted profit than thoseselected through the single-step approach.To our best knowledge, the proposed two-stage framework isthe first study aiming at incorporating credit scoring and profitscoring together to evaluate loans. To validate the effectivenessof the proposed approach, we conducted an empirical study usinga real-world data from Lending Club, which represents most ofthe P2P transactions in US. The results indicate that the two-stageapproach outperforms the existing one-stage profit scoring aloneapproach with respect to the identification of the more profitableloans.This paper has been structured as follows. We will first reviewthe related work of credit scoring and profit scoring in the P2P do-main in Section 2. Section 3 gives a brief overview of the proposedtwo-stage modeling approach based on the Light Gradient Boosting Machine (lightGBM) algorithm. The details of the empirical studyare further presented in Section 4. Section 5 displays the experi-mental results. Conclusions and discussion are finally addressed inSection 6. In this section, the related work in credit scoring and profit scoringare summarized.

In the P2P market, credit scoring is formulated as a classificationproblem with a binary outcome: default loans (i.e., more than 150days past due) and non-default loans (i.e., fully paid). Different clas-sifiers have been used in the credit scoring area, including logisticregression, support vector machine, Naive Bayes, k-nearest neigh-bors, random forest, and neural network [15]. Logistic regression isconsidered a natural method for credit scoring because of its rela-tively strong performance. Furthermore, it was shown that logisticregression could reach the best precision compared to other classi-fiers including support vector machine, Naive Bayes, and randomforest on the Lending Club data [10]. In [13], a random forest basedclassification approach was used to identify the loan status and itturned out the random forest model could reach a higher accuracythan support vector machine or logistic regression. In [9], a deepdense convolutional network was created to predict the repaymentamount of P2P lending. Tree-based ensemble algorithms includinglightGBM and XGBoost methods have been used to evaluate theloans on the Lending Club platform as well [12]. Moreover, therehave been some studies focusing on creating a hybrid model thataims to further improve the performance of the credit scoring ap-proach. For instance, in [6], a hybrid model combining randomforest and neural networks was proposed. Regardless of the variousmachine learning models proposed in the credit scoring area, allof them focused on targeting the “safest" loans and totally ignoretheir profitability.

Recently, many studies have changed their focus from credit scoringto profit scoring. However, there is still limited research focusingon profit scoring for P2P lending. As discussed in Section 1, IRR andARR have been used as the target for this approach [16][18]. Sinceboth IRR and ARR are continuous, profit scoring is formulatedas a regression problem. In [16], multiple linear regression anddecision tree models are used for the prediction of IRR. In [18],a cost-sensitive extreme gradient boosting (CSXGBoost) model isused to get the predictions of ARR. Regardless of the choice of theprofit measure, profit scoring models only focus on finding the most“profitable" loans and totally ignore their default risk.

As discussed in Section 1, the credit scoring information may bebeneficial in the detection of more profitable loans. In order toincorporate the credit information into profit scoring, an intuitiveapproach is to use the loan status (i.e., default or non-default) asan additional predictor in the profit scoring approach. Although itworks on the historical data, it cannot be used in real applicationsue to the lack of the value for the loan status when a loan isinitiated and it is when lenders would like to assess its profitability.To overcome the above-mentioned problem, a two-stage method isdeveloped and its structure is shown in Figure 1. Stage 1 predicts PDby formulating it into a binary classification problem. The predictedPD generated from stage 1 is then used as an additional featurein stage 2 for the prediction of ARR. The design of the two-stageapproach is based on the assumption that the information of PDis predictive for ARR. We hope that adding PD as the additionalpredictor may help avoid the loans with extremely high profit whileextremely high risk, which is especially helpful for conservativeinvestors.

Figure 1: The Illustrative Structure of the Two Stage Model

As shown in Figure 1, one classifier and one regressor are neededin stage 1 and stage 2, respectively. Theoretically, all kinds of clas-sifiers and regressors could be used in the two-stage modelingprocess. In this study, we select lightGBM as both the classifier instage 1 and the regressor in stage 2.LightGBM originated from Gradient Boosting Decision Tree(GBDT), which is an ensemble learning approach using the decisiontree as the base classifier. GBDT could enhance a weak classifierinto a strong one by iterative training [22]. It soon became a deadlyweapon in many machine learning tasks and more than half of thechampionship programs in the Kaggle competitions used GBDT[12]. XGBoost is one type of GBDT proposed in 2015. In recent years,XGBoost has been frequently applied because of its rapidness andscalability [4]. LightGBM, designed in 2016, is an additional noveltype of GBDT and was proposed to solve the problems encounteredby XGBoost in large-scale data. Details of the lightGBM theorycan be found in [8]. LightGBM supports efficient parallel trainingso it could have a lower computational cost while having betterperformance than XGBoost [8]. As a result, LightGBM is becomingmore preferred in sorting, classification, and regression tasks [17].As mentioned in Section 2.1, lightGBM was first introduced intothe P2P area for the prediction of loan repayments [12]. However,there has been no research that uses lightGBM for the predictionof a loan’s profitability. This is the first application of LightGBM insuch area.In summary, we chose lightGBM as both the classifier in stage 1and the regressor in stage 2 for the reasons as follows: • LightGBM can handle both classification and regression prob-lems [12]. Using the same model in stages 1 and 2 can simplifythe model structure. • GBDT is an ensemble method and the performance is signifi-cantly better than most of the conventional machine learningmethods, which has been well demonstrated in previous stud-ies [5][19][7]. As one type of GBDT, lightGBM has shown tohave good stability and accuracy [21][11]. It has a relativelysmall computational cost but provides good training effect. • This is the first attempt of using lightGBM in predicting theprofitability of a loan.Therefore, in our proposed two-stage lightGBM model, stage 1is designed as a credit scoring model, which uses lightGBM to getthe predicted PD for all the loans. The predicted PD is then used asan additional predictor in stage 2 for the prediction of ARR, whichalso uses the lightGBM algorithm. The hyper-parameters of thelightGBM model in both stages, including the number of trees, thenumber of levels for each tree, and the percentage of subsampleused during each iteration, are tuned based on a trial and errorapproach with the goal of minimizing the loss on the testing set.

As discussed in Section 1, it is our hope that adding the creditscoring information would be beneficial in the detection of moreprofitable loans. Thus, in this study, we aim to answer the followingresearch question explicitly based on P2P lending:

Is incorporating the credit information into the profit scoring ap-proach better than the profit scoring alone approach in identifyingthe “more profitable” loans?

To address the above-mentioned question, we design a compre-hensive empirical study and the details along with the data usedare described in the following subsections.

The P2P lending market appeared in the US in February 2006. ByJune 2012, Lending Club has become the largest P2P platform in theUS with respect to the issued volume and the revenue. Therefore,the transactions happened on the Lending Club platform are a goodrepresentative of the P2P market in the US. Figure 2 shows thehomepage of the Lending Club website. The Lending Club acts asa third-party platform between the investors and the borrowersand the P2P transaction occurs when: (1) a borrower applies for aloan and Lending Club approves his/her applications; and (2) aninvestor decides to invest on the loan if he/she thinks the borrowermeets a certain criteria.

Figure 2: The Homepage of the Lending Club Platform he historical Lending Club data can be acquired from its offi-cial website: .These data sets contain the information of millions of loan issuedsince 2007. The data is consistently updated with time going onand a newly updated data set is made available every quarter. Sincemost loans ( > Each loan in the data is identified by the unique ID and the infor-mation of the loan is described by several features. Similar withprevious research, these features are grouped into three categories:(1) loan characteristics; (2) credit worthiness; and (3) borrower in-formation [18]. After removing the features with high percentageof missing ( >

70% missing), we have 29 variables in the LendingClub data. 27 of the 29 variables are used as independent variablesin the modeling stage. The definition of the variables belonging toeach category can be found in the appendix. The remaining twovariables are the targets: one for the credit scoring approach andthe other for the profit scoring approach. We will introduce howthey are generated in Section 4.3 and 4.4, respectively.Among the above-mentioned features, some features are veryhelpful for investors in making decisions. For example, the variable“grade" denotes the grade of each loan that is pre-labeled by LendingClub, ranging from Grade A (the safest) to G (the riskiest). Thevariable “int_rate” denotes the interest rate that is pre-defined byLending Club. Figure 3 shows the interest rate across differentgrades. The Grade A loans are considered to have the lowest PDby Lending Club thus they are associated with the lowest interestrate. On the other hand, Grade G loans have the relatively higherinterest rates due to their high PDs. Conservative investors canselect relatively “safer" grades to reduce the investment risk whileaggressive investors may select relatively “riskier" grades to earn ahigher profit generated from the higher interest rate.Based on these features, researchers of the P2P lending marketcan use statistical approaches or machine learning methods to helpmake investment decisions from two perspectives: (1) determiningthe PDs of the loans and recommending “safer" loans to investors;and (2) distinguishing profitable loans and recommending “moreprofitable" loans to investors [3]. Again the former approach refersto the credit scoring method while the latter refers to the profitscoring method.

The purpose of credit scoring is to evaluate whether or not theborrowers will repay the loans in a timely manner. In Lending Club,each expired loan (i.e., not during repayment) ends in one of thetwo states: (1) fully paid; or (2) charged off. Fully paid means theborrower has made all the repayments while charged off means theloan is more than 150 days past due. We create a target variable

Figure 3: Interest Rates Across Different Grades named loan_status using Equation 2. Table 1 shows the frequencyof each category of loan_status. About 80.44% of the loans are fullypaid while 19.56% of them end with being charged off. The creditscoring approach focuses on minimizing the risk of investment byidentifying the loans that are fully paid while avoiding those thatare charged off. Note that for the rest of the paper, we will use theword default and chargedoff interchangeably. loan _ status = (cid:40) Table 1: Distribution of Loan_status

Status Loan_status Frequency Proportionfully paid 0 904,086 80.44%charge off 1 219,809 19.56%Figure 4 shows the stacked bar plot of the default rates acrossdifferent grades labeled by Lending Club. As expected, grade A hasthe lowest default rate while grade G has the highest. Therefore,from the perspective of credit scoring, conservative investors shouldfocus on the loans from grade A in order to minimize the defaultrisk.

The purpose of profit scoring is to evaluate the profit generated bythe investment. In the Lending Club data, there exists no variablethat directly describes the profit of the loans. As discussed in Section1, ARR is an appropriate metric for the profit measure, which canbe calculated using the existing features. For example, supposea certain investor invests $6,000 with a nominal interest rate of14.99% and 36 scheduled monthly payments. Theoretically, if theborrower pays back the loans as scheduled, the ARR is calculatedas ( + ∗ . ∗ ) ≈ .

13. However, in reality the borrowercan pay back earlier or later. For example, after 16 months, theborrower may pay back $7003 including all the principal as wellas the interest. In this case, and the loan expires with the status of igure 4: Default Rates Across Different Grades being fully paid and the real ARR is calculated as ( ) ≈ . Figure 5: Distribution of ARR

Figure 5 shows the distribution of the created variable ARR,which measures the profitability of each loan. From Equation 1, wecan see that the range of ARR will be [ , ∞) . The minimum value ofARR is 0, which denotes the extremely worst situation where theloan gets zero repayment and the investor loses all the principal.ARR larger than one denotes a profitable loan, indicating that theborrower pays back more than the principle. As shown in Figure 5,the mean and median ARR values are 0.99 and 1.07, respectively.Figure 6 shows the ARRs across different grades. The variation ofARR gradually increases from grade A to G. Some loans from gradesC, D, E, and F lead to a very high ARR, and sometimes even higherthan some loans from grades A and B. Obviously, the “safer" loansdo not always associate with the “more profitable" result and thecredit scoring approach could not guarantee a good profit. Figure 6: ARRs Across Different Grades

As discussed above, credit scoring uses loan_status as the targetvariable and aims at predicting the PD of loans. Thus, it can evalu-ate the “safeness" of loans. On the other hand, profit scoring usesARR as the target variable and aims at predicting the profit of loans.Thus, it can evaluate the “profitability" of loans. Figure 7 showsthe cross distribution of loan_status and ARR. It is intuitive thatthere exists a strong relationship between ARR and loan_status: adefaulted loan (e.g., loan_status = 1) tends to be associated with anon-profitable ARR and vice versa. This can be confirmed by thecross table between ARR and loan_status shown in Table 2 andFigure 7. As shown by Figure 7, the variation of ARR of the defaultloans is much larger than that of the non-default loans, with somedefault loans resulting in an even higher ARR than non-defaultloans. Consequently, the loans identified with the lowest PD maynot always be the best choice for investors, especially for aggressivelenders whose goal is to reach high profitability. Meanwhile, thedefault loans with a profitable ARR may be a potential choice for in-vestors but they should be recommended with cautious. Moreover,previous studies showed that the explanatory variables differ inpredicting loan_status and profit [16]. Considering all the reasonsmentioned above, we conclude that credit scoring and profit scoringmeasure the loans from different perspectives and one cannot bereplaced by another. A “safe" loan identified by the credit scoringapproach cannot ensure a “profit" loan based on a profit scoring ap-proach while a “profit" loan identified by the profit scoring approachcannot avoid the default risk. It is critical to integrate credit scoringand profit scoring together to provide a comprehensive evaluationof the loans, thus may provide better investment decisions.

As discussed above, we finally kept 1,123,895 loans along with 27variables in the Lending Club data. The data is then randomly splitinto 70% training set (i.e., 786,726 loans) for the training purpose anda 30% testing set (i.e., 337,169 loans) for the evaluation purpose. Forcategorical features, a one-hot encoding method is applied. Take thefeature application_type as an example, which has two categories: igure 7: Distribution of ARR Across Different Categories ofLoan_statusTable 2: Cross Table of Loan_status and ARR

Loan status ARR Frequency Proportion0 > < = > The proposed two-stage lightGBM model was first implementedon the Lending Club training set and then evaluated on the testset. To confirm that incorporating credit scoring into profit scor-ing could be beneficial in detecting “more profitable" loans, wecompared its performance with the single profit scoring approachwithout using any information from credit scoring. Specifically, twomodels are compared: an existing profit scoring alone approachbased on lightGBM (the One-stage Model), and the proposed two-stage lightGBM method (the Two-stage Model). In both models, thehyper-parameters are tuned using the trial and error approach withthe goal of minimizing Root Mean Squared Error (RMSE) on thetest set. The details of the hyper-parameter settings are shown inTable 3. In both models, the final outcome is the predicted value ofARR. Loans with a higher predicted ARR would be recommended.For the comparison purpose, we compare the profitability of thetop 50 loans recommended by the two models in the testing data.

Figure 8 displays the comparison of the average ARR based onthe top loans using the two models, where the x-axis denotes thenumber of top loans identified by the two models changing from 1

Table 3: Hyper-parameter Settings in LightGBM

Name Description Valuemax_depth max depth for each tree 6num_leaves max leaves for each tree 10feature_fraction percentage of featuresused for each tree 0.8bagging_fraction percentage of positivesamples used for bag-ging 0.5learning_rate shrinkage rate 0.01to 50. Here the value of 50 is big enough for evaluating the modelperformance since investors tend to care more about the top several(maybe only 5, 10, etc) loans. It is shown that the profitabilityof the proposed two-stage model is consistently higher than theprofit scoring only method. The result can strongly confirm ourconjecture that incorporating the credit information into profitscoring could be beneficial in identifying the “more profitable"loans. Therefore, the two-stage model would be more preferred inguiding the investment decisions.

Figure 8: Comparison of the One-stage Model and the Two-stage Model in terms of the Average ARR of the Top LoansSelected from the Testing Data

To further explore the reason why the two-stage model coulddetect “more profitable" loans, we compare the constitution of thetop 50 loans identified by the two models and the result is shownin Table 4. Among the top 50 loans identified by the one-stageand the two-stage models, none of them were assigned Grade Aby Lending Club. It can be expected since the safest loans (i.e.,assigned by Grade A) can only lead to very small profit becauseof the low interest rate. The one-stage model selects 1 loan fromGrade B while the two-stage model didn’t select any loan fromGrade B. Most of the loans recommended by the one-stage modelcome from Grade D and E while the two-stage model recommendsmany loans from Grade F. Therefore, we can conclude that the two-stage model is more aggressive in selecting loans: it tends to selectmore risky" loans that are defined by Lending Club. These riskyloans are associated with higher interest rates, thus potentiallygenerate higher profit. In total, both models have 6 default loansamong the top 50 selected loans.

Table 4: Top 50 Loans Selected by the Two Models

Grade One-stage Model(Default Loans) Two-stage Model(Default loans)B 1 (0) 0 (0)C 7 (1) 0 (0)D 14 (1) 2 (0)E 14 (2) 13 (1)F 10 (1) 30 (3)G 4 (1) 5 (2)Table 5 summarized the average ARR and the default rate ofthe top 50 loans identified by the two models. It shows that thetwo-stage model can select the loans with much higher ARRs thanthose selected by the existing one-stage model, which confirmsour conjecture that incorporating the credit scoring informationis beneficial to improve the performance of the profit scoring ap-proach. We have another side result based on Table 5. The defaultrate generated by the two-stage model is 0.12, which equals to thatfrom the one-stage model. Therefore, the two-stage model couldidentify the loans with much higher profits while not introducingextra default risk for investors.

Table 5: The Average ARR and the Default Rate of the Top50 Loans Selected by the Two Models

Metric One-stage Model Two-stage ModelAverage ARR 1.09 1.13Default rate 0.12 0.12

Profit scoring focuses on profit predictions and it considers the bestloans as those with the highest predicted profit. The biggest disad-vantage of profit scoring is that it ignores the fact that default loanscan also be profitable. In order to overcome the disadvantage ofthe conventional profit scoring approach, we proposed a two-stageframework that incorporates the credit scoring information intothe profit scoring method. We used the lightGBM algorithm in bothstages 1 and 2 in the model since: (1) lightGBM is a highly efficientmachine learning method in handling large scale data [8]; and (2) asone of the state-of-the-art machine learning techniques, lightGBMhas not been widely used in the P2P domain, thereby making itnecessary to be introduced [12] [21]. The effectiveness of the pro-posed two-stage lightGBM is evaluated on the real-world P2P data.Results show that compared to a single step profit scoring onlymethod (i.e., the one-stage lightGBM model), the proposed methodcan identify more profitable loans while it doesn’t introduce extra default risk to investors. Therefore, it is confirmed that integrat-ing the credit information into profit scoring can provide betterinvestment suggestions to lenders by identifying “more profitable"loans.Different from the previous research which focuses either onlyon credit scoring or only on profit scoring, this is the first timein our study that a two-stage methodology is proposed with thegoal of integrating the two scoring approaches. Theoretically, inthe future studies, we have many other choices for the classifierin stage 1 and the regressor in stage 2 in the model, as long as theclassifiers and regressors could identify the non-linear relationshipamong the variables.The application of the proposed framework is not limited to theP2P area. It can also be used in other domains that contain twocorrelated targets. Furthermore, the framework can even be ex-tended to a multi-stage workflow to handle problems with multipletargets [20]. Depending on the different data sets and the differentresearch requirements, the best algorithm used in stages 1 and 2may vary. However, the proposed framework can be viewed as thefirst attempt in the P2P area and demonstrated its promising results.It may serve as an innovative perspective that could better guidethe investment decisions.

REFERENCES [1] A. Bachmann, A. Becker, D. Buerckner, M. Hilker, F. Kock, M. Lehmann, P. Tibur-tius, and B. Funk. 2011. Online Peer-to-Peer Lending-A Literature Review.

Journalof Internet Banking and Commerce

16, 2 (2011), 1.[2] R. Brealey, S. Myers, F. Allen, and P. Mohanty. 2012.

Principles of CorporateFinance . Tata McGraw-Hill Education.[3] A. Byanjankar, M. Heikkilä, and J. Mezei. 2015. Predicting Credit Risk in Peer-to-Peer Lending: A Neural Network Approach. In . IEEE, Cape Town, South Africa, 719–725.[4] T. Chen, T. He, M. Benesty, V. Khotilovich, and Y. Tang. 2015. Xgboost: ExtremeGradient Boosting.

R package version 0.4-2 (2015), 1–4.[5] J. Friedman. 2001. Greedy Function Approximation: a Gradient Boosting Machine.

Annals of statistics (2001), 1189–1232.[6] Y. Fu. 2017. Combination of Random Forests and Neural Networks in SocialLending.

Journal of Financial Risk Management

6, 4 (2017), 418–426.[7] M. Jahrer, A. Töscher, and R. Legenstein. 2010. Combining Predictions for Accu-rate Recommender Systems. In

Proceedings of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, Washington, DC USA,693–702.[8] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu. 2017.Lightgbm: A Highly Efficient Gradient Boosting Decision Tree. In

Advances inNeural Information Processing Systems . Long Beach, CA USA, 3146–3154.[9] J. Kim and S. Cho. 2019. Predicting Repayment of Borrows in Peer-to-Peer SocialLending with Deep Dense Convolutional Network.

Expert Systems (2019), e12403.[10] V. Kumar, S. Natarajan, S. Keerthana, K. Chinmayi, and N. Lakshmi. 2016. CreditRisk Analysis in Peer-to-Peer Lending System. In . IEEE, Singapore,193–196.[11] X. Ma, J. Sha, and X. Niu. 2018. An Empirical Study on the Credit Rating of P2PProjects based on LightGBM Algorithm.

The Journal of Quantitative & TechnicalEconomics

Electronic Commerce Research and Applications

31 (2018), 24–39.[13] M. Malekipirbazari and V. Aksakalli. 2015. Risk Assessment in Social Lendingvia Random Forests.

Expert Systems with Applications

42, 10 (2015), 4621–4631.[14] S. Patro and K. Sahu. 2015. Normalization: A Preprocessing Stage. arXiv preprintarXiv:1503.06462 (2015).[15] M. Polena and T. Regner. 2018. Determinants of Borrowers Default in P2P Lendingunder Consideration of the Loan Risk Class.

Games

9, 4 (2018), 82.[16] C. Serrano-Cinca and B. Gutiérrez-Nieto. 2016. The Use of Profit Scoring as anAlternative to Credit Scoring Systems in Peer-to-Peer (P2P) Lending.

DecisionSupport Systems

89 (2016), 113–122.[17] Y. Song, X. Jiao, Y. Qiao, X. Liu, Y. Qiang, and Z. Liu. 2019. Prediction of Double-High Biochemical Indicators Based on LightGBM and XGBoost. In

Proceedings ofhe 2019 International Conference on Artificial Intelligence and Computer Science .ACM, New York, NY USA, 189–193.[18] Y. Xia, C. Liu, and N. Liu. 2017. Cost-sensitive Boosted Tree for Loan Evaluationin Peer-to-Peer Lending.

Electronic Commerce Research and Applications

24 (2017),30–49.[19] J. Xie, V. Rojkova, S. Pal, and S. Coggeshall. 2009. A Combination of Boosting andBagging for KDD Cup 2009-Fast Scoring on a Large Database. In

Proceedings ofthe 2009 International Conference on KDD-Cup 2009-Volume 7 . JMLR. org, 35–43.[20] L. Yu, S. Wang, and K. Lai. 2008. Credit Risk Assessment with a Multistage NeuralNetwork Ensemble Learning Approach.

Expert systems with applications

34, 2(2008), 1434–1444.[21] S. Zhang, Y. Hu, and Z. Tan. 2019. Research on Borrower’s Credit Classificationof P2P Network Loan based on LightGBM Algorithm.

International Journal ofEmbedded Systems

11, 5 (2019), 602–612.[22] Y. Zhang and A. Haghani. 2015. A Gradient Boosting Method to Improve TravelTime Prediction.

Transportation Research Part C: Emerging Technologies

58 (2015),308–324.

The definitions of the variables are shown below: • Loan characteristics: – application_type: A categorical variable denotes whetherthe loan is an individual application or a joint applicationwith two co-borrowers. – dti: A.K.A. Debt to Income. A numeric variable denotesthe ratio of the borrower’s monthly debt to the monthlyincome. – grade: A categorical variable denotes the grade of the loanassigned by Lending Club. It ranges from A to G where Ais the safest loan and G is the riskiest loan. – initial_list_status: A categorical variable denotes the initiallisting status of the loan. – installment: A numeric variable denotes the monthly pay-ment owed by the borrower. – loan_amnt: A numeric variable denotes the total amountof money of a loan. – purpose: A categorical variable denotes the purpose forthe loan. – sub_grade: A categorical variable denotes the subgrade ofthe loan assigned by Lending Club. It ranges from A1 toG5 where A1 is the safest loan and G5 is the riskiest loan. – term: A categorical variable denotes the term of the loans.It can be either 36 months or 60 months. – verification_status: A categorical variable denotes whetherthe income of the borrower was verified or not. • Credit worthiness: – acc_now_delinq: A numeric variable denotes the numberof accounts on which the borrower is now delinquent. – deling_2yrs: A numeric variable denotes the number ofdelinquencies the borrower had in the past two years. – cr_line_month: A numeric variable denotes the credit ageof the borrower (in months) from the earliest credit tradeline listed in the credit report to the date when the loan isapplied. – fico_range_high: A numeric variable denotes the upperboundary of the borrowers FICO score range when theloan was originated. – fico_range_low: A numeric variable denotes the lowerboundary of the borrowers FICO score range when theloan was originated. – inq_last_6mths: A numeric variable denotes the numberof inquiries listed in borrower’s credit report during thepast 6 months. – open_acc: A numeric variable denotes the number of opentrade lines in the borrower’s credit report. – pub_rec: A numeric variable denotes the number of deroga-tory in the borrower’s credit report. – revol_bal: A numeric variable denotes the total credit re-volving balance. – revol_util: A numeric variable denotes the amount of re-volving credit limit that the borrower currently has. – total_acc: A numeric variable denotes the total number ofopen credit accounts on the borrower’s credit file. • Borrower information: – addr_state: A categorical variable denotes the state of theaddress provided by the borrower in the loan application. – annual_inc: A numeric variable denotes the annual incomeinformation provided by the borrower. – emp_length: A numeric variable denotes the length oftime in years the borrower is employed in a company. – emp_title: A categorical variable denotes the job title pro-vided by the borrower when applying for the loan. – home_ownership: A categorical variable denotes whetherthe borrower owns the house. ––