[PDF] Super-App Behavioral Patterns in Credit Risk Models: Financial, Statistical and Regulatory Implications

Abstract

In this paper we present the impact of alternative data that originates from an app-based marketplace, in contrast to traditional bureau data, upon credit scoring models. These alternative data sources have shown themselves to be immensely powerful in predicting borrower behavior in segments traditionally underserved by banks and financial institutions. Our results, validated across two countries, show that these new sources of data are particularly useful for predicting financial behavior in low-wealth and young individuals, who are also the most likely to engage with alternative lenders. Furthermore, using the TreeSHAP method for Stochastic Gradient Boosting interpretation, our results also revealed interesting non-linear trends in the variables originating from the app, which would not normally be available to traditional banks. Our results represent an opportunity for technology companies to disrupt traditional banking by correctly identifying alternative data sources and handling this new information properly. At the same time alternative data must be carefully validated to overcome regulatory hurdles across diverse jurisdictions.

Full PDF

SSuper-App Behavioral Patterns in Credit Risk Models:Financial, Statistical and Regulatory Implications ∗ Luisa Roa , Alejandro Correa-Bahnsen † , Gabriel Suarez , Fernando Cort´es-Tejada , Mar´ıa A. Luque ,and Cristi´an Bravo Rappi, Cl. 93 Pontiﬁcia Universidad Cat´olica del Per´u, Av. Universitaria 1801, San Miguel, Lima, Per´u. Department of Statistical and Actuarial Sciences, The University of Western Ontario, Western ScienceCentre, 1151 Richmond Street, London, ON, Canada.

Abstract

In this paper we present the impact of alternative data that originates from an app-basedmarketplace, in contrast to traditional bureau data, upon credit scoring models. These alter-native data sources have shown themselves to be immensely powerful in predicting borrowerbehavior in segments traditionally underserved by banks and ﬁnancial institutions. Our re-sults, validated across two countries, show that these new sources of data are particularlyuseful for predicting ﬁnancial behavior in low-wealth and young individuals, who are also themost likely to engage with alternative lenders. Furthermore, using the TreeSHAP method forStochastic Gradient Boosting interpretation, our results also revealed interesting non-lineartrends in the variables originating from the app, which would not normally be available totraditional banks. Our results represent an opportunity for technology companies to dis-rupt traditional banking by correctly identifying alternative data sources and handling thisnew information properly. At the same time alternative data must be carefully validated toovercome regulatory hurdles across diverse jurisdictions.

Keywords:

Fintech; Super-App; Credit Scoring; Financial Inclusion; Alternative Data

As technology companies have become more ubiquitous, their incursion into traditional businesslines has also become inevitable.

Super-apps , mobile applications intended for a large numberof day-to-day consumer needs, have targeted banking needs along with the delivery of goods,shopping, transport, and a long and diverse list of services that can be enhanced by mobile phoneinteractions. These super-apps also have the advantage of generating a large amount of diversedata, which has never before been available to traditional ﬁnancial institutions. Consequently, thescientiﬁc questions of how to engineer these variables, how these data sources can be used along ∗ NOTICE: This is a preprint of a work currently under review since May 8th, 2020. Changes resulting from the publishing process,such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reﬂected in thisversion of the document. Changes may have been made to this work since it was submitted for publication. † Corresponding author, [email protected] a r X i v : . [ q -f i n . GN ] M a y ith the regulatory implications of doing so and the disruptive potential these companies createin the ﬁnancial market become ever more relevant as their usage increases.The capacity of super-apps to expand far beyond their current spheres and become importantﬁnancial technology (ﬁntech) companies is becoming more common as they feature products andservices that revolutionize not only online commerce but traditional ﬁnancial services. Super-appsprovide an ecosystem of services on one platform, thus, allowing their makers to cross-sell andimprove user loyalty (Asian Insights Oﬃce , 2019). Part of a super-apps diversity is attributableto its mini-program function, which allows it to have the same functionalities as a specialized appdirectly within the super-app interface. Some examples of this are provided in Figure 1. In eachof the functionalities of the super-apps, the data and information provided by the users selectionsare generated, and these data become distinctive attributes of the users behavior. Figure 1: Super-apps functionalities and mini-programs

Although entering the ﬁnancial market represents a great challenge, super-apps boast a com-petitive advantage over traditional banking as they possess data generated by the users of theplatforms as well as the transactional data (once they have launched ﬁnancial services) commonto banking. The super-app companies can also serve as agents of ﬁnancial inclusion by using theirtransactional and behavioral user data to assess and create tailored ﬁnancial services that aretargeted at these underserved segments. In terms of credit risk, these new data sources are knownas alternative data (Siddiqi, 2017) as they are derived from sources other than traditional bankingand ﬁnancial behavior. A considerable amount of evidence has mounted with regard to the po-tential for ﬁnancial inclusion for these sources of data (Bravo, Maldonado, & Weber, 2013; Gool,Verbeke, Sercu, & Baesens, 2012; ´Oskarsd´ottir, Bravo, Sarraute, Vanthienen, & Baesens, 2019),particularly in countries with a large proportion of young and/or unbanked individuals where thesuper-app may achieve deeper market penetration compared with the traditional ﬁnancial system.A clear example of successful ﬁnancial inclusion is Ant Financial, which has taken advantage ofbig data analytics, machine learning systems and deep learning to develop a wide range of intelligentproducts and services such as insurance, micro loans, payments, risk management services andother, which focus upon the needs of individuals and small businesses. The world’s biggest unicornbegan as an Alibaba strategy in 2004 to increase trust in the company among online buyers andsellers and has grown to become a world leader in ﬁnancial innovation and risk management (Sun,2017). For credit analysis, Ant Financial provides a score based on personal ﬁnancial accountsfrom Ant Financial Services, social network and e-business information from the Alibaba Groupplatform and public utilities information (W.-Y. Zhang, 2016). The creditworthiness assessment bymeans of Ant Financials own scoring allows them to provide ﬁnancial services to all Alibaba users,including non-users of the traditional ﬁnancial system. Similarly, the Fintech Lufax from the Ping2n ﬁnancial group oﬀers more than 5,000 ﬁnancial services to market segments that previouslyhad no access to such services until this users transitioned to a technology company by connectingborrowers with investors (Osterwalder, Pigneur, Smith, & Etiemble, 2020). To understand theﬁnancial preferences of its users more deeply, Lufax generates models based upon natural languagelearning and user behavior data to identify and predict the needs of each user. Therefore, in eachmoment of the users life cycle the right products are oﬀered, and the matches eﬀected betweenborrowers and investors are more accurate and eﬃcient (World Artiﬁcial Intelligence Conference,2019).Alternative data from super-apps seem to promise the additional beneﬁt of enhancing tradi-tional credit score models; hence, we explore this in the paper and attempt to answer the followingresearch questions:1. Is there an additional predictive value when considering the variables provided by a super-app?2. Is the value added by the variables of a super-app signiﬁcant?3. What new behaviors do these variables reveal and how do they diﬀer from traditional bankingresources?4. What are the consequences of using super-app data for lenders, users and regulators?The rest of the paper proceeds as follows. Section 2 presents a review of the credit scoring andbank regulation literature related to ﬁntech. Sections 3 and 4 describe the methodology and theexperimental setup used within the research. In Section 5, the results are presented along with adiscussion of their implications. Conclusions are drawn in Section 6 along with the possibilities forfuture work on alternative data models for super-apps.

To strategically manage risk, ﬁnancial institutions assign each of the customers a credit scoreaccording to their estimated individual probability of a user committing default. This practiceallows companies to deﬁne the level of risk at which they are willing to operate and, therefore,minimize the potential losses to which they may be exposed. The objective of this credit score foreach client is to classify whether they are more or less likely to default on their ﬁnancial obligationsand to assess whether or not they will be approved for potential credit under the risk levelsaccepted by the institution (Lawrence & Solomon, 2012). Typically, diﬀerent ﬁnancial companiesaround the world have addressed this classiﬁcation problem through standard cost-insensitivebinary classiﬁcation algorithms, such as logistic regression, neural networks, discriminant analysis,genetic programming, and decision trees, among others (Lessmann, Baesens, Seow, & Thomas,2015).Formally, a credit score is a statistical model that allows the estimation of the probabilityˆ p i = P ( y i = 1 | x i ) of a customer i defaulting upon a contracted debt. Additionally, since theobjective of credit scoring is to estimate a classiﬁer c i to decide whether or not to grant a loan to acustomer i , a threshold t is deﬁned such that if ˆ p i < t , then the loan is granted, that is, c i ( t ) = 0,and denied otherwise, that is, c i ( t ) = 1 (Thomas, Crook, & Edelman, 2017).3 .1.1 Credit Bureau Features Some of the most commonly used variables around the world in the conformation of these modelsare the scores generated by a credit bureau or consumer reporting agency, that is, companiesdedicated to collecting data upon individuals throughout their ﬁnancial lives, who then makes thisinformation available this information to the market through credit reports for a possible lender topurchase (Hurley & Adebayo, 2016). To this end, the credit bureau examines how individuals havebehaved with the ﬁnancial companies with which they have interacted and generates a quantitativescore from this information, which is commonly used as an indicator for lending companies to assessthe probability of the individual defaulting. In many countries, the bureau scores are synonymouswith a credit score but most modern banks use their own implementations, which only utilize thebureau score as an input. Among other things, the variables that constitute the ﬁnancial reportare the number of credits in history, the type of credits acquired, the use of these credits or howmany of them are available, possible debts,payment defaults within a history, and bankruptcies orlate payments. These variables can be used as they are, or the score itself can be used as a ﬁrstvariable as in the case of this paper.

Fintech has assumed great importance in recent years functioning in the ﬁnancial sector to provideonline solutions for payments, transfers, investments and lending, among other services. Since 2010,more than U.S. $

50 billion has been invested in 2,500 ﬁntechs worldwide (Sy, Maino, Massara, Prez-Saiz, & Sharma, 2019) and it is estimated that by 2025 global ﬁntech market size will be U.S. $ Banking regulation is fundamentally related to credit scoring. The estimation of the probabilityof default (PD) is a function (usually a segmentation) of the score, adjusted by microeconomicfactors (Baesens, Roesch, & Scheule, 2016). This means the development and deployment of creditscores is highly regulated and must pass the stringent controls imposed by local banking regulators.Fintechs challenge the traditional methods used by banks through the design and implementationof machine learning models that, seem to have greater predictive and classiﬁcation power but maylack interpretability (Ribeiro, Singh, & Guestrin, 2016). Furthermore, these complex algorithmsmay unintentionally incorporate variables that are proxies for sensitive consumer attributes (Hurley& Adebayo, 2016). It is, therefore, mandatory for regulators to mitigate the potential risks of thesenew approaches, and in this way ensure that the scoring decisions are as accurate as possible butalso as unbiased, transparent and fair as possible (Basel Committee on Banking Supervision, 2018).We will comment on the regulatory implications of our ﬁndings in Section 5.4.5

Methodology

This paper contributes to the literature by investigating the use of transactional data for creditscoring in a ﬁntech context, explaining the new behaviors have for lending. In this section, themethodology proposed for combining and extracting valuable features from a super-app is dis-cussed. Moreover, the ﬁnancial evaluation measure used in the experiments is presented.

Users interact with super-apps in quite diﬀerent ways, every movement of the user in the applicationgenerates features that might be useful when creating a credit scorecard. The transactions carriedout by users contain a record of the diﬀerent characteristics of the users and their behavior.Each transaction generates data such as the transaction amount, the type of transaction, thepayment method, date, and the type of store, among others. Given that these super-apps fulﬁllnumerous functions, many possible features can be extracted from the most popular and commonfunctionalities they possess, such as being used for food and grocery deliveries, or transportationor ﬁnancial services engagement. Some examples of the features we collected for this study can beobserved in Table 1, where Sum, Pct, Avg, Count, Max presents for the aggregate of the speciﬁcvariable through time,the percentage of consumption of that variable when compared to totalconsumption, the average value that the variable has had over time, the number of occurrences ofthat variable and the maximum value the variable has obtained in time, respectively . It shouldbe noted that not all of these features must necessarily be included for the formation of the creditscorecard, as the variables must be carefully chosen in order to avoid building discriminatory orsubjective scores.

Generic features:

These refer to the demographic qualities of the user. These data mayinclude attributes such as age, gender, place of residence, brand of cell phone as well as socialcharacteristics such as income. These features provide an overview of the type of user and aremainly used to understand segments within the application to adapt oﬀers and campaigns.

Delivery:

This functionality includes all the services related to the purchase and delivery offood, groceries, technology, clothing, pharmaceutical products and others. Variables that can becreated from this type of service allow an understanding to be formed of consumer consumptionpatterns, user preferences and how users make use of diﬀerent types of stores.

Transportation:

This considers the data generated by the scooter, bike and ride sharingsystem operators such as Didi, Uber, Lime, Mobike and others. The features that can be extractedfrom this functionality provide information about the movements of people in a geographic areasuch as mobility patterns and the most frequently used transportation method.

Financial services:

This last functionality collects ﬁnancial services or products deliveredvia technology ranging from e-wallets and digital cards to loan services, on- and oﬄine payments,and money transfers. These ﬁnancial services allow the features associated with the number ofproducts and the users’ ﬁnancial behaviors to be deﬁned.

Data trails generated by the users of super-apps in their diﬀerent functionalities become importantas a mean of supplying behavioral and purchasing patterns. For a user who retains a ﬁnancialservice, the relationship between traditional features, such as a bank history, and super-app featuresrepresents an additional value for the credit evaluation. Figure 2 presents how the historical6eature type ExamplesGeneric features 1. Gender2. Age range and age in the app (tenure)3. Country/city of residence4. Most used address5. Number of diﬀerent addresses6. Preferred payment method7. Number of registered credit cards (na-tional/international)8. Number of registered credit card brands9. Phone brand/operating system10. Number of diﬀerent phones usedDelivery byvertical (delivery,groceries,pharmacies andothers) 1. Sum/Pct/Avg/Count/Max of total orders2. Sum/Pct/Avg/Count/Max of approved orders3. Sum/Pct/Avg/Count/Max of orders value4. Sum/Pct/Avg/Count/Max of total number of can-celled orders (By user/Payment Error/Fraud)5. Sum/Pct/Avg/Count/Max of total refund6. Sum/Pct/Avg/Count/Max of payment method used7. Sum/Pct/Avg/Count/Max of total discount8. Sum/Pct/Avg/Count/Max of consumption in a cer-tain vertical9. Sum/Pct/Avg/Count/Max oﬀered tip10. Sum/Pct/Avg/Count/Max of value spent in a cer-tain type of store11 Avg/Count products per order12 .Sum/Count of consumption in top store13.Count of diﬀerent stores in which a user purchases14.Period of time when orders are placed15.Store where a user purchases the mostTransportation 1. Count of rides2. Sum/Count/Avg/Max of travel time3. Count of diﬀerent departures locations4. Count of diﬀerent destinations5. Most frequented destination6. Count of sectors in the city within which a user hasmoved7. Favorite transportation vehicleFinancial Services 1. Count of ﬁnancial services2. Sum/Pct/Avg/Count/Max of debit transactions3. Sum/Pct/Avg/Count/Max of credit transactions4. Sum/Pct/Avg/Count/Max of total amount tradedon debit cards5. Sum/Pct/Avg/Count/Max of total amount tradedon credit cards6. Sum/Avg/Count/Max amount of transfers7. Number of people to whom the user made transfers8. Whether the user makes cash withdrawals

Table 1: Feature types with examples

Figure 2: Experimental Setup

Traditional measures to evaluate credit scoring models include the area under the receiver operatingcharacteristic curve (AUC), the Brier score, the Kolmogorov-Smirnov (K-S) statistic, the F1-Score,and the misclassiﬁcation rate (Lessmann et al., 2015). Nevertheless, none of these measures takesinto account the business and ﬁnancial realities that take place in lending. The costs incurred bythe ﬁnancial institution to acquire customers, or the proﬁt expected from a particular client, arenot considered in the evaluation of the diﬀerent models (Correa Bahnsen, Aouada, & Ottersten,2015). Recent approaches have included the Expected Maximum Proﬁt measure (Verbraken,Bravo, Weber, & Baesens, 2014) and the example-dependent cost-sensitive approach for creditscoring (Correa Bahnsen, Aouada, & Ottersten, 2014), which we used in this work.Actual Positive Actual Negative y i = 1 y i = 0Predicted Positive C T P i = 0 C F P i = r i + C aF P c i = 1Predicted Negative C F N i = Cl i · L gd C T N i = 0 c i = 0 Table 2: Credit scoring example-dependent cost matrix

In Table 2, the credit scoring cost matrix is shown. Initially, the costs of a correct classiﬁcation, C T P i and C T N i , are zero for all customers, i . Then, C F N i reﬂects the incurred losses if the customer i defaults, which is proportional to ther is credit line Cl i and the cost of a false positive C F P i asthe sum of two real ﬁnancial costs r i and C aF P , where r i is the loss in proﬁt through rejecting whatsomeone who would have been a good customer (Nayak & Turvey, 1997).Finally, the cost improvement can be expressed as the cost savings as compared with Cost l . Savings = Cost l − CostCost l , Cost is calculated as

Cost = (cid:88) (1 − c i ) ∗ y i ∗ C F N i + (1 − y i ) ∗ c i ∗ C F P i , and Cost l is the cost of the cost-less class (Correa Bahnsen et al., 2015). Our dataset consisted of the transactional information of users within a super-app for two diﬀerentLatin American countries, labeled as Country A and B. In the ﬁrst country, a sample of 50,000users was studied, while 30,000 users were analyzed for the second. For each user, we had accessto all their transactional data within the super-app, which included orders placed to more than15,000 restaurants and 2,000 grocery stores. In addition, we had access to several observationsregarding each of the users, such as the location in which they requested their orders, the deviceand the operating system through which the user placed the orders and the data regarding thepayment method used, including when applicable their credit card information. Moreover, wealso had access to data that made it possible to determine consumption patterns and constructvariables that characterized their ﬁnancial behavior.

Seeking to understand whether default prediction can be improved for certain populations, threesegments were deﬁned to divide the population into a sample with a high segment value andanother with a low segment value. The ﬁrst segment divided the population by device score asthis a variable that allows an approximation of the economic potential of an individual (Sundsy,Bjelland, Reme, M.Iqbal, & Jahani, 2016), while the second segment was intended to be a morerobust approximation of the economic potential and to be associated with the behavior in thesuper-app, which we named Wealth Score. Finally, the last segment separated the population bya super-app user segmentation (Recency, Frequency and Monetary Value; Fader, Hardie, & Lee,2005) based on the recency since the user made their last purchase, the frequency with whichthey placed orders and the average amount spent. This segmentation has proven to be a valuablevariable for other models developed internally within the super-app and in many applications. Foreach proposed segmentation and for the dataset without segmentation, two models were created:one that only contemplated the Bureau score and another that additionally considered the super-app variables. The number of observations and the default rate for each country and segmentationare shown in Table 3.An XGBoost classiﬁer was implemented as it has demonstrated its superior performance overmodels such as neural network, decision tree, support vector machines and bagging-NN with regardto structured data (Salvaire, 2019; Xiaa, Liu, Li, & Liu, 2017). The ﬁnal model performance wasevaluated using a randomized bootstrap of 50 iterations on the databases with a data proportionof 70% to train and the remaining 30% to test in each iteration.Model performance was measured by using the area under AUC and the KS measure. The AUCcaptures the trade-oﬀ between true and false positives at various discrimination thresholds, that is,it measures the ability to predict defaulters the where cutoﬀ points have equal contribution. TheKS statistic measures the degree of separation between two cumulative distributions, speciﬁcally9ountry A Country BModel Size Default Rate Size Default RateNo Segments 50,000 5.00% 30,000 9.00%Low Device Score 29,627 5.52% 14,548 8.51%High Device Score 20,373 4.24% 15,452 9.46%Low Wealth Score 27,570 5.88% 19,664 9.04%High Wealth Score 22,430 3.92% 10336 8.92%Low RFM 26,479 5.87% 15,998 9.40%High RFM 23,521 4.02% 14,002 8.54%

Table 3: Dataset information

Parameter ValueInterest rate ( int r ) 40%Cost of funds ( int cf ) 10%Loss given default ( L gd ) 75% Table 4: Parameters to estimate the ﬁnancial savings the maximum distance for all classiﬁcation thresholds between the true and false positive ratecurves.In addition, the ﬁnancial performance of the models was evaluated with the estimated ﬁnancialsavings as described in Section 3.2. The parameters required to estimate the savings are shown inTable 4.These measures allowed us to assess the discriminatory ability for defaulters, and the averageperformance and the maximum performance for the most optimistic case. In addition, statisticaltests were performed to establish whether there was a signiﬁcant diﬀerence in the classiﬁcationperformances of any of the models. This was in order to compare the performances of the diﬀerentsegments and identify in which population, with a common characteristic, the super-app variableshad a representative contribution.

In this section we present the experimental results. First, the statistical performance results aredescribed, followed by the ﬁnancial performance results for each model. Finally, the regulatoryimplications of the results are discussed.

Regarding the results obtained with the AUC metric, for both countries the model that consideredthe variables of the super-app always had a higher average performance regardless of the segmen-tation chosen as seen in Figures 3a and 3b. For Country A, the Device Score did appear to be apopulation characteristic that allowed for the better prediction of defaulters given that for thosewith a high score a higher average performance was gained. However, for Country B this charac-teristic did not produce a substantial improvement in either the low or high score. Nevertheless,the results associated with the Wealth Score and RFM segments suggest a signiﬁcant diﬀerence in10 o Segments Low Device Score High Device Score Low Wealth Score High Wealth Score Low RFM High RFM0.7000.7250.7500.7750.8000.8250.850 A U C BureauBureau + Super-App (a) AUC performance by model for Country A dataset

No Segments Low Device Score High Device Score Low Wealth Score High Wealth Score Low RFM High RFM0.7000.7250.7500.7750.8000.8250.850 A U C BureauBureau + Super-App (b) AUC performance by model for Country B dataset the performance of the users with a speciﬁc value for these characteristics for both countries. Thesuper-app information manages to capture information that the bureau score does not consider.The results obtained with the KS metric were more pronounced as can be seen in Figures 4a and4b. For all the segments, and regardless of country, the information from the super-app improvedthe model performance at least two percentage points. The maximum increase was obtainedwith the AUC metric. Country B demonstrated a particularly high improvement with an averageincrease of approximately ten percentage points. This implies that the super-app information has11 o Segments Low Device Score High Device Score Low Wealth Score High Wealth Score Low RFM High RFM0.300.350.400.450.500.550.60 K S BureauBureau + Supper-App (a) KS performance by model for Country A dataset

No Segments Low Device Score High Device Score Low Wealth Score High Wealth Score Low RFM High RFM0.300.350.400.450.500.550.60 K S BureauBureau + Super-App (b) KS performance by model for Country B dataset a signiﬁcant impact upon the ability of the model to discriminate. The diﬀerence is most obviousin the Device Score (low) and Wealth Score (low), which hints at a much higher discriminationcapacity for lower income segments. Something similar occurs in the RFM cluster, although witha lower eﬀect. This could indeed occur as there should be a correlation between wealth and RFM,but it is muddled by the engagement of the user with the app.Although the box plots provide an approximation of the distribution of the performance mea-sures, Mann Whitney non-parametric mean tests were conducted to more accurately determine12hich segmentation provided a signiﬁcant improvement between the super-app data model and thestand-alone bureau model. Table 5 shows the p values of the test, which implies that a signiﬁcantdiﬀerence is obtained by adding the super-app variables in all the models and segmentations forboth countries and performance metrics, except for the increase in the K-S metric of the HighWealth model for Country B. We can also see that the models with the highest AUC performancewere the High Wealth Score model for Country A and the Low RFM model for Country B, whilefor the K-S statistic it was the High and Low Wealth Scores, respectively.Country A Country BSegmentation AUC KS AUC KSNo Segments 1.762e-17 8.648e-18 2.078e-12 3.529e-18Low Device Score 4.771e-15 6.814e-18 9.204e-05 1.611e-17High Device Score 2.248e-16 9.237e-11 1.280e-03 2.740e-16Low Wealth Score 4.295e-12 3.530e-18 8.375e-09 3.531e-18High Wealth Score 1.137e-13 1.084e-02 3.110e-02

Low RFM 2.766e-15 1.659e-17 1.328e-07 3.980e-18High RFM 6.439e-14 2.090e-11 2.000e-04 2.102e-17

Table 5: Mann Whitney test P-Value for performance metrics

In the previous section it was shown that the statistical performance of the models that considersuper-app variables was signiﬁcantly higher for proposed segmentations. However, the model thatperforms the best in terms of statistical measures does not necessarily perform the best in terms ofcosts and savings. In terms of general ﬁnancial results, it was found that for both countries therewas a higher average saving in all the models that considered the super-app variables, as well asconsiderable savings when segmenting the population. For Country A, as seen in Figure 5a, theaverage saving when using the super-app variables ranged from 20.5%-31.3% to 29.3%-36.0%, whilefor Country B, Figure 5b, there was a slight increase ranging from 25.0%-41.0% to 27.4%-42.0%.The most accentuated ﬁnancial diﬀerences for Country A were obtained for the (overall) nosegment model and the models with High Device Score, Low RFM and High RFM segmentations;the respective average increases after adding the super-app variables were 9.0%, 7.0%, 5.9% and4.9%. For country B, the largest average increases were provided by the Low Device Score, LowRFM and no segments models at values of 3.5%, 2.2% and 2.3%, respectively. In addition, althoughfor both countries these were not the models with the best average statistical performances, theseresults reveal not only the additional statistical value of the super-app variables, but also theﬁnancial beneﬁt of considering the costs incurred during the default prediction process.Overall, Figures 5a and 5b show a greater gain in terms of savings for Country A when addingalternative data. Table 6 shows the p-values of the Mann-Whitney non-parametric test, ﬁndingthat for Country B, the segments of High Device Score and Low Wealth Score did not signiﬁcantlyenhance savings.

Considering that a highly complex model was implemented, SHapley Additive exPlanations (SHAP;Lundberg & Lee, 2017) was used for robust feature importance explanation. Since this technique13 o Segments Low Device Score High Device Score Low Wealth Score High Wealth Score Low RFM High RFM0.150.200.250.300.350.40 S a v i n g s BureauBureau + Super-App (a) Financial model performance. Savings from Country A dataset

No Segments Low Device Score High Device Score Low Wealth Score High Wealth Score Low RFM High RFM0.00.10.20.30.40.5 S a v i n g s BureauBureau + Super-App (b) Financial model performance. Savings from Country B dataset

Low Wealth Score 1.898e-04

High Wealth Score 1.791e-03 1.581e-10Low RFM 4.265e-13 1.883e-02High RFM 3.533e-18 1.979e-12

Table 6: Mann Whitney test test P-Value for ﬁnancial performance is based on game theory, speciﬁcally Shapleys optimal values, it oﬀers a unique way of consistentlyand precisely assigning importance to features.The feature importance obtained with the TreeSHAP method for the no segments models arepresented in Figures 6a and 6b. For both countries, although the bureaus had the highest predictivepower, super-app delivery and generic features achieved the most complementary performances inthe predictive power of creditworthiness. The latter super-app set reveals that although mostsociodemographic (generic) features should be readily available to bureaus, the tenure and time ofengagement are also extremely relevant in predicting default. These variables would relate mostlyto each institution (as nothing suggests this variable would be exclusive to super-apps), hinting atthe higher predictive power of in-house models over wider ranging bureau ones.Regarding Country A the features associated with the payment method behavioral patternswere those that added more value to the accuracy of the default prediction. For Country B, all thecategories of the delivery consumption patterns contributed in similar measures. Similarly, someﬁnancial and transportation features added value to the default prediction. Generic Month-on-Books (MOB) for Country B, and delivery payment errors, for both countries, were also relevantfeatures.Country As high MOB implies the greater likelihood of no default, as expected. These newsources of data consist of diverse behaviors, which pose the risk that the alternative data patternscould vary as times goes by. Accordingly, proper model validation and follow-up processes becomeeven more relevant for these data sources.For delivery payment error, the interpretation was the same for both countries as the number oferrors increases so does the PD. This would mean there is an early warning of ﬁnancial diﬃculties,as declines over time hint at a lower available disposable income.Most of other variables revealed interesting behaviors that are not captured by traditionalbureau variables. In particular those variables that involved credit card use or utilization (thoseending in CC), were among the most relevant sets. These can be interpreted as bancarizationengagement indexes: users with a higher level of engagement are in general better customers,however, if the amount is too high then, ceteris paribus , the borrower has an increased defaultrate.Finally, the last interesting result arising from these variable importance plots derives from themagnitude and range of the SHAP values. Considering that the higher the SHAP value, the moreimpact a variable has on the output, it is clear that extremely low and extremely high bureau scores(shown in blue and red in the Bureau Score row of the plot respectively) are the most impactfulvariables. This is well-known: it is much easier to predict a very good or a very bad borrowerthan an average one. It is in this segment that the super-app variables really shine. Bureau scores15 a) Feature importance with SHAP for Country A dataset(b) Feature importance with SHAP for Country B dataset

16o not have small diﬀerence ranges for the defaulters, that is, a range of scores that allows thediscrimination of slightly bad borrowers on a sliding scale. However, variables such as PaymentErrors and Total Amount do exactly that. This allows the ﬁntech to take calculated risks byaccepting slightly riskier borrowers for a temporary income boost (such as a boosted growth phasecommon in technology companies), with which traditional banks would not be able to compete.

There are many interesting lessons for regulators arising from this work. First, the statisticaland ﬁnancial gains resulting from both segmented and non-segmented models demonstrate thatalternative data have a place in the lending sphere and, furthermore, that lenders have a ﬁnancialincentive to use these variables within their models. In recent reports, the Basel Committee onBanking Supervision detected this trend (Basel Committee on Banking Supervision, 2018) and hassuggested that regulators should treat these ﬁntech companies in a similar manner to banks froma regulatory perspective. Our results suggest they should progress a step further by encouragingthis information to become mainstream, which we propose could lead to higher bancarization andmore general access to ﬁnancing rates.However, the counterpoint to these gains is that they must come with a clear mandate con-cerning the interpretability of the variables. Clear arguments have to be presented on exactlywhat the variables are illustrating and how they relate to ﬁnancial behaviors. For example, thisis the case with a variable such as tipping, which had a slightly negative eﬀect in our model. Ex-periments have shown that psychometric variables, to which tipping behavior is related, have animpact upon the creditworthiness of borrowers (Arr´aiz, Bruhn, Ortega, & Stucchi, 2017), to whichthe tipping behaviour is related to, but what exactly what is this is showing has to be clearlyexplained by the lender in order to be approved for use in scoring models. In this exploratorystudy, the variable indicated that high tippers have a signiﬁcantly higher default rate. Allowingthe use of such a variable would provide an incentive for users to stop tipping altogether (althoughsmall tips had a signiﬁcant but lower impact upon lower default rates in contrast), which is anundesirable consequence. Thus, the regulator should intervene when case variables such as thisare proposed and potentially forbid their use or control how they are used. In ´Oskarsd´ottir et al.(2019), a suggestion for the use of such variables involved using only the positive part, that is,considering only the segment of the variable that is positive and ﬁxing a neutral score to thosethat are not. Doing so rewards positive behaviors, while eliminating the impact of more dubiousones in terms of why the phenomenon occurs. However, further research needs to be conducted tounderstand the underlying reasoning of these results to arrive at a ﬁnal recommendation regardingthese variables.

In this paper, we have tested an alternative dataset arising from a super-app and researched itseﬀectiveness and implications with regard to developing credit risk models. Four research questionswere proposed and our study clearly answered each of them.First, there is clear predictive value both in ﬁnancial and in statistical terms in using thesevariables, which answers the ﬁrst two research questions. The gains were signiﬁcant across all thestudied segments, and these gains were consistent across both the countries in which we testedthese variables. Clearly, a ﬁnancial and statistical incentive exists for lenders to include app-based17nformation in their decision support systems.Typically, alternative data variables are strong indicators of both the willingness and capacity torepay a loan (Bravo, Thomas, & Weber, 2015). However, the types and varieties of these indicators(almost 20 new variables had signiﬁcant eﬀects upon the estimation) result in signiﬁcant gains inboth ﬁnancial and statistical terms. This paper demonstrates that there is a strong ﬁnancialincentive for ﬁnancial institutions to use these variables for prediction models. As the ﬁnancialincentive is high, this means for a while ﬁntech companies will have an advantage over traditionalinstitutions unless these institutions also begin embracing these new sources of information.Regarding the third question, in terms of the patterns we observed, the super-app variablesshow that engagement with ﬁnancial products provide the strongest signals in terms of the defaultrate prediction. These patterns did not appear to be readily included in bureau scores as theydo not collect signs of debt but of transactionality, hence, there is an opportunity for them to beincorporated in the mainstream. In the meantime, those institutions with access to these variableshave a competitive advantage with regard to designing better decision support systems for thispurpose.Finally, we foresee regulators will have to balance allowing this data to be used with eﬀectivesupervision over what patterns these variables actually reﬂect. Given the ﬁnancial incentives thatarise from the use of these variables, it will be necessary to take measures to safeguard a fairand transparent app-based lending system. Nonetheless, our results suggest that these apps areuseful contributions to ﬁnancial inclusion, therefore, regulatory eﬀorts should also proceed in thisdirection.

Acknowledgements

The last author acknowledges this research was undertaken, in part, thanks to funding from theCanada Research Chairs program.

References

Aitken, R. (2017). all data is credit data: Constituting the unbanked.

Competition & Change , (4), 274-300.Arr´aiz, I., Bruhn, M., Ortega, C. R., & Stucchi, R. (2017, December). Are Psychometric Toolsa Viable Screening Method for Small and Medium-Size Enterprise Lending? Evidence fromPeru (Tech. Rep. No. 8276). The World Bank.Asian Insights Oﬃce . (2019, September).

Super apps in ﬁnancial services: Business models andopportunities (Tech. Rep. No. sector brieﬁng 81). DBS Group Research.Baesens, B., Roesch, D., & Scheule, H. (2016).

Credit risk analytics: Measurement techniques,applications, and examples in SAS . Hoboken, New Jersey: Wiley.Basel Committee on Banking Supervision. (2018).

Sound Practices: Implications of ﬁntech devel-opments for banks and bank supervisors (Tech. Rep.). Bank for International Settlements.Berg, T., Burg, V., Gombovi´c, A., & Puri, M. (2019). On the Rise of FinTechs: Credit ScoringUsing Digital Footprints.

The Review of Financial Studies , Accepted for publication .Bravo, C., Maldonado, S., & Weber, R. (2013). Granting and managing loans for micro-entrepreneurs: New developments and practical experiences.

European Journal of Opera-tional Research , , 358–366. 18ravo, C., Thomas, C. L., & Weber, R. (2015). Improving credit scoring by diﬀerentiating defaulterbehaviour. Journal of the Operational Research Society , (5), 771–781.Carroll, P., & Rehmani, S. (2017). Altenative data and the unbanked (Tech. Rep.). Oliver WymanFinancial Services.Correa Bahnsen, A., Aouada, D., & Ottersten, B. (2014). Example-Dependent Cost-SensitiveLogistic Regression for Credit Scoring. In (pp. 263–269). Detroit, USA: IEEE.Correa Bahnsen, A., Aouada, D., & Ottersten, B. (2015). Example-Dependent Cost-SensitiveDecision Trees.

Expert Systems with Applications , (19), 6609–6619.Demirg-Kunt, A., Klapper, L., Singer, D., Ansar, S., & Hess, J. (2018). The global ﬁndex database2017 : Measuring ﬁnancial inclusion and the ﬁntech revolution (Tech. Rep. No. 126033).The World Bank. doi: 10.1596/978-1-4648-1259-0Fader, P. S., Hardie, B. G. S., & Lee, K. L. (2005). RFM and CLV: Using Iso-Value Curves forCustomer Base Analysis.

Journal of Marketing Research , (4), 415–430.Gool, J. V., Verbeke, W., Sercu, P., & Baesens, B. (2012). Credit scoring for microﬁnance: Is itworth it? International Journal of Finance & Economics , (2), 103–123.Hurley, M., & Adebayo, J. (2016). Credit scoring in the era of big data. Yale Journal of Law &Technology , , 149-216.Lawrence, D., & Solomon, A. (2012). Managing a Consumer Lending Business . Solomon LawrencePartners.Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classiﬁcation algorithms for credit scoring: An update of research.

European Journal ofOperational Research , (1), 124 - 136.Lundberg, S. M., & Lee, S.-I. (2017). A uniﬁed approach to interpreting model predictions. InI. Guyon et al. (Eds.), Advances in neural information processing systems 30 (pp. 4765–4774).Curran Associates, Inc.Nayak, G. N., & Turvey, C. G. (1997). Credit Risk Assessment and the Opportunity Costs ofLoan Misclassiﬁcation.

Canadian Journal of Agricultural Economics , (3), 285–299.´Oskarsd´ottir, M., Bravo, C., Sarraute, C., Vanthienen, J., & Baesens, B. (2019). The value of bigdata for credit scoring: Enhancing ﬁnancial inclusion using mobile phone data and socialnetwork analytics. Applied Soft Computing , , 26–39.Osterwalder, A., Pigneur, Y., Smith, A., & Etiemble, F. (2020). The invincible company: How toconstantly reinvent your organization with inspiration from the world’s best business models .Wiley.Philippon, T. (2019).

On ﬁntech and ﬁnancial inclusion (Tech. Rep. No. NBER Working PaperNo. 26330). National Bureau of Economic Research.Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). ”why should I trust you?”: Explaining the pre-dictions of any classiﬁer. In

Proceedings of the 22nd ACM SIGKDD international conferenceon knowledge discovery and data mining, san francisco, ca, usa, august 13-17, 2016 (pp.1135–1144).Salvaire, P. (2019).

Explaining the predictions of a boosted tree algorithm : application to creditscoring (Unpublished master’s thesis). Universidade Nova de Lisboa.Siddiqi, N. (2017).

Intelligent credit scoring . John Wiley & Sons, Inc.Sun, T. (2017, 08). Balancing innovation and risks in digital ﬁnancial inclusion-experiences of antﬁnancial services group. In (p. 37-43).Sundsy, P., Bjelland, J., Reme, B.-A., M.Iqbal, A., & Jahani, E. (2016). Deep learning applied tomobile phone data for individual income classiﬁcation. In rtiﬁcial intelligence: Technologies and applications (p. 96-99). Atlantis Press.Sy, N. A., Maino, R., Massara, A., Prez-Saiz, H., & Sharma, P. (2019). Fintech in sub-saharanafrican countries : A game changer? (Tech. Rep. No. 19/04). International Monetary Fund,African Department.Task Force on Financial Technology. (2019).

Examining the use of alternative data in underwritingand credit scoring to expand access to credit: Hearings before the task force on ﬁnancialtechnology. (US House of Representatives, 116th Cong.)Thomas, L., Crook, J., & Edelman, D. (2017).

Credit Scoring and its Applications (Second Editioned.). USA: SIAM.Valuates Reports. (2019, September).

Global FinTech Market Size, Status and Forecast 2018-2025 (Tech. Rep. No. QYRE-Othe-2W194). Valuates.Verbraken, T., Bravo, C., Weber, R., & Baesens, B. (2014). Development and application ofconsumer credit scoring models using proﬁt-based classiﬁcation measures.

European Journalof Operational Research , (2), 505 - 513.World Artiﬁcial Intelligence Conference. (2019, Aug). Lufax CTO Mao Jinliang: AI is reshapingthe wealth management industry. (Accessed 2020-04-02)Xiaa, Y., Liu, C., Li, Y., & Liu, N. (2017). Boosted decision tree approach using bayesian hyper-parameter optimization for credit scoring.

Expert Systems with Applications , , 225–241.Zhang, W.-Y. (2016). Exploring the improved personal credit scoring model of ant ﬁnancialservices in its disruptive innovation process. In

3d international conference on applied socialscience research (icassr 2015) (p. 408-410). Atlantis Press.Zhang, Y., Jia, H., Diao, Y., Hai, M., & Li, H. (2016). Research on credit scoring by fusingsocial media information in online peer-to-peer lending.

Procedia Computer Science ,91