Computer-Assisted Fraud Detection, From Active Learning to Reward Maximization
CComputer-Assisted Fraud Detection,From Active Learning to Reward Maximization
Christelle Marfaing
Lydia Solutions [email protected]
Alexandre Garcia
LTCI Télécom ParisTechUniversité Paris Saclay [email protected]
Abstract
The automatic detection of frauds in banking transactions has been recently studiedas a way to help the analysts finding fraudulent operations. Due to the availabilityof a human feedback, this task has been studied in the framework of active learning:the fraud predictor is allowed to sequentially call on an oracle. This humanintervention is used to label new examples and improve the classification accuracyof the latter. Such a setting is not adapted in the case of fraud detection withfinancial data in European countries. Actually, as a human verification is mandatoryto consider a fraud as really detected, it is not necessary to focus on improvingthe classifier. We introduce the setting of ’Computer-assisted fraud detection’where the goal is to minimize the number of non fraudulent operations submittedto an oracle. The existing methods are applied to this task and we show that asimple meta-algorithm provides competitive results in this scenario on benchmarkdatasets.
The task of automatic fraud detection has been mainly studied under the framework of imbalancedbinary classification (Bhattacharyya et al., 2011). Given the description of a transaction x , the goal isto predict a binary label y ∈ { , } indicating whether this transaction is fraudulent or not. The maindifficulties arising in fraud detection highlighted earlier in (Bolton and Hand, 2002) include amongothers • The strong imbalance between the output labels. Indeed fraudulent behaviors are assumedto be rare and thus harder to find. Previous work have proposed different solutions thathelp building efficient predictors in the case of imbalanced classification. Such approachesmainly consist in introducing instance reweighting or bootstrap based schemes (Chawlaet al., 2004) in order to transform the imbalanced learning problem in a related balancedproblem on which learning can be done with on the shelves predictors. • The large amount of unlabeled data in regards to labelled data advocates the use of methodsthat can scale on large datasets and that generalize well.In the case of fraud detection in financial transactions, these properties have been highlighted in workinvolving both supervised (Owen, 2007) and unsupervised (Damez et al., 2012) learning approacheswhere the problem of handling large datasets is specifically studied. Whereas in this setting the userassumes that he has enough labeled samples to confidently follow the decision of a learned predictor,another line of work relies on an active learning procedure that consists in optimizing the accuracy ofa predictor by iteratively labelling a set of well chosen samples (Carcillo et al., 2018; Zhang et al.,2018). The main objective of this approach is to minimize the amount of work necessary to build a
NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness,Explainability, Accuracy, and Privacy, Montréal, Canada. a r X i v : . [ c s . L G ] N ov orrectly performing classifier since obtaining reliable labels is an expensive operation. A commonfeature of the existing active learning strategies is the selection of examples that keep a balanced poolof labeled samples wherever it is in label or in space (Ertekin et al., 2007). Indeed, training a predictorwith imbalanced data is known to affect its performance while incorporating some scalability issuesdue to the difficulty to handle in memory the labeled examples of the majority class. This onlinere-balancing process solves thus the two issues raised above at the same time.In the standard active learning framework, the true labels are sequentially queried to an oracle anda good active learning strategy should be able to provide good classification performance on somenew sample while doing as few oracle queries as possible. Whereas previous work in the contextof fraud detection have focused on optimizing some metrics computed on a test set based on theresulting classifier predictions, we argue that in many practical applications with financial data, thissetting is not adapted since it does not rely on the right metric. Actually, Due to the article 22 of theGPDR european regulation - automated individual decision-making, including profiling -, engagingsome legal pursuits and sanctions against a fraudulent user requires a human verification of thecorresponding decision (European parliament and council, 2016). Since the fitted classifier will neverbe used without requesting an oracle, it is not desirable for the fitted classifier to outperform on aheld-out dataset. It is preferable, in this configuration, to minimize the number of verifications thatcorresponds to non fraudulent operations and thus maximize the number of discovered and treatedfrauds over time. This setting differs from the active learning setting in the fact that the goal is not tobuild the best classifier over a given horizon but instead to recommend as many fraudulent objects aspossible to the oracle. In the next sections we present each framework and stress on their similaritiesand differences as well as the consequences in terms of adapted strategy. In this section we assume that we have access to a sample D = { x, y ∼ P X ×Y } where x are inputfeature representations and y ∈ { , } binary labels indicating whether the transaction x is a fraud( y = 1 ) or not. This sample is partitioned into an active set D and a finite testing set D (cid:48) . Theactive set is again partitioned in a labeled set D l and an unlabeled set D u that evolve over time sincequerying the label of an example make it move from D tu to D tl at each iteration t . Initially, the labelsin the labeled set are only available for a fraction of the data D l = { x i , y i } i ∈ ,...,n . We supposeadditionally that we have access to an active learning strategy H t ( D tl ) i.e. a function based on thecurrent labeled sample which returns the next unlabeled point that will be provided to the oracle.Example of such strategies are detailed in section 2.2. The active learning procedure can then bedivided in the following steps:1. Based on the current labeled dataset D tl , build a predictor g t
2. Choose an unlabelled point x based on H t ( D tl , g t ) for which we want to obtain the label.3. Query the corresponding label y to an oracle4. Update D t +1 l = D tl (cid:83) ( x, y ) , and D t +1 u = D tu \ ( x, y )
5. Increment t and repeat from (1) until t = T .The performance of an active learning strategy can be measured thanks to the performance of g t onthe testing set. For a non-negative performance measure m : Y × Y → R + , the goal is to find thestrategy that maximizes (cid:80) ( x,y ) ∈D (cid:48) m ( g t ( x ) , y ) for all the time steps t ∈ { , . . . , T } . A large body of work has focused on designing active learning strategies that take into account someproperties of the data or some specificities of the underlying class of predictors to optimize. Thus(Ertekin et al., 2007) focuses on learning on the border using SVM properties, and (Zhang et al., 2016)uses the distance notion introduced by the SVM hyperplane to define a way to query points to label.On the other hand, strategies can be defined without relying on some properties of the underlyingpredictor but only take advantage of their ability to produce class wise probability estimations. Suchstrategies can be grouped into two categories : 2
Unitary methods (base methods): Uncertainty Sampling, Random sampling. This type ofmethod rely on a single hypothesis explaining the insufficient performance of the predictor.Based on this hypothesis, a sampling method is proposed. In the case of Uncertaintysampling the hypothesis is that the most important are where the probabilities estimatedby the model itself have a high variance Lewis and Catlett (1994); Cohn et al. (1995). Inpractice the strategy will tend to select samples in zones that are at the known frontier of twodistinct classes. In the case of Random sampling, the strategy ignores the learned predictorand makes no hypothesis on the evolution of its performance with respect to the chosenlabeled points. • Adaptive methods: While unitary methods have been designed with the idea of choosingsamples that optimize a single criterion, (Hsu and Lin, 2015) proposes a meta-algorithmthat chooses the best unitary method to use at each time step in order to maximize aspecifically designed reward function (Weighted accuracy computed on the points submittedto the oracle). Note for example that different uncertainty sampling approaches could bebuilt based upon different probability estimations of the output labels and the adaptiveapproach would choose at each time step which unitary strategy should be chosen. Similarly,(Konyushkova et al., 2017) fits a model able to predict the expected increase of a test metric.Then, the point picked by the algorithm is the one that has the greatest expected reward inthe so-called metrics.Now we turn to the presentation of our framework that differs from the active learning one.
We now propose a new setting that intends to simulate more appropriately the real-life constraints. Thegoal is no longer to optimize a metric evaluated on a holdout dataset but instead to iteratively retrieveonly the examples corresponding to the class 1 (fraudulent operation) to the oracle. The available datais thus only partitioned into a labeled set D t and unlabeled set D (cid:48) t such that D = D t (cid:83) D (cid:48) t . Supposethat we can build a strategy H t that returns an unlabelled example. Given a non-negative rewardfunction r : Y × Y → R + the goal is then to find the strategy that maximises the cumulated reward: (cid:88) ( x,y ) ∈D t r ( g t ( x ) , y ) (1)The reward can take into account the amount of money contained in a fraudulent transaction. Whenthis information is not available, we can simply provide a unitary reward when a fraud is identified : r ( y, f ( x )) = (cid:26) if y = f ( x ) = 10 else (2)At each time step, the optimal strategy H t would return an element of D (cid:48) t in the set of highest expectedreward. x (cid:63) ∈ arg max x p ( y = 1 | x ) , where p ( y | x ) is the true conditional distribution of the data.Since the conditional probability is not directly available, it is instead estimated by a function ˆ p t taken in a hypothesis class C and learned on the labeled sample D t : ˆ p t = arg min ˆ p ∈C (cid:88) x,y ∈D t l (ˆ p ( x ) , y ) + Ω(ˆ p ) (3)Where l is a loss function penalizing wrong predictions of y and Ω a penalty function enforcing thechoice of regular candidates.In the case where we want to compute class probabilities, one can choose the cross entropy lossfunction: ˆ p t = arg min ˆ p ∈C (cid:88) x,y ∈D t − y log(ˆ p ( x )) (4)This type of probability estimators are well known and can be parameterized by a linear (logisticregression) or a non linear model (neural networks). Different choices of loss and parameterizationlead to different class of predictors that may be used to construct C (Gaussian Processes, RandomForests, Boosting based algorithm). Up to this point we have provided an approximation of p ( y | x ) based on the D t sample only. This has two consequences:3 Based on the knowledge of ˆ p , the x values proposed to the oracle will be the one withthe highest probability of finding the label . For a correctly regularized predictor, thesepoints will be the one located close to already detected frauds. By analogy with the banditlitterature Audibert et al. (2009), this step can be seen as an exploitation phase where thestrategy relies on its estimation of the expected rewards to pick the arm that will give a gainwith the highest probability among all the possible candidates. • When there are unlabeled parts of the space X containing some objects labeled or whenthe ones we already found have been exhausted, then a good strategy needs to quicklyexplore the space to find new instances labeled . During this step, instead of choosingthe x that maximizes the corresponding reward, we try to find the one that gives the mostinformation to ˆ p . Once again it is analogous to the exploration phase in the bandit literature. CAFDA ) The two steps of exploration / exploitation presented previously can be mixed in a simple algorithmthat in practice works surprisingly well on benchmark datasets for the task of computer-assisted frauddetection. It is inspired by the EXP4.P bandit algorithm Beygelzimer et al. (2011) which maintains aset of probability of picking each of the possible strategy and update them according to the rewardreceived.Similarly to Beygelzimer et al. (2011); Hsu and Lin (2015), we suppose that we have accessto a set of K active learning algorithm that provide an advice vector ξ of the size of the un-labelled set which contains the probability of querying each example. We additionally main-tain a vector w ∈ [0 , K that indicates the probability of using each strategy and choosetwo update parameters K , K which control the variation of w depending on the rewards re-ceived. We also introduce P min and P max two threshold levels on the probabilities stored in w that are used to reduce the time necessary to switch quickly from one best current strat-egy (of index i with high w i value) to another as the number of iterations increases. In orderto maximize our custom reward, we propose the following fraud detection algorithm ( CAFDA ): Data:
Labeled set D l and unlabeled set D u Result:
Sequence of rewards ( r t ) t ∈{ ,...,T } , final labelled set D Tl Initialization:
Set initial probability of sampling each strategy w i = K ; for t in { , . . . , T } doPick a strategy i ∈ { , . . . , K } according to the distribution w ; Sample the next point x j for which we want a query according to ξ i ; Query the label y j to the oracle; Receive a reward r t according to y j ; Update the sets: D t +1 l = D tl (cid:83) ( x j , y j ) and D t +1 u = D tu \ ( x j , y j ) ; Update the probabilities w according to the following heuristic: ; if r = 1 then w i = min( K w i , P max ) ; else w i = max( K w i , P min ) ; end ∀ j (cid:54) = i, w [ j ] = max(min( w [ j ] , P max ) , P min ) ; w = w (cid:80) Ki =1 w i ; Update the strategies using D t +1 l ; end Algorithm 1: Heuristic based procedure for computer aided fraud detection (
CAFDA
The main difference with Hsu and Lin (2015) is the use of the w update heuristic. In the originalpaper, the reward update scheme is chosen to optimize the accuracy of the resulting predictor ona held-out dataset which differs from our reward based only on the label found. Concerning theupdate, EXP .P has been designed to achieve optimal regrets in a stationary context which is notthe case here. By choosing carefully K , K , P min , P max , CAFDA retrieves competitive results thatwe detail in section (4). 4
Experiments
We simulate the framework described in section 2.3 in the following way. Given an imbalanced frauddataset containing p frauds, we first sample a small fraction of the points that will constitute an initiallabeled set and then iteratively select an unlabelled point which is shown to the oracle. If this point islabeled , a reward of is gained and we display the cumulated reward over the time. We compare CAFDA against some baselines and state of the art active learning strategies: • base : Use the predictor trained only once on the initial labeled set and perform the exploita-tion phase only at each time step: x (cid:63) ∈ arg max x ˆ p t ( y = 1 | x ) • base_refit : Same as base but the predictor is retrained on D tl at each timestep. • random : The point queried is picked randomly in the unlabeled set D tu • us (uncertainty sampling): The point queried is the one of maximal uncertainty for thepredictor i.e. min x ( | P ( y = 1 | x ) − P ( y = 0 | x ) | ) • lal_independent (Learning Active learning with an independent strategy): The pointqueried is the one of the maximal expected improve in a choosen loss. The expected improveis the prediction of a model fitted on a synthetic dataset. In the independent strategy, aMonte Carlo procedure is simulated to query randomly some points and associate them withan improve in the loss. (Konyushkova et al., 2017) • lal_iterative (Learning Active learning with an iterative strategy): This algorithm differsfrom the previous one only by the way the synthetic dataset is constructed. Actually, thepoints are queried in order to minimize the selection bias. • albl (Active Learning By Learning): A multi-armed bandit chooses among multiple activelearning strategies at each time step in order to maximise an expected cumulated rewardwhich is a weighted accuracy on the already queried point D tl .As base strategies for CAFDA , we take 5 strategies (base, base_refit, random, lal_independent,lal_iterative) and exclude ALBL as it is also a meta-algorithm. For all the scenarios, K = 0 . , K = 1 . , P min = 0 . and P max = 0 . .The different methods are compared in two scenarios:1. The active learning is run during the entire experiment. In this experiment, we empiricallyshow that active learning methods do not maximize the cumulated reward we defined.2. The active learning algorithm is run for 100 steps, then the resulting classifier is used toselect the points labeled with the highest probability. Here we aim at showing that earlyexploration using an active learning strategy is not even helping in the long run.For all our experiments, we used a Random Forest classifier as the base probability estimator andselected the hyperparameters by cross-validation on the initially labeled training set.We display results obtained with 3 standard benchmark anomaly detection datasets since they share theimbalance property of financial fraud detection databases and are freely available. In all experiments,Table 1: Properties of the datasetsshuttle covtype credit cardNumber of samples 85849 295541 284807Input dimension 10 55 31Anomaly proportion 7.2% 4.1% 0.17%we first sample an initial labeled dataset. This initial set is 1% for all the datasets. In the case of thecovtype dataset, instead of using the full dataset, we worked with a sample of examples inorder to keep a fairly low number of ’fraud’ initially observed. The figure 1 shows the cumulated reward over time obtained with each strategy.5
200 400 600 800 1000 1200 14000100200300400 creditcard Scenario 1 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (a) creditcard scenario 1 covtype Scenario 1 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (b) covtype scenario 1 shuttle Scenario 1 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (c) shuttle scenario 1
Figure 1: Scenario 1 cumulated rewardsThe best results are obtained using
CAFDA , base and base_refit with a slight improvement givento the model that retrain their underlying model on the creditcard dataset. As expected, the activelearning strategies do not specifically try to provide labels to the oracle which explains their behavior.Now we test experimentally whether an early active learning based exploration can provide a benefitin a subsequent exploitation phase. We now turn to the case where each active learning strategies is used for the 100 first steps. In the nextiterations, the points provided to the oracle are queried by solving x (cid:63) = arg max x ˆ p ( y = 1 | x ) basedon the resulting learned predictor. The results presented in Figure 2 show that the active learningstrategies do not take advantage even lately of their early exploration. Indeed CAFDA , base and base_refit remain competitive while being simpler than active learning procedures. We focusedon the 300 first iterations where we observe that the delay of the cumulated reward of active learningprocedures is generated at the very beginning and remains present until all the 1 have been found.6 covtype Scenario 2 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (a) covtype scenario 2 covtype Scenario 2 first iterations alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (b) covtype scenario 2 first iterations shuttle Scenario 2 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (c) shuttle scenario 2 shuttle Scenario 2 first iterations alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (d) shuttle scenario 2 first iterations creditcard Scenario 2 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (e) credit card scenario 2 creditcard Scenario 2 first iterations alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (f) credit card scenario 2 first iterations Figure 2: Scenario 2 cumulated rewards total and first iterations7
Conclusion and perspectives
We presented a new fraud detection framework that differs from the active learning setting in whichthe quality of a strategy is measured by its ability to retrieve the rare labels to the oracle. We haveshown that our algorithm
CAFDA and also simple baselines provide better results than state of the artactive learning algorithms on these problems. Future work will focus on the statistical properties ofthe computer-assisted fraud detection problem in order to design theory grounded optimal algorithmsfor the task at hand and will explore how to adapt other adaptative active learning strategies to oursetting.
References
Audibert, J.-Y., Munos, R., and Szepesvári, C. (2009). Exploration–exploitation tradeoff usingvariance estimates in multi-armed bandits.
Theoretical Computer Science , 410(19):1876–1902.Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual banditalgorithms with supervised learning guarantees. In
Proceedings of the Fourteenth InternationalConference on Artificial Intelligence and Statistics , pages 19–26.Bhattacharyya, S., Jha, S., Tharakunnel, K., and Westland, J. C. (2011). Data mining for credit cardfraud: A comparative study.
Decision Support Systems , 50(3):602–613.Bolton, R. J. and Hand, D. J. (2002). Statistical fraud detection: A review.
Statistical science , pages235–249.Carcillo, F., Borgne, Y.-A. L., Caelen, O., and Bontempi, G. (2018). Streaming active learningstrategies for real-life credit card fraud detection: assessment and visualization.
InternationalJournal of Data Science and Analytics , pages 1–16.Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Special issue on learning from imbalanced datasets.
ACM Sigkdd Explorations Newsletter , 6(1):1–6.Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1995). Active learning with statistical models. In
Advances in neural information processing systems , pages 705–712.Damez, M., Lesot, M.-J., and d’Allonnes, A. R. (2012). Dynamic credit-card fraud profiling.In
International Conference on Modeling Decisions for Artificial Intelligence , pages 234–245.Springer.Ertekin, S., Huang, J., Bottou, L., and Giles, L. (2007). Learning on the border: active learning inimbalanced data classification. In
Proceedings of the sixteenth ACM conference on Conference oninformation and knowledge management , pages 127–136. ACM.European parliament and council (2016). Regulation (EU) 2016/679 of the european parliament andcouncil. https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679 .Hsu, W.-N. and Lin, H.-T. (2015). Active learning by learning. In
Proceedings of the Twenty-NinthAAAI Conference on Artificial Intelligence , pages 2659–2665. AAAI Press.Konyushkova, K., Sznitman, R., and Fua, P. (2017). Learning active learning from data. In
Advancesin Neural Information Processing Systems , pages 4225–4235.Lewis, D. D. and Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In
Machine Learning Proceedings 1994 , pages 148–156. Elsevier.Owen, A. B. (2007). Infinitely imbalanced logistic regression.
Journal of Machine Learning Research ,8(Apr):761–773.Zhang, X., Yang, T., and Srinivasan, P. (2016). Online asymmetric active learning with imbalanceddata. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , pages 2055–2064. ACM. 8hang, Y., Zhao, P., Cao, J., Ma, W., Huang, J., Wu, Q., and Tan, M. (2018). Online adaptiveasymmetric active learning for budgeted imbalanced data. In