[PDF] Computer-Assisted Fraud Detection, From Active Learning to Reward Maximization

Abstract

The automatic detection of frauds in banking transactions has been recently studied as a way to help the analysts finding fraudulent operations. Due to the availability of a human feedback, this task has been studied in the framework of active learning: the fraud predictor is allowed to sequentially call on an oracle. This human intervention is used to label new examples and improve the classification accuracy of the latter. Such a setting is not adapted in the case of fraud detection with financial data in European countries. Actually, as a human verification is mandatory to consider a fraud as really detected, it is not necessary to focus on improving the classifier. We introduce the setting of 'Computer-assisted fraud detection' where the goal is to minimize the number of non fraudulent operations submitted to an oracle. The existing methods are applied to this task and we show that a simple meta-algorithm provides competitive results in this scenario on benchmark datasets.

Full PDF

CComputer-Assisted Fraud Detection,From Active Learning to Reward Maximization

Christelle Marfaing

Lydia Solutions [email protected]

Alexandre Garcia

LTCI Télécom ParisTechUniversité Paris Saclay [email protected]

Abstract

The automatic detection of frauds in banking transactions has been recently studiedas a way to help the analysts ﬁnding fraudulent operations. Due to the availabilityof a human feedback, this task has been studied in the framework of active learning:the fraud predictor is allowed to sequentially call on an oracle. This humanintervention is used to label new examples and improve the classiﬁcation accuracyof the latter. Such a setting is not adapted in the case of fraud detection withﬁnancial data in European countries. Actually, as a human veriﬁcation is mandatoryto consider a fraud as really detected, it is not necessary to focus on improvingthe classiﬁer. We introduce the setting of ’Computer-assisted fraud detection’where the goal is to minimize the number of non fraudulent operations submittedto an oracle. The existing methods are applied to this task and we show that asimple meta-algorithm provides competitive results in this scenario on benchmarkdatasets.

The task of automatic fraud detection has been mainly studied under the framework of imbalancedbinary classiﬁcation (Bhattacharyya et al., 2011). Given the description of a transaction x , the goal isto predict a binary label y ∈ { , } indicating whether this transaction is fraudulent or not. The maindifﬁculties arising in fraud detection highlighted earlier in (Bolton and Hand, 2002) include amongothers • The strong imbalance between the output labels. Indeed fraudulent behaviors are assumedto be rare and thus harder to ﬁnd. Previous work have proposed different solutions thathelp building efﬁcient predictors in the case of imbalanced classiﬁcation. Such approachesmainly consist in introducing instance reweighting or bootstrap based schemes (Chawlaet al., 2004) in order to transform the imbalanced learning problem in a related balancedproblem on which learning can be done with on the shelves predictors. • The large amount of unlabeled data in regards to labelled data advocates the use of methodsthat can scale on large datasets and that generalize well.In the case of fraud detection in ﬁnancial transactions, these properties have been highlighted in workinvolving both supervised (Owen, 2007) and unsupervised (Damez et al., 2012) learning approacheswhere the problem of handling large datasets is speciﬁcally studied. Whereas in this setting the userassumes that he has enough labeled samples to conﬁdently follow the decision of a learned predictor,another line of work relies on an active learning procedure that consists in optimizing the accuracy ofa predictor by iteratively labelling a set of well chosen samples (Carcillo et al., 2018; Zhang et al.,2018). The main objective of this approach is to minimize the amount of work necessary to build a

NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness,Explainability, Accuracy, and Privacy, Montréal, Canada. a r X i v : . [ c s . L G ] N ov orrectly performing classiﬁer since obtaining reliable labels is an expensive operation. A commonfeature of the existing active learning strategies is the selection of examples that keep a balanced poolof labeled samples wherever it is in label or in space (Ertekin et al., 2007). Indeed, training a predictorwith imbalanced data is known to affect its performance while incorporating some scalability issuesdue to the difﬁculty to handle in memory the labeled examples of the majority class. This onlinere-balancing process solves thus the two issues raised above at the same time.In the standard active learning framework, the true labels are sequentially queried to an oracle anda good active learning strategy should be able to provide good classiﬁcation performance on somenew sample while doing as few oracle queries as possible. Whereas previous work in the contextof fraud detection have focused on optimizing some metrics computed on a test set based on theresulting classiﬁer predictions, we argue that in many practical applications with ﬁnancial data, thissetting is not adapted since it does not rely on the right metric. Actually, Due to the article 22 of theGPDR european regulation - automated individual decision-making, including proﬁling -, engagingsome legal pursuits and sanctions against a fraudulent user requires a human veriﬁcation of thecorresponding decision (European parliament and council, 2016). Since the ﬁtted classiﬁer will neverbe used without requesting an oracle, it is not desirable for the ﬁtted classiﬁer to outperform on aheld-out dataset. It is preferable, in this conﬁguration, to minimize the number of veriﬁcations thatcorresponds to non fraudulent operations and thus maximize the number of discovered and treatedfrauds over time. This setting differs from the active learning setting in the fact that the goal is not tobuild the best classiﬁer over a given horizon but instead to recommend as many fraudulent objects aspossible to the oracle. In the next sections we present each framework and stress on their similaritiesand differences as well as the consequences in terms of adapted strategy. In this section we assume that we have access to a sample D = { x, y ∼ P X ×Y } where x are inputfeature representations and y ∈ { , } binary labels indicating whether the transaction x is a fraud( y = 1 ) or not. This sample is partitioned into an active set D and a ﬁnite testing set D (cid:48) . Theactive set is again partitioned in a labeled set D l and an unlabeled set D u that evolve over time sincequerying the label of an example make it move from D tu to D tl at each iteration t . Initially, the labelsin the labeled set are only available for a fraction of the data D l = { x i , y i } i ∈ ,...,n . We supposeadditionally that we have access to an active learning strategy H t ( D tl ) i.e. a function based on thecurrent labeled sample which returns the next unlabeled point that will be provided to the oracle.Example of such strategies are detailed in section 2.2. The active learning procedure can then bedivided in the following steps:1. Based on the current labeled dataset D tl , build a predictor g t

2. Choose an unlabelled point x based on H t ( D tl , g t ) for which we want to obtain the label.3. Query the corresponding label y to an oracle4. Update D t +1 l = D tl (cid:83) ( x, y ) , and D t +1 u = D tu \ ( x, y )

5. Increment t and repeat from (1) until t = T .The performance of an active learning strategy can be measured thanks to the performance of g t onthe testing set. For a non-negative performance measure m : Y × Y → R + , the goal is to ﬁnd thestrategy that maximizes (cid:80) ( x,y ) ∈D (cid:48) m ( g t ( x ) , y ) for all the time steps t ∈ { , . . . , T } . A large body of work has focused on designing active learning strategies that take into account someproperties of the data or some speciﬁcities of the underlying class of predictors to optimize. Thus(Ertekin et al., 2007) focuses on learning on the border using SVM properties, and (Zhang et al., 2016)uses the distance notion introduced by the SVM hyperplane to deﬁne a way to query points to label.On the other hand, strategies can be deﬁned without relying on some properties of the underlyingpredictor but only take advantage of their ability to produce class wise probability estimations. Suchstrategies can be grouped into two categories : 2

Unitary methods (base methods): Uncertainty Sampling, Random sampling. This type ofmethod rely on a single hypothesis explaining the insufﬁcient performance of the predictor.Based on this hypothesis, a sampling method is proposed. In the case of Uncertaintysampling the hypothesis is that the most important are where the probabilities estimatedby the model itself have a high variance Lewis and Catlett (1994); Cohn et al. (1995). Inpractice the strategy will tend to select samples in zones that are at the known frontier of twodistinct classes. In the case of Random sampling, the strategy ignores the learned predictorand makes no hypothesis on the evolution of its performance with respect to the chosenlabeled points. • Adaptive methods: While unitary methods have been designed with the idea of choosingsamples that optimize a single criterion, (Hsu and Lin, 2015) proposes a meta-algorithmthat chooses the best unitary method to use at each time step in order to maximize aspeciﬁcally designed reward function (Weighted accuracy computed on the points submittedto the oracle). Note for example that different uncertainty sampling approaches could bebuilt based upon different probability estimations of the output labels and the adaptiveapproach would choose at each time step which unitary strategy should be chosen. Similarly,(Konyushkova et al., 2017) ﬁts a model able to predict the expected increase of a test metric.Then, the point picked by the algorithm is the one that has the greatest expected reward inthe so-called metrics.Now we turn to the presentation of our framework that differs from the active learning one.

We now propose a new setting that intends to simulate more appropriately the real-life constraints. Thegoal is no longer to optimize a metric evaluated on a holdout dataset but instead to iteratively retrieveonly the examples corresponding to the class 1 (fraudulent operation) to the oracle. The available datais thus only partitioned into a labeled set D t and unlabeled set D (cid:48) t such that D = D t (cid:83) D (cid:48) t . Supposethat we can build a strategy H t that returns an unlabelled example. Given a non-negative rewardfunction r : Y × Y → R + the goal is then to ﬁnd the strategy that maximises the cumulated reward: (cid:88) ( x,y ) ∈D t r ( g t ( x ) , y ) (1)The reward can take into account the amount of money contained in a fraudulent transaction. Whenthis information is not available, we can simply provide a unitary reward when a fraud is identiﬁed : r ( y, f ( x )) = (cid:26) if y = f ( x ) = 10 else (2)At each time step, the optimal strategy H t would return an element of D (cid:48) t in the set of highest expectedreward. x (cid:63) ∈ arg max x p ( y = 1 | x ) , where p ( y | x ) is the true conditional distribution of the data.Since the conditional probability is not directly available, it is instead estimated by a function ˆ p t taken in a hypothesis class C and learned on the labeled sample D t : ˆ p t = arg min ˆ p ∈C (cid:88) x,y ∈D t l (ˆ p ( x ) , y ) + Ω(ˆ p ) (3)Where l is a loss function penalizing wrong predictions of y and Ω a penalty function enforcing thechoice of regular candidates.In the case where we want to compute class probabilities, one can choose the cross entropy lossfunction: ˆ p t = arg min ˆ p ∈C (cid:88) x,y ∈D t − y log(ˆ p ( x )) (4)This type of probability estimators are well known and can be parameterized by a linear (logisticregression) or a non linear model (neural networks). Different choices of loss and parameterizationlead to different class of predictors that may be used to construct C (Gaussian Processes, RandomForests, Boosting based algorithm). Up to this point we have provided an approximation of p ( y | x ) based on the D t sample only. This has two consequences:3 Based on the knowledge of ˆ p , the x values proposed to the oracle will be the one withthe highest probability of ﬁnding the label . For a correctly regularized predictor, thesepoints will be the one located close to already detected frauds. By analogy with the banditlitterature Audibert et al. (2009), this step can be seen as an exploitation phase where thestrategy relies on its estimation of the expected rewards to pick the arm that will give a gainwith the highest probability among all the possible candidates. • When there are unlabeled parts of the space X containing some objects labeled or whenthe ones we already found have been exhausted, then a good strategy needs to quicklyexplore the space to ﬁnd new instances labeled . During this step, instead of choosingthe x that maximizes the corresponding reward, we try to ﬁnd the one that gives the mostinformation to ˆ p . Once again it is analogous to the exploration phase in the bandit literature. CAFDA ) The two steps of exploration / exploitation presented previously can be mixed in a simple algorithmthat in practice works surprisingly well on benchmark datasets for the task of computer-assisted frauddetection. It is inspired by the EXP4.P bandit algorithm Beygelzimer et al. (2011) which maintains aset of probability of picking each of the possible strategy and update them according to the rewardreceived.Similarly to Beygelzimer et al. (2011); Hsu and Lin (2015), we suppose that we have accessto a set of K active learning algorithm that provide an advice vector ξ of the size of the un-labelled set which contains the probability of querying each example. We additionally main-tain a vector w ∈ [0 , K that indicates the probability of using each strategy and choosetwo update parameters K , K which control the variation of w depending on the rewards re-ceived. We also introduce P min and P max two threshold levels on the probabilities stored in w that are used to reduce the time necessary to switch quickly from one best current strat-egy (of index i with high w i value) to another as the number of iterations increases. In orderto maximize our custom reward, we propose the following fraud detection algorithm ( CAFDA ): Data:

Labeled set D l and unlabeled set D u Result:

Sequence of rewards ( r t ) t ∈{ ,...,T } , ﬁnal labelled set D Tl Initialization:

Set initial probability of sampling each strategy w i = K ; for t in { , . . . , T } doPick a strategy i ∈ { , . . . , K } according to the distribution w ; Sample the next point x j for which we want a query according to ξ i ; Query the label y j to the oracle; Receive a reward r t according to y j ; Update the sets: D t +1 l = D tl (cid:83) ( x j , y j ) and D t +1 u = D tu \ ( x j , y j ) ; Update the probabilities w according to the following heuristic: ; if r = 1 then w i = min( K w i , P max ) ; else w i = max( K w i , P min ) ; end ∀ j (cid:54) = i, w [ j ] = max(min( w [ j ] , P max ) , P min ) ; w = w (cid:80) Ki =1 w i ; Update the strategies using D t +1 l ; end Algorithm 1: Heuristic based procedure for computer aided fraud detection (

CAFDA

The main difference with Hsu and Lin (2015) is the use of the w update heuristic. In the originalpaper, the reward update scheme is chosen to optimize the accuracy of the resulting predictor ona held-out dataset which differs from our reward based only on the label found. Concerning theupdate, EXP .P has been designed to achieve optimal regrets in a stationary context which is notthe case here. By choosing carefully K , K , P min , P max , CAFDA retrieves competitive results thatwe detail in section (4). 4

Experiments

We simulate the framework described in section 2.3 in the following way. Given an imbalanced frauddataset containing p frauds, we ﬁrst sample a small fraction of the points that will constitute an initiallabeled set and then iteratively select an unlabelled point which is shown to the oracle. If this point islabeled , a reward of is gained and we display the cumulated reward over the time. We compare CAFDA against some baselines and state of the art active learning strategies: • base : Use the predictor trained only once on the initial labeled set and perform the exploita-tion phase only at each time step: x (cid:63) ∈ arg max x ˆ p t ( y = 1 | x ) • base_refit : Same as base but the predictor is retrained on D tl at each timestep. • random : The point queried is picked randomly in the unlabeled set D tu • us (uncertainty sampling): The point queried is the one of maximal uncertainty for thepredictor i.e. min x ( | P ( y = 1 | x ) − P ( y = 0 | x ) | ) • lal_independent (Learning Active learning with an independent strategy): The pointqueried is the one of the maximal expected improve in a choosen loss. The expected improveis the prediction of a model ﬁtted on a synthetic dataset. In the independent strategy, aMonte Carlo procedure is simulated to query randomly some points and associate them withan improve in the loss. (Konyushkova et al., 2017) • lal_iterative (Learning Active learning with an iterative strategy): This algorithm differsfrom the previous one only by the way the synthetic dataset is constructed. Actually, thepoints are queried in order to minimize the selection bias. • albl (Active Learning By Learning): A multi-armed bandit chooses among multiple activelearning strategies at each time step in order to maximise an expected cumulated rewardwhich is a weighted accuracy on the already queried point D tl .As base strategies for CAFDA , we take 5 strategies (base, base_reﬁt, random, lal_independent,lal_iterative) and exclude ALBL as it is also a meta-algorithm. For all the scenarios, K = 0 . , K = 1 . , P min = 0 . and P max = 0 . .The different methods are compared in two scenarios:1. The active learning is run during the entire experiment. In this experiment, we empiricallyshow that active learning methods do not maximize the cumulated reward we deﬁned.2. The active learning algorithm is run for 100 steps, then the resulting classiﬁer is used toselect the points labeled with the highest probability. Here we aim at showing that earlyexploration using an active learning strategy is not even helping in the long run.For all our experiments, we used a Random Forest classiﬁer as the base probability estimator andselected the hyperparameters by cross-validation on the initially labeled training set.We display results obtained with 3 standard benchmark anomaly detection datasets since they share theimbalance property of ﬁnancial fraud detection databases and are freely available. In all experiments,Table 1: Properties of the datasetsshuttle covtype credit cardNumber of samples 85849 295541 284807Input dimension 10 55 31Anomaly proportion 7.2% 4.1% 0.17%we ﬁrst sample an initial labeled dataset. This initial set is 1% for all the datasets. In the case of thecovtype dataset, instead of using the full dataset, we worked with a sample of examples inorder to keep a fairly low number of ’fraud’ initially observed. The ﬁgure 1 shows the cumulated reward over time obtained with each strategy.5

200 400 600 800 1000 1200 14000100200300400 creditcard Scenario 1 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (a) creditcard scenario 1 covtype Scenario 1 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (b) covtype scenario 1 shuttle Scenario 1 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (c) shuttle scenario 1

Figure 1: Scenario 1 cumulated rewardsThe best results are obtained using

CAFDA , base and base_refit with a slight improvement givento the model that retrain their underlying model on the creditcard dataset. As expected, the activelearning strategies do not speciﬁcally try to provide labels to the oracle which explains their behavior.Now we test experimentally whether an early active learning based exploration can provide a beneﬁtin a subsequent exploitation phase. We now turn to the case where each active learning strategies is used for the 100 ﬁrst steps. In the nextiterations, the points provided to the oracle are queried by solving x (cid:63) = arg max x ˆ p ( y = 1 | x ) basedon the resulting learned predictor. The results presented in Figure 2 show that the active learningstrategies do not take advantage even lately of their early exploration. Indeed CAFDA , base and base_refit remain competitive while being simpler than active learning procedures. We focusedon the 300 ﬁrst iterations where we observe that the delay of the cumulated reward of active learningprocedures is generated at the very beginning and remains present until all the 1 have been found.6 covtype Scenario 2 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (a) covtype scenario 2 covtype Scenario 2 first iterations alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (b) covtype scenario 2 ﬁrst iterations shuttle Scenario 2 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (c) shuttle scenario 2 shuttle Scenario 2 first iterations alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (d) shuttle scenario 2 ﬁrst iterations creditcard Scenario 2 alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (e) credit card scenario 2 creditcard Scenario 2 first iterations alblbasebase_refitlal_independentlal_iterativerandomusCAFDA (f) credit card scenario 2 ﬁrst iterations Figure 2: Scenario 2 cumulated rewards total and ﬁrst iterations7

Conclusion and perspectives

We presented a new fraud detection framework that differs from the active learning setting in whichthe quality of a strategy is measured by its ability to retrieve the rare labels to the oracle. We haveshown that our algorithm

CAFDA and also simple baselines provide better results than state of the artactive learning algorithms on these problems. Future work will focus on the statistical properties ofthe computer-assisted fraud detection problem in order to design theory grounded optimal algorithmsfor the task at hand and will explore how to adapt other adaptative active learning strategies to oursetting.

References

Audibert, J.-Y., Munos, R., and Szepesvári, C. (2009). Exploration–exploitation tradeoff usingvariance estimates in multi-armed bandits.

Theoretical Computer Science , 410(19):1876–1902.Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual banditalgorithms with supervised learning guarantees. In

Proceedings of the Fourteenth InternationalConference on Artiﬁcial Intelligence and Statistics , pages 19–26.Bhattacharyya, S., Jha, S., Tharakunnel, K., and Westland, J. C. (2011). Data mining for credit cardfraud: A comparative study.

Decision Support Systems , 50(3):602–613.Bolton, R. J. and Hand, D. J. (2002). Statistical fraud detection: A review.

Statistical science , pages235–249.Carcillo, F., Borgne, Y.-A. L., Caelen, O., and Bontempi, G. (2018). Streaming active learningstrategies for real-life credit card fraud detection: assessment and visualization.

InternationalJournal of Data Science and Analytics , pages 1–16.Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Special issue on learning from imbalanced datasets.

ACM Sigkdd Explorations Newsletter , 6(1):1–6.Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1995). Active learning with statistical models. In

Advances in neural information processing systems , pages 705–712.Damez, M., Lesot, M.-J., and d’Allonnes, A. R. (2012). Dynamic credit-card fraud proﬁling.In

International Conference on Modeling Decisions for Artiﬁcial Intelligence , pages 234–245.Springer.Ertekin, S., Huang, J., Bottou, L., and Giles, L. (2007). Learning on the border: active learning inimbalanced data classiﬁcation. In

Proceedings of the sixteenth ACM conference on Conference oninformation and knowledge management , pages 127–136. ACM.European parliament and council (2016). Regulation (EU) 2016/679 of the european parliament andcouncil. https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679 .Hsu, W.-N. and Lin, H.-T. (2015). Active learning by learning. In

Proceedings of the Twenty-NinthAAAI Conference on Artiﬁcial Intelligence , pages 2659–2665. AAAI Press.Konyushkova, K., Sznitman, R., and Fua, P. (2017). Learning active learning from data. In

Advancesin Neural Information Processing Systems , pages 4225–4235.Lewis, D. D. and Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In

Machine Learning Proceedings 1994 , pages 148–156. Elsevier.Owen, A. B. (2007). Inﬁnitely imbalanced logistic regression.

Journal of Machine Learning Research ,8(Apr):761–773.Zhang, X., Yang, T., and Srinivasan, P. (2016). Online asymmetric active learning with imbalanceddata. In

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , pages 2055–2064. ACM. 8hang, Y., Zhao, P., Cao, J., Ma, W., Huang, J., Wu, Q., and Tan, M. (2018). Online adaptiveasymmetric active learning for budgeted imbalanced data. In