Towards automated feature engineering for credit card fraud detection using multi-perspective HMMs
Yvan Lucas, Pierre-Edouard Portier, Léa Laporte, Liyun He-Guelton, Olivier Caelen, Michael Granitzer, Sylvie Calabretto
TTowards automated feature engineering for creditcard fraud detection using multi-perspective HMMs
Yvan Lucas , Pierre-Edouard Portier , L´ea Laporte , LiyunHe-Guelton , Olivier Caelen , Michael Granitzer , and SylvieCalabretto INSA Lyon Universit¨at Passau Worldline Lyon
Abstract
Machine learning and data mining techniques have been used ex-tensively in order to detect credit card frauds. However, most studiesconsider credit card transactions as isolated events and not as a se-quence of transactions.In this framework, we model a sequence of credit card transactionsfrom three different perspectives, namely (i) The sequence contains ordoesn’t contain a fraud (ii) The sequence is obtained by fixing the card-holder or the payment terminal (iii) It is a sequence of spent amount orof elapsed time between the current and previous transactions. Com-binations of the three binary perspectives give eight sets of sequencesfrom the (training) set of transactions. Each one of these sequences ismodelled with a Hidden Markov Model (HMM). Each HMM associatesa likelihood to a transaction given its sequence of previous transactions.These likelihoods are used as additional features in a Random Forestclassifier for fraud detection.Our multiple perspectives HMM-based approach offers automatedfeature engineering to model temporal correlations so as to improve theeffectiveness of the classification task and allows for an increase in thedetection of fraudulent transactions when combined with the state ofthe art expert based feature engineering strategy for credit card frauddetection.In extension to previous works, we show that this approach goesbeyond ecommerce transactions and provides a robust feature engineer-ing over different datasets, hyperparameters and classifiers. Moreover,we compare strategies to deal with structural missing values. a r X i v : . [ c s . L G ] S e p ntroduction Credit card fraud detection presents several difficulties. One of them is thefact that the feature set describing a credit card transaction usually ignoresdetailed sequential information. Typical models only use raw transactionalfeatures, such as time, amount, merchant category, etc Donato et al. (1999).Bolton and Hand (2001) showed the necessity to use attributes describing thehistory of the transaction when they used unsupervised methods such as peergroup analysis for credit card fraud detection. Consequently, Whitrow et al.(2008) create descriptive statistics as features in order to include historicalknowledge. These descriptive features can be for example the number oftransactions or the total amount spent by the card-holder in the past 24hours for a given merchant category or country. Bahnsen et al. (2016)considered Whitrow et al. (2008) strategy to epitomized the state of theart feature engineering technique for credit card fraud detection.We identified several weaknesses in the construction of these featuresthat motivated our work. First, descriptive statistics provide an aggregatedview over a set of transactions. Such aggregated features do not considerfine-grained temporal dependencies between the transactions. For example,a common fraud pattern starts with low amount transactions for testing thecard, followed by high amount transaction to empty the account. Second,aggregated features are usually calculated over transactions occuring in afixed time window (e.g. 24 h). In general, transactions from very differentcard holders do not follow such a time pattern in general. However, thenumber of transactions made during such a time period can vary a lot fordifferent card-holders. Fixed size aggregated statistics can’t account for thatfact. Third, these features consider only the history of the card-holder anddo not exploit information of fraudulent transactions for feature engineering.However, a sequence of transactions happening at a fixed terminal can alsocontain valuable patterns for fraud detection.In our work we propose to generate history-based features using HiddenMarkov Models (HMM). They quantify the similarity between an observedsequence and the sequences of past fraudulent or genuine transactions ob-served for the cardholders or the terminals.We have chosen these three perspectives based on the following assump-tions: the features made with only genuine historical transactions shouldmodel the common behavior of card holders. This is a classic anomalydetection scheme where the likelihood of a new sequence of transactionsis measured against historical honest sequences of transactions Chandolaet al. (2012). The other four features are made with sequences of trans-2ctions with at least one fraudulent transaction. The rationale is that tohave a risk of fraud, it is not enough for a new sequence to be far fromusual transaction behavior but it is also expected for it to be relatively closeto a risky behavior. Thus, these last features should decrease the numberof false positive. As stated by Pozzolo et al. (2017), this is a crucial issuesince the investigators can only verify a limited number of alerts each day.The second perspective allows the model to take the point of view of thecard-holder and the merchant which are the two actors involved in the creditcard transactions. The last perspective takes into account two importantfeatures for credit card fraud detection: the amount of a transaction and theelapsed time between two transactions. These features are strong indicatorsfor fraud detection.By using a real world credit card transactions dataset provided andlabeled by a large European card processing company, we want to assess howmuch our contributions - namely the addition of the terminal perspectives(among others) in the construction of history features and the constructionof multiple perspectives HMM-based features - improve fraud detection.To quantify the impact of the addition of the HMM-based features, weuse the Precision-Recall AUC metric. We fed Random Forest classifiers withtransactional data including the state of the art transaction aggregationstrategy and measured the increase in PR-AUC due to the HMM-basedfeature engineering technique we propose. The multiple perspective propertyof our HMM-based feature engineering allows for an incorporation of a broadspectrum of sequential information that leads to a significant increase of thedetection of fraudulent transactions.This paper consists in an extension to a poster Lucas et al. (2019). As inLucas et al. (2019), we present the concept of multiperspective HMM-basedfeature engineering. Moreover, we extend it significantly with additional ex-periments and evaluations. First of all, the framework is shown to increasethe detection of fraudulent transactions not only for e-commerce transac-tions but also for face-to-face transactions. This result wasn’t necessarily aforegone conclusion since e-commerce and face-to-face transactions presentvery different properties (e.g. merchant not open at night, necessity of a PINauthentication for face-to-face transactions...). Previous work done on thesame dataset Jurgovsky et al. (2018) showed that some increase in detectionobserved on one type of transactions couldn’t be extended to all types oftransactions. Then, the feature engineering strategy is shown to be relevantfor various types of classifiers (random forest, logistic regression and Ad-aboost) and robust to hyperparameters choices made for constructing thefeatures: The number of hidden state and the length of the sequence have3o strong effect on the quality of the fraud detection. Lastly, the frameworksuffered from a structural missing value limitation: For some users with fewtransactions, the HMM-based features couldn’t be calculated. From 20% to40% of the transactions (depending on the choice of the length of the se-quences modeled by the HMM) are not associated with HMM-based features.After comparing several solutions to overcome this limitation, we were ableto obtain a steady improvement of the detection over all the transactions ofthe dataset.In this paper, we present the state-of-the-art approaches for the problemsof credit card fraud detection and sequence classification in section 1. Af-terwards, we show how the HMM-based features improve on the limitationsof the state of the art techniques and the way they are created in section 2.The experimental protocol is described in section 3. In section 4.1 and 4.2,we present the experimental results obtained on face-to-face and e-commercetransactions with different classifiers. The section 4.3 is dedicated to provethe robustness of the method through an hyperparameter study. Finally,in section 5 we compare different solutions to tackle the issue of structuralmissing values.The HMM-based features we propose present interesting assets in thecontext of credit card fraud detection and more generally anomaly detection.This work opens perspectives for feature engineering in any supervisedtask with sequential data. In order to ensure reproducibility, the sourcecode of the proposed framework can be found at https://gitlab.com/Yvan_Lucas/hmm-ccfd . A wide range of machine learning approaches were used in credit card frauddetection. (Maes et al., 2002) evaluated Artificial Neural Network andBayesian belief network with ROC AUC on Europay International’s dataset.(Bhattacharyya et al., 2011) compared Support Vector Machine, RandomForest and logistic regression on a real world dataset using a wide varietyof metrics. (Bahnsen et al., 2013) adjusted Bayes Minimum Risk using realfinancial costs in order to adapt the prediction of Random Forest and Lin-ear Regression classifier. (Pozzolo et al., 2014) tested an architecture totake into account temporal concept drift in the credit card transaction datastream with Random Forest, Support Vector Machine and Neural Network.(Mahmoudi and Duman, 2015) applied a modified Fisher discriminant func-4ion to take into account the higher false negative cost in credit card frauddetection. More recently, (Jurgovsky et al., 2018) used LSTM for sequenceclassification on the same real world dataset that we use in this article. Theyshowed that, in the case of face-to-face transactions only, sequence modellingwith Long Short Term Memory networks (LSTM) improves fraud detectionwhen compared to Random Forest with aggregated features.Since Random Forest have been shown to perform well for credit cardfraud detection in the litterature (Bhattacharyya et al., 2011) and in pre-liminary experiments we have done, we chose to consider them in this workfor evaluating the impact of our proposed features on the prediction quality.Moreover, Random Forests offer the possibility to calculate the importanceof a feature which is defined as the decrease of gini impurity through a nodeweighted by the proportion of elements of the dataset passing through thisnode (Breiman et al., 1984). This property is interesting for studying theimpact of a feature engineering strategy.Feature engineering is critical for credit card fraud detection. Someauthors used only raw features in order to detect fraudulent transactions((Mahmoudi and Duman, 2015), (Minegishi and Niimi, 2011)). (Bolton andHand, 2001) showed the necessity to use attributes describing the historyof the transactions for unsupervised credit card fraud detection (peer groupanalysis). Lately, Saia and Carta (2019) used Fourier and wavelet transformsin order to move the transaction in a new domain before applying a machinelearning algorithm. This allows to raise outliers based on a different viewof the dataset (frequential view). This is related to our approach since ourapproach consists in creating likelihood score for a variety of views on thedataset (sequential views).(Whitrow et al., 2008) proposed a transaction aggregation strategy tocreate descriptive features containing information about the past behaviourof the card-holder over a certain period of time. These descriptive featurescan be for example the number of transactions or the total amount spentby the card-holder in the past 24 h with the same merchant category orcountry. They showed a 28% increase in the detection of fraudulent trans-actions by using these aggregated features with a Random Forest as thelearning algorithm. (Jha et al., 2012) applied Whitrow’s transaction aggre-gation strategy to logistic regression in a real world credit card transactionsdataset. (Krivko, 2010) presented a rule-based approach using the differencebetween the recent amount spent by the card holder and either the averageamount spent by this card holder or the average amount spent by all the cardholders. (Bahnsen et al., 2016) showed that adding periodic features basedon the time of the transaction to Whitrow’s aggregated features increases5he savings by an average of 13% with Random Forest, Logistic Regressionand Bayes minimum Risk model.An other feature engineering strategy has been proposed by (Vlasselaeret al., 2015). It consists in using the numbers of transactions between cardholders and terminals in the graph of the transactions in order to create atime-dependent suspiciousness score. They showed a 3.4% increase of theROC AUC by using these network-based features together with Whitrow’saggregated features and Random Forests.In this work, similar to other contributors, we consider Whitrow’s card-holder centric aggregated features as the state of the art baseline for com-parison with the HMM-based features we propose.
Sequence classification is one of the main machine-learning research field. Itaims to consider the sequential properties of the data at the algorithmic levelin order to improve the classification of sequential data. Dietterich (2002)reviewed sequential classification based on sliding windows (or recurrentsliding windows). However, sliding windows methods don’t take into accountinner dependencies between consecutive events.Strivastava et al. (2008) tried to overcome this limitation by using gener-ative models such as Hidden Markov Models (HMMs) for credit card frauddetection. They motivated the choice of HMMs by relating the hidden statesto the different types of purchase. For this purpose, they created an artificialcredit card transactions dataset. In their multinomial HMMs, the transac-tions were characterized with a symbol (’big amount’, ’medium amount’,’small amount’) used as the observed variable. After training, the likelihoodthat the sequence of recent transactions was generated by the HMMs iscomputed. The decision is taken by comparing the likelihood to a thresholdvalue.Graves (2012) observed that Long Short Term Memory networks are bet-ter than other sequential algorithms such as HMMs for speech recognitionand handwriting recognition tasks since they allow to learn long term depen-dencies in sequences. Wiese and Omlin (2009) compared feed forward neuralnetwork combined with Support Vector Machines and LSTM for credit cardfraud detection and showed that LSTM are relevant in the credit card frauddetection context because they can model time series of different lengthfor each card holder. However, Jurgovsky et al. (2018) showed recently onthe real world dataset that we use that LSTM offers a small improvemntover Random Forest in the case of face-to-face transactions. There is no6mprovement for e-commerce transactions.
The state of the art feature engineering techniques for credit card fraud de-tection creates descriptive features using the history of the card-holder (suchas: ”amount spent by the card-holder in shops from a given country in thelast 24h” (Whitrow et al. (2008), Bahnsen et al. (2016)). These descriptivefeatures present several limits we aim to overcome. First, they do not takeinto account the history of the seller even if it is clearly identified in mostcredit card transactions dataset. Moreover these descriptive features do notconsider dependencies between transactions of a same sequence. Thereforewe use Hidden Markov Models which are generative probabilistic modelsand a common choice for sequence modelling Rabiner and Juang (1991).Finally, the choice of the descriptive feature created using the transactionaggregation strategy (Whitrow et al. (2008), Bahnsen et al. (2016)) is guidedby expert knowledge. In order not to depend on expert knowledge, we favorautomated feature engineering in a supervised context.In addition to the descriptive aggregated features created by Whitrowet al. (2008), we propose to create eight new HMM-based features. Theyquantify the similarity between the history of a transaction and eight distri-butions learned previously on set of sequences selected in a supervised wayin order to model different perspectives.We model the sequence of transactions from the combinations of threebinary perspectives (genuine / fraudulent, card-holder / merchant, amount/ timing) and therefore learn eight (2 ) different HMMs. At the end the setof 8 HMM-based features will provide information about the genuinenessand the fraudulence of both terminal and card holder histories.In particular, we select three perspectives for modelling a sequence oftransactions (see figure 1). A sequence (i) can be made only of genuine his-torical transactions or can include at least one fraudulent transaction in thehistory, (ii) can come from a fixed card-holder or from a fixed terminal, and(iii) can consist of amount values or of time-delta values (i.e. the differencein time between the current transaction and the previous one). We opti-mized the parameters of eight HMMs using all eight possible combinations(i-iii).To learn the HMM parameters on observed data, we create 4 datasets:7 H TM amt label1 A 47€ 02 A 75€ 11 B 56e 02 B 15€ 12 C 25€ 0
Raw transactions dataset Sequences datasets CH TM amt label A 47€ 0 B 56€ 0 A 75€ 1 B 15€ 1 C 25€ 0CH TM amt label1 A
47€ 02 A
75€ 11 B
56€ 02 B
15€ 1 CH sequences TM sequences CH TM amt label1
A 47€ B 56€ A 75€ B 15€ C 25€ sequences
Fraudulent CH sequencesCH TM amt label A A B B sequences Supervised sequences sets C
25€ 0 2 C sequences Figure 1: Supervised selection of sequences for the training sets of the multi-ple perspectives Hidden Markov Models (CH = Card-holder, TM = Terminal) • sequences of transactions from genuine credit cards (without fraud-ulent transactions in their history). • sequences of transactions from compromised credit cards (with atleast one fraudulent transaction) • sequences of transactions from genuine terminals (without fraudu-lent transactions in their history) • sequences of transactions from compromised terminals (with atleast one fraudulent transaction)We then extract from these sequences of transactions the symbols thatwill be the observed variable for the HMMs. In our experiments, the ob-served variable can be either: • the amount of a transaction. • the amount of time elapsed between two consecutive transactions of acard-holder (time-delta). 8idden Markov Models main hypothesis is that behind the observeddistribution, there is a simpler latent model that rules the sequential distri-bution. Hypothetically, the complexity of the observed distribution comespartly from additive Gaussian noise corrupting both the state evolution pro-cess and the emission process. Hidden Markov Models allow for a compres-sion of the observed sequence in order to generalize the observed behaviourinto an abstracted latent behaviour. It is comprised of two processes repre-sented by matrices (see figure 2): • The transition matrix describes the evolution of the hidden states.Each row i of the transition matrix is a multinomial distribution ofthe next state given that the current state is i . The hidden statesobey to the Markov property (i.e. given the present the future doesnot depend on the past). • The emission matrix describes the conditional distribution of the ob-served variables given the current hidden state. Usually, the distri-bution is considered multinomial for categorical observed variables orgaussian for continuous observed variables.Figure 2: Hidden markov model architecture.The transition and emission conditional probability matrices of the HMMsare optimized by an iterative Expectation-Maximisation algorithm knownas the Baum-Welch algorithm (Baum (1972), Rabiner and Juang (1991)).EM optimization of a model with latent (hidden) parameters consists in (forparameters initalized with a value):
Expectation:
Find the latent states distributions that correspond the mostto the sequences of observed data. This is usually done with the help ofthe Viterbi algorithm which recursively leverage the Markov propertyin order to simplify the calculations of the conditional probabilitiesto observe a sequence of event given the parameters of the transition9nd emission matrices (also referred to as forward-backward algorithmViterbi (1967)).
Maximisation:
Maximise the correspondence between the latent distribu-tion inferred during the expectation step and the parameters of thetransition and emission matrices by adjusting the parameters.The expectation and maximization steps are repeated until convergence ofthe graphical model to the observed data. The convergence can be moni-tored by observing the increase of the value of the likelihood that the setof observed sequences has been generated by the model. This likelihoodincreases over the iterations until it reaches a ceiling when the hyper pa-rameters ruling the architecture of the generative model don’t allow it to fitmore to the set of observed sequences.At the end, we obtain 8 trained HMMs modeling 4 types of behaviour(genuine terminal behaviour, fraudulent terminal behaviour, genuine card-holder behaviour and fraudulent card-holder behaviour) for both observedvariables (amount and time-delta).
Algorithm 1
Online: calculate likelihood of sequences of observed events for tx i in all transactions dofor perspective j in perspectives combinations do [ tx i , tx i − , tx i − ] ← usersequence i,j HM M j ← HM M ( usertype, signaltype, sequencetype ) AnomalyScore ← log ( P ([ tx i , tx i − , tx i − ] | HM M j )) end forend for The HMM-based features proposed in this paper (table 1) are the likeli-hoods that a sequence made of the current transaction and the two previousones from this terminal/card holder is generated by each of these models. Inorder to calculate their value, the most probable sequence of hidden statesfor each observed sequence has to be computed. This is usually done withthe help of the Viterbi algorithm Viterbi (1967) which also leverages theMarkov property in order to simplify the calculations of the conditionalprobabilities to observe a sequence of hidden states given the parameters ofthe HMM (initial probabilities and transition and emission matrices) andan observed sequence of events. 10ser Feature Genuine FraudulentCard Holder Amount
HMM1 HMM5
Tdelta
HMM2 HMM6
Terminal Amount
HMM3 HMM7
Tdelta
HMM4 HMM8
Table 1: Set of 8 HMM-based features describing 8 combinations of perspec-tives
We used a credit card transactions dataset provided by our industrial partnerin order to quantify the increase in detection when adding HMM-basedfeatures. This dataset contains the anonymized transactions from all thebelgian credit cards between 01.03.2015 and 31.05.2015.The classification task is to predict the class of the transactions (genuineor fraudulent).Transactions are represented by vectors of continuous, categorical andbinary features that characterize the card-holder, the transaction and theterminal. The card-holder is characterized by a unique card-holder ID, itsage, its gender, etc. The transaction is characterized by variables like thedate-time, the amount and other confidential features. The terminal is char-acterized by a unique terminal-ID, a merchant category code, a country.In addition to the work already presented as a poster Lucas et al. (2019),we studied the impact of the proposed multiple perspective HMM-basedfeature engineering strategy for face-to-face transactions which present verydifferent properties than e-commerce transaction: The merchant is usuallyclosed at night, the necessity of a PIN authentication decreases significantlythe number of fraudulent transactions Ali et al. (2019), etc. For comparison,Jurgovsky et al. (2018) have shown on the same belgium transactions datasetthat some machine learning approaches (LSTM) gave better results on face-to-face transactions than on e-commerce transactions.
In order for the HMM-based features and the aggregated features to becomparable, we calculate terminal-centered aggregated features in additionto (Whitrow et al., 2008) card-holder centered aggregated features (table 2)11eature SignificationAGGCH1
We train Random Forest Classifiers using different feature sets in order tocompare the efficiency of prediction when we add HMM-based features tothe classification task for the face-to-face and e-commerce transactions. Thedifference in term of raw AUC between the e-commerce and the face-to-facetransactions is due to the difference in imbalancy between these datasets:the imbalancy is 17 times stronger for the face-to-face transactions thanfor the e-commerce transactions. For face-to-face, there are 0.2 frauds per1000 transactions whereas for e-commerce there are 3.7 frauds per 100012 .0 0.2 0.4 0.6 0.8 1.0Recall0.00.20.40.60.81.0 P r e c i s i o n rawraw+HMMraw+aggCHraw+aggCH+HMMraw+all_aggraw+all_agg+HMM Feature set no HMM-features HMM-features increase through HMMsraw 0.212 ± ± ± ± ± ± (Each colorcorresponds to a specific feature set, the line style corresponds to the presenceor not of HMM-based features. The addition of HMM-based features (boldlines) to each feature set considered - even the most informative ones - allowsfor an increase in the detection of fraudulent transactions when compared tothe same prediction without HMM-based features (thin lines).) transactions.We tested the addition of our HMM-based features to several featuresets. We refer to the feature set raw+aggCH as the state of the art featureengineering strategy since it contains all the raw features with the addition ofWhitrows aggregated features Whitrow et al. (2008). The feature groups werefer to are: the raw features (raw), the features based on the aggregations ofcard-holders transactions (aggCH), the features based on the aggregation ofterminal transactions (aggTM), the proposed HMM-based features (HMMfeatures).In this section, the HMMs were created with 5 hidden states and theHMM-based features were calculated with a window-size of 3 (actual trans-action + 2 past transactions of the card-holder and of the terminal). Weshowed in section 4.3 that the HMM hyperparameters (number of hiddenstates and size of the window considered for the calculation of HMM-basedfeatures) did not change significantly the increase in Precision-Recall AUC13 .0 0.2 0.4 0.6 0.8 1.0Recall0.00.20.40.60.81.0 P r e c i s i o n rawraw+HMMraw+aggCHraw+aggCH+HMMraw+all_aggraw+all_agg+HMM Feature set no HMM features HMM features increase through HMMsraw 0.082 ± ± ± ± ± ± { } { , , } { , , } { , N one } Table 3: Random Forest grid searchFigures 3 and 4 show the precision recall curves and their AUC obtainedby testing the efficiency of Random Forests trained with several feature setson the transactions of the testing set. The AUC numbers correspond to theaverage ± standard deviation on 3 different runs. The AUCs are stable overthe different runs and the standard deviation numbers are low.We can observe that the face-to-face results (figure 4) and the e-commerceresults (figure 3) both show a significant improvement in precision-recallAUC when adding sequence descriptors such as the proposed HMM-basedfeatures or Whitrows aggregated features to the raw feature set. For the14-commerce transactions, this improvement ranges from 3.6% for the bestfeature set without HMM-based features (raw+all-agg for the e-commercetransactions) to 40.5% for the worst feature set (raw for the e-commercetransactions). For the face-to-face transactions, this improvement rangesfrom 6.0% for the best feature set without HMM-based features (raw+all-agg for the face-to-face transactions) to 85.4% for the worst feature set (rawfor the face-to-face transactions).By comparing the AUC of the curves raw+aggCH and raw+aggCH+HMM,we observe that adding HMM-based features to the state of the art featureengineering strategy introduced in the work of Whitrow et al. (2008) leadsto an increase of 18.1% of the PR-AUC for the face-to-face transactions andto an increase of 9.3% of the PR-AUC for the e-commerce transactions.The relative increase in PR-AUC when adding terminal centered ag-gregated features to the feature set is of 16.0% for the face-to-face dataset.The relative increase in PR-AUC when adding terminal centered aggregatedfeatures to the feature set is of 11.7% for the e-commerce transactions.Overall, we can observe that the addition of features that describe thesequence of transactions, be it HMM-based features or Whitrows aggregatedfeatures, increases a lot the Precision-Recall AUC on both e-commerce andface-to-face transactions. The addition of HMM-based features improvesthe prediction on both e-commerce and face-to-face transactions and allowsthe classifiers to reach the best levels of accuracy on both e-commerce andface-to-face transactions. We have shown in section 4.1 that the addition of HMM-based featuresto the existing transaction aggregation strategy allows for a consistent andsignificant increase in the precision-recall AUC of random forest classifiers.In order to make sure that this increase is stable over different types ofclassifiers, we did the same experiments with adaboost and logistic regressionclassifier.We tuned the gradient boosting and logistic regression hyperparameters(table 5) through a grid search that optimizes the Precision-Recall Areaunder the Curve on the validation set.By comparing the AUCs reported on table 5 and 6, we can concludethat the improvement observed when integrating the proposed multiple per-spectives HMM-based feature engineering in addition to the state of theart transaction aggregation strategy is significant and reliable over differentclassifiers and datasets. The relative improvement is bigger for weaker clas-15ree numbers learning rate tolerance for stopping max tree depth { , } { . , , } { , } { , } Table 4: Adaboost grid searchC parameter penalty tolerance for stopping { , , } { l , l } { , } Table 5: Logistic regression grid searchE-commerce no HMM-features HMM-features increase through HMMsraw 0.015 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± In order to understand if the feature engineering strategy is sensitive to thehyperparameters used for the construction of the HMM-based features, weconstructed 9 sets of HMM-based features with different combinations of theHMM-based features hyperparameters. The hyperparameters considered arethe number of hidden states in the HMMs and the size of the sequence ofpast transactions used for the calculation of the likelihood score.We measure the AUC obtained on the test set when adding different setof HMM-based features obtained with different combinations of hyperpa-rameters to the raw feature set. Window size3 5 7Hidden 3 0.280 ± ± ± ± ± ± ± ± ± raw = 0 . ± . { window size: 3, hiddenstates: 5 } gives the best AUCs on average on 3 runs. However the standarddeviation values ( ± ) are too high to confidently say that a hyperparameters17indow size3 5 7Hidden 3 0.159 ± ± ± ± ± ± ± ± ± raw = 0 . ± . e-commerce face-to-face All transactions . ∗ . ∗ History > = 3 4 . ∗ . ∗ History > = 7 3 . ∗ . ∗ History > = 3 means that all the transactions have at least 2transactions in the past for the card-holder and for the terminal: we canbuild sequences of 3 transactions for both perspectives)With the HMM feature engineering framework, we can calculate sets of18MM features for different window sizes. However, when the transactionhistory is not big enough we can’t calculate the HMM-based features for it.Because of this limitation we had to dismiss about 20% of the transactionsfor the experiments described in section 4.1 and around 40% of the trans-actions for the experiments of the hyperparameter section 4.3. There is astrong need to tackle the issue of structural missing values depending onthe length of users sequences in order to integrate transactions with shorthistory where part or all of the HMM-based features couldn’t have beencalculated.In this section we consider 9 sets of HMM based features obtained with awindow size of 3, 5 and 7 for the card-holder or for the terminal. We there-fore have 16 sets of transactions with different history constraints (terminalhistory: [0, 3, 5, 7] ∗ card-holder history: [0, 3, 5, 7]).We consider 3 missing values strategies: default0 a genuine default value solution where the HMM-based featuresthat couldn’t have been calculated for the current transactions arereplaced by 0. weighted PR a weighted sum of the predictions of the Random Forestsspecialized on the history constraints of the transaction, among 16Random Forests each trained on one of the 16 possible history con-straints. For example, for a transaction with a terminal history of 6and a card-holder history of 3 we will sum the predictions of the Ran-dom Forests [0,0], [0, 3], [3, 0], [3, 3], [5, 0], [5, 3] . Each RandomForest is weighted by it’s efficiency on the validation set. We used PRAUCs values as weights for the Random Forests. stacked RF a stacking approach where a Random Forest classifier is trainedon the predictions of the 16 Random Forests specialized on the con-straints (0 for missing prediction, when the transaction doesn’t satisfythe constraints of the considered Random Forest). This approach hasthe second benefit that we might stack indivual classifiers thereby cre-ating a more accurate new one.Other approaches to handle missing values advise to generate them by mod-elling the distribution of the corresponding features. We considered thatthese solutions don’t apply in our case since the missing values appear be-cause there wasn’t enough historical information to calculate the correspond-ing feature. We thought that replacing the value of a model based feature The first number describes the terminal history constraint, the second number de-scribes the card-holder history constraint raw (no HMM-based features) . ± .
002 0 . ± . default0 . ± .
001 0 . ± . weighted PR . ± . ± . stacked RF . ± .
002 0 . ± . default0 ) allowsfor the best PR-AUCs with the best stability for both face-to-face and e-commerce transactions. This is also by far the fastest method since it needsto only train one Random Forest instead of 16 (resp. 17) for the weightedPR approach (resp. stacked RF approach).The weighted PR solution doesn’t allow for satisfying results. Howeverthe stacked RF solution gives good PR-AUCs and presents interesting prop-erties in order to combine different types of classifiers.Finding satisfying solutions to integrate transactions with structuralmissing value increases drastically the range of application of the proposedframework (see table 8). Moreover it adds an other perspective to the frame-work: HMM-based features calculated for a small (resp. big) window-sizewill characterize the short (resp. long) term history. Conclusion
In this work, we propose an HMM-based feature engineering strategy thatallows us to incorporate sequential knowledge in the transactions in theform of HMM-based features. These HMM-based features enable a nonsequential classifier (Random Forest) to use sequential information for theclassification.The multiple perspective property of our HMM-based automated featureengineering strategy gives us the possibility to incorporate a broad spectrumof sequential information. In fact, we model the genuine and fraudulent be-haviours of the merchants and the card-holders according to two features:the timing and the amount of the transactions. Moreover, the HMM-basedfeatures are created in a supervised way and therefore lower the need of ex-20ert knowledge for the creation of the fraud detection system. The terminalperspective is usually not used in credit card fraud detection and is shownin this paper to greatly help the detection for face-to-face and e-commercetransactions.This extension to Lucas et al. (2019) consolidates the claims alreadymade with additional experiments and evaluations. More precisely: • The feature engineering strategy is shown to perform well for e-commerceand face-to-face credit card fraud detection: the results show an in-crease in the precision-recall AUC of 18.1% for the face-to-face trans-actions and 9.3% for the e-commerce ones. • Then, the feature engineering strategy is shown to be relevant forvarious types of classifiers (random forest, logistic regression and Ad-aboost) and robust to hyperparameters choices made for constructingthe features. • Finally, the structural missing values limitation of the framework islooked at and several solutions are benchmarked.HMM-based feature engineering strategy is a powerful tool that is shownto present interesting properties for fraud detection. We can imagine build-ing similar HMM-based features in any supervised task that involve a se-quential dataset.To ensure reproducibility, a source code for calculating and evaluatingthe proposed HMM-based features can be found at https://gitlab.com/Yvan_Lucas/hmm-ccfd .As a future work, it would be interesting to combine the predictions of anLSTM with the prediction of some HMM-based features enhanced RandomForest since these classifiers have been shown to not detect the same fraudsin face-to-face transactions by (Jurgovsky et al., 2018).
The work has been funded partially by the Bavarian Ministry of EconomicAffairs, Regional Development and Energy in the project Internetkompe-tenzzentrum Ostbayern”. 21 eferenceseferences