Anomaly Detection Model for Imbalanced Datasets
AAnomaly Detection Model for Imbalanced Datasets
R´egis Houssou Stephan Robert-Nicoud Abstract
This paper proposes a method to detect bankfrauds using a mixed approach combining astochastic intensity model with the probabilityof fraud observed on transactions. It is a dynamicunsupervised approach which is able to predictfinancial frauds. The fraud prediction probabilityon the financial transaction is derived as a func-tion of the dynamic intensities. In this context,the Kalman filter method is proposed to estimatethe dynamic intensities. The application of ourmethodology to financial datasets shows a bet-ter predictive power in higher imbalanced datacompared to other intensity-based models.
1. Introduction
Financial fraud is growing exponentially, especially becauseof the large sums involved. It is an issue that has wideconsequences in both the finance industry and the dailylife. Fraud can reduce confidence in industry, destabiliseeconomies, and affect people’s cost of living. However, as afirst step, banks and financial institutions have approachedthe detection of fraud using traditional approaches based onmanual techniques such as auditing, which are inefficientand unreliable due to the complexities associated with theproblem. This is a very relevant problem that demands theattention of communities such as machine learning and datascience where the solution of the problem can be automated,and evolve the detection of fraud towards methods usingadaptive rules to tighten the mesh of the network.The machine learning models work with many parametersand are much more efficient at finding subtle correlationsin the data, which can be masked by an expert system orby human criticism, (Dyzma, 2018). The large volume oftransactional data and client data readily available in thefinancial services industry makes it an ideal tool for the School of Management and Engineering Vaud (HEIG-VD),University of Applied Sciences and Arts of Western Switzer-land (HES-SO), CH-1400 Yverdon-les-Bains. Correspondence to:R´egis Houssou < [email protected] > . Preliminary work. Under review , Copyright 2020 by the au-thor(s). application of complex machine learning algorithms. Inaddition to learning from known models, machine learningcan go further and learn new models without human oper-ation. This allows models to adapt over time to discoverpreviously unknown patterns or to identify new tactics thatcan be used by fraudsters. In fact, the development of con-ventional machine learning algorithms has led them to solvesome specific problems, one of the most important featuresof which is that the distribution of data is generally bal-anced, unlike financial fraud, which is not balanced. Moststandard classifiers such as decision trees and neural net-works assume that learning samples are evenly distributedamong different classes. However, in many real-world appli-cations, the ratio of the minority class is very small( 1:100,1:1000 or can be exceeded at 1:10000). Due to the lackof data, few samples of the minority learning class tend tobe falsely detected by the classifiers and the decision limitis therefore far from correct. Numerous research worksin machine learning has been proposed to solve the prob-lem of data imbalance; (He & Garcia, 2009), (Galar et al.,2012), (Krawczyk, 2016), (Elrahman & Abraham, 2013),etc. However, most of these algorithms suffer from certainlimitations in real-world applications, such as the loss ofusual information, classification cost, excessive time, andadjustments, see (Elrahman & Abraham, 2013).(Houssou et al., 2019) investigated the problem of fraud de-tection in imbalanced data using the Poisson process. Theydefined the fraud times as the jump times of the Poissonprocess with intensity that describes the instantaneous rateof fraud. They showed how to estimate the intensity func-tion in deterministic form and how to predict fraud events.The comparison of their methodology to other baseline ap-proaches shows a better predicting power especially in veryimbalanced dataset. However, their approach suffers fromsome limitations such as - The reduced form of the modelin the sense that the fraud detection depends uniquely of theintensity’s parameters; the model does not look inside thesubtle correlations in the data. - The deterministic form ofthe intensity meaning that the intensity is a function of times,so it is predictable. In addition, their model is a supervisedapproach for which the lack of labelled data constitutes themain constraint in fraud detection.In this paper, we address these issues by considering astochastic process for the fraud intensity; in other words, a r X i v : . [ c s . C E ] N ov nomaly Detection Model for Imbalanced Datasets the intensity is a function of time and for a fixed time it isa random variable. In contrast to (Houssou et al., 2019),the instantaneous rate of fraud is no longer predictable andthis is more realistic. For the calibration purpose, we alsoconsider the posterior probabilities of fraud observed oneach transaction; we suppose these probabilities reflect thelikelihood of fraud in the dataset and they take into accountthe hidden correlations between the features. Our approachis a mixed approach combining the stochastic intensity withthe probability of fraud observed on transaction. For theintensity’s model, we focussed on the Cox-Ingersoll-Ross(CIR) model assuming that the trend of the fraud intensityis mean-reverting and the fraud intensity is always positive.Another main advantage of choosing this process is that wecan derive a closed form solution of the prediction probabil-ity of fraud. As the intensity is unobservable variable, wepropose to estimate its values by the Kalman-Filter methodwhere the intensity is updated by the probability of fraudobserved on transaction. Finally, our model is unsupervisedapproach in the sense that labelled data with examples offraud are not need for detecting fraud events.However, a lot of research based on the Kalman filter hasbeen done in the financial fields such as the interest ratemodels, the volatility models, the pricing of the defaultablebonds; see (Babbs & Nowman, 1999), (Duan & Simonato,1998), (Racicot & Theoret, 2010), (Vo, 2014), etc...The rest of the paper is organized as follows. Section II fo-cusses on the fraud detection in the context of the stochasticintensity; the Cox-Ingersoll-Ross (CIR) intensity model isinvestigated. The prediction probability of fraud is derivedand the estimation process of the intensity is discussed. Inthe section III, the model is applied to financial datasetsand the results are presented. The dataset was provided byNetGuardians , a swiss company which develops solutionsfor banks to proactively prevent fraud.
2. Fraud detection with stochastic intensity
Consider a financial institution such as a bank, an insurancecompany, a trading company, etc. and information aboutits clients. We are interested in the occurrence of fraudin client transactions for such an institution. The fraudevent is then defined as a rare event occurring at a randomtime and resulting in significant financial losses for theclient and the financial institution. Let define (Ω , F , F , P ) ,the filtered probability space with Ω denotes the possiblestates of the world, F is the σ -algebra, F = ( F t ) t ≥ isthe filtration with F t contains all information up to time t and F T = F . P is the probability measure describing thelikelihood of certain events. We denote by λ , the intensity https://netguardians.ch that represents the expected number of fraud events per unitof time. As in (Houssou et al., 2019), one assumes that λ is a non-negative process. In addition, we consider that theintensity is stochastic and follows the Cox-Ingersoll-Ross(CIR) process dλ t = κ ( θ − λ t ) dt + σ (cid:112) λ t dB t (1)where κ , θ and σ are positive constant; κ represents the rateof mean reverting, θ is the long run average, σ is the volatil-ity of the intensity and ( B t ) is the Brownian motion underthe probability P . The Cox-Ingersoll-Ross (CIR) modelis one of the most popular and commonly used stochasticintensity in both academic research and practical applica-tions. The process was first developed in (Cox et al., 1985a)to model the term structure of interest rates; It is set upas a single-good, continuous time economy with a singlestate variable. Multivariate versions are developed later by(Longstaff & Schwartz, 1992) and (Chen & Scott, 1993).When we impose the condition κθ > σ then the intensity λ is always positive, otherwise we can only guarantee thatit is non-negative (with a positive probability to terminateto zero). In fact, when the fraud intensity approaches thenthe volatility σ √ λ t approaches cancelling the effect ofthe randomness, so the intensity rate remains always non-negative. Figure 1 shows the simulations of the stochasticfraud intensity following the CIR model with various pa-rameters. All simulations generate dynamic non-negativeintensities which tend to move around a long-run mean θ . I n t e n s i t y (i) I n t e n s i t y (ii) I n t e n s i t y (iii) I n t e n s i t y (iv) Figure 1.
Simulation of stochastic intensities with the CIR process.(i) λ = 0 . , κ = 0 . , θ = 0 . , σ = 0 . . (ii) as for (i) but θ = 0 . . (iii) as for (i) but κ = 2 . (iv) as for (i) but σ = 0 . . In (Jafari & Abbasian, 2017), it has been shown that λ t = e − κt λ + θ (1 − e − κt ) + σe − κt (cid:90) t e κs (cid:112) λ s dB s (2)with E ( λ t ) = e − κt λ + θ (1 − e − κt ) (3)and V ( λ t ) = σ κ λ ( e − κt − e − κt ) + θσ κ (1 − e − κt ) (4) nomaly Detection Model for Imbalanced Datasets There is no general explicit solution for equation (2). How-ever, its calibration is critical for obtaining meaningful re-sults. One of the easiest methods to implement it is toperform discretization of equation (2) and then use avail-able data for small time intervals, in order to be able toestimate the parameters. Let (cid:77) t = TN +1 and t j = j · (cid:77) t for j = 0 , ..., N + 1 . Equation (2) becomes λ t i = e − κ (cid:77) t λ t i − + θ (1 − e − κ (cid:77) t )+ σe − κt i (cid:90) t i t i − e κs (cid:112) λ s dB s (5)From equation (3), E ( λ t i | λ t i − ) = µ = e − κ (cid:77) t λ t i − + θ (1 − e − κ (cid:77) t ) and from equation (4) V ( λ t i | λ t i − ) = µ = θσ κ (1 − e − κ (cid:77) t ) + σ κ ( e − κ (cid:77) t − e − κ (cid:77) t ) λ t i − equation (5) is reduced to λ t i = e − κ (cid:77) t λ t i − + θ (1 − e − κ (cid:77) t ) + (cid:15) t i (6)where (cid:15) t i = σe − κt i (cid:82) t i t i − e κs √ λ s dB s with (cid:15) t i which isan Ito integral with respect to the Brownian motion ( B t ) .Using the zero mean property E ( (cid:15) t i | (cid:15) t i − ) = 0 . From (6) V ( (cid:15) t i | (cid:15) t i − ) = V ( λ t i | λ t i − ) = µ . Following (Cox et al.,1985a), λ t i given λ t i − is a non-central χ distribution withthe first two moments µ and µ . From (Ball & Torous,1996) and under the assumption of small time intervals, λ t i given λ t i − can be reasonably approximated by a normaldistribution with mean µ and variance µ . Then, λ t i = e − κ (cid:77) t λ t i − + θ (1 − e − κ (cid:77) t ) + (cid:15) t i with (cid:15) t i ∼ N (0 , θσ κ (1 − e − κ (cid:77) t ) + σ κ ( e − κ (cid:77) t − e − κ (cid:77) t ) λ t i − ) . Let us define α = θ (1 − e − κ (cid:77) t ) , β = e − κ (cid:77) t and η t i = θσ κ (1 − e − κ (cid:77) t ) + σ κ ( e − κ (cid:77) t − e − κ (cid:77) t ) λ t i − . We can write λ t i = α + βλ t i − + (cid:15) t i (7)with (cid:15) t i ∼ N (0 , η t i ) . According to the equation (7), if wesuppose that V ( (cid:15) t ) is constant, the process of { λ t i } is astationary AR(1) process. REDICTION OF FRAUD WITH THE
CIR
INTENSITY
We suppose that all the background information on the finan-cial institution’s transactions, except for the hours of fraudevents fraud, is expressed by the filtration G = ( G t ) t ≥ .For example, G t can be generated by a d-dimensional driv-ing process X t which includes the information on transac-tions amounts, transaction dates, country of the receivingbank, client IDs, etc... Suppose further that there is a non-negative process λ t which is also adapted to G which playsthe role of a stochastic intensity, generally correlated withthe various components of the driving process X t . Nextassume that H = ( H t ) t ≥ is the filtration generated by thefraud indicator process { τ (cid:54) t } . The full filtration for themodel is obtained as F = G ∨ H where F = ( F t ) t ≥ . Let N = ( N t ) t ≥ where N t = (cid:80) n (cid:62) { τ n (cid:54) t } , a counting pro-cess in the occurrence of fraud in a client’s transactions.We say that N = ( N t ) t ≥ is a doubly-stochastic Poissonprocess or a Cox process if, conditioned on the backgroundinformation G t available at time t , N t is an inhomogeneousPoisson process with a time-varying intensity λ s , ≤ s ≤ t .In other words, each realization of the process λ t deter-mines the local jump probabilities for the process N t . Theintuition of the doubly-stochastic assumption is that G t con-tains enough information to reveal the intensity λ t , but notenough information to reveal the event times of the countingprocess N . That is why, the fraud time τ is a F -stoppingtime but not a G -stopping time. Proposition 1.
Consider the filtration F t that contains theinformation about the fraud events up to time t . Supposethat a new transaction is in progress at time s ( s > t ). Theprobability of fraud occurring on the next transaction attime s is given by P ( N s − N t = 1 |F t ) = 1 − E (cid:16) e − (cid:82) st λ u du |F t (cid:17) (8) Proof.
Letting A be the event { N s − N t } of no fraud ar-rivals, the law of iterated expectations implies that, for τ > tP ( N s − N t = 1 |F t ) = 1 − P ( N s − N t = 0 |F t )= 1 − E (1 A |F t )= 1 − E ( E (1 A |G s ∨ H t ) |F t )= 1 − E ( P ( N s − N t = 0 |G s ∨ H t ) |F t )= 1 − E (cid:16) e − (cid:82) st λ u du |F t (cid:17) (9)The last equation is derived by the fact that under the back-ground information G s , N s is an inhomogeneous Poissonprocess. nomaly Detection Model for Imbalanced Datasets Proposition 2.
Suppose the intensity follows the stochasticCIR process dλ t = κ ( θ − λ t ) dt + σ (cid:112) λ t dB t Under the assumptions of proposition (1), the probabilitythat the fraud will occur on the next transaction at the time s is given by P ( N s − N t = 1 |F t ) = 1 − e A ( s − t ) − B ( s − t ) λ t (10) where γ = √ κ + 2 σ B ( s − t ) = 2( e γ ( s − t ) − γ + ( γ + κ )( e γ ( s − t ) − A ( s − t ) = log (cid:40) γe ( κ + γ )( ( s − t )2 ) γ + ( γ + κ )( e γ ( s − t ) − (cid:41) κθσ Proof.
Using proposition (1) and following (Cox et al.,1985a) and (Cox et al., 1985b), we obtain (10).Consequently, the prediction probability of fraud at time s depends on the underlying parameters and on the dynamicintensity λ t with s ≥ t . In the next section, we will fo-cus on the form of relationship between the intensity andprobability of fraud.2.2.2. D EFINING THE MEASUREMENT EQUATION
Suppose we are now interested in the probability that nofraud will occur at the time s given the filtration ( F t ) with s > t . We denote Y s , the logarithm of this probability. Fromthe proposition (2), Y s = log [ P ( N s − N t = 0 |F t )] = A ( s − t ) − B ( s − t ) λ t . Let a = A ( s − t ) and b = − B ( s − t ) ;we have Y s = a + bλ t (11)The equation (11) shows an affine relationship between thelogarithm of the probability of prediction for no fraud attime s and the intensity at time t with s > t . For simplicityand calibration reasons, we take Y t as proxy for Y s with Y t is the logarithm of the posterior probability for no fraudat time t . The main reason for choosing Y t instead of Y s is that Y t is known at time t and therefore it will be usefullater in the filtering methods. So, we introduce noises in theequation (11) to take into account the differences between Y t and Y s . We assume that these noises are Gaussian whitenoises. Therefore, equation (11) is written Y t = a + bλ t + µ t (12) with µ t ∼ N (0 , w t ) . Although the equation (12) is affinein the state λ t , the functions a and b are non-linear functionsof the underlying parameters. Also for s < t , we alwayshave b < ; this implies a negative relationship between thelikelihood of no fraud and the intensity of fraud. However,despite the Gaussian assumption of (cid:15) t in the autoregres-sive equation (7), the maximum likelihood estimation ofthe intensity’s parameters is no longer feasible because theintensity λ t is an unobserved variable and the probabilitydensity function is not available in a closed form. On theother hand, taking the equation (12), it would be difficult toestimate the parameters a et b by the likelihood estimationfor the same reason. Thus, filtering methods can be used totrack the intensities based on the observed probabilities Y t .The Kalman filter is proposed here to capture the dynamicintensities and to estimate the various parameters. In thiscontext, two equations are required; the measurement equa-tion (12) that concerns the observed probability Y t and thestate equation (7) for the unknown intensity. ALMAN FILTER IN THE ESTIMATION OF THESTOCHASTIC INTENSITY
Now that the model in (1) has been put in state space formconducting to equations (7) and (12), the Kalman filter canbe used to obtain information about the unobserved intensity λ t using the logarithm of observed probability for no fraud, Y t for t = 0 , .., n . Let’s recall the measurement and stateequations. Measurement equation: Y t = a + bλ t + µ t , with µ t ∼ N (0 , w t ) . State equation: λ t = α + βλ t − + (cid:15) t , with (cid:15) t ∼ N (0 , η t ) . where a , b , α , β and (cid:15) t are functions of theunknown parameters ( κ, θ, σ ) of the model. The Kalmanfilter is actually a recursive algorithm for calculating esti-mates of unobserved state variables based on observationsthat depend on these state variables. It was first publishedin (Kalman, 1960) and it is used in areas as aeronautics,signal processing, and futures trading. A detailed explana-tion of the Kalman filter can be found in (Harvey, 1989),(Lutkepohl, 1991), (Maybeck, 1979), (Jazwinsky, 1970) and(Heemink, 1986). The principle of the Kalman filter is touse a time series of observable data to estimate the valuesof state variables. This technique is useful when there is alinear dependency of the observable data on the state vari-ables. In our case, we have this linearity relation betweenthe probability of no fraud and the fraud intensity. The al-gorithm first forms an optimal predictor of the unobservedstate variable vector given its previous estimated value. Thisprediction is obtained by using the distribution of the unob-served state variables, conditional on the previous estimatedvalues. These estimates for the unobserved state variablesare then updated using the information provided by the ob-served variables. Although the Kalman filter relies on thenormality assumption of the measurement error and initial nomaly Detection Model for Imbalanced Datasets state vector, one can calculate the likelihood function bydecomposing the prediction error. Let v t be the variance of λ t , λ t − | t − an unbiased estimation of λ t − at time t − and v t − | t − the variance of λ t − | t − . The initial state λ at time is a random variable which is not correlated withboth the system and the measurement noise processes. Atime , we must have a preliminary value of λ | and v | .As these values are unknown, a common way is to put anull value to λ | and a high value to v | in order to takeinto account the uncertainty linked to the estimate of v | .Let us give the three steps of the procedure followed by theKalman filter: forecasting, updating and estimation of theparameters. First, we make the following forecasts:1. λ t | t − = E t − ( λ t ) , that is the forecast of λ t condi-tional to the information set at time ( t − . λ t | t − = α + βλ t − | t − (13) λ t | t − is an unbiased conditional estimation of λ t . Infact, It is straightforward to check that E ( λ t | t − − λ t ) = 0 .2. v t | t − as the variance of λ t | t − , which is v t | t − = E [( λ t | t − − λ t ) ] . v t | t − = β v t − | t − + η t (14)The two forecasts λ t | t − and v t | t − will be used in the nextstep to update λ t and its variance. The second step is theupdate. At time t , we have a new observation of Y , i.e. Y t .We can thus compute the prediction error e t : e t = Y t − a − bλ t | t − (15)The variance of e t , denoted by ψ t is given by : ψ t = w t + b v t | t − (16)We use e t and ψ t to update λ t | t and its variance v t | t asfollows λ t | t = λ t | t − + K t e t (17) v t | t = (1 − K t b ) v t | t − (18)with K t is Kalman gain defined as K t = bv t | t − ψ t . TheKalman gain K t is the most crucial parameter of the fil-ter. This determines how easily the filter will adapt to allpossible new conditions. In (17), K t guarantees that λ t | t will be an unbiaised estimator of λ t . In (18), it minimizesthe variance v t | t . Thus, λ t | t is a conditionally unbiased andefficient estimator. The Kalman filter is therefore optimalbecause it is the best estimator in the class of linear estima-tors. For more details on the Kalman gain derivation, see(Hamilton, 1994) and (Welch & Bishop, 2005). The thirdstep concerns the estimation of the parameters. In our study, parameters have to be estimated: κ , θ , σ and the variancesof the measurement error at each time step, w t . From (15),the prediction error e t follows the normal distribution withmean and variance ψ t . Based on the Gaussian distribu-tion of e t , we use the maximum likelihood method. Thelog-likelihood function can be written as follows: l = − (cid:88) t log ( ψ t ) − (cid:88) t e t ψ t (19)To complete the procedure, we go to time ( t + 1) and repeatthe three-step procedure up to n . As discussed in (Duan &Simonato, 1998), when the state space model is Gaussian,the Kalman filter provides an optimal solution to predict,update and evaluate the likelihood function. When the state-space model is non-Gaussian, the Kalman filter can still beapplied to obtain approximate first and second moments ofthe model and the resulting filter is almost optimal. Theuse of this quasi-optimal filter gives an approximate quasi-likelihood function with which the estimation of the param-eters can be performed. So, our fraud detection approach isan unsupervised approach in the sense that the estimationof the dynamic intensities does not require the labels butthe fraud probabilities observed on the transactions. Thisapproach will be useful for the detection of fraudulent trans-action for which the main constraint is the lack of labellingdataset.2.3.2. I SSUES WITH NEGATIVE ESTIMATED VALUES FORFRAUD INTENSITY
From the equation (1), the intensity follows a non-central χ distribution and this guarantees that the intensity is al-ways non-negative. However, the intensity is unobservablevariable and in order to estimate its values, the approachby the Kalman filter is proposed. As noted in the previoussection the Kalman filter uses the quasi maximum likelihoodto estimate the intensity, since the true distribution of theintensity is not Gaussian. Therefore, there is a non-zeroprobability to obtain negative values for the intensity duringthe calibration process. To deal with the possible negativevalues of the intensity, the following steps are proposed. Step 1: Intensities Shift
This step consists in translatingthe intensity values obtained by the Kalman filter ( λ t | t ) topositive values ( S t | t ) to eliminate negative/near-zeros values.The following transformation is proposed S t | t = λ t | t + α, t ∈ [0 , n ] (20)where α is a deterministic positive quantity. From the abovetranslation, dS t = dλ t for any time t . There are manyvalues that could be assigned to α , but in our study the mostappropriate choice is the 99th percentile of the empiricaldistribution of the intensity. The Stochastic Differential nomaly Detection Model for Imbalanced Datasets Equation (SDE) of S t becomes: dS t = dλ t = κ ( θ − λ t ) dt + σ (cid:112) λ t dB t = κ ( θ − ( S t − α )) dt + σ (cid:112) S t − αdB t = κ ( θ + α − S t ) dt + σ (cid:114) S t (1 − αS t ) dB t = κ ( θ + α − S t ) dt + σ (cid:114) − αS t (cid:112) S t dB t dS t = κ ( θ ∗ − S t ) dt + σ ∗ t (cid:112) S t dB t (21)with θ ∗ = θ + α and σ ∗ t = σ (cid:113) − αS t . S t follows anextended CIR with stochastic σ ∗ t . S t is a mean revertingprocess with κ being the rate of mean reverting, θ ∗ thelong run average and σ ∗ t the volatility. If S t approaches α , σ ∗ t = σ (cid:113) − αS t approaches cancelling the effect ofrandomness, so S t ≥ α . Step 2: Updating parameters
The SDE in (21) doesnot lead to the analytical expression of the proposition (2)because σ ∗ t is stochastic but not time-dependent; see (Boyleet al., 2002). In order to apply (10) to predict the fraudoccurrence with the new intensity S t , the SDE of S t ismodified as follows dS t = κ ( θ ∗ − S t ) dt + σ ∗ (cid:112) S t dB t (22)with σ ∗ = E ( σ ∗ t ) . In this context, the parameters κ , θ and σ ∗ can be updated by Ordinary Least Square (OLS). Thediscretised form of equation (22) is given by S t +∆ t − S t = κ ( θ ∗ − S t )∆ t + σ ∗ (cid:112) S t ξ t (23)where ξ t is a Gaussian white noise with E ( ξ t ) = 0 and V ( ξ t ) = ∆ t . For performing OLS, we transform (23) by S t +∆ t − S t √ S t = κθ ∗ ∆ t √ S t − κ (cid:112) S t ∆ t + σ ∗ ξ t (24)Then, the drift parameters κ and θ ∗ are found by minimizingthe OLS objective function n − (cid:88) i =1 (cid:32) S t i +1 − S t i (cid:112) S t i − κθ ∗ ∆ t (cid:112) S t i + κ (cid:112) S t i ∆ t (cid:33) (25)The diffusion parameter estimate ˆ σ ∗ is found by dividing thestandard deviation of residuals by √ ∆ t . So, in the contextof negative values for λ t the updated parameters κ , θ ∗ and ˆ σ ∗ for the new intensity S t are finally used in proposition(2) for fraud prediction. Table 1.
Summary Statistics of Risk-Scores by transaction and offraud proportion by client in the full dataset. The clients with nofraud events and the clients with of fraud proportion areremoved from this full dataset. M IN M AX M EAN M EDIAN S T DEV R ISK -S CORE
RAUD P ROPORTION
Table 2.
Repartition of the number of the clients and the numberof distinct transactions in the 7 subsets. G ROUPS N B OF CLIENTS N B OF DIST TRANS P ≤ . . < P ≤ . . < P ≤ . . < P ≤ . . < P ≤ . . < P ≤ . . < P ≤ . OTAL
3. Datasets
The data provided by NetGuardians is a simulated bank-ing transactions dataset created by NetGuardians fromanonymized real-world banking datasets. It covers a pe-riod of 2 years and contains a total of more than 15 millionstransactions made by more than 120’000 clients. The datasetincludes a total of features such as the transaction dates,transaction amounts, transaction senders IDs, the accountnumbers of transaction recipients, bank countries receivingtransactions, etc... It is important to mention that there is nofraudulent labelling in the dataset. C o un t Dist. of Risk-Score by transaction C o un t Dist. of Fraud Proportions
Figure 2.
Histogram of fraud Risk-Score and of fraud proportionby client in the full dataset. The clients with no fraud events andthe clients with of fraud proportion are removed from thisfull dataset.
The model with the Kalman filter is unsupervised learningin the sense that no label is required to detect the fraud event.Instead, information about the likelihood of fraud for each nomaly Detection Model for Imbalanced Datasets transaction is required to define the measurement equation.Many machine learning or statistical methods such as thedimensionality reduction method, logistic regression, Z-Score, etc...can be used to estimate the fraud probability.However, the performance of the model will strongly dependon the accuracy of such method in estimating the fraudprobability. In this study, for the sake of simplicity wefocus on the fraud Risk-Score provided by NetGuardiansto define the measurement equation. Risk-Score fraud is ametric that gives an estimate of the fraud proportion for eachtransaction based on the recent information. In addition forreasons of confidentiality the methodology for calculatingthe Risk-Score will not be mentioned.
Table 3.
Process of fraud prediction in the test set: The intensity isupdated by the Kalman filter and the fraud is predicted on the nexttransaction.
At time t Given starting λ t | t at time tP ( N t +1 − N t = 1 |F t ) = Fraud prediction for time t + 1 At time t + 1 Risk-Score at time t + 1 is provided λ t +1 | t +1 is updated by the Kalman Filter P ( N t +2 − N t +1 = 1 |F t +1 ) = Fraud prediction for time t + 2 At time t + 2 Repeat the process as at time t and so on ... To complete our study, we also need to generate artificialfraud labels in our dataset. The main reason is that weexpect to compare the performance of the Kalman filtermodel with other intensity-based models such as Homoge-neous and Non-homogeneous Poisson process which aresupervised methods and have been investigated in (Hous-sou et al., 2019). There are several possibilities to createartificial labels according to the specified criterion. In ourstudy, artificial labelling is based on the following criterion:transactions for which the banks receiving the money arelocated outside Switzerland are flagged as fraudulent. Themain reason of using this criterion is that the provided labelsare very correlated with the fraud Risk-Score. The pointbiserial correlation between the artificial labels and the Risk-Score is around . . The proportion of fraud which is thenumber of fraudulent transactions over the total numberof transactions is calculated for each client. According tothe labelling methodology, we found some clients with afraud rate of . This concerns the clients for whomthe institutions receiving the money are all located outsideof Switzerland. To be realistic, we remove these clientsfrom our analysis. Also, the clients with no fraud events inthe full dataset are removed because the datasets of these clients contain only one class and the classification problemis not defined. Table 1 and and figure 2 show the descriptivestatistics of the fraud Risk-Scores and of the client’s fraudproportions in the cleaned dataset. We remark that the twodistributions are skewed and the dataset is unbalanced sincemost of the clients have small proportion of fraud. Themean and the median of the fraud proportions are and respectively.However, it is important to note that with the labelling cri-teria, the above distribution of the fraud proportions is notrepresentative of the true fraud distribution because in prac-tice the majority of fraud proportions are less than . Toinvestigate our analysis in a general framework of imbal-anced dataset, we focussed on clients with a fraud propor-tion below which leads to a sample of , clientswith , , separate transactions. Next, we divide thissample in seven subsets containing various fraud profiles.The first subset regroups the clients for proportion of fraudless than . . The second subset concerns the clientswith proportion between . and . . The third subsetconcerns the clients with proportion between . and .The fourth subset concerns the clients with proportion be-tween and . The fifth subset concerns the clients withproportion between and . The sixth subset concernsthe clients with proportion between and . The lastsubset concerns the clients with proportion between and .The boundaries of the subsets are chosen to ensure a mini-mum number of 1000 clients in each subset. Table 2 showsthe distribution of the number of clients and the numberof transactions in each group. Among the subsets, thefirst group contains the small number of clients and thefourth subset contains the large volume of clients. In figure3 the Boxplots for the proportion of fraud in each groupare represented. In each subset, we select randomly a fixednumber N of clients and we train and test our model on thetransactions for each client. N = 1391 which representsthe number of clients in the first group (the smallest group).The training set represents the first in chronologicalorder of transactions for each client where the intensity pa-rameters are estimated. The test set represents the last and the fraud events are predicted with the estimated pa-rameters. We compare our model to other intensity-basedmodel such as the Homogeneous and the InhomogeneousPoisson process. For the Inhomogeneous Poisson process,we focussed on λ ( t ) = a + bt and λ ( t ) = a + bt + ct asin (Houssou et al., 2019). We consider two other models;the baseline model and the Risk-Score model. The baselinemodel consists of calculating the proportion of fraud in thetraining set and using this probability to predict fraud in thetest set. The Risk-Score model consists in using the Risk-Score of the transaction at time t as the fraud predictionon the next transaction of the client. All these models are nomaly Detection Model for Imbalanced Datasets Table 4.
Results for AUC medians for the various models in each group.M
ODELS G ROUP G ROUP G ROUP G ROUP G ROUP G ROUP G ROUP H OMO P OISSON
INEAR P OISSON
UADRATIC P OISSON
AIVE A PPROACH
CORE A PPROACH
PPROACH
Table 5.
Results for A/P medians for the various models in each group.M
ODELS G ROUP G ROUP G ROUP G ROUP G ROUP G ROUP G ROUP H OMO P OISSON
INEAR P OISSON
UADRATIC P OISSON
AIVE A PPROACH
CORE A PPROACH
PPROACH compared to our Kalman filter model. Finally, the predictiveperformance is summarized in each subset using two perfor-mance measures: the ROC AUC and the Average Precision(AP) score.
4. Results
Our model is noted by KFApproach. In the training set foreach client, the equations (7) and (12) are estimated by theKalman filtering (KF) process as described above. We setthe starting value of the intensity λ | = 0 and the variance v | = 10 . As explained above a high value is given tothe variance in order to take account the uncertainty in theestimation of the starting value λ | . P<=0.004 0.004
Boxplots for the proportions of fraud in the seven subsets.Clients with the fraud proportion P ≤ are grouped in sevensubsets representing different fraud profiles. Thus in the training set, we generate with the KF the dy-namic intensities related to the transactions of the client. Inthe test set, the prediction process described in table 3 works as follows: At time t , given the updated intensity on trans-action λ t | t , the equation (10) in proposition (2) is used topredict the fraud probability on the client’s next transactionat time t + 1 . At time t + 1 , using the information providedby the Risk-Score on the transaction, the intensity at time t + 1 , λ t +1 | t +1 is updated by the KF. With λ t +1 | t +1 , thefraud probability on transaction at time t + 2 is predictedand the process is repeated until the last transaction in thetest setThere is a total of models to compare with the Kalmanfilter model:1. The first model is the homogeneous Poisson process( λ ( t ) = λ ). The constant intensity λ is estimated inthe training set. It is used to predict the fraud eventin the whole test set. We designate this model byHomoPoisson.2. The second model is the non-homogeneous Poissonprocess whose intensity is a linear function of time( λ ( t ) = a + bt ). The intensity parameters are estimatedin the training set and are used for the prediction offraud in the whole test set. It is noted LinearPoisson.3. The third model is the non-homogeneous Poissonprocess whose intensity is a quadratic function oftime ( λ ( t ) = a + bt + ct ). The procedure is thesame as in LinearStatic. We designate this modelQuadraticPoisson4. the fourth model is the baseline model which consistsof estimating the probability of fraud in the trainingset and using the same probability for the predictionin the test set. Thus, the predicting probabilities are nomaly Detection Model for Imbalanced Datasets the same for all transactions in the test set. This isequivalent to a random classifier because the modeldoes not have the capacity to discriminate between anauthentic transaction and a fraudulent transaction. Wedesignate this model NaiveApproach5. The fifth model is based on the Risk-Score of the trans-actions and consists of predicting the Risk-Score usinga Random Walk process. It supposes that the Risk-Scores follow the following process X t = X t − + u t (26)where X t is the Risk-Score of transaction at time t and u is a white noise. In this context, X t is not station-ary and the best prediction of the fraud proportion ontransaction at time t + 1 is the fraud proportion at time t . This approach is indicated by ScoreApproach p<0.004 0.004
HomoPoissonLinearPoissonQuadraticPoissonNaiveApproachScoreApproachKFApproach
Figure 4.
Performance for AUC Medians for various models ineach group. The plot shows the performance for each model withthe degree of imbalanced dataset.
Below, we present the results of the performance for the models based on the predicting probabilities and the artifi-cial labels in the test set. For comparison reasons, in eachgroup the performances calculated are summarized us-ing the median. Table 4 shows the results for the AUC (AreaUnder The Curve)-ROC (Receiver Operating Characteris-tics) for the different models in each group. The results showthat KFApproach performs better than the other models inthe group to group that is when the probability of fraudis less than . It is followed by ScoreApproach. Whenthe fraud probability is greater than , all Poisson modelsoutperform the KFApproach and ScoreApproach. We alsoremark that NaiveApproach works less well than the othermodels. It is important to note that KFApproach outper-forms ScoreApproach in all groups; this can be attributedto the fact that Kalman filter combines information on theRisk-Scores of the transactions with additional informationgiven by the instantaneous fraud rate for the client. In fact, the Risk-Score information is described by the measurementequation and the information on the instantaneous rate offraud is described by the state equation.Figure 4 plots the AUC medians for the different models ineach group. As mentioned above the Kalman filter modelfollowed by the ScoreApproach outperforms significantlythe Poisson models in higher imbalanced dataset ( P < =1% ) and this performance decreases when the probabilityof fraud for groups increases. We have the opposite effectfor the Poisson models in the sense that their performanceincreases from
P > = 1% and becomes relatively stable upto
P < = 15% .This concludes that the stochastic approach for the inten-sity is more adapted to the fraud prediction in high imbal-anced dataset. Finally among the three Poisson models,LinearStatic is the best one followed by QuadraticPoisson;for more details, see (Houssou et al., 2019). We completeour analysis by focussing on the Precision-Recall perfor-mance. The Average Precision (A/P) is calculated whichis an estimate of the area under the precision-recall curveand their results are summarized in the table 5. When
P < = 0 . , we notice that KFApproach followed by theScoreApproach outperforms the Poisson models and thebaseline approach. When P > . , all Poisson modelsexcept the HomoPoisson in the group perform better thanKFApproach. The baseline approach still works less wellthan the other models. Figure 5 shows the evolution ofthe A/P median with the probability of fraud and we ob-serve that all models tend to increase with the degree ofbalanced dataset. It is important to note that with the A/P,ScoreApproach outperforms KFApproach when P > .Finally, we conclude the important results: 1. KFApproachis a mixing approach combining the dynamic intensitieswith Risk-Scores. The ROC-AUC shows that KFApproachalways outperforms the ScoreApproach; this shows thatthe prediction of the fraud probability by the Kalman fil-ter is better than the prediction of the Random Walk pro-cess on the Risk-Scores. 2. KFApproach followed by theScoreApproach works better than the other models in highimbalanced dataset. The analyzes on ROC-AUC and theA/P confirm this result when P < = 1% and
P < = 0 . re-spectively. In fact, in a very imbalanced dataset, there is lessfraud information and the intensity-based approach only isnot enough for the prediction of fraud events. KFApproachuses additional information on the Risk-Score of the trans-actions and this explains why it outperforms the rest ofthe models. So, the contribution of the Risk-Scores to theKFApproach in high imbalanced dataset is more significant.Therefore, KFApproach would be an interesting approachfor detecting fraud in high imbalanced dataset. 3. Analysisof the AUCs shows that the performance of KFApproachas well as ScoreApproach decreases when the fraud prob-ability of the dataset increases. On the other hand, A/P nomaly Detection Model for Imbalanced Datasets shows that the shapes of the two approaches tend to be tiltedupwards. 4. Analysis of the AUCs and the A/P shows thatthe Poisson models perform better than KFApproach andScoreApproach in more balanced dataset because more in-formation of fraud events are available for estimating onlythe intensity. The A/P shows that ScoreApproach outper-forms KFApproach when P > which is contrary tothe analysis of the ROC-AUC. Finally, all models performbetter than the baseline model. P<=0.004 0.004
HomoPoissonLinearPoissonQuadraticPoissonNaiveApproachScoreApproachKFApproach
Figure 5.
Performance for A/P Medians for various models in eachgroup. The plot shows the performance for each model with thedegree of imbalanced dataset.
5. Conclusion
An unsupervised approach based on a stochastic intensitymodel is investigated to detect fraud in imbalanced dataset.The Cox-Ingersoll-Ross (CIR) process is proposed withthe advantage to guarantee a positive value for the fraudintensity. In this context, a closed form solution for theprediction probability of fraud is derived. Using the proba-bility of fraud observed on the transactions, we have shownhow to estimate the dynamic intensities by the Kalman-Filter method. Our methodology is applied to financialdatasets. To evaluate the performance of our model, we con-sider in the paper other models of predicting fraud by theintensity-based approach. These include the homogeneousPoisson process, the linear and quadratic inhomogeneousPoisson processes, a baseline approach and a random walkapproach. All these models are compared to our model.We found that our Kalman filter approach outperforms theother approaches in the case of the more imbalanced dataset.When the fraud probability of the dataset increases, theperformance of our model decreases. In this context, thelinear intensity model is the better one following by thequadratic and the homogeneous Poisson process. Finally,all the models perform better than the baseline model. Themain contributions of this paper are: 1. Our model is the firstunsupervised approach for fraud detection using a stochas-tic intensity. It would be useful for datasets for which thefraud labels are not available. 2. The Cox-Ingersoll-Ross (CIR) process conducts to closed form solutions with fewparameters and this greatly reduces the computational costsand the over-fitting. 3. Instead of using the observed fraudprobability to estimate the intensity by the Kalman filter, themodel could also be challenged by applying deep machinelearning algorithms. 4. Our model is complete in the sensethat it combines the information on the instantaneous rateof fraud with the fraud causality for the prediction of fraudevents. So, the question of why and how the fraud occurs isinvestigated.
References
Babbs, S. H. and Nowman, K. B. Kalman filtering of gener-alized vasicek term structure models.
Journal of Financialand Quantitative Analysis , 34(1):115–130, 1999.Ball, C. A. and Torous, W. N. Unit roots and estimation ofinterest rate dynamics.
Journal of Empirical Finance , 3(2):215–238, 1996.Boyle, P. P., Tian, W., and Guan, F. The riccati equationin mathematical finance.
J. Symbolic Computation , 33:343–355, 2002.Chen, R. and Scott, L. Maximum likelihood estimation fora multifactor equilikrum model of the term structure ofinterest rates.
Journal of Fixed Income , pp. 14–31, 1993.Cox, J., Ingersoll, J., and Ross, S. A theory of the termstructure of interest rates.
Econometrica , 53(2):385–407,1985a.Cox, J., Ingersoll, J., and Ross, S. An intertemporal generalequilibrum model of asset prices.
Econometrica , 53(2):363–384, 1985b.Duan, J. C. and Simonato, J. G. Estimating and testingexponential affine term structure models by the kalmanfilter.
Review of Quantitative Finance and Accounting ,1998.Dyzma, M. Fraud detection with machine learning:How banks and financial institutions leverage ai. , 2018.Elrahman, S. M. A. and Abraham, A. A review of classimbalance problem.
J. Netw. Innov. Comput. , 1:332–340,2013.Galar, M., Fernandez, A., Barrenechea, E., Bustince, H.,and Herrera, F. A review on ensembles for the classimbalance problem: bagging-,boosting-,and hybrid-basedapproaches.
Systems, Man, and Cybernetics, Part C:Applications and Reviews, IEEE Transactions on , 42(4):463–484, 2012. nomaly Detection Model for Imbalanced Datasets
Hamilton, J. D.
Times Series Analysis.
Princeton UniversityPress, 1994.Harvey, A. C.
Forecasting, Structural Time Series Modelsand the Kalman Filter.
Cambridge University Press, 1989.He, H. and Garcia, E. A. Learning from imbalanced data.
IEEE Transactions on Knowledge and Data Engineering ,21:1263–1284, 2009.Heemink, A. W.
Storm Surge PredictionnUsing KalmanFiltering. PhD thesis . Twente University of Technology,1986.Houssou, R., Bovay, J., and Robert, S. Adaptive financialfraud detection in imbalanced data with time-varying pois-son processes.
Journal of Financial Risk Management , 8(4):286–304, 2019.Jafari, M. A. and Abbasian, S. The moments for solutionof the cox-ingersoll-ross interest rate model.
Journal ofFinance and Economics , 5(1):34–37, 2017.Jazwinsky, A. H.
Stochastic Processes and Filtering Theory.
Academic Press, New York, 1970.Kalman, R. E. A new approach to linear filtering and pre-diction problems. transactions of the asme.
Journal ofBasic Engineering , pp. 35–45, 1960.Krawczyk, B. Learning from imbalanced data.
Prog ArtifIntell , pp. 1–12, 2016.Longstaff, F. A. and Schwartz, E. S. Interest rate volatilityand the term structure: A two-factor general equilibrummodel.
Journal of Finance, XLVII , pp. 1259–1282, 1992.Lutkepohl, H.
Introduction to Multiple Time Series Analysis.
Springer, 1991.Maybeck, S. P.
Stochastic models, estimation and control.
Academic Press, New York, 1979.Racicot, F. E. and Theoret, R. Forecasting stochastic volatil-ity using the kalman filter: An application to canadianinterest rates and price-earning ratio.
AESTI-MATIO, theIEB International Journal of Finance , 1:28–47, 2010.Vo, L. H. Application of kalman filter on modelling interestrates.
Journal of Management Sciences , 1(1):1–15, 2014.Welch, G. and Bishop, G. An introduction to the kalmanfilter,technical report tr-95-041.