[PDF] MQLV: Optimal Policy of Money Management in Retail Banking with Q-Learning

Abstract

Reinforcement learning has become one of the best approach to train a computer game emulator capable of human level performance. In a reinforcement learning approach, an optimal value function is learned across a set of actions, or decisions, that leads to a set of states giving different rewards, with the objective to maximize the overall reward. A policy assigns to each state-action pairs an expected return. We call an optimal policy a policy for which the value function is optimal. QLBS, Q-Learner in the Black-Scholes(-Merton) Worlds, applies the reinforcement learning concepts, and noticeably, the popular Q-learning algorithm, to the financial stochastic model of Black, Scholes and Merton. It is, however, specifically optimized for the geometric Brownian motion and the vanilla options. Its range of application is, therefore, limited to vanilla option pricing within financial markets. We propose MQLV, Modified Q-Learner for the Vasicek model, a new reinforcement learning approach that determines the optimal policy of money management based on the aggregated financial transactions of the clients. It unlocks new frontiers to establish personalized credit card limits or to fulfill bank loan applications, targeting the retail banking industry. MQLV extends the simulation to mean reverting stochastic diffusion processes and it uses a digital function, a Heaviside step function expressed in its discrete form, to estimate the probability of a future event such as a payment default. In our experiments, we first show the similarities between a set of historical financial transactions and Vasicek generated transactions and, then, we underline the potential of MQLV on generated Monte Carlo simulations. Finally, MQLV is the first Q-learning Vasicek-based methodology addressing transparent decision making processes in retail banking.

Full PDF

MMQLV: Optimal Policy of Money Managementin Retail Banking with Q-Learning

Jeremy Charlier , , Gaston Ormazabal , Radu State , and Jean Hilger University of Luxembourg, L-1855 Luxembourg, Luxembourg { name.surname@ } @uni.lu Columbia University, New York NY 10027, USA { jjc2292,gso7@ } @columbia.edu BCEE, L-1160 Luxembourg, Luxembourg [email protected]

Abstract.

Reinforcement learning has become one of the best approachto train a computer game emulator capable of human level performance.In a reinforcement learning approach, an optimal value function is learnedacross a set of actions, or decisions, that leads to a set of states givingdiﬀerent rewards, with the objective to maximize the overall reward. Apolicy assigns to each state-action pairs an expected return. We call anoptimal policy a policy for which the value function is optimal. QLBS, Q-Learner in the Black-Scholes(-Merton) Worlds, applies the reinforcementlearning concepts, and noticeably, the popular Q-learning algorithm, tothe ﬁnancial stochastic model of Black, Scholes and Merton. It is, how-ever, speciﬁcally optimized for the geometric Brownian motion and thevanilla options. Its range of application is, therefore, limited to vanillaoption pricing within ﬁnancial markets. We propose MQLV, ModiﬁedQ-Learner for the Vasicek model, a new reinforcement learning approachthat determines the optimal policy of money management based on theaggregated ﬁnancial transactions of the clients. It unlocks new frontiersto establish personalized credit card limits or to fulﬁll bank loan applica-tions, targeting the retail banking industry. MQLV extends the simula-tion to mean reverting stochastic diﬀusion processes and it uses a digitalfunction, a Heaviside step function expressed in its discrete form, to es-timate the probability of a future event such as a payment default. Inour experiments, we ﬁrst show the similarities between a set of histori-cal ﬁnancial transactions and Vasicek generated transactions and, then,we underline the potential of MQLV on generated Monte Carlo simula-tions. Finally, MQLV is the ﬁrst Q-learning Vasicek-based methodologyaddressing transparent decision making processes in retail banking.

Keywords:

Q-Learning · Monte Carlo · Payment Transactions.

A major goal of the reinforcement learning (RL) and Machine Learning (ML)community is to build eﬃcient representations of the current environment to a r X i v : . [ c s . L G ] A ug J. Charlier et al. solve complex tasks. In RL, an agent relies on multiple sensory inputs and pastexperience to derive a set of plausible actions to solve a new situation [1]. Whilethe initial idea around RL is not new [2–4], signiﬁcant progress has been achievedrecently by combining neural networks and Deep Learning (DL) with RL. Theprogress of DL [5, 6] has allowed the development of a novel agent combiningRL with a class of deep artiﬁcial neural networks [1, 7] resulting in Deep Q Net-work (DQN). The Q refers to the Q-learning algorithm introduced in [8]. It isan incremental method that successively improves its evaluations of the qualityof the state-action pairs. The DQN approach achieves human level performanceon Atari video games using unprocessed pixels as inputs. In [9], deep RL withdouble Q-Learning was proposed to challenge the DQN approach while tryingto reduce the overestimation of the action values, a well-known drawback of theQ-learning and DQN methodologies. The extension of the DQN approach fromdiscrete to continuous action domain, directly from the raw pixels to inputs, wassuccessfully achieved for various simulated tasks [10].Nonetheless, most of the proposed models focused on gaming theory and com-puter game simulation and very few to the ﬁnancial world. In QLBS [11], a RLapproach is applied to the Black, Scholes and Merton ﬁnancial framework forderivatives [12,13], a cornerstone of the modern quantitative ﬁnance. In the BSMmodel, the dynamic of a stock market is deﬁned as following a Geometric Brow-nian Motion (GBM) to estimate the price of a vanilla option on a stock [14].A vanilla option is an option that gives the holder the right to buy or sell theunderlying asset, a stock, at maturity for a certain price, the strike price. QLBSis one of the ﬁrst approach to propose a complete RL framework for ﬁnance. Asmentioned by the author, a certain number of topics are, however, not coveredin the approach. For instance, it is speciﬁcally designed for vanilla options and itfails to address any other type of ﬁnancial applications. Additionally, the initialgenerated paths rely on the popular GBM but there exist a signiﬁcant numberof other popular stochastic models depending on the market dynamics [15].In this work, we describe a RL approach tailored for personal recommendationin retail banking regarding money management to be used for loan applicationsor credit card limits. The method is part of a banking strategy trying to reducethe customer churn in a context of a competitive retail banking market. We relyon the Q-learning algorithm and on a mean reverting diﬀusion process to addressthis topic. It leads ultimately to a ﬁtted Q-iteration update and a model-freeand oﬀ-policy setting. The diﬀusion process reﬂects the time series observed inretail banking such as transaction payments or credit card transactions. Suchdata is, however, strictly conﬁdential and protected by the regulators, and there-fore, it cannot be released publicly. We furthermore introduce a new terminaldigital function, Π , deﬁned as a Heaviside step function in its discrete form fora discrete variable n ∈ R . The digital function is at the core of our approach forretail banking since it can evaluate the future probability of an event including,for instance, the future default probability of a client based on his spendings. -Learning of Money Management in Retail Banking 3 Our method converges to an optimal policy, and to optimal sets of actions andstates, respectively the spendings and the available money. The retail banks can,consequently, determine the optimal policy of money management based on theaggregated ﬁnancial transactions of the clients. The banks are able to comparethe diﬀerence between the MQLV’s optimal policy and the individual policy ofeach client. It contributes to an unbiased decision making process while oﬀeringtransparency to the client. Our main contributions are summarized below: – A new RL framework called MQLV, Modiﬁed Q-Learning for Vasicek, ex-tending the initial QLBS framework [11]. MQLV uses the theoretical founda-tion of RL learning and Q-Learning to build a ﬁnancial RL framework basedon a mean reverting diﬀusion process, the Vasicek model [16], to simulatedata, in order to reach ultimately a model-free and oﬀ-policy RL setting. – The deﬁnition of a digital function to estimate the future probability of anevent. The aim is to widen the application perspectives of MQLV by using acharacteristic terminal function that is usable for a decision making processin retail banking such as the estimation of the default probability of a client. – The ﬁrst application of Q-learning to determine the clients’ optimal policy ofmoney management in retail banking. MQLV leverages the clients aggregatedﬁnancial transactions to deﬁne the optimal policy of money management,targeting the risk estimation of bank loan applications or credit cards.The paper is structured as follows. We review QLBS and the Q-Learning for-mulations derived by Halperin in [11] in the context of the Black, Scholes andMerton model in section 2. We describe MQLV according to the Q-Learningalgorithm that leads to a model-free and oﬀ-policy setting in section 3. We high-light experimental results in section 4. We discuss related works in section 5 andwe conclude in section 6 by addressing promising directions for future work.

We deﬁne A t ∈ A the action taken at time t for a given state X t ∈ X and theimmediate reward by R t +1 . The ongoing state is denoted by X t ∈ X and thestochastic diﬀusion process by S t ∈ S at time t . The discount factor that tradesoﬀ the importance of immediate and later rewards is expressed by γ ∈ [0; 1].We recall a policy is a mapping from states to probabilities of selecting eachpossible action [17]. By following the notations of [11], the policy π such that π : { , . . . , T − } × X → A (1)maps at time t the current state X t = x t into the action a t ∈ A . a t = π ( t, x t ) (2) J. Charlier et al.

The value of a state x under a policy π , denoted by v π ( x ) when starting in x and following π thereafter, is called the state-value function for policy π . v π = E π (cid:34) ∞ (cid:88) k =0 γ k R t + k +1 | X t = x (cid:35) (3)The action-value function, q π ( x, a ) for policy π deﬁnes the value of taking action a in state x under a policy π as the expected return starting from x , taking theaction a , and thereafter following policy π . q π ( x, a ) = E π (cid:34) ∞ (cid:88) k =0 γ k R t + k +1 | X t = x, A t = a (cid:35) (4)The optimal policy, π ∗ t , is the policy that maximizes the state-value function. π ∗ t ( X t ) = arg max π V πt ( X t ) (5)The optimal state-value function, V ∗ t , satisﬁes the Bellman optimality equationsuch that V ∗ t ( X t ) = E π ∗ t (cid:2) R t ( X t , u t = π ∗ t ( X t ) , X t +1 ) + γV ∗ t +1 ( X t +1 ) (cid:3) . (6)The Bellman equation for the action-value function, the Q-function, is deﬁnedas Q πt ( x, a ) = E t [ R t ( X t , a t , X t +1 ) | X t = x, a t = a ] + γ E πt (cid:2) V πt +1 ( X t +1 ) | X t = x (cid:3) . (7)The optimal action-value function, Q ∗ t , is obtained for the optimal policy with π ∗ t = arg max π Q πt ( x, a ) . (8)The optimal state-value and action-value functions are connected by the follow-ing system of equations. (cid:40) V ∗ t = max a Q ∗ ( x, a ) Q ∗ t = E t [ R t ( X t , a, X t +1 )] + γ E t (cid:2) V ∗ t +1 ( X t +1 | X t = x ) (cid:3) (9)Therefore, we can obtain the Bellman optimality equation. Q ∗ t ( x, a ) = E t (cid:20) R t ( X t , a t , X t +1 ) + γ max a t +1 ∈A Q ∗ t +1 ( X t +1 , a t +1 ) | X t = x, a t = a (cid:21) (10)Using the Robbins-Monro update [18], the update rule for the optimal Q-functionwith on-line Q-learning on the data point ( X ( n ) t , a ( n ) t , R ( n ) t , X ( n ) t +1 ) is expressedby the following equation with α a constant step-size parameter. -Learning of Money Management in Retail Banking 5 Q ∗ ,k +1 t ( X t , a t ) =(1 − α k ) Q ∗ ,kt ( X t , a t )+ α k (cid:20) R t ( X t , a t , X t +1 ) + γ max a t +1 ∈A Q ∗ ,kt +1 ( X t +1 , a t +1 ) (cid:21) (11) We describe, in this section, how to derive a general recursive formulation for theoptimal action. It is equivalent to an optimal hedge under a ﬁnancial frameworksuch as, for instance, portfolio or personal ﬁnance optimization. We additionallypresent the formulation of the action-value function, the Q-function. Both theoptimal hedge and the Q-function follow the assumption of a continuous spacescenario generated by the Vasicek model with Monte Carlo simulation.By relying on the ﬁnancial framework established in [11], we consider a meanreverting diﬀusion process, also known as the Vasicek model [16]. dS t = κ ( b − S t ) dt + σdB t (12)The term κ is the speed reversion, b the long term mean level, σ the volatilityand B t the Brownian motion. The solution of the stochastic equation is equal to S t = S e − κt + b (1 − e − κt ) + σe − κt (cid:90) t e κs dB s . (13)Therefore, we deﬁne a new time-uniform state variable, i.e. without a drift, as (cid:40) S t = X t + S e − κt + b (1 − e − κt )with X t = σe − κt (cid:82) t e κs dB s − [ S e − κt + b (1 − e − κt )] . (14)Instead of estimating the price of a vanilla option as proposed in [11], we areinterested to estimate the future probability of an event using the Q-learningalgorithm and a digital function. First, we deﬁne the terminal condition reﬂectingthat with the following equation Q ∗ T ( X T , a T = 0) = − Π T − λV ar [ Π T ( X T )] (15)where Π T is the digital function at time t = T deﬁned such that Π T = 1 S T ≥ K = (cid:26) S T ≥ K λV ar [ Π T ( X T )], is a regularization term with λ ∈ R + (cid:28) Π t for t = T − , ..., Π t = γ ( Π t +1 − a t ∆S t ) with ∆S t = S t +1 − S t γ = S t +1 − e r∆t S t (17) J. Charlier et al.

Following the deﬁnition of the equations (6) and (17), we express the one-steptime dependent random reward with respect to the cross-sectional information F t as follows. R t ( X t , a t , X t +1 ) = γa t ∆S t ( X t , X t +1 ) − λV ar [ Π t |F t ]with V ar [ Π t |F t ] = γ E t (cid:104) ˆ Π t +1 − a t ∆ ˆ S t ˆ Π t +1 + a t ∆ ˆ S t (cid:105) (18)The term ∆ ¯ S t is deﬁned such that ∆ ¯ S t = N ∆S , ∆ (cid:98) S = ∆S − ∆ ¯ S t and ˆ Π t +1 = Π t +1 − ¯ Π t +1 with ¯ Π t +1 = N Π t +1 . Because of the regularizer term, the expectedreward R t is quadratic in a t and has a ﬁnite solution. We therefore inject the one-step time dependent random reward equation (18) into the Bellman optimalityequation (10) to obtain the following Q-learning update, Q ∗ , and the optimalaction, a ∗ , to be solved within a backward loop ∀ t = T − , ..., Q ∗ t ( X t , a t ) = γ E t (cid:2) Q ∗ t +1 ( X t +1 , a ∗ t +1 ) + a t ∆S t (cid:3) − λV ar [ Π t |F t ] a ∗ t ( X t ) = E t (cid:20) ∆ ˆ S t ˆ Π t +1 + 12 λγ ∆S t (cid:21) (cid:20) E t (cid:20)(cid:16) ∆ ˆ S t (cid:17) (cid:21)(cid:21) − (19)We refer to [11] for further details about the analytical solution, a ∗ , of theQ-learning update (19). Our approach uses the N Monte Carlo paths simultane-ously to determine the optimal action a ∗ and the optimal action-value function Q ∗ to learn the policy π ∗ . We thus do not need an explicit conditioning of X t attime t . We assume a set of basis function { Φ n ( x ) } for which the optimal action a ∗ t ( X t ) and the optimal action-value function, Q ∗ t ( X t , a ∗ t ), can be expanded. a ∗ t ( X t ) = M (cid:88) n φ nt Φ n ( X t ) and Q ∗ t ( X t , a ∗ t ) = M (cid:88) n ω nt Φ n ( X t ) (20)The coeﬃcients φ and ω are computed recursively backward in time ∀ t = T − , . . . ,

0. We subsequently deﬁne the minimization problem to evaluate φ nt . G t ( φ ) = N (cid:88) k =1  − M (cid:88) n φ nt Φ n ( X kt ) ∆S kt + γλ (cid:32) Π kt +1 − M (cid:88) n φ nt Φ n ( X kt ) ∆ (cid:98) S kt (cid:33)  (21)The equation (21) leads to the following set of linear equations ∀ n = 1 , . . . , M .  A ( t ) nm = N (cid:88) k =1 Φ n ( X kt ) Φ m ( X kt )( ∆ (cid:98) S t k ) B ( t ) n = N (cid:88) k =1 Φ n ( X kt ) (cid:20) (cid:98) Π kt +1 ∆ (cid:98) S kt + 12 γλ ∆S kt (cid:21) with M (cid:88) m A ( t ) nm φ mt = B ( t ) n (22) -Learning of Money Management in Retail Banking 7 Therefore, the coeﬃcients of the optimal action a ∗ t ( X t ) are determined by φ ∗ t = A − t B t . (23)We hereinafter use the Fitted Q Iteration (FQI) [19, 20] to evaluate the coef-ﬁcients ω . The optimal action-value function, Q ∗ ( X t , a t ), is represented in itsmatrix form according to the basis function expansion of the equation (20). Q ∗ t ( X t , a t ) = (cid:18) , a, a t (cid:19)  W ( t ) W ( t ) . . . W M ( t ) W ( t ) W ( t ) . . . W M ( t ) W ( t ) W ( t ) . . . W M ( t )   Φ ( X t )... Φ M ( X t )  = A Tt W t Φ ( X t ) = A Tt U W ( t, X t ) (24)Based on the least-square optimization problem, the coeﬃcients W t are deter-mined using backpropagation ∀ t = T − , ..., L t ( W t ) = N (cid:88) k =1 (cid:18) R t ( X t , a t , X t +1 ) + γ max a t +1 ∈A Q ∗ t +1 ( X t +1 , a t +1 ) − W t Ψ t ( X t , a t ) (cid:19) with W t Ψ ( X t , a t ) + (cid:15) −→ (cid:15) → R t ( X t , a t , X t +1 ) + γ max a t +1 ∈A Q ∗ t +1 ( X t +1 , a t +1 )(25)for which we derive the following set of linear equations.  M ( t ) n = N (cid:88) k =1 Ψ n ( X kt , a kt ) (cid:20) η (cid:18) R t ( X t , a t , X t +1 ) + γ max a t +1 ∈A Q ∗ t +1 ( X t +1 , a t +1 ) (cid:19)(cid:21) with η ∼ B ( N, p ) (26)The term B ( N, p ) represents the binomial distribution for n samples with prob-ability p . It plays the role of a dropout function when evaluating the matrix M t to compensate the well-known drawback of the Q-learning algorithm that is theoverestimation of the Q-function values. We reach ﬁnally the deﬁnition of theoptimal weights to determine the optimal action a ∗ . W ∗ t = S − t M t (27)The proposed model does not require any assumption on the dynamics of thetime series, neither transition probabilities nor policy or reward functions. Itis an oﬀ-policy model-free approach. The computation of the optimal policy,the optimal action and the optimal Q-function that leads to the future eventprobabilities is summed up in algorithm 1. J. Charlier et al.

Algorithm 1:

Q-learning to evaluate the optimal policy of money management

Data: time series of maturity T, either from generated or true data

Result: optimal Q-function Q ∗ , optimal action a ∗ , value of digital function Π begin /* Condition at T */ a ∗ T ( X T ) = 0 Q T ( X T , a T ) = − Π T = − S T ≥ K using equation (16) Q ∗ T ( X T , a ∗ T ) = Q T ( X T , a T ) /* Backward Loop */ for t ← T − to do /* Evaluate the coeﬃcients φ */ compute A t , B t using equation (22) φ ∗ t ← A − t B t /* Evaluate the coeﬃcients ω */ compute S t , M t using equation (26) W ∗ t ← S − t M t a ∗ t ( X t ) = (cid:80) Mn φ ∗ nt Φ n ( X t ) Q ∗ ( X t , a t ) = A Tt W ∗ t Φ ( X t ) /* Compute the digital function value to estimate the event probability at t = 0*/ print( Π = mean ( Q ∗ ) ) return We empirically evaluate the performance of MQLV. We initially highlight thesimilarities between historical payment transactions and Vasicek generated trans-actions. We then underline the MQLV’s capabilities to learn the optimal policyof money management based on the estimation of future event probabilities incomparison to the closed formula of [12,13], hereinafter denoted by BSM’s closedformula. We rely on synthetic data sets because of the privacy and the conﬁden-tiality issues of the retail banking data sets.

Data Availability and Data Description

One of our contributions is tobring a RL framework designed for retail banking. However, none of the datasets can be released publicly because of the highly sensitive information they con-tain. We therefore show the similarities between a small sample of anonymizedtransactions and Vasicek generated transactions [16]. We then use the Vasicekmean reverting stochastic diﬀusion process to generate larger synthetic data sets -Learning of Money Management in Retail Banking 9 similar to the original retail banking data sets. The mean reverting dynamic isparticularly interesting since it reﬂects a wide range of retail banking transac-tions including the credit card transactions, the savings history or the clients’spendings. Three diﬀerent data sets were generated to avoid any bias that couldhave been introduced by using only one data set. We choose to diﬀerentiate thenumber of Monte Carlo paths between the data sets to assess the inﬂuence ofthe sampling size on the results. The ﬁrst, second and third data sets containrespectively 20,000, 30,000 and 40,000 paths. We release publicly the data sets to ensure the reproducibility of the experiments. Experimental Setup and Code Availability

In our experiments, we gen-erate synthetic data sets using the Vasicek model with a parameter S = 1 . t = 0, a maturity of six months T = 0 .

5, a speed reversion a = 0 .

01, a long term mean b = 1 and a volatility σ = 0 .

15. The numbers were ﬁxed such that any limitations of the methodologywould be quickly observed because the choice of the parameters of the Vasicekmodel does not have any inﬂuence on the results of the Q-learning approach.The number of time steps is ﬁxed equal to 5. We additionally use diﬀerent strikevalues for the experiments explicitly mentioned in the Results and Discussionssubsection. The simulations were performed on a computer with 16GB of RAM,Intel i7 CPU and a Tesla K80 GPU accelerator. To ensure the reproducibility ofthe experiments, the code is available at the following address . Results and Discussions about MQLV

As aforementioned, we cannot re-lease publicly an anonymized transactions data set because of privacy, conﬁden-tiality and regulatory issues. We consequently highlight the similarities betweenthe dynamic of a small sample of anonymized transactions and Vasicek generatedtransactions for one client [21] in ﬁgure 1. The ﬁnancial transactions in retailbanking are periodic and often ﬂuctuates around a long term mean, reﬂectingthe frequency and the amounts of the spendings habits of the clients. The akindynamic of the original and the generated transactions is highlighted by thesmall RMSE of 0.03. We also performed a least square calibration of the Vasicekparameters to assess the model’s plausibility. We can observe in table 1 that theVasicek parameters have the same magnitude and, therefore, it supports the hy-pothesis that the Vasicek model could be used to generate synthetic transactions.We present the core of our contribution in the following experiment. We aim atlearning the optimal policy of money management. It is particularly interestingfor bank loan applications where the diﬀerences between a client’s spendingspolicy and the optimal policy can be compared. We show that MQLV is capableof evaluating accurately the probability of a default event using a digital func-tion, which highlights the learning of the optimal policy of money management.Eﬀectively, if the MQLV’s learned policy is diﬀerent than the optimal policy,then the probabilities of default events are not accurate. The estimation of fu- The code and the data sets are available at https://github.com/dagrate/MQLV.0 J. Charlier et al. T r a n s a c t i o n P a t h s generatedoriginal Fig. 1.

Samples of original and Vasicekgenerated transactions for one client. Thetwo samples oscillate around a long termmean of 1 and have a similar pattern,highlighted by the small RMSE of 0.03in table 1.

Table 1.

RMSE error between the sam-ples of original transactions and generatedVasicek transactions of ﬁgure 1. We alsocalibrated the Vasicek parameters accord-ing to the original transactions to validatethe model’s plausibility.

Description ValueRMSE 0.0335Vasicek speed reversion a b σ ture event probabilities for diﬀerent strike values is represented in ﬁgure 2. Werely on the BSM’s closed formula for the vanilla option pricing [12, 13] to ap-proximate the digital function values [15]. We used, therefore, the BSM’s valuesas reference values to cross-validate the MQLV’s values. MQLV achieves a closerepresentation of the event probabilities for the diﬀerent strike values in ﬁgure2. The curves of both the MQLV and the BSM’s approaches are similar witha RMSE of 1.5016. This result highlights that the learned Q-learning policy ofMQLV is suﬃciently close to the optimal policy to compute event probabilitiesalmost identical to the probabilities of the BSM’s formula approximation.We gathered quantitative results in table 2 for a deeper analysis of the MQLV’sresults. The event probability values are listed for the three data sets. We chose aset of parameters for the Vasicek model such that our conﬁguration is free of anytime-dependency. We therefore expect a probability value of 50% at a thresholdof 1 because the standard deviation of the generated data sets is only induced bythe standard deviation of the normal distribution, used to simulate the Brownianmotion. Surprisingly, the MQLV values at a strike of 1 are closer to 50% thanthe BSM’s values for all the data sets. We can conclude, subsequently, that, forour conﬁguration, MQLV is capable to learn the optimal policy of money man-agement which is reﬂected by the accurate evaluation of the event probabilities.We chose to generate three new data sets with new Vasicek parameters a and σ to underline the potential of MQLV and the universality of the results. Intable 3, we computed the event probabilities for diﬀerent strikes for the newlygenerated data sets. The parameter b remains unchanged since we want to keepa conﬁguration free of any time-dependency. We notice that MQLV is capableto estimate a probability of 50% for a strike of 1 which can only be obtainedif MQLV is able to learn the optimal policy. We also observe that the BSM’sapproximation does lead to a lower accuracy. We showed in this experiment that -Learning of Money Management in Retail Banking 11 E v e n t P r o b a b ili t y V a l u e s BSM's Closed Formula ApproximationMQLV

Fig. 2.

Event probability values calculated by MQLV and BSM’s closed formula ap-proximation for 40,000 Monte Carlo paths with Vasicek parameters a = 0 . , b = 1and σ = 0 .

15. The BSM’s closed formula approximation values are used as referencevalues. The event probabilities of MQLV are close to the BSM’s values with a totalRMSE of 1.502. It illustrates that MQLV is able to learn the optimal policy leading toaccurate event probabilities.

Table 2.

Valuation diﬀerences of the digital values for event probabilities according todiﬀerent strikes between the BSM’s closed formula approximation and MQLV. Givenour time-uniform conﬁguration, the event probability values should be close to 50% fora strike value of 1. The MQLV values are close to the theoretical target of 50% at astrike of 1 highlighting the MQLV’s capabilities to learn the optimal policy. The BSM’sclosed formula approximation slightly underestimates the probability values.

Data Number Strike BSM’s Approx. MQLV AbsoluteSet of Paths Values Values (%) Values (%) Diﬀerence1 20,000 0.92 76.810

Table 3.

Event probabilities for data sets generated with diﬀerent Vasicek parameters a and σ . The parameter b remains unchanged to keep a conﬁguration free of any time-dependency to facilitate the results explainability. We can deduce that MQLV is able tolearn the optimal policy because the MQLV’s probabilities are close to the theoreticaltarget of 50% at a strike of 1. MQLV is also more accurate than BSM’s formula in thisconﬁguration. Parameters Number Strike BSM’s App. MQLV Absolute a ; b ; σ of Paths Values Values (%) Values (%) Diﬀerence0.01; 1; 0.10 50,000 0.98 59.856 our model-free and oﬀ-policy RL approach, MQLV, is able to learn the optimalpolicy reﬂected by the accurate probability values independently of the data setsconsidered and of the Vasicek parameters. Limitations of the BSM’s closed formula used for cross validation

Inour experiments, we observed, surprisingly, that the BSM’s closed formula ap-proximation underestimates the event probability values. The volatility is theonly parameter playing a signiﬁcant role in the generation of the time seriesand, therefore, the event probability should be equal to the mean of the distri-bution used to generate the random numbers. The Brownian motion is simulatedwith a standard normal distribution with a 0.5 mean. The BSM’s closed formuladid not, however, lead to a probability of 0.5 but to slightly smaller values be-cause of the limit of their theoretical framework [12,13]. We hence observed thatMQLV was more accurate than the BSM’s closed formula in our conﬁguration.

The foundations of modern reinforcement learning described in [2,4] establishedthe theoretical framework to learn good policies for sequential decision problemsby proposing a formulation of cumulative future reward signal. The Q-learningalgorithm introduced in [3] is one of the cornerstone of all recent reinforcementlearning publications. However, the convergence of the Q-Learning algorithmwas solved several years later. It was shown that the Q-Learning algorithm with -Learning of Money Management in Retail Banking 13 non-linear function approximators [22] with oﬀ-policy learning [23] could provokea divergence of the Q-network. The reinforcement learning community thereforefocused on linear function approximators [22] to ensure convergence.The emergence of neural networks and deep learning [24] contributed to addressthe use of reinforcement learning with neural networks. At an early stage, deepauto-encoders were used to extract feature spaces to solve reinforcement learn-ing tasks [25]. Thanks to the release of the Atari 2600 emulator [26], a publicdata set then was available answering the needs of the RL community for largersimulation. The Atari emulator allowed a proper performance benchmark of thediﬀerent reinforcement learning algorithms and oﬀered the possibility to test var-ious architectures. The Atari games were used to introduce the concept of deepreinforcement learning [1, 7]. The authors used a convolutional neural networktrained with a variant of Q-learning to successfully learn control policies directlyfrom high dimensional sensory inputs. They reached human-level performanceon many of the Atari games. Shortly after, the deep reinforcement learning waschallenged by double Q-Learning within a deep reinforcement learning frame-work [9]. The double Q-Learning algorithm was initially introduced in [19] in atabular setting. The double deep Q-Learning gave more accurate estimates andlead to much higher scores than the one observed in [1, 7]. An ongoing work isconsequently to further improve the results of the double deep Q-learning al-gorithms through diﬀerent variants. The authors used a quantile regression toapproximate the full quantile function for the state-action return distributionin [27], leading to a large class of risk-sensitive policies. It allowed them to fur-ther improve the scores on the Atari 2600 games simulator. Similarly, a newalgorithm, called C51, which applies the Bellman’s equation to the learning ofthe approximate value distribution was designed in [28]. They showed state-of-the-art results on the Atari 2600 emulator.Other publications meanwhile focused on model-free policies and actor-criticframework. Stochastic policies were trained in [29] with a replay buﬀer to avoiddivergence. It was showed in [30] that deterministic policy gradients (DPG) ex-ist, even in a model-free environment. The DPG approach was subsequently ex-tended in [31] using a deviator network. Continuous control policies were learnedusing backpropagation introducing the Stochastic Value Gradient SVG(0) andSVG(1) in [32]. Recently, Deep Deterministic Policy Gradient (DDPG) was pre-sented in [10] to learn competitive policies using an actor-critic model-free algo-rithm based on the DPG that operates over continuous action spaces.

We introduced Modiﬁed Q-Learning for Vasicek or MQLV, a new model-free andoﬀ-policy reinforcement learning approach capable of evaluating an optimal pol-icy of money management based on the aggregated transactions of the clients.

MQLV is part of a banking strategy that looks to minimize the customer churnby including more transparency and more personalization in the decision pro-cess related to bank loan applications or credit card limits. It relies on a digitalfunction, a Heaviside step function expressed in its discrete form, to estimate thefuture probability of an event such as a payment default. We discuss its relationwith the Bellman optimality equation and the Q-learning update. We conductedexperiments on synthetic data sets because of the privacy and conﬁdentialityissues related to the retail banking data sets. The generated data sets followed amean reverting stochastic diﬀusion process, the Vasicek model, simulating retailbanking data sets such as transaction payments. Our experiments showed theperformance of MQLV with respect to the BSM’s closed formula for vanilla op-tions. We also highlighted that MQLV is able to determine an optimal policy, anoptimal Q-function, the optimal actions and the optimal states reﬂected by ac-curate probabilities. Surprisingly, we observed that MQLV led to more accurateevent probabilities than the popular BSM’s formula in our conﬁguration.Future work will address the creation of a fully anonymized data set illustratingthe retail banking daily transactions with a privacy, conﬁdentiality and regula-tory compliance. We will also evaluate the MQLV’s performance for data setsthat violate the Vasicek assumptions. We furthermore observed that the Q-learning update could minor the real probability values for simulation involvinga small temporal discretization such as ∆t = 200. Preliminary results showed itis provoked by the basis function approximator error. We will address this pointin future research. We will ﬁnally extend the Q-learning update to other schemefor improved accuracy and incorporate a deep learning framework. References

1. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 (2013)2. Sutton, R.S.: Temporal credit assignment in reinforcement learning (1984)3. Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, King’s College,Cambridge (1989)4. Williams, R.: A class of gradient-estimation algorithms for reinforcement learningin neural networks. In: Proceedings of the International Conference on NeuralNetworks. pp. II–601 (1987)5. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep con-volutional neural networks. In: Advances in neural information processing systems.pp. 1097–1105 (2012)6. Sermanet, P., Kavukcuoglu, K., Chintala, S., LeCun, Y.: Pedestrian detection withunsupervised multi-stage feature learning. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 3626–3633 (2013)7. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G.,Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-levelcontrol through deep reinforcement learning. Nature (7540), 529 (2015)8. Watkins, C.J., Dayan, P.: Q-learning. Machine learning (3-4), 279–292 (1992)-Learning of Money Management in Retail Banking 159. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with doubleq-learning. In: AAAI. vol. 2, p. 5. Phoenix, AZ (2016)10. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D.,Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprintarXiv:1509.02971 (2015)11. Halperin, I.: Qlbs: Q-learner in the black-scholes (-merton) worlds. arXiv preprintarXiv:1712.04609 (2017)12. Black, F., Scholes, M.: The pricing of options and corporate liabilities. Journal ofpolitical economy (3), 637–654 (1973)13. Merton, R.C.: Theory of rational option pricing. The Bell Journal of economicsand management science pp. 141–183 (1973)14. Wilmott, P.: Paul Wilmott on quantitative ﬁnance. John Wiley & Sons (2013)15. Hull, J.C.: Options futures and other derivatives. Pearson Education India (2003)16. Vasicek, O.: An equilibrium characterization of the term structure. Journal ofﬁnancial economics (2), 177–188 (1977)17. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press(2018)18. Robbins, H., Monro, S.: A stochastic approximation method. In: Herbert RobbinsSelected Papers, pp. 102–109. Springer (1985)19. Hasselt, H.V.: Double q-learning. In: Advances in Neural Information ProcessingSystems. pp. 2613–2621 (2010)20. Murphy, S.A.: A generalization error for q-learning. Journal of Machine LearningResearch (Jul), 1073–1097 (2005)21. Santander: Santander product recommendation. (2016)22. Tsitsiklis, J.N., Van Roy, B.: Analysis of temporal-diﬀference learning with functionapproximation. In: Advances in neural information processing systems. pp. 1075–1081 (1997)23. Baird, L.: Residual algorithms: Reinforcement learning with function approxima-tion. In: Machine Learning Proceedings 1995, pp. 30–37. Elsevier (1995)24. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning, vol. 1. MITpress Cambridge (2016)25. Lange, S., Riedmiller, M.: Deep auto-encoder neural networks in reinforce-ment learning. In: The 2010 International Joint Conference on Neural Networks(IJCNN). pp. 1–8. IEEE (2010)26. Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environ-ment: An evaluation platform for general agents. Journal of Artiﬁcial IntelligenceResearch , 253–279 (2013)27. Dabney, W., Ostrovski, G., Silver, D., Munos, R.: Implicit quantile networks fordistributional reinforcement learning. arXiv preprint arXiv:1806.06923 (2018)28. Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on rein-forcement learning. arXiv preprint arXiv:1707.06887 (2017)29. Wawrzy´nSki, P., Tanwani, A.K.: Autonomous reinforcement learning with experi-ence replay. Neural Networks41