[PDF] Adaptive trading strategies across liquidity pools

Abstract

In this article, we provide a flexible framework for optimal trading in an asset listed on different venues. We take into account the dependencies between the imbalance and spread of the venues, and allow for partial execution of limit orders at different limits as well as market orders. We present a Bayesian update of the model parameters to take into account possibly changing market conditions and propose extensions to include short/long trading signals, market impact or hidden liquidity. To solve the stochastic control problem of the trader we apply the finite difference method and also develop a deep reinforcement learning algorithm allowing to consider more complex settings.

Full PDF

aa r X i v : . [ q -f i n . T R ] A ug Adaptive trading strategies across liquidity pools ∗ Bastien

Baldacci † Iuliia

Manziuk ‡ August 19, 2020

Abstract

In this article, we provide a ﬂexible framework for optimal trading in an asset listed on dif-ferent venues. We take into account the dependencies between the imbalance and spread of thevenues, and allow for partial execution of limit orders at diﬀerent limits as well as market orders.We present a Bayesian update of the model parameters to take into account possibly changingmarket conditions and propose extensions to include short/long trading signals, market impactor hidden liquidity. To solve the stochastic control problem of the trader we apply the ﬁnitediﬀerence method and also develop a deep reinforcement learning algorithm allowing to considermore complex settings.

Keywords: cross-platform trading, optimal trading, Bayesian learning, adaptive trading strate-gies, deep reinforcement learning, stochastic control.

A vast majority of quantitative trading strategies are based on cross-platform arbitrage. These strate-gies involve cross-listed stocks, that are assets traded on two or more liquidity venues. In [21], theauthors investigate the prices of cross-listed stocks in diﬀerent venues and provide evidence of pricedeviations for the majority of the 600 cross-listed stocks they studied. In [3], the authors highlightmispricing that can exist between a domestic stock and its ADR (American Deposit Receipt) coun-terpart. The study conducted in [26] for US-UK cross-listed stocks shows that markets for cross-listedsecurities are among the most heavily arbitraged. In particular, higher potential of arbitrage can beexploited for cross-listed stocks from emerging markets, see [24].Usually, the trader builds an execution curve targeting, for example, an Implementation Shortfallor volume-weighted average price (VWAP). Then, he buys or sells shares of the asset following the ∗ This work beneﬁts from the ﬁnancial support of the Chaires Analytics and Models for Regulation, Financial Riskand Finance and Sustainable Development. Bastien Baldacci gratefully acknowledge the ﬁnancial support of the ERCGrant 679836 Staqamof. The authors would like to thank Joﬀrey Derchu (Ecole Polytechnique), Mathieu Rosenbaum(Ecole Polytechnique) and Olivier Guéant (Université Paris-1 Panthéon-Sorbonne) for numerous fruitful discussions.In particular, Mathieu Rosenbaum deserves warm thanks for his careful reading of the paper and his many suggestionsto improve its quality. † École Polytechnique, CMAP, 91128, Palaiseau, France, [email protected]. ‡ École Polytechnique, CMAP, 91128, Palaiseau, France, [email protected].

The model presented in this section is a generalization of the classic optimal trading framework,developed notably in [5, 17, 19, 25] and in the reference books [12, 16], to the case of several liquidityvenues.

We consider a trader acting on N liquidity platforms operating with limit order books over timeinterval [0 , T ]. He trades continuously on each venue by sending limit and market orders. For n ∈{ , . . . , N } , the n -th venue is characterized by the following continuous-time Markov chains: • the bid-ask spread process ( ψ nt ) t ∈ [0 ,T ] taking values in the state space ψ n = { δ n , . . . , J δ n } , • the imbalance process ( I nt ) t ∈ [0 ,T ] taking values in the state space I n = { I n , . . . , I nK } ,where J, K ∈ N denote the number of possible spreads and imbalances respectively and δ n stands forthe tick size of the n -th venue. We deﬁne the sets Ψ = { Ψ , . . . , Ψ } , I = {I , . . . , I I } of disjointintervals, representing diﬀerent market regimes of interest in terms of spreads and imbalances.3 xample 2.1. Assume for all n ∈ { , . . . , N } that δ n = δ . The set Ψ = n δ, { δ, δ } , { δ, δ } o denotesthree spread regimes: low (one tick), medium (two or three ticks), and high (four or ﬁve ticks). Example 2.2.

Assume for all n ∈ { , . . . , N } and k ∈ { , . . . , K } that I nk = I k . In this casethe set I = n [ − , − . , ( − . , − . , ( − . , . , (0 . , . , (0 . , o denotes ﬁve regimes ofimbalance: low ( − to ), medium on the ask (resp. bid) from to (resp. from − to − ) and high on the ask (resp. bid) from to (resp. from − to − ). Whenever the spread and the imbalance of each venue enter the state k = ( k ψ , k I ) ∈ K where K = Q Nn =1 ψ n × Q Nn =1 I n , they remain in this state for a time exponentially distributed with mean ν k .We deﬁne a transition matrix P = ( p kk ′ ), ( k , k ′ ) ∈ K , and corresponding intensity vectors ν = ( ν k ) Tk .We assume that p kk = 0, meaning that we cannot come to the same state twice in a row. Theinﬁnitesimal generator of the processes can be written as r kk ′ = ν k p kk ′ if k = k ′ r kk = − X k ′ = k r kk ′ = − ν k , otherwise . Remark 2.3.

This general formulation allows us a full coupling between the spread and imbalanceof all venues. If one wants a more parsimonious model, the following simpliﬁcations could be made.When the spread (imbalance) of the n -th venue enters the state k , it remains there for an exponentiallydistributed time with mean ν n,ψk ( ν n,Ik for the imbalance). Therefore, we deﬁne a transition matrix P n,ψ = ( p n,ψkk ′ ) , n ∈ { , . . . , N } , ( k, k ′ ) ∈ ψ n such that p n,ψkk = 0 , and corresponding intensity vectors ν n,ψ = ( ν n,ψ , . . . , ν n,ψK ) T . Similarly we deﬁne a transition matrix P n,I for the imbalance. Then, theinﬁnitesimal generator of the processes can be written as r n,ψkk ′ = ν n,ψk p n,ψkk ′ if k = k ′ r n,ψkk = − X k ′ = k r n,ψkk ′ = − ν n,ψk otherwise . This framework will be used in Section 5, where we present the numerical results.

In what follows, the trader designs his strategy on the ask side of the market (optimal liquidationproblem). The extension to trading on both sides of the market is straightforward and does not causean increase in the problem’s dimensionality.The number of, possibly partially, ﬁlled ask orders in the venue n is modeled by a Cox process denotedby N n , n ∈ { , . . . , N } with intensities λ n (cid:16) ψ t , I t , p nt , ℓ t (cid:17) where p nt ∈ Q nψ represent the limit at whichthe trader sends a limit order of size ℓ nt , and Q nψ = { , } if ψ n = δ n , and {− , , } otherwise , A = (cid:26) ( ℓ t ) t ∈ [0 ,T ] , F − predictable, s.t for all t ∈ [0 , T ] , ≤ N X n =1 ℓ nt ≤ q t (cid:27) , where ( q t ) t ∈ [0 ,T ] is deﬁned in Equation (2.1). Practically for n ∈ { , . . . , N } , when the spread is equalto the tick size, the trader can post at the ﬁrst best limit ( p n = 0) or the second best limit (if p n = 1).4hen the spread is equal to two ticks or more, the trader can either create a new best limit ( p n = − t on the venue n ∈ { , . . . , N } at the limit p ∈ Q nψ , given a couple ( ψ t , I t ) = m of spread andimbalance on each venue, is equal to λ n, m ,p >

0. When the trader posts limit orders of volume ℓ nt on the n -th venue for n ∈ { , . . . , N } , the probability that it is executed is equal to f λ ( ℓ t ), where f λ ( · ) ∈ [0 ,

1] is a continuously diﬀerentiable function, decreasing with respect to each of its coordinate.Therefore, the arrival intensity of an ask market order ﬁlling the buy limit order of the trader on the n -th venue at the limit p nt , given spread and imbalance ( ψ t , I t ) is a multi-regime function deﬁned by λ n ( ψ t , I t , p nt , ℓ t ) = f λ ( ℓ t ) X m ∈M ,p ∈ Q nψ λ n, m ,p { ( ψ t ,I t ) ∈ m ,p nt = p } , where M = Ψ N × I N . Moreover, we allow for partial execution, the fact of which we representby random variables ǫ nt ∈ [0 , N venues, as well as the volume and the limit ofthe order chosen by the trader. We assume a categorical distribution with R > ω r , r ∈ { , . . . , R } for each venue with P ( ǫ nt = ω r ) = ρ n,r ( ψ t , I t , p nt , ℓ t ), where ρ n,r ( ψ t , I t , p nt , ℓ t ) = f ρ ( ℓ t ) X m ∈M ,p ∈ Q nψ ρ n, m ,p,r { ( ψ t ,I t ) ∈ m ,p nt = p } , where f ρ ( · ) is a continuously diﬀerentiable function, decreasing with respect to each of its coordinate. Remark 2.4.

The estimation of this kind of parameters for executed proportions can be quite intricatein practice. To simplify, one can assume that ρ n,r ( ψ t , I t , p nt , ℓ t ) = ρ n,r ∈ [0 , . In practice, this meansthat there are diﬀerent execution proportion probabilities inherent by each venue, depending on itstoxicity. Finally, we allow for the execution of market orders (denoted by a point process ( J nt ) t ∈ [0 ,T ] ) on eachvenue of size ( m nt ) t ∈ [0 ,T ] ∈ [0 , m ] where m > J nt = J nt − + 1. We assume that market orders arealways fully executed.The cash process of the trader at time t ∈ [0 , T ] is dX t = N X n =1 (cid:18) ℓ nt (cid:16) S t + ψ nt p nt δ n (cid:17) ǫ nt dN nt + m nt (cid:16) S t − ψ nt (cid:17) dJ nt (cid:19) , where dS t = µdt + σdW t , ( µ, σ ) ∈ R × R + , is the dynamics of the mid-price process. The inventory process of the trader at time t ∈ [0 , T ] isdeﬁned by q t = q − N X n =1 Z t (cid:18) ℓ nu ǫ nu dN nu + Z t m nu dJ nu (cid:19) . (2.1)We also assume that the trader has a pre-computed trading curve q ⋆ that he wants to follow (Almgren-Chriss trading curve or VWAP strategy, for example). Then the trader’s optimization problem issup ( p,ℓ,m ) ∈ Q ψ ×A× [0 ,m ] N E (cid:20) X T + q T S T − Z T g ( q t − q ⋆t ) dt (cid:21) , (2.2)where the function g penalizes deviation from the pre-computed optimal trading curve.5 .2 The Hamilton-Jacobi-Bellman quasi-variational inequality The HJBQVI associated with the optimization problem of the trader (2.2) is the following:0 = min  − ∂ t u ( t, x, q, S, ψ, I ) + g ( q − q ⋆t ) − µ∂ S u − σ ∂ SS u − X k ∈K r ( ψ,I ) , ( k ψ , k I ) (cid:16) u ( t, x, q, S, k ψ , k I ) − u ( t, x, q, S, ψ, I ) (cid:17) − sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ψ, I, p n , ℓ ) E (cid:20) u (cid:16) t, x + ǫ n ℓ n ( S + ψ n p n δ n ) , q − ℓ n ǫ n , S, ψ, I (cid:17) − u ( t, x, q, S, ψ, I ) (cid:21) ; N X n =1 u ( t, x, q, S, ψ, I ) − sup m n ∈ [0 ,m ] u (cid:16) t, x + m n ( S − ψ n , q − m n , S, ψ, I (cid:17) , (2.3)with terminal condition u ( t, x, q, S, ψ, I ) = x + qS, where ψ = ( ψ , . . . , ψ N ) , I = ( I , . . . , I N ). The expectation in (2.3) is taken over the variables ǫ n , n ∈{ , . . . , N } . We prove the following theorem in Appendix A: Theorem 1.

There exists a unique viscosity solution to the HJBQVI (2.3) , which coincides with thevalue function of the control problem of the trader (2.1) . The proof of existence and uniqueness of the viscosity solution mainly relies on adaptations of thetheory of the second order viscosity solution with jumps, see [9], for example.The value function has to be linear with respect to the cash process and the mark-to-market valueof the trader’s inventory due to the form of the terminal condition. Therefore we use the followingansatz: u ( t, x, q, S, ψ, I ) = x + qS + v ( t, q, ψ, I ) . The HJBQVI then becomes a system of ODEs with 2 N + 1 state variables:0 = min  − ∂ t v ( t, q, ψ, I ) + g ( q − q ⋆t ) − µq − X k ∈K r ( ψ,I ) , ( k ψ , k I ) (cid:16) v ( t, q, S, k ψ , k I ) − v ( t, q, S, ψ, I ) (cid:17) − sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ψ, I, p n , ℓ ) E (cid:20) ǫ n ℓ n ( ψ n p n δ n ) + v (cid:16) t, q − ℓ n ǫ n , ψ, I (cid:17) − v ( t, q, ψ, I ) (cid:21) ; N X n =1 v ( t, q, ψ, I ) − sup m n ∈ [0 ,m ] − m n ψ n v (cid:16) t, q − m n , ψ, I (cid:17) , (2.4)with terminal condition v ( T, q, ψ, I ) = 0. 6onditionally on the market parameters such as the transition matrix of both the spread and theimbalance processes, the drift and volatility of the underlying asset and the execution proportionprobabilities, solving Equation (2.4) is done using simple ﬁnite diﬀerence schemes and the optimalsplitting of volumes as well as the optimal limits can be computed in advance.If one want to incorporate directly Bayesian learning of the parameters in the control problem, theresult would be a very high number of state variables, which makes the problem intractable in practice.For example, if we want to update continuously the value of the processes λ n for n ∈ { , . . . , N } weneed to add the counting processes ( N nt ) t ∈ [0 ,T ] to the state variables, which increases the dimensionof the HJBQVI (2.4) by N . What we propose in the following section is a practical way to updatethe market parameters according to trader’s observations in a Bayesian way. This method, which isperformed separately from the optimization procedure, allows to update, at the end of a slice, thetrading strategy according to changing market conditions. The framework presented in the above section allows to choose generic parametric forms for the statevariables prior distributions (transition matrix of spreads and imbalances, intensities of orders’ arrivalon each venue) suitable to the use of conjugate Bayesian updates.

In this section, we present the conjugate Bayesian update of the market parameters and how to choosethe prior distributions.

Let us recall the form of the intensities for counterpart market orders’ arrival: λ n ( ψ t , I t , p nt , ℓ t ) = f λ ( ℓ t ) X m ∈M ,p ∈ Q nψ λ n, m ,p { ( ψ t ,I t ) ∈ m ,p nt = p } . In the vast majority of optimal liquidation models, the probability of execution λ n, m ,p is estimatedempirically. We propose to put a prior law Γ( α n, m ,p , β n, m ,p ) on the arrival rate, and to update a priorbelief at the end of each slice of execution. The parameters α n, m ,p , β n, m ,p are chosen by the traderaccording to his vision of the market before he starts to trade. Up to time t ∈ [0 , T ] the traderobserves the processes N n, m ,pt = Z t { ( ψ s ,I s ) ∈ m ,p ns = p } dN ns , which represent the number of executed orders on each venue for every spread-imbalance zone m .The posterior distribution of λ n, m ,p for n ∈ { , . . . , N } is then given by λ n, m ,p | N n, m ,pt ∼ Γ (cid:16) α n, m ,p + N n, m ,pt , β n, m ,p + Z t f λ ( ℓ s ) ds (cid:17) , t , our best estimate of the ﬁlling ratio becomes λ n, m ,p ( t, N n, m ,pt , ℓ t ) = E (cid:20) λ n, m ,p | N n, m ,pt (cid:21) = α n, m ,p + N n, m ,pt β n, m ,p + R t f λ ( ℓ s ) ds . The posterior estimate of the intensity λ n ( ψ t , I t , p nt , ℓ t ) becomesˆ λ n ( ψ t , I t , p nt , ℓ t ) = f λ ( ℓ t ) X m ∈M ,p ∈ Q nψ , α n, m ,p + N n, m ,pt β n, m ,p + R t f λ ( ℓ s ) ds { ( ψ t ,I t ) ∈ m ,p nt = p } . As the convergence of the prior parameters toward the true market speciﬁcation follows from thecentral limit theorem, the convergence rate equals to √ o m where o m is the number of observations ofﬁlled limit orders on the spread-imbalance zone m . If we consider even a quite parsimonious model,for example two venues, two regimes of spread and three regimes of imbalance, we have M = 36diﬀerent zones. This means that we need a suﬃciently large amount of observations (large number ofexecuted orders) to get an accurate approximation of the market behavior.If the trader anticipates that the number of observations he will have is not adequate to obtaina suitable approximation of the “true” market parameters (in the case of a mid to low frequencystrategy with only a few number of trades throughout the day), he might choose at the beginningthe couples ( α n, m ,p , β n, m ,p ) such that α n, m ,p β n, m ,p >> N n, m ,pt R t f λ ( ℓ s ) ds . That way, his prior will not be sensitiveto a small number of observations, and with suﬃcient number of observations the prior will have lessinﬂuence and the estimation will be less biased. We propose to use the Dirichlet prior distribution on the executed proportion parameters so that ρ n, m ,p ∼ Dirichlet( α ǫ,n, m ,p ) where α ǫ,n, m ,p = ( α ǫ,n, m ,p, , . . . , α ǫ,n, m ,p,R ) for all ( n, m , p, r ) ∈ { , . . . , N }×M × Q ψ × { , . . . , R } . Given observations of ǫ nt , the executed proportion parameters have Dirichletposterior distribution ρ n, m ,p ∼ Dirichlet( α ǫ,n, m ,p + c n, m ,pt ) , where c n, m ,pt = ( c n, m ,p, t , . . . , c n, m ,p,Rt ) and c n, m ,p,rt = P s ≤ t { ǫ ns = ω r , ( ψ s ,I s ) ∈ m ,p ns = p,N ns − N ns − =1 } is the numberof observations before time t in zone m for a limit p in the venue n . Therefore, the ǫ it have thefollowing posterior distribution:ˆ ρ n,r ( ψ t , I t , p nt , ℓ t ) = f ρ ( ℓ t ) X m ∈M ,p ∈ Q ψ α ǫ,n, m ,p,r + c n, m ,p,rt P Rr =1 ( α ǫ,n, m ,p,r + c n, m ,p,rt ) { ( ψ t ,I t ) ∈ m ,p n = p } . This Bayesian update is linked to the ﬁlling of limit orders of the trader: the proportion executed isupdated only if the limit order is (partially) executed. If one chooses the parametrization independentof the spread-imbalance zones and the order volume, that is execution proportion depends only onthe venue, the speed of convergence is much faster as the same amount of gathered information isused to update a much smaller number of parameters. Using this more parsimonious parametrizationthe trader can rely on the observations more than on his prior.8 .1.3 Update of the characteristics of the venues

We observe the states of the Markov chains ψ t d , I t d , d ∈ { , . . . , D } and the times t d of the D > L ( P , ν | ψ t ≤ t D , I t ≤ t D ) = D Y d =1 ν t d − exp (cid:16) − ν t d − ( t d − t d − ) (cid:17) p ( ψ td − ,I td − )( ψ td ,I td ) ∝ Y k ∈K ( ν k ) n k · exp( − ν k T k ) Y k ′ ∈K ( p kk ′ ) n kk ′ , where n kk ′ is the number of observed transitions from state k to k ′ for ( k , k ′ ) ∈ K , T k is the totaltime spent in state k , and n k · = P k ′ ∈K n kk ′ is the total number of transitions out of state k .Given independent prior distributions for P , ν , the posterior distributions will also be independent.We can carry out Bayesian inference separately on the probability matrix and the intensity vectorsof the Markov chains. We assume the following priors: ν k ∼ Γ( a k , b k ) , p k = ( p kk ′ ) k ′ ∈K ∼ Dirichlet( α k ) , where α k = ( α kk ′ ) k ′ ∈K . Given these conjugate priors, our best estimators of ν k , p k areˆ ν k = a k + n k · − b k + T k , ˆ p kk ′ = α kk ′ + n kk ′ P l = k ( α kl + n kl ) . Then the posterior transition matrix isˆ r kk ′ = ˆ ν k ˆ p kk ′ , k = k ′ , ˆ r kk = − ˆ ν k . This update aims at ﬁnding the “true” behavior of the imbalance and spread processes of each venue.This is of particular importance if an event (for instance, an announcement or news) happens in themarket. More speciﬁcally, if one event occurs in a particular platform (if a metaorder is executed inone speciﬁc platform, for example), this helps to discriminate one venue from the others and to redirectthe orders to the less toxic liquidity platforms. Given the large number of observations (transitionsfrom one state of imbalance or spread to another occur fast), the trader does not necessarily need tobe conﬁdent about his prior distributions.

Remark 3.1.

If one wants to use a more parsimonious model as in Remark 2.3, the same methodologyapplies. In particular for k ∈ ψ n , we assume the following prior: ν n,ψk ∼ Γ( a n,ψk , b n,ψk ) , p n,ψk = ( p n,ψkk ′ ) k ′ ∈ ψ n ∼ Dirichlet ( α n,ψk ) , where α n,ψk = ( α n,ψkk ′ ) k ′ ∈ ψ n . Given these conjugates priors, our best estimators of ν n,ψk , p n,ψk are ˆ ν n,ψk = a n,ψk + n n,ψk · − b n,ψk + T n,ψk , p n,ψkk ′ = α kk ′ + n n,ψkk ′ P l = k ( α kl + n n,ψkl ) . The posterior transition matrix is given by ˆ r n,ψkk ′ = ˆ ν n,ψk ˆ p n,ψkk ′ , k = k ′ , ˆ r n,ψkk = − ˆ ν n,ψk . Similar formulae apply for ν n,Ik , p n,Ik . We recall that the price process has the following dynamics: dS t = µdt + σdW t , so that ( S t − S | µ, σ ) ∼ N ( µt, σ t ). We assume that the couple ( µ, σ ) follows a Normal-Inverse-Gamma prior distribution N IG ( µ , ν, α s , β s ), where ( µ , ν, α s , β s ) ∈ R × R . Therefore the posteriordistribution has the following form:( µ, σ | S t − S ) ∼ N IG (cid:18) ( S t − S ) + µ νν + t , ν + t, α s + t , β s + tνν + t ( S t − S t − µ ) (cid:19) . Given our observations of the stock price up to time t , the best approximation of the drift and volatilityare given by µ ( t, S t ) = E [ µ | S t − S ] = ( S t − S ) + µ νν + t , σ ( t, S t ) = E [ σ | S t − S ] = β s + tνν + t ( St − S t − µ ) α s + t − . The volatility σ does not appear explicitly in the HJBQVI (2.3). However, it is taken into accountwhen the trader computes his trading curve q ⋆ .In the case where the trader is conﬁdent with his estimation of σ , one can use a Normal priordistribution on µ such that µ ∼ N ( µ , ν ). Then, the best approximation of the drift is given by µ ( t, S t ) = E [ µ | S t − S ] = µ σ + ν ( S t − S ) σ + ν t . (3.1)If the trader ﬁrmly believes in the a priori parameter estimation, he can set ν close to 0 so that hemostly relies on his prior. On the contrary, if he sets ν high enough, his estimation comes mostlyfrom market information. Given the large amount of data coming from the market (each time stepcorresponding to one new observation), convergence to the real value of the drift is fast. Remark 3.2.

One can argue about the use of a frequentist estimator of the model parameters, whichwould actually lead to quite similar formulae. However the original problem, that is continuous updateof market parameters in the control problem, is of Bayesian nature. Moreover, in our approach, theformulae for posterior distribution of market parameters are as explicit as in the frequentist approach. .2 Algorithm description We now present the use of the Bayesian updates in order to obtain adaptive trading strategies inpractice. We emphasize that the procedure is decoupled from the optimization problem (2.4), sothat we do not perform Bayesian optimization but rather a Bayesian update of the parameters of anoptimization problem.Number of time steps is an important parameter of the optimization problem because its choice is atrade-oﬀ between computation time and computation precision. To address this problem, we use thetrading algorithm with ﬁxed market parameters over a short period of time (a couple of seconds upto a few minutes), which we call a slice. Let us consider V > T v = [ T v , T v +1 ] , v = 0 , . . . , V − , such that T = 0 , T V = T . We deﬁne for each slice v ∈ V a set of market parameters θ mv = ( r, ρ n , λ n, m ,p , µ, σ ) { n ∈{ ,...,N } , m ∈M ,p ∈ Q ψ } . At each time slice v ∈ { , V − } starting from v = 0 we perform the following algorithm:1. Take the best estimation of market parameters θ mv from the prior distribution for the currentslice v .2. Compute the optimal trading strategy on T v using the set of parameters θ mv .3. Observe market events during the current slice (executions, changes of the state).4. At T v +1 , update the parameters θ mv +1 following the Bayes rules described in Section 3.To summarize, we use the output of the control model (the optimal volumes and limits in each venue)over a slice of execution and then run the model again with the updated market parameters. Thismethod, which is clearly time inconsistent, is common practice when one applies optimal control withonline parameter estimation, see for example [7]. We now present some possible extensions of thepresented model. In this section we describe diﬀerent potential model extensions and their impact on the problem’sdimensionality.

The two main sources of signals at the microstructural level are the imbalance and the bid-ask spread.Therefore, one can assume a parametric dependence f short ( ψ t , I t ) of the price process on these twosources, such that the price process becomes dS t = (cid:16) µ + f short ( ψ t , I t ) (cid:17) dt + σdW t . In a modiﬁed stochastic control problem the term µq in the HJBQVI is replaced by ( µ + f short ( ψ, I )),which causes no increase in the dimensionality of the state process.11 .1.2 Mid/Long term and path-dependent price signals When trading on longer time horizon, one can incorporate mid- or long-term signals such as Bollingerbands, moving average or cointegration ratio. For example, consider a signal taking into account themoving average and the maximum of the price process S t , that is S t = 1 t Z t S t dt, S ⋆t = max s ≤ t S s . The triplet ( S t , S ⋆t , S t ) is Markovian. Therefore, we can add a long term signal f long ( S t , S t , S ⋆t ) intothe asset’s drift: dS t = (cid:16) µ + f long ( S t , S t , S ⋆t ) (cid:17) dt + σdW t . The HJBQVI then becomes:0 = min  − ∂ t u ( t, q, S, S, S ⋆ , ψ, I ) + g ( q − q ⋆t ) − (cid:16) µ + f long ( S, S, S ⋆ ) (cid:17) ∂ S u − S − St ∂ S u − σ ∂ SS u − X k ∈K r ( ψ,I ) , ( k ψ , k I ) (cid:16) u ( t, q, S, S, S ⋆ , k ψ , k I ) − u ( t, q, S, S, S ⋆ , ψ, I ) (cid:17) − sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ψ, I, p n , ℓ ) E (cid:20) ǫ n ℓ n ( S + ψ n p n δ n ) + u (cid:16) t, q − ℓ n ǫ n , S, S, S ⋆ , ψ, I (cid:17) − u ( t, q, S, S, S ⋆ , ψ, I ) (cid:21) ; N X n =1 u ( t, q, S, S, S ⋆ , ψ, I ) − sup m n ∈ [0 ,m ] m n ( S − ψ n u (cid:16) t, q − m n , S, S, S ⋆ , ψ, I (cid:17) , for S ≤ S ⋆ , with ∂ S u = 0 for S = S ⋆ . To obtain this equation we just use a change of variable v ( t, x, q, S, S, S ⋆ , ψ, I ) = x + u ( t, q, S, S, S ⋆ , ψ, I ), linear with respect to the cash process X t . We endup with a 2 N + 4 dimensional HJBQVI, that we can still solve using our deep reinforcement learningalgorithm (but unlikely with ﬁnite diﬀerences).More generally, adding a path-dependent state variable that gives information on the price trend addsone dimension to the HJBQVI (in the example above ( S t , S ⋆t , S t ) add one dimension each). So far we assumed no market impact on the price process. It is common knowledge that cost ofmarket impact can cut down a large proportion of the trading strategy’s proﬁt. Therefore, we canuse a simple permanent-temporary market impact model, inspired by [2].The impacted mid-price process can be modeled as follows: dS t = (cid:16) µ + h ( ℓ t ) (cid:17) dt + σdW t + N X n =1 (cid:16) ξ n,l ( t, ℓ nt ) dN nt + ξ n,m ( t, ℓ nt ) dJ nt (cid:17) , where the functions h, ξ n,l , ξ n,m are the permanent and temporary market impact functions. Follow-ing [15], we assume linear permanent market impact, that is h ( ℓ t ) = N X n =1 κ n, per ℓ nt , κ n, per > n ∈ { , . . . , N } . ξ n,l ( t, ℓ nt ) = κ n,l ( ℓ nt ) γ n,l , ξ n,m ( t, ℓ nt ) = κ n,m ( ℓ nt ) γ n,m , where κ n,l , κ n,m , γ n,l , γ n,m > γ n,l , γ n,m ≈ /

2. On the other hand, in order to take into accountthe transient part of the impact, we can set the following form for S t : S t = S + Z t µ + h ( ℓ s ) ds + σW t + N X n =1 Z t ξ n,l ( t − s ) ˜ ξ n,l ( ℓ is ) dN ns + ξ n,m ( t − s ) ˜ ξ n,m ( ℓ s ) dJ ns , (4.1)where ξ n,l , ξ n,m are decreasing kernels, and ˜ ξ n,l , ˜ ξ n,m are decreasing functions of the posted volume. Itis well known that by taking an exponentially decreasing kernel, Equation (4.1) admits a Markovianrepresentation as the couples (cid:16) N nt , R t ξ n, { l,m } ( t − s ) dN ns (cid:17) t ∈ [0 ,T ] are Markovian. Practically, this willadd 2 N dimension to the HJBQVI.Functions h, ξ n,l , ξ n,m could also be approximated by neural networks. Determination of a cross-impactfunction between liquidity pools can lead to possible arbitrage detection across liquidity venues. Hidden liquidity represents a great proportion of the liquidity especially in the US markets, see forexample [21]. Therefore, if one wants to design trading tactics for assets cross-listed in a Europeanand an American market, taking into account the hidden part of the liquidity is crucial.Assume that the n -th venue is a US liquidity pool. Borrowing the notations of [4], we denote by H n the hidden liquidity of the n -th venue at the ﬁrst limit of the order book. Therefore, the corre-sponding imbalance process represented by the continuous-time Markov chain I n can be rewritten as N n,a,mt − N n,b,mt N n,a,mt + N n,b,mt +2 H n , where N n,b,mt , N n,a,mt are the bid and ask market order ﬂow processes on the n -thvenue. Empirical estimation of the prior parameters for the transition matrix of I n have to take intoaccount this additional term in the imbalance processes. Furthermore, incorporating the imbalanceprocess with hidden liquidity into trading signals allows to detect arbitrage opportunities betweendiﬀerent venues. This does not increase the dimensionality of Equation (2.3). We take the example of a trader acting on a stock cross-listed on 2 diﬀerent venues ( N = 2), withthe following global parameters: • ψ n = { δ, δ } : the processes ( ψ nt ) t ∈ [0 ,T ] can take two values, which correspond to a low or highspread regimes, and the tick size is δ = 0 . • I n = {− . , , . } : the processes ( I nt ) t ∈ [0 ,T ] can take three values, which correspond to a nega-tive, neutral or positive imbalance regime. 13 R = 2 , ( ω , ω ) = (0 . , ǫ nt ) t ∈ [0 ,T ] can take two values, which correspond to atotal or half-execution of the posted volume ( ℓ nt ) t ∈ [0 ,T ] . • q = 5 × : initial inventory of the trader. • T v = [ v, v + ∆ v ], where ∆ v = 1 min, which means that each slice lasts one minute, with V = 10slices and T = 10 min. • ∆ t = 0 .

1: we take 10 time steps in each slice, that is the agent takes 10 trading decisions duringeach slice.The pre-computed trading curve is borrowed from an implementation shortfall execution using marketorders, that is: q ⋆t = q sinh (cid:18)q γσ V η ( T − t ) (cid:19) sinh (cid:18)q γσ V η T (cid:19) . with the following set of parameters • η = 0 .

1: coeﬃcient of quadratic costs. • V = 1 × : average market volume. • γ = 1 × − : risk aversion of the trader using a CARA utility function. • σ = 0 .

05: volatility of the asset. • f λ ( ℓ t ) = exp( − κ P Nn =1 ℓ nt ) with κ = 2 . × − : sensitivity of the execution with respect to thetotal volume posted. • f ρ ( ℓ t ) = 1: no sensitivity of the executed proportion with respect to the total volume posted.For this numerical experiment for the sake of clarity of interpretations we consider the trader sendingonly limit orders. To ﬁnd optimal strategy for limit orders we consider the following equation:0 = − ∂ t v ( t, q, ψ, I ) + g ( q − q ⋆t ) − µq − N X n =1 J X j =1 r n,ψψ,jδ (cid:16) v ( t, q, ψ − njδ , I ) − v ( t, q, ψ, I ) (cid:17) − N X n =1 K X k =1 r n,II,I k (cid:16) v ( t, q, ψ, I − nI k ) − v ( t, q, ψ, I ) (cid:17) − sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ψ, I, p n , ℓ ) E (cid:20) ǫ n ℓ n ( ψ n p n δ n ) + v (cid:16) t, q − ℓ n ǫ n , ψ, I (cid:17) − v ( t, q, ψ, I ) (cid:21) , ψ − njδ = ( ψ , . . . , ψ n − , jδ, ψ n +1 , . . . ) , I − nI k = ( I , . . . , I n − , I k , I n +1 , . . . ) . In order to apply the ﬁnite diﬀerence method we introduce the discretization of time and state space.For inventories we have Q = { q = 0 < . . . < q Q = q } . Time discretization in the slice is T = { t = 0 < t = t + ∆ t < . . . < t T = ∆ v } . We also discretize the order volumes the trader cansend L = { l = 0 < . . . < l L = q } .Using the ﬁrst diﬀerence for the value function derivative with respect to time we can rewrite theabove equation as ∀ i ∈ { , . . . , T − } , ∀ q ∈ Q , ∀ ( ψ, I ) ∈ M v ( t i +1 , q, ψ, I ) = v ( t i , q, ψ, I ) − ∆ t  g ( q − q ⋆t ) − µq − N X n =1 J X j =1 r n,ψψ,jδ (cid:16) v ( t, q, ψ − njδ , I ) − v ( t, q, ψ, I ) (cid:17) − N X n =1 K X k =1 r n,II,I k (cid:16) v ( t, q, ψ, I − nI k ) − v ( t, q, ψ, I ) (cid:17) − sup p ∈{− , , } N ,ℓ ∈ L N N X n =1 λ n ( ψ, I, p n , ℓ ) E (cid:20) ǫ n ℓ n ( ψ n p n δ n ) + v (cid:16) t, q − ℓ n ǫ n , ψ, I (cid:17) − v ( t, q, ψ, I ) (cid:21) , with terminal condition v ( T, q, ψ, I ) = 0.In terms of calculations the most demanding part is obviously the search of the supremum which isneeded to be performed on the dimension 3 N × Q × L N × M for each time step. From whatfollows that ﬁnite diﬀerences can be applied to solve the problem of optimal orders posting for thestock cross-listed in N = 2 venues with reasonable precision and calculation time. However, if weintroduce more venues ﬁnite diﬀerences are not going to be any more eﬃcient because the complexityis growing exponentially.For our numerical example, we used the discretization with Q = 101 and L = 51 which assuresthe calculation time (on a simple PC) around 1min for the whole slice. In this section, we brieﬂy introduce the method using neural networks to solve HJB equations. In thispaper, we used a method which can be referred to as Actor-Critic method to approximate optimalcontrols and corresponding value function for the problem. Applications of this approach have shownto be fruitful, especially when we talk about equations in high dimension, more elaborate descriptionof the method can be found for example in [6, 8, 18, 20].The core of this approach is to represent the strategy of the trader with a neural network as well asthe corresponding value function. Then one needs to formalize the target functions for both neuralnetworks and to perform the gradient descent on the parameters (weights) of these networks. Thisprocedure needs to be done for every time step, and so one ends up with 2 T networks.15et us start from the description of the value function approximation. We consider the neural networkstaking as an input the spreads and the imbalances in the venues of interest and the inventory of thetrader giving as an output the value function at this point. As in the ﬁnite diﬀerence method we solveour problem backward, starting from t T − = ∆ v − ∆ t , because the value function at the end of the sliceis known from the terminal condition. To calculate the value function at time t i , ∀ i ∈ { , . . . , T − } we use the minimization of the mean-squared error between values given by the neural network andthe target values calculated with the use of the value function approximation for time t i +1 and thenetwork for the controls at the current step. Let us assume that we have the controls ℓ ∗ , p ∗ (obtainedvia neural networks, for example) for time t i , then the target for the value function can be found as v target ( t i − , q, ψ,I ) = v [ θ vi ]( t i , q, ψ, I ) + ∆ t g ( q − q ⋆t ) − µq − N X n =1 J X j =1 r n,ψψ,jδ (cid:16) v [ θ vi ]( t, q, ψ − njδ , I ) − v [ θ vi ]( t, q, ψ, I ) (cid:17) − N X n =1 K X k =1 r n,II,I k (cid:16) v [ θ vi ]( t, q, ψ, I − nI k ) − v [ θ vi ]( t, q, ψ, I ) (cid:17) − N X n =1 λ n ( ψ, I, p ∗ n , ℓ ∗ ) E (cid:20) ǫ n ℓ ∗ n ( ψ n p ∗ n δ n ) + v [ θ vi ] (cid:16) t, q − ℓ ∗ n ǫ n , ψ, I (cid:17) − v [ θ vi ]( t, q, ψ, I ) (cid:21)! , with v [ θ v T ]( t T , q, ψ, I ) = 0 and where [ θ vi ] stands for the weights of the neural network for the valuefunction at time t i .The trader’s inventory is of continuous nature, however, spread and imbalance are categorical, so weneed to verify if we should use some special techniques to ensure better ﬁtting in this case. Target for value function for random spreads and imbalances target value

Figure 1: Target value function for increasing inventory and random market states.Let us see ﬁrst in Figure 1 the example of the target value function of the trader for q ∈ [0 , q ] atdiﬀerent spreads and imbalances. We see considerable changes in the value function level dependingon the market state which we would like to capture by our approximation.Now, let us compare the ﬁtting of the value function parametrization taking as inputs raw spreadand imbalance values with the parametrization working with encoded values of the spread and theimbalance. Here we are going to use the so-called one-hot encoding for categorical variables, whichconsists in the representation of diﬀerent values of the variable by a one-hot vector e iψ ∈ { , } e iI ∈ { , } I for the imbalance. And e i (both for e iψ and e iI ) are such that that e ij = 0 , ∀ j = i, and e ii = 1 otherwise. Fitting of target value function with raw values for states target valuepredicted value

Figure 2: Comparison of the target value with ap-proximation continuous in spread and imbalance.

Fitting of target value function with encoded values for states target valuepredicted value

Figure 3: Comparison of the target value withapproximation discrete in spread and imbalance.In Figures 2 and 3 we see the comparison between values predicted by two parametrizations withtarget values for the same number of learning epochs. There is a considerable gain in precision whenthe parametrization takes into account the categorical nature of market states. Therefore we applyit for both value function network approximation and the strategy neural network approximation.Now, let us describe the learning procedure for the strategy. First of all, the inputs of the strategynetwork are the same as for the value function network, i.e. the trader’s inventory, spreads and imbal-ances for both venues. As an output, we need to have volumes of the orders and limits on which thetrader needs to send his orders. Volumes to send to each venue are bounded by the current inventorybecause we do not want the trader to execute more shares than he possesses. Limits should equal −

1, 0, or 1, but as soon as we want to use the tools of automatic diﬀerentiation, we need to representthem by diﬀerentiable function. The softmax activation function serves well to this purpose, so werepresent the limits for each venue by the probabilities to send an order to each precise limit. Inpractice, the trader can choose the maximum of the three to perform his action.The optimization criterium used for the strategy neural network is the function under supremum fromthe HJBQVI (2.4), with limit probabilities taken into account (let us denote them by P ( p = a ), for a ∈ {− , , } ) we need to maximize with respect to θ ℓi , i ∈ { , . . . , T − } : N X n =1 X a ∈{− , , } P [ θ ℓi ]( p n = a ) λ n ( ψ, I, a, ℓ [ θ ℓi ]) E (cid:20) ǫ n ℓ [ θ ℓi ] n (cid:18) ψ n aδ n (cid:19) + v [ θ ℓi +1 ] (cid:16) t i , q − ℓ [ θ ℓi ] n ǫ n , ψ, I (cid:17) − v [ θ ℓi +1 ]( t i , q, ψ, I ) (cid:21) , where θ ℓi stand for the weights of the neural network of controls at time t i . So we want to maximize thisfunction for all possible values of market states and inventories. To avoid the dimensionality trap weneed to optimize this function on some subset of possible values, which we are going to draw randomly.When optimizing neural networks approximations, it is important to normalize the data, to have ifpossible a universal set of hyperparameters. First of all, the inventory entering as an input of the17alue function neural network and of the strategy neural network is normalized by q to always stayin [0 , q , which suﬃciently reduces the order of values. For strategy network, we are going tolearn the proportion of the inventory to be sent and not the volume itself. And ﬁnally, we can noticethat for high inventories the diﬀerence between value functions (which are quadratic in the inventory)in the supremum can become much more important than the proﬁt of the trader coming from the tick(which is not more than linear in inventory). This fact can hinder us from ﬁnding optimal values forthe limit to which the trader should send his order, especially for small inventories. We normalize thevalues of the optimized function for diﬀerent inventories to make small inventories more importantby multiplying all values by q . However, this latter normalization is used when we optimize over thepart of the strategy responsible for the limits only, leaving volume updates untouched.To summarize in Figures 4 and 5, we presented the structures of the neural networks used to representthe approximators for the strategy and the value function. Another feature worth mentioning hereis the separation of market state and inventory inputs for some layers, both for the strategy and thevalue function. This allows capturing features of the market state independently of the inventory.Also, we separated some layers preceding the outputs of the strategy network to be able to performthe learning process with diﬀerent learning rates for volumes and limits of limit orders. spreads: InputLayerspreads_and_imbalances: Concatenate all_inputs: Concatenateimbalances: InputLayer inventory: InputLayerhidden_layer_1.1: Dense hidden_layer_1.2: Densehidden_layer_2.1: Dense hidden_layer_2.2: Denseoutputs_2.1_and_2.2: Concatenatehidden_layer_3.1: Dense hidden_layer_3.2: Densehidden_layer_4.1: Densevolumes: Dense limits_venue_1: Denselimits_venue_2: Dense Figure 4: Neural network structure for thetrader’s strategy. spreads: InputLayerspreads_and_imbalances: Concatenate all_inputs: Concatenateimbalances: InputLayer inventory: InputLayerhidden_layer_1.1: Dense hidden_layer_1.2: Denseoutputs_1.1_and_1.2: Concatenatehidden_layer_2: Densevalue_function: Dense

Figure 5: Neural network structure for thevalue function.While the ﬁnite diﬀerence schemes must complete the entire recalculation of values for the wholegrid every time the trader wants to adapt his strategy using the updated market parameters, neuralnetworks can be adapted progressively, starting from some pre-trained strategy, for example, the onecorresponding to the previous parameters. In practice, a pre-trained model can be reused for diﬀerentproblem settings due to normalization. Therefore a long and elaborate training procedure shouldbe done only once. The resulting model can be ameliorated by small adjustment trainings which18ake only 1 minute on the simplest instance of the AWS platform (2CPU, no GPU), and have greatspeed-up potential when performed on more complex infrastructures.

We assume that the trader is conﬁdent about his estimation of σ . Therefore he uses Bayesian updateonly on the drift µ of the asset. The venues share identical parameters, which will be inferred by thetrader through time. We ﬁrst plot in Figures 6 and 7 the evolution through time of the value function of the trader in thestate ψ = ψ = 1 and I = I = 0 during a slice of execution, obtained through ﬁnite diﬀerencemethod. Value function at different time steps ψ = ψ = 1 and I = I = 0 t=0.4t=0.3t=0.2t=0.1t=0.0 Figure 6: Value function with respect to theinventory between t = 0 and t = 0 . Value function at different time steps ψ = ψ = 1 and I = I = 0 t=0.9t=0.8t=0.7t=0.6t=0.5 Figure 7: Evolution of the value function v be-tween t = 0 . t = 0 . g ( q − q ⋆t ) in (2.3). The maximum valueindicates the optimal inventory for the next step in the slice. When t increases, the maximum shiftstoward zero, which means that the trader wants to ﬁnish the execution at the end of the slice.We plot in Figures 8 and 9 the value function of (2.3) obtained using neural networks. We can seethat the neural networks approximate accurately the value function. Value function at different time steps ψ = ψ = 1 and I = I = 0 t=0.4t=0.3t=0.2t=0.1t=0.0 Figure 8: Evolution of the value function v be-tween t = 0 and t = 0 . Value function at different time steps ψ = ψ = 1 and I = I = 0 t=0.9t=0.8t=0.7t=0.6t=0.5 Figure 9: Evolution of the value function v be-tween t = 0 . t = 0 . In Figures 10 and 11, we plot the limits at which the trader posts his limit orders in the two venues,given equal spread and imbalance processes.

Limits at different time steps when ψ = ψ = 1 and I = I = 0 venue 1 t=0.9t=0.6t=0.3t=0.0 Figure 10: Limit strategy in the ﬁrst venue, ψ = ψ = δ, I = I = 0. Limits at different time steps when ψ = ψ = 1 and I = I = 0 venue 2 t=0.9t=0.6t=0.3t=0.0 Figure 11: Limit strategy in the second venue, ψ = ψ = δ, I = I = 0.As the trader has the same prior distribution in the two venues, his strategy is the same in both venues.At the beginning of the slice, i.e. at t = 0, the maximum of the value function is near q = 32000.Therefore, if the trader has a lower inventory, he does not post any orders and wait for the next timestep. If he has a higher inventory, he tries to reach q = 32000 inventory. For q ∈ [32000 , q ∈ [34000 , q > M market states, the trader faces similar trade-oﬀ), we consideredthe same set of controls for the limit where the trader can send his order. For this reason, we cansee that even for the spread equal to δ the trader can submit an order to the limit p = −

1, which inpractice can obviously be treated as p = 0 due to piecewise monotonous nature of the optimal limitstrategy (which is, in fact, monotonous, though it cannot be reﬂected by ﬁnite diﬀerences when theoptimal volume equals to 0).When the trader is near the end of the slice, he starts posting limit orders earlier (can be seen if bothvolumes and limits are considered). For example if t = 0 .

6, he begins to trade at the second best limitwhen q ∈ [8000 , q ∈ [11000 , q ∈ [19000 , t = 0 .

9, the trader does not rush to liquidate hisinventory completely. This comes from the absence of a terminal penalty, often used in optimal liq-uidation problem to guarantee the complete execution of the inventory. It enables in some sense to“relax” the optimal execution framework on a slice, as the part of the inventory that has not beenexecuted during one slice is split between the remaining ones.We plot in Figures 12 and 13 the volumes posted in both venues, for the same spread and imbalance.We see that, at the beginning of the slice, the trader begins to post a nonzero volume only when q >

Volumes at different time steps when ψ = ψ = 1 and I = I = 0 venue 1 t=0.9t=0.6t=0.3t=0.0 Figure 12: Volume sent to the ﬁrst venue, ψ = ψ = δ, I = I = 0. Volumes at different time steps when ψ = ψ = 1 and I = I = 0 venue 2 t=0.9t=0.6t=0.3t=0.0 Figure 13: Volume sent to the second venue, ψ = ψ = δ, I = I = 0.s When the second venue has a higher spread, we plot the strategy of the trader in both venues inFigures 14 and 15. Limits at different time steps when ψ = 1, ψ = 2 and I = I = 0 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 14: Limit strategy, ψ = δ, ψ = 2 δ,I = I = 0. Volumes at different time steps when ψ = 1, ψ = 2 and I = I = 0 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 15: Volume strategy, ψ = δ, ψ = 2 δ,I = I = 0.For t = 0 .

5, we see in Figure 14 that the trader starts to post at the second best limit in the secondvenue when q = 10000 and in the ﬁrst when q = 11000. For q ∈ [18000 , q ∈ [30000 , t = 0 . t = 0 .

5, he starts to trade at q = 10000 for the second venue and at q = 11000 for the ﬁrstone. The volume posted in the ﬁrst venue increases almost linearly with respect to the inventory. Incontrast, the volume posted in the second venue increases until an inventory of q = 22000, then staysconstant until q = 30000 and decreases to zero afterward. This means that for q ∈ [10000 , q > t = 0 . t = 0 . q ∈ [12000 , q ∈ [15000 , q ∈ [20000 , q ∈ [12000 , q ∈ [22000 , t = 0 . Limits at different time steps when ψ = ψ = 1 and I = − 0.5, I = 0.5 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 16: Limit order strategy, ψ = ψ = δ,I = − . , I = 0 . Volumes at different time steps when ψ = ψ = 1 and I = 0.5, I = − 0.5 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 17: Volume strategy, ψ = ψ = δ,I = − . , I = 0 . We plot in Figures 18 and 19 the strategies on the limits used by the trader. As soon as limits arerepresented by probabilities to send an order to each precise limit, for graphical representation, weplot the limit corresponding to the highest of the three probabilities. We see that the choice of thelimits is in line with the ones of Figures 10 and 11 up to states where optimal order volume is at 0 (in22his case limit values are indistinguishable for ﬁnite diﬀerences). When the trader is at the beginningof the slice, for a small inventory, he prefers to collect a higher spread by being executed at the secondbest limit. When he is near the end of the slice, he prefers to be ﬁlled at a less favorable price, atthe best or new best limit, in order to lower his execution risk. We can also see that neural networkspreserve the monotonicity of the optimal limit function.

Limits at different time steps when ψ = ψ = 1 and I = I = 0 venue 1 t=0.9t=0.6t=0.3t=0.0 Figure 18: Limit order strategy in the ﬁrstvenue, ψ = ψ = δ, I = I = 0 using neuralnetworks. Limits at different time steps when ψ = ψ = 1 and I = I = 0 venue 2 t=0.9t=0.6t=0.3t=0.0 Figure 19: Limit order strategy in the secondvenue, ψ = ψ = δ, I = I = 0 using neuralnetworks.In Figures 20 and 21, we plot the posted volumes of the trader in both venues for the same spreadand imbalance. We see that the strategy is a smoothed approximation of the one obtained using ﬁnitediﬀerences in Figures 12 and 13. We see that at the very beginning of the slice, the trader is not goingto trade if his inventory is already small enough. The strategy in both venues is the same up to somenegligible numerical eﬀects. Volumes at different time steps when ψ = ψ = 1 and I = I = 0 venue 1 t=0.9t=0.6t=0.3t=0.0 Figure 20: Volume posted in the ﬁrst venue, ψ = ψ = δ, I = I = 0 using neural net-works. Volumes at different time steps when ψ = ψ = 1 and I = I = 0 venue 2 t=0.9t=0.6t=0.3t=0.0 Figure 21: Volume posted in the second venue, ψ = ψ = δ, I = I = 0 using neural net-works.If the spread of the second venue is higher, we see in Figure 22 that the strategy with the limitsis the same as in Figure 14. It is interesting to note in Figure 23 that the trader does not stopposting in the second venue, as in Figure 15, again because of the approximation coming from neuralnetworks. However, this behavior enables to perform some exploration of the venue parameters. Forexample, if the trader follows the strategy given by ﬁnite diﬀerences in Figure 15, he posts a volumeequal to 0 in the second venue when q > t = 0 .

5. However, if the trader underestimates23he prior on the ﬁlling probability in the second venue ˆ λ , he will keep sending orders in the ﬁrstvenue, neglecting the possibility of splitting his orders which can potentially improve his execution.Moreover, Figures 8 and 9 show that this slight diﬀerence in the obtained controls does not changedrastically the performance of the trader in terms of the value function. Limits at different time steps when ψ = 1, ψ = 2 and I = I = 0 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 22: Limit order strategy, ψ = δ, ψ = 2 δ,I = I = 0 using neural networks. Volumes at different time steps when ψ = 1, ψ = 2 and I = I = 0 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 23: Volume strategy, ψ = δ, ψ = 2 δ,I = I = 0 using neural networks.The same comments apply to Figures 24 and 25, where we see that the trader posts a small butnonzero volume in the ﬁrst venue with a less favorable imbalance which potentially allows to performexploration in this venue and faster improve parameter estimations. Limits at different time steps when ψ = ψ = 1 and I = − 0.5, I = 0.5 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 24: Limit order strategy, ψ = ψ = δ,I = − . , I = 0 . Volumes at different time steps when ψ = ψ = 1 and I = 0.5, I = − 0.5 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 25: Volume strategy, ψ = ψ = δ,I = − . , I = 0 . In this section, we analyze the behavior of the trader believing that the ﬁrst venue is better than thesecond venue in terms of ﬁlling rate. We compare the solutions obtained via ﬁnite diﬀerence schemesand neural networks.

We show in Figures 26 and 27 the evolution of the value function of the trader during a slice ofexecution, obtained through the ﬁnite diﬀerence method.24

Value function at different time steps ψ = ψ = 1 and I = I = 0 t=0.4t=0.3t=0.2t=0.1t=0.0 Figure 26: Evolution of the value function v between t = 0 and t = 0 . Value function at different time steps ψ = ψ = 1 and I = I = 0 t=0.9t=0.8t=0.7t=0.6t=0.5 Figure 27: Evolution of the value function v between t = 0 . t = 0 . v at t = 0 . q = 50000 is − − Value function at different time steps ψ = ψ = 1 and I = I = 0 t=0.4t=0.3t=0.2t=0.1t=0.0 Figure 28: Evolution of the value function v between t = 0 and t = 0 . Value function at different time steps ψ = ψ = 1 and I = I = 0 t=0.9t=0.8t=0.7t=0.6t=0.5 Figure 29: Evolution of the value function v between t = 0 . t = 0 . In Figures 30 and 31, we show the limit order strategy of the trader in the two venues for the samespreads and imbalances. As the second venue is less favorable for execution, the trader prefers tocreate a new best limit for smaller inventories. For example, when t = 0 .

6, he posts an order on thenew best limit starting from q = 19000, and in the second venue, he prefers to create a new limit25tarting from q = 18000. Generally, either at the beginning or at the end of the slice, the traderprefers to post at a lower limit in the second venue in order to increase his execution rate there,sacriﬁcing the spread that could have been collected. Limits at different time steps when ψ = ψ = 1 and I = I = 0 venue 1 t=0.9t=0.6t=0.3t=0.0 Figure 30: Limit order strategy in the ﬁrstvenue, ψ = ψ = δ, I = I = 0. Limits at different time steps when ψ = ψ = 1 and I = I = 0 venue 2 t=0.9t=0.6t=0.3t=0.0 Figure 31: Limit order strategy in the secondvenue, ψ = ψ = δ, I = I = 0.The strategy of the trader diﬀers drastically in terms of order volumes. In Figures 32 and 33, wesee that the trader posts the majority of his volume in the ﬁrst venue. Especially when at t = 0 . Volumes at different time steps when ψ = ψ = 1 and I = I = 0 venue 1 t=0.9t=0.6t=0.3t=0.0 Figure 32: Volume posted in the ﬁrst venue, ψ = ψ = δ, I = I = 0. Volumes at different time steps when ψ = ψ = 1 and I = I = 0 venue 2 t=0.9t=0.6t=0.3t=0.0 Figure 33: Volume posted in the second venue, ψ = ψ = δ, I = I = 0.In Figures 34 and 35, we see the limits and the volumes recommended to the trader when the secondvenue has a higher spread, and the imbalances are equal. The trader posts an even smaller volume inthe second venue, compared to Figure 15. As the ﬁlling rate is lower in the second venue, the traderdecreases his liquidity consumption in this venue, because of the smaller probability of collecting ahigher spread.The strategy on the limits in Figure 34 is also diﬀerent from the one in Figure 14. When t = 0 . q ∈ [11000 , q ∈ [13000 , q ∈ [18000 , q ∈ [10000 , q ∈ [12000 , q ∈ [17000 , Limits at different time steps when ψ = 1, ψ = 2 and I = I = 0 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 34: Limit order strategy, ψ = δ, ψ = 2 δ,I = I = 0. Volumes at different time steps when ψ = 1, ψ = 2 and I = I = 0 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 35: Volume strategy, ψ = δ, ψ = 2 δ,I = I = 0.If the imbalance is more favorable in the second venue, we see in Figures 36 and 37 that the strategy isvery diﬀerent from the one in Figures 16 and 17 where the two venues shared the same characteristics.As the second venue has a more favorable imbalance, the trader posts a higher volume in it. However,he posts a nonzero volume in the ﬁrst venue, because of the overall better ﬁlling ratio. This contrastswith Figure 17 where at some suﬃciently high inventories, the trader stops sending orders to the ﬁrstvenue. Due to the trade-oﬀ between an overall higher ﬁlling ratio in the ﬁrst venue and a more favor-able imbalance in the second venue, the trader splits his liquidity consumption between the two venues.The strategy on the limits in Figure 36 also diﬀers from the one with two identical venues in Fig-ure 16. For t = 0 . q ∈ [10000 , q ∈ [13000 , q ∈ [20000 , q ∈ [10000 , q ∈ [12000 , q > Limits at different time steps when ψ = ψ = 1 and I = − 0.5, I = 0.5 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 36: Limit order strategy, ψ = ψ = δ,I = − . , I = 0 . Volumes at different time steps when ψ = ψ = 1 and I = 0.5, I = − 0.5 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 37: Volume strategy, ψ = ψ = δ,I = − . , I = 0 . We observe in Figures 38 and 39 that the strategy of the trader on the limits is in line with the onein Figures 30 and 31 up to the states where the optimal volume of the order equals 0.

Limits at different time steps when ψ = ψ = 1 and I = I = 0 venue 1 t=0.9t=0.6t=0.3t=0.0 Figure 38: Limit order strategy in the ﬁrstvenue, ψ = ψ = δ, I = I = 0 using neuralnetworks. Limits at different time steps when ψ = ψ = 1 and I = I = 0 venue 2 t=0.9t=0.6t=0.3t=0.0 Figure 39: Limit order strategy in the secondvenue, ψ = ψ = δ, I = I = 0 using neuralnetworks.In Figures 40 and 41, we see that the strategy of the trader on the posted volumes is well approximatedand smoothed by neural networks. Volumes at different time steps when ψ = ψ = 1 and I = I = 0 venue 1 t=0.9t=0.6t=0.3t=0.0 Figure 40: Volume posted in the ﬁrst venue, ψ = ψ = δ, I = I = 0 using neural net-works. Volumes at different time steps when ψ = ψ = 1 and I = I = 0 venue 2 t=0.9t=0.6t=0.3t=0.0 Figure 41: Volume posted in the second venue, ψ = ψ = δ, I = I = 0 using neural net-works.In Figures 42 and 43, we see in the case of a higher spread in the second venue that, because ofneural network parametrization of the strategy, the trader posts a nonzero volume in the secondvenue leaving the possibility to better explore the ﬁlling ratios. Results are in line with the ones inFigures 34 and 35: the trader posts the majority of his volume in the ﬁrst venue because of a lowerspread and a more favorable ﬁlling ratio. 28 Limits at different time steps when ψ = 1, ψ = 2 and I = I = 0 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 42: Limit order strategy, ψ = δ, ψ = 2 δ,I = I = 0 using neural networks. Volumes at different time steps when ψ = 1, ψ = 2 and I = I = 0 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 43: Volume strategy, ψ = δ, ψ = 2 δ,I = I = 0 using neural networks.Finally, we show in Figures 44 and 45 a similar behavior compared to the ﬁnite diﬀerence schemesin Figures 36 and 37: the trader posts a higher volume in the second venue due to a more favorableimbalance, and keeps posting in the ﬁrst venue due to an overall more favorable ﬁlling ratio. Limits at different time steps when ψ = ψ = 1 and I = − 0.5, I = 0.5 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 44: Limit order strategy, ψ = ψ = δ,I = − . , I = 0 . Volumes at different time steps when ψ = ψ = 1 and I = 0.5, I = − 0.5 t=0.5, venue 1t=0.1, venue 1t=0.5, venue 2t=0.1, venue 2 Figure 45: Volume strategy, ψ = ψ = δ,I = − . , I = 0 . In this section, we analyze the eﬀectiveness of the Bayesian update framework through several exe-cution slices.

We ﬁrst show an example of a market simulation of one slice and demonstrate the trading strategythrough the slice, which are illustrated in Figure 46.At t = 0 .

2, the spread in both venues is equal to δ , with an unfavorable imbalance in both venues. Inthat case, as two venues share the same characteristics, and the inventory is suﬃciently close to theoptimal for the next step, so the trader sends the same quantity to both venues, which is close to zero.29hen t = 0 .

5, the ﬁrst venue has an unfavorable imbalance, and the second venue has a higher spread.In this conﬁguration, the trader sends a higher volume in the ﬁrst venue, in order to get a betterﬁlling rate due to a lower spread.Finally, at t = 0 .

7, the ﬁrst venue has a higher spread and a more favorable imbalance compared tothe second venue. This leads to a higher volume in the ﬁrst venue at the second best limit and a lowervolume in the second venue at ﬁrst best limit. The favorable imbalance in the ﬁrst venue indicatesa higher probability of execution for an order at a higher limit, because the price may move in thisdirection. Therefore, even if the spread is equal to two ticks, the trader posts in this venue in order tobe executed at a more favorable price. As the spread in the second venue is lower, but the imbalanceis less favorable, he posts at the ﬁrst best limit to beneﬁt from the trade-oﬀ between execution andproﬁt through collecting the spread.Figure 46: Market simulation: spreads (upper left), imbalances (upper right), volumes (lower left)and limits (lower right) in both venues.The corresponding execution trajectory is shown in Figure 47, where we can see the typical Imple-mentation Shortfall execution shape, coming from the pre-computed trading curve q ⋆ . Number of shares q

Figure 47: Evolution of the inventory of the trader on a slice of execution.30 .5.2 Update of the execution proportion

We show how the trader updates the market parameters through observations and trading. Theupdate of the execution proportion is quite fast, as it can be seen in Figures 48 and 49 the goodestimation can be achieved after completing 1-2 slices. In this example we started from the correctprior for the second venue and the inaccurate one for the ﬁrst: ρ = h . . i , ρ = h . . i . ρ estimationreal value 0 250 500 750 1000 1250 1500 1750 20000.800.850.900.95 ρ estimationreal value Bayesian updates of parameters ρ Figure 48: Bayesian update of the executed pro-portion in the ﬁrst venue. Figure 49: Bayesian update of the executed pro-portion in the second venue.

We plot the convergence of the estimated transition matrices r ,ψ , r ,ψ in Figures 50 and 51. Weobserve quite fast convergence to a good approximation of the spread dynamics parameters. Theprior values are respectively: r ,ψ = " − − , r ,ψ = " − − . r ψδ,δ estimationreal value 0 250 500 750 1000 1250 1500 1750 20003.253.503.754.004.254.504.755.00 r ψδ,2δ estimationreal value0 250 500 750 1000 1250 1500 1750 20003.253.503.754.004.254.504.755.00 r ψ2δ,δ estimationreal value 0 250 500 750 1000 1250 1500 1750 2000−5.00−4.75−4.50−4.25−4.00−3.75−3.50−3.25 r ψ2δ,2δ estimationreal value Bayesian updates of parameters r ψ for venue 1 Figure 50: Bayesian update of the transition ma-trix r ,ψ . r ψδ,δ estimationreal value 0 250 500 750 1000 1250 1500 1750 2000−5.00−4.75−4.50−4.25−4.00−3.75−3.50−3.25 r ψδ,2δ estimationreal value0 250 500 750 1000 1250 1500 1750 2000−5.00−4.75−4.50−4.25−4.00−3.75−3.50−3.25 r ψ2δ,δ estimationreal value 0 250 500 750 1000 1250 1500 1750 20003.253.503.754.004.254.504.755.00 r ψ2δ,2δ estimationreal value Bayesian updates of parameters r ψ for venue 2 Figure 51: Bayesian update of the transition ma-trix r ,ψ .We perform the same study for the transition matrices r ,I , r ,I of the imbalance processes throughthe observed one in Figures 52 and 53. We see that we need just a couple of slices to have a quite goodapproximation and only a dozen of slices (less for more granular slices) to achieve the right estimation.31igure 52: Bayesian update of the transition matrix r ,I .Figure 53: Bayesian update of the transition matrix r ,I .32n the examples in Figures 52 and 53 we started from the following prior parameters: r ,I =  − . . . − . . . . −  , r ,I =  − . . . − . . . . −  . As we observe the increments of the price process S t continuously, it is easy to converge toward a realmarket drift, the example is in Figure 54, we ﬁnd µ true = − .

5, starting from a prior of µ = 0 .

1. Ittook 20 slices to ﬁnd a real value even if in the considered example we supposed to be sure in ourprior estimation ν = 0 .

02, which appeared to be incorrect.

Bayesian updates of parameter μ

Figure 54: Bayesian update of the drift of the asset.

The hardest parameter to update quickly is obviously the intensity of ﬁlling which depends on statesof both venues. In our numerical setting we have 32 possible states, so during one slice of 10 timesteps we have no possibility to even visit all the states. The results of convergence of the parameters λ can be found in Figures 55 and 56. We see that full convergence requires a lot of observations,however we should keep in mind that to have a strategy close to the optimal one we do not necessitatean excessive precision.In this example, we started from the priors same for both venues, whereas the real parameters arediﬀerent. The priors are: λ δ,δ = λ δ,δ =  .

35 6 .

52 7 . .

75 3 . . . .

86 2 .  , λ δ, δ = λ δ, δ =  .

28 10 .

03 10 . .

38 5 .

35 5 . . .

05 3 .  ,λ δ,δ = λ δ,δ =  .

81 2 .

27 2 . .

78 1 .

04 1 . .

29 0 .

43 0 .  , λ δ, δ = λ δ, δ =  .

96 3 .

65 4 . .

42 1 .

81 2 . .

68 0 . .  . λ estimationreal value 0 5000 10000 15000 200006.506.757.007.25 λ estimationreal value 0 5000 10000 15000 200007.07.27.47.6 λ estimationreal value 0 5000 10000 15000 200002.502.753.003.25 λ estimationreal value 0 5000 10000 15000 200003.43.63.84.04.2 λ estimationreal value 0 5000 10000 15000 200003.84.04.2 λ estimationreal value0 5000 10000 15000 200001.41.61.82.0 λ estimationreal value 0 5000 10000 15000 200001.82.02.2 λ estimationreal value 0 5000 10000 15000 200002.22.42.6 λ estimationreal value 0 5000 10000 15000 200008.08.28.48.68.8 λ estimationreal value 0 5000 10000 15000 200009.7510.0010.2510.50 λ estimationreal value 0 5000 10000 15000 2000010.811.011.211.4 λ estimationreal value0 5000 10000 15000 200004.44.64.85.0 λ estimationreal value 0 5000 10000 15000 200005.505.756.006.25 λ estimationreal value 0 5000 10000 15000 200005.86.06.26.46.6 λ estimationreal value 0 5000 10000 15000 200002.42.62.83.0 λ estimationreal value 0 5000 10000 15000 200003.23.43.6 λ estimationreal value 0 5000 10000 15000 200003.43.63.84.0 λ estimationreal value0 5000 10000 15000 200001.82.02.22.4 λ estimationreal value 0 5000 10000 15000 200002.22.42.62.8 λ estimationreal value 0 5000 10000 15000 200002.42.62.83.0 λ estimationreal value 0 5000 10000 15000 200000.81.01.2 λ estimationreal value 0 5000 10000 15000 200001.01.21.4 λ estimationreal value 0 5000 10000 15000 200001.21.41.61.8 λ estimationreal value0 5000 10000 15000 200000.40.60.8 λ estimationreal value 0 5000 10000 15000 200000.60.81.0 λ estimationreal value 0 5000 10000 15000 200000.60.81.0 λ estimationreal value 0 5000 10000 15000 200003.03.23.43.6 λ estimationreal value 0 5000 10000 15000 200003.54.04.5 λ estimationreal value 0 5000 10000 15000 200004.004.254.504.755.00 λ estimationreal value0 5000 10000 15000 200001.41.61.82.0 λ estimationreal value 0 5000 10000 15000 200001.82.02.2 λ estimationreal value 0 5000 10000 15000 200002.02.22.4 λ estimationreal value 0 5000 10000 15000 200000.81.01.21.4 λ estimationreal value 0 5000 10000 15000 200000.81.01.21.4 λ estimationreal value 0 5000 10000 15000 200001.21.41.6 λ estimationreal value Bayesian updates of parameters λ Figure 55: Bayesian update of the intensity of limit orders in the ﬁrst venue.34 λ estimationreal value 0 5000 10000 15000 200002.502.753.003.25 λ estimationreal value 0 5000 10000 15000 200003.43.63.84.04.2 λ estimationreal value 0 5000 10000 15000 200003.84.04.2 λ estimationreal value 0 5000 10000 15000 200001.41.61.82.0 λ estimationreal value 0 5000 10000 15000 200001.82.02.2 λ estimationreal value0 5000 10000 15000 200002.22.42.6 λ estimationreal value 0 5000 10000 15000 200008.08.28.48.68.8 λ estimationreal value 0 5000 10000 15000 200009.7510.0010.2510.50 λ estimationreal value 0 5000 10000 15000 2000010.811.011.211.4 λ estimationreal value 0 5000 10000 15000 200004.44.64.85.0 λ estimationreal value 0 5000 10000 15000 200005.505.756.006.25 λ estimationreal value0 5000 10000 15000 200005.86.06.26.46.6 λ estimationreal value 0 5000 10000 15000 200002.42.62.83.0 λ estimationreal value 0 5000 10000 15000 200003.23.43.6 λ estimationreal value 0 5000 10000 15000 200003.43.63.84.0 λ estimationreal value 0 5000 10000 15000 200001.82.02.22.4 λ estimationreal value 0 5000 10000 15000 200002.22.42.62.8 λ estimationreal value0 5000 10000 15000 200002.42.62.83.0 λ estimationreal value 0 5000 10000 15000 200000.81.01.2 λ estimationreal value 0 5000 10000 15000 200001.01.21.4 λ estimationreal value 0 5000 10000 15000 200001.21.41.61.8 λ estimationreal value 0 5000 10000 15000 200000.40.60.8 λ estimationreal value 0 5000 10000 15000 200000.60.81.0 λ estimationreal value0 5000 10000 15000 200000.60.81.0 λ estimationreal value 0 5000 10000 15000 200003.03.23.43.6 λ estimationreal value 0 5000 10000 15000 200003.54.04.5 λ estimationreal value 0 5000 10000 15000 200004.004.254.504.755.00 λ estimationreal value 0 5000 10000 15000 200001.41.61.82.0 λ estimationreal value 0 5000 10000 15000 200001.82.02.2 λ estimationreal value0 5000 10000 15000 200002.02.22.4 λ estimationreal value 0 5000 10000 15000 200000.81.01.21.4 λ estimationreal value 0 5000 10000 15000 200000.81.01.21.4 λ estimationreal value 0 5000 10000 15000 200001.21.41.6 λ estimationreal value 0 5000 10000 15000 200005.255.505.756.006.25 λ estimationreal value 0 5000 10000 15000 200002.42.62.83.0 λ estimationreal value Bayesian updates of parameters λ Figure 56: Bayesian update of the intensity of limit orders in the second venue.35 ppendix A Proof of Theorem 2.3

It can be show with the dynamic programming principle that the HJBQVI (2.3) does not depend onthe cash variable x . We set (cid:16) ˜ q, ˜ ψ, ˜ I (cid:17) ∈ D = Q × K and ( t i , S i ) ∈ [0 , T ) × R such that t i → i → + ∞ ˆ t,S i → i → + ∞ ˆ S,v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) → i → + ∞ v ⋆ (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) . We begin with ˆ t = T . By taking ℓ n = 0 for all n ∈ { , . . . , N } we get v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) ≥ E t i , ˜ q,S i , ˜ ψ, ˜ I (cid:20) Q T S T − Z T g ( q ⋆t − q t ) dt (cid:21) . By dominated convergence, we get v ⋆ ( T, ˜ q, ˆ S, ˜ ψ, ˜ I ) ≥ ˜ q ˆ S .Assume now that ˆ t < T and that the minimum in the HJBQVI is given by the ﬁrst term. Wetake φ : [0 , T ) × R × D → R be C in time, C in ˆ S and such that 0 = min [0 ,T ] × R ×D ( v ⋆ − φ ) =( v ⋆ − φ )(ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ). If there exists η > η <∂ t φ (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) − g ( q − q ⋆t )+ µ∂ S φ + 12 σ ∂ SS φ + X k ∈K r ( ˜ ψ, ˜ I ) , ( k ψ , k I ) (cid:16) φ (ˆ t, ˜ q, ˆ S, k ψ , k I ) − φ (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) (cid:17) + sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ˜ ψ, ˜ I, p n , ℓ ) E (cid:20) ǫ n ℓ n ( ˆ S + ˜ ψ n p n δ n ) + φ (cid:16) ˆ t, ˜ q − ℓ n ǫ n , ˆ S, ˜ ψ, ˜ I (cid:17) − φ (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) (cid:21) , we should have0 ≤ ∂ t φ ( t, ˜ q, S, ˜ ψ, ˜ I ) − g ( q − q ⋆t ) + µ∂ S φ + 12 ∂ SS φ + X k ∈K r ( ˜ ψ, ˜ I ) , ( k ψ , k I ) (cid:16) φ ( t, ˜ q, S, k ψ , k I ) − φ ( t, ˜ q, S, ˜ ψ, ˜ I ) (cid:17) + sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ˜ ψ, ˜ I, p n , ℓ ) E (cid:20) ǫ n ℓ n ( S + ˜ ψ n p n δ n ) + φ (cid:16) t, ˜ q − ℓ n ǫ n , S, ˜ ψ, ˜ I (cid:17) − φ ( t, ˜ q, S, ˜ ψ, ˜ I ) (cid:21) , for all ( t, S ) ∈ B = (cid:18) (ˆ t − r, ˆ t + r ) ∩ [0 , T ) (cid:19) × (cid:18) ˆ S − r, ˆ S + r (cid:19) for a given r ∈ (0 , T − ˆ t ). We can assumewithout loss of generality that B contains the sequences ( t i , S i ) i and, by taking η arbitrarily small φ ( t, ˜ q, S, ˜ ψ, ˜ I ) + η ≤ v ⋆ ( t, ˜ q, S, ˜ ψ, ˜ I ) ≤ v ( t, ˜ q, S, ˜ ψ, ˜ I )on the boundary of B , denoted by ∂ p B . Without loss of generality we can also assume that φ ( t, q, S, ψ, I ) + η ≤ v ⋆ ( t, q, S, ψ, I ) ≤ v ( t, q, S, ψ, I ) , for ( t, q, S, ψ, I ) ∈ ˜ B where˜ B = (cid:26) ( t, q, S, ψ, I ) : ( t, S ) ∈ B, q ∈ { ˜ q − min n ǫ n , ˜ q, ˜ q + min n ǫ n } ,ψ ∈ N Y n =1 { ˜ ψ n − δ n , ˜ ψ n , ˜ ψ n + δ n } , I ∈ N Y n =1 { ˜ I n − , ˜ I n , ˜ I n +1 } , ( q, ψ, I ) = (˜ q, ˜ ψ, ˜ I ) (cid:27) .

36e introduce the set B D = n ( t, ˜ q, S, ˜ ψ, ˜ I ) : ( t, S ) ∈ B (cid:27) , and denote by τ i the ﬁrst exit time of ( t, q t , S t , ψ t , I t ) t ≥ t i from B D , with q t i = ˜ q, S t i = ˆ S, ψ t i = ˜ ψ, I t i = ˜ I ,and the processes are controlled by the optimal controls ( ℓ n , p n ) n ∈{ ,...,N } ∈ A × Q ψ . By Ito’s formula,we get φ ( τ i , q τ i ,S τ i , ψ τ i , I τ i ) = φ ( t i , q t i , S t i , ψ t i , I t i ) + Z τ i t i ∂ t φ ( s, q s , S s , ψ s , I s ) + µ∂ S φ + 12 σ ∂ SS φ + X k ∈K r ( ψ s ,I s ) , ( k ψ , k I ) (cid:16) φ ( s, q s , S s , k ψ , k I ) − φ ( s, q s , S s , ψ s , I s ) (cid:17) + N X n =1 λ n ( ψ s , I s , p ns , ℓ s ) E (cid:20) φ (cid:16) s, q s − ℓ ns ǫ ns , S s , ψ s , I s (cid:17) − φ ( s, q s , S s , ψ s , I s ) (cid:21) ds + M ( τ i , t i ) , where M ( τ i , t i ) is a martingale. This can be rewritten as φ ( τ i , q τ i ,S τ i , ψ τ i , I τ i ) = φ ( t i , q t i , S t i , ψ t i , I t i ) + Z τ i t i ∂ t φ ( s, q s , S s , ψ s , I s ) + µ∂ S φ − g ( q s − q ⋆ ( s ))+ 12 σ ∂ SS φ + X k ∈K r ( ψ s ,I s ) , ( k ψ , k I ) (cid:16) φ ( s, q s , S s , k ψ , k I ) − φ ( s, q s , S s , ψ s , I s ) (cid:17) + N X n =1 λ n ( ψ s , I s , p ns , ℓ s ) E (cid:20) ǫ ns ℓ ns ( S s + ψ ns p ns δ n )+ φ (cid:16) s, q s − ℓ ns ǫ ns , S s , ψ s , I s (cid:17) − φ ( s, q s , S s , ψ s , I s ) (cid:21) ds + M ( τ i , t i ) − N X n =1 Z τ i t i λ n ( ψ s , I s , p ns , ℓ s ) E h ǫ ns ℓ ns ( S s + ψ ns p ns δ n ) i + g ( q s − q ⋆ ( s )) ds. We derive φ ( τ i , q τ i , S τ i , ψ τ i , I τ i ) ≥ φ ( t i , q t i , S t i , ψ t i , I t i )+ M ( τ i , t i ) − N X n =1 Z τ i t i λ n ( ψ s , I s , p ns , ℓ s ) E h ǫ ns ℓ ns ( S s + ψ ns p ns δ ) i + g ( q s − q ⋆ ( s )) ds. As the martingale term vanishes with the expectation, we get φ ( t i , q t i , S t i , ψ t i , I t i ) ≤ E (cid:20) φ ( τ i , q τ i , S τ i , ψ τ i , I τ i )+ N X n =1 Z τ i t i λ n ( ψ s , I s , p ns , ℓ s ) E h ǫ ns ℓ ns ( S s + ψ ns p ns δ ) i − g ( q s − q ⋆ ( s )) ds (cid:21) . and thus φ ( t i , q t i , S t i , ψ t i , I t i ) ≤ − η + E (cid:20) v ( τ i , q τ i , S τ i , ψ τ i , I τ i )+ N X n =1 Z τ i t i λ n ( ψ s , I s , p ns , ℓ s ) E h ǫ ns ℓ ns ( S s + ψ ns p ns δ ) i − g ( q s − q ⋆ ( s )) ds (cid:21) . i suﬃciently large, we deduce v ( t i , ˜ q, S t i , ˜ ψ, ˜ I ) ≤ − η E (cid:20) v ( τ i , q τ i , S τ i , ψ τ i , I τ i )+ N X n =1 Z τ i t i λ n ( ψ s , I s , p ns , ℓ s ) E h ǫ ns ℓ ns ( S s + ψ ns p ns δ ) i − g ( q s − q ⋆ ( s )) ds (cid:21) , which contradicts the dynamic programming principle. In conclusion, we necessarily have0 ≥ ∂ t v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) − g ( q − q ⋆t )+ µ∂ S v + 12 σ ∂ SS v + X k ∈K r ( ˜ ψ, ˜ I ) , ( k ψ , k I ) (cid:16) v (ˆ t, ˜ q, ˆ S, k ψ , k I ) − v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) (cid:17) + sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ˜ ψ, ˜ I, p n , ℓ ) E (cid:20) ǫ n ℓ n ( ˆ S + ˜ ψ n p n δ n ) + v (cid:16) ˆ t, ˜ q − ℓ n ǫ n , ˆ S, ˜ ψ, ˜ I (cid:17) − v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) (cid:21) . The second part of the HJBQVI being straightforward, we prove that v is a viscosity supersolutionof the HJBQVI on [0 , T ) × R × D . The proof for the subsolution is identical, except that we need toprove N X n =1 sup m n ∈ [0 ,m ] m n ( S − ψ n v (cid:16) t, q − m n , S, ψ, I (cid:17) − v ( t, q, S, ψ, I ) ≥ , which is direct by choosing the constant controls m n = 0 for all n ∈ { , . . . , N } .For the proof of the uniqueness, we recall the deﬁnition of subjet and superjet. Deﬁnition Appendix A.1.

Let v : [0 , T ) × R × D → R be l.s.c (resp u.s.c) with respect to (ˆ t, ˆ S ) .For (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) ∈ [0 , T ) × R × D we say that ( y, p, A ) ∈ R is in the subjet P − v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) (resp.the superjet P + v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) if v ( t, ˜ q, S, ˜ ψ, ˜ I ) ≥ v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) + y ( t − ˆ t ) + p ( S − ˆ S ) + 12 A ( S − ˆ S ) + o ( | t − ˆ t | + | S − ˆ S | ) , (resp. v ( t, ˜ q, S, ˜ ψ, ˜ I ) ≥ v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) + y ( t − ˆ t ) + p ( S − ˆ S ) + A ( S − ˆ S ) + o ( | t − ˆ t | + | S − ˆ S | ) ), forall ( t, S ) such that ( t, ˜ q, S, ˜ ψ, ˜ I ) ∈ [0 , T ) × R × D . We also deﬁne P − (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) as the set of points ( y, p, A ) ∈ R such that there exists a sequence ( t I , ˜ q, S i , ˜ ψ, ˜ I, y i , p i , A i ) satisfying ( t i , ˜ q, S i , ˜ ψ, ˜ I, y i , p i , A i ) → i → + ∞ (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I, y, p, A ) . The set P + (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) is deﬁned similarly. We now introduce an analogous of the Ishii’s lemma, whose proof can be found in [11].

Lemma Appendix A.2.

A l.s.c (resp u.s.c) function v is a supersolution (resp. subsolution) of theHJBQVI on [0 , T ) × R × D if and only if for all (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) ∈ [0 , T ) × R × D , and all (ˆ y, ˆ p, ˆ A ) ∈P − (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) (resp. P + (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) ), we have ≤ min  − ˆ y + g (˜ q − q ⋆ (ˆ t )) − µ ˆ p − σ ˆ A − X k ∈K r ( ˜ ψ, ˜ I ) , ( k ψ , k I ) (cid:16) v (ˆ t, ˜ q, ˆ S, k ψ , k I ) − v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) (cid:17) sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ˜ ψ, ˜ I, p n , ℓ ) E (cid:20) ǫ n ℓ n ( ˆ S + ˜ ψ n p n δ n ) + v (ˆ t, ˜ q − ℓ n ǫ n , ˆ S, ˜ ψ, ˜ I ) − v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) (cid:21) ; N X n =1 v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) − sup m n ∈ [0 ,m ] m n ( ˆ S − ˜ ψ n v (cid:16) ˆ t, ˜ q − m n , ˆ S, ˜ ψ, ˜ I (cid:17) , (resp. ≤ . We now prove the following comparison principle:

Proposition Appendix A.3.

Let u (resp. v ) be a l.s.c supersolution (resp. u.s.c subsolution) withpolynomial growth of the HJBQVI on [0 , T ) × R × D . If u ≥ v on { T } × R × D , then u ≥ v on [0 , T ) × R × D .Proof. For ρ > u ( t, q, S, ψ, I ) = e ρt u ( t, q, S, ψ, I ) , ˜ v ( t, q, S, ψ, I ) = e ρt v ( t, q, S, ψ, I ) . Then, ˜ u and ˜ v are respectively supersolution and subsolution of the following equation:0 = min  − ∂ t w ( t, q, S, ψ, I ) + ρw ( t, q, S, ψ, I ) + g ( q − q ⋆t ) − µ∂ S w − σ ∂ SS w − X k ∈K r ( ψ,I ) , ( k ψ , k I ) (cid:16) w ( t, q, S, k ψ , k I ) − w ( t, q, S, ψ, I ) (cid:17) − sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ψ, I, p n , ℓ ) E (cid:20) ǫ n ℓ n e ρt ( S + ψ n p n δ n ) + w (cid:16) t, q − ℓ n ǫ n , S, ψ, I (cid:17) − w ( t, q, S, ψ, I ) (cid:21) ; N X n =1 w ( t, q, S, ψ, I ) − sup m n ∈ [0 ,m ] m n e ρt ( S − ψ n w (cid:16) t, q − m n , S, ψ, I (cid:17) , on [0 , T ) × R × D , with ˜ u ≥ ˜ v on { T } × R × D . In order to prove the proposition, we only have toshow that ˜ u ≥ ˜ v on [0 , T ) × R × D . Assume that the minimum is given by the ﬁrst term and thatsup [0 ,T ) × R ×D ˜ v − ˜ u >

0. We ﬁx p ∈ N ⋆ such thatlim k S k → + ∞ sup t ∈ [0 ,T ]( q,ψ,I ) ∈D | ˜ u ( t, q, S, ψ, I ) | + | ˜ v ( t, q, S, ψ, I ) | k S k p = 0 . Then, there exists (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) ∈ [0 , T ] × R × D such that0 < ˜ v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) − ˜ u (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) − φ (ˆ t, ˜ q, ˆ S, ˆ S, ˜ ψ, ˜ I )= max ( t,q,S,ψ,I ) ˜ v ( t, q, S, ψ, I ) − ˜ u ( t, q, S, ψ, I ) − φ ( t, q, S, S, ψ, I ) , where ǫ > φ ( t, S, R ) = ǫ exp( − ˜ κt ) (cid:16) k S k p + k R k p (cid:17) , ˜ κ > . u ≥ ˜ v on { T } × R × D , we directly have ˆ t < T .For all i ∈ N , we can ﬁnd a sequence ( t i , S i , R i ) such that0 < ˜ v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) − ˜ u ( t i , ˜ q, R i , ˜ ψ, ˜ I ) − φ ( t i , S i , R i ) − i | S i − R i | − (cid:16) | t i − ˆ t | + | S i − ˆ S | (cid:17) = max ( t,S,R ) ˜ v ( t, ˜ q, S, ˜ ψ, ˜ I ) − ˜ u ( t, ˜ q, R, ˜ ψ, ˜ I ) − φ ( t, S, R ) − i | S − R | − (cid:16) | t − ˆ t | + | S − ˆ S | (cid:17) . Then we have: ( t i , S i , R i ) −→ i → + ∞ (ˆ t, ˆ S, ˆ S )up to a subsequence, and˜ v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) − ˜ u ( t i , ˜ q, R i , ˜ ψ, ˜ I ) − φ ( t i , S i , R i ) − i | S i − R i | − (cid:16) | t i − ˆ t | + | S i − ˆ S | (cid:17) −→ n → + ∞ ˜ v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) − ˜ u (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) − φ (ˆ t, ˆ S, ˆ S )Let us then denote for i ∈ N ∗ ϕ i ( t, S, R ) = φ ( t, S, R ) + i | S − R | + | t − ˆ t | + | S − ˆ S | ∀ ( t, S, R ) ∈ [0 , T ] × R . Then Ishii’s Lemma (see [9, 14]) guarantees that for all η > , we can ﬁnd ( y i , p i , A i ) ∈ ¯ P + ˜ v ( t i , ˜ q, S i , ˜ ψ, ˜ I )and ( y i , p i , A i ) ∈ ¯ P − ˜ u ( t i , ˜ q, R i , ˜ ψ, ˜ I ) such that: y i − y i = ∂ t ϕ i ( t i , S i , R i ) , ( p i , p i ) = ( ∂ S ϕ i , − ∂ R ϕ i ) ( t i , S i , R i ) , and A i − A i ! ≤ H SR ϕ i ( t i , S i , R i ) + η ( H SR ϕ n ( t i , S i , R i )) , where H SR ϕ i ( t i , ., . ) denotes the Hessian matrix of ϕ i ( t i , ., . ) . Applying Lemma Appendix A.2, we get ρ (cid:16) ˜ v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) − ˜ u ( t i , ˜ q, R i , ˜ ψ, ˜ I ) (cid:17) ≤ y i − y i + 12 σ ( A i − A i ) + µ ( p i − p i )+ X k ∈K r ( ˜ ψ, ˜ I ) , ( k ψ , k I ) (cid:16) ˜ v ( t i , ˜ q, S i , k ψ , k I ) − ˜ v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) (cid:17) + sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ˜ ψ, ˜ I, p n , ℓ ) E (cid:20) ǫ n ℓ n e ρt i ( S i + ˜ ψ n p n δ n ) + ˜ v (cid:16) t i , ˜ q − ℓ n ǫ n , S i , ˜ ψ, ˜ I (cid:17) − ˜ v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) (cid:21) − X k ∈K r ( ˜ ψ, ˜ I ) , ( k ψ , k I ) (cid:16) ˜ u ( t i , ˜ q, R i , k ψ , k I ) − ˜ u ( t i , ˜ q, R i , ˜ ψ, ˜ I ) (cid:17) − sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ˜ ψ, ˜ I, p n , ℓ ) E (cid:20) ǫ n ℓ n e ρt i ( R i + ˜ ψ n p n δ n ) + ˜ u (cid:16) t i , ˜ q − ℓ n ǫ n , R i , ˜ ψ, ˜ I (cid:17) − ˜ u ( t i , ˜ q, R i , ˜ ψ, ˜ I ) (cid:21) . Moreover, we have H SR ϕ i ( t i , S i , R i ) = ∂ SS φ ( t i , S i , R i ) + 2 i + 12( S i − ˆ S ) ∂ SR φ ( t i , S i , R i ) − i∂ SR φ ( t i , S i , R i ) − i ∂ SR φ ( t i , S i , R i ) + 2 i ! , and ∂ S ϕ i ( t i , S i , R i ) = ∂ S φ ( t i , S i , R i ) + 2 i | S i − R i | + 4 | S i − ˆ S | , R ϕ i ( t i , S i , R i ) = ∂ R φ ( t i , S i , R i ) − i | S i − R i | , so from what precedes we can write ρ (cid:16) ˜ v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) − ˜ u ( t i , ˜ q, R i , ˜ ψ, ˜ I ) (cid:17) ≤ ∂ t φ ( t i , S i , R i ) + 2( t i − ˆ t ) + µ (cid:16) ∂ S φ ( t i , S i , R i ) + ∂ R φ ( t i , S i , R i )+ 4( S i − ˆ S ) (cid:17) + 12 σ (cid:16) ∂ SS φ ( t i , S i , R i ) + ∂ RR φ ( t i , S i , R i ) + 2 ∂ SR φ ( t i , S i , R i ) + 12( S i − ˆ S ) (cid:17) + ηC i + X k ∈K r ( ˜ ψ, ˜ I ) , ( k ψ , k I ) (cid:16) ˜ v ( t i , ˜ q, S i , k ψ , k I ) − ˜ v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) (cid:17) + sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ˜ ψ, ˜ I, p n , ℓ ) E (cid:20) ǫ n ℓ n e ρt i ( S i + ˜ ψ n p n δ n ) + ˜ v (cid:16) t i , ˜ q − ℓ n ǫ n , S i , ˜ ψ, ˜ I (cid:17) − ˜ v ( t i , ˜ q, S i , ˜ ψ, ˜ I ) (cid:21) − X k ∈K r ( ˜ ψ, ˜ I ) , ( k ψ , k I ) (cid:16) ˜ u ( t i , ˜ q, R i , k ψ , k I ) − ˜ u ( t i , ˜ q, R i , ˜ ψ, ˜ I ) (cid:17) − sup p ∈ Q ψ ,ℓ ∈A N X n =1 λ n ( ˜ ψ, ˜ I, p n , ℓ ) E (cid:20) ǫ n ℓ n e ρt i ( R i + ˜ ψ n p n δ n ) + ˜ u (cid:16) t i , ˜ q − ℓ n ǫ n , R i , ˜ ψ, ˜ I (cid:17) − ˜ u ( t i , ˜ q, R i , ˜ ψ, ˜ I ) (cid:21) , where C i does not depend on η. As ˜ v is u.s.c., ˜ u is l.s.c. and ( t i , S i , R i ) i is convergent, when η → i → + ∞ , for a certain constant M we get ρ (cid:16) ˜ v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) − ˜ u (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) (cid:17) ≤ ∂ t φ (ˆ t, ˆ S, ˆ S ) + µ (cid:16) ∂ S φ (ˆ t, ˆ S, ˆ S ) + ∂ R φ (ˆ t, ˆ S, ˆ S ) (cid:17) + 12 σ (cid:16) ∂ SS φ (ˆ t, ˆ S, ˆ S ) + ∂ RR φ (ˆ t, ˆ S, ˆ S ) + 2 ∂ SR φ (ˆ t, ˆ S, ˆ S ) (cid:17) + M. For ˜ κ > ρ > v (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) − ˜ u (ˆ t, ˜ q, ˆ S, ˜ ψ, ˜ I ) < , which yields to a contradiction. The proof for the other part of the HJBQVI is direct.With the two above propositions, it is easy to conclude the proof of the theorem. Indeed, as v ⋆ is a supersolution such that v ⋆ ≥ v on { T } × R × D , and v ⋆ is a subsolution such that v ⋆ ≤ v on { T }× R ×D , we can apply the maximum principle to get v ⋆ ≥ v ⋆ on [0 , T ] × R ×D . But by deﬁnition of v ⋆ and v ⋆ , we must have v ⋆ ≤ v ≤ v ⋆ on [0 , T ] × R × D , which proves that we have v ⋆ = v = v ⋆ and v iscontinuous. The maximum principle implies that if two continuous viscosity solutions of the HJBQVIsatisfy the same terminal condition, they are equal on [0 , T ] × R × D , hence the uniqueness.41 ppendix B Application to OTC market making Appendix B.1 Framework

The model we present in this article is designed for trading in cross-listed stocks in limit order books.However, it can be adapted straightforwardly to handle the problem of an OTC market maker, whooften deals with a large number of assets driven by a few factors. We borrow here the factorial methodmarket making framework of [10] (we are also going to keep their notation only for this section). Weconsider a market maker who is in charge of providing bid and ask quotes on d assets, whose dynamicsare dS it = µ i dt + σ i dW it , i ∈ { , . . . , d } , where µ i is the drift of the i -th asset, σ i is its volatility and ( W t , . . . , W dt ) is a d -dimensional Brownianmotion. We consider a non-singular variance-covariance matrix Σ = ( ρ i,j σ i σ j ) i,j ∈{ ,...,d } for the vectorof assets ( S t , . . . , S dt ). The market maker sets bid and ask prices on every asset: S i,b ( t, z ) = S it − δ i,b ( t, z ) , S i,a ( t, z ) = S it + δ i,a ( t, z ) , z ∈ R , where δ = ( δ i,a , δ i,b ) i ∈{ ,...,d } are the (predictable and uniformly lower bounded) bid and ask quotesaround the mid-price of each asset. The volume of transactions on the bid and ask sides are modeledby marked point processes N i,b ( dt, dz ) , N i,a ( dt, dz ) of intensity ν i,bt ( dz ) , ν i,at ( dz ) deﬁned by ν i,jt ( dt, dz ) = Λ i,j (cid:16) δ i,j ( t, z ) (cid:17) η i,j ( dz ) , i ∈ { , . . . , d } , where Λ i,j is a suﬃciently regular function (exponential, logistic, SU Johnson etc.) modeling theprobability to trade on the asset i , on the side j for a given spread δ i,j ( t, z ) and a size z . Thefunctions η i,j ( dz ) are probability densities over R + modeling the distribution of a trade size. Themarket maker manages his inventory process q t = ( q t , . . . , q dt ) of dynamics given by dq it = Z R + zN i,b ( dt, dz ) − Z R + zN i,a ( dt, dz ) , i ∈ { , . . . , d } . The market maker manages his cash process given at time t by dX t = d X i =1 Z R + zS i,a ( t, z ) N i,a ( dt, dz ) − Z R + zS i,b ( t, z ) N i,b ( dt, dz ) . Its optimization problem is deﬁned assup δ E (cid:20) X T + d X i =1 q iT S iT − Z T φ ( q t ) dt (cid:21) , where φ is a running penalty preventing from too large positions and P di =1 q iT S iT is the marked-to-market value of the market maker’s portfolio at time t . The corresponding HJB equation is givenby 0 = ∂ t v ( t, q ) + d X i =1 q i µ i − φ ( q ) + d X i =1 Z R + zH i,b (cid:16) v ( t, q ) − v ( t, q + ze i ) z (cid:17) η i,b ( dz )+ d X i =1 Z R + zH i,a (cid:16) v ( t, q ) − v ( t, q − ze i ) z (cid:17) η i,a ( dz ) , with terminal condition v ( T, q ) = 0, H i,j ( p ) = sup δ Λ i,j ( δ )( δ − p ), and ( e , . . . , e d ) is the canonicalbasis of R d . 42 ppendix B.2 Bayesian update for OTC market makers Usually, the functions Λ i,j are of the formΛ i,j (cid:16) δ i,j ( t, z ) (cid:17) = λ i,j RFQ f (cid:16) δ i,j ( t, z ) (cid:17) , where λ i,j RFQ is the constant intensity of arrival of requests for quote, and f (cid:16) δ i,j ( t, z ) (cid:17) gives the prob-ability that a request will result in a transaction given the quote δ proposed by the market maker.The estimation of the quantity λ i,j RFQ is of particular importance for the market maker so that he canadjust his quotes depending on his view on the number of request for a certain asset and a certainside. In the same spirit as in Section 3.1.1, we assume the following prior distribution: λ i,j RFQ ∼ Γ( α i,j , β i,j ) , ( α i,j , β i,j ) > . For an asset i ∈ { , . . . , d } on the side j ∈ { a, b } , this corresponds to an average intensity of α i,j β i,j ,with variance equal to α i,j ( β i,j ) . If the market maker is conﬁdent in his estimation of the intensity λ i,j RFQ , he can choose a large β i,j so that the variance of his Bayesian estimator is small. Given all theinformation accumulated up to time t , its best estimation of the quantity λ i,j RFQ , is given by E h λ i,j RFQ | N ( t, dz ) i = α i,j + R R + N ( t, dz ) β i,j + R R + R t f ( δ i,j ( s, z )) ds η i,j ( dz ) . (Appendix B.1)By the law of large numbers, when the market maker has accumulated a suﬃciently large numberof observations, his best estimation of λ i,j RFQ converges to the “real” intensity of the market. As timepasses, the prior parameters ( α i,j , β i,j ) of the market maker are less important as the estimation willrely mostly on the observations.Another important parameter of the model is the size of transactions, which impacts the quotes of themarket maker as well as his inventory risk. In [10], the authors choose in their numerical experimentsa Γ( a i,j , b i,j ) distribution for η i,j . The trader can choose between Bayesian updates (revise only a i,j ,only b i,j , or both), depending on his conﬁdence on parameters’ estimation. If he is conﬁdent withrespect to the shape parameter a i,j , that is he knows approximately the average size of a request butnot the standard deviation, he sets b i,j ∼ Γ( a i,j , b i,j ). Given n observations of size z , . . . , z n , the bestBayesian estimate of b i,j (the scale parameter of the size of the request) is E [ b i,j | ( z , . . . , z n )] = a i,j + na i,j b i,j + P ni =1 z i . (Appendix B.2)The use of diﬀerent prior distribution to take into account the uncertainty on the shape parameter a i,j (if b i,j is known) or on both ( a i,j , b i,j ) can be done in the same way.Another sensitive parameter, especially for the multi-asset market making, is the variance-covariancematrix Σ. This quantity is usually estimated on a long run, but parameters are subject to a brutalchange. For example, let us assume that the market maker is in charge of d assets on 2 diﬀerentsectors (for instance, technology and aerospace). Following the factorial approach, the market makingproblem’s dimension will be reduced from d to 3. The three factors mainly correspond to the three43ighest eigenvalues of the variance-covariance matrix Σ, and will drive the quotes of the marketmaker. However, in case of a sectorial tail event, for example the bankruptcy of one of the companiesof the tech sector, it is likely that all the correlations between the assets of this sector will rise toone. This will impact the eigenvalue related to the technology sector, and change the quotes of themarket maker as he has to avoid long inventory positions on assets whose values are decreasing. Todesign adaptive market making strategy based on Bayesian update of the correlation matrix and thedrift of the assets, we deﬁne the Normal-Inverse-Wishart prior on ( µ, Σ) ∼ NIW( µ , κ , ν , ψ ), where( µ , κ , ν , ψ ) ∈ R d × R ⋆ + × ( d − , + ∞ ) × M d ( R ). This distribution is built as follows: µ | ( µ , κ , Σ) ∼ N (cid:16) µ , κ Σ (cid:17) , Σ | ( ψ, ν ) ∼ W − ( ψ, ν ) , then ( µ, Σ) ∼ NIW( µ , κ , ν , ψ ) , where W − is the standard inverse Wishart distribution. In other words, the drift vector µ of theassets follows a multivariate Gaussian distribution whereas the variance-covariance matrix Σ followsa standard inverse Wishart distribution. At time t , if we note S t = ( S t , . . . , S dt ) the prices observedup to time t , the Bayesian update of ( µ, Σ) is( µ, Σ | S t − S ) ∼ NIW (cid:18) κ µ + ( S t − S ) κ + t , κ + t, ν + t,ψ + ( S t − S t t )( S t − S t t ) T + κ tκ + t ( µ − S t t )( µ − S t t ) T (cid:19) . Following the law of large numbers, as t → + ∞ we have a larger number of information and weconverge toward the drift and variance-covariance of the market maker’s portfolio. Therefore, themarket maker can recompute his factors derived from the updated variance-covariance matrix andadjust his quotes.This extension deserves several remarks. First, the problems encountered by an OTC market makerare quite diﬀerent from a high-mid frequency trader in an order book. The model is more parsimo-nious, especially for the intensity functions. Therefore, the convergence toward the “true” marketparameters will be faster than in order book model. The objective of the Bayesian update on thequantities λ i,j RFQ is to determine the average behavior or the counterparts of the market maker. If heobserves a large number of requests on the ask (resp. bid) side of the asset i , the Bayesian update(Appendix B.1) enables the market maker to adjust his quotes to set a higher ask (resp. bid) pricefor this asset. If the market maker observes a higher discrepancy than expected for the transactionsizes, the Bayesian update (Appendix B.2) helps to adjust his quotes. Finally, the Bayesian learningon the drift and covariance of the assets enables to update the factors from which the market makerchooses his quotes. 44 eferences [1] R. Almgren and B. Harts. A dynamic algorithm for smart order routing. White paper StreamBase ,2008.[2] R. Almgren, C. Thum, E. Hauptmann, and H. Li. Equity market impact.

Risk, July , pages58–62, 2005.[3] H. Alsayed and F. McGroarty. Arbitrage and the law of one price in the market for ameri-can depository receipts.

Journal of International Financial Markets, Institutions and Money ,22(5):1258–1276, 2012.[4] M. Avellaneda, J. Reed, and S. Stoikov. Forecasting prices from level-i quotes in the presence ofhidden liquidity.

Algorithmic Finance , 1(1):35–43, 2011.[5] M. Avellaneda and S. Stoikov. High-frequency trading in a limit order book.

QuantitativeFinance , 8(3):217–224, 2008.[6] A. Bachouch, C. Huré, N. Langrené, and H. Pham. Deep neural networks algorithms for stochasticcontrol problems on ﬁnite horizon, part 2: Numerical applications, 2018.[7] B. Baldacci, P. Bergault, and O. Guéant. Algorithmic market making: the case of equity deriva-tives. arXiv preprint arXiv:1907.12433 , 2019.[8] B. Baldacci, I. Manziuk, T. Mastrolia, and M. Rosenbaum. Market making and incentivesdesign in the presence of a dark pool: a deep reinforcement learning approach. arXiv preprintarXiv:1912.01129 , 2019.[9] G. Barles and C. Imbert. Second-order elliptic integro-diﬀerential equations: viscosity solutions’theory revisited. In

Annales de l’Institut Henri Poincare/Analyse non lineaire , volume 3, pages567–585, 2008.[10] P. Bergault and O. Guéant. Size matters for otc market makers: viscosity approach and dimen-sionality reduction technique. arXiv preprint arXiv:1907.01225 , 2019.[11] B. Bouchard. Introduction to stochastic control of mixed diﬀusion processes, viscosity solutionsand applications in ﬁnance and insurance.

Lecture Notes Preprint , 2007.[12] Á. Cartea, S. Jaimungal, and J. Penalva.

Algorithmic and high-frequency trading . CambridgeUniversity Press, 2015.[13] R. Cont and A. Kukanov. Optimal order placement in limit order markets.

Quantitative Finance ,17(1):21–39, 2017.[14] M. G. Crandall, H. Ishii, and P.-L. Lions. User’s guide to viscosity solutions of second orderpartial diﬀerential equations.

Bulletin of the American mathematical society , 27(1):1–67, 1992.[15] J. Gatheral. No-dynamic-arbitrage and market impact.

Quantitative ﬁnance , 10(7):749–759,2010. 4516] O. Guéant.

The Financial Mathematics of Market Liquidity: From optimal execution to marketmaking , volume 33. CRC Press, 2016.[17] O. Guéant, C.-A. Lehalle, and J. Fernandez-Tapia. Dealing with the inventory risk: a solutionto the market making problem.

Mathematics and ﬁnancial economics , 7(4):477–507, 2013.[18] O. Guéant and I. Manziuk. Deep reinforcement learning for market making in corporate bonds:beating the curse of dimensionality. arXiv preprint arXiv:1910.13205 , 2019.[19] F. Guilbaud and H. Pham. Optimal high-frequency trading with limit and market orders.

Quan-titative Finance , 13(1):79–94, 2013.[20] C. Huré, H. Pham, A. Bachouch, and N. Langrené. Deep neural networks algorithms for stochasticcontrol problems on ﬁnite horizon, part i: convergence analysis. arXiv preprint arXiv:1812.04300 ,2018.[21] A. Jain and C. Jain. Hidden liquidity on the us stock exchanges.

The Journal of Trading ,12(3):30–36, 2017.[22] S. Laruelle, C.-A. Lehalle, and G. Pages. Optimal split of orders across liquidity pools: astochastic algorithm approach.

SIAM Journal on Financial Mathematics , 2(1):1042–1076, 2011.[23] S. Laruelle, C.-A. Lehalle, and G. Pagès. Optimal posting price of limit orders: learning bytrading.

Mathematics and Financial Economics , 7(3):359–403, 2013.[24] R. Rabinovitch, A. C. Silva, and R. Susmel. Returns on adrs and arbitrage in emerging markets.

Emerging Markets Review , 4(3):225–247, 2003.[25] S. Stoikov and M. Sağlam. Option market making under inventory risk.

Review of DerivativesResearch , 12(1):55–79, 2009.[26] I. M. Werner and A. W. Kleidon. Uk and us trading of british cross-listed stocks: An intradayanalysis of market integration.