[PDF] Reinforced Deep Markov Models With Applications in Automatic Trading

Abstract

Inspired by the developments in deep generative models, we propose a model-based RL approach, coined Reinforced Deep Markov Model (RDMM), designed to integrate desirable properties of a reinforcement learning algorithm acting as an automatic trading system. The network architecture allows for the possibility that market dynamics are partially visible and are potentially modified by the agent's actions. The RDMM filters incomplete and noisy data, to create better-behaved input data for RL planning. The policy search optimisation also properly accounts for state uncertainty. Due to the complexity of the RKDF model architecture, we performed ablation studies to understand the contributions of individual components of the approach better. To test the financial performance of the RDMM we implement policies using variants of Q-Learning, DynaQ-ARIMA and DynaQ-LSTM algorithms. The experiments show that the RDMM is data-efficient and provides financial gains compared to the benchmarks in the optimal execution problem. The performance improvement becomes more pronounced when price dynamics are more complex, and this has been demonstrated using real data sets from the limit order book of Facebook, Intel, Vodafone and Microsoft.

Full PDF

RR EINFORCED D EEP M ARKOV M ODELS W ITH A PPLICATIONS IN A UTOMATIC T RADING

A P

REPRINT

Tadeu A. Ferreira

Department of Statistical SciencesUniversity of Toronto [email protected]

November 10, 2020 A BSTRACT

Inspired by the developments in deep generative models, we propose a model-based RL approach,coined

Reinforced Deep Markov Model (RDMM) , designed to integrate desirable properties of areinforcement learning algorithm acting as an automatic trading system. The network architectureallows for the possibility that market dynamics are partially visible and are potentially modiﬁed bythe agent’s actions. The RDMM ﬁlters incomplete and noisy data, to create better behaved input datafor RL planning. The policy search optimisation also properly accounts for state uncertainty. Due tothe complexity of the RKDF model architecture, we performed ablation studies to understand thecontributions of individual components of the approach better. To test the ﬁnancial performance ofthe RDMM we implement policies using variants of Q-Learning, DynaQ-ARIMA and DynaQ-LSTMalgorithms. The experiments show that the RDMM is data efﬁcient and provides ﬁnancial gainscompared to the benchmarks in the optimal execution problem. The performance improvementbecomes more pronounced when price dynamics are more complex, and this has been demonstratedusing real data sets from the limit order book of Facebook, Intel, Vodafone and Microsoft. K eywords Batch Reinforcement Learning · Algorithmic Trading · Deep Markov Model · Variational Auto-Encoder · Deep Learning

Many developments in reinforcement learning have been introduced to improve the shortcomings of value iterationalgorithms like Q-learning as in Watkins [1989, 1992]. Due to slow convergence and difﬁculties with large state-actionspaces, classical algorithms in reinforcement learning exhibit poor performance in extracting information from theenvironment. In Q-learning, the changes in a state x t do not directly back-propagate to previous states x t − , x t − , .. [Wiering and van Otterlo, 2011]. This issue is known as exploration overhead . Model-based reinforcement learningmethods were introduced to model the dynamics of the transitions/rewards to expedite convergence and make thealgorithm more data-efﬁcient. However, the inability to reproduce the environment accurately due to model bias,has been reported as a common issue in model-based RL, which can hinder the beneﬁts of learning the environmentdynamics. Incorporating different sources of uncertainty into planning has shown good results in many recent studiesaimed at mitigating model bias, improving model robustness and data efﬁciency as in Deisenroth and Rasmussen[2011], and McAllister and Rasmussen [2016]. Deisenroth and Rasmussen [2011] employ Gaussian processes intotheir probabilistic inference for learning control (PILCO) algorithm to model the Markov decision process dynamicswhich allows them to incorporate the uncertainty of the model parameters into planning and control. McAllister andRasmussen [2016] took a step further, including also the uncertainty regarding the agent’s belief distribution.A model’s ability to handle noisy data is also important; McAllister and Rasmussen [2016] points out that noiseproduces a substantial variation in the controller output, causing instability in the cart-pole problem. To cope with noisyobservations, they extended PILCO [Deisenroth and Rasmussen, 2011] into a partially observable Markov decision a r X i v : . [ q -f i n . T R ] N ov einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT process (POMDP) framework by ﬁltering the observations and using the predictions with respect to the ﬁltering process.The policy is formed using this smoother version of the observed values.Our proposed architecture combines these successful approaches with advances in deep generative models such asvariational auto-encoders (VAEs) by Rezende et al. [2014] and Kingma and Welling [2013], and deep Markov Models(DMM) by Krishnan et al. [2015, 2016].We modify and build upon the model structure of the DMM and add a gradient-based policy learning framework aimingto obtain the desirable properties mentioned before, such as fast convergence, data-efﬁciency, and the ability to handlenoisy and incomplete observations. To achieve this, we need to extend the basic DMM structure presented in Krishnanet al. [2015, 2016] into a more appropriate state-space model (SSM) that accounts for actions and rewards, and reframeit into a model-based reinforcement learning framework inspired by some of the ideas in McAllister and Rasmussen[2016], where the policy is optimised with respect to the ﬁltering process instead of the observations themselves. Stateuncertainty is incorporated into the policy search process, which helps with data-efﬁciency. The policy π is optimisedto maximise the expected return for a ﬁnite number of steps. We refer to our approach hereafter as a reinforced deepMarkov model (RDMM) .We begin with a description of POMDP and the intuition of our model architecture in Section 2. After explaining theRDMM motivation, we give the complete model speciﬁcations in Section 3. In Section 4, the objective functions aredeﬁned along with the parameters that need to be optimised. The implementation of the learning algorithm is outlinedin Section 5. A description of the optimal liquidation, the problem that we are trying to solve, is exposed in Section 6.We conduct tests in a simulated environment using synthetic price dynamics and the results are presented in Section 7.The environment’s complexity is raised by replacing the artiﬁcial price dynamics with real prices taken from the limitorder book of Intel, Facebook, Microsoft and Vodafone in Section 8. Finally, we end this article with a discussion andconclusions about the results in Section 9. In a POMDP with a ﬁnite and discrete number of states Z = ( z , ..., z N ) , the agent only perceives a set of observablestates Ω = ( o , ..., o N ) instead of observing subsets of Z directly. An observation function O is deﬁned in the contextof an imperfect sensor, which assigns a probability to each 3-tuple O ( s (cid:48) , a, o ) , representing the probability of observing o at state s (cid:48) after executing a . The idea of introducing limitations on the agent with respect to its perception of theenvironment is more complete and reﬂective of reality compared to the classical MDP.Using the POMDP framework, we assume a discrete and ﬁnite sequence of observable states Ω = ( o , ..., o T ) . Thesequence of the corresponding real states are represented by the sequence of latent states Z = ( z , ..., z T ) . For thisdesign we represent the observable state o t containing all the relevant information available for the agent at time t . Wechoose policies π as a function of the approximated distribution of Z . A sequence of actions A = ( a , ..., a T − ) givenby the policy π is sent to the environment, which in turn, yields a sequence of rewards R = ( r , ..., r T ) .We aim to develop a model with the ability to represent a wide variety of functions responsible for the interactionbetween actions A , latent states Z , observable states Ω and rewards R . To achieve this goal we rely on artiﬁcial neuralnetworks to behave as universal approximators [Cybenko, 1989, Csaji, 2001, Lu et al., 2017]. Before delving intothe speciﬁcs of the model we develop a general intuition about how the RDMM is conceptualised. The completespeciﬁcations are given in the next section.Figure 1, shows a graphical representation of the RDMM approach where the sequence of events can be divided intotwo interconnected sections: reality and controller . To explain the rationale of the architecture represented by Figure 1we start narrating the sequence of the events in the reality section at the instant t . At this point, we consider the action a t − as it is observed — the description of how the actions are created appears in the controller section. This action issent to the environment where all the underlying dynamics of the system occur and are represented by the latent state z t . The environment is perturbed by that action, which in turn, emits an observable state o t and yields a reward r t .The visible state and reward are also considered to be directly inﬂuenced by the action a t − . All these processes aremodelled by artiﬁcial neural networks and are represented in Figure 1 by the blue arrows. This reality section is alsoconsidered the emission or decoding part of a variational auto-encoder (VAE).Once the action a t − , the state o t , and the reward r t are observed, the RDMM transitions to the controller sectionwhere a t − , o t and r t update an LSTM with a new hidden state h t , which serves as a summary of all past informationup to the instant t . With the possession of h t the RDMM approximates the posterior distribution of the latent states,concluding the recognition portion of a VAE. The parameters of this approximated distribution µ φt and Σ φt are the inputsof a deterministic function that outputs the next action a t . The parameters in the controller section are also modelled by2einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT artiﬁcial neural networks and are represented in Figure 1 by the green arrows. Once the action a t is formed the systemtransitions from the state z t to the next state z t +1 and the process repeats itself.The presence of edges from the actions (cid:126)a to the latent states (cid:126)z might be crucial in certain contexts. In our applicationwe consider the possibility of the agent’s actions affecting the environment; for instance, a trader buying or selling asigniﬁcant amount of an asset compared to the other agent’s positions in the market may alter other agents’ perceptionof the asset’s price.Figure 1: Graphical representation of the RDMM. The reality sub-diagram represents the interaction of the actions a t − , theenvironment z t , and the emission of observable states o t and rewards r t . After the emissions and actions are integrated into asummary of past information h t by an LSTM, the algorithm uses an approximation of the posterior to construct a new action a t as adeterministic function of the mean µ φt and covariance Σ φt of the approximated posterior distribution. It should be noted that the ‘reality’ part of the RDMM differs from the deep Kalman ﬁlters (DKF) presented in Krishnanet al. [2015] or the deep Markov models (DMM) in Krishnan et al. [2016], where the authors consider only edges fromactions to latent states to make counterfactual inference for medical data. In the RDMM we also have the incorporationof the reward nodes (cid:126)r , the inclusion of edges from actions to observable states, and edges from actions to rewards. Sincethe actions are assumed to be market orders, these edges are relevant to cause an immediate impact on the rewards,latent and observable states. These adaptations result in a more appropriate non-linear state-space model (SSM) to adda gradient-based policy learning in the ‘controller’ part of the RDMM. In this part, the actions stem from the parametersof the approximate distribution of the belief states (approximation of the posterior).In the next section, we use deep neural networks to further specify the architecture depicted in Figure 1.

The deﬁnitions in this section rely heavily on the conditional independence properties of directed graphical models.Henceforth, observed states o t and latent states z t are considered as elements of R n and R m , respectively, while rewards r t and actions a t scalars.The latent state transitions are modelled as conditionally normal. Speciﬁcally, Transitions: p λ ( z t | z t − , a t − ) = N (cid:0) µ λt , Σ λt (cid:1) . (1)As in DKF [Krishnan et al., 2015] the mean µ λt is an interpolation between linear transitions L λt and non-lineartransitions NL t controlled by a gated unit g t (for further justiﬁcation will be provided ahead), and is explicitly given by µ λt = g t (cid:12) NL t + (1 − g t ) (cid:12) L t , A PREPRINT where A (cid:12) B represents the Hadamard product between A and B and,NL t = W NL (cid:2) ReLU ( W NL ( z t , a t − ) T + b NL ) (cid:3) + b NL non-linear, (2)L t = W L ( z t , a t − ) T + b L linear, (3) g t = sigmoid (cid:0) W g (cid:2) ReLU ( W g ( z t , a t − ) T + b g ) (cid:3) + b g (cid:1) gated unit, (4)The covariance function is modelled as Σ λt = softplus ( W Σ λ ReLU ( NL t ) + b Σ λ ) . The motivation behind using an interpolation for the mean µ λt between linear and non-linear components, resides onthe fact that some datasets tend to achieve better approximations in terms of held-out likelihood with linear functions,while other datasets have better results with non-linear functions as noted in Krishnan et al. [2016]. The interpolation isinspired by the update gate on gated recurrent units (GRUs) [Cho et al., 2014], and provides the model freedom to seekthe best combination of the linear and non-linear components.Observed states o t and rewards r t conditioned on z t and a t are assumed normally distributed and parametrised by amulti-layer perceptron (MLP). Explicitly, Emissions: p θ ( o t | z t , a t − ) = N (cid:0) µ θt , Σ θt (cid:1) and p η ( r t | z t , a t − ) = N ( µ ηt , Σ ηt ) , (5)where, µ θt = ReLU ( W µ θ h θt + b µ θ ) , µ ηt = W µ η h ηt + b µ η , (6) Σ θt = softplus ( W Σ θ h θt + b Σ θ ) , Σ ηt = softplus ( W Σ η h ηt + b Σ η ) , (7) h θt = tanh ( W θ ( z t , a t − ) T + b θ ) , h ηt = tanh ( W η ( z t , a t − ) T + b η ) . (8)For Σ ηt and Σ θt we use a softplus as an activation function, whereas, µ θt and µ ηt are parametrised by an MLP withrectiﬁer and linear activation. We do not use the rectiﬁer for the reward as we must allow negative rewards. Since theposterior p λ ( z | o, r, a ) is intractable [Krishnan et al., 2015], we use an approximation instead. As in VAEs (see Appendix C), the posterior distribution over latent states given observations is intractable and insteadwe introduce a variational approximation to the posterior. Our model uses the variational approximation q φ ( z t +1 | z t , h t ) = N (cid:16) µ φt +1 , Σ φt +1 , (cid:17) , (9)where, µ φt +1 = W µ φ h + b µ φ , (10) Σ φt +1 = softplus ( W Σ φ h + b Σ φ ) , (11) h = ( tanh ( W φ ( z t ) T + b φ ) + h t ) , (12)and h t is the hidden state of a long short-term memory RNN (LSTM) [Hochreiter and Schmidhuber, 1997] summarisingpast information up to the instant t , i.e., ( o t , r t , a t − ) (see Appendix B.1). To ensure the hidden layer h is conﬁnedto the interval [ − , , we divide the right hand side of (12) by 2, since tanh and h t ranges are [ − , . The equationsdeﬁning the LSTM cell are i t = sigmoid ( W i X t + U i h t − + b i ) input gate, (13) f t = sigmoid ( W f X t + U f h t − + b f ) forget gate, (14) o t = sigmoid ( W o X t + U o h t − + b o ) output gate, (15) ˜ c t = tanh ( W c X t + U c h t − + b c ) candidate cell, (16) c t = f t ◦ c t − + i t ◦ ˜ c t cell state, (17) h t = o t ◦ tanh ( c t ) output/hidden state, (18)where X t = ( o t , r t , a t − ) represents the input array containing prices, inventory, rewards and actions. The Hadamard product is the entry-wise product between matrices of same dimensions. A PREPRINT

The policy π is deﬁned deterministically as an MLP using rectiﬁer activation whose inputs are the mean and variance ofthe approximate posterior distribution over latent states q φ ( z t | z t − , h t − ) , speciﬁcally, the agent’s actions are given by a t = π ( µ φt , Σ φt , q t | ψ ) , (19)where q t is the inventory available at the instant t and π ( µ φt , Σ φt | ψ ) = min( ReLU ( W ψ h ψ + b ψ ) , q t ) , (20) h ψ = tanh ( W ψI ( µ φt , Σ φt ) T , b ψI ) , (21)where µ φt , Σ φt are deﬁned in Equation (10) and (11). It is worth mentioning that, similarly to Krizhevsky [2010], thefunction in Equation 20 is Rectiﬁed linear unit (ReLU) capped at q t (See Figure 2). xf ( x ) q t q t Figure 2: Rectiﬁed linear unit (ReLU) with a cap at q t Applying the conditional independence of d -separation to the graphical representation, we can factor the model in aconvenient way, which helps with the optimisation problem. The graphical model representation of RDMM depictedin Figure 1 has two objectives to maximise: (i) the conditional likelihood given a sequence of observed actions a = ( a , ..., a t − ) , and (ii) the unconditional expected reward for policy search. The ﬁrst target ﬁts the dynamicalmodel parameters and the variational approximation of the posterior given frozen action parameters. The second targetoptimises actions given ﬁxed dynamical model parameters. We aim to maximise p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) which is equivalent to maximising log p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) . First notice that, p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) = p λ ( (cid:126)z, (cid:126)o, (cid:126)r, (cid:126)a ) p λ ( (cid:126)o, (cid:126)r, (cid:126)a ) = p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a ) p λ ( (cid:126)z | (cid:126)a ) p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) (22)Therefore, p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) = p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a ) p λ ( (cid:126)z | (cid:126)a ) p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) . (23)From variational inference theory and the identity in Equation (23), we decompose log p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) as,5einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT log p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) = log p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a ) p λ ( (cid:126)z | (cid:126)a ) p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) (24) = log p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a ) p λ ( (cid:126)z | (cid:126)a ) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) − log p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) (25) = (cid:90) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) log p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a ) p λ ( (cid:126)z | (cid:126)a ) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) d(cid:126)z (26) − (cid:90) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) log p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) d(cid:126)z = L ( q ) + KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a )] (27)where, L ( q ) := (cid:90) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) log p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a ) p λ ( (cid:126)z | (cid:126)a ) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) d(cid:126)z. (28)Since KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a )] ≥ , we have log p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) ≥ L ( q ) . (29)For this reason the term L is called the evidence lower bound (ELBO) on log p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) , and it is the objective functionthat we want to maximise during the learning process in order to maximise log p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) . Another consequenceof the ELBO maximisation is to provide a good approximation between q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) and the intractable posterior p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) .More explicitly, we know that KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a )] = 0 if and only if q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) is equal tothe posterior p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) . Consequently, by Equation (27) we see that maximising L implies minimising KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a )] since the model evidence p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) does not depend on q φ .In the theorem below we further characterise the ELBO for the RDMM architecture summarised in Figure 1. Theorem.

The evidence lower bound L of the conditional log likelihood is given by L = E z ∼ q φ ( z | (cid:126)o,(cid:126)r,(cid:126)a ) [log p θ ( o | z , a )]+ E z ∼ q φ ( z | (cid:126)o,(cid:126)r,(cid:126)a ) [log p η ( r | z , a )]+ T (cid:88) t =2 E z t ∼ q φ ( z t | z t − ,(cid:126)o,(cid:126)r,(cid:126)a ) [log p θ ( o t | z t , a t − )]+ T (cid:88) t =2 E z t ∼ q φ ( z t | z t − ,(cid:126)o,(cid:126)r,(cid:126)a ) [log p η ( r t | z t , a t − )] − KL ( q φ ( z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( z | (cid:126)a )) − T (cid:88) t =2 E z t − ∼ q φ ( z t − | z t − ,(cid:126)o,(cid:126)r,(cid:126)a ) [ KL ( q φ ( z t | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( z t | z t − , (cid:126)a ))] . (30) Proof:

First we have by deﬁnition L ( q ) = (cid:90) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) log p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a ) d(cid:126)z + (cid:90) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) log p λ ( (cid:126)z | (cid:126)a ) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) d(cid:126)z = (cid:90) q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) log p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a ) d(cid:126)z − KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)a )]= E (cid:126)z ∼ q φ ( (cid:126)z | (cid:126)o,(cid:126)r,(cid:126)a ) [log p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a )] − KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)a )] . A PREPRINT

Applying d -separation and the conditional independence of o t and r t given z t and a t , (see (5)) we have, p λ ( (cid:126)o, (cid:126)r | (cid:126)z, (cid:126)a ) = T (cid:89) t =1 p θ ( o t | z t , a t − ) p η ( r t | z t , a t − ) . Similarly q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) can be factorised as q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) = q φ ( z | (cid:126)o, (cid:126)r, (cid:126)a ) T (cid:89) t =2 q φ ( z t | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) . and combining these two equalities we have L = (cid:90) ... (cid:90) q φ ( z | (cid:126)o, (cid:126)r, (cid:126)a ) × T (cid:89) t =2 q φ ( z t | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) (cid:34) T (cid:88) t =1 log p θ ( o t | z t , a t − ) + log p η ( r t | z t , a t − ) (cid:35) dz ...dz T − KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)a )]= (cid:90) q φ ( z | (cid:126)o, (cid:126)r, (cid:126)a ) [log p θ ( o | z , a ) + log p η ( r | z , a )] dz + T (cid:88) t =2 (cid:90) q φ ( z t | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) [log p θ ( o t | z t , a t − ) + log p η ( r t | z t , a t − )] dz t − KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)a )]= E z ∼ q φ ( z | (cid:126)o,(cid:126)r,(cid:126)a ) [log p θ ( o | z , a )] + E z ∼ q φ ( z | (cid:126)o,(cid:126)r,(cid:126)a ) [log p η ( r | z , a )]+ T (cid:88) t =2 E z t ∼ q φ ( z t | z t − ,(cid:126)o,(cid:126)r,(cid:126)a ) [log p θ ( o t | z t , a t − )]+ T (cid:88) t =2 E z t ∼ q φ ( z t | z t − ,(cid:126)o,(cid:126)r,(cid:126)a ) [log p η ( r t | z t , a t − )] − KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)a )] . (31)Finally, a similar factorization can be applied to KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)a )] to obtain KL [ q φ ( (cid:126)z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( (cid:126)z | (cid:126)a )] = − (cid:90) ... (cid:90) q φ ( z | (cid:126)o, (cid:126)r, (cid:126)a ) × T (cid:89) t =2 q φ ( z t | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) log (cid:34) p λ ( z | (cid:126)a ) (cid:81) Tt =2 p λ ( z t | z t − , (cid:126)a ) q φ ( z | , (cid:126)o, (cid:126)r, (cid:126)a ) (cid:81) Tt =2 q φ ( z t | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) (cid:35) dz ...dz T = − (cid:90) q φ ( z | (cid:126)o, (cid:126)r, (cid:126)a ) log p λ ( z | a ) q φ ( z | (cid:126)o, (cid:126)r, (cid:126)a ) dz − T (cid:88) t =2 (cid:90) (cid:90) q φ ( z t − | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) q φ ( z t | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) log p λ ( z t | z t − , (cid:126)a ) q φ ( z t | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) dz t − dz t − = KL ( q φ ( z | (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( z | (cid:126)a ))+ T (cid:88) t =2 E z t − ∼ q φ ( z t − | z t − ,(cid:126)o,(cid:126)r,(cid:126)a ) [ KL ( q φ ( z t | z t − , (cid:126)o, (cid:126)r, (cid:126)a ) || p λ ( z t | z t − , (cid:126)a ))] . (32)Combining the results in Equations (31) and (32) we obtain Equation (30).In (30) the expectations are with respect to the latent states z t and are approximated using the reparametrization trick(see Appendix C). On the other hand, the KL divergence terms has a closed form expression, as the prior p λ and theposterior q φ are normally distributed.As a consequence of Equation (30) the gradients of L with respect to the parameters will propagate to the entirearchitecture via reverse-mode automatic differentiation (See Appendix D). Indeed, from Equation (30) the emission7einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT functions (5) in the ELBO the parameters θ and η , represent the weights and biases of their respective MLPs, i.e., θ through (cid:8) W µ θ , W Σ θ , W θ (cid:9) and η through { W µ η , W Σ η , W η } . Likewise, for the KL divergence in Equation (30)the transition/prior distribution p λ (1), contain the parameters λ that summarises all weights and biases of the neu-ral networks: λ = { W NL , W NL , W L , W g , W g , W Σ λ , b NL , b NL , b L , b g , b g , b Σ λ } . Similarly, φ represents allparameters of the neural networks in the posterior/approximated distribution, including the LSTM, i.e., φ contains (cid:8) W µ φ , W Σ φ , W φ , W i , W f , W o , W c , U i , U f , U o , U c , b µ φ , b Σ φ , b φ , b i , b f , b o , b c (cid:9) . For the objective functions deﬁned byEquation (30) the reverse-mode AD computational time for computing the gradient of L is similar to the computationaltime of the ELBO.Recall the actions (cid:126)a in L are considered to be observed and the corresponding network parameters considered ﬁxed.The role of the actions, with regards to L optimisation, is to allow the model to “learn” the interplay between the actionsand the emissions of prices and rewards through Equation (30). In this section, the agent’s actions are the outputs of an MLP whose inputs are the parameters of the posteriorapproximated distribution, as deﬁned in Equation (19), and the parameters are chosen to maximise the unconditionalexpected reward J = E p ( r ) [ R ] , (33)where R = (cid:80) Tt =1 r t and p ( r ) = p ( r , .., r T ) is the joint distribution of rewards.Through iterated expectations we have J = E p ( r ) (cid:34) T (cid:88) t =1 r t (cid:35) = T (cid:88) t =1 E p ( z t ,a t − ) (cid:2) E p ( r t | z t ,a t − ) [ r t ] (cid:3) = T (cid:88) t =1 E p ( z t ,a t − ) [ µ ηt ]= T (cid:88) t =1 E p ( z t ,a t − ) (cid:2) M LP µ ηt ( z t , a t − ) (cid:3) = T (cid:88) t =1 E p ( z t ,a t − ) (cid:104) M LP µ ηt ( z t , π ( µ φt − , Σ φt − | ψ )) (cid:105) (34)where M LP µ ηt ( z t , a t − ) represents the multi-layer perceptron parametrising µ ηt as deﬁned in Equation (5).The unconditional expected reward J in (34) is approximated by J ≈ T (cid:88) t =1 M LP µ ηt ( z t , π ( µ φt − , Σ φt − | ψ )) . (35)Therefore, our goal is to ﬁnd ψ ∗ such that, ψ ∗ = argmax ψ T (cid:88) t =1 M LP µ ηt ( z t , π ( µ φt − , Σ φt − | ψ )) . (36)As in the ELBO maximisation, we use reverse-mode AD to compute the gradient of J with respect to ψ := { W ψ , W ψI , b ψ , b ψI } , while the set of parameters θ , η , λ and φ are kept ﬁxed.8einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

The learning algorithm has two goals. The ﬁrst goal is to obtain a good model representation of the observed data giventhe actions taken. In this phase we maximise the lower bound in Equation (30) in order to maximise the conditionallog-likelihood log p λ ( (cid:126)o, (cid:126)r | (cid:126)a ) . The second goal is to perform a policy search by maximising unconditional expectedreward in Equation (34), and extract the best policy using a gradient-based method.Algorithm 1 summarises the steps in our approach, which can be broken into three main parts . The ﬁrst part deﬁnes thefunctions of our RDMM model. It is important to follow the correct order of construction of the graphical architecturein Figure 1. In our implementation, Theano will compile this graph in native machine instructions into a callable object.The second part is the ELBO optimisation through the application of Equation (30), where gradients and the parametersof the compiled graph are updated using the Adam optimisation algorithm [Kingma and Ba, 2015]. The third part of thealgorithm is implemented after the ELBO convergence is achieved. In this part we perform policy search to maximisethe unconditional expected reward J approximation in Equation (35). /* PART 1: RDMM graph architecture */ Deﬁne the LSTM for hidden state h t summarising past information (18);Deﬁne the MLPs in µ φt and Σ φt for the inference layer q φ ( z t | z t − , h t − ) (9);Deﬁne MLPs µ θt , Σ θt , µ ηt and Σ ηt of the emission functions (5);Deﬁne the MLPs µ λt and Σ λt for transitions (1);Build MLP of the actions a t (19);Initialise all parameters; /* PART 2: ELBO Optimisation */ for i ∈ epochs dofor all minibatches do Compute h t , µ φt and Σ φt of PART 1 for all t in the mini-batch;Sample z t = µ φt + (cid:15) Σ φt where (cid:15) ∼ N (0 , for all t in the mini-batch;Compute µ θt , Σ θt , µ ηt , Σ ηt , µ λt and Σ λt of PART 1 for all t in the mini-batch;Compute L using (30) and check convergence;Compute gradients of L using reverse-mode AD (Theano);Update parameters λ , η , θ and φ related to L optimisation using ADAM; endend /* PART 3: Policy Search Optimisation */ for i ∈ epochs dofor all minibatches do Compute actions a t using (19);Compute J using (35);Compute gradients of J using reverse-mode AD (Theano);Update parameters ψ for policy search using ADAM; endend Algorithm 1: Overview of the RDMM learning

To test the algorithms presented in this article, we evaluate their ability to perform a liquidation task. By liquidation,we mean to sell all stocks of an inventory where the agent’s goal is to optimally sell a certain number of stocks withprices subject to a model dynamics. This situation can be easily converted to an acquisition problem. For a more formaltreatment of these ﬁnancial problems, we recommend Casgrain and Jaimungal [2017].Let us represent the inventory and the trader’s actions at time t by q t and a t , respectively. In the liquidation problem,we assume that the trader starts with a positive inventory of q ≥ shares of an asset. The mid-price at instant t isrepresent by x t . In our reinforcement setting the pair s t = ( x t , q t ) comprised of price and inventory represent the theagent’s state. Most of the derivations on this article could be readily expanded such that x t could represent a matrixcontaining all the information available in the limit order book at instant t . For our purposes x t , a t and q t ∈ R + .9einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

When a certain number of shares a t is liquidated at instant t , it is reasonable to assume a permanent price impact due tothe volume traded, or in the ﬁnancial jargon, where the order executed “walk the book” causing a decrement of themid-prices. As shown in Cartea and Jaimungal [2016], the permanent price impact can be satisfactorily approximatedby a linear model. For simplicity purposes let us represent the permanent impact by c x t , with c being a scalar.We may also consider a temporary price impact due to cost transactions, which can also be approximated by a linearmodel as shown in Frei and Westray [2015]. Thus, we represent the temporary price impact by a scalar multiplied bythe number of stocks sold, i.e., c a t . Consequently, the return obtained by an action of a t shares liquidated at step t canbe represented as r t = ( x t − c a t ) a t (37) We conduct simulations to solve the liquidation problem where the stock mid-price x t are mean-reverting, [Cartea andJaimungal, 2017], [Casgrain and Jaimungal, 2017], which assumes that prices revert back towards the average price.The mid-price dynamics of mean reversion can be formulated as follows, x t +1 = − c a t + θ + e − κ ∆ t ( x t − θ ) + vol × (cid:15) t with (cid:15) t ∼ N (0 , i.i.d. (38)where x t represents the mid-price of a stock at instant t , the constant θ is the mean price, the term c a t represents thepermanent price impact, κ is the mean-reversion rate, and vol = σ (cid:112) ((1 − exp( − κ ∆ t )) / (2 κ )) (cid:117) σ √ ∆ t (39)is the asset’s volatility term that makes the price deviate from the mean. The reward is deﬁned by r t = ( x t − c a t ) a t − c q t , (40)which corresponds to the proceeds from trading, a penalty for speed of trading as a proxy for liquidity costs, and apenalty for holding inventory as a proxy for urgency or quadratic variation.In our simulations we used θ = 10 , κ = 0 . , σ = 0 . , c = c = c = 0 . , and ∆ t = 1 .The training set for the RDMM model has 400 trajectories of size T = 500 s (200,000 data points in total) with pricesfollowing the mean reversion dynamics. Each trajectory starts with the asset’s price equal to $10 and inventory 0 units.After t=40s a sequence of a ﬁctional inventory, actions and the respective rewards is added up to t=400s. The details ofthis construction are presented in Algorithm 2. x = 10 ; q = 0 ; i = 20 ; for t ∈ , doif t mod i=0 then i = i + n where n ∼ U d { , } a ; if < i < then q t = u where u ∼ U d { , } ; else q t = 0 . ; endif t mod 500=0 then x t = 10 . ; q t = 0 . ; end a t = a where a ∼ U d { , q t } ; r t = x t a t − c a t − c q t ; x t +1 = f ( x t , c , κ, θ, σ, a t , (cid:15) t ) where f is the mean reversion in Equation (38); q t +1 = q t − a t ; end Algorithm 2: Simulation training data set creation a U d { a, b } stands for a uniform discrete distribution over a, a + 1 , ...., b A PREPRINT

Each trajectory of length 500 has zero activity in the ﬁrst 40 timestamps and the last 100. In this manner we constrainthe actions to not take place near the extremes of the time series. The reason for this choice is that we do not wantthe terminal inventory to become a major concern for the agents and outbalance the other strategies involved in theliquidation process.Similarly, the choice of the intervals of the discrete uniform distributions is not completely arbitrary. Adding inventoryin intervals spaced by U d { , } is more complex than equally spaced intervals. If these intervals are increased,hypothetically, to U d { , } , more than 400 trajectories would be needed to train the model satisfactorily.During the ﬁrst part of the training phase, the RDMM model learns not only the price dynamics but also the interplayamong prices, the inventory, the actions and the rewards. A good approximation of the training set is vital to a properpolicy search driven by Equation (34) to provide good results.We compare the RDMM model against three benchmarks; Q-learning, DynaQ with an autoregressive integrated movingaverage model ARIMA (DynaQ-ARIMA) and DynaQ with long short-term memory network (DynaQ-LSTM). Thedetails of the benchmark implementation may be found in Appendix F on Algorithms 6 (Q-learning) and 7 (DynaQ andvariations).After training, we test the policies in 10,000 simulated time series with a trading horizon T = 500 generated byEquation (38). We use Algorithm 2 from the training set, to create 10,000 batches of size 500 each for the test set.Figure 3 shows the price approximations(see Equation (6)) for a single training batch with horizon T = 500 with25 epochs on the left panel and 225 epochs on the right panel. The green shaded area corresponds to µ θ ± σ θ (seeEquation (5)).Figure 3: Price approximations of a sampled single training batch with horizon T=500. Model approximation at 25(left) and 225 (right) epochsIn Figure 4 we plot the reward approximations (see Equation (6)) for the same training batch as in Figure 3.Figure 4: Reward approximation of a sampled single training batch with horizon T=500. Model approximation at 25(left) and 225 (right) epochsThe convergence of the negative evidence lower bound (ELBO) of the conditional log-likelihood L (see Equation(30)) during the training phase is shown by Figure 5. We used ADAM with a learning rate of × − for stochasticoptimisation. 11einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

Figure 5: Convergence of the negative ELBO with × − learning rateThe convergence of the negative approximation of the unconditional expected reward (see Equation (34)) is shown byFigure 6. Similarly to the ELBO, we use ADAM with a learning rate of × − for stochastic optimisation.Figure 6: Convergence of unconditional expected reward approximation in Equation (34) with ADAM optimisationusing a learning rate of × − .To compare the RDMM method against the benchmarks, we record and keep track of all actions taken by the algorithms,asset prices and the rewards. To better visualise and understand the policies behind each algorithm we create heatmaps ofactions across states, prices, and inventory. The heatmaps contain different shades of blue representing the average actiontaken (i.e. the number of stocks executed) by the algorithm in question across the 10,000 time-series batches createdfor the test set. Figures 7, 8 and 9 contain the policy heatmaps for Q-learning, DynaQ-ARIMA and DynaQ-LSTM,respectively.Figure 7: Q-learning policy heatmap across price and inventory. Darker shades represent more stocks sold.12einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

Figure 8: DynaQ-ARIMA policy heatmap across price and inventory. Darker shades represent more stocks sold.The actions taken by these benchmarks (Q-learning and DynaQ algorithms) use an (cid:15) -greedy strategy exploration, wherea random action is taken with a frequency (100 (cid:15) )% of the time and they vary only with price and inventory. In otherwords, the benchmarks are Markov in s t = ( x t , q t ) . The RDMM model, however, is not Markov in ( x t , q t ) alone, aseach action depends on the approximated distribution which in turn depends on the entire path. As shown before, themodel architecture summarises past information by an LSTM hidden state shown by the graphical representation inFigure 1 and in more detail in Equations (19) and (12).Figure 9: DynaQ-LSTM policy heatmap across price and inventory. Darker shades represent more stocks sold.The heatmaps provide an accurate representation of policies based on value functions like Q-learning. In the RDMM,however, we have an LSTM summarising past information. In other words, the policy derived from the RDMMarchitecture does not depend only on the current price and inventory; it also depends on all historical data such as prices,inventory, rewards, and actions. For that reason, the in Figure 10 we have a heatmap showing the average of actionstaken by the RDMM approach for every pair (price, inventory) available in the training set.Figure 10: Policy heatmap showing the average of actions taken by the RDMM approach on the training set. Darker shadesrepresent more stocks sold. A PREPRINT

Figure 10 shows that the RDMM generates a policy with far smoother transitions between states than any of theQ-learning approaches. In order to better visualise the policy we plot a cropped version of Figure 10 in Figure 11 wherethe inventories range from 4 to 10 units to showcase the policies sensitivity to price and inventory.The heatmap pattern where the policy is to, with all else the same, execute larger shares for higher prices and inventories.Figure 11:

RDMM policy snippet extracted from Figure 10. States ranging from $9.5 to $10.5 and from to units ofinventory.Darker shades represent more stocks sold. An important measure to compare to execution strategies is the relative savings in basis point deﬁned as RS = R RDMM − R benchmark R benchmark × , (41)where R RDMM := (cid:80) Tt r RDMM,t and similarly for R benchmark .In Figures 12 we plot the histograms for the RS of 10,000 simulated time series. The red area highlights the part ofthe histogram where the RDMM underperformed the benchmark (negative savings) while the blue area highlights thepart of the histogram where the RDMM outperformed the benchmarks (positive savings). As we can see, the modelbias introduced by the LSTM has a negative impact in the results compared to the other benchmarks. Model bias,as Deisenroth and Rasmussen [2010] pointed out, is a known problem in model-based RL where the model, in ourexample, the LSTM, fails to provide an accurate representation of the environment. As a consequence, the policiesobtained from model-based RL tend to exploit the shortcomings of the model and its misrepresented environmentleading to poor results.Figure 12: Histogram of relative savings in basis points of total reward. The red area highlights the part of the histogramwhere the RDMM underperformed the benchmark while the blue area highlights the part of the histogram where theRDMM outperformedTable 1 provides a summary of the performance of each strategy and Table 2 provides a pairwise comparison betweenthe methods. 14einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

Algorithm Time-Series (Training) E ( (cid:80) r t ) sd ( (cid:80) r t ) Q-learning 1 million 2104.90 217.76DynaQ-ARIMA 1 million 2105.08 217.79DynaQ-LSTM 1 million 2104.89 218.11RDMM 400 2105.30 217.76Table 1: Average reward on test set

Comparison Mean of the differences t-Statistic p-value

RDMM vs Q-learning $0.38 6.71 0.000RDMM vs DynaQ-ARIMA $0.21 4.47 0.000RDMM vs DynaQ-LSTM $0.40 4.52 0.000Table 2: Paired t-test results for mean reward differenceThe results demonstrate that, our method slightly outperforms the benchmarks in terms average reward.

In this section, we repeat the experiments from Section 7 with real stock prices data from Intel (INTC), Microsoft(MSFT), Facebook (FB), traded between January 2018 and March 2018, and Vodafone (VOD) traded in 2017. To do so,we extract all order executions from the limit order book tick-by-tick data, and sample the mid price every second. Fortraining and testing we select the ﬁrst 400,000 data points for INTC, MSFT and FB, and the ﬁrst 200,000 data pointsfor VOD. The remainder of each time-series is used as validation set for parameter tuning. Figure 13 shows the timeseries plot for the ﬁrst 400,000 data points (200,000 for VOD). (a)

Facebook (b)

Microsoft (c)

Intel (d)

Vodafone

Figure 13:

Price time series plot in US Dollars, calculated using order executions from the limit order book tick-by-tick data andsampling mid prices at every second A PREPRINT

We use the same reward deﬁned in Equation (40): r t = ( x t − c a t ) a t − c q t . (42)Here, however, the value of the constants c and c are set to be proportional to the magnitude of the average price of thestock considered. In real-time execution, this may be replaced by realised execution costs. We adopted c = c = 0 . for FB, c = c = 0 . for MSFT, c = c = 0 . for INTC, and c = c = 0 . for VOD. Prices ( x , x , x , .... ) are taken sequentially from the real data set.; q = 0 ; i = 20 ; for t ∈ , doif t mod i=0 then i = i + n where n ∼ U d { , } a ; if < i < then q t = u where u ∼ U d { , } ; else q t = 0 . ; end a t = a where a ∼ U d { , q t } ; r t = x t a t − c a t − c q t ; q t +1 = q t − a t ; end Algorithm 3:

Real training data set; random inventory, actions and rewards creation a U d { a, b } stands for a uniform discrete distribution over a, a + 1 , ...., b The ﬁrst 200,000 data points from FB, MSFT, and INTC stock prices are split into batches of 500 observations andused as the training set for the RDMM model. For VOD we use only the ﬁrst 150,000 observations for the training setsince we have less data available. Similarly to the experiments for mean reversion dynamics, we use Algorithm 3 tocreate a sequence of a ﬁctional inventory, actions, and their respective rewards.For the real data experiment, we compare the RDMM model against seven benchmarks: Q-learning, DynaQ with anautoregressive integrated moving average model (DynaQ-ARIMA), DynaQ with long short-term memory network(DynaQ-LSTM), time-weighted average price with three seconds for execution (TWAP3), RDMM without stateuncertainty Σ φt as input in the deterministic policy as in Equation 19 (RDMM-NoU), a DynaQ using a deep Markovmodel (DynaQ-DMM) as the model M of the simulated experience in the DynaQ Algorithm 4, and an RDMM with thepolicy search inputs augmented with extra simulated price The architecture of the DMM used in the DynaQ-DMMapproach is the architecture in the RDMM without the feedback of actions.We include TWAP approach because it is a well-known strategy in ﬁnance for mitigating the adverse effects of executinga large number of stocks at once. The choice of 3 seconds of execution for TWAP was motivated by the averageexecution time observed on other strategies on the same task.The RDMM can be seen as a combination of deep learning structures arranged to provide desirable properties forhandling sequential data as a trading system. Gauging which structure or guideline provides a signiﬁcant incrementin ﬁnancial performance is a challenge. In that sense, we follow the rationale found in Mnih et al. [2015], where theauthors attempt to assess the importance of constituent parts of an RL approach by disabling some individual corecomponents of the deep Q-Network agent to show the detrimental effects on the agent’s performance. This procedure isknown as ablation studies.For this reason, we add two models to further investigate the strengths of the RDMM model. The RDMM-NoU model(A RDMM without state uncertainty Σ φt as input in the deterministic policy) allows us to determine the contribution ofthe state uncertainty in the RDMM model. In the DynaQ-DMM approach, we borrow the predictions made by the DMMpart in the RDMM and use it to make predictions on a DynaQ algorithm. This benchmark informs us whether DMMalone can provide a signiﬁcant improvement compared to other model-based RL methods such as DynaQ-ARIMA andDynaQ-LSTM.Figure 13 shows that the price range of the real time-series varies considerably more than the price range in ourmean-reversion simulation. The wider the price range interval, the larger the number of possible states that Q-learningand DynaQ algorithm have to visit to generate their policies. To cope with this issue, we modify Algorithm 7 in the16einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

Appendix F, making the number of visits proportional to the time-series price range. Algorithm 4 contains the generalform for the Q-learning and DynaQ models used in this section. Notice that in the ﬁrst 2 lines we ﬁrst compute thenumber of possible states by multiplying the price range over 400,000 seconds (200,000 for Vodafone), by the numberof bins of size one cent, and by the maximum number of stocks provided to the agent (10 units in our experiments).Next, we multiply the estimated number of states by 200. trajectories drawn from the generative part of the RDMM(RDMMx) . nstates = [ max ( T S ) − min ( T S )] × × (price range x 100 cents x 10 inventory); simN = nstates × (number of states x number of visits);Initialize Q(x,q,a); (cid:15) = 0 . , α = 0 . ; for i ← to simN doif i mod (simN/30)==0 then (cid:15) = (cid:15) × . ; α = α × . ; end x ∼ real mid-price time-series , q ∼ U (0 , ; x = round ( s, , q = round ( q, ; while q > do /* choose action using (cid:15) -greedy */ if U (0 , < (cid:15) then a = argmax a (cid:48) Q ( x, q, a (cid:48) ) ; else a ∼ U (0 , q ) ; a = round ( a, ; r = x ∗ a − c a − c q ; x (cid:48) = f ( x, a ) ; q (cid:48) = q − a ; Q ( x, q, a ) ← Q ( x, q, a ) + α ( r + max a (cid:48) Q ( x (cid:48) , q (cid:48) , a (cid:48) ) − Q ( x, q, a )) ; q = q (cid:48) , x = x (cid:48) ; /* Simulated experience (DynaQ only) */ for i ← to do x, q ← random previously observed state; a ← random action taken on state x, q ; r, x (cid:48) , q (cid:48) = M ( x, q, a ) (model) ; Q ( x, q, a ) ← Q ( x, q, a ) + α ( r + max a (cid:48) Q ( x (cid:48) , q (cid:48) , a (cid:48) ) − Q ( x, q, a )) ; endendend Algorithm 4:

Q-learning/DynaQ - version 2 - adapted from Sutton, 1998During policy learning, we give the advantage to the benchmark RL agents to access the entire data set represented inFigure 13 during the training phase, whereas the RDMM has to complete the same task using only the ﬁrst half of thedata set (i.e., 200,000 points for INTC, FB, and MSFT and 150,000 points for VOD).Table 3 summarises some facts about our dataset. The column “ simN ” refers to the number of trajectories sampledfor Algorithm 4. The column “ c , c ” contains constants adopted for the reward function (42). Finally, the last twocolumns indicate the size of the training and test sets for the RDMM model. Stock Range Min Max simN c , c Train Set Test Set

FB 28.09 167.21 195.30 5,618,000 0.017 200,000 200,000INTC 11.72 42.05 53.77 2,344,000 0.004 200,000 200,000MSFT 12.18 83.88 96.06 2,436,000 0.008 200,000 200,000VOD 8.43 24.32 32.75 1,686,000 0.002 150,000 50,000

Table 3: Data set summary.17einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

The 200,000 (50,000 for VOD) data points in the test set are split into batches of 500 observations. We also adopt thesame inventory and actions generation as described in Algorithm 3.In the RDMMx, we exploit the generative nature of the variational autoencoder in the RDMM, creating a price sampletrajectory drawn from the combination of distribution of the transitions in Equation 1 and the distribution of theprice emissions in Equation 5. This price trajectory is generated after the ELBO optimisation of Algorithm 1, and,analogous to how the real prices of the training set are treated, is combined with a sequence of a ﬁctional inventory,actions, and their respective rewards using Algorithm 3. The resulting simulated path is used to augment the trainingset for the policy search optimisation part of Algorithm 1. More speciﬁcally, for every real data batch b of size 500, ( x b , x b , ..., x b ) , an extra batch of the same size is simulated, taking the last latent state z b of the inference partof the RDMM to be the initial state of the simulated trajectory (see Figure 14). The next states are computed using thetransitions in Equation 1 and emissions in Equation 5. The size of the training set for the policy search in the RDMMxis double that of the size of the training set for the policy search in the regular RDMM. The goal of RDMMx is toinvestigate if artiﬁcially increasing the number of trajectories enhances the policy search.Figure 14: Samples of price trajectories. Real prices in blue and generated trajectory in red. A PREPRINT

The heatmap representations of the policies generated by Q-learning, DynaQ-ARIMA, DynaQ-LSTM and DynaQ-DMM are presented in Figures 15, 16, 17 and 18. The colourbar to the right side of each plot represents the number ofstocks sold, with the number increasing with darker shades.In the simulated price dynamics, we observed that higher prices trigger higher-order executions. For the real pricedynamics, we observe the same characteristic in the heatmaps to a certain extent, but not as pronounced as in thesimulations. There is, however, an observed inventory dependence, as before. (a)

Q-learning (b)

DynaQ-ARIMA (c)

DynaQ-LSTM (d)

DynaQ-DMM

Figure 15:

Facebook policy heatmaps for states where the stock prices go from $167.21 to $195.30 and inventory ranges from 0 to10 units. In this heatmap, the actions - or the number of stocks to be sold by the agent - are correlated to the tonality indicated by thecolourbar on the right side of each heatmap.

In the Facebook policy heatmaps of Figure 15, we notice that DynaQ-ARIMA has the smoothest shade transitionsbetween contiguous states, followed by DynaQ-DMM and Q-learning. In general, we observe, for all cases, that pricesthat appear less frequently in the time-series, such as those in the neighbourhood of the minimum and the maximumvalue, tend to result in more abrupt transitions between adjacent states. In the DynaQ-LSTM heatmap we observeclusters of states with diminished actions proportional to its inventory in the asset price ranges of $165 to $175 and$185 to $190. From the heatmap, we suspect that the policy search is trying the exploit shortcomings of a possiblemodel bias introduced by the LSTM, where the DynaQ-LSTM outputted an over-optimistic policy on those clusterstates, meaning that the agent decided to hold the stocks in expectation of a price reversion in the near future as aconsequence of model bias discussed previously.In the Intel policy heatmaps of Figure 16, we notice that DynaQ-ARIMA the smoothest shade transitions betweenadjacent states, followed by Q-learning and DynaQ-DMM. Compared with Facebook policies, the abrupt transitionsbetween contiguous states is less noticeable in the neighbourhood of the minimum and the maximum prices. Thismay be explained by the time-series plotted in Figure 13, where we notice that the price trajectory of the Intel stocksapproaches the maximum and the minimum values more frequently than the Facebook time-series. In the DynaQ-LSTMheatmap we see a more aggressive policy for the darker cluster on states with prices around $50. Similarly to theFacebook case, we suspect that it is a negative result of model bias introduced by the LSTM.19einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT (a)

Q-learning (b)

DynaQ-ARIMA (c)

DynaQ-LSTM (d)

DynaQ-DMM

Figure 16:

Intel policy heatmaps for states where the stock prices go from $42.05 to $53.77 and inventory ranges from 0 to 10 units.In this heatmap the actions, or the number of stocks to be sold by the agent, are correlated to the tonality indicated by the colour baron the right side of each heatmap.(a)

Q-learning (b)

DynaQ-ARIMA (c)

DynaQ-LSTM (d)

DynaQ-DMM

Figure 17:

Microsoft policy heatmaps for states where the stock prices range from $83.88 to $96.06 and inventory ranges from 0 to10 units. In this heatmap, the actions, or the number of stocks to be sold by the agent, are correlated to the tonality indicated by thecolour bar on the right side of each heatmap. A PREPRINT

The Microsoft policy heatmaps displayed in Figure 17 seem to be similar to each other compared to the Facebookand Intel policies. We observe the smoothest shade transitions between contiguous states in DynaQ-DMM followedclosely by DynaQ-ARIMA. Similar to Intel stocks, we notice, in the time-series plotted in Figure 13, that the Microsofttime-series approaches the surroundings of the maximum value more than once - we refer to the two peaks aroundt=130,000 and t=325,000. This seems to result in less abrupt transitions between adjacent states in the top part of theheatmaps, i.e., near the maximum price. The high contrast between contiguous states in the bottom part of the heatmapsis consistent with the fact that the prices trajectory has a sharp dive to the minimum price around t=220,000. Thepresence of different clusters is less noticeable in the Microsoft DynaQ-LSTM heatmap compared with the previouscases. (a)

Q-learning (b)

DynaQ-ARIMA (c)

DynaQ-LSTM (d)

DynaQ-DMM

Figure 18:

Vodafone policy heatmaps for states with prices ranging from $24.32 to $32.75 and inventory ranging from 0 to 10 units.In this heatmap the actions, or the number of stocks to be sold by the agent, are correlated to the tonality indicated by the colourbaron the right side of each heatmap.

In the Vodafone policy heatmaps of Figure 18 , we observe a white band (no stocks executed) for all cases locatedbetween $27 and $28. The reason for this particular pattern is that, in the VOD dateset, there is a 12 cents gap betweenthe prices $27.27 and $27.39; therefore, the initial state ( Q = 0 ) of the Q-table does not change because states whoseprice lies between $27.27 and $27.39 cannot be sampled. For the very same reason, we can be assured that this whiteband does not affect the performance of the benchmarks since the agents are not required to perform on those states inour experiments. We also notice that Q-learning, DynaQ-LSTM and DynaQ-DMM policy heatmaps displayed in Figure18 appear similar. DynaQ-ARIMA seem to be the one with the smoothest shade transitions between contiguous states.As in the Microsoft case, the presence of uncommon clusters is less noticeable in the DynaQ-LSTM heatmap comparedto Facebook and Intel DynaQ-LSTM heatmaps. To achieve reasonable results during the learning phase, the RDMM requires an increment of the size of the neuralnetwork in the real data set compared to the mean reversion case. To facilitate our investigation an additional strategy isadded during model optimisation; instead of completing the learning in a single session, where the model approximationof the VAE (ELBO optimisation) is followed by the RL optimisation (policy search), we introduced the possibilityto have an intermission between the ELBO optimisation and the policy search . The goal of this modiﬁcation isthree-fold i) to allow the execution of multiple learning sessions during ELBO optimisation until we achieve the desiredapproximation using different learning rates for each session; ii) to facilitate parameters ﬁne tuning and avoid overﬁtting;21einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT and iii) to start the RL optimisation (or continue training the model approximation) directly from a set of previouslysaved parameters.In addition to the three goals mentioned above, we should consider that this work has, at the present moment,experimental and academic inclinations. These ideas were inﬂuenced by the guidelines described in Smith [2017a].The strategy of breaking down the training phase into segmented sessions would be replaced by an additional moduleimplemented to control the learning process for cases where the RDMM is used for commercial purposes. From ourexperience of training and testing the RDMM on the simulated mean reversion and real data sets, this module shouldcoordinate the interplay among i) the complexity of the neural networks used and the volatility of the dataset; and ii)the number of epochs adopted for ELBO and policy search optimisations along with an intelligent management of thelearning rates. We revisit this topic at the end of this subsection.With the above remark in mind, we provide a summary of the RDMM learning for the four stocks mentioned previously(INTC, FB, VOD and MSFT). The summary contains a few plots of the learning phase for each stock to give the readera brief notion of some aspects of the model training relative to the intensity of learning rates, the number of epochs usedfor the stochastic gradient descent and their relation with the convergence of the objective functions. We start in Table 4,with detailed information regarding the number of sessions used as well as the learning rates and epochs.

ELBO L Uncond. Exp. Reward J Stock sessions learning rates epochs sessions rates epochs

FB 3 × − , × − , × − × − × −

500 1 × − × − , × − , × − × − × − , × − × − Table 4: RDMM model training summary - real dataIn the last three columns of Table 4 we see that a single session was enough to achieve good convergence of theapproximated unconditional expected reward (see Equation (35)). During the ELBO optimisation, we divided thelearning phase into three sessions for the Facebook and Microsoft data, where the intensity of the learning ratesdecreases as we progress through the learning sessions. Intel, on the other hand, achieved satisfactory results in onlyone session. We use two short sessions for Vodafone to achieve ELBO convergence. (a) Prices approximation - training (b) ELBO convergence

Figure 19:

RDMM optimisation for the MSFT dataset. (a) Price approximation of a sampled batch from the training set. (b) ELBOconvergence on its second training session.

In Figure 19a we have plotted the real prices of Microsoft stocks in blue and the model’s approximation in orange for asampled batch from the training set. For each price approximation, we add and subtract its estimated standard deviation(shaded area in green). The RDMM seems capable to provide a smoother approximation of the original time series. InFigure 19b we observe two spikes in the ELBO convergence (see Equation (30)) during the second session (from a totalof three sessions) where the learning rate is × − with 500 epochs (see Table 4). Since learning is not destabilizedtoo frequently we decided to not use any technique to address high loss in bad batches such as gradient clipping oradaptive learning rate clipping as in Ede and Beanland [2019].22einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT (a) Predicted prices - test set (b) Negative total expected reward

Figure 20:

RDMM optimisation for the MSFT dataset. (a) Price approximation of a sampled batch from the test set. (b) Negativetotal expected reward approximation convergence.

Figure 20a presents the predicted prices for a sampled batch of the test set of Microsoft stocks. Figure 20b shows theapproximation of the negative unconditional expected total reward (see Equation (35)), performed in a single sessionwith a learning rate of × − and 125 epochs. (a) Prices approximation - training (b) ELBO convergence Figure 21:

RDMM optimisation for the FB dataset. (a) Price approximation of a sampled batch from the training set. (b) ELBOconvergence on its second training session.

Similarly, in Figure 21a we have plotted the real prices of Facebook stocks and the model’s approximation for a sampledbatch from the training set with the estimated variance in shaded green area. As before the RDMM seems to reducethe noise from the observed data. Figure 21b presents the ELBO convergence (see Equation (30)) during the secondsession (from a total of three sessions) using a learning rate of × − and 500 epochs (see Table 4) which seems toprovide a smooth decrease in the objective function. (a) Predicted prices - test set (b) Negative total expected reward Figure 22:

RDMM optimisation for the FB dataset. (a) Price approximation of a sampled batch from the test set. (b) Negative totalexpected reward approximation convergence. A PREPRINT

In Figure 22a presents the predicted prices for a sampled batch of the test set of Facebook stock prices. Figure 22bshows the convergence of the approximated negative unconditional expected total reward (see Equation (35)) performedin a single session using 150 epochs and a learning rate of × − . We notice that for this learning rate the costfunction of the policy search decreases as the number of epochs increases, but oscillates considerably more than theMSFT case, which uses a smaller learning rate. (a) Prices approximation - training (b) ELBO convergence Figure 23:

RDMM optimisation for the INTC dataset. (a) Price approximation of a sampled batch from the training set. (b) ELBOconvergence on its ﬁrst training session.

Intel stock prices and the RDMM approximation for a sampled batch of the training set is shown by Figure 23a. Theapproximation seems to reduce noise fairly well. In Figure 23b we observe an abrupt drop in the ELBO convergence inthe ﬁrst training session (from a total of two sessions) where the learning rate adopted is × − (see Table 4). (a) Predicted prices - test set (b) Negative total expected reward Figure 24:

RDMM optimisation for the Intel dataset. (a) Price approximation of a sampled batch from the test set. (b) Negativetotal expected reward approximation convergence.

Figure 24a presents the predicted prices for a sampled batch of the Intel test set. Figure 24b shows the approximation ofthe negative unconditional expected total reward (see Equation (35)) performed in a single session with a learning rateof × − and 100 epochs. 24einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT (a) Prices approximation - training (b) ELBO convergence

Figure 25:

RDMM optimisation for the VOD dataset (a) Price approximation a of sampled batch from the training set. (b) ELBOconvergence on its ﬁrst training session.

In Figure 25a we have, for a sampled batch from the training set, plotted in blue the real prices of Vodafone stocks,whereas the model’s approximation for those prices are plotted in orange. For each price approximation, we add andsubtract its estimated standard deviation (shaded area in green). Figure 25b presents the ELBO convergence during theﬁrst training session (from a total of two sessions) using a learning rate of × − and 150 epochs (see Table 4). (a) Predicted prices - test set (b) Negative total expected reward Figure 26:

RDMM optimisation for the VOD dataset. (a) Price approximation a sampled batch from the test set. (b) Negative totalexpected reward approximation convergence.

Figure 26a presents the predicted prices for a sampled batch of the test set of Vodafone stock prices. Figure 26b showsthe convergence of the approximated negative unconditional expected total reward (see Equation (35)) where we choseto perform it in a single session using and a large learning rate of × − achieving good results with only 20 epochs.The challenge of ﬁnding adequate values for hyperparameters is nearly ubiquitous to any machine learning algorithmas it is for the RDMM. In fact, as we suggested at the beginning of Subsection 8.1, it would be advantageous forcommercial uses, the implementation of an extra module in the algorithm to manage and automate the selection of thesize and complexity of the neural networks, number of epochs, and learning rates. This automation should take intoaccount that the ideal size for the neural networks seems to be associated with the volatility of the dataset. The numberof epochs adopted and the learning rates are closely associated, with the policy search part of the RDMM algorithm,suggesting that the optimisation could be facilitated by the approach proposed by Smith [2017b], where the authorrecommends allowing the learning rates to vary cyclically within a range of values, rather than using an exponentialdecay or adaptative learning rate such as ADAM used in our project. As Dauphin et al. [2015] states; saddle pointsare the main obstacles in optimising large, deep neural networks with non-convex objective functions. Smith [2017b]suggests that a cyclical approach, where the learning rate is periodically increased, allows the algorithm to transversesaddle point plateaus. Consequently, this could be a better technique for dealing with two conﬂicting situations, where asmall learning rate makes the objective converge slowly while a large learning rate can destabilize the optimisation.25einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

To evaluate the performance of the RDMM approach, we compare our method against the six benchmarks mentionedbefore: Q-learning, DynaQ-ARIMA, DynaQ-LSTM, DynaQ-DMM, RDMM-NoU, and TWAP3, using the test dataset, i.e., the last 200,000 data points (50,000 for VOD) of the time-series. The ﬁnancial performance is evaluated bycalculating the relative savings RS R in basis points as in the simulated price dynamics section, where for every batch b of 500 data points we compute: RS R ( b ) = R ( b ) RDMM − R ( b ) benchmark R ( b ) benchmark × , (43)with R ( b ) = (cid:88) t =1 r ( b ) t (44)where r t is given by r ( b ) t = ( x ( b ) t − c a ( b ) t ) a ( b ) t − c q t ( b ) , (45)which is the reward for instant t , deﬁned in Equation (42), applied to batch b .The histograms for the RS R of the 400 time series (100 for VOD) are plotted in Figures 27, 28, 29, and 30. The batcheswhere the RDMM is outperforming the benchmark (i.e. RS R > ) are represented in blue, and when the RDMMis outperformed by the benchmark (i.e. RS R ≤ ) we represent it with red. Due to the asymmetric nature of thedistribution of the relative savings in basis points of the total reward, we add a table with the quantiles and the mean ofthe distribution above.Figure 27: MSFT histogram of the relative savings in basis points of the total reward. Notice that the RDMMx is in adifferent scale. Approach 10% 25% Median 75% 90% Mean

Q-learning -0.07 -0.01 0.18 0.54 0.90 0.34DynaQ-ARIMA -0.09 -0.06 0.02 0.21 0.57 0.16DynaQ-LSTM -0.04 0.02 0.12 0.25 0.44 0.68DynaQ-DMM -0.04 -0.01 0.04 0.11 0.29 0.12RDMM-NoU 0.42 0.51 0.58 0.64 0.69 0.57RDMMx 0.00 0.00 0.00 0.00 0.00 0.00TWAP3 1.43 1.51 1.60 1.74 1.88 1.63Table 5: MSFT quantiles and mean of the relative savings in basis points of the total reward26einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

In Figure 27 we observe that the RDMM outperforms the benchmarks in almost all batches of the MSFT test set.DynaQ-LSTM produces inferior results in a few samples due to model bias, as we suspect during the discussion of thepolicies obtained, where the DynaQ-LSTM seems to generate a policy with no actions assigned to some states (seeFigure 17c). One possible explanation is that the model bias introduces, for those states, some form of overly optimisticexpectation that the stock prices will rise in the near future. This issue resulted in three out of 400 batches (0.75%)where the inventory was not fully executed at the terminal state leaving a small remainder. For those batches, we see alarge RS R in favour of the RDMM. In the RDMMx histogram, we observe that all relative savings are equal to zero,i.e., there is no difference in performance compared to RDMM. After a closer inspection, we notice that RDMMx andRDMM performed the same actions in the MSFT dataset. Table 5 conﬁrms that the majority of the distributions arerightly skewed.Figure 28: INTC histogram of the relative savings in basis points of the total reward. Notice that the RDMMx is in adifferent scale. Approach 10% 25% Median 75% 90% Mean

Q-learning 0.02 0.09 0.22 0.49 0.85 0.34DynaQ-ARIMA -0.03 0.01 0.10 0.29 0.66 0.22DynaQ-LSTM 0.09 0.60 1.31 1.69 2.67 1.69DynaQ-DMM -0.03 0.05 0.11 0.21 0.46 0.19RDMM-NoU 0.42 0.50 0.56 0.62 0.69 0.55RDMMx -0.09 -0.02 0.05 0.10 0.14 0.04TWAP3 1.24 1.35 1.47 1.61 1.70 1.47Table 6: INTC quantiles and mean of the relative savings in basis points of the total rewardIn Figure 28 we, once again, observe that the RDMM outperforms the benchmarks in almost all batches of the Intelstocks test set. As with the MSFT stocks, the DynaQ-LSTM seems to suffer from the same model bias problem discussedpreviously and results in three out of 400 batches (0.75%) where the inventory is not fully executed, generating a large RS R in favour of the RDMM. For the INTC dataset, the RDMMx histogram shows a slightly better performance infavour of the RDMM approach which is corroborated by Table 6.27einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

Figure 29: VOD histogram of the relative savings in basis points of the total reward. Notice that the RDMMx is in adifferent scale.

Approach 10% 25% Median 75% 90% Mean

Q-learning 0.07 0.36 0.63 0.89 1.26 0.68DynaQ-ARIMA 0.05 0.13 0.26 0.40 0.58 0.28DynaQ-LSTM 0.22 0.58 0.86 1.18 2.63 2.07DynaQ-DMM 0.16 0.44 0.67 0.96 1.31 0.81RDMM-NoU 0.34 0.37 0.44 0.56 0.82 0.50RDMMx 0.00 0.00 0.00 0.00 0.00 -0.00TWAP3 0.92 1.03 1.19 1.30 1.43 1.18Table 7: VOD quantiles and mean of the relative savings in basis points of the total rewardFigure 29 shows the dominance of the RDMM on the Vodafone stocks test set. As in the other test sets, DynaQ-LSTMyielded poor results in some batches, but the inventory is fully executed in 100% of the cases leaving no remainderat the terminal states for all 50 batches of the test set. On the top left and top centre panels we see that Q-learningand DynaQ-ARIMA produce better results than the RDMM in only a few batches. Similar to the MSFT case, theRDMMx histogram for the VOD dataset shows almost no difference in performance compared to the RDMM. In Table7, the small advantage of the RDMMx approach (RS mean equals to -0.00067) was suppressed due to rounding to 2decimals places. After a closer inspection, we noticed that the RDMM and RDMMx actions differ only in a smallfraction (16/8077) of the total number of actions executed.

Approach 10% 25% Median 75% 90% Mean

Q-learning -0.02 0.03 0.14 0.39 0.72 0.30DynaQ-ARIMA -0.04 -0.01 0.02 0.08 0.15 0.11DynaQ-LSTM -0.03 0.02 0.07 0.14 0.95 1.44DynaQ-DMM -0.02 0.03 0.16 0.38 0.72 0.30RDMM-NoU 0.92 1.06 1.21 1.36 1.54 1.22RDMMx -0.00 0.01 0.02 0.03 0.04 0.02TWAP3 1.54 1.67 1.79 1.93 2.05 1.80Table 8: FB quantiles and mean of the relative savings in basis points of the total reward28einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

Figure 30: FB histogram of the relative savings in basis points of the total reward. Notice that the RDMMx is in adifferent scale.In Figure 30, the superior performance of the RDMM is validated one more time as it signiﬁcantly outperforms thebenchmarks in almost all batches of the Facebook stocks test set. DynaQ-LSTM leaves a small remainder in the ﬁnalstate in eight out of 400 batches (2%), where the inventory was not fully executed. The shape of the histograms appears,in general, more positively skewed for the Facebook dataset compared to the other stocks, indicating higher ﬁnancialgains for the RDMM compared to the other methods. We see a slightly better performance by the original RDMMaccording to the RDMMx histogram in the Facebook dataset.To better assess the ﬁnancial gain produced by RDMM we compute the accumulated reward for every batch b on thetest set (size of 500) with Equation 44.Next we estimate the mean batch reward ¯ R of the test set with ¯ R = 1 B B (cid:88) b =1 R b (46)where B is the total number of batches in the test set, i.e., B = 400 ( B = 100 for VOD). For every algorithm, theresulting estimation of Equation (46) applied to Intel, Microsoft, Vodafone and Facebook stocks, can be found in Table9. Approach INTC MSFT VOD FB

Q-learning 10534.60 19241.95 6550.63 37588.15DynaQ-ARIMA 10534.73 19242.30 6550.89 37588.90DynaQ-LSTM 10533.16 19241.40 6549.62 37584.17DynaQ-DMM 10534.75 19242.37 6550.54 37588.18RDMM-NoU 10534.37 19241.50 6550.76 37584.70TWAP3 10533.40 19239.44 6550.31 37582.52RDMMx 10534.91 19242.59 6551.09 37589.24RDMM 10534.96 19242.59 6551.08 37589.30Table 9: Mean batch reward ¯ R in US dollars on all test sets given by Equation (46)Table 9 shows that the RDMM approach generated the highest reward in almost all test sets considered. The exception isthe RDMMx approach, which compared to the RDMM yielded the same mean batch reward for the MSFT dataset anda slightly superior mean batch reward for the VOD dataset. For the Intel and Microsoft test sets, DynaQ-DMM was the29einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT third place followed by DynaQ-ARIMA. For the Facebook and Vodafone test sets, DynaQ-DMM and DynaQ-ARIMAswitch places in terms of their performances. Overall, TWAP3 and DynaQ-LSTM delivered inferior results comparedwith the other methods.One might also consider the total reward accumulated R acc in the test set which is computed by: R acc = B (cid:88) b =1 R ( b ) . (47)The resulting total reward accumulated (47) values from our experiments are shown by Table 10 Approach INTC MSFT VOD FB

Q-learning 4,213,840.38 7,696,781.21 655,062.69 15,035,261.72DynaQ-ARIMA 4,213,890.65 7,696,920.73 655,089.19 15,035,559.59DynaQ-LSTM 4,213,264.39 7,696,561.65 654,962.30 15,033,668.45DynaQ-DMM 4,213,901.02 7,696,949.70 655,054.11 15,035,270.46RDMM-NoU 4,213,749.96 7,696,599.59 655,076.06 15,033,878.16TWAP3 4,213,359.78 7,695,776.42 655,031.33 15,033,006.60RDMMx 4,213,965.91 7,697,035.90 655,108.51 15,035,694.30RDMM 4,213,982.71 7,697,035.90 655,108.47 15,035,720.19Table 10: Total reward accumulated R acc in US dollars for all test setsThe total reward accumulated values in Table 10 naturally follow the same narrative seen in Table 9 where the RDMMand the RDMMx outperform all other baselines models, and DynaQ-DMM and DynaQ-ARIMA alternate between thethird and fourth places.For every batch we compute the difference between the total reward of regular RDMM and the benchmark, d b = R ( b ) RDMM − R ( b ) benchmark for b=1,2,3,... (48)where R ( b ) is the total reward of the batch b as deﬁned in Equation 44, and we conduct a paired t-test of the meandifference of total reward between the RDMM approach and every benchmark. The null hypothesis is that the truemean difference is equal to zero, with the alternative that the difference is positive in favour of the RDMM.The mean differences with the resulting p-values from the statistical test, indicated in parentheses, can be found in Table11. Approach INTC MSFT VOD FB

Q-learning 0.3558 0.6367 0.4578 1.1462( . × − ) ( . × − ) ( . × − ) ( . × − )DynaQ-ARIMA 0.2301 0.2879 0.1928 0.4015( . × − ) ( . × − ) ( . × − ) ( . × − )DynaQ-LSTM 1.7958 1.1856 1.4617 5.1294( . × − ) ( . × − ) ( . × − ) ( . × − )DynaQ-DMM 0.2042 0.2155 0.5436 1.1243( . × − ) ( . × − ) ( . × − ) ( . × − )RDMM-noU 0.5819 1.0908 0.3241 4.6051( . × − ) ( . × − ) ( . × − ) ( . × − )TWAP3 1.5573 3.1487 0.7714 6.7840( . × − ) ( . × − ) ( . × − ) ( . × − )RDMMx 0.0420 0.0000 -0.0004 0.0647( . × − ) ( . × − ) ( . × − ) ( . × − )Table 11: Mean differences statistic ¯ d = B (cid:80) b d b of the total reward of the paired t-test with the associated p-values inparentheses. 30einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

From Table 11 we see in all cases that the estimated mean difference is positive in almost all cases except whenthe RDMM is running against the RDMMx model in MSFT and VOD datasets, where the mean differences are notstatistically signiﬁcant.For the other baselines models, we have enough evidence from our experiments to assert that the ﬁnancial gain of theRDMM approach is statistically signiﬁcantly higher than the gain achieved with these benchmark models (excludingthe RDMMx) when looking at the ﬁnancial metrics deﬁned by Equations (46) and (47).As mentioned before, it is a challenge to gauge which part or idea behind the RDMM is more relevant for producinggood results. The RDMM-NoU and DynaQ-DMM approaches were an attempt at grasping the beneﬁts of using aﬁltered process instead of the raw observations, or the use of the state uncertainty as an input for the deterministicpolicy in the RDMM model. From the results in Tables 9, 10 and 11, we see that, on some occasions, these approachesperform well, but they are always outperformed by the complete model.Despite of our efforts to determine the signiﬁcance of the individual components of the RDMM via ablation studies, weshould also consider that the performance of the DMM in the DynaQ-DMM approach could be partially bounded bya Q-learning type framework. In the RDMM-NoU approach, the challenge lies in deﬁning what is the best practiceregarding the sizes of the neural networks involved in the policy search process, once part of the inputs (i.e., the stateuncertainty) of the deterministic function has been eliminated.With the RDMMx we ﬁnd limited evidence that artiﬁcially increasing the number of trajectories enhances the policysearch. Producing equal and slightly superior mean rewards for the MSFT and VOD datasets, respectively, compared tothe standard RDMM. In the INTC and FB cases, the RDMMx found inferior results.

We demonstrated that the RDMM architecture proposed outperforms classical approaches like Q-learning and variationsof DynaQ in a simple setting like the mean reversion problem. The performance improvement becomes more pronouncedwhen price dynamics are more complex, and this has been demonstrated using real data sets from the limit order bookof Facebook, Intel, Vodafone and Microsoft. Since our approach requires fewer training examples (as we allowed thebenchmarks to have access to the entire dataset to train while the RDMM used just half of it), we can conclude thatthe method is very data efﬁcient. Additionally, as deﬁned in the method core assumptions, this approach can handlenoisy and incomplete observations since the observations are processed in a POMDP framework, where the RL policysearch is performed with respect to the ﬁltered process provided by a modiﬁed DMM. Another important feature of theRDMM is that it has been formulated while taking into account the impact caused by the agent to the environment.It should be highlighted that all experiments were conducted using a small range for inventory and prices. In real lifeapplications, where prices vary considerably more or the agent is required to deal with large amounts of shares to sell,implementing procedures like Q-learning and DynaQ might become infeasible due to an unmanageable number ofstates to visit. An attempt to circumvent this issue would be to use larger price intervals on the Q-tables to reduce thenumber of states (binning) but it is reasonable to conjecture that this might result in loss of information and inferiorresults.Although the RDMM model has reasonable assumptions, we need to verify to what extent its architecture helps usunderstand the environment and generate proﬁts. An experiment to evaluate the importance of direct connectionsbetween actions and latent variables representing the environment, and the degree to which the agent’s actions candeteriorate the quality of policies obtained by the benchmarks and the RDMM should be conducted.A challenge posed by the RDMM, is the visualization of its policies. A heatmap gives an accurate representationfor a simplistic approach based only on value functions like Q-learning. However, it does not provide an adequaterepresentation of the rationale behind the actions taken by the RDMM, given the complexity of the neural networksinvolved, where an LSTM summarising past information is used. This was identiﬁed in the graphical representationin Figure 1, and in more details, in Equations (19) and (12). In other words, the policy derived from the RDMMarchitecture does not depend only on the current price and inventory, it also depend on all historical data such as prices,inventory, rewards and actions.Finally, the RDMM was formulated and implemented to handle much more information than the prices and inventory,such as high frequency snapshots of the LOB containing the type of orders, prices, depths, as well as auxiliaryinformation coming from different sources not restricted to the LOB. Unfortunately, due to time constraints we couldnot fully explore the RDMM’s ability to address different kinds of inputs, where we believe this method’s advantagesmight become more evident. 31einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

Acknowledgments

We would like to thank Sebastian Jaimungal and David Duvenaud for their guidance, and other members of theDepartment of Statistical Sciences of University of Toronto for helpful discussions.

Appendices

A Activation Functions

In this section we will deﬁne the activation functions used in this article. Figure 31 presents a plot of the curvesdescribed.

A.1 Rectiﬁed linear unit (ReLU) f ( x ) = (cid:26) for x < x for x ≥ (49) A.2 Hyperbolic tangent (tanh) f ( x ) = tanh( x ) = ( e x − e − x )( e x + e − x ) (50) A.3 Softplus f ( x ) = ln(1 + e x ) (51) − − x f ( x ) softplus ReLU tanh

Figure 31: Activation Functions32einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

B Neural Network Architectures for Sequential Data

B.1 LSTM

A long short-term memory (LSTM) [Hochreiter and Schmidhuber, 1997] is an recurrent neural network (RNN) whereeach building unit is composed of a memory cell and 3 gates; an input gate, an output gate and a forget gate: f t = σ g ( W f x t + U f h t − + b f ) forget gate i t = σ g ( W i x t + U i h t − + b i ) input gate o t = σ g ( W o x t + U o h t − + b o ) output gate c t = f t ◦ c t − + i t ◦ σ c ( W c x t + U c h t − + b c ) memory cell h t = o t ◦ σ h ( c t ) Here, x t is the input to the memory cell layer at time t , W _ and U _ are weight matrices, and b _ are bias vectors.A graphical representation of an unrolled LSTM is shown by Figure 32.Figure 32: Graphical representation of an unrolled LSTM. Image courtesy of Christopher Olah, used with permission.

B.2 GRU

Gated recurrent units (GRU) [Cho et al., 2014] are a simpliﬁed version of an LSTM with fewer parameters (U and Ware smaller) but with comparable performance than LSTMs [Chung et al., 2014]. z t = σ g ( W z x t + U z h t − + b z ) update gate r t = σ g ( W r x t + U r h t − + b r ) reset gate h t = (1 − z t ) ◦ h t − + z t ◦ σ h ( W h x t + U h ( r t ◦ h t − ) + b h ) output vector Here, x t is the input vector, W _ and U _ are weight matrices, and b _ are biases vectors. C Variational Auto-encoder

Kingma and Welling [2013] and Rezende et al. [2014] independently introduced a powerful approach to treat intractableposteriors of direct probabilistic models with continuous latent variables. It is assumed that X = (cid:8) x (1) , ..., x ( n ) (cid:9) isa random sample generated by a conditional distribution p θ ( x | z ) , where z is an unobserved random variable, whichin turn, is generated by some prior distribution p θ ( z ) . Brieﬂy we want to write p θ ( x | z ) as a normal distributionparametrized by an MLP: log p θ ( x | z ) = log N ( µ θ , σ θ I ) (52)33einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT where h = tanh( W ( θ )1 z + b ( θ )1 ) µ θ = W ( θ )2 h + b ( θ )2 log σ θ = W ( θ )3 h + b ( θ )3 , which makes the posterior p θ ( z | x ) intractable. In the variational autoencoder approach, Kingma and Welling [2013]and Rezende et al. [2014] replace the intractable posterior p θ ( z | x ) by an approximation q φ ( z | x ) parametrized by neuralnetworks called recognition models (Figure 33) and they introduce a method to learn the parameters φ and θ (Algorithm5): log q φ ( x | z ) = log N ( µ φ , σ φ I ) (53)where h = tanh( W ( φ )1 z + b ( φ )1 ) µ φ = W ( φ )2 h + b ( φ )2 log σ φ = W ( φ )3 h + b ( φ )3 Figure 33: Variational autoencoder diagramThe learning is driven by the maximisation of the variational lower bound: log p θ ( x ) = KL [ q φ ( z | x ) || p θ ( z | x )] + L ( θ, φ, x ) (54)where L ( θ, φ, x ) = − KL [ q φ ( z | x ) || p θ ( z )] + E q φ ( z | x ) [log p θ ( x | z )] (55)is maximised making use of the reparametrization trick : E q φ ( z | x ) [log p θ ( x | z )] ≈ L L (cid:88) l =1 log p θ ( x | z ( l ) ) (56)with z ( l ) = µ φ + σ φ (cid:15) ( l ) , (cid:15) ( l ) ∼ N (0 , (57)Initialize θ and φ ; repeat Sample a mini-batch x M ; (cid:15) φ ∼ N (0 , , z = µ φ + σ φ (cid:15) φ as in Equation 53; ∇ φ,θ L M ( θ, φ, x M , (cid:15) ) as in Equation 55 ;Update θ and φ using stochastic gradient descent until convergence of θ and φ is reached ; Algorithm 5:

VAE - Learning34einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

D Automatic Differentiation

Algorithmic or automatic differentiation are techniques designed to compute derivatives without the need to explicitlycompute them by hand. These techniques were developed long before deep learning was established. There are manyvariants of AD such as forward-mode [Wengert, 1964], reverse-mode [Linnainmaa, 1970], among many others. Theappropriateness of each type will depend on the type of function from which the derivatives are computed. In ourapplication, reverse-mode AD is most appropriate and we provide a brief review limited to the computation of gradients,used extensively by deep learning packages like Theano, PyTorch or Tensorﬂow. For a general and comprehensiveanalysis on AD, we recommend the book by Griewank and Walther [2008], where the authors do not conﬁne AD tomachine learning applications. On the other hand, Baydin et al. [2018] provide an up-to-date survey speciﬁc to machinelearning applications.Reverse-mode AD is most appeling when the function being differentiated f : R n → R m has a codomain dimensionthat is signiﬁcantlly smaller than the domain, ie, m (cid:28) n . In this case only m sweeps are required to compute thederivative in contrast to the forward method, which requires n sweeps. For this reason, reverse-mode AD has become acornerstone in the developement of deep learning models where gradient based optmizations are widespread.We present reverse-mode AD through an example. Consider the following function, f : R → R : f ( x , x ) = sin( x ) + x exp( x ) . (58)The ﬁrst step of reverse-mode AD is to convert the function target into a sequence of primitive operations (add, multiply,log, exp, etc.. ). This sequence is called the Wengert list , which in this example is w = x ,w = x ,w = exp( w ) ,w = w w ,w = sin( w ) ,w = w + w . This sequence of operations forms a computational graph representing the inter-relations of the series of operations asdisplayed in Figure 34, Figure 34:

Computational graph associated with the Wengert list above

The next step is to compute all the derivatives in the reverse order using the following formula, (essentialy an applicationof the chain rule for a sequence of composite functions) ¯ w i = (cid:88) j ∈ π ( i ) ¯ w j ∂w j ∂w i i = N, N − , N − , ... and ¯ w N = 1 , (59)35einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT where ¯ w i are called adjoints, π ( i ) represents all parent indexes of i in the computational graph in Figure 34, and N isthe total number of elements on the Wengert list. For example, π (1) = { , } and π (4) = { } . Therefore, the list ofadjoints is: ¯ w = 1 , ¯ w = 1 , ¯ w = 1 , ¯ w = w , ¯ w = w exp( w ) , ¯ w = w + cos( w ) . For a real valued function f : R n → R ,the full gradient of the function f is given by ∇ f = ( ∂f∂x , .., ∂f∂x n ) =( ¯ w , ..., ¯ w n ) , which can be computed in just one sweep.In our particular example, the full gradient is: ∇ f = ( ¯ w , ¯ w ) , ∇ f = ( w + cos( w ) , w exp( w )) , ∇ f = (cos( x ) + exp( x ) , x exp( x )) . E Deep Learning Packages

Deep learning packages like Theano, PyTorch or Tensorﬂow use reverse-mode AD extensively. In addition, theselibraries have several optimisations including GPU computations, arithmetic simpliﬁcation, merging of similar subgraphsand improvements to numerical stability to name a few. In our experiments we use Theano [Theano Development Team,2016] to implement the RDMM model.

F Q-learning and DynaQ algorithms

In Algorithm 6 we ﬁnd the Q-learning used in this article, while the base algorithm for DynaQ usedin our experiments is presented in Algorithm 7. In both cases Q(x,q,a) represents the action-value func-tion, x ∼ U ( a, b ) and q ∼ U ( a, b ) represents the price x and inventory being drawn from a uni-form distribution on the interval (a,b), round ( a, b ) represents rounding off a up to b number of digits. Initialise Q(x,q,a); (cid:15) = 0 . , α = 0 . ; for i ← to , , doif i mod 100,000==0 then (cid:15) = (cid:15) × . ; α = α × . ; end x ∼ U (9 . , . , q ∼ U (0 , ; x = round ( s, , q = round ( q, ; while q > doif U (0 , < (cid:15) then a = argmax a (cid:48) Q ( x, q, a (cid:48) ) ; else a ∼ U (0 , q ) ; a = round ( a, ; r = x ∗ a − c a − c q ; x (cid:48) = f ( x, a ) ; q (cid:48) = q − a ; Q ( x, q, a ) ← Q ( x, q, a ) + α ( r + max a (cid:48) Q ( x (cid:48) , q (cid:48) , a (cid:48) ) − Q ( x, q, a )) ; q = q (cid:48) , x = x (cid:48) ; endend Algorithm 6:

Q-Learning - Adapted from Sutton [1998]36einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

DynaQ planning is conducted in an online manner, i.e., we combined real experiences sampled from the environmentwith simulated experiences sampled from the model:

Initialise Q(x,q,a); (cid:15) = 0 . , α = 0 . ;f for i ← to , , doif i mod 100,000==0 then (cid:15) = (cid:15) × . ; α = α × . ; end x ∼ U (9 . , . , q ∼ U (0 , ; x = round ( s, , q = round ( q, ; while q > do /* choose action using (cid:15) -greedy */ if U (0 , < (cid:15) then a = argmax a (cid:48) Q ( x, q, a (cid:48) ) ; else a ∼ U (0 , q ) ; a = round ( a, ; r = x ∗ a − c a − c q ; x (cid:48) = f ( x, a ) ; q (cid:48) = q − a ; Q ( x, q, a ) ← Q ( x, q, a ) + α ( r + max a (cid:48) Q ( x (cid:48) , q (cid:48) , a (cid:48) ) − Q ( x, q, a )) ; q = q (cid:48) , x = x (cid:48) ; /* Simulated experience */ for i ← to do x, q ← random previously observed state; a ← random action taken on state x, q ; r, x (cid:48) , q (cid:48) = M ( x, q, a ) (model) ; Q ( x, q, a ) ← Q ( x, q, a ) + α ( r + max a (cid:48) Q ( x (cid:48) , q (cid:48) , a (cid:48) ) − Q ( x, q, a )) ; endendend Algorithm 7:

DynaQ - Adapted from Sutton, 1998

F.1 DynaQ-ARIMA

In the DynaQ-ARIMA, for a state s t = ( x t , q t ) and an action a t obtained from the simulated experience in Algorithm7, the predictions of the model M are: M ( x t , q t , a t )  ˆ x t +1 = µ + x t + φ ( x t − x t − )ˆ r t +1 = x t a t − c a t − c q t ˆ q t +1 = q t − a t (60)with c = c = 0 . F.2 DynaQ-LSTM

We used a cross-validated long short-term memory (LSTM) network (see Appendix B.1) as the model M in Algorithm7 in the DynaQ-LSTM benchmark. Therefore, for a state s t = ( x t , q t ) and an action a t obtained from the simulatedexperience in Algorithm 7 the predictions of the model M are: M ( x t , q t , a t )  ˆ x t +1 = LST M ˆ r t +1 = x t a t − c a t − c q t ˆ q t +1 = q t − a t (61)with φ = 0 . and c = c = 0 . A PREPRINT

References

Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automaticdifferentiation in machine learning: a survey.

Journal of Machine Learning Research , 18(153):1–43, 2018. URL http://jmlr.org/papers/v18/17-468.html .Christopher M. Bishop.

Pattern Recognition and Machine Learning . Springer, 2006.Álvaro Cartea and Sebastian Jaimungal. Incorporating order-ﬂow into optimal execution.

Mathematics and FinancialEconomics , 2016.Donnely R. Cartea, A.R. and Sebastian Jaimungal. Algorithmic trading with model uncertainty.

SIAM Journal onFinancial Mathematics 8(1), 635-671 , 2017.P. Casgrain and Sebastian Jaimungal. Trading algorithms with learning in latent alpha models.

SSRN , 2017.Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, andYoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation.2014.Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrentneural networks on sequence modeling. 2014.Balazs Csanad Csaji. Approximation with artiﬁcial neural networks. Master’s thesis, Eötvös Loránd University (ELTE),2001.G. Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics of Control, Signals and Systems ,2(4):303–314, Dec 1989. ISSN 1435-568X. doi: 10.1007/BF02551274. URL https://doi.org/10.1007/BF02551274 .Yann N. Dauphin, Harm de Vries, Junyoung Chung, and Yoshua Bengio. Rmsprop and equilibrated adaptive learningrates for non-convex optimization.

CoRR , abs/1502.04390, 2015. URL http://arxiv.org/abs/1502.04390 .Marc Deisenroth and Carl Rasmussen. Reducing model bias in reinforcement learning. 12 2010.Marc Peter Deisenroth and Carl Edward Rasmussen. Pilco: A model-based and data-efﬁcient approach to policy search.

Proceedings of the 28th International Conference on Machine Learning , 2011.Jeffrey M. Ede and Richard Beanland. Adaptive learning rate clipping stabilizes learning.

CoRR , abs/1906.09060, 2019.URL http://arxiv.org/abs/1906.09060 .Christoph Frei and Nicholas Westray. Optimal execution of a VWAP order: A stochastic control approach.

MathematicalFinance, 25(3), 612-639 , 2015.Andreas Griewank and Andrea Walther.

Evaluating Derivatives . SIAM, 2008.Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.

Neural Computation , 9(8):1735–1780, 1997. doi:10.1162/neco.1997.9.8.1735.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In , 2015.URL http://arxiv.org/abs/1412.6980 .Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. 2013.Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep Kalman ﬁlters. 2015.Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. 2016.Alex Krizhevsky. Convolutional deep belief networks on cifar-10, 2010.Seppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor expansion of thelocal rounding errors (in Finnish). Master’s thesis, University of Helsinki, 1970.Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: Aview from the width, 2017.Rowan McAllister and Carl Edward Rasmussen. Data-efﬁcient reinforcement learning in coninuous-state pomdps. arXiv:1602.02523 [stat.ML] , 2016.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves,Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, IoannisAntonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-levelcontrol through deep reinforcement learning.

Nature , 518(7540):529–533, February 2015. ISSN 00280836. URL http://dx.doi.org/10.1038/nature14236 . 38einforced Deep Markov Models With Applications in Automatic Trading A PREPRINT

Kevin P. Murphy.

Machine Learning a Probabilistic Perspective . The MIT Press, 2012.Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inferencein deep generative models. 2014.Leslie N. Smith. Best practices for applying deep learning to novel applications.

CoRR , abs/1704.01568, 2017a. URL http://arxiv.org/abs/1704.01568 .Leslie N. Smith. Cyclical learning rates for training neural networks. , pages 464–472, 2017b.Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXive-prints , abs/1605.02688, 2016. URL http://arxiv.org/abs/1605.02688 .Rudin Walter.

Principles of Mathematical Analysis . McGraw Hill, 1976.Christopher John Cornish Hellaby Watkins.

Learning from Delayed Rewards . PhD thesis, Cambridge University, 1989.Dayan P. Watkins, C.J.C.H. Q-learning.

Kluwer Academic Publishers , 1992.Robert Edwin Wengert. A simple automatic derivative evaluation program.

Communications of the ACM , 7(8):463–464,1964.Marco Wiering and Martijn van Otterlo.