Deep Reinforcement Learning with Function Properties in Mean Reversion Strategies
DDeep Reinforcement Learning with Function Properties in Mean ReversionStrategies
Sophia Gu ∗ Abstract.
With the recent advancement in Deep Reinforcement Learning (DRL) in the gaming industry, we arecurious if the same technology would work as well for common quantitative financial problems. Inthis paper, we will investigate if an off-the-shelf library developed by OpenAI can be easily adaptedto a common trading strategy – mean reversion strategy. Moreover, we will design and test to see ifwe can get better performance by narrowing the function space that the agent needs to search for.We achieve this through augmenting the reward function by a carefully picked penalty term.
Key words.
Deep Reinforcement Learning, Model-free Reinforcement Learning, Proximal Policy Optimization,Markov Decision Process, Bayesian Statistics, Quantitative Finance, Mean Reversion, Time Series
1. Introduction.
Mean reversion strategy has been studied for decades. [5] used a tabu-lar Q-learning for simple mean reversion problems approximated with an Ornstein-Uhlenbeck(OU) driven price process and achieved an average sharpe ratio close to 2.07. And this year,[3] developed a closed-form solution using heat potentials for the same price process togetherwith a simple cost function. In this paper, although we will be evaluating a similar set ofproblems using an arbitrary transaction cost model, we aim at making the problem settingmore general by evaluating our model on both an OU process and an Auto Regressive Mov-ing Average (ARMA) process. One advantage of using DRL against the prior approaches isthat we can make the state and action spaces continuous, which is one of the limiting factorsthat prevents a learned model from its theoretical expectation. Another advantage is that,although due to time limitation we could only evaluate our model’s results for mean reversionproblems, the same model can be easily reconfigured for other similar trading strategies ( e.g. ,a single-factor Arbitrage Pricing Theorem model has also been briefly studied and it gavesimilar promising performance to the mean reversion model).That said, the landscape of a DRL policy function is complex. A reinforcement learning(RL) agent, without knowing what it is actually searching for, can easily get stuck in asmall puddle. But in reality, we may often know some important properties of the targetfunction that we are looking for. For example, a price process with mean reversion propertyimplies a near arbitrage in the system. When the price is too far out of equilibrium, a tradebetting that it returns to the equilibrium has a slim chance of loss. So although we don’tknow the exact nature of the policy function, for mean reversion we know at least that ata higher price, the trade has to be smaller (or more negative) than at a lower price. Inother words, the action suggested by the policy network has to be monotonically decreasing.Such domain knowledge about the target function, if incorporated properly into the agent’s ∗ Department of Mathematics, Courant Institute of Mathematical Sciences, New York University. E-mail:[email protected] by Gordon Ritter: Department of Mathematics, Courant Institute of Mathematical Sciences, NewYork University. E-mail:[email protected] a r X i v : . [ q -f i n . M F ] J a n SOPHIA GU learning procedure, can effectively reduce the function space that the agent needs to searchfor. Thus, in this paper, using Bayesian statistics, we derived the below function propertyand we conclusively show that there is a dramatic improvement in the agents’ performanceusing the new reward function.The paper is organized as follows: The notations are defined in section 2, our main resultsare in section 3, the algorithm used is in section 4, our experimental results are in section 5,and the conclusions follow in section 6.You can find all the code used for this paper at https://github.com/sophiagu/RLF.
2. Notations.
Suppose a rational investor invests in a stock over a finite period 1, 2, ..., T . She chooses actions to maximize the expected utility of terminal wealth: E [ u ( w T )] = E [ u ( w + T (cid:88) t =1 δw t )]where w is the initial wealth, δw t = w t − w t − is the change in wealth and u : R → R denotesthe utility function; it is a mapping from wealth to a real number with dimensionless units.Assume the investor is risk-averse, then u is concave. Definition 2.1 (Mean-Variance Distribution).
The underlying asset return random variable r is said to follow a mean-variance equivalent distribution if it has a density p ( r ) , has firstand second moments, and for any increasing utility function u , there exists a constant κ > such that the policy which maximizes E [ u ( w T )] is also optimal for the simpler problem max π { E [ w T ] − κ V [ w T ] } It follows that, by writing w T = w + (cid:80) Tt =1 δw t , our expected utility above becomesmax π (cid:88) t E [ δw t ] − κ V [ δw t ]
3. Main results.3.1. Deep reinforcement learning setup.
Many of our choices of the DRL setup, includ-ing the reward function and the transaction cost model will be based on Ritter’s prior work[5].It would be a good read if you are interested in more detailed derivations. Here, we will brieflystate the main results and discuss some differences.First, RL algorithms refer to a state space, an action space, a Markov decision process(MDP), a reward function, and etc. So we will break down the setup into those pieces.
The term state, in RL problems, usually refers to the state of the environ-ment. We let s t denote the state of the environment at time t ; the state is a data structureconsisting of all the information that the agent will need to decide upon the action. For meanreversion, clearly we need the current holding of the stock, h t and the current price of thestock, p t . Now, different from the previous work, we also include the previous price of thestock, p t − . Underlying every RL problem is a MDP, which means the agent should be able RL WITH FUNCTION PROPERTIES IN MEAN REVERSION STRATEGIES 3 to make a decision at timestep t only based on the information, s t , not s We will use annualized sharpe ratio to measure how good atrained agent is annualized sharpe ratio = 260 × mean of daily P & Ls √ × standard deviation of daily P & Ls ≈ mean of daily P & Lsstandard deviation of daily P & Ls Inspired by [8] which introduced a monotonic hint for classi-fication problems, we introduce a general framework of incorporating a particular functionproperty in a RL setting. In particular, we will apply it to mean reversion price processes. When searching for a mean reversion strategy using a neural network,we are effectively looking for an optimal function in a space parametrized by P = { all distribution on R that can be expressed by the neural network } This space is huge. But Stochastic Gradient Descent (SGD) is inherently only workingin a low-dimensional subspace and cannot explore the whole space of the parameters . Onepossible solution is to reduce the function space by ruling out undesirable subsets of theparameter space. To do so, we introduce a function penalty: Definition 3.1 (Function Penalty). Let p , ..., p T ∈ R + be the set of stock prices generatedfor one epoch of training and consider their sorted statistics p (1) , ..., p ( T ) in increasing order,i.e., p (1) ≤ p (2) ≤ ... ≤ p ( T ) . Let i, j be two integers sampled uniformly from 1 to T such that i < j .Given an order specific functional f ( a ( p ( i ) ) , a ( p ( j ) )) that outputs 0 if ( a ( p ( i ) ) , a ( p ( j ) ) ) obeythe property and 1 otherwise, where a be a policy function. Then the function penalty, p err ,is a 0-1 loss such that:When taking two parameters, p err ( i, j ) = f ( a ( p ( i ) ) , a ( p ( j ) )) ;When taking a single parameter, p err ( j ) = (cid:80) j − i =1 p err ( i,j ) j − ;When taking no parameter, p err = (cid:80) Tj =2 p err ( j ) T − . Recall that when a stock price has a mean reversion nature, optimal trades should bemonotonically decreasing w.r.t. the price. Using this definition, we can define its functionproperty f as: f ( i, j ) = i RL WITH FUNCTION PROPERTIES IN MEAN REVERSION STRATEGIES 5 constant c indicating how strong our belief is, the probability of a proposed policy function a , P ( a ) ∝ exp( − c × p err )This distribution represents the a priori probability density assigned to a candidate func-tion, a , with a given level of function penalty. The probability that a is the best possibleapproximation to the optimal function decreases exponentially with the increase in the func-tion penalty.Note that p err ranges from 0 to 1, with 0 meaning no violation at all and 1 meaning allpoints violate the function property. So we can pick c such as 10 so that when all pointsviolate the function property, the probability of this candidate function is approximately zero,and if we made one mistake out of a hundred, there is still a good chance that the function isfeasible.In RL, we use the reward as an indicator of how probable a learned function is. Let’stherefore write the likelihood function as, given a positive constant c : P ( reward | f unction ) ∝ exp( c × ( reward − optimal reward ))In other words, a higher reward means the function is closer to the optimal function. Nowassume the optimal function can achieve the maximum of the mean-variance reward at eachtimestep t : E [ δw t ] − κ E [ δw t ] To find the optimal reward, use the first order optimal condition w.r.t. E [ δw t ]:1 − κ × E [ δw t ] = 0 → E [ δw t ] ∗ = 1 κ Substitute E [ δw t ] ∗ back into the mean-variance reward function, we obtain optimal reward t = 12 κ If we act conservatively, it is reasonable to assume that any reward less than the optimalreward by κ should correspond to a function with approximately zero probability. In otherwords, we need exp( − c κ ) ≈ 0. Then a reasonable choice of c is simply 2 κ .Adding the prior probability density of a candidate function given a fixed problem, we getthe posterior density, using Bayes’ Theorem: P ( f unction | reward, problem ) ∝ P ( reward | f unction, problem ) × P ( f unction | problem )Taking log and for convenience, let c = c c , we getlog P ( f unction | reward, problem ) ∝ log P ( reward | f unction, problem ) + log P ( f unction | problem )= c × ( reward − optimal reward ) − c × p err ∝ c × reward − c × p err ∝ reward − c × p err SOPHIA GU Note that c can either be tuned using cross validation or hand picked. For the latter, if wefollow the previous choices – c = 10 and c = 2 κ – we get:log P ( f unction | reward, problem ) ∝ reward − κ p err Putting everything together, we obtain a new reward function: R t ≈ δw t − κ δw t ) − κ p err ( t ) 4. Algorithm.4.1. Deep Reinforcement Learning. The RL problem consists of an environment andan agent. At each iteration, the agent observes the current state of the environment andproposes an action based on a policy. After each interaction, the agent receives a reward fromthe environment and the environment updates its state. DRL differs from RL in that it trainsa neural network to learn that policy. Fig. 1 compares RL and DRL.Figure 1: One iteration of RL (top) and DRL (bottom) procedure This is a soft constraint form for function property, rather than a hard constraint. This is because thetraining data we generated also contain a lot of noise and so we don’t want them to follow the function propertythat we proposed strictly, which can often result in overfitting. RL WITH FUNCTION PROPERTIES IN MEAN REVERSION STRATEGIES 7 In choosing a specific RL algorithm, we focusedon model-free algorithms as they have been studied more extensively than model-based algo-rithms in the past decade. Within the scope of model-free RL algorithms, the two big branchesare Q-learning and policy optimization. While we have experimented with both approaches,policy optimization yielded more promising results given the same amount of training time.For policy optimization, besides its built-in support for continuous state space and actionspace, it directly improves the policy as we are following gradients w.r.t. the policy itself;whereas in Q-learning we improve the estimates of the value function, which only implicitlyimproves the policy. As a result, we settled down to a policy gradient based DRL algorithm called ProximalPolicy Optimization (PPO). PPO is one of the Actor-Critic algorithms that have two neuralnetworks, one for estimating the policy (actor) function and the other for the value (critic)function. The main policy gradient loss function is L P G ( θ ) = ˆ E t [log π θ ( a t | s t ) ˆ A t ]Here, θ represents the parameters or weights of a neural net, π θ ( a t | s t ) is the probabilityof choosing an action a t based on a state s t , and ˆ A t is an estimate of the advantage function, i.e. , the relative value of the selected action to the base action.To fully understand this equation and its subsequent variations, we refer to the greatpaper on PPO[7]. Intuitively, this loss function tells the agent to put more weight on a goodpolicy, or more precisely, to have higher probabilities of choosing actions that lead to highercritic values and vice versa.Another thing to keep in mind is that PPO combines ideas from A2C (having multipleworkers) and TRPO (it uses a trust region to improve the actor). The main idea is that afteran update, the new policy should be not too far from the old policy. For that, PPO usesclipping to avoid too large update. Nonetheless, it is helpful to look at the training loop tounderstand how the agent learns: Algorithm 4.1 PPO, Actor-Critic Style for iteration := 1, 2, ... dofor actor := 1, 2, ..., N do Run policy π θold in environment for T timestepsCompute advantage estimates ˆ A , ..., ˆ A T end for Optimize surrogate L w.r.t. θ , with K epochs and minibatch size M ≤ N Tθ old := θ end for We use OpenAI’s improved version of its original implementation of PPO– Stable Baselines – to train our agents. This release of OpenAI Baselines includes scalable, It turns out that Q-learning also tends to be less stable than policy optimization algorithms. See [12], [11]and chapter 11 of [10]. https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html https://github.com/hill-a/stable-baselines SOPHIA GU parallel implementations of PPO which uses MPI for data passing. For both value and policy networks, we use a 64 × 64 feedfor-ward neural net with ReLU activation function followed by an LSTM layer with 256 cells. Both training and hyperparameters tuning use an Adam optimizer with a learning rate of1e-5 and early stopping. We counter the problem of potential overfitting by adding both l l 5. Experimental results. We ran 10,000 Monte Carlo simulations for evaluating thetrained agents’ out-of-sample performance using annualized sharpe ratio for both an OUprocess and an ARMA(2,1) process. Table 1 and Table 2 show their summary statistics re-spectively. We denote Model A for the agents trained using the original mean-variance rewardfunction and Model B for the agents trained using the new augmented reward function withmonotonic decreasing function penalty. For the OU process, we also compared our agents tothe tabular Q-learning model from [5].Table 1: Statistics of OU processstatistic Q-learning (benchmark) Model A Model Bmean 2.07 2.10 2.78std NA 0.375 0.329timesteps to convergence 1000k 7k 4kTable 2: Statistics of ARMA(2,1) processstatistic Model A Model Bmean 2.46 3.22std 0.479 0.268timesteps to convergence 4k 10kFig. 2 and 3 display the kernel density estimates of sharpe ratios of all paths. The ideaof kernel density estimates is to plot the observed samples on a line and to smooth them sothat they look like a density.For both processes, we observe a noticeable increase in the average sharpe ratio and adecrease in variance when including a function penalty term. Moreover, for each process, we https://openai.com/blog/openai-baselines-ppo/ The network is designed to be slightly bigger than the size of our problem setting for ease of optimization.As pointed out by [6]: ”Add a few more connections creates extra dimensions in weight-space and thesedimensions create paths around the barriers that create poor local minima in the lower dimensional subspaces.” RL WITH FUNCTION PROPERTIES IN MEAN REVERSION STRATEGIES 9 Figure 2: Kernel density estimates of the sharpe ratios from 10,000 out-of-sample simulationsof OU process: Model A (top), Model B (bottom)Figure 3: Kernel density estimates of the sharpe ratios from 10,000 out-of-sample simulationsof ARMA process: Model A (top), Model B (bottom)performed a two-sample t-test to determine if we have high confidence in the improvement inperformance. The differences in sharpe ratios are indeed highly statistically significant, with t-statistics of 135 and 139 respectively.Another interesting observation is that with DRL, the agents were able to converge within10 epochs, which is equivalent to 10,000 timesteps, and the fastest only took 4,000 steps, muchfaster than the tabular Q-learning agent, which took about one million training steps. 6. Conclusions. The main contribution of this paper is that we show how to apply DRLto quantitative financial problems and how to incorporate domain knowledge to assist thetraining for finding a better optimal. We provide a proof of concept in a controlled numericalsimulation which permits an approximate arbitrage, and we verify that the DRL agent findsand exploits this arbitrage.Again, note that although we only evaluated our results on two specific price processesand used a specific cost model, the DRL agent did not know that there was mean reversionin asset prices, nor did it know anything about the cost of trading. Therefore it could indeedlearn other price and cost models with little extra tuning. You can find more code samplesand other function properties being explored in the github repository.Before concluding, we leave open two avenues to look into further:1) With simple function properties, we did not notice any increase in training time, butfor more convoluted function properties, we expect the training time to grow significantly andthus it may hinder the whole training process. One potential solution is to build the functionproperty directly into the network structure ;2) Right now we only train the agent in a purely simulated environment, it would beinteresting to see how it performs in the real market. One potential challenge is that we willnot have as much real-world data as simulated data, but we can train on simulated data firstto have a good initialization of the model weights and then continue training on the limitedmarket data. Acknowledgments. We are grateful to Oriol Vinyals for discussion about DRL, and PetterKolm for sponsoring NYU HPC. REFERENCES [1] C. Y. Huang , Financial trading as a game: A deep reinforcement learning approach , arXiv:1807.02787,(2018).[2] P. N. Kolm and G. Ritter , Modern perspectives on reinforcement learning in finance , SSRN, (2019).[3] A. Lipton and M. L. de Prado , A closed-form solution for optimal mean-reverting trading strategies ,SSRN, (2020).[4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, andK. Kavukcuoglu , Asynchronous methods for deep reinforcement learning , arXiv:602.01783, (2016).[5] G. Ritter , Machine learning for trading , SSRN, (2017).[6] D. E. Rumelhart, G. E. Hinton, and R. J. Williams , Learning representations by back-propagatingerrors , Nature, (1986).[7] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov , Proximal policy optimizationalgorithms , arXiv:1707.06347, (2017).[8] J. Sill and Y. Abu-Mostafa , Advances in Neural Information Processing Systems , vol. 9, MIT Press,1997.[9] R. Sutton , Learning to predict by the method of temporal differences , Machine Learning, (1988), https://doi.org/10.1007/BF00115009.[10] R. S. Sutton and A. G. Barto , Reinforcement Learning: An Introduction , MIT Press, Cambridge,MA, 2nd ed., 2018. Indeed we already have some attempts here: https://github.com/sophiagu/stable-baselines-tf2/blob/master/common/policies.py RL WITH FUNCTION PROPERTIES IN MEAN REVERSION STRATEGIES 11 [11] C. Szepesvari , Algorithms for Reinforcement Learning , Morgan and Claypool Publishers, 2009.[12]