Using Reinforcement Learning in the Algorithmic Trading Problem
11450
ISSN 1064-2269, Journal of Communications Technology and Electronics, 2019, Vol. 64, No. 12, pp. 1450–1457. © Pleiades Publishing, Inc., 2019.Russian Text © The Author(s), 2019, published in Informatsionnye Protsessy, 2019, Vol. 19, No. 2, pp. 122–131.
Using Reinforcement Learning in the Algorithmic Trading Problem
E. S. Ponomarev a , *, I. V. Oseledets a , b , and A. S. Cichocki a a Skolkovo Institute of Science and Technology, Moscow, Russia b Marchuk Institute of Numerical Mathematics, Russian Academy of Sciences, Moscow, Russia * e-mail: [email protected] Received June 10, 2019; revised June 10, 2019; accepted June 26, 2019
Abstract —The development of reinforced learning methods has extended application to many areas includingalgorithmic trading. In this paper trading on the stock exchange is interpreted into a game with a Markovproperty consisting of states, actions, and rewards. A system for trading the fixed volume of a financialinstrument is proposed and experimentally tested; this is based on the asynchronous advantage actor-criticmethod with the use of several neural network architectures. The application of recurrent layers in thisapproach is investigated. The experiments were performed on real anonymized data. The best architecturedemonstrated a trading strategy for the RTS Index futures (MOEX:RTSI) with a profitability of 66% perannum accounting for commission. The project source code is available via the following link:http://github.com/evgps/a3c_trading.
Keywords : algorithmic trading, reinforcement learning, neural network, recurrent neural networks
DOI:
1. INTRODUCTIONThe algorithmic trading considered in this paperconsists in designing a control system capable of buy-ing or selling a fixed volume of a financial instrument onthe stock exchange. The algorithm is intended to maxi-mize the cost of the total portfolio or, in other words, theprofit. As the financial instrument, we considered RTSIndex futures in our project; the data for the experimentalpart were obtained from a large Russian exchange. Com-mission is taken into account according to the prices ofthis exchange for futures trading.The design of the algorithm is based on using rein-forcement learning since this is most appropriate forproblems with delayed reward. In contrast to super-vised learning, this does not require creating the rulesunder which a certain action must be considered truewith a certain weight and allows using the metrics cal-culated for each strategy for long time intervals, forinstance, the Sharpe ratio [4]. We use a modifiedasynchronous advantage actor-critic algorithm [12] inour work. As the approximation function, we studiedseveral artificial neural network (ANN) architectures.We show the dependence of the results on a variationin network depth and number of parameters (neurons)and on the introduction of a recurrent layer, namely,long short-term memory (LSTM) [2]. Data in theform of anonymized bids were aggregated empiricallyin the vector of attributes. This action is chosen onceper sixty seconds.Note that applicability of all developed methodswas confirmed experimentally on real data. We detailthe principal results in the next sections. Today, there are multiple reinforcement learningalgorithms [5] and parts of them have been applied inalgorithmic trading, for instance, in Q-learning [6],Deep Q-learning [1, 7], recurrent reinforcementlearning, and policy gradient methods [8, 6, 9],REINFORCE [10], and other actor-critic methods[5, 11]. However, this research area is rapidly develop-ing and new algorithms appear.In this work we construct an environment [3] typi-cal for a reinforcement learning problem, which con-sists of states , a set of possible actions , and a reward function. In addition, we assume that this process hasthe Markov property. As the learning method weapplied the asynchronous advantage actor-critic (A3C)algorithm [12], which showed efficiency on manydatasets, including Atari 2600 [13]. The application ofthis algorithm to the exchange trading problem wasnot found in the literature. In the improvement pro-cess of the algorithm operation, we searched for artifi-cial neuron network architecture lying in the basis ofthe method, including the study of recurrent neuronnetworks and several other optimizations. The algo-rithm was learned and tested on real anonymized dataon the trading of the RTS Index at the MoscowExchange (MOEX:RTSI).Notable results of the work:(1) We studied the application of the reinforcementlearning algorithm based on the deep neuron network.(2) The modern asynchronous advantage actor-critic (A3C) algorithm was applied to these tasks.(3) We searched for artificial neuron networkarchitecture.
MATHEMATICAL MODELSAND COMPUTATIONAL METHODS
OURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 64 No. 12 2019
USING REINFORCEMENT LEARNING IN THE ALGORITHMIC TRADING PROBLEM 1451 (4) The different architectures of the method wereexperimentally probed and compared. The bestdemonstrated a stable winning strategy with a profit-ability of 66% per annum over six months of testing(accounting for exchange commission).2. PRELIMINARY INFORMATION
Consider the exchange trading process in moredetail. Let us fix a financial instrument (futures on theRTS index) and its possible volume and let us agree tomake a decision on buy/sell once per time quantum,the minute. Thus, we obtain the system where in eachstep we need to choose one of three actions, thedesired position. The position may be either long,when we possess the futures; neutral, when we turn allassets into money; or short, when we borrow a fixedvolume under the obligation to buy it later at the futureprice. In addition, new data become available at eachstep, which may be aggregated in the form of a statevector.We prescribe the environment as the Markov deci-sion-making process [14]: M = {S, A, P, γ , R}, where(i) S ∈ R m is the space of observed states . At eachstep the exchange-agent system is in s t ∈ S. s t is con-structively represented in the work as an aggregation ofcurrent bids or as an internal state of the LSTM mem-ory cell. In the last case, we may hope to procure thelarger informative value.(ii) A = {–1, 0, 1} is the space of actions . In thetrading problem, the action is the desired position:long (to keep a unit of the instrument, long, 1), neutral(cash out, 0), or short (borrow a unit of the instru-ment, short, –1).At each step we choose the action a t ∈ A from thedeveloped politics π ( a | s ), that is, the probability tochoose a in s .(iii) P( s '| s , a ) is the transition probability of theassumed Markov process. (iv) R( s , a ) is the reward function. At each step theagent becomes the reward depending, not only on thecurrent, but also on the previous actions r t = R( s t , a t ).(v) γ ∈ [0, 1] is the decay multiplier with which thenext reward is summed up into the total reward for anaction R t = . Statement 1.
The task of the algorithm is to find thestrategy π : S → A that maximizes the mathematicalexpectation of reward ρ π : where the track τ = is the realization of onegame/episode (Fig. 1) and R ( τ ) is the total reward .As the reward we use its estimate, the function R ( s , a ) of action a and state s where A π ( s , a ) is the advantage function, which is thevalue characterizing how much chosen action a is betterthan the average estimate of the utility value of state s and V π ( s ) is the estimate of the cost function of state s for the politics π : S → A :Thus, the architecture of the actor determines whichaction a = π ( s ) to choose. The loss function for thissubnetwork is in linear proportion to advantage func-tion A ( s t , a t ) = [ r t + γ V π ( s t + 1 )] – V ( s ). The gradient ofthe loss function for this part of the network will equal(the derivation may be found in [12])Actor: ∇ θ J ( θ ) ≈ E π [ ∇ θ log π θ ( a | s ) A π ( s , a )].The architecture of the critic predicts which rewardis anticipated in this state without dividing the esti- T i t ii r += γ [ ] [ ] ( ) ( ) max,( ) ( ) ( ) max, T RR p R d π π ππ θ ρ = ρ θ = π →ρ = π θ = τ π θ τ τ → EE { , } Tt t t s a T = ∈ ( ) ( ) ( ) , , , R s a s s aV A π π ≈ + ( ) ( ) ( ) :, ,
A Rs a s a sV π π = − ( ) ( ) , .
V Rs s a π ππ = E Fig. 1. τ is the red track, a is the action, s is the state, and r is the reward. +$12 a s s ′ s s ′ s ′′ s ′ s s ′′ s * s a +$10 –$20–$1 +$100+$2+$71+$4 τ JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 64 No. 12 2019
PONOMAREV et al. mates into actions V π ( s t ) = E π R t ( π ). The loss functionfor the critic is the error in the prediction V π ( s t ):Critic: L v = ,In the problem considered, P and R are unknown,and the instantaneous reward function reflects varia-tion in the portfolio cost and takes into account thecommission from the buy/sell of an instrument. Itmeans that, if c t is the cost of all assets of the agent atstep t (the amount of money from selling everythingwithout regard for commission), then r t = c t – c t – 1 – fee · I [ a t = a t – 1 ].The entire network is trained with the error-back-propagation method. The loss function for the entire net-work is common and is a linear combination of the lossfunctions for the actor and critic (parameter α ∈ [0, 1]):(1) ( ) i T ii
V V = − target . i T k i ik i
V r −= = γ ( ) log ( ) . i T i i i ii
L V V a s A πθ= = α − − π As a result, the algorithm attempts to maximize thecumulative discounted rewardwhere r t + i is a variation in the portfolio for a step. Thenumber of steps is T ~ 200, because the majority ofterms make no considerable contribution to the esti-mate ( γ T → , T it t ii
R r += = γ Fig. 2.
Artificial neuron network determining cost function V ( s ) = V ( s | θ ) and politics (actor) function π ( a | s ) = π ( a | s , θ ). LSTMDENSEDROPOUT f f f N a Va a ... ... A x t h t Fig. 3.
Scheme of training and testing system.
Train data Test data θ new θ new θ∇θ ∇θ i Initial weigths θ init Parameters Master network Master networkTest_worker StatisticsTests and metricaspyBackTest
Resultshistory
Local networkEnvironment Environment Environment......... Local network Local networkSave Loadmodel.cptkLogging system(TensorBoard)Worker i Worker 1 A3C_training A3C_testing a r s a r s a r s
OURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 64 No. 12 2019
USING REINFORCEMENT LEARNING IN THE ALGORITHMIC TRADING PROBLEM 1453 architecture is a key task and is discussed in detail inSection 4.The training of the neuron network is performedevery N steps = 200 steps. N worker is the number of parallelprocesses, which was fixed in this work equal 10. Eachepoch of training was conducted on the three-monthdata containing 50000 one-minute steps. The conver-gence requires approximately 1000 epochs for a stepsize of approximately 10 –3. To accelerate the system testing, we implementedthe testing subsystem without training and writingin the history. We replace the probabilistic approach inthe testing with the arg max function. We also switchoff the dropout function in the testing.4. NUMERICAL RESULTSWe performed numerical experiments on real his-torical data (the complete log of the anonymized bidsfor the RTSI). As the training set we used the bids fromSeptember 15, 2015 to December 15, 2015. We carriedout the test over the following six months: fromDecember 15, 2015 to June 15, 2016.The commission from buy or sell was fixed at1.25 rubles for an operation, that is, 2.50 rubles foreach transaction. It is clear that the strategy must bemore profitable than 2.50 rubles for a trade. Toincrease the profitability of each transaction, we arti-ficially increased the commission in the rewardmodel. In addition, we changed the reward function toavoid coming into the buy-and-hold trap (Fig. 4),which means no active trading. To do this we intro-duced a penalty for the long repetition of an action.The development of neural network architecture isimportant. As the initial point we chose the simplestartificial neural network with a single common hiddenlayer, linear function V , and a linear layer with softmaxactivation for choosing action π ( s ). As the hypothesesimproving method quality, we made the followingassumptions: Assumption 1.
Using a different reward function.
Assumption 2.
Introducing a recurrent layer(LSTM).
Assumption 3.
Adding a dropout layer.
Assumption 4.
Increasing the number of neurons inthe hidden layers.
Assumption 5.
Using a more complicated architec-ture of the cost function.
Assumption 6.
Combining the attributes for severalminutes in a common vector.Following these assumptions, we designed severalarchitectures from combinations of the same layers,but with different parameters, including no layer(Table 1 and Fig. 2). In Table 1 we used the followingdenotations:(1) Depth is the number of serially connected vec-tors of attributes used as the input vector.(2) Dense is the number of neurons in the fullyconnected first layer (for instance, 128) or the absenceof this layer (–).(3) Dropout is the probability of dropout (forinstance, 0.5) or its absence (–).(4) LSTM is the number of neurons in the LSTMlayer connected with the first layer (for instance, 64)or the absence of this layer (–).(5) Dense V is the number of neurons in the fullyconnected layer preceding the output linear critic layer( V ( s )). Fig. 4.
Training curve of algorithm. Falling into the extremum corresponding to the buy-and-hold strategy is marked. –35–30–25–20–15–10–50 50Perf/Episode_reward
Name Value Step Time Relativeworker_6/. 0.1315 430 Thu apr. 27, 18:48:28 2d 3h 56 m 31 sworker_5/. 0.1315 430 Thu apr. 27, 18:44:27 2d 3h 52 m 38 sworker_4/. 0.1315 430 Thu apr. 27, 18:42:21 2d 3h 50 m 34 sworker_3/. 0.1315 430 Thu apr. 27, 18:50:45 2d 3h 58 m 57 sworker_2/. 0.1315 430 Thu apr. 27, 18:55:31 2d 4h 3 m 32 sworker_1/. 0.1315 430 Thu apr. 27, 18:40:42 2d 3h 48 m 54 sworker_0/. 0.1315 430 Thu apr. 27, 18:38:10 2d 3h 46 m 24 s
100 150 200 250 300 350 400 450
Table 1.
ArchitecturesName Depth Dense Dropout LSTM DenseV DenseA5 6 – 0.5 64 – –8 6 – 0.5 128 – –5coolV 6 – 0.5 64 32 –9 1 – 0.5 64 32 –12 1 – 0.5 64 32 325noLSTM 20 – – – – –6 6 128 – 128 – –454
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 64 No. 12 2019
PONOMAREV et al. (6) Dense A is the number of neurons in the fullyconnected layer preceding the output softmax actorlayer ( π ( a | s )).To study how the results depend on the presence ofthe dropout layer, we chose the two architecturesnamed 6 and 8. From the test results (Table 3) the dropout significantly improves the situation. To com-pare result dependence of the number of neurons inthe LSTM layer, we considered 8 and 5 architectures,and, to compare the dependence on the complicated-ness of the cost function approximator, we used 5 and5 cool V architectures. To check how the results depend
Fig. 5.
Comparison of 8 and 5 cool
V models: effect of linear approximation of cost function of state V ( s ) (to the left) against atwo-layer network with activation function tanh at first layer. The remaining parameters are identical (Table 1).
140 000130 000120 000110 000100 00090 00080 00070 000 Backtesting08_6m R u b l e s J a n . , F e b . , M a r c h , A p r i l , M a y , J u n e ,
140 000130 000120 000110 000100 00090 00080 00070 000 Backtesting05_cool_V_6m J a n . , F e b . , M a r c h , A p r i l , M a y , J u n e , Backtesting08_trainTime Time R u b l e s S e p t . , O c t . , O c t . , N o v . , N o v . , N o v . , D e c . , S e p t . , O c t . , O c t . , N o v . , N o v . , N o v . , D e c . , Backtesting05_cool_V_train10 Table 2.
Result of execution on three-month test dataName Profit % per annum Profit % per annum (commiss.) Sharpe ratio Fractionof winning transactions Average transaction,rubles5
OURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 64 No. 12 2019
USING REINFORCEMENT LEARNING IN THE ALGORITHMIC TRADING PROBLEM 1455 on the number of serially connected vectors of attri-butes used as the input vector, we considered 5 cool
Vand 9 architectures (the training and testing curves aredepicted in Figs. 5–7).The main difficulty in the experimental optimiza-tion of the architecture is the training time of a single model. This varies dependent on the server workloadand in general counts to tens of hours. The choice ofthe training speed is also important for the optimiza-tion and convergence of the algorithm [17].Below, we present the tables with the economicalmetrics important for the decision making on the
Fig. 6.
Comparison of 5 and 8 models: effect of 64 neurons in LSTM layer (to the left) against 128 neurons. The remaining param-eters are identical (Table 1).
140 000130 000120 000110 000100 00090 00080 00070 000 Backtesting05_6m R u b l e s J a n . , F e b . , M a r c h , A p r i l , M a y , J u n e ,
140 000130 000120 000110 000100 00090 00080 00070 000 J a n . , F e b . , M a r c h , A p r i l , M a y , J u n e , Backtesting05_train Backtesting08_6mBacktesting08_trainTime Time R u b l e s S e p t . , O c t . , O c t . , N o v . , N o v . , N o v . , D e c . , S e p t . , O c t . , O c t . , N o v . , N o v . , N o v . , D e c . , Table 3.
Result of execution on six-month test dataName Profit % per annum Profit % per annum (commiss.) Sharpe ratio Fractionof winning transactions Average transaction,rubles5 8.8 5.2 0.29
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 64 No. 12 2019
PONOMAREV et al.
Fig. 7.
Comparison of 5 cool
V and 9 models: effect of combination of attributes for six steps in a common vector (to the left)against use of attributes for a single step. The remaining parameters are identical (Table 1).
140 000130 000120 000110 000100 00090 00080 00070 000 Backtesting05_cool_V_6m R u b l e s J a n . , F e b . , M a r c h , A p r i l , M a y , J u n e ,
140 000130 000120 000110 000100 00090 00080 00070 000 Backtesting09_6m J a n . , F e b . , M a r c h , A p r i l , M a y , J u n e , Backtesting05_cool_V_trainTime Time R u b l e s S e p t . , O c t . , O c t . , N o v . , N o v . , N o v . , D e c . , S e p t . , O c t . , O c t . , N o v . , N o v . , N o v . , D e c . , Backtesting09_train10 Table 4.
Result of execution on three-month training dataName Profit % per annum Profit % per annum (commiss.) Sharpe ratio Fractionof winning transactions Average transaction,rubles5 10 −3 −3 −3 −3 −3 investment attractiveness of the algorithm reflectingprofitability and risk. In Table 2 we use the followingdenotations:(1) Profit % per annum is the profitability in percentper annum, that is, ( profit / begin _ price ) · (365/ num-ber _ of _ days ), where number _ of _ days = 90 for three months and 180 for six months. begin _ price is the cost ofa financial instrument at the beginning of the trading.(2) Profit % per annum (commiss.) is the profit-ability in percent per annum with account for the com-mission, that is, ( profit – n _ trades · fee / begin _ price ) ·(365/ number _ of _ days ), where fee = 2.5 rubles . OURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 64 No. 12 2019
USING REINFORCEMENT LEARNING IN THE ALGORITHMIC TRADING PROBLEM 1457 (3) Sharpe ratio is the Sharpe ratio E ( profit )/ σ ( profit ), which is the ratio of profitability tovariability in both directions.(4) Average transaction, rubles is the average profitfor a transaction. This criterion is crucial in consider-ation of commissions.5. CONCLUSIONSThis work has resulted mainly in the creation of anexchange trading algorithm based on the advantageactor-critic method, which is potentially profitableand attractive for investments from an economic pointof view. Thus, the best architecture achieves a profit-ability of 110% per annum not accounting for commis-sion or 66% per annum accounting for a commissionof 2.5 rubles for transaction on the RTS futures (com-puted on six months of historical 2016 data).During algorithm optimization, we experimentallyverified several hypotheses, which allows significantimprovement of method characteristics and creates aview on applicability of several ideas:(i) The use of another reward function is disput-able. On the one side, this helps avoid locking in thelocal minimum of absence of trading, and, on theother side, the goal function of the trader is not opti-mized here.(ii) The unification of attributes for several minutesinto a common vector is wrong.(iii) The addition of a recurrent layer (LSTM) iscorrect.(iv) The addition of a dropout layer is correct.(v) The increase in the number of neurons in thehidden layers is disputable.(vi) The use of the neuron network in several layersfor approximating the cost function is correct.As a result of the work, we also implemented a con-venient environment for future experiments withexchange trading sustained in the taken style. Webelieve that this will allow easy experimenting usingvarious methods to solve this problem. We think thatsubsequent developments of the work must be gearedtowards optimizing the architecture and applying it tothe real trading system.REFERENCES
1. Y. Deng, Y. Kong, F. Bao, and Q. Dai, “Sparse coding-inspired optimal trading system for HFT industry.”IEEE Trans. Industrial Inf. , 467−475 (2015).2. S. Hochreiter and Jü. Schmidhuber, “Long short-termmemory,” Neural Comput. , 1735−1780 (1997).3. R. S. Sutton and A. G. Barto, “Reinforcement learning:an introduction.” IEEE Trans. Neural Networks Neu-ral Networks , 1054 (1998).4. W. F. Sharpe, “The sharpe ratio,” J. Portfolio Manag. , 49–58 (1994). 5. Y. Li, “Deep reinforcement learning: an overview,”Computing Research Repository (CoRR) abs/1701 ,1−30 (2017).6. J. Moody and M. Saffell, “Learning to trade via directreinforcement,” IEEE Trans. Neural Networks ,875−889 (2001).7. Y. Deng, F. Bao, Y. Kong, Z. Ren, and D. Q. Dai,“Direct reinforcement learning for financial signalrepresentation and trading,” IEEE Trans. Neural Net-works & Learn. Syst. , 653−664 (2017).8. J. Moody, L. Wu, Y. Liao, and M. Saffell, “Perfor-mance functions and reinforcement learning for tradingsystems and portfolios,” J. Forecasting (56),441−470 (1998).9. K. L. Xin Du and Jinjian Zhai, “Algorithm trading us-ing Q-learning and recurrent reinforcement,” Learn.Positions , 1 (2009).10. R. J. Williams, “Simple statistical gradient-followingalgorithms for connectionist reinforcement learning,”Machine Learn. , 229−256 (1992).11. S. D. Bekiros, Heterogeneous trading strategies withadaptive fuzzy Actor-Critic reinforcement learning: abehavioral approach,” J. Economic Dynamics andControl. , 1153−1170 (2010).12. V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lilli-crap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asyn-chronous methods for deep reinforcement learning,” in Proc. Int. Conf. on Machine Learning, New York City,USA, June 19 –24, , pp. 1928−1937.13. Y. Zhan, H. B. Ammar, and M. E. Taylor, “Theoreti-cally-grounded policy advice from multiple teachers inreinforcement learning settings with applications tonegative transfer,” in
Proc. Int. Joint Conf. on ArtificialIntelligence, (IJCAI), New York, NY, USA, July 9 – (IJCAI, 2016), pp. 2315−2321.14. A. A. Markov, “The theory of algorithms,” J. SymbolicLogic , 340−341 (1953).15. R. Bellman, “A markovian decision process,” J. Math.Mech. , 679−684 (1957).16. M. Riedmiller and H. Braun, “A direct adaptive meth-od for faster backpropagation learning: the RPROP Al-gorithm,” in Proc. IEEE Int. Conf. on Neural Networks,San Francisco, Mar. 28 −
1 Apr. 1, (IEEE, NewYork, 1993), pp. 586−591.17. G. E. Hinton, N. Srivastava, and K. Swersky, “RMSProp:Lecture 6a- overview of mini-batch gradient descentCOURSERA,” Neural Networks for Machine Learn-ing , (2012).18. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,M. Kudlur, J. Levenberg, R. Monga, S. Moore,D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan,P. Warden, M. Wicke, Y. Yu, X. Zheng, and G. Brain,“A system for large-scale machine learning,” Proc. 12thUSENIX Symp. on Operating Systems Design and Imple-mentation (OSDI’16), , pp. 265–284.19. W. Zaremba, I. Sutskever, and O. Vinyals, “RecurrentNeural Network Regularization,” in
Proc. Int. Conf. onLearning Representations (ICLR), Banff, Canada,April 14 − (ICLR, 2014), pp. 1–8.(ICLR, 2014), pp. 1–8.