[PDF] A Reinforcement Learning Based Encoder-Decoder Framework for Learning Stock Trading Rules

Abstract

A wide variety of deep reinforcement learning (DRL) models have recently been proposed to learn profitable investment strategies. The rules learned by these models outperform the previous strategies specially in high frequency trading environments. However, it is shown that the quality of the extracted features from a long-term sequence of raw prices of the instruments greatly affects the performance of the trading rules learned by these models. Employing a neural encoder-decoder structure to extract informative features from complex input time-series has proved very effective in other popular tasks like neural machine translation and video captioning in which the models face a similar problem. The encoder-decoder framework extracts highly informative features from a long sequence of prices along with learning how to generate outputs based on the extracted features. In this paper, a novel end-to-end model based on the neural encoder-decoder framework combined with DRL is proposed to learn single instrument trading strategies from a long sequence of raw prices of the instrument. The proposed model consists of an encoder which is a neural structure responsible for learning informative features from the input sequence, and a decoder which is a DRL model responsible for learning profitable strategies based on the features extracted by the encoder. The parameters of the encoder and the decoder structures are learned jointly, which enables the encoder to extract features fitted to the task of the decoder DRL. In addition, the effects of different structures for the encoder and various forms of the input sequences on the performance of the learned strategies are investigated. Experimental results showed that the proposed model outperforms other state-of-the-art models in highly dynamic environments.

Full PDF

AA Reinforcement Learning Based Encoder-DecoderFramework for Learning Stock Trading Rules

Mehran Taghian , Ahmad Asadi , Reza Safabakhsh ∗ Abstract

A wide variety of deep reinforcement learning (DRL) models have recently beenproposed to learn proﬁtable investment strategies. The rules learned by thesemodels outperform the previous strategies specially in high frequency tradingenvironments. However, it is shown that the quality of the extracted featuresfrom a long-term sequence of raw prices of the instruments greatly aﬀects theperformance of the trading rules learned by these models. Employing a neuralencoder-decoder structure to extract informative features from complex inputtime-series has proved very eﬀective in other popular tasks like neural machinetranslation and video captioning in which the models face a similar problem.The encoder-decoder framework extracts highly informative features from a longsequence of prices along with learning how to generate outputs based on the ex-tracted features. In this paper, a novel end-to-end model based on the neuralencoder-decoder framework combined with DRL is proposed to learn single in-strument trading strategies from a long sequence of raw prices of the instrument.The proposed model consists of an encoder which is a neural structure respon-sible for learning informative features from the input sequence, and a decoderwhich is a DRL model responsible for learning proﬁtable strategies based onthe features extracted by the encoder. The parameters of the encoder and thedecoder structures are learned jointly, which enables the encoder to extractfeatures ﬁtted to the task of the decoder DRL. In addition, the eﬀects of dif- ∗ Corresponding author

Email addresses: [email protected] (Mehran Taghian), [email protected] (Ahmad Asadi), [email protected] (Reza Safabakhsh)

Preprint submitted to Applied Soft Computing January 12, 2021 a r X i v : . [ q -f i n . S T ] J a n erent structures for the encoder and various forms of the input sequences onthe performance of the learned strategies are investigated. Experimental resultsshowed that the proposed model outperforms other state-of-the-art models inhighly dynamic environments. Keywords:

Deep reinforcement learning, Deep Q-learning, Single stocktrading, Trading strategy, Encoder decoder framework

1. Introduction

Forming proﬁtable trading strategies ﬁtted on either a single ﬁnancial in-strument or a set of instruments in a speciﬁc market, based on a vast historicaldata is a critical problem for investors. Since the introduction of algorithmictrading [1] and monitoring the trading process by computers, especially at highfrequency [2], there has been a widespread interest in designing a powerful modelto learn proﬁtable investment strategies.In recent years, machine learning (ML) models and deep neural networks(DNNs) have been widely used for learning proﬁtable investment strategies inboth single asset trading and portfolio management problems [3]. Among thetechniques used for learning asset-speciﬁc trading rules, genetic programming(GP) and deep reinforcement learning (DRL) methods have been more inter-esting for the research community[4].Genetic programming was widely used to learn technical trading rules fordiﬀerent indices like S&P500 index ([5]), to learn appropriate trading rules tobeneﬁt from short-term price ﬂuctuations ([6]), to learn noise-tolerant rulesbased on a large number of technical indicators ([7]), and to learn the tradingrules based on popular technical indicators like MACD ([8]).Since genetic programming and its modiﬁcations are not able to evolve af-ter task execution, reinforcement learning techniques have been widely used tocombine with evolutionary algorithms to cover this weakness. [9] combined theSARSA algorithm with genetic programming to enable the model to change pro-grams during task execution. [10] modiﬁed the GNP-SARSA model proposed2y [9] with augmenting new nodes called subroutines to make an appropriatetrade-oﬀ between the eﬃciency and compactness of the model.Considering the great performance of deep reinforcement learning (DRL)models (deep neural networks trained with the reinforcement learning tech-niques) in forming investment strategies, the proposed methods for portfoliomanagement are mainly based on DNN structures and DRL techniques. [11]proposed a very-long short-term memory (VLSTM) network, to deal with ex-tremely long sequences in ﬁnancial markets and explored the importance ofVLSTM in the context of high frequency trading. [12] proposed a DNN struc-ture to forecast the next one-minute average price of an instrument given itscurrent time and n -lagged one-minute pseudo-returns to build a trading strategythat buys (sells) when the next predicted average price is above (below) the lastclosing price. [13] proposed a DNN model to learn the spatio-temporal modelof the input and developed a classiﬁcation rule to predict short-term futuresmarket prices using order book depth.[14] proposed a DRL technique for forecasting the short-term trend in thecurrency FOREX (FOReign EXchange) market to maximize the return on in-vestment in an HFT algorithm. Jiang et al. [15] proposed a ﬁnancial model-freereinforcement learning framework to provide a deep learning solution to theportfolio management problem. For single stocks trading, Wang et al. [16]employed deep Q-learning to build an end-to-end deep Q-trading system forlearning trading strategies.[4] studied the DRL performance in learning single asset-speciﬁc tradingrules and concluded that: 1) the quality of extracted features from the inputcan greatly aﬀect the performance of the learned strategy by DRL models,and 2) proposing a good feature extractor from a long-term historical pricedata sequence would obviously improve the proﬁtability of the resulting tradingstrategy. Considering the results from [4], proposing a model to learn goodfeatures from a long-term price sequence would eﬀectively contribute to improvethe performance of DRL models.Considering the results reported by [4], we combined the encoder-decoder3ramework with the DRL techniques and proposed an end-to-end model to learninformative features from a long-term price sequence of a speciﬁc ﬁnancial in-strument and learn a proﬁtable trading strategy. The encoder-decoder frame-work is one of the state-of-the-art neural structures applied in tasks requiringextracting complex feature representation, specially in cases that the input ispresented in the form of a long-term time-series[17].In this paper, we ﬁrst develop a DRL agent based on a deep Q-learningalgorithm to generate trading signals given a sequence of OHLC prices of eachinstrument. Then, we design and implement an encoder-decoder based modelto improve the agent’s feature extraction performance. In addition, we exam-ine the performance of diﬀerent DNN structures for the encoder module. Thetime-series of candlesticks and raw OHLC input types are evaluated, and theperformance of models is tested using various stocks with diﬀerent behavior.Furthermore, the inﬂuence of window size on the agent’s performance for thewindowed input type has been studied. Experimental results showed that ourmodel outperforms the state-of-the-art methods.In the next section of this paper, we brieﬂy review the related work of learn-ing ﬁnancial asset-speciﬁc trading strategies and discuss the advantages anddisadvantages of diﬀerent categories of the proposed methods. Section 3, dis-cusses the model proposed in this paper. The model consists of an encoder partfor feature extraction and a decoder part for decision making. The details of thearchitecture of both of these parts are discussed in Section 3. Section 4 providesthe experimental results and the conclusions are provided in Section 5.

2. Related Work many researchers proposed methods based on reinforcement learning for de-termining trading strategies. [18] ﬁrst applied RL in portfolio management,and proposed a method based on recurrent reinforcement learning for ﬁnancialtransactions. [19] proposed a method based on the RL framework which in-corporates stock selection and asset management. [20] incorporate time-series4rediction with the decision making power of RL. They ﬁrst predict the futureprices using a CNN, and then feed the output to a policy gradient model alongwith historical data to empower trading decisions.[21] ﬁrst incorporate deep neural networks to learn policy directly from highdimensional sensory inputs. The proposed method, termed deep Q-learning,successfully played seven diﬀerent Atari games, three of which could outperformthe human level. Considering the successful performance of deep reinforcementlearning in playing Atari games, researchers have carried out many kinds ofresearch to apply DRL methods to the stock market environment.[22] applied two diﬀerent CNN based function approximators with an actor-critic RL algorithm called “Deep Deterministic Policy Gradient” (DDPG) toﬁnd the optimal policy. The proposed DDPG has two diﬀerent convolutionalneural network (CNN) function approximators. The input state to the modelis 18 diﬀerent technical indicators converted to multiple channels of 1D imagesfed to the CNN model.[23] proposed a novel RL based investment strategy consisting of threephases: 1) extracting asset representation from multiple time-series using aLong Short Term Memory with a History Attention(LSTM-HA) network, 2)modeling the interrelationships among assets as well as the asset price risingprior using a Cross-Asset Attention Network(CAAN), and 3) generating portfo-lio and giving the investment proportion of each asset according to the outputwinner scores of the attention network. The three components are optimizedend-to-end using a Sharpe ratio oriented RL.[24] explored the training power of the Deep Deterministic Policy Gradientto learn stock trading strategy. [25] proposed a method using the Q-learningalgorithm to ﬁnd the optimal dynamic trading strategy. They introduced twomodels varied in their representation of the environment, the ﬁrst of whichrepresents environment states using a ﬁnite set of clusters, the second of whichused the candlesticks themselves as the states of the environment.[26] presented a solution to the algorithmic trading problem of generating thetrading strategy for single stock based on the DQN algorithm with a Sharpe ratio5riented manner. [27] used co-integrated stock market prices and incorporatedDQN to generate pairs trading strategy.Following the methods attempting to apply RL and DRL techniques to gen-erate stock market trading strategies and portfolio management, some methodstried to improve various parts of the agent and environment to improve the RLagent’s performance.[28] used McShane-Whitney extension of the Lipschitz function ([29]) to fore-cast the reward function based on its previous values. Besides, to support theextension of the reward function, all the previous states, along with some arti-ﬁcially generated states, called dream states, were combined to enrich learning.[30] proposed an approach based on deep Q-learning for deriving a multi-asset portfolio trading strategy. Instead of using a discrete action-space, whichmight lead to infeasible actions, they introduced a mapping function to mapunreasonable and infeasible actions to similar valuable actions. Therefore, thetrading strategy would be more reasonable in the practical action space. Besides,the dimensionality problem of the action space is taken care of by using deepQ-learning.Having considered the temporal essence of stock market data, some of theresearches have combined the temporal feature extraction power of recurrentneural networks with DRL’s decision-making ability. [31] applied the GatedRecurrent Unit(GRU) to exploit informative features from raw ﬁnancial dataalong with technical indicators to represent stock market conditions more ro-bustly. Then they designed a risk-adjusted reward function using the Sortinoratio proposed by [32]. Based on the state, action, and reward functions de-signed, they proposed Deep Q-Learning and Deep Deterministic Policy Gradientfor quantitative stock trading.[33] applied DRL for portfolio management and in order to distinguish thecritical time when the price changes, they proposed using a three-dimensionalattention gating network that gave higher weights on rising moments and assets.Moreover, they applied the XGBoost method to quantify the importance offeatures and output the three most relevant features from historical data to the6odel: close price, high price, and low price.[34] compared diﬀerent RL agents for trading ﬁnancial indices in a personalretirement portfolio. The comparison included on-policy SARSA( λ ) and oﬀ-policy Q( λ ) with discrete state and discrete action settings that maximize eithertotal return or diﬀerential Sharpe ratios, on-policy temporal diﬀerence learning,and TD( λ ) with discrete state and continuous action settings. They showed thatan adaptive continuous action agent has the best performance in predicting nextperiod portfolio allocations.[35] incorporated particle swarm optimization algorithm to optimize portfo-lio. Besides, they developed a recurrent reinforcement learning based methodfor portfolio allocation and trading.In our former work ([4]), we studied the performance of strategies basedon the candlestick patterns, SARSA( λ ) algorithm, and deep Q-learning andconcluded that methods based o deep reinforcement learning could generatemore adaptive trading strategies speciﬁc to each asset.The encoder-decoder framework, is one of the most popular neural structureswhich is used to solve complex problems in an end-to-end menner ([17]).Thisarchitecture was ﬁrst proposed in neural machine translation by [36]. It consistsof two sequential parts: 1) An encoder part which is responsible to lean a goodfeature extraction from the input and 2) a decoder which is responsible forgenerating appropriate output based on the extracted feature vector.Since employing the encoder-decoder framework in neural machine transla-tion has signiﬁcantly improved the performance of the models, it is widely usedin other complex tasks like image captioning (see [37] and [38] for more details),and video captioning (see [39] for more details).In this work, we dedicate focus on the following issues :1. Proposing a model based on the encoder-decoder architecture where thedecoder is a DRL agent trained to generate a trading strategy based onthe representation of the market produced by the encoder model. Theencoder is a neural network structure responsible for exploiting features7rom the raw input time-series data and generating feature vector.2. Showing the importance of extracting time dependencies of the inputprices, and proposing diﬀerent encoder structures are proposed to improvethe quality of time dependencies extraction.3. Investigating the impact of window-size on exploiting important featuresand generating a proper representation of the market.The rest of this paper is structured as follows: ﬁrst, we introduce the detailsof the proposed method, deep Q-learning method, model architecture, and DNNmodels used as encoder. Then, the performance of diﬀerent encoder models areevaluated using various methods explained in detail in Section 4. Next, theresults of experiments are analyzed.

3. Proposed Method

In ﬁnancial markets, a candlestick is used to represent the price ﬂuctuationsin a short time period, originating from Japanese rice traders and merchantswho used candlesticks to track the market prices[40]. A candlestick consists of4 price elements, namely high (the highest stock price during a period - e.g.,a day), low (the lowest price), open (the stock price at the beginning of theperiod), and close(the stock price at the end of the period), abbreviated toOHLC. A candlestick’s color can be either green/white, representing a bullishcandle(open price is lower than close price), or red/black representing a bearishcandle(close price is lower than open price). 1 shows a sample candlestick. (1)shows a vector representing a candlestick. This vector consists of the Open,High, Low, and Close prices. c t = ( p open , p high , p low , p close ) (1)A candlestick chart is used to demonstrate the behavior of the asset price.According to this concept, patterns in this chart show the buyers’ and sellers’8 igure 1: A candlestick representing the price behavior of an asset during a speciﬁc timewindow [4] behavior and their inﬂuence on the market. Thus, these patterns can be usedto analyze the price ﬂuctuations and use the analysis to devise trade strategieson a ﬁnancial asset. The proposed model is based on the encoder-decoder framework which con-sists of following modules:1. Encoder:Encoder is the ﬁrst neural structure that takes the input of the model andlearns a good mapping from the input space to the feature space whichminimizes the decoders’ loss function.2. Decoder:Decoder is the second neural structure in the encoder-decoder framework,that takes the features extracted by the encoder for each input record andgenerates the appropriate output based on the input feature vector. Thegradients of the decoder are back propagated to the encoder and train itsweights along with the weights of the decoder during the training phase.In the proposed model, the decoder part is a target or policy network usedin the deep Q-Learning based model proposed by [4] to learn trading strategies.The encoder part is a deep neural network applied to extract deep features from9he candlestick chart representations. These features are categorized into thefollowing two groups:1. Features directly learned from candlestick representations or raw OHLCdata.2. Features representing the temporal relationships among a sequence of can-dlesticks inside a time window.For each category, NN models exist to eﬃciently analyze and extract thosefeatures according to the Policy Network’s performance in trading. The encoderpart extracts features from the input and provides a state vector (feature vector)for the decoder module, a deep Q-Learning agent that uses this state vector asthe state of the environment to produce trading signals. Based on the rewardsgiven to the DRL agent, the model is optimized towards producing higher prof-its. This optimization is done in an end-to-end form, back-propagating errorfrom the decoder part to the encoder module. As a result, the encoder extractsfeatures based on the trading performance of the DRL. The model architectureis shown in 2.

The decoder module is the trading agent of our model, and learns to producetrading signals. This module’s architecture is based on the deep Q-Learningalgorithm proposed by [21] to play Atari games. The module’s input is a statevector, containing the features of the market at time-step t. These featurescan be either vanilla input of OHLC prices or the feature vector produced by afeature extractor model. We show the vector space of the input candles with C = { c , c , . . . , c T } , where T is the ﬁnal time-step, and c i is the candle representationof diﬀerent time-intervals (here daily). The feature extractor module φ getsthe vector space of candles as input, and generates the vector space of states S = { s , s , . . . , s T } . φ ( C ) = S (2)10 igure 2: In this architecture, the state is given by the environment at each timestep, and the agent takes action according to the state and receives the reward andnext state. The quadruples ( CurrentState, Action, Reward, NextState ) are saved in thereplay memory. For the optimization part in each iteration, a batch of quadruples ( CurrentState, Action, Reward, NextState ) is selected for the training after the step men-tioned earlier. The encoder part’s output is common between policy and target networks andis optimized in each iteration. The replay memory has a speciﬁc capacity, and after it is ﬁlled,a random quadruple is substituted with a new quadruple (2) represents φ , the feature extraction function. This function can be eitheran identity function (state space is the candlesticks themselves) or a neuralnetwork that extracts deep features from the input vector space of candlesticksand produces a vector space of features. Each vector s i denotes the state ofthe environment at time-step t. This state is given to the DQN agent to use intaking action. The action space of the DQN agent is A = { (cid:48) buy (cid:48) , (cid:48) sell (cid:48) , (cid:48) noop (cid:48) } ,which are the signals of the trading strategy at each time-step. After takingaction, the agent would be given a reward based on the signal produced. Therewards of buying and selling are a bit diﬀerent. (3) demonstrates the rewardfunction used by the environment. The ownShare parameter is used alongsideaction ’noop’ showing that whether the money has already been invested on the11arket or not. R t =  ((1 − T C ) × P P − × if action = ’buy’ or(action = ’noop’ and ownShare = True) ((1 − T C ) × P P − × if action = ’sell’ or(action = ’noop’ and ownShare = False) (3)Reinforcement learning is a framework used to learn a sequence of decisiontasks. In general, the RL agent interacts with the environment, observes thestate, takes action according to the policy and the observed state, and gets areward. In a sequence of decisions made by the RL agent, the agent learns apolicy π regarding the actions taken and rewards earned at each episode. Af-terward, the agent should optimize its policy to maximize cumulative rewardafter each episode. For this purpose, we used deep Q-learning, a critic basedreinforcement learning algorithm, which uses action-value function Q(S, A) de-noting the expected cumulative reward in state S when action A is taken. Moreformally, we use a multi-layered Perceptron to approximate the optimal action-value function, which is demonstrated in (4). Q ∗ ( s, a ) = max π E [ r t + γr t +1 + γ r t +2 + ... | s t = s, a t = a, π ] (4)where γ is the discounting factor, r t is the reward at time-step t, π is thebehavior policy learned, s is the observed state, and a is the action taken. Theoptimal action-value function obeys the Bellman equation: Q ∗ ( s, a ) = E s (cid:48) [ r + γ max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) | s, a ] (5)In order to reduce the mean squared error in the Bellman equation, we usetwo sets of parameters (Neural Networks). The target values are approximated12sing the target network weights (from previous iterations) shown with θ − i atiteration i. The policy network, which is being trained in each iteration to adjustits parameters to reduce the mean squared error, is used to approximate the Qfunction using parameters θ i at iteration i. Thus, we have a sequence of lossfunctions L i ( θ i ) that changes at each iteration. L i ( θ i ) = E s,a,r [( E s (cid:48) [ y | s, a ] − Q ( s, a ; θ i )) ]= E s,a,r,s (cid:48) [( y − Q ( s, a ; θ i )) ] + E s,a,r [ V s (cid:48) [ y ]] (6)In order to further stabilize the deep Q-learning algorithm, we use the Huberloss proposed by [41] instead of the Mean Squared Error, which pays attentionto large and small errors equally. Huber ( e ) =  e for | e | ≤ | e | − ) , otherwise (7)Furthermore, our agent stores the last c experiences in the replay memory.The agent’s experience vector e t = ( s t , a t , r t , s t +1 ) is saved in the ExperienceReplay Memory D t = { e , e , . . . , e c } (where c is the length of the Replay Mem-ory) and used as a batch to optimize the policy network. When the model wantsto update, it samples a batch of experiences uniformly at random from D. Thesteps of the deep Q-Learning algorithm used in our work are represented inAlgorithm 1. 13 lgorithm 1 Deep Q-Learning Algorithm used for training the agent Initialize replay memory D to capacity N Initialize action-value function Q with random weights θ Initialize target action-value function ˆ Q with weights θ − = θ for episode from 1 to M do Initialize sequence s and preprocessed sequence φ = φ ( s for t from 1 to T do With probability (cid:15) select a random action a t Otherwise select a t = argmax a Q ( φ ( s t ) , a ; θ ) Execute action a t and observe reward r t and state s t +1 Set s t +1 = s t , a t and preprocess φ t +1 = φ ( s t +1 ) Store transition ( φ t , a t , r t , φ t +1 ) Sample random mini-batch of transitions ( φ j , a j , r j , φ j +1 ) from D Set y j =  r j if episode terminates at step j + 1 r j + γ max a (cid:48) ˆ Q ( φ j +1 , a (cid:48) ; θ − ) otherwise Perform a gradient descent step on ( y j − Q ( φ j , a j ; θ )) with respectto the network parameters θ Every C steps reset ˆ Q = Q end for end for So far, we have discussed the input data representation, diﬀerent parts of thetrading agent, and the deep Q-Learning algorithm, used by the agent to optimizeits policy to learn devise proﬁtable strategies. However, the essential part of eachRL algorithm is the representation of the environment. As mentioned before,the environment tells the RL agent in which state it currently is, based onwhich the agent would take actions and receive rewards from the environment.A proper state representation can signiﬁcantly improve the performance of the14L agent. Therefore, in this section, we want to concentrate on our model’sfeature extraction and state representation – the Encoder module.We introduced the φ function, which given the input candlesticks, extractsfeatures and outputs the state space, which is then fed to the decoder module– the DQN model. The φ function can be either an identity function or adeep neural network. We do not have any feature extraction in the ﬁrst case,and candlesticks are directly fed to the DQN model. The DNNs we use as thefeature extractor are Multi-Layered Perceptron (MLP), Gated Recurrent Unit(GRU)[36], 1-dimension Convolution in the direction of time (CNN)[42], andGRU with 1-dimension Convolution in the direction of price (CNN-GRU).The MLP model can extract features from candlesticks without consideringthe temporal relationship among candles. In contrast, the other four models notonly pay attention to the structure of each candlestick, but they also considerthe temporal relationships. In this section, we explain the detailed architectureof each feature extractor. We will compare the performance of these DNNs asthe feature extractor for the DQN later.Before we dive into each model’s description, we need to explain diﬀerentinputs to these models. Our inputs partition into two categories:1. Vanilla :The OHLC prices without any change. This kind of input contains onlythe representation of candle c t at time-step t .2. Windowed :A series of candlesticks with size w are grouped together to form a windowof candles W = { c ( t − w ) , c ( t − w +1) , . . . , c t } at time step t.All models use the windowed input type, but the raw OHLC prices are onlyfor MLP encoder and DQN without any encoder models. The MLP model is a NN with only one hidden layer. In order to regular-ize the outputs of layers, we used Batch Normalization after the hidden layer.15he dimensions of layers are

InputSize × , BatchN ormalization (128) , × F eatureV ectorSize . The MLP architecture takes both types of inputs. If theraw OHLC is used, then

InputSize = 4 , and in case the input type is windowed,then the

InputSize would be equal to the size of the window.

The Gated Recurrent Unit is a recurrent neural network exerted on extract-ing features from time-series data. This model’s input type is the

Windowed input, which contains a sequence of candles at each time step. The role of GRUhere is to extract features from each candlestick while considering the historyin each window. The architecture of the GRU model is represented in 3a.

The Convolutional Neural Networks has been widely used in image pro-cessing to extract deep features from images, and also it is applied in signalprocessing for analyzing signals. CNN has a kernel which can move in one, two,or higher directions and extract features from multi-dimensional data. Here, ourinput is the OHLC prices windowed to form time-series data. Thus, we havea 2-dimensional input, the ﬁrst of which is the price(candles), and the secondone is time. The input channel size is the size of each candle vector (i.e., 4 forOHLC), and the kernel size is 3 in the direction of time (i.e., w the windowsize). This architecture is shown in 3b.

The combination of CNN and GRU model is proposed here, where the CNNmodel’s kernel moves in the direction of candles and extracts candlestick fea-tures. Then it outputs a sequence of features from the windowed input to theGRU model. The GRU model here is responsible for the input’s temporal be-havior, where it takes the candlesticks’ features in sequence from the CNN, andextracts temporal features. The details of this architecture are represented in3c. 16 a) GRU model architecture (b) CNN model architecture(c) CNN-GRU model architecture

Figure 3: Architecture of diﬀerent models proposed to use as the encoder . Experimental Results All the models are tested on real-world ﬁnancial data, including stocks andcrypto-currencies. Data is chosen to be varied in the behavior like the bullishtrend, bearish trend, and side markets. Furthermore, the data length is chosento be 20 years with the last ﬁve years as test (trading) data, ten years with thelast two years as test data, and six years (BTC/USD) with the last two yearsas test data. The interval of candlesticks in all data is chosen to be daily. Alldata used in this work are available on

Yahoo Finance and

Google Finance . Thesummary of the datasets is represented in 1.

Table 1: Data used along with train-test split dates

Data Begin Date Split Point End DateGOOGL 2010/01/01 2018/01/01 2020/08/25AAPL 2010/01/01 2018/01/01 2020/08/25AAL ‌ 2010/01/01 2018/01/01 2020/08/25BTC-USD 2014/09/17 2018/01/01 2020/08/26KSS 1999/01/01 2018/01/01 2020/08/24GE ‌2000/01/01 2015/01/01 2020/08/24HSI ‌‌2000/01/01 2015/01/01 2020/08/244 shows the condition of each dataset in diﬀerent periods. The AAL datais bullish on the training-set and bearish on the test-set, market GE is bothbearish on the train and test sets, AAPL and GOOGL are both bullish; KSSand HSI are examples of volatile markets, and BTC/USD is side on the test-set. These datasets are selected to measure the ﬂexibility of diﬀerent models indiﬀerent market conditions. A robust model can generalize its performance toprovide a proper strategy behaving proﬁtable on the test-set.18 a) Price history of AAL stock used to train andtest the model. (b) Price history of GE stock used to train andtest the model.(c) Price history of GOOGL stock used to trainand test the model. (d) Price history of AAPL stock used to train andtest the model.(e) Price history of KSS stock used to train andtest the model. (f) Price history of HSI stock used to train andtest the model.(g) Price history of BTC/USD stock used to trainand test the model.

Figure 4: Price histories used to test the models. The blue sections are used for training andthe red parts are used as testing sets. .2. Evaluation Metrics The trading strategy proposed by each model is evaluated from three per-spectives:1. How proﬁtable is the proposed strategy2. What is the risk of the proposed strategy3. The eﬀect of hyper-parameters(e.g. window size) in proposing a strategyfor each asset.The metrics are mentioned and described in detail as follows.

This is a qualitative metric showing the percentage of proﬁt concerning theinitial investment. At each point of time t , if the current wealth W t and theinitial investment is W m then the percentage of the proﬁt at each time step iscalculated using (8). % Rate t = W t − W W × (8)The proﬁt curve compares the % Rate of proﬁt for each model at diﬀerent timesteps.

This metric is the sum of the rate of increase or decrease in the currentinvestment due to the decisions made by the model (Buy, Sell, None). The rateof wealth change at the current time-step if the model has already invested themoney (not sold before) is as in (9). AR t = W t − W t − W t − (9)Using the (9), we can calculate the arithmetic return in (10). AR = T (cid:88) t =1 AR t (10)which shows the cumulative return at each time step.20 .2.3. Time Weighted Return The amount of return in diﬀerent periods are not independent of each other.In other words, when the amount of loss is signiﬁcant at one time, then thecapital would be signiﬁcantly lower to invest afterward. For this purpose, weuse

Time Weighted Return (TWR) which is calculated in (11).

T W R = ( n (cid:89) i =1 ( x i + 1)) n − (11)To avoid negative values, we add 1 to all the return values, then we remove1 from the result. This metric is the variance of daily arithmetic returns. RV = (cid:80) Tt =1 ( AR t − AR ) T − (12)where AR is the average arithmetic return and AR t is the arithmetic return attime t . It is the percentage of the increase in the capital during trading time. TotalReturn is calculated in (13) where W and W T are the initial and ﬁnal wealth,respectively. T R = W T − W W (13) The value at risk (

V aR ) is a metric to measure the quality level of ﬁnancialrisk within a portfolio during a speciﬁc period of time.

V aR typically is mea-sured with a conﬁdence ratio − α (e.g., with a conﬁdence level of 95% where α = 5 ) and measures the maximum amount of loss in the worst situation withconﬁdence − α in the corresponding time period. The higher the value of the21 aR α (i.e., the absolute value of V aR α ) with a ﬁxed value of α , the higher thelevel of the portfolio’s ﬁnancial risk.There exist two main approaches to compute V aR α : 1) using the closed-formwhich assumes the probability distribution of the daily returns of the portfo-lio follows a Normal standard distribution, 2) using the historical estimationmethod, which is a non-parametric method and assumes no prior knowledgeabout the portfolio’s daily returns. In this paper, we used the closed-formmethod.To calculate V aR α , we used Monte Carlo simulation by developing a modelfor future stock price returns and running multiple hypothetical trials throughthe model. The mean µ and standard deviation σ of the returns are calculated,then 1000 simulations run to generate random outputs with a normal distribu-tion N ( µ, σ ) . Then the α percent lowest value of the outputs is selected andreported as V aR α . The volatility of the daily returns evaluates the risk level of trading rules bycalculating daily returns’ standard deviation. This metric is calculated for eachstrategy using (14), where AR is the average daily arithmetic return, and ARis the daily arithmetic return. σ p = (cid:115) Σ Ti =1 ( AR i − AR ) T − (14) The Sharpe ratio (SR) was proposed ﬁrst by Sharpe et al. [43] to measure thereward-to-variability ratio of the mutual funds. This metric displays the averagereturn earned in excess of the risk-free rate per unit total risk and is computedhere by (15) in which R f is the return of the risk-free asset, and E { R p } is theexpected value of the portfolio value. Here we assumed that R f = 0 . SR = E { R p } − R f σ p (15)22 .2.9. Window Size heat-map This diagram illustrates the impact of window-size in extracting appropriatepatterns from the input candlesticks for each asset, which is reﬂected as thetotal proﬁt earned by the agent corresponding to each window size.

In this curve, the trading signals to trade each asset are demonstrated overthat asset’s raw price curve. This chart gives insight into the quality of decisionmaking power of each model on each ﬁnancial asset.

All the models are implemented using

Pytorch library in Python. In orderto optimize the models, we used Adam optimizer. The mini-batch training isalso conducted using a batch size of 10, and the replay memory size is set to20. The only regularization used in the experiments in the policy and targetnetworks is the

Batch Normalization . The transaction cost is set to zero duringthe training process; however, it may be non-zero during the evaluation.

In this section, the overall performance of the diﬀerent models, along withdiﬀerent input types for the MLP and DQN models (i.e., raw OHLC and Win-dowed candles) are compared using proﬁt curves and other risk and proﬁt eval-uation metrics.5 illustrates the proﬁt curves of models on the test set for diﬀerent datasets.DQN-vanilla is the DQN model without any encoder and with the input ofraw OHLC. DQN-windowed is the DQN model with a window of candles asinput. MLP-vanilla and MLP-windowed are the same as DQN-vanilla and DQN-windowed except that MLP contains an encoder part, which is an MLP model.CNN, GRU, and CNN-GRUare models with the encoder part as described insections 3.4.3, 3.4.2, and 3.4.4, respectively with input type as a window ofcandles(time series).The general conclusions we reached from the experiments are reported in 5:23 Stocks can be categorized into two kinds: the one in which the sequenceof candlesticks have eﬀective temporal relationships and those with fewmeaningful time dependencies.• The most proﬁtable trading strategies for data with a high level of time-dependency in their price history can be generated using windowed-inputmodels. The BTC/USD, GOOGL, AAPL, and GE, are of this kind. TheGRU, and CNN have the best performance on the BTC/USD model; TheCNN, DQN-windowed, and MLP-windowed have the best performanceon GE; The GRU, CNN-GRU, and MLP-windowed have the best per-formance on GOOGL; MLP-windowed, GRU, DQN-windowed, and CNNprovided the most proﬁtable strategies for AAPL.• On the other hand, we have data with a low level of dependency in timeamong candlesticks, which leads to the models with raw OHLC inputhaving a better performance. AAL, HSI, and KSS are among this type ofdata. By low level of dependency in time, we do not mean that modelswith time-series inputs have poor performance. Their performance is verygood, but they behave a little poorly compared to models with raw OHLCinput.2 represents the details of experiments with regards to both proﬁt and risk.One crucial point that can be inferred from the results is that as the models’total return increases, the Sharpe ratio increases correspondingly. It meansthat the models devise strategies in a risk-adjusted way. However, if we wantto examine the results in speciﬁc, on some data, the models with the highestproﬁtability acted riskily. For example, in AAL, the best model in total returnis MLP-vanilla, but the Sharpe ratio and VaR(here we consider the absolutevalue of VaR) of its strategy are, respectively, lower and higher than thosestrategies proposed by windowed-input models. The same is true about GRUin BTC/USD, where GRU has the highest total return, but CNN provided morerisk-adjusted strategies with respect to VaR and Sharpe ratio.24nother important conclusion deducted from comparing data diagrams andthe models’ performance is that stocks with highly volatile prices such as KSS,AAL, and HSI can best be processed by models with raw OHLC inputs, whereasother data with more stable prices are best analyzed with windowed-input mod-els. Therefore, in order to select among feature extractors, we should pay carefulattention to the type of input data. The best feature extractor for stable stockswould be windowed-input models, whereas, for highly volatile stocks, modelswith raw OHLC can decide and change their behavior more quickly since theyonly pay attention to the current candlestick, not the history of candles.

Now that we have examined the performance of diﬀerent feature extractormodels, it is time to dive deeper into the temporal feature extractor concept.Feature extractors with windowed inputs can perform better on data with morestable price movements (rather than highly volatile data). Moreover, consid-ering 2 and 5, each windowed-input model has its best performance varyingfrom data to data. We want to inspect the impact of window size for eachmodel diﬀerently using the data in which the model has its best performance.We test the performance of GRU and CNN-GRU using GOOGL; CNN, MLP-windowed, and DQN-windowed using GE. 6 demonstrate a heat-map showingthe relationship between the window size and the normalized total proﬁt earnedby the models with windowed input. Window sizes vary from 3 to 75, andthe total proﬁt is normalized to bring between 0 and 1. Blocks having lightercolors earned higher proﬁt than those with darker colors. As obvious from theheat-map, the number of lighter colors in the interval of 10 to 20 is more thanother window sizes. Therefore, the best feature extraction using a sequence ofcandlesticks can be done with window sizes between 10 and 20.

For each data, the trading strategy is illustrated in 7 based on the decisionsmade at each time step by the most proﬁtable model. The green, red, and25 a) Performance of diﬀerent models on AAPL (b) Performance of diﬀerent models on GOOGL(c) Performance of diﬀerent models on GE (d) Performance of diﬀerent models on AAL(e) Performance of diﬀerent models on KSS (f) Performance of diﬀerent models on HSI(g) Performance of diﬀerent models on BTC/USD

Figure 5: The proﬁt curve of the models diﬀerent from the viewpoint of encoder part andinput type. igure 6: The heat-map generated to show the impact of diﬀerent window sizes on the featureextraction by calculating total return for each window size. blue points represent the ’buy’, ’sell’, and ’none’ signals, respectively. Whenthe agent generates a signal, it will inﬂuence the next day’s investment. Inother words, when the agent decides to buy a share, this action is actually donethe next day. As mentioned before, we use a parameter OwnShare , which tellsus whether the agent already bought the share or not. Thus, when the agentbought a share at the time step t , the OwnShare parameter would become true ,and if the next action is none , the agent’s money will continue to be invested.We begin by an initial investment at t , and at each time-step, when the agentdecides to buy or sell, all the money would be invested or withdrawn.As shown in 7, agents could generate signals properly in positions where thetrend of the market changes. In order to represent the strategy behavior foreach data, we select a period of 100 intervals. The stable markets such as GE,GOOGL, and AAPL contain ’none’ signals in their strategy more than volatilemarkets such as HSI and KSS. That can explain the fact that in stable markets,the market trend changes less rapidly than highly volatile markets; therefore,agents can produce more ’none’ action in their strategy. Whenever possible, the proposed models in this paper are compared withthe state-of-the-art models of learning single asset trading rules. Since most ofthese models’ implementations are not accessible, comparison with each baselinemodel is accomplished just in cases that the model’s performance metrics are27 a) Trading strategy generated for AAL (b) Trading strategy generated for GE(c) Trading strategy generated for GOOGL (d) Trading strategy generated for AAPL(e) Trading strategy generated for KSS (f) Trading strategy generated for HSI(g) Trading strategy generated for BTC/USD

Figure 7: The histogram of strategies generated on each dataset for a period of time withlength 100 by the best model.

Buy and Hold (B&H) :B&H is one of the most widely used benchmark strategies to compare theperformance of a model. In this strategy, the investor selects an asset andbuys it at the ﬁrst time step of the investment. The purchased asset is heldto the end of the period regardless of its price ﬂuctuations.ii)

GDQN :Proposed by Wu et al. [31], uses the concatenation of the technical indi-cators and raw OHLC price data of the last nine time steps as the input,a two-layered stacked structure of GRUs as the feature extractor, and theDQN as the decision-making module.iii)

DQT :Proposed by Wang et al. [16], implements online Q-learning algorithm tomaximize the long-term proﬁt of the investment using the learned rules ona single ﬁnancial asset. The reward function here is formed by computingthe accumulated wealth over the last n days.iv) DDPG :Proposed by Xiong et al. [24] uses Deep Deterministic Policy Gradient(DDPG)as the deep reinforcement leaning approach to obtain an adaptive tradingstrategy. Then, the model’s performance is evaluated and compared withthe Dow Jones Industrial Average and the traditional min-variance portfolioallocation strategy.Tables 3, 4, and 5 represent our models’ performance along with the state-of-the-art models using the proﬁt metrics. According to the results reported in3, the performance of the model with MLP encoder and raw OHLC input issigniﬁcantly better than DQT and RRL on stocks HSI and S&P500 proposedby Wang et al. [16]. For HSI, time-series models achieve a performance close toMLP-vanilla, but they behave poorly on S&P500.29 represents the Rate of Return (%) for our models with diﬀerent encodersand models proposed by Wu et al. [31]. Wu et al.’s best model performance ison AAPL stock with Rate of Return equal to 77.7, but the GRU model gainsthe Rate of Return 438, which is signiﬁcantly better. Moreover, wherever themodels proposed by Wu et al. got a negative return, our model returns a highlypositive return. Consider stock GE, where the maximum return value in Wuet al. is -6.39%, but the best strategy proposed by MLP with vanilla input is130.4%. When examining the returns gained by diﬀerent models on IBM, itis obvious that return values for time-series models are better than those withraw OHLC input, and the GRU encoder gains the highest return of 174%. Thisconcept explains the existence of a temporal relationship in IBM stock in thatspeciﬁc period.5 shows the performance of DDPG, the model presented by Xiong et.al.[24]. The ﬁnal portfolio value of models in our work is better than DDPG,starting with an initial portfolio value of 10000. The CNN-GRU has the bestperformance with a ﬁnal portfolio value of 21984, while the DDPG model’s ﬁnalportfolio value is 19791.As the results indicate, our models perform signiﬁcantly better than similarmodels in proﬁtability, ranging from time-series models to raw OHLC inputs.As previously mentioned, these papers’ codes were not available, and we had tocompare the performance according to common metrics.

5. Conclusion

In this work, we proposed a method based on the Encoder-Decoder frame-work, where the encoder model is a DNN, which helps extract essential featuresfrom the raw ﬁnancial data, and the decoder is a DRL agent which makes adecision at each time-step and generates trading signals. The model is trainedend-to-end, and the encoder’s feature extraction function is optimized towardthe policy improvement of the DRL agent.The DRL is based on the Deep Q-learning algorithm and consists of a policy30nd a target network, both of which are multi-layered Perceptron networks. Forthe encoder part, the feature extraction performance of various DNNs is evalu-ated and compared. The proposed models for the encoder part are categorizedinto two types: 1) The raw OHLC input, which receives candle OHLC pricesdirectly. 2) Time-series input, which concatenates a window of consecutive can-dles and receives the window as input.Based on experimental results, the performance of models depended on themarket behavior. When the market is highly volatile, meaning that the rate ofprice ﬂuctuation is high, DQN and MLP model with the raw OHLC input hadthe best performance since they are able to make decisions only based on currentinput representation, disregarding to the historical changes of the market. Onthe other hand, there are more stable markets where models with time-seriesinput can devise more proﬁtable trading strategies because the market behav-ior enables them to exploit eﬃcient features from ﬁnancial data history. Thewindow size impact is further studied, and we concluded that window sizes inthe interval of 10 to 20 have the best feature extraction performance. Using thetrading strategies generated for each data, we understand that agents workingon stable stocks will generate none signal more frequently than in strategiesdevised for highly volatile markets.The future of the work can be viewed from diﬀerent perspectives.• As we have experimented, if we could predict the next state of the envi-ronment using the current state, and feed the predicted next state to theDRL model, the performance would signiﬁcantly increase.• The actor and actor-critic based DRL methods can be tested and com-pared with the performance of critic based Deep Q-learning algorithmused here.• Working on oﬀering a metric used to describe the behavior of the market,based on which, we can specify whether the time-series models can workeﬃciently in rule extraction or not. Using this metric, we can distinguishwhere to apply models with raw OHLC input or apply time-series input.31 eferencesReferences able 2: Performance of diﬀerent models on BTC/USD, GOOGL, AAPL, KSS, and GE

Agent A r i t h m e t i c R e t u r n A v e r ag e D a il y R e t u r n D a il y R e t u r n V a r i a n ce T i m e W e i g h t e d R e t u r n T o t a l R e t u r n Sh a r p e R a t i o V a l u e A t R i s k V o l a t ili t y I n i t i a l I n v e s t m e n t F i n a l P o r t f o li o V a l u e BTC/USDDQN-vanilla 262 0.27 12.64 0.002 629 % 0.076 -5.58 110.6 1000 7287.2DQN-windowed 334 0.34 8.69 0.003 1757 % 0.117 -4.51 91.7 1000 18567.1MLP-vanilla 324 0.33 12.01 0.003 1296 % 0.097 -5.37 107.8 1000 13959.8MLP-windowed 320 0.33 10.19 0.003 1402 % 0.104 -4.93 99.3 1000 15021.8GRU 359 0.37 9.89 0.003 2158 % 0.118 -4.81 97.8 1000 22577.1CNN 353 0.36 9.42 0.003 2069 % 0.119 -4.69 95.5 1000 21693.6CNN-GRU 338 0.35 9.47 0.003 1770 % 0.114 -4.72 95.7 1000 18701.5GOOGLDQN-vanilla 138 0.21 2.59 0.002 263 % 0.128 -2.45 41.6 1000 3631.0DQN-windowed 134 0.20 2.25 0.002 255 % 0.134 -2.27 38.7 1000 3546.3MLP-vanilla 135 0.20 2.60 0.002 252 % 0.125 -2.45 41.6 1000 3520.7MLP-windowed 163 0.24 2.29 0.002 371 % 0.162 -2.25 39.0 1000 4714.4GRU 180 0.27 1.56 0.003 475 % 0.217 -1.79 32.2 1000 5752.0CNN 139 0.21 2.73 0.002 268 % 0.127 -2.51 42.6 1000 3678.4CNN-GRU ‌163 0.25 1.75 0.002 382 % 0.185 -1.93 34.1 1000 4819.9AAPLDQN-vanilla 166 0.25 3.07 0.002 372 % 0.142 -2.63 45.2 1000 4722.5DQN-windowed 190 0.29 3.22 0.003 500 % 0.159 -2.67 46.3 1000 5997.1MLP-vanilla 165 0.25 3.16 0.002 366 % 0.139 -2.68 45.9 1000 4657.1MLP-windowed 200 0.30 3.03 0.003 566 % 0.172 -2.57 44.9 1000 6658.3GRU 191 0.29 2.99 0.003 511 % 0.166 -2.56 44.6 1000 6112.8CNN 181 0.27 2.07 0.003 469 % 0.189 -2.10 37.1 1000 5688.3CNN-GRU 170 0.26 4.03 0.002 379 % 0.127 -3.05 51.8 1000 4786.2KSSDQN-vanilla 272 0.41 9.25 0.004 1024 % 0.134 -4.60 78.5 1000 11236.7DQN-windowed 251 0.38 9.38 0.003 809 % 0.123 -4.67 79.0 1000 9088.3MLP-vanilla 287 0.43 9.14 0.004 1205 % 0.143 -4.55 78.0 1000 13048.2MLP-windowed 242 0.36 8.93 0.003 747 % 0.122 -4.56 77.1 1000 8467.0GRU 248 0.37 8.84 0.003 801 % 0.125 -4.52 76.7 1000 9005.2CNN 250 0.37 9.10 0.003 806 % 0.124 -4.59 77.8 1000 9055.3CNN-GRU 242 0.36 9.19 0.003 737 % 0.120 -4.63 78.3 1000 8369.1GEDQN-vanilla 260 0.18 3.25 0.002 967 % 0.101 -2.79 68.0 1000 10673.4DQN-windowed 333 0.23 2.87 0.002 2179 % 0.138 -2.56 63.9 1000 22788.3MLP-vanilla 264 0.19 2.87 0.002 1044 % 0.110 -2.61 63.9 1000 11442.2MLP-windowed 317 0.22 2.89 0.002 1848 % 0.131 -2.58 64.1 1000 19482.2GRU 304 0.21 3.12 0.002 1580 % 0.121 -2.69 66.5 1000 16795.4CNN 335 0.24 2.74 0.002 2242 % 0.142 -2.49 62.3 1000 23416.3CNN-GRU 283 0.20 2.91 0.002 1278 % 0.117 -2.61 64.3 1000 13779.3 gent A r i t h m e t i c R e t u r n A v e r ag e D a il y R e t u r n D a il y R e t u r n V a r i a n ce T i m e W e i g h t e d R e t u r n T o t a l R e t u r n Sh a r p e R a t i o V a l u e A t R i s k V o l a t ili t y I n i t i a l I n v e s t m e n t F i n a l P o r t f o li o V a l u e HSIDQN-vanilla 224 0.16 0.78 0.002 792 % 0.182 -1.30 33.0 1000 8915.1DQN-windowed 199 0.14 0.72 0.001 593 % 0.169 -1.25 31.6 1000 6930.6MLP-vanilla 230 0.17 0.76 0.002 841 % 0.189 -1.27 32.6 1000 9412.9MLP-windowed 190 0.14 0.82 0.001 531 % 0.151 -1.35 33.7 1000 6306.6GRU 207 0.15 0.81 0.001 651 % 0.166 -1.33 33.6 1000 7510.0CNN 196 0.14 0.80 0.001 570 % 0.157 -1.33 33.4 1000 6696.0CNN-GRU 193 0.14 0.72 0.001 556 % 0.164 -1.26 31.6 1000 6560.5AALDQN-vanilla 280 0.42 12.07 0.004 1030 % 0.121 -5.30 89.7 1000 11299.0DQN-windowed 281 0.42 9.14 0.004 1145 % 0.140 -4.56 78.0 1000 12447.2MLP-vanilla 303 0.45 11.45 0.004 1352 % 0.134 -5.12 87.3 1000 14521.4MLP-windowed 261 0.39 7.03 0.004 985 % 0.148 -3.98 68.4 1000 10845.0GRU 268 0.40 8.77 0.004 999 % 0.136 -4.48 76.4 1000 10989.6CNN 253 0.38 7.19 0.003 900 % 0.142 -4.04 69.2 1000 9999.2CNN-GRU 212 0.32 9.36 0.003 520 % 0.104 -4.72 79.0 1000 6198.9

Table 3: Compare proﬁtability performance with Wang et. al. [16] based on AccumulatedReturn(%)

Agents ‌HSI S&P500MLP-vanilla 13231.2 5032.3DQN-vanilla 5016 2524MLP-windowed 7227 4118DQN-windowed 7576 4289GRU 10911 3918CNN 10575 ‌ 3859CNN-GRU 12566 2573B&H 153.5 168.6B&H 154 169DQT 350 214RRL 174 14138 able 4: Compare proﬁtability performance with Wu et. al. [31] based on Rate of Return(%)

Agents AAPL GE ‌AXP ‌CSCO IBMDQN-vanilla 336 129 183 182 144MLP-vanilla 262.3 130.4 260.2 259.9 149.2DQN-windowed 425 70 252 241 118MLP-windowed 402 74 280 299 165GRU 438 129 262 233 174‌ CNN 290 84 189 251 153CNN-GRU 411 78 284 227 152GDQN 77.7 -10.8 20.0 20.6 4.63GDPG 82.0 -6.39 24.3 13.6 2.55Turtle 69.5 -17.0 25.6 -1.41 -11.7