[PDF] An Application of Deep Reinforcement Learning to Algorithmic Trading

Abstract

This scientific research paper presents an innovative approach based on deep reinforcement learning (DRL) to solve the algorithmic trading problem of determining the optimal trading position at any point in time during a trading activity in stock markets. It proposes a novel DRL trading strategy so as to maximise the resulting Sharpe ratio performance indicator on a broad range of stock markets. Denominated the Trading Deep Q-Network algorithm (TDQN), this new trading strategy is inspired from the popular DQN algorithm and significantly adapted to the specific algorithmic trading problem at hand. The training of the resulting reinforcement learning (RL) agent is entirely based on the generation of artificial trajectories from a limited set of stock market historical data. In order to objectively assess the performance of trading strategies, the research paper also proposes a novel, more rigorous performance assessment methodology. Following this new performance assessment approach, promising results are reported for the TDQN strategy.

Full PDF

AAn Application of Deep Reinforcement Learning to Algorithmic Trading

Thibaut Th´eate a, ∗ , Damien Ernst a a Monteﬁore Institute, University of Li`ege (All´ee de la d´ecouverte 10, 4000 Li`ege, Belgium)

Abstract

This scientiﬁc research paper presents an innovative approach based on deep reinforcement learning (DRL) to solve thealgorithmic trading problem of determining the optimal trading position at any point in time during a trading activityin stock markets. It proposes a novel DRL trading strategy so as to maximise the resulting Sharpe ratio performanceindicator on a broad range of stock markets. Denominated the Trading Deep Q-Network algorithm (TDQN), this newtrading strategy is inspired from the popular DQN algorithm and signiﬁcantly adapted to the speciﬁc algorithmic tradingproblem at hand. The training of the resulting reinforcement learning (RL) agent is entirely based on the generation ofartiﬁcial trajectories from a limited set of stock market historical data. In order to objectively assess the performanceof trading strategies, the research paper also proposes a novel, more rigorous performance assessment methodology.Following this new performance assessment approach, promising results are reported for the TDQN strategy.

Keywords:

Artiﬁcial intelligence, deep reinforcement learning, algorithmic trading, trading strategy.

1. Introduction

For the past few years, the interest in artiﬁcial intelli-gence (AI) has grown at a very fast pace, with numerousresearch papers published every year. A key element forthis growing interest is related to the impressive successesof deep learning (DL) techniques which are based on deepneural networks (DNN) - mathematical models directly in-spired by the human brain structure. These speciﬁc tech-niques are nowadays the state of the art in many appli-cations such as speech recognition, image classiﬁcation ornatural language processing. In parallel to DL, anotherﬁeld of research has recently gained much more attentionfrom the research community: deep reinforcement learn-ing (DRL). This family of techniques is concerned withthe learning process of an intelligent agent (i) interactingin a sequential manner with an unknown environment (ii)aiming to maximise its cumulative rewards and (iii) us-ing DL techniques to generalise the information acquiredfrom the interaction with the environment. The many re-cent successes of DRL techniques highlight their ability tosolve complex sequential decision-making problems.Nowadays, an emerging industry which is growing ex-tremely fast is the ﬁnancial technology industry, generallyreferred to by the abbreviation

FinTech . The objective ofFinTech is pretty simple: to extensively take advantageof technology in order to innovate and improve activitiesin ﬁnance. In the coming years, the FinTech industry is ∗ Corresponding author.

Email addresses: [email protected] (ThibautTh´eate), [email protected] (Damien Ernst) expected to revolutionise the way many decision-makingproblems related to the ﬁnancial sector are addressed, in-cluding the problems related to trading, investment, riskmanagement, portfolio management, fraud detection andﬁnancial advising, to cite a few. Such complex decision-making problems are extremely complex to solve as theygenerally have a sequential nature and are highly stochas-tic, with an environment partially observable and poten-tially adversarial. In particular, algorithmic trading , whichis a key sector of the FinTech industry, presents particu-larly interesting challenges. Also called quantitative trad-ing, algorithmic trading is the methodology to trade usingcomputers and a speciﬁc set of mathematical rules. Thepresent research paper objective is to develop and analysenovel DRL solutions to solve the algorithmic trading prob-lem of determining the optimal trading position (long orshort) at any point in time during a trading activity.The scientiﬁc research paper is structured as follows.First of all, a brief review of the scientiﬁc literature aroundthe algorithmic trading ﬁeld and its main AI-based contri-butions is presented in Section 2. Afterwards, Section 3 in-troduces and rigorously formalises the particular algorith-mic trading problem considered. Additionally, this sectionmakes the link with the reinforcement learning (RL) ap-proach. Then, Section 4 covers the complete design of theTDQN trading strategy based on DRL concepts. Subse-quently, Section 5 proposes a novel methodology to objec-tively assess the performance of trading strategies. Section6 is concerned with the presentation and discussion of theresults achieved by the TDQN trading strategy. To endthis research paper, Section 7 discusses interesting leadsas future work and draws meaningful conclusions.

Preprint submitted to Elsevier June 11, 2020 a r X i v : . [ q -f i n . T R ] J un . Literature review To begin this brief literature review, two facts haveto be emphasised. Firstly, it is important to be awarethat many sound scientiﬁc works in the ﬁeld of algorith-mic trading are not publicly available. As explained in [1],due to the huge amount of money at stake, private FinTechﬁrms are very unlikely to make their latest research resultspublic. Secondly, it should be acknowledged that making afair comparison between trading strategies is a challengingtask, due to the lack of a common, well-established frame-work to properly evaluate their performance. Instead, theauthors generally deﬁne their own framework with theirevident bias. Another major problem is related to thetrading costs which are variously deﬁned or even omitted.First of all, most of the works in algorithmic trading aretechniques developed by mathematicians, economists andtraders who do not exploit AI. Typical examples of clas-sical trading strategies are the trend following and meanreversion strategies, which are covered in detail in [2], [3]and [4]. Then, the majority of works applying machinelearning (ML) techniques in the algorithmic trading ﬁeldfocus on forecasting. If the ﬁnancial market evolution isknown in advance with a reasonable level of conﬁdence, theoptimal trading decisions can easily be computed. Fol-lowing this approach, DL techniques have already beeninvestigated with good results, see e.g. [5] introducing atrading strategy based on a DNN, and especially [6] usingwavelet transforms, stacked autoencoders and long short-term memory (LSTM). Alternatively, several authors havealready investigated RL techniques to solve this algorith-mic trading problem. For instance, [7] introduced a recur-rent RL algorithm for discovering new investment policieswithout the need to build forecasting models, and [8] usedadaptive RL to trade in foreign exchange markets. Morerecently, a few works investigated DRL techniques in a sci-entiﬁcally sound way to solve this particular algorithmictrading problem. For instance, one can ﬁrst mention [9]which introduced the fuzzy recurrent deep neural networkstructure to obtain a technical-indicator-free trading sys-tem taking advantage of fuzzy learning to reduce the timeseries uncertainty. One can also mention [10] which stud-ied the application of the deep Q-learning algorithm fortrading in foreign exchange markets. Finally, there exista few interesting works studying the application of DRLtechniques to algorithmic trading in speciﬁc markets, suchas in the ﬁeld of energy, see e.g. the article [11].To ﬁnish with this short literature review, a sensitiveproblem in the scientiﬁc literature is the tendency to priori-tise the communication of good results or ﬁndings, some-times at the cost of a proper scientiﬁc approach with ob-jective criticism. Going even further, [12] even states thatmost published research ﬁndings in certain sensitive ﬁeldsare probably false. Such concern appears to be all themore relevant in the ﬁeld of ﬁnancial sciences, especially when the subject directly relates to trading activities. In-deed, [13] claims that many scientiﬁc publications in ﬁ-nance suﬀer from a lack of a proper scientiﬁc approach,instead getting closer to pseudo-mathematics and ﬁnan-cial charlatanism than rigorous sciences. Aware of theseconcerning tendencies, the present research paper intendsto deliver an unbiased scientiﬁc evaluation of the novelDRL algorithm proposed.

3. Algorithmic trading problem formalisation

In this section, the sequential decision-making algorith-mic trading problem studied in this research paper is pre-sented in detail. Moreover, a rigorous formalisation of thisparticular problem is performed. Additionally, the linkwith the RL formalism is highlighted.

Algorithmic trading, also called quantitative trading,is a subﬁeld of ﬁnance, which can be viewed as the ap-proach of automatically making trading decisions basedon a set of mathematical rules computed by a machine.This commonly accepted deﬁnition is adopted in this re-search paper, although other deﬁnitions exist in the lit-erature. Indeed, several authors diﬀerentiate the tradingdecisions (quantitative trading) from the actual trading ex-ecution (algorithmic trading). For the sake of generality,algorithmic trading and quantitative trading are consid-ered synonyms in this research paper, deﬁning the entireautomated trading process. Algorithmic trading has al-ready proven to be very beneﬁcial to markets, the mainbeneﬁt being the signiﬁcant improvement in liquidity, asdiscussed in [14]. For more information about this speciﬁcﬁeld, please refer to [15] and [16].There are many diﬀerent markets suitable to apply al-gorithmic trading strategies. Stocks and shares can betraded in the stock markets, FOREX trading is concernedwith foreign currencies, or a trader could invest in com-modity futures, to only cite a few. The recent rise ofcryptocurrencies, such as the Bitcoin, oﬀers new inter-esting possibilities as well. Ideally, the DRL algorithmsdeveloped in this research paper should be applicable tomultiple markets. However, the focus will be set on stockmarkets for now, with an extension to various other mar-kets planned in the future.In fact, a trading activity can be viewed as the man-agement of a portfolio, which is a set of assets includingdiverse stocks, bonds, commodities, currencies, etc. Inthe scope of this research paper, the portfolio consideredconsists of one single stock together with the agent cash.The portfolio value v t is then composed of the tradingagent cash value v ct and the share value v st , which contin-uously evolves over time t . Buying and selling operationsare simply cash and share exchanges. The trading agent2nteracts with the stock market through an order book,which contains the entire set of buying orders ( bids ) andselling orders ( asks ). An example of a simple order bookis depicted in Table 1. An order represents the willingnessof a market participant to trade and is composed of a price p , a quantity q and a side s (bid or ask). For a trade tooccur, a match between bid and ask orders is required, anevent which can only happen if p bidmax ≥ p askmin , with p bidmax ( p askmin ) being the maximum (minimum) price of a bid (ask)order. Then, a trading agent faces a very diﬃcult task inorder to generate proﬁt: what, when, how, at which priceand which quantity to trade. This is the algorithmic trad-ing complex sequential decision-making problem studiedin this scientiﬁc research paper. Table 1: Example of a simple order book

Side s Quantity q Price p Ask 3000 107Ask 1500 106Ask 500 105Bid 1000 95Bid 2000 94Bid 4000 93

Since trading decisions can be issued at any time, thetrading activity is a continuous process. In order to studythe algorithmic trading problem described in this researchpaper, a discretisation operation of the continuous time-line is performed. The trading timeline is discretized intoa high number of discrete trading time steps t of constantduration ∆ t . In this research paper, for the sake of clarity,the increment (decrement) operations t + 1 ( t −

1) are usedto model the discrete transition from time step t to timestep t + ∆ t ( t − ∆ t ).The duration ∆ t is closely linked to the trading fre-quency targeted by the trading agent (very high tradingfrequency, intraday, daily, monthly, etc.). Such discretisa-tion operation inevitably imposes a constraint with respectto this trading frequency. Indeed, because the duration ∆ t between two time steps cannot be chosen as small as pos-sible due to technical constraints, the maximum tradingfrequency achievable, equal to 1 / ∆ t , is limited. In thescope of this research paper, this constraint is met as thetrading frequency targeted is daily, meaning that the trad-ing agent makes a new decision once every day. The algorithmic trading approach is rule based, mean-ing that the trading decisions are made according to a setof rules: a trading strategy . In technical terms, a tradingstrategy can be viewed as a programmed policy π ( a t | i t ),either deterministic or stochastic, which outputs a trad-ing action a t according to the information available to the trading agent i t at time step t . Additionally, a key char-acteristic of a trading strategy is its sequential aspect, asillustrated in Figure 1. An agent executing its tradingstrategy sequentially applies the following steps:1. Update of the available market information i t .2. Execution of the policy π ( a t | i t ) to get action a t .3. Application of the designated trading action a t .4. Next time step t → t + 1, loop back to step 1. Figure 1: Illustration of a trading strategy execution

In the following subsection, the algorithmic trading se-quential decision-making problem, which shares similari-ties with other problems successfully tackled by the RLcommunity, is casted as an RL problem.

As illustrated in Figure 2, reinforcement learning isconcerned with the sequential interaction of an agent withits environment. At each time step t , the RL agent ﬁrstlyobserves the RL environment of internal state s t , and re-trieves an observation o t . It then executes the action a t resulting from its RL policy π ( a t | h t ) where h t is the RLagent history and receives a reward r t as a consequence ofits action. In this RL context, the agent history can beexpressed as h t = { ( o τ , a τ , r τ ) | τ = 0 , , ..., t } . Figure 2: Reinforcement learning core building blocks

Reinforcement learning techniques are concerned withthe design of policies π maximising an optimality crite-rion, which directly depends on the immediate rewards r t observed over a certain time horizon. The most popularoptimality criterion is the expected discounted sum of re-wards over an inﬁnite time horizon. Mathematically, theresulting optimal policy π ∗ is expressed as the following: π ∗ = argmax π E [ R | π ] (1) R = ∞ (cid:88) t =0 γ t r t (2)3he parameter γ is the discount factor ( γ ∈ [0 , γ = 0, the RL agent is said to be myopic asit only considers the current reward and totally discardsthe future rewards. When the discount factor increases,the RL agent tends to become more long-term oriented.In the extreme case where γ = 1, the RL agent considerseach reward equally. This key parameter should be tunedaccording to the desired behaviour. In the scope of this algorithmic trading problem, theRL environment is the entire complex trading world grav-itating around the trading agent. In fact, this tradingenvironment can be viewed as an abstraction includingthe trading mechanisms together with every single pieceof information capable of having an eﬀect on the tradingactivity of the agent (both qualitative and quantitative in-formation). In order to reach its objective, the RL agenthas to ﬁgure out the key phenomena encompassed under-neath such an extremely complex environment.At each trading time step t , the RL agent observesthe stock market whose internal state is s t ∈ S . Thelimited information collected by the agent on this complextrading environment is denoted by o t ∈ O . Ideally, thisobservation space O should encompass all the informationcapable of inﬂuencing the market prices. Typically, theRL agent observations can be expressed as the following: o t = { S ( t ) , D ( t ) , T ( t ) , I ( t ) , M ( t ) , N ( t ) , E ( t ) } (3)where: • S ( t ) represents the state information of the RL agentat time step t (current trading position, number ofshares owned by the agent, available cash). • D ( t ) is the information gathered by the agent at timestep t concerning the OHLCV (Open-High-Low-Close-Volume) data characterising the stock market. Moreprecisely, D ( t ) can be expressed as follows: D ( t ) = { p Ot , p Ht , p Lt , p Ct , V t } (4)where: – p Ot is the stock market price at the opening ofthe time period [ t − ∆ t, t [. – p Ht is the highest stock market price over thetime period [ t − ∆ t, t [. – p Lt is the lowest stock market price over the timeperiod [ t − ∆ t, t [. – p Ct is the stock market price at the closing ofthe time period [ t − ∆ t, t [. – V t is the total volume of shares exchanged overthe time period [ t − ∆ t, t [. • T ( t ) is the agent information regarding the tradingtime step t (date, weekday, time). • I ( t ) is the agent information regarding multiple tech-nical indicators about the stock market targeted attime step t . There exist many technical indicatorsproviding extra insights about diverse ﬁnancial phe-nomena, such as moving average convergence diver-gence (MACD), relative strength index (RSI) or av-erage directional index (ADX), to only cite a few. • M ( t ) gathers the macroeconomic information at thedisposal of the agent at time step t . There are manyinteresting macroeconomic indicators which could po-tentially be useful to forecast markets’ evolution,such as the interest rate or the exchange rate. • N ( t ) represents the news information gathered bythe agent at time step t . These news data can beextracted from various sources such as social media(Twitter, Facebook, LinkedIn), the newspapers, spe-ciﬁc journals, etc. The beneﬁts of such informationhas already been demonstrated by several authors,see e.g. [17], [18] and [19]. • E ( t ) is any extra useful information at the disposal ofthe trading agent at time step t , such as other marketparticipants trading strategies, companies’ conﬁden-tial information, similar stock market behaviours,rumours, experts’ advice, etc.A major challenge of this algorithmic trading problemis the extremely poor observability of the environment.Indeed, a signiﬁcant amount of information is simply hid-den to the trading agent, ranging from some companies’conﬁdential information to the other market participants’strategies. In fact, the information available to the RLagent is extremely limited compared to the complexityof the environment. Moreover, this information can takevarious forms, both quantitative and qualitative. Finally,as previously hinted, there are signiﬁcant time correlationcomplexities to deal with. This is why the observationsshould be considered as a series rather than individually. Observation space reduction:

In the scope of this research paper, it is assumed thatthe only information considered by the RL agent is theclassical OHLCV data D ( t ) together with the state infor-mation S ( t ). Especially, the reduced observation space O encompasses the current trading position together with aseries of the previous daily open-high-low-close prices anddaily traded volume. With such an assumption, the re-duced RL observation o t can be expressed as the following: o t = { p Ot , p Ht , p Lt , p Ct , V t , P t } (5)with P t being the trading position of the RL agent at timestep t (either long or short , as explained in the next sub-section of this research paper).4 .4.2. RL actions At each time step t , the RL agent executes a trading ac-tion a t ∈ A resulting from its policy π ( a t | h t ). In fact, thetrading agent has to answer several questions: whether,how and how much to trade? Such decisions can be mod-elled by the quantity of shares bought by the trading agentat time step t , represented by Q t ∈ Z . Therefore, the RLactions can be expressed as the following: a t = Q t (6)Three cases can occur depending on the value of Q t : • Q t >

0: The RL agent buys shares on the stock mar-ket, by posting new bid orders on the order book. • Q t <

0: The RL agent sells shares on the stock mar-ket, by posting new ask orders on the order book. • Q t = 0: The RL agent holds , meaning that it doesnot buy nor sell any shares on the stock market.Actually, the real actions occurring in the scope of atrading activity are the orders posted on the order book.The RL agent is assumed to communicate with an externalmodule responsible for the synthesis of these true actionsaccording to the value of Q t : the trading execution system .Despite being out of the scope of this paper, it should bementioned that multiple execution strategies can be con-sidered depending on the general trading purpose.The trading actions have an impact on the two com-ponents of the portfolio value, namely the cash and sharevalues. Assuming that the trading actions occur close tothe market closure at price p t (cid:39) p Ct , the updates of thesecomponents are governed by the following equations: v ct +1 = v ct − Q t p t (7) v st +1 = ( n t + Q t ) (cid:124) (cid:123)(cid:122) (cid:125) n t +1 p t +1 (8)with n t ∈ Z being the number of shares owned by thetrading agent at time step t . In the scope of this researchpaper, negative values are allowed for this quantity. De-spite being surprising at ﬁrst glance, a negative numberof shares simply corresponds to shares borrowed and sold,with the obligation to repay the lender in shares in thefuture. Such a mechanism is particularly interesting as itintroduces new possibilities for the trading agent.Two important constraints are assumed concerning thequantity of traded shares Q t . Firstly, contrarily to theshare value v st which can be both positive or negative, thecash value v ct has to remain positive for every trading timesteps t . This constraint imposes an upper bound on thenumber of shares that the trading agent is capable of pur-chasing, this volume of shares being easily derived from Equation 7. Secondly, there exists a risk associated withthe impossibility to repay the share lender if the agentsuﬀers signiﬁcant losses. To prevent such a situation fromhappening, the cash value v ct is constrained to be suﬃ-ciently large when a negative number of shares is owned, inorder to be able to get back to a neutral position ( n t = 0).A maximum relative change in prices, expressed in % anddenoted (cid:15) ∈ R + , is assumed by the RL agent prior tothe trading activity. This parameter corresponds to themaximum market daily evolution supposed by the agentover the entire trading horizon, so that the trading agentshould always be capable of paying back the share lenderas long as the market variation remains below this value.Therefore, the constraints acting upon the RL actions attime step t can be mathematically expressed as follows: v ct +1 ≥ v ct +1 ≥ − n t +1 p t (1 + (cid:15) ) (10)with the following condition assumed to be satisﬁed: (cid:12)(cid:12)(cid:12)(cid:12) p t +1 − p t p t (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (11) Trading costs consideration:

Actually, the modelling represented by Equation 7 isinaccurate and will inevitably lead to unrealistic results.Indeed, whenever simulating trading activities, the tradingcosts should not be neglected. Such omission is generallymisleading as a trading strategy, highly proﬁtable in simu-lations, may be likely to generate large losses in real trad-ing situations due to these trading costs, especially whenthe trading frequency is high. The trading costs can besubdivided into two categories. On the one hand, thereare explicit costs which are induced by transaction costsand taxes. On the other hand, there are implicit costs,called slippage costs, which are composed of three mainelements and are associated to some of the dynamics ofthe trading environment. The diﬀerent slippage costs aredetailed hereafter: • Spread costs:

These costs are related to the diﬀer-ence between the minimum ask price p askmin and themaximum bid price p bidmax , called the spread . Becausethe complete state of the order book is generally toocomplex to eﬃciently process or even not available,the trading decisions are mostly based on the mid-dle price p mid = ( p bidmax + p askmin ) /

2. However, a buying(selling) trade issued at p mid inevitably occurs at aprice p ≥ p askmin ( p ≤ p bidmax ). Such costs are all themore signiﬁcant that the stock market liquidity islow compared to the volume of shares traded. • Market impact costs:

These costs are induced bythe impact of the trader’s actions on the market.5ach trade (both buying and selling orders) is po-tentially capable of inﬂuencing the price. This phe-nomenon is all the more important that the stockmarket liquidity is low with respect to the volume ofshares traded. • Timing costs:

These costs are related to the timerequired for a trade to physically happen once thetrading decision is made, knowing that the marketprice is continuously evolving. The ﬁrst cause is theinevitable latency which delays the posting of theorders on the market order book. The second causeis the intentional delays generated by the tradingexecution system. For instance, a large trade couldbe split into multiple smaller trades spread over timein order to limit the market impact costs.An accurate modelling of the trading costs is requiredto realistically reproduce the dynamics of the real trad-ing environment. While explicit costs are relatively easyto take into account, the valid modelling of slippage costsis a truly complex task. In this research paper, the in-tegration of both costs into the RL environment is per-formed through a heuristic. When a trade is executed, acertain amount of capital equivalent to a percentage C ofthe amount of money invested is lost. This parameter waschosen equal to 0.2% in the forthcoming simulations.Practically, these trading costs are directly withdrawnfrom the trading agent cash. Following the heuristic pre-viously introduced, Equations 7 can be re-expressed witha corrective term modelling the trading costs: v ct +1 = v ct − Q t p t − C | Q t | p t (cid:124) (cid:123)(cid:122) (cid:125) Trading costs (12)Moreover, the trading costs have to be properly consid-ered in the constraint expressed in Equation 10. Indeed,the cash value v ct should be suﬃciently large to get backto a neutral position ( n t = 0) when the maximum mar-ket variation (cid:15) occurs, the trading costs being included.Consequently, Equation 10 is re-expressed as follows: v ct +1 ≥ − n t +1 p t (1 + (cid:15) )(1 + C ) (13)Eventually, the RL action space A can be deﬁned as thediscrete set of acceptable values for the quantity of tradedshares Q t . Derived in detail in Appendix A, the RL actionspace A is mathematically expressed as the following: A = { Q t ∈ Z ∩ [ Q t , Q t ] } (14)where: • Q t = v ct p t (1+ C ) • Q t = (cid:40) ∆ t p t (cid:15) (1+ C ) if ∆ t ≥ ∆ t p t (2 C + (cid:15) (1+ C )) if ∆ t < t = − v ct − n t p t (1 + (cid:15) )(1 + C ). Action space reduction:

In the scope of this scientiﬁc research paper, the ac-tion space A is reduced in order to lower the complexityof the algorithmic trading problem. The reduced actionspace is composed of only two RL actions which can bemathematically expressed as the following: a t = Q t ∈ { Q Longt , Q

Shortt } (15)The ﬁrst action Q Longt maximises the number of sharesowned by the trading agent, by converting as much cashvalue v ct as possible into share value v st . It can be mathe-matically expressed as follows: Q Longt = (cid:22) v ct p t (1 + C ) (cid:23) (16)The action Q Longt is always valid as it is obviously in-cluded into the original action space A described in Equa-tion 14. As a result of this action, the trading agent ownsa number of shares N Longt = n t + Q Longt . On the contrary,the second RL action, designated by Q Shortt , converts sharevalue v st into cash value v ct , such that the RL agent owns anumber of shares equal to − N Longt . It can be mathemati-cally expressed as the following: Q Shortt = − n t − (cid:22) v ct p t (1 + C ) (cid:23) (17)However, the action Q Shortt may violate the lower bound Q t of the action space A when the parameter (cid:15) is suﬃ-ciently large. For this action to always be valid, Q Shortt isre-expressed as follows: Q Shortt = max (cid:26) − n t − (cid:22) v ct p t (1 + C ) (cid:23) , Q t (cid:27) (18)To conclude this subsection, it should be mentionedthat the two reduced RL actions are actually related tothe next trading position of the agent, designated as P t +1 .Indeed, the ﬁrst action Q Longt induces a long trading po-sition because the number of owned shares is positive. Onthe contrary, the second action Q Shortt always results ina number of shares which is negative, which is generallyreferred to as a short trading position in ﬁnance.

For this algorithmic trading problem, a natural choicefor the RL rewards is the strategy daily returns. Intu-itively, it makes sense to favour positive returns which arean evidence of a proﬁtable strategy. Moreover, such quan-tity has the advantage of being independent of the numberof shares n t currently owned by the agent. This choice isalso motivated by the fact that it allows to avoid a sparsereward setup, which is more complex to deal with. The RLrewards can be mathematically expressed as the following: r t = v t +1 − v t v t (19)6 .5. Objective Objectively assessing the performance of a trading strat-egy is a tricky task, due to the numerous quantitative andqualitative factors to consider. Indeed, a well-performingtrading strategy is not simply expected to generate proﬁt,but also to eﬃciently mitigate the risk associated withthe trading activity. The balance between these two goalsvaries depending on the trading agent proﬁle and its will-ingness to take extra risks. Although intuitively conve-nient, maximising the proﬁt generated by a trading strat-egy is a necessary but not suﬃcient objective. Instead,the core objective of a trading strategy is the maximisa-tion of the

Sharpe ratio , a performance indicator widelyused in the ﬁelds of ﬁnance and algorithmic trading. Itis particularly well suited for the performance assessmenttask as it considers both the generated proﬁt and the riskassociated with the trading activity. Mathematically, theSharpe ratio S r is expressed as the following: S r = E [ R s − R f ] σ r = E [ R s − R f ] (cid:112) var[ R s − R f ] (cid:39) E [ R s ] (cid:112) var[ R s ] (20)where: • R s is the trading strategy return over a certain timeperiod, modelling its proﬁtability. • R f is the risk-free return, the expected return froma totally safe investment (negligible). • σ r is the standard deviation of the trading strategyexcess return R s − R f , modelling its riskiness.In order to compute the Sharpe ratio S r in practice, thedaily returns achieved by the trading strategy are ﬁrstlycomputed using the formula ρ t = ( v t − v t − ) /v t − . Then,the ratio between the returns mean and standard devia-tion is evaluated. Finally, the annualised Sharpe ratio isobtained by multiplying this value by the square root ofthe number of trading days in a year (252).Moreover, a well-performing trading strategy shouldideally be capable of achieving acceptable performance ondiverse markets presenting very diﬀerent patterns. For in-stance, the trading strategy should properly handle bothbull and bear markets (respectively strong increasing anddecreasing price trends), with diﬀerent levels of volatility.Therefore, the research paper’s core objective is the devel-opment of a novel trading strategy based on DRL tech-niques to maximise the average Sharpe ratio computed onthe entire set of existing stock markets.Despite the fact that the real objective is the max-imisation of the Sharpe ratio, the RL algorithm adoptedin this scientiﬁc paper actually maximises the discountedsum of rewards on an inﬁnite time horizon, an optimisa-tion criterion that can in fact be seen as a relaxation of the Sharpe ratio criterion. A future interesting researchdirection would be to narrow the gap between these twoobjectives.

4. Deep reinforcement learning algorithm design

In this section, a novel DRL algorithm is designed tosolve the algorithmic trading problem previously intro-duced. The resulting trading strategy, denominated theTrading Deep Q-Network algorithm (TDQN), is inspiredfrom the successful DQN algorithm presented in [20] and issigniﬁcantly adapted to the speciﬁc decision-making prob-lem at hand. Concerning the training of the RL agent,artiﬁcial trajectories are generated from a limited set ofstock market historical data.

The Deep Q-Network algorithm, generally referred toas DQN, is a DRL algorithm capable of successfully learn-ing control policies from high-dimensional sensory inputs.It is in a way the successor of the popular Q-learning al-gorithm introduced in [21]. This DRL algorithm is saidto be model-free , meaning that a complete model of theenvironment is not required and that trajectories are suﬃ-cient. Belonging to the

Q-learning family of algorithms, itis based on the learning of an approximation of the state-action value function, which is represented by a DNN. Insuch context, learning the Q-function amounts to learningthe parameters θ of this DNN. Finally, the DQN algorithmis said to be oﬀ-policy as it exploits in batch mode previ-ous experiences e t = ( s t , a t , r t , s t +1 ) collected at any pointduring training.For the sake of brevity, the DQN algorithm is illus-trated in Figure 3, but is not extensively presented in thispaper. Besides the original publications ([20] and [22]),there exists a great scientiﬁc literature around this algo-rithm, see for instance [23], [24], [25], [26], [27] and [28].Concerning DL techniques, interesting resources are [29],[30] and [31]. For more information about RL, the readercan refer to the following textbooks and surveys: [32], [33],[34], [35] and [36]. In the scope of the algorithmic trading problem, a com-plete model of the environment E is not available. Thetraining of the TDQN algorithm is entirely based on thegeneration of artiﬁcial trajectories from a limited set ofstock market historical daily OHLCV data. A trajectory τ is deﬁned as a sequence of observations o t ∈ O , actions a t ∈ A and rewards r t from an RL agent for a certainnumber T of trading time steps t : τ = (cid:16) { o , a , r } , { o , a , r } , ..., { o T , a T , r T } (cid:17) igure 3: Illustration of the DQN algorithm Initially, the RL agent disposes of one single real trajec-tory, corresponding to the historical behaviour of the stockmarket, i.e. the particular case of the trading agent beinginactive. For this algorithmic trading problem, new ﬁctivetrajectories are artiﬁcially generated from the interactionof the RL agent with its environment E . The historicalstock market behaviour is simply considered unaﬀected bythe new actions performed by the trading agent. Then,the artiﬁcial trajectories generated are simply composedof the sequence of historical real observations associatedwith various sequences of trading actions from the RLagent. For such practice to be scientiﬁcally acceptableand lead to realistic simulations, the trading agent shouldnot be able to inﬂuence the stock market behaviour. Thisassumption generally holds when the amount of moneyinvested by the trading agent is low with respect to theliquidity of the stock market.In addition to the generation of artiﬁcial trajectoriesjust described, a trick is employed to slightly improve theexploration of the RL agent. It relies on the fact that thereduced action space A is composed of only two actions:long ( Q Longt ) and short ( Q Shortt ). At each trading timestep t , the chosen action a t is executed on the tradingenvironment E and the opposite action a − t is executed on acopy of this environment E − . Although this trick does notcompletely solve the challenging exploration/exploitationtrade-oﬀ, it enables the RL agent to continuously exploreat a small extra computational cost. The DQN algorithm was chosen as starting point forthe novel DRL trading strategy developed, but was signiﬁ-cantly adapted to the speciﬁc algorithmic trading decision-making problem at hand. The diverse modiﬁcations andimprovements, which are mainly based on the numeroussimulations performed, are summarised hereafter: • Deep neural network architecture:

The ﬁrst dif-ference with respect to the classical DQN algorithmis the architecture of the DNN approximating theaction-value function Q ( s, a ). Due to the diﬀerentnature of the input (time-series instead of raw im-ages), the convolutional neural network (CNN) hasbeen replaced by a classical feedforward DNN withsome leaky rectiﬁed linear unit (Leaky ReLU) acti-vation functions. • Double DQN:

The DQN algorithm suﬀers fromsubstantial overestimations, this overoptimism harm-ing the algorithm performance. In order to reducethe impact of this undesired phenomenon, the arti-cle [23] presents the double DQN algorithm which isbased on the decomposition of the target max opera-tion into both action selection and action evaluation. • ADAM optimiser:

The classical DQN algorithmimplements the RMSProp optimiser. However, theADAM optimiser, introduced in [37], experimentallyproves to improve both the training stability and theconvergence speed of the DRL algorithm. • Huber loss:

While the classical DQN algorithm im-plements a mean squared error (MSE) loss, the Hu-ber loss experimentally improves the stability of thetraining phase. Such observation is explained by thefact that the MSE loss signiﬁcantly penalises largeerrors, which is generally desired but has a negativeside-eﬀect for the DQN algorithm because the DNNis supposed to predict values that depend on its owninput. This DNN should not radically change in asingle training update because this would also leadto a signiﬁcant change in the target, which could ac-tually result in a larger error. Ideally, the update ofthe DNN should be performed in a slower and morestable manner. On the other hand, the mean ab-solute error (MAE) has the drawback of not beingdiﬀerentiable at 0. A good trade-oﬀ between thesetwo losses is the Huber loss H : H ( x ) = (cid:26) x if | x | ≤ , | x | − otherwise. (21) Figure 4: Comparison of the MSE, MAE and Huber losses lgorithm 1 TDQN algorithm

Initialise the experience replay memory M of capacity C .Initialise the main DNN weights θ (Xavier initialisation).Initialise the target DNN weights θ − = θ . for episode = 1 to N do Acquire the initial observation o from the environment E and preprocess it. for t = 1 to T do With probability (cid:15) , select a random action a t from A .Otherwise, select a t = arg max a ∈A Q ( o t , a ; θ ).Copy the environment E − = E .Interact with the environment E (action a t ) and get the new observation o t +1 and reward r t .Perform the same operation on E − with the opposite action a − t , getting o − t +1 and r − t .Preprocess both new observations o t +1 and o − t +1 .Store both experiences e t = ( o t , a t , r t , o t +1 ) and e − t = ( o t , a − t , r − t , o − t +1 ) in M . if t % T’ = 0 then Randomly sample from M a minibatch of N e experiences e i = ( o i , a i , r i , o i +1 ).Set y i = (cid:26) r i if the state s i +1 is terminal, r i + γ Q ( o i +1 , arg max a ∈A Q ( o i +1 , a ; θ ); θ − ) otherwise.Compute and clip the gradients based on the Huber loss H ( y i , Q ( o i , a i ; θ )).Optimise the main DNN parameters θ based on these clipped gradients.Update the target DNN parameters θ − = θ every N − steps. end if Anneal the (cid:15) -Greedy exploration parameter (cid:15) . end forend for • Gradient clipping:

The gradient clipping tech-nique is implemented in the TDQN algorithm tosolve the gradient exploding problem which inducessigniﬁcant instabilities during the training of the DNN. • Xavier initialisation:

While the classical DQNalgorithm simply initialises the DNN weights ran-domly, the Xavier initialisation is implemented toimprove the algorithm convergence. The idea is toset the initial weights so that the gradients varianceremains constant across the DNN layers. • Batch normalisation layers:

This DL technique,introduced by [38], consists in normalising the inputlayer by adjusting and scaling the activation func-tions. It brings many beneﬁts including a faster andmore robust training phase as well as an improvedgeneralisation. • Regularisation techniques:

Because a strong ten-dency to overﬁt was observed during the ﬁrst exper-iments with the DRL trading strategy, three regu-larisation techniques are implemented:

Dropout , L2regularisation and

Early Stopping . • Preprocessing and normalisation:

The trainingloop of the TDQN algorithm is preceded by both apreprocessing and a normalisation operation of theRL observations o t . Firstly, because the high-frequencynoise present in the trading data was experimen-tally observed to lower the algorithm generalisation,a low-pass ﬁltering operation is executed. However, such a preprocessing operation has a cost as it mod-iﬁes or even destroys some potentially useful tradingpatterns and introduces a non-negligible lag. Sec-ondly, the resulting data are transformed in order toconvey more meaningful information about marketmovements. Typically, the daily evolution of pricesis considered rather than the raw prices. Thirdly,the remaining data are normalised. • Data augmentation techniques:

A key challengeof this algorithmic trading problem is the limitedamount of available data, which are in addition gen-erally of poor quality. As a counter to this ma-jor problem, several data augmentation techniquesare implemented: signal shifting, signal ﬁltering andartiﬁcial noise addition. The application of suchdata augmentation techniques will artiﬁcially gen-erate new trading data which are slightly diﬀerentbut which result in the same ﬁnancial phenomena.Finally, the algorithm underneath the TDQN tradingstrategy is depicted in detail in Algorithm 1.

5. Performance assessment

An accurate performance evaluation approach is capi-tal in order to produce meaningful results. As previouslyhinted, this procedure is all the more critical because therehas been a real lack of a proper performance assessmentmethodology in the algorithmic trading ﬁeld. In this sec-tion, a novel, more reliable methodology is presented toobjectively assess the performance of algorithmic tradingstrategies, including the TDQN algorithm.9 .1. Testbench

In the literature, the performance of a trading strategyis generally assessed on a single instrument (stock mar-ket or others) for a certain period of time. Nevertheless,the analysis resulting from such a basic approach shouldnot be entirely trusted, as the trading data could havebeen speciﬁcally selected so that a trading strategy looksproﬁtable, even though it is not the case in general. Toeliminate such bias, the performance should ideally be as-sessed on multiple instruments presenting diverse patterns.Aiming to produce trustful conclusions, this research pa-per proposes a testbench composed of 30 stocks presentingdiverse characteristics (sectors, regions, volatility, liquid-ity, etc.). The testbench is depicted in Table 2. To avoidany confusion, the oﬃcial reference for each stock (ticker)is speciﬁed in parentheses. To avoid any ambiguities con-cerning the training and evaluation protocols, it should bementioned that a new trading strategy is trained for eachstock included in the testbench. Nevertheless, for the sakeof generality, all the algorithm hyperparameters remainunchanged over the entire testbench.Regarding the trading horizon, the eight years preced-ing the publication year of the research paper are selectedto be representative of the current market conditions. Sucha short-time period could be criticised because it may betoo limited to be representative of the entire set of ﬁnan-cial phenomena. For instance, the ﬁnancial crisis of 2008 isrejected, even though it could be interesting to assess therobustness of trading strategies with respect to such anextraordinary event. However, this choice was motivatedby the fact that a shorter trading horizon is less likely tocontain signiﬁcant market regime shifts which would seri-ously harm the training stability of the trading strategies.Finally, the trading horizon of eight years is divided intoboth training and test sets as follows: • Training set: → • Test set: → θ are ﬁxed during the execution of the tradingstrategy on the entire test set, meaning that the new ex-periences acquired are not valued for extra training. Nev-ertheless, such practice constitutes an interesting futureresearch direction. In order to properly assess the strengths and weak-nesses of the TDQN algorithm, some benchmark algo-rithmic trading strategies were selected for comparisonpurposes. Only the classical trading strategies commonlyused in practice were considered, excluding for instance strategies based on DL techniques or other advanced ap-proaches. Despite the fact that the TDQN algorithm isan active trading strategy, both passive and active strate-gies are taken into consideration. For the sake of fairness,the strategies share the same input and output spaces pre-sented in Section 3.4.2 ( O and A ). The following list sum-marises the benchmark strategies selected: • Buy and hold (B&H). • Sell and hold (S&H). • Trend following with moving averages (TF). • Mean reversion with moving averages (MR).For the sake of brevity, a detailed description of eachstrategy is not provided in this research paper. The readercan refer to [2], [3] or [4] for more information. The ﬁrsttwo benchmark trading strategies (B&H and S&H) aresaid to be passive, as there are no changes in trading posi-tion over the trading horizon. On the contrary, the othertwo benchmark strategies (TF and MR) are active trad-ing strategies, issuing multiple changes in trading positionsover the trading horizon. On the one hand, a trend fol-lowing strategy is concerned with the identiﬁcation andthe follow-up of signiﬁcant market trends, as depicted inFigure 5. On the other hand, a mean reversion strategy,illustrated in Figure 6, is based on the tendency of a stockmarket to get back to its previous average price in the ab-sence of clear trends. By design, a trend following strategygenerally makes a proﬁt when a mean reversion strategydoes not, the opposite being true as well. This is due tothe fact that these two families of trading strategies adoptopposite positions: a mean reversion strategy always de-nies and goes against the trends while a trend followingstrategy follows the movements.

Figure 5: Illustration of a typical trend following trading strategyFigure 6: Illustration of a typical mean reversion trading strategy able 2: Performance assessment testbench Sector Region

American European AsianTrading index

Dow Jones (DIA) FTSE 100 (EZU) Nikkei 225 (EWJ)S&P 500 (SPY)NASDAQ (QQQ)

Technology

Apple (AAPL) Nokia (NOK) Sony (6758.T)Google (GOOGL) Philips (PHIA.AS) Baidu (BIDU)Amazon (AMZN) Siemens (SIE.DE) Tencent (0700.HK)Facebook (FB) Alibaba (BABA)Microsoft (MSFT)Twitter (TWTR)

Financial services

JPMorgan Chase (JPM) HSBC (HSBC) CCB (0939.HK)

Energy

ExxonMobil (XOM) Shell (RDSA.AS) PetroChina (PTR)

Automotive

Tesla (TSLA) Volkswagen (VOW3.DE) Toyota (7203.T)

Food

Coca Cola (KO) AB InBev (ABI.BR) Kirin (2503.T)

The quantitative performance assessment consists indeﬁning one performance indicator or more to numericallyquantify the performance of an algorithmic trading strat-egy. Because the core objective of a trading strategy isto be proﬁtable, its performance should be linked to theamount of money earned. However, such reasoning omitsto consider the risk associated with the trading activitywhich should be eﬃciently mitigated. Generally, a tradingstrategy achieving a small but stable proﬁt is preferred toa trading strategy achieving a huge proﬁt in a very unsta-ble way after suﬀering from multiple losses. It eventuallydepends on the investor proﬁle and the willingness to takeextra risks to potentially earn more.Multiple performance indicators were selected to accu-rately assess the performance of a trading strategy. Aspreviously introduced in Section 3.5, the most importantone is certainly the Sharpe ratio. This performance in-dicator, widely used in the ﬁeld of algorithmic trading,is particularly informative as it combines both proﬁtabil-ity and risk. Besides the Sharpe ratio, this research paperconsiders multiple other performance indicators to provideextra insights. Table 3 presents the entire set of perfor-mance indicators employed to quantify the performance ofa trading strategy.Complementarily to the computation of these numer-ous performance indicators, it is interesting to graphicallyrepresent the trading strategy behaviour. Plotting boththe stock market price p t and portfolio value v t evolutionstogether with the trading actions a t issued by the tradingstrategy seems appropriate to accurately analyse the trad-ing policy. Moreover, such visualisation could also provideextra insights about the performance, the strengths andweaknesses of the strategy analysed.

6. Results and discussion

In this section, the TDQN trading strategy is evaluatedfollowing the performance assessment methodology previ-ously described. Firstly, a detailed analysis is performedfor both a case that give good results and a case for whichthe results were at best mitigated. This highlights thestrengths, weaknesses and limitations of the TDQN al-gorithm. Secondly, the performance achieved by the DRLtrading strategy on the entire testbench is summarised andanalysed. Finally, some additional discussions about thediscount factor parameter, the trading costs inﬂuence andthe main challenges faced by the TDQN algorithm are pro-vided.

The ﬁrst detailed analysis concerns the execution of theTDQN trading strategy on the Apple stock, resulting inpromising results. Similar to many DRL algorithms, theTDQN algorithm is subject to a non-negligible variance.Multiple training experiments with the exact same initialconditions will inevitably lead to slightly diﬀerent tradingstrategies of varying performance. As a consequence, botha typical run of the TDQN algorithm and its expected per-formance are presented hereafter.

Typical run:

Firstly, Table 4 presents the perfor-mance achieved by each trading strategy considered, theinitial amount of money being equal to $100,000. TheTDQN algorithm achieves good results from both an earn-ings and a risk mitigation point of view, clearly outper-forming all the benchmark active and passive trading strate-gies. Secondly, Figure 7 plots both the stock market price p t and RL agent portfolio value v t evolutions, togetherwith the actions a t outputted by the TDQN algorithm. Itcan be observed that the DRL trading strategy is capable11 able 3: Quantitative performance assessment indicators Performance indicator Description

Sharpe ratio Return of the trading activity compared to its riskiness.Proﬁt & loss Money gained or lost at the end of the trading activity.Annualised return Annualised return generated during the trading activity.Annualised volatility Modelling of the risk associated with the trading activity.Proﬁtability ratio Percentage of winning trades made during the trading activity.Proﬁt and loss ratio Ratio between the trading activity trades average proﬁt and loss.Sortino ratio Similar to the Sharpe ratio with the negative risk penalised only.Maximum drawdown Largest loss from a peak to a trough during the trading activity.Maximum drawdown duration Time duration of the trading activity maximum drawdown.

Figure 7: TDQN algorithm execution on the Apple stock (test set) of accurately detecting and beneﬁting from major trends,while being more hesitant during market behavioural shiftswhen the volatility increases. It can also be seen that thetrading agent generally lags slightly behind the markettrends, meaning that the TDQN algorithm is more of a re-active trading strategy rather than a proactive one, whichis logical with such a limited observation space O . Expected performance:

In order to estimate theexpected performance of the TDQN algorithm, multipleRL trading agents are trained independently. Figure 8plots the averaged (over 50 iterations) performance of theTDQN algorithm for both the training and test sets withrespect to the number of training episodes. Although thisexpected performance is slightly lower than the resultsachieved during the typical run of the algorithm, it re-mains very encouraging. It can also be noticed that theoverﬁtting tendency of the RL agent seems to be properlyhandled for this speciﬁc market. Please note that the testset performance being temporarily superior to the train-ing set performance is not a mistake. It simply indicatesan easier to trade and more proﬁtable market for the testset trading period for the Apple stock. This example per-fectly illustrates a major diﬃculty of the algorithmic trad-

Figure 8: TDQN algorithm expected performance on the Apple stock ing problem: the training and test sets do not share thesame distributions. Indeed, the distribution of the dailyreturns is continuously changing, which complicates boththe training of the DRL trading strategy and its perfor-mance evaluation.

The same detailed analysis is performed on the Teslastock, which presents very diﬀerent characteristics com-pared to the Apple stock, such as a pronounced volatility.In contrast to the promising performance achieved on theprevious stock, this case was speciﬁcally selected to high-light the limitations of the TDQN algorithm.

Typical run:

Similar to the previous analysis, Table 5presents the performance achieved by every trading strate-gies considered, the initial amount of money being equalto $100,000. The poor results achieved by the benchmarkactive strategies suggest that the Tesla stock is quite diﬃ-cult to trade, which is partly due to its signiﬁcant volatil-ity. Although the TDQN algorithm can be ranked secondbehind the buy and hold passive trading strategy, it isnot proﬁtable. Moreover, the risk level associated withits trading activity cannot really be considered accept-able. Figure 9, which plots both the stock market price p t igure 9: TDQN algorithm execution on the Tesla stock (test set) and RL agent portfolio value v t evolutions together withthe actions a t outputted by the TDQN algorithm, con-ﬁrms this conclusion. Moreover, it can be clearly observedthat the pronounced volatility of the Tesla stock inducesa higher trading frequency (changes in trading positions,which correspond to the situation where a t (cid:54) = a t − ) de-spite the non-negligible trading costs, which increases theriskiness of the DRL trading strategy. Expected performance:

Figure 10 plots the expectedperformance of the TDQN algorithm for both the train-ing and test sets as a function of the number of trainingepisodes (over 50 iterations). The Sharpe ratio expectedvalue is similar to the one achieved by the typical run ofthe DRL trading strategy. Nevertheless, the extraordi-nary high performance achieved on the training set sug-gests that the DRL algorithm is subject to overﬁtting inthis speciﬁc case, despite the multiple regularisation tech-niques implemented. This overﬁtting phenomenon can bepartially explained by the observation space O which istoo limited to eﬃciently apprehend the Tesla stock. As previously suggested in this research paper, theTDQN algorithm is evaluated on the testbench introducedin Section 5.1, in order to draw more robust and trustfulconclusions. Table 6 presents the expected Sharpe ratio(averaged over 50 training iterations) achieved by both theTDQN and benchmark trading strategies on the entire setof stocks included in this testbench.Regarding the performance achieved by the benchmarktrading strategies, it is important to diﬀerentiate the pas-sive strategies (B&H and S&H) from the active ones (TFand MR). Indeed, this second family of trading strategieshas more potential at the cost of an extra non-negligiblerisk: continuous speculation. Because the stock markets

Figure 10: TDQN algorithm expected performance on the Teslastock were mostly bullish (price p t mainly increasing over time)with some instabilities during the test set trading period,it is not surprising to see the buy and hold strategy outper-forming the other benchmark trading strategies. In fact,neither the trend following nor the mean reversion strat-egy managed to generate satisfying results on average onthis testbench. It clearly indicates that there is a majordiﬃculty to actively trade in such market conditions. Thispoorer performance can also be explained by the fact thatsuch strategies are generally well suited to exploit speciﬁcﬁnancial patterns, but they lack versatility and thus of-ten fail to achieve good average performance on a largeset of stocks presenting diverse characteristics. Moreover,such strategies are generally more impacted by the trad-ing costs due their higher trading frequency (for relativelyshort moving averages durations, as it is the case in thisresearch paper).Concerning the innovative trading strategy, the TDQNalgorithm achieves promising results on the testbench, out-performing the benchmark active trading strategies on av-erage. Nevertheless, the DRL trading strategy only barelysurpasses the buy and hold strategy on these particularbullish markets which are so favourable to this simple pas-sive strategy. Interestingly, it should be noted that theperformance of the TDQN algorithm is identical or veryclose to the performance of the passive trading strategies(B&H and S&H) for multiple stocks. This is explained bythe fact that the DRL strategy eﬃciently learns to tendtoward a passive trading strategy when the uncertaintyassociated to active trading increases. It should also beemphasized that the TDQN algorithm is neither a trendfollowing nor a mean reversion trading strategy as bothﬁnancial patterns can be eﬃciently handled in practice.Thus, the main advantage of the DRL trading strategy iscertainly its versatility and its ability to eﬃciently handlevarious markets presenting diverse characteristics.13 able 4: Performance assessment on the Apple stock Performance indicator B&H S&H TF MR TDQN

Sharpe ratio 1.237 -1.593 1.074 -0.716 1.473Proﬁt & loss [$] 79557 -79956 60259 -38381 101604Annualised return [%] 28.79 -100.00 23.89 -22.85 33.14Annualised volatility [%] 26.59 44.30 24.87 28.28 26.18Proﬁtability ratio [%] 100 0.00 42.31 56.67 51.61Proﬁt and loss ratio ∞ Table 5: Performance assessment on the Tesla stock

Performance indicator B&H S&H TF MR TDQN

Sharpe ratio 0.506 -0.155 -1.047 0.025 0.186Proﬁt & loss [$] 29471 -29869 -74818 -26995 -8289Annualised return [%] 24.01 -7.43 -100.00 1.49 9.51Annualised volatility [%] 53.04 46.04 52.54 59.02 53.54Proﬁtability ratio [%] 100 0.00 34.38 60.87 47.50Proﬁt and loss ratio ∞ Table 6: Performance assessment on the entire testbench

Stock Sharpe Ratio

B&H S&H TF MR TDQN

Dow Jones (DIA) 0.681 -0.639 -0.463 0.234 0.681S&P 500 (SPY) 0.830 -0.836 -0.458 -0.582 0.830NASDAQ 100 (QQQ) 0.842 -0.808 0.028 -0.103 0.842FTSE 100 (EZU) 0.085 0.022 -1.053 -0.344 0.085Nikkei 225 (EWJ) 0.124 -0.029 -1.110 0.134 0.124Google (GOOGL) 0.568 -0.372 0.037 0.497 0.527Apple (AAPL) 1.237 -1.593 1.074 -0.716 1.299Facebook (FB) 0.369 -0.079 -0.020 -0.201 0.444Amazon (AMZN) 0.557 -0.188 0.068 -1.474 0.409Microsoft (MSFT) 1.362 -1.391 -0.121 -0.654 1.455Twitter (TWTR) 0.188 0.313 -0.389 -0.494 -0.187Nokia (NOK) -0.409 0.563 0.961 1.210 -0.124Philips (PHIA.AS) 1.083 -0.711 0.029 -0.737 1.128Siemens (SIE.DE) 0.443 -0.329 0.606 0.223 0.463Baidu (BIDU) -0.700 0.864 -1.336 -0.153 -0.520Alibaba (BABA) 0.356 -0.141 -0.180 0.259 0.223Tencent (0700.HK) -0.014 0.307 0.053 -0.622 0.098Sony (6758.T) 0.792 -0.656 -0.489 0.101 0.824JPMorgan Chase (JPM) 0.710 -0.746 -1.524 -0.115 0.710HSBC (HSBC) -0.521 0.721 -0.254 0.531 -0.111CCB (0939.HK) 0.038 0.163 -1.170 -0.591 0.102ExxonMobil (XOM) 0.052 0.130 -0.557 -0.792 0.052Shell (RDSA.AS) 0.525 -0.282 -0.118 0.416 0.525PetroChina (PTR) -0.378 0.512 -1.053 -0.315 -0.156Tesla (TSLA) 0.506 -0.155 -1.047 0.025 0.241Volkswagen (VOW3.DE) 0.430 -0.278 -0.554 0.130 0.489Toyota (7203.T) 0.394 -0.292 -1.246 0.526 0.394Coca Cola (KO) 1.028 -0.873 0.012 -0.536 1.028AB InBev (ABI.BR) -0.006 0.225 0.592 -1.539 -0.087Kirin (2503.T) 0.102 0.153 -1.576 0.115 0.102

Average .4. Discount factor discussion As previously explained in Section 3.4, the discountfactor γ is concerned with the importance of future re-wards. In the scope of this algorithmic trading problem,the proper tuning of this parameter is not trivial due tothe signiﬁcant uncertainty of the future. On the one hand,the desired trading policy should be long-term oriented( γ → γ → γ . Indeed, it was ob-served that there is an optimal value for the discount fac-tor, which is neither too small nor too large. All the simu-lations performed in this research paper adopt this optimalvalue for the discount factor: γ = 0 .

5. Additionally, theseexperiments highlighted the hidden link between the dis-count factor and the trading frequency, due to the tradingcosts. From the point of view of the RL agent, these costsrepresent an obstacle to overcome for a change in trad-ing position to occur, due to the immediate reduced (andoften negative) reward received. It models the fact thatthe trading agent should be suﬃciently conﬁdent aboutthe future in order to overcome the extra risk associatedwith the trading costs. The discount factor determiningthe importance assigned to the future, a small value forthe parameter γ will inevitably reduce the tendency of theRL agent to change its trading position, which decreasesthe trading frequency of the TDQN algorithm. The analysis of the trading costs inﬂuence on a tradingstrategy behaviour and performance is capital, due to thefact that such costs represent an extra risk to mitigate. Amajor motivation for studying DRL solutions rather thanpure prediction techniques that could also be based onDL architectures is related to the trading costs. As pre-viously explained in Section 3, the RL formalism enablesthe consideration of these additional costs directly into thedecision-making process. The optimal policy is learnedaccording to the trading costs value. On the contrary,a purely predictive approach would only output predic-tions about the future market direction or prices withoutany indications regarding an appropriate trading strategytaking into account the trading costs. Although this lastapproach oﬀers more ﬂexibility and could certainly leadto well-performing trading strategies, it is less eﬃcient bydesign.In order to illustrate the ability of the TDQN algo-rithm to automatically and eﬃciently adapt to diﬀerenttrading costs, Figure 11 presents the behaviour of the DRL trading strategy for three diﬀerent costs values, all otherparameters remaining unchanged. It can clearly be ob-served that the TDQN algorithm eﬀectively reduces itstrading frequency when the trading costs increase, as ex-pected. When these costs become too high, the DRL algo-rithm simply stops actively trading and adopts a passiveapproach (buy and hold or sell and hold strategies).

Nowadays, the main DRL solutions successfully ap-plied to real-life problems concern speciﬁc environmentswith particular properties such as games (see e.g. the fa-mous AlphaGo algorithm developed by Google Deepmind[39]). In this research paper, an entirely diﬀerent environ-ment characterised by a signiﬁcant complexity and a con-siderable uncertainty is studied with the algorithmic trad-ing problem. Obviously, multiple challenges were facedduring the research around the TDQN algorithm, the ma-jor ones being summarised hereafter.Firstly, the extremely poor observability of the tradingenvironment is a characteristic that signiﬁcantly limits theperformance of the TDQN algorithm. Indeed, the amountof information at the disposal of the RL agent is really notsuﬃcient to accurately explain the ﬁnancial phenomenaoccurring during training, which is necessary to eﬃcientlylearn to trade. Secondly, although the distribution of thedaily returns is continuously changing, the past is requiredto be representative enough of the future for the TDQNalgorithm to achieve good results. This makes the DRLtrading strategy particularly sensitive to signiﬁcant marketregime shifts. Thirdly, the TDQN algorithm overﬁttingtendency has to be properly handled in order to obtaina reliable trading strategy. As suggested in [40], morerigorous evaluation protocols are required in RL due tothe strong tendency of common DRL techniques to overﬁt.More research on this particular topic is required for DRLtechniques to ﬁt a broader range of real-life applications.

7. Conclusion

This scientiﬁc research paper presents the Trading DeepQ-Network algorithm (TDQN), a deep reinforcement learn-ing (DRL) solution to the algorithmic trading problemof determining the optimal trading position at any pointin time during a trading activity in stock markets. Fol-lowing a rigorous performance assessment, this innova-tive trading strategy achieves promising results, surpassingon average the benchmark trading strategies. Moreover,the TDQN algorithm demonstrates multiple beneﬁts com-pared to more classical approaches, such as an appreciableversatility and a remarkable robustness to diverse tradingcosts. Additionally, such data-driven approach presentsthe major advantage of suppressing the complex task ofdeﬁning explicit rules suited to the particular ﬁnancialmarkets considered.15 a) Trading costs: 0% (b) Trading costs: 0.2% (c) Trading costs: 0.4%

Figure 11: Impact of the trading costs on the TDQN algorithm, for the Apple stock

Nevertheless, the performance of the TDQN algorithmcould still be improved, from both a generalisation and areproducibility point of view, to cite a few. Several re-search directions are suggested to upgrade the DRL solu-tion, such as the use of LSTM layers into the deep neuralnetwork which should help to better process the ﬁnancialtime-series data, see e.g. [41]. Another example is the con-sideration of the numerous improvements implemented inthe Rainbow algorithm, which are detailed in [32], [23],[24], [25], [26], [27] and [28]. Another interesting researchdirection is the comparison of the TDQN algorithm withPolicy Optimisation DRL algorithms such as the ProximalPolicy Optimisation (PPO - [42]) algorithm.The last major research direction suggested concernsthe formalisation of the algorithmic trading problem into areinforcement learning one. Firstly, the observation space O should be extended to enhance the observability of thetrading environment. Similarly, some constraints aboutthe action space A could be relaxed in order to enablenew trading possibilities. Secondly, advanced RL rewardengineering should be performed to narrow the gap be-tween the RL objective and the Sharpe ratio maximisationobjective. Finally, an interesting and promising researchdirection is the consideration of distributions instead of ex-pected values in the TDQN algorithm in order to encom-pass the notion of risk and to better handle uncertainty. Acknowledgements

Thibaut Th´eate is a Research Fellow of the F.R.S.-FNRS, of which he acknowledges the ﬁnancial support.

References [1] Y. Li, Deep Reinforcement Learning: An Overview, CoRRabs/1701.07274 (2017).[2] E. P. Chan, Quantitative Trading: How to Build Your OwnAlgorithmic Trading Business, Wiley, 2009.[3] E. P. Chan, Algorithmic Trading: Winning Strategies and TheirRationale, Wiley, 2013.[4] R. K. Narang, Inside the Black Box, Wiley, 2009. [5] A. Ar´evalo, J. Ni˜no, G. Hern´andez, J. Sandoval, High-Frequency Trading Strategy Based on Deep Neural Networks,ICIC (2016).[6] W. N. Bao, J. Yue, Y. Rao, A Deep Learning Framework forFinancial Time Series using Stacked Autoencoders and Long-Short Term Memory, PloS one 12 (2017).[7] J. E. Moody, M. Saﬀell, Learning to Trade via Direct Rein-forcement, IEEE transactions on neural networks 12 4 (2001)875–89.[8] M. A. H. Dempster, V. Leemans, An Automated FX TradingSystem using Adaptive Reinforcement Learning, Expert Syst.Appl. 30 (2006) 543–552.[9] Y. Deng, F. Bao, Y. Kong, Z. Ren, Q. Dai, Deep Direct Re-inforcement Learning for Financial Signal Representation andTrading, IEEE Transactions on Neural Networks and LearningSystems 28 (2017) 653–664.[10] J. Carapuo, R. F. Neves, N. Horta, Reinforcement Learningapplied to Forex Trading, Appl. Soft Comput. 73 (2018) 783–794.[11] I. Boukas, D. Ernst, T. Th´eate, A. Bolland, A. Huynen,M. Buchwald, C. Wynants, B. Corn´elusse, A Deep Reinforce-ment Learning Framework for Continuous Intraday Market Bid-ding, ArXiv abs/2004.05940 (2020).[12] J. P. A. Ioannidis, Why Most Published Research Findings AreFalse, PLoS Med 2 (2005) 124.[13] D. H. Bailey, J. M. Borwein, M. L. de Prado, Q. J. Zhu, Pseudo-Mathematics and Financial Charlatanism: The Eﬀects of Back-test Overﬁtting on Out-of-Sample Performance, Notice of theAmerican Mathematical Society (2014) 458–471.[14] T. Hendershott, C. M. Jones, A. J. Menkveld, Does AlgorithmicTrading Improve Liquidity?, Journal of Finance 66 (2011) 1–33.[15] P. C. Treleaven, M. Galas, V. Lalchand, Algorithmic TradingReview, Commun. ACM 56 (2013) 76–85.[16] G. Nuti, M. Mirghaemi, P. C. Treleaven, C. Yingsaeree, Algo-rithmic Trading, Computer 44 (2011) 61–69.[17] D. Leinweber, J. Sisk, Event-Driven Trading and the New News,The Journal of Portfolio Management 38 (2011) 110–124.[18] J. Bollen, H. Mao, X. jun Zeng, Twitter Mood Predicts theStock Market, J. Comput. Science 2 (2011) 1–8.[19] W. Nuij, V. Milea, F. Hogenboom, F. Frasincar, U. Kaymak,An Automated Framework for Incorporating News into StockTrading Strategies, IEEE Transactions on Knowledge and DataEngineering 26 (2014) 823–835.[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, M. A. Riedmiller, Playing Atari with Deep Rein-forcement Learning, CoRR abs/1312.5602 (2013).[21] C. J. C. H. Watkins, P. Dayan, Technical Note: Q-Learning,Machine Learning 8 (1992) 279–292.[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland,G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, uman-Level Control through Deep Reinforcement Learning,Nature 518 (2015) 529–533.[23] H. P. van Hasselt, A. Guez, D. Silver, Deep Reinforce-ment Learning with Double Q-Learning, CoRR abs/1509.06461(2015).[24] Z. Wang, N. de Freitas, M. Lanctot, Dueling Network Architec-tures for Deep Reinforcement Learning, CoRR abs/1511.06581(2015).[25] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized Expe-rience Replay, CoRR abs/1511.05952 (2016).[26] M. G. Bellemare, W. Dabney, R. Munos, A DistributionalPerspective on Reinforcement Learning, CoRR abs/1707.06887(2017).[27] M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Os-band, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin,C. Blundell, S. Legg, Noisy Networks for Exploration, CoRRabs/1706.10295 (2018).[28] M. Hessel, J. Modayil, H. P. van Hasselt, T. Schaul, G. Os-trovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, D. Sil-ver, Rainbow: Combining Improvements in Deep ReinforcementLearning, CoRR abs/1710.02298 (2017).[29] Y. LeCun, Y. Bengio, G. Hinton, Deep Learning, Nature 521(2015).[30] I. J. Goodfellow, Y. Bengio, A. C. Courville, Deep Learning,Nature 521 (2015) 436–444.[31] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MITPress, 2016.[32] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Intro-duction, 2nd Edition, The MIT Press, 2018.[33] C. Szepesvari, Algorithms for Reinforcement Learning, Morganand Claypool Publishers, 2010.[34] L. Busoniu, R. Babuska, B. De Schutter, D. Ernst, Reinforce-ment Learning and Dynamic Programming using Function Ap-proximators, CRC Press, 2010.[35] K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A.Bharath, A Brief Survey of Deep Reinforcement Learning,CoRR abs/1708.05866 (2017).[36] K. Shao, Z. Tang, Y. Zhu, N. Li, D. Zhao, A Survey of Deep Re-inforcement Learning in Video Games, ArXiv abs/1912.10944(2019).[37] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Opti-mization, CoRR abs/1412.6980 (2015).[38] S. Ioﬀe, C. Szegedy, Batch Normalization: Accelerating DeepNetwork Training by Reducing Internal Covariate Shift, CoRRabs/1502.03167 (2015).[39] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Pan-neershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham,N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach,K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the Gameof Go with Deep Neural Networks and Tree Search, Nature 529(2016) 484–489.[40] C. Zhang, O. Vinyals, R. Munos, S. Bengio, A Study on Over-ﬁtting in Deep Reinforcement Learning, CoRR abs/1804.06893(2018).[41] M. J. Hausknecht, P. Stone, Deep Recurrent Q-Learning forPartially Observable MDPs, CoRR abs/1507.06527 (2015).[42] J. Schulman, F. Wolski, P. Dhariwal, A. Radford,O. Klimov, Proximal Policy Optimization Algorithms, CoRRabs/1707.06347 (2017). ppendix A. Derivation of action space A Theorem 1.

The RL action space A admits an upperbound Q t such that: Q t = v ct p t (1 + C ) Proof.

The upper bound of the RL action space A is de-rived from the fact that the cash value v ct has to remainpositive over the entire trading horizon (Equation 9). Mak-ing the hypothesis that v ct ≥

0, the number of shares Q t traded by the RL agent at time step t has to be set suchthat v ct +1 ≥ v ct − Q t p t − C | Q t | p t ≥ Q t :Case of Q t <

0: The previous expression becomes v ct − Q t p t + C Q t p t ≥ ⇔ Q t ≤ v ct p t (1 − C ) .The expression on the right side of the inequality is al-ways positive due to the hypothesis that v ct ≥

0. Because Q t is negative in this case, the condition is always satisﬁed.Case of Q t ≥

0: The previous expression becomes v ct − Q t p t − C Q t p t ≥ ⇔ Q t ≤ v ct p t (1+ C ) .This condition represents the upper bound (positive) ofthe RL action space A . Theorem 2.

The RL action space A admits a lower bound Q t such that: Q t = (cid:40) ∆ t p t (cid:15) (1+ C ) if ∆ t ≥ ∆ t p t (2 C + (cid:15) (1+ C )) if ∆ t < with ∆ t = − v ct − n t p t (1 + (cid:15) )(1 + C ) .Proof. The lower bound of the RL action space A is de-rived from the fact that the cash value v ct has to be suﬃ-cient to get back to a neutral position ( n t = 0) over the en-tire trading horizon (Equation 13). Making the hypothesisthat this condition is satisﬁed at time step t , the numberof shares Q t traded by the RL agent should be such thatthis condition remains true at the next time step t + 1. In-troducing this constraint into Equation 12, the followinginequality is obtained: v ct − Q t p t − C | Q t | p t ≥ − ( n t + Q t ) p t (1 + C )(1 + (cid:15) )Two cases arise depending on the value of Q t : Case of Q t ≥

0: The previous expression becomes v ct − Q t p t − C Q t p t ≥ − ( n t + Q t ) p t (1 + C )(1 + (cid:15) ) ⇔ v ct ≥ − n t p t (1 + C )(1 + (cid:15) ) − Q t p t (cid:15) (1 + C ) ⇔ Q t ≥ − v ct − n t p t (1+ C )(1+ (cid:15) ) p t (cid:15) (1+ C ) The expression on the right side of the inequality repre-sents the ﬁrst lower bound for the RL action space A .Case of Q t <

0: The previous expression becomes v ct − Q t p t + C Q t p t ≥ − ( n t + Q t ) p t (1 + C )(1 + (cid:15) ) ⇔ v ct ≥ − n t p t (1 + C )(1 + (cid:15) ) − Q t p t (2 C + (cid:15) + (cid:15)C ) ⇔ Q t ≥ − v ct − n t p t (1+ C )(1+ (cid:15) ) p t (2 C + (cid:15) (1+ C )) The expression on the right side of the inequality repre-sents the second lower bound for the RL action space A .Both lower bounds previously derived have the samenumerator, which is denoted ∆ t from now on. This quan-tity represents the diﬀerence between the maximum as-sumed cost to get back to a neutral position at the nexttime step t + 1 and the current cash value of the agent v ct .The expression tests whether the agent can pay its debtin the worst assumed case or not at the next time step, ifnothing is done at the current time step ( Q t = 0). Twocases arise depending on the sign of the quantity ∆ t :Case of ∆ t <

0: The trading agent has no problem payingits debt in the situation previously described. This is al-ways true when the agent owns a positive number of shares( n t ≥ n t <

0) and when the pricedecreases ( p t < p t − ) due to the hypothesis that Equa-tion 13 was veriﬁed for time step t . In this case, the mostconstraining lower bound of the two is the following: Q t = ∆ t p t (2 C + (cid:15) (1 + C ))Case of ∆ t ≥