[PDF] AAMDRL: Augmented Asset Management with Deep Reinforcement Learning

Abstract

Can an agent learn efficiently in a noisy and self adapting environment with sequential, non-stationary and non-homogeneous observations? Through trading bots, we illustrate how Deep Reinforcement Learning (DRL) can tackle this challenge. Our contributions are threefold: (i) the use of contextual information also referred to as augmented state in DRL, (ii) the impact of a one period lag between observations and actions that is more realistic for an asset management environment, (iii) the implementation of a new repetitive train test method called walk forward analysis, similar in spirit to cross validation for time series. Although our experiment is on trading bots, it can easily be translated to other bot environments that operate in sequential environment with regime changes and noisy data. Our experiment for an augmented asset manager interested in finding the best portfolio for hedging strategies shows that AAMDRL achieves superior returns and lower risk.

Full PDF

AAAMDRL: Augmented Asset Management with Deep Reinforcement Learning

Eric Benhamou

David Saltiel , Sandrine Ungari , Abhishek Mukhopadhyay , Jamal Atif AI Square Connect, France, { eric.benhamou,david.saltiel } @aisquareconnect.com MILES, LAMSADE, Dauphine university, France, eric.benhamou, [email protected] LISIC, ULCO, France, [email protected] Societe Generale, Cross Asset Quantitative Research, UK, Societe Generale, Cross Asset Quantitative Research, France, { sandrine.ungari,abhishek.mukhopadhyay } @sgcib.com Abstract

Can an agent learn efﬁciently in a noisy and self adapt-ing environment with sequential, non-stationary and non-homogeneous observations? Through trading bots, we illus-trate how Deep Reinforcement Learning (DRL) can tacklethis challenge. Our contributions are threefold: (i) the use ofcontextual information also referred to as augmented state inDRL, (ii) the impact of a one period lag between observationsand actions that is more realistic for an asset management en-vironment, (iii) the implementation of a new repetitive traintest method called walk forward analysis, similar in spirit tocross validation for time series. Although our experiment ison trading bots, it can easily be translated to other bot envi-ronments that operate in sequential environment with regimechanges and noisy data. Our experiment for an augmented as-set manager interested in ﬁnding the best portfolio for hedg-ing strategies shows that AAMDRL achieves superior returnsand lower risk.

Introduction

Can a bot learn efﬁciently in a noisy and self adapt-ing environment with sequential, non-stationary and non-homogeneous observations? By noisy and non homoge-neous, we mean that data have different statistical proper-ties across time. By sequential observations, we mean thatchronological order matters and that observations are com-pletely modiﬁed if we change their order. To answer thisquestion, we use trading bots that offer a perfect exampleof noisy data, strongly sequential observations subject tochange of regime. We aim at creating an augmented assetmanager bot for the asset management industry.Asset management is a well-suited industry to applyrobotic machine learning: large amount of data available dueto electronic tradings and strategic interest in bots as they donot give any hold on emotional and or behavioral bias largelydescribed in Kahneman (2011) that can cause the asset man-ager ruin. However, machine learning is hardly used to makeinvestment decision. Because of the complexity of the learn-ing environment, asset managers are still largely relying ontraditional methods, based on human decisions.This is in sharp contrast with recent advances of deep re-inforcement learning (DRL) on challenging tasks like game

Copyright c (cid:13) (Atari games from raw pixel inputs Mnih et al. (2013, 2015),Go Silver et al. (2016), StarCraft II Vinyals et al. (2019)),but also more robotic learning like advanced locomotion andmanipulation skills from raw sensory inputs (Levine et al.(2015, 2016) Schulman et al. (2015a,b, 2017), Lillicrap et al.(2015)), autonomous driving (Wang, Jia, and Weng (2018))and general bot learning (Gu et al. (2017)).We investigate if deep reinforcement learning can helpcreating an augmented asset manager when solving a classi-cal portfolio allocation problem: ﬁnding hedging strategiesto an existing portfolio, the MSCI World index in our exam-ple. The hedging strategies are different strategies operatedby standard bots that have different logics and perform wellin different market conditions. Knowing when to add andremove them and when to decrease or increase their assetunder management is a fundamental but challenging ques-tion for an augmented asset manager.

Related works

At ﬁrst, reinforcement learning was not used in portfolio al-location. Initial works focused on trying to use deep net-works to forecast next period prices, as presented in Fre-itas, De Souza, and Almeida (2009), Niaki and Hoseinzade(2013), Heaton, Polson, and Witte (2017). These modelssolved a supervised learning task akin to a regression, andtried to predict prices using past information only and com-pute portfolio allocations based on forecast. For asset man-agers, this initial usage of machine learning contains multi-ple problems. First, it does not ensure that the prediction isreliable in the near future: ﬁnancial markets are well knownto be non stationary and to present regime changes as il-lustrated in Salhi et al. (2015), Dias, Vermunt, and Ramos(2015), Zheng, Li, and Xu (2019). Second, this approachdoes not address the question of ﬁnding the optimal portfo-lio based on some reward metrics. Third, it does not adapt tochanging environment and does not easily incorporate trans-action costs.A second stream of research around deep reinforcementlearning has emerged to address those points: Jiang andLiang (2016); Zhengyao et al. (2017); Liang et al. (2018);Yu et al. (2019); Wang and Zhou (2019); Liu et al. (2020);Ye et al. (2020); Li et al. (2019); Xiong et al. (2019). Thedynamic nature of reinforcement learning makes it an ob-vious candidate for changing environment Jiang and Liang1 a r X i v : . [ c s . L G ] S e p Contributions

Our contributions are threefold: • The addition of contextual information.

Using just pastinformation is not sufﬁcient for bot learning in a noisy andfast changing environment. The addition of contextual in-formation improves results signiﬁcantly. Technically, wecreate two sub-networks: one fed with direct observations(past prices and standard deviation) and another one withcontextual information (level of risk aversion in ﬁnan-cial markets, early warning indicators for future recession,corporate earnings...). • One day lag between price observation and action.

Weassume that prices are observed at time t but action onlyoccurs at time t + 1 , to be consistent with reality. Thisone day lag makes the RL problem more realistic but alsomore challenging. • The walk-forward procedure.

Because of the non sta-tionarity nature of time dependent data and especially ﬁ-nancial data, it is crucial to test DRL models stability.We present a new methodology in DRL model evalua-tion referred to as walk forward analysis that iterativelytrains and tests the model on extending data-set. This canbe seen as the analogy of cross validation for time se-ries. This allows to validate that selected hyper parameterswork well over time and that resulting models are stableover time.

Background and mathematical formulation

In standard bot reinforcement learning, models are basedon Markov Decision Process (MDP) as in Sutton and Barto(2018). MDP assumes that the bot knows all the states of theenvironment and has all the information to make the optimaldecision in every state. The Markov property in addition im-plies that knowing the current state is sufﬁcient. Yet, the traditional MDP framework is inappropriate here:noise may arise in ﬁnancial market data due to unpre-dictable external events. We prefer to use Partially Ob-servable Markov Decision Process (POMDP) as presentedinitially in Astrom (1969). In POMDP, only a subset ofthe information of a given state is available. The partially-informed agent cannot behave optimally. He uses a windowof past observations to replace states as in a traditional MDP.Mathematically, POMDP is a generalization of MDP. Re-call that MDP assumes a 4-tuple ( S , A , P , R ) where S isthe set of states, A is the set of actions, P is the state actionto next state transition probability function P : S ×A×S → [0 , , and R is the immediate reward. The goal of the agentis to learn a policy that maps states to the optimal action µ : S → A and that maximizes the expected discountedreward E [ (cid:80) ∞ t =0 γ t R t ] . POMPD adds two more variables inthe tuple, O and Z where O is the set of observations and Z is the observation transition function Z : S×A×O → [0 , .At each time, the agent is asked to take an action a t ∈ A in a particular environment state s t ∈ S , that is followed bythe next state s t +1 with transition probability P ( s t +1 | s t , a t ) .The next state s t +1 is not observed by the agent. It ratherreceives an observation o t +1 ∈ O on the state s t +1 withprobability Z ( o t +1 | s t +1 , a t ) .From a practical standpoint, the general RL setting ismodiﬁed by taking a pseudo state formed with a set ofpast observations ( o t − n , o t − n − , . . . , o t − , o t ) . In practiceto avoid large dimension and the curse of dimension, it isuseful to reduce this set and take only a subset of thesepast observations with j < n past observations, such that < i < . . . < i j and i k, ≤ k ≤ j ∈ N is an integer. The set δ = (0 , i , . . . , i j ) is called the observation lags. In our ex-periment we typically use lag periods like (0, 1, 2, 3, 4, 20,60) for daily data, where the tuple (0 , , , , is indeed thelast week observation, is for the one-month ago observa-tion (as there is approximately 20 business days in a month)and 60 the three-month ago observation. Observations

Regular observations

There are two types of observa-tions: regular and contextual information. Regular observa-tions are data directly linked to the problem to solve. Fora standard bot, these are observations from its environmentlike position of the arm, degree, etc. In the case of a trad-ing bot, regular observations are past prices observed overa lag period δ = (0 < i < . . . < i j ) . To re-normalizedata, we rather use past returns computed as r kt = p kt p kt − − where p kt is the price at time t of the asset k . For a ﬁnancialasset k , to give information about regime changes, our trad-ing bot receives also empirical standard deviation computedover a sliding estimation window denoted by d as follows σ kt = (cid:113) d (cid:80) tu = t − d +1 ( r ku − µ ) , where the empirical mean µ k is computed as µ k = d (cid:80) tu = t − d +1 r ku . Hence our regu-lar observations is a three dimensional tensor represented asfollows:2 eturns A t Volatility A t with A t =  r t − i j ... r t ... ... ...r mt − i j .... r mt  , A t =  σ t − i j ... σ t ... ... ...σ mt − i j .... σ mt  This setting with two layers (past returns and past volatil-ities) is quite different from the one presented in Jiang andLiang (2016); Zhengyao et al. (2017); Liang et al. (2018)that uses different layers representing open, high, low andclose prices. There are various remarks to be made. First,high low information does not make sense for portfoliostrategies that are only evaluated daily, which is the caseof all the funds. Secondly, open high low prices tend to behighly correlated creating some noise in the inputs. Third,the concept of volatility is crucial to detect regime changeand is surprisingly absent from these works as well as fromother works like Yu et al. (2019); Wang and Zhou (2019);Liu et al. (2020); Ye et al. (2020); Li et al. (2019); Xionget al. (2019).

Context observation

Contextual observations are addi-tional information that provides intuition about current con-text. For our asset manager bot, they are other ﬁnancial datanot directly linked to its portfolio assumed to have somepredictive power for portfolio assets. In the case of a ﬁnan-cial portfolio, context information is typically modelled bya large range of features : • the level of risk aversion in ﬁnancial markets, or marketsentiment, measured as an indicator varying between 0 formaximum risk aversion and 1 for maximum risk appetite, • the bond/equity historical correlation, a classical ex-post measure of the diversiﬁcation beneﬁts of a dura-tion hedge, measured on a 1-month, 3-month and 1-yearrolling window, • The credit spreads of global corporate - investment grade,high yield, in Europe and in the US - known to be an earlyindicator of potential economic tensions, • The equity implied volatility, a measure if the ’fear factor’in ﬁnancial market, • The spread between the yield of Italian government bondsand the German government bond, a measure of potentialtensions in the European Union, • The US Treasury slope, a classical early indicator for USrecession, • And some more ﬁnancial variables, often used as a gaugefor global trade and activity: the dollar, the level of ratesin the US, the estimated earnings per shares (EPS).On top of these observations, we also include the maxi-mum and minimum portfolio strategies return and the maxi-mum portfolio strategies volatility. The latter information islike for regular observations motivated by the stylized factthat standard deviations are useful features to detect crisis. Contextual observations are stored in a 2D matrix denotedby C t with stacked past p individual contextual observations.The contextual state writes as C t =  c t ... c t − i k ... ... ...c pt .... c pt − i k  . Thematrix nature of contextual states C t implies in particularthat we will use 1D convolutions should we use convolu-tional layers. All in all, observations that are augmented ob-servations, write as O t = [ A t , C t ] , with A t = [ A t , A t ] thatwill feed the two sub-networks of our global network as pre-sented in ﬁgure 1.Figure 1: Network architecture Action

In our deep reinforcement learning the augmented assetmanager trading bot needs to decide at each period in whichhedging strategy it invests. The augmented asset managercan invest in l strategies that can be simple strategies orstrategies that are also done by asset management bots. Tocope with reality, the bot will only be able to act after oneperiod. This is because asset managers have a one day turnaround to change their positions. We will see on experimentsthat this one day turnaround lag makes a big difference in re-sults. As it has access to l potential hedging strategies, theoutput is a l dimension vector that provides how much it in-vests in each hedging strategy. For our deep network, thismeans that the last layer is a softmax layer to ensure thatportfolio weights are between and and sum to . Reward

There are multiple choices for our reward. A straightforwardreward function is to compute the ﬁnal net performance ofthe combination of our portfolio computed as the value ofour portfolio at the last train date t T over the initial valueof the portfolio t minus one: P t T /P t − . Another natu-ral reward function is to compute the Sortino ratio, that isa variation of Sharpe ratio where risk is computed by thedownside standard deviation (instead of regular standard de-viation) whose deﬁnition is to compute the standard devi-ation only on negative daily returns denoted by (˜ r t ) t =0 ..T . Hence the downside standard deviation is computed by √ × StdDev [(˜ r t ) t =0 ..T ] .3 dversarial Policy Gradient A policy is a mapping from the observation space to theaction space, π : O → A . To achieve this, a pol-icy is speciﬁed by a deep network with a set of parame-ters (cid:126)θ . The action is a vector function of the observationgiven the parameters: (cid:126)a t = π (cid:126)θ ( o t ) . The performance met-ric of π (cid:126)θ for time interval [0 , t ] is deﬁned as the corre-sponding total reward function of the interval J [0 ,t ] ( π (cid:126)θ ) = R (cid:0) (cid:126)o , π (cid:126)θ ( o ) , · · · , (cid:126)o t , π (cid:126)θ ( o t ) , (cid:126)o t +1 (cid:1) . After random initial-ization, the parameters are continuously updated along thegradient direction with a learning rate λ : (cid:126)θ −→ (cid:126)θ + λ ∇ (cid:126)θ J [0 ,t ] ( π (cid:126)θ ) . The gradient ascent optimization is donewith standard Adam (short term for Adaptive Moment Es-timation) optimizer to have the beneﬁt of adaptive gradientdescent with root mean square propagation Kingma and Ba(2014). The whole process is summarized in algorithm 1,called adversarial policy gradient as we introduce randomi-sation both in the observations and the action (to have stan-dard exploration exploitation). This two steps randomizationensures more robust training as we will see in the experi-ments. Noise in observations has already been suggested toimprove training in Liang et al. (2018). Algorithm 1

Adversarial Policy Gradient Input: initial policy parameters θ , empty replay buffer D repeat reset replay buffer while not terminal do Observe observation o and select action a = π θ ( o ) with probability p and random action with proba-bility − p , Execute a in the environment Observe next observation o (cid:48) , reward r , and donesignal d to indicate whether o (cid:48) is terminal apply noise to next observation o (cid:48) store ( o, a, o (cid:48) ) in replay buffer D if Terminal then for however many updates in D do compute ﬁnal reward R end for update network parameter with Adam gradientascent (cid:126)θ −→ (cid:126)θ + λ ∇ (cid:126)θ J [0 ,t ] ( π (cid:126)θ ) end if end while until convergence Walk forward analysis

In machine learning, the standard approach is to do k -foldcross validation as shown in ﬁgure 2. White rectangles rep-resent training periods while grey rectangles testing periods.This approach breaks the chronology of data and potentiallyuses past data in the test set. Rather, we can take sliding testset and take past data as training data. We can either takethe training data set with a ﬁxed starting point and grow thetraining data set by adding more and more data, which is what we call extending walk forward as shown in ﬁgure 3 ortake always the same amount of data and slide the trainingdata (ﬁgure 4), hence the name of sliding walk forward. Ex-tending walk forward tends to more stable models as we addincrementally new data, at each new training step, and shareall past data. The negative effect of this is to adapt slowlyto new information. On the opposite, sliding walk forwardleads to more rapidly changing models as we progressivelydrop old data and hence give more weight to more recentdata. To our experience, because we do not have so muchdata to train our DRL model, it is better to use extendingwalk forward. Last but not least, as the test set is always af-ter the train set, walk forward analysis gives less steps com-pared to cross validation. In practice for our data set, we trainour models from 2000 to end of 2006 (to have at least sevenyears of data) and use an extending test period of one year. TestTestTestTestTestStep 1:Step 2:Step 3:Step 4:Step 5:

Figure 2: k-fold cross validation

TestTestTestTestStep 1:Step 2:Step 3:Step 4:

Figure 3: Extending Walk Forward

TestTestTestTestStep 1:Step 2:Step 3:Step 4:

Figure 4: Sliding Walk Forward

Experiments

Goal of the experiment

Data-set description

Systematic strategies are asset management bots that investin ﬁnancial markets according to adaptive and pre-deﬁnedtrading rules. Here, we use 4 SG CIB proprietary ’hedg-ing strategies’, that tend to perform when stock markets aredown: • Directional hedges - react to small negative return in eq-uities, • Gap risk hedges - perform well in sudden market crashes, • Proxy hedges - tend to perform in some market conﬁg-urations, like for example when highly indebted stocksunder-perform other stocks, • Duration hedges - invest in bond market, a classical diver-siﬁer to equity risk in ﬁnance.The underlying ﬁnancial instruments vary from put op-tions, listed futures, single stocks, to government bonds.Some of those strategies are akin to an insurance contractand bear a negative cost over the long run. The challengeconsists in balancing cost versus beneﬁts. In practice, assetmanagers have to decide how much of these hedging strate-gies are needed on top of an existing portfolio to achievea better risk reward. The decision making process is oftenbased on contextual information, such as the economic andgeopolitical environment, the level of risk aversion amonginvestors and other correlation regimes. A cross validationstep selects the most relevant features contextual informa-tion. In the present case, the ﬁrst three features are selected.The rebalancing of strategies in the portfolio comes withtransaction costs, that can be quite high since hedges useoptions. Transactions costs are like frictions in physical sys-tems. They are taken into account dynamically to penalisesolutions with a high turnover rate.

Evaluation metrics

Asset managers use a wide range of metrics to gauge the suc-cess of their investment decision. For a thorough review ofthose metrics, see for example Cogneau and Hbner (2009).To keep things simple, we use the following metrics: • annualized return deﬁned as the average annualized com-pounded return, • annualized daily based Sharpe ratio deﬁned as the ratioof the annualized return over the annualized daily basedvolatility µ/σ , • Sortino ratio computed as the ratio of the annualized re-turn overt the downside standard deviation, • maximum drawdown denoted by max DD in table 4.Let P T be the ﬁnal value of the portfolio at time T and P its initial value at time t = 0 . Let τ be the year frac-tion of the ﬁnal time T . The annualized return is deﬁned as µ = ( P T /P ) /τ − . The maximum drawdown is com-puted as the maximum of all daily drawdowns. The dailydrawdown is computed as the ratio of the difference be-tween the running maximum of the portfolio value ( RM T =max t =0 ..T ( P t ) ) and the portfolio value over the runningmaximum of the portfolio value. Hence DD T = ( RM T − P T ) /RM T and M DD T = max t =0 ..T ( DD t ) . Baseline

Pure risky asset

This ﬁrst evaluation is to compare ourportfolio composed only of the risky asset (in our case, theMSCI world index) with the one augmented by the tradingbot and composed of the risky asset and the hedging overlay.If our bot is successful in identifying good hedging strate-gies, it should improve the overall portfolio and have a betterperformance than the risky asset.

Markowitz theory

The standard approach for portfolioallocation in ﬁnance is the Markowitz model (Markowitz(1952)). It computes the portfolio with minimum variancegiven an expected return which is taken in our experiment tobe the average return of the hedging strategies over the lastyear. The intuition in Markowitz (or mean-variance portfo-lio) theory is that an investor wants to have the lowest risk fora given return. In practice, we solve a quadratic program thatﬁnds the minimum portfolio variance under the constraintthat the expected return is greater or equal to the minimumreturn. In our baseline, Markowitz portfolio is recomputedevery 6 months to have something dynamic to cope withregime changes.

Follow the winner

This is a simple strategy that consistsin selecting the hedging strategy that was the best performerin the past year. If there is some persistence over time of thehedging strategies’ performance, this simple methodologyshould work well. It replicates standard investors behaviorthat tend to select strategies that performed well in the past.

Follow the loser

Follow the loser is the opposite of followthe winner. It assumes that there is some mean reversion instrategies’ performance, meaning that strategies tend to per-form equally well on long term and mean revert around theirtrend. Hence if a strategy did not perform well in the past,and if there is mean reversion, there is a lot of chance thatthis strategy will recover with its pairs.

Results and discussion

We compare the performance of the following 5 models:DRL model based on convolutional networks with contex-tual states (Sentiment indicator, 6 month correlation betweenequity and bonds and credit main index), same DRL modelwithout contextual states, follow the winner, follow the loserand Markowitz portfolio. The resulting graphics are dis-played in ﬁgure 5 with the risky asset position alone in blueand the models in orange. Out of these 5 models, only DRLand Follow the winner are able to provide signiﬁcant netperformance increase thanks to an efﬁcient hedging strategyover the 2007 to 2020 period. The DRL model is in additionable to better adapt to the Covid crisis and to have better ef-ﬁciency in net return but also Sharpe and Sortino ratios over5 and 5 years as shown in table 4. In terms of the smallestmaximum drawdown, the follow the loser model is able tosigniﬁcantly reduce maximum drawdown but at the price ofa lower return, Sharpe and Sortino ratios. Removing con-textual information deteriorates model performances signif-icantly and is illustrated by the difference in term of return,Sharpe, Sortino ratio and maximum drawdown between theDRL and the DRL no context model. Last but not least,Markowitz model is not able to adapt to the new regimechange of 2015 onwards despite its good performance from2007 to 2015. It is the worst performer over the last 3 and5 years because of this lack of adaptation. For all models,we use the walk forward analysis as described in the corre-sponding section. Hence, we start training the models from2000 to end of 2006 and use the best model on the test setin 2007. We then train the model from 2000 to end of 2007and use the best model on the test set in 2008. In total, we do14 training (from 2007 to 2020). This process ensures thatwe detect models that are unstable over time and is similarin spirit to delayed online training.

Impact of context

In table 1, we provide a list of 32 models based on the fol-lowing choices: network architecture (LSTM or CNN), ad-versarial training with noise in data or not, use of contextualstates, and reward function (net proﬁt and Sortino), use ofday lag between observations and actions. We see that thebest DRL model with the day-lag turnover constraint is theone using convolutional networks, adversarial training, con-textual states and net proﬁt reward function. These 4 param-eters are meaningful for our DRL model and change modelperformance substantially as illustrated by the table. We alsocompare the same model with and without contextual stateand see in table 3 that the use of contextual state improvesmodel performance substantially. This is quite intuitive aswe provide more meaningful data to the model.

Impact of one day lag

Reminding the fact that asset managers cannot immediatelychange their position at the close of the ﬁnancial markets,modeling the one day lag turnover to account is also sig-niﬁcant as shown in table 2. It is not surprising that a de-layed action after observation makes the learning processmore challenging for the DRL agent as inﬂuence of variablestends to decrease with time. Surprisingly, this salient model-ing characteristic is ignored in existing literature Zhengyaoet al. (2017); Liang et al. (2018); Yu et al. (2019); Wang andZhou (2019); Liu et al. (2020); Ye et al. (2020); Li et al.(2019). ∗ : the number of iterations is at maximum 500 provided we do notstop because of early stop detection. We do early stop if on the trainset, there is no improvement over the last 50 iterations. Conclusion

In this paper, we address the challenging task of learning ina noisy and self adapting environment with sequential, non-stationary and non-homogeneous observations for a bot andmore speciﬁcally for a trading bot. Our approach is based Table 1: Model comparison based on reward function, net-work (CNN or LSTM units) adversarial training (noise indata) and use of contextual state reward network adversarial contextual performance performancetraining states with with1 day lag 0 day lagNet Proﬁt CNN Yes Yes 81.8% 123.8%Net Proﬁt CNN No Yes 75.2% 112.3%Net Proﬁt LSTM Yes Yes 65.9% 98.8%Net Proﬁt LSTM No Yes 64.5% 98.5%Sortino LSTM No Yes 61.8% 87.4%Net Proﬁt LSTM No No 56.6% 59.8%Sortino LSTM No No 48.5% 51.4%Net Proﬁt LSTM Yes No 47.5% 50.8%Sortino LSTM Yes Yes 29.6% 47.6%Sortino LSTM Yes No 28.4% 47.0%Sortino CNN No Yes 26.5% 45.3%Sortino CNN Yes Yes 26.3% 29.3%Sortino CNN Yes No -16.7% 16.9%Net Proﬁt CNN Yes No -29.5% 13.9%Sortino CNN No No -45.0% 10.6%Net Proﬁt CNN No No -47.7% 8.6%

Table 2: Impact of day lag reward network adversarial contextual day lag impacttraining statesNet Proﬁt CNN Yes Yes -42.0%Net Proﬁt CNN No Yes -37.2%Net Proﬁt LSTM Yes Yes -32.9%Net Proﬁt LSTM No Yes -34.0%Sortino LSTM No Yes -25.6%Net Proﬁt LSTM No No -3.2%Sortino LSTM No No -2.9%Net Proﬁt LSTM Yes No -3.3%Sortino LSTM Yes Yes -18.0%Sortino LSTM Yes No -18.7%Sortino CNN No Yes -18.8%Sortino CNN Yes Yes -3.0%Sortino CNN Yes No -33.6%Net Proﬁt CNN Yes No -43.4%Sortino CNN No No -55.6%Net Proﬁt CNN No No -56.3%

Table 3: Impact of contextual state reward network adversarial contextual statestraining impactNet Proﬁt CNN Yes 111.4%Net Proﬁt CNN No 122.9%Net Proﬁt LSTM Yes 18.5%Net Proﬁt LSTM No 7.9%Sortino LSTM No 13.3%Sortino LSTM Yes 1.2%Sortino CNN No 71.5%Sortino CNN Yes 43.0% -0.27Winner 13.19% 0.66 0.72 -0.35Loser 9.30% 0.89 0.89 -0.15

DRL no context 8.11% 0.42 0.47 -0.34Markowitz -0.31% -0.01 -0.01 -0.415 Yearsreturn Sortino Sharpe max DDRisky asset 9.16% 0.54 0.57 - 0.34DRL -0.27Winner 10.84% 0.65 0.68 -0.35Loser 7.04% 0.78 0.76 -0.15

DRL no context 6.87% 0.44 0.47 -0.34Markowitz -0.07% -0.00 -0.00 -0.41

Table 5: Hyper parameters used hyper-parameters value descriptionbatch size 50 mini-batch size during trainingregularization 1e-8 L regularization coefﬁcientcoefﬁcient applied to network traininglearning rate 0.01 Step size parameter in Adamstandard deviation 20 days period for standard deviationperiod in asset statescommission 30 bps commission ratestride 2,1 stride in convolution networksconv number 1 5,10 number of convolutions insub-network 1conv number 2 2 number of convolutions insub-network 2lag period 1 [60 , , , , , , lag period for asset stateslag period 2 [60 , , , , , , lag period for contextual statesnoise 0.002 adversarial Gaussian standarddeviationmax iterations ∗

500 maximum number of iterationsearly stop 50 early stop criterioniterations ∗ random seed 12345 random seed on deep reinforcement learning using contextual informa-tion thanks to a second sub-network. We also show that theadditional constraint of a delayed action following observa-tions has a substantial impact that should not be overlooked.We introduce the novel concept of walk forward to test therobustness of the deep RL model. This is very importantfor regime changing environment that cannot be evaluatedwith a simple train validation test procedure, neither a k -foldcross validation that ignores the strong chronological featureof observations.For our trading bots, we take not only past performancesof portfolio strategies over different rolling period, butalso standard deviation to provide predictive variables forregime changes. Augmented states with contextual informa-tion make a big difference in the model and help the botlearning more efﬁciently in a noisy environment. On ex-periment, contextual based approach over-performs baselinemethods like Markowitz or naive follow the winner and fol-low the loser. Last but not least, it is quite important to ﬁnetune the numerous hyper-parameters of the contextual basedDRL model, namely the various lags (lags period for the subnetwork fed by portfolio strategies past returns, lags periodfor common contextual features referred to as the commonfeatures in the paper), standard deviation period, learningrate, etc...Despite the efﬁciency of contextual based DRL models,there is room for improvement. Other information like newscould be incorporated to continue increasing model perfor-mance. For large stocks, like tech stocks, sentiment informa-tion based on social media activity could also be relevant.7 eferences Astrom, K. 1969. Optimal control of Markov processes withincomplete state-information II. The convexity of the loss-function.

Journal of Mathematical Analysis and Applica-tions

SSRN Electronic Journal .Dias, J.; Vermunt, J.; and Ramos, S. 2015. Clustering ﬁ-nancial time series: New insights from an extended hiddenMarkov model.

European Journal of Operational Research

Neurocomputing

72: 2155–2170.Gu, S.; Holly, E.; Lillicrap, T.; and Levine, S. 2017. Deepreinforcement learning for robotic manipulation with asyn-chronous off-policy updates. In

IEEE International Confer-ence on Robotics and Automation (ICRA) , 3389–3396.Heaton, J. B.; Polson, N. G.; and Witte, J. H. 2017. Deeplearning for ﬁnance: deep portfolios.

Applied StochasticModels in Business and Industry arXiv e-prints .Kahneman, D. 2011.

Thinking, Fast and Slow . New York:Farrar, Straus and Giroux.Kingma, D.; and Ba, J. 2014. Adam: A Method for Stochas-tic Optimization.Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2015. End-to-End Training of Deep Visuomotor Policies.

Journal ofMachine Learning Research

The Inter-national Journal of Robotics Research .Li, X.; Li, Y.; Zhan, Y.; and Liu, X.-Y. 2019. Optimistic Bullor Pessimistic Bear: Adaptive Deep Reinforcement Learningfor Stock Portfolio Allocation. In

ICML .Liang et al. 2018. Adversarial Deep Reinforcement Learn-ing in Portfolio Management.Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa,Y.; Silver, D.; and Wierstra, D. 2015. Continuous controlwith deep reinforcement learning.

CoRR . Liu, Y.; Liu, Q.; Zhao, H.; Pan, Z.; and Liu, C. 2020. Adap-tive Quantitative Trading: an Imitative Deep ReinforcementLearning Approach. In

AAAI .Markowitz, H. 1952. Portfolio Selection.

The Journal ofFinance

NIPS DeepLearning Workshop .Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness,J.; Bellemare, M.; Graves, A.; Riedmiller, M.; Fidjeland,A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.;Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg,S.; and Hassabis, D. 2015. Human-level control throughdeep reinforcement learning.

Nature

Journal of Industrial Engineering International

Physica A: Statistical Mechanicsand its Applications

ICML .Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel,P. 2015b. High-Dimensional Continuous Control UsingGeneralized Advantage Estimation.

ICLR .Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; andKlimov, O. 2017. Proximal Policy Optimization Algorithms.

CoRR .Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.;Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneer-shelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham,J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.;Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mas-tering the game of Go with deep neural networks and treesearch.

Nature

Reinforcement Learn-ing: An Introduction . The MIT Press, second edition.Thate, T.; and Ernst, D. 2020. Application of deep reinforce-ment learning in stock trading strategies and stock forecast-ing.Vinyals, O.; Babuschkin, I.; Czarnecki, W.; Mathieu, M.;Dudzik, A.; Chung, J.; Choi, D.; Powell, R.; Ewalds, T.;Georgiev, P.; Oh, J.; Horgan, D.; Kroiss, M.; Danihelka, I.;Huang, A.; Sifre, L.; Cai, T.; Agapiou, J.; Jaderberg, M.;and Silver, D. 2019. Grandmaster level in StarCraft II usingmulti-agent reinforcement learning.

Nature arXiv e-prints .Wang, S.; Jia, D.; and Weng, X. 2018. Deep ReinforcementLearning for Autonomous Driving.

ArXiv abs/1811.11329.Wu, X.; Chen, H.; Wang, J.; Troiano, L.; Loia, V.; and Fu-jita, H. 2020. Adaptive stock trading strategies with deepreinforcement learning methods.

Information Sciences

Advancement of Artiﬁcial Intelligence conferenceproceedings (AAAI) .Yu, P.; Lee, J. S.; Kulyatin, I.; Shi, Z.; and Dasgupta, S.2019. Model-based Deep Reinforcement Learning for Fi-nancial Portfolio Optimization.

RWSDM Workshop, ICML2019 .Zhang, Z.; Zohren, S.; and Roberts, S. 2019. Deep Rein-forcement Learning for Trading.Zheng, K.; Li, Y.; and Xu, W. 2019. Regime switchingmodel estimation: spectral clustering hidden Markov model.

Annals of Operations Research .Zhengyao et al. 2017. Reinforcement Learning Frameworkfor the Financial Portfolio Management Problem. arXivarXiv