[PDF] Time your hedge with Deep Reinforcement Learning

Abstract

Can an asset manager plan the optimal timing for her/his hedging strategies given market conditions? The standard approach based on Markowitz or other more or less sophisticated financial rules aims to find the best portfolio allocation thanks to forecasted expected returns and risk but fails to fully relate market conditions to hedging strategies decision. In contrast, Deep Reinforcement Learning (DRL) can tackle this challenge by creating a dynamic dependency between market information and hedging strategies allocation decisions. In this paper, we present a realistic and augmented DRL framework that: (i) uses additional contextual information to decide an action, (ii) has a one period lag between observations and actions to account for one day lag turnover of common asset managers to rebalance their hedge, (iii) is fully tested in terms of stability and robustness thanks to a repetitive train test method called anchored walk forward training, similar in spirit to k fold cross validation for time series and (iv) allows managing leverage of our hedging strategy. Our experiment for an augmented asset manager interested in sizing and timing his hedges shows that our approach achieves superior returns and lower risk.

Full PDF

TTime your hedge with Deep Reinforcement Learning

Eric Benhamou , David Saltiel , Sandrine Ungari , Abhishek Mukhopadhyay MILES, LAMSADE, Dauphine university, France [email protected] AI Square Connect, France [email protected] LISIC, ULCO, France Societe Generale, Cross Asset Quantitative Research, UK, Societe Generale, Cross Asset Quantitative Research, France, { sandrine.ungari,abhishek.mukhopadhyay } @sgcib.com Abstract

Can an asset manager plan the optimal timing for her/hishedging strategies given market conditions? The standard ap-proach based on Markowitz or other more or less sophisti-cated ﬁnancial rules aims to ﬁnd the best portfolio alloca-tion thanks to forecasted expected returns and risk but failsto fully relate market conditions to hedging strategies deci-sion. In contrast, Deep Reinforcement Learning (DRL) cantackle this challenge by creating a dynamic dependency be-tween market information and hedging strategies allocationdecisions. In this paper, we present a realistic and augmentedDRL framework that: (i) uses additional contextual informa-tion to decide an action, (ii) has a one period lag between ob-servations and actions to account for one day lag turnover ofcommon asset managers to rebalance their hedge, (iii) is fullytested in terms of stability and robustness thanks to a repeti-tive train test method called anchored walk forward training,similar in spirit to k fold cross validation for time series and(iv) allows managing leverage of our hedging strategy. Ourexperiment for an augmented asset manager interested in siz-ing and timing his hedges shows that our approach achievessuperior returns and lower risk.

Introduction

From an external point of view, the asset management (buyside) industry is a well-suited industry to apply machinelearning as large amount of data are available thanks tothe revolution of electronic trading and the methodicalcollection of data by asset managers or their acquisitionfrom data providers. In addition, machine based decisioncan help reducing emotional bias and taking rational andsystematic investment choices (Kahneman 2011). However,to date, the buy side industry is still largely relying on oldand traditional methods to make investment decisions andin particular to choose portfolio allocation and hedgingstrategies. It is hardly using machine learning in investmentdecisions.This is in sharp contrast with the ubiquitous usage ofdeep reinforcement learning (DRL) in other industriesand in particular its use for solving challenging tasks likeautonomous driving (Wang, Jia, and Weng 2018), learning advanced locomotion and manipulation skills from rawsensory inputs (Levine et al. 2015; Levine et al. 2016;Schulman et al. 2015; Schulman et al. 2017;Lillicrap et al. 2015) or on a more conceptual sidefor reaching supra human level in popular gameslike Atari (Mnih et al. 2013), Go (Silver et al. 2016;Silver et al. 2017), StarCraft II (Vinyals et al. 2019), etc ...It therefore makes sense to investigate if DRL can helphelp in ﬁnancial planning and in particular in creating aug-mented asset managers. To narrow down our problem, weare speciﬁcally interested in ﬁnding hedging strategies fora risky asset. To make things concrete and more illustra-tive, we represent this risky asset in our experiment with theMSCI World index that captures large and mid cap securitiesacross 23 developed ﬁnancial markets. The targeted hedgingstrategies are on purpose different in nature and spirit. Theyare appropriate under distinctive market conditions. Finan-cial planning is therefore critical for deciding the appropriatetiming when to add and remove these hedging strategies.

Related works

At ﬁrst, reinforcement learning was not used in portfolio al-location. Initial works focused on trying to make decisionsusing deep networks to forecast next period prices, (Freitas,De Souza, and Almeida 2009; Niaki and Hoseinzade 2013;Heaton, Polson, and Witte 2017). Armed with the fore-cast, an augmented asset manager could solve its ﬁnan-cial planning problem to decide the optimal portfolio al-locations. However, this initial usage of machine learn-ing contains multiple caveats. First, there is no guaranteethat the forecast is reliable in the near future. On the con-trary, it is a stylized fact that ﬁnancial markets are nonstationary and exhibit regime changes (Salhi et al. 2015;Dias, Vermunt, and Ramos 2015; Zheng, Li, and Xu 2019),making the prediction exercise quite difﬁcult and unreliable.Second, it does not target speciﬁcally the ﬁnancial plan-ning question of ﬁnding the optimal portfolio based on somereward metrics. Third, there is no consideration of onlinelearning to adapt to changing environment as well as the in-corporation of transaction costs.A second stream of research around deep reinforcement1 a r X i v : . [ q -f i n . P M ] N ov earning has emerged to address these points (Jiang andLiang 2016; Jiang, Xu, and Liang 2017; Liang et al. 2018;Yu et al. 2019; Wang and Zhou 2019; Liu et al. 2020;Ye et al. 2020; Li et al. 2019; Xiong et al. 2019; Ben-hamou et al. 2020). The dynamic nature of reinforcementlearning makes it an obvious candidate for changing en-vironment (Jiang and Liang 2016; Jiang, Xu, and Liang2017; Liang et al. 2018). Transaction costs can be eas-ily included in rules (Liang et al. 2018; Yu et al. 2019;Wang and Zhou 2019; Liu et al. 2020; Ye et al. 2020;Yu et al. 2019). However, these works, except (Ye et al.2020) and (Benhamou et al. 2020) rely only on time se-ries of open high low close prices, which are known to bevery noisy. Secondly, they all assume an immediate actionafter observing prices which is quite different from reality.Most asset managers need a one day turnaround to managetheir new portfolio positions. Thirdly, except (Benhamouet al. 2020), they rely on a single reward function and donot measure the impact of the reward function. Last but notleast, they only do one train and test period, never testing formodel stability. Contributions

Our contributions are fourfold:•

The addition of contextual information.

Using onlypast information is not sufﬁcient to learn in a noisy andfast changing environment. The addition of contextual in-formation improves results signiﬁcantly. Technically, wecreate two sub-networks: one fed with direct observations(past prices and standard deviation) and another one withcontextual information (level of risk aversion in ﬁnan-cial markets, early warning indicators for future recession,corporate earnings etc...).•

One day lag between price observation and action.

Weassume that prices are observed at time t but action onlyoccurs at time t + 1 , to be consistent with reality. Thisone day lag makes the RL problem more realistic but alsomore challenging.• The walk-forward procedure.

Because of the non sta-tionarity nature of time dependent data and especially ﬁ-nancial data, it is crucial to test DRL models stability.We present a new methodology in DRL model evalua-tion referred to as walk forward analysis that iterativelytrains and tests the model on extending data-set. This canbe seen as the analogy of cross validation for time se-ries. This allows validating that selected hyper parameterswork well over time and that the resulting model is stableover time.•

Model leverage . Not only do we do a multi inputs net-work, we also do a multi outputs network to compute atthe same time the percentage in each hedging strategy andthe overall leverage. This is a nice feature of this DRLmodel as it incorporates by design a leverage mechanism.To make sure the leverage is in line with the asset managerobjective, we cap the leverage to the maximum authorizedleverage, which is in our case 3. This byproduct of themethod is another key difference with standard ﬁnancial methods like Markwitz that do not care about leverageand only give a percentage for the hedging portfolio allo-cation.

Background and mathematical formulation

In standard reinforcement learning, models are based onMarkov Decision Process (MDP) (Sutton and Barto 2018).A Markov decision process is deﬁned as a tuple M =( X , A , p, r ) where:• X is the state space,• A is the action space,• p ( y | x, a ) is the transition probability such that p ( y | x, a ) = P ( x t +1 = y | x t = x, a t = a ) ,• r ( x, a, y ) is the reward of transition (x, a, y).MDP assumes that the we know all the states of the en-vironment and have all the information to make the optimaldecision in every state. The Markov property in addition im-plies that knowing the current state is sufﬁcient.From a practical standpoint, the general RL setting is mod-iﬁed by taking a pseudo state formed with a set of past ob-servations ( o t − n , o t − n − , . . . , o t − , o t ) . In practice to avoidlarge dimension and the curse of dimension, it is useful to re-duce this set and take only a subset of these past observationswith j < n past observations, such that < i < . . . < i j and i k ∈ N is an integer. The set δ = (0 , i , . . . , i j ) iscalled the observation lags. In our experiment we typicallyuse lag periods like (0, 1, 2, 3, 4, 20, 60) for daily data,where (0 , , , , provides last week observation, is forthe one-month ago observation (as there is approximately20 business days in a month) and 60 the three-month agoobservation. Observations

Regular observations

There are two types of observa-tions: regular and contextual information. Regular observa-tions are data directly linked to the problem to solve. In thecase of a trading framework, regular observations are pastprices observed over a lag period δ = (0 < i < . . . < i j ) .To renormalize data, we rather use past returns computed as r t = p kt p kt − − where p kt is the price at time t of the asset k . To give information about regime changes, our tradingagent receives also empirical standard deviation computedover a sliding estimation window denoted by d as follows σ kt = (cid:113) d (cid:80) tu = t − d +1 ( r u − µ ) , where the empirical mean µ is computed as µ = d (cid:80) tu = t − d +1 r u . Hence our regularobservations is a three dimensional tensor A t = (cid:2) A t , A t (cid:3) with A t =  r t − i j ... r t ... ... ...r mt − i j .... r mt  , A t =  σ t − i j ... σ t ... ... ...σ mt − i j .... σ mt  This setting with two layers (past returns and past volatili-ties) is quite different from the one presented in (Jiang andLiang 2016; Jiang, Xu, and Liang 2017; Liang et al. 2018)that uses different layers representing closing, open high lowprices. There are various remarks to be made. First, high2ow information does not make sense for portfolio strate-gies that are only evaluated daily, which is the case of all thefunds. Secondly, open high low prices tend to be highly cor-related creating some noise in the inputs. Third, the conceptof volatility is crucial to detect regime change and is surpris-ingly absent from these works as well as from other workslike (Yu et al. 2019; Wang and Zhou 2019; Liu et al. 2020;Ye et al. 2020; Li et al. 2019; Xiong et al. 2019).

Context observation

Contextual observations are addi-tional information that provide intuition about current con-text. For our asset manager, they are other ﬁnancial datanot directly linked to its portfolio assumed to have somepredictive power for portfolio assets. Contextual observa-tions are stored in a 2D matrix denoted by C t with stackedpast p individual contextual observations. Among these ob-servations, we have the maximum and minimum portfo-lio strategies return and the maximum portfolio strategiesvolatility. The latter information is like for regular obser-vations motivated by the stylized fact that standard devi-ations are useful features to detect crisis. The contextualstate writes as C t =  c t ... c t − i k ... ... ...c pt .... c pt − i k  . The matrix natureof contextual states C t implies in particular that we will use1D convolutions should we use convolutional layers. All inall, observations that are augmented observations, write as O t = [ A t , C t ] , with A t = [ A t , A t ] that will feed the twosub-networks of our global network. Action

In our deep reinforcement learning the augmented assetmanager trading agent needs to decide at each period inwhich hedging strategy it invests. The augmented asset man-ager can invest in l strategies that can be simple strategies orstrategies that are also done by asset management agent. Tocope with reality, the agent will only be able to act after oneperiod. This is because asset managers have a one day turnaround to change their position. We will see on experimentsthat this one day turnaround lag makes a big difference inresults. As it has access to l potential hedging strategies, theoutput is a l dimension vector that provides how much itinvest in each hedging strategy. For our deep network, thismeans that the last layer is a softmax layer to ensure thatportfolio weights are between and and sum to ,denoted by ( p t , ..., p lt ) . In addition, to include leverage, ourdeep network has a second output which is the overall lever-age that is between 0 and a maximum leverage value (in ourexperiment 3), denoted by lvg t . Hence the ﬁnal allocation isgiven by lvg t × ( p t , ..., p lt ) . Reward

There are multiple choices for our reward and it’s a key pointfor the asset manager to decide the reward corresponding tohis her risk proﬁle.• A straightforward reward function is to compute the ﬁnalnet performance of the combination of our portfolio com-puted as the value of our portfolio at the last train date t T over the initial value of the portfolio t minus one: P tT P t − .• Another natural reward function is to compute the Sharperatio. There are various ways to compute Sharpe ratio andwe take explicitly the annualized Sharpe ratio. This annu-alized Sharpe ratio computed from daily data is deﬁnedas the ratio of the annualized return over the annualizedvolatility µ/σ . The intuition of the Sharpe ratio is to ac-count for risk when comparing returns with risk is repre-sented by volatility.• The last reward we are interested in is the Sortino ra-tio. This metric is a variation of the Sharpe ratio wherethe risk is computed by the downside standard devia-tion whose deﬁnition is to compute the standard devi-ation only on negative daily returns (˜ r t ) t =0 ..T . Hencethe downside standard deviation is computed by √ × StdDev [(˜ r t ) t =0 ..T )] . Convolutional network

The similarities with image recognition (where pixels arestored in 3 different matrices representing red, green andblue image) enable us using convolution networks for ourdeep neural network. The analogy goes even further as itis well known in image recognition that convolutional net-works achieve strong performances thanks to their capac-ity to extract meaningful features and to have very limitedparameters hence avoiding over-ﬁtting. Indeed, convolutionallows us to extract features; blindly weighting locally thevariables over the tensor. There is however something to no-tice. We use a convolution layer with a convolution windowor kernel with a single row and a resulting vertical stride of1. This particularity enables us to avoid mixing data fromdifferent strategies. We only mix data of the same strate-gies but for different observation dates. Recall that in convo-lution network, the stride parameter controls how the ﬁlterconvolves around our input. Likewise the size of the win-dow also referred to as the kernel size controls how the ﬁlterapplies to data. Thus, a kernel with a row of 1 and a stridewith a row of 1 allows us to detect the vertical (temporal)relation for each strategy by shifting one unit at a time, with-out mixing any data from different strategies. This conceptis illustrated in ﬁgure 1. Because of this peculiarity, we caninterpret our 2-D convolution as an iteration over a 1-D con-volution network for each variable.Figure 1: 2-D Convolution with stride of 13 ulti inputs and outputs

We display in ﬁgure 2 the architecture of our network. Be-cause we feed our network with both data from the strate-gies to select but also contextual information, our network isa multiple inputs network.Figure 2: network architecture obtained via tensorﬂow plot-model function. Our network is very different from standardDRL networks that have single inputs and outputs. Contex-tual information introduces a second input while the lever-age adds a second outputAdditionally, as we want from these inputs to providenot only percentage in the different hedging strategies (witha softmax activation of a dense layer) but also the overallleverage (with a dense layer with one single ouput neurons),we also have a multi outputs network. Additional hyperpa-rameters that are used in the network as L2 regularizationwith a coefﬁcient of 1e-8.

Adversarial Policy Gradient

To learn the parameters of our network depicted in 2, weuse a modiﬁed policy gradient algorithm called adversarialas we introduce noise in the data as suggested in (Liang etal. 2018).. The idea of introducing noise in the data is tohave some randomness in each training to make it morerobust. This is somehow similar to drop out in deep net-works where we randomly pertubate the network by ran-domly removing some neurons to make it more robust andless prone to overﬁtting. A policy is a mapping from theobservation space to the action space, π : O → A . Toachieve this, a policy is speciﬁed by a deep network witha set of parameters (cid:126)θ . The action is a vector function ofthe observation given the parameters: (cid:126)a t = π (cid:126)θ ( o t ) . Theperformance metric of π (cid:126)θ for time interval [0 , t ] is deﬁned as the corresponding total reward function of the inter-val J [0 ,t ] ( π (cid:126)θ ) = R (cid:0) (cid:126)o , π (cid:126)θ ( o ) , · · · , (cid:126)o t , π (cid:126)θ ( o t ) , (cid:126)o t +1 (cid:1) . Afterrandom initialization, the parameters are continuously up-dated along the gradient direction with a learning rate λ : (cid:126)θ −→ (cid:126)θ + λ ∇ (cid:126)θ J [0 ,t ] ( π (cid:126)θ ) . The gradient ascent optimizationis done with standard Adam (short for Adaptive Moment Es-timation) optimizer to have the beneﬁt of adaptive gradientdescent with root mean square propagation (Kingma and Ba2014). The whole process is summarized in algorithm 1. Algorithm 1

Adversarial Policy Gradient Input: initial policy parameters θ , empty replay buffer D repeat reset replay buffer while not terminal do Observe observation o and select action a = π θ ( o ) with probability p and random action with proba-bility − p , Execute a in the environment Observe next observation o (cid:48) , reward r , and donesignal d to indicate whether o (cid:48) is terminal apply noise to next observation o (cid:48) store ( o, a, o (cid:48) ) in replay buffer D if Terminal then for however many updates in D do compute ﬁnal reward R end for update network parameter with Adam gradientascent (cid:126)θ −→ (cid:126)θ + λ ∇ (cid:126)θ J [0 ,t ] ( π (cid:126)θ ) end if end while until convergenceIn our gradient ascent, we use a learning rate of 0.01,an adversarial Gaussian noise with a standard deviation of0.002. We do up to 500 maximum iterations with an earlystop condition if on the train set, there is no improvementover the last 50 iterations. Walk forward analysis

In machine learning, the standard approach is to do k -foldcross validation as shown in ﬁgure 3. This approach breaksthe chronology of data and potentially uses past data in thetest set. Rather, we can take sliding test set and take pastdata as training data as show in the two sub-ﬁgures on theright of ﬁgure 4. To ensure some stability, we favor to addincrementally new data in the training set, at each new step.This method is sometimes referred to as anchored walk for-ward as we have anchored training data. The negative effectof using extending training data set is to adapt slowly to newinformation. To our experience, because we do not have somuch data to train our DRL model, we use anchored walkforward to make sure we have enough training data. Last butnot least, as the test set is always after the train set, walk for-ward analysis gives less steps compared to cross validation.In practice for our data set, we train our models from 20004o end of 2006 (to have at least seven years of data) and usea repetitive test period of one year. TestTestTestTestTestTest

Figure 3: k-fold cross validation

TestTestTestTest

Figure 4: anchored walk forward

Experiments

Goal of the experiment

We are interested in planing a hedging strategy for a riskyasset. The experiment is using daily data from 01/05/2000to 19/06/2020. The risky asset is the MSCI world index. Wechoose this index because it is a good proxy for a wide rangeof asset manager portfolios. The hedging strategies are 4SG-CIB proprietary systematic strategies further describedbelow .

Data-set description

Systematic strategies are similar to asset managers that in-vest in ﬁnancial markets according to an adaptive, pre-deﬁned trading rule. Here, we use 4 SG CIB proprietary’hedging strategies’, that tend to perform when stock mar-kets are down:• Directional hedges - react to small negative return in eq-uities,• Gap risk hedges - perform well in sudden market crashes,• Proxy hedges - tend to perform in some market conﬁg-urations, like for example when highly indebted stocksunder-perform other stocks,• Duration hedges - invest in bond market, a classical diver-siﬁer to equity risk in ﬁnance.The underlying ﬁnancial instruments vary from put op-tions, listed futures, single stocks, to government bonds.Some of those strategies are akin to an insurance contract and bear a negative cost over the long run. The challengeconsists in balancing cost versus beneﬁts.In practice, asset managers have to decide how much ofthese hedging strategies are needed on top of an existingportfolio to achieve a better risk reward. The decision mak-ing process is often based on contextual information, suchas the economic and geopolitical environment, the level ofrisk aversion among investors and other correlation regimes.The contextual information is modelled by a large range offeatures :• the level of risk aversion in ﬁnancial markets, or marketsentiment, measured as an indicator varying between 0 formaximum risk aversion and 1 for maximum risk appetite,• the bond/equity historical correlation, a classical ex-post measure of the diversiﬁcation beneﬁts of a dura-tion hedge, measured on a 1-month, 3-month and 1-yearrolling window,• The credit spreads of global corporate - investment grade,high yield, in Europe and in the US - known to be an earlyindicator of potential economic tensions,• The equity implied volatility, a measure if the ’fear factor’in ﬁnancial market,• The spread between the yield of Italian government bondsand the German government bond, a measure of potentialtensions in the European Union,• The US Treasury slope, a classical early indicator for USrecession,• And some more ﬁnancial variables, often used as a gaugefor global trade and activity: the dollar, the level of ratesin the US, the estimated earnings per shares (EPS).A cross validation step selects the most relevant features.In the present case, the ﬁrst three features are selected. Therebalancing of strategies in the portfolio comes with trans-action costs, that can be quite high since hedges use op-tions. Transactions costs are like frictions in physical sys-tems. They are taken into account dynamically to penalisesolutions with a high turnover rate.

Evaluation metrics

Asset managers use a wide range of metrics to evaluate thesuccess of their investment decision. For a thorough reviewof those metrics, see for example (Cogneau and H¨ubner2009). The metrics we are interested in for our hedging prob-lem are listed below:• annualized return deﬁned as the average annualized com-pounded return,• annualized daily based Sharpe ratio deﬁned as the ratioof the annualized return over the annualized daily basedvolatility µ/σ ,• Sortino ratio computed as the ratio of the annualized re-turn overt the downside standard deviation,• maximum drawdown (max DD) computed as the maxi-mum of all daily drawdowns. The daily drawdown is com-puted as the ratio of the difference between the running5aximum of the portfolio value ( RM T = max t =0 ..T ( P t ) ) and the portfolio value over the running maximum of theportfolio value. Hence DD T = ( RM T − P T ) /RM T and M DD T = max t =0 ..T ( DD t ) . It is the maximum loss inreturn that an investor will incur if she/he invested at theworst time (at peak). Baseline

Pure risky asset

This ﬁrst evaluation is to compare ourportfolio composed only of the risky asset (in our case, theMSCI world index) with the one augmented by the tradingagent and composed of the risky asset and the hedging over-lay. If our agent is successful in identifying good hedgingstrategies, it should improve the overall portfolio and have abetter performance than the risky asset.

Markowitz

In Markowitz theory (Markowitz 1952), riskis represented by the variance of the portfolio. Hence theMarkowitz portfolio consists in maximizing the expected re-turn for a given level of risk, represented by a given variance.Using dual optimization, this is also equivalent to minimizevariance for a given expected return, which is solved by stan-dard quadratic programming optimization. Recall that wehave l possible strategies and we want to ﬁnd the best al-location according to the Sharpe ratio. Let w = ( w , ..., w l ) be the allocation weights with ≥ w i ≥ for i = 0 ...l ,which is summarized by ≥ w ≥ , with the additionalconstraints that these weights sum to 1: (cid:80) li =1 w i = 1 .Let µ = ( µ , ..., µ l ) T be the expected returns for our l strate-gies and Σ the matrix of variance covariances of the l strate-gies’ returns. Let r min be the minimum expected return. TheMarkowitz optimization problem to solve that is done bystandard quadratic programming is the following:Minimize w T Σ w subject to µ T w ≥ r min , (cid:88) i =1 ...l w i = 1 , w ≥ The Markowitz portfolio is a good benchmark very oftenused in portfolio theory as it allows investors to constructmore efﬁcient portfolios by controlling the variance of theirstrategies. One of the famous critic of this theory is that itcontrols the variance (and then the standard deviation) of theportfolio but it doesn’t allow controlling a better risk indica-tor which is the downside standard deviation (representingthe potential loss that may arise from risk compared to aminimum acceptable return). Another limitation of this the-ory relies on the fact that it works under the assumption thatinvestors are risk-averse. In other words, an investor prefersa portfolio with less risk for a given level of return and willonly take on high-risk investments if he can expect a largerreward.

Follow the winner

This is a simple strategy that consistsin selecting the hedging strategy that was the best performerin the past year. If there is some persistence over time ofthe hedging strategies’ performance, this simple methodol-ogy works well. It replicates standard investors behavior thattends to select strategies that performed well in the past.

Follow the loser

As it name stands for, follow the loser isexactly the opposite of follow the winner. It assumes thatthere is some mean reversion in strategies’ performance,meaning that strategies tend to perform equally well on longterm and mean revert around their trend. Hence if a strategydid not perform well in the past, and if there is mean rever-sion, there is a lot of chance that this strategy will recoverwith its pairs.

Results and discussion

Table 1: Models comparison over 3 and 5 years -0.27Winner 13.19% 0.66 0.72 -0.35Loser 9.30% 0.89 0.89 -0.15

DRL no context 8.11% 0.42 0.47 -0.34Markowitz -0.31% -0.01 -0.01 -0.415 Yearsreturn Sortino Sharpe max DDRisky asset 9.16% 0.54 0.57 - 0.34DRL -0.27Winner 10.84% 0.65 0.68 -0.35Loser 7.04% 0.78 0.76 -0.15

DRL no context 6.87% 0.44 0.47 -0.34Markowitz -0.07% -0.00 -0.00 -0.41

We compare the performance of the following 5 mod-els: DRL model based on convolutional networks with con-textual states (Sentiment indicator, 6 month correlation be-tween equity and bonds and credit main index), same DRLmodel without contextual states, follow the winner, followthe loser and Markowitz portfolio. The resulting graphicsare displayed in ﬁgure 5 with the risky asset position alonein blue and the other models in orange, green and red. Tomake ﬁgures readable, we ﬁrst show the two DRL mod-els, with the risky asset and clearly see the impact of con-textual information as the DRL model (in orange) is wellabove the green curve (the same model without contextualinformation) and is also well above the risky asset positionalone (the blue curve). We then plot more traditional modelslike Markowitz, follow the Winner (entitled for space reasonWinner) and follow the Loser (entitled for the same reasonLoser). We ﬁnally plot the two best performers: the DRLand the Follow the Winner model, emphasizing that the dif-ference between DRL and Follow the Winner is mostly inyears 2018 to 2020 that exhibit regime changes, with in par-ticular the recent Covid crisis.Out of these 5 models, only DRL and Follow the win-ner are able to provide signiﬁcant net performance increasecompared to the risky asset alone thanks to an efﬁcient hedg-ing strategy over the 2007 to 2020 period. The DRL model isin addition able to better adapt to the Covid crisis and to havebetter efﬁciency in net return but also Sharpe and Sortino6igure 5: performance of all models Figure 6: DRL weightsFigure 7: Follow the winner weightsFigure 8: Markowitz weights7atios over 3 and 5 years as shown in table 1. In addition,on the last graphic of ﬁgure 5, we can remark that the DRLmodel has a tendency to move away from the blue curve (therisky asset) continuously and increasingly whereas the fol-low the winner model has moved away from the blue curvein 2015 and 2016 and tends to remain in parallel after thisperiod, indicating that there is no continuous improvementof the model. The growing divergence of the DRL from theblue curve is a positive sign of its regular performance whciis illustrated in numbers in table 1.Moreover, when comparing the weights obtained by thedifferent models (ﬁgures 6, 7, and 8), we see that the badperformance of Markowitz can be a consequence of its di-versiﬁcation as it takes each year a non null position in thefour hedging strategies and tends to change this allocationquite frequently. The rapid change of allocation is a sign ofunstability of this method (which is a well known drawbackof Markowitz).In contrast, DRL and Follow the winner models tends tochoose only one or two strategies, in a stock picking manner.DRL model tends to choose mostly duration hedge and isable to dynamically adapt its behavior over the last 3 yearsand to better manage the Covid crisis with a mix allocationbetween duration and proxy hedge.In terms of the smallest maximum drawdown, the fol-low the loser model is able to signiﬁcantly reduce maxi-mum drawdown but at the price of a lower return, Sharpeand Sortino ratios. Removing contextual information deteri-orates model performances signiﬁcantly and is illustrated bythe difference in term of return, Sharpe, Sortino ratio andmaximum drawdown between the DRL and the DRL nocontext model. Last but not least, Markowitz model is notable to adapt to the new regime change of 2015 onwards de-spite its good performance from 2007 to 2015. It is the worstperformer over the last 3 and 5 years because of this lack ofadaptation.For all models, we use the walk forward analysis as de-scribed earlier. Hence, we start training the models from2000 to end of 2006 and use the best model on the test setin 2007. We then train the model from 2000 to end of 2007and use the best model on the test set in 2008 and etc ... Intotal, we do 14 training (from 2007 to 2020). This processensures that we detect models that are unstable overtime andis similar in spirit to delayed online training. We also pro-vide in table 2 different conﬁgurations (adversarial training,use of context, and use of day lag), which leads to a totalof 16 models. The frist 8 models are the ones with a daylagsorted in order of decreasing performance. The best model isthe one with a reward in net proﬁt, adversarial training, useof context information with a total performance of 81.8 %.We also provide the corresponding same models but with noday lag (model 9 to 16). These models are theoreticall andnot considered as they do not cope with reality.

Impact of context

In table 2, we provide a list of 16 models based on the fol-lowing choices: the choice of the reward function (net proﬁtor Sortino), the use of adversarial training with noise in data Table 2: Model comparison based on reward function, ad-versarial training (noise in data) and use of contextual state or not, the use of contextual states, and the use of day lagbetween observations and actions.We see that the best DRL model with the day-lag turnoverconstraint is the one using convolutional networks, adversar-ial training, contextual states and net proﬁt reward function.These 4 parameters are meaningful for our DRL model andchange model performance substantially as illustrated by thetable with a difference between model 1 (the best model) andmodel 8 (the worst model) of 129.5 % (=81.8 % - (-47.7 %)).To measure the impact of the contextual information forour best model, we can measure it simply by doing the dif-ference between model 1 and model 6 (as there are the samemodel except the presence or absence of contextual informa-tion). We ﬁnd a signiﬁcant impact as it accounts for 111.4 %(=81.8 % - (-29.5 %)). It is quite intuitive that adding a con-text should improve the model as we provide more meaning-ful information to the model.

Impact of one day lag

Our model accounts for the fact that asset managers can-not immediately change their position at the close of the ﬁ-nancial markets. It is easy to measure the impact of the oneday lag as we simply need to take the difference of perfor-mance between model 9 and model 1. We ﬁnd an impactof the one day lag of 112 % (= 193.8 % - 81.8%). Thisis like for contextual information substantial. It is not sur-prising that a delayed action (with one period lag) after ob-servation makes the learning process more challenging forthe DRL agent as inﬂuence of variables tends to decreasewith time. Surprisingly, this salient modeling characteristicis ignored in existing literature (Jiang, Xu, and Liang 2017;Liang et al. 2018; Yu et al. 2019; Wang and Zhou 2019;Liu et al. 2020; Ye et al. 2020; Li et al. 2019).

Future work

As nice as this work is, there is room for improvement aswe have only tested a few possible hyper-parameters forour convolutional networks and could play with more lay-ers, other design choice like combination of max pooling8ayers (like in image recognition) and ways to create morepredictive contextual information.

Conclusion

In this paper, we address the challenging task of ﬁnancialplanning in a noisy and self adapting environment withsequential, non-stationary and non-homogeneous observa-tions. Our approach is based on deep reinforcement learn-ing using contextual information thanks to a second sub-network. We also show that the additional constraint of adelayed action following observations has a substantial im-pact that should not be overlooked. We introduce the novelconcept of walk forward analysis to test the robustness of thedeep RL model. This is very important for regime changingenvironments that cannot be evaluated with a simple trainvalidation test procedure, neither a k -fold cross validation asit ignores the strong chronological feature of observations.For our trading agent, we take not only past perfor-mances of portfolio strategies over different rolling period,but also standard deviations to provide predictive variablesfor regime changes. Augmented states with contextual in-formation make a big difference in the model and help theagent learning more efﬁciently in a noisy environment. Onexperiment, contextual based approach over-performs base-line methods like Markowitz or naive follow the winner andfollow the loser. Last but not least, it is quite important toﬁne tune the numerous hyper-parameters of the contextualbased DRL model, namely the various lags (lags period forthe sub network fed by portfolio strategies past returns, lagsperiod for common contextual features referred to as thecommon features in the paper), standard deviation period,learning rate, etc...Despite the efﬁciency of contextual based DRL models,there is room for improvement. Other information like newscould be incorporated to continue increasing model perfor-mance. For large stocks, like tech stocks, sentiment informa-tion based on social media activity could also be relevant. Acknowledgments.

We would like to thank BeatriceGuez and Marc Pantic for meaningful remarks while work-ing on this project. The views contained in this document arethose of the authors and do not necessarily reﬂect the onesof SG CIB.

References [Benhamou et al. 2020] Benhamou, E.; Saltiel, D.; Ohana,J.-J.; and Atif, J. 2020. Detecting and adapting to crisis pat-tern with context based deep reinforcement learning. arXiv .[Cogneau and H¨ubner 2009] Cogneau, P., and H¨ubner, G.2009. The 101 ways to measure portfolio performance.

SSRN Electronic Journal .[Dias, Vermunt, and Ramos 2015] Dias, J.; Vermunt, J.; andRamos, S. 2015. Clustering ﬁnancial time series: New in-sights from an extended hidden markov model.

EuropeanJournal of Operational Research

Neurocomputing

Applied Stochastic Models in Business andIndustry arXiv e-prints .[Jiang, Xu, and Liang 2017] Jiang, Z.; Xu, D.; and Liang, J.2017. Reinforcement learning framework for the ﬁnancialportfolio management problem. arXiv .[Kahneman 2011] Kahneman, D. 2011.

Thinking, Fast andSlow . New York: Farrar, Straus and Giroux.[Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam:A method for stochastic optimization.[Levine et al. 2015] Levine, S.; Finn, C.; Darrell, T.; andAbbeel, P. 2015. End-to-end training of deep visuomotorpolicies.

Journal of Machine Learning Research

The International Journal of Robotics Research .[Li et al. 2019] Li, X.; Li, Y.; Zhan, Y.; and Liu, X.-Y. 2019.Optimistic bull or pessimistic bear: Adaptive deep reinforce-ment learning for stock portfolio allocation. In

ICML .[Liang et al. 2018] Liang et al. 2018. Adversarial deep rein-forcement learning in portfolio management.[Lillicrap et al. 2015] Lillicrap, T.; Hunt, J.; Pritzel, A.;Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D.2015. Continuous control with deep reinforcement learning.

CoRR .[Liu et al. 2020] Liu, Y.; Liu, Q.; Zhao, H.; Pan, Z.; and Liu,C. 2020. Adaptive quantitative trading: an imitative deepreinforcement learning approach. In

AAAI .[Markowitz 1952] Markowitz, H. 1952. Portfolio selection.

The Journal of Finance

NIPSDeep Learning Workshop .[Niaki and Hoseinzade 2013] Niaki, S., and Hoseinzade, S.2013. Forecasting s&p 500 index using artiﬁcial neural net-works and design of experiments.

Journal of Industrial En-gineering International

Physica A: StatisticalMechanics and its Applications

ICML .[Schulman et al. 2017] Schulman, J.; Wolski, F.; Dhariwal,P.; Radford, A.; and Klimov, O. 2017. Proximal policy op-timization algorithms.

CoRR .9Silver et al. 2016] Silver, D.; Huang, A.; Maddison, C.;Guez, A.; Sifre, L.; Driessche, G.; Schrittwieser, J.;Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman,S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.;Lillicrap, T.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; andHassabis, D. 2016. Mastering the game of go with deepneural networks and tree search.

Nature

Reinforcement Learning: An Introduction . The MITPress, second edition.[Vinyals et al. 2019] Vinyals, O.; Babuschkin, I.; Czarnecki,W.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.; Pow-ell, R.; Ewalds, T.; Georgiev, P.; Oh, J.; Horgan, D.; Kroiss,M.; Danihelka, I.; Huang, A.; Sifre, L.; Cai, T.; Agapiou, J.;Jaderberg, M.; and Silver, D. 2019. Grandmaster level instarcraft ii using multi-agent reinforcement learning.

Nature arXiv e-prints .[Wang, Jia, and Weng 2018] Wang, S.; Jia, D.; and Weng, X.2018. Deep reinforcement learning for autonomous driving.

ArXiv abs/1811.11329.[Xiong et al. 2019] Xiong, Z.; Liu, X.-Y.; Zhong, S.; Yang,H.; and Walid, A. 2019. Practical deep reinforcement learn-ing approach for stock trading.[Ye et al. 2020] Ye, Y.; Pei, H.; Wang, B.; Chen, P.-Y.; Zhu,Y.; Xiao, J.; and Li, B. 2020. Reinforcement-learning basedportfolio management with augmented asset movement pre-diction states. In

AAAI .[Yu et al. 2019] Yu, P.; Lee, J. S.; Kulyatin, I.; Shi, Z.; andDasgupta, S. 2019. Model-based deep reinforcement learn-ing for ﬁnancial portfolio optimization.

RWSDM Workshop,ICML 2019 .[Zheng, Li, and Xu 2019] Zheng, K.; Li, Y.; and Xu, W.2019. Regime switching model estimation: spectral cluster-ing hidden markov model.