[PDF] Detecting and adapting to crisis pattern with context based Deep Reinforcement Learning

Abstract

Deep reinforcement learning (DRL) has reached super human levels in complex tasks like game solving (Go and autonomous driving). However, it remains an open question whether DRL can reach human level in applications to financial problems and in particular in detecting pattern crisis and consequently dis-investing. In this paper, we present an innovative DRL framework consisting in two sub-networks fed respectively with portfolio strategies past performances and standard deviations as well as additional contextual features. The second sub network plays an important role as it captures dependencies with common financial indicators features like risk aversion, economic surprise index and correlations between assets that allows taking into account context based information. We compare different network architectures either using layers of convolutions to reduce network's complexity or LSTM block to capture time dependency and whether previous allocations is important in the modeling. We also use adversarial training to make the final model more robust. Results on test set show this approach substantially over-performs traditional portfolio optimization methods like Markowitz and is able to detect and anticipate crisis like the current Covid one.

Full PDF

DDetecting and adapting to crisis pattern withcontext based Deep Reinforcement Learning

Eric Benhamou , David Saltiel , , Jean-Jacques Ohana , and Jamal Atif MILES, Machine Learning Group, LAMSADE, Dauphine, France Ai Square Connect, Research group, France LISIC, ULCO, France Multi Assets Solutions, Homa Capital, France Email: { eric.benhamou, jamal.atif } @lamsade.dauphine.fr, Email: [email protected] Email: [email protected]

Abstract —Deep reinforcement learning (DRL) has reachedsuper human levels in complex tasks like game solving (Go[1], StarCraft II [2]), and autonomous driving [3]. However, itremains an open question whether DRL can reach human level inapplications to ﬁnancial problems and in particular in detectingpattern crisis and consequently dis-investing. In this paper, wepresent an innovative DRL framework consisting in two sub-networks fed respectively with portfolio strategies past perfor-mances and standard deviations as well as additional contextualfeatures. The second sub network plays an important role as itcaptures dependencies with common ﬁnancial indicators featureslike risk aversion, economic surprise index and correlationsbetween assets that allows taking into account context basedinformation. We compare different network architectures eitherusing layers of convolutions to reduce network’s complexity orLSTM block to capture time dependency and whether previousallocations is important in the modeling. We also use adversarialtraining to make the ﬁnal model more robust. Results on testset show this approach substantially over-performs traditionalportfolio optimization methods like Markovitz and is able todetect and anticipate crisis like the current Covid one.

I. I

NTRODUCTION

Being able to adapt portfolio allocation to crisis environmentlike the current Covid crisis is a major concern for the ﬁnancialindustry. Indeed, the current Covid crisis took the industryby surprise twice. First, when stock markets plunged at anunprecedented speed in March 2020 with the SP 500 fallingby 13 %, asset managers were slow to react and to cut riskexposure. And secondly, when stock markets bounced backup at an equally rapid pace, with a rise of 13 % for the SP500 in May 2020, asset managers were again overhauled. Incontrast, the previous 2008 crisis was very slow both in termsof its falls and recovery. Hence, adapting portfolio allocation tocrisis environment is a very important matter and has attractedgrowing attention from the ﬁnancial scientiﬁc community.The standard approach for portfolio allocation, that servesas a base line for our research, relies on determining portfolioweights according to a risk return criterion. The so calledMarkovitz portfolio [4] ﬁnds the optimal allocation by de-termining the portfolio with minimum variance given a targetreturn or equivalently the portfolio with maximum return givena targeted level of variance (the dual optimization). However,this approach suffers from a major ﬂaw because of unreliablerisk estimations of the individual portfolio strategy excessreturns and covariances. This leads not only to unstable al- locations, but also to slow reactions to changing environments[5]. If we want to ﬁnd a more dynamic allocation method, deepreinforcement learning is an appealing method. It reformulatesthe portfolio optimization problem as a continuous controlprogram with delayed rewards. Rules are simple. Each tradingday, the dynamic virtual asset manager agent has the right tomodify the portfolio allocation. When it modiﬁes the portfolioweights, it incurred transaction costs. The agent can onlyallocate between 0 and 100 % for all the portfolio assets. Itcan not short any asset, hence weights are always positive andnever above 100 %. It can not neither borrow to fund leveragepositions, hence the sum of all allocations is strictly equalto 100 %. To make decisions, the dynamic agent has accessnot only to past performances but also some ﬁnancial contex-tual information that helps it making an informed decision.The agent receives in terms of feedback a ﬁnancial rewardthat orientates its decisions. Compared to traditional ﬁnancialmethods, this approach has the major advantage to adapt tochanging market conditions and to be somehow more modelfree than traditional ﬁnancial methods as we connect portfolioallocations directly to ﬁnancial data and not to speciﬁc riskfactors that may factor in some cognitive bias. This streamof research is also highly motivated by the recent majorprogress of deep reinforcement learning methods that havereached super human levels in complex tasks like game solving(historically Atari games [6], Go [1], StarCraft II [2]), andautonomous driving [3]. Nonetheless, it still remains an openquestion whether DRL can reach human level in applicationsto ﬁnancial problems and in particular in detecting patterncrisis and consequently dis-investing.

A. Related Work

Initially, many of the machine-learning and in particulardeep network applications to ﬁnancial markets tried to predictprice movements or trends [7], [8], [9]. The logic was totake historical prices of assets as inputs and use deep neuralnetworks to predict asset prices for the next period. Armedwith the forecast, a trading agent can act and decide thebest allocation. The problem to solve is a standard supervisedlearning task and more precisely a regression problem. It isstraightforward to implement. Yet, the efﬁciency of the methodrelies heavily on the accuracy of the prediction, which makesthe method quite fragile and questionable as future market a r X i v : . [ q -f i n . P M ] N ov rices are well known to be difﬁcult to predict. Furthermore,this approach tends to reduce substantially portfolio diversi-ﬁcation and can not cope easily with transactions costs. Incontrast, DRL can easily tackle these issues as it does not aimat predicting prices but rather at ﬁnding the optimal action orfor our matter the optimal allocation.The idea of applying DRL to portfolio allocation hasrecently taking off in the machine learning community withsome recent works on crypto currencies [10], [11], [12], [13]and [14]. Compared to traditional approaches on ﬁnancial timeseries, that aim at taking decision based on forecasting esti-mates, [11] and [12] showed that deep reinforcement learningwith Convolutional Neural Network (CNN) architecture tendsto perform better for crypto currencies and Chinese stockmarkets than deep learning architecture that relies on timeseries forecast like LSTM. However, when there is a very rapidcrisis, like what happened during the Covid crisis, using justpast performances may lead the DRL agent to react too slowly.To make an analogy, it is as if the agent was self-driving onthe highway and very brutally, an obstacle arises. Using pastperformances only is like looking in the mirrors behind to inferwhat will happen next. Adding a context is like lifting up oureyes and looking further forward. Context based reinforcementlearning has recently emerged as strong tool to increasereinforcement learning dynamic agent performance [15], [16].More speciﬁcally, context based reinforcement learning (RL)with high capacity function approximators, such as deepneural networks (DNNs), has in the last two years attractedgrowing attention and been the subject of many publications innotorious machine learning conferences as it solved efﬁcientlya variety of sequential decision-making problems, includingboard games (e.g., Go and Chess ([17]), video games (e.g.,Atari games ([18]), and complex robotic control tasks ([19],[20], [21]). Theoretically, it has also been advocated that theusage of a context enables achieving superior data-efﬁciencyto model-free RL methods in general [22], [23].So in this work, we extend previous works of DRL byprecisely using a context based approach. This is done byintegrating common ﬁnancial states in our deep network, hav-ing at least two sub networks and potentially three if we alsoincorporates in states the previous allocations. Experimentsshow that this approach is able to pick the best portfolioallocation out of sample using ﬁnancial features used by assetmanagers: risk aversion index, correlation between equitiesand bonds, Citi economic surprise index and to accommodatefor crisis by reducing risk exposure. We provide performancesout of sample and test various conﬁgurations to emphasizethat using CNN works much better than more predictivearchitecture like LSTM conﬁrming previous works. B. Contributions

Our contributions are twofold: • First we explain why a context based deep reinforcementlearning approach is closer to human thinking and leadsto better results, with a novel deep network architectureconsisting of two sub networks: one network (network 1) that takes as inputs past performances (and standarddeviation) of the portfolio strategies and another one(network 2) that takes as inputs ﬁnancial contextualinformation related to the performances of the portfoliostrategies that are thought to have some predictive powerregarding portfolio strategies future performances. • Second, we summarize lots of empirical ﬁndings. Rewardfunction is critical. Sharpe ratio reward leads to differentresults compared to a straight ﬁnal net performancereward function. CNN performs better than LSTM andcaptures implicit features. Using adversarial training byadding noise to the data improves the model. Last butnot least, dependency to previous allocations does notimprove the model.II. M

ATHEMATICAL FORMULATION

As summarized by ﬁgure 1, an asset manager robot hasseveral strategies that it wants to allocate optimally, with aperformance objective on the overall portfolio. Not only doesit have access to historical daily performance (the middlerectangle in ﬁgure 1) but it can also leverage additionalinformation (the rectangle on the left in ﬁgure 1) that providessome contextual information about market conditions. Theseare other price data points but also unstructured data likesome macro economic data. To gauge the performance ofits decision, it has an objective that can be either the netperformance of the portfolio or some risk return criterion (thethird rectangle of ﬁgure 1 on the right)

Optimal portfolioallocationAsset statesContextualinformation Manager’sobjective (+) historical strategiesreturns(+) historical standarddeviation(+) Other assets price data(+) Other predictive data(correlation between data)(+) Other unstructureddata (economic surprise,risk aversion index) (+) net performance(+) Sharpe ratio

Fig. 1. Portfolio allocation problem

The question of the asset allocation can be reformulated asa standard reinforcement learning problem, thanks to MarkovDecision Process (MDP). The learning agent interacts withan environment E to decide rational or optimal actionsand receives in return some rewards. These rewards are notnecessarily only positive and are given only at the end of theﬁnancial episode. These rewards act as a feedback for ﬁndingthe best action. Using the established formalism of Markovdecision process, we assume that there exists a discrete timestochastic control process represented by a 4-tuple deﬁnedby ( S , A , P a , R a ) where S is the set of states, A the setof actions, P a ( s, s (cid:48) ) = Pr( s t +1 = s (cid:48) | s t = s, a t = a ) theransition probability that action a in state s at time t willlead to state s (cid:48) at the next period t + 1 and ﬁnally, R a ( s, a ) the immediate reward received after state s and action a .The requirement of a Markovian state that guaranteesthat there exists a solution (hence satisfying the Bellmanoptimality principle [24]) is a strong assumption that is hardto verify in practice. It is somehow levied in practice bystacking enough observations to enforce that the Markovproperty is satisﬁed. Hence, it is useful, following [25] or[26], to introduce the concept of observations and pile themto coin states. In this setting, the agent perceives at time t anobservation o t along with a reward r t .In our setting, time is divided into trading periods of equallength τ . In the rest of the paper, τ represents one tradingday but the setting can be applied to shorter time periods, like30 minutes, to deal with intraday trading decisions. At thebeginning of each trading period, a trading robot decides topotentially reallocate the funds among m assets. The tradingrobot has access to an environment that provides at each time t , a state s t that is composed of the past observations that arerich enough to assume Markovianity. Intuitively, it is importantfor the agent to observe not only the last returns but also someprevious returns (like the returns over 2, 3 and 4 business days,but also a week and potentially a month) to make a decision.Mathematically, we denote by δ the lag operator applied toeach observation. To make this concrete the lag operator δ ’s outputs are the last portfolio strategy returns at time t butalso at time t − , t − , t − and so on. There is here sometrade-off. We obviously need enough observations to mimicka Markovian setting to ensure problem is well posed. But wealso need to reduce observations to avoid facing the curse ofdimensionality. We will discuss this point in our experience,but practically, we take returns at time t − representingreturns 3 months ago, t − one month ago , t − , t − , t − , t − and t , the latter four providing returns over the lasttrading week.By abuse of language, we can represent the lag δ operatorby a vector of lagging periods δ = [0 , , , , , , (asthere is a one to one mapping between the operator and thelagging periods) and retrieve the corresponding returns forasset i as follows: (cid:2) r it , r it − , r it − , . . . , r it − , r it − (cid:3) . Inputsthat we call asset states as they directly relate to the portfolio’sassets are not only past returns lagged over the δ periods butalso standard deviation. The intuition behind the consumptionof returns standard deviation or equivalently their volatiltyis that volatility is a good predictor of crisis. Indeed it isa stylized fact in the ﬁnancial literature that volatility is agood predictor of risk. [27] [28] and that an increase ofvolatility comes swiftly after a market crash [29] [30]. Theperiod to compute the volatility is a hyper parameter, againanother hyper-parameter to ﬁne-tune and is arbitrarily taken as there are approximately 60 trading days in a quarter, and 20 days in amonth to 20 periods to represent a month of data. If we summarize,asset states A t are given by two matrices A t = (cid:2) A t , A t (cid:3) with the ﬁrst matrix A t (in red) being the the matrix of returns: A t =  r ( t ) . . . r ( t − . . . . . . . . .r m ( t ) . . . r m ( t −  while the second matrix (in blue) A containing standarddeviations: A t =  σ ( t ) . . . σ ( t − . . . . . . . . .σ m ( t ) . . . σ m ( t −  The asset states are stored in a 3-D tensor as shown inﬁgure 2. Its similarities with image where pixels are stored in3 different matrices representing red, green and blue image en-able us to use 2 dimensional convolution network for our deepnetwork. The analogy goes even further as it is well known inimage recognition that convolutional networks achieves strongperformances thanks to their capacity to extract meaningfulfeatures and to have very limited parameters hence avoidingover-ﬁtting.

Returns A t Volatility A t Fig. 2. 3 dimensional tensor: the asset states

To introduce conceptual based information, the asset man-ager robot observes also additional important features denotedby C t that provides insights about the future evolution ofthe portfolio strategies. Using market knowledge from Homacapital multi assets solutions, we add 3 features (referred toas contextual features) that are correlation between equity andbonds denoted by c t , Citigroup global economic surprise indexdenoted by c t , and risk aversion index denoted by c t . Thesefeatures are not taken at random but are well known or atleast assumed to have some predictive power for our portfoliostrategies as these strategies incorporates a mix of equity andbonds and are highly sensitive to economic surprise and riskaversion level. Again to ensure somehow some Markovianityand to include in the current knowledge of the virtual agentmore than the last observation of these features, we introducea second lag operator δ that operates on the contextualfeatures. To keep things simple in our experience, we takehe same vector of lagging periods to represent this secondlag operator although the method can be ﬁne-tuned with twodifferent lags for the asset and contextual states. In our setting, δ = [0 , , , , , , and the contextual states that isrepresented by C t writes as follows: C t =  c ( t ) . . . c ( t − . . . . . . . . .c ( t ) . . . c ( t −  In contrast to asset states, contextual states C t are onlyrepresented by a two dimensions tensor or equivalentlya matrix. If we want to use convolutional networks, wetherefore need to use 1D (for 1 dimensional) and not 2D(2 dimensional) convolutions. In addition, we add in thesecommon contextual features the maximum portfolio strategy’sreturn, the maximum and minimum portfolio strategy’svolatilities. The latter two are like for asset states motivatedby the stylized fact that standard deviations are useful featuresto detect crisis.Last but not least we can also introduce that our state s t incorporates the previous portfolio allocation. Hence our statecan take the following three inputs: • previous portfolio strategy returns lagged by δ called theasset states A t ; • contextual features observed lagged by δ called thecommon states C t ; • the previous weight allocation w t − ;or mathematically, S t = { A t , C t , w t − } Our optimal control problem is to ﬁnd the optimal policy π ∗ ( S t ) that maximizes the total reward denoted by R ( T ) forone episode. Under very strong theoretical assumptions, thisoptimal policy always exists and is unique. In practice, weare far from the theoretical framework and we may ﬁnd onlylocally optimal policies thanks to gradient ascent! The policy isrepresented by a deep network whose parameters are given by θ and composed of three sub-networks as illustrated in ﬁgure3 and further described in II-A. Hence the optimal controlproblem writes as max θ E π θ ( . ) R ( T ) , where E π θ represents the expectation under the assumptionthat our policy π θ ( s t ) is precisely represented by our deepnetworks whose parameters are θ for a state at time t givenby s t . The total reward R ( T ) can either be the net performanceof the portfolio or some risk return criterion like the Sharperatio computed as the ratio of the average mean return overits standard deviation. A. Network Architecture

Our network (as described in ﬁgure 3) uses three types ofinputs:

Fig. 3. Possible DRL network architecture • sub-network 1: portfolio returns and standard deviationsobserved over the lag δ array (the asset states A t ); • sub-network 2: contextual information given by the cor-relation between equities and bonds, the Citigroup eco-nomic surprise and the risk aversion indexes observedover the lag δ array and other additional commonfeatures like the maximum portfolio strategy’s return, themaximum and minimum portfolio strategy’s volatilities(the context states C t ); • and potentially sub-network 3: the previous portfolioallocation w t − ig. 4. Deep RL Portfolio Optimisation ResultFig. 5. Corresponding optimal allocation architectures represent 32 models whose results are given intable II. In our experiment, m = 4 with the ﬁrst three assetsrepresenting real strategies, while the fourth one being justcash whose value do not change over time. To represent theperformance of each of the 3 strategies, we plot portfolio 1which consists in taking only strategy 1 (in blue in ﬁgure 4),respectively portfolio 2 and 3 taking only strategy 2 and 3 (inorange and green).It is worth noticing that the portfolio 3 consisting of 100 %in strategy 3 has a strong tendency to over-perform the othertwo strategies (portfolio 1 and 2). Hence we expect the deepRL agent to allocate mostly in strategy 3 and when anticipatinga crisis, to allocate in cash. This is exactly what it does asillustrated in ﬁgure 5. It is also interesting to notice that thetrained deep RL agent is mostly invested in strategy 3 and fromtime to time swap this allocation to a pure cash allocation. Theanticipated crisis in 2018 enables the agent to slightly over-perform portfolio 3 from 2018 to the end of 2019. The agenthowever is not all mighty and makes mistake as illustrated bythe wrong peaked cash allocation in end of 2019. It is able toadapt to the Covid crisis and to brutally swap allocation fromstrategy 3 to cash and back as markets bounced back at theend of March. B. DRL algorithm

To ﬁnd the optimal action π ∗ ( S t ) (in terms of portfolioallocation), we use deep policy gradient method with non linear activation (Relu). We use buffer replay to memorize allmarginal rewards, so that we can start batch gradient descentonce we reached the ﬁnal time step. We use the traditionalAdam optimization so that we have the beneﬁt of adaptivegradient descent with root mean square propagation [31]. C. Results

Performance results are given below in table II. Bestperforming models are highlighted in yellow. Returns arecomputed annually. Hence for a total performance of 21 %(as shown in ﬁgure 4) over the period of January 1st 2018to March 31st 2020, the corresponding annual return is 8.8%. Overall, out of the 32 models available, there are manyDRL models that are able to over-perform not only traditionalmethods like static Markovitz but also the best portfolio(sometimes referred as the naive winner strategy) in termsof net performance and Sharpe ratio, with a ﬁnal annual netreturn of 8.8% when using the best net proﬁt reward model or8.6% when using the best Sharpe ratio reward model comparedto 3.9 % for the naive winner method. Dynamic Markovitzmethod consists in computing the Markovitz optimal allocationevery 3 months. The Naive winner method consists in justselecting the best strategy over the train data set, which isstrategy 3.

TABLE IP

ERFORMANCE RESULTS

Portfolio Portfolio Portfolio Dynamic Deep RL Deep RL Naive1 2 3 Markovitz Net proﬁt Sharpe winnerNet Performance -6.3% -2.1% 3.9% 0.7%

III. L

EARNING OF THE NETWORK PARAMETERS

The agent’s objective is to maximize its total reward R givenat episode end. This reward can be net portfolio performanceor Sharpe ratio computed as portfolio mean return over itsstandard deviation. Because we somehow play and play againthe same scenario with the same reward function, the currentframework has two important distinctions from many otherRL problems. One is that the domain knowledge of theenvironment is well-mastered, and can be fully exploited bythe agent. This exact expressiveness is a direct consequencethat the agent’s action has no inﬂuence on future price, whichis clearly the case for small transactions or liquid assets. Thisisolation of action and external environment also allows one touse the same segment of market history to evaluate differencesequences of actions.The second distinction is that the ﬁnal reward depend onall episodic actions. In other words all episodic actions areimportant, justiﬁying the full-exploitation approach. ig. 6. Crisis portfolio reaction. It is worth noticing that the best DRL agenthas a stable performance as it rapidly dis-invests and put asset in cash duringthe Covid crisis A. Deterministic Policy Gradient

A policy is a mapping from the state space to the actionspace, π : S → A . With full exploitation in the currentframework, an action is deterministically produced by thepolicy from a state. The optimal policy is obtained usinga gradient ascent algorithm. To achieve this, a policy isspeciﬁed by a set of parameter θ , and a t = π θ ( s t ) . Theperformance metric of π θ for time interval [0 , t ] is deﬁnedas the corresponding reward function of the interval, J [0 ,t ] ( π θ ) = R ( s , π θ ( s ) , · · · , s t , π θ ( s t ) , s t +1 ) . (1)After random initialization, the parameters are continuouslyupdated along the gradient direction with a learning rate λ , θ −→ θ + λ ∇ θ J [0 ,t ] ( π θ ) . (2)To make the gradient ascent optimization, we use the standardAdam (short for Adaptive Moment Estimation) optimizer tohave the beneﬁt of adaptive gradient descent with root meansquare propagation [31]. B. Crisis adaptation

It is remarkable that the DRL approach is able to handlethe Covid crisis softly as displayed by ﬁgure 6. If we zoomover the period out of sample from December 2019 to March2020, we can see that the DRL agent is able to rapidly reduceexposure to strategy 3 and allocate in cash, detecting thanksto contextual information that a crisis is imminent as show inﬁgure 7. Interestingly, the DRL agent reallocates to portfolio3 in March picking the market rebound.Best networks as illustrated in table II are mostly convo-lutional networks. We found that for the sub network 1, it isoptimal to have 2 convolutional layers but only 1 convolutionallayer for the sub network 2. We illustrate the sub-network 1in ﬁgure 8.

C. Impact of contextual information

Logically, networks with contextual information performsbetter as they have more information. For each networkconﬁguration, we compute the difference between the versionwith and without contextual information. We summarized

Fig. 7. Allocation during Covid crisis. It is worth noticing that the best DRLagent allocates almost all asset either in cash or strategy 3 and is rarely mixingthe two strategies. It does not include at all strategies 1 or 2, indicating thatthe optimal choice is to saturate the allocation constraints as we only permitthe dynamic agent to allocate between 0 and 100 %Fig. 8. convolutional network 1 these results in table III with best results highlighted in yellow.For all conﬁgurations, the version with contextual informationachieves higher annual returns. This is almost the case alsoon Sharpe ratio, but there are exceptions. If we remove foreach criterion the two largest difference classiﬁed as outlyers,we found that contextual based models increase on averageannual returns by 2.45 % and Sharpe ratio by 0.29.IV. F

URTHER WORK

On experiment, we see that the contextual based approachover-performs baseline methods like Markovitz. We also ex-perienced that CNN architecture performs much better thanLTSM units as they reduce the number of parameters to trainand share parameters across portfolio strategies. Adversarialtraining makes also the training more robust by providinga more challenging environment. Last but not least, it isquite important to ﬁne tune the numerous hyper-parametersof the contextual based DRL model, namely the various lags(lags period for the sub network fed by portfolio strategiespast returns, lags period for common contextual featuresreferred to as the common features in the paper), standarddeviation period, learning rate, etc... It is compelling that thesuggested framework is linearly scalable with the portfolio sizeand can accommodate contextual information. Our ﬁndingssuggest that modeling the state with previous weight allocationeteriorates training and does not help suggesting that theartifact of introducing previous weight to have a direct impacton state when performing an action is artiﬁcial and that inreality under the assumption of small market impact, it is moreefﬁcient to assume that portfolio allocation does not inﬂuencefuture state. The memory mechanism is quite beneﬁcial as itallows to compute the ﬁnal reward on each episode and henceallows avoiding gradient vanishing problem faced by manydeep networks. Moreover, thanks to this memory mechanism,it is not challenging to create an online learning mechanismthat can continuously digest incoming market information toimprove the dynamic agent. The proﬁtability of this frameworksurpasses traditional portfolio-selection methods, as demon-strated in the paper by a non negligible factor as it outperformsdynamic Markovitz by than 8 % and the best strategy by 4 %.This better performance should be mitigated by the factthat the dynamic DRL agent is able to adapt well to the Covidcrisis. Hence it beneﬁts from an exceptional and almost uniquecondition in the ﬁnancial history. Consequently, these numbersshould not be taken literally but rather as a sign of the capacityof deep RL method to achieve human performance in portfolioallocation and to be able to detect and adapt to crisis patterns.Despite the efﬁciency of contextual based DRL models inexperiments, these models can be improved in future works.Their main weakness is the number of hyper parameters thatneeds to be estimated on the validation set. Their second majorweakness relies on the fact that in ﬁnance, each experience issomehow unique and one may not be able to draw conclusionon a single test set. Drawing a general conclusion is prematureand beyond reason at this stage. It should be tested on moreﬁnancial markets and on more outcomes. It may also be testedin terms of stability and capacity to adapt to further crisispatterns. V. C

ONCLUSION

In this paper, we address the challenging task of detectingand adapting portfolio allocation to crisis environment. Ourapproach is based on deep reinforcement learning using con-textual information thanks to a second sub-network. The modeltakes not only past performances of portfolio strategies overdifferent rolling period, but also portfolio strategies standarddeviation as well as contextual information like risk aversion,Citigroup economic surprise index, correlation between equityand bonds over a rolling period to make best allocationdecision. The additional contextual information makes thelearning of the dynamic asset manager agent more robust tocrisis environment as the agent reacts more rapidly to changingenvironments. In addition, the usage of standard deviationof portfolio strategies provides a good hint for future crisis.The model achieves better performance than standard ﬁnancialmodels. There are room for further improvement as this modelconstitutes only a ﬁrst attempt to ﬁnd a reasonable DRLsolution to adapt to crisis situation and to answer positivelyif DRL can reach human level in applications to ﬁnancialproblems and in particular in detecting pattern crisis.

TABLE IIR

ESULTS OF THE VARIOUS MODELS

Reward Adversarial Network Previous Context? Annual Sharpetraining? weight? returnNetProﬁt No Conv2D No Yes

NetProﬁt No Conv2D Yes Yes 8.5% 2.03Sharpe No Conv2D No Yes 8.4% 2.01NetProﬁt Yes Conv2D No Yes 8.0% 1.35NetProﬁt Yes Conv2D No No 7.7% 1.94Sharpe No Conv2D No No 6.4% 1.31NetProﬁt No LSTM No Yes 6.2% 1.49NetProﬁt No Conv2D No No 5.4% 0.97Sharpe Yes LSTM No Yes 5.4% 1.23NetProﬁt Yes LSTM No Yes 5.1% 0.93NetProﬁt Yes Conv2D Yes Yes 4.3% 0.63Sharpe Yes Conv2D No No 4.2% 0.69NetProﬁt No LSTM Yes Yes 3.8% 0.52Sharpe No Conv2D Yes No 3.8% 0.52NetProﬁt No Conv2D Yes Yes 3.8% 0.52Sharpe Yes LSTM Yes Yes 3.8% 0.52Sharpe Yes Conv2D Yes Yes 3.7% 0.51NetProﬁt Yes Conv2D Yes No 3.7% 0.51NetProﬁt No LSTM Yes No 3.6% 0.49NetProﬁt Yes LSTM Yes Yes 3.5% 0.48NetProﬁt Yes LSTM No No 3.4% 1.24Sharpe No LSTM Yes Yes 3.4% 0.48NetProﬁt No LSTM No No 3.4% 0.47Sharpe Yes Conv2D Yes No 3.4% 0.51NetProﬁt Yes LSTM Yes No 2.3% 0.97Sharpe Yes LSTM No No 2.3% 0.32Sharpe No LSTM No Yes 1.5% 0.22Sharpe No Conv2D Yes No 0.9% 0.13Sharpe Yes LSTM Yes No -5.1% naSharpe No LSTM No No -5.1% naSharpe No LSTM Yes No -5.1% na R EFERENCES[1] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap,F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis,“Mastering the game of go without human knowledge,”

Nature , vol.550, pp. 354–, Oct. 2017.[2] O. Vinyals, I. Babuschkin, W. Czarnecki, M. Mathieu, A. Dudzik,J. Chung, D. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan,M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. Agapiou,M. Jaderberg, and D. Silver, “Grandmaster level in starcraft ii usingmulti-agent reinforcement learning,”

Nature , vol. 575, 11 2019.[3] S. Wang, D. Jia, and X. Weng, “Deep reinforcement learning forautonomous driving,”

ArXiv , vol. abs/1811.11329, 2018.[4] H. Markowitz, “Portfolio selection,”

Journal of Finance , vol. 7, pp. 77–91, 1952.[5] F. Black and R. Litterman,

Global portfolio optimization . FinancialAnalysts, 1992.[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-ABLE IIID

IFFERENCE IN RETURNS AND SHARPE BETWEEN MODEL WITH ANDWITHOUT CONTEXTUAL INFORMATION

Reward Adversarial Network Previous Annual return Sharpetraining? weight? difference differenceNetProﬁt No Conv2D No 3.43% 0.99NetProﬁt No Conv2D Yes 4.68%

NetProﬁt No LSTM No 2.77% 1.03NetProﬁt No LSTM Yes 0.27% 0.03NetProﬁt Yes Conv2D No 0.32% - 0.59NetProﬁt Yes Conv2D Yes 0.58% 0.12NetProﬁt Yes LSTM No 1.70% - 0.32NetProﬁt Yes LSTM Yes 1.16% - 0.48Sharpe No Conv2D No 2.02% 0.70Sharpe No Conv2D Yes 2.94% 0.40Sharpe No LSTM No 6.56% 0.22Sharpe No LSTM Yes 8.52% 0.48Sharpe Yes Conv2D No 4.39% 1.39Sharpe Yes Conv2D Yes 0.36% 0.01Sharpe Yes LSTM No 3.05% 0.91Sharpe Yes LSTM Yes

TABLE IVH

YPER PARAMETERS USED hyper-parameters value descriptionbatch size 50 Size of mini-batch during trainingregularizationcoefﬁcient 1e-8 L regularization coefﬁcient applied tonetwork traininglearning rate 0.01 Step size parameter in Adamstandard devi-ation period 20 days period for standard deviation in assetstatescommission 10 bps commission ratestride 2,1 stride used in convolution networksconv number 1 5,10 number of convolutions in sub-network 1conv number 2 2 number of convolutions in sub-network 2lag period 1 [60 , , , , , , lag period for asset stateslag period 2 [60 , , , , , , lag period for contextual statesnoise 0.002 adversarial Gaussian standard deviation stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” NIPS Deep Learning Workshop , 2013.[7] F. Freitas, A. De Souza, and A. Almeida, “Prediction-based portfoliooptimization model using neural networks,”

Neurocomputing , vol. 72,pp. 2155–2170, 06 2009.[8] S. Niaki and S. Hoseinzade, “Forecasting s&p 500 index using artiﬁcialneural networks and design of experiments,”

Journal of IndustrialEngineering International , vol. 9, 02 2013.[9] J. B. Heaton, N. G. Polson, and J. H. Witte, “Deep learning for ﬁnance:deep portfolios,”

Applied Stochastic Models in Business and Industry ,vol. 33, no. 1, pp. 3–12, 2017.[10] Z. Jiang and J. Liang, “Cryptocurrency Portfolio Management with DeepReinforcement Learning,” arXiv e-prints , Dec. 2016.[11] Zhengyao et al., “Reinforcement learning framework for the ﬁnancialportfolio management problem,” arXiv , 2017.[12] Liang et al., “Adversarial deep reinforcement learning in portfolio management,” 2018.[13] P. Yu, J. S. Lee, I. Kulyatin, Z. Shi, and S. Dasgupta, “Model-based deepreinforcement learning for ﬁnancial portfolio optimization,”

RWSDMWorkshop, ICML 2019 , 01 2019.[14] H. Wang and X. Y. Zhou, “Continuous-Time Mean-Variance PortfolioSelection: A Reinforcement Learning Framework,” arXiv e-prints , Apr.2019.[15] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine, “Meta-Reinforcement Learning of Structured Exploration Strategies,” arXiv e-prints , Feb. 2018.[16] K. Lee, Y. Seo, S. Lee, H. Lee, and J. Shin, “Context-aware DynamicsModel for Generalization in Model-Based Reinforcement Learning,” arXiv e-prints , May 2020.[17] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre,S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap,and D. Silver, “Mastering Atari, Go, Chess and Shogi by Planning witha Learned Model,” arXiv e-prints , Nov. 2019.[18] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell,K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, A. Mo-hiuddin, R. Sepassi, G. Tucker, and H. Michalewski, “Model-BasedReinforcement Learning for Atari,” arXiv e-prints and ICLR 2020 , Mar.2019.[19] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. J. Johnson, and S. Levine,“SOLAR: Deep Structured Representations for Model-Based Reinforce-ment Learning,” arXiv e-prints and ICML 2019 , p. arXiv:1808.09105,Aug. 2018.[20] A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine,and C. Finn, “Learning to Adapt in Dynamic, Real-World EnvironmentsThrough Meta-Reinforcement Learning,” arXiv e-prints and ICLR 2019 ,Mar. 2018.[21] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to Control:Learning Behaviors by Latent Imagination,” arXiv e-prints and ICRL2020 , Dec. 2019.[22] M. P. Deisenroth and C. E. Rasmussen, “Pilco: A model-based anddata-efﬁcient approach to policy search,” in

In Proceedings of theInternational Conference on Machine Learning , 2011.[23] S. Levine and P. Abbeel, “Learning neural network policies with guidedpolicy search under unknown dynamics,” in

Advances in Neural Infor-mation Processing Systems 27 , Z. Ghahramani, M. Welling, C. Cortes,N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc.,2014, pp. 1071–1079.[24] R. S. Sutton and A. G. Barto,

Reinforcement Learning: An Introduction ,2nd ed. The MIT Press, 2018.[25] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deepreinforcement learning,” in

ICML , vol. 48. New York, New York,USA: PMLR, 20-22 Jun 2016, pp. 1928–1937.[26] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil-ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervisedauxiliary tasks.”

CoRR , vol. abs/1611.05397, 2016.[27] S. A. Ross, “The arbitrage theory of capital asset pricing,”

Journal ofEconomic Theory , vol. 13, no. 3, pp. 341–360, 1976.[28] D. Harmon, B. Stacey, Y. Bar-Yam, and Y. Bar-Yam, “Networks ofEconomic Market Interdependence and Systemic Risk,” arXiv e-prints ,Nov. 2010.[29] F. Black, “Studies of stock price volatility changes,”

Proceedings of the1976 Meetings of the American Statistical Association, Business andEconomical Statistics Section , 1976.[30] G. Wu, “The Determinants of Asymmetric Volatility,”