[PDF] Adversarial Deep Reinforcement Learning in Portfolio Management

Abstract

In this paper, we implement three state-of-art continuous reinforcement learning algorithms, Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO) and Policy Gradient (PG)in portfolio management. All of them are widely-used in game playing and robot control. What's more, PPO has appealing theoretical propeties which is hopefully potential in portfolio management. We present the performances of them under different settings, including different learning rates, objective functions, feature combinations, in order to provide insights for parameters tuning, features selection and data preparation. We also conduct intensive experiments in China Stock market and show that PG is more desirable in financial market than DDPG and PPO, although both of them are more advanced. What's more, we propose a so called Adversarial Training method and show that it can greatly improve the training efficiency and significantly promote average daily return and sharpe ratio in back test. Based on this new modification, our experiments results show that our agent based on Policy Gradient can outperform UCRP.

Full PDF

AAdversarial Deep Reinforcement Learning inPortfolio Management

Zhipeng Liang ∗† ,Hao Chen ∗† , Junhao Zhu ∗† , Kangkang Jiang ∗† ,Yanran Li ∗†∗ Likelihood Technology † Sun Yat-sen University { liangzhp , chenhao , zhujh , jiangkk , liyr } @ mail .sysu.edu.cn Abstract —In this paper, we implement three state-of-art con-tinuous reinforcement learning algorithms, Deep DeterministicPolicy Gradient (DDPG), Proximal Policy Optimization (PPO)and Policy Gradient (PG)in portfolio management. All of themare widely-used in game playing and robot control. What’smore, PPO has appealing theoretical propeties which is hopefullypotential in portfolio management. We present the performancesof them under different settings, including different learningrates, objective functions, feature combinations, in order toprovide insights for parameters tuning, features selection anddata preparation. We also conduct intensive experiments in ChinaStock market and show that PG is more desirable in ﬁnancialmarket than DDPG and PPO, although both of them are moreadvanced. What’s more, we propose a so called AdversarialTraining method and show that it can greatly improve thetraining efﬁciency and signiﬁcantly promote average daily returnand sharpe ratio in back test. Based on this new modiﬁcation,our experiments results show that our agent based on PolicyGradient can outperform UCRP.

Index Terms —Reinforcement Learning; Portfolio Manage-ment; Deep Learning; Policy Gradient; Deep Deterministic PolicyGradient; Proximal Policy Optimization

I. I

NTRODUCTION

Utilizing deep reinforcement learning in portfolio manage-ment is gaining popularity in the area of algorithmic trading.However, deep learning is notorious for its sensitivity to neuralnetwork structure, feature engineering and so on. Therefore,in our experiments, we explored inﬂuences of different op-timizers and network structures on trading agents utilizingthree kinds of deep reinforcement learning algorithms, deepdeterministic policy gradient (DDPG), proximal policy opti-mization (PPO) and policy gradient (PG). Our experimentswere conveyed on datasets of China stock market. Our codescan be viewed on github .II. S UMMARY

This paper is mainly composed of three parts. First, port-folio management, concerns about optimal assets allocationin different time for high return as well as low risk. Sev-eral major categories of portfolio management approachesincluding ”Follow-the-Winner”, ”Follow-the-Loser”, ”Pattern-Matching” and ”Meta-Learning Algorithms” have been pro- https://github.com/qq303067814/Reinforcement-learning-in-portfolio-management- posed. Deep reinforcement learning is in fact the combinationof ”Pattern-Matching” and ”Meta-Learning” [1].Reinforcement learning is a way to learn by interactingwith environment and gradually improve its performance bytrial-and-error, which has been proposed as a candidate forportfolio management strategy. Xin Du et al. conducted Q-Learning and policy gradient in reinforcement learning andfound direct reinforcement algorithm (policy search) enablesa simpler problem representation than that in value functionbased search algorithm [2]. Saud Almahdi et al. extendedrecurrent reinforcement learning and built an optimal vari-able weight portfolio allocation under the expected maximumdrawdown [3]. Xiu Gao et al. used absolute proﬁt and relativerisk-adjusted proﬁt as performance function to train the systemrespectively and employ a committee of two network, whichwas found to generate appreciable proﬁts from trading in theforeign exchange markets [4].Thanks to the development of deep learning, well knownfor its ability to detect complex features in speech recogni-tion, image identiﬁcation, the combination of reinforcementlearning and deep learning, so called deep reinforcementlearning, has achieved great performance in robot control,game playing with few efforts in feature engineering andcan be implemented end to end [5]. Function approximationhas long been an approach in solving large-scale dynamicprogramming problem [6]. Deep Q Learning, using neuralnetwork as an approximator of Q value function and replaybuffer for learning, gains remarkable performance in playingdifferent games without changing network structure and hyperparameters [7]. Deep Deterministic Policy Gradient(DDPG),one of the algorithms we choose for experiments, uses actor-critic framework to stabilize the training process and achievehigher sampling efﬁciency [8]. Another algorithm, ProximalPolicy Optimization(PPO), turns to derive monotone improve-ment of the policy [9].Due to the complicated, nonlinear patterns and low signalnoise ratio in ﬁnancial market data, deep reinforcement learn-ing is believed potential in it. Zhengyao Jiang et al. proposeda framework for deep reinforcement learning in portfolio man-agement and demonstrated that it can outperform conventionalportfolio strategies [10]. Yifeng Guo el at. reﬁned log-optimalstrategy and combined it with reinforcement learning [12].Lili Tang proposed a model-based actor-critic algorithm under a r X i v : . [ q -f i n . P M ] N ov ncertain environment is proposed, where the optimal valuefunction is obtained by iteration on the basis of the constrainedrisk range and a limited number of funds [13]. David W. Luimplemented in Long Short Term Memory (LSTM) recurrentstructures with Reinforcement Learning or Evolution Strate-gies acting as agents The robustness and feasibility of thesystem is veriﬁed on GBPUSD trading [14]. Steve Y. Yang etal. proposed an investor sentiment reward based trading systemaimed at extracting only signals that generate either negativeor positive market responses [15]. Hans Buehler presenteda framework for hedging a portfolio of derivatives in thepresence of market frictions such as transaction costs, marketimpact, liquidity constraints or risk limits using modern deepreinforcement machine learning methods [16].However, most of previous works use stock data in America,which cannot provide us with implementation in more volatileChina stock market. What’s more, few works investigated theinﬂuence of the scale of portfolio or combinations of differentfeatures. To have a closer look into the true performanceand uncover pitfalls of reinforcement learning in portfoliomanagement, we choose mainstream algorithms, DDPG, PPOand PG and do intensive experiments using different hyperparameters, optimizers and so on.The paper is organized as follows: in the second sectionwe will formally model portfolio management problem. Wewill show the existence of transaction cost will make theproblem from a pure prediction problem whose global op-timized policy can be obtained by greedy algorithm into acomputing-expensive dynamic programming problem. Mostreinforcement learning algorithms focus on game playing androbot control, while we will show that some key charactersin portfolio management requires some modiﬁcations of thealgorithms and propose our novel modiﬁcation so called Ad-versarial Training. The third part we will go to our experimen-tal setup, in which we will introduce our data processing, ouralgorithms and our investigation into effects of different hyperparameters to the accumulated portfolio value. The fourth partwe will demonstrate our experiment results. In the ﬁfth partwe would come to our conclusion and future work in deepreinforcement learning in portfolio management.III. P ROBLEM D EFINITION

Given a period, e.g. one year, a stock trader invests into aset of assets and is allowed to reallocate in order to maximizehis proﬁt. In our experiments, we assume that the market iscontinuous, in other words, closing price equals open pricethe next day. Each day the trading agent observes the stockmarket by analyzing data and then reallocates his portfolio. Inaddition, we assume that the agent conducts reallocation at theend of trade days, which indicates that all the reallocations canbe ﬁnished at the closing price. In addition, transaction cost,which is measured as a fraction of transaction amount, hasbeen taken into considerations in our experiments.Formally, the portfolio consists of m+1 assets, includingm risky assets and one risk-free asset. Without depreciation,we choose money as the risk-free asset. The closing price of i th asset after period t is v closei,t . The closing price ofall assets comprise the price vector for period t as v closet .Modeling as a Markovian decision process, which indicatesthe next state only depends on current state and action. Tuple ( S, A, P, r, ρ , γ ) describes the entire portfolio managementproblem where S is a set of states, A is a set of actions, P : S × A × S → R is the transition probability distribution, r : S → R is the reward function. ρ : S → R is thedistribution of the initial state s and γ ∈ (0 , is the discountfactor.It’s worth to note that in Markovian decision process, mostobjective functions take the form of discount rate, which is R = (cid:80) Tt =1 γ t r ( s t , a t ) . However, in the area of portfoliomanagement, due to the property that the wealth accumulatedby time t would be reallocated in time t+1, indicating that thewealth at time T, P T = (cid:81) Tt =1 P r t is continued product formbut not summation. A sightly modiﬁcation would be needed,which is to take logarithm of the return to transform continuedproduct form into summation.To clarify each item in the Markovian decision pro-cess, we make some notations here. Deﬁne y t = v t v t − = (1 , v ,t v ,t − , . . . , v m,t v m,t − ) T as the price ﬂuctuating vec-tor. w t − = ( w ,t − , w ,t − , . . . , w m,t − ) T represents thereallocated weight at the end of time t − with constraint (cid:80) i w i,t − = 1 . We assume initial wealth is P . Deﬁnitions ofstate, action and reward in portfolio management are as below. • State( s ): one state includes previous open, closing, high,low price, volume or some other ﬁnancial indexes in aﬁxed window. • Action( a ): the desired allocating weights, a t − =( a ,t − , a ,t − , . . . , a m,t − ) T is the allocating vector atperiod t − , subject to the constraint (cid:80) ni =0 a i,t − = 1 .Due to the price movement in a day, the weights vector a t − at the beginning of the day would evolve into w t − at the end of the day: w t − = y t − (cid:12) a t − y t − · a t − Fig. 1. The evolution of weights vector • Reward( r ): the naive ﬂuctuation of wealth minus trans-action cost. The ﬂuctuation of wealth is a Tt − · y t − .In the meanwhile, transaction cost should be subtractedfrom that, which equals µ (cid:80) mi =1 | a i,t − − w i,t − | . Theequation above suggests that only transactions in stocksoccur transaction cost. Speciﬁcally, we set µ = 0 . .In conclusion, the immediate reward at time t-1 as: r t ( s t − , a t − ) = log( a t − · y t − − µ m (cid:88) i =1 | a i,t − − w i,t − | ) The introduction of transaction cost is a nightmare to sometraditional trading strategy, such as follow the winner, followthe loser etc. Even can we predict precisely all stock pricein the future, deriving the optimal strategy when the periodis long or the scale of portfolio is large, is still intractable.Without transaction cost, greedy algorithm can achieve optimalproﬁts. To be speciﬁc, allocating all the wealth into the assetwhich has the highest expected increase rate is the optimalpolicy in such a naive setting. However, the existence oftransaction cost might turn action changing too much fromprevious weight vector into suboptimal action if the transactioncost overweights the immediate return.Although rich literatures have discussed Markovian decisionprocess, portfolio management is still challenging due toits properties. First and foremost, abundant noise includedin the stock data leads to distorted prices. Observations ofstock prices and ﬁnancial indexes can hardly reﬂect the statesunderneath. Providing inefﬁcient state representations for thealgorithm would lead to disastrous failure in its performance.What’s more, the transition probability of different states isstill unknown. We must learn environment before we attemptto solve such a complex dynamic programing problem.Although buying and selling stocks must be conducted byhands, here we still adapt continuous assumption. In factwhen wealth is much more than the prices of stocks, sucha simpliﬁcation would not lose much generation.IV. D

EEP R EINFORCEMENT LEARNING

Reinforcement learning, especially combining with state-of-art deep learning method is therefore thought to be a goodcandidate for solving portfolio problem. Reinforcement learn-ing is a learning method, by which the agent interacts withthe environment with less prior information and learning fromthe environment by trail-and-error while reﬁning its strategy atthe same time. Its low requirements for modeling and featureengineering is suitable for dealing with complex ﬁnancialmarkets. What’s more, deep learning has witnessed its rapidprogress in speech recognition and image identiﬁcation. Itsoutperformance with conventional methods has proven itscapability to capture complex, non-linear patterns. In fact,different methods using neural network in designing tradingalgorithms have been proposed.Compared with solely using deep learning or reinforcementlearning in portfolio management, deep reinforcement learningmainly has three strengths.First, with market’s information as its input and allocatingvector as its output, deep reinforcement learning is an totallyartiﬁcial intelligent methods in trading, which avoids the hand-made strategy from prediction of the future stock price and canfully self-improved.Second, deep reinforcement learning does not explicitlyinvolve predictions towards stock performance, which has beenproven very hard. Therefore, less challenges would hinder theimprovement in reinforcement learning performance. Third, compared with conventional reinforcement learning,deep reinforcement learning approximates strategy or valuefunction by using neural network, which can not only includethe ﬂexibility of designing speciﬁc neural network structurebut also prevent so called ”curse of dimensionality”, enablinglarge-scale portfolio management.Several continuous reinforcement learning methods havebeen proposed, such as policy gradient, dual DQN, Deep De-terministic Policy Gradient and Proximal Policy Optimization.We conduct the latter two algorithms in our experiments to testtheir potential in portfolio management.

A. Deep Deterministic Policy Gradient

Deep Deterministic Policy Gradient(DDPG) is a combi-nation of Q-learning and policy gradient and succeed inusing neural network as its function approximator based onDeterministic Policy Gradient Algorithms [18]. To illustrateits idea, we would brieﬂy introduce Q-learning and policygradient and then we would come to DDPG.Q-learning is a reinforcement learning based on Q-valuefunction. To be speciﬁc, a Q-value function gives expectedaccumulated reward when executing action a in state s andfollow policy π in the future, which is: Q π ( s t , a t ) = E r i ≥ t,s i >t E,a i >t π [ R t | s t , a t ] The Bellman Equation allows us to compute it by recursion: Q π ( s t , a t ) = E r t ,s t +1 ∼ E [ r ( s t , a t )+ γ E a t +1 ∼ π [ Q π ( s t +1 , a t +1 )]] For a deterministic policy which is a function µ : S → A ,the above equation can be written as: Q π ( s t , a t ) = E r t ,s t +1 ∼ E [ r ( s t , a t ) + γ [ Q µ ( s t +1 , µ ( s t +1 ))]] To be speciﬁc, Q-learning adapts greedy policy which is: µ ( s ) = arg max a Q ( s, a ) Deep reinforcement learning uses neural network as theQ-function approximator and some methods including replaybuffer are proposed to improve the convergence to the optimalpolicy. Instead of using iterations to derive the conventionalQ-value function, the function approximator, parameterized by θ Q , is derived by minimizing the loss function below: L ( θ Q ) = E s t ∼ ρ β ,a t ∼ β,r t ∼ [( Q ( s t , a t | θ Q ) − y t ) ] where y t = r ( s t , a t ) + γQ ( s t +1 , µ ( s t +1 ) | θ Q ) It’s worth to note here that y t is calculated by a separatetarget network which is softly updated by online network.This simple change moves the relatively unstable problemof learning the action-value function closer to the case ofsupervised learning, a problem for which robust solutionsexist. This is another method to improve convergence.When dealing with continuous action space, naively imple-menting Q-learning is intractable when the action space isa large due to the ”curse of dimensionality”. What’s more,etermining the global optimal policy in an arbitrary Q-value function may be infeasible without some good featuresguaranteed such as convex.The answer of DDPG to address the continuous controlproblem is to adapt policy gradient, in which DDPG consistsof an actor which would directly output continuous action.Policy would then be evaluated and improved according tocritic, which in fact is a Q-value function approximator torepresent objective function. Recall the goal of Markoviandecision process: derive the optimal policy which maximizethe objective function. Parameterized by θ , we can formallywrite it as: τ = ( s , a , s , a , . . . ) J ( π θ ) = E τ ∼ p θ ( τ ) [ (cid:88) t γ t r ( s t , a t )] π θ ∗ = arg max π θ J ( π θ )= arg max π θ E τ ∼ p θ ( τ ) [ (cid:88) t γ t r ( s t , a t )]= arg max π θ E τ ∼ p θ ( τ ) [ r ( τ )]= arg max π θ (cid:90) π θ ( τ ) r ( τ ) dτ In deep reinforcement learning, gradient descent is the mostcommon method to optimize given objective function, whichis usually non-convex and high-dimensional. Taking derivativeof the objective function equals to take derivative of policy.Assume the time horizon is ﬁnite, we can write the strategyin product form: π θ ( τ ) = π θ ( s , a , . . . , s T , a T )= p ( s ) T (cid:89) t =1 π θ ( a t | s t ) p ( s t +1 | s t , a t ) However, such form is difﬁcult to make derivative in termsof θ . To make it more computing-tractable, a transformationhas been proposed to turn it into summation form: ∇ θ π θ ( τ ) = π θ ( τ ) ∇ θ π θ ( τ ) π θ ( τ )= π θ ( τ ) ∇ θ log π θ ( τ ) ∇ θ log π θ ( τ ) = ∇ θ (log p ( s ) + T (cid:88) t =1 log π θ ( a t | s t ) + log p ( s t +1 ))= T (cid:88) t =1 ∇ θ log π θ ( a t , s t ) Therefore, we can rewrite differentiation of the objectivefunction into that of logarithm of policy: ∇ J ( π θ ) = E τ ∼ π θ ( τ ) [ r ( τ )]= E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( τ ) r ( τ )]= E τ ∼ π θ ( τ ) [( T (cid:88) t =1 ∇ θ log π θ ( a t | s t ))( T (cid:88) t =1 γ t r ( s t , a t ))] In deep deterministic policy gradient, four networks arerequired: online actor, online critic, target actor and targetcritic. Combining Q-learning and policy gradient, actor isthe function µ and critic is the Q-value function. Agentobserve a state and actor would provide an ”optimal” action incontinuous action space. Then the online critic would evaluatethe actor’s proposed action and update online actor. What’smore, target actor and target critic are used to update onlinecritic.Formally, the update scheme of DDPG is as below:For online actor: ∇ θ µ J ≈ E s t ∼ ρ β [ ∇ θ µ Q ( s, a | θ Q ) | s = s t ,a = µ ( s t | θ µ ) ]= E s t ∼ ρ β [ ∇ a Q ( s, a | θ Q ) | s = s t ,a = µ ( s t ) ∇ θ µ µ ( s | θ µ ) | s = s t ] For online critic, the update rule is similar. The target actorand target critic are updated softly from online actor and onlinecritic. We would leave the details in the presentation of thealgorithm:

Algorithm 1

DDPG Randomly initialize actor µ ( s | θ µ ) and critic Q ( s, a | θ Q ) Create Q (cid:48) and µ (cid:48) by θ Q (cid:48) → θ Q , θ µ (cid:48) → θ µ Initialize replay buffer R for i = 1 to M do Initialize a UO process N Receive initial observation state s for t = 1 to T do Select action a t = µ ( s t | θ µ ) + N t Execute action a t and observe r t and s t +1 Save transition ( s t , a t , r t , s t +1 ) in R Sample a random minibatch of N transitions( s i , a i , r i , s i +1 ) in R Set y i = r i + γQ (cid:48) ( s i +1 , µ (cid:48) ( s i +1 | θ µ (cid:48) ) | θ Q (cid:48) ) Update critic by minimizing the loss: L = N (cid:80) i ( y i − Q ( s i , a i | θ Q )) Update actor policy by policy gradient: ∇ θ µ J ≈ N (cid:88) i ∇ θ µ Q ( s, a | θ Q ) | s = s t ,a = µ ( s t | θ µ ) ∇ θ µ µ ( s | θ µ ) | s t Update the target networks: θ Q (cid:48) → τ θ Q + (1 − τ ) θ Q (cid:48) θ µ (cid:48) → τ θ µ + (1 − τ ) θ µ (cid:48) end for end for B. Proximal Policy Optimization

Most algorithms for policy optimization can be classiﬁedinto three broad categories:(1) policy iteration methods. (2)policy gradient methods and (3) derivative-free optimizationmethods. Proximal Policy Optimization(PPO) falls into theecond category. Since PPO is based on Trust Region PolicyOptimization(TRPO) [19], we would introduce TRPO ﬁrst andthen PPO.TRPO ﬁnds an lower bound for policy improvement so thatpolicy optimization can deal with surrogate objective function.This could guarantee monotone improvement in policies.Formally, let π denote a stochastic policy π : S × A → [0 , ,which indicates that the policy would derive a distribution incontinuous action space in the given state to represent all theaction’s ﬁtness. Let η ( π ) = E s ,a ,... [ ∞ (cid:88) t =0 γ t r ( s t )] s ∼ ρ ( s ) , a t ∼ π ( a t | s t ) ,s t +1 ∼ P ( s t +1 , a t +1 | s t , a t ) Following standard deﬁnitions of the state-action valuefunction Q π , the value function V π and the advantage functionas below: V π ( s t ) = E a t ,s t +1 ,... [ ∞ (cid:88) l =0 γ l r ( s t + l )] Q π ( s t , a t ) = E s t +1 ,a t +1 ,... [ ∞ (cid:88) l =0 γ l r ( s t + l )] A π ( s, a ) = Q π ( s, a ) − V π ( s ) The expected return of another policy ˜ π over π can be ex-pressed in terms of the advantage accumulated over timesteps: η (˜ π ) = η ( π ) + E s ,a , ···∼ ˜ π [ ∞ (cid:88) t =0 γ t A π ( s t , a t )] The above equation can be rewritten in terms of states: η (˜ π ) = η ( π ) + ∞ (cid:88) t =0 (cid:88) s P ( s t = s | ˜ π ) (cid:88) a ˜ π ( a | s ) γ t A π ( s, a )= η ( π ) + (cid:88) s ∞ (cid:88) t =0 γ t P ( s t = s | ˜ π ) (cid:88) a ˜ π ( a | s ) A π ( s, a )= η ( π ) + (cid:88) s ρ ˜ π ( s ) (cid:88) a ˜ π ( a | s ) A π ( s, a ) where ρ ˜ π = P ( s = s ) + γP ( s = s ) + γ P ( s = s ) + · · · denotes the discounted visitation frequencies of state s givenpolicy ˜ π .However, the complexity due to the reliance to policy ˜ π makes the equation difﬁcult to compute. Instead, TRPOproposes the following local approximation. L π (˜ π ) = η ( π ) + (cid:88) s ρ π ( s ) (cid:88) a ˜ π ( a | s ) A π ( s, a ) The lower bound of policy improvement, as one of the keyresults of TRPO, provides theoretical guarantee for monotonicpolicy improvement: η ( π new ) ≥ L π old ( π new ) − (cid:15)γ (1 − γ ) α where (cid:15) = max s,a | A π ( s, a ) | α = D maxT V ( π old , π new )= max s D T V ( π old ( ·| s ) || π new ( ·| s )) D T V ( p || q ) = (cid:80) i | p i − q i | is the total variation divergencedistance between two discrete probability distributions.Since D KL ( p || q ) ≥ D T V ( p || q ) , we can derive the fol-lowing inequation, which is used in the construction of thealgorithm: η (˜ π ) ≥ L π (˜ π ) − CD maxKL ( π, ˜ π ) where C = 4 (cid:15)γ (1 − γ ) D maxKL ( π, ˜ π ) = max s D KL ( π ( ·| s ) || ˜ π ( ·| s )) The proofs of above equations are available in [19]To go further into the detail, let M i ( π ) = L π i ( π ) − CD maxKL ( π i , π ) . Two properties would be uncovered withoutmuch difﬁculty as follow: η ( π i ) = M i ( π i ) η ( π i +1 ) ≥ M i ( π i +1 ) Therefore, the lower bound of the policy improvement isgiven out: η ( π i +1 ) − η ( π i ) ≥ M i ( π i +1 ) − M i ( π i ) Thus, by maximizing M i at each iteration, we guarantee thatthe true objective η is non-decreasing. Consider parameterizedpolicies π θ i , the policy optimization can be turned into: max π θi [ L π θi − ( π θ i ) − CD maxKL ( π θ i − , π θ i )] However, the penalty coefﬁcient C from the theoreticalresult would provide policy update with too small step sizes.While in the ﬁnal TRPO algorithm, an alternative optimizationproblem is proposed after carefully considerations of thestructure of the objective function: max π θi L π θi s.t. D ρπ θi − KL ( π θ i − , π θ i ) ≤ δ where D ρKL ( π θ , π θ ) = E s ∼ ρ [ D KL ( π θ ( ·| s ) || π θ ( ·| s ))] Further approximations are proposed to make the optimiza-tion tractable. Recalled that the origin optimization problemcan be written as : max π θ (cid:88) s ρπ θ i − ( s ) (cid:88) a π θ i ( a | s ) A θ i − ( s, a ) After some approximations including importance sampling,the ﬁnal optimization comes into: ax π θi E s ∼ ρπ θi − ,a ∼ q [ π θ i ( a | s ) q ( a | s ) A π θi − ( s, a )] s.t. E s ∼ ρπ θi − [ D KL ( π θ i − ( ·| s ) || π θ i ( ·| s ))] ≤ δ So here comes the PPO [9]: it proposed new surrogateobjective to simplify TRPO. One of them is clipped surrogateobjective which we choose in our experiments. Let us denote r ( θ ) = π θ ( a | s ) π πold ( a | s ) . The clipped surrogate objective can bewritten as: L CLIP ( θ ) = E [min( r ( θ ) A, clip ( r ( θ ) , − (cid:15), (cid:15) ) A )] This net surrogate objective function can constrain theupdate step in a much simpler manner and experiments showit does outperform the original objective function in terms ofsample complexity.

Algorithm 2

PPO Initialize actor µ : S → R m +1 and σ : S → diag ( σ , σ , · · · , σ m +1 ) for i = 1 to M do Run policy π θ ∼ N ( µ ( s ) , σ ( s )) for T timesteps andcollect ( s t , a t , r t ) Estimate advantages ˆ A t = (cid:80) t (cid:48) >t γ t (cid:48) − t r t (cid:48) − V ( s t ) Update old policy π old ← π θ for j = 1 to N do Update actor policy by policy gradient: (cid:88) i ∇ θ L CLIPi ( θ ) Update critic by: ∇ L ( φ ) = − T (cid:88) t =1 ∇ ˆ A t end for end for V. A

DVERSARIAL L EARNING

Although deep reinforcement learning is potential in port-folio management for its competence in capturing nonlinearfeatures and low prior assumption and similarity with humaninvesting, three main characteristics are worth to pay attention: • Financial market is highly volatile and non-stationary,which is totally different with game or robot control • Conventional reinforcement learning is designed forinﬁnite-horizon MDP while portfolio management seeksto maximize absolute portfolio value or other objectivein ﬁnite time • In game playing or robot control there is no need forsplitting training set and testing set while in ﬁnancialmarket a satisfying performance in back test is essentialin evaluating strategies • Stock market has explicit expression for portfolio valuewhich does not exist in game playing and robot control. Therefore, approximating the value function is uselessand can even deteriorate agent’s performance due to thedifﬁculty and error in approximationTherefore, some modiﬁcations are needed in order to applythis method in portfolio management. Adapting average returninstead of discount return can mitigate the contradiction be-tween inﬁnite and ﬁnite horizon. In our experiments, we foundthat DDPG and PPO both have unsatisfying performance intraining process, indicating that they cannot ﬁguring out theoptimal policy even in training set.As for the higher robust and risk-sensitive requirement fordeep reinforcement learning in portfolio management, we pro-pose so called adversarial training. In fact, risk-sensitive MDPand robust MDP are both two preferable method especiallyin portfolio management. LA Prashanth et al. devise actor-critic algorithms for estimating the gradient and updating thepolicy parameters in the ascent direction while establishingthe convergence of our algorithms to locally risk-sensitiveoptimal policies [17]. Motivated by A Pattanaik et al., whotrain two reinforcement learning agents and adversarial playingfor enhancing main player robustness [20], and L ∞ control,we propose adversarial training, which is to adding randomnoise in market prices. In our experiments, we add N (0 , . noise in the data. However, based on Conditional Value at risk(CVaR), Non-zero expectation distribution can also be adaptedto make agents more conservative. CV aR = 11 − c (cid:90) V aR − xp ( x ) dx Therefore, we give out our revised Policy Gradient in ourfollowing experiments as below:

Algorithm 3

Adversarial PG Randomly initialize actor µ ( s | θ µ ) Initialize replay buffer R for i = 1 to M do Receive initial observation state s Add noise into the price data for t = 1 to T do Select action ω t = µ ( s t | θ µ ) Execute action ω t and observe r t , s t +1 and ω (cid:48) t Save transition ( s t , ω t , ω (cid:48) t ) in R end for Update actor policy by policy gradient: ∇ θ µ J = ∇ θ µ N T (cid:88) t =1 (log( ω t · y t − µ m (cid:88) i =1 | ω i,t − ω (cid:48) i,t − | ) end for I. E

XPERIMENTS

A. Data preparation

Our experiments are conducted on China Stock data frominvesting , wind . A ﬁxed number (which is 5 in our experi-ments) of assets are randomly chosen from the assets pool. Toensure enough data is provided for learning, after a portfoliois formed, we check the intersection of their available tradinghistory and only if it is longer than our pre-set threshold (whichis 1200 days) can we run our agent on it.In order to derive a general agent which is robust withdifferent stocks, we normalize the price data. To be speciﬁc,we divide the opening price, closing price, high price and lowprice by the close price at the last day of the period. Formissing data which occurs during weekends and holidays, inorder to maintain the time series consistency, we ﬁll the emptyprice data with the close price on the previous day and we alsoset volume 0 to indicate the market is closed at that day. B. network structure

Motivated by Jiang et al., we use so called Identical In-dependent Evaluators(IIE). IIE means that the networks ﬂowindependently for the m+1 assets while network parametersare shared among these streams. The network evaluates onestock at a time and output a scaler to represent its preferenceto invest in this asset. Then m+1 scalers are normalized bysoftmax function and compressed into a weight vector as thenext period’s action. IIE has some crucial advantages overan integrated network, including scalability in portfolio size,data-usage efﬁciency and plasticity to asset collection. Theexplanation can be reviewed in [10] and we are not going toillustrate them here.We ﬁnd that in other works about deep learning in portfoliomanagement, CNN outperforms RNN and LSTM in mostcases. However, different from Jiang et al., we alternateCNN with Deep Residual Network. The depth of the neuralnetwork plays an important role in its performance. However,conventional CNN network is stopped from going deeperbecause of gradient vanishment and gradient explosion whenthe depth of the networks increases. Deep residual networksolves this problem by adding a shortcut for layers to jumpto the deeper layers directly, which could prevent the networkfrom deteriorating as the depth adds. Deep Residual Networkhas gained remarkable performance in image recognition andgreatly contributes to the development of deep learning.[11]When it comes to our structure of PG, we adapt similar settingswith Jiang’s and we would not go speciﬁc about them here. https://lpi.invest.com/invest.com&bittrex − − − − τ − − − − TABLE IH

YPER PARAMETERS IN OUR EXPERIMENTS

C. result1) learning rate:

Learning rate plays an essential role inneural network training. However, it is also very subtle. Ahigh learning rate will make training loss decrease fast at thebeginning but drop into a local minimum occasionally, or evenvibrate around the optimal solution but could not reach it. Alow learning rate will make the training loss decrease veryslowly even after a large number of epochs. Only a properlearning rate can help network achieve a satisfactory result.Therefore, we implement DDPG and test it using differentlearning rates. The results show that learning rates havesigniﬁcant effect on critic loss even actor’s learning rate doesnot directly control the critic’s training. We ﬁnd that whenthe actor learns new patterns, critic loss would jump. Thisindicates that the critic has not sufﬁcient generalization abilitytowards new states. Only when the actor becomes stable canthe critic loss decreases.

2) Risk:

Due to the limitation of training data, our rein-forcement learning agent may underestimate the risk when ig. 4. PPO Network Structure in our experimentsFig. 5. Critic loss under different actor learning rates training in bull market, which may occur disastrous deteriora-tion in its performance in real trading environment. Differentapproaches in ﬁnance can help evaluate the current portfoliorisk to alleviate the effect of biased training data. Inspired byAlmahdi et al. in which objective function is risk-adjusted andJacobsen et al. which shows the volatility would cluster in aperiod, we modify our objective function as follow: R = T (cid:88) t =1 γ t ( r ( s t , a t ) − βσ t ) where σ t = L (cid:80) tt (cid:48) = t − L +1 (cid:80) m +1 i =1 ( y i,t (cid:48) − y i,t (cid:48) ) · w i,t and y i,t (cid:48) = L (cid:80) tt (cid:48) = t − L +1 y i,t (cid:48) measure the volatility of the re-turns of asset i in the last L day. The objective function isconstrained by reducing the proﬁt from investing in highlyvolatile assets which would make our portfolio exposed inexceeded danger.Unfortunately, the result seems not support our modiﬁca-tions. We also train our agent in objective function taking formof Sharpe ratio but it also fails. In fact, reward engineeringis one of the core topics in designing reinforcement learningalgorithms [22]. It seems that our modiﬁcation makes theobjective function too complex. Fig. 6. Critic loss under different critic learning ratesFig. 7. Comparison of portfolio value with different risk penalties( β )

3) Features combination:

As far as we know, few worksdiscuss the combinations of features in reinforcement learning.Different from end to end game playing or robot control whoseinput is pixels, in portfolio management, abundant features canbe taken into considerations. Common features include theclosing price, the open price, the high price, the low price andvolume. What’s more, ﬁnancial indexes for long term analysissuch as Price-to-Earning Ratio (PE), Price to book ratio (PB)can also provide insights into market movements.However, adding irrelevant features would add noise anddeteriorate the training. The trade off in it is the topic of featureselection. Therefore, we conduct experiments under differentcombinations of features, which are 1. only with closingprices, 2. with closing and high, 3. with closing and open,4. with closing and low prices. The results show that featurecombinations matter in the training process. Select closing andhigh price could help agent gain the best performance in ourexperiments.

4) Training and Testing:

After experiments mentionedabove, we derive a satisfying set of hyper parameters and ig. 8. Comparison of critic loss with different features combinationsFig. 9. Comparison of reward with different features combinations features combinations. Under such setting, we conduct trainingfor 1000 epochs on both China stock market and USA stockmarket. The result shows that training could increase accumu-lative portfolio value (APV) while reducing the volatility ofthe returns in training data.

5) Noise:

We present the training process using adversariallearning and without using it: [12]We conduct 50 groups of experiments and we show 25 ofthem below.After adding noise in the prices data of stocks, remarkableprogress can be seen in the training process with successfullyavoiding sticking on saddle points and signiﬁcantly muchhigher APV after 100 epochs training. We also conductstatistic test on the backtest result. Our ﬁrst null hypothesisis: H : ARR < ARR The p value of the t test is 0.0076 and we can have at least99% conﬁdence to believe this adversarial training process canimprove the average daily return.

Fig. 10. Comparison of portfolio value before and after learning in trainingdata of China stock market by DDPGFig. 11. Comparison of portfolio value before and after learning in trainingdata of China stock market by PG

What’s more, we want to investigate whether this modiﬁca-tion can improve sharpe ratio in the backtest. Our second nullhypothesis is : H : SharpeRatio < SharpeRatio The p value of the t test is 0.0338 and we can have at least95% conﬁdence to believe this modiﬁcations indeed promotethe sharpe ratio.Finnaly, we want to investigate whether this modiﬁcationwould make our agent more volatile. Our third null hypothesisis : H : M axM arkDown < M axM arkDown The p value of the t test is 2.73e-8 and we can have at least99.9% conﬁdence to believe this modiﬁcations indeed makeour agent more volatile.Then we back test our agent on China data. The unsatisfyingperformance of PPO algorithm uncovers the considerable ig. 12. Comparison of portfolio value in training process with and withoutadversarial learningADR(%) Sharpe MMD ADR(%) Sharpe MMD1 0.416 1.171 0.416 0.226 0.678 0.222 0.242 0.647 0.417 0.31 0.885 03 0.242 0.724 0.224 0.249 0.753 0.134 0.298 0.859 0.349 0.304 0.921 0.1195 0.262 0.765 0.45 0.254 0.802 06 0.413 1.142 0.305 0.323 0.903 07 0.213 0.668 0.449 0.202 0.667 0.2318 0.187 0.554 0.347 0.276 0.836 09 0.471 1.107 0.649 0.308 0.873 0.27710 0.32 0.795 0.546 0.297 0.812 0.27911 0.312 0.837 0.195 0.338 0.924 012 0.17 0.741 0.242 0.202 0.715 0.1913 0.313 0.825 0.341 0.26 0.691 0.3414 0.345 0.931 0.263 0.307 0.892 0.09615 0.573 1.499 0.609 0.354 0.993 0.27216 0.493 1.337 0.42 0.328 0.91 017 0.348 0.911 0.307 0.364 1.002 0.2318 0.198 0.601 0.251 0.244 0.756 0.08619 0.306 0.813 0.46 0.295 0.863 020 0.377 1.099 0.419 0.313 0.949 0.16521 0.325 0.876 0.23 0.308 0.828 0.23722 0.301 0.918 0.123 0.269 0.824 023 0.373 1.176 0.514 0.245 0.819 0.13824 0.487 1.257 0.461 0.33 0.914 025 0.426 1.14 0.386 0.415 1.127 0.408TABLE IIP

ERFORMANCES OF A DVERSARIAL AND NOT - ADVERSARIAL L EARNING gap between game playing or robot control and portfoliomanagement. Random policy seems unsuitable in such anunstationary, low signal noise ratio ﬁnancial market althoughits theoretical properties are appealing, including monotoneimprovement of policy and higher sample efﬁciency. All thedetail experiments results, including the portfolio stocks codes,training time period, testing time period and average dailyreturn, sharpe ratio and max drawdown of PG agent, URCPagent, Follow-the-winner and Follow-the-loser agent can beviewed in . https://github.com/qq303067814/Reinforcement-learning-in-portfolio-management-/tree/master/Experiments%20Result Fig. 13. Backtest on China Stock Market H : ARR < ARR The p value of the t test is 0.039 and we can have at least95% conﬁdence to believe PG agent can outperform URCPon average daily return. H : SharpeRatio < SharpeRatio The p value of the t test is 0.013 and we can have at least95% conﬁdence to believe PG agent can outperform URCPon Sharpe Ratio. H : M axM arkDown < M axM arkDown The p value of the t test is 1e-11 and we can have at least99.9% conﬁdence to believe PG agent’s max markdown ishigher than URCP. VII. F

UTURE WORK

Thanks to the characteristics of portfolio management, thereis still many interesting topics in combination with deepreinforcement learning. For future research, we will try to useother indicators to measure the risk of our asset allocation,and work on the combination with conventional models inﬁnance to make advantages of previous ﬁnance research. Tobe speciﬁc, we believe model-based reinforcement as a goodcandidate in portfolio management instead of model-free [23][24].In model-based reinforcement learning, a model of thedynamics is used to make predictions, which is used foraction selection. Let f θ ( s t ; a t ) denote a learned discrete-timedynamics function, parameterized by θ , that takes the currentstate s t and action a t and outputs an estimate of the next stateat time t + ∆ t . We can then choose actions by solving thefollowing optimization problem: ( a t , . . . , a t + H − ) = arg max a t ,...,a t + H − t + H − (cid:88) t (cid:48) = t γ t (cid:48) − t r ( s t (cid:48) , a t (cid:48) ) hat’s more, due to the fact that neural network is sensitiveto the quality of data, traditional ﬁnancial data noise reductionapproaches can be utilized, such as wavelet analysis [25] andthe Kalman Filter [26]. A different approach for data pre-processing is to combine HMM with reinforcement learning,which is to extract the states beneath the ﬂuctuated prices andlearning directly from them [27].Modiﬁcation of the object function can also be taken intoconsiderations. One direction is to adapt risk-adjust return.Another direction we come up with experiments in designingRL agent in game playing. In game playing, the rewardfunction is simple, for example, in ﬂappy bird, the agent wouldreceive reward 1 when passing a pillar or receive reward -1when drop to the ground. Complex objective function wouldhinder agent from achieving desirable performance. We haveconduct a naive version of accumulative portfolio value asobject function, which is to take win rate instead of absolutereturn but we cannot receive satisfying improvement.VIII. C ONCLUSION

This paper applies deep reinforcement learning algorithmswith continuous action space to asset allocation. We comparethe performances of DDPG ,PPO and PG algorithms in hyperparameters and so on. Compared with previous works ofportfolio management using reinforcement learning, we testour agents with risk-adjusted accumulative portfolio valueas objective function and different features combinations asinput. The experiments show that the strategy obtained by PGalgorithm can outperform UCRP in assets allocation. It’s foundthat deep reinforcement learning can somehow capture patternsof market movements even though it is allowed to observelimited data and features and self-improve its performance.However, reinforcement learning does not gain such remark-able performance in portfolio management so far as those ingame playing or robot control. We come up with a few ideas.First, the second-order differentiability for the parameters inthe neural network of the output strategy and the expectationin Q-value is necessary for convergence of the algorithm. Dueto the algorithm, we could only search for optimal policy inthe second-order differentiable strategy function set, instead ofin the policy function set, which might also lead to the failureof ﬁnding globally optimal strategy.Second, the algorithm requires stationary transition. How-ever, due to market irregularities and government intervention,the state transitions in stock market might be time varying.In our experiments, deep reinforcement learning is highlysensitive so that its performance is unstable. What’s more,the degeneration of our reinforcement learning agent, whichoften tends to buy only one asset at a time, indicates moremodiﬁcations are needed for designing promising algorithms.A

CKNOWLEDGMENT