[PDF] Application of Deep Q-Network in Portfolio Management

Abstract

Machine Learning algorithms and Neural Networks are widely applied to many different areas such as stock market prediction, face recognition and population analysis. This paper will introduce a strategy based on the classic Deep Reinforcement Learning algorithm, Deep Q-Network, for portfolio management in stock market. It is a type of deep neural network which is optimized by Q Learning. To make the DQN adapt to financial market, we first discretize the action space which is defined as the weight of portfolio in different assets so that portfolio management becomes a problem that Deep Q-Network can solve. Next, we combine the Convolutional Neural Network and dueling Q-net to enhance the recognition ability of the algorithm. Experimentally, we chose five lowrelevant American stocks to test the model. The result demonstrates that the DQN based strategy outperforms the ten other traditional strategies. The profit of DQN algorithm is 30% more than the profit of other strategies. Moreover, the Sharpe ratio associated with Max Drawdown demonstrates that the risk of policy made with DQN is the lowest.

Full PDF

AApplication of Deep Q-Network in Portfolio Management

Ziming Gao, Yuan Gao, Yi Hu, Zhengyong Jiang † , Jionglong Su † , Department of Mathematics, Xi’an Jiaotong-Liverpool University Suzhou, P.R.China † [email protected] † [email protected]

Abstract —Machine Learning algorithms and Neural Networks are widely applied to many different areas such as stock market prediction, face recognition and population analysis. This paper will introduce a strategy based on the classic Deep Reinforcement Learning algorithm, Deep Q-Network, for portfolio management in stock market. It is a type of deep neural network which is optimized by Q Learning. To make the DQN adapt to financial market, we first discretize the action space which is defined as the weight of portfolio in different assets so that portfolio management becomes a problem that Deep Q-Network can solve. Next, we combine the Convolutional Neural Network and dueling Q-net to enhance the recognition ability of the algorithm. Experimentally, we chose five low-relevant American stocks to test the model. The result demonstrates that the DQN based strategy outperforms the ten other traditional strategies. The profit of DQN algorithm is 30% more than the profit of other strategies. Moreover, the Sharpe ratio associated with Max Drawdown demonstrates that the risk of policy made with DQN is the lowest.

Keywords-Q Learning; Convolutional Neural Network (CNN); Portfolio Management I. I NTRODUCTION

Optimized stock trading strategy is a process of making decisions based on optimizing allocation of capital into different stocks in order to maximize performance, such as expected return and Sharpe ratio. Traditionally, there exist portfolio trading strategies which may be broadly classiﬁed into four categories, namely “Follow-the-Winner”, “Follow-the-Loser”, “Pattern-Matching”, and “Meta-Learning” [1], [7]. However, in financial environments, there exist correlations between the price and other factors, as well as substantial noise, rendering the traditional methods to be limited in their use. In view of this, deep machine-learning approaches are now applied to ﬁnancial market trading [23]. Nevertheless, many of them tend to predict price movements by inputting history asset prices to output a prediction of asset prices in the next trading period via neural network, and the trading agent will take action based on these predictions [1], [8], [9]. This idea seems reasonable but the performance of these algorithms is highly dependent on the prediction accuracy of future market prices. Therefore, many studies [24],[25] solve the problem by Reinforcement Learning without predicting future prices. Previous studies indicated good performance in each setting, though, there still exist limitations such as it is not adaptable to multi-asset portfolio [1]. Learning good policies that bring more profit by optimizing an accumulative future reward signal in sequential decision-making problems is the goal of Reinforcement Learning, with Q Learning being one of the most popular Reinforcement Learning algorithms [2]. More recently, Deep Q-Networks algorithm which combines Q Learning with deep neural network is introduced and applied to many areas. In this paper, we propose a DQN framework specifically designed for a multi-asset portfolio trading strategy. This framework allows the DQN agent to optimize trading strategies through learning from its experience in financial environment so that it can adapt to real financial market. In our study, a discrete action space is defined to increase practicality of strategies, and algorithm performance is also improved by taking advantage of double DQN [2] and dueling DQN [5]. The rest of this paper is organized as follows. Section Ⅱ deﬁnes the portfolio management problem that this project is aiming to solve, and all the assumptions of this study are listed in section Ⅲ. Section Ⅳ introduces discrete actions, and asset preselection and input price tensor are shown in Section Ⅴ. Section Ⅵ and Section Ⅶ presents DQN algorithm, and Section Ⅷ shows network topology. Experiment and result are stated in Section Ⅸ. Section Ⅹ contains conclusions and future work. II. P ROBLEM S TATEMENT

In classical portfolio management theory, portfolio management aims to find the best investment policy that gives the maximum overall portfolio in a given period. In practice, the investor modifies the weight of portfolio in different assets according to the price of these assets and the previous distribution of portfolio, and this process is approximately described as Markov Decision Process (MDP) [6]. Essentially, MDP is a mathematical model that is used to formulate optimized policy, which consists of a tuple (S t , a t , P t , R t ) . The meaning of each element in the tuple is listed below • S t - the state at time t • a t - the action taken at time t • P t - the probability of transforming the state from S t to S t+1 • R t - reward at time t We can construct a model for portfolio management problem by defining the state at time t , S t , to be the price of assets invested, and action of time t as 𝒂 𝑡 ≜ 𝒘 𝑡+1 − 𝒘 𝑡 () where w t and w t+1 are the weight vectors of portfolio at time t and t+ 𝑅 𝑡 ≜ 𝑝 𝑡+1 − 𝑝 𝑡 (2) where p t is the portfolio at time t and p t+ is the portfolio at time t+

1. In (2), we only think of the reward at the present time t , but with a given policy 𝜋 , the state at time t affects the all the states after time t , which means the value of S t is not only R t , but also the rewards of following time periods. Therefore, with policy 𝜋 , the value function 𝐺 𝜋 of S t should be defined as 𝐺 𝜋 (𝑆 𝑡 ) ≜ ∑ 𝛾 𝑘−𝑡𝑇𝑘=𝑡 𝑅 𝑘 (3) in which T is the last trading period and 𝛾 ∈ (0, 1] is a discount factor. In general, 𝐺 𝜋 (𝑆 𝑡 ) cannot simply be obtained using (3), so we need to compute its estimated value by taking the expectation of 𝐺 𝜋 . And since the policy 𝜋 is determined uniquely by action a t , we define the value function 𝑄 𝜋 of S t and a t as following 𝑄 𝜋 (𝑆 𝑡 , 𝑎 𝑡 ) ≜ 𝐸[𝐺 𝜋 (𝑆 𝑡 )] (4) Based on (4), we may estimate the value of a state and an action that can be taken on this state, which is the basic principle of Q Learning. Deep Q-Network is an improvement over classic Q learning [3]. In Q Learning, we obtain the state S t and the action at of time t as its input and compute Q-value, 𝑄 𝜋 (𝑆 𝑡 , 𝑎 𝑡 ) as its output. By searching all the possible combinations of states and actions, we can obtain a Q-table of which the rows are states and columns are actions. From this table, we know the best action for each state so that the optimized policy may be determined. However, considering that Q Learning requires all the possible combinations of states and actions to be explicitly known, it is not suitable to solve problems with infinite state space, such as portfolio management in stock market. Therefore, we use Deep Q-network as our basic model which does not have special requirement for state space, i.e., it can solve either finite or infinite state space. According to [3], DQN receives a state S t as the input and output Q-value 𝑄(𝑆 𝑡 , 𝑎) of each action 𝑎 ∈ 𝐴 , where A is a finite set. Based on the Deep Q-network theory, this reinforcement learning algorithm uses neural networks to obtain the value of each action, which can be applied to solve portfolio management problems that have infinite state spaces. Nevertheless, it still requires a finite action space. Considering the actions which can be taken in stock market are infinite, we shall define a new discrete action space so that DQN may be applied to this financial market. III. A SSUMPTIONS

Before introducing the model, we shall make several assumptions to simplify our problem • Assumption 1: The action taken by the agent will not affect the financial market • Assumption 2: The portfolio remains unchanged between the end of previous trading period and the beginning of next trading period • Assumption 3: There is no other assets that can be chosen besides the selected assets • Assumption 4: Since the proportion of commission fee is only 0.0025, we approximately set it zero, which means the portfolio value will not be reduced when it is reallocated. • Assumption 5: The volume of each stock is large enough, so the agent can buy or sell each of them at any trading day. IV. A CTION D ISCRETIZATION

Traditionally, the action in portfolio management is defined as (1), which is the difference between w t+ and w t However, this may lead to very complicated action space or even continuous action space. Therefore, we redefine action as exactly the weight 𝒂 𝑡 ≜ 𝒘 𝑡 (5) Based on the definition, we create an environment for DQN algorithm so that it can adapt to stock market, which can be started with discretizing the action space. To begin with, we regard the initial portfolio as 1, and equally divide it into N parts. By doing this, we obtain the smallest unit of the portfolio, which is . Then, for total 𝑀 + 1 assets (including cash), we can calculate each action by using permutation. This process is equivalent to placing N balls into M+ (𝑁𝑁 , 0, 0, 0, … , 0) (𝑁 − 1𝑁 , 1𝑁 , 0,0, … ,0) (𝑁 − 1𝑁 , 0, 1𝑁 , 0, … ,0) . . . (𝑁 − 1𝑁 , 0,0,0, … , 1𝑁) . . . (0, 0, 0, 0, … , 𝑁𝑁 ) (6) Algorithm 1. Action Discretization Algorithm

In Algorithm. 1, combinations(seq M) refers to the set of combinations consisting of M elements from seq . So far, based on the least unit (1 𝑁⁄ ) and the combination algorithm mentioned above, we can discretize the action space, and the total number of actions is ( 𝑀+𝑁−1𝑀−1 ) . Here, since M and N are both positive integers, the action space is finite. V. M ATHEMATICAL F ORMALISM A. Data Processing

The data processing method is inspired by [1]. The state received by our Deep Q-Network consists of the portfolio weight vector at the beginning of the last trade period, 𝒘 𝑡−1 , and the price tensor 𝑷 𝑡 that includes closing, opening, highest and lowest price for the assets in the previous 𝑛 days. That is, the state at trade period t , is defined as the following 2-tuple 𝑆 𝑡 ≜ (𝑷 𝑡 , 𝒘 𝑡−1 ) s.t . { 𝒘 𝑡−1 = [𝑤 𝑡−1,0 , 𝑤 𝑡−1,1 , … , 𝑤 𝑡−1,𝑀 ] 𝑷 𝑡 = [𝑷 𝑡𝑜 , 𝑷 𝑡𝑐 , 𝑷 𝑡ℎ , 𝑷 𝑡𝑙 ] (7) where 𝑤 𝑡−1,𝑖 denote the proportion of i- th asset (Here, -th asset is cash) at the beginning of the last trade period and ∑ 𝑤 𝑡−1,𝑖𝑀𝑖=0 = 1 . Meanwhile, according to [1], we set the initial weight as 𝒘 = [1, 0, 0, … , 0] (8) where w is a vector. For the price tensor 𝑷 𝑡 , it is transformed from the original price tensor 𝑷 𝑡∗ consisting of 𝑷 𝑡𝑜 , 𝑷 𝑡𝑐 , 𝑷 𝑡ℎ , 𝑷 𝑡𝑙 which are the normalized price matrices of opening, closing, highest and lowest price which are denoted as below 𝑷 𝑡𝑜 = [𝒑 𝑡−𝑛+1 𝑜 ⊘ 𝒑 𝑡𝒄 | … |𝒑 𝑡𝒐 ⊘ 𝒑 𝑡𝒄 ] (9) 𝑷 𝑡𝑐 = [𝒑 𝑡−𝑛+1𝑐 ⊘ 𝒑 𝑡𝑐 | … |𝒑 𝑡𝑐 ⊘ 𝒑 𝑡𝑐 ] (10) 𝑷 𝑡ℎ = [𝒑 𝑡−𝑛+1ℎ ⊘ 𝒑 𝑡𝑐 | … |𝒑 𝑡ℎ ⊘ 𝒑 𝑡𝑐 ] (11) 𝑷 𝑡𝑙 = [𝒑 𝑡−𝑛+1𝑙 ⊘ 𝒑 𝑡𝑐 | … |𝒑 𝑡𝑙 ⊘ 𝒑 𝑡𝑐 ] (12) where ⊘ is elementwise division. In addition, 𝒑 𝑡𝑜 , 𝒑 𝑡𝑐 , 𝒑 𝑡ℎ , 𝒑 𝑡𝑙 represent the price vectors of opening, closing, highest and lowest price for all assets in trade period t respectively. In other words, the i- th element of them, 𝑝 𝑡,𝑖𝑜 , 𝑝 𝑡,𝑖𝑐 , 𝑝 𝑡,𝑖ℎ , 𝑝 𝑡,𝑖𝑙 , are relative technical indicators of i- th asset in the t- th period. Therefore, if there are M assets (except cash) in the portfolio, the original price tensor 𝑷 𝑡∗ is an (M N - dimensional tensor, as Fig. 1 Figure 1. Original Price tensor 𝑷 𝑡∗ We note that simply normalizing the original data may cause some recognition problem, i.e., the data may not provide enough information for the network to distinguish. Considering that, we need to subtract each element in 𝑷 𝑡∗ by 1 and multiply it by an expansion coefficient 𝛼 . So the final price tensor is defined as 𝑷 𝑡 ≜ 𝛼(𝑷 𝑡∗ − 𝟏) (13) where is a tensor in dimension of (M N 4) and with all elements as 1. B. Interaction with Environment

With the previous definition of state and action, we can define the transitions in the financial market environment. Since the opening price in period t is not equal to the closing price in period t , we denote the original price relative vector of t -th trading period, 𝝁 𝑡∗ , to be 𝝁 𝑡∗ = 𝒑 𝑡𝑐 ⊘ 𝒑 𝑡𝑜 = ( 𝑝 𝑡,1𝑐 𝑝 𝑡,1𝑜 , 𝑝 𝑡,2𝑐 𝑝 𝑡,2𝑜 , … , 𝑝 𝑡,𝑚𝑐 𝑝 𝑡,𝑚𝑜 ) (14) Here, since the total portfolio value includes cash, so we need to add cash price to 𝝁 𝑡∗ . Considering cash price remains unchanged, 𝝁 𝑡 would take the following form 𝝁 𝑡 = (1, 𝑝 𝑡,1𝑐 𝑝 𝑡,1𝑜 , 𝑝 𝑡,2𝑐 𝑝 𝑡,2𝑜 , … , 𝑝 𝑡,𝑚𝑐 𝑝 𝑡,𝑚𝑜 ) (15) Therefore 𝑦 𝑡′ , the portfolio value at the end of period t , can be computed by 𝑦 𝑡′ = 𝑦 𝑡 𝒘 𝑡 ∙ 𝝁 𝑡 (16) where 𝑦 𝑡 denote the portfolio value at the beginning of t- th trade period and 𝒘 𝑡 is asset weight vector at the beginning of current trading period. Furthermore, since commission fee is zero (Section Ⅲ, Assumption 4), 𝑦 𝑡−1′ equals 𝑦 𝑡 , and thus we define the rate of return 𝜌 𝑡 as 𝜌 𝑡 ≜ 𝑦 𝑡′ 𝑦 𝑡−1′ − 1 = 𝑦 𝒕′ 𝑦 𝒕 − 1 = 𝒘 𝑡 ∙ 𝝁 𝑡 − 1 (17) Defining the rate of return in this way is reasonable, but our primary goal is to maximize the overall portfolio. So the reward 𝑟 𝑡 is defined as 𝑟 𝑡 ≜ 𝑙𝑛 ( 𝑦 𝑡′ 𝑦 𝑡−1′ ) = 𝑙𝑛(𝒘 𝑡 ∙ 𝝁 𝑡 ) (18) We can thus compute the total portfolio at the end of the trading time by summing 𝑟 𝑡 , 𝑡 = 1,2, … , 𝑇 together and taking exponential 𝑒𝑥𝑝(∑ 𝑟 𝑡𝑇𝑡=1 ) = 𝑒𝑥𝑝 (∑ 𝑙𝑛 ( 𝑦 𝑡′ 𝑦 𝑡−1′ ) 𝑇𝑡=1 ) = 𝑦 𝑇′ 𝑦 (19) The result obtained in (19) is a fraction, so if we want to compute the final portfolio 𝑦 𝑇′ , we need to multiply the result by the initial portfolio 𝑦 𝑦 𝑇′ = 𝑦 𝑇′ 𝑦 ∙ 𝑦 = 𝑒𝑥𝑝 (∑ 𝑟 𝑡𝑇𝑡=1 ) ∙ 𝑦 (20) In summary, the transition of stock environment is demonstrated by Fig. 2 and the goal of our algorithm is to maximize the final portfolio (20) during a given trading time. VI. P RIORITIZED S AMPLING

Traditional DQN takes samples from the its memory pool randomly, which is not efficient. More importantly, if the good samples are very difficult to obtain, which means the samples that are valuable to be learnt are very rare in the memory pool, the agent will hardly learn anything useful. However, in financial market, it is very hard to obtain such valuable memory due to the complexity of market variation. Therefore, we need to use more efficient sampling method, which is called Prioritized Experience Replay [4], to enhance the learning ability of the agent. According to [4], Prioritized Experience Replay takes samples from memory pool by

TD-error , which is defined as

TD-error ≜ |𝑄 𝑟𝑒𝑎𝑙 − 𝑄 𝑒𝑣𝑎𝑙 | (21) here, 𝑄 𝑒𝑣𝑎𝑙 is the output of evaluation network (section Ⅶ), and 𝑄 𝑟𝑒𝑎𝑙 is defined in (29). In other words, with this sampling method, the memory with larger TD-error are more likely to be chosen, which means the agent will learn those experience of which the difference between predicted value and real value is large first. By doing this, the agent will learn more valuable experience from the training process, and also enhance its adaptation for some extreme cases, such as sudden increase or decrease of price. However, considering the huge memory size and difficulty in searching the proper sample, we need to introduce a structure called

SumTree [4], and its basic structure is shown in Fig. 3. Notice here, this tree structure only stores the

TD-errors of each memory and their sum. The memories are stored in memory pool which is another container.

Figure 2. Trading Process

Figure 3. SumTree

From this structure, we can easily compute the total

TD-error of all the samples in memory pool by adding all the

TD-error in the bottom nodes together, and its tree-shape is very helpful for searching. The searching algorithm is shown in Algorithm. 2.

Algorithm 2. SumTree Searching Algorithm

In Algorithm 2, uniform(a b) refers to sampling from a uniform distribution defined on (a b) and

SumTree is the structure mentioned in Fig. 5,

Memory is the memory pool that stores memories (S t a t r t S t+1 ) and the capacity of memory pool equals the number of nodes at the bottom level of the SumTree . Additionally, in the searching process, we also need to calculate the weight for each sample, which will be used to calculate the loss function (32). The weight for sample i is given by the following equation 𝑤𝑒𝑖𝑔ℎ𝑡 (𝑖) = (𝑝 (𝑖) 𝑝 𝑚𝑖𝑛 ⁄ ) −𝛽 ; 𝑖 = 1,2, … , 𝑛 (22) Where 𝑝 (𝑖) , 𝑝 𝑚𝑖𝑛 are the TD-error of sample i and minimum TD-error for all experiences in the memory pool respectively, and 𝛽 ∈ (0,1] is a constant. Therefore, by Prioritized Experience Replay, we can obtain a set of valuable samples and a weight vector 𝑲 which is defined as 𝑲 = [𝑤𝑒𝑖𝑔ℎ𝑡 (1) , … … , 𝑤𝑒𝑖𝑔ℎ𝑡 (𝑛) ] (23) where n is the batch size. VII. T RAINING P ROCESS

Considering that the basic principle of DQN is to approximate the real Q-function, there should be two Deep Q-Networks, i.e., the evaluation network 𝑄 𝑒𝑣𝑎𝑙 and the target network 𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 , which have exactly same structure, but different parameters. The parameters of 𝑄 𝑒𝑣𝑎𝑙 are continuously updated, while the parameters of 𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 are fixed until they are replaced by the parameters of 𝑄 𝑒𝑣𝑎𝑙 . In the training process, the sampler will take a batch of memories {(𝑆 𝑡 , 𝑎 𝑡 , 𝑟 𝑡 , 𝑆 𝑡 +1 ), … … , (𝑆 𝑡 𝑛 , 𝑎 𝑡 𝑛 , 𝑟 𝑡 𝑛 , 𝑆 𝑡 𝑛 +1 )} (24) from the memory pool according to Prioritized Experience Replay (Section Ⅵ). The 𝑄 𝑒𝑣𝑎𝑙 receives the present state 𝑆 𝑡 as input and returns the estimated Q-values 𝑄 𝑒𝑣𝑎𝑙 (𝑆 𝑡 , 𝑎) for each action 𝑎 ∈ 𝐴. Simultaneously, 𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 receives the state of next time 𝑆 𝑡+1 and also returns the Q-values 𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 (𝑆 𝑡+1 , 𝑎) for each action 𝑎 ∈ 𝐴 . Then, according to the theory of double DQN, we select Q-values from 𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 and 𝑄 𝑒𝑣𝑎𝑙 by the following equations 𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 ∗ (𝑖) = 𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 (𝑆 𝑡 𝑖 +1 , 𝑎𝑟𝑔𝑚𝑎𝑥 (𝑄 𝑒𝑣𝑎𝑙 (𝑆 𝑡 𝑖 +1 , 𝑎))) (25) 𝑄 𝑒𝑣𝑎𝑙 ∗ (𝑖) = 𝑄 𝑒𝑣𝑎𝑙 (𝑆 𝑡 𝑖 , 𝑎𝑟𝑔𝑚𝑎𝑥 (𝑄 𝑒𝑣𝑎𝑙 (𝑆 𝑡 𝑖 , 𝑎))) (26) Where 𝑖 = 1, 2, … , 𝑛 and we obtain the vector of target Q-values 𝑸 𝑡𝑎𝑟𝑔𝑒𝑡 ∗ and the vector of estimated Q-values 𝑸 𝑒𝑣𝑎𝑙 ∗ by 𝑸 𝑡𝑎𝑟𝑔𝑒𝑡 ∗ = [𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 ∗ (1) , 𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 ∗ (2) , … , 𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 ∗ (𝑛) ] (27) 𝑸 𝑒𝑣𝑎𝑙 ∗ = [𝑄 𝑒𝑣𝑎𝑙 ∗ (1) , 𝑄 𝑒𝑣𝑎𝑙 ∗ (2) , … , 𝑄 𝑒𝑣𝑎𝑙 ∗ (𝑛) ] (28) Then obtain the real Q-value and the vector of real Q-value by 𝑄 𝑟𝑒𝑎𝑙(𝑖) = 𝑟 𝑡 𝑖 + 𝛾𝑄 𝑡𝑎𝑟𝑔𝑒𝑡 ∗ (𝑖) ; 𝑖 = 1,2, … , 𝑛 (29) 𝑸 𝑟𝑒𝑎𝑙 = [𝑄 𝑟𝑒𝑎𝑙(1) , 𝑄 𝑟𝑒𝑎𝑙(2) , … , 𝑄 𝑟𝑒𝑎𝑙(𝑛) ] (30) where 𝛾 ∈ (0, 1]. And we define the original loss function as l * = (𝑸 𝑒𝑣𝑎𝑙 ∗ − 𝑸 𝑟𝑒𝑎𝑙 ) ⊙ (𝑸 𝑒𝑣𝑎𝑙 ∗ − 𝑸 𝑟𝑒𝑎𝑙 ) (31) Here, ⊙ is elementwise product. Since we use Prioritized Experience Replay (Section Ⅵ) , so each sample in the minibatch has different weight which has been defined in (22). Therefore, the final loss should be in the following form 𝑙 = 𝑲 ∙ l * (32) https://finance.yahoo.com/ where 𝑲 is the weight vector defined in (23) . Once the loss function is defined, we can train the network by minimizing the loss function. VIII. N ETWORK T OPOLOGY

This network is based on Convolution Neural Network (CNN), and the specific topological structure is inspired by [1]. Firstly, we set the input of this network as 2-tuple

𝐼 = (𝑷 𝑡 , 𝒘 𝑡−1 ) (33) which means the agent receives price data of trade period t , P t (defined in (13)) and portfolio weight of trade period t-1 . Notice here, although the weight of previous trading period is put into the network, it will not go through the first two convolution layers. Moreover, we choose Selu as activation function of the convolutional layers, because there are a number of negative numbers in the processed data defined in (13), and

Relu will transform all the negative numbers to zero, which causes serious death of neurons.

After the network receives the input, the price tensor P t will go through the first convolution layer of which the kernel is , and this layer will output 32 feature maps with size of 𝑚 × 5 . In the second convolution layer, the kernel is and output 64 features with size 𝑚 × 1 . Then, insert the weight of previous trading period 𝒘 𝑡−1 in the feature map and take this 65 features as the input of the next convolution layer (Here, the first item of 𝒘 𝑡−1 , which is the weight of cash, is removed). In the third convolution layer, the kernel is and 128 features are taken. After the third convolution layer, the 128 feature maps will be flattened and a cash bias will be added. So far, we obtain the state of trading period t , which can be expressed as 𝑆 𝑡 = 𝑐 o 𝑐 o 𝑐 (𝑷 𝑡 , 𝒘 𝑡−1 ) (34) In (34), we regard each convolutional layer as a function and 𝑐 𝑖 represents i- th convolutional layer, with o as composition of functions. Next, the part of the network after convolutional layers can be regarded as a Q-net, which receives the state S t and output Q-values. What should be explained detailly is the structure of the second fully connected layer. Here, we use the structure called dueling Q-net, which was introduced by [5]. By this structure, the network can evaluate the value of state and the value of action separately, which helps the agent to evaluate the current situation completely and also take an action wisely. Finally, the Q-value of S t is given by 𝑄 (𝑆 𝑡, 𝑎) = 𝑄 𝑠 + (𝑄 𝑎 − 𝐸[𝑄 𝑎 ]) (35) where 𝑄 𝑠 is the output of state layer, and 𝑄 𝑎 is the output of action layer. IX. E XPERIMENT A. Experimental Setting

In our experiment, five low-relevant US stocks from Yahoo Finance had been chosen as risk assets and code of them are CAH, CAT, CCE, CCL, DIS. Together with the cash as risk-free asset, there were 6 investment products to be managed. We set the trade period as two days in order to increase the difference between tensors. Meanwhile, we selected the past 3 years as a training set and back testing period, with B. Performance Metrics

Three different metrics had been used to evaluate the performance of trading strategies. The first metric is accumulative rate of return [1], defined as

𝐴𝑅𝑅 = exp(∑ 𝑟 𝑡𝑇𝑡=1 ) − 1 (36) where T denotes the total number of trading periods and r t is the reward as defined in (18). The ARR metric assesses the profitability of the algorithm. The second metric is the

Sharpe ratio , which was defined by [11] as follow:

𝑆𝑅 = 𝐸 𝑡 [𝜌 𝑡 −𝜌 𝑅𝐹 ]√𝑣𝑎𝑟(𝜌 𝑡 −𝜌 𝑅𝐹 ) (37) where 𝜌 𝑡 is the rate of return defined in (17) and 𝜌 𝑅𝐹 represents the rate of return of risk-free asset. Since we select cash as the risk-free asset, 𝜌 𝑅𝐹 is equal to zero in the Figure 4. Network Topology experiment. The

Sharpe ratio mainly represents the risk-adjusted return of strategies. In order to assess the risk resistance of an investment strategy completely, we introduce

Maximum Drawdown [12] as the third metric. The formula of Maximum Drawdown (MDD) is

𝑀𝐷𝐷 = 𝑚𝑎𝑥 𝛽>𝑡 𝑦 𝑡 −𝑦 𝛽 𝑦 𝑡 (38) This metric denotes the maximum portfolio value loss from a peak to bottom. C. Result and Analysis

The performance of trading strategy is compared with several strategies as listed below: • Robust Median Reversion (RMR) [14] • the Uniform Buy and Hold (BAH), a portfolio management approach simply equally spreading the total fund into the preselected assets and holding them without making any purchases or selling until the end [7] • Universal Portfolios (UP) [15] • Exponential Gradient (EG) [16] • Online Newton Step (ONS) [17] • Aniticor (ANTICOR) [18] • Passive Aggressive Mean Reversion (PAMR) [19] • Online Moving Average Reversion (OLMAR) [20] • Conﬁdence Weighted Mean Reversion (CWMR) [21] • Uniform Constant Rebalanced Portfolios (CRP) [15][22] Since we set zero commission fee for DQN algorithm, all of the strategies mentioned above are tested without commission fee.

Figure 5. Trading Performance

Fig. 5 illustrates the accumulative return over the investment horizon of the test period as learning continuous from

TABLE I. C OMPARISON OF D IFFERENT A LGORITHMS X. C ONCLUSIONS A ND F UTURE W ORK

In this paper, we propose a deep reinforcement learning algorithm for portfolio management based on a discrete action space. We define a method to discretize the market action and combine this method with DQN algorithm. Five low-related US stocks are selected as experimental data and use accumulative rate of return, Sharpe ratio, Maximum Drawdown to compare the performance of our algorithm with 10 traditional strategies in the back testing set. The results show that this deep reinforcement learning algorithm is more profitable than all the surveyed traditional strategies, and it is also the least risky investment method on the back testing set we chose. The limitation of our model is as follows. First, we set transaction cost as 0 (Section Ⅲ, Assumption 4), so the profitability may be affected after the transaction cost is taken into consideration. Second, we assume that the volumes of stocks are large enough (Section Ⅲ, Assumption 5) so each stock is available on any trading day. However, the stock might not be available sometimes, which will influence the profit as well. For future work, we shall look into a DQN model with transaction cost. The trading in financial markets has a little transaction cost which may outweigh the profit in some transactions. In order to reduce the impact of transaction fees on agent's portfolio, we will try to increase the number of portfolio divisions (section Ⅳ), which means the least trading unit of the portfolio will shrink and transaction fees will be reduced to some extent. However, smaller trading unit leads to larger action space, which requires the agents should be able to explore and learn large-scale discrete action space effectively. Therefore, we will try to improving the exploration method of the agent such as Information-Directed Exploration [13] R

EFERENCES [1]

Z. Jiang, D. Xu, and J. Liang, “A deep reinforcement learning framework for the financial portfolio management problem,” arXiv , 2017, arXiv:1706.10059. [2]

H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-Learning,”

National Conference on Artificial Intelligence, pp. 2094-2100, 2016. [3]

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” arXiv,

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized Experience Replay,”

International Conference on Learning Representations , 2016. [5]

Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt, M. Lanctot, and N. D. Freitas, “Dueling network architectures for deep reinforcement learning CiteCollect,”

International Conference on Machine Learning , 2016. [6]

R. Neuneier, “Enhancing Q-Learning for Optimal Asset Allocation,”

Neural Information Processing Systems , 1997. [7]

B. Li and S. C. H. Hoi, “Online portfolio selection: A survey,”

ACM Computing Surveys , vol.46, pp.35, 2014. [8]

J. B. Heaton, N. G. Polson, and J. Witte, “Deep Learning in Finance,” arXiv , 2016, arXiv:1602.06561. [9]

S. T. A. Niaki and S. Hoseinzade, “Forecasting S&P 500 index using artificial neural networks and design of experiments,”

Journal of Industrial Engineering, International , vol.9, pp.1-9, 2013. [10]

M. A. H. Dempster and V. Leemans, “An automated FX trading system using adaptive reinforcement learning,”

Expert Systems With Applications , vol.30, pp.543-552, 2006. [11]

W. F. Sharp, “The sharpe ratio,”

The journal of portfolio management , vol. 21, no. 1, pp. 49-58, 1994, doi:10.3905/jpm.1994.409501. [12]

M. Magon-Ismail and A. F. Atiya, “Maximum drawdown,”

Risk Magazine , vol. 17, no. 10, pp. 99-102, 2004. [13]

W. R. Clements, B. M. Robaglia, B. Van Delft, R. B. Slaoui, and S. Toth, “Estimating Risk and Uncertainty in Deep Reinforcement Learning,” arXiv , 2019, arXiv:1905.09638. [14]

D. Huang, J. Zhou, B. Li, S. C. H Hoi, and S. Zhou, “Robust median reversion strategy for on-line portfolio selection,”

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence , pp. 2006-2012, 2013. [15]

T. M. Cover, “Universal portfolios,”

Mathematical finance. An International Journal of Mathematics, Statistics and Financial Economics , vol. 1, no. 1, pp. 1, 1991. [16]

D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth, “On-line portfolio selection using multiplicative updates,”

Mathematical Finance , vol. 8, no. 4, pp. 325 – A. Agarwal, E. Hazan, S. Kale, and R. E. Schapire, “Algorithms for portfolio management based on the newton method,”

Proceedings of the 23rd international conference on Machine learning , pp. 9–16, 2006. [18]

A. Borodin, R. El-Yaniv, and V. Gogan, “Can we learn to beat the best stock,”

Journal of Artificial Intelligence Research , vol. 21, pp. 579, 2004. [19]

B. Li, P. Zhao, S. C. H. Hoi, and V. Gopalkrishnan, “PAMR: Passive Aggressive Mean Reversion Strategy for Portfolio Selection,”

Machine Learning , vol. 87, no. 2, pp.221, 2012. [20]

B. Li, S. C. Hoi, D. Sahoo, and Z. Liu, “Moving average reversion strategy for on-line portfolio selection,”

Artificial Intelligence , vol. 222, pp.104-123, 2015, doi:10.1016/j.artint.2015.01.006. [21]

B. Li, S. C. H. Hoi, P. Zhao, and V. Gopalkrishnan, “Confidence weighted mean reversion strategy for online portfolio selection,”

ACM Transaction on Knowledge Discovery from Data, vol. 7, no. 1, pp. 4, 2013, doi: 10.1145/2435209.2435213. [22]

J. L. Kelly, “A new interpretation of information rate,”

The Bell System Technical Journal , vol. 35, no. 4, pp.917-926, 1956, doi: 10.1002/j.1538-7305.1956.tb03809.x. [23]

Park, H. Song, and S. Lee, "Linear programing models for portfolio optimization using a benchmark,"

The European Journal of Finance vol. 25, no. 5, pp. 435-457, 2019 [24]