[PDF] An intelligent financial portfolio trading strategy using deep Q-learning

Abstract

Portfolio traders strive to identify dynamic portfolio allocation schemes so that their total budgets are efficiently allocated through the investment horizon. This study proposes a novel portfolio trading strategy in which an intelligent agent is trained to identify an optimal trading action by using deep Q-learning. We formulate a Markov decision process model for the portfolio trading process, and the model adopts a discrete combinatorial action space, determining the trading direction at prespecified trading size for each asset, to ensure practical applicability. Our novel portfolio trading strategy takes advantage of three features to outperform in real-world trading. First, a mapping function is devised to handle and transform an initially found but infeasible action into a feasible action closest to the originally proposed ideal action. Second, by overcoming the dimensionality problem, this study establishes models of agent and Q-network for deriving a multi-asset trading strategy in the predefined action space. Last, this study introduces a technique that has the advantage of deriving a well-fitted multi-asset trading strategy by designing an agent to simulate all feasible actions in each state. To validate our approach, we conduct backtests for two representative portfolios and demonstrate superior results over the benchmark strategies.

Full PDF

AAn intelligent ﬁnancial portfolio trading strategy usingdeep Q-learning

Hyungjun Park

Department of Industrial and Management Engineering, Pohang University of Science andTechnology, 77 Cheongam-Ro, Nam-Gu, Pohang, Gyeongbuk 37673, Rep. of Korea

Min Kyu Sim

Department of Industrial and Systems Engineering, Seoul National University of Scienceand Technology, 232, Gongneung-Ro, Nowon-Gu, Seoul, 01811, Rep. of Korea

Dong Gu Choi ∗ Department of Industrial and Management Engineering, Pohang University of Science andTechnology, 77 Cheongam-Ro, Nam-Gu, Pohang, Gyeongbuk 37673, Rep. of Korea

Abstract

Portfolio traders strive to identify dynamic portfolio allocation schemes so thattheir total budgets are eﬃciently allocated through the investment horizon. Thisstudy proposes a novel portfolio trading strategy in which an intelligent agentis trained to identify an optimal trading action by using deep Q-learning. Weformulate a Markov decision process model for the portfolio trading process,and the model adopts a discrete combinatorial action space, determining thetrading direction at prespeciﬁed trading size for each asset, to ensure practicalapplicability. Our novel portfolio trading strategy takes advantage of three fea-tures to outperform in real-world trading. First, a mapping function is devisedto handle and transform an initially found but infeasible action into a feasibleaction closest to the originally proposed ideal action. Second, by overcoming thedimensionality problem, this study establishes models of agent and Q-networkfor deriving a multi-asset trading strategy in the predeﬁned action space. Last,this study introduces a technique that has the advantage of deriving a well- ∗ Corresponding author

Email address: [email protected] (Dong Gu Choi)

Preprint submitted to Elsevier December 2, 2019 a r X i v : . [ q -f i n . P M ] N ov tted multi-asset trading strategy by designing an agent to simulate all feasibleactions in each state. To validate our approach, we conduct backtests for tworepresentative portfolios and demonstrate superior results over the benchmarkstrategies. Keywords:

Portfolio trading, Reinforcement learning, Deep Q-learning,Deep neural network, Markov decision process

1. Introduction

A goal of ﬁnancial portfolio trading is maximizing the trader’s monetarywealth by allocating capital to a basket of assets in a portfolio over the periodsduring the investment horizon. Thus, portfolio trading is the most importantinvestment practice in the buy-side ﬁnancial industry. Portfolio traders strive toestablish trading strategies that can properly allocate capital to ﬁnancial assetsin response to time-varying market conditions. Typical objective functions fortrading strategy optimization include expected returns and the Sharpe ratio(i.e., risk-adjusted returns). In addition to optimizing an objective function, atrading strategy should achieve a reasonable turnover rate so that it is applicableto real-world ﬁnancial trading. If the turnover rate is not reasonable, transactioncosts hurt overall trading performance.Portfolio trading is an optimization problem that involves a sequential decision-making process across multiple rebalancing periods. In this process, the stochas-tic components of time-varying market variables should be considered. Thus,the problem of deriving an optimal portfolio trading strategy has traditionallybeen formulated as a stochastic optimization problem [10, 15, 19]. To handlethese stochastic components over multiple periods, most related studies havedeveloped heuristic methods [5, 7, 8, 13, 21, 29, 38, 39]. In very recent years,reinforcement learning (RL) has become another popular approach for ﬁnancialportfolio trading problems. In RL methods, a learning agent can understand acomplex ﬁnancial environment by attempting various trading actions and revis-ing its trading action policy, and then optimize their trading strategies based2n these experiences. In addition, these methods have the important advantagethat learning agents can update their trading strategies based on their experi-ences on future trading days. Instead of simply maintaining trading strategiesderived from historical data, learning agents can adapt their strategies usingtheir observed experiences on each real trading day [36]. With these advantagesand the increasing popularity of RL algorithms, many previous studies havebeen conducted to apply RL algorithms to various portfolio trading problemsettings [1, 2, 3, 6, 11, 12, 14, 17, 18, 24, 25, 26, 27, 28, 30, 38].According to the recent evolution of RL methods, some researchers [12, 17,18] have started to use deep RL (DRL) methods, which is the combination ofRL and deep neural network (DNN), to overcome the unstable performance ofprevious RL methods. Nevertheless, we believe that this line of study is yet tomature in terms of practical applicability because of the following two reasons.First, most studies based on DRL methods focus on a single-asset trading [12,17]. Because most traders generally have multiple securities, additional decision-making steps are necessary even though single asset trading rules are derived.Second, even if a study dealing with the multi-asset portfolio trading, the actionsdetermine portfolio weights [18]. The action spaces that determine the portfolioweights cannot be followed directly because it requires additional decisions onhow to satisfy the target weight. We will explain this issue in more detail inSection 2.3.To overcome the current limitations, this study proposes a new approach forderiving a multi-asset portfolio trading strategy using deep Q-learning, one ofthe most popular DRL methods. In this study, we focus on a multi-asset tradingstrategy and deﬁne an intuitive trading action set that can be interpreted asdirect investment guides to traders. In the action space used in this study, eachaction includes trading directions corresponding to each asset in a portfolio,and each trading direction comprises either holding each asset or buying orselling each asset at a prespeciﬁed trading size. Although a recent study [30]argues that optimizing a trading strategy based on a discrete action space hasa negative eﬀect, we ﬁnd that our discrete action space modeling allows for a3ower turnover rate and is more practical than continuous action space modelingis. To develop a practical multi-asset trading strategy, this study tackles a fewchallenging aspects. First, setting a discrete combinatorial action space maylead to infeasible actions, and, thus, we may derive an unreasonable tradingstrategy (i.e., a strategy with frequent and pointless portfolio weight changesthat only leads to more transaction costs). To address this issue, we introduce amapping function that enables the agent to prevent the selection of unreasonableactions by mapping infeasible actions onto similar and valuable actions. Byapplying this mapping function, we can derive a reasonable trading strategy inthe practical action space. Second, the action space that determines a tradingdirection for each asset in the portfolio has a dimensionality problem [24]. As thenumber of assets in portfolio increasing, then the size of action space increasesexponentially because a trading agent must determine a combination of tradingdirections for several assets in the portfolio. This is why previous studies relatedto this action space have only considered a single asset or a single risky asset witha risk-free asset trading. Therefore, in this study, we overcome this limitationand conduct the ﬁrst study of multi-asset trading in the practical action space byusing DQL. Third, although we use years of ﬁnancial data, these data may notprovide enough training data for the DRL agent to learn a multi-asset tradingstrategy in the ﬁnancial environment. Because learning a strategy mappingfrom a joint state to a joint action is necessary lots of data. There is a givenamount of data, so we need to make the agent gains more experience within thetraining data and learns as much as possible. Thus, we achieve suﬃcient learningby simulating all feasible actions in each state and then updating the agent’strading strategy using the learning experiences from the simulation results. Thistechnique allows the agent to gain and learn enough experience to derive a well-ﬁtted multi-asset trading strategy.The rest of this paper is organized as follows. In Section 2, we ﬁrst reviewthe related literature and present the diﬀerences between our study and previ-ous studies. Section 3 describes the deﬁnition of our problem, and Section 44ntroduces our approach for deriving an intelligent trading strategy. In Section5, we provide experimental results to validate the advantages of our approach.Finally, we conclude in Section 6 by providing relevant implications and identi-fying directions for future research.

2. Literature Review

Portfolio trading is an optimization problem that involves a sequential decision-making process over multiple rebalancing periods. In addition, the stochasticcomponents of market variables should be considered in this process. Thus,traditionally, the derivation of portfolio trading strategies has been formulatedas a stochastic programming problem to ﬁnd an optimal trading strategy. Re-cently, much eﬀort has been made to solve this stochastic optimization problemusing a learning-based approach, RL. To formulate this stochastic optimiza-tion problem, it is necessary to determine how to measure the features of thestochastic components corresponding to changes in the ﬁnancial market. Utiliz-ing technical indicators is more common than utilizing the fundamental indexesof securities in daily frequency portfolio trading, as in our study.This section reviews how previous studies have attempted to model stochas-tic market components to formulate the portfolio trading problem and derivean optimal trading strategy. Section 2.1 provides a brief description of previ-ous studies that formulate the stochastic components of the ﬁnancial market.Section 2.2 reviews previous studies that discuss heuristic methods for derivingan optimal trading strategy. Section 2.3 reviews previous studies that addressthe stochastic optimization problem to derive an optimal trading strategy usingRL.

Early studies on portfolio trading and, sometimes, management used stochas-tic programming-based models. Stochastic programming models formulate asequence of investment decisions over time that can maximize a portfolio man-ager’s expected utility up to the end of the investment horizon. Golub et al.515] modeled an interest rate series as a binomial lattice scenario using MonteCarlo procedures to solve a money management problem with stochastic pro-gramming. Kouwenberg [19] solved an asset-liability management problem us-ing the event tree method to generate random stochastic programming coeﬃ-cients. Consigli and Dempster [10] used scenario-based stochastic dynamic pro-gramming to solve an asset-liability management problem. However, stochasticprogramming-based models have the limitation of needing to generate numer-ous scenarios to solve a complex problem, such as understanding a ﬁnancialenvironment, resulting in a large computational burden.

Because of this limitation of stochastic programming-based models, manystudies have devised heuristic methods (i.e., trading heuristics). One of themost famous such methods is technical analysis for asset trading. This methodprovides a simple and sophisticated way to identify hidden relationships betweenmarket features and asset returns through the study of historical data. Usingthese identiﬁed relationships, investments are made in assets by taking appro-priate positions. Brock et al. [5] conducted backtests with real and artiﬁcial datausing moving average and trading range strategies. Zhu and Zhou [39] consid-ered theoretical rationales for using technical analysis and suggested a practicalmoving average strategy to determine a portion of investments. Chourmouziadisand Chatzoglou [9] suggested an intelligent stock-trading fuzzy system based onrarely used technical indicators for short-term portfolio trading. Another popu-lar heuristic method is the pattern matching (i.e., charting heuristics) method,which detects critical market situations by comparing the current series of mar-ket features to meaningful patterns in the past. Leigh et al. [21] developed atrading strategy using two types of bull ﬂag pattern matching. Chen and Chen[8] proposed an intelligent pattern-matching model based on two novel methodsin the pattern identiﬁcation process. The other well-known heuristic method isa metaheuristic algorithm that can ﬁnd a near optimal solution in acceptablecomputation time. Derigs and Nickel [13] developed a decision support system6enerator for portfolio management using simulated annealing, and Potvin et al.[32] applied genetic programming to generate trading rules automatically. Chenand Yu [7] used a genetic algorithm to group stocks with similar price seriesto support investors in making more eﬃcient investment decisions. However,these heuristic methods have limited ability to fully search a very large feasiblesolution space because they are inﬂexible. Thus, we need to be careful aboutthe reliability of obtaining an optimal trading strategy using these methods.

A recent research direction is optimizing a trading strategy using RL suchthat a learning agent develops a policy while interacting with the ﬁnancial envi-ronment. Using RL, a learning-based method, the learning agent can search foran optimal trading strategy ﬂexibly in a high-dimensional environment. Unlikesupervised learning, RL allows learning from experience, leading to training theagent with unlabeled data obtained from interactions with the environment.In the earliest such studies, Neuneier [26, 27] optimized multi-asset portfo-lio trading using Q-learning, a model-free and value-based RL. In other earlystudies, Moody et al. [25] and Moody and Saﬀell [24] used

Direct

RL with

Recurrent

RL as a base algorithm and derived a multi-asset long-short port-folio trading strategy and a single-asset trading rule, respectively.

Direct

RLis policy-based RL, which optimizes an objective function by adjusting policyparameters, and

Recurrent

RL is an RL algorithm in which the last action isreceived as an input. These studies introduced several measures, such as theSharpe ratio and the Sterling ratio, as objective functions and compared thetrading strategies derived using diﬀerent objectives. Casqueiro and Rodrigues[6] derived a single-asset trading strategy using Q-learning, which can maximizethe Sharpe ratio. Dempster and Leemans [11] developed an automated foreignexchange trading system using an adaptive learning system with a base algo-rithm of

Recurrent

RL by dynamically adjusting a hyper-parameter dependingon the market situation. O et al. [28] proposed a Q-learning-based local tradingsystem that categorized an asset price series into four patterns and applied dif-7erent trading rules. Bertoluzzo and Corazza [3] suggested a single-asset tradingsystem using Q-learning with linear and kernel function approximations. Eilerset al. [14] developed a trading rule for an asset with a seasonal price trend usingQ-learning. Zhang et al. [38] derived a trading rule generator using extendedclassiﬁer systems combined with RL and a genetic algorithm. Almahdi andYang [1] suggested a

Recurrent

RL-based trading decision system that enabledmulti-asset portfolio trading and compared the performance of the system whenseveral diﬀerent objective functions were adopted. Pendharkar and Cusatis [30]suggested an indices trading rule derived using two diﬀerent RL methods, on-policy (SARSA) and oﬀ-policy (Q-learning) methods and compared the perfor-mance of these two methods, and it also compared the performances of discreteand continuous agent action space modeling. Almahdi and Yang [2] used a hy-brid method that combined

Recurrent

RL and particle swarm optimization toderive a portfolio trading strategy that considers real-world constraints.More recently, DRL, which combines deep learning and RL algorithms, wasdeveloped, and, thus, studies have suggested using DRL-based methods to de-rive portfolio trading strategies. DRL methods enable an agent to understanda complex ﬁnancial environment through deep learning and to learn a tradingstrategy by automatically applying an RL algorithm. Jiang et al. [18] used adeep deterministic policy gradient (DDPG), an advanced method of combiningpolicy-based and value-based RL, and introduced various DNN structures andtechniques to trade a portfolio consisting of cash and several cryptocurrencies.Deng et al. [12] derived an asset trading strategy using a

Recurrent

RL-basedalgorithm and introduced a fuzzy deep recurrent neural network that used fuzzyrepresentation to reduce uncertainty in noisy asset prices and used a deep recur-rent neural network to consider the previous action and utilize high-dimensionalnonlinear features. Jeong and Kim [17] derived an asset trading rule that de-termined actions for assets and the number of shares for the actions taken. Tolearn this trading rule, Jeong and Kim [17] used a deep Q-network (DQN) witha novel DNN structure consisting of two branches, one of which learned actionvalues while the other learned the number of shares to take to maximize the8bjective function.The above studies used various RL-based methods in diﬀerent problem set-tings. All of the methods performed well in each setting, but some issues limitthe applicability of these methods to the real world. First, some problem set-tings did not consider transaction costs [3, 14, 17, 28, 30]. A trading strategydeveloped without assuming transaction costs is likely to be impractical for ap-plication to the real world. The second issue is that some strategies considertrading for only one asset [1, 3, 6, 11, 14, 12, 17, 24, 38]. A trading strategyof investing in only one risky asset may have high risk exposure because it hasno risk diversiﬁcation eﬀect. Finally, in previous studies deriving multi-assetportfolio trading strategies using RL, the agent’s action space was deﬁned asthe portfolio weights in the next period [1, 2, 18, 25]. The action spaces of thesestudies do not provide portfolio traders with a direct guide that is applicableto a real-world trading scenario that includes transaction costs. This is becausethere are many diﬀerent ways to transition from the current portfolio weight tothe next portfolio weight. Thus, previous studies using portfolio weights as theaction space required ﬁnding a way to minimize transaction costs at each rebal-ancing moment. Rebalancing in a way that reduces both transaction costs anddispersion from the next target portfolio is not an easily solved problem [16].In addition, a portfolio trading strategy derived based on the action spaces ofthe previous studies may be diﬃcult to apply to real-world trading becausethe turnover rate is likely to be high. An action space that determines portfo-lio weights can result in frequent asset switching because the amount of assetchanges has no upper bound. Thus, we contribute to the literature by derivinga portfolio trading strategy that has no such issues.

3. Problem deﬁnition

In this study, we consider a portfolio consisting of cash and several riskyassets. All assets in the portfolio are bought using cash, and the value gainedfrom selling assets is held in cash. That is, the agent cannot buy an asset without9olding cash and cannot sell an asset without holding the asset. This type ofportfolio is called a long-only portfolio, which does not allow short selling. Ourproblem setting also has a multiplicative proﬁt structure in that the portfoliovalue accumulates based on the proﬁts and losses in previous periods. Weconsider proportional transaction costs that are charged according to a ﬁxedproportion of the amount traded in transactions involving buying or selling. Inaddition, we allow the agent to partially buy or sell assets (e.g., the agent canbuy or sell half of a share of an asset).We set up some assumptions in our problem setting. First, transactions canonly be carried out once a day, and all transactions in a day are made at theclosing price in the market at the end of that day. Second, the liquidity of themarket is high enough that each transaction can be carried out immediately forall assets. Third, the trading volume of the agent is very small compared to thesize of the whole market, so the agent’s trades do not aﬀect the state transitionof the market environment.To apply RL to solve our problem, we need a model of the ﬁnancial environ-ment that reﬂects the ﬁnancial market mechanism. Using the notations sum-marized in Table 1, we formulate a Markov decision process (MDP) model thatmaximizes the portfolio return rate in each period by selecting sequential trad-ing actions for the individual assets in the portfolio according to time-varyingmarket features (Table 2).

The state space of the agent is deﬁned as the weight vector of the currentportfolio before the agent selects an action and the tensor that contains themarket features (technical indicators) for the assets in the portfolio. This typeof state space is similar to that used in a previous study [18]. That is, the statein period t can be represented as below (Equations (1)-(3)): s t = ( X t , w (cid:48) t ), (1) w (cid:48) t = ( w (cid:48) t, , w (cid:48) t, , w (cid:48) t, , ..., w (cid:48) t,I ) T , (2)10 able 1: Summary of notations Decision variables a t = ( a t, , a t, , ..., a t,I ) agent’s action at the end of period t { a t ∈ Z I : a t,i ∈{− , , } ∀ i } Set and indices i = 0 , , , ..., I portfolio asset index (i=0 represents cash) t time period index S − ( a t ) set of an index of selling assets when an agent takes action a t (i.e., { i ∈ Z | < i ≤ I, a t,i = − } ) S + ( a t ) set of an index of buying assets when an agent takes action a t (i.e., { i ∈ Z | < i ≤ I, a t,i = 1 } ) Parameters n size of the time window containing recent previous market features P t portfolio value changed by the action at the end of period tP (cid:48) t portfolio value before the agent takes an action at the end of period tP st portfolio value at the end of period t when the agent takes no action at theend of the previous period t − t ) w t,i proportion of asset i changed by the action at the end of period tw (cid:48) t,i proportion of asset i before the agent takes an action at the end of period t ˆ w (cid:48) t,i auxiliary parameter used to derive w t,i c t decay rate of transaction costs at the end of period tc − transaction cost rate for selling c + transaction cost for buying δ trading size for selling or buying (cid:18) < δ < P (cid:48) t I (cid:19) ρ t return rate of the portfolio in period t (cid:18) = P t − P t − P t − (cid:19) o t,i opening price of asset i in period tp t,i closing price of asset i in period th t,i highest price of asset i in period tl t,i lowest price of asset i in period tv t,i volume of asset i in period t able 2: Summary of market features Features k ct,i rate of change of the closing price of asset i in period t (cid:18) = p t,i − p t − ,i p t − ,i (cid:19) k ot,i ratio of the opening price in period t to the closing price in period t − i (cid:18) = o t,i − p t − ,i p t − ,i (cid:19) k ht,i ratio of the closing price to the highest price of asset i in period t (cid:18) = p t,i − h t,i h t,i (cid:19) k lt,i ratio of the closing price to the lowest price of asset i in period t (cid:18) = p t,i − l t,i l t,i (cid:19) k vt,i rate of change of the volume of asset i in period t (cid:18) = v t,i − v t − ,i v t − ,i (cid:19) X t = [ K ct , K ot , K ht , K lt , K vt ], (3)where w (cid:48) t denotes the weight vector of the current portfolio and X t representsthe technical indicator tensor for the assets in the portfolio. For this tensor, weuse ﬁve technical indicators for the assets in the portfolio, as below (Equations(4)): k xt = ( k xt, , k xt, , ..., k xt,I ) T ∀ x ∈ { c, o, h, l, v } , (4)Every set of ﬁve technical indicators can be expressed as a matrix (Equations(5)), where the rows represent each asset in the portfolio and the columns rep-resent the series of recent technical indicators in the time window. Here, if weset a time window of size n (considering n -lag autocorrelation) and a portfolioof I assets, the technical indicator tensor is an ( I, n, K xt = [ k xt − n +1 | k xt − n +2 | ... | k xt ] ∀ x ∈ { c, o, h, l, v } , (5) We deﬁne the action space to overcome the limitations of the action spaces inprevious studies. Agent actions determine which assets to hold and which assets12 igure 1: Market feature tensor ( X t ) to sell or buy at prespeciﬁed a constant trading size. For example, if a portfolioincludes two assets and the trading size is 10,000 U SD , then the agent can selectthe action of buying 10,000

U SD of asset1 and selling 10,000 U SD of asset2 .The action space includes the trading directions of buying, selling, or holdingeach asset in the portfolio, so the action space contains 3 I diﬀerent actions.These actions are expressed in a vector form that includes trading directions foreach asset in a portfolio. In addition, each trading direction ( sell, hold, buy ) isencoded as ( − , , asset1 and buying asset2 can be encoded into the vector ( − , With the state space and action space deﬁned in the previous subsections,we can deﬁne the MDP model as follows. The ﬁnancial market environmentoperates according to this model during the investment horizon. To deﬁne thetransitions in the ﬁnancial market environment (i.e., the system dynamics inthe MDP model), we need to deﬁne following parameters and equations: w t = ( w t, , w t, , w t, , ..., w t,I ) T , (6) w t · (cid:126) w (cid:48) t · (cid:126) ∀ t , (7) P (cid:48) t = P t − w t − · φ ( k ct ) ∀ t , (8) w (cid:48) t = w t − (cid:12) φ ( k ct ) w t − · φ ( k ct ) ∀ t , (9)where w t denotes the portfolio weight after the agent takes an action at theend of period t (Equation (6)). Equation (7) provides the constraint that theportfolio weight elements sum to one in all periods. Equations (8) and (9)represent the change in the portfolio value and the change in the proportionsof the assets in the portfolio given the changes in the value of each asset inthe portfolio, respectively. Here, (cid:12) represents the elementwise product of twovectors, and (cid:126) I+1 with all elements equal to one. φ ( · ) isan operator that not only increases a vector’s dimension by positioning zeroas the ﬁrst element but also adds it to the (cid:126) φ : ( e , e , ..., e I ) T → (1 , e + 1 , e + 1 , ..., e I + 1) T ).Now, we can deﬁne the state changes after the agent takes an action asfollows: c t = δP (cid:48) t (cid:18) c − | S − ( a t ) | + c + | S + ( a t ) | (cid:19) ∀ t , (10)14 t = P (cid:48) t (1 − c t ) ∀ t , (11)ˆ w (cid:48) t = ( ˆ w (cid:48) t, , ˆ w (cid:48) t, , ˆ w (cid:48) t, , ..., ˆ w (cid:48) t,I ) T , (12)ˆ w (cid:48) t,i =  w (cid:48) t,i − δP (cid:48) t if i ∈ S − ( a t ) ,w (cid:48) t,i + δP (cid:48) t if i ∈ S + ( a t ) ,w (cid:48) t,i otherwise ∀ i =1 ...I , (13)ˆ w (cid:48) t, = w (cid:48) t, + δP (cid:48) t (cid:18) (1 − c − ) | S − ( a t ) | − (1 + c + ) | S + ( a t ) | (cid:19) , (14) w t = ˆ w (cid:48) t ˆ w (cid:48) t · −→ | S | is the size of set S . ˆ w (cid:48) t,i denotesthe auxiliary weight of the portfolio that is needed to connect the change in theportfolio weights before and after the agent takes an action at the end of period t (Equation (12)). The procedure by which the action selected by the agentis handled for trading in the ﬁnancial environment is as follows. The auxiliaryweight of an asset in the portfolio increases (or decreases) as a proportion of thetrading size when buying (or selling) the asset. On the contrary, the auxiliaryweights of the assets do not change when the agent holds the assets (Equation(13)). As a result of selling asset, the proportion of cash increases by the pro-portion of the trading size discounted by the selling transaction cost rate. Asa result of buying asset, the proportion of cash decreases by the proportion ofthe trading size multiplied by the buying transaction cost rate (Equation (14)).To ensure that the sum of the portfolio weight elements equals one after theagent takes an action, a process for adjusting the auxiliary weights is required(Equation (15)). In summary, the ﬁnancial market environment transition isillustrated by Figure 2.Last, the reward in the MDP model should reﬂect the contribution of theagent’s action to the portfolio return. This reward can be simply deﬁned as theportfolio return. However, if the portfolio return is deﬁned only as a reward,then diﬀerent reward criteria can be given depending on the market trend.15 igure 2: Financial environment transition For example, when the market trend is suﬃciently improving, then no matterhow poor the agent’s action is, a positive reward is provided to the agent. Incontrast, if the market trend is suﬃciently negative, then no matter how helpfulthe agent’s action is, a negative reward is provided to the agent. Thus, thereward must be deﬁned as the rate of change in the portfolio value by which themarket trend is removed. Therefore, we deﬁne the reward as the change in theportfolio value at the end of the next period relative to the static portfolio value(Equation (16)). The static portfolio value is the next portfolio value when theagent takes no action at the end of the current period (Equation (17)). r t = P (cid:48) t +1 − P st +1 P st +1 , (16) P st +1 = P (cid:48) t w (cid:48) t · φ ( k ct +1 ), (17)

4. Methodology

In this section, we introduce our proposed approach for deriving the portfoliotrading strategy using DQL. In our action space, some issues may prohibit aDQL agent from deriving an intelligent trading strategy. We ﬁrst explain howto resolve these issues by introducing some techniques and applying existingmethodologies. Then, we describe our DQL algorithm with these techniques.

In our action space, some actions are infeasible in some states (e.g., theagent cannot buy assets because of a cash shortage or cannot sell assets because16f a shortage of held assets). To handle infeasible actions, we ﬁrst set the ac-tion values (i.e., Q-values) of infeasible actions to be very low to mask theseactions [20]. Thus, we need to deﬁne a rule for selecting the appropriate actionfrom the remaining actions when infeasible actions are excluded. For this rule,a simple way in which the agent selects the largest Q-valued one among remain-ing actions can be considered [37]. However, this simple rule can result in anunreasonable trading strategy. For example, when an agent’s strategy selectsthe action of selling both asset1 and asset2 but this action is infeasible owingto a lack of asset2 , the action of buying both asset1 and asset2 , which is thelargest Q-value action in the remaining action space, is selected. Because learn-ing the similarity between actions is diﬃcult for an RL agent, the agent will takethis action without any doubt even though this selected action is the oppositeof the original action determined by the agent’s strategy. This issue leads tothe selection of unreasonable actions, which degrades the trading performance.A mapping rule is required to map infeasible actions to similar and valuableactions in the feasible action set. Thus, we resolve this issue by introducing amapping function that contains several mapping rules.The mapping function is a type of constraint on action space in RL thatallows the agent to derive a reasonable trading strategy. Pham et al. [31], Bha-tia et al. [4] handled constrained action space by adding an optimization layer,so-called

OptLayer , for solving quadratic programming at the last layer of theagent’s policy network, determining an action that minimizes diﬀerences fromthe output at the previous layer while satisfying constraints. However, themethod cannot be applied directly to our situation because they can only dealwith continuous action space. Although

OptLayer can be applied for handlingour constrained action space by revising the integer quadratic programming(IQP) version, the solution of IQP suﬀers multiple-choice issue in this situa-tion because there are tied actions which have the smallest distance from theoriginal action. To overcome the limitation of the

OptLayer , we devise themapping function based on a heuristic searching method. The function mapsto one feasible action which has the largest Q-value among tied actions that17ave the smallest distance from the original action. Moreover, the computationcosts of the mapping function are lower than those of

OptLayer . Therefore,the mapping function is an extended eﬃcient method from the line of study tohandle infeasible action using the concept of distance between actions, such as

OptLayer .The mapping function contains two mapping rules, each of which is requiredfor mapping infeasible actions, that are divided into two cases. In the ﬁrst case,the amount of cash is not suﬃcient to take an action that involves buying assets.In this case, a similar action set is derived by holding rather than buying a subsetof the asset group to be bought in the original action. Thereafter, infeasibleactions are mapped to the most valuable feasible actions in the similar actionset. For example, if the action of buying both asset1 and asset2 is infeasibleowing to a cash shortage, this action is mapped to the most valuable feasibleaction within the set of similar actions, which includes the action of buying asset1 and holding asset2 , the action of holding asset1 and buying asset2 , andthe action of holding both asset1 and asset2 . In the second case, an action thatinvolves selling assets is infeasible because of a shortage of the assets. In thiscase, the original action is simply mapped to an action in which the assets thatare not enough to sell are held. These examples are illustrated in Figure 3.We provide the details of the two mapping rules and the mapping function inthe following pseudocode in Algorithm (1). In the Algorithm, the last part (i.e.,Lines (26)-(27)) of the mapping rule for the second case(

Rule2 ) is necessary.Because, in the second case, converting the original action of selling assets thatcannot be sold into an action that holds the selling assets which cannot be sold,then the cash amount gained from selling assets is removed, causing the ﬁrstinfeasible action case to arise. Furthermore, this part of the code can handle thespecial case in which an asset shortage and a cash shortage occur simultaneously.Next, the RL ﬂow chart with the mapping function technique is shown in Figure4. 18 lgorithm 1

Mapping function s t : state of the agent a t : infeasible action in state s t Q ( s t , a t ): Q-value for state action pair ( s t , a t ) procedure Map ( s t , a t ) if asset shortage for action a t in state s t then a map ← Rule2 ( s t , a t ) else if cash shortage for action a t in state s t then a map ← Rule1 ( s t , a t ) return a map procedure Rule1 ( s t , a t ) M AXQ ← − inf subset of buying asset index: S = { C , ... } for each subset C in S do replicate action ˆ a t ← a t for each asset j in C do ˆ a t,j ← if converted action ˆ a t is feasible in state s t then if Q ( s t , ˆ a t ) > M AXQ then M AXQ ← Q ( s t , ˆ a t ) a best ← ˆ a t return a best procedure Rule2 ( s t , a t ) for asset i = 1,2... do if action a t to asset i in state s t infeasible then a t,i ← if converted action a t is infeasible in state s t then a t ← Rule1 ( s t , a t ) return a t a)(b)Figure 3: Mapping examples of (a) a cash shortage and (b) an asset shortageFigure 4: RL ﬂow chart with mapping function We optimize the multi-asset portfolio trading strategy by applying the DQNalgorithm. DQN is the primary algorithm for DQL. Mnih et al. [22] developedthe DQN algorithm, and Mnih et al. [23] later introduced additional techniques20nd completed this algorithm. The base algorithm for DQN, Q-learning, isvalue-based RL, which is a method that approximates an action value (i.e., a Q-value) in each state. Further, Q-learning is a model-free method such that evenif the agent does not have knowledge of the environment, the agent can develop apolicy using repeated experience by exploring. In addition, Q-learning is an oﬀ-policy algorithm, that is, the action policy for selecting the agent’s action is notthe same as the update policy for selecting an action on the target value. Analgorithm based on Q-learning that approximates the Q-function using DNNis the basis of DQN [22]. To prevent DNN from learning only through theexperience of a speciﬁc situation, experience replay was introduced to sample ageneral experience batch from memory. Additionally, the DQN algorithm usedtwo separate networks: a Q-network that approximates the Q-function and atarget network that approximates the target value needed for the Q-networkupdated to follow a ﬁxed target [23]. Based on this algorithm, we introduceseveral techniques to support the derivation of an intelligent trading strategy.The existing DQN algorithm updates the Q-network with experience byallowing the agent to take only one action in each stage. Because the agent hasno information about the environment, only one action is taken then proceedingto the next state. Thus, it is impossible to take multiple actions in the existingDQN. However, for this problem, we use historical technical indicator data ofthe assets in the portfolio as training data. Thus, our agent can take multipleactions in one state in each stage and observe all of their experiences based onthose actions. To utilize this advantage, we introduce a technique that simulatesall feasible actions in one state at each stage and updates the trading strategyby using the resulting experiences from conducting these simulations.Motivated by Tan et al. [35], we utilize a simulation technique that takes allfeasible actions virtually to force to the agent learns about many experienceseﬃciently for deriving a fully searched multi-asset trading strategy. Thus, thistechnique can relax the data shortage issue that arises when deriving a multi-asset trading strategy. Although simulating all feasible actions can result ina huge computational burden, using multi-core parallel computing can prevent21his computational burden from greatly increasing. Moreover, even if the agenttakes multiple actions in the current state, the next state only depends on theaction selected by the action policy (epsilon-greedy) with the mapping rule.The application of this technique requires a change in the data structure ofthe element in replay memory for storing a list of experiences in a state. Theconcepts related to this technique are illustrated in Figure 5. In this ﬁgure, a jt means that the j − th action of the agent is taken at the end of period t . r jt isthe reward obtained by taking action a jt , and s jt +1 is the next state that resultsfrom taking action a jt . (a)(b)Figure 5: (a) Simulating feasible actions, (b) Data structure for experience list

22n DQN, a multiple output neural network is commonly adopted as the Q-network structure. In this network structure, the input of the neural networkis the state, and the output is the Q-value of each action. Using the abovetechnique, we can approximate the Q-value of all feasible actions by updatingthis multiple output Q-network in parallel with the experience list. To maintainQ-values of infeasible actions, the current Q-value of an infeasible state-actionpair is assigned to the target value of the Q-network output of the correspondinginfeasible action to set a temporal diﬀerence error of zero. Furthermore, as inDQN, several experience lists are sampled from replay memory, and the Q-network is updated using the experience list batch. A detailed description ofthe process for updating the Q-network is shown in Figure 6.

Figure 6: Updating a multiple output Q-network using an experience list

In addition, to apply RL, learning episodes must be deﬁned for the agentto explore and experience the environment. Rather than deﬁning all of thetraining data, which cover several years, as one episode, we divide the trainingdata into several episodes. If we deﬁne a much longer training episode than theinvestment horizon of the test data that will be used to test the trading strategy,this diﬀerence in the lengths of the training and test data can produce negativeresults. For example, in our experiment, the training and testing processesbegin with the same portfolio weights. In this case, the farther the agent is23rom the beginning of the long training episode, the farther the agent is fromthe initial portfolio weights. Thus, it is diﬃcult for the agent to utilize thecritical experience obtained from the latter half of the long episode in the earlytesting process. Therefore, we divide training data into sets of the same lengthas the investment horizon of the test data (i.e., one year, as the investmenthorizon of the test data is a year in our experiment). Thus, the criteria fordividing the training data are deﬁned in yearly units so that the episodes donot overlap (e.g., episode1 contains data from 2016, and episode2 contains datafrom 2015). In each training epoch, the agent explores and learns in an episodesampled from the training data.It is well known that more recent historical data have more explainablefor predicting future data than less recent historical data have. Thus, it isreasonable to assign higher sampling probabilities to episodes that are closer tothe test data period [18]. We use a truncated geometric distribution to assignhigher sampling probabilities to episodes that are closer to the test period. Thistruncated geometric sampling distribution is expressed in Equation (18). Here, y is the year of the episode, y v is the year of the test data, and N is the numberof total training episodes. β is a parameter for this sampling distribution thatranges from zero to one. If this parameter is closer to one, episodes closer tothe test period are sampled frequently. g β ( y ) = β (1 − β ) y v − y − − (1 − β ) N , (18)To implement DQN, we need to model the neural network structure forapproximating the Q-function of an agent’s state and action. We construct ahybrid encoder LSTM-DNN neural network that enables us to approximate theQ-value of an agent’s action in our predeﬁned state and action space. First,we train LSTM autoencoder, an unsupervised learning method for compressingsequence data, for identifying latent variables of the historical sequence of pre-deﬁned technical indicators of assets in the portfolio [34]. Then the autoencoderis ﬁtted in the historical pattern of technical indicators of assets, the decoder is24emoved and the encoder keeps as a standalone model that encodes the technicalindicator sequences for assets to lower dimension latent variables. Each assetin the portfolio shares the same encoder LSTM to take the sequence encodingprocedure because it is known that a single deep learning model is more eﬀectivefor learning the feature patterns of diﬀerent assets than multiple deep learningmodels that learning individual assets [33]. Then, the encoded outputs for eachasset are concatenated to create the intermediate output, and this intermediateoutput is then combined again with the current portfolio weights to use as theinput to the DNN. Through this DNN layers, we can obtain the Q-value ofeach action of the agent. Because these DNN layers extract meaningful featuresthrough nonlinear mapping using a multi-layer neural network and conducts aregression for the Q-value, we refer to these layers as the DNN regressor. Theoverall Q-network structure is as shown in Figure 7. In summary, the overallDQN algorithm for our approach for deriving the portfolio trading strategy isas follows Algorithm (2). In oﬄine learning, we use this algorithm to build aninitial trading strategy ﬁtted on historical data and in online learning, adapt thetrading strategy by updating based on the daily observed data in the tradingprocess.

5. Experimental results

In this section, we demonstrate that the DQN strategy (i.e., the tradingstrategy derived using our proposed DQN algorithm for portfolio trading) canoutperform in real-world trading. We conduct a trading simulation for twodiﬀerent portfolio cases using both our DQN strategy and traditional tradingstrategies as benchmarks, and we verify that the DQN strategy is relatively su-perior to the other benchmark strategies based on several common performancemeasures.

We use three diﬀerent output performance measures to evaluate tradingstrategies. The ﬁrst measure is the cumulative return based on the increase in25 igure 7: Q-network structure the portfolio value at the end of the investment horizon relative to the initialportfolio value, as deﬁned as Equation (19): CR = P t f − P P × t f is the ﬁnal date of the investment horizon and P is the initial portfoliovalue.The second measure is the Sharpe ratio, as deﬁned in Equation (20): SR = E [ ρ t − ρ f ] std ( ρ t ) × √ std ( ρ t ) is the standard deviation of the daily return rate, ρ f is the dailyrisk-free rate (assumed to be 0.01%), and annualization term (square root of thenumber of annual trading days) is multiplied. This ratio is a common measureof the risk-adjusted return, and it is used to evaluate not only how high the riskpremium is but also how small the variation in the return rate is.For the last measure, we use the customized average turnover rate deﬁned26 lgorithm 2 DQN algorithm for portfolio trading F ( s ) : feasible action set in state s Initialize replay memory D Initialize weights of Q-network θ randomly Initialize weights of target network θ (cid:48) ← θ for episode is sampled by sampling distribution y ← g β ( · ) do Initialize state s for period t =0... T in episode y do With probability (cid:15) select random a t ∈ F ( s t )otherwise, a t =  argmax a Q ( s t , a ; θ ) if argmax a Q ( s t , a ; θ ) ∈ F ( s t ) , Map ( s t , argmax a Q ( s t , a ; θ )) o/w Take action a t and then observe reward r t and next state s t +1 Simulate all actions a ∈ F ( s t ), then observe experience list L Store L in replay memory D Sample random batch of experience list K from D ( s t , a t , r t , s t +1 ) is element of experience list in batch,update Q-network from current prediction Q ( s t , a t ; θ ) to target z t =  r t + γ max a (cid:48) ˆ Q ( s t +1 , a (cid:48) ; θ (cid:48) ) if argmax a (cid:48) ˆ Q ( s t +1 , a (cid:48) ; θ (cid:48) ) ∈ F ( s t +1 ) ,r t + γ max a (cid:48) ˆ Q ( s t +1 , Map ( s t +1 , argmax a (cid:48) ˆ Q ( s t +1 , a (cid:48) ; θ (cid:48) )); θ (cid:48) ) o/w Update θ by minimizing the loss: L ( θ ) = | K | (cid:80) L ∈ K (cid:80) j ∈ L ( z j − Q ( s j , a j ; θ )) θ (cid:48) ← θ as in Equation (21): AT = 12 t f t f (cid:88) t =0 I (cid:88) i =1 | ˆ w (cid:48) t,i − w (cid:48) t,i | × We experiment with two diﬀerent three-asset portfolios. The ﬁrst consists ofthree exchange traded funds (ETFs) in the US market that track the S&P500index, the Russell 1000 index, and the Russell Microcap Index. This type ofportfolio was tested in a previous study [1]. The second portfolio is a Koreanportfolio consisting of the KOSPI 100 index, the KOSPI midcap index, and theKOSPI microcap index. More information for these test portfolios is providedin Table 3.

Table 3: Test portfoliosAssets PortfolioUS Portfolio (US-ETF) Korean Portfolio (KOR-IDX)Asset 1 SPDR S&P 500 KOSPI 100 indexAsset 2 iShares Russell 1000 Value Midcap KOSPI indexAsset 3 iShares Microcap Microcap KOSPI index ETF tracks the S&P500 index ETF tracks the Russell 1000 (mid- and large-cap US stocks) index ETF tracks the Russell microcap index

We obtain data on the three US ETFs from

Yahoo Finance and data on theKorean indices from

Investing.com . Both cases are tested in 2017. The tradingstrategy for the US portfolio is derived by training on data from 2010 to 2016,and the trading strategy for the Korean portfolio is derived by training on datafrom 2012 to 2016.

Through several rounds of tuning, we derive appropriate hyper-parameters.In particular, the time window size( n ) is the most important hyper-parameter,28nd we adopt the value of 20 among the candidates (5,20,60,120). This timewindow size and the other tuned hyper-parameters are summarized in Table 4. hyper-parameter value hyper-parameter value time window size ( n ) 20 replay memory size 2000learning rate ( α ) 1e-7 number of epochs 500distribution parameter ( β ) 0.3 discount factor ( γ ) 0.9DNN input dimension 64 batch size 32DNN layer 2 encoder LSTM layer 1DNN 1st layer dimension 64 hidden dim of encoder LSTM 128DNN 2nd layer dimension 32 output dim of encoder LSTM 20 Table 4: hyper-parameter summary

In the experiment, we also need to set trading parameters, such as the initialportfolio value and the trading size. We set the initial portfolio value as onemillion in both portfolio cases (e.g., 1M

U SD for the US portfolio and 1M

KRW for the Korean portfolio). Similarly, we set the trading size as ten thousand inboth portfolio cases (e.g., 10K

U SD trading size for the US portfolio case and10K

KRW for the Korean portfolio case). We set the transaction cost rate forbuying and selling in both the US and Korean markets as 0.25%. In both cases,the initial portfolio is set up as an equally weighted portfolio, in which everyasset and cash has the same proportion.

To evaluate our DQN strategy, we compare it to some traditional portfoliotrading strategies. The ﬁrst strategy is a buy-and-hold strategy ( B & H ) thatdoes not take any action but rather holds the initial portfolio until the end of theinvestment horizon. The second strategy is a randomly selected strategy ( RN )that takes action within the feasible action space randomly in each state. Thethird strategy is a momentum strategy ( M O ). This strategy buys assets whosevalues increased in the previous period and sells assets whose values decreased in29he previous period. However, if it cannot buy all assets with increased values,it gives buying priority to assets whose values increased more. If it is unable tosell assets whose values decreased, it simply holds the assets. The last strategyis a reversion strategy ( RV ), which is the opposite of the momentum strategy.This strategy sells assets whose values increased in the previous period and buysassets whose values decreased in the previous period. However, if it cannot buyall of the assets whose values decreased, it gives buying priority to the assetswhose values decreased more. If it is unable to sell the assets whose valuesincreased, then it simply holds the assets. We derive a trading strategy for both portfolio cases using DQN. For bothcases, we identify the increase in the cumulative return over the investmenthorizon of the test period as episode learning continues. Figure 8 shows thetrend in the cumulative return performance over the learning episodes in bothcases.Table 5 shows the number of changes in the phase of trading direction (i.e.,the number of changes from selling to buying or vice versa, except for holding)before and after applying the mapping function. A trading strategy that hasfrequent changes in the phase of trading action implies an unreasonable tradingstrategy because these changes deteriorate trading performance by incurringmeaningless transaction costs. Through the experimental result, we identifythat when the mapping function is applied to the DQN strategy, the numberof changes in the phase of the trading direction decreases relative to when itis not applied. Following by decreasing the number of changes in the phase ofthe trading direction, the cumulative return of the trading strategy increasesby 6.68% relative to the one without applied the mapping function in the USportfolio case. Likewise, in the Korean portfolio case, the cumulative returnof the trading strategy increases by 10.83% relative to the one without appliedthe mapping function. Therefore, we demonstrate that the mapping functioncontributes to deriving a reasonable trading strategy.30 a)(b)Figure 8: Cumulative return rate as the learning episode continues (a) US portfolio, (b)Korean portfolio

31S portfolio Korean portfoliomapping function asset 1 asset 2 asset 3 asset 1 asset 2 asset 3without 83 107 64 74 128 63with 54 40 37 42 68 40diﬀerence -34.9% -62.6% -42.1% -43.2% -46.8% -36.5%

Table 5: The number of changes in the phase of trading direction for DQN strategy with-out/with the mapping function in the two test portfolio cases

Figure 9 shows the portfolio value trend when applying the DQN strategyand the benchmark strategies in the US and Korean portfolio cases. In the USportfolio case, we observe that the DQN strategy outperforms the benchmarkstrategies for most of the test period. The ﬁnal portfolio value of the DQNstrategy is 15.69% higher than that of the B&H strategy, 33.74% higher thanthat of the RN strategy, 21.81% higher than that of the MO strategy, and114.47% higher than that of the RV strategy. Likewise, in the Korean portfoliocase, we observe that the DQN strategy outperforms the benchmark strategiesfor most of the test period. The ﬁnal portfolio value of the DQN strategy is25.52% higher than that of the B&H strategy, 34.99% higher than that of theRN strategy, 13.22% higher than that of the MO strategy, and 247.91% higherthan that of the RV strategy.Table 6 summarizes the output performance measure results when usingDQN and the benchmark strategies in both portfolio cases. This table showsthat the DQN strategy has the best cumulative return and Sharpe ratio per-formances for the US portfolio, and this strategy has the lowest turnover rateexcept for the B&H strategy, which has no turnover rate. In the Korean portfo-lio case, the DQN strategy also has the best cumulative return and Sharpe ratioperformances. Moreover, the DQN strategy has the lowest turnover rate exceptfor the B&H strategy. Given that the B&H strategy does not incur any trans-action costs during the investment horizon, it is a remarkable achievement thatthe DQN strategy outperforms the B&H strategy in terms of the cumulative32eturn and Sharpe ratio. (a)(b)Figure 9: Comparative portfolio value results for the DQN and benchmark strategies for (a)US portfolio, (b) Korean portfolio

33S portfolio Korean portfoliostrategy

CR SR AT CR SR AT B & H RN * 9.446% 1.143 1.029% 7.358% 0.381 1.032% M O RV DQN 12.634% 1.382 0.954% 9.933% 0.946 0.989% *performance of RN is average from 30 samples Table 6: Output performance measure values for our DQN strategy and the benchmark strate-gies in the two test portfolio cases

6. Conclusion

The main contribution of our study is applying the DQN algorithm to derivea portfolio trading strategy on the practical action space. However, applyingDQN to portfolio trading has some challenges. To overcome these challenges,we devise a DQL model for trading and several techniques. First, we introducea mapping function for handling infeasible actions to derive a reasonable trad-ing strategy. Trading strategies derived from RL agents can be unreasonable toapply in the real world. Thus, we apply a domain knowledge rule to develop atrading strategy with an infeasible action mapping constraint. As a result, thisfunction works well, and we can derive a reasonable trading strategy. Second,we design DQL agent and Q-network for considering multi-asset features andderive a multi-asset trading strategy in the practical action space, determiningthe trading direction of the asset, by overcoming the dimensionality problem.Third, we relax the data shortage issue for deriving well-ﬁtted multi-asset trad-ing strategies by introducing a technique that simulates all feasible actions andthen updating the trading strategy based on the experiences of these simulatedactions.The experimental results show that our proposed DQN strategy is a superiortrading strategy relative to benchmark strategies. Based on the results of the34umulative return and the Sharpe ratio, the DQN strategy is more proﬁtablewith lower risk than other benchmark strategies. In addition, based on theresults of the average turnover rate, the DQN strategy is more suitable forapplication in real-world trading than benchmark strategies.However, our proposed methodology still has a limit of scalability, arisingthe dimensionality problem if the number of assets in the portfolio very large.Furthermore, in our study, the reward of the MDP model is optimized only forreturns and not for risk. Nevertheless, the contributions of this study are stillvaluable because of the novel techniques for expressing the practical applicabilityof the portfolio trading strategy. By responding to the limit of the current study,we will try to devise the method for resolving these limitations in future research.

References [1] S. Almahdi and S. Y. Yang. An adaptive portfolio trading system: A risk-return portfolio optimization using recurrent reinforcement learning withexpected maximum drawdown.

Expert Systems With Applications , 87:267–279, 2017.[2] S. Almahdi and S. Y. Yang. A constrained portfolio trading system us-ing particle swarm algorithm and recurrent reinforcement learning.

ExpertSystems With Applications , 130:145–156, 2019.[3] F. Bertoluzzo and M. Corazza. Testing diﬀerent Reinforcement Learningconﬁgurations for ﬁnancial trading: Introduction and applications.

ProcediaEconomics and Finance , 3:68–77, 2012.[4] A. Bhatia, P. Varakantham, and A. Kumar. Resource constrained deepreinforcement learning.

Proceedings of the International Conference on Au-tomated Planning and Scheduling , 29:610–620, 2018.[5] W. Brock, J. Lakonishok, and B. Lebaron. Simple Technical Trading Rulesand the Stochastic Properties of Stock Returns.

The Journal of Finance ,47:1731–1764, 1992. 356] P. X. Casqueiro and A. J. L. Rodrigues. Neuro-dynamic trading methods.

European Journal of Operational Research , 175:1400–1412, 2006.[7] C. H. Chen and H. Y. Yu. A series based group stock portfolio optimizationapproach using the grouping genetic algorithm with symbolic aggregateApproximations.

Knowledge-Based Systems , 125:146–163, 2017.[8] T. Chen and F. Chen. An intelligent pattern recognition model for sup-porting investment decisions in stock market.

Information Sciences , pages261–274, 2016.[9] K. Chourmouziadis and P. D. Chatzoglou. An intelligent short term stocktrading fuzzy system for assisting investors in portfolio management.

ExpertSystems With Applications , 43:298–311, 2016.[10] G. Consigli and M. A. H. Dempster. Dynamic stochastic programmingfor assetliability management.

Annals of Operations Research , 81:131–161,1998.[11] M. A. H. Dempster and V. Leemans. An automated FX trading systemusing adaptive reinforcement learning.

Expert Systems With Applications ,30:543–552, 2006.[12] Y. Deng, B. Feng, Y. Kong, Z. Ren, and Q. Dai. Deep Direct Reinforce-ment Learning for Financial Signal Representation and Trading.

IEEETransactions on Neural Networks and Learning Systems , 28:653–664, 2016.[13] U. Derigs and N. H. Nickel. Meta-heuristic based decision support forportfolio optimization with a case study on tracking error minimization inpassive portfolio management.

OR Spectrum , 25:345–378, 2003.[14] D. Eilers, C. L. Dunis, H. J. Mettenheim, and M. H. Breitner. Intelli-gent trading of seasonal eﬀects: A decision support algorithm based onreinforcement learning.

Decision Support Systems , 64:100–108, 2014.3615] B. Golub, M. Holmer, R. Mckendall, L. Polhlman, and S. A. Zenios. Astochastic programming model for money management.

European Journalof Operations Research , 85:282–296, 1995.[16] R. C. Grinold and R. N. Khan. Active portfolio management: A quantita-tive approach for producing superior returns and controlling risk.

McGraw-Hill , 2, 2000.[17] G. Jeong and H. Y. Kim. Improving ﬁnancial trading decisions using deepQ-learning: Predicting the number of shares, action strategies, and transferlearning.

Expert system with applications , 117:125–138, 2019.[18] Z. Jiang, D. Xu, and J. Liang. A Deep Reinforcement Learning Frame-work for the Financial Portfolio Management Problem. arXiv preprintarXiv:1706.10059 , 2017.[19] R. Kouwenberg. Scenario generation and stochastic programming modelsfor asset liability management.

European Journal of Operational Research ,134:279–292, 2001.[20] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. P´erolat,D. Silver, and T. Graepel. A Uniﬁed Game-Theoretic Approach to Multi-agent Reinforcement Learning. arXiv preprint arXiv:1711.00832 , 2017.[21] W. Leigh, N. Modani, R. Purvis, Q. Wu, and T. Robert. Stock markettrading rule discovery using technical charting heuristics.

Expert Systemswith Applications , 23:155–159, 2002.[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra,and M. Riedmiller. Playing Atari with Deep Reinforcement Learning. arXivpreprint arXiv:1312.5602 , 2013.[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare,A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, and et al. Human-level control through deep reinforcement learning.

Nature , 518:529–533,2015. 3724] J. Moody and M. Saﬀell. Learning to Trade via Direct Reinforcement.

IEEE TRANSACTIONS ON NEURAL NETWORKS , 12:875–889, 2001.[25] J. Moody, L. WU, Y. Liao, and M. Saﬀell. Performance Functions andReinforcement Learning for Trading Systems and Portfolios.

Journal ofForecasting , 17:441–470, 1998.[26] R. Neuneier. Optimal Asset Allocation using Adaptive Dynamic Program-ming.

Advances in Neural Information Processing Systems , pages 952–958,1996.[27] R. Neuneier. Enhancing Q-Learning for Optimal Asset Allocation.

Ad-vances in Neural Information Processing Systems , pages 936–942, 1998.[28] J. O, J. Lee, J. W. Lee, and B. T. Zhang. Adaptive stock trading with dy-namic asset allocation using reinforcement learning.

Information Sciences ,176:2121–2147, 2006.[29] F. Papailias and D. D. Thomakos. An improved moving average technicaltrading rule.

Physica A , 428:458–469, 2015.[30] P. C. Pendharkar and P. Cusatis. Trading ﬁnancial indices with reinforce-ment learning agents.

Expert Systems with Applications , 103:1–13, 2018.[31] T. H. Pham, G. D. Magistris, and R. Tachibana. Optlayer-practical con-strained optimization for deep reinforcement learning in the real world. ,pages 6236–6243, 2018.[32] J. Y. Potvin, P. Soriano, and M. Vallee. Generating trading rules on thestock markets with genetic programming.

Computers & Operations Re-search , 31:1033–1047, 2004.[33] J. Sirignano and R. Cout. Universal features of price formation inﬁnancial markets: Perspectives from Deep Learning. arXiv preprintarXiv:1803.06917 , 2018. 3834] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised Learn-ing of Video Representations using LSTMs.

International conference onmachine learning , pages 843–852, 2015.[35] Y. Tan, W. Liu, and Q. Qiu. Adaptive Power Management Using Rein-forcement Learning.

ICCAD , pages 461–467, 2009.[36] Y. Wang, D. Wang, S. Zhang, Y. Feng, S. Li, and Q. Zhou. Deep Q-trading. http://cslt.riit.tsinghua.edu.cn , 2016.[37] J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang,J. Liu, and H. Liu. Parametrized Deep Q-Networks Learning: Rein-forcement Learning with Discrete-Continuous Hybrid Action Space. arXivpreprint arXiv:1810.06394 , 2018.[38] X. Zhang, Y. Hu, K. Xie, W. Zhang, L. Su, and M. Liu. An evolutionarytrend reversion model for stock trading rule discovery.

Knowledge-BasedSystems , 79:27–35, 2015.[39] Y. Zhu and G. Zhou. Technical analysis: An asset allocation perspective onthe use of moving averages.