Portfolio Optimization with 2D Relative-Attentional Gated Transformer
PPortfolio Optimization with 2D Relative-Attentional Gated Transformer
Tae Wan Kim
School of Computer Science University of Sydney
NSW, Australia [email protected] Matloob Khushi
School of Computer Science University of Sydney
NSW, Australia [email protected]
Abstract โPortfolio optimization is one of the most attentive fields that have been researched with machine learning approaches. Many researchers attempted to solve this problem using deep reinforcement learning due to its efficient inherence that can handle the property of financial markets. However, most of them can hardly be applicable to real-world trading since they ignore or extremely simplify the realistic constraints of transaction costs. These constraints have a significantly negative impact on portfolio profitability. In our research, a conservative level of transaction fees and slippage are considered for the realistic experiment. To enhance the performance under those constraints, we propose a novel Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model. Applying learnable relative positional embeddings for the time and assets axes, the model better understands the peculiar structure of the financial data in the portfolio optimization domain. Also, gating layers and layer reordering are employed for stable convergence of Transformers in reinforcement learning. In our experiment using U.S. stock market data of 20 years, our model outperformed baseline models and demonstrated its effectiveness.
Keywordsโportfolio optimization, reinforcement learning, deep deterministic policy gradient, transformer, relative positional encoding I. I NTRODUCTION
Portfolio optimization aims to allocate resources optimally into various financial assets to maximize the return while reducing the risks. Since it was theoretically pioneered by [1], many researchers have attempted to solve this problem using various machine learning approaches. Particularly, reinforcement learning is a type of machine learning suitable for sequential decision making such as online portfolio rebalancing. In reinforcement learning, the agent improves its policy to decide an action by repeatedly trying various actions for the environment and maximizing the expected cumulative reward from the environment. This can be implemented by two elements of an agent, the actor that decides its action and the critic that assesses the value of the action with the estimate of the expected cumulative reward. Deep reinforcement learning, a reinforcement learning that utilizes deep neural networks in its actor and critic, is known to be efficient in handling financial problems. However, most research that adopted it in portfolio optimization showed a lack of consideration of realistic constraints, which affects the performance of the models. Moreover, the data in the portfolio optimization domain has intractable characteristics โ continuous action space, partial observability, and high dimensionality. This research proposes the Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model to tackle these issues. In the ablation study, the model illustrated profitability as well as stability and outperformed baseline models. II. R ELATED W ORK
Various machine learning approaches have been examined for portfolio optimization. The majority of them, including recent research [2-4], focused on predicting future prices and built portfolios based on the prediction. However, this two-stage approach can be suboptimal in that minimizing the prediction error could be different from the objective of optimizing portfolios and relevant data could be lost by using the predicted price alone [5]. Moreover, the performance of the approach is highly dependent on the prediction accuracy, of which a high degree can be hardly achieved with financial data. For these reasons, researchers employed reinforcement learning as an alternative approach without predicting future prices. Several researchers [6, 7] used reinforcement learning with discrete action space, simplifying trading actions to include buying, selling, or holding a single asset. Using a limited number of positions in trading was also used in [8], but it is difficult to generalize this approach to apply to large-scale portfolios since expanding the number of assets results in exponential growth in the action space. To address the continuous action space problem, [9] used a policy-based reinforcement learning framework using deep neural networks as its approximation functions and returning the deterministic continuous action values directly from the policy network. General reinforcement learning approaches are based on Markov Decision Process, which assumes that the current state depends on the previous state only. However, financial markets are only partially observable [10] from the price and volume at a specific point in time. To address the partial observability of the state, reinforcement learning with recurrent neural networks was proposed in [8, 11, 12] using a time series of observations instead of a single observation to represent a state. However, RNN including LSTM still suffers from long-term dependency problems and shows lower performance with longer data [13]. The attention mechanism [14] is proposed to tackle this problem. Particularly, the advent of the Transformer that uses multi-head attention [15] produced state-of-the-art performance in natural language processing and computer vision, but the financial field has not yet benefited from it. Moreover, in [16, 17] Transformer failed to solve a simple Markov Decision Process problem or was comparable to a random model, which suggests that it is extremely difficult to optimize Transformer in a reinforcement learning setting. There have been many studies on portfolio optimization but most of them are based on unrealistic assumptions about transaction fees and slippage, which have a significant impact on portfolio profitability [18]. In light of this, ignoring or extremely simplifying these constraints makes it difficult to apply the algorithms to real-world asset trading. In this regard, this research proposes a policy-based deep ยฉ 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. einforcement learning using a variation of Transformer to address realistic constraints as well as the profitability. III. M ETHODOLOGY A. Problem Statement 1)
State
Since a financial state is only partially observable, this study employs an observation set of historical prices and trading volumes up to time ๐ก to represent a state at time ๐ก . A single observation set ๐น ๐ก is a three-dimensional tensor that consists of five features, Opening, High, Low, Closing prices (OHLC), and trading volumes of assets at time ๐ก . ๐น ๐ก = (๐ ๐ก , ๐ป ๐ก , ๐ฟ ๐ก , ๐ถ ๐ก , ๐ ๐ก ) (1) Each feature is ๐ก by ๐ + 1 matrix, where the rows represent the time axis and the columns represent the assets axis that consists of cash and m assets. The opening prices ๐ ๐ก , high prices ๐ป ๐ก , low prices ๐ฟ ๐ก , closing prices ๐ถ ๐ก , and trading volumes ๐ ๐ก at time t are as follows: ๐ ๐ก = ( ๐ ๐ก0 ๐ ๐ก1 ๐ ๐ก2 โฆ ๐ ๐ก๐ ๐ ๐กโ10 ๐ ๐กโ11 ๐ ๐กโ12 โฆ ๐ ๐กโ1๐ ๐ ๐กโ20 ๐ ๐กโ21 ๐ ๐กโ22 โฆ ๐ ๐กโ2๐ โฎ โฎ โฎ โฆ โฎ๐ ๐ ๐ โฆ ๐ ) (2) ๐ป ๐ก = ( โ ๐ก0 โ ๐ก1 โ ๐ก2 โฆ โ ๐ก๐ โ ๐กโ10 โ ๐กโ11 โ ๐กโ12 โฆ โ ๐กโ1๐ โ ๐กโ20 โ ๐กโ21 โ ๐กโ22 โฆ โ ๐กโ2๐ โฎ โฎ โฎ โฆ โฎโ โ โ โฆ โ ) (3) ๐ฟ ๐ก = ( ๐ ๐ก0 ๐ ๐ก1 ๐ ๐ก2 โฆ ๐ ๐ก๐ ๐ ๐กโ10 ๐ ๐กโ11 ๐ ๐กโ12 โฆ ๐ ๐กโ1๐ ๐ ๐กโ20 ๐ ๐กโ21 ๐ ๐กโ22 โฆ ๐ ๐กโ2๐ โฎ โฎ โฎ โฆ โฎ๐ ๐ ๐ โฆ ๐ ) (4) ๐ถ ๐ก = ( ๐ ๐ก0 ๐ ๐ก1 ๐ ๐ก2 โฆ ๐ ๐ก๐ ๐ ๐กโ10 ๐ ๐กโ11 ๐ ๐กโ12 โฆ ๐ ๐กโ1๐ ๐ ๐กโ20 ๐ ๐กโ21 ๐ ๐กโ22 โฆ ๐ ๐กโ2๐ โฎ โฎ โฎ โฆ โฎ๐ ๐ ๐ โฆ ๐ ) (5) ๐ ๐ก = ( ๐ฃ ๐ก0 ๐ฃ ๐ก1 ๐ฃ ๐ก2 โฆ ๐ฃ ๐ก๐ ๐ฃ ๐กโ10 ๐ฃ ๐กโ11 ๐ฃ ๐กโ12 โฆ ๐ฃ ๐กโ1๐ ๐ฃ ๐กโ20 ๐ฃ ๐กโ21 ๐ฃ ๐กโ22 โฆ ๐ฃ ๐กโ2๐ โฎ โฎ โฎ โฆ โฎ๐ฃ ๐ฃ ๐ฃ โฆ ๐ฃ ) (6) where the subscript of each element stands for the time and the superscript stands for the assets. The elements with the superscript 0 in the first columns stand for cash data and are uniformly set at one. Action
The action defined in this research is the proportion of asset investment to be rebalanced at time ๐ก , since the actions represented in continuous values can be applied to large scale portfolios much better than discrete actions. Action ๐ ๐ก is a portfolio vector at time ๐ก where its elements are weights of resource allocation in cash and ๐ assets, and the sum of the weights totals one. ๐ ๐ก = (๐ ๐ก0 , ๐ ๐ก1 , ๐ ๐ก2 , โฆ , ๐ ๐ก๐ ), โโโโ โ ๐ ๐ก๐๐๐=0 = 1 (๏ท) Reward
The reward is the risk-adjusted return for the action, with transaction costs applied. The previous portfolio value ๐ ๐กโ1 is a scalar value calculated by the inner product of the current closing prices of the assets and the previous shares of assets held. ๐ ๐กโ1 = ๐ ๐ก โ ๐ ๐กโ1 (๏ธ) where ๐ ๐ก is a vector of the current closing prices from ๐ ๐ก0 to ๐ ๐ก๐ โ the first row of the closing price feature ๐ถ ๐ก โ and ๐ ๐กโ1 is a vector of the previous shares from ๐ ๐กโ10 to ๐ ๐กโ1๐ . The weighted portfolio values ๐ ๐กโ1 โ ๐ ๐ก are modulated to the integer numbers of rebalanced shares ๐ ๐ก of the assets according to the current closing prices as follows: ๐ ๐ก = ๐ ๐กโ1 โ ๐ ๐ก //๐ ๐ก (๏น) where // stands for the element-wise floor division operator that returns integer quotients of element-wise division. The transaction fee rate is assumed conservatively at 20 basis points (0.2%), and the slippage rate is set at half of the proportional bid-ask spread. Since the bid-ask spread data is difficult to acquire, the estimate of the proportional bid-ask spread ๐ ๐ก๐ [19] is used for a single asset ๐ at time ๐ก : ๐ ๐ก๐ = 2โ๐ธ[(๐๐๐(๐ ๐ก๐ ) โ ๐ ๐ก๐ )(๐๐๐(๐ ๐ก๐ ) โ ๐ ๐ก+1๐ )] (๏ฑ๏ฐ) where ๐๐๐(๐ ๐ก๐ ) is daily closing log-price at time ๐ก and ๐ ๐ก๐ is the average of daily high and low log-prices at time ๐ก . The rebalancing cost for a single asset is the transaction fee and slippage proportional to the current closing price and change in shares of the asset. Thus, the total rebalancing cost ๐ ๐ก is calculated as follows: ๐ ๐ก = โ ๐ ๐ก๐ |๐ ๐ก๐ โ ๐ ๐กโ1๐ |โ(0.2 + 0.5๐ ๐ก๐ ) ๐๐=1 (๏ฑ๏ฑ) The rebalanced share for cash ๐ ๐ก0 is the remainder of the previous portfolio value after deduction of the ๐ assetsโ rebalanced portfolio value and rebalancing cost. ๐ ๐ก0 = ๐ ๐กโ1 โ โ ๐ ๐ก๐๐๐=1 ๐ ๐ก๐ โ ๐ ๐ก (๏ฑ๏ฒ) Now, the total rebalanced portfolio value ๐ ๐ก can be calculated as follows: ๐ ๐ก = ๐ ๐ก โ ๐ ๐ก (๏ฑ๏ณ) The return ๐ ๐ก is a log return of the portfolio value. ๐ ๐ก = ๐๐๐(๐ ๐ก ) โ ๐๐๐(๐ ๐กโ1 ) (๏ฑ๏ด) ince the return itself does not reflect risks, the reward function used here is the Sortino ratio [20], which is a variation of the Sharpe ratio [21]. Sharpe ratio is defined as the average of historical returns from ๐ to ๐ ๐ก divided by standard deviation of all the returns, whereas Sortino ratio is the expected return divided by the standard deviation of negative returns. The denominators used in both ratios represent the risks of the portfolios. B. Model Architecture
In this section, we propose the Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model. The overall architecture is shown in Fig. 1 and is designed in consideration of the characteristics of the portfolio optimization domain data โ continuous action space, partial observability, and high dimensionality. The agent basically follows the structure of Deep Deterministic Policy Gradient [9] for continuous action space, utilizing Transformer encoders whose structure is robust to long-term dependencies of partial observability. Specifically, a variation of Transformer called 2D Relative-attentional Gated Transformer (RG-Transformer) is used as a core part of its actor, target actor, critic, and target critic networks to deal with high dimensional portfolio data. Deep Deterministic Policy Gradient
The agent uses a deep neural network as a policy approximator that returns actions with continuous values in a deterministic way [9]. In addition to the actor ๐ and critic ๐ with weights ๐ ๐ and ๐ ๐ , respectively, a separate pair of a target actor ๐ ' and a target critic ๐ ' with respective weights ๐ ๐ ' and ๐ ๐ ' is introduced to ensure stable learning. The target return ๐บ ๐ for the ๐ -th sample from replay buffer is as follows: ๐บ ๐ = ๐ ๐ + ๐พโ๐ ' (๐ ๐' , ๐ ' (๐ ๐' |๐ ๐ ' ) |๐ ๐ ' ) (๏ฑ๏ต) where ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐' and ๐พ present the state, action, reward, next state, and discount factor, respectively. The critic weights ๐ ๐ are updated by minimizing the loss ๐ฟ from temporal difference error between ๐บ ๐ and ๐(๐ ๐ , ๐ ๐ |๐ ๐ ) : ๐ฟ = โ (๐บ ๐ โ ๐(๐ ๐ , ๐ ๐ |๐ ๐ )) (๏ฑ๏ถ) Also, the policy gradient to update the actor weights ๐ ๐ is calculated using the chain rule as follows: โ ๐ ๐ ๐ฝ โ 1๐ โ โ ๐ ๐ ๐(๐ , ๐(๐ |๐ ๐ )|๐ ๐ )| ๐ =๐ ๐ ,๐=๐(๐ ๐ )๐ = โ โ ๐Q (๐ , ๐|๐ ๐ )| ๐ =๐ ๐ ,๐=๐(๐ ๐ ) โ ๐ ๐ ๐ ๐(๐ |๐ ๐ )| ๐ =๐ ๐ (๏ฑ๏ท) Finally, with a predefined update rate ๐ , the target actor weights ๐ ๐ ' and target critic weights ๐ ๐ ' are updated to ๐๐ ๐ + (1 โ ๐)๐ ๐ ' and ๐๐ ๐ + (1 โ ๐)๐ ๐ ' , respectively. During training, the actor includes action space noise of [22] for temporally correlated exploration to avoid local optima as follows: ๐ = ๐(๐ |๐ ๐ ) + d๐ฅ ๐ก = ๐(๐ |๐ ๐ ) + ๐(๐ โ ๐ฅ ๐กโ1 )๐๐ก + ๐โ๐๐ก๐(0,1) (๏ฑ๏ธ) where ๐ฅ ๐ก is the noise at time ๐ก with fixed parameters ๐ , ๐ , and ๐ for the noise generation. To accelerate the training procedure, an asynchronous episodic training method [23] was adopted. The episodic training is processed parallelly by multi-simulators that accumulate each group of experience data ๐ , ๐ , ๐, and ๐ in the experience replay buffer. To encourage the agent to learn highly-rewarded Fig. 1. The overall architecture of DPGRGT model rajectories more often, HMemory saves an episode only when it renews the highest episodic reward, and those saved data are sampled with probability ๐ .
2D Relative-attentional Gated Transformer
The input for portfolio optimization is historical data of partially observable states and is a tensor of three dimensions โ financial features, historical time, and assets. To address the high dimensionality, the agent incorporates a variation of Transformer capable of identifying positions of both time and assets into the main part of its actors and critics. Fig. 2 shows the structure of the actor network and its 2D relative positional multi-head attention (2D relative attention), where the length dimension stands for periods while the height dimension represents assets. As seen in Fig. 3, the critic network is similar to the actor network, except that it includes action data as its additional input.
Fig. 2. The actor network and its 2D relative positional multi-head attention of the DPGRGT model
Fig. 3. The critic network of the DPGRGT model
Since the Transformer does not have any recurrent or convolutional structures, it requires additional position information. In addition to the sinusoidal encoding used in the original Transformer, [24, 25] showed the effectiveness of incorporating relative positional information in the self-attention for machine translation and music scoring. For
๐ฟ ร ๐ท dimensional data ๐ , the relative attention ๐ด โ for each head is as follows: ๐ด โ = ๐๐๐๐ก๐๐๐ฅ (๐๐ โ๐ (๐๐ โ๐พ + ๐ โ ) ๐ โ๐ท โ ) X๐ โ๐ = ๐๐๐๐ก๐๐๐ฅ ( ๐๐ โ๐ (๐๐ โ๐พ ) ๐ +๐๐ โ๐ (๐ โ ) ๐ โ๐ท โ ) X๐ โ๐ (๏ฑ๏น) where ๐ท โ is ๐ท divided by the number of heads โ . X๐ โ๐ , X๐ โ๐พ , and X๐ โ๐ are an evenly split ๐ฟ ร ๐ท โ matrix for each head and are the query, key, and value in the attention, respectively. ๐ โ is a matrix that represents the relative positions between every pair of ๐ elements and is gathered from ๐ฟ ร ๐ท โ dimensional relative position embedding ๐ธ โ learned separately for each head. After multiplying two expanded tensors X๐ ๐ of shape (โ, โ๐ฟ, โ๐ท โ ) and (๐ธ) ๐ of shape (โ, โ๐ท โ , โ๐ฟ) , โskewingโ [24] the result gives rise to the direct calculation of X๐ โ๐ (๐ โ ) ๐ for each head with efficient use of memory. To implement 2D relative attention in our model, two relative position embeddings are used for ๐ฟ ร๐ป ร ๐ท dimensional financial data ๐ โฒ where the additional dimension ๐ป stands for the height. Embedding ๐ธ ๐ and ๐ธ โ learn the relative positional representation for each pair of data elements in ๐ฟ (time) dimension and ๐ป (assets) dimension, respectively. While ๐ โฒ ๐ ๐ , ๐ โฒ ๐ ๐พ , and ๐ โฒ ๐ ๐ are flattened into (โ, โ๐ฟ โ ๐ป, โ๐ท โ ) tensors for matrix multiplication except for the second term in (19), ๐ โฒ ๐ ๐ with the original shape of (โ, โ๐ฟ, โ๐ป, โ๐ท โ ) is multiplied by the height embedding (๐ธ โ ) ๐ of shape (โ, โ๐ท โ , โ๐ป) and ๐ โฒ ๐ ๐ (๐ธ โ ) ๐ is flattened into a (โ โ ๐ฟ, โ๐ป, โ๐ป) tensor for skewing. Similarly, after multiplying permuted ๐ โฒ ๐ ๐ of shape (โ, โ๐ป, โ๐ฟ, ๐ท โ ) and (๐ธ ๐ ) ๐ of shape (โ, ๐ท โ , โ๐ฟ) , the result is flattened to a (โ โ ๐ป, ๐ฟ, โ๐ฟ) tensor for skewing. After skewing, both of the calculation outputs result in a tensor of shape (โ, โ๐ป โ ๐ฟ, โโ๐ป โ ๐ฟ) and are added to the first term instead of the original second term in (19), representing relative positions of both height and length dimensions of the data. Lastly, the layer normalization is placed before the multi-head attention, and the gating layer is used to replace the residual connection to enhance the stability of Transformers in reinforcement learning [17]. The gating layer utilizes the structure of GRU [13] cell as a gating function ๐ as follows: ๐ = ๐(๐ ๐ ๐ฆ + ๐ ๐ ๐ฅ) ๐ง = ๐(๐ ๐ง ๐ฆ + ๐ ๐ง ๐ฅ โ ๐ ๐ ) โ = ๐ก๐๐โ (๐ ๐ ๐ฆ + ๐ ๐ (๐ โ ๐ฅ)) ๐(๐ฅ, ๐ฆ) = (1 โ ๐ง) โ ๐ฅ + ๐ง โ โ (๏ฒ๏ฐ) here ๐ฆ is the input from the previous layer, ๐ฅ is the residual value, and โ is element-wise multiplication. IV. E XPERIMENTS A. Experimental Setup
The model is trained and evaluated with assets of nine Dow Jones companies representing each sector - industrials (MMM), financials (JPM), consumer services (PG), technology (AAPL), health care (UNH), consumer goods (WMT), oil & gas (XOM), basic materials (DD) and telecommunications (VZ). The OHLC prices and trading volume data of the assets are collected from Yahoo Finance. The data for 18 years from 2000 to 2017 are used for training, and the data for the period from 2018 to April 2020 is used for evaluation. Each daily observation set consists of historical data for the recent 50 days, and all the data is log differenced for the stationarity of the time series. The maximum length of a single episode is set at 50 days, and the initial investment at 100,000 USD. Two separate Adam optimizers are used with mini-batch size 32 to train the actor and critic. The learning rate of the actor and critic is 1e-4, and the update rate ๐ for the target actor and target critic is 0.15. Discount factor ๐พ is 0.9, HMemory sampling rate ๐ is 0.2, and the number of threads used for parallel training is 5. The action space noise parameters ๐ , ๐, and ๐ are 0.13, 0, and 0.2, respectively. For the Transformer, three encoder layers, eight heads, 128-dimensional vectors for attention hidden layers, and 512-dimensional vectors for feed-forward network layers are used. All programs were implemented with Python 3 and TensorFlow 2 using Google Colab. As baseline models, two traditional portfolio models that implement MPT (Markowitzโs Modern Portfolio Theory) and Uniform Constant Rebalanced Portfolio (UCRP) strategy [26] are employed. For the ablation study, simple Deep Deterministic Policy Gradient (DDPG), DDPG with Transformer (DDPG_TF), DDPG with 2D Relative- attentional Transformer (DDPG_RP_TF), and DDPG with Gated Transformer (DDPG_GL_TF) are tested under the same conditions. Both the cumulative return and the annualized Sharpe ratio are evaluated to verify the robustness of the models to risks as well as their performances. B. Results
Fig. 4 shows the changes in portfolio values for all the models tested with the asset data for the 28 months prior to the experiment. The steep drop seen around March 2020 comes from the outbreak of COVID-19, which has made a huge negative impact on the portfolio values. The portfolio value of the DPGRGT model keeps increasing until it meets the outbreak, and despite the drop, the model outperformed all the other models, proving its resilience and high profitability. DDPG and DDPG with Transformer are slightly better than MPT but show poor performance. DDPG with 2D Relative-attentional Transformer is relatively better than DDPG and DDPG with Transformer, but still worse than the UCRP baseline model. With the exception of DPGRGT, DDPG with Gated Transformer is the only model that is more profitable and better performing than UCRP, which points to the effectiveness of the gating layer of Transformer in reinforcement learning. Fig. 4. The portfolio values of the experiment Table I shows the cumulative returns and the annualized Sharpe ratios of models. While DPGRGT and DDPG with Gated Transformer are profitable, three other models have lost over 40% of their initial value. The MPT model that attempts to maximize its Sharp ratio for every rebalancing chance is the worst, whereas UCRP that rebalances its shares of assets for the constant investment proportion is the second best. This result suggests that transaction fees and slippage have a significant influence on rebalancing performance. With these constraints, the effects of using either 2D Relative-attentional Transformer or Gated Transformer are also limited. Although the pandemic outbreak has undermined the overall performance of the models, DPGRGT that uses both 2D Relative attention and Gated Transformer has ultimately demonstrated stability and strong performance.
TABLE I. T HE PERFORMANCE COMPARISON OF MODELS
Model Cumulative Return (%) Annualized Sharpe Ratio DPGRGT (Our Model)
DDPG_GL_TF
DDPG_RP_TF -5.45 -0.1343
DDPG_TF -41.71 -0.8191
DDPG -42.91 -1.2194
UCRP
MPT -50.07 -1.5840 V. C ONCLUSION AND F UTURE W ORK
This paper proposes a portfolio optimization algorithm based on reinforcement learning using 2D Relative-attentional Gated Transformers. To our best knowledge, this is the first research that applies Transformers to reinforcement learning for portfolio optimization. The experiment shows that the 2D relative attentions and Gating layers improve the performance of Transformers, and combining them creates synergy effects and produces the best results for portfolio optimization. Since even highly profitable models cannot be applied to real trades without stability and realistic constraints taken into consideration, the risk is considered by incorporating the Sortino ratio into the reward function, and the transaction costs are set at a conservative level to ensure the more practical experiment. In a further study, the relations between periods and assets can be analyzed with the multi-head attention weights used in the Transformer to make the model more interpretable and trustable. Furthermore, experiments utilizing a multi-PU environment for a wide range of types of assets can be considered for the generalization of the model. R
EFERENCES [1] H. Markowitz, "Portfolio Selection," The Journal of Finance, vol. 7, no. 1, pp. 77-91, 1952. [2] D. S. Samer Obeidat, Mathieu Lemay, Mary Kate MacPherson, Miodrag Bolic, "Adaptive Portfolio Asset Allocation Optimization with Deep Learning," International Journal on Advances in Intelligent Systems, vol. vol 11, no 1&2, pp. 25 - 34, 2018. [3] M. M. Solin, A. Alamsyah, B. Rikumahu, and M. A. A. Saputra, "Forecasting Portfolio Optimization using Artificial Neural Network and Genetic Algorithm," in 2019 7th International Conference on Information and Communication Technology (ICoICT), 24-26 July 2019 2019, pp. 1-7, doi: 10.1109/ICoICT.2019.8835344. [4] H. Yun, M. Lee, Y. S. Kang, and J. Seok, "Portfolio management via two-stage deep learning with a joint cost," Expert Syst. Appl., vol. 143, / 2020, doi: 10.1016/j.eswa.2019.113041. [5] J. Moody, L. Wu, Y. Liao, and M. Saffell, "Performance functions and reinforcement learning for trading systems and portfolios," Journal of Forecasting, vol. 17, no. 5 โโ