[PDF] Portfolio Optimization with 2D Relative-Attentional Gated Transformer

Abstract

Portfolio optimization is one of the most attentive fields that have been researched with machine learning approaches. Many researchers attempted to solve this problem using deep reinforcement learning due to its efficient inherence that can handle the property of financial markets. However, most of them can hardly be applicable to real-world trading since they ignore or extremely simplify the realistic constraints of transaction costs. These constraints have a significantly negative impact on portfolio profitability. In our research, a conservative level of transaction fees and slippage are considered for the realistic experiment. To enhance the performance under those constraints, we propose a novel Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model. Applying learnable relative positional embeddings for the time and assets axes, the model better understands the peculiar structure of the financial data in the portfolio optimization domain. Also, gating layers and layer reordering are employed for stable convergence of Transformers in reinforcement learning. In our experiment using U.S. stock market data of 20 years, our model outperformed baseline models and demonstrated its effectiveness.

Full PDF

PPortfolio Optimization with 2D Relative-Attentional Gated Transformer

Tae Wan Kim

School of Computer Science University of Sydney

NSW, Australia [email protected] Matloob Khushi

School of Computer Science University of Sydney

NSW, Australia [email protected]

Abstract —Portfolio optimization is one of the most attentive fields that have been researched with machine learning approaches. Many researchers attempted to solve this problem using deep reinforcement learning due to its efficient inherence that can handle the property of financial markets. However, most of them can hardly be applicable to real-world trading since they ignore or extremely simplify the realistic constraints of transaction costs. These constraints have a significantly negative impact on portfolio profitability. In our research, a conservative level of transaction fees and slippage are considered for the realistic experiment. To enhance the performance under those constraints, we propose a novel Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model. Applying learnable relative positional embeddings for the time and assets axes, the model better understands the peculiar structure of the financial data in the portfolio optimization domain. Also, gating layers and layer reordering are employed for stable convergence of Transformers in reinforcement learning. In our experiment using U.S. stock market data of 20 years, our model outperformed baseline models and demonstrated its effectiveness.

Keywords—portfolio optimization, reinforcement learning, deep deterministic policy gradient, transformer, relative positional encoding I. I NTRODUCTION

Portfolio optimization aims to allocate resources optimally into various financial assets to maximize the return while reducing the risks. Since it was theoretically pioneered by [1], many researchers have attempted to solve this problem using various machine learning approaches. Particularly, reinforcement learning is a type of machine learning suitable for sequential decision making such as online portfolio rebalancing. In reinforcement learning, the agent improves its policy to decide an action by repeatedly trying various actions for the environment and maximizing the expected cumulative reward from the environment. This can be implemented by two elements of an agent, the actor that decides its action and the critic that assesses the value of the action with the estimate of the expected cumulative reward. Deep reinforcement learning, a reinforcement learning that utilizes deep neural networks in its actor and critic, is known to be efficient in handling financial problems. However, most research that adopted it in portfolio optimization showed a lack of consideration of realistic constraints, which affects the performance of the models. Moreover, the data in the portfolio optimization domain has intractable characteristics — continuous action space, partial observability, and high dimensionality. This research proposes the Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model to tackle these issues. In the ablation study, the model illustrated profitability as well as stability and outperformed baseline models. II. R ELATED W ORK

Various machine learning approaches have been examined for portfolio optimization. The majority of them, including recent research [2-4], focused on predicting future prices and built portfolios based on the prediction. However, this two-stage approach can be suboptimal in that minimizing the prediction error could be different from the objective of optimizing portfolios and relevant data could be lost by using the predicted price alone [5]. Moreover, the performance of the approach is highly dependent on the prediction accuracy, of which a high degree can be hardly achieved with financial data. For these reasons, researchers employed reinforcement learning as an alternative approach without predicting future prices. Several researchers [6, 7] used reinforcement learning with discrete action space, simplifying trading actions to include buying, selling, or holding a single asset. Using a limited number of positions in trading was also used in [8], but it is difficult to generalize this approach to apply to large-scale portfolios since expanding the number of assets results in exponential growth in the action space. To address the continuous action space problem, [9] used a policy-based reinforcement learning framework using deep neural networks as its approximation functions and returning the deterministic continuous action values directly from the policy network. General reinforcement learning approaches are based on Markov Decision Process, which assumes that the current state depends on the previous state only. However, financial markets are only partially observable [10] from the price and volume at a specific point in time. To address the partial observability of the state, reinforcement learning with recurrent neural networks was proposed in [8, 11, 12] using a time series of observations instead of a single observation to represent a state. However, RNN including LSTM still suffers from long-term dependency problems and shows lower performance with longer data [13]. The attention mechanism [14] is proposed to tackle this problem. Particularly, the advent of the Transformer that uses multi-head attention [15] produced state-of-the-art performance in natural language processing and computer vision, but the financial field has not yet benefited from it. Moreover, in [16, 17] Transformer failed to solve a simple Markov Decision Process problem or was comparable to a random model, which suggests that it is extremely difficult to optimize Transformer in a reinforcement learning setting. There have been many studies on portfolio optimization but most of them are based on unrealistic assumptions about transaction fees and slippage, which have a significant impact on portfolio profitability [18]. In light of this, ignoring or extremely simplifying these constraints makes it difficult to apply the algorithms to real-world asset trading. In this regard, this research proposes a policy-based deep © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. einforcement learning using a variation of Transformer to address realistic constraints as well as the profitability. III. M ETHODOLOGY A. Problem Statement 1)

State

Since a financial state is only partially observable, this study employs an observation set of historical prices and trading volumes up to time 𝑡 to represent a state at time 𝑡 . A single observation set 𝐹 𝑡 is a three-dimensional tensor that consists of five features, Opening, High, Low, Closing prices (OHLC), and trading volumes of assets at time 𝑡 . 𝐹 𝑡 = (𝑂 𝑡 , 𝐻 𝑡 , 𝐿 𝑡 , 𝐶 𝑡 , 𝑉 𝑡 ) (1) Each feature is 𝑡 by 𝑚 + 1 matrix, where the rows represent the time axis and the columns represent the assets axis that consists of cash and m assets. The opening prices 𝑂 𝑡 , high prices 𝐻 𝑡 , low prices 𝐿 𝑡 , closing prices 𝐶 𝑡 , and trading volumes 𝑉 𝑡 at time t are as follows: 𝑂 𝑡 = ( 𝑜 𝑡0 𝑜 𝑡1 𝑜 𝑡2 … 𝑜 𝑡𝑚 𝑜 𝑡−10 𝑜 𝑡−11 𝑜 𝑡−12 … 𝑜 𝑡−1𝑚 𝑜 𝑡−20 𝑜 𝑡−21 𝑜 𝑡−22 … 𝑜 𝑡−2𝑚 ⋮ ⋮ ⋮ … ⋮𝑜 𝑜 𝑜 … 𝑜 ) (2) 𝐻 𝑡 = ( ℎ 𝑡0 ℎ 𝑡1 ℎ 𝑡2 … ℎ 𝑡𝑚 ℎ 𝑡−10 ℎ 𝑡−11 ℎ 𝑡−12 … ℎ 𝑡−1𝑚 ℎ 𝑡−20 ℎ 𝑡−21 ℎ 𝑡−22 … ℎ 𝑡−2𝑚 ⋮ ⋮ ⋮ … ⋮ℎ ℎ ℎ … ℎ ) (3) 𝐿 𝑡 = ( 𝑙 𝑡0 𝑙 𝑡1 𝑙 𝑡2 … 𝑙 𝑡𝑚 𝑙 𝑡−10 𝑙 𝑡−11 𝑙 𝑡−12 … 𝑙 𝑡−1𝑚 𝑙 𝑡−20 𝑙 𝑡−21 𝑙 𝑡−22 … 𝑙 𝑡−2𝑚 ⋮ ⋮ ⋮ … ⋮𝑙 𝑙 𝑙 … 𝑙 ) (4) 𝐶 𝑡 = ( 𝑐 𝑡0 𝑐 𝑡1 𝑐 𝑡2 … 𝑐 𝑡𝑚 𝑐 𝑡−10 𝑐 𝑡−11 𝑐 𝑡−12 … 𝑐 𝑡−1𝑚 𝑐 𝑡−20 𝑐 𝑡−21 𝑐 𝑡−22 … 𝑐 𝑡−2𝑚 ⋮ ⋮ ⋮ … ⋮𝑐 𝑐 𝑐 … 𝑐 ) (5) 𝑉 𝑡 = ( 𝑣 𝑡0 𝑣 𝑡1 𝑣 𝑡2 … 𝑣 𝑡𝑚 𝑣 𝑡−10 𝑣 𝑡−11 𝑣 𝑡−12 … 𝑣 𝑡−1𝑚 𝑣 𝑡−20 𝑣 𝑡−21 𝑣 𝑡−22 … 𝑣 𝑡−2𝑚 ⋮ ⋮ ⋮ … ⋮𝑣 𝑣 𝑣 … 𝑣 ) (6) where the subscript of each element stands for the time and the superscript stands for the assets. The elements with the superscript 0 in the first columns stand for cash data and are uniformly set at one. Action

The action defined in this research is the proportion of asset investment to be rebalanced at time 𝑡 , since the actions represented in continuous values can be applied to large scale portfolios much better than discrete actions. Action 𝑎 𝑡 is a portfolio vector at time 𝑡 where its elements are weights of resource allocation in cash and 𝑚 assets, and the sum of the weights totals one. 𝑎 𝑡 = (𝑎 𝑡0 , 𝑎 𝑡1 , 𝑎 𝑡2 , … , 𝑎 𝑡𝑚 ), ∑ 𝑎 𝑡𝑖𝑚𝑖=0 = 1 () Reward

The reward is the risk-adjusted return for the action, with transaction costs applied. The previous portfolio value 𝑝 𝑡−1 is a scalar value calculated by the inner product of the current closing prices of the assets and the previous shares of assets held. 𝑝 𝑡−1 = 𝑐 𝑡 ⋅ 𝑠 𝑡−1 () where 𝑐 𝑡 is a vector of the current closing prices from 𝑐 𝑡0 to 𝑐 𝑡𝑚 — the first row of the closing price feature 𝐶 𝑡 — and 𝑠 𝑡−1 is a vector of the previous shares from 𝑠 𝑡−10 to 𝑠 𝑡−1𝑚 . The weighted portfolio values 𝑝 𝑡−1 ∗ 𝑎 𝑡 are modulated to the integer numbers of rebalanced shares 𝑠 𝑡 of the assets according to the current closing prices as follows: 𝑠 𝑡 = 𝑝 𝑡−1 ∗ 𝑎 𝑡 //𝑐 𝑡 () where // stands for the element-wise floor division operator that returns integer quotients of element-wise division. The transaction fee rate is assumed conservatively at 20 basis points (0.2%), and the slippage rate is set at half of the proportional bid-ask spread. Since the bid-ask spread data is difficult to acquire, the estimate of the proportional bid-ask spread 𝑑 𝑡𝑖 [19] is used for a single asset 𝑖 at time 𝑡 : 𝑑 𝑡𝑖 = 2√𝐸[(𝑙𝑜𝑔(𝑐 𝑡𝑖 ) − 𝜂 𝑡𝑖 )(𝑙𝑜𝑔(𝑐 𝑡𝑖 ) − 𝜂 𝑡+1𝑖 )] () where 𝑙𝑜𝑔(𝑐 𝑡𝑖 ) is daily closing log-price at time 𝑡 and 𝜂 𝑡𝑖 is the average of daily high and low log-prices at time 𝑡 . The rebalancing cost for a single asset is the transaction fee and slippage proportional to the current closing price and change in shares of the asset. Thus, the total rebalancing cost 𝑏 𝑡 is calculated as follows: 𝑏 𝑡 = ∑ 𝑐 𝑡𝑖 |𝑠 𝑡𝑖 − 𝑠 𝑡−1𝑖 | (0.2 + 0.5𝑑 𝑡𝑖 ) 𝑚𝑖=1 () The rebalanced share for cash 𝑆 𝑡0 is the remainder of the previous portfolio value after deduction of the 𝑚 assets’ rebalanced portfolio value and rebalancing cost. 𝑠 𝑡0 = 𝑝 𝑡−1 − ∑ 𝑐 𝑡𝑖𝑚𝑖=1 𝑠 𝑡𝑖 − 𝑏 𝑡 () Now, the total rebalanced portfolio value 𝑝 𝑡 can be calculated as follows: 𝑝 𝑡 = 𝑐 𝑡 ⋅ 𝑠 𝑡 () The return 𝑟 𝑡 is a log return of the portfolio value. 𝑟 𝑡 = 𝑙𝑜𝑔(𝑝 𝑡 ) − 𝑙𝑜𝑔(𝑝 𝑡−1 ) () ince the return itself does not reflect risks, the reward function used here is the Sortino ratio [20], which is a variation of the Sharpe ratio [21]. Sharpe ratio is defined as the average of historical returns from 𝑟 to 𝑟 𝑡 divided by standard deviation of all the returns, whereas Sortino ratio is the expected return divided by the standard deviation of negative returns. The denominators used in both ratios represent the risks of the portfolios. B. Model Architecture

In this section, we propose the Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model. The overall architecture is shown in Fig. 1 and is designed in consideration of the characteristics of the portfolio optimization domain data — continuous action space, partial observability, and high dimensionality. The agent basically follows the structure of Deep Deterministic Policy Gradient [9] for continuous action space, utilizing Transformer encoders whose structure is robust to long-term dependencies of partial observability. Specifically, a variation of Transformer called 2D Relative-attentional Gated Transformer (RG-Transformer) is used as a core part of its actor, target actor, critic, and target critic networks to deal with high dimensional portfolio data. Deep Deterministic Policy Gradient

The agent uses a deep neural network as a policy approximator that returns actions with continuous values in a deterministic way [9]. In addition to the actor 𝜇 and critic 𝑄 with weights 𝜃 𝜇 and 𝜃 𝑄 , respectively, a separate pair of a target actor 𝜇 ' and a target critic 𝑄 ' with respective weights 𝜃 𝜇 ' and 𝜃 𝑄 ' is introduced to ensure stable learning. The target return 𝐺 𝑖 for the 𝑖 -th sample from replay buffer is as follows: 𝐺 𝑖 = 𝑟 𝑖 + 𝛾 𝑄 ' (𝑠 𝑖' , 𝜇 ' (𝑠 𝑖' |𝜃 𝜇 ' ) |𝜃 𝑄 ' ) () where 𝑠 𝑖 , 𝑎 𝑖 , 𝑟 𝑖 , 𝑠 𝑖' and 𝛾 present the state, action, reward, next state, and discount factor, respectively. The critic weights 𝜃 𝑄 are updated by minimizing the loss 𝐿 from temporal difference error between 𝐺 𝑖 and 𝑄(𝑠 𝑖 , 𝑎 𝑖 |𝜃 𝑄 ) : 𝐿 = ∑ (𝐺 𝑖 − 𝑄(𝑠 𝑖 , 𝑎 𝑖 |𝜃 𝑄 )) () Also, the policy gradient to update the actor weights 𝜃 𝜇 is calculated using the chain rule as follows: ∇ 𝜃 𝜇 𝐽 ≈ 1𝑁 ∑ ∇ 𝜃 𝜇 𝑄(𝑠, 𝜇(𝑠|𝜃 𝜇 )|𝜃 𝑄 )| 𝑠=𝑠 𝑖 ,𝑎=𝜇(𝑠 𝑖 )𝑖 = ∑ ∇ 𝑎Q (𝑠, 𝑎|𝜃 𝑄 )| 𝑠=𝑠 𝑖 ,𝑎=𝜇(𝑠 𝑖 ) ∇ 𝜃 𝜇 𝑖 𝜇(𝑠|𝜃 𝜇 )| 𝑠=𝑠 𝑖 () Finally, with a predefined update rate 𝜏 , the target actor weights 𝜃 𝜇 ' and target critic weights 𝜃 𝑄 ' are updated to 𝜏𝜃 𝜇 + (1 − 𝜏)𝜃 𝜇 ' and 𝜏𝜃 𝑄 + (1 − 𝜏)𝜃 𝑄 ' , respectively. During training, the actor includes action space noise of [22] for temporally correlated exploration to avoid local optima as follows: 𝑎 = 𝜇(𝑠|𝜃 𝜇 ) + d𝑥 𝑡 = 𝜇(𝑠|𝜃 𝜇 ) + 𝜃(𝜇 − 𝑥 𝑡−1 )𝑑𝑡 + 𝜎√𝑑𝑡𝑁(0,1) () where 𝑥 𝑡 is the noise at time 𝑡 with fixed parameters 𝜃 , 𝜇 , and 𝜎 for the noise generation. To accelerate the training procedure, an asynchronous episodic training method [23] was adopted. The episodic training is processed parallelly by multi-simulators that accumulate each group of experience data 𝑠 , 𝑎 , 𝑟, and 𝑠 in the experience replay buffer. To encourage the agent to learn highly-rewarded Fig. 1. The overall architecture of DPGRGT model rajectories more often, HMemory saves an episode only when it renews the highest episodic reward, and those saved data are sampled with probability 𝜌 .

2D Relative-attentional Gated Transformer

The input for portfolio optimization is historical data of partially observable states and is a tensor of three dimensions — financial features, historical time, and assets. To address the high dimensionality, the agent incorporates a variation of Transformer capable of identifying positions of both time and assets into the main part of its actors and critics. Fig. 2 shows the structure of the actor network and its 2D relative positional multi-head attention (2D relative attention), where the length dimension stands for periods while the height dimension represents assets. As seen in Fig. 3, the critic network is similar to the actor network, except that it includes action data as its additional input.

Fig. 2. The actor network and its 2D relative positional multi-head attention of the DPGRGT model

Fig. 3. The critic network of the DPGRGT model

Since the Transformer does not have any recurrent or convolutional structures, it requires additional position information. In addition to the sinusoidal encoding used in the original Transformer, [24, 25] showed the effectiveness of incorporating relative positional information in the self-attention for machine translation and music scoring. For

𝐿 × 𝐷 dimensional data 𝑋 , the relative attention 𝐴 ℎ for each head is as follows: 𝐴 ℎ = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 (𝑋𝑊 ℎ𝑄 (𝑋𝑊 ℎ𝐾 + 𝑅 ℎ ) 𝑇 √𝐷 ℎ ) X𝑊 ℎ𝑉 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ( 𝑋𝑊 ℎ𝑄 (𝑋𝑊 ℎ𝐾 ) 𝑇 +𝑋𝑊 ℎ𝑄 (𝑅 ℎ ) 𝑇 √𝐷 ℎ ) X𝑊 ℎ𝑉 () where 𝐷 ℎ is 𝐷 divided by the number of heads ℎ . X𝑊 ℎ𝑄 , X𝑊 ℎ𝐾 , and X𝑊 ℎ𝑉 are an evenly split 𝐿 × 𝐷 ℎ matrix for each head and are the query, key, and value in the attention, respectively. 𝑅 ℎ is a matrix that represents the relative positions between every pair of 𝑋 elements and is gathered from 𝐿 × 𝐷 ℎ dimensional relative position embedding 𝐸 ℎ learned separately for each head. After multiplying two expanded tensors X𝑊 𝑄 of shape (ℎ, 𝐿, 𝐷 ℎ ) and (𝐸) 𝑇 of shape (ℎ, 𝐷 ℎ , 𝐿) , “skewing” [24] the result gives rise to the direct calculation of X𝑊 ℎ𝑄 (𝑅 ℎ ) 𝑇 for each head with efficient use of memory. To implement 2D relative attention in our model, two relative position embeddings are used for 𝐿 ×𝐻 × 𝐷 dimensional financial data 𝑋 ′ where the additional dimension 𝐻 stands for the height. Embedding 𝐸 𝑙 and 𝐸 ℎ learn the relative positional representation for each pair of data elements in 𝐿 (time) dimension and 𝐻 (assets) dimension, respectively. While 𝑋 ′ 𝑊 𝑄 , 𝑋 ′ 𝑊 𝐾 , and 𝑋 ′ 𝑊 𝑉 are flattened into (ℎ, 𝐿 ⋅ 𝐻, 𝐷 ℎ ) tensors for matrix multiplication except for the second term in (19), 𝑋 ′ 𝑊 𝑄 with the original shape of (ℎ, 𝐿, 𝐻, 𝐷 ℎ ) is multiplied by the height embedding (𝐸 ℎ ) 𝑇 of shape (ℎ, 𝐷 ℎ , 𝐻) and 𝑋 ′ 𝑊 𝑄 (𝐸 ℎ ) 𝑇 is flattened into a (ℎ ⋅ 𝐿, 𝐻, 𝐻) tensor for skewing. Similarly, after multiplying permuted 𝑋 ′ 𝑊 𝑄 of shape (ℎ, 𝐻, 𝐿, 𝐷 ℎ ) and (𝐸 𝑙 ) 𝑇 of shape (ℎ, 𝐷 ℎ , 𝐿) , the result is flattened to a (ℎ ⋅ 𝐻, 𝐿, 𝐿) tensor for skewing. After skewing, both of the calculation outputs result in a tensor of shape (ℎ, 𝐻 ⋅ 𝐿, 𝐻 ⋅ 𝐿) and are added to the first term instead of the original second term in (19), representing relative positions of both height and length dimensions of the data. Lastly, the layer normalization is placed before the multi-head attention, and the gating layer is used to replace the residual connection to enhance the stability of Transformers in reinforcement learning [17]. The gating layer utilizes the structure of GRU [13] cell as a gating function 𝑔 as follows: 𝑟 = 𝜎(𝑊 𝑟 𝑦 + 𝑈 𝑟 𝑥) 𝑧 = 𝜎(𝑊 𝑧 𝑦 + 𝑈 𝑧 𝑥 − 𝑏 𝑔 ) ℎ = 𝑡𝑎𝑛ℎ (𝑊 𝑔 𝑦 + 𝑈 𝑔 (𝑟 ⊙ 𝑥)) 𝑔(𝑥, 𝑦) = (1 − 𝑧) ⊙ 𝑥 + 𝑧 ⊙ ℎ () here 𝑦 is the input from the previous layer, 𝑥 is the residual value, and ⊙ is element-wise multiplication. IV. E XPERIMENTS A. Experimental Setup

The model is trained and evaluated with assets of nine Dow Jones companies representing each sector - industrials (MMM), financials (JPM), consumer services (PG), technology (AAPL), health care (UNH), consumer goods (WMT), oil & gas (XOM), basic materials (DD) and telecommunications (VZ). The OHLC prices and trading volume data of the assets are collected from Yahoo Finance. The data for 18 years from 2000 to 2017 are used for training, and the data for the period from 2018 to April 2020 is used for evaluation. Each daily observation set consists of historical data for the recent 50 days, and all the data is log differenced for the stationarity of the time series. The maximum length of a single episode is set at 50 days, and the initial investment at 100,000 USD. Two separate Adam optimizers are used with mini-batch size 32 to train the actor and critic. The learning rate of the actor and critic is 1e-4, and the update rate 𝜏 for the target actor and target critic is 0.15. Discount factor 𝛾 is 0.9, HMemory sampling rate 𝜌 is 0.2, and the number of threads used for parallel training is 5. The action space noise parameters 𝜃 , 𝜇, and 𝜎 are 0.13, 0, and 0.2, respectively. For the Transformer, three encoder layers, eight heads, 128-dimensional vectors for attention hidden layers, and 512-dimensional vectors for feed-forward network layers are used. All programs were implemented with Python 3 and TensorFlow 2 using Google Colab. As baseline models, two traditional portfolio models that implement MPT (Markowitz’s Modern Portfolio Theory) and Uniform Constant Rebalanced Portfolio (UCRP) strategy [26] are employed. For the ablation study, simple Deep Deterministic Policy Gradient (DDPG), DDPG with Transformer (DDPG_TF), DDPG with 2D Relative- attentional Transformer (DDPG_RP_TF), and DDPG with Gated Transformer (DDPG_GL_TF) are tested under the same conditions. Both the cumulative return and the annualized Sharpe ratio are evaluated to verify the robustness of the models to risks as well as their performances. B. Results

Fig. 4 shows the changes in portfolio values for all the models tested with the asset data for the 28 months prior to the experiment. The steep drop seen around March 2020 comes from the outbreak of COVID-19, which has made a huge negative impact on the portfolio values. The portfolio value of the DPGRGT model keeps increasing until it meets the outbreak, and despite the drop, the model outperformed all the other models, proving its resilience and high profitability. DDPG and DDPG with Transformer are slightly better than MPT but show poor performance. DDPG with 2D Relative-attentional Transformer is relatively better than DDPG and DDPG with Transformer, but still worse than the UCRP baseline model. With the exception of DPGRGT, DDPG with Gated Transformer is the only model that is more profitable and better performing than UCRP, which points to the effectiveness of the gating layer of Transformer in reinforcement learning. Fig. 4. The portfolio values of the experiment Table I shows the cumulative returns and the annualized Sharpe ratios of models. While DPGRGT and DDPG with Gated Transformer are profitable, three other models have lost over 40% of their initial value. The MPT model that attempts to maximize its Sharp ratio for every rebalancing chance is the worst, whereas UCRP that rebalances its shares of assets for the constant investment proportion is the second best. This result suggests that transaction fees and slippage have a significant influence on rebalancing performance. With these constraints, the effects of using either 2D Relative-attentional Transformer or Gated Transformer are also limited. Although the pandemic outbreak has undermined the overall performance of the models, DPGRGT that uses both 2D Relative attention and Gated Transformer has ultimately demonstrated stability and strong performance.

TABLE I. T HE PERFORMANCE COMPARISON OF MODELS

Model Cumulative Return (%) Annualized Sharpe Ratio DPGRGT (Our Model)

DDPG_GL_TF

DDPG_RP_TF -5.45 -0.1343

DDPG_TF -41.71 -0.8191

DDPG -42.91 -1.2194

UCRP

MPT -50.07 -1.5840 V. C ONCLUSION AND F UTURE W ORK

This paper proposes a portfolio optimization algorithm based on reinforcement learning using 2D Relative-attentional Gated Transformers. To our best knowledge, this is the first research that applies Transformers to reinforcement learning for portfolio optimization. The experiment shows that the 2D relative attentions and Gating layers improve the performance of Transformers, and combining them creates synergy effects and produces the best results for portfolio optimization. Since even highly profitable models cannot be applied to real trades without stability and realistic constraints taken into consideration, the risk is considered by incorporating the Sortino ratio into the reward function, and the transaction costs are set at a conservative level to ensure the more practical experiment. In a further study, the relations between periods and assets can be analyzed with the multi-head attention weights used in the Transformer to make the model more interpretable and trustable. Furthermore, experiments utilizing a multi-PU environment for a wide range of types of assets can be considered for the generalization of the model. R

EFERENCES [1] H. Markowitz, "Portfolio Selection," The Journal of Finance, vol. 7, no. 1, pp. 77-91, 1952. [2] D. S. Samer Obeidat, Mathieu Lemay, Mary Kate MacPherson, Miodrag Bolic, "Adaptive Portfolio Asset Allocation Optimization with Deep Learning," International Journal on Advances in Intelligent Systems, vol. vol 11, no 1&2, pp. 25 - 34, 2018. [3] M. M. Solin, A. Alamsyah, B. Rikumahu, and M. A. A. Saputra, "Forecasting Portfolio Optimization using Artificial Neural Network and Genetic Algorithm," in 2019 7th International Conference on Information and Communication Technology (ICoICT), 24-26 July 2019 2019, pp. 1-7, doi: 10.1109/ICoICT.2019.8835344. [4] H. Yun, M. Lee, Y. S. Kang, and J. Seok, "Portfolio management via two-stage deep learning with a joint cost," Expert Syst. Appl., vol. 143, / 2020, doi: 10.1016/j.eswa.2019.113041. [5] J. Moody, L. Wu, Y. Liao, and M. Saffell, "Performance functions and reinforcement learning for trading systems and portfolios," Journal of Forecasting, vol. 17, no. 5 ‐‐