Deep Reinforcement Learning for Portfolio Optimization using Latent Feature State Space (LFSS) Module
DDeep Reinforcement Learning for Portfolio Optimization using LatentFeature State Space (LFSS) Module
Kumar Yashaswi Abstract — Dynamic Portfolio optimization is the process ofdistribution and rebalancing of a fund into different financialassets such as stocks, cryptocurrencies, etc, in consecutivetrading periods to maximize accumulated profits or minimizerisks over a time horizon. This field saw huge developmentsin recent years, because of the increased computational powerand increased research in sequential decision making throughcontrol theory. Recently Reinforcement Learning(RL) has beenan important tool in the development of sequential and dynamicportfolio optimization theory. In this paper, we design a DeepReinforcement Learning (DRL) framework as an autonomousportfolio optimization agent consisting of a Latent Feature StateSpace(LFSS) Module for filtering and feature extraction offinancial data which is used as a state space for deep RLmodel. We develop an extensive RL agent with high efficiencyand performance advantages over several benchmarks andmodel-free RL agents used in prior work. The noisy and non-stationarity behaviour of daily asset prices in the financialmarket is addressed through Kalman Filter. Autoencoders,ZoomSVD, and restricted Boltzmann machines were the modelsused and compared in the module to extract relevant time seriesfeatures as state space. We simulate weekly data, with practicalconstraints and transaction costs, on a portfolio of S&P 500stocks. We introduce a new benchmark based on technicalindicator Kd-Index and Mean-Variance Model as compared toequal weighted portfolio used in most of the prior work. Thestudy confirms that the proposed RL portfolio agent with statespace function in the form of LFSS module gives robust resultswith an attractive performance profile over baseline RL agentsand given benchmarks.Keywords- Portfolio Optimization; Reinforcement learning;Deep Learning; Kalman Filter; Autoencoders; ZoomSVD;RBM; Markowitz Model; KD-Index
I. INTRODUCTIONPortfolio Optimization/Management problem is to op-timize the allocation of capital across various financialassets such as bonds, stocks or derivatives to optimizea preferred performance metric, like maximize expectedreturns or minimize risk. Dynamic portfolio optimizationinvolves sequential decision making of continuously real-locating funds (which has roots in control theory) intoassets in consecutive balancing periods based on real-timefinancial information to achieve desired performance. Infinancial markets, an investor’s success heavily relies onmaintaining a well balanced portfolio. Engineering methodslike signal processing [1], control theory [2], data mining[3] and advanced machine learning [4,5] are routinely used *This work was supported by Department of Mathematics, Indian Insti-tute of Technology Kharagpur K. Yashaswi- Department of Mathematics, Indian Institute of Technol-ogy Kharagpur- [email protected] in financial market applications. Researchers are constantlyworking on machine learning techniques that have provenso successful in computer vision, NLP or beating humansin chess, etc, in the domain of dynamic financial marketsenvironment.Before the advent of machine learning, most portfoliomodels were based on variations of Modern Portfolio The-ory(MPT) by Markowitz [6]. These had the drawbacks ofbeing static and linear in computation. The dynamic meth-ods used for this problem like dynamic programming andconvex optimization, required discrete action space basedmodels and thus were not so efficient in capturing marketinformation [7,8]. To address this issue we work with deepreinforcement learning model. Reinforcement Learning(RL)is an area in artificial intelligence which focuses on howsoftware agents take action in a dynamic environment tomaximize cumulative performance metric or reward [9].Reinforcement learning is appropriate in dynamical systemsrequiring optimal controls, like robotics [10], self-drivingcars [11] and gaming [12], with performance exceeding othermodels.With the breakthrough of deep learning, the combinationof RL and neural networks(DRL) has enabled the algo-rithm to give optimal results on more complex tasks andsolving many constraints posed by traditional models. Inportfolio optimization, deep reinforcement learning helps insequentially re-balancing the portfolio throughout the tradingperiod and has continuous action space approximated by aneural network framework which circumvents the problemof discrete action space.RL has been widely used in financial domain [13,14] likealgorithmic trading and execution algorithms, though it hasnot been used to that extent for portfolio optimization. Somemajor works [15,16,17,18,19] gave state-of-the-art perfor-mance on their inception and most of the modern works havebeen inspired by these research. While [18,19] considereddiscrete action spaces using RL, [15,16,17] leveraged use ofdeep learning for continuous action space. They considereda model-free RL approach which modelled the dynamicsof market through exploration. The methods proposed in[15,16,20,21,17,9] considered constraints like transactioncost and suitable reward function, though they suffered froma major drawback of state space modelling. They did nottake into account the risk and dimensionality issues causedby volatile, noisy and non-stationary market environment.Asset prices are highly fluctuating and time-varying. Suddenfluctuations in asset price due to many factors like marketsentiments, internal company conflict, etc cause the prices to a r X i v : . [ q -f i n . P M ] F e b eviate from their true value based on actual fundamentalsof the asset. Many other factors like economic conditions,correlation to market, etc have an adverse effect on the pricesbut due to limited data and model complexity, they cannotbe incorporated in modelling phase.In this paper, we propose a model-free Deep RL agentwith a modified state space function over the previous workswhich filters the fluctuations in asset price and derives acompressed time-series representation containing relevantinformation for modelling. We introduce the Latent FeatureState Space Module(LFSS) as a state space to our deepRL architecture. This gives a compressed latent state spacerepresentation of filtered time series in our trading agent. Itconsists of 2 units: • Filtering Unit • Latent Feature Extractor UnitThe filtering unit first takes the raw, unfiltered signal fromasset prices and outputs a filtered signal using Kalman Filteras described in [22]. After the filtered signal is obtained theLatent Feature Extractor Unit extracts a lower dimensionallatent feature space using 3 models proposed by us: • Autoencoders • ZoomSVD • Restricted Boltzmann MachineAll these 3 models are applied independently and theirperformance is compared to judge which is the most suitableto get a compressed latent feature space of the price signal.Autoencoders and Restricted Boltzmann Machine are self-supervised deep learning architecture that are widely used indimensionality reduction [23,24] to learn a lower dimensionrepresentation for an input space, by training the networkto ignore input noise. Svd(Singular Value Decomposition)is a matrix factorization procedure used widely in machinelearning and signal processing to get a linear compressed rep-resentation of a time series. We employ ZoomSvd procedurewhich is a fast and memory efficient method for extractingSVD in an arbitrary query time range proposed recently in[25].Asset price data are usually processed in OHLC(open,high, low, close) form and RL state space are built upon thesame. In deep-RL models used in [15,16,20,21,9], returns orprice ratio(eg High/close) are used in unprocessed form asstate space with a backward time step which is the hypothesisthat how much the portfolio weights are determined by theassets previous prices. Although most of the RL agentsapplied in research use OHLC State Space, there are stillseveral challenges that LFSS Module solves: • Financial Data is considered very noisy with distortedobservations due to turbulences caused by the informa-tion published daily along with those sudden market be-haviours that change asset prices. These intraday noisesare not the actual representative of the asset price andthe real underlying state based on actual fundamentalsof the asset are corrupted by noise. Filtering techniqueshelp approximate the real state from noisy state andhelp’s improve quality of data. • OHCL Data space is highly dimensional and, as such,models that try to extract relevant patterns in the rawprice data can suffer from the so-called curse of dimen-sionality [26]. We will explore the potential of LFSSmodule to extract relevant features from the dynamicasset price information in a lower dimensional featurespace. This latent space can hypothetically incorporatefeatures such as economic conditions or asset funda-mentals.Rest of the framework is roughly similar to [1] in terms ofaction space, reward function, etc. We use deep deterministicpolicy gradients (DDPG) algorithm [27] as our RL Modelwith convolutional neural network(CNN) as deep learningarchitecture for prediction framework.Our dataset was stocks constituting S&P 500 with datafrom 2007-2015 considered for training data and 2016-2019used for testing. We design a new benchmark based ontechnical indicator Stochastic Kd-Index and Mean-Variancemodel which is a variation of equal-weighted portfoliobenchmark and performs much better than the same. Thisbenchmark is based on the works in [28].To the best of our knowledge, this is the first of many workthat leverages the use of latent features as state space, andfurther integrates with already known deep RL Frameworkin portfolio optimization domain. The main aim of our workis to investigate the effectiveness of Latent Feature StateSpace(LFSS) module added to our RL agent, on the processof portfolio optimization to get improved result over existingdeep RL framework.
A. Related Work
With the increased complexity of deep RL models, theirapplication in Portfolio Optimization has increased widelyover the past years. As we discussed in the last section,[15,16,17,18] were one of the breakthrough papers in thisdomain. Our work borrowed many methods from [15,16],which had the advantage of using continuous action space.They used state of the art RL algorithms like DQN, DDPG,PPO, etc for various markets. [20] makes a comparison ofmodel-based RL agent and model-free deep RL agent, thusshowing dominance and robustness of model-free Approach.[15] showed CNN architecture performed much better thanLSTM networks for the task of dynamic optimization ofcryptocurrency portfolios, even though LSTM is more ben-eficial for time-series data. [9] used a modified rewardfunction that prevented large portfolio weights in a singleasset.[26,29] were the first to use the concept of an addedmodule for improved performance to already existing ar-chitectures. [29] added a State Augmented ReinforcementLearning(SARL) module to augment the asset informationwhich leveraged the power of Natural Language Process-ing(NLP) for price movement prediction. They used financialnews data to represent the external information to be encodedand augmented to the final state. [26] used combinationof three modules infused prediction module(IPM), a gen-erative adversarial data augmentation module (DAM) and aehaviour cloning module (BCM). IPM forecast the futureprice movements of each asset, using historical data . DAMsolves the problem of availability of large financial datasetsby increasing the dataset size by making use of GANs andBCM uses a greedy strategy to reduce volatility in changesin portfolio weight’s hence reducing transaction cost[26].II. B
ACKGROUND
A. Portfolio Optimization
A portfolio is a collection of multiple financial assets, andis characterized by its: • Component: n assets that comprise it. In our case n =15 • Portfolio vector, w t : the i th index illustrates the propor-tion of the funds allocated to the i th asset w t = [ w ,t , w ,t , w ,t , ...w n,t ] ∈ R n For w ,t < for any n, implies short selling is allowed.We add a risk-free asset to our portfolio for the case if allthe wealth is to be allocated to a risk-free asset. The weightvector gets modified to w t = [ w ,t , w ,t , w ,t , w ,t , ...w n,t ] (1)The closing price is defined as v i,t for asset i at time t . Theprice vector v t = [ v ,t , v ,t , v ,t , v ,t , ...v n,t ] , consists of themarket prices of the n assets, taken as the closing price ofthe day in this case and constant price of risk-free asset v ,t .Similarly, v hit and v lot denote the highest prices and the low-est price vector at time step t, respectively. Over time, assetprice’s change, therefore we denote y t as the relative pricevector equal to v t +1 v t = (1 , v ,t +1 v ,t , v ,t +1 v ,t , v ,t +1 v ,t , .... v n,t +1 v n,t ) T .To reduce risk, portfolios with many assets are preferableover holding single assets. We assume readers have knowl-edge of important concepts of portfolio namely portfoliovalue, asset returns, portfolio variance, Sharpe ratio, mean-variance model, etc as we will not go in detail of these topics.These topics are well described in thesis reports [20,9]. B. Deep RL Model as a Markov Desicion Process
We briefly review the concepts of DRL and introduce themathematics of the RL agent. Reinforcement learning is aself-learning method, in which the agent interacts with theenvironment with no defined model and less prior informa-tion, learning from the environment by exploration while atthe same time, optimally updating its strategy to maximize aperformance metric. RL consist of an agent that receives thecontrolled state of the system and a reward associated withthe last state transition. It then calculates an action whichis sent back to the system. In response, the system makes atransition to a new state and the cycle is repeated as describedin fig 1.The goal is to learn a set of actions taken for each state(policy) so as to maximize the cumulative reward in dynamicenvironment which in our case is the market. RL is mod-elled on the concepts of Markov Decision Process(MDP), astochastic model in discrete time for decision making [9].They work on the principle of Markov chain.
Fig. 1. Reinforcement Learning Setting [30]
A MDP is defined as a 5-tuple ( S, A, P, r, γ ) describedbelow with t being time horizon over portfolio period- • S = (cid:83) t S t is a collection of finite dimensional contin-uous state space of the RL model. • A = (cid:83) t A t is the finite dimensional continuous actionspace as we consider the market environment as aninfinite Markov Decision Process (IMDP). • P : S × A × S → [0 , is the state transition probabilityfunction • r : S × A → R is the instant or the expected instantreward at each time index obtained for taking an actionin a particular state. • γ ∈ (0 , is a discount factor. When γ =1 a rewardmaintains its full value at each future time index inde-pendent of the relative time from present time step. As γ decreases, the effect of reward in the future is declinedexponentially by γ C. Policy Function
Policy function( π ) indicate the behaviour of a RL agentin a given state. π is a “state to action” mapping function, π : S → A . A policy is either Deterministic A t +1 = π ( S t ) or Stochastic: π ( S | α ) . We deal with the mapping to bedeterministic. Since we are dealing with the case of infiniteMarkov decision process (IMDP), the policy is modelledas a parameterized function with respect to parameter θ and policy mapping becomes A t = π θ ( S t ) . Policy functionparameterized with θ is determined solved by using DRLmodels like DDPG, DQN, etc which derived from conceptsof Q-Learning described in the next subsection. D. Q-Learning
Q-learning is an approach in reinforcement learning whichhelps in learning optimal policy function, using the conceptof Q-value function. Q-value function is defined as theexpected total reward when executing action A in state S and henceforth follow a policy π for future time steps. Q π ( S t , A t ) = E π ( r t | S t , A t ) (2)Since we are using a deterministic policy approach, we cansimplify the Q-function using Bellman Equation as: Q π ( S t , A t ) = E π ( R t +1 + γQ π ( S t +1 , A t +1 ) | S t , A t ) (3)he optimal policy for Q π is the one which gives maximumQ-function over all policies: π ( S ) = argmaxQ π ( S, A ) (4)To speed up the convergence to the optimal policy, concept ofreplay buffer is used and a target network is used to movethe relatively unstable problem of learning the Q functioncloser to the case of supervised learning[16]. E. Deep Deterministic Policy Gradient(DDPG)
DDPG is a model-free algorithm combining Deep Q-Network (DQN) with Deep Policy Gradient (DPG). DQN(Deep Q-Network) stabilizes Q-function learning by replaybuffer and the frozen target network [9]. In case of DPG, thedeterministic target policy function is constructed by a deeplearning model and the optimal policy is reached by usinggradient descent algorithm. The action is deterministicallyoutputted by the policy network from the given state. DDPGworks on actor-critic framework, where the actor networkwhich outputs continuous action, and then the actor per-formance is improved according to critic framework whichconsist of an appropriate objective function.The returns from the agent at a time step t is defined as r t = (cid:80) ∞ k = t γ k − t r ( S k , A k ) where r is reward function, γ is the discount factor and return r t is defined as the totaldiscounted reward from time step t to final time period[15].The performance metric of π θ for time interval [0 , t f inal ] isdefined as the corresponding reward function: J [0 ,t final ] ( π θ ) = E ( r, S π θ )= T (cid:88) t =1 γ t r ( S t , π θ ( S t )) (5)The network is assigned random weights initially. Withthe gradient descent algorithm (Eq 6) weights are constantlyupdated to give the best expected performance metric. θ ← θ + λ ∇ θ J [0 ,t final ] ( π θ ) (6) λ is the learning rate. For the portfolio optimization task, wemodify our reward function to be similar to that describedin [15] and we shall describe it in section IV. For trainingmethodology also we borrow the works of [15] as we use theconcepts of Portfolio Vector Memory(PVM) and stochasticbatch optimization. We have only worked with and describedin brief DDPG algorithm but other algorithms like ProximalPolicy Optimization(PPO) are widely used.III. L ATENT F EATURE S TATE S PACE M ODULE
Price and returns are usually derived from Open, High,Low, Close(OHLC) data. These are unprocessed and containhigh level of noise. Using OHLC data based state space[15,16,20,21,9] also leads to high dimensionality and the useof extra irrelevant data. LFSS module helps in tackling suchproblems by using a filtering unit to reduce noise in the dataand a Latent Feature Extractor Unit which helps in obtaininga set of compressed latent features which represent the asset-prices in a more suitable and relevant form. The filtering unit uses Kalman Filter to obtain a linearhidden state of the returns with the noise being consideredGaussian. The Latent Feature Extractor Unit compares 3models Autoencoders, ZoomSVD and RBM, each being usedindependently with the filtering unit to extract latent featurespace which is the state space of our deep RL architecture.LFSS module structure can be visualized from figure 2.
A. Kalman Filter
Is an algorithm built on the framework of recursiveBayesian estimation, that given a series of temporal mea-surements containing statistical noise, outputs state estimatesof hidden underlying states that tend to be more accuraterepresentation of underlying process than those based onan observed measurement [31]. Theory of Bayesian filteringis based on hidden Markov model. Kalman Filter assumeslinear modelling and gaussian error (Eq. 6,7). Kalman Filteris a class of Linear-Quadratic Gaussian Estimator(LQG)which have closed-form solution as compared to otherfiltering methods having approximate solutions(eg ParticleFiltering)[32]. We used Kalman Filter to clean our returntime-series of any Gaussian noise present due to marketdynamics. The advantages of Kalman Filter are: • Gives simple but effective state estimates[18] which iscomputationally inexpensive • Market’s are highly fluctuating and non-stationary, butthe Kalman Filter gives optimal results despite being aLQG estimator. • It is dynamic and works sequentially in real time,making predictions only using current measurementand previously estimated state on predefined modeldynamics.Though markets may not always be modelled in linear fash-ion and noises may not be Gaussian, Kalman filter is widelyused and various experiments have proved to be effectivefor state estimation [31,22]. The mathematical explanationof the filter is given from eq 7-15 [22]. x t +1 = F t x t + w t , w t ∼ N (0 , Q ) (7) y t = G t x t + v t , v t ∼ N (0 , R ) (8)where x t and y t are the state estimate and measurements attime index t respectively. F t is the state transition matrix, G t is measurement function. White noises w t (state noise)and v t (process noise) are independent, normally distributedwith zero mean and constant covariance matrices Q and Rrespectively. x MLt +1 | t +1 = F t x MLt | t + K t +1 ε t +1 (9) ε t +1 = y t +1 − G t x MLt +1 | t (10) K t +1 = P t +1 | t G Tt [ G t P t +1 | t G Tt + R ] − (11) P t +1 | t = F t P t | t F Tt + Q (12) P t +1 | t +1 = [ I − K t +1 G ] P t +1 | t (13) x ML | = µ (14) ig. 2. LFSS Module Structure: (i) The stock price series is passed through a filtering layer which uses Kalman Filter to clean the signal of any Gaussiannoise. (ii) The filtered signal is passed through a Latent Feature Extractor Unit which consists of one of 3 feature extractor methods a) Autoencoders b)ZoomSVD c) RBM, each applied individually and independently to obtain a lower dimensional signal P | = P (15)Eq 9-15 give the closed-form solution of Kalman filter usingmaximum likelihood estimation(ML). [32] gives a detailedderivation of closed-form solution using ML and Maximuma posteriori (MAP) estimate. ε t is the error estimate definedin Eq-10 and K t is called the Kalman gain and defined inEq-11. Eq 14-15 gives the initial state assumption of theprocess. In the experiment section, we use the Kalman filterfor our asset price signal and input the filtered signal to theLatent Feature Extractor Unit of LFSS module. B. Autoencoders
Its a state of the art model from non-linear feature extrac-tion of time series leveraging the use of deep learning. Anautoencoder is a neural network architecture in which theoutput data is the same as input data during training, thuslearns a compact representation of the input, with no needfor labels [23]. Since no output exists i.e. requires no humanintervention such as data labelling it is a domain of self-supervised learning [33]. The architecture of an autoencoderuses two parts in this transformation[34] • Encoder- by which it transforms its high dimensionalinputs into a lower dimension space while keeping themost important features φ ( X ) → E (16) • Decoder- which tries to reconstruct the original inputfrom the output of the encoder Ω( E ) → X (17)The output of the encoder is the latent-space representationwhich is of interest to us. It is a compressed form ofthe input data in which the most influential and significantfeatures are kept. The key objective of autoencoders is tonot directly replicate the input into the output [34] i.e learnan identity function. The output is the original input withcertain information loss. Encoder should map the input to alower dimension than input as a larger dimension may leadto learning an identity function.The encoder-decoder weights are learnt using backprop-agation similar to supervised learning approach but with output similar to the input. The mean squared norm lossis used(equation 18,19) as loss function L . φ, Ω = argmin φ, Ω (cid:107) X − ( φ ◦ Ω) X (cid:107) (18) L = (cid:107) X − X (cid:48) (cid:107) (19)For our methodology we compare and select the mostsuitable of CNN autoencoder(fig 3), LSTM autoencoders andDNN autoencoders. Fig. 3. Convolutional-Autoencoder Explained- (i) First the encoderconsisting of Convolution and Pooling Layer, compresses the input timeseries to lower dimension. (ii) Decoder reconstructs the input space usingupsampling and Convolution layer [35]
C. ZoomSVD: Fast and Memory Efficient Method for Ex-tracting SVD of a time series in Arbitrary Time Range-
Based on the works in [25], this method proposes anefficient way of calculating the Singular Value Decompo-sition(SVD) of time series in any particular range. The SVDfor a matrix is as defined in Eq 20. A = U Σ V T (20)If A has dimensions M × N (for time series case, M isthe length of time period and N is the number of assets)then for case of compact-SVD(form of SVD used) U is M × R and V is N × R orthogonal matrices such that U U T = I RxR and
V V T = I RxR and R ≤ min [ M, N ] isthe rank of A . U and V are called left-singular vectors andright-singular vectors of A . Σ = diag ( d , d , d , ...d R ) is asquare diagonal of size R × R with d ≥ d ≥ d , ... ≥ ig. 4. ZoomSVD representation from [13]- (i) In the Storage-Phase the original asset-price series matrix is divided in blocks and compressed to lowerdimensional SVD form using incremental-SVD. (ii) In Query-Phase, SVD in reconstructed from the compressed block structure in storage phase, for aparticular time-query [ t i , t f ] using Partial-SVD and Stitched-SVD [25] d R ≥ and are called singular values of A . For timeseries data, A = [ A ; A ; A ; ....A N ] where A i is columnmatrix of time series values within a range and A is thevertical concatenation of each A i . The SVD of a matrix canbe calculated by many numerical methods known in linearalgebra. Some of the methods are described in [36].SVD is a widely used numerical method to discoverhidden/latent factors in multiple time series data [25], andin many other applications including principal componentanalysis, signal processing, etc. While autoencoders extractnon-linear features from high dimensional time series, SVDis much simpler and finds linear patterns in the dimensionof time series thus leading to less overfitting in many cases.Zoom-SVD incrementally and sequentially compresses timeseries matrix in a block by block approach to reduce thespace cost in storage phase, and for a given time range querycomputes singular value decomposition (SVD) in queryphase by methodologically stitching stored SVD results [25].ZoomSVD is divided in 2 phases- • Storage Phase of Zoom-SVD – Block Matrix – Incremental SVD • Query Phase of Zoom-SVD – Partial-SVD – Stitched-SVDFigure 4 explains the original structure proposed in [25].In storage phase, time series matrix A is divided into blocksof size b decided beforehand. SVD of each block is computedsequentially and stored discarding original matrix A . In thequery phase, an input query range [ t i , t f ] is passed andusing the block structure the blocks containing the range areconsidered. From the selected blocks, initial and final blockscontain partial time ranges. Partial-SVD is used to computeSVD for initial and final blocks and is merged with SVD ofcomplete blocks using Stitched-SVD to give final SVD forquery range [ t i , t f ]. Further explanation of the mathematicalformulation of the storage and query phase are described in[25].ZoomSVD solves the problem of expensive computationalcost and large storage space in some cases. In comparison totraditional SVD methods, ZoomSVD computes time rangequeries up to x faster and requires x less space than other methods[25]. For the case of portfolio optimizationproblem, we use Σ V T as our state space. The right singularvector represents the characteristics of the time series matrix A and the singular value’s represents the strength of thecorresponding right singular vector [37]. U is not usedbecause of high dimensionality due to the large time horizonrequired by the RL agent. D. Restricted Boltzmann machine: Probabilistic representa-tion of the training data
Restricted Boltzmann machine is based on the works of[38], and has similar utility to the above mentioned methodsin fields like collaborative filtering, dimensionality reduction,etc. A restricted Boltzmann machine (RBM) is a form ofartificial neural network that trains to learn a probabilitydistribution over its set of inputs and is made up of twolayers: the visible and the hidden layer. The visible layerrepresents the input data and the hidden layer tries to learnfeature-space from the visible layer aiming to represent aprobabilistic distribution of the data [39]. Its an energy-
Fig. 5. RBM structure consisting of 3 visible units and 2 hidden units withweights w and bias a,b, forming a bipartite graph structure [39] based model implying that the probability distribution overthe variables v (visible layer) and h (hidden layer) is definedby an entropy function in vector form(Eq 21). E ( v, h ) = − h T W v − a T v − b T h (21)where W is the weights, a and b are bias added.RBM’s have been mainly used for binary data, but thereare some works [39,40] which present new variations forealing with continuous data using a modification to RBMcalled Gaussian-Bernoulli RBM (GBRBM). We further addextension of RBM, conditional RBM(cRBM) [24,41]. ThecRBM has auto-regressive weights that model short-termtemporal dependencies and hidden layer unit that modellong-term temporal structure in time-series data [24]. AcRBM is similar to RBM’s except that the bias vector forboth layers is dynamic and depends on previous visible layers[24]. We will further refer to RBM as cRBM in the paper.Restricted Boltzmann machines uses contrastive diver-gence (CD) algorithm to train the network to maximize inputprobability. We refer to [39,24] for further explanation ofmathematics of RBM’s and cRBM’s.IV. P ROBLEM F ORMULATION
In the previous section, we had described RL agent formu-lation in terms of Markov decision process defined using 5tuple ( S, A, P, r, γ ) . In this section we describe each elementof tuple with respect to our portfolio optimization problemalong with the dataset used and the network architecture usedas our policy network. A. Market Assumptions
In a market environment, some assumptions are consideredclose to reality if the trading volume of the asset in a marketis high. Since we are dealing with stocks from S & P 500the trading volume is significantly high(high liquidity) toconsider 2 major assumptions: • Zero Slippage- Due to high liquidity, each trade can becarried out immediately at the last price when an orderis placed • Zero Market Impact- The wealth invested by our agentinsignificant to make an influence on the market
B. Dataset
The S & P 500 is an American stock market index basedon the market capitalization of 500 large companies. Thereare about 500 tradable stocks constituting S & P 500 indexhowever, we will only be using a subset of 15 randomlyselected stocks for our portfolio. The stocks were- Apple,Amex, Citi bank, Gilead, Berkshire Hathaway, Honeywell,Intuit, JPMC, Nike, NVDIA, Oracle, Procter & Gamble,Walmart, Exxon Mobile and United Airlines Holdings. Wetook stocks from various sectors to diversify our portfolio.We took the data from January 2007 to December 2019,that made up a dataset of total size of 3271 rows. 70 % (2266 prices, 2007-2015) of data was considered as trainingset and 30 % as testing set(1005 prices, 2016-2019). Thislarge span of time subjects our RL agent to different types ofmarket scenarios and learns optimal strategies in both bearishand bullish market. For our risk-free asset to invest in caseinvesting in stocks is risky, we use a bond with a constantreturn of 0.05 % annually. C. Action Space
To solve the dynamic asset allocation task, the tradingagent must at every time step t be able to regulate theportfolio weights w t . The action a t at the time t is theportfolio vector w t at the time t : a t ≡ w t = [ w ,t , w ,t , w ,t , ...w n,t ] (22)The action space A is thus a subset of the continuous R n real n-dimensional space: a t ∈ A ⊆ R n , n (cid:88) i =1 a i,t = 1 , ∀ t ≥ (23)The action space is continuous (infinite) and therefore con-sider the stock market as an infinite decision-making processfor MDP(IMDP) [9]. D. OHLC Based State Space
Before we describe the novel state space structure used byour RL agent, we describe the state space based on OHLCdata being used in previous works. In later sections, weshowcase the superior performance of our novel state spacebased RL agent.Using the data in OHLC form, [15] considered closing,high and low prices to form input tensor for state space of RLagent. The dimension of input is (n,m,3)(some works maydefine dimensions in (3,m,n) form), where n is the numberof assets, m is the window size and 3 denotes the number ofprice forms. All the prices are normalized by closing price v t at time t . The individual price tensors can be written as: V t = [ v t − m +1 (cid:11) v t | v t − m +2 (cid:11) v t | v t − m +3 (cid:11) v t | .. | v t (cid:11) v t ] (24) V ( hi ) t = [ v ( hi ) t − m +1 (cid:11) v t | v ( hi ) t − m +2 (cid:11) v t | v ( hi ) t − m +3 (cid:11) v t | .. | v ( hi ) t (cid:11) v t ] (25) V ( lo ) t = [ v ( lo ) t − m +1 (cid:11) v t | v ( lo ) t − m +2 (cid:11) v t | v ( lo ) t − m +3 (cid:11) v t | .. | v ( lo ) t (cid:11) v t ] (26)Where v t , v ( hi ) t , v ( lo ) t represents the asset close, high andlow prices at time t , (cid:11) represents element-wise divisionoperator and | represents horizontal concatenation. V t isdefined individually for each asset as V t = [ V ,t ; V ,t ; V ,t ; .. ; V n,t ] (27)where V i,t is close price vector for asset i and ; representsvertical concatenation. Similar is defined for V ( hi ) t and V ( lo ) t . X t is the final state space formed by stacking layers V t , V ( hi ) t , V ( lo ) t as seen in figure 6.In order to incorporate constraints like transaction cost,the state space X t at time t is combined with previoustime step portfolio weight w t − to form final state space S t = ( X t , w t − ) . The addition of w t − to state space hasbeen widely used in other works [15,16,20,21,9] to modeltransaction cost of portfolio. ig. 6. OHLC state space in 3-dimensional input structure [15] E. LFSS Module Based State Space
LFSS module based state space consisted of two assetfeatures that were processed individually through differentneural networks and merged after processing. The state spaceconsidered was based on time-series features of asset returnsfor 15 assets in our portfolio. The asset feature extractedwere- • Covariance Matrix- the symmetric matrix for 15 assetswas calculated. This part of the state space signifies therisk/volatility and dependency part of each asset withone another. 3 matrix for close, high and low priceswere computed giving a × × inputmatrix . • LFSS Module Features- We described Filtering unit toclean the time series followed by 3 different methodsAutoencoder, ZoomSVD, RBM in Latent Feature Ex-tractor Unit to extract latent features. Close, high andlow prices were individually processed by LFSS moduleto form 3 dimensional structure similar to figure 6 butin a lower-dimensional feature space.Similar to technique mentioned in last sub-section, previ-ous time step portfolio weight w t − is added to form finalstate space S t = ( X t , w t − ) , where X t is merged networkresult. F. Reward Function
A key challenge in portfolio optimization is controllingtransaction costs(brokers’ commissions, tax, spreads, etc)which is a practical constraint to consider to not get biasin estimating returns. Whenever the portfolio is re-balanced,the corresponding transaction cost is deducted from theportfolio return. Many trading strategies work on the basicassumption on neglecting transaction cost, hence are notsuitable for real-life trading and only obtain optimality usinggreedy algorithm(e.g allocating all the wealth into the assetwhich has the most promising expected growth rate) withoutconsidering transaction cost. Going by the rule of thumbthe transaction commission is taken as . for every buyand sell re-balancing activity i.e c = c b = c s = 0 . . Wehave to consider such a reward function which takes into account practical constraints, its near-optimality with respectto portfolio performance and should be easily differentiable.Applying the reward function used in [15] satisfying all ofthe above conditions, the reward function at time t is : r t = r ( s t , a t ) = ln( a t · y t − c n (cid:88) i =1 | a i,t − w i,t | ) (28)As explained by Eq 4, the performance objective function isdefined as: J [0 ,t final ] ( π θ ) = 1 T T (cid:88) t =1 γ t r ( S t , π θ ( S t )) (29)where γ is discount factor and r ( S t , π θ ( S t )) is the immediatereward at time t with policy function π θ ( S t ) . We modify Eq5 by dividing it by T which is used to normalize the objectivefunction for time periods of different length. The mainobjective involves maximizing performance objective basedon policy function π θ ( S t ) parametrized by θ . The network isassigned random weights initially. With the gradient descentalgorithm(Eq 6) weights are constantly updated to give thebest expected performance metric, based on the actor-criticnetwork. θ ← θ + λ ∇ θ J [0 ,t final ] ( π θ ) (30) G. Input Dimensions
LFSS Module used considered two deep learning tech-niques Autoencoders and cRBM for feature extraction oftime-series. While ZoomSVD has a static method for rep-resentation and we kept fixed size of cRBM structure, wecompared different network topology’s used for trainingautoencoders. 3 types of neural networks- ConvolutionalNeural Network(CNN), Long-Short Term Memory(LSTM)and Deep Neural Network(DNN) were compared, with thecorresponding results shown in the experiments section. Ourinput signal consisted of considering window-size of 60 daysor roughly 2 months price data, giving sufficient informationat each time step t . The window-size is adjustable as thisvalue is experiment based and there is no theoretical backingof choosing 60 days window. Since we are using 15 assetsin our portfolio the state space concept used by [15] basedon OHLC data, has input size of 15 × ×
3, where 60 is thewindow size, 15 is the number of assets and 3 representing3 prices close, high and low. We added the asset covariancematrix in our state space, along with LFSS module features,which has dimensions of 15 × ×
1. The LFSS modulefeature space for individual methods had dimensions asfollows- • Autoencoder - 15 × × • ZoomSVD - 15 × × • cRBM - 15 × × ig. 7. The framework of our proposed LFSS Module based RL Agent. Asset prices are structured in 3-dimensional close, high and low price and thenthe state space S ,t and S ,t are extracted from state-covariance and LFSS module, and a combined state X t is formed from a merged Neural Network.Combined with previous timestep action space a t − , a new portfolio vector is formed for time t through the policy network H. Network Architecture
Our Deterministic target policy function is constructedby a neural network architecture used by the agent. Sev-eral variations of architecture are used for building policyfunction revolving around CNN, LSTM, etc. To establishthe performance of Latent Feature State Space compared toOHLC state space, we work with single model architecturefor better judgement of performance of various state-spaces.Considering the results in [15,20,21], CNN based networkoutperformed RNN and LSTM for approximating policyfunction, due to its ease in handling large amounts ofmultidimensional data, though this result is empirical assequential neural network like RNN and LSTM should modelprice data better than CNN[15].Motivated by [15] we used similar techniques, namelyIdentical Independent Evaluators(IIE) and Portfolio-VectorMemory(PVM) for our trading agent. In IIE, the policynetwork propagates independently for the n+1 assets withcommon network parameters shared between each flow. Thescalar output of each identical stream is normalized bythe softmax layer and compressed into a weight vector asthe next period’s action. PVM allows simultaneous mini-batch training of the network, enormously improving trainingefficiency [15].Our state space consists of 3 parts- LFSS Module Features,covariance matrix and the portfolio vector from the previousrebalancing step( w t − ). LFSS and covariance matrix wereindividually processed through CNN architecture, and con-catenated together into a single vector X t after individuallayers of processing. The previous actions, w t − , is concate-nated to this vector to form final stage state S t = ( X t , w t − ) .This is passed through a deep neural network, with IIE andPVM setup, to output portfolio vector w t with the trainingprocess as described in the previous subsection. The basiclayout of our RL agent is as given in figure 7. I. Benchmarks • Equal Weight Baseline(EW) - is a trivial baseline whichassigns equal weight to all assets in the portfolio. • S&P 500 - the stock-market index which depicts themacro-level movement of the market as a whole. • OHLC State-Space Rl Agen(Baseline-RL) - uses unpro-cessed asset prices as state space and trains the agentwith a CNN approximated policy function. It is the mostwidely used RL agent structure for dynamic portfoliooptimization. • Weighted Moving Average Mean Reversion(WMAMR)- is a method which captures information from past pe-riod price relatives using moving averaged loss functionto achieve an optimal portfolio. • Integrated Mean-Variance Kd-Index Baseline(IMVK)-derived from [28], this works on selecting a subset ofgiven portfolio asset based on technical-indicator Kd-Index and mean-variance model[6], and allocates equal-weights to the subset at each time-step. For calculatingKd-Index first a raw stochastic value(RSV) is calculatedas follows:
RSV t = ( v t − v min )( v max − v min ) × (31) v t , v max , v min represent the closing price at time t,highest closing price and lowest closing price, respec-tively. Kd-Index is computed as: K t = RSV t ×
13 + K t − × (32) D t = K t ×
13 + D t − × (33)where K = D = 50 . Whenever K t − ≤ D t − and K t > D t , a buy signal is created. Two strategies,Moderate and aggressive, are formed upon Kd-Index.Moderate strategy works on selecting an asset only ifMean-Variance model weights for the asset is positiveand Kd-Index indicated a buy signal at that time-step.Aggressive strategy selects an asset only based on Kd-Index indicating a buy signal. Equal distribution ofwealth is done on the subset.. E XPERIMENTAL R ESULTS
A. Filtering Unit
As described before, LFSS Module has an added FilteringUnit consisting of a Kalman Filter to filter out noise fromprice data of each asset. Since deep learning model aresensitive to the quality of data used, many complex deeplearning models may overfit the data, learning noise in thedata as well, thus not generalizing well to different datasets.Each of the asset prices was filtered before passing to theLatent Feature Extractor Unit.
Fig. 8. Filtering Unit: Filtered Berkshire Hathaway Price and ReturnsSignal
B. Latent Feature Extractor Unit
Autoencoder Space - We build a time-series price encoderbased on 3 encoder-decoder models- LSTM, CNN and DNNbased autoencoders. We used a window size of 60 and theinput consisted of price windows for each asset. To getbetter scalability of data, log returns were normalized by themin-max scalar. The data was divided into 80-20% training-testing dataset.The training process summary is as given in table 1,for both training and testing data. Number of epochs weredecided on the convergence of mean-squared error of theautoencoder i.e L = (cid:107) X − X (cid:48) (cid:107) . ZoomSVD Space - Storage-Phase of ZoomSVD requireddividing the input price data into blocks, and storing theinitial asset-price matrix in compressed SVD form. The blockparameter b was taken as 60, but is usually decided onstorage power of systems for large datasets. SVD of eachblock is stored by incremental-SVD. As our time-range queryis passed by our RL agent, the SVD based state-space iscomputed by Partial and stitched-SVD. Autoencoder Training SummaryLSTM CNN DNNTraining set Mse 0.0610 0.0567 0.0593Test set Mse 0.0592 0.0531 0.0554Epochs 125 100 100TABLE I cRBM Space - We use a cRBM network stacked of 2layers(1 visible layer, 1 hidden layer). Visible unit corre-sponding to the asset price window(60) and 30 neurons inthe hidden layer for extracted feature space.
Fig. 9. Extracted Features from Input Price Window. Feature-1, 2, 3corresponds to CNN-Autoencoder, ZoomSVD and cRBM extracted space,respectively
C. Training Step
For each individual RL agent, number of episodes weredecided on the convergence of percentage-distance fromequal-weighted portfolio returns. A stable learning rate of0.01 of adam optimizer was used, with 32 as batch size. Theexploration probability is 20% initially and decreases afterevery episode.
D. Comparison of LFSS module based RL Agent withBaseline-RL agent
After setting up LFSS module and integrating it to policynetwork, we backtested the results on given stock portfolio.Table 10 gives comparison summary of different featureextraction methods with one another and baseline-RL agent.Initial portfolio investment is 10000$.As shown in Table 10, adding cRBM and CNN-Autoencoder based LFSS module to the agent leadsto a significant increase in different performance met-ric, while DNN-Autoencoder based LFSS module givesslightly better performance than baseline-RL. ZoomSVD andSTM-Autoencoder state space performs roughly similar tobaseline-RL. Individual performance metric for each asset isdescribed below- • Portfolio-Value(PV)- In terms of portfolio value over aspan of 4 years on training-set an initial investment of10000$ reached roughly 39000$ for test-set for CNN-Autoencoder and cRBM RL-agent, which performedsignificantly better than baseline RL-agent. The addi-tion of autoencoder and RBM based LFSS modulemodels the asset-market environment in a better waythan raw price data for the RL-agent. Deep-learningbased feature selection methods could model non-linearfeatures in asset-price data as compared to linear spacein ZoomSVD. Though cRBM RL agent did not givehigh returns in the initial years, it had a sharp increasein portfolio value in later years, performing much betterthan other methods during this period. • Sharpe-Ratio- Most of the models performed similarlyin terms of volatility, with returns varying. There wasno statistically significant difference in volatilities. Thisimplies the LFSS module added is in not learninganything new to minimize risk in the model. This maybe due to the simplicity of reward function used, andmore work will be required to incorporate risk-adjustedreward to handle volatility, which fits in with DDPGalgorithm. • Sortino Ratio- Addition of LFSS module, reduces down-side deviation to some extent to the RL agent, butoverall a more effective reward function can minimizedownward deviation further.
Fig. 10. Comparison of various RL agents
E. Performace Measure w.r.t to Benchmarks
We used 5 non-RL based benchmarks for comparison ofour agent with standard portfolio optimization techniques.The 5 benchmarks used were EW, S&P 500, WMAMR,IMVK(Moderate and Aggressive Strategy) as explained inthe last section. We compare these benchmarks to one otherand to our RL-agent. The performance summary for all themethods is as described in table 3.
Performance SummaryPortfolio-Value($) AnnualReturns AnnualVolatil-ity Sharpe-Ratio Sortino-RatioBaseline RL 29440 0.486 0.254 1.91 2.12DNN-AutoencoderRL 32960 0.575 0.252 2.28 2.51cRBM RL 37368 0.684 0.256 2.67 2.92ZoomSVDRL 28343 0.458 0.257 1.78 2.03CNN-AutoencoderRL 38864 0.722 0.254 2.84 3.04LSTM-AutoencoderRL 30121 0.503 0.255 1.97 2.13TABLE IIP
ERFORMANCE METRIC FOR DIFFERENT RL AGENTS • Portfolio-Value- In terms of portfolio value our RLagent and the baseline-RL performs much better than allthe benchmarks used. This shows the superiority of deepRL based methods for the task of portfolio optimization.For the test set, equal-weighted portfolio gave betterresults than all our benchmarks. IMVK based strategiesperformed poorly on test set, mostly due to high changein weights leading to high transaction cost. If weconsider the overall time period from 2007-2019 (fig13), IMVK based strategies perform much better thanequal-weighted portfolio till 2017, followed by a largedrop in portfolio performance. • Sharpe Ratio- Similar to high returns given by RLagents, deep RL based approach gave much betterSharpe Ratio than the benchmarks. Though there washigh difference in volatility, the risk was compensatedby the high returns. • Sortino Ratio- Similar to Sharpe ratio, RL agents per-formed much better to all other benchmarks.
Fig. 11. Comparison of RL agent with various benchmarkserformance SummaryBaseline-RL DNN-AutoencoderRL cRBMRL ZoomSVDRL CNN-AutoencoderRL LSTM-AutoencoderRL EW AggressiveStrategy ModerateStrategy S&P500 WMAMRPortfolioValue($) 29440 32960 37368 28343 38864 30121 19456 13604 13144 15970 15524AnnualReturns 0.486 0.575 0.684 0.458 0.722 0.503 0.235 0.0953 0.0785 0.149 0.152AnnualVolatility 0.254 0.252 0.256 0.257 0.254 0.255 0.144 0.162 0.154 0.128 0.144SharpeRatio 1.91 2.28 2.67 1.78 2.84 1.97 1.63 0.59 0.51 1.16 1.06SortinoRatio 2.12 2.51 2.92 2.03 3.04 2.13 1.84 0.78 0.69 1.37 1.26TABLE IIIP
ERFORMANCE METRIC FOR ALL THE PROPOSED METHODS
Fig. 12. Backtest Results for all methods proposed
VI. CONCLUSIONSIn this paper, we propose a RL agent for portfolio opti-mization with an added LFSS Module to obtain a novel statespace. We compared the performances of our RL agent withalready existing approaches for a portfolio of 15 S&P 500stocks. Our agent performed much better than baseline-RLagent and other benchmarks in terms of different metrics likeportfolio value, Sharpe Ratio, Sortino Ratio, etc. The maincatalyst for improvement in returns was due to the additionof LFSS module which consisted of a time series filteringunit and latent feature extractor unit, to get a compressedlatent feature state space for the policy network. We furthercompared different feature space extracting methods likeautoencoders, ZoomSVD and cRBM. Deep learning basedtechniques like CNN-Autoencoder and cRBM performedeffectively in obtaining a state-space for our RL agent,that gave high returns, while minimizing risk. Addition-ally, we also introduced a new benchmark based on Kd-Index technical-indicator, that gave comparable performanceto equal-weighted portfolio, and can be further used as abenchmark for more portfolio optimization problems.The work done in this paper can provide a flexible
Fig. 13. Comparison of Aggressive and Moderate Strategy IMVK withEqual-Weighted Portfolio modelling framework for various different state space withmore sophisticated deep learning models and with the ad-vancements in RL algorithms our RL agent can be furtherimproved upon. VII. F
UTURE W ORK
Due to the flexible approach in state space modelling, thereare many ways to improve upon the performance of the RLagent. To improve upon the quality of data through filteringlayer, non-linear filters can be applied like EKF, particlefilter, etc[32], which take into account the non-linearity inthe data or use an adaptive filter[452] that at each timestep uses most appropriate filter out of a set of filters. Forextracting latent features, cRBM can be further extended todeep belief networks[43], which goes deeper than cRBM andis trained using a similar approach . Since RL is highlysensitive to the data used, our agent should be trained onmore datasets, which are more volatile than American stockexchange data, as this will give a more robust performance.Additionally, we can add data augmentation module[26] togenerate synthetic data, hence exposing the agent to moredata. Text data based embedding can also be added to thetate space[29], leveraging the use of NLP techniques.In terms of deep RL algorithms, PPO was not used in thiswork. Performance of LFSS module with PPO algorithm canbe further explored. This framework can be integrated withappropriate risk-adjusted reward-function that minimizes therisk of our asset allocation.VIII. A
CKNOWLEDGEMENT