Reinforcement Learning for Portfolio Management
RReinforcement Learningfor Portfolio Management
MEng DissertationAngelos FilosCID: 00943119June 20, 2018
Supervisor: Professor Danilo MandicSecond Marker: Professor Pier Luigi DragottiAdvisors: Bruno Scalzo Dees, Gregory SidierDepartment of Electrical and Electronic Engineering, Imperial College London a r X i v : . [ q -f i n . P M ] S e p cknowledgement I would like to thank Professor Danilo Mandic for agreeing to supervisethis self-proposed project, despite the uncertainty about the viability of thetopic. His support and guidance contributed to the delivery of a challengingproject.I would also like to take this opportunity and thank Bruno Scalzo Deesfor his helpful comments, suggestions and enlightening discussions, whichhave been instrumental in the progress of the project.Lastly, I would like to thank Gregory Sidier for spending time with me, outof his working hours. His experience, as a practitioner, in Quantitative Fi-nance helped me demystify and engage with topics, essential to the project. i bstract
The challenges of modelling the behaviour of financial markets, such as non-stationarity, poor predictive behaviour, and weak historical coupling, haveattracted attention of the scientific community over the last 50 years, and hassparked a permanent strive to employ engineering methods to address andovercome these challenges. Traditionally, mathematical formulations of dy-namical systems in the context of Signal Processing and Control Theory havebeen a lynchpin of today’s Financial Engineering. More recently, advancesin sequential decision making, mainly through the concept of ReinforcementLearning, have been instrumental in the development of multistage stochas-tic optimization, a key component in sequential portfolio optimization (assetallocation) strategies. In this thesis, we develop a comprehensive account ofthe expressive power, modelling efficiency, and performance advantages ofso called trading agents (i.e., Deep Soft Recurrent Q-Network (DSRQN) andMixture of Score Machines (MSM)), based on both traditional system iden-tification (model-based approach) as well as on context-independent agents(model-free approach). The analysis provides a conclusive support for theability of model-free reinforcement learning methods to act as universal trad-ing agents, which are not only capable of reducing the computational andmemory complexity (owing to their linear scaling with size of the universe),but also serve as generalizing strategies across assets and markets, regard-less of the trading universe on which they have been trained. The relativelylow volume of daily returns in financial market data is addressed via dataaugmentation (a generative approach) and a choice of pre-training strate-gies, both of which are validated against current state-of-the-art models. Forrigour, a risk-sensitive framework which includes transaction costs is con-sidered, and its performance advantages are demonstrated in a variety ofscenarios, from synthetic time-series (sinusoidal, sawtooth and chirp waves),iiimulated market series (surrogate data based), through to real market data(S&P 500 and EURO STOXX 50). The analysis and simulations confirmthe superiority of universal model-free reinforcement learning agents overcurrent portfolio management model in asset allocation strategies, with theachieved performance advantage of as much as 9.2% in annualized cumula-tive returns and 13.4% in annualized Sharpe Ratio. iii ontents
I Background 4 ontents
II Innovation 50
III Experiments 91 ontents
10 Conclusion 104
Bibliography 106 viii hapter 1
Introduction
Engineering methods and systems are routinely used in financial market ap-plications, including signal processing, control theory and advanced statisti-cal methods. The computerization of the markets (Schinckus, 2017) encour-ages automation and algorithmic solutions, which are now well-understoodand addressed by the engineering communities. Moreover, the recent suc-cess of Machine Learning has attracted interest of the financial community,which permanently seeks for the successful techniques from other areas,such as computer vision and natural language processing to enhance mod-elling of financial markets. In this thesis, we explore how the asset allocationproblem can be addressed by Reinforcement Learning, a branch of MachineLearning that optimally solves sequential decision making problems via di-rect interaction with the environment in an episodic manner.In this introductory chapter, we define the objective of the thesis and high-light the research and application domains from which we draw inspiration.
The aim of this report is to investigate the effectiveness of ReinforcementLearning agents on asset allocation . A finite universe of financial instru-ments, assets, such as stocks, is selected and the role of an agent is toconstruct an internal representation (model) of the market, allowing it todetermine how to optimally allocate funds of a finite budget to those assets.The agent is trained on both synthetic and real market data. Then, its perfor-mance is compared with standard portfolio management algorithms on an The terms
Asset Allocation and
Portfolio Management are used interchangeably through-out the report.
1. I ntroduction out-of-sample dataset; data that the agent has not been trained on (i.e., testset).
From the IBM TD-Gammon (Tesauro, 1995) and the IBM Deep Blue (Camp-bell et al. , 2002) to the Google DeepMind Atari (Mnih et al. , 2015) and theGoogle DeepMind AlphaGo (Silver and Hassabis, 2016), reinforcement learn-ing is well-known for its effectiveness in board and video games. Nonethe-less, reinforcement learning applies to many more domains, including Robotics,Medicine and Finance, applications of which align with the mathematicalformulation of portfolio management. Motivated by the success of some ofthese applications, an attempt is made to improve and adjust the underly-ing methods, such that they are applicable to the asset allocation problemsettings. In particular special attention is given to: • Adaptive Signal Processing , where Beamforming has been success-fully addressed via reinforcement learning by Almeida et al. (2015); • Medicine , where a data-driven medication dosing system (Nemati etal. , 2016) has been made possible thanks to model-free reinforcementagents; • Algorithmic Trading , where the automated execution (Noonan, 2017)and market making (Spooner et al. , 2018) have been recently revolu-tionized by reinforcement agents.Without claiming equivalence of portfolio management with any of theabove applications, their relatively similar optimization problem formula-tion encourages the endeavour to develop reinforcement learning agents forasset allocation.
The report is organized in three Parts: the
Background (Part I), the
Inno-vation (Part II) and the
Experiments (Part III). The readers are advised tofollow the sequence of the parts as presented, however, if comfortable withModern Portfolio Theory and Reinforcement Learning, they can focus onthe last two parts, following the provided references to background mate-rial when necessary. A brief outline of the project structure and chapters isprovided below:2.3. Report Structure
Chapter 2: Financial Signal Processing
The objective of this chapter isto introduce essential financial terms and concepts for understandingthe methods developed later in the report.
Chapter 3: Portfolio Optimization
Providing the basics of FinancialSignal Processing, this chapter proceeds with the mathematical formu-lation of static Portfolio Management, motivating the use of Reinforce-ment Learning to address sequential Asset Allocation via multi-stagedecision making.
Chapter 4: Reinforcement Learning
This chapter serves as an impor-tant step toward demystifying Reinforcement Learning concepts, byhighlighting their analogies to Optimal Control and Systems Theory.The concepts developed in this chapter are essential to the understand-ing of the trading algorithms and agents developed later in the report.
Chapter 5: Financial Market as Discrete-Time Stochastic DynamicalSystem
This chapter parallels Chapters 3 and 4, introducing a unified,versatile framework for training agents and investment strategies.
Chapter 6: Trading Agents
This objectives of this chapter are to:(1) introduce traditional model-based (i.e., system identification) re-inforcement learning trading agents; (2) develop model-free reinforce-ment learning trading agents; (3) suggest a flexible universal tradingagent architecture that enables pragmatic applications of Reinforce-ment Learning for Portfolio Management; (4) assess performance ofdeveloped trading agents on a small scale experiment (i.e., 12-assetS&P 500 market)
Chapter 7: Pre-Training
In this chapter, a pre-training strategy issuggested, which addresses the local optimality of the Policy Gradientagents, when only a limited number of financial market data samplesis available.
Chapter 8: Synthetic Data
In this chapter, the effectiveness of the trad-ing agents of Chapter 6 is assessed on synthetic data - from determin-istic time-series (sinusoidal, sawtooth and chirp waves) to simulatedmarket series (surrogate data based). The superiority of model-basedor model-free agents is highlighted in each scenario.
Chapter 9: Market Data
This chapter parallels Chapter 9, evaluatingthe performance of the trading agents of Chapter 6 on real marketdata, from two distinct universes: (1) the underlying U.S. stocks ofS ¶
500 and (2) the underlying European stocks of EURO STOXX 50. 3 art I
Background hapter 2 Financial Signal Processing
Financial applications usually involve the manipulation and analysis of se-quences of observations, indexed by time order, also known as time-series.Signal Processing, on the other hand, provides a rich toolbox for system-atic time-series analysis, modelling and forecasting (Mandic and Chambers,2001). Consequently, signal processing methods can be employed to mathe-matically formulate and address fundamental economics and business prob-lems. In addition, Control Theory studies discrete dynamical systems, whichform the basis of Reinforcement Learning, the set of algorithms used in thisreport to solve the asset allocation problem. The links between signal pro-cessing algorithms, systems and control theory motivate their integrationwith finance, to which we refer as
Financial Signal Processing or FinancialEngineering .In this chapter, the overlap between signal processing and control theorywith finance is explored, attempting to bridge their gaps and highlight theirsimilarities. Firstly, in Section 2.1, essential financial terms and concepts areintroduced, while In Section 2.2, the time-series in the context of financeare formalized. In Section 2.3 the evaluation criteria used throughout thereport to assess the performance of the different algorithms and strategiesare explained, while in Section 2.4 signal processing methods for modellingsequential data are studied.
In order to better communicate ideas and gain insight into the economicproblems, basic terms are defined and explained in this section. However,useful definitions are also provided by Johnston and Djuri´c (2011). 5. F inancial S ignal P rocessing Signal Processing(SP) Control Theory(CT)Finance & EconomicsFinancial Engineering(FE)Dynamical Systems(DS)
Figure 2.1: Financial Engineering relative to Signal Processing and ControlTheory. An asset is an item of economic value. Examples of assets are cash (in handor in a bank), stocks, loans and advances, accrued incomes etc. Our mainfocus on this report is on cash and stocks, but general principles apply to allkinds of assets. Assumption 2.1
The assets under consideration are liquid, hence they can be con-verted into cash quickly, with little or no loss in value. Moreover, the selected assetshave available historical data in order to enable analysis. A portfolio is a collection of multiple financial assets, and is characterizedby its: • Constituents : M assets of which it consists; • Portfolio vector , w t : its i -th component represents the ratio of the totalbudget invested to the i -th asset, such that: w t = (cid:20) w t , w t , . . . , w M , t (cid:21) T ∈ R M and M ∑ i = w i , t = w t , a portfolio can be treated asa single master asset. Therefore, the analysis of single simple assets can be6.1. Financial Terms & Conceptsapplied to portfolios upon determination of the constituents and the corre-sponding portfolio vector.Portfolios are more powerful, general representation of financial assets sincethe single asset case can be represented by a portfolio; the j -th asset is equiv-alent to the portfolio with vector e ( j ) , where the j -th term is equal to unityand the rest are zero. Portfolios are also preferred over single assets in orderto minimize risk, as illustrated in Figure 2.2. r t (%)0.0 F r e q u e n c y D e n s i t y Returns Distribution for
M = 1 r t (%)0.0 F r e q u e n c y D e n s i t y Returns Distribution for
M = 4 r t (%)0.00.51.01.52.0 F r e q u e n c y D e n s i t y Returns Distribution for
M = 25 r t (%)01234 F r e q u e n c y D e n s i t y Returns Distribution for
M = 100
Figure 2.2: Risk for a single asset and a number of uncorrelated portfolios.Risk is represented by the standard deviation or the width of the distributioncurves, illustrating that a large portfolio ( M = M = Sometimes is it possible to sell an asset that we do not own. This processis called short selling or shorting (Luenberger, 1997). The exact shortingmechanism varies between markets, but it can be generally summarized as:1. Borrowing an asset i from someone who owns it at time t ;2. Selling it immediately to someone else at price p i , t ;3. Buying back the asset at time ( t + k ) , where k >
0, at price p i , t + k ;4. Returning the asset to the lender 7. F inancial S ignal P rocessing Therefore, if one unit of the asset is shorted, the overall absolute return is p i , t − p i , t + k and as a result short selling is profitable only if the asset pricedeclines between time t and t + k or p i , t + k < p i , t . Nonetheless, note thatthe potential loss of short selling is unbounded, since asset prices are notbounded from above (0 ≤ p i , t + k < ∞ ). Remark 2.2
If short selling is allowed, then the portfolio vector satisfies (2.1), butw i can be negative, if the i-th asset is shorted. As a consequence, w j can be greaterthan , such that ∑ Mi = w i = . For instance, in case of a two-assets portfolio, the portfolio vector w t = (cid:104) − (cid:105) T is valid and can be interpreted as: 50% of the budget is shortsold on the first asset ( w t = − w t = w t > long and short position to an asset are used to refer toinvestments where we buy or short sell the asset, respectively. The dynamic nature of the economy, as a result of the non-static supply anddemand balance, causes prices to evolve over time. This encourages to treatmarket dynamics as time-series and employ technical methods and tools foranalysis and modelling.In this section, asset prices are introduced, whose definition immediatelyreflect our intuition, as well as other time-series, derived to ease analysisand evaluation.
Let p t ∈ R be the price of an asset at discrete time index t (Feng and Palo-mar, 2016), then the sequence p , p , . . . , p T is a univariate time-series. Theequivalent notations p i , t and p asset i , t are also used to distinguish between theprices of the different assets. Hence, the T -samples price time-series of an8.2. Financial Time-Seriesasset i , is the column vector (cid:126) p i ,1: T , such that: (cid:126) p i ,1: T = p i ,1 p i ,2 ... p i , T ∈ R T + (2.2)where the arrow highlights the fact that it is a time-series. For convenienceof portfolio analysis, we define the price vector p t , such that: p t = (cid:20) p t , p t , . . . , p M , t (cid:21) ∈ R M + (2.3)where the i -th element is the asset price of the i -th asset in the portfolio attime t . Extending the single-asset time-series notation to the multivariatecase, we form the asset price matrix (cid:126) P T by stacking column-wise the T -samples price time-series of the M assets of the portfolio, then: (cid:126) P T = (cid:20) (cid:126) p T , (cid:126) p T , . . . , (cid:126) p M ,1: T (cid:21) = p p · · · p M ,1 p p · · · p M ,2 ... ... . . . ... p T p T · · · p M , T ∈ R T × M + (2.4)This formulation enables cross-asset analysis and consideration of the inter-dependencies between the different assets. We usually relax notation byomitting subscripts when they can be easily inferred from context.Figure 2.3 illustrates examples of asset prices time-series and the correspond-ing distribution plots. At a first glance, note the highly non-stationary na-ture of asset prices and hence the difficulty to interpret distribution plots.Moreover, we highlight the unequal scaling between prices, where for exam-ple, GE (General Electric) average price at 23.14$ and BA (Boeing Company)average price at 132.23$ are of different order and difficult to compare. Absolute asset prices are not directly useful for an investor. On the otherhand, prices changes over time are of great importance, since they reflectthe investment profit and loss, or more compactly, its return . 9. F inancial S ignal P rocessing A ss e t P r i c e s , p t ( $ ) Asset Prices: Time-Series
AAPLGEBA p t ($)0.0000.0250.0500.0750.100 F r e q u e n c y D e n s i t y Asset Prices: Distributions
AAPLGEBA
Figure 2.3: Asset prices time-series (left) and distributions (right) for
AAPL (Apple), GE (General Electric) and BA (Boeing Company). Gross Return
The gross return R t of an asset represents the scaling factor of an investmentin the asset at time ( t − ) (Feng and Palomar, 2016). For example, a B dollars investment in an asset at time ( t − ) will worth BR t dollars at time t . It is given by the ratio of its prices at times t and ( t − ) , such that: R t (cid:44) p t p t − ∈ R (2.5)Figure 2.4 illustrates the benefit of using gross returns over asset prices. Remark 2.3
The asset gross returns are concentrated around unity and their be-haviour does not vary over time for all stocks, making them attractive candidates forstationary autoregressive (AR) processes (Mandic, 2018a). A ss e t G r o ss R e t u r n s , R t Asset Gross Returns: Time-Series
AAPLGEBA 0.90 0.95 1.00 1.05 1.10Asset Gross Returns, R t F r e q u e n c y D e n s i t y Asset Gross Returns: Distributions
AAPLGEBA
Figure 2.4: Asset gross returns time-series (left) and distributions (right).
Simple Return
A more commonly used term is the simple return , r t , which represents thepercentage change in asset price from time ( t − ) to time t , such that: r t (cid:44) p t − p t − p t − = p t p t − − (2.5) = R t − ∈ R (2.6)10.2. Financial Time-SeriesThe gross and simple returns are straightforwardly connected, but the latteris more interpretable, and thus more frequently used.Figure 2.5 depicts the example asset simple returns time-series and theircorresponding distributions. Unsurprisingly, simple returns possess the rep-resentation benefits of gross returns, such as stationarity and normalization.Therefore, we can use simple returns as a comparable metric for all assets,thus enabling the evaluation of analytic relationships among them, despiteoriginating from asset prices of different scale. A ss e t S i m p l e R e t u r n s , r t Asset Simple Returns: Time-Series
AAPLGE
BA 0.10 0.05 0.00 0.05 0.10Asset Simple Returns, r t F r e q u e n c y D e n s i t y Asset Simple Returns: Distributions
AAPLGEBA
Figure 2.5: Single Assets Simple ReturnsThe T -samples simple returns time-series of the i -th asset is given by thecolumn vector (cid:126) r i ,1: T , such that: (cid:126) r i ,1: T = r i ,1 r i ,2 ... r i , T ∈ R T (2.7)while the simple returns vector r t : r t = r t r t ... r M , t ∈ R M (2.8)where r t the simple return of the i -th asset at time index t . Remark 2.4
Exploiting the representation advantage of the portfolio over singleassets, we define the portfolio simple return as the linear combination of the simplereturns of each constituents, weighted by the portfolio vector.
11. F inancial S ignal P rocessing Hence, at time index t , we obtain: r t (cid:44) M ∑ i = w i , t r i , t = w Tt r t ∈ R (2.9)Combining the price matrix in (2.4) and the definition of simple return (2.6),we construct the simple return matrix (cid:126) R T by stacking column-wise the T -samples simple returns time-series of the M assets of the portfolio, to give: (cid:126) R T = (cid:20) (cid:126) r T (cid:126) r T · · · (cid:126) r M ,1: T (cid:21) = r r · · · r M ,1 r r · · · r M ,2 ... ... . . . ... r T r T · · · r M , T ∈ R T × M (2.10)Collecting the portfolio (column) vectors for the time interval t ∈ [ T ] into a portfolio weights matrix (cid:126) W T , we obtain the portfolio returns time-series bymultiplication of (cid:126) R T with (cid:126) W T and extraction of the T diagonal elementsof the product, such that: r T = diag ( (cid:126) R T (cid:126) W T ) ∈ R T (2.11) Log Return
Despite the interpretability of the simple return as the percentage change inasset price over one period, it is asymmetric and therefore practitioners tendto use log returns instead (Kennedy, 2016), in order to preserve interpreta-tion and to yield a symmetric measure. Using the example in Table 2.1, a15% increase in price followed by a 15% decline does not result in the initialprice of the asset. On the contrary, a 15% log-increase in price followed bya 15% log-decline returns to the initial asset price, reflecting the symmetricbehaviour of log returns.Let the log return ρ t at time t be: ρ t (cid:44) ln ( p t p t − ) (2.5) = ln ( R t ) ∈ R (2.12)Note the very close connection of gross return to log return. Moreover, sincegross return is centered around unity, the logarithmic operator makes logreturns concentrated around zero, clearly observed in Figure 2.6.12.3. Evaluation Criteriatime t simple return price ( $ ) log return0 - 100 -1 +0.15 110 +0.132 -0.15 99 -0.163 +0.01 100 +0.014 -0.14 86 -0.155 +0.16 100 +0.15Table 2.1: Simple Return Asymmetry & Log Return Symmetry A ss e t L o g R e t u r n s , t Asset Log Returns: Time-Series
AAPLGE
BA 0.15 0.10 0.05 0.00 0.05 0.10Asset Log Returns, t F r e q u e n c y D e n s i t y Asset Log Returns: Distributions
AAPLGEBA
Figure 2.6: Single Assets Log ReturnsComparing the definitions of simple and log returns in (2.6) and (2.12), re-spectively, we obtain the relationship: ρ t = ln ( + r t ) (2.13)hence we can define all time-series and convenient portfolio representationsof log returns by substituting simple-returns in (2.13). For example, the portfolio log return is given by substitution of (2.13) into (2.9), such that: ρ t (cid:44) ln ( + w Tt r t ) ∈ R (2.14) The end goal is the construction of portfolios, linear combinations of individ-ual assets, whose properties (e.g., returns, risk) are optimal under providedconditions and constraints. As a consequence, a set of evaluation criteria andmetrics is necessary in order to evaluate the performance of the generatedportfolios. Due to the uncertainty of the future dynamics of the financialmarkets, we study the statistical properties of the assets returns, as well asother risk metrics, motivated by signal processing. 13. F inancial S ignal P rocessing Future prices and hence returns are inherently unknown and uncertain(Kennedy, 2016). To mathematically capture and manipulate this stochastic-ity, we treat future market dynamics (i.e., prices, cross-asset dependencies)as random variables and study their properties. Qualitative visual analy-sis of probability density functions is a labour-intensive process and thusimpractical, especially when high-dimensional distributions (i.e., 4D andhigher) are under consideration. On the other hand, quantitative measures,such as moments, provide a systematic way to analyze (joint) distributions(Meucci, 2009).
Mean, Median & Mode
Suppose that we need to summarize all the information regarding a randomvariable X in only one number, the one value that best represents the wholerange of possible outcomes. We are looking for a location parameter thatprovides a fair indication of where on the real axis the random variable X will end up taking its value.An immediate choice for the location parameter is the center of mass of thedistribution, i.e., the weighted average of each possible outcome, where theweight of each outcome is provided by its respective probability. This corre-sponds to computing the expected value or mean of the random variable: E [ X ] = µ X (cid:44) (cid:90) + ∞ − ∞ x f X ( x ) d x ∈ R (2.15)Note that the mean is also the first order statistical moment of the distribu-tion f X . When a finite number of observations T is available, and there is noclosed form expression for the probability density function f X , the samplemean or empirical mean is used as an unbiased estimate of the expectedvalue, according to: E [ X ] ≈ T T ∑ t = x t (2.16) Investor Advice 2.1 (Greedy Criterion)
For the same level of risk, choose theportfolio that maximizes the expected returns (Wilmott, 2007).
Figure 2.7 illustrates to cases where the
Greedy Criterion µ blue = < µ red = r t (%)0.00.10.20.30.4 F r e q u e n c y D e n s i t y Simple Returns:
Equal
Risk Level (1, 1)(4, 1) r t (%)0.00.10.20.30.4 F r e q u e n c y D e n s i t y Simple Returns:
Unequal
Risk Levels (1, 1)(4, 9)
Figure 2.7: Greedy criterion for equally risky assets (left) and unequallyrisky assets (right).other hand, when the assets have unequal risk levels (i.e., right sub-figure)the criterion does not apply and we cannot draw any conclusions withoutemploying other metrics as well.The definition of the mean value is extended to the multivariate case as thejuxtaposition of the mean value (2.15) of the marginal distribution of eachentry: E [ X ] = µ X (cid:44) E [ X ] E [ X ] ... E [ X M ] ∈ R M (2.17)A portfolio with vector w t and single asset mean simple returns µ r has ex-pected simple returns: µ r = w Tt µ r (2.18)An alternative choice for the location parameter is the median , which is thequantile relative to the specific cumulative probability p = [ X ] (cid:44) Q X (cid:18) (cid:19) ∈ R (2.19)The juxtaposition of the median, or any other quantile, of each entry of arandom variable does not satisfy the affine equivariance property (Meucci,2009) and therefore it does not define a suitable location parameter. Med [ a + B X ] (cid:54) = a + B Med [ X ] .
15. F inancial S ignal P rocessing A third parameter of location is the mode, which refers to the shape of theprobability density function f X . Indeed, the mode is defined as the pointthat corresponds to the highest peak of the density function:Mod [ X ] (cid:44) argmax x ∈ R f X ( x ) ∈ R (2.20)Intuitively, the mode is the most frequently occurring data point in the dis-tribution. It is trivially extended to multivariate distributions, namely as thehighest peak of the joint probability density function:Mod [ X ] (cid:44) argmax x ∈ R M f X ( x ) ∈ R M (2.21)Note that the relative position of the location parameters provide qualita-tive information about the symmetry, the tails and the concentration of thedistribution. Higher-order moments quantify these properties.Figure 2.8 illustrates the distribution of the prices and the correspondingsimple returns of the asset BA (Boeing Company), along with their locationparameters. In case of the simple returns, we highlight that the mean, themedian and the mode are very close to each other, reflecting the symme-try and the concentration of the distribution, properties that motivated theselection of returns over raw asset prices. Asset Prices, p t ($) F r e q u e n c y D e n s i t y BA Prices: First Order Moments pdfmeanmedianmode
15 10 5 0 5 10 15Asset Simple Returns, r t (%)0.00.10.2 F r e q u e n c y D e n s i t y BA Simple Returns: First Order Moments pdfmeanmedianmode
Figure 2.8: First order moments for BA (Boeing Company) prices (left) andsimple returns (right). Volatility & Covariance
The dilemma we faced in selecting between assets in Figure 2.7 motivatesthe introduction of a metric that quantifies risk level. On other words, weare looking for a dispersion parameter that yields an indication of the extent16.3. Evaluation Criteriato which the location parameter (i.e., mean, median) might be wrong inguessing the outcome of the random variable X .The variance is the benchmark dispersion parameter, measuring how far therandom variable X is spread out of its mean, given by:Var [ X ] = σ X (cid:44) E [( X − E [ X ]) ] ∈ R (2.22)The square root of the variance, σ X , namely the standard deviation or volatil-ity in finance, is a more physically interpretable parameter, since it has thesame units as the random variable under consideration (i.e., prices, simplereturns).Note that the variance is also the second order statistical central moment ofthe distribution f X . When a finite number of observations T is available, andthere is no closed form expression for the probability density function, f X ,the Bessel’s correction formula (Tsay, 2005) is used as an unbiased estimateof the variance, according to:Var [ X ] ≈ T − T ∑ t = ( x t − µ X ) (2.23) Investor Advice 2.2 (Risk-Aversion Criterion)
For the same expected returns,choose the portfolio that minimizes the volatility (Wilmott, 2007). r t (%)0.00.10.20.30.4 F r e q u e n c y D e n s i t y Simple Returns:
Equal
Returns Level (1, 1)(1, 9) r t (%)0.00.10.20.30.4 F r e q u e n c y D e n s i t y Simple Returns:
Unequal
Return Levels (1, 1)(4, 9)
Figure 2.9: Risk-aversion criterion for equal returns (left) and unequal re-turns (right).According to the
Risk-Aversion Criterion , in Figure 2.9, for the same returnslevel (i.e., left sub-figure) we choose the less risky asset, the blue, since thered is more spread out ( σ = < σ = covariance , which measures the joint variability of two variables, given by: 17. F inancial S ignal P rocessing Cov [ X ] = Σ X (cid:44) E [( X − E [ X ])( X − E [ X ]) T ] ∈ R M × M (2.24)or component-wise:Cov [ X m , X n ] = [ Cov [ X ]] mn = Σ mn (cid:44) E [( X m − E [ X m ])( X n − E [ X n ])] ∈ R (2.25)By direct comparison of (2.22) and (2.25), we note that:Var [ X m ] = Cov [ X m , X m ] = [ Cov [ X ]] mm = Σ mm (2.26)hence the m -th diagonal element of the covariance matrix Σ mm is the varianceof the m -th component of the multivariate random variable X , while thenon-diagonal terms Σ mn represent the joint variability of the m -th with the n -th component of X . Note that, by definition (2.24), the covariance is asymmetric and real matrix, thus it is semi-positive definite (Mandic, 2018b).Empirically, we estimate the covariance matrix entries using again the Bessel’scorrection formula (Tsay, 2005), in order to obtain an unbiased estimate:Cov [ X m , X n ] ≈ T − T ∑ t = ( x m , t − µ X m )( x n , t − µ X n ) (2.27)A portfolio with vector w t and covariance matrix of assets simple returns S has variance: σ r = w Tt Σ w t (2.28)The correlation coefficient is also frequently used to quantify the linear de-pendency between random variables. It takes values in the range [ −
1, 1 ] andhence it is a normalized way to compare dependencies, while covariancesare highly influenced by the scale of the random variables’ variance. Thecorrelation coefficient is given by:corr [ X m , X n ] = [ corr [ X ]] mn = ρ mn (cid:44) Cov [ X m , X n ] σ X m σ X n ∈ [ −
1, 1 ] ⊂ R (2.29)18.3. Evaluation Criteria AAPL GE BA MMM JPMAAPLGEBAMMMJPM 8.2 1.7 1.3 1 2.11.7 3.6 1.5 1.4 2.81.3 1.5 3.9 1.2 1.91 1.4 1.2 2.3 1.62.1 2.8 1.9 1.6 6.2
Simple Returns:
Covariance
Matrix
AAPL GE BA MMM JPMAAPLGEBAMMMJPM 1 0.32 0.23 0.24 0.30.32 1 0.4 0.5 0.590.23 0.4 1 0.41 0.390.24 0.5 0.41 1 0.440.3 0.59 0.39 0.44 1
Simple Returns:
Correlation
Matrix
Figure 2.10: Covariance and correlation matrices for assets simple returns.
Skewness
The standard measure of symmetry of a distribution is the skewness , whichis the third central moment normalized by the standard deviation, in such away to make it scale-independent:skew [ X ] (cid:44) E (cid:2) ( X − E [ X ]) (cid:3) σ X (2.30)In particular, a distribution whose probability density function is symmetricaround its expected value has null skewness. If the skewness is positive(negative), occurrences larger than the expected value are less (more) likelythan occurrences smaller than the expected value. Investor Advice 2.3 (Negatively Skewed Criterion)
Choose negatively skewedreturns, rather than positively skewed. (Wilmott, 2007).
Kurtosis
The fourth moment provides a measure of the relative weight of the tailswith respect to the central body of a distribution. The standard quantityto evaluate this balance is the kurtosis , defined as the normalized fourthcentral moment: kurt [ X ] (cid:44) E (cid:2) ( X − E [ X ]) (cid:3) σ X (2.31)The kurtosis gives an indication of how likely it is to observe a measure-ment far in the tails of the distribution: a large kurtosis implies that thedistribution displays ”fat tails”. 19. F inancial S ignal P rocessing Despite the insight into the statistical properties we gain by studying mo-ments of returns distribution, we can combine them in such ways to fullycapture the behaviour of our strategies and better assess them. Inspired bystandard metrics used in signal processes (e.g. signal-to-noise ratio) andsequential decision making we introduce the following performance evalua-tors: cumulative returns, sharpe ratio, drawdown and value at risk.
Cumulative Returns
In subsetion 2.2.2 we defined returns relative to the change in asset prices inone time period. Nonetheless, we usually get involved into a multi-periodinvestment, hence we are extending the definition of vanilla returns to the cumulative returns , which represent the change in asset prices over largertime horizons.Based on (2.5), the cumulative gross return R t → T between time indexes t and T is given by: R t → T (cid:44) p T p t = (cid:18) p T p T − (cid:19)(cid:18) p T − p T − (cid:19) · · · (cid:18) p t + p t (cid:19) (2.5) = R T R T − · · · R t + = T ∏ i = t + R i ∈ R (2.32)The cumulative gross return is usually also termed Profit & Loss (PnL),since it represents the wealth level of the investment. If R t → T > <
1) theinvestment was profitable (lossy).
Investor Advice 2.4 (Profitability Criterion)
Aim to maximize profitability ofinvestment.
Moreover, the cumulative simple return r t → T is given by: r t → T (cid:44) p T p t − (2.32) = (cid:20) T ∏ i = t + R i − (cid:21) (2.6) = (cid:20) T ∏ i = t + ( + r i ) − (cid:21) ∈ R (2.33)while the cumulative log return ρ t → T is: ρ t → T (cid:44) ln ( p T p t ) (2.32) = ln (cid:18) T ∏ i = t + R i (cid:19) = T ∑ i = t + ln ( R i ) (2.12) = T ∑ i = t + ρ i ∈ R (2.34)20.3. Evaluation CriteriaFigure 2.11 demonstrates the interpretation power of cumulative returnsover simple returns. Simple visual inspection of simple returns is inade-quate for comparing the performance of the different assets. On the otherhand, cumulative simple returns exhibit that BA ’s (Boeing Company) priceincreased by ≈ GE ’s (General Electric) price declines by ≈ S i m p l e R e t u r n s , r t Simple Returns
AAPLGEBA 2012 2013 2014 2015 2016 2017 2018Date024 C u m u l a t i v e S i m p l e R e t u r n s , r t T Cumulative Simple Returns
AAPLGEBA
Figure 2.11: Assets cumulative simple returns.
Sharpe Ratio
Remark 2.5
The criteria 2.1 and 2.2 can sufficiently distinguish and prioritizeinvestments which either have the same risk level or returns level, respectively.Nonetheless, they fail in all other cases, when risk or return levels are unequal.
The failure of greedy criterion and risk-aversion criterion is demonstratedin both examples in Figures 2.7 and 2.9, where it can be observed that themore risky asset, the red one, has a higher expected returns (i.e., the reddistribution is wider, hence has larger variance, but it is centered around alarger value, compared to the blue distribution). Consequently, none of thecriteria applies and the comparison is inconclusive.In order to address this issue and motivated by
Signal-to-Noise Ratio (SNR)(Zhang and Wang, 2017; Feng and Palomar, 2016), we define
Sharpe Ratio (SR) as the ratio of expected returns (i.e., signal power) to their standarddeviation (i.e., noise power ), adjusted by a scaling factor: SR T (cid:44) √ T E [ r T ] (cid:112) Var [ r T ] ∈ R (2.35) The variance of the noise is equal to the noise power. Standard deviation is used in thedefinition of SR to provide a unit-less metric.
21. F inancial S ignal P rocessing where T is the number of samples considered in the calculation of the em-pirical mean and standard deviation. Investor Advice 2.5 (Sharpe Ratio Criterion)
Aim to maximize Sharpe Ratioof investment.
Considering now the example in Figures 2.7, 2.9, we can quantitatively com-pare the two returns streams and select the one that maximizes the SharpeRatio: SR blue = √ T µ blue σ blue = √ T = √ T (2.36) SR red = √ T µ red σ red = √ T
43 (2.37) SR blue < SR red ⇒ choose red (2.38) Drawdown
The drawdown (DD) is a measure of the decline from a historical peak incumulative returns (Luenberger, 1997). A drawdown is usually quoted asthe percentage between the peak and the subsequent trough and is definedas: DD ( t ) = − max { (cid:2) max τ ∈ ( t ) r → τ (cid:3) − r → t } (2.39)The maximum drawdown (MDD) up to time t is the maximum of the draw-down over the history of the cumulative returns, such that: MDD ( t ) = − max x ∈ ( t ) { (cid:2) max τ ∈ ( T ) r → τ (cid:3) − r → T } (2.40)The drawdown and maximum drawdown plots are provided in Figure 2.12along with the cumulative returns of assets GE and BA . Interestingly, thedecline of GE ’s cumulative returns starting in early 2017 is perfectly reflectedby the (maximum) drawdown curve. Value at Risk
The value at risk (VaR) is another commonly used metric to assess the per-formance of a returns time-series (i.e., stream). Given daily simple returns22.4. Time-Series Analysis P e r c e n t a g e GE: Cumulative Simple Returns & Drawdown
Cumulative ReturnsDrawdownMax Drawdown P e r c e n t a g e BA: Cumulative Simple Returns & Drawdown
Cumulative ReturnsDrawdownMax Drawdown
Figure 2.12: (Maximum) drawdown and cumulativer returns for GE and BA . r t and cut-off c ∈ (
0, 1 ) , the value at risk is defined as the c quantile of theirdistribution, representing the worst 100 c % case scenario: VaR ( c ) (cid:44) Q r ( c ) ∈ R (2.41)Figure 2.13 depicts GE ’s value at risk at − c = R t F r e q u e n c y D e n s i t y VaR = -0.01890
GE: Value at Risk ( c = 0.05 ) > VaR< VaR 0.10 0.05 0.00 0.05 0.10Cumulative Gross Returns, R t F r e q u e n c y D e n s i t y VaR = -0.01998
BA: Value at Risk ( c = 0.05 ) > VaR< VaR Figure 2.13: Illustration of the 5% value at risk (VaR) of GE and BA stocks. Time-series analysis is of major importance in a vast range of research top-ics, and many engineering applications. This relates to analyzing time-seriesdata for estimating meaningful statistics and identifying patterns of sequen-tial data. Financial time-series analysis deals with the extraction of under-lying features to analyze and predict the temporal dynamics of financialassets (Navon and Keller, 2017). Due to the inherent uncertainty and non-analytic structure of financial markets (Tsay, 2005), the task is proven chal-lenging, where classical linear statistical methods such as the VAR model,and statistical machine learning models have been widely applied (Ahmed 23. F inancial S ignal P rocessing et al. , 2010). In order to efficiently capture the non-linear nature of the finan-cial time-series, advanced non-linear function approximators, such as RNNmodels (Mandic and Chambers, 2001) and Gaussian Processes (Roberts et al. ,2013) are also extensively used.In this section, we introduce the VAR and RNN models, which comprise thebasis for the model-based approach developed in Section 6.1. Autoregressive (AR) processes can model univariate time-series and specifythat future values of the series depend linearly on the past realizations ofthe series (Mandic, 2018a). In particular, a p -order autoregressive processAR( p ) satisfies: x t = a x t − + a x t − + · · · + a p x t − p + ε t = p ∑ i = a i x t − i + ε t = a T (cid:126) x t − p : t − + ε t ∈ R (2.42)where ε t is a stochastic term (an imperfectly predictable term), which isusually treated as white noise and a = [ a , a , · · · , a p ] T the p model parame-ters/coefficients.Extending the AR model for multivariate time-series, we obtain the vec-tor autoregressive (VAR) process, which enables us to capture the cross-dependencies between series. For the general case of a M -dimensional p -order vector autoregressive process VAR M ( p ), it follows that:24.4. Time-Series Analysis x t x t ...x M , t = c c ... c M + a ( ) a ( ) · · · a ( ) M a ( ) a ( ) · · · a ( ) M ... ... . . . ... a ( ) M ,1 a ( ) M ,2 · · · a ( ) M , M x t − x t − ...x M , t − + a ( ) a ( ) · · · a ( ) M a ( ) a ( ) · · · a ( ) M ... ... . . . ... a ( ) M ,1 a ( ) M ,2 · · · a ( ) M , M x t − x t − ...x M , t − + · · · + a ( p ) a ( p ) · · · a ( p ) M a ( p ) a ( p ) · · · a ( p ) M ... ... . . . ... a ( p ) M ,1 a ( p ) M ,2 · · · a ( p ) M , M x t − p x t − p ...x M , t − p + e t e t ...e M , t (2.43)or equivalently in compact a form: x t = c + A x t − + A x t − + · · · + A p x t − p + e t = c + p ∑ i = A i x t − i + e t ∈ R M (2.44)where c ∈ R M a vector of constants (intercepts), A i ∈ R M × M for i =
1, 2, . . . , p , the p parameter matrices and e t ∈ R M a stochastic term, noise.Hence, VAR processes can adequately capture the dynamics of linear sys-tems, under the assumption that they follow a Markov process of finiteorder, at most p (Murphy, 2012). In other words, the effectiveness of a p -th order VAR process relies on the assumption that the last p observationshave all the sufficient statistics and information to predict and describe thefuture realizations of the process. As a result, we enforce a memory mecha-nism, keeping the p last values, (cid:126) X t − p : t − , of the multivariate time-series andmaking predictions according to (2.44). Increasing the order of the model p , results in increased computational and memory complexity as well as atendency to overfit the noise of the observed data. A VAR M ( p ) process has: 25. F inancial S ignal P rocessing | P | VAR M ( p ) = M × M × p + M (2.45)parameters, hence they increase linearly with the model order. The system-atic selection of the model order p can be achieved by minimizing an infor-mation criterion, such as the Akaike Information Criterion (AIC) (Mandic,2018a), given by: p AIC = min p ∈ N (cid:20) ln ( MSE ) + pN (cid:21) (2.46)where MSE the mean squared error of the model and N the number ofsamples.After careful investigation of equation (2.44), we note that the target (cid:126) x t isgiven by an ensemble (i.e., linear combination) of p linear regressions, wherethe i -th regressor has (trainable) weights A i and features x t − i . This interpre-tation of a VAR model allows us to interpret its strengths and weaknesseson a common basis with the neural network architectures, covered in subse-quent parts. Moreover, this enables adaptive training (e.g., via Least-Mean-Square (Mandic, 2004) filter), which will prove useful in online learning,covered in Section 6.1.Figure 2.14 illustrates a fitted VAR (12) process, where the p = p AIC = | P | VAR ( ) = Recurrent neural networks (RNN) (Mandic and Chambers, 2001), are a fam-ily of neural networks with feedback loops which are very successful in pro-cessing sequential data. Most recurrent networks can also process sequencesof variable length (Goodfellow et al. , 2016).Consider the classical form of a dynamical system: s t = f ( s t − , x t ; θ ) (2.47)where s t and x t the system state and input signal at time step t , respectively,while f a function parametrized by θ that maps the previous state and the Let one of the regressors to has a bias vector that corresponds to c . S i m p l e R e t u r n s , r t Simple Returns:
AAPL originalVAR ( ) in-sampleout-of-sample 2005 2007 2009 2011 2013 2015 2017Date0.10.00.1 S i m p l e R e t u r n s , r t Simple Returns: BA originalVAR ( ) in-sampleout-of-sample2005 2007 2009 2011 2013 2015 2017 Date S i m p l e R e t u r n s , r t Simple Returns: GE originalVAR ( ) in-sampleout-of-sample 2005 2007 2009 2011 2013 2015 2017 Date S i m p l e R e t u r n s , r t Simple Returns:
XOM originalVAR ( ) in-sampleout-of-sample Figure 2.14: Vector autoregressive (VAR) time-series predictive model forassets simple returns. One step prediction is performed, where the realizedobservations are used as they come.input signal to the new state. Unfolding the recursive definition in (2.47) fora finite value of t : s t = f ( s t − , x t ; θ ) s t = f ( f ( s t − , x t − ; θ ) , x t ; θ ) s t = f ( f ( f ( · · · ( f ( · · · ) , x t − ; θ ) , x t ; θ ))) (2.48)In general, f can be a highly non-linear function. Interestingly, a compositefunction of nested applications of f is responsible for generating the nextstate.Many recurrent neural networks use equation (2.49) or a similar equation todefine the values of their hidden units. To indicate that the state is the hid-den units of the network, we now rewrite equation (2.47) using the variable h to represent the state: h t = f ( h t − , x t ; θ ) (2.49)Then the hidden state h t can be used to obtain the output signal y t (i.e.,observation) at time index t , assuming a non-linear relationship, describedby function g that is parametrized by ϕ : 27. F inancial S ignal P rocessing ˆ y t = g ( h t ; ϕ ) (2.50)The computational graph corresponding to (2.49) and (2.50) is provided inFigure 2.15. It can be shown that recurrent neural networks are universalfunction approximators (Cybenko, 1989), which means that if there is a rela-tionship between past states and current input with next states, RNNs havethe capacity to model it.Another important aspect of RNNs is parameter sharing . Note in (2.48) that θ are the only parameters, shared between time steps. Consequently, the num-ber of parameters of the model decreases significantly, enabling faster train-ing and limiting model overfitting (“Supervised sequence labelling with re-current neural networks. 2012”), compared to feedforward neural networks(i.e., multi-layer-perceptrons), which do not allow loops or any recursive con-nection. Feedforward networks can be also used with sequential data whenmemory is brute-forced , leading to very large and deep architectures, andrequiring a lot more time to train and effort to avoid overfitting, in order toachieve similar results with smaller RNNs (Mandic and Chambers, 2001). xh f unfold x t-1 h t-1 h t-2 f x t h t f x t+1 h t+1 f h t+2 ŷ g ŷ t-1 g ŷ t g ŷ t+1 gm m m mm Figure 2.15: A generic recurrent network computational graph. This recur-rent network processes information from the input x by incorporating itinto the state h that is passed forward through time, which in turn is usedto predict the target variable ˆ y . ( Left ) Circuit diagram. The black squareindicates a delay of a single time step. (
Right ) The same network seen asan unfolded computational graph, where each node is now associated withone particular time instance (Goodfellow et al. , 2016).Simple RNNs, such that the one implementing equation (2.48), are not usedbecause of the vanishing gradient problem (Hochreiter, 1998; Pascanu etal. , 2012), but instead variants, such as the
Gated Rectified Unit s (GRU), And any neural network with certain non-linear activation functions, in general. Similar to VAR process memory mechanics. and simplearchitecture. The most variants introduce some filter/forget mechanism, re-sponsible for selectively filtering out (or ”forgetting”) past states, shapingthe hidden state h in a highly non-linear fashion, but without accumulat-ing the effects from all the history of observations, alleviating the vanishinggradients problem. Consulting the schematic in Figure 2.15, this filteringoperation is captured by m (in red), which stands for selective ”memory”.Given a loss function L and historic data D , then the process of traininginvolves minimization of L conditioned on D , where the parameters θ and ϕ are the decision variables of the optimization problem:minimize θ , ϕ L ( θ , ϕ ; D ) (2.51)which is usually addressed by adaptive variants of Stochastic Gradient De-scent (SGD), such as Adam (Kingma and Ba, 2014), to ensure faster conver-gence and avoid saddle points. All these optimization algorithms rely on(estimates of) descent directions, obtained by the gradients of the loss func-tion L with respect to the network parameters ( θ , ϕ ), namely ∇ θ ( L ) and ∇ ϕ ( L ) . Due to the parameter sharing mechanics of RNNs, obtaining thegradients is non-trivial and thus Backpropagation Through Time (BPTT)(Werbos, 1990) algorithm is used, which efficiently calculates the contribu-tion of each parameter to the loss function across all time steps.The selection of the hyperparameters, such as the size of the hidden state h and the activation functions, can be performed using cross-validation orother empirical methods. Nonetheless, we choose to allow excess degreesof freedom to our model but regularize it using weight decay in the form of L and L norms, as well as dropout, according to the Gal and Ghahramani(2016) guidelines.As any neural network layer, recurrent layers can be stack together or con-nected with other layers (i.e., affine or convolutional layers) forming deeparchitectures, capable of dealing with complex datasets.For comparison with the VAR (12) process in Section 2.4.1, we train an RNN,comprised of two layers, one GRU layer followed by an affine layer, wherethe size of the hidden state is 3 or h ∈ R . The number of model parametersis | P | GRU-RNN ( → → ) =
88, but it significantly outperforms the VAR model,as suggested by Figure 2.16 and summary in Table 2.2. Compared to LSTM (Ortiz-Fuentes and Forcada, 1997).
29. F inancial S ignal P rocessing S i m p l e R e t u r n s , r t Simple Returns:
AAPL originalGRU-RNN in-sampleout-of-sample 2005 2007 2009 2011 2013 2015 2017Date0.10.00.1 S i m p l e R e t u r n s , r t Simple Returns: BA originalGRU-RNN in-sampleout-of-sample S i m p l e R e t u r n s , r t Simple Returns: GE originalGRU-RNN in-sampleout-of-sample S i m p l e R e t u r n s , r t Simple Returns:
XOM originalGRU-RNN in-sampleout-of-sample
Figure 2.16: Gated recurrent unit recurrent neural network (GRU-RNN)time-series predictive model for assets simple returns. One step predictionis performed, where the realized observations are used as they come (i.e.,set observation y t − equal to x t , rather than predicted value ˆ y t − ). Training Error Testing ErrorVAR GRU-RNN VAR GRU-RNN
AAPL BA GE XOM hapter 3
Portfolio Optimization
The notion of a portfolio has already been introduced in subsection 2.1.2, asa master asset, highlighting its representation advantage over single assets.Nonetheless, portfolios allow also investors to combine properties of indi-vidual assets in order to ”amplify” the positive aspects of the market, while”attenuating” its negative impacts on the investment.Figure 3.1 illustrates three randomly generated portfolios (with fixed portfo-lio weights over time given in Table 3.2), while Table 3.1 summarizes theirperformance. Importantly, we note the significant differences between therandom portfolios, highlighting the importance that portfolio constructionand asset allocation plays in the success of an investment. As Table 3.2 im-plies, short-selling is allowed in the generation of the random portfolios ow-ing to the negative portfolio weights, such as p ( ) AAPL and p ( ) MMM . Regardless,portfolio vector definition (2.1) is satisfied in all cases, since the portfoliovectors’ elements sum to one (column-wise addition). P o r t f o li o S i m p l e R e t u r n s , r t Portfolio Simple Returns: Time-Series p (0) p (1) p (2) r t F r e q u e n c y D e n s i t y Portfolio Simple Returns: Distributions p (0) p (1) p (2) Figure 3.1: Simple returns of randomly allocated portfolios.
Portfolio Optimization aims to address the allocation problem in a system-atic way, where an objective function reflecting the investor’s preferences is 31. P ortfolio O ptimization Randomly Allocated Portfolios Performance Summary
Performance Metrics p ( ) p ( ) p ( ) Mean Returns (%)
Cumulative Returns (%) -83.8688 297.869 167.605
Volatility (%)
Sharpe Ratio
Max Drawdown (%)
Average Drawdown Time (days)
70 7 24
Skewness
Kurtosis
Value at Risk, c = (%) -14.0138 -1.52798 -2.88773 Conditional Value at Risk (%) -20.371 -2.18944 -4.44684
Hit Ratio (%)
Average Win to Average Loss p ( ) ) is out-performed by portfolio 1 ( p ( ) ) in all metrics, motivating the introduction ofportfolio optimization. p ( ) p ( ) p ( ) AAPL -2.833049 0.172436 0.329105 GE -2.604941 -0.061177 0.467233 BA JPM
MMM -3.949186 0.341040 -1.373778Table 3.2: Random Portfolio Vectors of Figure 3.1.constructed and optimized with respect to the portfolio vector.In this chapter, we introduce the Markowitz Model (section 3.1), the firstattempt to mathematically formalize and suggest an optimization methodto address portfolio management. Moreover, we extend this framework togeneric utility and objective functions (section 3.2), by taking transactioncosts into account (section 3.3). The shortcomings of the methods discussedhere encourage the development of context-agnostic agents, which is thefocus of this thesis (section 6.2). However, the simplicity and robustnessof the traditional portfolio optimization methods have motivated the super-vised pre-training (Chapter 7) of the agents with Markowitz-like models asground truths.32.1. Markowitz Model
The
Markowitz model (Markowitz, 1952; Kroll et al. , 1984) mathematicallyformulates the portfolio allocation problem, namely finding a portfolio vec-tor w in a universe of M assets, according to the investment Greedy (Invest-ment Advice 2.1) and Risk-Aversion (Investment Advice 2.2) criteria, con-strained on the portfolio vector definition (2.1). Hence, we the Markowitzmodel gives the optimal portfolio vector w ∗ which minimizes volatility fora given returns level, such that: M ∑ i = w ∗ , i = w ∗ ∈ R M (3.1) For a trading universe of M assets, provided historical data, we obtain em-pirical estimates of the expected returns µ = (cid:104) µ , µ , . . . µ M (cid:105) T ∈ R M ,where µ i the sample mean (2.16) of the i -th asset and the covariance Σ ∈ R M × M , such that Σ ij the empirical covariance (2.27) of the i -th and the j -thassets.For given target expected return ¯ µ target , determine the portfolio vector w ∈ R M such that: minimize w w T Σ w (3.2)subject to w T µ = ¯ µ target (3.3)and TM w = σ = w T Σ w is the portfolio variance, µ = w T µ the portfolio expectedreturn and the M -dimensional column vector of ones is denoted by M .For Lagrangian multipliers λ , κ ∈ R , we form the Lagrangian function L (Papadimitriou and Steiglitz, 1998) such that: L ( w , λ , κ ) = w T Σ w − λ ( w T µ − ¯ µ target ) − κ ( TM w − ) (3.5)We differentiating the Lagrangian function L and apply the first ordernecessary condition of optimality: The covariance matrix Σ is by definition (2.24) symmetric, so ∂ ( w T Σ w ) ∂ w = Σ w .
33. P ortfolio O ptimization ∂ L ∂ w = Σ w − λ µ − κ M = Σ µ M µ T TM w − λ − κ = µ target M (3.7)Under the assumption that Σ is full rank and µ is not a multiple of M ,then equation (3.7) is solvable by matrix inversion (Boyd and Vandenberghe,2004). The resulting portfolio vector w MVP defines the mean-variance opti-mal portfolio . w MVP − λ − κ = Σ µ µ T T − µ target (3.8)Note that mean-variance optimization is also used in signal processing andwireless communications in order to determine the optimal beamformer(Almeida et al. , 2015), using, for example, Minimum Variance DistortionResponse filters (Xia and Mandic, 2013). Notice that the constraint (3.4) suggests that short-selling is allowed. Thissimplifies the formulation of the problem by relaxing conditions, enablinga closed form solution (3.8) as a set of linear equations. If short sales areprohibited, then an additional constraint should be added, and in this case,the optimization problem becomes:minimize w w T Σ w subject to w T µ = ¯ µ target and TM w = w (cid:23) (cid:23) designates an element-wise inequality operator. This problem can-not be reduced to the solution of a set of linear equations. It is termed a34.1. Markowitz Model quadratic program and it is solved numerically using gradient-based algo-rithms (Gill et al. , 1981). Optimization problems with quadratic objectivefunctions and linear constraints fall into this framework. Remark 3.1
Figure 3.2 illustrates the volatility to expected returns dependency ofan example trading universe (i.e., yellow scatter points), along with all optimal port-folio solutions with and without short selling, left and right subfigures, respectively.The blue part of the solid curve is termed
Efficient Frontier (Luenberger, 1997),and the corresponding portfolios are called efficient. These portfolios are obtained bythe solution of (3.2).
Interestingly, despite the inferior performance of the individual assets’ per-formance, appropriate linear combinations of them results in less volatileand more profitable master assets, demonstrating once again the power ofportfolio optimization. Moreover, we note that the red part of the solid curveis inefficient, since for the same risk level (i.e., standard deviation) there areportfolios with higher expected returns, which aligns with the Greedy Cri-terion 2.1. Finally, we highlight that in case of short-selling (left subfigure)there are feasible portfolios, which have higher expected returns than any as-set from the universe. This is possible since the low-performing assets can beshorted and the inventory from them can be invested in high-performanceassets, amplifying their high returns. On the other hand, when short-sellingis not allowed, the expected returns of any portfolio is restricted in the inter-val defined by the lowest and the highest empirical returns of the assets inthe universe. r E p e c t e d M e a n R e t u r n s , r KO JPM AAPLMMM
Markowitz Model: with short sales efficient frontierassets r E p e c t e d M e a n R e t u r n s , r KO JPM AAPLMMM
Markowitz Model: without short sales efficient frontierassets
Figure 3.2: Efficient frontier for Markowitz model the with (
Left ) and with-out (
Right ) short-selling. The yellow points are the projections of single assetshistoric performance on the σ − µ (i.e., volatility-returns) plane. The solidlines are the portfolios, obtained by solving the Markowitz model optimiza-tion problem (3.2) for different values of ¯ µ target . Note that the red points arerejected and only the blue loci is efficient. 35. P ortfolio O ptimization In Section 2.3 we introduced various evaluation metrics that reflect our in-vestment preferences. Extending the vanilla Markowitz model which mini-mizes risk (i.e., variance), constraint on a predetermined profitability level(i.e., expected returns), we can select any metric as the optimization objec-tive function, constrained on the budget (3.4) and any other criteria we favor,as long as the fit in the quadratic programming framework. More complexobjectives may be solvable with special non-linear optimizers, but there isno general principle. In this Section we exhibit how to translate evaluationmetrics to objective functions, suitable for quadratic programming. Any ofthe functions presented can be used with and without short-selling, so in or-der to address the more difficult of the two cases, we will consider that longpositions are only allowed. The other case is trivially obtained by ignoringthe relevant constraint in the sign of weights.
Motivated by the Lagrangian formulation of the Markowitz model (3.5, wedefine the
Risk Aversion portfolio, named after the risk aversion coefficient α ∈ R + , given by the solution of the program:maximize w w T µ − α w T Σ w (3.10)subject to TM w = w (cid:23) α is model hyperparameter, which reflect thetrade-off between portfolio expected returns ( w T µ ) and risk level ( w T Σ w )(Wilmott, 2007). For α → α → ∞ , the investor is infinitely risk-averse and selects the least risky port-folio, regardless its returns performance. Any positive value for α results ina portfolio which balances the two objectives weighted by the risk aversioncoefficient. Figure 3.3 illustrates the volatility-returns plane for the sametrading universe as in Figure 3.2, but the curve is obtained by the solution of(3.10). As expected, high values of α result in less volatile portfolios, whilelow risk aversion coefficient leads to higher returns.36.3. Transaction Costs Standard Deviation, r E p e c t e d M e a n R e t u r n s , r KO AAPL
Risk-Aversion: with short sales efficient frontierassets 1234 Standard Deviation, r E p e c t e d M e a n R e t u r n s , r KO AAPL
Risk-Aversion: without short sales efficient frontierassets 1234
Figure 3.3: Risk-Aversion optimization efficient frontier with (
Right ) andwithout ( left ) short-selling. The yellow points are the projections of singleassets historic performance on the σ − µ (i.e., volatility-returns) plane. The blue points are the efficient portfolios, or equivalently the solutions of therisk-aversion optimization problem (3.10) for different values of α , desig-nated by the opacity of the blue color (see colorbar). Both objective functions so far require hyperparameter tuning ( ¯ µ target or α ),hence either cross-validation or hand-picked selection is required (Kennedy,2016). On the other hand, in Section 2.3 we motivated the use of SharpeRatio (2.35) as a Signal-to-Noise Ratio equivalent for finance, which is notparametric. Considering the Sharpe Ratio as the objective function of theprogram: maximize w w T µ √ w T Σ w (3.11)subject to TM w = w (cid:23) In real stock exchanges, such as NYSE (Wikipedia, 2018c), NASDAQ (Wikipedia,2018b) and LSE (Wikipedia, 2018a), trading activities (buying or selling) areaccompanied with expenses, including brokers’ commissions and spreads(Investopedia, 2018g; Quantopian, 2017), they are usually referred to as transaction costs . Therefore every time a new portfolio vector is determined 37. P ortfolio O ptimization (portfolio re-balancing), the corresponding transaction costs should be sub-tracted from the budget.In order to simplify the analysis around transaction costs we will use the ruleof thumb (Quantopian, 2017), charging 0.2% for every activity. For example,if three stocks A are bought with price 100$ each, then the transaction costswill be 0.6$. Let the price of the stock A raising at price 107$, when wedecide to sell all three stocks, then the transaction costs will be 0.642$. Given any objective function, J , the transaction costs are subtracted fromthe returns term in order to adjust the profit & loss, accounting for theexpenses of trading activities. Therefore, we solve the optimization program:maximize w J − TM β (cid:107) w − w (cid:107) (3.12)subject to TM w = w (cid:23) β ∈ R the transactions cost (i.e., 0.002 for standard 0.2% commis-sions), and w ∈ R M the initial portfolio, since the last re-balancing. All theparameters of the model (3.12) are given, since β is market-specific, and w the current position. Additionally, the transactions cost term can be seen asa regularization term which penalizes excessive trading and restricts largetrades (i.e., large (cid:107) w − w (cid:107) ).The objective function J can be any function that can be optimized accord-ing to the framework developed in Section 3.2. For example the risk-aversionwith transaction costs optimization program is given by:maximize w w T µ − α w T Σ w − TM β (cid:107) w − w (cid:107) (3.13)while the Sharpe Ratio optimization with transaction costs is:maximize w w T µ − TM β (cid:107) w − w (cid:107) √ w T Σ w (3.14)We highlight that in (3.13) is subtracted from J directly since all terms havethe same units , while in (3.14) the transaction cost term is subtracted di- Not considering the objective function’s J parameters. The parameter α is not dimensionless. The involvement of transaction costs make Portfolio Management a
Multi-Stage Decision Problem (Neuneier, 1996), which in simple terms means thattwo sequences of states with the same start and end state will have differentvalue. For instance, imagine two investors, both of which have an initial bud-get of 100$. On Day 1, the first investor uses all of his budget to construct aportfolio according to his preferred objective function, paying 0.2$ for trans-action costs, according to Quantopian (2017). By Day 3, the market priceshave changed but the portfolio of the first investor has not changed in valueand decided to liquidate all of his investment, paying another 0.2$ for sell-ing his portfolio. On Day 5 both of the investors (re-)enter the market makeidentical investments and hence pay the same commission fees. Obviously,the two investors have the same start and end states but their intermediatetrajectories lead to different reward streams.From (3.12) it is obvious that w affects the optimal allocation w . In a se-quential portfolio optimization setting, the past decisions (i.e., w ) will havea direct impact on the optimality of the future decisions (i.e., w ), thereforeapart from the maximization of immediate rewards, we should also focus oneliminating the negative effects on the future decisions. As a consequence,sequential asset allocation is a multi-stage decision problem where myopicoptimal actions can lead to sub-optimal cumulative rewards. This settingencourages the use of Reinforcement Learning agents which aim to max-imize long-term rewards, even if that means acting sub-optimally in thenear-future from the traditional portfolio optimization point of view. 39 hapter 4 Reinforcement Learning
Reinforcement learning (RL) refers to both a learning problem and a sub-field of machine learning (Goodfellow et al. , 2016). As a learning problem(Szepesv´ari, 2010), it refers to learning to control a system ( environment ) soas to maximize some numerical value, which represents a long-term objec-tive ( discounted cumulative reward signal ). Recalling the analysis in Section3.3.2, sequential portfolio managementIn this chapter we introduce the necessary tools to analyze stochastic dy-namical systems (Section 4.1). Moreover, we review the major componentsof a reinforcement learning algorithm (Section 4.2, as well as extensions ofthe formalization of dynamical systems (Section 4.4), enabling us to reusesome of those tools to more general and intractable otherwise problems.
Reinforcement learning is suitable in optimally controlling dynamical sys-tems, such as the general one illustrated in Figure 4.1: A controller ( agent )receives the controlled state of the system and a reward associated with thelast state transition . It then calculates a control signal ( action ) which is sentback to the system. In response, the system makes a transition to a new state and the cycle is repeated. The goal is to learn a way of controlling the sys-tem ( policy ) so as to maximize the total reward. The focus of this report ison discrete-time dynamical systems, thought most of the notions developedextend to continuous-time systems.40.1. Dynamical Systems
AgentEnvironment reward r t
The term agent is used to refer to the controller, while environment is usedinterchangeably with the term system. The goal of a reinforcement learningalgorithm is the development (training) of an agent capable of successfullyinteracting with the environment, such that it maximizes some scalar objec-tive over time.
Action a t ∈ A is the control signal that the agent sends back to the system attime index t . It is the only way that the agent can influence the environmentstate and as a result, lead to different reward signal sequences. The actionspace A refers to the set of actions that the agent is allowed to take and itcan be: • Discrete: A = { a , a , . . . , a M } ; • Continuous: A ⊆ [ c , d ] M . A reward r t ∈ B ⊆ R is a scalar feedback signal, which indicates howwell the agent is doing at discrete time step t . The agent aims to maximizecumulative reward, over a sequence of steps.Reinforcement learning addresses sequential decision making tasks (Silver,2015b), by training agents that optimize delayed rewards and can evaluatethe long-term consequences of their actions, being able to sacrifice immedi-ate reward to gain more long-term reward. This special property of rein- 41. R einforcement L earning forcement learning agents is very attractive for financial applications, whereinvestment horizons range from few days and weeks to years or decades.In the latter cases, myopic agents can perform very poorly since evaluationof long-term rewards is essential in order to succeed (Mnih et al. , 2016).However, the applicability of reinforcement learning depends vitally on thehypothesis: Hypothesis 4.1 (Reward Hypothesis)
All goals can be described by the maxi-mization of expected cumulative reward.
Consequently, the selection of the appropriate reward signal for each ap-plication is very crucial. It influences the agent learned strategies since itreflects its goals. In Section 5.4, a justification for the selected reward sig-nal is provided, along with an empirical comparison between other metricsmentioned in Section 2.3.
The state , s t ∈ S , is also a fundamental element of reinforcement learning,but it is usually used to refer to both the environment state and the agentstate.The agent does not always have direct access to the state, but at every timestep, t , it receives an observation, o t ∈ O . Environment State
The environment state s et is the internal representation of the system, usedin order to determine the next observation o t + and reward r t + . The envi-ronment state is usually invisible to the agent and even if it visible, it maycontain irrelevant information (Sutton and Barto, 1998). Agent State
The history (cid:126) h t at time t is the sequence of observations, actions and rewardsup to time step t , such that: (cid:126) h t = ( o , a , r , o , a , r , . . . , o t , a t , r t ) (4.1)The agent state (a.k.a state ) s at is the internal representation of the agentabout the environment, used in order to select the next action a t + and itcan be any function of the history: s at = f ( (cid:126) h t ) (4.2)42.2. Major Components of Reinforcement LearningThe term state space S is used to refer to the set of possible states the agentscan observe or construct. Similar to the action space, it can be: • Discrete: S = { s , s , . . . , s n } ; • Continuous: S ⊆ R N . Observability
Fully observable environments allow the agent to directly observe the envi-ronment state, hence: o t = s et = s at (4.3) Partially observable environments offer indirect access to the environmentstate, therefore the agent has to construct its own state representation s at (“Supervised sequence labelling with recurrent neural networks. 2012”), us-ing: • Complete history: s at ≡ (cid:126) h t ; • Recurrent neural network: s at ≡ f ( s at − , o t ; θ ) .Upon modifying the basic dynamical system in Figure 4.1 in order to takepartial observability into account, we obtain the schematic in Figure 4.2.Note that f is function unknown to the agent, which has access to the obser-vation o t but not to the environment state s t . Moreover, R as and P ass (cid:48) are thereward generating function and the transition probability matrix (function)of the MDP, respectively. Treating the system as a probabilistic graphicalmodel, the state s t is a latent variable that either deterministically or stochas-tically (depending on the nature of f ) determines the observation o t . In apartially observable environment, the agent needs to reconstruct the environ-ment state, either by using the complete history h t or a stateful sequentialmodel (i.e., recurrent neural network, see (2.49)). Reinforcement Learning agents may include one or more of the followingcomponents (Silver, 2015b): • Policy : agent’s behavior function; • Value function : how good is each state, or state-action pair; • Model : agent’s representation of the environment. 43. R einforcement L earning Figure 4.2: High-level stochastic partially observable dynamical systemschematic.In this Section, we discuss these components and highlight their importanceand impact on algorithm design.
Let γ be the discount factor of future rewards, then the return G t (alsoknown as future discounted reward ) at time index t is given by: G t = r t + + γ r t + + γ r t + + . . . = ∞ ∑ k = γ k r t + k + , γ ∈ [
0, 1 ] (4.4) Policy , π , refers to the behavior of an agent. It is a mapping function from”state to action” (Witten, 1977), such that: π : S → A (4.5)where S and A are respectively the state space and the action space. Apolicy function can be: • Deterministic: A t + = π ( s t ) ; • Stochastic: π ( a | s ) = P [ a t = a | s t = s ] . State-value function , v π , is the expected return, G t , starting from state s ,which then follows a policy π (Szepesv´ari, 2010), that is: v π : S → B , v π ( s ) = E π [ G t | s t = s ] (4.6)44.3. Markov Decision Processwhere S and B are respectively the state space and the rewards set ( B ⊆ R ). Action-value function , q π , is the expected return, G t , starting from state s ,upon taking action a , which then follows a policy π (Silver, 2015b): q π : S × A → B , q π ( s , a ) = E π [ G t | s t = s , a t = a ] (4.7)where S , A , B the state space, the action space and and the reward set, re-spectively. A model predicts the next state of the environment, s t + , and the correspond-ing reward signal, r t + , given the current state, s t , and the action taken, a t ,at time step, t . It can be represented by a state transition probability matrix P given by: P ass (cid:48) : S × A → S , P ass (cid:48) = P [ s t + = s (cid:48) | s t = s , a t = a ] (4.8)and a reward generating function R : R : S × A → B , R as = E [ r t + | s t = s , a t = a ] (4.9)where S , A , B the state space, the action space and and the reward set, re-spectively. A special type of discrete-time stochastic dynamical systems are Markov De-cision Processes (MDP). They posses strong properties that guarantee con-verge to the global optimum policy (i.e., strategy), while by relaxing some ofthe assumption, they can describe any dynamical system, providing a power-ful representation framework and a common way of controlling dynamicalsystems.
A state S t (Silver, 2015c) satisfies the Markov property if and only if (iff): P [ s t + | s t , s t − , . . . , s ] = P [ s t + | s t ] (4.10)This implies that the previous state, s t , is a sufficient statistic for predictingthe future, therefore the longer-term history, (cid:126) h t , can be discarded. 45. R einforcement L earning Any fully observable environment, which satisfies equation (4.3), can bemodeled as a
Markov Decision Process (MDP). A Markov Decision Process(Poole and Mackworth, 2010) is an object (i.e., 5-tuple) (cid:104) S , A , P , R , γ (cid:105) where: • S is a finite set of states (state space), such that they satisfy the Markovproperty, as in definition (4.10) • A is a finite set of actions (action space); • P is a state transition probability matrix;; • R is a reward generating function; • γ is a discount factor. Apart from the expressiveness of MDPs, they can be optimally solved, mak-ing them very attractive.
Value Function
The optimal state-value function , v ∗ , is the maximum state-value functionover all policies: v ∗ ( s ) = max π v π ( s ) , ∀ s ∈ S (4.11)The optimal action-value function , q ∗ , is the maximum action-value func-tion over all policies: q ∗ ( s , a ) = max π q π ( s , a ) , ∀ s ∈ S , a ∈ A (4.12) Policy
Define a partial ordering over policies (Silver, 2015c) π ≤ π (cid:48) ⇐ v π ( s ) ≤ v (cid:48) π ( s ) , ∀ s ∈ S (4.13)For an MDP the following theorems are true: Theorem 4.2 (Policy Optimality)
There exists an optimal policy, π ∗ , that is bet-ter than or equal to all other policies, such that π ∗ ≥ π , ∀ π . The proofs are based on the contraction property of Bellman operator (Poole and Mack-worth, 2010).
Theorem 4.3 (State-Value Function Optimality)
All optimal policies achieve theoptimal state-value function, such that v π ∗ ( s ) = v ∗ ( s ) , ∀ s ∈ S . Theorem 4.4 (Action-Value Function Optimality)
All optimal policies achievethe optimal action-value function, such that q π ∗ ( s , a ) = q ∗ ( s , a ) , ∀ s ∈ S , a ∈ A . Given a Markov Decision Process (cid:104) S , A , P , R , γ (cid:105) , because of the Markovproperty (4.10) that states in S satisfy: • The policy π is a distribution over actions given states π ( s | a ) = P [ a t = a | s t = s ] (4.14)Without loss of generality we assume that the policy π is stochasticbecause of the state transition probability matrix P (Sutton and Barto,1998). Owing to the Markov property, MDP policies depend only onthe current state and are time-independent, stationary (Silver, 2015c),such that a t ∼ π ( ·| s t ) , ∀ t > • The state-value function v π can be decomposed into two parts: theimmediate reward and the discounted reward of successor state γ r t + : v π ( s ) = E π [ G t | s t = s ] (4.4) = E π [ r t + + γ r t + + γ r t + + . . . | s t = s ]= E π [ r t + + γ ( r t + + γ r t + + . . . ) | s t = s ] (4.4) = E π [ r t + + γ G t + | s t = s ] (4.10) = E π [ r t + + γ v π ( s t + ) | s t = s ] (4.16) • The action-value function q π can be similarly decomposed to q π ( s ) = E π [ r t + + γ q π ( s t + , s t + ) | s t = s , a t = a ] (4.17)Equations (4.16) and (4.17) are the Bellman Expectation Equations for MarkovDecision Processes formulated by Bellman (1957). 47. R einforcement L earning Search, or seeking a goal under uncertainty, is a ubiquitous requirement oflife (Hills et al. , 2015). Not only machines but also humans and animalsusually face the trade-off between exploiting known opportunities and explor-ing for better opportunities elsewhere. This is a fundamental dilemma inreinforcement learning, where the agent may need to act ”sub-optimally” inorder to explore new possibilities, which may lead it to better strategies. Ev-ery reinforcement learning algorithm takes into account this trade-off, tryingto balance search for new opportunities (exploration) with secure actions(exploitation). From an optimization point of view, if an algorithm is greedyand only exploits, it may converges fast, but it runs the risk of sticking toa local minimum. Exploration, may at first slow down convergence, but itcan lead to previously unexplored regions of the search space, resulting inan improved solution. Most algorithms perform exploration either by artifi-cially adding noise to the actions, which is attenuated while the agent gainsexperience, or modelling the uncertainty of each action (Gal, 2016) in theBayesian optimization framework. Markov Decisions Processes can be exploited by reinforcement learningagents, who can optimally solve them (Sutton and Barto, 1998; Szepesv´ari,2010). Nonetheless, most real-life applications are not satisfying one or moreof the conditions stated in Section 4.3.2. As a consequence, modifications ofthem lead to other types of processes, such as Infinite MDP and PartiallyObservable MDP, which in turn can realistically fit a lot of application do-mains.
In the case of either the state space S , or the action space A , or both be-ing infinite then the environment can be modelled as an Infinite MarkovDecision Process (IMDP). Therefore, in order to implement the policy π or/and the action-value function q π in a computer, a differentiable functionapproximation method must be used (Sutton et al. , 2000a), such as a least Metaphorically speaking, an agent ”lives” in the environment. This does not reflect any risk-sensitive metric or strategy, ”secure” is used here to de-scribe actions that have been tried in the past and their outcomes are predictable to someextend. Countably infinite (discrete) or continuous. et al. , 2013).An IMDP action-state dynamics are described by a transition probabilityfunction P ass (cid:48) and not a matrix, since the state or/and the action spaces arecontinuous. If s et (cid:54) = s at then the environment is partially observable and it can be modeledas a Partially Observable Markov Decision Process ( POMDP ). POMDP isa tuple (cid:104) S , A , O , P , R , Z , γ (cid:105) where: • O is a finite set of observations (observation space) • Z is an observation function, Z as (cid:48) o = P [ O t + | s t + = s (cid:48) , a t = a ] It is important to notice that, any dynamical system can be viewed as aPOMDP and all the algorithms used for MDPs are applicable, without con-vergence guarantees though. 49 art II
Innovation hapter 5 Financial Market as Discrete-Time
Stochastic Dynamical System
In Chapter 3, the task of static asset allocation as well as traditional methodsof its assessment were introduced. Our interest in dynamically (i.e., sequen-tially) constructing portfolios led to studying Reinforcement Learning basiccomponents and concepts in Chapter 4, which suggest a framework to dealwith sequential decision making tasks. However, in order to leverage the re-inforcement learning tools, it is necessary to translate the problem (i.e., assetallocation) into a discrete-time stochastic dynamical system and, in particu-lar, into a Markov Decision Process (MDP). Note that not all of the strongassumptions of an MDP (Section 4.3.2) can be satisfied, hence we resort tothe relaxation of some of the assumptions and consideration of the MDP ex-tensions, discussed in Section 4.4. However, the convergence and optimalityguarantees are obviously not applicable under this formalization.In this chapter, we mathematically formalize financial markets as discrete-time stochastic dynamical systems. Firstly, we consider the necessary as-sumptions for this formalization (Section 5.1), followed by the framework(Sections 5.2, 5.3, 5.4) which enables reinforcement learning agents to inter-act with the financial market in order to optimality address portfolio man-agement.
Back-test tradings are only considered, where the trading agent pretends tobe back in time at a point in the market history, not knowing any ”future”market information, and does paper trading from then onward (Jiang et
51. F inancial M arket as D iscrete -T ime S tochastic D ynamical S ystem al. , 2017). As a requirement for the back-test experiments, the followingthree assumptions must apply: sufficient liquidity , zero slippage and zero marketimpact , all of which are realistic if the traded assets’ volume in a market ishigh enough (Wilmott, 2007). An asset is termed liquid if it can be converted into cash quickly, with littleor no loss in value (Investopedia, 2018c).
Assumption 5.1 (Sufficient Liquidity)
All market assets are liquid and everytransaction can be executed under the same conditions.
Slippage refers to the difference between the expected price of a trade andthe price at which the trade is actually executed (Investopedia, 2018e).
Assumption 5.2 (Zero Slippage)
The liquidity of all market assets is high enoughthat, each trade can be carried out immediately at the last price when an order isplaced.
Asset prices are determined by the
Law of Supply and Demand (Investo-pedia, 2018b), therefore any trade impacts the balance between them, henceaffects the price of the asset being traded.
Assumption 5.3 (Zero Market Impact)
The capital invested by the trading agentis so insignificant that is has no influence on the market.
In order to solve the asset allocation task, the trading agent should be ableto determine the portfolio vector w t at every time step t , therefore the action a t at time t is the portfolio vector w t + at time t + a t ≡ w t + (cid:44) (cid:20) w t + , w t + , . . . , w M , t + (cid:21) (5.1)hence the action space A is a subset of the continuous M -dimensional realspace R M :52.3. State & Observation Space a t ∈ A ⊆ R M , ∀ t ≥ M ∑ i = a i , t = a t ∈ A ⊆ [
0, 1 ] M , ∀ t ≥ M ∑ i = a i , t = At any time step t , we can only observe asset prices, thus the price vector p t (2.3) is the observation o t , or equivalently: o t ≡ p t (2.3) (cid:44) (cid:20) p t p t · · · p M , t (cid:21) (5.4)hence the observation space O is a subset of the continuous M -dimensionalpositive real space R M + , since prices are non-negative real values: o t ∈ O ⊆ R M + , ∀ t ≥ , financial mar-kets are partially observable (Silver, 2015b). As a consequence, equation (4.3) isnot satisfied and we should construct the agent’s state s at by processing theobservations o t ∈ O . In Section 4.1.4, two alternatives to deal with partialobservability were suggested, considering:1. Complete history: s at ≡ (cid:126) h t (4.1) (cid:44) ( o , a , r , . . . , o t , a t , r t ) ;2. Recurrent neural network: s at ≡ f ( s at − , o t ; θ ) ; If prices were a VAR(1) process (Mandic, 2018a), then financial markets are pure MDPs.
53. F inancial M arket as D iscrete -T ime S tochastic D ynamical S ystem where in both cases we assume that the agent state approximates the en-vironment state s at = ˆ s et ≈ s et . While the first option may contain all theenvironment information by time t , it does not scale well, since the memoryand computational load grow linearly with time t . A GRU-RNN (see Section2.4.2), on the other hand, can store and process efficiently the historic obser-vation in an adaptive manner as they arrive, filtering out any uninformativeobservations out. We will be referring to this recurrent layer as the statemanager , since it is responsible for constructing (i.e., managing) the agentstate. This layer can be part of any neural network architecture, enablingend-to-end differentiability and training.Figure 5.1 illustrates examples of a financial market observations o t and thecorresponding actions a t of a random agent. A ss e t P r i c e s , o t ( $ ) Financial Market: Observations
AAPLGEBAXOM P o r t f o li o V e c t o r , a t Financial Market: Actions
AAPLGEBAXOM
Figure 5.1: Example universe of assets as dynamical system, including
AAPL (Apple), GE (General Electric), BA (Boeing Company) and XOM (Exxon MobilCorporation). (
Left ) Financial market asset prices, observations o t . ( Right )Portfolio manager, agent, portfolio vectors, actions a t ; the portfolio coeffi-cients are illustrated in a stacked bar chat, where at each time step, theysum to unity according to equation (2.1). In order to assist and speed-up the training of the state manager, we processthe raw observations o t , obtaining ˆ s t . In particular, thanks to the represen-tation and statistical superiority of log returns over asset prices and simplereturns (see 2.2.2), we use log returns matrix (cid:126) ρ t − T : t (2.10), of fixed windowsize T . We also demonstrate another important property of log returns,which suits the nature of operations performed by neural networks, the func-tion approximators used for building agents. Neural network layers applynon-linearities to weighted sums of input features, hence the features do not Expanding windows are more appropriate in case the RNN state manager is replacedby complete the history (cid:126) h t . , but only with the layer weights(i.e., parameters). Nonetheless, by summing the log returns, we equivalentlymultiply the gross returns, hence the networks are learning non-linear func-tions of the products of returns (i.e., asset-wise and cross-asset) which arethe building blocks of the covariances between assets. Therefore, by sim-ply using the log returns we enable cross-asset dependencies to be easilycaptures.Moreover, transaction costs are taken into account, and since (3.12) suggeststhat the previous time step portfolio vector w t − affects transactions costs,we also append the w t , or equivalently a t − by (5.1), to the agent state,obtaining the 2-tuple:ˆ s t (cid:44) (cid:104) w t , (cid:126) ρ t − T : t (cid:105) = (cid:28) w t w t ... w M , t , ρ t − T ρ t − T + · · · ρ t ρ t − T ρ t − T + · · · ρ t ... ... . . . ... ρ M , t − T ρ M , t − T + · · · ρ M , t (cid:29) (5.8)where ρ i , ( t − τ ) → t the log cumulative returns of asset i between the time inter-val [ t − τ , t ] . Overall, the agent state is given by: s at ≡ f ( s at − , ˆ s t ; θ ) (5.9)where f the state manager non-linear mapping function. When an observa-tion arrives, we calculate ˆ s t and feed it to the state manager (GRU-RNN).Therefore the state space S is a subset of the continuous K -dimensional realspace R K , where K the size of the hidden state in the GRU-RNN state man-ager: This is the issue that multiplicative neural networks (Salinas and Abbott, 1996) try toaddress. Consider two scalar feature variables x and x and the target scalar variable y suchthat: f ( x , x ) (cid:44) y = x ∗ x , x , x ∈ R (5.6)It is very hard for a neural network to learn this function, but a logarithmic transformation ofthe features transforms the problem to a very simple sum of logarithms using the property: log ( x ) + log ( x ) = log ( x ∗ x ) = log ( y ) (5.7) Most terms can be stored or pre-calculated.
55. F inancial M arket as D iscrete -T ime S tochastic D ynamical S ystem s at ∈ S ⊆ R K , ∀ t ≥ s t . Theagent uses this as input to determine its internal state, which in term drivesits policy. They look meaningless and impossible to generalize from, withthe naked eye, nonetheless, in Chapter 6, we demonstrated the effectivenessof this, representation, especially thanks to the combination of the convolu-tional and recurrent layers combination.Overall, the financial market should be modelled as an Infinite PartiallyObservable Markov Decision Process (IPOMDP), since: • The action space is continuous (infinite), A ⊆ R M ; • The observations o t are not sufficient statistics (partially observable) ofthe environment state; • The state space is continuous (infinite), S ⊆ R K . w t t T t (%) w t t T t (%) Figure 5.2: Examples of processed observation 2-tuples for two randomlyselected time steps.
The determination of the reward signal is usually the most challenging stepin the design of a reinforcement learning problem. According to the RewardHypothesis 4.1, the reward is a scalar value, which fully specifies the goals ofthe agent, and the maximization of the expected cumulative reward leads tothe optimal solution of the task. Specifying the optimal reward generatingfunction is the field of study of Inverse Reinforcement Learning (Ng andRussell, 2000) and Inverse Optimal Control (Moylan and Anderson, 1973).56.4. Reward SignalIn our case, we develop a generic, modular framework, which enables com-parison of various reward generating functions , including log returns, (neg-ative) volatility and Sharpe Ratio. Section 2.3 motivates a few reward func-tion candidates, most of which are implemented and tested in Chapter 9.It is worth highlighting that the reinforcement learning methods, by default,aim to maximize the expected cumulative reward signal , hence the optimizationproblem that the agent (parametrized by θ ) solves is given by:maximize θ T ∑ t = E [ γ t r t ] (5.11)For instance, when we refer to log returns (with transaction costs) as thereward generating function, the agent solves the optimization problem:maximize θ T ∑ t = E [ γ t ln ( + w Tt r t − β (cid:107) w t − − w t (cid:107) )] (5.12)where the argument of the logarithm is the adjusted by the transaction costsgross returns at time index t (see (2.12) and (3.3)). Transaction costs are included in all case. hapter 6 Trading Agents
Current state-of-the-art algorithmic portfolio management methods: • Address the decision making task of asset allocation by solving a pre-diction problem, heavily relying on the accuracy of predictive modelsfor financial time-series (Aldridge, 2013; Heaton et al. , 2017), which areusually unsuccessful, due to the stochasticity of the financial markets; • Make unrealistic assumptions about the second and higher-order statis-tical moments of the financial signals (Necchi, 2016; Jiang et al. , 2017); • Deal with binary trading signals (i.e., BUY, SELL, HOLD) (Neuneier,1996; Deng et al. , 2017), instead of assigning portfolio weights to eachasset, and hence limiting the scope of their applications.On the other hand, the representation of the financial market as a discrete-time stochastic dynamical system, as derived in Chapter 5 enables the devel-opment of a unified framework for training reinforcement learning tradingagents. In this chapter, this framework is exploited by: • Model-based Reinforcement Learning agents, as in Section 6.1, wherevector autoregressive processes (VAR) and recurrent neural networks(RNN) are fitted to environment dynamics, while the derived agentsperform planning and control (Silver, 2015a). Similar to (Aldridge,2013; Heaton et al. , 2017), these agents are based on a predictive modelof the environment, which is in turn used for decision making. Theirperformance is similar to known algorithms and thus they are used asbaseline models for comparison; • Model-free Reinforcement Learning agents, as in Section 6.2, whichdirectly address the decision making task of sequential and multi-step58.1. Model-Based Reinforcement Learningoptimization. Modifications to the state-of-the-art reinforcement learn-ing algorithms, such as Deep Q-Network (DQN) (Mnih et al. , 2015)and Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al. , 2015),enable their incorporation to the trading agents training framework.Algorithm 1 provides the general setup for reinforcement learning algo-rithms discussed in this chapter, based on which, experiments on a smalluniverse of real market data are carried out, for testing their efficacy andillustration purposes. In Part III, all the different agents are compared ona larger universe of assets with different reward functions, a more realisticand practical setting.
Algorithm 1:
General setup for trading agents. inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o objective function J output : optimal agent parameters θ ∗ , ϕ ∗ repeat for t =
1, 2, . . . T do observe 2-tuple (cid:104) o t , r t (cid:105) calculate gradients ∇ θ J ( r t ) and ∇ ϕ J ( r t ) // BPTT update agent parameters θ , ϕ using adaptive gradient optimizers // ADAM get estimate of agent state: s t ≈ f ( · , o t ) // (5.9) sample and take action: a t ∼ π ( ·| s t ; ϕ ) // portfoliorebalance end until convergence set θ ∗ , ϕ ∗ ← θ , ϕ Upon a revision of the schematic of a generic partially observable environ-ment (i.e., dynamical system) as in Figure 4.2, it is noted that given the transi-tion probability function P ass (cid:48) of the system, the reinforcement learning taskreduces to planning (Atkeson and Santamaria, 1997); simulate future statesby recursively calling P ass (cid:48) L times and choose the roll-outs (i.e., trajectories)which maximize cumulative reward, via dynamic programming (Bertsekas et al. , 1995): 59. T rading A gents s t P ass (cid:48) → s t + P ass (cid:48) → · · · P ass (cid:48) → s t + L (6.1) a t + ≡ max a ∈ A J ( a | a t , s t , . . . , s t + L ) (6.2)Note that due to the assumptions made in Section 5.1, and especially theZero Market Impact assumption 5.3, the agent actions do not affect the envi-ronment state transitions, or equivalently the financial market is an open loopsystem (Feng and Palomar, 2016), where the agent actions do not modify thesystem state, but only the received reward: p ( s t + | s , a ) = p ( s t + | s ) ⇒ P ass (cid:48) = P ss (cid:48) (6.3)Moreover, the reward generating function is known, as explained in section5.4, hence a model of the environment is obtained by learning only the transitionprobability function P ss (cid:48) . In the area of Signal Processing and Control Theory, the task under consid-eration is usually termed as
System Identification (SI), where an approxi-mation of the environment, the ”model”, is fitted such that it captures theenvironment dynamics: ˆ P ss (cid:48) (cid:124)(cid:123)(cid:122)(cid:125) model ≈ P ss (cid:48) (cid:124)(cid:123)(cid:122)(cid:125) environment (6.4)Figure 6.1 illustrates schematically the system identification wiring of the cir-cuit, where the model, represented by ˆ P ss (cid:48) , is compared against the true tran-sition probability function P ss (cid:48) and the loss function L (i.e., mean squarederror) is minimized by optimizing with respect to the model parameters θ .It is worth highlighting that the transition probability function is by defini-tion stochastic (4.8) hence the candidate fitted models should ideally be ableto capture and incorporate this uncertainty. As a result, model-based rein-forcement learning usually (Deisenroth and Rasmussen, 2011; Levine et al. ,2016; Gal et al. , 2016) relies on probabilistic graphical models, such as Gaus-sian Processes (Rasmussen, 2004) or Bayesian Networks (Ghahramani, 2001),which are non-parametric models that do not output point estimates, butlearn the generating process of the data p data , and hence enable sampling60.1. Model-Based Reinforcement LearningFigure 6.1: General setup for System Identification (SI) (i.e., model-basedreinforcement learning) for solving a discrete-time stochastic partially ob-servable dynamical system.from the posterior distribution. Sampling from the posterior distributionallows us to have stochastic predictions that respect model dynamics.In this section we shall will focus, nonetheless, only on vector autoregres-sive processes (VAR) and recurrent neural networks (RNN) for modelling P ss (cid:48) , trained on an adaptive fashion (Mandic and Chambers, 2001), given byAlgorithm 2. An extension of vanilla RNNs to bayesian RNNs could be alsotried using the MC-dropout trick from (Gal and Ghahramani, 2016). Following on the introduction of the vector autoregressive processes (VAR)in Section 2.4.1, and using the fact that the transition probability model ˆ P ss (cid:48) is a one-step time-series predictive model, we investigate the effectivenessof VAR processes as time-series predictors. Agent Model
The vector autoregressive processes (VAR) regress past values of multivari-ate time-series with the future values (see equation (2.44)). In order to sat-isfy the covariance stationarity assumption (Mandic, 2018a), we fit a VARprocess on the log-returns ρ t , and not the raw observations o t (i.e., pricevectors), since the latter is known to be highly non-stationary . The modelis pre-trained on historic data (i.e., batch supervised learning training (Mur-phy, 2012)) and it is updated online, following the gradient ∇ θ L ( ˆ ρ tt , ρ t ) , asdescribed in Algorithm 2. The model takes the form: In the wide-sense (Bollerslev, 1986).
61. T rading A gents Algorithm 2:
General setup for adaptive model-based trading agents. inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o loss function L historic dataset D output : optimal model parameters θ ∗ batch training on D θ ← argmax θ p ( θ |D ) // MLE repeat for t =
1, 2, . . . T do predict next state ˆ s t // via ˆ P ss (cid:48) observe tuple (cid:104) o t , r t (cid:105) get estimate of agent state: s t ≈ f ( · , o t ) // (5.9) calculate gradients: ∇ θ L ( ˆ s t , s t ) // backprop update model parameters θ using adaptive gradient optimizers // ADAM plan and take action a t // portfolio rebalance end until convergence set θ ∗ ← θ P ss (cid:48) : ρ t (2.44) ≈ c + p ∑ i = A i ρ t − i (6.5)planning : (cid:18) s t P ass (cid:48) → · · · P ass (cid:48) → s t + L (cid:19) (6.2) ⇒ a t + ≡ max a ∈ A J ( a | a t , s t , . . . , s t + L ) (6.6) Related Work
Vector autoregressive processes have been widely used for modelling finan-cial time-series, especially returns, due to their pseudo-stationary nature(Tsay, 2005). In the context of model-based reinforcement learning thereis no results in the open literature on using VAR processes in this con-text, nonetheless, control engineering applications (Akaike, 1998) have ex-tensively used autoregressive models to deal with dynamical systems.62.1. Model-Based Reinforcement Learning
Evaluation
Figure 6.2 illustrates the cumulative rewards and the prediction error powerare illustrated Observe that the agent is highly correlated with the market(i.e., S&P 500) and overall collects lower cumulative returns. Moreover, notethat the market crash in 2009 (Farmer, 2012), affects the agent significantly,leading to a decline by 179.6% (drawdown), taking as many as , from 2008-09-12 to 2015-10-22. P e r c e n t a g e VAR (8) : Overall Performance Profit & LossDrawdownS&P500 E rr o r P o w e r , | e t | VAR (8) : Error Power Figure 6.2: Order-eight vector autoregressive model-based reinforcementlearning agent on a 12-asset universe (i.e., VAR (8)), pre-trained on his-toric data between 2000-2005 and trained online onward. ( Left ) Cumulativerewards and (maximum) drawdown of the learned strategy, against the S&P500 index (traded as
SPY ). (
Right ) Mean squared prediction error for single-step predictions.
Weaknesses
The order- p VAR model, VAR( p ), assumes that the underlying generatingprocess:1. Is covariance stationary;2. Satisfies the order- p Markov property;3. Is linear, conditioned on past samples.Unsurprisingly, most of these assumptions are not realistic for real marketdata, which reflects the poor performance of the method illustrated in Figure6.2.
Expected Properties 6.1 (Non-Stationary Dynamics)
The agent should be ableto capture non-stationary dynamics.
Expected Properties 6.2 (Long-Term Memory)
The agent should be able to se-lectively remember past events without brute force memory mechanisms (e.g., usinglagged values as features).
63. T rading A gents Expected Properties 6.3 (Non-Linear Model)
The agent should be able to learnnon-linear dependencies between features.
The limitations of the vector autoregressive processes regarding stationarity,linearity and finite memory assumptions are overcome by the recurrent neu-ral network (RNN) environment model. Inspired by the effectiveness of re-current neural networks in time-series prediction (Gers et al. , 1999; Mandicand Chambers, 2001; H´enaff et al. , 2011) and the encouraging results ob-tained from the initial one-step predictive GRU-RNN model in Figure 2.16,we investigate the suitability of RNNs in the context of model-based rein-forcement learning, used as environment predictors.
Agent Model
Revisiting Algorithm 2 along with the formulation of RNNs in Section 2.4.2,we highlight the steps:state manager : s t (4.2) = f ( s t − , ρ t ) (6.7)prediction : ˆ ρ t + ≈ V σ ( s t ) + b (6.8)planning : (cid:18) s t P ass (cid:48) → · · · P ass (cid:48) → s t + L (cid:19) (6.2) ⇒ a t + ≡ max a ∈ A J ( a | a t , s t , . . . , s t + L ) (6.9)where V and b the weights matrix and bias vector of the output affine layer of the network and σ a non-linearity (i.e., rectified linear unit (Nair andHinton, 2010), hyperbolic tangent, sigmoid function.). Schematically thenetwork is depicted in Figure 6.3. Related Work
Despite the fact that RNNs were first used decades ago (Hopfield, 1982;Hochreiter and Schmidhuber, 1997; Mandic and Chambers, 2001), recent ad- In Deep Learning literature (Goodfellow et al. , 2016), the term affine refers to a neuralnetwork layer with parameters W and b that performs a mapping from an input matrix X to an output vector y according to f affine ( X ; W , b ) = ˆ y (cid:44) XW + b (6.10)The terms affine , fully-connected (FC) and dense refer to the same layer configuration. G R U G R U A FF I N E ρ t ˆ ρ t +1 s t Agent StateObservation ObservationEstimate
Figure 6.3: Two layer gated recurrent unit recurrent neural network (GRU-RNN) model-based reinforcement learning agent; receives log returns ρ t asinput, builds internal state s t and estimates future log returns ˆ ρ t + . Regu-larized mean squared error is used as the loss function, optimized with theADAM (Kingma and Ba, 2014) adaptive optimizer.vances in adaptive optimizers (e.g., RMSProp (Tieleman and Hinton, 2012),ADAM (Kingma and Ba, 2014)) and deep learning have enabled the develop-ment of deep recurrent neural networks for sequential data modelling (e.g.,time-series, text). Since financial markets are dominated by dynamic struc-tures and time-series, RNNs have been extensively used for modelling dy-namic financial systems (Tino et al. , 2001; Chen et al. , 2015; Heaton et al. ,2016; Bao et al. , 2017). In most cases, feature engineering prior to trainingis performed so that meaningful financial signals are combined, instead ofraw series. Our approach was rather context-free, performing pure technicalanalysis of the series, without involving manual extraction and validation ofhigh-order features. Evaluation
Figure 6.4 illustrates the performance of a two-layer gated recurrent unitrecurrent neural network, which is not outperforming the vector autoregres-sive predictor as much as it was expected. Again, we note the strong correla-tion with the market (i.e., S&P 500). The 2008 market crash affects the RNNagent as well, which manages to recover faster than the VAR agent, lead-ing to an overall 221.1% cumulative return, slightly higher than the marketindex.
Having developed both vector autoregressive and recurrent neural networkmodel-based reinforcement learning agents, we conclude that despite the ar- 65. T rading A gents P e r c e n t a g e GRU RNN : Overall Performance
Profit & LossDrawdownS&P500 E rr o r P o w e r , | e t | GRU RNN : Error Power
Figure 6.4: Two-layer gated recurrent unit recurrent neural network (GRU-RNN) model-based reinforcement learning agent on a 12-asset universe, pre-trained on historic data between 2000-2005 and trained online onward (
Left )Cumulative rewards and (maximum) drawdown of the learned strategy,against the S&P 500 index (traded as
SPY ). (
Right ) Mean square predictionerror for single-step predictions.chitectural simplicity of system identification, who are under-performing.The inherent randomness (i.e., due to uncertainty) of the financial time-series (e.g., prices, returns) affects the model training and degrades pre-dictability.In spite of the promising results in Figures 2.14, 2.16, where one-step pre-dictions are considered, control and planning (6.2) are only effective whenaccurate multi-step predictions are available. Therefore, the process of firstfitting a model (e.g., VAR, RNN or Gaussian Process) and then use an exter-nal optimization step, results in two sources of approximation error, wherethe error propagates over time and reduces performance.A potential improvement of these methods would be manually feature en-gineering the state space , such as extracting meaningful econometric signals(Greene, 2003) (e.g., volatility regime shifts, earning or dividends announce-ments, fundamentals) which in turn are used for predicting the returns. Thisis, in a nutshell, the traditional approach that quantitative analysts (LeBaron,2001) have been using for the past decades. The computational power hasbeen radically improved over the years, which permits larger (i.e., deeperand wider) models to be fitted, while elaborate algorithms, such as varia-tional inference (Archer et al. , 2015), have made previously intractable taskspossible. Nonetheless, this approach involves a lot of tweaks and humanintervention, which are the main aspects we aim to attenuate.66.2. Model-Free Reinforcement Learning
In the final Section, we assumes that solving a system identification problem(i.e., explicitly inferring environment dynamics) is easier than addressingdirectly the initial objective; the maximization of a cumulative reward signal(e.g., log returns, negative volatility, sharpe ratio). Nonetheless, predictingaccurately the evolution of the market was proven challenging, resulting inill-performing agents.In this section, we adapt an orthogonal approach, where we do not rely onan explicit model of the environment, but we parametrize the agent valuefunction or/and policy directly. At first glance, it may seem counter-intuitivehow skipping the modelling of the environment can lead to a meaningfulagent at all, but consider the following example from daily life. Humansare able to easily handle objects or move them around. Unarguably, this is aconsequence of experience that we have gained over time, however, if we areasked to explain the environment model that justifies our actions, it is muchmore challenging, especially for an one year old kid, who can successfullyplay with toys but fails to explain this task using Newtonian physics.Another motivating example from the portfolio management and tradingfield: pairs trading is a simple trading strategy (Gatev et al. , 2006), whichrelies on the assumption that historically correlated assets will preserve thisrelationship over time . Hence when the two assets start deviating from oneanother, this is considered an arbitrage opportunity (Wilmott, 2007), sincethey are expected to return to a correlated state. This opportunity is ex-ploited by taking a long position for the rising stock and a short position forthe falling. If we would like to train a model-based reinforcement learningagent to perform pairs trading, it would be almost impossible or too unsta-ble, regardless the algorithm simplicity. On the other hand, a value-based orpolicy gradient agent could perform this task with minimal effort, replicat-ing the strategy, because the pairs trading strategy does not rely on futurevalue prediction, but much simpler statistical analysis (i.e., cross-correlation),which, in case of model-based approaches, should be translated into an op-timization problem of an unrelated objective - the prediction error. Overall,using model-free reinforcement learning improves efficiency (i.e., only oneepisode fitting) and also allows finer control over the policy, but it also limitsthe policy to only be as good as the learned model. More importantly, forthe task under consideration (i.e., asset allocation) it is shown to be easier to Pairs trading is selected for educational purposes only, we are not claiming that it isoptimal in any sense.
67. T rading A gents represent a good policy than to learn an accurate model.The model-free reinforcement learning agents are summarized in Algorithm1, where different objective functions and agent parametrizations lead todifferent approaches and hence strategies. We classify these algorithms as: • Value-based : learn a state value function v (4.16), or an action-valuefunction q (4.17) and use it with an implicit policy (e.g., ε -greedy (Sut-ton and Barto, 1998)); • Policy-based : learn a policy directly by using the reward signal toguide adaptation.In this chapter, we will, first, focus on value-based algorithms, which ex-ploit the state (action) value function, as defined in equations (4.6), (4.7) asan estimate for expected cumulative rewards. Then policy gradient meth-ods will be covered, which parametrize directly the policy of the agent, andperform gradient ascent to optimize performance. Finally, a universal agentwill be introduced, which reduces complexity (i.e., computational and mem-ory) and generalizes strategies across assets, regardless the trained universe,based on parameter sharing (Bengio et al. , 2003) and transfer learning (Panand Yang, 2010a) principles.
A wide range of value-based reinforcement learning algorithms (Sutton andBarto, 1998; Silver, 2015d; Szepesv´ari, 2010) have been suggested and usedover time. The Q-Learning is one of the simplest and best performing ones(Liang et al. , 2016), which motivates us to extend it to continuous actionspaces to fit our system formulation.
Q-Learning
Q-Learning is a simple but very effective value-based model-free reinforce-ment learning algorithm (Watkins, 1989). It works by successively improv-ing its evaluations of the action-value function q , and hence the name. Letˆ q , be the estimate of the true action-value function, then ˆ q is updated online(every time step) according to:ˆ q ( s t , a t ) ← ˆ q ( s t , a t ) + α (cid:20) r t + γ max a (cid:48) ∈ A ˆ q ( s t + , a (cid:48) ) − ˆ q ( s t , a t ) (cid:124) (cid:123)(cid:122) (cid:125) TD error, δ t + (cid:21) (6.11)68.2. Model-Free Reinforcement Learningwhere α ≥ γ ∈ [
0, 1 ] the discount factor. In the litera-ture, the term in the square brackets is usually called Temporal DifferenceError (TD error), or δ t + (Sutton and Barto, 1998). Theorem 6.1
For a Markov Decision Process, Q-learning converges to the opti-mum action-values with probability 1, as long as all actions are repeatedly sampledin all states and the action-values are represented discretely.
The proof of Theorem 6.1 is provided by Watkins (1989) and relies on the con-traction property of the Bellman Operator (Sutton and Barto, 1998), show-ing that: ˆ q ( s , a ) → q ∗ ( s , a ) (6.13)Note that equation (6.11) is practical only in the cases that:1. The state space is discrete, and hence the action-value function can bestored in a digital computer, as a grid of scalars;2. The action space is discrete, and hence at each iteration, the maximiza-tion over actions a ∈ A is tractable.The Q-Learning steps are summarized in Algorithm 3. Related Work
Due to the success of the Q-Learning algorithm, early attempts to modify itto fit the asset allocation task were made by Neuneier (1996), who attemptedto used a differentiable function approximator (i.e., neural network) to rep-resent the action-value function, and hence enabled the use of Q-Learningin continuous state spaces . Nonetheless, he was restricted to discrete actionspaces and thus was acting on buy and sell signals only. In the same year,Moody et al. (1998) used a similar approach but introduced new rewardsignals, namely general utility functions and the differential Sharpe Ratio,which are considered in Chapter 9.Recent advances in deep learning (Goodfellow et al. , 2016) and stochastic op-timization methods (Boyd and Vandenberghe, 2004) led to the first practical The Bellman Operator, B is a-contraction with respect to some norm (cid:107) · (cid:107) since it canbe shown that (Rust, 1997): (cid:107) B s − B ¯ s (cid:107) ≤ a (cid:107) s − ¯ s (cid:107) (6.12)Therefore it follows that:1. The sequence s , B s , B s , . . . converges for every s ;2. B has a unique fixed point s ∗ , which satisfies B s ∗ = s ∗ and all sequencues s , B s , B s , . . . converge to this unique fixed point s ∗ .
69. T rading A gents Algorithm 3:
Q-Learning with greedy policy. inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o initial output : optimal action-value function q ∗ initialize q-table: ˆ q ( s , a ) ← ∀ s ∈ S , a ∈ A while convergence do for t =
0, 1, . . . T do select greedy action: a t = max a (cid:48) ∈ A ˆ q ( s t , a (cid:48) ) observe tuple (cid:104) s t + , r t (cid:105) update q-table:ˆ q ( s t , a t ) ← ˆ q ( s t , a t ) + α (cid:2) r t + γ max a (cid:48) ∈ A ˆ q ( s t + , a (cid:48) ) − ˆ q ( s t , a t ) (cid:3) end end use case of Q-Learning in high-dimensional state spaces (Mnih et al. , 2015).Earlier work was limited to low-dimensional applications, where shallowneural network architectures were effective. Mnih et al. (2015) used a fewalgorithmic tricks and heuristics, and managed to stabilize the training pro-cess of the Deep Q-Network (DQN). The first demonstration was performedon Atari video games, where trained agents outperformed human playersin most games.Later, Hausknecht and Stone (2015) published a modified version of theDQN for partially observable environments, using a recurrent layer to con-struct the agent state, giving rise to the
Deep Recurrent Q-Network (DRQN).Nonetheless, similar to the vanilla Q-Learning and enhanced DQN algo-rithms, the action space was always discrete.
Agent Model
Inspired by the breakthroughs in DQN and DRQN, we suggest a modifica-tion to the last layers to handle pseudo-continuous action spaces, as requiredfor the portfolio management task. The current implementation, termed the
Deep Soft Recurrent Q-Network (DSRQN) relies on a fixed, implicit pol-icy (i.e., exponential normalization or softmax (McCullagh, 1984)) while theaction-value function q is adaptively fitted.A neural network architecture as in Figure 6.5 is used to estimate the action-value function. The two 2D-convolution (2D-CONV) layers are followed by70.2. Model-Free Reinforcement Learninga max-pooling (MAX) layer, which aim to extract non-linear features fromthe raw historic log returns (LeCun and Bengio, 1995). The feature map isthen fed to the gated recurrent unit (GRU), which is the state manager (seeSection 5.3.2), responsible for reconstructing a meaningful agent state froma partially observable environment. The generated agent state, s t , is thenregressed along with the past action, a t (i.e., current portfolio vector), inorder to produce the action-value function estimates. Those estimates areused with the realized reward r t + to calculate the TD error δ t + (6.11) andtrain the DSRQN as in Algorithm 4. The action-values estimates ˆ q t + arepassed to a softmax layer, which produces the agent action a t + . We selectthe softmax function because it provides the favourable property of forcingall the components (i.e., portfolio weights) to sum to unity (see Section 2.1).Analytically the actions are given by: ∀ i ∈ {
1, 2, . . . , M } : a i = e a i ∑ Mj = e a j = ⇒ M ∑ i = a i = et al. , 2016),which means that it can be replaced by any function (i.e., deterministic orstochastic), even by a quadratic programming step, since differentiabilityis not required. For this experiment we did not consider more advancedpolicies, but anything is accepting as long as constraint (2.1) is satisfied.Moreover, comparing our implementation with the original DQN (Mnih etal. , 2015), no experience replay is performed, in order to avoid reseting theGRU hidden state for each batch, which will lead to an unused latent state,and hence poor state manager. Evaluation
Figure 6.7 illustrates the results obtained on a small scale experiment with 12assets from S&P 500 market using the DSRQN. The agent is trained on his-toric data between 2000-2005 for 5000 episodes, and tested on 2005-2018. Theperformance evaluation of the agent for different episodes e is highlighted.For e = e =
10, the agent acted completely randomly, leading to poorperformance. For e = e = rading A gents Algorithm 4:
Deep Soft Recurrent Q-Learning. inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o objective function J initial agent weights θ output : optimal agent parameters θ ∗ repeat for t =
1, 2, . . . T do observe tuple (cid:104) o t , r t (cid:105) calculate TD error δ t + // (6.11) calculate gradients ∇ θ i L ( θ i ) = δ t + ∇ θ i q ( s , a ; θ ) // BPTT update agent parameters θ using adaptive gradient optimizers // ADAM get estimate of value function q t ≈ NN ( (cid:126) ρ t − T → t ) // (6.11) take action a t softmax ( q t ) // portfolio rebalance end until convergence set θ ∗ ← θ which were highly impacted by 2008 market crash, while DSRQN was al-most unaffected (e.g., 85 % drowdawn).In Figure 6.6, we illustrate the out-of-sample cumulative return of the DSRQNagent, which flattens for e (cid:39) Weaknesses
Despite the improved performance of DSRQN compared to the model-basedagents, its architecture has severe weaknesses.Firstly, the selection of the policy (e.g., softmax layer) is a manual processthat can be only verifies via empirical means, for example, cross-validation.This complicates the training process, without guaranteeing any global opti-mality of the selected policy.72.2. Model-Free Reinforcement Learning
Value Function Estimator Policy G R U C ON V D C ON V M A X C ON V D C ON V A FF I N E s t ProbabilisticAction ValuesAgent StateHistoricLog Returns q t +1 a t Past Action q q q M q M S O F T M A X a a a M a M Agent Action a t +1 ! ρ t − T → t Figure 6.5: Deep Soft Recurrent Q-Network (DSRQN) architecture. The his-toric log returns (cid:126) ρ t − T → t ∈ R M × T are passed throw two 2D-convolution lay-ers, which generate a feature map, which is, in turn, processed by the GRUstate manager. The agent state produced is combined (via matrix flatteningand vector concatenation) with the past action (i.e., current portfolio posi-tions) to estimate action-values q , q , . . . , q M . The action values are usedboth for calculating the TD error (6.11), showing up in the gradient calcu-lation, as well as for determining the agents actions, after passed throw asoftmax activation layer. e C u m u l a t i v e R e t u r n , G e DSRQN Adam : Cumulative Return G e e C u m u l a t i v e R e t u r n , G e DSRQN SGD : Cumulative Return G e Figure 6.6: Out-of-sample cumulative returns per episode during trainingphase for DSRQN. Performance improvement saturates after e (cid:39) Left )Adaptive Neural network optimization algorithm
ADAM (Kingma and Ba,2014). (
Right ) Neural network optimized with Stochastic Gradient Descent(SGD) (Mandic, 2004).
Expected Properties 6.4 (End-to-End Differentiable Architecture)
Agent pol-icy should be part of the trainable architecture so that it adapts to (locally) optimalstrategy via gradient optimization during training.
Secondly, DSRQN is a Many-Input-Many-Output (MIMO) model, whosenumber of parameters grows polynomially as a function of the universesize (i.e., number of assets M ), and hence its training complexity. Moreover, 73. T rading A gents P e r c e n t a g e DSRQN : Overall Performance, episode= Profit & LossDrawdownS&P500 P e r c e n t a g e DSRQN : Overall Performance, episode= Profit & LossDrawdownS&P500 P e r c e n t a g e DSRQN : Overall Performance, episode=
Profit & LossDrawdownS&P500 P e r c e n t a g e DSRQN : Overall Performance, episode=
Profit & LossDrawdownS&P500
Figure 6.7: Deep Soft Recurrent Q-Network (DSRQN) model-free rein-forcement learning agent on a 12-asset universe, trained on historic databetween 2000-2005 and tested onward, for different number of episodes e = {
1, 10, 100, 1000 } . Visualization of cumulative rewards and (maximum)drawdown of the learned strategy, against the S&P 500 index (traded as SPY ).under this setting, the learned strategy is universe-specific, which meansthat the same trained network does not generalize to other universes. Iteven fails to work on permutations of the original universe; for example, ifwe interchange the order assets in the processed observation ˆ s t after training,then DSRQN will break down. Expected Properties 6.5 (Linear Scaling)
Model should scale linearly (i.e, com-putation and memory) with respect to the universe size.
Expected Properties 6.6 (Universal Architecture)
Model should be universe-agnosticand replicate its strategy regardless the underlying assets.
In order to address the first weakness of the DSRQN (i.e., manual selectionof policy), we consider policy gradient algorithms, which directly addressthe learning of an agent policy, without intermediate action-value approxi-mations, resulting in an end-to-end differentiable model.74.2. Model-Free Reinforcement Learning
Policy Gradient Theorem
In Section 4.2.2 we defined policy of an agent, π , as: π : S → A (4.14)In policy gradient algorithms, we parametrize the policy with parameters θ as π θ and optimize them according to a long-term objective function J ,such as average reward per time-step, given by: J ( θ ) (cid:44) ∑ s ∈ S P π θ ( s ) ∑ a ∈ A π θ ( s , a ) R as (6.15)Note, that any differentiable parametrization of the policy is valid (e.g., neu-ral network, linear model). Moreover, the freedom of choosing the rewardgenerating function, R as , is still available.In order to optimize the parameters, θ , the gradient, ∇ θ J , should be cal-culated at each iteration. Firstly, we consider an one-step Markov DecisionProcess: J ( θ ) = E π θ [ R as ] (6.15) = ∑ s ∈ S P π θ ( s ) ∑ a ∈ A π θ ( s , a ) R as (6.16) ∇ θ J ( θ ) = ∑ s ∈ S P π θ ( s ) ∑ a ∈ A ∇ θ π θ ( s , a ) R as = ∑ s ∈ S P π θ ( s ) ∑ a ∈ A π θ ( s , a ) ∇ θ π θ ( s , a ) π θ ( s , a ) R as = ∑ s ∈ S P π θ ( s ) ∑ a ∈ A π θ ( s , a ) ∇ θ log [ π θ ( s , a )] R as = E π θ (cid:20) ∇ θ log [ π θ ( s , a )] R as (cid:21) (6.17)The policy gradient calculation is extended to multi-step MDPs by replacingthe instantaneous reward R as with the long-term (action) value q π ( s , a ) . Theorem 6.2 (Policy Gradient Theorem)
For any differentiable policy π θ ( s , a ) and for J the average discounted future rewards per step, the policy gradient is: ∇ θ J ( θ ) = E π θ (cid:20) ∇ θ log [ π θ ( s , a )] q π ( s , a ) (cid:21) (6.18) 75. T rading A gents where the proof is provided by Sutton et al. (2000b). The theorem applies tocontinuous settings (i.e., Infinite MDPs) (Sutton and Barto, 1998), where thesummations are replaced by integrals. Related Work
Policy gradient methods have gained momentum the past years due to theirbetter covergence policites (Sutton and Barto, 1998), their efffective in high-dimensional or continuous action spaces and natural fit to stochastic environ-ments (Silver, 2015e). Apart from the their extensive application in robotics(Smart and Kaelbling, 2002; Kohl and Stone, 2004; Kober and Peters, 2009),policy gradient methods have been used also in financial markets. Necchi(2016) develops a general framework for policy gradient agents to be trainedto solve the asset allocation task, but only successful back-tests for syntheticmarket data are provided.
Agent Model
From equation (6.18), we note two challenges:1. Calculation of an expectation over the (stochastic) policy, E π θ , leadingto integration over unknown quantities;2. Estimation of the unknown action-value, q π ( s , a ) The simplest, successful algorithm to address both of the challenges is
Monte-Carlo Policy Gradient , also know as
REINFORCE . In particular, differenttrajectories are generated following policy π θ which are then used to es-timate the expectation and the discounted future rewards, which are anunbiased estimate of q π ( s , a ) . Hence we obtain Monte-Carlo estimates: q π ( s , a ) ≈ G t (cid:44) t ∑ i = r i (6.19) E π θ (cid:20) ∇ θ log [ π θ ( s , a )] q π ( s , a ) (cid:21) ≈ T (cid:20) ∇ θ log [ π θ ( s , a )] T ∑ i = G i (cid:21) (6.20)where the gradient of the log term ∇ θ log [ π θ ( s , a )] , is obtained by Backprop-agation Through Time (Werbos, 1990).We choose to parametrize the policy using a neural network architecture,illustrated in Figure 6.8. The configuration looks very similar to the DSRQN, The empirical mean is an unbiased estimate of expected value (Mandic, 2018b).
Algorithm 5:
Model-Carlo Policy Gradient (REINFORCE). inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o objective function J initial agent weights θ output : optimal agent policy parameters θ ∗ initialize buffers: G , ∆ θ c ← repeat for t =
1, 2, . . . T do observe tuple (cid:104) o t , r t (cid:105) sample and take action: a t ∼ π θ ( ·| s t ; θ ) // portfoliorebalance cache rewards: G ← G + r t // (6.19) cache log gradients: ∆ θ c ← ∆ θ c + ∇ θ log [ π θ ( s , a )] G // (6.20) end update policy parameters θ using buffered Monte-Carlo estimates via adaptive optimization // (6.18),ADAM empty buffers: G , ∆ θ c ← until convergence set θ ∗ ← θ Evaluation
In Figure 6.10, we present the results from an experiment performed on asmall universe comprising of 12 assets from S&P 500 market using the RE-INFORCE agent. The agent is trained on historic data between 2000-2005for 5000 episodes, and tested on 2005-2018. Similar to the DSRQN, at earlyepisodes the REINFORCE agent performs poorly, but it learns a profitablestrategy after a few thousands of episodes ( e ≈ rading A gents Policy G R U C ON V D C ON V M A X C ON V D C ON V A FF I N E s t Agent StateHistoricLog Returns a t Past Action S O F T M A X a a a M a M Agent Action a t +1 ! ρ t − T → t Figure 6.8: Monte-Carlo Policy Gradient (REINFORCE) architecture. Thehistoric log returns (cid:126) ρ t − T → t ∈ R M × T are passed throw two 2D-convolutionlayers, which generate a feature map, which is, in turn, processed by theGRU state manager. The agent state produced and the past action (i.e., cur-rent portfolio positions) are non-linearly regressed and exponentially nor-malized by the affine and the softmax layer, respectively, to generate theagent actions. e C u m u l a t i v e R e t u r n , G e REINFORCE Adam : Cumulative Return G e e C u m u l a t i v e R e t u r n , G e REINFORCE SGD : Cumulative Return G e Figure 6.9: Out-of-sample cumulative returns per episode during trainingphase for REINFORCE. Performance improvement saturates after e (cid:39) Left ) Adaptive Neural network optimization algorithm
ADAM . (
Right ) Neu-ral network optimized with Stochastic Gradient Descent (SGD).In Figure 6.6, we illustrate the out-of-sample cumulative return of the DSRQNagent, which flattens for e (cid:39) Weaknesses
Policy gradient addressed only the end-to-end differentiability weakness ofthe DSRQN architecture, leading to siginificant improvements. Nonethe-less, the polynomial scaling and universe-specific nature of the model are78.2. Model-Free Reinforcement Learning P e r c e n t a g e REINFORCE : Overall Performance, episode= Profit & LossDrawdownS&P500 P e r c e n t a g e REINFORCE : Overall Performance, episode=
Profit & LossDrawdownS&P500 P e r c e n t a g e REINFORCE : Overall Performance, episode=
Profit & LossDrawdownS&P500 2005 2007 2009 2011 2013 2015 2017 2019Date0123 P e r c e n t a g e REINFORCE : Overall Performance, episode=
Profit & LossDrawdownS&P500
Figure 6.10: Model-Carlo Policy Gradient (REINFORCE) model-free rein-forcement learning agent on a 12-asset universe, trained on historic databetween 2000-2005 and tested onward, for different number of episodes e = {
1, 100, 1000, 7500 } . Visualization of cumulative rewards and (maxi-mum) drawdown of the learned strategy, against the S&P 500 index (tradedas SPY ).still restricting the applicability and generalization of the learned strategies.Moreover, the intractability of the policy gradient calculation given by (6.18)lead to the comprising solution of using Monte-Carlo estimates by runningnumerous simulations, leading to increased number of episodes required forconvergence. Most importantly, the empirical estimation of the state action-value q π θ ( s , a ) in (6.19) has high-variance (see returns in Figure 6.9) (Suttonand Barto, 1998). Expected Properties 6.7 (Low Variance Estimators)
Model should rely on lowvariance estimates.
In this subsection, we introduce the
Mixture of Score Machines (MSM)model with the aim to provide a universal model that reduces the agentmodel complexity and generalizes strategies across assets, regardless of thetrained universe. These properties are obtained by virtue of principles ofparameter sharing (Bengio et al. , 2003) and transfer learning (Pan and Yang,2010a). 79. T rading A gents Related Work
Jiang et al. (2017) suggested a universal policy gradient architecture, the
En-semble of Identical Independent Evaluators (EIIE), which reduces significantlythe model complexity, and hence enables larger-scale applications. Nonethe-less, it operates only on independent (e.g., uncorrelated) time-series, whichis a highly unrealistic assumption for real financial markets.
Agent Model
As a generalization to the universal model of Jiang et al. (2017), we introducethe
Score Machine (SM), an estimator of statistical moments of stochasticmultivariate time-series: • A First-Order Score Machine
SM(1) operates on univariate time-series,generating a score that summarizes the characteristics of the location parameters of the time-series (e.g., mode, median, mean). An M -components multivariate series will have ( M ) = M first-order scores,one for each component; • A Second-Order Score Machine
SM(2) operates on bivariate time-series, generating a score that summarizes the characteristics of the dispersion parameters of the joint series (e.g., covariance, modal disper-sion (Meucci, 2009)). An M -components multivariate series will have ( M ) = M !2! ( M − ) ! second-order scores, one for each distinct pair; • An N -Order Score Machine SM( N ) operates on N -component multi-variate series and extracts information about the N-order statistics (i.e.,statistical moments) of the joint series. An M -components multivariateseries, for M ≥ N , will have ( MN ) = M ! N ! ( M − N ) ! N -order scores, one foreach distinct N combination of components.Note that the extracted scores are not necessarily equal to the statistical (cen-tral) moments of the series, but a compressed and informative representa-tion of the statistics of the time-series. The universality of the score machineis based on parameter sharing (Bengio et al. , 2003) across assets.Figure 6.11 illustrates the first and second order score machines, where thetransformations are approximated by neural networks. Higher order scoremachines can be used in order to captures higher-order moments.By combining the score machines, we construct the Mixture of Score Ma-chines (MSM), whose architecture is illustrated in Figure 6.12. We identifythree main building blocks:80.2. Model-Free Reinforcement Learning G R U C ON V D C ON V M A X C ON V D C ON V A FF I N E h t Hidden StateUnivariateTime-Series First-Order Score
SM(1): First-Order Score Machine ! x t ∈ R T s (1) t ∈ R G R U C ON V D C ON V M A X C ON V D C ON V A FF I N E h t Hidden StateBivariateTime-Series
SM(2): Second-Order Score Machine
Second-Order Score s (2) t ∈ R ! x i & j,t ∈ R × T Figure 6.11: Score Machine (SM) neural netowrk architecture. Convolutionallayers followed by non-linearities (e.g., ReLU) and Max-Pooling (Giusti etal. , 2013) construct a feature map, which is selectively stored and filteredby the Gate Recurrent Unit (GRU) layer. Finally, a linear layer combinesthe GRU output components to a single scalar value, the score. (
Left ) First-Order Score Machine SM(1); given a univariate time-series (cid:126) x t of T samples,it produces a scalar score value s ( ) t . ( Right ) Second-Order Score MachineSM(2); given a bivariate time-series (cid:126) x i & j , t , with components (cid:126) x i , t and (cid:126) x j , t , of T samples each, it generates a scalar value s ( ) t .1. SM(1) : a first-order score machine that processes all single-asset logreturns, generating the first-order scores;2.
SM(2) : a second-order score machine that processes all pairs of assetslog-returns, generating the second-order scores;3.
Mixture Network : an aggregation mechanism that accesses the scoresfrom SM(1) and SM(2) and infers the action-values.We emphasize that there is only one SM for each order, shared across thenetwork, hence during backpropagation the gradient of the loss functionwith respect to the parameters of each SM is given by the sum of all thepaths that contribute to the loss (Goodfellow et al. , 2016) that pass throughthat particular SM. The mixture network, inspired by the Mixtures of ExpertNetworks by Jacobs et al. (1991), gathers all the extracted information fromfirst and second order statistical moments and combines them with the pastaction (i.e., current portfolio vector) and the generated agent state (i.e., statemanager hidden state) to determine the next optimal action.Neural networks, and especially deep architectures, are very data hungry (Murphy, 2012), requiring thousands (or even millions) of data points to con-verge to meaningful strategies. Using daily market data (i.e., daily prices),almost 252 data points are only collected every year, which means that veryfew samples are available for training. Thanks to the parameter sharing,nonetheless, the SM networks are trained on orders of magnitude moredata. For example, a 12-assets universe with 5 years history is given by the 81. T rading A gents dataset D ∈ R × · = ( ) · = · = M without modification, by stacking more copies of the same machine withshared parameters, the mixture network is universe-specific . As a result, adifferent mixture network is required to be trained for different universes.Consider the case of an M -asset market, then the the mixture network hasthe interface: N inputsmixture-network = M + (cid:18) M (cid:19) , N outputsmixture-network = M (6.21)Consequently, selecting a different number of assets would break the inter-face of the mixture network. Nonetheless, the score machines can be trainedwith different mixture networks hence when a new universe of assets isgiven, we freeze the training of the score machines and train only the mix-ture network. This operation is cheap since the mixture network comprisesof only a small fraction of the total number of trainable parameters of theMSM.Practically, the score machines are trained on large historic datasets and keptfixed, while transfer learning is performed on the mixture network. There-fore, the score machines can be viewed as rich, universal feature extractors,while the mixture network is the small (i.e., in size and capacity) mechanismthat enables mapping from the abstract space of scores to the feasible actionspace, capable of preserving asset-specific information as well. Evaluation
Figure 6.14 shows the results from an experiment performed on a small uni-verse comprising of 12 assets from S&P 500 market using the MSM agent.The agent is trained on historic data between 2000-2005 for 10000 episodes,and tested on 2005-2018. Conforming with our analysis, the agent under-performed early in the training, but after 9000 episodes it became profitableand its performance saturated after 10000 episodes, with total cumulativereturns of 283.9% and 68.5% maximum drawdown. It scored slightly worse82.2. Model-Free Reinforcement Learning
Policy ...
SM(1) !ρ ,t SM(1) !ρ ,t !ρ M,t ... ...
SM(1) v (1)1 ,t v (1)2 ,t v (1) M,t v (1) t ∈ R M First-Order Scores G R U A FF I N E S O F T M A X a a a M a M Agent Action a t +1 s t Agent State a t Past Action
SM(2)SM(2) ... ... ...
SM(2) !ρ ,t !ρ ,t !ρ M &( M − ,t v (2)1&2 ,t v (2)1&3 ,t v (2) M &( M − ,t v (2) t ∈ R ( M ) First-Order Scores
First-Order Score MachineSecond-Order Score Machine Mixture
Figure 6.12: Mixture of Score Machines (MSM) architecture. The historic logreturns (cid:126) ρ t − T → t ∈ R M × T processes by the score machines SM(1) and SM(2),which assign scores to each asset ( v ( ) t ) and pair of assets ( v ( ) t ), respectively.The scores concatenated and passed to the mixture network, which com-bines them with the past action (i.e., current portfolio vector) and the gen-erated agent state (i.e., state manager hidden state) to determine the nextoptimal action.than the REINFORCE agent, but in Part III it is shown that in a larger scaleexperiments the MSM is both more effective and efficient (i.e., computation-ally and memory-wise). Weaknesses
The Mixture of Score Machines (MSM) is an architecture that addresses theweaknesses of all the aforementioned approaches (i.e., VAR, LSTM, DSRQNand REINFORCE). However, it could be improved by incorporating the ex-pected properties 6.8 and 6.9:
Expected Properties 6.8 (Short Sales)
Model should output negative portfolioweights, corresponding to short selling, as well.
Expected Properties 6.9 (Optimality Guarantees)
Explore possible re-interpretationsof the framework, which would allow proof of optimality (if applicable).
83. T rading A gents e C u m u l a t i v e R e t u r n , G e MSM Adam : Cumulative Return G e e C u m u l a t i v e R e t u r n , G e MSM SGD : Cumulative Return G e Figure 6.13: Out-of-sample cumulative returns per episode during trainingphase for MSM. Performance improvement saturates after e (cid:39) Left )Adaptive Neural network optimization algorithm
ADAM . (
Right ) Neuralnetwork optimized with Stochastic Gradient Descent (SGD). P e r c e n t a g e MSM : Overall Performance, episode= Profit & LossDrawdownS&P500 P e r c e n t a g e MSM : Overall Performance, episode=
Profit & LossDrawdownS&P500 P e r c e n t a g e MSM : Overall Performance, episode=
Profit & LossDrawdownS&P500 P e r c e n t a g e MSM : Overall Performance, episode=
Profit & LossDrawdownS&P500
Figure 6.14: Mixture of Score Machines (MSM) model-free reinforcementlearning agent on a 12-asset universe, trained on historic data between2000-2005 and tested onward, for different number of episodes e = {
1, 1000, 5000, 10000 } . Visualization of cumulative rewards and (maximum)drawdown of the learned strategy, against the S&P 500 index (traded as SPY ).84.2. Model-Free Reinforcement Learning
Trading Agents Comparison Matrix: -assets of S&P 500Model-Based Model-FreeVAR RNN DSRQN REINFORCE MSMM e t r i cs Cumulative Returns (%) 185.1 221.0 256.7 325.9 283.9Sharpe Ratio (SNR) 1.53 1.62 2.40 3.02 2.72Max Drawdown (%) 179.6 198.4 85.6 63.5 68.5 E x p e c t e dP r o p e r ti e s Non-Stationary Dynamics × (cid:88) (cid:88) (cid:88) (cid:88) Long-Term Memory × (cid:88) (cid:88) (cid:88) (cid:88) Non-Linear Model × (cid:88) (cid:88) (cid:88) (cid:88) End-to-End × × × (cid:88) (cid:88)
Linear Scaling × × × × (cid:88)
Universality × × × × (cid:88)
Low Variance Estimators × × × × ×
Short Sales (cid:88) (cid:88) × × ×
Table 6.1: Comprehensive comparison of evaluation metrics and their weak-nesses (i.e., expected properties) of trading algorithms addressed in thischapter. Model-based agents (i.e., VAR and RNN) underperform, whilethe best performing agent is the REINFORCE. As desired, the MSM agentscores also well above the index (i.e., baseline) and satisfies most wantedproperties. 85 hapter 7
Pre-Training
In Chapter 6, model-based and model-free reinforcement learning agentswere introduce, which address the asset allocation task. It was demon-strated (see comparison table 6.1) that model-based (i.e., VAR and RNN)and value-based model-free agents (i.e., DSRQN) are outperformed by thepolicy gradient agents (i.e., REINFORCE and MSM). However, policy gradi-ent algorithms usually converge to local optima (Sutton and Barto, 1998).Inspired by the approach taken by the authors of the original DeepMind Al-phaGo paper (Silver and Hassabis, 2016), the local optimality of policy gra-dient agents is addressed via pre-training the policies in order to replicatethe strategies of baseline models. It is shown that any one-step optimizationmethod, discussed in Chapter 3 that reduces to a quadratic program, can bereproduced by the policy gradient networks (i.e., REINFORCE and MSM),when the networks are trained to approximate the quadratic program solu-tion.Because of the highly non-convex policy search space (Szepesv´ari, 2010), therandomly initialised agents (i.e., agnostic agents) tend to either get stuckto vastly sub-optimal local minima or need a lot more episodes and sam-ples to converge to meaningful strategies. Therefore, the limited numberof available samples (e.g., 10 years of market data is equivalent to approxi-mately 2500 samples), motivates pre-training, which is expected to improveconvergence speed and performance, assuming that the baseline model issub-optimal but a proxy to the optimal strategy. Moreover, the pre-trainedmodels can be viewed as priors to the policies and episodic training with re-inforcement learning steers them to the updated strategies, in a data-drivenand data-efficient manner. As in the context of Bayesian Inference.
In Chapter 3, the traditional one-step (i.e., static) portfolio optimizationmethods were described, derived from the Markowitz model (Markowitz,1952). Despite the assumptions about covariance stationarity (i.e., time-invariance of first and second statistical moments) and myopic approachof those methods, they usually form the basis of other more complicatedand effective strategies. As a result, it is attempted to replicate those strate-gies with the REINFORCE (see subsection 6.2.2) and the Mixture of ScoreMachines (MSM) (see subsection 6.2.3) agents. In both cases, the architectureof the agents (i.e., underlying neural networks) are treated as black boxes,represented by a set of parameters, which thanks to their end-to-end differ-entiability, can be trained via backpropagation. Figure 7.1 summarizes thepipeline used to (pre-)train the policy gradient agents.
Black Box Agent
HistoricLog Returns ! ρ t − T → t a t Past Action Agent Action a t +1 ! Figure 7.1: Interfacing with policy gradient agents as black boxes, with in-puts (1) historic log returns (cid:126) ρ t − T → t and (2) past action (i.e., current port-folio vector) a t and output the next agent actions a t + . The black-box isparametrized by θ which can be updated and optimized. The one-step optimal portfolio for given commission rates (i.e., transactioncosts coefficient β ) and hyperparameters (e.g., risk-aversion coefficient) isobtained, by solving the optimization task in (3.13) or (3.14) via quadraticprogramming. The Sharpe Ratio with transaction costs objective function is 87. P re -T raining selected as the baseline for pre-training, since it has no hyperparameter totune and inherently balances profit-risk trade-off.Without being explicitly given the mean vector µ , the covariance matrix Σ and the transaction coefficient β , the black-box agents should be able to solvethe optimization task (3.14), or equivalently:maximize a t + ∈ A a Tt + µ − TM β (cid:107) a t − a t + (cid:107) (cid:113) a Tt + Σ a t + and TM a t + = a t + (cid:23) Since there is a closed form formula that connects the black-box agents’ in-puts (cid:126) ρ t − T → t and a t with the terms in the optimization problem (3.14), N supervised pairs { ( X i , y i ) } Ni = are generated by solving the optimization for N distinct cases, such that: X i = (cid:2) (cid:126) ρ t i − T → t i , a t i (cid:3) (7.1) y i = a t i + (7.2)Interestingly, myriad of examples (i.e., X i , y i pairs) can be produced toenrich the dataset and allow convergence. This is a very rare situation wherethe generating process of the data is known and can be used to producevalid samples, which respect the dynamics of the target model. The datageneration process is given in algorithm 6. The parameters of the black-box agents are steered in the gradient direc-tion that minimizes the
Mean Square Error between the predicted portfo-lio weights, ˆ y t i , and the baseline model target portfolio weights, y t i . An L -norm weight decaying, regularization , term is also considered to avoidoverfitting, obtaining the loss function: L ( θ ) = (cid:107) y t i − ˆ y t i ; θ (cid:107) + λ (cid:107) θ (cid:107) (7.3)88.2. Model Evaluation Algorithm 6:
Pre-training supervised dataset generation. inputs : number of pairs to generate N number of assets in portfolio M look back window size T transaction costs coefficient β output : dataset { ( X i , y i ) } Ni = for i =
1, 2, . . . N do sample valid random initial portfolio vector w t i sample random lower triangular matrix L ∈ R M × M // Choleskydecomposition sample randomly distributed log returns: (cid:126) ρ t i − T → t i ∼ N ( , LL T ) calculate empirical mean vector of log returns: µ = E [ (cid:126) ρ t i − T → t i ] calculate empirical covariance matrix of log returns: Σ = Cov [ (cid:126) ρ t i − T → t i ] determine a t i + by solving quadratic program (3.14) set X i = [ (cid:126) ρ t i − T → t i , a t i ] and y i = a t i + end The parameters are adaptively optimized by Adam (Kingma and Ba, 2014),while the network parameters gradients are obtained via BackpropagationThrough Time (Werbos, 1990).
Figure 7.2 depicts the the learning curves, in-sample and out-of sample, ofthe supervised learning training process. Both the REINFORCE and theMSM converge after ≈
400 epochs (i.e., iterations). The
As suggested by Figure 7.3, the pre-training improves the cumulative returnsand Sharpe Ratio of the policy gradient agents up to 21.02% and 13.61%,respectively. 89. P re -T raining M e a n S q u a r e E rr o r , | y i y i | REINFORCE : Pre-Training Learning Curve in-sampleout-of-sample M e a n S q u a r e E rr o r , | y i y i | MSM : Pre-Training Learning Curve in-sampleout-of-sample
Figure 7.2: Mean square error (MSE) of Monte-Carlo Policy Gradient (REIN-FORCE) and Mixture of Score Machines (MSM) during pre-training. After ≈
150 epochs the gap between the training (in-sample) and the testing (out-of-sample) errors is eliminated and error curve plateaus after ≈
400 epochs,when training is terminated.
DSRQN MSM REINFORCE RNN VAR0100200300400 C u m u l a t i v e S i m p l e R e t u r n s ( % ) Pre Training : Performance Gain in r t S&P500RLRL & PT
DSRQN MSM REINFORCE RNN VAR0123 S h a r p e R a t i o Pre Training : Performance Gain in SR S&P500RLRL & PT
Figure 7.3: Performance evaluation of trading with reinforcement learning(RL) and reinforcement learning and pre-training (RL & PT). The Mixture ofScore Machines (MSM) improves cumulative returns by 21.02% and SharpeRatio by 13.61%. The model-based (i.e., RNN and VAR) and the model-free value-based (i.e., DSRQN) agents are not end-to-end differentiable andhence cannot be pre-trained.90 art III
Experiments hapter 8 Synthetic Data
It has been shown that the trading agents of Chapter 6, and especially REIN-FORCE and MSM, outperform the market index (i.e., S&P500) when testedin a small universe of 12-assets, see Table 6.1. For rigour, the validity and ef-fectiveness of the developed reinforcement agents is investigated via a seriesof experiments on: • Deterministic series, including sine, sawtooth and chirp waves, as inSection 8.1; • Simulated series, using data surrogate methods, such as AAFT, as inSection 8.2.As expected, it is demonstrated that model-based agents (i.e., VAR andRNN) excel in deterministic environments. This is attributed to the fact thatgiven enough capacity they have the predictive power to accurately forecastthe future values, based on which they can act optimally via planning.On the other hand, on simulated (i.e., surrogate) time-series, it is shown thatmodel-free agents score higher, especially after the pre-training process ofChapter 7, which contributes to up to 21% improvement in Sharpe Ratio andup to 40% reduction in the number of episodic runs.
To begin with, via interaction with the environment (i.e., paper trading), theagents construct either an explicit (i.e., model-based reinforcement learning)or implicit model (i.e., model-free reinforcement learning) of the environ-ment. In Section 6.1, It has been demonstrated that explicit modelling offinancial time series is very challenging due to the stochasticity of the in-92.1. Deterministic Processesvolved time-series, and, as a result, model-based methods underperform.On the other hand, should the market series were sufficiently predictable,these methods would be expected to optimally allocate assets of the portfo-lio via dynamic programming and planning. In this section, we investigatethe correctness of this hypothesis by generating a universe of deterministictime-series.
A set of 100 sinusoidal waves of constant parameters (i.e., amplitude, circularfrequency and initial phase) is generated, while example series are providedin Figure 8.1. Note the dominant performance of the model-based recur-rent neural network (RNN) agent, which exploits its accurate predictions offuture realizations and scores over three times better than the best-scoringmodel-free agent, the Mixture of Score Machines (MSM). S i m p l e R e t u r n s , r t Sinusoidal Waves : Simple Returns Time-Series P e r c e n t a g e Sinusoidal Waves : Overall Performance
VARRNNDSRQNREINFORCEMSM
Figure 8.1: Synthetic universe of deterministic sinusoidal waves. (
Left ) Ex-ample series from universe. (
Right ) Cumulative returns of reinforcementlearning trading agents.For illustration purposes and in order to gain a finer insight into the learnedtrading strategies, a universe of only two sinusoids is generated the RNNagent is trained on binary trading the two assets; at each time step the agentputs all its budget on a single asset. As shown in Figure 8.2, the RNN agentlearns the theoretically optimal strategy : w t = (cid:40) w i , t =
1, if i = argmax { r t } w i , t =
0, otherwise (8.1)or equivalently, the returns of the constructed portfolio is the max of thesingle asset returns at each time step. Note that transaction costs are not considered in this experiment, in which case wewould expect a time-shifted version of the current strategy so that it offsets the fees.
93. S ynthetic D ata S i m p l e R e t u r n s , r t RNN Binary Trader : Buy & Sell Signals asset 1 asset 2 BUY SELL max
Figure 8.2: Recurrent neural network (RNN) model-based reinforcementlearning agent trained on binary trading between two sinusoidal waves. Thetriangle trading signals (i.e., BUY or SELL) refer to asset 1 (i.e., red), whileopposite actions are taken for asset 2, but not illustrated.
A set of 100 deterministic sawtooth waves is generated next and examplesare illustrated in Figure 8.3. Similar to the sinusoidal waves universe, theRNN agent outperforms the rest of the agents. Interestingly, it can be ob-served in the cumulative returns time series, right Figure 8.3, that all strate-gies have a low-frequency component, which corresponds to the highestamplitude sawtooth wave (i.e., yellow). S i m p l e R e t u r n s , r t Sawtooth Waves : Simple Returns Time-Series P e r c e n t a g e Sawtooth Waves : Overall Performance
VARRNNDSRQNREINFORCEMSM
Figure 8.3: Synthetic universe of deterministic sawtooth waves. (
Left ) Ex-ample series from universe. (
Right ) Cumulative returns of reinforcementlearning trading agents.
Last but not least, the experiment is repeated with a set of 100 deterministicchirp waves (i.e., sinusoidal wave with linearly modulated frequency). Threeexample series are plotted in 8.4, along with the cumulative returns of eachtrading agent. Note that the RNN agent is only 8.28% better than the second,the MSM, agent, compared to the > S i m p l e R e t u r n s , r t Chirp Waves : Simple Returns Time-Series P e r c e n t a g e Chirp Waves : Overall Performance
VARRNNDSRQNREINFORCEMSM
Figure 8.4: Synthetic universe of deterministic chirp waves. (
Left ) Exampleseries from universe. (
Right ) Cumulative returns of reinforcement learningtrading agents.
Remark 8.1
Overall, in a deterministic financial market, all trading agents learnprofitable strategies and solve the asset allocation task. As expected, model-basedagents, and especially the RNN, are significantly outperforming in case of well-behaving, easy-to-model and deterministic series (e.g., sinusoidal, sawtooth). Onthe other hand, in more complicated settings (e.g., chirp waves universe) the model-free agents perform almost as good as model-based agents.
Having asserted the successfulness of reinforcement learning trading agentsin deterministic universes, their effectiveness is challenged in stochastic uni-verses, in this section. Instead of randomly selecting families of stochasticprocesses and corresponding parameters for them, real market data is usedto learn the parameters of candidate generating processes that explain thedata. The purpose of this approach is two-fold:1. There is no need for hyperparameter tuning ;2. The training dataset is expanded, via data augmentation , giving theopportunity to the agents to gain more experience and further explorethe joint state-action space.It is worth highlighting that data augmentation improves overall perfor-mance, especially when strategies learned in the simulated environment aretransferred and polished on real market data, via Transfer Learning (Panand Yang, 2010b). 95. S ynthetic D ata The simulated universe is generated using surrogates with random Fourierphases (Raeth and Monetti, 2009). In particular the
Amplitude AdjustedFourier Transform (AAFT) method (Prichard and Theiler, 1994) is used, ex-plained in Algorithm 7. Given a real univariate time-series, the AAFT al-gorithm operates in Fourier (i.e., frequency) domain, where it preserves theamplitude spectrum of the series, but randomizes the phase, leading to anew realized signal.AAFT can be explained by the
Wiener–Khinchin–Einstein Theorem (Co-hen, 1998), which states that the autocorrelation function of a wide-sense-stationary random process has a spectral decomposition given by the powerspectrum of that process. In other words, first and second order statisti-cal moments (i.e., due to autocorrelation) of the signal are encoded in itspower spectrum, which is purely dependent on the amplitude spectrum.Consequently, the randomization of the phase does not impact the first andsecond order moments of the series, hence the surrogates share statisticalproperties of the original signal.Since the original time-series (i.e., asset returns) are real-valued signals, theirFourier Transform after randomization of the phase should preserve conju-gate symmetry , or equivalently, the randomly generated phase componentshould be an odd function of frequency. Then the Inverse Fourier Trans-form (IFT) returns real-valued surrogates.
Algorithm 7:
Amplitude Adjusted Fourier Transform (AAFT). inputs : M -variate original time-series (cid:126) X output : M -variate synthetic time-series ˆ (cid:126) X for i =
1, 2, . . . M do calculate Fourier Transform of univariate series F [ (cid:126) X : i ] randomize phase component // preserve odd symmetry ofphase calculate Inverse Fourier Transform of unchanged amplitude and randomized phase ˆ (cid:126) X : i endRemark 8.2 Importantly, the AAFT algorithm works on univariate series, there-fore the first two statistical moments of the single asset are preserved but the cross- asset dependencies (i.e., cross-correlation, covariance) are modified due to the dataaugmentation.
Operating on the same 12-assets universe used in experiments of Chapter 6,examples of AAFT surrogates are given in Figure 8.5, along with the cumu-lative returns of each trading agent on this simulated universe. As expected,the model-free agents outperform the model-based agents, corroborating theresults obtained in Chapter 6. S i m p l e R e t u r n s , r t Surrogate AAFT : Simple Returns Time-Series
AAPLGEBA P e r c e n t a g e Surrogate AAFT : Overall Performance
VARRNNDSRQN
REINFORCE
MSM
Figure 8.5: Synthetic, simulated universe of 12-assets from S&P500 via Am-plitude Adjusted Fourier Transform (AAFT). (
Left ) Example series from uni-verse. (
Right ) Cumulative returns of reinforcement learning trading agents. 97 hapter 9
Market Data
Having verified the applicability of the trading agents in synthetic environ-ments (i.e., deterministic and stochastic) in Section 8, their effectiveness ischallenged in real financial markets, namely the underlying stocks of theStandard & Poor’s 500 (Investopedia, 2018f) and the EURO STOXX 50 (In-vestopedia, 2018a) indices. In detail, in this chapter: • Candidate reward generating functions are explored, in Section 9.1; • Paper trading experiments are carried out on U.S. and European mostliquid assets (see Sufficient Liquidity Assumption 5.1), as in Sections9.2 and 9.3, respectively; • Comparison matrices and insights into the learned agent strategies areobtained.
Reinforcement learning relies fundamentally on the hypothesis that the goalof the agent can be fully described by the maximization of the cumulativereward over time, as suggested by the Reward Hypothesis 4.1. Consequently,the selection of the reward generating function can significantly affect thelearned strategies and hence the performance of the agents. Motivated bythe returns-risk trade-off arising in investments (see Chapter 3), two rewardfunctions are implemented and tested: the log returns and the DifferentialSharpe Ratio (Moody et al. , 1998).98.1. Reward Generating Functions
The agent at time step t observes asset prices o t ≡ p t and computes the logreturns, given by: ρ t = log ( p t (cid:11) p t − ) (2.12)where (cid:11) designates element-wise division and the log function is also ap-plied element-wise, or equivalently: ρ t (cid:44) ρ t ρ t ... ρ M , t = log ( p t p t − ) log ( p t p t − ) ... log ( p M , t p M , t − ) ∈ R M (9.1)Using one-step log returns for reward, results in the multi-step maximiza-tion of cumulative log returns , which focuses only on the profit, withoutconsidering any risk (i.e., variance) metric. Therefore, agents are expectedto be highly volatile when trained with this reward function. In Section 2.3, the Sharpe Ratio (Sharpe and Sharpe, 1970) was introduced,motivated by Signal-to-Noise Ratio (SNR), given by: SR t (cid:44) √ t E [ r t ] (cid:112) Var [ r t ] ∈ R (2.35)where T is the number of samples considered in the calculation of the em-pirical mean and standard deviation. Therefore, empirical estimates of themean and the variance of the portfolio are used in the calculation, mak-ing Sharpe Ratio an inappropriate metric for online (i.e., adaptive) episodiclearning. Nonetheless, the Differential Sharpe Ratio (DSR), introduced byMoody et al. (1998), is a suitable reward function. DSR is obtained by:1. Considering exponential moving averages of the returns and standarddeviation of returns in 2.35;2. Expanding to first order in the decay rate: SR t ≈ SR t − + η ∂ SR t ∂η (cid:12)(cid:12)(cid:12)(cid:12) η = + O ( η ) (9.2) 99. M arket D ata Noting that only the first order term in expansion (9.2) depends upon thereturn, r t , at time step, t , the differential Sharpe Ratio, D t , is defined as: D t (cid:44) ∂ SR t ∂η = B t − ∆ A t + A t − ∆ B t ( B t − − A t − ) (9.3)where A t and B t are exponential moving estimates of the first and secondmoments of r t , respectively, given by: A t = A t − + η ∆ A t = A t − + η ( r t − A t − ) (9.4) B t = B t − + η ∆ B t = B t − + η ( r t − B t − ) (9.5)Using differential Sharpe Ratio for reward, results in the multi-step maxi-mization of Sharpe Ratio , which balances risk and profit, and hence it isexpected to lead to better strategies, compared to log returns.
Publicly traded companies are usually also compared in terms of their
Mar-ket Value or Market Capitalization (Market Cap), given by multiplying thenumber of their outstanding shares by the current share price (Investopedia,2018d), or equivalently:
Market Cap asset i = Volume asset i × Share Price asset i (9.6)The Standard & Poor’s 500 Index (S&P 500) is a market capitalization weightedindex of the 500 largest U.S. publicly traded companies by market value (In-vestopedia, 2018d). According to the Capital Asset Pricing Model (CAPM)(Luenberger, 1997) and the Efficient Market Hypothesis (EMH) (Fama, 1970),the market index, S&P 500, is efficient and portfolio derived by its con-stituent assets cannot perform better (as in the context of Section 3.1.2).Nonetheless, CAPM and EMH are not exactly satisfied and trading oppor-tunities can be exploited via proper strategies.100.3. EURO STOXX 50
In order to compare the different trading agents introduced in Chapter 6,as well as variants in Chapter 7 and Section 8.2, all agents are trained onthe constituents of S&P 500 (i.e., 500 U.S. assets) and the results of theirperformance are provided in Figure 9.1 and Table 9.1. As expected, thedifferential Sharpe Ration (DSR) is more stable than log returns, yieldinghigher Sharpe Ratio strategies, up to 2.77 for the pre-trained and experiencetransferred Mixtutre of Score Machines (MSM) agent.
DSRQN MSM REINFORCE RNN VAR050100150200250300 C u m u l a t i v e S i m p l e R e t u r n s ( % ) S&P 500 : Log Returns
S&P500RLRL & PTRL & PT & TL
DSRQN MSM REINFORCE RNN VAR0.00.51.01.52.0 S h a r p e R a t i o S&P 500 : Log Returns
S&P500RLRL & PTRL & PT & TL
DSRQN MSM REINFORCE RNN VAR0100200300400 C u m u l a t i v e S i m p l e R e t u r n s ( % ) S&P 500 : Differential Sharpe Ratio
S&P500RLRL & PTRL & PT & TL
DSRQN MSM REINFORCE RNN VAR0.00.51.01.52.02.5 S h a r p e R a t i o S&P 500 : Differential Sharpe Ratio
S&P500RLRL & PTRL & PT & TL
Figure 9.1: Comparison of reinforcement learning trading agents on cumu-lative returns and Sharpe Ratio, trained with: ( RL ) Reinforcement Learning;( RL & PT ) Reinforcement Learning and Pre-Training; (
RL & PT & TL ) Re-inforcement Learning, Pre-Training and Transfer Learning from simulateddata.
Remark 9.1
The simulations confirm the superiority of the universal model-free re-inforcement learning agents, Mixture of Score Machines (MSM), in asset allocation,with the achieved performance gain of as much as in cumulative returns and in Sharpe Ratio, compared to the most recent models in (Jiang et al. , 2017)in the same universe.
Similar to S&P 500, the EURO STOXX 50 (SX5E) is a benchmark for the 50largest publicly traded companies by market value in countries of Eurozone. 101. M arket D ata Trading Agents Comparison Matrix: S&P 500Reward Differential LogGenerating Function Sharpe Ratio Returns
Cumulative Sharpe Cumulative SharpeReturns (%) Ratio Returns (%) Ratio
SPY
VAR
RNN
DSRQN
REINFORCE
MSM
REINFORCE & PT
MSM & PT
REINFORCE & PT & TL
MSM & PT & TL 381 . . A universal baseline agent is developed, based on the Sharpe Ratio withtransaction costs (see optimization problem 3.14) extension of the Markowitzmodel, from Section 3.3. Therefore, a
Sequential Markowitz Model (SMM)agent is derived by iteratively applying the one-step optimization programsolver for each time step t . The Markowitz model is obviously a universalportfolio optimizer, since it does not make assumptions about the universe(i.e., underlying assets) it is applied upon. Given the EURO STOXX 50 market, transfer learning is performed for the
MSM agent trained on the S&P 500 (i.e., only the Mixture network is replacedand trained, while the parameters of the Score Machine networks are frozen).Figure 9.2 illustrates the cumulative returns of the market index (SX5E), the102.3. EURO STOXX 50Sequential Markowitz Model (SMM) agent and the Mixture of Score Ma-chines (MSM) agent.
Remark 9.2
As desired, the MSM agent outperformed both the market index (SX5E)and the SMM agent, reflecting the universality of the MSM learned strategies,which are both successful in the S&P 500 and EURO STOXX 50 markets.
It is worth also noting that the cumulative returns of the MSM and the SMMagents were correlated, however, the MSM performed better, especially after2009, when the SMM followed the declining market and the MSM becameprofitable. This fact can be attributed to the pre-training stage of the MSMagent, since during this stage, the policy gradient network converges to theMarkowitz model, or effectively mimics the SMM strategies. Then, the re-inforcement episodic training allows the MSM to improve itself so that itoutperforms its initial strategy, the SMM. P e r c e n t a g e EURO STOXX 50 : Overall Performance
SMM MSM SX5E
Figure 9.2: Cumulative Returns of Mixture of Score Machines (MSM) agent,trained on S&P 500 market and transferred experience to EURO STOXX50 market (SX5E), along with the traditional Sequential Markowitz Model(SMM). 103 hapter 10
Conclusion
The main objective of this report was to investigate the effectiveness of Rein-forcement Learning agents on Sequential Portfolio Management. To achievethis, many concepts from the fields of Signal Processing, Control Theory,Machine Intelligence and Finance have been explored, extended and com-bined. In this chapter, the contributions and achievements of the project aresummarized, along with possible axis for future research.
To enable episodic reinforcement learning, a mathematical formulation offinancial markets as discrete-time stochastic dynamical systems is provided,giving rise to a unified, versatile framework for training agents and invest-ment strategies.A comprehensive account of reinforcement agents has been developed, in-cluding traditional, baseline agents from system identification (i.e., model-based methods) as well as context agnostic agents (i.e., model-free methods).A universal model-free reinforcement learning family of agents has been in-troduced, which was able to reduce the model computational and memorycomplexity (i.e., linear scaling with universe size) and to generalize strate-gies across assets and markets, regardless of the training universe. It alsooutperformed all trading agents, found in the open literature, in the S&P500 and the EURO STOXX 50 markets.Lastly, model pre-training, data augmentation and simulations enabled ro-bust training of deep neural network architectures, even with a limited num-ber of available real market data.1040.2. Future Work
Despite the performance gain of the developed strategies, the lack of inter-pretability (Rico-Martinez et al. , 1994) and the inability to exhaustively testthe deep architectures used (i.e., Deep Neural Networks) discourage practi-tioner from adopting these solutions. As a consequence, it is worth inves-tigating and interpreting the learned strategies by opening the deep ”blackbox” and being able to reason for its decisions.In addition, exploiting the large number of degrees of freedom given by theframework formulation of financial markets, further research on reward gen-erating functions and state representation could improve the convergenceproperties and overall performance of the agents. A valid approach wouldbe to construct the indicators typically used in the technical analysis of fi-nancial instruments (King and Levine, 1992). These measures would embedthe expert knowledge acquired by financial analysts over decades of activ-ity and could help in guiding the agent towards better decisions (Wilmott,2007).Furthermore, the flexible architecture of Mixtures of Score Machines (MSM)agent, the best-scoring universal trading agents, allows experimentationwith both the Score Machines (SM) networks and their model-order, as wellas the Mixture network, which, ideally, should be universally used withouttransfer learning.The trading agents in this report are based on point estimates, providedby a deep neural network. However, due to the uncertainty of the financialsignals, it would be appropriate to also model this uncertainty, incorporatingit in the decision making process. Bayesian Inference, or more tractablevariants of it, including Variation Inference (Titsias and Lawrence, 2010), canbe used to train probabilistic models, capable of capturing the environmentuncertainty (Vlassis et al. , 2012).Last but not least, motivated by the recent publication by (Fellows et al. ,2018) on exact calculation of the policy gradient by operating on the Fourierdomain, employing an exact policy gradient method could eliminate theestimate variance and accelerate training. At the time that report is submitted this paper has not been presented, but it is acceptedin International Conference on Machine Learning (ICML) 2018. ibliography
Markowitz, Harry (1952). “Portfolio selection”.
The Journal of Finance
Dynamic programming . Courier Corporation.Fama, Eugene F (1970). “Efficient capital markets: A review of theory andempirical work”.
The Journal of Finance
Portfolio theory and capital markets .Vol. 217. McGraw-Hill New York.Moylan, P and B Anderson (1973). “Nonlinear regulator theory and an in-verse optimal control problem”.
IEEE Transactions on Automatic Control
Information and Control
Proceedings of the National Academyof Sciences
The Journal of Finance
European Journal ofOperational Research
Journal of Econometrics
Mathematics of Control, Signals and Systems
Proceedings of the IEEE et al. (1991). “Adaptive mixtures of local experts”.
NeuralComputation
Financial indicators and growthin a cross section of countries . Vol. 819. World Bank Publications.Prichard, Dean and James Theiler (1994). “Generating surrogate data fortime series with several simultaneously measured variables”.
Physical Re-view Letters
Neural Networks for Signal Processing [1994] IV. Proceed-ings of the 1994 IEEE Workshop . IEEE, pp. 596–605.Bertsekas, Dimitri P et al. (1995).
Dynamic programming and optimal control .Vol. 1. 2. Athena scientific Belmont, MA.LeCun, Yann, Yoshua Bengio, et al. (1995). “Convolutional networks for im-ages, speech, and time series”.
The Handbook of Brain Theory and NeuralNetworks
Communications of the ACM
Advances in Neural Information Processing Systems , pp. 952–958.Salinas, Emilio and LF Abbott (1996). “A model of multiplicative neural re-sponses in parietal cortex”.
Proceedings of the National Academy of Sciences
Robotics and Automa-tion, 1997. Proceedings., 1997 IEEE International Conference on . Vol. 4. IEEE,pp. 3557–3564.Hochreiter, Sepp and J ¨urgen Schmidhuber (1997). “Long short-term mem-ory”.
Neural Computation et al. (1997). “Investment science”.
OUP Catalogue .Ortiz-Fuentes, Jorge D and Mikel L Forcada (1997). “A comparison betweenrecurrent neural network architectures for digital equalization”.
Acous-tics, speech, and signal processing, 1997. ICASSP-97., 1997 IEEE InternationalConference on . Vol. 4. IEEE, pp. 3281–3284.Rust, John (1997). “Using randomization to break the curse of dimensional-ity”.
Econometrica: Journal of the Econometric Society , pp. 487–516. 107 ibliography
Akaike, Hirotugu (1998). “Markovian representation of stochastic processesand its application to the analysis of autoregressive moving average pro-cesses”.
Selected Papers of Hirotugu Akaike . Springer, pp. 223–247.Cohen, Leon (1998). “The generalization of the wiener-khinchin theorem”.
Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEEInternational Conference on . Vol. 3. IEEE, pp. 1577–1580.Hochreiter, Sepp (1998). “The vanishing gradient problem during learningrecurrent neural nets and problem solutions”.
International Journal of Un-certainty, Fuzziness and Knowledge-Based Systems et al. (1998). “Reinforcement learning for trading systems andportfolios: Immediate vs future rewards”.
Decision Technologies for Compu-tational Finance . Springer, pp. 129–140.Papadimitriou, Christos H and Kenneth Steiglitz (1998).
Combinatorial opti-mization: Algorithms and complexity . Courier Corporation.Sutton, Richard S and Andrew G Barto (1998).
Introduction to reinforcementlearning . Vol. 135. MIT press Cambridge.Gers, Felix A, J ¨urgen Schmidhuber, and Fred Cummins (1999). “Learning toforget: Continual prediction with LSTM”.Ng, Andrew Y, Stuart J Russell, et al. (2000). “Algorithms for inverse rein-forcement learning.”
Icml , pp. 663–670.Sutton, Richard S et al. (2000a). “Policy gradient methods for reinforcementlearning with function approximation”.
Advances in neural information pro-cessing systems , pp. 1057–1063.— (2000b). “Policy gradient methods for reinforcement learning with func-tion approximation”.
Advances in neural information processing systems , pp. 1057–1063.Ghahramani, Zoubin (2001). “An introduction to hidden Markov modelsand Bayesian networks”.
International Journal of Pattern Recognition andArtificial Intelligence et al. (2001). “A builder’s guide to agent-based financial mar-kets”.
Quantitative Finance et al. (2001).
Recurrent neural net-works for prediction: Learning algorithms, architectures and stability . WileyOnline Library.Tino, Peter, Christian Schittenkopf, and Georg Dorffner (2001). “Financialvolatility trading using recurrent neural networks”.
IEEE Transactions onNeural Networks
Artificial Intelligence
Robotics and Automation, 2002. Proceedings.ICRA’02. IEEE International Conference on . Vol. 4. IEEE, pp. 3404–3410.Bengio, Yoshua et al. (2003). “A neural probabilistic language model”.
Journalof Machine Learning Research
Econometric analysis . Pearson Education India.Boyd, Stephen and Lieven Vandenberghe (2004).
Convex optimization . Cam-bridge university press.Kohl, Nate and Peter Stone (2004). “Policy gradient reinforcement learningfor fast quadrupedal locomotion”.
Robotics and Automation, 2004. Proceed-ings. ICRA’04. 2004 IEEE International Conference on . Vol. 3. IEEE, pp. 2619–2624.Mandic, Danilo P (2004). “A generalized normalized gradient descent algo-rithm”.
IEEE Signal Processing Letters
Advanced lectures on machine learning . Springer, pp. 63–71.Tsay, Ruey S (2005).
Analysis of financial time series . Vol. 543. John Wiley &Sons.Gatev, Evan, William N Goetzmann, and K Geert Rouwenhorst (2006). “Pairstrading: Performance of a relative-value arbitrage rule”.
The Review ofFinancial Studies
Journal of Electronic Imaging
Paul Wilmott introduces quantitative finance . John Wiley& Sons.Kober, Jens and Jan R Peters (2009). “Policy search for motor primitives inrobotics”.
Advances in neural information processing systems , pp. 849–856.Meucci, Attilio (2009).
Risk and asset allocation . Springer Science & BusinessMedia.Raeth, Christoph and R Monetti (2009). “Surrogates with random Fourierphases”.
Topics On Chaotic Systems: Selected Papers from CHAOS 2008 In-ternational Conference . World Scientific, pp. 274–285.Ahmed, Nesreen K et al. (2010). “An empirical comparison of machine learn-ing models for time series forecasting”.
Econometric Reviews
Proceedings of the 27th international con-ference on machine learning (ICML-10) , pp. 807–814.Pan, Sinno Jialin and Qiang Yang (2010a). “A survey on transfer learning”.
IEEE Transactions on Knowledge and Data Engineering ibliography
Pan, Sinno Jialin and Qiang Yang (2010b). “A survey on transfer learning”.
IEEE Transactions on Knowledge and Data Engineering
Artificial intelligence: Founda-tions of computational agents . Cambridge University Press.Szepesv´ari, Csaba (2010). “Algorithms for reinforcement learning”.
SynthesisLectures on Artificial Intelligence and Machine Learning
Proceedings of the Thirteenth International Conferenceon Artificial Intelligence and Statistics , pp. 844–851.Deisenroth, Marc and Carl E Rasmussen (2011). “PILCO: A model-basedand data-efficient approach to policy search”.
Proceedings of the 28th Inter-national Conference on machine learning (ICML-11) , pp. 465–472.H´enaff, Patrick et al. (2011). “Real time implementation of CTRNN and BPTTalgorithm to learn on-line biped robot balance: Experiments on the stand-ing posture”.
Control Engineering Practice
IEEE Signal Processing Magazine
Journal of Economic Dynamics and Con-trol
Machine Learning: A Probabilistic Perspective . TheMIT Press. isbn : 0262018020, 9780262018029.Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio (2012). “Understand-ing the exploding gradient problem”.
CoRR, abs/1211.5063 .Tieleman, Tijmen and Geoffrey Hinton (2012). “Lecture 6.5-rmsprop: Dividethe gradient by a running average of its recent magnitude”.
COURSERA:Neural Networks for Machine Learning et al. (2012). “Bayesian reinforcement learning”.
ReinforcementLearning . Springer, pp. 359–386.Aldridge, Irene (2013).
High-frequency trading: A practical guide to algorithmicstrategies and trading systems . Vol. 604. John Wiley & Sons.Giusti, Alessandro et al. (2013). “Fast image scanning with deep max-poolingconvolutional neural networks”.
Image Processing (ICIP), 2013 20th IEEEInternational Conference on . IEEE, pp. 4034–4038.Michalski, Ryszard S, Jaime G Carbonell, and Tom M Mitchell (2013).
Ma-chine learning: An artificial intelligence approach . Springer Science & Busi-ness Media.Roberts, Stephen et al. (2013). “Gaussian processes for time-series modelling”.
Phil. Trans. R. Soc. A
IEEE Transactionson Instrumentation and Measurement arXiv preprint arXiv:1412.6980 .Almeida, N´athalee C, Marcelo AC Fernandes, and Adri˜ao DD Neto (2015).“Beamforming and power control in sensor arrays using reinforcementlearning”.
Sensors et al. (2015). “Black box variational inference for state spacemodels”. arXiv preprint arXiv:1511.07367 .Chen, Kai, Yi Zhou, and Fangyan Dai (2015). “A LSTM-based method forstock returns prediction: A case study of China stock market”.
Big Data(Big Data), 2015 IEEE International Conference on . IEEE, pp. 2823–2824.Hausknecht, Matthew and Peter Stone (2015). “Deep recurrent q-learningfor partially observable mdps”.
CoRR, abs/1507.06527 .Hills, Thomas T et al. (2015). “Exploration versus exploitation in space, mind,and society”.
Trends in Cognitive Sciences et al. (2015). “Continuous control with deep reinforce-ment learning”. arXiv preprint arXiv:1509.02971 .Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforce-ment learning”.
Nature
Integrating learning and planning . url : .— (2015b). Introduction to reinforcement learning . url : .— (2015c). Markov decision processes . url : .— (2015d). Model-free control . url : .— (2015e). Policy gradient . url : .Feng, Yiyong, Daniel P Palomar, et al. (2016). “A signal processing perspec-tive on financial engineering”. Foundations and Trends R (cid:13) in Signal Process-ing University of Cambridge .Gal, Yarin and Zoubin Ghahramani (2016). “A theoretically grounded ap-plication of dropout in recurrent neural networks”.
Advances in neuralinformation processing systems , pp. 1019–1027. 111 ibliography
Gal, Yarin, Rowan McAllister, and Carl Edward Rasmussen (2016). “Im-proving PILCO with Bayesian neural network dynamics models”.
Data-Efficient Machine Learning workshop, ICML .Goodfellow, Ian et al. (2016).
Deep learning . Vol. 1. MIT press Cambridge.Heaton, JB, NG Polson, and Jan Hendrik Witte (2016). “Deep learning infinance”. arXiv preprint arXiv:1602.06561 .Kennedy, Douglas (2016).
Stochastic financial models . Chapman and Hall/CRC.Levine, Sergey et al. (2016). “End-to-end training of deep visuomotor poli-cies”.
The Journal of Machine Learning Research et al. (2016). “State of the art control of atari games using shallowreinforcement learning”.
Proceedings of the 2016 International Conference onAutonomous Agents & Multiagent Systems . International Foundation forAutonomous Agents and Multiagent Systems, pp. 485–493.Mnih, Volodymyr et al. (2016). “Asynchronous methods for deep reinforce-ment learning”.
International Conference on Machine Learning , pp. 1928–1937.Necchi, Pierpaolo (2016).
Policy gradient algorithms for asset allocation problem . url : https://github.com/pnecchi/Thesis/blob/master/MS_Thesis_Pierpaolo_Necchi.pdf .Nemati, Shamim, Mohammad M Ghassemi, and Gari D Clifford (2016). “Op-timal medication dosing from suboptimal clinical examples: A deep re-inforcement learning approach”. Engineering in Medicine and Biology So-ciety (EMBC), 2016 IEEE 38th Annual International Conference of the . IEEE,pp. 2978–2981.Silver, David and Demis Hassabis (2016). “AlphaGo: Mastering the ancientgame of Go with Machine Learning”.
Research Blog .Bao, Wei, Jun Yue, and Yulei Rao (2017). “A deep learning framework forfinancial time series using stacked autoencoders and long-short termmemory”.
PloS One et al. (2017). “Deep direct reinforcement learning for financialsignal representation and trading”.
IEEE Transactions on Neural Networksand Learning Systems
Applied Stochastic Models in Business and Industry arXiv preprint arXiv:1706.10059 .Navon, Ariel and Yosi Keller (2017). “Financial time series prediction usingdeep learning”. arXiv preprint arXiv:1711.04174 .112ibliographyNoonan, Laura (2017).
JPMorgan develops robot to execute trades . url : .Quantopian (2017). Commission models . url : .Schinckus, Christophe (2017). “An essay on financial information in the eraof computerization”. Journal of Information Technology , pp. 1–10.Zhang, Xiao-Ping Steven and Fang Wang (2017). “Signal processing for fi-nance, economics, and marketing: Concepts, framework, and big dataapplications”.
IEEE Signal Processing Magazine
CoRR .Investopedia (2018a).