[PDF] Reinforcement Learning for Portfolio Management

Abstract

In this thesis, we develop a comprehensive account of the expressive power, modelling efficiency, and performance advantages of so-called trading agents (i.e., Deep Soft Recurrent Q-Network (DSRQN) and Mixture of Score Machines (MSM)), based on both traditional system identification (model-based approach) as well as on context-independent agents (model-free approach). The analysis provides conclusive support for the ability of model-free reinforcement learning methods to act as universal trading agents, which are not only capable of reducing the computational and memory complexity (owing to their linear scaling with the size of the universe), but also serve as generalizing strategies across assets and markets, regardless of the trading universe on which they have been trained. The relatively low volume of daily returns in financial market data is addressed via data augmentation (a generative approach) and a choice of pre-training strategies, both of which are validated against current state-of-the-art models. For rigour, a risk-sensitive framework which includes transaction costs is considered, and its performance advantages are demonstrated in a variety of scenarios, from synthetic time-series (sinusoidal, sawtooth and chirp waves), simulated market series (surrogate data based), through to real market data (S\&P 500 and EURO STOXX 50). The analysis and simulations confirm the superiority of universal model-free reinforcement learning agents over current portfolio management model in asset allocation strategies, with the achieved performance advantage of as much as 9.2\% in annualized cumulative returns and 13.4\% in annualized Sharpe Ratio.

Full PDF

RReinforcement Learningfor Portfolio Management

MEng DissertationAngelos FilosCID: 00943119June 20, 2018

Supervisor: Professor Danilo MandicSecond Marker: Professor Pier Luigi DragottiAdvisors: Bruno Scalzo Dees, Gregory SidierDepartment of Electrical and Electronic Engineering, Imperial College London a r X i v : . [ q -f i n . P M ] S e p cknowledgement I would like to thank Professor Danilo Mandic for agreeing to supervisethis self-proposed project, despite the uncertainty about the viability of thetopic. His support and guidance contributed to the delivery of a challengingproject.I would also like to take this opportunity and thank Bruno Scalzo Deesfor his helpful comments, suggestions and enlightening discussions, whichhave been instrumental in the progress of the project.Lastly, I would like to thank Gregory Sidier for spending time with me, outof his working hours. His experience, as a practitioner, in Quantitative Fi-nance helped me demystify and engage with topics, essential to the project. i bstract

The challenges of modelling the behaviour of ﬁnancial markets, such as non-stationarity, poor predictive behaviour, and weak historical coupling, haveattracted attention of the scientiﬁc community over the last 50 years, and hassparked a permanent strive to employ engineering methods to address andovercome these challenges. Traditionally, mathematical formulations of dy-namical systems in the context of Signal Processing and Control Theory havebeen a lynchpin of today’s Financial Engineering. More recently, advancesin sequential decision making, mainly through the concept of ReinforcementLearning, have been instrumental in the development of multistage stochas-tic optimization, a key component in sequential portfolio optimization (assetallocation) strategies. In this thesis, we develop a comprehensive account ofthe expressive power, modelling efﬁciency, and performance advantages ofso called trading agents (i.e., Deep Soft Recurrent Q-Network (DSRQN) andMixture of Score Machines (MSM)), based on both traditional system iden-tiﬁcation (model-based approach) as well as on context-independent agents(model-free approach). The analysis provides a conclusive support for theability of model-free reinforcement learning methods to act as universal trad-ing agents, which are not only capable of reducing the computational andmemory complexity (owing to their linear scaling with size of the universe),but also serve as generalizing strategies across assets and markets, regard-less of the trading universe on which they have been trained. The relativelylow volume of daily returns in ﬁnancial market data is addressed via dataaugmentation (a generative approach) and a choice of pre-training strate-gies, both of which are validated against current state-of-the-art models. Forrigour, a risk-sensitive framework which includes transaction costs is con-sidered, and its performance advantages are demonstrated in a variety ofscenarios, from synthetic time-series (sinusoidal, sawtooth and chirp waves),iiimulated market series (surrogate data based), through to real market data(S&P 500 and EURO STOXX 50). The analysis and simulations conﬁrmthe superiority of universal model-free reinforcement learning agents overcurrent portfolio management model in asset allocation strategies, with theachieved performance advantage of as much as 9.2% in annualized cumula-tive returns and 13.4% in annualized Sharpe Ratio. iii ontents

I Background 4 ontents

II Innovation 50

III Experiments 91 ontents

10 Conclusion 104

Bibliography 106 viii hapter 1

Introduction

Engineering methods and systems are routinely used in ﬁnancial market ap-plications, including signal processing, control theory and advanced statisti-cal methods. The computerization of the markets (Schinckus, 2017) encour-ages automation and algorithmic solutions, which are now well-understoodand addressed by the engineering communities. Moreover, the recent suc-cess of Machine Learning has attracted interest of the ﬁnancial community,which permanently seeks for the successful techniques from other areas,such as computer vision and natural language processing to enhance mod-elling of ﬁnancial markets. In this thesis, we explore how the asset allocationproblem can be addressed by Reinforcement Learning, a branch of MachineLearning that optimally solves sequential decision making problems via di-rect interaction with the environment in an episodic manner.In this introductory chapter, we deﬁne the objective of the thesis and high-light the research and application domains from which we draw inspiration.

The aim of this report is to investigate the effectiveness of ReinforcementLearning agents on asset allocation . A ﬁnite universe of ﬁnancial instru-ments, assets, such as stocks, is selected and the role of an agent is toconstruct an internal representation (model) of the market, allowing it todetermine how to optimally allocate funds of a ﬁnite budget to those assets.The agent is trained on both synthetic and real market data. Then, its perfor-mance is compared with standard portfolio management algorithms on an The terms

Asset Allocation and

Portfolio Management are used interchangeably through-out the report.

1. I ntroduction out-of-sample dataset; data that the agent has not been trained on (i.e., testset).

From the IBM TD-Gammon (Tesauro, 1995) and the IBM Deep Blue (Camp-bell et al. , 2002) to the Google DeepMind Atari (Mnih et al. , 2015) and theGoogle DeepMind AlphaGo (Silver and Hassabis, 2016), reinforcement learn-ing is well-known for its effectiveness in board and video games. Nonethe-less, reinforcement learning applies to many more domains, including Robotics,Medicine and Finance, applications of which align with the mathematicalformulation of portfolio management. Motivated by the success of some ofthese applications, an attempt is made to improve and adjust the underly-ing methods, such that they are applicable to the asset allocation problemsettings. In particular special attention is given to: • Adaptive Signal Processing , where Beamforming has been success-fully addressed via reinforcement learning by Almeida et al. (2015); • Medicine , where a data-driven medication dosing system (Nemati etal. , 2016) has been made possible thanks to model-free reinforcementagents; • Algorithmic Trading , where the automated execution (Noonan, 2017)and market making (Spooner et al. , 2018) have been recently revolu-tionized by reinforcement agents.Without claiming equivalence of portfolio management with any of theabove applications, their relatively similar optimization problem formula-tion encourages the endeavour to develop reinforcement learning agents forasset allocation.

The report is organized in three Parts: the

Background (Part I), the

Inno-vation (Part II) and the

Experiments (Part III). The readers are advised tofollow the sequence of the parts as presented, however, if comfortable withModern Portfolio Theory and Reinforcement Learning, they can focus onthe last two parts, following the provided references to background mate-rial when necessary. A brief outline of the project structure and chapters isprovided below:2.3. Report Structure

Chapter 2: Financial Signal Processing

The objective of this chapter isto introduce essential ﬁnancial terms and concepts for understandingthe methods developed later in the report.

Chapter 3: Portfolio Optimization

Providing the basics of FinancialSignal Processing, this chapter proceeds with the mathematical formu-lation of static Portfolio Management, motivating the use of Reinforce-ment Learning to address sequential Asset Allocation via multi-stagedecision making.

Chapter 4: Reinforcement Learning

This chapter serves as an impor-tant step toward demystifying Reinforcement Learning concepts, byhighlighting their analogies to Optimal Control and Systems Theory.The concepts developed in this chapter are essential to the understand-ing of the trading algorithms and agents developed later in the report.

Chapter 5: Financial Market as Discrete-Time Stochastic DynamicalSystem

This chapter parallels Chapters 3 and 4, introducing a uniﬁed,versatile framework for training agents and investment strategies.

Chapter 6: Trading Agents

This objectives of this chapter are to:(1) introduce traditional model-based (i.e., system identiﬁcation) re-inforcement learning trading agents; (2) develop model-free reinforce-ment learning trading agents; (3) suggest a ﬂexible universal tradingagent architecture that enables pragmatic applications of Reinforce-ment Learning for Portfolio Management; (4) assess performance ofdeveloped trading agents on a small scale experiment (i.e., 12-assetS&P 500 market)

Chapter 7: Pre-Training

In this chapter, a pre-training strategy issuggested, which addresses the local optimality of the Policy Gradientagents, when only a limited number of ﬁnancial market data samplesis available.

Chapter 8: Synthetic Data

In this chapter, the effectiveness of the trad-ing agents of Chapter 6 is assessed on synthetic data - from determin-istic time-series (sinusoidal, sawtooth and chirp waves) to simulatedmarket series (surrogate data based). The superiority of model-basedor model-free agents is highlighted in each scenario.

Chapter 9: Market Data

This chapter parallels Chapter 9, evaluatingthe performance of the trading agents of Chapter 6 on real marketdata, from two distinct universes: (1) the underlying U.S. stocks ofS ¶

500 and (2) the underlying European stocks of EURO STOXX 50. 3 art I

Background hapter 2 Financial Signal Processing

Financial applications usually involve the manipulation and analysis of se-quences of observations, indexed by time order, also known as time-series.Signal Processing, on the other hand, provides a rich toolbox for system-atic time-series analysis, modelling and forecasting (Mandic and Chambers,2001). Consequently, signal processing methods can be employed to mathe-matically formulate and address fundamental economics and business prob-lems. In addition, Control Theory studies discrete dynamical systems, whichform the basis of Reinforcement Learning, the set of algorithms used in thisreport to solve the asset allocation problem. The links between signal pro-cessing algorithms, systems and control theory motivate their integrationwith ﬁnance, to which we refer as

Financial Signal Processing or FinancialEngineering .In this chapter, the overlap between signal processing and control theorywith ﬁnance is explored, attempting to bridge their gaps and highlight theirsimilarities. Firstly, in Section 2.1, essential ﬁnancial terms and concepts areintroduced, while In Section 2.2, the time-series in the context of ﬁnanceare formalized. In Section 2.3 the evaluation criteria used throughout thereport to assess the performance of the different algorithms and strategiesare explained, while in Section 2.4 signal processing methods for modellingsequential data are studied.

In order to better communicate ideas and gain insight into the economicproblems, basic terms are deﬁned and explained in this section. However,useful deﬁnitions are also provided by Johnston and Djuri´c (2011). 5. F inancial S ignal P rocessing Signal Processing(SP) Control Theory(CT)Finance & EconomicsFinancial Engineering(FE)Dynamical Systems(DS)

Figure 2.1: Financial Engineering relative to Signal Processing and ControlTheory. An asset is an item of economic value. Examples of assets are cash (in handor in a bank), stocks, loans and advances, accrued incomes etc. Our mainfocus on this report is on cash and stocks, but general principles apply to allkinds of assets. Assumption 2.1

The assets under consideration are liquid, hence they can be con-verted into cash quickly, with little or no loss in value. Moreover, the selected assetshave available historical data in order to enable analysis. A portfolio is a collection of multiple ﬁnancial assets, and is characterizedby its: • Constituents : M assets of which it consists; • Portfolio vector , w t : its i -th component represents the ratio of the totalbudget invested to the i -th asset, such that: w t = (cid:20) w t , w t , . . . , w M , t (cid:21) T ∈ R M and M ∑ i = w i , t = w t , a portfolio can be treated asa single master asset. Therefore, the analysis of single simple assets can be6.1. Financial Terms & Conceptsapplied to portfolios upon determination of the constituents and the corre-sponding portfolio vector.Portfolios are more powerful, general representation of ﬁnancial assets sincethe single asset case can be represented by a portfolio; the j -th asset is equiv-alent to the portfolio with vector e ( j ) , where the j -th term is equal to unityand the rest are zero. Portfolios are also preferred over single assets in orderto minimize risk, as illustrated in Figure 2.2. r t (%)0.0 F r e q u e n c y D e n s i t y Returns Distribution for

M = 1 r t (%)0.0 F r e q u e n c y D e n s i t y Returns Distribution for

M = 4 r t (%)0.00.51.01.52.0 F r e q u e n c y D e n s i t y Returns Distribution for

M = 25 r t (%)01234 F r e q u e n c y D e n s i t y Returns Distribution for

M = 100

Figure 2.2: Risk for a single asset and a number of uncorrelated portfolios.Risk is represented by the standard deviation or the width of the distributioncurves, illustrating that a large portfolio ( M = M = Sometimes is it possible to sell an asset that we do not own. This processis called short selling or shorting (Luenberger, 1997). The exact shortingmechanism varies between markets, but it can be generally summarized as:1. Borrowing an asset i from someone who owns it at time t ;2. Selling it immediately to someone else at price p i , t ;3. Buying back the asset at time ( t + k ) , where k >

0, at price p i , t + k ;4. Returning the asset to the lender 7. F inancial S ignal P rocessing Therefore, if one unit of the asset is shorted, the overall absolute return is p i , t − p i , t + k and as a result short selling is proﬁtable only if the asset pricedeclines between time t and t + k or p i , t + k < p i , t . Nonetheless, note thatthe potential loss of short selling is unbounded, since asset prices are notbounded from above (0 ≤ p i , t + k < ∞ ). Remark 2.2

If short selling is allowed, then the portfolio vector satisﬁes (2.1), butw i can be negative, if the i-th asset is shorted. As a consequence, w j can be greaterthan , such that ∑ Mi = w i = . For instance, in case of a two-assets portfolio, the portfolio vector w t = (cid:104) − (cid:105) T is valid and can be interpreted as: 50% of the budget is shortsold on the ﬁrst asset ( w t = − w t = w t > long and short position to an asset are used to refer toinvestments where we buy or short sell the asset, respectively. The dynamic nature of the economy, as a result of the non-static supply anddemand balance, causes prices to evolve over time. This encourages to treatmarket dynamics as time-series and employ technical methods and tools foranalysis and modelling.In this section, asset prices are introduced, whose deﬁnition immediatelyreﬂect our intuition, as well as other time-series, derived to ease analysisand evaluation.

Let p t ∈ R be the price of an asset at discrete time index t (Feng and Palo-mar, 2016), then the sequence p , p , . . . , p T is a univariate time-series. Theequivalent notations p i , t and p asset i , t are also used to distinguish between theprices of the different assets. Hence, the T -samples price time-series of an8.2. Financial Time-Seriesasset i , is the column vector (cid:126) p i ,1: T , such that: (cid:126) p i ,1: T =  p i ,1 p i ,2 ... p i , T  ∈ R T + (2.2)where the arrow highlights the fact that it is a time-series. For convenienceof portfolio analysis, we deﬁne the price vector p t , such that: p t = (cid:20) p t , p t , . . . , p M , t (cid:21) ∈ R M + (2.3)where the i -th element is the asset price of the i -th asset in the portfolio attime t . Extending the single-asset time-series notation to the multivariatecase, we form the asset price matrix (cid:126) P T by stacking column-wise the T -samples price time-series of the M assets of the portfolio, then: (cid:126) P T = (cid:20) (cid:126) p T , (cid:126) p T , . . . , (cid:126) p M ,1: T (cid:21) =  p p · · · p M ,1 p p · · · p M ,2 ... ... . . . ... p T p T · · · p M , T  ∈ R T × M + (2.4)This formulation enables cross-asset analysis and consideration of the inter-dependencies between the different assets. We usually relax notation byomitting subscripts when they can be easily inferred from context.Figure 2.3 illustrates examples of asset prices time-series and the correspond-ing distribution plots. At a ﬁrst glance, note the highly non-stationary na-ture of asset prices and hence the difﬁculty to interpret distribution plots.Moreover, we highlight the unequal scaling between prices, where for exam-ple, GE (General Electric) average price at 23.14$ and BA (Boeing Company)average price at 132.23$ are of different order and difﬁcult to compare. Absolute asset prices are not directly useful for an investor. On the otherhand, prices changes over time are of great importance, since they reﬂectthe investment proﬁt and loss, or more compactly, its return . 9. F inancial S ignal P rocessing A ss e t P r i c e s , p t ( $ ) Asset Prices: Time-Series

AAPLGEBA p t ($)0.0000.0250.0500.0750.100 F r e q u e n c y D e n s i t y Asset Prices: Distributions

AAPLGEBA

Figure 2.3: Asset prices time-series (left) and distributions (right) for

AAPL (Apple), GE (General Electric) and BA (Boeing Company). Gross Return

The gross return R t of an asset represents the scaling factor of an investmentin the asset at time ( t − ) (Feng and Palomar, 2016). For example, a B dollars investment in an asset at time ( t − ) will worth BR t dollars at time t . It is given by the ratio of its prices at times t and ( t − ) , such that: R t (cid:44) p t p t − ∈ R (2.5)Figure 2.4 illustrates the beneﬁt of using gross returns over asset prices. Remark 2.3

The asset gross returns are concentrated around unity and their be-haviour does not vary over time for all stocks, making them attractive candidates forstationary autoregressive (AR) processes (Mandic, 2018a). A ss e t G r o ss R e t u r n s , R t Asset Gross Returns: Time-Series

AAPLGEBA 0.90 0.95 1.00 1.05 1.10Asset Gross Returns, R t F r e q u e n c y D e n s i t y Asset Gross Returns: Distributions

AAPLGEBA

Figure 2.4: Asset gross returns time-series (left) and distributions (right).

Simple Return

A more commonly used term is the simple return , r t , which represents thepercentage change in asset price from time ( t − ) to time t , such that: r t (cid:44) p t − p t − p t − = p t p t − − (2.5) = R t − ∈ R (2.6)10.2. Financial Time-SeriesThe gross and simple returns are straightforwardly connected, but the latteris more interpretable, and thus more frequently used.Figure 2.5 depicts the example asset simple returns time-series and theircorresponding distributions. Unsurprisingly, simple returns possess the rep-resentation beneﬁts of gross returns, such as stationarity and normalization.Therefore, we can use simple returns as a comparable metric for all assets,thus enabling the evaluation of analytic relationships among them, despiteoriginating from asset prices of different scale. A ss e t S i m p l e R e t u r n s , r t Asset Simple Returns: Time-Series

AAPLGE

BA 0.10 0.05 0.00 0.05 0.10Asset Simple Returns, r t F r e q u e n c y D e n s i t y Asset Simple Returns: Distributions

AAPLGEBA

Figure 2.5: Single Assets Simple ReturnsThe T -samples simple returns time-series of the i -th asset is given by thecolumn vector (cid:126) r i ,1: T , such that: (cid:126) r i ,1: T =  r i ,1 r i ,2 ... r i , T  ∈ R T (2.7)while the simple returns vector r t : r t =  r t r t ... r M , t  ∈ R M (2.8)where r t the simple return of the i -th asset at time index t . Remark 2.4

Exploiting the representation advantage of the portfolio over singleassets, we deﬁne the portfolio simple return as the linear combination of the simplereturns of each constituents, weighted by the portfolio vector.

11. F inancial S ignal P rocessing Hence, at time index t , we obtain: r t (cid:44) M ∑ i = w i , t r i , t = w Tt r t ∈ R (2.9)Combining the price matrix in (2.4) and the deﬁnition of simple return (2.6),we construct the simple return matrix (cid:126) R T by stacking column-wise the T -samples simple returns time-series of the M assets of the portfolio, to give: (cid:126) R T = (cid:20) (cid:126) r T (cid:126) r T · · · (cid:126) r M ,1: T (cid:21) =  r r · · · r M ,1 r r · · · r M ,2 ... ... . . . ... r T r T · · · r M , T  ∈ R T × M (2.10)Collecting the portfolio (column) vectors for the time interval t ∈ [ T ] into a portfolio weights matrix (cid:126) W T , we obtain the portfolio returns time-series bymultiplication of (cid:126) R T with (cid:126) W T and extraction of the T diagonal elementsof the product, such that: r T = diag ( (cid:126) R T (cid:126) W T ) ∈ R T (2.11) Log Return

Despite the interpretability of the simple return as the percentage change inasset price over one period, it is asymmetric and therefore practitioners tendto use log returns instead (Kennedy, 2016), in order to preserve interpreta-tion and to yield a symmetric measure. Using the example in Table 2.1, a15% increase in price followed by a 15% decline does not result in the initialprice of the asset. On the contrary, a 15% log-increase in price followed bya 15% log-decline returns to the initial asset price, reﬂecting the symmetricbehaviour of log returns.Let the log return ρ t at time t be: ρ t (cid:44) ln ( p t p t − ) (2.5) = ln ( R t ) ∈ R (2.12)Note the very close connection of gross return to log return. Moreover, sincegross return is centered around unity, the logarithmic operator makes logreturns concentrated around zero, clearly observed in Figure 2.6.12.3. Evaluation Criteriatime t simple return price ( $ ) log return0 - 100 -1 +0.15 110 +0.132 -0.15 99 -0.163 +0.01 100 +0.014 -0.14 86 -0.155 +0.16 100 +0.15Table 2.1: Simple Return Asymmetry & Log Return Symmetry A ss e t L o g R e t u r n s , t Asset Log Returns: Time-Series

AAPLGE

BA 0.15 0.10 0.05 0.00 0.05 0.10Asset Log Returns, t F r e q u e n c y D e n s i t y Asset Log Returns: Distributions

AAPLGEBA

Figure 2.6: Single Assets Log ReturnsComparing the deﬁnitions of simple and log returns in (2.6) and (2.12), re-spectively, we obtain the relationship: ρ t = ln ( + r t ) (2.13)hence we can deﬁne all time-series and convenient portfolio representationsof log returns by substituting simple-returns in (2.13). For example, the portfolio log return is given by substitution of (2.13) into (2.9), such that: ρ t (cid:44) ln ( + w Tt r t ) ∈ R (2.14) The end goal is the construction of portfolios, linear combinations of individ-ual assets, whose properties (e.g., returns, risk) are optimal under providedconditions and constraints. As a consequence, a set of evaluation criteria andmetrics is necessary in order to evaluate the performance of the generatedportfolios. Due to the uncertainty of the future dynamics of the ﬁnancialmarkets, we study the statistical properties of the assets returns, as well asother risk metrics, motivated by signal processing. 13. F inancial S ignal P rocessing Future prices and hence returns are inherently unknown and uncertain(Kennedy, 2016). To mathematically capture and manipulate this stochastic-ity, we treat future market dynamics (i.e., prices, cross-asset dependencies)as random variables and study their properties. Qualitative visual analy-sis of probability density functions is a labour-intensive process and thusimpractical, especially when high-dimensional distributions (i.e., 4D andhigher) are under consideration. On the other hand, quantitative measures,such as moments, provide a systematic way to analyze (joint) distributions(Meucci, 2009).

Mean, Median & Mode

Suppose that we need to summarize all the information regarding a randomvariable X in only one number, the one value that best represents the wholerange of possible outcomes. We are looking for a location parameter thatprovides a fair indication of where on the real axis the random variable X will end up taking its value.An immediate choice for the location parameter is the center of mass of thedistribution, i.e., the weighted average of each possible outcome, where theweight of each outcome is provided by its respective probability. This corre-sponds to computing the expected value or mean of the random variable: E [ X ] = µ X (cid:44) (cid:90) + ∞ − ∞ x f X ( x ) d x ∈ R (2.15)Note that the mean is also the ﬁrst order statistical moment of the distribu-tion f X . When a ﬁnite number of observations T is available, and there is noclosed form expression for the probability density function f X , the samplemean or empirical mean is used as an unbiased estimate of the expectedvalue, according to: E [ X ] ≈ T T ∑ t = x t (2.16) Investor Advice 2.1 (Greedy Criterion)

For the same level of risk, choose theportfolio that maximizes the expected returns (Wilmott, 2007).

Figure 2.7 illustrates to cases where the

Greedy Criterion µ blue = < µ red = r t (%)0.00.10.20.30.4 F r e q u e n c y D e n s i t y Simple Returns:

Equal

Risk Level (1, 1)(4, 1) r t (%)0.00.10.20.30.4 F r e q u e n c y D e n s i t y Simple Returns:

Unequal

Risk Levels (1, 1)(4, 9)

Figure 2.7: Greedy criterion for equally risky assets (left) and unequallyrisky assets (right).other hand, when the assets have unequal risk levels (i.e., right sub-ﬁgure)the criterion does not apply and we cannot draw any conclusions withoutemploying other metrics as well.The deﬁnition of the mean value is extended to the multivariate case as thejuxtaposition of the mean value (2.15) of the marginal distribution of eachentry: E [ X ] = µ X (cid:44)  E [ X ] E [ X ] ... E [ X M ]  ∈ R M (2.17)A portfolio with vector w t and single asset mean simple returns µ r has ex-pected simple returns: µ r = w Tt µ r (2.18)An alternative choice for the location parameter is the median , which is thequantile relative to the speciﬁc cumulative probability p = [ X ] (cid:44) Q X (cid:18) (cid:19) ∈ R (2.19)The juxtaposition of the median, or any other quantile, of each entry of arandom variable does not satisfy the afﬁne equivariance property (Meucci,2009) and therefore it does not deﬁne a suitable location parameter. Med [ a + B X ] (cid:54) = a + B Med [ X ] .

15. F inancial S ignal P rocessing A third parameter of location is the mode, which refers to the shape of theprobability density function f X . Indeed, the mode is deﬁned as the pointthat corresponds to the highest peak of the density function:Mod [ X ] (cid:44) argmax x ∈ R f X ( x ) ∈ R (2.20)Intuitively, the mode is the most frequently occurring data point in the dis-tribution. It is trivially extended to multivariate distributions, namely as thehighest peak of the joint probability density function:Mod [ X ] (cid:44) argmax x ∈ R M f X ( x ) ∈ R M (2.21)Note that the relative position of the location parameters provide qualita-tive information about the symmetry, the tails and the concentration of thedistribution. Higher-order moments quantify these properties.Figure 2.8 illustrates the distribution of the prices and the correspondingsimple returns of the asset BA (Boeing Company), along with their locationparameters. In case of the simple returns, we highlight that the mean, themedian and the mode are very close to each other, reﬂecting the symme-try and the concentration of the distribution, properties that motivated theselection of returns over raw asset prices. Asset Prices, p t ($) F r e q u e n c y D e n s i t y BA Prices: First Order Moments pdfmeanmedianmode

15 10 5 0 5 10 15Asset Simple Returns, r t (%)0.00.10.2 F r e q u e n c y D e n s i t y BA Simple Returns: First Order Moments pdfmeanmedianmode

Figure 2.8: First order moments for BA (Boeing Company) prices (left) andsimple returns (right). Volatility & Covariance

The dilemma we faced in selecting between assets in Figure 2.7 motivatesthe introduction of a metric that quantiﬁes risk level. On other words, weare looking for a dispersion parameter that yields an indication of the extent16.3. Evaluation Criteriato which the location parameter (i.e., mean, median) might be wrong inguessing the outcome of the random variable X .The variance is the benchmark dispersion parameter, measuring how far therandom variable X is spread out of its mean, given by:Var [ X ] = σ X (cid:44) E [( X − E [ X ]) ] ∈ R (2.22)The square root of the variance, σ X , namely the standard deviation or volatil-ity in ﬁnance, is a more physically interpretable parameter, since it has thesame units as the random variable under consideration (i.e., prices, simplereturns).Note that the variance is also the second order statistical central moment ofthe distribution f X . When a ﬁnite number of observations T is available, andthere is no closed form expression for the probability density function, f X ,the Bessel’s correction formula (Tsay, 2005) is used as an unbiased estimateof the variance, according to:Var [ X ] ≈ T − T ∑ t = ( x t − µ X ) (2.23) Investor Advice 2.2 (Risk-Aversion Criterion)

For the same expected returns,choose the portfolio that minimizes the volatility (Wilmott, 2007). r t (%)0.00.10.20.30.4 F r e q u e n c y D e n s i t y Simple Returns:

Equal

Returns Level (1, 1)(1, 9) r t (%)0.00.10.20.30.4 F r e q u e n c y D e n s i t y Simple Returns:

Unequal

Return Levels (1, 1)(4, 9)

Figure 2.9: Risk-aversion criterion for equal returns (left) and unequal re-turns (right).According to the

Risk-Aversion Criterion , in Figure 2.9, for the same returnslevel (i.e., left sub-ﬁgure) we choose the less risky asset, the blue, since thered is more spread out ( σ = < σ = covariance , which measures the joint variability of two variables, given by: 17. F inancial S ignal P rocessing Cov [ X ] = Σ X (cid:44) E [( X − E [ X ])( X − E [ X ]) T ] ∈ R M × M (2.24)or component-wise:Cov [ X m , X n ] = [ Cov [ X ]] mn = Σ mn (cid:44) E [( X m − E [ X m ])( X n − E [ X n ])] ∈ R (2.25)By direct comparison of (2.22) and (2.25), we note that:Var [ X m ] = Cov [ X m , X m ] = [ Cov [ X ]] mm = Σ mm (2.26)hence the m -th diagonal element of the covariance matrix Σ mm is the varianceof the m -th component of the multivariate random variable X , while thenon-diagonal terms Σ mn represent the joint variability of the m -th with the n -th component of X . Note that, by deﬁnition (2.24), the covariance is asymmetric and real matrix, thus it is semi-positive deﬁnite (Mandic, 2018b).Empirically, we estimate the covariance matrix entries using again the Bessel’scorrection formula (Tsay, 2005), in order to obtain an unbiased estimate:Cov [ X m , X n ] ≈ T − T ∑ t = ( x m , t − µ X m )( x n , t − µ X n ) (2.27)A portfolio with vector w t and covariance matrix of assets simple returns S has variance: σ r = w Tt Σ w t (2.28)The correlation coefﬁcient is also frequently used to quantify the linear de-pendency between random variables. It takes values in the range [ −

1, 1 ] andhence it is a normalized way to compare dependencies, while covariancesare highly inﬂuenced by the scale of the random variables’ variance. Thecorrelation coefﬁcient is given by:corr [ X m , X n ] = [ corr [ X ]] mn = ρ mn (cid:44) Cov [ X m , X n ] σ X m σ X n ∈ [ −

1, 1 ] ⊂ R (2.29)18.3. Evaluation Criteria AAPL GE BA MMM JPMAAPLGEBAMMMJPM 8.2 1.7 1.3 1 2.11.7 3.6 1.5 1.4 2.81.3 1.5 3.9 1.2 1.91 1.4 1.2 2.3 1.62.1 2.8 1.9 1.6 6.2

Simple Returns:

Covariance

Matrix

AAPL GE BA MMM JPMAAPLGEBAMMMJPM 1 0.32 0.23 0.24 0.30.32 1 0.4 0.5 0.590.23 0.4 1 0.41 0.390.24 0.5 0.41 1 0.440.3 0.59 0.39 0.44 1

Simple Returns:

Correlation

Matrix

Figure 2.10: Covariance and correlation matrices for assets simple returns.

Skewness

The standard measure of symmetry of a distribution is the skewness , whichis the third central moment normalized by the standard deviation, in such away to make it scale-independent:skew [ X ] (cid:44) E (cid:2) ( X − E [ X ]) (cid:3) σ X (2.30)In particular, a distribution whose probability density function is symmetricaround its expected value has null skewness. If the skewness is positive(negative), occurrences larger than the expected value are less (more) likelythan occurrences smaller than the expected value. Investor Advice 2.3 (Negatively Skewed Criterion)

Choose negatively skewedreturns, rather than positively skewed. (Wilmott, 2007).

Kurtosis

The fourth moment provides a measure of the relative weight of the tailswith respect to the central body of a distribution. The standard quantityto evaluate this balance is the kurtosis , deﬁned as the normalized fourthcentral moment: kurt [ X ] (cid:44) E (cid:2) ( X − E [ X ]) (cid:3) σ X (2.31)The kurtosis gives an indication of how likely it is to observe a measure-ment far in the tails of the distribution: a large kurtosis implies that thedistribution displays ”fat tails”. 19. F inancial S ignal P rocessing Despite the insight into the statistical properties we gain by studying mo-ments of returns distribution, we can combine them in such ways to fullycapture the behaviour of our strategies and better assess them. Inspired bystandard metrics used in signal processes (e.g. signal-to-noise ratio) andsequential decision making we introduce the following performance evalua-tors: cumulative returns, sharpe ratio, drawdown and value at risk.

Cumulative Returns

In subsetion 2.2.2 we deﬁned returns relative to the change in asset prices inone time period. Nonetheless, we usually get involved into a multi-periodinvestment, hence we are extending the deﬁnition of vanilla returns to the cumulative returns , which represent the change in asset prices over largertime horizons.Based on (2.5), the cumulative gross return R t → T between time indexes t and T is given by: R t → T (cid:44) p T p t = (cid:18) p T p T − (cid:19)(cid:18) p T − p T − (cid:19) · · · (cid:18) p t + p t (cid:19) (2.5) = R T R T − · · · R t + = T ∏ i = t + R i ∈ R (2.32)The cumulative gross return is usually also termed Proﬁt & Loss (PnL),since it represents the wealth level of the investment. If R t → T > <

1) theinvestment was proﬁtable (lossy).

Investor Advice 2.4 (Proﬁtability Criterion)

Aim to maximize proﬁtability ofinvestment.

Moreover, the cumulative simple return r t → T is given by: r t → T (cid:44) p T p t − (2.32) = (cid:20) T ∏ i = t + R i − (cid:21) (2.6) = (cid:20) T ∏ i = t + ( + r i ) − (cid:21) ∈ R (2.33)while the cumulative log return ρ t → T is: ρ t → T (cid:44) ln ( p T p t ) (2.32) = ln (cid:18) T ∏ i = t + R i (cid:19) = T ∑ i = t + ln ( R i ) (2.12) = T ∑ i = t + ρ i ∈ R (2.34)20.3. Evaluation CriteriaFigure 2.11 demonstrates the interpretation power of cumulative returnsover simple returns. Simple visual inspection of simple returns is inade-quate for comparing the performance of the different assets. On the otherhand, cumulative simple returns exhibit that BA ’s (Boeing Company) priceincreased by ≈ GE ’s (General Electric) price declines by ≈ S i m p l e R e t u r n s , r t Simple Returns

AAPLGEBA 2012 2013 2014 2015 2016 2017 2018Date024 C u m u l a t i v e S i m p l e R e t u r n s , r t T Cumulative Simple Returns

AAPLGEBA

Figure 2.11: Assets cumulative simple returns.

Sharpe Ratio

Remark 2.5

The criteria 2.1 and 2.2 can sufﬁciently distinguish and prioritizeinvestments which either have the same risk level or returns level, respectively.Nonetheless, they fail in all other cases, when risk or return levels are unequal.

The failure of greedy criterion and risk-aversion criterion is demonstratedin both examples in Figures 2.7 and 2.9, where it can be observed that themore risky asset, the red one, has a higher expected returns (i.e., the reddistribution is wider, hence has larger variance, but it is centered around alarger value, compared to the blue distribution). Consequently, none of thecriteria applies and the comparison is inconclusive.In order to address this issue and motivated by

Signal-to-Noise Ratio (SNR)(Zhang and Wang, 2017; Feng and Palomar, 2016), we deﬁne

Sharpe Ratio (SR) as the ratio of expected returns (i.e., signal power) to their standarddeviation (i.e., noise power ), adjusted by a scaling factor: SR T (cid:44) √ T E [ r T ] (cid:112) Var [ r T ] ∈ R (2.35) The variance of the noise is equal to the noise power. Standard deviation is used in thedeﬁnition of SR to provide a unit-less metric.

21. F inancial S ignal P rocessing where T is the number of samples considered in the calculation of the em-pirical mean and standard deviation. Investor Advice 2.5 (Sharpe Ratio Criterion)

Aim to maximize Sharpe Ratioof investment.

Considering now the example in Figures 2.7, 2.9, we can quantitatively com-pare the two returns streams and select the one that maximizes the SharpeRatio: SR blue = √ T µ blue σ blue = √ T = √ T (2.36) SR red = √ T µ red σ red = √ T

43 (2.37) SR blue < SR red ⇒ choose red (2.38) Drawdown

The drawdown (DD) is a measure of the decline from a historical peak incumulative returns (Luenberger, 1997). A drawdown is usually quoted asthe percentage between the peak and the subsequent trough and is deﬁnedas: DD ( t ) = − max { (cid:2) max τ ∈ ( t ) r → τ (cid:3) − r → t } (2.39)The maximum drawdown (MDD) up to time t is the maximum of the draw-down over the history of the cumulative returns, such that: MDD ( t ) = − max x ∈ ( t ) { (cid:2) max τ ∈ ( T ) r → τ (cid:3) − r → T } (2.40)The drawdown and maximum drawdown plots are provided in Figure 2.12along with the cumulative returns of assets GE and BA . Interestingly, thedecline of GE ’s cumulative returns starting in early 2017 is perfectly reﬂectedby the (maximum) drawdown curve. Value at Risk

The value at risk (VaR) is another commonly used metric to assess the per-formance of a returns time-series (i.e., stream). Given daily simple returns22.4. Time-Series Analysis P e r c e n t a g e GE: Cumulative Simple Returns & Drawdown

Cumulative ReturnsDrawdownMax Drawdown P e r c e n t a g e BA: Cumulative Simple Returns & Drawdown

Cumulative ReturnsDrawdownMax Drawdown

Figure 2.12: (Maximum) drawdown and cumulativer returns for GE and BA . r t and cut-off c ∈ (

0, 1 ) , the value at risk is deﬁned as the c quantile of theirdistribution, representing the worst 100 c % case scenario: VaR ( c ) (cid:44) Q r ( c ) ∈ R (2.41)Figure 2.13 depicts GE ’s value at risk at − c = R t F r e q u e n c y D e n s i t y VaR = -0.01890

GE: Value at Risk ( c = 0.05 ) > VaR< VaR 0.10 0.05 0.00 0.05 0.10Cumulative Gross Returns, R t F r e q u e n c y D e n s i t y VaR = -0.01998

BA: Value at Risk ( c = 0.05 ) > VaR< VaR Figure 2.13: Illustration of the 5% value at risk (VaR) of GE and BA stocks. Time-series analysis is of major importance in a vast range of research top-ics, and many engineering applications. This relates to analyzing time-seriesdata for estimating meaningful statistics and identifying patterns of sequen-tial data. Financial time-series analysis deals with the extraction of under-lying features to analyze and predict the temporal dynamics of ﬁnancialassets (Navon and Keller, 2017). Due to the inherent uncertainty and non-analytic structure of ﬁnancial markets (Tsay, 2005), the task is proven chal-lenging, where classical linear statistical methods such as the VAR model,and statistical machine learning models have been widely applied (Ahmed 23. F inancial S ignal P rocessing et al. , 2010). In order to efﬁciently capture the non-linear nature of the ﬁnan-cial time-series, advanced non-linear function approximators, such as RNNmodels (Mandic and Chambers, 2001) and Gaussian Processes (Roberts et al. ,2013) are also extensively used.In this section, we introduce the VAR and RNN models, which comprise thebasis for the model-based approach developed in Section 6.1. Autoregressive (AR) processes can model univariate time-series and specifythat future values of the series depend linearly on the past realizations ofthe series (Mandic, 2018a). In particular, a p -order autoregressive processAR( p ) satisﬁes: x t = a x t − + a x t − + · · · + a p x t − p + ε t = p ∑ i = a i x t − i + ε t = a T (cid:126) x t − p : t − + ε t ∈ R (2.42)where ε t is a stochastic term (an imperfectly predictable term), which isusually treated as white noise and a = [ a , a , · · · , a p ] T the p model parame-ters/coefﬁcients.Extending the AR model for multivariate time-series, we obtain the vec-tor autoregressive (VAR) process, which enables us to capture the cross-dependencies between series. For the general case of a M -dimensional p -order vector autoregressive process VAR M ( p ), it follows that:24.4. Time-Series Analysis  x t x t ...x M , t  =  c c ... c M  +  a ( ) a ( ) · · · a ( ) M a ( ) a ( ) · · · a ( ) M ... ... . . . ... a ( ) M ,1 a ( ) M ,2 · · · a ( ) M , M   x t − x t − ...x M , t −  +  a ( ) a ( ) · · · a ( ) M a ( ) a ( ) · · · a ( ) M ... ... . . . ... a ( ) M ,1 a ( ) M ,2 · · · a ( ) M , M   x t − x t − ...x M , t −  + · · · +  a ( p ) a ( p ) · · · a ( p ) M a ( p ) a ( p ) · · · a ( p ) M ... ... . . . ... a ( p ) M ,1 a ( p ) M ,2 · · · a ( p ) M , M   x t − p x t − p ...x M , t − p  +  e t e t ...e M , t  (2.43)or equivalently in compact a form: x t = c + A x t − + A x t − + · · · + A p x t − p + e t = c + p ∑ i = A i x t − i + e t ∈ R M (2.44)where c ∈ R M a vector of constants (intercepts), A i ∈ R M × M for i =

1, 2, . . . , p , the p parameter matrices and e t ∈ R M a stochastic term, noise.Hence, VAR processes can adequately capture the dynamics of linear sys-tems, under the assumption that they follow a Markov process of ﬁniteorder, at most p (Murphy, 2012). In other words, the effectiveness of a p -th order VAR process relies on the assumption that the last p observationshave all the sufﬁcient statistics and information to predict and describe thefuture realizations of the process. As a result, we enforce a memory mecha-nism, keeping the p last values, (cid:126) X t − p : t − , of the multivariate time-series andmaking predictions according to (2.44). Increasing the order of the model p , results in increased computational and memory complexity as well as atendency to overﬁt the noise of the observed data. A VAR M ( p ) process has: 25. F inancial S ignal P rocessing | P | VAR M ( p ) = M × M × p + M (2.45)parameters, hence they increase linearly with the model order. The system-atic selection of the model order p can be achieved by minimizing an infor-mation criterion, such as the Akaike Information Criterion (AIC) (Mandic,2018a), given by: p AIC = min p ∈ N (cid:20) ln ( MSE ) + pN (cid:21) (2.46)where MSE the mean squared error of the model and N the number ofsamples.After careful investigation of equation (2.44), we note that the target (cid:126) x t isgiven by an ensemble (i.e., linear combination) of p linear regressions, wherethe i -th regressor has (trainable) weights A i and features x t − i . This interpre-tation of a VAR model allows us to interpret its strengths and weaknesseson a common basis with the neural network architectures, covered in subse-quent parts. Moreover, this enables adaptive training (e.g., via Least-Mean-Square (Mandic, 2004) ﬁlter), which will prove useful in online learning,covered in Section 6.1.Figure 2.14 illustrates a ﬁtted VAR (12) process, where the p = p AIC = | P | VAR ( ) = Recurrent neural networks (RNN) (Mandic and Chambers, 2001), are a fam-ily of neural networks with feedback loops which are very successful in pro-cessing sequential data. Most recurrent networks can also process sequencesof variable length (Goodfellow et al. , 2016).Consider the classical form of a dynamical system: s t = f ( s t − , x t ; θ ) (2.47)where s t and x t the system state and input signal at time step t , respectively,while f a function parametrized by θ that maps the previous state and the Let one of the regressors to has a bias vector that corresponds to c . S i m p l e R e t u r n s , r t Simple Returns:

AAPL originalVAR ( ) in-sampleout-of-sample 2005 2007 2009 2011 2013 2015 2017Date0.10.00.1 S i m p l e R e t u r n s , r t Simple Returns: BA originalVAR ( ) in-sampleout-of-sample2005 2007 2009 2011 2013 2015 2017 Date S i m p l e R e t u r n s , r t Simple Returns: GE originalVAR ( ) in-sampleout-of-sample 2005 2007 2009 2011 2013 2015 2017 Date S i m p l e R e t u r n s , r t Simple Returns:

XOM originalVAR ( ) in-sampleout-of-sample Figure 2.14: Vector autoregressive (VAR) time-series predictive model forassets simple returns. One step prediction is performed, where the realizedobservations are used as they come.input signal to the new state. Unfolding the recursive deﬁnition in (2.47) fora ﬁnite value of t : s t = f ( s t − , x t ; θ ) s t = f ( f ( s t − , x t − ; θ ) , x t ; θ ) s t = f ( f ( f ( · · · ( f ( · · · ) , x t − ; θ ) , x t ; θ ))) (2.48)In general, f can be a highly non-linear function. Interestingly, a compositefunction of nested applications of f is responsible for generating the nextstate.Many recurrent neural networks use equation (2.49) or a similar equation todeﬁne the values of their hidden units. To indicate that the state is the hid-den units of the network, we now rewrite equation (2.47) using the variable h to represent the state: h t = f ( h t − , x t ; θ ) (2.49)Then the hidden state h t can be used to obtain the output signal y t (i.e.,observation) at time index t , assuming a non-linear relationship, describedby function g that is parametrized by ϕ : 27. F inancial S ignal P rocessing ˆ y t = g ( h t ; ϕ ) (2.50)The computational graph corresponding to (2.49) and (2.50) is provided inFigure 2.15. It can be shown that recurrent neural networks are universalfunction approximators (Cybenko, 1989), which means that if there is a rela-tionship between past states and current input with next states, RNNs havethe capacity to model it.Another important aspect of RNNs is parameter sharing . Note in (2.48) that θ are the only parameters, shared between time steps. Consequently, the num-ber of parameters of the model decreases signiﬁcantly, enabling faster train-ing and limiting model overﬁtting (“Supervised sequence labelling with re-current neural networks. 2012”), compared to feedforward neural networks(i.e., multi-layer-perceptrons), which do not allow loops or any recursive con-nection. Feedforward networks can be also used with sequential data whenmemory is brute-forced , leading to very large and deep architectures, andrequiring a lot more time to train and effort to avoid overﬁtting, in order toachieve similar results with smaller RNNs (Mandic and Chambers, 2001). xh f unfold x t-1 h t-1 h t-2 f x t h t f x t+1 h t+1 f h t+2 ŷ g ŷ t-1 g ŷ t g ŷ t+1 gm m m mm Figure 2.15: A generic recurrent network computational graph. This recur-rent network processes information from the input x by incorporating itinto the state h that is passed forward through time, which in turn is usedto predict the target variable ˆ y . ( Left ) Circuit diagram. The black squareindicates a delay of a single time step. (

Right ) The same network seen asan unfolded computational graph, where each node is now associated withone particular time instance (Goodfellow et al. , 2016).Simple RNNs, such that the one implementing equation (2.48), are not usedbecause of the vanishing gradient problem (Hochreiter, 1998; Pascanu etal. , 2012), but instead variants, such as the

Gated Rectiﬁed Unit s (GRU), And any neural network with certain non-linear activation functions, in general. Similar to VAR process memory mechanics. and simplearchitecture. The most variants introduce some ﬁlter/forget mechanism, re-sponsible for selectively ﬁltering out (or ”forgetting”) past states, shapingthe hidden state h in a highly non-linear fashion, but without accumulat-ing the effects from all the history of observations, alleviating the vanishinggradients problem. Consulting the schematic in Figure 2.15, this ﬁlteringoperation is captured by m (in red), which stands for selective ”memory”.Given a loss function L and historic data D , then the process of traininginvolves minimization of L conditioned on D , where the parameters θ and ϕ are the decision variables of the optimization problem:minimize θ , ϕ L ( θ , ϕ ; D ) (2.51)which is usually addressed by adaptive variants of Stochastic Gradient De-scent (SGD), such as Adam (Kingma and Ba, 2014), to ensure faster conver-gence and avoid saddle points. All these optimization algorithms rely on(estimates of) descent directions, obtained by the gradients of the loss func-tion L with respect to the network parameters ( θ , ϕ ), namely ∇ θ ( L ) and ∇ ϕ ( L ) . Due to the parameter sharing mechanics of RNNs, obtaining thegradients is non-trivial and thus Backpropagation Through Time (BPTT)(Werbos, 1990) algorithm is used, which efﬁciently calculates the contribu-tion of each parameter to the loss function across all time steps.The selection of the hyperparameters, such as the size of the hidden state h and the activation functions, can be performed using cross-validation orother empirical methods. Nonetheless, we choose to allow excess degreesof freedom to our model but regularize it using weight decay in the form of L and L norms, as well as dropout, according to the Gal and Ghahramani(2016) guidelines.As any neural network layer, recurrent layers can be stack together or con-nected with other layers (i.e., afﬁne or convolutional layers) forming deeparchitectures, capable of dealing with complex datasets.For comparison with the VAR (12) process in Section 2.4.1, we train an RNN,comprised of two layers, one GRU layer followed by an afﬁne layer, wherethe size of the hidden state is 3 or h ∈ R . The number of model parametersis | P | GRU-RNN ( → → ) =

88, but it signiﬁcantly outperforms the VAR model,as suggested by Figure 2.16 and summary in Table 2.2. Compared to LSTM (Ortiz-Fuentes and Forcada, 1997).

29. F inancial S ignal P rocessing S i m p l e R e t u r n s , r t Simple Returns:

AAPL originalGRU-RNN in-sampleout-of-sample 2005 2007 2009 2011 2013 2015 2017Date0.10.00.1 S i m p l e R e t u r n s , r t Simple Returns: BA originalGRU-RNN in-sampleout-of-sample S i m p l e R e t u r n s , r t Simple Returns: GE originalGRU-RNN in-sampleout-of-sample S i m p l e R e t u r n s , r t Simple Returns:

XOM originalGRU-RNN in-sampleout-of-sample

Figure 2.16: Gated recurrent unit recurrent neural network (GRU-RNN)time-series predictive model for assets simple returns. One step predictionis performed, where the realized observations are used as they come (i.e.,set observation y t − equal to x t , rather than predicted value ˆ y t − ). Training Error Testing ErrorVAR GRU-RNN VAR GRU-RNN

AAPL BA GE XOM hapter 3

Portfolio Optimization

The notion of a portfolio has already been introduced in subsection 2.1.2, asa master asset, highlighting its representation advantage over single assets.Nonetheless, portfolios allow also investors to combine properties of indi-vidual assets in order to ”amplify” the positive aspects of the market, while”attenuating” its negative impacts on the investment.Figure 3.1 illustrates three randomly generated portfolios (with ﬁxed portfo-lio weights over time given in Table 3.2), while Table 3.1 summarizes theirperformance. Importantly, we note the signiﬁcant differences between therandom portfolios, highlighting the importance that portfolio constructionand asset allocation plays in the success of an investment. As Table 3.2 im-plies, short-selling is allowed in the generation of the random portfolios ow-ing to the negative portfolio weights, such as p ( ) AAPL and p ( ) MMM . Regardless,portfolio vector deﬁnition (2.1) is satisﬁed in all cases, since the portfoliovectors’ elements sum to one (column-wise addition). P o r t f o li o S i m p l e R e t u r n s , r t Portfolio Simple Returns: Time-Series p (0) p (1) p (2) r t F r e q u e n c y D e n s i t y Portfolio Simple Returns: Distributions p (0) p (1) p (2) Figure 3.1: Simple returns of randomly allocated portfolios.

Portfolio Optimization aims to address the allocation problem in a system-atic way, where an objective function reﬂecting the investor’s preferences is 31. P ortfolio O ptimization Randomly Allocated Portfolios Performance Summary

Performance Metrics p ( ) p ( ) p ( ) Mean Returns (%)

Cumulative Returns (%) -83.8688 297.869 167.605

Volatility (%)

Sharpe Ratio

Max Drawdown (%)

Average Drawdown Time (days)

70 7 24

Skewness

Kurtosis

Value at Risk, c = (%) -14.0138 -1.52798 -2.88773 Conditional Value at Risk (%) -20.371 -2.18944 -4.44684

Hit Ratio (%)

Average Win to Average Loss p ( ) ) is out-performed by portfolio 1 ( p ( ) ) in all metrics, motivating the introduction ofportfolio optimization. p ( ) p ( ) p ( ) AAPL -2.833049 0.172436 0.329105 GE -2.604941 -0.061177 0.467233 BA JPM

MMM -3.949186 0.341040 -1.373778Table 3.2: Random Portfolio Vectors of Figure 3.1.constructed and optimized with respect to the portfolio vector.In this chapter, we introduce the Markowitz Model (section 3.1), the ﬁrstattempt to mathematically formalize and suggest an optimization methodto address portfolio management. Moreover, we extend this framework togeneric utility and objective functions (section 3.2), by taking transactioncosts into account (section 3.3). The shortcomings of the methods discussedhere encourage the development of context-agnostic agents, which is thefocus of this thesis (section 6.2). However, the simplicity and robustnessof the traditional portfolio optimization methods have motivated the super-vised pre-training (Chapter 7) of the agents with Markowitz-like models asground truths.32.1. Markowitz Model

The

Markowitz model (Markowitz, 1952; Kroll et al. , 1984) mathematicallyformulates the portfolio allocation problem, namely ﬁnding a portfolio vec-tor w in a universe of M assets, according to the investment Greedy (Invest-ment Advice 2.1) and Risk-Aversion (Investment Advice 2.2) criteria, con-strained on the portfolio vector deﬁnition (2.1). Hence, we the Markowitzmodel gives the optimal portfolio vector w ∗ which minimizes volatility fora given returns level, such that: M ∑ i = w ∗ , i = w ∗ ∈ R M (3.1) For a trading universe of M assets, provided historical data, we obtain em-pirical estimates of the expected returns µ = (cid:104) µ , µ , . . . µ M (cid:105) T ∈ R M ,where µ i the sample mean (2.16) of the i -th asset and the covariance Σ ∈ R M × M , such that Σ ij the empirical covariance (2.27) of the i -th and the j -thassets.For given target expected return ¯ µ target , determine the portfolio vector w ∈ R M such that: minimize w w T Σ w (3.2)subject to w T µ = ¯ µ target (3.3)and TM w = σ = w T Σ w is the portfolio variance, µ = w T µ the portfolio expectedreturn and the M -dimensional column vector of ones is denoted by M .For Lagrangian multipliers λ , κ ∈ R , we form the Lagrangian function L (Papadimitriou and Steiglitz, 1998) such that: L ( w , λ , κ ) = w T Σ w − λ ( w T µ − ¯ µ target ) − κ ( TM w − ) (3.5)We differentiating the Lagrangian function L and apply the ﬁrst ordernecessary condition of optimality: The covariance matrix Σ is by deﬁnition (2.24) symmetric, so ∂ ( w T Σ w ) ∂ w = Σ w .

33. P ortfolio O ptimization ∂ L ∂ w = Σ w − λ µ − κ M =  Σ µ M µ T TM   w − λ − κ  =  µ target M  (3.7)Under the assumption that Σ is full rank and µ is not a multiple of M ,then equation (3.7) is solvable by matrix inversion (Boyd and Vandenberghe,2004). The resulting portfolio vector w MVP deﬁnes the mean-variance opti-mal portfolio .  w MVP − λ − κ  =  Σ µ µ T T  −  µ target  (3.8)Note that mean-variance optimization is also used in signal processing andwireless communications in order to determine the optimal beamformer(Almeida et al. , 2015), using, for example, Minimum Variance DistortionResponse ﬁlters (Xia and Mandic, 2013). Notice that the constraint (3.4) suggests that short-selling is allowed. Thissimpliﬁes the formulation of the problem by relaxing conditions, enablinga closed form solution (3.8) as a set of linear equations. If short sales areprohibited, then an additional constraint should be added, and in this case,the optimization problem becomes:minimize w w T Σ w subject to w T µ = ¯ µ target and TM w = w (cid:23) (cid:23) designates an element-wise inequality operator. This problem can-not be reduced to the solution of a set of linear equations. It is termed a34.1. Markowitz Model quadratic program and it is solved numerically using gradient-based algo-rithms (Gill et al. , 1981). Optimization problems with quadratic objectivefunctions and linear constraints fall into this framework. Remark 3.1

Figure 3.2 illustrates the volatility to expected returns dependency ofan example trading universe (i.e., yellow scatter points), along with all optimal port-folio solutions with and without short selling, left and right subﬁgures, respectively.The blue part of the solid curve is termed

Efﬁcient Frontier (Luenberger, 1997),and the corresponding portfolios are called efﬁcient. These portfolios are obtained bythe solution of (3.2).

Interestingly, despite the inferior performance of the individual assets’ per-formance, appropriate linear combinations of them results in less volatileand more proﬁtable master assets, demonstrating once again the power ofportfolio optimization. Moreover, we note that the red part of the solid curveis inefﬁcient, since for the same risk level (i.e., standard deviation) there areportfolios with higher expected returns, which aligns with the Greedy Cri-terion 2.1. Finally, we highlight that in case of short-selling (left subﬁgure)there are feasible portfolios, which have higher expected returns than any as-set from the universe. This is possible since the low-performing assets can beshorted and the inventory from them can be invested in high-performanceassets, amplifying their high returns. On the other hand, when short-sellingis not allowed, the expected returns of any portfolio is restricted in the inter-val deﬁned by the lowest and the highest empirical returns of the assets inthe universe. r E p e c t e d M e a n R e t u r n s , r KO JPM AAPLMMM

Markowitz Model: with short sales efficient frontierassets r E p e c t e d M e a n R e t u r n s , r KO JPM AAPLMMM

Markowitz Model: without short sales efficient frontierassets

Figure 3.2: Efﬁcient frontier for Markowitz model the with (

Left ) and with-out (

Right ) short-selling. The yellow points are the projections of single assetshistoric performance on the σ − µ (i.e., volatility-returns) plane. The solidlines are the portfolios, obtained by solving the Markowitz model optimiza-tion problem (3.2) for different values of ¯ µ target . Note that the red points arerejected and only the blue loci is efﬁcient. 35. P ortfolio O ptimization In Section 2.3 we introduced various evaluation metrics that reﬂect our in-vestment preferences. Extending the vanilla Markowitz model which mini-mizes risk (i.e., variance), constraint on a predetermined proﬁtability level(i.e., expected returns), we can select any metric as the optimization objec-tive function, constrained on the budget (3.4) and any other criteria we favor,as long as the ﬁt in the quadratic programming framework. More complexobjectives may be solvable with special non-linear optimizers, but there isno general principle. In this Section we exhibit how to translate evaluationmetrics to objective functions, suitable for quadratic programming. Any ofthe functions presented can be used with and without short-selling, so in or-der to address the more difﬁcult of the two cases, we will consider that longpositions are only allowed. The other case is trivially obtained by ignoringthe relevant constraint in the sign of weights.

Motivated by the Lagrangian formulation of the Markowitz model (3.5, wedeﬁne the

Risk Aversion portfolio, named after the risk aversion coefﬁcient α ∈ R + , given by the solution of the program:maximize w w T µ − α w T Σ w (3.10)subject to TM w = w (cid:23) α is model hyperparameter, which reﬂect thetrade-off between portfolio expected returns ( w T µ ) and risk level ( w T Σ w )(Wilmott, 2007). For α → α → ∞ , the investor is inﬁnitely risk-averse and selects the least risky port-folio, regardless its returns performance. Any positive value for α results ina portfolio which balances the two objectives weighted by the risk aversioncoefﬁcient. Figure 3.3 illustrates the volatility-returns plane for the sametrading universe as in Figure 3.2, but the curve is obtained by the solution of(3.10). As expected, high values of α result in less volatile portfolios, whilelow risk aversion coefﬁcient leads to higher returns.36.3. Transaction Costs Standard Deviation, r E p e c t e d M e a n R e t u r n s , r KO AAPL

Risk-Aversion: with short sales efficient frontierassets 1234 Standard Deviation, r E p e c t e d M e a n R e t u r n s , r KO AAPL

Risk-Aversion: without short sales efficient frontierassets 1234

Figure 3.3: Risk-Aversion optimization efﬁcient frontier with (

Right ) andwithout ( left ) short-selling. The yellow points are the projections of singleassets historic performance on the σ − µ (i.e., volatility-returns) plane. The blue points are the efﬁcient portfolios, or equivalently the solutions of therisk-aversion optimization problem (3.10) for different values of α , desig-nated by the opacity of the blue color (see colorbar). Both objective functions so far require hyperparameter tuning ( ¯ µ target or α ),hence either cross-validation or hand-picked selection is required (Kennedy,2016). On the other hand, in Section 2.3 we motivated the use of SharpeRatio (2.35) as a Signal-to-Noise Ratio equivalent for ﬁnance, which is notparametric. Considering the Sharpe Ratio as the objective function of theprogram: maximize w w T µ √ w T Σ w (3.11)subject to TM w = w (cid:23) In real stock exchanges, such as NYSE (Wikipedia, 2018c), NASDAQ (Wikipedia,2018b) and LSE (Wikipedia, 2018a), trading activities (buying or selling) areaccompanied with expenses, including brokers’ commissions and spreads(Investopedia, 2018g; Quantopian, 2017), they are usually referred to as transaction costs . Therefore every time a new portfolio vector is determined 37. P ortfolio O ptimization (portfolio re-balancing), the corresponding transaction costs should be sub-tracted from the budget.In order to simplify the analysis around transaction costs we will use the ruleof thumb (Quantopian, 2017), charging 0.2% for every activity. For example,if three stocks A are bought with price 100$ each, then the transaction costswill be 0.6$. Let the price of the stock A raising at price 107$, when wedecide to sell all three stocks, then the transaction costs will be 0.642$. Given any objective function, J , the transaction costs are subtracted fromthe returns term in order to adjust the proﬁt & loss, accounting for theexpenses of trading activities. Therefore, we solve the optimization program:maximize w J − TM β (cid:107) w − w (cid:107) (3.12)subject to TM w = w (cid:23) β ∈ R the transactions cost (i.e., 0.002 for standard 0.2% commis-sions), and w ∈ R M the initial portfolio, since the last re-balancing. All theparameters of the model (3.12) are given, since β is market-speciﬁc, and w the current position. Additionally, the transactions cost term can be seen asa regularization term which penalizes excessive trading and restricts largetrades (i.e., large (cid:107) w − w (cid:107) ).The objective function J can be any function that can be optimized accord-ing to the framework developed in Section 3.2. For example the risk-aversionwith transaction costs optimization program is given by:maximize w w T µ − α w T Σ w − TM β (cid:107) w − w (cid:107) (3.13)while the Sharpe Ratio optimization with transaction costs is:maximize w w T µ − TM β (cid:107) w − w (cid:107) √ w T Σ w (3.14)We highlight that in (3.13) is subtracted from J directly since all terms havethe same units , while in (3.14) the transaction cost term is subtracted di- Not considering the objective function’s J parameters. The parameter α is not dimensionless. The involvement of transaction costs make Portfolio Management a

Multi-Stage Decision Problem (Neuneier, 1996), which in simple terms means thattwo sequences of states with the same start and end state will have differentvalue. For instance, imagine two investors, both of which have an initial bud-get of 100$. On Day 1, the ﬁrst investor uses all of his budget to construct aportfolio according to his preferred objective function, paying 0.2$ for trans-action costs, according to Quantopian (2017). By Day 3, the market priceshave changed but the portfolio of the ﬁrst investor has not changed in valueand decided to liquidate all of his investment, paying another 0.2$ for sell-ing his portfolio. On Day 5 both of the investors (re-)enter the market makeidentical investments and hence pay the same commission fees. Obviously,the two investors have the same start and end states but their intermediatetrajectories lead to different reward streams.From (3.12) it is obvious that w affects the optimal allocation w . In a se-quential portfolio optimization setting, the past decisions (i.e., w ) will havea direct impact on the optimality of the future decisions (i.e., w ), thereforeapart from the maximization of immediate rewards, we should also focus oneliminating the negative effects on the future decisions. As a consequence,sequential asset allocation is a multi-stage decision problem where myopicoptimal actions can lead to sub-optimal cumulative rewards. This settingencourages the use of Reinforcement Learning agents which aim to max-imize long-term rewards, even if that means acting sub-optimally in thenear-future from the traditional portfolio optimization point of view. 39 hapter 4 Reinforcement Learning

Reinforcement learning (RL) refers to both a learning problem and a sub-ﬁeld of machine learning (Goodfellow et al. , 2016). As a learning problem(Szepesv´ari, 2010), it refers to learning to control a system ( environment ) soas to maximize some numerical value, which represents a long-term objec-tive ( discounted cumulative reward signal ). Recalling the analysis in Section3.3.2, sequential portfolio managementIn this chapter we introduce the necessary tools to analyze stochastic dy-namical systems (Section 4.1). Moreover, we review the major componentsof a reinforcement learning algorithm (Section 4.2, as well as extensions ofthe formalization of dynamical systems (Section 4.4), enabling us to reusesome of those tools to more general and intractable otherwise problems.

Reinforcement learning is suitable in optimally controlling dynamical sys-tems, such as the general one illustrated in Figure 4.1: A controller ( agent )receives the controlled state of the system and a reward associated with thelast state transition . It then calculates a control signal ( action ) which is sentback to the system. In response, the system makes a transition to a new state and the cycle is repeated. The goal is to learn a way of controlling the sys-tem ( policy ) so as to maximize the total reward. The focus of this report ison discrete-time dynamical systems, thought most of the notions developedextend to continuous-time systems.40.1. Dynamical Systems

AgentEnvironment reward r t AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0g1dkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wdTTpMJ AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0g1dkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wdTTpMJ AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0g1dkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wdTTpMJ AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0g1dkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wdTTpMJ r t +1 AAAB+XicbVDLSsNAFL3xWesr6tLNYBEEoSQi6LLoxmUF+4A2hMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJUik0Os63tba+sbm1Xdmp7u7tHxzaR8dtnWSK8RZLZKK6AdVcipi3UKDk3VRxGgWSd4LxfeF3JlxpkcRPOE25F9FhLELBKBrJt+1+RHEUhLma+TleujPfrjl1Zw6yStyS1KBE07e/+oOEZRGPkUmqdc91UvRyqlAwyWfVfqZ5StmYDnnP0JhGXHv5PPmMnBtlQMJEmRcjmau/N3IaaT2NAjNZ5NTLXiH+5/UyDG+9XMRphjxmi0NhJgkmpKiBDITiDOXUEMqUMFkJG1FFGZqyqqYEd/nLq6R9VXeduvt4XWvclXVU4BTO4AJcuIEGPEATWsBgAs/wCm9Wbr1Y79bHYnTNKndO4A+szx+sMpOq AAAB+XicbVDLSsNAFL3xWesr6tLNYBEEoSQi6LLoxmUF+4A2hMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJUik0Os63tba+sbm1Xdmp7u7tHxzaR8dtnWSK8RZLZKK6AdVcipi3UKDk3VRxGgWSd4LxfeF3JlxpkcRPOE25F9FhLELBKBrJt+1+RHEUhLma+TleujPfrjl1Zw6yStyS1KBE07e/+oOEZRGPkUmqdc91UvRyqlAwyWfVfqZ5StmYDnnP0JhGXHv5PPmMnBtlQMJEmRcjmau/N3IaaT2NAjNZ5NTLXiH+5/UyDG+9XMRphjxmi0NhJgkmpKiBDITiDOXUEMqUMFkJG1FFGZqyqqYEd/nLq6R9VXeduvt4XWvclXVU4BTO4AJcuIEGPEATWsBgAs/wCm9Wbr1Y79bHYnTNKndO4A+szx+sMpOq AAAB+XicbVDLSsNAFL3xWesr6tLNYBEEoSQi6LLoxmUF+4A2hMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJUik0Os63tba+sbm1Xdmp7u7tHxzaR8dtnWSK8RZLZKK6AdVcipi3UKDk3VRxGgWSd4LxfeF3JlxpkcRPOE25F9FhLELBKBrJt+1+RHEUhLma+TleujPfrjl1Zw6yStyS1KBE07e/+oOEZRGPkUmqdc91UvRyqlAwyWfVfqZ5StmYDnnP0JhGXHv5PPmMnBtlQMJEmRcjmau/N3IaaT2NAjNZ5NTLXiH+5/UyDG+9XMRphjxmi0NhJgkmpKiBDITiDOXUEMqUMFkJG1FFGZqyqqYEd/nLq6R9VXeduvt4XWvclXVU4BTO4AJcuIEGPEATWsBgAs/wCm9Wbr1Y79bHYnTNKndO4A+szx+sMpOq AAAB+XicbVDLSsNAFL3xWesr6tLNYBEEoSQi6LLoxmUF+4A2hMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJUik0Os63tba+sbm1Xdmp7u7tHxzaR8dtnWSK8RZLZKK6AdVcipi3UKDk3VRxGgWSd4LxfeF3JlxpkcRPOE25F9FhLELBKBrJt+1+RHEUhLma+TleujPfrjl1Zw6yStyS1KBE07e/+oOEZRGPkUmqdc91UvRyqlAwyWfVfqZ5StmYDnnP0JhGXHv5PPmMnBtlQMJEmRcjmau/N3IaaT2NAjNZ5NTLXiH+5/UyDG+9XMRphjxmi0NhJgkmpKiBDITiDOXUEMqUMFkJG1FFGZqyqqYEd/nLq6R9VXeduvt4XWvclXVU4BTO4AJcuIEGPEATWsBgAs/wCm9Wbr1Y79bHYnTNKndO4A+szx+sMpOq s t +1 AAAB+XicbVDLSsNAFL3xWesr6tLNYBEEoSQi6LLoxmUF+4A2hMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJUik0Os63tba+sbm1Xdmp7u7tHxzaR8dtnWSK8RZLZKK6AdVcipi3UKDk3VRxGgWSd4LxfeF3JlxpkcRPOE25F9FhLELBKBrJt+1+RHEUhLme+TleujPfrjl1Zw6yStyS1KBE07e/+oOEZRGPkUmqdc91UvRyqlAwyWfVfqZ5StmYDnnP0JhGXHv5PPmMnBtlQMJEmRcjmau/N3IaaT2NAjNZ5NTLXiH+5/UyDG+9XMRphjxmi0NhJgkmpKiBDITiDOXUEMqUMFkJG1FFGZqyqqYEd/nLq6R9VXeduvt4XWvclXVU4BTO4AJcuIEGPEATWsBgAs/wCm9Wbr1Y79bHYnTNKndO4A+szx+tvZOr AAAB+XicbVDLSsNAFL3xWesr6tLNYBEEoSQi6LLoxmUF+4A2hMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJUik0Os63tba+sbm1Xdmp7u7tHxzaR8dtnWSK8RZLZKK6AdVcipi3UKDk3VRxGgWSd4LxfeF3JlxpkcRPOE25F9FhLELBKBrJt+1+RHEUhLme+TleujPfrjl1Zw6yStyS1KBE07e/+oOEZRGPkUmqdc91UvRyqlAwyWfVfqZ5StmYDnnP0JhGXHv5PPmMnBtlQMJEmRcjmau/N3IaaT2NAjNZ5NTLXiH+5/UyDG+9XMRphjxmi0NhJgkmpKiBDITiDOXUEMqUMFkJG1FFGZqyqqYEd/nLq6R9VXeduvt4XWvclXVU4BTO4AJcuIEGPEATWsBgAs/wCm9Wbr1Y79bHYnTNKndO4A+szx+tvZOr AAAB+XicbVDLSsNAFL3xWesr6tLNYBEEoSQi6LLoxmUF+4A2hMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJUik0Os63tba+sbm1Xdmp7u7tHxzaR8dtnWSK8RZLZKK6AdVcipi3UKDk3VRxGgWSd4LxfeF3JlxpkcRPOE25F9FhLELBKBrJt+1+RHEUhLme+TleujPfrjl1Zw6yStyS1KBE07e/+oOEZRGPkUmqdc91UvRyqlAwyWfVfqZ5StmYDnnP0JhGXHv5PPmMnBtlQMJEmRcjmau/N3IaaT2NAjNZ5NTLXiH+5/UyDG+9XMRphjxmi0NhJgkmpKiBDITiDOXUEMqUMFkJG1FFGZqyqqYEd/nLq6R9VXeduvt4XWvclXVU4BTO4AJcuIEGPEATWsBgAs/wCm9Wbr1Y79bHYnTNKndO4A+szx+tvZOr AAAB+XicbVDLSsNAFL3xWesr6tLNYBEEoSQi6LLoxmUF+4A2hMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJUik0Os63tba+sbm1Xdmp7u7tHxzaR8dtnWSK8RZLZKK6AdVcipi3UKDk3VRxGgWSd4LxfeF3JlxpkcRPOE25F9FhLELBKBrJt+1+RHEUhLme+TleujPfrjl1Zw6yStyS1KBE07e/+oOEZRGPkUmqdc91UvRyqlAwyWfVfqZ5StmYDnnP0JhGXHv5PPmMnBtlQMJEmRcjmau/N3IaaT2NAjNZ5NTLXiH+5/UyDG+9XMRphjxmi0NhJgkmpKiBDITiDOXUEMqUMFkJG1FFGZqyqqYEd/nLq6R9VXeduvt4XWvclXVU4BTO4AJcuIEGPEATWsBgAs/wCm9Wbr1Y79bHYnTNKndO4A+szx+tvZOr state s t AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0gNdkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wdU15MK AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0gNdkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wdU15MK AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0gNdkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wdU15MK AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0gNdkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wdU15MK action a t AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0gpdkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wc5NZL4 AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0gpdkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wc5NZL4 AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0gpdkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wc5NZL4 AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0gpdkgxWxQqbo1dwayTLyCVKFAY1D56g0jloRcIZPUmK7nxthPqUbBJM/KvcTwmLIJHfGupYqG3PTTWeqMnFplSIJI26eQzNTfGykNjZmGvp3MU5pFLxf/87oJBlf9VKg4Qa7Y/FCQSIIRySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/zyMmmd1zy35t1dVOvXRR0lOIYTOAMPLqEOt9CAJjDQ8Ayv8OY8OS/Ou/MxH11xip0j+APn8wc5NZL4 Figure 4.1: High-level stochastic dynamical system schematic (Sutton andBarto, 1998).

The term agent is used to refer to the controller, while environment is usedinterchangeably with the term system. The goal of a reinforcement learningalgorithm is the development (training) of an agent capable of successfullyinteracting with the environment, such that it maximizes some scalar objec-tive over time.

Action a t ∈ A is the control signal that the agent sends back to the system attime index t . It is the only way that the agent can inﬂuence the environmentstate and as a result, lead to different reward signal sequences. The actionspace A refers to the set of actions that the agent is allowed to take and itcan be: • Discrete: A = { a , a , . . . , a M } ; • Continuous: A ⊆ [ c , d ] M . A reward r t ∈ B ⊆ R is a scalar feedback signal, which indicates howwell the agent is doing at discrete time step t . The agent aims to maximizecumulative reward, over a sequence of steps.Reinforcement learning addresses sequential decision making tasks (Silver,2015b), by training agents that optimize delayed rewards and can evaluatethe long-term consequences of their actions, being able to sacriﬁce immedi-ate reward to gain more long-term reward. This special property of rein- 41. R einforcement L earning forcement learning agents is very attractive for ﬁnancial applications, whereinvestment horizons range from few days and weeks to years or decades.In the latter cases, myopic agents can perform very poorly since evaluationof long-term rewards is essential in order to succeed (Mnih et al. , 2016).However, the applicability of reinforcement learning depends vitally on thehypothesis: Hypothesis 4.1 (Reward Hypothesis)

All goals can be described by the maxi-mization of expected cumulative reward.

Consequently, the selection of the appropriate reward signal for each ap-plication is very crucial. It inﬂuences the agent learned strategies since itreﬂects its goals. In Section 5.4, a justiﬁcation for the selected reward sig-nal is provided, along with an empirical comparison between other metricsmentioned in Section 2.3.

The state , s t ∈ S , is also a fundamental element of reinforcement learning,but it is usually used to refer to both the environment state and the agentstate.The agent does not always have direct access to the state, but at every timestep, t , it receives an observation, o t ∈ O . Environment State

The environment state s et is the internal representation of the system, usedin order to determine the next observation o t + and reward r t + . The envi-ronment state is usually invisible to the agent and even if it visible, it maycontain irrelevant information (Sutton and Barto, 1998). Agent State

The history (cid:126) h t at time t is the sequence of observations, actions and rewardsup to time step t , such that: (cid:126) h t = ( o , a , r , o , a , r , . . . , o t , a t , r t ) (4.1)The agent state (a.k.a state ) s at is the internal representation of the agentabout the environment, used in order to select the next action a t + and itcan be any function of the history: s at = f ( (cid:126) h t ) (4.2)42.2. Major Components of Reinforcement LearningThe term state space S is used to refer to the set of possible states the agentscan observe or construct. Similar to the action space, it can be: • Discrete: S = { s , s , . . . , s n } ; • Continuous: S ⊆ R N . Observability

Fully observable environments allow the agent to directly observe the envi-ronment state, hence: o t = s et = s at (4.3) Partially observable environments offer indirect access to the environmentstate, therefore the agent has to construct its own state representation s at (“Supervised sequence labelling with recurrent neural networks. 2012”), us-ing: • Complete history: s at ≡ (cid:126) h t ; • Recurrent neural network: s at ≡ f ( s at − , o t ; θ ) .Upon modifying the basic dynamical system in Figure 4.1 in order to takepartial observability into account, we obtain the schematic in Figure 4.2.Note that f is function unknown to the agent, which has access to the obser-vation o t but not to the environment state s t . Moreover, R as and P ass (cid:48) are thereward generating function and the transition probability matrix (function)of the MDP, respectively. Treating the system as a probabilistic graphicalmodel, the state s t is a latent variable that either deterministically or stochas-tically (depending on the nature of f ) determines the observation o t . In apartially observable environment, the agent needs to reconstruct the environ-ment state, either by using the complete history h t or a stateful sequentialmodel (i.e., recurrent neural network, see (2.49)). Reinforcement Learning agents may include one or more of the followingcomponents (Silver, 2015b): • Policy : agent’s behavior function; • Value function : how good is each state, or state-action pair; • Model : agent’s representation of the environment. 43. R einforcement L earning Figure 4.2: High-level stochastic partially observable dynamical systemschematic.In this Section, we discuss these components and highlight their importanceand impact on algorithm design.

Let γ be the discount factor of future rewards, then the return G t (alsoknown as future discounted reward ) at time index t is given by: G t = r t + + γ r t + + γ r t + + . . . = ∞ ∑ k = γ k r t + k + , γ ∈ [

0, 1 ] (4.4) Policy , π , refers to the behavior of an agent. It is a mapping function from”state to action” (Witten, 1977), such that: π : S → A (4.5)where S and A are respectively the state space and the action space. Apolicy function can be: • Deterministic: A t + = π ( s t ) ; • Stochastic: π ( a | s ) = P [ a t = a | s t = s ] . State-value function , v π , is the expected return, G t , starting from state s ,which then follows a policy π (Szepesv´ari, 2010), that is: v π : S → B , v π ( s ) = E π [ G t | s t = s ] (4.6)44.3. Markov Decision Processwhere S and B are respectively the state space and the rewards set ( B ⊆ R ). Action-value function , q π , is the expected return, G t , starting from state s ,upon taking action a , which then follows a policy π (Silver, 2015b): q π : S × A → B , q π ( s , a ) = E π [ G t | s t = s , a t = a ] (4.7)where S , A , B the state space, the action space and and the reward set, re-spectively. A model predicts the next state of the environment, s t + , and the correspond-ing reward signal, r t + , given the current state, s t , and the action taken, a t ,at time step, t . It can be represented by a state transition probability matrix P given by: P ass (cid:48) : S × A → S , P ass (cid:48) = P [ s t + = s (cid:48) | s t = s , a t = a ] (4.8)and a reward generating function R : R : S × A → B , R as = E [ r t + | s t = s , a t = a ] (4.9)where S , A , B the state space, the action space and and the reward set, re-spectively. A special type of discrete-time stochastic dynamical systems are Markov De-cision Processes (MDP). They posses strong properties that guarantee con-verge to the global optimum policy (i.e., strategy), while by relaxing some ofthe assumption, they can describe any dynamical system, providing a power-ful representation framework and a common way of controlling dynamicalsystems.

A state S t (Silver, 2015c) satisﬁes the Markov property if and only if (iff): P [ s t + | s t , s t − , . . . , s ] = P [ s t + | s t ] (4.10)This implies that the previous state, s t , is a sufﬁcient statistic for predictingthe future, therefore the longer-term history, (cid:126) h t , can be discarded. 45. R einforcement L earning Any fully observable environment, which satisﬁes equation (4.3), can bemodeled as a

Markov Decision Process (MDP). A Markov Decision Process(Poole and Mackworth, 2010) is an object (i.e., 5-tuple) (cid:104) S , A , P , R , γ (cid:105) where: • S is a ﬁnite set of states (state space), such that they satisfy the Markovproperty, as in deﬁnition (4.10) • A is a ﬁnite set of actions (action space); • P is a state transition probability matrix;; • R is a reward generating function; • γ is a discount factor. Apart from the expressiveness of MDPs, they can be optimally solved, mak-ing them very attractive.

Value Function

The optimal state-value function , v ∗ , is the maximum state-value functionover all policies: v ∗ ( s ) = max π v π ( s ) , ∀ s ∈ S (4.11)The optimal action-value function , q ∗ , is the maximum action-value func-tion over all policies: q ∗ ( s , a ) = max π q π ( s , a ) , ∀ s ∈ S , a ∈ A (4.12) Policy

Deﬁne a partial ordering over policies (Silver, 2015c) π ≤ π (cid:48) ⇐ v π ( s ) ≤ v (cid:48) π ( s ) , ∀ s ∈ S (4.13)For an MDP the following theorems are true: Theorem 4.2 (Policy Optimality)

There exists an optimal policy, π ∗ , that is bet-ter than or equal to all other policies, such that π ∗ ≥ π , ∀ π . The proofs are based on the contraction property of Bellman operator (Poole and Mack-worth, 2010).

Theorem 4.3 (State-Value Function Optimality)

All optimal policies achieve theoptimal state-value function, such that v π ∗ ( s ) = v ∗ ( s ) , ∀ s ∈ S . Theorem 4.4 (Action-Value Function Optimality)

All optimal policies achievethe optimal action-value function, such that q π ∗ ( s , a ) = q ∗ ( s , a ) , ∀ s ∈ S , a ∈ A . Given a Markov Decision Process (cid:104) S , A , P , R , γ (cid:105) , because of the Markovproperty (4.10) that states in S satisfy: • The policy π is a distribution over actions given states π ( s | a ) = P [ a t = a | s t = s ] (4.14)Without loss of generality we assume that the policy π is stochasticbecause of the state transition probability matrix P (Sutton and Barto,1998). Owing to the Markov property, MDP policies depend only onthe current state and are time-independent, stationary (Silver, 2015c),such that a t ∼ π ( ·| s t ) , ∀ t > • The state-value function v π can be decomposed into two parts: theimmediate reward and the discounted reward of successor state γ r t + : v π ( s ) = E π [ G t | s t = s ] (4.4) = E π [ r t + + γ r t + + γ r t + + . . . | s t = s ]= E π [ r t + + γ ( r t + + γ r t + + . . . ) | s t = s ] (4.4) = E π [ r t + + γ G t + | s t = s ] (4.10) = E π [ r t + + γ v π ( s t + ) | s t = s ] (4.16) • The action-value function q π can be similarly decomposed to q π ( s ) = E π [ r t + + γ q π ( s t + , s t + ) | s t = s , a t = a ] (4.17)Equations (4.16) and (4.17) are the Bellman Expectation Equations for MarkovDecision Processes formulated by Bellman (1957). 47. R einforcement L earning Search, or seeking a goal under uncertainty, is a ubiquitous requirement oflife (Hills et al. , 2015). Not only machines but also humans and animalsusually face the trade-off between exploiting known opportunities and explor-ing for better opportunities elsewhere. This is a fundamental dilemma inreinforcement learning, where the agent may need to act ”sub-optimally” inorder to explore new possibilities, which may lead it to better strategies. Ev-ery reinforcement learning algorithm takes into account this trade-off, tryingto balance search for new opportunities (exploration) with secure actions(exploitation). From an optimization point of view, if an algorithm is greedyand only exploits, it may converges fast, but it runs the risk of sticking toa local minimum. Exploration, may at ﬁrst slow down convergence, but itcan lead to previously unexplored regions of the search space, resulting inan improved solution. Most algorithms perform exploration either by artiﬁ-cially adding noise to the actions, which is attenuated while the agent gainsexperience, or modelling the uncertainty of each action (Gal, 2016) in theBayesian optimization framework. Markov Decisions Processes can be exploited by reinforcement learningagents, who can optimally solve them (Sutton and Barto, 1998; Szepesv´ari,2010). Nonetheless, most real-life applications are not satisfying one or moreof the conditions stated in Section 4.3.2. As a consequence, modiﬁcations ofthem lead to other types of processes, such as Inﬁnite MDP and PartiallyObservable MDP, which in turn can realistically ﬁt a lot of application do-mains.

In the case of either the state space S , or the action space A , or both be-ing inﬁnite then the environment can be modelled as an Inﬁnite MarkovDecision Process (IMDP). Therefore, in order to implement the policy π or/and the action-value function q π in a computer, a differentiable functionapproximation method must be used (Sutton et al. , 2000a), such as a least Metaphorically speaking, an agent ”lives” in the environment. This does not reﬂect any risk-sensitive metric or strategy, ”secure” is used here to de-scribe actions that have been tried in the past and their outcomes are predictable to someextend. Countably inﬁnite (discrete) or continuous. et al. , 2013).An IMDP action-state dynamics are described by a transition probabilityfunction P ass (cid:48) and not a matrix, since the state or/and the action spaces arecontinuous. If s et (cid:54) = s at then the environment is partially observable and it can be modeledas a Partially Observable Markov Decision Process ( POMDP ). POMDP isa tuple (cid:104) S , A , O , P , R , Z , γ (cid:105) where: • O is a ﬁnite set of observations (observation space) • Z is an observation function, Z as (cid:48) o = P [ O t + | s t + = s (cid:48) , a t = a ] It is important to notice that, any dynamical system can be viewed as aPOMDP and all the algorithms used for MDPs are applicable, without con-vergence guarantees though. 49 art II

Innovation hapter 5 Financial Market as Discrete-Time

Stochastic Dynamical System

In Chapter 3, the task of static asset allocation as well as traditional methodsof its assessment were introduced. Our interest in dynamically (i.e., sequen-tially) constructing portfolios led to studying Reinforcement Learning basiccomponents and concepts in Chapter 4, which suggest a framework to dealwith sequential decision making tasks. However, in order to leverage the re-inforcement learning tools, it is necessary to translate the problem (i.e., assetallocation) into a discrete-time stochastic dynamical system and, in particu-lar, into a Markov Decision Process (MDP). Note that not all of the strongassumptions of an MDP (Section 4.3.2) can be satisﬁed, hence we resort tothe relaxation of some of the assumptions and consideration of the MDP ex-tensions, discussed in Section 4.4. However, the convergence and optimalityguarantees are obviously not applicable under this formalization.In this chapter, we mathematically formalize ﬁnancial markets as discrete-time stochastic dynamical systems. Firstly, we consider the necessary as-sumptions for this formalization (Section 5.1), followed by the framework(Sections 5.2, 5.3, 5.4) which enables reinforcement learning agents to inter-act with the ﬁnancial market in order to optimality address portfolio man-agement.

Back-test tradings are only considered, where the trading agent pretends tobe back in time at a point in the market history, not knowing any ”future”market information, and does paper trading from then onward (Jiang et

51. F inancial M arket as D iscrete -T ime S tochastic D ynamical S ystem al. , 2017). As a requirement for the back-test experiments, the followingthree assumptions must apply: sufﬁcient liquidity , zero slippage and zero marketimpact , all of which are realistic if the traded assets’ volume in a market ishigh enough (Wilmott, 2007). An asset is termed liquid if it can be converted into cash quickly, with littleor no loss in value (Investopedia, 2018c).

Assumption 5.1 (Sufﬁcient Liquidity)

All market assets are liquid and everytransaction can be executed under the same conditions.

Slippage refers to the difference between the expected price of a trade andthe price at which the trade is actually executed (Investopedia, 2018e).

Assumption 5.2 (Zero Slippage)

The liquidity of all market assets is high enoughthat, each trade can be carried out immediately at the last price when an order isplaced.

Asset prices are determined by the

Law of Supply and Demand (Investo-pedia, 2018b), therefore any trade impacts the balance between them, henceaffects the price of the asset being traded.

Assumption 5.3 (Zero Market Impact)

The capital invested by the trading agentis so insigniﬁcant that is has no inﬂuence on the market.

In order to solve the asset allocation task, the trading agent should be ableto determine the portfolio vector w t at every time step t , therefore the action a t at time t is the portfolio vector w t + at time t + a t ≡ w t + (cid:44) (cid:20) w t + , w t + , . . . , w M , t + (cid:21) (5.1)hence the action space A is a subset of the continuous M -dimensional realspace R M :52.3. State & Observation Space a t ∈ A ⊆ R M , ∀ t ≥ M ∑ i = a i , t = a t ∈ A ⊆ [

0, 1 ] M , ∀ t ≥ M ∑ i = a i , t = At any time step t , we can only observe asset prices, thus the price vector p t (2.3) is the observation o t , or equivalently: o t ≡ p t (2.3) (cid:44) (cid:20) p t p t · · · p M , t (cid:21) (5.4)hence the observation space O is a subset of the continuous M -dimensionalpositive real space R M + , since prices are non-negative real values: o t ∈ O ⊆ R M + , ∀ t ≥ , ﬁnancial mar-kets are partially observable (Silver, 2015b). As a consequence, equation (4.3) isnot satisﬁed and we should construct the agent’s state s at by processing theobservations o t ∈ O . In Section 4.1.4, two alternatives to deal with partialobservability were suggested, considering:1. Complete history: s at ≡ (cid:126) h t (4.1) (cid:44) ( o , a , r , . . . , o t , a t , r t ) ;2. Recurrent neural network: s at ≡ f ( s at − , o t ; θ ) ; If prices were a VAR(1) process (Mandic, 2018a), then ﬁnancial markets are pure MDPs.

53. F inancial M arket as D iscrete -T ime S tochastic D ynamical S ystem where in both cases we assume that the agent state approximates the en-vironment state s at = ˆ s et ≈ s et . While the ﬁrst option may contain all theenvironment information by time t , it does not scale well, since the memoryand computational load grow linearly with time t . A GRU-RNN (see Section2.4.2), on the other hand, can store and process efﬁciently the historic obser-vation in an adaptive manner as they arrive, ﬁltering out any uninformativeobservations out. We will be referring to this recurrent layer as the statemanager , since it is responsible for constructing (i.e., managing) the agentstate. This layer can be part of any neural network architecture, enablingend-to-end differentiability and training.Figure 5.1 illustrates examples of a ﬁnancial market observations o t and thecorresponding actions a t of a random agent. A ss e t P r i c e s , o t ( $ ) Financial Market: Observations

AAPLGEBAXOM P o r t f o li o V e c t o r , a t Financial Market: Actions

AAPLGEBAXOM

Figure 5.1: Example universe of assets as dynamical system, including

AAPL (Apple), GE (General Electric), BA (Boeing Company) and XOM (Exxon MobilCorporation). (

Left ) Financial market asset prices, observations o t . ( Right )Portfolio manager, agent, portfolio vectors, actions a t ; the portfolio coefﬁ-cients are illustrated in a stacked bar chat, where at each time step, theysum to unity according to equation (2.1). In order to assist and speed-up the training of the state manager, we processthe raw observations o t , obtaining ˆ s t . In particular, thanks to the represen-tation and statistical superiority of log returns over asset prices and simplereturns (see 2.2.2), we use log returns matrix (cid:126) ρ t − T : t (2.10), of ﬁxed windowsize T . We also demonstrate another important property of log returns,which suits the nature of operations performed by neural networks, the func-tion approximators used for building agents. Neural network layers applynon-linearities to weighted sums of input features, hence the features do not Expanding windows are more appropriate in case the RNN state manager is replacedby complete the history (cid:126) h t . , but only with the layer weights(i.e., parameters). Nonetheless, by summing the log returns, we equivalentlymultiply the gross returns, hence the networks are learning non-linear func-tions of the products of returns (i.e., asset-wise and cross-asset) which arethe building blocks of the covariances between assets. Therefore, by sim-ply using the log returns we enable cross-asset dependencies to be easilycaptures.Moreover, transaction costs are taken into account, and since (3.12) suggeststhat the previous time step portfolio vector w t − affects transactions costs,we also append the w t , or equivalently a t − by (5.1), to the agent state,obtaining the 2-tuple:ˆ s t (cid:44) (cid:104) w t , (cid:126) ρ t − T : t (cid:105) = (cid:28)  w t w t ... w M , t  ,  ρ t − T ρ t − T + · · · ρ t ρ t − T ρ t − T + · · · ρ t ... ... . . . ... ρ M , t − T ρ M , t − T + · · · ρ M , t  (cid:29) (5.8)where ρ i , ( t − τ ) → t the log cumulative returns of asset i between the time inter-val [ t − τ , t ] . Overall, the agent state is given by: s at ≡ f ( s at − , ˆ s t ; θ ) (5.9)where f the state manager non-linear mapping function. When an observa-tion arrives, we calculate ˆ s t and feed it to the state manager (GRU-RNN).Therefore the state space S is a subset of the continuous K -dimensional realspace R K , where K the size of the hidden state in the GRU-RNN state man-ager: This is the issue that multiplicative neural networks (Salinas and Abbott, 1996) try toaddress. Consider two scalar feature variables x and x and the target scalar variable y suchthat: f ( x , x ) (cid:44) y = x ∗ x , x , x ∈ R (5.6)It is very hard for a neural network to learn this function, but a logarithmic transformation ofthe features transforms the problem to a very simple sum of logarithms using the property: log ( x ) + log ( x ) = log ( x ∗ x ) = log ( y ) (5.7) Most terms can be stored or pre-calculated.

55. F inancial M arket as D iscrete -T ime S tochastic D ynamical S ystem s at ∈ S ⊆ R K , ∀ t ≥ s t . Theagent uses this as input to determine its internal state, which in term drivesits policy. They look meaningless and impossible to generalize from, withthe naked eye, nonetheless, in Chapter 6, we demonstrated the effectivenessof this, representation, especially thanks to the combination of the convolu-tional and recurrent layers combination.Overall, the ﬁnancial market should be modelled as an Inﬁnite PartiallyObservable Markov Decision Process (IPOMDP), since: • The action space is continuous (inﬁnite), A ⊆ R M ; • The observations o t are not sufﬁcient statistics (partially observable) ofthe environment state; • The state space is continuous (inﬁnite), S ⊆ R K . w t t T t (%) w t t T t (%) Figure 5.2: Examples of processed observation 2-tuples for two randomlyselected time steps.

The determination of the reward signal is usually the most challenging stepin the design of a reinforcement learning problem. According to the RewardHypothesis 4.1, the reward is a scalar value, which fully speciﬁes the goals ofthe agent, and the maximization of the expected cumulative reward leads tothe optimal solution of the task. Specifying the optimal reward generatingfunction is the ﬁeld of study of Inverse Reinforcement Learning (Ng andRussell, 2000) and Inverse Optimal Control (Moylan and Anderson, 1973).56.4. Reward SignalIn our case, we develop a generic, modular framework, which enables com-parison of various reward generating functions , including log returns, (neg-ative) volatility and Sharpe Ratio. Section 2.3 motivates a few reward func-tion candidates, most of which are implemented and tested in Chapter 9.It is worth highlighting that the reinforcement learning methods, by default,aim to maximize the expected cumulative reward signal , hence the optimizationproblem that the agent (parametrized by θ ) solves is given by:maximize θ T ∑ t = E [ γ t r t ] (5.11)For instance, when we refer to log returns (with transaction costs) as thereward generating function, the agent solves the optimization problem:maximize θ T ∑ t = E [ γ t ln ( + w Tt r t − β (cid:107) w t − − w t (cid:107) )] (5.12)where the argument of the logarithm is the adjusted by the transaction costsgross returns at time index t (see (2.12) and (3.3)). Transaction costs are included in all case. hapter 6 Trading Agents

Current state-of-the-art algorithmic portfolio management methods: • Address the decision making task of asset allocation by solving a pre-diction problem, heavily relying on the accuracy of predictive modelsfor ﬁnancial time-series (Aldridge, 2013; Heaton et al. , 2017), which areusually unsuccessful, due to the stochasticity of the ﬁnancial markets; • Make unrealistic assumptions about the second and higher-order statis-tical moments of the ﬁnancial signals (Necchi, 2016; Jiang et al. , 2017); • Deal with binary trading signals (i.e., BUY, SELL, HOLD) (Neuneier,1996; Deng et al. , 2017), instead of assigning portfolio weights to eachasset, and hence limiting the scope of their applications.On the other hand, the representation of the ﬁnancial market as a discrete-time stochastic dynamical system, as derived in Chapter 5 enables the devel-opment of a uniﬁed framework for training reinforcement learning tradingagents. In this chapter, this framework is exploited by: • Model-based Reinforcement Learning agents, as in Section 6.1, wherevector autoregressive processes (VAR) and recurrent neural networks(RNN) are ﬁtted to environment dynamics, while the derived agentsperform planning and control (Silver, 2015a). Similar to (Aldridge,2013; Heaton et al. , 2017), these agents are based on a predictive modelof the environment, which is in turn used for decision making. Theirperformance is similar to known algorithms and thus they are used asbaseline models for comparison; • Model-free Reinforcement Learning agents, as in Section 6.2, whichdirectly address the decision making task of sequential and multi-step58.1. Model-Based Reinforcement Learningoptimization. Modiﬁcations to the state-of-the-art reinforcement learn-ing algorithms, such as Deep Q-Network (DQN) (Mnih et al. , 2015)and Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al. , 2015),enable their incorporation to the trading agents training framework.Algorithm 1 provides the general setup for reinforcement learning algo-rithms discussed in this chapter, based on which, experiments on a smalluniverse of real market data are carried out, for testing their efﬁcacy andillustration purposes. In Part III, all the different agents are compared ona larger universe of assets with different reward functions, a more realisticand practical setting.

Algorithm 1:

General setup for trading agents. inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o objective function J output : optimal agent parameters θ ∗ , ϕ ∗ repeat for t =

1, 2, . . . T do observe 2-tuple (cid:104) o t , r t (cid:105) calculate gradients ∇ θ J ( r t ) and ∇ ϕ J ( r t ) // BPTT update agent parameters θ , ϕ using adaptive gradient optimizers // ADAM get estimate of agent state: s t ≈ f ( · , o t ) // (5.9) sample and take action: a t ∼ π ( ·| s t ; ϕ ) // portfoliorebalance end until convergence set θ ∗ , ϕ ∗ ← θ , ϕ Upon a revision of the schematic of a generic partially observable environ-ment (i.e., dynamical system) as in Figure 4.2, it is noted that given the transi-tion probability function P ass (cid:48) of the system, the reinforcement learning taskreduces to planning (Atkeson and Santamaria, 1997); simulate future statesby recursively calling P ass (cid:48) L times and choose the roll-outs (i.e., trajectories)which maximize cumulative reward, via dynamic programming (Bertsekas et al. , 1995): 59. T rading A gents s t P ass (cid:48) → s t + P ass (cid:48) → · · · P ass (cid:48) → s t + L (6.1) a t + ≡ max a ∈ A J ( a | a t , s t , . . . , s t + L ) (6.2)Note that due to the assumptions made in Section 5.1, and especially theZero Market Impact assumption 5.3, the agent actions do not affect the envi-ronment state transitions, or equivalently the ﬁnancial market is an open loopsystem (Feng and Palomar, 2016), where the agent actions do not modify thesystem state, but only the received reward: p ( s t + | s , a ) = p ( s t + | s ) ⇒ P ass (cid:48) = P ss (cid:48) (6.3)Moreover, the reward generating function is known, as explained in section5.4, hence a model of the environment is obtained by learning only the transitionprobability function P ss (cid:48) . In the area of Signal Processing and Control Theory, the task under consid-eration is usually termed as

System Identiﬁcation (SI), where an approxi-mation of the environment, the ”model”, is ﬁtted such that it captures theenvironment dynamics: ˆ P ss (cid:48) (cid:124)(cid:123)(cid:122)(cid:125) model ≈ P ss (cid:48) (cid:124)(cid:123)(cid:122)(cid:125) environment (6.4)Figure 6.1 illustrates schematically the system identiﬁcation wiring of the cir-cuit, where the model, represented by ˆ P ss (cid:48) , is compared against the true tran-sition probability function P ss (cid:48) and the loss function L (i.e., mean squarederror) is minimized by optimizing with respect to the model parameters θ .It is worth highlighting that the transition probability function is by deﬁni-tion stochastic (4.8) hence the candidate ﬁtted models should ideally be ableto capture and incorporate this uncertainty. As a result, model-based rein-forcement learning usually (Deisenroth and Rasmussen, 2011; Levine et al. ,2016; Gal et al. , 2016) relies on probabilistic graphical models, such as Gaus-sian Processes (Rasmussen, 2004) or Bayesian Networks (Ghahramani, 2001),which are non-parametric models that do not output point estimates, butlearn the generating process of the data p data , and hence enable sampling60.1. Model-Based Reinforcement LearningFigure 6.1: General setup for System Identiﬁcation (SI) (i.e., model-basedreinforcement learning) for solving a discrete-time stochastic partially ob-servable dynamical system.from the posterior distribution. Sampling from the posterior distributionallows us to have stochastic predictions that respect model dynamics.In this section we shall will focus, nonetheless, only on vector autoregres-sive processes (VAR) and recurrent neural networks (RNN) for modelling P ss (cid:48) , trained on an adaptive fashion (Mandic and Chambers, 2001), given byAlgorithm 2. An extension of vanilla RNNs to bayesian RNNs could be alsotried using the MC-dropout trick from (Gal and Ghahramani, 2016). Following on the introduction of the vector autoregressive processes (VAR)in Section 2.4.1, and using the fact that the transition probability model ˆ P ss (cid:48) is a one-step time-series predictive model, we investigate the effectivenessof VAR processes as time-series predictors. Agent Model

The vector autoregressive processes (VAR) regress past values of multivari-ate time-series with the future values (see equation (2.44)). In order to sat-isfy the covariance stationarity assumption (Mandic, 2018a), we ﬁt a VARprocess on the log-returns ρ t , and not the raw observations o t (i.e., pricevectors), since the latter is known to be highly non-stationary . The modelis pre-trained on historic data (i.e., batch supervised learning training (Mur-phy, 2012)) and it is updated online, following the gradient ∇ θ L ( ˆ ρ tt , ρ t ) , asdescribed in Algorithm 2. The model takes the form: In the wide-sense (Bollerslev, 1986).

61. T rading A gents Algorithm 2:

General setup for adaptive model-based trading agents. inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o loss function L historic dataset D output : optimal model parameters θ ∗ batch training on D θ ← argmax θ p ( θ |D ) // MLE repeat for t =

1, 2, . . . T do predict next state ˆ s t // via ˆ P ss (cid:48) observe tuple (cid:104) o t , r t (cid:105) get estimate of agent state: s t ≈ f ( · , o t ) // (5.9) calculate gradients: ∇ θ L ( ˆ s t , s t ) // backprop update model parameters θ using adaptive gradient optimizers // ADAM plan and take action a t // portfolio rebalance end until convergence set θ ∗ ← θ P ss (cid:48) : ρ t (2.44) ≈ c + p ∑ i = A i ρ t − i (6.5)planning : (cid:18) s t P ass (cid:48) → · · · P ass (cid:48) → s t + L (cid:19) (6.2) ⇒ a t + ≡ max a ∈ A J ( a | a t , s t , . . . , s t + L ) (6.6) Related Work

Vector autoregressive processes have been widely used for modelling ﬁnan-cial time-series, especially returns, due to their pseudo-stationary nature(Tsay, 2005). In the context of model-based reinforcement learning thereis no results in the open literature on using VAR processes in this con-text, nonetheless, control engineering applications (Akaike, 1998) have ex-tensively used autoregressive models to deal with dynamical systems.62.1. Model-Based Reinforcement Learning

Evaluation

Figure 6.2 illustrates the cumulative rewards and the prediction error powerare illustrated Observe that the agent is highly correlated with the market(i.e., S&P 500) and overall collects lower cumulative returns. Moreover, notethat the market crash in 2009 (Farmer, 2012), affects the agent signiﬁcantly,leading to a decline by 179.6% (drawdown), taking as many as , from 2008-09-12 to 2015-10-22. P e r c e n t a g e VAR (8) : Overall Performance Profit & LossDrawdownS&P500 E rr o r P o w e r , | e t | VAR (8) : Error Power Figure 6.2: Order-eight vector autoregressive model-based reinforcementlearning agent on a 12-asset universe (i.e., VAR (8)), pre-trained on his-toric data between 2000-2005 and trained online onward. ( Left ) Cumulativerewards and (maximum) drawdown of the learned strategy, against the S&P500 index (traded as

SPY ). (

Right ) Mean squared prediction error for single-step predictions.

Weaknesses

The order- p VAR model, VAR( p ), assumes that the underlying generatingprocess:1. Is covariance stationary;2. Satisﬁes the order- p Markov property;3. Is linear, conditioned on past samples.Unsurprisingly, most of these assumptions are not realistic for real marketdata, which reﬂects the poor performance of the method illustrated in Figure6.2.

Expected Properties 6.1 (Non-Stationary Dynamics)

The agent should be ableto capture non-stationary dynamics.

Expected Properties 6.2 (Long-Term Memory)

The agent should be able to se-lectively remember past events without brute force memory mechanisms (e.g., usinglagged values as features).

63. T rading A gents Expected Properties 6.3 (Non-Linear Model)

The agent should be able to learnnon-linear dependencies between features.

The limitations of the vector autoregressive processes regarding stationarity,linearity and ﬁnite memory assumptions are overcome by the recurrent neu-ral network (RNN) environment model. Inspired by the effectiveness of re-current neural networks in time-series prediction (Gers et al. , 1999; Mandicand Chambers, 2001; H´enaff et al. , 2011) and the encouraging results ob-tained from the initial one-step predictive GRU-RNN model in Figure 2.16,we investigate the suitability of RNNs in the context of model-based rein-forcement learning, used as environment predictors.

Agent Model

Revisiting Algorithm 2 along with the formulation of RNNs in Section 2.4.2,we highlight the steps:state manager : s t (4.2) = f ( s t − , ρ t ) (6.7)prediction : ˆ ρ t + ≈ V σ ( s t ) + b (6.8)planning : (cid:18) s t P ass (cid:48) → · · · P ass (cid:48) → s t + L (cid:19) (6.2) ⇒ a t + ≡ max a ∈ A J ( a | a t , s t , . . . , s t + L ) (6.9)where V and b the weights matrix and bias vector of the output afﬁne layer of the network and σ a non-linearity (i.e., rectiﬁed linear unit (Nair andHinton, 2010), hyperbolic tangent, sigmoid function.). Schematically thenetwork is depicted in Figure 6.3. Related Work

Despite the fact that RNNs were ﬁrst used decades ago (Hopﬁeld, 1982;Hochreiter and Schmidhuber, 1997; Mandic and Chambers, 2001), recent ad- In Deep Learning literature (Goodfellow et al. , 2016), the term afﬁne refers to a neuralnetwork layer with parameters W and b that performs a mapping from an input matrix X to an output vector y according to f afﬁne ( X ; W , b ) = ˆ y (cid:44) XW + b (6.10)The terms afﬁne , fully-connected (FC) and dense refer to the same layer conﬁguration. G R U G R U A FF I N E ρ t ˆ ρ t +1 s t Agent StateObservation ObservationEstimate

Figure 6.3: Two layer gated recurrent unit recurrent neural network (GRU-RNN) model-based reinforcement learning agent; receives log returns ρ t asinput, builds internal state s t and estimates future log returns ˆ ρ t + . Regu-larized mean squared error is used as the loss function, optimized with theADAM (Kingma and Ba, 2014) adaptive optimizer.vances in adaptive optimizers (e.g., RMSProp (Tieleman and Hinton, 2012),ADAM (Kingma and Ba, 2014)) and deep learning have enabled the develop-ment of deep recurrent neural networks for sequential data modelling (e.g.,time-series, text). Since ﬁnancial markets are dominated by dynamic struc-tures and time-series, RNNs have been extensively used for modelling dy-namic ﬁnancial systems (Tino et al. , 2001; Chen et al. , 2015; Heaton et al. ,2016; Bao et al. , 2017). In most cases, feature engineering prior to trainingis performed so that meaningful ﬁnancial signals are combined, instead ofraw series. Our approach was rather context-free, performing pure technicalanalysis of the series, without involving manual extraction and validation ofhigh-order features. Evaluation

Figure 6.4 illustrates the performance of a two-layer gated recurrent unitrecurrent neural network, which is not outperforming the vector autoregres-sive predictor as much as it was expected. Again, we note the strong correla-tion with the market (i.e., S&P 500). The 2008 market crash affects the RNNagent as well, which manages to recover faster than the VAR agent, lead-ing to an overall 221.1% cumulative return, slightly higher than the marketindex.

Having developed both vector autoregressive and recurrent neural networkmodel-based reinforcement learning agents, we conclude that despite the ar- 65. T rading A gents P e r c e n t a g e GRU RNN : Overall Performance

Profit & LossDrawdownS&P500 E rr o r P o w e r , | e t | GRU RNN : Error Power

Figure 6.4: Two-layer gated recurrent unit recurrent neural network (GRU-RNN) model-based reinforcement learning agent on a 12-asset universe, pre-trained on historic data between 2000-2005 and trained online onward (

Left )Cumulative rewards and (maximum) drawdown of the learned strategy,against the S&P 500 index (traded as

SPY ). (

Right ) Mean square predictionerror for single-step predictions.chitectural simplicity of system identiﬁcation, who are under-performing.The inherent randomness (i.e., due to uncertainty) of the ﬁnancial time-series (e.g., prices, returns) affects the model training and degrades pre-dictability.In spite of the promising results in Figures 2.14, 2.16, where one-step pre-dictions are considered, control and planning (6.2) are only effective whenaccurate multi-step predictions are available. Therefore, the process of ﬁrstﬁtting a model (e.g., VAR, RNN or Gaussian Process) and then use an exter-nal optimization step, results in two sources of approximation error, wherethe error propagates over time and reduces performance.A potential improvement of these methods would be manually feature en-gineering the state space , such as extracting meaningful econometric signals(Greene, 2003) (e.g., volatility regime shifts, earning or dividends announce-ments, fundamentals) which in turn are used for predicting the returns. Thisis, in a nutshell, the traditional approach that quantitative analysts (LeBaron,2001) have been using for the past decades. The computational power hasbeen radically improved over the years, which permits larger (i.e., deeperand wider) models to be ﬁtted, while elaborate algorithms, such as varia-tional inference (Archer et al. , 2015), have made previously intractable taskspossible. Nonetheless, this approach involves a lot of tweaks and humanintervention, which are the main aspects we aim to attenuate.66.2. Model-Free Reinforcement Learning

In the ﬁnal Section, we assumes that solving a system identiﬁcation problem(i.e., explicitly inferring environment dynamics) is easier than addressingdirectly the initial objective; the maximization of a cumulative reward signal(e.g., log returns, negative volatility, sharpe ratio). Nonetheless, predictingaccurately the evolution of the market was proven challenging, resulting inill-performing agents.In this section, we adapt an orthogonal approach, where we do not rely onan explicit model of the environment, but we parametrize the agent valuefunction or/and policy directly. At ﬁrst glance, it may seem counter-intuitivehow skipping the modelling of the environment can lead to a meaningfulagent at all, but consider the following example from daily life. Humansare able to easily handle objects or move them around. Unarguably, this is aconsequence of experience that we have gained over time, however, if we areasked to explain the environment model that justiﬁes our actions, it is muchmore challenging, especially for an one year old kid, who can successfullyplay with toys but fails to explain this task using Newtonian physics.Another motivating example from the portfolio management and tradingﬁeld: pairs trading is a simple trading strategy (Gatev et al. , 2006), whichrelies on the assumption that historically correlated assets will preserve thisrelationship over time . Hence when the two assets start deviating from oneanother, this is considered an arbitrage opportunity (Wilmott, 2007), sincethey are expected to return to a correlated state. This opportunity is ex-ploited by taking a long position for the rising stock and a short position forthe falling. If we would like to train a model-based reinforcement learningagent to perform pairs trading, it would be almost impossible or too unsta-ble, regardless the algorithm simplicity. On the other hand, a value-based orpolicy gradient agent could perform this task with minimal effort, replicat-ing the strategy, because the pairs trading strategy does not rely on futurevalue prediction, but much simpler statistical analysis (i.e., cross-correlation),which, in case of model-based approaches, should be translated into an op-timization problem of an unrelated objective - the prediction error. Overall,using model-free reinforcement learning improves efﬁciency (i.e., only oneepisode ﬁtting) and also allows ﬁner control over the policy, but it also limitsthe policy to only be as good as the learned model. More importantly, forthe task under consideration (i.e., asset allocation) it is shown to be easier to Pairs trading is selected for educational purposes only, we are not claiming that it isoptimal in any sense.

67. T rading A gents represent a good policy than to learn an accurate model.The model-free reinforcement learning agents are summarized in Algorithm1, where different objective functions and agent parametrizations lead todifferent approaches and hence strategies. We classify these algorithms as: • Value-based : learn a state value function v (4.16), or an action-valuefunction q (4.17) and use it with an implicit policy (e.g., ε -greedy (Sut-ton and Barto, 1998)); • Policy-based : learn a policy directly by using the reward signal toguide adaptation.In this chapter, we will, ﬁrst, focus on value-based algorithms, which ex-ploit the state (action) value function, as deﬁned in equations (4.6), (4.7) asan estimate for expected cumulative rewards. Then policy gradient meth-ods will be covered, which parametrize directly the policy of the agent, andperform gradient ascent to optimize performance. Finally, a universal agentwill be introduced, which reduces complexity (i.e., computational and mem-ory) and generalizes strategies across assets, regardless the trained universe,based on parameter sharing (Bengio et al. , 2003) and transfer learning (Panand Yang, 2010a) principles.

A wide range of value-based reinforcement learning algorithms (Sutton andBarto, 1998; Silver, 2015d; Szepesv´ari, 2010) have been suggested and usedover time. The Q-Learning is one of the simplest and best performing ones(Liang et al. , 2016), which motivates us to extend it to continuous actionspaces to ﬁt our system formulation.

Q-Learning

Q-Learning is a simple but very effective value-based model-free reinforce-ment learning algorithm (Watkins, 1989). It works by successively improv-ing its evaluations of the action-value function q , and hence the name. Letˆ q , be the estimate of the true action-value function, then ˆ q is updated online(every time step) according to:ˆ q ( s t , a t ) ← ˆ q ( s t , a t ) + α (cid:20) r t + γ max a (cid:48) ∈ A ˆ q ( s t + , a (cid:48) ) − ˆ q ( s t , a t ) (cid:124) (cid:123)(cid:122) (cid:125) TD error, δ t + (cid:21) (6.11)68.2. Model-Free Reinforcement Learningwhere α ≥ γ ∈ [

0, 1 ] the discount factor. In the litera-ture, the term in the square brackets is usually called Temporal DifferenceError (TD error), or δ t + (Sutton and Barto, 1998). Theorem 6.1

For a Markov Decision Process, Q-learning converges to the opti-mum action-values with probability 1, as long as all actions are repeatedly sampledin all states and the action-values are represented discretely.

The proof of Theorem 6.1 is provided by Watkins (1989) and relies on the con-traction property of the Bellman Operator (Sutton and Barto, 1998), show-ing that: ˆ q ( s , a ) → q ∗ ( s , a ) (6.13)Note that equation (6.11) is practical only in the cases that:1. The state space is discrete, and hence the action-value function can bestored in a digital computer, as a grid of scalars;2. The action space is discrete, and hence at each iteration, the maximiza-tion over actions a ∈ A is tractable.The Q-Learning steps are summarized in Algorithm 3. Related Work

Due to the success of the Q-Learning algorithm, early attempts to modify itto ﬁt the asset allocation task were made by Neuneier (1996), who attemptedto used a differentiable function approximator (i.e., neural network) to rep-resent the action-value function, and hence enabled the use of Q-Learningin continuous state spaces . Nonetheless, he was restricted to discrete actionspaces and thus was acting on buy and sell signals only. In the same year,Moody et al. (1998) used a similar approach but introduced new rewardsignals, namely general utility functions and the differential Sharpe Ratio,which are considered in Chapter 9.Recent advances in deep learning (Goodfellow et al. , 2016) and stochastic op-timization methods (Boyd and Vandenberghe, 2004) led to the ﬁrst practical The Bellman Operator, B is a-contraction with respect to some norm (cid:107) · (cid:107) since it canbe shown that (Rust, 1997): (cid:107) B s − B ¯ s (cid:107) ≤ a (cid:107) s − ¯ s (cid:107) (6.12)Therefore it follows that:1. The sequence s , B s , B s , . . . converges for every s ;2. B has a unique ﬁxed point s ∗ , which satisﬁes B s ∗ = s ∗ and all sequencues s , B s , B s , . . . converge to this unique ﬁxed point s ∗ .

69. T rading A gents Algorithm 3:

Q-Learning with greedy policy. inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o initial output : optimal action-value function q ∗ initialize q-table: ˆ q ( s , a ) ← ∀ s ∈ S , a ∈ A while convergence do for t =

0, 1, . . . T do select greedy action: a t = max a (cid:48) ∈ A ˆ q ( s t , a (cid:48) ) observe tuple (cid:104) s t + , r t (cid:105) update q-table:ˆ q ( s t , a t ) ← ˆ q ( s t , a t ) + α (cid:2) r t + γ max a (cid:48) ∈ A ˆ q ( s t + , a (cid:48) ) − ˆ q ( s t , a t ) (cid:3) end end use case of Q-Learning in high-dimensional state spaces (Mnih et al. , 2015).Earlier work was limited to low-dimensional applications, where shallowneural network architectures were effective. Mnih et al. (2015) used a fewalgorithmic tricks and heuristics, and managed to stabilize the training pro-cess of the Deep Q-Network (DQN). The ﬁrst demonstration was performedon Atari video games, where trained agents outperformed human playersin most games.Later, Hausknecht and Stone (2015) published a modiﬁed version of theDQN for partially observable environments, using a recurrent layer to con-struct the agent state, giving rise to the

Deep Recurrent Q-Network (DRQN).Nonetheless, similar to the vanilla Q-Learning and enhanced DQN algo-rithms, the action space was always discrete.

Agent Model

Inspired by the breakthroughs in DQN and DRQN, we suggest a modiﬁca-tion to the last layers to handle pseudo-continuous action spaces, as requiredfor the portfolio management task. The current implementation, termed the

Deep Soft Recurrent Q-Network (DSRQN) relies on a ﬁxed, implicit pol-icy (i.e., exponential normalization or softmax (McCullagh, 1984)) while theaction-value function q is adaptively ﬁtted.A neural network architecture as in Figure 6.5 is used to estimate the action-value function. The two 2D-convolution (2D-CONV) layers are followed by70.2. Model-Free Reinforcement Learninga max-pooling (MAX) layer, which aim to extract non-linear features fromthe raw historic log returns (LeCun and Bengio, 1995). The feature map isthen fed to the gated recurrent unit (GRU), which is the state manager (seeSection 5.3.2), responsible for reconstructing a meaningful agent state froma partially observable environment. The generated agent state, s t , is thenregressed along with the past action, a t (i.e., current portfolio vector), inorder to produce the action-value function estimates. Those estimates areused with the realized reward r t + to calculate the TD error δ t + (6.11) andtrain the DSRQN as in Algorithm 4. The action-values estimates ˆ q t + arepassed to a softmax layer, which produces the agent action a t + . We selectthe softmax function because it provides the favourable property of forcingall the components (i.e., portfolio weights) to sum to unity (see Section 2.1).Analytically the actions are given by: ∀ i ∈ {

1, 2, . . . , M } : a i = e a i ∑ Mj = e a j = ⇒ M ∑ i = a i = et al. , 2016),which means that it can be replaced by any function (i.e., deterministic orstochastic), even by a quadratic programming step, since differentiabilityis not required. For this experiment we did not consider more advancedpolicies, but anything is accepting as long as constraint (2.1) is satisﬁed.Moreover, comparing our implementation with the original DQN (Mnih etal. , 2015), no experience replay is performed, in order to avoid reseting theGRU hidden state for each batch, which will lead to an unused latent state,and hence poor state manager. Evaluation

Figure 6.7 illustrates the results obtained on a small scale experiment with 12assets from S&P 500 market using the DSRQN. The agent is trained on his-toric data between 2000-2005 for 5000 episodes, and tested on 2005-2018. Theperformance evaluation of the agent for different episodes e is highlighted.For e = e =

10, the agent acted completely randomly, leading to poorperformance. For e = e = rading A gents Algorithm 4:

Deep Soft Recurrent Q-Learning. inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o objective function J initial agent weights θ output : optimal agent parameters θ ∗ repeat for t =

1, 2, . . . T do observe tuple (cid:104) o t , r t (cid:105) calculate TD error δ t + // (6.11) calculate gradients ∇ θ i L ( θ i ) = δ t + ∇ θ i q ( s , a ; θ ) // BPTT update agent parameters θ using adaptive gradient optimizers // ADAM get estimate of value function q t ≈ NN ( (cid:126) ρ t − T → t ) // (6.11) take action a t softmax ( q t ) // portfolio rebalance end until convergence set θ ∗ ← θ which were highly impacted by 2008 market crash, while DSRQN was al-most unaffected (e.g., 85 % drowdawn).In Figure 6.6, we illustrate the out-of-sample cumulative return of the DSRQNagent, which ﬂattens for e (cid:39) Weaknesses

Despite the improved performance of DSRQN compared to the model-basedagents, its architecture has severe weaknesses.Firstly, the selection of the policy (e.g., softmax layer) is a manual processthat can be only veriﬁes via empirical means, for example, cross-validation.This complicates the training process, without guaranteeing any global opti-mality of the selected policy.72.2. Model-Free Reinforcement Learning

Value Function Estimator Policy G R U C ON V D C ON V M A X C ON V D C ON V A FF I N E s t ProbabilisticAction ValuesAgent StateHistoricLog Returns q t +1 a t Past Action q q q M q M S O F T M A X a a a M a M Agent Action a t +1 ! ρ t − T → t Figure 6.5: Deep Soft Recurrent Q-Network (DSRQN) architecture. The his-toric log returns (cid:126) ρ t − T → t ∈ R M × T are passed throw two 2D-convolution lay-ers, which generate a feature map, which is, in turn, processed by the GRUstate manager. The agent state produced is combined (via matrix ﬂatteningand vector concatenation) with the past action (i.e., current portfolio posi-tions) to estimate action-values q , q , . . . , q M . The action values are usedboth for calculating the TD error (6.11), showing up in the gradient calcu-lation, as well as for determining the agents actions, after passed throw asoftmax activation layer. e C u m u l a t i v e R e t u r n , G e DSRQN Adam : Cumulative Return G e e C u m u l a t i v e R e t u r n , G e DSRQN SGD : Cumulative Return G e Figure 6.6: Out-of-sample cumulative returns per episode during trainingphase for DSRQN. Performance improvement saturates after e (cid:39) Left )Adaptive Neural network optimization algorithm

ADAM (Kingma and Ba,2014). (

Right ) Neural network optimized with Stochastic Gradient Descent(SGD) (Mandic, 2004).

Expected Properties 6.4 (End-to-End Differentiable Architecture)

Agent pol-icy should be part of the trainable architecture so that it adapts to (locally) optimalstrategy via gradient optimization during training.

Secondly, DSRQN is a Many-Input-Many-Output (MIMO) model, whosenumber of parameters grows polynomially as a function of the universesize (i.e., number of assets M ), and hence its training complexity. Moreover, 73. T rading A gents P e r c e n t a g e DSRQN : Overall Performance, episode= Profit & LossDrawdownS&P500 P e r c e n t a g e DSRQN : Overall Performance, episode= Profit & LossDrawdownS&P500 P e r c e n t a g e DSRQN : Overall Performance, episode=

Profit & LossDrawdownS&P500 P e r c e n t a g e DSRQN : Overall Performance, episode=

Profit & LossDrawdownS&P500

Figure 6.7: Deep Soft Recurrent Q-Network (DSRQN) model-free rein-forcement learning agent on a 12-asset universe, trained on historic databetween 2000-2005 and tested onward, for different number of episodes e = {

1, 10, 100, 1000 } . Visualization of cumulative rewards and (maximum)drawdown of the learned strategy, against the S&P 500 index (traded as SPY ).under this setting, the learned strategy is universe-speciﬁc, which meansthat the same trained network does not generalize to other universes. Iteven fails to work on permutations of the original universe; for example, ifwe interchange the order assets in the processed observation ˆ s t after training,then DSRQN will break down. Expected Properties 6.5 (Linear Scaling)

Model should scale linearly (i.e, com-putation and memory) with respect to the universe size.

Expected Properties 6.6 (Universal Architecture)

Model should be universe-agnosticand replicate its strategy regardless the underlying assets.

In order to address the ﬁrst weakness of the DSRQN (i.e., manual selectionof policy), we consider policy gradient algorithms, which directly addressthe learning of an agent policy, without intermediate action-value approxi-mations, resulting in an end-to-end differentiable model.74.2. Model-Free Reinforcement Learning

Policy Gradient Theorem

In Section 4.2.2 we deﬁned policy of an agent, π , as: π : S → A (4.14)In policy gradient algorithms, we parametrize the policy with parameters θ as π θ and optimize them according to a long-term objective function J ,such as average reward per time-step, given by: J ( θ ) (cid:44) ∑ s ∈ S P π θ ( s ) ∑ a ∈ A π θ ( s , a ) R as (6.15)Note, that any differentiable parametrization of the policy is valid (e.g., neu-ral network, linear model). Moreover, the freedom of choosing the rewardgenerating function, R as , is still available.In order to optimize the parameters, θ , the gradient, ∇ θ J , should be cal-culated at each iteration. Firstly, we consider an one-step Markov DecisionProcess: J ( θ ) = E π θ [ R as ] (6.15) = ∑ s ∈ S P π θ ( s ) ∑ a ∈ A π θ ( s , a ) R as (6.16) ∇ θ J ( θ ) = ∑ s ∈ S P π θ ( s ) ∑ a ∈ A ∇ θ π θ ( s , a ) R as = ∑ s ∈ S P π θ ( s ) ∑ a ∈ A π θ ( s , a ) ∇ θ π θ ( s , a ) π θ ( s , a ) R as = ∑ s ∈ S P π θ ( s ) ∑ a ∈ A π θ ( s , a ) ∇ θ log [ π θ ( s , a )] R as = E π θ (cid:20) ∇ θ log [ π θ ( s , a )] R as (cid:21) (6.17)The policy gradient calculation is extended to multi-step MDPs by replacingthe instantaneous reward R as with the long-term (action) value q π ( s , a ) . Theorem 6.2 (Policy Gradient Theorem)

For any differentiable policy π θ ( s , a ) and for J the average discounted future rewards per step, the policy gradient is: ∇ θ J ( θ ) = E π θ (cid:20) ∇ θ log [ π θ ( s , a )] q π ( s , a ) (cid:21) (6.18) 75. T rading A gents where the proof is provided by Sutton et al. (2000b). The theorem applies tocontinuous settings (i.e., Inﬁnite MDPs) (Sutton and Barto, 1998), where thesummations are replaced by integrals. Related Work

Policy gradient methods have gained momentum the past years due to theirbetter covergence policites (Sutton and Barto, 1998), their efffective in high-dimensional or continuous action spaces and natural ﬁt to stochastic environ-ments (Silver, 2015e). Apart from the their extensive application in robotics(Smart and Kaelbling, 2002; Kohl and Stone, 2004; Kober and Peters, 2009),policy gradient methods have been used also in ﬁnancial markets. Necchi(2016) develops a general framework for policy gradient agents to be trainedto solve the asset allocation task, but only successful back-tests for syntheticmarket data are provided.

Agent Model

From equation (6.18), we note two challenges:1. Calculation of an expectation over the (stochastic) policy, E π θ , leadingto integration over unknown quantities;2. Estimation of the unknown action-value, q π ( s , a ) The simplest, successful algorithm to address both of the challenges is

Monte-Carlo Policy Gradient , also know as

REINFORCE . In particular, differenttrajectories are generated following policy π θ which are then used to es-timate the expectation and the discounted future rewards, which are anunbiased estimate of q π ( s , a ) . Hence we obtain Monte-Carlo estimates: q π ( s , a ) ≈ G t (cid:44) t ∑ i = r i (6.19) E π θ (cid:20) ∇ θ log [ π θ ( s , a )] q π ( s , a ) (cid:21) ≈ T (cid:20) ∇ θ log [ π θ ( s , a )] T ∑ i = G i (cid:21) (6.20)where the gradient of the log term ∇ θ log [ π θ ( s , a )] , is obtained by Backprop-agation Through Time (Werbos, 1990).We choose to parametrize the policy using a neural network architecture,illustrated in Figure 6.8. The conﬁguration looks very similar to the DSRQN, The empirical mean is an unbiased estimate of expected value (Mandic, 2018b).

Algorithm 5:

Model-Carlo Policy Gradient (REINFORCE). inputs : trading universe of M -assetsinitial portfolio vector w = a initial asset prices p = o objective function J initial agent weights θ output : optimal agent policy parameters θ ∗ initialize buffers: G , ∆ θ c ← repeat for t =

1, 2, . . . T do observe tuple (cid:104) o t , r t (cid:105) sample and take action: a t ∼ π θ ( ·| s t ; θ ) // portfoliorebalance cache rewards: G ← G + r t // (6.19) cache log gradients: ∆ θ c ← ∆ θ c + ∇ θ log [ π θ ( s , a )] G // (6.20) end update policy parameters θ using buffered Monte-Carlo estimates via adaptive optimization // (6.18),ADAM empty buffers: G , ∆ θ c ← until convergence set θ ∗ ← θ Evaluation

In Figure 6.10, we present the results from an experiment performed on asmall universe comprising of 12 assets from S&P 500 market using the RE-INFORCE agent. The agent is trained on historic data between 2000-2005for 5000 episodes, and tested on 2005-2018. Similar to the DSRQN, at earlyepisodes the REINFORCE agent performs poorly, but it learns a proﬁtablestrategy after a few thousands of episodes ( e ≈ rading A gents Policy G R U C ON V D C ON V M A X C ON V D C ON V A FF I N E s t Agent StateHistoricLog Returns a t Past Action S O F T M A X a a a M a M Agent Action a t +1 ! ρ t − T → t Figure 6.8: Monte-Carlo Policy Gradient (REINFORCE) architecture. Thehistoric log returns (cid:126) ρ t − T → t ∈ R M × T are passed throw two 2D-convolutionlayers, which generate a feature map, which is, in turn, processed by theGRU state manager. The agent state produced and the past action (i.e., cur-rent portfolio positions) are non-linearly regressed and exponentially nor-malized by the afﬁne and the softmax layer, respectively, to generate theagent actions. e C u m u l a t i v e R e t u r n , G e REINFORCE Adam : Cumulative Return G e e C u m u l a t i v e R e t u r n , G e REINFORCE SGD : Cumulative Return G e Figure 6.9: Out-of-sample cumulative returns per episode during trainingphase for REINFORCE. Performance improvement saturates after e (cid:39) Left ) Adaptive Neural network optimization algorithm

ADAM . (

Right ) Neu-ral network optimized with Stochastic Gradient Descent (SGD).In Figure 6.6, we illustrate the out-of-sample cumulative return of the DSRQNagent, which ﬂattens for e (cid:39) Weaknesses

Policy gradient addressed only the end-to-end differentiability weakness ofthe DSRQN architecture, leading to siginiﬁcant improvements. Nonethe-less, the polynomial scaling and universe-speciﬁc nature of the model are78.2. Model-Free Reinforcement Learning P e r c e n t a g e REINFORCE : Overall Performance, episode= Profit & LossDrawdownS&P500 P e r c e n t a g e REINFORCE : Overall Performance, episode=

Profit & LossDrawdownS&P500 P e r c e n t a g e REINFORCE : Overall Performance, episode=

Profit & LossDrawdownS&P500 2005 2007 2009 2011 2013 2015 2017 2019Date0123 P e r c e n t a g e REINFORCE : Overall Performance, episode=

Profit & LossDrawdownS&P500

Figure 6.10: Model-Carlo Policy Gradient (REINFORCE) model-free rein-forcement learning agent on a 12-asset universe, trained on historic databetween 2000-2005 and tested onward, for different number of episodes e = {

1, 100, 1000, 7500 } . Visualization of cumulative rewards and (maxi-mum) drawdown of the learned strategy, against the S&P 500 index (tradedas SPY ).still restricting the applicability and generalization of the learned strategies.Moreover, the intractability of the policy gradient calculation given by (6.18)lead to the comprising solution of using Monte-Carlo estimates by runningnumerous simulations, leading to increased number of episodes required forconvergence. Most importantly, the empirical estimation of the state action-value q π θ ( s , a ) in (6.19) has high-variance (see returns in Figure 6.9) (Suttonand Barto, 1998). Expected Properties 6.7 (Low Variance Estimators)

Model should rely on lowvariance estimates.

In this subsection, we introduce the

Mixture of Score Machines (MSM)model with the aim to provide a universal model that reduces the agentmodel complexity and generalizes strategies across assets, regardless of thetrained universe. These properties are obtained by virtue of principles ofparameter sharing (Bengio et al. , 2003) and transfer learning (Pan and Yang,2010a). 79. T rading A gents Related Work

Jiang et al. (2017) suggested a universal policy gradient architecture, the

En-semble of Identical Independent Evaluators (EIIE), which reduces signiﬁcantlythe model complexity, and hence enables larger-scale applications. Nonethe-less, it operates only on independent (e.g., uncorrelated) time-series, whichis a highly unrealistic assumption for real ﬁnancial markets.

Agent Model

As a generalization to the universal model of Jiang et al. (2017), we introducethe

Score Machine (SM), an estimator of statistical moments of stochasticmultivariate time-series: • A First-Order Score Machine

SM(1) operates on univariate time-series,generating a score that summarizes the characteristics of the location parameters of the time-series (e.g., mode, median, mean). An M -components multivariate series will have ( M ) = M ﬁrst-order scores,one for each component; • A Second-Order Score Machine

SM(2) operates on bivariate time-series, generating a score that summarizes the characteristics of the dispersion parameters of the joint series (e.g., covariance, modal disper-sion (Meucci, 2009)). An M -components multivariate series will have ( M ) = M !2! ( M − ) ! second-order scores, one for each distinct pair; • An N -Order Score Machine SM( N ) operates on N -component multi-variate series and extracts information about the N-order statistics (i.e.,statistical moments) of the joint series. An M -components multivariateseries, for M ≥ N , will have ( MN ) = M ! N ! ( M − N ) ! N -order scores, one foreach distinct N combination of components.Note that the extracted scores are not necessarily equal to the statistical (cen-tral) moments of the series, but a compressed and informative representa-tion of the statistics of the time-series. The universality of the score machineis based on parameter sharing (Bengio et al. , 2003) across assets.Figure 6.11 illustrates the ﬁrst and second order score machines, where thetransformations are approximated by neural networks. Higher order scoremachines can be used in order to captures higher-order moments.By combining the score machines, we construct the Mixture of Score Ma-chines (MSM), whose architecture is illustrated in Figure 6.12. We identifythree main building blocks:80.2. Model-Free Reinforcement Learning G R U C ON V D C ON V M A X C ON V D C ON V A FF I N E h t Hidden StateUnivariateTime-Series First-Order Score

SM(1): First-Order Score Machine ! x t ∈ R T s (1) t ∈ R G R U C ON V D C ON V M A X C ON V D C ON V A FF I N E h t Hidden StateBivariateTime-Series

SM(2): Second-Order Score Machine

Second-Order Score s (2) t ∈ R ! x i & j,t ∈ R × T Figure 6.11: Score Machine (SM) neural netowrk architecture. Convolutionallayers followed by non-linearities (e.g., ReLU) and Max-Pooling (Giusti etal. , 2013) construct a feature map, which is selectively stored and ﬁlteredby the Gate Recurrent Unit (GRU) layer. Finally, a linear layer combinesthe GRU output components to a single scalar value, the score. (

Left ) First-Order Score Machine SM(1); given a univariate time-series (cid:126) x t of T samples,it produces a scalar score value s ( ) t . ( Right ) Second-Order Score MachineSM(2); given a bivariate time-series (cid:126) x i & j , t , with components (cid:126) x i , t and (cid:126) x j , t , of T samples each, it generates a scalar value s ( ) t .1. SM(1) : a ﬁrst-order score machine that processes all single-asset logreturns, generating the ﬁrst-order scores;2.

SM(2) : a second-order score machine that processes all pairs of assetslog-returns, generating the second-order scores;3.

Mixture Network : an aggregation mechanism that accesses the scoresfrom SM(1) and SM(2) and infers the action-values.We emphasize that there is only one SM for each order, shared across thenetwork, hence during backpropagation the gradient of the loss functionwith respect to the parameters of each SM is given by the sum of all thepaths that contribute to the loss (Goodfellow et al. , 2016) that pass throughthat particular SM. The mixture network, inspired by the Mixtures of ExpertNetworks by Jacobs et al. (1991), gathers all the extracted information fromﬁrst and second order statistical moments and combines them with the pastaction (i.e., current portfolio vector) and the generated agent state (i.e., statemanager hidden state) to determine the next optimal action.Neural networks, and especially deep architectures, are very data hungry (Murphy, 2012), requiring thousands (or even millions) of data points to con-verge to meaningful strategies. Using daily market data (i.e., daily prices),almost 252 data points are only collected every year, which means that veryfew samples are available for training. Thanks to the parameter sharing,nonetheless, the SM networks are trained on orders of magnitude moredata. For example, a 12-assets universe with 5 years history is given by the 81. T rading A gents dataset D ∈ R × · = ( ) · = · = M without modiﬁcation, by stacking more copies of the same machine withshared parameters, the mixture network is universe-speciﬁc . As a result, adifferent mixture network is required to be trained for different universes.Consider the case of an M -asset market, then the the mixture network hasthe interface: N inputsmixture-network = M + (cid:18) M (cid:19) , N outputsmixture-network = M (6.21)Consequently, selecting a different number of assets would break the inter-face of the mixture network. Nonetheless, the score machines can be trainedwith different mixture networks hence when a new universe of assets isgiven, we freeze the training of the score machines and train only the mix-ture network. This operation is cheap since the mixture network comprisesof only a small fraction of the total number of trainable parameters of theMSM.Practically, the score machines are trained on large historic datasets and keptﬁxed, while transfer learning is performed on the mixture network. There-fore, the score machines can be viewed as rich, universal feature extractors,while the mixture network is the small (i.e., in size and capacity) mechanismthat enables mapping from the abstract space of scores to the feasible actionspace, capable of preserving asset-speciﬁc information as well. Evaluation

Figure 6.14 shows the results from an experiment performed on a small uni-verse comprising of 12 assets from S&P 500 market using the MSM agent.The agent is trained on historic data between 2000-2005 for 10000 episodes,and tested on 2005-2018. Conforming with our analysis, the agent under-performed early in the training, but after 9000 episodes it became proﬁtableand its performance saturated after 10000 episodes, with total cumulativereturns of 283.9% and 68.5% maximum drawdown. It scored slightly worse82.2. Model-Free Reinforcement Learning

Policy ...

SM(1) !ρ ,t SM(1) !ρ ,t !ρ M,t ... ...

SM(1) v (1)1 ,t v (1)2 ,t v (1) M,t v (1) t ∈ R M First-Order Scores G R U A FF I N E S O F T M A X a a a M a M Agent Action a t +1 s t Agent State a t Past Action

SM(2)SM(2) ... ... ...

SM(2) !ρ ,t !ρ ,t !ρ M &( M − ,t v (2)1&2 ,t v (2)1&3 ,t v (2) M &( M − ,t v (2) t ∈ R ( M ) First-Order Scores

First-Order Score MachineSecond-Order Score Machine Mixture

Figure 6.12: Mixture of Score Machines (MSM) architecture. The historic logreturns (cid:126) ρ t − T → t ∈ R M × T processes by the score machines SM(1) and SM(2),which assign scores to each asset ( v ( ) t ) and pair of assets ( v ( ) t ), respectively.The scores concatenated and passed to the mixture network, which com-bines them with the past action (i.e., current portfolio vector) and the gen-erated agent state (i.e., state manager hidden state) to determine the nextoptimal action.than the REINFORCE agent, but in Part III it is shown that in a larger scaleexperiments the MSM is both more effective and efﬁcient (i.e., computation-ally and memory-wise). Weaknesses

The Mixture of Score Machines (MSM) is an architecture that addresses theweaknesses of all the aforementioned approaches (i.e., VAR, LSTM, DSRQNand REINFORCE). However, it could be improved by incorporating the ex-pected properties 6.8 and 6.9:

Expected Properties 6.8 (Short Sales)

Model should output negative portfolioweights, corresponding to short selling, as well.

Expected Properties 6.9 (Optimality Guarantees)

Explore possible re-interpretationsof the framework, which would allow proof of optimality (if applicable).

83. T rading A gents e C u m u l a t i v e R e t u r n , G e MSM Adam : Cumulative Return G e e C u m u l a t i v e R e t u r n , G e MSM SGD : Cumulative Return G e Figure 6.13: Out-of-sample cumulative returns per episode during trainingphase for MSM. Performance improvement saturates after e (cid:39) Left )Adaptive Neural network optimization algorithm

ADAM . (

Right ) Neuralnetwork optimized with Stochastic Gradient Descent (SGD). P e r c e n t a g e MSM : Overall Performance, episode= Profit & LossDrawdownS&P500 P e r c e n t a g e MSM : Overall Performance, episode=

Profit & LossDrawdownS&P500 P e r c e n t a g e MSM : Overall Performance, episode=

Profit & LossDrawdownS&P500

Figure 6.14: Mixture of Score Machines (MSM) model-free reinforcementlearning agent on a 12-asset universe, trained on historic data between2000-2005 and tested onward, for different number of episodes e = {

1, 1000, 5000, 10000 } . Visualization of cumulative rewards and (maximum)drawdown of the learned strategy, against the S&P 500 index (traded as SPY ).84.2. Model-Free Reinforcement Learning

Trading Agents Comparison Matrix: -assets of S&P 500Model-Based Model-FreeVAR RNN DSRQN REINFORCE MSMM e t r i cs Cumulative Returns (%) 185.1 221.0 256.7 325.9 283.9Sharpe Ratio (SNR) 1.53 1.62 2.40 3.02 2.72Max Drawdown (%) 179.6 198.4 85.6 63.5 68.5 E x p e c t e dP r o p e r ti e s Non-Stationary Dynamics × (cid:88) (cid:88) (cid:88) (cid:88) Long-Term Memory × (cid:88) (cid:88) (cid:88) (cid:88) Non-Linear Model × (cid:88) (cid:88) (cid:88) (cid:88) End-to-End × × × (cid:88) (cid:88)

Linear Scaling × × × × (cid:88)

Universality × × × × (cid:88)

Low Variance Estimators × × × × ×

Short Sales (cid:88) (cid:88) × × ×

Table 6.1: Comprehensive comparison of evaluation metrics and their weak-nesses (i.e., expected properties) of trading algorithms addressed in thischapter. Model-based agents (i.e., VAR and RNN) underperform, whilethe best performing agent is the REINFORCE. As desired, the MSM agentscores also well above the index (i.e., baseline) and satisﬁes most wantedproperties. 85 hapter 7

Pre-Training

In Chapter 6, model-based and model-free reinforcement learning agentswere introduce, which address the asset allocation task. It was demon-strated (see comparison table 6.1) that model-based (i.e., VAR and RNN)and value-based model-free agents (i.e., DSRQN) are outperformed by thepolicy gradient agents (i.e., REINFORCE and MSM). However, policy gradi-ent algorithms usually converge to local optima (Sutton and Barto, 1998).Inspired by the approach taken by the authors of the original DeepMind Al-phaGo paper (Silver and Hassabis, 2016), the local optimality of policy gra-dient agents is addressed via pre-training the policies in order to replicatethe strategies of baseline models. It is shown that any one-step optimizationmethod, discussed in Chapter 3 that reduces to a quadratic program, can bereproduced by the policy gradient networks (i.e., REINFORCE and MSM),when the networks are trained to approximate the quadratic program solu-tion.Because of the highly non-convex policy search space (Szepesv´ari, 2010), therandomly initialised agents (i.e., agnostic agents) tend to either get stuckto vastly sub-optimal local minima or need a lot more episodes and sam-ples to converge to meaningful strategies. Therefore, the limited numberof available samples (e.g., 10 years of market data is equivalent to approxi-mately 2500 samples), motivates pre-training, which is expected to improveconvergence speed and performance, assuming that the baseline model issub-optimal but a proxy to the optimal strategy. Moreover, the pre-trainedmodels can be viewed as priors to the policies and episodic training with re-inforcement learning steers them to the updated strategies, in a data-drivenand data-efﬁcient manner. As in the context of Bayesian Inference.

In Chapter 3, the traditional one-step (i.e., static) portfolio optimizationmethods were described, derived from the Markowitz model (Markowitz,1952). Despite the assumptions about covariance stationarity (i.e., time-invariance of ﬁrst and second statistical moments) and myopic approachof those methods, they usually form the basis of other more complicatedand effective strategies. As a result, it is attempted to replicate those strate-gies with the REINFORCE (see subsection 6.2.2) and the Mixture of ScoreMachines (MSM) (see subsection 6.2.3) agents. In both cases, the architectureof the agents (i.e., underlying neural networks) are treated as black boxes,represented by a set of parameters, which thanks to their end-to-end differ-entiability, can be trained via backpropagation. Figure 7.1 summarizes thepipeline used to (pre-)train the policy gradient agents.

Black Box Agent

HistoricLog Returns ! ρ t − T → t a t Past Action Agent Action a t +1 ! Figure 7.1: Interfacing with policy gradient agents as black boxes, with in-puts (1) historic log returns (cid:126) ρ t − T → t and (2) past action (i.e., current port-folio vector) a t and output the next agent actions a t + . The black-box isparametrized by θ which can be updated and optimized. The one-step optimal portfolio for given commission rates (i.e., transactioncosts coefﬁcient β ) and hyperparameters (e.g., risk-aversion coefﬁcient) isobtained, by solving the optimization task in (3.13) or (3.14) via quadraticprogramming. The Sharpe Ratio with transaction costs objective function is 87. P re -T raining selected as the baseline for pre-training, since it has no hyperparameter totune and inherently balances proﬁt-risk trade-off.Without being explicitly given the mean vector µ , the covariance matrix Σ and the transaction coefﬁcient β , the black-box agents should be able to solvethe optimization task (3.14), or equivalently:maximize a t + ∈ A a Tt + µ − TM β (cid:107) a t − a t + (cid:107) (cid:113) a Tt + Σ a t + and TM a t + = a t + (cid:23) Since there is a closed form formula that connects the black-box agents’ in-puts (cid:126) ρ t − T → t and a t with the terms in the optimization problem (3.14), N supervised pairs { ( X i , y i ) } Ni = are generated by solving the optimization for N distinct cases, such that: X i = (cid:2) (cid:126) ρ t i − T → t i , a t i (cid:3) (7.1) y i = a t i + (7.2)Interestingly, myriad of examples (i.e., X i , y i pairs) can be produced toenrich the dataset and allow convergence. This is a very rare situation wherethe generating process of the data is known and can be used to producevalid samples, which respect the dynamics of the target model. The datageneration process is given in algorithm 6. The parameters of the black-box agents are steered in the gradient direc-tion that minimizes the

Mean Square Error between the predicted portfo-lio weights, ˆ y t i , and the baseline model target portfolio weights, y t i . An L -norm weight decaying, regularization , term is also considered to avoidoverﬁtting, obtaining the loss function: L ( θ ) = (cid:107) y t i − ˆ y t i ; θ (cid:107) + λ (cid:107) θ (cid:107) (7.3)88.2. Model Evaluation Algorithm 6:

Pre-training supervised dataset generation. inputs : number of pairs to generate N number of assets in portfolio M look back window size T transaction costs coefﬁcient β output : dataset { ( X i , y i ) } Ni = for i =

1, 2, . . . N do sample valid random initial portfolio vector w t i sample random lower triangular matrix L ∈ R M × M // Choleskydecomposition sample randomly distributed log returns: (cid:126) ρ t i − T → t i ∼ N ( , LL T ) calculate empirical mean vector of log returns: µ = E [ (cid:126) ρ t i − T → t i ] calculate empirical covariance matrix of log returns: Σ = Cov [ (cid:126) ρ t i − T → t i ] determine a t i + by solving quadratic program (3.14) set X i = [ (cid:126) ρ t i − T → t i , a t i ] and y i = a t i + end The parameters are adaptively optimized by Adam (Kingma and Ba, 2014),while the network parameters gradients are obtained via BackpropagationThrough Time (Werbos, 1990).

Figure 7.2 depicts the the learning curves, in-sample and out-of sample, ofthe supervised learning training process. Both the REINFORCE and theMSM converge after ≈

400 epochs (i.e., iterations). The

As suggested by Figure 7.3, the pre-training improves the cumulative returnsand Sharpe Ratio of the policy gradient agents up to 21.02% and 13.61%,respectively. 89. P re -T raining M e a n S q u a r e E rr o r , | y i y i | REINFORCE : Pre-Training Learning Curve in-sampleout-of-sample M e a n S q u a r e E rr o r , | y i y i | MSM : Pre-Training Learning Curve in-sampleout-of-sample

Figure 7.2: Mean square error (MSE) of Monte-Carlo Policy Gradient (REIN-FORCE) and Mixture of Score Machines (MSM) during pre-training. After ≈

150 epochs the gap between the training (in-sample) and the testing (out-of-sample) errors is eliminated and error curve plateaus after ≈

400 epochs,when training is terminated.

DSRQN MSM REINFORCE RNN VAR0100200300400 C u m u l a t i v e S i m p l e R e t u r n s ( % ) Pre Training : Performance Gain in r t S&P500RLRL & PT

DSRQN MSM REINFORCE RNN VAR0123 S h a r p e R a t i o Pre Training : Performance Gain in SR S&P500RLRL & PT

Figure 7.3: Performance evaluation of trading with reinforcement learning(RL) and reinforcement learning and pre-training (RL & PT). The Mixture ofScore Machines (MSM) improves cumulative returns by 21.02% and SharpeRatio by 13.61%. The model-based (i.e., RNN and VAR) and the model-free value-based (i.e., DSRQN) agents are not end-to-end differentiable andhence cannot be pre-trained.90 art III

Experiments hapter 8 Synthetic Data

It has been shown that the trading agents of Chapter 6, and especially REIN-FORCE and MSM, outperform the market index (i.e., S&P500) when testedin a small universe of 12-assets, see Table 6.1. For rigour, the validity and ef-fectiveness of the developed reinforcement agents is investigated via a seriesof experiments on: • Deterministic series, including sine, sawtooth and chirp waves, as inSection 8.1; • Simulated series, using data surrogate methods, such as AAFT, as inSection 8.2.As expected, it is demonstrated that model-based agents (i.e., VAR andRNN) excel in deterministic environments. This is attributed to the fact thatgiven enough capacity they have the predictive power to accurately forecastthe future values, based on which they can act optimally via planning.On the other hand, on simulated (i.e., surrogate) time-series, it is shown thatmodel-free agents score higher, especially after the pre-training process ofChapter 7, which contributes to up to 21% improvement in Sharpe Ratio andup to 40% reduction in the number of episodic runs.

To begin with, via interaction with the environment (i.e., paper trading), theagents construct either an explicit (i.e., model-based reinforcement learning)or implicit model (i.e., model-free reinforcement learning) of the environ-ment. In Section 6.1, It has been demonstrated that explicit modelling ofﬁnancial time series is very challenging due to the stochasticity of the in-92.1. Deterministic Processesvolved time-series, and, as a result, model-based methods underperform.On the other hand, should the market series were sufﬁciently predictable,these methods would be expected to optimally allocate assets of the portfo-lio via dynamic programming and planning. In this section, we investigatethe correctness of this hypothesis by generating a universe of deterministictime-series.

A set of 100 sinusoidal waves of constant parameters (i.e., amplitude, circularfrequency and initial phase) is generated, while example series are providedin Figure 8.1. Note the dominant performance of the model-based recur-rent neural network (RNN) agent, which exploits its accurate predictions offuture realizations and scores over three times better than the best-scoringmodel-free agent, the Mixture of Score Machines (MSM). S i m p l e R e t u r n s , r t Sinusoidal Waves : Simple Returns Time-Series P e r c e n t a g e Sinusoidal Waves : Overall Performance

VARRNNDSRQNREINFORCEMSM

Figure 8.1: Synthetic universe of deterministic sinusoidal waves. (

Left ) Ex-ample series from universe. (

Right ) Cumulative returns of reinforcementlearning trading agents.For illustration purposes and in order to gain a ﬁner insight into the learnedtrading strategies, a universe of only two sinusoids is generated the RNNagent is trained on binary trading the two assets; at each time step the agentputs all its budget on a single asset. As shown in Figure 8.2, the RNN agentlearns the theoretically optimal strategy : w t = (cid:40) w i , t =

1, if i = argmax { r t } w i , t =

0, otherwise (8.1)or equivalently, the returns of the constructed portfolio is the max of thesingle asset returns at each time step. Note that transaction costs are not considered in this experiment, in which case wewould expect a time-shifted version of the current strategy so that it offsets the fees.

93. S ynthetic D ata S i m p l e R e t u r n s , r t RNN Binary Trader : Buy & Sell Signals asset 1 asset 2 BUY SELL max

Figure 8.2: Recurrent neural network (RNN) model-based reinforcementlearning agent trained on binary trading between two sinusoidal waves. Thetriangle trading signals (i.e., BUY or SELL) refer to asset 1 (i.e., red), whileopposite actions are taken for asset 2, but not illustrated.

A set of 100 deterministic sawtooth waves is generated next and examplesare illustrated in Figure 8.3. Similar to the sinusoidal waves universe, theRNN agent outperforms the rest of the agents. Interestingly, it can be ob-served in the cumulative returns time series, right Figure 8.3, that all strate-gies have a low-frequency component, which corresponds to the highestamplitude sawtooth wave (i.e., yellow). S i m p l e R e t u r n s , r t Sawtooth Waves : Simple Returns Time-Series P e r c e n t a g e Sawtooth Waves : Overall Performance

VARRNNDSRQNREINFORCEMSM

Figure 8.3: Synthetic universe of deterministic sawtooth waves. (

Left ) Ex-ample series from universe. (

Right ) Cumulative returns of reinforcementlearning trading agents.

Last but not least, the experiment is repeated with a set of 100 deterministicchirp waves (i.e., sinusoidal wave with linearly modulated frequency). Threeexample series are plotted in 8.4, along with the cumulative returns of eachtrading agent. Note that the RNN agent is only 8.28% better than the second,the MSM, agent, compared to the > S i m p l e R e t u r n s , r t Chirp Waves : Simple Returns Time-Series P e r c e n t a g e Chirp Waves : Overall Performance

VARRNNDSRQNREINFORCEMSM

Figure 8.4: Synthetic universe of deterministic chirp waves. (

Left ) Exampleseries from universe. (

Right ) Cumulative returns of reinforcement learningtrading agents.

Remark 8.1

Overall, in a deterministic ﬁnancial market, all trading agents learnproﬁtable strategies and solve the asset allocation task. As expected, model-basedagents, and especially the RNN, are signiﬁcantly outperforming in case of well-behaving, easy-to-model and deterministic series (e.g., sinusoidal, sawtooth). Onthe other hand, in more complicated settings (e.g., chirp waves universe) the model-free agents perform almost as good as model-based agents.

Having asserted the successfulness of reinforcement learning trading agentsin deterministic universes, their effectiveness is challenged in stochastic uni-verses, in this section. Instead of randomly selecting families of stochasticprocesses and corresponding parameters for them, real market data is usedto learn the parameters of candidate generating processes that explain thedata. The purpose of this approach is two-fold:1. There is no need for hyperparameter tuning ;2. The training dataset is expanded, via data augmentation , giving theopportunity to the agents to gain more experience and further explorethe joint state-action space.It is worth highlighting that data augmentation improves overall perfor-mance, especially when strategies learned in the simulated environment aretransferred and polished on real market data, via Transfer Learning (Panand Yang, 2010b). 95. S ynthetic D ata The simulated universe is generated using surrogates with random Fourierphases (Raeth and Monetti, 2009). In particular the

Amplitude AdjustedFourier Transform (AAFT) method (Prichard and Theiler, 1994) is used, ex-plained in Algorithm 7. Given a real univariate time-series, the AAFT al-gorithm operates in Fourier (i.e., frequency) domain, where it preserves theamplitude spectrum of the series, but randomizes the phase, leading to anew realized signal.AAFT can be explained by the

Wiener–Khinchin–Einstein Theorem (Co-hen, 1998), which states that the autocorrelation function of a wide-sense-stationary random process has a spectral decomposition given by the powerspectrum of that process. In other words, ﬁrst and second order statisti-cal moments (i.e., due to autocorrelation) of the signal are encoded in itspower spectrum, which is purely dependent on the amplitude spectrum.Consequently, the randomization of the phase does not impact the ﬁrst andsecond order moments of the series, hence the surrogates share statisticalproperties of the original signal.Since the original time-series (i.e., asset returns) are real-valued signals, theirFourier Transform after randomization of the phase should preserve conju-gate symmetry , or equivalently, the randomly generated phase componentshould be an odd function of frequency. Then the Inverse Fourier Trans-form (IFT) returns real-valued surrogates.

Algorithm 7:

Amplitude Adjusted Fourier Transform (AAFT). inputs : M -variate original time-series (cid:126) X output : M -variate synthetic time-series ˆ (cid:126) X for i =

1, 2, . . . M do calculate Fourier Transform of univariate series F [ (cid:126) X : i ] randomize phase component // preserve odd symmetry ofphase calculate Inverse Fourier Transform of unchanged amplitude and randomized phase ˆ (cid:126) X : i endRemark 8.2 Importantly, the AAFT algorithm works on univariate series, there-fore the ﬁrst two statistical moments of the single asset are preserved but the cross- asset dependencies (i.e., cross-correlation, covariance) are modiﬁed due to the dataaugmentation.

Operating on the same 12-assets universe used in experiments of Chapter 6,examples of AAFT surrogates are given in Figure 8.5, along with the cumu-lative returns of each trading agent on this simulated universe. As expected,the model-free agents outperform the model-based agents, corroborating theresults obtained in Chapter 6. S i m p l e R e t u r n s , r t Surrogate AAFT : Simple Returns Time-Series

AAPLGEBA P e r c e n t a g e Surrogate AAFT : Overall Performance

VARRNNDSRQN

REINFORCE

MSM

Figure 8.5: Synthetic, simulated universe of 12-assets from S&P500 via Am-plitude Adjusted Fourier Transform (AAFT). (

Left ) Example series from uni-verse. (

Right ) Cumulative returns of reinforcement learning trading agents. 97 hapter 9

Market Data

Having veriﬁed the applicability of the trading agents in synthetic environ-ments (i.e., deterministic and stochastic) in Section 8, their effectiveness ischallenged in real ﬁnancial markets, namely the underlying stocks of theStandard & Poor’s 500 (Investopedia, 2018f) and the EURO STOXX 50 (In-vestopedia, 2018a) indices. In detail, in this chapter: • Candidate reward generating functions are explored, in Section 9.1; • Paper trading experiments are carried out on U.S. and European mostliquid assets (see Sufﬁcient Liquidity Assumption 5.1), as in Sections9.2 and 9.3, respectively; • Comparison matrices and insights into the learned agent strategies areobtained.

Reinforcement learning relies fundamentally on the hypothesis that the goalof the agent can be fully described by the maximization of the cumulativereward over time, as suggested by the Reward Hypothesis 4.1. Consequently,the selection of the reward generating function can signiﬁcantly affect thelearned strategies and hence the performance of the agents. Motivated bythe returns-risk trade-off arising in investments (see Chapter 3), two rewardfunctions are implemented and tested: the log returns and the DifferentialSharpe Ratio (Moody et al. , 1998).98.1. Reward Generating Functions

The agent at time step t observes asset prices o t ≡ p t and computes the logreturns, given by: ρ t = log ( p t (cid:11) p t − ) (2.12)where (cid:11) designates element-wise division and the log function is also ap-plied element-wise, or equivalently: ρ t (cid:44)  ρ t ρ t ... ρ M , t  =  log ( p t p t − ) log ( p t p t − ) ... log ( p M , t p M , t − )  ∈ R M (9.1)Using one-step log returns for reward, results in the multi-step maximiza-tion of cumulative log returns , which focuses only on the proﬁt, withoutconsidering any risk (i.e., variance) metric. Therefore, agents are expectedto be highly volatile when trained with this reward function. In Section 2.3, the Sharpe Ratio (Sharpe and Sharpe, 1970) was introduced,motivated by Signal-to-Noise Ratio (SNR), given by: SR t (cid:44) √ t E [ r t ] (cid:112) Var [ r t ] ∈ R (2.35)where T is the number of samples considered in the calculation of the em-pirical mean and standard deviation. Therefore, empirical estimates of themean and the variance of the portfolio are used in the calculation, mak-ing Sharpe Ratio an inappropriate metric for online (i.e., adaptive) episodiclearning. Nonetheless, the Differential Sharpe Ratio (DSR), introduced byMoody et al. (1998), is a suitable reward function. DSR is obtained by:1. Considering exponential moving averages of the returns and standarddeviation of returns in 2.35;2. Expanding to ﬁrst order in the decay rate: SR t ≈ SR t − + η ∂ SR t ∂η (cid:12)(cid:12)(cid:12)(cid:12) η = + O ( η ) (9.2) 99. M arket D ata Noting that only the ﬁrst order term in expansion (9.2) depends upon thereturn, r t , at time step, t , the differential Sharpe Ratio, D t , is deﬁned as: D t (cid:44) ∂ SR t ∂η = B t − ∆ A t + A t − ∆ B t ( B t − − A t − ) (9.3)where A t and B t are exponential moving estimates of the ﬁrst and secondmoments of r t , respectively, given by: A t = A t − + η ∆ A t = A t − + η ( r t − A t − ) (9.4) B t = B t − + η ∆ B t = B t − + η ( r t − B t − ) (9.5)Using differential Sharpe Ratio for reward, results in the multi-step maxi-mization of Sharpe Ratio , which balances risk and proﬁt, and hence it isexpected to lead to better strategies, compared to log returns.

Publicly traded companies are usually also compared in terms of their

Mar-ket Value or Market Capitalization (Market Cap), given by multiplying thenumber of their outstanding shares by the current share price (Investopedia,2018d), or equivalently:

Market Cap asset i = Volume asset i × Share Price asset i (9.6)The Standard & Poor’s 500 Index (S&P 500) is a market capitalization weightedindex of the 500 largest U.S. publicly traded companies by market value (In-vestopedia, 2018d). According to the Capital Asset Pricing Model (CAPM)(Luenberger, 1997) and the Efﬁcient Market Hypothesis (EMH) (Fama, 1970),the market index, S&P 500, is efﬁcient and portfolio derived by its con-stituent assets cannot perform better (as in the context of Section 3.1.2).Nonetheless, CAPM and EMH are not exactly satisﬁed and trading oppor-tunities can be exploited via proper strategies.100.3. EURO STOXX 50

In order to compare the different trading agents introduced in Chapter 6,as well as variants in Chapter 7 and Section 8.2, all agents are trained onthe constituents of S&P 500 (i.e., 500 U.S. assets) and the results of theirperformance are provided in Figure 9.1 and Table 9.1. As expected, thedifferential Sharpe Ration (DSR) is more stable than log returns, yieldinghigher Sharpe Ratio strategies, up to 2.77 for the pre-trained and experiencetransferred Mixtutre of Score Machines (MSM) agent.

DSRQN MSM REINFORCE RNN VAR050100150200250300 C u m u l a t i v e S i m p l e R e t u r n s ( % ) S&P 500 : Log Returns

S&P500RLRL & PTRL & PT & TL

DSRQN MSM REINFORCE RNN VAR0.00.51.01.52.0 S h a r p e R a t i o S&P 500 : Log Returns

S&P500RLRL & PTRL & PT & TL

DSRQN MSM REINFORCE RNN VAR0100200300400 C u m u l a t i v e S i m p l e R e t u r n s ( % ) S&P 500 : Differential Sharpe Ratio

S&P500RLRL & PTRL & PT & TL

DSRQN MSM REINFORCE RNN VAR0.00.51.01.52.02.5 S h a r p e R a t i o S&P 500 : Differential Sharpe Ratio

S&P500RLRL & PTRL & PT & TL

Figure 9.1: Comparison of reinforcement learning trading agents on cumu-lative returns and Sharpe Ratio, trained with: ( RL ) Reinforcement Learning;( RL & PT ) Reinforcement Learning and Pre-Training; (

RL & PT & TL ) Re-inforcement Learning, Pre-Training and Transfer Learning from simulateddata.

Remark 9.1

The simulations conﬁrm the superiority of the universal model-free re-inforcement learning agents, Mixture of Score Machines (MSM), in asset allocation,with the achieved performance gain of as much as in cumulative returns and in Sharpe Ratio, compared to the most recent models in (Jiang et al. , 2017)in the same universe.

Similar to S&P 500, the EURO STOXX 50 (SX5E) is a benchmark for the 50largest publicly traded companies by market value in countries of Eurozone. 101. M arket D ata Trading Agents Comparison Matrix: S&P 500Reward Differential LogGenerating Function Sharpe Ratio Returns

Cumulative Sharpe Cumulative SharpeReturns (%) Ratio Returns (%) Ratio

SPY

VAR

RNN

DSRQN

REINFORCE

MSM

REINFORCE & PT

MSM & PT

REINFORCE & PT & TL

MSM & PT & TL 381 . . A universal baseline agent is developed, based on the Sharpe Ratio withtransaction costs (see optimization problem 3.14) extension of the Markowitzmodel, from Section 3.3. Therefore, a

Sequential Markowitz Model (SMM)agent is derived by iteratively applying the one-step optimization programsolver for each time step t . The Markowitz model is obviously a universalportfolio optimizer, since it does not make assumptions about the universe(i.e., underlying assets) it is applied upon. Given the EURO STOXX 50 market, transfer learning is performed for the

MSM agent trained on the S&P 500 (i.e., only the Mixture network is replacedand trained, while the parameters of the Score Machine networks are frozen).Figure 9.2 illustrates the cumulative returns of the market index (SX5E), the102.3. EURO STOXX 50Sequential Markowitz Model (SMM) agent and the Mixture of Score Ma-chines (MSM) agent.

Remark 9.2

As desired, the MSM agent outperformed both the market index (SX5E)and the SMM agent, reﬂecting the universality of the MSM learned strategies,which are both successful in the S&P 500 and EURO STOXX 50 markets.

It is worth also noting that the cumulative returns of the MSM and the SMMagents were correlated, however, the MSM performed better, especially after2009, when the SMM followed the declining market and the MSM becameproﬁtable. This fact can be attributed to the pre-training stage of the MSMagent, since during this stage, the policy gradient network converges to theMarkowitz model, or effectively mimics the SMM strategies. Then, the re-inforcement episodic training allows the MSM to improve itself so that itoutperforms its initial strategy, the SMM. P e r c e n t a g e EURO STOXX 50 : Overall Performance

SMM MSM SX5E

Figure 9.2: Cumulative Returns of Mixture of Score Machines (MSM) agent,trained on S&P 500 market and transferred experience to EURO STOXX50 market (SX5E), along with the traditional Sequential Markowitz Model(SMM). 103 hapter 10

Conclusion

The main objective of this report was to investigate the effectiveness of Rein-forcement Learning agents on Sequential Portfolio Management. To achievethis, many concepts from the ﬁelds of Signal Processing, Control Theory,Machine Intelligence and Finance have been explored, extended and com-bined. In this chapter, the contributions and achievements of the project aresummarized, along with possible axis for future research.

To enable episodic reinforcement learning, a mathematical formulation ofﬁnancial markets as discrete-time stochastic dynamical systems is provided,giving rise to a uniﬁed, versatile framework for training agents and invest-ment strategies.A comprehensive account of reinforcement agents has been developed, in-cluding traditional, baseline agents from system identiﬁcation (i.e., model-based methods) as well as context agnostic agents (i.e., model-free methods).A universal model-free reinforcement learning family of agents has been in-troduced, which was able to reduce the model computational and memorycomplexity (i.e., linear scaling with universe size) and to generalize strate-gies across assets and markets, regardless of the training universe. It alsooutperformed all trading agents, found in the open literature, in the S&P500 and the EURO STOXX 50 markets.Lastly, model pre-training, data augmentation and simulations enabled ro-bust training of deep neural network architectures, even with a limited num-ber of available real market data.1040.2. Future Work

Despite the performance gain of the developed strategies, the lack of inter-pretability (Rico-Martinez et al. , 1994) and the inability to exhaustively testthe deep architectures used (i.e., Deep Neural Networks) discourage practi-tioner from adopting these solutions. As a consequence, it is worth inves-tigating and interpreting the learned strategies by opening the deep ”blackbox” and being able to reason for its decisions.In addition, exploiting the large number of degrees of freedom given by theframework formulation of ﬁnancial markets, further research on reward gen-erating functions and state representation could improve the convergenceproperties and overall performance of the agents. A valid approach wouldbe to construct the indicators typically used in the technical analysis of ﬁ-nancial instruments (King and Levine, 1992). These measures would embedthe expert knowledge acquired by ﬁnancial analysts over decades of activ-ity and could help in guiding the agent towards better decisions (Wilmott,2007).Furthermore, the ﬂexible architecture of Mixtures of Score Machines (MSM)agent, the best-scoring universal trading agents, allows experimentationwith both the Score Machines (SM) networks and their model-order, as wellas the Mixture network, which, ideally, should be universally used withouttransfer learning.The trading agents in this report are based on point estimates, providedby a deep neural network. However, due to the uncertainty of the ﬁnancialsignals, it would be appropriate to also model this uncertainty, incorporatingit in the decision making process. Bayesian Inference, or more tractablevariants of it, including Variation Inference (Titsias and Lawrence, 2010), canbe used to train probabilistic models, capable of capturing the environmentuncertainty (Vlassis et al. , 2012).Last but not least, motivated by the recent publication by (Fellows et al. ,2018) on exact calculation of the policy gradient by operating on the Fourierdomain, employing an exact policy gradient method could eliminate theestimate variance and accelerate training. At the time that report is submitted this paper has not been presented, but it is acceptedin International Conference on Machine Learning (ICML) 2018. ibliography

Markowitz, Harry (1952). “Portfolio selection”.

The Journal of Finance

Dynamic programming . Courier Corporation.Fama, Eugene F (1970). “Efﬁcient capital markets: A review of theory andempirical work”.

The Journal of Finance

Portfolio theory and capital markets .Vol. 217. McGraw-Hill New York.Moylan, P and B Anderson (1973). “Nonlinear regulator theory and an in-verse optimal control problem”.

IEEE Transactions on Automatic Control

Information and Control

Proceedings of the National Academyof Sciences

The Journal of Finance

European Journal ofOperational Research

Journal of Econometrics

Mathematics of Control, Signals and Systems

Proceedings of the IEEE et al. (1991). “Adaptive mixtures of local experts”.

NeuralComputation

Financial indicators and growthin a cross section of countries . Vol. 819. World Bank Publications.Prichard, Dean and James Theiler (1994). “Generating surrogate data fortime series with several simultaneously measured variables”.

Physical Re-view Letters

Neural Networks for Signal Processing [1994] IV. Proceed-ings of the 1994 IEEE Workshop . IEEE, pp. 596–605.Bertsekas, Dimitri P et al. (1995).

Dynamic programming and optimal control .Vol. 1. 2. Athena scientiﬁc Belmont, MA.LeCun, Yann, Yoshua Bengio, et al. (1995). “Convolutional networks for im-ages, speech, and time series”.

The Handbook of Brain Theory and NeuralNetworks

Communications of the ACM

Advances in Neural Information Processing Systems , pp. 952–958.Salinas, Emilio and LF Abbott (1996). “A model of multiplicative neural re-sponses in parietal cortex”.

Proceedings of the National Academy of Sciences

Robotics and Automa-tion, 1997. Proceedings., 1997 IEEE International Conference on . Vol. 4. IEEE,pp. 3557–3564.Hochreiter, Sepp and J ¨urgen Schmidhuber (1997). “Long short-term mem-ory”.

Neural Computation et al. (1997). “Investment science”.

OUP Catalogue .Ortiz-Fuentes, Jorge D and Mikel L Forcada (1997). “A comparison betweenrecurrent neural network architectures for digital equalization”.

Acous-tics, speech, and signal processing, 1997. ICASSP-97., 1997 IEEE InternationalConference on . Vol. 4. IEEE, pp. 3281–3284.Rust, John (1997). “Using randomization to break the curse of dimensional-ity”.

Econometrica: Journal of the Econometric Society , pp. 487–516. 107 ibliography

Akaike, Hirotugu (1998). “Markovian representation of stochastic processesand its application to the analysis of autoregressive moving average pro-cesses”.

Selected Papers of Hirotugu Akaike . Springer, pp. 223–247.Cohen, Leon (1998). “The generalization of the wiener-khinchin theorem”.

Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEEInternational Conference on . Vol. 3. IEEE, pp. 1577–1580.Hochreiter, Sepp (1998). “The vanishing gradient problem during learningrecurrent neural nets and problem solutions”.

International Journal of Un-certainty, Fuzziness and Knowledge-Based Systems et al. (1998). “Reinforcement learning for trading systems andportfolios: Immediate vs future rewards”.

Decision Technologies for Compu-tational Finance . Springer, pp. 129–140.Papadimitriou, Christos H and Kenneth Steiglitz (1998).

Combinatorial opti-mization: Algorithms and complexity . Courier Corporation.Sutton, Richard S and Andrew G Barto (1998).

Introduction to reinforcementlearning . Vol. 135. MIT press Cambridge.Gers, Felix A, J ¨urgen Schmidhuber, and Fred Cummins (1999). “Learning toforget: Continual prediction with LSTM”.Ng, Andrew Y, Stuart J Russell, et al. (2000). “Algorithms for inverse rein-forcement learning.”

Icml , pp. 663–670.Sutton, Richard S et al. (2000a). “Policy gradient methods for reinforcementlearning with function approximation”.

Advances in neural information pro-cessing systems , pp. 1057–1063.— (2000b). “Policy gradient methods for reinforcement learning with func-tion approximation”.

Advances in neural information processing systems , pp. 1057–1063.Ghahramani, Zoubin (2001). “An introduction to hidden Markov modelsand Bayesian networks”.

International Journal of Pattern Recognition andArtiﬁcial Intelligence et al. (2001). “A builder’s guide to agent-based ﬁnancial mar-kets”.

Quantitative Finance et al. (2001).

Recurrent neural net-works for prediction: Learning algorithms, architectures and stability . WileyOnline Library.Tino, Peter, Christian Schittenkopf, and Georg Dorffner (2001). “Financialvolatility trading using recurrent neural networks”.

IEEE Transactions onNeural Networks

Artiﬁcial Intelligence

Robotics and Automation, 2002. Proceedings.ICRA’02. IEEE International Conference on . Vol. 4. IEEE, pp. 3404–3410.Bengio, Yoshua et al. (2003). “A neural probabilistic language model”.

Journalof Machine Learning Research

Econometric analysis . Pearson Education India.Boyd, Stephen and Lieven Vandenberghe (2004).

Convex optimization . Cam-bridge university press.Kohl, Nate and Peter Stone (2004). “Policy gradient reinforcement learningfor fast quadrupedal locomotion”.

Robotics and Automation, 2004. Proceed-ings. ICRA’04. 2004 IEEE International Conference on . Vol. 3. IEEE, pp. 2619–2624.Mandic, Danilo P (2004). “A generalized normalized gradient descent algo-rithm”.

IEEE Signal Processing Letters

Advanced lectures on machine learning . Springer, pp. 63–71.Tsay, Ruey S (2005).

Analysis of ﬁnancial time series . Vol. 543. John Wiley &Sons.Gatev, Evan, William N Goetzmann, and K Geert Rouwenhorst (2006). “Pairstrading: Performance of a relative-value arbitrage rule”.

The Review ofFinancial Studies

Journal of Electronic Imaging

Paul Wilmott introduces quantitative ﬁnance . John Wiley& Sons.Kober, Jens and Jan R Peters (2009). “Policy search for motor primitives inrobotics”.

Advances in neural information processing systems , pp. 849–856.Meucci, Attilio (2009).

Risk and asset allocation . Springer Science & BusinessMedia.Raeth, Christoph and R Monetti (2009). “Surrogates with random Fourierphases”.

Topics On Chaotic Systems: Selected Papers from CHAOS 2008 In-ternational Conference . World Scientiﬁc, pp. 274–285.Ahmed, Nesreen K et al. (2010). “An empirical comparison of machine learn-ing models for time series forecasting”.

Econometric Reviews

Proceedings of the 27th international con-ference on machine learning (ICML-10) , pp. 807–814.Pan, Sinno Jialin and Qiang Yang (2010a). “A survey on transfer learning”.

IEEE Transactions on Knowledge and Data Engineering ibliography

Pan, Sinno Jialin and Qiang Yang (2010b). “A survey on transfer learning”.

IEEE Transactions on Knowledge and Data Engineering

Artiﬁcial intelligence: Founda-tions of computational agents . Cambridge University Press.Szepesv´ari, Csaba (2010). “Algorithms for reinforcement learning”.

SynthesisLectures on Artiﬁcial Intelligence and Machine Learning

Proceedings of the Thirteenth International Conferenceon Artiﬁcial Intelligence and Statistics , pp. 844–851.Deisenroth, Marc and Carl E Rasmussen (2011). “PILCO: A model-basedand data-efﬁcient approach to policy search”.

Proceedings of the 28th Inter-national Conference on machine learning (ICML-11) , pp. 465–472.H´enaff, Patrick et al. (2011). “Real time implementation of CTRNN and BPTTalgorithm to learn on-line biped robot balance: Experiments on the stand-ing posture”.

Control Engineering Practice

IEEE Signal Processing Magazine

Journal of Economic Dynamics and Con-trol

Machine Learning: A Probabilistic Perspective . TheMIT Press. isbn : 0262018020, 9780262018029.Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio (2012). “Understand-ing the exploding gradient problem”.

CoRR, abs/1211.5063 .Tieleman, Tijmen and Geoffrey Hinton (2012). “Lecture 6.5-rmsprop: Dividethe gradient by a running average of its recent magnitude”.

COURSERA:Neural Networks for Machine Learning et al. (2012). “Bayesian reinforcement learning”.

ReinforcementLearning . Springer, pp. 359–386.Aldridge, Irene (2013).

High-frequency trading: A practical guide to algorithmicstrategies and trading systems . Vol. 604. John Wiley & Sons.Giusti, Alessandro et al. (2013). “Fast image scanning with deep max-poolingconvolutional neural networks”.

Image Processing (ICIP), 2013 20th IEEEInternational Conference on . IEEE, pp. 4034–4038.Michalski, Ryszard S, Jaime G Carbonell, and Tom M Mitchell (2013).

Ma-chine learning: An artiﬁcial intelligence approach . Springer Science & Busi-ness Media.Roberts, Stephen et al. (2013). “Gaussian processes for time-series modelling”.

Phil. Trans. R. Soc. A

IEEE Transactionson Instrumentation and Measurement arXiv preprint arXiv:1412.6980 .Almeida, N´athalee C, Marcelo AC Fernandes, and Adri˜ao DD Neto (2015).“Beamforming and power control in sensor arrays using reinforcementlearning”.

Sensors et al. (2015). “Black box variational inference for state spacemodels”. arXiv preprint arXiv:1511.07367 .Chen, Kai, Yi Zhou, and Fangyan Dai (2015). “A LSTM-based method forstock returns prediction: A case study of China stock market”.

Big Data(Big Data), 2015 IEEE International Conference on . IEEE, pp. 2823–2824.Hausknecht, Matthew and Peter Stone (2015). “Deep recurrent q-learningfor partially observable mdps”.

CoRR, abs/1507.06527 .Hills, Thomas T et al. (2015). “Exploration versus exploitation in space, mind,and society”.

Trends in Cognitive Sciences et al. (2015). “Continuous control with deep reinforce-ment learning”. arXiv preprint arXiv:1509.02971 .Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforce-ment learning”.

Nature

Integrating learning and planning . url : .— (2015b). Introduction to reinforcement learning . url : .— (2015c). Markov decision processes . url : .— (2015d). Model-free control . url : .— (2015e). Policy gradient . url : .Feng, Yiyong, Daniel P Palomar, et al. (2016). “A signal processing perspec-tive on ﬁnancial engineering”. Foundations and Trends R (cid:13) in Signal Process-ing University of Cambridge .Gal, Yarin and Zoubin Ghahramani (2016). “A theoretically grounded ap-plication of dropout in recurrent neural networks”.

Advances in neuralinformation processing systems , pp. 1019–1027. 111 ibliography

Gal, Yarin, Rowan McAllister, and Carl Edward Rasmussen (2016). “Im-proving PILCO with Bayesian neural network dynamics models”.

Data-Efﬁcient Machine Learning workshop, ICML .Goodfellow, Ian et al. (2016).

Deep learning . Vol. 1. MIT press Cambridge.Heaton, JB, NG Polson, and Jan Hendrik Witte (2016). “Deep learning inﬁnance”. arXiv preprint arXiv:1602.06561 .Kennedy, Douglas (2016).

Stochastic ﬁnancial models . Chapman and Hall/CRC.Levine, Sergey et al. (2016). “End-to-end training of deep visuomotor poli-cies”.

The Journal of Machine Learning Research et al. (2016). “State of the art control of atari games using shallowreinforcement learning”.

Proceedings of the 2016 International Conference onAutonomous Agents & Multiagent Systems . International Foundation forAutonomous Agents and Multiagent Systems, pp. 485–493.Mnih, Volodymyr et al. (2016). “Asynchronous methods for deep reinforce-ment learning”.

International Conference on Machine Learning , pp. 1928–1937.Necchi, Pierpaolo (2016).

Policy gradient algorithms for asset allocation problem . url : https://github.com/pnecchi/Thesis/blob/master/MS_Thesis_Pierpaolo_Necchi.pdf .Nemati, Shamim, Mohammad M Ghassemi, and Gari D Clifford (2016). “Op-timal medication dosing from suboptimal clinical examples: A deep re-inforcement learning approach”. Engineering in Medicine and Biology So-ciety (EMBC), 2016 IEEE 38th Annual International Conference of the . IEEE,pp. 2978–2981.Silver, David and Demis Hassabis (2016). “AlphaGo: Mastering the ancientgame of Go with Machine Learning”.

Research Blog .Bao, Wei, Jun Yue, and Yulei Rao (2017). “A deep learning framework forﬁnancial time series using stacked autoencoders and long-short termmemory”.

PloS One et al. (2017). “Deep direct reinforcement learning for ﬁnancialsignal representation and trading”.

IEEE Transactions on Neural Networksand Learning Systems

Applied Stochastic Models in Business and Industry arXiv preprint arXiv:1706.10059 .Navon, Ariel and Yosi Keller (2017). “Financial time series prediction usingdeep learning”. arXiv preprint arXiv:1711.04174 .112ibliographyNoonan, Laura (2017).

JPMorgan develops robot to execute trades . url : .Quantopian (2017). Commission models . url : .Schinckus, Christophe (2017). “An essay on ﬁnancial information in the eraof computerization”. Journal of Information Technology , pp. 1–10.Zhang, Xiao-Ping Steven and Fang Wang (2017). “Signal processing for ﬁ-nance, economics, and marketing: Concepts, framework, and big dataapplications”.

IEEE Signal Processing Magazine

CoRR .Investopedia (2018a).