[PDF] A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem

Abstract

Financial portfolio management is the process of constant redistribution of a fund into different financial products. This paper presents a financial-model-free Reinforcement Learning framework to provide a deep machine learning solution to the portfolio management problem. The framework consists of the Ensemble of Identical Independent Evaluators (EIIE) topology, a Portfolio-Vector Memory (PVM), an Online Stochastic Batch Learning (OSBL) scheme, and a fully exploiting and explicit reward function. This framework is realized in three instants in this work with a Convolutional Neural Network (CNN), a basic Recurrent Neural Network (RNN), and a Long Short-Term Memory (LSTM). They are, along with a number of recently reviewed or published portfolio-selection strategies, examined in three back-test experiments with a trading period of 30 minutes in a cryptocurrency market. Cryptocurrencies are electronic and decentralized alternatives to government-issued money, with Bitcoin as the best-known example of a cryptocurrency. All three instances of the framework monopolize the top three positions in all experiments, outdistancing other compared trading algorithms. Although with a high commission rate of 0.25% in the backtests, the framework is able to achieve at least 4-fold returns in 50 days.

Full PDF

aa r X i v : . [ q -f i n . C P ] J u l Deep Portfolio Management

A Deep Reinforcement Learning Framework for theFinancial Portfolio Management Problem

Zhengyao Jiang [email protected]

Dixing Xu [email protected]

Department of Computer Sciences and Software Engineering

Jinjun Liang [email protected]

Department of Mathematical SciencesXi’an Jiaotong-Liverpool UniversitySuzhou, SU 215123, P. R. China

Editor:

XZY ABCDE

Abstract

Financial portfolio management is the process of constant redistribution of a fund into dif-ferent ﬁnancial products. This paper presents a ﬁnancial-model-free Reinforcement Learn-ing framework to provide a deep machine learning solution to the portfolio managementproblem. The framework consists of the Ensemble of Identical Independent Evaluators(EIIE) topology, a Portfolio-Vector Memory (PVM), an Online Stochastic Batch Learning(OSBL) scheme, and a fully exploiting and explicit reward function. This framework is real-ized in three instants in this work with a Convolutional Neural Network (CNN), a basic Re-current Neural Network (RNN), and a Long Short-Term Memory (LSTM). They are, alongwith a number of recently reviewed or published portfolio-selection strategies, examined inthree back-test experiments with a trading period of 30 minutes in a cryptocurrency mar-ket. Cryptocurrencies are electronic and decentralized alternatives to government-issuedmoney, with Bitcoin as the best-known example of a cryptocurrency. All three instances ofthe framework monopolize the top three positions in all experiments, outdistancing othercompared trading algorithms. Although with a high commission rate of 0 .

25% in the back-tests, the framework is able to achieve at least 4-fold returns in 50 days.

Keywords:

Machine learning; Convolutional Neural Networks; Recurrent Neural Net-works; Long Short-Term Memory; Reinforcement learning; Deep Learning; Cryptocur-rency; Bitcoin; Algorithmic Trading; Portfolio Management; Quantitative Finance

1. Introduction

Portfolio management is the decision making process of continuously reallocating an amountof fund into a number of diﬀerent ﬁnancial investment products, aiming to maximize thereturn while restraining the risk (Haugen, 1986; Markowitz, 1968). Traditional portfoliomanagement methods can be classiﬁed into four categories, ”Follow-the-Winner”, ”Follow-the-Loser”, ”Pattern-Matching”, and ”Meta-Learning” (Li and Hoi, 2014). The ﬁrst twocategories are based on prior-constructed ﬁnancial models, while they may also be assistedby some machine learning techniques for parameter determinations (Li et al., 2012; Cover,1996). The performance of these methods is dependent on the validity of the models on dif-ferent markets. ”Pattern-Matching” algorithms predict the next market distribution based iang, Xu and Liang on a sample of historical data and explicitly optimizes the portfolio based on the sampleddistribution (Gy¨orﬁ et al., 2006). The last class, ”Meta-Learning” method combine multi-ple strategies of other categories to attain more consistent performance (Vovk and Watkins,1998; Das and Banerjee, 2011).There are existing deep machine-learning approaches to ﬁnancial market trading. How-ever, many of them try to predict price movements or trends (Heaton et al., 2016; Niaki and Hoseinzade,2013; Freitas et al., 2009). With history prices of all assets as its input, a neural networkcan output a predicted vector of asset prices for the next period. Then the trading agentcan act upon this prediction. This idea is straightforward to implement, because it is asupervised learning, or more speciﬁcally a regression problem. The performance of theseprice-prediction-based algorithms, however, highly depends on the degree of prediction ac-curacy, but it turns out that future market prices are diﬃcult to predict. Furthermore,price predictions are not market actions, converting them into actions requires additionallayer of logic. If this layer is a hand-coded, then the whole approach is not fully machinelearning, and thus is not very extensible or adaptable. For example, it is diﬃcult for aprediction-based network to consider transaction cost as a risk factor.Previous successful attempts of model-free and fully machine-learning schemes to thealgorithmic trading problem, without predicting future prices, are treating the problem as aReinforcement Learning (RL) one. These include Moody and Saﬀell (2001), Dempster and Leemans(2006), Cumming (2015), and the recent deep RL utilization by Deng et al. (2017). TheseRL algorithms output discrete trading signals on an asset. Being limited to single-assettrading, they are not applicable to general portfolio management problems, where tradingagents manage multiple assets.Deep RL is lately drawing much attention due to its remarkable achievements in play-ing video games (Mnih et al., 2015) and board games (Silver et al., 2016). These are RLproblems with discrete action spaces, and can not be directly applied to portfolio selectionproblems, where actions are continuous. Although market actions can be discretized, dis-cretization is considered a major drawback, because discrete actions come with unknownrisks. For instance, one extreme discrete action may be deﬁned as investing all the capitalinto one asset, without spreading the risk to the rest of the market. In addition, dis-cretization scales badly. Market factors, like number of total assets, vary from market tomarket. In order to take full advantage of adaptability of machine learning over diﬀerentmarkets, trading algorithms have to be scalable. A general-purpose continuous deep RLframework, the actor-critic Deterministic Policy Gradient Algorithms, was recently intro-duced (Silver et al., 2014; Lillicrap et al., 2016). The continuous output in these actor-criticalgorithms is achieved by a neural-network approximated action policy function, and a sec-ond network is trained as the reward function estimator. Training two neural networks,however, is found out to be diﬃcult, and sometimes even unstable.This paper proposes an RL framework specially designed for the task of portfolio man-agement. The core of the framework is the Ensemble of Identical Independent Evaluators(EIIE) topology. An IIE is a neural network whose job is to inspect the history of an assetand evaluate its potential growth for the immediate future. The evaluation score of eachasset is discounted by the size of its intentional weight change for the asset in the portfolioand is presented to a softmax layer, whose outcome will be the new portfolio weights forthe coming trading period. The portfolio weights deﬁne the market action of the RL agent. eep Portfolio Management An asset with an increased target weight will be bought in with additional amount, andthat with decreased weight will be sold. Apart from the market history, portfolio weightsfrom the previous trading period are also input to the EIIE. This is for the RL agent toconsider the eﬀect of transaction cost to its wealth. For this purpose, the portfolio weightsof each period are recorded in a Portfolio Vector Memory (PVM). The EIIE is trained in anOnline Stochastic Batch Learning scheme (OSBL), which is compatible with both pre-tradetraining and online training during back-tests or online trading. The reward function of theRL framework is the explicit average of the periodic logarithmic returns. Having an explicitreward function, the EIIE evolves, under training, along the gradient ascending directionof the function. Three diﬀerent species of IIEs are tested in this work, a ConvolutionalNeural Network (CNN) (Fukushima, 1980; Krizhevsky et al., 2012; Sermanet et al., 2012),a basic Recurrent Neural Network (RNN) (Werbos, 1988), and a Long Short Term Memory(LSTM) (Hochreiter and Schmidhuber, 1997).Being a fully machine-learning approach, the framework is not restricted to any par-ticular markets. To examine its validity and proﬁtability, the framework is tested in acryptocurrency (virtual money, Bitcoin as the most famous example) exchange market,Polonix.com. A set of coins are preselected by their ranking in trading-volume over a timeinterval just before an experiment. Three back-test experiments of well separated time-spans are performed in a trading period of 30 minutes. The performance of the threeEIIEs are compared with some recently published or reviewed portfolio selection strategies(Li et al., 2015a; Li and Hoi, 2014). The EIIEs signiﬁcantly beat all other strategies in allthree experimentsCryptographic currencies, or simply cryptocurrencies, are electronic and decentralizedalternatives to government-issued moneys (Nakamoto, 2008; Grinberg, 2012). While thebest known example of a cryptocurrency is Bitcoin, there are more than 100 other tradablecryptocurrencies competing each other and with Bitcoin (Bonneau et al., 2015). The motivebehind this competition is that there are a number of design ﬂaws in Bitcoin, and people aretrying to invent new coins to overcome these defects hoping their inventions will eventuallyreplace Bitcoin (Bentov et al., 2014; Duﬃeld and Hagan, 2014). There are, however, moreand more cryptocurrencies being created without targeting to beat Bitcoin, but with thepurposes of using the blockchain technology behind it to develop decentralized applications .To June 2017, the total market capital of all cryptocurrencies is 102 billions in USD, 41 ofwhich is of Bitcoin. Therefore, regardless of its design faults, Bitcoin is still the dominantcryptocurrency in markets. As a result, many other currencies can not be bought with ﬁatcurrencies, but only be traded against Bitcoin.Two natures of cryptocurrencies diﬀerentiate them from traditional ﬁnancial assets,making their market the best test-ground for algorithmic portfolio management experi-ments. These natures are decentralization and openness, and the former implies the latter.Without a central regulating party, anyone can participate in cryptocurrency trading withlow entrance requirements. One direct consequence is abundance of small-volume curren-cies. Aﬀecting the prices of these penny-markets will require smaller amount of investment,compared to traditional markets. This will eventually allow trading machines to learn and

1. For example, Ethereum is a decentralized platform that runs smart contracts, and Siacoin is the currencyfor buying and selling storage service on the decentralized cloud Sia.2. Crypto-currency market capitalizations, http://coinmarketcap.com/, accessed: 2017-06-30. iang, Xu and Liang take advantage of the impacts by their own market actions. Openness also means the mar-kets are more accessible. Most cryptocurrency exchanges have application programminginterface for obtaining market data and carrying out trading actions, and most exchangesare open 24/7 without restricting frequency of tradings. These non-stop markets are idealfor machines to learn in the real world in shorter time-frames.The paper is organized as follows. Section 2 deﬁnes the portfolio management problemthat this project is aiming to solve. Section 3 introduces asset preselection and the reasoningbehind it, the input price tensor, and a way to deal with missing data in the markethistory. The portfolio management problem is re-described in the language RL in Section 4.Section 5 presents the EIIE meta topology, the PVM, the OSBL scheme. The results of thethree experiments are staged in Section 6.

2. Problem Deﬁnition

Portfolio management is the action of continuous reallocation of a capital into a numberof ﬁnancial assets. For an automatic trading robot, these investment decisions and ac-tions are made periodically. This section provides a mathematical setting of the portfoliomanagement problem.

In this work, trading algorithms are time-driven, where time is divided into periods of equallengths T . At the beginning of each period, the trading agent reallocates the fund among theassets. T = 30 minutes in all experiments of this paper. The price of an asset goes up anddown within a period, but four important price points characterize the overall movementof a period, namely the opening, highest, lowest and closing prices (Rogers and Satchell,1991). For continuous markets, the opening price of a ﬁnancial instrument in a period isthe closing price from the previous period. It is assumed in the back-test experiments thatat the beginning of each period assets can be bought or sold at the opening price of thatperiod. The justiﬁcation of such an assumption is given in Section 2.4. The portfolio consists of m assets. The closing prices of all assets comprise the price vector for Period t , v t . In other words, the i th element of v t , v i,t , is the closing price of the i thasset in the t th period. Similarly, v (hi) t and v (lo) t denote the highest and lowest prices of theperiod. The ﬁrst asset in the portfolio is special, that it is the quoted currency, referred toas the cash for the rest of the article. Since the prices of all assets are quoted in cash, theﬁrst elements of v t , v (hi) t and v (lo) t are always one, that is v (hi)0 ,t = v (lo)0 ,t = v ,t = 1 , ∀ t . In theexperiments of this paper, the cash is Bitcoin.For continuous markets, elements of v t are the opening prices for Period t + 1 as wellas the closing prices for Period t . The price relative vector of the t th trading period, y t , isdeﬁned as the element-wise division of v t by v t − : y t := v t ⊘ v t − = (cid:18) , v ,t v ,t − , v ,t v ,t − , ..., v m,t v m,t − (cid:19) ⊺ . (1) eep Portfolio Management The elements of y t are the quotients of closing prices and opening prices for individualasset in the period. The price relative vector can be used to calculate the change in totalportfolio value in a period. If p t − is the portfolio value at the begining of Period t , ignoringtransaction cost, p t = p t − y t · w t − , (2)where w t − is the portfolio weight vector (referred to as the portfolio vector from now on)at the beginning of Period t , whose i th element, w t − ,i , is the proportion of asset i in theportfolio after capital reallocation. The elements of w t always sum up to one by deﬁnition, P i w t,i = 1 , ∀ t . The rate of return for Period t is then ρ t := p t p t − − y t · w t − − , (3)and the corresponding logarithmic rate of return is r t := ln p t p t − = ln y t · w t − . (4)In a typical portfolio management problem, the initial portfolio weight vector w ischosen to be the ﬁrst basis vector in the Euclidean space, w = (1 , , ..., ⊺ , (5)indicating all the capital is in the trading currency before entering the market. If there isno transaction cost, the ﬁnal portfolio value will be p f = p exp t f +1 X t =1 r t ! = p t f +1 Y t =1 y t · w t − , (6)where p is the initial investment amount. The job of a portfolio manager is to maximize p f for a given time frame. In a real-world scenario, buying or selling assets in a market is not free. The cost is normallyfrom commission fee. Assuming a constant commission rate, this section will re-calculatethe ﬁnal portfolio value in Equation (6), using a recursive formula extending a work byOrmos and Urb´an (2013).The portfolio vector at the beginning of Period t is w t − . Due to price movements inthe market, at the end of the same period, the weights evolve into w ′ t = y t ⊙ w t − y t · w t − , (7)where ⊙ is the element-wise multiplication. The mission of the portfolio manager now atthe end of Period t is to reallocate portfolio vector from w ′ t to w t by selling and buyingrelevant assets. Paying all commission fees, this reallocation action shrinks the portfoliovalue by a factor µ t . µ t ∈ (0 , transaction remainder factor from iang, Xu and Liang time t − t t + 1Period t Period t + 1 w t − w t p t − p t w ′ t +1 p ′ t +1 w ′ t p ′ t µ t buysell y t y t +1 · · · · · · Figure 1: Illustration of the eﬀect of transaction remainder factor µ t . The market movementduring Period t , represented by the price-relative vector y t , drives the portfoliovalue and portfolio weights from p t − and w t − to p ′ t and w ′ t . The asset selling andpurchasing action at time t redistributes the fund into w t . As a side-eﬀect, thesetransactions shrink the portfolio to p t by a factor of µ t . The rate of return forPeriod t is calculated using portfolio values at the beginning of the two consecutiveperiods in Equation (9).now on. µ t is to be determined below. Denoting p t − as the portfolio value at the beginningof Period t and p ′ t at the end, p t = µ t p ′ t . (8)The rate of return (3) and logarithmic rate of return (4) are now ρ t = p t p t − − µ t p ′ t p t − − µ t y t · w t − − , (9) r t = ln p t p t − = ln ( µ t y t · w t − ) , (10)and the ﬁnal portfolio value in Equation (6) becomes p f = p exp t f +1 X t =1 r t ! = p t f +1 Y t =1 µ t y t · w t − . (11)Diﬀerent from Equation (4) and (2) where transaction cost is not considered, in Equa-tion (10) and (11), p ′ t = p t and the diﬀerence between the two values is where the trans-action remainder factor comes into play. Figure 1 demonstrates the relationship amongportfolio vectors and values and their dynamic relationship on a time axis.The remaining problem is to determine this transaction remainder factor µ t . Duringthe portfolio reallocation from w ′ t to w t , some or all amount of asset i need to be sold, if p ′ t w ′ t,i > p t w t,i or w ′ t,i > µ t w t,i . The total amount of cash obtained by all selling is(1 − c s ) p ′ t m X i =1 ( w ′ t,i − µ t w t,i ) + (12)where 0 c s < v ) + = ReLu( v ) is the element-wiserectiﬁed linear function, ( x ) + = x if x >

0, ( x ) + = 0 otherwise. This money and the originalcash reserve p ′ t w ′ t, taken away the new reserve µ t p ′ t w t, will be used to buy new assets,(1 − c p ) " w ′ t, + (1 − c s ) m X i =1 ( w ′ t,i − µ t w t,i ) + − µ t w t, = m X i =1 ( µ t w t,i − w ′ t,i ) + , (13) eep Portfolio Management where 0 c p < p ′ t has been canceled out onboth sides. Using identity ( a − b ) + − ( b − a ) + = a − b, and the fact that w ′ t, + m P i =1 w ′ t,i =1 = w t, + m P i =1 w t,i , Equation (13) is simpliﬁed to µ t = 11 − c p w t, " − c p w ′ t, − ( c s + c p − c s c p ) m X i =1 ( w ′ t,i − µ t w t,i ) + . (14)The presence of µ t inside a linear rectiﬁer means µ t is not solvable analytically, but it canonly be solved iteratively. Theorem 1

Denoting f ( µ ) := 11 − c p w t, " − c p w ′ t, − ( c s + c p − c s c p ) m X i =1 ( w ′ t,i − µw t,i ) + , the sequence n ˜ µ ( k ) t o , deﬁned as n ˜ µ ( k ) t (cid:12)(cid:12)(cid:12) ˜ µ (0) t = µ ⊙ and ˜ µ ( k ) t = f (cid:16) ˜ µ ( k − t (cid:17) , k ∈ N o (15) converges to µ t , the solution to Equation (14) , for any µ ⊙ ∈ [0 , . While this convergence is not stated in Ormos and Urb´an (2013), its proof will be givenin Appendix A. This theorem provides a way to approximate the transaction remainderfactor µ t to an arbitrary accuracy. The speed on the convergence depends on the errorof the initial guest µ ⊙ . The smaller | µ t − µ ⊙ | is, the quicker Sequence (15) converges to µ t . When c p = c s = c , there is a practice (Moody et al., 1998) to approximate µ t with c m P i =1 | w ′ t,i − w t,i | . Therefore, in this work, µ ⊙ will use this as the ﬁrst value for the sequence,that µ ⊙ = c m X i =1 (cid:12)(cid:12) w ′ t,i − w t,i (cid:12)(cid:12) . (16)In the training of the neural networks, ˜ µ ( k ) t with a ﬁxed k in (15) is used. In the back-test experiments, a tolerant error δ dynamically determines k , that is the ﬁrst k , such that (cid:12)(cid:12)(cid:12) ˜ µ ( k ) t − ˜ µ ( k − t (cid:12)(cid:12)(cid:12) < δ , is used for ˜ µ ( k ) t to approximate µ t . In general, µ t and its approximationsare functions of portfolio vectors of two recent periods and the price relative vector, µ t = µ t ( w t − , w t , y t ) . (17)Throughout this work, a single constant commission rate for both selling and purchasingfor all non-cash assets is used, c s = c p = 0 . { w , w , · · · , w t , · · · } in order to maximize the accumulative capital in (11), takingtransaction cost into account. iang, Xu and Liang In this work, back-test tradings are only considered, where the trading agent pretends to beback in time at a point in the market history, not knowing any ”future” market information,and does paper trading from then onward. As a requirement for the back-test experiments,the following two assumptions are imposed:1. Zero slippage: The liquidity of all market assets is high enough that, each trade canbe carried out immediately at the last price when a order is placed.2. Zero market impact: The capital invested by the software trading agent is so insignif-icant that is has no inﬂuence on the market.In a real-world trading environment, if the trading volume in a market is high enough, thesetwo assumptions are near to reality.

3. Data Treatments

The trading experiments are done in the exchange Poloniex, where there are about 80tradable cryptocurrency pairs with about 65 available cryptocurrencies . However, for thereasons given below, only a subset of coins is considered by the trading robot in one period.Apart from coin selection scheme, this section also gives a description of the data structurethat the neural networks take as their input, a normalization pre-process, and a scheme todeal with missing data. In the experiments of the paper, the 11 most-volumed non-cash assets are preselected forthe portfolio. Together with the cash, Bitcoin, the size of the portfolio, m + 1, is 12. Thisnumber is chosen by experience and can be adjusted in future experiments. For marketswith large volumes, like the foreign exchange market, m can be as big as the total numberof available assets.One reason for selecting top-volumed cryptocurrencies (simply called coins below) isthat bigger volume implies better market liquidity of an asset. In turn it means the marketcondition is closer to Hypothesis 1 set in Section 2.4. Higher volumes also suggest that theinvestment can have less inﬂuence on the market, establishing an environment closer to theHypothesis 2. Considering the relatively high trading frequency (30 minutes) compared tosome daily trading algorithms, liquidity and market size are particularly important in thecurrent setting. In addition, the market of cryptocurrency is not stable. Some previouslyrarely- or popularly-traded coins can have sudden boost or drop in volume in a short periodof time. Therefore, the volume for asset preselection is of a longer time-frame, relative tothe trading period. In these experiments, volumes of 30 days are used.However, using top volumes for coin selection in back-test experiments can give rise toa survival bias . The trading volume of an asset is correlated to its popularity, which inturn is governed by its historic performance. Giving future volume rankings to a back-test, will inevitably and indirectly pass future price information to the experiment, causing

3. as of May 23, 2017. eep Portfolio Management unreliable positive results. For this reason, volume information just before the beginning ofthe back-tests is taken for preselection to avoid survival bias. Historic price data is fed into a neural network to generate the output of a portfolio vector.This subsection describes the structure of the input tensor, its normalization scheme, andhow missing data is dealt with.The input to the neural networks at the end of Period t is a tensor, X t , of rank 3 withshape ( f, n, m ), where m is the number of preselected non-cash assets, n is the number ofinput periods before t , and f = 3 is the feature number. Since prices further back in thehistory have much less correlation to the current moment than that of recent ones, n = 50(a day and an hour) for the experiments. The criterion of choosing the m assets were givenin Section 3.1. Features for asset i on Period t are its closing, highest, and lowest prices inthe interval. Using the notations from Section 2.2, these are v i,t , v (hi) i,t , and v (lo) i,t . However,these absolute price values are not directly fed to the networks. Since only the changesin prices will determine the performance of the portfolio management (Equation (10)), allprices in the input tensor will be normalization by the latest closing prices. Therefore, X t is the stacking of the three normalized price matrices, X t = V (lo) t V (hi) t V t . (18)where V t , V (hi) t , and V (lo) t are the normalized price matrices, V t = (cid:20) v t − n +1 ⊘ v t (cid:12)(cid:12)(cid:12) v t − n +2 ⊘ v t (cid:12)(cid:12)(cid:12) · · · (cid:12)(cid:12)(cid:12) v t − ⊘ v t (cid:12)(cid:12)(cid:12) (cid:21) , V (hi) t = (cid:20) v (hi) t − n +1 ⊘ v t (cid:12)(cid:12)(cid:12) v (hi) t − n +2 ⊘ v t (cid:12)(cid:12)(cid:12) · · · (cid:12)(cid:12)(cid:12) v (hi) t − ⊘ v t (cid:12)(cid:12)(cid:12) v (hi) t ⊘ v t (cid:21) , V (lo) t = (cid:20) v (lo) t − n +1 ⊘ v t (cid:12)(cid:12)(cid:12) v (lo) t − n +2 ⊘ v t (cid:12)(cid:12)(cid:12) · · · (cid:12)(cid:12)(cid:12) v (lo) t − ⊘ v t (cid:12)(cid:12)(cid:12) v (lo) t ⊘ v t (cid:21) , with = (1 , , · · · , ⊺ , and ⊘ being the element-wise division operator.At the end of Period t , the portfolio manager comes up with a portfolio vector w t usingmerely the information from the price tensor X t and the previous portfolio vector w t − ,according to some policy π . In other words, w t = π ( X t , w t − ). At the end of Period t + 1, the logarithmic rate of return for the period due to decision w t can be calculatedwith the additional information from the price change vector y t +1 , using Equation (10), r t +1 = ln( µ t +1 y t +1 · w t ) . In the language of RL, r t +1 is the immediate reward to theportfolio management agent for its action w t under environment condition X t . iang, Xu and Liang Some of the selected coins lack part of the history. This absence of data is due to the factthat these coins just appeared relatively recently. Data points before the existence of a coinare marked as Not A Numbers (NANs) from the exchange. NANs only appeared in thetraining set, because the coin selection criterion is the volume-ranking of the last 30 daysbefore the back-tests, meaning all assets must have existed before that.As the input of a neural network must be real numbers, these NANs have to be replaced.In a previous work of the authors (Jiang and Liang, 2017), the missing data was ﬁlled withfake decreasing price series with a decay rate of 0.01, in order for the neural networks toavoid picking these absent assets in the training process. However, it turned out that thenetworks deeply remembered these particular assets, that they avoided them even when theywere in very promising up-climbing trends in the back-test experiments. For this reason,in this current work, ﬂat fake price-movements (0 decay rates) are used to ﬁll the missingdata points. In addition, under the novel EIIE structure, the new networks will not be ableto reveal the identity of individual assets, preventing them from making decision based onthe long-past bad records of particular assets.

4. Reinforcement Learning

With the problem deﬁned in Section 2 in mind, this section presents a reinforcement-learning (RL) solution framework using a deterministic policy gradient algorithm. Theexplicit reward function is also given under this framework.

In the problem of algorithmic portfolio management, the agent is the software portfoliomanager performing trading-actions in the environment of a ﬁnancial market. This envi-ronment comprises of all available assets in the markets and the expectations of all marketparticipants towards them.It is impossible for the agent to get total information of a state of such a large andcomplex environment. Nonetheless, all relevant information is believed, in the philosophyof technical traders (Charles et al., 2006; Lo et al., 2000), to be reﬂected in the prices of theassets, which are publicly available to the agent. Under this point of view, an environmentalstate can be roughly represented by the prices of all orders throughout the market’s historyup to the moment where the state is at. Although full order history is in the public domainfor many ﬁnancial markets, it is too huge a task for the software agent to practically processthis information. As a consequence, sub-sampling schemes for the order-history informationare employed to future simplify the state representation of the market environment. Theseschemes include asset preselection described in Section 3.1, periodic feature extraction andhistory cut-oﬀ. Periodic feature extraction discretizes the time into periods, and thenextract the highest, lowest, and closing prices in each periods. History cut-oﬀ simply takesthe price-features of only a recent number of periods to represent the current state of theenvironment. The resultant representation is the price tensor X t described in Section 3.2.Under Hypothesis 2 in Section 2.4, the trading action of the agent will not inﬂuence thefuture price states of the market. However, the action made at the beginning of Period t eep Portfolio Management will aﬀect the reward of Period t + 1, and as a result will aﬀect the decision of its action.The agent’s buying and selling transactions made at the beginning of Period t + 1, aiming toredistribute the wealth among the assets, are determined by the diﬀerence between portfolioweights w ′ t and w t . w ′ t is deﬁned in term of w t − in Equation (7), which also plays a rolein the action for the last period. Since w t − has already been determined in the last periodthe action of the agent at time t can be represented solely by the portfolio vector w t , a t = w t . (19)Therefore a previous action does have inﬂuence on the decision of the current one throughthe dependency of r t +1 and µ t +1 on w t (17). In the current framework, this inﬂuence isencapsulated by considering w t − as a part of the environment and inputting it to theagent’s action making policy, so the state at t is represented as the pair of X t and w t − , s t = ( X t , w t − ) , (20)where w is predetermined in (5). The state s t consists of two parts, the external staterepresented by the price tensor, X t , and the internal state represented by the portfoliovector from the last period, w t − . Because under Hypothesis 2 of Section 2.4, the portfolioamount is negligible compared to the total trading volume of the market, p t is not includedin the internal state. It is the job of the agent to maximize the ﬁnal portfolio value p f of Equation (11) at theend of the t f + 1 period. As the agent does not have control over the choices of the initialinvestment, p , and the length of the whole portfolio management process, t f , this job isequivalent to maximizing the average logarithmic cumulated return R , R ( s , a , · · · , s t f , a t f , s t f +1 ) := 1 t f ln p f p = 1 t f t f +1 X t =1 ln ( µ t y t · w t − ) (21)= 1 t f t f +1 X t =1 r t . (22)On the right-hand side of (21), w t − is given by action a t − , y t is part of price tensor X t from state variable s t , and µ t is a function of w t − , w t and y t as stated in (17). Inthe language of RL, R is the cumulated reward, and r t /t f is the immediate reward foran individual episode. Diﬀerent from a reward function using accumulated portfolio value(Moody et al., 1998), the denominator t f guarantees the fairness of the reward functionbetween runs of diﬀerent lengths, enabling it to train the trading policy in mini-batches.With this reward function, the current framework has two important distinctions frommany other RL problems. One is that both the episodic and cumulated rewards are exactlyexpressed. In other words, the domain knowledge of the environment is well-mastered, andcan be fully exploited by the agent. This exact expressiveness is based upon Hypothesis1 of Section 2.4 that an action has no inﬂuence on the external part of future states, theprice tensor. This isolation of action and external environment also allows one to use the iang, Xu and Liang same segment of market history to evaluate diﬀerence sequences of actions. This feature ofthe framework is considered a major advantage, because a complete new trial in a tradinggame is both time-consuming and expansive.The second distinction is that all episodic rewards are equally important to the ﬁnalreturn. This distinction, together with the zero-market-impact assumption, allows r t /t f tobe regarded as the action-value function of action w t with a discounted factor of 0, takingno consideration of future inﬂuence of the action. Having a deﬁnite action-value functionfurther justiﬁes the full-exploitation approach, since exploration in other RL problems ismainly for trying out diﬀerent classes of action-value functions.Without exploration, on the other hand, local optima can be avoided by random initial-isation of the policy parameters which will be discussed below. A policy is a mapping from the state space to the action space, π : S → A . With fullexploitation in the current framework, an action is deterministically produced by the policyfrom a state. The optimal policy is obtained using a gradient ascent algorithm. To achievethis, a policy is speciﬁed by a set of parameter θ , and a t = π θ ( s t ). The performance metricof π θ for time interval [0 , t f ] is deﬁned as the corresponding reward function (21) of theinterval, J [0 ,t f ] ( π θ ) = R ( s , π θ ( s ) , · · · , s t f , π θ ( s t f ) , s t f +1 ) . (23)After random initialisation, the parameters are continuously updated along the gradientdirection with a learning rate λ , θ −→ θ + λ ∇ θ J [0 ,t f ] ( π θ ) . (24)To improve training eﬃciency and avoid machine-precision errors, θ will be updated uponmini-batches instead of the whole training market-history. If the time-range of a mini-batchis [ t b , t b ], the updating rule for the batch is θ −→ θ + λ ∇ θ J [ t b1 ,t b2 ] ( π θ ) , (25)with the denominator in the corresponding R deﬁned in (21) replaced by t b − t b . Thismini-batch approach of gradient ascent also allows online learning, which is important inonline trading where new market history keep coming to the agent. Details of the onlinelearning and mini-batch training will be discussed in Section 5.3

5. Policy Networks

The policy functions π θ will be constructed using three diﬀerent deep neural networks.The neural networks in this paper diﬀer from a previous version (Jiang and Liang, 2017)with three important innovations, the mini-machine topology invented to target the portfo-lio management problem, the portfolio-vector memory, and a stochastic mini-batch onlinelearning scheme. eep Portfolio Management The three incarnations of neural networks to build up the policy functions are a CNN, abasic RNN, and a LSTM. Figure 2 shows the topology of a CNN designed for solving thecurrent portfolio management problem, while Figure 3 portrays the structure of a basicRNN or LSTM network for the same problem. In all cases, the input to the networks is theprice tensor X t deﬁned in (18), and the output is the portfolio vector w t . In both ﬁgures,an hypothetical example of output portfolio vector is used, while the dimension of the pricetensor and thus the number of assets are actual values deployed in the experiments. Thelast hidden layers are the voting scores for all non-cash assets. The softmax outcomes ofthese scores and a cash bias become the actual corresponding portfolio weights. In order forthe neural network to consider transaction cost, the portfolio vector from the last period, w t − , is inserted to the networks just before the voting-layer. The actual mechanism ofstoring and retrieving portfolio vectors in a parallel manner is presented in Section 5.2.A vital common feature in all three networks is that the networks ﬂow independently forthe m assets while network parameters are shared among these streams. These streams arelike independent but identical networks of smaller scopes, separately observing and assessingindividual non-cash assets. They only interconnect at the softmax function, just to makesure their outputting weights are non-negative and summing up to unity. We call thesestreams mini-machines or more formally Identical Independent Evaluators (IIE), and thistopology feature Ensemble of IIE (EIIE) nicknamed mini-machine approach, to distinguishwith the wholesome approach in an earlier attempt (Jiang and Liang, 2017). EIIE is realizeddiﬀerently in Figure 2 and 3. An IIE in Figure 2 is just a chain of convolution with kernelsof height 1, while in Figure 3 it is either a LSTM or a Basic RNN taking the price historyof a single asset as input.EIIE greatly improves the performance of the portfolio management. Remembering thehistoric performance of individual assets, an integrated network in the previous version ismore reluctant to invest money to a historically unfavorable asset, even if the asset hasa much more promising future. On the other hand, without being designed to reveal theidentity of the assigned asset, an IIE is able to judge its potential rise and fall merely basedon more recent events.From a practical point of view, EIIE has three other crucial advantages over an in-tegrated network. The ﬁrst is scalability in asset number. Having the mini-machines allidentical with shared parameters, the training time of an ensemble scales roughly linearlywith m . The second advantage is data-usage eﬃciency. For an interval of price history,a mini-machine can be trained m times across diﬀerent assets. Asset assessing experienceof the IIEs is then shared and accumulated in both time and asset dimensions. The ﬁnaladvantage is plasticity to asset collection. Since an IIE’s asset assessing ability is universalwithout being restricted to any particular assets, an EIIE can update its choice of assetsand/or the size of the portfolio in real-time, without having to train the network again fromground zero. In order for the portfolio management agent to minimize transaction cost by restraining itselffrom large changes between consecutive portfolio vectors, the output of portfolio weights iang, Xu and Liang portfolio vector +1=12 elements . . . cash bias softmax from last period Figure 2: CNN Implementation of the EIIE: This is a realization the Ensemble of IdenticalIndependent Evaluators (EIIE), a fully convolutional network. The ﬁrst dimen-sions of all the local receptive ﬁelds in all feature maps are 1, making all rowsisolated from each other until the softmax activation. Apart from weight-sharingamong receptive ﬁelds in a feature map, which is a usual CNN characteristic,parameters are also shared between rows in an EIIE conﬁguration. Each row ofthe entire network is assigned with a particular asset, and is responsible to submita voting score to the softmax on the growing potential of the asset in the comingtrading period. The input to the network is a 3 × m × n price tensor, comprisingthe highest, closing, and lowest prices of m non-cash assets over the past n peri-ods. The outputs are the new portfolio weights. The previous portfolio weightsare inserted as an extra feature map before the scoring layer, for the agent tominimize transaction cost.from the previous trading period is input to the networks. One way to achieve this is torely on the remembering ability of RNN, but with this approach the price normalizationscheme proposed in (18) has to be abandoned. This normalization scheme is empiricallybetter performing than others. Another possible solution is Direct Reinforcement (RR)introduced by Moody and Saﬀell (2001). However, both RR and RNN memory suﬀer fromthe gradient vanishing problem. More importantly, RR and RNN require serialization ofthe training process, unable to utilize parallel training within mini-batches.In this work, inspired by the idea of experience replay memory (Mnih et al., 2016), adedicated Portfolio-Vector Memory (PVM), is introduced to store the network outputs. Asshown in Figure 4, the PVM is a stack of portfolio vectors in chronological order. Beforeany network training, the PVM is initialized with uniform weights. In each training step, apolicy network loads the portfolio vector of the previous period from the memory locationat t −

1, and overwrites the memory at t with its output. As the parameters of the policynetworks converge through many training epochs, the values in the memory also converge.Sharing a single memory stack allows a network to be trained simultaneously againstdata points within a mini-batches, enormously improving training eﬃciency. In the case of eep Portfolio Management portfolio vector +1=12 elements . . . cash bias softmax A A A ... from last period

Figure 3: RNN (Basic RNN or LSTM) Implementation of the EIIE: This is a recurrentrealization the Ensemble of Identical Independent Evaluators (EIIE). In this ver-sion, the price inputs of individual assets are taken by small recurrent subnets.These subnets are identical LSTMs or Basic RNNs. The structure of the ensem-ble network after the recurrent subnets is the same as the second half of the CNNin Figure 2. networkreadrewrite timetimetimememorybeforerewritingmemoryafterrewritingpricehistory inputperiod (a) Mini-Batch Viewpoint portfolio vector +1 elements convolution orRNN unrolling +1 featuremaps of size ×1 1 featuremap ofsize ×11×1convolution . . . cash biassoftmaxportfolio vectors(excluding cash)memory writeread time (b) Network Viewpoint

Figure 4: A Read/Write Cycle of the Portfolio-Vector Memory: In both graphs, a smallvertical strip on the time axis represents a portion of the memory containing theportfolio weights at the beginning of a period. Red memories are being read to thepolicy network, while blue ones are being overwritten by the network. The twocolored rectangles in (a) consisting of four strips are example of two consecutivemini-batches. While (a) exhibits a complete read and write circle for a mini-batch, (b) shows a circle within a network (omitting the CNN or RNN part ofthe network). iang, Xu and Liang RNN versions of the networks, inserting last outputs after the recurrent blocks (Figure 3)avoids passing the gradients back to the deep RNN structures, circumventing the gradientvanishing problem.

With the introduction of the network output-memory, mini-batch training becomes plausi-ble, although the learning framework requires sequential inputs. However, unlike supervisedlearning, where data points are unordered and mini-batches are random disjoint subsets ofthe training sample space, in this training scheme the data points within a batch have to bein their time-order. In addition, since data sets are time series, mini-batches starting withdiﬀerent periods are considered valid and distinctive, even if they have a signiﬁcantly over-lapping interval. For example, if the uniform batch size is n b , data sets covering [ t b , t b + n b )and [ t b + 1 , t b + n b + 1) are two validly diﬀerent batches.The ever-ongoing nature of ﬁnancial markets means new data keeps pouring into theagent, and as a consequence the size of the of training sample explodes indeﬁnitely. For-tunately, it is believed that the correlation between two market price events decades expo-nentially with the temporal distance between them (Holt, 2004; Charles et al., 2006). Withthis belief, here an Online Stochastic Batch Learning (OSBL) scheme is proposed.At the end of the t th period, the price movement of this period will be added to thetraining set. After the agent has completed its orders for period t + 1, the policy networkwill be trained against N b randomly chosen mini-batches from this set. A batch startingwith period t b t − n b is picked with a geometrically distributed probability P β ( t b ), P β ( t b ) = β (1 − β ) t − t b − n b , (26)where β ∈ (0 ,

1) is the probability-decaying rate determining the shape of the probabilitydistribution and how important are recent market events, and n b is the number of periodsin a mini-batch.

6. Experiments

The tools has been developed to this point of the article are examined in three back-testexperiments of diﬀerent time frames with all three policy networks on the crypto-currencyexchange Poloniex. Results are compared with many well-established and recently publishedportfolio-selection strategies. The main compared ﬁnancial metric is the portfolio value aswell as maximum drawdown and the Sharpe ratio.

Details of the time-ranges for the back-test experiments and their corresponding trainingsets are presented in Table 1. A cross validation set is used for determination of the hyper-parameters, whose range is also listed. All time in the table are in Coordinated UniversalTime (UTC). All training sets start at 0 o’clock. For example, the training set for Back-Test1 is from 00:00 on November 1st 2014. All price data is accessed with Poloniex’s oﬃcialApplication Programming Interface (API) .

4. https://poloniex.com/support/api/ eep Portfolio Management Data Purpose Data Range Training Data SetCV 2016-05-07 04:00 to 2016-06-27 08:00 2014-07-01 to 2016-05-07 04:00Back-Test 1 2016-09-07 04:00 to 2016-10-28 08:00 2014-11-01 to 2016-09-07 04:00Back-Test 2 2016-12-08 04:00 to 2017-01-28 08:00 2015-02-01 to 2016-12-08 04:00Back-Test 3 2017-03-07 04:00 to 2017-04-27 08:00 2015-05-01 to 2017-03-07 04:00Table 1: Price data ranges for hyperparameter-selection (cross-validation, CV) and back-test experiments. Prices are accessed in periods of 30 minutes. Closing prices areused for cross validation and back-tests, while highest, lowest, and closing prices inthe periods are used for training. The hours of the starting points for the trainingsets are not given, since they begin at midnight of the days. All times are in UTC.

Diﬀerent metrics are used to measure the performance of a particular portfolio selectionstrategy. The most direct measurement of how successful is a portfolio management over atimespan is the accumulative portfolio value (APV), p t . It is unfair, however, to comparethe PVs of two management starting of diﬀerent initial values. Therefore, APVs here aremeasured in the unit of their initial values, or equivalently p = 1 and thus p t = p t /p . (27)In this unit, APV is then closely related to the accumulated return, and in fact it onlydiﬀers from the latter by 1. Under the same unit, the ﬁnal APV (fAPV) is the APV at theend of a back-test experiment, p f = p f /p = p t f +1 /p .A major disadvantage of APV is that it does not measure the risk factors, since it merelysums up all the periodic returns without considering ﬂuctuation in these returns. A secondmetric, the Sharpe ratio (SR) (Sharpe, 1964, 1994), is used to take risk into account. Theratio is a risk adjusted mean return, deﬁned as the average of the risk-free return by itsdeviation, S = E t [ ρ t − ρ F ] q var t ( ρ t − ρ F ) , (28)where ρ t are periodic returns deﬁned in (9), and ρ F is the rate of return of a risk-free asset.In these experiments the risk-free asset is Bitcoin. Because the quoted currency is alsoBitcoin, the risk-free return is zero, ρ F = 0, here.Although the SR considers volatility of the portfolio values, but it equally treats upwardsand downwards movements. In reality upwards volatility contributes to positive returns, butdownwards to loss. In order to highlight the downwards deviation, Maximum Drawdown(MDD) (Magdon-Ismail and Atiya, 2004) is also considered. MDD is the biggest loss froma peak to a trough, and mathematically D = max τ>tt p t − p τ p t . (29) iang, Xu and Liang CNN bRNN

Best Stock

UCRP

UBAH K Table 2: Performances of the three EIIE (Ensemble of Identical Independent Evaluators)neural networks, an integrated network, and some traditional portfolio selectionstrategies in three diﬀerent back-test experiments (in UTC, detailed time-rangeslisted in Table 1) on the cryptocurrency exchange Poloniex. The performancemetrics are Maximum Drawdown (MDD), the ﬁnal Accumulated Portfolio Value(fAPV) in the unit of initial portfolio amount ( p f /p ), and the Sharpe ratio (SR).The bold algorithms are the EIIE networks introduced in this paper, named af-ter the underlining structures of their IIEs. For example, bRNN is the EIIE ofFigure 3 using basic RNN evaluators. Three benchmarks (italic), the integratedCNN (iCNN) previous proposed by the authors (Jiang and Liang, 2017), and somerecently reviewed a strategies (Li et al., 2015a; Li and Hoi, 2014) are also tested.The algorithms in the table are divided into ﬁve categories, the model-free neuralnetwork, the benchmarks, follow-the-loser strategies, follow-the-winner strategies,and pattern-matching or other strategies. The best performance in each columnis highlighted with boldface. All three EIIEs signiﬁcantly outperform all otheralgorithms in the fAPV and SR columns, showing the proﬁtability and reliabilityof the EIIE machine-learning solution to the portfolio management problem. a. The exceptions are RMR of Huang et al. (2013) and WMAMR of Gao and Zhang (2013). The performances of all three EIIE policy networks proposed in the current paper will becompared to that of the integrated CNN (iCNN) (Jiang and Liang, 2017), several well-known or recently published model-based strategies, and three benchmarks. eep Portfolio Management The three benchmarks are the Best Stock, the asset with the most fAPV over theback-test interval, the Uniform Buy and Hold (UBAH), a portfolio management approachsimply equally spreading the total fund into the preselected assets and holding them withoutmaking any purchases or selling until the end (Li and Hoi, 2014), and Uniform ConstantRebalanced Portfolios (UCRP) (Kelly, 1956; Cover, 1991).Most of the strategies to be compared in this work were surveyed by Li and Hoi (2014),including Aniticor (Borodin et al., 2004), Online Moving Average Reversion (OLMAR)(Li et al., 2015b), Passive Aggressive Mean Reversion (PAMR) (Li et al., 2012), Conﬁ-dence Weighted Mean Reversion (CWMR) (Li et al., 2013), Online Newton Step (ONS)(Agarwal et al., 2006), Universal Portfolios (UP) (Cover, 1991), Exponential Gradient (EG)(Helmbold et al., 1998), Nonparametric Kernel Based Log Optimal Strategy (B K ) (Gy¨orﬁ et al.,2006), Correlation-driven Nonparametric Learning Strategy (CORN) (Li et al., 2011), andM0 (Borodin et al., 2000), except Weighted Moving Average Mean Reversion (WMAMR)(Gao and Zhang, 2013) and Robust Median Reversion (RMR) (Huang et al., 2013).Table 2 shows the performance scores fAPV, SR, and MDD of the EIIE policy networksas well as of the compared strategies for the three back-test intervals listed in Table 1. Interm of fAPV or SR, the best performing algorithm in Back-Test 1 and 2 is the CNN EIIEwhose ﬁnal wealth is more than twice of the runner-up in the ﬁrst experiment. Top threewinners in these two measures in all back-tests are occupied by the three EIIE networks,losing only the MDD measure. This result demonstrates the powerful proﬁtability andconsistency of the current EIIE machine-learning framework.When only considering fAPV, all three EIIEs outperform the best assets in all threeback-tests, while the only model-based algorithm does that is RMR on the only occasionof Back-Test 3. Because of the high commission rate of 0 .

25% and the relatively high half-hourly trading frequency, many traditional strategies have bad performances. Especiallyin Back-Test 1, all model-based strategies have negative returns, with fAPV less than 1 orequivalently negative SRs. On the other hand, the EIIEs are able to achieve at least 4-foldreturns in 20 days in diﬀerent market conditions.Figures 5, 6 and 7 plot the APV against time in the three back-tests respectively for theCNN and bRNN EIIE networks, two selected benchmarks and two model-based strategies.The benchmarks Best Stock and UCRP are two good representatives of the market. In allthree experiments, both CNN and bRNN EIIEs beat the market throughout the entirety ofthe back-tests, while traditional strategies are only able to achieve that in the second halfof Back-Test 3 and very brieﬂy elsewhere.

7. Conclusion

This article proposed an extensible reinforcement-learning framework solving the generalﬁnancial portfolio management problem. Being invented to cope with multi-channel marketinputs and directly output portfolio weights as the market actions, the framework can beﬁt in with diﬀerent deep neural networks, and is linearly scalable with the portfolio size.This scalability and extensibility are the results of the EIIE meta topology, which is ableto accommodate many types of weight-sharing neural-net structures in the lower level.To take transaction cost into account when training the policy networks, the frameworkincludes a portfolio-weight memory, the PVM, allowing the portfolio-management agent to iang, Xu and Liang Figure 5: Back-Test 1: 2016-09-07-4:00 to 2016-10-28-8:00 (UTC). Accumulated portfoliovalues (APV, p t /p ) over the interval of Back-Test 1 for the CNN and basic RNNEIIEs, the Best Stock, the UCRP, RMR, and the ONS are plotted in log-10scale here. The two EIIEs are leading throughout the entire time-span, growingconsistently only with a few drawdown incidents.learn restraining from oversized adjustments between consecutive actions, while avoidingthe gradient vanishing problem faced by many recurrent networks. The PVM also allowparallel training within batching, beating recurrent approaches in learning eﬃciency tothe transaction cost problem. Moreover, the OSBL scheme governs the online learningprocess, so that the agent can continuously digest constant incoming market informationwhile trading. Finally, the agent was trained using a fully exploiting deterministic policygradient method, aiming to maximize the accumulated wealth as the reinforcement rewardfunction.The proﬁtability of the framework surpasses all surveyed traditional portfolio-selectionmethods, as demonstrated in the paper by the outcomes of three back-test experimentsover diﬀerent periods in a cryptocurrency market. In these experiments, the frameworkwas realized using three diﬀerent underlining networks, a CNN, a basic RNN and a LSTM.All three versions better performed in ﬁnal accumulated portfolio value than other tradingalgorithms in comparison. The EIIE networks also monopolized the top three positions inthe risk-adjusted score in all three tests, indicating the consistency of the framework in itsperformances. Another deep reinforcement learning solution, previously introduced by theauthors, was assessed and compared as well under the same settings, losing too to the EIIEnetworks, proving that the EIIE framework is a major improvement over its more primitivecousin. eep Portfolio Management Figure 6: Back-Test 2: 2016-12-08-4:00 to 2017-01-28-8:00 (UTC), log-scale accumulatedweath. This is the worst experiment amung the three back-tests for the EIIEs.However, they are able to steadily climb up till the end of the test.Among the three EIIE networks, LSTM had much lower scores than the CNN and thebasic RNN. The signiﬁcant gap in performance between the two RNN species under thesame framework might be an indicator to the well-known secret in ﬁnancial markets, thathistory repeats itself. Not being designed to forget its input history, a vanilla RNN is moreable than a LSTM to exploit repetitive patterns in price movement for higher yields. Thegap might also be due to lack of ﬁne-tuning in hyper-parameters for the LSTM. In theexperiments, same set of structural hyper-parameters were used for both basic RNN andLSTM.Despite the success of the EIIE framework in the back-tests, there is room for improve-ment in future works. The main weakness of the current work is the assumptions of zeromarket impact and zero slippage. In order to consider market impact and slippage, largeamount of well-documented real-world trading examples will be needed as training data.Some protocol will have to be invented for documenting trade actions and market reactions.If that is accomplished, live trading experiments of the auto-trading agent in its currentversion can be recorded, for its future version to learn the principles behind market impactsand slippages from this recorded history. Another shortcoming of the work is that theframework has only been tested in one market. To test its adaptability, the current andlater versions will need to be examined in back-tests and live trading in a more traditionalﬁnancial market. In addition, the current award function will have to be amended, if notabandoned, for the reinforcement-learning agent to include awareness of longer-term mar-ket reactions. This may be achieved by a critic network. However, the backbone of the iang, Xu and Liang Figure 7: Back Test 3: 2017-03-07-4:00 to 2017-04-27-8:00 (UTC), log-scale accumulatedweath. All algorithms struggle and consolidate at the beginning of this experi-ment, and both of the EIIEs experience two major dips on March 15 and April 9.This diving contributes to their high Maximum Drawdown in the text (Table 2).Nevertheless, this is the best month for both EIIEs in term of ﬁnal wealth.current framework, including the EIIE meta topology, the PVM, and the OSBL scheme,will continue to take important roles in future versions.

Appendix A. Proof of Theorem 1

In order to prove Theorem 1, it is handy to have the following ﬁve lemmas.

Lemma A.1

The function f ( µ ) in Theorem 1 is monotonically increasing. In other words, f ( µ ) > f ( µ ) if µ > µ . Proof

Recall that from Section 2.3, f ( µ ) := 11 − c p w t, " − c p w ′ t, − ( c s + c p − c s c p ) m X i =1 ( w ′ t,i − µw t,i ) + , The fact that the linear rectiﬁer ( x ) + ( ( x ) + = x if x >

0, ( x ) + = 0 otherwise ) is monoton-ically increasing readily implies that f ( µ ) is also monotonically increasing. Lemma A.2 f (0) > . eep Portfolio Management Proof

Using the fact that w t, , w ′ t, ∈ [0 , f (0) = 11 − c p w t, " − c p w ′ t, − ( c s + c p − c s c p ) m X i =1 ( w ′ t,i ) + = 11 − c p w t, (cid:2) − c p w ′ t, − ( c s + c p − c s c p )(1 − w ′ t, ) (cid:3) > − c p − c s + c s c p > , for c p , c s < f (0) > Lemma A.3 f (1) . Proof

The proof is split into two cases. The fact that c p , c p ∈ [0 ,

1) implies ( c s + c p − c s c p ) > Case 1: w ′ t, > w t, . Since 1 − c p w t, > f (1) = 1 − c p w ′ t, − c p w t, − − c p w t, ( c s + c p − c s c p ) m X i =1 ( w ′ t,i − w t,i ) + − − c p w t, ( c s + c p − c s c p ) m X i =1 ( w ′ t,i − w t,i ) + Case 2: w ′ t, < w t, . This will be proved by contradiction. By assuming f (1) > − c p w ′ t, − ( c s + c p − c s c p ) m X i =1 ( w ′ t,i − w t,i ) + > − c p w t, . Bringing the two w ’s together, c p ( w t, − w ′ t, ) > ( c s + c p − c s c p ) m X i =1 ( w ′ t,i − w t,i ) + . (A.1)Noting that w ′ t, + m P i =1 w ′ t,i = 1 = w t, + m P i =1 w t,i , w t, − w ′ t, = 1 − m X i =1 w t,i − − m X i =1 w ′ t,i ! = m X i =1 (cid:0) w ′ t,i − w t,i (cid:1) . Using identity ( a − b ) + − ( b − a ) + = a − b , (A.1) becomes c p " m X i =1 ( w ′ t,i − w t,i ) + − m X i =1 ( w t,i − w ′ t,i ) + > ( c s + c p − c s c p ) m X i =1 ( w ′ t,i − w t,i ) + . iang, Xu and Liang Moving the ( w ′ t,i − w t,i ) + terms to the right-hand side, − c p m X i =1 ( w t,i − w ′ t,i ) + > c s (1 − c p ) m X i =1 ( w ′ t,i − w t,i ) + . (A.2)The left-hand side of (A.2) is a non-positive number, and the right-hand side is a non-negative number. The former is greater than the latter, arriving at a contradiction.Therefore, f (1) Lemma A.4 the sequence n ˜ µ ( k ) t o , deﬁned as n ˜ µ ( k ) t (cid:12)(cid:12)(cid:12) ˜ µ (0) t = 0 and ˜ µ ( k ) t = f (cid:16) ˜ µ ( k − t (cid:17) , k ∈ N o converges to µ t . Proof

This is a special case of the ﬁnal goal Theorem 1 when µ ⊙ = 0. This convergenceis proved by the Monotone Convergence Theorem (MCT) (Rudin, 1976, Chapter 5). Themonotonicity of f by Lemma A.1 with Mathematical Induction establishes an upper boundfor n ˜ µ ( k ) t o . ˜ µ (0) t = 0 < µ t , If ˜ µ ( k − t µ t , ˜ µ ( k ) t = f (cid:16) ˜ µ ( k − t (cid:17) f ( µ t ) = µ t ) = ⇒ ˜ µ ( k ) t µ t , ∀ k. Note that by deﬁnition µ t is the transaction remainder factor, and 0 < µ t

1. Themonotonicity of sequence n ˜ µ ( k ) t o itself can be also proved by Mathematical Induction andLemma A.2. ˜ µ (1) t = f (0) > µ (0) t , If ˜ µ ( k − t > ˜ µ ( k − t , ˜ µ ( k ) t = f (cid:16) ˜ µ ( k − t (cid:17) > f (cid:16) ˜ µ ( k − t (cid:17) = ˜ µ ( k − t ) = ⇒ ˜ µ ( k ) t > ˜ µ ( k − t , ∀ k. If ˜ µ ( k ) t = ˜ µ ( k − t , then ˜ µ ( k ) t is the solution of Equation (14), and the proof ends here. Oth-erwise, the sequence n ˜ µ ( k ) t o is strictly increasing and bounded above by µ t . In that case,by MCT, lim k →∞ ˜ µ ( k ) t = µ ∗ , where µ ∗ is the Least Upper Bound of n ˜ µ ( k ) t o . As a result,0 = lim k →∞ (cid:16) ˜ µ ( k +1) t − ˜ µ ( k ) t (cid:17) = lim k →∞ (cid:16) f (cid:16) ˜ µ ( k ) t (cid:17) − ˜ µ ( k − t (cid:17) = f ( µ ∗ ) − µ ∗ . Therefore, µ ∗ is thesolution to Equation (14), and hence lim k →∞ ˜ µ ( k ) t = µ t . Lemma A.5 the sequence n ˜ µ ( k ) t o , deﬁned as n ˜ µ ( k ) t (cid:12)(cid:12)(cid:12) ˜ µ (0) t = 1 and ˜ µ ( k ) t = f (cid:16) ˜ µ ( k − t (cid:17) , k ∈ N o converges to µ t . eep Portfolio Management Proof

The proof is similar to that of Lemma A.4 using Mathematical Induction and MCT.The sequence is monotonically decreasing and bounded below by µ t . The monotonically isa result of Lemma A.3, and the boundedness is by Lemma A.1.With the previous two Lemmas, it is in a good position to prove the general convergencetheorem. Recall Theorem 1 from Section 2.3: Theorem 1

It is proved in three cases:

Case 1: µ ⊙ = µ t , , or 1 . This case is trivial, as when µ ⊙ = µ t it is the solution of µ = f ( µ ), and the sequence will obviously converge. The convergence to µ t for the othertwo values of µ ⊙ is guaranteed by Lemma A.4 and A.5 Case 2: < µ ⊙ < µ t . Constructing a sequence nb µ ( k ) t o using µ ⊙ = 0, nb µ ( k ) t (cid:12)(cid:12)(cid:12)b µ (0) t = 0 and b µ ( k ) t = f (cid:16)b µ ( k − t (cid:17) , k ∈ N o . By the proof of Lemma A.4, nb µ ( k ) t o is strictly increasing and bounded above by µ t ,so there is a j ∈ N such that b µ ( j ) µ ⊙ b µ ( j +1) . If any of the above equality holds, the two sequences coincide from j + 1 onward,converging to µ t , and the proof ends here. Otherwise, b µ ( j ) < µ ⊙ < b µ ( j +1) . Using the monotonicity of f ( µ ) by Lemma A.1, these inequalities become b µ ( j +1) = f ( b µ ( j ) ) ˜ µ (1) = µ ⊙ f ( b µ ( j +1) ) = b µ ( j +2) . Again, if one of the equality holds, the proof ends here. This chain can go on indeﬁ-nitely if no equality holds, b µ ( j + k +1) < ˜ µ ( k ) < b µ ( j + k +2) . By the Squeeze Theorem (Leithold, 1996),lim k →∞ ˜ µ ( k ) = lim k →∞ b µ ( k ) = µ t . iang, Xu and Liang Case 3: > µ ⊙ > µ t . This case is proved in a similar way to Case 2, by constructing asequence with µ ⊙ = 1 and making use of Lemma A.5.In conclusion, for any initial value µ ⊙ ∈ [0 , µ ( k ) converges to µ t . Appendix B. Hyper-Parameters hyper-parameters value descriptionbatch size 50 Size of mini-batch during training. (Section 3.1)window size 50 Number of the columns (number of the tradingperiods) in each input price matrix. (Section 3.2)number of assets 12 Total number of preselected assets (including the cash,Bitcoin). (Section 3.1)trading period(second) 1800 Time interval between two portfolio redistributions.(Section 2.1)total steps 2 × Total number of steps for pre-training in the trainingset.regularizationcoeﬃcient 10 − The L2 regularization coeﬃcient applied to networktraining.learning rate 3 × − Parameter α (i.e. the step size) of the Adamoptimization (Kingma and Ba, 2014).volumeobservation (day) 30 The length of time during which trading volumes areused to preselect the portfolio assets. (Section 3.1)commission rate 0 .

25% Rate of commission fee applied to each transaction.(Section 2.3)rolling steps 30 Number of online training steps for each period duringcross-validation and back-tests.sample bias 5 × − Parameter of geometric distribution when selectingonline training sample batches. (The β in Equation 26of Section 5.3)Table B.1: Hyper-parameters of the reinforcement-learning framework. They are chosenbased on the networks’ scores in the cross-validation set described in Table 1 ofSection 6.1. Although these are the values used in the experiments of the paper,they are all adjustable in the framework.The hyper-parameters and their values used in the back-test experiments of the paperare listed in Table B.1. These numbers are selected to maximize the network scores inthe cross-validation time-range (see Section 6.1). In order to avoid over-ﬁtting, the cross-validation range and the back-tests do not overlap.Diﬀerent topologies of the IIEs were also tried in the cross-validation set, and it turnedout that deeper network structures than those presented in Figure 2 and 3 did not improvescores on the set. eep Portfolio Management Bibliography

Amit Agarwal, Elad Hazan, Satyen Kale, and Robert E Schapire. Algorithms for portfoliomanagement based on the newton method. In

Proceedings of the 23rd internationalconference on Machine learning , pages 9–16. ACM, 2006.Iddo Bentov, Charles Lee, Alex Mizrahi, and Meni Rosenfeld. Proof of activity: Extendingbitcoin’s proof of work via proof of stake [extended abstract] y.

ACM SIGMETRICSPerformance Evaluation Review , 42(3):34–37, 2014.Joseph Bonneau, Andrew Miller, Jeremy Clark, Arvind Narayanan, Joshua A Kroll, andEdward W Felten. Sok: Research perspectives and challenges for bitcoin and cryptocur-rencies. In , pages 104–121. IEEE, 2015.Allan Borodin, Ran El-Yaniv, and Vincent Gogan. On the competitive theory and practiceof portfolio selection. In

Latin American Symposium on Theoretical Informatics , pages173–196. Springer, 2000.Allan Borodin, Ran El-Yaniv, and Vincent Gogan. Can we learn to beat the best stock.

J.Artif. Intell. Res.(JAIR) , 21:579–594, 2004.D Charles, II Kirkpatrick, and Julie R Dahlquist. Technical analysis: The complete resourcefor ﬁnancial market technician.

ISBN-13 , pages 978–0137059447, 2006.Thomas M Cover. Universal portfolios.

Mathematical ﬁnance , 1(1):1–29, 1991.Thomas M Cover. Universal data compression and portfolio selection. In

Foundations ofComputer Science, 1996. Proceedings., 37th Annual Symposium on , pages 534–538. IEEE,1996.James Cumming. An investigation into the use of reinforcement learn-ing techniques within the algorithmic trading domain. Master’sthesis, Imperial College London, United Kiongdoms, 2015. URL .Puja Das and Arindam Banerjee. Meta optimization and its application to portfolio se-lection.

Proceedings of the 17th ACM SIGKDD international conference on Knowledgediscovery and data mining - KDD ’11 , page 1163, 2011. doi: 10.1145/2020408.2020588.URL http://dl.acm.org/citation.cfm?doid=2020408.2020588 .M.A.H. Dempster and V. Leemans. An automated fx trading system using adap-tive reinforcement learning.

Expert Systems with Applications , 30(3):543 – 552,2006. ISSN 0957-4174. doi: http://dx.doi.org/10.1016/j.eswa.2005.10.012. URL . Intelli-gent Information Systems for Financial Engineering.Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. Deep direct rein-forcement learning for ﬁnancial signal representation and trading.

IEEE transactions onneural networks and learning systems , 28(3):653–664, 2017. iang, Xu and Liang Evan Duﬃeld and Kyle Hagan. Darkcoin: Peer to peer cryptocurrency with anonymousblockchain transactions and an improved proofofwork system. bitpaper.info , 2014.Fabio D Freitas, Alberto F De Souza, and Ailson R de Almeida. Prediction-based portfoliooptimization model using neural networks.

Neurocomputing , 72(10):2155–2170, 2009.Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mecha-nism of pattern recognition unaﬀected by shift in position.

Biological cybernetics , 36(4):193–202, 1980.Li Gao and Weiguo Zhang. Weighted moving average passive aggressive algorithm for onlineportfolio selection. In

Intelligent Human-Machine Systems and Cybernetics (IHMSC),2013 5th International Conference on , volume 1, pages 327–330. IEEE, 2013.Reuben Grinberg. Bitcoin: An innovative alternative digital currency.

Hastings Sci. &Tech. LJ , 4:159, 2012.L´aszl´o Gy¨orﬁ, G´abor Lugosi, and Frederic Udina. Nonparametric kernel-based sequentialinvestment strategies.

Mathematical Finance , 16(2):337–357, 2006.Robert A Haugen.

Modern investment theory . Prentice Hall, 1986.J. B. Heaton, N. G. Polson, and Jan Hendrik Witte. Deep learning for ﬁnance: deepportfolios.

Applied Stochastic Models in Business and Industry , 2016. ISSN 1526-4025.doi: 10.1002/ASMB.2209. URL .David P Helmbold, Robert E Schapire, Yoram Singer, and Manfred K Warmuth. On-lineportfolio selection using multiplicative updates.

Mathematical Finance , 8(4):325–347,1998.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation ,9(8):1735–1780, 1997.Charles C Holt. Forecasting seasonals and trends by exponentially weighted moving aver-ages.

International journal of forecasting , 20(1):5–10, 2004.Dingjiang Huang, Junlong Zhou, Bin Li, Steven CH Hoi, and Shuigeng Zhou. Robustmedian reversion strategy for on-line portfolio selection. In

IJCAI , pages 2006–2012,2013.Zhengyao Jiang and Jinjun Liang. Cryptocurrency portfolio management with deep rein-forcement learning. In

Proceedings of 2017 Intelligent Systems Conference . SAI Confer-ences, 2017. Preprint: arXiv:1612.01277 [cs.LG].J. L. Kelly. A new interpretation of information rate.

The Bell System Technical Journal ,35(4):917–926, July 1956. ISSN 0005-8580. doi: 10.1002/j.1538-7305.1956.tb03809.x.Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.

In-ternational Conference on Learning Representations , pages 1–13, dec 2014. URL http://arxiv.org/abs/1412.6980 . eep Portfolio Management Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In

Advances in neural information processing systems ,pages 1097–1105, 2012.Louis Leithold.

The calculus 7 . HarperCollins College Publishing, 1996.B Li, D Sahoo, and SCH Hoi. Olps: A toolbox for online portfolio selection.

Journal ofMachine Learning Research (JMLR) , 2015a.Bin Li and Steven CH Hoi. Online portfolio selection: A survey.

ACM Computing Surveys(CSUR) , 46(3):35, 2014.Bin Li, Steven CH Hoi, and Vivekanand Gopalkrishnan. Corn: Correlation-driven non-parametric learning approach for portfolio selection.

ACM Transactions on IntelligentSystems and Technology (TIST) , 2(3):21, 2011.Bin Li, Peilin Zhao, Steven C. H. Hoi, and Vivekanand Gopalkrishnan. PAMR: Pas-sive aggressive mean reversion strategy for portfolio selection.

Machine Learning ,87(2):221–258, may 2012. ISSN 08856125. doi: 10.1007/s10994-012-5281-z. URL http://link.springer.com/10.1007/s10994-012-5281-z .Bin Li, Steven CH Hoi, Peilin Zhao, and Vivekanand Gopalkrishnan. Conﬁdence weightedmean reversion strategy for online portfolio selection.

ACM Transactions on KnowledgeDiscovery from Data (TKDD) , 7(1):4, 2013.Bin Li, Steven CH Hoi, Doyen Sahoo, and Zhi-Yong Liu. Moving average reversion strategyfor on-line portfolio selection.

Artiﬁcial Intelligence , 222:104–123, 2015b.Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, YuvalTassa, David Silver, and Daan Wierstra. Continuous Control with Deep ReinforcementLearning. arXiv , 2016. ISSN 1935-8237. doi: 10.1561/2200000006.Andrew W Lo, Harry Mamaysky, and Jiang Wang. Foundations of technical analysis: Com-putational algorithms, statistical inference, and empirical implementation.

The journalof ﬁnance , 55(4):1705–1770, 2000.Malik Magdon-Ismail and Amir F Atiya. Maximum drawdown.

Risk Magazine , 17(1):99–102, 2004.Harry M Markowitz.

Portfolio selection: eﬃcient diversiﬁcation of investments , volume 16.Yale university press, 1968.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, GeorgOstrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, He-len King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hass-abis. Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533, feb 2015. ISSN 0028-0836. doi: 10.1038/nature14236. URL . iang, Xu and Liang Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lill-icrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods fordeep reinforcement learning. In

International Conference on Machine Learning , pages1928–1937, 2016.J. Moody and M. Saﬀell. Learning to trade via direct reinforcement.

IEEE Transactionson Neural Networks , 12(4):875–889, Jul 2001. ISSN 1045-9227. doi: 10.1109/72.935097.John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saﬀell. Performance functions andreinforcement learning for trading systems and portfolios.

Journal of Forecasting , 17(56):441–470, 1998.Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system, 2008.Seyed Taghi Akhavan Niaki and Saeid Hoseinzade. Forecasting S&P 500 index usingartiﬁcial neural networks and design of experiments.

Journal of Industrial Engineer-ing International , 9(1):1, 2013. ISSN 2251-712X. doi: 10.1186/2251-712X-9-1. URL .Mih´aly Ormos and Andr´as Urb´an. Performance analysis of log-optimal portfolio strategieswith transaction costs.

Quantitative Finance , 13(10):1587–1597, 2013.L Christopher G Rogers and Stephen E Satchell. Estimating variance from high, low andclosing prices.

The Annals of Applied Probability , pages 504–512, 1991.Walter Rudin.

Principles of mathematical analysis . McGraw-Hill New York, 3 edition,1976. ISBN 9780070856134.Pierre Sermanet, Soumith Chintala, and Yann LeCun. Convolutional neural networks ap-plied to house numbers digit classiﬁcation. In

Pattern Recognition (ICPR), 2012 21stInternational Conference on , pages 3288–3291. IEEE, 2012.William F Sharpe. Capital asset prices: A theory of market equilibrium under conditionsof risk.

The journal of ﬁnance , 19(3):425–442, 1964.William F Sharpe. The sharpe ratio.

The journal of portfolio management , 21(1):49–58,1994.David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Ried-miller. Deterministic Policy Gradient Algorithms.

Proceedings of the 31st InternationalConference on Machine Learning (ICML-14) , pages 387–395, 2014.David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George VanDen Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, MarcLanctot, et al. Mastering the game of go with deep neural networks and tree search.

Nature , 529(7587):484–489, 2016.Volodya Vovk and Chris Watkins. Universal portfolio selection. In

Proceedings of theeleventh annual conference on Computational learning theory , pages 12–23. ACM, 1998. eep Portfolio Management Paul J Werbos. Generalization of backpropagation with application to a recurrent gasmarket model.

Neural networks , 1(4):339–356, 1988., 1(4):339–356, 1988.