[PDF] AlphaStock: A Buying-Winners-and-Selling-Losers Investment Strategy using Interpretable Deep Reinforcement Attention Networks

Abstract

Recent years have witnessed the successful marriage of finance innovations and AI techniques in various finance applications including quantitative trading (QT). Despite great research efforts devoted to leveraging deep learning (DL) methods for building better QT strategies, existing studies still face serious challenges especially from the side of finance, such as the balance of risk and return, the resistance to extreme loss, and the interpretability of strategies, which limit the application of DL-based strategies in real-life financial markets. In this work, we propose AlphaStock, a novel reinforcement learning (RL) based investment strategy enhanced by interpretable deep attention networks, to address the above challenges. Our main contributions are summarized as follows: i) We integrate deep attention networks with a Sharpe ratio-oriented reinforcement learning framework to achieve a risk-return balanced investment strategy; ii) We suggest modeling interrelationships among assets to avoid selection bias and develop a cross-asset attention mechanism; iii) To our best knowledge, this work is among the first to offer an interpretable investment strategy using deep reinforcement learning models. The experiments on long-periodic U.S. and Chinese markets demonstrate the effectiveness and robustness of AlphaStock over diverse market states. It turns out that AlphaStock tends to select the stocks as winners with high long-term growth, low volatility, high intrinsic value, and being undervalued recently.

Full PDF

AAlphaStock: A Buying-Winners-and-Selling-Losers InvestmentStrategy using Interpretable Deep Reinforcement AttentionNetworks

Jingyuan Wang , , Yang Zhang , Ke Tang , Junjie Wu , , ∗ , Zhang Xiong . MOE Engineering Research Center of Advanced Computer Application Technology,School of Computer Science Engineering, Beihang University, Beijing, China2 . Institute of Economics, School of Social Sciences, Tsinghua University, Beijing China3 . Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations,School of Economics and Management, Beihang University, Beijing, China4 . Beijing Advanced Innovation Center for BDBC, Beihang University, Beijing, China. ∗ Corresponding author.

ABSTRACT

Recent years have witnessed the successful marriage of financeinnovations and AI techniques in various finance applications in-cluding quantitative trading (QT). Despite great research effortsdevoted to leveraging deep learning (DL) methods for buildingbetter QT strategies, existing studies still face serious challengesespecially from the side of finance, such as the balance of risk andreturn, the resistance to extreme loss, and the interpretability ofstrategies, which limit the application of DL-based strategies in real-life financial markets. In this work, we propose

AlphaStock , a novelreinforcement learning (RL) based investment strategy enhancedby interpretable deep attention networks, to address the above chal-lenges. Our main contributions are summarized as follows: i ) Weintegrate deep attention networks with a Sharpe ratio-oriented re-inforcement learning framework to achieve a risk-return balancedinvestment strategy; ii ) We suggest modeling interrelationshipsamong assets to avoid selection bias and develop a cross-asset at-tention mechanism; iii ) To our best knowledge, this work is amongthe first to offer an interpretable investment strategy using deepreinforcement learning models. The experiments on long-periodicU.S. and Chinese markets demonstrate the effectiveness and ro-bustness of AlphaStock over diverse market states. It turns outthat AlphaStock tends to select the stocks as winners with highlong-term growth, low volatility, high intrinsic value, and beingundervalued recently. CCS CONCEPTS • Applied computing → Economics ; •

Computing method-ologies → Reinforcement learning ; Neural networks . KEYWORDS

Investment Strategy, Reinforcement Learning, Deep Learning,Interpretable Prediction

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

ACM Reference Format:

Jingyuan Wang, Yang Zhang, Ke Tang, Junjie Wu, Zhang Xiong. 2019. Al-phaStock: A Buying-Winners-and-Selling-Losers Investment Strategy usingInterpretable Deep Reinforcement Attention Networks In

The 25th ACMSIGKDD Conference on Knowledge Discovery Data Mining (KDD ’19), August4–8, 2019, Anchorage, AK, USA.

ACM, NY, NY, USA, 9 pages.https://doi.org/10.1145/3292500.3330647

Given the ability in handling large scales of transactions andoffering rational decision-makings, quantitative trading (QT) strate-gies have long been adopted in financial institutions and hedgefunds and have achieved spectacular successes.Traditional QT strate-gies are usually based on specific financial logics. For instance, the momentum phenomenon found by Jegadeesh and Titman in thestock market [14] was used to build momentum strategies. The mean reversion [20] proposed by Poterba and Summers believesthat asset price tends to move to the average over time, so the biasof asset prices to their means could be used to select investmenttargets. The multi-factor strategy [7] uses factor-based asset valua-tions to select assets. Most of these traditional QT strategies, thoughequipped with solid financial theories, can only leverage some spe-cific characteristic of financial markets, and therefore might bevulnerable to complex markets with diverse states.In recent years, deep learning (DL) emerges as an effective wayto extract multi-aspect characteristics from complex financial sig-nals. Many supervised deep neural networks are proposed in theliterature to predict asset prices using various factors, such as fre-quency of prices [11], economic news [12], social media [27], andfinancial events [4, 5]. Deep neural networks are also adopted inreinforcement learning (RL) frameworks to enhance traditionalshallow investment strategies [3, 6, 16]. Despite the rich studiesabove, applying DL to real-life financial markets still faces severalchallenges:

Challenge 1: Balancing return and risk.

Most existing superviseddeep learning models in finance focus on price prediction withoutrisk awareness, which is not in line with fundamental investmentprinciples and may lead to suboptimal performance [8]. While someRL-based strategies [8, 17] have considered this problem, how toadopt state-of-the-art DL approaches into risk-return-balanced RLframeworks, is yet not well studied. a r X i v : . [ q -f i n . T R ] J u l hallenge 2: Modeling interrelationships among assets. Many fi-nancial tools in the market can be used to derive risk-aware profitsfrom the interrelationship among assets, such as hedging, arbitrage,and the BWSL strategy used in this work. However, existing DL/RL-based investment strategies paid little attention to this importantinformation.

Challenge 3: Interpreting investment strategies.

There is a long-standing voice arguing that DL-based systems are “unexplainableblack boxes” and therefore cannot be used in crucial applicationslike medicine, investment and military [9]. RL-based strategies withdeep structures make it even worse. How to extract interpretablerules from DL-enabled strategies remains an open problem.In this paper, we propose

AlphaStock , a novel reinforcementlearning based strategy using deep attention networks, to overcomethe above challenges. AlphaStock is essentially a buying winners andselling losers (BWSL) strategy for stock assets. It consists of threecomponents. The first is a

Long Short-Term Memory with Historystate Attention (LSTM-HA) network, which is used to extract assetrepresentations from multiple time series. The second componentis a

Cross-Asset Attention Network (CAAN), which can fully modelthe interrelationships among assets as well as the asset price risingprior. The third is a portfolio generator, which gives the investmentproportion of each asset according to the output winner scores ofthe attention networks. We use a RL framework to optimize ourmodel towards a return-risk-balanced objective, i.e., maximizingthe Sharpe Ratio. In this way, the merit of representation learningvia deep attention models and the merit of risk-return balancevia Sharpe ratio targeted reinforcement learning are integratednaturally. Moreover, to gain interpretability for AlphaStock, wepropose a sensitivity analysis method to unveil how our modelselects an asset to invest according to its multi-aspect features.Extensive experiments on long-periodic U.S. stock markets demon-strate that our AlphaStock strategy outperforms some state-of-the-art competitors in terms of a variety of evaluation measures. Inparticular, AlphaStock shows excellent adaptability to diverse mar-ket states (enabled by RL and Sharpe ratio) and exceptional abilityfor extreme loss control (enabled by CAAN). Extended experimentson Chinese stock markets further confirm the superiority of Alpha-Stock and its robustness. Interestingly, the interpretation analysisresults reveal that AlphaStock selects assets by following a principleas “selecting the stocks as winners with high long-term growth, lowvolatility, high intrinsic value, and being undervalued recently”.

In this section, we first introduce the financial concepts usedthroughout this paper, and then formally define our problem.

Definition 1 (Holding Period).

A holding period is a minimumtime unit to invest an asset. We divide the time axis as sequentialholding periods with fixed length, such as one day or one month. Wecall the starting time of the t -th holding period as the time t . Definition 2 (Seqential Investment).

A sequential invest-ment is a sequence of holding periods. For the t -th holding period, astrategy uses original capital to invest in assets at time t , and getsprofits (could be negative) at time t + . The capitals plus profits of the t -th holding period are used as the original capitals of the ( t + ) -thholding period. Definition 3 (Asset Price).

The price of an asset is defined asa time series p ( i ) = { p ( i ) , p ( i ) , . . . , p ( i ) t , . . . } , where p ( i ) t denotes theprice of asset i at time t . In this work, we use a stock as an asset to describe our model,which could be extended to other types of assets by taking assetspecificities and transaction rules into consideration.Definition 4 (Long Position).

The long position is the tradingoperation that buys an asset at time t first and then sells it at t . Theprofit of a long position during the period from t to t for asset i is u i ( p ( i ) t − p ( i ) t ) , where u i is the buying volume of asset i . In the long position, traders expect an asset will rise in price, sothey buy the asset first and wait for the price rise to earn profits.Definition 5 (Short Position).

A short position is the tradingoperation that sells an asset at t first and then buys it back at t . Theprofit of a short position during the period from t to t for asset i is u i ( p ( i ) t − p ( i ) t ) , where u i is the selling volume of asset i . Short position is a reverse operation of the long position. Traders’expectation in short position is that the price will drop, so they sellat a price higher than the price at which they buy it back later. Inthe stock market, a short position trader borrows stocks from abroker and sells them at t . At t , the trader buys the sold stocksback and returns them to the broker.Definition 6 (Portfolio). Given an asset pool with I assets, aportfolio is defined as a vector b = ( b ( ) , . . . , b ( i ) , . . . , b ( I ) ) ⊤ , where b ( i ) is the proportion of the investment on asset i , with (cid:205) Ii = b ( i ) = . Assume we have a collection of portfolios { b ( ) , . . . , b ( j ) , . . . , b ( J ) } .The investment on portfolio b ( j ) is M ( j ) , with M ( j ) ≥ b ( j ) , and M ( j ) ≤ A zero-investmentportfolio is a collection of portfolios that has a net total investment ofzero when the portfolios are assembled. That is, for a zero-investmentportfolio containing J portfolios, the total investment (cid:205) Jj = M ( j ) = . For instance, an investor may borrow $1,000 worth of stocks inone set of companies and sell them as a short position, and then usethe proceeds of short selling to purchase $1,000 stocks in anotherset of companies as a long position. The assemble of the long andshort positions is a zero-investment portfolio. Note that while thename is “zero-investment”, there still exists a budget constraint tolimit the overall worth of stocks that can be borrowed from thebroker. Also, we ignore real-world transaction costs for simplicity.

In this paper, we adopt the buy-winners-and-sell-losers (BWSL)strategy for stock trading [14], the key of which is to buy theassets with high price rising rate (winners) and sell those with lowprice rising rate (losers). We execute the BWSL strategy as a zero-investment portfolio consisting of two portfolios: a long portfoliofor buying winners and a short portfolio for selling losers. Given aequential investment with T periods, we denote the short portfoliofor the t -th period as b − t and the long portfolio as b + t , t = , . . . , T .At time t , given a budget constraint ˜ M , we borrow the “loser”stocks from brokers according to the investment proportion in b − t .The volume of stock i that we can borrow is u −( i ) t = ˜ M · b −( i ) t / p ( i ) t , (1) where b −( i ) t is the proportion of stock i in b − t . Next, we sell the“loser” stocks we borrowed and get the money ˜ M . After that, weuse ˜ M to buy the “winner” stocks according to the long portfolio b + t . The volume of stock i that we can buy at time t is u + ( i ) t = ˜ M · b + ( i ) t / p ( i ) t . (2) The money ˜ M we used to buy winner stocks is the proceeds of shortselling, so the net investment on the portfolio { b + t , b − t } is zero.At the end of the t -th holding period, we sell stocks in the longportfolio. The money we can get is the proceeds of selling stocksusing new prices at t + i.e., M + t = I (cid:213) i = u + ( i ) t p ( i ) t + = I (cid:213) i = ˜ M · b + ( i ) t p ( i ) t + p ( i ) t . (3) Next, we buy the stocks in the short portfolio back and return themto the broker. The money we spend on buying the short stocks is M − t = I ′ (cid:213) i = u −( i ) t p ( i ) t + = I ′ (cid:213) i = ˜ M · b −( i ) t p ( i ) t + p ( i ) t . (4) The ensemble profit earned by the long and short portfolios is M t = M + t − M − t . Let z ( i ) t = p ( i ) t + / p ( i ) t denote the price rising rate ofstock i in the t -th holding period. Then, the rate of return of theensemble portfolio is calculated as R t = M t ˜ M = I (cid:213) i = b + ( i ) t z ( i ) t − I ′ (cid:213) i = b −( i ) t z ( i ) t . (5) Insight I.

As shown in Eq. (5), a positive profit, i.e., R t >

0, meansthe average price rising rate of stocks in the long portfolio is higherthan that in the short portfolio, i.e., I (cid:213) i = b + ( i ) t z ( i ) t > I ′ (cid:213) i = b −( i ) t z ( i ) t . (6) A profitable BWSL strategy must ensure the stocks in the portfolio b + have a higher average price rising rate than the stocks in b − . Thatis to say, even the prices of all stocks in the market are falling, aslong as we can ensure the price falling of stocks in b + is slower thanthat in b − , we can still get profits. On the contrary, even the pricesof all stocks are rising, if the rising of stocks in b − is faster than thatin b + , our strategy still lose money. This characteristic implies thatthe absolute price rising or falling of stocks is not the main concernof our strategy; rather, the relative price relations among stocksare much more important. As a consequence, we must design amechanism to describe the interrelationships of stock prices in ourmodel for the BWSL strategy. In order to ensure that our strategy considers both return andrisk of an investment, we adopt the

Sharpe ratio , a risk-adjusted

LSTM-HALSTM-HALSTM-HA (cid:28663)(cid:28663)

LSTM-HA (cid:28663)(cid:28663)

CAAN P o r tf o li o G e n er a t o r (cid:1846)(cid:1861)(cid:1865)(cid:1857) features t (cid:1846)(cid:1861)(cid:1865)(cid:1857) features t (cid:1846)(cid:1861)(cid:1865)(cid:1857) features t (cid:1846)(cid:1861)(cid:1865)(cid:1857) features t (cid:28663)(cid:28663)(cid:28663)(cid:28663) S t o c k H i s t o r y S t a t e s Figure 1: The framework of the AlphaStock model. return developed by the Nobel laureate William F. Sharpe [21] in1994, to measure the performance of our strategy.Definition 8 (Sharpe Ratio).

The Sharpe ratio is the averagereturn in excess of the risk-free return per unit of volatility. Givena sequential investment that contains T holding periods, its Sharperatio is calculated as H T = A T − Θ V T , (7) where A T is the average rate of return per period for the investment, V T is the volatility that is used to measure risk of the investment, Θ is a risk-free return rate, such as the return rate of bank. Given a sequential investment with T holding periods, A T iscalculated as A T = T T (cid:213) t = R t − T C t , (8) where TC t is a transaction cost in the t -th period. The volatility V T in Eq. (7) is defined as V T = (cid:115) (cid:205) Tt = ( R t − ¯ R t ) T , (9) where ¯ R t = (cid:205) Tt = R t / T is the average of R t .For a T -period investment, the optimization objective of ourstrategy is to generate the long and short portfolio sequences B + = { b + , . . . b + T } and B − = { b − , . . . , b − T } that can maximize the Sharperatio of the investment as arg max { B + , B − } H T (cid:0) B + , B − (cid:1) . (10) Insight II.

The Sharpe ratio evaluates the performance of a strategyfrom both profit and risk perspectives. This profit-risk balancecharacteristic requires our model not only focuses on maximizingreturn rate R t for each period, but also considers the long-termvolatility of R t across all periods in an investment. In other words,designing a far-sighted steady investment strategy is more valuablethan a short-sighted strategy with short-term high profits. In this section, we propose a reinforcement learning (RL) basedmodel called

AlphaStock to implement a BWSL strategy with theSharpe ratio defined in Eq. (7) as the optimization objective. Asshown in Fig. 1, AlphaStock contains three components. The firstcomponent is a LSTM with History state Attention network (LSTM-HA). For each stock i , we use the LSTM-HA model to extract astock representation r ( i ) from its history states X ( i ) . The secondcomponent is a Cross-Asset Attention Network (CAAN) to describeinterrelationships among the stocks. The CAAN takes as input theepresentations ( r ( i ) ) of all stocks, and estimates a winner score s ( i ) for every stock. The s ( i ) is a score to indicate the degree ofstock i belonging to a winner. The third component is a portfoliogenerator, which calculates the investment proportions in b + and b − according to the scores ( s ( i ) ) of all stocks. We use reinforcementlearning to end-to-end optimize the three components as a whole,where the Sharpe ratio of a sequential investment is maximizedthrough a far-sighted way. The stock features used in our model contains two categories.The first category is the trading features , which describes the tradinginformation of a stock. At time t , the trading features include: • Price Rising Rate (PR) : The price rising rate of a stock duringthe last holding period. It is defined as (cid:16) p ( i ) t / p ( i ) t − (cid:17) for stock i . • Fine-grained Volatility (VOL) : A holding period can be fur-ther divided into many sub-periods. We set one month as a holdingperiod in our experiment, thus a sub-period can be a trading day.VOL is defined as the standard deviation of the prices of all sub-periods from t − t . • Trade Volume (TV) : The total quantity of stocks traded from t − t . It reflects the market activity of a stock.The second category is the company features, which describethe financial condition of the company that issues a stock. At time t , the company features include: • Market Capitalization (MC) : For stock i , it is defined as theproduct of the price p ( i ) t and the outstanding shares of the stock. • Price-earnings Ratio (PE) : It is the ratio of the market capi-talization of a company to its annual earnings. • Book-to-market Ratio (BM) : It is the ratio of the book valueof a company to its market value. • Dividend (Div) : It is the reward from company’s earnings tostock holders during the ( t − ) -th holding period.Since the values of these features are not in the same scale, westandardize them into Z-scores. The performance of a stock has close relations with its historystates. In the AlphaStock model, we propose a

Long Short-TermMemory with History state Attention (LSTM-HA) model to learn therepresentation of a stock from its history features.

The sequential representation.

In the LSTM-HA network, weuse the vector ˜ x t to denote the history state of a stock at time t ,which consists of the stock features given in Section 3.1. We namethe last K historical holding periods at time t , i.e., the period fromtime t − K to time t , as a look-back window of t . The history statesof a stock in the look-back window are denoted as a sequence X = { x , . . . , x k , . . . , x K } , where x k = ˜ x t − K + k . Our model usesa Long Short-Term Memory (LSTM) network [10] to recursivelyencode X into a vector as h k = LSTM ( h k − , x k ) , k ∈ [ , K ] (11) We also use X to denote the matrix ( x k ) , the two definitions are interchangeable. where h k is the hidden state encoded by LSTM at step k . The h K at the last step is used as a representation of the stock. It containsthe sequential dependence among elements in X . The history state attention.

The h K can fully exploit the sequen-tial dependence of elements in X , but the global and long-rangedependence among X are not effectively modeled. Therefore, weadopt a history state attention to enhance h K using all middle hid-den states h k . Specifically, following the standard attention [22],the history state attention enhanced representation, denoted as r ,is calculated as r = K (cid:213) k = ATT ( h K , h k ) h k , (12)where ATT (· , ·) is an attention function defined asATT ( h K , h k ) = exp ( α k ) (cid:205) Kk ′ = exp ( α k ′ ) , (13) α k = w ⊤ · tanh (cid:16) W ( ) h k + W ( ) h K (cid:17) . Here, w , W ( ) and W ( ) are the parameters to learn.For the i -th stock at time t , the history state attention enhancedrepresentation is denoted as r ( i ) t . It contains both the sequential andglobal dependences of stock i ’s history states from time t − K + t . In our model, the representation vectors for all stocksare extracted by the same LSTM-HA network. The parameters w , W ( ) , W ( ) and those of the LSTM network in Eq. (11) are shared byall stocks. In this way, the representations extracted by LSTM-HAare relatively stable and general for all stocks rather than for aparticular one. Remark.

A major advantage of LSTM-HA is that it can learn boththe sequential and global dependences from stock history states.Compared with the existing studies that only use a recurrent neuralnetwork to extract the sequential dependence in history states [3,17] or directly stack history states as an input vector of MLP [16]to learn the global dependence, our model describes stock historiesmore comprehensively. It is worth mentioning that LSTM-HA is alsoan open framework. The representations learned from other typesof information sources, such as news, events and social media [4,12, 27], could also be concatenated or attended with r ( i ) t . In the traditional RL-based strategy models, the investment port-folio is often directly generated from the stock representationsthrough a softmax normalization [3, 6, 16]. The drawback of thistype of methods is that it does not fully exploit the interrelation-ships among stocks, which however is very important for the BWSLstrategy as analyzed in

Insight I of Section 2.2. In light of this, wepropose a

Cross-Asset Attention Network (CAAN) to describe theinterrelationships among stocks.

The basic CAAN model.

The CAAN model adopts the self-attentionmechanism proposed by Ref. [24] to model the interrelationshipsamong stocks. Specifically, given the stock representation r ( i ) (weomit time t without loss of generality), we calculate a query vector q ( i ) , a key vector k ( i ) and a value vector v ( i ) for stock i as q ( i ) = W ( Q ) r ( i ) , k ( i ) = W ( K ) r ( i ) , v ( i ) = W ( V ) r ( i ) , (14) here W ( Q ) , W ( K ) and W ( V ) are the parameters to learn. Theinterrelationship of stock j to stock i is modeled as using the q ( i ) of the stock i to query the key k ( j ) of stock j , i.e., β ij = q ( i )⊤ · k ( j ) √ D k , (15) where D k is a re-scale parameter setting following Ref. [24]. Then,we use the normalized interrelationships { β ij } as weights to sumthe values { v ( j ) } of other stocks into an attenuation score: a ( i ) = I (cid:213) j = SATT (cid:16) q ( i ) , k ( j ) (cid:17) · v ( j ) , (16) where the self-attention function SATT (· , ·) is a softmax normalizedinterrelationships of β ij , i.e., SATT (cid:16) q ( i ) , k ( j ) (cid:17) = exp (cid:0) β ij (cid:1)(cid:205) Ij ′ = exp (cid:0) β ij ′ (cid:1) . (17) We use a fully connected layer to transform the attention vector a ( i ) into a winner score as s ( i ) = sigmoid (cid:16) w ( s )⊤ · a ( i ) + e ( s ) (cid:17) , (18) where w ( s ) and e ( s ) are the connection weights and the bias to learn.The winner score s ( i ) t indicates the degree of stock i being a winnerin the t -th holding period. A stock with a higher score is more likelyto be a winner. Incorporating price rising rank prior.

In the basic CAAN, theinterrelationships modeled by Eq. (15) are directly learned fromdata. In fact, we could use priori knowledge to help our model tolearn the stock interrelationships. We use c ( i ) t − to denote the rankof price rising rate of stock i in the last holding period (from t − t ). Inspired by the method for modeling positional informationfrom the NLP field, we use the relative positions of stocks in thecoordinate axis of c ( i ) t − as a priori knowledge of the stock interrela-tionships. Specifically, given two stocks i and j , we calculate theirdiscrete relative distance in the coordinate axis of c ( i ) t − as d ij = (cid:22)(cid:12)(cid:12)(cid:12) c ( i ) t − − c ( j ) t − (cid:12)(cid:12)(cid:12) (cid:46) Q (cid:23) , (19) where Q is a preset quantization coefficient. We use a lookup matrix L = ( l , . . . , l L ) to represent each discretized value of d ij . Usingthe d ij as the index, the corresponding column vector l d ij is anembedding vector of the relative distance d ij .For a pair of stocks i and j , we calculate a priori relation coeffi-cient ψ ij using l d ij as ψ ij = sigmoid (cid:16) w ( L )⊤ l d ij (cid:17) , (20) where w ( L ) is a learnable parameter. The relationship between i and j estimated by Eq. (15) is rewritten as β ij = ψ ij (cid:16) q ( i )⊤ · k ( j ) (cid:17) √ D . (21) In this way, the relative positions of stocks in price rising rate rankare introduced as a weight to enhance or weaken the attentioncoefficient. The stocks have similar history price rising rates willhave a stronger interrelationship in the attention and then havesimilar winner scores.

Remark.

As shown in Eq. (16), for each stock i , the winner score s ( i ) is calculated according to the attention of all other stocks. Inthis way, the interrelationships among all stocks are involved intoCAAN. This special attention mechanism meets the model designrequirement of Insight I in Section 2.2.

Given the winner scores { s ( ) , . . . , s ( i ) , . . . , s ( I ) } of I stocks,our AlphaStock model generally buys the stocks with high winnerscores and sells those with low winner scores. Specifically, we firstsort the stocks in descending order by their winner scores andobtain the sequence number o ( i ) for each stock i . Let G denote thepreset size of portfolio b + and b − . If o ( i ) ∈ [ , G ] , stock i will enterthe portfolio b + ( i ) , with the investment proportion calculated as b + ( i ) = exp (cid:16) s ( i ) (cid:17)(cid:205) o ( i ′) ∈[ , G ] exp (cid:0) s ( i ′ ) (cid:1) . (22) If o ( i ) ∈ ( I − G , I ] , stock i will enter b −( i ) with a proportion b −( i ) = exp (cid:16) − s ( i ) (cid:17)(cid:205) o ( i ′) ∈( I − G , I ] exp (cid:0) − s ( i ′ ) (cid:1) . (23) The rest stocks are unselected for the lack of clear buy/sell signals.For simplicity, we can use one vector to record all the informationof the two portfolios. That is, we form the vector b c of length I ,with b c ( i ) = b + ( i ) if o ( i ) ∈ [ , G ] , or b c ( i ) = b −( i ) if o ( i ) ∈ ( I − G , I ] ,or 0 otherwise, i = , . . . , I . In what follows, we use b c and { b + , b − } interchangeably as the return of our AlphaStock model for clarity. We frame the AlphaStock strategy into a RL game with discreteagent actions to optimize the model parameters, where a T -periodinvestment is modeled as a state-action-reward trajectory π of aRL agent, i.e., π = { state , action , reward , . . . , state t , action t , reward t , . . . , state T , action T , reward T } . The state t is the historymarket state observed at t , which is expressed as X t = ( X ( i ) t ) . The action t is an I -dimensional binary vector, of which the element action ( i ) t = i at t , and 0 otherwise .According to state t , the agent has a probability Pr ( action ( i ) t = ) toinvest stock i , which is determined by AlphaStock as Pr (cid:16) action ( i ) t = (cid:12)(cid:12) X nt , θ (cid:17) = G ( i ) (X nt , θ ) = b c ( i ) t , (24) where G ( i ) (X nt , θ ) is part of AlphaStock that generates b c ( i ) t , θ de-notes the model parameters, and 1 / (cid:205) Ii = Pr ( action ( i ) t = ) =

1. Let H π denote the Sharpe ratio of π , then reward t is thecontribution of action t to H π , with (cid:205) Tt = reward t = H π .For all possible π , the average reward of the RL agent is J ( θ ) = ∫ π H π Pr ( π | θ ) d π , (25) In the RL game, the actions of an agent are discrete states with the probability b c ( i ) t / indicating whether to invest stock i . In the real investment, we allocate capitals tostocks i according the continuous proportion b c ( i ) t . This approximation is for the sakeof problem solving. here Pr ( π | θ ) is the probability of generating π from θ . Then,the objective of the RL model optimization is to find the optimalparameters θ ∗ = arg max θ J ( θ ) .We use the gradient ascent approach to iteratively optimize θ atround τ as θ τ = θ τ − + η ∇ J ( θ )| θ = θ τ − , where η is a learning rate.Given a training dataset that contains N trajectories { π , . . . , π n ,. . . , π N } , ∇ J ( θ ) can be approximately calculated as [23] ∇ J ( θ ) = ∫ π H π Pr ( π | θ )∇ log Pr ( π | θ ) d π . ≈ N N (cid:213) n = (cid:32) H π n T n (cid:213) t = I (cid:213) i = ∇ θ log Pr (cid:16) action ( i ) t = (cid:12)(cid:12) X ( n ) t , θ (cid:17)(cid:33) , (26) The gradient ∇ θ log Pr ( action ( i ) t = |X ( n ) t , θ ) = ∇ θ log G ( i ) (X nt , θ ) ,which is calculated by the Back Propagation algorithm.In order to ensure the proposed model can beat the market,we introduce the threshold method [23] into our reinforcementlearning. Then the gradient ∇ J ( θ ) in Eq. (26) is rewritten as ∇ J ( θ ) = N N (cid:213) n = (cid:32)(cid:0) H π n − H (cid:1) T n (cid:213) t = I (cid:213) i = ∇ θ log G ( i ) (cid:0) X nt , θ (cid:1)(cid:33) , (27) where the threshold H is set as the Sharpe ratio of the overall mar-ket. In this way, the gradient ascent only encourages the parametersthat can outperform the market. Remark.

The Eq. (27) uses ( H π n − H ) to integrally weight thethe gradients ∇ θ log G of all holding periods in π n . The reward isnot directly given to any isolated step in π n but given to all stepsin π n . This feature of our model meets the far-sight requirement of Insight II in Section 2.2.

In the AlphaStock model, the LSTM-HA and CAAN networkscast the raw stock features as winner scores. The final investmentportfolios are directly generated from the winner scores. A naturalfollow-up question is: what kind of stocks would be selected aswinners by AlphaStock? To answer this question, we propose asensitivity analysis method [1, 25, 26] to interpret how the historyfeatures of a stock influence its winner score in our model.We use s = F ( X ) to express the function of history features X ofa stock to its winner score s . In our model, s = F ( X ) is a combinednetwork of LSTM-HA and CAAN. We use x q to denote an elementof X which is the value of one feature (defined in Section 3.1) ata particular time period of the look-back window, e.g., the pricerising rate of a stock at the time of three months ago.Given the history state X of a stock, the influence of x q to itswinner score s , i.e., the sensitivity of s to x q , is expressed as δ x q ( X ) = lim ∆ x q → F ( X ) − F (cid:16) x q + ∆ x q , X ¬ x q (cid:17) x q − (cid:0) x q + ∆ x q (cid:1) = ∂ F ( X ) ∂ x q , (28) where X ¬ x q denotes the elements of X except x q .For all possible stock states in a market, the average influence ofthe stock state feature x q to the winner score s is ¯ δ x q = ∫ D X Pr ( X ) δ x q ( X ) d σ . (29) where Pr ( X ) is the probability density function of X , and ∫ D X · d σ is an integral over all possible value of X . According to the Large Number Law, given a dataset that contains history states of I stocksin N holding periods, the ¯ δ x q is approximated as ¯ δ x q = I × N N (cid:213) n = I (cid:213) i = δ x q (cid:18) X ( i ) n (cid:12)(cid:12)(cid:12) X (¬ i ) n (cid:19) , (30) where X ( i ) n is the history state of the i -th stock at the n -th holdingperiod, and X (¬ i ) n denotes the history states of other stocks that areconcurrent with the history state of i -th stock.We use ¯ δ x q to measure the overall influence of a stock feature x q to the winner score. A positive value of ¯ δ x q indicates that ourmodel tends to take a stock as a winner when x q is large, and viceversa. For example, in the experiment to follow, we obtain ¯ δ < In this section, we empirically evaluate our AlphaStock model bythe data in the U.S. markets. The data in the Chinese stock marketsare also used for robustness check.

The data of U.S. stock market used in our experiments are ob-tained from Wharton Research Data Services (WRDS) . The timerange of the data is from Jan. 1970 to Dec. 2016. This long time rangecovers several well-known market events, such as the dot-com bub-ble from 1995 to 2000 and the subprime mortgage crisis from 2007 to2009, which enables the evaluation over diverse market states. Thestocks are from four markets: NYSE, NYSE American, NASDAQ,and NYSE Arca. The number of valid stocks is more than 1000 peryear. We use the data from Jan. 1970 to Jan. 1990 as the trainingand validation set, and the rest as the test set.In the experiment, the holding period is set to one month, andthe number of holding periods T in an investment is set to 12, i.e., the Sharpe ratio reward is calculated every 12 months for RL. Thelook-back window size K is set to 12, i.e., we look back on the 12-month history states of stocks. The size G of the portfolios is set as1/4 of number of all stocks. AlphaStock is compared with a number of baselines including: • Market : the uniform Buy-And-Hold strategy [13]; • Cross Sectional Momentum (

CSM ) [15] and Time Series Mo-mentum (

TSM ) [18]: two classic momentum strategies; • Robust Median Reversion (

RMR ): a newly reported reversionstrategy [13]; • Fuzzy Deep Direct Reinforcement (

FDDR ): a newly reportedRL-based BWSL strategy [3]; • AlphaStock-NC (

AS-NC ): the AlphaStock model without theCAAN, where the outputs of LSTM-HA are directly used as theinputs of the portfolio generator. • AlphaStock-NP (

AS-NP ): the AlphaStock model without pricerising rank prior, where we use the basic CAAN in our model.The baselines TSM/CSM/RMR represent the traditional financialstrategies. TSM and CSM are based on the momentum logic and https://wrds-web.wharton.upenn.edu/wrds/ MR is based on the reversion logic. FDDR represents the state-of-the-art RL-based BWSL strategy. AS-NC and AS-NP are used as acontrast to verify the effectiveness of the CAAN and price risingrank prior. The Market is used to indicate states of the market.

The most standard evaluation measure for investment strategiesis

Cumulative Wealth , which is defined as CW T = T (cid:214) t = ( R t + − TC ) , (31)where R t is the rate of return defined in Eq. (5) and the transactioncost TC is set to 0.1% in our experiments according to Ref. [3].The preferences of different investors are varied. Therefore, wealso use some other evaluation measures including:1) Annualized Percentage Rate (APR) is an annualized averageof return rate. It is defined as

APR T = A T × N Y , where N Y is thenumber of holding periods in a year.2) Annualized Volatility (AVOL) is an annualized average of volatil-ity. It is defined as

AVOL T = V T × √ N Y and is used to measure theaverage risk of a strategy during an unit time period.3) Annualized Sharpe Ratio (ASR) is the risk-adjusted annualizedreturn based on APR and AVOL. The formalized definition of ASRis

ASR T = APR T / AVOL T .4) Maximum DrawDown (MDD) is the maximum loss from apeak to a trough of a portfolio, before a new peak is attained. Itis the other way to measure the investment risk. The formalizeddefinition of MDD is

MDD T = max τ ∈[ , T ] (cid:18) max t ∈[ , τ ] (cid:18) APR t − APR τ APR t (cid:19)(cid:19) . (32) Calmar Ratio (CR) is the risk-adjusted APR based on MaximumDrawDown. It is calculated as CR T = APR T / MDD T .6) Downside Deviation Ratio (DDR) measures the downside riskof a strategy as the average of returns when it falls below a mini-mum acceptable return (MAR). It is the risk-adjusted APR based onDownside Deviation. The formalized definition of DDR is given as

DDR T = APR T Downside Deviation = APR T (cid:112) E [ min ( R t , MAR )] , t ∈ [ , T ] . (33) In our experiment, the MAR is set to zero.

Fig. 2 is a cumulative wealth comparison of AlphaStock and thebaselines. In general, the performance of AlphaStock (AS) is muchbetter than other baselines, which verifies the effectiveness of ourmodel. Some interesting observations are highlighted as follows:1) The performance of AlphaStock is better than AlphaStock-NPand the performance of AlphaStock-NP is better than AlphaStock-NC, which indicates that the stock rank priors and interrelation-ships modeled by CAAN are very helpful for the BWSL strategy.2) The FDDR is also a kind of deep RL investment strategy, whichextracts the fuzzy representations of stocks using a recurrent deepneural network. In our experiment, the performance of AlphaStock-NC is better than FDDR, indicating the advantage of our LSTM-HAnetwork in the stock representation learning.

Year C u m u l a ti v e W ea lt h AS (cid:36)(cid:54)(cid:16)(cid:49)(cid:51) (cid:36)(cid:54)(cid:16)(cid:49)(cid:38) FDDR RMR CSM TSM Market

Figure 2: The Cumulative Wealth in U.S. markets.Table 1: Performance comparison on U.S. markets.

APR AVOL ASR MDD CR DDRMarket

TSM

CSM

RMR

FDDR

AS-NC

AS-NP

AS 0.143

3) The TSM strategy performs well in the bull market but verypoorly in the bear market (the financial crisis in 2003 and 2008),while the RMR has an opposite performance. This implies the tradi-tional financial strategies can only adapt to a certain type of marketstate without an effective forward-looking mechanism. This defectis greatly addressed by the RL strategies, including AlphaStock andFDDR, which perform much stably across different market states.The performances evaluated by other measures are listed in Ta-ble 1. For the measures underlined (AVOL, MDD), the lower valueindicates the better performance, while the situation is opposite forthe other measures. As shown in Table 1, the performances of Al-phaStock, AlphaStock-NP and AlphaStock-NC are better than otherbaselines with all measures, confirming the effectiveness and robust-ness of our strategy. The performances of AlphaStock, AlphaStock-NP and AlphaStock-NC are close in terms of ASR, which might bedue to all of these models are optimized for maximizing the Sharperatio. The profits of AlphaStock and AlphaStock-NP measured byAPR are higher than that of AlphaStock-NC, at the cost of a littlebit higher volatility.More interestingly, the performance of AlphaStock measuredby MDD, CR and DDR is much better than that of AlphaStock-NP. The similar results could be observed by comparing MDD,CR and DDR of AlphaStock-NP and AlphaStock-NC. The threemeasures are used to indicate the extreme loss in an investment, i.e., the maximum draw down and the returns below the minimumacceptable threshold. The results suggest that the extreme losscontrol ability of the three models are AlphaStock > AlphaStock-NP > AlphaStock-NC, which highlights the contribution of the CAANcomponent and the price rising rank prior. Indeed, CAAN with pricerising rank priors fully exploits the ranking relationship amongstocks. This mechanism can protect our strategy from the errorof “buying losers and selling winners”, and therefore can greatlyavoid extreme losses in investments. In summary, AlphaStock isa very competitive strategy for investors with different types ofpreferences. (cid:2)(cid:3)(cid:4)(cid:5) (cid:6)(cid:7)(cid:8)(cid:2)(cid:9)(cid:7) (cid:1) -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 (cid:1) (cid:1) (cid:2) (cid:3) -0.0500.050.1

Influence of PR to WS (a) Price Rising

History Months -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 (cid:1) (cid:1) (cid:1) (cid:2) -0.02-0.0100.010.02

Influence of TV to WS (b) Trade Volume

History Months -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 (cid:1) (cid:1) (cid:1) (cid:2) -0.06-0.04-0.020

Influence of VOL to WS (c) Fine-grained Volatility

MC PE BM DIV (cid:1) (cid:1) (cid:1) (cid:2) -0.0500.050.1 (d) Company Features

Figure 3: Influence of history trading features to winner scores.Table 2: Performance comparison on Chinese markets.

APR AVOL ASR MDD CR DDRMarket

TSM

CSM

RMR

FDDR

AS-NC

AS-NP

AS 0.125 0.103 1.220 0.135 0.296 1.704

In order to further testify the robustness of our model, we runthe back-test experiments of our model and baselines over theChinese stock markets, which contains two exchanges: ShanghaiStock Exchange (SSE) and Shenzhen Stock Exchange (SZSE). Thedata are obtained from the WIND databese . The stocks are theRMB priced ordinary shares (A-share) and the total number ofstocks used for experiment is 1,131. The time range of our data isfrom Jun. 2005 to Dec. 2018, with the period from Jun. 2005 – Dec.2011 used as the training/validation set and the rest as the test set.Since the Chinese markets cannot short sell, so we only use the b + portfolio in the experiment.The experimental results are given in Table 2. From the table wecan see that the performances of AlphaStock, AlphaStock-NP andAlphaStock-NC are better than that of other baselines again. Thisverifies the effectiveness of our model over the Chinese markets.By further comparing Table 2 with Table 1, it turns out that the riskof our model measured by AVOL and MDD in the Chinese marketsis higher than that in the U.S. markets. This might be attributableto the market faultiness of emerging countries like China, withmore speculative capital but less effective governance. The lack ofshort sell mechanism also contributes to the imbalance of marketforces. The AVOL and MDD of the Market and other baselines inthe Chinese markets are also higher than that in the U.S. markets.Compared with these baselines, the risk control ability of our modelis still competitive. To sum up, the experimental results in Table 2indicate the robustness of our model over emerging markets. Here, we try to interpret the underlying investment strategiesof AlphaStock, which is crucial for practitioners to better under-standing this model. To this end, we use ¯ δ x p in Eq. (30) to measurethe influence of the stock features defined in Section 3.1 to Al-phaStock’s winner selection. Figures 3(a)-3(b) plot the influencesfrom the trading features. The vertical axis denotes the influence strengths indicated by ¯ δ x q , and the horizontal axis denotes howmany months before the trading time. For example, the bar indexedby “-12” of the horizontal axis in Fig. 3(a) denotes the influence ofstock price rising rate (PR) at the time of twelve months ago.As shown in Fig. 3(a), the influence of history price rising rate isheterogeneous along the time axis. The PR in long-term months, i.e., i.e., δ x p of trading volumes (TV) has a similar tendency withthe price rising rate (PR). Finally, as shown in Fig. 3(c), the volatili-ties (VOL) have negative influence to winner scores for all historymonths. It means that our model trends to select low volatilitystocks as winners, which indeed explains why AlphaStock canadapt to diverse market states.Fig. 3(d) further exhibits the average influences of different com-pany features to the winner score, i.e., the ¯ δ x p averaged on allhistory months. It turns out that Market Capitalization (MC), Price-earnings Ratio (PE), and Book-to-market Ratio (BM) have positiveinfluences. The three features are important company valuationfactors for a listed company, which indicates that AlphaStock tendsto select companies with sound fundamental values. In contrast,dividends mean a part of company values are returned to share-holders and could reduce the intrinsic value of a stock. That is whythe influence of Dividends (DIV) is negative in our model.To sum up, while AlphaStock is an AI-enabled investment strat-egy, the interpretation analysis proposed in Section 4 can help toextract investment logics from AlphaStock. Specifically, AlphaStocksuggests selecting the stocks as winners with high long-term growth , low volatility , high intrinsic value , and being undervalued recently . Our work is related to the following research directions.

Financial Investment Strategy:

Classic financial investmentstrategy includes Momentum, Mean Reversion, and Multi-factors. Inthe first work of BWSL [14], Jegadeesh and Titman found “momen-tum” could be used to select winners and losers. The momentumstrategy buys assets that have had high returns over a past period aswinners, and sells those that have had poor returns over the sameperiod. Classic momentum strategies include the Cross SectionalMomentum (CSM) [15] and the Time Series Momentum (TSM) [18].he mean reversion strategy [20] considers asset prices always re-turn to their mean over a past period, so it buys assets with a priceunder their historical mean and sells above the historical mean.The multi-factor model [7] uses factors to compute a valuation foreach asset and buys/sells those assets with price under/above theirvaluations. Most of these financial investment strategies can onlyexploit a certain factor of financial markets and thus might fail incomplex market environments.

Deep Learning in Finance:

In recent years, deep learning ap-proaches begin to be applied in the financial areas. In the literature,L. Zhang et al. proposed to exploit frequency information to predictstock prices [11]. News and social media were used in price pre-diction in Refs. [12, 27]. Information about events and corporationrelationships were used to predict stock prices in Ref. [2, 4]. Mostof these works focus on price prediction rather than end-to-endinvestment portfolio generation like us.

Reinforcement Learning in Finance:

The RL approaches usedin investment strategies fall in two categories: the value-based andthe policy-based [8]. The value-based approaches learn a criticto describe the expected outcomes of markets to trading actions.Typical value-based approaches in investment strategies includeQ-learning [19] and deep Q-learning [16]. A defect of value-basedapproaches is the market environment is too complex to be approxi-mated by a critic. Therefore, policy-based approaches are consideredas more suitable to financial markets [8]. The AlphaStock modelalso belongs to this category. A classic policy-based RL algorithmin investment strategy is the Recurrent Reinforcement Learning(RRL) [17]. The FDDR [3] model extends the RRL framework usingdeep neural networks. In the Investor-Imitator model [6], a policy-based deep RL framework was proposed to imitate the behaviors ofdifferent types of investors. Compared with RRL and its deep learn-ing extensions, which focus on exploiting sequential dependence infinancial signals, our AlphaStock model pays more attention to theinterrelationships among assets. Moreover, deep RL approaches areoften hard to deployed in real-life applications for unexplainabledeep network structures. The interpretation tools offered by ourmodel can solve this problem.

In this paper, we proposed a RL-based deep attention networkto design a BWSL strategy called AlphaStock. We also designed asensitivity analysis method to interpret the investment logics ofour model. Compared with existing RL-based investment strategies,AlphaStock fully exploits the interrelationship among stocks, andopens a door for solving the “black box” problem of using deeplearning models in financial markets. The back-testing and simula-tion experiments over U.S. and Chinese stock markets showed thatAlphaStock performed much better than other competing strate-gies. Interestingly, AlphaStock suggests buying stocks with highlong-term growth, low volatility, high intrinsic value, and beingundervalued recently.

ACKNOWLEDGMENTS

J. Wang’s work was partially supported by the National NaturalScience Foundation of China (NSFC) (61572059, 61202426), the Sci-ence and Technology Project of Beijing (Z181100003518001), and the CETC Union Fund (6141B08080401). Y. Zhang’s work was par-tially supported by the National Key Research and DevelopmentProgram of China under Grant (2017YFC0820405) and the Funda-mental Research Funds for the Central Universities. K. Tang’s workwas partially supported the National Social Sciences Foundationof China (No.14BJL028). J. Wu’s work was partially supported byNSFC (71725002, 71531001, U1636210).

REFERENCES [1] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt,and Been Kim. 2018. Sanity checks for saliency maps. In

NIPS’18 . 9525–9536.[2] Yingmei Chen, Zhongyu Wei, and Xuanjing Huang. 2018. Incorporating Corpo-ration Relationship via Graph Convolutional Neural Networks for Stock PricePrediction. In

CIKM’18 . ACM, 1655–1658.[3] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2017. Deepdirect reinforcement learning for financial signal representation and trading.

IEEE TNNLS

28, 3 (2017), 653–664.[4] Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2015. Deep learning forevent-driven stock prediction.. In

IJCAI’15 . 2327–2333.[5] Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2016. Knowledge-drivenevent embedding for stock prediction. In

COLING’16 . 2133–2142.[6] Yi Ding, Weiqing Liu, Jiang Bian, Daoqiang Zhang, and Tie-Yan Liu. 2018. Investor-Imitator: A Framework for Trading Knowledge Extraction. In

KDD’18 . ACM,1310–1319.[7] Eugene F Fama and Kenneth R French. 1996. Multifactor explanations of assetpricing anomalies.

J. Finance

51, 1 (1996), 55–84.[8] Thomas G Fischer. 2018.

Reinforcement learning in financial markets-a survey .Technical Report. FAU Discussion Papers in Economics.[9] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, FoscaGiannotti, and Dino Pedreschi. 2018. A survey of methods for explaining blackbox models.

ACM Computing Surveys (CSUR)

51, 5 (2018), 93.[10] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory.

Neural Computation

9, 8 (1997), 1735–1780.[11] Hao Hu and Guo-Jun Qi. 2017. State-Frequency Memory Recurrent NeuralNetworks. In

ICML’17 . 1568–1577.[12] Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and Tie-Yan Liu. 2018. Listeningto chaotic whispers: A deep learning framework for news-oriented stock trendprediction. In

WSDM’18 . ACM, 261–269.[13] Dingjiang Huang, Junlong Zhou, Bin Li, Steven CH Hoi, and Shuigeng Zhou.2016. Robust median reversion strategy for online portfolio selection.

IEEE TKDE

28, 9 (2016), 2480–2493.[14] Narasimhan Jegadeesh and Sheridan Titman. 1993. Returns to buying winnersand selling losers: Implications for stock market efficiency.

J. Finance

48, 1 (1993),65–91.[15] Narasimhan Jegadeesh and Sheridan Titman. 2002. Cross-sectional and time-series determinants of momentum returns.

RFS

15, 1 (2002), 143–157.[16] Olivier Jin and Hamza El-Saawy. 2016.

Portfolio Management using ReinforcementLearning . Technical Report. Stanford University.[17] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. 1998. Performancefunctions and reinforcement learning for trading systems and portfolios.

Journalof Forecasting

17, 5-6 (1998), 441–470.[18] Tobias J Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen. 2012. Time seriesmomentum.

J. Financial Economics

NIPS’95 .[20] James M Poterba and Lawrence H Summers. 1988. Mean reversion in stock prices:Evidence and implications.

J. Financial Economics

22, 1 (1988), 27–59.[21] William F Sharpe. 1994. The sharpe ratio.

JPM

21, 1 (1994), 49–58.[22] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to SequenceLearning with Neural Networks.

NIPS’14 (2014), 3104–3112.[23] Richard S Sutton and Andrew G Barto. 2018.

Reinforcement learning: An intro-duction . MIT press.[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

NIPS’17 . 5998–6008.[25] Jingyuan Wang, Qian Gu, Junjie Wu, Guannan Liu, and Zhang Xiong. 2016.Traffic speed prediction and congestion source exploration: A deep learningmethod. In

ICDM’16 . IEEE, 499–508.[26] Jingyuan Wang, Ze Wang, Jianfeng Li, and Junjie Wu. 2018. Multilevel waveletdecomposition network for interpretable time series analysis. In

KDD’18 . ACM,2437–2446.[27] Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets andhistorical prices. In