[PDF] Deep Learning, Predictability, and Optimal Portfolio Returns

Abstract

We study dynamic portfolio choice of a long-horizon investor who uses deep learning methods to predict equity returns when forming optimal portfolios. Our results show statistically and economically significant benefits from using deep learning to form optimal portfolios through certainty equivalent returns and Sharpe ratios. Return predictability via deep learning also generates substantially improved portfolio performance across different subsamples, particularly during recessionary periods. These gains are robust to including transaction costs, short-selling and borrowing constraints.

Full PDF

DDeep Learning, Predictability, and OptimalPortfolio Returns * Mykola Babiak † Jozef Baruník ‡ First draft: July 2020. This draft: November 24, 2020

Abstract

We study dynamic portfolio choice of a long-horizon investor who uses deep learning methodsto predict equity returns when forming optimal portfolios. Our results show statistically and eco-nomically signiﬁcant beneﬁts from using deep learning to form optimal portfolios through certaintyequivalent returns and Sharpe ratios. Return predictability via deep learning also generates sub-stantially improved portfolio performance across different subsamples, particularly during reces-sionary periods. These gains are robust to including transaction costs, short-selling and borrowingconstraints.

Keywords:

Return Predictability, Portfolio Allocation, Machine Learning, Neural Networks, Em-pirical Asset Pricing

JEL codes:

C45, C53, E37, G11, G17 * Jozef Barunik gratefully acknowledges support from the Czech Science Foundation under the EXPROGX19-28231X project. † Department of Accounting and Finance, Lancaster University Management School, Lancaster, LA1 4YX,UK. E-mail: [email protected]

Web: sites.google.com/site/mykolababiak ‡ Institute of Economic Studies, Charles University, Opletalova 26, 110 00, Prague, CR and Institute ofInformation Theory and Automation, Academy of Sciences of the Czech Republic, Pod Vodarenskou Vezi 4,18200, Prague, Czech Republic. E-mail: [email protected]

Web: barunik.github.io a r X i v : . [ q -f i n . GN ] N ov Introduction

Extensive empirical asset pricing literature has documented supportive evidence for equityreturn predictability. With an ever increasing number of potential predictors, the practiceof applying machine learning methods to make the most accurate predictions using largedatasets is gaining further traction. This new literature demonstrates the superior perfor-mance of machine learning approaches relative to the linear regression analysis researchershave tended to favor. However, it is unclear whether sound statistical performance of ma-chine learning leads to portfolio gains for an investor who applies these models of returnpredictability when forming optimal portfolios. Indeed, the existing evidence on usinglinear models indicates that an ensemble of additional features are required to improveportfolio performance that stems from linear predictive regressions. This raises a questionwhether exploiting predictability via machine learning generates any beneﬁts to agents.In this paper, we examine the economic value of non-linear machine learning methods,such as neural networks (NNs), for an investor forming optimal portfolios. We study theasset allocation of a long-horizon investor with a power utility choosing between a marketportfolio and a risk-free asset. Our optimal portfolio design exercise follows Johanneset al. (2014) and our forecasting comparison follows Gu et al. (2020). Methodologically, weconsider univariate and multivariate linear regressions and a variety of machine learningarchitectures including shallow and deep NNs, as well as long-short-term-memory (LSTM)recurrent NNs. An LSTM is a specialized form of a neural network, which is capable oflearning extremely complex long-term temporal dynamics that a vanilla NN is unable to See, for example, Campbell (1987); Campbell and Shiller (1988); Fama and French (1988, 1989); Fersonand Harvey (1991); Pesaran and Timmermann (1995); Lettau and Ludvigson (2001); Lewellen (2004) and Angand Bekaert (2007) among many others. See, for example, Rapach et al. (2010); Kelly and Pruitt (2013, 2015); Sirignano et al. (2016); Giannoneet al. (2017); Giglio and Xiu (2017); Heaton et al. (2017); Messmer (2017); Feng et al. (2018); Fuster et al.(2018); Chen et al. (2019); Feng et al. (2019); Kelly et al. (2019); Bianchi et al. (2020); Freyberger et al. (2020);Gu et al. (2020); Kozak et al. (2020). Goyal and Welch (2008) use around 20 ﬁnancial and macroeconomic variables for the aggregate marketreturns. Green et al. (2013) list more than 330 return predictive signals used by the existing literature overthe 1970-2010 period. Harvey et al. (2016) report 316 “factors” useful for predicting stock returns. Additional ingredients include learning about predictability with informative priors (Wachter and Waru-sawitharana, 2009) and an ensemble of estimation risk and time-varying volatility (Johannes et al., 2014).

Evaluating Predictability via Portfolio Performance

The standard approach used to forecast excess equity returns is a linear model of the form r t + = α + β x t + ε rt + , (1)where r t + are monthly log excess returns, α and β are coefﬁcients to be estimated, x t = ( x t , ..., x nt ) is a set of predictor variables, and ε rt + is a normal error term. A largestrand of empirical literature has examined linear regression models with multiple pre-dictors including prominent variables such as the dividend yield, valuation ratios, variousinterest rates and spreads, among others. Although researchers have proposed numerousvariables for predicting stock market returns, empirical evidence on the degree of pre-dictability is mixed at best. Goyal and Welch (2008) ﬁnd that most linear speciﬁcationswith multiple predictors perform poorly and remain insigniﬁcant even in-sample. Theyfurther show that an investor using linear models to forecast equity returns would not beable to improve portfolio performance compared to no predictability benchmark.There are several reasons for the lack of robust evidence on the predictability equityreturns and its beneﬁts for portfolio construction. The speciﬁcation deﬁned by Eq.(1) as-sumes a linear and time-invariant relationship between log excess returns and predictors,which is at odds with the theoretical and empirical evidence. Bayesian learning aboutuncertain parameters in the linear regression has been proposed as a way to introducea time-varying relationship between the returns and predictor variables. However, se-quential parameter learning leads to signiﬁcant portfolio beneﬁts only in the presence of a See, for example, Shiller (1981); Hodrick (1992); Stambaugh (1999); Avramov (2002); Cremers (2002); Fer-son et al. (2003); Lewellen (2004); Torous et al. (2004); Campbell and Yogo (2006); Ang and Bekaert (2007);Campbell and Thompson (2008); Cochrane (2008); Lettau and Van Nieuwerburgh (2008); Pástor and Stam-baugh (2009). Leading examples of this literature include Menzly et al. (2004); Paye and Timmermann (2006); Santosand Veronesi (2006); Lettau and Van Nieuwerburgh (2008); Henkel et al. (2011); Dangl and Halling (2012). Speciﬁcally, we apply neural networksto approximate the functional association between the set of predictors and returns foroptimal portfolio construction. In doing so, we do not impose a known form of this rela-tionship, but instead allow for ﬂexible identiﬁcation of potentially nonlinear interactionsfrom the data. Our choice of neural networks over other machine learning methods (forinstance, tree-based approaches) is motivated by the fact that they deliver the most ac-curate statistical performance, as documented by the existing literature. The aim of thispaper is to revisit the evidence documented by Goyal and Welch (2008) and to show that,unlike linear predictive regressions, sound statistical performance of neural networks in-deed translates into substantial portfolio improvements for an investor using these novelmethods when dynamically forming an optimal portfolio.

Machine learning has a long history in economics and ﬁnance (Hutchinson et al., 1994;Kuan and White, 1994; Racine, 2001; Baillie and Kapetanios, 2007). At its core, one mayperceive machine learning as a general statistical analysis that economists can use to cap-ture complex relationships that are hidden when using simple linear methods. Breimanet al. (2001) emphasize that maximizing prediction accuracy in the face of an unknownmodel differentiates machine learning from the more traditional statistical objective of es-timating a model assuming a data generating process. Building on this, machine learningseeks to choose the most preferable model from an unknown pool of models using innova-tive optimization techniques. As opposed to traditional measures of ﬁt, machine learning Leading studies include Giglio and Xiu (2017); Heaton et al. (2017); Feng et al. (2018, 2019); Chen et al.(2019); Kelly et al. (2019); Freyberger et al. (2020); Gu et al. (2020); Kozak et al. (2020).

Deep feedforward networks, also often called feed-forward neural networks, or multilayer perceptrons lie at heart of deep learning modelsand are universal approximators that can learn any functional relationship between inputand output variables with sufﬁcient data.A feedforward network is a form of supervised machine learning that uses hierar-chical layers to represent high-dimensional non-linear predictors in order to predict anoutput variable. Figure 1 illustrates how (cid:96) ∈ {

1, . . . , L } hidden layers transform input data x t = ( x t , ..., x nt ) in a chain using a collection of non-linear activation functions f ( ) , . . . , f ( L ) .7 igure 1. (Deep) Feedforward Network This ﬁgure illustrates a deep neural network model r t + = f W , b ( x t ) + ε rt + that predicts output return r t + using a set of predictor variables x t = ( x t , ..., x nt ) . The network is deep, with a large number of hidden layers L . ... ... ... x t x t x t x nt f (1) w (1)1 , · f (1) w (1)2 , · f (1) w (1)3 , · f (1) w (1) m − , · f (1) w (1) m − , · f (1) w (1) m, · f ( L ) w ( L )1 , · f ( L ) w ( L )2 , · f ( L ) w ( L ) k, · r t +1 . . . Inputlayer x t W (1) , b (1) Hiddenlayer f (1) W (1) ,b (1) . . . W ( L ) , b ( L ) Hiddenlayer f ( L ) W ( L ) ,b ( L ) Outputlayer r t +1 More formally, we can deﬁne our prediction problem by characterizing excess equity re-turns as: r t + = f W , b ( x t ) + ε rt + , (2)where x t = ( x t , ..., x nt ) is a set of predictor variables that enter an input layer, and ε rt + is ai.i.d. error term, f W , b is a neural network with L hidden layers such as (cid:98) r t + : = f W , b ( x t ) = f ( L ) W ( L ) , b ( L ) ◦ . . . ◦ f ( ) W ( ) , b ( ) ( x t ) , (3)and W = (cid:16) W ( ) , . . . , W ( L ) (cid:17) and b = (cid:16) b ( ) , . . . , b ( L ) (cid:17) are weight matrices and bias vec-tor. Any weight matrix W ( (cid:96) ) ∈ R m × n contains m neurons as n column vectors W ( (cid:96) ) =[ w ( (cid:96) ) · ,1 , . . . , w ( (cid:96) ) · , n ] , and b ( (cid:96) ) are a threshold or activation level which contribute to the outputof a hidden layer, allowing the function to be shifted. Commonly used activation function8 ( (cid:96) ) W ( (cid:96) ) , b ( (cid:96) ) f ( (cid:96) ) W ( (cid:96) ) , b ( (cid:96) ) : = f (cid:96) (cid:16) W ( (cid:96) ) x t + b ( (cid:96) ) (cid:17) = f (cid:96) (cid:32) m ∑ i = W ( (cid:96) ) i x t + b ( (cid:96) ) i (cid:33) (4)are sigmoidal (e.g. f (cid:96) ( z ) = ( + exp ( − z )) ) or f (cid:96) ( z ) = tanh ( z ) , or rectiﬁed linear units(ReLU) ( f (cid:96) ( z ) = max { z , 0 } ). Note that in case functions f are linear, f W , b ( x t ) is a simplelinear regression, regardless of the number of layers L , and hidden layers are redundant.For example with L =

2, the model becomes a reparametrized simple linear regression: (cid:98) r t + = W ( ) ( W ( ) x t + b ( ) ) + b ( ) = β x t + α . In case f W , b ( x t ) is non-linear, neural networkcomplexity grows with increasing m , and with increasing the number of hidden layers L ,or growing deepness of the network, we have a deep neural network. Many predictors used in ﬁnance are non-i.i.d., anddynamically evolve in time, and hence traditional neural networks assuming independenceof data may not approximate relationships sufﬁciently well. Instead, a Recurrent NeuralNetwork (RNN) that takes into account time series behavior may help in the predictiontask. In addition, Long-Short-Term-Memory (LSTM) is designed to ﬁnd hidden state pro-cesses allowing for lags of unknown and potentially long time dynamics in the time series.Figure 2 illustrates how the network structure additionally uses lagged information.More formally, RNNs are a family of neural networks used for processing sequences ofdata. They transform a sequence of input predictors to another output sequence introduc-ing lagged hidden states as h t = f ( W h h t − + W x x t + b ) . (5)Intuitively, RNN is a non-linear generalization of an autoregressive process where laggedvariables are transformations of the lagged observed variables. Figure 2 depicts W h usingdashed lines and W x using solid lines. Nevertheless, this structure is only useful when theimmediate past is relevant. In case the time series dynamics are driven by events that arefurther back in the past, the addition of complex LSTMs is required.9 .2.3 Long-Short-Term-Memory (LSTM). An LSTM is a particular form of recurrent net-works, which provides a solution to the short memory problem by incorporating memoryunits (Hochreiter and Schmidhuber, 1997). Memory units allow the network to learn whento forget previous hidden states and when to update hidden states given new information.Speciﬁcally, in addition to a hidden state, LSTM includes an input gate, a forget gate, aninput modulation gate, and a memory cell. The memory cell unit combines the previousmemory cell unit, which is modulated by the forget and input modulation gates togetherwith the previous hidden state, modulated by the input gate. These additional cells enablean LSTM to learn extremely complex long-term temporal dynamics that a vanilla RNN isnot capable of. Such structures can be viewed as a ﬂexible hidden state space model for alarge dimensional system. Additional depth can be added to an LSTM by stacking themon top of each other, using the hidden state of the LSTM as the input to the next layer.More formally, at each step a new memory cell c t is created with current input x t and previous hidden state h t − and it is then combined with a forget gate controlling theamount of information stored in the hidden state as h t = σ  W ( o ) h h t − + W ( o ) x x t + b ( o ) (cid:124) (cid:123)(cid:122) (cid:125) output gate  ◦ tanh ( c t ) (6) c t = σ  W ( g ) h h t − + W ( g ) x x t + b ( g ) (cid:124) (cid:123)(cid:122) (cid:125) forget gate  ◦ c t − + σ  W ( i ) h h t − + W ( i ) x + b ( i ) (cid:124) (cid:123)(cid:122) (cid:125) input gate  ◦ tanh ( k t ) . (7)The term σ ( · ) ◦ c t − introduces the long-range dependence, and k t is new information ﬂowto the current cell. The states of forget and input gates control weights of past memoryand new information. In Figure 2, c t is the memory pass through multiple hidden statesin the recurrent network. Due to the high dimensionality and non-linearity of the problem, estimation of a deep neural network is a complex task. Here,we provide a detailed summary of the model architectures and their estimations. We10 igure 2. (Deep) Recurrent Network

This ﬁgure illustrates a deep recurrent neural network model. ... ... ... x t x t x t x nt f (1) w (1)1 , · f (1) w (1)2 , · f (1) w (1)3 , · f (1) w (1) m − , · f (1) w (1) m − , · f (1) w (1) m, · f ( L ) w ( L )1 , · f ( L ) w ( L )2 , · f ( L ) w ( L ) k, · r t +1 . . . Inputlayer x t W (1) , b (1) Hiddenlayer f (1) W (1) ,b (1) . . . W ( L ) , b ( L ) Hiddenlayer f ( L ) W ( L ) ,b ( L ) Outputlayer r t +1 work with a variety of deep learning structures and compare them with a recurrent LSTMnetwork and regularized OLS. We consider NN1, NN2 and NN3 models that contain 16,32–16 and 32–16–8 neurons in the one, two, and three hidden layer structures, respectively,and an LSTM model which is a NN with 3 recurrent layers with 32-16-8 neurons in eachand LSTM cells introduced into the last layer.To prevent the model from over-ﬁtting and to the reduce large number of parameters,we use dropout, which is a common form of regularization that has generally better per-formance in comparison to traditional l or l regularization. The term dropout refers todropping out units in neural networks and can be shown to be a form of ridge regular-ization. To ﬁt the networks, we adopt a popular and robust adaptive moment estimationalgorithm (Adam) with weight decay regularization introduced by Kingma and Ba (2014)and we use the Huber loss function in the estimation.Further, we follow the most common approach in the literature and select tuning pa-11ameters adaptively from the data in a validation sample. We split the data into trainingand validation samples that maintain temporal ordering of the data and tune hyperparam-eters with respect to the statistical and economic criteria. We search the optimal models inthe following grid of 100 randomly chosen combinations of the following hyperparame-ters: learning rate ∈ [ ] , decay regularization ∈ [

0, 0.001 ] , dropout ∈ [ ] ofweights and activation function ∈ { sigmoid, ReLU } with 1000 epochs with early stopping.Since the sample at each window is rather small, and ﬁnal models can depend on ini-tial values in the optimization, we use ensemble averaging of ﬁve models with randomlychosen initial values. We consider a portfolio choice problem of an agent with the investment horizon of T periods in the future who maximizes her expected utility over the cumulative portfolioreturn. There are two assets: a one-period Treasury bill and a stock index. If ω t + τ is theallocation to the stock index at time t + τ , the investor solves the following optimizationproblem at time t max ω E t (cid:2) U ( r p , t + T ) (cid:3) (8)in which the end-of-horizon portfolio return r p , t + T is deﬁned as r p , t + T = T ∏ τ = (cid:104) ( − ω t + τ − ) exp ( r ft + τ ) + ω t + τ − exp ( r ft + τ + r t + τ ) (cid:105) , (9) We have estimated our models on two servers with 48 core Intel® Xeon® Gold 6126 CPU@ 2.60GHzand 24 core Intel® Xeon® CPU E5-2643 v4 @ 3.40GHz, 768GB memory and two NVIDIA GeForce RTX2080 Ti GPUs. We have used

Flux.jl with

JULIA Extending our analysis to multiple assets is straightforward; however, we consider a portfolio choiceproblem with two assets as in Barberis (2000) and more recently Johannes et al. (2014) and Rossi (2018) tomake our results directly comparable to other studies. r ft + τ denotes a zero-coupon default-free log bond yield between t + τ − t + τ .Following Johannes et al. (2014), we consider various choices of horizons T to assess theimpact of the length of the investment period. Speciﬁcally, we report the results for thetwo cases of six months ( T = ) and two years ( T = ) . Furthermore, we allow the in-vestor to rebalance portfolio weights with different frequencies. The allocations between aTreasury bill and a stock index are updated every three months, or once per year for theshorter or longer investment horizons, respectively. These choices of horizons and rebal-ancing periods allow us to compare two investment strategies. The former reﬂects a moreactively managed portfolio with frequent changes in the allocations, whereas the lattercorresponds to a relatively passive investment portfolio with less frequent rebalancing. Wefurther winsorize the weights for the stock index to − ≤ ω t + τ ≤ U ( r p , t + τ ) = r − γ p , t + τ − γ ,where γ is the coefﬁcient of risk aversion. The expected utility is deﬁned by the predictivedistribution of cumulative portfolio returns r p , t + τ given by Eq.(9), which in turn dependson the corresponding model used to predict future excess returns r t + τ and the law ofmotion of predictor variables x t . For x t , we adopt a parsimonious AR(1) framework, thatis, each variable x it satisﬁes x it = α x i + β x i x it − + ε x i t .where α x i and β x i are coefﬁcients, and ε x i t are normal error terms. To proxy for the jointvariance-covariance matrix of the error terms ε t = ( ε rt , ε xt ) , we employ a sample varianceestimator ˆ Σ t = ˆ ε t ˆ ε (cid:48) t , where ε t are forecast errors. Finally, we set the risk aversion parameter γ = β = t or over a 10-year rolling window, as in Johannes et al. (2014). The univariatemodels with the expanding and rolling windows are denoted OLS1 and OLS2, andthe multivariate versions are OLS3 and OLS4.3. A set of machine learning architectures including neural networks with 1 layer of 16neurons (NN1), 2 layers of 32-16 neurons (NN2), 3 layers of 32-16-8 neurons (NN3)and an LSTM model with 3 recurrent layers and 32-16-8 neurons and LSTM cellsintroduced in the last layer. All NNs use a “kitchen sink” approach by utilizing allavailable data to predict log excess returns and are trained on a 10-year rolling win-dow to account for a time-varying relationship between the predictors and returns.There are many dimensions that can be used to generalize our modelling approach.More general speciﬁcations could add additional predictor variables (McCracken and Ng,2016), parameter uncertainty (Wachter and Warusawitharana, 2009; Johannes et al., 2014;Bianchi and Tamoni, 2020), economic restrictions (Van Binsbergen and Koijen, 2010), orconsider a larger set of investable assests and alternative preferences (Dangl and Weis-sensteiner, 2020) among other extensions. Most notably, modelling stochastic volatility viaa parsimonious mean-reverting process (Johannes et al., 2014) or more complex GARCH-14nd MIDAS-type volatility estimators (Rossi, 2018) would certainly improve the perfor-mance of our strategies. Instead, we consider all speciﬁcations with a constant volatilitysetting to solely evaluate the impact of neural networks on the performance of dynamicallocation strategies. Our aim is to demonstrate out-of-sample portfolio gains from usingdeep learning in the most restrictive setting. In our analysis, we employ a number of metrics measuring the statistical accuracy of themethods considered and their economic gains for the investor. With respect to the statis-tical performance, we ﬁrst consider a common measure of mean squared prediction error(MSPE) deﬁned as

MSPE = T − t + T ∑ t = t (cid:16) r t − ˆ r M s t (cid:17) , (10)where r t denotes the observed excess log return, ˆ r M s t is the return predicted by a particularframework M s , and t and T are the months of the ﬁrst and last predictions. Noticethat the investor rebalances her allocations at varying frequency. Thus, we compute theprediction errors only in those periods when she reoptimizes her portfolio.As in Campbell and Thompson (2008), we compute the out-of-sample predictive R oos R oos = − ∑ T t = t (cid:0) r t − ˆ r M s t (cid:1) ∑ T t = t (cid:0) r t − ¯ r t (cid:1) ,where ¯ r t is the historical mean of returns. By construction, the R oos statistic comparesthe out-of-sample performance of the chosen model M s relative to the historical averageforecast. Notice that we compute the historical mean over the same sample used to estimate M s , which corresponds to either an expanding sample or a 10-year rolling window. Thepositive value of R oos indicates that the model-implied forecast has smaller mean squaredpredictive error compared to the error implied by the historical average forecast. Thus, weperform a formal test of the null hypothesis R oos ≤ oos > R oos is positive.After we compare different models in terms of the statistical accuracy of their predic-tions, we assess whether superior statistical ﬁt translates into economic gains. It is worthnoting that this relationship is non-trivial. Indeed, Campbell and Thompson (2008) andRapach et al. (2010) note that seemingly small improvements in R oos could generate largebeneﬁts in practice. We start our investigation of the size of the improvements by cal-culating the average Sharpe ratio of portfolio returns as a common measure of portfolioperformance used in ﬁnance. The drawback of this metric is that it does not take tailbehaviour into account. Consequently, we follow Fleming et al. (2001) and compute thecertainty equivalent return (CER) by equating the utility from CER to the average utilityimplied by an alternative model. Finally, we visualize the performance of all speciﬁcationsby plotting the cumulative log portfolio returns over the sample period considered. Thisallows us to clearly see the time intervals in which the investor beneﬁts the most fromusing different frameworks.To evaluate the statistical signiﬁcance of portfolio gains, we follow Bianchi et al. (2020)and implement the test á la Diebold and Mariano (2002) . Speciﬁcally, we perform a pair-wise comparison between the CERs generated by each framework under consideration andthose yielded by the EH speciﬁcation. For each model M s , we estimate the regression U M s t + T − U EHt + T = α M s + ε t + T ,where U Xt + T = (cid:16) r Xp , t + T (cid:17) − γ − γ and r Xp , t + T is the cumulative portfolio return with the horizon T .Testing for the difference in the CERs boils down to a test for the signiﬁcance in α M s . For the signiﬁcance of SRs, we ﬁrst need to simulate artiﬁcial returns under a null model of no pre-dictability, that is, a model with constant mean and constant volatility. For each simulation, we need toobtain the forecasts for all models considered and construct optimal portfolios. Since a complete exercise ofhyperparameter tuning takes around 2 days on the supercomputer cluster, repeating it, say, 500 times willincrease cluster computing time proportionally. This makes the task computationally infeasible given thecurrent computing capacity, unless more resources for parallel computing become available. Empirical Results

Our empirical analysis of the S&P 500 excess return predictability is based on the applica-tions of a variety of linear models and non-linear machine learning methods as discussedin Section 2.3. We use a set of economic predictor variables considered by Goyal and Welch(2008) to make our results directly comparable to the literature. Speciﬁcally, we focus onthe monthly historical data of twelve predictors including dividend yield, log earning priceratio, dividend payout ratio, book to market ratio, net equity expansion, treasury bill rates,term spread, default yield spread, default return spread, cross-sectional premium, inﬂationgrowth, and monthly stock variance. Table 1 reports the statistical accuracy of the models considered. Panels A and B showthe MSPEs and R oos based on those periods when quarterly and annual rebalancing isoccurring. As shown in Panel A, all linear regressions generate larger MSPEs compared tothe constant mean and constant volatility model, while neural networks provide the bestﬁt with the data.A multivariate linear regression does not necessarily outperform a univariate model.Indeed, a linear regression estimated on the rolling window (OLS3) is noisier and generatesa larger MSPE than regressions using only dividend yield (OLS1), whereas the “kitchensink” linear regression with an expanding window estimation (OLS4) slightly outperformsa single predictor model (OLS2). Furthermore, consistent with Goyal and Welch (2008),none of the linear regressions can beat the simple historical mean, as indicated by thenegative R oos . In contrast, we ﬁnd that deep learning methods achieve the positive R oos ,indicating the statistical beneﬁts of accounting for the nonlinear relationship between stockmarket returns and predictors, similarly to Feng et al. (2018) and Rossi (2018). A formaltest conﬁrms that expected return predictability generated by NNs is statistically different able 1. Statistical Accuracy of Excess Return Forecasts This table reports the mean squared prediction error and out-of-sample R oos obtained from using differentmethodologies to predict future S&P 500 excess returns as outlined in Section 2.3. We compute the out-of-sample R oos in comparison to the expectations hypothesis using the historical mean to predict returns. PanelA shows the results when the investor maximizes a 6-month portfolio return and changes the allocationsquarterly. Panel B demonstrates the results for a 2-year horizon and annual rebalancing. We computestatistical accuracy measures in those periods when the investor reevaluates her allocations with quarterly orannual frequency. We also report a p-value (in parentheses) of the null hypothesis R oos ≤ R oos is positive. The forecast starts in February1955. The sample period spans from January 1945 to December 2018.EH OLS1 OLS2 OLS3 OLS4 NN1 NN2 NN3 LSTMPanel A: 6-month horizon and quarterly rebalancingMSPE × R oos × R oos from a naive historical mean forecast. In unreported results, we verify that the performanceof all machine learning methods is statistically the same. Panel B also shows the results infavor of NNs in a setting with less frequent rebalancing. Table 2 provides a summary of annualized CERs and monthly SRs of portfolio returns foreach model assuming a 6-month (Panel A) and 2-year (Panel B) investment horizon. Thesummary statistics in each panel are computed for the whole sample and for recession andexpansion periods as deﬁned by the NBER recession indicator. The risk aversion parameteris γ = R oos obtained using machine learn-ing methods directly translate into economic gains for an investor. Speciﬁcally, the best-performing NN – the LSTM model – generates more than two- and three-fold increasesin the annual CER (around 10% vs 4.7%) and monthly SR (0.175 vs 0.049) relative to themodel ignoring expected return predictability. The LSTM model, which is a three-layernetwork, is directly comparable to NN3 in terms of its structure complexity. Nevertheless,LSTM dominates a standard network, emphasizing the importance of learning complexlong-term temporal dynamics in addition to non-linear predictive relationships. In gen-eral, comparing NN1 through NN3, we observe that increasing the complexity of NNsdoes not necessarily improve portfolio performance, although all machine learning struc-tures remain statistically equivalent to each other. A formal one-sided test conﬁrms that,except for NN3, the portfolio performance of NNs is signiﬁcantly better than the perfor-mance generated by the EH model. Further, a comparison of the results in Panels A andB demonstrates that the investor beneﬁts more from using NNs when she manages herportfolio more actively. Overall, these results indicate that expected return predictabil-ity generated by applying nonlinear methods provides valuable information for portfolioconstruction.We dissect this superior performance by looking at portfolio return statistics in periodsof expansion and recession. Table 2 shows that economic gains generated by NNs are largeduring both regimes and are especially pronounced in recessions. For instance, the annu-alized CER generated by the LSTM is, on average, around 8% in good times, which is morethan the 5% predicted by the EH model. In bad times, the difference in performance isextremely large, with around 26% and 3% CERs in the LSTM and EH models, respectively.A pairwise test conﬁrms that the improvement of LSTM over EH is statistically signiﬁcantduring both expansions and recessions. In contrast, the portfolio returns of NN1 through19 able 2. Certainty Equivalent Returns and Sharpe Ratios This table reports the annualized certainty equivalent returns and monthly Sharpe ratios for different modelsoutlined in Section 2.3. Panel A shows the results when the investor maximizes a 6-month portfolio returnand changes the allocations quarterly. Panel B shows the results for a 2-year horizon and annual rebalancing.Each panel computes the statistics for the whole sample, with expansion and recession periods as deﬁned byNBER. For the statistical signiﬁcance of CERs, we report a one-sided p-value (in parentheses) of the test á laDiebold and Mariano (2002). In particular, we regress the difference in utilities for each model M s and EH U M s t + T − U EHt + T = α M s + ε t + T ,where U Xt + T = (cid:16) r Xp , t + T (cid:17) − γ − γ and r Xp , t + T is the cumulative portfolio return with the horizon T . Testing for thedifference in the CERs boils down to a test for the signiﬁcance in α M s . We ﬂag in bold font CER values thatare signiﬁcant at the 10% conﬁdence level. The portfolio construction starts in February 1955. The sampleperiod spans from January 1945 to December 2018.EH OLS1 OLS2 OLS3 OLS4 NN1 NN2 NN3 LSTMPanel A: 6-month horizon and quarterly rebalancing1955-2018CER 4.737 2.643 -0.030 2.781 2.491 p-value (1.000) (1.000) (0.935) (0.954) (0.027) (0.032) (0.292) (0.000)SR 0.049 0.046 0.062 0.088 0.095 0.166 0.157 0.144 0.175ExpansionsCER 4.948 3.073 -0.173 4.598 2.045 5.873 5.280 5.304 p-value (0.998) (1.000) (0.654) (0.982) (0.258) (0.398) (0.403) (0.000)SR 0.100 0.077 0.048 0.092 0.108 0.149 0.143 0.135 0.149RecessionsCER 3.311 -0.274 1.401 -9.079 6.752 p-value (0.995) (0.648) (0.944) (0.221) (0.000) (0.000) (0.200) (0.000)SR -0.193 -0.182 0.154 0.091 0.036 0.284 0.255 0.204 0.358Panel B: 2-year horizon and annual rebalancing1955-2018CER 4.542 1.068 0.040 0.923 -0.067 p-value (1.000) (1.000) (0.999) (0.997) (0.000) (0.000) (0.000) (0.012)SR 0.048 0.044 0.046 0.083 0.081 0.138 0.136 0.129 0.118ExpansionsCER 4.448 0.826 0.514 0.321 0.231 p-value (1.000) (1.000) (0.999) (0.987) (0.002) (0.000) (0.000) (0.011)SR 0.100 0.076 0.037 0.097 0.137 0.136 0.149 0.132 0.112RecessionsCER 5.235 2.866 -2.924 5.930 -2.138 able 3. Portfolio Return Statistics This table reports mean, standard deviation, skewness, and kurtosis of optimal portfolio returns for differentmodels as outlined in Section 2.3. All statistics are expressed in monthly terms. Panel A shows the resultsfor the case when the investor maximizes a 6-month portfolio return and changes the allocations quarterly.Panel B shows the results for a 2-year horizon and annual rebalancing. The portfolio construction starts inFebruary 1955. The sample period spans from January 1945 to December 2018.EH OLS1 OLS2 OLS3 OLS4 NN1 NN2 NN3 LSTMPanel A: 6-month horizon and quarterly rebalancingMean 0.937 2.213 4.138 4.676 6.762 10.728 9.605 9.533 11.715St.dev. 5.504 13.871 19.122 15.353 20.502 18.601 17.641 19.121 19.343Skew -0.472 -0.615 -0.893 -0.332 -0.881 -0.844 -0.816 -0.786 -0.046Kurt 4.353 8.609 10.400 7.631 9.182 11.707 12.237 11.172 4.860Panel B: 2-year horizon and annual rebalancingMean 0.978 2.184 3.058 5.093 4.908 7.333 5.634 4.445 7.722St.dev. 5.849 14.443 19.189 17.655 17.512 15.331 11.937 9.916 18.87Skew -0.452 -0.492 -1.058 -0.989 -0.787 -0.275 0.469 0.799 -0.013Kurt 4.386 7.104 10.566 13.550 10.737 9.113 10.416 14.775 6.433

NN3 are indistinguishable from EH in expansions, while shallower networks exhibit sig-niﬁcantly better performance in recessions.The investor who ignores expected return predictability experiences, on average, around-19% Sharpe ratios in recessions. In contrast, the LSTM model helps generate signiﬁcantportfolio gains around 36% SRs, with other NNs generating at least 20% SRs on a monthlybasis. Further, all NNs outperform linear regressions across good and bad times. The exist-ing evidence for equities (Rapach et al., 2010; Dangl and Halling, 2012) indicates that returnpredictability is concentrated in bad times. Our ﬁndings extend the existing literature byshowing that, unlike linear models, NNs help the investor to effectively convert predictivevariation in stock market returns into substantial economic gains across different businesscycle conditions.Table 3 presents additional statistics of portfolio returns for different methodologies.The models using NNs generate out-of-sample returns with signiﬁcantly larger means.Intuitively, this occurs because machine learning methods speciﬁcally excel in risk pre-mium prediction, that is, the conditional expectation of returns. The linear regressions and Gargano et al. (2019) report a similar result for bond returns. Recently, Bianchi et al. (2020) show thatbond return predictability is also present in expansions when machine learning methods are employed. igure 3. Cumulative Returns This ﬁgure illustrates the cumulative log returns of optimal portfolio strategies from different models out-lined in Section 2.3. The left panel shows the results when the investor maximizes a 6-month portfolio returnand changes the allocations quarterly. The right panel shows the results for a 2-year horizon and annual re-balancing. The shaded areas denote recession periods as deﬁned by NBER. The portfolio construction startsin February 1955. The sample period spans from January 1945 to December 2018.(a) 6-month horizon and quarterly rebalancing (b) 2-year horizon and annual rebalancing vanilla NNs do not take the time-varying volatility of returns into account and hence thesemodels predict negative skewness and excess kurtosis (since they ignore a fat-tailed returndistribution). Interestingly, although an LSTM network does not consider time variationin return volatility, it is able to identify the periods of high return variance using the long-term memory of its cells (including realized return variance as one of the predictors alsohelps). This results in better skewness and lower excess kurtosis. The statistics for thelonger horizon portfolio are improved for the standard neural networks, where propertiesremain largely the same or slightly deteriorate for other models.We visually summarize the previous results in Figure 3, which shows the cumulativesum of log portfolio returns. The left panel shows that NNs outperform other modelsby a large margin. The LSTM dominates remaining networks by the end of the periodconsidered, with a particularly pronounced difference in the second half of the sample. Inrelation to speciﬁc historical events, all NNs produce steady positive portfolio performanceduring the 2007-2008 Financial Crisis. Interestingly, the LSTM network additionally avoidsa largely unexpected stock market crash, Black Monday, on October 19, 1987. Figure 3 also22hows that weaker statistical performances for the passive strategy with annual rebalancingleads to lower cumulative returns across all models.

This section dissects the performance of portfolio returns constructed in the previous sec-tion across seven decades in the post-WWII period considered. Further, this section con-nects economic gains of the best-performing model to common drivers of asset prices.Finally, it also provides the robustness of our conclusions to alternative measures of port-folio performance, transaction costs, borrowing and short-selling constraints, and a largersize of a rolliing window used to train NNs.

We start by examining whether superior portfolio performance implied by NNs varies oversubsamples other than expansions and recessions. Table 4 shows the certainty equivalentyields and Sharpe ratios computed separately for each decade in our sample. For theCERs, we extend the main ﬁnding of the paper: NNs, particularly LSTM, outperform theexpectations hypothesis model in most cases. Speciﬁcally, the table shows that, exceptfor the last decade, the LSTM network generates certainty equivalent values above thoseimplied by no predictability framework. Interestingly, the formal test indicates that theimprovement of LSTM over EH is signiﬁcant during the ﬁrst three decades, while higherCERs in later periods are statistically equivalent to those from the EH model.The linear models perform well across the 1990s and 2010s during which the stockmarket grew steadily. Also, the rolling-window linear regressions tend to perform betterthan those using the expanding-window estimation, emphasizing the role of time-varyingbetas and changing information sets. For instance, Goyal and Welch (2008) show thatdividend-yield exhibited a strong predictive power for stock market returns from 1970 tomid-1990, with a weaker but mostly positive out-of-sample performance during the ﬁrst23wo decades after World War II. In contrast, it produced large prediction errors during the1995-2000 and 2000s. As a result, Table 4 shows that the OLS3 model generates high CERsfrom 1955 to 1989, exhibiting statistically better performance than EH in some case, butthe model is weaker in later years when the forecast based on dividend yield had strongunderperformance.Turning to the SRs, NNs provide the investor with substantially higher Sharpe ratioswith the exception of the 1990s and 2010s when they perform slightly worse. These resultsare consistent with our previous ﬁndings. Indeed, the U.S. stock market was stronglybullish in these two decades, which are marked by prolonged stock market expansions. Incontrast, the Black Monday crash occurred in 1987 and the S&P 500 index recovered slowly,only by the end of the 1980s. Further, the beginning of the new millennium experiencedtwo major crashes driven by the burst of the dot-com bubble and the subprime mortgagecrisis. Table 4 shows that NNs perform signiﬁcantly better than other speciﬁcations duringdecades with major stock bear markets and provide statistically equal results during bullmarkets, which is consistent with our previous results across expansions and recessions.

This section explores the link between the economic gains implied by the best-performingmachine learning framework and prominent drivers of asset prices. In particular, we focuson the portfolio choice problem of the investor with a 6-month investment horizon andquarterly rebalancing who uses the LSTM to forecast future stock market returns. For-mally, we establish this link by running a set of univariate regressions of the investor’sutility from future portfolio returns on the set of structural determinants of risk premia.Our choice of variables is motivated by existing studies. For instance, a large strand ofthe literature (see Buraschi and Jiltsov (2007) and Dumas et al. (2009) among others) em-phasizes the importance of disagreement for asset prices. In our analysis, we employ theSurvey of Professional Forecasters to proxy for real disagreement (

DiB ( g ) ) and nominaldisagreement ( DiB ( π ) ), which are constructed as the interquartile range of 6-month-ahead24 able 4. Portfolio Performance across Subsamples This table reports the annualized certainty equivalent returns and monthly Sharpe ratios for different modelsoutlined in Section 2.3. The table shows the results when the investor maximizes a 6-month portfolio returnand changes the allocations quarterly. The table computes the statistics for each of the seven decades sinceWWII. For the statistical signiﬁcance of CERs, we report a one-sided p-value (in parentheses) of the test á laDiebold and Mariano (2002). In particular, we regress the difference in utilities for each model M s and EH U M s t + T − U EHt + T = α M s + ε t + T ,where U Xt + T = (cid:16) r Xp , t + T (cid:17) − γ − γ and r Xp , t + T is the cumulative portfolio return with the horizon T . Testing for thedifference in the CERs boils down to a test for the signiﬁcance in α M s . We ﬂag in bold font CER values thatare signiﬁcant at the 10% conﬁdence level. The portfolio construction starts in February 1955. The sampleperiod spans from January 1945 to December 2018.EH OLS1 OLS2 OLS3 OLS4 NN1 NN2 NN3 LSTM1955-1959CER 5.467 4.376 3.219 9.495 5.545 p-value (0.631) (0.707) (0.149) (0.490) (0.000) (0.017) (0.000) (0.001)SR 0.225 0.188 0.209 0.258 0.154 0.319 0.278 0.431 0.3411960-1969CER 4.197 0.580 -4.193 p-value (0.959) (0.991) (0.024) (0.931) (0.094) (0.042) (0.360) (0.000)SR 0.062 0.064 0.030 0.157 0.067 0.164 0.148 0.181 0.2411970-1979CER 3.312 0.599 0.223 p-value (1.000) (0.847) (0.015) (0.026) (0.000) (0.000) (0.000) (0.000)SR -0.107 -0.097 0.005 0.149 0.180 0.274 0.248 0.224 0.3091980-1989CER 9.215 7.243 -3.130 11.276 -1.450 3.315 1.216 2.603 10.241p-value (0.992) (0.983) (0.148) (0.969) (0.820) (0.906) (0.864) (0.282)SR 0.048 -0.005 0.072 0.139 0.055 0.166 0.123 0.142 0.1121990-1999CER 8.101 -4.393 able 5. Drivers of Portfolio Performance This table reports the regression estimates, Newey-West p-values (in parentheses) and R of economic gainson a set of selected variables determining risk premia. Economic gains are computed for portfolio returnsfor the best performing model employing the LSTM prediction of stock market returns. The independentvariables proxy for real disagreement DiB ( g ) , nominal disagreement DiB ( π ) , economic uncertainty UNbex ,risk aversion via consumption growth − Surplus or ﬁnancial variables

RAbex , the VIX index

V IX , and re-alized stock market volatility σ . The variables on the left and right sides are standardized. We ﬂag in boldfont regression estimates that are signiﬁcant at the 10% conﬁdence level. DiB ( g ) DiB ( π ) UNbex − Surplus RAbex V IX σ R ( % ) (i) forecasts of GDP and CPI growth. Motivated by the well-established link between as-set prices and uncertainty, we employ a novel measure of economic uncertainty ( U Nbex )constructed from ﬁnancial variables at high frequencies (Bekaert et al., 2019).We next examine the relationship between portfolio gains and time-varying risk aver-sion of investors. Following Wachter (2006), we approximate risk aversion via the neg-ative weighted average of consumption growth rates over a moving window of 10 years( − Sur plus ). We compare the results to an alternative measure of risk aversion extractedfrom ﬁnancial variables (Bekaert et al., 2019). Finally, we relate portfolio utilities to stockmarket volatility by using the risk-neutral volatility (

V IX ) as measured by the VIX indexand by using the realized volatility ( σ ) as measured by the root of the intra-month sum ofsquared daily S&P500 returns.Table 5 presents the regression results. Overall, the relationship between future real-ized portfolio gains and most structural risk factors is rather weak. Indeed, we documentthat only dispersion in beliefs about a real or nominal growth is positively and statistically26igniﬁcantly linked to the investor’s utilities. Intuitively, this result is expected since ma-chine learning methods signiﬁcantly outperform competing models during recessionaryperiods, when uncertainty and disagreement in forecasts are large. The third panel in Ta-ble 5 further conﬁrms this positive association between economic uncertainty and portfoliogains, however, the link is statistically weaker compared to disagreement measures. Exceptfor realized stock market volatility, we obtain positive coefﬁcients on the remaining riskfactors. Although certainty equivalent yields and Sharpe ratios are common measures of portfolioperformance considered in the literature, the investor may use alternative statistics to eval-uate their investment strategies, including maximum drawdown, maximum one-monthloss, and average monthly turnover. For each model M s , we deﬁne maximum drawdownMax DD = max t ≤ t ≤ t ≤ T (cid:104) ˆ r t , M s t − ˆ r t , M s t (cid:105) , (11)in which ˆ r t , M s t denotes the cumulative portfolio return from time t through t , while t and T are the months of the ﬁrst and last predictions. The maximum one-month loss measuresthe largest portfolio decline during the period considered. The average monthly turnoveris deﬁned as Turnover = T − t T ∑ t = t + (cid:12)(cid:12)(cid:12) ω t − ω t − · ˆ r M s t − (cid:12)(cid:12)(cid:12) , (12)where ω t − is the weight of the stock index.Table 6 shows the results for alternative performance statistics. We ﬁrst focus on ac-tively managed portfolios with quarterly rebalancing and then move to more passive in-vestment strategies with annual rebalancing. The maximum drawdown experienced byNN1 through NN3 is between 68% and 83% on the monthly basis. The linear modelspredict comparable or even larger drawdowns, whereas the constant mean and constantvolatility model delivers a mild loss of around 23%. In contrast, the maximum drawdown27 able 6. Drawdowns, Maximum Loss, and Turnover This table reports alternative out-of-sample performance measures — maximum drawdown, maximum 1-month loss, and turnover — of optimal portfolio returns for different methodologies used to predict futureS&P 500 excess returns as outlined in Section 2.3. All statistics are expressed in percentages. Panel A showsthe results when the investor maximizes a 6-month portfolio return and changes the allocations quarterly.Panel B demonstrates the results for a 2-year horizon and annual rebalancing. The portfolio constructionstarts in February 1955. The sample period spans from January 1945 to December 2018.EH OLS1 OLS2 OLS3 OLS4 NN1 NN2 NN3 LSTMPanel A: 6-month horizon and quarterly rebalancingMax DD 22.795 76.236 74.251 144.760 100.956 74.572 68.248 82.72 45.995Max 1M Loss 7.795 33.325 57.974 31.756 57.974 57.974 57.974 57.974 35.011Turnover 0.506 4.286 10.968 17.890 34.531 23.008 23.407 29.616 32.814Panel B: 2-year horizon and annual rebalancingMax DD 25.002 90.896 83.370 123.279 145.663 74.972 34.562 25.136 64.433Max 1M Loss 8.036 29.431 57.974 57.974 38.862 34.375 24.419 25.136 35.011Turnover 0.584 4.289 8.804 10.773 11.329 11.352 8.725 6.771 17.459 for LSTM is around 46%, the mildest decline among the predictive models. Panel A furthershows a similar picture for the maximum one-month loss of the portfolio: linear modelsand NNs tend to generate the worst one-period performance, while the LSTM strategyexperiences a milder loss. Thus, the LSTM speciﬁcation is the most successful in avoidinglarge losses over short- and long-term periods, even though it comes at the expense of thehigher turnover.Panel B in Table 6 shows that the investor engaging in less frequent portfolio rebal-ancing is generally less efﬁcient in forming the optimal portfolio if he relies on the linearregressions. Interestingly, the beneﬁts of deep learning methods remain similar and evenimprove in some cases. For instance, the maximum one-month and drawdown losses tendto increase from 83% to more than 140% for the linear models, while NNs produce thelargest declines, from 25% to 35% per month. Furthermore, as the portfolio weights arekept unchanged for longer investment periods, the turnover is reduced. Thus, the passiveinvestor who is mainly interested in reducing his short- and long-term tail risks would stillﬁnd NNs useful, while he does not beneﬁt from linear predictive models.In sum, exploiting expected return predictability via NNs for portfolio constructionleads to riskier investments. It also generates increased turnover, especially for the best-28erforming model using the LSTM network. A natural question arises if these beneﬁts areoffset by the large transaction costs implied by more aggressive buying and selling stocks

This subsection extends the main analysis by accounting for the effect of transaction costs.Speciﬁcally, we consider low and high transaction costs that are equal to the percentagepaid by the investor for the change in value traded. Let τ denote a transaction cost param-eter. Then the transaction-costs adjusted returns are deﬁned asˆ r τ , M s t = ˆ r M s t − τ (cid:12)(cid:12) ω t − ω t − · ˆ r M s t − (cid:12)(cid:12) ,where τ can attain one of the two possible values τ l = τ h = able 7. Portfolio Performance with Transaction Costs This table reports the annualized certainty equivalent returns and Sharpe ratios for different models outlinedin Section 2.3. The top and bottom sections of the table compute optimal returns with low ( τ = ) andhigh ( τ = ) transaction costs. Panels A and C show the results when the investor maximizes a 6-month portfolio return and changes the allocations quarterly. Panel B and D show the results for a 2-yearhorizon and annual rebalancing. Each panel computes the statistics for the whole sample. For the statisticalsigniﬁcance of CERs, we report a one-sided p-value (in parentheses) of the test á la Diebold and Mariano(2002). In particular, we regress the difference in utilities for each model M s and EH U M s t + T − U EHt + T = α M s + ε t + T ,where U Xt + T = (cid:16) r Xp , t + T (cid:17) − γ − γ and r Xp , t + T is the cumulative portfolio return with the horizon T . Testing for thedifference in the CERs boils down to a test for the signiﬁcance in α M s . We ﬂag in bold font those CERvalues that are signiﬁcant at the 10% conﬁdence level. The portfolio construction starts in February 1955.The sample period spans from January 1945 to December 2018.EH OLS1 OLS2 OLS3 OLS4 NN1 NN2 NN3 LSTMLow Transaction CostsPanel A: 6-month horizon and quarterly rebalancingCER 4.731 2.589 -0.171 2.548 2.036 p-value (1.000) (1.000) (0.953) (0.978) (0.045) (0.054) (0.391) (0.000)SR 0.049 0.045 0.060 0.084 0.089 0.162 0.153 0.139 0.169Panel B: 2-year horizon and annual rebalancingCER 4.535 0.998 -0.085 0.765 -0.241 p-value (1.000) (1.000) (0.999) (0.998) (0.001) (0.000) (0.000) (0.033)SR 0.048 0.043 0.044 0.081 0.079 0.135 0.134 0.127 0.115High Transaction CostsPanel C: 6-month horizon and quarterly rebalancingCE 4.706 2.370 -0.736 1.609 0.193 5.791 5.501 3.592 p-value (1.000) (1.000) (0.990) (0.999) (0.214) (0.263) (0.784) (0.000)SR 0.048 0.041 0.053 0.068 0.066 0.145 0.134 0.117 0.145Panel D: 2-year horizon and annual rebalancingCE 4.506 0.717 -0.586 0.129 -0.943 We consider an additional robustness check of the alternative assumptions about the port-folio weights. The main analysis allows the investor to borrow the money or to short-sellthe stock by considering the weights in the interval − ≤ ω t ≤

2. In this subsection, weperform a two-step analysis: we ﬁrst impose borrowing constraints by restricting the opti-30al weight on the risk-free investment to be non-negative and then additionally imposingshort-selling constrains with the weights 0 ≤ ω t ≤ The subperiod analysis presented in Table 4 reveals a slightly declining performance ofNNs by the end of the sample. In particular, the LSTM generates higher CERs than the31 able 8. Portfolio Performance with Borrowing and Short-Selling Constraints

This table reports the annualized certainty equivalent returns and Sharpe ratios for different models outlinedin Section 2.3. The top section of the table imposes borrowing constrains, while the bottom section addition-ally assumes short-selling constraints. Panels A and C show the results when the investor maximizes a6-month portfolio return and changes the allocations quarterly. Panels B and D show the results for a 2-yearhorizon and annual rebalancing. For the statistical signiﬁcance of CERs, we report a one-sided p-value (inparentheses) of the test á la Diebold and Mariano (2002). In particular, we regress the difference in utilitiesfor each model M s and EH U M s t + T − U EHt + T = α M s + ε t + T ,where U Xt + T = (cid:16) r Xp , t + T (cid:17) − γ − γ and r Xp , t + T is the cumulative portfolio return with the horizon T . Testing for thedifference in the CERs boils down to a test for the signiﬁcance in α M s . We ﬂag in bold font CER values thatare signiﬁcant at the 10% conﬁdence level. The portfolio construction starts in February 1955. The sampleperiod spans from January 1945 to December 2018.EH OLS1 OLS2 OLS3 OLS4 NN1 NN2 NN3 LSTMBorrowing ConstraintPanel A: 6-month horizon and quarterly rebalancingCER 4.737 3.662 3.371 4.149 4.560 p-value (0.999) (0.958) (0.770) (0.591) (0.000) (0.000) (0.003) (0.000)SR 0.049 0.046 0.061 0.074 0.080 0.176 0.146 0.135 0.157Panel B: 2-year horizon and annual rebalancingCER 4.542 2.780 2.936 2.974 1.560 4.964 p-value (1.000) (1.000) (1.000) (1.000) (0.147) (0.007) (0.008) (0.033)SR 0.048 0.044 0.051 0.049 0.013 0.101 0.094 0.087 0.100Borrowing and Short-Selling ConstraintsPanel C: 6-month horizon and quarterly rebalancingCER 4.737 3.704 4.707 5.353 5.708 p-value (0.998) (0.528) (0.080) (0.004) (0.000) (0.000) (0.000) (0.000)SR 0.049 0.047 0.066 0.093 0.093 0.146 0.129 0.138 0.150Panel D: 2-year horizon and annual rebalancingCER 4.542 2.780 3.757 4.261 p-value (1.000) (1.000) (0.859) (0.004) (0.000) (0.000) (0.000) (0.000)SR 0.048 0.044 0.054 0.054 0.071 0.107 0.109 0.098 0.107

EH model, however, the difference proves to be statistically indistinguishable over the lastfour decades. This raises the question whether the evidence in this paper holds for morerecent data. This subsection demonstrates that the main conclusions of this paper indeedremain intact.Table 9 reports summary statistics of the out-of-sample portfolio returns, which areobtained for the subperiod from February 1969 to December 2018 as in Rossi (2018). In32elation to the models using the rolling-window estimation, we assume a 20-year hori-zon to assess the impact of longer history on the performance of different methodologies,particularly machine learning methods that are assumed to work better with larger sam-ples. Notice that the quantitative predictions of this exercise are not directly comparableto the previous results due to difference in the historical data. In particular, the periodfrom February 1969 to December 2018 is characterized by slightly weaker market perfor-mance, which ultimately translates into a less favorable opportunity set for the investor.The returns statistics in Table 9 are consistent with this intuition. The average Sharperatio implied by the model with no predictability shrinks to half the size of that in thebenchmark analysis. The linear models experience comparable deterioration in results.For NNs with quarterly rebalancing, we document several interesting observations.First, despite a weaker performance of the stock market during the period considered,monthly Sharpe ratios implied by NNs decrease marginally, with the drop approximatelyequal to 0.01 to 0.03 relative to the main results. Second, comparing NN1 through NN3in terms of certainty equivalent returns, NNs yield statistically the same results. Althoughdeeper networks generate slightly lower CERs than those predicted by shallower networks,the p-values indicate that these model-based values remain in the same equivalence class.Third, the LSTM still produces the most signiﬁcant economic gains. Speciﬁcally, the an-nualized certainty equivalent yield is above 7% and monthly Sharpe ratios remain as highas 0.165. Finally, unlike weak statistical evidence of the main results with recent data,the formal test of the results in this subsection demonstrates strong statistical evidence infavor of NNs. The reason is that NNs use a 20-year rolling window for hyperparametertuning, which helps them to better learn non-linear relationships, and short- and long-termdependencies (in case of LSTM) from the data.33 able 9. Portfolio Performance from Feb 1969:02 to Dec 2018: 20-year rolling window

This table reports the annualized certainty equivalent returns and Sharpe ratios for different models outlinedin Section 2.3. The rolling window estimation uses 20 years of recent data. Panel A shows the results whenthe investor maximizes a 6-month portfolio return and changes the allocations quarterly. Panel B shows theresults for a 2-year horizon and annual rebalancing. Each panel computes the statistics for the whole sample,expansion and recession periods as deﬁned by the NBER. For the statistical signiﬁcance of CERs, we report aone-sided p-value (in parentheses) of the test á la Diebold and Mariano (2002). In particular, we regress thedifference in utilities for each model M s and EH U M s t + T − U EHt + T = α M s + ε t + T ,where U Xt + T = (cid:16) r Xp , t + T (cid:17) − γ − γ and r Xp , t + T is the cumulative portfolio return with the horizon T . Testing for thedifference in the CERs boils down to a test for the signiﬁcance in α M s . We ﬂag in bold font CER values thatare signiﬁcant at the 10% conﬁdence level. The portfolio construction starts in February 1969.EH OLS1 OLS2 OLS3 OLS4 NN1 NN2 NN3 LSTMPanel A: 6-month horizon and quarterly rebalancing1969-2018CER 4.600 1.763 0.791 1.025 3.707 p-value 1.000 1.000 0.984 0.811 0.018 0.053 0.061 0.016SR 0.025 0.010 0.018 0.059 0.057 0.135 0.140 0.132 0.165ExpansionsCER 5.038 3.158 2.479 4.551 5.846 p-value 0.999 0.997 0.694 0.158 0.008 0.022 0.335 0.039SR 0.090 0.059 0.045 0.070 0.068 0.139 0.138 0.141 0.161RecessionsCER 1.846 -6.771 -9.496 -18.688 -8.560 3.819 2.347 4.423 p-value 1.000 0.999 0.985 0.980 0.345 0.465 0.287 0.002SR -0.251 -0.253 -0.123 0.034 0.023 0.123 0.156 0.061 0.226Panel B: 2-year horizon and annual rebalancing1969-2018CER 4.530 0.508 -2.573 -2.558 2.080 p-value (1.000) (1.000) (1.000) (1.000) (0.026) (0.068) (0.000) (0.000)SR 0.023 0.008 -0.002 0.008 0.025 0.126 0.084 0.117 0.135ExpansionsCER 4.448 0.246 -2.160 -2.174 3.324 p-value (1.000) (1.000) (1.000) (0.992) (0.076) (0.297) (0.000) (0.001)SR 0.089 0.059 0.035 0.008 0.021 0.108 0.054 0.100 0.127RecessionsCER 5.089 2.275 -5.061 -4.914 -4.333 p-value (0.999) (1.000) (1.000) (1.000) (0.023) (0.004) (0.003) (0.009)SR -0.248 -0.259 -0.194 0.010 0.045 0.216 0.211 0.210 0.181 Conclusion

In this paper, we evaluate the economic gains of using deep learning methods for theconstruction of optimal portfolios. We study the portfolio allocation of a long-horizoninvestor who uses neural networks to predict future returns when choosing an optimalallocation between a market portfolio and a risk-free asset. We propose and comparevarious architectures of neural networks including shallow and deep NNs as well as theLSTM speciﬁcation, which is capable of learning the long-term relationships. Three keyﬁndings emerge from our investigation.First, we demonstrate that sound statistical performance of non-linear machine learningmethods, such as neural networks, leads to large and signiﬁcant out-of-sample portfoliogains. These gains are robust to a variety of portfolio performance measures, the inclu-sion of transaction costs, and borrowing and short-selling constraints. Second, we ﬁndthat employing the forecasts of deeper networks does not necessarily translate into largereconomic gains. In order to identify and beneﬁt from a complex non-linear predictive re-lationship, the investor needs to harvest more data, while shallower NNs might be a betteroption in a setting with small samples. In terms of NNs, we further show that the novelLSTM is the best-performing speciﬁcation. This emphasizes the critical role of short- andlong-term order dependencies in predicting stock returns, in addition to approximating thenon-linear relationship. Finally, we document that NNs perform well even in the absenceof additional ingredients, such as time-varying return volatility, which are commonly pro-posed by the literature studying linear predictive regressions. Our results show that NNsare capable of identifying these complex features from the data in a non-parametric wayand without any speciﬁc modelling assumptions.Our analysis can be extended in a number of ways. It would be interesting to exam-ine the interaction between NNs and alternative preference speciﬁcations. In particular,it is not clear whether an investor with a tail sensitive utility function or a preferencefor early resolution of uncertainty would be able to generate comparable economic gains.35an Binsbergen and Koijen (2010) present evidence that additional economic restrictionscan actually improve the model’s performance. Our results point out a negative impact ofrestricting portfolio weights on the gains of the NNs. It would be interesting to examineif our evidence holds in a setting with other restrictions, in particular those proposed byVan Binsbergen and Koijen (2010). Finally, extending our analysis to multiple assets is astraightforward exercise, which would shed light on the economic signiﬁcance of forecast-ing returns of different asset classes via NNs.36 eferences

Ang, A. and Bekaert, G. (2007). Stock return predictability: Is it there?

The Review ofFinancial Studies , 20(3):651–707.Avramov, D. (2002). Stock return predictability and model uncertainty.

Journal of FinancialEconomics , 64(3):423–458.Baillie, R. T. and Kapetanios, G. (2007). Testing for neglected nonlinearity in long-memorymodels.

Journal of Business & Economic Statistics , 25(4):447–461.Barberis, N. (2000). Investing for the long run when returns are predictable.

The Journal ofFinance , 55(1):225–264.Bekaert, G., Engstrom, E. C., and Xu, N. R. (2019). The time variation in risk appetite anduncertainty. Technical report, National Bureau of Economic Research.Bianchi, D., Büchner, M., and Tamoni, A. (2020). Bond risk premia with machine learning.

Review of Financial Studies , (forthcoming).Bianchi, D. and Tamoni, A. (2020). Sparse predictive regressions: Statistical performanceand economic signiﬁcance.

Machine Learning for Asset Management: New Developments andFinancial Applications , pages 75–113.Breiman, L. et al. (2001). Statistical modeling: The two cultures (with comments and arejoinder by the author).

Statistical science , 16(3):199–231.Bryzgalova, S., Pelger, M., and Zhu, J. (2019). Forest through the trees: Building cross-sections of stock returns.

Available at SSRN 3493458 .Buraschi, A. and Jiltsov, A. (2007). Habit formation and macroeconomic models of the termstructure of interest rates.

The Journal of Finance , 62(6):3009–3063.Campbell, J. Y. (1987). Stock returns and the term structure.

Journal of ﬁnancial economics ,18(2):373–399.Campbell, J. Y. and Shiller, R. J. (1988). The dividend-price ratio and expectations of futuredividends and discount factors.

The Review of Financial Studies , 1(3):195–228.Campbell, J. Y. and Thompson, S. B. (2008). Predicting excess stock returns out of sample:Can anything beat the historical average?

The Review of Financial Studies , 21(4):1509–1531.Campbell, J. Y. and Yogo, M. (2006). Efﬁcient tests of stock return predictability.

Journal ofﬁnancial economics , 81(1):27–60. 37hen, L., Pelger, M., and Zhu, J. (2019). Deep learning in asset pricing.

Available at SSRN3350138 .Chen, L., Pelger, M., and Zhu, J. (2020). Deep learning in asset pricing.

Available at SSRN3350138 .Clark, T. and West, K. (2007). Approximately normal tests for equal predictive accuracy innested models.

Journal of econometrics , 138(1):291–311.Cochrane, J. H. (2008). The dog that did not bark: A defense of return predictability.

TheReview of Financial Studies , 21(4):1533–1575.Cremers, K. M. (2002). Stock return predictability: A bayesian model selection perspective.

The Review of Financial Studies , 15(4):1223–1249.Dangl, T. and Halling, M. (2012). Predictive regressions with time-varying coefﬁcients.

Journal of Financial Economics , 106(1):157–181.Dangl, T. and Weissensteiner, A. (2020). Optimal portfolios under time-varying investmentopportunities, parameter uncertainty, and ambiguity aversion.

Journal of Financial andQuantitative Analysis , 55(4):1163–1198.Diebold, F. X. and Mariano, R. S. (2002). Comparing predictive accuracy.

Journal of Business& economic statistics , 20(1):134–144.Dumas, B., Kurshev, A., and Uppal, R. (2009). Equilibrium portfolio strategies in thepresence of sentiment risk and excess volatility.

The Journal of Finance , 64(2):579–629.Fama, E. F. and French, K. R. (1988). Dividend yields and expected stock returns.

Journalof ﬁnancial economics , 22(1):3–25.Fama, E. F. and French, K. R. (1989). Business conditions and expected returns on stocksand bonds.

Journal of ﬁnancial economics , 25(1):23–49.Feng, G., He, J., and Polson, N. G. (2018). Deep learning for predicting asset returns. arXivpreprint arXiv:1804.09314 .Feng, G., Polson, N., and Xu, J. (2019). Deep learning in characteristics-sorted factor mod-els.

Available at SSRN 3243683 .Ferson, W. E. and Harvey, C. R. (1991). The variation of economic risk premiums.

Journalof Political Economy , 99(2):385–415.Ferson, W. E., Sarkissian, S., and Simin, T. T. (2003). Spurious regressions in ﬁnancialeconomics?

The Journal of Finance , 58(4):1393–1413.38leming, J., Kirby, C., and Ostdiek, B. (2001). The economic value of volatility timing.

TheJournal of Finance , 56(1):329–352.Freyberger, J., Neuhierl, A., and Weber, M. (2020). Dissecting characteristics nonparamet-rically.

The Review of Financial Studies , 33(5):2326–2377.Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., and Walther, A. (2018). Predictably un-equal? the effects of machine learning on credit markets.

The Effects of Machine Learningon Credit Markets (November 6, 2018) .Gargano, A., Pettenuzzo, D., and Timmermann, A. (2019). Bond return predictability:Economic value and links to the macroeconomy.

Management Science , 65(2):508–540.Giannone, D., Lenza, M., and Primiceri, G. E. (2017). Economic predictions with big data:The illusion of sparsity.Giglio, S. and Xiu, D. (2017). Inference on risk premia in the presence of omitted factors.Technical report, National Bureau of Economic Research.Goyal, A. and Welch, I. (2008). A comprehensive look at the empirical performance ofequity premium prediction.

The Review of Financial Studies , 21(4):1455–1508.Green, J., Hand, J. R., and Zhang, X. F. (2013). The supraview of return predictive signals.

Review of Accounting Studies , 18(3):692–730.Gu, S., Kelly, B., and Xiu, D. (2020). Empirical asset pricing via machine learning.

TheReview of Financial Studies , 33(5):2223–2273.Harvey, C. R., Liu, Y., and Zhu, H. (2016). ... and the cross-section of expected returns.

TheReview of Financial Studies , 29(1):5–68.Heaton, J. B., Polson, N. G., and Witte, J. H. (2017). Deep learning for ﬁnance: deepportfolios.

Applied Stochastic Models in Business and Industry , 33(1):3–12.Henkel, S. J., Martin, J. S., and Nardari, F. (2011). Time-varying short-horizon predictability.

Journal of ﬁnancial economics , 99(3):560–580.Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.

Neural computation ,9(8):1735–1780.Hodrick, R. J. (1992). Dividend yields and expected stock returns: Alternative proceduresfor inference and measurement.

The Review of Financial Studies , 5(3):357–386.39utchinson, J. M., Lo, A. W., and Poggio, T. (1994). A nonparametric approach to pricingand hedging derivative securities via learning networks.

The Journal of Finance , 49(3):851–889.Israel, R., Kelly, B. T., and Moskowitz, T. J. (2020). Can machines ’learn’ ﬁnance?

Availableat SSRN 3624052 .Johannes, M., Korteweg, A., and Polson, N. (2014). Sequential learning, predictability, andoptimal portfolio returns.

The Journal of Finance , 69(2):611–644.Kelly, B. and Pruitt, S. (2013). Market expectations in the cross-section of present values.

The Journal of Finance , 68(5):1721–1756.Kelly, B. and Pruitt, S. (2015). The three-pass regression ﬁlter: A new approach to forecast-ing using many predictors.

Journal of Econometrics , 186(2):294–316.Kelly, B. T., Pruitt, S., and Su, Y. (2019). Characteristics are covariances: A uniﬁed modelof risk and return.

Journal of Financial Economics , 134(3):501–524.Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .Kozak, S., Nagel, S., and Santosh, S. (2020). Shrinking the cross-section.

Journal of FinancialEconomics , 135(2):271–292.Kuan, C.-M. and White, H. (1994). Artiﬁcial neural networks: An econometric perspective.

Econometric reviews , 13(1):1–91.Lettau, M. and Ludvigson, S. (2001). Consumption, aggregate wealth, and expected stockreturns. the Journal of Finance , 56(3):815–849.Lettau, M. and Van Nieuwerburgh, S. (2008). Reconciling the return predictability evi-dence: The review of ﬁnancial studies: Reconciling the return predictability evidence.

The Review of Financial Studies , 21(4):1607–1652.Lewellen, J. (2004). Predicting returns with ﬁnancial ratios.

Journal of Financial Economics ,74(2):209–235.Lopez de Prado, M. (2019). Beyond econometrics: A roadmap towards ﬁnancial machinelearning.

Available at SSRN 3365282 .McCracken, M. W. and Ng, S. (2016). Fred-md: A monthly database for macroeconomicresearch.

Journal of Business & Economic Statistics , 34(4):574–589.40enzly, L., Santos, T., and Veronesi, P. (2004). Understanding predictability.

Journal ofPolitical Economy , 112(1):1–47.Messmer, M. (2017). Deep learning and the cross-section of expected returns.

Available atSSRN 3081555 .Pástor, L. and Stambaugh, R. F. (2009). Predictive systems: Living with imperfect predic-tors.

The Journal of Finance , 64(4):1583–1628.Paye, B. S. and Timmermann, A. (2006). Instability of return prediction models.

Journal ofEmpirical Finance , 13(3):274–315.Pesaran, M. H. and Timmermann, A. (1995). Predictability of stock returns: Robustnessand economic signiﬁcance.

The Journal of Finance , 50(4):1201–1228.Racine, J. (2001). On the nonlinear predictability of stock returns using ﬁnancial and eco-nomic variables.

Journal of Business & Economic Statistics , 19(3):380–382.Rapach, D., Strauss, J., and Zhou, G. (2010). Out-of-sample equity premium predic-tion: Combination forecasts and links to the real economy.

Review of Financial Studies ,23(2):821–862.Rossi, A. G. (2018). Predicting stock market returns with machine learning. Technicalreport, Working paper.Santos, T. and Veronesi, P. (2006). Labor income and predictable stock returns.

The Reviewof Financial Studies , 19(1):1–44.Shiller, R. (1981). Do stock prices move too much to be justiﬁed by subsequent changes individends?

American Economic Review , 71:421–436.Sirignano, J., Sadhwani, A., and Giesecke, K. (2016). Deep learning for mortgage risk. arXivpreprint arXiv:1607.02470 .Stambaugh, R. F. (1999). Predictive regressions.

Journal of Financial Economics , 54(3):375–421.Tobek, O. and Hronec, M. (2020). Does it pay to follow anomalies research? machinelearning approach with international evidence.

Journal of Financial Markets , page 100588.Torous, W., Valkanov, R., and Yan, S. (2004). On predicting stock returns with nearlyintegrated explanatory variables.

The Journal of Business , 77(4):937–966.Van Binsbergen, J. H. and Koijen, R. S. (2010). Predictive regressions: A present-valueapproach.

The Journal of Finance , 65(4):1439–1471.41achter, J. A. (2006). A consumption-based model of the term structure of interest rates.

Journal of Financial economics , 79(2):365–399.Wachter, J. A. and Warusawitharana, M. (2009). Predictable returns and asset allocation:Should a skeptical investor time the market?

Journal of Econometrics , 148(2):162–178.Zhang, Z., Zohren, S., and Roberts, S. (2020). Deep learning for portfolio optimisation. arXiv preprint arXiv:2005.13665arXiv preprint arXiv:2005.13665