[PDF] Time-varying neural network for stock return prediction

Abstract

We consider the problem of neural network training in a time-varying context. Machine learning algorithms have excelled in problems that do not change over time. However, problems encountered in financial markets are often time-varying. We propose the online early stopping algorithm and show that a neural network trained using this algorithm can track a function changing with unknown dynamics. We compare the proposed algorithm to current approaches on predicting monthly U.S. stock returns and show its superiority. We also show that prominent factors (such as the size and momentum effects) and industry indicators, exhibit time varying stock return predictiveness. We find that during market distress, industry indicators experience an increase in importance at the expense of firm level features. This indicates that industries play a role in explaining stock returns during periods of heightened risk.

Full PDF

TTime-varying neural network for stock returnprediction

Steven Y. K. Wong ∗ , Jennifer S. K. Chan † , Lamiae Azizi † ,Richard Y. D. Xu ∗ January 25, 2021

Abstract

We consider the problem of neural network training in a time-varying context. Machine learning algorithms have excelled in prob-lems that do not change over time. However, problems encountered inﬁnancial markets are often time-varying . We propose the online earlystopping algorithm and show that a neural network trained using thisalgorithm can track a function changing with unknown dynamics. Wecompare the proposed algorithm to current approaches on predictingmonthly U.S. stock returns and show its superiority. We also showthat prominent factors (such as the size and momentum eﬀects) andindustry indicators, exhibit time varying predictive power on stockreturns. We ﬁnd that during market distress, industry indicators ex-perience an increase in importance at the expense of ﬁrm level features.This indicates that industries play a role in explaining stock returnsduring periods of heightened risk.

Keywords— return prediction, deep learning, online learning, time-varying ∗ Steven Wong and Richard Xu are with School of Electrical and Data En-gineering, University of Technology Sydney, Australia. Corresponding email:[email protected]. † Jennifer Chan and Lamiae Azizi are with School of Mathematics and Statistics, Uni-versity of Sydney, Australia. a r X i v : . [ q -f i n . C P ] J a n Introduction

The motivating application of this work is in predicting cross-sectional stock re-turns in a portfolio context. At every interval, an investor forecasts expectedreturn of assets and performs security selection. A closely related problem isasset pricing — a fundamental problem in ﬁnancial theory . Asset pricing hasbeen well studied. A survey by Harvey et al. (2016) documented over 300 cross-sectional factors published in journals. However, literature has also documentedevidence of time-variability of the true asset pricing model (also known as con-cept drift in machine learning, Gama et al., 2014). Pesaran and Timmermann(1995) performed linear regressions with permutations of regressors on U.S. stocksand compared both statistical and ﬁnancial measures for model selection. Bothpredictability and regression coeﬃcients of the selected model changed over time.Bossaerts and Hillion (1999) reported similar ﬁndings in international stocks. Sowhy do relationships change over time? Changes in macroeconomic environmentis one possibility. Other explanations oﬀered by McLean and Pontiﬀ (2016) relateto statistical bias (also called data mining bias in machine learning) and eﬀectsof arbitrage by investors (which the authors referred to as publication-informedtrading). Thus, it is unsatisfactory for a practitioner to learn a static model asout-of-sample performance can vary.Recently, deep learning has made signiﬁcant advances across a wide rangeof applications, such as achieving human-like accuracy in image recognition tasks(Schroﬀ et al., 2015) and translating texts (Sutskever et al., 2014). By contrast,machine learning in ﬁnancial markets is still in its infancy. Weigand (2019) pro-vided a recent survey of machine learning applied to empirical ﬁnance and notedthat machine learning algorithms show promise in addressing shortcomings of con-ventional models (such as the inability to model non-linearity or handle large num-ber of covariates). Recent works have applied neural networks to the problem ofcross-sectional stock return prediction (see, Messmer, 2017; Abe and Nakayama,2018; Gu et al., 2020). Gu et al. (2020) have modelled potential time-variabilitydriven by macroeconomic conditions by interacting ﬁrm level features with macroe-conomic indicators. However, they do not consider all possible avenues of time-variability of asset pricing models, such as eﬀects of trading as highlighted byMcLean and Pontiﬀ (2016). For instance, Lev and Srivastava (2019) noted thatthe prominent value factor (Rosenberg et al., 1985; Fama and French, 1992) hasbeen unproﬁtable for almost 30 years — a period that included multiple busi- It is useful to remind readers that this paper is concerned with improving tools forstock return prediction, enabling practitioners to better select securities. Asset pricing, inan academic context, is more concerned with explaining the drivers of returns. Deep learning is a subﬁeld of machine learing . An overview is provided in Section 2.2. ess cycles. In fact, the authors noted that returns to the value factor have beennegative since 2007, suggesting a change in the underlying relationship.To address this, we propose the online early stopping algorithm (henceforth,OES), for training neural networks that can adapt to a time-varying function.Our problem is characterized by information release over time and iterative deci-sion making. Optimization in this context are called online as decisions are madewith the knowledge of past information but not the future. In conventional neu-ral network training, one of the hyperparameters is the number of optimizationiterations τ . In OES, we propose to treat τ as a learnable parameter that variesover time ( t ), as τ t , and is recursively estimated over time. We provide τ t with anew meaning — a regularization parameter that controls the amount of updateneural network weights receive as new observations are revealed. Thus, if con-secutive cross-sectional observations are very diﬀerent then we would expect τ t tobe small and the neural network is prevented from overﬁtting to any one period.Conversely, a slowly changing function will have a high degree of continuity and wewould expect the network to ﬁt more tightly to each new observation. Using thistraining algorithm, a neural network can adapt to changes in the data generationprocess over time. For practitioners, we show that a neural network trained withOES can be a powerful prediction model and a useful tool for understanding thetime-varying drivers of returns.Neural network training is an optimization problem. We draw on concepts inonline optimization to provide a performance bound that is related to the variabil-ity of each period. We do not assume any time-varying dynamics of the underlyingfunction, a typical approach in online optimization. The beneﬁt of this approachis that it can track any source of variability in the underlying function, includingmacroeconomic, arbitrage-induced, market condition-induced or other unknownsources. For instance, Lev and Srivastava (2019) suggested that the negative re-turn to the value factor was related to diminishing relevance of book equity as anaccounting measure. Such drivers would not have been captured by the macroeco-nomic approach in Gu et al. (2020). Nonetheless, we acknowledge that a limitationof our approach is the inability to explain the source of variability.We provide two evaluations of OES: 1) a simulation study based a data setsimulated from a non-linear function evolving under a random-walk; 2) an empir-ical study of U.S. stock returns. The empirical study is based on Gu et al. (2020),who compared several machine learning algorithms for predicting monthly returnsof all U.S. stocks. Majority of the data set were made available to the public andare used in this work. We note that the setup in Gu et al. (2020) is suboptimalfor our portfolio selection problem for three reasons. Firstly, (raw) monthly stockreturns contain characteristics that complicate the forecasting problem, such asoutliers, heavy tails and volatility clustering (Cont, 2001). These characteristics re likely to impede a predictor’s ability to learn. Secondly, the data set in Guet al. (2020) contains stocks with very low market capitalization, are illiquid, andare unlikely to be accessible by institutional investors. Thirdly, at the individualstock level, forecasting stocks’ excess returns over risk free rate also encompassesforecasting market excess returns. As practitioners are typically concerned withrelative performance between stocks , the market return component adds unnec-essary noise to the problem of relative performance forecasting. Thus, in additionto comparison with Gu et al. (2020), we also present results based on a more likelyuse case by practitioners, by excluding stocks with very low capitalization andforecasting cross-sectionally standardized excess returns. We show that forecast-ing performance signiﬁcantly improved based on this re-formulation. We proposeto measure performance using information coeﬃcient (henceforth, IC), a widelyapplied performance measure in investment management (Ambachtsheer, 1974;Grinold and Kahn, 1999; Fabozzi et al., 2011). OES achieves IC of 4 .

58 % on theU.S. equities data set, compared to 3 .

82 % under an expanding window approachin Gu et al. (2020).A summary of our contributions in this paper are as follows: • We propose the OES algorithm which allows a neural network to track atime-varying function. OES can be applied to an existing network architec-ture and requires signiﬁcantly less time to train than the expanding windowapproach in Gu et al. (2020). In our tests, OES took 1/7 the time to trainand predict as the expanding window approach of DNN. This has a practi-cal implication as practitioners wishing to employ deep learning models havelimited time between market close and next day’s open to generate featuresand train new models, which is made worse if an ensemble is required. • We show that ﬁrm features exhibit time-varying importance and that themodel changes over time. We ﬁnd that some prominent features, such asmarket capitalization (the size eﬀect) display declining importance over timeand is consistent with the ﬁndings of McLean and Pontiﬀ (2016). Thishighlights the importance to have a time-varying model. • We ﬁnd that ﬁrm features, in aggregate, experience a fall in importancein predicting cross-sectional returns during market distress (e.g. Dot-combubble in 2000-01). Importance of sector dummy variables (e.g. technologyand oil stocks) rose over the same period, suggesting importance of sectorsis also time-varying. Our analysis indicates that sectors have an important In the simplest form, a long-only investor will hold a portfolio of the top ranked stocksand a long-short investor will buy top ranked stocks and sell short bottom ranked stocks.Thus, relative performance is relevant to practitioners. ole in predicting stock returns during market distress. We expect this to beespecially true if market stress impacts on certain sectors more than others,such as travel and leisure stocks during a pandemic. • Using a subuniverse that is more accessible to institutional investors (byexcluding microcap stocks), we show that OES exhibits superior predictiveperformance. We ﬁnd that mean correlation between predictions of OESand DNN is only 35 . • We ﬁnd that an ensemble formed by averaging the standardized predictionsof the two models exhibits the highest IC, decile spread and Sharpe ratio.Thus, practitioners may choose to deploy both models in a complementarymanner.In the rest of this paper, we denote the algorithm of Gu et al. (2020) as

DNN (Deep Neural Network) and our proposed Online Early Stopping as

OES . Thispaper is organized as follows. Section 2 deﬁnes our cross-disciplinary problem,and provides overviews of neural networks and online optimization. Section 3outlines our main contribution of this paper — the proposed OES algorithm whichintroduces time-variations to the neural network. Simulation results are presentedin Section 4, which demonstrates the eﬀectiveness of OES in tracking a time-varying function. An empirical study on U.S. stock returns is outlined in Section 5.Finally, Section 6 discusses the empirical ﬁnance problem and concludes the paperwith some remarks.

Similar to a classical online learning setup, a player iteratively makes portfolioselection decisions at each time period. We call this iterative process per intervaltraining . There are n stocks in the market, each with m features, forming inputmatrix X t ∈ R n × m at time t = 1 , ..., T . The i -th row in X t is feature vector x t,i of stock i . To simplify notations, we deﬁne return of stock i as return overthe next period, i.e., r t,i = ( p t +1 ,i + d t +1 ,i ) /p t,i −

1, where p t,i is price at time t and d t,i is dividend at t if a dividend is paid, and zero otherwise. Player predictsstock returns ˆ r t ∈ R n by choosing θ t ∈ Θ, which parameterizes prediction function F : R n × m (cid:55)→ R n . Market reveals r t and, for regression purposes, investor incurs quared loss, J t ( θ t ) = 1 n n (cid:88) i =1 ( r t,i − F i ) , where F i is the i -th element of vector F ( X t ; θ t ). The true function φ t : R n × m (cid:55)→ R n drifts over time and is approximated by F with time-varying θ t . Investor’sobjective is to minimize loss incurred by choosing the best θ t at time t usingobserved history up to t −

1. Both the function form and time-varying dynamicsof φ t are not known. Hence a neural network is used to model the cross-sectionalrelationship at each t and the time-variability is formulated as a network weightstracking problem. The loss function J veriﬁes the same assumptions adopted inAydore et al. (2019), which are: • J t is bounded: | J t | ≤ D ; D > • J t is L-Lipschitz: | J t ( a ) − J t ( b ) | ≤ L (cid:107) a − b (cid:107) ; L > • J t is β -smooth: (cid:107)∇ J t ( a ) − ∇ J t ( b ) (cid:107) ≤ β (cid:107) a − b (cid:107) ; β > J t at θ t as ∇ J t ( θ t ) and stochastic gradient as ˆ ∇ J t ( θ t ) =E[ ∇ J t ( θ t )], or where the context is obvious, ∇ t and ˆ ∇ t respectively.As performance measure, Gu et al. (2020) used pooled R oos without meanadjustment in the denominator, R oos = 1 − (cid:80) ( t,i ) ∈D oos ( r t,i − ˆ r t,i ) (cid:80) ( t,i ) ∈D oos r t,i , where D oos is the pooled out-of-sample data set covering January 1987 to December2016 in the empirical study. There are several shortcomings with this performancemeasure. The number of stocks in the U.S. equities data set starts from 1,060 inMarch 1957, peaks at over 9,100 in 1997, and falls to 5,708 at the end of 2016. Apooled performance metric will place more weight on periods with a higher numberof stocks. An investor making iterative portfolio allocation decisions would be con-cerned with accuracy on average over time . Moreover, asset returns are known toexhibit non-Gaussian characteristics (Cont, 2001). Summary statistics of monthlyU.S. stock returns are provided in Table 2 (in Section 5), which clearly conﬁrmsthe existence of considerable skewness and time-varying variance. Therefore, weprovide three additional metrics. The ﬁrst metric is the information coeﬃcient (IC), deﬁned as the cross-sectional Pearson’s correlation between stock returns Rank IC, which uses Spearman’s rank correlation instead of Pearson’s, is also used inpractice. nd predictions. The time series of correlations is then averaged to give the ﬁ-nal score. IC was ﬁrst proposed by Ambachtsheer (1974) and is widely appliedin investment management for measuring predictive power of a forecaster or aninvestment strategy (Grinold and Kahn, 1999; Fabozzi et al., 2011). The secondmetric is the annualized Sharpe ratio, calculated as, SR = 12 × E[ P t ∈D oos ] (cid:112) × Var[ P t ∈D oos ] , where P t ∈D oos is the diﬀerence between average realized monthly returns of decile10 and decile 1 at t , sorted on predicted returns. The third metric is the averagemonthly R , where denominator is adjusted by the cross-sectional mean, as aconventional complement to R oos . An overview of neural networks is provided in this section. Interested readers arereferred to Goodfellow et al. (2016) for a comprehensive review.Neural networks are a broad class of high capacity models which were inspiredby the biological brain and can theoretically learn any function (known as the

Universal Approximation Theorem , see Hornik et al., 1989; Cybenko, 1989; Good-fellow et al., 2016). A common form, the feedforward network, also known as multilayer perceptrons (MLP), is a subset of neural networks which forms a ﬁniteacyclic graph (Goodfellow et al., 2016). There are no loop connections and valuesare fed forward, from the input layer to hidden layers, and to the output layer.The word ‘deep’ is preﬁxed to the name (e.g. deep feedforward network or deepneural network) to signify a network with many hidden layers, as illustrated inFigure 1. A feedforward network is also called a fully connected network if everynode has every node in the preceding layer connected to it. The output of eachlayer acts as input to the next layer and loss is ‘backpropagated’ by taking thepartial derivative of loss with respect to weights at each layer. Each layer consistsof activation function f (e.g. rectiﬁed linear unit , deﬁned as f ( x ) = max( x, W , bias b , and output f ( x ; W , b ) = f ( x T W + b ). The (cid:96) -th layer of thenetwork is denoted as f ( (cid:96) ) . For brevity, we drop the layer designation, and denotethe entire network as F and weight vector set θ = (cid:83) L (cid:96) =1 { W ( (cid:96) ) , b ( (cid:96) ) } , where L isthe number of layers. The network is trained with stochastic gradient descent (orvariants) at time t (but dropping the subscript t for simplicity as the context isclear), θ k +1 = θ k − η ˆ ∇ J ( θ k ) , where θ k is the weight vector at optimization iteration k (also called epochs ) and η is step size. n (cid:96) refers to the number of nodes in (cid:96) -th layer. Arrows indicate direction ofﬂow for the output value of the respective node. At time t , τ t denotes the number of optimization iterations that are used totrain the network and is found by monitoring loss on a validation set. This pro-cedure is called early stopping (Morgan and Bourlard, 1990; Reed, 1993; Prechelt,1998; Mahsereci et al., 2017). Training is stopped when the validation loss de-creases by less than a predeﬁned amount, called tolerance . Early stopping can beseen as a regularization technique which limits the optimizer to search in the pa-rameter space near the starting parameters (Sj¨oberg and Ljung, 1995; Goodfellowet al., 2016). In particular, given optimization steps τ , the product ητ can beinterpreted as the eﬀective capacity which bounds reachable parameter space from θ , thus behaving like L regularization (Goodfellow et al., 2016).For time series problems where chronological ordering is important, popularapproaches include expanding window (each new time slice is added to the paneldata set) and rolling window (the oldest time slice is removed as a new time slice isadded, Rossi and Inoue, 2012). Instead of randomly splitting training and test sets,the out-of-sample procedure can be used where the end of the series is withheldfor evaluation. This is unsatisfactory in the context of stock return prediction fortwo reasons. First, each time period is drawn from a diﬀerent data distribution D As described in Bergmeir et al. (2018). hereon denoted as D t for data set drawn at time t , or D oos for all periods in the out-of-sample data set). A pooled regression with window size w eﬀectively assumesdata at t + 1 is drawn from the average of the past w observations. Secondly, ifdata is scarce in terms of time periods, estimates for optimal optimization stepsˆ τ t can have large stochastic error. For instance, monthly data with a window sizeof 12 months and 3:1 training-validation split. ˆ τ is estimated using only 3 monthsof data. To the best of our knowledge, there is no procedure for adapting earlystopping in an online context with time-varying dynamics. Optimizing network weights to track a function evolving under unknown dynamicsis an online optimization problem. A discussion on relevant concepts in online op-timization is provided. Interested readers are encouraged to read Shalev-Shwartz(2012) for an introduction. In online optimization literature, iterate is often de-noted as x t and loss function as f t . We have used θ t as iterate to be consistentwith our parameter of interest and J t as loss function to avoid conﬂict with ouruse of f as activation function.Online optimization and its related topics have been well researched. Appli-cations of online optimization in ﬁnance ﬁrst came in the form of the UniversalPortfolios by Cover (1991). However, most of the early works in online optimiza-tion are on the convex case and assume each draw of loss function J t is fromthe same distribution (in other words, J t is stationary). These assumptions arenot consistent with our problem. Recently, Hazan et al. (2017) extended onlineconvex optimization to the non-convex and stationary case. This was further ex-tended by Aydore et al. (2019) to the non-convex and non-stationary case, withthe proposed Dynamic Exponentially Time-Smoothed Stochastic Gradient Descent (DTS-SGD) algorithm. Non-convex optimization is NP-Hard . Therefore, existingnon-convex optimization algorithms focus on ﬁnding local minima (Hazan et al.,2017). For this reason, one diﬀerence between online convex optimization andonline non-convex optimization is that the former focuses on minimizing sum oflosses relative to a benchmark (for instance, the minimizer over all time intervals θ ∗ = arg min θ ∈ Θ (cid:80) t J t ( θ ) is one of the most basic benchmarks), and the latterfocuses on minimizing sum of gradients (e.g. (cid:80) t ∇ J t ( θ t )). This optimization ob-jective is called regret . Readers familiar with time-series analysis might be takenaback by the lack of parameters in a typical online optimization algorithm. This Non-stationarity in online optimization literature refers to time-variability of loss func-tion J t . In computer science, NP-Hard refers a class of problems where no known polynomialrun-time algorithm exists. s due to the game theoretic approach of online optimization and the focus onworst case performance guarantees, as opposed to the average case performancein statistical learning. Regret bounds are typically functions of properties of theloss function (e.g. convexity and smoothness) and are dependent on environmentalassumptions.At each interval t , DTS-SGD updates network weights using a time-weightedsum of past observed gradients. Time weighting is controlled by a forget factor α .In analyzing DTS-SGD, we note two potential weaknesses. Firstly, neural networksare notoriously diﬃcult to train. Geometry of the loss function is plagued by anabundance of local minima and saddle points (see Chapter 8.2 of Goodfellow et al.,2016). Momentum and learning rate decay strategies (for instance, Sutskever et al.,2013; Kingma and Ba, 2015) have been introduced which require multiple passesover training data, adjusting learning rate each time to better traverse the losssurface. DTS-SGD is a single weight update at each time period which may havediﬃculties in traversing highly non-convex loss surfaces. Secondly, during oursimulation tests, we observed that loss can increase after a weight update. Onepossibility is that a past gradient is taking the weights further away from thecurrent local minima. This is particularly problematic for our problem as stockreturns are very noisy. We start by providing an informal discussion of the algorithm. Neural networksare universal approximators. That is, it can approximate any function up to anarbitrary accuracy. Thus, given a network structure and a time-varying function,network weights trained with data from a single time interval (i.e., a cross-sectionalslice of time) neatly summarizes the function at that interval and the Euclideandistance between consecutive sets of weights can be interpreted as the amount ofvariations in the latent function expressed in weight space. Simply using θ t − topredict on t will lead to an overﬁtted result. To illustrate, suppose θ t ∈ R , θ = 0and θ t alternates in a sequence of { , − , , − , ... } . Then, it is clear that using θ = 1 to predict on t = 2 will lead to a worse outcome than using θ = 0. Inthis scenario, the optimal strategy is to never update weights (or scale updates byzero). Generally, the optimal policy is to regularize updates such that the networkis not overﬁtted to any single period.In the rest of this section, we present our main theoretical results. Formally,our goal is to track the unobserved minimizer of J t , a proxy for the true assetpricing model, as closely as possible. In regret analysis, it is desirable to have egret that scales sub-linearly to T , which leads to asymptotic convergence tothe optimal solution. Hazan et al. (2017) demonstrated that in the non-convexcase, a sequence of adversarially chosen loss functions can force any algorithm tosuﬀer regret that scales with T as Ω (cid:0) Tw (cid:1) . Locally smoothed gradients (over arolling window of w loss functions) were used to improve smoothed regret , with alarger w advocated by Hazan et al. (2017). Aydore et al. (2019) extended thisto use rolling weighted average of past gradients which gives recent gradients ahigher weight to track a dynamic function. Inevitably, smoothing will track atime-varying minimizer with a tracking error that is proportionate to w and theforget factor.To address this, we propose a restricted optimum (denoted by θ ∗ t at time t ) asthe tracking target of our algorithm. At time t , the online player selects θ t basedon observed {∇ , ..., ∇ t − } . As the network is trained using gradient descent, wepropose to restrict the admissible weight set to the path formed from θ ∗ t − andextending along the gradient vector −∇ t − (in other words, the path traversedby gradient descent). The point θ (cid:48) along this path with the minimum (cid:107)∇ J t ( θ (cid:48) ) (cid:107) is the restricted optimum. We argue that the trade-oﬀ between restricting theadmissible weight space and solving the simpliﬁed problem is justiﬁed as otherpoints in the weight space are not attainable via gradient descent and is thusunnecessary to consider all possible weight sets in Θ. Without assuming any time-varying dynamics, updating weights using an average of past gradients (similar toHazan et al., 2017) will induce a tracking error to the time-varying function. Toillustrate the restricted optimum concept, let θ (cid:48) = θ ∗ t − be our starting point ofoptimization, g = −∇ J t − ( θ (cid:48) ) and g (cid:48) = −∇ J t ( θ (cid:48) ). The possible scenarios duringtraining are (also illustrated in Figure 2):1. If (cid:12)(cid:12)(cid:12) cos − (cid:104) g , g (cid:48) (cid:105) ] (cid:107) g (cid:107)(cid:107) g (cid:48) (cid:107) (cid:12)(cid:12)(cid:12) < π/

2, then moving along g will also improve J t ( θ (cid:48) ) until g is perpendicular to g (cid:48) or θ (cid:48) has reached a local minima of J t − .2. If (cid:12)(cid:12)(cid:12) cos − (cid:104) g , g (cid:48) (cid:105) ] (cid:107) g (cid:107)(cid:107) g (cid:48) (cid:107) (cid:12)(cid:12)(cid:12) ≥ π/

2, then following g will not improve J t ( θ (cid:48) ) and trainingshould terminate.This observation motivates our online early stopping algorithm. In this section,we will use θ ∗ t to denote restricted optimal weights at t and θ t to denote the onlineplayer’s choice of weights. Suppose θ ∗ t evolves under the dynamics of, θ ∗ t = θ ∗ t − − v t − ∇ J t − ( θ ∗ t − ) , where v t − is sampled from an unknown distribution. v t − can be interpreted as a regularizer which provides the optimal prediction weights on J t if we are restricted In computer science, Ω notation refers to the lower bound complexity. −∇ J t − ( θ (cid:48) ). On the left, optimization should continueuntil −∇ J t ( θ (cid:48) ) is perpendicular to −∇ J t − ( θ (cid:48) ). On the right, optimizationshould terminate. to travelling along the direction of −∇ J t − ( θ ∗ t − ). In this context, (cid:107)∇ J t ( θ ∗ t ) (cid:107) isthe minimum gradient suﬀered by the player. Next, let τ ∗ t be the optimal numberof optimization steps at time t and τ t be the estimated number of optimizationsteps. At iteration t , we solve optimal optimization steps τ ∗ t − , τ ∗ t − = arg min τ (cid:48) ≥ J t − (cid:34) θ ∗ t − − η τ (cid:48) (cid:88) k =1 ∇ J t − ( θ ∗ t − ,k ) (cid:35) . (1)We start from t − τ ∗ t − requires J t which we are yet to observe. Thisleads to optimal weights (the restricted optimum) trained on J t − for predictionon J t − , θ ∗ t − = θ ∗ t − − η τ ∗ t − (cid:88) k =1 ∇ J t − ( θ ∗ t − ,k ) , (2)and can be approximated by, θ ∗ t − − η τ ∗ t − (cid:88) k =1 ∇ J t − ( θ ∗ t − ,k ) ≈ θ ∗ t − − ητ ∗ t − ∇ J t − ( θ ∗ t − ) , which implies v t − ≈ ητ ∗ t − . To predict ˆ r t , we choose τ t − = t − (cid:80) t − q =2 τ ∗ t − q andtrain prediction weights on J t − by substituting in (cid:98) τ t − + 0 . (cid:99) (the rounded up stimate of optimization steps), θ t = θ ∗ t − − η (cid:98) τ t − +0 . (cid:99) (cid:88) k =1 ∇ J t − ( θ ∗ t − ,k ) ≈ θ ∗ t − − ητ t − ∇ J t − ( θ ∗ t − ) . (3)As η is a constant chosen by hyperparameter search, τ t − can be interpreted asa proxy to the regularizer v t − . Using our β -smooth assumption (in Section 2.1)and substituting in deﬁnitions of θ t and θ ∗ t (in Equation 3), we obtain, (cid:107)∇ J t ( θ t ) − ∇ J t ( θ ∗ t ) (cid:107) ≤ β (cid:107) θ t − θ ∗ t (cid:107) , T (cid:88) t =2 (cid:107)∇ J t ( θ t ) − ∇ J t ( θ ∗ t ) (cid:107) ≤ T (cid:88) t =2 β (cid:107) θ t − θ ∗ t (cid:107) , ≤ T (cid:88) t =2 β (cid:13)(cid:13) ητ ∗ t − ∇ J t − ( θ ∗ t − ) − ητ t − ∇ J t − ( θ ∗ t − ) (cid:13)(cid:13) , (4)where we start from t = 2 as our algorithm requires at least 2 cross-sectionalobservations. The elegance of Equation 4 is that it conforms with the conventionalnotion of regret, with cumulative gradient deﬁcit against an optimal outcome inplace of cumulative loss. As τ t − is the unbiased estimator of τ ∗ t − , Equation 4indicates that the cumulative deﬁcit is asymptotically bounded by the varianceof τ ∗ t − . This concept is illustrated in Figure 3. If τ ∗ t − is constant, then τ t − willconverge to τ ∗ t − and the optimal weights are achieved. Conversely, if τ ∗ t − has highvariance, then the player will suﬀer a larger cumulative gradient deﬁcit. (cid:2)(cid:13)(cid:13) θ ∗ t − θ ∗ t − (cid:13)(cid:13)(cid:3) . Suppose θ ∗ t = [ θ ∗ ,t θ ∗ ,t ]is a row vector with two elements. Twenty one random θ ∗ t vectors were drawnwith each θ ∗ t − θ ∗ t − pair represented as an arrow. The circle has radius (cid:80) t =2 (cid:13)(cid:13) θ ∗ t − θ ∗ t − (cid:13)(cid:13) . θ t is regularized by limiting how far it can travel from θ ∗ t − which is E (cid:2)(cid:13)(cid:13) θ ∗ t − θ ∗ t − (cid:13)(cid:13)(cid:3) . Our strategy is to modify the early stopping algorithm to recursively estimate τ t .An outline is provided below as an introduction to the pseudocode in Algorithm 1:1. At t , solve τ ∗ t − (Equation 1) and θ ∗ t − (Equation 2) by training on J t − andvalidating against J t − (line 3 of Algorithm 1).2. Recursively estimate τ t − as the mean of observed { τ ∗ , ..., τ ∗ t − } (line 4).3. Start from θ ∗ t − and perform gradient descent for (cid:98) τ t − + 0 . (cid:99) iterations(Equation 3). The new weights are θ t (line 5–9).4. Predict using θ t (line 11). arlyStopping on line 3 is outlined in Algorithm 2. In our implementation ofthe algorithm, we have used stochastic gradient ˆ ∇ t − instead of the full gradient ∇ t − . Algorithm 2 contains the schematics of an early stopping algorithm withone modiﬁcation adapted from Algorithm 7.1 and Algorithm 7.2 in Goodfellowet al. (2016). Validation is performed before the ﬁrst training step to allow for thecase where τ best = 0 (i.e., we start from the optimal weights). Algorithm 1

General framework for online early stopping. The outer looprecursively estimates τ t − . Require: data X t , r t ∼ p t at interval t ; θ ∗ initialized randomly τ (cid:48) ← for t = 2 , ..., T do τ (cid:48) , θ ∗ t − ← EarlyStopping ( θ ∗ t − , X t − , r t − , X t − , r t − ) τ ← τ ( t − τ (cid:48) t − θ ← θ ∗ t − for i = 1 , ..., (cid:98) τ + 0 . (cid:99) do θ ← θ − η ˆ ∇ t − ( θ ) end for θ t ← θ Receive input X t Predict ˆ r t ← F ( X t ; θ t ) Receive output r t end for lgorithm 2 Early stopping procedure. Training stops when validation lossdoes not improve by at least ε for Q iterations. Require:

Maximum iterations {T ∈ N |T > } ; tolerance { ε ∈ R | ε > } ;patience { Q ∈ N | Q > } ; step size { η ∈ R | η > } function EarlyStopping ( θ , X train , r train , X test , r test ) θ best ← θ q ← J best ← J ( r test , F ( X test ; θ )) for k = 1 , ..., T do θ ← θ − η ˆ ∇ J ( r train , F ( X train ; θ )) J (cid:48) ← J ( r test , F ( X test ; θ )) if J (cid:48) < J best then τ best ← k θ best ← θ J best ← J (cid:48) end if if J (cid:48) did not improve by at least ε then q ← q + 1 if q ≥ Q then break (cid:46) Assume convergence end if else q ← end if end for return τ best , θ best end function In the next two sections, we conduct two empirical studies. First is based onsimulation data which highlights the use of online early stopping, and the secondon predicting U.S. stock returns based on the data set in Gu et al. (2020) and ispresented in Section 5.

For the simulation study, we create the following synthetic data set: T = 180 months, each month consists of n = 200 stocks. • Each stock has m = 100 features, forming input matrix of X ∈ R × × and output vector r ∈ R × . • Let x t,i,j be the value of feature j of stock i at time t . Each feature value israndomly set to x t,i,j ∼ N (0 , • Each feature is associated with a latent factor ψ t,j = 0 . ψ t − ,j + 0 . δ t,j ,where δ t,j ∼ N (0 ,

1) and ψ ,j ∼ N (0 , ψ t,j follows a Wiener process anddrifts over time. • Each output value is r t,i = (cid:80) mj =1 tanh( x t,i,j × ψ t,j )+ (cid:15) t,i , where (cid:15) t,i ∼ N (0 , r t is non-linear with respect to X t and the relationship changes overtime.We have used the same network setup and hyperparameter ranges as the empiricalstudy on U.S. equities (outlined in Table 3) but with a batch size of 50. DNN hasthe same setup but is re-ﬁtted at every 10-th time intervals. The data set issplit into three 60 interval blocks. Hyperparameters for OES are chosen using agrid search, a procedure called hyperparameter tuning . For each hyperparametercombination, the network is trained on the ﬁrst 60 intervals and validated on thenext 60 intervals. Hyperparameters with the minimum MSE in the validation setis used in the remaining 60 intervals as out-of-sample data. Performance metricsare calculated using the out-of-sample set. DTS-SGD follows the same trainingscheme as OES, with additional hyperparameters: window period w ∈ { , , } and forget factor α ∈ { . , . , . } . Our synthetic data requires the network to adapt to time-varying dynamics. Ta-ble 1 records results of the simulation. DNN struggles to learn the time-varyingrelationships, with mean R of − .

26 % and mean rank correlation of − .

07 %.This is expected as the expanding window approach used in DNN assumes the re-lationships at t is best approximated by the average relationships in the observedpast. OES signiﬁcantly outperforms the other two methods in this simple simu-lation, achieving mean R of 49 .

64 % and mean rank correlation of 69 .

63 %. Thisdemonstrates OES’s ability to track a non-linear, time-varying function reasonablyclosely. There is a preference for higher L regularization and learning rate. InAydore et al. (2019), the authors reported issues of exploding gradient with the static time-smoothed stochastic gradient descent in Hazan et al. (2017) and thatDTS-SGD provided greater stability. In our simulation test, we observed gradient nstability with DTS-SGD as well. During training, loss can increase after a weightupdate. We hypothesize that a past gradient is taking network weights away fromthe direction of the current local minima and could be an issue with this generalclass of optimizers. Lastly, we ﬁnd that mean R tends to be slightly lower than R oos (which is reasonable with a smaller denominator of a negative term). Table 1: Simulation results and selected hyperparameters by hyperparametersearch averaged over time and ensemble networks. Values are in percentagesunless speciﬁed ( w , number of periods).% DNN OES DTS-SGDMetricsPooled R oos -7.12 50.22 0.13Mean R -7.77 49.64 -0.33IC -4.21 71.24 6.29HyperparametersMean L penalty 0.01 0.09 0.04Mean η w (periods) 14Mean α The U.S. equities data set in Gu et al. (2020) consists of all stocks listed in NYSE,AMEX, and NASDAQ from March 1957 to December 2016. Average number ofstocks exceeds 5,200. Excess returns over risk-free rate are calculated as forwardone month stock returns over Treasury-bill rates. As noted in Section 2.1, stockreturns exhibit non-Gaussian characteristics. Table 2 presents descriptive statis-tics of excess returns. Monthly excess returns are positively skewed and containspossible outliers which may inﬂuence the regression. We follow Gu et al. (2020)in using MSE but note that MSE is not robust against outliers. As noted in Sec-tion 1, we also provide an alternative setup which excludes microcap stocks. Thealternative setup and empirical results are presented in Section 5.4.Feature set includes 94 ﬁrm level features, 74 industry dummy variables (basedon ﬁrst two digits of Standard Industrial Classiﬁcation code, henceforth SIC) and % 1957-1966 1967-1976 1977-1986 1987-1996 1997-2006 2007-2016Mean 0.95 0.25 0.95 0.64 0.90 0.50Std Dev 9.98 14.89 15.84 18.44 19.93 16.26Skew 212.44 184.21 365.98 1059.88 502.41 783.70Min -76.38 -91.88 -90.14 -99.13 -98.30 -99.901% -20.27 -31.41 -33.82 -40.39 -44.61 -38.9610% -9.26 -14.99 -14.38 -15.61 -17.08 -14.2525% -4.42 -7.78 -6.54 -6.64 -6.91 -5.7650% -0.10 -0.65 -0.52 -0.41 0.00 0.2475% 5.14 6.21 6.67 6.18 6.67 5.8490% 11.62 16.23 16.43 16.11 17.57 14.0699% 33.04 49.60 51.99 56.92 65.43 48.08Max 255.29 432.89 1019.47 2399.66 1266.36 1598.45interaction terms with 8 macroeconomic indicators. The ﬁrm features and macroe-conomic indicators used in Gu et al. (2020) are based on Green et al. (2017) andWelch and Goyal (2008), respectively. Firm level features include share price basedmeasures, valuation metrics and accounting ratios. The purpose of interacting ﬁrmlevel features with macroeconomic indicators is to capture any time-varying dy-namics that are related to (common across all stocks) macroeconomic indicators.For instance, suppose valuation metrics have a stronger relationship with stock re-turns during periods of high inﬂation. Then, this information will be encoded in theinteraction term. The aggregated data set therefore contains 94 × (8+1)+74 = 920features. Each feature has been appropriately lagged to avoid look-forward bias,and is cross-sectionally ranked and scaled to [ − , which contains94 ﬁrm level characteristics and 74 industry classiﬁcation. Our main result uses94 + 74 = 168 ﬁrm level features but results with the full 920 features are alsoprovided as a comparison. At this point, it is useful to remind readers that our goalis to track a time-varying function when the time-varying dynamics are unknown.In other words, we assume that time-varying dynamics between stock returns and Dacheng Xiu’s website https://dachxiu.chicagobooth.edu/ eatures are not well understood or are unobservable. As such, the subset of datawithout interaction terms is suﬃcient for our problem. If macroeconomic indicatorsdo encode time-varying dynamics, our network will track changing macroeconomicconditions automatically.Data is divided into 18 years of training (from 1957 to 1974), 12 years of valida-tion (1975-1986), and 30 years of out-of-sample tests (1987-2016). We use monthlytotal returns of individual stocks from CRSP. Where stock price is unavailable atthe end of month, we use the last available price during the month. Table 3 recordstest conﬁgurations as outlined in Gu et al. (2020) and in our replication. A total ofsix hyperparameter combinations ( L penalty and η in Table 3) are tested. We usethe same training scheme as Gu et al. (2020) to train DNN. Once hyperparametersare tuned, the same network is used to make predictions in the out-of-sample setfor 12 months. Training and validation sets are rolled forward by 12 months atthe end of every December and the model is re-ﬁtted. An ensemble of 10 networksis used, where each prediction r t,i is the average prediction of 10 networks. Table 3: Disclosed model parameters in Gu et al. (2020) and in our replica-tion. We ﬁll missing values with the cross-sectional median or zero if medianis unavailable. ‘H’ is hidden layer activation. ‘O’ is output layer activation.ADAM is the optimizer proposed by Kingma and Ba (2015).Parameter Gu et al. (2020) This paperPreprocessing Rank [-1, 1]; Fill median Rank [-1, 1]; Fill median/0Hidden layers 32-16-8 32-16-8Activation H: ReLU / O: Linear H: ReLU / O: LinearBatch size 10,000 DNN 10,000 / OES 1,000Batch normalization Yes Yes L penalty [10 − , − ] { − , − , − } Early stopping Patience 5 Patience 5 / Tolerance 0.001Learning rate η [0 . , . { . , . } Optimizer ADAM ADAMLoss function MSE MSEEnsemble Average over 10 Average over 10

To train OES, we keep the ﬁrst 18 years (to 1974) as training data and next12 years (to 1986) as validation data. For each permutation of hyperparameterset, we have trained an online learner up to 1986. Hyperparameter tuning is onlyperformed once on this period, as opposed to every year in Gu et al. (2020). Asthe algorithm does not depend on a separate set of data for validation, we simply ake the hyperparameter set with the lowest monthly average MSE over 1975-1986as the best conﬁguration to use for rest of the data set. Batch size of 1,000 forOES was chosen arbitrarily. In this section, we present our U.S. stock return prediction results. DTS-SGD didnot complete training with a reasonable range of hyperparameters due to explodinggradient and is omitted from this section. As an overarching comment, R forboth DNN and OES on U.S. stock returns are very low, and are consistent withthe ﬁndings of Gu et al. (2020). First, results with and without interaction termsare presented in Table 4, keeping in mind that our method should be comparedagainst DNN without interaction terms. Without interaction terms, OES andDNN achieve IC of 4 .

53 % and 3 .

82 %, respectively. The relatively high correlationof OES (compared to DNN) indicates that it is better at diﬀerentiating relativeperformance between stocks. This is particularly important in our use case aspractitioners build portfolios based on expected relative performance of stocks. Forinstance, a long-short investor will buy top ranked stocks and short sell bottomranked stocks and earn the diﬀerence in relative return between the two baskets ofstocks. Mean R are − .

14 % and − .

68 % for OES and DNN, respectively. Notethat the denominator of mean R is adjusted by the cross-sectional mean of excessreturns. Therefore, negative means R of both OES and DNN indicate that neithermethod is able to accurately predict the magnitude of cross-sectional returns.Finally, OES scores − .

48 % on R oos and DNN scores 0 .

22 %. The low values ofboth methods underscore the diﬃculty in return forecasting. DNN achieves higherSharpe ratio than OES, at 1.63 and 0.83, respectively. As we will point out inSection 5.4, the high Sharpe ratio of DNN is driven by microcap stocks. Despite thevery low R , both methods are able to generate economically meaningful returns.This underscores our argument that R is not the best measure of performance andveriﬁes practitioners’ choice of correlation as the preferred measure. We observedsimilar performance with interaction terms, suggesting that the 8 macroeconomictime series have little interaction eﬀect with the 94 features. In the subsequentresults in this section, we only report statistics without interaction terms.So why do IC and R oos diverge? The answer lies in Table 5 and Figure 4. Inhere, we form decile portfolios based on predicted returns over the next month andtrack their respective realized returns. OES predicted values span a wider rangethan DNN. This has contributed to a lower R , even though OES is able to betterdiﬀerentiate relative performance between stocks. DNN used a pooled data setwhich will average out time-varying eﬀects. As a result, the average gradient willlikely smaller in magnitude. This is evident from the lower mean L penalty and R oos is calculatedacross the entire out-of-sample period as a whole. Mean R and IC arecalculated cross-sectionally for each month then averaged across time. P10-1 is the average monthly spread between top and bottom deciles. Sharperatio is based on P10-1 return spread and annualized Mean hyperparametersare calculated over the ensemble of 10 networks and across all periods. Asreported are results in Gu et al. (2020).With Interactions W/O Interactions% As reported DNN OES DNN OESMetricsPooled R oos R -9.89 -11.93 -9.68 -12.17IC 3.51 4.22 3.82 4.53P10-1 3.27 1.83 2.10 2.39 2.41Sharpe ratio 2.36 0.94 0.72 1.63 0.83HyperparametersMean L penalty 0.0012 0.0154 0.0024 0.0028Mean η P1 is the mean excess returns of the ﬁrst decile (0-10% ofbottom ranked stocks) and P10-1 is P10 less P1 showing the return spreadbetween the best decile relative to the worst decile. As reported are originalresults from Table A.9 in Gu et al. (2020).

As reported DNN OES% Predicted Realized Predicted Realized Predicted RealizedP1 -0.31 -0.92 -0.59 -0.47 -3.53 -0.50P2 0.22 0.16 0.09 0.15 -1.96 0.03P3 0.45 0.44 0.37 0.54 -1.07 0.27P4 0.60 0.66 0.55 0.64 -0.34 0.48P5 0.73 0.77 0.70 0.73 0.30 0.67P6 0.85 0.81 0.84 0.78 0.88 0.85P7 0.97 0.86 0.99 0.85 1.46 1.04P8 1.12 0.93 1.17 0.96 2.10 1.18P9 1.38 1.18 1.43 1.26 2.89 1.42P10 2.28 2.35 2.33 1.92 4.25 1.91P10-1 2.58 3.27 2.92 2.39 7.78 2.41 igher learning rate η chosen by validation. By contrast, OES trains on each timeperiod individually and the norm of the gradient presented to the network at eachperiod is likely to be larger. This led to a lower learning rate chosen by validation.Hence, variance of OES predicted values is higher and potentially requires higheror diﬀerent forms of regularization.In Table 5 and Figure 4, we observe that the prediction performance of DNN isconcentrated on the extremities, namely P1 and P10, with realized mean returnsof − .

47 % and 1 .

92 % respectively. Stocks between P3 and P7 are not well diﬀer-entiated. By contrast, OES is better at ranking stocks across the entire spectrum.Realized mean returns of OES are more evenly spread across the deciles, result-ing in higher correlation than DNN. P10-1 realized portfolio returns are similaracross DNN and OES at 2 .

39 % and 2 .

41 %, respectively. However, the diﬀerencein mean return spread increases when calculated on a quintile basis (mean returnof top 20 % of stocks minus bottom 20 %), to 1 .

75 % and 1 .

90 % for DNN and OES,respectively. This reﬂects better predictiveness in the middle of the spectrum ofOES. An investor holding a well diversiﬁed portfolio is more likely to utilize pre-dictions closer to the center of the distribution and experience relative returnsthat are reminiscent of the quintile spreads (and even tertile spreads) rather decilespreads.

So far, our tests are predicated on time-varying relationships between features andstock returns. How do features’ importance change over time? To examine this,at every time period we train the OES model and make a baseline prediction.For each feature j = 1 , ..., m , all values of j are set to zero and a new predictionis made. A new R is calculated between the new prediction and the baselineprediction, denoted as R t,J . Importance of feature j at time t is calculated as F I t,j = 1 − R t,j . Our measure tracks features that the network is using. This isdiﬀerent to the procedure in Gu et al. (2020) where R is calculated against actualstock returns, rather than a baseline prediction.To illustrate the inadequacy of a non-time-varying model, we ﬁrst track fea-ture importance over January 1987 to December 1991. The top 10 features withthe highest feature importance are (in order of decreasing importance): idiovol (CAPM residual volatility), mvel1 (log market capitalization), dolvol (monthlytraded value), retvol (return volatility), beta (CAPM beta), mom12m (12-monthminus 1-month price momentum), betasq (CAPM beta squared), mom6m (6-month minus 1-month month price momentum), ill (illiquidity), and maxret (30-day max daily return). Rolling 12-month averages were calculated to provide amore discernible trend, with the top 5 shown in Figure 5. Feature importance xhibits strong time-variability. Rolling 12-month average feature importance fellfrom 14-16% at the start of the out-of-sample period to a trough of 2-6% beforerebounding. This indicates that the network would have changed considerablyover time. Rapid falls in feature importance can be seen in Figure 5, over 1990-91, 2000-01 and 2008-09. These periods correspond to the U.S. recession in early1990s, the Dot-com bubble and the Global Financial Crisis, respectively. Thus,market distress may explain rapid changes in feature importance. Figure 5: Top 5 features based on rolling 12-month average feature impor-tance over 1987-1991. Three rapid falls can be seen which coincide with the1990-91 U.S. recession, Dot-com bubble (2000-03) and the Global FinancialCrisis (2007-09). These periods are shaded for reference.

Next, we examine changes in importance for all features on a yearly basis.Figure 6 displays considerable year-to-year variations in feature importance. Asthere are just a few clusters of features with relatively higher feature importance,the network’s predictions can be attributed to a small set of features. This is likelydue to the use of L regularization which encourages sparsity. There is an overalltrend towards lower importance over time, consistent with publication-informedtrading hypothesis of McLean and Pontiﬀ (2016). For instance, the importanceof the market capitalization ( mvel1 ) has decreased over time, as documented in(Horowitz et al., 2000). There are periods of visibly lower importance for allfeatures, over 2000-02 and 2008-09, and to a lesser extent 1990 and 1997 (Asianﬁnancial crisis). If all features have lower importance during market distress,then what explains stock returns during these periods? To answer this question,we turn to importance of sectors, using SIC 13 (Oil and Gas), 60 (DepositoryInstitutions) and 73 (Business Services) as proxies for oil companies, banks and echnology companies, respectively. Figure 7 records the rolling 12-month average R to baseline prediction of banks, oil and technology companies. The peak ofimportance of SIC 73 overlaps with the Dot-com bubble and peak of SIC 60 occursjust after the Global Financial Crisis (which started as a sub-prime mortgagecrisis). Importance of SIC 13 peaked in 2016, coinciding with the 2014-16 oilglut which saw oil prices fell from over US $

100 per barrel to below US $

30 perbarrel. This is an example of how an exogenous event that is conﬁned to a speciﬁcindustry impacts on predictability of stock returns. Thus, a plausible explanationfor the observed results is that ﬁrm features explain less of cross-sectional returnsduring market shocks, which becomes increasingly explained by industry groups.This is particularly true if the market shock is industry related. For instance,technology companies during the Dot-com bubble, oil companies during an oilcrisis and lodging companies during a pandemic. This underscores the importanceto have a dynamic model that adapts to changes in the true model. R to baseline predictions (in decimal). The OESnetwork appeared to use only a handful of features. Shades of feature im-portance are distinctly lighter over 2000-02, 2008-09, and to a lesser extentin 1990 and 1997. Importance of some features have eroded over time (e.g. dolvol , maxret and turn ). 27igure 7: Rolling 12-month average R to baseline prediction of SIC code13, 60 and 73, as proxies for oil & gas companies, banks and technologycompanies, respectively. R of technology companies peaks over 2001-02,banks over 2008-10, and oil companies over 2015-16. Duration of 1990-91U.S. recession, Dot-com bubble, Global Financial Crisis and the 2014-16 oilglut have been shaded in grey. As noted in Section 1, the data set in Gu et al. (2020) contains many stocksthat are small and illiquid. The U.S. Securities and Exchange Commission (2013)deﬁnes “microcap” stocks as companies with market capitalization below US $ $

50 million. At the end of 2016, there are over to 1,300 stocks with marketcapitalization below US $

50 million and over 1,800 stocks with market capitaliza-tion between US $

50 million and US $

300 million. Together, microcap and nanocapstocks constitute close to half of the data set as at 2016. Thus, we also provideresults excluding these diﬃcult to trade stocks. At the end of every June, we cal-culate breakpoint based on the 5-th percentile of NYSE listed stocks and excludestocks with market capitalization below this value. Once rebalanced, the same setof stocks are carried forward until the next rebalance (unless the stock ceases toexist). This cutoﬀ is chosen to approximately include the larger half of U.S.-listedstocks, with the average number of stocks exceeding 2,600. We label this dataset as the investable set . To mitigate the impact of outliers, we also winsorize ex-cess returns at 1 % and 99 % for each month (separately). Winsorized returns arethen standardized by subtracting the cross-sectional mean and dividing by cross- ectional standard deviation. Standardization is a common procedure in machinelearning and can assists in network training (LeCun et al., 2012). Predicting a de-pendent variable with zero mean also removes the need to predict market returnswhich is embedded in stocks’ excess returns (over risk free rate). This transfor-mation allows the neural network to more easily learn the relationships betweenrelative returns and ﬁrm characteristics. Table 6: Predictive performance on the investable set. Ensemble is the aver-age of standardized predictions of the two methods. Pooled R oos is calculatedacross the entire out-of-sample period as a whole. Mean R and IC are cal-culated cross-sectionally for each month then averaged across time. P10-1 isthe average monthly spread between top and bottom deciles. Sharpe ratiois based on P10-1 return spread and annualized. Mean hyperparameters arecalculated over the ensemble of 10 networks and across all periods.% DNN OES EnsembleMetricsPooled R oos R L penalty 0.0211 0.0046Mean η Results based on this investable set are presented in Table 6. Both R oos andIC improved once microcaps are excluded, with OES scoring 6 .

05 % on IC andDNN on 5 .

74 %. However, DNN experienced a signiﬁcant drop in mean decilespread (to 1 .

69 % per month) and Sharpe ratio (0.69), suggesting that microcapsare signiﬁcant contributors to the results using the full data set. By contrast,mean decile spread and Sharpe ratio remain stable for OES, at 2 .

41 % and 0.82,respectively. This indicates that predictive performance of OES was not driven bymicrocap stocks. We believe this is a meaningful result for practitioners as thissubset represents a relatively accessible segment of the market for institutionalinvestors. An ensemble based on the average of cross-sectionally standardizedpredictions of the two models achieved the best IC, decile spread and Sharpe ratiorelative to OES and DNN. Mean monthly correlation between OES and DNN isonly 35 . ariance of the predictions. Monthly correlations between the two models arepresented in Figure 8. We observe that correlation tends to be lowest immediatelyafter a recession or crisis. We hypothesise that OES is quicker to react to aneconomic recovery.Turning to cumulative decile returns presented in Figure 9, we observe sig-niﬁcant drawdowns for DNN during recovery phases of the Dot-com bubble andGlobal Financial Crisis. P1 of DNN bounced back sharply during these episodes,causing sharp drops in decile spreads and are consistent with momentum crashes (Daniel and Moskowitz, 2016). By contrast, decile spreads of OES appear to reactto the recovery more quickly. Consistent with prior ﬁndings, the spreads betweendecile 3 to 7 are also better under OES than DNN in the investable set. Giventhese favorable characteristics, practitioners are likely to ﬁnd OES a useful tool toadd to the armory of prediction models. Figure 8: Monthly and rolling 12-month correlation between predictions ofOES and DNN. Duration of 1990-91 U.S. recession, Dot-com bubble, GlobalFinancial Crisis and the 2014-16 oil glut have been shaded in grey.

Stock return prediction is an arduous task. The true model is noisy, complexand time-varying. Mainstream deep learning research has focused on problemsthat do not vary over time and, arguably, time-varying applications have seenless advancements. In this work, we propose an online early stopping algorithmthat is easy to implement and can be applied to an existing network setup. We show that a network trained with OES can track a time-varying function andachieve superior performance to DTS-SGD, a recently proposed online non-convexoptimization technique. Our method is also signiﬁcantly faster, as only two periodsof training data are required at each iteration, compared to the pooled methodused in Gu et al. (2020) which re-trains the network on the entire data set annually.In our tests, the pooled method took 5 . . By contrast, ourmethod took 44 .

25 mins for a single pass over the entire data set (an ensemble often networks took 7 . L regularizationwhich encourages sparsity. We also ﬁnd evidence of time-varying feature impor-tance. In particular, features such as log market capitalization (the size eﬀect) and12-month minus 1-month momentum have seen a gradual decrease to their impor-tance towards the end of our test period, consistent with the publication-informedtrading hypothesis of McLean and Pontiﬀ (2016). We ﬁnd that sectors can also ex-hibit time-varying importance (for instance, technology stocks during the Dot-combubble). These results have strong implications for practitioners forecasting stockreturns using well known asset pricing anomalies. Excluding microcaps, we ﬁnd Tests performed on AMD Ryzen ™ hat OES oﬀers superior predictive performance in a subuniverse that is accessibleto institutional investors. We ﬁnd that correlation between OES and DNN is at itslowest after a recession or crisis. We argue that this is driven by faster reactionsof OES in tracking the recovery. An ensemble based on the average prediction ofthe two models achieves the best IC and Sharpe ratio, suggesting that the twomethods may be complementary.From an academic perspective, recent advances in deep learning such as dropoutand residual connections (He et al., 2016) may allow deeper networks to be trained,which enables more expressive asset pricing models. Given the higher variance ofpredictions produced by OES, future work should explore alternative methods ofregularization including dropouts, L penalty or a mixture of regularization tech-niques. Lastly, we believe time-varying neural network is a relatively less exploreddomain of machine learning that has applications in both prediction and analysisof asset returns. References

Abe, M. and Nakayama, H. (2018). Deep learning for forecasting stock returns inthe cross-section. In arXiv .Ambachtsheer, K. (1974). Proﬁt potential in an almost eﬃcient market.

Journalof Portfolio Management , 1(1):84.Aydore, S., Zhu, T., and Foster, D. P. (2019). Dynamic local regret for non-convexonline forecasting. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alch´e Buc,F., Fox, E., and Garnett, R., editors,

Advances in Neural Information ProcessingSystems 32 , pages 7980–7989. Curran Associates, Inc.Bergmeir, C., Hyndman, R., and Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction.

ComputationalStatistics & Data Analysis , 120:70–83.Bossaerts, P. and Hillion, P. (1999). Implementing statistical criteria to selectreturn forecasting models: What do we learn?

The Review of Financial Studies ,12(2):405–428.Cont, R. (2001). Empirical properties of asset returns: stylized facts and statisticalissues.

Quantitative Finance , 1:223–236.Cover, T. M. (1991). Universal portfolios.

Mathematical Finance , 1(1):1–29. ybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems , 2(4):303–314.Daniel, K. and Moskowitz, T. J. (2016). Momentum crashes.

Journal of FinancialEconomics , 122(2):221–247.Fabozzi, F. J., Grant, J. L., and Vardharaj, R. (2011).

Common Stock PortfolioManagement Strategies , volume 1, chapter 9, pages 229–270. John Wiley &Sons, Ltd.Fama, E. F. and French, K. R. (1992). The cross-section of expected stock returns.

Journal of Finance , 47(2):427–465.Gama, J. a., ˇZliobait˙e, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. (2014). Asurvey on concept drift adaptation.

ACM Computing Surveys , 46(4):44:1–44:37.Goodfellow, I., Bengio, Y., and Courville, A. (2016).

Deep Learning . MIT Press. .Green, J., Hand, J. R. M., and Zhang, X. F. (2017). The characteristics thatprovide independent information about average U.S. monthly stock returns.

The Review of Financial Studies , 30(12):4389–4436.Grinold, R. and Kahn, R. (1999).

Active Portfolio Management: A QuantitativeApproach for Producing Superior Returns and Controlling Risk . McGraw-HillEducation.Gu, S., Kelly, B., and Xiu, D. (2020). Empirical asset pricing via machine learning.

The Review of Financial Studies , 33(5):2223–2273.Harvey, C. R., Liu, Y., and Zhu, H. (2016). ... and the cross-section of expectedreturns.

The Review of Financial Studies , 29(1):5–68.Hazan, E., Singh, K., and Zhang, C. (2017). Eﬃcient regret minimization innon-convex games. In Precup, D. and Teh, Y. W., editors,

Proceedings of the34th International Conference on Machine Learning, ICML 2017 , volume 70 of

ICML , pages 1433–1441, Sydney, Australia. PMLR.He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for imagerecognition. In Agapito, L., Berg, T., Kosecka, J., and Zelnik-Manor, L., editors,

Proceedings of the 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR) , CVPR, pages 770–778, Las Vegas, NV, USA. IEEE. ornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward net-works are universal approximators. Neural Networks , 2(5):359—-366.Horowitz, J. L., Loughran, T., and Savin, N. (2000). The disappearing size eﬀect.

Research in Economics , 54(1):83–100.Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization.In Bengio, Y. and LeCun, Y., editors,

Proceedings of the 3rd International Con-ference on Learning Representations, ICLR 2015 , ICLR, San Diego, CA, USA.LeCun, Y. A., Bottou, L., Orr, G. B., and M¨uller, K.-R. (2012).

Eﬃcient Back-Prop , pages 9–48. Springer Berlin Heidelberg, Berlin, Heidelberg.Lev, B. and Srivastava, A. (2019). Explaining the recent failure of value investing.In

SSRN .Mahsereci, M., Balles, L., Lassner, C., and Hennig, P. (2017). Early stoppingwithout a validation set.

CoRR , abs/1703.09580.McLean, R. D. and Pontiﬀ, J. (2016). Does academic research destroy stock returnpredictability?

The Journal of Finance , 71(1):5–32.Messmer, M. (2017). Deep learning and the cross-section of expected returns. In

SSRN .Morgan, N. and Bourlard, H. A. (1990). Generalization and parameter estimationin feedforward nets: Some experiments. In Touretzky, D. S., editor,

Advances inNeural Information Processing Systems 2 , pages 630–637. Morgan-Kaufmann.Pesaran, M. H. and Timmermann, A. (1995). Predictability of stock returns:Robustness and economic signiﬁcance.

Journal of Finance , 50:1201–1228.Prechelt, L. (1998). Early stopping - but when? In

Neural Networks: Tricks ofthe Trade, This Book is an Outgrowth of a 1996 NIPS Workshop , pages 55–69,London, UK, UK. Springer-Verlag.Reed, R. D. (1993). Pruning algorithms-a survey.

Transactions on Neural Net-works , 4(5):740–747.Rosenberg, B., Reid, K., and Lanstein, R. (1985). Persuasive evidence of marketineﬃciency.

Journal of Portfolio Management , 11(3):9–16. Name - New YorkStock Exchange; Copyright - Copyright Euromoney Institutional Investor PLCSpring 1985; Last updated - 2015-05-25. ossi, B. and Inoue, A. (2012). Out-of-sample forecast tests robust to the choiceof window size. Journal of Business & Economic Statistics , 30(3):432–453.Schroﬀ, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A uniﬁed embed-ding for face recognition and clustering. In Grauman, K., Learned-Miller, E.,Torralba, A., and Zisserman, A., editors,

Proceedings of the 2015 IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR) , CVPR, pages815–823, Boston, MA, USA. IEEE.Shalev-Shwartz, S. (2012). Online learning and online convex optimization.

Foun-dations and Trends ® in Machine Learning , 4(2):107–194.Sj¨oberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for aminimum, with application to neural networks. International Journal of Control ,62(6):1391–1407.Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance ofinitialization and momentum in deep learning. In Dasgupta, S. and McAllester,D., editors,

Proceedings of the 30th International Conference on Machine Learn-ing, ICML 2013 , volume 28 of

ICML , pages 1139–1147, Atlanta, Georgia, USA.PMLR.Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning withneural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D.,and Weinberger, K. Q., editors,

Advances in Neural Information ProcessingSystems 27 , pages 3104–3112. Curran Associates, Inc.U.S. Securities and Exchange Commission (2013). Microcap stock: A guidefor investors. . Accessed: 2021-01-03.Weigand, A. (2019). Machine learning in empirical asset pricing.

Financial Marketsand Portfolio Management , 33:93–104.Welch, I. and Goyal, A. (2008). A comprehensive look at the empirical performanceof equity premium prediction.

The Review of Financial Studies , 21(4):1455–1508., 21(4):1455–1508.