[PDF] Predicting Stock Returns with Batched AROW

Abstract

We extend the AROW regression algorithm developed by Vaits and Crammer in [VC11] to handle synchronous mini-batch updates and apply it to stock return prediction. By design, the model should be more robust to noise and adapt better to non-stationarity compared to a simple rolling regression. We empirically show that the new model outperforms more classical approaches by backtesting a strategy on S\&P500 stocks.

Full PDF

PPredicting Stock Returns with Batched AROW

Rachid Guennouni Hassani ♣ , Alexis Gilles ♦ , Emmanuel Lassalle ♥ and Arthur Dénouveaux ♠ École Polytechnique Machina CapitalMarch 23, 2020

Abstract

We extend the AROW regression algorithm developed by Vaits and Crammerin [VC11] to handle synchronous mini-batch updates and apply it to stockreturn prediction. By design, the model should be more robust to noise andadapt better to non-stationarity compared to a simple rolling regression. Weempirically show that the new model outperforms more classical approaches bybacktesting a strategy on S&P500 stocks. ♣ [email protected] ♦ [email protected] ♥ [email protected] ♠ [email protected] a r X i v : . [ q -f i n . C P ] M a r Introduction

Financial markets exhibit highly non-stationary behaviors, making it diﬃcult to buildpredictive signals that do not decay too rapidly (see [SCSG13, Con01] for empiricalstudies of return time series). A standard method for capturing these changes in timeseries data consists in using a rolling regression, that is, a linear regression modeltrained on a rolling window and kept as static model during a prediction period.However, the size of historical training data as well as the duration of the predictionperiod have a direct impact on the performance of the resulting model: using toomany training data would result in a model that does not react quickly enough tosudden changes while short training and prediction windows would make the modelunstable (see for instance [IJR17]).Online learning algorithms are suited to situations where data arrives sequentially.New information is taken into account by updating the model parameters in asupervised fashion. More precisely, an online learning algorithm repeats the followingsteps indeﬁnitely: receive a new instance x t , make a prediction ˆ y t , receive the correctlabel y t for the instance and update the model accordingly.In the particular case of regression, online models are also good candidates to handlethe non-stationarity inherent in ﬁnancial time series while keeping a certain memoryof what has been learnt from the beginning. The recursive least squares (RLS)algorithm is a well known approach to online linear regression problems (e.g. [SD91]),yet it updates the model parameters using one sample at a time. However, buildingpredictive models on stock markets, one should take into account the very low signalover noise ratio, and one way to do so is to ﬁt a single model for predicting all thestock returns of a trading universe (hence ﬁtting on much more data). It allows us tohave more training data covering shorter periods, but it comes with the diﬃculty ofupdating the model parameters synchronously with all the data available at a giventime.In this paper we extend AROW for regression ([VC11]), an algorithm similar to RLS,in order to take into account a batch of instances for the online update instead ofdoing one update per sample. Indeed the latter approach would introduce a spuriousorder in information that actually occur synchronously and should be captured assuch. Like RLS (see [Hay96]), AROW suﬀers logarithmic regret in the stationarycase, but also comes with a bound on regret in the general case. We test it on theuniverse of S&P500 stocks on a daily strategy and show it outperforms the rollingregression on the same set of features in backtest.2 Adaptive Weight Regularization

When training online models, we try to ﬁnd the right balance between reactivity tonew information and accumulation of predictive power over time. For instance, thefamily of models introduced in [CDK +

06] aggressively update their parameters byguaranteeing a certain performance on new data (expressed in terms of geometricmargin). But even regularized versions of these algorithms do not take into accountthe fact that some features might be less noisy than others. In other words, they arenot suited for cases where we would like updates to be more aggressive on particularparameters than on the rest of the weight vector.Subsequent online learning algorithms introduced in [DCP08] and [CKD13] maintaina Gaussian distribution on weight vectors representing the conﬁdence the model hasin its parameters. From this point of view, the mean (resp. the variance) of thedistribution represents the knowledge (resp. the conﬁdence in the parameters). Inthat framework, when receiving new feature vector and target ( x t , y t ) , we wouldlike the Gaussian parameters to be updated by solving the following optimizationproblem: ( µ t , Σ t ) = arg min µ, Σ D KL ( N ( µ, Σ ) ∥ N ( µ t − , Σ t − )) s.t. P w ∼ N ( µ, Σ ) ( ‘ ( y t , w ⊤ x t ) ⩽ (cid:15) ) ⩾ η where D KL is the Kullback-Leibler divergence and ‘ is the classical mean squareloss. The idea behind this optimization is to ﬁnd the minimal changes in knowledgeand conﬁdence such that minimum regression performance is achieved with a givenprobability threshold η .Although that formulation is tractable (in particular there is an explicit formula forthe KL divergence between two Gaussian distributions), it is not convex in µ and Σ.With AROW, Vaits and Crammer use a simpler but convex objective ([VC11]): C ( µ, Σ ) = D KL ( N ( µ, Σ ) ∥ N ( µ t − , Σ t − )) + λ ‘ ( y t , µ ⊤ x t ) + λ x ⊤ t Σ x t (1)where λ and λ are hyperparameters to adjust the tradeoﬀ between the threecomponents of the objectives. These three components should be understood asfollows: 3. the parameters should not change much per update;2. the new model should perform well on the current instance;3. the uncertainty about the parameters should reduce as we get additional data.As such, the model is well suited for non-stationary regression. However the updatesonly take into account a single instance at the time. Because we want a single modelfor all the stocks here, we need to extend this approach to synchronously update theparameters on all cross-sectional information available at a given time, which is thepurpose of the next section. We now assume that between the updates at times t and t −

1, we have K synchronousinstances ( x kt , y kt ) , k = , . . . , K of the features and targets. They correspond to theobservation of a complete universe of stocks at a given time. Applying an AROWupdate would introduce a fake order between the instances and potentially hurt theperformance (see Section 5 for more details). In order to take into account all the newinformation at once in a single batch, we extend the cost function from equation 1 asfollows: C ( µ, Σ ) = D KL ( N ( µ, Σ ) ∥ N ( µ t − , Σ t − )) + λ K K ∑ k = ‘ ( y kt , µ ⊤ x kt ) + λ K x kt ⊤ Σ x kt . For simplicity of presentation, we set λ = λ = r where r > R = rK . Using D KL ( N ( µ , Σ ) ∥ N ( µ , Σ )) =

12 log ( det Σ det Σ ) + Tr ( Σ − Σ ) + ( µ − µ ) ⊤ Σ − ( µ − µ ) − d , and setting X t = [ x t ⋯ x Kt ] ⊤ , Y t = [ y t ⋯ y Kt ] ⊤ .

4e get, remembering that ‘ is the mean square loss, C ( µ, Σ ) =

12 log ( det Σ t − det Σ ) +

12 Tr ( Σ − t − Σ ) + ( µ t − − µ ) ⊤ Σ − t − ( µ t − − µ ) − d + R (∥ Y t − µ ⊤ X t ∥ + Tr ( X t Σ X ⊤ t )) . The main result of this article is the following: minimizing the cost function C has anexplicit solution given byΣ t = Σ t − − Σ t − X ⊤ t ( R Id d + X t Σ t − X ⊤ t ) − X t Σ t − µ t = µ t − − Σ t − X ⊤ t ( R Id d + X t Σ t − X ⊤ t ) − ( X t µ t − − Y t ) . (2a)(2b)From now on we refer to this batch version of AROW as BAROW, and detail thesteps to the solution in the next section. As for AROW, BAROW’s cost function C is convex. To prove 2a and 2b it is thusenough to look at the critical points of C . We start by computing ∂ C / ∂ Σ and, usingformulas from [PP12] chapter 2, we ﬁnd ∂ C ∂ Σ ( µ, Σ ) = − ( Σ − − Σ − t − − R X ⊤ t X t ) . The Kailath variant of the Woodbury identity (see [PP12] chapter 3) yields ( Σ − t − + R X ⊤ t X t ) − = Σ t − − Σ t − X ⊤ t ( R Id + X t Σ t − X ⊤ t ) − X t Σ t − and allows us to deduce that ∂ C / ∂ Σ vanishes forΣ = Σ t − − Σ t − X ⊤ t ( R Id + X t Σ t − X ⊤ t ) − X t Σ t − , which is formula 2a. 5imilarly, one ﬁnds ∂ C ∂µ ( µ, Σ ) = Σ − t − ( µ − µ t − ) + R ( X ⊤ t X t µ − X ⊤ t Y t ) , so that ∂ C / ∂µ = µ = ( Σ − t − + R X ⊤ t X t ) − ( Σ − t − µ t − + R X ⊤ t Y t ) = ( Σ − t − + R X ⊤ t X t ) − Σ − t − µ t − + R ( Σ − t − + R X ⊤ t X t ) − X ⊤ t Y t . We deal with the two terms separately. For the ﬁrst one we use again the Kailathvariant of the Woodbury equality and get ( Σ − t − + R X ⊤ t X t ) − Σ − t − µ t − = ( Id − Σ t − X ⊤ t ( R Id + X t Σ t − X t ) − X t ) µ t − = µ t − − Σ t − X ⊤ t ( R Id + X t Σ t − X t ) − X t µ t − . For the second term, ﬁrst notice that: ( Σ − t − + R X ⊤ t X t ) Σ t − X ⊤ t = ( X ⊤ t + R X ⊤ t X t Σ t − X ⊤ t ) = R X ⊤ t ( R Id + X t Σ t − X ⊤ t ) so that ( Σ − t − + R X ⊤ t X t ) Σ t − X ⊤ t ( R Id + X t Σ t − X ⊤ t ) − Y t is equal to 1 R X ⊤ t ( R Id + X t Σ t − X ⊤ t ) ( R Id + X t Σ t − X ⊤ t ) − Y t . Simplifying ( R Id + X t Σ t − X ⊤ t ) we get1 R ( Σ − t − + R X ⊤ t X t ) − X ⊤ t Y t = Σ t − X ⊤ t ( R Id + X t Σ t − X ⊤ t ) − Y t . Finally, combining the two terms, we get µ = µ t − − Σ t − X ⊤ t ( R Id + X t Σ t − X t ) − X t µ t − + Σ t − X ⊤ t ( R Id + X t Σ t − X ⊤ t ) − Y t = µ t − − Σ t − X ⊤ t ( R Id + X t Σ t − X t ) − ( X t µ t − − Y t ) , which was the claimed formula. 6 Backtesting a Strategy using BAROW

BAROW combines the adaptability of AROW to non-stationary data and the ad-vantage of taking into account all new information synchronously. We tested themodel against two baselines as return predictor for a trading strategy on the S&P500universe.We backtested a long-short strategy taking daily positions proportional to the predic-tion generated by a regression model on stock returns and showed it outperforms thefollowing baselines:1. A rolling regression updated daily and using the past 12 months of data fortraining.2. AROW regression with single instance updates (500 updates per day).Figure 1 – Estimated Return of each model7e ran the backtest on 2,250 days from 2011 to 2019, allowing a burn-in period of 12months for AROW and BAROW before starting to use them as return predictors.We also tweaked the R hyperparameter on 2010 for AROW and BAROW. To avoidthe strategy having too high sensitivity to market moves, we neutralized daily returnsusing a multi-factor model (including beta, volatility and a variety of other indicators,see [KL15] for a detailed discussion about risk models). These neutralized returnsare the regression targets y t . We used features based on MACD indicators such asthose described in [CNL14].We estimated daily returns of a strategy taking positions proportional to the pre-dictions as the cross-sectional correlation between the predictions and the realizedreturns, multiplied by the cross-sectional standard deviation of the returns.As expected, we can see that the baseline using AROW updates stock per stock isnot suited to learn a generic predictor for the universe. Its cumulative return appearsrandom compared to the two other models. That empirically justiﬁes the batchupdate computed in BAROW.Now if we compare BAROW to the rolling linear regression, we directly see thatperformances drift away from each other. BAROW signiﬁcantly outperforms thebaselines, illustrating the beneﬁts of online learning models for non-stationary data.model Return Sharpe MaxDD CalmarAROW 0.5% 0.1 -3.02% 0.18Linear 31.3% 4.0 -0.58% -0.61% Table 1 – Performance statistics of each modelWe provide some usual performance statistics in table 1. Speciﬁcally, we reportthe total return over the period, the Sharpe ratio of expected return (

Sharpe ), themaximum drawdown (

MaxDD ), and the Calmar ratio (

Calmar , deﬁned as returnover the period divided by

MaxDD ). One should be aware that using such shorttargets to predict induces a very high turnover in the portfolio, resulting in lower netperformance after transaction costs are applied.8

Discussion

Our BAROW algorithm outperformed its baseline in a backtest, yet the questionof how Σ impacts the model updates remains. The model forces Σ to converge,causing the model to be less and less able to adapt quickly. That trade-oﬀ betweenadaptability and robustness might not be the right one in volatility periods when onecould require more dynamic updates. One way to address it is to schedule reset ofthe covariance matrix as experimented in [VC11]. In our case it might be beneﬁcialto reset the covariance matrix conditionally on any market event we consider as atrigger for more dynamic updates.The outperformance of BAROW vs its baseline and the adaptability/robustnesstrade-oﬀ mentioned above could both be further analyzed by increasing the backtestfrequency (e.g. 1-min bars). A higher frequency would likely enable shorter trainingand prediction windows, hereby enhancing the potential overperformance of BAROWvs the linear baseline.Also, the backtest results described above are a mere sum of single-stock returns anddo not include any form of risk management at the overall portfolio level, which couldfurther enhance the risk-adjusted performance of each approach.Finally, it is worth pointing out that BAROW results are signiﬁcantly impacted bythe method used to neutralize returns that serve as the regression target. Usualstandardization methods have exhibited periods of higher correlation of residuals inrecent years. These likely stem from stakeholders relying on similar risk models andan ever-spreading use of machine learning type of strategies.

References [CDK +

06] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, andYoram Singer. Online passive-aggressive algorithms.

J. Mach. Learn. Res. ,7:551–585, 2006.[CKD13] Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularizationof weight vectors.

Machine Learning , 91(2):155–187, May 2013.[CNL14] Terence Chong, Wing-Kam Ng, and Venus Liew. Revisiting the per-formance of macd and rsi oscillators.

Journal of Risk and FinancialManagement , 7(1):1–12, Feb 2014.9Con01] Rama Cont. Empirical properties of asset returns: stylized facts andstatistical issues.

Quantitative Finance , 1:223–236, 2001.[DCP08] Mark Dredze, Koby Crammer, and Fernando Pereira. Conﬁdence-weightedlinear classiﬁcation. In

Proceedings of the 25th international conferenceon Machine learning - ICML ’08 , pages 264–271, Helsinki, Finland, 2008.ACM Press.[Hay96] Monson H. Hayes.

Statistical Digital Signal Processing and Modeling . JohnWiley & Sons, Inc., USA, 1st edition, 1996.[IJR17] Atsushi Inoue, Lu Jin, and Barbara Rossi. Rolling window selectionfor out-of-sample forecasting with time-varying parameters.

Journal ofEconometrics , 196(1):55 – 67, 2017.[KL15] Zura Kakushadze and Jim Liew. Custom v. standardized risk models.

Risks , 3(2):112–138, May 2015.[PP12] K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov 2012.Version 20121115.[SCSG13] Thilo A. Schmitt, Desislava Chetalova, Rudi Schäfer, and Thomas Guhr.Non-stationarity in ﬁnancial time series: Generic features and tail behavior.

EPL (Europhysics Letters) , 103(5):58003, sep 2013.[SD91] L.L. Scharf and C. Demeure.

Statistical Signal Processing: Detection,Estimation, and Time Series Analysis . Addison-Wesley series in electricaland computer engineering. Addison-Wesley Publishing Company, 1991.[VC11] Nina Vaits and Koby Crammer. Re-adapting the Regularization of Weightsfor Non-stationary Regression. In