Deep Portfolio Optimization via Distributional Prediction of Residual Factors
DDeep Portfolio Optimization via Distributional Prediction of Residual Factors
Kentaro Imajo, Kentaro Minami, Katsuya Ito, Kei Nakagawa, Preferred Networks, Inc. Nomura Asset Management Co., Ltd. { imos, minami, katsuya1ito } @preferred.jp, [email protected] Abstract
Recent developments in deep learning techniques have mo-tivated intensive research in machine learning-aided stocktrading strategies. However, since the financial market hasa highly non-stationary nature hindering the application oftypical data-hungry machine learning methods, leveraging fi-nancial inductive biases is important to ensure better sam-ple efficiency and robustness. In this study, we propose anovel method of constructing a portfolio based on predictingthe distribution of a financial quantity called residual factors,which is known to be generally useful for hedging the riskexposure to common market factors. The key technical ingre-dients are twofold. First, we introduce a computationally ef-ficient extraction method for the residual information, whichcan be easily combined with various prediction algorithms.Second, we propose a novel neural network architecture thatallows us to incorporate widely acknowledged financial in-ductive biases such as amplitude invariance and time-scaleinvariance. We demonstrate the efficacy of our method onU.S. and Japanese stock market data. Through ablation ex-periments, we also verify that each individual technique con-tributes to improving the performance of trading strategies.We anticipate our techniques may have wide applications invarious financial problems.
Developing a profitable trading strategy is a central prob-lem in the financial industry. Over the past decade, ma-chine learning and deep learning techniques have drivensignificant advances across many application areas (Devlinet al. 2019; Graves, Mohamed, and Hinton 2013), whichinspired investors and financial institutions to develop ma-chine learning-aided trading strategies (Wang et al. 2019;Choudhry and Garg 2008; Shah 2007). However, it is be-lieved that forecasting the nature of financial time series isessentially a difficult task (Krauss, Do, and Huck 2017).In particular, the well-known efficient market hypothesis (Malkiel and Fama 1970) claims that no single trading strat-egy could be permanently profitable because the dynam-ics of the market changes quickly. At this point, the finan-cial market is significantly different from stationary envi-ronments typically assumed by most machine learning/deeplearning methods.
Generally speaking, a good way for deep learning meth-ods to adapt quickly to the given environment is to introducea network architecture that reflects a good inductive bias forthe environment. The most prominent examples for such ar-chitectures include convolutional neural networks (CNNs)(Krizhevsky, Sutskever, and Hinton 2012) for image dataand long short-term memories (LSTMs) (Hochreiter andSchmidhuber 1997) for general time series data. Therefore,a natural question to ask is what architecture is effective forprocessing financial time series.In finance, researchers have proposed various tradingstrategies and empirically studied their effectiveness. Hence,it is reasonable to seek architectures inspired by empiricalfindings in financial studies. In particular, we consider thefollowing three features in the stock market.
Many empirical studies on stock returns are described interms of factor models (e.g., (Fama and French 1992,2015)). These factor models express the return of a certainstock i at time t as a linear combination of K factors plus aresidual term: r i,t = K (cid:88) k =1 β ( k ) i f ( k ) t + (cid:15) i,t . (1)Here, f (1) t , . . . , f ( K ) t are the common factors shared by mul-tiple stocks i ∈ { , . . . , S } , and the residual factor (cid:15) i,t isspecific to each individual stock i . Therefore, the commonfactors correspond to the dynamics of the entire stock mar-ket or industries, whereas the residual factors convey somefirm-specific information.In general, if the return of an asset has a strong correla-tion to the market factors, the asset exhibits a large expo-sure to the risk of the market. For example, it is known thata classical strategy based on the momentum phenomenon(Jegadeesh and Titman 1993) is correlated to the Fama–French factors (Fama and French 1992, 2015), which ex-hibited negative returns around the credit crisis of 2008(Calomiris, Love, and Peria 2010; Szado 2009). On the otherhand, researchers found that trading strategies based only onthe residual factors can be robustly profitable because suchstrategies can hedge out the time-varying risk exposure tothe market factor (Blitz, Huij, and Martens 2011; Blitz et al. a r X i v : . [ q -f i n . P M ] D ec When we address a certain prediction task using a neu-ral network-based approach, an effective choice of neuralnetwork architecture typically hinges on patterns or invari-ances in the data. For example, CNNs (LeCun et al. 1999)take into account the shift-invariant structure that commonlyappears in image-like data. From this perspective, it is im-portant to find invariant structures that are useful for pro-cessing financial time series.As candidates of such structures, there are two types ofinvariances known in financial literature. First, it is knownthat a phenomenon called volatility clustering is commonlyobserved in financial time series (Lux and Marchesi 2000),which suggests an invariance structure of a sequence withrespect to its volatility (i.e., amplitude). Second, there isa hypothesis that sequences of stock prices have a certaintime-scale invariance property known as the fractal struc-ture (Peters 1994). We hypothesize that incorporating suchinvariances into the network architecture is effective at ac-celerating learning from financial time series data. Another important problem is how to convert a given pre-diction of the returns into an actual trading strategy. In fi-nance, there are several well-known trading strategies. Toname a few, the momentum phenomenon (Jegadeesh andTitman 1993) suggests a strategy that bets the current mar-ket trend, while the mean reversion (Poterba and Summers1988) suggests another strategy that assumes that the stockreturns moves toward the opposite side of the current direc-tion. However, as suggested by the construction, the momen-tum and the reversal strategies are negatively correlated toeach other, and it is generally unclear which strategy is ef-fective for a particular market. On the other hand, modernportfolio theory (Markowitz 1952) provides a framework todetermine a portfolio from distributional properties of as-set prices (typically means and variances of returns). Theresulting portfolio is unique in the sense that it has an op-timal trade-off of returns and risks under some predefinedconditions. From this perspective, distributional prediction of returns can be useful to construct trading strategies thatcan automatically adapt to the market. • We propose a novel method to extract residual informa-tion, which we call the spectral residuals . The spectralresiduals can be calculated much faster than the classicalfactor analysis-based method without losing the ability tohedge out exposure to the market factors. Moreover, thespectral residuals can easily be combined with any pre-diction algorithms.• We propose a new system for distributional prediction ofstock prices based on deep neural networks. Our systeminvolves two novel neural network architectures inspiredby well-known invariance hypotheses on financial timeseries. Predicting the distributional information of returnsallows us to utilize the optimal portfolio criteria offeredby modern portfolio theory.• We demonstrate the effectiveness of our proposed meth-ods on real market data.In the supplementary material, we also include appen-dices which contain detailed mathematical formulations andexperimental settings, theoretical analysis, and additionalexperiments.
Our problem is to construct a time-dependent portfoliobased on sequential observations of stock prices. Supposethat there are S stocks indexed by symbol i . The obser-vations are given as a discrete time series of stock prices p ( i ) = ( p ( i )1 , p ( i )2 , . . . , p ( i ) t , . . . ) . Here, p ( i ) t is the price ofstock i at time t . We mainly consider the return of stocksinstead of their raw prices. The return of stock i at time t isdefined as r ( i ) t = p ( i ) t +1 /p ( i ) t − . A portfolio is a (time-dependent) vector of weights overthe stocks b t = ( b (1) t , . . . , b ( i ) t , . . . , b ( S ) t ) , where b ( i ) t is thevolume of the investment on stock i at time t satisfying (cid:80) Si =1 | b ( i ) t | = 1 . A portfolio b t is understood as a partic-ular trading strategy, that is, b ( i ) t > implies that the in-vestor takes a long position on stock i with amount | b ( i ) t | attime t , and b ( i ) t < means a short position on the stock.Given a portfolio b t , its overall return R t at time t is givenas R t := (cid:80) Si =1 b ( i ) t r ( i ) t . Then, given the past observations ofindividual stock returns, our task is to determine the value of b t that optimizes the future returns.An important class of portfolios is the zero-investmentportfolio defined as follows. Definition 1 (Zero-Investment Portfolio) . A zero-investment portfolio is a portfolio whose buying positionand selling position are evenly balanced, i.e., (cid:80) Si =1 b ( i ) t = 0 .In this paper, we restrict our attention to trading strate-gies that output zero-investment portfolio. This assumptionis sensible because a zero-investment portfolio requires noigure 1: Overview of the proposed system. Our system consists of three parts: (i) the extraction layer of residual factors (ii) aneural network-based distribution predictor and (iii) transformation to the optimal portfolio.equity and thus encourages a fair comparison between dif-ferent strategies.In practice, there can be delays between the observationsof the returns and the actual execution of the trading. To ac-count for this delay, we also adopt the delay parameter d in our experiments. When we trade with a d -day delay, theoverall return should be R t := R dt = (cid:80) Si =1 b ( i ) t r ( i ) t + d . According to modern portfolio theory (Markowitz 1952), in-vestors construct portfolios to maximize expected return un-der a specified level of acceptable risk. The standard devia-tion is commonly used to quantify the risk or variability ofinvestment outcomes, which measures the degree to whicha stock’s annual return deviates from its long-term historicalaverage (Kintzel 2007).The Sharpe ratio (Sharpe 1994) is one of the most refer-enced risk/return measures in finance. It is the average re-turn earned in excess of the risk-free rate per unit of volatil-ity. The Sharpe ratio is calculated as ( R p − R f ) /σ p , where R p is the return of portfolio, σ p is the standard deviation ofthe portfolio’s excess return, and R f is the return of a risk-free asset (e.g., a government bond). For a zero-investmentportfolio, we can always omit R f since it requires no equity(Mitra 2009).In this paper, we adopt the Sharpe ratio as the objective forour portfolio construction problem. Since we cannot alwaysobtain an estimate of the total risk beforehand, we often con-sider sequential maximization of the Sharpe ratio of the nextperiod. Once we predict the mean vector and the covariancematrix of the population of future returns, the optimal port-folio b ∗ can be solved as b ∗ = λ − Σ − µ , where λ is apredefined parameter representing the relative risk aversion, Σ is the estimated covariance matrix, and µ is the estimatedmean vector (Kan and Zhou 2007). Therefore, predictingthe mean and the covariance is essential to construct riskaverse portfolios.
In this section, we present the details of our proposed sys-tem, which is outlined in Figure 1. Our system consists ofthree parts. In the first part (i), the system extracts residual Note that b ∗ is derived as the maximizer of b (cid:62) µ − λ b (cid:62) Σ b ,where b (cid:62) µ and b (cid:62) Σ b are the return and the risk of the portfolio b , respectively. information to hedge out the effects of common market fac-tors. To this end, in Section 3.1, we introduce the spectralresidual , a novel method based on spectral decomposition.In the second part (ii), the system predicts future distribu-tions of the spectral residuals using a neural network-basedpredictor. In the third part (iii), the predicted distributionalinformation is leveraged for constructing optimal portfolios.We will outline these procedures in Section 3.2. Addition-ally, we introduce a novel network architecture that incor-porates well-known financial inductive biases, which we ex-plain in Section 3.3. As mentioned in the introduction, we focus on developingtrading strategy based on the residual factors, i.e., the in-formation remaining after hedging out the common marketfactors. Here, we introduce a novel method to extract theresidual information, which we call the spectral residual . Definition of the spectral residuals
First, we introducesome notions from portfolio theory. Let r be a random vec-tor with zero mean and covariance Σ ∈ R S × S , which rep-resents the returns of S stocks over the given investmenthorizon. Since Σ is symmetric, we have a decomposition Σ = V Λ V (cid:62) , where V = [ v , . . . , v S ] is an orthogonalmatrix and Λ = diag( λ , . . . , λ S ) is a diagonal matrix ofthe eigenvalues. Then, we can create a new random vectoras ˆ r = V (cid:62) r such that the coordinate variables ˆ r i = v (cid:62) i r are mutually uncorrelated. In portfolio theory, ˆ r i s are called principal portfolios (Partovi and Caputo 2004). Principalportfolios have been utilized in “risk parity” approaches todiversify the exposures to the intrinsic source of risk in themarket (Meucci 2009).Since the volatility (i.e., the standard deviation) of the i -th principal portfolio is given as √ λ i , the raw return se-quence has large exposure to the principal portfolios withlarge eigenvalues. For example, the first principal portfoliocan be seen as the factor that corresponds to the overall mar-ket (Meucci 2009). Therefore, to hedge out common marketfactors, a natural idea is to discard several principal portfo-lios with largest eigenvalues. Formally, we define the spec-tral residuals as follows. Definition 2.
Let C ( < S ) be a given positive integer. Wedefine the spectral residual ˜ (cid:15) as a vector obtained by project-ing the raw return vector r onto the space spanned by theprincipal portfolios with the smallest S − C eigenvalues.n practice, we calculate the empirical version of the spec-tral residuals as follows. Given a time window H > , wedefine a windowed signal X t as X t := [ r t − H , . . . , r t − ] . We also denote by ˜ X t a matrix obtained by subtracting em-pirical means of row vectors from X t . By the singular valuedecomposition (SVD), ˜ X t can be decomposed as ˜ X t = V t diag( σ , . . . , σ S ) U t (cid:124) (cid:123)(cid:122) (cid:125) principal portfolios , where V t is an S × S orthogonal matrix, U t is an S × H matrix whose rows are mutually orthogonal unit vectors, and σ ≥ · · · ≥ σ S are singular values. Note that V (cid:62) t ˜ X t can beseen as the realized returns of the principal portfolios. Then,the (empirical) spectral residual at time s ( t − H ≤ s ≤ t − )is computed as ˜ (cid:15) s := A t r s , (2)where A t is the projection matrix defined as A t := V t diag(0 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) C , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) S − C ) V (cid:62) t . Relationship to factor models
Although we defined thespectral residuals through PCA, they are also related to thegenerative model (1), and thus convey information about“residual factors” in the original sense.In the finance literature, it has been pointed out that trad-ing strategies depending only on the residual factor (cid:15) t in(1) can be robust to structural changes in the overall mar-ket (Blitz, Huij, and Martens 2011; Blitz et al. 2013). Whileestimating parameters in the linear factor model (1) is typ-ically done by factor analysis (FA) methods (Bartholomew,Knott, and Moustaki 2011), the spectral residual ˜ (cid:15) t is notexactly the same as the residual factors obtained from theFA. Despite this, we can show the following result that thespectral residuals can hedge out the common market factorsunder a suitable condition. Proposition 1.
Let r be a random vector in R S generatedaccording to a linear model r = Bf + (cid:15) , where B is an S × C matrix, and f ∈ R C and (cid:15) ∈ R S are zero-meanrandom vectors. Assume that the following conditions hold:• Var( f i ) = 1 and Var( (cid:15) k ) = σ > .• The coordinate variables in f and (cid:15) are uncorrelated, thatis, E [ f i f j ] = 0 , E [ (cid:15) k (cid:15) (cid:96) ] = 0 , and E [ f i (cid:15) k ] = 0 hold forany i (cid:54) = j and k (cid:54) = (cid:96) .Then, we have the followings.(i) The spectral residual ˜ (cid:15) defined in (2) is uncorrelated fromthe common factor f .(ii) The covariance matrix of ˜ (cid:15) is given as σ A res . Under asuitable assumption , this can be approximated as a diag-onal matrix, which means the coordinates variables of thespectral residual (cid:15) i ( i ∈ { , . . . , S } ) are almost uncorre-lated. See Appendix B for a precise statement.
In the above proposition, the first statement (i) claims thatthe spectral residual can eliminate the common factors with-out knowing the exact residual factors. The latter statement(ii) justifies the diagonal approximation of the predicted co-variance, which will be utilized in the next subsection. Forcompleteness, we provide the proof in Appendix B. More-over, the assumption that (cid:15) is isotropic can be relaxed in thefollowing sense. If we assume that the residual factor (cid:15) is“almost isotropic” and the common factors Bf have largervolatility contributions than (cid:15) , we can show that the lineartransformation used in the spectral residual is close to theprojection matrix eliminating the market factors. Since theformal statement is somewhat involved, we give the detailsin Appendix B.Besides, the spectral residual can be computed signifi-cantly faster than the FA-based methods. This is becausethe FA typically requires iterative executions of the SVD tosolve a non-convex optimization problem, while the spectralresidual requires it only once. Section 4.2 gives an experi-mental comparison of running times. Our next goal is to construct a portfolio based on the ex-tracted information. To this end, we here introduce a methodto forecast future distributions of the spectral residuals, andexplain how we can convert the distributional features intoexecutable portfolios.
Distributional prediction
Given a sequence of past real-izations of residual factors, ˜ (cid:15) i,t − H , . . . , ˜ (cid:15) i,t − , consider theproblem of predicting the distribution of a future observation ˜ (cid:15) i,t . Our approach is to learn a functional predictor for theconditional distribution p (˜ (cid:15) i,t | ˜ (cid:15) i,t − H , . . . , ˜ (cid:15) i,t − ) . Sinceour final goal is to construct the portfolio, we only use pre-dicted means and covariances, and we do not need the fullinformation about the conditional distribution. Despite this,fitting symmetric models such as Gaussian distributions canbe problematic because it is known that the distributions ofreturns are often skewed (Cont 2000; Lin and Liu 2018).To circumvent this, we utilize quantile regression (Koenker2005), an off-the-shelf nonparametric method to estimateconditional quantiles. Intuitively, if we obtain a sufficientlylarge number of quantiles of the target variable, we can re-construct any distributional properties of that variable. Wetrain a function ψ that predicts several conditional quantilevalues, and convert its output into estimators of conditionalmeans and variances. The overall procedure can be made tobe differentiable, so we can incorporate it into modern deeplearning frameworks.Here, we provide the details of the aforementioned proce-dure. First, we give an overview for the quantile regressionobjective. Let Y be a scalar-valued random variable, and X be another random variable. For α ∈ (0 , , an α -th condi-tional quantile of Y given X = x is defined as y ( x ; α ) := inf { y (cid:48) : P ( Y ≤ y (cid:48) | X = x ) ≥ α } . It is known that y ( x ; α ) can be found by solving the follow-ng minimization problem y ( x ; α ) = argmin y (cid:48) ∈ R E [ (cid:96) α ( Y, y (cid:48) ) | X = x ] , where (cid:96) α ( y, y (cid:48) ) is the pinball loss defined as (cid:96) α ( y, y (cid:48) ) := max { ( α − y − y (cid:48) ) , α ( y − y (cid:48) ) } . For our situation, the target variable is y t = ˜ (cid:15) i,t andthe explanatory variable is x t = (˜ (cid:15) i,t − H , . . . , ˜ (cid:15) i,t − ) (cid:62) . Wewant to construct a function ψ : R H → R that estimate theconditional α -quantile of y t . To this end, the quantile regres-sion tries to solve the following minimization problem min ψ (cid:98) E y t , x t [ (cid:96) α ( y t , ψ ( x t ))] . Here, (cid:98) E y t , x t is understood as taking the empirical expecta-tion with respect to y t and x t across t . We should note that asimilar application of the quantile regression to forecastingconditional quantiles of time series has been considered in(Biau and Patra 2011).Next, let Q > be a given integer, and let α j = j/Q ( j = 1 , . . . , Q − ) be an equispaced grid of quantiles.We consider the problem of simultaneously estimating α j -quantiles by a function ψ : R H → R Q − . To do this, wedefine a loss function as L Q ( y t , ψ ( x t )) := Q − (cid:88) j =1 (cid:96) α j ( y t , ψ j ( x t )) , where ψ j ( x t ) is the j -th coordinate of ψ ( x t ) .Once we obtain the estimated Q − quantiles ˜ y ( j ) t = ψ j ( x t ) ( j = 1 , . . . , Q − ), we can estimate the future meanof the target variable y t as ˆ µ t := ˆ µ ( ˜ y t ) = 1 Q − Q − (cid:88) j =1 ˜ y ( j ) t . (3)Similarly, we can estimate the future variance by the samplevariance of ˜ y ( j ) t ˆ σ t := ˆ σ ( ˜ y t ) = 1 Q − (cid:88) (˜ y ( j ) t − ˆ µ t ) (4)or its robust counterpart such as the median absolute devia-tion (MAD). Portfolio construction
Given the estimated means andvariances of future spectral residuals, we finally constructa portfolio based on optimality criteria offered by modernportfolio theory (Markowitz 1952). As mentioned in Sec-tion 2.2, the formula for the optimal portfolio requires themeans and the covariances of the returns. Thanks to Propo-sition 1-(ii), we can approximate the covariance matrix ofthe spectral residual by a diagonal matrix. Precisely, oncewe calculate the predicted mean ˆ µ t,j and the variance ˆ σ t,j of the spectral residual at time t , the weight for j -th asset isgiven as ˆ b j := λ − ˆ µ t,j / ˆ σ t,j . In the experiments in Section 4, we compare the perfor-mances of zero-investment portfolios. For trading strategies that do not output zero-investment portfolios, we apply acommon transformation to portfolios to be centered and nor-malized. As a result, the eventual portfolio does not dependon the risk aversion parameter λ . See Appendix A.1 for de-tails. For the model of the quantile predictor ψ , we introduce twoarchitectures for neural network models that take into ac-count scale invariances studied in finance. Volatility invariance
First, we consider an invarianceproperty on amplitudes of financial time series. It is knownthat financial time series data exhibit a property calledvolatility clustering (Mandelbrot 1997). Roughly speak-ing, volatility clustering describes a phenomenon that largechanges in financial time series tend to be followed by largechanges, while small changes tend to be followed by smallchanges. As a result, if we could observe a certain signal asa financial time series, a signal obtained by positive scalarmultiplication can be regarded as another plausible realiza-tion of a financial time series.To incorporate such an amplitude invariance property intothe model architectures, we leverage the class of positive ho-mogeneous functions. Here, a function f : R n → R m is saidto be positive homogeneous if f ( a x ) = af ( x ) holds for any x ∈ R n and a > . For example, we can see that any linearfunctions and any ReLU neural networks with no bias termsare positive homogeneous. More generally, we can modelthe class of positive homogeneous functions as follows. Let ˜ ψ : S H − → R Q − be any function defined on the H − dimensional sphere S H − = { x ∈ R H : (cid:107) x (cid:107) = 1 } . Then,we obtain a positive homogeneous function as ψ ( x ) = (cid:107) x (cid:107) ˜ ψ (cid:18) x (cid:107) x (cid:107) (cid:19) . (5)Thus, we can convert any function class on the sphere intothe model of amplitude invariant predictors. Time-scale invariance
Second, we consider an invarianceproperty with respect to time-scale. There is a well-knownhypothesis that time series of stock prices have fractal struc-tures (Peters 1994). The fractal structure refers to a self-similarity property of a sequence. That is, if we observea single sequence in several different sampling rates, wecannot infer the underlying sampling rates from the shapeof downsampled sequences. The fractal structure has beenobserved in several real markets (Cao, Cao, and Xu 2013;Mensi et al. 2018; Lee et al. 2018). See Remark 1 in Ap-pendix A.2 for further discussion on this property.To take advantage of the fractal structure, we propose anovel network architecture that we call fractal networks . Thekey idea is that we can effectively exploit the self-similarityby applying a single common operation to multiple sub-sequences with different resolutions. By doing so, we ex-pect that we can increase sample efficiency and reduce thenumber of parameters to train.Here, we give a brief overview of the proposed architec-ture, while a more detailed explanation will be given in Ap-endix A.2. Our model consists of (a) the resampling mech-anism and (b) two neural networks ψ and ψ . The input-output relation of our model is described as the followingprocedure. First, given a single sequence x of stock returns,the resampling mechanism Resample( x , τ ) generates a se-quence that corresponds to sampling rates specified by ascale parameter < τ ≤ . We apply Resample proce-dure for L different parameters τ < . . . < τ L = 1 andgenerate L sequences. Next, we apply a common non-lineartransformation ψ modeled by a neural network. Finally, bytaking the empirical mean of these sequences, we aggregatethe information on different sampling rates and apply an-other network ψ . To sum up, the overall procedure can bewritten in the following single equation ψ ( x ) = ψ (cid:32) L L (cid:88) i =1 ψ (Resample( x , τ i )) (cid:33) . (6) We conducted a series of experiments to demonstrate the ef-fectiveness of our methods on real market data. In Section4.1, we describe the details of the dataset and some commonexperimental settings used throughout this section. In Sec-tion 4.2, we test the validity of the spectral residual by a pre-liminary experiment. In Section 4.3, we evaluate the perfor-mance of our proposed system by experiments on U.S. mar-ket data. We also conducted similar experiments on Japanesemarket data and obtained consistent results. Due to the spacelimitation, we provide the entire results for Japanese marketdata in Appendix E.
U.S. market data
For U.S. market data, we used the dailyprices of stocks listed in S&P 500 from January 2000 toApril 2020. We used data before January 2008 for trainingand validation and the remainder for testing. We obtainedthe data from Alpha Vantage .We used opening prices because of the following reasons.First, the trading volume at the opening session is larger thanthat at the closing session (Amihud and Mendelson 1987),which means that trading with opening prices is practicallyeasier than trading with the closing prices. Moreover, a fi-nancial institution cannot trade a large amount of stocks dur-ing the closing period because it can be considered as anillegal action known as “banging the close”. Common experimental settings
We adopted the delayparameter d = 1 (i.e., one-day delay) for updating portfo-lios. We set the look-back window size as H = 256 , i.e., allprediction models can access the historical stock prices up-to preceding business days. For other parameters usedin the experiments, see Appendix C.3 as well. Evaluation metrics
We list the evaluation metrics usedthroughout our experiments.•
Cumulative Wealth (CW) is the total return yielded fromthe trading strategy: CW T := (cid:81) Ti =1 (1 + R t ) . Dot-com bubble crash Subprime mortgage crisis 2020 stock market crash
Figure 2: Cumulative returns of reversal strategies over rawreturns, the FA residuals, and the spectral residuals. Thereversal-based strategies are more robust against financialcrises.•
Annualized Return (AR) is an annualized return rate de-fined as AR t := ( T Y /T ) (cid:80) Ti =1 R t , where T Y is the aver-age number of holding periods in a year.• Annualized Volatility (AVOL) is annualized risk defined as
AVOL T := (( T Y /T ) (cid:80) Ti =1 R t ) / .• Annualized Sharpe ratio (ASR) is an annualized risk-adjusted return (Sharpe 1994). It is defined as
ASR T :=AR T / AVOL T .As mentioned in Section 2.2, we are mainly interested inASR as the primary evaluation metric. AR and AVOL areauxiliary metrics for calculating ASR. While CW representsthe actual profits, it often ignores the existence of large riskvalues. In addition to these, we also calculated some eval-uation criteria commonly used in finance: Maximum Draw-Down (MDD), Calmar Ratio (CR), and Downside DeviationRatio (DDR). For completeness, we provide precise defini-tions in Appendix C.1. As suggested in 3.1, the spectral residuals can be useful tohedge out the undesirable exposure to the market factors. Toverify this, we compared the performances of trading strate-gies over (i) the raw returns, (ii) the residual factors extractedby the factor analysis (FA), and (iii) the spectral residuals.For the FA, we fit the factor model (1) with K = 30 bythe maximum likelihood method (Bartholomew, Knott, andMoustaki 2011) and extracted residual factors as the remain-ing part. For the spectral residual, we obtain the residual se-quence by subtracting C = 30 principal components fromthe raw returns. We applied both methods to windowed datawith length H = 256 .In order to be agnostic to the choice of training algo-rithms, we used a simple reversal strategy. Precisely, for theraw return sequence r t , we used a deterministic strategy ob-tained simply by normalizing the negation of the previousobservation − r t − to be a zero-investment portfolio (seeAppendix C.2 for the precise formula). We defined reversalstrategies over residual sequences in similar ways.Figure 2 shows the cumulative returns of reversal strate-gies performed on the Japanese market data. We see that the u m u l a ti v e w ea lt h DateFigure 3: The Cumulative Wealth in U.S. market. C u m u l a ti v e w ea lt h DateFigure 4: The Cumulative Wealth in U.S. market data withoutspectral residual extraction.reversal strategy based on the raw returns is significantly af-fected by several well-known financial crises, including thedot-com bubble crush in the early 2000s, the 2008 subprimemortgage crisis, and the 2020 stock market crush. On theother hand, two residual-based strategies seem to be morerobust against these financial crises. The spectral residualperformed similarly to the FA residual in cumulative returns.Moreover, in terms of the Sharpe ratio, the spectral residuals( = 2 . ) performed better than the FA residuals ( = 2 . ).Remarkably, the spectral residuals were calculated muchfaster than the FA residuals. In particular, we calculated bothresiduals using the entire dataset which contains records ofall the stock prices for , days. For the PCA and the FA,we used implementations in the scikit-learn package (Pe-dregosa et al. 2011), and all the computation were run on 18CPU cores of Intel Xeon Gold 6254 Processor (3.1 GHz).Then, extracting the spectral residuals took approximately10 minutes, while the FA took approximately 13 hours. We evaluated the performance of our proposed system de-scribed in Section 3 on U.S. market data. Corresponding re-sults for Japanese market data are provided in Appendix E.
Baseline Methods
We compare our system with the fol-lowing baselines: (i)
Market is the uniform buy-and-holdstrategy. (ii)
AR(1) is the
AR(1) model with all coefficientsbeing − . This can be seen as the simple reversal strategy.(iii) Linear predicts returns by ordinary linear regressionbased on previous H raw returns. (iv) MLP predicts returnsby a multi-layer perceptron with batch normalization anddropout (Pal and Mitra 1992; Ioffe and Szegedy 2015; Sri-vastava et al. 2014). (v)
SFM is one of state-of-the-art stockprice prediction algorithms based on the State FrequencyMemory RNNs (Zhang, Aggarwal, and Qi 2017).Additionally, we compare our proposed system (
DPO )with some ablation models, which are similar to
DPO exceptfor the following points.• DPO with No Quantile Prediction (
DPO-NQ ) does not usethe information of the full distributional prediction, but in- Table 1: Performance comparison on U.S. market. All themethods except for
Market are applied to the spectralresiduals (SRes).
ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ Market +0.607 +0.130
AR(1) on SRes +0.858 +0.021 0.025 +1.470 +0.295 0.072
Linear on SRes +0.724 +0.017 0.024 +1.262 +0.298 0.059
MLP on SRes +0.728 +0.022 0.030 +1.280 +0.283 0.077
SFM on SRes +0.709 +0.019 0.026 +1.211 +0.323 0.058
DPO-NQ +1.237 +0.032 0.026 +2.169 +0.499 0.063
DPO-NF +1.284 +0.027 +2.347 +0.627 +1.154 +0.030 0.026 +2.105 +0.562 0.053
DPO (Proposed) +1.393 +0.030 stead it outputs conditional means trained by the L loss.• DPO with No Fractal Network ( DPO-NF ) uses a simplemulti-layer perceptron instead of the fractal network.• DPO with No Volatility Normalization (
DPO-NV ) doesnot use the normalization (5) in the fractal network.
Performance on real market data
Figure 3 shows thecumulative wealth (CW) achieved in U.S. market. Table 1shows the results for the other evaluation metrics presentedin Section 4.1. For parameter C of the spectral residual, weused C = 10 , which we determined sorely from the trainingdata (see Appendix D.2 for details). Overall, our proposedmethod DPO outperformed the baseline methods in multipleevaluation metrics. Regarding the comparison against threeablation models, we make the following observations.1.
Effect of the distributional prediction . We found thatintroducing the distributional prediction significantly im-proved the ASR. While
DPO-NQ achieved the best CW,
DPO performed better in the ASR. It suggests that, with-out the variance prediction,
DPO-NQ tends to pursue thereturns without regard to taking the risks. Generally, weobserved that
DPO reduced the AVOL while not losingthe AR.able 2: Performance comparison on U.S. market withoutspectral residual extraction.
ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ AR(1) +0.212 +0.011 0.051 +0.355 +0.067 0.160
Linear +0.304 +0.016 0.052 +0.485 +0.127 0.125
MLP +0.261 +0.013 0.048 +0.424 +0.103 0.122
SFM +0.264 +0.014 0.051 +0.428 +0.079 0.171
DPO-NQ +0.405 +0.020 0.048 +0.655 +0.172 0.114
DPO-NF +0.854 +0.034
DPO-NV +0.542 +0.029 0.054 +0.922 +0.238 0.123
DPO +0.874 +0.032 Effect of the fractal network . Introducing the fractal net-work architecture also improved the performance in mul-tiple evaluation metrics. In both markets, we observed thatthe fractal network contributed to increasing the AR whilekeeping the AVOL, which is suggestive of the effective-ness of leveraging the financial inductive bias on the re-turn sequence.3.
Effect of the normalization . We also saw the effec-tiveness of the normalization (5). Comparing
DPO and
DPO-NV , the normalization affected both of the AR andthe AVOL, resulting in the improvement in the ASR. Thismay occur because the normalization improves the sam-ple efficiency by reducing the degrees of freedom of themodel.To see the effect of the spectral residuals, we also evalu-ated our proposed method and the baseline methods on theraw stock returns. Figure 4 and Table 2 show the results.Compared to the corresponding results with spectral resid-uals, we found that the spectral residuals consistently im-proved the performance for every method. Some further in-triguing observations are summarized as follows.1. With the spectral residuals,
AR(1) achieved the best ASRamong the baseline methods (Table 1), which has not beenobserved on the raw return sequence (Table 2). This sug-gests that the spectral residuals encourage the reversalphenomenon (Poterba and Summers 1988) by suppress-ing the common market factors. Interestingly, without ex-tracting the spectral residuals, the CWs are crossing dur-ing the test period, and no single baseline method consis-tently beats others (Figure 4). A possible reason is that thestrong correlation between the raw stock returns increasesthe exposure to the common market risks.2. We found that our network architectures are still effectiveon the raw sequences. In particular,
DPO outperformed allthe other methods in multiple evaluation metrics.
Trading based on factor models is one of the popular strate-gies for quantitative portfolio management (e.g., (Naka-gawa, Uchida, and Aoshima 2018)). One of the best-knownfactor models is Fama and French(Fama and French 1992, 1993), and they put forward a model explaining returns inthe US equity market with three factors: the market returnfactor, the size (market capitalization) factor and the value(book-to-market) factor.Historically, residual factors are treated as errors in factormodels(Sharpe 1964). However, (Blitz, Huij, and Martens2011; Blitz et al. 2013) suggested that there exists pre-dictability in residual factors. In modern portfolio theory,less correlation of investment returns enables to earn largerrisk-adjusted returns (Markowitz 1952). Residual factors areless correlated than the raw stock returns by its nature. Con-sequently, (Blitz, Huij, and Martens 2011; Blitz et al. 2013)demonstrated that residual factors enable to earn larger risk-adjusted returns.
With the recent advance of deep learning, various deep neu-ral networks are applied to stock price prediction (Chen et al.2019). Some deep neural networks for time series are alsoapplied to stock price prediction (Fischer and Krauss 2018).Compared to other classical machine learning methods,deep learning enables learning with fewer a priori repre-sentational assumptions if provided with sufficient amountof data and computational resources. Even if data is in-sufficient, introducing inductive biases to a network archi-tecture can still facilitate deep learning (Battaglia et al.2018). Technical indicators are often used for stock predic-tion (e.g., (Metghalchi, Marcucci, and Chang 2012; Neelyet al. 2014)), and (Li et al. 2019) used technical indicators asinductive biases of a neural network. (Zhang, Aggarwal, andQi 2017) used a recurrent model that can analyze frequencydomains so as to distinguish trading patterns of various fre-quencies.
We proposed a system for constructing portfolios. The keytechnical ingredients are (i) a spectral decomposition-basedmethod to hedge out common market factors and (ii) a dis-tributional prediction method based on a novel neural net-work architecture incorporating financial inductive biases.Through empirical evaluations on the real market data, wedemonstrated that our proposed method can significantly im-prove the performance of portfolios on multiple evaluationmetrics. Moreover, we verified that each of our proposedtechniques is effective on its own, and we believe that ourtechniques may have wide applications in various financialproblems.
Acknowledgment
We thank the anonymous reviewers for their constructivesuggestions and comments. We also thank Masaya Abe,Shuhei Noma, Prabhat Nagarajan and Takuya Shimada forhelpful discussions.
References
Amihud, Y.; and Mendelson, H. 1987. Trading mechanismsand stock returns: An empirical investigation.
The Journalof Finance
LatentVariable Models and Factor Analysis: A Unified Approach .Wiley, 3rd edition.Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.;Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relationalinductive biases, deep learning, and graph networks. arXivpreprint arXiv:1806.01261 .Biau, G.; and Patra, B. 2011. Sequential Quantile Predictionof Time Series.
IEEE Transactions on Information Theory
Journal of Financial Markets
Journal of Empirical Finance
Physica A: Statistical Mechanics andits Applications
Proceed-ings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , 2376–2384.Choudhry, R.; and Garg, K. 2008. A hybrid machine learn-ing system for stock market forecasting.
World Academy ofScience, Engineering and Technology
Quantitative Finance
1: 223–236.Davis, C.; and Kahan, W. M. 1970. The Rotation of Eigen-vectors by a Perturbation. III.
SIAM Journal on NumericalAnalysis
NAACL-HLT .Fama, E. F.; and French, K. R. 1992. The Cross-Sectionof Expected Stock Returns.
The Journal of Finance
Journal of .Fama, E. F.; and French, K. R. 2015. A five-factor assetpricing model.
Journal of Financial Economics
European Journal of Operational Research , 6645–6649. ISSN 2379-190X.Grossman, S. J.; and Zhou, Z. 1993. Optimal investmentstrategies for controlling drawdowns.
Mathematical finance
Neural Computation arXiv preprint arXiv:1502.03167 .Jegadeesh, N.; and Titman, S. 1993. Returns to Buying Win-ners and Selling Losers: Implications for Stock Market Effi-ciency.
The Journal of Finance
Journal of Financial and Quantita-tive Analysis
International Conference forLearning Representations (ICLR) .Kintzel, D. 2007. Portfolio theory, life-cycle investing, andretirement income.
Social Security Administration PolicyBrief
Quantile Regression . Econometric Soci-ety Monographs. Cambridge University Press. doi:10.1017/CBO9780511754098.Krauss, C.; Do, X. A.; and Huck, N. 2017. Deep neuralnetworks, gradient-boosted trees, random forests: Statisticalarbitrage on the S&P 500.
European Journal of OperationalResearch
The European Physical Journal B
Advances in Neural Information Process-ing Systems 25 , 1097–1105. Curran Associates, Inc. URLhttp://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.LeCun, Y.; Haffner, P.; Bottou, L.; and Bengio, Y. 1999.Object Recognition with Gradient-Based Learning. In
Shape, Contour and Grouping in Computer Vision , 319–345. Springer Berlin Heidelberg.Lee, M.; Song, J.; Kim, S.; and Chang, W. 2018. Asym-metric market efficiency using the index-based asymmetric-MFDFA.
Physica A: Statistical Mechanics and its Applica-tions
Proceed-ings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , 894–902.in, T.-C.; and Liu, X. 2018. Skewness, individual investorpreference, and the cross-section of stock returns.
Review ofFinance
In-ternational journal of theoretical and applied finance
The journal ofFinance
Fractals and scaling in finance , 371–418.Springer.Mandelbrot, B. B.; and Ness, J. W. V. 1968. FractionalBrownian Motions, Fractional Noises and Applications.
SIAM Review
The journal offinance
Physica A: Statistical Mechanics and its Applica-tions
Applied Economics
Risk
International business research
ECML PKDD 2018 Workshops , 37–50.Springer.Neely, C. J.; Rapach, D. E.; Tu, J.; and Zhou, G. 2014. Fore-casting the equity risk premium: the role of technical indica-tors.
Management science
IEEE Transactions on Neural Networks
Economics Bulletin
International Journal of Pure andApplied Mathematics
Journal of MachineLearning Research
12: 2825–2830. Peters, E. E. 1994.
Fractal market analysis: applying chaostheory to investment and economics , volume 24. John Wiley& Sons.Poterba, J. M.; and Summers, L. H. 1988. Mean reversion instock prices: Evidence and Implications.
Journal of Finan-cial Economics
Foundations of Machine Learning— Spring
The journal of finance
Journal of portfoliomanagement the Journal of Investing
Journal of MachineLearning Research
TheJournal of Alternative Investments
Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery & DataMining , 1900–1908.Young, T. W. 1991. Calmar ratio: A smoother tool.
Futures
Biometrika
Proceedings of the 23rd ACM SIGKDD international con-ference on knowledge discovery and data mining , 2141–2149. ppendix
This is the supplementary material for the paper entitled“Deep Portfolio Optimization via Distributional Predictionof Residual Factors”.
A Technical details of the proposed method
In this section, we provide some technical details for ourproposed method in Section 3.
A.1 Detailed calculation of portfolio
Let µ t be the expected return vector at time t , and Σ t be the covariance matrix of returns. As we explained inSection 2.2, the (ideal) optimal portfolio can be written as b ∗ t = λ − Σ − t µ t , where λ > is a predefined risk aversionparameter. Therefore, having estimators ˆ µ t and ˆ Σ t for thesepopulation quantities, a natural construction of the portfoliois the following plug-in rule ˆ b t := 1 λ ˆ Σ − t ˆ µ t . In our proposed method, we construct a portfolio overthe spectral residual ˜ (cid:15) t . For the estimators of the mean andthe covariance, we use the quantile regression-based esti-mators explained in the previous subsection. In particular,we approximate the covariance matrix by a diagonal matrix diag(ˆ σ t, , . . . , ˆ σ t,S ) . See also Appendix B.2 for empiricaland theoretical justifications of this diagonal approximation.As a result, the weight for j -th residual factor is given as b res t,j := ˆ µ t,j λ ˆ σ t,j , where ˆ µ t,j and ˆ σ t,j are predicted means and variances givenas (3) and (4), respectively.We now have a portfolio b res t defined for the spectralresiduals. Recall that the spectral residual is obtained as alinear transformation of the (centered) raw returns as ˜ (cid:15) t = A t r t (see Section 3.1). To obtain a portfolio for the raw re-turn r t , we use the relation b t = A (cid:62) t b res t .In our experiment, we apply a common transformationto the output of any trading strategy so that it becomes azero-investment portfolio. Here, we explain the detail of thistransformation. Let b t be a given portfolio. Assume that b t is not proportional to the all-one vector = (1 , . . . , (cid:62) .This also implies that b t is not proportional to the uniformbuy-and-hold strategy. To convert b t to a zero-investmentportfolio, we subtract the average ¯ b t = S (cid:80) Si =1 b t,j fromevery coordinate, and then renormalize the portfolio so thatthe sum of the absolute values is unity. The resulting portfo-lio is given as b t − ¯ b t (cid:107) b t − ¯ b t (cid:107) . Note that the normalized portfolio defined in this way doesnot depend on the parameter λ . Figure 5: Illustration of the fractal network. A.2 Fractal networks
Here, we explain the detailed structure of the fractal networkintroduced in Section 3.3. Figure 5 illustrates the structure ofthe fractal network.Let x denote the input of the network. First,the fractal network applies the resampling mecha-nism Resample( x , τ i ) for several scale parameters τ > τ > · · · > τ L > to generate multipleviews of x with different sampling rates. To be precise, theresample mechanism outputs a sequence by the followingfour procedures:(i) Cumulation.
First, given an input x = ( x , . . . , x H ) , wecalculate the cumulative sum z = ( x , x + x , . . . , x + · · · + x H ) . Since the input x corresponds to the residualfactors of the returns, each coordinate variable x s is un-derstood as the increment or the difference of stock pricesof two adjacent time periods. Hence, we use its cumu-lative sum so that it corresponds to (logarithmic) stockprices and exhibits the fractal structure.(ii) Resampling.
Second, given < τ ≤ and z , we gener-ate a shorter sequence z τ of a fixed length H (cid:48) ( < H ) bydownsampling from ( z (cid:98) (1 − τ ) H (cid:99) , . . . , z H ) . In other words,the resulting sequence is a subsequence with length (cid:100) τ H (cid:101) located at the end of the original sequence.(iii) Differentiation.
Third, we again take the first differenceof the downsampled sequence z τ to obtain the corre-sponding return sequence.(iv) Rescaling.
Finally, we normalize the output by multiply-ing the entire sequence by τ − / . On the choice of themultiplicative factor, see following Remark 1 as well.Next, we apply a common non-linear transformation ψ to these views. Finally, we take average of the results andapply another transformation ψ . The overall procedure isummarized in the following equation ψ ( x ) = ψ (cid:32) L L (cid:88) i =1 ψ (Resample( x , τ i )) (cid:33) . (7)To incorporate the volatility invariance (Section 3.3) intothe fractal network, we want to make the overall transforma-tion x (cid:55)→ ψ ( x ) positive homegeneous. To do this, it sufficesto ensure that non-linear transformations ψ and ψ are pos-itive homogeneous. This can be proved by combining thefollowing claims. Lemma 1.
Let f , f , . . . , f L be any positive homogeneousfunctions.(i) Any linear transformation is positive homogeneous.(ii) Suppose the composition f ◦ f can be defined. Then, f ◦ f is positive homogeneous.(iii) The concatenation x (cid:55)→ ( f ( x ) (cid:62) , . . . , f L ( x ) (cid:62) ) (cid:62) is posi-tive homogeneous.(iv) The average x (cid:55)→ L (cid:80) Li =1 f i ( x ) is positive homoge-neous.(v) The resampling mechanism x (cid:55)→ Resample( x , τ ) is pos-itive homogeneous for any < τ ≤ . Proof.
Since (i), (ii), and (iii) are almost obvious, we omittheir proofs. (iv) is derived from (i), (ii), and (iii). As for(v), the resampling mechanism is a linear transformation,and thus it is positive homogeneous. In fact, it is easy to seethat the four operations in the resampling mechanism (i.e.,cumulation, resampling, differentiation, and rescaling) arelinear. Therefore, the resampling mechanism is also linear.
Remark 1.
In the rescaling phase, the choice of multiplica-tive factor τ − / is sensible because of the following reason.If the underlying law is the fractional Brownian motion withthe Hurst index H , the appropriate scaling factor determinedby its self-similarity is τ −H (Mandelbrot and Ness 1968). Ithas been reported in (Kristoufek and Vosvrda 2014) that, inmany real-world markets, estimated Hurst indices are ap-proximately H ≈ . , which implies that stock prices ex-hibit similar self-similarity properties as the standard Brow-nian motion. Remark 2.
In some paper, the term “fractal property” standsfor the behavior of stochastic processes captured by the frac-tional Brownian motion with
H (cid:54) = 1 / , which can lead in-efficiency of the market. In this paper, however, we focus onthe self-similarity of processes to design the network archi-tecture. The self-similarity can be observed even in efficientmarkets with H ≈ . . B Theoretical analysis
In this section, we provide theoretical analyses of the spec-tral residuals justifying that the spectral residuals can hedgeout the market factors. In Section B.1, we prove Proposition1, which shows that the spectral residuals are actually uncor-related to the market factors when the true residual factorsare isotropic. In practice, the covariance of the residual fac-tors can be approximated by diagonal matrices. In Section B.2, we explain why this approximation is justified. In Sec-tion B.3, we investigate a more general situation where thetrue residual factors are anisotropic, i.e., the variances of thecoordinate variables differ.Before proceeding, let us recall the definition of the spec-tral residual. Let r be a zero-mean random vector with acovariance matrix Σ ∈ R S × S . Let Σ = V Λ V (cid:62) is aneigendecomposition of Σ , where V = [ v , . . . , v S ] is anorthogonal matriix and Λ = diag( λ , . . . , λ S ) is a diago-nal matrix of the eigenvalues. Assume that λ , . . . , λ S aresorted in descending order, i.e., λ ≥ · · · ≥ λ S . Given aninteger ≤ C < S , let V res = [ v C +1 , . . . , v S ] be an S × ( S − C ) matrix that consists of the eigenvectorsthat correspond to the smallest S − C eigenvalues . We alsodefine a matrix A res as A res = V res V (cid:62) res . Note that A res is the orthogonal projection onto the spacespanned by the principal portfolios with the smallest S − C eigenvalues. Then, we define the spectral residual as ˜ (cid:15) = A res r . (8) B.1 Isotropic residuals
First, we prove the following result, which corresponds toProposition 1 in the main body.
Proposition 2 (Proposition 1 in the main body, restated) . Let r be a random vector in R S generated according to alinear model r = Bf + (cid:15) , where B is an S × C matrix offull column rank, and f ∈ R C and (cid:15) ∈ R S are zero-meanrandom vectors. Assume that the following conditions hold:• Var( f i ) = 1 and Var( (cid:15) k ) = σ > .• The coordinate variables in f and (cid:15) are uncorrelated, thatis, E [ f i f j ] = 0 , E [ (cid:15) k (cid:15) (cid:96) ] = 0 , and E [ f i (cid:15) k ] = 0 hold forany i (cid:54) = j and k (cid:54) = (cid:96) .Then, we have the followings.(i) The spectral residual ˜ (cid:15) defined in (8) is uncorrelated fromthe common factor f .(ii) The covariance matrix of ˜ (cid:15) is given as σ A res . Under asuitable assumption, this can be approximated as a diago-nal matrix. Proof.
Let BB (cid:62) = V mar Λ V (cid:62) mar be the eigendecomposi-tion of the market factor covariance BB (cid:62) , where V mar =[ v , . . . , v S ] is an orthogonal matrix and Λ is a diagonal ma-trix of the eigenvalues. Since rank BB (cid:62) ≤ C , we can write Λ = diag( λ , . . . , λ C , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) S − C ) . Note that λ i > for all i ∈ { , . . . , C } since BB (cid:62) ispositively semi-definite and has rank C . Since coordinate In general, V res is not unique because there can be multipleeigenvectors having the same eigenvalues. For simplicity, we as-sume that λ C > λ C +1 so that V res is uniquely determined. ariables in f and (cid:15) are uncorrelated, the covariance matrixof r is calculated as Σ = BB (cid:62) + σ I = V mar ( Λ + σ I ) V (cid:62) mar . This gives the eigendecomposition of Σ , and its eigenvaluesare given as λ + σ , . . . , λ C + σ (cid:124) (cid:123)(cid:122) (cid:125) C , σ , . . . , σ (cid:124) (cid:123)(cid:122) (cid:125) S − C . In particular, the projection matrix for the spectral residualis given as A res = [ v C +1 , . . . , v S ][ v C +1 , . . . , v S ] (cid:62) . (9)Consequently, we have ˜ (cid:15) = A res ( Bf + (cid:15) ) = A res (cid:15) , whichis uncorrelated from Bf (thus we proved (i)).Moreover, the covariance matrix of A res (cid:15) is given as σ A res A (cid:62) res = σ A res . Regarding the diagonal approxima-tion of this matrix, see the next subsection for details. B.2 On the near-orthogonality of spectralresiduals
When we construct a portfolio over the spectral residuals,we use an diagonal approximation of the covariance matrix(see Section 3.2 and Appendix A.1). As mentioned in theprevious subsection, this covariance matrix is proportionalto A res . Hence, to justify the diagonal approximation, weshould show that A res becomes nearly diagonal.Figure 6 shows examples of A res calculated empricallyfrom U.S. market data. For several choices of the parameter C , these matrices are reasonably close to diagonal matrices.But why does this happen? Generally speaking, a pro-jection matrix constructed as (9) is not necessarily approxi-mated well by a diagonal matrix. However, in financial mar-kets, we may assume that the following properties hold forthe first several principal portfolios, which may justify thediagonal approximation.• Well-spreading factors : In a financial market, a largeproportion of a stock return is often explained by a smallnumber of common market factors. A market factor maybe almost equally shared by all the stocks in the market.For example, the first principal portfolio can be seen asthe factor of the overall market, which may be close tothe “equal weighting index” of stocks. In this case, re-moving the market factor changes all the elements of thecovariance matrix only very slightly.•
Spiked factors : Suppose that, in an investment horizon, acertain company’s stock price is affected by a factor thatis independent from any other market factors. In this sit-uation, there can be a principal portfolio that arises onlyfrom a single stock. Removing this factor is just equiva-lent to removing the corresponding stock, which does notaffect the off-diagonal elements of the covariance matrix.To formulate the above idea, we introduce the followingtwo notions.
Definition 3.
Let v ∈ R S be a unit vector (i.e., (cid:107) v (cid:107) = 1 ).Let δ > be a positive number. • We say that v is δ -spreading if | v i | ≤ (cid:112) δ/S for all i ∈{ , . . . , S } .• We say that v is δ -spiked if there exists i ∗ ∈ { , . . . , S } such that | v i | ≤ δ/S for all i (cid:54) = i ∗ .The above properties ensure that the off-diagonal ele-ments of vv (cid:62) are sufficiently small. Proposition 3.
Suppose that a unit vector v ∈ R S is either δ -spreading or δ -spiked for some δ > . Then, for all i (cid:54) = j , the ( i, j ) element of P = vv (cid:62) is bounded as | P ij | = | v i v j | ≤ δ/S . Proof.
The conclusion is obvious if v is δ -spreading. If v is δ -spiked, we have | v i v j | ≤ · δ/S = δ/S for any i (cid:54) = j .Now, let us consider the linear factor model r = Bf + (cid:15) .We assume a similar condition as in Proposition 2. In par-ticular, we assume that (cid:15) has an isotropic covariance matrix σ I . Our goal is to show that the spectral residual A res r hasa nearly isotropic covariance matrix under a suitable condi-tion.Let BB (cid:62) = V mar Λ V (cid:62) mar be an eigendecomposition ofthe market factor, where V is an orthogonal matrix and Λ =diag( λ , . . . , λ C , , . . . , with λ ≥ · · · ≥ λ C > . Then,as in Proposition 2, we have A res = [ v C +1 , . . . , v S ][ v C +1 , . . . , v S ] (cid:62) = I − P mar , and the spectral residual is given as A res r = A res (cid:15) . Here, P mar = [ v , . . . , v C ][ v , . . . , v C ] (cid:62) is the projection matrix onto the space of the market fac-tors. Thus, the covariance matrix of the spectral residual isgiven as σ A res . The following proposition explains thatthis covariance matrix is nearly diagonal if the top- C princi-pal portfolios are either “well-spreading” or “nearly spiked”. Proposition 4.
We use similar notations and assumptionsas above. Suppose that each v i ( i ∈ { , . . . , C } ) is either δ -spreading or δ -spiked. Then, the off-diagonal elements of A res are bounded as | [ A res ] i,j | ≤ CδS for any i (cid:54) = j . Proof.
Combining the equality P mar = (cid:80) Ci =1 v i v (cid:62) i andthe fact that all the absolute values of the off-diagonal ele-ments in v i v (cid:62) i are not larger than δ/S , we have | [ A res ] i,j | = | [ P mar ] i,j | ≤ Cδ/S for any i (cid:54) = j . Hence the result. B.3 Anisotropic residuals
Propositon 2 does not hold when the true residual factor (cid:15) , . . . , (cid:15) S are anisotropic. However, we can show that thespectral residual is almost uncorrelated from the market fac-tors if• the volatilities of market factors are sufficiently largerthan the residual factors, and• the residual factors are almost isotropic. a) (b) (c) 0.01.00.5 Figure 6: The absolute values of projection matrix A res calculated from U.S. market data. Three panels show results for (a) C = 10 , (b) C = 50 , and (c) C = 200 , respectively. In all cases, the projection matrices are reasonably close to diagonalmatrices.Here, we provide a quantitative justification of this claim byusing matrix perturbation theory (Davis and Kahan 1970;Yu, Wang, and Samworth 2014).Before describing the result, we provide some notations.For any S × S matrix A , (cid:107) A (cid:107) F is the Frobenius norm definedas (cid:107) A (cid:107) = (cid:80) i,j a ij . For any orthogonal projection matrix P , let V ( P ) denote the corresponding linear subspace (i.e.,the largest linear subspace that is invariant under P ). Let P and P be any two orthogonal projection matrices having thesame rank, and V i = V ( P i ) ( i = 1 , ). Then, the quantity (cid:107) sin Θ( V , V ) (cid:107) F := (cid:107) P ( I − P ) (cid:107) F = (cid:107) P ( I − P ) (cid:107) F measures the “principal angles” between two subspaces V and V (Davis and Kahan 1970). In particular, this quantitybecomes zero if and only if the two subspaces agree. Proposition 5.
Let r be a random vector in R S generatedaccording to a linear model r = Bf + (cid:15) , where B is an S × C matrix of full rank, and f ∈ R C and (cid:15) ∈ R S are zero-mean random vectors. Assume that the following conditionshold:• Var( f i ) = 1 and Var( (cid:15) k ) = σ k > for k ∈ { , . . . , S } .• The coordinate variables in f and (cid:15) are uncorrelated, thatis, E [ f i f j ] = 0 , E [ (cid:15) k (cid:15) (cid:96) ] = 0 , and E [ f i (cid:15) k ] = 0 hold forany i (cid:54) = j and k (cid:54) = (cid:96) .Let P mar be the orthogonal projection matrix onto the linearspace spanned by B , and let A res be the projection matrixof the spectral residuals. Let λ min be the smallest positiveeigenvalue of BB (cid:62) . Then, we have (cid:107) P mar A res (cid:107) F ≤ √ S (max i σ i − min i σ i ) λ min . (10)Here, we give an interpretation of (10). If (cid:107) P mar A res (cid:107) F is shown to be small, the spectral residual A res r is nearly or-thogonal to any market factors. The right-hand side of (10)can be small when either of the following conditions are sat-isfied: (i) If the residual factors are nearly isotropic, then max i σ i − min i σ i is small. (ii) If the smallest variance ofthe market factor λ min is much larger than √ S (max i σ i − min i σ i ) , the right-hand side becomes small. Proof.
Since f and (cid:15) are uncorrelated, the covariance matrixof the generative model r = Bf + (cid:15) can be written as Σ = BB (cid:62) + Q , where Q = diag( σ , . . . , σ S ) is the variance ofthe residual factors.Let ¯ s = S (cid:80) Si =1 σ i be the averaged variance of the resid-ual factors. As such, ¯ s I is the closest isotropic covariancematrix of Q in the following sense: ¯ s I ∈ argmin s I : s ≥ (cid:107) Q − s I (cid:107) F . Let (cid:98) Q = ¯ s I − Q . Define ∆ iso ≥ as ∆ iso := (cid:107) (cid:98) Q (cid:107) F = min s I : s ≥ (cid:107) Q − s I (cid:107) F = (cid:32) S (cid:88) i =1 ( σ i − ¯ s ) (cid:33) / , which quantifies how far Q is away from being isotropic.As in the proof of Proposition 2, let BB (cid:62) = V mar Λ V (cid:62) mar be the eigendecomposition of BB (cid:62) . Define (cid:98) Σ as (cid:98) Σ = Σ + (cid:98) Q = BB (cid:62) + ¯ s I . Since BB (cid:62) and ¯ s I are simultaneously diagonalizable by V mar , we can write (cid:98) Σ = V mar ( Λ + ¯ s I ) V (cid:62) mar . In particular, the eigenvalues of (cid:98) Σ are given as λ + ¯ s, . . . , λ C + ¯ s (cid:124) (cid:123)(cid:122) (cid:125) C , ¯ s, . . . , ¯ s (cid:124) (cid:123)(cid:122) (cid:125) S − C . Let P mar be the orthogonal projection matrix onto the linearsubspace spanned by the column vectors of B (i.e., the mar-ket factors). Note that such a linear subspace is spanned by v , . . . , v C , and P mar is computed as P mar = [ v , . . . , v C ][ v , . . . , v C ] (cid:62) . Also, let Σ = U M U (cid:62) be the eigendecomposition of Σ , where U = [ u , . . . , u S ] is an orthogonal matrix, and M = diag( µ , . . . , µ S ) is a diagonal matrix with µ ≥ · · ≥ µ C > µ C +1 ≥ · · · µ S > . Then, the projectionmatrix for the spectral residual is given as A res = [ u C +1 , . . . , u S ][ u C +1 , . . . , u S ] (cid:62) . Now, P mar and I − A res are the projection matrices thatcorrespond to eigenvectors with the largest C eigenvaluesof (cid:98) Σ and Σ , respectively. From Davis–Kahan sin θ theorem(Davis and Kahan 1970; Yu, Wang, and Samworth 2014),we conclude (cid:107) P mar A res (cid:107) F ≤ (cid:107) (cid:98) Σ − Σ (cid:107) F min ≤ i ≤ C λ i = 2∆ iso min ≤ i ≤ C λ i . We also have a weaker inequality (10) since ∆ iso ≤√ S (max i σ i − min i σ i ) . C Details for experimental settings
In this section, we provide some more details on the experi-ments in Section 4.
C.1 Definitions of additional evaluation metrics • Maximum DrawDown (MDD) is the maximum loss froma peak to a trough (Grossman and Zhou 1993), which canmeasure one aspect of downside risk:
MDD T = max t ∈{ ,...,T } max s ∈{ ,...,t } (cid:18) CW t − CW s CW t (cid:19) . (11)• Calmar Ratio (CR) is a risk-adjusted return basedon the maximum drawdown (Young 1991): CR T :=AR T / MDD T .• Downside Deviation Ratio (DDR) (a.k.a. Sortino Ratio) isa variation of the Sharpe ratio (Sortino and Price 1994).While the Sharpe ratio regards overall volatility as risk, itregards only volatility caused by negative returns as harm-ful risk:
DDR T := AR T (cid:113) T Y T (cid:80) Ti =1 min(0 , R t ) . (12) C.2 Definitions of some baseline methods • Reversal strategy . In Section 4.2 and Section 4.3, weused a simple reversal strategy (
AR(1) ) as a benchmarkmethod, which is defined as follows. Let r t ( t = 1 , , . . . )be either a return sequence or a transformed return se-quence (e.g., the spectral residuals). Then, the reversalstrategy b t is defined by renormalizing − r t − to be azero-investment portfolio, that is, b t = r t − − ¯ r t − (cid:107) r t − − ¯ r t − (cid:107) , where ¯ r t − = S (cid:80) Si =1 r t − . This is true for many cases where the market factors are muchlarger than the residual factors. For example, let δ = (cid:107) ˆ Q (cid:107) op =max j | σ j − ¯ s | , and suppose that λ C = min i λ i > δ . Then, fromthe well-known Weyl’s inequality on eigenvalues, we have µ C ≥ λ C + ¯ s − δ > ¯ s + δ ≥ µ C +1 . • MLP-based prediction . In Section 4.2, we also used aneural network-based prediction of returns (
MLP ). Weused a fully-connected neural network with hidden lay-ers. Each hidden layer has 512 nodes, 50% dropout andbatch normalization. C.3 Details of network architectures andhyperparameters
Here, we explain the details of the architecture that weused for the distributional prediction. For each coordinate i ∈ { , . . . , S } of the spectral residual, we applied a com-mon non-linear function ψ : R H → R Q − that predicts Q − quantiles based on the past H observations. We de-signed ψ by the fractal network introduced in Section 3.3.The detailed specifications are as follows.• We estimate Q − quantiles for each stock. Inparticular, j -th coordinate of ψ corresponds to the j -thquantile of the future distribution.• The resampling mechanism Resample( x , τ ) outputs se-quences of a fixed length H (cid:48) = 64 . For the scale parame-ters, we used τ j := 4 − j/ with j ∈ { , . . . , } .• For the function ψ : R H (cid:48) → R K (with K − ), weused a fully connected neural network with 3 hidden lay-ers. Each layer has 256 nodes, 50% dropout and batchnormalization.• For the function ψ : R K → R Q − , we used a fully con-nected neural network with 8 hidden layers. Each layerhas 128 nodes, 50% dropout and batch normalization.• For training, we used the Adam optimizer (Kingma andBa 2015) with learning rate . . D Supplementary experiments for spectralresiduals
In this section, we provide additional experimental resultson the spectral residual. In Section D.1, we conduct a sim-ple experiment to show that the short-term behavior of the(empirical) spectral residual is reasonably stable. In SectionD.2, we discuss a simple way to determine the parameter C from the training data. D.1 Local stability of spectral residuals
We check whether the spectral residuals are locally stable.This is not trivial because (i) the spectral residual is calcu-lated locally on time windows, and (ii) the long-term behav-ior of the financial time series is highly non-stationary. Tothis end, we here investigate the ability of the spectral resid-uals to reduce the volatility of sequences.For each time t , we calculated the projection matrix ( A t defined in Section 3.1) and the spectral residuals ˜ (cid:15) s for t − H ≤ s ≤ t − . We also generated the spectralresiduals for unseen duration t ≤ s < t + H by fix-ing A t and extrapolating the relation ˜ (cid:15) s = A t r s . Then,we calculated the volatility Vol(˜ (cid:15) s ) of the spectral residualat each s . Throughout, we fixed H = 256 , and we var-ied the number of principal components to be removed as C ∈ { , , , , , } . e l a ti v e vo l a tilit y Delay ∆ Figure 7: Relative volatility to the raw stock returns for vari-ous choices of parameter C . The ability to reduce the volatil-ity seems to continue in the unseen duration ( ∆ > ), whichmay suggest the local stability of spectral residuals. See thetext for details.Table 3: Performance comparison of reversal returns on dif-ferent numbers of principal components (PCs) to be elimi-nated (U.S. market). ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ C = 0 +0.759 +0.076 C = 1 +0.753 +0.054 0.041 +1.343 +0.309 0.132 C = 10 +1.426 +0.035 0.049 +2.541 +1.206 C = 20 +1.317 +0.028 0.037 +2.288 +1.000 0.037 C = 50 +1.264 +0.019 0.024 +2.275 +0.990 C = 100 +1.089 +0.013 +1.877 +0.391 0.035 Figure 7 shows the result averaged over t , which illus-trates the proportion of the volatility of spectral residuals( C ≥ ) to the volatility of raw stock returns ( C = 0 ).The horizontal axis is the delay parameter ∆ := s − t ∈{− H, . . . , , . . . , H − } . Here, ∆ < corresponds to the“observed” duration used for calculating the projection ma-trix A t , and ∆ ≥ corresponds to the “unseen” durationgenerated by extrapolation. From this, we observe the fol-lowings:• Monotonicity.
Increasing the number C of eliminatedprincipal components can reduce the volatility of thespectral residuals even for the unseen duration.• Local stability.
Regarding the volatility proportion, thedifference between the observed duration and the unseenduration increases with increasing C . Remarkably, thefirst principal component (a.k.a. the market factor) is quitestable, and the volatility proportion for C = 1 is well ex-trapolated to the unseen duration.The above observations suggest that the projection matrix A t can be locally stable if C is not too large. Hence, we canextract meaningful residual information in the subsequenttime window by extrapolating the projection matrix. D.2 Choosing the number of eliminated factors
Next, we consider the appropriate choice of the parameter C to make trading strategies robustly profitable. Figure 8 and Table 4: Performance comparison of reversal returns on dif-ferent numbers of principal components (PCs) to be elimi-nated (Japanese market). ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ C = 0 +1.259 +0.082 0.065 +2.344 +0.700 0.117 C = 1 +1.715 +0.087 C = 10 +2.960 +0.074 0.025 +5.555 +2.037 0.036 C = 20 +3.291 +0.071 0.022 +6.306 +4.215 C = 50 +3.036 +0.047 0.016 +5.938 +3.258 C = 100 +2.175 +0.023 +4.114 +1.152 0.020 Table 5: Performance comparison on the Japanese market.
ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ Market +0.819 +0.158
AR(1) +1.511 +0.058 0.038 +2.684 +1.094 0.053
AR(1) on SRes +1.835 +0.034 0.019 +3.237 +1.112 0.031
Linear on SRes +1.380 +0.023 +2.469 +0.952 +1.802 +0.030 +3.280 +0.963 0.032
SFM on SRes +0.250 +0.005 0.019 +0.442 +0.116 0.042
DPO-NQ +1.369 +0.024 0.018 +2.379 +0.758 0.032
DPO-NF +1.770 +0.030 +3.159 +1.165 0.025
DPO-NV +1.979 +0.039 0.020 +3.516 +1.484
DPO +2.171 +0.036 +1.460 0.025
Table 3 show the performance of the reversal strategy overthe spectral residuals on U.S. market.From the result, we observe the followings. With increas-ing the number C of eliminated principal components, bothAR and AVOL tend to decrease. This is because eliminatingmore principal components reduces the volatility, while itbecomes more difficult to earn large returns from the remain-ing information. Especially, eliminating the first principalcomponents ( C = 10 ) attained the best trade-off betweenthe return and the risk, and thus had the best ASR value.We also conducted a similar experiment on Japanese mar-ket data, in which C = 20 achieved the best ASR value. SeeFigure 9 and Table 4 for corresponding results. E Performance evaluation on Japanesemarket data
Here, we evaluated the performance of our proposed system(DPO) on Japanese market data. We used similar evaluationmetrics and baseline methods presented in Section 4.3.
E.1 Japanese market data
For Japanese market data, we used daily prices of stockslisted in TOPIX 500 from January 2005 to December 2018.We used data before January 2012 for training and valida-tion and the remainder for testing. We obtained the data fromJapan Exchange Group (JPX) .igure 8: The Cumulative Wealth of the reversal strategy withdifferent choices of the number C of principal components tobe eliminated (U.S. market). Figure 9: The Cumulative Wealth of the reversal strategy withdifferent choices of the number C of principal components tobe eliminated (Japanese market). C u m u l a ti v e w ea lt h DateFigure 10: The Cumulative Wealth in Japanese market.
E.2 Results
Table 5 and Figure 10 show the results. Overall, we obtainedconsistent results with those on U.S. market. Our proposedmethod (DPO) outperformed other baseline methods and ab-lation models in the ASR. For other evaluation metrics, DPOachieved comparable performance to the best performingbaselines.6