[PDF] Deep Portfolio Optimization via Distributional Prediction of Residual Factors

Abstract

Recent developments in deep learning techniques have motivated intensive research in machine learning-aided stock trading strategies. However, since the financial market has a highly non-stationary nature hindering the application of typical data-hungry machine learning methods, leveraging financial inductive biases is important to ensure better sample efficiency and robustness. In this study, we propose a novel method of constructing a portfolio based on predicting the distribution of a financial quantity called residual factors, which is known to be generally useful for hedging the risk exposure to common market factors. The key technical ingredients are twofold. First, we introduce a computationally efficient extraction method for the residual information, which can be easily combined with various prediction algorithms. Second, we propose a novel neural network architecture that allows us to incorporate widely acknowledged financial inductive biases such as amplitude invariance and time-scale invariance. We demonstrate the efficacy of our method on U.S. and Japanese stock market data. Through ablation experiments, we also verify that each individual technique contributes to improving the performance of trading strategies. We anticipate our techniques may have wide applications in various financial problems.

Full PDF

DDeep Portfolio Optimization via Distributional Prediction of Residual Factors

Kentaro Imajo, Kentaro Minami, Katsuya Ito, Kei Nakagawa, Preferred Networks, Inc. Nomura Asset Management Co., Ltd. { imos, minami, katsuya1ito } @preferred.jp, [email protected] Abstract

Recent developments in deep learning techniques have mo-tivated intensive research in machine learning-aided stocktrading strategies. However, since the ﬁnancial market hasa highly non-stationary nature hindering the application oftypical data-hungry machine learning methods, leveraging ﬁ-nancial inductive biases is important to ensure better sam-ple efﬁciency and robustness. In this study, we propose anovel method of constructing a portfolio based on predictingthe distribution of a ﬁnancial quantity called residual factors,which is known to be generally useful for hedging the riskexposure to common market factors. The key technical ingre-dients are twofold. First, we introduce a computationally ef-ﬁcient extraction method for the residual information, whichcan be easily combined with various prediction algorithms.Second, we propose a novel neural network architecture thatallows us to incorporate widely acknowledged ﬁnancial in-ductive biases such as amplitude invariance and time-scaleinvariance. We demonstrate the efﬁcacy of our method onU.S. and Japanese stock market data. Through ablation ex-periments, we also verify that each individual technique con-tributes to improving the performance of trading strategies.We anticipate our techniques may have wide applications invarious ﬁnancial problems.

Developing a proﬁtable trading strategy is a central prob-lem in the ﬁnancial industry. Over the past decade, ma-chine learning and deep learning techniques have drivensigniﬁcant advances across many application areas (Devlinet al. 2019; Graves, Mohamed, and Hinton 2013), whichinspired investors and ﬁnancial institutions to develop ma-chine learning-aided trading strategies (Wang et al. 2019;Choudhry and Garg 2008; Shah 2007). However, it is be-lieved that forecasting the nature of ﬁnancial time series isessentially a difﬁcult task (Krauss, Do, and Huck 2017).In particular, the well-known efﬁcient market hypothesis (Malkiel and Fama 1970) claims that no single trading strat-egy could be permanently proﬁtable because the dynam-ics of the market changes quickly. At this point, the ﬁnan-cial market is signiﬁcantly different from stationary envi-ronments typically assumed by most machine learning/deeplearning methods.

Generally speaking, a good way for deep learning meth-ods to adapt quickly to the given environment is to introducea network architecture that reﬂects a good inductive bias forthe environment. The most prominent examples for such ar-chitectures include convolutional neural networks (CNNs)(Krizhevsky, Sutskever, and Hinton 2012) for image dataand long short-term memories (LSTMs) (Hochreiter andSchmidhuber 1997) for general time series data. Therefore,a natural question to ask is what architecture is effective forprocessing ﬁnancial time series.In ﬁnance, researchers have proposed various tradingstrategies and empirically studied their effectiveness. Hence,it is reasonable to seek architectures inspired by empiricalﬁndings in ﬁnancial studies. In particular, we consider thefollowing three features in the stock market.

Many empirical studies on stock returns are described interms of factor models (e.g., (Fama and French 1992,2015)). These factor models express the return of a certainstock i at time t as a linear combination of K factors plus aresidual term: r i,t = K (cid:88) k =1 β ( k ) i f ( k ) t + (cid:15) i,t . (1)Here, f (1) t , . . . , f ( K ) t are the common factors shared by mul-tiple stocks i ∈ { , . . . , S } , and the residual factor (cid:15) i,t isspeciﬁc to each individual stock i . Therefore, the commonfactors correspond to the dynamics of the entire stock mar-ket or industries, whereas the residual factors convey someﬁrm-speciﬁc information.In general, if the return of an asset has a strong correla-tion to the market factors, the asset exhibits a large expo-sure to the risk of the market. For example, it is known thata classical strategy based on the momentum phenomenon(Jegadeesh and Titman 1993) is correlated to the Fama–French factors (Fama and French 1992, 2015), which ex-hibited negative returns around the credit crisis of 2008(Calomiris, Love, and Peria 2010; Szado 2009). On the otherhand, researchers found that trading strategies based only onthe residual factors can be robustly proﬁtable because suchstrategies can hedge out the time-varying risk exposure tothe market factor (Blitz, Huij, and Martens 2011; Blitz et al. a r X i v : . [ q -f i n . P M ] D ec When we address a certain prediction task using a neu-ral network-based approach, an effective choice of neuralnetwork architecture typically hinges on patterns or invari-ances in the data. For example, CNNs (LeCun et al. 1999)take into account the shift-invariant structure that commonlyappears in image-like data. From this perspective, it is im-portant to ﬁnd invariant structures that are useful for pro-cessing ﬁnancial time series.As candidates of such structures, there are two types ofinvariances known in ﬁnancial literature. First, it is knownthat a phenomenon called volatility clustering is commonlyobserved in ﬁnancial time series (Lux and Marchesi 2000),which suggests an invariance structure of a sequence withrespect to its volatility (i.e., amplitude). Second, there isa hypothesis that sequences of stock prices have a certaintime-scale invariance property known as the fractal struc-ture (Peters 1994). We hypothesize that incorporating suchinvariances into the network architecture is effective at ac-celerating learning from ﬁnancial time series data. Another important problem is how to convert a given pre-diction of the returns into an actual trading strategy. In ﬁ-nance, there are several well-known trading strategies. Toname a few, the momentum phenomenon (Jegadeesh andTitman 1993) suggests a strategy that bets the current mar-ket trend, while the mean reversion (Poterba and Summers1988) suggests another strategy that assumes that the stockreturns moves toward the opposite side of the current direc-tion. However, as suggested by the construction, the momen-tum and the reversal strategies are negatively correlated toeach other, and it is generally unclear which strategy is ef-fective for a particular market. On the other hand, modernportfolio theory (Markowitz 1952) provides a framework todetermine a portfolio from distributional properties of as-set prices (typically means and variances of returns). Theresulting portfolio is unique in the sense that it has an op-timal trade-off of returns and risks under some predeﬁnedconditions. From this perspective, distributional prediction of returns can be useful to construct trading strategies thatcan automatically adapt to the market. • We propose a novel method to extract residual informa-tion, which we call the spectral residuals . The spectralresiduals can be calculated much faster than the classicalfactor analysis-based method without losing the ability tohedge out exposure to the market factors. Moreover, thespectral residuals can easily be combined with any pre-diction algorithms.• We propose a new system for distributional prediction ofstock prices based on deep neural networks. Our systeminvolves two novel neural network architectures inspiredby well-known invariance hypotheses on ﬁnancial timeseries. Predicting the distributional information of returnsallows us to utilize the optimal portfolio criteria offeredby modern portfolio theory.• We demonstrate the effectiveness of our proposed meth-ods on real market data.In the supplementary material, we also include appen-dices which contain detailed mathematical formulations andexperimental settings, theoretical analysis, and additionalexperiments.

Our problem is to construct a time-dependent portfoliobased on sequential observations of stock prices. Supposethat there are S stocks indexed by symbol i . The obser-vations are given as a discrete time series of stock prices p ( i ) = ( p ( i )1 , p ( i )2 , . . . , p ( i ) t , . . . ) . Here, p ( i ) t is the price ofstock i at time t . We mainly consider the return of stocksinstead of their raw prices. The return of stock i at time t isdeﬁned as r ( i ) t = p ( i ) t +1 /p ( i ) t − . A portfolio is a (time-dependent) vector of weights overthe stocks b t = ( b (1) t , . . . , b ( i ) t , . . . , b ( S ) t ) , where b ( i ) t is thevolume of the investment on stock i at time t satisfying (cid:80) Si =1 | b ( i ) t | = 1 . A portfolio b t is understood as a partic-ular trading strategy, that is, b ( i ) t > implies that the in-vestor takes a long position on stock i with amount | b ( i ) t | attime t , and b ( i ) t < means a short position on the stock.Given a portfolio b t , its overall return R t at time t is givenas R t := (cid:80) Si =1 b ( i ) t r ( i ) t . Then, given the past observations ofindividual stock returns, our task is to determine the value of b t that optimizes the future returns.An important class of portfolios is the zero-investmentportfolio deﬁned as follows. Deﬁnition 1 (Zero-Investment Portfolio) . A zero-investment portfolio is a portfolio whose buying positionand selling position are evenly balanced, i.e., (cid:80) Si =1 b ( i ) t = 0 .In this paper, we restrict our attention to trading strate-gies that output zero-investment portfolio. This assumptionis sensible because a zero-investment portfolio requires noigure 1: Overview of the proposed system. Our system consists of three parts: (i) the extraction layer of residual factors (ii) aneural network-based distribution predictor and (iii) transformation to the optimal portfolio.equity and thus encourages a fair comparison between dif-ferent strategies.In practice, there can be delays between the observationsof the returns and the actual execution of the trading. To ac-count for this delay, we also adopt the delay parameter d in our experiments. When we trade with a d -day delay, theoverall return should be R t := R dt = (cid:80) Si =1 b ( i ) t r ( i ) t + d . According to modern portfolio theory (Markowitz 1952), in-vestors construct portfolios to maximize expected return un-der a speciﬁed level of acceptable risk. The standard devia-tion is commonly used to quantify the risk or variability ofinvestment outcomes, which measures the degree to whicha stock’s annual return deviates from its long-term historicalaverage (Kintzel 2007).The Sharpe ratio (Sharpe 1994) is one of the most refer-enced risk/return measures in ﬁnance. It is the average re-turn earned in excess of the risk-free rate per unit of volatil-ity. The Sharpe ratio is calculated as ( R p − R f ) /σ p , where R p is the return of portfolio, σ p is the standard deviation ofthe portfolio’s excess return, and R f is the return of a risk-free asset (e.g., a government bond). For a zero-investmentportfolio, we can always omit R f since it requires no equity(Mitra 2009).In this paper, we adopt the Sharpe ratio as the objective forour portfolio construction problem. Since we cannot alwaysobtain an estimate of the total risk beforehand, we often con-sider sequential maximization of the Sharpe ratio of the nextperiod. Once we predict the mean vector and the covariancematrix of the population of future returns, the optimal port-folio b ∗ can be solved as b ∗ = λ − Σ − µ , where λ is apredeﬁned parameter representing the relative risk aversion, Σ is the estimated covariance matrix, and µ is the estimatedmean vector (Kan and Zhou 2007). Therefore, predictingthe mean and the covariance is essential to construct riskaverse portfolios.

In this section, we present the details of our proposed sys-tem, which is outlined in Figure 1. Our system consists ofthree parts. In the ﬁrst part (i), the system extracts residual Note that b ∗ is derived as the maximizer of b (cid:62) µ − λ b (cid:62) Σ b ,where b (cid:62) µ and b (cid:62) Σ b are the return and the risk of the portfolio b , respectively. information to hedge out the effects of common market fac-tors. To this end, in Section 3.1, we introduce the spectralresidual , a novel method based on spectral decomposition.In the second part (ii), the system predicts future distribu-tions of the spectral residuals using a neural network-basedpredictor. In the third part (iii), the predicted distributionalinformation is leveraged for constructing optimal portfolios.We will outline these procedures in Section 3.2. Addition-ally, we introduce a novel network architecture that incor-porates well-known ﬁnancial inductive biases, which we ex-plain in Section 3.3. As mentioned in the introduction, we focus on developingtrading strategy based on the residual factors, i.e., the in-formation remaining after hedging out the common marketfactors. Here, we introduce a novel method to extract theresidual information, which we call the spectral residual . Deﬁnition of the spectral residuals

First, we introducesome notions from portfolio theory. Let r be a random vec-tor with zero mean and covariance Σ ∈ R S × S , which rep-resents the returns of S stocks over the given investmenthorizon. Since Σ is symmetric, we have a decomposition Σ = V Λ V (cid:62) , where V = [ v , . . . , v S ] is an orthogonalmatrix and Λ = diag( λ , . . . , λ S ) is a diagonal matrix ofthe eigenvalues. Then, we can create a new random vectoras ˆ r = V (cid:62) r such that the coordinate variables ˆ r i = v (cid:62) i r are mutually uncorrelated. In portfolio theory, ˆ r i s are called principal portfolios (Partovi and Caputo 2004). Principalportfolios have been utilized in “risk parity” approaches todiversify the exposures to the intrinsic source of risk in themarket (Meucci 2009).Since the volatility (i.e., the standard deviation) of the i -th principal portfolio is given as √ λ i , the raw return se-quence has large exposure to the principal portfolios withlarge eigenvalues. For example, the ﬁrst principal portfoliocan be seen as the factor that corresponds to the overall mar-ket (Meucci 2009). Therefore, to hedge out common marketfactors, a natural idea is to discard several principal portfo-lios with largest eigenvalues. Formally, we deﬁne the spec-tral residuals as follows. Deﬁnition 2.

Let C ( < S ) be a given positive integer. Wedeﬁne the spectral residual ˜ (cid:15) as a vector obtained by project-ing the raw return vector r onto the space spanned by theprincipal portfolios with the smallest S − C eigenvalues.n practice, we calculate the empirical version of the spec-tral residuals as follows. Given a time window H > , wedeﬁne a windowed signal X t as X t := [ r t − H , . . . , r t − ] . We also denote by ˜ X t a matrix obtained by subtracting em-pirical means of row vectors from X t . By the singular valuedecomposition (SVD), ˜ X t can be decomposed as ˜ X t = V t diag( σ , . . . , σ S ) U t (cid:124) (cid:123)(cid:122) (cid:125) principal portfolios , where V t is an S × S orthogonal matrix, U t is an S × H matrix whose rows are mutually orthogonal unit vectors, and σ ≥ · · · ≥ σ S are singular values. Note that V (cid:62) t ˜ X t can beseen as the realized returns of the principal portfolios. Then,the (empirical) spectral residual at time s ( t − H ≤ s ≤ t − )is computed as ˜ (cid:15) s := A t r s , (2)where A t is the projection matrix deﬁned as A t := V t diag(0 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) C , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) S − C ) V (cid:62) t . Relationship to factor models

Although we deﬁned thespectral residuals through PCA, they are also related to thegenerative model (1), and thus convey information about“residual factors” in the original sense.In the ﬁnance literature, it has been pointed out that trad-ing strategies depending only on the residual factor (cid:15) t in(1) can be robust to structural changes in the overall mar-ket (Blitz, Huij, and Martens 2011; Blitz et al. 2013). Whileestimating parameters in the linear factor model (1) is typ-ically done by factor analysis (FA) methods (Bartholomew,Knott, and Moustaki 2011), the spectral residual ˜ (cid:15) t is notexactly the same as the residual factors obtained from theFA. Despite this, we can show the following result that thespectral residuals can hedge out the common market factorsunder a suitable condition. Proposition 1.

Let r be a random vector in R S generatedaccording to a linear model r = Bf + (cid:15) , where B is an S × C matrix, and f ∈ R C and (cid:15) ∈ R S are zero-meanrandom vectors. Assume that the following conditions hold:• Var( f i ) = 1 and Var( (cid:15) k ) = σ > .• The coordinate variables in f and (cid:15) are uncorrelated, thatis, E [ f i f j ] = 0 , E [ (cid:15) k (cid:15) (cid:96) ] = 0 , and E [ f i (cid:15) k ] = 0 hold forany i (cid:54) = j and k (cid:54) = (cid:96) .Then, we have the followings.(i) The spectral residual ˜ (cid:15) deﬁned in (2) is uncorrelated fromthe common factor f .(ii) The covariance matrix of ˜ (cid:15) is given as σ A res . Under asuitable assumption , this can be approximated as a diag-onal matrix, which means the coordinates variables of thespectral residual (cid:15) i ( i ∈ { , . . . , S } ) are almost uncorre-lated. See Appendix B for a precise statement.

In the above proposition, the ﬁrst statement (i) claims thatthe spectral residual can eliminate the common factors with-out knowing the exact residual factors. The latter statement(ii) justiﬁes the diagonal approximation of the predicted co-variance, which will be utilized in the next subsection. Forcompleteness, we provide the proof in Appendix B. More-over, the assumption that (cid:15) is isotropic can be relaxed in thefollowing sense. If we assume that the residual factor (cid:15) is“almost isotropic” and the common factors Bf have largervolatility contributions than (cid:15) , we can show that the lineartransformation used in the spectral residual is close to theprojection matrix eliminating the market factors. Since theformal statement is somewhat involved, we give the detailsin Appendix B.Besides, the spectral residual can be computed signiﬁ-cantly faster than the FA-based methods. This is becausethe FA typically requires iterative executions of the SVD tosolve a non-convex optimization problem, while the spectralresidual requires it only once. Section 4.2 gives an experi-mental comparison of running times. Our next goal is to construct a portfolio based on the ex-tracted information. To this end, we here introduce a methodto forecast future distributions of the spectral residuals, andexplain how we can convert the distributional features intoexecutable portfolios.

Distributional prediction

Given a sequence of past real-izations of residual factors, ˜ (cid:15) i,t − H , . . . , ˜ (cid:15) i,t − , consider theproblem of predicting the distribution of a future observation ˜ (cid:15) i,t . Our approach is to learn a functional predictor for theconditional distribution p (˜ (cid:15) i,t | ˜ (cid:15) i,t − H , . . . , ˜ (cid:15) i,t − ) . Sinceour ﬁnal goal is to construct the portfolio, we only use pre-dicted means and covariances, and we do not need the fullinformation about the conditional distribution. Despite this,ﬁtting symmetric models such as Gaussian distributions canbe problematic because it is known that the distributions ofreturns are often skewed (Cont 2000; Lin and Liu 2018).To circumvent this, we utilize quantile regression (Koenker2005), an off-the-shelf nonparametric method to estimateconditional quantiles. Intuitively, if we obtain a sufﬁcientlylarge number of quantiles of the target variable, we can re-construct any distributional properties of that variable. Wetrain a function ψ that predicts several conditional quantilevalues, and convert its output into estimators of conditionalmeans and variances. The overall procedure can be made tobe differentiable, so we can incorporate it into modern deeplearning frameworks.Here, we provide the details of the aforementioned proce-dure. First, we give an overview for the quantile regressionobjective. Let Y be a scalar-valued random variable, and X be another random variable. For α ∈ (0 , , an α -th condi-tional quantile of Y given X = x is deﬁned as y ( x ; α ) := inf { y (cid:48) : P ( Y ≤ y (cid:48) | X = x ) ≥ α } . It is known that y ( x ; α ) can be found by solving the follow-ng minimization problem y ( x ; α ) = argmin y (cid:48) ∈ R E [ (cid:96) α ( Y, y (cid:48) ) | X = x ] , where (cid:96) α ( y, y (cid:48) ) is the pinball loss deﬁned as (cid:96) α ( y, y (cid:48) ) := max { ( α − y − y (cid:48) ) , α ( y − y (cid:48) ) } . For our situation, the target variable is y t = ˜ (cid:15) i,t andthe explanatory variable is x t = (˜ (cid:15) i,t − H , . . . , ˜ (cid:15) i,t − ) (cid:62) . Wewant to construct a function ψ : R H → R that estimate theconditional α -quantile of y t . To this end, the quantile regres-sion tries to solve the following minimization problem min ψ (cid:98) E y t , x t [ (cid:96) α ( y t , ψ ( x t ))] . Here, (cid:98) E y t , x t is understood as taking the empirical expecta-tion with respect to y t and x t across t . We should note that asimilar application of the quantile regression to forecastingconditional quantiles of time series has been considered in(Biau and Patra 2011).Next, let Q > be a given integer, and let α j = j/Q ( j = 1 , . . . , Q − ) be an equispaced grid of quantiles.We consider the problem of simultaneously estimating α j -quantiles by a function ψ : R H → R Q − . To do this, wedeﬁne a loss function as L Q ( y t , ψ ( x t )) := Q − (cid:88) j =1 (cid:96) α j ( y t , ψ j ( x t )) , where ψ j ( x t ) is the j -th coordinate of ψ ( x t ) .Once we obtain the estimated Q − quantiles ˜ y ( j ) t = ψ j ( x t ) ( j = 1 , . . . , Q − ), we can estimate the future meanof the target variable y t as ˆ µ t := ˆ µ ( ˜ y t ) = 1 Q − Q − (cid:88) j =1 ˜ y ( j ) t . (3)Similarly, we can estimate the future variance by the samplevariance of ˜ y ( j ) t ˆ σ t := ˆ σ ( ˜ y t ) = 1 Q − (cid:88) (˜ y ( j ) t − ˆ µ t ) (4)or its robust counterpart such as the median absolute devia-tion (MAD). Portfolio construction

Given the estimated means andvariances of future spectral residuals, we ﬁnally constructa portfolio based on optimality criteria offered by modernportfolio theory (Markowitz 1952). As mentioned in Sec-tion 2.2, the formula for the optimal portfolio requires themeans and the covariances of the returns. Thanks to Propo-sition 1-(ii), we can approximate the covariance matrix ofthe spectral residual by a diagonal matrix. Precisely, oncewe calculate the predicted mean ˆ µ t,j and the variance ˆ σ t,j of the spectral residual at time t , the weight for j -th asset isgiven as ˆ b j := λ − ˆ µ t,j / ˆ σ t,j . In the experiments in Section 4, we compare the perfor-mances of zero-investment portfolios. For trading strategies that do not output zero-investment portfolios, we apply acommon transformation to portfolios to be centered and nor-malized. As a result, the eventual portfolio does not dependon the risk aversion parameter λ . See Appendix A.1 for de-tails. For the model of the quantile predictor ψ , we introduce twoarchitectures for neural network models that take into ac-count scale invariances studied in ﬁnance. Volatility invariance

First, we consider an invarianceproperty on amplitudes of ﬁnancial time series. It is knownthat ﬁnancial time series data exhibit a property calledvolatility clustering (Mandelbrot 1997). Roughly speak-ing, volatility clustering describes a phenomenon that largechanges in ﬁnancial time series tend to be followed by largechanges, while small changes tend to be followed by smallchanges. As a result, if we could observe a certain signal asa ﬁnancial time series, a signal obtained by positive scalarmultiplication can be regarded as another plausible realiza-tion of a ﬁnancial time series.To incorporate such an amplitude invariance property intothe model architectures, we leverage the class of positive ho-mogeneous functions. Here, a function f : R n → R m is saidto be positive homogeneous if f ( a x ) = af ( x ) holds for any x ∈ R n and a > . For example, we can see that any linearfunctions and any ReLU neural networks with no bias termsare positive homogeneous. More generally, we can modelthe class of positive homogeneous functions as follows. Let ˜ ψ : S H − → R Q − be any function deﬁned on the H − dimensional sphere S H − = { x ∈ R H : (cid:107) x (cid:107) = 1 } . Then,we obtain a positive homogeneous function as ψ ( x ) = (cid:107) x (cid:107) ˜ ψ (cid:18) x (cid:107) x (cid:107) (cid:19) . (5)Thus, we can convert any function class on the sphere intothe model of amplitude invariant predictors. Time-scale invariance

Second, we consider an invarianceproperty with respect to time-scale. There is a well-knownhypothesis that time series of stock prices have fractal struc-tures (Peters 1994). The fractal structure refers to a self-similarity property of a sequence. That is, if we observea single sequence in several different sampling rates, wecannot infer the underlying sampling rates from the shapeof downsampled sequences. The fractal structure has beenobserved in several real markets (Cao, Cao, and Xu 2013;Mensi et al. 2018; Lee et al. 2018). See Remark 1 in Ap-pendix A.2 for further discussion on this property.To take advantage of the fractal structure, we propose anovel network architecture that we call fractal networks . Thekey idea is that we can effectively exploit the self-similarityby applying a single common operation to multiple sub-sequences with different resolutions. By doing so, we ex-pect that we can increase sample efﬁciency and reduce thenumber of parameters to train.Here, we give a brief overview of the proposed architec-ture, while a more detailed explanation will be given in Ap-endix A.2. Our model consists of (a) the resampling mech-anism and (b) two neural networks ψ and ψ . The input-output relation of our model is described as the followingprocedure. First, given a single sequence x of stock returns,the resampling mechanism Resample( x , τ ) generates a se-quence that corresponds to sampling rates speciﬁed by ascale parameter < τ ≤ . We apply Resample proce-dure for L different parameters τ < . . . < τ L = 1 andgenerate L sequences. Next, we apply a common non-lineartransformation ψ modeled by a neural network. Finally, bytaking the empirical mean of these sequences, we aggregatethe information on different sampling rates and apply an-other network ψ . To sum up, the overall procedure can bewritten in the following single equation ψ ( x ) = ψ (cid:32) L L (cid:88) i =1 ψ (Resample( x , τ i )) (cid:33) . (6) We conducted a series of experiments to demonstrate the ef-fectiveness of our methods on real market data. In Section4.1, we describe the details of the dataset and some commonexperimental settings used throughout this section. In Sec-tion 4.2, we test the validity of the spectral residual by a pre-liminary experiment. In Section 4.3, we evaluate the perfor-mance of our proposed system by experiments on U.S. mar-ket data. We also conducted similar experiments on Japanesemarket data and obtained consistent results. Due to the spacelimitation, we provide the entire results for Japanese marketdata in Appendix E.

U.S. market data

For U.S. market data, we used the dailyprices of stocks listed in S&P 500 from January 2000 toApril 2020. We used data before January 2008 for trainingand validation and the remainder for testing. We obtainedthe data from Alpha Vantage .We used opening prices because of the following reasons.First, the trading volume at the opening session is larger thanthat at the closing session (Amihud and Mendelson 1987),which means that trading with opening prices is practicallyeasier than trading with the closing prices. Moreover, a ﬁ-nancial institution cannot trade a large amount of stocks dur-ing the closing period because it can be considered as anillegal action known as “banging the close”. Common experimental settings

We adopted the delayparameter d = 1 (i.e., one-day delay) for updating portfo-lios. We set the look-back window size as H = 256 , i.e., allprediction models can access the historical stock prices up-to preceding business days. For other parameters usedin the experiments, see Appendix C.3 as well. Evaluation metrics

We list the evaluation metrics usedthroughout our experiments.•

Cumulative Wealth (CW) is the total return yielded fromthe trading strategy: CW T := (cid:81) Ti =1 (1 + R t ) . Dot-com bubble crash Subprime mortgage crisis 2020 stock market crash

Figure 2: Cumulative returns of reversal strategies over rawreturns, the FA residuals, and the spectral residuals. Thereversal-based strategies are more robust against ﬁnancialcrises.•

Annualized Return (AR) is an annualized return rate de-ﬁned as AR t := ( T Y /T ) (cid:80) Ti =1 R t , where T Y is the aver-age number of holding periods in a year.• Annualized Volatility (AVOL) is annualized risk deﬁned as

AVOL T := (( T Y /T ) (cid:80) Ti =1 R t ) / .• Annualized Sharpe ratio (ASR) is an annualized risk-adjusted return (Sharpe 1994). It is deﬁned as

ASR T :=AR T / AVOL T .As mentioned in Section 2.2, we are mainly interested inASR as the primary evaluation metric. AR and AVOL areauxiliary metrics for calculating ASR. While CW representsthe actual proﬁts, it often ignores the existence of large riskvalues. In addition to these, we also calculated some eval-uation criteria commonly used in ﬁnance: Maximum Draw-Down (MDD), Calmar Ratio (CR), and Downside DeviationRatio (DDR). For completeness, we provide precise deﬁni-tions in Appendix C.1. As suggested in 3.1, the spectral residuals can be useful tohedge out the undesirable exposure to the market factors. Toverify this, we compared the performances of trading strate-gies over (i) the raw returns, (ii) the residual factors extractedby the factor analysis (FA), and (iii) the spectral residuals.For the FA, we ﬁt the factor model (1) with K = 30 bythe maximum likelihood method (Bartholomew, Knott, andMoustaki 2011) and extracted residual factors as the remain-ing part. For the spectral residual, we obtain the residual se-quence by subtracting C = 30 principal components fromthe raw returns. We applied both methods to windowed datawith length H = 256 .In order to be agnostic to the choice of training algo-rithms, we used a simple reversal strategy. Precisely, for theraw return sequence r t , we used a deterministic strategy ob-tained simply by normalizing the negation of the previousobservation − r t − to be a zero-investment portfolio (seeAppendix C.2 for the precise formula). We deﬁned reversalstrategies over residual sequences in similar ways.Figure 2 shows the cumulative returns of reversal strate-gies performed on the Japanese market data. We see that the u m u l a ti v e w ea lt h DateFigure 3: The Cumulative Wealth in U.S. market. C u m u l a ti v e w ea lt h DateFigure 4: The Cumulative Wealth in U.S. market data withoutspectral residual extraction.reversal strategy based on the raw returns is signiﬁcantly af-fected by several well-known ﬁnancial crises, including thedot-com bubble crush in the early 2000s, the 2008 subprimemortgage crisis, and the 2020 stock market crush. On theother hand, two residual-based strategies seem to be morerobust against these ﬁnancial crises. The spectral residualperformed similarly to the FA residual in cumulative returns.Moreover, in terms of the Sharpe ratio, the spectral residuals( = 2 . ) performed better than the FA residuals ( = 2 . ).Remarkably, the spectral residuals were calculated muchfaster than the FA residuals. In particular, we calculated bothresiduals using the entire dataset which contains records ofall the stock prices for , days. For the PCA and the FA,we used implementations in the scikit-learn package (Pe-dregosa et al. 2011), and all the computation were run on 18CPU cores of Intel Xeon Gold 6254 Processor (3.1 GHz).Then, extracting the spectral residuals took approximately10 minutes, while the FA took approximately 13 hours. We evaluated the performance of our proposed system de-scribed in Section 3 on U.S. market data. Corresponding re-sults for Japanese market data are provided in Appendix E.

Baseline Methods

We compare our system with the fol-lowing baselines: (i)

Market is the uniform buy-and-holdstrategy. (ii)

AR(1) is the

AR(1) model with all coefﬁcientsbeing − . This can be seen as the simple reversal strategy.(iii) Linear predicts returns by ordinary linear regressionbased on previous H raw returns. (iv) MLP predicts returnsby a multi-layer perceptron with batch normalization anddropout (Pal and Mitra 1992; Ioffe and Szegedy 2015; Sri-vastava et al. 2014). (v)

SFM is one of state-of-the-art stockprice prediction algorithms based on the State FrequencyMemory RNNs (Zhang, Aggarwal, and Qi 2017).Additionally, we compare our proposed system (

DPO )with some ablation models, which are similar to

DPO exceptfor the following points.• DPO with No Quantile Prediction (

DPO-NQ ) does not usethe information of the full distributional prediction, but in- Table 1: Performance comparison on U.S. market. All themethods except for

Market are applied to the spectralresiduals (SRes).

ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ Market +0.607 +0.130

AR(1) on SRes +0.858 +0.021 0.025 +1.470 +0.295 0.072

Linear on SRes +0.724 +0.017 0.024 +1.262 +0.298 0.059

MLP on SRes +0.728 +0.022 0.030 +1.280 +0.283 0.077

SFM on SRes +0.709 +0.019 0.026 +1.211 +0.323 0.058

DPO-NQ +1.237 +0.032 0.026 +2.169 +0.499 0.063

DPO-NF +1.284 +0.027 +2.347 +0.627 +1.154 +0.030 0.026 +2.105 +0.562 0.053

DPO (Proposed) +1.393 +0.030 stead it outputs conditional means trained by the L loss.• DPO with No Fractal Network ( DPO-NF ) uses a simplemulti-layer perceptron instead of the fractal network.• DPO with No Volatility Normalization (

DPO-NV ) doesnot use the normalization (5) in the fractal network.

Performance on real market data

Figure 3 shows thecumulative wealth (CW) achieved in U.S. market. Table 1shows the results for the other evaluation metrics presentedin Section 4.1. For parameter C of the spectral residual, weused C = 10 , which we determined sorely from the trainingdata (see Appendix D.2 for details). Overall, our proposedmethod DPO outperformed the baseline methods in multipleevaluation metrics. Regarding the comparison against threeablation models, we make the following observations.1.

Effect of the distributional prediction . We found thatintroducing the distributional prediction signiﬁcantly im-proved the ASR. While

DPO-NQ achieved the best CW,

DPO performed better in the ASR. It suggests that, with-out the variance prediction,

DPO-NQ tends to pursue thereturns without regard to taking the risks. Generally, weobserved that

DPO reduced the AVOL while not losingthe AR.able 2: Performance comparison on U.S. market withoutspectral residual extraction.

ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ AR(1) +0.212 +0.011 0.051 +0.355 +0.067 0.160

Linear +0.304 +0.016 0.052 +0.485 +0.127 0.125

MLP +0.261 +0.013 0.048 +0.424 +0.103 0.122

SFM +0.264 +0.014 0.051 +0.428 +0.079 0.171

DPO-NQ +0.405 +0.020 0.048 +0.655 +0.172 0.114

DPO-NF +0.854 +0.034

DPO-NV +0.542 +0.029 0.054 +0.922 +0.238 0.123

DPO +0.874 +0.032 Effect of the fractal network . Introducing the fractal net-work architecture also improved the performance in mul-tiple evaluation metrics. In both markets, we observed thatthe fractal network contributed to increasing the AR whilekeeping the AVOL, which is suggestive of the effective-ness of leveraging the ﬁnancial inductive bias on the re-turn sequence.3.

Effect of the normalization . We also saw the effec-tiveness of the normalization (5). Comparing

DPO and

DPO-NV , the normalization affected both of the AR andthe AVOL, resulting in the improvement in the ASR. Thismay occur because the normalization improves the sam-ple efﬁciency by reducing the degrees of freedom of themodel.To see the effect of the spectral residuals, we also evalu-ated our proposed method and the baseline methods on theraw stock returns. Figure 4 and Table 2 show the results.Compared to the corresponding results with spectral resid-uals, we found that the spectral residuals consistently im-proved the performance for every method. Some further in-triguing observations are summarized as follows.1. With the spectral residuals,

AR(1) achieved the best ASRamong the baseline methods (Table 1), which has not beenobserved on the raw return sequence (Table 2). This sug-gests that the spectral residuals encourage the reversalphenomenon (Poterba and Summers 1988) by suppress-ing the common market factors. Interestingly, without ex-tracting the spectral residuals, the CWs are crossing dur-ing the test period, and no single baseline method consis-tently beats others (Figure 4). A possible reason is that thestrong correlation between the raw stock returns increasesthe exposure to the common market risks.2. We found that our network architectures are still effectiveon the raw sequences. In particular,

DPO outperformed allthe other methods in multiple evaluation metrics.

Trading based on factor models is one of the popular strate-gies for quantitative portfolio management (e.g., (Naka-gawa, Uchida, and Aoshima 2018)). One of the best-knownfactor models is Fama and French(Fama and French 1992, 1993), and they put forward a model explaining returns inthe US equity market with three factors: the market returnfactor, the size (market capitalization) factor and the value(book-to-market) factor.Historically, residual factors are treated as errors in factormodels(Sharpe 1964). However, (Blitz, Huij, and Martens2011; Blitz et al. 2013) suggested that there exists pre-dictability in residual factors. In modern portfolio theory,less correlation of investment returns enables to earn largerrisk-adjusted returns (Markowitz 1952). Residual factors areless correlated than the raw stock returns by its nature. Con-sequently, (Blitz, Huij, and Martens 2011; Blitz et al. 2013)demonstrated that residual factors enable to earn larger risk-adjusted returns.

With the recent advance of deep learning, various deep neu-ral networks are applied to stock price prediction (Chen et al.2019). Some deep neural networks for time series are alsoapplied to stock price prediction (Fischer and Krauss 2018).Compared to other classical machine learning methods,deep learning enables learning with fewer a priori repre-sentational assumptions if provided with sufﬁcient amountof data and computational resources. Even if data is in-sufﬁcient, introducing inductive biases to a network archi-tecture can still facilitate deep learning (Battaglia et al.2018). Technical indicators are often used for stock predic-tion (e.g., (Metghalchi, Marcucci, and Chang 2012; Neelyet al. 2014)), and (Li et al. 2019) used technical indicators asinductive biases of a neural network. (Zhang, Aggarwal, andQi 2017) used a recurrent model that can analyze frequencydomains so as to distinguish trading patterns of various fre-quencies.

We proposed a system for constructing portfolios. The keytechnical ingredients are (i) a spectral decomposition-basedmethod to hedge out common market factors and (ii) a dis-tributional prediction method based on a novel neural net-work architecture incorporating ﬁnancial inductive biases.Through empirical evaluations on the real market data, wedemonstrated that our proposed method can signiﬁcantly im-prove the performance of portfolios on multiple evaluationmetrics. Moreover, we veriﬁed that each of our proposedtechniques is effective on its own, and we believe that ourtechniques may have wide applications in various ﬁnancialproblems.

Acknowledgment

We thank the anonymous reviewers for their constructivesuggestions and comments. We also thank Masaya Abe,Shuhei Noma, Prabhat Nagarajan and Takuya Shimada forhelpful discussions.

References

Amihud, Y.; and Mendelson, H. 1987. Trading mechanismsand stock returns: An empirical investigation.

The Journalof Finance

LatentVariable Models and Factor Analysis: A Uniﬁed Approach .Wiley, 3rd edition.Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.;Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relationalinductive biases, deep learning, and graph networks. arXivpreprint arXiv:1806.01261 .Biau, G.; and Patra, B. 2011. Sequential Quantile Predictionof Time Series.

IEEE Transactions on Information Theory

Journal of Financial Markets

Journal of Empirical Finance

Physica A: Statistical Mechanics andits Applications

Proceed-ings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , 2376–2384.Choudhry, R.; and Garg, K. 2008. A hybrid machine learn-ing system for stock market forecasting.

World Academy ofScience, Engineering and Technology

Quantitative Finance

1: 223–236.Davis, C.; and Kahan, W. M. 1970. The Rotation of Eigen-vectors by a Perturbation. III.

SIAM Journal on NumericalAnalysis

NAACL-HLT .Fama, E. F.; and French, K. R. 1992. The Cross-Sectionof Expected Stock Returns.

The Journal of Finance

Journal of .Fama, E. F.; and French, K. R. 2015. A ﬁve-factor assetpricing model.

Journal of Financial Economics

European Journal of Operational Research , 6645–6649. ISSN 2379-190X.Grossman, S. J.; and Zhou, Z. 1993. Optimal investmentstrategies for controlling drawdowns.

Mathematical ﬁnance

Neural Computation arXiv preprint arXiv:1502.03167 .Jegadeesh, N.; and Titman, S. 1993. Returns to Buying Win-ners and Selling Losers: Implications for Stock Market Efﬁ-ciency.

The Journal of Finance

Journal of Financial and Quantita-tive Analysis

International Conference forLearning Representations (ICLR) .Kintzel, D. 2007. Portfolio theory, life-cycle investing, andretirement income.

Social Security Administration PolicyBrief

Quantile Regression . Econometric Soci-ety Monographs. Cambridge University Press. doi:10.1017/CBO9780511754098.Krauss, C.; Do, X. A.; and Huck, N. 2017. Deep neuralnetworks, gradient-boosted trees, random forests: Statisticalarbitrage on the S&P 500.

European Journal of OperationalResearch

The European Physical Journal B

Advances in Neural Information Process-ing Systems 25 , 1097–1105. Curran Associates, Inc. URLhttp://papers.nips.cc/paper/4824-imagenet-classiﬁcation-with-deep-convolutional-neural-networks.pdf.LeCun, Y.; Haffner, P.; Bottou, L.; and Bengio, Y. 1999.Object Recognition with Gradient-Based Learning. In

Shape, Contour and Grouping in Computer Vision , 319–345. Springer Berlin Heidelberg.Lee, M.; Song, J.; Kim, S.; and Chang, W. 2018. Asym-metric market efﬁciency using the index-based asymmetric-MFDFA.

Physica A: Statistical Mechanics and its Applica-tions

Proceed-ings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , 894–902.in, T.-C.; and Liu, X. 2018. Skewness, individual investorpreference, and the cross-section of stock returns.

Review ofFinance

In-ternational journal of theoretical and applied ﬁnance

The journal ofFinance

Fractals and scaling in ﬁnance , 371–418.Springer.Mandelbrot, B. B.; and Ness, J. W. V. 1968. FractionalBrownian Motions, Fractional Noises and Applications.

SIAM Review

The journal ofﬁnance

Physica A: Statistical Mechanics and its Applica-tions

Applied Economics

Risk

International business research

ECML PKDD 2018 Workshops , 37–50.Springer.Neely, C. J.; Rapach, D. E.; Tu, J.; and Zhou, G. 2014. Fore-casting the equity risk premium: the role of technical indica-tors.

Management science

IEEE Transactions on Neural Networks

Economics Bulletin

International Journal of Pure andApplied Mathematics

Journal of MachineLearning Research

12: 2825–2830. Peters, E. E. 1994.

Fractal market analysis: applying chaostheory to investment and economics , volume 24. John Wiley& Sons.Poterba, J. M.; and Summers, L. H. 1988. Mean reversion instock prices: Evidence and Implications.

Journal of Finan-cial Economics

Foundations of Machine Learning— Spring

The journal of ﬁnance

Journal of portfoliomanagement the Journal of Investing

Journal of MachineLearning Research

TheJournal of Alternative Investments

Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery & DataMining , 1900–1908.Young, T. W. 1991. Calmar ratio: A smoother tool.

Futures

Biometrika

Proceedings of the 23rd ACM SIGKDD international con-ference on knowledge discovery and data mining , 2141–2149. ppendix

This is the supplementary material for the paper entitled“Deep Portfolio Optimization via Distributional Predictionof Residual Factors”.

A Technical details of the proposed method

In this section, we provide some technical details for ourproposed method in Section 3.

A.1 Detailed calculation of portfolio

Let µ t be the expected return vector at time t , and Σ t be the covariance matrix of returns. As we explained inSection 2.2, the (ideal) optimal portfolio can be written as b ∗ t = λ − Σ − t µ t , where λ > is a predeﬁned risk aversionparameter. Therefore, having estimators ˆ µ t and ˆ Σ t for thesepopulation quantities, a natural construction of the portfoliois the following plug-in rule ˆ b t := 1 λ ˆ Σ − t ˆ µ t . In our proposed method, we construct a portfolio overthe spectral residual ˜ (cid:15) t . For the estimators of the mean andthe covariance, we use the quantile regression-based esti-mators explained in the previous subsection. In particular,we approximate the covariance matrix by a diagonal matrix diag(ˆ σ t, , . . . , ˆ σ t,S ) . See also Appendix B.2 for empiricaland theoretical justiﬁcations of this diagonal approximation.As a result, the weight for j -th residual factor is given as b res t,j := ˆ µ t,j λ ˆ σ t,j , where ˆ µ t,j and ˆ σ t,j are predicted means and variances givenas (3) and (4), respectively.We now have a portfolio b res t deﬁned for the spectralresiduals. Recall that the spectral residual is obtained as alinear transformation of the (centered) raw returns as ˜ (cid:15) t = A t r t (see Section 3.1). To obtain a portfolio for the raw re-turn r t , we use the relation b t = A (cid:62) t b res t .In our experiment, we apply a common transformationto the output of any trading strategy so that it becomes azero-investment portfolio. Here, we explain the detail of thistransformation. Let b t be a given portfolio. Assume that b t is not proportional to the all-one vector = (1 , . . . , (cid:62) .This also implies that b t is not proportional to the uniformbuy-and-hold strategy. To convert b t to a zero-investmentportfolio, we subtract the average ¯ b t = S (cid:80) Si =1 b t,j fromevery coordinate, and then renormalize the portfolio so thatthe sum of the absolute values is unity. The resulting portfo-lio is given as b t − ¯ b t (cid:107) b t − ¯ b t (cid:107) . Note that the normalized portfolio deﬁned in this way doesnot depend on the parameter λ . Figure 5: Illustration of the fractal network. A.2 Fractal networks

Here, we explain the detailed structure of the fractal networkintroduced in Section 3.3. Figure 5 illustrates the structure ofthe fractal network.Let x denote the input of the network. First,the fractal network applies the resampling mecha-nism Resample( x , τ i ) for several scale parameters τ > τ > · · · > τ L > to generate multipleviews of x with different sampling rates. To be precise, theresample mechanism outputs a sequence by the followingfour procedures:(i) Cumulation.

First, given an input x = ( x , . . . , x H ) , wecalculate the cumulative sum z = ( x , x + x , . . . , x + · · · + x H ) . Since the input x corresponds to the residualfactors of the returns, each coordinate variable x s is un-derstood as the increment or the difference of stock pricesof two adjacent time periods. Hence, we use its cumu-lative sum so that it corresponds to (logarithmic) stockprices and exhibits the fractal structure.(ii) Resampling.

Second, given < τ ≤ and z , we gener-ate a shorter sequence z τ of a ﬁxed length H (cid:48) ( < H ) bydownsampling from ( z (cid:98) (1 − τ ) H (cid:99) , . . . , z H ) . In other words,the resulting sequence is a subsequence with length (cid:100) τ H (cid:101) located at the end of the original sequence.(iii) Differentiation.

Third, we again take the ﬁrst differenceof the downsampled sequence z τ to obtain the corre-sponding return sequence.(iv) Rescaling.

Finally, we normalize the output by multiply-ing the entire sequence by τ − / . On the choice of themultiplicative factor, see following Remark 1 as well.Next, we apply a common non-linear transformation ψ to these views. Finally, we take average of the results andapply another transformation ψ . The overall procedure isummarized in the following equation ψ ( x ) = ψ (cid:32) L L (cid:88) i =1 ψ (Resample( x , τ i )) (cid:33) . (7)To incorporate the volatility invariance (Section 3.3) intothe fractal network, we want to make the overall transforma-tion x (cid:55)→ ψ ( x ) positive homegeneous. To do this, it sufﬁcesto ensure that non-linear transformations ψ and ψ are pos-itive homogeneous. This can be proved by combining thefollowing claims. Lemma 1.

Let f , f , . . . , f L be any positive homogeneousfunctions.(i) Any linear transformation is positive homogeneous.(ii) Suppose the composition f ◦ f can be deﬁned. Then, f ◦ f is positive homogeneous.(iii) The concatenation x (cid:55)→ ( f ( x ) (cid:62) , . . . , f L ( x ) (cid:62) ) (cid:62) is posi-tive homogeneous.(iv) The average x (cid:55)→ L (cid:80) Li =1 f i ( x ) is positive homoge-neous.(v) The resampling mechanism x (cid:55)→ Resample( x , τ ) is pos-itive homogeneous for any < τ ≤ . Proof.

Since (i), (ii), and (iii) are almost obvious, we omittheir proofs. (iv) is derived from (i), (ii), and (iii). As for(v), the resampling mechanism is a linear transformation,and thus it is positive homogeneous. In fact, it is easy to seethat the four operations in the resampling mechanism (i.e.,cumulation, resampling, differentiation, and rescaling) arelinear. Therefore, the resampling mechanism is also linear.

Remark 1.

In the rescaling phase, the choice of multiplica-tive factor τ − / is sensible because of the following reason.If the underlying law is the fractional Brownian motion withthe Hurst index H , the appropriate scaling factor determinedby its self-similarity is τ −H (Mandelbrot and Ness 1968). Ithas been reported in (Kristoufek and Vosvrda 2014) that, inmany real-world markets, estimated Hurst indices are ap-proximately H ≈ . , which implies that stock prices ex-hibit similar self-similarity properties as the standard Brow-nian motion. Remark 2.

In some paper, the term “fractal property” standsfor the behavior of stochastic processes captured by the frac-tional Brownian motion with

H (cid:54) = 1 / , which can lead in-efﬁciency of the market. In this paper, however, we focus onthe self-similarity of processes to design the network archi-tecture. The self-similarity can be observed even in efﬁcientmarkets with H ≈ . . B Theoretical analysis

In this section, we provide theoretical analyses of the spec-tral residuals justifying that the spectral residuals can hedgeout the market factors. In Section B.1, we prove Proposition1, which shows that the spectral residuals are actually uncor-related to the market factors when the true residual factorsare isotropic. In practice, the covariance of the residual fac-tors can be approximated by diagonal matrices. In Section B.2, we explain why this approximation is justiﬁed. In Sec-tion B.3, we investigate a more general situation where thetrue residual factors are anisotropic, i.e., the variances of thecoordinate variables differ.Before proceeding, let us recall the deﬁnition of the spec-tral residual. Let r be a zero-mean random vector with acovariance matrix Σ ∈ R S × S . Let Σ = V Λ V (cid:62) is aneigendecomposition of Σ , where V = [ v , . . . , v S ] is anorthogonal matriix and Λ = diag( λ , . . . , λ S ) is a diago-nal matrix of the eigenvalues. Assume that λ , . . . , λ S aresorted in descending order, i.e., λ ≥ · · · ≥ λ S . Given aninteger ≤ C < S , let V res = [ v C +1 , . . . , v S ] be an S × ( S − C ) matrix that consists of the eigenvectorsthat correspond to the smallest S − C eigenvalues . We alsodeﬁne a matrix A res as A res = V res V (cid:62) res . Note that A res is the orthogonal projection onto the spacespanned by the principal portfolios with the smallest S − C eigenvalues. Then, we deﬁne the spectral residual as ˜ (cid:15) = A res r . (8) B.1 Isotropic residuals

First, we prove the following result, which corresponds toProposition 1 in the main body.

Proposition 2 (Proposition 1 in the main body, restated) . Let r be a random vector in R S generated according to alinear model r = Bf + (cid:15) , where B is an S × C matrix offull column rank, and f ∈ R C and (cid:15) ∈ R S are zero-meanrandom vectors. Assume that the following conditions hold:• Var( f i ) = 1 and Var( (cid:15) k ) = σ > .• The coordinate variables in f and (cid:15) are uncorrelated, thatis, E [ f i f j ] = 0 , E [ (cid:15) k (cid:15) (cid:96) ] = 0 , and E [ f i (cid:15) k ] = 0 hold forany i (cid:54) = j and k (cid:54) = (cid:96) .Then, we have the followings.(i) The spectral residual ˜ (cid:15) deﬁned in (8) is uncorrelated fromthe common factor f .(ii) The covariance matrix of ˜ (cid:15) is given as σ A res . Under asuitable assumption, this can be approximated as a diago-nal matrix. Proof.

Let BB (cid:62) = V mar Λ V (cid:62) mar be the eigendecomposi-tion of the market factor covariance BB (cid:62) , where V mar =[ v , . . . , v S ] is an orthogonal matrix and Λ is a diagonal ma-trix of the eigenvalues. Since rank BB (cid:62) ≤ C , we can write Λ = diag( λ , . . . , λ C , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) S − C ) . Note that λ i > for all i ∈ { , . . . , C } since BB (cid:62) ispositively semi-deﬁnite and has rank C . Since coordinate In general, V res is not unique because there can be multipleeigenvectors having the same eigenvalues. For simplicity, we as-sume that λ C > λ C +1 so that V res is uniquely determined. ariables in f and (cid:15) are uncorrelated, the covariance matrixof r is calculated as Σ = BB (cid:62) + σ I = V mar ( Λ + σ I ) V (cid:62) mar . This gives the eigendecomposition of Σ , and its eigenvaluesare given as λ + σ , . . . , λ C + σ (cid:124) (cid:123)(cid:122) (cid:125) C , σ , . . . , σ (cid:124) (cid:123)(cid:122) (cid:125) S − C . In particular, the projection matrix for the spectral residualis given as A res = [ v C +1 , . . . , v S ][ v C +1 , . . . , v S ] (cid:62) . (9)Consequently, we have ˜ (cid:15) = A res ( Bf + (cid:15) ) = A res (cid:15) , whichis uncorrelated from Bf (thus we proved (i)).Moreover, the covariance matrix of A res (cid:15) is given as σ A res A (cid:62) res = σ A res . Regarding the diagonal approxima-tion of this matrix, see the next subsection for details. B.2 On the near-orthogonality of spectralresiduals

When we construct a portfolio over the spectral residuals,we use an diagonal approximation of the covariance matrix(see Section 3.2 and Appendix A.1). As mentioned in theprevious subsection, this covariance matrix is proportionalto A res . Hence, to justify the diagonal approximation, weshould show that A res becomes nearly diagonal.Figure 6 shows examples of A res calculated empricallyfrom U.S. market data. For several choices of the parameter C , these matrices are reasonably close to diagonal matrices.But why does this happen? Generally speaking, a pro-jection matrix constructed as (9) is not necessarily approxi-mated well by a diagonal matrix. However, in ﬁnancial mar-kets, we may assume that the following properties hold forthe ﬁrst several principal portfolios, which may justify thediagonal approximation.• Well-spreading factors : In a ﬁnancial market, a largeproportion of a stock return is often explained by a smallnumber of common market factors. A market factor maybe almost equally shared by all the stocks in the market.For example, the ﬁrst principal portfolio can be seen asthe factor of the overall market, which may be close tothe “equal weighting index” of stocks. In this case, re-moving the market factor changes all the elements of thecovariance matrix only very slightly.•

Spiked factors : Suppose that, in an investment horizon, acertain company’s stock price is affected by a factor thatis independent from any other market factors. In this sit-uation, there can be a principal portfolio that arises onlyfrom a single stock. Removing this factor is just equiva-lent to removing the corresponding stock, which does notaffect the off-diagonal elements of the covariance matrix.To formulate the above idea, we introduce the followingtwo notions.

Deﬁnition 3.

Let v ∈ R S be a unit vector (i.e., (cid:107) v (cid:107) = 1 ).Let δ > be a positive number. • We say that v is δ -spreading if | v i | ≤ (cid:112) δ/S for all i ∈{ , . . . , S } .• We say that v is δ -spiked if there exists i ∗ ∈ { , . . . , S } such that | v i | ≤ δ/S for all i (cid:54) = i ∗ .The above properties ensure that the off-diagonal ele-ments of vv (cid:62) are sufﬁciently small. Proposition 3.

Suppose that a unit vector v ∈ R S is either δ -spreading or δ -spiked for some δ > . Then, for all i (cid:54) = j , the ( i, j ) element of P = vv (cid:62) is bounded as | P ij | = | v i v j | ≤ δ/S . Proof.

The conclusion is obvious if v is δ -spreading. If v is δ -spiked, we have | v i v j | ≤ · δ/S = δ/S for any i (cid:54) = j .Now, let us consider the linear factor model r = Bf + (cid:15) .We assume a similar condition as in Proposition 2. In par-ticular, we assume that (cid:15) has an isotropic covariance matrix σ I . Our goal is to show that the spectral residual A res r hasa nearly isotropic covariance matrix under a suitable condi-tion.Let BB (cid:62) = V mar Λ V (cid:62) mar be an eigendecomposition ofthe market factor, where V is an orthogonal matrix and Λ =diag( λ , . . . , λ C , , . . . , with λ ≥ · · · ≥ λ C > . Then,as in Proposition 2, we have A res = [ v C +1 , . . . , v S ][ v C +1 , . . . , v S ] (cid:62) = I − P mar , and the spectral residual is given as A res r = A res (cid:15) . Here, P mar = [ v , . . . , v C ][ v , . . . , v C ] (cid:62) is the projection matrix onto the space of the market fac-tors. Thus, the covariance matrix of the spectral residual isgiven as σ A res . The following proposition explains thatthis covariance matrix is nearly diagonal if the top- C princi-pal portfolios are either “well-spreading” or “nearly spiked”. Proposition 4.

We use similar notations and assumptionsas above. Suppose that each v i ( i ∈ { , . . . , C } ) is either δ -spreading or δ -spiked. Then, the off-diagonal elements of A res are bounded as | [ A res ] i,j | ≤ CδS for any i (cid:54) = j . Proof.

Combining the equality P mar = (cid:80) Ci =1 v i v (cid:62) i andthe fact that all the absolute values of the off-diagonal ele-ments in v i v (cid:62) i are not larger than δ/S , we have | [ A res ] i,j | = | [ P mar ] i,j | ≤ Cδ/S for any i (cid:54) = j . Hence the result. B.3 Anisotropic residuals

Propositon 2 does not hold when the true residual factor (cid:15) , . . . , (cid:15) S are anisotropic. However, we can show that thespectral residual is almost uncorrelated from the market fac-tors if• the volatilities of market factors are sufﬁciently largerthan the residual factors, and• the residual factors are almost isotropic. a) (b) (c) 0.01.00.5 Figure 6: The absolute values of projection matrix A res calculated from U.S. market data. Three panels show results for (a) C = 10 , (b) C = 50 , and (c) C = 200 , respectively. In all cases, the projection matrices are reasonably close to diagonalmatrices.Here, we provide a quantitative justiﬁcation of this claim byusing matrix perturbation theory (Davis and Kahan 1970;Yu, Wang, and Samworth 2014).Before describing the result, we provide some notations.For any S × S matrix A , (cid:107) A (cid:107) F is the Frobenius norm deﬁnedas (cid:107) A (cid:107) = (cid:80) i,j a ij . For any orthogonal projection matrix P , let V ( P ) denote the corresponding linear subspace (i.e.,the largest linear subspace that is invariant under P ). Let P and P be any two orthogonal projection matrices having thesame rank, and V i = V ( P i ) ( i = 1 , ). Then, the quantity (cid:107) sin Θ( V , V ) (cid:107) F := (cid:107) P ( I − P ) (cid:107) F = (cid:107) P ( I − P ) (cid:107) F measures the “principal angles” between two subspaces V and V (Davis and Kahan 1970). In particular, this quantitybecomes zero if and only if the two subspaces agree. Proposition 5.

Let r be a random vector in R S generatedaccording to a linear model r = Bf + (cid:15) , where B is an S × C matrix of full rank, and f ∈ R C and (cid:15) ∈ R S are zero-mean random vectors. Assume that the following conditionshold:• Var( f i ) = 1 and Var( (cid:15) k ) = σ k > for k ∈ { , . . . , S } .• The coordinate variables in f and (cid:15) are uncorrelated, thatis, E [ f i f j ] = 0 , E [ (cid:15) k (cid:15) (cid:96) ] = 0 , and E [ f i (cid:15) k ] = 0 hold forany i (cid:54) = j and k (cid:54) = (cid:96) .Let P mar be the orthogonal projection matrix onto the linearspace spanned by B , and let A res be the projection matrixof the spectral residuals. Let λ min be the smallest positiveeigenvalue of BB (cid:62) . Then, we have (cid:107) P mar A res (cid:107) F ≤ √ S (max i σ i − min i σ i ) λ min . (10)Here, we give an interpretation of (10). If (cid:107) P mar A res (cid:107) F is shown to be small, the spectral residual A res r is nearly or-thogonal to any market factors. The right-hand side of (10)can be small when either of the following conditions are sat-isﬁed: (i) If the residual factors are nearly isotropic, then max i σ i − min i σ i is small. (ii) If the smallest variance ofthe market factor λ min is much larger than √ S (max i σ i − min i σ i ) , the right-hand side becomes small. Proof.

Since f and (cid:15) are uncorrelated, the covariance matrixof the generative model r = Bf + (cid:15) can be written as Σ = BB (cid:62) + Q , where Q = diag( σ , . . . , σ S ) is the variance ofthe residual factors.Let ¯ s = S (cid:80) Si =1 σ i be the averaged variance of the resid-ual factors. As such, ¯ s I is the closest isotropic covariancematrix of Q in the following sense: ¯ s I ∈ argmin s I : s ≥ (cid:107) Q − s I (cid:107) F . Let (cid:98) Q = ¯ s I − Q . Deﬁne ∆ iso ≥ as ∆ iso := (cid:107) (cid:98) Q (cid:107) F = min s I : s ≥ (cid:107) Q − s I (cid:107) F = (cid:32) S (cid:88) i =1 ( σ i − ¯ s ) (cid:33) / , which quantiﬁes how far Q is away from being isotropic.As in the proof of Proposition 2, let BB (cid:62) = V mar Λ V (cid:62) mar be the eigendecomposition of BB (cid:62) . Deﬁne (cid:98) Σ as (cid:98) Σ = Σ + (cid:98) Q = BB (cid:62) + ¯ s I . Since BB (cid:62) and ¯ s I are simultaneously diagonalizable by V mar , we can write (cid:98) Σ = V mar ( Λ + ¯ s I ) V (cid:62) mar . In particular, the eigenvalues of (cid:98) Σ are given as λ + ¯ s, . . . , λ C + ¯ s (cid:124) (cid:123)(cid:122) (cid:125) C , ¯ s, . . . , ¯ s (cid:124) (cid:123)(cid:122) (cid:125) S − C . Let P mar be the orthogonal projection matrix onto the linearsubspace spanned by the column vectors of B (i.e., the mar-ket factors). Note that such a linear subspace is spanned by v , . . . , v C , and P mar is computed as P mar = [ v , . . . , v C ][ v , . . . , v C ] (cid:62) . Also, let Σ = U M U (cid:62) be the eigendecomposition of Σ , where U = [ u , . . . , u S ] is an orthogonal matrix, and M = diag( µ , . . . , µ S ) is a diagonal matrix with µ ≥ · · ≥ µ C > µ C +1 ≥ · · · µ S > . Then, the projectionmatrix for the spectral residual is given as A res = [ u C +1 , . . . , u S ][ u C +1 , . . . , u S ] (cid:62) . Now, P mar and I − A res are the projection matrices thatcorrespond to eigenvectors with the largest C eigenvaluesof (cid:98) Σ and Σ , respectively. From Davis–Kahan sin θ theorem(Davis and Kahan 1970; Yu, Wang, and Samworth 2014),we conclude (cid:107) P mar A res (cid:107) F ≤ (cid:107) (cid:98) Σ − Σ (cid:107) F min ≤ i ≤ C λ i = 2∆ iso min ≤ i ≤ C λ i . We also have a weaker inequality (10) since ∆ iso ≤√ S (max i σ i − min i σ i ) . C Details for experimental settings

In this section, we provide some more details on the experi-ments in Section 4.

C.1 Deﬁnitions of additional evaluation metrics • Maximum DrawDown (MDD) is the maximum loss froma peak to a trough (Grossman and Zhou 1993), which canmeasure one aspect of downside risk:

MDD T = max t ∈{ ,...,T } max s ∈{ ,...,t } (cid:18) CW t − CW s CW t (cid:19) . (11)• Calmar Ratio (CR) is a risk-adjusted return basedon the maximum drawdown (Young 1991): CR T :=AR T / MDD T .• Downside Deviation Ratio (DDR) (a.k.a. Sortino Ratio) isa variation of the Sharpe ratio (Sortino and Price 1994).While the Sharpe ratio regards overall volatility as risk, itregards only volatility caused by negative returns as harm-ful risk:

DDR T := AR T (cid:113) T Y T (cid:80) Ti =1 min(0 , R t ) . (12) C.2 Deﬁnitions of some baseline methods • Reversal strategy . In Section 4.2 and Section 4.3, weused a simple reversal strategy (

AR(1) ) as a benchmarkmethod, which is deﬁned as follows. Let r t ( t = 1 , , . . . )be either a return sequence or a transformed return se-quence (e.g., the spectral residuals). Then, the reversalstrategy b t is deﬁned by renormalizing − r t − to be azero-investment portfolio, that is, b t = r t − − ¯ r t − (cid:107) r t − − ¯ r t − (cid:107) , where ¯ r t − = S (cid:80) Si =1 r t − . This is true for many cases where the market factors are muchlarger than the residual factors. For example, let δ = (cid:107) ˆ Q (cid:107) op =max j | σ j − ¯ s | , and suppose that λ C = min i λ i > δ . Then, fromthe well-known Weyl’s inequality on eigenvalues, we have µ C ≥ λ C + ¯ s − δ > ¯ s + δ ≥ µ C +1 . • MLP-based prediction . In Section 4.2, we also used aneural network-based prediction of returns (

MLP ). Weused a fully-connected neural network with hidden lay-ers. Each hidden layer has 512 nodes, 50% dropout andbatch normalization. C.3 Details of network architectures andhyperparameters

Here, we explain the details of the architecture that weused for the distributional prediction. For each coordinate i ∈ { , . . . , S } of the spectral residual, we applied a com-mon non-linear function ψ : R H → R Q − that predicts Q − quantiles based on the past H observations. We de-signed ψ by the fractal network introduced in Section 3.3.The detailed speciﬁcations are as follows.• We estimate Q − quantiles for each stock. Inparticular, j -th coordinate of ψ corresponds to the j -thquantile of the future distribution.• The resampling mechanism Resample( x , τ ) outputs se-quences of a ﬁxed length H (cid:48) = 64 . For the scale parame-ters, we used τ j := 4 − j/ with j ∈ { , . . . , } .• For the function ψ : R H (cid:48) → R K (with K − ), weused a fully connected neural network with 3 hidden lay-ers. Each layer has 256 nodes, 50% dropout and batchnormalization.• For the function ψ : R K → R Q − , we used a fully con-nected neural network with 8 hidden layers. Each layerhas 128 nodes, 50% dropout and batch normalization.• For training, we used the Adam optimizer (Kingma andBa 2015) with learning rate . . D Supplementary experiments for spectralresiduals

In this section, we provide additional experimental resultson the spectral residual. In Section D.1, we conduct a sim-ple experiment to show that the short-term behavior of the(empirical) spectral residual is reasonably stable. In SectionD.2, we discuss a simple way to determine the parameter C from the training data. D.1 Local stability of spectral residuals

We check whether the spectral residuals are locally stable.This is not trivial because (i) the spectral residual is calcu-lated locally on time windows, and (ii) the long-term behav-ior of the ﬁnancial time series is highly non-stationary. Tothis end, we here investigate the ability of the spectral resid-uals to reduce the volatility of sequences.For each time t , we calculated the projection matrix ( A t deﬁned in Section 3.1) and the spectral residuals ˜ (cid:15) s for t − H ≤ s ≤ t − . We also generated the spectralresiduals for unseen duration t ≤ s < t + H by ﬁx-ing A t and extrapolating the relation ˜ (cid:15) s = A t r s . Then,we calculated the volatility Vol(˜ (cid:15) s ) of the spectral residualat each s . Throughout, we ﬁxed H = 256 , and we var-ied the number of principal components to be removed as C ∈ { , , , , , } . e l a ti v e vo l a tilit y Delay ∆ Figure 7: Relative volatility to the raw stock returns for vari-ous choices of parameter C . The ability to reduce the volatil-ity seems to continue in the unseen duration ( ∆ > ), whichmay suggest the local stability of spectral residuals. See thetext for details.Table 3: Performance comparison of reversal returns on dif-ferent numbers of principal components (PCs) to be elimi-nated (U.S. market). ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ C = 0 +0.759 +0.076 C = 1 +0.753 +0.054 0.041 +1.343 +0.309 0.132 C = 10 +1.426 +0.035 0.049 +2.541 +1.206 C = 20 +1.317 +0.028 0.037 +2.288 +1.000 0.037 C = 50 +1.264 +0.019 0.024 +2.275 +0.990 C = 100 +1.089 +0.013 +1.877 +0.391 0.035 Figure 7 shows the result averaged over t , which illus-trates the proportion of the volatility of spectral residuals( C ≥ ) to the volatility of raw stock returns ( C = 0 ).The horizontal axis is the delay parameter ∆ := s − t ∈{− H, . . . , , . . . , H − } . Here, ∆ < corresponds to the“observed” duration used for calculating the projection ma-trix A t , and ∆ ≥ corresponds to the “unseen” durationgenerated by extrapolation. From this, we observe the fol-lowings:• Monotonicity.

Increasing the number C of eliminatedprincipal components can reduce the volatility of thespectral residuals even for the unseen duration.• Local stability.

Regarding the volatility proportion, thedifference between the observed duration and the unseenduration increases with increasing C . Remarkably, theﬁrst principal component (a.k.a. the market factor) is quitestable, and the volatility proportion for C = 1 is well ex-trapolated to the unseen duration.The above observations suggest that the projection matrix A t can be locally stable if C is not too large. Hence, we canextract meaningful residual information in the subsequenttime window by extrapolating the projection matrix. D.2 Choosing the number of eliminated factors

Next, we consider the appropriate choice of the parameter C to make trading strategies robustly proﬁtable. Figure 8 and Table 4: Performance comparison of reversal returns on dif-ferent numbers of principal components (PCs) to be elimi-nated (Japanese market). ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ C = 0 +1.259 +0.082 0.065 +2.344 +0.700 0.117 C = 1 +1.715 +0.087 C = 10 +2.960 +0.074 0.025 +5.555 +2.037 0.036 C = 20 +3.291 +0.071 0.022 +6.306 +4.215 C = 50 +3.036 +0.047 0.016 +5.938 +3.258 C = 100 +2.175 +0.023 +4.114 +1.152 0.020 Table 5: Performance comparison on the Japanese market.

ASR ↑ AR ↑ AVOL ↓ DDR ↑ CR ↑ MDD ↓ Market +0.819 +0.158

AR(1) +1.511 +0.058 0.038 +2.684 +1.094 0.053

AR(1) on SRes +1.835 +0.034 0.019 +3.237 +1.112 0.031

Linear on SRes +1.380 +0.023 +2.469 +0.952 +1.802 +0.030 +3.280 +0.963 0.032

SFM on SRes +0.250 +0.005 0.019 +0.442 +0.116 0.042

DPO-NQ +1.369 +0.024 0.018 +2.379 +0.758 0.032

DPO-NF +1.770 +0.030 +3.159 +1.165 0.025

DPO-NV +1.979 +0.039 0.020 +3.516 +1.484

DPO +2.171 +0.036 +1.460 0.025

Table 3 show the performance of the reversal strategy overthe spectral residuals on U.S. market.From the result, we observe the followings. With increas-ing the number C of eliminated principal components, bothAR and AVOL tend to decrease. This is because eliminatingmore principal components reduces the volatility, while itbecomes more difﬁcult to earn large returns from the remain-ing information. Especially, eliminating the ﬁrst principalcomponents ( C = 10 ) attained the best trade-off betweenthe return and the risk, and thus had the best ASR value.We also conducted a similar experiment on Japanese mar-ket data, in which C = 20 achieved the best ASR value. SeeFigure 9 and Table 4 for corresponding results. E Performance evaluation on Japanesemarket data

Here, we evaluated the performance of our proposed system(DPO) on Japanese market data. We used similar evaluationmetrics and baseline methods presented in Section 4.3.

E.1 Japanese market data

For Japanese market data, we used daily prices of stockslisted in TOPIX 500 from January 2005 to December 2018.We used data before January 2012 for training and valida-tion and the remainder for testing. We obtained the data fromJapan Exchange Group (JPX) .igure 8: The Cumulative Wealth of the reversal strategy withdifferent choices of the number C of principal components tobe eliminated (U.S. market). Figure 9: The Cumulative Wealth of the reversal strategy withdifferent choices of the number C of principal components tobe eliminated (Japanese market). C u m u l a ti v e w ea lt h DateFigure 10: The Cumulative Wealth in Japanese market.

E.2 Results

Table 5 and Figure 10 show the results. Overall, we obtainedconsistent results with those on U.S. market. Our proposedmethod (DPO) outperformed other baseline methods and ab-lation models in the ASR. For other evaluation metrics, DPOachieved comparable performance to the best performingbaselines.6