Design and analysis of momentum trading strategies
aa r X i v : . [ q -f i n . GN ] J a n Design and analysis of momentum trading strategies
Richard J. Martin ∗ January 5, 2021
Abstract
We give a complete description of the third-moment (skewness) characteristics of both linearand nonlinear momentum trading strategies, the latter being understood as transformations ofa normalised moving-average filter (EMA). We explain in detail why the skewness is generallypositive and has a term structure.This paper is a synthesis of two papers published by the author in RISK in 2012, with someupdates and comments.
Introduction
Trend-following, or momentum, strategies have the attractive property of generating trading returnswith a positively skewed statistical distribution. Consequently, they tend to hold on to their profitsand are unlikely to have severe ‘drawdowns’. They are very scalable and are employed in most assetclasses—most traditionally in futures, where they are a favourite strategy among CTA (‘commoditytrading advisor’) firms, but also in OTC markets—and by both buy- and sell-side practitioners.The basic premise behind momentum is to buy what has been going up recently, and sell whathas been going down. In other words, if recent returns have been positive then future ones aremore likely to be positive, and similarly with negative. Systematic strategies formalise this notionby (i) measuring momentum, essentially by smoothing out recent returns to obtain a signal thatis not too rapidly-varying, and (ii) having a law that turns this signal into a trading position, i.e.how may contracts or what notional to have on. Put this way, the ideas that only the finest mindscan understand CTA strategies, or how the theory of statistics is of central importance to theirconstruction, or how one needs to have been steeped in managed futures for many years to builda workable strategy, are seen to be self-serving and pretentious—a conclusion implicitly arrived at,even if not thus expressed, by other authors.Before talking about skewness we may as well deal with the first moment, that is to say the ex-pected return. it is important to understand that this is an entirely separate matter. Any statementabout this depends on markets exhibiting momentum, i.e. serially-correlated returns. This can beascribed to the way information is disseminated into markets or of behavioural characteristics ofmarket participants. However, it is entirely subjective and is a matter of believing that marketswill continue to behave in the way that they have done in the past. In contrast, as we analyse indetail here, even if market returns exhibit no serial correlation, during which period the strategywill produce no average return, the trading returns of a momentum strategy will still have positivethird moment. What is interesting is that the skewness characteristic is a product of the design of ∗ Dept. of Mathematics, Imperial College London, South Kensington, London SW1 2AZ, UK. Email: [email protected] when they have already made money —as opposed to reversionstrategies that follow the opposite principle. In statistical language the full distribution of P&L isa mixture of distributions of different mean and variance: the components with a higher variancehave positive mean and that is a recipe for positive skewness.Studies on the subject have generally been empirical (for a good overview see e.g. [21] andreferences therein, and [2] for a general introduction to technical trading). However, there is adecent literature on quantitative aspects. The first work was by Acar [1] who derived a variety ofresults in discrete time using different forecasting models and also ascertained that the distributionof momentum trading returns has positive skewness. Potters & Bouchaud [20] consider a particulartype of momentum strategy and derive rigorous results about its performance. Bruder et al. [4],and an even longer extension by Jusselin et al. [11], devote considerable effort to deriving moving-average filters from an underlying model. In practice, however, this seems to create more problemsthan it solves because the assumed model may be wrong: better, we think, is to use a convenientdefinition of moving-average filter, here the exponentially-weighted moving average (EMA) as itis easily calculated by recursion, and design a strategy using those. Then in 2012 two paperswere published in RISK by the author, which considerably broadened the scope of the subject.Both focused on the third-moment characteristics of momentum models. The first [19] dealt withlinear models, by which we mean a signal proportional to a momentum signal obtained by applyingexponential smoothing to the market returns. The second [15] showed how to deal with strategiesdefined as nonlinear transformations of momentum signals, within the same framework. This paperis a synthesis of these two. More recently Dao [5] focuses on the connection between convexity,option-like characteristics, and momentum strategies.In building momentum strategies, two main considerations are important. The first relates tobacktesting, in other words finding what worked best in the past and assuming that it will continueto do so. Part of the problem with this is that it is too reliant on historical data: if left unchecked,it wastes an inordinate amount of time in fitting and overfitting, mainly because different modelstypically produce almost identical historical performance. The second relates to design: that is tosay, without regard to the past, force the strategy to have certain statistical properties when themarket behaves in predefined ways. One idea, which we consider in depth, is when markets are nottrending (and so market returns are uncorrelated). Another is when the market does trend in a waythat it did in a particular historical scenario, such as gold in the first decade of this century, or inthe first 18 months from January 2019—some designs behave differently from others. There is to anextent a trade-off between these considerations: better positive skewness and better performance incertain trending scenarios may be obtained at the expense of worse average historical performanceand vice versa. This necessitates subjective decision, and despite the great effort of systematictrading firms to claim that there is no discretion in the implementation of their systems—nowadaysassisted by the smoke-screens of statistical theory and ‘machine learning’—inevitably there mustbe. An incidental conclusion from reading [5] is that the SG CTA index is very easily replicated,giving the lie to the contentions, blithely trotted out by the CTA industry, that barriers to entry areso high, that the subject can only be understood by those with years of experience, that a cohort of2hDs are required to build strategies, and that proprietary execution algorithms are important—the last of these is clearly nonsense given the low speed at which the replicating strategy in [5]trades (see Figure 7 in that paper). In fact, rather than clever trade execution being importantfor momentum strategies, it is the reverse that is true: momentum is an important ingredient intrade execution, as over short time scales many financial time series exhibit momentum. Further,the relatively poor performance of the SG CTA index since the end of the Global Financial Crisis,together with the underlying simplicity of momentum trading, should make investors questionwhether CTAs’ management fees are justified, as well as how big an asset allocation they shouldreceive by comparison with standard investments in equities and fixed income.A consequence of positive skewness is that the proportion of winning trades may well be negative[20]. Small trading losses are common, but occasional big gains are produced when the strategylevers itself into a trend. The longevity of trend-following funds suggests that this characteristichas served them well over the years, pointing to the conclusion that the oft-asked question “Whatis your fraction of winning trades?” is misleading. The link between moments and proportion ofwinning trades can be formalised with the Gram-Charlier expansion, which estimates for a randomvariable Y , P ( Y > E [ Y ]) ≈ − κ √ π where κ is the coefficient of skewness of Y (see note ).We show that the skewness of the returns depends strongly on their period , so that even ifthe one-day returns have no skewness, longer-period returns may be skewed. This may at firstseem curious, and arises because successive daily trading returns are not independent. Thus it ispossible to obtain a skewed distribution by adding non-skewed random variables, if those variablesare appropriately dependent.This paper is arranged as follows. We begin with linear strategies ( §
2) and give a completeexposition of ‘skew theory’ for them. We show how to calculate this skewness as a function of thereturn period, by simple application of residue calculus. The skewness of the M -period tradingreturn depends on M : it rises to a maximum at a period proportional to the typical response-timeof the trending indicator, and then drops as M − / (eqs. 15, 17, 18/19). We then test on some realdata and find reasonable correspondence. Finally we analyse a particular hybrid linear strategythat is not pure momentum and derive a simple condition that ensures positive long-term skewnessof returns.An interlude on the option-like nature of trend-following ensues ( §
3) which is a natural conse-quence of §
2. This has since been treated in an excellent paper by Dao et al. [5] who point outthat in effect one is buying long-dated options and selling short-dated ones. This explains neatlywhy momentum strategies suffer badly from whipsawing, when short-date volatility is high, andperform well when the market moves steadily in one direction. The effect is important because itcauses momentum strategies to hold on to most of their previous profits during the periods wherethey are not making money. As pointed out by Till & Eagleeye [21], this “long-option behaviour”distinguishes them from other strategies that tend to have higher Sharpe ratio, the implicationbeing that the higher Sharpe is a form of remuneration for negative skewness. By the same token, Take for example the exponential distribution: the exact probability of exceeding the mean is e − ≈ . − / (6 √ π ) ≈ . By the M -period trading return we mean the gain in P&L between date n and date n + M . This is to bedistinguished very carefully from the market return which simply means the change in price of the traded asset overthat period. We are assuming a discrete time model with the time increment being 1 day.
3s pointed out in a different context in [17], positive skewness can result in longer drawdown timesthan strategies with zero skew, depending on what period of history is being considered—so timespent in drawdown is not necessarily a good measure of strategy performance.We then move on to nonlinear strategies in §
4. An arbitrary nonlinear function of several mo-mentum factors (of different speeds) would be very difficult to analyse, so we opt for nonlinearlytransforming each momentum factor first, and then the position is a weighted sum of the trans-formed factors. We extend the work on skewness of trading returns, studying the effect of thenonlinear transformation. This analysis is primarily a matter of algebra, and we derive new results(27;28,29) for the term structure of the skewness of trading returns. For some useful instances ofthe model (30,31,32) we can evaluate these expressions in closed form, making for easy compu-tation. It turns out that the nature of the transformation is very important and can cause thepositive skewness to disappear or even become negative. For example, one simple transformation isthe binary construction with a position of +1 or − We have already mentioned returns and now formalise this notion. Simply, the return is eitherthe change in price, or the relative change in price, the former being X n +1 − X n and the latter( X n +1 − X n ) /X n which is approximately the difference in ln X between the two time points. Thelatter can only be applied when prices are positive and is therefore inapplicable to asset classessuch as interest-rate swaps , but it is the most natural definition for equities, bond futures andmost commodities. We say ‘most’ commodities because a major upset occurred in the front WTIoil contract in April 2020 when it went negative [16]. On the other hand the former definition ismost natural for interest-rate futures. We use the former definition in the ensuing algebra but thischoice is not critical to the theoretical development.We define U n +1 to be the return per unit volatility for the asset X between time n and n + 1,i.e. U n +1 = X n +1 − X n ˆ σ n , ˆ σ n = E n [( X n +1 − X n ) ] / (1)with E n denoting an expectation conditional on F n , the information known up to and includingtime n . The reason for dividing by ˆ σ is that we wish U n to be nondimensional and appropriatelynormalised.Following the general principle that one should bet a number of contracts (or contract notional)inversely proportional to the contract volatility , we define the position in the asset to be ϕ n / ˆ σ n at time n , where ϕ n is to be a function of any or all of the U ’s up to and including U n . One must As a swap at inception has zero PV. If one wants to consider the momentum of the underlying swap rate, ratherthan the contract PV, then this is quite sensible, though not quite the same as the contract PV includes the effectsof carry and rolldown, not just the changes in par swap rate. Nonetheless one should still use absolute changes forthe obvious reason that interest rates can go negative; also the absolute variation in EUR and USD rates as exampleshas not been strongly spot-dependent over the last 25 years or so. Also known as the risk-adjusted return. Essentially to keep a reasonably constant level of risk on: if the same position is held while the volatility risessubstantially, one is likely to break one’s market risk limits. More formally this follows from stochastic control theory,see e.g. [3, § µ for the drift and σ for the volatility of the traded asset, the optimal position always emergesas µ/σ multiplied by a few factors that pertain to the exact setup (utility, etc). If the dimensionless trading signal S is representative of the risk-adjusted return µ/σ , the position is S/σ . s k e w n e ss return period /days CTAUSTSPX Figure 1:
Performance, or ‘total return’, and skewness of market returns, for three indices: SG CTA index;US Treasuries (7–10y bucket); equities (SPX). Data source for top plot: Bloomberg. have ϕ n ∈ F n , otherwise the strategy can cheat by looking ahead. Clearly the P&L arising fromthe period between time n and time n + 1 is ϕ n U n +1 .For much of this paper we assume that the risk-adjusted returns U n are i.i.d. and of zeromean. Thus, we are studying the behaviour of the strategy under the assumption that it is notgenerating any expected return (Potters & Bouchaud do the same). We also assume that U n haszero third moment, so that its first three moments are 0,1,0. Thus although U n has no skewness, weare about to show that the trading return may have skewness. We make no further distributionalassumptions about the ( U n ). The raw returns need not be, because of stochastic volatility, which is another reason for dividing off by anestimate of the stdev of the asset return. M trading return is defined as Y ( M ) n = M − X k =0 ϕ n + k U n + k +1 . (2)The first moment of the trading return is clearly zero. The second moment is given by (cid:10) ( ϕ U + · · · + ϕ M − U M ) (cid:11) . Let us consider this expectation on expansion as a product. The cross-terms all vanish becauseeach contains a U n +1 term multiplied by a term in F n . This leaves the squared terms, which givesimply M h ϕ i (as h U i = 1). The proportionality in M is a consequence of the trading returns being uncorrelated(note that we have not said ‘independent’).The third moment of the trading return is given by (cid:10) ( ϕ U + · · · + ϕ M − U M ) (cid:11) . Expanding this as a product, we obtain four types of terms:(i) ϕ n U n +1 ϕ n U n +1 ϕ n U n +1 with n < n < n ;(ii) ϕ n U n +1 ϕ m U m +1 with n < m ;(iii) ϕ n U n +1 ϕ m U m +1 with n > m ;(iv) ϕ n U n +1 .The independence of the U ’s and the assumptions about their moments show that (i), (ii) and (iv)all vanish. We are therefore left with (iii), which can be written3 X ≤ m
0. The terms listed as type (iv) above will then cause the skewness of the short-termtrading returns to be negative. In fact this is visible in Figure 1. h·i denotes realisation average. Linear strategies
We now specialise these results to linear strategies, by which we mean ϕ n = ∞ X j =0 a j U n − j . (4)Linear strategies have several advantages. They are easily constructed, for example through EMAs( a j ∝ α j ) which can be implemented recursively. They are also easily added, so that one cancombine momentum of different periods (or even have negative weights on momentum of certainperiods, so that one may capture counter-trending behaviour). Finally, analysis is reasonablystraightforward, and the moments of the trading returns can be captured using the coefficients ( a j )alone.We shall need the autocovariance function of the impulse response: R ak = ∞ X j =0 a j a j + k , k ≥ z -transform of the weights: A ( z ) = ∞ X j =0 a j z − j , z ∈ C . (6)This is bounded for | z | ≥
1. A linear combination of EMAs always has a rational system functionand its poles are usually a key part of the design and analysis. For a general account, refer to [9].The simplest example is the single-EMA case, which we will call ‘EMA1’: a j = α j +1 , and A ( z ) = α − αz − . This arises as the difference between the spot price and an EMA of past prices,and is used by Potters & Bouchaud in [20]. The decay-factor α is linked to the effective periodof the EMA, N , by α = 1 − N − , so that the EMA becomes progressively slower, or more highlysmoothed, as α → a j = α j +1 − β j +1 α − β and A ( z ) = − αz − )(1 − βz − ) . This arises as the difference between twoEMAs of prices, a common device in technical analysis . It has less day-to-day variation thanEMA1, on account of being the difference of two smoothed prices, or equivalently a double (ratherthan single) EMA of the returns.It is convenient to define a class of models that we call SPRZ (‘simple poles, regular at zero’).The precise conditions are: A ( z ) bounded in | z | > − ε for some ε >
0; the only singularities tobe simple poles; and A ( z ) regular at the origin. These should be thought of as mild analyticalconditions that enable the ready application of residue calculus; models with multiple poles can beunderstood as limiting cases of models with simple poles as the poles coalesce. This subsection may be omitted at a first reading. See for example [10, §
9] and also in many online articles on tea-leaf reading, e.g. . X t we candefine any time-invariant linear system as K [ X ] t = Z t −∞ K ( t − s ) dX s where K is commonly known as the kernel. If X t is a unit Brownian motion then the variance of K [ X ] t is kKk = Z ∞ K ( t ) dt, which we call the square-norm.An EMA1 is then the difference between X and its exponentially-weighted moving average,which is an exponential smooth of the returns: X t − Z t −∞ ˙ αe ˙ α ( s − t ) X s ds = Z t −∞ e ˙ α ( s − t ) dX s (7)and its square-norm is 1 / α (of course ˙ α > β > ˙ α : Z t −∞ (cid:0) e ˙ α ( s − t ) − e ˙ β ( s − t ) (cid:1) dX s (8)and its square-norm is ( ˙ α − ˙ β ) (cid:14) α ˙ β ( ˙ α + ˙ β ). The limit ˙ β → ˙ α obviously only makes sense if wedivide by ˙ β − ˙ α first, giving Z t −∞ ( t − s ) e ˙ α ( s − t ) dX s (9)and its square-norm is 1 / α . This can also be written as the EMA of X minus the double-EMA(EMA of the EMA) of X . We call this ‘EMA2=’.An important practical aspect of continuous-time signals is the notion of path-length, definedfor a process Y t to be 1 T Z T | dY t | . This is a concern because it relates to the rate at which money is lost in proportional transactioncosts. Infinite path length results in an infinite rate of loss, unless the problem is obviated. For aBrownian motion the path length is infinite as | dY t | = O ( √ dt ). Now if X t is a Brownian motionand we pass it through a linear system of kernel K , then what is the path length of K [ X ] t ? We seethat Z t + dt −∞ K ( t + dt − s ) dX s − Z t −∞ K ( t − s ) dX s = K (0) dX t + dt · Z t −∞ K ′ ( t − s ) dX s and the first term will generate infinite path length unless K (0) = 0. The second term (withoutthe dt ) is Normally distributed of zero mean and variance kK ′ k , so its expected absolute valueis (2 kK ′ k /π ) / . The conclusion is that the path length is finite for EMA2, infinite forEMA1 . This does not rule out the use of EMA1, as the theory of trading under proportionaltransaction costs is reasonably well-established (see e.g. [18, 13, 14] and references therein), but itdoes suggest that in trading systems EMA2 is preferable.Now it may be advantageous to minimise the path-length, subject to two conditions: (i) thevariance of the output is to be unity (as otherwise the solution would be K ≡ Z ∞ K ′ ( t ) dt s.t. Z ∞ K ( t ) dt = 1 and Z ∞ t − K ( t ) dt = 1 /τ (10)where τ , of units time, is a given parameter. One boundary condition is K (0) = 0, and we requiresensible behaviour at ∞ . This is a standard type of variational calculus problem and gives rise tothe ODE K ′′ ( t ) + ( λ + µt − ) K ( t ) = 0 (11)where λ, µ are Lagrange multipliers. It is an easy exercise to see that te − ˙ αt is one solution of this ,and so in this particular sense EMA2= is an optimal choice of momentum filter. It is immediate that the second moment of the M -period trading return is M R a . For the thirdmoment, we have to find the U m +1 U j term ( j = m + 1) in ϕ n in the expression (3). This is2 ∞ X k =0 ,k = n − m a n − m a k U m +1 U n − k +1 . This now has to be multiplied by 3 ϕ m U m +1 and the expectation taken. Thus it is necessary tolook for any overlap between U n − k +1 and ϕ m , and so in the k -summation we only need terms with k ≥ n − m , and exclude the others. The resulting expression emerges as3 X ≤ m M R a , and R a = 12 π i I | z | =1 A ( z ) A ( z − ) z − dz = A (0) + X j ρ j α − j A ( α − j ) . (14)Collecting the results together, we deduce that the skewness of M -period trading returns, for large M , is κ ( M )3 ∼ P j ρ j A ( α − j ) (cid:16) A (0) + P j ρ j α − j A ( α − j ) (cid:17) / M / . (15)In the EMA1 case we immediately obtain κ ( M )3 ∼ α (1 − α ) / M / ∼ √ (cid:18) NM (cid:19) / where the right-hand expression is obtained by assuming that N = (1 − α ) − is not small. In theEMA2 case, the poles are at α , β and are of residue α , − β , and the result is, after a little algebra, κ ( M )3 ∼ α + β )(1 − αβ ) / (1 − α ) / (1 − β ) / (1 + αβ ) / M / ∼ √ (cid:18) N α + N β M (cid:19) / . We can also return to (12) to get the exact third moment, not just the long-term asymptotic.To do this, we write (12) in terms of A ( z ), as6(2 π i) M − X k =1 ( M − k ) ∞ X j =0 I I I A ( y ) y j − A ( z ) z j + k − A ( w ) w k − dw dy dz in which the contours for w - and y -integrals are of radius 1 − ε and the contour for z is just | z | = 1(the need for this will presently become apparent). The j -summation and the y -integral can be doneimmediately (the placement of the contours causes | yz | < 1, which is necessary for convergence ofthe sum; in doing the y -integral, expand the contour out to ∞ and pick up the residue at y = 1 /z on the way). Next do the k -summation using the identity m − X k =1 ( m − k ) r k − ≡ r m − m (1 − r )(1 − r ) to arrive at 6(2 π i) I I A ( z − ) A ( z ) A ( w ) ( wz ) M − ∗ z }| { M (1 − wz )(1 − wz ) w − dw dz. M result we have already obtained, once the w -integral is done (again, by expanding the contour out to ∞ and picking up the residue at w = 1 /z on the way). The remaining part can be calculated by collapsing the w -contour around all thesingularities inside the unit circle (note that no singularity arises from the (1 − wz ) term in thedenominator, as | wz | < z -contour. In the SPRZ case, we finally obtainthe third moment as6 M X j ρ j A ( α − j ) − A (0) X j ρ j A ( α − j ) − X j,k ρ j ρ k α − j A ( α − k ) 1 − α Mj α Mk (1 − α j α k ) . (16)For EMA1, the exact expression for the skewness is therefore κ ( M )3 = 6 α (1 − α ) / M / (cid:18) − − α M − α M − (cid:19) . (17)This rises from zero to a peak and then rolls off as O ( M − / ) (see Figure 2a). The maximumskew is roughly 2.1–2.4, and occurs for period M ≈ . N (recall α = 1 − N − ).For EMA2, the exact expressions for the second and third moments are µ ( M )2 = M (1 + αβ )(1 − αβ )(1 − α )(1 − β ) (18) µ ( M )3 = 6 M ( α + β )(1 + αβ )(1 − αβ )(1 − α ) (1 − β ) (19)+ 6 α (1 − α M )( α − β ) (1 − α ) (1 − αβ ) + 6 β (1 − β M )( α − β ) (1 − β ) (1 − αβ ) − αβ (1 − α M β M )( α − β ) (1 − β )(1 − αβ ) − α β (1 − α M β M )( α − β ) (1 − α )(1 − αβ ) and κ ( M )3 = µ ( M )3 (cid:14)(cid:0) µ ( M )2 (cid:1) / as usual. This is qualitatively similar to EMA1 (see Figure 2b). Themaximum skew is around 2.1, and occurs at period M ≈ . N α + N β ), provided N α and N β arenot too far apart. In the extreme case where either of the N ’s is equal to 1, we recover EMA1. Thelimit β → α is well-behaved, but algebraically messy and omitted here.In essence, (16) telescopes the various geometric series that are implicit in the calculation of(12), and allows it to be done with an amount of computational effort independent of M . For a demonstration using real data we use two datasets: the CHFUSD futures and the S&P500futures . For risk-adjusting the returns we use a 20-day EMA of squared price changes to estimatethe volatility (ˆ σ n in the definition of U n ). We are using an EMA2 with N = 20 , U n ); (ii) symmetry of their distribution up to the third moment. Inpractice the first clearly does not hold, because it implies that momentum strategies do not generatepositive expected return, whereas the evidence is that on average they do. That means that whenwe examine real data, the observed skewness of returns may well not equal the theoretical result, Unless N is small we can approximate (17) as (3 / √ x )( e − x − x ), with x = M/N ; the maximum of thisfunction is ≈ . 41 and occurs at x ≈ . Bloomberg: SF1 Curncy and SP1 Index . These are the front contracts, rolled 10 days before expiry to creategeneric series. Data range: 01-Jan-90 to 31-Dec-09. s k e w n e ss return period /days N=10N=20N=40 (b) s k e w n e ss return period /days N=5,10N=10,20N=20,40 Figure 2: Skewness of trading returns, as a function of period, for (a) EMA1 type model, (b) EMA2 typemodel. Note the characteristic shape. by virtue of the mean being different. We therefore plot the central skew (third central momentdivided by power of the second central moment—the usual definition) and also the ‘non-central’skew (third moment about zero divided by power of the second moment about zero). If the effectof trending is to generate a slightly positive expected return but keep the other moments roughlyequal, then the non-central skew will be fractionally higher than the central skew. As to (ii), weknow that equity markets occasionally have very negative returns.Figure 3 shows the results for the two markets, superimposing also the theoretical result fromFigure 2b. In spite of the deficiencies in the modelling assumptions the agreement is not bad andthe general shape is right. The short-term skewness for the equity market is nonzero because of theasymmetry of the market returns; the higher long-term skewness is best ascribed to the particularlygood trending behaviour in the mid-1990s generating high trading returns. The skewness of the Central moment = moment about the mean. -0.5 0 0.5 1 1.5 2 2.5 0 50 100 150 200 250 s k e w n e ss return period /dayscentral skewnon-central skewtheoretical (b) -1-0.5 0 0.5 1 1.5 2 2.5 3 0 50 100 150 200 250 s k e w n e ss return period /dayscentral skewnon-central skewtheoretical Figure 3: Skewness of trading returns, as a function of period, for (a) CHFUSD, (b) S&P500 futures;theoretical result also shown. N = 20 , trading returns is far higher than that of the underlying markets (i.e. of the U n ’s): the latter is (towithin 0.1) typically about 0.0 for CHFUSD and − . r ) − κ / (6 √ π ) in the presence of nonzero first cumulant (expectedreturn); here r = κ (cid:14) κ / is the Sharpe ratio and Φ is the Normal c.d.f. For horizons M in therange 100–200 days the Sharpe ratio of each is roughly +0.2 and the skewness is around 2, so thisgives the probability of exceeding zero as about 0.45, which corresponds well with the empiricalvalue—note that it is less than one-half. Suppose that a strategy has a trend-following and a counter-trending characteristic, as wouldhappen if its weights were obtained from a linear combination of EMA2’s, with opposite signs.13 s k e w n e ss return period /daysweights -1,1weights 1.476,-1 Figure 4: Skewness of trading returns, as a function of period, for hybrid model with both trending andcounter-trending behaviour, in two cases. It may be desirable to ensure that the long-term skewness remains positive, as this is associatedwith longevity of the strategy. There are two situations in which this arises. In what is basically atrend-following strategy, it is desired:(i) to make small bets on short-term reversion without this upsetting the behaviour if a longer-term trend occurs;(ii) to make a small bet against very long-term trends on the supposition that what goes up musteventually come down (or vice versa), provided this bet is not too large.In the first case the weights on most recent returns will be negative; in the second, it is the weightson the distant past that will be negative. The idea is to make sure that they are not too negative,in a sense to be made precise.We have a model of the form A ( z ) = λ F ( α F − β F )(1 − α F z − )(1 − β F z − ) + λ S ( α S − β S )(1 − α S z − )(1 − β S z − )where λ F and λ S are the multipliers on the fast and slow components. Positive asymptotic skewnessis ensured by (15): X j ρ j A ( α − j ) > α = α F , α = β F , α = α S , α = β S . Thus ρ = α F λ F , A ( α − ) = λ F ( α F − β F )(1 − α F )(1 − β F α F ) + λ S ( α S − β S )(1 − α S α F )(1 − β S α F )and similarly for the other three. The LHS of (20) is a homogeneous cubic in λ F , λ S , which willfactorise as P ( λ F , λ S ) = ( λ F − ζ λ S )( λ F − ζ λ S )( λ F − ζ λ S )14here the ζ ’s are functions of the four poles. It is possible to identify the coefficients of λ F , λ F λ S , λ F λ S , λ S as functions of the poles, then evaluate them and factorise the cubic by the Cardano-Tartaglia formula. However for practical purposes one might just as well write a numerical routinefor LHS(20) and find the roots ζ i numerically. One root ζ has to be real, and the other two arelikely to be complex because we expect P to be strictly increasing in λ F and in λ S : raising eitherweight should enhance the trending behaviour and hence the asymptotic skewness.As a particular example, let N α F = 5, N β F = 10, N α S = 20, N β S = 40. Then ζ ≈ − . λ F + 1 . λ S > . This being so, it is easily incorporated into an optimisation as a ‘style constraint’. Figure 4 showsthe results for two examples, (i) λ F = − λ S = 1, so the short-term behaviour is counter-trendingand generates negative skewness; (ii) the critical case λ F = 1 . λ S = − 1, where now just enoughlong-term counter-trending behaviour is added to make the asymptotic skew zero at leading order.These exemplify the cases (i), (ii) discussed above. The results were obtained using (16) again,which is not laborious despite there being four poles (so that the double summation has sixteenterms): it is preferable to Monte Carlo simulation, which even with a few hundred thousandsimulations generates noticeable uncertainty. As pointed out in [21], trend-following strategies are often thought to have a long-option-type payoffon account of the positive skewness. For linear strategies this can be formalised as follows. The M -period trading return is Y ( M ) n = u ′ Γu , u = (cid:2) U n + M U n + M − · · · (cid:3) ′ where the symmetric matrix Γ is given by Γ = 12 a a a · · · a . . . . . . . . . . . . . . . a . . . 0 a a a · · · a . . . a · · · ... . . . a · · · . . . a · · · ... ... ... ... . . . M rows . (21)The moments of Y ( M ) n relate to the spectrum of Γ , and direct calculation reveals D Y ( M ) n E = tr( Γ ) = 0 , D(cid:0) Y ( M ) n (cid:1) E = 2 tr( Γ ) , D(cid:0) Y ( M ) n (cid:1) E = 8 tr( Γ ) . Writing Γ in terms of its eigenvalues γ j and normalised eigenvectors e j , we have an expressionthat is a weighted sum (weights adding to zero) of ‘orthogonal quadratic bets’, i.e. squared linearcombinations of returns, which are like straddle payoffs but have constant convexity: Y ( M ) n = X j γ j ( e j · u ) . (22) Because the other two terms in P multiply to give a quadratic that is always positive, and hence of no consequence. Γ r ) = P j γ rj , so the moments of Y ( M ) n relate to the moments of the eigenvalue distribution.It is easy to see the rank of Γ is ≤ M (and is M + 1 in the EMA1 case as then the rows afterthe M th are linear multiples of each other), which limits the number of nonzero eigenvalues to2 M . The interpretation of all this is that a positively skewed strategy has a small number of largepositive eigenvalues and a larger number of smaller negative ones. This generates a small numberof large positive-convexity bets and a larger number of smaller negative-convexity bets, which iswhere the positive skewness comes from. Dao et al. [5] make the same point, but emphasise theimportant point that the positive-convexity bets are long-dated options and the negative-convexitybets short-dated ones. Thus in situations where the long-term volatility is elevated and the short-term volatility is low, momentum strategies work well. This subsection may be omitted at a first reading. If we want to know the full distribution of trading returns, we need to make an assumptionabout the full distribution of the market returns, whereas until now we have only used the firstthree moments.We can use the ideas of the previous section to compute the full distribution of trading returns,exactly as is done by Acar [1, Ch.3] using generating functions. The moment-generating functionof the M -period return is F M ( s ) := D exp( sY ( M ) ) E = (cid:10) exp (cid:0) s ( ϕ U + · · · + ϕ M − U M ) (cid:1)(cid:11) . Let us assume that the ( U n ) are Normally distributed. Then for a linear model F M ( s ) = [det( I − s Γ )] − / with Γ as above. (See e.g. [7] for details on quadratic transformations of Normal variables.)Before proceeding further we should note that the above expression is not very helpful becauseit requires the manipulation of Γ which is an infinite matrix. Let us therefore evaluate ab initio the expression (cid:10) exp (cid:0) s ( ϕ U + · · · + ϕ M − U M ) (cid:1)(cid:11) . Conditioning on ( U , . . . , U M ), effectively fixing those values, we have (inside the exponential) alinear combination of U , U − , . . . , added to another expression that is a function of ( U j ) Mj =1 only.In the first part the coefficients are U : s ( a U + a U + · · · + a M − U M ) U − : s ( a U + a U + · · · + a M − U M )and so on. These variables can then be integrated out to give the expressionexp ( s ∞ X l = − M X k =1 a k + l U k ! ) = exp ( s ∞ X l = − M X j,k =1 a j + l a k + l U j U k ) . The other part of the expression depends on the U ’s through pairwise products, i.e. U U , etc, andis easily seen to be exp ( s X ≤ j 0; (c,d) reverting sigmoids with λ = 0 . , . , . 5; (e,f) double-step with ε = 0 . , . , . N = 20 , .2.1 Simple sigmoid, ψ ( z ) = c λ · (cid:0) λz ) − (cid:1) In effect, this caps the position when the magnitude of the the momentum signal is large. We have H k = 2 a k − c λ (2 /π ) / λ √ λ arctan λ ρ (cid:14) √ λ p λ + 2(1 − ρ ) λ !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ρ = R ak (30)and for normalisation we require c λ = (cid:0) π arctan λ √ λ (cid:1) − / . We obtain in the limit λ → H k = 2 a k − R ak which is as expected the linear result. ψ ( z ) = c λ · ze − λ z / The behaviour of this one is more nonlinear in the sense that it begins reducing the positionwhen the momentum gets too high, ultimately to zero if the momentum is strong enough. Therationale for this is that a very strong trend might be more susceptible to reversing (marketoverbought/oversold), justifying a reduction in position. The maximum positions are held when z = ± λ − . We have H k = 2 a k − c λ ρ (cid:0) − (1 − ρ ) λ (cid:1)(cid:0) λ + 2(1 − ρ ) λ (cid:1) / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ρ = R ak (31)and for normalisation we require c λ = (1 + 2 λ ) / . Notice that H k becomes negative if λ is highenough: this is not surprising because ψ ′ ( Z ) is negative for | Z | > λ − , wherein the model is bettingagainst the trend. As expected λ → ψ ( z ) = c λ · ( z>ε − z< − ε )This is +1 if the momentum is positive enough, − ε > 0. This is similar to the one considered byPotters & Bouchaud. We have H k = 2 a k − c λ φ ( ε ) Φ ε r ρ − ρ ! − Φ ε r − ρ ρ !!(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ρ = R ak (32)and for normalisation we require c λ = (cid:0) − ε ) (cid:1) − / .From this it can be seen that as ε → 0, creating a binary response ψ ( z ) = sgn( z ), the skewnessvanishes . One way of understanding this is to see that as the position is always of magnitude 1, itdoes not have the characteristic—associated with positive momentum—of increasing the positionwhen the P&L is positive. A less precise explanation is that during periods in which the marketis not trending, the strategy loses money rather quickly because it is buying and selling the samesize of position as it holds when a trend has been detected. By contrast, the other two (sigmoidal)functions that we have just examined only trade a small size until a trend is established, resultingin P&L distribution with more, but smaller, losses, and fewer, but bigger, gains: that is wheretheir positive skewness comes from. A recent piece on FX strategy [6] makes this point somewhat Do not use these results for ε < This can also be seen from the sigmoidal case when λ → ∞ , as the argument in the arctan() term goes to zero.The result can also be seen directly from (28), because ψ ( Z ) = 1 a.s. and and Z − ρZ is independent of Z , sothe expectation decouples into a product of two expectations each of which is zero. Technical point: The behaviouras ρ → ± ρ → ρ = 1;similarly for − 1. However, these two values of ρ cannot occur in our problem. ε is raised, notice that the skewness rises without limit, which seems rather good: however,if ε is too high then the algorithm hardly ever trades, so practicalities dictate ε . . S R s k e w n e ss width parameter, ε SRskewness Figure 6: Sharpe ratio and skewness as a function of ε for double-step type activation function,showing performance dropoff for ε > . w R w S \ λ ∞ − − − Figure 7: Sharpe ratio and skewness as a function of λ and the ratio w R /w S for compound sigmoidalactivation function. In each pair of numbers, the first is the SR, and the second is the skewness. We show in Figure 5 the term structure of skewness for the different examples given above: sigmoid(a,b), reverting sigmoid (c,d), double-step (e,f). The linear result is overlaid for comparison. Theprecise choice of momentum crossover does not affect the main conclusion, and we have usedEMA2 with N = 20 , 40 throughout. Using a faster or slower momentum measure simply stretchesor compresses the graph in a horizontal direction, as it did in the linear models.23 t r a n s f o r m e d , ψ ( z ) raw signal, z Figure 8: Sketch of the activation function highlighted in Figure 7 (parameters: w R = 0 . w S = 0 . λ = 0 . O il p r i c e P & L OilP&L (Sigmoid) − RHSP&L (Rev. sigmoid) − RHS Figure 9: Trend-following oil over a decade, using the sigmoid and reverting sigmoid for activationfunction ( λ = 0 . N = 10, N = 20). The reverting sigmoid fails to capitalise on the full selloffin late 2014.It is apparent from the results that as the activation function becomes progressively less linear,the main effect is to compress the graph in a vertical direction, so that the maximum skewnessis reduced. With the reverting sigmoid, the graph can be affected much more, to the extent ofbecoming negative when λ is high enough: we predicted this earlier when remarking that H k could24ecome negative as a result of the activation function being decreasing over much of its domain,so the model spends a lot of time incrementally trading against the trend rather than with it. (Infact for ψ ( z ) = ze − λ z / the critical λ , above which the skewness is no longer everywhere positive,is around 1.3. An explanation is in the Appendix.)The general conclusion so far is that any capping effect in the activation function will cause thetrading returns to be less positively skewed, and any reverting effect will exacerbate this reductionin skewness. From the perspective of skewness alone, these effects should be avoided as muchas possible. However, they may well be justified by reason of risk management and/or expectedreturn, so we consider these next. Analysis of the expected return is a totally different proposition because there are no theoreticalguidelines at all. One can only adopt an empirical approach, seeing what has worked in the past,and relying on it continuing to do so.We need to decide what objective function is to be maximised, and the most natural thingto do is to maximise the Sharpe ratio (SR) of the trading strategy, i.e. use an objective functionthat directly relates to trading model performance. As the SR is the expected return divided bythe volatility, we will be penalising any effect that increases volatility without generating enoughextra return. Taking a range of futures contracts across different asset classes (stocks, bonds, FX,commodities) and a range of EMA2 periods (5 vs 10 days, 10 vs 20 days, etc.), we have run tradingsimulations over the available history, which is typically 20 years or more, and calculated the Sharperatio; this gives a list of Sharpe ratios, one for each contract and speed. For simplicity we are goingto use the same activation function across all contracts and speeds. We then average the list ofSharpe ratios and use this as our performance indicator, to be maximised .We first examine the double-step activation function. Here there is only one parameter toadjust, namely ε , the half-width of the ‘dead zone’. Figure 6 shows the performance as a functionof ε . It is not surprising that the SR drops off as ε becomes large, because the strategy hardlyever has a position on and can never make any money. What is interesting is that the performancefor ε < . ε < . M ) throughout: we choose M = 100for convenience, this being the top of the curve for a linear activation function when the EMAperiods are 20,40 (see Figure 5(b,d,f)). The skewness is also shown in Figure 6, and clearly itincreases with increasing ε , so from that perspective alone we prefer ε as high as possible. If wecan push ε up to about 0.6 without decreasing the SR, and in doing so can have positive skewnessas well, then we should do just that. So this is our first conclusion about design of nonlinearmomentum strategies: the blue line in Figure 5(e), ε = 0 . 6, is a good construction.Next we turn to the sigmoidal functions that we introduced earlier, and take a linear combinationof them: ψ ( z ) = w S (cid:0) π arctan λ √ λ (cid:1) − / (cid:0) λz ) − (cid:1) + w R (1 + 2 λ ) / ze − λ z / with weights w R,S > 0. To normalise the weighted function we enforce the elliptical constraint w S + 2 w S w R cos δ + w R = 1 , The degree of temporal and cross-asset-class diversification that can be obtained is governed by what is commonlyknown as ‘breadth’ and explained in detail by Grinold [8]. δ , the correlation between the R - and S - signals, is given bycos δ = λ (1 + 2 λ ) / λ (cid:18) arctan λ √ λ (cid:19) − / . The effective number of parameters is now two: the horizontal scaling λ and the ratio w R /w S . Theresults are shown numerically in the table of Figure 7.The general picture is that the performance surface is rather flat. Provided one avoids the farleft (where the function is too linear and suffers from putting on too much risk when momentumis high) or the top right (where it reverts to zero too quickly when momentum is high), any of thepairs ( w R w S , λ ) would do reasonably well.Again, we overlay the conclusions about skewness, simplifying as before by using the skewnessof M = 100-day returns. We know that higher skewness arises from a low value of λ and from w R /w S small i.e. little reversive behaviour. This means going as far as possible to the left of thetable, and steering well clear of the top right. Going to the left does lose performance (SR), sosome trade-off is required. One particular example is highlighted ( w R = 0 . w S = 0 . λ = 0 . ze − λ z / does (top row of table), and is likelyto be preferable on account of its better third-moment characteristics.We now return to the discussion at the outset about designing strategies that perform well inspecific scenarios. Figure 9 shows the results for the sigmoid and reverting sigmoid, with λ = 0 . N = 10, N = 20, over ten years, for the oil market. It is clear that the reverting sigmoiddoes substantially less well. This is because the trend is strong and persists for a long time, sothe reverting behaviour of the activation function causes severe underperformance. Yet the thirdcolumn of Figure 7 suggests that the reverting sigmoid (top row) on average performs better thanthe sigmoid (bottom row). Part of the selling-point of CTA strategies is their ability to produce‘alpha’ in scenarios such as the selloff in oil (and other commodities, and associated equities) inlate 2014. It follows that one should make sure that the strategy does well in such scenarios, ratherthan simply relying on what has produced the best historical SR. This example also corroboratesour earlier remark about calibration being sensitive to data history and therefore subjective. Werethe oil selloff in late 2014 absent from the calibration, one would arrive at different conclusionsabout the optimal model.As a final comment, we see that the smoother activation functions offer only a small improve-ment in Sharpe ratio over the double-step. However, we have not considered transaction costs, andmodels that generate sudden large trades can be more difficult to run a large amount of money on.Models that take positions more gradually are therefore easier to handle in trading. They are alsoeasier to handle in backtesting, because a slight change in the definition of the momentum oscil-lator, which is the input to the activation function, can for a discontinuous function make a hugedifference to the simulated position: even minor changes to the strategy can produce unpredictableresults.That said, the characteristic of the double-step function, that it waits until the momentum isabove a certain level before trading, may be worthy of further investigation. Thus one aims for afunction that is zero for 0 < | x | < ε , then rises smoothly until a maximum is reached, and thenrolls off slowly and asymptotes to a level above zero.26 Conclusions and final remarks We have shown how to analyse the behaviour of a variety of trend-following models by particularreference to the skewness of the distribution of trading returns. To do this we have needed only thefirst three moments of the market returns, thus keeping the modelling quite general. As regardslinear models the most important formulae are (14,16) giving the second and third moments of thetrading returns in an elegant application of residue calculus. Pure momentum (trending) strategiesgenerate positive skewness even though the market returns might be totally symmetrical. Theskewness depends on the return period and has a characteristic term structure which we havederived, illustrated and verified with real data. Hybrid strategies, with trending and counter-trending behaviour, may exhibit a more complex term structure of skewness, and we have shownhow to analyse a general linear model.We have investigated ‘nonlinear momentum strategies’ from different angles, understanding theSharpe ratio and the skewness—in essence, the first and third moments—of their trading returns.The former was investigated empirically, and the latter mathematically. We have also pointed outthat it may be wise to consider the behaviour of a strategy in specific scenarios, especially if theyare a raison d’ˆetre of momentum trading. Specific conclusions about optimal design are given inthe text, but two salient ones are repeated here.First, the common practice of forming a momentum signal from moving averages and thenmaking a ‘binary bet’ on it, +1 or − − 1. This of course makes the skewness negative, but we do not want it to be asnegative as possible. Therefore the optimal design will not simply be the reverse of what we havedone here. Instead, something like the reverting sigmoid, now written as ψ ( x ) = − c λ xe − λ x / , islikely to be a good idea. When the market deviates from the reversion level a long way, some riskis taken off, which is likely to be beneficial.A common explanation for the positive skewness is that it arises from the strategy havingpositive convexity, as mentioned for example in [5]. This is partly true, and we have explained itsorigin in § 3, but in fact there is more to it than that, as will be discussed in forthcoming work.27 Formulary A.1 Expectation formulae If ( Z , Z ) ∼ N ( ρ ) (the bivariate Normal distribution with N (0 , 1) marginals and correlation ρ )then E (cid:2) f ( Z , Z ) e − a Z / e − a Z / (cid:3) = 1 √ D b E (cid:2) f ( p D /DZ , p D /DZ ) (cid:3) where under b E , ( Z , Z ) ∼ N ( b ρ ), with b ρ = ρ √ D D , D i = (1 − ρ ) a i + 1 , D = (1 − ρ ) a a + a + a + 1The following results are of use in obtaining (30–32). For Z ∼ N (0 , (cid:10) Z n e − b Z / (cid:11) = (2 n − b ) − (2 n +1) / h φ ( a + bZ ) i = 1 √ b φ (cid:18) a √ b (cid:19) h Φ( a + bZ ) i = Φ (cid:18) a √ b (cid:19) h Zφ ( a + bZ ) i = − ab (1 + b ) / φ (cid:18) a √ b (cid:19) h Z Φ( a + bZ ) i = b √ b φ (cid:18) a √ b (cid:19) h Φ( aZ )Φ( bZ ) i = 12 π arctan (cid:18) ab √ a + b (cid:19) + 14The last result follows from differentiating both sides w.r.t. a , and using integration by parts on h Zφ ( aZ )Φ( bZ ) i . A.2 Skewness of reverting sigmoid activation function We justify why the skewness is always positive for | λ | . . 3. It was derived for linear models bymeans of z -transforms that for large M , D(cid:0) Y ( M ) n (cid:1) E ∼ M ∞ X k =1 H k . We are going to calculate this infinite sum, at least approximately, in the EMA1 case. By (31) thecondition for positivity of the above expression is ∞ X k =1 α k α k (cid:0) − (1 − α k ) λ (cid:1)(cid:0) λ + 2(1 − α k ) λ (cid:1) / > . Write α k = u and approximate the sum as an integral over u , to give Z u (cid:0) − (1 − u ) λ (cid:1)(cid:0) λ + 2(1 − u ) λ (cid:1) / duu > . Upon doing the integral and tidying up, one ends up with2 + 9 λ + 7 λ − λ > λ < . 65. This does not explain rigorously what goes on in the pre-asymptotic region when M is not large, and it uses EMA1 rather than EMA2, but the above analysis seems sufficient.28 eferences [1] E. Acar. Economic Evaluation of Financial Forecasting . PhD thesis, City University, London,1992.[2] E. Acar and S. Satchell. Advanced Trading Rules . Butterworth, 2002.[3] T. Bj¨ork. Arbitrage Theory in Continuous Time . Oxford University Press, 1998.[4] B. Bruder, T.-L. Dao, J.-C. Richard, and T. Roncalli. Trend filtering methods for momentumstrategies. , 2011.[5] T.-L. Dao, T.-T. Nguyen, C. Deremble, Y. Lemp´eri`ere, J.-P. Bouchaud, and M. Potters. Tailprotection for long investors: Trend convexity at work. arXiv:1512.08037 , 2016.[6] D. Bloom et al. Momentum strategies in FX. Technical report, HSBC Global Research, 27thFeb 2012.[7] A. Feuerverger and A. C. M. Wong. Computation of value-at-risk for nonlinear portfolios. J.of Risk , 3(1):37–55, 2000.[8] R. C. Grinold and R. N. Kahn. Active Portfolio Management: A Quantitative Approach forProducing Superior Returns and Controlling Risk . McGraw-Hill, New Jersey, 1999.[9] S. Haykin. Modern Filters . Macmillan, 1989.[10] A. F. Herbst. Analyzing and Forecasting Futures Prices . Wiley, 1992.[11] P. Jusselin, E. Lezmi, H. Malongo, C. Masselin, T. Roncalli, and T.-L. Dao. Understand-ing the momentum risk premium: An in-depth journey through trend-following strategies. , 2017.[12] R. J. Martin. Saddlepoint methods in portfolio theory. In A. Lipton and A. Rennie, edi-tors, The Oxford Handbook of Credit Derivatives , chapter 15. Oxford University Press, 2011. arXiv:1201.0106 .[13] R. J. Martin. Optimal multifactor trading under proportional transaction costs. arXiv:1204.6488 , 2012.[14] R. J. Martin. Universal trading under proportional transaction costs. RISK , 27(8):54–59,2014. arXiv:1603.06558 .[15] R. J. Martin and A. Bana. Nonlinear momentum strategies. RISK , 25(11):60–65, 2012.[16] R. J. Martin and A. Birchall. Black to Negative: Embedded optionalities in commoditiesmarkets. arXiv:2006.06076v2 , 2020.[17] R. J. Martin and M. J. Kearney. Time since maximum of Brownian motion and asymmetricL´evy processes. J. Phys. A: Math. Theor. , 51:275001, 2018.[18] R. J. Martin and T. Sch¨oneborn. Mean reversion pays, but costs. RISK , 24(2):84–89, 2011.Full vsn at arXiv:1103.4934 .[19] R. J. Martin and D. Zou. Momentum trading: ’skews me. RISK , 25(8):40–45, 2012.2920] M. Potters and J.-P. Bouchaud. Trend followers lose more often than they gain. WilmottMagazine , pages 58–63, Nov/Dec 2005. Also at arXiv:0508104 .[21] H. Till and J. Eagleeye. A Hedge Fund Investor’s Guide to Understanding Managed Futures.Technical report, EDHEC-Risk Institute, 2011.