[PDF] Constructing trading strategy ensembles by classifying market states

Abstract

Rather than directly predicting future prices or returns, we follow a more recent trend in asset management and classify the state of a market based on labels. We use numerous standard labels and even construct our own ones. The labels rely on future data to be calculated, and can be used a target for training a market state classifier using an appropriate set of market features, e.g. moving averages. The construction of those features relies on their label separation power. Only a set of reasonable distinct features can approximate the labels. For each label we use a specific neural network to classify the state using the market features from our feature space. Each classifier gives a probability to buy or to sell and combining all their recommendations (here only done in a linear way) results in what we call a trading strategy. There are many such strategies and some of them are somewhat dubious and misleading. We construct our own metric based on past returns but penalising for a low number of transactions or small capital involvement. Only top score-performance-wise trading strategies end up in final ensembles. Using the Bitcoin market we show that the strategy ensembles outperform both in returns and risk-adjusted returns in the out-of-sample period. Even more so we demonstrate that there is a clear correlation between the success achieved in the past (if measured in our custom metric) and the future.

Full PDF

CConstructing trading strategy ensembles byclassifying market states

Michal Balcerak ∗ and Thomas Schmelzer † Institute for Theoretical Physics,Georg-August-Universit¨at G¨ottingen, Germany Faculty of Business and Economics (HEC Lausanne),University of Lausanne, Switzerland

Rather than directly predicting future prices or returns, we follow a morerecent trend in asset management and classify the state of a market based onlabels. We use numerous standard labels and even construct our own ones.The labels rely on future data to be calculated, and can be used a target fortraining a market state classiﬁer using an appropriate set of market features,e.g. moving averages. The construction of those features relies on their labelseparation power . Only a set of reasonable distinct features can approximatethe labels.For each label we use a speciﬁc neural network to classify the state using themarket features from our feature space . Each classiﬁer gives a probability to buyor to sell and combining all their recommendations (here only done in a linearway) results in what we call a trading strategy . There are many such strategiesand some of them are somewhat dubious and misleading. We construct our ownmetric based on past returns but penalizing for a low number of transactions orsmall capital involvement. Only top score-performance-wise trading strategiesend up in ﬁnal ensembles.Using the Bitcoin market we show that the strategy ensembles outperformboth in returns and risk-adjusted returns in the out-of-sample period. Evenmore so we demonstrate that there is a clear correlation between the successachieved in the past (if measured in our custom metric) and the future . Contents ∗ [email protected], [email protected] † [email protected], [email protected] a r X i v : . [ q -f i n . T R ] D ec .1 Feature space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Fixing feature space parameters . . . . . . . . . . . . . . . . . . . 73.3 Chosen market representation. . . . . . . . . . . . . . . . . . . . 7 feature space

19C Appendix: strategy space Using neural networks to predict ﬁnancial time series data is today widely re-garded as the old unfulﬁlled dream of quantitative ﬁnance. An idea would beto apply supervised learning and train a neural network with sub-windows of atime series to predict the next data point(s). So instead of using images of dogsand cats we use at some time t the last n points of a time series to predict a pointfollowing at some time t (cid:48) > t . Given the non-stationary nature of time seriesmarket data and low signal-to-noise ratios, this is a rather ambitious problem.For instance, rather than using n prices (or returns), we reduce the dimen-sionality of the problem by using m << n features based on the very same n points, i.e. an optimal combination of m moving averages. Such questions havetypically been addressed by linear regression. However, linear regression fails toexploit any non-linear eﬀects between the features.We do not stop by only modifying the input - we also alter the goals of ourpredictions. Rather than aiming for a (noisy) price trajectory we ask simplerquestions more suitable for the machinery of machine learning. Our goal is to2uantify the probability p of a market being in a class or category c or movinginto one within the next hours or minutes. This could be the probability for atrend reversion or a spike in volatility or volume. We rely on labels as recentlymade popular by L´opez de Prado [2] but also create some on our own. Theﬂexibility of labels allows us to design a strategy by emphasizing eﬀects we tryto cover.For each label we ask for an optimal set of m features to approximate them.These features, through a classiﬁer, induce a probability for the market to be ina particular label-class. We then ask for an optimal linear combination of thoseprobabilities to execute trades. Rather than looking at a Sharpe ratio in an out-of-sample period we construct robust variations of this concept and penalize fora lack of trading activity, etc. Although we don’t aim directly for it we observehigh Sharpe ratios and attractive returns as an unavoidable collateral side eﬀect. We describe a market by a time series of datapoints p t , p t , . . . . Predictingunseen price data is a hard problem often resulting in the notorious estimatethat the next price is just the last observed price.Rather than aiming for the next price, we argue that the market is currentlyin a particular label-class which we ultimately want to identify without usingany unseen future data.Throughout this work we distinguish three such label-classes and identifythem with the actions we intend to take: • Buy. The market may start or continue to rise over the next few periods • Sell. The market may drop over the next few periods, the volume maydrop signiﬁcantly or there is a spike in volatility. • Neutral. We do nothing.Obviously identifying buy opportunities is trivial with a good level of hindsight.Looking at a historic time series we can identify numerous buy opportunities.This process is subject to some constraints we set, e.g. we may argue that theprice at time t was a buy opportunity if we see a signiﬁcant rise over the nextminutes following t .We call the process of classifying over time the labeling of a time series. Sothe particular label is a time series mapping p t to one of the three classes.Numerous such labels can and should be used. We use the popular thresholdlabel to discuss the concept. We deﬁne the return over the period t i to t i +1 as r i,i +1 = p t i +1 p t i − . We introduce a threshold τ and use y i = (cid:26) sgn ( r i,i +1 ) | r i,i +1 | > τ y i = 0 if | r i,i +1 | ≤ τ otherwise y i = 1 or y i = − −

1. So if | r i,i +1 | < = τ weuse instead of y i = 0 the function y i = (cid:16) r i,i +1 τ (cid:17) . We could use a simpler linear term. However, in our experiments we have madebetter experiences with this particular choice.So given a historic time series with all its price jumps and chaotic behaviourwe reduce it to a time series just oscillating between three label-classes. Ob-viously we loose some information in this process but one could also argue weemphasize the information we really care about. And we can always combinemultiple labels.Identifying the moments we have missed to make a proﬁt can help to evaluatethe quality of a strategy, however, its inherent delay renders it of limited use ina live trading setup.The idea is to approximate the labels with market features (i.e. technicaltrading indicators) that do not use any future data. Once in live trading, wecan live update the indicators and therefore talk about label-classes predictions.The threshold is often made dynamic using estimates for the current volatility.We use a variation of this idea where rather than p t i +1 in the deﬁnition ofthe return we use a moving average of prices following t i .The construction of such labels is an exercise only done during the trainingphase of the strategies. Running a backtest based on the actions induced by thelabels over this training period would be a severe mistake.Although it would be possible to have labels based on all sorts of ﬁnancialdata, e.g. volume, we use here exclusively labels based on price data. The central idea of this paper is to approximate the labels with a set of func-tions, referred to also as market features. The functions we use are standardtechnical indicators. The art is to resolve the labels in a small set of suchparametrized functions. Those parameters are chosen in a way to maximize the label separation power of those functions.Although the arsenal of orthogonal functions, i.e. a set of sin waves, is gen-erally a great choice for approximations, we believe it is not suitable to capturemarket dynamics. A Fourier transform of the label would learn everythingabout the seasonality of this label but is of very limited generalization in anout-of-sample period.We present our ideas using a toy example of only two functions with one freeparameter each. In Appendix: feature space we give a complete list of featureswe have used. 4he set of feature functions we identify as feature space . The parameters arenot completely free. They are integer numbers from intervals we deﬁne. Hencewe can pick for each label from a ﬁnite set of such features.

To illustrate a feature space on an relatively simple example, let us deﬁne it as2 indicators with some possible parameters:

Example feature space: • feature 1: A [ X ], VWAP - SMA( X ) where X ∈ [2 ,

10] [minutes] • feature 2: B [ Y ], VWAP - SMA( Y ) where Y ∈ [30 ,

60] [minutes]where VWAP stands for volume weighted average price of a givin minute andSMA( Z ) stands for a moving average of the last Z minutes.This set of two features has one feature that looks at relatively short termtime horizon and one feature with relatively long term. We normalise their val-ues to [ − ,

1] using local scaling by standard deviation and arctangent function.Let us deﬁne a label as one of the threshold labels: 1.5 % price change in 5minute window. We now face a dilemma - which features from the feature spaceshould we use? There are 279 candidates (9 diﬀerent feature 1 and 31 diﬀerentfeature 2).Let us ﬁx parameters to acquire 2 possible feature sets from the feature spaceand solve the dilemma there: • feature set 1: A [5] , B [50] (2) • feature set 2: A [10] , B [30] (3)An approximator sees market states only through their market representa-tion. It is essential that features used in the market representation will diﬀerin values if they encounter diﬀerent classes of our label of choice. To measurethese diﬀerences we use the L1 distance between corresponding vectors of fea-tures values. The market representations are illustrated on Figures 1 and 25 TC/USDT market and its representation

Figure 1: Market VWMP, threshold label with 1.2% price change, 5 minutestime horizon and its Eq. 2 market representation. Zoomed out Fig. 2.

BTC/USDT market and its representation - zoomed in.

Figure 2: Fig. 1 - zoomed in. The red and the blue star indicate two diﬀerentmarket states label-class-wise but with similar feature values in Eq. 2 marketrepresentation. The two features fail to resolve the cross-label-classes which isthe central problem of the market representation through feature selection.Frequent low values of cross-label-class distances in a given market repre-sentation may cause severe problems for an approximator to correctly classifydiﬀerent market states as diﬀerent label classes. Let us look at feature-wisemarket representation distances across a time period through a histogram ofcross-label-class distances - Fig. 3 6 ross-label-class distances histogram.

Figure 3: Histogram of distances in Eq. 2 (feature set 1) and Eq. 3 (featureset 2) market representations. Both representations contain a lot of cross-label-class pairs with distances close to zero, however, the feature set 1 should beslightly better for an approximator than the feature set 2 because the featureset 1 distances histogram is skewed to the right. 3000 points from each class so9 million points per histogramBased on Fig. 3 we conclude this subsection saying that the Eq. 2 marketrepresentation is better for the threshold label with 1.2% price change, 5 minutestime horizon, than the Eq. 3 market representation. In it important to point outthat both representations contain a lot of cross-label-class pairs with distancesclose to zero, so one should either search for a diﬀerent feature set from thefeature space or change the feature set altogether.

We need a way to quantify goodness of a particular features set to representa market for a particular label across a time period. Let us deﬁne a followingmetric for it:

Label separation power of a feature set : inverse of an area under a cross-label-class distances histogram (like in Fig. 3) weighted by a function to onlyselect values relatively close to zero. Choice of the weighting function dependson the label and the numbers of feature in the feature space.Choosing a particular feature set from a feature space for a given label is donethrough maximising their label separation power with Bayesian Optimization[5] and HyperBand [1].

Feature space used in later parts of the paper contains 28 standard price andvolume indicators and is formally deﬁned in Appendix: feature space .7hreshold label with 1.2% price change and 5 minutes time horizon is oneof the labels which we used for the analysis. The selected feature set has thefollowing cross-label-class distance histogram:

Cross-label-class distances histogram.

Figure 4: Histogram of distances of the selected features for the threshold labelwith 1.2% price change and 5 minutes time horizon. 3000 points from each classso 9 million points per histogram. Note fundamental diﬀerences between Fig.3 and this one. Only of cross-label-class distances is below 3 on thesame dataset as Fig. 3 histogram. 8 ross-label-class distances histogram - log scale.

Figure 5: Log scaled Fig. 4 Y-axis-wise. class0 (buy) and class2 (sell) are theeasiest to separate - this is in agreement with our intuition.

There are eight labels to approximate (deﬁned in Appendix: selected labels)using eight diﬀerent market representations (from a feature space deﬁned inAppendix: feature space ). Instead of approximating the continuous labels, wechoose to classify discrete-label-classes. This way we can focus on identiﬁcationof the most important three regimes of the labels. In addition, this approachgives us a probabilistic way to determine conﬁdences of our predictions. Eachdiscrete-label-classes will be assigned a probability of occurring at a given time.The process of training a label classiﬁer using historical market data is illustratedon the Fig. 6, whereas getting the label approximation is illustrated on the Fig.7: 9 rocess of label classiﬁer preparation (training)

Figure 6: Illustration of training a label classiﬁer using historical market data.Calculating the features requires knowledge of what happened in recent past,however, calculating true labels requires also knowledge of the near future.

Process of live label classiﬁcation

Figure 7: Illustration of label classiﬁcation using current market data. There isno need to know near future in this process, which makes it possible to performlive. 10 .2 Chosen algorithm

Based on Fig. 4 we see that an accurate approximator for this label and thisfeature set is possible to built but has to be non-linear. We conclude the samefor the other seven labels. Because of a high number of datapoints in our train-ing dataset (exact numbers in Experimental setup) and the requested non-linearbehaviour we have decided to use a neural network classiﬁer and a supervisedlearning algorithm. For hyper-parameter optimisation we used previously men-tioned Bayesian Optimization [5] and HyperBand [1]. The loss calculator, whichappears on Fig. 6, is built based on a concept called loss scaling which scalesloss based on continuous labels. The central idea is to make class0 (buy) andclass2 (sell) prediction accuracy more signiﬁcant than class1 (do nothing) in thefeedback loop to the label classiﬁer during the training. This is an essentialstep because of heavy class-unbalance in the labels we have chosen. We con-struct the scaling in such a way that the sum of loss scale factors associatedwith class0 (buy) and class2 (sell) is equal to the sum of loss scale factors forclass1 (do nothing). In addition, we reduce the loss scaling in-between 0-1 and1-2 continuous label to make the training focus on clearer buy/do nothing/sellsignals { } .Figure 8: Continuous label based loss scaling factors used during the trainingprocess (Fig. 6). The sum of loss scale factors associated with class0 (buy) andclass2 (sell) is equal to the sum of loss scale factors for class1 (do nothing).We reduce the loss scaling in-between 0-1 and 1-2 continuous label to make thetraining focus on clearer buy/do nothing/sell signals (0,1,2). A trained classiﬁer acting on unseen data is illustrated on Fig. 9. Apart fromindustry-standard metrics like generalisation and confusion matrix coeﬃcients,we also study our classiﬁers through Shapley Values [3][4][6]. This approach en-ables to see impact of a particular feature on the model output. If at this point,the data would not comply with our intuitions, we would not have chosen this11articular feature space and the algorithm for label approximation for furtherexperiments.

Neural network output

Figure 9: Prices, corresponding true labels and neural network model outputpredictions. Class0 (buy) in red, class1 (do nothing) in green, and class2(sell) inblue for the threshold label with 1.2% price change and 5 minutes time horizon.Out-of-sample data fragment.

Our strategies are based on linear combinations of 16 model outputs - class0(buy)and class2(sell) of each 8 labels. We map it to [0 ,

1] using a map: φ ( x ) = (cid:26) ( x + 1) / | x | < sgn ( x ) otherwise (4)to acquire a trading signal y : y = φ ( W T X ) (5)where y is a trading signal, W is a column of weights, X is a column of 16model outputs - class0(buy) and class2(sell) of each 8 labels. We interpret theoutput value y as a desired long position on an asset. To execute trades, weuse three thresholds: y buy to buy, y sell to sell and y width to prevent execution ofrelatively small transactions. 19 constrained parameters in total. The param-eter space are formally deﬁned in Appendix: strategy space . Fig. 10 illustratesthreshold-based trading execution. We increase (decrease) the long position ifthe trading signal y is above (below) y buy ( y sell ) and a distance between ourprevious position and the desired position is greater than y width .12igure 10: Illustration of the way our trading strategy changes the long positionsize based on a trading signal y . Thresholds for the illustration: y buy = 0.75, y sell = 0.25, y width = 0.10. We ran an experiment of backtesting Trading strategy with 20 thousand diﬀerentweights columns. The goal was to check performances in the past dataset andcheck their generalisation on the future dataset. Exact deﬁnitions of the useddatasets are in the Experimental setup section. To acquire the weights we ranBayesian Optimization [5] and HyperBand [1] with a task of producing weightswith high performance in the past dataset. Because of random nature of theHyperBand algorithm we have acquired a full spectrum of strategies - from badto good performance-wise.

We based our experiment on BTC/USDT tick-by-tick transaction data recordedon the Binance exchange. We divided our dataset into three parts(dd-mm-yyyy): • train&evaluate dataset • past dataset • future dataset (out-of-sample)Gaps in-between datasets were designed to prevent the look ahead bias. Weaggregate the data into 1 minute datapoints. Transactions in our backtestinguse the next open price to execute orders and carry a ﬂat transaction fee equalto . This is a realistic estimation of a transaction cost which can beachieved on an exchange. 13 .2 Results - single strategy performances We have to ﬁnd a selection metric for our strategies. First, let us talk aboutcross-datasets returns, where we put past dataset strategy returns against cor-responding future dataset strategy returns. The comparisons are illustrated onthe Fig. 11.Figure 11: Cross-dataset returns of strategies. Returns on the past and future datasets are on the X and Y axis respectively. Statistically insigniﬁcant futurered label corresponds to strategies that resulted in less than 5 trades per monthon the future datasetAs illustrated on Fig. 11, a past dataset return cannot be used as a metricto select strategies with promising results in the future. The positive correla-tion brakes down around 5% montly return on the past dataset and the regionwith highest future returns is ﬁlled with statistically insigniﬁcant future perfor-mances. We need a better strategy selection method.Let us introduce score function as follows: S = M P − T P − M CIP (6)where: • S: strategy score • MR: montly return • TP: transaction penalty - if number of transaction per month on the past dataset is lower than 30, then it is equal to the missing transactions permonth • MCIP: mean capital involvement penalty - if a mean capital involvement(mean long position) is lower than 25%, then it is equal to a half of missingpercentages 14he goal of the score function is to map the problematic low-return or statisti-cally insigniﬁcant regions (illustrated on the Fig. 11) to low scores but preservingthe positive performance correlation structure. Parameter values of the score function were chosen intuitively, before the cross-datasets studies. The number30 in the transaction penalty was chosen simply because for Bitcoin it is thenumber of trading days in a month, and all of the used features can drasticallychange intra-day. An idea how to modify the score function, to make it lessaccidental is presented in Future work.Cross-score performances are illustrated on the Fig. 12 down below:Figure 12: Cross-dataset performances of strategies. Scores on the past datasetand returns on the future dataset are on the X and Y axis respectively. Sta-tistically insigniﬁcant future red label corresponds to strategies that resulted inless than 5 trades per month on the future dataset.The higher the past dataset score the higher (on average) the future datasetmonthly returns. The statistically insigniﬁcant regions were mapped out of theillustrated on Fig. 12 region. The score can now be used as a selection metricfor our trading strategies.

Strategy ensemble backtests using Top100, Top20, Top10, and Top5 past score-performance-wise models are illustrated on Fig. 13, Fig. 14, Fig. 15, and Fig.16 respectively. 15igure 13: Performance of top 100 score-wise strategies and a correspondingstrategy ensemble on the future dataset. Risk adjustment is performed throughscaling the strategy ensemble returns by σ BTC/USDT σ ensemble where σ is the standarddeviation of returns.Figure 14: Performance of top 20 score-wise strategies and a correspondingstrategy ensemble on the future dataset. Risk adjustment is performed throughscaling the strategy ensemble returns by σ BTC/USDT σ ensemble where σ is the standarddeviation of returns. 16igure 15: Performance of top 10 score-wise strategies and a correspondingstrategy ensemble on the future dataset. Risk adjustment is performed throughscaling the strategy ensemble returns by σ BTC/USDT σ ensemble where σ is the standarddeviation of returns.Figure 16: Performance of top 5 score-wise strategies and a corresponding strat-egy ensemble on the future dataset. Risk adjustment is performed throughscaling the strategy ensemble returns by σ BTC/USDT σ ensemble where σ is the standarddeviation of returns. 17oth the returns and risk-adjusted returns are increasing as the average past score-performance increases. Performances of created strategies increase in terms of return and risk-adjustedreturn on the out-of-sample future dataset as the past score -performance in-creases. Using top score -performance-wise strategies we achieved exceptionalmarket-beating results. As of right now, using a framework which we have de-scribed can lead to further improvements of capital allocations of institutionalinvestors with access to market data and computational power.

Making transaction rates in the score function deﬁnition dataset de-pendent , because static rates lead to ruling out potentially high performancestrategies if they do not comply with dataset’s dynamics. The optimal trans-action rate should be based on characteristics of the past dataset - i.e. averagevolatility - and in general should not be hard-coded.

Changing strategies on the ﬂy should further increase performance byswapping under-performing single model strategies with more promising substi-tutes. In this approach, the strategies are ranked for selection based on theirup-to-date past performances. The eﬀective past dataset would change period-ically.

More sophisticated feature space would potentially lead to better clas-siﬁers and enable detection of sub-minute movements.

More sophisticated trading strategies - our linear combination wasselected to reduce complexity. We now look into more complex solutions whichare still relatively easy to interpret.

Running computations for longer to ﬁnd higher past score-performance-wise strategies should further increase out-of-sample performances.

Acknowledgement

We would like to thank David Klemm for his support, discussions, and theaccess to his computational grid.

A Appendix: selected labels

The 8 chosen labels can be categorised into 2 subcategories: threshold labels(see Labeling) and local extrema labels.

Threshold labels short descriptions :18 • • •

3% price change in the next 5 minutes •

3% price change in the next 60 minutesThe remaining 3 local extrema labels are a custom construct and are a materialfor a separate paper. Visualise and explore them all through our repository:GitHub Labels

B Appendix: feature space

The feature space (see Market representation) we use consists of 28 functionsand is illustrated on the Fig. 17 below:

Deﬁnition of our feature space . Figure 17: Deﬁnition of each of the 28 features and their corresponding param-eter ranges that were used during our search for optimal market representationsin Market representation. Parameters are integer only and represent minutes.Ranges of possible parameters and types of indicators are based on ourdomain knowledge.

C Appendix: strategy space

The strategy space (see Trading strategy) consists of 19 parameters: 3 thresholdsand 16 weights. 19 trategy thresholds: y buy ∈ [0 . , y sell ∈ [0 , . y width ∈ [0 . , . Weights : w n ∈ [ − , References [1] Lisha Li et al. “Hyperband: A Novel Bandit-Based Approach to Hyperpa-rameter Optimization”. In:

J. Mach. Learn. Res.

18 (Jan. 2017), pp. 6765–6816. issn : 1532-4435.[2] Marcos Lopez de Prado.

Advances in Financial Machine Learning . WileyPublishing, 2018.[3] Marcos L´opez de Prado.

Interpretable Machine Learning: Shapley Values(Seminar Slides . pt. Available at SSRN: June 27, 2020. doi :

10 . 2139 /ssrn.3637020 . url : http://dx.doi.org/10.2139/ssrn.3637020 .[4] L.S. Shapley. A value for n-person games, vol II of Contributions to thetheory of games . en. Princeton: Princeton University Press, 1953.[5] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. “Practical BayesianOptimization of Machine Learning Algorithms”. In:

Advances in NeuralInformation Processing Systems 25 . Ed. by F. Pereira et al. Curran Asso-ciates, Inc., 2012, pp. 2951–2959. url : http://papers.nips.cc/paper/4522 - practical - bayesian - optimization - of - machine - learning -algorithms.pdf .[6] Erik ˇStrumbelj and Igor Kononenko. “Explaining prediction models andindividual predictions with feature contributions”. In: Knowledge and In-formation Systems issn : 0219-3116. doi : . url : https://doi.org/10.1007/s10115-013-0679-xhttps://doi.org/10.1007/s10115-013-0679-x