[PDF] Learning low-frequency temporal patterns for quantitative trading

Abstract

We consider the viability of a modularised mechanistic online machine learning framework to learn signals in low-frequency financial time series data. The framework is proved on daily sampled closing time-series data from JSE equity markets. The input patterns are vectors of pre-processed sequences of daily, weekly and monthly or quarterly sampled feature changes. The data processing is split into a batch processed step where features are learnt using a stacked autoencoder via unsupervised learning, and then both batch and online supervised learning are carried out using these learnt features, with the output being a point prediction of measured time-series feature fluctuations. Weight initializations are implemented with restricted Boltzmann machine pre-training, and variance based initializations. Historical simulations are then run using an online feedforward neural network initialised with the weights from the batch training and validation step. The validity of results are considered under a rigorous assessment of backtest overfitting using both combinatorially symmetrical cross validation and probabilistic and deflated Sharpe ratios. Results are used to develop a view on the phenomenology of financial markets and the value of complex historical data-analysis for trading under the unstable adaptive dynamics that characterise financial markets.

Full PDF

LLearning low-frequency temporal patterns forquantitative trading

Joel Da Costa

Department of Statistical SciencesUniversity of Cape Town

Cape Town, South [email protected]://orcid.org/0000-0001-7821-6635

Tim Gebbie

Department of Statistical SciencesUniversity of Cape Town

Cape Town, South [email protected]://orcid.org/0000-0002-4061-2621

Abstract —We consider the viability of a modularised mecha-nistic online machine learning framework to learn signals in low-frequency ﬁnancial time series data. The framework is proved ondaily sampled closing time-series data from JSE equity markets.The input patterns are vectors of pre-processed sequences ofdaily, weekly and monthly or quarterly sampled feature changes.The data processing is split into a batch processed step wherefeatures are learnt using a stacked autoencoder via unsupervisedlearning, and then both batch and online supervised learningare carried out using these learnt features, with the output beinga point prediction of measured time-series feature ﬂuctuations.Weight initializations are implemented with restricted Boltzmannmachine pre-training, and variance based initializations. His-torical simulations are then run using an online feedforwardneural network initialised with the weights from the batchtraining and validation step. The validity of results are consideredunder a rigorous assessment of backtest overﬁtting using bothcombinatorially symmetrical cross validation and probabilisticand deﬂated Sharpe ratios. Results are used to develop a viewon the phenomenology of ﬁnancial markets and the value ofcomplex historical data-analysis for trading under the unstableadaptive dynamics that characterise ﬁnancial markets.

Index Terms —online learning, feature selection, pattern pre-diction, backtest overﬁtting

I. I

NTRODUCTION

A. Technical Analysis

Technical analysis is a ﬁnancial analytical practice thatmakes use of past price data in order to identify market statesand forecast future price movements based on past movements.The techniques typically rely on past market data (price andvolume), rather than company assessments using fundamentalanalysis. We explore the idea that technical analysis has meritin exposing market inefﬁciencies when they are signiﬁed byrepeated feature time-series patterns [1], [2].Financial markets have been shown to be complex andadaptive systems where the effects of interaction between par-ticipants can be highly non-linear [3], but they may also havecombinations of top-down and bottom-up sources of informa-tion and interaction that mix in vast numbers of interactionsmediated by numerous ﬂocks of heterogeneous strategic agentsthat constitute modern ﬁnancial markets [4]. Complex anddynamic systems such as these may often exist at multiple‘order-disorder borders’ - and they will then generate certain non-random patterns and internal organisation on differentaveraging scales. Two key price generation processes haveemerged: the low-frequency domain (the result of sequencesof closing auctions generating prices), and the high-frequencyintra-day domain driven by order-ﬂow itself. Here we considerlow-frequency daily sampled data that is the result of the pricediscovery from closing auctions.Even at low-frequency, identifying patterns and structureis simultaneously reasonable and notoriously difﬁcult. Whileit is often clear in hindsight that patterns exist, the amountof noise and non-linearity in the system can make predictionchallenging. Fittingly, neural networks are a popular choicesfor modelling within ﬁnancial markets because of their abilityto perform well as universal approximators [5].Practical approaches to money management within therealities of adapting and changing market systems increasinglyfavour online methods, in particular [6] explored the appli-cation of online learning models in this space in the SouthAfrican market to show that direct (but simplistic) onlinepattern-learning is able to identify and potentially exploittrading opportunities on the JSE through the assessment ofOpen High Low Close (OHLC) data. This was extended by[7] to more directly explore the use of online learning ap-plied to optimizing parameters for traditional technical tradingindicators as applied to maximising wealth trading zero-costportfolio strategies.The work presented here ﬁts into the growing body of workwhich considers mechanistic and brute-force approaches of ap-plying machine learning models to ﬁnancial market data. Thecomplexity, non-linearity, noise and stability of ﬁnancial mar-kets are highlighted through both the successes and challengesfound in training these models. These difﬁcult dynamics, andtheir notable difference when compared to other popular areasof ML research - which are often around Independently andIdentically Distributed (IID) datasets - present fundamentalproblems to be explored; both in terms of prediction efﬁcacy aswell as validation. We present a framework using batch ofﬂineand online learning on JSE closing data, feature extraction androbust non-parametric validation techniques. a r X i v : . [ q -f i n . S T ] A ug . Backtesting and Model Validation Financial academic literature is currently facing a problemin terms of validation and veriﬁcation of results. Tradingstrategy proﬁtability has typically been proven using historicalsimulations, or “backtests”. However, the recent advances intechnology and algorithms available to construct these strate-gies have resulted in researchers being able to test increasingnumbers of variations of factors. This has made it increasinglydifﬁcult to control for spurious results. The problem is soextensive that some meta-research papers suggest that mostpublished research ﬁndings are false [8].The standard way of implementing backtests is to splitthe data into two portions: an In Sample (IS) portion whichis used to train the model, and an Out of Sample (OOS)portion which is used to test the model and validate results.If vast numbers of different model conﬁgurations are tested,then it is only a matter of time before false positives occurswith high performance both IS and OOS (i.e. overﬁtting)[9], [10]. The nature of ﬁnancial data makes it difﬁcult toresolve these issues effectively. There is a low signal-to-noiseratio in a dynamic and adaptive system with only one truedata sequence. Traditional hypothesis testing frameworks (e.g.Neyman-Pearson) are not sufﬁcient in this context makingmore sophisticated techniques necessary.The problem of overﬁtting is not novel. However, in amachine learning context, frameworks are often not suited totrading schedules with a random frequency structure. They donot account for overﬁtting outside of the output parametersnor take into consideration the number of trials attempted.The common ‘hold-out’ strategy is where a certain portionof the dataset is reserved for testing true OOS performance.Numerous problems have been pointed out with this approach.The data is often used regardless, and awareness of themovements in the data may inﬂuence strategy and test design[11]. For small samples, a hold-out strategy may be too shortto be conclusive [12]. Even for large samples, it results in themost recent data (which is arguably the most pertinent) notbeing used for model selection [9], [13]. We present a novelapplication of existing sophisticated validation methods (seeSection III) to a machine learning framework.II. F

RAMEWORK I MPLEMENTATION

A. Full Framework Process

The framework implementation brings several ideas to-gether: i.) SAE based feature selection, ii.) Deep Learning withpre-training and weight initialization, and iii.) Online Learningand Backtest Overﬁtting Validation. The learning part of theframework consists of two phases: I. batch learning and II.online learning.I

Batch learning phase : IS data is used to train StackedAutoencoder (SAE) networks which in turn are used toperform feature reduction for Feedforward Neural Net-works (FNNs) learning the price ﬂuctuation predictionsin the IS data. Both are trained using Stochastic GradientDescent (SGD). II

Online learning phase : The batch trained FFN networksare used to predict price ﬂuctuations on OOS datathrough Online Gradient Descent (OGD).These online predictions are then sequentially used to simulatetrading in a Money Management System (MMS), which in turngenerates simulated returns. Finally, the MMS returns are usedas input for the Probability of Backtest Overﬁtting (PBO) andDeﬂated Sharpe Ratio (DSR) techniques in order to validatethe legitimacy of the framework.The process has two key principles: First, implementing ageneralised version of a system which could offer explorationof more complex techniques. Second, ensuring an effectivemodularisation of steps such that the process can be recon-ﬁgured accordingly while maintaining its integrity. In doingso, a separable system is created which brings together allkey concepts. We aim to delivery the simplest implementationof a complex framework such that the effects of individualcomponents can be properly assessed and developed. The fullprocess ﬂow can be seen in Figure 7 in the Appendix.

B. Data Processing

Datasets are transformed into log feature differences andaggregated to include the changes over rolling window pe-riods. The log feature ﬂuctuations are processed for eachasset’s closing price at each time point i . Log ﬂuctuationstake compounding into account in a systematic way and aresymmetric in terms of gains and losses; the log transformationalso provides an ergodic time series. The rolling windowsummations are calculated for the past for input data e.g. (1,5, 20) days, and in the future for predicted output - 5 daysin this paper. These are calculated as summations of the logdifferences ∆ p i such that for d days at timepoint t : p − ( d , t ) = t (cid:88) i = t − d ∆ p i , and p +( t , d ) = t + d (cid:88) i = t +1 ∆ p i . (1)The aggregations are scaled using a modiﬁed Normalization,where the min and max values are determined by the Trainingportion of the dataset, but applied to both the Training andPrediction portions. This emulates a production implemen-tation where future data is unknown. The log-differenced,aggregated and scaled data is then used as the input for theneural network models. Predicted outputs have the scaling andlog differencing reversed in order to reconstruct the actualprice point predictions for performance assessment. C. Network Weight Initialization

The problem of vanishing and exploding gradients has beenone of the primary barriers to deep learning with neuralnetworks. The approach of greedy layer wise pre-training forSAEs was suggested by [14], which allowed much deeperlayered networks to be trained than previously possible [15].Once the SAE is trained, a ﬁnal output layer is added andthe entire network can then be ﬁne-tuned through back-propagation without suffering such performance degradationfrom vanishing or exploding gradients, enabling training ofboth SAE and FNN networks [16], [17].n Section V we see that batch training on historical data hasa limited beneﬁt, which gives primacy to weight initializationtechniques for machine learning of ﬁnancial time series. Initialresults found that RBM pre-training for sigmoid SAE networks(as described by [17]) had detrimental effects on networkperformance. This suggests that the IID assumptions and thedifferent loss functions result in the ﬁnancial time series dataused being pathalogical for RBM pre-training, and is discussedfurther in [18]. It has also been shown that pre-training maylargely act as a prior which may not be necessary if largeenough datasets are available [19], [20]. In the context ofﬁnancial time series this prior can explain the poor perfor-mance. For these reasons we focus on variance based weightinitialisations developed by [21] and [22]. These have simplerimplementations, faster computation and enables initializationfor non-probabilistic activation functions, such as ReLU.Concretely, we use a modiﬁed He initialization: “He-Adjusted”. This initialization uses a mean of the input andoutput layers to scale the weight variance. For networks withconstant layer sizes, the initialization is the same as He [22].For SAE networks, where layer size changes by deﬁnition, theHe-Adjusted initialization results in better sized weights. For n , the number of nodes in a layer, we initialize using: w ij ∼ U ( − r, r ) , with r = (cid:113) / ( n i + n j ) . (2) D. Unsupervised Learning: SAE Training

The beneﬁt of the modularised system is emphasised here,as the SAE training will not suffer from limitations due tobacktesting considerations: any amount of conﬁgurations canbe tested for feature extraction without concern. The bestchosen SAE networks (based on a minimum Mean SquaredError (MSE) score) are used to reprocess both the Trainingand Prediction datasets such that the input is encoded, andthe output is as before. These encoded datasets can then beused for the subsequent steps in the framework. We did notimplement a step to update the SAE, but results detailed inSection V suggest that this would be an important inclusionin a production system.

E. Supervised Learning: Prediction Network Training

Once the predictive network is trained on IS data usingSGD, the OGD process is run through the encoded Predictiondataset in order to generate the predictions for the asset pricesthat the model produces - thus emulating what would haveoccurred in a live environment.

F. Money Management Strategy

The MMS follows an arithmetic long strategy of buyingany asset for which the predicted price is above the currentprice, and selling the stock at the prediction horizon regardless.Trading costs were included at 10% capital costs per annumfor borrowing to purchase, and 0.45% for transaction costs asper [6], without taking liquidity effects into account. The naiveapproach is taken purposefully so as not to bias the perspectiveof the system as a whole by the effects of an impactful trading strategy. It is important, in the interest of effectiveoptimization, that the pattern prediction of the system is nottightly coupled with making it proﬁtable. Thus, the modularityof the system is continued with a separation between theprediction signal and the MMS implementation.

G. Validation

Validation is implemented with Combinatorially SymmetricCross-Validation (CSCV) [9] that uses the IS and OOS returnsfrom the MMS, which in turn uses the prices from theprediction network; which is a somewhat novel application.Conceptually, the whole system comes into place here, asthe results from the CSCV process are now indicative of notonly backtest overﬁtting in the trading strategy, but also inthe prediction network and without having to consider theimpact of the many conﬁguration tests for feature extraction.A modiﬁed version of ONC was run, with reduced clusterexploration. As noted in [18], this did not appear to affectresults. III. A

SSESSMENT M ETHODOLOGIES

A. Probability of Backtest Overﬁtting

CSCV was developed by [9] as a robust approach to assess-ing backtest overﬁtting. Their research deﬁnes backtest over-ﬁtting as having occurred when the strategy selection whichmaximizes IS performance systematically underperforms themedian OOS performance in comparison to the remainingconﬁgurations. They use this deﬁnition to develop a frameworkwhich measures the probability of such an event occurring,where the sample space is the combined pairs of IS and OOStrading performance measures. The PBO is then established asthe likelihood of a conﬁguration underperforming the medianOOS while outperforming IS.The CSCV methodology is generic, model-free and non-parametric, allowing it to arguably be used in any modelcase. By recombining the slices of available data, both thetraining and testing sets are of equal size (advantageous forcomparing performance statistics). The symmetry of the setcombinations in CSCV ensure that performance degradationis only as a result of overﬁtting, and not arbitrary differencesin data sets. There is no requirement of a hold-out set,which removes potential credibility issues regarding whetherthe holdout set was treated appropriately or not. The logitdistribution developed through the assessment offers a usefulview on the robustness of the strategies used and the natureof the PBO score.The PBO can be estimated using the CSCV method results,which provides an estimate rate at which the best IS strategiesunderperform the median of OOS trials. [9] extend this to showthat with models overﬁtting to backtest data noise, there comesa point where seeking increased IS performance is detrimentalto the goal of improving OOS performance.

B. Deﬂated Sharpe Ratio

The Sharpe Ratio ( SR ) is based on the assumption thatthe returns used are the result of a single trial, as is the caseor most standard performance measures. In consideration ofthe issues laid out in I-B, it then becomes a misrepresentativeperformance measure. [23] developed the Probabilistic SharpeRatio (PSR) which estimates the likelihood that an observedbest estimated (cid:99) SR exceeds a provided benchmark SR ∗ (whichmight be expected from variance in the trials). It is worthemphasising the distinction in investment strategies betweena Family Wise Error Rate (FWER), which is the probabilitythat one or more false positives occur, and a False DiscoveryRate (FDR), which is the ratio of false positives to predictedpositives. Investment strategy generations will tend to relyon the single best approach produced, and so must considerFWER. [24] further developed the False Strategy Theorem(FST) with this in mind, allowing the assessment of whethera presented strategy is a false positive or not.The DSR calculates the likelihood that the true SR ispositive under consideration of numerous trials being tested[24]. The DSR can be estimated using the PSR methodologyas (cid:91) P SR [ SR ∗ ] where the benchmark Sharpe ratio, SR ∗ , iscalculated using the False Strategy Theorem. The calculationof SR ∗ requires both the variance of trial SR values and thenumber of independent trials, which are not typically consid-ered and where determining independence is challenging. [25]aim to resolve this with the Optimal Number Clusters (ONC)algorithm, a modiﬁed K-means methodology of clusteringstrategies and trial results. This clustering allows an estimationthen of both the variance and number of trials, which in turnallows the DSR to be calculated. With this as a conﬁdencelevel, one can accept or reject the notion that the observed (cid:99) SR is positive. IV. E XPERIMENT P ROCESS

A. Data & Software

Datasets were constructed using JSE closing price relativedata for 2003-2018 [26], with a 60:40 split on the Train-ing:Prediction subsets. The closing price dataset consisted of10 assets from the JSE Top 40: AGL, BIL, IMP, FSR, SBK,REM, INP, SNH, MTN, DDT (coming from a variety ofsectors). More source information, data snapshots and pricecharts are available in [18].The software libraries, written in Julia, were produced for allthe training, experimentation and recording of results. Theseare discussed extensively in [18], and made available online[27].

B. Parameter Space Exploration

The parameter space is explored using a phased grid searchapproach. For each stage, the relevant parameters are eachspeciﬁed as a set of values, and all sets are then used togenerate the full combinatorial space, such that each possiblecombination of the speciﬁed parameters is tested.1)

Stage 1:

The data conﬁguration (i.e. data windows,prediction point, scaling, data split points) as well as theSAE conﬁguration (network size, learning rates, learningoptimization parameters, SGD epochs) are set in Stage1 and used to train the SAE networks. 2)

Stage 2:

The preferred SAEs are chosen from Stage1, and determine the data conﬁguration used for Stage2. These are then used to encode the datasets, whichwill be used for FFN training. The FFN SGD andOGD parameters are set in this stage (network size,learning rates, SGD epochs etc.), and will be combinedcombinatorially with the SAEs that were chosen fortesting as well. V. F

INDINGS

A. Value of Historical Data and Training

We expected that the IS batch training using SGD for thepredictive network would improve OOS P&L performance.Theoretically, the training on historical data might prime thenetwork for predicting future data. However, we found that theeffects of IS training had limited beneﬁt. We ran experimentaltrials to test the hypothesis that the amount of historical ISdata available is of limited beneﬁt. We found the P&L resultsvalidate this idea, as seen in Figures 1 and 2. We saw thatextensive training on past data may be akin to pre-trainingnetwork weights at best, and counterproductive in overﬁttingto dynamics that no longer exist at worst. This highlights thecomplexity and dynamic nature of ﬁnancial time series, wherepast relations and behaviours are not necessarily indicative ofpresent state. It follows that the primary determinants of OOSP&L are those present in the OGD (OOS) learning phase:the OGD learning rate, the data horizon aggregations, and theSAE feature selection.This ﬁts well with research showing that online algorithmstypically perform as fast as batch algorithms during the‘search’ phase of parameter optimization, but that ‘ﬁnal’ phaseconvergence tended to ﬂuctuate around the optima due tothe noise present in single sample gradients [28], [29]. [30]showed that it is actually more practical to consider theconvergence towards the parameters of the optima, ratherthan the optima itself (as deﬁned by the cost function) - thedifference between the learning speed and optimization speed,respectively. Online learning methods are thus well suited toﬁnancial market modelling using neural networks. They alloweffective and efﬁcient incremental updates as more recent(and relevant) data becomes available. Further, the increasedlearning speed over optima convergence makes them a ﬁttingchoice when data is non-IID and constantly changing.

B. Primary Determinants of P&L

Input data was scaled to 3 different conﬁgurations to assessthe effects of shorter and longer data horizons, using SAEMSE and predictive OOS P&L to assess performance. Theconﬁgurations tested (in trading day window periods) were:1.) horizon-tuple 1 with [1, 5, 20], 2.) horizon-tuple 2. with[5, 20, 60], and 3.) horizon-tuple 3. with [10, 20, 60]. TheSAE networks were only trained on IS data, and not updatedafterwards. As noted below, an effective SAE feature selectionin this process is an optimization that may be limited to acertain time period and may not generalise well OOS. We alsofound that lower variance, in the shorter horizon aggregations,

OS P&L by IS Training Epochs

Fig. 1. These results show OOS P&L grouped by the number of epochs in theSGD IS training phase. Here 100 Epochs offered the best overall performance,and further training to 500 or 1000 epochs degraded performance due tothe network overﬁtting on the IS data. The results show that the beneﬁt ofhistorical data is limited - having networks become better at learning returnrelationships from 10 years in the past did not lead to increased OOS P&Lfor more current data. The small difference in the upper half of observationsbetween 10 and 100 Epochs further emphasises this point.

OOS P&L by IS Training Dataset Size

Fig. 2. To further explore the effect of IS training on historical data,conﬁgurations were run with a percentage of the usual training data excluded,with the P&L results grouped above. The exclusion of up to 80% of the IStraining data resulted in only a 2.2% drop in median OOS P&L for thosenetworks. The training in these instances was not adjusted to increase thenumber of epochs according to the size of IS data, and so the conﬁgurationswith more data excluded were also in essence trained less. resulted in easier replication; while longer horizons are moredifﬁcult (as indicated through MSE scores). The reproductiondifferences are discussed in [18].We observed strong interactions between the SAE featuresizes, FNN OGD learning rates and data horizon aggregationconﬁgurations. The performance differences seen further em-phasises the unstable nature of ﬁnancial systems. Generally,the FNN OGD learning rate had the largest impact on OOSperformance, and demonstrates the beneﬁts in being able toadapt quickly to new information, as seen in Figure 3. As the SAE feature size decreased from 25 to 10 , SAEs learntlonger term features (as they were increasingly unable torepresent short term ﬂuctuations). For FNN networks withlarger learning rates which could otherwise adapt quickly, theincreased focus on long term features caused P&L perfor-mance degradation. For FNN networks with smaller learningrates, poorly able to adapt quickly to new information eitherway, there was a beneﬁt from SAE features with an increasingrepresentation of long term trends. The relationship is empha-sised dramatically in the 10-feature SAEs, to the point thatlower learning rates can be more effective in generating OOSP&L. The P&L performance suggests that the 10-feature SAEsoverﬁt to long term IS features, and became pathological forshort term adaptation OOS.The most noteworthy results were the 5-feature SAEs,where performance was often on par or better than the higherfeature sizes, or no SAE at all, as seen in Figure 3. It is possiblethat the small encoding layer acts as a form of regularization,forcing the SAE to learn more consistently generalisablefeatures. The performance of the 5-feature networks, with an83% reduction in input data, is clear evidence of the efﬁcacyand potential of feature selection in ﬁnancial times series.The effect of data horizon aggregations is as expected: shortterm horizons (i.e. [1,5,10]) outperformed in conﬁgurationswith more SAE features and higher learning rates; longer termhorizons (i.e. [10, 20, 60]) outperformed in low learning rateand low feature conﬁgurations. The differentiation betweenthese groups is seen more robustly in Section V-F, wheredata horizon aggregations are determined to be the primaryclustering attribute for trade correlations. Strategies focusingon short term predictive strategies (aggregations of [1, 5, 10])had higher variance in returns than the longer data horizonstrategies, though also the highest highest P&L and Sharperatios. This again shows the beneﬁts in focusing on recentcross-sectional data in ﬁnancial markets. The differentiationbetween the groups is discussed more in [18]. C. Money Management Strategy Results

The benchmark is an upper bound on performance, repre-senting MMS returns based on perfect knowledge of futureprices. The benchmark full return rate is 2.4% with tradingcosts, over a period of 1555 trading days. So while the strate-gies’ proximity to the benchmark do represent a frameworksuccess, they are not necessarily representative of a feasiblemarket solution. Ultimately, this enforces the notion that theMMS implementation is of exceeding importance in a livetrading process, and predictive accuracy is only able to achieveso much.Figure 4 shows the distributions of OOS P&L with tradingcosts being accounted for. There were a signiﬁcant number ofconﬁgurations within 20%-30% of the benchmark. The trialswith 0 P&L are networks which suffered from either explodingor vanishing gradients, and were not able to make sufﬁcientpredictions. Input data was 10 assets with 3 horizon aggregations each, resulting in aninput size of 30 at each timestep.

OS P&L By Feature Size and OGD Learning Rate

Fig. 3. This ﬁgure shows that the lower learning rates (0.005, 0.01) performedbest with strategies using long term trend pricing. The 10 feature encodingappeared to optimise speciﬁcally for this perspective. The optimisation causedoutperformance at the lower learning rates and detrimental performanceat higher learning rates (which perform best with short term ﬂuctuationstrategies). The 15 to 25 horizon encodings showed a better association to theshort term strategies. Here higher encodings and higher learning rates offerthe best performance. The 5 feature encoding offered consistent performanceacross learning rates and shows the learning of generalisable features.

MMS OOS P&L Distributions, with Costs Applied

Fig. 4. The distributions of all OOS P&L values, with the benchmark P&Lindicated in orange, show an encouraging view of the results. There is asigniﬁcant negative skew, with a proportionally small number of strategiesresulting in negative returns, even with capital and trading costs applied. Therewere a large proportion of strategies near the OOS upper bound, which iswithin 28% of the benchmark.

D. Probability of Backtest Overﬁtting1) Applying PBO in Mechanistic Machine Learning:

Whilethe methodology is a model free approach to assessing over-ﬁtting, the application in a machine learning context is noveland has dynamics worth considering. The use of ofﬂine batchlearning parameters, online learning parameters and adaptivenetwork weights make the concept of model parameters lessdistinct. If a model performs well OOS due to effectivelearning, this can be due to the model’s strength rather thanoverﬁtting. It is noted that the logit metric, which the CSCV methodrelies on, has its basis in an ordinal ranking; indicating whetherthe best strategy in the IS set is higher than the median in theOOS set. This means that poor performing conﬁgurations canartiﬁcially bolster an ordinal position past the median pointand so bias PBO results. An honest, wide exploration of theparameter solution space in a mechanistic machine learningframework is likely to result in “poor” conﬁgurations beingtested (as visible in the ‘0’ P&L conﬁgurations in Figure 4). Asa result, the methodology shifts the onus onto the researcherin both handling and reporting these dynamics responsibly.Further to this, the parameter space search methodology(Section IV-B) may also result in a lower likelihood of PBOdue to the way of combining parameters across IS and OOSstages. By way of example, any conﬁguration which performswell IS will have all possible OOS parameters tested incombination with it. While some of these combinations mayresult in poor performance, there will always be a combinationof the best IS and best OOS parameter choices. This makes itunlikely that the best conﬁgurations will be past the medianpoint for the logit calculation, resulting in a systematically lowPBO regardless of how many conﬁgurations are attempted.Lastly, the CSCV algorithm requires a parameter choice ofhow many windows to split the data into. While not inherentlyproblematic, this choice can have a signiﬁcant impact onresults which is not visible in the reported PBO value. Wediscuss this further in [18].

2) PBO Results:

We ran the CSCV algorithm on themajority of the conﬁgurations tested, which resulted in a ﬁnalPBO value of 1.7%. A subset of networks were excluded onaccounts of ‘null’ predictions, resulting in a sample size of21653 (out of a total of 22248 conﬁgurations). The CSCValgorithm was run with a split value of 16. There were 15 yearsof data, making 16 a reasonable choice as the split parameter(which needs to be even). Ideally, the splits would representshorter periods, but the exponential increase in computationaltime made this impractical. The full logit distribution can beseen below in Figure 5.We found interesting dynamics around the calculation ofPBO, and the conﬁgurations contributing to the ﬁgure. Theconﬁguration process went through 2 primary phases: an ex-tremely broad combinatorial grid search, consisting of 20736conﬁgurations; and a second much narrower search of 1512conﬁgurations. Assessing only the conﬁgurations from thesecond phase resulted in a PBO score of 6.3%, which wassigniﬁcantly higher than the overall PBO score. The effecthere highlights important aspects of the PBO calculation. ThePBO score was much higher for the conﬁgurations whichwere picked more speciﬁcally after having already seen alarge number of results, which is correctly indicative ofincreased likelihood to overﬁt. However, the PBO score is notmonotonically increasing with N, as one would expect. Thisis counterintuitive and is in line with the concerns regardingthe effects of increasing conﬁguration sample size. ogit Distribution for All Conﬁgurations

Fig. 5. The CSCV logit distribution for the 21,653 conﬁgurations run, witha calculated PBO of 1.7%. The strong negative skew is indicative of IS andOOS strategy returns being closely matched in rank and results in a low PBOscore. This is a favourable assessment for the efﬁcacy of the full frameworkpresented here and shows that training was able to occur without much riskof backtest overﬁtting.

E. Optimal Number of Clusters

The ONC algorithm produced three clusters: one consistingmostly of the negative Sharpe ratio conﬁgurations, and twocontaining the remaining conﬁgurations partitioned by theirdata horizon aggregations conﬁgurations. The distributions forall three clusters’ Sharpe ratios can be see below in Figure 6.If we consider the two primary clusters, we see that ClusterOne contained all the trials with horizon aggregations of [5,20, 60] and [10, 20, 60], and Cluster Two contained alltrials with horizon aggregations of [1, 5, 10]. The nature ofthe experimentation process, with the combinatorial parameterspace exploration (as detailed in Section IV-B), is such thatother parameters were mostly evenly split across the 2 clusters(e.g. OGD learning rate, network sizes, initializations and soon).The clusters here indicate that the networks adapted to atleast two different general strategies for predicting prices:one which which was more inﬂuenced by the long termﬂuctuations, and the second was more inﬂuenced by the shortterm ﬂuctuations. The results presented in Section V-B are thenindicative of the networks ability to execute these overarchingstrategies effectively.The best Sharpe ratio value (with trading costs applied)was 0.64 and part of Cluster Two, with the [1, 5, 10] priceﬂuctuation horizon aggregations. The distributions seen inFigure 6 indicate that at a general level, Cluster One has moreconsistent performance, Cluster Two on the other hand hashigher variance, with more strategies at both the lower andhigher range of Sharpe ratios. The lack of further clusteringswas probed manually to ﬁnd that the variance from furthersubclustering lead to a worse cost function score.

F. DSR and PSR Results

Using the clusters produced by the ONC Algorithm (SectionV-E), the DSR could be determined. The aggregate cluster timeseries returns were calculated and annualized to allow their

Sharpe Ratios for ONC Clusters and Best Strategy

Fig. 6. This ﬁgure shows the distributions of all Sharpe ratios, groupedby the clusters produced by the ONC algorithm, and an indication for thebest Sharpe ratio (which is in Cluster Two). Cluster One has more consistentvalues, with a higher mean ( µ c = 0 . , µ c = 0 . ) and higher negativeskewness ( ˜ µ ,c = − . , ˜ µ ,c = − . ). Cluster Two has highervariance ( σ c = 0 . , σ c = 0 . ) but with more strategies at both thelower and higher range of Sharpe ratios; including the highest Sharpe ratiofrom all trials. variance estimates to be used to calculate SR ∗ (the maximumexpected observed Sharpe ratio due to variance under the nullhypothesis of H : (cid:99) SR = 0 ). Using SR ∗ as the benchmark, thePSR calculation ( (cid:91) P SR [ SR ∗ ] ) can then be used to determineif the observed (cid:99) SR is a false positive or not. This gives usthe DSR as a conﬁdence for observing a positive best SR .The benchmark SR ∗ calculated was 0.211245, and the best (cid:99) SR observed was 0.642632, leading to a (cid:91) P SR [ SR ∗ ] of 1.0,thus indicating that the trials certainly contain a strategy whichhas a positive SR rate. This seems a reasonable conclusion,considering the SR distributions in Figure 6.VI. C ONCLUSIONS

Mechanistic machine learning approaches to ﬁnancial mar-ket data hold some promise for enhancing the performance oflow-frequency quantitative trading. To investigate this potentialwe provide a novel framework that we show to be effectivein both training, and in validation. The framework is con-ﬁgurable and based on decoupled modules and uses severalwell understood techniques: deep-learning neural networks forstock price ﬂuctuation prediction, stacked autoencoders for thepurpose of feature selection, both CSCV & PBO to assess thereturns from MMS, and the likelihood that backtest overﬁttinghas taken place, and DSR in order to assess the likelihood ofa positive Sharpe ratio having been observed.While machine learning models are expected to excel inbig data environments, in ﬁnancial markets there is in facta lack of data with relevant information and signal, both fortraining and more so for validation. We show that IS trainingon historical data had a limited beneﬁt. This is not a surprisingempirical insight and was conﬁrmed by the negligible impactof increasing training time, as well as the small impact of largereductions in the training data sizes. Increased performance forIS data was not signiﬁcantly linked to OOS performance. Thisemphasises the idea that the changing dynamics of ﬁnancialarkets over time need careful attention. Learning optimiza-tions for IS training, such as regularization and learning rateschedules, were shown to have IS beneﬁts, but little impact onOOS performance [18]. This gives weight to the importance ofonline learning methods for ﬁnancial applications, and in turnhighlights the importance of network initialization. The resultsshowed better performance for the He-adjusted initializationand poor performance in RBM based pre-training.The primary determinants of OOS P&L were shown to bethose which affect the online learning data and the model’sability to adapt to this. We found SAE encoding layer sizesinﬂuenced the nature of features learnt, with smaller encodingsgenerally leading to learning of longer term features. Thisrelationship continued from layer sizes 25 to 10, with increas-ing effect. The results from the 10-feature SAEs suggest theSAEs were overﬁt to long term IS features and pathological forOOS adaptation. The relationship changed at 5-feature SAEs,which learnt far more generalisable features. The 5-featureSAEs had very competitive performance and show that featureselection in ﬁnancial time series is both possible and beneﬁcialdespite the complexity present. Predictive strategies focusingon long term changes were present in conﬁgurations withlonger data horizons, less features and lower learning rates(slow adaptation). Short term strategies presented with shorterdata horizons, more features and larger learning rates (quickadaptation). The data horizon was the primary separatingattribute in the ONC clusters, emphasising these groupings.The short term strategies had higher variance, but also thehighest returns. This again highlights both the increased valuein recent information in ﬁnancial markets, as well as thedifﬁculty in using it due to the amount of noise present.The most challenging aspect of a mechanistic approach tolearning is avoiding backtest overﬁtting (see Section I-B).Probing and validating of the generalisation error was doneusing the PBO methodology in conjunction with DSR [25].The results discussed in Section V show a low likelihood ofthe models having overﬁt. The CSCV and PBO methodologiessuggest that they are able to add-value to our novel implemen-tation of machine learning models; while providing a robustassessment of results. The results were further validated usingthe ONC and DSR algorithms to detect positive Sharpe ratios.The phenomenological view of ﬁnancial markets basedon our experimental results suggest a very limited beneﬁtto training on long term historical ﬁnancial time series. Across sectional view of the data has far more weight indelivering OOS returns; this is noteworthy in the context ofneural networks. Based on our simulation work we speculatethat money management strategies can be more importantdeterminants of OOS proﬁtability relative to signal generationand should also be learnt.A

CKNOWLEDGMENT

The authors would like to thank Sebnem Er, Patrick Changand Turgay Celik for valuable comments on the project.R

EFERENCES[1] J. Murphy, “Technical analysis of ﬁnancial markets,” 01 1999. [2] J. H. Andrew W. Lo,

Heretics of Finance . Bloomberg Press, 2009.[3] B. Arthur, “Complexity in economics and ﬁnancial markets,”

Complex-ity , vol. 1, pp. 20–25, 10 1995.[4] D. Wilcox and T. Gebbie, “Hierarchical causality in ﬁnancial eco-nomics,”

Available at SSRN: https://ssrn.com/abstract=2544327 , 2014.[5] K. Hornik, “Multilayer feed-forward networks are universal approxima-tors,”

Neural Networks , vol. 2, pp. 359–366, 1989.[6] F. Loonat and T. Gebbie, “Learning zero-cost portfolio selection withpattern matching,”

PLOS ONE , vol. 13, 05 2016.[7] N. Murphy and T. Gebbie, “Learning the population dynamics oftechnical trading strategies,” arXiv:1903.02228 [q-ﬁn.CP] , 03 2019.[8] J. Ioannidis, “Why most published research ﬁndings are false,”

CHANCE , vol. 32, pp. 4–13, 01 2019.[9] D. Bailey, J. Borwein, M. Lpez de Prado, and Q. J. Zhu, “The probabilityof backtest overﬁtting,”

Journal of Computational Finance , vol. 20,pp. 39–69, 4 2017.[10] R. Mclean and J. Pontiff, “Does academic research destroy stock returnpredictability?,”

The Journal of Finance , vol. 71, 05 2013.[11] F. Schorfheide and K. Wolpin, “On the use of holdout samples for modelselection,”

American Economic Review , vol. 102, 05 2012.[12] S. M. Weiss and C. A. Kulikowski,

Computer Systems That Learn:Classiﬁcation and Prediction Methods from Statistics, Neural Nets,Machine Learning, and Expert Systems . San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 1991.[13] D. Hawkins, “The problem of overﬁtting,”

Journal of chemical infor-mation and computer sciences , vol. 44, pp. 1–12, 05 2004.[14] G. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm fordeep belief nets,”

Neural computation , vol. 18, pp. 1527–54, 08 2006.[15] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,”

Adv. Neural Inf. Process. Syst. , vol. 19,pp. 153–160, 01 2007.[16] M. Ranzato, C. Poultney, S. Chopra, and Y. Lecun, “Efﬁcient learningof sparse representations with an energy-based model,” 01 2006.[17] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of datawith neural networks,”

Science (New York, N.Y.) , vol. 313, pp. 504–7,08 2006.[18] J. Da Costa, “Online non-linear prediction of ﬁnancial time seriespatterns,” Master’s thesis, University of Cape Town, Cape Town, ZA,Dec 2020.[19] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,”

IEEE transactions on pattern analysisand machine intelligence , vol. 35, pp. 1798–1828, 08 2013.[20] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, andS. Bengio, “Why does unsupervised pre-training help deep learning?,”

Journal of Machine Learning Research , vol. 11, pp. 625–660, 02 2010.[21] X. Glorot and Y. Bengio, “Understanding the difﬁculty of training deepfeedforward neural networks,”

Journal of Machine Learning Research -Proceedings Track , vol. 9, pp. 249–256, 01 2010.[22] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,”

IEEEInternational Conference on Computer Vision (ICCV 2015) , vol. 1502,02 2015.[23] D. Bailey and M. Lopez de Prado, “The sharpe ratio efﬁcient frontier,”

The Journal of Risk , vol. 15, pp. 3–44, 12 2012.[24] D. Bailey and M. Lopez de Prado, “The deﬂated sharpe ratio: Correctingfor selection bias, backtest overﬁtting, and non-normality,”

The Journalof Portfolio Management , vol. 40, pp. 94–107, 09 2014.[25] M. Lopez de Prado and M. Lewis, “Detection of false investmentstrategies using unsupervised learning methods,”

Quantitative Finance ,vol. 19, pp. 1–11, 07 2019.[26] J. Da Costa, “Jse price relative dataset from 2003-2018.”https://zivahub.uct.ac.za/articles/dataset/JSE Top40 Closing PriceRelative Data for 2003-2018/11897628, 2020. Accessed: 2020-06-30.[27] J. Da Costa, “Julia libraries for online non-linear prediction of ﬁnancialtime series patterns.” https://github.com/joel11/Masters, 2020. Accessed:2020-06-30.[28] Y. Lecun, L. Bottou, G. Orr, and K.-R. Mller,

Efﬁcient BackProp ,vol. 1524, pp. 546–546. 01 1998.[29] L. Bottou and N. Murata,

Stochastic Approximations and EfﬁcientLearning, The Handbook of Brain Theory and Neural Networks . TheMIT Press, 2 ed., 2019.[30] L. Bottou and Y. Lecun, “Large scale online learning.,”