Learning low-frequency temporal patterns for quantitative trading
LLearning low-frequency temporal patterns forquantitative trading
Joel Da Costa
Department of Statistical SciencesUniversity of Cape Town
Cape Town, South [email protected]://orcid.org/0000-0001-7821-6635
Tim Gebbie
Department of Statistical SciencesUniversity of Cape Town
Cape Town, South [email protected]://orcid.org/0000-0002-4061-2621
Abstract —We consider the viability of a modularised mecha-nistic online machine learning framework to learn signals in low-frequency financial time series data. The framework is proved ondaily sampled closing time-series data from JSE equity markets.The input patterns are vectors of pre-processed sequences ofdaily, weekly and monthly or quarterly sampled feature changes.The data processing is split into a batch processed step wherefeatures are learnt using a stacked autoencoder via unsupervisedlearning, and then both batch and online supervised learningare carried out using these learnt features, with the output beinga point prediction of measured time-series feature fluctuations.Weight initializations are implemented with restricted Boltzmannmachine pre-training, and variance based initializations. His-torical simulations are then run using an online feedforwardneural network initialised with the weights from the batchtraining and validation step. The validity of results are consideredunder a rigorous assessment of backtest overfitting using bothcombinatorially symmetrical cross validation and probabilisticand deflated Sharpe ratios. Results are used to develop a viewon the phenomenology of financial markets and the value ofcomplex historical data-analysis for trading under the unstableadaptive dynamics that characterise financial markets.
Index Terms —online learning, feature selection, pattern pre-diction, backtest overfitting
I. I
NTRODUCTION
A. Technical Analysis
Technical analysis is a financial analytical practice thatmakes use of past price data in order to identify market statesand forecast future price movements based on past movements.The techniques typically rely on past market data (price andvolume), rather than company assessments using fundamentalanalysis. We explore the idea that technical analysis has meritin exposing market inefficiencies when they are signified byrepeated feature time-series patterns [1], [2].Financial markets have been shown to be complex andadaptive systems where the effects of interaction between par-ticipants can be highly non-linear [3], but they may also havecombinations of top-down and bottom-up sources of informa-tion and interaction that mix in vast numbers of interactionsmediated by numerous flocks of heterogeneous strategic agentsthat constitute modern financial markets [4]. Complex anddynamic systems such as these may often exist at multiple‘order-disorder borders’ - and they will then generate certain non-random patterns and internal organisation on differentaveraging scales. Two key price generation processes haveemerged: the low-frequency domain (the result of sequencesof closing auctions generating prices), and the high-frequencyintra-day domain driven by order-flow itself. Here we considerlow-frequency daily sampled data that is the result of the pricediscovery from closing auctions.Even at low-frequency, identifying patterns and structureis simultaneously reasonable and notoriously difficult. Whileit is often clear in hindsight that patterns exist, the amountof noise and non-linearity in the system can make predictionchallenging. Fittingly, neural networks are a popular choicesfor modelling within financial markets because of their abilityto perform well as universal approximators [5].Practical approaches to money management within therealities of adapting and changing market systems increasinglyfavour online methods, in particular [6] explored the appli-cation of online learning models in this space in the SouthAfrican market to show that direct (but simplistic) onlinepattern-learning is able to identify and potentially exploittrading opportunities on the JSE through the assessment ofOpen High Low Close (OHLC) data. This was extended by[7] to more directly explore the use of online learning ap-plied to optimizing parameters for traditional technical tradingindicators as applied to maximising wealth trading zero-costportfolio strategies.The work presented here fits into the growing body of workwhich considers mechanistic and brute-force approaches of ap-plying machine learning models to financial market data. Thecomplexity, non-linearity, noise and stability of financial mar-kets are highlighted through both the successes and challengesfound in training these models. These difficult dynamics, andtheir notable difference when compared to other popular areasof ML research - which are often around Independently andIdentically Distributed (IID) datasets - present fundamentalproblems to be explored; both in terms of prediction efficacy aswell as validation. We present a framework using batch offlineand online learning on JSE closing data, feature extraction androbust non-parametric validation techniques. a r X i v : . [ q -f i n . S T ] A ug . Backtesting and Model Validation Financial academic literature is currently facing a problemin terms of validation and verification of results. Tradingstrategy profitability has typically been proven using historicalsimulations, or “backtests”. However, the recent advances intechnology and algorithms available to construct these strate-gies have resulted in researchers being able to test increasingnumbers of variations of factors. This has made it increasinglydifficult to control for spurious results. The problem is soextensive that some meta-research papers suggest that mostpublished research findings are false [8].The standard way of implementing backtests is to splitthe data into two portions: an In Sample (IS) portion whichis used to train the model, and an Out of Sample (OOS)portion which is used to test the model and validate results.If vast numbers of different model configurations are tested,then it is only a matter of time before false positives occurswith high performance both IS and OOS (i.e. overfitting)[9], [10]. The nature of financial data makes it difficult toresolve these issues effectively. There is a low signal-to-noiseratio in a dynamic and adaptive system with only one truedata sequence. Traditional hypothesis testing frameworks (e.g.Neyman-Pearson) are not sufficient in this context makingmore sophisticated techniques necessary.The problem of overfitting is not novel. However, in amachine learning context, frameworks are often not suited totrading schedules with a random frequency structure. They donot account for overfitting outside of the output parametersnor take into consideration the number of trials attempted.The common ‘hold-out’ strategy is where a certain portionof the dataset is reserved for testing true OOS performance.Numerous problems have been pointed out with this approach.The data is often used regardless, and awareness of themovements in the data may influence strategy and test design[11]. For small samples, a hold-out strategy may be too shortto be conclusive [12]. Even for large samples, it results in themost recent data (which is arguably the most pertinent) notbeing used for model selection [9], [13]. We present a novelapplication of existing sophisticated validation methods (seeSection III) to a machine learning framework.II. F
RAMEWORK I MPLEMENTATION
A. Full Framework Process
The framework implementation brings several ideas to-gether: i.) SAE based feature selection, ii.) Deep Learning withpre-training and weight initialization, and iii.) Online Learningand Backtest Overfitting Validation. The learning part of theframework consists of two phases: I. batch learning and II.online learning.I
Batch learning phase : IS data is used to train StackedAutoencoder (SAE) networks which in turn are used toperform feature reduction for Feedforward Neural Net-works (FNNs) learning the price fluctuation predictionsin the IS data. Both are trained using Stochastic GradientDescent (SGD). II
Online learning phase : The batch trained FFN networksare used to predict price fluctuations on OOS datathrough Online Gradient Descent (OGD).These online predictions are then sequentially used to simulatetrading in a Money Management System (MMS), which in turngenerates simulated returns. Finally, the MMS returns are usedas input for the Probability of Backtest Overfitting (PBO) andDeflated Sharpe Ratio (DSR) techniques in order to validatethe legitimacy of the framework.The process has two key principles: First, implementing ageneralised version of a system which could offer explorationof more complex techniques. Second, ensuring an effectivemodularisation of steps such that the process can be recon-figured accordingly while maintaining its integrity. In doingso, a separable system is created which brings together allkey concepts. We aim to delivery the simplest implementationof a complex framework such that the effects of individualcomponents can be properly assessed and developed. The fullprocess flow can be seen in Figure 7 in the Appendix.
B. Data Processing
Datasets are transformed into log feature differences andaggregated to include the changes over rolling window pe-riods. The log feature fluctuations are processed for eachasset’s closing price at each time point i . Log fluctuationstake compounding into account in a systematic way and aresymmetric in terms of gains and losses; the log transformationalso provides an ergodic time series. The rolling windowsummations are calculated for the past for input data e.g. (1,5, 20) days, and in the future for predicted output - 5 daysin this paper. These are calculated as summations of the logdifferences ∆ p i such that for d days at timepoint t : p − ( d , t ) = t (cid:88) i = t − d ∆ p i , and p +( t , d ) = t + d (cid:88) i = t +1 ∆ p i . (1)The aggregations are scaled using a modified Normalization,where the min and max values are determined by the Trainingportion of the dataset, but applied to both the Training andPrediction portions. This emulates a production implemen-tation where future data is unknown. The log-differenced,aggregated and scaled data is then used as the input for theneural network models. Predicted outputs have the scaling andlog differencing reversed in order to reconstruct the actualprice point predictions for performance assessment. C. Network Weight Initialization
The problem of vanishing and exploding gradients has beenone of the primary barriers to deep learning with neuralnetworks. The approach of greedy layer wise pre-training forSAEs was suggested by [14], which allowed much deeperlayered networks to be trained than previously possible [15].Once the SAE is trained, a final output layer is added andthe entire network can then be fine-tuned through back-propagation without suffering such performance degradationfrom vanishing or exploding gradients, enabling training ofboth SAE and FNN networks [16], [17].n Section V we see that batch training on historical data hasa limited benefit, which gives primacy to weight initializationtechniques for machine learning of financial time series. Initialresults found that RBM pre-training for sigmoid SAE networks(as described by [17]) had detrimental effects on networkperformance. This suggests that the IID assumptions and thedifferent loss functions result in the financial time series dataused being pathalogical for RBM pre-training, and is discussedfurther in [18]. It has also been shown that pre-training maylargely act as a prior which may not be necessary if largeenough datasets are available [19], [20]. In the context offinancial time series this prior can explain the poor perfor-mance. For these reasons we focus on variance based weightinitialisations developed by [21] and [22]. These have simplerimplementations, faster computation and enables initializationfor non-probabilistic activation functions, such as ReLU.Concretely, we use a modified He initialization: “He-Adjusted”. This initialization uses a mean of the input andoutput layers to scale the weight variance. For networks withconstant layer sizes, the initialization is the same as He [22].For SAE networks, where layer size changes by definition, theHe-Adjusted initialization results in better sized weights. For n , the number of nodes in a layer, we initialize using: w ij ∼ U ( − r, r ) , with r = (cid:113) / ( n i + n j ) . (2) D. Unsupervised Learning: SAE Training
The benefit of the modularised system is emphasised here,as the SAE training will not suffer from limitations due tobacktesting considerations: any amount of configurations canbe tested for feature extraction without concern. The bestchosen SAE networks (based on a minimum Mean SquaredError (MSE) score) are used to reprocess both the Trainingand Prediction datasets such that the input is encoded, andthe output is as before. These encoded datasets can then beused for the subsequent steps in the framework. We did notimplement a step to update the SAE, but results detailed inSection V suggest that this would be an important inclusionin a production system.
E. Supervised Learning: Prediction Network Training
Once the predictive network is trained on IS data usingSGD, the OGD process is run through the encoded Predictiondataset in order to generate the predictions for the asset pricesthat the model produces - thus emulating what would haveoccurred in a live environment.
F. Money Management Strategy
The MMS follows an arithmetic long strategy of buyingany asset for which the predicted price is above the currentprice, and selling the stock at the prediction horizon regardless.Trading costs were included at 10% capital costs per annumfor borrowing to purchase, and 0.45% for transaction costs asper [6], without taking liquidity effects into account. The naiveapproach is taken purposefully so as not to bias the perspectiveof the system as a whole by the effects of an impactful trading strategy. It is important, in the interest of effectiveoptimization, that the pattern prediction of the system is nottightly coupled with making it profitable. Thus, the modularityof the system is continued with a separation between theprediction signal and the MMS implementation.
G. Validation
Validation is implemented with Combinatorially SymmetricCross-Validation (CSCV) [9] that uses the IS and OOS returnsfrom the MMS, which in turn uses the prices from theprediction network; which is a somewhat novel application.Conceptually, the whole system comes into place here, asthe results from the CSCV process are now indicative of notonly backtest overfitting in the trading strategy, but also inthe prediction network and without having to consider theimpact of the many configuration tests for feature extraction.A modified version of ONC was run, with reduced clusterexploration. As noted in [18], this did not appear to affectresults. III. A
SSESSMENT M ETHODOLOGIES
A. Probability of Backtest Overfitting
CSCV was developed by [9] as a robust approach to assess-ing backtest overfitting. Their research defines backtest over-fitting as having occurred when the strategy selection whichmaximizes IS performance systematically underperforms themedian OOS performance in comparison to the remainingconfigurations. They use this definition to develop a frameworkwhich measures the probability of such an event occurring,where the sample space is the combined pairs of IS and OOStrading performance measures. The PBO is then established asthe likelihood of a configuration underperforming the medianOOS while outperforming IS.The CSCV methodology is generic, model-free and non-parametric, allowing it to arguably be used in any modelcase. By recombining the slices of available data, both thetraining and testing sets are of equal size (advantageous forcomparing performance statistics). The symmetry of the setcombinations in CSCV ensure that performance degradationis only as a result of overfitting, and not arbitrary differencesin data sets. There is no requirement of a hold-out set,which removes potential credibility issues regarding whetherthe holdout set was treated appropriately or not. The logitdistribution developed through the assessment offers a usefulview on the robustness of the strategies used and the natureof the PBO score.The PBO can be estimated using the CSCV method results,which provides an estimate rate at which the best IS strategiesunderperform the median of OOS trials. [9] extend this to showthat with models overfitting to backtest data noise, there comesa point where seeking increased IS performance is detrimentalto the goal of improving OOS performance.
B. Deflated Sharpe Ratio
The Sharpe Ratio ( SR ) is based on the assumption thatthe returns used are the result of a single trial, as is the caseor most standard performance measures. In consideration ofthe issues laid out in I-B, it then becomes a misrepresentativeperformance measure. [23] developed the Probabilistic SharpeRatio (PSR) which estimates the likelihood that an observedbest estimated (cid:99) SR exceeds a provided benchmark SR ∗ (whichmight be expected from variance in the trials). It is worthemphasising the distinction in investment strategies betweena Family Wise Error Rate (FWER), which is the probabilitythat one or more false positives occur, and a False DiscoveryRate (FDR), which is the ratio of false positives to predictedpositives. Investment strategy generations will tend to relyon the single best approach produced, and so must considerFWER. [24] further developed the False Strategy Theorem(FST) with this in mind, allowing the assessment of whethera presented strategy is a false positive or not.The DSR calculates the likelihood that the true SR ispositive under consideration of numerous trials being tested[24]. The DSR can be estimated using the PSR methodologyas (cid:91) P SR [ SR ∗ ] where the benchmark Sharpe ratio, SR ∗ , iscalculated using the False Strategy Theorem. The calculationof SR ∗ requires both the variance of trial SR values and thenumber of independent trials, which are not typically consid-ered and where determining independence is challenging. [25]aim to resolve this with the Optimal Number Clusters (ONC)algorithm, a modified K-means methodology of clusteringstrategies and trial results. This clustering allows an estimationthen of both the variance and number of trials, which in turnallows the DSR to be calculated. With this as a confidencelevel, one can accept or reject the notion that the observed (cid:99) SR is positive. IV. E XPERIMENT P ROCESS
A. Data & Software
Datasets were constructed using JSE closing price relativedata for 2003-2018 [26], with a 60:40 split on the Train-ing:Prediction subsets. The closing price dataset consisted of10 assets from the JSE Top 40: AGL, BIL, IMP, FSR, SBK,REM, INP, SNH, MTN, DDT (coming from a variety ofsectors). More source information, data snapshots and pricecharts are available in [18].The software libraries, written in Julia, were produced for allthe training, experimentation and recording of results. Theseare discussed extensively in [18], and made available online[27].
B. Parameter Space Exploration
The parameter space is explored using a phased grid searchapproach. For each stage, the relevant parameters are eachspecified as a set of values, and all sets are then used togenerate the full combinatorial space, such that each possiblecombination of the specified parameters is tested.1)
Stage 1:
The data configuration (i.e. data windows,prediction point, scaling, data split points) as well as theSAE configuration (network size, learning rates, learningoptimization parameters, SGD epochs) are set in Stage1 and used to train the SAE networks. 2)
Stage 2:
The preferred SAEs are chosen from Stage1, and determine the data configuration used for Stage2. These are then used to encode the datasets, whichwill be used for FFN training. The FFN SGD andOGD parameters are set in this stage (network size,learning rates, SGD epochs etc.), and will be combinedcombinatorially with the SAEs that were chosen fortesting as well. V. F
INDINGS
A. Value of Historical Data and Training
We expected that the IS batch training using SGD for thepredictive network would improve OOS P&L performance.Theoretically, the training on historical data might prime thenetwork for predicting future data. However, we found that theeffects of IS training had limited benefit. We ran experimentaltrials to test the hypothesis that the amount of historical ISdata available is of limited benefit. We found the P&L resultsvalidate this idea, as seen in Figures 1 and 2. We saw thatextensive training on past data may be akin to pre-trainingnetwork weights at best, and counterproductive in overfittingto dynamics that no longer exist at worst. This highlights thecomplexity and dynamic nature of financial time series, wherepast relations and behaviours are not necessarily indicative ofpresent state. It follows that the primary determinants of OOSP&L are those present in the OGD (OOS) learning phase:the OGD learning rate, the data horizon aggregations, and theSAE feature selection.This fits well with research showing that online algorithmstypically perform as fast as batch algorithms during the‘search’ phase of parameter optimization, but that ‘final’ phaseconvergence tended to fluctuate around the optima due tothe noise present in single sample gradients [28], [29]. [30]showed that it is actually more practical to consider theconvergence towards the parameters of the optima, ratherthan the optima itself (as defined by the cost function) - thedifference between the learning speed and optimization speed,respectively. Online learning methods are thus well suited tofinancial market modelling using neural networks. They alloweffective and efficient incremental updates as more recent(and relevant) data becomes available. Further, the increasedlearning speed over optima convergence makes them a fittingchoice when data is non-IID and constantly changing.
B. Primary Determinants of P&L
Input data was scaled to 3 different configurations to assessthe effects of shorter and longer data horizons, using SAEMSE and predictive OOS P&L to assess performance. Theconfigurations tested (in trading day window periods) were:1.) horizon-tuple 1 with [1, 5, 20], 2.) horizon-tuple 2. with[5, 20, 60], and 3.) horizon-tuple 3. with [10, 20, 60]. TheSAE networks were only trained on IS data, and not updatedafterwards. As noted below, an effective SAE feature selectionin this process is an optimization that may be limited to acertain time period and may not generalise well OOS. We alsofound that lower variance, in the shorter horizon aggregations,
OS P&L by IS Training Epochs
Fig. 1. These results show OOS P&L grouped by the number of epochs in theSGD IS training phase. Here 100 Epochs offered the best overall performance,and further training to 500 or 1000 epochs degraded performance due tothe network overfitting on the IS data. The results show that the benefit ofhistorical data is limited - having networks become better at learning returnrelationships from 10 years in the past did not lead to increased OOS P&Lfor more current data. The small difference in the upper half of observationsbetween 10 and 100 Epochs further emphasises this point.
OOS P&L by IS Training Dataset Size
Fig. 2. To further explore the effect of IS training on historical data,configurations were run with a percentage of the usual training data excluded,with the P&L results grouped above. The exclusion of up to 80% of the IStraining data resulted in only a 2.2% drop in median OOS P&L for thosenetworks. The training in these instances was not adjusted to increase thenumber of epochs according to the size of IS data, and so the configurationswith more data excluded were also in essence trained less. resulted in easier replication; while longer horizons are moredifficult (as indicated through MSE scores). The reproductiondifferences are discussed in [18].We observed strong interactions between the SAE featuresizes, FNN OGD learning rates and data horizon aggregationconfigurations. The performance differences seen further em-phasises the unstable nature of financial systems. Generally,the FNN OGD learning rate had the largest impact on OOSperformance, and demonstrates the benefits in being able toadapt quickly to new information, as seen in Figure 3. As the SAE feature size decreased from 25 to 10 , SAEs learntlonger term features (as they were increasingly unable torepresent short term fluctuations). For FNN networks withlarger learning rates which could otherwise adapt quickly, theincreased focus on long term features caused P&L perfor-mance degradation. For FNN networks with smaller learningrates, poorly able to adapt quickly to new information eitherway, there was a benefit from SAE features with an increasingrepresentation of long term trends. The relationship is empha-sised dramatically in the 10-feature SAEs, to the point thatlower learning rates can be more effective in generating OOSP&L. The P&L performance suggests that the 10-feature SAEsoverfit to long term IS features, and became pathological forshort term adaptation OOS.The most noteworthy results were the 5-feature SAEs,where performance was often on par or better than the higherfeature sizes, or no SAE at all, as seen in Figure 3. It is possiblethat the small encoding layer acts as a form of regularization,forcing the SAE to learn more consistently generalisablefeatures. The performance of the 5-feature networks, with an83% reduction in input data, is clear evidence of the efficacyand potential of feature selection in financial times series.The effect of data horizon aggregations is as expected: shortterm horizons (i.e. [1,5,10]) outperformed in configurationswith more SAE features and higher learning rates; longer termhorizons (i.e. [10, 20, 60]) outperformed in low learning rateand low feature configurations. The differentiation betweenthese groups is seen more robustly in Section V-F, wheredata horizon aggregations are determined to be the primaryclustering attribute for trade correlations. Strategies focusingon short term predictive strategies (aggregations of [1, 5, 10])had higher variance in returns than the longer data horizonstrategies, though also the highest highest P&L and Sharperatios. This again shows the benefits in focusing on recentcross-sectional data in financial markets. The differentiationbetween the groups is discussed more in [18]. C. Money Management Strategy Results
The benchmark is an upper bound on performance, repre-senting MMS returns based on perfect knowledge of futureprices. The benchmark full return rate is 2.4% with tradingcosts, over a period of 1555 trading days. So while the strate-gies’ proximity to the benchmark do represent a frameworksuccess, they are not necessarily representative of a feasiblemarket solution. Ultimately, this enforces the notion that theMMS implementation is of exceeding importance in a livetrading process, and predictive accuracy is only able to achieveso much.Figure 4 shows the distributions of OOS P&L with tradingcosts being accounted for. There were a significant number ofconfigurations within 20%-30% of the benchmark. The trialswith 0 P&L are networks which suffered from either explodingor vanishing gradients, and were not able to make sufficientpredictions. Input data was 10 assets with 3 horizon aggregations each, resulting in aninput size of 30 at each timestep.
OS P&L By Feature Size and OGD Learning Rate
Fig. 3. This figure shows that the lower learning rates (0.005, 0.01) performedbest with strategies using long term trend pricing. The 10 feature encodingappeared to optimise specifically for this perspective. The optimisation causedoutperformance at the lower learning rates and detrimental performanceat higher learning rates (which perform best with short term fluctuationstrategies). The 15 to 25 horizon encodings showed a better association to theshort term strategies. Here higher encodings and higher learning rates offerthe best performance. The 5 feature encoding offered consistent performanceacross learning rates and shows the learning of generalisable features.
MMS OOS P&L Distributions, with Costs Applied
Fig. 4. The distributions of all OOS P&L values, with the benchmark P&Lindicated in orange, show an encouraging view of the results. There is asignificant negative skew, with a proportionally small number of strategiesresulting in negative returns, even with capital and trading costs applied. Therewere a large proportion of strategies near the OOS upper bound, which iswithin 28% of the benchmark.
D. Probability of Backtest Overfitting1) Applying PBO in Mechanistic Machine Learning:
Whilethe methodology is a model free approach to assessing over-fitting, the application in a machine learning context is noveland has dynamics worth considering. The use of offline batchlearning parameters, online learning parameters and adaptivenetwork weights make the concept of model parameters lessdistinct. If a model performs well OOS due to effectivelearning, this can be due to the model’s strength rather thanoverfitting. It is noted that the logit metric, which the CSCV methodrelies on, has its basis in an ordinal ranking; indicating whetherthe best strategy in the IS set is higher than the median in theOOS set. This means that poor performing configurations canartificially bolster an ordinal position past the median pointand so bias PBO results. An honest, wide exploration of theparameter solution space in a mechanistic machine learningframework is likely to result in “poor” configurations beingtested (as visible in the ‘0’ P&L configurations in Figure 4). Asa result, the methodology shifts the onus onto the researcherin both handling and reporting these dynamics responsibly.Further to this, the parameter space search methodology(Section IV-B) may also result in a lower likelihood of PBOdue to the way of combining parameters across IS and OOSstages. By way of example, any configuration which performswell IS will have all possible OOS parameters tested incombination with it. While some of these combinations mayresult in poor performance, there will always be a combinationof the best IS and best OOS parameter choices. This makes itunlikely that the best configurations will be past the medianpoint for the logit calculation, resulting in a systematically lowPBO regardless of how many configurations are attempted.Lastly, the CSCV algorithm requires a parameter choice ofhow many windows to split the data into. While not inherentlyproblematic, this choice can have a significant impact onresults which is not visible in the reported PBO value. Wediscuss this further in [18].
2) PBO Results:
We ran the CSCV algorithm on themajority of the configurations tested, which resulted in a finalPBO value of 1.7%. A subset of networks were excluded onaccounts of ‘null’ predictions, resulting in a sample size of21653 (out of a total of 22248 configurations). The CSCValgorithm was run with a split value of 16. There were 15 yearsof data, making 16 a reasonable choice as the split parameter(which needs to be even). Ideally, the splits would representshorter periods, but the exponential increase in computationaltime made this impractical. The full logit distribution can beseen below in Figure 5.We found interesting dynamics around the calculation ofPBO, and the configurations contributing to the figure. Theconfiguration process went through 2 primary phases: an ex-tremely broad combinatorial grid search, consisting of 20736configurations; and a second much narrower search of 1512configurations. Assessing only the configurations from thesecond phase resulted in a PBO score of 6.3%, which wassignificantly higher than the overall PBO score. The effecthere highlights important aspects of the PBO calculation. ThePBO score was much higher for the configurations whichwere picked more specifically after having already seen alarge number of results, which is correctly indicative ofincreased likelihood to overfit. However, the PBO score is notmonotonically increasing with N, as one would expect. Thisis counterintuitive and is in line with the concerns regardingthe effects of increasing configuration sample size. ogit Distribution for All Configurations
Fig. 5. The CSCV logit distribution for the 21,653 configurations run, witha calculated PBO of 1.7%. The strong negative skew is indicative of IS andOOS strategy returns being closely matched in rank and results in a low PBOscore. This is a favourable assessment for the efficacy of the full frameworkpresented here and shows that training was able to occur without much riskof backtest overfitting.
E. Optimal Number of Clusters
The ONC algorithm produced three clusters: one consistingmostly of the negative Sharpe ratio configurations, and twocontaining the remaining configurations partitioned by theirdata horizon aggregations configurations. The distributions forall three clusters’ Sharpe ratios can be see below in Figure 6.If we consider the two primary clusters, we see that ClusterOne contained all the trials with horizon aggregations of [5,20, 60] and [10, 20, 60], and Cluster Two contained alltrials with horizon aggregations of [1, 5, 10]. The nature ofthe experimentation process, with the combinatorial parameterspace exploration (as detailed in Section IV-B), is such thatother parameters were mostly evenly split across the 2 clusters(e.g. OGD learning rate, network sizes, initializations and soon).The clusters here indicate that the networks adapted to atleast two different general strategies for predicting prices:one which which was more influenced by the long termfluctuations, and the second was more influenced by the shortterm fluctuations. The results presented in Section V-B are thenindicative of the networks ability to execute these overarchingstrategies effectively.The best Sharpe ratio value (with trading costs applied)was 0.64 and part of Cluster Two, with the [1, 5, 10] pricefluctuation horizon aggregations. The distributions seen inFigure 6 indicate that at a general level, Cluster One has moreconsistent performance, Cluster Two on the other hand hashigher variance, with more strategies at both the lower andhigher range of Sharpe ratios. The lack of further clusteringswas probed manually to find that the variance from furthersubclustering lead to a worse cost function score.
F. DSR and PSR Results
Using the clusters produced by the ONC Algorithm (SectionV-E), the DSR could be determined. The aggregate cluster timeseries returns were calculated and annualized to allow their
Sharpe Ratios for ONC Clusters and Best Strategy
Fig. 6. This figure shows the distributions of all Sharpe ratios, groupedby the clusters produced by the ONC algorithm, and an indication for thebest Sharpe ratio (which is in Cluster Two). Cluster One has more consistentvalues, with a higher mean ( µ c = 0 . , µ c = 0 . ) and higher negativeskewness ( ˜ µ ,c = − . , ˜ µ ,c = − . ). Cluster Two has highervariance ( σ c = 0 . , σ c = 0 . ) but with more strategies at both thelower and higher range of Sharpe ratios; including the highest Sharpe ratiofrom all trials. variance estimates to be used to calculate SR ∗ (the maximumexpected observed Sharpe ratio due to variance under the nullhypothesis of H : (cid:99) SR = 0 ). Using SR ∗ as the benchmark, thePSR calculation ( (cid:91) P SR [ SR ∗ ] ) can then be used to determineif the observed (cid:99) SR is a false positive or not. This gives usthe DSR as a confidence for observing a positive best SR .The benchmark SR ∗ calculated was 0.211245, and the best (cid:99) SR observed was 0.642632, leading to a (cid:91) P SR [ SR ∗ ] of 1.0,thus indicating that the trials certainly contain a strategy whichhas a positive SR rate. This seems a reasonable conclusion,considering the SR distributions in Figure 6.VI. C ONCLUSIONS
Mechanistic machine learning approaches to financial mar-ket data hold some promise for enhancing the performance oflow-frequency quantitative trading. To investigate this potentialwe provide a novel framework that we show to be effectivein both training, and in validation. The framework is con-figurable and based on decoupled modules and uses severalwell understood techniques: deep-learning neural networks forstock price fluctuation prediction, stacked autoencoders for thepurpose of feature selection, both CSCV & PBO to assess thereturns from MMS, and the likelihood that backtest overfittinghas taken place, and DSR in order to assess the likelihood ofa positive Sharpe ratio having been observed.While machine learning models are expected to excel inbig data environments, in financial markets there is in facta lack of data with relevant information and signal, both fortraining and more so for validation. We show that IS trainingon historical data had a limited benefit. This is not a surprisingempirical insight and was confirmed by the negligible impactof increasing training time, as well as the small impact of largereductions in the training data sizes. Increased performance forIS data was not significantly linked to OOS performance. Thisemphasises the idea that the changing dynamics of financialarkets over time need careful attention. Learning optimiza-tions for IS training, such as regularization and learning rateschedules, were shown to have IS benefits, but little impact onOOS performance [18]. This gives weight to the importance ofonline learning methods for financial applications, and in turnhighlights the importance of network initialization. The resultsshowed better performance for the He-adjusted initializationand poor performance in RBM based pre-training.The primary determinants of OOS P&L were shown to bethose which affect the online learning data and the model’sability to adapt to this. We found SAE encoding layer sizesinfluenced the nature of features learnt, with smaller encodingsgenerally leading to learning of longer term features. Thisrelationship continued from layer sizes 25 to 10, with increas-ing effect. The results from the 10-feature SAEs suggest theSAEs were overfit to long term IS features and pathological forOOS adaptation. The relationship changed at 5-feature SAEs,which learnt far more generalisable features. The 5-featureSAEs had very competitive performance and show that featureselection in financial time series is both possible and beneficialdespite the complexity present. Predictive strategies focusingon long term changes were present in configurations withlonger data horizons, less features and lower learning rates(slow adaptation). Short term strategies presented with shorterdata horizons, more features and larger learning rates (quickadaptation). The data horizon was the primary separatingattribute in the ONC clusters, emphasising these groupings.The short term strategies had higher variance, but also thehighest returns. This again highlights both the increased valuein recent information in financial markets, as well as thedifficulty in using it due to the amount of noise present.The most challenging aspect of a mechanistic approach tolearning is avoiding backtest overfitting (see Section I-B).Probing and validating of the generalisation error was doneusing the PBO methodology in conjunction with DSR [25].The results discussed in Section V show a low likelihood ofthe models having overfit. The CSCV and PBO methodologiessuggest that they are able to add-value to our novel implemen-tation of machine learning models; while providing a robustassessment of results. The results were further validated usingthe ONC and DSR algorithms to detect positive Sharpe ratios.The phenomenological view of financial markets basedon our experimental results suggest a very limited benefitto training on long term historical financial time series. Across sectional view of the data has far more weight indelivering OOS returns; this is noteworthy in the context ofneural networks. Based on our simulation work we speculatethat money management strategies can be more importantdeterminants of OOS profitability relative to signal generationand should also be learnt.A
CKNOWLEDGMENT
The authors would like to thank Sebnem Er, Patrick Changand Turgay Celik for valuable comments on the project.R
EFERENCES[1] J. Murphy, “Technical analysis of financial markets,” 01 1999. [2] J. H. Andrew W. Lo,
Heretics of Finance . Bloomberg Press, 2009.[3] B. Arthur, “Complexity in economics and financial markets,”
Complex-ity , vol. 1, pp. 20–25, 10 1995.[4] D. Wilcox and T. Gebbie, “Hierarchical causality in financial eco-nomics,”
Available at SSRN: https://ssrn.com/abstract=2544327 , 2014.[5] K. Hornik, “Multilayer feed-forward networks are universal approxima-tors,”
Neural Networks , vol. 2, pp. 359–366, 1989.[6] F. Loonat and T. Gebbie, “Learning zero-cost portfolio selection withpattern matching,”
PLOS ONE , vol. 13, 05 2016.[7] N. Murphy and T. Gebbie, “Learning the population dynamics oftechnical trading strategies,” arXiv:1903.02228 [q-fin.CP] , 03 2019.[8] J. Ioannidis, “Why most published research findings are false,”
CHANCE , vol. 32, pp. 4–13, 01 2019.[9] D. Bailey, J. Borwein, M. Lpez de Prado, and Q. J. Zhu, “The probabilityof backtest overfitting,”
Journal of Computational Finance , vol. 20,pp. 39–69, 4 2017.[10] R. Mclean and J. Pontiff, “Does academic research destroy stock returnpredictability?,”
The Journal of Finance , vol. 71, 05 2013.[11] F. Schorfheide and K. Wolpin, “On the use of holdout samples for modelselection,”
American Economic Review , vol. 102, 05 2012.[12] S. M. Weiss and C. A. Kulikowski,
Computer Systems That Learn:Classification and Prediction Methods from Statistics, Neural Nets,Machine Learning, and Expert Systems . San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 1991.[13] D. Hawkins, “The problem of overfitting,”
Journal of chemical infor-mation and computer sciences , vol. 44, pp. 1–12, 05 2004.[14] G. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm fordeep belief nets,”
Neural computation , vol. 18, pp. 1527–54, 08 2006.[15] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,”
Adv. Neural Inf. Process. Syst. , vol. 19,pp. 153–160, 01 2007.[16] M. Ranzato, C. Poultney, S. Chopra, and Y. Lecun, “Efficient learningof sparse representations with an energy-based model,” 01 2006.[17] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of datawith neural networks,”
Science (New York, N.Y.) , vol. 313, pp. 504–7,08 2006.[18] J. Da Costa, “Online non-linear prediction of financial time seriespatterns,” Master’s thesis, University of Cape Town, Cape Town, ZA,Dec 2020.[19] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,”
IEEE transactions on pattern analysisand machine intelligence , vol. 35, pp. 1798–1828, 08 2013.[20] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, andS. Bengio, “Why does unsupervised pre-training help deep learning?,”
Journal of Machine Learning Research , vol. 11, pp. 625–660, 02 2010.[21] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks,”
Journal of Machine Learning Research -Proceedings Track , vol. 9, pp. 249–256, 01 2010.[22] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,”
IEEEInternational Conference on Computer Vision (ICCV 2015) , vol. 1502,02 2015.[23] D. Bailey and M. Lopez de Prado, “The sharpe ratio efficient frontier,”
The Journal of Risk , vol. 15, pp. 3–44, 12 2012.[24] D. Bailey and M. Lopez de Prado, “The deflated sharpe ratio: Correctingfor selection bias, backtest overfitting, and non-normality,”
The Journalof Portfolio Management , vol. 40, pp. 94–107, 09 2014.[25] M. Lopez de Prado and M. Lewis, “Detection of false investmentstrategies using unsupervised learning methods,”
Quantitative Finance ,vol. 19, pp. 1–11, 07 2019.[26] J. Da Costa, “Jse price relative dataset from 2003-2018.”https://zivahub.uct.ac.za/articles/dataset/JSE Top40 Closing PriceRelative Data for 2003-2018/11897628, 2020. Accessed: 2020-06-30.[27] J. Da Costa, “Julia libraries for online non-linear prediction of financialtime series patterns.” https://github.com/joel11/Masters, 2020. Accessed:2020-06-30.[28] Y. Lecun, L. Bottou, G. Orr, and K.-R. Mller,
Efficient BackProp ,vol. 1524, pp. 546–546. 01 1998.[29] L. Bottou and N. Murata,
Stochastic Approximations and EfficientLearning, The Handbook of Brain Theory and Neural Networks . TheMIT Press, 2 ed., 2019.[30] L. Bottou and Y. Lecun, “Large scale online learning.,”