Deep Learning modeling of Limit Order Book: a comparative perspective
DD EEP L EARNING MODELING OF THE L IMIT O RDER B OOK : A COMPARATIVE PERSPECTIVE
Antonio Briola
Department of Computer ScienceUniversity of Milano-Bicocca, Milano, Italy [email protected]
Jeremy Turiel
Department of Computer ScienceUCL, London, United Kingdom [email protected]
Tomaso Aste
Department of Computer ScienceUCL, London, United KingdomandSystemic Risk CentreLondon School of Economics, London, United Kingdom [email protected]
August 14, 2020 A BSTRACT
The present work addresses theoretical and practical questions in the domain of Deep Learning forHigh Frequency Trading, with a thorough review and analysis of the literature and state-of-the-artmodels. Random models, Logistic Regressions, LSTMs, LSTMs equipped with an Attention mask,CNN-LSTMs and MLPs are compared on the same tasks, feature space, and dataset and clusteredaccording to pairwise similarity and performance metrics. The underlying dimensions of the mod-eling techniques are hence investigated to understand whether these are intrinsic to the Limit OrderBook’s dynamics. It is possible to observe that the Multilayer Perceptron performs comparably to orbetter than state-of-the-art CNN-LSTM architectures indicating that dynamic spatial and temporaldimensions are a good approximation of the LOB’s dynamics, but not necessarily the true underlyingdimensions. K eywords Artificial Intelligence · Deep Learning · Machine Learning · Market Microstructure · Econophysics · Financial Markets
Recent years have seen the growth and spreading of Deep Learning methods across several domains. In particular,Deep Learning is increasingly applied to the domain of Financial Markets as well, but these activities are mostlyperformed in industry and there is a scarce academic literature to date. The present work builds upon the more generalDeep Learning literature to offer a comparison between models applied to High Frequency markets. Insights aboutMarket Microstructure are then derived from the features and performance of the models.The Limit Order Book (LOB) represents the venue where buyers and sellers interact in an order-driven market. Itsummarises a collection of intentions to buy or sell integer multiples of a base unit volume v (lot size) at price p . Theset of available prices { p , ..., p n } is discrete with a basic unit step θ (tick size). The LOB is a self-organising complexprocess where a transaction price emerges from the interaction of a multitude of agents in the market. These agentsinteract through the submission of a range of order types in the market. Figure 1 provides a visual representation ofthe LOB, its components and features. a r X i v : . [ q -f i n . T R ] A ug igure 1: Schematic representation of the LOB structure. It is possible to distinguish between the bid side (left) andthe ask side (right), where both are organised into levels. The first level contains the best bid-price and the best ask-price, respectively. Since the market’s goal is to facilitate the matching of intentions from buyers and sellers, the bestbid-price is defined as the maximum proposed bid price, while the best-ask price is defined as the minimum proposedask-price. The distance between best bid-price and best ask-price is commonly referred to as bid-ask spread. The mid-price is defined as the mean between best bid-price and best ask-price. The lower (higher) the bid-price (ask-price) atwhich limit orders are submitted, the deeper the level at which they are placed. The cumulative volume of buy and selllimit orders determines the market depth. In order-driven markets, the priority of orders to be matched at each pricelevel depends upon the arrival time, according to a FIFO (First In, First Out) rule [1].Three main categories of orders exist: market orders, limit orders and cancellation orders. Market orders are executedat arrival by paying a higher execution cost which originates from crossing the bid-ask spread. Limit orders makeup the liquidity of the LOB at different price levels and constitute an expression of the intent to buy or sell a givenquantity v p at a specific price p . These entail lower transaction costs, with the risk of the order not being fulfilled.Cancellation orders are used to partially or fully remove limit orders which have not been filled yet.The study of order arrival and dynamics of the Limit Order Book and of order-driven markets has seen a growinginterest in the academic literature as well as in the industry. This sparked from the almost simultaneous spreadingof electronic trading and high frequency trading (HFT) activity throughout global markets. The resulting increase infrequency of the trading activity has generated a growing amount of trading data thereby creating the critical mass forBig Data applications.The availability of Big Data from High Frequency Trading has then made it possible to apply the data hungry MachineLearning and Deep Learning methods to financial markets. Machine Learning methods were initially adopted byhedge funds towards the end of the last century, while now it is possible to see large quantitative firms and leadinginvestment banks openly applying AI methods. Building upon this growing interest, an increasing number of papersand theses exploring Machine Learning and Deep Learning methods applied mostly to financial markets are beingwritten. This is part of the modern trend where large companies lead research fields in AI due to the availability ofcomputational and monetary resources [16, 29]. This often results in a literature dominated by increasingly complexand task-specific model designs, often conceived adopting an applicative approach without an in-depth analysis of thetheoretical implications of obtained results. 2n light of this, for the present work, the relevant literature has been screened in search of state-of-the-art models forprice movement forecasting in high frequency trading. Increasingly complex models from this literature are presented,characterised and results are compared on the same training and test sets. Theoretical implications of these results areinvestigated and they are compared for statistically validated similarity. This analysis has the purpose to reason whycertain models should or should not be used for this application as well as verify whether more complex architecturesare truly necessary. One example of this consists in the performed study of temporal and spatial dimensions (impliedby Recurrent Neural Network (RNN) [30], Attention [38] and Convolutional Neural Network (CNN) [28] models,respectively) and whether they are unique and optimal representations of the Limit Order Book. The alternativehypothesis consists in a Multilayer Perceptron (MLP) which does not explicitly model any dynamic.The Deep Learning models described in Section 2, incorporate assumptions about the structure and relations in thedata as well as how it should be interpreted. As reported above, this is the case of CNNs which exploit the relationsbetween neighbouring elements in their grid-like input. Analogous considerations can be made for RNN-like models,which are augmented by edges between consecutive observation embeddings. These types of structures aim to carryinformation across a series, hence implying sequential relations between inputs. These tailored architectures are usedto test hypotheses about the existence and informativeness of corresponding dimensional relations in the Limit OrderBook.The diffusion of flexible and extensible frameworks such as Weka [40] and Keras [9], and the success of automatedMachine Learning tools such as AutoML [20] and H2O AutoML [7], is facilitating the abovementioned industry drivenapplicative approach. Unfortunately, much less attention is given to methods for model comparison able to asses thereal improvement brought by these new models [12]. In order to make the work presented here more reliable and topromote a more thorough analysis of published works, a statistical comparison between models is provided throughthe state-of-the-art statistical approaches described in [4].It is crucial for scientifically robust results to validate the significance of model performance and the potential perfor-mance improvements brought by novel architectures and methodologies. The main classes of methods for significancetesting are Frequentist and Bayesian. Null model-based Frequentist methods are not ideal for this application as thedetected statistical significance might have no practical impact on performance. The need to answer questions aboutthe likelihood of one model to perform significantly better than another, based on experiments requires the use ofposterior probabilities, which can be obtained from the Bayesian methods in [15, 13, 5].This paper is organised in the following sections: Section 2 presents a review of the relevant literature which motivatesthe study performed in the current work and presents the assumptions upon which it is built. Section 3 briefly describesthe data used throughout the experiments. Section 4 provides an exhaustive description of the experiments conducted.Section 5 presents and analyses the results and Section 7 concludes the work with ideas for further research efforts. The review by Bouchaud et al. [6] offers a thorough introduction to Limit Order Books, to which we refer theinterested reader. As discussed in Section 1, the growth of electronic trading has sparked the interest in Deep Learningapplications to order-driven markets and Limit Order Books. The work by Kearns and Nevmyvaka [25] presents anoverview of Machine Learning and Reinforcement Learning applications to market microstructure data and tasks,including return prediction from Limit Order Book states. The first attempt to produce an extensive analysis of DeepLearning-based methods for stock price prediction based upon the Limit Order Book was made by Tsantekidis et al. [37]. Starting from a horse racing -type comparison between classical Machine Learning approaches (e.g. SupportVector Machines) and more structured Deep Learning ones, they then considered the possibility to apply CNNs todetect anomalous events in the financial markets, and take profitable positions. In the last two years a few worksapplying a variety of Deep Learning models to LOB-based return prediction were published by the research group ofStephen Roberts, the first one, to the best of our knowledge, applied Bayesian (Deep Learning) Networks to LimitOrder Book [41], followed by an augmentation to the labelling system as quantiles of returns and an adaptation of themodeling technique to this [43]. The most recent work introduces the current state-of-the-art modeling architecturecombining CNNs and LSTMs to delve deeper into the complex structure of the Limit Order Book. The work bySirignano and Cont [35] provides a more theoretical approach, where it tests and compares multiple Deep Learningarchitectures for return forecasting based on order book data and shows how these models are able to capture andgeneralise to universal price formation mechanisms throughout assets.The models used in the above works were originally defined in the literature from the field of Machine and DeepLearning and are summarised below. 3ultinomial Logistic Regression is used as a baseline for this work and consists in a linear combination of the inputsmapped through a logit activation function, as defined in [18]. Feedforward Neural Networks (or Multilayer Per-ceptrons) are defined in [21] and constitute the general framework to represent non-linear function mappings betweena set of input variables and a set of output variables. Recurrent Neural Networks (RNNs) [30] are then considered inthe form of Long-Short Term Memory models (LSTMs) [22]. RNNs constitute an evolution of MLPs. They introducethe concept of sequentiality into the model including edges which span adjacent time steps. RNNs suffer from theissue of vanishing gradients when carrying on information for a large number of time steps. LSTMs solve this issueby replacing nodes in the hidden layers with self-connected memory cells of unit edge weight which allow to carry oninformation without vanishing or exploding gradients. LSTMs hence owe their name to the ability to retain informa-tion through a long sequence. The addition of Attention mechanisms [38] to MLPs helps the model to focus on moreon relevant regions of the input data in order to make predictions. Self-Attention extends the parametric flexibility ofglobal Attention Mechanisms by introducing an Attention mask that is no longer fixed, but a function of the input. Thelast kind of Deep Learning unit considered are Convolutional Neural Networks (CNNs), designed to process data withgrid-like topology. These unit serve as feature extractors, thus learning feature representations of their inputs [28].A considerable body of literature about comparison of different models has been produced, despite not being vastlyapplied by the Machine Learning community. The first attempts of formalisation were made by Ditterich [14] andSalzberg [34], and refined by Nadau and Bengio [32] and Alpaydm [2]. A comprehensive review of all these methodsand of classical statistical tests for Machine Learning is presented in [24]. A crucial point of view is provided bythe work [12]. More recently, starting from the work by Corani and Benavoli [11], a Bayesian approach to statisticaltesting was proposed to replace classical approaches based on the null hypothesis. The proposed new ideas found acomplete definition in [4].
All the experiments presented in this paper are based on the usage of the LOBSTER [23] dataset, which provides ahighly detailed, event-by-event description of all micro-scale market activities for each stock listed on the NASDAQexchange. LOBSTER is one of the data providers featured in some major publications and journals in this field, suchas Quantitative Finance. LOB datasets are provided for each security in the NASDAQ. The dataset lists every marketorder arrival, limit order arrival and cancellation that occurs in the NASDAQ platform between 09:30 am – 04:00 pmon each trading day. Trading does not occur on weekends or public holidays, so these days are excluded from all theanalyses performed. A tick size of θ = 0 . is adopted. Depending on the type of the submitted order, orders can beexecuted at the lower cost equals of . . This is the case of hidden orders which, when revealed, appear at a priceequal to the notional mid-price at the time of execution.LOBSTER [23] data are structured into two different files: • The message file lists every market order arrival, limit order arrival and cancellation that occurs. • The orderbook file describes the market state (i.e. the total volume of buy or sell orders at each price)immediately after the corresponding event occurs.Experiments described in the next few sections are performed only using the orderbook files . The training datasetconsists of Intel Corporation’s (
INTC ) LOB data from 04-02-2019 to 31-05-2019, corresponding to a total of 82 files,while the test dataset consists of Intel Corporation’s LOB data from 03-06-2019 to 28-06-2019, obtained from 20 otherfiles. All the experiments presented in the current work are conducted on snapshots of the LOB with a depth (numberof tick size-separated limit order levels per side of the Order Book) of . This means that each row in the orderbookfiles corresponds to a vector of length 40. Each row is structured as [( p, v ) a , ( p, v ) b , ( p, v ) a , ( p, v ) b , ... , ( p, v ) a , ( p, v ) b ] , (1)where ( p, v ) represents the price level and corresponding liquidity tuple, { a, b } distinguish ask and bid levels progres-sively further away from the best ask and best bid. 4 Methods
Price log-returns for the target labels are defined at three distinct time horizons H ∆ τ . In order to account for pricevolatility and discount long periods of stable and noisy order flow, the time delay between the LOB observation (input)and the target label return ∆ τ is defined as follows.Given a series of mid-prices at consecutive ticks p m, , p m, , ... , p m,n , (2)the mid-price is defined as the mean between the best bid and best ask price. The series of log-returns is r m, , r m, , ... , r m,n − , (3)where r m, = log p m, − log p m, . (4)The number of non-zero log-returns in the series is hence counted as: ∆ τ = n − (cid:88) k =0 Θ( | r m,k | ) , (5)where Θ is the Heaviside step function defined below Θ( x ) = (cid:26) if x > if x ≤ . (6) The data described in Section 3 are preprocessed as follows: • The target labels for the prediction task aim to categorise the return at three different time horizons H ∆ τ | ∆ τ ∈{ , , } . In order to perform the mapping from continuous variables into discrete classes, the followingquantile levels (0 ., . , . , . ) are computed on the returns distribution of the training set and then appliedto the test set. These quantiles are mapped onto classes, denoted with ( q − , q , q +1 ) as reported in Figure 2.Figure 2: Visual representation of the mapping between quantiles and corresponding classes. Quantiles’ edges (i.e. (0 ., . , . , . ) ) define three different intervals. Each specific class (i.e. q − , q , q +1 ) corresponds to a specificinterval. • The training set input data (LOB states) are scaled within a (0 , interval with the min-max scaling algorithm[33]. The scaler’s training phase is conducted by chunks to optimise the computational effort. The trainedscaler is then applied to the test data. 5igure 3 reports the training and test set quantile distributions per horizon H ∆ τ .Figure 3: Training and test set quantile ( q − , q , q +1 ) distributions per horizon H ∆ τ | ∆ τ ∈ { , , } at the end ofthe preprocessing and labelling phase. Tables’ entries, for both the training and the test set, report the exact number ofsamples per horizon, for each considered quantile.It is possible to notice moderately balanced classes for both plots in Figure 3. Indeed, all classes lie within the sameorder of magnitude ( ) for all horizons H ∆ τ | ∆ τ ∈ { , , } with the q class being the most represented and q +1 the least. The benchmark null model for this work is a generic random model, which does not handle any dynamics. For eachsample in the test set and each horizon H ∆ τ , the quantile label q r is sampled from the uniform distribution over r ∈ {− , , } . The SciPy [39] randint generator is used for this task. The baseline model is represented by the multinomial Logistic Regression which, as the Random Model, does notexplictly model any dynamics in the data. Like binary Logistic Regression, the multinomial one adopts maximumlikelihood estimation to evaluate the probability of categorical membership. It is also known as Softmax Regressionand can be interpreted as a classical ANN, as per the definition in Table 1, with the input layer directly connected tothe output layer with a softmax activation function:softmax ( x i ) = e x i (cid:80) j e x j . (7)The input is represented by the ten most recent LOB states as per the definition is Equation 1 in Section 3. Also thismodel is not able to handle any specific dynamics. The Scikit-Learn [33] implementation is used and, in order toguarantee a fair comparison with the Deep Learning models in the next sections, the following parameters are set asfollows: • max_iter (i.e. maximum number of iterations taken for the solvers to converge) = 20. • tol (i.e. the tolerance for stopping criteria) = e − . • solver (i.e. the algorithm to use in the optimization problem) = sag with the default L penalty.6able 1: Multinomial Logistic Regression architectural scheme. Logistic regression
Input @ [ × ]Dense @ 3 Units (activation softmax ) The first Deep Learning model is a generic Multilayer Perceptron (MLP), which does not explicitly model temporalor spatial properties of the input data, but has the ability to model not explicitly defined dimensions through its hiddenlayers. Similarly to the two previously mentioned models, it does not explicitly handle any specific dimension. TheMLP can be considered the most general form of universal approximator and it represents the ideal model to confirmor reject any hypothesis about the presence of a specific leading dimension in LOBs. The input is represented by theten most recent LOB states (Equation 1), Section 3, with the ten states concatenated in a flattened shape. The MLPmodel is architecturally defined in Table 2.Table 2: Multilayer Perceptron architectural scheme.
Multilayer Perceptron
Input @ [ × ]Dense @ 512 UnitsDense @ 1024 UnitsDense @ 1024 UnitsDense @ 64 UnitsDense @ 3 (activation softmax ) In order to explicitly handle temporal dynamics of the system, a shallow LSTM model is tested [22]. The LSTMarchitecture explicitly models temporal and sequential dynamics in the data, hence providing insight on the temporaldimension of the data. As all the other RNN models, the structure of LSTMs enables the network to capture the tempo-ral dynamics performing sequential predictions. The current state directly depends on the previous ones, meaning thatthe hidden states represent the memory of the network. Differently from classic RNN models, LSTMs are explicitlydesigned to overcome the vanishing gradient problem as well as capture the effect of long-term dependencies. Theinput is here represented by a [10 × matrix, where 10 is the number of consecutive history ticks and 40 is the shapeof the LOB defined in Equation 1. The LSTM layer consists of 20 units with a tanh activation function: tanh( x ) = sinh( x )cosh( x ) (8)It is observed that the addition of LSTM units beyond the chosen level does not yield statistically significant perfor-mance improvements. Hence, the chosen number of LSTM units can be considered optimal and the least computa-tionally costly. The model is architecturally defined in Table 3.Table 3: Shallow LSTM architectural scheme. Shallow LSTM
Input @ [ × ]LSTM @ 20 UnitsDense @ 3 Units (activation softmax ) .7 Self-Attention LSTM As a point of contact between architectures which model temporal dynamics (LSTMs) and spatial modeling ones(CNNs), the LSTM described in Section 4.6 is enhanced by the introduction of a Self-Attention module [38]. Bydefault, the Attention layer statically considers the whole context while computing the relevance of individual entries.Differently from what described in Section 4.8, the input is not subject to any spatial transformation (e.g. convolu-tions). This difference implies a static nature of the detected behaviours over multiple timescales. The Self-AttentionLSTM is architecturally defined in Table 4.Table 4: Self-Attention LSTM architectural scheme.
Self-attention LSTM
Input @ [ × ]LSTM @ 40 UnitsSelf-Attention ModuleDense @ 3 (activation softmax ) The input to this model is represented by a [10 × matrix, where 10 is the number of consecutive history ticks and40 is the shape of the LOB defined in Equation 1. The nature of the Deep Learning architectures in Sections 4.6, 4.7 mainly focuses on modeling the temporal dimensionof the inputs. Recent developments [42, 36] highlight the potential of the spatial dimension in LOB-based forecast-ing as a structural module allowing to capture dynamic behaviours over multiple timescales. In order to study theeffectiveness of such augmentation, the architecture described in [42] is reproduced as in Table 5, and adapted to theapplication domain described in the current work. The input is represented by a [10 × matrix, where 10 is thenumber of consecutive history ticks and 40 is the shape of the LOB defined in Equation 1. This model represents thestate-of-the-art in terms of prediction potential at the time of writing.Table 5: CNN-LSTM architectural scheme. CNN-LSTM
Input @ [ × ]Conv × @ 16 (stride = × ) × @ 16 × @ 16 × @ 16 (stride = × ) × @ 16 × @ 16 × @ 16 × @16 × @ 16Inception @ 32LSTM @ 64 UnitsDense @ 3 (activation softmax ) For each of the Deep Learning models, the following training (and testing, see Section 4.10) procedure is applied:8
Training batches of 1024 samples are produced, with each sample made of 10 consecutive LOB states. LOBstates are defined in Section 3 and in Equation 1. • For each training epoch . × batches are randomly chosen. This number of batches is selected to considera total number of training samples equivalent to one month ( ∼ , × samples). This sampling procedureensures a good coverage of the entire dataset and allows to operate with a reduced amount of computationalresources. • Class labels are converted to their one-hot representation. • The selected optimizer is
Adam [27]. Its Keras [9] implementation is chosen and default values for its hy-perparameters are kept ( lr = 0 . ). The categorical crossentropy loss function is chosen due to itssuitability for multi-class classification tasks [8]. • From manual hyperparameter exploration it is observed that 30 training epochs are optimal when accountingfor constraints on computational resources. It has been empirically observed that slight variations do notproduce any significant improvement.
At the end of the training phase, the inducer for each model is queried on the test set as follows: • The Keras [9] Time Series Generator is used to rearrange the test set creating batches of × test samples.Each sample is made of 10 consecutive states (Equation 1). • For each model and test fold, balanced Accuracy [31, 26, 19], weighted Precision, weighted Recall andweighted F-score are computed. These metric are weighted in order to correct for class imbalance andobtain unbiased indicators. The following individual class metrics are considered too: Precision, Recall andF-measure. Two multi-class correlation metrics between labels are also computed: Matthews CorrelationCoefficient (MCC) [17] and Cohen’s Kappa [10, 3]. • Performance metrics for each test fold are statistically compared through the Bayesian correlated t-test [11, 4].One should note that the region of practical equivalence (rope) determining the negligible difference betweenperformance metrics in different models, is arbitrarily set to a sensible , due to the lack of examples in theliterature. Table 6: A summary of the Deep Learning models and related dynamics. Model Dimension
Random Model NoneMultinomial Logistic Regression NoneMultilayer Perceptron Not explicitly definedShallow LSTM TemporalSelf-Attention LSTM Temporal + Spatial (static)CNN-LSTM Temporal + Spatial (dynamic)
Multinomial Logistic Regressions, MLPs, LTSMs, LSTMs with Attention and CNN-LSTMs are trained to predict thereturn quantile q at different horizons H ∆ τ | ∆ τ ∈ { , , } . The dataset used for all models is defined in Section3 and the metrics used to evaluate and compare out of sample model performances are introduced in Section 4.10.Out of sample performance metrics are reported in Table 7. We define three clusters that group models according totheir performances. Specifically, in each one of these clusters it is possible to locate models that perform statisticallyequivalent throughout horizons H ∆ τ , based on the MCC and weighted F-measure metrics described in Section 4.9. Arepresentation of model clustering and ordering is presented in Figure 4.We tested similarities between the models’ performances by means of the Bayesian Correlated t-test and were able toassign the models to the relative clusters or intersections of those as per the representation in Figure 4.9igure 4: Schematic clustering solution for models presented in Section 4. Similarities between models’ performancesare tested using the Bayesian Correlated t-test described by Benavoli et al. [4]. Models contained in the same clustercomponent perform statistically equivalent based on MCC and weighted F-measure metrics. The higher intensity incluster shading indicates an increase in models performances.Table 7: Performance metrics for horizons H ∆ τ computed on the test folds. The column labels H10 , H50, H100 referto H ∆ τ | ∆ τ = , H ∆ τ | ∆ τ = , H ∆ τ | ∆ τ = , respectively. Random Model Logistic Regression Shallow LSTM Self-Attention LSTM CNN-LSTM Multilayer PerceptronH10 H50 H100 H10 H50 H100 H10 H50 H100 H10 H50 H100 H10 H50 H100 H10 H50 H100Balanced Accuracy
Weighted Precision
Weighted Recall
Weighted F-Measure
Precision quantile [0, 0.25]
Precision quantile [0.25, 0.75]
Precision quantile [0.75, 1]
Recall quantile [0, 0.25]
Recall quantile [0.25, 0.75]
Recall quantile [0.75, 1]
F-Measure quantile [0, 0.25]
F-Measure quantile [0.25, 0.75]
F-Measure quantile [0.75, 1]
MCC
Cohen’s Kappa
This section builds upon the results in Section 5 and delves deeper into the analysis of model similarities and individualdimension-based model performances. A better Deep Learning-based understanding of LOB dynamics and modelingapplications emerges from this.The first cluster component in Figure 4 is characterised by the Random Model, but it comprises Logistic Regressionand Self-Attention LSTM at intersections with other clusters. Looking at Table 7, low values for weighted Precisionand weighted Recall are observed for the Random Model. This means that the system, for each horizon H ∆ τ , yieldsbalanced predictions throughout classes and by chance a few of them are correct. Most of the correctly classifiedlabels belong to the central quantile q , as shown by the higher values of its class-specific Precision. The second mostcorrectly predicted quantile is the lower one q − , while the worst overall performance is associated with the upperquantile q +1 . Given that these predictions are picked randomly from an uniform distribution, the obtained resultsreflect the test set class distribution. The MCC and the Cohen’s Kappa are both equal to zero, thereby confirming thatthe model performs in a completely random fashion. 10t results difficult to assign multinomial Logistic Regression to a cluster, as results in Table 7 show that the modelis solid throughout horizons H ∆ τ but lacks the ability to produce complex features which could improve itsperformance, due to the absence of non-linearities and hidden layers. Because of its behaviour, it is placed at theoverlap between the first two clusters. Looking at Tables 8 and 9, it is possible to note how the multinomial LogisticRegression consistently outperforms the Random Model. For longer horizons (i.e. H ∆ τ | ∆ τ ∈ { , } ), LogisticRegression outperforms more structured Deep Learning models designed to handle specific dynamics. Despite thisresult, the model is systematically unable to decode the signal related to the upper quantile, perhaps due to the slightclass imbalance in the training set.The second cluster is represented by the shallow LSTM model. In this case, weighted Precision and weighted Recallvalues for all three horizons H ∆ τ are higher than for Logistic Regression and the Random Model. This meansthat the proposed model is able to correctly predict a considerably high number of samples from different classes,when compared to the previously considered architectures. Similarly to what described in the previous paragraph,the model’s predictions are well balanced between quantiles, for horizons H ∆ τ | ∆ τ ∈ { , } . This observationis confirmed by higher values for the MCC and Cohen’s Kappa for H ∆ τ | ∆ τ ∈ { , } , which indicate that anincreasing number of predictions match the ground-truth. Performance metrics collapse for the longer horizon H ∆ τ | ∆ τ = 100 in the upper quantile q +1 .Results achieved by the Attention-LSTM make assigning the model to one of the clusters in Figure 4 extremelydifficult. As stated for class specific performances in the shallow LSTM model, there is a strong relation betweenmetrics and the considered horizon H ∆ τ . For H ∆ τ | ∆ τ = 10 , higher values of weighted Precision are accompaniedby higher values of weighted Recall. These imply that the model is able to correctly predict a considerable numberof samples from different classes. The only exception is represented by the lower quantile q − which has a lowerclass-specific weighted Recall. Looking at Table 7, it is possible to note higher MCC and Cohen’s Kappa valuesfor the considered model, suggesting a relatively structured agreement between predictions and the ground truth, asdescribed in Table 8. This makes the Attention-LSTM model statistically equivalent to the state-of-the-art models(namely the CNN-LSTM and MLP). Our analysis changes significantly when considering H ∆ τ | ∆ τ = 50 . A decreaseof more than in all performance metrics is observed. The greatest impact is on the upper quantile wherethe model is less capable to perform correct predictions. All these considerations strongly impact the MatthewsCorrelation Coefficient which has a value lower than the one for H ∆ τ | ∆ τ = 10 . The last analysis concernsthe results obtained for H ∆ τ | ∆ τ = 100 . For this horizon, the model yields more balanced performances in terms ofcorrectly predicted samples for the extreme quantiles q − , +1 , while the greatest impact in terms of performance is on q . It is relevant to highlight the high number of misclassifications of the central quantile q in favour of the upper one q +1 . Analysing results in Tables 8 and 9, it is possible to note that there is no reason to place the current experimentin the same cluster as the Shallow-LSTM, as they are never statistically equivalent. It is clear that for differenthorizons H ∆ τ | ∆ τ ∈ { , , } , the model shows completely different behaviours. For horizon H ∆ τ | ∆ τ = 10 it is statistically equivalent to the state-of-the-art methods which will be described in the next paragraph, but for H ∆ τ | ∆ τ ∈ { , } there is no similarity to these models. This is the reason why the Attention-LSTM model isplaced at the intersection of all clusters.General performances for CNN-LSTM and MLP models are comparable to the ones for horizon H ∆ τ | ∆ τ = 10 in theShallow-LSTM model. The difference with the Shallow-LSTM experiment, making CNN-LSTM and MLP modelsstate-of-the-art, must be searched in their ability to maintain stable, high performances throughout horizons H ∆ τ .Here too the higher values of weighted precision and recall for both the considered models, indicate the ability tocorrectly classify a significant number of samples associated with different target classes. Such an ability not onlyconcerns higher level performance metrics, but is reflected in fine-grained per-class performance metrics as well. Fordifferent horizons H ∆ τ | ∆ τ ∈ { , , } , homogeneous class-specific weighted precision and recall are observed.The greater ability of these two models to correctly predict test samples is also shown by their highest values forMatthews Correlation Coefficient and Cohen’s Kappa, indicating a better overlap between predictions and the groundtruth. A statistical equivalence between these models throughout metrics and horizons arises from the results presentedin Tables 8 and 9. 11able 8: Ranking representation of results from the Bayesian correlated t-test [4] based on the MCC performancemetric. Models on the same line indicate statistical equivalence and models in lower rows perform worse (statisticallysignificant) than the ones in the upper rows. H ∆ τ | ∆ τ =
10 H ∆ τ | ∆ τ =
50 H ∆ τ | ∆ τ = Multilayer PerceptronCNN - LSTM Multilayer PerceptronCNN - LSTM Multilayer PerceptronCNN - LSTMSelf-Attention LSTM Shallow LSTMMultinomial Logistic Regression Multinomial Logistic RegressionShallow LSTMMultinomial Logistic Regression Self-Attention LSTM Self-Attention LSTMRandom Model Random Model Shallow LSTMRandom Model
Table 9: Ranking representation of results from the Bayesian correlated t-test [4] based on the F-measure performancemetric. Models on the same line indicate statistical equivalence and models in lower rows perform worse (statisticallysignificant) than the ones in the upper rows. H ∆ τ | ∆ τ =
10 H ∆ τ | ∆ τ =
50 H ∆ τ | ∆ τ = Multilayer PerceptronCNN - LSTMSelf-Attention LSTM Multilayer PerceptronCNN - LSTM Multilayer PerceptronShallow LSTMMultinomial Logistic Regression Shallow LSTM CNN - LSTMMultinomial Logistic RegressionRandom Model Multinomial Logistic Regression Shallow LSTMSelf-Attention LSTM Self-Attention LSTMRandom Model Random Model
In the present work, different Deep Learning models from the literature are applied to the task of price return fore-casting in financial markets based on the Limit Order Book. LOBSTER data is used to train and test the DeepLearning models which are then analysed in terms of results, similarities between the models, performance rankingand dynamics-based model embedding. Hypotheses regarding the nature of the Limit Order Book and its dynamicsare then discussed on the basis of model performances and similarities.The three main contributions of the present work are summarised hereafter and directions for future work are sug-gested.The Multinomial Logistic Regression model is solid throughout horizons H ∆ τ but lacks the ability to produce complexfeatures which could improve its performance as well as explicit dynamics modeling. Not all complex architectures arethough superior, as the shallow LSTM model, which incorporates the temporal dimension alone, does not significantlyoutperform the Logistic Regression and yields a decrease in predictive power for longer horizons (i.e. H ∆ τ | ∆ τ ∈{ , } ). The time dimension, upon which recurrent models are based, is also exploited by the Self-AttentionLSTM model, which is augmented by the Self-Attention module. This allows to consider the whole context, whilecalculating the relevance of specific components. This shrewdness guarantees state-of-the-art performances (in linewith the CNN-LSTM and MLP) for short-range horizons (i.e. H ∆ τ | ∆ τ = 10 ).This result then leads to the second consideration. It is clear that multiple levels of complexity in terms of returnprediction exist in an LOB. There are at least two levels of complexity. The first one relates to short-range predictions(i.e. horizon H ∆ τ | ∆ τ = 10 ). It is time dependent and can be well predicted by statically considering spatial dynamicswhich can be found in the immediate history (the context) related to the LOB state at tick time t . The second levelof complexity is related to longer-range predictions (i.e. horizons H ∆ τ | ∆ τ ∈ { , } ) and multiple dimensions(temporal and spatial) must be taken into account to interpret it. The CNN-LSTM model, which explicitly modelsboth dynamics, seems to penetrate deeper LOB levels of complexity guaranteeing stable and more accurate predictionsfor longer horizons too. This finding, combined with the results previously discussed, would lead to assert that space12nd time could be building blocks of the LOB’s inner structure. This hypothesis is though partially denied by thestatistically equivalent performance of the Multilayer Perceptron.The last consideration hence follows from this. We observe that a simple Multilayer Perceptron, which does notexplicitly model temporal or spatial dynamics, yields statistically equivalent results to the CNN-LSTM model, thecurrent state-of-the-art. According to these results, it is possible to conclude that both time and space are a goodapproximations of the underlying LOB’s dimensions for the different prediction horizons H ∆ τ , but they should not beconsidered the real, necessary underlying dimensions ruling this entity and hence the market.The present work has demonstrated how Deep Learning can serve a theoretical and interpretative purpose in financialmarkets. Future works should further explore Deep Learning-based theoretical investigations of financial marketsand trading dynamics. The upcoming paper by the authors of this work presents an extension of the concepts in thiswork to Deep Reinforcement Learning and calls for further theoretical agent-based work in the field of high frequencytrading and market making. The authors acknowledge Riccardo Marcaccioli for the useful discussions and suggestions. AB acknowledges supportfrom the European Erasmus+ Traineeship Program. TA and JT acknowledge support from the EC Horizon 2020 FIN-Tech and EPSRC (EP/L015129/1). TA acknowledges support from ESRC (ES/K002309/1); EPSRC (EP/P031730/1)and; EC (H2020-ICT-2018-2 825215).
References [1] F. Abergel, M. Anane, A. Chakraborti, A. Jedidi, and I. Toke.
Limit Order Books . PHYSICS OF SOCIETY:ECONOPHYSICS. Cambridge University Press, 2016.[2] E. Alpaydm. Combined 5 × Neuralcomputation , 11(8):1885–1892, 1999.[3] R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics.
Computational Linguistics ,34(4):555–596, 2008.[4] A. Benavoli, G. Corani, J. Demšar, and M. Zaffalon. Time for a change: a tutorial for comparing multipleclassifiers through bayesian analysis.
The Journal of Machine Learning Research , 18(1):2653–2688, 2017.[5] J. O. Berger and T. Sellke. Testing a point null hypothesis: The irreconcilability of p values and evidence.
Journalof the American statistical Association , 82(397):112–122, 1987.[6] J.-P. Bouchaud, J. D. Farmer, and F. Lillo. How markets slowly digest changes in supply and demand. In
Handbook of financial markets: dynamics and evolution , pages 57–160. Elsevier, 2009.[7] A. Candel, V. Parmar, E. LeDell, and A. Arora. Deep learning with h2o.
H2O. ai Inc , 2016.[8] F. Chollet. Deep learning with python, 2017.[9] F. Chollet et al. Keras. https://keras.io , 2015.[10] J. Cohen. A coefficient of agreement for nominal scales.
Educational and psychological measurement , 20(1):37–46, 1960.[11] G. Corani and A. Benavoli. A bayesian approach for comparing cross-validated algorithms on multiple data sets.
Machine Learning , 100(2-3):285–304, 2015.[12] J. Demšar. On the appropriateness of statistical tests in machine learning. In
Workshop on Evaluation Methodsfor Machine Learning in conjunction with ICML , page 65, 2008.[13] J. Dickey. Scientific reporting and personal probabilities: Student’s hypothesis.
Journal of the Royal StatisticalSociety: Series B (Methodological) , 35(2):285–305, 1973.[14] T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms.
Neuralcomputation , 10(7):1895–1923, 1998.[15] W. Edwards, H. Lindman, and L. J. Savage. Bayesian statistical inference for psychological research.
Psycho-logical review , 70(3):193, 1963.[16] S. Ganesh, N. Vadori, M. Xu, H. Zheng, P. Reddy, and M. Veloso. Reinforcement learning for market making ina multi-agent dealer market. arXiv preprint arXiv:1911.05892 , 2019.1317] J. Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient.
Computationalbiology and chemistry , 28(5-6):367–374, 2004.[18] W. H. Greene.
Econometric analysis . Pearson Education India, 2003.[19] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, T. K. Ho, N. Macià, B. Ray, M. Saeed, A. Statnikov,et al. Design of the 2015 chalearn automl challenge. In , pages 1–8. IEEE, 2015.[20] I. Guyon, L. Sun-Hosoya, M. Boullé, H. J. Escalante, S. Escalera, Z. Liu, D. Jajetic, B. Ray, M. Saeed, M. Sebag,A. Statnikov, W. Tu, and E. Viegas. Analysis of the automl challenge series 2015-2018. In
AutoML , Springerseries on Challenges in Machine Learning, 2019.[21] S. Haykin and N. Network. A comprehensive foundation.
Neural networks , 2(2004):41, 2004.[22] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural computation , 9(8):1735–1780, 1997.[23] R. Huang and T. Polak. Lobster: Limit order book reconstruction system.
Available at SSRN 1977207 , 2011.[24] N. Japkowicz and M. Shah.
Evaluating learning algorithms: a classification perspective . Cambridge UniversityPress, 2011.[25] M. Kearns and Y. Nevmyvaka. Machine learning for market microstructure and high frequency trading.
HighFrequency Trading: New Realities for Traders, Markets, and Regulators , 2013.[26] J. D. Kelleher, B. Mac Namee, and A. D’arcy.
Fundamentals of machine learning for predictive data analytics:algorithms, worked examples, and case studies . MIT press, 2015.[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[28] Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time series.
The handbook of braintheory and neural networks , 3361(10):1995, 1995.[29] X. Li, J. Saude, P. Reddy, and M. Veloso. Classifying and understanding financial data using graph neuralnetwork. 2020.[30] Z. C. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 , 2015.[31] L. Mosley. A balanced approach to the multi-class imbalance problem. 2013.[32] C. Nadeau and Y. Bengio. Inference for the generalization error. In
Advances in neural information processingsystems , pages 307–313, 2000.[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python.
Journal of Machine Learning Research , 12:2825–2830, 2011.[34] S. L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach.
Data mining andknowledge discovery , 1(3):317–328, 1997.[35] J. Sirignano and R. Cont. Universal features of price formation in financial markets: perspectives from deeplearning.
Quantitative Finance , 19(9):1449–1459, 2019.[36] J. A. Sirignano. Deep learning for limit order books.
Quantitative Finance , 19(4):549–570, 2019.[37] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis. Forecasting stock prices fromthe limit order book using convolutional neural networks. In , volume 1, pages 7–12. IEEE, 2017.[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attentionis all you need. In
Advances in neural information processing systems , pages 5998–6008, 2017.[39] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson,W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov, A. R. J. Nelson,E. Jones, R. Kern, E. Larson, C. Carey, ˙I. Polat, Y. Feng, E. W. Moore, J. Vand erPlas, D. Laxalde, J. Perktold,R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. vanMulbregt, and S. . . Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python.
Nature Methods , 17:261–272, 2020.[40] I. H. Witten, E. Frank, and M. A. Hall. Data mining: Practical machine learning tools and techniques. 2016.[41] Z. Zhang, S. Zohren, and S. Roberts. Bdlob: Bayesian deep convolutional neural networks for limit order books. arXiv preprint arXiv:1811.10041 , 2018. 1442] Z. Zhang, S. Zohren, and S. Roberts. Deeplob: Deep convolutional neural networks for limit order books.
IEEETransactions on Signal Processing , 67(11):3001–3012, June 2019.[43] Z. Zhang, S. Zohren, and S. Roberts. Extending deep learning models for limit order books to quantile regression. arXiv preprint arXiv:1906.04404arXiv preprint arXiv:1906.04404