[PDF] Deep Learning modeling of Limit Order Book: a comparative perspective

Abstract

The present work addresses theoretical and practical questions in the domain of Deep Learning for High Frequency Trading. State-of-the-art models such as Random models, Logistic Regressions, LSTMs, LSTMs equipped with an Attention mask, CNN-LSTMs and MLPs are reviewed and compared on the same tasks, feature space and dataset, and then clustered according to pairwise similarity and performance metrics. The underlying dimensions of the modeling techniques are hence investigated to understand whether these are intrinsic to the Limit Order Book's dynamics. We observe that the Multilayer Perceptron performs comparably to or better than state-of-the-art CNN-LSTM architectures indicating that dynamic spatial and temporal dimensions are a good approximation of the LOB's dynamics, but not necessarily the true underlying dimensions.

Full PDF

DD EEP L EARNING MODELING OF THE L IMIT O RDER B OOK : A COMPARATIVE PERSPECTIVE

Antonio Briola

Department of Computer ScienceUniversity of Milano-Bicocca, Milano, Italy [email protected]

Jeremy Turiel

Department of Computer ScienceUCL, London, United Kingdom [email protected]

Tomaso Aste

Department of Computer ScienceUCL, London, United KingdomandSystemic Risk CentreLondon School of Economics, London, United Kingdom [email protected]

August 14, 2020 A BSTRACT

The present work addresses theoretical and practical questions in the domain of Deep Learning forHigh Frequency Trading, with a thorough review and analysis of the literature and state-of-the-artmodels. Random models, Logistic Regressions, LSTMs, LSTMs equipped with an Attention mask,CNN-LSTMs and MLPs are compared on the same tasks, feature space, and dataset and clusteredaccording to pairwise similarity and performance metrics. The underlying dimensions of the mod-eling techniques are hence investigated to understand whether these are intrinsic to the Limit OrderBook’s dynamics. It is possible to observe that the Multilayer Perceptron performs comparably to orbetter than state-of-the-art CNN-LSTM architectures indicating that dynamic spatial and temporaldimensions are a good approximation of the LOB’s dynamics, but not necessarily the true underlyingdimensions. K eywords Artiﬁcial Intelligence · Deep Learning · Machine Learning · Market Microstructure · Econophysics · Financial Markets

Recent years have seen the growth and spreading of Deep Learning methods across several domains. In particular,Deep Learning is increasingly applied to the domain of Financial Markets as well, but these activities are mostlyperformed in industry and there is a scarce academic literature to date. The present work builds upon the more generalDeep Learning literature to offer a comparison between models applied to High Frequency markets. Insights aboutMarket Microstructure are then derived from the features and performance of the models.The Limit Order Book (LOB) represents the venue where buyers and sellers interact in an order-driven market. Itsummarises a collection of intentions to buy or sell integer multiples of a base unit volume v (lot size) at price p . Theset of available prices { p , ..., p n } is discrete with a basic unit step θ (tick size). The LOB is a self-organising complexprocess where a transaction price emerges from the interaction of a multitude of agents in the market. These agentsinteract through the submission of a range of order types in the market. Figure 1 provides a visual representation ofthe LOB, its components and features. a r X i v : . [ q -f i n . T R ] A ug igure 1: Schematic representation of the LOB structure. It is possible to distinguish between the bid side (left) andthe ask side (right), where both are organised into levels. The ﬁrst level contains the best bid-price and the best ask-price, respectively. Since the market’s goal is to facilitate the matching of intentions from buyers and sellers, the bestbid-price is deﬁned as the maximum proposed bid price, while the best-ask price is deﬁned as the minimum proposedask-price. The distance between best bid-price and best ask-price is commonly referred to as bid-ask spread. The mid-price is deﬁned as the mean between best bid-price and best ask-price. The lower (higher) the bid-price (ask-price) atwhich limit orders are submitted, the deeper the level at which they are placed. The cumulative volume of buy and selllimit orders determines the market depth. In order-driven markets, the priority of orders to be matched at each pricelevel depends upon the arrival time, according to a FIFO (First In, First Out) rule [1].Three main categories of orders exist: market orders, limit orders and cancellation orders. Market orders are executedat arrival by paying a higher execution cost which originates from crossing the bid-ask spread. Limit orders makeup the liquidity of the LOB at different price levels and constitute an expression of the intent to buy or sell a givenquantity v p at a speciﬁc price p . These entail lower transaction costs, with the risk of the order not being fulﬁlled.Cancellation orders are used to partially or fully remove limit orders which have not been ﬁlled yet.The study of order arrival and dynamics of the Limit Order Book and of order-driven markets has seen a growinginterest in the academic literature as well as in the industry. This sparked from the almost simultaneous spreadingof electronic trading and high frequency trading (HFT) activity throughout global markets. The resulting increase infrequency of the trading activity has generated a growing amount of trading data thereby creating the critical mass forBig Data applications.The availability of Big Data from High Frequency Trading has then made it possible to apply the data hungry MachineLearning and Deep Learning methods to ﬁnancial markets. Machine Learning methods were initially adopted byhedge funds towards the end of the last century, while now it is possible to see large quantitative ﬁrms and leadinginvestment banks openly applying AI methods. Building upon this growing interest, an increasing number of papersand theses exploring Machine Learning and Deep Learning methods applied mostly to ﬁnancial markets are beingwritten. This is part of the modern trend where large companies lead research ﬁelds in AI due to the availability ofcomputational and monetary resources [16, 29]. This often results in a literature dominated by increasingly complexand task-speciﬁc model designs, often conceived adopting an applicative approach without an in-depth analysis of thetheoretical implications of obtained results. 2n light of this, for the present work, the relevant literature has been screened in search of state-of-the-art models forprice movement forecasting in high frequency trading. Increasingly complex models from this literature are presented,characterised and results are compared on the same training and test sets. Theoretical implications of these results areinvestigated and they are compared for statistically validated similarity. This analysis has the purpose to reason whycertain models should or should not be used for this application as well as verify whether more complex architecturesare truly necessary. One example of this consists in the performed study of temporal and spatial dimensions (impliedby Recurrent Neural Network (RNN) [30], Attention [38] and Convolutional Neural Network (CNN) [28] models,respectively) and whether they are unique and optimal representations of the Limit Order Book. The alternativehypothesis consists in a Multilayer Perceptron (MLP) which does not explicitly model any dynamic.The Deep Learning models described in Section 2, incorporate assumptions about the structure and relations in thedata as well as how it should be interpreted. As reported above, this is the case of CNNs which exploit the relationsbetween neighbouring elements in their grid-like input. Analogous considerations can be made for RNN-like models,which are augmented by edges between consecutive observation embeddings. These types of structures aim to carryinformation across a series, hence implying sequential relations between inputs. These tailored architectures are usedto test hypotheses about the existence and informativeness of corresponding dimensional relations in the Limit OrderBook.The diffusion of ﬂexible and extensible frameworks such as Weka [40] and Keras [9], and the success of automatedMachine Learning tools such as AutoML [20] and H2O AutoML [7], is facilitating the abovementioned industry drivenapplicative approach. Unfortunately, much less attention is given to methods for model comparison able to asses thereal improvement brought by these new models [12]. In order to make the work presented here more reliable and topromote a more thorough analysis of published works, a statistical comparison between models is provided throughthe state-of-the-art statistical approaches described in [4].It is crucial for scientiﬁcally robust results to validate the signiﬁcance of model performance and the potential perfor-mance improvements brought by novel architectures and methodologies. The main classes of methods for signiﬁcancetesting are Frequentist and Bayesian. Null model-based Frequentist methods are not ideal for this application as thedetected statistical signiﬁcance might have no practical impact on performance. The need to answer questions aboutthe likelihood of one model to perform signiﬁcantly better than another, based on experiments requires the use ofposterior probabilities, which can be obtained from the Bayesian methods in [15, 13, 5].This paper is organised in the following sections: Section 2 presents a review of the relevant literature which motivatesthe study performed in the current work and presents the assumptions upon which it is built. Section 3 brieﬂy describesthe data used throughout the experiments. Section 4 provides an exhaustive description of the experiments conducted.Section 5 presents and analyses the results and Section 7 concludes the work with ideas for further research efforts. The review by Bouchaud et al. [6] offers a thorough introduction to Limit Order Books, to which we refer theinterested reader. As discussed in Section 1, the growth of electronic trading has sparked the interest in Deep Learningapplications to order-driven markets and Limit Order Books. The work by Kearns and Nevmyvaka [25] presents anoverview of Machine Learning and Reinforcement Learning applications to market microstructure data and tasks,including return prediction from Limit Order Book states. The ﬁrst attempt to produce an extensive analysis of DeepLearning-based methods for stock price prediction based upon the Limit Order Book was made by Tsantekidis et al. [37]. Starting from a horse racing -type comparison between classical Machine Learning approaches (e.g. SupportVector Machines) and more structured Deep Learning ones, they then considered the possibility to apply CNNs todetect anomalous events in the ﬁnancial markets, and take proﬁtable positions. In the last two years a few worksapplying a variety of Deep Learning models to LOB-based return prediction were published by the research group ofStephen Roberts, the ﬁrst one, to the best of our knowledge, applied Bayesian (Deep Learning) Networks to LimitOrder Book [41], followed by an augmentation to the labelling system as quantiles of returns and an adaptation of themodeling technique to this [43]. The most recent work introduces the current state-of-the-art modeling architecturecombining CNNs and LSTMs to delve deeper into the complex structure of the Limit Order Book. The work bySirignano and Cont [35] provides a more theoretical approach, where it tests and compares multiple Deep Learningarchitectures for return forecasting based on order book data and shows how these models are able to capture andgeneralise to universal price formation mechanisms throughout assets.The models used in the above works were originally deﬁned in the literature from the ﬁeld of Machine and DeepLearning and are summarised below. 3ultinomial Logistic Regression is used as a baseline for this work and consists in a linear combination of the inputsmapped through a logit activation function, as deﬁned in [18]. Feedforward Neural Networks (or Multilayer Per-ceptrons) are deﬁned in [21] and constitute the general framework to represent non-linear function mappings betweena set of input variables and a set of output variables. Recurrent Neural Networks (RNNs) [30] are then considered inthe form of Long-Short Term Memory models (LSTMs) [22]. RNNs constitute an evolution of MLPs. They introducethe concept of sequentiality into the model including edges which span adjacent time steps. RNNs suffer from theissue of vanishing gradients when carrying on information for a large number of time steps. LSTMs solve this issueby replacing nodes in the hidden layers with self-connected memory cells of unit edge weight which allow to carry oninformation without vanishing or exploding gradients. LSTMs hence owe their name to the ability to retain informa-tion through a long sequence. The addition of Attention mechanisms [38] to MLPs helps the model to focus on moreon relevant regions of the input data in order to make predictions. Self-Attention extends the parametric ﬂexibility ofglobal Attention Mechanisms by introducing an Attention mask that is no longer ﬁxed, but a function of the input. Thelast kind of Deep Learning unit considered are Convolutional Neural Networks (CNNs), designed to process data withgrid-like topology. These unit serve as feature extractors, thus learning feature representations of their inputs [28].A considerable body of literature about comparison of different models has been produced, despite not being vastlyapplied by the Machine Learning community. The ﬁrst attempts of formalisation were made by Ditterich [14] andSalzberg [34], and reﬁned by Nadau and Bengio [32] and Alpaydm [2]. A comprehensive review of all these methodsand of classical statistical tests for Machine Learning is presented in [24]. A crucial point of view is provided bythe work [12]. More recently, starting from the work by Corani and Benavoli [11], a Bayesian approach to statisticaltesting was proposed to replace classical approaches based on the null hypothesis. The proposed new ideas found acomplete deﬁnition in [4].

All the experiments presented in this paper are based on the usage of the LOBSTER [23] dataset, which provides ahighly detailed, event-by-event description of all micro-scale market activities for each stock listed on the NASDAQexchange. LOBSTER is one of the data providers featured in some major publications and journals in this ﬁeld, suchas Quantitative Finance. LOB datasets are provided for each security in the NASDAQ. The dataset lists every marketorder arrival, limit order arrival and cancellation that occurs in the NASDAQ platform between 09:30 am – 04:00 pmon each trading day. Trading does not occur on weekends or public holidays, so these days are excluded from all theanalyses performed. A tick size of θ = 0 . is adopted. Depending on the type of the submitted order, orders can beexecuted at the lower cost equals of . . This is the case of hidden orders which, when revealed, appear at a priceequal to the notional mid-price at the time of execution.LOBSTER [23] data are structured into two different ﬁles: • The message ﬁle lists every market order arrival, limit order arrival and cancellation that occurs. • The orderbook ﬁle describes the market state (i.e. the total volume of buy or sell orders at each price)immediately after the corresponding event occurs.Experiments described in the next few sections are performed only using the orderbook ﬁles . The training datasetconsists of Intel Corporation’s (

INTC ) LOB data from 04-02-2019 to 31-05-2019, corresponding to a total of 82 ﬁles,while the test dataset consists of Intel Corporation’s LOB data from 03-06-2019 to 28-06-2019, obtained from 20 otherﬁles. All the experiments presented in the current work are conducted on snapshots of the LOB with a depth (numberof tick size-separated limit order levels per side of the Order Book) of . This means that each row in the orderbookﬁles corresponds to a vector of length 40. Each row is structured as [( p, v ) a , ( p, v ) b , ( p, v ) a , ( p, v ) b , ... , ( p, v ) a , ( p, v ) b ] , (1)where ( p, v ) represents the price level and corresponding liquidity tuple, { a, b } distinguish ask and bid levels progres-sively further away from the best ask and best bid. 4 Methods

Price log-returns for the target labels are deﬁned at three distinct time horizons H ∆ τ . In order to account for pricevolatility and discount long periods of stable and noisy order ﬂow, the time delay between the LOB observation (input)and the target label return ∆ τ is deﬁned as follows.Given a series of mid-prices at consecutive ticks p m, , p m, , ... , p m,n , (2)the mid-price is deﬁned as the mean between the best bid and best ask price. The series of log-returns is r m, , r m, , ... , r m,n − , (3)where r m, = log p m, − log p m, . (4)The number of non-zero log-returns in the series is hence counted as: ∆ τ = n − (cid:88) k =0 Θ( | r m,k | ) , (5)where Θ is the Heaviside step function deﬁned below Θ( x ) = (cid:26) if x > if x ≤ . (6) The data described in Section 3 are preprocessed as follows: • The target labels for the prediction task aim to categorise the return at three different time horizons H ∆ τ | ∆ τ ∈{ , , } . In order to perform the mapping from continuous variables into discrete classes, the followingquantile levels (0 ., . , . , . ) are computed on the returns distribution of the training set and then appliedto the test set. These quantiles are mapped onto classes, denoted with ( q − , q , q +1 ) as reported in Figure 2.Figure 2: Visual representation of the mapping between quantiles and corresponding classes. Quantiles’ edges (i.e. (0 ., . , . , . ) ) deﬁne three different intervals. Each speciﬁc class (i.e. q − , q , q +1 ) corresponds to a speciﬁcinterval. • The training set input data (LOB states) are scaled within a (0 , interval with the min-max scaling algorithm[33]. The scaler’s training phase is conducted by chunks to optimise the computational effort. The trainedscaler is then applied to the test data. 5igure 3 reports the training and test set quantile distributions per horizon H ∆ τ .Figure 3: Training and test set quantile ( q − , q , q +1 ) distributions per horizon H ∆ τ | ∆ τ ∈ { , , } at the end ofthe preprocessing and labelling phase. Tables’ entries, for both the training and the test set, report the exact number ofsamples per horizon, for each considered quantile.It is possible to notice moderately balanced classes for both plots in Figure 3. Indeed, all classes lie within the sameorder of magnitude ( ) for all horizons H ∆ τ | ∆ τ ∈ { , , } with the q class being the most represented and q +1 the least. The benchmark null model for this work is a generic random model, which does not handle any dynamics. For eachsample in the test set and each horizon H ∆ τ , the quantile label q r is sampled from the uniform distribution over r ∈ {− , , } . The SciPy [39] randint generator is used for this task. The baseline model is represented by the multinomial Logistic Regression which, as the Random Model, does notexplictly model any dynamics in the data. Like binary Logistic Regression, the multinomial one adopts maximumlikelihood estimation to evaluate the probability of categorical membership. It is also known as Softmax Regressionand can be interpreted as a classical ANN, as per the deﬁnition in Table 1, with the input layer directly connected tothe output layer with a softmax activation function:softmax ( x i ) = e x i (cid:80) j e x j . (7)The input is represented by the ten most recent LOB states as per the deﬁnition is Equation 1 in Section 3. Also thismodel is not able to handle any speciﬁc dynamics. The Scikit-Learn [33] implementation is used and, in order toguarantee a fair comparison with the Deep Learning models in the next sections, the following parameters are set asfollows: • max_iter (i.e. maximum number of iterations taken for the solvers to converge) = 20. • tol (i.e. the tolerance for stopping criteria) = e − . • solver (i.e. the algorithm to use in the optimization problem) = sag with the default L penalty.6able 1: Multinomial Logistic Regression architectural scheme. Logistic regression

Input @ [ × ]Dense @ 3 Units (activation softmax ) The ﬁrst Deep Learning model is a generic Multilayer Perceptron (MLP), which does not explicitly model temporalor spatial properties of the input data, but has the ability to model not explicitly deﬁned dimensions through its hiddenlayers. Similarly to the two previously mentioned models, it does not explicitly handle any speciﬁc dimension. TheMLP can be considered the most general form of universal approximator and it represents the ideal model to conﬁrmor reject any hypothesis about the presence of a speciﬁc leading dimension in LOBs. The input is represented by theten most recent LOB states (Equation 1), Section 3, with the ten states concatenated in a ﬂattened shape. The MLPmodel is architecturally deﬁned in Table 2.Table 2: Multilayer Perceptron architectural scheme.

Multilayer Perceptron

Input @ [ × ]Dense @ 512 UnitsDense @ 1024 UnitsDense @ 1024 UnitsDense @ 64 UnitsDense @ 3 (activation softmax ) In order to explicitly handle temporal dynamics of the system, a shallow LSTM model is tested [22]. The LSTMarchitecture explicitly models temporal and sequential dynamics in the data, hence providing insight on the temporaldimension of the data. As all the other RNN models, the structure of LSTMs enables the network to capture the tempo-ral dynamics performing sequential predictions. The current state directly depends on the previous ones, meaning thatthe hidden states represent the memory of the network. Differently from classic RNN models, LSTMs are explicitlydesigned to overcome the vanishing gradient problem as well as capture the effect of long-term dependencies. Theinput is here represented by a [10 × matrix, where 10 is the number of consecutive history ticks and 40 is the shapeof the LOB deﬁned in Equation 1. The LSTM layer consists of 20 units with a tanh activation function: tanh( x ) = sinh( x )cosh( x ) (8)It is observed that the addition of LSTM units beyond the chosen level does not yield statistically signiﬁcant perfor-mance improvements. Hence, the chosen number of LSTM units can be considered optimal and the least computa-tionally costly. The model is architecturally deﬁned in Table 3.Table 3: Shallow LSTM architectural scheme. Shallow LSTM

Input @ [ × ]LSTM @ 20 UnitsDense @ 3 Units (activation softmax ) .7 Self-Attention LSTM As a point of contact between architectures which model temporal dynamics (LSTMs) and spatial modeling ones(CNNs), the LSTM described in Section 4.6 is enhanced by the introduction of a Self-Attention module [38]. Bydefault, the Attention layer statically considers the whole context while computing the relevance of individual entries.Differently from what described in Section 4.8, the input is not subject to any spatial transformation (e.g. convolu-tions). This difference implies a static nature of the detected behaviours over multiple timescales. The Self-AttentionLSTM is architecturally deﬁned in Table 4.Table 4: Self-Attention LSTM architectural scheme.

Self-attention LSTM

Input @ [ × ]LSTM @ 40 UnitsSelf-Attention ModuleDense @ 3 (activation softmax ) The input to this model is represented by a [10 × matrix, where 10 is the number of consecutive history ticks and40 is the shape of the LOB deﬁned in Equation 1. The nature of the Deep Learning architectures in Sections 4.6, 4.7 mainly focuses on modeling the temporal dimensionof the inputs. Recent developments [42, 36] highlight the potential of the spatial dimension in LOB-based forecast-ing as a structural module allowing to capture dynamic behaviours over multiple timescales. In order to study theeffectiveness of such augmentation, the architecture described in [42] is reproduced as in Table 5, and adapted to theapplication domain described in the current work. The input is represented by a [10 × matrix, where 10 is thenumber of consecutive history ticks and 40 is the shape of the LOB deﬁned in Equation 1. This model represents thestate-of-the-art in terms of prediction potential at the time of writing.Table 5: CNN-LSTM architectural scheme. CNN-LSTM

Input @ [ × ]Conv × @ 16 (stride = × ) × @ 16 × @ 16 × @ 16 (stride = × ) × @ 16 × @ 16 × @ 16 × @16 × @ 16Inception @ 32LSTM @ 64 UnitsDense @ 3 (activation softmax ) For each of the Deep Learning models, the following training (and testing, see Section 4.10) procedure is applied:8

Training batches of 1024 samples are produced, with each sample made of 10 consecutive LOB states. LOBstates are deﬁned in Section 3 and in Equation 1. • For each training epoch . × batches are randomly chosen. This number of batches is selected to considera total number of training samples equivalent to one month ( ∼ , × samples). This sampling procedureensures a good coverage of the entire dataset and allows to operate with a reduced amount of computationalresources. • Class labels are converted to their one-hot representation. • The selected optimizer is

Adam [27]. Its Keras [9] implementation is chosen and default values for its hy-perparameters are kept ( lr = 0 . ). The categorical crossentropy loss function is chosen due to itssuitability for multi-class classiﬁcation tasks [8]. • From manual hyperparameter exploration it is observed that 30 training epochs are optimal when accountingfor constraints on computational resources. It has been empirically observed that slight variations do notproduce any signiﬁcant improvement.

At the end of the training phase, the inducer for each model is queried on the test set as follows: • The Keras [9] Time Series Generator is used to rearrange the test set creating batches of × test samples.Each sample is made of 10 consecutive states (Equation 1). • For each model and test fold, balanced Accuracy [31, 26, 19], weighted Precision, weighted Recall andweighted F-score are computed. These metric are weighted in order to correct for class imbalance andobtain unbiased indicators. The following individual class metrics are considered too: Precision, Recall andF-measure. Two multi-class correlation metrics between labels are also computed: Matthews CorrelationCoefﬁcient (MCC) [17] and Cohen’s Kappa [10, 3]. • Performance metrics for each test fold are statistically compared through the Bayesian correlated t-test [11, 4].One should note that the region of practical equivalence (rope) determining the negligible difference betweenperformance metrics in different models, is arbitrarily set to a sensible , due to the lack of examples in theliterature. Table 6: A summary of the Deep Learning models and related dynamics. Model Dimension

Random Model NoneMultinomial Logistic Regression NoneMultilayer Perceptron Not explicitly deﬁnedShallow LSTM TemporalSelf-Attention LSTM Temporal + Spatial (static)CNN-LSTM Temporal + Spatial (dynamic)

Multinomial Logistic Regressions, MLPs, LTSMs, LSTMs with Attention and CNN-LSTMs are trained to predict thereturn quantile q at different horizons H ∆ τ | ∆ τ ∈ { , , } . The dataset used for all models is deﬁned in Section3 and the metrics used to evaluate and compare out of sample model performances are introduced in Section 4.10.Out of sample performance metrics are reported in Table 7. We deﬁne three clusters that group models according totheir performances. Speciﬁcally, in each one of these clusters it is possible to locate models that perform statisticallyequivalent throughout horizons H ∆ τ , based on the MCC and weighted F-measure metrics described in Section 4.9. Arepresentation of model clustering and ordering is presented in Figure 4.We tested similarities between the models’ performances by means of the Bayesian Correlated t-test and were able toassign the models to the relative clusters or intersections of those as per the representation in Figure 4.9igure 4: Schematic clustering solution for models presented in Section 4. Similarities between models’ performancesare tested using the Bayesian Correlated t-test described by Benavoli et al. [4]. Models contained in the same clustercomponent perform statistically equivalent based on MCC and weighted F-measure metrics. The higher intensity incluster shading indicates an increase in models performances.Table 7: Performance metrics for horizons H ∆ τ computed on the test folds. The column labels H10 , H50, H100 referto H ∆ τ | ∆ τ = , H ∆ τ | ∆ τ = , H ∆ τ | ∆ τ = , respectively. Random Model Logistic Regression Shallow LSTM Self-Attention LSTM CNN-LSTM Multilayer PerceptronH10 H50 H100 H10 H50 H100 H10 H50 H100 H10 H50 H100 H10 H50 H100 H10 H50 H100Balanced Accuracy

Weighted Precision

Weighted Recall

Weighted F-Measure

Precision quantile [0, 0.25]

Precision quantile [0.25, 0.75]

Precision quantile [0.75, 1]

Recall quantile [0, 0.25]

Recall quantile [0.25, 0.75]

Recall quantile [0.75, 1]

F-Measure quantile [0, 0.25]

F-Measure quantile [0.25, 0.75]

F-Measure quantile [0.75, 1]

MCC

Cohen’s Kappa

This section builds upon the results in Section 5 and delves deeper into the analysis of model similarities and individualdimension-based model performances. A better Deep Learning-based understanding of LOB dynamics and modelingapplications emerges from this.The ﬁrst cluster component in Figure 4 is characterised by the Random Model, but it comprises Logistic Regressionand Self-Attention LSTM at intersections with other clusters. Looking at Table 7, low values for weighted Precisionand weighted Recall are observed for the Random Model. This means that the system, for each horizon H ∆ τ , yieldsbalanced predictions throughout classes and by chance a few of them are correct. Most of the correctly classiﬁedlabels belong to the central quantile q , as shown by the higher values of its class-speciﬁc Precision. The second mostcorrectly predicted quantile is the lower one q − , while the worst overall performance is associated with the upperquantile q +1 . Given that these predictions are picked randomly from an uniform distribution, the obtained resultsreﬂect the test set class distribution. The MCC and the Cohen’s Kappa are both equal to zero, thereby conﬁrming thatthe model performs in a completely random fashion. 10t results difﬁcult to assign multinomial Logistic Regression to a cluster, as results in Table 7 show that the modelis solid throughout horizons H ∆ τ but lacks the ability to produce complex features which could improve itsperformance, due to the absence of non-linearities and hidden layers. Because of its behaviour, it is placed at theoverlap between the ﬁrst two clusters. Looking at Tables 8 and 9, it is possible to note how the multinomial LogisticRegression consistently outperforms the Random Model. For longer horizons (i.e. H ∆ τ | ∆ τ ∈ { , } ), LogisticRegression outperforms more structured Deep Learning models designed to handle speciﬁc dynamics. Despite thisresult, the model is systematically unable to decode the signal related to the upper quantile, perhaps due to the slightclass imbalance in the training set.The second cluster is represented by the shallow LSTM model. In this case, weighted Precision and weighted Recallvalues for all three horizons H ∆ τ are higher than for Logistic Regression and the Random Model. This meansthat the proposed model is able to correctly predict a considerably high number of samples from different classes,when compared to the previously considered architectures. Similarly to what described in the previous paragraph,the model’s predictions are well balanced between quantiles, for horizons H ∆ τ | ∆ τ ∈ { , } . This observationis conﬁrmed by higher values for the MCC and Cohen’s Kappa for H ∆ τ | ∆ τ ∈ { , } , which indicate that anincreasing number of predictions match the ground-truth. Performance metrics collapse for the longer horizon H ∆ τ | ∆ τ = 100 in the upper quantile q +1 .Results achieved by the Attention-LSTM make assigning the model to one of the clusters in Figure 4 extremelydifﬁcult. As stated for class speciﬁc performances in the shallow LSTM model, there is a strong relation betweenmetrics and the considered horizon H ∆ τ . For H ∆ τ | ∆ τ = 10 , higher values of weighted Precision are accompaniedby higher values of weighted Recall. These imply that the model is able to correctly predict a considerable numberof samples from different classes. The only exception is represented by the lower quantile q − which has a lowerclass-speciﬁc weighted Recall. Looking at Table 7, it is possible to note higher MCC and Cohen’s Kappa valuesfor the considered model, suggesting a relatively structured agreement between predictions and the ground truth, asdescribed in Table 8. This makes the Attention-LSTM model statistically equivalent to the state-of-the-art models(namely the CNN-LSTM and MLP). Our analysis changes signiﬁcantly when considering H ∆ τ | ∆ τ = 50 . A decreaseof more than in all performance metrics is observed. The greatest impact is on the upper quantile wherethe model is less capable to perform correct predictions. All these considerations strongly impact the MatthewsCorrelation Coefﬁcient which has a value lower than the one for H ∆ τ | ∆ τ = 10 . The last analysis concernsthe results obtained for H ∆ τ | ∆ τ = 100 . For this horizon, the model yields more balanced performances in terms ofcorrectly predicted samples for the extreme quantiles q − , +1 , while the greatest impact in terms of performance is on q . It is relevant to highlight the high number of misclassiﬁcations of the central quantile q in favour of the upper one q +1 . Analysing results in Tables 8 and 9, it is possible to note that there is no reason to place the current experimentin the same cluster as the Shallow-LSTM, as they are never statistically equivalent. It is clear that for differenthorizons H ∆ τ | ∆ τ ∈ { , , } , the model shows completely different behaviours. For horizon H ∆ τ | ∆ τ = 10 it is statistically equivalent to the state-of-the-art methods which will be described in the next paragraph, but for H ∆ τ | ∆ τ ∈ { , } there is no similarity to these models. This is the reason why the Attention-LSTM model isplaced at the intersection of all clusters.General performances for CNN-LSTM and MLP models are comparable to the ones for horizon H ∆ τ | ∆ τ = 10 in theShallow-LSTM model. The difference with the Shallow-LSTM experiment, making CNN-LSTM and MLP modelsstate-of-the-art, must be searched in their ability to maintain stable, high performances throughout horizons H ∆ τ .Here too the higher values of weighted precision and recall for both the considered models, indicate the ability tocorrectly classify a signiﬁcant number of samples associated with different target classes. Such an ability not onlyconcerns higher level performance metrics, but is reﬂected in ﬁne-grained per-class performance metrics as well. Fordifferent horizons H ∆ τ | ∆ τ ∈ { , , } , homogeneous class-speciﬁc weighted precision and recall are observed.The greater ability of these two models to correctly predict test samples is also shown by their highest values forMatthews Correlation Coefﬁcient and Cohen’s Kappa, indicating a better overlap between predictions and the groundtruth. A statistical equivalence between these models throughout metrics and horizons arises from the results presentedin Tables 8 and 9. 11able 8: Ranking representation of results from the Bayesian correlated t-test [4] based on the MCC performancemetric. Models on the same line indicate statistical equivalence and models in lower rows perform worse (statisticallysigniﬁcant) than the ones in the upper rows. H ∆ τ | ∆ τ =

10 H ∆ τ | ∆ τ =

50 H ∆ τ | ∆ τ = Multilayer PerceptronCNN - LSTM Multilayer PerceptronCNN - LSTM Multilayer PerceptronCNN - LSTMSelf-Attention LSTM Shallow LSTMMultinomial Logistic Regression Multinomial Logistic RegressionShallow LSTMMultinomial Logistic Regression Self-Attention LSTM Self-Attention LSTMRandom Model Random Model Shallow LSTMRandom Model

Table 9: Ranking representation of results from the Bayesian correlated t-test [4] based on the F-measure performancemetric. Models on the same line indicate statistical equivalence and models in lower rows perform worse (statisticallysigniﬁcant) than the ones in the upper rows. H ∆ τ | ∆ τ =

10 H ∆ τ | ∆ τ =

50 H ∆ τ | ∆ τ = Multilayer PerceptronCNN - LSTMSelf-Attention LSTM Multilayer PerceptronCNN - LSTM Multilayer PerceptronShallow LSTMMultinomial Logistic Regression Shallow LSTM CNN - LSTMMultinomial Logistic RegressionRandom Model Multinomial Logistic Regression Shallow LSTMSelf-Attention LSTM Self-Attention LSTMRandom Model Random Model

In the present work, different Deep Learning models from the literature are applied to the task of price return fore-casting in ﬁnancial markets based on the Limit Order Book. LOBSTER data is used to train and test the DeepLearning models which are then analysed in terms of results, similarities between the models, performance rankingand dynamics-based model embedding. Hypotheses regarding the nature of the Limit Order Book and its dynamicsare then discussed on the basis of model performances and similarities.The three main contributions of the present work are summarised hereafter and directions for future work are sug-gested.The Multinomial Logistic Regression model is solid throughout horizons H ∆ τ but lacks the ability to produce complexfeatures which could improve its performance as well as explicit dynamics modeling. Not all complex architectures arethough superior, as the shallow LSTM model, which incorporates the temporal dimension alone, does not signiﬁcantlyoutperform the Logistic Regression and yields a decrease in predictive power for longer horizons (i.e. H ∆ τ | ∆ τ ∈{ , } ). The time dimension, upon which recurrent models are based, is also exploited by the Self-AttentionLSTM model, which is augmented by the Self-Attention module. This allows to consider the whole context, whilecalculating the relevance of speciﬁc components. This shrewdness guarantees state-of-the-art performances (in linewith the CNN-LSTM and MLP) for short-range horizons (i.e. H ∆ τ | ∆ τ = 10 ).This result then leads to the second consideration. It is clear that multiple levels of complexity in terms of returnprediction exist in an LOB. There are at least two levels of complexity. The ﬁrst one relates to short-range predictions(i.e. horizon H ∆ τ | ∆ τ = 10 ). It is time dependent and can be well predicted by statically considering spatial dynamicswhich can be found in the immediate history (the context) related to the LOB state at tick time t . The second levelof complexity is related to longer-range predictions (i.e. horizons H ∆ τ | ∆ τ ∈ { , } ) and multiple dimensions(temporal and spatial) must be taken into account to interpret it. The CNN-LSTM model, which explicitly modelsboth dynamics, seems to penetrate deeper LOB levels of complexity guaranteeing stable and more accurate predictionsfor longer horizons too. This ﬁnding, combined with the results previously discussed, would lead to assert that space12nd time could be building blocks of the LOB’s inner structure. This hypothesis is though partially denied by thestatistically equivalent performance of the Multilayer Perceptron.The last consideration hence follows from this. We observe that a simple Multilayer Perceptron, which does notexplicitly model temporal or spatial dynamics, yields statistically equivalent results to the CNN-LSTM model, thecurrent state-of-the-art. According to these results, it is possible to conclude that both time and space are a goodapproximations of the underlying LOB’s dimensions for the different prediction horizons H ∆ τ , but they should not beconsidered the real, necessary underlying dimensions ruling this entity and hence the market.The present work has demonstrated how Deep Learning can serve a theoretical and interpretative purpose in ﬁnancialmarkets. Future works should further explore Deep Learning-based theoretical investigations of ﬁnancial marketsand trading dynamics. The upcoming paper by the authors of this work presents an extension of the concepts in thiswork to Deep Reinforcement Learning and calls for further theoretical agent-based work in the ﬁeld of high frequencytrading and market making. The authors acknowledge Riccardo Marcaccioli for the useful discussions and suggestions. AB acknowledges supportfrom the European Erasmus+ Traineeship Program. TA and JT acknowledge support from the EC Horizon 2020 FIN-Tech and EPSRC (EP/L015129/1). TA acknowledges support from ESRC (ES/K002309/1); EPSRC (EP/P031730/1)and; EC (H2020-ICT-2018-2 825215).

References [1] F. Abergel, M. Anane, A. Chakraborti, A. Jedidi, and I. Toke.

Limit Order Books . PHYSICS OF SOCIETY:ECONOPHYSICS. Cambridge University Press, 2016.[2] E. Alpaydm. Combined 5 × Neuralcomputation , 11(8):1885–1892, 1999.[3] R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics.

Computational Linguistics ,34(4):555–596, 2008.[4] A. Benavoli, G. Corani, J. Demšar, and M. Zaffalon. Time for a change: a tutorial for comparing multipleclassiﬁers through bayesian analysis.

The Journal of Machine Learning Research , 18(1):2653–2688, 2017.[5] J. O. Berger and T. Sellke. Testing a point null hypothesis: The irreconcilability of p values and evidence.

Journalof the American statistical Association , 82(397):112–122, 1987.[6] J.-P. Bouchaud, J. D. Farmer, and F. Lillo. How markets slowly digest changes in supply and demand. In

Handbook of ﬁnancial markets: dynamics and evolution , pages 57–160. Elsevier, 2009.[7] A. Candel, V. Parmar, E. LeDell, and A. Arora. Deep learning with h2o.

H2O. ai Inc , 2016.[8] F. Chollet. Deep learning with python, 2017.[9] F. Chollet et al. Keras. https://keras.io , 2015.[10] J. Cohen. A coefﬁcient of agreement for nominal scales.

Educational and psychological measurement , 20(1):37–46, 1960.[11] G. Corani and A. Benavoli. A bayesian approach for comparing cross-validated algorithms on multiple data sets.

Machine Learning , 100(2-3):285–304, 2015.[12] J. Demšar. On the appropriateness of statistical tests in machine learning. In

Workshop on Evaluation Methodsfor Machine Learning in conjunction with ICML , page 65, 2008.[13] J. Dickey. Scientiﬁc reporting and personal probabilities: Student’s hypothesis.

Journal of the Royal StatisticalSociety: Series B (Methodological) , 35(2):285–305, 1973.[14] T. G. Dietterich. Approximate statistical tests for comparing supervised classiﬁcation learning algorithms.

Neuralcomputation , 10(7):1895–1923, 1998.[15] W. Edwards, H. Lindman, and L. J. Savage. Bayesian statistical inference for psychological research.

Psycho-logical review , 70(3):193, 1963.[16] S. Ganesh, N. Vadori, M. Xu, H. Zheng, P. Reddy, and M. Veloso. Reinforcement learning for market making ina multi-agent dealer market. arXiv preprint arXiv:1911.05892 , 2019.1317] J. Gorodkin. Comparing two k-category assignments by a k-category correlation coefﬁcient.

Computationalbiology and chemistry , 28(5-6):367–374, 2004.[18] W. H. Greene.

Econometric analysis . Pearson Education India, 2003.[19] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, T. K. Ho, N. Macià, B. Ray, M. Saeed, A. Statnikov,et al. Design of the 2015 chalearn automl challenge. In , pages 1–8. IEEE, 2015.[20] I. Guyon, L. Sun-Hosoya, M. Boullé, H. J. Escalante, S. Escalera, Z. Liu, D. Jajetic, B. Ray, M. Saeed, M. Sebag,A. Statnikov, W. Tu, and E. Viegas. Analysis of the automl challenge series 2015-2018. In

AutoML , Springerseries on Challenges in Machine Learning, 2019.[21] S. Haykin and N. Network. A comprehensive foundation.

Neural networks , 2(2004):41, 2004.[22] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[23] R. Huang and T. Polak. Lobster: Limit order book reconstruction system.

Available at SSRN 1977207 , 2011.[24] N. Japkowicz and M. Shah.

Evaluating learning algorithms: a classiﬁcation perspective . Cambridge UniversityPress, 2011.[25] M. Kearns and Y. Nevmyvaka. Machine learning for market microstructure and high frequency trading.

HighFrequency Trading: New Realities for Traders, Markets, and Regulators , 2013.[26] J. D. Kelleher, B. Mac Namee, and A. D’arcy.

Fundamentals of machine learning for predictive data analytics:algorithms, worked examples, and case studies . MIT press, 2015.[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[28] Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time series.

The handbook of braintheory and neural networks , 3361(10):1995, 1995.[29] X. Li, J. Saude, P. Reddy, and M. Veloso. Classifying and understanding ﬁnancial data using graph neuralnetwork. 2020.[30] Z. C. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 , 2015.[31] L. Mosley. A balanced approach to the multi-class imbalance problem. 2013.[32] C. Nadeau and Y. Bengio. Inference for the generalization error. In

Advances in neural information processingsystems , pages 307–313, 2000.[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python.

Journal of Machine Learning Research , 12:2825–2830, 2011.[34] S. L. Salzberg. On comparing classiﬁers: Pitfalls to avoid and a recommended approach.

Data mining andknowledge discovery , 1(3):317–328, 1997.[35] J. Sirignano and R. Cont. Universal features of price formation in ﬁnancial markets: perspectives from deeplearning.

Quantitative Finance , 19(9):1449–1459, 2019.[36] J. A. Sirignano. Deep learning for limit order books.

Quantitative Finance , 19(4):549–570, 2019.[37] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosiﬁdis. Forecasting stock prices fromthe limit order book using convolutional neural networks. In , volume 1, pages 7–12. IEEE, 2017.[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attentionis all you need. In

Advances in neural information processing systems , pages 5998–6008, 2017.[39] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson,W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov, A. R. J. Nelson,E. Jones, R. Kern, E. Larson, C. Carey, ˙I. Polat, Y. Feng, E. W. Moore, J. Vand erPlas, D. Laxalde, J. Perktold,R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. vanMulbregt, and S. . . Contributors. SciPy 1.0: Fundamental Algorithms for Scientiﬁc Computing in Python.

Nature Methods , 17:261–272, 2020.[40] I. H. Witten, E. Frank, and M. A. Hall. Data mining: Practical machine learning tools and techniques. 2016.[41] Z. Zhang, S. Zohren, and S. Roberts. Bdlob: Bayesian deep convolutional neural networks for limit order books. arXiv preprint arXiv:1811.10041 , 2018. 1442] Z. Zhang, S. Zohren, and S. Roberts. Deeplob: Deep convolutional neural networks for limit order books.

IEEETransactions on Signal Processing , 67(11):3001–3012, June 2019.[43] Z. Zhang, S. Zohren, and S. Roberts. Extending deep learning models for limit order books to quantile regression. arXiv preprint arXiv:1906.04404arXiv preprint arXiv:1906.04404