[PDF] Transformers for Limit Order Books

Abstract

We introduce a new deep learning architecture for predicting price movements from limit order books. This architecture uses a causal convolutional network for feature extraction in combination with masked self-attention to update features based on relevant contextual information. This architecture is shown to significantly outperform existing architectures such as those using convolutional networks (CNN) and Long-Short Term Memory (LSTM) establishing a new state-of-the-art benchmark for the FI-2010 dataset.

Full PDF

TTransformers for limit order books

James Wallbridge ∗ March 3, 2020

Abstract

We introduce a new deep learning architecture for predicting price movements fromlimit order books. This architecture uses a causal convolutional network for feature ex-traction in combination with masked self-attention to update features based on relevantcontextual information. This architecture is shown to signiﬁcantly outperform existingarchitectures such as those using convolutional networks (CNN) and Long-Short TermMemory (LSTM) establishing a new state-of-the-art benchmark for the FI-2010 dataset.

Contents

Understanding high-frequency market micro-structure in time-series data such as limit or-der books (LOB) is complicated by a large number of factors including high-dimensionality, ∗ Correspondence to [email protected] a r X i v : . [ q -f i n . C P ] F e b rends based on supply and demand, order creation and deletion around price jumps andthe overwhelming relative percentage of order cancellations. It makes sense in this in-herently noisy environment to take an agnostic approach to the underlying mechanismsinducing this behavior and construct a network which learns to uncover the relevantfeatures from raw data. This removes the bias contained in models using hand-craftedfeatures and other market assumptions such as those in autoregressive models VAR [42]and ARIMA [1].Arguably the most successful architecture used to extract features is the convolutionalneural network [20] which makes use of translation equivariance, present in many domainsincluding time-series applications. For time-series however, further inductive biases proveto be beneﬁcial. Convolutional neural networks with a causal temporal bias were intro-duced in [24] to encode long-range temporal dependencies in raw audio signals. Hereconvolutions are replaced by dilated causal convolutions controlled by a dilation rate.The dilation rate is the number of input values skipped by the ﬁlter, thereby allowing thenetwork to act with a larger receptive ﬁeld. In this work, features from our architecturewill come from the output of multiple such dilated causal convolutional layers connectedin series.Once we have a collection of features, we would like to do computations with theselearned representations to enable context dependent updates. Historically, attention net-works were introduced in [3] to improve existing long-short term memory (LSTM) [16, 15]models for neural machine translation by implementing a “soft search” over neighboringwords enabling the system to focus only on words relevant to the generation of the nexttarget word. This early work combined attention with RNNs. Shortly after, CNNs werecombined with attention in [39] and [6] for image captioning and question-answering tasksrespectively.In [37], self-attention was introduced as a stand alone replacement for LSTMs on awide range of natural language processing tasks leading to state-of-the-art results [10, 26]which included masked word prediction. Introducing self-attention can be thought ofas incorporating an inductive biases into the learning architecture to exploit relationalstructure in the task environment. This amounts to learning over a graph neural network[28, 5] where nodes are entities given by the learned features which are then updatedthrough a message passing procedure along edges. Results in various applications showthat self-attention can better capture long range dependencies in comparison to LSTMs[9]. More precisely, [37] introduced the transformer architecture which consists of an en-coder and decoder for language translation. Both the encoder and decoder contain therepetition of modules which we refer to as transformer blocks . Each transformer blockconsists of a multi-head self-attention layer followed by normalization, feedforward andresidual connections. This is described in detail in Section 3.Combining transformer blocks with convolutional layers for feature extraction is apowerful combination for various tasks. In particular, for complex reasoning tasks in var-ious strategic game environments, the addition of these transformer modules signiﬁcantly2nhanced performance and sample eﬃciency compared with existing non-relational base-lines [40, 38, 8]. In this work we combine the causal convolutional architecture of [24] withmultiple transformer blocks. Moreover, our transformer blocks contain masked multi-headself-attention layers. By applying a mask to our self-attention functions, we ensure thatthe ordering of events in our time-series is never violated at each step, ie. entities canonly attend to entities in its causal past.We train and test our model on the publicly available FI-2010 data-set which is aLOB of ﬁve instruments from the Nasdaq Nordic stock market for a ten day period [23].We show that our algorithm outperforms other common and previously state-of-the-artarchitectures using standard model validation techniques.In summary, inspired by the wavenet architecture of [24] where dilated causal con-volutions were used to encode long-range temporal dependencies, we use these causalconvolutions to build a feature map for our transformer blocks to act on. We refer to ourspeciﬁc architecture as TransLOB. It is a composition of diﬀerentiable functions that pro-cess and integrate both local and global information from the LOB in a dynamic relationalway whilst respecting the causal structure.There are a number of advantages to our architecture outside of the signiﬁcant in-creases in performance. Firstly, in spite of the O ( N ) complexity of the self-attentioncomponent, our architecture is substantially more sample eﬃcient than existing LSTMarchitectures for this task. Secondly, the ability to analyse attention distributions pro-vides a clearer picture of internal computations within the model compared with theseother methods leading to better interpretability. Related work

There is now a substantial literature applying deep neural networks to time-series applica-tions, and in particular, limit order books (LOB). Convolutional neural networks (CNN)have been explored in LOB applications in [12, 34]. To capture long-range dependenciesin temporal behavior, CNNs have been combined with recurrent neural networks (RNN)(typically long-short term memory (LSTM)) which improve on earlier results [36, 41].Some modiﬁcations to the standard convolutional layer have been used in attempts toinfer local interactions over diﬀerent time horizons. For example, [41] uses an inceptionmodule [32] after the standard convolutional layers for this inference followed by an LSTMto encode relational dynamics. Stand-alone RNNs have been used extensively in marketprediction [11, 13, 4] and have been shown to outperform models based on standardmulti-layer perceptrons, random forests and SVMs [35].For time-series applications, recent work [33, 25] uses attention and [18, 22, 29] incombination with CNNs. However, there are relatively few references which combineCNNs with transformers to analyse time-series data. We mention [30] which uses a CNNplus multi-head self-attention to analyse clinical time-series behaviour and [21] which The “MNIST” for limit order books.

A limit order book (LOB) at time t is the set of all active orders in a market at time t . These orders consist of two sides; the bid-side and the ask-side. The bid-side consistsof buy orders and the ask-side consists of sell orders both containing price and volumefor each order. Our experiments will use the LOB from the publicly available FI-2010dataset . A general introduction to LOBs can be found in [14].Let { p ia ( t ) , v ia ( t ) } denote the price (resp. volume) of sell orders at time t at level i inthe LOB. Likewise, let { p ib ( t ) , v ib ( t ) } denote the price (resp. volume) of buy orders at time t at level i in the LOB. The bid price p b ( t ) at time t is the highest stated price amongactive buy orders at time t . The ask price p a ( t ) at time t is the lowest stated price amongactive sell orders at time t . A buy order is executed if p b ( t ) > p a ( t ) for the entire volumeof the order. Similarly, a sell order is executed if p a ( t ) < p b ( t ) for the entire volume ofthe order.The FI-2010 dataset is made up of 10 days of 5 stocks from the Helsinki Stock Ex-change, operated by Nasdaq Nordic, consisting of 10 orders on each side of the LOB. Eventtypes can be executions, order submissions, and order cancellations and are non-uniformin time. We restrict to normal trading hours (no auction). The general structure of theLOB is contained in Table 1. ( p a ( t ) , v a ( t ))...( p a ( t ) , v a ( t ))( p b ( t ) , v b ( t ))...( p b ( t ) , v b ( t )) Event t ( p a ( t + 1) , v a ( t + 1))...( p a ( t + 1) , v a ( t + 1))( p b ( t + 1) , v b ( t + 1))...( p b ( t + 1) , v b ( t + 1)) Event t + 1 . . . . . . . . . . . . . . . . . . . . . . . . ( p a ( t + 10) , v a ( t + 10))...( p a ( t + 10) , v a ( t + 10))( p b ( t + 10) , v b ( t + 10))...( p b ( t + 10) , v b ( t + 10)) Event t + 10 Table 1: Structure of the limit order book.The data is split into 7 days of training data and 3 days of test data. Preprocessingconsists of normalizing the data x according to the z -score ¯ x t = x t − ¯ yσ y The dataset is available at https://etsin.fairdata.ﬁ/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649 y (resp. σ y ) is the mean (resp. standard deviation) of the previous days data.Since the aim of this work is to extract the most amount of possible latent informationcontained in the LOB, we do not include any of the hand-crafted features contained inthe FI-2010 dataset. For a detailed description of this dataset we refer the reader to [23].We aim to predict future movements from the (virtual) mid-price. Price directionof the data is calculated using the following smoothed version of the mid-price. Thisamounts to adjusting for the average volatility of each instrument. The virtual mid-priceis the mean p ( t ) = p a ( t ) + p b ( t )2between the bid-price and the ask-price. The mean of the next k mid-prices is then m + k ( t ) = 1 k k (cid:88) n =0 p ( t + n ) . The direction of price movement for the FI-2010 dataset is calculated using the per-centage change of the virtual mid-price according to r k ( t ) = m + k ( t ) − p ( t ) p ( t ) . There exist other more sophisticated methods to determine the direction of price move-ment at a given time. However, for fair comparison to other work, we utilize this deﬁnitionand leave other methods for future work. The direction is up (+1) if r k ( t ) > α , down( −

1) if r k ( t ) < − α and neutral (0) otherwise, according to a chosen threshold α . For theFI-2010 dataset, this has been set to α = 0 . k ∈ { , , , } for the denoising horizonwindow. The 100 most recent events are used as input to our model. In this section we give a detailed account of our architecture. The main two componentsare a convolutional module and a transformer module. They contain multiple iterationsof dilated causal convolutional layers and transformer blocks respectively. A transformerblock consists of a speciﬁc combination of multi-head self-attention, residual connections,layer normalization and feedforward layers. We took seriously the causal nature of theproblem by implementing both causality in the convolutional module and causality inthe transformer module through masked self-attention to accurately capture temporalinformation in the LOB. Our resulting architecture will be referred to as TransLOB.Since each order consists of a price and volume, a state x t = { p ia ( t ) , v ia ( t ) , p ib ( t ) , v ib ( t ) } i =1 at time t is a vector x t ∈ R . Events are irregularly spaced in time and the 100 mostrecent events are used as input resulting in a normalized vector X ∈ R × .5e apply ﬁve one-dimensional convolutional layers to the input X , regarded as a tensorof shape [100 ,

40] (ie. an element of R ⊗ R ). All layers are dilated causal convolutionallayers with 14 features, kernel size 2 and dilation rates 1 , , , , X of shape [100 , N = 100 as the number of entities and d = 15 as the model dimension. We denote these entities by e i , 1 ≤ i ≤ N , where e i ∈ E = R d . These entities are then updated through learning in a number of steps.First we introduce an inner product space H = R d with dot product pairing (cid:104) h, h (cid:48) (cid:105) = h · h (cid:48) . We employ a multi-head version of self-attention with C channels. Therefore, wechoose a decomposition H = H ⊕ . . . ⊕ H C and apply a linear transformation T = C (cid:77) a =1 T a : E → C (cid:77) a =1 H ⊕ a with H a each of dimension d/C . The vectors ( q i, ( a ) , k i, ( a ) , v i, ( a ) ) = T a ( e i ) are referredto as query , key and value vectors respectively. We arrange these vectors into matrices Q a , K a and V a respectively with N -rows and d -columns. In other words, Q a = XW Qa , K a = XW Ka and V a = XW Va for weight matrices W Qa , W Ka and W Va which are vectors in R d × d/C .Next we apply the masked scaled dot-product self-attention functionhead a = V (cid:48) a = Softmax (cid:18) Mask (cid:18) Q a K Ta √ d (cid:19)(cid:19) V a resulting in a matrix of reﬁned value vectors for each entity. Here Mask substitutesinﬁnitesimal values to entries in the upper right triangle of the applied matrix whichforces queries to only pay attention to keys in its causal history via the softmax function.The heads are then concatenated and a ﬁnal learnt linear transformation is given leadingto the multi-head self-attention operationMultiHead( X ) = (cid:32) C (cid:77) a =1 head a (cid:33) W O where W O ∈ R d × d . 6e next add a residual connection and apply layer normalization resulting in Z = LayerNorm(MultiHead( X ) + X ) . This is followed by a feedforward network MLP consisting of a ReLU activation betweentwo aﬃne transformations applied identically to each position, ie. individually to eachrow of Z . The inner layer is of dimension 4 × d = 60. Finally, a further residual connectionand ﬁnal layer normalization is applied to arrive at our updated matrix of entitiesTransformerBlock( X ) = LayerNorm(MLP( Z ) + Z ) . The output of the transformer block is the same shape [

N, d ] as the input. Our updatedentities are e (cid:48) i ∈ R , 1 ≤ i ≤ N .After multiple iterations of the transformer block, the output is then ﬂattened andpassed through a feedforward layer of dimension 64 with ReLU activation and L2 regular-ization. Finally, we apply dropout followed by a softmax layer to obtain the ﬁnal outputprobabilities. A schematic of the TransLOB architecture is given in Figure 1. inputdilated conv1dilated conv2dilated conv3dilated conv4dilated conv5Layer NormalizationPosition Encodingtransformer block1transformer block2MLPDropoutLinearSoftmaxoutputFigure 1: Architecture schematic with enclosed convolutional and transformer modules.2 Figure 1: Architecture schematic with enclosed convolutional and transformer modules.For the FI-2010 dataset, we employ two transformer blocks with three heads and withthe weights shared between iterations of the transformer block. The hyperparameters arecontained in Table 2. No dropout was used inside the transformer blocks.7 yperparameter Value

Batch size 32Adam β β × − Number of heads 3Number of blocks 2MLP activations ReLUDropout rate 0.1

Table 2: Hyperparameters for the FI-2010 experiments.

Here we record our experimental results for the FI-2010 dataset. The ﬁrst 7 days wereused to train the model and the last 3 days were used as test data. Training was donewith mini-batches of size 32. Our metrics include accuracy, precision, recall and F1. Alltraining was done using one K80 GPU on google colab.To be consistent with earlier works using the same dataset, we train and test ourmodel on the horizons k = { , , , } . All models were trained for 150 epochs,although convergence was achieved signiﬁcantly earlier. See Figure 3 of Appendix A foran example.The following models were used as comparison. An LSTM was utilized and com-pared to a support vector machine (SVM) and multi-layer perceptron (MLP) in [35] withfavourable results. Results using a stand-alone CNN were reported in [34]. This modelwas reproduced and trained for use as our baseline for the horizon k = 100. The baselinetraining and test curves are shown in Figure 4 of Appendix A. In [36] a CNN was com-bined with an LSTM resulting in the architecture denoted CNN-LSTM. An improvementover the CNN-LSTM architecture, named DeepLOB, was achieved in [41] by using aninception module between the CNN and LSTM together with a diﬀerent choice of con-volution ﬁlters, stride and pooling. Finally, the architecture C(TABL) refers to the bestperforming implementation of the temporal attention augmented bilinear network of [33].Our results are shown in Table 3, Table 4, Table 5 and Table 6 for each of the horizonchoices respectively. The training and test curves with respect to accuracy for k = 100are shown in Figure 3 of Appendix A.For inspection of our model, we plot the attention distributions for all three headsin the ﬁrst transformer block. A random sample input was chosen from the horizon k = 10 test set. Pixel intensity has been scaled for ease of visualization. The vertical axesrepresent the query index 0 ≤ i ≤

100 and the horizontal axes represent the key index0 ≤ j ≤ odel Accuracy Precision Recall F1SVM [35] - 39.62 44.92 35.88MLP [35] - 47.81 60.78 48.27CNN [34] - 50.98 65.54 55.21LSTM [35] - 60.77 75.92 66.33CNN-LSTM [36] - 56.00 45.00 44.00C(TABL) [33] 84.70 76.95 78.44 77.63DeepLOB [41] 84.47 84.00 84.47 83.40TransLOB Table 3: Prediction horizon k = 10. Model Accuracy Precision Recall F1SVM [35] - 45.08 47.77 43.20MLP [35] - 51.33 65.20 51.12CNN [34] - 54.79 67.38 59.17LSTM [35] - 59.60 70.52 62.37CNN-LSTM [36] - - - -C(TABL) [33] 73.74 67.18 66.94 66.93DeepLOB [41] 74.85 74.06 74.85 72.82TransLOB

Table 4: Prediction horizon k = 20. Model Accuracy Precision Recall F1SVM [35] - 46.05 60.30 49.42MLP [35] - 55.21 67.14 55.95CNN [34] - 55.58 67.12 59.44LSTM [35] - 60.03 68.58 61.43CNN-LSTM [36] - 56.00 47.00 47.00C(TABL) [33] 79.87 79.05 77.04 78.44DeepLOB [41] 80.51 80.38 80.51 80.35TransLOB

Table 5: Prediction horizon k = 50. Model Accuracy Precision Recall F1CNN [34] 63.06 63.29 63.06 62.97TransLOB

Table 6: Prediction horizon k = 100.heads learn to attend to diﬀerent properties of the temporal dynamics. A majority of thequeries pay special attention to the most recent keys which is sensible for predicting thenext price movement. This is particularly clear in heads two and three.9

20 40 60 80020406080

Figure 2: First head of the ﬁrst transformer block.

We have shown that the limit order book contains informative information to enable pricemovement prediction using deep neural networks with a causal and relational inductivebias. This was shown by introducing the architecture TransLOB which contains both adilated causal convolutional module and a masked transformer module. This architecturewas tested on the publicly available FI-2010 dataset achieving state-of-the-art results. Weexpect further improvements using more sophisticated proprietary additions such as theinclusion of sentiment information from news, social media and other sources. However,this work was developed to exploit only the information contained in the LOB and servesas very strong baseline from which additional tools can be added.Due to the limited nature of the FI-2010 dataset, signiﬁcant time was spend tuninghyperparameters of our model to negate overﬁtting. In particular, our architecture wasnotably sensitive to the initialization. However, due to the very strong performance of themodel, together with the ﬂexibility and sensible inductive biases of the architecture, weexpect robust results on larger LOB datasets. This is an important second step and willbe addressed in future work. In particular, this will allow us to explore the generalizationcapabilities of the model together with the optimization of important parameters suchas the horizon k and threshold α . Nevertheless, based on these initial results we arguethat further investigation of transformer based models for ﬁnancial time-series predictiontasks is warranted.The eﬃciency of our algorithm is another imporant property which makes it amenableto training on larger datasets and LOB data with larger event windows. In spite ofthe O ( N ) complexity of the self-attention component, our architecture is signiﬁcantlymore sample eﬃcient than existing LSTM architectures for this task such as [35, 36, 41].10owever, moving far beyond the window size of 100, to the territory of LOB datasets onthe scale of months or years, it would be interesting to explore sparse and compressedrepresentations in the transformer blocks. Implementations of sparsity and compressioncan be found in [7, 31, 19, 21] and [17, 27] respectively.Looking forward, similar to recent advances in natural language processing, the nextgeneration of ﬁnancial time-series models should implement self-supervision as pretraining[10, 26]. Finally, it would be interesting to consider the inﬂuence of higher-order self-attention [8] in LOB and other ﬁnancial time-series applications. Acknowledgements

The author would like to thank Andrew Royal and Zihao Zhang for correspondence relatedto this project.

References [1] A. A. Ariyo, A. O. Adewumi, and C. K. Ayo. Stock price prediction using the arimamodel. In , pages 106–112. IEEE, 2014.[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprintarXiv:1607.06450 , 2016.[3] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473 , 2014.[4] W. Bao, J. Yue, and Y. Rao. A deep learning framework for ﬁnancial time seriesusing stacked autoencoders and long-short term memory.

PloS one , 12(7), 2017.[5] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Ma-linowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductivebiases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 , 2018.[6] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia. Abc-cnn: Anattention based convolutional neural network for visual question answering. arXivpreprint arXiv:1511.05960 , 2015.[7] R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences withsparse transformers. arXiv preprint arXiv:1904.10509 , 2019.[8] J. Clift, D. Doryn, D. Murfet, and J. Wallbridge. Logic and the 2-simplicial trans-former. In

Proceedings of the International Conference on Learning Representations ,2020. 119] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. arXiv preprintarXiv:1901.02860 , 2019.[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidi-rectional transformers for language understanding. arXiv preprint arXiv:1810.04805 ,2018.[11] M. Dixon. Sequence classiﬁcation of the limit order book using recurrent neuralnetworks.

Journal of computational science , 24:277–286, 2018.[12] J. Doering, M. Fairbank, and S. Markose. Convolutional neural networks appliedto high-frequency market microstructure forecasting. In , pages 31–36. IEEE, 2017.[13] T. Fischer and C. Krauss. Deep learning with long short-term memory networks forﬁnancial market predictions.

European Journal of Operational Research , 270(2):654–669, 2018.[14] M. D. Gould, M. A. Porter, S. Williams, M. McDonald, D. J. Fenn, and S. D.Howison. Limit order books.

Quantitative Finance , 13(11):1709–1742, 2013.[15] K. Greﬀ, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber. Lstm:A search space odyssey.

IEEE transactions on neural networks and learning systems ,28(10):2222–2232, 2016.[16] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation ,9(8):1735–1780, 1997.[17] N. Kitaev, (cid:32)L. Kaiser, and A. Levskaya. Reformer: The eﬃcient transformer. In

Proceedings of the International Conference on Learning Representations , 2020.[18] G. Lai, W.-C. Chang, Y. Yang, and H. Liu. Modeling long-and short-term temporalpatterns with deep neural networks. In

The 41st International ACM SIGIR Confer-ence on Research & Development in Information Retrieval , pages 95–104, 2018.[19] G. Lample, A. Sablayrolles, M. Ranzato, L. Denoyer, and H. J´egou. Large memorylayers with product keys. In

Advances in Neural Information Processing Systems ,pages 8546–8557, 2019.[20] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, andL. D. Jackel. Backpropagation applied to handwritten zip code recognition.

Neuralcomputation , 1(4):541–551, 1989.[21] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan. Enhancing thelocality and breaking the memory bottleneck of transformer on time series forecasting.In

Advances in Neural Information Processing Systems , pages 5244–5254, 2019.1222] Y. M¨akinen, J. Kanniainen, M. Gabbouj, and A. Iosiﬁdis. Forecasting jump arrivalsin stock prices: new attention-based network architecture using limit order bookdata.

Quantitative Finance , 19(12):2033–2050, 2019.[23] A. Ntakaris, M. Magris, J. Kanniainen, M. Gabbouj, and A. Iosiﬁdis. Benchmarkdataset for mid-price forecasting of limit order book data with machine learningmethods.

Journal of Forecasting , 37(8):852–866, 2018.[24] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch-brenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for rawaudio. arXiv preprint arXiv:1609.03499 , 2016.[25] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell. A dual-stageattention-based recurrent neural network for time series prediction. arXiv preprintarXiv:1704.02971 , 2017.[26] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving languageunderstanding with unsupervised learning.

Technical report, OpenAI , 2018.[27] J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap. Compressive trans-formers for long-range sequence modelling. In

Proceedings of the International Con-ference on Learning Representations , 2020.[28] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graphneural network model.

IEEE Transactions on Neural Networks , 20(1):61–80, 2008.[29] S.-Y. Shih, F.-K. Sun, and H.-y. Lee. Temporal pattern attention for multivariatetime series forecasting.

Machine Learning , 108(8-9):1421–1441, 2019.[30] H. Song, D. Rajan, J. J. Thiagarajan, and A. Spanias. Attend and diagnose: Clinicaltime series analysis using attention models. In

Thirty-second AAAI conference onartiﬁcial intelligence , 2018.[31] S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin. Adaptive attention span intransformers. arXiv preprint arXiv:1905.07799 , 2019.[32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-houcke, and A. Rabinovich. Going deeper with convolutions. In

Proceedings of theIEEE conference on computer vision and pattern recognition , pages 1–9, 2015.[33] D. T. Tran, A. Iosiﬁdis, J. Kanniainen, and M. Gabbouj. Temporal attention-augmented bilinear network for ﬁnancial time-series data analysis.

IEEE transactionson neural networks and learning systems , 30(5):1407–1418, 2018.1334] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosiﬁdis.Forecasting stock prices from the limit order book using convolutional neural net-works. In , volume 1,pages 7–12. IEEE, 2017.[35] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosiﬁdis.Using deep learning to detect price change indications in ﬁnancial markets. In , pages 2511–2515. IEEE,2017.[36] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosiﬁdis.Using deep learning for price prediction by exploiting stationary limit order bookfeatures. arXiv preprint arXiv:1810.09965 , 2018.[37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, (cid:32)L. Kaiser,and I. Polosukhin. Attention is all you need. In

Advances in neural informationprocessing systems , pages 5998–6008, 2017.[38] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung,D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraftii using multi-agent reinforcement learning.

Nature , 575(7782):350–354, 2019.[39] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, andY. Bengio. Show, attend and tell: Neural image caption generation with visualattention. In

International conference on machine learning , pages 2048–2057, 2015.[40] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls,D. Reichert, T. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu,M. Botvinick, O. Vinyals, and P. Battaglia. Deep reinforcement learning with re-lational inductive biases. In

Proceedings of the International Conference on LearningRepresentations , 2019.[41] Z. Zhang, S. Zohren, and S. Roberts. Deeplob: Deep convolutional neural networksfor limit order books.

IEEE Transactions on Signal Processing , 67(11):3001–3012,2019.[42] E. Zivot and J. Wang. Vector autoregressive models for multivariate time series.

Modeling Financial Time Series with S-Plus , pages 385–429, 2006.14

Training curves

We plot the training and validation history with respect to accuracy for both our TransLOBarchitecture in Figure 3 and the baseline CNN architecture of [34] in Figure 4.Figure 3: Training and validation accuracy for TransLOB for k = 100.Figure 4: Training and validation accuracy for baseline CNN for k = 100. B Attention distributions

We include here the remaining visualizations of the attention output of our learned modelin the ﬁrst transformer block. Input is a random sample for the horizon k = 10.15

20 40 60 80020406080

Figure 5: Second head of the ﬁrst transformer block.0 20 40 60 80020406080