[PDF] Improved ACD-based financial trade durations prediction leveraging LSTM networks and Attention Mechanism

Abstract

The liquidity risk factor of security market plays an important role in the formulation of trading strategies. A more liquid stock market means that the securities can be bought or sold more easily. As a sound indicator of market liquidity, the transaction duration is the focus of this study. We concentrate on estimating the probability density function p({\Delta}t_(i+1) |G_i) where {\Delta}t_(i+1) represents the duration of the (i+1)-th transaction, G_i represents the historical information at the time when the (i+1)-th transaction occurs. In this paper, we propose a new ultra-high-frequency (UHF) duration modelling framework by utilizing long short-term memory (LSTM) networks to extend the conditional mean equation of classic autoregressive conditional duration (ACD) model while retaining the probabilistic inference ability. And then the attention mechanism is leveraged to unveil the internal mechanism of the constructed model. In order to minimize the impact of manual parameter tuning, we adopt fixed hyperparameters during the training process. The experiments applied to a large-scale dataset prove the superiority of the proposed hybrid models. In the input sequence, the temporal positions which are more important for predicting the next duration can be efficiently highlighted via the added attention mechanism layer.

Full PDF

IImproved ACD-based financial trade durations prediction leveraging LSTM networks and Attention Mechanism

Yong Shi

1, 2, 3 , Wei Dai

1, 2, 3 , Wen Long

1, 2, 3, * , Bo Li

1, 2, 3 School of Economics and Management, University of Chinese Academy of Sciences, No. 80 of Zhongguancun East Street, Haidian District, Beijing, 100190, P.R.China Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, No. 80 of Zhongguancun East Street, Haidian District, Beijing, 100190, P.R.China Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences , No. 80 of Zhongguancun East Street, Haidian District, Beijing, 100190, P.R.China * Correspondence: [email protected]

Abstract:

The liquidity risk factor of security market plays an important role in the formulation of trading strategies. A more liquid stock market means that the securities can be bought or sold more easily. As a sound indicator of market liquidity, the transaction duration is the focus of this study. We concentrate on estimating the probability density function 𝑝(𝛥𝑡 𝑖+1 |𝐺 𝑖 ) where 𝛥𝑡 𝑖+1 represents the duration of the (i+1)- th transaction, 𝐺 𝑖 represents the historical information at the time when the (i+1)- th transaction occurs. In this paper, we propose a new ultra-high-frequency (UHF) duration modelling framework by utilizing long short-term memory (LSTM) networks to extend the conditional mean equation of classic autoregressive conditional duration (ACD) model while retaining the probabilistic inference ability. And then the attention mechanism is leveraged to unveil the internal mechanism of the constructed model. In order to minimize the impact of manual parameter tuning, we adopt fixed hyperparameters during the training process. The experiments applied to a large-scale dataset prove the superiority of the proposed hybrid models. In the input sequence, the temporal positions which are more important for predicting the next duration can be efficiently highlighted via the added attention mechanism layer. Keywords:

Duration Prediction, Deep Learning, LSTM, ACD, Hybrid Model . Introduction

Market liquidity refers to the degree to which an asset can be bought and sold easily for a fair price [1]. In other words, market liquidity can be regarded as the speed at which transactions can be concluded while maintaining a basically stable price [1]. Therefore, market liquidity risk is one of the most common factors considered by security investors especially by high frequency traders in building a trading strategy. With the rapid development of computer storage technology, transaction by transaction financial trading data is accessible to researchers. Let i t stand for the time at which the th- i trade occurs, so that the duration between the (i+1)- th and i- th trade is iii ttt −= ++ , which can directly measure the transaction speed of financial trading. The autoregressive conditional duration (ACD) model proposed by Engle and Russell has been the primary framework used for analyzing trading durations of ultra-high frequency (UHF) data, which are irregularly time-spaced and convey meaningful information [2]. In ACD models, the transaction duration is decomposed into the multiplicative product of two components, the conditional (expected) duration and the unexpected duration. The expected component is the portion of transaction duration that is linearly conditional on past durations, whereas the unexpected duration is the fraction of duration beyond that which could be predicted from past durations, and is usually characterized by an exponential distribution. Based on the work of Engle and Russell [2], many works tried to improve the ability of capturing the relation between the conditional duration and the lagged durations. For example, the logarithmic version of ACD model was provided in [3], the threshold autoregressive conditional duration model was proposed in [4], the asymmetric autoregressive conditional duration model was put forward in [5], the smooth transition ACD model and the time-varying ACD model were introduced in [6]. There are also many other works focusing on choosing a suitable distribution to characterize the unexpected duration. The distributions which have been applied to the ACD models includes the generalized Gamma distribution in [7], generalized F distribution in [8], the mixture of two exponential distributions in [9], the regime-switching Pareto distribution in [10], the mixture of an exponential and a generalized beta of type2 (GB2) distribution in [11], etc. Like many other statistical models, the ACD family models require trong assumptions which are difficult to satisfy in realistic situations [12]. In recent years, machine learning methods have been widely applied to image identification and natural language processing problems. Compared with traditional statistical models, machine learning methods have looser model assumptions and better generalization ability. The artificial neural network (ANN), inspired by the biological neural network, is one of the most widely used machine learning methods. According to Universal Approximation Theorem [13], feedforward neural networks can approximate a Borel measurable function to any desired degree of accuracy if sufficiently many hidden units with arbitrary squashing functions are provided. Recurrent neural networks (RNNs) are a family of specially designed artificial neural network networks capable of extracting temporal information via the cycle architecture [14]. As the development in optimization techniques and computation hardware, RNNs have been widely used in many different domains recently [15]. To solve the vanishing gradient/exploding problem of simple RNNs, Hochreiter S. proposed the LSTM neural networks which can help us to utilize a longer sequence of historical information [16] . Although having the merit of strong fitting ability, LSTMs cannot provide probabilistic output compared with ACD family models. Inspired by the work from Kristjanpoller and Minutolo [17], we propose a new architecture called LSTM-ACD to predict the UHF transaction durations by combing the ANN networks and ACD framework. We take a fully data driven approach to extend the mean equation of classic ACD models while retaining the probabilistic inference ability. In addition, attention layer is added into our model to make a visualization of the proposed network and to improve the interpretability. The proposed architecture is applied to real-world stock duration datasets. The result shows that the proposed model produces more accurate estimation and prediction, outperforming the classic ACD model on mean squared error and quantile estimation. The remains of this paper is organized as follows: Section 2 introduces the methodology detailedly while Section 3 contains the experiment design and the corresponding results in this study. Section 4 concludes this paper and points out the possible direction of future research.

2. Methodology

In Section 2, the ACD framework is integrated with LSTM networks to propose a new STM-ACD model for predicting the trading durations of UHF data. This section is organized as follows. Section 2.1 introduces the classic ACD model. Section 2.2 describes the proposed LSTM-ACD architecture in detail. In section 2.3, attention mechanism layer is utilized to unveil the internal mechanism of the proposed model.

A classic ACD model assumes that the durations are conditionally exponentially distributed with a mean that follows an ARMA process [2]. As shown in formula(1), the duration i t  between the i- th and (i-1)- th trade is the multiplicative product of 𝜇 𝑖 and 𝜀 𝑖 , which represents expected and unexpected portion of the transaction duration respectively. In the conditional mean equation, 𝜇 𝑖 is linearly depends on the lagged durations and the lagged terms of itself. 𝑝, 𝑞 in formula (2) represents the lagged order. 𝛥𝑡 𝑖 = 𝜇 𝑖 𝜀 𝑖 (1) 𝜇 𝑖 = 𝜔 + ∑ 𝛼 𝑗 𝛥𝑡 𝑖−𝑗𝑝𝑗=1 + ∑ 𝛽 𝑗 𝜇 𝑖−𝑗𝑞𝑗=1 (2) A major limitation of classic ACD model is the assumption that the variables in the conditional mean equation behave in strict stationarity and linearity, but the duration sequences are usually in a non-linear or non-stationary state. Hence, this paper intuitively extends the linear conditional mean equation to nonlinear case by LSTM networks due to the strong fitting ability of deep learning techniques.

It has been generally known that the LSTM cell is able to store information over longer time range compared with simple RNNs. As depicted in Figure 1, the information flow propagating across time steps is controlled by three LSTM gates: the forget gate, the input gate and the output gate.

Fig 1. S tructure of LSTM cell

Assuming that 𝑊 𝑓 , 𝑊 𝑖 , 𝑊 𝑜 , 𝑊 𝑐 represent the LSTM weight matrices, 𝑏 𝑓 , 𝑏 𝑖 , 𝑏 𝑐 , 𝑏 𝑜 represents the bias vectors. The input vector, output vector and cell state vector at time 𝑡 are denoted as 𝑥 𝑡 , ℎ 𝑡 , 𝐶 𝑡 respectively [18] . The operating process of a LSTM cell can be mathematically described as follows:

𝐹𝑜𝑟𝑔𝑒𝑡 𝑔𝑎𝑡𝑒: 𝑓 𝑡 = 𝜎(𝑊 𝑓 ⋅ [ ℎ 𝑡−1 , 𝑥 𝑡 ] + 𝑏 𝑓 ), (3) 𝐼𝑛𝑝𝑢𝑡 𝑔𝑎𝑡𝑒: 𝑖 𝑡 = 𝜎(𝑊 𝑖 ⋅ [ ℎ 𝑡−1 , 𝑥 𝑡 ] + 𝑏 𝑖 ), (4) 𝑂𝑢𝑡𝑝𝑢𝑡 𝑔𝑎𝑡𝑒: 𝑜 𝑡 = 𝜎(𝑊 𝑜 ⋅ [ ℎ 𝑡−1 , 𝑥 𝑡 ] + 𝑏 𝑜 ), (5) 𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 𝑠𝑡𝑎𝑡𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑖𝑜𝑛: 𝐶 𝑡~ = 𝑡𝑎𝑛 ℎ ( 𝑊 𝐶 ⋅ [ ℎ 𝑡−1 , 𝑥 𝑡 ] + 𝑏 𝐶 ), (6) 𝐶 𝑡 = 𝑓 𝑡 × 𝐶 𝑡−1 + 𝑖 𝑡 × 𝐶 𝑡~ , (7) ℎ 𝑡 = 𝑜 𝑡 ⋅ 𝑡𝑎𝑛 ℎ ( 𝐶 𝑡 ). (8) As a type of RNNs specially designed to avoid the exponentially fast decaying factor, the LSTM networks can effectively prevent the gradient vanishing/explosion problem. Due to their ability to learn long term dependencies, LSTMs are particularly suitable for financial prediction 𝐼𝑛𝑝𝑢𝑡 𝑔𝑎𝑡𝑒

𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 𝑠𝑡𝑎𝑡𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝐶 𝑡−1 𝑥 𝑡 ℎ 𝑡−1 𝑥 𝑡 ℎ 𝑡−1 𝐹𝑜𝑟𝑔𝑒𝑡 𝑔𝑎𝑡𝑒 𝑥 𝑡 ℎ 𝑡−1 ⊗ 𝐼 𝑡 𝐶 𝑡−1 ⊗ ⊕ 𝑥 𝑡 ℎ 𝑡−1 𝑂𝑢𝑡𝑝𝑢𝑡 𝑔𝑎𝑡𝑒 𝐶 𝑡 ⊗ ℎ 𝑡 𝑡𝑎𝑛ℎ roblems. Hence, we have the conjecture that extending the linear mean equation to LSTM network will improve the ability of extracting long-term dependencies for duration sequence. To verify this hypothesis, we take the 𝛥𝑡 𝑖−1 and 𝑙𝑛 𝜇 ∧𝑖−1 as the input for the LSTM cell at the time point of i- th transaction where 𝛥𝑡 𝑖−1 is the duration of last transaction and 𝑙𝑛 𝜇 ∧𝑖−1 is the logarithmic value of the output of the proposed LSTM-ACD model at time − i . To retain the ability of probabilistic inference, the objective function is still the log likelihood function of 𝛥𝑡 𝑖 = 𝜇 𝑖 𝜀 𝑖 which follows an exponential distribution. The log likelihood function can be mathematically described as follows: 𝑙 = ∑ 𝑙𝑛 1𝜇 𝑖∧ 𝑒𝑥𝑝 (− 1𝜇 𝑖∧ 𝛥𝑡 𝑖 ) (9) 𝑙𝑛 𝜇 𝑖∧ = 𝜑(𝛥𝑡 𝑖−1 , 𝑙𝑛 𝜇 ∧𝑖−1 , ℎ 𝑖−1 ) (10) where  represents a mapping from 𝛥𝑡 𝑖−1 , 𝑙𝑛 𝜇 ∧𝑖−1 , ℎ 𝑖−1 to 𝑙𝑛 𝜇 𝑖∧ by a LSTM cell. Attention mechanism was firstly proposed to improve the image processing accuracy by mimicking the perceptual system of human beings [19]. In the work of [20], attention mechanism was introduced to extend the basic encoder-decoder architecture and enhance the interpretability on the task of machine translation. Unlike the sequence-to-sequence modeling in sentence translation, the problem we focus on in this paper is to predict the financial duration one-step ahead. The attention weights which help automatically search for import hidden states of the sequence-to-one LSTM architecture can be calculated by the following formulas: 𝑒 𝑖−𝑘 = 𝑣 𝛼𝑇 𝑡𝑎𝑛 ℎ ( 𝑤 𝛼 ℎ 𝑖−𝑘 ) (11) 𝛼 𝑖−𝑘 = 𝑒𝑥𝑝(𝑒 𝑖−𝑘 )∑ 𝑒𝑥𝑝(𝑒 𝑖−𝑘 ) 𝑇𝑘=1 (12) where ℎ 𝑖−𝑘 represents the hidden state lagged 𝑘 time steps and 𝛼 𝑖−𝑘 represents the attention weight of ℎ 𝑖−𝑘 . The 𝑤 𝛼 and 𝑣 𝛼 are parameter matrices in the attention mechanism. By allocating different attention weights for different hidden states , a new vector 𝑐 𝑖 is produced as the input of a feedforward network 𝑓 for predicting the target variable 𝑦 𝑖 . 𝑖 = ∑ 𝛼 𝑘𝑇𝑘=1 ℎ 𝑖−𝑘 (13)𝑦 𝑖 = 𝑓 (𝑐 𝑖 ) (14) In this study, the attention layer is integrated with LSTM to characterize the dynamics of i  ln in the above-mentioned mean equation of ACD model. The proposed Attention-LSTM-ACD model can be described by the following equations: ℎ 𝑖−𝑘′ = 𝐿𝑆𝑇𝑀( ℎ 𝑖−𝑘−1′ , 𝑠 𝑖−𝑘−1′ , 𝛥𝑡 𝑖−𝑘−1 , 𝑙𝑛 𝜇 ^ 𝑖−𝑘−1 ) (15) 𝑐 𝑖′ = ∑ 𝛼 𝑘′𝑇 ′ 𝑘=1 ℎ 𝑖−𝑘′ (16)𝑙𝑛 𝜇 ^ 𝑖 = 𝑓 ′ ( 𝑐 𝑖′ ) (17) where 𝑠 𝑖−𝑘−1′ represents the cell state of LSTM lagged 𝑘 + 1 time steps . Figure 2 shows the Attention-LSTM-ACD model more detailly. Fig 2.

Architecture of Attention-LSTM-ACD model 𝛥𝑡 𝑖−𝑘−2 , 𝑙𝑛 𝜇 ∧𝑖−𝑘−2 𝐿𝑆𝑇𝑀 𝐶𝑒𝑙𝑙

𝐿𝑆𝑇𝑀 𝐶𝑒𝑙𝑙

𝐿𝑆𝑇𝑀 𝐶𝑒𝑙𝑙 ℎ′ 𝑖−𝑘−2 , 𝑠′ 𝑖−𝑘−2 ℎ′ 𝑖−𝑘−1 , 𝑠′ 𝑖−𝑘−1 𝛥𝑡 𝑖−𝑘−1 , 𝑙𝑛 𝜇 ∧𝑖−𝑘−1 𝛥𝑡 𝑖−1 , 𝑙𝑛 𝜇 ∧𝑖−1 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝐿𝑎𝑦𝑒𝑟

𝐹𝑢𝑙𝑙𝑦 𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝐿𝑎𝑦𝑒𝑟 ℎ′ 𝑖−𝑘−2 ℎ′ 𝑖−𝑘−1 ℎ′ 𝑖 𝑐′ 𝑖 𝑂𝑢𝑡𝑝𝑢𝑡: 𝑙𝑛 𝜇 ∧𝑖 ⋯ Experiment has no transactions during 2017, we totally have 99 stocks listed in SZSE COMP on December 31st, 2016 as our research dataset, which sums to 9900, 000 transactions.

As the box plots in Figure 3 demonstrate, transaction durations of each constituent stock from SZSE 100 Index reveals a very long tail compared with the inter-quartile range. The large amount of data locates in the tail means the existence of liquidity risk.

Fig 3.

Box plots of durations of 99 stocks from SZSE 100 (minimum time unit: millisecond), the 99 stocks are listed in the x-axis and the y-axis represents the duration dimension. To further dig the dynamic characteristics of the duration sequence, the averaged coefficients of auto correlation function ( acf ) and partial correlation function ( pacf ) coefficients is plotted. As shown in the following Figure.4, we can see that time series duration data shows a longer memory in that both acf coefficients and pacf coefficients decay very slowly as the lagged term increases. Hence, the higher complexity of the UHF duration data requires a forecasting algorithm with strong fitting ability.

Fig 4.

Averaged acf and pacf

𝑴𝑨𝑬 ) As one of the most widely used metrics,

𝑀𝐴𝐸 is used to evaluate the performance of duration prediction more directly and can be calculated by the following formula:

𝑀𝐴𝐸 = 1/𝑁 ∑ |𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑖 𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡 − 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑖𝑟𝑒𝑎𝑙 | 𝑁𝑖=1 (18) A smaller

𝑀𝐴𝐸 means that we have a more precise forecast of the transaction duration.

To evaluate the forecasting performance of quantile points, we utilize the loss function in quantile regression minimization problems [21]. Let }.,..,1:{ hiTaR i = be the prediction quantile points of the confidence level 𝛼 , 𝑥 𝑖 be the realistic durarion of the 𝑖-th transaction, and the 𝐼 represent an indicative function, the performance measure 𝑄𝐿 𝛼,𝑡 can be calculated as follows: 𝑄𝐿 𝛼,𝑡 = ∑ (𝑥 𝑖 − 𝑇𝑎𝑅 𝑖,𝛼 )[𝛼 − 𝐼(𝑥 𝑖 − 𝑇𝑎𝑅 𝑖,𝛼 )] 𝑇+ ℎ 𝑖=𝑇+1 (19) In Section 2, we have created a new framework for the one-step ahead prediction. The equence of 50 lagged durations (1 feature, 50 timesteps) is firstly chosen as the input data and we hence construct the LSTM-ACD model and Attention-LSTM-ACD model, which are presented in Section 2.2.1 and Section 2.2.2 respectively. The only difference between the two models is the attention layer. To further utilize the information of transaction by transaction data, one-dimensional duration feature is extended to multi-dimensional feature vector by adding the transaction volume and transaction type information. And then two other models are constructed, named as the Attention-LSTM-ACD(M) model and the LSTM-ACD (M) model. The experiments will be performed with the following five models: the classic ACD model, the LSTM-ACD model, the Attention-LSTM-ACD model, the Attention-LSTM-ACD(M) model and the LSTM-ACD (M) model.

During the training process, configurations are determined with as few exogenous inputs as possible because of the various drawbacks of manual tuning. We adopt fixed hyperparameters including learning rate, number of neurons of each layer, batch size, time steps, etc. for each constituent stock of SZSE 100.

As mentioned above, the sample used in this study is the 100, 000 durations in 2017 for each stock collected from SZSE 100. We select the last 30% of data as the test set, while the remaining data is divided into training set and validation set according to the ratio of 8:2.

During the experiment, fixed hyperparameter combination is selected for each model based on LSTM-ACD framework. Table 1 lists the hyperparameters used in our experiment. The attention size represents the height of the tensor 𝑤 𝛼 in formula (11). The initial learning rate is 0.5 and it is reduced by 50% after 1000 training steps. Besides the selection of hyperparameter combination, the remaining parameters of the proposed hybrid models are learned by taking advantage the early-stopping technique to avoid the over-fitting problem. We evaluate model performance on the validation set every 100 training steps and the early stopping patience represents the number of times that there is continuously no improvement in he log likelihood function calculated on the validation set. Table 1 ： The hyperparameters of each model

Attention-LSTM-ACD (M) & LSTM-ACD (M) Attention-LSTM-ACD & LSTM-ACD Input Layer 3 features, 50 timesteps 1 feature, 50 timesteps LSTM layer 5 hidden neurons 5 hidden neurons Attention Size 2 (for model Attention-LSTM-ACD (M)) 2 (for model Attention-LSTM-ACD) Fully Connected Layer 2 hidden neurons 2 hidden neurons Batch Size 300 300 Start Learning Rate 0.5 0.5 Decay Steps 1000 1000 Decay Rate 50% 50% Early Stopping Patience 10 10 In this paper, the four models based on LSTM-ACD framework are coded in Tensorflow1.0 and the classic ACD method is modelling by the ACDm package based on R language.

𝑴𝑨𝑬

The out-of-sample forecasting errors of the five types of experiment models are calculated. Table 2 is the average

𝑀𝐴𝐸 in the test sets when the five models are applied to

SZSE 100 Index constituent stocks respectively.

The average

𝑀𝐴𝐸 of the LSTM-ACD (M) model is smaller than the classic ACD model while the remaining three models all perform a bit worse than the classic ACD model. We can also see from the Figure 5 that the LSTM-ACD(M) and LSTM-ACD are both supreme to the classic ACD model on more stocks in the metric of

𝑀𝐴𝐸 . As mentioned above, uniform hyperparameter combination is chosen when applying the hybrid models. If we select different hyperparameters when focusing on different stocks, the performance of these hybrid models will be much better. In addition, we calculate the

𝑀𝐴𝐸 of each model for the durations one step lagged by the following formula (20):

𝑀𝐴𝐸 𝑙𝑎𝑔𝑔𝑒𝑑 = 1/𝑁 ∑ |𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑖 𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡 − 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑖−1𝑟𝑒𝑎𝑙 | 𝑁𝑖=1 (20) The results in the third column in Table 1 show that the average

𝑀𝐴𝐸 𝑙𝑎𝑔𝑔𝑒𝑑 of ACD model is significantly smaller than the average

𝑀𝐴𝐸 . It means the predictions of other four models ased on LSTM-ACD framework somewhat convey more meaningful information.

Fig. 5.

The contrasts of two pairs of models (the left subfigure is the contrast between LSTM-ACD(M) and ACD model, the right subfigure is the contrast between LSTM-ACD and ACD model, the proportion of each slice in a pie chare represents the quantity of stocks on which the corresponding model performs better in

𝑀𝐴𝐸 .) Table 2 ： The average

𝑀𝐴𝐸 on SZSE 100 Index constituent stocks of each model

Model Average

MAE

Average

MAE for durations lagged one step Difference Attention-LSTM-ACD (M)

LSTM-ACD (M)

Attention-LSTM-ACD

LSTM-ACD

ACD

Table 3 lists the quantile forecast measure 𝑄𝐿 of different probability level 𝛼 , for four models. It can be found that the Attention-LSTM-ACD(M) model is the supreme model at all three confidence levels. In terms of the Attention-LSTM-ACD model, it also provides a better quantile forecasting than LSTM-ACD model at all levels. These indicate that the attention layer can improve the accuracy in conditional distribution forecasting. able 3 ： Quantile loss for the four models at different 𝛼 level 𝛼 = 0.1 Attention-LSTM-ACD (M) LSTM-ACD (M) Attention-LSTM-ACD LSTM-ACD 𝛼 = 0.05 Attention-LSTM-ACD (M) LSTM-ACD (M) Attention-LSTM-ACD LSTM-ACD 𝛼 = 0.01 Attention-LSTM-ACD (M) LSTM-ACD (M) Attention-LSTM-ACD LSTM-ACD This section makes a visualization for the Attention-LSTM-ACD model and Attention-LSTM-ACD (M) model. As can be seen in the following

Table 4 and Figure 6, the weights learned by the attention layer in both the two models decrease exponentially with the increase of lag order. That means that the closer transaction has a more important effect on the current duration, which is consistent to our intuition.

Table 4:

Average weights of the Attention-LSTM-ACD (M) model and the Attention-LSTM-ACD model on SZSE 100 Index constituent stocks

Attention-LSTM-ACD (M) Attention-LSTM-ACD Lag order Weight Lag order Weight lag 1 lag 1 lag 2 0.028296111 lag 2 0.034143126 lag 3 0.024825037 lag 3 0.027711418 lag 4 0.023690568 lag 4 0.025372255 lag 5 0.022575405 lag 5 0.023460835 lag 6 0.022012244 lag 6 0.022699354 lag 7 0.02148028 lag 7 0.021109585 lag 8 0.020997523 lag 8 0.020365109 lag 9 0.020678233 lag 9 0.02006378 lag 10 0.020491562 lag 10 0.01944402 …… …… …… …… lag 41 0.018665664 lag 41 0.017536173 lag 42 0.018792729 lag 42 0.017302261 lag 43 0.018596823 lag 43 0.017404814 lag 44 0.018732648 lag 44 0.017530387 lag 45 0.018695344 lag 45 0.017503974 lag 46 0.018753901 lag 46 0.017434779 lag 47 0.018770904 lag 47 0.017377114 lag 48 0.01898069 lag 48 0.017801802 ag 49 0.018924108 lag 49 0.017549126 lag 50 0.018977175 lag 50 0.01779628 Fig 6.

Attention weights of the Attention-LSTM-ACD (M) model and Attention-LSTM-ACD model of different lags on SZSE 100 Index constituent stocks (each bule line represents the attention weight sequence of a stock for the corresponding model, each red line represents the average attention weights on SZSE 100 Index constituent stocks for the corresponding model)

In this paper, we review the studies of transaction duration modelling based on ACD framework and find that these studies can be classified into two categories: a. propose a new non-linear equation form to describe the dynamics of conditional (expected) duration; b. choose a more flexible distribution for the unexpected portion of the duration. This study constructs a new framework for transaction duration modeling from the perspective of extending the mean equation of ACD model by machine learning methods. Firstly, we build a LSTM-ACD model by combining the LSTM networks with classic ACD odel to characterize the complexity of the conditional mean process while retaining the advantage of providing probabilistic output. And then, attention layer is added to construct the Attention-LSTM-ACD model with the ability of unveiling importance of each hidden state in the LSTM networks. Our proposed new framework is applied to a large-scale dataset. The fixed hyperparameters are chosen for all constituent stocks of SZSE 100 index to reduce the impact of manual tuning and the parameters (and consequently the underlying distributions) are learned via maximize the log-likelihood function.The results show that LSTM-ACD (M) model can present highest accuracy on the task of forecasting on real-world financial datasets among all the presented models. Although Attention-LSTM-ACD model and Attention-LSTM-ACD (M) model could not provide a more accurate performance in

MAE metric, the attention layer vividly depicts the importance of different temporal points of the input sequence and outperforms the corresponding LSTM-ACD model and LSTM-ACD (M) model in QL loss metric respectively. In addition, the average 𝑀𝐴𝐸 𝑙𝑎𝑔𝑔𝑒𝑑 of ACD model is significantly smaller than the average

𝑀𝐴𝐸 , which means the predictions of the LSTM-ACD framework models to some extent convey more meaningful information. As a suitable chosen residual distribution does matters, the exponential distribution used in our framework can be extended to more flexible distributions in future research. eferences

Econometrica , (5), 1127. https://doi.org/10.2307/2999632. 3. Bauwens; Giot. The Logarithmic ACD Model: An Application to the Bid-Ask Quote Process of Three NYSE Stocks. Ann. DÉconomie Stat. , No. 60, 117. https://doi.org/10.2307/20076257. 4. Zhang, M. Y.; Russell, J. R.; Tsay, R. S. A Nonlinear Autoregressive Conditional Duration Model with Applications to Financial Transaction Data.

J. Econom. , (1), 179–207. https://doi.org/10.1016/S0304-4076(01)00063-X. 5. Bauwens, L.; Giot, P. Asymmetric ACD Models: Introducing Price Information in ACD Models. Empir. Econ. , (4), 709–731. https://doi.org/10.1007/s00181-003-0155-7. 6. Meitz, M.; Teräsvirta, T. Evaluating Models of Autoregressive Conditional Duration. J. Bus. Econ. Stat. , (1), 104–124. https://doi.org/10.1198/073500105000000081. 7. Asger Lunde. A Generalized Gamma Autoregressive Conditional Duration Model. . 8. Hautsch, N. The generalized F ACD model, Mimeo, University of Konstanz. . 9. De Luca, G.; Gallo, G. M. Mixture Processes for Financial Intradaily Durations.

Stud. Nonlinear Dyn. Econom. , (2). https://doi.org/10.2202/1558-3708.1223. 10. De Luca, G.; Zuccolotto, P. Regime-Switching Pareto Distributions for ACD Models. Comput. Stat. Data Anal. , (4), 2179–2191. https://doi.org/10.1016/j.csda.2006.08.019. 11. Yatigammana, R. P.; Chan, J. S. K.; Gerlach, R. H. Forecasting Trade Durations via ACD Models with Mixture Distributions. Quant. Finance , (12), 2051–2067. https://doi.org/10.1080/14697688.2019.1618896. 12. Luo, R.; Zhang, W.; Xu, X.; Wang, J. A Neural Stochastic Volatility Model. ArXiv171200504 Cs Q-Fin Stat . 13. Hornik, K.; Stinchcombe, M.; White, H. Multilayer Feedforward Networks Are Universal Approximators.

Neural Netw. , (5), 359–366. https://doi.org/10.1016/0893-6080(89)90020-8. 14. Lipton, Z. C.; Berkowitz, J.; Elkan, C. A Critical Review of Recurrent Neural Networks for Sequence Learning. ArXiv150600019 Cs . 15. Tran, D. T.; Iosifidis, A.; Kanniainen, J.; Gabbouj, M. Temporal Attention-Augmented Bilinear Network for Financial Time-Series Data Analysis.

IEEE Trans. Neural Netw. Learn. Syst. , (5), 1407–1418. https://doi.org/10.1109/TNNLS.2018.2869225. 16. Hochreiter, S. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. , (02), 107–116. https://doi.org/10.1142/S0218488598000094. 17. Kristjanpoller, W.; Minutolo, M. C. Forecasting Volatility of Oil Price Using an Artificial Neural Network-GARCH Model. Expert Syst. Appl. , , 233–241. https://doi.org/10.1016/j.eswa.2016.08.045. 18. Rundo, F. Deep LSTM with Reinforcement Learning Layer for Financial Trend Prediction in FX High Frequency Trading Systems. Appl. Sci. , (20), 4460. https://doi.org/10.3390/app9204460. 9. Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. ArXiv14066247 Cs Stat . 20. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate.

ArXiv14090473 Cs Stat . 21.

Koenker, R.; Bassett, G. Regression Quantiles.

Econometrica , , 33-50.

22. Akaike, H. Information Theory and an Extension of the Maximum Likelihood Principle. In

Breakthroughs in Statistics ; Kotz, S., Johnson, N. L., Eds.; Springer Series in Statistics; Springer New York: New York, NY, 1992; pp 610–624. https://doi.org/10.1007/978-1-4612-0919-5_38. 23. Chou, R. Y.-T. Forecasting Financial Volatilities with Extreme Values: The Conditional Autoregressive Range (CARR) Model.

J. Money Credit Bank. , (3), 561–582. https://doi.org/10.1353/mcb.2005.0027. 24. Donaldson, R. G.; Kamstra, M. An Artificial Neural Network-GARCH Model for International Stock Return Volatility. J. Empir. Finance , (1), 17–46. https://doi.org/10.1016/S0927-5398(96)00011-4. 25. Engle, R. F. The Econometrics of Ultra-High-Frequency Data. Econometrica , (1), 1–22. https://doi.org/10.1111/1468-0262.00091. 26. Gers, F. A.; Schmidhuber, E. LSTM Recurrent Networks Learn Simple Context-Free and Context-Sensitive Languages. IEEE Trans. Neural Netw. , (6), 1333–1340. https://doi.org/10.1109/72.963769. 27. Ghysels, E.; Gouriéroux, C.; Jasiak, J. Stochastic Volatility Duration Models. J. Econom. , (2), 413–433. https://doi.org/10.1016/S0304-4076(03)00202-1. 28. Baruník, J.; Křehl í k, T. Combining High Frequency Data with Non-Linear Models for Forecasting Energy Market Volatility. Expert Syst. Appl. , , 222–242. https://doi.org/10.1016/j.eswa.2016.02.008. 29. Kristjanpoller, W.; Fadic, A.; Minutolo, M. C. Volatility Forecast Using Hybrid Neural Network Models. Expert Syst. Appl. , (5), 2437–2442. https://doi.org/10.1016/j.eswa.2013.09.043. 30. Yatigammana, R. P.; Choy, S. T. B.; Chan, J. S. K. Autoregressive Conditional Duration Model with an Extended Weibull Error Distribution. In Causal Inference in Econometrics ; Huynh, V.-N., Kreinovich, V., Sriboonchitta, S., Eds.; Studies in Computational Intelligence; Springer International Publishing: Cham, 2016; Vol. 622, pp 83–107. https://doi.org/10.1007/978-3-319-27284-9_5. 31. Wei, Y.; Wang, Y.; Huang, D. Forecasting Crude Oil Market Volatility: Further Evidence Using GARCH-Class Models.

Energy Econ. , (6), 1477–1484. https://doi.org/10.1016/j.eneco.2010.07.009. 32. Wang, Y.-H. Nonlinear Neural Network Forecasting Model for Stock Index Option Price: Hybrid GJR–GARCH Approach. Expert Syst. Appl. , (1), 564–570. https://doi.org/10.1016/j.eswa.2007.09.056. 33. Tseng, C.-H.; Cheng, S.-T.; Wang, Y.-H.; Peng, J.-T. Artificial Neural Network Model of the Hybrid EGARCH Volatility of the Taiwan Stock Index Option Prices. Phys. Stat. Mech. Its Appl. , (13), 3192–3200. https://doi.org/10.1016/j.physa.2008.01.074. 34. Mohammadi, H.; Su, L. International Evidence on Crude Oil Price Dynamics: Applications of ARIMA-GARCH Models. Energy Econ. , (5), 1001–1008. ttps://doi.org/10.1016/j.eneco.2010.04.009. 35. Luca, G. D.; Gallo, G. M. Time-Varying Mixing Weights in Mixture Autoregressive Conditional Duration Models. Econom. Rev. , (1–3), 102–120. https://doi.org/10.1080/07474930802387944. 36. Fischer, T.; Krauss, C. Deep Learning with Long Short-Term Memory Networks for Financial Market Predictions. Eur. J. Oper. Res. , 270 (2), 654–669. https://doi.org/10.1016/j.ejor.2017.11.054., 270 (2), 654–669. https://doi.org/10.1016/j.ejor.2017.11.054.