Temporal mixture ensemble models for intraday volume forecasting in cryptocurrency exchange markets
NNoname manuscript No. (will be inserted by the editor)
Temporal mixture ensemble models for intraday volumeforecasting in cryptocurrency exchange markets
Nino Antulov-Fantulin* · Tian Guo* · Fabrizio Lillo
Received: date / Accepted: date
Abstract
We study the problem of the intraday short-term volume forecast-ing in cryptocurrency exchange markets. The predictions are built by usingtransaction and order book data from different markets where the exchangetakes place. Methodologically, we propose a temporal mixture ensemble model,capable of adaptively exploiting, for the forecasting, different sources of dataand providing a volume point estimate, as well as its uncertainty. We provideevidence of the outperformance of our model by comparing its outcomes withthose obtained with different time series and machine learning methods. Fi-nally, we discuss the difficulty of volume forecasting when large quantities areabruptly traded.
Cryptocurrencies recently attracted massive attention from researchers in sev-eral disciplines such as finance, economics, and computer science. It origi-nated from a decentralized peer-to-peer payment platform through the Inter-net. When new transactions are announced on this network, they have to beverified by network nodes and recorded in a public distributed ledger calledthe blockchain [1]. Cryptocurrencies are created as a reward in the verification *Shared first authorship.N. Antulov-FantulinETH ZurichAisot GmbH, Zurich, SwitzerlandT. GuoRAM Active Investments, SwitzerlandWork done when at ETH ZurichF. LilloDepartment of MathematicsUniversity of Bologna, ItalyE-mail: [email protected] a r X i v : . [ q -f i n . T R ] M a y Nino Antulov-Fantulin* et al. competition in which users offer their computing power to verify and recordtransactions into the blockchain. Bitcoin is one of the most prominent decen-tralized digital cryptocurrencies and it is the focus of this paper, although themodel developed below can be adapted to other cryptocurrencies with ease,as well as to other ”ordinary” assets (equities, futures, FX rates, etc.).The exchange of Bitcoins with other fiat or cryptocurrencies takes placeon exchange markets, which share some similarities with the foreign exchangemarkets [2]. These markets typically work through a continuous double auc-tion, which is implemented with a limit order book mechanism, where nodesignated dealer or market maker is present and limit and market orders tobuy and sell arrive continuously. Moreover, as observed for traditional assets,the market is fragmented , i.e. there are several exchanges where the tradingof the same asset, in our case the exchange of a cryptocurrency with a fiatcurrency, can simultaneously take place.The automation of the (cryptocurrency) exchanges lead to the increaseof the use of automated trading [3, 4] via different trading algorithms. Animportant input for these algos is the prediction of future trading volume.This is important for several reasons. First, trading volume is a proxy forliquidity which in turn is important to quantify transaction costs. Tradingalgorithms aim at minimizing these costs by splitting orders in order to finda better execution price [5, 6] and the crucial part is the decision of whento execute the orders in such a way to minimize market impact or to achievecertain trading benchmarks (e.g. VWAP) [7–9]. Second, when different marketvenues are available, the algorithm must decide where to post the order andthe choice is likely the market where more volume is predicted to be available.Third, volume is also used to model the time-varying price volatility process,whose relation is also known as Mixture of Distribution Hypothesis” [10].In this paper, we study the problem of intraday short-term volume predic-tion on cryptocurrency markets, intending to obtain not only point estimatebut also an interval of uncertainty [11–14]. Moreover, conventional volumepredictions focuses on using data or features from the same market. Sincecryptocurrency markets are traded on several markets simultaneously, it isreasonable to use cross-market data not only to enhance the predictive power,but also to help understanding the interaction between markets. In particular,we investigate the exchange rate of Bitcoin (BTC) with a fiat currency (USD)on two liquid markets: Bitfinex and Bitstamp. The first market is more liquidthan the second, since its traded volume in the investigated period is 2 . . Thus one expects an asymmetric role of the past volume (or othermarket variables) of one market on the prediction of volume in the other mar-ket. We propose a class of models, termed temporal mixture ensemble models,to build predictions of volume and we compare out-of-sample forecasts withthose obtained with some traditional time-series approaches and with a ma-chine learning benchmark (gradient boosting). Recently, there have been few reports that are showing fake reported volume for certainBitcoin exchange markets. In this study, we are working with Bitcoin exchange markets thathave been independently verified to report true values [15].ixture ensembles for cryptocurrency intraday volume forecasting 3
Specifically, the contribution of this paper can be summarized as follows: – We formulate the cross-market volume prediction as a supervised multi-source learning problem. We use multi-source data, i.e. transactions andlimit order books from different markets, to predict the volume of the targetmarket. – We propose the temporal mixture ensemble model, which models individualsource’s relation to the target and adaptively adjusts the contribution ofthe individual source to the target prediction. – By equipping with modern ensemble techniques, the proposed model canfurther quantify the predictive uncertainty consisting of the epistemic andaleatoric component, for volume and source contributions. – As main benchmarks for volume dynamics, we use different time-series andmachine learning models (clearly with the same regressors/features used inour model). We observe that our dynamic mixture ensemble is often havingsuperior out-of-sample performance on conventional prediction error met-rics e.g. root mean square error (RMSE) and mean absolute error (MAE).More importantly, it presents much better calibrated results, evaluated bymetrics taking into account predictive uncertainty, i.e. normalized negativelog-likelihood (NNLL), uncertainty interval coverage (IC) and width (IW). – We show that the main difficulty in predicting volume (for all models) isrelated to very large and unexpected volumes. Outside these situations,our model strongly outperforms the other benchmarks.The paper is organized as follows: in Section 2 we present the investigatedmarkets, the data, and the variables used in the modeling. In Section 3 wepresent our benchmark models. In Section 4 we present our empirical investi-gations on the cryptocurrency markets for the prediction of intraday marketvolume. Finally, Section 5 presents some conclusions and outlook for futurework.
Our empirical analyses are performed on a sample of data from two exchanges,Bitfinex and Bitstamp , where Bitcoins can be exchanged with US dollars.These markets work through a limit order book, as many conventional ex-changes. The investigated period is June-November 2018. The main analysesare performed on five-minute intervals, thus the length of the investigated timeseries is 34 , /
7. In the Appendix we also showsome analyses performed at one minute resolution, raising the number of datapoints to 171k.For each of the two markets we consider two types of data: transactiondata and limit order book data. A v e r a g e m i n i n t r a d a y v o l u m e i n B T C [ m e a n + s t d ] Fig. 1: The intraday average 1-min transaction volume plus 1 std of BTC/USDrate in Bitfinex market over the period from June 2018 to November 2018.From transaction data we extract the following features for each 5-mininterval: – Buy volume - in BTC units of executed transactions – Sell volume - in BTC units of executed transactions – Volume imbalance - absolute difference between buy and sell volume – Buy transactions - number of executed transactions on buy side – Sell transactions - number of executed transactions on sell side – Transaction imbalance - absolute difference between buy and sell numberof transactionsFrom limit orderbook data we extract the following features [16, 17],obtained by averaging the one minute variables in each 5-min interval: – Spread is the difference between the highest price that a buyer is willing topay for a BTC (bid) and the lowest price that a seller is willing to accept(ask). – Ask volume is the number of BTCs on the ask side of order book. – Bid volume is the number of BTCs on the bid side of order book. – Imbalance is the absolute difference between ask and bid volume. – Ask/bid Slope is estimated as the volume until δ price offset from the bestask/bid price. δ is estimated by the bid price at the order that has at least1%, 5% and 10 % of orders with the highest bid price. – Slope imbalance is the absolute difference between ask and bid slope atdifferent values of price associated to δ . δ is estimated by the bid price atthe order that has at least 1%, 5% and 10 % of orders with the highest bidprice. ixture ensembles for cryptocurrency intraday volume forecasting 5 The target variable y t that we aim at forecasting is the sum of the two firstfeatures of transaction data of the target market, i.e. the sum of buy and sellvolume.In the proposed modeling approaches (described in Section 3) we considerdifferent sources that at each time can affect the probability distribution oftrading volume in the next time interval in a given market. Given the settingpresented above, in our analysis, there are S = 4 sources, namely one fortransaction data and one for limit order book data for the two markets. Weindicate with x s,t ∈ R d s the data from source s at time step t , and d s thedimensionality of source data s . Given the list of variables presented above,we have d s = 6 when the source is transaction data in any market, while d s = 13 for orderbook data.Figure 1 shows the average 1-min transaction volume as a function ofthe time of the day. We observe the lack of a strong intra-daily ”U-shape”component, which is instead observed in other asset classes [18] and used inintraday volume modeling. Econometric modeling of intra-daily trading volume relies on a set of empiricalregularities [7–9] of volume dynamics. These include fat tails, strong persis-tence and an intra-daily clustering around the ”U”-shaped periodic compo-nent. Brownlees et al. [7] proposed Component Multiplicative Error Model(CMEM), which is the extension of Multiplicative Error Model (MEM) [19].The CMEM volume model has a connection to the component-GARCH [20]and the periodic P-GARCH [21]. Satish et al. [8], proposed four-componentvolume forecast model composed of: (i) rolling historical volume average, (ii)daily ARMA for serial correlation across daily volumes, (iii) deseasonalizedintra-day ARMA volume model and (iv) a dynamic weighted combination ofprevious models. Chen et al. [9], simplify the multiplicative volume model [7]into an additive one by modeling the logarithm of intraday volume with theKalman filter.Given the lack of intraday periodicity (see Fig. 1) in the investigated cryp-tocurrency markets and the need of adding external multi-source regressors,our econometric benchmarks are AR-GARCH type models, described in thenext subsection. The machine learning benchmark is the gradient boostingmethod, while our main model is the temporal mixture ensemble model, whichnaturally allows using multi-source data. When multi-source temporal data arefrom different sources, x s,t indicates the data from source s at time step t and x s,t could be multi-dimensional, i.e. x s,t ∈ R d s , where d s is the data dimen-sionality of source s . With these multi-source data, we aim at predicting thefuture value of the target variable y t .In this paper, the target variable is the 5 min trading volume in one ofthe two cryptocurrency markets using data from both markets. Regardingthe multi-source data, on one hand, it includes the feature time series from Nino Antulov-Fantulin* et al. the target market. This data is believed to be directly correlated with thetarget variable. On the other hand, there is an alternative market, which couldinteract with the target market. Together with the target market, the featuretime series of this alternative market constitute the multi-source data. Foreach market, we use features from both transaction and order book data. Theexperiment section 4 provides more details about markets, transactions, orderbook data, and features.3.1 Econometric benchmarksAs mentioned, our benchmarks belong to the AR-GARCH class with externalregressors. More specifically, the volume process y t is modelled with the fol-lowing autoregressive process (AR(p)) with external regressors: y t = µ + p (cid:88) i =1 φ i y t − i + S (cid:88) s =1 d s (cid:88) j =1 ψ s,j x s,t − ( j ) + (cid:15) t , (1)where x s,t − ( j ) denotes the j -th feature from external feature vector x s,t − at time t − s . The total number of sources S = 4, whichincludes transactions and limit order book data of the two markets. Sincevolume exhibits time clustering, we assume that the residuals (cid:15) t are modelledby a GARCH process [7–9]: (cid:15) t = σ t e t e t ∼ N (0 ,
1) (2) σ λt = ω + α(cid:15) λt − + γ | (cid:15) t − | λ [ (cid:15) t − < + βσ λt − (3)For λ = 2, γ = 0, we get standard GARCH(1,1) model [22]. In case of λ = 2, γ = 1, we get GJR-GARCH model [23], that captures asymmetry inpositive and negative shocks .3.2 Machine learning benchmarkWe take the gradient boosting machine [25] as a machine learning baseline.Gradient boosting approximates the volume ˆ y t = F ( x t ) with a function thathas the following additive expansion (similar to other functional approximationmethods like radial basis functions, neural networks, wavelets, etc.):ˆ y t = F ( x t ) = M (cid:88) m =0 β m h ( x t ; a m ) , (4) Note, that we have also evaluated all ARX-GARCH models by using autoregressiveexternal features { x s,t − i ( j ) } pi =1 terms, but results were not better and training time andconvergence become problematic. We also tested the case of λ = 1, γ = 1 corresponding to the Threshold heteroskedasticmodels [24], but the model displays sometimes convergence problems, thus we decided notto present it. Anyhow the forecasting ability of this model is comparable to that of the otherGARCH models.ixture ensembles for cryptocurrency intraday volume forecasting 7 where x t denotes the feature vector, that is constructed as a concatenationfrom different sources x t = ( x s =1 ,t , x s =2 ,t , x s =3 ,t , x s =4 ,t ) .The functions h ( x t ; a m ) are also called ”base learners” and in our case theyare regression trees with parameters a m and β m is a simple scalar.Each base learner h ( x t ; a m ) partitions the feature space x t ∈ X into L -disjoint regions { R l,m } L and predicts a separate constant value in each: h ( x t ; { R l,m } L ) = L (cid:88) l =1 ¯ y l,m ( x t ∈ R l,m ) , (5)where ¯ y l,m is inferred during the learning phase along with the expansion co-efficients { β m } and the parameters of regression trees a m . Learning procedurestarts by defining the loss function Ψ ( y t , F ( x t )) e.g. squared loss (cid:80) t ( y t − F ( x t )) and initial regression tree F ( x t ). Then, for each m = 1 , ..., M , wesolve the optimization problem:( β m , a m ) = arg min β, a T (cid:88) t =1 Ψ ( y t , F m − ( x t ) + βh ( x t ; a )) (6)and F m ( x t ) = F m − ( x t ) + β m h ( x t ; a m ) . (7)See Appendix for more details. Furthermore, note that different variants oftree boosting have been empirically proven to be state-of-the-art methods inpredictive tasks across different machine learning challenges [26, 27] and morerecently in finance [28, 29].3.3 Temporal mixture ensembleIn this paper, we construct an intra-daily dynamic mixture ensemble model,belonging to the class of of mixture models [30–34], that takes previous trans-actions and limit order book data [16,17] from multiple markets simultaneouslyinto account. Though mixture models have been widely used in machine learn-ing and deep learning [35–37], they have been hardly explored for predictiontasks in cryptocurrency markets. Moreover, our proposed temporal mixtureensemble can provide predictive uncertainty of the target volume by the use ofStochastic Gradient Descent (SGD) based ensemble techniques [38–40]. Pre-dictive uncertainty reflects the confidence of the model over the prediction. Itis valuable extra information for model interpretability and reliability.In principle, the temporal mixture ensemble exploits latent variables tocapture the contributions of different sources of data to the future evolutionof the target variable. The source contributing at a certain time depends onthe history of all the sources. Note, that we have omitted the transpose operators in the next line, as the concatenationis simple operation and to avoid confusion with index of time. Nino Antulov-Fantulin* et al.
More quantitatively, the generative process of the time series of the targetvariable conditional on multi-source data { x ,t , · · · , x S,t } Tt =0 is formulated asthe following probabilistic mixture process: p ( y , · · · , y T |{ x ,t , · · · , x S,t } Tt =0 )= (cid:88) z · · · (cid:88) z T p ( y , · · · , y T , z , · · · , z T | { x ,t , · · · , x S,t } Tt =0 )= (cid:89) t S (cid:88) z t =1 p θ ( y t | z t = s, x s, The typical prediction on the target y τ is the expectedvalue, i.e. the conditional mean. In our temporal mixture model, it is derivedas: E [ y τ |{ x ,<τ , · · · , x S,<τ } , D ]= (cid:90) y y · p ( y |{ x ,<τ , · · · , x S,<τ } , D ) dy = (cid:90) y (cid:90) Θ y · p Θ ( { x ,<τ , · · · , x S,<τ } ) · p ( Θ |D ) dydΘ = 1 M M (cid:88) m =1 E [ y τ | x ,<τ , · · · , x S,<τ , Θ m ] , where Θ m ∼ p ( Θ |D ) (11)In Eq. 11, we use Monte Carlo methods to obtain unbiased estimates of theintegral on model parameters. E [ y τ | x ,<τ , · · · , x S,<τ , Θ m ] is the conditionalmean given one realization Θ m of the model parameters. In the context oftemporal mixture models, it is derived as: E [ y τ | x ,<τ , · · · , x S,<τ , Θ m ]= S (cid:88) s =1 P ω m ( z τ = s | x ,<τ , · · · , x S,<τ ) · E [ y τ | x s,<τ , θ m ] (12)Eq. 12 shows that the mixture mean is the weighted sum of means derivedfrom individual data sources. Predictive aleatoric and epistemic uncertainty. Apart from the mean,the predictive uncertainty of the target is of great interest as well, since itallows to compute confidence intervals on the predictions and facilitates thedecision making based on volume predictions. Meanwhile, by jointly mod-eling predictive mean and uncertainty, our mixture ensemble provides well-calibrated prediction, which will be demonstrated in the experiment section.In a Bayesian setting, there are two main types of uncertainty one canmodel. – Aleatoric uncertainty captures underlying noise inherent in the observa-tions. For instance, in financial markets, a widely used aleatoric uncer-tainty is the volatility of stock return, which reflects the price fluctuationover time. It can be estimated either by empirical variance or GARCHfamily models. – Epistemic uncertainty is the uncertainty in the model, which captures whatour model does not know due to lack of training data. It can be explainedaway with increased training data.In the following, by deriving the conditional variance of the target in theBayesian mixture manner, we demonstrate that the total variance is decom-posed into aleatoric and epistemic uncertainties, which reflect different aspectsof the variance of the target y τ .Var( y τ |{ x ,<τ , · · · , x S,<τ } , D )= (cid:90) y y p ( y |{ x ,<τ , · · · , x S,<τ } , D ) dy − E [ y τ |{ x ,<τ , · · · , x S,<τ } , D ]= 1 M M (cid:88) m =1 S (cid:88) s =1 P ω m ( z t = s |· )Var( y | z t = s, x s,<τ , Θ m ) (cid:124) (cid:123)(cid:122) (cid:125) Aleatoric Uncertainty +1 M M (cid:88) m =1 S (cid:88) s =1 P ω m ( z t = s |· ) E [ y | z t = s, x s,τ , Θ m ) − E [ y |{ x ,τ , · · · , x S,τ } τ , D ] , (cid:124) (cid:123)(cid:122) (cid:125) Epistemic Uncertainty (13)where ω m is from sample Θ m . Eq. 13 bridges the aleatoric and epistemic un-certainty by the derivation of total variance of y τ in the Bayesian mixturesetting. The decomposition in Eq. 13 also theoretically demonstrates the re-lation between total variance and the aleatoric and epistemic uncertainty inBayesian modeling, namely the total variance is composed of inherent noiseand model uncertainty on the target.The aleatoric part in Eq. 13 stems from variance induced from multi-sourcedata. It captures the noise inherent to the target which could depend on x s,<τ .As a comparison, a classical aleatoric uncertainty (or volatility) estimationmodel, the GARCH, is typically used to estimate the volatility solely with thetarget time series. It has no mechanism to capture the evolving relevance ofmulti-source data to the aleatoric uncertainty of the target.The epistemic uncertainty on mean in Eq. 13 accounts for uncertainty inthe model parameters i.e. uncertainty which captures our ignorance aboutwhich model generated our collected data. This uncertainty can be reducedwhen enough data are available, and is often referred to as model uncertainty. Model specification. We now specify in detail the mathematical formula-tion of each component in the temporal mixture model. The inference processwe presented so far does not relies on any specific formulation of the model ixture ensembles for cryptocurrency intraday volume forecasting 11 and thus it is flexible to different specifications. Without loss of generality,we present the following model specification for cryptocurrency data of thispaper’s interest.To specify the model, we need to define the predictive density function ofindividual sources, i.e. p θ ( y t | z t = s, x s, 45% for well-calibrated models [46].The closer the empirical value to IC ∗ , the better calibrated the model is.Therefore, we define a simpler measure, the 2ˆ σ coverage error (ICE), whichmeasures the absolute difference to the ground value IC ∗ as:ICE = | IC − . | . (25)Finally, for a prediction interval [ y − i , y + i ], we calculate the mean predictioninterval width (IW) as:IW = 1 C M (cid:88) i =1 ( y + i − y − i ) = 1 C M (cid:88) i =1 σ i , (26)where y + i = ˆ y i + 2ˆ σ i and y − i = ˆ y i − σ i and C is the number of true targetvalues falling within the prediction interval C = (cid:80) Mi =1 [ˆ y i − σ i ≤ y i ≤ ˆ y i +2ˆ σ i ] . The IW measure should be minimized and it tells that the high-quality predictionintervals should be as narrow as possible, while capturing a specified portionof data, without assumptions on the distribution [46].4.2 ResultsIn the first set of experiments we concentrate on 5-min ahead predictionsfor both markets. For all the processes in the AR-GARCH family the auto-regressive order was fixed to p = 10, which is the maximum lag for which thepartial autocorrelation function is still significant.Table 1 and Table 2 show the out of sample prediction metrics for thetwo markets. Notice that the 5 min mean volume on test set is 49 . 67 forBitfinex and 21 . 47 for Bitstamp. By comparing these numbers with the MAE,we observe that the level of noise is quite high, since the two values arecomparable. This can be also seen by computing the average relative fore-casting error. For AR-GARCH model this value is E [( y t − ˆ y t ) /y t ] = 3 . E [( y t − ˆ y t ) /y t ] = 3 . ρ ( y t , ˆ y t ) = 0 . ρ ( y t , ˆ y t ) = 0 . MODEL RMSE ↓ MAE ↓ NNLL ↓ CORR ↑ ICE ↓ IW ↓ AR-GARCH 63.952 38.229 5.4866 0.5268 0.012 284.598AR-GJR-GARCH 63.78 38.838 5.4799 0.5306 0.012 282.843ARX-GARCH 63.677 38.086 5.482 0.531 0.011 282.934ARX-GJR-GARCH 63.597 37.208 5.4788 0.4225 0.010 285.323Gradient boosting 62.529 37.768 NA 0.562 NA NAMixture ensemble 63.68 33.72 4.82 0.55 0.002 184.54ixture ensembles for cryptocurrency intraday volume forecasting 15 Table 2: 5-min ahead prediction metrics for Bitstamp markets. Out of sampleperformance of 5-min ahead predictions for the period of June 2018 - November2018 (70% train, 10% validation and 20% test). The arrow symbols in the firstline indicate the direction of the metrics for better models. MODEL RMSE ↓ MAE ↓ NNLL ↓ CORR ↑ ICE ↓ IW ↓ AR-GARCH 38.118 17.089 4.9026 0.4799 0.022 158.496AR-GJR-GARCH 38.051 17.328 4.8952 0.4819 0.022 162.632ARX-GARCH 39.13 20.356 4.931 0.4534 0.023 157.959ARX-GJR-GARCH 39.131 20.324 4.9251 0.453 0.023 158.16Gradient boosting 37.764 15.966 NA 0.5177 NA NAMixture ensemble 38.90 15.54 3.89 0.51 0.005 91.69 We now focus on the forecast of the temporal mixture ensemble In Fig 2and Fig 3 we show the volume and uncertainty prediction on a 5 min levelfor the two markets in a random interval of two hundred 5 min intervals. Inboth markets the 95% confidence interval covers quite well the actual values.We notice in both plots the presence of large spikes, corresponding to 5 minintervals where, unexpectedly, a large volume is traded and clearly the modelis unable to forecast them. We believe that these volume bursts are responsibleof the large difference between MAE and RMSE and of the fact that all modelshave comparable RMSE (but different MAE). Below we provide more evidenceof this.Fig. 2: Sample time series of 5 min trading volumes in Bitfinex (black line). Theblue line is the 5 min ahead prediction with the temporal mixture ensembleand the light blue area represent its 95% confidence interval.The temporal mixture ensemble is able to quantify at each time step thecontribution of each source to the target forecasting. In Fig 4 we show thedynamical contributions of the S = 4 sources for a random sample of 5-minahead predictions for Bitfinex market. We notice that the relative contributionsvaries with time and we observe that the external order book source from theless liquid market does not contribute much to predictions. On the contrary,in Fig. 5, where the data for Bitstamp are shown, external order book andexternal transaction features from the more liquid market (Bitfinex) play amore dominant role. Fig. 3: Sample time series of 5 min trading volumes in Bitstamp (black line).The blue line is the 5 min ahead prediction with the temporal mixture ensembleand the light blue area represent its 95% confidence interval.Fig. 4: Data source contribution for a time series sample of 5 min tradingvolume in Bitfinex. The contributions are obtained with the temporal mixtureensemble.Fig. 5: Data source contribution for a time series sample of 5 min tradingvolume in Bitstamp. The contributions are obtained with the temporal mixtureensemble.In order to understand in a more quantitative way the role of large volumesin the forecasting ability of the different models, we compute the RMSE andMAE conditional to the quartile of the true value of the volume of the targetmarket. Table 3 reports the results. First of all, we notice that, for all themethods, both error measures change by almost an order of magnitude whenmoving from the lowest to the largest quartile. This is a strong indication thatthe main problems in forecasting derive from large and unexpected volumebursts. We finally notice that the temporal mixture ensemble outperforms theother models, both considering RMSE and MAE, when Bitfinex volume is in ixture ensembles for cryptocurrency intraday volume forecasting 17 Q1-Q3. For Bitstamp market the results are less clear, but in general machinelearning methods work better than the benchmark econometric models.Table 3: 5-min ahead prediction metrics for both markets conditional to thequartile of the target volume in the period of June 2018 - November 2018. BITFINEX MARKET RMSE Q1 RMSE Q2 RMSE Q3 RMSE Q4AR-GARCH 25.4712 32.5289 41.3908 116.2046ARX-GARCH 29.0931 31.2223 34.6301 114.9813Gradient Boosting 30.9203 29.4742 28.4582 112.6362Mixture ensemble 22.82 21.06 25.74 120.93MAE Q1 MAE Q2 MAE Q3 MAE Q4AR-GARCH 18.7562 21.1582 27.2724 80.4797ARX-GARCH 24.9695 23.1059 22.5028 77.9657Gradient Boosting 28.0851 24.4661 20.2829 74.4320Mixture ensemble 20.90 15.14 16.94 84.49 BITSTAMP MARKET RMSE Q1 RMSE Q2 RMSE Q3 RMSE Q4AR-GARCH 11.5399 14.5152 17.962 73.2483ARX-GARCH 18.9325 19.0446 18.1409 71.2284Gradient Boosting 11.4270 11.0809 10.5455 72.6233Mixture ensemble 11.73 11.34 11.10 74.87MAE Q1 MAE Q2 MAE Q3 MAE Q4AR-GARCH 7.6147 8.7595 10.5267 40.2782ARX-GARCH 17.3215 15.8453 12.4302 35.6983Gradient Boosting 10.0851 8.7940 7.1503 37.4343Mixture ensemble 11.15 8.55 5.77 39.13 We have also repeated these experiments for the data with 1 min resolu-tion. The results are collected in the figures and tables in the Appendix. Sincethe volume distribution at small time scale is more leptokurtic than the one at5 min, large volume bursts are more frequent and tend to deteriorate signifi-cantly the forecasting performance of all the models. This can be understoodby considering that the average relative error of the AR-GARCH model is 447and 119 for the two markets, to be compared with the values 3.35 and 3.99observed at 5 min resolution. Looking at Table 6, where the analysis condi-tional to quartile is presented, it is again clear that machine learning methodsoutperforms the econometric benchmarks (except in the fourth quartile, asexpected). Finally, the temporal mixture ensemble provides confidence inter-vals which are significantly more accurate than those obtained with the othermodels. In this paper, we analyzed the problem of predicting trading volume and itsuncertainty in cryptocurrency exchange markets. The main innovations pro-posed in this paper are (i) the use of transaction and order book data from different markets and (ii) the use of a class of model able to identify at eachtime step the set of data locally more useful in predictions.By investigating data from BTC/USD exchange markets, we found thattime series models of the AR-GARCH family do provide fair basic predictionsfor volume and its uncertainty, but when external data (e.g. from order bookand/or from other markets) are added, the prediction performance does notimprove significantly. Our analysis suggests that this might be due to the factthat the contribution of this data to the prediction could be not constant overtime, but depending on the ”market state”. The temporal mixture ensem-ble model is designed precisely to account for such a variability. Indeed wefind that this method outperforms time series models both in point and ininterval predictions of trading volume. Moreover, especially when comparedto other machine learning methods, the temporal mixture approach is signif-icantly more interpretable, allowing the inference of the dynamical contribu-tions from different data sources as a core part of the learning procedure. Thishas important potential implications for decision making in economics andfinance.One of the critical outcomes of the forecasting exercise is that the pre-dictability significantly depends on the size of the volume to be forecast. Wefound that our method works better than the benchmarks when volume is notin the top quartile, while in this extreme case all the methods perform poorly.This is likely due to the presence of unexpected bursts of volume which arevery challenging to forecast. As a consequence, the prediction is significantlyless accurate when the time interval of the series is too short, since in this caseextreme fluctuations are more frequent.Finally, although the method has been proposed and tested for cryptocur-rency volume in two specific exchanges, we argue that it can be successfullyapplied (in future work) to other cryptocurrencies and to more traditionalfinancial assets. Acknowledgement This work has been funded by the European Program scheme ’INFRAIA-01-2018-2019: Research and Innovation action’, grant agreement References 1. S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” 2008. [Online].Available: http://bitcoin.org/bitcoin.pdf2. E. Baumhl, “Are cryptocurrencies connected to forex? a quantile cross-spectralapproach,” Finance Research Letters The Journal of Finance , vol. 69, no. 5, pp.2045–2084, 2014. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/jofi.121864. T. Hendershott, C. Jones, and A. Menkveld, “Does algorithmic trading improveliquidity?” The Journal of Finance , vol. 66, no. 1, pp. 1–33, 2011. [Online]. Available:https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-6261.2010.01624.x5. C. Frei and N. Westray, “Optimal execution of a vwap order: A stochastic controlapproach,” Mathematical Finance , vol. 25, no. 3, pp. 612–639, 2015. [Online]. Available:https://onlinelibrary.wiley.com/doi/abs/10.1111/mafi.120486. A. Barzykin and F. Lillo, “Optimal vwap execution under transient price impact,” Available at https://arxiv.org/abs/1901.02327 , 2018.7. C. T. Brownlees, F. Cipollini, and G. M. Gallo, “Intra-daily volume modeling andprediction for algorithmic trading,” Journal of Financial Econometrics , vol. 9, no. 3,pp. 489–518, 2010.8. V. Satish, A. Saxena, and M. Palmer, “Predicting intraday trading volumeand volumepercentages,” The journal of trading , vol. 9, no. 3, pp. 15–25, 2014.9. R. Chen, Y. Feng, and D. Palomar, “Forecasting intraday trading volume: A kalmanfilter approach,” Available at SSRN 3101695 , 2016.10. T. G. Andersen, “Return volatility and trading volume: An information flow interpre-tation of stochastic volatility,” The Journal of Finance , vol. 51, no. 1, pp. 169–204,1996.11. J. Chu, S. Nadarajah, and S. Chan, “Statistical analysis of the exchange rate of bitcoin,” PLOS ONE , vol. 10, no. 7, pp. 1–27, 2015.12. A. Urquhart, “The inefficiency of bitcoin,” Economics Letters , vol. 148, pp. 80–82, 2016.13. P. Katsiampa, “Volatility estimation for bitcoin: A comparison of GARCH models,” Economics Letters , vol. 158, pp. 3–6, 2017.14. M. Balcilar, E. Bouri, R. Gupta, and D. Roubaud, “Can volume predict bitcoin returnsand volatility? a quantiles-based approach,” Economic Modelling , vol. 64, pp. 74–81,2017.15. M. Hougan, H. Kim, M. Lerner, and B. A. Management, “Economic and non-economictrading in bitcoin: Exploring the real spot market for the worlds first digital commodity,” Bitwise Asset Management , 2019.16. M. D. Gould, M. A. Porter, S. Williams, M. McDonald, D. J. Fenn, and S. D. Howison,“Limit order books,” Quantitative Finance , vol. 13, no. 11, pp. 1709–1742, 2013.17. M. Rambaldi, E. Bacry, and F. Lillo, “The role of volume in order book dynamics: amultivariate hawkes process analysis,” Quantitative Finance , vol. 17, no. 7, pp. 999–1020, 2016.18. T. G. Andersen and T. Bollerslev, “Intraday periodicity and volatility persistence infinancial markets,” Journal of empirical finance , vol. 4, no. 2-3, pp. 115–158, 1997.19. R. Engle, “New frontiers for arch models,” Journal of Applied Econometrics , vol. 17,no. 5, pp. 425–446, 2002.20. R. F. Engle and M. E. Sokalska, “Forecasting intraday volatility in the us equity market.multiplicative component garch,” Journal of Financial Econometrics , vol. 10, no. 1, pp.54–83, 2012.21. T. Bollerslev and E. Ghysels, “Periodic autoregressive conditional heteroscedasticity,” Journal of Business & Economic Statistics , vol. 14, no. 2, pp. 139–151, 1996.22. T. Bollerslev, “Generalized autoregressive conditional heteroskedasticity,” Journal ofeconometrics , vol. 31, no. 3, pp. 307–327, 1986.23. L. R. Glosten, R. Jagannathan, and D. E. Runkle, “On the relation between the expectedvalue and the volatility of the nominal excess return on stocks,” The journal of finance ,vol. 48, no. 5, pp. 1779–1801, 1993.24. J.-M. Zakoian, “Threshold heteroskedastic models,” Journal of Economic Dynamicsand control , vol. 18, no. 5, pp. 931–955, 1994.25. J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annalsof statistics , pp. 1189–1232, 2001.26. D. Nielsen, “Tree boosting with xgboost-why does xgboost win” every” machine learningcompetition?” Master’s thesis, NTNU, 2016.0 Nino Antulov-Fantulin* et al.27. S. B. Taieb and R. J. Hyndman, “A gradient boosting approach to the kaggle loadforecasting competition,” International journal of forecasting , vol. 30, no. 2, pp. 382–394, 2014.28. N. Zhou, W. Cheng, Y. Qin, and Z. Yin, “Evolution of high-frequency systematic trad-ing: a performance-driven gradient boosting model,” Quantitative Finance , vol. 15,no. 8, pp. 1387–1403, 2015.29. X. Sun, M. Liu, and Z. Sima, “A novel cryptocurrency price trend forecasting modelbased on lightgbm,” Finance Research Letters , 2018.30. S. R. Waterhouse, D. MacKay, and A. J. Robinson, “Bayesian methods for mixtures ofexperts,” in Advances in neural information processing systems , 1996, pp. 351–357.31. S. E. Yuksel, J. N. Wilson, and P. D. Gader, “Twenty years of mixture of experts,” IEEE transactions on neural networks and learning systems , vol. 23, pp. 1177–1193,2012.32. X. Wei, J. Sun, and X. Wang, “Dynamic mixture models for multiple time-series.” in IJCAI , vol. 7, 2007, pp. 2909–2914.33. L. Bazzani, H. Larochelle, and L. Torresani, “Recurrent mixture density network forspatiotemporal visual attention,” arXiv preprint arXiv:1603.08199 , 2016.34. T. Guo, T. Lin, and N. Antulov-Fantulin, “Exploring interpretable lstm neural networksover multi-variable data,” in International Conference on Machine Learning , 2019, pp.2494–2504.35. T. Guo, A. Bifet, and N. Antulov-Fantulin, “Bitcoin volatility forecasting with a glimpseinto buy and sell orders,” in . IEEE, 2018, pp. 989–994.36. P. Schwab, D. Miladinovic, and W. Karlen, “Granger-causal attentive mixtures of ex-perts: Learning important features with neural networks,” in Proceedings of the AAAIConference on Artificial Intelligence , vol. 33, 2019, pp. 4846–4853.37. R. Kurle, S. G¨unnemann, and P. van der Smagt, “Multi-source neural variational infer-ence,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, 2019,pp. 4114–4121.38. B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictiveuncertainty estimation using deep ensembles,” in Advances in neural information pro-cessing systems , 2017, pp. 6402–6413.39. W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson, “A simplebaseline for bayesian uncertainty in deep learning,” in Advances in Neural InformationProcessing Systems , 2019, pp. 13 132–13 143.40. J. Snoek, Y. Ovadia, E. Fertig, B. Lakshminarayanan, S. Nowozin, D. Sculley, J. Dillon,J. Ren, and Z. Nado, “Can you trust your model’s uncertainty? evaluating predictive un-certainty under dataset shift,” in Advances in Neural Information Processing Systems ,2019, pp. 13 969–13 980.41. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,G. Irving, M. Isard et al. , “Tensorflow: A system for large-scale machine learning,”in { USENIX } Symposium on Operating Systems Design and Implementation( { OSDI } , 2016, pp. 265–283.42. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.43. S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprintarXiv:1609.04747 , 2016.44. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” InternationalConference on Learning Representations , 2015.45. Y. Wu, J. M. Hern´andez-Lobato, and Z. Ghahramani, “Gaussian process volatilitymodel,” in NIPS , 2014, pp. 1044–1052.46. T. Pearce, A. Brintrup, M. Zaki, and A. Neely, “High-quality prediction intervals fordeep learning: A distribution-free, ensembled approach,” in International Conferenceon Machine Learning , 2018, pp. 4075–4084.47. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research , vol. 12, pp. 2825–2830, 2011.ixture ensembles for cryptocurrency intraday volume forecasting 2148. S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximatebayesian inference,” The Journal of Machine Learning Research , vol. 18, no. 1, pp.4873–4907, 2017.49. G. Gur-Ari, D. A. Roberts, and E. Dyer, “Gradient descent happens in a tiny subspace,” arXiv preprint arXiv:1812.04754 , 2018.2 Nino Antulov-Fantulin* et al. In this Appendix we report, for the sake of completeness, the results obtainedwith 1 min data. As we have mentioned in the main text, the high burstinessof the volume data on this time scale lead to a significant deterioration of theforecasting ability of all the models.First of all, we note that for AR-GARCH model (and similarly for theother models) the relative mean absolute error for the two markets are hugebeing equal to AV G [( y t − ˆ y t ) /y t ] = 447 . AV G [( y t − ˆ y t ) /y t ] = 119 . MODEL RMSE ↓ MAE ↓ NNLL ↓ CORR ↑ ICE ↓ IW ↓ AR-GARCH 20.113 9.936 4.2542 0.447 0.016 86.523AR-GJR-GARCH 20.086 9.976 4.2522 0.4471 0.016 86.366ARX-GARCH 20.132 9.703 4.2445 0.4451 0.015 85.423ARX-GJR-GARCH 20.103 9.867 4.2494 0.4453 0.016 86.603Gradient boosting 19.985 8.87 NA 0.483 NA NAMixture ensemble 20.00 9.42 2.90 0.46 0.001 61.89ixture ensembles for cryptocurrency intraday volume forecasting 23 Table 5: 1-min ahead prediction metrics for Bitstamp markets. Out of sampleperformance of 1-min ahead predictions for the period of June 2018 - November2018 (70% train, 10% validation and 20% test). The arrow symbols in the firstline indicate the direction of the metrics for better models. MODEL RMSE ↓ MAE ↓ NNLL ↓ CORR ↑ ICE ↓ IW ↓ AR-GARCH 11.196 4.248 3.5866 0.4774 0.023 40.115AR-GJR-GARCH 11.193 4.243 3.5841 0.4762 0.023 42.34ARX-GARCH 11.252 4.504 3.5895 0.4701 0.023 39.794ARX-GJR-GARCH 11.21 4.188 3.5862 0.4739 0.023 41.598Gradient boosting 11.182 3.73 NA 0.5317 NA NAMixture ensemble 11.38 4.058 2.03 0.49 0.004 27.67 Fig. 6: Sample time series of 1 min trading volumes in Bitfinex (black line). Theblue line is the 1 min ahead prediction with the temporal mixture ensemblemodel and the light blue area represents its 95% confidence interval.Fig. 7: Sample time series of 1 min trading volumes in Bitstamp (black line).The blue line is the 1 min ahead prediction with the temporal mixture ensemblemodel and the light blue area represents its 95% confidence interval. For a given training sample { y t , x t } Tt =1 , our goal is to find a function F ∗ ( x )such that the expected value of loss function Ψ ( y, F ( x )) is minimized over thejoint distribution of { y, x } F ∗ ( x ) = arg min F ( x ) E y, x Ψ ( y, F ( x )) . (27)Under the additive expansion F ( x ) = (cid:80) Mm =0 β m h ( x ; a m ) with parameterizedfunctions h ( x ; a m ), we proceed with the minimization of data estimate of ex- Fig. 8: Data source contribution for a time series sample of 1 min tradingvolume in Bitfinex. The contributions are obtained with the temporal mixtureensemble model.Fig. 9: Data source contribution for a time series sample of 5 min tradingvolume in Bitstamp. The contributions are obtained with the temporal mixtureensemble model.Table 6: 1-min ahead prediction metrics for both markets conditional to thequartile of the target volume in the period of June 2018 - November 2018. BITFINEX MARKET RMSE Q1 RMSE Q2 RMSE Q3 RMSE Q4AR-GARCH 7.3131 9.163 10.3165 37.0707ARX-GARCH 7.0584 9.0724 10.1451 37.2288Gradient Boosting 5.2867 5.9321 6.0074 38.7101Mixture ensemble 6.59 7.56 8.086 37.86MAE Q1 MAE Q2 MAE Q3 MAE Q4AR-GARCH 5.8234 6.3874 5.9997 21.5342ARX-GARCH 5.2314 5.9161 5.8702 21.7944Gradient Boosting 4.6802 4.78 3.8955 22.1228Mixture ensemble 6.07 6.16 4.84 21.70 BITSTAMP MARKET RMSE Q1 RMSE Q2 RMSE Q3 RMSE Q4AR-GARCH 3.3807 3.9941 4.376 21.3262ARX-GARCH 3.7292 4.3312 4.6574 21.2603Gradient Boosting 2.0817 2.2466 2.1172 22.0518Mixture ensemble 2.71 2.89 2.64 22.19MAE Q1 MAE Q2 MAE Q3 MAE Q4AR-GARCH 2.4622 2.5202 2.2675 9.7414ARX-GARCH 2.8769 2.9354 2.6175 9.5851Gradient Boosting 1.7587 1.7614 1.3628 10.0345Mixture ensemble 2.46 2.40 1.55 9.90ixture ensembles for cryptocurrency intraday volume forecasting 25 pected loss [25]: { β m , a m } M = arg min β (cid:48) m , a (cid:48) m T (cid:88) t =1 Ψ ( y t , M (cid:88) m =0 β (cid:48) m h ( x t ; a (cid:48) m )) . (28)However, for practical purposes first we make the initial guess F ( x ) = arg min c (cid:80) Tt =1 Ψ ( y t , c ) and then parameters are jointly fit in a forwardincremental way m = 1 , ..., M :( β m , a m ) = arg min β, a T (cid:88) t =1 Ψ ( y t , F m − ( x t ) + βh ( x t ; a )) (29)and F m ( x t ) = F m − ( x t ) + β m h ( x t ; a m ) . (30)First, the function h ( x t ; a ) is fit by least-squares to the pseudo-residuals (cid:101) y t,m a m = arg min a ,ρ T (cid:88) t =1 [ (cid:101) y t,m − ρh ( x t ; a )] , (31)which for squared loss Ψ ( y t , F ( x t )) = ( y t − F ( x )) at stage m is a residual (cid:101) y t,m = ( y t − F m − ( x t )). For general loss Ψ , we have (cid:101) y t,m = − (cid:20) ∂Ψ ( y t , F ( x t )) ∂F ( x t ) (cid:21) F ( x )= F m − ( x ) . (32)Now, we just find the coefficient β m for the expansion as β m = arg min β T (cid:88) t =1 ψ ( y t , F m − + βh ( x t ; a m )) . (33)Each base learner h ( x t ; a m ) partitions the feature space x t ∈ X into L -disjoint regions { R l,m } L and predicts a separate constant value in each: h ( x t ; { R l,m } L ) = L (cid:88) l =1 ¯ y l,m ( x t ∈ R l,m ) , (34)where ¯ y l,m is the mean value of pseudo-residual (eq. 32) in each region R l,m ¯ y l,m = (cid:80) Tt =1 (cid:101) y t,m [ x t ∈ R l,m ] (cid:80) Tt =1 [ x t ∈ R l,m ] . (35)We have used the GBM implementation from Scikit-learn library [47] for allour experiments . Within this library, for hyper-parameters optimization, we take the followingregression tree hyper-parameters into the account: ”n estimators”, ”max features”,”min samples leaf”, ”max depth” and the following learning hyper-parameters: ”learn-ing rate” and ”loss”.6 Nino Antulov-Fantulin* et al. In stochastic gradient descent (SGD) based optimization, stochasticity comesfrom two places: – SGD trajectory. The iterates { Θ , · · · , Θ i } forms a exploratory trajectoryof posterior space log p ( Θ |D ), as Θ i is updated by randomly data sam-ple D i . Recent works [48, 49] studied the connection of trajectory iteratesto an approximate Markov chain Monte Carlo sampler by analyzing thedynamics of SGD. – Model initialization. Different initialization of model parameters, i.e. Θ ,leads to distinct trajectories. It has been shown that ensembles of indepen-dently initialized and trained models empirically often provide comparableperformance in prediction and uncertainty quantification w.r.t. samplingand variational inference based methods, even though it does not applyconventional Bayesian grounding [38, 40].In this paper, we make a hybrid approach, that uses both sources of stochas-ticity to obtain approximate samples { Θ m } ∼ p ( Θ |D ) as follows: { Θ m } ≈ (cid:91) j { Θ ji , · · · , Θ jI } (36)Eq. 36 indicates that from each independently trained SGD trajectory (indexedby jj