[PDF] A Deep Learning Framework for Predicting Digital Asset Price Movement from Trade-by-trade Data

Abstract

This paper presents a deep learning framework based on Long Short-term Memory Network(LSTM) that predicts price movement of cryptocurrencies from trade-by-trade data. The main focus of this study is on predicting short-term price changes in a fixed time horizon from a looking back period. By carefully designing features and detailed searching for best hyper-parameters, the model is trained to achieve high performance on nearly a year of trade-by-trade data. The optimal model delivers stable high performance(over 60% accuracy) on out-of-sample test periods. In a realistic trading simulation setting, the prediction made by the model could be easily monetized. Moreover, this study shows that the LSTM model could extract universal features from trade-by-trade data, as the learned parameters well maintain their high performance on other cryptocurrency instruments that were not included in training data. This study exceeds existing researches in term of the scale and precision of data used, as well as the high prediction accuracy achieved.

Full PDF

AA Deep Learning Framework for Predicting DigitalAsset Price Movement from Trade-by-trade Data st Qi Zhao

New York UniversityLeonard N. Stern School of [email protected]

Abstract —This paper presents a deep learning frameworkbased on Long Short-term Memory Network(LSTM) that pre-dicts price movement of cryptocurrencies based on a trailingwindow. It is the ﬁrst to use trade-by-trade data to predict short-term price changes in ﬁxed time horizons. By carefully designingfeatures and detailed searching for best hyper-parameters, themodel is trained to achieve high performance on nearly a yearof trade-by-trade data. The optimal model delivers stable highperformance(over 60% accurancy) on out-of-sample test periods.In a realistic trading simulation setting, the prediction made bythe model could be easily monetized. Moreover, this study showsthat the LSTM model could extract universal features from trade-by-trade data, as the learned parameters well maintain their highperformance on other cryptocurrency instruments that were notincluded in training data. This study exceeds existing researchesin term of the scale and precision of data used, as well as thehigh prediction accuracy achieved.

Index Terms —LSTM, Asset Price Prediction, Time SeriesForecasting, Bitcoin, Cryptocurrency

I. I

NTRODUCTION

Since the introduction of Bitcoin in 2008, numerousblockchain-backed cryptocurrencies have been invented andthey have attracted investors through their volatile pricemovement. Similar to the price of ﬁnancial assets traded onsecondary markets, price of these digital assets also behavein a stochastic way but are more volatile in some sense.Unlike stock or other commodities which have regular tradinghours and holidays, cryptocurrencies trades on a 24-hour basisand transactions are done very frequently. That results ina dense transaction record, at the same time a set of verycontinuous time series data. This continuity on the otherhand makes trading cryptocurrencies much more challenging:manually monitoring the price movement and keeping upwith sudden price changes becomes ineffective and nearlyimpossible because of its continuity in activities. Havinga automated system to interpret scenarios in real-time andpredict the direction of price movement will help investorsto navigate through the volatile environment and proﬁt fromsmart trades. This paper presents a deep-learning architecturethat utilizes the Long Short-Term Memory network(LSTM)to draw information from the time series data and predictionfuture price movement. The trained model is capable ofmaking accurate predictions, thereby empowering users to gaininstant insight into this volatile market.

A. The Problem and Current Research

Many have different views on the predictability of assetprices and the degree of market efﬁciency. [1] It remainsan open question that if one could generate ”excess return”,the additional return relative to the one generated by themarket by predicting future price movement. Efﬁcient MarketHypothesis assumes the price behaves as a random walk, andall information is ”priced-in”. [2] That is, information hasbeen built into the current price, and therefore knowing priceinformation at any given point will not help on predictingthe price in the future. However, some studies showed thatthe asset prices are predictable to some extent. [3] Manyresearches have been done in predicting future stock pricesor predicting future price movement. In general, there are thetraditional statistical parametric approach and the emergingmachine learning approach.Due to the success of the LSTM network on processingsequential data, it is natural for one to apply it on ﬁnancialtime series. Some researchers have attempted to predict theprice numerically and try to achieve low prediction error, whileothers have developed classiﬁers to predict the up/down pricemovement and try to achieve high accuracy. Similar effortshave done in forecasting the price of digital assets. Substantialworks are done in all ﬁelds mentioned above, and they willbe discussed in detail in a later section. Unlike researcheson stock market which competes performance on benchmarkdata set, researches in the cryptocurrency market have noﬁrm agreement on data sources or baseline performances.In addition to the limitations imposed by inadequate datasamples, some other challenges exist in developing a high-performance model for forecasting the prices of digital assetsincludes making prediction on a data set with highly skewedclass distribution, which may produce high but deceptive ac-curacy. Moreover, the constantly changing price regime raisesobstacles in successfully transferring the learned result into anout-of-sample data set and later delivering steady performancein a real trading environment. Last but not least, solely usingprice data as input have present inadequate information topredict future price as they are affected by numerous factors.

B. Contributions

In order to address issues mentioned above, this studybuilds models on the largest data set so far with 119,712,824trade-by-trade data points that ranges from 2019/01/01 to a r X i v : . [ q -f i n . S T ] O c t C. Research Flow

Rather than focusing on predicting prices numerically, thisstudy chooses to work on classifying the directional movementof future prices and aims for high accuracy. This choice ismade due to the fact that having a very close price predictiondoes not always correlates to proﬁtability. However, if onecould successfully predict the direction of price movement,then he or she could deﬁnitely proﬁt from applying suchsimple up/down strategy. That is, long the asset if predicteddirection is up and short otherwise. In contrast to some previ-ous work which make prediction on a ”event-based horizon”,essentially to predict the price change in the next few ”events”,i.e. order book changes or trades, this study emphasizes onmaking prediction in a ﬁxed time horizon. That is, given inputprior to time t , the model will predict the price movement in For example, if a trading strategy is to buy the asset when predicted priceis higher and sell the asset when predicted price is lower, then for a currentprice at 99, a predicted price of 99.2 and an actual price of 98.9 would meanbuying and losing money. If the predicted price is 98.1, one would makemoney while having a much larger prediction error of 0.9 instead of 0.3. the next few minutes or seconds relative to time t . The beneﬁtsof making prediction in a ﬁxed time horizon is considerable.First, utilization of such prediction allows more rooms forlatency, as compared to event-based predictions, which willrequire the local system to be on the same state as the exchangeserver to ensure the least slippage. This is beneﬁcial in tradingas will show in a later section. Another beneﬁt of predictingin a ﬁxed time horizon is that the model is not subject to theintensity of market movement. For example, during intensivetrading times, more than a thousand trades could take placein one minute, whereas in times with minimal activities, onlyless than one hundred trades happen per minutes. Therefore,predicting the price after a ﬁxed number of trades or orderbook changes does not deliver a steady monitoring windowbecause trades do not occur regularly in time. Since no oneknows the number of trades in the future, such predictioncould hardly be utilized by human traders. By deliver accurateprediction on ﬁxed time horizons, the model presented in thisstudy provides an intuitive understanding of the market tousers beyond a signal that could only captured by a machine.In the course of research, the author also observed the optimalhyper-parameters, like the prediction horizon and the lookback periods, in achieving the highest prediction accuracy. Outline:

The rest of the paper is organized as follows.Section 2 will discuss relevant literature and present theirworkﬂows in detail. Section 3 will discuss the data and thepreprocessing measures used in this study speciﬁcally. Section4 will present the model and the process of selecting optimalhyper-parameter. Section 5 will contain the overall results ontest periods, a trading simulation, as well as the experimenton other cryptocurrency trading pairs.II. R

ELATED W ORK

The Long Short-Term Memory (LSTM) [6] is an improve-ment on recurrent neural networks and it has solved thevanishing gradients problem. LSTM has been successfullyapplied to areas like language modeling [7], speech recognition[8], and many others that deal with sequential data. In recentyears, many researchers have focused their work on usingthe LSTM to analyze ﬁnancial data. Many more appliedit on processing stock price time series. Among them, [9],[10] uses LSTM to analyze limit order book(LOB) data andachieved high prediction accuracy on benchmark data set F1-2010. Some others [11] apply the LSTM combined with auto-encoders to analyze stock price series in order to predict futureprices. Besides predicting the price itself, there are also studies[12] [13] [14] building neural networks to predict the futuredirection of stock prices and aims to achieve high classiﬁcationaccuracy.Compared to substantial works done in predicting stockprices, studies are less in number and data precision whenit comes to use machine learning methods to forecast theprice of digital assets. However, some researchers are stillable to make promising results. There has not been manyresearches on predicting the direction of price movementof digital assets. [15] uses the LSTM to predict BitcoinABLE I: A glimpse at trade-by-trade data

TradeID Timestamp Price Amount IsBuyerMaker price movement from daily price data and achieved 52.78%accuracy. [16] reported high accuracy over 65 percent on out-of-sample test periods. They used several months of bid-askprice of bitcoin sampled in ﬁve second intervals, ranging from2014 to 2016, and focused their prediction on the predictingthe latest price, essentially in the next ﬁve seconds. Althoughtheir achievement is remarkable, the prediction horizon is tooshort to have price movement large enough to make a proﬁt. Infact, the market prior to 2017 is far less liquid that it is today,and predicting a less liquid market is easier than to predictliquid ones. Therefore, the high accuracy could potentially bethe result of a skewed class distribution, as in a less liquidmarket, most price change tend to be zero because there is notrade. Nevertheless, their work made remarkable contributionto show that the price of Bitcoin is to some extent predictable.There are more works on predicting the numerical price ofdigital assets, mainly Bitcoin. [17] performed feature selectionand applied LSTM on daily price data to predict Bitcoin price.[18] found that gated recurring unit(GRU) model outperformsother models in predicting bitcoin prices from daily price data.[19] applied LSTM on daily and minute price data to developan anomaly detection system.Machine learning methods also have been used on other datasources to forecast Bitcoin prices. [20] forecasts the volatilityof Bitcoin using LSTM. [21] uses Bayesian neural networks toanalyze blockchain data and tries to predict future numericalprices.Although a lot of works reported remarkable results, un-fortunately, there is no benchmark data set or consensus onresearch method or metrics. Therefore, it is not meaningful tocompare accuracy with other works when predicted horizonand data precision are completely different. To the best ofthe author’s knowledge, there is no work that uses trade-by-trade data in larger scale to predict short-term price movementsof cryptocurrencies on ﬁxed time horizons. This study is theﬁrst to extensively explore the process of modeling trade-by-trade data using an LSTM network in order to achievepredictability in price movements. In particular, this studyadopts new practice to create training and validation separationand uses new method to reduce the number of redundant inputexamples while keeping high coverage on the data. Both isessential in evaluating the performance achieved in trainingdata and making sure it is transferable to an out-of-sampletest set. III. D

ATA , F

EATURE E NGINEERING

A. Trade-by-trade Data Overview

The trade-by-trade data for training, presented in TABEL I,spans all trades on BTC-USDT trading pair from 2019/01/01 to 2019/11/30, UTC time. There are 119,712,824 trades intotal. Each data point in the trade-by-trade data represents onetrade that has occurred on the exchange. The details of eachtrade is represented in four important features. The ﬁrst is thetimestamp( t ) when the trade took place, with the precisionin milliseconds. The second is the amount( a ) of asset thatis transacted in the particular trade. For BTC-USDT tradingpair, it is the amount of Bitcoin that changed hand in the trade.The third is the price( p ) in quote asset, in this case USDT, atwhich the transaction is settled. The last is a Boolean value,maker( m ), where ”true” indicates the buyer of the trade puthis/her order on the limit order book. Notice that since thereis always a buyer and a seller when an asset changes hand,this Boolean value also indicates that whether this trade is theresult of active selling or active buying. When this Booleanvalue is true, it means that a trader actively sold the asset tothe buyer whose his/her order is on the limit order book, andvise versa. Speciﬁcally, each data point can be represented asa row vector: d u = [ t ( u ) , p ( u ) , a ( u ) , m ( u )] ∈ R × with u representing the index of the data point in the data set. B. Trade-by-trade data representation in time intervals

Since this study is interested in predicting the price move-ment in a ﬁxed time horizon, it is critical to ensure there is away to resample trade-by-trade data into ﬁxed time intervals.Furthermore, using a representation in ﬁxed time interval allowus to track a steady trailing window over time. Otherwise, ifwe choose to use 10,000 trades as input size, then it couldmean trades happening in the last ten minutes during normaltime or in the last thirty seconds in some extreme cases. Thisstudy adopts the following measures to resample trade-by-trade data into ﬁxed time intervals with length l , measuredin milliseconds. Start by group all the trades into subsets bythe time interval they belong to. Speciﬁcally, one can obtaina group number i by the following operation: i = (cid:98) t ( u ) l (cid:99) . The k th subset is essentially S k = { d ( i ) u | i = k } . In each subset S k , trade index is reset so that u ∈ { , , . . . , n } . After puttingtrades into subsets, the subsets are sorted in the order of groupnumber.Then, in each subset we have: • N umberof T rades is simply deﬁned as the count of totalnumber of individual trades in the subset, n . • Volume is deﬁned as the sum of amount in the subset. n (cid:88) i =1 a i (1) • Active Buy Volume is deﬁned as the sum of amount whenmaker equals to false. n (cid:88) i =1 a i × (1 − m i ) (2) • Amplitude is deﬁned as the highest price minus the lowestprice. max { p i } − min { p i } (3) Price Change is deﬁned as the price of the last trademinus the price of the ﬁrst trade. p n − p (4) • Volume weighted average price is deﬁned as the sum ofnotional value of all trades divided by the total volume.The notional value of one trade is essentially the price ofthe trade times the amount of the trade. price = (cid:80) ni =1 p i × a i (cid:80) ni =1 ai (5) • Taker ratio is deﬁned as the sum of amount when makerequals to false divided by the total sum of amount. (cid:80) ni =1 a i × m i (cid:80) ni =1 ai (6)After restructuring individual trades into ﬁxed time inter-vals, they will be sorted according to their group number i .Each data point d (cid:48) i in the new data structure now represents aﬁxed time interval with 10 features. The entire data set nowis D (cid:48) = [ d (cid:48) , d (cid:48) , . . . , d (cid:48) i , . . . , d (cid:48) n ] T ∈ R n × This study focusestwo values of l , which are 60000 and 300000, correspondingto one minute intervals and ﬁve minute intervals respectively.For one minute time interval, the total number of intervals n is 477,680, and for ﬁve minute intervals, it is 95,538. It isworthwhile to clarify how the features are found and selected.There are three principles the author follows in selecting thesefeatures: i) represent dimensions in original trade-by-tradedata, ii) use relative measures to produce stationary features,iii) little correlation with other features.In addition to the input features, change in next m interval at time t , with t represents a time-interval, an non-input featurewhich represents the price movement in a prediction horizon,is deﬁned as the price at t+m divided by the price at t, andthen minus one. Speciﬁcally, C ( m ) t = price ( t + m ) price ( t ) − (7)By varying parameter m, one can get the price movement indifferent prediction horizons. This feature will be computedbefore generating input data based on trailing windows so thatit will be available for all data point later during labeling. Thispaper will focus on m=15,30 for one minute time interval, and6, 24 for ﬁve-minute time interval. C. Stationary test

Since time series data on asset prices are sensitive to regimechanges, it is critical to check if the restructured data isstationary before divided them as input for the neural network.A piece of time series data is said to be non-stationary ifit has unit-roots, indicating its values are time dependent.This study evaluates if the all features generated for theﬁxed time interval representations are stationary through thefrequently-used Augmented Dickey–Fuller (ADF) test. Thenull hypothesis (H0) of the ADF test is that the time serieshas a unit root. Therefore, if the test results reject the null m (cid:15) prev-dist. post-dist.

15 0.000 50.65% 50.65%30 0.000 50.80% 50.80% m (cid:15) prev-dist. post-dist. TABLE II: threshold value and class distributionhypothesis, then the data is considered to be stationary. Afterrunning the ADF test on all the feature series, the author foundall features are stationary,except for price, which has a teststatistics of -1.454 and a p-value of 0.55. That test statisticis larger than the one for a 10% conﬁdent interval, whichis -2.566. Therefore, the price feature is adjusted by takingdifference. Speciﬁcally, the new value for the feature is price (cid:48) ( t ) = price ( t ) − price ( t − (8)After taking difference for price series, the new p-value is 0.0,and this show that it is now stationary. D. Normalization and labeling

An input example can be generated in this ways: Forprediction time t > T , a trailing window with length T isthe collection of data from d t − T to d t . An input X can nowbe deﬁned as X = [ x , x , x , . . . , x t , . . . , x T ] T ∈ R T × .Then a min-max normalization is applied to normalize allcolumns of input data. As deﬁned earlier, the author willuse change in next m interval to label different classes ofprice movement. Based on a threshold ( (cid:15) ), the labels will beassigned as a row vector [1,0] for above the threshold, and[0,1] for below the threshold. The threshold will be slightlyadjusted for different time horizon on which price movementto be predicted in order to achieve a well-balanced classdistribution. It is worth emphasize the importance of havinga balanced class distribution since it does not only ensure afair baseline when interpreting the test result and comparingagainst random guesses, but also allow the model to emphasizeeach class equally in the process of learning their featuredistributions. As we can see, the training data is well balancedand the baseline performance of a random guess should haveabout 50% accuracy. Detailed class distribution is presentedin TABEL II. E. Training/testing allocation1) Separation between Training and Validation:

The way tosplit the data into a training set and a validation set is describedas follows: Randomly pick p number of disconnected timeintervals with length q greater than trailing window length T as validation periods. Then the entire data is left with p + 1 number of disconnected time intervals, and those will be usedas training periods. If a training period has length less than T , it will be discarded. This ensures that the input data forthe model have a validation set and a training set that aregenerated from different time periods and have no intersectionwith one another. That is, the model has not ever peeked thedata that is used for validation or testing when learning theparameters. If one failed to achieve this separation, then themodel will have similar performance on the validation set andthe training set due to the overlapping in the data, makinghe model likely to overﬁt the training set and not achievingsatisfying performance on the out-of-sample data. This wayof forming the training-validation separation has an importantadvantage in preserving the class distribution in the split datawith a large p and small q . The class distribution in trainingand validation periods are very close to the one of the wholetraining data.

2) Redundancy vs Quantity Trade-off:

After splitting thetraining and validation periods, input examples can be gener-ated identically in each period. For a time periods with lengthq, the maximum number of input examples with intersectioncan be generated is q − T . These inputs have a great deal ofoverlapping. For example, a input that makes a prediction attime t will cover data from t to t-T, and an input at time t-1will cover data from t-1 to t-1- T. Then T-1 number of vecotrsin these two inputs are essentially the same. To reduce suchredundancy, one may wish to get inputs with no intersectionwith other inputs. Then the maximum number of inputs is (cid:98) q T (cid:99) . However, this leaves the model to little input examplesto learn parameters. Therefore, it is crucial to balance the tradeoff between getting too many redundant training examples orhaving too little input examples.

3) Generate Input Examples by Offset Values:

The authorapproaches this problem by applying a number of offsets whengetting non-intersect inputs in the same time period. Whenthere is no offset, the ﬁrst input is generated at t=T, so thatthe starting point of this input is at t = 0. Then the followinginputs are generated at t + n × T . Let us use an example toillustrate this. For a time period with length q=14, if one wouldwant to get non-intersect input examples for trailing windowT=4, then the eligible inputs are at t=4,8,12 with no offset.The maximum value of offset is T-1 = 3. If applying an offsetvalue of 2, we will get inputs at t=6,10,14. In this way, onecould expand the number of input examples very quickly withless overlapped data.In this study, the author randomly selects 10 to 50 percentfrom all eligible offset values and generated input examplesin each time periods and the offset value of periodlength mod T is always selected to ensure a full coverage of the data.After iterate through all the training and validation periods,all input examples will be allocated to the training set orthe validation set, respectively. One could of course generateall eligible input data with sufﬁcient computational power.Intuitively, having more data would not hurt if not improve themodel’s performance. However, sacriﬁcing some input withgreat overlapping does not necessarily mean that the modelcould not learn a good set of parameters. The author chooseto use only part of all the eligible offset value purely for thepurpose of signiﬁcantly reducing the work load during trainingwhile preserving a full coverage for the data.IV. M ODEL A RCHITECTURE

The model proposed in this study is a slightly alteredversion of a standard long short-term memory network(LSTM)network. The ﬁrst part is an input layer, which takes in thenormalized data. The next part is a layer of LSTM cells with Fig. 1: illustration of creating examples using offsetlength T , connected with a drop-out layer with dropout rate of0.5. The drop-out layer randomly selects half of output datafrom the LSTM layer and put them in the fully connectedlayer. The fully-connected layer with a softmax activationfunction outputs the prediction result at t = T . A. Long short-term memory network (LSTM)

Recurrent neural networks has been proven useful in an-alyzing time series data and the LSTM network helps solvethe vanishing gradient problem in training a recurrent neuralnetwork. Time series data are split into vectors x t and passedsequentially into LSTM cells at each time step. There arethree types of gates in a LSTM cell–the update(input) gate, theoutput gate, and the forget gate. They are essentially governedby sets of trained parameters to maintain a cell state c t , whichcould be passed through LSTM cells in parallel to the outputof each cell a t . Keeping a cell state allows the network topreserve information that is far away from the current timestep and draw connections from long time dependencies. Togenerate the current cell state, we need the update gate andforget gate to decide whether each input information is used.It begins by stacking the output from previous cell a t − andthe raw data input of the current cell x t to form a combinedinput A t . The update gate, governed by a set of parameter W u and bias term b u , takes in the combined input A t and form aupdate gate values Γ u , using the Sigmoid activation function.Speciﬁcally, Γ u = σ [ W u × A t − + b u ] (9)The same procedure is done to generate the forget gate values Γ f . The only difference is that it uses parameter W f and b f .Speciﬁcally, Γ f = σ [ W f × A t − + b f ] (10)Then by using a separated set of parameter W c and a biasparameter b c to evaluate the combined input and form ˜ c t , usinga tanh activation function. ˜ c t = tanh [ W c × A t − + b c ] (11)With the update gate values dictating tildec t and forget gatevalues dictating the previous cell state c t − , one can get thecurrent cell state c t . Speciﬁcally, c t = Γ u × ˜ c t + Γ f × c t − (12)The output gate determines what information from the cellstate is used when forming the output. One can get the outputgate values Γ o using the same process for update gate valuesig. 2: Structure of a LSTM celland forget gate values described above, with parameter W o and b o . Speciﬁcally, Γ o = σ [ W o × A t − + b o ] (13)Then the output is simply obtained by evaluating the dotproduct of Γ o and the tanh-activated cell state c t . Speciﬁcally, a t = Γ o × tanh [ c t ] (14)It is obvious that the length of the LSTM layer equals tosize of the input data, which is the length of the trailingwindow T . Intuitively, the length of the trailing windowdetermines how far the model will be able to look backwardto predict the future price movement at the current time. Itis a hyper-parameter that worth tuning because it can affectthe training time of a model and the predictive capability.One would assume that with a longer trailing window, moretime is required to train the model because the input containsmore data and the model has more parameters. On the otherhand, the model should make better prediction since moreinformation is given to the model.Another important hyper-parameter is the number ofunits(N) of the LSTM cell. This affects the dimensions ofthe LSTM cell output as well as the total parameter size ofthe model. Later, we will see that more parameter could moreeasily cause overﬁtting. Grid searches are performed in orderto ﬁnd the optimal values for these two hyper-parameters.For different prediction horizons and time interval length, theoptimal choices of these two hyper-parameter are different.Detailed result and some interesting trends will discussed verysoon. B. Training

Before moving on to results of the grid search, it isworthwhile to describe the learning related setups and param-eter choices. In this study, the model learns parameters byminimizing the categorical cross-entropy loss with the ’adam’optimizer that has initial learning rat of 0.001 and learningrate decays by 0.0003 for every 15 epochs until it reaches0.0001. The mini-batch size ranges from 32 to 128, dependingon the number of training and validation examples for differentsetups. A larger mini-batch will gradually decrease to a smallersize during the process of training. This choice is inﬂuenced by the ﬁnding in [22], that loss obtained from training with asmall batch-size is more likely to converge to a shallow-boardminimal. Another reason is that small batch-size tend to havea regularization effect, according to [23]. Training is stoppedif the validation loss does not improve for 20 or more epochsor if it has increased for more than 5 % from the minimumvalue. In the latter case, the model is considered to be ”over-ﬁtting” the training set. Although lower values of loss usuallytranslate into higher accuracy, this correlation does not holdfor every epoch trained. In all the training process of this study,minimizing loss is prioritized over maximizing accuracy. Thereason is that loss represents the general prediction capabilityof a model, which outputs close numerical values to assignedlabels, while prediction accuracy is a result of such capabilityand it could be manipulated by setting different cut-off valuesin activation functions. Nevertheless, both metrics will be re-ported later. Moreover, the author chooses a softmax activationin the output layer to avoid tuning cut-off values. Training timeranges from a few hours for some models with small input sizeand fewer examples, and more than a day for large input sizeand more examples. All models are constructed by using Keraswith the TensorFlow backend, and they are trained on a singleNVIDIA GeForce GTX 1070 GPU. C. Hyper-parameter Selection

The grid search is performed separately for the two setups oftime interval on two hyper-parameter T, and N. For one-minutetime interval, the value range of T is 100,300,1000,2000, andthe one of N is 16,32,64,128. For ﬁve-minute time interval,the value range of T is 60,300,500,1000, and the one of N is16,32,64,128. For simplicity, the author will only report thebest hyper-parameter pairs in terms of achieving the lowestvalidation loss for each setup in prediction horizon and timeinterval length in TABLE III. Fig.3 plots the learning curvesfor optimal hyper-parameters. Some interesting observationsfrom the result are: i) The optimal number of neuron is 16or 32. Greater number of units tends to over-ﬁt the dataearlier. ii)Prediction accuracy positively correlates with thelength of the trailing window. However, the accuracy usuallydoes not improve after an optimal value, suggesting that extrainformation in the past does not help make better prediction.iii) Smaller m values and larger l values tend to lead to betterperformance. By deduction, when predicting the same timehorizon (notice that with l=60000,m=30 and l=300000,m=6,the model is predicting a thirty minute time horizon), larger l values which corresponds to smaller m values leads to betterperformance. One potential reason could be that a larger lvalue compress information to a higher degree and noise isconsolidated in this process and leaves clearer signals.V. E XPERIMENTAL R ESULTS

This study conducts experiments on three out-of-sampledata sets. These experiments can be breakdown into twocategories, time interval length l=60000, and l=300000. Foreach time interval length setting, selected models describedin the previous section will be used to make predictions onig. 3:

Blue: training loss, Red: validation loss, Yellow: training accuracy, Green: validation accuracy

TABLE III: Optimal hyper-parameter for setups l=60000 l=300000m-value m=15 m=30 m=6 m=24Na 16 32 16 16T 300 1000 300 500Val loss 0.6812 0.6741 0.6610 0.6791Val acc 57.66% 58.51% 62.09% 58.04% different setups in terms of future return horizons. Same aswhat was deﬁned in Section 3, labels will be based on nextm minutes return. For l=60000, those are m=15,30. And forl=300000, those are m=6,24.

A. Experiment on BTC-USDT data set

The BTC-USDT out-of-sample data set consist of all tradesfrom 2019/12/01 to 2020/03/01. Same measures described inSection3 are utilized in aggregate these trades into ﬁxed timeintervals and in labeling them. In contrast to the practice intraining our models, a threshold of 0 is chosen in producingall the labels. This indicates that there is no directional viewon the out-of-sample and no effort in producing a balancedclass distribution for the testing data. In fact, in a real tradingenvironment, the model is very often dealing with a skeweddistribution. Unlike the policy in training the model, which usea random subset of all valid input data, we will test our modelon all the valid input data in a chronological order. By doingso, it not only helps to evaluate the accuracy comprehensivelyon all available test data, but also allows one to track a rollingaccuracy, a metric which measures a model’s performance onprediction accuracy in a rolling window of time. This metricgives insight into whether the performance is stationary intime and points out the regimes where the performance isparticularly unsatisfying, allowing for further improvement.Ideally, one would wish the performance to be steady across Fig. 4: Rolling Accuracy DistributionTABLE IV: Results on Out-of-Sample Test Periods l=60000 l=300000m-value m=15 m=30 m=6 m=24loss 0.6845 0.6799 0.6720 0.6812accuracy 57.18% 57.65% 61.12% 57.08% the timeline. That will suggest the model’s capability toconstantly produce robust signal in real trading environmentwithout any knowledge on the current regime. The overalltest results are shown in TABLE IV and the distributions ofdaily rolling accuracy is shown in Fig. 4. In general, the highperformance learned in training data translates well into thetesting data. Setups with greater m value tends to loose moreaccuracy in the testing data.

B. Experiment in a Trading Simulation

Even though the focus of this paper is not to introduce aproﬁtable trading strategy, rather it is to show the capability ofthe deep learning model in delivering robust price movementig. 5: Trading simulation resultprediction, a trading simulation is presented below to showthe predictions could easily be monetized. There are threeassumptions. The ﬁrst one assumes that trades in the tradingsimulation have no market impact, which means a trade madeby the model does not affect trades from the data. One canachieve this by using a small order size relative to the existingorders on the limit order book. The second assumption is thatthere is no latency on execution. This implies that as soonas the model makes the prediction, it is able to execute thetrade at the last price on that time interval. This conditionis hardly achievable in real life, because there is friction andtrade executions are always subject to the bid-ask spread atgiven time, although that spread is usually very small. Finally,a transaction cost of 0.0003 percent per order is applied toall trade executed. Such transaction cost is realistic in someinstruments offered by exchanges.The trading strategy is simply to take a long positionwhen the model is predicting upward price movement andtake a short position when the model is predicting the otherdirection. Note that the long positions and the short positionsare separated. That is, taking a short position when alreadyhave a long position will not reduce the long position. Thepositions are closed only at the end of the prediction horizon,which means when predicting a ﬁve minute horizon, everyposition is held for ﬁve minutes. In reality, one can easilyachieve so by using an exchange or broker that allows tradingwith isolated margin. For simplicity, the author runs onlythe best performing model on the test period in the tradingsimulation, which is resampling trade-by-trade data in ﬁve-minute time intervals with a look back periods of 300 andpredicting the price movement in the next 5 intervals. Theresult is illustrated in Fig. 5.As we can see the trading strategy outperformed Bitcoinin terms of net return. However, it does have signiﬁcantdownturns in revenue. A possible reason is that the tradingstrategy is making trades in every interval and when the marketis less active, the price movements in those periods are sosmall that it could not cover the transaction cost. Therefore,even if the trading strategy is making right prediction, it couldpotentially loose money. To reduce the number of trades andsave transaction cost is indeed a direction of developing a moresophisticated strategy that utilize this prediction well. TABLE V: Results on Out-of-Sample Test Periods

Instrument ETH BCH LTC EOSaccuracy 60.48% 60.17% 59.96% 60.03%

Fig. 6: Accuracy Distribution on Other Digital Assets

C. Experiment on other cryptocurrencies

In this experiment, the model trained using BTC-USDTdata is applied on other cryptocurrency trading pairs in orderto examine that without additional training, if the learnedparameters from one instrument can still be useful in predict-ing price movement of other instrument. In other word, thisexperiment could tell that if there exists universal features intrade-by-trade data from which the LSTM could extract toinfer the direction of future prices. Among all digital assets,Ethereum(ETH), Bitcoin Cash(BCH), Litecoin(LTC) and EOSare top in terms of market capitalization and trading volume.Similar to the experiment on BTC-USDT, trade-by-trade datafrom the trading pairs of these cryptocurrencies against USDTare acquired in the time range from 2019/12/01 to 2020/02/28.The same experiment setup is adopted as the one on BTC-USDT data. For simplicity, only the best setup in BTC-USDTwith l=300000 and m=6 is used in this experiment. The resultsare shown in Fig. 6. From this experiment, we can see thatthe prediction accuracy is very well translated to these tradingpairs. Thus, it indicates that there is universal feature in trade-by-trade data across instruments from which the deep learningmodel could extract and utilize to make accurate predictions.VI. C

ONCLUSION AND F UTURE W ORKS

This study is the ﬁrst to explore the application of along short-term memory deep learning network on Bitcointrade-by-trade data time series to make prediction on pricemovement in different horizons. With the most extensive andcontinuous price data, this research unveiled the relationsbetween hyper-parameters and learning outcomes when ap-plying an LSTM network to price data time series. Thisstudy also ﬁnds the correlation of model performance withthe length of the resample time intervals, as well as the onewith prediction horizon. Models with longer time interval andsmaller prediction horizon tend to perform better. In addition,ovel research practices is adopted in this article. First, themethod for separating training data and validation data helppreserve similar class distribution in the training set and thevalidation set. It also ensures the learned parameters are welltransferred to the out-of-sample data. Second, the techniquewhich involves using random offset values to generate inputexamples helps to signiﬁcantly reduce redundancy and trainingworkload while keeps full coverage and high utilization ondata. Finally, using carefully designed feature, optimal hyper-parameter choices, and helpful training practices, the modelproduces satisfying results and successfully transferred thelearned high prediction performance onto an out-of-sample testperiod over three months. Furthermore, the model maintainsits good performance when tested against other instrumentsthat is not part of the training data. This shows there existsuniversal feature which could be extracted by the deep learningframework presented in this study. Moreover, it suggests thatthe price of digital assets is to some degree predictable.Although the models introduced in this studies are capableof delivering consistent high-accuracy prediction on Bitcoinprice movement, they could provide no information on themagnitude of such movement. In fact, the magnitude is ac-tually important in real trading. Therefore, the author willexplore new models that are able to advise on the magnitudeand direction of price movement. In addition, the author wouldalso like to investigate in developing a sophisticated tradingstrategy that fully utilizes this model’s prediction throughmeans like optimizing the order size or determining the bestholding time for each position, etc.R

EFERENCES[1] A. Ang and G. Bekaert, “Stock Return Predictability: Is it There?”

TheReview of Financial Studies , vol. 20, no. 3, pp. 651–707, 07 2006.[Online]. Available: https://doi.org/10.1093/rfs/hhl021[2] M. A. Ferreira and S. Santa-Clara, “Forecasting stock market returns:The sum of the parts is more than the whole,,”

Journal of FinancialEconomics , vol. 100, no. 3, pp. 514–537, 2011.[3] T. Bollerslev, J. Marrone, L. Xu, and H. Zhou, “Stock return predictabil-ity and variance risk premia: Statistical inference and internationalevidence,”

Journal of Financial and Quantitative Analysis , vol. 49, no. 3,p. 633–661, 2014.[4] W. Wei, “Liquidity and market efﬁciency in cryptocurrencies,”

Eco-nomics Letters , vol. 168, pp. 21–24, 2018.[5] L. T. Hoang and D. G. Baur. (2020) How stable are stablecoins?[Online]. Available: https://ssrn.com/abstract=3519225[6] S. Hochreiter and J.Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, p. 1735–1780, 1997.[7] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, andY. Bengio, “Learning phrase representations using rnn encoder-decoderfor statistical machine translation,” in

Conference on Empirical Methodsin Natural Language Processing (EMNLP 2014) , 2014.[8] A. Graves, N. Jaitly, and A. Mohamed, “Hybrid speech recognition withdeep bidirectional lstm,” in , 2013, pp. 273–278.[9] Z. Zhang, S. Zohren, and S. Roberts, “Deeplob: Deep convolutionalneural networks for limit order books,”

IEEE Transactions on SignalProcessing , vol. 67, no. 11, pp. 3001–3012, 2019.[10] J. Sirignano and R. Cont, “Universal features of price formation inﬁnancial markets: perspectives from deep learning,” arXiv preprintarXiv:1803.06917 , 2018.[11] W. Bao, J. Yue, and Y. Rao, “A deep learning framework for ﬁnancialtime series using stacked autoencoders and long-short term memory,”

PloS one , vol. 12, no. 7, 2017. [12] M. Sethi, P. Treleaven, and S. Del Bano Rollin, “Beating the s p 500index — a successful neural network approach,” in , 2014, pp. 3074–3077.[13] T. Fischer and C. Krauss, “Deep learning with long short-term memorynetworks for ﬁnancial market predictions,”

European Journal of Oper-ational Research , vol. 270, no. 2, pp. 654 – 669, 2018.[14] W.-C. Chiang, D. Enke, T. Wu, and R. Wang, “An adaptive stock indextrading decision support system,”

Expert Systems with Applications ,vol. 59, pp. 195 – 207, 2016.[15] S. McNally, J. Roche, and S. Caton, “Predicting the price of bitcoin usingmachine learning,” in , 2018, pp.339–343.[16] M. Amjad and D. Shah, “Trading bitcoin and online time seriesprediction,” in

Proceedings of the Time Series Workshop at NIPS2016 , ser. Proceedings of Machine Learning Research, O. Anava,A. Khaleghi, M. Cuturi, V. Kuznetsov, and A. Rakhlin, Eds., vol. 55.Barcelona, Spain: PMLR, 09 Dec 2017, pp. 1–15. [Online]. Available:http://proceedings.mlr.press/v55/amjad16.html[17] C. Wu, C. Lu, Y. Ma, and R. Lu, “A new forecasting framework forbitcoin price with lstm,” in , 2018, pp. 168–175.[18] A. Dutta, S. Kumar, and M. Basu, “A gated recurrent unit approach tobitcoin price prediction,”

Journal of Risk and Financial Management ,vol. 13, no. 2, 12 2019.[19] S. Chhem, A. Anjum, and B. Arshad, “Intelligent price alert system fordigital assets - cryptocurrencies,” in

UCC ’19 Companion: Proceedingsof the 12th IEEE/ACM International Conference on Utility and CloudComputing Companion , 12 2019, pp. 109–115.[20] Y. Liu, “Novel volatility forecasting using deep learning–long shortterm memory recurrent neural networks,”

Expert Systems withApplications

IEEE Access , vol. 6, pp. 5427–5437, 2018.[22] N. S. Keskar, J. Nocedal, P. T. P. Tang, D. Mudigere, and M. Smelyan-skiy, “On large-batch training for deep learning: Generalization gap andsharp minima,” in , 2017.[23] D. Wilson and T. Martinez, “The general inefﬁciency of batch trainingfor gradient descent learning,”