Are Bitcoins price predictable? Evidence from machine learning techniques using technical indicators
AAre Bitcoins price predictable? Evidence from machinelearning techniques using technical indicators.
Samuel Asante Gyamerah a a Pan African University, Institute for Basic Sciences, Technology, and Innovation,Kenya
Abstract
The uncertainties in future Bitcoin price make it difficult to accurately pre-dict the price of Bitcoin. Accurately predicting the price for Bitcoin is there-fore important for decision-making process of investors and market playersin the cryptocurrency market. Using historical data from 01/01/2012 to16/08/2019, machine learning techniques (Generalized linear model via pe-nalized maximum likelihood, random forest, support vector regression withlinear kernel, and stacking ensemble) were used to forecast the price of Bit-coin. The prediction models employed key and high dimensional technicalindicators as the predictors. The performance of these techniques were eval-uated using mean absolute percentage error (MAPE), root mean square er-ror (RMSE), mean absolute error (MAE), and coefficient of determination(R-squared). The performance metrics revealed that the stacking ensemblemodel with two base learner (random forest and generalized linear modelvia penalized maximum likelihood) and support vector regression with linearkernel as meta-learner was the optimal model for forecasting Bitcoin price.The MAPE, RMSE, MAE, and R-squared values for the stacking ensemble (cid:73) correspondence: [email protected]
Preprint September 4, 2019 a r X i v : . [ q -f i n . S T ] S e p odel were 0 . . . . Keywords:
Bitcoin volatility, Machine learning, stacking ensemble, Bitcoinprice forecasting, technical indicators
1. Introduction
Bitcoin is considered as the world’s largest digital currency by market capi-talisation [1]. Bitcoin has generated a lot of returns for market players andinvestors alike . Nevertheless, there is a strong fluctuations in the price ofBitcoin [2] leading to price uncertainties; a situation that threatens its po-tential to function as a currency. Bitcoin is therefore seen as a highly volatilecurrency. Market players and analysts have associated different factors to thehigh price volatility of Bitcoin. Among these factors are: a relatively smallmarket as compared to traditional assets such as fiat currencies, bonds, andstocks, low liquidity which increases price fluctuations, regulation problemsand failure, news events, shifting sentiments, and high speculations. Thevolatile nature of Bitcoin makes price prediction very difficult for most in-vestors and market players. Hence, we develop machine learning predictingmodels that can accurately forecast the price of Bitcoin to help investors and estimated as $182,675,714,614
2. Machine Learning Forecasting Techniques
In this study, we use machine learning as a tool for forecasting the price ofBitcoin. The choice of an optimal machine learning algorithm for forecastingis a major factor to consider in any forecasting problem. For this reason,the chosen machine learning technique should be able to forecast the price4f Bitcoin with a small margin of error.
A generalized version of support vector machine (SVM) called the sup-port vector regression (SVR) was proposed by [8] in 1996. The output modelof SVR relies solely on a subsample of training data. The cost function forconstructing the SVR model does not take into consideration any trainingdata that is near to the model prediction. SVR also uses kernels and hasdemonstrated to be a functional and versatile tool in most real-valued func-tion computation. The following steps can be used to implement SVR.
Step1 . Given a training dataset { ( x , y ) , ( x , y ) , · · · , ( x i , y i ) } ⊂ K × R ,where K is a high dimensional space of the input pattern ( K = R d ). Step2 . A nonlinear (NL) regression problem can be changed into a functionallinear regression problem in K by making use of a linear function called theSVR function, h ( x ) = v T · τ ( x ) + b, v ∈ K, b ∈ R (1) h ( x ) is the forecasted Bitcoin price values, the coefficients v and b can betuned. Step3 . The observed risk, R ( h ) can be determine as, R ( h ) = 1 N N X i =1 ψ (cid:15) ( y i , h ( x )) , (2) ψ (cid:15) ( y i , h ( x )) represents a (cid:15) -intensive loss function defined as, ψ (cid:15) ( y i , h ( x )) = | h ( x ) − y | − (cid:15), if | h ( x ) − y | ≥ (cid:15), , otherwise . (3)5he purpose of the (cid:15) -intensive loss function is to restrict the way the modelare generalized. Step3 . Using a quadratic optimization problem with inequality constraints,the errors between the training data and the the (cid:15) -intensive loss function canbe estimated,minimize 12 k v k + λ N X i =1 ( ϑ i + ϑ ∗ i )subject to y i − h v, x i i − b ≤ (cid:15) + ϑ i , i = 1 , , · · · , N, h v, x i i + b − y i ≤ (cid:15) + ϑ ∗ i i = 1 , , · · · , N,ϑ i , ϑ ∗ i i = 1 , , · · · , N. (4) λ > (cid:15) and the flatness of h . While the first part of theobjective function penalizes large weights, regularize the size of the weight,and preserve the flatness in the regression function, the second part penalizesthe training errors associated with h ( x ) and y . However, some errors canbe allowed by introducing slack variables ϑ i , ϑ ∗ i to deal with the infeasibleconstraints. Step4 . By solving equation 4, v can be estimated as, v = N X i =1 ( α ∗ i − α i ) τ ( x i ) , (5) α ∗ i , α i are the Lagrangian multipliers. Step3
The SVR function is set up as, h ( x ) = N X i =1 ( α ∗ i − α i ) K ( x i , x j ) + b,K ( x i , x j ) = e − κ k x i − x j k , κ > , (6)6 ( · ) is a Kernel function.Generally, the performance of SVR depends on the settings of the globalparameters: Cost ( C ) controls the trade-off in the model complexity andextent to which the variance greater than (cid:15) can allowed, (cid:15) controls the widthof the insensitive areas, and the Kernel function (K). Selecting an optimalvalue for these parameters is complicated since SVR depends on all the threeparameters. Random forest is an ensemble approach based on the idea that ensemble ofweak learners (decision trees) when combined would result in a strong learner[9, 10]. Using Breiman’s bagger, each of the variables is considered in everysplit. Due to the principle of Strong Law of Large Numbers, over-fitting isnot a problem in random forest. For this reason, RF always converges. Thestrength of each single-tree classifier and a measure of their dependenciescontributes to the accuracy of random forest. For implementation of therandom forest algorithm, the interested reader should see [10].For optimal performance of random forest model, the number of trees (ntree)and the number of variables sampled as candidates for each split (mtry) mustbe carefully selected. For regression problems, mtry = n (where n =numberof features used for the prediction). The fraction of the training data thatis randomly selected to suggest the next tree in the expansion is called thesubsampling fraction or the bag.fration. The default value of bag.fraction is0.5. However, this value can be increase if the training sample is small.7 .3. Generalized linear model via penalized maximum likelihood (GLMNET) Generalized linear model via penalized maximum likelihood is a highly robustmethod for fitting the entire lasso or elastic-net regularization path for linearregression [11]. GLMNET can take advantage of the sparsity in the features.It can fit linear, multi-response linear, multinomial, logistic, and poissonregression models. Different predictions can be obtained from the fittedregression models. GLMNET solves the following problemmin α ,α N N X i =1 w i L ( y i , α + α T x i ) + λ [(1 − γ ) || α || / γ || α || ] , (7)for a grid of values of λ for the full bounds. L ( y, ϑ ) is defined as the neg-ative log-likelihood contribution for data point i . The elastic-net penaltyis controlled by γ , and connects the gap between lasso ( γ = 1) and ridge( γ = 0) penalty. The tuning parameter λ regulates the general strength ofthe penalty. The ridge penalty reduces the coefficients of correlated featurestowards each other. The lasso penalty hand pick one of the features and dropthe other remaining features. The elastic-net penalty combines the ridge andlasso penalty; if features are correlated in groups, a γ = 0 . Ensemble learning is a machine learning meta-algorithms where “weak learn-ers” are trained and combined into one predictive model to reduce bias(boosting), variance (bagging), or increase the accuracy of predictions (stack-ing). The concept of ensemble methods is that when weak learners are rightlycombined, the resulting model is robust as compared to the individual weaklearners. Stacking ensemble is less widely used than boosting and bagging812]. In contrast to boosting and bagging, stacking may be used to combinemodels of different types. In stacking ensemble, a new model from a meta-regressor learns how to optimally combine the predictions of other existingmodels from weak learners. That is, the base level weak models (made upof different learning algorithms) are trained on the training dataset and ameta-model is trained using the outputs of the base level model as features.Hence, stacking ensemble learning method can be considered as a “heteroge-neous ensemble model”. From literatures [4, 13], predictive models based onstacking ensemble models are usually better than individual model. Figure1 is the visual diagram of stacking ensemble scheme.9 igure 1: A graphical representation of stacking ensemble scheme.
3. Methodology
Daily dataset of Bitcoin prices and other indicators (High, Low, Open, Vol-ume, SMA5, SMA13, SMA20 SMA30, SMA50, EMA5, EMA12, EMA26,EMA50, MACDLine, MACDSignalLine, MACDHistogram, SMABollBands5,BBands5Up, BBands5Down, SMABollBands13, BBands13Up, BBands13Down,SMABollBands20, BBands20Up, BBands20Down, Volatility) were taken from10he CryptoCompare website . The daily dataset expanded from 01 / / / / / / / / / / / / Figure 2: Bitcoin closing price igure 3: Volatility of BitcoinTable 1: Descriptive statistics of the Bitcoin dataset from 01/01/2012 to 16/08/2019 Min Max Mean Std. Dev.Close 4.22 19345.49 2259.77 3422.389High 4.40 19870.62 2330.46 3547.90Low 3.88 18750.91 2173.56 3264.26Open 4.22 19346.60 2256.02 3419.13
Technical analysis of a cryptocurrency is founded on the assumption that allthe important information about a specific cryptocurrency is incorporatedin its price and/or other market data like the Volume traded. That is, thedynamics of the historical price and other market data control the decisionof market players and investors in the cryptocurrency market. Technicalindicators are important tools that can be used to transform price patternsinto actionable trading plans. They can therefore be used as features to12redict future prices. By applying simple but relevant rules to historicalprice data, different technical indicators can be generated. The objectiveof a technical indicator in a cryptocurrency market is to analyze trends inthe price of a cryptocurrency in order to forecast the future price of thecryptocurrency. Below are the technical indicators that were calculated fromthe extracted Bitcoin time series data. These indicators are transformed tofeatures for the forecasting models.Simple Moving Average (SMA): A type of moving average that computes thearithmetic average price over a specific period.
SM A w = w − X i =0 C d − i w, (8)where C d is the closing price for day d and w is a window size. Simple movingaverage of order 5 (SMA5), order 13 (SMA13), order 20 (SMA20), order 30(SMA30) and order 50 (SMA50) were extracted from the cryptocomparewebsite.Exponential Moving Average (EMA): A moving average where the weightsof historical prices decreases exponentially. It calculates an exponentially-weighted mean, giving more weight to current observations. EMA of order5, 12, 26, and 50 denoted as EMA5, EMA12, EMA26, and EMA50 wereextracted from cryptocompare website. EM A t = P w − i =0 C d − i w (9)Weighted Moving Average (WMA): WMA is similar to an EMA, but withlinear weighting if the length of weights is equal to w .13verage True Range (ATR): It measures the volatility of a High-Low-Closeseries. AT R w = EM A w (cid:18) max ( H d − L d , abs ( H d − C d − ) , abs ( L d − C d − )) (cid:19) , (10)where H d , L d , and C d are the price high, price low, and closing price at day d respectively.Chaikin Accumulation/Distribution line (AD): It measures the money flow-ing into or out of a Bitcoin market. AD = AD d − + ( C d − L d ) − ( H d − C d ) H d − L d V d , (11)where V d is the volume traded at day d .Commodity Channel Index (CCI): It identifies cyclical turns in Bitcoin price.CCI can be used to evalaute whether a bitcoin is overbought or oversold. CCI w = Σ d − SM A w ( Σ d )0 . P wi =1 | Σ d − i +1 − SM A w ( Σ d ) | /w , (12) Σ d = H d + L d + C d Rate of change (ROC): It calculates the rate of change relative to the Bitcoinclosing prices over a period of time.
ROC w = C d − C d − w C d − w (13)Momentum (MOM): It measures the change in price relative to the actualprice levels. M OM w = C d − C w − (14)14oving Average Convergence Divergence (MACD): It is the most popularand widely used technical indicator. It uses the moving averages to deter-mine the momentum of a cryptocurrency. The three components of MACD:MACD signal, signal line, and histogram were calculated. MACD line is cal-culated as the difference between 12 period EMA and the 26 period EMA.The MACD signal line is a 9 period EMA of the MACD line and the MACDhistogram is the difference between the MACD line and the MACD signalline. MACDLine, MACDSignalLine, and MACDHistogram were obtainedfrom the website of cryptocompare. M ACDline = 12d EMA −
26d EMA
M ACDsignalline = 9d EMA of MACD line
M ACDhistogram = MACD line − signal line (15)Bollinger Band (BBands/BollBands): It is a method used to compare acryptocurrency volatility and price levels over a period of time. The upper(Up) and lower (Down) BBands were also calculated. The upper and lowerBBands are calculated as the standard deviations above and below the mov-ing average.Stochastic Oscillator (stochOSC): A momentum indicator that relates thelocation of each day’s closing price relative to the high/low range over thepast n periods. stochOSC = C d − LL w HH w − LL w (16)where LL w and HH w are respectively the mean lowest low and highest highprices for previous d days. 15 .2. Data pre-processing To make the data more relevant for the machine learning forecastingmodels, the time series data was pre-processed.
The Bitcoin time series data (Close, High, Low, and Volume) was trans-formed into a set of ten (10) additional technical indicators which differsfrom the technical indicators extracted from the CryptoCompare website.These technical indicators are widely used in financial market literatures andhelp in price forecasting.
The Bitcoin time series data are converted to the same scale without chang-ing the differences in the range of the price values. The minimum-maximumformula (see equation 17) was used to normalized the dataset into the range[0 , x normalize = x − minimum ( x ) maximum ( x ) − minimum ( x ) (17) x = x normalize (cid:16) maximum ( x ) − minimum ( x ) (cid:17) + minimum ( x ) (18)where maximum ( x ), minimum ( x ), x normalize are the maximum, minimumvalue of the inputs and the normalized input value respectively. R statistical software was used in implementing the data normalization. Feature selection is an important step in the Bitcoin forecasting problem.Boruta algorithm was used to select the most important features for the fore-16asting models. Boruta is a feature ranking and selection machine learningalgorithm that uses a wrapper approach buitt on RF algorithm. It itera-tively eliminates the features that are less important than random probes.The Boruta package in R [14] was used to select the most important features.
Equation 19–22 are the metrics used in evaluating the performance of theforecasting models.Root mean squared error (RMSE),
RM SE = s P Ni =1 ( A i − F i ) N (19)Mean absolute error (MAE), M AE = 1 N N X i =1 | A i − F i | , (20)Mean absolute percentage error (MAPE), M AP E = 1 N N X i =1 (cid:18) | A i − F i || A i | (cid:19) × , (21)Coefficiet of determination/R-squared ( R ) R = 1 − P ni =1 ( A i − F i ) P ni =1 ( A i − ¯ A i ) , (22)where A i , ¯ A i , F i are the actual, mean, and the forecasted Bitcoin prices. Incomparing the techniques, the model that gives a lower RMSE, MAE, andMAPE is considered as the best model with respect to these metrics. A modelwith a larger R-Squared value is considered to be the best model when using17-Squared as the performance metric. The RMSE, MAE measure rangesfrom 0 to ∞ . MAPE measure (equation 21) ranges from 0 to 100%. R-Squared measures the degree of relationship between the forecasted and thereal price data and it ranges from 0 to 1. In all the machine learning tech-niques, the testing data was used to evaluate and validate the performanceof the model.
4. Experimental results and discussion
Boruta performed 99 iterations in 43.2388 minutes and 34 attributes wereconfirmed important. One output (volume from (volumeF)) was consideredunimportant and two outputs (average true range (atr) and volume to (vol-ume)) were considered to be tentative. Figure 4 displays the Boruta resultplot for the technical indicators. The plot shows the importance of each ofthe technical indicators. The columns in green are the ‘confirmed’ technicalindicators and the column in red is not. There are two tentative attributesshown in yellow columns. The blue bars (shadowMin, shadowMax) are nottechnical indicators but are used by Boruta algorithm to determine if an indi-cator is important or not important. Table 2 presents the mean importanceof the technical indicators from the Boruta algorithm.18 igure 4: Boruta result plot for technical indicators.Table 2: Selected features using Boruta algorithm
Feature meanImp Feature MeanImp Feature meanImpLow 8.6621 EMA12 5.5459 BBands5Down 4.5395stochOSC 8.5909 SMABollBands13 5.5384 MACDline 4.4343High 8.5353 SMA13 5.5015 BBands13Down 4.2226cci 8.0096 SMA30 5.3326 ad 3.9548WMA5 7.3011 EMA26 5.3188 Volatility 35332EMA5 7.3638 SMABollBands20 5.2633 roc 3.4091Open 6.8225 SMA20 5.2340 MACDSignalLine 3.3965SMABollBands5 6.0390 BBands5Up 5.1564 mom 2.7701meanMW 6.0287 BBands20up 5.0809SMA5 5.9783 BBands13Up 5.0423WMA50 5.9061 BBands20Down 5.0072medianMW 5.8782 EMA50 4.9579SMA50 5.7759 MASCDHistogram 4.5757 .2. Forecasting with ML techniques Using the training data, the four ML techniques were fine-tuned to select theoptimal parameter value for the forecasting model.For the generalized linear model via penalized maximum likelihood, resam-pling was done on a 10 fold cross validation and repeated for 6 times. Thesmallest root mean square error value was used to select the best model.The final parameters value used to construct the model were alpha = 1(pure lasso regression) and lambda = 1 × − . Resampling was done ona 12 fold cross validation and repeated for 8 times for random forest algo-rithm. Using the smallest root mean square error value, the best randomforest model was selected for the training model. The final parameter valuewere ntree=2500, mtry=13, bag.fraction=0 .
75. Using a 10 fold cross val-idation, support vector regression was sampled. The final parameter andparameter value used after fine-tuning the model were: svm type= epsilon - regression , svm-kernel= linear kernel , cost=0 .
07, epsilon=0 .
1, number ofsupport vectors=18, tolerance=0 . . . . . . . . . . . . . . . able 3: Evaluation metrics values for GLMNET, RF, SVR with linear Kernel, and Stack-ing ensemble algorithms MAPE (%) RMSE (USD) MAE (USD) R-SquaredTesting Training Testing Training Testing Training Testing TrainingGLMNET RF Stacking 0.0191
Figure 5: Real and Predicted Bitcoin price of testing and training dataset using generalizedlinear model via penalized maximum likelihood igure 6: Real and Predicted Bitcoin price of testing and training dataset using randomforestFigure 7: Real and Predicted Bitcoin price of testing and training dataset using supportvector regression with linear kernel igure 8: Real and Predicted Bitcoin price of testing and training dataset using stackingensemble Figure 9: Forecasting error of Stacked ensemble model . Conclusion In the existence of high volatility of Bitcoin price, an accurate and reliableforecasting models for Bitcoin price is very important for investors and mar-ket players.Three machine learning models (generalized linear model via penalized max-imum likelihood, random forest, support vector regression with linear kernel)were used to predict the price of bitcoin in the midst of price uncertainties.The construction of a stacking ensemble model using generalized linear modelvia penalized maximum likelihood, random forest as the base learners andsupport vector regression with linear kernel as the meta-learner reduced theprediction error for the three machine learning models, which was already lowto begin with. Clearly, the stacking ensemble was functional in fine-tuning amodel to attain a nearly perfect prediction.The performance metrics (mean absolute percentage error, root mean squareerror, mean absolute error, and coefficient of determination) showed that thestacking ensemble model was the optimal model for predicting the testingdata. However, the result is not to conclude that, the stacking ensemblemodel is superior to the other models; the performance of a model underseparate states should be studied and understood. By employing machinelearning techniques, the closing price of Bitcoins has been forecasted. Eventhough, the price of Bitcoin is very volatile, machine learning models wereable to accurately forecast the price of Bitcoin.26 onflict of Interest
The author declare that there are no conflicts of interest regarding thepublication of this paper.
Data Availability
Data for this work are available from the author upon request.
References [1] CoinMarketCap, Top 100 cryptocurrencies by market capitalization, https : //coinmarketcap.com/ , 2019. (accessed 27 August 2019).[2] S. A. Wolla, Bitcoin: Money or financial investment?, Page OneEconomics R (cid:13)(cid:13)