A Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models
1 A Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models
Sidra Mehtab & Jaydip Sen
School of Computing and Analytics, NSHM Knowledge Campus Kolkata – 700053, INDIA
Email: [email protected]
Abstract
Prediction of future movement of stock prices has always been a challenging task for the researchers. While the advocates of the efficient market hypothesis (EMH) believe that it is impossible to design any predictive framework that can accurately predict the movement of stock prices, there are seminal work in the literature that have clearly demonstrated that the seemingly random movement patterns in the time series of a stock price can be predicted with a high level of accuracy. Design of such predictive models requires choice of appropriate variables, right transformation methods of the varaiables, and tuning of the parameters of the models. In this work, we present a very robust and accurate framework of stock price prediction that consists of an agglomeration of statistical, machine learning and deep learning models. We use the daily stock price data, collected at five minutes interval of time, of a very well known company that is listed in the
National Stock Exchange (NSE) of India. The granular data is aggregated into three slots in a day, and the aggregated data is used for building and training the forecasting models. We contend that the agglomerative approach of model building that uses a combination of statistical, machine learning, and deep learning approaches, can very effectively learn from the volatile and random movement patterns in a stock price data. This effective learning will lead to building of very robust training of the models that can be deployed for short-term forecasting of stock prices, and prediction of stock movement patterns. We build eight classification and eight regression models based on statistical and machine learning approaches. In addition to these models, a deep learning regression model using a long-and-short-term memory (LSTM) network is also built. Extensive results have been presented on the performance of these models, and the results are critically analyzed.
Keywords:
Stock Price Prediction, Multivariate Regression, Logistic Regression, Decision Tree, K-Nearest Neighbor, Artificial Neural Networks, Random Forest, Bagging, Boosting, Support Vector Machines, Long and Short-Term Memory Networks, Multivariate Adpative Regression Splines.
JEL Classification:
G 11, G 14, G 17, C 63 Introduction
Prediction of future movement patterns of stock prices has been a widely researched area in the literature. While there are proponents of the efficient market hypothesis who believe that it is impossible to predict stock prices, there are also propositions that demonstrated that if correctly formulated and modeled, prediction of stock prices can be done with a fairly high level of accuracy. The latter school of thought focused on the construction of robust statistical, econometric and machine learning models based on the careful choice of variables and appropriate functional forms or models of forecasting. There propositions in the literature that are based on time series analysis and decomposition for forecasting future values of stocks. In this regard, Sen and Datta Chaudhuri presented an approach of stock price forecasting based on a time series decomposition that had yielded a high level of accuracy in forecasting (Sen & Datta Chaudhuri, 2018a; Sen & Datta Chaudhuri, 2018b; Sen & Datta Chaudhuri, 2017a; Sen & Datta Chaudhuri 2017b; Sen & Data Chaudhuri, 2017c; Sen & Datta Chaudhuri, 2017d; Sen & Datta Chaudhuri, 2016). There is also an extent of literature that deals with various technical analysis of stock price movements. Propositions also exist for mining sock price patterns using various important indicators like Bollinger Bands, moving average convergence divergence (MACD), relative strength index (RSI), moving average (MA), stochastic momentum index (SMI), etc. There are also well-known patterns like head and shoulders pattern , inverse head and shoulders pattern , triangle , flag , Fibonacci fan , Andrew's Pitchfork, etc., which are exploited by traders for investing intelligently in the stock market. These approaches provide the user with visual manifestations of the indicators which help the ordinary investors to understand which way stock prices are more likely to move in the near future.
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen In this paper, we propose a granular approach to forecasting stock price and the price movement pattern by combining several statistical, machine learning and deep learning methods of prediction on technical analysis of stock prices. We present several approaches for short-term stock price movement forecasting using various classification and regression techniques and compare their performance in prediction of stock price movement. We believe this approach will provide several useful information to the investors in the stock market who are particularly interested in short-term investments for profit. This work is a modified and extended version of our previous work (Mehtab & Sen, 2019). In the present work, we have presented a predictive framework that aggregates eight classification and eight regression models including a Long-and Short-term memory (LSTM)-based advanced deep learning model. The objective of our work is to take stock price data at five minutes interval from the National Stock Exchange (NSE) of India and develop a robust forecasting framework for the stock price movement. Our contention is that such a granular approach can model the inherent dynamics and can be fine-tuned for immediate forecasting of stock price or stock price movement. Here, we are not addressing the problem of forecasting of long-term movement of the stock price. Rather, our framework will be more relevant to a trade-oriented framework. The rest of the paper is organized as follows. Section 2 presents a comprehensive review of the literature on stock price movement modelling and prediction. In Section 3, we present a detailed discussion on the methodology that we have followed in this work. Section 4 provides a brief discussion on the working principles of the classification and the regression models in machine learning that we have used in this work. In Section 5, we provided a summary of the LSTM-based deep learning model for regression that we have also used in our predictive model. Section 6 presents a detailed discussion on the performance of machine learning and deep learning models. A comparative analysis of the performances of the models is also presented in this Section. Finally, Section 7 concludes the paper. Related Work
The literature attempting to prove or disprove the efficient market hypothesis can be classified into three strands, according to the choice of variables and techniques of estimation and forecasting. The first strand consists of studies using simple regression techniques on cross-sectional data (Basu, 1983; Jaffe et al., 1989; Rosenberg et al., 1985; Fama & French, 1995; Chui & Wei, 1998). The second strand of the literature has used time series models and techniques to forecast stock returns following economic tools like autoregressive integrated moving average (ARIMA), Granger causality test, autoregressive distributed lag (ARDL) and quantile regression (QR) to forecast stock prices (Jarrett & Kyper, 2011; Adebiyi et al., 2014; Mondal et al., 2014; Mishra, 2016). The third strand includes work using machine learning tools for the prediction of stock returns (Mostafa, 2010; Dutta et al., 2006; Wu et al., 2008; Siddiqui & Abdullah, 2015; Jaruszewicz & Mandziuk, 2004). Among the some of the recent propositions in the literature on stock price prediction, Mehtab and Sen have demonstrated how machine learning and long- and short-term memory (LSTM)-based deep learning networks can be used for accurately forecasting NIFTY 50 stock price movements in the National Stock Exchange (NSE) of India (Mehtab & Sen, 2019). The authors used the daily stock prices for three years during the period of January 2015 till December 2017 for building the predictive models. The forecast accuracies of the models were then evaluated based on their ability to predict the movement patterns of the close value of the NIFTY index on a time horizon of one week. For the purpose of testing, the authors used NIFTY 50 index values for the period of January 2018 till June 2019. To further improve the predictive power of the models, the authors incorporated a sentiment analysis module for analyzing the public sentiments in Twitter on NIFTY 50 stocks. The output of the sentiment analysis module is fed into the predictive model in addition to the past NIFTY 50 index values for the building a very robust and accurate forecasting model. The sentiment analysis module uses a self-organizing fuzzy neural network (SOFNN) for handling non-linearity in a multivariate predictive environment. Mehtab and Sen recently proposed another approach to stock price and movement prediction using convolutional neural networks (CNN) on a multivariate time series (Mehtab & Sen, 2020). The predictive model proposed by the authors exploits the learning ability of a CNN with a walk-forward validation ability so as to realize a high level of accuracy in forecasting the future NIFTY index values and their movement patterns. Three different architectures of CNN are proposed by the authors that differ in the number of variables used in forecasting, the number of sub-models used in the overall system, and the size of the input data for training the models. The experimental results clearly indicated that the CNN-based multivariate forecasting model was highly accurate the predicting the movement of NIFTY index values with a weekly forecast horizon. The design of efficient predictive models and algorithms for accurately forecasting the movement patterns of stock prices and stock returns has attracted considerable attention and effort from the research community over a considerably long period of time. Many of such propositions involve the application of various types of neural networks. The neural networks have the ability of modeling nonlinearity in data and this property is proven to be extremely effective in mining
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen probabilistic neural network (PNN) using historical stock market data. The forecasted output of the model was applied to form various index trading strategies, and the effectiveness of those strategies was compared with those generated by the buy and hold strategy, the investment strategies formed using the output of a random walk model, and the parametric generalized method of moments (GMM) with a Kalman filter. The results showed that the investment strategies made using the output of the PPN yielded the highest return of investment in the long-run. de Faria et al. illustrated a predictive model using a neural network and an adaptive exponential smoothing method for forecasting the movements of the principal index of the Brazilian stock market (de Faria et al., 2009). The authors compared the forecasting performance of both the neural network and the exponential smoothing models with a particular focus on the sign of the market returns. While the simulation results showed that both methods were equally efficient in predicting the index returns, the neural network model was found to be more accurate in predicting the market movement than the adaptive exponential smoothing method. Leigh et al proposed the use of linear regression and simple neural network models for forecasting the stock market indices in the New York Stock Exchange during the period 1981-1999 (Leigh et al, 2005). The proposed scheme by the authors used a template matching mechanism based on statistical pattern recognition that efficiently and accurately identified spikes in the trading volumes. A threshold limit for the spike in volume was identified, and the days on which the traded volume exhibited significant spikes were identified. A linear regression model was applied to forecast the future change in price based on the historical price, traded volume, and the prime interest rate. Shen et al. proposed a novel scheme that was based on a tapped delay neural network (TDNN) with an ability of adaptive learning and pruning for forecasting on a non-linear time series of stock price values (Shen et al., 2007). The TDDN model was trained by a recursive least square (RLS) technique that involved a tunable learning-rate parameter that enables faster network convergence. The trained neural network model was optimized using a pruning algorithm that reduced the possibility of overfitting of the model. The experimental results in a simulated environment clearly showed that the pruned model had a reduced complexity, a faster execution, and an improved prediction accuracy. Ning et al. proposed a scheme of stock index prediction that was based on a chaotic neural network (Ning et al., 2009). Data from a Chinese stock market and a Shenzhen stock market were used for building the model. The non-linear, stochastic, and chaotic patterns in the stock market indices were learned by the chaotic neural network, and the learnings of the chaotic neural network were gainfully applied in forecasting future index values of the stock markets.
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Hanias et al. conducted a study to predict the daily stock exchange price index of the
Athens Stock Exchange (ASE) using a neural network with backpropagation (Hanias et al., 2012). The neural network was used to make a multistep forecasting for nine days and yielded a very low mean square error (MSE) value of 0.0024. Wu et al. proposed an ensemble model of prediction using support vector machines (SVM) and artificial neural networks (ANN) for predicting stock prices (Wu et al., 2008). The forecasting performance of the ensemble model was compared with those of the SVM model and the ANN model. It was observed by the authors that the ensemble approach produced more accurate results than the other two models. Liao et al. carried out a study on the stock market investment issues on the Taiwan stock market (Liao et al., 2008). The scheme involved two phases. In the first phase, the apriori algorithm was used to identify the association rules and knowledge patterns about stock category association and possible stock category investment collections. After the association rules were successfully mined, in the second phase, the k -means clustering algorithm was used to identify the various clusters of stocks based on their association patterns. The authors also proposed several possible stock market portfolio alternatives under various clusters of stocks. Zhu et al. hypothesized that there is a significant bidirectional nonlinear causality between stock returns and trading volumes (Zhu et al., 2008). The authors proposed the use of a neural network-based scheme for forecasting stock index movements. The model was further enriched by the inclusion of different combinations of indices and component stocks’ trading volumes as inputs. NASDAQ, DJIA and STI data of stock prices and volume of transactions were used in training the neural network. The experimental results demonstrated that the augmented neural networks with trading volumes lead to improvements in forecasting performance under different terms of the forecasting horizon. Bentes et al. presented a study on the long memory and volatility clustering for the S&P 500, NASDAQ 100 and Stoxx 50 indexes in order to compare the US and European markets (Bentes et al., 2008). The authors compared the performance of two different approaches. The first approach was based on the traditional approaches using generalized autoregressive conditional heteroscedasticity GARCH(1, 1), IGARCH(1, 1) and FIGARCH (1, d, 1), while the second approach exploited the concept of entropy in the Econophysics. In the second approach, three different measures were considered by the authors in the study. The three measures were Shannon, Renyi, and Tsallis measures. The results obtained using both the approaches elicited the existence of nonlinearity and volatility of SP 500, NASDAQ 100 and Stoxx 50 indexes. Chen et al. demonstrated how the random and chaotic behavior of stock price movements can be very effectively modeled using a local linear wavelet neural network (LLWNN) technique (Chen et al, 2005). The proposed wavelet-based model was further optimized using a novel algorithm, which the authors referred to as estimation of distribution algorithm (EDA). The purpose of the model was to accurately predict the share price for the following trade day given the opening, closing and maximum values of the stock price for a particular day. The study revealed an interesting observation - even for a time series that exhibited an extremely high level of random fluctuations in its values, the model could extract some very important features from the opening, closing and the maximum values of the stock index that enabled an accurate prediction of its future behavior. Hutchinson et al. proposed a non-parametric method for estimating the pricing formula of a derivative that applied the principles of learning networks (Hutchinson et al, 1994). The variables that were used as the input to the model were: the present fundamental asset price, the strike price, the time to maturity, etc. These variables had a direct influence on the derivative price. The learning network mapped the input values to its output values. For training the model, the authors used a dataset consisting of the daily closing prices of S&P 500 futures and the options prices for the 5-year period from January 1987 to December 1991. For the purpose of understanding the efficacy and the efficiency of various models, the authors compared the performance of four models: (i) ordinary least squares, (ii) radial basis function networks, (iii) multilayer feed-forward neural networks, and (iv) the projection pursuit. The simulation results showed that among the four models, the non-parametric model proposed by the authors yielded the most accurate forecasts on the derivative prices. Dutta et al. illustrated how ANN models could be applied in forecasting Bombay Stock Exchange’s SENSEX weekly closing values for the period of January 2002 to December 2003 (Dutta et al, 2006). The proposed approach by the author involved building two neural networks each consisting of three hidden layers in addition to the input and the output layers. The input values to the first neural network were: (i) the weekly closing values, (ii) the 52-week moving average of the weekly closing SENSEX values, (iii) the 5-week moving average of the closing values, and (iv) the 10-week oscillator values for the past 200 weeks. On the other hand, the second network was provided with the following input values: (i) weekly closing value of SENSEX, (ii) the moving average of the weekly closing values computed on the 52-week historical data, (iii) the moving average of the closing values computed on the 5-week historical data, and (iv) the volatility
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen root mean square error (RMSE) and mean absolute error (MSE) values on the test data. For the purpose of testing the networks, the weekly closing SENSEX values for the period of January 2002 to December 2003 were used. Hammad et al. demonstrated that an artificial neural network (ANN) model can be trained to converge to an optimal solution while it maintains a very high level of precision in the forecasting of stock prices (Hammad et al, 2009). The proposed scheme was based on a multi-layer feedforward neural network model that used the back-propagation algorithm. The model was used for forecasting the Jordanian stock prices. The authors demonstrated simulations using MATLAB that were carried on seven Jordanian companies from the service and manufacturing sectors. The accuracy of the model in forecasting stock price movement was found to be very high. Tsai and Wang found conducted a study to illustrate how Bayesian Network-based approaches could produce better forecasting results than traditional regression and neural network-based approaches (Tsai & Wang, 2009). The authors proposed a hybrid predictive model for stock price forecasting that combined a neural network-based model with a decision-tree. The experimental results demonstrated that the hybrid model had a higher predictive power than the single ANN and the single decision tree-based approach. Tseng et al. utilized various approaches including the traditional time series decomposition (TSD) model, HoltWinters (H/W) exponential smoothing with trend and seasonality models, Box-Jenkins (B/J) models using autocorrelation and partial autocorrelation, and neural network-based models (Tseng et al, 2012). The authors trained the models on the stock price data of 50 randomly chosen stocks during the period: September 1, 1998 - December 31, 2010. For the purpose of training the models, 3105 observations based on closed prices of the stocks were used. The testing of the model was carried out on data over 60 trading days. The study showed that the forecasting accuracies were higher for B/J, H/W and normalized neural network models. The errors associated with the time series decomposition-based model and the non-normalized neural network models were found to be higher. Senol and Ozturan illustrated that ANN can be used to predict stock prices and their direction of changes (Senol & Ozturan, 2008). The result was promising with a forecast accuracy of 81% on the average. In the literature, a substantial number of contributions exist that are based on the application of time series and fuzzy time series approaches for forecasting stock price movements. Thenmozhi investigated the applicability of chaos theory in modeling the nonlinear behavior of the Bombay Stock Exchange (BSE) time series (Thenmozhi, 2006). The author used the return values of the BSE SENSEX time series data during the period August 1980 to September 1997 and showed that the time series of the daily and the weekly return values exhibited nonlinearity weakly chaotic properties. Fu et al. presented an approach that represented the data points in a financial time series according to their importance (Fu et al., 2007). Using the ranked data points based on their importance, a tree was constructed that enabled incremental updating of data in the time series. The scheme facilitated representation of a large-sized time series in different levels of details, and also enabled multi-resolution dimensionality reduction. The authors have presented several evaluation methods of data point importance, a novel method of updating a time series, and two-dimensionality reduction approaches. Extensive experimental results are also presented demonstrating the effectiveness of all propositions. Phua et al. presented a predictive model using neural networks with genetic algorithms for forecasting stock price movements in Singapore Stock Exchange (Phua et al, 2000). The forecasting accuracy of the predictive model was found to be 81% on the test dataset indicating that the model was moderately effective in its forecasting job. Moshiri and Cameron described a back propagation-based neural network and a set of econometric models to forecast inflation levels (Moshiri, & Cameron, 2010). The set of econometric models proposed by the authors included the following: (i)
Box-Jenkins autoregressive integrated moving average (ARIMA) model, (ii) vector autoregression (VAR) model, and (iii)
Bayesian vector autoregression (BVAR) model. The forecasting accuracies of the three models were compared with the hybrid back propagation network (BPN) model proposed by the authors. For the purpose of testing the models, three different values of the forecasting horizon were used: one month, two months, and twelve months. With the root mean square error (RMSE) and the mean absolute error (MAE) as the two metrics, the authors observed that the performance of the hybrid BPN was superior to the other econometric models. The major drawback of the existing propositions in literature for stock price prediction is their inability to predict stock price movement in a short-term interval. The current work attempts to address this shortcoming by exploiting the learning ability of a gamut of machine learning and a deep neural network in stock price movement modeling and prediction.
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Methodology
In Section 1, we mentioned that the goal of this work is to develop a robust forecasting framework for the short-term price movement of stocks. We use the Metastock tool for collecting data on the short-term price movement of stocks (Metastock). Particularly, we collected the stock data for the company – Godrej Consumer Products Ltd. The data is collected at every 5 minutes interval in a day, for all the days in which the
National Stock Exchange (NSE) was operating during the years 2013 and 2014. The raw data for each stock consisted of the following variables: (i) date , (ii) time , (iii) open value of the stock, (iv) high value of the stock, (v) low value of the stock, (vi) close value of the stock, and (viii) the volume of the stock traded in a given interval. The variable time refers to the time instance at which the stock values are noted as each record is collected at 5 minutes interval of time. Hence, the time interval between two successive records in the raw data was 5 minutes. The raw data in this format is collected for the stock Godrej Consumer Products. for two years. In addition to the six variables in the raw data that we have mentioned, we also collected the NIFTY index at 5 minutes interval for the same period of two years, in order to capture the overall market sentiment at each time instant, so that more accurate and robust forecasting can be made using the combined information of historical stock prices and the market sentiment index. Therefore, the raw data for both the stocks now consists of seven variables. As 5 minutes interval is too granular, we make some aggregation of the raw data. We break the total time interval in a day into three slots as follows: (1) morning slot that covers the time interval 9:00 AM till 11:30 AM, (2) afternoon slot that covers the time interval 11:35 AM till 1:30 PM and (3) evening slot that covers the time interval 1:35 PM till the time of closure of NSE in a given day. Hence, the daily stock information now consists of three records, each record containing stock price information for a time slot. Using the eight variables in the raw data and incorporating the aggregation of data using the time slots, we design eleven derived variables and compute their values. These derived variables are used as the input variables for building the predictive models for forecasting the stock price and the stock movement. We followed two approaches to stock price forecasting - regression and classification . The difference in these two approaches lied in the way the response variable open_perc was used in the model building process. This point will be described in detail later in this Section. Following are the eleven derived variables: month : This is a numeric variable that refers to the month for a given stock price record. The twelve months are assigned numeric codes of 1 through 12, with the month of January being coded as 1, and the month of December assigned with a code of 12. day_month : This numeric variable denotes the particular day of a given month to which a stock price record corresponds. The value of this variable lies in the interval [1, 31]. For instance, if the date for a stock price record is 22 nd May 2013 then the day_month variable for that record will be assigned a value of 22. day_week : This is a numeric variable that corresponds to the day of the week for a given stock price record. The five days in a week on which the stock market remain open are assigned numeric codes of 1 through 5, with Monday being coded as 1, while the Friday is assigned a code of 5. time : This numeric variable refers to the time slot to which a stock price record belongs. There are three-time slots in a day - morning, afternoon and evening. The slots are assigned codes 1, 2, and 3 respectively. For example, if a stock price record refers to the time point 3:45 PM, the variable time will be assigned a value of 3 for the stock price record. open_perc : it is a numeric variable that is computed as a percentage change in the value of the open price of the stock over two successive time slots. The computation of the variable is done as follows. Suppose, we have two successive slots: S and S . Both of them consist of several records at five minutes interval of time. Let the open price of the stock for the first record of S is X and that for S is X . The open_perc for the slot S is computed as ( X - X )/ X in terms of percentage. high_perc : it is a numeric value that is computed as the difference between the high values of two successive slots. The computation is identical to that of open_perc except for the fact that high values are used in this case instead of the open values. low_perc : it is a numeric value that is computed as the difference between the low values of two successive slots. For two successive slots S and S , first we compute the mean of all low values of the records in both the slots. If L and L refers to the mean of the Low values for S and S respectively, then low_diff for S is computed as ( L - L ). Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen close_perc : it is a numeric value that is computed as the difference between the close values of two successive slots. Its computation is similar to the open _ perc variable, except for the fact that we use the close values in the slots instead of the open values. vol_perc : it is a numeric value that is computed as the difference between the volume values of two successive slots. For two successive slots S and S , we compute the mean values of volume for both the slots, say V and V respectively. Now, the vol_diff for S is computed as ( V - V ). nifty_perc : it is a numeric variable that is computed as a percentage change in the NIFTY index over two successive time slots. The computation of the variable is done as follows. We compute the means of the NIFTY index values for two successive time slots S and S . Let us assume the means are M and M respectively. Then the nifty_perc for the slot S is computed as ( M - M )/ M in terms of percentage. range_diff : The value of this numeric variable is obtained by computing the difference in the range values of two consecutive time slots. The range value for a given slot is the difference between its high and the low values. If S and S , denote two consecutive slots, and if H , H , L and L respectively represent the high and the low values of the slots S and S respectively, then the range value for S is R = ( H - L ) and for S is R = ( H - L ). The range_diff for the slot S is computed as ( R - R ). After we compute the values of the above eleven variables for each slot for both the stocks for the time frame of two years (i.e., 2013 and 2014), we develop the forecasting framework. As mentioned earlier, we followed two broad approaches in the forecasting of the stock movements - regression and classification. In the regression approach, based on the historical movement of the stock prices we predict the stock price in the next slot. We use open_perc as the response variable, which is a continuous numeric variable. The objective of the regression technique is to predict the open_perc value of the next slot given the stock movement pattern and the values of the predictors till the previous slot. In other words, if the current time slot is S , the regression techniques will attempt to predict open_perc for the next slot S . If the predicted open_perc is positive, then it will indicate that there is an expected rise in the stock price in S , while a negative open_perc will indicate a fall in the stock price in the next slot. Based on the predicted values, a potential investor can make his/her investment strategy in stocks. In the classification approach, the response variable open_perc is a discrete variable belonging to one of two classes. For developing the classification-based forecasting approaches, we converted open_perc into a categorical variable that takes up one of the two values 0 and 1. The value 0 indicating negative open_perc values and 1 indicating positive open_perc values. Hence, if the current slot is S and if the forecast model expects a rise in the open_perc value in the next slot S , then the open_perc value for S will be 1. An expected negative value of the open_perc in the next slot will be indicated by a 0 value for the response variable. For both classification and regression approaches, we experimented with three cases which are described below. Case I:
We used the data for the year 2013 which consisted of 19, 385 records at five minutes interval. These records were aggregated into 745 time slot records for building the predictive model. We used the same dataset for testing the forecast accuracy of the models for both the stocks and made a comparative analysis of all the models.
Case II : We used the data for the year 2014 which consisted of 18, 972 records at five minutes interval. These granular data were aggregated into 745 time slot record for building the predictive model. We used the same dataset for testing the forecast accuracy of the model for both the stocks and carried out an analysis of the performance of the predictive models.
Case III : We used that data for 2013 as the training dataset for building the models and test the models using the data for the year 2014 as the test dataset. We, again, carried out an analysis on the performance of different models in this approach. We have built eight classification models and nine regression models for developing our forecasting framework. The classification models are: (i) logistic regression, (ii) k -nearest neighbor (iii) decision tree, (iv) bagging, (v) boosting, (vi) random forest, (vii) artificial neural network, and (viii) support vector machines. For measuring accuracy and effectiveness in these approaches, we use several metrics such as: sensitivity , specificity , positive predictive value , negative predictive value , and classification accuracy . Sensitivity and positive predictive value are also known as recall and precision respectively.
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen The eight regression methods that we built are: (i) multivariate Regression, (ii) multivariate adaptive regression spline, (iii) decision tree, (iv) bagging, (v) boosting, (vi) random forest, (vii) artificial neural network, (viii) support vector machine, and (ix) long- and short-term memory network. While all the classification techniques are machine learning-based approaches, one regression technique - Long- and Short-Term Memory Network - is a deep learning method. For comparing the performance, of the regression methods, we use several metrics such as root mean square error (RMSE), and correlation coefficient between the actual and predicted values of the response variable, e.g., open_perc . Machine Learning Methods
The eight classification models that we built are now discussed in detail.
Logistic Regression : This being a classification technique, we transformed the response variable open_perc to a discrete domain from a continuous domain. In other words, we transformed the response variable into a categorical variable that can assume values 0 or 1. We converted all negative or zero values of open_perc to the class 0 and all non-zero positive values to class 1. We used the function glm in R for building the logistic regression model with three parameters being passed in the function: (i) the first parameter is the formula which is open_perc ~. to include open_perc as the response variable and all the remaining variables as the predictors, (ii) the second parameter is "family = binomial" indicating that model is a binary logistic regression that involves two classes, and (iii) the third parameter is the R data object containing the training data set. We used the predict function in R to compute the probability of the test records belonging to the two classes. We assumed a threshold value of 0.5 as the probability. In other words, when the probability of a record belonging to a class exceeds 0.5, we assume that the record belongs to that class.
K-Nearest Neighbor : The
K- nearest neighbor (KNN) is an example of instance-based learning. Based on the training, the classification for a new unclassified record may be found simply by comparing it to the most similar records in the training set. The value of k determines how many closest similar records in the training data set is considered for classifying a test data set record. We have used the R function knn defined in the library class to carryout KNN classification in the stock price data. The data is normalized using min-max normalization before applying the knn function so that all predictors are scaled down into the same range of values. Different values of k were tried out for building the models and the value of k =3 was finally chosen. This value of k was found to produce the best performance of the models with the minimum probability of model overfitting. Decision Tree : The classification and regression tree (CART) algorithm produces decision trees that are strictly binary so that there are exactly two branches for each node. The algorithm recursively partitions the records in the training data set into subsets of records with similar values for the target attributes. The trees are constructed by carrying out an exhaustive search on each node for all available variables and all possible splitting values and selects the optimal split based on some "goodness of split" criteria. We used the tree function defined in the tree library of R for classification of the stock records.
Bagging : Bootstrap Aggregation (Bagging) is an ensemble technique. It works as follows: Given a set D , of d tuples, for iteration i , a training set, D i of d tuples is sampled with replacement form the original set of tuples D . Each training set represents a bootstrap sample. Since the samples are simple random samples with replacement, it is possible that some records (i.e., tuples) in D may not get a chance to be included in D i , while some tuples may get included in more than one samples. A classifier model M i is trained on the information contained in each training set, D i . For classifying an unknown tuple X in the out-of-sample set (i.e., in the test dataset), each classifier, M i is asked to return its class predictions. The classification result of each of the trained classifier is considered as one vote. The bagging classifier counts the votes and finally assigns the class with the maximum number of votes to the tuple X . For carrying out classification on stock price data, we used bagging function defined in the ipred library of R. The value of the parameter nbag - that specifies the number of samples - was taken as 25. Boosting : Unlike bagging, boosting, assigns weights to each tuple in a training dataset. Based on the training dataset, k classification models are built iteratively. However, all the classifiers are not given equal importance in the final classification decision. Unlike bagging which uses simple majority voting among the classifiers, boosting uses a weighted majority voting mechanism. After a classifier M i is constructed, the weights assigned to the classifiers are updated before building the subsequent classifier M i+1 . After the completion of the current iteration, the classifiers that could correctly classify the tuples which were misclassified in the previous round are assigned higher weights before the next iteration of classifier construction starts. After the completion of the final round, the boosted classifier model combines the weighted votes of each individual classifier, where the weights are computed based on some functions of the classification Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen
Adaptive Boosting (AdaBoost) is a very popular variant of Boosting for classification purpose. We used the boosting function of the adaba g library in R for the classification of stock price data.
Random Forest : Random forest is an ensemble machine learning approach. The algorithm first builds a large number of decision tree classifiers separately so that the collection of the classifiers is a forest. The individual decision tree classifier models are built based on a random selection of attributes at each node. The splitting at each node is done by randomly selecting the feature and the feature value for splitting to introduce as much randomness as possible. In other words, each decision tree depends on the values of a random vector sampled independently, and with the same distribution for all trees in the forest. The objective of introducing so much randomness in building the decision tree models is to avoid overfitting of the models during the training phase. During the classification phase, each tree votes and the most popular class is returned. We have used the randomForest function defined in the randomForest library in R for classification purposes of the stock price data.
Artificial Neural Network : An artificial neural network (ANN) is a connectionist network that consists of nodes and their interconnecting links where the nodes are arranged in several layers - an input layer, one or more hidden layers, and an output layer. The nodes in the input layer correspond to the predictor variables (i.e., attributes) in the training dataset. The inputs are fed simultaneously into the units making up the input layer. The input values pass through the respective nodes in the input layer and are then weighted using the weights associated with the links connecting the nodes and fed simultaneously to the second layer of nodes, known as the hidden layer nodes. The outputs of the nodes in the first hidden layer are weighted again using the corresponding link weights, and the resultant values are provided as the inputs to a possible second hidden layer and so on. The weighted outputs of the last hidden layer are input to units making up the output layer, which produces the network's prediction for given tuples. We used the neuralnet function defined in the neuralnet library in R for classifying the stock price data. The raw data is normalized using the min-max normalization approach. Only the predictors are normalized, the response variable: open_perc is kept unchanged. The parameter hidden of the function neuralnet is changed to realize the different number of hidden layers in the network. The parameter stepmax is set to the maximum value of 10 so that the maximum number of iteration capability of the neuralnet function can be utilized. In order to carry out classification exercise, the parameter linear.output if set to FALSE in the neuralnet function. Support Vector Machine : A support vector machin e (SVM) is a machine learning model for both classification and regression. When applied for classification, it can classify both linear and nonlinear data. It uses a nonlinear mapping to transform the original training data into a higher dimension. Within this new higher dimension, it searches for the linear optimal hyperplane that separates the two classes. SVM finds this hyperplane using support vectors which are the essential and the discriminating training tuples to separate the two classes. We have used the ksvm function defined in the kernlab library in R for carrying out the classification of the stock price data. The function ksvm has an optional parameter called kernel which is set to vanilladot in our implementation. We now briefly discuss the regression models. Multivariate Regression : In this regression approach, we used open_perc as the response variable and the remaining ten variables as the predictors to build predictive models for three cases mentioned earlier in Section 3. In all these cases, we use the programming language R for data management, model construction, testing of models and visualization of results.
Case I:
We use 2013 data as the training data set for building the model and test the model using the same data set. For both the stocks, we used two approaches of multivariate regression - (i) backward deletion and (ii) forward addition of variables. Both approaches yielded the same results for the stock price data. For the year 2013, we applied the vif function in the faraway library to detect the collinear variables in order to get rid of the multicollinearity problem. The variance inflation factor (VIF) values of the variables were found to be as follows: month = 1.003, day_month = 1.008, day_week = 1.002, time = 1.095, high_perc = 4372. 547, low_perc = 4369.694, close_perc = 165.436, vol_perc = 1.072, nifty_perc = 1.046, range_diff = 156.198. Hence, it was clear that high_perc , low_perc , close_perc and range_diff exhibited multicollinearity. We retained low_perc and range_diff for the model construction and removed the other two variables since their VIF values were smaller than the other two. Using the drop1 function in case of the backward deletion technique, and the add1 function in case of the forward addition technique, we identified the variables that were not significant in the model and did not contribute to the information content of the model. For identifying the variables that contributed least to the information contained in the model at each iteration, we used the Akaike Information Criteria (AIC) - the variable that had the least AIC value and non-significant p -value at each iteration, was removed from the model, in case of the backward deletion process. On the other hand, the variable that had Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen the lowest AIC and a significant p -value was added to the model at each iteration for the forward addition technique. It was found that low_perc and range_diff were the two predictors that finally remained in the regression model. Case II : For the year 2014, the VIF values for the predictors were found to be as follows: month = 1.007, day_month = 1.004, day_week = 1.007, time = 1.057, high_perc = 1161.446, low_perc = 1331.035, close_perc = 115.161, vol_perc = 1.022, range_diff = 92.092, nifty_perc = 1.073. The variables high_perc , low_perc , close_perc , and range_diff exhibited multicollinearity. As in Case I, we retained low_perc and range_diff as their VIF values were smaller compared with the other two. Use of backward deletion and forward addition methods both yielded the same regression models as in Case I with low_perc and range_diff as the predictors and open_perc as the response variable.
Case III : In this case, the model is identical to that in
Case I . However, the model is tested on data for the year 2014. There, the performance results of the model are expected to be different. The performance results and their critical analysis is presented in Section 6.
Multivariate Adaptive Regression Spline : Multivariate Adaptive Regression Spline (MARS) is a machine learning approach for building robust regression models. MARS works by splitting input variables into multiple basis functions and then fitting a linear regression model to those basis functions. The basis functions used by MARS are designed in pairs: 𝑓(𝑥) = {𝑥 − 𝑡, 𝑖𝑓 𝑥 > 𝑡, 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒} and 𝑔(𝑥) = {𝑡 − 𝑥, 𝑖𝑓 𝑥 < 𝑡, 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒} . These the main characteristic of the basis functions is that these functions are piecewise linear functions . The value 𝑡 at which the two functions meet is called a knot . The working principles of MARS are very similar to that of CART. Like CART, MARS first builds a complex model involving a large number of basis functions, which are separated from each other by a large number of knots. This phase of the algorithm execution is called the forward pass of the model building. In the subsequent phase, known as the backward pass , the algorithm prunes back unimportant terms (i.e., basis functions), which could not contribute significantly to the generalized R2 values of the model. This phase essentially enables MARS to avoid a possible overfitting model during the training phase. During the execution of the backward phase, the algorithm computes the generalized cross-validation (GCV) values to determine how well the model fits into the data while avoiding any possible overfitting. Finally, the algorithm returns the model with the best cost/benefit ratio. To fit a model using MARS in R, we use the function earth in the library earth . Decision Tree : For building a regression model, we have used the same tree function in the tree library in R as we did in building the classification decision tree-based classification model. However, in this case, the response variable was kept as numeric and not converted to a factor variable unlike in the classification techniques. The predict function is used to predict the values of the response variable. The functions cor and rmse defined in the library
Metrics are used to compute the correlation coefficient and the RMSE value for determining the prediction accuracy of the models.
Bagging : For carrying out regression on stock price data, we use bagging function defined in the ipred library of R. The value of the parameter nbag - that specifies the number of samples - is taken as 100. We use the predict function in the ipred library to predict the response variable values and rmse function in the
Metric library to compute the RMSE values of the predicted values. The cor function in R is used to compute the correlation between the original and the predicted values of the response variable.
Boosting : We use the blackboost function defined in the mboost library in R for building regression models on the stock price data unlike the boosting function of the adabag library in R for classification of stock price data. As in other cases of regression, the predict and rmse functions are used to compute the predicted values and the RMSE values in the regression model.
Random Forest : We use the randomForest function defined in the randomForest library in R for regression purposes. The response variable open_perc is kept as a numeric variable and not converted to a factor variable as it was done in case of random forest classification. The same predict and rmse functions are used as in other regression methods.
Artificial Neural Network : As in the case of classification, we use the neuralnet function defined in the neuralnet library in R for regression on the stock price data. The predictors are normalized using min-max normalization before building the model. The compute function defined in the neuralnet library is used for computing the predicted values, while the parameter hidden is used to change the number of nodes in the hidden layer. The value of the parameter stepmax is set to 10 so as to exploit the maximum number of iterations executed by the neuralnet function. The parameter linear . output is by default set to TRUE, and hence it is not altered. For the Godrej dataset, we needed only one node in the hidden layer for all the three cases for building ANN regression models. Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Support Vector Machine : For building the regression model using SVM, we use the svm function defined in the e1071 library in R. The predict function is used for predicting the response variable values using the regression model, and the rmse function is used to compute the RMSE values for the predicted quantities. Deep Learning Method
In this Section, we discuss the Long- and Short-Term Memory (LSTM) network for classification.
Long- and Short-Term Memory Network : LSTM is a variant of recurrent neural networks (RNNs) - neural networks with feedback loops (Geron, 2019). In such networks output at the current time slot depends on the current inputs as well as the previous state of the network. However, RNNs suffer from a problem that these networks cannot capture long-term dependencies due to vanishing or exploding gradient during backpropagation in learning the weights of the links (Geron, 2019). LSTM networks overcome such problems and hence such networks are quite effective in forecasting in multivariate time series. LSTM networks consist of memory cells that can maintain their states over time using memory and gating units that regulate the information flow into and out of the memory. There are different variants of gates used. The forget gates control what information to throw away from memory. The input gates are meant for controlling the new information that is added to the cell state from the current input. The cell state vector aggregates the two components - the old memory from the forget gate and the new memory from the input gate. At the end, the output gates conditionally decide what to output from the memory cells. The architecture of an LSTM network along with the backpropagation through time (BPTT) algorithm for learning provides such networks a very powerful ability to learn and forecast in a multivariate time series framework. We use Python programming language and the Tensorflow deep learning framework for implementing LSTM networks and utilize those networks to predict the stock prices of
Godrej Consumer Products a multivariate time series. For this purpose, we use the open price of the stocks as the response variable and the predictors chosen are – high , low , close , volume and the NIFTY index values. However, unlike for the machine learning techniques, we don't compute the differences between successive slots. Rather, we forecast the open value of the next slot based on the predictor values in the previous slots. We used the mean absolute error (MAE) and the adaptive moment estimation (ADAM) as the optimizer for evaluating the model performance in all the three cases. ADM computes adaptive learning rates for each parameter in the gradient descent algorithm. In addition to storing an exponentially decaying average of the past squared gradients, ADAM also keeps track of the exponentially decaying average of the past gradients, which serve as the momentum in the learning process. Instead of behaving like a ball running down a steep slope like momentum, ADAM manifests itself like a heavy ball with a rough outer surface. This high level of friction results in ADAM’s preference for a flat minimum in the error surface. Due to its ability to integrate an adaptive learning with a momentum, ADAM is found to perform very efficiently in optimizing the performance of large-scale networks. This was the reason for our choice of ADAM as the optimizer in our LSTM modelling. However, we trained the deep learning networks using different epoch values and different batch sizes for the three different cases and determined the optimum performance of the network under those parameter values. The sequential constructor in the Tensorflow framework has been used to build the LSTM model. The performance results of the LSTM regression models are presented in Section 6. Performance Results and Analysis
In this Section, we provide a detailed discussion on the forecasting techniques that we have used and the results obtained using those techniques. We first discuss the classification techniques and then the regression techniques. For both the stocks and for all the three cases, we computed the prediction accuracy of the classification models using several metrics. We define the metrics below.
Sensitivity:
It is the ratio of the number of true positives to the total number of positives in the test dataset, expressed as a percentage. Here, positive refers to the cases that belong to the target group (i.e., the class “1”). The term true positives refers to the number of positive cases that the model correctly identified. The word sensitivity is also sometimes referred to as recall . 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑙𝑎𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (1)
Specificity : It is the ratio of the number of true negatives to the total number of negatives in the test dataset, expressed as a percentage. Here, negative refers to the cases that belong to the non-target group (i.e., the class “0”). The term true negative refers to the number of negative cases that the model correctly identified.
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟 𝑡𝑒𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (2)
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Positive Predictive Value : Positive predictive ratio (PPV), also sometimes referred to as precision , refers to the accuracy of the model in classifying the target group cases among the total number of target group cases identified by it. It is computed as the ratio of the number of correctly identified target group cases to the total number of target group cases as identified by the model. Since the total number of target group cases identified by the model is the sum of the number of true positive cases and the number of false-positive cases, PPV is the ratio of the total number of true positive cases to the sum of the number of true positive cases and the number of false-positive cases, expressed as a percentage. The complement of PPV is also called false discovery rate (FDR).
𝑃𝑃𝑉 =
𝑁𝑢𝑚𝑒𝑏𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑎𝑙𝑙𝑠 (3)
𝐹𝐷𝑅 = 1 − 𝑃𝑃𝑉 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑠𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑠𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑎𝑙𝑙𝑠 (4)
Negative Predictive Value : Negative predictive value (NPV) refers to the accuracy of the model in classifying the non-target group cases among the total number of non-target elements identified by it. It is computed as the ratio of the number of correctly identified non-target group cases to the total number of non-target group cases as identified by the model. Since the total number of non-target group cases identified by the models is the sum of the number of true negative cases and the number of false-negative cases, NPV is the ratio of the total number of true negative cases to the sum of the number of true negative cases and the number of false-negative cases, expressed as a percentage. The complement of NPV is also called false omission rate (FOR).
𝑁𝑃𝑉 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑐𝑎𝑙𝑙𝑠 (5)
𝐹𝑂𝑅 = 1 − 𝑁𝑃𝑉 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑙𝑎𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑐𝑎𝑙𝑙𝑠 (6)
Classification Accuracy (CA):
It is the ratio of the total number of cases that are correctly classified to the total number of cases in the dataset, expressed as a percentage.
𝐶𝐴 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑡𝑖𝑣𝑒𝑠 + 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (7)
F1 Score:
If the test data set is highly imbalanced, with the cases belonging to the non-target group far outnumbering the target cases, sensitivity is found to be very poor even with a very high classification accuracy. Hence, classification accuracy is not considered a very robust and reliable metric. F1 score, which is computed as the harmonic mean of the sensitivity and PPV, is found to be a very robust metric, however.
𝐹1 = = (8) Classification Methods : Logistic Regression : We used glm function in R programming language to build logistic regression-based classification models for all the three cases. The response variable was converted into categorical type by using the function as.factor before we built the models. The parameter family was set to binomial in order to build a binary logistic regression model. The predict function was used to predict the class of the test data records. We also built the lift curve and the receiver operating characteristic (ROC) curve of the model for each case. The output of the performance function defined in the ROCR library was plotted to illustrate the ROC curve of the model. The area under the curve (AUC) for each ROC curve is computed using the auc function defined in the pROC library in R programming language. Table 1 presents the performance results of the logistic regression classification method. For Case I , out of 419 actual “0” cases, only 10 cases were misclassified as “1”, while among 316 actual “1” cases, 17 cases were found to be wrongly classified as “0”. The value of AUC for the ROC curve for
Case I was 0.9934. For
Case II , 16 cases out of total 396 actual “0” cases were misclassified as “1”, and out of 329 cases which were actually “1” were wrongly classified as “0”. The AUC value for this case was found to be 0.9891.
Case III yielded 42 cases which were actually “0” but misclassified as “1” out of a total of 396 cases, while among 329 cases which were actually “1”, 26 cases were misclassified as “0”. The AUC value for
Case III was found to be 0.9587. Fig 1(a), 1(b), and 1(c) present the classification performance, the lift curve and the ROC curve of the logistic regression-based classification model. In Fig 1(a), the y-axis represents the actual classes of the records (either “0” or “1”) and the
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen
13 x-axis denotes the probability that a case will belong to the class “1”. The threshold value along the x-axis is by convention taken to be 0.5. Hence, all the cases which are found to be lying on the level “0” along the y-axis and situated to the right of the threshold value of 0.5 along the x-axis are misclassified. Similarly, all the points which are on the level “1” along the y-axis, and are situated to the left of the threshold value of 0.5 along the x-axis are also misclassified. It is evident from Fig 1(a) that the number of misclassified cases in Cases were very low. Fig 1(b) shows that the lift curve is pulled up from the baseline indicating that the model was very accurate in discriminating between the two classes. Fig 1(c) depicts the ROC curve for the logistic regression model in
Case I . The steepness of the curve makes it evident that the model has been able to very effectively optimize the values of the true positive rate (TPR) and the false positive rate (FPR) . In Fig 1(c), the line segment with red color presents the class “1” cases which are correctly classified, while the blue line segment denotes the correctly classified cases which belong to the class “0”. The portion of the ROC curve that is colored with yellow represents those cases which actually belong to the class “0”, but the model wrongly classified them to the class “1”. The “green” colored portion of the ROC curve depicts those cases which are misclassified into the class “0”, while they actually belong to the class “1”.
Table 1:
Logistic regression classification results
Metrics Case I
Case II
Case III
Training Accuracy 2013 Training Accuracy 2014 Test Accuracy 2014
Sensitivity 94.79 94.83 92.10 Specificity 97.61 95.96 89.39 PPV 96.87 95.12 87.83 NPV 96.01 95.72 93.16 CA 96.38 95.45 90.62 F1 Score 95.82 94.97 89.91 Fig 1(c): Logistic Regression for classification – ROC curve (
Case I ) Fig 1(a): Logistic Regression - actual vs predicted probabilities of open_perc ( Case I ) Fig 1(b): Logistic Regression - lift curve ( Case I ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 2(c): Logistic Regression for classification – ROC curve (
Case II ) Fig 2(a), Fig 2(b) and Fig 2(c) depict respectively the classification performance, the lift curve and the ROC curve of the logistic regression model for Case II. The performance of the model, in this case, is similar to that in
Case I . However, the AUC value yielded by the model in this was just marginally smaller than the corresponding value in
Case I . Conclusion
Fig 3(c): Logistic Regression for classification – ROC curve (
Case III ) Fig 2(a): Logistic Regression - actual vs predicted probabilities of open_perc ( Case II ) Fig 2(b): Logistic Regression – lift curve ( Case II ) Fig 3(a): Logistic Regression – actual vs predicted probabilities of open_perc ( Case III ) Fig 3(b): Logistic Regression – lift curve (
Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen
15 Fig 3(a), Fig 3(b), and Fig 3(c) show the classification accuracy, the lift curve and the ROC curve for the logistic regression model in Case III. It is evident from Fig 3(c) that unlike in Case I and Case II, the classification model committed more errors in classification. This case also yielded a lower AUC value of 0.9587.
KNN Classification : Table 2 presents the performance results of the KNN classification method. For
Case I , with the values of k = 1, 3, 5, 7 and 9, the classification accuracy values were found to be 100, 93.42, 91.68, 92.35, and 92.08 respectively. We choose k = 3 in order to avoid the overfitted model with k = 1. In this case, there were 419 cases were 0s and 326 cases were 1s. 15 cases of actual 0s were misclassified as 1s, and 34 cases of actual 1s were misclassified as 0s. In Case II , for k = 1,3,5,7, and 9, the classification accuracy values were 100, 90.21, 85.10, 83.22, and 84.16 respectively. Again k = 3 was chosen to avoid model overfitting. 28 cases of actual 0 were misclassified as 1, while 43 cases of actual 1 were misclassified as 0. For Case III, the classification accuracy values were found to be 65.24, 65.10, 67.17, 68.69, and 67.44 for k = 1, 3, 5, 7, and 9 respectively. We chose k = 3, for which 202 cases which were actually 0s were misclassified as 1s, while 51 cases of actual 1s were misclassified as 0s. Table 2:
KNN classification results
Metrics Case I
Case II
Case III
Training Accuracy 2013 Training Accuracy 2014 Test Accuracy 2014
Sensitivity 89.57 86.93 84.50 Specificity 96.42 92.93 48.99 PPV 95.11 91.08 57.92 NPV 92.24 89.54 79.18 CA 93.42 90.21 65.10 F1 Score 92.26 88.96 68.73
Decision Tree Classification : We used tree function defined in the tree library in R programming language for building decision tree-based classification models in all the three cases. The response variable open_perc is converted into a categorical type using as.factor function for the purpose of classification. The predict function in tree library was used for predicting the classes of the response variable open_perc for the records in the test dataset. For
Case I and
Case III the models were identical as they were trained on 2013 data. However, while the model in
Case I was tested on 2103 data, 2014 data was used for testing the model in
Case II . For all these cases, we found high_perc , low_perc , and close_perc were the three predictor variables that were used by the models to construct the decision trees. However, in Case I , the predictor which was used for splitting at the root node was close_prec , indicating that high_perc was the most important predictor for classification in the 2013 dataset. However, for the 2014 dataset, high_perc was found to be the most discriminating one as the same was used by the model for splitting at the root node. In
Case I , the decision tree classifier misclassified 8 cases out of a total of 419 cases which actually belonged to the class “0”, while 16 cases were wrongly classified out of a total of 326 cases which were actually the records of the class “1”. In
Case II , the model failed to correctly classify 17 cases out of a total of 396 cases which were actually “0” class members, while 25 cases were misclassified out of a total of 329 cases that actually belonged to the class “1”. In
Case III , the model had a more difficult task at hand. We found that we could not correctly classify 30 cases out of a total of 396 cases which actually belonged to the class “0”, while 33 cases were misclassified out of a total of 329 cases which actually belonged to the actual class of “1”.Table 3 presents the performance results of the decision tree classification models under three different cases. Fig 4(a), 4(b), 4(c) depict the decision tree classifiers for
Case I , Case II , and
Case
III respectively.
Bagging Classification : We used bagging function defined in the ipred library in R programming language for building the bagging classification models for all the three cases. We set the value of the parameter nbag to 25 so that 25 decision trees were created randomly and a simple majority voting mechanism was applied in constructing the classifier. In
Case I , we found that the model failed to correctly classify 8 cases out of a total of 419 cases which actually belonged to the class “0”, while 16 cases out of a total of 326 cases which actually belonged to the class “1” were also misclassified. In
Case II , the model could not correctly classify 14 cases out of a total of 396 cases that are of actual class “0”, while 15 cases out of a total of 329 cases were misclassified which belonged to the class “1”. In
Case III , 30 cases out of 396 actual “0” class cases were incorrectly classified by the model, while 33 cases out a total of 329 cases of the class “1” were also misclassified. The performance results of the bagging classification model for all three cases are presented in Table 4. Fig 5(a), Fig 5(b), and Fig 5(c) depict the classification accuracy of the model in
Case I , Case II , and
Case III respectively. In all these three figures, the y -axis represents the actual class labels, while the values along the x -axis show the probabilities of the predicted class for the records. The cases which are on the label “0” on the y -axis and have their Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen probability values greater than 0.5 along the x -axis are the misclassified cases. In a similar line, those cases which are lying on the label “1” along the y -axis and have their probability values less than 0.5 along the x -axis are also misclassified. Table 3:
Decision Tree classification results
Metrics Case I
Case II
Case III
Training Accuracy 2013 Training Accuracy 2014 Test Accuracy 2014
Sensitivity 95.09 92.40 89.97 Specificity 98.09 95.71 92.42 PPV 97.48 94.70 90.80 NPV 96.25 93.81 91.73 CA 96.78 94.21 91.31 F1 Score 96.07 93.54 90.38
Table 4:
Bagging classification results
Metrics Case I
Case II
Case III
Training Accuracy 2013 Training Accuracy 2014 Test Accuracy 2014
Sensitivity 95.09 95.44 89.97 Specificity 98.09 96.46 92.42 PPV 97.48 95.73 90.78 NPV 96.25 96.22 91.73 CA 96.78 96.00 91.31 F1 Score 96.07 95.58 90.37 Fig 4(a): Decision Tree for classification (
Case I ) Fig 4(c): Decision Tree for classification (
Case III ) Fig 4(b): Decision Tree for classification (
Case II ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 5(c): Bagging for classification – actual vs predicted classes of open_perc ( Case III ) Boosting Classification : We have used the boosting function defined in the adabag library in R programming language for building the boosting models for classification under all the three cases. The response variable open_perc was transformed into the categorical type using as.factor function so as to satisfy the requirement of a classification model. The predict function was used for predicting the class of the response variable in the test data records. For both
Case I and
Case II , the boosting classification models were found to have yielded 100% accuracy in all the metrics of classification as presented in Table 5. This is not surprising as in both the cases the models were built and tested using the same dataset, and thus the learning of the models had been very accurate using the ensemble of the weighted majority voting on a large number of random decision tree classifiers. However, the model faced more challenges in
Case III in which the ensemble model was built using the 2013 data, and the testing was done using the 2014 data. In
Case III , we found that the model misclassified 26 cases out of a total of 396 cases which actually belonged to the class “0”, while among 329 cases which were actually of the class “1”, 26 cases were incorrectly classified. Table 5 presents the performance results of the boosting classification models for all the three cases. Fig 6(a), Fig 6(b), and Fig 6(c) depict the performance of the boosting classifier for
Case I , Case II , and
Case III respectively. In these three figures, the along the y -axis the actual classes are plotted – there are two actual class levels “0” and “1”. The x -axis presents the predicted probability that a case will belong to the class “1”. Hence, the data points which are to the left side of the threshold value of 0.5 along the x -axis and lying on the level “1” along the y -axis are the misclassified cases. Similarly, the point that is on the right side of the threshold value of 0.5 and lying on the level “0” along the y -axis are also the misclassified cases. It is evident from Fig 6(a), Fig 6(b), and Fig 6(c) that boosting classifiers have performed very well in all the three cases. Table 5:
Boosting classification results
Metrics Case I
Case II
Case III
Training Accuracy 2013 Training Accuracy 2014 Test Accuracy 2014
Sensitivity 100 100 92.10 Specificity 100 100 93.43 PPV 100 100 92.10 NPV 100 100 93.43 CA 100 100 92.83 F1 Score 100 100 92.10 Fig 5(a): Bagging for classification – actual vs predicted classes of open_perc ( Case I ) Fig 5(b): Bagging for classification – actual vs predicted classes of open_perc ( Class II ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 6(c): Boosting for classification – actual vs predicted classes of open_perc ( Case III ) Random Forest Classification : We used randomForest function defined in the randomForest library in R programming language for building random forest-based classification models. In all three cases, the random forest algorithm created 500 decision trees using three predictors at each node in the decision trees for carrying out the splitting task. In
Case I , the model wrongly classified 10 cases as the class “1” cases out of a total of 419 cases which were actually belonged to the class “0”. On the other hand, 18 cases were misclassified into the class “1” out of a total of 326 cases which were actually of the class “0”. The out of bag (OOB) estimate of the error rate of the model, in this case, was 3.76%. In
Case II , the model could not correctly classify 23 cases out of a total of 396 cases that belonged to the actual class of “0”. On the other hand, 23 cases out of a total of 329 cases that actually belonged to the class of “0” were also misclassified. The OOB estimate of the error rate of the classification model, in this case, was 6.34%. In
Case III , the random forest classification model was identical to that in
Case I . However, the model was tested on 2014 data unlike the model in
Case I that was tested on 2013 data. We found that in
Case III , the model misclassified 28 cases out of a total number of 396 cases which actually belonged to the class “0”. On the other hand, 29 cases were wrongly classified out of a total of 329 actual “1” cases. The performance results of the random forest classification model for all three cases are presented in Table 6.
Table 6:
Random Forest classification results
Metrics Case I
Case II
Case III
Training Accuracy 2013 Training Accuracy 2014 Test Accuracy 2014
Sensitivity 94.48 93.01 91.19 Specificity 97.61 94.19 92.93 PPV 96.86 93.01 91.46 NPV 98.08 94.19 92.70 CA 96.24 93.66 92.14 F1 Score 95.66 93.01 91.32
ANN Classification : We used the neuralnet function defined in the neuralnet library in R programming language to build ANN classification models for all the three cases. The parameter linear.output was set to false and the response variable Fig 6(a): Boosting for classification – actual vs predicted classes of open_perc ( Case I ) Fig 6(b): Boosting for classification – actual vs predicted classes of open_perc ( Class II ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen open_perc was converted into a categorical variable type by using the function as.factor before the classification models were built. We found that only one node at the hidden layer was sufficient to model the data, hence we passed the value of the parameter hidden as 1 in the neuralnet function. In order to avoid any possible scenario in which the backpropagation algorithm fails to converge, we set the parameter stepmax to its maximum possible value of 10 . In Case I , the ANN classification model misclassified 10 cases out of a total of 419 cases as “1” while they actually belonged to class “0”. On the other hand, 15 cases which were actually “1” were wrongly classified as “0” out of a total of 326 cases. The ANN model for classification for
Case I is presented in Fig 7(a), and its performance in the classification task is presented in Fig 7(b). Fig 7(b) plots along the y -axis the actual classes and along the x -axis the predicted classes. The points lying on the actual class label “0” along the y -axis while having their predicted class probabilities greater than 0.5 (i.e., those points on the “0” label lying on the right-hand side of the threshold value of 0.5 along the x -axis) represent the misclassified cases. In a similar line, the points which are on the label “1” along y -axis while having their probabilities smaller than 0.5 (i.e., those points on the “1” label lying on the left-hand side of the threshold value of 0.5 along the x -axis) are also misclassified points. In Case II , the ANN classification model misclassified 18 cases as class “0” out of 396 cases which were actually belonged class “1”. On the other hand, 24 cases were misclassified as class “1” out of 329 cases which were actually class “0” cases. Fig 8(a) and 8(b) presents the ANN classification model in
Case II , and its performance in classification task respectively. In
Case III , the model was built using 2013 data, hence it was identical to the model that was used in
Case I . However, since the model was tested on 2014 data, unlike in
Case I in which the model was tested on 2013 data, the performance results of the model in
Case III was very much different. In fact, the model in
Case III faced a much bigger challenge as there was a difference in the characteristics of the data in 2013 and 2014. We found that in
Case III , the model wrongly classified 258 cases as class “1” out of 396 cases which actually belonged to the class “0”. On the other hand, only 1 case out of 329 cases which were actually of the class “1” was misclassified as the class “0” case. It is evident, that model failed horribly in classifying the class “0” cases which resulted in a very poor value of its specificity. The specificity in
Case III was found to be only 34.85%, while for
Case I and
Case II , the specificity values were 97.61% and 95.45% respectively. This clearly indicated that the ANN classification model had a poor generalization in learning during the training phase using the 2013 data and that possibly led to a model overfitting. This overfitted model failed to correctly classify the majority of the “0” cases in the test data of 2014, which resulted in a very low specificity value. Fig 9(a) and Fig 9(b) present the ANN classification model and its classification performance respectively. Table 7 presents the performance of the ANN classification models in all the three cases.
Table 7:
ANN classification results
Metrics Case I
Case II
Case III
Training Accuracy 2013 Training Accuracy 2014 Test Accuracy 2014
Sensitivity 95.40 92.71 99.70 Specificity 97.61 95.45 34.85 PPV 96.88 94.43 55.97 NPV 96.24 94.03 99.28 CA 96.64 94.21 64.28 F1 Score 96.13 93.56 71.69 Fig 7(a): ANN classification model (
Case I ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 7(b): ANN classification – actual vs predicted classes of open_perc ( Case I ) Fig 8(a): ANN classification model (
Case II ) Fig 8(b): ANN classification – actual vs predicted classes of open_perc ( Case II ) Fig 9(a): ANN classification model (
Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 9(b): ANN classification – actual vs predicted classes of open_perc ( Case III ) SVM Classification : We used the ksvm function defined in the kernlab library in R programming language for building the SVM-based classification models. The function ksvm was used with the parameter kernel set to vanilladot . It implies that a linear kernel is used for building the SVM classification models. For Case I , the model found 120 number of support vectors. We found that out of a total number of 430 cases which were actually 0 class records, 19 cases were misclassified as 1. On the other hand, 8 cases were wrongly classified as 0, out of a total of 315 cases which were actually 1. The training error for
Case I was found to be 3.62%. For
Case II , the model found 156 support vectors in order to classify all the 725 records. Among 406 cases which actually belonged to the class 0, 27 cases were misclassified as 1. On the other hand, 8 cases were wrongly classified as 0 out of a total of 315 cases which were actually 1. The training error for
Case II was found to be 6.07%. The SVM classification model found 120 support vector points in
Case III . The model misclassified 19 cases as 1 out of a total of 430 cases which were actually 0. On the other hand, out of a total of 315 cases which were actually 1, 8 cases were misclassified as 0. Table 8 presents the results of the SVM classification models for all the three cases.
Table 8:
SVM classification results
Metrics Case I
Case II
Case III
Training Accuracy 2013 Training Accuracy 2014 Test Accuracy 2014
Sensitivity 94.46 94.67 97.46 Specificity 95.58 93.35 95.58 PPV 94.17 91.79 94.17 NPV 98.09 95.71 98.09 CA 96.38 93.93 96.37 F1 Score 94.31 93.21 95.79
Regression Methods : We now present the performance results of the regression models.
Multivariate Regression : We have already mentioned in Section 4 that the predictors that were finally included in the multivariate regression models all the three cases –
Case I , Case II , and
Case III – were low_perc and range_diff . for
Case I , the regression model yielded a value of 0.9919 for the adjusted R value and the F statistic value of 4.58*10 with an associated p -value of 2.2*10 -16 . This indicated that the regression model was able to successfully able to establish a linear relationship between the response variable open_perc and the predictor variables low_perc and range_diff . The RMSE value yielded by the regression model for this case was found to be 0.0853, and the mean of the absolute values of the actual open_perc was 0.6402. The ratio of the RMSE to the mean of the absolute values of the actual open_perc was found to be 13.317. 14 cases out of a total of 745 cases exhibited sign mismatch between the predicted and the actual values of open_perc . The correlation test produced a correlation coefficient value of 0.99 with the p -value of the t -statistic as 2.2*10 -16 . This indicated the there is a strong linear relationship between the predicted and the actual values of open_perc . The Breusch-Pagan test for yielded a test statistic value of 10.239 with a p -value of 0.005978. Hence, it was evident that the residuals are not homoscedastic. However, the Durbin-Watson test of autocorrelation produced a test statistic value of 3.023 with an associated p -value of 1. Hence, the null hypothesis that assumes presence no autocorrelation among the residuals has the fullest support. We conclude that the residuals do not exhibit any significant Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen autocorrelation. For Case II , the regression model yielded an adjusted R value of 0.9827 with the value of the F -statistics as 2.052*10 . The p -values of the F statistics were found to be less than 2.2*10 -16 indicating a very highly significant F statistics and very good model fit. RMSE value for Case II was found to be 0.1749 with the mean of the absolute values of actual open_perc as 0.9286. The ratio of the RMSE to the mean of the absolute values of the actual open_perc was 18.84. 39 cases out of a total of 725 cases were found to have a sign mismatch between the predicted and the actual open_perc values. The correlation test for this case yielded a value of correlation coefficient as 0.99 with the value of the t -statistic as 202.74. The p -value of the t -statistic was 2.2*10 -16 indicating a very strong linear relationship between the predicted and the actual open_perc values. The Breusch-Pagan test yielded a test statistic value of 3.1877 with an associated p -value of 0.203. It was thus evident that the residuals did not exhibit significant heteroscedasticity. The Durbin-Watson test of autocorrelation produced a test statistic value of 2.9005. The p -value of the Durbin-Watson test was found to be 1 indicating that the null hypothesis of no significant autocorrelation among the residues got full support. Hence, we conclude that the residuals in the regression model in Case II did not exhibit any significant autocorrelation. The model in Case III was the same as that in Case I. However, its performance results were different as it was tested on 2014 data, unlike the model in Case I which was tested on 2013 data. The RMSE for
Case III was found to be 0.1753 with the mean of the absolute values of the actual open_perc equal to 0.9286. Thus, the ratio of the RMSE to the mean of the absolute values of the actual open_perc values was found to be 18.88. We found that 39 cases out a total of 725 cases exhibited sign mismatch between the predicted and the actual values of open_perc . The correlation test yielded a correlation coefficient of 0.99 with the value of t -statistics as 202.53 and the associated p -value of 2.2*10 -16 . This indicated that the predicted and the actual values of open_perc exhibited a strong linear relationship between them. The Breusch-Pagan test yielded a test statistic of 3.1877 with an associated p -value of 0.2031 thereby indicating that the residuals were not heteroscedastic. The test statistic value yielded the Durbin-Watson test was found to be 2.9005 with an associated p -value of 1. Hence, the null hypothesis of no autocorrelation among the residuals had the fullest support and we concluded that the residuals did not exhibit any significant autocorrelation. Table 9 presents the results of the multivariate regression results for all the three cases. Fig 10(a), (b) and (c) present some performance results of the multivariate regression model for Case I . Fig 10(a) shows that the predicted values very closely followed the pattern of the actual open_perc values, while Fig 10(b) exhibits that there is a very strong linear relationship between the predicted and the actual values of open_perc . The residuals of the model were found to be scattered and random and exhibited no significant autocorrelation as depicted in Fig. 10(c). The performance results of
Case II are presented in Fig 11(a), (b), and (c). The predicted and the actual values of the open_perc exhibited almost identical movement patterns in this case as in
Case I . The residuals did not show any significant autocorrelations. Fig 12(a) shows how closely the predicted values of the open_perc followed the patterns exhibited its actual values in
Case III , while Fig 12(b) exhibits a strong linear relationship between them. Fig 12(c) depicts that the residuals of the regression model for
Case III were random and did not exhibit any autocorrelations.
Table 9:
Multivariate Regression results
Metrics Case I
Case II
Case III
Training 2013 Training 2014 Test 2014
Correlation Coefficient 0.99 0.99 0.99 RMSE/Mean of Absolute Values of Actuals 13.32 18.84 18.88 Percentage of Mismatched Cases 18.67 5.38 5.24
Fig10(a): Multivariate Regression- time-varying actual and predicted values of open_perc ( Case1 ) Fig 10(b): Multivariate Regression - relationship between the actual and the predicted values of open_perc ( Case I ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen
23 Fig 10(c): Multivariate Regression- time-varying residuals ( Case1 ) Fig 11(c): Multivariate Regression- time-varying residuals ( Case II ) Fig 11(a): Multivariate Regression- time-varying actual and predicted values of open_perc ( Case II ) Fig 11(b): Multivariate Regression - relationship between the actual and the predicted values of open_perc ( Case II ) Fig 12(a): Multivariate Regression- time-varying actual and predicted values of open_perc ( Case III ) Fig 12(b): Multivariate Regression - relationship between the actual and the predicted values of open_perc ( Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 12(c): Multivariate Regression- time-varying residuals ( Case III ) MARS : We used the earth function defined in the earth library in R programming language for building MARS regression models in all the three cases. In
Case I , in the forward pass of the execution of the algorithm, 7 terms were used in the model building as after the inclusion of the 8 th term the change in the value of R was found to be only 5*10 -5 which was less than the threshold value of 0.001. After the completion of the forward pass, both the generalized R-square (GRSq) and the R converged to a common value of 0.993. During the backward pass, the algorithm could not prune any term and all the 7 terms used in the forward pass were finally retained in the model. In Case 1 , the model retained 3 predictors out of a total of 10 predictors. The selected predictors in decreasing order of their importance in the model were found to be: close_perc , high_perc , and low_perc . The predictors which the algorithm did not use were: month , day_month , day_week , time , vol_perc , nifty_perc , and range_diff . At the completion of the execution of the algorithm, the values of some of the important metrics were as follows: (i) generalized cross validation (GCV): 0.0065, residual sum of square (RSS): 4.7006, GRSq: 0.9928, and R : 0.9930. The 7 terms that the MARS algorithm used in Case I were found to be as follows: (i) the intercept of the model, (ii) h (-0.83682 – high_perc ), (iii) h ( high_perc – 0.83682), (iv) h (-0.692841 – low_perc ), (v) h ( low_perc – 0.692841), (vi) h (-2.11268 – close_perc ), and (vii) h ( close_perc – 2.11268). In Case I , the MARS regression model yielded 9 cases out of a total of 745 cases that exhibited mismatch in signs between the predicted and the actual values of open_perc . The RMSE value for this case was 0.0794, while the mean of the absolute values of the actual open_perc was 0.6402. Hence, the ratio of the RMSE to the mean of the absolute values of the actual open_perc was 2.4065. The correlation test yielded a value of correlation coefficient as 0.99 with the t -statistic value of 325.41 and an associated p -value of 2.2*10 -16 . This indicated that there is a strong linear relationship between the predicted and the actual values of open_perc in Case I . In
Case II , the algorithm used 9 terms during its forward execution since the change in the R value at the end of the 9 th term was found to be only 0.0002, which was less than the threshold value of 0.001. After the completion of the forward pass, the values of GRSq and R were found to be 0.985 and 0.986 respectively. During the backward pass of its execution, the algorithm could prune one term out of the 9 terms included in the forward pass. Hence, the algorithm used 8 terms in constructing the regression model. We also observed that the algorithm retained 4 predictors out of a total of 10 predictors available initially. The 4 predictors retained in the model in the decreasing order of their importance were found to be: low_perc , close_perc , range_diff and high_perc . At the end of the execution of the backward pass of the algorithm, some important metric values were noted. GCV: 0.0262, RSS: 18.2512, GRSq: 0.9852 and R : 0.9858. The 8 terms that the algorithm used in building the regression model in Case II were: (i) the intercept of the model, (ii) h (0.3675 – high_perc ), (iii) h ( high_perc – 0.3675), (iv) h (-2.6685 – low_perc ), (v) h ( low_perc – 2.6685), (vi) h (0.3996 – close_perc ), (vii) h (-1.8 – range_diff ), and (viii) h( range_diff - -1.8). In Case II , we found that 31 cases out of a total of 725 cases exhibited mismatched signs between the predicted and the actual values of open_perc . With an RMSE value of 0.1587 and the mean of the absolute values of the actual open_perc as 0.9286, their ratio was found to be 17.09. The correlation test yielded the value of correlation coefficient as 0.99, with the value of the t -statistic as 223.87, with an associated p -value of 2.2*10 -16 . The high value of the correlation coefficient and the negligible support for the null hypothesis in the form of a very low p -value indicated that there was a very strong linear relationship between the predicted and the actual values of open_perc in Case II . In
Case III , the MARS model of regression was identical to that of
Case I . The model was, however, tested on 2014 data. We observed that in
Case III , the MARS model yielded 46 cases out of a total of 725 cases that yielded a sign mismatch between the predicted and the actual open_perc values. The RMSE for this case found to be 0.1894, while the mean of the absolute values of the actual open_perc was 0.9286. The ratio of the RMSE to the mean value was found to be 20.40. The correlation test on the predicted and the actual values of open_perc yielded a correlation coefficient value of 0.99 with the value of t -statistic as 187.13 and an associated p -value of 2.2*10 -16 . The results indicated that like in Case I and
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Case II , the predicted and the actual values of open_perc exhibited a strong linear relationship between them in
Case III as well.
Table 10:
MARS regression results
Metrics Case I
Case II
Case III
Training 2013 Training 2014 Test 2014
Correlation Coefficient 0.99 0.99 0.99 RMSE/Mean of Absolute Values of Actuals 12.41 17.09 20.40 Percentage of Mismatched Cases 1.21 4.28 6.34 Fig 13(c): MARS - time-varying residuals ( Case I ) Fig 13(a): MARS- time-varying actual and predicted values of open_perc ( Case I ) Fig 13(b): MARS – relationship between the actual and the predicted values of open_perc ( Case I ) Fig 14(a): MARS- time-varying actual and predicted values of open_perc ( Case II ) Fig 14(b): MARS – relationship between the actual and the predicted values of open_perc ( Case II ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 14(c): MARS - time-varying residuals ( Case II ) Fig 15(c): MARS - time-varying residuals ( Case III ) Decision Tree Regression : We used the tree function defined in the tree library in R programming language to build a decision tree-based regression model. For
Case I , close_perc turned out to be the splitting variable at the root node. Other important variables that led to splitting at nodes were high_perc and low_perc . Fig. 16(a) depicts the decision tree model. RMSE for this case was 0.2263 and the mean of the absolute values of the actual open_perc was found to be 0.6402. Among the total of 745 cases, 100 cases exhibited sign mismatch between the predicted and the actual values of open_perc . The correlation coefficient between the predicted and the actual open_perc values turned out to be 0.97. The t -statistics for the correlation test yielded a value of 111.35 with a p -value of 2.2*10 -16 which indicated that there was a strong linear relationship between the predicted and the actual open_perc values. Fig 16(b), (c), and (d) depict different performance characteristics of the decision tree-based regression model for Case I . Fig 16(b) depicts that except for a few instances, the predicted values of open_perc very closely followed the pattern exhibited by its actual values. Fig 16(c) shows that with the increase in the actual open_perc values, its predicted values also exhibited an upward trend stepwise. Fig 16(d) shows that residuals were randomly scattered and did not exhibit an autocorrelation among them. Fig. 17(a) presents the decision tree regression model for
Case II . In this case too, the variable close_perc was the node that was split at the root node, and the other two variables which were split at subsequent nodes were high_perc and low_perc . Fig 15(a): MARS- time-varying actual and predicted values of open_perc values (
Case III ) Fig 15(b): MARS – relationship between the actual and the predicted values of open_perc ( Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen
27 This case yielded an RMSE value of 0.3440, and the mean of the absolute values of the actual open_perc values was 0.9286. 126 cases out of a total of 725 cases exhibited sign mismatch between their predicted and actual open_perc values. The correlation coefficient between the actual and the predicted values of open_perc was found to be 0.96 with a t -statistics value of the correlation test as 100.47, and its associated p -values as 2.2*10 -16 . The correlation test indicated that the predicted and the actual open_perc values were highly correlated. Fig 17 (b), (c), (d) show that the regression model was effective in establishing a linear relationship between the response variable – open_perc – and all other predictor variables. The decision tree regression model for Case III was the same as that in
Case I . The decision tree model is presented in Fig 18(a). However, the performance of the model yielded different results as it was tested on 2014 data unlike in
Case I , in which the model was tested on 2013 data. The correlation coefficient between the predicted and the actual values of open_perc for this was found to be 0.96 with the t -statistics value of the correlation test as 100.47 and its associated p -value as 2.2*10 -16 . However, as expected the RMSE for this case was higher than those in the previous two cases. The RMSE was found to be 0.5232 with the mean of the absolute values of the actual open_perc as 0.9286. This led to a significantly high value of 56.34 their ratio. 126 cases out a total of 725 cases exhibited mismatch in sign between the predicted and the actual values of open_perc . Fig 18(b), (c), and (d) present the performance of the model in Case III. While the behavior of the model was almost identical to that in the other tow case, in Fig 18(b) shows clearly that there was more deviation between the patterns exhibited by the actual values and the predicted values of open_perc. This led to a significantly higher RMSE in this case as compared to Case I and
Case II . Table 11:
Decision Tree regression results
Metrics Case I
Case II
Case III
Training 2013 Training 2014 Test 2014
Correlation Coefficient 0.97 0.97 0.96 RMSE/Mean of Absolute Values of Actuals 35.34 37.04 56.34 Percentage of Mismatched Cases 13.42 17.38 17.38
Fig 16(a): Decision Tree regression model (
Case I ) Fig 16(b): Decision Tree regression - time-varying actual and predicted values of open_perc ( Case I ) Fig 16(c): Decision Tree regression - relationship between the actual and the predicted values of open_perc ( Case I ) Fig 16(d): Decision Tree regression – time-varying residuals ( Case I ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Bagging Regression : The bagging function defined in the ipred library of R programming language was used in building the bagging regression model. In
Case I , RMSE value was found to be 0.2141 with the mean of the absolute values of open_perc as 0.6402. Among 745 total cases, 22 cases exhibited mismatch in their predicted and the corresponding actual values of open_perc . Case II yielded an RMSE value of 0.2386 and the mean of the absolute values of the actual open_perc as 0.9286. 37 cases out of a total of 725 cases yielded mismatch in sign between the predicted and the actual values of open_perc . The RMSE value for
Case III was found to be 0.3242. The mean of the absolute values of the actual Fig 17(a): Decision Tree regression model (
Case II ) Fig 17(b): Decision Tree regression - time-varying actual and predicted values of open_perc ( Case II ) Fig 17(c): Decision Tree regression - relationship between the actual and the predicted values of open_perc ( Case II ) Fig 17(d): Decision Tree regression – time-varying residuals (
Case II ) Fig 18(a): Decision Tree regression model (
Case III ) Fig 18(b): Decision Tree regression - time-varying actual and predicted values of open_perc ( Case III ) Fig 18(c): Decision Tree regression - relationship between the actual and the predicted values of open_perc ( Case III ) Fig 18(d): Decision Tree regression – time-varying residuals ( Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen open_perc was 0.9286. 67 cases out of a total of 725 cases showed a mismatch in sign among its predicted and the corresponding actual values of open_perc . Table 12:
Bagging regression results
Metrics Case I
Case II
Case III
Training 2013 Training 2014 Test 2014
Correlation Coefficient 0.97 0.98 0.97 RMSE/Mean of Absolute Values of Actuals 33.43 25.70 34.91 Percentage of Mismatched Cases 2.95 5.10 9.24
Fig 19(c): Bagging regression – time-varying residuals ( Case I ) Fig 19(b): Bagging regression - relationship between the actual and the predicted values of open_perc ( Case I ) Fig 20(a): Bagging regression - time-varying actual and predicted values of open_perc ( Case II ) Fig 20(b): Bagging regression - relationship between the actual and the predicted values of open_perc ( Case II ) Fig 19(a): Bagging regression - time-varying actual and predicted values of open_perc ( Case I ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 20(c): Bagging regression – time-varying residuals ( Case II ) Fig 21(c): Bagging regression – time-varying residuals ( Case III ) Boosting Regression : We used blackboost function defined in the mboost library in R programming language. In
Case I , 6 cases out of 745 cases exhibited mismatched signs among the predicted and the corresponding actual open_perc values. RMSE for this case was found to be 0.1498, while the mean of the absolute values of the actual open_perc was 0.6402. For
Case II , out of 725 total cases, 34 cases yielded mismatched signs between the actual and their corresponding predicted values of open_perc . The RMSE for this case was 0.1596, and the mean of the absolute values of the actual open_perc was 0.9286. Case III yielded an RMSE value of 0.3855 with the mean of the absolute values of the actual open_perc of 0.9286. 50 cases out of a total of 725 cases exhibited mismatched signs between the predicted and their corresponding actual values of open_perc . Fig 21(a): Bagging regression - time-varying actual and predicted values of open_perc ( Case III ) Fig 21(b): Bagging regression - relationship between the actual and the predicted values of open_perc ( Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Table 13:
Boosting regression results
Metrics Case I
Case II
Case III
Training 2013 Training 2014 Test 2014
Correlation Coefficient 0.99 0.99 0.97 RMSE/Mean of Absolute Values of Actuals 23.40 17.19 41.51 Percentage of Mismatched Cases 0.81 4.69 6.90 Fig 22(c): Boosting regression – time-varying residuals (
Case I ) Fig 22(a): Boosting regression - time-varying actual and predicted values of open_perc ( Case I ) Fig 22(b): Boosting regression - relationship between the actual and the predicted values of open_perc ( Case I ) Fig 23(a): Boosting regression - time-varying actual and predicted values of open_perc ( Case II ) Fig 23(b): Boosting regression - relationship between the actual and the predicted values of open_perc ( Case II ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 23(c): Boosting regression – time-varying residuals (
Case II ) Fig 24(c): Boosting regression – time-varying residuals ( Case III ) Random Forest Regression : We have used randomForest function defined in the randomForest library in R programming language for building the random forest regression model. For all the three cases, the algorithm tried 3 variables at each split of the associated decision tree. The number of regression decision trees constructed in each case was 500. The mean squared residual values were found to be 0.0441, 0.0512, and 0.0441 respectively for
Case I , Case II and
Case III respectively. In
Case I , the percentage of variance explained by the model was 95.13. None of the 745 cases exhibited any mismatching between their predicted and actual values of open_perc . While the RMSE for this case was 0.1005, the mean of the absolute values of the actual open_perc was 0.9286051. For
Case II , the model could explain 97.11% of the variance, and 19 cases out of a total of 725 cases exhibited mismatched signs between the predicted and the actual values of open_perc . The RMSE for this case was 0.1005 while the mean of the absolute values of the actual open_perc was 0.9286.
Case III had 95.13% of the variance explained by the model. 47 cases out of 725 cases exhibited mismatched signs for the predicted and the actual open_perc values. RMSE value was 0.2973 with a mean of the absolute values of the actual open_perc values as 0.9286. Table 14 presents the results of the random forest regression model. Fig 24(a): Boosting regression - time-varying actual and predicted values of open_perc ( Case III ) Fig 24(b): Boosting regression - relationship between the actual and the predicted values of open_perc ( Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Table 14:
Random Forest regression results
Metrics Case I
Case II
Case III
Training 2013 Training 2014 Test 2014
Correlation Coefficient 0.99 0.99 0.97 RMSE/Mean of Absolute Values of Actuals 16.26 10.82 32.02 Percentage of Mismatched Cases 0.00 2.62 6.48
Fig 25(a) depicts the way the predicted open_perc values superimposed on their corresponding actual values for each of the 745 time slots in
Case I . The linear relationship between the predicted and the actual open_perc values are presented in Fig 25 (b). The residual values for the random forest regression model are depicted in Fig 25 (c). These three graphs along with the numeric metrics presented under
Case I in Table 14 clearly indicate that the random forest regression very effectively modeled the
Case I of Godrej Consumer data.
Fig 25(c): Random Forest regression – time-varying residuals ( Case I ) Fig 26(a), (b) and (c) present various visual performance metrics of the random forest regression model for Case II. It is evident from these figures that the predicted values of the open_perc very closely follows the patterns of the actual values. Moreover, the residual values of the regression model exhibited randomness and no significant autocorrelations were observed among them.
Fig 25(a): Random Forest regression - time-varying actual and predicted values of open_perc ( Case I ) Fig 25(b): Random Forest regression - relationship between the actual and the predicted values of open_perc ( Case I ) Fig 26(a): Random Forest regression - time-varying actual and predicted values of open_perc ( Case II ) Fig 26(b): Random Forest regression - relationship between the actual and the predicted values of open_perc ( Case II ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 26(c): Random Forest regression – time-varying residuals ( Case II ) Fig 27(c): Random Forest regression – time-varying residuals ( Case III ) It is also evident from Fig 27(a), (b), and (c) that the random forest regression was very effective in modeling
Case III . Fig 27(b) indicates there are some deviations from linearity at the head and the tail of the linear segment that exhibited a linear relationship between the actual and the predicted values of open_perc . This manifested in the form of a marginally higher value of the ratio the RMSE and the mean of the absolute values of open_perc for
Case III in random forest regression.
ANN Regression : We used neuralnet function defined in the neuralnet library in R programming language for designing the ANN regression model on Godrej data. For
Case I , 8 cases out of 745 records were found to have yielded mismatched signs in their actual and predicted open_perc values. The RMSE, in this case, was found to be 0.0960, while the mean of the absolute values of the actual open_perc values was 0.6402. 66 cases out of 725 cases were found to have their signs mismatched in their actual and predicted values of open_perc in Case II . The RMSE of the model for
Case II was found to be 0.3389. In
Case III , we found that 75 cases out of 725 cases had mismatched signs in their actual and predicted open_perc values. RMSE for this case was found to be 0.3278. We also computed the product moment correlation coefficient of the predicted and actual open_perc values. The results for the ANN regression model are presented in Table 14. Fig 27(a): Random Forest regression - time-varying actual and predicted values of open_perc ( Case III ) Fig 27(b): Random Forest regression - relationship between the actual and the predicted values of open_perc ( Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Table 14:
ANN regression results
Metrics Case I
Case II
Case III
Training 2013 Training 2014 Test 2014
Correlation Coefficient 0.99 0.97 0.98 RMSE/Mean of Absolute Values of Actuals 16.96 36.49 35.29 Percentage of Mismatched Cases 1.07 15.31 9.10 Fig 28(a) the ANN regression model for Case I. Only one node is used in the hidden layer as additional nodes in this layer would have led to an overfitted model. The link weights are written in black color while the bias values associated with the hidden layer node and the output layer node are written in the blue color. The input layer depicted nodes each of which corresponds to an input variable. While Fig 28(b) shows how the predicted values of open_perc followed the variational patterns of its actual values. Fig 28(c) exhibits the linear relationship between the predicted and the actual values of open_perc . From both these figures, it is evident that the
Case I was very elegantly modeled by ANN regression. Fig 28(d) showed that the residuals are random and do not exhibit any autocorrelation. The correlation for this case was found to be 0.99 and the percentage of cases that exhibited mismatching signs in the predicted and the actual open_perc was only 1.07
Fig 28(a): ANN regression model (
Case I ) Fig 28(b): ANN regression - time-varying actual and predicted values of open_perc ( Case I ) Fig 28(c): ANN regression - relationship between the actual and the predicted values of open_perc ( Case I ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 28(d): ANN regression – time-varying residuals ( Case I ) Fig 29(a) depicts the ANN regression model built for modeling
Case II . Fig 29(b) and Fig 29(c) clearly show that the predicted series for the open_perc very closely followed the patterns of its corresponding actual values. The linearity of the relationship between the predicted and actual values of open_perc is depicted in Fig 29(c). Fig 29(d) shows that the residuals of the regression model did not exhibit any autocorrelation. Fig 30(a), (b), (c), (d) the ANN regression model for Case III and the behavior of the predicted values of open_perc with respect to its actual values, and the residuals of the regression model. All these figures and the numerical metrics like correlation coefficient, the ratio of RMSE and the mean of the absolute values of the actual open_perc , and the number of cases in which the predicted values had different signs from its actual values, all showed that the model was very accurate.
Fig 29(a): ANN regression model (
Case II ) Fig 29(b): ANN regression - time-varying actual and predicted values of open_perc ( Case II ) Fig 29(c): ANN regression - relationship between the actual and the predicted values of open_perc ( Case II ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 29(d): ANN regression – time-varying residuals ( Case II ) Fig 30(a): ANN regression model (
Case III ) Fig 30(d): ANN regression – time-varying residuals ( Case III ) Fig 30(b): ANN regression - time-varying actual and predicted values of open_perc ( Case III ) Fig 30(c): ANN regression - relationship between the actual and the predicted values of open_perc ( Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen SVM Regression : In SVM regression we have used svm function defined in the e1071 library of R programming language. For all the three cases, the regression type used by R was eps-regression, SVM-kernel was radial.
The values of the parameters gamma and epsilon were both found to be 0.1. The algorithm found the number of support vectors as 248, 265, and 246 for
Case I , Case II and
Case III respectively. The RMSE values for the three cases were found to be 0.3450, 0.2593, and 0.7703 respectively. The mean of the absolute values of the open_perc was 0.6402. We computed the ratio of the RMSE values to the mean of the absolute values of open_perc for all the three cases so as to get an idea about the magnitude of RMSE with respect to mean of the actual open_perc values. We also identified the cases which exhibited a difference in the signs in the actual and predicted values of open_perc . These are the cases, where the regression model had failed to predict the direction of the movement of the actual open_perc values. For
Case I , 2 cases out of 745 cases were found to have exhibited sign mismatch in the actual and the predicted values of open_perc . 32 out of 725 cases were found to have yielded sign mismatch in
Case II . In
Case III , the model faced more challenges in prediction, and thus 95 cases out of 725 cases mismatched in sign in their actual and predicted values of open_perc . The product moment correlation coefficient values were also computed between the actual and the predicted values of open_perc . The SVM regression results are presented in Table 15. For all the three cases, SVM regression was found to have yielded quite encouraging results.
Table 15:
SVM regression results
Metrics Case I
Case II
Case III
Training 2013 Training 2014 Test 2014
Correlation Coefficient 0.93 0.98 0.83 RMSE/Mean of Absolute Values of Actuals 53.88 27.92 82.96 Percentage of Mismatched Cases 0.27 4.41 13.10 Fig 31(a) presents the variation of actual open_perc and its predicted values at 745 time slots for
Case I . It is clear that in most of the cases, the predicted series has been able to accurately predict the movement of the actual open_perc time series. In Fig 32(b), we have plotted the predicted values of the open_perc as a function of its actual values. It can be easily observed that except for some points at the tail and the head, most of the points exhibit a strong linear relationship between the actual and the predicted values of open_perc for
Case I . The residual plots in Fig 32(c) also depicts that most of the residuals are random within a small range with a very few residuals exhibiting large positive or negative values. Fig 32(a), (b) and (c) depict almost similar patterns as exhibited by Fig. 31(a), (b) and (c) respectively, indicating an almost identical performance of SVM regression in
Case II as in
Case I . In fact if we closely observe the pattern of variation in Fig 32(a), we can see that the predicted open_perc series follows even more closely the actual open_perc series, in this case. It can be verified by checking the ratio of the RMSE to the mean of the absolute values of the actual open_perc values which was much lower in
Case I than it was in
Case II . However, Fig 33(a), (b) and (c) clearly shows that
Case III proved to be much more challenging for the SVM regression model. The correlation coefficient between the actual and the predicted values of the open_perc was found to be much lower in this case, which can be easily verified in Fig 33(a) and Fig 33 (b). While Fig 33(a) showed that the predicted time series in many time instances failed to follow the pattern exhibited by the actual open_perc time series, Fig 33(b) exhibited substantial nonlinearity between the predicted and the actual open_perc values. Fig 33(c), however, depicts that the residuals were randomly scattered and did not exhibit any significant autocorrelation. Fig 31(a): SVM regression - time-varying actual and predicted values of open_perc ( Case I ) Fig 31(b): SVM regression - relationship between the actual and the predicted values of open_perc ( Case I ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen
39 Fig 31(c): SVM regression – time-varying residuals ( Case I ) Fig 32(c): SVM regression – time-varying residuals ( Case II ) Fig 32(a): SVM regression - time-varying actual and predicted values of open_perc ( Case II ) Fig 32(b): SVM regression - relationship between the actual and the predicted values of open_perc ( Case II ) Fig 33(a): SVM regression - time-varying actual and predicted values of open_perc ( Case III ) Fig 33(b): SVM regression - relationship between the actual and the predicted values of open_perc ( Case III ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 33(c): SVM regression – time-varying residuals ( Case III ) LSTM Regression:
In Section 5, we briefly discussed some major points on LSTM networks in deep learning. In the following, we present, in detail, the results related to the forecasting performance of the LSTM-based regression models used in the three cases. For all the three cases, we followed the following steps in building the LSTM models: (i) reading the raw data, (ii) normalizing the data, (iii) converting the normalized data into a time series and then into a supervised learning problem, (v) creating a deep learning model using Tensorflow and Keras frameworks, (vi) training and validating the model, (vii) visualization of the training and validation performance, and (viii) evaluating the predicting accuracy of the model on test data. For all the three cases, the raw data consisted of the following attributes: (i) year , (ii) month , (iii) day , (iv) hour (i.e., the time slot), (v) open , (vi) high , (vii) low , (viii) close , (ix) volume , and (x) the NIFTY index. Using Python programming, we combined the attributes (i) through (iv) into a single attribute so that the resultant dataset consisted of seven attributes. We provide the details of the three cases in the following. For
Case I , we first plot the open , high , low , close, volume and the NIFTY time series. In this case, there were 746 records in total. Fig 34(a) depicts the time series for each of the attributes in Case I . All these six attributes (leaving out the time attribute) are then normalized using the
MinMaxScala r function defined in the sklearn.processing module in Python. Out of the 746 records, the first 500 records are used for training and the remaining 246 for the validation. The
Sequential function defined in Keras is used for building the LSTM and the model is compiled using MAE as the loss function and ADAM as the optimizer. The behavior of the training and the validation loss values is studied for different values of epochs and batch sizes. With a batch size of 72 and an epoch value of 100, the training and validation losses are found to have converged to a very low value. Fig 34(b) presents the behavioral patterns of the training and the validation losses in
Case I . At the completion of the final epoch, the RMSE value was 8.812 and Pearson’s product moment correlation coefficient was 0.983 between the actual and the predicted open values. The training and validation loss values were 0.0194 and 0.0252 respectively.
Fig 34(a): LSTM regression – stock data representation (
Case I ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Fig 34(b): LSTM regression – training and validation error (
Case I ) Case II involved stock prices for the entire year 2014 and it consisted of 725 tuples. As in
Case II , first the six attributes are plotted for all the 725 records. Fig 35(a) depicts the plots for the attributes – open , high , low , close , volume and the NIFTY index. Similar to Case I , the raw values of these six attributes are normalized using the
MinMaxScalar function. The first 500 records are used for model construction and the remaining 225 records are utilized in validating the model. The validation loss converged with the training loss at an epoch value of 40. However, it started increasing again with the increase in epoch value. The validation loss converged finally with the training loss at an epoch value of 100, and with a batch size of 72. The RMSE of the model was found to 15.002 with a correlation value of 0.982 between the actual and the predicted open values. The training and the validation loss were 0.0134 and 0.0301 respectively, after the completion of the last epoch. Fig 35(b) depicts the pattern of variation of the training and the validation loss with different values of the epoch in
Case II . Fig 35(a): LSTM regression – stock data representation (
Case II ) Fig 35(b): LSTM regression – training and validation error (
Case II ) Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen In Case III , the LSTM model was built using the records of the year 2013, and then the model was tested on the records of the year 2014. The raw dataset, in this case, consisted of 1471 records in total, of which 746 records (those belonging to the year 2013) were used in building the model, and the remaining 725 records (those belonging to the year 2014) were used for testing the model. Fig 36(a) presents the plots of the open , high , low , close , volume , and the NIFTY time series for this case with 1471 records. The training and the test losses were found to have converged at an epoch value of 60 with a batch size of 72. The RMSE and the correlation values for this case were found to be 13.477 and 0.996 respectively. The training and the test losses were 0.0116 and 0.0258 respectively. Fig 36(b) depicts the patterns exhibited by the training and the testing losses with different values of epoch. Fig 36(a): LSTM regression – stock data representation (
Case III ) Fig 36(b): LSTM regression – training and testing error (
Case III ) Overall Performance:
Finally, we summarize the performance of different predictive models that we have built, validated and tested on the stock price data of Godrej Consumer Products for the period of January 2013 till December 2014. Tables 16 – 18 present the performance of the classification models under
Case I , Case II , and
Case III respectively. For each case and for each metric, the model that exhibited the best performance has been marked with a bold font. We observe that both for
Case I and
Case II and all the metrics, boosting performed the best among all the classification models. However, considering the fact that
Case I and
Case II exhibit only the training accuracies, the performance in the
Case III should be considered as the most critical as it demonstrates the test accuracy of a model. From Table 18, we find that ANN performed the best on sensitivity and NPV while boosting outperformed all other models on specificity, PPV, and classification accuracy. However, SVM was found to have performed best on the F1 score, which is usually considered to be the most important metric in classification. In Tables 16-21, the following abbreviations are used in the column names: LR – Logistic Regression, KNN – K-Nearest Neighbor, DT- Decision Tree, BAG – Bagging, BOOST – Boosting, RF – Random Forest, ANN – Artificial Neural Networks, SVM – Support Vector Machines, LSTM – Long and Short Term Memory. Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Table 16:
Summary of the performance of the classification models in
Case I
LR KNN DT BAG BOOST RF ANN SVM
Sensitivity 94.79 89.57 95.09 95.09
Table 17:
Summary of the performance of the classification models in
Case II
LR KNN DT BAG BOOST RF ANN SVM
Sensitivity 94.83 86.93 92.40 95.44
Table 18:
Summary of the performance of the classification models in
Case III
LR KNN DT BAG BOOST RF ANN SVM
Sensitivity 92.10 84.50 89.97 89.97 92.10 91.19
Tables 19 – 21 present the performance of the regression models, including the LSTM-based deep learning model. Since the LSTM model has outperformed the machine learning models on all metrics and for all the three cases, we have also noted down the best performing machine learning model on each metric. In
Case I , multivariate regression, MARS, boosting, random forest and ANN all yielded the highest correlation coefficient value of 0.99. However, the correlation coefficient was found to be 1.00 in the case of LSTM. For the ratio of the RMSE to the mean of the absolute values of the open_perc values, MARS yielded the lowest value of 12.41 among the machine learning models, while the corresponding value for LSTM was 7.94. Both random forest and LSTM yielded no sign mismatch among the predicted and the actual values of the open_perc . In Case II, the highest value of the correlation coefficient was achieved by multivariate regression, MARS, boosting, random forest. LSTM outperformed all the machine learning models on this metric by attaining a value of 1.00. The RMSE to the mean ratio value of 10.82 was the least for random forest among the machine learning models. However, the corresponding value yielded by LSTM was 4.04. Random forest produced only 2.62 percent cases that mismatched in the signs of the actual and predicted open_perc values, however for LSTM, all the cases had the same sign for the actual and the predicted open_perc values. For
Case III , while LSTM exhibited the best performance on all metrics, multivariate regression and MARS yielded the same value for the correlation coefficient. For the metric RMSE to the mean ratio and the percentage of the mismatched cases, multivariate regression, and MARS produced the best results among the machine learning models.
Table 19:
Summary of the performance of the regression models in
Case I
MV MARS DT BAG BOOST RF ANN SVM LSTM
Correlation
RMSE/Mean 13.32
Mismatched Cases 18.67 1.21 13.42 2.95 0.81
Table 20:
Summary of the performance of the regression models in
Case II
MV MARS DT BAG BOOST RF ANN SVM LSTM
Correlation
RMSE/Mean 18.84 17.09 37.04 25.70 17.19
Mismatched Cases 51.31 4.28 17.38 5.10 4.69 Table 21:
Summary of the performance of the regression models in
Case III
MV MARS DT BAG BOOST RF ANN SVM LSTM
Correlation
RMSE/Mean
Mismatched Cases 51.31 Conclusion
In this work, we have proposed a robust forecasting framework for stock price and stock price movement pattern prediction with a very high level of accuracy. The predictive model consists of eight classification and eight regression models based on several machine learning approaches. In addition to that, the framework also includes a deep learning model of regression using an LSTM network. All these models work on a short-term time horizon, and they have the ability to forecast stock price movement and stock price on the basis of three-time slots on a given day. We constructed the models, trained, validated and finally tested them using the historical stock prices of a company – Godrej Consumer Products Ltd. The data is taken from the listed values of the stock in the National Stock Exchange (NSE) of India during the period of two years – January 2013 till December 2014. The stock price data were extracted from the NSE database at five minutes interval of time using the Metastock tool. After its collection, the raw data were pre-processed, appropriate transformation (i.e., normalization, standardization, NA removal, etc.) done, and a number of derived predictor variables are created using the rich features of the stock data. While a number of newly derived predictors were used in building the model, the used the percentage change in the open values of the stock, called open_perc , as the response variables. The five minutes interval granular data are also aggregated into three slots on a given day so that the predictive models can be utilized to forecast the value of open_perc in the next slot given stock price data till the current slot. While the classification-based models are used to predict the movement pattern of open_perc values, the objective of the regression models is to accurately predict the value of the open_perc . In addition to exploiting the machine learning algorithms for building the eight classification and eight regression models, we also leveraged the rich features of Tensorflow and Keras frameworks in building an extremely powerful deep learning-based regression model using an LSTM network. For building the machine learning models, we used R programming language, while for the LSTM-based deep learning regression model, Python programming has been used. The models are trained, validated, and tested on the stock data and extensive results are produced and critically analyzed. The results elicited a very interesting observation. While there was not a single machine learning model that performed the best on all the metrics on classification and regression, the deep learning model using an LSTM network outperformed all the regression models on every metric that we considered. In another recent work, we have already studied the efficacy and accuracy of a CNN-based deep learning regression model in time series forecasting. It is a very well-known fact now that deep learning models have a much higher capability of extracting and learning the features from a time series data than their machine learning counterpart. However, in order to exploit the power of deep learning models, the volume of data should be very large. As a future scope of work, we would explore the use of generalized adversarial networks (GAN) in forecasting price movements and values. We believe that an integrated approach to building deep learning models that combines the power of LSTM, CNN and GAN can be a very interesting area of work in this direction.
References
Adebiyi, A., Adewumi, O., & Ayo, C.K. (2014). Stock price prediction using the ARIMA model . Proceedings of the International Conference on Computer Modelling and Simulation , 105 – 111, Cambridge, UK. Basalto, N., Bellotti, R., De Carlo, F., Facchi, P., & Pascazio, S. (2005). Clustering stock market companies via chaotic map synchronization.
Physica A , 345, 196-206. Basu, S. (1983). The relationship between earnings yield, market value and return for NYSE common stocks: further evidence.
Journal of Economics , 12(1), 129-156. Bentes, S. R., Menezes, R., & Mendes, D. A. (2008). Long memory and volatility clustering: is the empirical evidence consistent across stock markets?
Physica A: Statistical Mechanics and its Applications , 387(15), 3826-3830. Chen, A.-S., Leung, M. T. & Daouk, H. (2003). Application of Neural Networks to an Emerging Financial Market: Forecasting and Trading the Taiwan Stock Index.
Operations Research in Emerging Economics , 30(6), 901– 923. DOI: 10.1016/S0305-0548(02)00037-0. Chen, Y., Dong, X. & Zhao, Y. (2005). Stock Index Modeling Using EDA Based Local Linear Wavelet Neural Network.
Proceedings of International Conference on Neural Networks and Brain , 1646–1650. DOI: 10.1109/ICNNB.2005.1614946. Chui, A. & Wei, K. (1998). Book-to-market firm size, and the turn of the year effect: evidence from Pacific basin emerging markets.
Pacific Basin Finance Journal , 6(3-4), 275-293.
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen
45 de Faria, E., Albuquerque, M. P., Gonzalez, J., Cavalcante, J. & Albuquerque, M. P. (2009). Predicting the Brazilian Stock Market through Neural Networks and Adaptive Exponential Smoothing Methods.
Expert Systems with Applications , 36(10), 12506-12509. DOI. 10.1016/j.eswa.2009.04.032. Dutta, G., Jha, P., Laha, A. & Mohan, N. (2006). Artificial neural network models for forecasting stock price index in the Bombay Stock Exchange.
Journal of Emerging Market Finance , 5(3), 283-295. DOI:
Journal of Finance , 50(1), 131-155. Fu, T-C, Chung, F-L., Luk, R., & Ng, C-M, (2008). Representing financial time series based on data point importance.
Engineering Applications of Artificial Intelligence , 21(2), 277-300. Geron, A. (2019).
Hands-on Machine Learning with Scikit-Learn Keras & Tensorflow . O’Reilly Publications, USA. Hanias, M., Curtis, P. & Thalassinos, J. (2012). Time Series Prediction with Neural Networks for the Athens Stock Exchange Indicator.
European Research Studies , 15(2), 23-31. Hornik, K. (1989). Multilayer feedforward networks are universal approximators.
Neural Networks , 2(5), 359-366. Hutchinson, J. M., Lo, A. W., & Poggio, T. (1994). A Nonparametric Approach to Pricing and Hedging Derivative Securities via Learning Networks.
Journal of Finance , 49(3), 851-889. DOI: 10.3386/w4718. Jaffe, J., Keim, D. B., & Westerfield, R. (1989). Earnings, yields, market values and stock returns.
Journal of Finance , 44(1), 135-148. Jarrett, J.E. & Kyper E. (2011). ARIMA modeling with intervention to forecast and analyze Chinese stock prices.
International Journal of Engineering Business Management , 3(3), 53-58. Jaruszewicz, M. & Mandziuk, J. (2004). One day prediction of Nikkei index considering information from other stock markets.
Proceedings of the International Conference on Artificial Intelligence and Soft Computing , 1130 – 1135, Zakopane, Poland. Kimoto, T., Asakawa, K., Yoda, M. & Takeoka, M. (1990). Stock Market Prediction System with Modular Neural Networks.
Proceedings of the IEEE International Joint Conference on
Neural Networks (IJCNN) , 1-16. DOI: 10.1109/IJCNN.1990.137535 Leigh, W., Hightower, R. & Modani, N. (2005). Forecasting the New York Stock Exchange Composite Index with Past Price and Interest Rate on Condition of Volume Spike.
Expert Systems with Applications , 28(1), 1-8. DOI: 10.1016/j.eswa.2004.08.001 Liao, S-H., Ho, H-H., Lin, H-W. (2008). Mining stock category association and cluster on Taiwan stock market.
Expert System with Applications , 35(2008), 19-29. Mehtab, S. & Sen, J. (2020). Stock price prediction using convolutional neural networks on a multivariate time series.
Proceedings of the 3 rd National Conference on Machine Learning and Artificial Intelligence (NCMLAI’20) , New Delhi, India, February 1, 2020. Mehtab, S. & Sen, J. (2019). A robust predictive model for stock price prediction using deep learning and natural language processing.
Proceedings of the 7 th International Conference on Business Analytics and Intelligence (BAICONF’19)
IUP Journal of Financial Risk Management , 13(1), 7-27. Mondal, P., Shit, L., & Goswami, S. (2014). Study of effectiveness of time series modeling (ARMA) in forecasting stock prices.
International Journal of Computer Science, Engineering and Applications , 4, 13-29. Moshiri, S. & Cameron, N. (2010). Neural network versus econometric models in forecasting inflation.
Journal of Forecasting , 19(3), 201-217. Mostafa, M. (2010). Forecasting stock exchange movements using neural networks: empirical evidence from Kuwait.
Expert Systems with Application , 37(9), 6302-6309. DOI: 10.1016/j.eswa.2010.02.091. Phua, P. K. H., Ming, D., & Lin, W. (2000). Neural network with genetic algorithms for stock prediction. th Conference of the Association of Asian-Pacific Operations Research Societies , Singapore. Rosenberg, B., Reid, K., & Lanstein, R. (1985). Persuasive evidence of market inefficiency.
Journal of Portfolio
Management , 11(1), 9-17. Sen, J. & Datta Chaudhuri, T. (2018a). Understanding the sectors of Indian economy for portfolio choice.
International Journal of Business Forecasting and Marketing Intelligence , 4(2), 178-222. DOI: 10.1504/IJBFMI.2018.090914. Sen, J. & Datta Chaudhuri, T. (2018b). Stock price prediction using machine learning and deep learning frameworks.
Proceedings of the 6 th International Conference on Business Analytics and Intelligence (ICBAI’18) , Bangalore, India, December 20-22, 2018.
Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models Sidra Mehtab & Jaydip Sen Sen, J. & Datta Chaudhuri, T. (2017a). A time series analysis-based forecasting framework for the Indian healthcare sector,
Journal of Insurance and Financial Management , 3(1), 66-94. Sen, J. & Datta Chaudhuri, T. (2017b). A predictive analysis of the Indian FMCG sector using time series decomposition-based approach.
Journal of Economics Library , 4(2), 206-226. DOI: http://dx.doi.org/10.1453/jel.v4i2.1282. Sen, J. & Datta Chaudhuri, T. (2017c). A time series analysis-based forecasting approach for the Indian realty sector.
International Journal of Applied Economic Studies , 5(4), 8 – 27. Sen, J. & Datta Chaudhuri, T. (2017d). A robust predictive model for stock price forecasting.
Proceedings of the 5 th International Conference on Business Analytics and Intelligence , Bangalore, India, December 11-13, 2017. Sen, J. & Datta Chaudhuri, T. (2016). An alternative framework for time series decomposition and forecasting and its relevance for portfolio choice – a comparative study of the Indian consumer durable and small-cap sector.
Journal of Economic Library , 3(2), 303-326. Senol, D. & Ozturan, M. (2008). Stock price direction prediction using artificial neural network approach: the case of Turkey.
Journal of Artificial Intelligence , 1, 70-77. DOI: 10.3923/jai.2008.70.77 Shen, J., Fan, H. & Chang, S. (2007). Stock index prediction based on adaptive training and pruning algorithm.
Advances in Neural Networks , Lecture Notes in Computer Science , Springer-Verlag, 4492, 457–464. DOI. 10.1007/978-3-540-72393-6_55. Siddiqui, T.A., Abdullah, Y. (2015). Developing a nonlinear model to predict stock prices in India: an artificial neural networks approach.
IUP Journal of Applied Finance , 21(3), 36-39. Thenmozhi, M. (2006). Forecasting stock index numbers using neural networks.
Delhi Business Review
Proceedings of International MultiConference of Engineers and Computer Scientists , 1. Tseng, K-C., Kwon, O., & Tjung, L. C. (2012). Time series and neural network forecast of daily stock prices.
Investment Management and Financial Innovations , 9(1), 32-54. Wu, Q., Chen, Y. & Liu, Z. (2008). Ensemble model of intelligent paradigms for stock market forecasting.
Proceedings of the IEEE 1 st International Workshop on Knowledge Discovery and Data Mining , 205 – 208, Washington, DC, USA. DOI: 10.1109/WKDD.2008.54 Zhang, D., Jiang, Q., & Li, X. (2007). Application of neural networks in financial data mining.
International Journal of Computer, Electrical, Automation, and Information Engineering , 1(1), 225-228, World Academy of Science, Engineering and Technology. Zhu, X., Wang, H., Xu, L. & Li, H. (2008). Predicting stock index increments by neural networks: the role of trading volume under different horizons.