Stock Volatility Prediction Using Recurrent Neural Networks with Sentiment Analysis
aa r X i v : . [ c s . S I] M a y Stock Volatility Prediction Using RecurrentNeural Networks with Sentiment Analysis
Yifan Liu , Zengchang Qin ∗ , Pengyu Li , , and Tao Wan ∗ Intelligent Computing and Machine Learning Lab, School of ASEEBeihang University, Beijing 100191, China School of Mechanical Engineering and AutomationBeihang University, Beijing, 100191, China School of Biological Science and Medical EngineeringBeihang University, Beijing, 100191, China ∗ { taowan,zcqin } @buaa.edu.cn Abstract.
In this paper, we propose a model to analyze sentiment ofonline stock forum and use the information to predict the stock volatil-ity in the Chinese market. We have labeled the sentiment of the onlinefinancial posts and make the dataset public available for research. Bygenerating a sentimental dictionary based on financial terms, we developa model to compute the sentimental score of each online post related toa particular stock. Such sentimental information is represented by twosentiment indicators, which are fused to market data for stock volatilityprediction by using the Recurrent Neural Networks (RNNs). Empiricalstudy shows that, comparing to using RNN only, the model performssignificantly better with sentimental indicators.
Keywords:
Natural language processing, Stock volatility prediction,Sentimental analysis, Sentimental score
In time-series data mining, stock market is notoriously difficult to analyze,even be totally unpredictable based on the famous Efficient Market Hypothesis(EMH). As early as 1900s, Bachelier [2] applied statistical methods to analyzestock data, and found that the mathematical expectation of the stock fluctuationtends to be zero. In the 1970s, Fama [7] formally put forward the EMH, whichstated that under the condition of market with complete information, investorscouldn’t gain more than fifty percent of the profits only with the past price, orsimply, no one could ‘beat’ the market’ continuously. The Random Walk Theory(RWT) proposed by Osborne [14] also suggested same conclusion that the stockprices were unpredictable. But all these theories are based on the same assump-tion that investors are rational and complete market information is available.Paul Hawtin once said “For years, investors have widely accepted that fi-nancial markets are driven by fear and greed.” In the actual market, investors The founder of Derwent Capital Markets and one of early pioneers in the use ofsocial media sentiment analysis to trade financial derivatives. annot be completely rational. They may be influenced by their emotions andmake impulsive decisions [16]. Therefore, the basic assumption of EMH andRWT is not impregnable. Many researchers have tried to study the correlationbetween sentiment and stock market volatility to challenge the classical theories.For example, researchers found that some factors, e.g. weather and sports games,can affect public emotion and also the stock market. The sunny weather and therising stock index had certain correlations [10]. There would be a significantmarket decline after the soccer lost [5]. In recent years, the rapid developmentof social networks (Facebook, Twitter, Weibo) opens a new door to measure thepublic emotion. Bollen et al. [3] analyzed the text content of daily Twitter byusing two mood tracking tools: OpinionFinder (OF) and Google-Profile of MoodStates (GPOMS). The authors then used self-organizing fuzzy neural network(SOFNN) to predict the volatility of the Dow Jones Industrial Average (DJIA).By considering the sentimental information from Twitter, the prediction accu-racy has raised up by 13%. The results were very encouraging and this directionwas followed by some other similar research. Zhang et al. [19] found that a burstof public emotion no matter positive or negative, heralded the falling of the in-dex. There were also some research on the individual stock, Si et al. [17] proposeda technique to leverage topic based sentiment from Twitter to help predict thestock price while O’Connor [13] found that the popularity of a brand was muchrelated to related tweets and its stock price.But there are still some problems remained. First, Twitter users are predom-inantly English speakers, or even worse the investors of a particular market maynot use Twitter to discuss their finance [3]. Second, popular sentiment analy-sis dictionary can not entirely measure the emotion of the stock investors [11].Sprenger et al. [18] selected tweets which mentioned the company in the Stan-dard & Poor’s 100 index, and labeled the tweets with buy , hold or sell signals.With the labeled training data, they used a Naive Bayes classifier to extractthe signals from the tweets automatically and calculated the bullishness throughthese signals. Finally, they found that a strategy based on bullishness signalscould earn substantial abnormal returns.Above all, a great amount of focus has been placed on the correlation be-tween the investors’ sentiment and the U.S. stock market. While limited by theChinese expression complexity, little attention has been paid to the the rele-vant research on Chinese stock market. According to the World Federation ofExchanges database , Chinese market capitalization ranked the second in 2015in the world. That’s the reason we focus on our study of Chinese stock mar-ket. In this paper, based on Sprenger’s [18] approach, we propose a model tostudy Chinese stock market. The sentiment of Chinese investors are from theEast Money Forum , which is one of the biggest and specified stock forum inChina, but it is not public forums like Facebook or Twitter. Each stock has itsindividual sub-forum which ensures that most posts from the sub-forums arepublished by the investors who hold or sell this particular stock. In order to http://guba.eastmoney.com/ void the problem that the ordinary dictionary often makes misunderstandingin recognizing investors sentiment, we use a machine learning method to gener-ate our own dictionary and then calculate the sentiment score of the posts basedon the dictionary automatically. To study the correlation between Chinese stockmarket and Chinese investor sentiment, we propose sentiment indicators for thestock volatility prediction model using the Recurrent Neural Networks (RNNs)to obtain a better performance. Bollen et al. [3] proposed a dictionary-based method for sentiment analysis ofthe financial contexts. However, Loughran and Mcdonald [11] found that three-fourths of the words identified as negative by the Harvard Dictionary are nottypically considered as negative in financial contexts. The same problem alsooccurs in Chinese sentiment analysis. We find that Chinese posts in stock forumshave some special expressions containing strong emotions. But these expressionsrarely appear in common sentiment analysis dictionary. So in this research, wefirst need a practical dataset from which we can obtain a dictionary of financialwords, we also develop a simple but effective tool to generate sentimental weightsfor the words.
East Money Forum is one of the most influential Internet financial media inChina. It has more than 3000 sub-forums for each individual stock. We ran-domly select 10 stocks as well as the sub-forum of the posts from 25 th Sept.,2015 to 30 th Sept., 2016 with a web crawler “Bazhuayu (means “Octopus”)” .Nearly 96000 pieces of posts are obtained, and most of them are short and col-loquial. They do not follow any strict syntax but contain strong sentiment. Werandomly sampled 3427 stock posts from 10 different stocks to do the manualannotation . If the post expresses an optimistic attitude towards the stock mar-ket and suggests to buy, we label it as positive, otherwise, we label it as negative.We have annotated 2067 negative posts and 1360 positive posts manually. Theoriginal Chinese texts need to be preprocessed by segmentation, and a classi-cal Chinese text segmentation tool called “Jieba” (Chinese for “to stutter”) inPython is chosen for this. The polarity model of sentiment is trained by a collection of texts labelled onlyby positive or negative. Emotional words are extracted and each one has an http://dsd.future-lab.cn/members/2016/LiuYFProject/data.xlsx https://github.com/fxsjy/jieba ssociated sentiment weight. Weights can be learned from the labeled trainingdataset [9]. The sum of weighted sentiment scores of all terms determines thesentiment polarity (positive or negative) of the post. If the sum is greater than0, it is positive and vice versa. The sentiment score h w ( x ) of a given post iscomputed as follows: h w ( x ) = f N X i =1 w ( i ) x ( i ) ! = f ( w T x ); 1 ≤ i ≤ N (1)where N is the number of all the terms in the corpus, a term could be uni-gramor bi-gram model. w ( i ) is the sentimental weight for each term t ( i ) , x ( i ) is theterm frequency or tf-idf value of the given term t ( i ) . Function f ( · ) is a sigmoidfunction to compress the linear combination of sentimental weight into 0 and 1,and make it smooth. f ( z ) = 11 + e − z (2)By using the logistic regression, the target label y is 1 for positive posts and 0 fornegative ones. So that h w ( x ) represents the probability of a post being positive.If we take the threshold value as 0 .
5, the prediction of the sentiment is: y = (cid:26) h > . h ≤ . M text posts, x ( k ) denotes the k th (1 ≤ k ≤ M )post feature value vector. We can derive the cost function and its logarithmiclikelihood function based on the maximum likelihood estimation. The loss func-tion J ( · ) is: J ( h w ( x ) , y ) = (cid:26) − log( h w ( x )) if y = 1 − log(1 − h w ( x )) if y = 0 (4)The average loss for the entire data set is (for 1 ≤ k ≤ M ): J ( w ) = − M M X k =1 J ( h w ( x ) , y )= − M " M X k =1 y ( k ) log( h w ( x ( k ) )) + (1 − y ( k ) ) log(1 − h w ( x ( k ) )) (5)In order to minimize J ( w ), we can update w using the Gradient Descent algo-rithm with a learning rate α : w ( k ) j +1 = w ( k ) j − α ( h ( k ) − y ( k ) ) x ( k ) . The values of thesentimental weight can be obtained. The term with a higher sentimental weightindicates a stronger positive sentiment and vice versa. We use the weights andthe corresponding terms to build a sentimental dictionary. The sentiment scoreof a post can be calculated by weighted sum of weights of all consisting termsbased on Eq.(1). Emotion Model for Stock Prediction
Some literatures in finance [18] suggest that individual investors have a herdmentality when they make decisions. For example, if they find that most ofpeople are not optimistic in the outlook of the stock price, they will trade onthe advice and move the price. What’s more, a larger quantity of the posts ona forum indicates a larger amount of attention which may lead to a severe pricevolatility. Therefore, we propose an emotion model (EMM) according to thefollowing two important assumptions: (1) Increased bullishness of stock posts isassociated with higher stock price. (2) Increased posts volume suggests a moresubstantial volatility. The index of bullishness of online posts can be defined ona daily basis according to [1]: B t = ln 1 + N pt N nt (6)where N pt ( N nt ) represents the number of positive (negative) posts on the day t .This indicator reflects both the expectations of the rise in price and the totalnumber of posts. When the posts have a continuous sentimental score instead ofa binary label, the index of bullishness becomes B t = ln ε + S pt ε + | S nt | = ln ε + N pt P k =1 h w ( x ) ( k ) ε + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N nt P i =1 h w ( x ) ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (7)where S pt ( S nt ) represents the sum of positive (negative) sentimental score of theposts on the day t and h w ( x ) ( k ) ( h w ( x ) ( i ) ) represents the positive (negative)sentimental score of the k th ( i th ) post on the day t . ε ( ε >
0) is a tiny number forsmoothing, and we set ε = 0 . N t on the day t is N t = N pt + N nt . To enablefair comparison for B t and N t , we use the z-score to normalize data based on themean and standard deviation within a sliding window of length l (we averagethe data of l days before and after the current date t ). The z-scores for B t and N t are: Z B ( t ) = B t − µ ( B t ± l ) σ ( B t ± l ) (8) Z N ( t ) = N t − µ ( N t ± l ) σ ( N t ± l ) (9)where µ ( B t ± l ) and µ ( N t ± l ) are the means and σ ( B t ± l ) and σ ( N t ± l ) are thestandard deviations with 2 l days around the current day t . The correlations of Z B and stock price (Fig. 1-(a)), Z N and stock volatility (Fig. 1-(b)) given aparticular stock are shown in Fig. 1 We can see they are positively correlatedand satisfy the two assumptions on sentimental indicators we previously gave. ate N o t m a i z e d V a l u e o f B a nd P Relationship Between Bullishness and the Stock Price
BullishnessStock Price (a)
Date N o r m a i z e d V a l u e o f N a nd V Relationship Between Number of Posts and the Absolute Value of the Stock Volatility
Number of PostsAbsolute Value of the Stock Volatility (b)
Fig. 1: Relationship between sentimental indicators and the stock information
The stock price for a day is the weighted average price of all transactions on thatday. In the Chinese market, it is calculated by the last minute of the tradingday, so it is also referred to as the closing price [12]. In the actual stock market,profit-driving investors only care about the volatility of a stock instead of theexact price. The stock volatility V t is defined based on the closing price P t onthe current day t and the previous day t − V t = P t − P t − P t − ; V t ∈ [ − . , .
1] (10)In order to regulate the stock market from any malicious manipulations, anystock has the volatility more than 10% will be forced to quit the market onthat trading day, therefore, V t always lies in the range of [ − . , . V .At the same time, we set 0.5 as the threshold in order to obtain a binary label(0 for price going down, and 1 for price rising). F t = (cid:26) V t > . otherwise (11)A stock market is highly complex in the control of “invisible hands”. How-ever, there are still loads of research on statistical modeling and machine learningapproaches to learn from history data. The key to predict the stock market isto fit a latent nonlinear relation between the history data and the future stockvolatility. The traditional statistical models used for financial forecasting wereig. 2: The structure of the RNN model with sentimental indicators. The inputvalues are stock volatility ( V ) and sentimental indicators ( Z ), we use the inputsof previous k trading days to predict the stock volatility of the next trading day( V ( t +1) ). There are 25 hidden nodes used in our model.simple and suffered from several shortcomings. Machine learning methods likeMulti-Layer Perception (MLP), Recurrent Neural Networks, Support Vector Ma-chine (SVM) [15] have an increasing popularity in this area.RNN is incorporatedin our fundamental prediction model due to its appropriateness to address timeseries problem. The context layer stores the outputs of the state neurons fromthe previous time step and outputs to the next time step for computation. Inthis paper, we employ Elman Network [6] in the following experiments. If wedenote the output of hidden layer at time t by H ( t ) , the final prediction can bemade by: V ( t +1) = f ( H ( t ) W + B ) (12) H ( t ) = f ([ V, Z ] W + H ( t − W + B ) (13)where f ( · ) is the activation function and B is bias. The structure of our proposedmodel is show in Fig. 2. In order to verify the effectiveness of the new proposed model, we test it on thestock data introduced in Section 2.1. The stock data of 250 consecutive tradingdays is downloaded from the DaZhiHui (DZH) software. In order to evaluate github link: https://github.com/irfanICMLL/EMM-for-stock-prediction . It can be downloaded from . he quality of the model, we define the concept of accuracy based on the binarylabel F . F ∗ is the predict label of the test data while F is the real label. Define counter as the total number if F ∗ t = F t , accuracy Acc = counter k F k . Lagged time window k A cc u r acy RNN+EMMRNN
Fig. 3: Comparison results with different k .We choose one stock (000573) as an example, we extract the volatility dataand run RNN on it, and then compare to the RNN with sentimental indicators Z B ( t ) and Z N ( t ) (RNN+EMM). We vary k from 3 to 15 to test the best lengthof history for predicting the future. The experiments are replicated for 50 times.Comparison results are shown in Fig. 3. We can see that sentimental indicatorshelp to improve the accuracy significantly, and the parameter k will affect theprediction accuracy, the optimal length is around 10 based on different data sets.For the stock 000573, the best accuracy of the EMM with RNN is 69 . k = 12)while the best accuracy without sentimental information is 57 . k = 13),and the accuracy is significantly better than 0.5. Another 9 stocks are selectedrandomly to test the model, that increase the credibility of the conclusion.As for each particular stock, we can obtain better performance for 8 datasetsin 10 and the detailed result comparisons are shown in Table. 1 . To make itmore intuitive, we draw the histogram in Fig. 4. From the results, we can seethat the stock 000573 performs better than others. The reason may be that mostof the training posts of the emotion classifier come from its sub-forum during thechosen period. In other words, if the actual sentimental indicators are obtained,the accuracy of the model can be better.Table 1: Accuracy and the best k for RNN+EMM and RNN stock number 000573 000733 000703 300017 600605 300333 000909 601668 000788 600362RNN+EMM 0.6985 0.6187 0.6757 0.6927 0.7355 0.6491 0.6154 0.5626 0.5917 0.7092RNN 0.5733 0.524 0.605 0.6455 0.6982 0.6232 0.6029 0.5543 0.6017 0.7344 k (RNN+EMM) 12 13 11 14 3 4 11 15 13 11 k (RNN) 13 5 5 14 14 6 10 3 13 6 Table 2 shows the comparison results of four learning models: MLP [8], SVM[4], RNN, EMM+RNN and the baseline is a random guesser (RAND). On the10 datasets with online discussions, the accuracy of RNN is higher than MLP tock Number000573 000733 000703 300017 600605 300333 000909 601668 000788 600362 A cc u r acy RNN+EMMRNN
Fig. 4: Accuracy for RNN+EMM and RNN on 10-stocks datasetand SVM, because it contains information about the previous states. When con-sidering sentimental indicators, the prediction performance improves nearly 4%on the average which verifies the assumptions about the sentimental indicators.We only test 10 stocks, because the posts are not so easy to be obtained andlabeled. We need roughly 20 hours to collect one sub-forum. And then extractthe sentimental indicators through the method introduced in Section 2. In orderto make the results more credible, we do repetition under various of the initialparameters to make sure that the improved accuracy is thus most likely not theresult by chance nor the selection of a specifically favorable test period.Table 2: Performance comparisons on 10 stocks from the Chinese market.
Method RAND MLP SVM RNN RNN+EMMMEAN 0.500168 0.559257 0.602339 0.61623 0.6549STD 0.003846 0.028681 0.111318 0.06347 0.05640
In this research, we investigated the relationship between the stock volatility andsentimental information obtained from an online stock forum. We employed aRNN model to consider sentimental information, experimental results show thatthe new model can boost the prediction accuracy. The main contribution of ourresearch are as follows: (1) Generate a sentimental weight dictionary of Chinesestock posts. (2) Propose sentimental indicators and investigate the relationshipbetween the stock volatility and the information from the stock forums. (3)Build a RNN model considering sentimental information for stock predictionand verifies the information from forums can help to predict the stock marketof China. (4) We construct a benchmark dataset of labeled financial posts andmake it public available for comparison studies. Finally, it’s worth mentioningthat our analysis doesn’t take into account many factors. The posts from theforums may contains a lot of fake messages that confuse the public. We willconsider that in our future work. cknowledgement
This work is supported by the National Science Foundation of China Nos.61401012 and 61305047.