Artificial intelligence prediction of stock prices using social media
AA RTIFICIAL I NTELLIGENCE P REDICTION OF S TOCK P RICESUSING S OCIAL M EDIA
A P
REPRINT
Kavyashree Ranawat
Durham University School of Engineering and Computing SciencesDurham UniversityLower Mountjoy, South Rd, Durham DH1 3LE, United Kingdom [email protected]
Stefano Giani
Durham University School of Engineering and Computing SciencesDurham UniversityLower Mountjoy, South Rd, Durham DH1 3LE, United Kingdom [email protected]
January 25, 2021 A BSTRACT
The primary objective of this work is to develop a Neural Network based on LSTM to predict stockmarket movements using tweets. Word embeddings, used in the LSTM network, are initialisedusing Stanford’s GloVe embeddings, pretrained specifically on 2 billion tweets. To overcome thelimited size of the dataset, an augmentation strategy is proposed to split each input sequence into 150subsets. To achieve further improvements in the original configuration, hyperparameter optimisationis performed. The effects of variation in hyperparameters such as dropout rate, batch size, and LSTMhidden state output size are assessed individually. Furthermore, an exhaustive set of parametercombinations is examined to determine the optimal model configuration. The best performance onthe validation dataset is achieved by hyperparameter combination 0.4,8,100 for the dropout, batchsize, and hidden units respectively. The final testing accuracy of the model is 76.14%. K eywords LSTM · Twitter · Stock Prediction · APPLE · Neural Networks · VADER
Introduction
Twitter is a microblogging and social media platform that allows users to communicate via short messages (280characters) known as tweets [1, 2, 3]. It enables millions of users to express their opinions on a daily basis on a varietyof different topics ranging from reviews on products and services to users’ political and religious views, making Twittera potent tool for gauging public sentiment [4]. Thus, it manifestly follows that twitter data can be regarded as a corpus,forming the basis on which predictions can be made, and researchers have indeed exploited this fact to seek trends byperforming numerous and varied analyses.A characteristic feature of the stock market is volatility and there is no general equation describing the prediction ofstock prices, which is a complex function of a range of different factors. The methods of stock market prediction canbe broadly classified into Technical Analysis and Fundamental Analysis [5]. The latter involves the consideration ofmacroeconomic factors as well as industry specific news and events to guide investment strategies [5]. The analysis ofpublic sentiment via tweets performed in this project can be regarded as an aspect of Fundamental Analysis. Althoughthe prediction of stock prices is highly nuanced, the Efficient Market Hypothesis (EMH), propounded by EugeneFarma in the 1960’s, suggested a relation between public opinion and stock prices [6]. The semi-strong form of theEMH implies that current events and new public information have a significant bearing on market trends [1, 6]. This a r X i v : . [ c s . A I] J a n PREPRINT - J
ANUARY
25, 2021view is supported by Nosfinger who draws upon evidence from several studies in the field of Behavioural Finance toreinforce that changes in aggregate stock price as well as the high degree of market volatility can be, in part, attributedto public emotion [7]. Numerous studies have been successful in unveiling and proving the perceived existence of arelationship between public mood gathered from social media and stock market trends [7, 8, 9, 10, 11, 12, 13, 14, 15].The vast majority of ML (Machine Learning) techniques applied in this sector have integrated the characteristicsof NLP (Natural Language Processing) to extract and quantify the sentiment of public opinion expressed via socialmedia. SVM (Support Vector Machines) [8], Random Forest [9], and KNN (K-Nearest Neighbour) [11] classifiershave yielded impressive classification accuracies in this application. The only caveat is that most studies consider thecompound effect of historic prices and public sentiment, thereby discounting the exclusive impact of sentiment.Some studies have gone beyond the classic ML approach, employing deep learning methods such as ANNs (ArtificialNeural Networks) in one form or another for the purposes of making predictions [12, 13, 14, 15, 16, 17]. Bollen et alhave established that collective public mood is predictive of DJIA (Dow Jones Industrial Average) closing values bymaking use of Granger Causality Analysis and SOFNN (Self-Organized Fuzzy Neural Networks). Many researchershave built on this work and others have explored alternate deep learning models such as MLP (Multi Layer Perceptron),CNN (Convolutional Neural Network) + LSTM (Long Short Term Memory) for market prediction.The primary focus of this work was on the development of a variant of RNN (Recursive Neural Network), knownas LSTM, capable of predicting short-term price movements. Owing to the volatile and unpredictable nature of thestock market, it is plausible that the relationship between the societal mood and economic indicators perhaps is morecomplex and nuanced than linear. Deep learning methods are felicitous for this application in that hidden layers canexploit the inherent relational complexity and can potentially extract these implicit relationships. It is for this reasonthat an LSTM structure was selected as the principal model in this work. The popularity of RNNs in NLP and stockprediction tasks is attributed to the fact that they consider the temporal effect of events which is a significant advantageover other NNs (Neural Networks). With the aid of a popular sentiment analysis tool, known as VADER, the degree ofcorrelation between the sentiment expressed via tweets and stock price direction was also investigated for the purposesof comparison with the results from the LSTM architecture.
VADER is a state of the art technique employed by researchers in sentiment analysis tasks. One aspect of this workinvolves using VADER to explore the degree of correlation between public opinion and sentiment expressed via twitterand stock market direction. As aforementioned, VADER is a gold standard lexicon and rule-based tool for sentimentanalysis [18, 19]. Developed and empirically validated by Hutto and Gilbert, the VADER lexicon is characteristicallyattuned to text segments in the social media domain [20, 21]. Unlike other lexicon approaches, VADER takes intoaccount that microblog text often contains slang, emoticons, and abbreviated text [21]. It not only provides the seman-tic orientation of words but also quantifies sentiment intensity by considering generalisable heuristics such as wordorder, capitalisation, degree modifiers etc [21]. In this application, VADER was used to generate the polarity scores oftweets, including a compound score (normalised between 1 and -1) which reflects the combined effect of the degreeof positivity, negativity, and neutrality expressed in a tweet.
The aim is to investigate the correlation between two variables; VADER scores and stock market trajectory. Firstly,tweets containing the APPLE stock ticker symbol were cleaned, using the algorithm described in the next section.Stock data associated with APPLE, among the Big Four technology companies, is deemed as a suitable choice uponwhich to perform analysis for several reasons; a detailed rationale is provided in a subsequent section. The compoundscore for each tweet was generated using VADER. Subsequently, the average of the scores for all tweets in a single daywas taken. To obtain values for the second variable i.e. stock market movement, the direction of stock price movementwas quantified. If the next-day close price of the security is greater than that of the current day, the value for that dayis defined as 1, else it is defined as 0.After generating the values for both variables, a special case of the Pearson coefficient, known as Point biserial correla-tion coefficient, was applied on the data to determine the correlation. This metric is commonly used when one variableis continuous and the other is categorical, as is the case in this application, where the VADER scores are continuouswhereas the stock price change data is dichotomous (binary) [22]. The biserial correlation method, however, requiresthe continuous variable to be normally distributed [22]. Therefore, the distribution of the VADER scores was plottedand a roughly normal distribution was obtained (as shown in figure 1), allowing the use of the biserial correlationmethod. 2
PREPRINT - J
ANUARY
25, 2021Figure 1: Distribution of VADER scores (continuous variable).
Modifications were made to the stock price change data by redefining the classification strategy. Previously, it wasbased on a delay of 1 day. It is plausible, however, that the effect of information and opinions presented in tweets maytake longer to manifest and reflect in the asset price. To explore this theory, a delay size of days in the range [1,7] wastaken. For example, a delay size of 2 indicates that if the asset price at the close of the trading day 2 days hence ishigher, it is classed as 1, else it is classed as 0.Figure 2: Correlation coefficient values for different delay periods.Another configuration was assessed after making modifications to the VADER scores. If a particular tweet has moreretweets than others i.e. it has been frequently shared by users, it could indicate that the information contained inthat tweet has a notable influence on other users. These tweets could potentially play a greater role in impacting anincremental change within the stock market. By taking a simple average of the compound scores of tweets to arrive ata singular value for a specific day, it is not possible to gauge the contribution of important tweets. Thus, based on thenumber of retweets per tweet, a weighted average of the compound scores was taken to ensure accurate representationof all tweets. The correlation was then determined using the modified scores.
The correlation obtained for variable delay configurations is shown graphically in figure 2. It is evident that thecorrelation for the original configuration using a delay of 1 day is 0.0812 (or 8.12%), suggesting a notably poor linearrelationship between the sentiment expressed in tweets and market movement. Using delays higher than 1 generallytend to improve the degree of correlation, with a delay of 4 days yielding the highest correlation (30.37%). This resultsupports the assumption that there exists a lag between the release of public opinion and its consequent reflection instock price. A decreasing trend is observed as the delay size is increased beyond 4 days. A potential implication of3
PREPRINT - J
ANUARY
25, 2021this could be that the information contained on a particular day is irrelevant after large delay periods and no longer hasany bearing on stock market values.The retweets-based weighted average configuration results in a small negative correlation value of -8.87%, which isin complete contrast to the original configuration. There is ambiguity in what the underlying cause of the generatedresults could be. There is a possibility that the inclusion of retweets has no impact in moulding future stock values.Alternatively, it could also be that the lack in the number of data points is masking the true contribution of the retweets.It can be concluded from the results that VADER is not able to present any strong association between public sentimentand market trajectory. The low correlation values could be attributed to the use of an insufficient number of samplesfor both variables. The final VADER scores are derived by taking an average of all compound scores, which is asimplistic and naive approach. The inability of the lexicon-based tool to identify any notable relationships betweenthe variables could also be due to the inadequacies inherent in the development of VADER sentiment. However, asmentioned earlier, there is a high possibility that the relationship between the variables is not linear and is perhapsmore complex and nuanced. This could be one of the reasons why using a metric such as correlation, used to assessstrength of a linear relationship, is not able to detect any significant associations.
ANNs, inspired by the behaviour of neurons in biological systems, are a dense interconnection of nodes or pre-processing units connected in layers, which have the ability to discover complex relationships between inputs andoutputs [23]. There are many varying implementations of ANNs, which differ in terms of network architecture,properties and complexity. Considering the sequential nature of tweets, it is essential to use a network which considersthe temporal effect of an input sequence. RNNs satisfy this criterion and are indeed suitable for application in tasksof this nature. However, the main drawback of RNNs is their inability in capturing long term dependencies in aninput sequence i.e. the Vanishing Gradient problem [24, 25]. For example, when an input sentence is fed into thenetwork, the error must be backpropagated through the network in order to update the weights. If the input is long,the gradients diminish exponentially during backpropagation, resulting in virtually no contribution from the state inearlier time steps. It is particularly problematic when using the sigmoid activation function as its derivative lies in therange [0, 0.25], resulting in highly diminished gradient values after repeated multiplication. A variant of the vanillaRNN network, the LSTM, can overcome this limitation [24, 25], albeit not entirely, and is used in this application.Figure 3 shows a high level abstraction of the LSTM model architecture, consisting of one stacked LSTM layer. Theinput side shows the vectors, concatenated in the Embedding Layer, being input to the LSTM cells in the stacked layer.The output side depicts the hidden state outputs of the LSTM layer being taken as inputs by a sigmoid-activated nodeto make output predictions.
Raw tweets contain a great deal of noise which needs to be eliminated in order to extract relevant information fromtweets and improve the predictive performance of the applied algorithm. Tweets contain twitter handles, URLs,numeric characters, and punctuation which do not contribute meaningfully to the analysis. A preprocessing algorithmwas applied to remove these elements, convert words to lower case, and tokenize the words present in a tweet. Thecleaned tweets were concatenated so as to form one input sequence for a given day. Ordinal or integer encoding wasused to map all the words in the vocabulary to an integer value, resulting in an input vector { w , w , w , . . . , w n } ,where w t corresponds to a unique integer index representing a feature (or word) in the vocabulary. Although LSTMscan take inputs of variable length [26], post-padding was applied as only vectors with homogeneous dimensionalitycan be used with the Keras Embedding Layer. For each input vector of length t ∈ [1 , n ] , n − t zeros or dummy featuresmust be appended, where n is the length of the longest encoded vector. This produces vectors in an m -dimensionalfeature space, where m is the total number of unique samples i.e. number of unique phrases or tweets fed to thenetwork. NLP tasks for textual representation and feature extraction commonly use BoW (Bag of Words) models owing totheir flexibility and simplicity. The traditional BoW has two variants: N-gram BoW and TF-IDF (Term Frequency-Inverse Document Frequency). The former reduces dimensionality of the feature set by extracting phrases comprising N words while the latter considers the frequency of words whilst considering the effect of rare words. The maindrawback is that BoW fails to take into account word order and context and results in sparse representations. This workexplores the use of GloVe (Global Vectors for Word Representation) which has gained momentum in text classification4 PREPRINT - J
ANUARY
25, 2021Figure 3: General schematic of the LSTM Network. The input side shows the vectors, concatenated in the EmbeddingLayer, being input to the LSTM cells in the stacked layer. The output side depicts the hidden state outputs of theLSTM layer being taken as inputs by a sigmoid-activated node to make output predictions.problems [27, 28]. GloVe overcomes the sparsity problems associated with the BoW model by generating dense vectorrepresentations and projecting the vectors to a markedly lower dimensional space. It has the ability of capturing thesemantic and syntactic relationships that are present between words, where words with a similar meaning are locallyclustered in the vector space. Stanford’s GloVe embeddings, trained specifically on 2 billion tweets, were used toproject each feature as a 200-dimensional vector [27, 28]. The weights of the feature vectors were initialised usingthe pre-trained embeddings but adjusted with the progression of training to improve classification performance. Eachinteger encoded feature, w t , corresponds to an embedding vector, x t , where x t ∈ R p . Owing to the fact that eachfeature is represented as a 200-dimensional vector, p = 200 . A padded input feature vector is thus represented in theembedding layer as X ∈ R np , formed as a result of concatenation of m vector embeddings. In figure 3, although theembedding layer is not explicitly shown, the vector embeddings which constitute it can be seen as inputs to the LSTMlayer at the corresponding time steps. The main distinction between a neuron in the vanilla RNN and an LSTM cell lies in the presence of a cell state vector,whose contents at each time state are maintained and modified via an LSTM gating mechanism [29, 30, 31]. Theinformation flow in an LSTM memory cell is regulated by three primary gates viz. forget gate, input gate, and outputgate [29, 30]. Figure 4 shows the schematic of an LSTM cell, including the gating mechanisms used to achieve itsfunctionality.At a certain time step, k , the vector embedding, x k , along with the previous hidden output, h k , will be used by theinput and forget gates to update the internal state of the cell [29, 32]. The output gate combines the inputs and thecurrent cell state, c k , to determine the information to be carried over to the next cell in the repeating structure [29]. Inthis fashion, LSTMs can control the contribution of those words and word relationships that have a higher impact onprediction, whilst penalising those that are less significant. Equation 1 is a matrix representation [32] of the outputsof the gates. Equations 2 and 3 show the outputs of the cell, where the output vector is obtained by an element wisemultiplication process [29, 32]. 5 PREPRINT - J
ANUARY
25, 2021Figure 4: Structure of an LSTM cell, showing the role of the primary gates and flow of information within the cell toform the current memory state and hidden output state. ifog = σσσ tanh W (cid:18) x k h k − (cid:19) (1) c k = f (cid:12) c k − + i (cid:12) g (2) h k = o (cid:12) tanh c k (3)where i , f , and o are the outputs of the input gate, forget gate, and output gate respectively and g is the output of anadditional gate which aids in updating cell memory. W is the weights matrix and σ and tanh represent the sigmoid and tanh non linearities. Note that the system of equations in (1) also contains a bias term for each gate output. Dropout,a regularization method, is utilized to prevent the model from overfitting [25]. Overfitting occurs when the modellearns the statistical noise present in the dataset, capturing unnecessary complex relationships and thus, resulting indecreased generalisability. During training, it is possible for neighbouring neurons to become co-dependent, inhibitingthe effectiveness of individual neurons. Dropout causes a proportion of the nodes or outputs in the layer to becomeinactive, thereby forcing the model to become more robust. This results in an increase in the network weights, whichmust be scaled by the dropout rate after completion of training. A singular output node with a sigmoid activation function, presented in equation 4, was used for the purpose ofclassifying trend. σ ( z ) = 11 + e − z (4)where z is the activation of the output node. The estimated probability returned by the node was compared againsta threshold probability in order to perform binary classification. If the output probability for a given input sequence σ ( z ) ≥ . , the input was labelled as 1, predicting an increase in asset price for the following trading day. If theoutput probability did not exceed this threshold, the input was labelled as 0, indicating either no change or a decreasein next-day price. A binary cross entropy cost function, J bce , was used as given by equation 5 [33]. J bce = − m m (cid:88) j =1 [ y j × log ( σ ( z j ))+ (1 − y j ) × log (1 − σ ( z j ))] (5)where y j is the j th target variable or the actual class label from a set of m training samples. The cost or error functionis representative of how accurately the model predicts target values, using a given set of network parameters. Themain aim is to optimise or minimise the cost function, updating the weights and biases of connections in the networkas a result. A mini-batch Stochastic Gradient was used during backpropagation to allow the model to converge to aglobal cost minimum [25]. This optimum state represents a model configuration where the error between the actualand predicted values is at a minimum and the network can successfully detect patterns between word embeddings6 PREPRINT - J
ANUARY
25, 2021Figure 5: Bar graphs representing an even class distribution ( ≈ Twitter data was obtained from followthehashtag [34], an online resource containing a readily available corpus oftweets. Approximately 167,000 tweets mentioning or associated with APPLE stocks were used for analysis. APPLEis among the companies currently dominating the technological sector and is regarded as a suitable choice upon whichto base analysis. Owing to its popularity and the fact that it has the largest market capitalisation out of all NASDAQ100 companies, it is fair to assume that twitter contains sufficient information relating to its stocks. The stock pricedata was sourced from Yahoo Finance [35]. The granularity of stock data considered is 1 day i.e. daily changes in stockprice were computed to capture the essence of short term price fluctuation. The input tweets were labelled accordingto the scheme described previously. Concatenation of tweets leads to an aggregate of 48 input samples, correspondingto 48 unique days. Due to the limited number of samples, each input was divided into 100 subsets, whilst keepingthe labelling of the subsets consistent with that of the original day. Subsequently, 4800 input samples for the networkwere obtained. For the initial experimentation, a naive model configuration was used. This model forms the basis onwhich further improvements in performance can be achieved. The next section discusses the effects of using differentnetwork types, hyperparameter optimisation, and varying split values (for tuning the number of input samples) onmodel performance. For the initial configuration, consisting of 4800 samples, the training/testing/validation splitwas 70/20/10 i.e. training was performed on 3360 samples, testing was performed on 960 samples, and validationwas performed on the remaining 480 samples. The validation set is used to configure the model so as to obtain thehyperparameters which give the best performance. The testing data is only used once after the network has beenconfigured to give an unbiased evaluation of model performance. A single LSTM stacked layer was utilised with adropout value of 0.2 and 100 hidden units (used for determining the dimension of the LSTM outputs). The gradientsand weights are updated according to a batch size of 32.The performance of the initial configuration is reported in the next section. The primary metric used to assess per-formance is accuracy. Another commonly used metric is the F1 score, which is the harmonic mean of the precisionand recall [36]. Precision refers to the percentage of instances correctly predicted as positive with respect to all in-stances classified as positive by the model, whereas recall refers to the percentage correctly classified as positive outof all positive classes [36]. The F1 score is also calculated for varying implementations discussed in the next section .However, it is only needed when there is a greater cost associated with either the false positives or false negatives. Asthis is not applicable for this task and class distribution is even (as shown in figure 5), it is only computed to ensureconsistency in results and thus not reported for all configurations.7
PREPRINT - J
ANUARY
25, 2021Figure 6: Confusion Matrix showing classifier performance at the deep functional level.
The initial model configuration, described in the previous section, gives an impressive classification (testing) accuracyof . A confusion matrix, displayed in figure 6, summarizes the classification performance of the model.It indicates that the model is able to correctly identify the 0 class and 1 class with an accuracy of 76% and 73% re-spectively. The F1 score for this configuration is 71.76%, reinforcing model performance as reflected in the testingaccuracy result. The achieved accuracy is far superior to the random guessing threshold of 50%. This result indicatesthe effectiveness of NNs in this task, which is in contrast with the results of the correlation analysis performed pre-viously. This output reinforces the claim that NNs have the ability to detect nuanced patterns and produce complexmappings between the input and output.
In the original model, the concatenated tweets, resulting in 48 samples, were split into 100 subsets per input sample toaugment the dataset. To investigate the effect of modifying the number of subsets per sample on overall performance,values for the input splits were selected in the range [25, 450]. Figure 7 shows the dependence of classificationaccuracy as a function of split size.In general, selecting large values of split size leads to performance degradation. Splitting up all tweets on a particularday into higher subsets can result in insufficient information contained within a unique sample, deteriorating predic-tion capability. It is reasonable to assume that on any given day, some tweets cause the price to go in the oppositedirection to that observed in the stock market, however, the aggregate impact of other tweets outweigh this effect. Asa result, higher subdivisions do not accurately capture the true nature of the task. To determine if this trend continues,an extreme case was considered i.e. the maximum logical split value was considered. Using all 60,233 filtered tweetsas individual inputs to the network, the observed accuracy was 62.37%, validating the observed graphical results.On the other end of the spectrum, using the concatenated tweets in their unaltered form will reflect all the necessaryinformation on a given day for the model to make predictions. However, in this work, this will entail using a consider-ably limited number of training instances, hindering the network’s ability to learn effectively and leading to erroneousoutputs. The training results of this configuration further confirmed this intuition as it gave the worst performance incomparison with using other subset values. The training time for this configuration i.e. using no splits was also signif-icantly higher than any other value tested. As the input sequence length is maximum in this case, a significant numberof LSTM cells is required. This discernibly increases processing time and degrades performance due to the emergenceof vanishing gradients. Therefore, there exists a trade-off between loss of information and creating a reasonably sizeddataset. In light of this fact, a split size of 150 per day was selected for subsequent analyses as it is able to achieve asatisfactory balance of the aforementioned performance variables.8
PREPRINT - J
ANUARY
25, 2021Figure 7: Variation of classification accuracy with the split size per input sample.Table 1: Effect of varying dropout rates
Value Unidirectional BidirectionalAccuracy Epoch Accuracy Epoch0.2 74.58 6,9 75.63 70.3 To improve the results of the original model configuration, the hyperparameters employed in the NN were optimised.In particular, the dropout rate, batch size, and number of LSTM output hidden units were varied to investigate their im-pact on classification results. Each hyperparamter was considered in isolation, with all other model variables remainingunchanged, to determine its exclusive impact. The same experimental procedure was also applied to a different net-work configuration known as bidirectional LSTMs. In strict terms, the standard LSTM structure used in this work iscalled a unidirectional LSTM network which differs from the bidirectional LSTM structure. The bidirectional networkis a variant of the classical LSTM, where information flows in both directions between LSTM cells such that the cell, atevery time step, is able to maintain previous and future input information [31]. This is in contrast to the unidirectionalstructure, where each cell contains only past information.Table 1 shows the classification accuracies achieved by varying the network dropout percentage. It is apparent that theunidirectional structure tends to perform well on dropout values higher than 0.2. However, the bidirectional structureshows no notable trends and as such, no conclusive inference can be drawn. The best accuracy (78.12%) is achievedby the unidirectional architecture, using a dropout value of 0.3. The implications of utilising higher values of dropoutare that a higher proportion of neurons become inactive, increasing the robustness of the model and resulting in bettertesting accuracies. Table 2: Effect of varying batch size
Value Unidirectional BidirectionalAccuracy Epoch Accuracy Epoch8
816 78.12 2 75.00 132 74.58 6,9 75.63 764 74.79 6 72.29 6128 73.54 8 73.12 3 PREPRINT - J
ANUARY
25, 2021Table 3: Effect of varying lstm output hidden units
Value Unidirectional BidirectionalAccuracy Epoch Accuracy Epoch100 74.58 6,9 75.63 7128 74.38 6
Table 2, which highlights the effect of variable batch sizes, indicates that there is a decline in the accuracy level withincreasing batch size for the unidirectional structure. Mini-batch sizes of 8 and 16 yield comparable results, withbatch size 8 achieving a remarkable accuracy of 78.33%. Similarly to the dropout case, there are no identifiable trendsfor the bidirectional structure, however the lowest batch size (8) performs the best for this variant of LSTM as well.The batch size determines the number of training instances after which the gradients and weights of the network areupdated. The impressive performance of the lower batch sizes can be attributed to a more robust convergence as wellas the network’s ability to circumvent local minima.The final parameter used for optimisation is the number of hidden state units of the LSTM cells. The hidden stateunits determine the dimensionality of the output space of the LSTM layer i.e. the dimensionality of the LSTM outputvectors H = { h , h , h , ..., h n } . As presented in Table 3, the accuracy shows an overall increase with an increasein the output dimensionality for the unidirectional framework. Although performance gains are observed upon usinglarger values of hidden state size, it is at the cost of an exponential increase in the number of parameters and processingtime. Values greater than 512 are not used in this study as the resulting models will be prone to overfitting owing totheir marked complexity [37]. The bidirectional variant performs better when 128 hidden state units are used howeverresults in a performance degradation for higher values. Due to the inherent characteristics of the bidirectional model,the dimensionality of the LSTM output is double that of its unidirectional counterpart. This ultimately leads to aninordinately complex model that is more prone to overfitting.Some neural network structures are known to achieve satisfactory results by employing two hidden layers. Therefore,two identical LSTM stacked layers with the same hyperparameter values as the original configuration were integratedin the network. The classification accuracy achieved by this configuration was 75.10%, which is lower than thevalue (76.67%) achieved using a single hidden layer implementation. Hence, it is deemed apposite to forego such anarchitecture. A notable observation, based on the hyperparameter tuning results, is the performance of the bidirectionalstructure. It has the ability to preserve past and future values in each cell, thereby allowing the network to gain a fullercontext of the information present in the input tweets. In theory, this should lead to improvements in the overallpredictions made by the model. However, not only does the bidirectional implementation produce comparable resultsoverall but it also occasionally generates lower accuracies than the unidirectional model. In order to discover the optimal model configuration, experiments were conducted using an exhaustive set of combina-tions of the hyperparameters. Different combinations of the dropout rate, mini-batch size, and hidden state size weredeployed, with the range of hyperparameter values limited to the those outlined in Table 1, Table 2, and Table 3. Thecombination 0.4 (dropout), 8 (batch size), and 100 (hidden units) produces the best results, obtaining a validation ac-curacy of 81.04%. Improvements in accuracy cannot be attained by merely using those hyperparameter values whichgive the highest accuracy when altered independently i.e. using the combination 0.3,8,512. Therefore, it can be arguedthat there exists some degree of interaction between the variables when varied simultaneously.The optimal model configuration is thus given by the combination 0.4,8,100. To perform an unbiased evaluation of themodel, the testing data (960 samples) was used. The model produces a testing accuracy of when presentedwith the unseen testing data. Although the model exhibits a remarkable performance in absolute terms, its results arespecific to APPLE and are not generalisable to other technology companies. There could be different patterns andinherent complexities within the twitter datasets of other companies which could lead to similar or contrasting resultsto that observed in this analysis.
The objective of this work is to develop a model capable of predicting the direction of next-day stock market fluctu-ations using twitter messages. Tweets associated with APPLE, regarded among the Big Four technology companies,10
PREPRINT - J
ANUARY
25, 2021is used as the basis for this analysis. The primary focus of this work is in the development, configuration, and deploy-ment of an LSTM structure. A correlation analysis is briefly explored to determine the relationship between VADERscores, quantifying the sentiment of tweets, and stock market movement.Using Point biserial correlation coefficient as the measurement metric, a low correlation value of 0.0812 is obtained.Alternate configurations are considered based on the time taken for information contained in tweets to manifest inmarket movement and a retweet-weighted average configuration. A delay size of 4 days results in the highest correla-tion value (0.3037). However, it is apparent from these results that VADER is not able to extract any strong relationsbetween societal sentiment and market direction.In order to initialise the weights of the LSTM network, GloVe embeddings, pre-trained on a sizeable twitter corpus, areutilised. Due to the limited number of training samples, the effect of splitting the dataset for augmentation is analysed.A value of 150 is selected for splitting the input sequence for each day into subsets as it provides a satisfactory trade-off between information loss and a suitable representation of dataset size. Hyperparameter tuning is performed usingthe validation set and independently varying the dropout rate, batch size, and hidden unit size to further optimisemodel performance. To determine the optimum configuration, an exhaustive set of varying combinations of selectedparameters is tested. A combination of 0.4,8,100 performs the best on the validation set, achieving a testing accuracyof .Despite the level of accuracy being an impressive standalone result, twitter datasets from other technological compa-nies need to be analysed and contrasted with results from this study. This relative comparison will allow the LSTMnetwork performance to be gauged more accurately and in a broader context, enabling the formation of generalisableresults. The application of technical indicators such as historical price data can also be explored in conjunction withthe components of Fundamental Analysis used in this task to provide more input information vital to classification.Word embeddings are effective in projecting words/features occurring in similar contexts within a neighbouring vectorspace. However, it is common for tweets to contain words expressing opposite sentiments that are collocated. Thisleads to erroneous representations of these fundamentally different words as similar vectors, mitigating their discrim-inative ability as required for classification [38]. Further work should be undertaken to incorporate linguistic lexiconssuch as SentiNet to capture the effect of word similarity and sentiment [27]. An alternative approach is to use SSWE(Sentiment-Specific Word Embeddings) which injects sentiment information into the loss function of neural networks[27]. This could potentially boost performance though the enhancement of the quality of word vectors.
References [1] Narendra Babu Anuradha Yenkikar, Manish Bali. Emp-sa: Ensemble model based market prediction usingsentiment analysis.
International Journal of Recent Technology and Engineering (IJRTE) , 8(2), 2019.[2] Selena Larson. Welcome to a world with 280-character tweets. https://money.cnn.com/2017/11/07/technology/twitter-280-character-limit/index.html . Accessed: 07 November 2019.[3] Twitter. How to tweet. https://help.twitter.com/en/using-twitter/how-to-tweet . Accessed 03March 2020.[4] Alexander Pak and Patrick Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In
Proceed-ings of LREC , volume 10, January 2010.[5] N.B. GKumar and S. Mohapatra.
The Use of Technical and Fundamental Analysis in the Stock Market in Emerg-ing and Developed Economies , chapter Introduction. Emerald Group Publishing Limited, 2015.[6] Eugene F. Fama. The behavior of stock-market prices.
The Journal of Business , 38(1):34–105, 1965.[7] John R. Nofsinger. Social mood and financial economics.
Journal of Behavioral Finance , 6(3):144–160, 2005.[8] John Kordonis, Symeon Symeonidis, and Avi Arampatzis. Stock price forecasting via sentiment analysis ontwitter. In
Proceedings of the 20th Pan-Hellenic Conference on Informatics , PCI ’16, New York, NY, USA,2016. Association for Computing Machinery.[9] V. S. Pagolu, K. N. Reddy, G. Panda, and B. Majhi. Sentiment analysis of twitter data for predicting stock marketmovements. In , pages 1345–1350, 2016.[10] V. Kalyanaraman, S. Kazi, R. Tondulkar, and S. Oswal. Sentiment analysis on news articles for stocks. In , pages 10–15, 2014.[11] Ayman Khedr, S.E.Salama, and Nagwa Yaseen. Predicting stock market behavior using data mining techniqueand news sentiment analysis.
International Journal of Intelligent Systems and Applications , 9:22–30, July 2017.11
PREPRINT - J
ANUARY
25, 2021[12] Johan Bollen, Huina Mao, and Xiao-Jun Zeng. Twitter mood predicts the stock market.
Journal of ComputationalScience , 2, October 2010.[13] Franco Valencia, Alfonso Gómez-Espinosa, and Benjamin Valdes. Price movement prediction of cryptocurren-cies using sentiment analysis and machine learning.
Entropy , 21:1–12, June 2019.[14] Zhigang Jin, Yang Yang, and Yuhong Liu. Stock closing price prediction based on sentiment analysis and lstm.
Neural Computing and Applications , September 2019.[15] Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. Using structured events to predict stock price movement:An empirical investigation. In
Proceedings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1415–1425, Doha, Qatar, October 2014. Association for Computational Linguistics.[16] Kyoung jae Kim, Kichun Lee, and Hyunchul Ahn. Predicting corporate financial sustainability using novelbusiness analytics.
Sustainability , 11(1):1–17, December 2018.[17] Evita Stenqvist and Jacob Lönnö. Predicting bitcoin price fluctuation with twitter sentiment analysis. Master’sthesis, School of Computer Science and Communication, 2017.[18] Sangeeta Oswal, Ravikumar Soni, Omkar Narvekar, and Abhijit Pradha. Named entity recognition and aspectbased sentiment analysis.
International Journal of Computer Applications , 178(46):18–23, September 2019.[19] Venkateswarlu Bonta, Nandhini Kumaresh, and N. Janardhan. A comprehensive study on lexicon based ap-proaches for sentiment analysis.
Asian Journal of Computer Science and Technology , 8:1–6, 2019.[20] C. W. Park and D. R. Seo. Sentiment analysis of twitter corpus related to artificial intelligence assistants. In , pages 495–498, April 2018.[21] C.J. Hutto and Eric Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of social media text.In
Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014 , January 2015.[22] Diana Kornbrot. Point biserial correlation. In David Howell Brian S. Everitt, editor,
The Encyclopedia ofStatistics in Behavioral Science , volume 1, page 2352. Wiley, 1 edition, October 2005.[23] Snezana Kustrin and Rosemary Beresford. Basic concepts of artificial neural network (ann) modeling and itsapplication in pharmaceutical research.
Journal of pharmaceutical and biomedical analysis , 22:717–27, June2000.[24] Rajalingappaa Shanmugamani.
Deep Learning for Computer Vision . Packt Publishing Ltd., 2018.[25] Aaron Courville Ian Goodfellow, Yoshua Bengio.
Deep Learning . MIT Press, 2016.[26] Mahidhar Dwarampudi and N. V. Subba Reddy. Effects of padding on lstms and cnns.
ArXiv , abs/1903.07288,2019.[27] Erion Çano and Maurizio Morisio. Word embeddings for sentiment analysis: A comprehensive empirical survey.
ArXiv , 2019.[28] Jeffrey Pennington, Richard Socher, and Christoper Manning. Glove: Global vectors for word representation. In
EMNLP , volume 14, pages 1532–1543, January 2014.[29] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation , 9:1735–80, December1997.[30] Ralf C. Staudemeyer and Eric Rothstein Morris. Understanding lstm - a tutorial into long short-term memoryrecurrent neural networks.
ArXiv , abs/1909.09586, 2019.[31] Klaus Greff, Rupesh Srivastava, Jan Koutník, Bas Steunebrink, and Jürgen Schmidhuber. Lstm: A search spaceodyssey.
IEEE transactions on neural networks and learning systems , 28, March 2015.[32] Serena Yeung Fei-Fei Li, Justin Johnson. Recurrent neural network. http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf . Accessed: 03 March 2020.[33] Yaoshiang Ho and Samuel Wookey. The real-world-weight cross-entropy loss function: Modeling the costs ofmislabeling.
IEEE Access , PP:1–1, December 2019.[34] Followthehashtag. One hundred nasdaq 100 companies – free twitter datasets. http://followthehashtag.com/datasets/nasdaq-100-companies-free-twitter-dataset/ . Accessed: 18 October 2019.[35] Yahoo Finance. Apple inc. (aapl), nasdaqgs real-time price. currency in usd. https://uk.finance.yahoo.com/quote/AAPL/history?p=AAPL . Accessed: 18 October 2019.[36] Gavin Hackeling.
Mastering Machine Learning with scikit-learn. Apply effective learning algorithms to real-world problems using scikit-learn . Packt publishing, August 2014.12
PREPRINT - J
ANUARY
25, 2021[37] G.P. Zhang.
Neural Networks in Business Forecasting . Idea Group Publishing, 2004.[38] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning sentiment-specific wordembedding for twitter sentiment classification. In52nd Annual Meeting of the Association for ComputationalLinguistics, ACL 2014 - Proceedings of the Conference