Unleashing the Power of Hashtags in Tweet Analytics with Distributed Framework on Apache Storm
UUnleashing the Power of Hashtags in Tweet Analytics with Distributed Framework on Apache Storm
Vibhuti Gupta and Rattikorn Hewett
Department of Computer Science Texas Tech University, Lubbock, TX 79415 Emails: [email protected], [email protected]
Abstract — Twitter is a popular social network platform where users can interact and post texts of up to 280 characters called tweets . Hashtags, hyperlinked words in tweets, have increasingly become crucial for tweet retrieval and search. Using hashtags for tweet topic classification is a challenging problem because of context dependent among words, slangs, abbreviation and emoticons in a short tweet along with evolving use of hashtags. Since Twitter generates millions of tweets daily, tweet analytics is a fundamental problem of Big data stream that often requires a real-time Distributed processing. This paper proposes a distributed online approach to tweet topic classification with hashtags. Being implemented on
Apache Storm , a distributed real time framework, our approach incrementally identifies and updates a set of strong predictors in the Na ï ve Bayes model for classifying each incoming tweet instance. Preliminary experiments show promising results with up to 97% accuracy and 37% increase in throughput on eight processors. Keywords— Twitter; Hashtags; Social Media; Big Data Stream; Ontology; Apache Storm I. I NTRODUCTION
The proliferation of social media networks in last few years have produced an enormous volumes of data and become a common source of Big data. Twitter is one of the most popular social media platform, where users post short text messages of up to 280 characters, known as tweets for communication. On average, 6000 tweets are generated per second and 500 million tweets per day. Since twitter generates huge, unstoppable, fast growing and unstructured Big data stream of tweets daily, tweet analytics is a fundamental problem of Big data stream that often requires real-time Distributed processing. Hashtags , user-defined hyperlinked words of typical topics, in tweets facilitate efficient information sharing [14]. Hashtags begin with a hash symbol representing various subjects, for examples, hybrid hashtag approach to cope with the challenges of tweet topic classification using hashtags. Hybrid Hashtags consist of two types of hashtags: 1) those that are extracted from input tweet data and 2) those derived from a knowledge base of topic (or class) concepts (or topic ontology) by using hashtagify [18], a tool to generate "similar" hashtags from a given term (see more details in [7]). We evaluated the effectiveness of this semi-automated approach using a batch analysis on Naïve Bayes algorithm. The applicability of this approach in real tweet Big data stream requires an online and distributed approach to deal with fast and dynamic arrival rates of tweets. Thus, real time processing with minimum latency is desirable. This paper is different from our previous work [7] in that it presents a fully automated, online and distributed system for tweet topic classification using
Hybrid Hashtags as opposed to finding the most effective way to use hashtags for tweet classification in a non-distributed environment. Our contribution is two fold in this paper. First, we propose an online approach (both for data pre-processing and analytic) to analyzing each tweet to identify appropriate hybrid hashtags and incrementally updating an accumulated set of hybrid ashtags. Our approach is completely online since it updates the hybrid hashtags and classification model incrementally with each incoming tweet instance. Second, we empirically illustrate how the proposed approach is scalable and thus, suitable for Big data stream environment giving a lower execution time and a higher throughput in Apache Storm framework. The rest of the paper is organized as follows. Section II discusses related work. Section III describes our proposed approach followed by experimental results in Section IV and Section V concludes the paper. II. RELATED WORK Recent research in tweet analytics has studied how hashtags can be applied to tweet classification [2, 3, 12, 13, 14], tweet retrieval/search [5], and hashtag recommendation [6, 17]. Work on hashtag recommendation for a given tweet finds tweets similar to the given tweet and ranks hashtags in those tweets for recommendation based on how they are closely relevant to content of the given tweet using similarity measures, statistical model or probabilistic machine learning model [6, 17]. Most related hashtag research to our work is on tweet classification [3, 12, 13, 14], majority of which deals with sentiment analysis. There are some large scale implementations for twitter sentiment analysis [8,9]. [8] uses all hashtags and emoticons as sentiment labels and classify tweets using MapReduce and Apache Spark while [9] builds a large scale sentiment lexicon and classify using MapReduce. Previous work in [10] uses ontology to determine sentiment of twitter posts by assigning sentiment scores to each tweet instance and [16] uses ontology of keywords to classify the documents in economic field. The difference between the above work and ours is not just by the domain but the characteristics of the classes to classify. Our previous work in [7] deals with tweet topic classification using domain specific knowledge and this paper extends it for Big data stream using Apache Storm [23]. The main distinction of our distributed processing with the previous approaches is that, our processing is completely online implemented on Storm framework as opposed to
MapReduce [4], which is a distributed batch processing. Although both Apache Storm and Apache Spark are data stream processing,
Apache
Storm is an online framework while
Apache Spark [22] processes data in batches and therefore not applicable to our work. However, combination of both can be applied for better computation on Stream data. III.
PROPOSED
APPROACH The key distinction of this work is an online and distributed approach to a tweet analytic using hybrid hashtags as opposed to focusing on constructing a hybrid hashtag technique for a non-distributed computing on a single processor as proposed in our previous approach in [7]. In particular, we use
Naïve Bayes [19] algorithm for classification. We now describe how we select and construct appropriate hybrid hashtags [7] followed by our proposed online approach. Our hybrid hashtags construction approach starts by building a domain-specific knowledge base describing concepts relevant to the class/topic to be classified. The knowledge base is a graphical representation of concepts and relationship between them (or synonymously referred to as ontology ). Each of the concepts from the most bottom level nodes is fed into an automated tool
Hashtagify [18] to retrieve a set of hashtags relevant to the input concept called concept-based hashtags.
Each of these concept-based hashtags is ranked according to its correlation with the given concept and a specified k . The correlation score between the hashtag h and concept c can be computed by using equation 1. (cid:1855)(cid:1867)(cid:1870)(cid:1870)((cid:1855), ℎ) = ∑ (c i - c (cid:3365) )(h i - h (cid:3364) ) ni=1 ( n-1 ) S c S h (1) where c and h are vectors representing frequency of occurrence of concept c and hashtag h in the hashtagify data while c (cid:3365) , h (cid:3365) , S c , S h are the mean and standard deviation of values respectively. We extract top k hashtags correlated with the concept (called k - correlated hashtags). This process repeats until we retrieve all the k-correlated hashtags of all selected concepts of a given class concept (or topic). These are Ontology-driven hashtags . This set of hashtags are combined with the k-correlated tweet-based-hashtags (i.e., top k hashtags in the tweet data that are correlated with the class topic) to get the set of hybrid hashtags . Hybrid hashtag approach is shown in Figure 1. Detailed explanation can be found in [7]. These hybrid hashtags are used for tweet topic classification using various classification algorithms (i.e., Naïve Bayes, SVM, k-NN) to evaluate its performance. More details for the approach can be found at [7]. This paper presents an online and distributed approach to tweet topic classification using Hybrid Hashtags . The hybrid hashtag construction is online in the sense that the set of hybrid hashtags is updated incrementally for each new incoming tweet instance. The approach starts with an initial
Algorithm
Hybrid-hashtags ( H C , H T , k ) Inputs: H C , a set of ontology-driven hashtags corresponding class concepts; H T , a set of hashtags extracted from tweet data sets; k , a set size of hashtags correlated with a given concept Output: H , a set of hybrid hashtags (from H C and H T ) of k -correlation with C : H ← H C ∩ H T // potentially strong predictive hashtags : H ← ∅ For each h ∈ H do H ← Select ( H T − H C , h , k ) // add new tweet hashtags that are k -correlated with h H ← H ∪ H : end for return H ∪ H Fig. 1
Combining tweet-extracted with ontology-driven hashtags. et of ontology-driven hashtags as developed in [7]. For each incoming tweet instance, each of hashtags in the tweet, compute its correlation with the topic concepts. Keep only the top k tweet-based hashtags. Thus, we obtain an accumulated set of k - correlated tweet-based-hashtags . The latter set will grow as more tweets are analyzed. Combining this tweet-based hashtag set with the ontology-driven-hashtags , a set of hybrid hashtags for each tweet instance is obtained. These hybrid hashtags are used to classify the tweet using online Naïve Bayes classifier. Same process repeats for next tweet instance computing a new set of hybrid hashtags from current as well as previous tweet instances. In this way hybrid hashtag set updates itself with each new incoming tweet instance until the tweet stream ends. The classification model is updated with this new set of hybrid hashtags. In this way the classifier is incrementally updated and improved the classification results at each new instance of tweet until the stream ends. The proposed approach has been implemented in Apache Storm [23], a distributed real-time processing framework for Big Data Streams. The Storm framework processes data in real time using spouts and bolts as components to make a topology.
Spouts are the source of stream data that are being processed by
Bolts to produce results. The topology is submitted to a
Master Node known as
Nimbus which distributes the computation among worker nodes in a cluster which executes a subset of specific topology running in its own JVM. Each worker node has multiple worker processes which executes the topology using
Executers . Each worker process runs several executors and run in the worker’s JVM. Each executor contains multiple
Tasks that perform the actual data processing. The coordination between Master node and worker nodes is maintained by
Zookeeper . Storm has been successfully applied for many data stream applications. A detailed description of Storm can be found in [23]. IV. EXPERIMENTS AND RESULTS This section provides the experiments and results to evaluate the effectiveness of the proposed approach in Big Data Stream Infrastructure. In this paper, the Storm cluster is composed by a varying number of virtual machines (VMs or processors) (i.e., 1, 2, 4, and 8) in a system with Intel Core -i7-8550U CPU 2 GHz processor, 16 GB RAM 8 cores and 1TB of Hard disk. Each of the virtual machine is configured with 4 vCPU and 4 GB RAM. We have installed Ubuntu 14.04.05 64 bits OS in each of the VM along with the JDK/JRE v 1.8. All nodes are running
Apache Storm except the one running
Zookeeper and
Nimbus [23]. The
Apache Storm version used is 0.9.7 with zookeeper
Table I : Comparisons of Tweet Stream Classification Results.
Approaches Batch Analysis
Online Analysis
Words Only 74.0 %
Table I compares the average accuracy results obtained by the previous batch analysis approach [7] with those obtained by the proposed online analysis approach using Naïve Bayes classification algorithm. The batch analysis attempts to find the best way to exploit hashtags in tweets for tweet classification. Thus, we explored the analysis in batches by investigating the power of hashtags compared to using only words in tweets. As shown in Table I, like the batch analysis, hybrid hashtags give the best performing result (97% accuracy) compared to other online approaches using other set of features (i.e., words, words & tweet hashtags, etc.) Furthermore, in each set of features, the online preprocessing and classifier perform a little better than the batch analysis with 2% increase of accuracy in the hybrid hashtags. This could be due to the fact that the set of features for the online approach "gradually" adapts to learning from the growing input tweets as opposed to the fixed set of features pre-determined during the training of the batch analysis.
Fig. 2
Throughput for hybrid hashtags with increasing number of processors
We also illustrate the throughput and processing time with hybrid hashtags approach to evaluate the scalability of the approach when the number of processors is increased. Throughput is a total number of tweet instances processed per unit time (i.e., seconds in our case) and processing time is the T h r o u g hpu t ( t w ee t s p r o c e ss e d / s e c ) Number of Processors verage time taken to completely process a tweet instance in the Storm architecture. We ran the experiment for a session of 5 minutes in each of the case (i.e., with no. of processors as 1,2,4,8) and observed the throughput and execution time values. Figure 2 compares the throughput when the number of processors increases in the distributed environment. As shown in Figure 2, throughput improves from 2,975 tweets/sec to 4,065 tweets/sec when we increase from a single processor to eight processors.
Table II : % Increase in throughput with increasing number of processors
Number of Processors % Increase in throughput
2 8.9% Table II shows the percentage of increase in throughput with increasing number of processors as compared to a single processor. It shows a slight increase as we doubled and quadrupled the number of processors but reached highest of 37% increase when the number of processers is eight. The number of processed tweets depends upon the speed of execution, so the faster the tweets are processed, the higher the throughput is obtained. By increasing the number of processors, the rate of processing the tweets increases resulting in the improvement of the throughput for the classification.
Fig. 3
Processing time for hybrid hashtags with various processors
Figure 3 compares the execution times as the number of processors increases. It shows that the processing time is decreased on an increased number of processors since computational analysis of each tweet can now be distributed to multiple processors. Thus, the average execution time of each processor is decreased. Table III shows the % reduction of execution time with increased number of processors as compared to single processor. As shown in Table III, there is a slight reduction in execution time when the number of processors is 2 and 4 but increases drastically with 8 processors.
Table III : % Reduction in execution time with various numbers of processors
Number of Processors % Reduction in Execution time (ms)
2 1.4% It is interesting to see that the % reduction of the time is not necessarily linear to the number of processors used. Further experiments are required.
Fig. 4
Total
Figure 4 shows the total number of tweets being processed at various number of processors. As the number of processors increases, the number of tweets is processed as expected. Thus, the proposed online approach appears to perform as well as expected. This shows promising applicability to Big data tweet analytics not only accuracy of the classification but also efficiency in data processing and analysis V. CONCLUSION This paper presents an online, automated and distributed approach to use hybrid hashtags for tweet classification in Apache Storm. The approach is general in that it can be applied for any class concept in any domain. The experimental results show that the proposed approach is able to benefit from the distributed processing capabilities in reducing the execution time, scalability and providing real time data processing. Future work includes more experiments on different domains and applications of this online approach to other kinds of tweet analytics. Additional research using different windowing techniques are required to help improve tweet classification and other tweet analytic problems. REFERENCES [1]
Baccianella, S., A. Esuli, & F. Sebastiani, "SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining", Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), Vol. 10, 2010. [2]
Belainine, B., A. Fonseca, & F. Sadat, "Named Entity Recognition and Hashtag Decomposition to Improve the Classification of Tweets", Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), 2016. [3]
Davidov, D., O. Tsur, & A. Rappoport, "Enhanced sentiment learning using twitter hashtags and smileys", Proceedings of the 23rd E x e c u t i o n T i m e ( m illi s e c o nd s ) Number of processors 15753 16038 18388 227320500010000150002000025000 1 2 4 8 T o t a l o f t w ee t s p r o c e ss e d Number of Processors nternational conference on computational linguistics: posters, Association for Computational Linguistics, 2010. [4]
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. [5]
Efron, M., "Hashtag retrieval in a microblogging environment", Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, ACM, 2010. [6]
Feng, W., and J. Wang, "We can learn your
Gupta, V., & Hewett, R., “Harnessing the power of hashtags in tweet analytics”, In Big Data (Big Data), 2017 IEEE International Conference on,IEEE,2017, 2390-2395. [8]
Kanavos, A., et al., “Large scale implementations for twitter sentiment classification”, Algorithms, 2017,10(1), 33. [9]
Khuc, V.N. et al., “Towards Building Large-Scale Distributed Systems for Twitter Sentiment Analysis”, In Proceedings of the Annual ACM Symposium on Applied Computing, 2012, 459–464. [10]
Kontopoulos, E., et al., “Ontology-based sentiment analysis of twitter posts”., Expert systems with applications,2013, 40(10), 4065-4074. [11]
Li, Quanzhi, et al. "Discovering Relevant Hashtags for Health Concepts: A Case Study of Twitter", AAAI Workshop: WWW and Population Health Intelligence, 2016. [12]
Mohammad, Saif M. and S. Kiritchenko, "Using hashtags to capture fine emotion categories from tweets", Computational Intelligence, 31.2 (2015), 301-326. [13]
Simeon, C., H. J. Hamilton, & R. J. Hilderman.,"Word Segmentation Algorithms with Lexical Resources for Hashtag Classification", Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on, IEEE, 2016. [14]
Simeon, C., and R. Hilderman, "Evaluating the Effectiveness of Hashtags as Predictors of the Sentiment of Tweets", International Conference on Discovery Science, Springer, Cham, 2015. [15]
Song, S., and Y. Meng, "Classifying and ranking microblogging hashtags with news categories", Research Challenges in Information Science (RCIS), 2015 IEEE 9th Inter. Conference on, IEEE, 2015. [16]
Vogrin č i č , S., and Z. Bosni ć , "Ontology-based multi-label classification of economic articles", Computer Science and Information Systems 8.1 (2011), 101-119. [17] Zangerle, E., W. Gassler, and G. Specht, "Recommending
Hashtagify, Accessed October 2017, URL: http://hashtagify.me/ [19]
Naive Bayes, Accessed October 2017, URL: https://web.stanford.edu/class/cs124/lec/naivebayes.pdf [20]
Twitter API, Accessed October 2017, URL: https://dev.twitter.com [21]
Porter, M., “The Porter Stemming Algorithm”, Accessed August 2017, URL:https://tartarus.org/martin/PorterStemmer/ [22]
Apache Spark, Accessed October 2018,URL: http://spark.apache.org/ [23]