[PDF] Improving Distributed Representations of Tweets - Present and Future

Abstract

Unsupervised representation learning for tweets is an important research field which helps in solving several business applications such as sentiment analysis, hashtag prediction, paraphrase detection and microblog ranking. A good tweet representation learning model must handle the idiosyncratic nature of tweets which poses several challenges such as short length, informal words, unusual grammar and misspellings. However, there is a lack of prior work which surveys the representation learning models with a focus on tweets. In this work, we organize the models based on its objective function which aids the understanding of the literature. We also provide interesting future directions, which we believe are fruitful in advancing this field by building high-quality tweet representation learning models.

Full PDF

IImproving Distributed Representations of Tweets - Present and Future

Ganesh J

Information Retrieval and Extraction LaboratoryIIIT HyderabadTelangana, India [email protected]

Abstract

Unsupervised representation learning fortweets is an important research ﬁeld whichhelps in solving several business applica-tions such as sentiment analysis, hashtagprediction, paraphrase detection and mi-croblog ranking. A good tweet represen-tation learning model must handle the id-iosyncratic nature of tweets which posesseveral challenges such as short length, in-formal words, unusual grammar and mis-spellings. However, there is a lack ofprior work which surveys the represen-tation learning models with a focus ontweets. In this work, we organize the mod-els based on its objective function whichaids the understanding of the literature.We also provide interesting future direc-tions, which we believe are fruitful in ad-vancing this ﬁeld by building high-qualitytweet representation learning models.

Twitter is a widely used microblogging platform,where users post and interact with messages,“tweets”. Understanding the semantic represen-tation of tweets can beneﬁt a plethora of applica-tions such as sentiment analysis (Ren et al., 2016;Giachanou and Crestani, 2016), hashtag predic-tion (Dhingra et al., 2016), paraphrase detec-tion (Vosoughi et al., 2016) and microblog rank-ing (Huang et al., 2013; Shen et al., 2014). How-ever, tweets are difﬁcult to model as they poseseveral challenges such as short length, informalwords, unusual grammar and misspellings. Re-cently, researchers are focusing on leveraging un-supervised representation learning methods basedon neural networks to solve this problem. Oncethese representations are learned, we can use off- the-shelf predictors taking the representation as in-put to solve the downstream task (Bengio, 2013a;Bengio et al., 2013b). These methods enjoy sev-eral advantages: (1) they are cheaper to train, asthey work with unlabelled data, (2) they reduce thedependence on domain level experts, and (3) theyare highly effective across multiple applications,in practice.Despite this, there is a lack of prior work whichsurveys the tweet-speciﬁc unsupervised represen-tation learning models. In this work, we attempt toﬁll this gap by investigating the models in an or-ganized fashion. Speciﬁcally, we group the mod-els based on the objective function it optimizes.We believe this work can aid the understandingof the existing literature. We conclude the pa-per by presenting interesting future research direc-tions, which we believe are fruitful in advancingthis ﬁeld by building high-quality tweet represen-tation learning models.

There are various models spanning across differ-ent model architectures and objective functions inthe literature to compute tweet representation inan unsupervised fashion. These models work in asemi-supervised way - the representations gener-ated by the model is fed to an off-the-shelf pre-dictor like Support Vector Machines (SVM) tosolve a particular downstream task. These mod-els span across a wide variety of neural networkbased architectures including average of word vec-tors, convolutional-based, recurrent-based and soon. We believe that the performance of these mod-els is highly dependent on the objective functionit optimizes – predicting adjacent word (within-tweet relationships), adjacent tweet (inter-tweetrelationships), the tweet itself (autoencoder), mod- a r X i v : . [ c s . C L ] J un nsupervised Tweet Representation ModelsModeling within-tweet relationships Modelinginter-tweet relationships Paragraph2Vec (ICML’14)Conceptual SE (ACL’16)

Modeling from structured resources Modeling as an autoencoder Modeling using weak supervision

Skip-thought (NIPS’15)FastSent(NAACL’16)Siamese CBOW (ACL’16)Ganesh et al. (ECIR’17) CHARAGRAM (EMNLP’16)Wieting et al. (ICLR’16) SDAE (NAACL’16)Tweet2Vec (SIGIR’16) SSWE (TKDE’16)Twee2Vec (ACL’16)

Figure 1: Unsupervised Tweet Representation Models Hierarchy based on Optimized Objective Functioneling from structured resources like paraphrasedatabases and weak supervision. In this section,we provide the ﬁrst of its kind survey of the recenttweet-speciﬁc unsupervised models in an orga-nized fashion to understand the literature. Specif-ically, we categorize each model based on the op-timized objective function as shown in Figure 1.Next, we study each category one by one. : Every tweet is assumed to have a la-tent topic vector, which inﬂuences the distributionof the words in the tweet. For example, thoughthe appearance of the phrase catch the ball is fre-quent in the corpus, if we know that the topic of atweet is about “technology”, we can expect wordssuch as bug or exception after the word catch (ig-noring the ) instead of the word ball since catchthe bug/exception is more plausible under the topic“technology”. On the other hand, if the topic ofthe tweet is about “sports”, then we can expect ball after catch . These intuitions indicate that theprediction of neighboring words for a given wordstrongly relies on the tweet also. Models : (Le and Mikolov, 2014)’s work is the ﬁrstto exploit this idea to compute distributed docu-ment representations that are good at predictingwords in the document. They propose two mod-els: PV-DM and PV-DBOW, that are extensionsof Continuous Bag Of Words (CBOW) and Skip- gram model variants of the popular Word2Vecmodel (Mikolov et al., 2013) respectively – PV-DM inserts an additional document token (whichcan be thought of as another word) which is sharedacross all contexts generated from the same doc-ument; PV-DBOW attempts to predict the sam-pled words from the document given the documentrepresentation. Although originally employed forparagraphs and documents, these models workbetter than the traditional models: BOW (Harris,1954) and LDA (Blei et al., 2003) for tweet classi-ﬁcation and microblog retrieval tasks (Wang et al.,2016). The authors in (Wang et al., 2016) make thePV-DM and PV-DBOW models concept-aware (arich semantic signal from a tweet) by augmentingtwo features: attention over contextual words andconceptual tweet embedding, which jointly exploitconcept-level senses of tweets to compute betterrepresentations. Both the discussed works havethe following characteristics: (1) they use a shal-low architecture, which enables fast training, (2)computing representations for test tweets requirescomputing gradients, which is time-consuming forreal-time Twitter applications, and (3) most impor-tantly, they fail to exploit textual information fromrelated tweets that can bear salient semantic sig-nals. .2 Modeling inter-tweet relationshipsMotivation : To capture rich tweet semantics,researchers are attempting to exploit a type of sentence-level Distributional Hypothesis (Harris,1954; Polajnar et al., 2015). The idea is to infer thetweet representation from the content of adjacenttweets in a related stream like users’ Twitter time-line, topical, retweet and conversational stream.This approach signiﬁcantly alleviates the contextinsufﬁciency problem caused due to the ambigu-ous and short nature of tweets (Ren et al., 2016;Ganesh et al., 2017).

Models : Skip-thought vectors (Kiros et al., 2015)(STV) is a widely popular sentence encoder,which is trained to predict adjacent sentences inthe book corpus (Zhu et al., 2015). Although thetesting is cheap as it involves a cheap forwardpropagation of the test sentence, STV is very slowto train thanks to its complicated model architec-ture. To combat this computational inefﬁciency,FastSent (Hill et al., 2016) propose a simple ad-ditive (log-linear) sentence model, which predictsadjacent sentences (represented as BOW) takingthe BOW representation of some sentence in con-text. This model can exploit the same signal, butat a much lower computational expense. Paral-lel to this work, Siamase CBOW (Kenter et al.,2016) develop a model which directly comparesthe BOW representation of two sentence to bringthe embeddings of a sentence closer to its adja-cent sentence, away from a randomly occurringsentence in the corpus. For FastSent and SiameseCBOW, the test sentence representation is a sim-ple average of word vectors obtained after train-ing. Both of these models are general purposesentence representation models trained on bookcorpus, yet give a competitive performance overprevious models on the tweet semantic similaritycomputation task. (Ganesh et al., 2017)’s modelattempt to exploit these signals directly from Twit-ter. With the help of attention technique andlearned user representation, this log-linear modelis able to capture salient semantic informationfrom chronologically adjacent tweets of a targettweet in users’ Twitter timeline. : In recent times, building represen-tation models based on supervision from richlystructured resources such as Paraphrase Database(PPDB) (Ganitkevitch et al., 2013) (containing noisy phrase pairs) has yielded high quality sen-tence representations. These methods work bymaximizing the similarity of the sentences in thelearned semantic space.

Models : CHARAGRAM (Wieting et al., 2016a)embeds textual sequences by learning a character-based compositional model that involves additionof the vectors of its character n-grams followed byan elementwise nonlinearity. This simpler archi-tecture trained on PPDB is able to beat modelswith complex architectures like CNN, LSTM onSemEval 2015 Twitter textual similarity task by alarge margin. This result emphasizes the impor-tance of character-level models that address differ-ences due to spelling variation and word choice.The authors in their subsequent work (Wietinget al., 2016b) conduct a comprehensive analysisof models spanning the range of complexity fromword averaging to LSTMs for its ability to dotransfer and supervised learning after optimizing amargin based loss on PPDB. For transfer learning,they ﬁnd models based on word averaging performwell on both the in-domain and out-of-domain tex-tual similarity tasks, beating LSTM model by alarge margin. On the other hand, the word averag-ing models perform well for both sentence similar-ity and textual entailment tasks, outperforming theLSTM. However, for sentiment classiﬁcation task,they ﬁnd LSTM (trained on PPDB) to beat the av-eraging models to establish a new state of the art.The above results suggest that structured resourcesplay a vital role in computing general-purpose em-beddings useful in downstream applications. : The autoencoder based approachlearns latent (or compressed) representation by re-constructing its own input. Since textual data liketweets contain discrete input signals, sequence-to-sequence models (Sutskever et al., 2014) likeSTV can be used to build the solution. The en-coder model which encodes the input tweet cantypically be a CNN (Kim, 2014), recurrent modelslike RNN, GRU, LSTM (Karpathy et al., 2015) ormemory networks (Sukhbaatar et al., 2015). Thedecoder model which generates the output tweetcan typically be a recurrent model that predicts aoutput token at every time step.

Models : Sequential Denoising Autoencoders(SDAE) (Hill et al., 2016) is a LSTM-basedsequence-to-sequence model, which is trained toecover the original data from the corrupted ver-sion. SDAE produces robust representations bylearning to represent the data in terms of fea-tures that explain its important factors of varia-tion. Tweet2Vec (Vosoughi et al., 2016) is a recentmodel which uses a character-level CNN-LSTMencoder-decoder architecture trained to constructthe input tweet directly. This model outper-forms competitive models that work on word-levellike PV-DM, PV-DBOW on semantic similaritycomputation and sentiment classiﬁcation tasks,thereby showing that the character-level nature ofTweet2Vec is best-suited to deal with the noise andidiosyncrasies of tweets. Tweet2Vec controls thegeneralization error by using a data augmentationtechnique, wherein tweets are replicated and someof the words in the replicated tweets are replacedwith their synonyms. Both SDAE and Tweet2Vechas the advantage that they don’t need a coherentinter-sentence narrative (like STV), which is hardto obtain in Twitter. : In a weakly supervised setup, wecreate labels for a tweet automatically and pre-dict them to learn potentially sophisticated mod-els than those obtained by unsupervised learningalone. Examples of labels include sentiment ofthe overall tweet, words like hashtag present inthe tweet and so on. This technique can create a huge labeled dataset especially for building data-hungry, sophisticated deep learning models.

Models : (Tang et al., 2016) learns sentiment-speciﬁc word embedding (SSWE), which encodesthe polarity information in the word representa-tions so that words with contrasting polarities andsimilar syntactic context (like good and bad ) arepushed away from each other in the semanticspace that it learns. SSWE utilizes the massivedistant-supervised tweets collected by positive andnegative emoticons to build a powerful tweet rep-resentation, which are shown to be useful in taskssuch as sentiment classiﬁcation and word simi-larity computation in sentiment lexicon. (Dhin-gra et al., 2016) observes that hashtags in tweetscan be considered as topics and hence tweets withsimilar hashtags must come closer to each other.Their model predicts the hashtags by using a Bi-GRU layer to embed the tweets from its charac-ters. Due to subword modeling, such character-level models can approximate the representations for rare words and new words (words not seen dur-ing training) in the test tweets really well. Thismodel outperforms the word-level baselines forhashtag prediction task, thereby concluding thatexploring character-level models for tweets is aworthy research direction to pursue. Both theseworks fail to study the model’s generality (Westonet al., 2014), i.e., the ability of the model to trans-fer the learned representations to diverse tasks.

In this section we present the future research di-rections which we believe can be worth pursuingto generate high quality tweet embeddings. • (Ren et al., 2016) propose a supervisedneural network utilizing contextualized fea-tures from conversation, author and topicbased context about a target tweet to per-form well in classiﬁcation of tweet. Apartfrom (Ganesh et al., 2017)’s work which uti-lizes author context, there is no other workwhich builds unsupervised tweet representa-tion model on Twitter-speciﬁc contexts suchas conversation and topical streams. We be-lieve such a solution directly exploits seman-tic signals (or nuances) from Twitter, unlikeSTV or Siamese CBOW which are trained onbooks corpus. • (dos Santos and Gatti, 2014) propose a super-vised, hybrid model exploiting both the char-acter and word level information for Twit-ter sentiment analysis task. Since the set-tings when the character level model beats theword level model is not well understood yet,we believe it would be interesting to exploresuch a hybrid compositional model to buildunsupervised tweet representations. • Twitter provides a platform for the users tointeract with other users. To the best of ourknowledge, there is no related work that com-putes unsupervised tweet representation byexploiting the user proﬁle attributes like pro-ﬁle picture, user biography and set of fol-lowers, and social interactions like retweetcontext (set of surrounding tweets in a usersretweet stream) and favorite context (set ofsurrounding tweets in a users favorite tweetstream).

DSSM (Huang et al., 2013; Shen et al., 2014)propose a family of deep models that aretrained to maximize the relevance of clickeddocuments given a query. Such a ranking lossfunction helps the model cater to a wide vari-ety of applications such as web search rank-ing, ad selection/relevance, question answer-ing, knowledge inference and machine trans-lation. We observe such a loss function hasnot been explored for building unsupervisedtweet representations. We believe employinga ranking loss directly on tweets using a largescale microblog dataset can result in repre-sentations which can be useful to Twitter ap-plications beyond those studied in the tweetrepresentation learning literature. • Linguists assume that language is best un-derstood as a hierarchical tree of phrases,rather than a ﬂat sequence of words or charac-ters. It’s difﬁcult to get the syntactic trees fortweets as most of them are not grammaticallycorrect. The average of word vectors modelhas the most simplest compositional archi-tecture with no additional parameters, yetdisplays a strong performance outperformingcomplex architectures such as CNN, LSTMand so on for several downstream applica-tions (Wieting et al., 2016a,b). We believea theoretical understanding of why word av-eraging models perform well can help in em-bracing these models by linguists. • Models in (Wieting et al., 2016a,b) learnfrom noisy phrase pairs of PPDB. Note thatthe source of the underlying texts is com-pletely different from Twitter. It can be inter-esting to see the effectiveness of such mod-els when directly trained on structural re-sources from Twitter like Twitter ParaphraseCorpus (Xu et al., 2014). The main challengewith this approach is the small size of the an-notated Twitter resources, which can encour-age models like (Arora et al., 2017) that workwell even when the training data is scarce ornonexistent. • Tweets mostly have an accompanying im-age which sometimes has visual correspon-dence with its textual content (Chen et al., http://trec.nist.gov/ can we build multimodal representations fortweets accompanying correlated visual con-tent and compare with traditional bench-marks? . We can leverage insights from mul-timodal skip-gram model (Lazaridou et al.,2015) which builds multimodally-enhancedword vectors that perform well in the tradi-tional semantic benchmarks. However, it’shard to detect visual tweets and learning froma non-visual tweet can degrade its tweet rep-resentation. It would be interesting to see ifa dispersion metric (Kiela et al., 2014) fortweets can be explored to overcome this prob-lem of building a nondegradable, improvedtweet representation. • Interpreting the tweet representations to un-earth the encoded features responsible forits performance on a downstream task is animportant, but a less studied research area.(Ganesh et al., 2016)’s work is the ﬁrst toopen the blackbox of vector embeddings fortweets. They propose elementary propertyprediction tasks which predicts the accuracyto which a given tweet representation en-codes the elementary property (like slangwords, hashtags, mentions, etc). The maindrawback of the work is that they fail to cor-relate their study with downstream applica-tions. We believe performing such a correla-tion study can clearly highlight the set of el-ementary features behind the performance ofa particular representation model over otherfor a given downstream task.

In this work we study the problem of learningunsupervised tweet representations. We believeour survey of the existing works based on the ob-jective function can give vital perspectives to re-searchers and aid their understanding of the ﬁeld.We also believe the future research directions stud-ied in this work can help in breaking the barriers inbuilding high quality, general purpose tweet repre-sentation models. eferences

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.A simple but tough-to-beat baseline for sentence em-beddings.

CoRR .Yoshua Bengio. 2013a. Deep Learning of Represen-tations: Looking Forward. In

Proc. of the st Intl.Conf. on Statistical Language and Speech Process-ing . pages 1–37.Yoshua Bengio, Aaron C. Courville, and Pascal Vin-cent. 2013b. Representation Learning: A Reviewand New Perspectives.

IEEE Transactions on Pat-tern Analysis and Machine Intelligence

Journal of Ma-chine Learning Research

ACM Multimedia Conference, MM ’13 . pages781–784.Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick,Michael Muehl, and William W. Cohen. 2016.Tweet2Vec: Character-Based Distributed Represen-tations for Social Media. In

Proc. of the th An-nual Meeting of the Association for ComputationalLinguistics .C´ıcero Nogueira dos Santos and Maira Gatti. 2014.Deep convolutional neural networks for sentimentanalysis of short texts. In th Intl. Conf. on Com-putational Linguistics . pages 69–78.J Ganesh, Manish Gupta, and Vasudeva Varma. 2016.Interpreting the syntactic and social elements of thetweet representations via elementary property pre-diction tasks.

CoRR abs/1611.04887.J Ganesh, Manish Gupta, and Vasudeva Varma. 2017.Improving tweet representations using temporal anduser context. In

Advances in Information Retrieval- th European Conference on IR Research, ECIR .pages 575–581.Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2013. PPDB: the paraphrasedatabase. In

Proc. of conf. of the North AmericanChapter of the Association of Computational Lin-guistics . pages 758–764.Anastasia Giachanou and Fabio Crestani. 2016. Likeit or not: A Survey of Twitter Sentiment Anal-ysis Methods.

ACM Computing Surveys (CSUR)

Word

Proc. of the Conf. of the North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies . pages 1367–1377.Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,Alex Acero, and Larry P. Heck. 2013. LearningDeep Structured Semantic Models for Web Searchusing Clickthrough Data. In

Proc. of the nd ACMIntl. Conf. on Information and Knowledge Manage-ment . pages 2333–2338.Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015.Visualizing and understanding recurrent networks.

CoRR abs/1506.02078.Tom Kenter, Alexey Borisov, and Maarten de Rijke.2016. Siamese CBOW: Optimizing Word Embed-dings for Sentence Representations. In

Proc. of the th Annual Meeting of the Association for Compu-tational Linguistics .Douwe Kiela, Felix Hill, Anna Korhonen, and StephenClark. 2014. Improving multi-modal representa-tions using image dispersion: Why less is some-times more. In

Proc. of the nd Annual Meet-ing of the Association for Computational Linguis-tics . pages 835–841.Yoon Kim. 2014. Convolutional Neural Networks forSentence Classiﬁcation. In

Proc. of the 2014 Conf.on Empirical Methods in Natural Language Pro-cessing . pages 1746–1751.Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,Richard S. Zemel, Raquel Urtasun, Antonio Tor-ralba, and Sanja Fidler. 2015. Skip-Thought Vec-tors. In

Proc. of the 2015 Intl. Conf. on Advancesin Neural Information Processing Systems . pages3294–3302.Angeliki Lazaridou, Nghia The Pham, and Marco Ba-roni. 2015. Combining language and vision with amultimodal skip-gram model. In

Proc. of conf. ofthe North American Chapter of the Association ofComputational Linguistics . pages 153–163.Quoc V. Le and Tomas Mikolov. 2014. DistributedRepresentations of Sentences and Documents. In

Proc. of the st Intl. Conf. on Machine Learning .pages 1188–1196.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed Repre-sentations of Words and Phrases and Their Compo-sitionality. In

Proc. of the th Intl. Conf. on NeuralInformation Processing Systems . pages 3111–3119.Tamara Polajnar, Laura Rimell, and Stephen Clark.2015. An exploration of discourse-based sentencespaces for compositional distributional semantics.In

Workshop on Linking Models of Lexical, Sen-tential and Discourse-level Semantics (LSDSem) .page 1.afeng Ren, Yue Zhang, Meishan Zhang, andDonghong Ji. 2016. Context-sensitive twitter senti-ment classiﬁcation using neural network. In

Proc. ofthe th AAAI Conference on Artiﬁcial Intelligence .pages 215–221.Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng,and Gr´egoire Mesnil. 2014. A Latent SemanticModel with Convolutional-Pooling Structure for In-formation Retrieval. In

Proc. of the rd ACM Intl.Conf. on Conference on Information and KnowledgeManagement . pages 101–110.Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In

NIPS . pages 2440–2448.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In

NIPS . pages 3104–3112.Duyu Tang, Furu Wei, Bing Qin, Nan Yang, TingLiu, and Ming Zhou. 2016. Sentiment Embeddingswith Applications to Sentiment Analysis.

IEEETransactions on Knowledge and Data Engineering

Proc. of the th Intl. ACM SIGIR Conf.on Research and Development in Information Re-trieval . pages 1041–1044.Yashen Wang, Heyan Huang, Chong Feng, QiangZhou, Jiahui Gu, and Xiong Gao. 2016. CSE:conceptual sentence embeddings based on attentionmodel. In

Proc. of the th Annual Meeting of theAssociation for Computational Linguistics .Zhiyu Wang, Peng Cui, Lexing Xie, Wenwu Zhu,Yong Rui, and Shiqiang Yang. 2014. Bilateral cor-respondence model for words-and-pictures associ-ation in multimedia-rich microblogs.

TOMCCAP

Proc. of the 2014 Conf. on Empirical Methods inNatural Language Processing . pages 1822–1827.John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2016a. Charagram: Embedding words andsentences via character n-grams. In

Proc. of the2016 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP . pages 1504–1515.John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2016b. Towards universal paraphrasticsentence embeddings.

CoRR abs/1511.08198.Wei Xu, Alan Ritter, Chris Callison-Burch, William B.Dolan, and Yangfeng Ji. 2014. Extracting lexicallydivergent paraphrases from twitter.

TACL