[PDF] Deep Learning Paradigm with Transformed Monolingual Word Embeddings for Multilingual Sentiment Analysis

Abstract

The surge of social media use brings huge demand of multilingual sentiment analysis (MSA) for unveiling cultural difference. So far, traditional methods resorted to machine translation---translating texts in other languages to English, and then adopt the methods once worked in English. However, this paradigm is conditioned by the quality of machine translation. In this paper, we propose a new deep learning paradigm to assimilate the differences between languages for MSA. We first pre-train monolingual word embeddings separately, then map word embeddings in different spaces into a shared embedding space, and then finally train a parameter-sharing deep neural network for MSA. The experimental results show that our paradigm is effective. Especially, our CNN model outperforms a state-of-the-art baseline by around 2.1% in terms of classification accuracy.

Full PDF

DDeep Learning Paradigm with Transformed Monolingual WordEmbeddings for Multilingual Sentiment Analysis

Yujie Lu and

Tatsunori Mori

Graduate School of Environment and Information SciencesYokohama National University, Yokohama, 240851, Japan { luyujie, mori } @forest.eis.ynu.ac.jp Abstract

The surge of social media use brings hugedemand of multilingual sentiment analy-sis (MSA) for unveiling cultural differ-ence. So far, traditional methods resortedto machine translation—translating textsin other languages to English, and thenadopt the methods once worked in En-glish. However, this paradigm is condi-tioned by the quality of machine transla-tion. In this paper, we propose a newdeep learning paradigm to assimilate thedifferences between languages for MSA.We ﬁrst pre-train monolingual word em-beddings separately, then map word em-beddings in different spaces into a sharedembedding space, and then ﬁnally train aparameter-sharing deep neural network forMSA. The experimental results show thatour paradigm is effective. Especially, ourCNN model outperforms a state-of-the-artbaseline by around 2.1% in terms of clas-siﬁcation accuracy.

The prevalence of social media has allowed forthe collection of abundant subjective multilingualtexts. Twitter is a particularly signiﬁcant multi-lingual data source that provides researchers withsufﬁcient opinion pieces on various topics fromall over the world. An analysis of these multilin-gual opinion texts can reveal the cultural variationsin public opinions from different areas. There-fore, an efﬁcient multilingual sentiment analy-sis (MSA) that can process all multilingual texts(mixed monolingual texts) simultaneously is nec-essary.There has been substantial research on monolin-gual sentiment analysis, including sentiment anal- ysis of traditional reviews (product/movie, etc.;(Pang et al., 2002; Turney, 2002; Pang and Lee,2008)) and tweets ((Agarwal et al., 2011; Go et al.,2009; Xiang and Zhou, 2014; Mukherjee andBhattacharyya, 2012)). Instead of creating sep-arate models for each language, an MSA shoulduse a single model (with the same parameters forall languages) to process different texts in differentlanguages.However, compared with monolingual senti-ment analysis, the research on MSA has pro-gressed slowly. One of the reasons for this isthat there is no benchmark dataset that supportsthe evaluation of MSA methods (particularly, itscross-language adaptability). As many previousstudies have highlighted, open-source sentimentdatasets are imbalanced (Mihalcea et al., 2007;Denecke, 2008; Wan, 2009; Steinberger et al.,2011): there are many freely available annotatedsentiment corpora for English; however, such cor-pora for other languages are scarce or even nonex-istent. As a compromise, many of the previousmultilingual corpora have been built using hu-man/machine translations, which are unrealistic.In this study, we used the MDSU corpus asour training/test dataset (Lu et al., 2017). TheMDSU corpus contains three distinct languages(i.e., English, Japanese, and Chinese) and fouridentical international topics (i.e., iPhone 6, Win-dows 8, Vladimir Putin, and Scottish Indepen-dence), with 5,422 tweets in total. The multilin-guality of the corpus makes it the most suitabletraining/test dataset for MSA.Moreover, traditional machine learning meth-ods that are effective in monolingual settings arenot necessarily effective in multilingual settings,because they usually require heavy language-speciﬁc feature engineering that further needslanguage-speciﬁc resources (e.g., polarity lexi-cons)/tools (e.g., POS taggers and parsers). This a r X i v : . [ c s . C L ] O c t revents the application of many sophisticatedmonolingual methods to other languages, partic-ularly the minor languages that lack basic NLPtools. Until now, the most typically used meth-ods of MSA have been based on machine trans-lation (MT): ﬁrst, texts in other languages aretranslated into English, and then machine-learningmethods are developed based on the expanded En-glish texts.However, this paradigm is conditioned stronglyby the quality of the MT. Considering that our pro-cessing objects—tweets—contain many informalexpressions, it is even more difﬁcult to guaranteean accurate MT. Therefore, we proposed a newdeep learning paradigm to integrate the process-ing of different languages into a uniﬁed compu-tation model. First, we pre-trained monolingualword embeddings separately; second, we mappedthem in different spaces within a shared embed-ding space; and ﬁnally, we trained a parameter-sharing deep neural network for MSA. Our modelis presented in Figure 1.Figure 1: MT-Based Paradigm and Deep LearningParadigmAlthough the study by (Ruder et al., 2016) ismost similar to ours in the use of deep learningmethods, there are two fundamental differences.First, they only input the raw monolingual wordembeddings (an open-source, pre-trained wordembedding for English and random word embed-dings for other languages) in their deep learn-ing methods; however, we used customized pre-trained word embeddings and further transferredthem into a shared space. Second, they created In this thesis, “parameter-sharing” speciﬁcally meansthat the same model parameters are shared between differentlanguages. separate models for each language, whereas wedeveloped a single parameter-sharing model for alllanguages.To the best of our knowledge, this study is theﬁrst to use a deep learning paradigm for MSA.Moreover, because of the use of such a paradigm,the only resources we required were word embed-dings for each language and tokenizers for non-spaced languages (e.g., Chinese). We expectedthis paradigm to assimilate language differencesto take full advantage of the size of multilin-gual datasets (compared with its smaller mono-lingual parts). In this study, we employed theLSTM and CNN models. Our parameter-sharingCNN model with adjusted word embeddings out-performed the machine-translation-based baselineby nearly 5.3% and the state-of-the-art baseline by2.1%, thereby proving its effectiveness.This paper is organized as follows: in Section2, we discuss the related studies; in Section 3,we describe the study methods; in Section 4, wepresented and discussed the results of the experi-ments; and ﬁnally, in Section 5, we draw conclu-sions.

In this section, we introduce MSA-related studies,including those on multilingual subjectivity analy-sis as well as the MSA of traditional text and socialmedia.

Sentiment analysis in a multilingual frameworkwas ﬁrst conducted for subjectivity analysis. Mi-halcea et al. (Mihalcea et al., 2007) exploredthe automatic generation of resources (i.e., lexi-con translation and corpus projection) for the sub-jectivity analysis of a new language (i.e., Roma-nian). They translated the English polarity lex-icon into the target language, assessed the qual-ity of the generated lexicon through an annotationstudy, and proposed a rule-based target-languageclassiﬁer using the generated lexicon. The resultsrevealed that the translated lexicon was less reli-able compared with the English one, and the per-formance of the rule-based subjectivity classiﬁerwas worse in Romanian than in English. They alsoconducted a subjectivity annotation on a parallelcorpus (English sentences were manually trans-lated to Romanian); the results indicated that inmost cases, the subjectivity was preserved dur-ng the translation. They projected the subjectiv-ity onto the Romanian part to automatically obtaina Romanian subjectivity corpus and trained NaiveBayes (NB) classiﬁers. The results revealed thatthe performance of the NB classiﬁers in Romanianwas worse than in English.Banea et al. (Banea et al., 2010) translatedthe English corpus into other languages (i.e., Ro-manian, French, English, German, and Spanish)and explored the integration of unigram featuresfrom multiple languages into a machine learningapproach for subjectivity analysis. They demon-strated that both English and the other languagescould beneﬁt from using features from multiplelanguages. They believed that this was probablybecause, when one language does not provide suf-ﬁcient information, another one can serve as a sup-plement.

Although there is extensive scope for improve-ment, translation-based methods have inspiredmany other studies. The research on MSA beganrelatively late. Denecke (Denecke, 2008) trans-lated German movie reviews into English, de-veloped SentiWordNet-based methods for Englishmovie reviews, and tested the proposed meth-ods on the German corpus. The results revealedthat the performance of the proposed methodsin MSA was similar to that in monolingual set-tings. Wan (Wan, 2009) leveraged a labeled En-glish corpus for Chinese sentiment classiﬁcation.He ﬁrst machine translated the labeled Englishcorpora and an unlabeled Chinese corpus to thetarget language, and then proposed a co-trainingapproach to use the unlabeled corpora. His ex-perimental results suggested that the co-trainingapproach outperformed the standard inductive andtransductive classiﬁers. Steinberger et al. (Stein-berger et al., 2011) annotated entity-opinion pairsin a parallel news article corpus in seven Euro-pean languages—English, Spanish, French, Ger-man, Czech, Italian, and Hungarian (they ﬁrst didthe annotation work for English and then pro-jected those annotations onto other languages).Their simple method to determine the word po-larity aggregation for entity-level sentiment anal-ysis was tested on the entity-opinion pairs in theparallel corpus. They created a valuable resourcefor entity-level sentiment analysis in a multilin-gual setting; however, their method, as they ob- served, is preliminary and depends substantiallyon language-speciﬁc polarity lexicons.

Recently, the MSA of social media content hasbeen increasing. Balahur and Turchi (Balahurand Turchi, 2013) conducted an MSA of tweets.They ﬁrst translated English tweets into fourlanguages—Italian, Spanish, French, and German(the texts in the test set were further correctedmanually) to create an artiﬁcial multilingual cor-pus. They then tested support vector machine(SVM) classiﬁers using polarity lexicon-basedfeatures on various combinations of the datasetin different languages. The results suggested thatthe combined use of training data from multiplelanguages improves the performance of sentimentclassiﬁcation. Volkova et al. (Volkova et al., 2013)constructed a multilingual tweet dataset in En-glish, Spanish, and Russian using Amazon Me-chanical Turk. They explored the lexical varia-tions in subjective expression and the differencesin emoticon and hashtag usage by gender informa-tion in the three different languages; their resultsdemonstrated that gender information can be usedto improve the performance sentiment analysis ofall the three languages.

Our study is different from the previous studies inthe following ways. First, in multilingual datasetsfrom previous studies, datasets of languages otherthan English have been projected from the Englishdataset. Banea et al. (Banea et al., 2010) and Bal-ahur and Turchi (Balahur and Turchi, 2013) haveused MT to obtain texts in target languages, whichare considerably noisy. Mihalcea et al. (Mihal-cea et al., 2007) and Dwnwcke (Denecke, 2008)have directly used parallel corpora to eliminatethis noise. However, real multilingual opiniontexts would not be in the form of parallel corporabecause users usually give their opinions in onelanguage. Therefore, the MDSU corpus in thisstudy includes three distant languages and coverscommon international topics, which is useful totest the multilingual adaptability of a method.As for methods, Denecke (Denecke, 2008) andWan (Wan, 2009) have adopted the “MT + ma-chine learning” approach, which unavoidably im-ports bias during the MT. The abstraction of theword feature in Balahur and Turchi (Balahur andTurchi, 2013) can be applied to other languages,ut it requires language-speciﬁc polarity lexicons.Banea et al. (Banea et al., 2010) used unigrams inmultiple languages as features, but they might berestricted due to data sparseness issues. Volkova etal. (Volkova et al., 2013) proved the effectivenessof employing gender information, but their clas-siﬁers are not designed for multilingual settings.By contrast, our deep learning methods require nopolarity lexicons and can unify different languagesthrough a neural text model that uses word embed-dings.

In this section, we introduce our baseline meth-ods and the proposed deep learning methods (i.e.,transformed word embedding + deep learning).The global polarity of the MDSU corpus has threetypes: positive, negative, and neutral; therefore,our study is technically a three-way classiﬁcationtask.

Our ﬁrst baseline was MT-based. We used GoogleTranslate to translate Japanese/Chinese tweetsinto English. Google Translate is a paid servicethat supports more than 100 languages at vari-ous levels. For Japanese and Chinese, neural MTtechnology was enabled, providing more reliabletranslation results for the baselines.The SVM-based learning methods with n-gramfeatures, proposed by Pang et al. (Pang et al.,2002) and Go et al. (Pang et al., 2002), have beenfrequently used as baselines in many monolingual(English) studies. Similar to their settings, weused the default SVM model with a linear kerneland C = 1 and fed the binarized unigram/bigramterm frequencies as features. The one-vs-one strat-egy was adopted for multiclass classiﬁcation. Fol-lowing the traditional paradigm, the SVM modeltrained on all translated tweets in the MDSU cor-pus is our ﬁrst baseline, denoted as MT-SVM.In addition, we re-implemented Banea et al.’s(Banea et al., 2010) NB model that uses the cumu-lation of monolingual unigram features. We ﬁnetuned Banea et al.’s method in two ways: ﬁrst, weused both unigram and bigram as our features; andsecond, we used all the features instead of partsof them. We denoted this state-of-the-art baselinethat does not use language-speciﬁc polarity lexi-cons as Banea (2010)*. https://cloud.google.com/translate/ Since there is no comparable open source wordembeddings learnt from Twitter data for multiplelanguages, we independently obtained word em-beddings using numerous monolingual texts foreach language. However, these monolingual wordembeddings were heterogeneous in terms of vec-tor space (the meaning of each dimension was dif-ferent between languages.). Hence, we attemptedto reduce the discrepancy between monolingualword embeddings.This notion was adopted from Mikolov et al.(Mikolov et al., 2013). In their study, they high-lighted that the same concepts have similar ge-ometric arrangements in their respective vectorspaces. This implies that if the matrix transforma-tion is adequately performed, monolingual wordembeddings in heterogeneous spaces can be ad-justed to a shared vector space. Thereafter, manyother ways to conduct this transformation havebeen proposed (Ruder, 2017). Following Mikolovet al. (Mikolov et al., 2013), we used the

Transla-tion Matrix method—to obtain a linear projectionbetween the languages using a set of pivot wordpairs.Assume a set of word pairs { x i , z i } ni =1 , where x i and z i are the vector representations of word i in the source and target languages, respectively.We aimed to identify a translation matrix W S → T that minimized the following object function: minimize W S → T n (cid:88) i =1 || W S → T x i − z i || (1)After W S → T was identiﬁed, we mapped the vo-cabulary matrix Z of one language space to an-other by computing ˆZ = ZW S → T . For example,we transferred the Japanese vocabulary matrix tothe English vector space using ˆ Z J = Z J W J → E .In this paper, we developed two types of trans-lation matrix: W J → E and W C → E , to unify ourseparately pre-trained monolingual word embed-dings into a shared one. We selected top K high-frequent word in the English training corpus asour pivot words, translated them into Japanese andChinese (using Google Translate), and ﬁnally ob-tained the translation matrices using a linear re-gression algorithm. A matrix that consists of word embeddings of all wordsin the training corpus. lthough the linear projection by the

Transla-tion Matrix method can be considered as a word-level MT, the space transformation is consider-ably less expensive than building a full-ﬂedgedMT system.

RNNs have received tremendous attention in theNLP ﬁeld and been employed to complete manytasks, including predicting words/phrases, speechrecognition, image caption generation, and MT.Traditional neutral networks are stateless,whereas RNNs have the unique property of be-ing “stateful”. By reusing the hidden units in theprevious layer, RNNs allow cyclically encoding ofpast information within the networks.Let x i ∈ (cid:60) k be the k -dimensional word vectorcorresponding to the i -th word in a tweet; then, atweet having n words can be represented as X =( x , ..., x T ) . At each time step t , the hidden state h t of the RNN is updated as follows: h t = f ( h t − , x t ) (2)where f is a function that takes a signal x t as in-put during time step t , updates its current state h t based on the inﬂuence of x t and the previous state h t − .A vanilla RNN only combines the precious hid-den state h t with the current input x t , which isnot powerful enough to present a complex context.Thus, we used an LSTM network instead.The LSTM model introduces a new structurecalled a memory block (see Figure 2). A memoryblock consists of four main elements: input, out-put, and forget gates and a self-connected cell. Thecell is at the center of the LSTM memory block.Gates can be regarded as water valves, which yieldvalues between 0 and 1, describing how much ofeach component should be let through. An LSTMmemory block has three of these gates, to modu-late the cell state.Speciﬁcally, the input gate i t controls the candi-date state of the cell ˜ C t ; the forget gate f t regulatesthe previous state of the cell C t − ; and the outputgate o t determined the parts of the cell state C t tooutput.Eqs.(3)–(8) describe how a layer of memoryblocks is updated at every time step t . i t = σ ( W i x t + U i h t − + b i ) (3) ˜ C t = σ ( W c x t + U c h t − + b c ) (4) Figure 2: LSTM Memory Block f t = σ ( W f x t + U f h t − + b f ) (5) C t = i t ∗ ˜ C t + f t ∗ C t − (6) f o = σ ( W o x t + U o h t − + b o ) (7) h t = o t ∗ tanh ( C t ) (8)where x t is the input to the memory block layerat time t , W i , W c , W f , W o , U i , U c , U f , U o areweight matrices, and b i , b c , b f , b o are bias vectors.Although LSTM memory blocks have a unique(more complicated) way of computing the hiddenstate, they use the same network structure as theRNN. The lengths of both hidden layer and celllayer for LSTM take the same value as the dimen-sionality of word embeddings. There have been continual debates on whichmodel—the RNN or CNN—is more suited forNLP tasks (Yin et al., 2017). Therefore, we usea CNN model for MSA as well.One of the advantages of CNNs is that theyhave much fewer parameters than fully connectednetworks with the same number of hidden units,which makes them much easier to be trained. OurCNN is similar to that of Kim (Kim, 2014). OurCNN model is presented in Figure 3.As in RNNs, a tweet having n words was repre-sented as follows: x n = x ⊕ x ⊕ x ⊕ ... ⊕ x n (9)where ⊕ is the concatenation operator. Here, theﬁnal index of the word vectors in a tweet was n instead of T . In general, x i : i + j meant the concate-nation of words x i , x i +1 , ..., x i + j .To unify the matrix representation of tweets indifferent length, the maximum length of all tweetsigure 3: Network Structure of the CNN Modelin the dataset was used as the ﬁxed size for tweetmatrices. For shorter tweets, zero word vectorswere padded at the back of a tweet matrix.The layers of the CNN are formed by a con-volution operation followed by a pooling opera-tion. First, we performed a convolution operationto transform a window of h words (i.e., x i : i + h − )to generate a feature c i . The procedure was for-mulated as follows: c i = σ ( w · x i : i + h − + b ) (10)where w denotes a ﬁlter map, h is the window sizeof a ﬁlter, σ is a non-linear activation function and b is a bias term.By applying ﬁlter w to each possible window ofwords in a sentence, we obtained a feature map: c = [ c , c , ..., c n − h +1 ] (11)Second, we performed a subsampling opera-tion, for which we used the following max-poolingsubsampling method based on the idea of captur-ing the most important feature from each featuremap. c max = max { c } (12)From Eqs. (10)–(12), a ﬁlter generated one c max from a tweet matrix.The number of ﬁlter maps in our CNN modelwas 100, and the possible window sizes were { , , and } ; thus, our model had 300 different ﬁl-ters in total. The corresponding 300 c max formedthe penultimate layer, and was then passed to afully connected softmax layer to predict the globalpolarity of a tweet. In this section, we compare our deep learningmethods with the baseline methods. We ﬁrst de-scribe our experimental setup, followed by a dis-cussion of the results.

As described in Section 1, we used the MDSUcorpus as our training/test dataset. The MDSUcorpus was originally built for deeper sentimentunderstanding in a multilingual setting; therefore,tweets in it were annotated many ﬁne-grained tagsin addition to global (overall) polarity. In this pa-per, we used global polarities as the classiﬁcationlabels. (Lu et al., 2017) ﬁltered out apparent non-emotional tweets and prioritized long tweets withrich language phenomenon during data selection;therefore, the tweets in the MDSU corpus are morecomplex and longer than those in randomly col-lected or noisy-labeled tweet datasets.Table 1 presents the global polarity distributionfor each language in the MDSU corpus. The po-larity distribution of each language although notperfectly uniform, does not differ largely. More-over, the polarity distribution of the entire corpusis well-balanced, rendering it a suitable corpus fora three-way sentiment classiﬁcation. The length ofa tweet is deﬁned as the number of elements (in-cluding words, emoticons, and punctuations) afterunder-mentioned preprocessing. The maximumlength (also the ﬁxed size of the CNN models) ofthe MDSU corpus is 124: 41 for English, 93 forJapanese, and 124 for Chinese.Table 1: Polarity Distribution for Each Languagein the MDSU Corpus

Language Abbr. Positive Neutral Negative Total

The language used in social media is more casualthan in traditional media. There are many uniqueways of expression on Twitter, such as emoti-cons, Unicode emojis, misspelled words, letter-repeating words, all-caps words, and special tags(e.g., ◦ (cid:53) ◦ *) o)))) using regularexpressions and replaced them with “EMOTI-CON”; and labeled URLs as “URL”).We also performed language-dependent prepro-cessing. For English, we lowercased English char-acters and tokenized the tweets with TweetTok-enizer ; for Japanese, we normalized Japanesecharacters and tokenized the tweets with Mecab ;for Chinese, we transferred traditional Chinesecharacters to simpliﬁed Chinese characters and to-kenized the tweets with NLPIR . In addition to the annotated MDSU corpus, we ac-cumulated large collections of raw tweets usingTwitter RESTful API by the same query keywordsduring a one-year period. We ﬁrst excluded un-desirable tweets (e.g., tweets starting with “RT”)using the same veto patterns as (Lu et al., 2017);then, we checked the preceding 10 tweets to deletethe repeating tweets, because similar tweets usu-ally appear in succession. After ﬁltering out theundesirable tweets, the remaining tweets were pre-processed as previously described. The numberof remaining tweets was 232,214 (EN), 264,179(JA), and 148,052 (ZH). The vocabulary size foreach collection of tweets was 63,343 (EN), 49,575(JA), and 52,292 (ZH).Our vector representation for words was learntusing FastText . Because the scale of our corpusfor word embedding training was relatively small,we set the minimal number of word occurrences as2. We used the skip-gram model because it gener-ates higher quality representations for infrequentwords (Mikolov et al., 2013). The word embed-dings for each language were trained separatelyon its corresponding corpus. Words that were notpresent in the pre-trained word list were initializedrandomly in the deep learning models.The dimensionality of our word embeddingswas 100, and the Japanese/Chinese spaces weretransformed by their respective translation matri-ces. For the translation matrix, we set k as 3500,which implied that the top 3500 English wordsand their translations were the pivot word pairs. We registered some rare expressions to an ad hoc list. http://taku910.github.io/mecab/ http://ictclas.nlpir.org/ https://github.com/facebookresearch/fastText We split the 3500 pivot word pairs into two sets—training set (3000 words) and test set (500 words).The translation matrices were obtained based onthe training sets. As a validation, we calculatedthe change of Euclidean/cosine distances for eachword pair in the test set before and after the map-ping; Table 2 depicts the decrease in the sum ofthe two distances.Table 2: Sum of Embedding Distances of WordPairs in the Test Set

Language Before Mapping After MappingJapanese

Euclidean Distance

Cosine Distance

Euclidean Distance

Cosine Distance

All the methods were tested using 10-fold crossvalidation. For the deep learning models, we ran-domly selected 10% of the training splits of cross-validation as the developed datasets to tune param-eters for an early stopping.For fair comparison, we empirically set thehyper-parameters for deep learning models as con-sistent as possible. Both trainings were completedusing a stochastic gradient descent (SGD) algo-rithm for shufﬂed mini-batches with the Adadeltaupdate rule, with a mini-batch size of 50. Thedropout technique is effective in preventing co-adaptation of hidden units by randomly setting aportion of the hidden units to zeroes during feed-forward/backpropagation. Therefore, to preventoverﬁtting, we employed the dropout techniquefor both deep learning models on their penultimatesoftmax layers, with a dropout rate of 0.5. We didthe same for the dimensionality of word embed-dings; the lengths of both the hidden and cell layerfor LSTM were 100.

Table 3 presents the classiﬁcation accuracies ofbaselines.According to Table 3, the average accuracy ofseparate SVM classiﬁers over original datasetswas the same as it over translated datasets. Thisshowed that the same method did not necessar-ily perform worse after being translated by MTfor monolingual datasets. In addition, the per-formance of MT+SVM model (use all translatedtweets) was worse than the average accuracy ofeparate SVM classiﬁers over original datasets(53.0% vs. 54.5%), showing the limitation of tra-ditional paradigm(i.e., “MT + machine learning”).For classiﬁers directly used the cumulationof unigram and bigram, both SVM and Banea(2010)* performed better than MT+SVM by 0.8%and 3.4%, respectively. The increases indicatethat the use of cumulation of n-gram is effective;although this may result in the problem of datasparseness (Banea et al., 2010), it could be miti-gated by feature selection.Table 3: Results of Baselines

Model Dataset Feature AccuracyAverage – – 0.545

SVM EN unigram+bigram 0.529SVM JA unigram+bigram 0.596SVM ZH unigram+bigram 0.509

Average – – 0.545

SVM EN unigram+bigram 0.529SVM Translated JA unigram+bigram 0.591SVM Translated ZH unigram+bigram 0.515

MT+SVM

Translated ALL unigram+bigram

SVM

ALL cumulation ofunigram+bigram

Table 4 presents the classiﬁcation accuracies ofthe deep learning models; the input of word em-beddings for the models in this Table involved notransformation.First, our deep learning paradigm performsbetter than the MT+SVM method (traditionalparadigm). Speciﬁcally, parameter-sharing LSTMand CNN models outperformed MT+SVM modelby 1.2% and 4.3%, respectively. Thus, the deeplearning paradigm is more efﬁcient than the tradi-tional paradigm. In addition, the LSTM performedworse than the Banea (2010)* baseline, whereasthe CNN excelled. Thus, CNN is more suitablefor MSA than LSTM.Besides, we also conducted the learning sep-arately on each language split. The results re-vealed that the average accuracies of separateLSTM/CNN classiﬁers were a little higher thanthe accuracy of the mixed case (54.4% vs. 54.2%,and 58.1% vs. 57.3%), implying that the deeplearning methods did not improve after using theentire dataset. This was a result of the heterogene-ity of vector spaces of word embeddings, becausethe raw word embeddings were learned separately. Furthermore, we observed that both MT +LSTM and MT + CNN models (trained on thetranslated datasets and using only English wordembeddings) performed worse than the LSTM andCNN models (trained on the original datasets andusing multilingual word embeddings). Ideally, ifJA/ZH were perfectly translated, the performanceshould have increased. This suggests that thenoises that MT brings in are greater than the het-erogeneity of multilingual word embeddings does.Table 4: Results of Deep Learning Models

Model Dataset AccuracyAverage – 0.544LSTM EN 0.531LSTM JA 0.569LSTM ZH 0.532MT+LSTM Translated ALL 0.541Parameter-sharing LSTM ALL

Average – 0.581CNN EN 0.578CNN JA 0.610CNN ZH 0.553MT+CNN Translated ALL 0.564Parameter-sharing CNN ALL

The uniﬁcation of different vector spaces wasexpected to further improve the deep learningparadigm. Table 5 presents the classiﬁcation ac-curacies of the deep learning models before andafter space coordination. According to Table 5,the effectiveness of LSTM and CNN models weredivided. We observed that after space transforma-tion, the accuracy of LSTM decreased by 0.6%,whereas the accuracy of CNN increased by 1.4%.This suggests that the same vector space transfor-mation does not necessarily suitable for differentkinds of network structures.Overall, the performance of the CNN model fedwith transformed word embeddings was most ef-fective.Table 5: Results of Deep Learning Models Beforeand After Space Transformation

Model Dataset Word Embedding AccuracyParameter-sharingLSTM ALL Raw (Table 4)

ALL Transformed 0.536Parameter-sharingCNN ALL Raw (Table 4) 0.573ALL Transformed

Conclusion and Future Work

In this paper, we proposed a novel deep learningparadigm for MSA. We map monolingual wordembeddings into a shared embedding space, andused parameter-sharing deep learning models tounify the processing of multiple languages. Thetests on a well-balanced tweet sentiment corpus—the MDSU corpus—revealed the effectiveness ofour deep learning paradigm. Especially, our CNNmodel fed with translation matrix-transformedword embeddings achieves a rise of 2.3%, com-paring with the strong Banea (2010)* baseline.Our paradigm provides a great cross-lingualadaptability. Training tweets in any other languagecan be transferred into vector representation us-ing transformed word embeddings, and then com-bined with the learning process of the deep learn-ing models.The novelty of our study is not in the complex-ity of the network itsel, but more in the uniﬁca-tion of heterogeneous monolingual word embed-dings and the parameter-sharing model for mul-tilingual datasets. In the future, we plan to at-tempt more complex transformation methods andnetwork structures.

References

Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Ram-bow, and Rebecca Passonneau. 2011. Sentimentanalysis of twitter data.

Proceedings of the Work-shop on Language in Social Media (LSM 2011) pages 30–38.Alexandra Balahur and Marco Turchi. 2013. Improv-ing sentiment analysis in twitter using multilingualmachine translated data.

In Proceedings of RecentAdvances in Natural Language Processing pages49–55.Carmen Banea, Rada Mihalcea, and Janyce Wiebe.2010. Multilingual subjectivity: Are more lan-guages better?

In Proceedings of the 23rd Inter-national Conference on Computational Linguistics(COLING 2010) pages 28–36.Pierre Luc Carrier and Kyunghyun Cho. 2017.Lstm networks for sentiment analysis.http://deeplearning.net/tutorial/lstm.html. [On-line; accessed May 10, 2017].Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm:A library for support vector machine.

ACM Trans-actions on Intelligent Systems and Technology

In Proceedings of the 24th International Conference on Data EngineeringWorkshop (ICDE 2008) pages 507–512.Jeffrey L Elman. 1990. Finding structure in time.

Cog-nitive Science

CS224N Project Report (Stanford) pages 1–6.Yoon Kim. 2014. Convolutional neural networks forsentence classiﬁcation.

Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) pages 1746–1751.Yujie Lu, Kotaro Sakamoto, Hideyuki Shibuki, andTatsunori Mori. 2017. Construction of a multilin-gual annotated corpus for deeper sentiment under-standing in social media.

Journal of Natural Lan-guage Processing

In Proceedings of the45th Annual Meeting of the As-sociation of Compu-tational Linguistics (ACL 2007) pages 976–983.Tomas Mikolov, Martin Karaﬁat, Lukas Burget,JanCernocky, and Sanjeev Khudanpur. 2010. Re-current neural network based language model.

IN-TERSPEECH pages 1045–1048.Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013.Exploiting similarities among languages for ma-chine translation. arXiv:1309.4168 .Subhabrata Mukherjee and Pushpak Bhattacharyya.2012. Sentiment analysis in twitter with lightweightdiscourse analysis.

Proceedings of the 23th In-ternational Conference on Computational Linguis-tics: Technical Papers (COLING 2012) pages 1847–1864.Bo Pang and Lillian Lee. 2008.

Opinion mining andsentiment analysis . Now Publishers.Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.2002. Thumbs up? sentiment classiﬁcation usingmachine learning techniques.

Proceedings of Em-pirical Methods on Natural Language Processing(EMNLP 2002) pages 79–86.Sebastian Ruder. 2017. A survey of cross-lingual em-bedding models. arXiv:1706.04902 .Sebastian Ruder, Parsa Ghaffari, and John G. Breslin.2016. Insight-1 at semeval-2016 task 5: Deep learn-ing for multilingual aspect-based sentiment analy-sis.

In Proceedings of Proceedings of SemEval-2016 pages 330–336.Josef Steinberger, Polina Lenkova, Mijail Kabadjov,Ralf Steinberger, and Erik van der Goot. 2011. Mul-tilingual entity-centered sentiment analysis evalu-ated by parallel corpora.

Proceedings of Recent Ad-vances in Natural Language Processing pages 770–775.eter D. Turney. 2002. Thumbs up or thumbs down?:Semantic orientation applied to unsupervised classi-ﬁcation of reviews.

ACL ’02 Proceedings of the 40thAnnual Meeting on Association for ComputationalLinguistics pages 417–424.Svitlana Volkova, Theresa Wilson, and DavidYarowsky. 2013. Exploring demographic lan-guage variations to improve multilingual sentimentanalysis in social media.

Proceedings of the2013 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP 2013) pages1815–1827.Xiaojun Wan. 2009. Co-training for cross-lingual sen-timent classiﬁcation.

In Proceedings of the JointConference of the 47th An-nual Meeting of the ACLand the 4th International Joint Conference on Natu-ral Language Processing of the AFNLP pages 235–243.Haohan Wang and Bhiksha Raj. 2017. On the origin ofdeep learning. arXiv:1702.07800v4 .Bing Xiang and Liang Zhou. 2014. Improving twittersentiment analysis with topic-based mixture mod-eling and semi-supervised training.

Proceedingsof the 52nd Annual Meeting of the Association forComputational Linguistics (Short Papers)(ACL’14) pages 434–439.Wenpeng Yin, Katharina Kann, Mo Yu, and HinrichSch¨utze. 2017. Comparative study of cnn and rnnfor natural language processing. arXiv:1702.01923arXiv:1702.01923