Zero-Shot Cross-Lingual Opinion Target Extraction
ZZero-Shot Cross-Lingual Opinion Target Extraction
Soufian Jebbara and Philipp Cimiano
Semalytix GmbH, Bielefeld, GermanySemantic Computing Group, CITEC - Bielefeld University, Bielefeld, Germanysoufi[email protected]@cit-ec.uni-bielefeld.de
Abstract
Aspect-based sentiment analysis involves therecognition of so called opinion target expres-sions (OTEs). To automatically extract OTEs,supervised learning algorithms are usually em-ployed which are trained on manually anno-tated corpora. The creation of these corpora islabor-intensive and sufficiently large datasetsare therefore usually only available for a verynarrow selection of languages and domains.In this work, we address the lack of avail-able annotated data for specific languages byproposing a zero-shot cross-lingual approachfor the extraction of opinion target expres-sions. We leverage multilingual word embed-dings that share a common vector space acrossvarious languages and incorporate these into aconvolutional neural network architecture forOTE extraction. Our experiments with 5 lan-guages give promising results: We can suc-cessfully train a model on annotated data of asource language and perform accurate predic-tion on a target language without ever usingany annotated samples in that target language.Depending on the source and target languagepairs, we reach performances in a zero-shotregime of up to 77% of a model trained ontarget language data. Furthermore, we can in-crease this performance up to 87% of a base-line model trained on target language data byperforming cross-lingual learning from multi-ple source languages.
In recent years, there has been an increasing in-terest in developing sentiment analysis modelsthat predict sentiment at a more fine-grained levelthan at the level of a complete document. Aparadigm coined as Aspect-based Sentiment Anal-ysis (ABSA) addresses this need by defining thesentiment expressed in a text relative to an opiniontarget (also called aspect ). Consider the followingexample from a restaurant review: “ Moules were excellent , lobster ravioli was VERYsalty ! ”
In this example, there are two sentiment state-ments, one positive and one negative. The posi-tive one is indicated by the word “ excellent ” andis expressed towards the opinion target “
Moules ”.The second, negative sentiment, is indicated by theword “ salty ” and is expressed towards the “ lobsterravioli ”.A key task within this fine-grained sentimentanalysis consists of identifying so called opiniontarget expressions (OTE). To automatically extractOTEs, supervised learning algorithms are usuallyemployed which are trained on manually anno-tated corpora. In this paper, we are concerned withhow to transfer classifiers trained on one domainto another domain. In particular, we focus on thetransfer of models across languages to alleviatethe need for multilingual training data. We pro-pose a model that is capable of accurate zero-shotcross-lingual OTE extraction, thus reducing the re-liance on annotated data for every language. Simi-lar to Upadhyay et al. (2018), our model leveragesmultilingual word embeddings (Smith et al., 2017;Lample et al., 2018) that share a common vectorspace across various languages. The shared spaceallows us to transfer a model trained on source lan-guage data to predict OTEs in a target languagefor which no (i.e. zero-shot setting) or only smallamounts of data are available, thus allowing to ap-ply our model to under-resourced languages.Our main contributions can be summarized asfollows: • We present the first approach for zero-shotcross-lingual opinion target extraction andachieve up to 87% of the performance of amonolingual baseline. • We investigate the benefit of using multi- a r X i v : . [ c s . C L ] A p r le source languages for cross-lingual learn-ing and show that we can improve by 6 to8 points in F -Score compared to a modeltrained on a single source language. • We investigate the benefit of augmentingthe zero-shot approach with additional datapoints from the target language. We ob-serve that we can save hundreds of annotateddata points by employing a cross-lingual ap-proach. • We compare two methods for obtainingcross-lingual word embeddings on the task.
A common approach for extracting opinion targetexpressions is to phrase the task as a sequence tag-ging problem using the well-known IOB scheme(Tjong Kim Sang and Veenstra, 1999) to repre-sent OTEs as a sequence of tags. According tothis scheme, each word in our text is marked withone of three tags, namely I , O or B that indicateif the word is at the B eginning , I nside or O utsideof a target expression. An example of such an en-coding can be seen below: The wine list is also really nice . O I I
O O O O O
By rephrasing the task in this way, we can ad-dress it using established sequence tagging mod-els. In this work, we use a multi-layer convolu-tional neural network (CNN) as our sequence tag-ging model. The model receives a sequence ofwords as input features and predicts an output se-quence of IOB tags. In order to keep our modelsimple and our results clear, we restrict our in-put representation to a sequence of word embed-dings. While additional features such as Part-of-Speech (POS) tags are known to perform well inthe domain of OTE extraction (Toh and Su, 2016;Kumar et al., 2016; Jebbara and Cimiano, 2016),they would require a separately trained model forPOS-tag prediction which can not be assumed tobe available for every language. We refrain fromusing more complex architectures such as memorynetworks as our goal is mainly to investigate thepossibility of performing zero-shot cross-lingualtransfer learning for OTE prediction. Being the Note that the B token is only used to indicate the bound-ary of two consecutive phrases. first approach proposing this, we leave the ques-tion of how to increase performance of the ap-proach by using more complex architectures to fu-ture work.In the following, we describe our monolingualCNN model for OTE extraction which we use asour baseline model. Afterwards, we show how weadapt this model for a cross-lingual and even zero-shot regime. Our monolingual baseline model consists of aword embedding layer, a stack of convolution lay-ers, a standard feed-forward layer followed by afinal output layer. Formally, the word sequence w = ( w , . . . , w n ) is passed to the word em-bedding layer that maps each word w i to its em-bedding vector x i using an embedding matrix W .The sequence of word embedding vectors x =( x , . . . , x n ) is processed by a stack of L convo-lutional layers , each with a kernel width of l conv , d conv filter maps and RELU activation function f (Nair and Hinton, 2010). The final output ofthese convolution layers is a sequence of abstractrepresentations h L = ( h L , . . . , h Ln ) that incorpo-rate the immediate context of each word by meansof the learned convolution operations. The hid-den states h Li of the last convolution layer are pro-cessed by a regular feed-forward layer to furtherincrease the model’s capacity and the resulting se-quence is passed to the output layer.In a last step, each hidden state is projectedto a probability distribution over all possible out-put tags q i = ( q Bi , q Ii , q Oi ) using a standard feed-forward layer with weights W tag , bias b tag and asoftmax activation function.Since the prediction of each tag can be inter-preted as a classification, the network is trainedto minimize the categorical cross-entropy betweenexpected tag distribution p i and predicted tag dis-tribution q i of each word i : H ( p i , q i ) = − (cid:88) t ∈T p ti log( q ti ) , where T = { I, O, B } is the set of IOB tags, p ti ∈{ , } is the expected probability of tag t and q ti ∈ [0 , the predicted probability. Figure 1 depictsthe sequence labeling architecture. The input sequences are padded with zeros to allow theapplication of the convolution operations to the edge words. igure 1: Model for sequence tagging using convolu-tion operations. For simplicity, we only show a singleconvolution operation. The gray boxes depict paddingvectors. The layers inside the dashed box are sharedacross multiple languages.
Our cross-lingual model works purely with cross-lingual embeddings that have been trained onmonolingual datasets and in a second step havebeen aligned across languages. In fact, the embed-dings are pre-computed in an offline fashion andare not adapted while training the convolutionalnetwork on data from a specific language. As theinputs to the convolutional network are only thecross-lingual embeddings, the network can be ap-plied to any language for which the embeddingshave been aligned.Since the word embeddings for source and tar-get language share a common vector space, theshared parts of the target language model are ableto process data samples from the completely un-seen target language and perform accurate predic-tion i.e. enabling zero-shot cross-lingual extrac-tion of opinion target expressions.We rely on two approaches to compute em-beddings that are aligned across languages. Bothmethods rely on fastText (Bojanowski et al., 2017)to compute monolingual embeddings trained onWikipedia articles. The first method is the oneproposed by Smith et al. (2017), which com-putes a singular value decomposition (SVD) ona dictionary of translated word pairs to obtain anoptimal, orthogonal projection matrix from onespace into the other. We refer to this method as
SVD-aligned . We use these embeddings inour experiments in Sections 3.3, 3.4 and 3.6.The second method proposed by Lample et al.(2018) performs the alignment of embeddings Obtained from: https://github.com/Babylonpartners/fastText_multilingual across languages in an unsupervised fashion, with-out requiring translation pairs.The approach uses adversarial training to ini-tialize the cross-lingual mapping and a synthet-ically generated bilingual dictionary to fine-tuneit with the Procrustes algorithm (Sch¨onemann,1966). We refer to the multilingual embeddings from Lample et al. (2018) as ADV-aligned .These are used in Section 3.5.
In this section, we investigate the proposed zero-shot cross-lingual approach and evaluate it on thewidely used dataset of Task 5 of the SemEval 2016workshop. With our evaluation, we answer the fol-lowing research questions:RQ1: To what degree is the model capable ofperforming OTE extraction for unseen lan-guages?RQ2: Is there a benefit in training on more thanone source language?RQ3: What improvements can be expected whena small amount of samples for the targetlanguage are available?RQ4: How big is the impact of the used align-ment method on the OTE extraction perfor-mance?Before we answer these questions, we give a briefoverview over the used datasets and resources.
As part of Task 5 of the SemEval 2016 workshop(Pontiki et al., 2016), a collection of datasets foraspect-based sentiment analysis on various lan-guages and domains was published. Due to itsrelatively large number of samples and high cov-erage of languages and domains, the datasets arecommonly used to evaluate ABSA approaches. Toanswer our research questions, we make use ofa selection of the available datasets. We eval-uate our cross-lingual approach on the availabledatasets for the restaurant domain for the 5 lan-guages Dutch ( nl ), English ( en ), Russian ( ru ),Spanish ( es ) and Turkish ( tr ) . Table 1 gives abrief overview of the used datasets. Obtained from: https://github.com/facebookresearch/MUSE We tried to include the dataset of French reviews in ourevaluation but the provided download script no longer works. ataset en (train) 2000 29278 1880 en (test) 676 10080 650 es (train) 2070 36164 1937 es (test) 881 13290 731 nl (train) 1722 24981 1283 nl (test) 575 7690 394 ru (train) 3655 53734 3159 ru (test) 1209 17856 972 tr (train) 1232 12702 1385 tr (test) 144 1360 159 Table 1: Statistics of the SemEval 2016 ABSA datasetfor the restaurant domain.
In all our experiments, we report F -scores for theextracted opinion target expressions computed onexact matches of the character spans as in the orig-inal SemEval task (Pontiki et al., 2016).As described in Section 2.2, our model relieson pretrained multilingual embeddings. For both SVD-aligned and
ADV-aligned , we use theembeddings as provided by the original authors.However, we restrict our vocabulary to the mostfrequent 50,000 words per language to reducememory consumption.For all experiments, we fix our model architec-ture to 5 convolution layers with each having a ker-nel size of 3, a dimensionality of 300 units and aReLU activation function (Nair and Hinton, 2010).The penultimate feed-forward layer has 300 di-mensions and a ReLU activation, as well. We ap-ply dropout (Srivastava et al., 2014) on the wordembedding layer with a rate of 0.3 and betweenall other layers with 0.5. The word embeddingsand the penultimate layer are L1-regularized (Ng,2004).The network’s parameters are optimized us-ing the stochastic optimization technique Adam (Kingma and Ba, 2015). We optimize the numberof training epochs for each model using early stop-ping (Caruana et al., 2000) but do not tune otherhyperparameters of our models. We always pick20% of our available training data for the valida-tion process. For the zero-shot scenario, this en-tails that we optimize the number of epochs on thesource language and not on the target language tosimulate true zero-shot learning. As appearing in the respective embedding files.
In this section, we present our evaluation for zero-shot learning. We first examine a setting with asingle source language. Then, we evaluate the ef-fect of cross-lingual learning from multiple sourcelanguages.
Single Source Language
This part of our eval-uation addresses our first research question:RQ1: To what degree is the model capable ofperforming OTE extraction for unseen lan-guages?To answer this question, we perform a set of ex-periments in the zero-shot setting. We train amodel on the training portion of a source languageand evaluate the model performance on all possi-ble target languages. Figure 2 shows the obtainedscores. The reported results are averaged over 10runs with different random seeds. The main di-agonal represents results of models both trainedand tested on target language data. We consideredthese our monolingual baselines.In general, the proposed approach achieves rel-atively high scores for some language pairs, al-though with large performance differences de-pending on the exact source and target languagepairs. Looking at the absolute scores, the bestperforming cross-lingual language pair is en → es with an F -score of 0.5. This is followed by en → nl at 0.46. The lowest is es → tr with anF -score of 0.14. When considering the resultsrelative to their respective monolingual baselines,the highest relative performance is achieved by en → nl at 77% of a nl → nl model, followedby en → es and ru → nl , which both reach an F-Measure of about 74%. The weakest performinglanguage pair is still es → tr at 29% relative per-formance. In general, the Turkish language seemsto benefit the least from the cross-lingual trans-fer learning, while Russian is on average the bestsource language in terms of relative performanceachievement for the target languages.Overall, the presented results show that it isin fact possible for most considered languages totrain a model for OTE extraction without ever us-ing any annotated data in that target language. Multiple Source Languages
In the next exper-iment, we want to address our second researchquestion: n es nl ru trtarget languageenesnlrutr s o u r c e l a n g u a g e Figure 2: Zero-shot F -scores for cross-lingual learn-ing from a single source to a target language. target en es nl ru trbest → target 0.45 0.50 0.46 0.37 0.30all others → target 0.52 0.58 0.53 0.43 0.27target → target 0.66 0.68 0.60 0.56 0.48 Table 2: Zero-shot results for cross-lingual learningfrom multiple source languages to a target language.The row best → target represents the best perform-ing cross-lingual model from Figure 2 for each targetlanguage. all others → target are the results fortraining on all languages except for the target language. target → target shows the monolingual scores thatact as a baseline. RQ2: Is there a benefit in training on more thanone source language?As we explained in Section 2.2, our approachallows us to train and test on any number ofsource and target languages, provided that we havealigned word embeddings for each considered lan-guage.In order to answer our second research question,we train a model on the available training data forall but one language and perform prediction on thetest data for the left-out language. The results forthese experiments are summarized in Table 2. Wecan see that all languages with the exception ofTurkish seem to profit from a cross-lingual trans-fer setting with multiple source languages. Theabsolute improvements are in the range of 6 to 8points in F -Score while the performance on Turk-ish samples drops by 3 points.We can summarize that we can obtain sub-stantial improvements for most languages whentraining on a combination of multiple source lan-guages. In fact, for en , es , nl and ru , the results of our cross-lingual models trained on all otherlanguages reach between 78% to 87% relative per-formance of a model trained with target languagedata. While our goal is to reduce the effort of annotat-ing huge amounts of data in a target language towhich the model is to be transferred, it might stillbe reasonable to provide a few annotated samplesfor a target language. Our next research questionaddresses this issue:RQ3: What improvements can be expected whena small amount of samples for the targetlanguage are available?We answer this question by training our modelsjointly on a source language dataset as well as asmall amount of target language samples and com-pare this to a baseline model that only uses tar-get language samples. By gradually increasing theavailable target samples, we can directly observetheir benefit on the test performance. Figure 3shows a visualization for the source language en and the target languages es , nl , ru , and tr .We can immediately see that a monolingualmodel requires at least 100 target samples to pro-duce meaningful results as opposed to a cross-lingual model that performs well with source lan-guage samples alone. Training on increasingamounts of target samples improves the modelperformances monotonically for each target lan-guage and the model leveraging the bilingual dataconsistently outperforms the monolingual baselinemodel. The benefits of the source language dataare especially pronounced when very few targetsamples are available, i.e. less than 200. As anexample, a model trained on bilingual data usingall available English samples and 200 Dutch sam-ples is competitive to a monolingual model trainedon 1000 Dutch samples (0.55 vs. 0.56).As one would expect, the results in Table 2 andFigure 3 suggest that training the model on moredata samples leads to a better performance. Sinceour model can leverage the data from all languagessimultaneously, we can exhaust our resources andtrain an instance of our model that has access to alltraining data samples from all languages, includ-ing the target training data. This is reflected by thedashed line in Figure 3. We see, however, that themodel cannot leverage the other source languages
20 50 1002005001000 a ll F - S c o r e (en, es) eses esall es a ll a ll Number of target samples0.00.10.20.30.40.50.60.70.80.91.0 F - S c o r e (en, ru) ruru ruall ru a ll Number of target samples0.00.10.20.30.40.50.60.70.80.91.0 (en, tr) trtr trall tr
Figure 3: Cross-lingual results for increasing numbersof training samples from the target language. beyond what it achieves with the combination ofthe full target and English language data alone.
The previous experiments show that we canachieve good performance in a cross-lingual set-ting for OTE extraction using the multilingualword embeddings proposed by Smith et al. (2017).Now we address our final research question:RQ4: How big is the impact of the used align-ment method on the OTE extraction perfor-mance?With our final research question, we compare ourprevious results to an alternative method of align-ing word embeddings in multiple languages. Werepeat our experiments in Section 3.3 using theembeddings of Lample et al. (2018) which we re-fer to as
ADV-aligned .To enable a direct comparison to the zero-shotresults in Section 3.3, we report absolute differ-ences in F -Score to the scores obtained with SVD-aligned for all source and target languagecombinations.As can be seen in Figure 4, the two meth-ods do perform well overall, albeit different forspecific language pairs. In a monolingual set-ting (i.e. main diagonal),
ADV-aligned per-forms slightly worse than
SVD-aligned withthe exception of en → en . Using ADV-aligned ,Spanish appears to be a more effective source en es nl ru trtarget languageenesnlrutr s o u r c e l a n g u a g e Figure 4: Zero-shot results comparing the multilin-gual embeddings
ADV-aligned to SVD-aligned .A positive value means higher absolute F score for ADV-aligned and vice versa. For readability, scoredifferences are scaled by a factor of 100. language than using
SVD-aligned as the av-erage performance is about 2.9 points higher. Itcan also be observed that the cross-lingual trans-fer learning works better for English as a targetlanguage using
ADV-aligned since the averageperformance is about 2.2 points higher than for
SVD-aligned . The opposite is true for Dutch asa target language, which shows a reduction in per-formance by 2.1 points on average. Overall, for 13of the 25 language pairs, the embeddings based on
SVD-aligned perform better than embeddingsaligned with
ADV-aligned . In this last part of our evaluation, we want toput our work into perspective of prior systemsfor opinion target extraction on the SemEval 2016restaurant datasets. We report results for our mul-tilingual model that is trained on the combinedtraining data of all languages and evaluated onthe corresponding test datasets. We compare ourmodel to the respective state-of-the-art for eachlanguage in Table 3.We can see that the competition is strongestfor English where we fall behind recent monolin-gual systems. This corresponds to rank 7 of 19 ofthe original SemEval competition. Regarding theother languages, we see that we are close to thebest Spanish and Dutch systems and even clearlyoutperform systems for Russian and Turkish by atleast 7 points in F -score. With that, we presentthe first approach on this task to achieve such com-petitive performances for a variety of languages ystem en es nl ru tr Toh and Su (2016) 0.723 – – – – `Alvarez-L´opez et al. (2016) – –Pontiki et al. (2016)* 0.441 0.520 0.506 0.493 0.419Li and Lam (2017) – – – – all → target (Ours) 0.660 0.687 0.624 Table 3: Overview of the current state-of-the-art foropinion target extraction for 5 languages. Our model istrained on the combined training data of all languagesand evaluated on the respective test datasets. The rowmarked with * is the baseline provided by the work-shop organizers. To our knowledge, no better model ispublished for Russian and Turkish. with a single, multilingual model.
The presented experiments shed light on the per-formance of our proposed approach under vari-ous circumstances. In the following, we want todiscuss its limitations and consider explanationsfor performance differences of different languagepairs.
Model Limitations
The core of our proposedsequence labeling approach consists of alignedword embeddings and shared CNN layers. Due tothe limited context of a CNN layer, the model canonly base its decisions for each word on the localinformation around that word. In many cases, thisinformation is sufficient since most opinion tar-get expressions are adjective-noun phrases whichare well enough identified by the local context formost considered languages.As future work, it is worth to investigate in howfar our findings translate to more complex modelarchitectures that have been proposed for OTE ex-traction, such as memory networks or attention-based models. Language Characteristics
Due to the inherentvariability of natural languages and of the useddatasets, it is difficult to identify the exact rea-sons for the observed performance differences be-tween language pairs. However, we suspect thatlanguage features such as word order, inflection,or agglutination affect the compatibility of lan-guages. As an example, Turkish is considereda highly agglutinative language, that is, complexwords are composed by attaching several suffixes
90% of OTEs in the English dataset consist of zero ormore adjectives followed by at least one noun. to a word stem. This sets it apart from the other4 languages. This language feature might presenta difficulty in our approach since the appendingof suffixes is not optimally reflected in the tok-enization process and the used word embeddings.An approach that performs alignment of languageson subword units might alleviate this problem andlead to performance gains for language pairs withsimilar inflection rules.Syntactic regularities such as word order mightalso play a role in our transfer learning approach.It is reasonable to assume that the CNN layers ofour approach pick up patterns in the word order ofa source language that are indicative of an opin-ion target expression, e.g. ”the [NOUN] is good” .When applying such a model to a target languagewith drastically different word order regularities,these patterns might not appear as such in the tar-get language.For the considered languages, we see follow-ing characteristics: Where English and Spanishare generally considered to follow a Subject-Verb-Object (SVO) order, Dutch largely exhibits a com-bination of SOV and SVO cases. Turkish and Rus-sian are overall flexible in their word order and al-low a variety of syntactic structures. In the case ofTurkish, its morphological and syntactic featuresseem to explain some of the relatively low results.However, with the small sample of languages andthe many potential influencing factors at play, weare aware that it is not possible to draw any strongconclusions. Further research has to be conductedin this direction to answer open questions.
Our work brings together the domains of opiniontarget extraction on the one side and cross linguallearning on the other side. In this section, we givea brief overview of both domains and point outparallels to previous work.
Opinion Target Extraction
San Vicente et al.(2015) present a system that addresses opiniontarget extraction as a sequence labeling problembased on a perceptron algorithm with token, wordshape and clustering-based features.Toh and Wang (2014) propose a ConditionalRandom Field (CRF) as a sequence labeling modelthat includes a variety of features such as Part-of-Speech (POS) tags and dependency tree features,word clusters and features derived from the Word-Net taxonomy. The model is later improved us-ng neural network output probabilities (Toh andSu, 2016) and achieved the best results on the Se-mEval 2016 dataset for English restaurant reviews.Jakob and Gurevych (2010) follow a very sim-ilar approach that addresses opinion target extrac-tion as a sequence labeling problem using CRFs.Their approach includes features derived fromwords, Part-of-Speech tags and dependency paths,and performs well in a single and cross-domainsetting.Kumar et al. (2016) present a CRF-based modelthat makes use of a variety of morphological andlinguistic features and is one of the few systemsthat submitted results for more than one languagefor the SemEval 2016 ABSA challenge. Thestrong reliance on high-level NLP features, such asdependency trees, named-entity information andWordNet features restricts its wide applicability toresource-poor languages.Among neural network models Poria et al.(2016) and Jebbara and Cimiano (2016) use deepconvolutional neural network (CNN) with Part-of-Speech (POS) tag features. Poria et al. (2016) alsoextend their base model using linguistic rules.Wang et al. (2017) use coupled multi-layer at-tentions to extract opinion expressions and opin-ion targets jointly. This approach, however, relieson additional annotations for opinion expressionsalongside annotations for the opinion targets.Li and Lam (2017) propose two LSTMs withmemory interaction to detect aspect and opinionterms. In order to generate opinion expression an-notations for the SemEval dataset, a sentiment lex-icon is used in combination with high precisiondependency rules.For a more comprehensive overview of ABSAand OTE extraction approaches we refer to Pontikiet al. (2016).
Cross-Lingual and Zero-Shot Learning for Se-quence Labelling
With the CLOpinionMiner,Zhou et al. (2015) present a method for cross-lingual opinion target extraction that relies on ma-chine translation. The approach derives an an-notated dataset for a target language by translat-ing the annotated source language data. Part-of-Speech tags and dependency path-features are pro-jected into the translated data using the word align-ment information of the translation algorithm. Theapproach is evaluated for English to Chinese re-views. A drawback of the presented method isthat it requires access to a strong machine trans- lation algorithm for source to target language thatalso provides word alignment information. Addi-tionally, it builds upon NLP resources that are notavailable for many potential target languages.Addressing the task of zero-shot spoken lan-guage understanding (SLU), Upadhyay et al.(2018) follow a similar approach as our work.They use the aligned embeddings from Smithet al. (2017) in combination with a bidirectionalRNN and target zero-shot SLU for Hindi andTurkish.Overall, our work differs from the related workby presenting a simple model for the zero-shotextraction of opinion target expressions. By us-ing no annotated target data or elaborate NLP re-sources, such as Part-of-Speech taggers or depen-dency parsers, our approach is easily applicable tomany resource-poor languages.
In this work, we presented a method for cross-lingual and zero-shot extraction of opinion targetexpressions which we evaluated on 5 languages.Our approach uses multilingual word embeddingsthat are aligned into a single vector space to allowfor cross-lingual transfer of models.Using English as a source language in a zero-shot setting, our approach was able to reach an F -score of 0.50 for Spanish and 0.46 for Dutch. Thiscorresponds to relative performances of 74% and77% compared to a baseline system trained on tar-get language data. By using multiple source lan-guages, we increased the zero-shot performanceto F -scores of 0.58 and 0.53, respectively, whichcorrespond to 85% and 87% in relative terms. Weinvestigated the benefit of augmenting the zero-shot approach with additional data points from thetarget language. Here, we observed that we cansave several hundreds of annotated data points byemploying a cross-lingual approach. Among the5 considered languages, Turkish seemed to benefitthe least from cross-lingual learning in all experi-ments. The reason for this might be that Turkishis the only agglutinative language in the dataset.Further, we compared two approaches for aligningmultilingual word embeddings in a single vectorspace and found their results to vary for individ-ual language pairs but to be comparable overall.Lastly, we compared our multilingual model withthe state-of-the-art for all languages and saw thate achieve competitive performances for somelanguages and even present the best system forRussian and Turkish. Acknowledgement
This work was supported in part by the H2020project Prˆet-`a-LLOD under Grant Agreementnumber 825182.
References
Tamara `Alvarez-L´opez, Jonathan Juncal-Mart´ınez,Milagros Fern´andez Gavilanes, Enrique Costa-Montenegro, and Francisco Javier Gonz´alez-Casta˜no. 2016. GTI at SemEval-2016 Task 5:SVM and CRF for Aspect Detection and Unsu-pervised Aspect-Based Sentiment Analysis. In
SemEval@NAACL-HLT , pages 306–311. TheAssociation for Computer Linguistics.Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching Word Vectors withSubword Information.
Transactions of the Associa-tion for Computational Linguistics , 5:135–146.Rich Caruana, Steve Lawrence, and C. Lee Giles. 2000.Overfitting in Neural Nets: Backpropagation, Con-jugate Gradient, and Early Stopping. In
NIPS , pages402–408. MIT Press.Niklas Jakob and Iryna Gurevych. 2010. Extractingopinion targets in a single-and cross-domain settingwith conditional random fields. In
Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing , pages 1035–1045.Soufian Jebbara and Philipp Cimiano. 2016. Aspect-Based Relational Sentiment Analysis Using aStacked Neural Network Architecture. In
ECAI2016 - 22nd European Conference on Artificial In-telligence, 29 August-2 September 2016, The Hague,The Netherlands - Including Prestigious Applica-tions of Artificial Intelligence (PAIS 2016) , pages1123—-1131.Diederik Kingma and Jimmy Ba. 2015. Adam: AMethod for Stochastic Optimization. In
Proceed-ings of the International Conference on LearningRepresentations .Ayush Kumar, Sarah Kohail, Amit Kumar, Asif Ekbal,and Chris Biemann. 2016. IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combin-ing Domain Dependency and Distributional Seman-tics Features for Aspect Based Sentiment Analysis.In
SemEval@NAACL-HLT , pages 1129–1135. TheAssociation for Computer Linguistics.Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Herv Jgou. 2018.Word translation without parallel data. In
Interna-tional Conference on Learning Representations . Xin Li and Wai Lam. 2017. Deep Multi-Task Learn-ing for Aspect Term Extraction with Memory In-teraction. In
Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing , pages 2886–2892. Association for Compu-tational Linguistics.Vinod Nair and Geoffrey E. Hinton. 2010. RectifiedLinear Units Improve Restricted Boltzmann Ma-chines. In
Proceedings of the 27th InternationalConference on International Conference on Ma-chine Learning , ICML’10, pages 807–814, USA.Omnipress.Andrew Y. Ng. 2004. Feature Selection, L1 vs. L2Regularization, and Rotational Invariance. In
Pro-ceedings of the Twenty-first International Confer-ence on Machine Learning , ICML ’04, pages 78–,New York, NY, USA. ACM.Maria Pontiki, Dimitris Galanis, Haris Papageor-giou, Ion Androutsopoulos, Suresh Manandhar, Mo-hammad Al-Smadi, Mahmoud Al-Ayyoub, YanyanZhao, Bing Qin, Orph´ee De Clercq, V´eroniqueHoste, Marianna Apidianaki, Xavier Tannier, Na-talia V. Loukachevitch, Evgeniy Kotelnikov, N´uriaBel, Salud Mar´ıa Jim´enez Zafra, and G¨ulsen Eryigit.2016. SemEval-2016 Task 5: Aspect Based Sen-timent Analysis. In
Proceedings of the 10thInternational Workshop on Semantic Evaluation,SemEval@NAACL-HLT 2016, San Diego, CA, USA,June 16-17, 2016 , pages 19–30.Soujanya Poria, Erik Cambria, and Alexander Gel-bukh. 2016. Aspect Extraction for Opinion Min-ingwith a Deep Convolutional Neural Network.
Knowledge-Based Systems , 108:42–49.I˜naki San Vicente, Xabier Saralegi, and RodrigoAgerri. 2015. Elixa: A modular and flexible ABSAplatform. In
Proceedings of the 9th InternationalWorkshop on Semantic Evaluation , pages 748–752,Denver, Colorado. Association for ComputationalLinguistics.Peter H. Sch¨onemann. 1966. A generalized solution ofthe orthogonal procrustes problem.
Psychometrika ,31(1):1–10.Samuel L. Smith, David H. P. Turban, Steven Hamblin,and Nils Y. Hammerla. 2017. Offline bilingual wordvectors, orthogonal transformations and the invertedsoftmax. In
International Conference on LearningRepresentations .Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A Simple Way to Prevent Neural Networksfrom Overfitting.
Journal of Machine Learning Re-search , 15:1929–1958.Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Rep-resenting text chunks. In
Proceedings of Euro-pean Chapter of the ACL (EACL) , pages 173–179.Bergen, Norway.hiqiang Toh and Jian Su. 2016. NLANGP atSemEval-2016 Task 5: Improving Aspect BasedSentiment Analysis using Neural Network Features.In
Proceedings of the 10th International Work-shop on Semantic Evaluation, SemEval@NAACL-HLT 2016 , volume 2015, pages 282–288.Zhiqiang Toh and Wenting Wang. 2014. DLIREC: As-pect Term Extraction and Term Polarity Classifica-tion System. In
Proceedings of the 8th InternationalWorkshop on Semantic Evaluation , pages 235–240.Shyam Upadhyay, Manaal Faruqui, Gokhan Tur, DilekHakkani-Tur, and Larry Heck. 2018. (Almost) Zero-Shot Cross-Lingual Spoken Language Understand-ing. In
Proceedings of the 2018 IEEE InternationalConference on Acoustics, Speech and Signal Pro-cessing (ICASSP) .Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier,and Xiaokui Xiao. 2017. Coupled Multi-Layer At-tentions for Co-Extraction of Aspect and OpinionTerms. In
Proceedings of the Thirty-First AAAIConference on Artificial Intelligence, February 4-9,2017, San Francisco, California, USA. , pages 3316–3322.Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2015.CLOpinionMiner: Opinion Target Extraction in aCross-Language Scenario.