Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization
Wataru Hirota, Yoshihiko Suhara, Behzad Golshan, Wang-Chiew Tan
EEmu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization
Wataru Hirota ∗ , Yoshihiko Suhara , Behzad Golshan , Wang-Chiew Tan Osaka University, Megagon [email protected], { yoshi, behzad, wangchiew } @megagon.ai Abstract
We present E MU , a system that semantically enhances multi-lingual sentence embeddings. Our framework fine-tunes pre-trained multilingual sentence embeddings using two maincomponents: a semantic classifier and a language discrimi-nator. The semantic classifier improves the semantic similar-ity of related sentences, whereas the language discriminatorenhances the multilinguality of the embeddings via multilin-gual adversarial training. Our experimental results based onseveral language pairs show that our specialized embeddingsoutperform the state-of-the-art multilingual sentence embed-ding model on the task of cross-lingual intent classificationusing only monolingual labeled data. Introduction
Learning multilingual sentence representations (Ruder etal. 2019) is a key technique for building NLP applicationswith multilingual support. A primary advantage of multilin-gual sentence embeddings is that they enable us to train asingle classifier based on a single language (e.g., English)and then apply it to other languages without using train-ing models for those languages (e.g., German.) Further-more, recent advances in multilingual sentence embeddingtechniques (Artetxe and Schwenk 2019b; Chidambaram etal. 2019) have shown to exhibit competitive performanceon several downstream NLP tasks, compared to the two-stage approach that relies on machine translation followedby monolingual sentence embedding techniques.The main challenge of multilingual sentence embeddingsis that they are sensitive to textual similarity ( textual similar-ity bias ) which negatively affects the the semantic similarityof sentence embeddings (Zhu, Li, and de Melo 2018). Thefollowing example illustrates this point:S1:
What time is the pool open tonight?
S2:
What time are the stores on 5th open tonight?
S3:
When does the pool open this evening?
S1 and S3 have similar intents. They ask for the openinghours of the pool in the evening. S2 has a different intent: ∗ This work was done during an internship at Megagon Labs.Copyright c (cid:13) it asks about the opening hour of stores. We expect embed-dings of sentences of the same intent to be closer (e.g., tohave higher cosine similarity) to one another than embed-dings of sentences with different intents.We tested several pre-trained (multilingual) sentence em-bedding models (Pagliardini, Gupta, and Jaggi 2018; Con-neau et al. 2017; Artetxe and Schwenk 2019b; Chidambaramet al. 2019) in both monolingual and cross-lingual settings.Somewhat surprisingly, every model provided lower similar-ity scores between S1 and S3 (compared to S1 and S2, or S2and S3). This is mainly because S1 and S2 are more textu-ally similar (because both sentences contain “what time” and“tonight”) compared to S1 and S3. This example highlightsthat general-purpose multilingual sentence embeddings ex-hibit textual similarity bias, which is a fundamental limita-tion as they may not correctly capture the semantic similarityof sentences.Motivated by the need for sentence embeddings that bet-ter reflect the semantics of sentence, we examine multilin-gual semantic specialization , which tailors pre-trained mul-tilingual sentence embeddings to handle semantic similar-ity. Although prior work has developed semantic special-ization methods for word embeddings (Mrkˇsi´c et al. 2017)and semantic and linguistic properties of sentence embed-dings (Zhu, Li, and de Melo 2018; Conneau et al. 2018a), noprior work has considered semantic specialization of multi-lingual sentence embeddings.In this paper, we develop a “lightweight” approach for se-mantic specialization of multilingual embeddings that canbe applied to any base model. Our approach fine-tunes apre-trained multilingual sentence embedding model basedon a classification task that considers semantic similarity.This aligns with common techniques of pre-training meth-ods for NLP (Howard and Ruder 2018; Peters et al. 2018;Devlin et al. 2019). We explore several loss functions to de-termine which is appropriate for the semantic specializationof cross-lingual sentence embeddings.We found that naive choices of loss functions such as thesoftmax loss, which is a common choice for classification,may suffer from significant degradation of the original mul-tilingual sentence embedding model.We also design E MU to specialize multilingual sentence a r X i v : . [ c s . C L ] N ov mbeddings using only monolingual training data as itis expensive to collect parallel training data in multiplelanguages. Our solution incorporates language adversarialtraining to enhance the multilinguality of sentence embed-dings. Specifically, we implemented a language discrimina-tor that tries to identify the language of an input sentencegiven its embedding and optimizes multilingual sentenceembeddings to confuse the language discriminator.We conducted experiments on three cross-lingual intentclassification tasks that involves 6 languages. The resultsshow that E MU successfully specializes the state-of-the-art multilingual sentence embedding techniques, namelyLASER, using only monolingual training data with unla-beled data in other languages. It outperforms the originalLASER model and monolingual sentence embeddings withmachine translation by up to 47.7% and 86.2% respectively.The contributions of the paper are as follows: • We developed E MU , a system that semantically enhancespre-trained multilingual sentence embeddings . E MU in-corporates multilingual adversarial training on top of fine-tuning to enhance multilinguality without using parallelsentences. • We experimented with several loss functions and showthat the two loss functions, namely L constrained soft-max and center loss, outperform common loss functionsused for fine-tuning. • We show that E MU successfully specializes multilingualsentence embedding using only monolingual labeled data. Multilingual Semantic Specialization
The architecture of E MU is depicted in Figure 1. There arethree main components, which we detail next: multilingualencoder E , semantic classifier C , and language discrimina-tor D . The solid lines show the flow of the forward propa-gation for fine-tuning C and E , and the dotted lines are thatfor D . These arrows become reversed during the backprop-agation. The semantic classifier and language discriminatorare only used for fine-tuning.After fine-tuning, E MU uses the fine-tuned multilingualencoder to obtain sentence embeddings for input sentences.More specifically, we expect the similarity (e.g., cosine sim-ilarity) between two related sentences in any languages tobe closer to each other. We consider cosine similarity asit is the most common choice and can be calculated effi-ciently (Wang et al. 2017). Multilingual Encoder
A multilingual encoder is a language-agnostic sentence en-coder that converts sentences in any language into embed-ding vectors in a common space. E MU is flexible with thechoice of multilingual encoders and their architectures. Theonly requirement of this component is that it encodes a sen-tence in any language into a sentence embedding.In this paper, we use LASER (Artetxe and Schwenk2019b) as a base multilingual sentence embedding model.LASER is a multilingual sentence embedding model that Our code is available at https://github.com/megagonlabs/emu.
Multilingualencoder E Semanticclassifier C Languagediscriminator D Sentence in lang
Sentence in lang t Language scoreSemantic label
Figure 1: Architecture of E MU .covers more than 93 languages with more than 23 differ-ent alphabets. It is an encoder-decoder model that sharesthe same BiLSTM encoder with max-pooling and uses bytepair encoding (BPE) (Sennrich, Haddow, and Birch 2016)to accept sentences in any languages as input. The model istrained on a set of bilingual translation tasks and is shown tohave the state-of-the-art performance on cross-lingual NLPtasks including bitext mining. We use LASER instead ofmultilingual models for BERT (Devlin et al. 2019) because(1) LASER outperformed the BERT model on the XNLItask (Artetxe and Schwenk 2019b) and (2) a LASER modelcan be used as a sentence encoder without any changes . Semantic Classifier
The semantic classifier categorizes input sentences intogroups that share the same intent, such as “seeking pool in-formation” or “seeking restaurant information”. We expectthe semantic classifier to enhance multilingual sentence em-beddings to better reflect the semantic similarity of relatedsentences, where the semantic similarity is calculated as thecosine similarity between the embeddings of the two sen-tences.Additionally, we expect that learned embeddings retainsemantic similarity with respect to cosine similarity. Thus,we propose the use of L -constrained softmax loss (Ranjan,Castillo, and Chellappa 2017) and center loss (Wen et al.2016), which are known to be effective for image recogni-tion tasks. To the best of our knowledge, we are the first toapply these loss functions for fine-tuning embedding mod-els. We describe these loss functions next. L -constrained softmax loss L -constrained softmaxloss (Ranjan, Castillo, and Chellappa 2017) considers hardconstraints on the norm of embedding vectors on top of the A BERT model needs to be fine-tuned to use the first vectorcorresponding to the class symbol [CLS] as a sentence embedding.A BERT variant for sentence embeddings also needs supervisionto train a model (Reimers and Gurevych 2019). oftmax loss:minimize − M M (cid:88) i =1 log e W Tyi u i + b yi (cid:80) Cj =1 e W Tj u i + b j subject to (cid:107) u i (cid:107) = α, ∀ i = 1 , . . . , M, where M denotes the number of classes, and u i and y i are i -th sentence embedding vector and its true label respectively.The L constraint ensures that embedding vectors are dis-tributed on the hypersphere with the size of α. Therefore,the Euclidean distance between two vectors on the hyper-sphere is approximately close to its cosine distance. Thisproperty is helpful for specializing sentence embeddings tolearn semantic similarity in the form of cosine similarity.Note that this L -constraint is different from the L regular-ization term applied to the weight parameters of the outputlayer. In that case, the regularization term will be consideredin the loss function.To implement L -constrained softmax loss, the model ad-ditionally inserts an L -normalized layer that normalizes theencoder output u (i.e., u (cid:107) u (cid:107) ) followed by a layer that scaleswith a hyper-parameter α . The scaled vectors are then fedinto the output layer, where the model evaluates the softmaxloss. Center loss
The center loss (Wen et al. 2016) was origi-nally developed for face recognition tasks to stabilize deepfeatures learned from data. The center loss is described asfollows: L center = 12 m (cid:88) i =1 (cid:107) u i − c y i (cid:107) , (1)where c y i denotes the centroid of sentence embedding vec-tors of class y i . The loss function forces the embedding vec-tor of i -th sample toward the centroid of the true category.Our motivation to use this loss function is to enhance the intra-class compactness of sentence embeddings. That is,we want to ensure that the sentence embeddings that havethe same intent form compact clusters because other lossfunctions, such as the softmax loss, does not have this func-tionality. The center loss works as cross-lingual center loss;it enforces sentences, in any language, that belong to thesame intent as a same cluster if multilingual training dataare available.We consider combining the center loss with another func-tion with a hyper-parameter λ : L C = L L - sm + λL center , (2)where L L - sm denotes the L -constrained softmax lossfunction. Language Discriminator
The semantic classifier does not directly consider multilin-guality , so the model, which is fine-tuned on a single lan-guage, may now perform worse on other languages. To avoidthis problem, we incorporate multilingual adversarial learn-ing into the framework. Specifically, the language discrimi-nator D aims to identify the language of an input sentence Table 1: Statistics of the datasets. HotelQA ATIS Quora given its embedding, whereas the multilingual sentence en-coder E incorporates an additional loss function to “con-fuse” D . The idea was inspired by related work that used ad-versarial learning for multilingual NLP models (Chen et al.2018; Chen and Cardie 2018). We hypothesize and our ex-periments show that incorporating adversarial learning alsoenhances the multilinguality of sentence embeddings.The language discriminator is trained to determinewhether the languages of two input embeddings are differ-ent. Simultaneously, the other part of the model is trainedto confuse the discriminator. In our implementation, we useWasserstein GAN (Arjovsky, Chintala, and Bottou 2017)because it is known to be more robust than the originalGAN (Goodfellow et al. 2014).Algorithm 1 shows a single training step of E MU . Eachstep consists of two training routines for language discrimi-nator D t and the other components (multilingual sentenceencoder E and semantic classifier C ). Target language t denotes the language used for training (e.g., English). t israndomly chosen from a training language set if multiplelanguages are used for training. Adversarial languages L is a set of languages that are used to retrieve adversarialsentences. To train language discriminator D t , training sen-tences in language t and adversarial sentences from ran-domly chosen language (cid:96) ∈ L are used to evaluate L D t .Formally, the loss function for any training language t is de-scribed as L D t = L d (1 , D t ( u t )) + L d (0 , D t ( v (cid:96) )) , (3)where L d ( · , · ) is the cross entropy loss, u t and v (cid:96) are em-bedding vectors (encoded by E ) of sentences in language t and language (cid:96) ( t (cid:54) = (cid:96) ). Our design implements a languagediscriminator for each training language t . For instance, lan-guage discriminator D t = en aims to predict whether an inputmultilingual sentence embedding belongs to English.Next, labeled sentences in language t and adversarial sen-tences (cid:96) are sampled to update the parameters of E and C with the fixed parameters of D t . The overall loss function L C + D t now takes into account the loss value of D t so thatthe multilingual encoder E can generate multilingual sen-tences embeddings for sentences in languages t and (cid:96) , whichcannot be classified by the language discriminator D t . Weuse hyper-parameter γ to balance the loss functions: L C + D t = L C − γL D t . (4) Evaluation
We evaluated E MU based on the cross-lingual intent classi-fication task. The task is to detect the intent of an input sen-tence in a source language (e.g., German) based on labeled lgorithm 1 Single Training Step of E MU Require:
Training lang t , adversarial langs L , iteration number k , clipping interval c .1: for to k do
2: Sample training sentences as x t
3: Sample adversarial language (cid:96) from L
4: Sample adversarial sentences as x (cid:96) u t ← E ( x t ) ; v (cid:96) ← E ( x (cid:96) )
6: Evaluate loss L D t ( u t , v (cid:96) ) (cid:46) Eq. 37: Update D t parameters8: Clip D t parameters to [ − c, c ]
9: Sample training sentences and labels as x t and y t
10: Sample adversarial language (cid:96) from L
11: Sample adversarial sentences as x (cid:96) u i ← E ( x t ) ; v (cid:96) ← E ( x (cid:96) )
13: Evaluate loss L C + D t ( u t , v (cid:96) , y t ) (cid:46) Eq. 414: Update E and C parameters sentences associated with intent labels in a target language(e.g., English.) We consider similarity-based intent detec-tion, which categorizes an input sentence based on the labelof the nearest neighbor sentence that has the highest cosinesimilarity against the input sentence. We adopted this eval-uation method since it is widely used in search-based QAsystems (Pas¸ca 2003) and works robustly especially if train-ing data are sparse. An intuitive alternative for intent detec-tion is to directly use the trained semantic classifier (see Fig-ure 1). We evaluated the classification results using the se-mantic classifier but the performance was poor. Therefore,we excluded the results from the tables. Dataset
We used three datasets for evaluation. Some statistics ofthese datasets are shown in Table 1.
HotelQA is a real-world private corpus of 820 ques-tions collected via a multi-channel communication plat-form for hotel guests and hotel staff. Questions are alwaysmade by guests and have ground truth labels for 28 intentclasses (e.g., check-in, pool.) The utterances are profession-ally translated into 5 non-English languages (German (de),Spanish (es), French (fr), Japanese (ja), and Chinese (zh).)We split the dataset into training and test sets so that thesentences used for fine-tuning do not appear in the test set.
ATIS (Hemphill, Godfrey, and Doddington 1990) is apublicly available corpus for spoken dialog systems and iswidely used for intent classification research. The datasetconsists of more than 5k sentences and 22 intent labels areassigned to each sentence. We excluded the “flights” classfrom the dataset since the class accounts for about 75% ofthe dataset. We also ensured that each class has at least 5sentences in each of train and test datasets. As a result, 13classes remained in the dataset. Similar to previous stud-ies (Conneau et al. 2018b; Glavas et al. 2019), we usedGoogle Translate to generate corresponding translations inthe same 5 non-English languages as
HotelQA . Quora is a publicly available paraphrase detectiondataset that contains over 400k questions with duplicate la-bels. Each row is a pair of questions with a duplicate label.Duplicate questions can be considered sentences that be-long to the same intent. Therefore, we created a graph whereeach node is a question and an edge between two nodes de-notes that these questions are considered duplicate. By do-ing this, we can consider each disjoint clique in the graphas a single intent class. Specifically, we filtered only com-plete subgraphs whose size (i.e., ATIS . Baselines
MT + sent2vec
We consider the two-stage approach thatuses machine translation and monolingual sentence embed-dings in a pipeline We used Google Translate for transla-tion and sent2vec (Pagliardini, Gupta, and Jaggi 2018) as abaseline method . Softmax loss
Softmax loss is the most common loss func-tion for classification, and thus a natural choice for fine-tuning the embeddings. We used the softmax loss functionto train the semantic classifier and adjust the embeddings.
Contrastive loss
Contrastive loss (Chopra et al. 2005) is awidely used pairwise loss function for metric learning. Theloss function minimizes the squared distance between twoembeddings if the labels are the same, and it maximizes themargin (we used m = 2 . ) between two samples otherwise.For contrastive loss, we use the Siamese (i.e., dual-encoder)architecture (Chopra et al. 2005) that takes two input sen-tences that will be fed into a shared encoder (i.e., multilin-gual encoder E ) to obtain sentence embeddings. N-pair loss
As another metric learning method, we used theN-pair sampling cosine loss (Yang et al. 2019), which firstsamples one positive sample and N − negative samplesand then minimizes a cosine similarity-based loss function. Experimental Settings
For each dataset, we used only English training data tofine-tune the models with E MU and the baseline methods.To train E MU ’s language discriminator, we used unlabeledtraining data in other non-English languages (i.e., de, es, fr,ja, zh.) Emu variants
To verify the effect of the language discrim-inator and the center loss, we also evaluated E MU withoutthe language discriminator (E MU w/o LD) and E MU with-out the language discriminator or the center loss (E MU w/o https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs The non-English sentences obtained through MT from Englishhad to be translated back to English. We tested the official implementation of InferSent (Conneauet al. 2017), finding that performance was unstable and often sig-nificantly lower than that of sent2vec. Thus, we decided to usesent2vec in the experiments. able 2: Experimental results (Acc@1) on three dataset. The highest performance (excluding E MU -P ARALLEL ) is in bold andthe highest performance by E MU -P ARALLEL is underlined. * , ** , and *** denote p -value < . , . , and . respectively basedon the binomial proportion confidence intervals of Acc@1 values against the baseline methods. (a) HotelQA E N → * * → E N Method en-en de es fr ja zh de es fr ja zh B a s e li n e MT + sent2vec 48.6 41.0 35.4 34.7 47.2 43.1 46.5 47.2 44.4 48.6 41.7LASER (original) 55.6 45.1 48.6 48.6 47.9 45.1 43.8 45.8 50.7 44.4 49.3Contrastive loss 34.0 19.4 12.5 22.9 25.7 21.5 24.3 18.8 25.7 24.3 20.1N-pair loss 27.8 20.8 22.9 21.5 20.8 21.5 24.3 24.3 25.7 25.0 20.1Softmax loss 30.6 13.9 13.9 7.6 8.3 7.6 13.2 24.3 16.0 20.8 13.9 P r opo s e d E MU *** *** *** *** *** *** *** *** *** ** ** E MU w/o LD 76.4 *** *** *** *** *** *** *** *** ** ** E MU w/o LD+CL 77.1 *** *** *** *** *** *** *** *** *** *** ** E MU -P ARALLEL *** *** *** *** *** *** *** *** ** ** (b) ATIS E N → * * → E N Method en-en de es fr ja zh de es fr ja zh B a s e li n e MT + sent2vec 90.5 87.3 89.7 87.7 2.4 7.1 84.9 84.9 86.1 80.6 81.7LASER (original) 88.5 86.5 84.1 81.3 85.3 87.7 87.7 87.7 85.7 86.5
Contrastive loss 83.3 62.3 67.9 63.9 44.0 52.8 66.7 69.4 64.3 57.9 59.1N-pair loss 81.0 57.1 49.2 52.0 30.2 42.1 58.3 57.9 55.2 41.3 41.3Softmax loss 90.5 48.0 63.5 52.0 56.0 52.0 50.0 46.0 45.2 35.7 39.7 P r opo s e d E MU *** *** *** *** *** *** *** *** *** MU w/o LD 97.6 *** *** *** *** *** *** *** MU w/o LD+CL 98.4 *** *** *** *** *** *** ** MU -P ARALLEL *** *** *** *** *** *** *** *** ***
Quora E N → * * → E N Method en-en de es fr ja zh de es fr ja zh B a s e li n e MT + sent2vec 77.6 74.0 75.8 73.5 1.8 72.6 70.4 70.4 69.5 70.4 71.3LASER (original) 88.8 83.9
Contrastive loss 65.0 35.4 43.0 42.6 27.8 26.0 50.2 59.2 54.3 50.7 49.8N-pair loss 61.4 23.8 40.4 35.9 12.6 26.5 50.2 53.4 45.7 50.7 52.0Softmax loss 75.8 20.2 35.4 30.5 12.1 16.1 31.4 39.0 35.9 28.3 26.0 P r opo s e d E MU ∗ MU w/o LD 89.7 83.9 85.7 83.4 82.1 MU w/o LD+CL 88.3 75.3 80.3 75.8 70.9 78.5 72.6 82.1 75.3 81.6 80.7E MU -P ARALLEL
LD+CL) as a part of an ablation study. Finally, we eval-uated E MU -P ARALLEL , which uses parallel sentences in-stead of randomly sampled sentences for cross-lingual ad-versarial training.
Hyper-parameters
We used the official implementation ofLASER and the pre-trained models including BPE. We im-plemented our proposed method and the baseline methodsusing PyTorch. We used an initial learning rate of − andoptimized the model with Adam. We used a batch size of 16.For our proposed methods, we set α = 50 and λ = 10 − .All the models were trained for 3 epochs. The architec-ture of language discriminator D has two 900-dimensionalfully-connected layers with a dropout rate of 0.2. The hyper-parameters were γ = 10 − , k = 5 , c = 0 . respectively.The language discriminator was also optimized with Adamwith an initial learning rate of . × − . https://github.com/facebookresearch/LASER Evaluation Metric
We used the leave-one-out evaluationmethod on the test data. For each sentence, we consider theother sentences in the test data as labeled sentences to findthe nearest neighbor to predict the label. The idea is to ex-clude the direct translation of an input sentence in the targetlanguage to make the nearest neighbor search more chal-lenging and to simulate the real-world setting where parallelsentences are missing. We used Acc@1 (the ratio of test sen-tences that are correctly categorized into the intent classes)as our evaluation metric.
Results and Discussion
Table 2 shows the experimental results on these threedatasets. In Table 2 (a), E MU achieved the best performancefor all the 11 tasks (en-fr, en-ja, and ja-en by E MU w/o LDand en-ja by E MU w/o LD+CL.) E MU outperformed thebaseline methods including the original LASER model. InTable 2 (b), E MU achieved the best performance for 10 tasksable 3: Relative performance (Acc@1 on HotelQA ) of E MU w/o LD models trained on different training languages againstthe original LASER model for each language pair. Training data en-en en-de en-fr de-en de-de de-fr fr-en fr-de fr-frEn only +37.5% +40.0% +34.3% +27.0% +10.0% +1.7% +12.3% +12.7% +11.1%De only +26.2% +47.7% +10.0% +49.2% +10.0% +25.0% +9.6% +7.9% +9.9%Fr only +30.0% +33.8% +28.6% +17.5% +8.7% +16.7% +31.5% +15.9% +17.3%En + De +37.5% +58.5% +27.1% +50.8% +17.5% +23.3% +9.6% +12.7% +14.8%En + Fr +40.0% +60.0% +50.0% +46.0% +12.5% +33.3% +35.6% +25.4% +23.5%De + Fr +28.7% +50.8% +37.1% +55.6% +12.5% +46.7% +31.5% +25.4% +17.3%En + De + Fr +41.2% +63.1% +47.1% +60.3% +20.0% +56.7% +31.5% +34.9% +25.9%
Table 4: Relative performance of Acc@1 on
HotelQA ofE MU w/o LD against the original LASER model for eachlanguage pair. * → en de es fr zh jaen +37.5% +40.0% +22.9% +34.3% +39.1% +26.1%de +27.0% +10.0% +10.0% +1.7% +14.5% +20.9%es +34.8% +0.0% +11.5% +5.0% +21.3% +8.0%fr +12.3% +12.7% +23.2% +11.1% +13.2% +7.1%zh +21.9% +34.5% +11.4% +9.6% +9.1% +10.1%ja +18.3% +31.0% +23.4% +20.3% +32.3% +22.1% (en-fr by E MU w/o LD+CL.) The original LASER modelshowed the best performance for zh-en and all of the E MU methods degraded the performance for the task. In Table 2(c), E MU achieved the best performance for 7 tasks (en-zh byE MU w/o LD), whereas the original LASER model achievedthe best performance for the rest of the tasks. From the re-sults, E MU consistently outperformed the baseline methods,including the original LASER model. At the same time,E MU failed to improve the performance of the five tasks,namely zh-en on ATIS (Table 2 (b)) and en-fr, fr-en, ja-en,ja-zh on
Quora (Table 2 (c)). We would like to emphasizethat the E MU models were trained using labeled data only inEnglish. The E MU also used unlabeled data in non-Englishlanguages. Therefore, it is noteworthy that our frameworksuccessfully specializes multilingual sentence emebeddingsfor multiple language pairs, which involve English, usingonly English labeled data. The results support that E MU iseffective in semantically specializing multilingual sentenceembeddings.For all the tasks, we observe that the baseline fine-tuningmethods (i.e., contrastive loss, N-pair loss, softmax loss) donot improve the performance but instead decrease the accu-racy values compared to the original LASER performance.The results indicate that fine-tuning multilingual sentenceembeddings is sensitive to the choice of loss functions, and L -constrained softmax loss is the best choice among theloss functions.The original LASER model consistently performs betteron all datasets for the en-en task compared to the other tasks.This is partially due to the higher quality sentence embed-dings in English. More specifically, the LASER model wastrained on MT tasks, translating text from 93 languages toeither English or Spanish as target languages with Englishhaving the most training data in the dataset (Artetxe andSchwenk 2019b). Table 5: Ablation study of E MU . Each value denotes the av-erage percentage point (pp) drop after removing the compo-nent. Negative values denote improvements after removingthe component. ** and *** denote p -values < . and < . (Wilcoxon signed ranked test) respectively. Component
HotelQA ATIS Quora
Language Discriminator .
45 2 . ** . Center loss − .
44 0 .
04 6 . *** MT+sent2vec shows significantly low Acc@1 values forthe en-ja and en-zh tasks on
ATIS , and for the en-ja taskon
Quora . Investigating this trend, we observed that back-translation of sentences that were translated from en intoja/zh results in the following types of degradation: (1) miss-ing words, especially interrogative pronouns (e.g., what,when, which etc.) and verbs, (2) significant changes in theword order. As discussed above, sent2vec embeddings arealso susceptible to this type of perturbation. From the eval-uation perspective, we can consider the en-en performanceof MT+sent2vec as an upper bound on its performance forall en-* and *-en tasks, assuming that MT exactly translatesback to the original sentence in English. Nevertheless, E MU consistently outperforms MT+sent2vec. Ablation study
We conducted an ablation study to quantita-tively evaluate the contribution of each component of E MU ,namely, the language discriminator and the center loss. First,we compared E MU w/o LD with E MU to verify the effect ofthe language discriminator, and then compared E MU w/o LDand E MU w/o LD+CL to determine the effect of the centerloss.Table 5 shows the average percentage point drop (i.e., thedegree of contributions) of each component. The languagediscriminator had a significant contribution of 2.81 points on ATIS . The contributions were 1.45 points and 1.05 points on
HotelQA and
Quora respectively. Similarly, the center losshad a significant impact on
Quora , whereas it had almost noeffect on
ATIS and had a negative impact on
HotelQA . Sentence embedding visualization
We conducted a quali-tative analysis to observe how our framework with the lan-guage discriminator specialized multilingual sentence em-beddings and enhanced the multilinguality. We filtered En-glish and German sentences from the test data of the
ATIS dataset and visualized sentence embeddings of (a) the orig-inal LASER model, (b) the softmax loss, (c) E MU w/o LD,and (d) E MU into the same 2D space using t -SNE. a) (b) (c) (d) Figure 2: Visualizations of the sentence embeddings of English ( ◦ ) and German ( × ) test data of the ATIS dataset. We used t -SNE to convert the sentence embeddings into the 2d space. Each point is a sentence and the color denotes the intent class.The plots are: (a) the original LASER embeddings, (b) softmax loss, (c) E MU w/o LD, (d) E MU .Figure 2 shows visualizations of these methods. Figure2(a) shows that the original LASER sentence embeddingshave multilinguality, as the sentences in the same intent inEnglish and German were embedded close to each other.Figure 2(b) shows that fine-tuning the model with the soft-max loss function broke not only the intent clusters but alsospoiled the multilinguality. In Figure 2(c), E MU w/o LDsuccessfully specialized the sentence embeddings, whereasmultilinguality was degraded as the sentence embeddings ofthe same intent classes were separated compared to the orig-inal LASER model. Finally, E MU (with the language dis-criminator) moved sentence embeddings of the same intentin English and German close to each other, as shown in Fig-ure 2(d).From the results, we observe that incorporating the lan-guage discriminator enriches the multilinguality in the em-bedding space. Do we need parallel sentences for Emu?
We comparedE MU to E MU -P ARALLEL , which uses parallel sentences in-stead of randomly sampled sentences, to verify whether us-ing parallel sentences makes multilingual adversarial learn-ing more effective. The results are shown in Tables 2 (a)-(c).Compared to E MU , E MU -P ARALLEL showed lower Acc@1values on the three datasets. The decreases were -0.5 points,-1.2 points, and -5.9 points on
HotelQA , ATIS , and
Quora respectively. The differences are not statistically significantexcept for
Quora . The results show that the language dis-criminator of E MU does not need any cost-expensive paral-lel corpus but can improve performance using unlabeled andnon-parallel sentences in other languages. What language(s) should we use for training?
We also in-vestigated how the performance changes by fine-tuning withtraining data in multiple languages other than English. Tounderstand the insights more closely, we turned off the lan-guage discriminator in this analysis to ensure that E MU usesdata only in specified languages. We summarize the relativeperformance of E MU w/o LD against the original LASERmodel on the HotelQA dataset. As discussed above, the ac-curacy values of tasks that involve English in at least oneside (i.e., source language, target language, or both) showlarger improvements than the other pairs that only involvenon-English languages. This is likely because sentence em-beddings of those languages were not appropriately fine-tuned compared to those of English because training data in those languages were not used.Therefore, we hypothesized that using training data in thesame language for a target and/or source language wouldbe the best choice. To test the hypothesis, we chose En-glish, German, and French as source/target languages andconducted additional experiments on the
HotelQA dataset.The experimental settings, including the hyper-parameters,followed the main experiments, with only the training dataused for fine-tuning being different.Table 3 shows the results. When only using training datain a single language (i.e., En only, De only, Fr only), thetarget language was the best training data for monolingualintent classification tasks because this method achieved thebest performance in the en-en, de-de, and fr-fr tasks respec-tively. Similarly, using the source and target languages astraining data was the best configuration for methods thattrained in two languages. That is, En+De achieved the bestperformance for the en-de and de-en tasks. En+Fr (De+Fr)also achieved the best performance for the en-fr (de-fr) andfr-en (fr-de.) Finally, the method that used training data inthe three languages (En+De+Fr) showed the best accuracyvalues for 7 out of 9 tasks. The degradation in those twotasks occurred when En+De+Fr incorporated a language thatwas neither the source nor target languages (i.e., en-fr andfr-en.)From the results, we conclude that we should focus oncreating training data in a target or source language to ob-tain the best performance with E MU and use our budget ef-fectively. Related Work
Multilingual embedding techniques (Ruder et al. 2019) havebeen well studied, and most of the prior work has focusedon word embeddings. However, relatively fewer techniqueshave been developed for multilingual sentence embeddings.This is because such techniques (Hermann and Blunsom2014; Artetxe and Schwenk 2019b) require parallel sen-tences for training multilingual sentence embeddings andsome use both sentence-level and word-level alignment in-formation (Luong, Pham, and Manning 2015). Ruckle etal. (R¨uckl´e et al. 2018) developed an unsupervised sentenceembedding method based on concatenating and aggregat-ing cross-lingual word embeddings. They also confirm thathe method performs well on cross-lingual as well as mono-lingual settings. Schwenk and Douze (Schwenk and Douze2017) used machine translation tasks to learn multilingualsentence representations. This idea has been further ex-panded in LASER (Artetxe and Schwenk 2019b; 2019a), arecently developed system which trains a language-agnosticsentence embeddings model with a large number of transla-tion tasks on a large-scale parallel corpora.Similar to the center loss used in this paper, two tech-niques have incorporated cluster-level information (Huanget al. 2018; Doval et al. 2018) to enhance the compactnessof word clusters to improve the quality of multilingual wordembedding models. None of them have directly used thecentroid of each class to calculate loss values for training.Adversarial learning (Goodfellow et al. 2014) is a com-mon technique that has been used for many NLP tasks,including (monolingual) sentence embeddings (Patro et al.2018) and multilingual word embeddings (Conneau et al.2018b; Chen and Cardie 2018). (Chen et al. 2018) devel-oped a technique that uses a language discriminator to traina cross-lingual sentiment classifier. Our framework is sim-ilar in the use of a language discriminator, but our noveltyis that it uses a language discriminator for learning multilin-gual sentence embeddings instead of cross-lingual transfer.Joty et al. (Joty et al. 2017) used a language discriminator totrain a model for cross-lingual question similarity calcula-tion. Their setting differs from ours as their method requiresparallel sentences in different languages and pair-wise simi-larity labels instead of class labels.There is a line of work in post-processing word embed-ding models called word embedding specialization (Faruquiet al. 2015; Kiela, Hill, and Clark 2015; Mrkˇsi´c et al. 2017).Prior work specialized word embeddings with different ex-ternal resources such as semantic information (Faruqui etal. 2015). The common approaches are (1) a post-hoc learn-ing (Faruqui et al. 2015) that uses additional loss functionto tune pre-trained embeddings, (2) learning an additionalmodel (Glavaˇs and Vuli´c 2018; Vuli´c et al. 2018), and (3) thefine-tuning approach (Abdalla, Sahlgren, and Hirst 2019),which is similar to our fine-tuning approach. However, to thebest of our knowledge, we are the first to approach semanticspecialization of multilingual sentence embeddings.
Conclusion
We have presented E MU , a semantic specialization frame-work for multilingual sentence embeddings. E MU incorpo-rates multilingual adversarial training on top of fine-tuningto enhance multilinguality without using parallel sentences.Our experimental results show that E MU outperformedthe baseline methods including state-of-the-art multilingualsentence emebeddings, LASER, and monolingual sentenceembeddings after machine translation with respect to mul-tiple language pairs. The results also show that E MU cansuccessfully train a model using only monolingual labeleddata and unlabeled data in other languages. Acknowledgments
We thank Sorami Hisamoto for sharing his literature surveyon cross-lingual embedding techniques and Tom Mitchellfor helpful comments as well as the anonymous reviewersfor their constructive feedback.
References
Abdalla, M.; Sahlgren, M.; and Hirst, G. 2019. Enrichingword embeddings with a regressor instead of labeled cor-pora. In
Proc. AAAI ’19 , 6188–6195.Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasser-stein generative adversarial networks. In
Proc. ICML ’17 ,volume 70, 214–223.Artetxe, M., and Schwenk, H. 2019a. Margin-based parallelcorpus mining with multilingual sentence embeddings. In
Proc. ACL ’19 , 3197–3203.Artetxe, M., and Schwenk, H. 2019b. Massively multilin-gual sentence embeddings for zero-shot cross-lingual trans-fer and beyond.
Transactions of the Association for Compu-tational Linguistics
Proc. EMNLP ’18 , 261–270.Chen, X.; Sun, Y.; Athiwaratkun, B.; Cardie, C.; and Wein-berger, K. 2018. Adversarial deep averaging networks forcross-lingual sentiment classification.
Transactions of theAssociation for Computational Linguistics
Proc. RepL4NLP ’19 , 250–259.Chopra, S.; Hadsell, R.; LeCun, Y.; et al. 2005. Learninga similarity metric discriminatively, with application to faceverification. In
Proc. CVPR ’05 , 539–546.Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; and Bor-des, A. 2017. Supervised learning of universal sentence rep-resentations from natural language inference data. In
Proc.EMNLP ’17 , 670–680.Conneau, A.; Kruszewski, G.; Lample, G.; Barrault, L.; andBaroni, M. 2018a. What you can cram into a single $&!
Proc. ACL ’18 , 2126–2136.Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; andJ´egou, H. 2018b. Word translation without parallel data. In
Proc. ICLR ’18 .Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of deep bidirectional transformers forlanguage understanding. In
Proc. NAACL-HLT ’19 , 4171–4186.Doval, Y.; Camacho-Collados, J.; Espinosa Anke, L.; andSchockaert, S. 2018. Improving cross-lingual word em-beddings by meeting in the middle. In
Proc. EMNLP ’18 ,294–304.Faruqui, M.; Dodge, J.; Jauhar, S. K.; Dyer, C.; Hovy, E.;and Smith, N. A. 2015. Retrofitting word vectors to semanticlexicons. In
Proc. NAACL-HLT ’15 , 1606–1615.lavaˇs, G., and Vuli´c, I. 2018. Explicit retrofitting of distri-butional word vectors. In
Proc. ACL ’18 , 34–45.Glavas, G.; Litschko, R.; Ruder, S.; and Vulic, I. 2019. Howto (properly) evaluate cross-lingual word embeddings: Onstrong baselines, comparative analyses, and some miscon-ceptions. In
Proc. ACL ’19 .Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.2014. Generative adversarial nets. In
Proc. NIPS ’14 , 2672–2680.Hemphill, C. T.; Godfrey, J. J.; and Doddington, G. R. 1990.The ATIS spoken language systems pilot corpus. In
Proc.the Workshop on Speech and Natural Language , HLT ’90,96–101.Hermann, K. M., and Blunsom, P. 2014. Multilingual mod-els for compositional distributed semantics. In
Proc. ACL’14 , 58–68.Howard, J., and Ruder, S. 2018. Universal language modelfine-tuning for text classification. In
Proc. ACL ’18 , 328–339.Huang, L.; Cho, K.; Zhang, B.; Ji, H.; and Knight, K.2018. Multi-lingual common semantic space constructionvia cluster-consistent word embedding. In
Prc. EMNLP ’18 .Joty, S.; Nakov, P.; M`arquez, L.; and Jaradat, I. 2017. Cross-language learning with adversarial neural networks. In
Proc.CoNLL ’17 , 226–237.Kiela, D.; Hill, F.; and Clark, S. 2015. Specializing wordembeddings for similarity or relatedness. In
Proc. EMNLP’15 , 2044–2048.Luong, T.; Pham, H.; and Manning, C. D. 2015. Bilingualword representations with monolingual quality in mind. In
Proc. RepL4NLP ’15 , 151–159.Mrkˇsi´c, N.; Vuli´c, I.; ´O S´eaghdha, D.; Leviant, I.; Reichart,R.; Gaˇsi´c, M.; Korhonen, A.; and Young, S. 2017. Seman-tic specialization of distributional word vector spaces usingmonolingual and cross-lingual constraints.
Transactions ofthe Association for Computational Linguistics
NAACL-HLT ’18 .Pas¸ca, M. 2003.
Open-domain question answering fromlarge text collections . MIT Press. Patro, B. N.; Kurmi, V. K.; Kumar, S.; and Namboodiri, V. P.2018. Learning semantic sentence embeddings using pair-wise discriminator. In
Proc. COLING ’18 .Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextual-ized word representations. In
Proc. NAACL-HLT ’18 , 2227–2237.Ranjan, R.; Castillo, C. D.; and Chellappa, R. 2017. L2-constrained softmax loss for discriminative face verification. arXiv prepring arXiv:1703.09507 abs/1703.09507.Reimers, N., and Gurevych, I. 2019. Sentence-BERT: Sen-tence embeddings using Siamese BERT-networks. In
Proc.EMNLP-IJCNLP ’19 , 3973–3983.R¨uckl´e, A.; Eger, S.; Peyrard, M.; and Gurevych, I. 2018.Concatenated power mean word embeddings as univer-sal cross-lingual sentence representations. arXiv preprintarXiv:1803.01400 .Ruder, S.; Vuli´c, I.; Søgaard, A.; and Faruqui, M. 2019.
Cross-Lingual Word Embeddings . Morgan & Claypool Pub-lishers.Schwenk, H., and Douze, M. 2017. Learning joint multi-lingual sentence representations with neural machine trans-lation. In
Proc. RepL4NLP ’17 , 157–167.Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural ma-chine translation of rare words with subword units. In
Proc.ACL ’16 , 1715–1725.Vuli´c, I.; Glavaˇs, G.; Mrkˇsi´c, N.; and Korhonen, A. 2018.Post-specialisation: Retrofitting vectors of words unseen inlexical resources. In
Proc. NAACL-HLT ’19 .Wang, J.; Zhang, T.; Sebe, N.; Shen, H. T.; et al. 2017. Asurvey on learning to hash.
IEEE Transactions on On Pat-tern Analysis and Machine Intelligence
Proc. ECCV ’16 , 499–515.Yang, Y.; Abrego, G. H.; Yuan, S.; Guo, M.; Shen, Q.; Cer,D.; Sung, Y.-h.; Strope, B.; and Kurzweil, R. 2019. Improv-ing multilingual sentence embedding using bi-directionaldual encoder with additive margin softmax. In
Proc. IJCAI’19 , 5370–5378.Zhu, X.; Li, T.; and de Melo, G. 2018. Exploring semanticproperties of sentence embeddings. In