[PDF] Quality Estimation without Human-labeled Data

Abstract

Quality estimation aims to measure the quality of translated content without access to a reference translation. This is crucial for machine translation systems in real-world scenarios where high-quality translation is needed. While many approaches exist for quality estimation, they are based on supervised machine learning requiring costly human labelled data. As an alternative, we propose a technique that does not rely on examples from human-annotators and instead uses synthetic training data. We train off-the-shelf architectures for supervised quality estimation on our synthetic data and show that the resulting models achieve comparable performance to models trained on human-annotated data, both for sentence and word-level prediction.

Full PDF

QQuality Estimation without Human-labeled Data

Yi-Lin Tuan , Ahmed El-Kishky , Adithya Renduchintala , Vishrav Chaudhary ,Francisco Guzm´an and Lucia Specia University of California Santa Barbara, Facebook AI, Imperial College London [email protected], { ahelk,adirendu,vishrav,fguzman } @fb.com [email protected] Abstract

Quality estimation aims to measure the qual-ity of translated content without access to areference translation. This is crucial for ma-chine translation systems in real-world scenar-ios where high-quality translation is needed.While many approaches exist for quality es-timation, they are based on supervised ma-chine learning requiring costly human labelleddata. As an alternative, we propose a tech-nique that does not rely on examples fromhuman-annotators and instead uses synthetictraining data. We train off-the-shelf architec-tures for supervised quality estimation on oursynthetic data and show that the resulting mod-els achieve comparable performance to modelstrained on human-annotated data, both for sen-tence and word-level prediction.

The adoption of Machine Translation (MT) hasbeen increasing in areas ranging from governmentand ﬁnance, to even social media due to the sub-stantial improvements achieved from Neural Ma-chine Translation (NMT). However, even with im-proved performance, translation quality is not con-sistent across language pairs, domains, and sen-tences. This can be detrimental to end-user’s trustand can cause unintended consequences arisingfrom poor translations. Thus, having metrics to as-sess the quality of translated content is crucial to en-sure that only high-quality translations are providedto end-users or downstream tasks. Quality Estima-tion (QE) metrics aim to predict translation qualitywithout access to reference translations (Blatz et al.,2004; Specia et al., 2009, 2013).State-of-the-art QE techniques have leveragedMT systems and language-speciﬁc human annota-tions as supervision, including direct assessmentand post-editing (Kepler et al., 2019a; Fonsecaet al., 2019; Sun et al., 2020). However, these annotations are costly and time-consuming, partic-ularly for word-level QE, where each token needsa label.Some unsupervised approaches take inspirationfrom statistical MT (Popovi´c, 2012; Moreau andVogel, 2012; Etchegoyhen et al., 2018) or applyuncertainty quantiﬁcation (Fomicheva et al., 2020)for QE. However, their performance is inferior tothat of supervised models. In related areas suchas automatic post-editing, parallel data has beenused to create synthetic post-editing data (Negriet al., 2018), however this technique only com-pares machine-translated sentences to references.Our approach augments MT errors with additionalerrors via masked language model rewriting.We leverage noisy, mined comparable sentencesobtained by weakly-supervised techniques (El-Kishky et al., 2020b). These noisy bitexts havebeen mined from a variety of domains such asWikipedia (Schwenk et al., 2019a) and large web-crawls (Schwenk et al., 2019b; El-Kishky et al.,2020a; El-Kishky and Guzm´an, 2020) and havebeen shown to be an invaluable source of train-ing data for NMT models. Using this data is cru-cial to avoid data leakage between a trained NMTmodel and the data we use to create synthetic QEdata. For each source-target sentence pair fromthe mined data, we apply an MT system to gener-ate a candidate translation of the source sentence.Additionally we rewrite each target reference sen-tence using a masked language model to introduceerrors. These two approaches generate two alterna-tive “translations” of the source sentence. We thenproduce pseudo-labels for each token in these trans-lations by edit distance alignment to the originalreference sentence. This results in each translatedword being pseudo-labelled as correct or incorrect,which is our synthetic QE training data. Analo-gously, sentence-level training data is derived asthe proportion of incorrect words per sentence. a r X i v : . [ c s . C L ] F e b aw parallel datatokenize(source, target)NMT MLM Rewritesedit-distance algorithm pseudo post-editspseudo tagssynthetic datapseudo MTs(source, pseudo MTs, pseudo tags, pseudo post-edits) Figure 1: The pipeline to synthesize data for QE fromcomparable mined data.

Our main contributions are: (i) We explore asimple technique to effectively generate syntheticdata for QE that allows for both word-level andsentence-level estimation (ii) we demonstrate thatour technique performs comparably to off-the-shelfmodels trained on human-annotated data.

Word-level QE has been mainly framed as the taskof predicting which words in the translation needto be post-edited. As such, word-level QE aims toassign a tag for each word and gap between wordsin a machine-generated translation as correct, i.e.,the word does not need editing, or incorrect, i.e.,the words should be substituted, deleted, or inserted(tags for gaps) (Specia et al., 2020).For word-level, we denote the tag of each wordin a translation as m t ∈ { OK , BAD } , where t ∈ [1 , T ] and T is the length of the translation. Also,we denote the tag of each gap between two words(including the beginning and the end) as g t ∈{ OK , BAD } , where t ∈ [1 , T + 1] .In traditional QE, data is collected by ﬁrst trans-lating source sentences using an MT model. Sec-ond, experts post-edit these translations. Third, thepost-edits and machine translations are aligned insuch a way that induces the minimum edit distancebetween the tokens of each. Finally, each m t is la-belled as BAD if it should be deleted or substitutedand each g t is labelled as BAD if at least a wordshould be inserted there. Sentence-level QE labelscan be generated by computing the Human-targetedTranslation Error Rate (HTER) (Snover and Brent,2001; Snover et al., 2006), which is the minimum ~~a b c d e~~ ~~a c d ...~~ ~~a b' c x y d~~ a b' c x y d ...a b c d e substitute b insert delete erewritten by MLMpseudo MTspseudo post-edits applyedit-distance algorithm

OK OK OKOKOK OK OK OKOKBAD BAD BAD BAD pseudoword-leveltags

Figure 2: The rewriting process by text-inﬁlling usinga masked language model. ratio of edit operations needed to ﬁx the translationto the number of its tokens. We explore the possi-bility to skip the costly human post-editing processby proposing a data synthesis pipeline, which wethen test on human labelled data.

As depicted in Figure 1, we synthesize data frommined Wikipedia datasets, where each exampleconsists of a (source, target) sentence pair.We create candidate translations of source sen-tences in two ways: For the ﬁrst approach, we applythe NMT model to translate each source sentence.For the second approach, we rewrite each referencetarget sentence using a masked language model(MLM), as shown in the MLM Rewrites block inFigure 1. The two approaches create two forms oftranslations. Then, by treating target sentences as ifthey were post-edited data (pseudo post-edits), weidentify errors in each candidate translation by look-ing at the insertions, deletions, and substitutionsbetween the references and generated translations.

Neural Machine Translation.

For the ﬁrst ap-proach to generating synthetic data, we use a pre-trained NMT model to create translations. TheNMT model is the same model that was usedto generate translations in the supervised data;the architecture is a standard transformer as usedin (Vaswani et al., 2017; Ott et al., 2019). Theprocess of creating synthetic QE data ﬁrst involvestranslating each source sentence using this modeland taking the output as a translation which will ata English-German English-Chinese size MT bad (%) Gap bad (%) size MT bad (%) Gap bad (%)Human annotation 7K 27.8 4.7 7K 54.2 8.4NMT 459K 38.2 5.7 189K 49.5 6.8MLM (word-QE) 459K 40.7 2.9 189K 53.9 8.6MLM (sent-QE) 459K 43.1 3.3 189K 49.9 2.7

Table 1: Statistics of annotated and synthetic (NMT and MLM) data. later be used to generate the synthetic labels. Whendecoding, we apply a beam of 5 following the NMTmodels available in Fomicheva et al. (2020) to gen-erate a candidate translation. Next, we take themined reference target sentence and treat it as apseudo post-editing of the machine translation.We then compute the edit distance between MTsand pseudo post-edits. The resulted edit operationsare the pseudo tags, which consist of word tags m t and gap tags g t . This process is illustrated inAlgorithm 1. Algorithm 1:

DataSynthesis-NMT

Input: pairs (source, target) from mined data,pretrained NMT model

Output: (MTs, pseudo tags) for each pair (source, target) do MTs = NMT(source) { m t } Tt =1 , { g t } T +1 t =1 = edit distance(MTs, target)pseudo tags = ( { m t } Tt =1 , { g t } T +1 t =1 )return (MTs, pseudo tags) Rewriting by Masked Language Model (MLM).

Our second approach to creating synthetic QE train-ing data is to introduce errors by rewriting targetsentences. We inject these errors by performing text-inﬁlling (Zhu et al., 2019; Lewis et al., 2019).As displayed in Figure 2, we perform text-inﬁllingby applying three operations: (1) randomly sub-stituting a proportion of tokens with a token, (2) deleting consecutive tokens, and (3) in-serting additional consecutive tokens. Wedetermine the lengths of consecutive deletions andinsertions by drawing them from a Poisson dis-tribution with mean λ = 1 shifted by 1 to avoidzero-length insertions or deletions. We then use apre-trained masked language model (MLM) sup-plied with the source sentence as input to inﬁll themasked reference sentence. We select multilingualBERT (Devlin et al., 2019) as it is pre-trained onWikipedia which is in-domain to our test set. Wepresent the target-rewriting approach in detail inAlgorithm 2.In Section 4, we will investigate the performance Algorithm 2:

DataSynthesis-Rewriting

Input: pairs ( S, W ) :=(source, target) from mineddata, pretrained MLM Input: P s , P d , P i as the probabilities of substitution,deletion, and insertion Output: (pseudo MTs, pseudo tags) for each pair ( S, W ) do W (cid:48) = randomly mask tokens in W by P s D = randomly mark deletion in W (cid:48) by P d W (cid:48) = randomly delete a text span from marks D in W (cid:48) (length ∼ P oisson ( λ = 1) + 1 ) I = randomly mark insertion in W (cid:48) by P i W (cid:48) = randomly insert contiguous masks frommarks I in W (cid:48) (length ∼ P oisson ( λ = 1) + 1 )rewrites = MLM ﬁlls in masks( S, W (cid:48) ) { m t } Tt =1 , { g t } Tt =1 = edit distance(rewrites, W )pseudo tags = ( { m t } Tt =1 , { g t } T +1 t =1 )return (rewrites, pseudo tags) of QE models trained on NMT-based synthetic data,rewriter-based synthetic data, and a two-model en-semble where each model is trained on a differentform of synthetic data. We focus on data released by the WMT20 sharedtask on QE for predicting post-editing effort,which includes English-to-German (En-De) andEnglish-to-Chinese (En-Zh) word-level data andtheir sentence-level HTER (Specia et al., 2020). As the human-annotated data is sampled fromWikipedia, we choose to synthesize data from Wiki-Matrix (Schwenk et al., 2019a), which consists ofmined Wikipedia parallel data from which we sam-ple pairs with a LASER (Artetxe and Schwenk,2019) margin score threshold of . to ensurehigh-quality pairs. We note that the original QEdata is not a subset of WikiMatrix. The Germanand Chinese text were tokenized using the Moses and Jieba tokenizers, respectively. We list thestatistics of the ﬁltered Wikimatrix data as well asour resulting synthetic data in Table 1.For the off-the-shelf QE model, we choose the Available here: https://github.com/shefﬁeldnlp/mlqe-pe https://github.com/alvations/sacremoses https://github.com/fxsjy/jieba ata English-German English-Chinese MCC F1-Ok F1-Bad MCC F1-Ok F1-BadHuman annotation

Improvement (%) +0.20 -1.40 -0.40 +4.00 +1.83 +2.43

Table 2: Results of word-level QE trained on human-annotated (7k) and synthetic data. Improvement in MCC foren-de & en-zh shows synthetic data can train word-level models comparable to human-annotated data. We reportimprovement comparing models trained with human-annotation vs our combined NMT+MLM synthetic data.

Data English-German English-Chinese

Pearson MAE RMSE Pearson MAE RMSEHuman annotation

Improvement (%) -5.50 +14.7 +9.63 +3.18 -1.79 -1.67

Table 3: Results of sentence-level HTER QE trained on human-annotated and synthetic data. For Pearson, positiveimprovement is better while for MAE & RMSE negative is better. We report improvement comparing modelstrained with human-annotation vs our combined NMT+MLM synthetic data. multi-task predictor-estimator model (Kim et al.,2017) implemented by OpenKiwi v0.1.3 (Kepleret al., 2019b). This was the top-performing archi-tecture for QE at WMT19 (Kepler et al., 2019a;Fonseca et al., 2019). We train the predictor on par-allel MT data provided by the WMT20 QE sharedtask. The predictor reads in words’ contextualizedword representations, the estimator passes thesefeatures through a 2-layer 125-dimension bidirec-tional LSTM (biLSTM) and then feeds the outputsinto 1-layer linear word-level classiﬁer. The ﬁrstoutput of the biLSTM is also fed into a multi-layerperceptron to predict a sentence-level score. Formulti-task learning, we train the model with bothword- and sentence-level data.For a fair comparison, we take the pre-trainedpredictor provided by the WMT20 QE shared task,ﬁne-tune the whole model on the human annotateddata, and compare results to those when ﬁne-tunedon our synthetic data. We test by comparing modelpredictions and held-out human-annotated QE atword and sentence-level. At the word level, wemeasure QE performance with Matthew’s Corre-lation Coefﬁcient (MCC) (Matthews, 1975) (mainmetric), as well as F1 scores for BAD and OK tags.At the sentence-level, we measure the sentence-level Pearson’s correlation (Benesty et al., 2009),mean absolute error (MAE) and Root-mean-squaredeviation (RMSE). As shown in Table 2, for word-level QE, themodel trained on synthetic data generated fromNMT translations performs comparably to the samemodel trained on the original 7k human-annotatedpost-edits. This suggests that having human annota-tors post-edit each translation to create training datamay be unnecessary and using reference sentencesis good enough. The model trained on the MLMrewriting synthetic data generally under-performscompared to NMT generated data on MCC. How-ever, we note that it performs better on F1 on OKtags. Therefore, we also ensemble the two mod-els trained on each set of synthetic data through alinear combination. This yields comparable or bet-ter performance than the model trained on human-annotated data according to the main metric, MCC.In Table 3, we compare the models trained onhuman-annotated data to our synthetic data for pre-dicting sentence-level HTER scores. Again oursynthetic data from NMT-generated translationsoutperforms MLM-rewriting data. Both under-perform models trained on human-annotated data,but when combined they signiﬁcantly improve andeven outperform human-annotated for En-Zh. Thisonce again suggests that the two forms of syntheticdata are complementary and provide valuable sig-nals for QE. The results reported in Tables 2 and 3 are evaluated onthe test set provided (test20). ize MCC F1-BAD F1-OKEnglish-German

English-Chinese

50k 53.07 66.58 83.65100k 53.88 67.13 84.10All (189k) 53.42 66.86 84.47

Table 4: Ablation study of synthetic data amounts.

In this section, we further analyze how the quantityof synthetic data impacts performance, and whattypes of errors are represented in each of the MLMand NMT portions of the synthetic data.

As previously observed, the amount of syntheticdata is orders of magnitude larger than the amountof human-annotated data. It begs the question:How much beneﬁt do we get from smaller amountsof synthetic data? To analyze how the quantity ofsynthetic data affects QE performance, we conductan ablation study of word-level QE. As shown inTable 4, using only about half of the synthetic datagenerated (200k for En-De and 100k for En-Zh) iscomparable to using the full generated set. Whilethis suggests an upper-bound in performance totraining on synthetic data. The ablation also sug-gests that this synthetic process can yield goodperformance with even a small amount of syntheticdata.

In addition to the performance, we posit that thereare essential differences between MLM and NMTsynthetic data. To test that, bilingual volunteersqualitatively analyzed the types of mistakes fromMLM rewrites vs traditional NMT translations.The major reported differences in error types are:1. Deletions from NMT translations appear morenatural and do not destroy the sentence ﬂu-ency. However, deletions in MLM rewritesare more destructive (e.g., “new york restau-rants” vs “new restaurants” The semantics ischanged).2. Most incorrect insertions or deletions fromNMT translations are due to re-ordering The ablation study is only trained on word-level data. words. (e.g., “on 2020 in california” vs “incalifornia on 2020”) However insertions withMLM-rewrites introduces seemingly randomwords.3. NMT translations often have semantically dis-tant word substitutions. However, MLM-rewrites tend to substitute similar words (e.g.,“strong tea” vs “powerful tea”).In summary, NMT translations and MLM-rewrites appear to generate different types of er-rors – the former leads to more subtle errors whilethe latter often introduces more catastrophic errors.Since a high-quality QE model should be able todetect both types of errors, ensembling the modelstrained on these two forms of synthetic data indeedis expected to outperform using only one form ofsynthetic data.

In this work we devise a technique for buildingword and sentence-level QE models by creatingsynthetic training data. By training an off-the-shelfmodel on our synthetic data, we achieve perfor-mance comparable to and often better than trainingon human-annotated data. This technique for datasynthesis can be invaluable if human annotationis difﬁcult to come-by, for example when dealingwith low-resource scenarios.This work can be extended in various ways.While we investigate the scenario of utilizing solelysynthetic data, further work can study the effectsof augmenting human-labeled data with syntheticdata. Further work can analyze the efﬁcacy of thistechnique into low-resource language pairs wheresuch human-annotation is difﬁcult to obtain. Addi-tionally, instead of a simple MLM re-writer, adver-sarial training to generate and detect errors couldprovide more realistic synthetic data.