[PDF] Translationese as a Language in "Multilingual" NMT

Abstract

Machine translation has an undesirable propensity to produce "translationese" artifacts, which can lead to higher BLEU scores while being liked less by human raters. Motivated by this, we model translationese and original (i.e. natural) text as separate languages in a multilingual model, and pose the question: can we perform zero-shot translation between original source text and original target text? There is no data with original source and original target, so we train sentence-level classifiers to distinguish translationese from original target text, and use this classifier to tag the training data for an NMT model. Using this technique we bias the model to produce more natural outputs at test time, yielding gains in human evaluation scores on both accuracy and fluency. Additionally, we demonstrate that it is possible to bias the model to produce translationese and game the BLEU score, increasing it while decreasing human-rated quality. We analyze these models using metrics to measure the degree of translationese in the output, and present an analysis of the capriciousness of heuristically-based train-data tagging.

Full PDF

TTranslationese as a Language in “Multilingual” NMT

Parker Riley (cid:52)∗ , Isaac Caswell (cid:53) , Markus Freitag (cid:53) , David Grangier (cid:53)(cid:52)

University of Rochester (cid:53)

Google Research

Abstract

Machine translation has an undesirablepropensity to produce “translationese” ar-tifacts, which can lead to higher BLEUscores while being liked less by human raters.Motivated by this, we model translationeseand original (i.e. natural) text as separatelanguages in a multilingual model, and posethe question: can we perform zero-shottranslation between original source text andoriginal target text? There is no data withoriginal source and original target, so wetrain a sentence-level classiﬁer to distinguishtranslationese from original target text, anduse this classiﬁer to tag the training data foran NMT model. Using this technique we biasthe model to produce more natural outputs attest time, yielding gains in human evaluationscores on both adequacy and ﬂuency. Addi-tionally, we demonstrate that it is possibleto bias the model to produce translationeseand game the BLEU score, increasing itwhile decreasing human-rated quality. Weanalyze these outputs using metrics measuringthe degree of translationese, and present ananalysis of the volatility of heuristic-basedtrain-data tagging. “Translationese” is a term that refers to artifactspresent in text that was translated into a given lan-guage that distinguish it from text originally writtenin that language (Gellerstam, 1986). These artifactsinclude lexical and word order choices that are in-ﬂuenced by the source language (Gellerstam, 1996)as well as the use of more explicit and simplerconstructions (Baker et al., 1993).These differences between translated and origi-nal text mean that the direction in which paralleldata (bitext) was translated is potentially impor-tant for machine translation (MT) systems. Most *Work done while at Google Research.

TranslatedOriginal T r an s l a t ed O r i g i na l Target S ou r c e Src-OrigDataTrg-OrigData MT TrainingBitext

Figure 1: Illustration of MT train+test parallel data, or-ganized into quadrants based on whether the source ortarget is translated or original. parallel data is either source-original (the sourcewas translated into the target) or target-original (thetarget was translated into the source), though some-times neither side is original because both weretranslated from a third language.Figure 1 illustrates the four possible combina-tions of translated and original source and targetdata. Recent work has examined the impact oftranslationese in MT evaluation, using the WMTevaluation campaign as the most prominent exam-ple. From 2014 through 2018, WMT test sets wereconstructed such that 50% of the sentence pairs aresource-original (upper right quadrant of Figure 1)and the rest are target-original (lower left quadrant).Toral et al. (2018), Zhang and Toral (2019), andGraham et al. (2019) have examined the effect ofthis testing setup on MT evaluation, and have allargued that target-original test data should not beincluded in future evaluation campaigns becausethe translationese source is too easy to translate.While target-original test data does have the down-side of a translationese source side, recent work hasalso shown that human raters prefer MT output thatis closer in distribution to original target text than a r X i v : . [ c s . C L ] J u l ranslationese (Freitag et al., 2019). This indicatesthat the target side of test data should also be origi-nal (upper left quadrant of Figure 1); however, it isunclear how to produce high-quality test data (letalone training data) that is simultaneously source-and target-original.Because of this lack of original-to-original sen-tence pairs, we frame this as a zero-shot translationtask, where translationese and original text are dis-tinct languages or domains. We adapt techniquesfrom zero-shot translation with multilingual models(Johnson et al., 2016), where the training pairs aretagged with a reserved token corresponding to thedomain of the target side: translationese or originaltext. Tagging is helpful when the training set mixesdata of different types by allowing the model to 1)see each pair’s type in training to preserve distinctbehaviors and avoid regressing to a mean/dominantprediction across data types, and 2) elicit differentbehavior in inference, i.e. providing a tag at testtime yields predictions resembling a speciﬁc datatype. We then investigate what happens when theinput is an original sentence in the source languageand the model’s output is also biased to be original,a scenario never observed in training.Tagging in this fashion is not trivial, as most MTtraining sets do not annotate which pairs are source-original and which are target-original , so in orderto distinguish them we train binary classiﬁers todistinguish original and translated target text.Finally, we perform several analyses of taggingthese “languages” and demonstrate that taggedback-translation (Caswell et al., 2019) can beframed as a simpliﬁed version of our method, andthereby improved by targeted decoding.Our contributions are as follows:1. We propose two methods to train transla-tionese classiﬁers using only monolingual text,coupled with synthetic text produced by ma-chine translation.2. Using only original → translationese andtranslationese → original training pairs, we ap-ply techniques from zero-shot multilingualMT to enable original → original translation.3. We demonstrate with human evaluations thatthis technique improves translation quality,both in terms of ﬂuency and adequacy. Europarl (Koehn, 2005) is a notable exception, but it issomewhat small and not in the news domain.

4. We show that biasing the model to insteadproduce translationese outputs inﬂates BLEUscores while harming quality as measured byhuman evaluations.

Motivated by prior work detailing the importanceof distinguishing translationese from original text(Kurokawa et al., 2009; Lembersky et al., 2012;Toral et al., 2018; Zhang and Toral, 2019; Gra-ham et al., 2019; Freitag et al., 2019; Edunovet al., 2019) as well as work in zero-shot trans-lation (Johnson et al., 2016), we hypothesize thatperformance on the source-original translation taskcan be improved by distinguishing target-originaland target-translationese examples in the trainingdata and constructing an NMT model to performzero-shot original → original translation.Because most MT training sets do not annotateeach sentence pair’s original language, we train abinary classiﬁer to predict whether the target sideof a pair is original text in that language or trans-lated from the source language. This follows sev-eral prior works attempting to identify translations(Kurokawa et al., 2009; Koppel and Ordan, 2011;Lembersky et al., 2012).To train the classiﬁer, we need target-languagetext annotated by whether it is original or translated.We use News Crawl data from WMT as target-original data. It consists of news articles crawledfrom the internet, so we assume that most of themare not translations. Getting translated data is trick-ier; most human-translated pairs where the originallanguage is annotated are only present in test sets,which are generally small. To sidestep this, wechoose to use machine translation as a proxy forhuman translationese, based on the assumption thatthey are similar. This allows us to create classiﬁertraining data using only unannotated monolingualdata. We propose two ways of doing this: usingforward translation (FT) or round-trip translation(RTT). Both are illustrated in Figure 2.To generate FT data, we take source-languageNews Crawl data and translate it into the target lan-guage using a machine translation model trained onWMT training bitext. We can then train a classiﬁerto distinguish the generated text from monolingualtarget-language text.One potential problem with the FT data set is thatthe original and translated pairs may differ not only igure 2: Illustration of data set creation for the FT andRTT translationese classiﬁers. The Source → Target andTarget → Source nodes represent NMT systems. in the respects we care about (i.e. translationese),but also in content. Taking English → French asan example language pair, one could imagine thatcertain topics are more commonly reported on inoriginal English language news than in French, andvice versa, e.g. news about American or Frenchpolitics, respectively. The words and phrases repre-senting those topics could then act as signals to theclassiﬁer to distinguish the original language.To address this, we also experiment with RTTdata. For this approach we take target-languagemonolingual data and round-trip translate it withtwo machine translation models (target → sourceand then source → target), resulting in anothertarget-language sentence that should contain thesame content as the original sentence, alleviatingthe concern with FT data. Here we hope that thenoise introduced by round-trip translation will besimilar enough to human translationese to be usefulfor our downstream task.In both settings, we use the trained binary classi-ﬁer to detect and tag training bitext pairs where theclassiﬁer predicted that the target side is original. We perform our experiments on WMT18English → German bitext and WMT15English → French bitext. We use WMT NewsCrawl for monolingual data (2007-2017 forGerman and 2007-2014 for French). We ﬁlterout sentences longer than 250 subwords (seeSection 3.2 for the vocabulary used) and removepairs whose length ratio is greater than 2. Thisresults in about 5M pairs for English → German.We do not ﬁlter the English → French bitext,resulting in 41M sentence pairs. For monolingual data, we deduplicate and ﬁltersentences with more than 70 tokens or 500 char-acters. For the experiments described later in Sec-tion 5.3, this monolingual data is back-translatedwith a target-to-source translation model; after do-ing so, we remove any sentence pairs where theback-translated source is longer than 75 tokens or550 characters. This results in 216.5M sentencesfor English → German (of which we only use 24Mat a time) and 39M for English → French. As aﬁnal step, we use an in-house language identiﬁca-tion tool based on the publicly-available CompactLanguage Detector 2 to remove all pairs with theincorrect source or target language. This was mo-tivated by observing that some training pairs hadthe incorrect language on one side, including caseswhere both sides were the same; Khayrallah andKoehn (2018) found that this type of noise is espe-cially harmful to neural models.The classiﬁers were trained on the target lan-guage monolingual data in addition to either anequal amount of source language monolingual datamachine-translated into the target language (for theFT classiﬁers) or the same target sentences round-trip translated through the source language withMT (for the RTT classiﬁers). In both cases, the MTmodels were trained only with WMT bitext.The models used to generate the synthetic datahave BLEU (Papineni et al., 2002) performance asfollows on newstest2014/full: German → English31.8; English → German 28.5; French → English39.2; English → French 40.6. Here and elsewhere,we report BLEU scores with SacreBLEU (Post,2018); see Section 3.3.Both language pairs considered in this work arehigh-resource. While translationese is a potentialconcern for all language pairs, in low-resource set-tings it is overshadowed by general quality con-cerns stemming from the lack of training data. Weleave for future work the application of these tech-niques to low-resource language pairs.

Our NMT models use the transformer-big archi-tecture (Vaswani et al., 2017) implemented in lingvo (Shen et al., 2019b) with a shared source-target byte-pair-encoding (BPE) vocabulary (Sen-nrich et al., 2016b) of 32k types. To stabilize train-ing, we use exponentially weighted moving aver-age (EMA) decay (Buduma and Locascio, 2017). https://github.com/CLD2Owners/cld2 anguage Classiﬁer Bitext BTType % Orig. % Orig.French FT 47% 84%RTT 30% 68%German FT 22%* 82%RTT 29%* 70% Table 1: Percentage of training data where the targetside was classiﬁed as original. English → German pairswith predicted original German (marked with a *) wereupsampled to balance both bitext subsets’ sizes.

Checkpoints were picked by best dev BLEU on aset consisting of a tagged and untagged version ofevery input.For the translationese classiﬁer, we trained athree-layer CNN-based classiﬁer optimized withAdagrad. We picked checkpoints by F1 on thedevelopment set, which was newstest2015 forEnglish → German and a subset of newstest2013containing 500 English-original and 500 French-original sentence pairs for English → French. Wefound that the choice of architecture (RNN/CNN)and hyperparameters did not make a substantialdifference in classiﬁer accuracy.

We report BLEU (Papineni et al., 2002) scores withSacreBLEU (Post, 2018) and include the identiﬁ-cation string to facilitate comparison with futurework. We also run human evaluations for the bestperforming systems (Section 4.3). Before evaluating the usefulness of our transla-tionese classiﬁers for the downstream task of ma-chine translation, we can ﬁrst evaluate how accu-rate they are at distinguishing original text fromhuman translations. We use WMT test sets for thisevaluation, because they consist of source-originaland target-original sentence pairs in equal number.For French, the FT classiﬁer scored . F1 andthe RTT classiﬁer scored . on newstest2014/full.For German, the FT classiﬁer achieved . F1 andthe RTT classiﬁer scored . on newstest2015.We note that while the FT classiﬁers perform rea-sonably well, the RTT classiﬁers are less effec-tive. This result is in line with prior work by BLEU + case.mixed + lang.LANGUAGE PAIR + num-refs.1 + smooth.exp + test.SET + tok.intl + version.1.2.15

Test set → Src-Orig Trg-Orig BothDecode → Nt. Tr. Tr. Nt. MatchMatch? → (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) a. En → Fr: Avg. newstest20 { } Untagged

RTT clf. 38.0 39.4 43.2 44.1 41.8b. En → De: Avg. newstest20 { } Untagged

FT clf. 28.3 36.0 29.4 29.8 33.6RTT clf. 32.3 36.2

Table 2: Average BLEU for models trained on (a)WMT 2014 English → French bitext and (b) WMT 2018English → German bitext, tagged according to targetside classiﬁer predictions. The tag controls the outputdomain: translationese (“Tr”) or original/natural text(“Nt.”). Matching output and test domains (“Match?”row) for both halves (“Both” column) achieves thehighest combined BLEU.

Kurokawa et al. (2009), who trained an SVM clas-siﬁer on French sentences to detect translationsfrom English. They used word n-gram featuresfor their classiﬁer and achieved 0.77 F1, but wereworried about a potential content effect and soalso trained a classiﬁer where nouns and verbswere replaced with corresponding part-of-speech(POS) tags, achieving 0.69 F1. Note that theytested on the Canadian Hansard corpus (contain-ing Canadian parliamentary transcripts in Englishand French) while we tested on WMT test sets, sothe numbers are not directly comparable, but it isinteresting to see the similar trends in comparingcontent-aware and content-unaware versions of thesame method. We also point out that Kurokawaet al. (2009) both trained and tested with human-translated sentences, while we trained our classi-ﬁers with machine-translated sentences while stilltesting on human-translated data.The portion of our data classiﬁed as target-original by each classiﬁer is reported in Table 1.

Table 2a shows the BLEU scores of three modelsall trained on WMT 2014 English → French bitext.They differ in how the data was partitioned: eitherit wasn’t, or tags were applied to those sentencepairs with a target side that a classiﬁer predictedto be original French. We ﬁrst note that the modeltrained on data tagged by the round-trip translationest set → Src-OrigTagging ↓ Decode BLEU % PreferredUntagged -

Test set → Src-OrigTagging ↓ Decode BLEU % PreferredFT clf. Transl.

Table 3: Fluency side-by-side human evaluation for WMT English → French newstest2014/full (Table 2a). We eval-uate only the source-original half of the test set because it corresponds to our goal of original → original translation.Despite a BLEU drop, humans rate the natural decode on average as more ﬂuent than both the bitext model outputand the same model with the translationese decode. (RTT) classiﬁer performs slightly worse than thebaseline. However, the model trained with datatagged by the forward translation (FT) classiﬁeris able to achieve an improvement of 0.5 BLEUon both halves of the test set when biased towardtranslationese on the source-original half and origi-nal text on the target-original half. This, coupledwith the observation that the BLEU score on thesource-original half sharply drops when adding thetag, indicates that the two halves of the test setrepresent quite different tasks, and that the modelhas learned to associate the tag with some aspectsspeciﬁc to generating original text as opposed totranslationese.However, we were not able to replicate this posi-tive result on the English → German language pair(Table 2b). Interestingly, in this scenario the rela-tive ordering of the FT and RTT models is reversed,with the German RTT-trained model outperformingthe FT-trained one. This is also interesting becausethe German FT classiﬁer achieved a higher F1 scorethan the French one, indicating that a classiﬁer’sperformance alone is not a sufﬁcient indicator ofits effect on translation performance. One possi-ble explanation for the negative result is that theEnglish → German bitext only contains 5M pairs,as opposed to the 41M for English → French, sosplitting the data into two portions could make itdifﬁcult to learn both portions’ output distributionsproperly.

In the previous subsection, we saw that BLEU forthe source-original half of the test set went downwhen the model trained with FT classiﬁcations (

FTclf. ) was decoded it as if it were target-original (Ta-ble 2a). Prior work has shown that BLEU has alow correlation with human judgments when thereference contains translationese but the systemoutput is biased toward original/natural text (Fre-itag et al., 2019). This is the very situation we ﬁndourselves in now. Consequently, we run a humanevaluation to see if the output truly is more natu- ral and thereby preferred by human raters, despitethe loss in BLEU. We run both a ﬂuency and anadequacy evaluation for English → French to com-pare the quality of this system when decoding asif source-original vs. target-original. We also com-pare the system with the

Untagged baseline. Allevaluations are conducted with bilingual speakerswhose native language is French, and each is ratedby 3 different raters, with the average taken as theﬁnal score. Our two evaluations are as follows: • Adequacy : Raters were shown only thesource sentence and the model output. Eachoutput was scored on a 6-point scale. • Fluency : Raters saw two target sentences(two models’ outputs) without the source sen-tence, and were asked to select which wasmore ﬂuent, or whether they were equallygood.Fluency human evaluation results are shown inTable 3. We measured inter-rater agreement usingFleiss’ Kappa (Fleiss, 1971), which attains a max-imum value of 1 when raters always agree. Thisvalue was 0.24 for the comparison with the un-tagged baseline, and 0.16 for the comparison withthe translationese decodes. The agreement levelsare fairly low, indicating a large amount of subjec-tivity for this task. However, raters on average stillindicated a preference for the

FT clf. model’s natu-ral decodes. This provides evidence that they aremore ﬂuent than both the translationese decodesfrom the same model and the baseline untaggedmodel, despite the drop in BLEU compared to each.Adequacy human ratings are summarised in Ta-ble 4. Both decodes from the

FT clf. model scoredsigniﬁcantly better than the baseline. This is espe-cially true of the natural decodes, demonstratingthat the model does not suffer a loss in adequacyby generating more ﬂuent output, and actually seesa signiﬁcant gain. We hypothesize that splitting thedata as we did here allowed the model to learn asharper distribution for both portions, thereby in-creasing the quality of both decode types. Someest set → Src-OrigTagging ↓ Decode BLEU AdequacyUntagged - 43.9 4.51FT clf. Transl. ** Table 4: Human evaluation of adequacy for WMTEnglish → French on the source-original half of new-stest2014/full. Humans rated each output separately ona 6-point scale. As with ﬂuency (Table 3), the natu-ral decode scores the best, despite a BLEU loss. Thesingle and double asterisks indicate that the adequacyvalue is signiﬁcantly greater than the ﬁrst row’s value atsigniﬁcance level α = 0 . and α = 0 . , respectively,according to a one-tailed paired t -test. The differencebetween the second and third rows was not signiﬁcantat α = 0 . . additional evidence for this is the fact that the FTclf. model’s training loss was consistently lowerthan that of the baseline.

Translationese tends to be simpler, more standard-ised and more explicit (Baker et al., 1993) com-pared to original text and can retain typical char-acteristics of the source language (Toury, 2012).Toral (2019) proposed metrics attempting to quan-tify the degree of translationese present in a trans-lation. Following their work, we quantify lexicalsimplicity with two metrics: lexical variety andlexical density. We also calculate the length va-riety between the source sentence and the gener-ated translations to measure interference from thesource.

An output is simpler when it uses a lower number ofunique tokens/words. By generating output closerto original target text, our hope is to increase lexicalvariety. Lexical variety is calculated as the type-token ratio (TTR):

T T R = number of typesnumber of tokens (1) Scarpa (2006) found that translationese tends tobe lexically simpler and have a lower percentageof content words (adverbs, adjectives, nouns andverbs) than original written text. Lexical density is calculated as follows: lex density = number of content wordsnumber of total words (2) Both MT and humans tend to avoid restructuringthe source sentence and stick to sentence struc-tures popular in the source language. This resultsin a translation with similar length to that of thesource sentence. By measuring the length variety,we measure interference in the translation becauseits length is guided by the source sentence’s struc-ture. We compute the normalized absolute lengthdifference at the sentence level and average thescores over the test set of source-target pairs ( x, y ) : length variety = || x | − | y ||| x | (3) Results for all three different translationese mea-surements are shown in Table 5.Test set → Src-OrigTagging ↓ Decode Lex. Lex. Len.Var. Density Var.Untagged - 0.258 0.393 0.246FT clf. Transl. 0.255 0.396

FT clf. Natural

Table 5: Measuring the degree of translationesefor WMT English → French newstest2014/full on thesource-original half. Higher lexical variety, lexical den-sity, and length variety indicate less translationese out-put.

Lexical Variety : Using the tag to decode asnatural text (i.e. more like original target text) in-creases lexical variety. This is expected as originalsentences tend to use a larger vocabulary.

Lexical Density : We also increase lexical den-sity when decoding as natural text. In other words,the model has a higher percentage of content wordsin its output, which is an indication that it is morelike original target-language text.

Length Variety : Unlike the previous two met-rics, decoding as natural text does not lead to amore “natural” (i.e. larger) average length variety.One reason may be related to the fact that this isthe only metric that also depends on the sourcesentence: since all of our training pairs featuretranslationese on either the source or target side,both the tagged and untagged training pairs willeature similar sentence structures, so the modelnever fully learns to produce different structures.This further illustrates the problem of the lack oforiginal → original training data noted in the intro-duction. Rather than tagging training data with a trainedclassiﬁer, as explored in the previous sections, itmight be possible to tag using much simpler heuris-tics, and achieve a similar effect. We explore twooptions here.

Here, we partition the training pairs ( x, y ) accord-ing to a simple length ratio | x || y | . We use a thresh-old ˆ ρ length empirically calculated from two largemonolingual corpora, M x and M y : ˆ ρ length = | M x | (cid:80) x i ∈ M x | x i | | M y | (cid:80) y i ∈ M y | y i | (4)For English → French, we found ˆ ρ length = 0 . ,meaning that original French sentences tend tohave more tokens than English. We tag all pairswith length ratio greater than ˆ ρ length (49.8% ofthe training bitext). Based on the discussion inSection 5.1.3, we expect that | x || y | ≈ . indicatestranslationese, so in this case the tag should mean“produce translationese” instead of “produce origi-nal text.” We tag examples with a target-side lexical densityof greater than 0.5, which means that the targetis more likely to be original than translationese.Please refer to Section 5.1.2 for an explanation ofthis metric.

Table 6 shows the results for this experiment, com-pared to the untagged baseline and the classiﬁer-tagged model from Table 2a. This table speciﬁcallylooks at the effect of controlling whether the out-put should feature more or less translationese oneach subset of the test set. We see that the lexicaldensity tagging approach yields expected results,in that the tag can be used to effectively increaseBLEU on the target-original portion of the test set.The length-ratio tagging, however, has the oppo-site effect: producing shorter outputs (“decode asif translationese”) produces higher target-original BLEU and lower source-original BLEU. We specu-late that this data partition has accidentally pickedup on some artifact of the data.Two interesting observations from Table 6 arethat 1) both heuristic tagging methods performmuch more poorly than the classiﬁer taggingmethod on both test set halves, and 2) all varietiesof tagging produce large performance changes (upto -7.2 BLEU). This second observation highlightsthat tagging can be powerful – and dangerous whenit does not correspond well with the desired feature.

We also investigated whether using a classiﬁer totag training data improved model performance inthe presence of back-translated (BT) data. Caswellet al. (2019) introduced tagged back-translation(TBT), where all back-translated pairs are taggedand no bitext pairs are. They experimentedwith decoding the model with a tag (“as-if-back-translated”) but found it harmed BLEU score. How-ever, in our early experiments we discovered thatdoing this actually improved the model’s perfor-mance on the target-original portion of the test set,while harming it on the source-original half. Thus,we frame TBT as a heuristic method for identify-ing target-original pairs: the monolingual data usedfor the back-translations is assumed to be original,and the target side of the bitext is assumed to betranslated. We wish to know whether we can ﬁnd abetter tagging scheme for the combined BT+bitextdata, based on a classiﬁer or some other heuristic.Results for English → French models trained withBT data are presented in Table 7a. While combin-ing the bitext classiﬁed by the FT classiﬁer withall-tagged BT data yields a minor gain of 0.2 BLEUover the TBT baseline of Caswell et al. (2019), theother methods do not beat the baseline. This indi-cates that assuming all of the target monolingualdata to be original is not as harmful as the errorintroduced by the classiﬁers.English → German results are presented in Ta-ble 7b. Combining the bitext classiﬁed by the RTTclassiﬁer with all-tagged BT data matched the per-formance of the TBT baseline, but none of themodels outperformed it. This is expected, giventhe poor performance of the bitext-only models forthis language pair. est set → Src-Orig Src-Orig Trg-Orig Trg-OrigDecode as if → Natural Transl. Transl. Natural ∴ Domain match? → (cid:55) (cid:51) (cid:55) (cid:51) Train data tagging ↓ Untagged

Length Variety 38.2 36.1 43.6 36.2Lex. Density 36.9 36.7 41.2 43.4

Table 6: Comparing heuristic- and classiﬁer-based tagging. BLEU scores are averaged for newstest2014/full andnewstest2015 English → French. The trained classiﬁer outperforms both heuristics, and length-ratio tagging has thereverse effect from what we expect.

Test set → Src-Orig Trg-Orig CombinedDecode as if → Natural Transl. Transl. Natural Both ∴ Domain match? → (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) Bitext tagging ↓ BT tagging ↓ a. English → French: Avg. newstest20 { } Untagged All Tagged 38.4 40.8 47.5 49.8 45.5FT clf. All Tagged

FT clf. FT clf. 38.2 → German: Avg. newstest20 { } Untagged All Tagged 33.5 37.3 36.7 37.1

FT clf. All Tagged 33.4 37.2 36.2

RTT clf. RTT clf. 31.6 35.7

Table 7: Average BLEU scores for models trained on (a) WMT 2018 English → French bitext plus 39M back-translated monolingual sentences, and (b) WMT 2018 English → German bitext plus 24M back-translated monolin-gual sentences. As before, we tag by heuristics and/or classiﬁer predictions on the target (German) side.

In Table 8, we show example outputs for WMTEnglish → French comparing the

Untagged base-line with the

FT clf. natural decodes. In the ﬁrstexample, avec sufﬁsamment d’art is an incorrectword-for-word translation, as the French word art cannot be used in that context. Here the word ha-bilement , which is close to “skilfully” in English,sounds more natural. In the second example, libred’impˆot is the literal translation of “tax-free”, butFrench documents rarely use it, they prefer pasimposable , meaning “not taxable”.

The effects of translationese on MT training andevaluation have been investigated by many priorauthors (Kurokawa et al., 2009; Lembersky et al.,2012; Toral et al., 2018; Zhang and Toral, 2019;Graham et al., 2019; Freitag et al., 2019; Edunovet al., 2019; Freitag et al., 2020). Training clas-siﬁers to detect translationese has also been done(Kurokawa et al., 2009; Koppel and Ordan, 2011; Shen et al., 2019a). Similarly to this work,Kurokawa et al. (2009) used their classiﬁer topreprocess MT training data; however, they com-pletely removed target-original pairs. In contrast,Lembersky et al. (2012) used both types of data(without explicitly distinguishing them with a clas-siﬁer), and used entropy-based measures to causetheir phrase-based system to favor phrase table en-tries with target phrases that are more similar toa corpus of translationese than original text. Inthis work, we combine aspects from each of these:we train a classiﬁer to partition the training data,and use both subsets to train a single model witha mechanism allowing control over the degree oftranslationese to produce in the output. We alsoshow with human evaluations that source-originaltest sentence pairs result in BLEU scores that donot correlate well with translation quality whenevaluating models trained to produce more originaloutput.

In addition to the methods in Caswell et al. (2019),tagging training data and using the tags to con-trol output is a technique that has been growingource Sorry she didn’t phrase it artfully enough for you.Untagged D´esol´ee, elle ne l’a pas formul´e avec sufﬁsamment d’art pour vous.FT clf. D´esol´e elle ne l’a pas formul´e assez habilement pour vous.Source Your ﬁrst 10,000 is tax free .Untagged Votre premi`ere tranche de 10 000 est libre d’impˆot .FT clf. La premi`ere tranche de 10 000 n’est pas imposable . Table 8: Example English → French output comparing the untagged baseline with the

FT clf. natural decode. in popularity. Tags on the source sentence havebeen used to indicate target language in multilin-gual models (Johnson et al., 2016), formality levelin English → Japanese (Yamagishi et al., 2016),politeness in English → German (Sennrich et al.,2016a), gender from a gender-neutral language(Kuczmarski and Johnson, 2018), as well as toproduce domain-targeted translation (Kobus et al.,2016). Shu et al. (2019) use tags at training andinference time to increase the syntactic diversity oftheir output while maintaining translation quality;similarly, Agarwal and Carpuat (2019) and Marchi-sio et al. (2019) use tags to control the reading level(e.g. simplicity/complexity) of the output. Overall,tagging can be seen as domain adaptation (Freitagand Al-Onaizan, 2016; Luong and Manning, 2015).

We have demonstrated that translationese and orig-inal text can be treated as separate target languagesin a “multilingual” model, distinguished by a clas-siﬁer trained using only monolingual and syn-thetic data. The resulting model has improvedperformance in the ideal, zero-shot scenario oforiginal → original translation, as measured by hu-man evaluation of adequacy and ﬂuency. However,this is associated with a drop in BLEU score, indi-cating that better automatic evaluation is needed. Acknowledgments

We are grateful to the anony-mous reviewers for suggesting useful additions.

References

Swetha Agarwal and Marine Carpuat. 2019. Control-ling Text Complexity in Neural Machine Translation.In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing .Mona Baker, Gill Francis, and Elena Tognini-Bonelli.1993.

Corpus Linguistics and Translation Studies:Implications and Applications , chapter 2. John Ben-jamins Publishing Company, Netherlands.Nikhil Buduma and Nicholas Locascio. 2017.

Fun-damentals of deep learning: Designing next- generation machine intelligence algorithms .O’Reilly Media, Inc.Isaac Caswell, Ciprian Chelba, and David Grangier.2019. Tagged back-translation. In

Proceedings ofthe Fourth Conference on Machine Translation (Vol-ume 1: Research Papers) , pages 53–63, Florence,Italy. Association for Computational Linguistics.Sergey Edunov, Myle Ott, Marc’Aurelio Ranzato, andMichael Auli. 2019. On the evaluation of machinetranslation systems trained with back-translation. arXiv preprint arXiv:1908.05204 .J.L. Fleiss. 1971. Measuring nominal scale agree-ment among many raters.

Psychological Bulletin ,76(5):378–382.Markus Freitag and Yaser Al-Onaizan. 2016. Fast Do-main Adaptation for Neural Machine Translation.

CoRR , abs/1612.06897.Markus Freitag, Isaac Caswell, and Scott Roy. 2019.APE at Scale and Its Implications on MT EvaluationBiases. In

Proceedings of the Fourth Conference onMachine Translation , pages 34–44, Florence, Italy.Association for Computational Linguistics.Markus Freitag, David Grangier, and Isaac Caswell.2020. BLEU might be Guilty but References arenot Innocent.Martin Gellerstam. 1986. Translationese in swedishnovels translated from english. In Lars Wollin andHans Lindquist, editors,

Translation Studies in Scan-dinavia , page 8895. CWK Gleerup.Martin Gellerstam. 1996. Translations as a source forcross-linguistic studies.

Lund Studies in English ,88:53–62.Yvette Graham, Barry Haddow, and Philipp Koehn.2019. Translationese in machine translation evalu-ation.

CoRR , abs/1906.09833.Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-rat, Fernanda B. Vi’egas, Martin Wattenberg, GregCorrado, Macduff Hughes, and Jeffrey Dean. 2016.Google’s Multilingual Neural Machine TranslationSystem: Enabling Zero-Shot Translation.

CoRR ,abs/1611.04558.Huda Khayrallah and Philipp Koehn. 2018. On theimpact of various types of noise on neural machineranslation. In

Proceedings of the 2nd Workshop onNeural Machine Translation and Generation , pages74–83, Melbourne, Australia. Association for Com-putational Linguistics.Catherine Kobus, Josep Maria Crego, and Jean Senel-lart. 2016. Domain Control for Neural MachineTranslation.

CoRR , abs/1612.06140.Philipp Koehn. 2005. Europarl: A Parallel Corpusfor Statistical Machine Translation. In

ConferenceProceedings: the tenth Machine Translation Summit ,pages 79–86, Phuket, Thailand. AAMT, AAMT.Moshe Koppel and Noam Ordan. 2011. Translationeseand its dialects. In

Proceedings of the 49th AnnualMeeting of the Association for Computational Lin-guistics: Human Language Technologies - Volume 1 ,HLT ’11, pages 1318–1326, Stroudsburg, PA, USA.Association for Computational Linguistics.James Kuczmarski and Melvin Johnson. 2018. Gender-aware natural language translation.

Technical Dis-closure Commons .David Kurokawa, Cyril Goutte, and Pierre Isabelle.2009. Automatic detection of translated text and itsimpact on machine translation. In

Proceedings ofMT-Summit XII , pages 81–88.Gennadi Lembersky, Noam Ordan, and Shuly Wintner.2012. Adapting translation models to translationeseimproves SMT. In

Proceedings of the 13th Confer-ence of the European Chapter of the Association forComputational Linguistics , EACL ’12, pages 255–265, Stroudsburg, PA, USA. Association for Com-putational Linguistics.Minh-Thang Luong and Christopher D Manning. 2015.Stanford neural machine translation systems for spo-ken language domains. In

Proceedings of the In-ternational Workshop on Spoken Language Transla-tion , pages 76–79.Kelly Marchisio, Jialiang Guo, Cheng-I Lai, andPhilipp Koehn. 2019. Controlling the reading levelof machine translation output. In

Proceedings ofMachine Translation Summit XVII Volume 1: Re-search Track , pages 193–203, Dublin, Ireland. Eu-ropean Association for Machine Translation.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for AutomaticEvaluation of Machine Translation. In

Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics , pages 311–318. Associationfor Computational Linguistics.Matt Post. 2018. A Call for Clarity in Reporting BleuScores. arXiv preprint arXiv:1804.08771 .Federica Scarpa. 2006. Corpus-based quality-assessment of specialist translation: A study usingparallel and comparable corpora in English and Ital-ian.

Insights into specialized translation–linguisticsinsights. Bern: Peter Lang , pages 155–172. Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Controlling politeness in neural machinetranslation via side constraints. In

Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies , pages 35–40, SanDiego, California. Association for ComputationalLinguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Jiajun Shen, Peng-Jen Chen, Matt Le, Junxian He, Ji-atao Gu, Myle Ott, Michael Auli, and Marc’AurelioRanzato. 2019a. The source-target domain mis-match problem in machine translation. arXivpreprint arXiv:1909.13151 .Jonathan Shen, Patrick Nguyen, Yonghui Wu, ZhifengChen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara N.Sainath, and Yuan Cao et al. 2019b. Lingvo: aModular and Scalable Framework for Sequence-to-Sequence Modeling.

CoRR , abs/1902.08295.Raphael Shu, Hideki Nakayama, and Kyunghyun Cho.2019. Generating Diverse Translations with Sen-tence Codes. In

Proceedings of the 40th AnnualMeeting on Association for Computational Linguis-tics . Association for Computational Linguistics.Antonio Toral. 2019. Post-editese: an exacerbatedtranslationese.

CoRR , abs/1907.00900.Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way.2018. Attaining the unattainable? Reassessingclaims of human parity in neural machine translation.In

Proceedings of the Third Conference on MachineTranslation: Research Papers , pages 113–123, Bel-gium, Brussels. Association for Computational Lin-guistics.Gideon Toury. 2012.

Descriptive translation studiesand beyond: Revised edition , volume 100. JohnBenjamins Publishing.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In

Advances in Neural Information Pro-cessing Systems , pages 5998–6008.Hayahide Yamagishi, Shin Kanouchi, Takayuki Sato,and Mamoru Komachi. 2016. Controlling the voiceof a sentence in Japanese-to-English neural machinetranslation. In

Proceedings of the 3rd Workshop onAsian Translation (WAT2016) , pages 203–210.Mike Zhang and Antonio Toral. 2019. The effectof translationese in machine translation test sets.