Automatically Extracting Challenge Sets for Non local Phenomena in Neural Machine Translation
AAutomatically Extracting Challenge Sets for Non-local Phenomenain Neural Machine Translation
Leshem Choshen and Omri Abend School of Computer Science and Engineering, Department of Cognitive SciencesThe Hebrew University of Jerusalem [email protected], [email protected]
Abstract
We show that the state-of-the-art TransformerMT model is not biased towards monotonicreordering (unlike previous recurrent neuralnetwork models), but that nevertheless, long-distance dependencies remain a challenge forthe model. Since most dependencies are short-distance, common evaluation metrics will belittle influenced by how well systems performon them. We therefore propose an automaticapproach for extracting challenge sets repletewith long-distance dependencies, and arguethat evaluation using this methodology pro-vides a complementary perspective on systemperformance. To support our claim, we com-pile challenge sets for English-German andGerman-English, which are much larger thanany previously released challenge set for MT.The extracted sets are large enough to allow re-liable automatic evaluation, which makes theproposed approach a scalable and practical so-lution for evaluating MT performance on thelong-tail of syntactic phenomena. The assumption that proximate source words aremore likely to correspond to proximate targetwords has often been introduced as a bias (hence-forth, locality bias ) into statistical MT systems(Brown et al., 1993; Koehn et al., 2003; Chiang,2005). While reordering phenomena, abundant forsome language pairs, violate this simplifying as-sumption, it has often proved to be a useful in-ductive bias in practice, especially when comple-mented with targeted techniques for addressingnon-monotonic translation (e.g., Och, 2002; Chi-ang, 2005). For example, if an adjective precedesa noun in one language and modifies it syntacti-cally, it is likely that their corresponding words Our extracted challenge sets and codebase are found in https://github.com/borgr/auto_challenge_sets . will appear close to each other in the translation— i.e., they may not be immediately adjacent oreven in the same order in the translation, but it isunlikely that they will be arbitrarily distant fromone another.In the era of Neural Machine Translation(NMT), such biases are implicitly introduced bythe sequential nature of the LSTM architecture(Bahdanau et al., 2015, see §2). The influentialTransformer model (Vaswani et al., 2017) replacesthe sequential LSTMs with self-attention, whichdoes not seem to possess this bias. We showthat the default implementation of the Transformerdoes retain some bias, but that it can be relieved byusing learned positional embeddings (§3).Long-distance dependencies (LDD) betweenwords and phrases present a long-standing prob-lem for MT (Sennrich, 2016), as they are gener-ally more difficult to detect (indeed, they pose anongoing challenge for parsing as well (Xu et al.,2009)), and often result in non-monotonic transla-tion if the target differs from the source in termsof its word order and lexicalization patterns. TheTransformer’s indifference to the absolute positionof the tokens raises the question of whether long-distance dependencies are still an open problem.We address this question by proposing an auto-matic method to compile challenge sets for evalu-ating system performance on LDD (§4). We dis-tinguish between two main LDD types: (1) re-ordering LDD, namely cases where source and tar-get words largely correspond to one another butare ordered differently; (2) lexical LDD, where theway a word or a contiguous expression on the tar-get side is translated is dependent on non-adjacentwords on the source side.We define a methodology for extracting bothLDD types. For reordering LDD, we build onBirch (2011), whereas for lexical LDD we com-pile a list of linguistic phenomena that yield LDD, a r X i v : . [ c s . C L ] S e p nd use a dependency parser to find instances ofthese phenomena in the source side of a paral-lel corpus. As a test case, we apply this methodto construct challenge sets (§4.2) for German-English and English-German. The approach canbe easily scaled to other languages for which agood enough parser exists.Experimenting both with RNN and self-attention NMT architectures, we find that althoughthe latter presents no locality bias, LDD remainchallenging. Moreover, lexical LDD become in-creasingly challenging with their distance, sug-gesting that syntactic distance remains an impor-tant determinant of performance in state-of-the-art(SoTA) NMT.We conclude that evaluating LDD using tar-geted challenge sets gives a detailed picture of MTperformance, and underscores challenges the fieldhas yet to fully address. As particular types ofLDD are not frequent enough to significantly af-fect coarse-grained measures, such as BLEU (Pa-pineni et al., 2002) or TER (Snover et al., 2006),our evaluation approach provides a complemen-tary perspective on system performance. A common architecture for text-to-text genera-tion tasks is the (Bi)LSTM encoder-decoder (Bah-danau et al., 2015). This architecture consists ofseveral LSTM layers for the encoder and the de-coder and a thin attention layer connecting them.LSTM is a recurrent network with a state vec-tor it updates. At every step, it discards some ofthe current and past information and aggregatesthe rest into the state. Any information aboutthe past comes from this state, which is a learned“summary” of the previous states (cf. Greff et al.,2017). Hence, for information to reach a cer-tain prediction step, it should be stored and thenkept throughout the intermediate steps (tokens).While theoretically information could be kept in-definitely (Hochreiter and Schmidhuber, 1997),practical evidence shows that LSTMs performancedecreases with the distance between the triggerand the prediction (Linzen et al., 2016; Liu et al.,2018), and that they have difficulties generalizingover sequence lengths (Suzgun et al., 2018).Despite being affected by absolute distancesbetween syntactically dependent tokens (Linzenet al., 2016), LSTMs tend to learn to a certain extent structural information even without beinginstructed to do so explicitly (Gulordava et al.,2018). Futrell and Levy (2018) discuss similar lin-guistic phenomena to what we discuss in §4.2, andshow that LSTM encoder-decoder systems handlethem better than previous N-gram based systems,despite being profoundly affected by distance.Transformer (Vaswani et al., 2017) models arealso encoder-decoder, but instead of LSTMs, theyuse self-attention. Self-attention is based on gat-ing all outputs of the previous layer as inputs forthe current one; put differently, it aggregates allthe input in one step. This approach makes infor-mation from all parts of the input sequence equallyreachable. While this is not the only architecturewith such attributes (van den Oord et al., 2016),we focus on it due to its SoTA results for MT(Lakew et al., 2018). The Transformer’s use ofself-attention inspired other works in related fields(Devlin et al., 2018), some of which attributedtheir performance gains to the model’s ability tocapture long-range context (Müller et al., 2018).As the Transformer does not aggregate inputsequentially, token positions must be representedthrough other means. For that purpose, the embed-ding of each input token W is concatenated withan embedding of its position in the source sentence P . While positional embeddings can generally beany vectors, two implementations are commonlyused (Tebbifakhr et al., 2018; Guo et al., 2018):learned positional embeddings (learnedPEs; P israndomly initialized), and sine positional embed-dings (SinePEs) defined as: P ( pos, i ) = sin ( pos/ , i/dim ) P ( pos, i +1) = cos ( pos/ , i/dim ) where dim is the dimension of the embedding.Vaswani et al. (2017) report that they see no ben-efit in learnedPEs, and hence use SinePEs, whichhave much fewer parameters.Most of the dependencies between words areshort. Short-distance linguistic dependencies in-clude some of the most common phenomena inlanguage, such as determination, modification byan adjective and compounding. For example, 62%of the dependencies in the standard UD EWTtraining set (Silveira et al., 2014) are between to-kens that are up to one word apart. It stands to rea-son that the locality bias is useful in these cases.Nevertheless, as system quality improves, rarer,more challenging dependencies become a priority,nd languages present a countless number of long-distance reordering phenomena (Deng and Xue,2017). One example is subject-verb agreement,where a correct translation requires that the verbis inflected according to the headword of the sub-ject (e.g., in English “dogs that ..., bark”, while “adog that ..., barks”). When translating such cases,a locality bias may impede performance, by bias-ing the model not to attend to both the subject’shead and the main verb (which may be arbitrarilydistant), thereby disallowing it to correctly inflectthe main verb.Due to the benefits of the locality bias, itfeatured prominently in statistical MT, includingin the IBM models, where alignments are con-strained not to cross too much (Brown et al.,1993), and in predicting probabilities of reorder-ings (Koehn et al., 2003; Chiang, 2005). Diffi-culties in handling LDD have motivated the devel-opment of syntax-based MT (Yamada and Knight,2001), that can effectively represent reordering atthe phrase level, such as when translating betweenVSO and SOV languages. However, syntax-based MT models remain limited in their abilityto map between arbitrarily different word orders(Sun et al., 2009; Xiong et al., 2012). For exam-ple, reorderings that violate the assumption thatthe trees form contiguous phrases would be dif-ficult for most such models to capture. In the nextsection (§3) we show that the Transformer, whenimplemented with learnedPEs, presents no localitybias, and hence can, in principle, learn dependen-cies between any two positions of the source, anduse them at any step during decoding. With major improvements in system performance,crude assessments of performance are becomingless satisfying, i.e., evaluation metrics do not givean indication on the performance of MT systemson important challenges for the field (Isabelle andKuhn, 2018). String-similarity metrics againsta reference are known to be partial and coarse-grained aspects of the task (Callison-Burch et al.,2006), but are still the common practice in vari-ous text generation tasks. However, their opaque-ness and difficulty to interpret have led to effortsto improve evaluation measures so that they willbetter reflect the requirements of the task (Ander-son et al., 2016; Sulem et al., 2018; Choshen andAbend, 2018b), and to increased interest in defin- ing more interpretable and telling measures (Loand Wu, 2011; Hodosh et al., 2013; Birch et al.,2016; Choshen and Abend, 2018a).A promising path forward is complement-ing string-similarity evaluation with linguisticallymeaningful challenge sets. Such sets have the ad-vantage of being interpretable: they test for spe-cific phenomena that are important for humansand are crucial for language understanding. In-terpretability also means that evaluation artefactsare more likely to be detected earlier. So far,such challenge sets were constructed for French-English (Isabelle et al., 2017; Isabelle and Kuhn,2018) and English-Swedish (Ahrenberg, 2018) .Previous challenge sets were compiled by manu-ally searching corpora for specific phenomena ofinterest (e.g., yes-no questions which are formu-lated differently in English and French). Thesecorpora are carefully made but are small in size(ten examples per phenomenon), which means thatevaluation must be done manually as well.As our methodology extracts sentences auto-matically based on parser output, we are able tocompile much larger challenge sets, which allowsus to apply standard MT measures to each sub-corpus corresponding to a specific phenomenon.The methodology is, therefore, more flexible, andcan be straightforwardly adapted to accommodatefuture advances in MT evaluation. In this section we show that encoder-decoder mod-els based on BiLSTM with attention (see §2), doexhibit a locality bias, but that the Transformer,whose encoder is based on self-attention, and inwhich token position is encoded only throughlearnedPEs, does not present any such bias.
In order to test whether an NMT system presentsa locality bias in a controlled environment, weexamine a setting of arbitrary absolute order ofthe source-side tokens. In this case, systems thatare predisposed towards monotonic decoding arelikely to present lower performance, while sys-tems that have no predisposition as to the orderof the target side tokens relative to the source-sidetokens are not expected to show any change in per- In WMT 2019 English-German phenomena were testedwith a new corpus, using both human and automatic evalua-tion. It is not possible, however, to use this evaluation outsidethe competition (Avramidis et al., 2019). ormance. In order to create a controlled setting,where source-side token order is arbitrary, we ex-tract fixed length sentences, and apply the samepermutation to all of them. We then train systemswith the permuted source-side data (and the sametarget-side data), and compare results to a controlcondition where no permutation is applied.Concretely, we experiment on a German-English setting, extracting all sentences of themost common length (18) from the WMT2015(Bojar et al., 2015) training data. This results in130,983 sentences, of which we hold out 1,000sentences for testing. It is comparable in trainingset size to a low-resource language setting.We set a fixed permutation σ : [18] → [18] andtrain systems on three versions of the training data(settings): (1) R EGULAR , to be used for control;(2) P
ERMUTED source-side, in which we apply σ over all source-side tokens; (3) P ER P OS E MB where the positional embeddings of the source-side tokens are permuted; and (4) R EVERSED ,where tokens are input in a reverse order.We apply the following permutation, σ , to thesource-side tokens: (cid:16)
110 51 92 153 84 145 106 17 38 169 1210 211 012 613 1714 415 1316 717 (cid:17)
We did not find any property that would deem thispermutation special (examining, e.g., its decom-position into cycles). We therefore assume thatsimilar results will hold for other σ s as well.We train a Transformer model, optimizing us-ing Adam (Kingma and Ba, 2015). We set the em-bedding size to 512, dropout rate of 0.1, 6 stacklayers in both the encoder and the decoder and 8attention heads. We use tokenization, truecasingand BPE (Sennrich et al., 2016) as preprocessing,following the same protocol as (Yang et al., 2018).We experiment both with learnedPEs, and withSinePEs. We train the BiLSTM model using theNematus implementation (Sennrich et al., 2017b),and use their supplied scripts for preprocessing,training and testing, changing only the datasetsused. For all models, we report the highest BLEUscore on the test data for any epoch during train-ing, and perform early stopping after 10 consecu-tive epochs without improvement.In the Transformer with learnedPEs, 5 repeti-tions were done in the R EGULAR setting, and 5 for Formally, if the source sentence is ( t , ..., t ) , then the input to the Transformer is (cid:0)(cid:2) W ( t ); P ( t σ (1) ) (cid:3) , ..., (cid:2) W ( t ); P ( t σ (18) ) (cid:3)(cid:1) . Model Positional Setting BLEUTransformer LearntPE R
EGULAR
ERMUTED
EVERSE ER P OS E MB EGULAR
ERMUTED
EGULAR
ERMUTED
Table 1: BLEU score for various Transformer settingson regular and permuted data. In brackets are the dif-ferences from R
EGULAR . Nematus and Transformerusing SinePEs show decreased performance when per-muting the input. Transformer with learnedPEs doesnot. Rows correspond to the different models used(Model), which positional embeddings are fed to theTransformer (Positional), and the order of the input to-kens (Setting; see text). the other settings: 5 repetitions for P
ERMUTED , 1for P ER P OS E MB and 1 for R EVERSED . In addi-tion, we trained the BiLSTM model and the Trans-former with SinePEs both in the R
EGULAR condi-tion and in P
ERMUTED , each was trained once.
Table 1 presents our results. We find that NematusBiLSTM suffers substantially from permuting thesource-side tokens, but that the Transformer doesnot exhibit a locality bias. Indeed, for learned-PEs in all settings (R
EGULAR , P
ERMUTED , R E - VERSED and P ER P OS E MB ), BLEU scores are es-sentially the same. We also find that the com-mon practice of using fixed SinePEs does intro-duce some bias, as attested by the small perfor-mance drop between R EGULAR and P
ERMUTED .Like Vaswani et al. (2017), we find that in theR
EGULAR settings, learnedPEs are not superior inperformance to SinePEs, despite having more ex-pressive power. However, our results suggest thatthe decision between learnedPEs and SinePEs isnot without consequences: learnedPEs are prefer-able if a locality bias is undesired (this is poten-tially the case for highly divergent language pairs).
Finding that Transformers do not present a local-ity bias has implications on how to construct theirinput in MT settings, as well as in other tasks thatuse self-attention encoders, such as image caption-ing (You et al., 2016). It is common practice toaugment the source-side with globally-applicablenformation, e.g., the target language in multi-lingual MT (Johnson et al., 2017). Having no lo-cality bias implies this additional information canbe added at any fixed point in the sequence fedto a Transformer, provided that the positional em-beddings do not themselves introduce such a bias.This is not the case with BiLSTMs, which oftenrequire introducing the same information at eachinput token to allow them to be effectively used bythe system (Yao et al., 2017; Rennie et al., 2017).
One of the stated motivations of the Transformermodel is to effectively tackle long-distance depen-dencies, which are “a key challenge in many se-quence transduction tasks” (Vaswani et al., 2017).Our results from the previous section show thatindeed fixed reordering patterns are completelytransparent for Transformers. This, however, stillleaves the question of how Transformers handlelinguistic reordering patterns, which may involvevarying distances between dependent tokens.
We propose a method for scalably compiling chal-lenge sets to support fine-grained MT evaluationfor different types of LDD. We address two maintypes:
Reordering LDD are cases where the words onthe two sides of the parallel corpus largely corre-spond to one another, but are ordered differently.These cases may require attending to source wordsin a highly non-monotonic order, but the gener-ation of each target word is localized to a spe-cific region in the source sentence. For exam-ple, in English-German, the verb in a subordinatedclause appears in a final position, while the verb inthe English source appears right after the subject.Consider “The man that is sitting on the chair”,and the corresponding German “Der Mann, derauf dem Stuhl sitzt” (lit. the man, that on thechair sits ) — while the verb is placed at differentclause positions in the two cases, the words mostlyhave direct correspondents. Our methodology fol-lows Birch (2011) in detecting such phenomenabased on alignment. Concretely, we extract a wordalignment between corresponding sentences, andcollect all sentences that include a pair of alignedwords in the source and target sides, whose indiceshave a difference of at least d ∈ N . Lexical LDD are cases where the translation ofa single word or phrase is determined by non-adjacent words on the source side. This requiresattending to two or more regions that can be arbi-trarily distant from one another. Several phenom-ena, such as light verbs (Isabelle and Kuhn, 2018),are known from the linguistic and MT literature toyield lexical LDD. Our methodology takes a pre-defined set of such phenomena, and defines rulesfor detecting each of them over dependency parsesof the source-side. See §4.2 for the list of phenom-ena we experiment on in this paper.Focusing on LDD, we restrict ourselves to in-stances where the absolute distance between theword and the dependent is at least d ∈ N . Select-ing large enough d entails that the extracted phe-nomena are unlikely to be memorized as a phrasewith a specific meaning (e.g., encode “ make thewhole thing up ” [ d = 3 ] as a phrase, rather thanas a discontiguous phrase “make ... up” with anargument “the whole thing”). This increases theprobability that such cases, if translated correctly,reflect the MT systems’ ability to recognize thatsuch discontiguous units are likely to be translatedas a single piece.We note, that by extracting the challenge setbased on syntactic parses, we by no means assumethese representations are internally represented bythe MT systems in any way, or assume such arepresentation is required for succeeding in cor-rectly translating such constructions. The extrac-tion method is merely a way of finding phenomenawe have a reason to believe are difficult to trans-late, and meaningful for language understanding.We use Universal Dependencies (UD; Nivre et al.,2016) as a syntactic representation, due to itscross-lingual consistency (about 90 languages aresupported so far), which allows research on diffi-cult LDD phenomena that recur across languages.Our extraction methods resemble previous chal-lenge set approaches (Isabelle et al., 2017; Isabelleand Kuhn, 2018; Ahrenberg, 2018), in using lin-guistically motivated sets of sentence pairs to as-sess translation quality. However, as our extractionmethod is fully automatic, it allows for the compi-lation of much larger challenge sets over many lan-guage pairs. The challenge sets we extract containhundreds or thousands of pairs (§4.2). The sizeof the sets allows using any MT evaluation mea-sures to measure performance, and is thus a muchmore scalable solution than manual inspection, ass commonly done in challenge set approaches.On the other hand, an automatic methodologyhas the side-effect of being noisier, and not neces-sarily selecting the most representative sentencesfor each phenomenon. For instance befinden sich (lit. to determine ) includes a verb and a reflexivepronoun, which do not necessarily appear contigu-ously in German. However, as befinden always ap-pears with the reflexive sich , it might not pose achallenge to NMT systems, which can essentiallyignore the reflexive pronoun upon translation. Next, we discuss the compilation of German-English and English-German corpora. We selectthese pairs, as they are among the most studied inMT, and comparatively high results are obtainedfor them (Bojar et al., 2017). Hence, they are morelikely to benefit from a fine-grained analysis.For the reordering LDD corpus, we align eachsource and target sentences using FastAlign (Dyeret al., 2013) and collect all sentences with at leastone pair of source-side and target-side tokens,whose indices have a difference of at least d = 5 .For example: Source:
Wäre es ein großer Misserfolg, nichtden Titel in der Ligue 1 zu gewinnen, wie diesin der letzten Saison der Fall war?
Gloss:
Would-be it a big failure, not the titlein the Ligue 1 to win, as this in the last seasonthe case was?
Target:
In Ligue 1, would not winning thetitle, like last season, be a big failure?
We extract lexical LDD using simple rulesover source-side parse trees, parsed with UDPipe(Straka and Straková, 2017). For a sentence tobe selected, at least one word should separate thedetected pair of words. We picked several well-known challenging constructions for translationthat involve discontiguous phrases: reflexive-verb,verb-particle constructions and preposition strand-ing. We note that while these constructions oftenyield lexical LDDs, and are thus expected to bechallenging on average, some of their instancescan be translated literally (e.g., amuse oneself istranslated to amüsieren sich ). Reflexive Verbs.
Prototypically, reflexivity isthe case where the subject and object corefer. Re-flexive pronouns in English end with self or selves (e.g., yourselves) and in German include sich , dich , mich and uns among others. However, reflex-ive pronouns can often change the meaning of averb unpredictably, and may thus lead to differenttranslations for non-reflexive instances of a verb,compared to reflexive ones. For example, abheben in German means taking off (as of a plane), but sich abheben means standing out. Similarly, in theexample below, drängte sich translates to intrude ,while drängte normally translates to pushed .A source sentence is said to include a reflexiveverb if one of its tokens is parsed with a reflexivemorphological feature ( refl=yes ).For example: Source: [...] es ertragen zu müssen, daß eineunsympathische Fremde sich unaufhörlich inihren Familienkreis drängte . Target: [...] to see an uncongenial alien per-manently intruded on her own family group.
Phrasal Verbs are verbs that are made up of averb and a particle (or several particles), whichmay change the meaning of the verb unpre-dictably. Examples of English phrasal verbs in-clude run into (in the sense of meet ) and give in ,and in German they include examples such as ein-laden ( invite ), consisting morphologically of theparticle ein and the verb laden ( load ).A source sentence is said to include a phrasalverb if a particle dependent (UD labels of compound:prt or prt ) exists in the parse. trat in itself means stepped , but in the extracted exam-ple below, trat. . . entgegen translates to received .For example: Source: [...] ich trat ihm in wahnsinnigerWut entgegen . Target: [...] I received him in frantic sort.
Preposition Stranding is the case where apreposition does not appear adjacent to the ob-ject it refers to. In English, it will often appearat the end of the sentence or a clause. For ex-ample,
The banana she stepped on or The boy Iread the book to . Preposition stranding is commonin English and other languages such as Scandina-vian languages or Dutch (Hornstein and Weinberg,1981). However, in German, it is not a part ofstandard written language (Beermann and Ik-Han,2005), although it does (rarely) appear (Fanselow,1983). We, therefore, extract this challenge setonly with English as the source side.While preposition stranding is often regardedas a syntactic phenomenon, we consider it here alexical LDD, since the translation of prepositions henomena Books Newstest2013De ↔ En Reorder 7,457 306Baseline (full dataset) 51,467 3,000
Table 2: Sizes of reordering and baseline corpora.
Min DistancePhenomena All ≥ ≥ ≥ → En Particle 8,361 7,584 6,261 4,780 232Reflexive 13,207 8,122 5,598 4,226 281En → De Particle 4,636 786 111 36 17Reflexive 3,225 1,188 460 274 11Preposition Stranding 682 191 85 40 8
Table 3: Sizes of Lexical LDD corpora. Challenge setsare partitioned (in order of appearance) by the languagepairs, the phenomenon type, and the minimal distancebetween the head and the dependent. Phenomenon ap-pears in the source. Statistics for the Newstest2013 cor-pora with miminal distance ≥ are at the rightmostcolumn, the rest are on Books. (and in some cases their accompanying verbs) isdependent on the prepositional object, which inthe case of preposition stranding, may be distantfrom the preposition itself. For example, translat-ing the car we looked for into German usually usesthe verb suchen ( search ), while translating the car we looked at does not. Translating prepositions isdifficult in general (Hashemi and Hwa, 2014), butpreposition stranding is especially so, as there isno adjacent object to assist disambiguation.A source sentence is said to include prepositionstranding if it contains two nodes with an edge ofthe type obl (oblique) or a subcategory thereofbetween them, and the UD POS tag of the depen-dent is adposition (ADP).For example, Source: [...] wherever she wanted to send the hedgehog to [...] Gloss: [...] where she the hedgehog rolled-towards wanted [...]
Target: [...] wo sie den Igel hinrollen wollte[...]
We turn to evaluate SoTA NMT performance onthe extracted challenge sets.
Experimental Setup.
We trained the Trans-former on WMT2015 training data (Bojar et al.,2015), for parameters see §3.1. For Nematus weused the non-ensemble pre-trained model from(Sennrich et al., 2017a). Each of the test sets, ei-ther a baseline or a challenge sets, for the Trans-former and Nematus used a maximum of 10k and
Transformer NematusBooks News Books NewsDe → En Baseline 9.02 28.23 16.26 26.32Reorder 7.16 22.68 13.88 22.73Particle 7.52 27.46 15.41 23.98Reflexive 8.15 27.84 14.91 27.04En → De Baseline 6.33 23.7 12.25 22.03Reorder 4.31 19.4 9.02 20.38Particle 5.30 17.83 9.55 16.72Reflexive 5.07 15.77 9.97 21.81PrepositionStranding 5.37 11.82 9.73 6.27
Table 4: BLEU scores on the challenge sets. Mini-mum distance between head and dependent d ≥ .A clear, consistent drop from the Baseline (full cor-pus) score is observed in all cases. The top part ofthe table corresponds to German-to-English (De → En)sets, and bottom part to English-to-German (En → De)sets. Within each part, rows correspond to variouslinguistic phenomena (second column), including re-ordering LDD (Reorder), Verb-Particle Constructions(Particle), Reflexive Verbs (Reflexive) and PrepositionStranding. Columns correspond to the models (Tran-former/Nematus), and the domains (Books/News).
1k sentences per set respectively. Two parallel corpora were used for extractingthe challenge sets. One is newstest2013 (Bojaret al., 2015) from the news domain that is com-monly used as a development set for English-German. The other is the relatively unused Bookscorpus (Tiedemann, 2012) from the more chal-lenging domain of literary translation. The cor-pora are of sizes 51K and 3K respectively. Forlexical LDD, we took the distance ( d ) between therelevant words to be at least 1, meaning there is atleast one word separating them. See Tables 2, 3for the sizes of the extracted corpora.For evaluation, we use the MOSES implemen-tation of BLEU (Papineni et al., 2002; Koehnet al., 2007), and for reordering LDD, also RIBES(Isozaki et al., 2010), which focuses on reorder-ing. RIBES measures the correlation of n-gramranks between the output and the reference, wheren-gram appears uniquely and in both. Manual Validation.
To assess the ability of ourprocedure to extract relevant LDDs, we manuallyanalyzed over 180 source German sentences ex- We subsample a smaller test set for Nematus, sincethe most competitive model for the language pair requiresTheano. As Theano is deprecated for two years now, it cannotrun on our GPUs, which entails long inference time.
LEU SpearmanLanguage Phenomena All ≥ ≥ ≥ Table 5: The effect of dependency distance for lexical LDDs on SoTA performance . Results are in BLEU over theBooks challenge sets. Columns correspond to the minimum distance, where
All does not restrict distance (control).The rightmost column presents the Spearman correlation of the phenomena’s score with the minimum distanceused. All correlations but one are highly negative, implying that distance has a negative effect on performance. tracted from Books, and 81 English ones includ-ing all the instances extracted from News and 45extracted from Books, where instances are evenlydistributed between phenomena and distance ofexactly 1,2 or 5. We find that 85% of German sen-tences, 87% of the English News sentences and86% of the Books ones indeed contain the targetphenomenon. For details of the manual evaluationof the extraction procedure, see Appendix A.News BooksGerman Baseline 0.82 0.57Reorder 0.79 0.54English Baseline 0.79 0.56Reorder 0.77 0.53
Table 6: RIBES scores on the reordering LDD chal-lenge sets. Sentences extracted as being challenging toreorder are harder for the Transformer (lower score).This trend is consistent with our experiments withBLEU. First column indicates the source language.
Results.
Comparison of the overall BLEUscores of the NMT models (Table 4) against theirperformance on the challenge sets, shows that thephenomena are challenging for both models. Bothin the small development set of newstest2013 andthe large set of Books, the challenge subparts aremore challenging across the board. For reorderingLDD, we further apply RIBES and find a similartrend: RIBES score is lower for the reorder chal-lenge set than the baseline (see Table 6).In order to confirm that the distance between thehead and dependent (the “length” of the depen- dency) is related to the observed performance dropin the case of lexical LDD, we partition each of thechallenge sets according to their length ( d ), andcompare the results to a control condition, whereall instances of the phenomena listed in §4.2 areextracted, including non-LDD instances, i.e., sen-tences where the head and the dependent are ad-jacent. System performance on the sliced chal-lenge sets (Table 5) shows that performance in-deed decreases with d . Results thus indicate thatit is not only the presence of the phenomena thatmake these sets challenging, but that the challengeincreases with the distance.We validate this main finding using manual an-notation of German to English cases. Using twoannotators (with high agreement between them; κ =0.79), we find that the decrease in performancewith d is replicated. We measure how many of thedetected lexical LDD are correctly translated, ig-noring the rest of the source and output, as donein manual challenge set approaches. We find that60%, 54% and 38% of the cases are translated cor-rectly for d ∈ (1 , , , respectively. This suggeststhat the extracted phenomena and the distance in-deed pose a challenge, and that the automatic met-ric we use shows the correct trend in these cases.See Appendix B for details. Discussion.
Interestingly, these results hold truefor the Transformer despite its indifference to theabsolute word order. Therefore, word distance initself is not what makes such phenomena chal-lenging, contrary to what one might expect fromthe definition of LDD. It seems then that thesehenomena are especially challenging due to thenon-standard linguistic structure (e.g., syntacticand lexical structure), and the varying distancesin which LDD manifest themselves. The models,therefore, seem to be unable to learn the linguis-tic structure underlying these phenomena, whichmay motivate more explicit modelling of linguis-tic biases into NMT models, as proposed by, e.g.,Eriguchi et al. (2017) and Song et al. (2019).We note that our experiments were not designedto compare the performance of BiLSTM and self-attention models. We, therefore, do not see theTransformer’s inferior performance on Books, rel-ative to Nematus as an indication of the generalability of this model in out-of-domain settings.What is evident from the results is that translat-ing Books is a challenge in itself, probably due tothe register of the language, and the presence offrequent non-literal translations.A potential confound is that performance mightchange with the length of the source in BiLSTMs(Carpuat et al., 2013; Murray and Chiang, 2018),in Transformers it was reported to increase (Zhanget al., 2018). Length is generally greater in thechallenge set than in the full test set, and generallyincreases with d , showing if anything a decreaseof performance by length. To assess whether ourcorpora are challenging due to a length bias, werandomly sample from Books 1,000 corpora with1,000, 100 and 10 sentences each. The correla-tion between their corresponding average lengthand the Transformers’ BLEU score on them was0.06,0.09 and 0.03 respectively. While this sug-gests length is not a strong predictor of perfor-mance, to verify that difficulty is not a result ofthe distribution of lengths in the challenge sets weconduct another experiment.For each challenge set and each value of d (0–3), we sample 100 corpora. For each sentence in agiven challenge set, we sample a sentence of nomore than a difference of 1 in length. This re-sults in a corpus with a similar length distribution,but sampled from the overall population of Bookssentences. Results show that the BLEU score ofthe challenge sets in all German to English casesis lower than any randomly sampled corpus. Inthe English-German cases, trends are similar, al-beit less pronounced. This may be due to the lownumber of long English sentences, which lead to Most sampled corpora actually had better scores than thebaseline. We believe this is because very short sentenceswhich are mostly noise, are never sampled. more homogeneous samples. Overall, results sug-gest that length is extremely unlikely to be the onlycause for the observed trends.
As NMT system performance is constantly im-proving, more reliable methods for identifyingand classifying their failures are needed. Muchresearch effort is therefore devoted to develop-ing more fine-grained and interpretable evaluationmethods, including challenge-set approaches. Inthis paper, we showed that, using a UD parser, itis possible to extract challenge sets that are largeenough to allow scalable MT evaluation of impor-tant and challenging phenomena.An accumulating body of research is devoted tothe ability of modern neural architectures such asLSTMs (Linzen et al., 2016) and pretrained em-beddings (Hewitt and Manning, 2019; Liu et al.,2019; Jawahar et al., 2019) to represent linguis-tic features. This paper makes a contribution tothis literature in confirming that the Transformermodel can indeed be made indifferent to the ab-solute order of the words, but also shows that thisdoes not entail that the model can overcome thedifficulties of LDD in naturalistic data. We maycarefully conclude then that despite the remark-able feats of current NMT models, inducing lin-guistic structure in its more evasive and challeng-ing instances is still beyond the reach of state-of-the-art NMT, which motivates exploring morelinguistically-informed models.
This work was supported by the Israel ScienceFoundation (grant no. 929/17)
References
Lars Ahrenberg. 2018. A challenge set for english-swedish machine translation. In
SLTC .Peter Anderson, Basura Fernando, Mark Johnson, andStephen Gould. 2016. Spice: Semantic propo-sitional image caption evaluation. In
EuropeanConference on Computer Vision , pages 382–398.Springer.Eleftherios Avramidis, Vivien Macketanz, UrsulaStrohriegel, and Hans Uszkoreit. 2019. Linguis-tic evaluation of german-english machine transla-tion using a test suite. In
Proceedings of the FourthConference on Machine Translation. Conference onMachine Translation (WMT-2019), located at The7th Annual Meeting of the Association for Com-putational Linguistics, August 1-2, Florence, Italy .Association for Computational Linguistics.Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate.
CoRR ,abs/1409.0473.Dorothee Beermann and Lars Ik-Han. 2005. Preposi-tion stranding and locative adverbs in german.
Or-ganizing Grammar , 86:31.Alexandra Birch. 2011.
Reordering metrics for statis-tical machine translation . Ph.D. thesis, The Univer-sity of Edinburgh.Alexandra Birch, Omri Abend, Ondˇrej Bojar, andBarry Haddow. 2016. Hume: Human ucca-basedevaluation of machine translation. In
Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing , pages 1264–1274.Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Shujian Huang,Matthias Huck, Philipp Koehn, Qun Liu, VarvaraLogacheva, et al. 2017. Findings of the 2017 confer-ence on machine translation (wmt17). In
Proceed-ings of the Second Conference on Machine Transla-tion , pages 169–214.Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Barry Haddow, Matthias Huck, Chris Hokamp,Philipp Koehn, Varvara Logacheva, Christof Monz,Matteo Negri, Matt Post, Carolina Scarton, LuciaSpecia, and Marco Turchi. 2015. Findings of the2015 workshop on statistical machine translation. In
WMT@EMNLP .Peter F Brown, Vincent J Della Pietra, Stephen A DellaPietra, and Robert L Mercer. 1993. The mathemat-ics of statistical machine translation: Parameter esti-mation.
Computational linguistics , 19(2):263–311.Chris Callison-Burch, Miles Osborne, and PhilippKoehn. 2006. Re-evaluation the role of bleu in ma-chine translation research. In
EACL .Marine Carpuat, Lucia Specia, and Dekai Wu. 2013.Proceedings of ssst-7, seventh workshop on syntax,semantics and structure in statistical translation. In
EMNLP 2014 .David Chiang. 2005. A hierarchical phrase-basedmodel for statistical machine translation. In
Pro-ceedings of the 43rd Annual Meeting on Associationfor Computational Linguistics , pages 263–270. As-sociation for Computational Linguistics.Leshem Choshen and Omri Abend. 2018a. Automaticmetric validation for grammatical error correction.In
Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers) . Leshem Choshen and Omri Abend. 2018b. Reference-less measure of faithfulness for grammatical errorcorrection. In
Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies .Dun Deng and Nianwen Xue. 2017. Translation diver-gences in chinese–english machine translation: Anempirical investigation.
Computational Linguistics ,43(3):521–565.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing.
CoRR , abs/1810.04805.Chris Dyer, Victor Chahuneau, and Noah A Smith.2013. A simple, fast, and effective reparameteriza-tion of ibm model 2. In
Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies , pages 644–648.Akiko Eriguchi, Yoshimasa Tsuruoka, and KyunghyunCho. 2017. Learning to parse and translate improvesneural machine translation. In
ACL .Gisbert Fanselow. 1983. Zu einigen problemen von ka-sus, rektion und bindung in der deutschen syntax.
Universität Konstanz .Richard Futrell and Roger P. Levy. 2018. Do rnns learnhuman-like abstract word order preferences?
CoRR ,abs/1811.01866.Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas RSteunebrink, and Jürgen Schmidhuber. 2017. Lstm:A search space odyssey.
IEEE transactions on neu-ral networks and learning systems , 28(10):2222–2232.Kristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Colorlessgreen recurrent networks dream hierarchically. In
NAACL-HLT .Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao,Xiangyang Xue, and Zheng Zhang. 2018. Star-transformer. arXiv preprint arXiv:1902.09113 .Homa B Hashemi and Rebecca Hwa. 2014. A com-parison of mt errors and esl errors. In
LREC , pages2696–2700.John Hewitt and Christopher D. Manning. 2019. Astructural probe for finding syntax in word repre-sentations. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4129–4138, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.icah Hodosh, Peter Young, and Julia Hockenmaier.2013. Framing image description as a ranking task:Data, models and evaluation metrics.
Journal of Ar-tificial Intelligence Research , 47:853–899.Norbert Hornstein and Amy Weinberg. 1981. Case the-ory and preposition stranding.
Linguistic inquiry ,12(1):55–91.Pierre Isabelle, Colin Cherry, and George F. Foster.2017. A challenge set approach to evaluating ma-chine translation. In
EMNLP .Pierre Isabelle and Roland Kuhn. 2018. A challengeset for french–> english machine translation. arXivpreprint arXiv:1806.02725 .Hideki Isozaki, Tsutomu Hirao, Kevin Duh, KatsuhitoSudoh, and Hajime Tsukada. 2010. Automatic eval-uation of translation quality for distant languagepairs. In
Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Process-ing , pages 944–952. Association for ComputationalLinguistics.Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.2019. What does BERT learn about the structure oflanguage? In
Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 3651–3657, Florence, Italy. Associationfor Computational Linguistics.Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-rat, Fernanda B. Viégas, Martin Wattenberg, Gre-gory S. Corrado, Macduff Hughes, and Jeffrey Dean.2017. Google’s multilingual neural machine transla-tion system: Enabling zero-shot translation.
Trans-actions of the Association for Computational Lin-guistics , 5:339–351.Diederik P. Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization.
CoRR ,abs/1412.6980.Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. In
Proceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions , pages 177–180.Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In
Proceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 , pages 48–54. Association for Computa-tional Linguistics. Surafel Melaku Lakew, Mauro Cettolo, and MarcelloFederico. 2018. A comparison of transformer andrecurrent neural networks on multilingual neuralmachine translation. In
COLING .Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of lstms to learn syntax-sensitive dependencies.
Transactions of the Associ-ation for Computational Linguistics , 4:521–535.Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019. Lin-guistic knowledge and transferability of contextualrepresentations. In
Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 1073–1094, Minneapolis, Minnesota.Association for Computational Linguistics.Nelson F. Liu, Omer Levy, Roy Schwartz, ChenhaoTan, and Noah A. Smith. 2018. Lstms exploit lin-guistic attributes of data. In
Rep4NLP@ACL .Chi-kiu Lo and Dekai Wu. 2011. Meant: an inexpen-sive, high-accuracy, semi-automatic metric for eval-uating translation utility via semantic frames. In
Proceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies-Volume 1 , pages 220–229. As-sociation for Computational Linguistics.Mathias Müller, Annette Rios, Elena Voita, and RicoSennrich. 2018. A large-scale test set for the evalu-ation of context-aware pronoun translation in neuralmachine translation. In
WMT .Kenton Murray and David Chiang. 2018. Correctinglength bias in neural machine translation. In
WMT .Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher D. Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.2016. Universal dependencies v1: A multilingualtreebank collection. In
Proc. of LREC , pages 1659–1666.Franz Josef Och. 2002.
Statistical machine transla-tion: from single-word models to alignment tem-plates . Ph.D. thesis, Bibliothek der RWTH Aachen.Aäron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves,Nal Kalchbrenner, Andrew W. Senior, and KorayKavukcuoglu. 2016. Wavenet: A generative modelfor raw audio. In
SSW .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In
Proceedings ofthe 40th annual meeting on association for compu-tational linguistics , pages 311–318. Association forComputational Linguistics.teven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In
Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 7008–7024.Rico Sennrich. 2016. How grammatical is charac-terlevel neural machine translation.
Assessing MTquality with contrastive translation pairs. CoRR,abs/1612.04629 .Rico Sennrich, Alexandra Birch, Anna Currey, UlrichGermann, Barry Haddow, Kenneth Heafield, An-tonio Valerio Miceli Barone, and Philip Williams.2017a. The university of edinburgh’s neural mt sys-tems for wmt17. In
WMT .Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-dra Birch, Barry Haddow, Julian Hitschler, MarcinJunczys-Dowmunt, Samuel Läubli, Antonio ValerioMiceli Barone, Jozef Mokry, and Maria Nadejde.2017b. Nematus: a toolkit for neural machine trans-lation. In
Proceedings of the Software Demonstra-tions of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics ,pages 65–68, Valencia, Spain. Association for Com-putational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In
Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , volume 1, pages1715–1725.Natalia Silveira, Timothy Dozat, Marie-Catherinede Marneffe, Samuel Bowman, Miriam Connor,John Bauer, and Christopher D. Manning. 2014. Agold standard dependency corpus for English. In
Proceedings of the Ninth International Conferenceon Language Resources and Evaluation (LREC-2014) .Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In
Proceedings of association for machine transla-tion in the Americas , volume 200.Linfeng Song, Daniel Gildea, Yue Zhang, ZhiguoWang, and Jinsong Su. 2019. Semantic neuralmachine translation using amr. arXiv preprintarXiv:1902.07282 .Milan Straka and Jana Straková. 2017. Tokenizing,pos tagging, lemmatizing and parsing ud 2.0 withudpipe. In
Proceedings of the CoNLL 2017 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies , pages 88–99, Vancouver, Canada.Association for Computational Linguistics.Elior Sulem, Omri Abend, and Ari Rappoport. 2018.Semantic structural evaluation for text simplifica-tion. In
NAACL-HLT . Jun Sun, Min Zhang, and Chew Lim Tan. 2009. A non-contiguous tree sequence alignment-based model forstatistical machine translation. In
Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP: Vol-ume 2-Volume 2 , pages 914–922. Association forComputational Linguistics.Mirac Suzgun, Yonatan Belinkov, and Stuart MShieber. 2018. On evaluating the generalization oflstm models in formal languages. arXiv preprintarXiv:1811.01001 .Amirhossein Tebbifakhr, Ruchit Agrawal, Matteo Ne-gri, and Marco Turchi. 2018. Multi-source trans-former with combined losses for automatic post edit-ing. In
WMT .Jörg Tiedemann. 2012. Parallel data, tools and inter-faces in opus. In
Lrec , volume 2012, pages 2214–2218.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in Neural Information Pro-cessing Systems , pages 5998–6008.Deyi Xiong, Min Zhang, and Haizhou Li. 2012. Mod-eling the translation of predicate-argument structurefor smt. In
Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics: Long Papers-Volume 1 , pages 902–911. Asso-ciation for Computational Linguistics.Peng Xu, Jaeho Kang, Michael Ringgaard, and FranzOch. 2009. Using a dependency parser to improvesmt for subject-object-verb languages. In
Proceed-ings of human language technologies: The 2009 an-nual conference of the North American chapter ofthe association for computational linguistics , pages245–253. Association for Computational Linguis-tics.Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In
Proceedingsof the 39th Annual Meeting of the Association forComputational Linguistics .Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018.Improving neural machine translation with condi-tional sequence generative adversarial nets. In
Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long Papers) , pages 1346–1355. Associationfor Computational Linguistics.Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, andTao Mei. 2017. Boosting image captioning with at-tributes. , pages 4904–4912.uanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,and Jiebo Luo. 2016. Image captioning with seman-tic attention. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages4651–4659.Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accel-erating neural transformer via an average attentionnetwork. In
ACL . Appendix
A Manual Evaluation of ExtractionProcedure
Manual evaluation of the sentences extracted us-ing our procedure was performed using two pro-ficient annotators (authors of this paper), one foreach source language. These include 180 sourceGerman sentences extracted from Books, and 81English sentences including all the instances ex-tracted from News and 45 extracted from Books.Within Books, sentences are distributed uniformlyacross phenomena and d values d ∈ { , , } .In German, phrasal verbs are detected with highprecision: 96% of the sentences indeed include aphrasal verb LDD. For reflexive verbs, 63% in-clude reflexive verbs with a distance of at least 1,and two thirds of the remaining cases (25%) in-clude verbal non-reflexive pronouns with d ≥ (in German, some pronouns may be used both asreflexive and non-reflexive). While non-reflexiveverbal pronouns are not lexical LDDs (as theycan mostly be translated word by word), they dochallenge the system to disambiguate them fromreflexive verbs. Our analysis in the next sec-tion shows that the extracted non-reflexive casespresent similar trends as the reflexive cases.In English (Table 7) detection precision ishigh for reflexive and phrasal verbs. Prepositionstranding detection precision is lower. However,wrongly extracted examples mostly involve prepo-sitional objects that are elided or difficult to detect.We therefore consider the difficulties such casespose as sufficiently similar to the ones posed bypreposition stranding. B Manual MT Performance evaluation
Using the same sample of 180 sentences used forGerman detection (See Manual Validation in thepaper), we analyze the performance of the Trans-former using in-house annotators. One annotator (an author of the paper), proficient in English andGerman, was presented with the German sourcein which the relevant tokens were marked. Theannotator was asked to locate and mark the cor-responding part in the English reference. Placesin which the gold translation did not contain atranslation of the phenomena, usually due to align-ment errors in the corpus or complete omission bythe human translator, are removed from the anal-ysis. Then, two annotators (a different author anda non-author), proficient in English, were asked tojudge whether the Transformer output conveys themeaning marked in the reference. Inter-annotatoragreement was computed to be κ = 0 . .Results (Table 8) show a decrease in perfor-mance when increasing the distance d . With re-flexive verbs, this effect is smaller between d = 1 and d = 2 . However, looking at each categoryseparately (reflexive or non-reflexive pronouns)shows that performance decreases with d in allcases (Table 9). News BooksReflexive 0.91 0.87Preposition Stranding 0.75 0.60Particle 0.94 1.00 Table 7: Ratio of extracted sentences that indeedpresent the target lexical LDD in English. Rows cor-respond to various lexical LDD types, and column cor-respond to corpora.
Amount Accuracy d = Table 8: Results of manual annotation of translationquality per lexical LDD phenomena in German to En-glish translation with the Transformer. Left: amountof sentences annotated for each type. Right: accu-racy (ratio of cases deemed to be correctly translated).Columns correspond to the distance d , as judged by theannotators. Numbers reported are after removing ex-traction errors and disagreements. mount Accuracy d = Table 9: Results of manual annotation of translationquality per sub-type of reflexive verbs in German to En-glish translation with the Transformer. Left: amountof sentences annotated for each type. Right: accu-racy (ratio of cases deemed to be correctly translated).Columns correspond to the distance dd