[PDF] Linguistic Input Features Improve Neural Machine Translation

Abstract

Neural machine translation has recently achieved impressive results, while using little in the way of external linguistic information. In this paper we show that the strong learning capability of neural MT models does not make linguistic features redundant; they can be easily incorporated to provide further improvements in performance. We generalize the embedding layer of the encoder in the attentional encoder--decoder architecture to support the inclusion of arbitrary features, in addition to the baseline word feature. We add morphological features, part-of-speech tags, and syntactic dependency labels as input features to English<->German, and English->Romanian neural machine translation systems. In experiments on WMT16 training and test sets, we find that linguistic input features improve model quality according to three metrics: perplexity, BLEU and CHRF3. An open-source implementation of our neural MT system is available, as are sample files and configurations.

Full PDF

aa r X i v : . [ c s . C L ] J un Linguistic Input Features Improve Neural Machine Translation

Rico Sennrich and

Barry Haddow

School of Informatics, University of Edinburgh [email protected] , [email protected] Abstract

Neural machine translation has recentlyachieved impressive results, while usinglittle in the way of external linguistic in-formation. In this paper we show thatthe strong learning capability of neuralMT models does not make linguistic fea-tures redundant; they can be easily incor-porated to provide further improvementsin performance. We generalize the em-bedding layer of the encoder in the at-tentional encoder–decoder architecture tosupport the inclusion of arbitrary features,in addition to the baseline word feature.We add morphological features, part-of-speech tags, and syntactic dependency la-bels as input features to English ↔ Germanand English → Romanian neural machinetranslation systems. In experiments onWMT16 training and test sets, we ﬁnd thatlinguistic input features improve modelquality according to three metrics: per-plexity, B

LEU and

CHR

F3. An open-source implementation of our neural MTsystem is available , as are sample ﬁlesand conﬁgurations . Neural machine translation has recently achievedimpressive results (Bahdanau et al., 2015;Jean et al., 2015), while learning from raw,sentence-aligned parallel text and using littlein the way of external linguistic information. However, we hypothesize that various levels oflinguistic annotation can be valuable for neuralmachine translation. Lemmatisation can reducedata sparseness, and allow inﬂectional variants of https://github.com/rsennrich/nematus https://github.com/rsennrich/wmt16-scripts Linguistic tools are most commonly used in preprocess-ing, e.g. for Turkish segmentation (Gülçehre et al., 2015). the same word to explicitly share a representationin the model. Other types of annotation, such asparts-of-speech (POS) or syntactic dependencylabels, can help in disambiguation. In this paperwe investigate whether linguistic information isbeneﬁcial to neural translation models, or whethertheir strong learning capability makes explicitlinguistic features redundant.Let us motivate the use of linguistic features us-ing examples of actual translation errors by neu-ral MT systems. In translation out of English,one problem is that the same surface word formmay be shared between several word types, due tohomonymy or word formation processes such asconversion. For instance, close can be a verb, ad-jective, or noun, and these different meanings of-ten have distinct translations into other languages.Consider the following English → German exam-ple:1.

We thought a win like this might be close. Wir dachten, dass ein solcher Sieg nah seinkönnte.

3. *

Wir dachten, ein Sieg wie dieser könnteschließen.

For the English source sentence in Example 1(our translation in Example 2), a neural MT sys-tem (our baseline system from Section 4) mis-translates close as a verb, and produces the Ger-man verb schließen (Example 3), even though close is an adjective in this sentence, which hasthe German translation nah . Intuitively, part-of-speech annotation of the English input coulddisambiguate between verb, noun, and adjectivemeanings of close .As a second example, consider the followingGerman → English example:4.

Gefährlich ist die Route aber dennoch . dangerous is the route but still ..

However the route is dangerous .

6. *

Dangerous is the route , however .

German main clauses have a verb-second (V2)word order, whereas English word order is gener-ally SVO. The German sentence (Example 4; En-glish reference in Example 5) topicalizes the pred-icate gefährlich ’dangerous’, putting the subject die Route ’the route’ after the verb. Our baselinesystem (Example 6) retains the original word or-der, which is highly unusual in English, especiallyfor prose in the news domain. A syntactic annota-tion of the source sentence could support the atten-tional encoder-decoder in learning which words inthe German source to attend (and translate) ﬁrst.We will investigate the usefulness of linguisticfeatures for the language pair German ↔ English,considering the following linguistic features: • lemmas • subword tags (see Section 3.2) • morphological features • POS tags • dependency labelsThe inclusion of lemmas is motivated by thehope for a better generalization over inﬂectionalvariants of the same word form. The other lin-guistic features are motivated by disambiguation,as discussed in our introductory examples. We follow the neural machine translation archi-tecture by Bahdanau et al. (2015), which we willbrieﬂy summarize here.The neural machine translation system is imple-mented as an attentional encoder-decoder networkwith recurrent neural networks.The encoder is a bidirectional neural net-work with gated recurrent units (Cho et al., 2014)that reads an input sequence x = ( x , ..., x m ) and calculates a forward sequence of hiddenstates ( −→ h , ..., −→ h m ) , and a backward sequence ( ←− h , ..., ←− h m ) . The hidden states −→ h j and ←− h j areconcatenated to obtain the annotation vector h j .The decoder is a recurrent neural network thatpredicts a target sequence y = ( y , ..., y n ) . Eachword y i is predicted based on a recurrent hiddenstate s i , the previously predicted word y i − , and a context vector c i . c i is computed as a weightedsum of the annotations h j . The weight of eachannotation h j is computed through an alignmentmodel α ij , which models the probability that y i isaligned to x j . The alignment model is a single-layer feedforward neural network that is learnedjointly with the rest of the network through back-propagation.A detailed description can be found in(Bahdanau et al., 2015), although our implemen-tation is based on a slightly modiﬁed form ofthis architecture, released for the dl4mt tutorial .Training is performed on a parallel corpus withstochastic gradient descent. For translation, abeam search with small beam size is employed. Our main innovation over the standard encoder-decoder architecture is that we represent theencoder input as a combination of features(Alexandrescu and Kirchhoff, 2006).We here show the equation for the forwardstates of the encoder (for the simple RNN case;consider (Bahdanau et al., 2015) for GRU): −→ h j = tanh ( −→ W Ex j + −→ U −→ h j − ) (1)where E ∈ R m × K x is a word embedding ma-trix, −→ W ∈ R n × m , −→ U ∈ R n × n are weight matrices,with m and n being the word embedding size andnumber of hidden units, respectively, and K x be-ing the vocabulary size of the source language.We generalize this to an arbitrary number of fea-tures | F | : −→ h j = tanh ( −→ W ( | F | n k =1 E k x jk ) + −→ U −→ h j − ) (2)where k is the vector concatenation, E k ∈ R m k × K k are the feature embedding matrices, with P | F | k =1 m k = m , and K k is the vocabulary size ofthe k th feature. In other words, we look up sepa-rate embedding vectors for each feature, which arethen concatenated. The length of the concatenatedvector matches the total embedding size, and allother parts of the model remain unchanged. Our generalized model of the previous sectionsupports an arbitrary number of input features. https://github.com/nyu-dl/dl4mt-tutorial n this paper, we will focus on a number ofwell-known linguistic features. Our main em-pirical question is if providing linguistic fea-tures to the encoder improves the translation qual-ity of neural machine translation systems, or ifthe information emerges from training encoder-decoder models on raw text, making its inclu-sion via explicit features redundant. All lin-guistic features are predicted automatically; weuse Stanford CoreNLP (Toutanova et al., 2003;Minnen et al., 2001; Chen and Manning, 2014) toannotate the English input for English → German,and ParZu (Sennrich et al., 2013) to annotate theGerman input for German → English. We here dis-cuss the individual features in more detail.

Using lemmas as input features guarantees sharingof information between word forms that share thesame base form. In principle, neural models canlearn that inﬂectional variants are semantically re-lated, and represent them as similar points in thecontinuous vector space (Mikolov et al., 2013).However, while this has been demonstrated forhigh-frequency words, we expect that a lemma-tized representation increases data efﬁciency; low-frequency variants may even be unknown to word-level models. With character- or subword-levelmodels, it is unclear to what extent they can learnthe similarity between low-frequency word formsthat share a lemma, especially if the word formsare superﬁcially dissimilar. Consider the follow-ing two German word forms, which share thelemma liegen ‘lie’: • liegt ‘lies’ (3.p.sg. present) • läge ‘lay’ (3.p.sg. subjunctive II)The lemmatisers we use are based on ﬁnite-statemethods, which ensures a large coverage, evenfor infrequent word forms. We use the Zmorgeanalyzer for German (Schmid et al., 2004;Sennrich and Kunz, 2014), and the lemmatiserin the Stanford CoreNLP toolkit for English(Minnen et al., 2001). In our experiments, we operate on the level ofsubwords to achieve open-vocabulary translationwith a ﬁxed symbol vocabulary, using a seg-mentation based on byte-pair encoding (BPE) (Sennrich et al., 2016c). We note that in BPE seg-mentation, some symbols are potentially ambigu-ous, and can either be a separate word, or a sub-word segment of a larger word. Also, text is rep-resented as a sequence of subword units with noexplicit word boundaries, but word boundaries arepotentially helpful to learn which symbols to at-tend to, and when to forget information in the re-current layers. We propose an annotation of sub-word structure similar to popular IOB format forchunking and named entity recognition, markingif a symbol in the text forms the beginning (B), in-side (I), or end (E) of a word. A separate tag (O)is used if a symbol corresponds to the full word.

For German → English, the parser annotates theGerman input with morphological features. Dif-ferent word types have different sets of features –for instance, nouns have case, number and gender,while verbs have person, number, tense and aspect– and features may be underspeciﬁed. We treatthe concatenation of all morphological features ofa word, using a special symbol for underspeciﬁedfeatures, as a string, and treat each such string as aseparate feature value.

In our introductory examples, we motivated POStags and dependency labels as possible disam-biguators. Each word is associated with one POStag, and one dependency label. The latter is thelabel of the edge connecting a word to its syntac-tic head, or ’ROOT’ if the word has no syntactichead.

We segment rare words into subword units usingBPE. The subword tags encode the segmentationof words into subword units, and need no fur-ther modiﬁcation. All other features are originallyword-level features. To annotate the segmentedsource text with features, we copy the word’s fea-ture value to all its subword units. An example isshown in Figure 1.

We evaluate our systems on the WMT16 sharedtranslation task English ↔ German. The paralleltraining data consists of about 4.2 million sentencepairs. eonidas begged in the arena .NNP VBD IN DT NN . root rootnsubj prep detpobj root words Le: oni: das beg: ged in the arena .lemmas Leonidas Leonidas Leonidas beg beg in the arena .subword tags B I E B E O O O OPOS NNP NNP NNP VBD VBD IN DT NN .dep nsubj nsubj nsubj root root prep det pobj root

Figure 1: Original dependency tree for sentence

Leonidas begged in the arena . , and our feature repre-sentation after BPE segmentation.To enable open-vocabulary transla-tion, we encode words via joint BPE (Sennrich et al., 2016c), learning 89 500 mergeoperations on the concatenation of the sourceand target side of the parallel training data. Weuse minibatches of size 80, a maximum sentencelength of 50, word embeddings of size 500, andhidden layers of size 1024. We clip the gradientnorm to 1.0 (Pascanu et al., 2013). We train themodels with Adadelta (Zeiler, 2012), reshufﬂingthe training corpus between epochs. We validatethe model every 10 000 minibatches via B LEU and perplexity on a validation set (newstest2013).For neural MT, perplexity is a useful measureof how well the model can predict a referencetranslation given the source sentence. Perplex-ity is thus a good indicator of whether input fea-tures provide any beneﬁt to the models, and we re-port the best validation set perplexity of each ex-periment. To evaluate whether the features alsoincrease translation performance, we report case-sensitive B

LEU scores with mteval-13b.perl ontwo test sets, newstest2015 and newstest2016. Wealso report

CHR

F3 (Popovi´c, 2015), a character n-gram F score which was found to correlate wellwith human judgments, especially for translationsout of English (Stanojevi´c et al., 2015). The twometrics may occasionally disagree, partly becausethey are highly sensitive to the length of the out-put. B

LEU is precision-based, whereas

CHR

F3considers both precision and recall, with a bias forrecall. For B

LEU , we also report whether differ-ences between systems are statistically signiﬁcantaccording to a bootstrap resampling signiﬁcancetest (Riezler and Maxwell, 2005).We train models for about a week, and report https://github.com/rsennrich/subword-nmt We use the re-implementation included with the subwordcode input vocabulary embeddingfeature EN DE model all singlesubword tags 4 4 4 5 5POS tags 46 54 54 10 10morph. features - 1400 1400 10 10dependency labels 46 33 46 10 10lemmas 800000 1500000 85000 115 167words 78500 85000 85000 * *

Table 1: Vocabulary size, and size of embeddinglayer of linguistic features, in system that includesall features, and contrastive experiments that adda single feature over the baseline. The embeddinglayer size of the word feature is set to bring thetotal size to 500.results for an ensemble of the 4 last saved models(with models saved every 12 hours). The ensem-ble serves to smooth the variance between singlemodels.Decoding is performed with beam search with abeam size of 12.To ensure that performance improvements arenot simply due to an increase in the number ofmodel parameters, we keep the total size of theembedding layer ﬁxed to 500. Table 1 lists theembedding size we use for linguistic features –the embedding layer size of the word-level fea-ture varies, and is set to bring the total embeddinglayer size to 500. If we include the lemma feature,we roughly split the embedding vector one-to-twobetween the lemma feature and the word feature.The table also shows the network vocabulary size;for all features except lemmas, we can representall feature values in the network vocabulary – inthe case of words, this is due to BPE segmenta-tion. For lemmas, we choose the same vocabularysize as for words, replacing rare lemmas with aspecial UNK symbol.Sennrich et al. (2016b) report large gains fromusing monolingual in-domain training data, auto-atically back-translated into the source languageto produce a synthetic parallel training corpus. Weuse the synthetic corpora produced in these exper-iments (3.6–4.2 million sentence pairs), and wetrained systems which include this data to compareagainst the state of the art. We note that our exper-iments with this data entail a syntactic annotationof automatically translated data, which may be asource of noise. For the systems with syntheticdata, we double the training time to two weeks.We also evaluate linguistic features forthe lower-resourced translation directionEnglish → Romanian, with 0.6 million sen-tence pairs of parallel training data, and 2.2million sentence pairs of synthetic paral-lel data. We use the same linguistic fea-tures as for English → German. We followSennrich et al. (2016a) in the conﬁguration, anduse dropout for the English → Romanian systems.We drop out full words (both on the source andtarget side) with a probability of 0.1. For all otherlayers, the dropout probability is set to 0.2.

Table 2 shows our main results forGerman → English, and English → German.The baseline system is a neural MT system withonly one input feature, the (sub)words themselves.For both translation directions, linguistic featuresimprove the best perplexity on the developmentdata (47.3 → → → English, the linguistic features leadto an increase of 1.5 B

LEU (31.4 → CHR

F3 (58.0 → → German, we observeimprovements of 0.6 B

LEU (27.8 → CHR

F3 (56.0 → → German,but not for German → English. All other featuresoutperform the baseline in terms of perplexity, andyield signiﬁcant improvements in B

LEU on at least The corpora are available at http://statmt.org/rsennrich/wmt16_backtranslations/ one test set. The gain from different features is notfully cumulative; we note that the information en-coded in different features overlaps. For instance,both the dependency labels and the morphologi-cal features encode the distinction between Ger-man subjects and accusative objects, the formerthrough different labels ( subj and obja ), the lat-ter through grammatical case ( nominative and ac-cusative ).We also evaluated adding linguistic featuresto a stronger baseline, which includes syntheticparallel training data. In addition, we com-pare our neural systems against phrase-based (PB-SMT) and syntax-based (SBSMT) systems by(Williams et al., 2016), all of which make useof linguistic annotation on the source and/ortarget side. Results are shown in Table 4.For German → English, we observe similar im-provements in the best development perplexity(45.2 → LEU (37.5 → CHR

F3 (62.2 → LEU is on par to the best submitted system to thisyear’s WMT 16 shared translation task, whichis similar to our baseline MT system, but whichalso uses a right-to-left decoder for reranking(Sennrich et al., 2016a). We expect that linguis-tic input features and bidirectional decoding areorthogonal, and that we could obtain further im-provements by combining the two.For English → German, improvements in devel-opment set perplexity carry over (49.7 → LEU and

CHR

F3. While we cannot clearly ac-count for the discrepancy between perplexity andtranslation metrics, factors that potentially lowerthe usefulness of linguistic features in this settingare the stronger baseline, trained on more data,and the low robustness of linguistic tools in theannotation of the noisy, synthetic data sets. Bothour baseline neural MT systems and the systemswith linguistic features substantially outperformphrase-based and syntax-based systems for bothtranslation directions.In the previous tables, we have reported the bestperplexity. To address the question about the ran-domness in perplexity, and whether the best per-plexity just happened to be lower for the systemswith linguistic features, we show perplexity onour development set as a function of training timefor different systems (Figure 2). We can see thatperplexity is consistently lower for the systems ystem German → English English → Germanppl ↓ B LEU ↑ CHR F3 ↑ ppl ↓ B LEU ↑ CHR F3 ↑ dev test15 test16 test15 test16 dev test15 test16 test15 test16baseline 47.3 27.9 31.4 54.0 58.0 54.9 23.0 27.8 52.6 56.0all features 46.2 28.7* 32.9* 54.8 58.5 52.9 23.8* 28.4* 53.9 57.2 Table 2: German ↔ English translation results: best perplexity on dev (newstest2013), and B

LEU and

CHR

F3 on test15 (newstest2015) and test16 (newstest2016). B

LEU scores that are signiﬁcantly different(p < 0.05) from respective baseline are marked with (*). system German → English English → Germanppl ↓ B LEU ↑ CHR F3 ↑ ppl ↓ B LEU ↑ CHR F3 ↑ dev test15 test16 test15 test16 dev test15 test16 test15 test16baseline 47.3 27.9 31.4 54.0 58.0 54.9 23.0 27.8 52.6 56.0lemmas 47.1 28.4 32.3* 54.6 58.7 53.4 23.8* 28.5* 53.7 56.7subword tags 47.3 27.7 31.5 54.0 58.1 54.7 23.6* 28.1 53.2 56.4morph. features 47.1 28.2 32.4* 54.3 58.4 - - - - -POS tags 46.9 28.1 32.4* 54.1 57.8 53.2 24.0* 28.9* 53.3 56.8dependency labels 46.9 28.1 31.8* 54.2 58.3 54.0 23.4* 28.0 53.1 56.5 Table 3: Contrastive experiments with individual linguistic features: best perplexity on dev (new-stest2013), and B

LEU and

CHR

F3 on test15 (newstest2015) and test16 (newstest2016). B

LEU scoresthat are signiﬁcantly different (p < 0.05) from respective baseline are marked with (*). system German → English English → Germanppl ↓ B LEU ↑ CHR F3 ↑ ppl ↓ B LEU ↑ CHR F3 ↑ dev test15 test16 test15 test16 dev test15 test16 test15 test16PBSMT (Williams et al., 2016) - 29.9 35.1 56.2 60.9 - 23.7 28.4 52.6 56.6SBSMT (Williams et al., 2016) - 29.5 34.4 56.0 61.0 - 24.5 30.6 55.3 59.9baseline 45.2 31.5 37.5 57.0 62.2 49.7 27.5 33.1 56.3 60.5all features 44.1 32.1* 38.5* 57.5 62.8 48.4 27.1 33.2 56.5 60.6 Table 4: German ↔ English translation results with additional, synthetic training data: best perplexity ondev (newstest2013), and B

LEU and

CHR

F3 on test15 (newstest2015) and test16 (newstest2016). B

LEU scores that are signiﬁcantly different (p < 0.05) from respective baseline are marked with (*).

10 20 30 40 50 60406080100120 training time (minibatches · ) p e r p l e x it y EN-DE baseline (synth. data)EN-DE all features (synth. data)DE-EN baseline (synth. data)DE-EN all features (synth. data)

Figure 2: English → German (black) andGerman → English (red) development set per-plexity as a function of training time (number ofminibatches) with and without linguistic features. system ppl ↓ B LEU ↑ CHR F3 ↑ (Peter et al., 2016) - 28.9 57.1baseline 74.9 23.8 52.5all features 72.7 24.8* 53.5baseline (+synth. data) 50.9 28.2 56.1all features (+synth. data) 50.1 29.2* 56.6 Table 5: English → Romanian translation results:best perplexity on newsdev2016, and B

LEU and

CHR

F3 on newstest2016. B

LEU scores that aresigniﬁcantly different (p < 0.05) from respectivebaseline are marked with (*).trained with linguistic features.Table 5 shows results for a lower-resourcedlanguage pair, English → Romanian. With lin-guistic features, we observe improvements of 1.0B

LEU over the baseline, both for the systemstrained on parallel data only (23.8 → → LEU , the best sub-mission to WMT16 was a system combination byPeter et al. (2016). Our best system is competitivewith this submission.Table 6 shows translation examples of our base-line, and the system augmented with linguis-tic features. We see that the augmented neuralMT systems, in contrast to the respective base-lines, successfully resolve the reordering for theGerman → English example, and the disambigua-tion of close for the English → German example. system sentencesource Gefährlich ist die

Route aber dennoch.reference However the route is dangerous.baseline Dangerous is the route , however.all features However, the route is dangerous.source [We thought] a win like this might be close .reference [...] dass ein solcher Gewinn nah sein könnte.baseline [...] ein Sieg wie dieser könnte schließen .all features [...] ein Sieg wie dieser könnte nah sein.

Table 6: Translation examples illustrating the ef-fect of adding linguistic input features.

Linguistic features have beenused in neural language modelling(Alexandrescu and Kirchhoff, 2006), and arealso used in other tasks for which neural modelshave recently been employed, such as syntacticparsing (Chen and Manning, 2014). This paperaddresses the question whether linguistic featureson the source side are beneﬁcial for neuralmachine translation. On the target side, linguisticfeatures are harder to obtain for a generation tasksuch as machine translation, since this wouldrequire incremental parsing of the hypotheses attest time, and this is possible future work.Among others, our model incorporatesinformation from a dependency annotation,but is still a sequence-to-sequence model.Eriguchi et al. (2016) propose a tree-to-sequencemodel whose encoder computes vector represen-tations for each phrase in the source tree. Theirfocus is on exploiting the (unlabelled) structureof a syntactic annotation, whereas we are focusedon the disambiguation power of the functionaldependency labels.Factored translation models are often used inphrase-based SMT (Koehn and Hoang, 2007) as ameans to incorporate extra linguistic information.However, neural MT can provide a much moreﬂexible mechanism for adding such information.Because phrase-based models cannot easily gen-eralize to new feature combinations, the individ-ual models either treat each feature combinationas an atomic unit, resulting in data sparsity, or as-sume independence between features, for instanceby having separate language models for words andPOS tags. In contrast, we exploit the strong gen-eralization ability of neural networks, and expectthat even new feature combinations, e.g. a wordthat appears in a novel syntactic function, are han-dled gracefully.ne could consider the lemmatized rep-resentation of the input as a second sourcetext, and perform multi-source translation(Zoph and Knight, 2016). The main technicaldifference is that in our approach, the encoderand attention layers are shared between features,which we deem appropriate for the types offeatures that we tested.

In this paper we investigate whether linguistic in-put features are beneﬁcial to neural machine trans-lation, and our empirical evidence suggests thatthis is the case.We describe a generalization of the encoderin the popular attentional encoder-decoder archi-tecture for neural machine translation that al-lows for the inclusion of an arbitrary numberof input features. We empirically test the in-clusion of various linguistic features, includinglemmas, part-of-speech tags, syntactic depen-dency labels, and morphological features, intoEnglish ↔ German, and English → Romanian neu-ral MT systems. Our experiments show thatthe linguistic features yield improvements overour baseline, resulting in improvements on new-stest2016 of 1.5 B

LEU for German → English, 0.6B

LEU for English → German, and 1.0 B

LEU forEnglish → Romanian.In the future, we expect several developmentsthat will shed more light on the usefulness of lin-guistic (or other) input features, and whether theywill establish themselves as a core component ofneural machine translation. On the one hand, themachine learning capability of neural architecturesis likely to increase, decreasing the beneﬁt pro-vided by the features we tested. On the other hand,there is potential to explore the inclusion of novelfeatures for neural MT, which might prove to beeven more helpful than the ones we investigated,and the features we investigated may prove espe-cially helpful for some translation settings, such asvery low-resourced settings and/or translation set-tings with a highly inﬂected source language.

Acknowledgments

This project has received funding from the Euro-pean Union’s Horizon 2020 research and innova-tion programme under grant agreements 645452(QT21), and 644402 (HimL).

References [Alexandrescu and Kirchhoff2006] Andrei Alexan-drescu and Katrin Kirchhoff. 2006. FactoredNeural Language Models. In

Proceedings ofthe Human Language Technology Conference ofthe NAACL, Companion Volume: Short Papers ,pages 1–4, New York City, USA. Association forComputational Linguistics.[Bahdanau et al.2015] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2015. Neural MachineTranslation by Jointly Learning to Align and Trans-late. In

Proceedings of the International Conferenceon Learning Representations (ICLR) .[Chen and Manning2014] Danqi Chen and ChristopherManning. 2014. A Fast and Accurate DependencyParser using Neural Networks. In

Proceedings ofthe 2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP) , pages 740–750,Doha, Qatar. Association for Computational Lin-guistics.[Cho et al.2014] Kyunghyun Cho, Bart van Merrien-boer, Caglar Gulcehre, Dzmitry Bahdanau, FethiBougares, Holger Schwenk, and Yoshua Bengio.2014. Learning Phrase Representations using RNNEncoder–Decoder for Statistical Machine Transla-tion. In

Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pages 1724–1734, Doha, Qatar. Associa-tion for Computational Linguistics.[Eriguchi et al.2016] Akiko Eriguchi, KazumaHashimoto, and Yoshimasa Tsuruoka. 2016.Tree-to-Sequence Attentional Neural MachineTranslation.

ArXiv e-prints .[Gülçehre et al.2015] Çaglar Gülçehre, Orhan Firat,Kelvin Xu, Kyunghyun Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, andYoshua Bengio. 2015. On Using MonolingualCorpora in Neural Machine Translation.

CoRR ,abs/1503.03535.[Jean et al.2015] Sébastien Jean, Orhan Firat,Kyunghyun Cho, Roland Memisevic, and YoshuaBengio. 2015. Montreal Neural Machine Transla-tion Systems for WMT’15 . In

Proceedings of theTenth Workshop on Statistical Machine Translation ,pages 134–140, Lisbon, Portugal. Association forComputational Linguistics.[Koehn and Hoang2007] Philipp Koehn and HieuHoang. 2007. Factored Translation Models. In

Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Processingand Computational Natural Language Learn-ing (EMNLP-CoNLL) , pages 868–876, Prague,Czech Republic. Association for ComputationalLinguistics.[Mikolov et al.2013] Tomas Mikolov, Wen-tau Yih, andGeoffrey Zweig. 2013. Linguistic Regularities inontinuous Space Word Representations. In

HLT-NAACL , pages 746–751. The Association for Com-putational Linguistics.[Minnen et al.2001] Guido Minnen, John A. Carroll,and Darren Pearce. 2001. Applied morphologicalprocessing of English.

Natural Language Engineer-ing , 7(3):207–223.[Pascanu et al.2013] Razvan Pascanu, Tomas Mikolov,and Yoshua Bengio. 2013. On the difﬁculty of train-ing recurrent neural networks. In

Proceedings of the30th International Conference on Machine Learn-ing, ICML 2013 , pages 1310–1318, Atlanta, USA.[Peter et al.2016] Jan-Thorsten Peter, Tamer Alkhouli,Hermann Ney, Matthias Huck, Fabienne Braune,Alexander Fraser, Aleš Tamchyna, Ondˇrej Bojar,Barry Haddow, Rico Sennrich, Frédéric Blain, Lu-cia Specia, Jan Niehues, Alex Waibel, AlexandreAllauzen, Lauriane Aufrant, Franck Burlot, ElenaKnyazeva, Thomas Lavergne, François Yvon, andMarcis Pinnis. 2016. The QT21/HimL CombinedMachine Translation System. In

Proceedings of theFirst Conference on Machine Translation (WMT16) ,Berlin, Germany.[Popovi´c2015] Maja Popovi´c. 2015. chrF: character n-gram F-score for automatic MT evaluation. In

Pro-ceedings of the Tenth Workshop on Statistical Ma-chine Translation , pages 392–395, Lisbon, Portugal.Association for Computational Linguistics.[Riezler and Maxwell2005] Stefan Riezler and John T.Maxwell. 2005. On Some Pitfalls in AutomaticEvaluation and Signiﬁcance Testing for MT. In

Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion and/or Summarization , pages 57–64, Ann Ar-bor, Michigan. Association for Computational Lin-guistics.[Schmid et al.2004] Helmut Schmid, Arne Fitschen,and Ulrich Heid. 2004. A German Computa-tional Morphology Covering Derivation, Composi-tion, and Inﬂection. In

Proceedings of the IVth In-ternational Conference on Language Resources andEvaluation (LREC 2004) , pages 1263–1266.[Sennrich and Kunz2014] Rico Sennrich and BeatKunz. 2014. Zmorge: A German MorphologicalLexicon Extracted from Wiktionary. In

Proceedingsof the 9th International Conference on LanguageResources and Evaluation (LREC 2014) , Reykjavik,Iceland.[Sennrich et al.2013] Rico Sennrich, Martin Volk, andGerold Schneider. 2013. Exploiting SynergiesBetween Open Resources for German DependencyParsing, POS-tagging, and Morphological Analy-sis. In

Proceedings of the International ConferenceRecent Advances in Natural Language Processing2013 , pages 601–609, Hissar, Bulgaria. [Sennrich et al.2016a] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2016a. Edinburgh NeuralMachine Translation Systems for WMT 16. In

Pro-ceedings of the First Conference on Machine Trans-lation (WMT16) , Berlin, Germany.[Sennrich et al.2016b] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2016b. Improving Neu-ral Machine Translation Models with MonolingualData. In

Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (ACL2016) , Berlin, Germany.[Sennrich et al.2016c] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2016c. Neural MachineTranslation of Rare Words with Subword Units. In

Proceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL 2016) ,Berlin, Germany.[Stanojevi´c et al.2015] Miloš Stanojevi´c, Amir Kam-ran, Philipp Koehn, and Ondˇrej Bojar. 2015. Re-sults of the WMT15 Metrics Shared Task. In

Pro-ceedings of the Tenth Workshop on Statistical Ma-chine Translation , pages 256–273, Lisbon, Portugal.Association for Computational Linguistics.[Toutanova et al.2003] Kristina Toutanova, Dan Klein,Christopher D. Manning, and Yoram Singer. 2003.Feature-Rich Part-of-Speech Tagging with a CyclicDependency Network. In

Proceedings of the 2003Human Language Technology Conference of theNorth American Chapter of the Association forComputational Linguistics .[Williams et al.2016] Philip Williams, Rico Sennrich,Maria Nadejde, Matthias Huck, Barry Haddow, andOndˇrej Bojar. 2016. Edinburgh’s Statistical Ma-chine Translation Systems for WMT16. In

Proceed-ings of the First Conference on Machine Translation(WMT16) .[Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA:An Adaptive Learning Rate Method.

CoRR ,abs/1212.5701.[Zoph and Knight2016] Barret Zoph and Kevin Knight.2016. Multi-Source Neural Translation. In