[PDF] NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems

Abstract

In this paper, we present nmtpy, a flexible Python toolkit based on Theano for training Neural Machine Translation and other neural sequence-to-sequence architectures. nmtpy decouples the specification of a network from the training and inference utilities to simplify the addition of a new architecture and reduce the amount of boilerplate code to be written. nmtpy has been used for LIUM's top-ranked submissions to WMT Multimodal Machine Translation and News Translation tasks in 2016 and 2017.

Full PDF

NNMTPY : A F

LEXIBLE T OOLKIT FOR A DVANCED N EURAL M ACHINE T RANSLATION S YSTEMS

Ozan Caglayan, Mercedes García-Martínez, Adrien Bardet, Walid Aransa,Fethi Bougares, Loïc Barrault

Laboratoire d’Informatique de l’Université du Maine (LIUM)Language and Speech Technology (LST) TeamLe Mans, France A BSTRACT

In this paper, we present nmtpy , a ﬂexible Python toolkit based on Theano fortraining Neural Machine Translation and other neural sequence-to-sequence ar-chitectures. nmtpy decouples the speciﬁcation of a network from the training andinference utilities to simplify the addition of a new architecture and reduce theamount of boilerplate code to be written. nmtpy has been used for LIUM’s top-ranked submissions to WMT Multimodal Machine Translation and News Trans-lation tasks in 2016 and 2017.

VERVIEW nmtpy is a refactored, extended and Python 3 only version of dl4mt-tutorial , a Theano (Theano De-velopment Team, 2016) implementation of attentive Neural Machine Translation (NMT) (Bahdanauet al., 2014).The development of nmtpy project which has been open-sourced under MIT license in March 2017,started in March 2016 as an effort to adapt dl4mt-tutorial to multimodal translation models. nmtpy has now become a powerful toolkit where adding a new model is as simple as deriving from anabstract base class to ﬁll in a set of fundamental methods and (optionally) implementing a customdata iterator. The training and inference utilities are as model-agnostic as possible allowing oneto use them for different sequence generation networks such as multimodal NMT and image cap-tioning to name a few. This ﬂexibility and the rich set of provided architectures (Section 3) iswhat differentiates nmtpy from Nematus (Sennrich et al., 2017) another NMT software derived from dl4mt-tutorial . ORKFLOW

Figure 1 describes the general workﬂow of a training session. An experiment in nmtpy is describedwith a conﬁguration ﬁle (Appendix A) to ensure reusability and reproducibility. A training ex-periment can be simply launched by providing this conﬁguration ﬁle to nmt-train which sets upthe environment and starts the training. Speciﬁcally nmt-train automatically selects a free GPU,sets the seed for all random number generators and ﬁnally creates a model ( model_type option)instance. Architecture-speciﬁc steps like data loading, weight initialization and graph construc-tion are delegated to the model instance. The corresponding log ﬁle and model checkpoints arenamed in a way to reﬂect the experiment options determined by the conﬁguration ﬁle (Example: model_type-e-r-_... ).Once everything is ready, nmt-train starts consuming mini-batches of data from the model’s iteratorto perform forward/backward passes along with the weight updates. A validation on held-out corpusis periodically performed to evaluate the generalization performance of the model. Speciﬁcally, aftereach valid_freq updates, nmt-train calls the nmt-translate utility which will perform beam- https://github.com/nyu-dl/dl4mt-tutorial https://github.com/lium-lst/nmtpy a r X i v : . [ c s . C L ] J un mt-train nmt-translate (beam search) model: NMTlrate: 0.1 eval: BLEU data: ... ... Experiment Configuration Evaluation Metrics

METEOR

BLEU

External ...

Model Definitions

PKLText

BiText ...Data Iterators

Periodic Evaluation

NMT

MNMTFNMT ...

Figure 1: The components of nmtpy .search decoding, compute the requested metrics and return the results back so that nmt-train cantrack the progress and save best checkpoints to disk.Several examples regarding the usage of the utilities are given in Appendix B.2.1 A

DDING N EW A RCHITECTURES

New architectures can be deﬁned by creating a new ﬁle under nmtpy/models/ using a copy ofan existing architecture and modifying the following predeﬁned methods: • __init__() : Instantiates a model. Keyword arguments can be used to add optionsspeciﬁc to the architecture that will be automatically gathered from the conﬁguration ﬁleby nmt-train . • init_params() : Initializes the layers and weights. • build() : Deﬁnes the Theano computation graph that will be used during training. • build_sampler() : Deﬁnes the Theano computation graph that will be used duringbeam-search. This is generally very similar to build() but with sequential RNN stepsand non-masked tensors. • load_valid_data() : Loads the validation data for perplexity computation. • load_data() : Loads the training data.2.2 B UILDING B LOCKS

In this section, we introduce the currently available components and features of nmtpy that one canuse to design their architecture.

Training nmtpy provides Theano implementations of stochastic gradient descent (SGD) and itsadaptive variants RMSProp (Tieleman & Hinton, 2012), Adadelta (Zeiler, 2012) and Adam (Kingma& Ba, 2014) to optimize the weights of the trained network. A preliminary support for gradient noise(Neelakantan et al., 2015) is available for Adam. Gradient norm clipping (Pascanu et al., 2013)is enabled by default with a threshold of 5 to avoid exploding gradients. Although the providedarchitectures all use the cross-entropy objective by their nature, any arbitrary differentiable objectivefunction can be used since the training loop is agnostic to the architecture being trained.

Regularization

A dropout (Srivastava et al., 2014) layer which can be placed after any arbitraryfeed-forward layer in the architecture is available. This layer works in inverse mode where themagnitudes are scaled during training instead of testing. Additionally, L2 regularization loss with ascalar factor deﬁned by decay_c option in the conﬁguration can be added to the training loss.2 nitialization

The weight initialization is governed by the weight_init option and supportsXavier (Glorot & Bengio, 2010) and He (He et al., 2015) initialization methods besides orthogonal(Saxe et al., 2013) and random normal.

Layers

The following layers are available in the latest version of nmtpy : • Feed-forward and highway layer (Srivastava et al., 2015) • Gated Recurrent Unit (GRU) (Chung et al., 2014) • Conditional GRU (CGRU) (Firat & Cho, 2016) • Multimodal CGRU (Caglayan et al., 2016a;b)Layer normalization (Ba et al., 2016), a method that adaptively learns to scale and shift the incomingactivations of a neuron, can be enabled for GRU and CGRU blocks.

Iteration

Parallel and monolingual text iterators with compressed (.gz, .bz2, .xz) ﬁle support areavailable under the names

TextIterator and

BiTextIterator . Additionally, the multimodal

WMTItera-tor allows using image features and source/target sentences at the same time for multimodal NMT(Section 3.3). We recommend using shuffle_mode:trglen when implemented to speed upthe training by efﬁciently batching same-length sequences.

Post-processing

All decoded translations will be post-processed if filter option is given in theconﬁguration ﬁle. This is useful in the case where one would like to compute automatic metrics onsurface forms instead of segmented. Currently available ﬁlters are bpe and compound for cleaningsubword BPE (Sennrich et al., 2016) and German compound-splitting (Sennrich & Haddow, 2015)respectively.

Metrics nmt-train performs a patience based early-stopping using either validation perplexity orone of the following external evaluation metrics: • bleu : Wrapper around Moses multi-bleu BLEU (Papineni et al., 2002) • bleu_v13a : A Python reimplementation of Moses mteval-v13a.pl BLEU • meteor : Wrapper around METEOR (Lavie & Agarwal, 2007)The above metrics are also available for nmt-translate to immediately score the produced hypothe-ses. Other metrics can be easily added and made available as early-stopping metrics. RCHITECTURES attention ) is based on the original dl4mt-tutorial implementationwhich differs from Bahdanau et al. (2014) in the following major aspects: • CGRU decoder which consists of two GRU layers interleaved with attention mechanism. • The hidden state of the decoder is initialized with a non-linear transformation applied to mean bi-directional encoder state in contrast to last bi-directional encoder state. • The Maxout (Goodfellow et al., 2013) hidden layer before the softmax operation is re-moved.In addition, nmtpy offers the following conﬁgurable options for this NMT: • layer_norm Enables/disables layer normalization for bi-directional GRU encoder. • init_cgru Allows initializing CGRU with all-zeros instead of mean encoder state. • n_enc_layers Number of additional unidirectional GRU encoders to stack on top of bi-directional encoder. 3 tied_emb

Allows sharing feedback embeddings and output embeddings (2way) or all em-beddings in the network (3way) (Inan et al., 2016; Press & Wolf, 2016). • *_dropout Dropout probabilities for three dropout layers placed after source embeddings( emb_dropout ), encoder hidden states ( ctx_dropout ) and pre-softmax activations( out_dropout ).3.2 F

ACTORED

NMTFactored NMT (FNMT) is an extension of NMT which is able to generate two output symbols. Thearchitecture of such a model is presented in Figure 2. In contrast to multi-task architectures, FNMToutputs share the same recurrence and output symbols are generated in a synchronous fashion . NMT

Target wordsFactorsLemmaSource words

Figure 2: Global architecture of the Factored NMT system.Two FNMT variants which differ in how they handle the output layer are currently available: • attention_factors : the lemma and factor embeddings are concatenated to form a singlefeedback embedding. • attention_factors_seplogits : the output path for lemmas and factors are kept separate withdifferent pre-softmax transformations applied for specialization.FNMT with lemmas and linguistic factors has been successfully used for IWSLT’16English → French (García-Martínez et al., 2016) and WMT’17 English → Latvian andEnglish → Czech evaluation campaigns.3.3 M

ULTIMODAL

NMT & C

APTIONING

We provide several multimodal architectures (Caglayan et al., 2016a;b) where the probability ofa target word is conditioned on source sentence representations and convolutional image features(Figure 3). More speciﬁcally, these architectures extends monomodal CGRU into a multimodal onewhere the attention mechanism can be shared or separate between input modalities. A late fusion ofattended context vectors are done using either by summing or concatenating the modality-speciﬁcrepresentations.Our attentive multimodal system for

Multilingual Image Description Generation track of

WMT’16Multimodal Machine Translation surpassed the baseline architecture (Elliott et al., 2015) by +1.1METEOR and +3.4 BLEU and ranked ﬁrst among multimodal submissions (Specia et al., 2016).3.4 L

ANGUAGE M ODELING

A GRU-based language model architecture ( rnnlm ) is available in the repository which can be usedwith nmt-test-lm to obtain language model scores. FNMT currently uses a dedicated nmt-translate-factors utility though it will probably be merged in thenear future. http://matrix.statmt.org/ ecoder withMultimodalAttentionTextualEncoderImage Visual Encoder (CNN) Under review as a conference paper at ICLR 2016CNN RNN children sittingin a classroom + RNN

Schulkinder sitzenin einem Klassenzimmer

Figure 1: An illustration of the multilingual multimodal language model. Descriptions are generatedby combining features from source- and target-language multimodal language models. The dashedlines denote variants of the model: removing the CNN features from a source model would createlanguage-only conditioning vectors; whereas removing the CNN input in the decoder assumes thesource feature vectors know enough about the image to generate a good description.picts the overall approach, illustrating the way we transfer feature representations between models.Image description models generally use a ﬁxed representation of the visual input taken from a objectdetection model (e.g., a CNN). In this work we add ﬁxed features extracted from a source languagemodel (which may itself be a multimodal image description model) to our image description model.This is distinct from neural machine translation models which train source language feature repre-sentations speciﬁcally for target decoding in a joint model (Cho et al., 2014; Sutskever et al., 2014).Our composite model pipeline is more ﬂexible than a joint model, allowing the reuse of models forother tasks (e.g., monolingual image description, object recognition) and not requiring retrainingfor each different language pair. We show that the representations extracted from source languagemodels, despite not being trained to translate between languages, are nevertheless highly successfulin transferring additional informative features to the target language image description model.In a series of experiments on the IAPR-TC12 dataset of images described in English and German,we ﬁnd that models that incorporate source language features substantially outperform target mono-lingual image description models. The best English-language model improves upon the state-of-the-art by 2.3

BLEU MODELS

Our multilingual image description models are neural sequence generation models, with additionalinputs from either visual or linguistic modalities, or both. We present a family of models in sequenceof increasing complexity to make their compositional character clear, beginning with a neural se-quence model over words and concluding with the full model using both image and source features.See Figure 2 for a depiction of the model architecture.2.1 R

ECURRENT L ANGUAGE M ODEL (LM)The core of our model is a Recurrent Neural Network model over word sequences, i.e., a neurallanguage model ( LM ) (Mikolov et al., 2010). The model is trained to predict the next word in thesequence, given the current sequence seen so far. At each timestep i for input sequence w ...n ,the input word w i , represented as a one-hot vector over the vocabulary, is embedded into a high-dimensional continuous vector using the learned embedding matrix W eh (Eqn 1). A nonlinear func-tion f is applied to the embedding combined with the previous hidden state to generate the hiddenstate h i (Eqn 2). At the output layer, the next word o i is predicted via the softmax function over the2 Feature MapsAnnotation VectorsSource words Target words

Figure 3: The architecture of multimodal attention (Caglayan et al., 2016b).3.5 I

MAGE C APTIONING

A GRU-based reimplementation of

Show, Attend and Tell architecture (Xu et al., 2015) which learnsto generate a natural language description by applying soft attention over convolutional image fea-tures is available under the name img2txt . This architecture is recently used as a baseline systemfor the Multilingual Image Description Generation track of

WMT’17 Multimodal Machine Trans-lation shared task.

OOLS

In this section we present translation and rescoring utilities nmt-translate and nmt-rescore . Otherauxiliary utilities are brieﬂy described in Appendix C.4.1

NMT - TRANSLATE nmt-translate is responsible for translation decoding using the beam-search method deﬁned by NMTarchitecture. This default beam-search supports single and ensemble decoding for both monomodaland multimodal translation models. If a given architecture reimplements the beam-search method inits class, that one will be used instead.Since the number of CPUs in a single machine is 2x-4x higher than the number of GPUs and wemainly reserve the GPUs for training, nmt-translate makes use of CPU workers for maximum ef-ﬁciency. More speciﬁcally, each worker receives a model instance (or instances when ensembling)and performs the beam-search on samples that it continuously fetches from a shared queue. Thisqueue is ﬁlled by the master process using the iterator provided by the model.One thing to note for parallel CPU decoding is that if the numpy is linked against a BLAS implemen-tation with threading support enabled (as in the case with Anaconda & Intel MKL), each spawnedprocess attempts to use all available threads in the machine leading to a resource conﬂict. In orderfor nmt-translate to beneﬁt correctly from parallelism, the number of threads per process is thuslimited to 1. The impact of this setting and the overall decoding speed in terms of words/sec (wps)are reported in (Table 1) for a medium-sized En → Tr NMT with ∼

10M parameters.4.2

NMT - RESCORE

A 1-best plain text or n -best hypotheses ﬁle can be rescored with nmt-rescore using either a single oran ensemble of models. Since rescoring of a given hypothesis simply means computing the negative This is achieved by setting

X_NUM_THREADS=1 environment variable where X is one of OPENBLAS,OMP,MKL depending on the numpy installation. BLAS Threads Tesla K40 4 CPU 8 CPU 16 CPU

Default 185 wps 26 wps 25 wps 25 wps

Set to 1

185 wps 109 wps 198 wps 332 wps

Table 1: Median beam-search speed over 3 runs with beam size 12: decoding on a single Tesla K40GPU is rougly equivalent to using 8 CPUs (Intel Xeon E5-2687v3). log-likelihood of it given the source sentence, nmt-rescore uses a single GPU to efﬁciently computethe scores in batched mode. See Appendix B for examples.

ONCLUSION

We have presented nmtpy , an open-source sequence-to-sequence framework based on dl4mt-tutorial and reﬁned in many ways to ease the task of integrating new architectures. The toolkit has been inter-nally used in our team for tasks ranging from monomodal, multimodal and factored NMT to imagecaptioning and language modeling to help achieving top-ranked submissions during campaigns likeIWSLT and WMT.A

CKNOWLEDGMENTS

This work was supported by the French National Research Agency (ANR) through the CHIST-ERAM2CR project, under the contract number ANR-15-CHR2-0006-01 . R EFERENCES

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprintarXiv:1607.06450 , 2016.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.Ozan Caglayan, Walid Aransa, Yaxing Wang, Marc Masana, Mercedes García-Martínez, FethiBougares, Loïc Barrault, and Joost van de Weijer. Does multimodality help human and machinefor translation and image captioning? In

Proceedings of the First Conference on Machine Trans-lation , pp. 627–633, Berlin, Germany, August 2016a. Association for Computational Linguistics.URL .Ozan Caglayan, Loïc Barrault, and Fethi Bougares. Multimodal attention for neural machine trans-lation. arXiv preprint arXiv:1609.03976 , 2016b.Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, andC Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXivpreprint arXiv:1504.00325 , 2015.Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluationof gated recurrent neural networks on sequence modeling.

CoRR , abs/1412.3555, 2014. URL http://arxiv.org/abs/1412.3555 .Desmond Elliott, Stella Frank, and Eva Hasler. Multi-language image description with neuralsequence models.

CoRR , abs/1510.04709, 2015. URL http://arxiv.org/abs/1510.04709 .Orhan Firat and Kyunghyun Cho. Conditional gated recurrent unit with attention mechanism. github.com/nyu-dl/dl4mt-tutorial/blob/master/docs/cgru.pdf , 2016.Mercedes García-Martínez, Loïc Barrault, and Fethi Bougares. Factored neural machine transla-tion architectures. In

Proceedings of the International Workshop on Spoken Language Trans-lation , IWSLT’16, Seattle, USA, 2016. URL http://workshop2016.iwslt.org/downloads/IWSLT_2016_paper_2.pdf . http://m2cr.univ-lemans.fr In Proceedings of the International Conference on Artiﬁcial Intelligence andStatistics (AISTATS’10). Society for Artiﬁcial Intelligence and Statistics , 2010.Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxoutnetworks. In Sanjoy Dasgupta and David McAllester (eds.),

Proceedings of the 30th InternationalConference on Machine Learning , volume 28 of

Proceedings of Machine Learning Research , pp.1319–1327, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/goodfellow13.html .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassinghuman-level performance on imagenet classiﬁcation. In

Computer Vision (ICCV), 2015 IEEEInternational Conference on , pp. 1026–1034. IEEE, 2015.Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classiﬁers: Aloss framework for language modeling. arXiv preprint arXiv:1611.01462 , 2016.Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Alon Lavie and Abhaya Agarwal. Meteor: an automatic metric for mt evaluation with high levelsof correlation with human judgments. In

Proceedings of the Second Workshop on StatisticalMachine Translation , StatMT ’07, pp. 228–231, Stroudsburg, PA, USA, 2007. Association forComputational Linguistics.Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Stan SzpakowiczMarie-Francine Moens (ed.),

Text Summarization Branches Out: Proceedings of the ACL-04Workshop , pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, andJames Martens. Adding gradient noise improves learning for very deep networks. arXiv preprintarXiv:1511.06807 , 2015.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automaticevaluation of machine translation. In

Proceedings of the 40th Annual Meeting on Association forComputational Linguistics , ACL ’02, pp. 311–318, Stroudsburg, PA, USA, 2002.Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neuralnetworks. In

Proceedings of The 30th International Conference on Machine Learning , pp. 1310–1318, 2013.Oﬁr Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprintarXiv:1608.05859 , 2016.Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynam-ics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120 , 2013.Rico Sennrich and Barry Haddow. A joint dependency model of morphological and syntactic struc-ture for statistical machine translation. In

Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing , pp. 114–121. Association for Computational Linguis-tics, 2015.Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) , pp. 1715–1725, Berlin, Germany, August 2016.Association for Computational Linguistics. URL .Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch-Mayne, Barry Haddow, JulianHitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Miceli Barone, Jozef Mokry, andMaria Nadejde.

Nematus: a Toolkit for Neural Machine Translation , pp. 65–68. Association forComputational Linguistics (ACL), 4 2017. ISBN 978-1-945626-34-0.7ucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. A shared task on multimodalmachine translation and crosslingual image description. In

Proceedings of the First Conferenceon Machine Translation, Berlin, Germany. Association for Computational Linguistics , 2016.Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overﬁtting.

Journal of Machine LearningResearch , 15(1):1929–1958, 2014.Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprintarXiv:1505.00387 , 2015.Theano Development Team. Theano: A Python framework for fast computation of mathematicalexpressions. arXiv e-prints , abs/1605.02688, 2016. URL http://arxiv.org/abs/1605.02688 .Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a runningaverage of its recent magnitude.

COURSERA: Neural networks for machine learning , 4(2), 2012.Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based imagedescription evaluation. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 4566–4575, 2015.Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, RichZemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visualattention. In

Proceedings of The 32nd International Conference on Machine Learning , pp. 2048–2057, 2015.Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 ,2012. 8 C ONFIGURATION F ILE E XAMPLE [training]model_type: attention patience: 20 valid_freq: 1000 valid_metric: meteor valid_start: 2 valid_beam: 3 valid_njobs: 16 valid_save_hyp: True decay_c: 1e-5 clip_c: 5 seed: 1235 save_best_n: 2 device_id: auto snapshot_freq: 10000 max_epochs: 100 [model]tied_emb: 2way layer_norm: True shuffle_mode: trglen filter: bpe n_words_src: 0 n_words_trg: 0 save_path: ~/models rnn_dim: 100 embedding_dim: 100 weight_init: xavierbatch_size: 32optimizer: adamlrate: 0.0004emb_dropout: 0.2 ctx_dropout: 0.4out_dropout: 0.4 [model.dicts]src: ~/data/train.norm.max50.tok.lc.bpe.en.pkltrg: ~/data/train.norm.max50.tok.lc.bpe.de.pkl [model.data]train_src : ~/data/train.norm.max50.tok.lc.bpe.entrain_trg : ~/data/train.norm.max50.tok.lc.bpe.devalid_src : ~/data/val.norm.tok.lc.bpe.envalid_trg : ~/data/val.norm.tok.lc.bpe.devalid_trg_orig: ~/data/val.norm.tok.lc.de U SAGE E XAMPLES $ nmt-train -c wmt-en-de.conf $ nmt-train -c wmt-en-de.conf ’model_type:my_amazing_nmt’ $ nmt-train -c wmt-en-de.conf ’rnn_dim:500’ ’embedding_dim:300’ $ nmt-train -c wmt-en-de.conf ’device_id:gpu5’

Listing 1: Example usage patterns for nmt-train . $ nmt-translate -j 30 -m best_model.npz -S val.tok.bpe.en \-R val.tok.de -o out.tok.de -M bleu meteor -b 10 $ nmt-translate -m model*npz -S val.tok.de \-o out.tok.50best.de -b 50 -N 50 $ nmt-translate -m best_model.npz -S val.tok.bpe.en \-R val.tok.de -o out.tok.de -e Listing 2: Example usage patterns for nmt-translate . $ nmt-rescore -m model*npz -s val.tok.bpe.en \-t out.tok.50best.de \-o out.tok.50best.rescored.de Listing 3: Example usage patterns for nmt-rescore . C D

ESCRIPTION OF THE PROVIDED TOOLS nmt-build-dict

Generates .pkl vocabulary ﬁles from preprocessed corpus. A single/combined vo-cabulary for two or more languages can be created with -s ﬂag. nmt-extract Extracts arbitrary weights from a model snapshot which can further be used as pre-trained weights of a new experiment or analyzed using visualization techniques (especially for em-beddings). nmt-coco-metrics

A stand-alone utility which computes multi-reference BLEU, METEOR,CIDE-r (Vedantam et al., 2015) and ROUGE-L (Lin, 2004) using MSCOCO evaluation tools (Chenet al., 2015). Multiple systems can be given with -s ﬂag to produce a table of scores. nmt-bpe-(learn,apply) Copy of subword utilities (Sennrich et al., 2016) which are used to ﬁrst learn a BPE segmentation model over a given corpus ﬁle and then apply it to new sentences. nmt-test-lm Computes language model perplexity of a given corpus. https://github.com/rsennrich/subword-nmthttps://github.com/rsennrich/subword-nmt