CytonMT: an Efficient Neural Machine Translation Open-source Toolkit Implemented in C++
aa r X i v : . [ c s . C L ] J un CytonMT: an Efficient Neural Machine TranslationOpen-source Toolkit Implemented in C++
Xiaolin Wang Masao Utiyama Eiichiro Sumita
Advanced Translation Research and Development Promotion CenterNational Institute of Information and Communications Technology, Japan { xiaolin.wang,mutiyama,eiichiro.sumita } @nict.go.jp Abstract
This paper presents an open-source neural ma-chine translation toolkit named CytonMT .The toolkit is built from scratch only usingC++ and NVIDIA’s GPU-accelerated libraries.The toolkit features training efficiency, codesimplicity and translation quality. Benchmarksshow that CytonMT accelerates the trainingspeed by 64.5% to 110.8% on neural net-works of various sizes, and achieves compet-itive translation quality. Neural Machine Translation (NMT)has made remarkable progress over thepast few years (Sutskever et al., 2014;Bahdanau et al., 2014; Wu et al., 2016). Justlike Moses (Koehn et al., 2007) does for statisticmachine translation (SMT), open-source NMTtoolkits contribute greatly to this progress,including but not limited to, • RNNsearch-LV (Jean et al., 2015) • Luong-NMT (Luong et al., 2015a) • DL4MT by Kyunghyun Cho et al. • BPE-char (Chung et al., 2016) • Nematus (Sennrich et al., 2017) • OpenNMT (Klein et al., 2017) • Seq2seq (Britz et al., 2017) https://github.com/arthurxlw/cytonMt https://github.com/sebastien-j/LV groundhog https://github.com/lmthang/nmt.hybrid https://github.com/nyu-dl/dl4mt-tutorial https://github.com/nyu-dl/dl4mt-cdec https://github.com/EdinburghNLP/nematus https://github.com/OpenNMT/OpenNMT-py https://github.com/google/seq2seq • ByteNet (Kalchbrenner et al., 2016) • ConvS2S (Gehring et al., 2017) • Tensor2Tensor (Vaswani et al., 2017) • Marian (Junczys-Dowmunt et al., 2018) These open-source NMT toolkits are undoubt-edly excellent software. However, there is a com-mon issue – they are all written in script languageswith dependencies on third-party GPU platforms(see Table 1) except Marian which is developedsimultaneously with our toolkit.Using script languages and third-party GPUplatforms is a two-edged sword. On one hand,it greatly reduces the workload of coding neuralnetworks. On the other hand, it also causes twoproblems as follows, • The running efficiency drops, and profilingand optimization also become difficult, as thedirect access to GPUs is blocked by the lan-guage interpreters or the platforms. NMTsystems typically require days or weeks totrain, so training efficiency is a paramountconcern. Slightly faster training can make thedifference between plausible and impossibleexperiments (Klein et al., 2017). • The researchers using these toolkits maybe constrained by the platforms. Unex-plored computations or operations may be-come disallowed or unnecessarily inefficienton a third-party platform, which lowers thechances of developing novel neural networktechniques. https://github.com/paarthneekhara/byteNet-tensorflow(unofficial) and others. https://github.com/facebookresearch/fairseq https://github.com/tensorflow/tensor2tensor https://github.com/marian-nmt/marianoolkit Language PlatformRNNsearch-LV Python Theano,GroundHogLuong-NMT Matlab MatlabDL4MT Python TheanoBPE-char Python TheanoNematus Python TheanoOpenNMT Lua TorchSeq2seq Python TensorflowByteNet Python TensorflowConvS2S Lua TorchTensor2Tensor Python TensorflowMarian C++ –CytonMT C++ – Table 1 : Languages and Platforms of Open-sourceNMT toolkitsCytonMT is developed to address this issue, inhopes of providing the community an attractivealternative. The toolkit is written in C++ whichis the genuine official language of NVIDIA – themanufacturer of the most widely-used GPU hard-ware. This gives the toolkit an advantage on effi-ciency when compared with other toolkits.Implementing in C++ also gives CytonMT greatflexibility and freedom on coding. The researcherswho are interested in the real calculations insideneural networks can trace source codes down tokernel functions, matrix operations or NVIDIA’sAPIs, and then modify them freely to test theirnovel ideas.The code simplicity of CytonMT is compara-ble to those NMT toolkits implemented in scriptlanguages. This owes to an open-source general-purpose neural network library in C++, named Cy-tonLib, which is shipped as part of the sourcecode. The library defines a simple and friendlypattern for users to build arbitrary network archi-tectures in the cost of two lines of genuine C++code per layer.CytonMT achieves competitive translationquality, which is the main purpose of NMTtoolkits. It implements the popular framework ofattention-based RNN encoder-decoder. Amongthe reported systems of the same architecture, itranks at top positions on the benchmarks of bothWMT14 and WMT17 English-to-German tasks.The following of this paper presented the detailsof CytonMT from the aspects of method, imple-mentation, benchmark, and future works.
The toolkit approaches to the problem of ma-chine translation using the attention-based RNNencoder-decoder proposed by Bahdanau et al.(2014) and Luong et al. (2015a). The figure 1 il- (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:4) (cid:1) (cid:5) (cid:6)(cid:7)(cid:8)(cid:9)(cid:10) (cid:11) (cid:2) (cid:11) (cid:3) (cid:11) (cid:2) (cid:11) (cid:3) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:7)(cid:8)(cid:9)(cid:10)(cid:3)(cid:11)(cid:12)(cid:13)(cid:14)(cid:6)(cid:15)(cid:15)(cid:16)(cid:9)(cid:17) (cid:18)(cid:19)(cid:4)(cid:17)(cid:6)(cid:11)(cid:7)(cid:20)(cid:3)(cid:11)(cid:10)(cid:3)(cid:11)(cid:7)(cid:12)(cid:13)(cid:14)(cid:6)(cid:15)(cid:15)(cid:16)(cid:9)(cid:17) (cid:18)(cid:19)(cid:4)(cid:17)(cid:6)(cid:11)(cid:7)(cid:7)(cid:8)(cid:9)(cid:10)(cid:3)(cid:11)(cid:12)(cid:13)(cid:14)(cid:6)(cid:15)(cid:15)(cid:16)(cid:9)(cid:17)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:7)(cid:21)(cid:1)(cid:18)(cid:22)(cid:21)(cid:19)(cid:23)(cid:6)(cid:4)(cid:7)(cid:24)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:7)(cid:21)(cid:1)(cid:18)(cid:22)(cid:21)(cid:19)(cid:23)(cid:6)(cid:4)(cid:7)(cid:25)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:7)(cid:21)(cid:1)(cid:18)(cid:22)(cid:21)(cid:19)(cid:23)(cid:6)(cid:4)(cid:7)(cid:26) (cid:18)(cid:19)(cid:4)(cid:17)(cid:6)(cid:11)(cid:7)(cid:7)(cid:21)(cid:1)(cid:18)(cid:22)(cid:21)(cid:19)(cid:23)(cid:6)(cid:4)(cid:7)(cid:24)(cid:18)(cid:19)(cid:4)(cid:17)(cid:6)(cid:11)(cid:7)(cid:7)(cid:21)(cid:1)(cid:18)(cid:22)(cid:21)(cid:19)(cid:23)(cid:6)(cid:4)(cid:7)(cid:25)(cid:18)(cid:19)(cid:4)(cid:17)(cid:6)(cid:11)(cid:7)(cid:7)(cid:21)(cid:1)(cid:18)(cid:22)(cid:21)(cid:19)(cid:23)(cid:6)(cid:4)(cid:7)(cid:26)(cid:27)(cid:11)(cid:11)(cid:6)(cid:9)(cid:11)(cid:16)(cid:2)(cid:9)(cid:21)(cid:19)(cid:23)(cid:6)(cid:4) (cid:1)(cid:2)(cid:28)(cid:11)(cid:13)(cid:19)(cid:29)(cid:30) (cid:2)(cid:31) !(cid:24)" (cid:30) (cid:11)(cid:31) (cid:30) $(cid:31)(cid:24)" (cid:1) (cid:2) (cid:1) (cid:3) (cid:4) (cid:2) (cid:4) (cid:2) (cid:5) (cid:2) (cid:4) (cid:3) (cid:4) (cid:3) (cid:5) (cid:3) (cid:30) (cid:2)(cid:31) " (cid:30) $(cid:31)(cid:25)" (cid:30) $(cid:31)(cid:26)" (cid:30) $(cid:31)%" (cid:30) (cid:11)(cid:31) " (cid:30) (cid:11)(cid:31) !(cid:24)" (cid:5) (cid:6)
Figure 1 : Model Architecture of CytonMTlustrates the architecture. The conditional proba-bility of a translation given a source sentence isformulated as, log p ( y | x ) = m X j =1 log ( p ( y j | H h j i o )= m X j =1 log(softmax y j (tanh( W o H h j i o + B o ))) (1) H h j i o = F att ( H s , H h j i t ) , (2) where x is a source sentence; y = ( y , . . . , y m ) isa translation; H s is a source-side top-layer hiddenstate; H h j i t is a target-side top-layer hidden state; H h j i o is a state generated by an attention model F att ; W o and B o are the weight and bias of anoutput embedding.The toolkit adopts the multiplicative attentionmodel proposed by Luong et al. (2015a), becauseit is slightly more efficient than the additive variantproposed by Bahdanau et al. (2014). This issue isaddressed in Britz et al. (2017) and Vaswani et al.(2017). The figure 2 illustrates the model, formu-lated as , a h ij i st = softmax( F a ( H h i i s , H h j i t ))= e F a ( H h i i s ,H h j i t ) P ni =1 e F a ( H h i i s ,H h j i t ) , (3) F a ( H h i i s , H h j i t ) = H h i i s ⊤ W a H h j i t , (4) C h j i s = n X i =1 a h ij i st H h i i s , (5) C h j i st = [ C s ; H h j i t ] , (6) H h j i o = tanh( W c C h j i st ) , (7) where F a is a scoring function for alignment; W a is a matrix for linearly mapping target-side hidden (cid:2)(cid:3)(cid:4)(cid:5)(cid:6) (cid:7) (cid:2)(cid:4)(cid:5)(cid:6) (cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:14)(cid:15)(cid:8)(cid:16)(cid:17)(cid:15)(cid:1)(cid:18)(cid:13)(cid:11)(cid:14)(cid:19)(cid:20)(cid:3)(cid:3)(cid:13)(cid:21)(cid:3)(cid:22)(cid:9)(cid:21)(cid:15)(cid:1)(cid:18)(cid:13)(cid:11) (cid:23) (cid:3)(cid:4)(cid:5)(cid:6) (cid:23) (cid:2)(cid:4)(cid:24)(cid:6) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:3)(cid:5)(cid:6)(cid:7) (cid:7) (cid:2)(cid:3)(cid:4)(cid:5)(cid:6) (cid:8)(cid:9)(cid:3)(cid:7)(cid:5)(cid:10)(cid:11)(cid:12)(cid:11)(cid:13)(cid:5)(cid:3)(cid:14)(cid:8)(cid:9)(cid:3)(cid:7)(cid:5)(cid:10)(cid:11)(cid:12)(cid:15)(cid:2)(cid:6)(cid:11)(cid:16)(cid:10)(cid:2)(cid:17)(cid:18)(cid:4)(cid:6)(cid:19)(cid:7)(cid:9)(cid:20)(cid:14)(cid:6)(cid:7)(cid:17)(cid:21)(cid:7)(cid:5)(cid:3) (cid:23) (cid:2)(cid:4)(cid:25)(cid:6) (cid:23) (cid:2)(cid:4)(cid:19)(cid:6) (cid:23) (cid:2)(cid:4)(cid:26)(cid:6) (cid:23) (cid:9)(cid:4)(cid:5)(cid:6) (cid:16)(cid:1)(cid:11)(cid:27)(cid:13)(cid:3)(cid:14)(cid:14)(cid:15)(cid:8)(cid:16)(cid:17)(cid:15)(cid:1)(cid:18)(cid:13)(cid:11)(cid:14)(cid:19) (cid:1) (cid:2) (cid:1) (cid:3) (cid:4) (cid:2) (cid:4) (cid:2)(cid:3) (cid:1) (cid:1) Figure 2 : Architecure of Attention Modelstates into a space comparable to the source-side; a h ij i st is an alignment coefficient; C h j i s is a source-side context; C h j i st is a context derived from bothsides. The toolkit consists of a general purpose neuralnetwork library, and a neural machine translationsystem built upon the library. The neural networklibrary defines a class named
Network to facili-tate the construction of arbitrary neural networks.Users only need to inherit the class, declare com-ponents as data members, and write down twolines of codes per component in an initializationfunction. For example, the complete code of theattention network formulated by the equations 3to 7 is presented in the figure 3. This piece of codefulfills the task of building a neural network as fol-lows, • The class of
Variable stores numeric valuesand gradients. Through passing the pointersof
Variable around, component are connectedtogether. • The data member of layers collects all thecomponents. The base class of
Network willcall the functions forward , backward and cal-culateGradient of each component to per-form the actual computation.The codes of actual computation are organizedin the functions forward , backward and calculate-Gradient for each type of component. The figure 4presents some examples. Note that these codeshave been slightly simplified for illustration. class Attention: public Network{DuplicateLayer dupHt; // declare componentsLinearLayer linearHt;MultiplyHsHt multiplyHsHt;SoftmaxLayer softmax;WeightedHs weightedHs;Concatenate concateCsHt;LinearLayer linearCst;ActivationLayer actCst;Variable* init(LinearLayer* linHt,LinearLayer* linCst, Variable* hs,Variable* ht){Variable* tx;tx=dupHt.init(ht); // make two copieslayers.push_back(&dupHt);tx=linearHt.init(linHt, tx); // WaHtlayers.push_back(&linearHt);tx=multiplyHsHt.init(hs, tx); // Falayers.push_back(&multiplyHsHt);tx=softmax.init(tx); // astlayers.push_back(&softmax);tx=weightedHs.init(hs, tx); // Cslayers.push_back(&weightedHs);tx=concateCsHt.init(tx, &dupHt.y1); // Cstlayers.push_back(&concateCsHt);tx=linearCst.init(linCst, tx); // WcCstlayers.push_back(&linearCst);tx=actCst.init(tx, CUDNN_ACTIVATION_TANH);// Holayers.push_back(&actCst);return tx; //pointer to result}}; Figure 3 : Complete Code of Attention Model For-mulated by Equations 3 to 7 void LinearLayer::forward(){cublasXgemm(cublasH, CUBLAS_OP_T, CUBLAS_OP_N,dimOutput, num, dimInput,&one, w.data, w.ni, x.data, dimInput,&zero, y.data, dimOutput)}void LinearLayer::backward(){cublasXgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_N,dimInput, num, dimOutput,&one, w.data, w.ni, y.grad.data, dimOutput,&beta, x.grad.data, dimInput));}void LinearLayer::calculateGradient(){cublasXgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_T,dimInput, dimOutput, num,&one, x.data, dimInput, y.grad.data, dimOutput,&one, w.grad.data, w.grad.ni));}void EmbeddingLayer::forward(){...embedding_kernel<<
Figure 4 : Codes of Performing Actual Computa-tion.
Benchmarks
CytonMT is tested on the widely-used bench-marks of the WMT14 and WMT17 English-to-German tasks (Bojar et al., 2017) (Ta-ble 2). Both datasets are processed and con-verted using byte-pair encoding(Gage, 1994;Schuster and Nakajima, 2012) with a sharedsource-target vocabulary of about 37000 to-kens. The WMT14 corpora are processed bythe scripts from Vaswani et al. (2017) . TheWMT17 corpora are processed by the scriptsfrom Junczys-Dowmunt et al. (2018) , whichincludes 10 million back-translated sentence pairsfor training.The benchmarks were run on an Intel XeonCPU E5-2630 @ 2.4Ghz and a GPU QuadroM4000 (Maxwell) that had 1664 CUDA cores@ 773 MHz, 2,573 GFLOPS . The software isCentOS 6.8, CUDA 9.1 (driver 387.26), CUDNN7.0.5, Theano 1.0.1, Tensorflow 1.5.0. Netmaus,Torch and OpenNMT are the latest version in De-cember 2017. Marian is the last version in May2018.CytonMT is run with the hyperparameters set-tings presented by Table 3 unless stated otherwise.The settings provide both fast training and com-petitive translate quality according to our experi-ments on a variety of translation tasks. Dropoutis applied to the hidden states between non-top re-current layers R s , R t and output H o according to(Wang et al., 2017). Label smoothing estimatesthe marginalized effect of label-dropout duringtraining, which makes models learn to be more un-sure (Szegedy et al., 2016). This improved BLEUscores (Vaswani et al., 2017). Length penalty isapplied using the formula in (Wu et al., 2016). Four baseline toolkits and CytonMT train mod-els using the settings of hyperparameters in Ta-ble 3. The number of layers and the size of embed-dings and hidden states varies, as large networksare often used in real-world applications to achievehigher accuracy at the cost of more running time.Table 4 presents the training speed of differ-ent toolkits measured in source tokens per sec-ond. The results show that the training speedof CytonMT is much higher than the baselines. https://github.com/tensorflow/tensor2tensor https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-uedin Data Set
Source TargetWMT14Train.(standard) 4,500,966 113,548,249 107,259,529Dev. (tst2013) 3,000 64,807 63,412Test (tst2014) 3,003 67,617 63,078WMT17Train.(standard) 4,590,101 118,768,285 112,009,072Train.(back trans.) 10,000,000 190,611,668 149,198,444Dev. (tst2016) 2,999 64,513 62,362Test (tst2017) 3,004 64,776 60,963
Table 2 : WMT English-to-German corpora
Hyperparameter Value
Embedding Size 512Hidden State Size 512Encoder/Decoder Depth 2Encoder BidirectionalRNN Type LSTMDropout 0.2Label Smooth. 0.1Optimizer SGDLearning Rate 1.0Learning Rate Decay 0.7Beam Search Size 10Length Penalty 0.6
Table 3 : Hyperparameter SettingsOpenNMT is the fastest baseline, while CytonMTachieves a speed up versus it by 64.5% to 110.8%.Moreover, CytonMT shows a consistent tendencyto speed up more on larger networks.
Table 5 compares the BLEU of CytonMT withthe reported results from the systems of thesame architecture (attention-based RNN encoder-decoder). BLEU is calculated on cased, to-kenized text to be comparable to previouswork (Sutskever et al., 2014; Luong et al., 2015b;Wu et al., 2016; Zhou et al., 2016).The settings of CytonMT on WMT14 followsTable 3, while the settings on WMT17 adopt adepth of 3 and a hidden state size of 1024 asthe training set is three times larger. The cross
Embed./State Size 512 512 1024 1024Enc./ Dec. Layers 2 4 2 4Nematus 1875 1190 952 604OpenNMT 2872 2038 1356 904Seq2Seq 1618 1227 854 599Marian 2630 1832 1120 688
CytonMT 4725 3751 2571 1906 speedup > Table 4 : Training Speed Measured in Source To-kens per Second. ystem Open Src. BLEUWMT14Nematus(Klein,2017) √ √ √ √ √ √ CytonMT √ GNMT (Wu, 2015) 24.61WMT17Nematus(Sennrich,2017) √ CytonMT √ Marian(Junczys,2018) √ Table 5 : Comparing BLEU with Public Records.entropy of the development set is monitored ev-ery epoch on WMT14 and every epoch onWMT17, approximately 400K sentence pairs. Ifthe entropy has not decreased by max (0 . × learning rate, . in 12 times, learning ratedecays by 0.7 and the training restarts from theprevious best model. The whole training proce-dure terminates when no improvement is madeduring two neighboring decays of learning rate.The actual training took 28 epochs on WMT14and 12 epochs on WMT17.Table 5 shows that CytonMT achieves the com-petitive BLEU points on both benchmarks. OnWMT14, it is only outperformed by Google’s pro-duction system (Wu et al., 2016), which is verymuch larger in scale and much more demandingon hardware. On WMT17, it achieves the samelevel of performance with Marian, which is highamong the entries of WMT17 for a single sys-tem. Note that the start-of-the-art scores on thesebenchmarks have been recently pushed forward bynovel network architectures such as Gehring et al.(2017), Vaswani et al. (2017) and Shazeer et al.(2017) This paper introduces CytonMT – an open-source NMT toolkit – built from scratch onlyusing C++ and NVIDIA’s GPU-accelerated li-braries. CytonMT speeds up training by more than64.5%, and achieves competitive BLEU points onWMT14 and WMT17 corpora. The source codeof CytonMT is simple because of CytonLib – anopen-source general purpose neural network li-brary – contained in the toolkit. Therefore, Cy-tonMT is an attractive alternative for the researchcommunity. We open-source this toolkit in hopes of benefiting the community and promoting thefield. We look forward to hearing feedback fromthe community.The future work of CytonMT will be contin-ued in two directions. One direction is to fur-ther optimize the code for GPUs, such support-ing multi-GPU. The problem we used to have isthat GPUs proceed very fast in the last few years.For example, the microarchitectures of NVIDIAGPUs evolve twice during the development of Cy-tonMT, from Maxwell to Pascale, and then toVolta. Therefore, we have not explored cutting-edge GPU techniques as the coding effort may beoutdated quickly. Multi-GPU machines are com-mon now, so we plan to support them.The other direction is to support latest NMT ar-chitectures such ConvS2S (Gehring et al., 2017)and Transformer (Vaswani et al., 2017). In thesearchitectures, recurrent structures are replaced byconvolution or attention structures. Their high per-formance indicates that the new structures suit thetranslation task better, so we also plan to supportthem in the future.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate.
Proceedings of the3rd International Conference on Learning Repre-sentations. , pages 1–15.Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Shujian Huang,Matthias Huck, Philipp Koehn, Qun Liu, VarvaraLogacheva, Christof Monz, Matteo Negri, MattPost, Raphael Rubino, Lucia Specia, and MarcoTurchi. 2017. Findings of the 2017 conferenceon machine translation (wmt17). In
Proceedingsof the Second Conference on Machine Translation,Volume 2: Shared Task Papers , pages 169–214,Copenhagen, Denmark. Association for Computa-tional Linguistics.Denny Britz, Anna Goldie, Minh-Thang Luong, andQuoc Le. 2017. Massive exploration of neuralmachine translation architectures. In
Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing , pages 1442–1451,Copenhagen, Denmark. Association for Computa-tional Linguistics.Junyoung Chung, Kyunghyun Cho, and Yoshua Ben-gio. 2016. A character-level decoder without ex-plicit segmentation for neural machine translation.In
Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 1693–1703, Berlin, Germany.Association for Computational Linguistics.hilip Gage. 1994. A new algorithm for data compres-sion.
The C Users Journal , 12(2):23–38.Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N Dauphin. 2017. ConvolutionalSequence to Sequence Learning.
ArXiv e-prints .S´ebastien Jean, Kyunghyun Cho, Roland Memisevic,and Yoshua Bengio. 2015. On using very largetarget vocabulary for neural machine translation.In
Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers) , pages1–10, Beijing, China. Association for Computa-tional Linguistics.Marcin Junczys-Dowmunt, Roman Grundkiewicz,Tomasz Dwojak, Hieu Hoang, Kenneth Heafield,Tom Neckermann, Frank Seide, Ulrich Germann,Alham Fikri Aji, Nikolay Bogoychev, Andr F. T.Martins, and Alexandra Birch. 2018. Marian: Fastneural machine translation in c++. arXiv preprintarXiv:1804.00344 .Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,A¨aron van den Oord, Alex Graves, and KorayKavukcuoglu. 2016. Neural machine translation inlinear time.
CoRR , abs/1610.10099.Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander Rush. 2017. Opennmt:Open-source toolkit for neural machine translation.In
Proceedings of ACL, System Demonstrations ,pages 67–72, Vancouver, Canada. Association forComputational Linguistics.Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, et al. 2007. Moses: Open sourcetoolkit for statistical machine translation. In
Pro-ceedings of ACL, Interactive Poster and Demonstra-tion Sessions , pages 177–180. Association for Com-putational Linguistics.Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015a. Effective approaches to attention-based neural machine translation. In
Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing , pages 1412–1421, Lis-bon, Portugal. Association for Computational Lin-guistics.Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals,and Wojciech Zaremba. 2015b. Addressing the rareword problem in neural machine translation. In
Pro-ceedings of the 53rd Annual Meeting of the Associ-ation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers) , pages 11–19,Beijing, China. Association for Computational Lin-guistics. Mike Schuster and Kaisuke Nakajima. 2012. Japaneseand korean voice search. In
Acoustics, Speech andSignal Processing (ICASSP), 2012 IEEE Interna-tional Conference on , pages 5149–5152. IEEE.Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-dra Birch, Barry Haddow, Julian Hitschler, MarcinJunczys-Dowmunt, Samuel L”aubli, Antonio Vale-rio Miceli Barone, Jozef Mokry, and Maria Nadejde.2017. Nematus: a toolkit for neural machine trans-lation. In
Proceedings of the Software Demonstra-tions of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics ,pages 65–68, Valencia, Spain. Association for Com-putational Linguistics.Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,Andy Davis, Quoc V. Le, Geoffrey E. Hinton, andJeff Dean. 2017. Outrageously large neural net-works: The sparsely-gated mixture-of-experts layer.
CoRR , abs/1701.06538.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In
Advances in neural information process-ing systems , pages 3104–3112.Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jonathon Shlens, and Zbigniew Wojna. 2016. Re-thinking the inception architecture for computer vi-sion.
Proceedings of the Conference on ComputerVision and Pattern Recognition , pages 2818–2826.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,
Advances in Neural Information Pro-cessing Systems 30 , pages 6000–6010. Curran As-sociates, Inc.Xiaolin Wang, Masao Utiyama, and Eiichiro Sumita.2017. Empirical study of dropout scheme for neu-ral machine translation. In
Proceedings of the 16thMachine Translation Summit , pages 1–15.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144 .Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and WeiXu. 2016. Deep recurrent models with fast-forwardconnections for neural machine translation.