Asynchronous Bidirectional Decoding for Neural Machine Translation
Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, Hongji Wang
AAsynchronous Bidirectional Decoding for Neural Machine Translation
Xiangwen Zhang , Jinsong Su ∗ , Yue Qin , Yang Liu , Rongrong Ji , Hongji Wang Xiamen University, Xiamen, China Tsinghua University, Beijing, China [email protected], [email protected], [email protected]@tsinghua.edu.cn, [email protected], [email protected] Abstract
The dominant neural machine translation (NMT) models ap-ply unified attentional encoder-decoder neural networks fortranslation. Traditionally, the NMT decoders adopt recurrentneural networks (RNNs) to perform translation in a left-to-right manner, leaving the target-side contexts generated fromright to left unexploited during translation. In this paper,we equip the conventional attentional encoder-decoder NMTframework with a backward decoder, in order to explore bidi-rectional decoding for NMT. Attending to the hidden statesequence produced by the encoder, our backward decoderfirst learns to generate the target-side hidden state sequencefrom right to left. Then, the forward decoder performs trans-lation in the forward direction, while in each translation pre-diction timestep, it simultaneously applies two attention mod-els to consider the source-side and reverse target-side hiddenstates, respectively. With this new architecture, our model isable to fully exploit source- and target-side contexts to im-prove translation quality altogether. Experimental results onNIST Chinese-English and WMT English-German transla-tion tasks demonstrate that our model achieves substantialimprovements over the conventional NMT by 3.14 and 1.38BLEU points, respectively. The source code of this work canbe obtained from https://github.com/DeepLearnXMU/ABD-NMT.
Introduction
Recently, end-to-end neural machine translation (NMT)(Kalchbrenner and Blunsom 2013; Sutskever, Vinyals, andLe 2014; Cho et al. 2014) has achieved promising resultsand gained increasing attention. Compared with conven-tional statistical machine translation (SMT) (Koehn, Och,and Marcu 2003; Chiang 2007) which needs to explicitlydesign features to capture translation regularities, NMT aimsto construct a unified encoder-decoder framework based onneural networks to model the entire translation process. Fur-ther, the introduction of the attention mechanism (Bahdanau,Cho, and Bengio 2015) enhances the capability of NMT incapturing long-distance dependencies. Despite being a rela-tively new framework, the attentional encoder-decoder NMTquickly become the de facto method. ∗ Corresponding author.Copyright c (cid:13)
Source r`ı f´angw`eit¯ing zhˇanggu¯an : b´u w`ang j¯ungu´ol`ıshˇi z¯unzh`ong l´ıngu´o z¯uny´an
Reference japan defense chief : never forget militaristichistory , respect neighboring nations ’ dignity
L2R japan ’s defense agency chief : :::: death ::: of ::::::: militarism ::::::: respects :: its ::::::: neighbors :: ’ ::::: dignity R2L japanese defense ::::: agency ::: has :::: never :::::::: forgottenmilitarism ’s history to respect the dignity ofneighboring countries
Table 1: Translation examples of NMT systems with dif-ferent decoding manners.
L2R/R2L denotes the translationproduced by the NMT system with left-to-right/right-to-leftdecoding. Texts highlighted in wavy/dashed lines are incor-rect/correct translations, respectively.Generally, most NMT decoders are based on recurrentneural networks (RNNs) and generate translations in a left-to-right manner. Thus, despite the advantage of encodingunbounded target words predicted previously for the predic-tion at each time step, these decoders are incapable of cap-turing the reverse target-side context for translation. Onceerrors occur in previous predictions, the quality of subse-quent predictions would be undermined due to the negativeimpact of the noisy forward encoded target-side contexts.Intuitively, the reverse target-side contexts are also crucialfor translation predictions, since they not only provide com-plementary signals but also bring different biases to NMTmodel (Hoang, Haffari, and Cohn 2017). Take the examplein Table 1 into consideration. The latter half of the Chinesesentence, misinterpreted by the conventional NMT system,is accurately translated by the NMT system with right-to-left decoding. Therefore, it is important to investigate howto integrate reverse target-side contexts into the decoder toimprove translation performance of NMT.To this end, many researchers resorted to introducing bidi-rectional decoding into NMT (Liu et al. 2016; Sennrich,Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017).Most of them re-ranked candidate translations using bidirec-tional decoding scores together, in order to select a transla-tion with both proper prefixes and suffixes. However, suchmethods also come with some drawbacks limiting the po-tential of bidirectional decoding in NMT. On the one hand, a r X i v : . [ c s . C L ] F e b ue to the limited search space and search errors of beamsearch, the generated 1-best translation is often far from sat-isfactory and thus it fails to provide sufficient information asa complement for the other decoder. On the other hand, be-cause the bidirectional decoders are often independent fromeach other during the translation, the unidirectional decoderis unable to fully exploit target-side contexts produced bythe other decoder, and consequently the generated candidatetranslations are still undesirable. Therefore, how to effec-tively exert the influence of bidirectional decoding on NMTis still worthy of further study.In this paper, we significantly extend the conventional at-tentional encoder-decoder NMT framework by introducing abackward decoder, for the purpose of fully exploiting reversetarget-side contexts to improve NMT. As shown in Fig. 1,along with our novel asynchronous bidirectional decoders,the proposed model remains an end-to-end attentional NMTframework, which mainly consists of three components: 1)an encoder embedding the input source sentence into bidi-rectional hidden states; 2) a backward decoder that is similarto the conventional NMT decoder but performs translation inthe right-to-left manner, where the generated hidden statesencode the reverse target-side contexts; 3) a forward decoderthat generates the final translation from left to right and in-troduces two attention models simultaneously consideringthe source-side bidirectional and target-side reverse hiddenstate vectors for translation prediction. Compared with theprevious related NMT models, our model has the followingadvantages: 1) The backward decoder learns to produce hid-den state vectors that essentially encode semantics of poten-tial hypotheses, allowing the following forward decoder toutilize richer target-side contexts for translation. 2) By inte-grating right-to-left target-side context modeling and left-to-right translation generation into an end-to-end joint frame-work, our model alleviates the error propagation of reversetarget-side context modeling to some extent.The major contributions of this paper are concluded asfollows: • We thoroughly analyze and point out the existing draw-backs of researches on NMT with bidirectional decoding. • We introduce a backward decoder to encode the left-to-right target-side contexts, as a supplement to the conven-tional context modeling mechanism of NMT. To the bestof our knowledge, this is the first attempt to investigate theeffectiveness of the end-to-end attentional NMT modelwith asynchronous bidirectional decoders. • Experiments on Chinese-English and English-Germantranslation show that our model achieves significant im-provements over the conventional NMT model.
Our Model
As described above, our model mainly includes three com-ponents: 1) a neural encoder with parameter set θ e ; 2) a neu-ral backward decoder with parameter set θ b ; and 3) a neuralforward decoder with parameter set θ f , which will be elab-orated in the following subsections.Particularly, we choose Gated Recurrent Unit (GRU)(Cho et al. 2014) to build the encoder and decoders, as it is widely used in the NMT literature with relatively few param-eters required. However, it should be noted that our modelis also applicable to other RNNs, such as Long Short-TermMemory (LSTM) (Hochreiter and Schmidhuber 1997). The Neural Encoder
The neural encoder of our model is identical to that of thedominant NMT model, which is modeled using a bidirec-tional RNN.The forward RNN reads a source sentence x = x , x ...x N in a left-to-right order. At each timestep, we apply a recur-rent activation function φ ( · ) to learn the semantic represen-tation of the word sequence x i as −→ h i = φ ( −→ h i − , x i ) . Like-wise, the backward RNN scans the source sentence in the re-verse order and generates the semantic representation ←− h i ofthe word sequence x i : N . Finally, we concatenate the hiddenstates of these two RNNs to form an annotation sequence h = { h , h , ...h i ..., h N } , where h i = [ −→ h Ti , ←− h Ti ] T encodesinformation about the i -th word with respect to all the othersurrounding words in the source sentence.In our model, these annotations will provide source-sidecontexts for not only the forward decoder but also the back-ward one via different attention models. The Neural Backward Decoder
The neural backward decoder of our model is also similarto the decoder of the dominant NMT model, while the onlydifference is that it performs decoding in a right-to-left way.Given the source-side hidden state vectors of the encoderand all target words generated previously, the backward de-coder models how to reversely produce the next target word.Using this decoder, we calculate the conditional probabilityof the reverse translation ←− y = ( y , y , y , ..., y M ) as follows P ( ←− y | x ; θ e , θ b ) = M Y j =0 P ( y j | y >j , x ; θ e , θ b )= M Y j =0 ←− g ( y j +1 , ←− s j , m ebj ) , (1)where ←− g ( · ) is a non-linear function, ←− s j and m ebj denotethe decoding state and the source-side context vector at the j -th time step, respectively, and M indicates the length ofthe reverse translation.Among ←− s j and m ebj , ←− s j is computed by the GRUactivation function f ( · ) : ←− s j = f ( ←− s j +1 , y j +1 , m ebj ) , and m ebj is defined by a encoder-backward decoder attentionmodel as the weighted sum of the source annotations { h i } : m ebj = N X i =1 α ebj ,i · h i , (2) α ebj ,i = exp ( e ebj ,i ) P Ni =1 exp ( e ebj ,i ) , (3) e ebj ,i = ( v eba ) T tanh ( W eba ←− s j +1 + U eba h i ) , (4)igure 1: The architecture of the proposed NMT model. Note that the forward decoder directly attends to the reverse hiddenstate sequence ←− s = {←− s , ←− s , ... ←− s M } rather than the word sequence produced by the backward decoder.where v eba , W eba and U eba are the parameters of the encoder-backward decoder attention model. In doing so, the decoderis also able to automatically select the effective source wordsto reversely predict target words.By introducing this backward decoder, our NMT modelis able to better exploit target-side contexts for translationprediction. In addition to the generation of target word se-quence, more importantly, our backward decoder will pro-duce target-side hidden states ←− s , which essentially capturesricher reverse target-side contexts for the further use of theforward decoder. The Neural Forward Decoder
The neural forward decoder of our model is extended fromthe decoder of the dominant NMT model. It performs de-coding in a left-to-right manner under the semantic guides ofsource-side and reverse target-side contexts, which are sep-arately captured by the encoder and the backward decoder.The forward decoder is trained to sequentially predict thenext target word given the source-side hidden state vectorsof the encoder, the reverse target-side hidden state sequencegenerated by the backward encoder, and all target wordsgenerated previously. Formally, the conditional probabilityof the translation y = ( y , y , ..., y M ) is defined as follows: P ( y | x ; θ e , θ b , θ f ) = M Y j =0 P ( y j | y Given a training corpus D = { ( x , y ) } , we train the proposedmodel according to the following objective: J ( D ; θ e , θ b , θ f ) = 1 | D | arg max θ e , θ b , θ f X ( x , y ) ∈ D (14) { λ · logP ( y | x ; θ e , θ b , θ f ) + (1 − λ ) · logP ( ←− y | x ; θ e , θ b ) } where ←− y is obtained by inverting y , and λ is a hyper-parameter used to balance the preference between the twoterms.The first term logP ( y | x ; θ e , θ b , θ f ) models the transla-tion procedure illustrated in Figure 1. To ensure the consis-tency between model training and testing, we perform beamsearch to generate reverse hidden states ←− s when optimiz-ing logP ( y | x ; θ e , θ b , θ f ) . In addition, to guarantee the ←− s produced by beam search is of high quality, we further intro-duce the second term logP ( ←− y | x ; θ e , θ b ) } to maximize theconditional likelihood of ←− y . Note that the beam search re-quires high time complexity, and therefore, we directly adoptgreedy search to implement right-to-left decoding, whileproves to be sufficiently effective in our experiments.Once the proposed model is trained, we adopt a two-phasescheme to translate the unseen input sentence x : First, weuse the backward decoder with greedy search to sequentiallygenerate ←− s until the target-side start symbol h s i occurs withthe highest probability. Then, we perform beam search onthe forward decoder to find the best translation that approx-imately maximizes logP ( y | x ; θ e , θ b , θ f ) . Experiments We evaluated the proposed model on NIST Chinese-Englishand WMT English-German translation tasks. Setup For Chinese-English translation, the training data consists of1.25M bilingual sentences with 27.9M Chinese words and34.5M English words. These sentence pairs are mainly ex-tracted from LDC2002E18, LDC2003E07, LDC2003E14,Hansards portion of LDC2004T07, LDC2004T08 andLDC2005T06. We chose NIST 2002 (MT02) dataset as ourdevelopment set, and the NIST 2003 (MT03), 2004 (MT04),2005 (MT05), and 2006 (MT06) datasets as our test sets. Fi-nally, we evaluated the translations using BLEU (Papineniet al. 2002).For English-German translation, we used WMT 2015training data that contains 4.46M sentence pairs with116.1M English words and 108.9M German words. Partic-ularly, we segmented words via byte pair encoding (BPE)(Sennrich, Haddow, and Birch 2016b). The news-test 2013was used as development set and the news-test 2015 as testset.To efficiently train NMT models, we trained each modelwith sentences of length up to 50 words. In doing so, 90.12%and 89.03% of the Chinese-English and English-Germanparallel sentences were covered in the experiments. Besides,we set the vocabulary size to 30K for Chinese-English trans-lation, and 50K for English-German translation, and mappedall the out-of-vocabulary words in the Chinese-English cor-pus to a special token UNK. Finally, such vocabularies con-tained 97.4% Chinese words and 99.3% English words ofthe Chinese-English corpus, and almost 100.0% Englishwords and 98.2% German words of the English-German cor-pus, respectively. We applied Rmsprop (Graves 2013) (mo-mentum = 0, ρ = 0.95, and (cid:15) = 1 × − ) to train models for 5epochs and selected the best model parameters according tothe model performance on the development set. During thisprocedure, we set the following hyper-parameters: word em-bedding dimension as 620, hidden layer size as 1000, learn-ing rate as × − , batch size as 80, gradient norm as 1.0,and dropout rate as 0.3. All the other settings are the sameas in (Bahdanau, Cho, and Bengio 2015). Baselines We compared the proposed model against the followingstate-of-the-art SMT and NMT systems: • Moses : an open source phrase-based translation systemwith default configuration and a 4-gram language modeltrained on the target portion of training data. Note that weused all data to train MOSES. • RNNSearch : a re-implementation of the attention-basedNMT system (Bahdanau, Cho, and Bengio 2015) withslight changes from dl4mt tutorial . • RNNSearch(R2L) : a variant of RNNSearch that producestranslation in a right-to-left direction. https://github.com/nyu-dl/dl4mt-tutorial YSTEM MT03 MT04 MT05 MT06 Average COVERAGE 34.49 38.34 34.91 34.25 35.50MemDec 36.16 39.81 35.91 35.98 36.97DeepLAU 39.35 41.15 38.07 37.29 38.97DMAtten 38.33 40.11 36.71 35.29 37.61Moses 32.93 34.76 31.31 31.05 32.51RNNSearch 36.59 39.57 35.56 35.29 36.75RNNSearch(R2L) 36.54 39.70 35.61 34.67 36.63ATNMT 38.09 40.99 36.87 36.17 38.03NSC(RT) 37.68 40.82 36.21 35.50 37.55NSC(HS) 37.99 40.74 36.82 36.32 37.97Our Model Table 2: Evaluation of the NIST Chinese-English translation task using case-insensitive BLEU scores ( λ =0.7). Here we dis-played the experimental results of the first four models reported in (Wang et al. 2017; Zhang et al. 2017). COVERAGE (Tu etal. 2016) is a basic NMT model with a coverage model. MemDec (Wang et al. 2016) improves translation quality with externalmemory. DeepLAU (Wang et al. 2017) reduces the gradient propagation length inside the recurrent unit of RNN-based NMT.DMAtten (Zhang et al. 2017) incorporates word reordering knowledge into attentional NMT. • ATNMT : an attention-based NMT system with two di-rectional decoders (Liu et al. 2016) which explores the agreement on target-bidirectional NMT . Using thismodel, we first run beam search for forward and back-ward models independently to obtain two k -best lists, andthen re-score the combination of these two lists using thejoint model to find the best candidate. Following (Liu etal. 2016), we set both beam sizes of two decoders as 10.Note that we replaced LSTM adopted in (Liu et al. 2016)with GRU to ensure fair comparison. • NSC(RT) : it is a variant of neural system combination framework proposed by Zhou et al. (2017). It first usesan attentional NMT model consisting of one standardencoder and one backward decoder to produce the best reverse translation . Finally, another attentional NMTmodel generates the final output from its standard encoderand a reverse translation encoder which embeds the bestreverse translation, in a way similar to the multi-sourceNMT model (Zoph and Knight 2016). This model differsfrom ours in two aspects: (1) it is not an end-to-end model,and (2) it considers the embedded hidden states of the re-verse translation, while our model considers the hiddenstates produced by the backward decoder. • NSC(HS) : it is similar to NSC(RT), with the only differ-ence that it directly considers the reverse hidden states produced by the backward decoder.We set beam sizes of all above-mentioned models as 10, andthe beam sizes of the backward and forward decoders of ourmodel as 1 and 10, respectively. Results on Chinese-English Translation Parameters. RNNSearch, RNNSearch(R2L), ATNMT,NSC(RT), NSC(HS) models have 85.6M, 85.6M, 171.2M,120.0M and 130.0M parameters, respectively. By contrast,the parameter size of our model is about 130.0M. Speed. We used a single GPU device 1080Ti to train mod-els. It takes one hour to train 6,500, 6,500, 6,500 and 4,700and 3,708 minibatches for RNNSearch, RNNSearch(R2L), BLE U S c o re λ Figure 2: Experiment results on the development set usingdifferent λ s.ATNMT, NSC(RT), NSC(HS) models, respectively. Thetraining speed of the proposed model is relatively slow:about 1,758 mini-batches are processed in one hour.We first investigated the impact of the hyper-parameter λ (see Eq. (14)) on the development set. To this end, wegradually varied λ from 0.5 to 1.0 with an increment of 0.1in each step. As shown in Fig. 2, we find that our modelachieved the best performance when λ =0.7. Therefore, weset λ =0.7 for all experiments thereafter.The experimental results on Chinese-English translationare depicted in Table 2. We also displayed the performancesof some dominant individual models such as COVERAGE (Tu et al. 2016), MemDec (Wang et al. 2016), DeepLAU (Wang et al. 2017) and DMAtten (Zhang et al. 2017) onthe same data set. Specifically, the proposed model signifi-cantly outperforms Moses, RNNSearch, RNNSearch(R2L),ATNMT, NSC(RT) and NSC(HS) by 7.38, 3.14, 3.26, 1.86,2.34, and 1.92 BLEU points, respectively. Even when com-pared with (Tu et al. 2016; Wang et al. 2016; 2017; Zhang etal. 2017), our model still has better performance in the samesetting. Moreover, we draw the following conclusions:(1) In contrast to RNNSearch and RNNSearch(R2L), ourmodel exhibits much better performance. These results tes-tify our hypothesis that the forward and backward decoders [1,10] [11,20] [21,30] [31,40] [41,50] [51,...] BLE U S c o re Sentence Length RNNSearchRNNSearch(R2L)ATNMT NSC(RT) NSC(HS)Our Model Figure 3: BLEU scores on different translation groups di-vided according to source sentence length.are complementary to each other in target-side context mod-eling, and therefore, the simultaneous exploration of bidirec-tional decoders will lead to better translations.(2) On all test sets, our model outperforms ATNMT,which indicates that compared with k -best hypothesesrescoring (Liu et al. 2016), joint modeling with attendingto reverse hidden states behaves better in exploiting reversetarget-side contexts. The underlying reason is that the re-verse hidden states encode richer target-side contexts thansingle translation. In addition, compared with the k -best hy-potheses rescoring, our model could refine translation at amore fine-grained level via the attention mechanism.(3) Particularly, the fact that NSC(HS) outperformsNSC(RT) reveals the advantage of reverse hidden state rep-resentations of the backward decoder in overcoming datasparsity. Besides, our model behaves better than NSC(HS),which accords with our intuition that to some extent, jointmodel is able to alleviate the error propagation when encod-ing target-side contexts.(4) Note that the performance of our model is better thanthat of our model (RR). This result verifies our speculationthat model training with the translations obtained by greedysearch is superior due to the consistency during the trainingand testing procedure.Finally, based on the length of source sentences, we di-vided our test sets into different groups and then com-pared the system performances in each group. Fig. 3 illus-trates the BLEU scores on these groups of test sets. We ob-serve that our model achieves the best performance in allgroups, although the performances of all systems drop withthe increase of the length of source sentences. These re-sults clearly demonstrate once again the effectiveness of ourmodel. Case Study To better understand how our model outperforms others, westudied the 1-best translations using different models.Table 3 provides a Chinese-English translation exam-ple. We find that RNNSearch produces the translation withgood prefix, while RNNSearch(R2L) generates the transla-tion with desirable suffix. Although there are various modelswith bidirectional decoding that could exploit bidirectional SYSTEM TEST BPEChar 23.90RecAtten 25.00ConvEncoder 24.20Moses 20.54RNNSearch 24.88RNNSearch(R2L) 23.83ATNMT 25.08NSC(RT) 25.15NSC(HS) 25.36Our Model Table 4: Evaluation of the WMT English-German transla-tion task using case-sensitive BLEU scores ( λ =0.8). We di-rectly cited the experimental results of the first three mod-els provided by (Gehring et al. 2017). BPEChar (Chung,Cho, and Bengio 2016) is an attentional NMT model witha character-level decoder. RecAtten (Yang et al. 2017) usesa recurrent attention model to explicitly model the depen-dence between attentions among target words. ConvEncoder(Gehring et al. 2017) introduces a convolutional encoder intoNMT.contexts, most of them are unable to translate the whole sen-tence precisely and our model is currently the only one capa-ble to produce a high quality translation in this circumstance. Results on English-German Translation To enhance the persuasion of our experiments, we also pro-vided some experiments results on the same data set, includ-ing BPEChar (Chung, Cho, and Bengio 2016), RecAtten (Yang et al. 2017), and ConvEncoder (Gehring et al. 2017).We determined the optimal λ as 0.8 according to the perfor-mance of our model on the development set.Table 4 presents the results on English-German transla-tion. Our model still significantly outperforms others in-cluding some dominant NMT systems with other improvedtechniques. We believe that our work can be applied toother architectures easily. It should be noted that the BLEUscore gaps between our model and the others on English-German translation are much smaller than those on Chinese-English translation. The underlying reasons lie in the follow-ing two aspects, which have also been mentioned in (Shenet al. 2016). First, the Chinese-English datasets contain fourreference translations for each sentence while the English-German dataset only have single reference. Second, com-pared with German, Chinese is more distantly related toEnglish, leading to the predominant advantage of utilizingtarget-side contexts in Chinese-English translation. Related Work In this work, we mainly focus on how to exploit bidirec-tional decoding to refine translation, which has always beena research focus in machine translation.In SMT, many approaches through backward languagemodel (BLM) or target-bidirectional decoding have been ex-plored to capture right-to-left target-side contexts for trans-lation. For example, Watanabe and Sumita (2002) explored ource y¯iyu`e k¯aishˇi , zˇongw`ushˇeng ji¯ang yˇou li`u m´ıng zh´ıyu´an y¯i zh¯ou zh`ıshˇao y¯i ti¯an b`u x¯uy`ao j`ınb`ang¯ongsh`ı , kˇeyˇi z`ai ji¯a lˇi , d`axu´e hu`o t´ush¯uguˇan t`ougu`o g¯aos`u wˇangl`uo f´uw`u g¯ongzu`o . Reference starting from january , the ministry of internal affairs and communications will have six employees who do n’tneed to go to their offices at least one day a week ; instead they may work from home , universities or librariesthrough high - speed internet services . Moses since january , there will be six staff members – a ::::: week :: for :: at :::: least :::: one ::: day :: in :::: office , they can at home ,university or through high - speed internet library services . RNNSearch as early as january , six staff members will not be required to enter office at least one day in one week , :::: which ::: can :: be :::: done :::::: through ::: high : - ::::: speed :::::: internet :::::: services :::::: through :::: high : - ::::: speed :::::: internet :::::: services . RNNSearch(R2L) beginning in january , :: at :::: least :: six :::: staff ::::::: members ::: have :: to ::: go : to ::: the :::: office ::: for :: at :::: least ::: one :::: week and can work athome , and university or library through high - speed internet services . ATNMT at the beginning of january , there will be six staff members :: to :: go :: to ::::: office : at :::: least :::: one :::: week , which can bedone through high - speed internet services at home and university or libraries . NSC(RT) :: at :::: least :: six :::: staff ::::::: members will leave office for :: at :::: least ::: one :::: week :: at ::: least ::: one :::: week , and can work at home anduniversity or library through high - speed internet services . NSC(HS) in january , there will be six staff members who ::: are :::::: required :: to :::: enter ::::: offices for at least one day at least oneday , and we can work at home , university or library through high - speed internet services . Our Model starting in january , six staff members will not need to enter the office at least one day in one week, and theycan work at home , universities or libraries through high - speed internet services . Table 3: Translation examples of different systems. Texts highlighted in wavy lines are incorrectly translated. Please note thatthe translations produced by RNNSearch and RNNSearch(R2L) are complementary to each other, and the translation generatedby our model is the most accurate and complete.two decoding methods: one is the right-to-left decodingbased on the left-to-right beam search algorithm; the otherdecodes in both directions and merges the two hypothesizedpartial sentences into one. Finch and Sumita (2009) inte-grated both mono-directional approaches to reduce the ef-fects caused by language specificity. Particularly, they in-tegrated the BLM to their reverse translation decoder. Be-yond left-to-right decoding, Zhang et al. (2013) studied theeffects of multiple decomposition structures as well as dy-namic bidirectional decomposition on SMT.When it comes to NMT, the dominant RNN-based NMTmodels also perform translation in a left-to-right manner,leading to the same drawback of underutilization of target-side contexts. To address this issue, Liu et al. (2016) firstjointly train both directional LSTM models, and then intesting they try to search for target-side translations whichare supported by both models. Similarly, Sennrich et al.(2016a) attempted to re-rank the left-to-right decoding re-sults by right-to-left decoding, leading to diversified trans-lation results. Recently, Hoang et al. (2017) proposed anapproximate inference framework based on continuous op-timization that enables decoding bidirectional translationmodels. Finally, it is noteworthy that our work is also re-lated to pre-translation (Niehues et al. 2016; Zhou et al.2017) and neural automatic post-editing (Pal et al. 2017;Dowmunt and Grundkiewicz 2017) for NMT, because ourmodel involves two stages of translation.Overall, the most relevant models include (Liu et al. 2016;Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, andCohn 2017; Zhou et al. 2017; Pal et al. 2017; Dowmuntand Grundkiewicz 2017). Our model significantly differsfrom these works in the following aspects: 1) The moti-vation of our work varies from theirs. Specifically, in thiswork, we aim to fully exploit the reverse target-side con- texts encoded by right-to-left hidden state vectors to im-prove NMT with left-to-right decoding. In contrast, Liu etal. (2016), Sennrich et al. (2016a), Hoang et al. (2017) in-vestigated how to exploit bidirectional decoding scores toproduce better translations, both Niehues et al. (2016) andZhou et al. (2017) intended to combine the advantages ofboth NMT and SMT, and in the work of (Pal et al. 2017;Dowmunt and Grundkiewicz 2017), they explored multipleneural architectures for the task of automatic post-editing ofmachine translation output. 2) Our model attends to right-to-left hidden state vectors, while (Niehues et al. 2016; Zhou etal. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017)considered the raw best output of machine translation sys-tem instead. 3) Our model is an end-to-end NMT model,while the bidirectional decoders adopted in (Liu et al. 2016;Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, andCohn 2017) were independent from each other, and thecomponent used to produce the raw translation was inde-pendent from the NMT model in (Niehues et al. 2016;Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grund-kiewicz 2017). In addition, Serban et al. (2018) introduceda backward RNN to refine the forward RNN. But their workwas only applied in the training procedure, which is differ-ent from ours where both the forward and backward RNNswere simultaneously used for sequence generation. Conclusions and Future Work In this paper, we have equipped the conventional attentionalencoder-decoder NMT model with a backward decoder. Inour model, the backward decoder first produces hidden statevectors encoding reverse target-side contexts. Then, two in-dividual hidden state sequences generated by the encoderand the backward decoder are simultaneously exploited viaattention mechanism by the forward decoder for translation.ompared with the previous models, ours is an end-to-endNMT model that fully utilizes reverse target-side contextsfor translation. Experimental results on Chinese-English andEnglish-German translation tasks demonstrate the effective-ness of our model.Our model is generally applicable to other models withRNN-based decoder. Therefore, the effectiveness of our ap-proach on other tasks related to RNN-based decoder mod-eling, such as image captioning, will be investigated in fu-ture research. Moreover, in our work, the attention mecha-nisms acting on the encoder and the backward decoder areindependent from each other. However, intuitively, these twomechanisms should be closely associated with each other.Therefore, we are interested in exploring better attentionmechanism combination to further refine our model. Acknowledgments The authors were supported by National Natural Sci-ence Foundation of China (Nos. 61672440, 61573294 and61432013), Scientific Research Project of National Lan-guage Committee of China (Grant No. YB135-49), Natu-ral Science Foundation of Fujian Province of China (No.2016J05161), and National Key R&D Program of China(Nos. 2017YFC011300 and 2016YFB1001503). We alsothank the reviewers for their insightful comments. References Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural ma-chine translation by jointly learning to align and translate. In Proc. of ICLR2015 .Chiang, D. 2007. Hierarchical phrase-based translation. Computational Linguistics Proc. of EMNLP2014 , 1724–1734.Chung, J.; Cho, K.; and Bengio, Y. 2016. A character-leveldecoder without explicit segmentation for neural machinetranslation. In Proc. of ACL2016 , 1693–1703.Dowmunt, M. J., and Grundkiewicz, R. 2017. An explo-ration of neural sequence-to-sequence architectures for au-tomatic post-editing. In arXiv:1706.04138v1 .Finch, A., and Sumita, E. 2009. Bidirectional phrase-basedstatistical machine translation. In Proc. of EMNLP2009 ,1124–1132.Gehring, J.; Auli, M.; Grangier, D.; and Dauphin, Y. 2017. Aconvolutional encoder model for neural machine translation.In Proc. of ACL2017 , 123–135.Graves, A. 2013. Generating sequences with recurrent neu-ral networks. In arXiv:1308.0850v5 .Hoang, C. D. V.; Haffari, G.; and Cohn, T. 2017. Decodingas continuous optimization in neural machine translation. In arXiv1701.02854 .Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory. Neural Computation Proc. of EMNLP2013 , 1700–1709.Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrase-based translation. In Proc. of NAACL2003 , 48–54.Liu, L.; Utiyama, M.; Finch, A.; and Sumita, E. 2016.Agreement on target-bidirectional neural machine transla-tion. In Proc. of NAACL2016 , 411–416.Niehues, J.; Cho, E.; Ha, T.-L.; and Waibel, A. 2016. Pre-translation for neural machine translation. In Proc. of COL-ING2016 , 1828–1836.Pal, S.; Naskar, S. K.; Vela, M.; Liu, Q.; and van Genabith, J.2017. Neural automatic post-editing using prior alignmentand reranking. In Proc. of EACL2017 , 349–355.Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. Bleu:A method for automatic evaluation of machine translation.In Proc. of ACL2002 , 311–318.Sennrich, R.; Haddow, B.; and Birch, A. 2016a. Edin-burgh neural machine translation systems for wmt 16. In arXiv:1606.02891v2 .Sennrich, R.; Haddow, B.; and Birch, A. 2016b. Neuralmachine translation of rare words with subword units. In Proc. of ACL2016 , 1715–1725.Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A.; andPineau, J. 2018. Twin networks: Matching the future forsequence generation. In Proc. of ICLR2018 .Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; andLiu, Y. 2016. Minimum risk training for neural machinetranslation. In Proc. of ACL2016 , 1683–1692.Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequenceto sequence learning with neural networks. In Proc. ofNIPS2014 , 3104–3112.Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Mod-eling coverage for neural machine translation. In Proc. ofACL2016 , 76–85.Wang, M.; Lu, Z.; Li, H.; and Liu, Q. 2016. Memory-enhanced decoder for neural machine translation. In Proc.of EMNLP2016 , 278–286.Wang, M.; Lu, Z.; Zhou, J.; and Liu, Q. 2017. Deep neuralmachine translation with linear associative unit. In Proc. ofACL2017 , 136–145.Watanabe, T., and Sumita, E. 2002. Bidirectional decod-ing for statistical machine translation. In Proc. of COLING2002 , 1200–1208.Yang, Z.; Hu, Z.; Deng, Y.; Dyer, C.; and Smola, A. 2017.Neural machine translation with recurrent attention model-ing. In Proc. of EACL2017 , 383–387.Zhang, H.; Toutanova, K.; Quirk, C.; and Gao, J. 2013.Beyond left-to-right: Multiple decomposition structures forsmt. In Proc. of NAACL2013 , 12–21.Zhang, J.; Wang, M.; Liu, Q.; and Zhou, J. 2017. Incorporat-ing word reordering knowledge into attention-based neuralmachine translation. In Proc. of ACL 2017 , 1524–1534.hou, L.; Hu, W.; Zhang, J.; and Zong, C. 2017. Neu-ral system combination for machine translation. In Proc.of ACL2017 , 378–384.Zoph, B., and Knight, K. 2016. Multi-source neural transla-tion. In