DTMT: A Novel Deep Transition Architecture for Neural Machine Translation
DDTMT: A Novel Deep Transition Architecture for Neural Machine Translation
Fandong Meng and Jinchao Zhang
WeChat AI - Pattern Recognition Center Tencent Inc. { fandongmeng, dayerzhang } @tencent.com Abstract
Past years have witnessed rapid developments in NeuralMachine Translation (NMT). Most recently, with advancedmodeling and training techniques, the RNN-based NMT(RNMT) has shown its potential strength, even comparedwith the well-known Transformer (self-attentional) model.Although the RNMT model can possess very deep archi-tectures through stacking layers, the transition depth be-tween consecutive hidden states along the sequential axisis still shallow. In this paper, we further enhance the RNN-based NMT through increasing the transition depth betweenconsecutive hidden states and build a novel Deep Transi-tion RNN-based Architecture for Neural Machine Transla-tion, named DTMT. This model enhances the hidden-to-hidden transition with multiple non-linear transformations,as well as maintains a linear transformation path through-out this deep transition by the well-designed linear transfor-mation mechanism to alleviate the gradient vanishing prob-lem. Experiments show that with the specially designed deeptransition modules, our DTMT can achieve remarkable im-provements on translation quality. Experimental results onChinese ⇒ English translation task show that DTMT can out-perform the Transformer model by +2.09 BLEU points andachieve the best results ever reported in the same dataset. OnWMT14 English ⇒ German and English ⇒ French translationtasks, DTMT shows superior quality to the state-of-the-artNMT systems, including the Transformer and the RNMT+. Introduction
Neural Machine Translation (NMT) with an encoder-decoder (Cho et al. et al. et al. et al. et al. et al. et al. et al. et al. et al.
Copyright c (cid:13) We release the source code at:https://github.com/fandongmeng/DTMT InDec et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. (2017) apply thistransition architecture to RNMT, while there is still a largemargin between this transition model and the state-of-the-artmodel, e.g. the Transformer (Vaswani et al. et al. (2018).In this paper, we further enhance the RNMT through in-creasing the transition depth of the consecutive hidden statesalong the sequential axis and build a novel and effectiveDeep Transition RNN-based Architecture for Neural Ma-chine Translation, named DTMT. We design three deeptransition modules, which correspondingly extend the RNNmodules of shallow RNMT in the encoder and the decoder,to enhance the non-linear transformation between consec- a r X i v : . [ c s . C L ] J u l tive hidden states. Since the deep transition increases thenumber of nonlinear steps, this may lead to the problem ofvanishing gradients. To alleviate this problem, we propose aLinear Transformation enhanced Gated Recurrent Unit (L-GRU) for DTMT, which provides a linear transformationpath throughout the deep transition.We test the effectiveness of our DTMT onChinese ⇒ English, English ⇒ German and English ⇒ Frenchtranslation tasks. Experimental results on NISTChinese ⇒ English translation show that DTMT canoutperform the Transformer model by +2.09 BLEUpoints and achieve the best results ever reported inthe same dataset. On WMT14 English ⇒ German andEnglish ⇒ French translation, it consistently leads tosubstantial improvements and shows superior quality tothe state-of-the-art NMT systems (Vaswani et al. et al. • We tap the potential strength of deep transition betweenconsecutive hidden states and propose a novel deep tran-sition RNN-based architecture for NMT, which achievesstate-of-the-art results on multiple translation tasks. • We propose a simple yet more effective linear transforma-tion enhanced GRU for our deep transition RNMT, whichprovides a linear transformation path for deep transitionof consecutive hidden states. Additionally, L-GRU canalso be used to enhance other GRU-based architectures,such as the shallow RNMT and the stacked RNMT. • We apply recent advanced techniques, including multi-head attention, layer normalization, label smoothing, anddropouts to enhance our DTMT. Additionally, we find thepositional encoding (Vaswani et al.
Background
Attention-based RNMT
Given a source sentence x = { x , x , · · · , x n } and a targetsentence y = { y , y , · · · , y m } , RNN-based neural machinetranslation (RNMT) models the translation probability wordby word: p ( y | x ) = m (cid:89) t =1 P ( y t | y < t , x ; θ )= m (cid:89) t =1 sof tmax ( f ( c t , y t − , s t )) (1)where f ( · ) is a non-linear function, and s t is the hidden stateof decoder RNN at time step t : s t = g ( s t − , y t − , c t ) (2) c t is a distinct source representation for time t , calculated asa weighted sum of the source annotations: c t = n (cid:88) j =1 a t,j h j (3) ………… … … … Figure 1: Deep transition RNN, in which the transition be-tween consecutive hidden states is deep.Formally, h j = [ −→ h j , ←− h j ] is the annotation of x j , which canbe computed by a bi-directional RNN (Schuster and Paliwal1997) with GRU and contains information about the wholesource sentence with a strong focus on the parts surrounding x j . Here, −→ h j = GRU ( x j , −→ h j − ); ←− h j = GRU ( x j , ←− h j +1 ) (4)The weight a t,j is computed as a t,j = exp ( e t,j ) (cid:80) Nk =1 exp ( e t,k ) (5)where e t,j = v Ta tanh ( W a ˜ s t − + U a h j ) scores how much ˜ s t − attends to h j , where ˜ s t − = g ( s t − , y t − ) is an inter-mediate state tailored for computing the attention score. Deep Transition RNN
Barone et al. (2017) first apply the deep transition RNN toNMT. As shown in Figure 1, in a deep transition RNN, thenext state is computed by the sequential application of mul-tiple transition layers at each time step, effectively using afeed-forward network embedded inside the recurrent cell.Obviously, this kind of architecture increases the depth oftransition between the consecutive hidden states along thesequential axis, unlike the deep stacked RNN, in which tran-sition between the consecutive hidden states is still shallow.Although the deep transition RNN has been proven tobe superior to deep stacked RNN on language modelingtask (Pascanu et al. et al. et al. et al. (2018).
Model Description
In this section, we describe our novel Deep Transition RNN-based Architecture for NMT (DTMT). As shown in Fig-ure 2, the DTMT consists of a bidirectional deep transi-tion encoder and a deep transition decoder, connected by themulti-head attention (Vaswani et al. encoder transition for encodingthe source sentence into a sequence of distributed represen-tations; 2) query transition for forming a query state to at-tend to the source representations; and 3) decoder transition y1 ym-1 … yt-1st-1 ……… ………Encoder … x1 x2 xn x1 xn … xn-1 …… … … … … … … … … Decoder Multi-headAttention … Softmax … y1 y2 ymyt-1 query transition stctct decoder transitionencoder transition Figure 2: The architecture of DTMT. The bidirectional deep transition encoder (on the left) and the deep transition decoder (onthe right) are connected by multi-head attention. There are three deep transition modules, namely the encoder transition , the query transition and the decoder transition , each of which consists of a L-GRU (the square frames fused with a small circle) atthe bottom followed by several T-GRUs (the square frames) from bottom to up.for generating the final decoder state of current time step.In each transition module, the transition block consists of aLinear Transformation enhanced GRU (L-GRU) at the bot-tom followed by several Transition GRUs (T-GRUs) frombottom to up. Before proceeding to the details of DTMT,we first describe the key components (i.e. GRU and its vari-ants) of our deep transition modules.
Gated Recurrent Unit and its Variants
GRU:
Gated Recurrent Unit (GRU) (Cho et al. t is computed as follows: h t = (1 − z t ) (cid:12) h t − + z t (cid:12) (cid:101) h t (6)where (cid:12) is an element-wise product, z t is the update gate,and (cid:101) h t is the candidate activation, computed as: (cid:101) h t = tanh ( W xh x t + r t (cid:12) ( W hh h t − )) (7)where x t is the input embedding, and r t is the reset gate.Reset and update gates are computed as: r t = σ ( W xr x t + W hr h t − ) (8) z t = σ ( W xz x t + W hz h t − ) (9)Actually, GRU can be viewed as a non-linear activationfunction with a specially designed gating mechanism, since the updated h t has two sources controlled by the update gateand the reset gate: 1) the direct transfer from previous state h t − ; and 2) the candidate update (cid:101) h t , which is a nonlineartransformation of the previous state h t − and the input em-bedding. T-GRU:
Transition GRU (T-GRU) is a key component ofdeep transition block. A basic deep transition block can becomposed of a GRU followed by several T-GRUs from bot-tom to up at each time step, just as Figure 1. In the wholerecurrent procedure, for the current time step, the “state”output of one GRU/T-GRU is used as the “state” input ofthe next T-GRU. And the “state” output of the last T-GRUfor the current time step is carried over as the “state” inputof the first GRU for the next time step. For a T-GRU, eachhidden state at time-step t is computed as follows: h t = (1 − z t ) (cid:12) h t − + z t (cid:12) (cid:101) h t (10) (cid:101) h t = tanh ( r t (cid:12) ( W hh h t − )) (11)where reset gate r t and update gate z t are computed as: r t = σ ( W hr h t − ) (12) z t = σ ( W hz h t − ) (13)As we can see, T-GRU is a special case of GRU with only“state” as input. It is also like the convolutional GRU (Kaiserand Sutskever 2015). Here the updated h t has two sourcescontrolled by the update gate and the reset gate: 1) the directtransfer from previous hidden state h t − ; and 2) the candi-date update (cid:101) h t , which is a nonlinear transformation of therevious hidden state h t − . That is to say, T-GRU conductsboth non-linear transformation and direct transfer of the in-put. This architecture will make training deep models easier. L-GRU:
L-GRU is a Linear Transformation enhancedGRU by incorporating an additional linear transformationof the input in its dynamics. Each hidden state at time-step t is computed as follows: h t = (1 − z t ) (cid:12) h t − + z t (cid:12) (cid:101) h t (14)where the candidate activation (cid:101) h t is computed as: (cid:101) h t = tanh ( W xh x t + r t (cid:12) ( W hh h t − )) + l t (cid:12) H ( x t ) (15)where reset gate r t and update gate z t are computed as theformula (8) and (9), and H ( x t ) = W x x t is the linear trans-formation of the input x t , controlled by the linear transfor-mation gate l t , which is computed as: l t = σ ( W xl x t + W hl h t − ) (16)In L-GRU, the updated h t has three sources controlled bythe update gate, the reset gate and the linear transformationgate: 1) the direct transfer from previous state h t − ; 2) thecandidate update (cid:101) h t ; and 3) a direct contribution from thelinear transformation of input H ( x t ) . Compared with GRU,L-GRU conducts both non-linear transformation and lineartransformation for the inputs, including the embedding in-put and the state input. Clearly, with L-GRU and T-GRUs,deep transition model can alleviate the problem of vanishinggradients since this structure provides a linear transforma-tion path as a supplement between consecutive hidden states,which are originally connected by only non-linear transfor-mations with multi-steps (e.g. GRU+T-GRUs).Our L-GRU is inspired by the Linear Associative Unit(LAU) (Wang et al. x t as wellas preserving the original non-linear abstraction producedby the input and previous hidden state. Different from theLAU, 1) the linear transformation of input is controlled byboth the update gate z t and the linear transformation gate l t ;and 2) the linear transformation gate l t only focus on the lin-ear transformation of the embedding input. These may be themain reasons why L-GRU is more effective than the LAU,as verified in our experiments described later. DTMT
The formal description of the encoder and the decoder ofDTMT is as follows:
Encoder:
The encoder is a bidirectional deep transitionencoder based on recurrent neural networks. Let L s be thedepth of encoder transition , then for the j th source word inthe forward direction the forward source state −→ h j ≡ −→ h j,L s is computed as: −→ h j, = L - GRU ( x j , −→ h j − ,L s ) −→ h j,k = T - GRU k ( −→ h j,k − ) for ≤ k ≤ L s where the input to the first L-GRU is the word embedding x j , while the T-GRUs have only “state” as the input. Recur-rence occurs as the previous state −→ h j − ,L s enters the com-putation in the first L-GRU transition for the current step.The reverse source word states are computed similarly andconcatenated to the forward ones to form the bidirectionalsource annotations C ≡ { [ −→ h j,L s , ←− h j,L s ] } . Decoder:
As shown in Figure 2, the deep transition de-coder consists of two transition modules, named query tran-sition and decoder transition , of which query transition isconducted before the multi-head attention and decoder tran-sition is conducted after the multi-head attention. These tran-sition modules can be conducted to an arbitrary transitiondepth. Suppose the depth of query transition is L q and thedepth of decoder transition is L d , then s t, = L - GRU ( y t − , s t − ,L q + L d +1 ) s t,k = T - GRU ( s t,k − ) for ≤ k ≤ L q where y t − is the embedding of the previous target word.And then the context representation c t of source sentence iscomputed by multi-head additive attention: c t = M ultihead - Attention ( C, s t,L q ) after that, the decoder transition is computed as s t,L q +1 = L - GRU ( c t , s t,L q ) s t,L q + p = T - GRU ( s t,L q + p − ) for ≤ p ≤ L d + 1 The current state vector s t ≡ s t,L q + L d +1 is then used bythe feed-forward output network as Formula (1) to predictthe current target word. Advanced Techniques
Except for the multi-head attention, we apply most recentlyadvanced techniques during training to enhance our model: • Dropout:
We apply dropout on embedding layers, theoutput layer before prediction, and the candidate activa-tion output (Semeniuta et al. • Label Smoothing:
We use uniform label smoothing withan uncertainty=0.1 (Szegedy et al. • Layer Normalization:
Inspired by the Transformer, per-gate layer normalization (Ba et al. • Positional Encoding:
We also add the positional encod-ing (Vaswani et al. / √ d k ( d k is the dimension of embedding) to the original posi-tional encoding function. YSTEM A RCHITECTURE VE . Existing end-to-end NMT systems
Shen et al. (2016) GRU with MRT – 37.34 40.36 40.93 41.37 38.81 29.23 38.14Wang et al. (2017) DeepLAU (4 layers) – 37.29 – 39.35 41.15 38.07 – –Zhang et al. (2018) Bi-directional decoding – 38.38 – 40.02 42.32 38.84 – –Meng et al. (2018) GRU with KV-Memory – 39.08 40.67 38.40 41.10 38.73 30.87 37.95Cheng et al. (2018) AST (2 layers) – 44.44 46.10 44.07 45.61 44.06 34.94 42.96Vaswani et al. (2017) Transformer (B IG ) 277.6M 44.78 45.32 44.13 45.92 44.06 35.33 42.95 Our end-to-end NMT systemsthis work S HALLOW
RNMT 143.2M 42.99 44.24 42.96 44.97 42.69 33.00 41.57DTMT
Table 1: Case-insensitive BLEU scores (%) on NIST Chinese ⇒ English translation. Our deep transition model outperforms thestate-of-the-art models including the Transformer (Vaswani et al. et al.
Experiments
Setup
We carry out experiments on Chinese ⇒ English (Zh ⇒ En),English ⇒ German (En ⇒ De) and English ⇒ French (En ⇒ Fr)translation tasks. For these tasks, we tokenize the referencesand evaluated the translation quality with BLEU scores (Pa-pineni et al. multi-bleu.pl script.For Zh ⇒ En, the training data consists of 1.25M sen-tence pairs extracted from the LDC corpora. We chooseNIST 2006 (MT06) dataset as our valid set, and NIST 2002(MT02), 2003 (MT03), 2004 (MT04), 2005 (MT05) and2008 (MT08) datasets as our test sets. For En ⇒ De andEn ⇒ Fr, we perform our experiments on the corpora pro-vided by WMT14 that comprise 4.5M and 36M sentencepairs, respectively. We use newstest2013 as the valid set, andnewstest2014 as the test set.
Training Details
In training the neural networks, we follow Sennrich etal. (2016) to split words into sub-word units. For Zh ⇒ En,the number of merge operations in byte pair encoding (BPE)is set to 30K for both source and target languages. ForEn ⇒ De and En ⇒ Fr, we use a shared vocabulary generatedby 32K BPEs following Chen et al. (2018).The parameters are initialized uniformly between [-0.08,0.08] and updated by SGD with the learning rate controlledby the Adam optimizer (Kingma and Ba 2014) ( β = 0 . , β = 0 . , and (cid:15) = 1 e − ). And we follow Chen etal. (2018) to vary the learning rate as follows: lr = lr · min (1 + t · ( n − /np, n, n · (2 n ) s − nte − s ) (17)Here, t is the current step, n is the number of concurrentmodel replicas in training, p is the number of warmup steps, s is the start step of the exponential decay, and e is the endstep of the decay. For Zh ⇒ En, we use 2 M40 GPUs for syn-chronous training and set lr , p , s and e to − , 500, 8000,and 64000 respectively. For En ⇒ De, we use 8 M40 GPUsand set lr , p , s and e to − , 50, 200000, and 1200000respectively. For En ⇒ Fr, we use 8 M40 GPUs and set lr , p , s and e to − , 50, 400000, and 3000000 respectively.We limit the length of sentences to 128 sub-words forZh ⇒ En and 256 sub-words for En ⇒ De and En ⇒ Fr in the training stage. We batch sentence pairs according to theapproximate length, and limit input and output tokens to4096 per GPU. We apply dropout strategy to avoid over-fitting (Hinton et al. ⇒ En, weset dropout rates of the embedding layers, the layer beforeprediction and the RNN output layer to 0.5, 0.5 and 0.3 re-spectively. For En ⇒ De, we set these dropout rates to 0.3, 0.3and 0.1 respectively. For En ⇒ Fr, we set these dropout ratesto 0.2, 0.2 and 0.1 respectively. For each model of the trans-lation tasks, the dimension of word embeddings and hiddenlayer is 1024. Translations are generated by beam search andlog-likelihood scores are normalized by the sentence length.We set beam size = 4 and length penalty alpha = 0 . . Wemonitor the training process every 2K iterations and decidethe early stop condition by validation BLEU. System Description • S HALLOW
RNMT: a shallow yet strong RNMT base-line system, which is our in-house implementation ofthe attention-based RNMT (Bahdanau et al. • DTMT
NUM : our deep transition systems, and the
NUM stands for the transition depth (i.e. the number of T-GRUs)above the bottom L-GRU in each transition module (i.e. encoder transition , query transition , and decoder transi-tion ). For example, DTMT Results on NIST Chinese ⇒ English
Table 1 shows the results on NIST Zh ⇒ En translation task.Our baseline system S
HALLOW
RNMT significantly out-performs previous RNN-based NMT systems on the samedatasets. Shen et al. (2016) propose minimum risk training(MRT) to optimize the model with respect to BLEU scores.Wang et al. (2017) propose the linear associative units(LAU) to address the issue of gradient diffusion, and their
YSTEM A RCHITECTURE E N -D E E N -F R Zhou et al. (2016) LSTM (8 layers) 20.60 37.70Luong et al. (2015) LSTM (4 layers) 20.90 31.50Wang et al. (2017) DeepLAU (4 layers) 23.80 35.10Wu et al. (2016) GNMT (8 layers) 24.60 38.95Gehring et al. (2017) ConvS2S (15 layers) 25.16 40.46Cheng et al. (2018) AST (2 layers) 25.26 –Vaswani et al. (2017) Transformer (B IG ) 28.40 41.00Chen et al. (2018) RNMT+ (8 layers) 28.49 41.00 this work S HALLOW
RNMT 25.66 39.28DTMT
Table 2: Case-sensitive BLEU scores (%) on WMT14 English ⇒ German and English ⇒ French translation.DTMT et al. et al. et al. (2018)propose to exploit both left-to-right and right-to-left decod-ing strategies to capture bidirectional dependencies. Meng et al. (2018) propose key-value memory augmented atten-tion to improve the adequacy of translation. Compared withthem, our baseline system S
HALLOW
RNMT outperformstheir best models by more than 3 BLEU points. S
HAL - LOW
RNMT is only 1.4 BLEU points lower than the state-of-the-art deep models, i.e. the Transformer (with 6 attentionlayers) (Vaswani et al. et al.
HALLOW
RNMT. DTMT
HALLOW
RNMT. Comparedwith the Transformer and deep RNMT augmented with AST,DTMT
Results on WMT14 En ⇒ De and En ⇒ Fr To demonstrate that our models work well across differ-ent language pairs, we also evaluate our models on theWMT14 benchmarks on En ⇒ De and En ⇒ Fr translationtasks, as listed in Table 2. For comparison, we list exist-ing NMT systems which are trained on the same WMT 14corpora. Among these systems, the Transformer (Vaswani et al. et al.
HALLOW
RNMT can achievebetter performance than most RNMT systems except for theRNMT+, which demonstrates that S
HALLOW
RNMT is also A RCHITECTURE
RNN S HALLOW
RNMT GRU 143.2M 41.57
LAU
L-GRU
GRU+T-GRU
LAU+T-GRU
GRU+T-GRUs
LAU+T-GRUs S Table 3: Comparisons of GRU, LAU and L-GRU with dif-ferent architectures on NIST Chinese ⇒ English translation(average BLEU scores (%) on test sets). The italics in the“RNN” column indicate the potential variants of the cor-responding architecture. For example, the RNN units inDTMT
LAU+T-GRU .a strong baseline system for both En ⇒ De and En ⇒ Fr.Our deep transition model DTMT ⇒ De and+1.47 BLEU for En ⇒ Fr over the strong baseline S
HAL - LOW
RNMT. With a deeper transition architecture, ourDTMT ⇒ De and +2.74BLEU for En ⇒ Fr over the baseline, and outperforms state-of-the-art systems, i.e. the Transformer and the RNMT+.
Analysis
L-GRU vs. GRU & LAU:
We investigate the effective-ness of the proposed L-GRU on different architectures, in-cluding the S
HALLOW
RNMT and the DTMTs. From Ta-ble 3 we can see that the L-GRU is effective since it canconsistently bring in substantial improvements over differ-ent architectures. In particular, it brings in +2.26 BLEUpoints improvements over S
HALLOW
RNMT averagely onfive test sets, and it also leads to +0.78 ∼ +0.88 BLEU pointsimprovements over our deep transition architectures (i.e.DTMT et al. ∼ +0.77 BLEU points. These resultsdemonstrate that L-GRU is a more effective unit for bothdeep transition models and the shallow RNMT model. Ablation Study:
Our deep transition model consists ofthree deep transition modules, including the encoder tran-sition , the query transition and the decoder transition . Weperform an ablation study on Zh ⇒ En translation to investi-gate the effectiveness of these transition modules by choos-ing DTMT et al. ∼ -1.74 BLEU). Among thesetransition modules, the encoder transition is the most im-portant, since deleting it leads to the most obvious decline(-1.74 BLEU). We also conduct an ablation study of the L-GRU and the “Advanced Techniques” on En ⇒ De task. Asshown in Table 5, deleting the L-GRU and/or the “Advanced nc-transion query-transion dec-transion
BLEU √ √ √ × √ √ √ × √ √ √ ×
Table 4: Ablation study of deep transition modules on NISTChinese ⇒ English translation (average BLEU scores (%) ontest sets). Here “ × ” stands for replacing the transition mod-ule with the corresponding part of the conventional shallowRNMT (Bahdanau et al. A RCHITECTURE
BLEU
DTMT − L-GRU 27.81 − L-GRU & Advanced Techniques 26.27
Table 5: Ablation study of the L-GRU and the “AdvancedTechniques” on WMT14 English ⇒ German translation.Techniques” leads to sharp declines on translation quality.Therefore, we can conclude that both the L-GRU and the“Advanced Techniques” are key components for DTMT
Transition Depth & Positional Encoding:
Table 6 showsthe impact of the transition depth and the positional encod-ing on Zh ⇒ En translation. From these results, we can drawthe following conclusions: 1) with the increasing of transi-tion depth (rows 3-6), our model can consistently achievebetter performance; 2) T-GRUs do bring in significant im-provements even over the strong baseline (rows 2-3); and 3)on different architectures, we can see that the positional en-coding can further bring in consistent improvements (+0.1 ∼ +0.3 BLEU) over its counterpart without the positional en-coding. This demonstrates that, although the positional en-coding is originally designed for non-recurrent network (i.e.Transformer), it also can assist the training of RNN-basedmodels by modeling positions of tokens in the sequence. About Length:
A more detailed comparison betweenDTMT
HALLOW
RNMT and the Trans-former suggest that our deep transition architectures are es-sential to achieve the superior performance. Figure 3 showsthe BLEU scores of generated translations on the test setswith respect to the lengths of the source sentences. In par-ticular, we test the BLEU scores on sentences longer than {
0, 10, 20, 30, 40, 50, 60 } in the merged test set of MT02,MT03, MT04, MT05 and MT08. Clearly, on sentences withdifferent lengths, DTMT HALLOW
RNMT and the Trans-former consistently. And DTMT
Related Work
Our work is inspired by the deep transition RNN (Pascanu et al. et al. (2017) fist apply this kind of architecture on
RCHITECTURE N ON -PE PE HALLOW
RNMT 41.22 41.572 + L-GRU 43.54 43.833 DTMT
Table 6: Impact of transition depth and the positional en-coding (PE) on NIST Chinese ⇒ English translation (averageBLEU scores (%) on test sets).
38 40 42 44 46 48 0 10 20 30 40 50 60 B LE U ( % ) sentence length (Merge)DTMT Figure 3: The BLEU scores (%) of generated translationson the merged four test sets with respect to the lengths ofsource sentences. The numbers on X-axis of the figure standfor sentences longer than the corresponding length, e.g., for source sentences with > words.NMT, while there is still a large margin between this transi-tion model and the state-of-the-art NMT models. Differentfrom these works, we extremely enhance the deep transi-tion architecture and build the state-of-the-art deep transitionNMT model from three aspects: 1) fusing L-GRU and T-GRUs, to provide a linear transformation path between con-secutive hidden states, as well as preserving the non-lineartransformation path; 2) exploiting three deep transition mod-ules, including the encoder transition , the query transition and the decoder transition ; and 3) investigating and combingrecent advanced techniques, including multi-head attention,labeling smoothing, layer normalization, dropout on multi-layers and positional encoding.Our work is also inspired by deep stacked RNN modelsfor NMT (Zhou et al. et al. et al. et al. (2016) propose fast-forward connectionsto address the notorious problem of vanishing/explodinggradients for deep stacked RNMT. Wang et al. (2017) pro-pose the Linear Associative Unit (LAU) to reduce the gradi-ent path inside the recurrent units. Different from these stud-ies, we focus on the deep transition architecture and proposea novel linear transformation enhanced GRU (L-GRU) forour deep transition RNMT. L-GRU is verified more effec-tive than the LAU, although L-GRU exploits more conciseoperations with the same parameter quantity to incorporatethe linear transformation of the embedding input. Inspiredy RNMT+ (Chen et al. Conclusion
We propose a novel and effective deep transition architec-ture for NMT. Our empirical study on Chinese ⇒ English,English ⇒ German and English ⇒ French translation tasksshows that our DTMT can achieve remarkable improve-ments on translation quality. Experimental results on NISTChinese ⇒ English translation show that DTMT can out-perform the Transformer by +2.09 BLEU points even withfewer parameters and achieve the best results ever reportedon the same dataset. On WMT14 English ⇒ German andEnglish ⇒ French tasks, it shows superior quality to the state-of-the-art NMT systems (Vaswani et al. et al.
References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. Layer normalization. arXiv preprint arXiv:1607.06450 ,2016.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural machine translation by jointly learning to align andtranslate. In
ICLR , 2015.Antonio Valerio Miceli Barone, Jindˇrich Helcl, Rico Sen-nrich, Barry Haddow, and Alexandra Birch. Deep archi-tectures for neural machine translation. arXiv preprintarXiv:1707.07631 , 2017.Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin John-son, Wolfgang Macherey, George Foster, Llion Jones, MikeSchuster, Noam Shazeer, Niki Parmar, Ashish Vaswani,Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, YonghuiWu, and Macduff Hughes. The best of both worlds: Combin-ing recent advances in neural machine translation. In
ACL ,2018.Yong Cheng, Zhaopeng Tu, Fandong Meng, Junjie Zhai, andYang Liu. Towards robust neural machine translation. In
ACL , 2018.Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre,Fethi Bougares, Holger Schwenk, and Yoshua Bengio.Learning phrase representations using rnn encoder-decoderfor statistical machine translation. In
EMNLP , 2014.Jonas Gehring, Michael Auli, David Grangier, Denis Yarats,and Yann N Dauphin. Convolutional sequence to sequencelearning. In
ICML , 2017.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
CVPR ,2016.Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, IlyaSutskever, and Ruslan R Salakhutdinov. Improving neuralnetworks by preventing co-adaptation of feature detectors. arXiv , 2012.Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algo-rithms.
CoRR , abs/1511.08228, 2015. Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaronvan den Oord, Alex Graves, and Koray Kavukcuoglu. Neu-ral machine translation in linear time. In
ICML , 2017.Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization.
CoRR , abs/1412.6980, 2014.Minh-Thang Luong, Hieu Pham, and Christopher D Man-ning. Effective approaches to attention-based neural ma-chine translation. In
EMNLP , 2015.Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu,Junjie Zhai, Yuekui Yang, and Di Wang. Neural machinetranslation with key-value memory-augmented attention. In
IJCAI , 2018.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-JingZhu. Bleu: a method for automatic evaluation of machinetranslation. In
ACL , 2002.Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, andYoshua Bengio. How to construct deep recurrent neural net-works. In
ICLR , 2014.Mike Schuster and Kuldip K Paliwal. Bidirectional recurrentneural networks.
TSP , 1997.Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth.Recurrent dropout without memory loss.
CoRR ,abs/1603.05118, 2016.Rico Sennrich, Barry Haddow, and Alexandra Birch. Neuralmachine translation of rare words with subword units. In
ACL , 2016.Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu,Maosong Sun, and Yang Liu. Minimum risk training forneural machine translation. In
ACL , 2016.Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequenceto sequence learning with neural networks. In
NIPS , 2014.Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jonathon Shlens, and Zbigniew Wojna. Rethinkingthe inception architecture for computer vision.
CoRR ,abs/1512.00567, 2015.Gongbo Tang, Mathias M¨uller, Annette Rios, and Rico Sen-nrich. Why Self-Attention? A Targeted Evaluation of NeuralMachine Translation Architectures. In
EMNLP , 2018.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In
NIPS , 2017.Mingxuan Wang, Zhengdong Lu, Jie Zhou, and Qun Liu.Deep neural machine translation with linear associative unit.In
ACL , 2017.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,et al. Google’s neural machine translation system: Bridg-ing the gap between human and machine translation. arXivpreprint arXiv:1609.08144 , 2016.Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, RongrongJi, and Hongji Wang. Asynchronous bidirectional decodingfor neural machine translation. In
AAAI , 2018.Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu.Deep recurrent models with fast-forward connections forneural machine translation.