Semi-Autoregressive Neural Machine Translation
SSemi-Autoregressive Neural Machine Translation
Chunqi Wang ∗ Ji Zhang Haiqing Chen
Alibaba Group { shiyuan.wcq, zj122146, haiqing.chenhq } @alibaba-inc.com Abstract
Existing approaches to neural machine trans-lation are typically autoregressive models.While these models attain state-of-the-arttranslation quality, they are suffering fromlow parallelizability and thus slow at decod-ing long sequences. In this paper, we proposea novel model for fast sequence generation —the semi-autoregressive Transformer (SAT).The SAT keeps the autoregressive propertyin global but relieves in local and thus isable to produce multiple successive wordsin parallel at each time step. Experimentsconducted on English-German and Chinese-English translation tasks show that the SATachieves a good balance between translationquality and decoding speed. On WMT’14English-German translation, the SAT achieves5.58 × speedup while maintains 88% transla-tion quality, significantly better than the pre-vious non-autoregressive methods. When pro-duces two words at each time step, the SATis almost lossless (only 1% degeneration inBLEU score). Neural networks have been successfully appliedto a variety of tasks, including machine transla-tion. The encoder-decoder architecture is the cen-tral idea of neural machine translation (NMT). Theencoder first encodes a source-side sentence x = x . . . x m into hidden states and then the decodergenerates the target-side sentence y = y . . . y n from the hidden states according to an autoregres-sive model p ( y t | y . . . y t − , x ) Recurrent neural networks (RNNs) are inherentlygood at processing sequential data. Sutskever ∗ Part of this work was done when the author was at In-stitute of Automation, Chinese Academy of Sciences. y y y y y Autoregressive y y y y y y Semi-Autoregressive y y y y y y Non-Autoregressive y Figure 1: The different levels of autoregressive proper-ties. Lines with arrow indicate dependencies. We markthe longest dependency path with bold red lines. Thelength of the longest dependency path decreases as werelieve the autoregressive property. An extreme caseis non-autoregressive , where there is no dependency atall. et al. (2014); Cho et al. (2014) successfully ap-plied RNNs to machine translation. Bahdanauet al. (2014) introduced attention mechanism intothe encoder-decoder architecture and greatly im-proved NMT. GNMT (Wu et al., 2016) further im-proved NMT by a bunch of tricks including resid-ual connection and reinforcement learning.The sequential property of RNNs leads to itswide application in language processing. How-ever, the property also hinders its parallelizabilitythus RNNs are slow to execute on modern hard-ware optimized for parallel execution. As a result,a number of more parallelizable sequence modelswere proposed such as ConvS2S (Gehring et al.,2017) and the Transformer (Vaswani et al., 2017).These models avoid the dependencies between dif- a r X i v : . [ c s . C L ] O c t erent positions in each layer thus can be trainedmuch faster than RNN based models. When infer-ence, however, these models are still slow becauseof the autoregressive property.A recent work (Gu et al., 2017) proposed anon-autoregressive NMT model that generates alltarget-side words in parallel. While the paral-lelizability is greatly improved, the translationquality encounter much decrease. In this paper,we propose the semi-autoregressive Transformer(SAT) for faster sequence generation. UnlikeGu et al. (2017), the SAT is semi-autoregressive,which means it keeps the autoregressive propertyin global but relieves in local. As the result, theSAT can produce multiple successive words inparallel at each time step. Figure 1 gives an il-lustration of the different levels of autoregressiveproperties.Experiments conducted on English-Germanand Chinese-English translation show that com-pared with non-autoregressive methods, the SATachieves a better balance between translation qual-ity and decoding speed. On WMT’14 English-German translation, the proposed SAT is 5.58 × faster than the Transformer while maintaining88% of translation quality. Besides, when pro-duces two words at each time step, the SAT is al-most lossless.It is worth noting that although we apply theSAT to machine translation, it is not designedspecifically for translation as Gu et al. (2017); Leeet al. (2018). The SAT can also be applied to anyother sequence generation task, such as summarygeneration and image caption generation. Almost all state-of-the-art NMT models are au-toregressive (Sutskever et al., 2014; Bahdanauet al., 2014; Wu et al., 2016; Gehring et al., 2017;Vaswani et al., 2017), meaning that the model gen-erates words one by one and is not friendly to mod-ern hardware optimized for parallel execution. Arecent work (Gu et al., 2017) attempts to acceler-ate generation by introducing a non-autoregressivemodel. Based on the Transformer (Vaswani et al.,2017), they made lots of modifications. The mostsignificant modification is that they avoid feedingthe previously generated target words to the de-coder, but instead feeding the source words, to pre-dict the next target word. They also introduceda set of latent variables to model the fertilities of source words to tackle the multimodality problemin translation. Lee et al. (2018) proposed anothernon-autoregressive sequence model based on iter-ative refinement. The model can be viewed as botha latent variable model and a conditional denoisingautoencoder. They also proposed a learning algo-rithm that is hybrid of lower-bound maximizationand reconstruction error minimization.The most relevant to our proposed semi-autoregressive model is (Kaiser et al., 2018). Theyfirst autoencode the target sequence into a shortersequence of discrete latent variables, which at in-ference time is generated autoregressively, and fi-nally decode the output sequence from this shorterlatent sequence in parallel. What we have in com-mon with their idea is that we have not entirelyabandoned autoregressive, but rather shortened theautoregressive path.A related study on realistic speech synthesis isthe parallel WaveNet (Oord et al., 2017). The pa-per introduced probability density distillation , anew method for training a parallel feed-forwardnetwork from a trained WaveNet (Van Den Oordet al., 2016) with no significant difference in qual-ity.There are also some work share a somehowsimillar idea with our work: character-level NMT(Chung et al., 2016; Lee et al., 2016) and chunk-based NMT (Zhou et al., 2017; Ishiwatari et al.,2017). Unlike the SAT, these models are not ableto produce multiple tokens (characters or words)each time step. Oda et al. (2017) proposed a bit-level decoder, where a word is represented by abinary code and each bit of the code can be pre-dicted in parallel.
Since our proposed model is built upon the Trans-former (Vaswani et al., 2017), we will briefly in-troduce the Transformer. The Transformer uses anencoder-decoder architecture. We describe the en-coder and decoder below.
From the source tokens, learned embeddings of di-mension d model are generated which are then mod-ified by an additive positional encoding. The po-sitional encoding is necessary since the networkdoes not leverage the order of the sequence by re-currence or convolution. The authors use additivencoding which is defined as: P E ( pos, i ) = sin ( pos/ i/d model ) P E ( pos, i + 1) = cos ( pos/ i/d model ) where pos is the position of a word in the sen-tence and i is the dimension. The authors chosethis function because they hypothesized it wouldallow the model to learn to attend by relative po-sitions easily. The encoded word embeddings arethen used as input to the encoder which consists of N blocks each containing two layers: (1) a multi-head attention layer, and (2) a position-wise feed-forward layer.Multi-head attention builds upon scaled dot-product attention, which operates on a query Q,key K and value V: Attention ( Q, K, V ) = sof tmax ( QK T √ d k ) V where d k is the dimension of the key. The au-thors scale the dot product by / √ d k to avoid theinputs to softmax function growing too large inmagnitude. Multi-head attention computes h dif-ferent queries, keys and values with h linear pro-jections, computes scaled dot-product attention foreach query, key and value, concatenates the re-sults, and projects the concatenation with anotherlinear projection: H i = Attention ( QW Qi , KW Ki , V W Vi ) M ultiHead ( Q, K, V ) =
Concat ( H , . . . H h ) in which W Qi , W Ki ∈ R d model × d k and W Vi ∈ R d model × d v . The attention mechanism in the en-coder performs attention over itself ( Q = K = V ), so it is also called self-attention.The second component in each encoder block isa position-wise feed-forward layer defined as: F F N ( x ) = max (0 , xW + b ) W + b where W ∈ R d model × d ff , W ∈ R d ff × d model , b ∈ R d ff , b ∈ R d model .For more stable and faster convergence, resid-ual connection (He et al., 2016) is applied to eachlayer, followed by layer normalization (Ba et al.,2016). For regularization, dropout (Srivastavaet al., 2014) are applied before residual connec-tions. Input
Embedding
Output
Embedding
Multi-Head
Self-AttentionAdd & NormFeed
ForwardAdd & Norm Masked
Multi-HeadSelf-Attention
Add & NormMulti-Head
AttentionAdd & NormFeedForwardAdd & NormLinearSoftmax Nx Nx Positional
Encoding PositionalEncoding
Probabilities
Inputs
Shifted Outputs
Figure 2: The architecture of the Transformer, also ofthe SAT, where the red dashed boxes point out the dif-ferent parts of these two models.
The decoder is similar with the encoder and is alsocomposed by N blocks. In addition to the twolayers in each encoder block, the decoder insertsa third layer, which performs multi-head attentionover the output of the encoder.It is worth noting that, different from the en-coder, the self-attention layer in the decoder mustbe masked with a causal mask, which is a lowertriangular matrix, to ensure that the prediction forposition i can depend only on the known outputsat positions less than i during training. We propose a novel NMT model—the Semi-Autoregressive Transformer (SAT)—that can pro-duce multiple successive words in parallel. Asshown in Figure 2, the architecture of the SAT isalmost the same as the Transformer, except somemodifications in the decoder. .1 Group-Level Chain Rule
Standard NMT models usually factorize the jointprobability of a word sequence y . . . y n accordingto the word-level chain rule p ( y . . . y n | x ) = n (cid:89) t =1 p ( y t | y . . . y t − , x ) resulting in decoding each word depending on allprevious decoding results, thus hindering the par-allelizability. In the SAT, we extend the standardword-level chain rule to the group-level chain rule.We first divide the word sequence y . . . y n intoconsecutive groups G , G , . . . , G [( n − /K ]+1 = y . . . y K , y K +1 . . . y K , . . . , y [( n − /K ] × K +1 . . . y n where [ · ] denotes floor operation, K is the groupsize, and also the indicator of parallelizability. Thelarger the K , the higher the parallelizability. Ex-cept for the last group, all groups must contain K words. Then comes the group-level chain rule p ( y . . . y n | x ) = [( n − /K ]+1 (cid:89) t =1 p ( G t | G . . . G t − , x ) This group-level chain rule avoids the depen-dencies between consecutive words if they are inthe same group. With group-level chain rule, themodel no longer produce words one by one as theTransformer, but rather group by group. In nextsubsections, we will show how to implement themodel in detail.
In autoregressive models, to predict y t , the modelshould be fed with the previous word y t − . Werefer it as short-distance prediction . In the SAT,however, we feed y t − K to predict y t , to whichwe refer as long-distance prediction . At the be-ginning of decoding, we feed the model with K special symbols < s > to predict y . . . y K in paral-lel. Then y . . . y K are fed to the model to predict y K +1 . . . y K in parallel. This process will con-tinue until a terminator is generated. Fig-ure 3 gives illustrations for both short and long-distance prediction. In the Transformer decoder, the causal mask isa lower triangular matrix, which strictly prevents
Decoder Network y y y y y y y y y y Decoder Network y y y y y y y y y Figure 3: Short-distance prediction (top) and long-distance prediction (bottom). earlier decoding steps from peeping informationfrom later steps. We denote it as strict causalmask . However, in the SAT decoder, strict causalmask is not a good choice. As described in theprevious subsection, in long-distance prediction,the model predicts y K +1 by feeding with y . Withstrict causal mask, the model can only access to y when predict y K +1 , which is not reasonable since y . . . y K are already produced. It is better to al-low the model to access to y . . . y K rather thanonly y when predict y K +1 .Therefore, we use a coarse-grained lower trian-gular matrix as the causal mask that allows peep-ing later information in the same group. We re-fer to it as relaxed causal mask . Given the tar-get length n and the group size K , relaxed causalmask M ∈ R n × n and its elements are defined be-low: M [ i ][ j ] = (cid:40) if j < ([( i − /K ] + 1) × K otherFor a more intuitive understanding, Figure 4gives a comparison between strict and relaxedcausal mask. Using group-level chain rule instead of word-level chain rule, long-distance prediction insteadof short-distance prediction, and relaxed causal Figure 4: Strict causal mask (left) and relaxed causalmask (right) when the target length n = 6 and thegroup size K = 2 . We mark their differences in bold. Model Complexity AccelerationTransformer N ( a + b ) NK a + N b K ( a + ba + Kb ) SAT (greedy search) NK ( a + b ) K Table 1: Theoretical complexity and acceleration of theSAT. a denotes the time consumed on the decoder net-work (calculating a distribution over the target vocabu-lary) each time step and b denotes the time consumedon search (searching for top scores, expanding nodesand pruning). In practice, a is usually much larger than b since the network is deep. mask instead of strict causal mask, we success-fully extended the Transformer to the SAT. TheTransformer can be viewed as a special case ofthe SAT, when the group size K = 1. The non-autoregressive Transformer (NAT) described inGu et al. (2017) can also be viewed as a specialcase of the SAT, when the group size K is not lessthan maximum target length.Table 1 gives the theoretical complexity and ac-celeration of the model. We list two search strate-gies separately: beam search and greedy search.Beam search is the most prevailing search strategy.However, it requires the decoder states to be up-dated once every word is generated, thus hindersthe decoding parallelizability. When decode withgreedy search, there is no such concern, thereforethe parallelizability of the SAT can be maximized. We evaluate the proposed SAT on English-Germanand Chinese-English translation tasks.
For English-German translation, wechoose the corpora provided by WMT 2014 (Bo-jar et al., 2014). We use the newstest2013 datasetfor development, and the newstest2014 dataset fortest. For Chinese-English translation, the corpora Sentence Number Vocab SizeSource TargetEN-DE 4.5M 36K 36KZH-EN 1.8M 9K 34K
Table 2: Summary of the two corpora. we use is extracted from LDC . We chose theNIST02 dataset for development, and the NIST03,NIST04 and NIST05 datasets for test. For En-glish and German, we tokenized and segmentedthem into subword symbols using byte-pair encod-ing (BPE) (Sennrich et al., 2015) to restrict the vo-cabulary size. As for Chinese, we segmented sen-tences into characters. For English-German trans-lation, we use a shared source and target vocabu-lary. Table 2 summaries the two corpora. Baseline
We use the base Transformer modeldescribed in Vaswani et al. (2017) as the base-line, where d model = 512 and N = 6 . Inaddition, for comparison, we also prepared alighter Transformer model, in which two en-coder/decoder blocks are used ( N = 2), and otherhyper-parameters remain the same. Hyperparameters
Unless otherwise specified,all hyperparameters are inherited from the baseTransformer model. We try three different settingsof the group size K : K = 2, K = 4, and K =6. For English-German translation, we share thesame weight matrix between the source and tar-get embedding layers and the pre-softmax linearlayer. For Chinese-English translation, we onlyshare weights of the target embedding layer andthe pre-softmax linear layer. Search Strategies
We use two search strate-gies: beam search and greedy search. As men-tioned in Section 4.4, these two strategies lead todifferent parallelizability. When beam size is set to1, greedy search is used, otherwise, beam search isused.
Knowledge Distillation
Knowledge distillation(Hinton et al., 2015; Kim and Rush, 2016) de-scribes a class of methods for training a smaller student network to perform better by learning froma larger teacher network. For NMT, Kim andRush (2016) proposed a sequence-level knowl-edge distillation method. In this work, we applythis method to train the SAT using a pre-trained The corpora include LDC2002E18, LDC2003E14,LDC2004T08 and LDC2005T0. odel Beam Size BLEU Degeneration Latency SpeedupTransformer 4 27.11 0% 346ms 1.00 × × Transformer, N =2 4 24.30 10% 163ms 2.12 × × NAT (Gu et al., 2017) - 17.69 25% 39ms 15.6 × NAT (rescroing 10) - 18.66 20% 79ms 7.68 × NAT (rescroing 100) - 19.17 18% 257ms 2.36 × LT (Kaiser et al., 2018) - 19.80 27% 105ms -LT (rescoring 10) - 21.00 23% - -LT (rescoring 100) - 22.50 18% - -IRNAT (Lee et al., 2018) - 18.91 22% - 1.98 × This Work
SAT, K =2 4 26.90 1% 229ms 1.51 × × SAT, K =4 4 25.71 5% 149ms 2.32 × × SAT, K =6 4 24.83 8% 116ms 2.98 × × Table 3: Results on English-German translation. Latency is calculated on a single NVIDIA TITAN Xp withoutbatching. For comparison, we also list results reported by Gu et al. (2017); Kaiser et al. (2018); Lee et al. (2018).Note that Gu et al. (2017); Lee et al. (2018) used PyTorch as their platform, but we and Kaiser et al. (2018) usedTensorFlow. Even on the same platform, implementation and hardware may not exactly be the same. Therefore,it is not fair to directly compare BLEU and latency. A fairer way is to compare performance degradation andspeedup, which are calculated based on their own baseline. autoregressive Transformer network. This methodconsists of three steps: (1) train an autoregressiveTransformer network (the teacher ), (2) run beamsearch over the training set with this model and(3) train the SAT (the student ) on this new createdcorpus.
Initialization
Since the SAT and the Trans-former have only slight differences in their ar-chitecture (see Figure 2), in order to accelerateconvergence, we use a pre-trained Transformermodel to initialize some parameters in the SAT.These parameters include all parameters in the en-coder, source and target word embeddings, andpre-softmax weights. Other parameters are initial-ized randomly. In addition to accelerating conver-gence, we find this method also slightly improvesthe translation quality.
Training
Same as Vaswani et al. (2017), wetrain the SAT by minimize cross-entropy with la-bel smoothing. The optimizer we use is Adam(Kingma and Ba, 2015) with β = 0 . , β = 0 . and ε = 10 − . We change the learning rate duringtraining using the learning rate funtion describedin Vaswani et al. (2017). All models are trainedfor 10K steps on 8 NVIDIA TITAN Xp with each minibatch consisting of about 30k tokens. Forevaluation, we average last five checkpoints savedwith an interval of 1000 training steps. Evaluation Metrics
We evaluate the translationquality of the model using BLEU score (Papineniet al., 2002).
Implementation
We implement the proposedSAT with
TensorFlow (Abadi et al., 2016). Thecode and resources needed for reproducing the re-sults are released at https://github.com/chqiwang/sa-nmt . Table 3 summaries results of English-Germantranslation. According to the results, the trans-lation quality of the SAT gradually decreases as K increases, which is consistent with intuition.When K = 2, the SAT decodes 1.51 × faster thanthe Transformer and is almost lossless in transla-tion quality (only drops 0.21 BLEU score). With K = 6, the SAT can achieve 2.98 × speedup whilethe performance degeneration is only 8%.When using greedy search, the acceleration be-comes much more significant. When K = 6,the decoding speed of the SAT can reach aboutodel b=1 b=16 b=32 b=64Transformer 346ms 58ms 53ms 56msSAT, K =2 229ms 38ms 32ms 32msSAT, K =4 149ms 24ms 21ms 20msSAT, K =6 116ms 20ms 17ms 16ms Table 4: Time needed to decode one sentence undervarious batch size settings. A single NVIDIA TIANXp is used in this test.
Model K =1 K =2 K =4 K =6Latency 1384ms 607ms 502ms 372ms Table 5: Time needed to decode one sentence on CPUdevice. Sentences are decoded one by one withoutbatching. K =1 denotes the Transformer. . × of the Transformer while maintaining 88%of translation quality. Comparing with Gu et al.(2017); Kaiser et al. (2018); Lee et al. (2018), theSAT achieves a better balance between transla-tion quality and decoding speed. Compared to thelighter Transformer ( N = 2), with K = 4, the SATachieves a higher speedup with significantly bettertranslation quality.In a real production environment, it is oftennot to decode sentences one by one, but batch bybatch. To investigate whether the SAT can accel-erate decoding when decoding in batches, we testthe decoding latency under different batch size set-tings. As shown in Table 4, the SAT significantlyaccelerates decoding even with a large batch size.It is also good to know if the SAT can still accel-erate decoding on CPU device that does not sup-port parallel execution as well as GPU. Results inTable 5 show that even on CPU device, the SATcan still accelerate decoding significantly. Table 6 summaries results on Chinese-Englishtranslation. With K = 2, the SAT decodes 1.69 × while maintaining 97% of the translation quality.In an extreme setting where K = 6 and beam size= 1, the SAT can achieve 6.41 × speedup whilemaintaining 83% of the translation quality. As shown inFigure 5, sequence-level knowledge distillation isvery effective for training the SAT. For larger K ,the effect is more significant. This phenomenonis echoing with observations by Gu et al. (2017); K=4
K=6
K=2
K=4
K=6
Chinese-English English-German B L E U w/o KD with KD Figure 5: Performance of the SAT with and withoutsequence-level knowledge distillation. c r o ss - e n t r o p y positionsTransformer SAT,K=2 SAT,K=4
SAT,K=6
Figure 6: Position-wise cross-entropy for various mod-els on English-German translation.
Oord et al. (2017); Lee et al. (2018). In addition,we tried word-level knowledge distillation (Kimand Rush, 2016) but only a slight improvementwas observed.
Position-Wise Cross-Entropy
In Figure 6, weplot position-wise cross-entropy for various mod-els. To compare with the baseline model, theresults in the figure are from models trainedon the original corpora, i.e., without knowledgedistillation. As shown in the figure, position-wise cross-entropy has an apparent periodicitywith a period of K . For positions in the samegroup, the position-wise cross-entropy increasemonotonously, which indicates that the long-distance dependencies are always more difficult tomodel than short ones. It suggests the key to fur-ther improve the SAT is to improve the ability ofmodeling long-distance dependencies. Case Study
Table 7 lists three sample Chinese-English translations from the development set. Asshown in the table, even when produces K = 6 odel Beam Size BLEU Degeneration Lattency SpeedupNIST03 NIST04 NIST05 AveragedTransformer 4 40.74 40.54 40.48 40.59 0% 410ms 1.00 × × Transformer, N =2 4 37.30 38.55 36.87 37.57 7% 169ms 2.43 × × This Work
SAT, K =2 4 39.13 40.04 39.55 39.57 3% 243ms 1.69 × × SAT, K =4 4 37.08 38.06 37.12 37.42 8% 152ms 2.70 × × SAT, K =6 4 34.61 36.29 35.06 35.32 13% 129ms 3.18 × × Table 6: Results on Chinese-English translation. Latency is calculated on NIST02.
Source (cid:14) (cid:48) (cid:44) (cid:40) (cid:19) (cid:4) (cid:22) (cid:44) (cid:34) (cid:16) (cid:3) (cid:35) (cid:32) (cid:50) (cid:41) (cid:5)
Transformer the international football federation will severely punish the fraud on the football field SAT, k=2 fifa will severely punish the deception on the football field SAT, k=4 fifa a will severely punish the fraud on the football court SAT, k=6 fifa a will severely punish the fraud on the football football court Reference federation international football association to mete out severe punishment for fraud on the football field Source (cid:18) (cid:17) (cid:30) (cid:13) (cid:26) (cid:11) (cid:21) (cid:42) (cid:7) (cid:19) (cid:15) (cid:10) (cid:43) (cid:28) (cid:47) (cid:6) (cid:41) (cid:1)
Transformer the largescale exhibition of campus culture will also be held during the meeting . SAT, k=2 the largescale cultural cultural exhibition on campus will also be held during the meeting . SAT, k=4 the campus campus exhibition will also be held during the meeting . SAT, k=6 a largescale campus culture exhibition will also be held on the sidelines of the meeting . Reference there will also be a large - scale campus culture show during the conference . Source (cid:45) (cid:27) (cid:20) (cid:33) (cid:39) (cid:2) (cid:46) (cid:23) (cid:25) (cid:9) (cid:29) (cid:38) (cid:8) (cid:31) (cid:12) (cid:24) (cid:49) (cid:14) (cid:37) (cid:36) (cid:1)
Transformer this is the second time mr koizumi has visited the yasukuni shrine since he came to power . SAT, k=2 this is the second time that mr koizumi has visited the yasukuni shrine since he took office . SAT, k=4 this is the second time that koizumi has visited the yasukuni shrine since he came into power . SAT, k=6 this is the second visit to the yasukuni shrine since mr koizumi came office power . Reference this is the second time that junichiro koizumi has paid a visit to the yasukuni shrine since he became prime minister .
Table 7: Three sample Chinese-English translations by the SAT and the Transformer. We mark repeated words orphrases by red font and underline. words at each time step, the model can still gen-erate fluent sentences. As reported by Gu et al.(2017), instances of repeated words or phrases aremost prevalent in their non-autoregressive model.In the SAT, this is also the case. This suggests thatwe may be able to improve the translation qualityof the SAT by reducing the similarity of the outputdistribution of adjacent positions.
In this work, we have introduced a novel modelfor faster sequence generation based on the Trans-former (Vaswani et al., 2017), which we refer to as the semi-autoregressive Transformer (SAT). Com-bining the original Transformer with group-levelchain rule, long-distance prediction and relaxedcausal mask, the SAT can produce multiple con-secutive words at each time step, thus speedup de-coding significantly. We conducted experimentson English-German and Chinese-English transla-tion. Compared with previously proposed non-autoregressive models (Gu et al., 2017; Lee et al.,2018; Kaiser et al., 2018), the SAT achieves a bet-ter balance between translation quality and decod-ing speed. On WMT’14 English-German transla-tion, the SAT achieves 5.58 × speedup while main-aining 88% translation quality, significantly betterthan previous methods. When produces two wordsat each time step, the SAT is almost lossless (only1% degeneration in BLEU score).In the future, we plan to investigate better meth-ods for training the SAT to further shrink the per-formance gap between the SAT and the Trans-former. Specifically, we believe that the followingtwo directions are worth study. First, use objectfunction beyond maximum likelihood to improvethe modeling of long-distance dependencies. Sec-ond, explore new method for knowledge distilla-tion. We also plan to extend the SAT to allow theuse of different group sizes K at different posi-tions, instead of using a fixed value. Acknowledgments
We would like to thank the anonymous reviewersfor their valuable comments. We also thank WenfuWang, Hao Wang for helpful discussion and Lin-hao Dong, Jinghao Niu for their help in paper writ-ting.
References
Mart´ın Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scalemachine learning. In
OSDI , volume 16, pages 265–283.Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450 .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, et al. 2014. Findings of the 2014workshop on statistical machine translation. In
Pro-ceedings of the ninth workshop on statistical ma-chine translation , pages 12–58.Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078 .Junyoung Chung, Kyunghyun Cho, and Yoshua Ben-gio. 2016. A character-level decoder without ex- plicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147 .Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N Dauphin. 2017. Convolutionalsequence to sequence learning.Jiatao Gu, James Bradbury, Caiming Xiong, Vic-tor OK Li, and Richard Socher. 2017. Non-autoregressive neural machine translation. arXivpreprint arXiv:1711.02281 .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 770–778.Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 .Shonosuke Ishiwatari, Jingtao Yao, Shujie Liu, Mu Li,Ming Zhou, Naoki Yoshinaga, Masaru Kitsuregawa,and Weijia Jia. 2017. Chunk-based decoder for neu-ral machine translation. In
Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 1901–1912.Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Pa-mar, Samy Bengio, Jakob Uszkoreit, and NoamShazeer. 2018. Fast decoding in sequence mod-els using discrete latent variables. arXiv preprintarXiv:1803.03382 .Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. arXiv preprintarXiv:1606.07947 .Diederik P Kingma and Jimmy Lei Ba. 2015. Adam:A method for stochastic optimization. internationalconference on learning representations .Jason Lee, Kyunghyun Cho, and Thomas Hofmann.2016. Fully character-level neural machine trans-lation without explicit segmentation. arXiv preprintarXiv:1610.03017 .Jason Lee, Elman Mansimov, and Kyunghyun Cho.2018. Deterministic non-autoregressive neural se-quence modeling by iterative refinement. arXivpreprint arXiv:1802.06901 .Yusuke Oda, Philip Arthur, Graham Neubig, KoichiroYoshino, and Satoshi Nakamura. 2017. Neural ma-chine translation via binary code prediction. arXivpreprint arXiv:1704.06918 .Aaron van den Oord, Yazhe Li, Igor Babuschkin, KarenSimonyan, Oriol Vinyals, Koray Kavukcuoglu,George van den Driessche, Edward Lockhart, Luis CCobo, Florian Stimberg, et al. 2017. Parallelwavenet: Fast high-fidelity speech synthesis. arXivpreprint arXiv:1711.10433 .ishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In
Proceedings ofthe 40th annual meeting on association for compu-tational linguistics , pages 311–318. Association forComputational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural machine translation of rare words withsubword units. arXiv preprint arXiv:1508.07909 .Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overfitting.
The Journal of Machine LearningResearch , 15(1):1929–1958.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In
Advances in neural information process-ing systems , pages 3104–3112.Aaron Van Den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves,Nal Kalchbrenner, Andrew Senior, and KorayKavukcuoglu. 2016. Wavenet: A generative modelfor raw audio. arXiv preprint arXiv:1609.03499 .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. arXiv preprint arXiv:1706.03762 .Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, and KlausMacherey. 2016. Google’s neural machine transla-tion system: Bridging the gap between human andmachine translation.Hao Zhou, Zhaopeng Tu, Shujian Huang, Xiaohua Liu,Hang Li, and Jiajun Chen. 2017. Chunk-based bi-scale decoder for neural machine translation. arXivpreprint arXiv:1705.01452arXivpreprint arXiv:1705.01452