Glancing Transformer for Non-Autoregressive Neural Machine Translation
Lihua Qian, Hao Zhou, Yu Bao, Mingxuan Wang, Lin Qiu, Weinan Zhang, Yong Yu, Lei Li
GGLAT: Glancing Transformer for Non-AutoregressiveNeural Machine Translation
Lihua Qian ∗ Hao Zhou Yu Bao Mingxuan Wang Lin Qiu Weinan Zhang Yong Yu Lei Li Shanghai Jiao Tong University ByteDance AI Lab Nanjing University {qianlihua,lqiu,wnzhang,yyu}@apex.sjtu.edu.cn{zhouhao.nlp,wangmingxuan.89,lileilab}@[email protected]
Abstract
Non-autoregressive neural machine translation achieves remarkable inference accel-eration compared to autoregressive models. However, current non-autoregressivemodels still fall behind their autoregressive counterparts in prediction accuracy. Weattribute the accuracy gaps to two disadvantages of non-autoregressive models: a)learning simultaneous generation under the overly strong conditional independenceassumption; b) lacking explicit target language modeling. In this paper, we proposeGlancing Transformer (GLAT) to address the above disadvantages, which reducesthe difficulty of learning simultaneous generation and introduces explicit targetlanguage modeling in the non-autoregressive setting at the same time. Experimentson several benchmarks demonstrate that our approach significantly improves theaccuracy of non-autoregressive models without sacrificing any inference efficiency.In particular, GLAT achieves 30.91 BLEU on WMT 2014 German-English, whichnarrows the gap between autoregressive models and non-autoregressive models toless than 0.5 BLEU score.
Recently, non-autoregressive transformer (NAT) has attracted wide attention in neural machinetranslation (Gu et al., 2018), which generates sentences simultaneously rather than sequentially. NATabandons the order constrains of text generation and could be significantly faster (almost a dozentimes speed-up) than autoregressive transformer (Vaswani et al., 2017). However, NAT still fallsbehind autoregressive transformer (AT) in the quality of output sentences (e.g. BLEU (Papineni et al.,2002) for machine translation). Much related work proposes to improve NAT from different aspects,including introducing latent variables, learning from AT and modifying training objectives (Gu et al.,2018; Bao et al., 2019; Wei et al., 2019; Li et al., 2019; Wang et al., 2019). Among them, theiterative editing approaches (Lee et al., 2018; Gu et al., 2019; Ghazvininejad et al., 2019) obtainvery competitive accuracy compared to the strong AT baseline by executing multiple rounds of NAT.Nevertheless, the iterative NAT sacrifices the speed merit of NAT due to the iterative process. To sumup, better NAT models with fast speeds and competitive accuracy still needs to be further explored.In this paper, we attribute the accuracy gaps between NAT and AT to NAT’s two disadvantagesin training : a) learning simultaneous generation under the overly strong conditional independenceassumption; b) lacking explicit target language modeling. Firstly, the overly strong conditionalindependence assumption in training increases the learning difficulty of NAT’s decoder. NAT hasthe assumption that all the word predictions are conditionally independent, which is hard for themodel to satisfy and can cause frequent learning with multiple potential answers. Considering the ∗ The work was done when the first author was an intern at Bytedance.Preprint. Under review. a r X i v : . [ c s . C L ] A ug nglish sentence "Thank you.", there are many German translations such as "Danke sch ¨o n." and"Vielen Dank.". If the assumption for NAT cannot be well supported, both "sch ¨o n" and "Dank"can be potential correct words at the second position. While for AT, there is much less learningwith multiple potential correct words, such as "Dank" is the only next correct word after "Vielen".Multiple potential answers can cause conflict in learning, NAT’s decoder would always be poorlylearned in such difficult learning setting. Furthermore, lacking explicit target language modeling inNAT’s decoder could cause NAT’s encoder and decoder to prevent each other from effective training.Specifically, the poorly learned decoder (at least in the early training phase) would not back-propagateadequate error information to the encoder in training, and in return, the resulting encoder would notprovide adequate information for the decoder. This makes the model stuck in training and we call itas encoder-decoder learning deadlock . As for AT, encoder-decoder learning deadlock is not suchsevere, since AT explicitly learns target language modeling. The learning of target language modelsin AT enables the decoder not far away from a proper model, which makes the decoder capable ofback-propagating relatively accurate gradients even if the encoder is noisy. Please refer to Section 2for more analysis.To address the above learning disadvantages of NAT, we propose Glancing Transformer (GLAT),which is equipped with an extremely simple yet very effective technique called reference glancing intraining. Specifically, we perform the two-pass decoding of NAT during training. At the first pass,we get the predicted distributions and the reference (ground truth sentence) from training data, thensample target words by glancing at the reference according to how well the reference is predicted.At the second pass, we feed the sampled words to the decoder at their corresponding positions, thentrain the decoder to predict the remaining words via maximum likelihood estimation. In our proposedGLAT, more reference words for inaccurate sentence predictions tend to be sampled as decodinginputs to predict remaining words. The training with reference glancing is a gradual learning process.We will sample a lot of words when the predictions for the reference are inaccurate, and the numberof sampled words will decrease when the NAT model predicts the reference more accurately alongthe training process.Note that our proposed GLAT could effectively alleviate the two learning disadvantages of NAT.Firstly, the gradual learning process in GLAT reduces the learning difficulty caused by the overlystrong assumption, as the assumption in learning changes from weak to strong. In the early trainingstage, the model learns with weak assumptions that only assumes the remaining part of targetwords are conditionally independent. During training, the learning with weak assumptions builds afoundation for the learning with stronger assumptions that assume more target words are conditionallyindependent, which gradually makes the learning with the strong assumption for fully simultaneousgeneration easier. Secondly, GLAT offers an explicit way to learn the target language models (outputtarget words based on sampled target word inputs), by which the encoder-decoder learning deadlockcould be effectively addressed. The underlying rationale is that, with the target language modelling,the decoder will not deviate too far when the encoder is noisy at the beginning of training. Ourproposed approach is simple yet effective, which obtains significant improvements (about 5 BLEU)on EN-DE/DE-EN compared to the vanilla NAT without losing inference speed-up.GLAT outperforms the current state-of-the-art parallel decoding model mask-predict (Ghazvininejadet al., 2019) on WMT14 DE-EN and WMT16 RO-EN and closes gaps to less than 0.5 BLEU onWMT14 EN-DE and WMT16 EN-RO, but with × speed-up. Considering pure end-to-end NATmodels without iterative refinement, GLAT outperforms the best competitors by gaps of ∼ BLEUscores. When compared to standard AT models, GLAT can still close the performance gap withinabout 1 BLEU point, while keeps . × speed-up. The leap in performance and simplicity of theapproach indicates our proposed approach holds a great deal of potential. In this section, we begin by comparing the formulations of AT and NAT, and highlight two maindifferences between NAT and AT. Then we analyze in detail why these differences make the learningof NAT model difficult, and from which we derive potential approaches for improving NAT.Formally, consider a sequence to sequence model (Cho et al., 2014; Bahdanau et al., 2014; Vaswaniet al., 2017) for predicting Y = { y , y , ..., y T } given the input sentence X = { x , x , ..., x N } .Regardless of network architectures, the problem can be formulated as p ( Y | X ; θ ) . Learning is to2nd the best statistical model, i.e. θ . In the AT model, the conditional probability can “exactly” beformulated as: p ( Y | X ; θ ) = T (cid:89) t =1 p ( y t | y Encoder X NAT Decoder H ̂ Y Y NAT Decoder H ′ Encoder-Decoder Attention SamplingGlancing Sampling Compute Loss Y ′ Replace Input Masked Cross Entropy Loss DecodingCompute loss y y y y y E( ) y h h E( ) y E( ) y ̂ y ̂ y ̂ y ̂ y y y y y y ̂ y Adaptive Sampling Number N ( ̂ Y , Y ) = 3 Hamming Distance Random Sampling N ( ̂ Y , Y ) = 3 y y y y y Select y y y y y Figure 1: Illustration of NAT training with reference samplingunder the complete conditional independence assumption and avoids encoder-decoder learningdeadlock by introducing explicit target language modeling.Generally, GLAT performs the two-pass decoding of NAT during training. At the first pass, we get theoutput prediction and compare with the reference (ground truth sentence), then sample target wordsby glancing at the reference according to how well this reference is predicted. At the second pass,we feed the sampled words to the decoder at their corresponding positions, then train the decoderto predict the remaining words via maximum likelihood estimation. In our proposed GLAT, morereference words for inaccurate sentence predictions tend to be sampled as decoding inputs. As themodel can better predict the reference along the training process, the model glances less referencewords and the learning setting gradually approaches that of fully simultaneous generation. Thegradual process weakens the conditional independence assumption according to the model supportfor the assumption, as only the remaining words are assumed to be conditionally independent. GLATcould also effectively address the encoder-decoder learning deadlock, because it introduces explicittarget modeling in the process of predicting remaining words with some target words as input.Formally, given the training data D = { X, Y } Li =1 for predicting Y = { y , y , ..., y T } with the inputsentence X = { x , x , ..., x N } , the training objective of GLAT is: L GLAT = − (cid:88) { X,Y }∈ D (cid:88) y t ∈{ Y \ GS ( Y, ˆ Y ) } log p ( y t | GS ( Y, ˆ Y ) , X ; θ ) . (3)Here, ˆ Y is the predicted sentences in the first decoding pass, and GS ( Y, ˆ Y ) is the set of sampledtarget words by the sampling strategy of GS . The sampling strategy will sample more words fromreference Y if ˆ Y is less accurate compared to Y , and sampling less words otherwise. Additionally, { Y \ GS ( Y, ˆ Y ) } is the difference set, representing the remaining words except these sampled words.In the following sections, we will describe the whole training process and the sampling strategy indetail, respectively. Our training procedure with two-pass decoding for GLAT is illustrated in Figure 1. In the firstdecoding pass, given the encoder F encoder and decoder F decoder , H = { h , h , ..., h T } is the encodedsource embedding sequence gathered from the input X , and ˆ Y = F decoder ( H, F encoder ( X ; θ ); θ ) is thepredicted sentence in the first decoding pass. With the predicted sentence ˆ Y , GS ( Y, ˆ Y ) is the set ofsampled words from reference Y , according to our sampling strategy GS , which will be introducedin the next section. 4ote that we use the attention mechanism to form the decoder inputs with the input X . Previouswork adopt Uniform Copy (Gu et al., 2018) or SoftCopy (Wei et al., 2019) instead. But empirically,we find that they give almost the same results in our setting.In the second decoding pass, we cover the original decoding inputs H by the embeddings of wordsfrom GS ( Y, ˆ Y ) to get the new decoding inputs H (cid:48) = F cover ( E y t ∈ GS ( Y, ˆ Y ) ( y t ) , H ) , where F cover coversaccording to the corresponding positions. Namely, if we have a sampled word at one position, wejust use its word embedding to replace the original decoding input at the same position. Here theword embeddings are obtained from the softmax embedding matrix of the decoder. With the covereddecoding inputs H (cid:48) , the probabilities of remaining words on each position p ( y t | GS ( Y, ˆ Y ) , X ; θ ) arecomputed by F decoder ( H (cid:48) , F encoder ( X ; θ ) , t ; θ ) . Finally, only the loss for the remaining target wordsare computed, and we use maximum likelihood estimation to learn our model θ according to theEquation 3. We employ a two-step sampling strategy. The first step decides the sampling number and the secondstep sample the reference words based on the number.Formally, given the input X , its predicted sentence ˆ Y and its reference Y , the goal of referenceglancing function GS ( Y, ˆ Y ) is to obtain a set of sampled words S , where S is a subset of Y . Inthe first step, we compute a sampling number N ( Y, ˆ Y ) , which decides the number of sampledwords. In the second step, we sample the set S with GS ( Y, ˆ Y ) , which should meet the constrain | S | = N ( Y, ˆ Y ) . Adaptive Sampling Number Although we can introduce target language modeling by referenceglancing, we also need to carefully control the sampling number. The reason for sampling numbercontrolling is that excessive target words input can lead to over-dependence on target words andthe ability to extract information from the source inputs is less trained, which inevitably causesdegradation of model capacity. With a similar idea with curriculum learning (Bengio et al., 2009), wepropose a curriculum sampling strategy to decide the number that aims at reducing more difficultyfrom the start and gradually approaching the simultaneous generation setting as the model is trainedbetter. Intuitively, for translation tasks, the sentences that can not be translated well are more difficultto learn. For these sentences, we should reduce more learning difficulty and enlarge the samplingnumber. Thus, the demand of target words for different training samples also varies.Specifically, given the input X , its translation ˆ Y and its reference Y , we use the distance d ( Y, ˆ Y ) tomeasure how Y is well generated given X , where d ( · ) is the distance function. The less words in thetarget sentence can the model generate, the more additional target words are needed to reduce thedifficulty of simultaneous generation. Considering only the same words in the output can match thetarget, we adopt Hamming distance (Hamming, 1950) in this work, that is d ( Y, ˆ Y ) = (cid:80) Tt =1 ( y t (cid:54) = ˆ y t ) ,where t is the word position. To better control the reference sampling, we further employ a ratiofunction f ratio to adjust the sampling number more flexibly. With the distance and the ratio function,the sampling number is determined by N ( Y, ˆ Y ) = f ratio d ( Y, ˆ Y ) , which adaptively adjusts thelearning procedure. Random Sampling Strategy Given the sampling number N ( Y, ˆ Y ) , we then decide which wordsshould be chosen. We use simple random sampling as our strategy, that is, randomly choose N ( Y, ˆ Y ) words from the reference Y . Random sampling is the most straightforward approach to gettingsamples. We choose this strategy for the following reasons. First, it samples from an almostunlimited space, which is unbiased and vital for drawing conclusions. Second, random sampling isnot necessarily the best candidates, while it is very powerful in most cases. Many previous studiesenjoy its simple implementation and desirable performance, such as BERT (Devlin et al., 2019). In this section, we demonstrate the experimental results to verify the effectiveness of GLAT.5able 1: Performance on WMT14 EN-DE/DE-EN and WMT16 EN-RO/RO-EN benchmarks. I dec isthe number of refinement iterations and m is the number of reranking candidates. Models WMT14 WMT-2016 Speed UpEN-DE DE-EN EN-RO RO-ENAT Models Transformer (Vaswani) 27.30 / / / /Transformer (ours) 27.48 31.27 33.70 34.05 1.0 × Fully NAT Models NAT-FT 17.69 21.47 27.29 29.06 15.6 × NAT-IR 13.91 16.77 24.45 25.73 9.0 × NAT-REG 20.65 24.77 / / 27.6 × imitate-NAT 22.44 25.67 28.61 28.90 18.6 × Flowseq 21.45 26.16 29.34 30.44 /Mask-Predict base ( I dec =1) 18.05 21.83 27.32 28.20 /NAT-base 20.12 24.46 28.47 29.43 15.3 × GLAT (ours) × NAT Modelsw/ IterativeRefinements NAT-IR ( I dec =10) 21.61 25.48 29.32 30.19 1.5 × LevT ( I dec =6+) / / 33.26 4.0 × Mask-Predict base ( I dec =10) 27.03 /NAT Modelsw/ Reranking NAT-FT (m=10) 18.66 22.41 29.02 30.76 7.7 × NAT-FT (m=100) 19.17 23.20 29.79 31.44 2.4 × NAT-REG (m=9) 24.61 28.90 / / 15.1 × imitate-NAT (m=7) 24.15 27.28 31.45 31.81 9.7 × Flowseq (m=30) 23.48 28.40 31.75 32.49 /NAT-DCRF (m=9) 26.07 29.68 / / 6.1 × GLAT (m=7, ours) × To validate the effectiveness of our method, we conduct experiments on three machinetranslation benchmarks: WMT14 EN-DE (4.5M translation pairs), WMT16 EN-RO (610k translationpairs) and IWSLT2016 DE-EN (150K translation pairs). The datasets are tokenized and segmentedinto subword units using BPE encodings (Sennrich et al., 2016). We preprocess WMT14 EN-DEfollowing the data preprocessing in Vaswani et al. (2017) and use the processed data provided in Leeet al. (2018) for WMT16 EN-RO and IWSLT16 DE-EN. Distillation Following previous work (Gu et al., 2018; Lee et al., 2018; Wang et al., 2019), wealso use sequence level knowledge distillation for all datasets. We employ base autoregressivetransformers in Vaswani et al. (2017) for distilling data. Then, we train our models on data distilledfrom transformer for all the datasets. Inference GLAT only modifies the training procedure and performs one-step non-autoregressivegeneration as the base model in Gu et al. (2018). We also consider the common practice of noiseparallel decoding (Gu et al., 2018; Lee et al., 2018; Guo et al., 2019; Wang et al., 2019), whichgenerates several decoding candidates in parallel and selects the best via re-scoring with a pre-trainedautoregressive model. For GLAT, we first predict m target length candidates, then generate outputsequences with argmax decoding for each target length candidate, which is also called length paralleldecoding (LPD) (Wei et al., 2019). Then we use the pre-trained transformer to rank these sequencesand identify the best overall output as the final output. Implementation We adopt the vanilla model which copies source input uniformly in Gu et al.(2018) as our base model (NAT-base) and replace the UniformCopy with attention mechanism usingpositions. For WMT datasets, we follow the hyperparameters of base Transformer in Vaswani et al.(2017). And we choose a smaller setting for IWSLT16 considering that IWSLT16 is a smaller dataset.For IWSLT16, we use 5 layers for encoder and decoder and set the size of model input/output d model to 256. We train the model with batches of 64k/8k tokens for WMT/IWSLT datasets, respectively. Weuse Adam Optimizer (Kingma & Ba, 2014) with β = (0 . , . . For WMT datasets, the learningrate warms up to 5e-4 in 4k steps and gradually decays according to inverse square root schedulein Vaswani et al. (2017). As for IWSLT16 DE-EN, we adopt linear annealing (from e − to e − )as in Lee et al. (2018). For the ratio function f ratio , we use a linear decreasing function from 0.5 to6 (cid:3794) BLEU Transformer NAT-FT NAT-REG imitate-NAT Flowseq R e l a t i v e D ec o d i n g S p ee d - u p Performance (BLEU) 22 24 26 28 30 32 Mask-Predict GLATGLATimitate-NATimitate-NAT NAT-REGNAT-REGNAT-FTNAT-FTNAT-FT Transformer Figure 2: The trade-off between speed-upand BLEU on WMT14 DE-EN Figure 3: The histogram of probability differenceson the test set of WMT14 EN-DE0.3 for WMT datasets and a fixed ratio of 0.5 for IWSLT16. The final model is created by averagingthe 5 best checkpoints. We compare our method with strong representative baselines, including fully non-autoregressive mod-els: the NAT with fertility NAT-FT (Gu et al., 2018), the NAT with regularization NAT-REG (Wanget al., 2019), our vanilla NAT-base model, the NAT imitating the autoregressive model imitate-NAT (Wei et al., 2019) and the Flow-based NAT model Flowseq (Ma et al., 2019), and the NATwith CRF (Lafferty et al., 2001) decoding NAT-DCRF (Sun et al., 2019), and the NAT with iterativerefinement: Mask-Predict (Ghazvininejad et al., 2019), NAR-IR (Lee et al., 2018) and LevT (Guet al., 2019). For all our tasks, we obtain the performance of other NAT models by directly using theperformance figures reported in their papers if they are available on our datasets.The main results on the benchmarks are presented in Table 1. Clearly, our method significantlyimproves the translation quality and outperforms strong baselines by a large margin. Our methodintroduces explicit target language modeling for the decoder and gradually learns simultaneousgeneration, which addresses encoder-decoder learning deadlock and better capture the underlyingdata structure. It is worth noting that although models equipped with iterative decoding achievecompetitive BLEU (Papineni et al., 2002) scores, their success is based on the sacrifice of speedadvantage. Our method completely maintains the inference efficiency advantage of fully non-autoregressive models since we only modify the training process.Compared with the competitors, we will highlight our empirical advantages: • The performance of GLAT is surprisingly good. Compared with the vanilla NAT-basemodels, GLAT obtains significant improvements (about 5 BLEU) on EN-DE/DE-EN. Ad-ditionally, GLAT also surpasses other fully non-autoregressive models with a big margin(almost +3 BLEU score on average). The results are even very close to those of the ATmodel, which shows great potential. • GLAT is extremely simple and can be applied to other NAT models flexibly, as we onlymodify the training process by reference glancing while keep inference unchanged. Forcomparison, imitate-NAT introduces additional AT models as teachers; NAT-DCRF utilizesCRF to generate outputs sequentially; NAT-IR and Mask-Predict models iteratively generatethe target outputs.We also see in Table 1 that BLEU and speed-up are more or less contradictory. Therefore, we presentthe scatter plot in Figure 2, displaying the trend of speed-up and BLEU scores with different NATmodels. It is shown that the point of GLAT is located on the top-right of the competing methods.Obviously, GLAT outperforms our competitors in BLEU if speed-up is controlled, and in speed-up ifBLEU is controlled. This indicates that GLAT outperforms previous state-of-the-art NAT methods.Although iterative models like Mask-Predict achieves competitive BLEU scores, they only maintainminor speed advantages over AT. In contrast, fully non-autoregressive models remarkably improvethe inference speed. 7able 2: Performances on IWSLT16 withfixed sampling ratio.Sampling Method λ BLEUFixed 0.0 24.660.1 24.910.2 27.120.3 24.980.4 22.96Adaptive - Table 3: Performances on IWSLT16 with decreas-ing sampling ratio.Sampling Method Schedule BLEU λ s λ e Fixed 0.5 0 27.800.5 0.1 28.210.5 0.2 27.150.5 0.3 23.37Adaptive - Table 4: Performance on WMT14 EN-DE with different sampling strategies. Sampling Strategy uniform p ref − p ref double correct double falseGLAT 25.21 24.87 25.37 25.10 25.38GLAT (m=7) 26.55 25.83 26.52 26.35 26.52 NAT model always gives flat distributions in predic-tion. We computed the difference between the highest probability and second-highest probabilityof each output position on the test set of WMT14 EN-DE. The result histogram is presented inFigure 3. We find that model trained with reference glancing can output sharper word distributions;namely, the highest probability is much larger than the second-highest one. Relatively, GLAT hasless occurrence where the highest probability is close to the second-highest one than NAT-base. Thesharper prediction distributions indicate that GLAT can obtain and exploit more relevant informationfor prediction. Thus, our approach output more confident predictions and better fit the simultaneousgeneration setting. Effectiveness of the Adaptive Sampling Strategy To validate the effectiveness of the adaptivesampling strategy for the sampling number N ( Y, ˆ Y ) , we also propose two fixed approaches forcomparison. The first one decides the sampling number with λ ∗ T , where T is the length of Y , and λ is the constant ratio. The second one is relatively flexible, which sets a start ratio λ s and end ratio λ e , and linearly reduce the sampling number from λ s ∗ T to λ e ∗ T with the training process.As shown in Table 2 and Table 3, clearly, our adaptive approach (Adaptive in the table) outperformsthe competitors with big margins. The results confirm our intuition that the sampling schedule affectsthe translation performance of our NAT model. The sampling strategy, which first offering relativelyeasy generation problems and then turns harder, benefits the final performance. More specifically: • The random sampling strategy is robust. Even with the simplest constant ratio, GLAT stillachieves remarkable results. When set λ = 0 . , it even outperforms the baseline λ = 0 . by2.5 BLEU score. • The experiments potentially support our suggestion that it is beneficial to learn simplepatterns at the start and gradually transfer to more complex patterns. The method of flexibledecreasing ratio works better than that of the constant one, and our proposed adaptiveapproaches achieve the best results. Influence of Different Reference Sampling Distributions To analyze the influence of the sam-pling distribution over the reference for glancing, we conduct experiments with different samplingdistributions. By default, we randomly choose words for glancing, that is, we assume the probabilityof each word in the reference is equal and sample from a uniform distribution. To investigate whetherreference words that can be accurately predicted or the opposites are important for training, wedevise four other sampling distributions for comparison: p ref , − p ref , "double correct" and "doublefalse". For p ref and − p ref , the sampling probability of each reference word is proportional to thereference word output probability p ref or − p ref respectively. For "double correct" and "doublefalse", we make the sampling probability of correctly predicted words twice that of falsely predictedwords or vise versa. The results for different sampling distributions are listed in Table 4.8n comparison, models with sampling distributions where falsely predicted words have higherprobability achieve better performance than those increase probabilities for correctly predicted words,indicating that words hard to predict are more important for glancing in the training process. And wefind that the performance of uniform sampling distribution is similar to that of sampling distributionsthat increase falsely predicted word probabilities. Since Gu et al. (2018) proposed the non-autoregressive transformer (NAT) which enables parallelsequence generation, there have been several advances in developing non-autoregressive models. Fully Non-Autoregressive Models A line of work was proposed to reduce the model’s burden ofdealing with dependencies among output words. Gu et al. (2018) interprets the latent variable asthe number of target words aligned to each source words. Ma et al. (2019) utilized the generativeflow to model expressive signals related to the output. Bao et al. (2019) modeled the position ofoutput words, to address the word reordering issue. Ran et al. (2019) reordered the input sequenceas the intermediate translation of output sequence. Another branch of work considers transferringthe knowledge from autoregressive models to non-autoregressive models. Wei et al. (2019) guidedthe training via imitating demonstrations from modules in autoregressive models. Li et al. (2019)leveraged the relevance between hidden states and the attention distribution of autoregressive models.Compared to fully non-autoregressive models, our proposed method stays simple and can significantlyboost the performance without the need of modifying model architectures or the inference process. Non-Autoregressive Models with Structured Decoding In order to model the dependencies be-tween words, Sun et al. (2019) introduces a CRF inference module in NAT and performs additionalsequential decoding after the non-autoregressive computation in inference. Since GLAT only per-forms one-step non-autoregressive generation, our approach is orthogonal to the method proposedin Sun et al. (2019). We can also combine our approach with the structured decoding method. Non-Autoregressive Models with Iterative Refinement There is a series of work devotes to semi-autoregressive models which combine the strength of both types of models by iteratively refiningthe former outputs. Lee et al. (2018) proposed a method of iterative refinement based on denoisingautoencoder. Gu et al. (2019) utilized insertion and deletion to refine the generation. Ghazvininejadet al. (2019) trained the model in the way of masked language model and the model iterativelyreplaces the mask tokens with new outputs. Despite the relatively better accuracy, the multipledecoding iterations largely reduces the inference efficiency of non-autoregressive models.Although the masked language model employed in Mask-Predict (Ghazvininejad et al., 2019) alsointroduces target language modeling, Mask-Predict reduces the difficulty of simultaneous generationin inference rather than training. To deal with the difficulty caused by the overly strong conditionalindependence assumption, both Mask-Predict and GLAT adopt a gradual way. Mask-Predict graduallygenerate the final sentences by iteratively refining the words that may be incorrect. GLAT graduallylearns simultaneous generation by starting from learning with weak assumptions and approaching thecomplete conditional independence assumption along the training process. In addition to the decreaseof inference speedup with iterative decoding, Mask-Predict also forcibly optimize the objectivewithout considering the model support for the strong conditional independence assumption, whichcould introduce noise into the model. In non-autoregressive models, learning is under strong conditional independence assumption and lacksexplicit target language modeling, which brings a challenge for training. In this paper, we proposeGlancing Transformer to tackle the problems. With the glanced reference, the model reduces learningdifficulty by gradually learning with stronger conditional independence assumptions and explores away out of the encoder-decoder learning deadlock in training by explicitly learning the target languagemodeling. Experimental results show that our approach significantly improves the performance ofnon-autoregressive machine translation with one-step generation. As non-autoregressive models areefficient and have a large potential in multiple tasks, we plan to apply our approach to other tasks.9 roader Impact Machine translation is a crucial technology in facilitating communication across languages andcultures. The vast amount of information on the social media platform prevents manual translationand demands highly efficient yet effective automatic translation systems. Our work aims at pushing theeffectiveness and efficiency of neural machine translation systems (dominantly based on Transformermodel). Our proposed method can be beneficial for researchers and practitioners working in thefield of both machine translation. This work can be beneficial to researchers working on the generalsequence modeling and generation area, such as text generation, dialog system, question answering,etc.This work can lead to potential business benefits. Users, multinational companies and internationalorganizations can utilize systems built upon our algorithms to get high quality translation of text almostwith greatly reduced latency. A potential societal impact is with the proposed non-autoregressivegeneration model much overhead in energy consumption can be saved since the inference is fasterand resource-efficient. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.Yu Bao, Hao Zhou, Jiangtao Feng, Mingxuan Wang, Shujian Huang, Jiajun Chen, and Lei Li.Non-autoregressive transformer by position learning. arXiv preprint arXiv:1911.10677 , 2019.Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML , pp. 41–48, 2009.Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder forstatistical machine translation. In EMNLP , pp. 1724–1734, 2014.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pp. 4171–4186, 2019.Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Paralleldecoding of conditional masked language models. In EMNLP-IJCNLP , pp. 6114–6123, 2019.Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-autoregressiveneural machine translation. In ICLR , 2018.Jiatao Gu, Changhan Wang, and Junbo Zhao. Levenshtein transformer. In NeurIPS , pp. 11179–11189,2019.Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. Non-autoregressive neuralmachine translation with enhanced decoder input. In AAAI , volume 33, pp. 3723–3730, 2019.Richard W Hamming. Error detecting and error correcting codes. The Bell system technical journal ,29(2):147–160, 1950.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.John D Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Prob-abilistic models for segmenting and labeling sequence data. In Proceedings of the EighteenthInternational Conference on Machine Learning , pp. 282–289, 2001.Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequencemodeling by iterative refinement. In EMNLP , pp. 1173–1182, 2018.Zhuohan Li, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu. Hint-based training fornon-autoregressive translation. In NeuralIPS (to appear) , 2019.10uezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In EMNLP-IJCNLP , pp.4273–4283, Hong Kong, China, November 2019. doi: 10.18653/v1/D19-1437.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automaticevaluation of machine translation. In ACL , pp. 311–318, 2002.Qiu Ran, Yankai Lin, Peng Li, and Jie Zhou. Guiding non-autoregressive neural machine translationdecoding with reordering information. arXiv preprint arXiv:1911.02215 , 2019.Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words withsubword units. In ACL , pp. 1715–1725, 2016.Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, and Zhihong Deng. Fast structured decodingfor sequence models. In Advances in Neural Information Processing Systems , pp. 3016–3026,2019.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In NIPS , pp. 5998–6008, 2017.Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Non-autoregressivemachine translation with auxiliary regularization. In AAAI , 2019.Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, and Xu Sun. Imitation learning fornon-autoregressive neural machine translation. In