[PDF] Glancing Transformer for Non-Autoregressive Neural Machine Translation

Abstract

Recent work on non-autoregressive neural machine translation (NAT) aims at improving the efficiency by parallel decoding without sacrificing the quality. However, existing NAT methods are either inferior to Transformer or require multiple decoding passes, leading to reduced speedup. We propose the Glancing Language Model (GLM), a method to learn word interdependency for single-pass parallel generation models. With GLM, we develop Glancing Transformer (GLAT) for machine translation. With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup. Experiments on multiple WMT language directions show that GLAT outperforms all previous single pass non-autoregressive methods, and is nearly comparable to Transformer, reducing the gap to 0.25-0.9 BLEU points.

Full PDF

GGLAT: Glancing Transformer for Non-AutoregressiveNeural Machine Translation

Lihua Qian ∗ Hao Zhou Yu Bao Mingxuan Wang Lin Qiu Weinan Zhang Yong Yu Lei Li Shanghai Jiao Tong University ByteDance AI Lab Nanjing University {qianlihua,lqiu,wnzhang,yyu}@apex.sjtu.edu.cn{zhouhao.nlp,wangmingxuan.89,lileilab}@[email protected]

Abstract

Non-autoregressive neural machine translation achieves remarkable inference accel-eration compared to autoregressive models. However, current non-autoregressivemodels still fall behind their autoregressive counterparts in prediction accuracy. Weattribute the accuracy gaps to two disadvantages of non-autoregressive models: a)learning simultaneous generation under the overly strong conditional independenceassumption; b) lacking explicit target language modeling. In this paper, we proposeGlancing Transformer (GLAT) to address the above disadvantages, which reducesthe difﬁculty of learning simultaneous generation and introduces explicit targetlanguage modeling in the non-autoregressive setting at the same time. Experimentson several benchmarks demonstrate that our approach signiﬁcantly improves theaccuracy of non-autoregressive models without sacriﬁcing any inference efﬁciency.In particular, GLAT achieves 30.91 BLEU on WMT 2014 German-English, whichnarrows the gap between autoregressive models and non-autoregressive models toless than 0.5 BLEU score.

Recently, non-autoregressive transformer (NAT) has attracted wide attention in neural machinetranslation (Gu et al., 2018), which generates sentences simultaneously rather than sequentially. NATabandons the order constrains of text generation and could be signiﬁcantly faster (almost a dozentimes speed-up) than autoregressive transformer (Vaswani et al., 2017). However, NAT still fallsbehind autoregressive transformer (AT) in the quality of output sentences (e.g. BLEU (Papineni et al.,2002) for machine translation). Much related work proposes to improve NAT from different aspects,including introducing latent variables, learning from AT and modifying training objectives (Gu et al.,2018; Bao et al., 2019; Wei et al., 2019; Li et al., 2019; Wang et al., 2019). Among them, theiterative editing approaches (Lee et al., 2018; Gu et al., 2019; Ghazvininejad et al., 2019) obtainvery competitive accuracy compared to the strong AT baseline by executing multiple rounds of NAT.Nevertheless, the iterative NAT sacriﬁces the speed merit of NAT due to the iterative process. To sumup, better NAT models with fast speeds and competitive accuracy still needs to be further explored.In this paper, we attribute the accuracy gaps between NAT and AT to NAT’s two disadvantagesin training : a) learning simultaneous generation under the overly strong conditional independenceassumption; b) lacking explicit target language modeling. Firstly, the overly strong conditionalindependence assumption in training increases the learning difﬁculty of NAT’s decoder. NAT hasthe assumption that all the word predictions are conditionally independent, which is hard for themodel to satisfy and can cause frequent learning with multiple potential answers. Considering the ∗ The work was done when the ﬁrst author was an intern at Bytedance.Preprint. Under review. a r X i v : . [ c s . C L ] A ug nglish sentence "Thank you.", there are many German translations such as "Danke sch ¨o n." and"Vielen Dank.". If the assumption for NAT cannot be well supported, both "sch ¨o n" and "Dank"can be potential correct words at the second position. While for AT, there is much less learningwith multiple potential correct words, such as "Dank" is the only next correct word after "Vielen".Multiple potential answers can cause conﬂict in learning, NAT’s decoder would always be poorlylearned in such difﬁcult learning setting. Furthermore, lacking explicit target language modeling inNAT’s decoder could cause NAT’s encoder and decoder to prevent each other from effective training.Speciﬁcally, the poorly learned decoder (at least in the early training phase) would not back-propagateadequate error information to the encoder in training, and in return, the resulting encoder would notprovide adequate information for the decoder. This makes the model stuck in training and we call itas encoder-decoder learning deadlock . As for AT, encoder-decoder learning deadlock is not suchsevere, since AT explicitly learns target language modeling. The learning of target language modelsin AT enables the decoder not far away from a proper model, which makes the decoder capable ofback-propagating relatively accurate gradients even if the encoder is noisy. Please refer to Section 2for more analysis.To address the above learning disadvantages of NAT, we propose Glancing Transformer (GLAT),which is equipped with an extremely simple yet very effective technique called reference glancing intraining. Speciﬁcally, we perform the two-pass decoding of NAT during training. At the ﬁrst pass,we get the predicted distributions and the reference (ground truth sentence) from training data, thensample target words by glancing at the reference according to how well the reference is predicted.At the second pass, we feed the sampled words to the decoder at their corresponding positions, thentrain the decoder to predict the remaining words via maximum likelihood estimation. In our proposedGLAT, more reference words for inaccurate sentence predictions tend to be sampled as decodinginputs to predict remaining words. The training with reference glancing is a gradual learning process.We will sample a lot of words when the predictions for the reference are inaccurate, and the numberof sampled words will decrease when the NAT model predicts the reference more accurately alongthe training process.Note that our proposed GLAT could effectively alleviate the two learning disadvantages of NAT.Firstly, the gradual learning process in GLAT reduces the learning difﬁculty caused by the overlystrong assumption, as the assumption in learning changes from weak to strong. In the early trainingstage, the model learns with weak assumptions that only assumes the remaining part of targetwords are conditionally independent. During training, the learning with weak assumptions builds afoundation for the learning with stronger assumptions that assume more target words are conditionallyindependent, which gradually makes the learning with the strong assumption for fully simultaneousgeneration easier. Secondly, GLAT offers an explicit way to learn the target language models (outputtarget words based on sampled target word inputs), by which the encoder-decoder learning deadlockcould be effectively addressed. The underlying rationale is that, with the target language modelling,the decoder will not deviate too far when the encoder is noisy at the beginning of training. Ourproposed approach is simple yet effective, which obtains signiﬁcant improvements (about 5 BLEU)on EN-DE/DE-EN compared to the vanilla NAT without losing inference speed-up.GLAT outperforms the current state-of-the-art parallel decoding model mask-predict (Ghazvininejadet al., 2019) on WMT14 DE-EN and WMT16 RO-EN and closes gaps to less than 0.5 BLEU onWMT14 EN-DE and WMT16 EN-RO, but with × speed-up. Considering pure end-to-end NATmodels without iterative reﬁnement, GLAT outperforms the best competitors by gaps of ∼ BLEUscores. When compared to standard AT models, GLAT can still close the performance gap withinabout 1 BLEU point, while keeps . × speed-up. The leap in performance and simplicity of theapproach indicates our proposed approach holds a great deal of potential. In this section, we begin by comparing the formulations of AT and NAT, and highlight two maindifferences between NAT and AT. Then we analyze in detail why these differences make the learningof NAT model difﬁcult, and from which we derive potential approaches for improving NAT.Formally, consider a sequence to sequence model (Cho et al., 2014; Bahdanau et al., 2014; Vaswaniet al., 2017) for predicting Y = { y , y , ..., y T } given the input sentence X = { x , x , ..., x N } .Regardless of network architectures, the problem can be formulated as p ( Y | X ; θ ) . Learning is to2nd the best statistical model, i.e. θ . In the AT model, the conditional probability can “exactly” beformulated as: p ( Y | X ; θ ) = T (cid:89) t =1 p ( y t | y

Encoder X NAT

Decoder H ̂ Y Y

NAT

Decoder H ′ Encoder-Decoder Attention

SamplingGlancing Sampling Compute Loss Y ′ Replace Input

Masked Cross Entropy Loss

DecodingCompute loss y y y y y E( ) y h h E( ) y E( ) y ̂ y ̂ y ̂ y ̂ y y y y y y ̂ y Adaptive Sampling Number N ( ̂ Y , Y ) = 3 Hamming Distance

Random Sampling N ( ̂ Y , Y ) = 3 y y y y y Select y y y y y Figure 1: Illustration of NAT training with reference samplingunder the complete conditional independence assumption and avoids encoder-decoder learningdeadlock by introducing explicit target language modeling.Generally, GLAT performs the two-pass decoding of NAT during training. At the ﬁrst pass, we get theoutput prediction and compare with the reference (ground truth sentence), then sample target wordsby glancing at the reference according to how well this reference is predicted. At the second pass,we feed the sampled words to the decoder at their corresponding positions, then train the decoderto predict the remaining words via maximum likelihood estimation. In our proposed GLAT, morereference words for inaccurate sentence predictions tend to be sampled as decoding inputs. As themodel can better predict the reference along the training process, the model glances less referencewords and the learning setting gradually approaches that of fully simultaneous generation. Thegradual process weakens the conditional independence assumption according to the model supportfor the assumption, as only the remaining words are assumed to be conditionally independent. GLATcould also effectively address the encoder-decoder learning deadlock, because it introduces explicittarget modeling in the process of predicting remaining words with some target words as input.Formally, given the training data D = { X, Y } Li =1 for predicting Y = { y , y , ..., y T } with the inputsentence X = { x , x , ..., x N } , the training objective of GLAT is: L GLAT = − (cid:88) { X,Y }∈ D (cid:88) y t ∈{ Y \ GS ( Y, ˆ Y ) } log p ( y t | GS ( Y, ˆ Y ) , X ; θ ) . (3)Here, ˆ Y is the predicted sentences in the ﬁrst decoding pass, and GS ( Y, ˆ Y ) is the set of sampledtarget words by the sampling strategy of GS . The sampling strategy will sample more words fromreference Y if ˆ Y is less accurate compared to Y , and sampling less words otherwise. Additionally, { Y \ GS ( Y, ˆ Y ) } is the difference set, representing the remaining words except these sampled words.In the following sections, we will describe the whole training process and the sampling strategy indetail, respectively. Our training procedure with two-pass decoding for GLAT is illustrated in Figure 1. In the ﬁrstdecoding pass, given the encoder F encoder and decoder F decoder , H = { h , h , ..., h T } is the encodedsource embedding sequence gathered from the input X , and ˆ Y = F decoder ( H, F encoder ( X ; θ ); θ ) is thepredicted sentence in the ﬁrst decoding pass. With the predicted sentence ˆ Y , GS ( Y, ˆ Y ) is the set ofsampled words from reference Y , according to our sampling strategy GS , which will be introducedin the next section. 4ote that we use the attention mechanism to form the decoder inputs with the input X . Previouswork adopt Uniform Copy (Gu et al., 2018) or

SoftCopy (Wei et al., 2019) instead. But empirically,we ﬁnd that they give almost the same results in our setting.In the second decoding pass, we cover the original decoding inputs H by the embeddings of wordsfrom GS ( Y, ˆ Y ) to get the new decoding inputs H (cid:48) = F cover ( E y t ∈ GS ( Y, ˆ Y ) ( y t ) , H ) , where F cover coversaccording to the corresponding positions. Namely, if we have a sampled word at one position, wejust use its word embedding to replace the original decoding input at the same position. Here theword embeddings are obtained from the softmax embedding matrix of the decoder. With the covereddecoding inputs H (cid:48) , the probabilities of remaining words on each position p ( y t | GS ( Y, ˆ Y ) , X ; θ ) arecomputed by F decoder ( H (cid:48) , F encoder ( X ; θ ) , t ; θ ) . Finally, only the loss for the remaining target wordsare computed, and we use maximum likelihood estimation to learn our model θ according to theEquation 3. We employ a two-step sampling strategy. The ﬁrst step decides the sampling number and the secondstep sample the reference words based on the number.Formally, given the input X , its predicted sentence ˆ Y and its reference Y , the goal of referenceglancing function GS ( Y, ˆ Y ) is to obtain a set of sampled words S , where S is a subset of Y . Inthe ﬁrst step, we compute a sampling number N ( Y, ˆ Y ) , which decides the number of sampledwords. In the second step, we sample the set S with GS ( Y, ˆ Y ) , which should meet the constrain | S | = N ( Y, ˆ Y ) . Adaptive Sampling Number

Although we can introduce target language modeling by referenceglancing, we also need to carefully control the sampling number. The reason for sampling numbercontrolling is that excessive target words input can lead to over-dependence on target words andthe ability to extract information from the source inputs is less trained, which inevitably causesdegradation of model capacity. With a similar idea with curriculum learning (Bengio et al., 2009), wepropose a curriculum sampling strategy to decide the number that aims at reducing more difﬁcultyfrom the start and gradually approaching the simultaneous generation setting as the model is trainedbetter. Intuitively, for translation tasks, the sentences that can not be translated well are more difﬁcultto learn. For these sentences, we should reduce more learning difﬁculty and enlarge the samplingnumber. Thus, the demand of target words for different training samples also varies.Speciﬁcally, given the input X , its translation ˆ Y and its reference Y , we use the distance d ( Y, ˆ Y ) tomeasure how Y is well generated given X , where d ( · ) is the distance function. The less words in thetarget sentence can the model generate, the more additional target words are needed to reduce thedifﬁculty of simultaneous generation. Considering only the same words in the output can match thetarget, we adopt Hamming distance (Hamming, 1950) in this work, that is d ( Y, ˆ Y ) = (cid:80) Tt =1 ( y t (cid:54) = ˆ y t ) ,where t is the word position. To better control the reference sampling, we further employ a ratiofunction f ratio to adjust the sampling number more ﬂexibly. With the distance and the ratio function,the sampling number is determined by N ( Y, ˆ Y ) = f ratio d ( Y, ˆ Y ) , which adaptively adjusts thelearning procedure. Random Sampling Strategy

Given the sampling number N ( Y, ˆ Y ) , we then decide which wordsshould be chosen. We use simple random sampling as our strategy, that is, randomly choose N ( Y, ˆ Y ) words from the reference Y . Random sampling is the most straightforward approach to gettingsamples. We choose this strategy for the following reasons. First, it samples from an almostunlimited space, which is unbiased and vital for drawing conclusions. Second, random sampling isnot necessarily the best candidates, while it is very powerful in most cases. Many previous studiesenjoy its simple implementation and desirable performance, such as BERT (Devlin et al., 2019). In this section, we demonstrate the experimental results to verify the effectiveness of GLAT.5able 1: Performance on WMT14 EN-DE/DE-EN and WMT16 EN-RO/RO-EN benchmarks. I dec isthe number of reﬁnement iterations and m is the number of reranking candidates. Models WMT14 WMT-2016 Speed UpEN-DE DE-EN EN-RO RO-ENAT Models Transformer (Vaswani) 27.30 / / / /Transformer (ours) 27.48 31.27 33.70 34.05 1.0 × Fully NAT Models NAT-FT 17.69 21.47 27.29 29.06 15.6 × NAT-IR 13.91 16.77 24.45 25.73 9.0 × NAT-REG 20.65 24.77 / / 27.6 × imitate-NAT 22.44 25.67 28.61 28.90 18.6 × Flowseq 21.45 26.16 29.34 30.44 /Mask-Predict base ( I dec =1) 18.05 21.83 27.32 28.20 /NAT-base 20.12 24.46 28.47 29.43 15.3 × GLAT (ours) × NAT Modelsw/ IterativeReﬁnements NAT-IR ( I dec =10) 21.61 25.48 29.32 30.19 1.5 × LevT ( I dec =6+) / / 33.26 4.0 × Mask-Predict base ( I dec =10) 27.03 /NAT Modelsw/ Reranking NAT-FT (m=10) 18.66 22.41 29.02 30.76 7.7 × NAT-FT (m=100) 19.17 23.20 29.79 31.44 2.4 × NAT-REG (m=9) 24.61 28.90 / / 15.1 × imitate-NAT (m=7) 24.15 27.28 31.45 31.81 9.7 × Flowseq (m=30) 23.48 28.40 31.75 32.49 /NAT-DCRF (m=9) 26.07 29.68 / / 6.1 × GLAT (m=7, ours) × To validate the effectiveness of our method, we conduct experiments on three machinetranslation benchmarks: WMT14 EN-DE (4.5M translation pairs), WMT16 EN-RO (610k translationpairs) and IWSLT2016 DE-EN (150K translation pairs). The datasets are tokenized and segmentedinto subword units using BPE encodings (Sennrich et al., 2016). We preprocess WMT14 EN-DEfollowing the data preprocessing in Vaswani et al. (2017) and use the processed data provided in Leeet al. (2018) for WMT16 EN-RO and IWSLT16 DE-EN.

Distillation

Following previous work (Gu et al., 2018; Lee et al., 2018; Wang et al., 2019), wealso use sequence level knowledge distillation for all datasets. We employ base autoregressivetransformers in Vaswani et al. (2017) for distilling data. Then, we train our models on data distilledfrom transformer for all the datasets.

Inference

GLAT only modiﬁes the training procedure and performs one-step non-autoregressivegeneration as the base model in Gu et al. (2018). We also consider the common practice of noiseparallel decoding (Gu et al., 2018; Lee et al., 2018; Guo et al., 2019; Wang et al., 2019), whichgenerates several decoding candidates in parallel and selects the best via re-scoring with a pre-trainedautoregressive model. For GLAT, we ﬁrst predict m target length candidates, then generate outputsequences with argmax decoding for each target length candidate, which is also called length paralleldecoding (LPD) (Wei et al., 2019). Then we use the pre-trained transformer to rank these sequencesand identify the best overall output as the ﬁnal output. Implementation

We adopt the vanilla model which copies source input uniformly in Gu et al.(2018) as our base model (NAT-base) and replace the

UniformCopy with attention mechanism usingpositions. For WMT datasets, we follow the hyperparameters of base Transformer in Vaswani et al.(2017). And we choose a smaller setting for IWSLT16 considering that IWSLT16 is a smaller dataset.For IWSLT16, we use 5 layers for encoder and decoder and set the size of model input/output d model to 256. We train the model with batches of 64k/8k tokens for WMT/IWSLT datasets, respectively. Weuse Adam Optimizer (Kingma & Ba, 2014) with β = (0 . , . . For WMT datasets, the learningrate warms up to 5e-4 in 4k steps and gradually decays according to inverse square root schedulein Vaswani et al. (2017). As for IWSLT16 DE-EN, we adopt linear annealing (from e − to e − )as in Lee et al. (2018). For the ratio function f ratio , we use a linear decreasing function from 0.5 to6 (cid:3794) BLEU Transformer NAT-FT NAT-REG imitate-NAT Flowseq R e l a t i v e D ec o d i n g S p ee d - u p Performance (BLEU)

22 24 26 28 30 32

Mask-Predict GLATGLATimitate-NATimitate-NAT NAT-REGNAT-REGNAT-FTNAT-FTNAT-FT Transformer Figure 2: The trade-off between speed-upand BLEU on WMT14 DE-EN Figure 3: The histogram of probability differenceson the test set of WMT14 EN-DE0.3 for WMT datasets and a ﬁxed ratio of 0.5 for IWSLT16. The ﬁnal model is created by averagingthe 5 best checkpoints.

We compare our method with strong representative baselines, including fully non-autoregressive mod-els: the NAT with fertility NAT-FT (Gu et al., 2018), the NAT with regularization NAT-REG (Wanget al., 2019), our vanilla NAT-base model, the NAT imitating the autoregressive model imitate-NAT (Wei et al., 2019) and the Flow-based NAT model Flowseq (Ma et al., 2019), and the NATwith CRF (Lafferty et al., 2001) decoding NAT-DCRF (Sun et al., 2019), and the NAT with iterativereﬁnement: Mask-Predict (Ghazvininejad et al., 2019), NAR-IR (Lee et al., 2018) and LevT (Guet al., 2019). For all our tasks, we obtain the performance of other NAT models by directly using theperformance ﬁgures reported in their papers if they are available on our datasets.The main results on the benchmarks are presented in Table 1. Clearly, our method signiﬁcantlyimproves the translation quality and outperforms strong baselines by a large margin. Our methodintroduces explicit target language modeling for the decoder and gradually learns simultaneousgeneration, which addresses encoder-decoder learning deadlock and better capture the underlyingdata structure. It is worth noting that although models equipped with iterative decoding achievecompetitive BLEU (Papineni et al., 2002) scores, their success is based on the sacriﬁce of speedadvantage. Our method completely maintains the inference efﬁciency advantage of fully non-autoregressive models since we only modify the training process.Compared with the competitors, we will highlight our empirical advantages: • The performance of GLAT is surprisingly good. Compared with the vanilla NAT-basemodels, GLAT obtains signiﬁcant improvements (about 5 BLEU) on EN-DE/DE-EN. Ad-ditionally, GLAT also surpasses other fully non-autoregressive models with a big margin(almost +3 BLEU score on average). The results are even very close to those of the ATmodel, which shows great potential. • GLAT is extremely simple and can be applied to other NAT models ﬂexibly, as we onlymodify the training process by reference glancing while keep inference unchanged. Forcomparison, imitate-NAT introduces additional AT models as teachers; NAT-DCRF utilizesCRF to generate outputs sequentially; NAT-IR and Mask-Predict models iteratively generatethe target outputs.We also see in Table 1 that BLEU and speed-up are more or less contradictory. Therefore, we presentthe scatter plot in Figure 2, displaying the trend of speed-up and BLEU scores with different NATmodels. It is shown that the point of GLAT is located on the top-right of the competing methods.Obviously, GLAT outperforms our competitors in BLEU if speed-up is controlled, and in speed-up ifBLEU is controlled. This indicates that GLAT outperforms previous state-of-the-art NAT methods.Although iterative models like Mask-Predict achieves competitive BLEU scores, they only maintainminor speed advantages over AT. In contrast, fully non-autoregressive models remarkably improvethe inference speed. 7able 2: Performances on IWSLT16 withﬁxed sampling ratio.Sampling Method λ BLEUFixed 0.0 24.660.1 24.910.2 27.120.3 24.980.4 22.96Adaptive -

Table 3: Performances on IWSLT16 with decreas-ing sampling ratio.Sampling Method Schedule BLEU λ s λ e Fixed 0.5 0 27.800.5 0.1 28.210.5 0.2 27.150.5 0.3 23.37Adaptive -

Table 4: Performance on WMT14 EN-DE with different sampling strategies.

Sampling Strategy uniform p ref − p ref double correct double falseGLAT 25.21 24.87 25.37 25.10 25.38GLAT (m=7) 26.55 25.83 26.52 26.35 26.52 NAT model always gives ﬂat distributions in predic-tion. We computed the difference between the highest probability and second-highest probabilityof each output position on the test set of WMT14 EN-DE. The result histogram is presented inFigure 3. We ﬁnd that model trained with reference glancing can output sharper word distributions;namely, the highest probability is much larger than the second-highest one. Relatively, GLAT hasless occurrence where the highest probability is close to the second-highest one than NAT-base. Thesharper prediction distributions indicate that GLAT can obtain and exploit more relevant informationfor prediction. Thus, our approach output more conﬁdent predictions and better ﬁt the simultaneousgeneration setting.

Effectiveness of the Adaptive Sampling Strategy

To validate the effectiveness of the adaptivesampling strategy for the sampling number N ( Y, ˆ Y ) , we also propose two ﬁxed approaches forcomparison. The ﬁrst one decides the sampling number with λ ∗ T , where T is the length of Y , and λ is the constant ratio. The second one is relatively ﬂexible, which sets a start ratio λ s and end ratio λ e , and linearly reduce the sampling number from λ s ∗ T to λ e ∗ T with the training process.As shown in Table 2 and Table 3, clearly, our adaptive approach (Adaptive in the table) outperformsthe competitors with big margins. The results conﬁrm our intuition that the sampling schedule affectsthe translation performance of our NAT model. The sampling strategy, which ﬁrst offering relativelyeasy generation problems and then turns harder, beneﬁts the ﬁnal performance. More speciﬁcally: • The random sampling strategy is robust. Even with the simplest constant ratio, GLAT stillachieves remarkable results. When set λ = 0 . , it even outperforms the baseline λ = 0 . by2.5 BLEU score. • The experiments potentially support our suggestion that it is beneﬁcial to learn simplepatterns at the start and gradually transfer to more complex patterns. The method of ﬂexibledecreasing ratio works better than that of the constant one, and our proposed adaptiveapproaches achieve the best results.

Inﬂuence of Different Reference Sampling Distributions

To analyze the inﬂuence of the sam-pling distribution over the reference for glancing, we conduct experiments with different samplingdistributions. By default, we randomly choose words for glancing, that is, we assume the probabilityof each word in the reference is equal and sample from a uniform distribution. To investigate whetherreference words that can be accurately predicted or the opposites are important for training, wedevise four other sampling distributions for comparison: p ref , − p ref , "double correct" and "doublefalse". For p ref and − p ref , the sampling probability of each reference word is proportional to thereference word output probability p ref or − p ref respectively. For "double correct" and "doublefalse", we make the sampling probability of correctly predicted words twice that of falsely predictedwords or vise versa. The results for different sampling distributions are listed in Table 4.8n comparison, models with sampling distributions where falsely predicted words have higherprobability achieve better performance than those increase probabilities for correctly predicted words,indicating that words hard to predict are more important for glancing in the training process. And weﬁnd that the performance of uniform sampling distribution is similar to that of sampling distributionsthat increase falsely predicted word probabilities. Since Gu et al. (2018) proposed the non-autoregressive transformer (NAT) which enables parallelsequence generation, there have been several advances in developing non-autoregressive models.

Fully Non-Autoregressive Models

A line of work was proposed to reduce the model’s burden ofdealing with dependencies among output words. Gu et al. (2018) interprets the latent variable asthe number of target words aligned to each source words. Ma et al. (2019) utilized the generativeﬂow to model expressive signals related to the output. Bao et al. (2019) modeled the position ofoutput words, to address the word reordering issue. Ran et al. (2019) reordered the input sequenceas the intermediate translation of output sequence. Another branch of work considers transferringthe knowledge from autoregressive models to non-autoregressive models. Wei et al. (2019) guidedthe training via imitating demonstrations from modules in autoregressive models. Li et al. (2019)leveraged the relevance between hidden states and the attention distribution of autoregressive models.Compared to fully non-autoregressive models, our proposed method stays simple and can signiﬁcantlyboost the performance without the need of modifying model architectures or the inference process.

Non-Autoregressive Models with Structured Decoding

In order to model the dependencies be-tween words, Sun et al. (2019) introduces a CRF inference module in NAT and performs additionalsequential decoding after the non-autoregressive computation in inference. Since GLAT only per-forms one-step non-autoregressive generation, our approach is orthogonal to the method proposedin Sun et al. (2019). We can also combine our approach with the structured decoding method.

Non-Autoregressive Models with Iterative Reﬁnement

There is a series of work devotes to semi-autoregressive models which combine the strength of both types of models by iteratively reﬁningthe former outputs. Lee et al. (2018) proposed a method of iterative reﬁnement based on denoisingautoencoder. Gu et al. (2019) utilized insertion and deletion to reﬁne the generation. Ghazvininejadet al. (2019) trained the model in the way of masked language model and the model iterativelyreplaces the mask tokens with new outputs. Despite the relatively better accuracy, the multipledecoding iterations largely reduces the inference efﬁciency of non-autoregressive models.Although the masked language model employed in Mask-Predict (Ghazvininejad et al., 2019) alsointroduces target language modeling, Mask-Predict reduces the difﬁculty of simultaneous generationin inference rather than training. To deal with the difﬁculty caused by the overly strong conditionalindependence assumption, both Mask-Predict and GLAT adopt a gradual way. Mask-Predict graduallygenerate the ﬁnal sentences by iteratively reﬁning the words that may be incorrect. GLAT graduallylearns simultaneous generation by starting from learning with weak assumptions and approaching thecomplete conditional independence assumption along the training process. In addition to the decreaseof inference speedup with iterative decoding, Mask-Predict also forcibly optimize the objectivewithout considering the model support for the strong conditional independence assumption, whichcould introduce noise into the model.

In non-autoregressive models, learning is under strong conditional independence assumption and lacksexplicit target language modeling, which brings a challenge for training. In this paper, we proposeGlancing Transformer to tackle the problems. With the glanced reference, the model reduces learningdifﬁculty by gradually learning with stronger conditional independence assumptions and explores away out of the encoder-decoder learning deadlock in training by explicitly learning the target languagemodeling. Experimental results show that our approach signiﬁcantly improves the performance ofnon-autoregressive machine translation with one-step generation. As non-autoregressive models areefﬁcient and have a large potential in multiple tasks, we plan to apply our approach to other tasks.9 roader Impact

Machine translation is a crucial technology in facilitating communication across languages andcultures. The vast amount of information on the social media platform prevents manual translationand demands highly efﬁcient yet effective automatic translation systems. Our work aims at pushing theeffectiveness and efﬁciency of neural machine translation systems (dominantly based on Transformermodel). Our proposed method can be beneﬁcial for researchers and practitioners working in theﬁeld of both machine translation. This work can be beneﬁcial to researchers working on the generalsequence modeling and generation area, such as text generation, dialog system, question answering,etc.This work can lead to potential business beneﬁts. Users, multinational companies and internationalorganizations can utilize systems built upon our algorithms to get high quality translation of text almostwith greatly reduced latency. A potential societal impact is with the proposed non-autoregressivegeneration model much overhead in energy consumption can be saved since the inference is fasterand resource-efﬁcient.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.Yu Bao, Hao Zhou, Jiangtao Feng, Mingxuan Wang, Shujian Huang, Jiajun Chen, and Lei Li.Non-autoregressive transformer by position learning. arXiv preprint arXiv:1911.10677 , 2019.Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

ICML , pp. 41–48, 2009.Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder forstatistical machine translation. In

EMNLP , pp. 1724–1734, 2014.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pp. 4171–4186, 2019.Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Paralleldecoding of conditional masked language models. In

EMNLP-IJCNLP , pp. 6114–6123, 2019.Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-autoregressiveneural machine translation. In

ICLR , 2018.Jiatao Gu, Changhan Wang, and Junbo Zhao. Levenshtein transformer. In

NeurIPS , pp. 11179–11189,2019.Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. Non-autoregressive neuralmachine translation with enhanced decoder input. In

AAAI , volume 33, pp. 3723–3730, 2019.Richard W Hamming. Error detecting and error correcting codes.

The Bell system technical journal ,29(2):147–160, 1950.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.John D Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random ﬁelds: Prob-abilistic models for segmenting and labeling sequence data. In

Proceedings of the EighteenthInternational Conference on Machine Learning , pp. 282–289, 2001.Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequencemodeling by iterative reﬁnement. In

EMNLP , pp. 1173–1182, 2018.Zhuohan Li, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu. Hint-based training fornon-autoregressive translation. In

NeuralIPS (to appear) , 2019.10uezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. FlowSeq: Non-autoregressive conditional sequence generation with generative ﬂow. In

EMNLP-IJCNLP , pp.4273–4283, Hong Kong, China, November 2019. doi: 10.18653/v1/D19-1437.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automaticevaluation of machine translation. In

ACL , pp. 311–318, 2002.Qiu Ran, Yankai Lin, Peng Li, and Jie Zhou. Guiding non-autoregressive neural machine translationdecoding with reordering information. arXiv preprint arXiv:1911.02215 , 2019.Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words withsubword units. In

ACL , pp. 1715–1725, 2016.Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, and Zhihong Deng. Fast structured decodingfor sequence models. In

Advances in Neural Information Processing Systems , pp. 3016–3026,2019.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In

NIPS , pp. 5998–6008, 2017.Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Non-autoregressivemachine translation with auxiliary regularization. In

AAAI , 2019.Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, and Xu Sun. Imitation learning fornon-autoregressive neural machine translation. In