An Empirical Study of Generation Order for Machine Translation
aa r X i v : . [ c s . C L ] O c t An Empirical Study of Generation Order for Machine Translation
William Chan ∗ Google Research, Brain Team [email protected]
Mitchell Stern ∗ University of California, Berkeley [email protected]
Jamie Kiros
Google Research, Brain Team [email protected]
Jakob Uszkoreit
Google Research, Brain Team [email protected]
Abstract
In this work, we present an empirical studyof generation order for machine translation.Building on recent advances in insertion-basedmodeling, we first introduce a soft order-reward framework that enables us to trainmodels to follow arbitrary oracle generationpolicies. We then make use of this frameworkto explore a large variety of generation orders,including uninformed orders, location-basedorders, frequency-based orders, content-basedorders, and model-based orders. Curiously, wefind that for the WMT’14 English → Germantranslation task, order does not have a substan-tial impact on output quality, with unintuitiveorderings such as alphabetical and shortest-first matching the performance of a standardTransformer. This demonstrates that tradi-tional left-to-right generation is not strictlynecessary to achieve high performance. On theother hand, results on the WMT’18 English → Chinese task tend to vary more widely, sug-gesting that translation for less well-alignedlanguage pairs may be more sensitive to gen-eration order.
Neural sequence models (Sutskever et al., 2014;Cho et al., 2014) have been successfully appliedto a broad range of tasks in recent years. Whilethese models typically generate their outputs us-ing a fixed left-to-right order, there has alsobeen some investigation into non-left-to-right andorder-independent generation in pursuit of qual-ity or speed. For example, Vinyals et al. (2015)explored the problem of predicting sets using se-quence models. While this is a domain wheregeneration order should intuitively be unimpor-tant, they nevertheless found it to make a substan-tial difference in practice. Ford et al. (2018) ex-plored treating language modeling as a two-pass ∗ Equal contribution. process, where words from certain classes are gen-erated first, and the remaining words are filledin during the second pass. They found that gen-erating function words first followed by contentwords second yielded the best results. Separately,Gu et al. (2018) and Lee et al. (2018) developednon-autoregressive approaches to machine trans-lation where the entire output can be generated inparallel in constant time. These models do awaywith order selection altogether but typically lagbehind their autoregressive counterparts in trans-lation quality.More recently, a number of novel insertion-based architectures have been developed for se-quence generation (Gu et al., 2019; Stern et al.,2019; Welleck et al., 2019). These frameworks li-cense a diverse set of generation orders, includinguniform (Welleck et al., 2019), random (Gu et al.,2019), or balanced binary trees (Stern et al.,2019). Some of them also match the qualityof state-of-the-art left-to-right models (Stern et al.,2019). In this paper, we utilize one such frame-work to explore an extensive collection of gen-eration orders, evaluating them on the WMT’14English-German and WMT’18 English-Chinesetranslation tasks. We find that a number of non-standard choices achieve BLEU scores compara-ble to those obtained with the classical approach,suggesting that left-to-right generation might notbe a necessary ingredient for high-quality transla-tion. Our contributions are as follows:• We introduce a general soft order-reward frame-work that can be used to teach insertion-basedmodels to follow any specified ordering.• We perform a thorough empirical study ofvarious orders, including: uniform, random,left-to-right, right-to-left, common-first, rare-first, shortest-first, longest-first, alphabetical,and model-adaptive.• On the WMT 2014 English → German task, we ncoder 那 个 男 人 吃 了 小 吃 EmbeddingSelf-Attention Decoder h START i ate snack h END i{ the, man } { a } {} EmbeddingSelf-Attention + Cross-AttentionContent and Location Softmaxes
Figure 1: A schematic of the Insertion Transformer model for a Chinese-English translation pair. The model isencouraged to predict the correct set of remaining words within each slot. Using our order-reward framework(Section 3), we can derive the necessary weight distribution to apply to the set of correct actions in order to trainthe model to follow any oracle generation policy of interest.
Serial generation:Hypothesis Insertion[] (ate, 0)[ate] (snack, 1)[ate, snack] (man, 0)[man, ate, snack] (the, 0)[the, man, ate, snack] (a, 3)[the, man, ate, a, snack] ( h EOS i , 5) Parallel generation:Hypothesis Insertions[] (ate, 0)[ate] (man, 0), (snack, 1)[man, ate, snack] (the, 0), (a, 2)[the, man, ate, a, snack] ( h EOS i , 5) Figure 2: Example decoding paths for serial and parallel generation using the Insertion Transformer. show that there is surprisingly little variation inBLEU for different generation orders. We fur-ther find that many orders are able to match theperformance of a standard base Transformer.
Neural sequence models have traditionally beendesigned with left-to-right prediction in mind. Inthe classical setting, output sequences are pro-duced by repeatedly appending tokens to therightmost end of the hypothesis until an end-of-sequence token is generated. Though high-performing across a wide range of application ar-eas, this approach lacks the flexibility to accom-modate other types of inference such as paral-lel generation, constrained decoding, infilling, etc.Moreover, it also leaves open the possibility that anon-left-to-right factorization of the joint distribu-tion over output sequences could outperform theusual monotonic ordering.To address these concerns, several recent ap-proaches have been proposed for insertion-basedsequence modeling, in which sequences are con- structed by repeatedly inserting tokens at arbi-trary locations in the output rather than onlyat the right-most position. We use one suchinsertion-based model, the Insertion Transformer(Stern et al., 2019), for our empirical study. Wegive a brief overview of the model in this sectionbefore moving on to the details of our investiga-tion.
The Insertion Transformer (Stern et al., 2019) is asequence-to-sequence model in which the outputis formed by successively inserting one or more to-kens at arbitrary locations into a partial hypothesis.This type of generation is made possible throughthe use of a joint distribution over tokens and slots.More formally, given an input x and a partial out-put ˆ y t at time t , the Insertion Transformer givesthe joint distribution p ( c, l | x, ˆ y t ) = InsertionTransformer( x, ˆ y t ) , where c ∈ V is the content being selected fromthe vocabulary V and ≤ l ≤ | ˆ y t | is the insertionlocation.s its name suggests, the Insertion Transformerextends the Transformer model (Vaswani et al.,2017) with a few key modifications to generalizefrom ordinary next-token modeling to joint token-and-slot modeling. First, the Insertion Trans-former removes the causal attention mask from thedecoder, allowing for fully contextualized outputrepresentations to be derived after each insertion.Second, the Insertion Transformer pads the length- n decoder input on both ends so that n + 2 outputvectors are produced. It then concatenates adja-cent pairs of output vectors to obtain n + 1 slotrepresentations, which in turn inform the condi-tional distributions over tokens within each slot, p ( c | l ) . Lastly, it performs an additional atten-tion step over the slot representations to obtain alocation distribution p ( l ) , which is multiplied withthe conditional content distributions to obtain thefull joint distribution: p ( c, l ) = p ( c | l ) p ( l ) . Aschematic of the architecture is given in Figure 1for reference.We note that Stern et al. (2019) also experi-mented with a number of other architectural vari-ants, but we use the baseline version of the modeldescribed above in our experiments for simplicity. Once the model has been trained, it can be usedfor greedy autoregressive sequence generation asfollows. At each step of decoding, we computethe joint argmax(ˆ c t , ˆ l t ) = argmax c,l p ( c, l | x, ˆ y t ) to determine what content ˆ c t should be insertedat which location ˆ l t . We then apply this inser-tion, increasing the sequence length by one, andrepeat this process until an end-of-sequence tokenis produced. This is the serial decoding procedureshown in the left half of Figure 2.The model can also be used for parallelpartially-autoregressive decoding. Instead of com-puting the joint argmax across all locations, weinstead compute the best content for each location: ˆ c l,t = argmax c p ( c | l, x, ˆ y t ) . We then insert the highest-scoring tokens in paral-lel for all slots that are not yet finished, increasingthe sequence length by anywhere between one and n + 1 tokens. This strategy visualized in the righthalf of Figure 2. Order Order Function O ( a ) Uniform Balanced Binary Tree | s − ( i + j ) / | Random rank(hash( w )) Sequential (L2R vs. R2L) ± s Frequency (Common vs. Rare) ± rank(frequency( w )) Length (Short vs. Long) ± rank(length( w )) Alphabetical (A → z vs. z → A) ± rank( w ) Adaptive (Easy vs. Hard) ± log p ( a ) Table 1: Order functions for an action a correspondingto the insertion of word w into slot s within span ( i, j ) .The rank terms are computed with respect to the set ofwords from the valid action set A ∗ . Having presented our model of interest, we nowdescribe a general soft order-reward frameworkthat can be used to train the model to follow anyoracle ordering for sequence generation. Let O ( a ) be an order function mapping insertion actions toreal numbers, where lower values correspond tobetter actions, and let p ( a ) be the probability as-signed by the model to action a . From these, weconstruct a reward function R ( a ) , an oracle policy q oracle , and a per-slot loss L : R ( a ) = ( − O ( a ) ∀ a ∈ A ∗ −∞ ∀ a A ∗ q oracle ( a ) = exp( R ( a ) /τ ) P a ′ ∈ A ∗ exp( R ( a ′ ) /τ ) L = KL( q oracle k p ) Here, A ∗ is the set of all valid actions. The tem-perature τ ∈ (0 , ∞ ) controls the sharpness of thedistribution, where τ → results in a one-hot dis-tribution with all mass on the best-scoring actionunder the order function O ( a ) , and τ → ∞ resultsin a uniform distribution over all valid actions. In-termediate values of τ result in distributions whichare biased towards better-scoring actions but allowfor other valid actions to be taken some of the time.Having defined the target distribution, we takethe slot loss L for insertions within a particularslot to be the KL-divergence between the oracledistribution q oracle and the model distribution p .Substituting L in for the slot loss within the train-ing framework of Stern et al. (2019) then gives thefull sequence generation loss, which we can useto train an Insertion Transformer under any oraclepolicy rather than just the specific one they pro-pose. We describe a wide variety of generationrders which can be characterized by different or-der functions O ( a ) in the subsections that follow.A summary is given in Table 1. We evaluate two uninformed orders, uniform andrandom. The uniform order O ( a ) = 0 givesequal reward or equivalently probability mass toany valid action. Consequently, this means wegive each order a uniform probability treatment.We also experiment with random order O ( a ) =rank(hash( w )) , wherein we hash each word anduse the sorted hash ID as the generation order. Therandom order forces the model to follow a spe-cific random path, whereas the uniform order givesequal probability mass to any order. We explore two types of location-based orders,balanced binary tree and monotonic orders. Thebalanced binary tree order O ( a ) = | s − ( i + j ) / | encourages the model to place most of itsprobability mass towards the middle tokens in amissing span. Consequently, this encourages themodel to generate text in a balanced binary treeorder. We also experiment with soft monotonicorders O ( a ) = ± s , or soft left-to-right and softright-to-left, which differ slightly from the left-to-right teacher forcing traditionally used in seq2seq.First, we still maintain a uniform roll-in policy(see Section 3.6), which increases diversity duringtraining and helps avoid label bias. Additionally,this endows the model with the ability to “lookback” and insert missing tokens in the middle ofthe sequence during inference, as opposed to al-ways being forced to append only at one end ofthe sequence. The order reward is also soft (as de-scribed by the τ term above), wherein we do notplace all the probability mass on the next mono-tonic token, but merely encourage it to generate ina monotonic fashion. We evaluate two frequency-based orders: rarewords first via O ( a ) = rank(frequency( w )) and common words first via O ( a ) = − rank(frequency( w )) . For these orders, wesimply sort the words based on their frequenciesand used their rank as the order. We note the mostfrequent words tend to be punctuation and stopwords, such as commas, periods, and “the” inEnglish. We also explore content-based orders. One classof orders is based on the word length: O ( a ) = ± rank(length( w )) . This encourages the model toeither emit all the shortest words first or all thelongest words first.We also explore alphabetical orderings O ( a ) = ± rank( w ) , where sorting is based on Unicode or-der. We note that in Unicode, uppercase lettersoccur before lower case letters. This biases themodel to produce words which are capitalized first(or last), typically corresponding to nouns in Ger-man. Additionally, for Chinese, the characters areroughly sorted by radical and stroke count, whichbears a loose relation to the complexity and fre-quency of the character. The orders presented thus far are static, mean-ing they are independent of the model. Wealso explore orders which are adaptive based onthe model’s posterior. We also introduce “easy”and “hard” adaptive orders induced by O ( a ) = ± log p ( a ) . The adaptive orders look at themodel’s posterior to determine the oracle policy.Consequently the loss is adaptive, as when themodel updates after each gradient step, the orderadapts to the model’s posterior.In the “easy“ version, we use O ( a ) =+ log p ( a ) , which is similar to a local greedy softEM loss. We renormalize our current model’s pos-terior over valid actions and optimize towards thatdistribution. This pushes the model’s posterior towhat is correct and where it has already placedprobability mass. Intuitively, this reinforces themodel to select what it thinks are the easiest ac-tions first. Conversely, the “hard” variant uses O ( a ) = − log p ( a ) which encourages the modelto place probability mass on what it thinks are thehardest valid actions. This is akin to a negativefeedback system whose stationary condition is theuniform distribution. We follow Stern et al. (2019) and use a uniformroll-in policy when sampling partial outputs attraining time in which we first select a subset sizeuniformly at random, then select a random subsetof the output of that size. Repeated tokens are han-dled via greedy left or right alignment to the trueoutput. nput:
It would of course be a little simpler for the Germans if there were a coherent and standardised European policy, whichis currently not the case.
Output:
Es w¨are f¨ur die Deutschen nat¨urlich ein wenig einfacher, wenn es eine koh¨arente und einheitliche europ¨aische Politikg¨abe, was derzeit nicht der Fall ist.
Parallel decode (alphabetical):
Es w¨are f¨ur die Deutschen nat¨urlich ein wenig einfacher , wenn es eine koh¨arent e und einheitliche europ¨aische Politik g¨abe , was derzeit nicht der Fall ist .Es w¨are f¨ur die Deutschen nat¨urlich ein wenig einfacher , wenn es eine koh¨arent e und einheitliche europ¨aische Politik g¨abe , was derzeit nicht der Fall ist .Es w¨are f¨ur die Deutschen nat¨urlich ein wenig einfacher , wenn es eine koh¨arent e und einheitliche europ¨aische Politik g¨abe , was derzeit nicht der Fall ist .Es w¨are f¨ur die Deutschen nat¨urlich ein wenig einfacher , wenn es eine koh¨arent e und einheitliche europ¨aische Politik g¨abe , was derzeit nicht der Fall ist .Es w¨are f¨ur die Deutschen nat¨urlich ein wenig einfacher , wenn es eine koh¨arent e und einheitliche europ¨aische Politik g¨abe , was derzeit nicht der Fall ist .Es w¨are f¨ur die Deutschen nat¨urlich ein wenig einfacher , wenn es eine koh¨arent e und einheitliche europ¨aische Politik g¨abe , was derzeit nicht der Fall ist .Es w¨are f¨ur die Deutschen nat¨urlich ein wenig einfacher , wenn es eine koh¨arent e und einheitliche europ¨aische Politik g¨abe , was derzeit nicht der Fall ist .
Input: according to the data of National Bureau of Statistics , the fixed asset investment growth , total imports and other datain July have come down .
Output: 根 据 国 家 统 计 局 的 数 据 , 月 份 的 固 定 资 产 投 资 增 长 、 进 口 总 额 和 其 他 数 据 有 所 下 降 。 Parallel decode (alphabetical): 根 据 国 家 统 计 局 的 数 据 , 月 份 的 固 定 资 产 投 资 增 长 、 进 口 总 额 和 其 他 数 据 有 所 下 降 。 根 据 国 家 统 计 局 的 数 据 , 月 份 的 固 定 资 产 投 资 增 长 、 进 口 总 额 和 其 他 数 据 有 所 下 降 。 根 据 国 家 统 计 局 的 数 据 , 月 份 的 固 定 资 产 投 资 增 长 、 进 口 总 额 和 其 他 数 据 有 所 下 降 。 根 据 国 家 统 计 局 的 数 据 , 月 份 的 固 定 资 产 投 资 增 长 、 进 口 总 额 和 其 他 数 据 有 所 下 降 。 根 据 国 家 统 计 局 的 数 据 , 月 份 的 固 定 资 产 投 资 增 长 、 进 口 总 额 和 其 他 数 据 有 所 下 降 。 根 据 国 家 统 计 局 的 数 据 , 月 份 的 固 定 资 产 投 资 增 长 、 进 口 总 额 和 其 他 数 据 有 所 下 降 。 根 据 国 家 统 计 局 的 数 据 , 月 份 的 固 定 资 产 投 资 增 长 、 进 口 总 额 和 其 他 数 据 有 所 下 降 。 根 据 国 家 统 计 局 的 数 据 , 月 份 的 固 定 资 产 投 资 增 长 、 进 口 总 额 和 其 他 数 据 有 所 下 降 。 Figure 3: Example decodes for models trained to generate tokens in alphabetical (Unicode) order. Blue tokenscorrespond those being inserted at the current time step, and gray tokens correspond to those not yet generated.Note that the desired ordering applies on a per-slot basis rather than a global basis.
Input:
It will be sung by all the artists at all the threeconcerts at the same time.
Output:
Es wird von allen K¨unstlern bei allen dreiKonzerten gleichzeitig gesungen.
Parallel decode (longest-first):
Es wird von allen K¨unstler n bei allen drei Konzert en gleichzeitig ges ungen .Es wird von allen K¨unstler n bei allen drei Konzert en gleichzeitig ges ungen .Es wird von allen K¨unstler n bei allen drei Konzert en gleichzeitig ges ungen .Es wird von allen K¨unstler n bei allen drei Konzert en gleichzeitig ges ungen .Es wird von allen K¨unstler n bei allen drei Konzert en gleichzeitig ges ungen .Es wird von allen K¨unstler n bei allen drei Konzert en gleichzeitig ges ungen .
Figure 4: An example of longest-first generation.
For our experiments, we train and evaluate mod-els for each order on two standard machine trans-lation datasets: WMT14 En-De and WMT18 En-Zh. For WMT14 En-De, we follow the standardsetup with newstest2013 as our development setand newstest2014 as our test set. For WMT18 En-Zh, we use the official preprocessed data with noadditional data normalization or filtering, takingnewstest2017 to be our development set and new-stest2018 our test set. En-Zh evaluation is carried http://data.statmt.org/wmt18/translation-task/preprocessed/zh-en/ Input: imagine eating enough peanuts to serve as yourdinner .
Output: 想 象 一下 , 吃 足 够 的 花 生 作 为 你 的 晚 餐 。 Parallel decode (common-first): 想 象 一下 , 吃 足 够 的 花 生 作 为 你 的 晚 餐 。 想 象 一下 , 吃 足 够 的 花 生 作 为 你 的 晚 餐 。 想 象 一下 , 吃 足 够 的 花 生 作 为 你 的 晚 餐 。 想 象 一下 , 吃 足 够 的 花 生 作 为 你 的 晚 餐 。 想 象 一下 , 吃 足 够 的 花 生 作 为 你 的 晚 餐 。 想 象 一下 , 吃 足 够 的 花 生 作 为 你 的 晚 餐 。 Figure 5: An example of common-first generation. out using sacreBLEU (Post, 2018). In both cases,we train all models for 1M steps using sequence-level knowledge distillation (Hinton et al., 2015;Kim and Rush, 2016) from a base Transformer(Vaswani et al., 2017). We perform a sweep overtemperatures τ ∈ { . , , } and EOS penalties ∈ { , . , , . , . . . , } (Stern et al., 2019) on thedevelopment set, but otherwise perform no addi-tional hyperparameter tuning, borrowing all othermodel and optimization settings from the baseTransformer. BLEU+case.mixed+lang.en-zh+numrefs.1+smooth.exp+test.wmt18+tok.zh+version.1.2.12 rder En → De En → Zh τ . . . . . . Binary Tree 91% 86% 80% 88% 83% 78%Random 86% 81% 72% 82% 77% 68%Left-to-Right 95% 88% 77% 88% 82% 70%Right-to-Left 95% 90% 78% 92% 83% 72%Common First 92% 88% 80% 88% 84% 76%Rare First 88% 81% 73% 83% 77% 67%Shortest First 93% 88% 80% 91% 84% 76%Longest First 92% 86% 77% 92% 84% 76%Alphabetical (A → z) 93% 87% 77% 88% 82% 73%Alphabetical (z → A) 90% 84% 74% 85% 78% 69%
Table 2: Percentage of insertions that follow the target order exactly, averaged over the development set.
By and large, we find that the Insertion Trans-former is remarkably capable of learning to gener-ate according to whichever order it was trained for.We give example decodes for three different gen-eration orders in Figures 3, 4, and 5. In the first ex-ample, we see that the alphabetical En-De modeladheres to the Unicode ordering for Latin char-acters (punctuation → uppercase → lowercase),and that the En-Zh model similarly adheres to theUnicode order for Chinese (punctuation → CJKcharacters sorted by radical and stroke count). Inthe second example, the longest-first En-De modelgenerates subwords in decreasing order of lengthas expected. Finally, in the third example, thecommon-first En-Zh model begins with commonparticles and punctuation before generating themain content words.We give a quantitative measurement of the suc-cess of each model in Table 2, computing thepercentage of insertions across the developmentset that adhered to the best-scoring action underthe desired ordering. Most models exhibit similartrends, with the majority of En-De models achiev-ing accuracies in excess of 90% when a low tem-perature is used, and with corresponding results inthe mid-to-upper 80% range for En-Zh. Even therandom order based on token hashes has accura-cies exceeding 80% for both languages, demon-strating that the model has a strong capacity toadapt to any oracle policy.
Next, we measure the quality of our models byevaluating their performance on their respective test sets. The BLEU scores are reported in Table 3.The uniform loss proposed by Stern et al. (2019)serves as a strong baseline for both language pairs,coming within 0.6 points of the original Trans-former for En-De at 26.72 BLEU, and attaininga respectable score of 33.1 BLEU on En-Zh. Wenote that there is a slightly larger gap between thenormal Transformer and the Insertion Transformerfor the latter of 2.7 points, which we hypothesize isa result of the larger discrepancy between word or-ders in the two languages combined with the moredifficult nature of the Insertion Transformer train-ing objective.Most of the content-based orderings(frequency-based, length-based, alphabetical)perform comparably to the uniform loss, and eventhe random order is not far behind. The adaptiveorders perform similarly well, with easy-firstattaining one of the highest scores on En-De.Curiously, in our model adaptive easy-order,we were unable to identify any strong patternsin the generation order. The model did have aslight preference towards functional words (i.e.,“,” and “der”), but the preference was weak. Asfor location-based losses, the binary tree lossis notable in that it achieves the highest scoreacross all losses for both languages. On the otherhand, we note that while the soft left-to-right andright-to-left losses perform substantially betterthan the hard loss employed in the original workby Stern et al. (2019), performance does sufferwhen using parallel decoding for those models,which is generally untrue of the other orderings.We believe this is due in part to exposure biasissues arising from the monotonic ordering as rder En → De En → ZhSerial Parallel Serial Parallel
Vaswani et al. (2017) This WorkTransformer 27.3 35.8Stern et al. (2019) This WorkUniform 27.12 26.72 32.9 33.1Binary Tree 27.29 27.41 32.6 34.0This WorkRandom 26.15 26.10 32.6 32.4Left-to-Right 26.37 25.56 31.7 31.2Right-to-Left 26.60 24.49 32.4 30.8Common First 26.88 26.86 33.5 32.9Rare First 26.06 26.24 32.5 32.2Shortest First 27.05 27.15 33.0 32.7Longest First 26.45 26.41 32.8 33.2Alphabetical (A → z) 26.86 26.58 32.7 32.5Alphabetical (z → A) 27.22 26.37 33.1 33.0Easy First 26.95 27.05 32.5 32.5Hard First 25.85 26.30 32.4 32.9
Table 3: Test BLEU results for WMT14 En-De newstest2014 and WMT18 En-Zh newstest2018 with serial andparallel decoding. compared with the uniform roll-in policy that arenot shared by the other losses.
For additional analysis, we consider how well ourmodels perform relative to one another conditionalon the length of the source sentence. Sentencelength can be seen as a rough proxy measurementof the difficulty of translating a sentence. Thisis to determine if whether some order variationsare able to achieve improved BLEU scores overother models depending on the source sentence’slength. For each sentence in the En-De and En-Zh development sets, we compute their lengthsand bin them into groups of size 5, up to a max-imum length of 50. Within each bin, we com-pute sentence-level BLEU and take the mean scoreacross all sentences. This is done for each of ourmodel variants. Figure 6 illustrates the results ofthis experiment. We observe a surprisingly smallmodel variance across all bin lengths. This sug-gests that sentences that are difficult to translateare difficult across all orderings, and no particu-lar ordering appears strictly better or worse thanothers. One small exception to this is a perfor-mance fall-off of hard-first orderings for very long sentences across both datasets. We also observea different distribution of BLEU scores across binlengths for En-De and En-Zh. In particular, En-Demodels are approximately monotonic-decreasingin performance as source length increases, whileon En-Zh performance is roughly flat across sen-tence length. This also highlights the importanceof taking additional diverse language pairs intoconsideration, as translation properties on one lan-guage pair may not be observed in others.Ultimately, given the similarity of the devel-opment scores across sentence lengths and thetest scores for the various models, we come tothe surprising conclusion that for single-sentenceEnglish-German translation, generation order isrelatively unimportant. However, for English-Chinese, it is unclear and we leave further anal-ysis to future work. Under the Insertion Trans-former framework, it appears order also does notmatter much, however there is a 2.7 BLEU gap be-tween the results in the Insertion Transformer andour Transformer baseline.
In recent work, several insertion-based frame-works have been proposed for the genera-
10 15 20 25 30 35 40 45 50
Source Sequence Length M e a n S e n t e n c e B L E U English-German
UniformBinary TreeRandomCommon FirstRare FirstShortest FirstLongest FirstA -> zz -> AEasy FirstHard First (a) English → German
Source Sequence Length M e a n S e n t e n c e B L E U English-Chinese
UniformBinary TreeRandomCommon FirstRare FirstShortest FirstLongest FirstA -> zz -> AEasy FirstHard First (b) English → Chinese
Figure 6: Sentence-level BLEU scores as a function of sentence length for several of our model variants. Sourcesentences in each development set are binned into groups of size 5, up to length 50. tion of sequences in a non-left-to-right fash-ion for machine translation (Stern et al., 2019;Welleck et al., 2019; Gu et al., 2019). Stern et al.(2019) introduced the Insertion Transformer andexplored uniform and balanced binary tree orders.We built upon and generalized this approach inorder to explore a much broader set of orders.Welleck et al. (2019) explored insertions using abinary-tree formulation. They also explored uni-form and model-based orders, but found them tolag significantly behind their left-to-right base-lines. Additionally, despite using a binary-tree for-mulation for generation, they did not explore tree-based orders. Gu et al. (2019) introduced a modelwhich did not explicitly represent the output can-vas arising from insertions, but rather used an im-plicit representation through conditioning on theinsertion sequence. They also performed an ex-ploration of different generation orders, includingrandom, odd-even, common-first, rare-first, and asearch-adaptive order. Their search-adaptive or-der can be seen as a global version of our lo-cal model adaptive order, where we use the lo-cal greedy posterior as the reward function, andthey use the sequence level log-probability as thereward function. Curiously, in their framework,the random order fell significantly behind the left-to-right baseline, while they showed small gainsin their search adaptive order. One key differ- ence between our work and Welleck et al. (2019)and Gu et al. (2019) is that we use a soft order-reward framework as opposed to teacher forcing.This might explain some of the performance dif-ferences, as our framework allows for a more flex-ible training objective. Additionally, since we usea uniform roll-in policy, our models may have lessof a label bias problem, as they are trained to beable to continue from any partial output rather thanjust those arising from the target policy.
In this work, we investigated a broad array ofgeneration orders for machine translation usingan insertion-based sequence generation model, theInsertion Transformer. We found that regardlessof the type of strategy selected, be it location-based, frequency-based, length-based, alphabeti-cal, model-based, or even random, the InsertionTransformer is able to learn it with high fidelityand produce high-quality output in the selected or-der. This is especially true for English-Germansingle sentence translation, in which we by andlarge found order to not matter. This opens awide range of possibilities for generation taskswhere monotonic orderings are not the most nat-ural choice, and we would be excited to exploresome of these areas in future work. eferences
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In
EMNLP .Nicolas Ford, Daniel Duckworth, MohammadNorouzi, and George E. Dahl. 2018. The Impor-tance of Generation Order in Language Modeling.In
EMNLP .Jiatao Gu, James Bradbury, Caiming Xiong, Vic-tor O.K. Li, and Richard Socher. 2018. Non-Autoregressive Neural Machine Translation. In
ICLR .Jiatao Gu, Qi Liu, and Kyunghyun Cho. 2019.Insertion-based Decoding with Automatically In-ferred Generation Order. In arXiv .Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean.2015. Distilling the Knowledge in a Neural Net-work. In
NIPS Deep Learning and RepresentationLearning Workshop .Yoon Kim and Alexander M. Rush. 2016. Sequence-Level Knowledge Distillation. In
EMNLP .Jason Lee, Elman Mansimov, and Kyunghyun Cho.2018. Deterministic Non-Autoregressive NeuralSequence Modeling by Iterative Refinement. In
EMNLP .Matt Post. 2018. A Call for Clarity in Reporting BLEUScores. In
WMT .Mitchell Stern, William Chan, Jamie Kiros, and JakobUszkoreit. 2019. Insertion Transformer: FlexibleSequence Generation via Insertion Operations. In
ICML .Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014.Sequence to Sequence Learning with Neural Net-works. In
NIPS .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In
NIPS .Oriol Vinyals, Samy Bengio, and Manjunath Kudlur.2015. Order Matters: Sequence to sequence for sets.In
ICLR .Sean Welleck, Kiante Brantley, Hal Daume, andKyunghyun Cho. 2019. Non-Monotonic SequentialText Generation. In
ICML . ppendix Full development set results for En-De translation and En-Zh translation.
Order English → German English → Chinese τ BLEU (+EOS) BLEU (+EOS) BLEU (+EOS) BLEU (+EOS)+Parallel +Parallel
Stern et al. (2019) This WorkUniform ∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . → Z → a → z) . . . → a → Z → A) . . . . . . . . .21.97 (24.90) 24.33 (24.70) 26.4 (31.1) 29.9 (31.4)