Hard-Coded Gaussian Attention for Neural Machine Translation
HHard-Coded Gaussian Attention for Neural Machine Translation
Weiqiu You ∗ , Simeng Sun ∗ , Mohit Iyyer College of Information and Computer SciencesUniversity of Massachusetts Amherst { wyou,simengsun,miyyer } @cs.umass.edu Abstract
Recent work has questioned the importance ofthe Transformer’s multi-headed attention forachieving high translation quality. We pushfurther in this direction by developing a “hard-coded” attention variant without any learnedparameters. Surprisingly, replacing all learnedself-attention heads in the encoder and decoderwith fixed, input-agnostic Gaussian distribu-tions minimally impacts BLEU scores acrossfour different language pairs. However, ad-ditionally hard-coding cross attention (whichconnects the decoder to the encoder) signifi-cantly lowers BLEU, suggesting that it is moreimportant than self-attention. Much of thisBLEU drop can be recovered by adding just a single learned cross attention head to an oth-erwise hard-coded Transformer. Taken as awhole, our results offer insight into which com-ponents of the Transformer are actually impor-tant, which we hope will guide future workinto the development of simpler and more ef-ficient attention-based models.
The Transformer (Vaswani et al., 2017) has be-come the architecture of choice for neural machinetranslation. Instead of using recurrence to contextu-alize source and target token representations, Trans-formers rely on multi-headed attention mechanisms(MHA), which speed up training by enabling paral-lelization across timesteps. Recent work has calledinto question how much MHA contributes to trans-lation quality: for example, a significant fractionof attention heads in a pretrained Transformer canbe pruned without appreciable loss in BLEU (Voitaet al., 2019; Michel et al., 2019), and self-attentioncan be replaced by less expensive modules such asconvolutions (Yang et al., 2018; Wu et al., 2019).In this paper, we take this direction to an ex-treme by developing a variant of MHA without * Authors contributed equally.
Standard Transformer : scaled dot product of learned query and key vectors
Jane went to the o ffi ce Ours : fixed Gaussian distributions centered around nearby tokens
Figure 1: Three heads of learned self-attention (top)as well as our hard-coded attention (bottom) given thequery word “to”. In our variant, each attention head is aGaussian distribution centered around a different tokenwithin a local window. any learned parameters (Section 3). Concretely,we replace each attention head with a “hard-coded”version, which is simply a standard normal distri-bution centered around a particular position in thesequence (Figure 1). When we replace all encoderand decoder self-attention mechanisms with ourhard-coded variant, we achieve almost identicalBLEU scores to the baseline Transformer for fourdifferent language pairs (Section 4). These experiments maintain fully learned MHA cross attention , which allows the decoder to con-dition its token representations on the encoder’soutputs. We next attempt to additionally replacecross attention with a hard-coded version, which re-sults in substantial drops of 5-10 BLEU. Motivatedto find the minimal number of learned attention In Figure 1, the hard-coded head distributioncentered on the word “to” (shown in green) is [0 . , . , . , . , . . Our code is available at https://github.com/fallcat/stupidNMT a r X i v : . [ c s . C L ] M a y D i s t a n c e Encoder Self-Attention
Head 1Head 2 L1 L2 L3 L4 L5
Layer
Decoder Self-Attention
L1 L2 L3 L4 L5
Decoder Cross-Attention
Figure 2: Most learned attention heads for a Transformer trained on IWSLT16 En-De focus on a local windowaround the query position. The x-axis plots each head of each layer, while the y-axis refers to the distance betweenthe query position and the argmax of the attention head distribution (averaged across the entire dataset). parameters needed to make up this deficit, we ex-plore configurations with only one learned crossattention head in total, which performs just slightlyworse (1-3 BLEU) than the baseline.By replacing MHA with hard-coded attention,we improve memory efficiency (26.4% more to-kens per batch) and decoding speed (30.2% in-crease in sentences decoded per second) with-out significantly lowering BLEU, although theseefficiency improvements are capped by othermore computationally-expensive components ofthe model (Section 5). We also perform analysisexperiments (Section 6.2) on linguistic properties(e.g., long-distance subject-verb agreement) thatMHA is able to better model than hard-coded at-tention. Finally, we develop further variants ofhard-coded attention in Section 6.3, including aversion without any attention weights at all.Our hard-coded Transformer configurations haveintuitively severe limitations: attention in a particu-lar layer is highly concentrated on a local windowin which fixed weights determine a token’s impor-tance. Nevertheless, the strong performance ofthese limited models indicates that the flexibilityenabled by fully-learned MHA is not as crucial ascommonly believed: perhaps attention is not all you need. We hope our work will spur further de-velopment of simpler, more efficient models forneural machine translation.
In this section, we first briefly review the Trans-former architecture of Vaswani et al. (2017) witha focus on its multi-headed attention. Then, weprovide an analysis of the learned attention headdistributions of a trained Transformer model, whichmotivates the ideas discussed afterwards.
The Transformer is an encoder-decoder modelformed by stacking layers of attention blocks. Eachencoder block contains a self-attention layer fol-lowed by layer normalization, a residual connec-tion, and a feed-forward layer. Decoder blocks areidentical to those of the encoder except they alsoinclude a cross attention layer, which connects theencoder’s representations to the decoder.To compute a single head of self-attention givena sequence of token representations t ...n , we firstproject these representations to queries q ...n , keys k ...n , and values v ...n using three different linearprojections. Then, to compute the self-attention dis-tribution at a particular position i in the sequence,we take the scaled dot product between the queryvector q i and all of the key vectors (represented bymatrix K ). We then use this distribution to computea weighted average of the values ( V ):Attn ( q i , K , V ) = softmax ( q i K (cid:62) √ d k ) V (1)where d k is the dimensionality of the key vector.For MHA, we use different projection matricesto obtain the query, key, and value representationsfor each head. The key difference between self-attention and cross attention is that the queries andkeys come from different sources: specifically, thekeys are computed by passing the encoder’s finallayer token representations through a linear pro-jection. To summarize, MHA is used in three dif-ferent components of the Transformer: encoderself-attention, decoder self-attention, and cross at-tention. .2 Learned heads mostly focus on localwindows The intuition behind MHA is that each head canfocus on a different type of information (e.g., syn-tactic or semantic patterns). While some headshave been shown to possess interpretable patterns(Voita et al., 2019; Correia et al., 2019), other workhas cautioned against using attention patterns to ex-plain a model’s behavior (Jain and Wallace, 2019).In our analysis, we specifically examine the be-havior of a head with respect to the current querytoken’s position in the sequence. We train a base-line Transformer model (five layers, two heads perlayer) on the IWSLT 2016 En → De dataset, andcompute aggregated statistics on its learned heads.Figure 2 shows that outside of a few layers, mostof the model’s heads focus their attention (i.e., theargmax of the attention distribution) on a localneighborhood around the current sequence posi-tion. For example, both self-attention heads in thefirst layer of the encoder tend to focus on just aone to two token window around the current posi-tion. The decoder self-attention and cross attentionheads show higher variability, but most of theirheads are still on average focused on local infor-mation. These results beg the question of whetherreplacing self-attention with “hard-coded” patternsthat focus on local windows will significantly affecttranslation quality.
While learned attention enables model flexibility(e.g., a head can “look” far away from the currentposition if it needs to), it is unclear from the aboveanalysis how crucial this flexibility is. To examinethis question, we replace the attention distributioncomputation in Equation 1 (i.e., scaled dot productof queries and keys) with a fixed Gaussian distri-bution. In doing so, we remove all learned pa-rameters from the attention computation: the meanof the Gaussian is determined by the position i ofthe current query token, and the standard devia-tion is always set to 1. As Transformers containboth self-attention and cross attention, the rest ofthis section details how we replace both of thesecomponents with simplified versions. We will re- Yang et al. (2018) implement a similar idea, except themean and standard deviation of their Gaussians are learnedwith separate neural modules. Preliminary experiments with other standard deviationvalues did not yield significant differences, so we do not varythe standard deviation for any experiments in this paper. fer to experimental results on the relatively smallIWSLT16 English-German dataset throughout thissection to contextualize the impact of the variousdesign decisions we describe. Section 4 contains amore fleshed out experimental section with manymore datasets and language pairs.
In self-attention, the queries and keys are derivedfrom the same token representations and as suchhave the same length n . The baseline Transformer( BASE ) computes the self-attention distribution atposition i by taking the dot product between thequery representation q i and all of the key vectors k ...n . We instead use a fixed Gaussian distributioncentered around position i − (token to the left), i (the query token), or i + 1 (token to the right).More formally, we replace Equation 1 withAttn ( i, V ) = N ( f ( i ) , σ ) V . (2)The mean of the Gaussian f ( i ) and its standard de-viation σ are both hyperparameters; for all of ourexperiments, we set σ to 1 and f ( i ) to either i − , i or i + 1 , depending on the head configuration. Note that this definition is completely agnostic tothe input representation: the distributions remainthe same regardless of what sentence is fed in orwhat layer we are computing the attention at. Ad-ditionally, our formulation removes the query andkey projections from the attention computation; theGaussians are used to compute a weighted averageof the value vectors. Instead of learning different query and key pro-jection matrices to define different heads, we sim-ply design head distributions with different means.Figure 1 shows an example of our hard-coded self-attention for a simple sentence. We iterate overdifferent configurations of distribution means f ( i ) on the IWSLT16 En-De dataset, while keeping thecross attention learned. Our best validation resultwith hard-coded self-attention ( HC - SA ) replacesencoder self-attention with distributions centeredaround i − and i + 1 and decoder self-attentionwith distributions centered around i − and i . This The Gaussian distribution is cut off on the borders of thesentence and is not renormalized to sum to one. Preliminary models that additionally remove the valueprojections performed slightly worse when we hard-codedcross attention, so we omit them from the paper. See Appendix for a table describing the effects of varying f ( i ) on IWSLT16 En-De BLEU score. We find in general thathard-coded heads within each layer should focus on differenttokens within the local window for optimal performance. odel achieves slightly higher BLEU than the base-line Transformer ( vs BLEU).
We turn next to cross attention, which on its faceseems more difficult to replace with hard-codeddistributions. Unlike self-attention, the queriesand keys in cross attention are not derived fromthe same token representations; rather, the queriescome from the decoder while the keys come fromthe encoder. Since the number of queries can nowbe different from the number of keys, setting thedistribution means by position is less trivial than itis for self-attention. Here, we describe two meth-ods to simplify cross attention, starting with a fullyhard-coded approach and moving onto a minimallearned configuration.
Hard-coded cross attention:
We begin with asimple solution to the problem of queries and keyshaving variable lengths. Given a training dataset,we compute the length ratio γ by dividing the av-erage source sentence length by the average tar-get sentence length. Then, to define a hard-codedcross attention distribution for target position i , wecenter the Gaussian on positions (cid:98) γi − (cid:99) , (cid:98) γi (cid:99) ,and (cid:98) γi + 1 (cid:99) of the source sentence. When weimplement this version of hard-coded cross atten-tion and also hard-code the encoder and decoderself-attention as described previously ( HC - ALL ),our BLEU score on IWSLT16 En-De drops from to . Clearly, cross attention is more im-portant for maintaining translation quality thanself-attention. Michel et al. (2019) notice a sim-ilar phenomenon when pruning heads from a pre-trained Transformer: removing certain cross atten-tion heads can substantially lower BLEU. Learning a single cross attention head:
Priorto the advent of the Transformer, many neural ma-chine translation architectures relied on just a singlecross attention “head” (Bahdanau et al., 2015). TheTransformer has many heads at many layers, buthow many of these are actually necessary? Here,we depart from the parameter-free approach by in-stead removing cross attention at all but the finallayer of the decoder, where we include only a sin-gle learned head ( SH - X ). Note that this is the onlylearned head in the entire model, as both the en-coder and decoder self-attention is hard-coded. OnIWSLT16 En-De, our BLEU score improves from to , less than 2 BLEU under the BASE
Transformer.
Train Test Len SRC Len TGTIWSLT16 En-De 196,884 993 28.5 29.6IWSLT17 En-Ja 223,108 1,452 22.9 16.0WMT16 En-Ro 612,422 1,999 27.4 28.3WMT14 En-De 4,500,966 3,003 28.5 29.6WMT14 En-Fr 10,493,816 3,003 26.0 28.8
Table 1: Statistics of the datasets used. The last twocolumns show the average number of tokens for sourceand target sentences, respectively.
The previous section developed hard-coded con-figurations and presented results on the relativelysmall IWSLT16 En-De dataset. Here, we expandour experiments to include a variety of differentdatasets, language pairs, and model sizes. For allhard-coded head configurations, we use the optimalIWSLT16 En-De setting detailed in Section 3.1 andperform no additional tuning on the other datasets.This configuration nevertheless proves robust, aswe observe similar trends with our hard-codedTransformers across all of datasets. We experiment with four language pairs,English ↔{ German, Romanian, French, Japanese } to show the consistency of our proposed attentionvariants. For the En-De pair, we use both the smallIWSLT 2016 and the larger WMT 2014 datasets.For all datasets except WMT14 En → De andWMT14 En → Fr, we run experiments in bothdirections. For English-Japanese, we train andevaluate on IWSLT 2017 En ↔ Ja TED talk dataset.More dataset statistics are shown in Table 1.
Our
BASE model is the original Transformerfrom Vaswani et al. (2017), reimplemented inPyTorch (Paszke et al., 2019) by Akoury et al.(2019). To implement hard-coded attention, weonly modify the attention functions in this code-base and keep everything else the same. For thetwo small IWSLT datasets, we follow prior work Code and scripts to reproduce our experimental results tobe released after blind review. We report BLEU on the IWSLT16 En-De dev set follow-ing previous work (Gu et al., 2018; Lee et al., 2018; Akouryet al., 2019). For other datasets, we report test BLEU. As the full WMT14 En → Fr is too large for us to feasiblytrain on, we instead follow Akoury et al. (2019) and train onjust the Europarl / Common Crawl subset, while evaluatingusing the full dev/test sets. https://github.com/dojoteef/synst ASE HC - SA HC - ALL SH - X IWSLT16 En-De 30.0 30.3 21.1 28.2IWSLT16 De-En 34.4 34.8 25.7 33.3IWSLT17 En-Ja 20.9 20.7 10.6 18.5IWSLT17 Ja-En 11.6 10.9 6.1 10.1WMT16 En-Ro 33.0 32.9 25.5 30.4WMT16 Ro-En 33.1 32.8 26.2 31.7WMT14 En-De 26.8 26.3 21.7 23.5WMT14 En-Fr 40.3 39.1 35.6 37.1
Table 2: Comparison of the discussed Transformervariants on six smaller datasets (top) and two largerdatasets (bottom). Hard-coded self-attention ( HC - SA )achieves almost identical BLEU scores to BASE acrossall datasets, while a model with only one cross attentionhead ( SH - X ) performs slightly worse. by using a small Transformer architecture with em-bedding size 288, hidden size 507, four heads, five layers, and a learning rate 3e-4 with a lin-ear scheduler. For the larger datasets, we use thestandard Tranformer base model, with embeddingsize 512, hidden size 2048, eight heads, six layers,and a warmup scheduler with 4,000 warmup steps.For all experiments, we report BLEU scores usingSacreBLEU (Post, 2018) to be able to comparewith other work. Broadly, the trends we observed on IWSLT16 En-De in the previous section are consistent for all ofthe datasets and language pairs. Our findings aresummarized as follows:• A Transformer with hard-coded self-attentionin the encoder and decoder and learnedcross attention ( HC - SA ) achieves almost equalBLEU scores to the BASE
Transformer.• Hard-coding both cross attention and self-attention ( HC - ALL ) considerably drops BLEUcompared to
BASE , suggesting cross attentionis more important for translation quality.• A configuration with hard-coded self- For hard-coded configurations, we duplicate heads to fitthis architecture (e.g., we have two heads per layer in theencoder with means of i + 1 and i − ). SacreBLEU signature: BLEU+case.mixed+lang.LANG+numrefs.1+smooth.exp+test.TEST+tok.intl+version.1.2.11,with LANG ∈ { en-de, de-en, en-fr } and TEST ∈{ wmt14/full, iwslt2017/tst2013 } . For WMT16 En-Ro and IWSLT17 En-Ja, we follow previous workfor preprocessing (Sennrich et al., 2016), encod-ing the latter with a 32K sentencepiece vocabulary( https://github.com/google/sentencepiece )and measuring the de-tokenized BLEU with SacreBLEU. attention and a single learned cross attentionhead in the final decoder layer ( SH - X )consistently performs 1-3 BLEU worse than BASE .These results motivate a number of interestinganalysis experiments (e.g., what kinds of phenom-ena is MHA better at handling than hard-codedattention), which we describe in Section 6. Thestrong performance of our highly-simplified mod-els also suggests that we may be able to obtainmemory or decoding speed improvements, whichwe investigate in the next section.
We have thus far motivated our work as an explo-ration of which components of the Transformerare necessary to obtain high translation quality.Our results demonstrate that encoder and decoderself-attention can be replaced with hard-coded at-tention distributions without loss in BLEU, andthat MHA brings minor improvements over single-headed cross attention. In this section, we measureefficiency improvements in terms of batch size in-creases and decoding speedup.
Experimental setup:
We run experiments onWMT16 En-Ro with the larger architecture to sup-port our conclusions. For each model variantdiscussed below, we present its memory efficiencyas the maximum number of tokens per batch al-lowed during training on a single GeForce RTX2080 Ti. Additionally, we provide inference speedas the number of sentences per second each modelcan decode on a 2080 Ti, reporting the average offive runs with a batch size of 256.
Hard-coding self-attention yields small effi-ciency gains:
Table 7 summarizes our profilingexperiments. Hard-coding self-attention and pre-serving learned cross attention allows us to fit 17%more tokens into a single batch, while also pro-viding a 6% decoding speedup compared to
BASE on the larger architecture used for WMT16 En-Ro.The improvements in both speed and memory us-age are admittedly limited, which motivates us tomeasure the maximum efficiency gain if we onlymodify self-attention (i.e., preserving learned crossattention). We run a set of upper bound experi-ments where we entirely remove self-attention inthe encoder and decoder. The resulting encoder Experiments with the smaller IWSLT16 En-De model aredescribed in the Appendix. odel BLEU sent/sec tokens/batch
BASE HC - SA SH - X BASE /- SA SH - X /- SA Table 3: Decoding speedup (in terms of sentences persecond) and memory improvements (max tokens perbatch) on WMT16 En-Ro for a variety of models. Thelast two rows refer to
BASE and SH - X configurationswhose self-attention is completely removed. thus just becomes a stack of feed-forward layers ontop of the initial subword embeddings. Somewhatsurprisingly, the resulting model still achieves afairly decent BLEU of compared to the BASE model’s . As for the efficiency gains, we canfit 27% more tokens into a single batch, and de-coding speed improves by 12.3% over
BASE . Thisrelatively low upper bound for HC - SA shows thatsimply hard-coding self-attention does not guaran-tee significant speedup. Previous work that simpli-fies attention (Wu et al., 2019; Michel et al., 2019)also report efficiency improvements of similar lowmagnitudes. Single-headed cross attention speeds up de-coding:
Despite removing learned self-attentionfrom both the encoder and decoder, we did notobserve huge efficiency or speed gains. However,reducing the source attention to just a single headresults in more significant improvements. By onlykeeping single-headed cross attention in the lastlayer, we are able to achieve 30.2% speed up andfit in 26.4% more tokens to the memory comparedto
BASE . Compared to HC - SA , SH - X obtains a22.9% speedup and 8.0% bigger batch size.From our profiling experiments, most of thespeed and memory considerations of the Trans-former are associated with the large feed-forwardlayers that we do not modify in any of our experi-ments, which caps the efficiency gains from modi-fying the attention implementation. While we didnot show huge efficiency improvements on modernGPUs, it remains possible that (1) a more tailoredimplementation could leverage the model simpli-fications we have made, and (2) that these differ-ences are larger on other hardware (e.g., CPUs).We leave these questions for future work. B L E U WMT2016 En-Ro
BASEHC-SABASE/-FFHC-SA/-FF
Figure 3: BLEU performance on WMT16 En-Ro be-fore and after removing all feed-forward layers fromthe models.
BASE and HC - SA achieve almost identi-cal BLEU scores, but HC - SA relies more on the feed-forward layers than the vanilla Transformer. As shownon the plot, with a four layer encoder and decoder, theBLEU gap between BASE - FF and BASE is 1.8, whilethe gap between HC - SA and HC - SA - FF is 3.2. Taken as a whole, our experimental results suggestthat many of the components in the Transformercan be replaced by highly-simplified versions with-out adversely affecting translation quality. In thissection, we explain how hard-coded self-attentiondoes not degrade translation quality (Section 6.1),perform a detailed analysis of the behavior of ourvarious models by comparing the types of errorsmade by learned versus hard-coded attention (Sec-tion 6.2), and also examine different attention con-figurations that naturally follow from our experi-ments (Section 6.3).
Given the good performance of HC - SA on multipledatasets, it is natural to ask why hard-coding self-attention does not deteriorate translation quality.We conjecture that feed-forward (FF) layers playa more important role in HC - SA than in BASE bycompensating for the loss of learned dynamic self-attention. To test this hypothesis, we conduct ananalysis experiment in which we train four modelconfigurations while varying the number of layers:
BASE , BASE without feed-forward layers (
BASE /- FF ), HC - SA and HC - SA without feed-forward layers( HC - SA /- FF ). As shown in Figure 3, BASE and HC - SA have similar performance and both - FF modelshave consistently lower BLEU scores. However, HC - SA without FF layers performs much worse
10 10-20 20-30 30-40 >40reference length16182022242628 B L E U WMT En-De
BASEHC-SAHC-ALLSH-X
Figure 4: BLEU difference vs.
BASE as a function ofreference length on the WMT14 En-De test set. Whencross attention is hard-coded ( HC - ALL ), the BLEU gapworsens as reference length increases. compared to its
BASE counterpart. This result con-firms our hypothesis that FF layers are more im-portant in HC - SA and capable of recovering thepotential performance degradation brought by hard-coded self-attention. Taking a step back to hard-coding cross attention, the failure of hard-codingcross attention might be because the feed-forwardlayers of the decoder are not powerful enough tocompensate for modeling both hard-coded decoderself-attention and cross attention. Since hard-coded attention is muchless flexible than learned attention and can strug-gle to encode global information, we are curiousto see if its performance declines as a function ofsentence length. To measure this, we categorizethe WMT14 En-De test set into five bins by refer-ence length and plot the decrease in BLEU between
BASE and our hard-coded configurations for eachbin. Somewhat surprisingly, Figure 4 shows thatthe BLEU gap between
BASE and HC - SA seemsto be roughly constant across all bins. However,the fully hard-coded HC - ALL model clearly deteri-orates as reference length increases.
Does hard-coding attention produceany systematic linguistic errors?
For amore fine-grained analysis, we run experi-ments on LingEval97 (Sennrich, 2017), anEnglish → German dataset consisting of contrastive We note that gradients will flow across long distancesif the number of layers is large enough, since the effectivewindow size increases with multiple layers (van den Oordet al., 2016; Kalchbrenner et al., 2016).
Error type
BASE HC - SA HC - ALL np-agreement polarity-affix-ins
Table 4: Accuracy for each error type in the LingEval97contrastive set. Hard-coding self-attention results inslightly lower accuracy for most error types, whilemore significant degradation is observed when hard-coding self and cross attention. We refer readers toSennrich (2017) for descriptions of each error type. translation pairs. This dataset measures targetederrors on thirteen different linguistic phenomenasuch as agreement and adequacy.
BASE and HC - SA perform very similarly across all errortypes (Table 4), which is perhaps unsurprisinggiven that their BLEU scores are almost identical.Interestingly, the category with the highest de-crease from BASE for both HC - SA and HC - ALL is deleted negations ; HC - ALL is 11% less accurate(absolute) at detecting these substitutions than
BASE (94% vs 83%). On the other hand, both HC - SA and HC - ALL are actually better than
BASE at detecting inserted negations , with HC - ALL achieving a robust 98.7% accuracy. We leavefurther exploration of this phenomenon to futurework. Finally, we observe that for the subject-verbagreement category, the discrepancy between
BASE and the hard-coded models increases as thedistance between subject-verb increases (Figure 5).This result confirms that self-attention is importantfor modeling some long-distance phenomena, andthat cross attention may be even more crucial.
Do hard-coded models struggle when learnedself-attention focuses on non-local information?
Since hard-coded models concentrate most of theattention probability mass on local tokens, theymight underperform on sentences for which the Accuracy is computed by counting how many referenceshave lower token-level cross entropy loss than their contrastivecounterparts. Specifically, when ein is replaced with negation kein .
10 >15subject-verb distance0.650.700.750.800.850.90 A cc u r a c y LingEval: subject-verb agreement
BASEHC-SAHC-ALL
Figure 5: Hard-coded models become increasinglyworse than
BASE at subject-verb agreement as the de-pendency grows longer. B L E U IWSLT De-En Decoder Self-Attention
BASEHC-SASH-XHC-ALL
Figure 6: Hard-coded attention performs better forsentences with low off-diagonality (i.e., sentences forwhich the
BASE model’s learned attention focuses closeto the query position for most of their tokens). learned heads of the
BASE model focus on to-kens far from the current query position. We de-fine a token to be “off-diagonal” when the max-imum probability of that token’s attention is atleast two steps away from query position. A sen-tence’s “off-diagonality” is then the proportion ofoff-diagonal tokens within the sentence. We binthe sentences in IWSLT En-De development setby their off-diagonality and analyze the transla-tion quality of our models on these different bins.Figure 6 shows that for decoder self attention, theBLEU gap between HC - ALL and
BASE increasesas off-diagonality increases, while the gap between
BASE and SH - X remains relatively constant acrossall bins. HC - SA even outperforms BASE for sen-tences with fewer off-diagonal tokens.
One natural question about thehard-coded attention strategy described in Sec-
Original Conv (window=3) IndexingEn-De 30.3 30.1 29.8En-Ro 32.4 32.3 31.4
Table 5: Comparison of three implementations of HC - SA . Truncating the distribution to a three token spanhas little impact, while removing the weights altogetherslightly lowers BLEU. tion 3 is whether it is necessary to assign some prob-ability to all tokens in the sequence. After all, theprobabilities outside a local window become verymarginal, so perhaps it is unnecessary to preservethem. We take inspiration from Wu et al. (2019),who demonstrate that lightweight convolutions canreplace self-attention in the Transformer withoutharming BLEU, by recasting our hard-coded atten-tion as a convolution with a hard-coded 1-D kernel.While this decision limits the Gaussian distribu-tion to span over just tokens within a fixed windowaround the query token, it does not appreciably im-pact BLEU (second column of Table 5). We set thewindow size to 3 in all experiments, so the kernelweights become [0 . , . , . . Are any attention weights necessary at all?
The previous setting with constrained window sizesuggests another follow-up: is it necessary to haveany attention weights within this local window atall? A highly-efficient alternative is to have eachhead simply select a single value vector associ-ated with a token in the window. Here, our imple-mentation requires no explicit multiplication witha weight vector, as we can compute each head’srepresentation by simply indexing into the valuevectors. Mathematically, this is equivalent to con-volving with a binary kernel (e.g., convolution with [1 , , is equivalent to indexing the left token rep-resentation). The third column of Table 5 showsthat this indexing approach results in less than 1BLEU drop across two datasets, which offers aninteresting avenue for future efficiency improve-ments. Where should we add additional cross attentionheads?
Our experiments with cross attention sofar have been limited to learning just a single head,as we have mainly been interested in minimal con-figurations. If we have a larger budget of crossattention heads, where should we put them? Is itbetter to have more cross attention heads in thelast layer in the decoder (and no heads anywhereelse), or to distribute them across multiple layers B L E U WMT2016 En-Ro
Multiple heads same layerSingle head across layers
Figure 7: Adding more cross attention heads in thesame layer helps less than adding individual headsacross different layers. of the decoder? Experiments on the WMT16 En-Ro dataset (Figure 7) indicate that distributinglearned heads over multiple layers leads to signifi-cantly better BLEU than adding all of them to thesame layer. Attention mechanisms were first introduced to aug-ment vanilla recurrent models (Kalchbrenner andBlunsom, 2013; Sutskever et al., 2014; Bahdanauet al., 2015; Luong et al., 2015; Chorowski et al.,2015; Wu et al., 2016; Miceli Barone et al., 2017)but have become the featured component of thestate-of-the-art Transformer architecture (Vaswaniet al., 2017) for NMT. We review recent researchthat focuses on analysing and improving multi-headed attention, and draw connections to ourwork.The intuitive advantage of MHA is that differentheads can focus on different types of information,all of which will eventually be helpful for transla-tion. Voita et al. (2019) find that some heads focuson adjacent tokens to the query (mirroring our anal-ysis in Section 2), while others focus on specificdependency relations or rare tokens. Correia et al.(2019) discover that some heads are sensitive tosubword clusters or interrogative words. Tang et al.(2018) shows that the number of MHA heads af-fects the ability to model long-range dependencies.Michel et al. (2019) show that pruning many headsfrom a pretrained model does not significantly im-pact BLEU scores. Similarly, Voita et al. (2019)prune many encoder self-attention heads withoutdegrading BLEU, while Tang et al. (2019) further We used the smaller IWSLT En-De architecture for thisexperiment. simplify the Transformer by removing the entireencoder for a drop of three BLEU points. In con-trast to existing literature on model pruning, we train our models without learned attention headsinstead of removing them post-hoc.There have been many efforts to modify MHAin Transformers. One such direction is to injectlinguistic knowledge through auxiliary supervisedtasks (Garg et al., 2019; Pham et al., 2019). Otherwork focuses on improving inference speed: Yanget al. (2018) replace decoder self-attention with asimple average attention network, assigning equalweights to target-side previous tokens. Wu et al.(2019) also speed up decoding by replacing self-attention with convolutions that have time-step de-pendent kernels; we further simplify this work withour fixed convolutional kernels in Section 6. Cuiet al. (2019) also explore fixed attention while re-taining some learned parameters, and Vashishthet al. (2019) show that using uniform or randomattention deteriorates performances on paired sen-tences tasks including machine translation. Otherwork has also explored modeling locality (Shawet al., 2018; Yang et al., 2018).
In this paper, we present “hard-coded” Gaussianattention, which while lacking any learned param-eters can rival multi-headed attention for neuralmachine translation. Our experiments suggest thatencoder and decoder self-attention is not crucialfor translation quality compared to cross attention.We further find that a model with hard-coded self-attention and just a single cross attention head per-forms slightly worse than a baseline Transformer.Our work provides a foundation for future workinto simpler and more computationally efficientneural machine translation.
Acknowledgments
We thank the anonymous reviewers for theirthoughtful comments, Omer Levy for general guid-ance and for suggesting some of our efficiency ex-periments, the UMass NLP group for helpful com-ments on earlier drafts, Nader Akoury for assistingwith modifications to his Transformer codebase,and Kalpesh Krishna for advice on the structure ofthe paper. In preliminary experiments, we find that using uniformdistributions for encoder self-attention decreases BLEU. Thisresult is similar to the indexing implementation we describein Section 6.3. eferences
Nader Akoury, Kalpesh Krishna, and Mohit Iyyer.2019. Syntactically supervised transformers forfaster neural machine translation. In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics, pages 1269–1281, Flo-rence, Italy. Association for Computational Linguis-tics.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd InternationalConference on Learning Representations, ICLR2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.Jan K Chorowski, Dzmitry Bahdanau, DmitriySerdyuk, Kyunghyun Cho, and Yoshua Bengio.2015. Attention-based models for speech recogni-tion. In C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, editors, Advances inNeural Information Processing Systems 28, pages577–585. Curran Associates, Inc.Gonc¸alo M. Correia, Vlad Niculae, and Andr´e F. T.Martins. 2019. Adaptively sparse transform-ers. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP),pages 2174–2184, Hong Kong, China. Associationfor Computational Linguistics.Hongyi Cui, Shohei Iida, Po-Hsuan Hung, TakehitoUtsuro, and Masaaki Nagata. 2019. Mixed multi-head self-attention for neural machine translation.In Proceedings of the 3rd Workshop on NeuralGeneration and Translation, pages 206–214, HongKong. Association for Computational Linguistics.Sarthak Garg, Stephan Peitz, Udhyakumar Nal-lasamy, and Matthias Paulik. 2019. Jointly learn-ing to align and translate with transformer mod-els. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP),pages 4452–4461, Hong Kong, China. Associationfor Computational Linguistics.Jiatao Gu, James Bradbury, Caiming Xiong,Victor O.K. Li, and Richard Socher. 2018.Non-autoregressive neural machine transla-tion. In International Conference on LearningRepresentations.Sarthak Jain and Byron C. Wallace. 2019. Attentionis not Explanation. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers), pages 3543–3556, Minneapolis, Minnesota.Association for Computational Linguistics. Nal Kalchbrenner and Phil Blunsom. 2013. Recur-rent continuous translation models. In Proceedingsof the 2013 Conference on Empirical Methods inNatural Language Processing, pages 1700–1709,Seattle, Washington, USA. Association for Compu-tational Linguistics.Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,Aaron van den Oord, Alex Graves, and KorayKavukcuoglu. 2016. Neural machine translation inlinear time. arXiv preprint arXiv:1610.10099.Jason Lee, Elman Mansimov, and Kyunghyun Cho.2018. Deterministic non-autoregressive neuralsequence modeling by iterative refinement. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages1173–1182, Brussels, Belgium. Association forComputational Linguistics.Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 1412–1421, Lisbon,Portugal. Association for Computational Linguis-tics.Antonio Valerio Miceli Barone, Jindˇrich Helcl, RicoSennrich, Barry Haddow, and Alexandra Birch.2017. Deep architectures for neural machine trans-lation. In Proceedings of the Second Conference onMachine Translation, pages 99–107, Copenhagen,Denmark. Association for Computational Linguis-tics.Paul Michel, Omer Levy, and Graham Neubig. 2019.Are sixteen heads really better than one? In H. Wal-lach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc,E. Fox, and R. Garnett, editors, Advances in NeuralInformation Processing Systems 32, pages 14014–14024. Curran Associates, Inc.A¨aron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves,Nal Kalchbrenner, Andrew W. Senior, and KorayKavukcuoglu. 2016. Wavenet: A generative modelfor raw audio. In The 9th ISCA Speech SynthesisWorkshop, Sunnyvale, CA, USA, 13-15 September2016, page 125. ISCA.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Py-torch: An imperative style, high-performance deeplearning library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, andR. Garnett, editors, Advances in Neural InformationProcessing Systems 32, pages 8024–8035. CurranAssociates, Inc.huong Pham, Dominik Machek, and Ondej Bojar.2019. Promoting the knowledge of source syntaxin transformer nmt is not needed. Computacin ySistemas, 23.Matt Post. 2018. A call for clarity in reporting BLEUscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.Rico Sennrich. 2017. How grammatical is character-level neural machine translation? assessing MT qual-ity with contrastive translation pairs. In Proceedingsof the 15th Conference of the European Chapterof the Association for Computational Linguistics:Volume 2, Short Papers, pages 376–382, Valencia,Spain. Association for Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Edinburgh neural machine translation sys-tems for WMT 16. In Proceedings of the FirstConference on Machine Translation: Volume 2,Shared Task Papers, pages 371–376, Berlin, Ger-many. Association for Computational Linguistics.Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-attention with relative position repre-sentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana. Association for Com-putational Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, editors, Advancesin Neural Information Processing Systems 27, pages3104–3112. Curran Associates, Inc.Gongbo Tang, Mathias M¨uller, Annette Rios, and RicoSennrich. 2018. Why self-attention? a targetedevaluation of neural machine translation architec-tures. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 4263–4272, Brussels, Belgium. Associationfor Computational Linguistics.Gongbo Tang, Rico Sennrich, and Joakim Nivre. 2019.Understanding neural machine translation by sim-plification: The case of encoder-free models. InProceedings of the International Conference onRecent Advances in Natural Language Processing(RANLP 2019), pages 1186–1193, Varna, Bulgaria.INCOMA Ltd.Shikhar Vashishth, Shyam Upadhyay, Gaurav SinghTomar, and Manaal Faruqui. 2019. Attention in-terpretability across nlp tasks. arXiv preprintarXiv:1909.11218.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Ben-gio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, editors, Advances in Neural InformationProcessing Systems 30, pages 5998–6008. CurranAssociates, Inc.Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing multi-headself-attention: Specialized heads do the heavy lift-ing, the rest can be pruned. In Proceedings ofthe 57th Annual Meeting of the Association forComputational Linguistics, pages 5797–5808, Flo-rence, Italy. Association for Computational Linguis-tics.Felix Wu, Angela Fan, Alexei Baevski, YannDauphin, and Michael Auli. 2019. Pay lessattention with lightweight and dynamic convolu-tions. In International Conference on LearningRepresentations.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between hu-man and machine translation. arXiv preprintarXiv:1609.08144.Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fan-dong Meng, Lidia S. Chao, and Tong Zhang.2018. Modeling localness for self-attention net-works. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 4449–4458, Brussels, Belgium. Associationfor Computational Linguistics.
A Mixed position for hard-codedself-attention works the best
Enc-Config Dec-Config BLEU( l , l ) ( l , l ) 27.4( l , l ) ( c , c ) 27.8( l , l ) ( l , c ) 28.1( l , r ) ( l , c ) Table 6: Search for best hard-coded configuration forhard-coded self-attention. ‘ l ’ stands for left, focusingon i − , ‘ r ’ for i + 1 and ‘ c ’ for i . Middle layers are( l , r ) for encoder and ( l , c ) for decoder. Each cell showssettings we used in the lowest and highest layer. B Memory efficiency and inferencespeedups
Table 7 summarizes the results of our profilingexperiments on IWSLT16 En-De development set. B L E U IWSLT En-De Encoder Self-Attention
BASEHC-SASH-XHC-ALL 40%-60% 60%-80% 80%-100%Percent off Diagonal2025303540 B L E U IWSLT En-De Decoder Self-Attention
BASEHC-SASH-XHC-ALL 40%-60% 60%-80% 80%-100%Percent off Diagonal2025303540 B L E U IWSLT De-En Encoder Self-Attention
BASEHC-SASH-XHC-ALL
Figure 8: Off-diagonal analysis for IWSLT En-De/De-En self-attention
Model BLEU sent/sec tokens/batch
BASE HC - SA SH - X BASE /- SA SH - X /- SA Table 7: Decoding speedup (in terms of sentences persecond) and memory improvements (max tokens perbatch) on IWSLT16 En-De for a variety of models. Thelast two rows refer to