[PDF] Synthesizer: Rethinking Self-Attention in Transformer Models

Abstract

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only 60% faster but also improves perplexity by a relative 3.5% . Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.

Full PDF

SS YNTHESIZER :Rethinking Self-Attention in Transformer Models

Yi Tay

Google ResearchMountain View [email protected]

Dara Bahri

Google ResearchMountain View [email protected]

Donald Metzler

Google ResearchMountain View [email protected]

Da-Cheng Juan

Google ResearchMountain View [email protected]

Zhe Zhao

Google ResearchMountain View [email protected]

Che Zheng

Google ResearchMountain View [email protected]

Abstract

YNTHESIZER , a model that learnssynthetic attention weights without token-token interactions. Our experimentalresults show that S

YNTHESIZER is competitive against vanilla Transformer modelsacross a range of tasks, including MT (EnDe, EnFr), language modeling (LM1B),abstractive summarization (CNN/Dailymail), dialogue generation (PersonaChat)and Multi-task language understanding (GLUE, SuperGLUE).

Transformer models [Vaswani et al., 2017] have demonstrated success across a wide range of tasks.This has resulted in Transformers largely displacing once popular auto-regressive and recurrentmodels in recent years. At the heart of Transformer models lies the query-key-value dot productattention. The success of Transformer models is widely attributed to this self-attention mechanismsince fully connected token graphs, which are able to model long-range dependencies, provide arobust inductive bias.But is the dot product self-attention really so important? Do we need it? Is it necessary to learn atten-tion weights via expensive pairwise dot products? This paper seeks to develop a deeper understandingof the role that the dot product self-attention mechanism plays in Transformer models.The fundamental role of dot product self-attention is to learn self-alignment, i.e., to determine therelative importance of a single token with respect to all other tokens in the sequence. To this end,there have been memory metaphors and analogies constructed to support this claim. Indeed, the terms query , keys , and values imply that self-attention emulates a content-based retrieval process whichleverages pairwise interactions at its very core. This paper rethinks this entire process.Moving against convention, this paper postulates that we can not only do without dot product self-attention but also content-based memory-like self-attention altogether. Traditionally, attention weightsare learned at the instance or sample level, where weights are produced by instance-level by pairwise Preprint. Under review. a r X i v : . [ c s . C L ] M a y nteractions. As a result, these instance-speciﬁc interactions often ﬂuctuate freely across differentinstances as they lack a consistent global context.This paper proposes S YNTHESIZER , a new model that learns to synthesize the self-alignment matrixinstead of manually computing pairwise dot products. We propose a diverse suite of synthesizingfunctions and extensively evaluate them. We characterize the source information that these synthe-sizing functions receive, i.e., whether they receive information from individual tokens, token-tokeninteractions, and/or global task information. Intuitively, different source inputs to the synthesizingfunctions should capture diverse views, which may be useful when employed in conjunction.Aside from generalizing the standard Transformer model, we show that it is possible to achievecompetitive results with fully global attention weights that do not consider token-token interactionsor any instance-level (local) information at all. More speciﬁcally, a random matrix S

YNTHESIZER model achieves a . BLEU score on WMT 2014 English-German . We observe that the popularand well-established dot-product content-based attention can be replaced with simpler variantswithout sacriﬁcing much performance in some cases. In general, we believe our ﬁndings will spurfurther investigation and discussion about the true role and utility of the self-attention mechanism inTransformer models.S YNTHESIZER is completely transformation-based, only relies on simple feed-forward layers, andcompletely dispenses with dot products and explicit token-token interactions. To reiterate, this workmoves away from the implied notion of a query-key-value memory store and shows that randomizedalignment matrices are sufﬁcient for many tasks in practice.

Our Contributions

Our key contributions are described as follows: • We propose Synthetic Attention, a new way of learning to attend without explicitly attending(i.e., without dot product attention or content-based attention). Instead, we generate thealignment matrix independent of token-token dependencies and explore a potpourri ofparameterized functions for synthesizing attention matrices. • We propose S

YNTHESIZER , a new model that leverages Synthetic Attention. The modelperforms competitive to state-of-the-art Transformer models on a wide range of languagetasks, including machine translation and language modeling. • Moreover, We show that (1) random learnable alignment matrices perform competitivelyand (2) token-token dependencies are not necessary to achieve good performance withTransformer models on certain tasks.

Attention-based models are used across a wide spectrum of problem domains. Such models areespecially popular, due to their effectiveness, in the language and vision domains. Attention modelscan be traced back to the machine translation models of [Bahdanau et al., 2014] and [Luong et al.,2015], where attention is employed to learn soft word alignments between language pairs. Theintuition behind the attention mechanism is deeply-rooted in the notion of memory-based retrieval[Graves et al., 2014, Weston et al., 2014], in which soft differentiable addressing of memory wasinitially proposed.The paradigm of learning self-alignments, also known as self-attention, has been largely popularizedby Transformer models [Vaswani et al., 2017]. This technical narrative has also been explored by anumber of other recent studies, including those on intra-attention [Parikh et al., 2016], self-matchingnetworks [Wang et al., 2017], and LSTMN [Cheng et al., 2016]. To this end, Transformer models,which function primarily based on self-attention and feed-forward layers, generally serve as a reliablereplacement for autoregressive recurrent models.The self-attention layer itself has been the subject of many recent technical innovations. For example,recent studies have investigated improving the layer’s overall efﬁciency via sparsiﬁcation and reducingthe complexity of computing the alignment matrix [Child et al., 2019, Kitaev et al., 2020, Huanget al., 2018, Tay et al., 2020, Beltagy et al., 2020]. These methods are tightly coupled with thequery-key-value paradigm, employing a form of memory-based content retrieval as an attention The originally reported result is . . This section introduces our proposed S

YNTHESIZER model. At its core, our model is essentiallya Transformer model with self-attention modules replaced with our Synthetic Attention modules.Figure 3.1 illustrates the key ideas behind (a) Transformer (b) Dense Synthesizers and (c) RandomSynthesizers.

This section introduces Synthetic Attention, our proposed self-attention module. Our model removesthe notion of query-key-values in the self-attention module and directly synthesizes the alignmentmatrix instead.

Dense Synthesizer

Let us consider the simplest variation of the S

YNTHESIZER model which isconditioned on each input token. Overall, our method accepts an input X ∈ R (cid:96) × d and produces anoutput of Y ∈ R (cid:96) × d . Here, (cid:96) refers to the sequence length and d refers to the dimensionality of themodel. We ﬁrst adopt F ( . ) , a parameterized function, for projecting input X i from d dimensions to (cid:96) dimensions. B i = F ( X i ) (1)where F ( . ) is a parameterized function that maps R d to R (cid:96) and i is the i -th token of X . Intuitively,this can be interpreted as learning a token-wise projection to the sequence length (cid:96) . Essentially, withthis model, each token predicts weights for each token in the input sequence. In practice, we adopt asimple two layered feed-forward layer with ReLU activations for F ( . ) : F ( X ) = W ( σ R ( W ( X ) + b )) + b (2)where σ R is the ReLU activation function. Hence, B is now of R (cid:96) × (cid:96) . Given B , we now compute: Y = Softmax ( B ) G ( X ) . (3)where G ( . ) is another parameterized function of X that is analogous to V (value) in the standardTransformer model.This approach eliminates the dot product altogether by replacing QK (cid:62) in standard Transformerswith the synthesizing function F ( . ) . Random Synthesizer

The previous variant learns synthetic attention by conditioning on eachinput of X and projecting to (cid:96) dimensions. Hence, the Dense Synthesizer conditions on each tokenindependently, as opposed to pairwise token interactions in the vanilla Transformer model. Weconsider another variation of S YNTHESIZER where the attention weights are not conditioned on anyinput tokens. Instead, the attention weights are initialized to random values. These values can theneither be trainable or kept ﬁxed (denoted as

Fixed ).Let R be a randomly initialized matrix. The Random Synthesizer is deﬁned as: Y = Softmax ( R ) G ( X ) . (4)3 nput XQuery Key ValueDot Product Attention Output Input X ValueOutput Input X ValueOutput

Random SynthesizerDense Synthesizer (a) Transformer (b) Synthesizer (Dense) (c) Synthesizer (Random)

Figure 1: Our proposed S

YNTHESIZER model architecture.where R ∈ R (cid:96) × (cid:96) . Notably, each head adds (cid:96) parameters to the network. The basic idea of theRandom Synthesizer is to not rely on pairwise token interactions or any information from individualtoken but rather to learn a task-speciﬁc alignment that works well globally across many samples. Thisis a direct generalization of the recently proposed ﬁxed self-attention patterns Raganato et al. [2020]. Factorized Models

The Dense Synthesizer adds d × (cid:96) parameters to the network. On the other hand,the Random Synthesizer adds (cid:96) × (cid:96) parameters. Here, note that we omit the Q, K projections in thestandard Transformer which results in further parameter savings. Despite these savings, synthesizedmodels can be cumbersome to learn when (cid:96) is large. Hence, we propose factorized variations of theS

YNTHESIZER models and show that these variants perform comparably in practice.

Factorized Dense Synthesizer

Factorized outputs not only slightly reduce the parameter costof the S

YNTHESIZER but also aid in preventing overﬁtting. The factorized variant of the densesynthesizer can be expressed as follows:

A, B = F A ( X i ) , F B ( X i ) (5)where F A ( . ) projects input X i into a dimensions, F B ( . ) projects X i to b dimensions, and a × b = (cid:96) .The output of the factorized module is now written as: Y = Softmax ( C ) G ( X ) . (6)where C = H A ( A ) ∗ H B ( B ) where H A , H B are tiling functions and C ∈ R (cid:96) × (cid:96) . The tiling functionsimply duplicates the vector k times, i.e., R (cid:96) → R (cid:96)k . In this case, H A () is a projection of R a → R ab and H B () is a projection of R b → R ba . To avoid having similar values within the same block, wecompose the outputs of H A and H B . Factorized Random Synthesizer

Similar to Factorized Synthesizers, we are also able to factorize R into low rank matrices R , R ∈ R (cid:96) × k . Y = Softmax ( R R (cid:62) ) G ( X ) . (7)Therefore, it is easy to see that, for each head, this reduces the parameter costs from (cid:96) to (cid:96)k ) where k << (cid:96) and hence helps prevent overﬁtting. In practice, we use a small value of k = 8 . We were not expecting this variation to work at all, but it turns out to be a strong baseline. ixture of Synthesizers Finally, we note that all of the proposed synthetic attention variants canbe mixed in an additive fashion. This can be expressed as: Y = Softmax ( α S ( X ) + · · · α N S N ( X )) G ( X ) . (8)where S ( . ) is a parameterized synthesizing function and the α (where (cid:80) α = 1 ) are learnable weights.In the case of mixing Random Factorized with standard Dense Synthesizers, this is expressed as: Y = Softmax ( R R (cid:62) + F ( X )) G ( X ) . (9)We investigate several Mixture of Synthesizers variants in our experiments. This paper asks fundamental questions about the attention matrix A and whether it is possible tosynthesize A by alternate means other than pairwise attention. It is worth noting that the regulardot product attention can also be subsumed by our S YNTHESIZER framework, i.e., S

YNTHESIZER generalizes the Transformer model. In the case of the Transformer, the synthesizing function inquestion is S ( X ) = F Q ( X ) F K ( X ) (cid:62) . Model S ( X ) Condition On Sample Interact | θ | Dot Product Attention F Q ( X ) F K ( X i ) (cid:62) X j ∀ j Local Yes d Random R N/A Global No . (cid:96) Factorized Random R R (cid:62) N/A Global No (cid:96)k Dense F σ ( F ( X i )) X i Local No d + d(cid:96) Factorized Dense H A ( F A ( X i ))) ∗ H B ( F B ( X i ))) X i Local No d + d ( k + k ) Table 1: Overview of all Synthesizing Functions.Table 1 lists the different model variants explored within our S

YNTHESIZER framework. The’condition on’ column refers to whether the synthesized output is produced as a function of X i or every X i , X j pair. The ‘sample‘ column indicates whether a given variant leverages local orglobal context. Random Synthesizers are global because they share the same global alignmentpatterns across all samples. Dense Synthesizers are considered to be local as they are conditionedon X i , which makes the alignment pattern dependent on each individual sample. To this end, it isimperative for synthesized models to have multiple heads to be effective. Finally, we note that RandomSynthesizers are related to Relative Positional Representations [Shaw et al., 2018], which typicallyaugmented standard self-attention mechanisms. The key difference is that Random Synthesizerscapture positional (relative) information without relying on token-token semantics. This section outlines our experimental setup and results.

We conduct experiments on WMT’14 English-German (EnDe) and WMT’14 English-French (EnFr),which are well-established machine translation benchmarks. The WMT EnDe dataset is comprised of4.5 million sentence pairs, while the EnFr dataset consists of 36 million sentence pairs. We implementour models in Tensor2Tensor using the standard base hyperparameter settings. Further details can befound in the appendix.

Experimental Results on Machine Translation

Table 2 reports results on machine translation.First, we observe that our Random Synthesizer baseline achieves . on EnDe and . onEnFr. The non-trainable (i.e., ﬁxed) variant performs substantially worse, but still yields surprisinglystrong ≈ BLEU with ﬁxed random attention weights. Most other S

YNTHESIZER variants achievecompetitive performance, although with slight performance degradation compared to Transformers.An interesting ﬁnding is that the Mixture model of Random and Dense synthesizer outperformsvanilla Transformers on EnDe. When mixing the standard dot product attention, performance furtherincreases by +0 . BLEU points on EnDe. 5n general, the performance of S

YNTHESIZER variants are competitive with standard Transformersfor this task. Furthermore, S

YNTHESIZER variants have reduced computational complexity andparameter costs that are about lower than Transformers. When taken together, synthetic attentionis an appealing alternative to traditional dot product self-attention.

NMT (BLEU) LM (PPL)Model

Synthesizer (Random + Vanilla) 73M

70M 40.05

Table 2: Experimental Results on WMT’14 English-German, WMT’14 English-French MachineTranslation tasks and Language Modeling One Billion (LM1B).

Summarization DialogueModel Rouge-1 Rouge-2 Rouge-L Bleu-1/4 Rouge-L Meteor CIDr EmbTransformer 38.24

Synthesizer (D+V) 38.57 16.64

Table 3: Experimental results on Abstractive Summarization (CNN/Dailymail) and Dialogue Genera-tion (PersonaChat).

We experiment on the well-established task of subword level language modeling. We use theLanguage Modeling One Billion (LM1B) dataset. Our baselines are similar to the ones used formachine translation except they only involve the decoder in the context of the LM task. We implementour models in Tensor2Tensor. We train our models on K steps on 16 TPU V2 chips. Furtherdetails can be found in the appendix. Experimental Results on LM1B

Table 2 reports our results on LM1B (perplexity). We ﬁnd thatthe Random Synthesizers perform within − perplexity points away from the vanilla Transformermodel. The best performing model is the Synthesizer (Dense + Vanilla), which achieves the bestperformance on this setting. Next, we evaluate S

YNTHESIZER on two text generation tasks – abstractive summarization using theCNN/Dailymail dataset and dialogue generation using the PersonaChat dataset [Zhang et al., 2018].The model used is a simple Seq2Seq Transformer model. We leverage our S

YNTHESIZER in both theencoder and decoder. All models use the base size setting. For the dialogue generation task, due tothe smaller dataset size, we train a small model for K steps. For the summarization task, we usethe well-established metrics, i.e., Rouge-1, Rouge-2 and Rouge-L. For the dialogue generation task,we use NLG-Eval [Sharma et al., 2017] and report BLEU-1, BLEU-4, Rouge-L, Meteor, CIDr andEmbedding based similarity scores (Emb). https://github.com/Maluuba/nlg-eval . esults on Summarization Table 3 reports results for the summarization and dialogue generationtasks. For summarization, we ﬁnd that the (R) and (D) variants do not outperform Transformers. Theperformance of the (D) model is ≈ Rouge-L points below Transformers. Hence, we postulate thatthe local sample-wise pairwise interactions are important for the summarization task. On the otherhand, the utility of synthesized attention can also be observed, i.e., the (R+V) and (R+D) models bothoutperform Transformers.

Results on Dialogue Generation

On this task, Synthesizers (R) and (D) both outperform vanillaTransformers by a reasonable margin ( ≈ Finally, we evaluate our S

YNTHESIZER model on multi-task language understanding (GLUE [Wanget al., 2018] and SuperGLUE [Wang et al., 2019]) following the T5 (text-to-text Transformer) [Raffelet al., 2019] methodology. Our experiments are based on the T5 repository and are implementedin Mesh Tensorﬂow [Shazeer et al., 2018]. We pre-train the vanilla T5 models and our models for steps using the span denoising objective. We then co-train the model on multiple tasks. Weco-train on the en_mix mixture (SuperGLUE and GLUE) for k steps with a constant learningrate of − . Model Glue CoLA SST MRPC STSB QQP MNLI QNLI RTET5 (Base) 83.5 53.1 /84.6

Table 4: Experimental results (dev scores) on multi-task language understanding (GLUE benchmark)for small model and en-mix mixture. Note: This task has been co-trained with SuperGLUE.

Model SGlue BoolQ CB CoPA MultiRC ReCoRD RTE WiC WSCT5 (Base) 70.3 78.2 72.1/83.9 59.0 73.1/32.1

Syn (R) 61.1 69.5 54.6/73.2 60.0 63.0/15.7 58.4/57.4 67.5 64.4 66.3Syn (D) 58.5 69.5 51.7/71.4 51.0 66.0/15.8 54.1/53.0 67.5 65.2 58.7Syn (D+V) 69.7 79.3 74.3/85.7 64.0 73.8/33.7 69.9/69.2 78.7 64.3 68.3Syn (R+V)

Table 5: Experimental results (dev scores) on multi-task language understanding (SuperGLUEbenchmark) for small model and en-mix mixture. Note: This task has been co-trained with GLUE.

Results on GLUE and SuperGLUE

Tables 4 and 5 report results on the GLUE and SuperGLUEbenchmarks. We note that the (R) and (D) variants of S

YNTHESIZER do not achieve reasonableperformance. This can be largely attributed to the fact that the encoder self-attention in the T5 settingalso functions as a cross-sentence attention. For example, in the entailment or reading comprehensiontasks, the premise and hypothesis are concatenated together and self-attention effectively acts ascross-sentence attention. Optimistically, we observe that Syn (R+V) outperforms the T5 model by asubstantial margin (+1.9 points on SuperGLUE and +0.6 points on GLUE).

On all evaluated tasks, we showed that synthesized attention functions competitively, i.e., it achievesperformance reasonably close to the dot product self-attention. On one task (dialogue generation),the dot product self-attention is found to actually degrade performance. Amongst the other tasks,machine translation is the least affected by the removal of the vanilla dot product. These ﬁndingsallow us to introspect about whether pairwise comparisons for self-attention are even necessary.We would like to emphasize that this solely refers to self-attention and not cross-attention. On themulti-task language understanding benchmark, the self-attention functions as a form of cross-attention https://github.com/google-research/text-to-text-transfer-transformer

7y concatenating sentence pairs. Hence, synthesize attention performance is considerably worse thanvanilla Transformers. However, complementing the base T5 model with synthetic attention boostsperforms, showing that synthesized attention provides additional value to current state-of-the-artmodels.

In this section, we perform a deeper analysis of the S

YNTHESIZER model.Enc L1 Enc L3 Enc L5 Dec L1 Dec L3 Dec L5Figure 2: Histogram of Encoder and Decoder Attention Weights on MT (WMT EnDe). L denotes thelayer number and Enc/Dec denotes encoder or decoder.

Distribution of Weights

We are interested in investigating how thesynthetically generated attention weights differ from the dot productattention weights. Figure 2 shows the attention histograms on trainedTransformer and S

YNTHESIZER models. We report histograms at lay-ers , , and of a 6 layered (Transformer or S YNTHESIZER ) modelat K steps. We found that the weight distributions remain relativelyidentical thereafter. Figure 3 shows the initialization state. We observethat there are distinct differences in the weight distribution of S YNTHE - SIZER and Transformer models. The variance of the S

YNTHESIZER weights tends to be higher. On the other hand, the weights on theTransformer model tends to gravitate near and have smaller variance.There are also notable differences across the (R) and (D) S YNTHESIZER variants. Speciﬁcally, the (D) model in general has greater max valueswith more values in the . - . range while the values of the R modeltends to stay closer to . Figure 3: Init De-coder weights (Ref-erence) Effect of Number of Heads

We also investigatethe impact of the number of heads on performance.We trained three Random Synthesizer models forthe small version of the machine translation tasksusing the T5 framework without pretraining. Forsimplicity, evaluation is done via greedy decoding.We report scores on the development set. Table 6reports the results on varying the number of headson performance.

Heads EnDe EnFr EnRoSyn h =2 19.43 34.12 18.67Syn h =4 20.42 35.26 19.78Syn h =8 20.88 34.92 20.28Syn h =16 21.71 35.26 20.43Syn h =32 Table 6: Effect of number of heads on multi-task MT. Increasing the number of heads im-proves performance.

This paper proposed S

YNTHESIZER , a new Transformer model that employs Synthetic Attention.We conducted a principled study to better understand and evaluate the utility of global alignmentand local, instance-wise alignment (e.g., independent token and token-token based) in self-attention.We show that, on multiple tasks such as machine translation, language modeling and dialoguegeneration, synthetic attention demonstrates competitive performance compared to vanilla self-attention. Moreover, for the dialogue generation task, pairwise interactions actually hurt performance.Notably, we reemphasize that this study refers to self-attention. We found that we are not able toreplace cross-attention with simpler variants in most cases. Overall, we hope our study will encouragefurther investigations into the component-wise effectiveness of well-established Transformer models.8

Supplementary Material

We implement our models in Tensor2Tensor, using the standard base hyper-parameter settings. Speciﬁcally, we use byte-pair encoding (BPE), 6-layered Transformer networkswith hidden size , ﬁlter size of and heads. We use label smoothing of . . The maximumsequence length is set to . Training is performed using 8 x V100 GPUs. We train all modelsfor K steps and report results at the last checkpoint. We use a length penalty of . and beamsize of following the default settings. We also compare with standard Transformer models. In theinterest of keeping a consistent, fair evaluation across all model settings, we do not use checkpointaveraging or tune the decoding hyperparameters although this generally leads to better performance.We evaluate BLEU scores using sacrebleu . Language Modeling

We implement our models in Tensor2Tensor using the packed TPU setupof sequence length . We train our models on K steps on 16 TPU V2 chips. We use the lmx_base model setting for fair comparison across all model variations. The model has 6 layersand 8 heads, along with a ﬁlter width of and hidden size of . We used conv_relu for thepositional feed-forward layers across all baselines since we ﬁnd them to perform slightly better. Wereport results (subword level perplexity scores) on the test set at the ﬁnal checkpoint. Summarization

For the summarization task, we train all models for K steps and a batch sizeof . All models use the base size setting. For the dialogue generation task, due to the smallerdataset size, we train a small model for K steps. All results are reported on the test set. Forthe summarization task, we use the well-established metrics, i.e., Rouge-1, Rouge-2 and Rouge-L.Experiments are conducted using Mesh Tensorﬂow. Dialogue Generation

For the dialogue generation task, we train our models on the small size for K steps. Experiments are conducted in Tensor2Tensor. We use NLG-Eval [Sharma et al., 2017]and report BLEU-1, BLEU-4, Rouge-L, Meteor, CIDr and Embedding based similarity scores (Emb). Multi-Task Language Understanding

Our experiments are based on the T5 repository imple-mented in Mesh Tensorﬂow [Shazeer et al., 2018]. We pre-train the vanilla T5 models and ourmodels for steps using the span denoising objective. We then co-train the model on multipletasks. We co-train on the en_mix mixture (SuperGLUE and GLUE) for k steps with a constantlearning rate of − . Embedding and Softmax output layer parameters are kept ﬁxed. The maximumsequence length is set to . We evaluate on the en_mix mixture as deﬁned in the original codebasewhich is comprised of training GLUE, SuperGLUE and SQuAD in a single model. We report results of several additional variants of S

YNTHESIZER , most of which we found to havemarginal or no improvement over the simple dense/random variations. • Convolution - Applying a 1D convolution instead of a 2 layer nonlinear network. We varythe ﬁlter width in our experiments. • Bottleneck - Converting the 2 layered feed forward network to a bottleneck layer, e.g., → → . We also experiment with a convolutional variant of bottleneck, i.e.,projecting to low dimension space and then projecting back to high dimensions. • Gated Linear Units (GLU), applying the GLU units of [Dauphin et al., 2017] as the Synthe-sizing function. https://github.com/Maluuba/nlg-eval . https://github.com/google-research/text-to-text-transfer-transformer f = 3 ) Linear 27.43ConvReluConv ( f = 3 ) 27.51ConvReluConv ( f = 5 ) 27.56ConvReluConv ( f = 3 , ) 27.49Bottleneck + Dense 27.43Bottleneck + ConvReluConv 27.72GLU 27.43Table 7: Results for additional S YNTHESIZER variants on WMT EnDe (BLEU scores)10 eferences

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150 , 2020.Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machinereading. arXiv preprint arXiv:1601.06733 , 2016.Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparsetransformers. arXiv preprint arXiv:1904.10509 , 2019.Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gatedconvolutional networks. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 933–941. JMLR. org, 2017.Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprintarXiv:1410.5401 , 2014.Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, CurtisHawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Musictransformer. arXiv preprint arXiv:1809.04281 , 2018.Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efﬁcient transformer. arXivpreprint arXiv:2001.04451 , 2020.Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-basedneural machine translation. arXiv preprint arXiv:1508.04025 , 2015.Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attentionmodel for natural language inference. arXiv preprint arXiv:1606.01933 , 2016.Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, YanqiZhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-texttransformer. arXiv preprint arXiv:1910.10683 , 2019.Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. Fixed encoder self-attention patterns intransformer-based machine translation. arXiv preprint arXiv:2002.10260 , 2020.Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. Relevance of unsupervised met-rics in task-oriented dialogue for evaluating natural language generation.

CoRR , abs/1706.09799,2017. URL http://arxiv.org/abs/1706.09799 .Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 , 2018.Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool,Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorﬂow: Deeplearning for supercomputers. In

Advances in Neural Information Processing Systems , pages10414–10423, 2018.Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. arXivpreprint arXiv:2002.11296 , 2020.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In

Advances in neural informationprocessing systems , pages 5998–6008, 2017.Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue:A multi-task benchmark and analysis platform for natural language understanding. arXiv preprintarXiv:1804.07461 , 2018. 11lex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose languageunderstanding systems. In

Advances in Neural Information Processing Systems , pages 3261–3275,2019.Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching networksfor reading comprehension and question answering. In

Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Volume 1: Long Papers) , pages 189–198, 2017.Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprintarXiv:1410.3916 , 2014.Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention withlightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430 , 2019.Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Per-sonalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243arXiv preprint arXiv:1801.07243