An Efficient Transformer Decoder with Compressed Sub-layers
aa r X i v : . [ c s . C L ] J a n An Efficient Transformer Decoder with Compressed Sub-layers
Yanyang Li , Ye Lin ∗ , Tong Xiao , Jingbo Zhu NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China { blamedrlee, linye2015 } @outlook.com, { xiaotong,zhujingbo } @mail.neu.edu.cn Abstract
The large attention-based encoder-decoder network (Trans-former) has become prevailing recently due to its effective-ness. But the high computation complexity of its decoderraises the inefficiency issue. By examining the mathematicformulation of the decoder, we show that under some mildconditions, the architecture could be simplified by compress-ing its sub-layers, the basic building block of Transformer,and achieves a higher parallelism. We thereby propose
Com-pressed Attention Network , whose decoder layer consists ofonly one sub-layer instead of three. Extensive experimentson 14 WMT machine translation tasks show that our model is1.42 × faster with performance on par with a strong baseline.This strong baseline is already 2 × faster than the widely usedstandard baseline without loss in performance. Introduction
Transformer is an attention-based encoder-decoder model(Vaswani et al. 2017). It has shown promising resultsin machine translation tasks recently (Wang et al. 2019;Aharoni, Johnson, and Firat 2019; Dehghani et al. 2019).Nonetheless, Transformer suffers from the inefficiency issueat inference. This problem is attributed to the Transformerdecoder for two reasons: 1) the decoder is deep (Kasai et al.2020). It consists of multiple layers and each layer con-tains three sub-layers, including two attentions and a feed-forward network; 2) the attention has a high (quadratictime) complexity (Zhang, Xiong, and Su 2018), as it needsto compute the correlation between any two input words.Previous work has focused on improving the complex-ity of the attention in the decoder to accelerate the in-ference. For example, A AN uses the averaging operationto avoid computing the correlation between input words(Zhang, Xiong, and Su 2018). S AN share the attention re-sults among layers (Xiao et al. 2019). On the other hand,we learn that vanilla attention runs faster in training thanin inference thanks to its parallelism. This offers a new di-rection: a higher degree of parallelism could speed up theinference. The most representative work of this type is thenon-autoregressive approach (Gu et al. 2018). Its decoder * Authors contributed equally. † predicts all words in parallel, but fails to model the worddependencies. Despite of their successes, all these systemsstill have a deep decoder.In this work, we propose to parallelize the sub-layers toobtain a shallow autoregressive decoder. This way does notsuffer from the poor result of directly reducing depths andavoids the limitation of non-autoregressive approaches. Weprove that the two attention sub-layers in a decoder layercould be parallelized if we assume their inputs are close toeach other. This assumption holds and thereby we compressthese two attentions into one. Furthermore, we show thatthe remaining feed-forward network could also be mergedinto the attention due to their linearity. To the end, we pro-pose Compressed Attention Network (C AN for short). Thedecoder layer of C AN possesses a single attention sub-layerthat does the previous three sub-layers’ jobs in parallel.As another “bonus”, C AN is simple and easy to be imple-mented.In addition, Kasai et al. (2020) empirically discover thatexisting systems are not well balancing the encoder and de-coder depths. Based on their work, we build a system with adeep encoder and a shallow decoder, which is 2 × faster thanthe widely used standard baseline without loss in perfor-mance. It requires neither the architecture modification noradding extra parameters. This system serves as a strongerbaseline for a more convincing comparison.We evaluate C AN and the stronger baseline in 14 machinetranslation tasks, including WMT14 English ↔{ German,French } (En ↔{ De, Fr } ) and WMT17 English ↔{ German,Finnish, Latvian, Russian, Czech } (En ↔{ De, Fi, Lv, Ru,Cs } ). The experiments show that C AN is up to 2.82 × fasterthan the standard baseline with almost no loss in perfor-mance. Even comparing to our stronger baseline, C AN stillhas a 1.42 × speed-up, while other acceleration techniquessuch as S AN and A AN are 1.12 ∼ × in the same case.To summarize, our contributions are as follows:• We propose C AN , a novel architecture that acceleratesTransformer by compressing its sub-layers for a higherdegree of parallelism. C AN is easy to be implemented.• Our work is based on a stronger baseline, which is 2 × faster than the widely used standard baseline.• The extensive experiments on 14 WMT machine transla-tion tasks show that C AN is 1.42 × faster than the stronger N wø3 hˇen hˇao .Encoder I am fine .am fine . h eos i Self-AttentionCross-Attention FFNTarget TokenSource TokenDecoder (a) Transformer × N wø3 hˇen hˇao .Encoder I am fine .am fine . h eos i Compressed-AttentionTarget TokenSource TokenDecoder (b) C AN Figure 1: Transformer vs. C AN (Chinese pinyin → English: “wø3 hˇen hˇao .” → “I am fine .” ).baseline and 2.82 × for the standard baseline. C AN alsooutperforms other approaches such as S AN and A AN . Background: Transformer
Transformer is one of the state-of-the-art neural models inmachine translation. It consists of a N -layer encoder and a N -layer decoder, where N = 6 in most cases. The encodermaps the source sentence to a sequence of continuous rep-resentations and the decoder maps these representations tothe target sentence. All layers in the encoder or decoder areidentical to each other.The layer in the decoder consists of three sub-layers, in-cluding the self-attention, the cross-attention and the feed-forward network (FFN). The self-attention takes the output X of the previous sub-layer as its input and produces a ten-sor with the same size as its output. It computes the attentiondistribution A x and then averages X by A x . We denote theself-attention as Y x = Self( X ) , where X ∈ R t × d , t is thetarget sentence length and d is the dimension of the hiddenrepresentation: A x = SoftMax( XW q W Tk X T √ d ) (1) Y x = A x XW v (2)where W q , W k , W v ∈ R d × d .The cross-attention is similar to the self-attention, exceptthat it takes the encoder output H as an additional input.We denote the cross-attention as Y h = Cross( X, H ) , where H ∈ R s × d , s is the source sentence length: A h = SoftMax( XW q W Tk H T √ d ) (3) Y h = A h HW v (4)where W q , W k , W v ∈ R d × d .The FFN applies non-linear transformation to its input X .We denote the FFN as Y f = FFN( X ) : Y f = ReLU( XW + b ) W + b (5) S e l f- A tt e n ti on C r o ss - A tt e n ti on Figure 2: The cosine similarity of inputs for every two ad-jacent sub-layers on WMT14 En-De translation task (a darkcell means the inputs are dissimilar).where W ∈ R d × d , b ∈ R d , W ∈ R d × d and b ∈ R d .All sub-layers are coupled with the residual connection(He et al. 2016a), i.e., Y = f ( X ) + X where f could be anysub-layer. Their inputs are also preprocessed by the layernormalization first (Ba, Kiros, and Hinton 2016). Fig. 1(a)shows the architecture of Transformer decoder. For more de-tails, we refer the reader to Vaswani et al. (2017). Compressed Attention Network
Compressing Self-Attention and Cross-Attention
As suggested by Huang et al. (2016), the output of one layerin the residual network can be decomposed into the sumof all outputs from previous layers. For the adjacent self-attention and cross-attention, we can write their final outputas Y = X + Self( X ) + Cross( X ′ , H ) , where X is the in-put of self-attention and X ′ = X + Self( X ) is the inputof cross-attention. If X and X ′ are identical, we are able toaccelerate the computation of Y by parallelizing these twoattentions, as X ′ do not need to wait Self( X ) to finish.Previous work (He et al. 2016b) has shown that inputs ofadjacent layers are similar. This implies that X and X ′ areclose and the parallelization is possible. We empirically ver-ify this in the left part of Fig. 2 by examining the cosinesimilarity between inputs of every self-attention and cross-attention pairs. It shows that X and X ′ are indeed close to H, X ] Yd dA d Y = h H f W v , X f W v i Eq. 11 Y = AYY = XW + Y + b Y = ReLU( Y ) W + b Input OutputTarget Token X Source Token H Figure 3: Compressed-Attention.each other (a high similarity > . for the diagonal entries).Therefore we could assume X and X ′ are identical (we omitthe layer normalization for simplicity): Y = X + Self( X ) + Cross( X, H ) (6)By observing that Eq. 2 and Eq. 4 are essentially matrixmultiplications, we could rewrite Self( X ) + Cross( X, H ) as a single matrix multiplication: A = (cid:2) A Tx , A Th (cid:3) T (7) Self( X ) + Cross( X, H ) = A [ XW v , HW v ] (8) [ · ] is the concatenation operation along the first dimension.Xiao et al. (2019) shows that some attention distributions A x and A h are duplicate. This means that there exists a cer-tain redundancy in { W q , W k } and { W q , W k } . Thus wecould safely share W q and W q to parallelize the computa-tion of the attention distribution A : ¯ A = (cid:16) XW q [ XW k , HW k ] T (cid:17) / √ d (9) A = (cid:2) SoftMax( ¯ A T · , ...t ) , SoftMax( ¯ A T · ,t +1 ...t + s ) (cid:3) T (10)However, A consists of two SoftMax distributions andis used in Eq. 8 without normalization. The output vari-ance is then doubled and leads to poor optimization(Glorot and Bengio 2010). It is advised to divide A by √ to preserve the variance. This way resembles a single distri-bution. So we use one SoftMax instead and this works well: A = SoftMax( XW q [ XW k , HW k ] T √ d ) (11)Now, we can compute Y in Eq. 6 efficiently by using Eq.11 as well as Eq. 8 to compute Self( X ) + Cross( X, H ) . Compressing Attention and FFN
It is natural to consider to merge the attention and FFN withthe same approach for further speed-up. As suggested bythe right part of Fig. 2, the similarities between inputs of B LE U [ % ] S p ee d BLEU Speed
Figure 4: Performance (BLEU) and translation speed (to-ken/sec) vs. the numbers of encoder and decoder layers onWMT14 En-De translation task.the adjacent cross-attention and FFN are low (dark diagonalentries). This implies that it is not ideal to make the identicalinput assumption to parallelize the cross-attention and FFN.Here we provide another solution. Given that attention ismerely a weighted sum and FFN performs a linear projectionfirst, we can merge them by exploiting the linearity. Thisway not only parallelizes the computation of attention andFFN but also removes redundant matrix multiplications.We substitute X in Eq. 5 by Y in Eq. 6: Y f = ReLU( XW + A [ XW v , HW v ] W + b ) W + b (12)We can combine W with W v as well as W v into f W v , f W v ∈ R d × d , as these matrices are learnable and ma-trix multiplied together: Y f = ReLU( XW + A h X f W v , H f W v i + b ) W + b (13)Furthermore, XW can be computed in parallel with othertransformations such as XW q .This eventually gives us an more efficient decoder layerarchitecture, named Compressed-Attention . The whole com-putation process is shown in Fig. 3: it first computes the at-tention distribution A by Eq. 11, then performs the attentionoperation via Eq. 13, and produces Y f as the final result.The proposed Compressed Attention Network (C AN ) stackscompressed-attentions to form its decoder. Fig. 1 shows thedifference between Transformer and C AN . Balancing Encoder and Decoder Depths
Based on the findings of Kasai et al. (2020), we learn thata shallow decoder could offer a great speed gain, whilea deep encoder could make up of the loss of a shal-low decoder without adding a heavy computation over-head. Since their work is based on knowledge distillation(Hinton, Vinyals, and Dean 2015), here we re-examine thisidea under the standard training setting (without knowledgedistillation).Fig. 4 shows the performance and speed if we graduallyreduce the decoder depth while adding more encoder layers.We see that although the overall number of parameters re-mains the same, the baseline can be 2 × faster without losingource Lang. Train Valid Testsent. word sent. word sent. wordWMT14 En ↔ De 4.5M 220M 3000 110K 3003 114KEn ↔ Fr 35M 2.2B 26K 1.7M 3003 155KWMT17 En ↔ De 5.9M 276M 8171 356K 3004 128KEn ↔ Fi 2.6M 108M 8870 330K 3002 110KEn ↔ Lv 4.5M 115M 2003 90K 2001 88KEn ↔ Ru 25M 1.2B 8819 391K 3001 132KEn ↔ Cs 52M 1.2B 8658 354K 3005 118KTable 1: Data statistics ( AN . Experiments
Experimental Setup
Datasets
We evaluate our methods on 14 machine trans-lation tasks (7 datasets × ↔{ De, Fr } and WMT17 En ↔{ De, Fi,Lv, Ru, Cs } .WMT14 En ↔{ De, Fr } datasets are tokenized by a scriptfrom Moses . We apply BPE (Sennrich, Haddow, and Birch2016) with 32K merge operations to segment words intosubword units. Sentences with more than 250 subword unitsare removed. The first two rows of Table 1 are the detailedstatistics of these two datasets. For En-De, we share thesource and target vocabularies. We choose newstest-2013 asthe validation set and newstest-2014 as the test set. For En-Fr, we validate the system on the combination of newstest-2012 and newstest-2013 , and test it on newstest-2014 .All WMT17 datasets are the official preprocessed versionfrom WMT17 website . BPE with 32K merge operations issimilarly applied to these datasets. We use the concatena-tion of all available preprocessed validation sets in WMT17datasets as our validation set:• En ↔ De. We use the concatenation of newstest2014 , new-stest2015 and newstest2016 as the validation set.• En ↔ Fi. We use the concatenation of newstest2015 , news-dev2015 , newstest2016 and newstestB2016 as the valida-tion set.• En ↔ Lv. We use newsdev2016 as the validation set.• En ↔ Ru. We use the concatenation of newstest2014 , new-stest2015 and newstest2016 as the validation set.• En ↔ Cs. We use the concatenation of newstest2014 , new-stest2015 and newstest2016 as the validation set.We use newstest2017 as the test set for all WMT17 datasets.Detailed statistics of these datasets are shown in Table 1. Forall 14 translation tasks, we report case-sensitive tokenizedBLEU scores . https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl http://data.statmt.org/wmt17/translation-task/preprocessed/ https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl System Test ∆ BLEU
Valid Speed ∆ Speed E n - D e Baseline 27.32 - 26.56 104.27 -Balanced 27.46 0.00 26.81 219.53 0.00%S AN AN AN D e - E n Baseline 30.50 - 30.34 103.97 -Balanced 30.76 0.00 30.37 206.00 0.00%S AN AN AN E n - F r Baseline 40.82 - 46.80 104.65 -Balanced 40.55 0.00 46.87 206.54 0.00%S AN AN AN F r- E n Baseline 36.33 - 47.03 105.85 -Balanced 36.86 0.00 46.89 201.13 0.00%S AN AN AN ↔{ De, Fr } translation tasks. Model Setup
Our baseline system is based on the open-source implementation of the Transformer model presentedin Ott et al. (2019). For all machine translation tasks, thestandard Transformer baseline (Baseline) consists of a 6-layer encoder and a 6-layer decoder. The embedding sizeis set to 512. The number of attention heads is 8. The FFNhidden size equals to 4 × embedding size. Dropout with thevalue of 0.1 is used for regularization. We adopt the inversesquare root learning rate schedule with 8,000 warmup stepsand . learning rate. We stop training until the modelstops improving on the validation set. All systems are trainedon 8 NVIDIA TITIAN V GPUs with mixed-precision train-ing (Micikevicius et al. 2018) and a batch size of 4,096 to-kens per GPU. We average model parameters in the last 5epochs for better performance. At test time, the model is de-coded with a beam of width 4 and half-precision. For an ac-curate speed comparison, we decode with a batch size of 1 toavoid paddings. The stronger balanced baseline (Balanced)shares the setting with this standard baseline, except that itsencoder depth is 12 and decoder depth is 2.We compare C AN and other model acceleration ap-proaches with our baselines. We choose Sharing Atten-tion Network (S AN ) (Xiao et al. 2019) and Average Atten-tion Network (A AN ) (Zhang, Xiong, and Su 2018) for com-parison, as they have been proven to be effective in vari-ous machine translation tasks (Birch et al. 2018). All hyper-parameters of C AN , S AN and A AN are identical to the bal-anced baseline system. Results are the average of 3 runs. Results
Table 2 shows the results of various systems on WMT14En ↔{ De, Fr } . Our balanced baseline has nearly the sameystem Test ∆ BLEU
Valid Speed ∆ Speed E n - D e Baseline 28.40 - 31.30 106.58 -Balanced 28.65 0.00 31.39 218.35 0.00%C AN D e - E n Baseline 34.48 - 35.36 103.04 -Balanced 34.38 0.00 35.16 220.05 0.00%C AN E n - F i Baseline 21.28 - 18.31 103.84 -Balanced 21.38 0.00 18.67 207.73 0.00%C AN F i - E n Baseline 25.54 - 21.32 106.59 -Balanced 25.63 0.00 21.29 209.88 0.00%C AN E n - L v Baseline 16.14 - 21.33 107.20 -Balanced 15.98 0.00 21.21 219.02 0.00%C AN L v - E n Baseline 18.74 - 24.79 106.25 -Balanced 18.69 0.00 24.54 216.06 0.00%C AN E n - R u Baseline 30.44 - 30.67 106.46 -Balanced 30.28 0.00 30.59 214.52 0.00%C AN R u - E n Baseline 34.44 - 32.39 107.24 -Balanced 34.24 0.00 32.22 213.78 0.00%C AN E n - C s Baseline 24.00 - 28.09 106.18 -Balanced 23.69 0.00 28.03 212.65 0.00%C AN C s - E n Baseline 30.00 - 33.01 104.00 -Balanced 30.06 0.00 32.86 202.96 0.00%C AN ↔{ De, Fi, Lv, Ru, Cs } translation tasks.performance as the standard baseline, but its speed is 2 × faster on average. A similar phenomenon is also observedfrom WMT17 experiments in Table 3. This observation in-dicates that existing systems do not well balance the encoderand decoder depths. We also report the performance of A AN ,S AN and the proposed C AN . All three approaches have sim-ilar BLEU scores and slightly underperform the balancedbaseline. C AN is more stable than the others, as its maxi-mum ∆ BLEU is -0.39, while S AN is -0.67 and A AN is -0.61.For speeds of these systems, S AN and A AN have a similarlevel of acceleration (1 ∼ AN , on the other hand, provides a higher level of acceler-ation (27 ∼ AN . We find that the length ratiobetween the translation and the source sentence in De-En ishigher than others, e.g., 1.0 for De-En and 0.981 for En-Fr.In this case the decoder tends to predict more words and con-sumes more time in De-En, and thus acceleration approachesthat work on the decoder are more effective.More experimental results to justify the effectiveness ofC AN are presented in Table 3. We evaluate the balanced System Before KD After KDTest ∆ BLEU
Test ∆ BLEU
Balanced 27.46 0.00 27.82 0.00S AN AN AN ∆ BLEU
Speed ∆ Speed
Balanced 27.46 0.00 219.53 0.00%+ Compress Attention 27.09 -0.37 263.64 +20.09%+ Compress FFN 27.69 +0.23 233.17 +6.21%+ Compress All 27.32 -0.14 290.08 +32.14%Table 5: Ablation study on WMT14 En-De translation task(Compress Attention: compress the self-attention and cross-attention only; Compress FFN: compress the cross-attentionand FFN only; Compress All: compress the self-attention,cross-attention and FFN).baseline as well as C AN on five WMT17 language pairs.The results again show that the balanced baseline is indeed astrong baseline with BLEU scores close to the standard base-line and is consistently 2 × faster. C AN also shows a simi-lar trend that it slightly underperforms the balanced baseline( < . BLEU scores) but is > faster. Analysis
Knowledge Distillation
Although S AN , A AN and C AN offer considerable speedgain over the balanced baseline, they all suffer from theperformance degradation as shown in Table 2 and Table3. The popular solution to this is knowledge distillation(KD). Here we choose sequence-level knowledge distilla-tion (Kim and Rush 2016) for better performance in ma-chine translation tasks. The balanced baseline is used to gen-erate the pseudo data for KD.Table 4 shows that KD closes the performance gap be-tween the fast attention models (S AN , A AN and C AN ) andthe balanced baseline. This fact suggests that all three sys-tems have enough capacity for a good performance, buttraining from scratch is not able to reach a good conver-gence state. It suggests that these systems might require amore careful hyper-parameters tuning or a better optimiza-tion method. Ablation Study
To investigate in which part C AN contributes the most to theacceleration as well as the performance loss, we only com-press the self-attention and cross-attention or compress thecross-attention and FFN for study. Table 5 shows the resultsof this ablation study. We can see that compressing the twoattentions provides a 20.09% speed-up, while only 6.21%for compressing attention and FFN. This is because FFN isalready highly parallelized and accelerating itself does not Beam Size S p ee d BalancedC AN
10 20 30 40 50+
Length S p ee d BalancedC AN C A N S A N A A N . . . . . . T r a n s l a ti on L e ng t h En-DeEn-FrFigure 5: Translation speed (token/sec) vs. beam size and translation length on WMT14 En-De translation task.bring much gain. On the other hand, compressing attentionsbrings the most performance loss, which shows that the iden-tical input assumption is strong. Fig. 2 shows that inputsof the adjacent layers are not very similar in lower layers.Therefore using C AN in low layers might bring a great loss.We also find that compressing attention and FFN has an evenbetter result. This might be that we remove the redundant pa-rameters in the model. Sensitivity Analysis
We study how the speed could be affected by other factorsin Fig. 5, e.g., the beam size and the translation length. Theleft of Fig. 5 shows that C AN is consistently faster than thebalanced baseline with different beam size. As the accelera-tion provided by C AN is constantly proportional to the speedof the baseline, it becomes less obvious when the baseline isslow, i.e., translating with a large beam. An opposite trendhappens in the middle of Fig. 5 for the translation length.This is because overheads such as data preparation domi-nate the translation time of short sentences. This way resultsin a slow speed even when the translation time is short. Asboth the baseline and C AN get faster when generating longertranslations, one might suspect that the superior accelerationof C AN over other approaches comes from the fact that C AN generates longer translations. Further analysis is conductedand shown in the right of Fig. 5. We see that C AN , S AN and A AN generate translations with similar lengths in twoWMT14 translation tasks. This observation justifies that thesuperior acceleration brought by C AN did come from its de-sign rather than translation lengths. Error Analysis
As shown in Table 2 and Table 3, the acceleration of C AN comes at the cost of performance. Here we conduct experi-ments to better understand in which aspect C AN scarifies forspeed-up. We first evaluate the sentence-level BLEU scorefor each translation, then cluster these translations accord-ing to their averaged word frequencies or lengths.Fig. 6 shows the results. The left of Fig. 6 indicates thatC AN did well on sentences with low frequencies, but noton those with high frequencies. The right of Fig. 6 showsthat C AN does not translate short sentences well but is quite Frequency ( × ) B LE U [ % ] BalancedC AN
10 20 30 40 50+
Length B LE U [ % ] BalancedC AN Figure 6: BLEU score [%] vs. word frequency ( × ) andtranslation length on WMT14 En-De translation task.good at translating long sentences. These facts are counter-intuitive as one might expect a poor model could do wellon easy samples (high frequency and short sentences) butnot on hard ones (low frequency and long sentences). Thismight due to the identical input assumptions we used to de-rive C AN are critical to easy samples. We left this for thefuture exploration. Parallelism Study
A simple approach to obtain a higher parallelism withoutmodifying the architecture is to increase the batch size at in-ference. Fig. 7 compares the inference time of the balancedbaseline and C AN by varying the batch size. We can see thatboth systems run faster with a larger batch size and C AN isconsistently faster than the balanced baseline. But the accel-eration of C AN over the baseline ∆ Speed diminishes whenthe batch size gets larger. In this case we observe that C AN reaches the highest parallelism (a nearly 100% GPU utility)in a smaller batch size ( ≥ ) than the baseline ( > ). Thismeans that enlarging the batch size no longer provides ac-celeration for C AN , while the baseline can still be furtherspeeded up. We expect C AN could be faster if more tensorcores are available in the future.
20 40 6002 , , , , , Batch Size S p ee d ∆ Sp ee d [ % ] Balanced C AN ∆ Speed
Figure 7: Speed (token/sec) and ∆ Speed [%] vs. batch sizeon WMT14 En-De translation task.
Training Study
We plot the training and validation loss curve of the stan-dard baseline, the balanced baseline and C AN in Fig. 8 forstudying their convergence. We can see that all systems con-verge stably. The balanced baseline has a higher loss than thestandard baseline in both the training and validation sets, buttheir BLEU scores are close as shown in Table 2. This is dueto the shallow decoder in the balanced baseline. Since theloss is determined by the decoder, a shallow decoder withless capacity would have a higher loss. Wang et al. (2019)indicates that the encoder depth has a greater impact than thedecoder on BLEU scores, therefore the deep encoder makesup the performance loss of the shallow decoder. We also seethat C AN has a higher loss than the balanced baseline be-cause we compress the decoder. Since we do not enhancethe encoder, the BLEU score drops accordingly. Related Work
Model Acceleration
Large Transformer has demonstrated its effectiveness onvarious natural language processing tasks, including ma-chine translation (Vaswani et al. 2017), language modelling(Baevski and Auli 2019) and etc. The by-product brought bythis huge network is the slow inference speed. Previous workfocuses on improving model efficiency from different per-spectives. For example, knowledge distillation approachestreat the large network output as the ground truth to traina small network (Kim and Rush 2016). Low-bit quantiza-tion approaches represent and run the model with 8-bit in-teger (Lin et al. 2020). Our work follows another line of re-searches, which purses a more efficient architecture.Chen et al. (2018) show that the attention of Transformerbenefits the encoder the most and the decoder could besafely replaced by a recurrent network. This way reducesthe complexity of the decoder to linear time but incurs ahigh cost in training. Zhang, Xiong, and Su (2018) showthat the self-attention is not necessary and a simple aver-aging is enough. Xiao et al. (2019) indicate that most at-tention distributions are redundant and thus share thesedistributions among layers. Kitaev, Kaiser, and Levskaya(2020) use locality-sensitive hashing to select a con-stant number of words and perform attention on them. L o ss BaselineBalancedC AN Figure 8: Loss vs.
Deep Transformer
Recent studies have shown that deepening the Transformerencoder is more beneficial than widening the encoderor deepening the decoder (Bapna et al. 2018). Wang et al.(2019) show that placing the layer normalization before(Pre-Norm) rather than behind (Post-Norm) the sub-layer al-lows us to train deep Transformer. Xiong et al. (2020) provethat the success of the Pre-Norm network relies on its well-behaved gradient. Zhang, Titov, and Sennrich (2019) sug-gest that a proper initialization is enough to train a deep Post-Norm network. Kasai et al. (2020) similarly exploit this ob-servation but to build a faster instead of a better model. Theyshow that using knowledge distillation, a deep encoder andshallow decoder model could run much faster without losingany performance. Based on their work, we use this model asour baseline system and evaluate it on extensive machinetranslation tasks without knowledge distillation.
Conclusion
In this work, we propose C AN , whose decoder layer consistsof only one attention. C AN offers consistent acceleration byproviding a high degree of parallelism. Experiments on 14WMT machine translation tasks show that C AN is 2.82 × faster than the baseline. We also use a stronger baseline forcomparison. It employs a deep encoder and a shallow de-coder, and is 2 × faster than the standard Transformer base-ine without loss in performance. Acknowledgments
This work was supported in part by the National ScienceFoundation of China (Nos. 61876035 and 61732005), theNational Key R&D Program of China (No. 2019QY1801).The authors would like to thank anonymous reviewers fortheir comments.
References
Aharoni, R.; Johnson, M.; and Firat, O. 2019. MassivelyMultilingual Neural Machine Translation. In Burstein, J.;Doran, C.; and Solorio, T., eds.,
Proceedings of the 2019Conference of the North American Chapter of the Associa-tion for Computational Linguistics: Human Language Tech-nologies, NAACL-HLT 2019, Minneapolis, MN, USA, June2-7, 2019, Volume 1 (Long and Short Papers) , 3874–3884.Association for Computational Linguistics. doi:10.18653/v1/n19-1388. URL https://doi.org/10.18653/v1/n19-1388.Ba, L. J.; Kiros, J. R.; and Hinton, G. E. 2016.Layer Normalization.
CoRR abs/1607.06450. URLhttp://arxiv.org/abs/1607.06450.Baevski, A.; and Auli, M. 2019. Adaptive Input Represen-tations for Neural Language Modeling. In . OpenReview.net. URLhttps://openreview.net/forum?id=ByxZX20qFQ.Bapna, A.; Chen, M. X.; Firat, O.; Cao, Y.; and Wu,Y. 2018. Training Deeper Neural Machine TranslationModels with Transparent Attention. In Riloff, E.; Chi-ang, D.; Hockenmaier, J.; and Tsujii, J., eds.,
Proceedingsof the 2018 Conference on Empirical Methods in Natu-ral Language Processing, Brussels, Belgium, October 31- November 4, 2018 , 3028–3033. Association for Com-putational Linguistics. doi:10.18653/v1/d18-1338. URLhttps://doi.org/10.18653/v1/d18-1338.Birch, A.; Finch, A. M.; Luong, M.; Neubig, G.; andOda, Y. 2018. Findings of the Second Workshop onNeural Machine Translation and Generation. In Birch,A.; Finch, A. M.; Luong, M.; Neubig, G.; and Oda, Y.,eds.,
Proceedings of the 2nd Workshop on Neural MachineTranslation and Generation, NMT@ACL 2018, Melbourne,Australia, July 20, 2018 , 1–10. Association for Compu-tational Linguistics. doi:10.18653/v1/w18-2701. URLhttps://doi.org/10.18653/v1/w18-2701.Chen, M. X.; Firat, O.; Bapna, A.; Johnson, M.; Macherey,W.; Foster, G. F.; Jones, L.; Schuster, M.; Shazeer, N.; Par-mar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Chen, Z.;Wu, Y.; and Hughes, M. 2018. The Best of Both Worlds:Combining Recent Advances in Neural Machine Transla-tion. In Gurevych, I.; and Miyao, Y., eds.,
Proceedingsof the 56th Annual Meeting of the Association for Compu-tational Linguistics, ACL 2018, Melbourne, Australia, July15-20, 2018, Volume 1: Long Papers . OpenReview.net.URL https://openreview.net/forum?id=HyzdRiR9Y7.Fan, A.; Grave, E.; and Joulin, A. 2020. Re-ducing Transformer Depth on Demand with Struc-tured Dropout. In . OpenReview.net. URLhttps://openreview.net/forum?id=SylO2yStDr.Glorot, X.; and Bengio, Y. 2010. Understanding thedifficulty of training deep feedforward neural networks.In Teh, Y. W.; and Titterington, D. M., eds.,
Proceed-ings of the Thirteenth International Conference on Ar-tificial Intelligence and Statistics, AISTATS 2010, ChiaLaguna Resort, Sardinia, Italy, May 13-15, 2010 , vol-ume 9 of
JMLR Proceedings , 249–256. JMLR.org. URLhttp://proceedings.mlr.press/v9/glorot10a.html.Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O. K.; and Socher,R. 2018. Non-Autoregressive Neural Machine Translation.In . OpenReview.net.URL https://openreview.net/forum?id=B1l8BtlCb.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep Resid-ual Learning for Image Recognition. In , 770–778.IEEE Computer Society. doi:10.1109/CVPR.2016.90. URLhttps://doi.org/10.1109/CVPR.2016.90.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Iden-tity Mappings in Deep Residual Networks. In Leibe, B.;Matas, J.; Sebe, N.; and Welling, M., eds.,
Computer Vision- ECCV 2016 - 14th European Conference, Amsterdam, TheNetherlands, October 11-14, 2016, Proceedings, Part IV ,volume 9908 of
Lecture Notes in Computer Science , 630–645. Springer. doi:10.1007/978-3-319-46493-0 \
38. URLhttps://doi.org/10.1007/978-3-319-46493-0 38.He, T.; Tan, X.; Xia, Y.; He, D.; Qin, T.; Chen, Z.; and Liu,T. 2018. Layer-Wise Coordination between Encoder andDecoder for Neural Machine Translation. In Bengio, S.;Wallach, H. M.; Larochelle, H.; Grauman, K.; Cesa-Bianchi,N.; and Garnett, R., eds.,
Advances in Neural InformationProcessing Systems 31: Annual Conference on NeuralInformation Processing Systems 2018, NeurIPS 2018,3-8 December 2018, Montr´eal, Canada , 7955–7965. URLhttp://papers.nips.cc/paper/8019-layer-wise-coordination-between-encoder-and-decoder-for-neural-machine-translation.Hinton, G. E.; Vinyals, O.; and Dean, J. 2015. Distilling theKnowledge in a Neural Network.
CoRR abs/1503.02531.URL http://arxiv.org/abs/1503.02531.Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; and Weinberger,K. Q. 2016. Deep Networks with Stochastic Depth. In Leibe,B.; Matas, J.; Sebe, N.; and Welling, M., eds.,
ComputerVision - ECCV 2016 - 14th European Conference, Amster-dam, The Netherlands, October 11-14, 2016, Proceedings,art IV , volume 9908 of
Lecture Notes in Computer Science ,646–661. Springer. doi:10.1007/978-3-319-46493-0 \ CoRR abs/2006.10369. URL https://arxiv.org/abs/2006.10369.Kim, Y.; and Rush, A. M. 2016. Sequence-Level KnowledgeDistillation. In Su, J.; Carreras, X.; and Duh, K., eds.,
Pro-ceedings of the 2016 Conference on Empirical Methods inNatural Language Processing, EMNLP 2016, Austin, Texas,USA, November 1-4, 2016 , 1317–1327. The Associationfor Computational Linguistics. doi:10.18653/v1/d16-1139.URL https://doi.org/10.18653/v1/d16-1139.Kitaev, N.; Kaiser, L.; and Levskaya, A. 2020. Reformer:The Efficient Transformer. In . OpenReview.net. URLhttps://openreview.net/forum?id=rkgNKkHtvB.Lin, Y.; Li, Y.; Liu, T.; Xiao, T.; Liu, T.; and Zhu, J. 2020.Towards Fully 8-bit Integer Inference for the TransformerModel. In Bessiere, C., ed.,
Proceedings of the Twenty-NinthInternational Joint Conference on Artificial Intelligence, IJ-CAI 2020 , 3759–3765. ijcai.org. doi:10.24963/ijcai.2020/520. URL https://doi.org/10.24963/ijcai.2020/520.Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G. F.; Elsen,E.; Garc´ıa, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.;Venkatesh, G.; and Wu, H. 2018. Mixed Precision Train-ing. In . OpenReview.net.URL https://openreview.net/forum?id=r1gs9JgRZ.Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.;Grangier, D.; and Auli, M. 2019. fairseq: A Fast, Extensi-ble Toolkit for Sequence Modeling. In Ammar, W.; Louis,A.; and Mostafazadeh, N., eds.,
Proceedings of the 2019Conference of the North American Chapter of the Associa-tion for Computational Linguistics: Human Language Tech-nologies, NAACL-HLT 2019, Minneapolis, MN, USA, June2-7, 2019, Demonstrations , 48–53. Association for Com-putational Linguistics. doi:10.18653/v1/n19-4009. URLhttps://doi.org/10.18653/v1/n19-4009.Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Ma-chine Translation of Rare Words with Subword Units. In
Proceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics, ACL 2016, August 7-12,2016, Berlin, Germany, Volume 1: Long Papers . The Associ-ation for Computer Linguistics. doi:10.18653/v1/p16-1162.URL https://doi.org/10.18653/v1/p16-1162.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017.Attention is All you Need. In Guyon, I.; von Luxburg,U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan,S. V. N.; and Garnett, R., eds.,
Advances in Neural In-formation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 Decem-ber 2017, Long Beach, CA, USA , 5998–6008. URLhttp://papers.nips.cc/paper/7181-attention-is-all-you-need.Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D. F.;and Chao, L. S. 2019. Learning Deep Transformer Mod-els for Machine Translation. In Korhonen, A.; Traum,D. R.; and M`arquez, L., eds.,
Proceedings of the 57th Con-ference of the Association for Computational Linguistics,ACL 2019, Florence, Italy, July 28- August 2, 2019, Vol-ume 1: Long Papers , 1810–1822. Association for Com-putational Linguistics. doi:10.18653/v1/p19-1176. URLhttps://doi.org/10.18653/v1/p19-1176.Xiao, T.; Li, Y.; Zhu, J.; Yu, Z.; and Liu, T. 2019.Sharing Attention Weights for Fast Transformer. InKraus, S., ed.,
Proceedings of the Twenty-Eighth Inter-national Joint Conference on Artificial Intelligence, IJ-CAI 2019, Macao, China, August 10-16, 2019 , 5292–5298. ijcai.org. doi:10.24963/ijcai.2019/735. URLhttps://doi.org/10.24963/ijcai.2019/735.Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing,C.; Zhang, H.; Lan, Y.; Wang, L.; and Liu, T. 2020. OnLayer Normalization in the Transformer Architecture.
CoRR abs/2002.04745. URL https://arxiv.org/abs/2002.04745.Zhang, B.; Titov, I.; and Sennrich, R. 2019. Improv-ing Deep Transformer with Depth-Scaled Initialization andMerged Attention. In Inui, K.; Jiang, J.; Ng, V.; andWan, X., eds.,
Proceedings of the 2019 Conference on Em-pirical Methods in Natural Language Processing and the9th International Joint Conference on Natural LanguageProcessing, EMNLP-IJCNLP 2019, Hong Kong, China,November 3-7, 2019 , 898–909. Association for Compu-tational Linguistics. doi:10.18653/v1/D19-1083. URLhttps://doi.org/10.18653/v1/D19-1083.Zhang, B.; Xiong, D.; and Su, J. 2018. Accelerating Neu-ral Transformer via an Average Attention Network. InGurevych, I.; and Miyao, Y., eds.,