Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning
Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao, Zhaopeng Tu
UU NDERSTANDING AND I MPROVING E NCODER L AYER F USION IN S EQUENCE - TO -S EQUENCE L EARNING
Xuebo Liu ∗ , Longyue Wang , Derek F. Wong , Liang Ding , Lidia S. Chao & Zhaopeng Tu NLP CT Lab, Department of Computer and Information Science, University of Macau Tencent AI Lab The University of Sydney [email protected], {vinnylywang,zptu}@tencent.com,{derekfw,lidiasc}@um.edu.com, [email protected] A BSTRACT
Encoder layer fusion (EncoderFusion) is a technique to fuse all the encoder lay-ers (instead of the uppermost layer) for sequence-to-sequence (Seq2Seq) models,which has proven effective on various NLP tasks. However, it is still not entirelyclear why and when EncoderFusion should work. In this paper, our main contribu-tion is to take a step further in understanding EncoderFusion. Many of previousstudies believe that the success of EncoderFusion comes from exploiting surfaceand syntactic information embedded in lower encoder layers. Unlike them, wefind that the encoder embedding layer is more important than other intermediateencoder layers. In addition, the uppermost decoder layer consistently pays moreattention to the encoder embedding layer across NLP tasks. Based on this obser-vation, we propose a simple fusion method, SurfaceFusion, by fusing only theencoder embedding layer for the softmax layer. Experimental results show thatSurfaceFusion outperforms EncoderFusion on several NLP benchmarks, includingmachine translation, text summarization, and grammatical error correction. It ob-tains the state-of-the-art performance on WMT16 Romanian-English and WMT14English-French translation tasks. Extensive analyses reveal that SurfaceFusionlearns more expressive bilingual word embeddings by building a closer relationshipbetween relevant source and target embeddings. The source code will be released.
NTRODUCTION
Sequence-to-Sequence (Seq2Seq) learning (Sutskever et al., 2014) has advanced the state of the artin various natural language processing (NLP) tasks, such as machine translation (Bahdanau et al.,2015; Vaswani et al., 2017; Wu et al., 2019), text summarization (Wang et al., 2019b; Zhang et al.,2020), and grammatical error correction (Kiyono et al., 2019; Kaneko et al., 2020). Seq2Seq modelsare generally implemented with an encoder-decoder framework, in which a multi-layer encodersummarizes a source sequence into a sequence of representation and another multi-layer decoderproduces the target sequence conditioned on the encoded representation.Recent studies reveal that fusing the intermediate encoder layers (EncoderFusion) is beneficial forSeq2Seq models, such as layer attention (Bapna et al., 2018), layer aggregation (Dou et al., 2018;Wang et al., 2019c), and layer-wise coordination (He et al., 2018). Despite its effectiveness, not muchis known about how fusing encoder layer representations work. The intuitive explanation is that fusingencoder layers exploits surface and syntactic information embedded in the lower encoder layers (Be-linkov et al., 2017; Peters et al., 2018). However, other studies show that attending to lower encoderlayers (excluding the encoder embedding layer) does not improve model performance (Domhan,2018), which is conflicted with existing conclusions. It is still unclear why and when fusing encoderlayers should work in Seq2Seq models.This paper tries to shed light upon behavior of Seq2Seq models augmented with EncoderFusionmethod. To this end, we propose a novel fine-grained layer attention to evaluate the contribution ofindividual encoder layers. We conduct experiments on several representative Seq2Seq NLP tasks, ∗ Work was done when Xuebo Liu and Liang Ding were interning at Tencent AI Lab. a r X i v : . [ c s . C L ] D ec ncluding machine translation, text summarization, and grammatical error correction. Through a seriesof analyses, we find that the uppermost decoder layer pays more attention to the encoder embeddinglayer. Masking the encoder embedding layer significantly drops model performance by generatinghallucinatory (i.e. fluent but unfaithful to the source) predictions. The encoded representation of thestandard Seq2Seq models (i.e. w/o fusing encoder layers) may not have enough capacity to modelboth semantic and surface features (especially at the encoder embedding layer). We call the problemdescribed above the source representation bottleneck .Based on this observation, we simplify the EncoderFusion approaches by only connecting the encoderembedding layer to softmax layer ( SurfaceFusion ). The SurfaceFusion approach shortens the pathdistance between source and target embeddings, which can help to learn better bilingual embeddingswith direct interactions. Experimental results on several Seq2Seq NLP tasks show that our methodconsistently outperforms both the vanilla Seq2Seq model and the layer attention model. Extensiveanalyses reveal that our approach produces more aligned bilingual word embeddings by shorteningthe path distance between them, which confirm our claim.Our main contributions are as follows:• We introduce a fine-grained layer attention method to qualitatively and quantitativelyevaluate the contribution of individual encoder layers.• We demonstrate that the encoder embedding layer is essential for fusing encoder layers,which consolidates conflicted findings reported by previous studies.• We propose a simple yet effective
SurfaceFusion approach to directly exploit the encoderembedding layer for the decoder, which produces more expressive bilingual embeddings.
RELIMINARIES
EQUENCE - TO - SEQUENCE LEARNING
Seq2Seq learning aims to maximize the log-likelihood of a target sequence y = { y , . . . , y J } condi-tioned on a source sequence x = { x , . . . , x I } , which is formulated as: ˆy = arg max log P ( y | x ) .Typically, Seq2Seq learning can be implemented as various architectures (Bahdanau et al., 2015;Gehring et al., 2017; Vaswani et al., 2017; Wu et al., 2019), among which the Transformer (Vaswaniet al., 2017) has advanced the state of the art. Without loss of generality, we introduce Transformer asthe testbed in this paper. Transformer consists of an encoder E equipped with N identical layers tomap the source sequence x into distributed representations, based on which a decoder D equippedwith M identical layers generates the target sequence y : X N = E ( X ) N := n =1 F FN (cid:0) A TT ( X n − , X n − , X n − ) (cid:1) (1) Y M = D ( Y , X N ) M := m =1 F FN (cid:0) A TT (cid:0) A TT ( Y m − , Y m − , Y m − ) , X N , X N (cid:1)(cid:1) (2)where X denotes the sum of the word embeddings X emb and position embeddings X pos of x , Y denotes that of the shifted right y , F FN ( · ) denotes a position-wise feed-forward network, andA TT ( · ) denotes a multi-head dot-product attention network with three arguments–query, key andvalue. Residual connection (He et al., 2016) and layer normalization (Ba et al., 2016) are used in eachsub-layer, which are suppressed in Equation 1 and 2 for clarity. Finally, the output representation Y M of the decoder is projected into the probability P ( y | x ) , which is optimized during model training.2.2 E XPERIMENTAL SETUP
To validate the universality of source representation bottleneck in Seq2Seq models, we conductedexperiments on three representative tasks, which vary from the distance between input and outputdomains and the scale of training data:
Machine translation takes a sentence in one language as input, and outputs a semantically-equivalentsentence in another language. We conducted experiments on three benchmarking datasets: small-scale WMT16 Romanian-English (Ro-En; 0.6M instances), medium-scale WMT14 English-GermanEn-De; 4.5M instances), and large-scale WMT14 English-French (En-Fr; 36.0M instances). Thetokenized BLEU score (Papineni et al., 2002) was used for all the translation tasks.
Text summarization takes a long-text document as input, and outputs a short and adequate summaryin the same language. We used the CNN/Daily Mail corpus (0.3M instances). We evaluated with thestandard ROUGE metric (Lin, 2004), i.e. Rouge-1, Rouge-2, and Rouge-L.
Grammatical error correction takes a sentence with grammatical errors as input, and outputs acorrected sentence. We used CONLL14 datasets as the testbed (1.4M instances). The MaxMatch(M ) scores (Dahlmeier & Ng, 2012) were used for evaluation with precision, recall, and F . values.The machine translation task has distant input/output domains (i.e. in different languages), whilethe other tasks have similar input/output domains (i.e. in the same language). We used Trans-former (Vaswani et al., 2017) as the Seq2Seq model. Details of the datasets and model training arelisted in Appendix A.1. EHAVIOR OF E NCODER F USION
In this section, we first formulate our research hypothesis of source representation bottleneck (§3.1)that EncoderFusion expects to solve. In the following subsections, we propose a fine-grained layerattention model (§3.2) to validate our hypothesis on well-designed experiments (§3.3).3.1 S
OURCE REPRESENTATION BOTTLENECK
Seq2Seq models learn more abstract features with the increase of layer level (i.e. X → X N and Y → Y M ) (Belinkov et al., 2017). It has been extensively validated that a reasonable use of boththe abstract representations (at higher-level layers) and the surface representations (at lower-levellayers) is beneficial for various NLP (Lu & Li, 2013; Hu et al., 2014; Dou et al., 2018; Peters et al.,2018) and CV (Long et al., 2014; Pinheiro et al., 2016; Lin et al., 2017; Chen et al., 2018a) tasks.However, the Seq2Seq decoder only takes the abstract representations at uppermost layer X N asinput (Equation 2), while ignores other usefully surface representations at other layers X n ( n < N ).Although X N has encoded surface features from low-level representations through layer-by-layerabstraction and residual connections, we hypothesize that its limited representation capacity may notsufficiently model those surface features from lower encoder layers, especially the embedding layer.We call such an issue as source representation bottleneck .3.2 F INE - GRAINED LAYER ATTENTION
For each decoder layer, layer attention (Bapna et al., 2018; Peters et al., 2018) assigns normalizedscalar weights to all encoder layers, providing a direct way for evaluating the contributions made byeach encoder layer. However, the capacity of a simple scalar weight is limited, leading to insufficientevaluation of the contributions.Motivated by fine-grained attention (Choi et al., 2018) that each element of a context vector receivesan individual attention weight, we propose a fine-grained layer attention model to combine theadvantages of both techniques. This allows us to more convincingly evaluate the contribution ofindividual encoder layer to the model performance. Besides, the nature of fine-grained attentionenables us to give in-depth analyses of the representation power in §3.3.Specifically, we replace the layer-agnostic source representation X N with the layer-aware representa-tion S m for each decoder layer Y m , which is calculated as: S m = N (cid:88) n =0 ˆ w m,n (cid:12) X n , ˆ w m,n = (cid:2) ˆ w m,n, , . . . , ˆ w m,n,D (cid:3) , ˆ w m,n,d = exp( w m,n,d ) (cid:80) Nn (cid:48) =0 exp( w m,n (cid:48) ,d ) where (cid:12) denotes an element-wise multiplication, and w m,n,d denotes an element in the learnableattention weight W ∈ R M × ( N +1) × D , where D is the dimensionality of the source representation.When n = 0 , we use the word embeddings X emb without position embeddings as X , which hasbeen empirically proved effective. We applied a regularization technique – DropConnect (Wan et al.,013) to the attention weight W for a stable training, which randomly drops each w m,n,d with aprobability p and divides W by − p . We set it to 0.3 for all the experiments.Table 2 lists the results. The proposed fine-grained layer attention model consistently outperformsthe vanilla Transformer across Seq2Seq tasks, demonstrating the benefit of fusing surface features atlower-level layers. Table 1: Results of existing encoder layer fusionmethods on the WMT16 Ro-En translation task. Model BLEU
Vanilla Transformer 33.80Layer aggregation 34.05Layer-wise coordination 34.19Coarse-grained layer attention 34.32Fine-grained layer attention 34.45We evaluated several EncoderFusion methods inTable 1, including layer aggregation (Dou et al.,2018), layer-wise coordination (He et al., 2018),and coarse-grained layer attention (Bapna et al.,2018). Their results are respectively 34.05,34.19, and 34.32, which are all lower than thatof fine-grained layer attention (34.45). Basedon these experimental results, we thus choosefine-grained layer attention as a representativeof EncoderFusion in the following analyses.3.3 B
EHAVIOR CHANGES ACROSS ENCODER LAYERS
In this section, we investigate whether the surface features at lower encoder layers (especially theencoder embedding layer) contribute to the model performance via carefully designed experiments. E n c o d e r L a y e r .15 .18 .17 .15 .12 .11.16 .18 .17 .16 .14 .13.14 .15 .15 .15 .14 .13.15 .14 .14 .13 .14 .14.14 .12 .12 .13 .14 .15.14 .11 .11 .13 .14 .16.13 .11 .13 .15 .17 .19 (a) Translation: Ro-En (b) Summarization (c) Correction Figure 1: Attention distribution that each decoder layer ( x -axis) attending to encoder layers ( y -axis). Visualization of layer attention
We first visualize the learned layer attention distribution in Fig-ure 1, in which each weight is the averaged attention weights over all dimensions. Generally, a higherweight denotes more contribution of an encoder layer to the corresponding decoder layer.Clearly, in all tasks higher decoder layers especially the uppermost ones pay more attention to theencoder embedding layer, which indicates that the surface representations potentially bring someadditional useful features to the model performance. Voita et al. (2019) reveal that the upper layers ofdecoder are responsible for the translation part while the lower layers for the language modeling part.Similarly, our results show that surface representations might play an important role in learning totranslate source tokens.Among the Seq2Seq models, there are still considerable differences in the attention heatmaps. Inthe summarization model, almost all decoder layers focus more on the encoder embedding layer,while in the other two models the intermediate decoder layers pay more attention to the higher-levelencoder layers. This is consistent with the findings of Rothe et al. (2019), in which they reveal thatthe summarization task, as a typical extractive generation task, tends to use more surface featuresto generate extractive summaries. In contrast, both machine translation and error correction tasksrequire a large amount of syntactic and semantic information, which are generally embedded inhigher-level encoder layers (Peters et al., 2018).owever, we still cannot conclude that source representation bottleneck does exist in Seq2Seq models,since the surface features might act as a noise regularizer to improve the robustness of encoder outputrepresentations. To dispel the doubt, we further design two experiments to directly evaluate theeffectiveness of surface features at the encoder embedding layer.
Contribution of individual encoder layer
In this experiment, we quantitatively analyze the be-haviors change of a trained Seq2Seq model when masking a specific encoder layer (i.e. turning itsattention weight to zero and redistribute the other attention weights). Note that the masking operationdoes not affect the information flow of encoding calculation, i.e. keeping Equation 1 unchanged.
Length -6%0%6% Translation GEC Summarization
Emb 1 2 34 5 6
Performance R e l a ti v e C h a ng e -4%-2%0% Translation GEC Summarization Emb 1 2 34 5 6 -6%0%6% Translation Summarization Correction R e l a t i v e C h a n g e -8%-4%0% Translation Summarization CorrectionEmb 1 2 34 5 6 (a) Model performance Length -6%0%6% Translation GEC Summarization
Emb 1 2 34 5 6
Performance R e l a ti v e C h a ng e -4%-2%0% Translation GEC Summarization Emb 1 2 34 5 6 -6%0%6% Translation Summarization Correction R e l a t i v e C h a n g e -8%-4%0% Translation Summarization CorrectionEmb 1 2 34 5 6 (b) Length of output Figure 2: Relative changes of (a) model performance and (b) length of output when maskingindividual encoder layer in the trained Seq2Seq models. As seen, masking the embedding layer leadsto a significant drop of model performance and increase of output length.Figure 2(a) shows the contribution of individual encoder layer to model performance. As seen,masking the encoder embedding layer seriously harms the model performance in all tasks, whichconfirms our claim that the surface features in the embedding layer are essential to Seq2Seq models.Figure 2(b) shows the results on the output length. Masking the encoder embedding layer consistentlyincreases the length of generated output, which is especially significant for the summarization model.One possible reason is that the instances in translation and correction tasks have similar input/outputlengths, while the summarization instances have distant input/output lengths.By analyzing the model outputs, we found that the Seq2Seq models tend to generate some hallucina-tory (i.e. fluent but unfaithful to the source) predictions (Lee et al., 2019; Wang & Sennrich, 2020)when masking the embedding layer. Taking the correction task for an example, a right prediction“anyone” was replaced by the hallucinatory prediction “friends of anyone” in the masked model, inwhich the corresponding source contains no information related to “friends”. This issue becomesworse in the summarization task, since the hallucinatory prediction is more likely to be a sentence.The additional hallucinations will increase the output length and reduce the model performance. Inaddition, Lee et al. (2019) point out that even if hallucinations occur only occasionally, the Seq2Seqmodel may evidently lose user trust than other prediction problems, indicating the importance to fusesurface features at the embedding layer. More cases are studied in Appendix A.2.
Expressivity of attended dimensions in the encoder embedding layer
As shown in Figure 1,the uppermost decoder layer pays most attention to the encoder embedding layer (i.e. the lower rightcorner). If the embedding layer acts as a noise regularizer, the layer dimensions would be randomlyattended by the fine-grained model; otherwise, the dimensions of higher attention weights should bedistinguished from the other dimensions.Starting from this intuition, we reordered the dimensions of the encoder embedding layer accordingto the attention weights ˆ w M, , and split it into two equal sub-embedding matrices, i.e. more attendeddimensions and less attended dimensions . We compared the expressivity of the two sub-embeddingmatrices by the commonly-used singular value decomposition (Gao et al., 2019; Wang et al., 2019a;Shen et al., 2020), in which higher normalized singular values denote that the embedding is moreuniformly distributed, thus are more expressive. The singular values are normalized by dividing themby the largest value and their log scale values are reported for better clarity.
50 100 150 200 250Index of Singular Value43210 L o g E i g e n v a l u e s More attendedRandomly selectedLess attended (a) Translation: Ro-En
More attendedRandomly selectedLess attended (b) Summarization
More attendedRandomly selectedLess attended (c) Correction
Figure 3: Log scale singular values of the three sub-embedding matrices in the fine-grained layerattention models. Higher log eigenvalues denote more expressivity of the dimensions.Figure 3 depicts the singular value results. For comparison, we also report the values of the randomlyselected dimensions. Clearly, the more attended dimensions are most expressive, while the lessattended dimensions are least expressive. These results demonstrate that the fine-grained attentionmodel indeed extracts useful surface information from the encoder embedding layer, which does notplay the role of a noise regularizer.From the above experiments, we prove that the encoder embedding layer indeed provides usefulsurface information, which is not fully exploited by the standard Seq2Seq models.
UR METHOD
In Section 3, we show that the uppermost decoder layer requires more surface features for betterrepresentation learning. One possible reason is that the uppermost decoder layer is used for predictingindividual target token, which naturally benefits from more token-level surface features than sequence-level abstract features. To validate this assumption, we simplify fine-grained layer attention thatonly the uppermost decoder layer can attend to the embedding layer and output layer of the encoder.Empirical results show that the simplified variant works on par with the original one, revealing thatthe surface features embed at the source embedding layer is expressive.Although layer attention model partially alleviates source representation bottleneck, it potentiallyintroduces unnecessary intermediate encoder representations. To address this gap, we proposeto directly connect the decoder softmax layer and the encoder embedding layer with a simple
SurfaceFusion method.4.1 S
URFACE F USION
Seq2Seq learning aims to maximize the log-likelihood of a target sequence y given a source sequence x . In practice, it factorizes the likelihood of the target sequence into individually token likelihoods: ˆy = arg max J (cid:89) j =1 log P ( y j ) = arg max J (cid:89) j =1 log P ( y j | y Hard fusion: Φ hard = λ log P ( y j | y Table 2: Results of the proposed SurfaceFusion methods on the Seq2Seq tasks. “FGLA” denotesfine-grained layer attention. The existing results are Ghazvininejad et al. (2019) for Ro-En, Ott et al.(2018) for En-De and En-Fr, Ott et al. (2019) for summarization, and Chollampatt & Ng (2018) forcorrection. All reported scores are the higher the better. Translation Summarization CorrectionRo-En En-De En-Fr RG-1 RG-2 RG-L Prec. Recall F . Existing Vanilla FGLA Hard fusion 35.1 29.5 43.9 Soft fusion Table 2 lists the results of the proposed approach on different tasks. Inaddition to the vanilla Seq2Seq model (“Vanilla”), we also report the results of existing studies onthe same datasets (“Existing”) for better comparison. Our re-implementation of the vanilla modelsmatches the results reported in previous works, which we believe make the evaluation convincing.Clearly, the proposed fusion approaches outperform the baselines (i.e. “Vanilla” and “FGLA”) in allcases, while there are still considerable differences among model variations. Hard fusion performsbetter on the translation tasks, while soft fusion is superior on the summarization and correction tasks.Unlike hard fusion that performs at the probability level, soft fusion performs at the logit level toprovide an earlier and direct way for fusing surface features, which might be a better solution for thetasks with a similar input/output domain.able 3: Cosine similarities between alignedsource and target word embeddings. “All” and“Non-Shared” denotes keeping or removing thealigned pair when the source and target words arethe same, which are easier to be aligned. All Non-SharedVanilla SurfaceFusion 0.650 0.417Closeness of word embeddings SurfaceFu-sion shortens the path distance between sourceand target embeddings, which can help to learnbetter bilingual embeddings with direct inter-actions. Table 3 shows the cosine similaritiesbetween the tied source and target embeddingson the Ro-En translation task.In the experiment, we first train an additionalaligner (i.e. fast-align (Dyer et al., 2013)) on thetraining corpus and use the alignment links to construct a word dictionary. The results calculated overthe dictionary show that the relationship between the source and target embedding becomes muchcloser (i.e. high cosine similarities). This can help each other to learn better representations, and hasbeen validated to be beneficial for Seq2Seq models (Press & Wolf, 2017; Liu et al., 2019). Expressivity of word embeddings In this experiment, we quantitatively evaluate the expressivityof the word embeddings learned by different models using the singular value decomposition. Therelated experimental details and executions are similar to that of Figure 3. L o g E i g e n v a l u e s SurfaceFusionVanilla Figure 4: Log scale singular values ofthe embeddings.Figures 4 shows the results of the tied source and targetembeddings on the Ro-En translation task. The word em-beddings of the vanilla model have fast decaying singularvalues, which limits the representational power of embed-dings to a small sub-space. The SurfaceFusion model slowsdown the decaying and the singular values become moreuniformly distributed, which demonstrates that the fusedsurface features remarkably enhance the representationlearning of embeddings. This provides a better startingpoint for the model to effectively extract surface and ab-stract features, which leads to an improvement of modelperformance. ELATED WORK EncoderFusion in Seq2Seq Lower encoder layers that embed useful surface features are far awayfrom the training signals, which poses difficulty for deep Seq2Seq models to exploit such usefulfeatures. Although residual connections (He et al., 2016) have been incorporated to combine layers,these connections have been “shallow” themselves, and only fuse by simple, one-step operations (Yuet al., 2018). In response to this problem, several approaches have been proposed to fuse the encoderlayers with advanced methods, such as layer attention (Bapna et al., 2018; Shen et al., 2018; Wanget al., 2019c), layer aggregation (Dou et al., 2018; Wang et al., 2018a; Dou et al., 2019; Li et al.,2020), and layer-wise coordination (He et al., 2018; Liu et al., 2020). Although these methods showpromising results on different NLP tasks, not much is known about how the EncoderFusion works. Inaddition, some other studies show that exploiting low-layer encoder representations fail to improvemodel performance (Domhan, 2018).In this paper, we consolidate the conflicting conclusions of existing studies by pointing out thatthe encoder embedding layer is the key, which can help Seq2Seq models to precisely predict targetwords. Based on this finding, we propose a novel SurfaceFusion to directly connecting the encoderembedding layer and the softmax layer, which consistently outperform current EncoderFusionapproaches across different NLP tasks. Variants of Feature Fusion Feature fusion aims to merge two sets of features into one, which isfrequently employed in CV tasks, such as semantic segmentation (Long et al., 2014; Chen et al.,2018a; Zhang et al., 2018) and object detection (Pinheiro et al., 2016; Lin et al., 2017). Zhang et al.(2018) shows that simply fusing surface and abstract features tends to be less effective due to the gapin semantic levels.For NLP tasks, researchers investigated fusion models for language understanding (Lu & Li, 2013;Hu et al., 2014; Peters et al., 2018) and language generation (Gulcehre et al., 2015; Sriram et al., 2017;tahlberg et al., 2018). Nguyen & Chiang (2019) propose to fuse features at representation-level, butwe empirically find this kind of fusion method is not orthogonal to multi-layer models due to the largesemantic gap. Gulcehre et al. (2015) combine the predictions produced by the Seq2Seq model andexternal LM predictions in a later fusion manner, which pose little impact to the original informationflow. Stahlberg et al. (2018) improve upon it by removing the dependence on the manually definedhyper-parameter. In this work, we demonstrate the effectiveness of the two typical probability-levelfusion methods on sequence-to-sequence learning tasks. Unlike them that rely on an external model,our approach only requires a surface attention module that can be jointly trained with the vanillaSeq2Seq model. ONCLUSION AND FUTURE WORK In this paper, we investigate how encoder layer fusion works on solving the source representationbottleneck. Based on a series of experiments on different Seq2Seq tasks, we find that the encoderembedding layer is important to the success of EncoderFusion by exploiting the useful surfaceinformation. Based on this observation, we propose a novel SurfaceFusion to directly connect theencoder embedding layer and softmax layer. Experiments show that SurfaceFusion consistentlyoutperforms the conventional EncoderFusion in several datasets. Extensive analyses reveal thatSurfaceFusion enhances the learning of expressive bilingual word embeddings for Seq2Seq models,which confirm our claim.Future directions include validating our findings on more Seq2Seq tasks (e.g. dialogue and speechrecognition) and model architectures (RNMT+ (Chen et al., 2018b) and DynamicConv (Wu et al.,2019)). It is also worthwhile to explore more alternatives to EncoderFusion from the perspective ofexploiting the embedding layer. R EFERENCES Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. In arXiv , 2016.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. In ICLR , 2015.Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training deeper neuralmachine translation models with transparent attention. In EMNLP , 2018.Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. What do neuralmachine translation models learn about morphology? In ACL , 2017.Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV , 2018a.Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, LlionJones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best ofboth worlds: Combining recent advances in neural machine translation. In ACL , 2018b.Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. Fine-grained attention mechanism for neuralmachine translation. Neurocomputing , 2018.Shamil Chollampatt and Hwee Tou Ng. A multilayer convolutional encoder-decoder neural networkfor grammatical error correction. In AAAI , 2018.Daniel Dahlmeier and Hwee Tou Ng. Better evaluation for grammatical error correction. In NAACL ,2012.Tobias Domhan. How much attention do you need? a granular analysis of neural machine translationarchitectures. In ACL , 2018.Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and Tong Zhang. Exploiting deep representationsfor neural machine translation. In EMNLP , 2018.i-Yi Dou, Zhaopeng Tu, Xing Wang, Longyue Wang, Shuming Shi, and Tong Zhang. Dynamiclayer aggregation for neural machine translation with routing-by-agreement. In AAAI , 2019.Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast, and effective reparameterizationof IBM model 2. In NAACL , 2013.Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tieyan Liu. Representation degeneration problemin training natural language generation models. In ICLR , 2019.Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutionalsequence to sequence learning. In ICML , 2017.Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Paralleldecoding of conditional masked language models. In EMNLP , 2019.Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, FethiBougares, Holger Schwenk, and Yoshua Bengio. On using monolingual corpora in neural machinetranslation. In arXiv , 2015.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In CVPR , 2016.Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Layer-wisecoordination between encoder and decoder for neural machine translation. In NIPS , 2018.Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network architecturesfor matching natural language sentences. In NIPS , 2014.Masahiro Kaneko, Masato Mita, Shun Kiyono, Jun Suzuki, and Kentaro Inui. Encoder-decodermodels can benefit from pre-trained masked language models in grammatical error correction. In ACL , 2020.Shun Kiyono, Jun Suzuki, Masato Mita, Tomoya Mizumoto, and Kentaro Inui. An empirical study ofincorporating pseudo data into grammatical error correction. In EMNLP , 2019.Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. Hallucinations inneural machine translation. In OpenReview , 2019.Jian Li, Xing Wang, Baosong Yang, Shuming Shi, Michael R Lyu, and Zhaopeng Tu. Neuroninteraction based representation composition for neural machine translation. In AAAI , 2020.Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text SummarizationBranches Out , 2004.Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.Feature pyramid networks for object detection. In CVPR , 2017.Fenglin Liu, Xuancheng Ren, Guangxiang Zhao, and Xu Sun. Layer-wise cross-view decoding forsequence-to-sequence learning. arXiv , 2020.Xuebo Liu, Derek F. Wong, Yang Liu, Lidia S. Chao, Tong Xiao, and Jingbo Zhu. Shared-privatebilingual word embeddings for neural machine translation. In ACL , 2019.Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semanticsegmentation. In CVPR , 2014.Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In NIPS , 2013.Toan Nguyen and David Chiang. Improving lexical choice in neural machine translation. In NAACL ,2019.Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In WMT@EMNLP , 2018.yle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL , 2019.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automaticevaluation of machine translation. In ACL , 2002.Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractivesummarization. In ICLR , 2018.Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, andLuke Zettlemoyer. Deep contextualized word representations. In NAACL , 2018.Pedro H. O. Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. Learning to refine objectsegments. In ECCV , 2016.Ofir Press and Lior Wolf. Using the Output Embedding to Improve Language Models. In EACL ,2017.Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Leveraging pre-trained checkpoints forsequence generation tasks. In arXiv , 2019.Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words withsubword units. In ACL , 2016.Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, and Kurt Keutzer. Rethinking batchnormalization in transformers. In ICML , 2020.Yanyao Shen, Xu Tan, Di He, Tao Qin, and Tie-Yan Liu. Dense information flow for neural machinetranslation. In NAACL , 2018.Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates. Cold fusion: Training seq2seqmodels together with language models. In arXiv , 2017.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overfitting. JMLR , 2014.Felix Stahlberg, James Cross, and Veselin Stoyanov. Simple fusion: Return of the language model.In WMT , 2018.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.In NIPS , 2014.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. Attention is all you need. In NIPS , 2017.Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-headself-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL , 2019.Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neuralnetworks using dropconnect. In ICML , 2013.Chaojun Wang and Rico Sennrich. On exposure bias, hallucination and domain shift in neuralmachine translation. In arXiv , 2020.Dilin Wang, Chengyue Gong, and Qiang Liu. Improving neural language modeling via adversarialtraining. In ICML , 2019a.Liang Wang, Wei Zhao, Ruoyu Jia, Sujian Li, and Jingming Liu. Denoising based sequence-to-sequence pre-training for text generation. In EMNLP , 2019b.Qiang Wang, Fuxue Li, Tong Xiao, Yanyang Li, Yinqiao Li, and Jingbo Zhu. Multi-layer representa-tion fusion for neural machine translation. In COLING , 2018a.Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao.Learning deep transformer models for machine translation. In ACL , 2019c.inyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. Switchout: An efficient data augmentationalgorithm for neural machine translation. In EMNLP , 2018b.Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention withlightweight and dynamic convolutions. In ICLR , 2019.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translationsystem: Bridging the gap between human and machine translation. In arXiv , 2016.Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In CVPR ,2018.Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J Liu. Pegasus: Pre-training with extractedgap-sentences for abstractive summarization. In ICML , 2020.Zhenli Zhang, Xiangyu Zhang, Chao Peng, Dazhi Cheng, and Jian Sun. Exfuse: Enhancing featurefusion for semantic segmentation. In ECCV , 2018. A PPENDIX A.1 E XPERIMENTAL SETUP Table 4: Statistics of the datasets and hyperparameters for the experiments. All the data have beentokenized and split into joint sub-word units (Sennrich et al., 2016). “Batch” denotes the number ofsource tokens and target tokens used in each training step. “DP” denotes the dropout value (Srivastavaet al., 2014). “LP” denotes the length penalty (Wu et al., 2016). “Base” and “Big” denote the twokinds of model variants of Transformer. We chose the checkpoint with best validation ppl for testing. Vocab En-De En-Fr CNN/DM CONLL Machine translation For WMT16 Romanian-English, we used the prepossessed data and existingresult from Ghazvininejad et al. (2019). The validation set is newsdev2016 and the test set isnewtest2016. For WMT14 English-German, the prepossessed data and existing result are derivedfrom Ott et al. (2018). The validation set is newstest2013 and the test set is newstest2014. ForWMT14 English-French, we reported the existing result from Ott et al. (2018) and followed them topreprocess the data sets. The validation set is newstest2012+2013 and the test set is newstest2014. Text summarization For CNN/Daily Mail dataset, we used the existing result and preprocessingmethod of Ott et al. (2019). During testing, the minimum length was set to 55 and the maximumlength was set to 140, which were tuned on the development data. We also followed Paulus et al.(2018) to disallow repeating the same trigram. Grammatical error correction For CONLL14 benchmark, the preprocessing script and existing re-sult are given by Chollampatt & Ng (2018). We applied the regularization technique SwitchOut (Wanget al., 2018b) in this task to prevent overfitting, which was set to 0.8 for the source and 0.9 for thetarget.Table 4 gives more details of the benchmarks. It is noted that other unmentioned hyperparameterskeep the same with the original paper of Transformer (Vaswani et al., 2017). All the models areimplemented by the open-source toolkit fairseq (Ott et al., 2019). A.2 C ASE STUDY Tables 5, 6 and 7 give the cases from the three tasks. We can see that the hallucination issue relatedto surface features consistently appear over the different Seq2Seq tasks. The most representativecases are those from the correction task, in which very similar input/output sequences still make suchmistakes.Another observation is the prediction omission problem when masking the encoder output layer.The lack of abstract features leads to incomplete semantics of source representations, thus makingSeq2Seq models omit generating a part of source, hurting the model performance. By looking at thecases over the three tasks, we find that the prediction omission is widespread in the prediction ofmodifiers, e.g. adjectives and adverbs. https://drive.google.com/uc?id=1YrAwCEuktG-iDVxtEW-FE72uFTLc5QMl https://drive.google.com/uc?id=0B_bZck-ksdkpM25jRUN2X2UxMm8 https://github.com/nusnlp/mlconvgec2018/blob/master/data/prepare_data.sh https://github.com/pytorch/fairseq able 5: Examples from the Ro-En translation task. Red words are good predictions, while blue words are bad predictions. Masking the embedding layer (“Mask Emb”) of the fine-grained layerattention model leads to hallucinatory predictions, prolonging the prediction length. While maskingthe output layer (“Mask Out”) leads to prediction omissions, shortening the length. Hallucination Source diseara voi merge acasa si voi dormi linistit .Reference i will go home tonight and sleep well .Vanilla i will go home and sleep quietly .Mask Emb the device will go home and i will sleep peacefully .Mask Out i will go home and sleep quietly . Omission Source radem adesea mult atunci cand vorbim .Reference we often laugh a lot when we talk .Vanilla we often laugh a lot when we talk .Mask Emb we often laugh a lot when we talk .Mask Out we often laugh when we talk . Table 6: Examples from the CNN/DM summarization task. Hallucination Source ... But it is able to carry just as much power - 400,000 volts . It is designedto be less obtrusive and will be used for clean energy purposes ...Reference ... But it is able to carry just as much power - 400,000 volts . It is designedto be less obtrusive and will be used for clean energy . Vanilla ... But it is able to carry just as much power - 400,000 volts . It is designedto be less obtrusive and will be used for clean energy . Mask Emb ... It is able to carry just as much power - 400,000 volts . The design is aT-shape , with two ‘ hanging baskets ’ either side ...Mask Out ... But it is able to carry just as much power - 400,000 volts . It is designedto be less obtrusive and will be used for clean energy . Omission Source ... Opening statements in his trial are scheduled to begin Monday ...Reference ... Opening statements are scheduled Monday in the trial of JamesHolmes ...Vanilla ... Prosecutors are not swayed, will seek the death penalty . Openingstatements in his trial are scheduled to begin Monday . Holmes sayshe was suffering ‘ a psychotic episode ’ at the time ...Mask Emb ... Prosecutors are not swayed, will seek the death penalty . Openingstatements in his trial are scheduled to begin Monday . Holmes sayshe was suffering ‘ a psychotic episode ’ at the time ...Mask Out ... Prosecutors are not swayed and will seek the death penalty . Holmes says he was suffering ‘ a psychotic episode ’ at the time ... Table 7: Examples from the CONLL correction task. Hallucination Source They can become anyone .Reference They can become anyone .Vanilla They can become anyone .Mask Emb They can become friends with anyone .Mask Out They can become anyone . Omission Source In conclude , people should think carefully of what is the consequences oftelling the relatives his or her generic disorder issue .Reference In conclusion , people should think carefully about what is the consequencesof telling the relatives his or her generic disorder issue .Vanilla In conclusion , people should think carefully about what is the consequencesof telling the relatives his or her generic disorder issue .Mask Emb In conclusion , people should think carefully about what is the consequencesof telling the relatives his or her generic disorder issue .Mask Out In conclusion , people should think carefully about what is the consequencesof telling the relatives his or her generic issuegeneric issue