[PDF] Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning

Abstract

Encoder layer fusion (EncoderFusion) is a technique to fuse all the encoder layers (instead of the uppermost layer) for sequence-to-sequence (Seq2Seq) models, which has proven effective on various NLP tasks. However, it is still not entirely clear why and when EncoderFusion should work. In this paper, our main contribution is to take a step further in understanding EncoderFusion. Many of previous studies believe that the success of EncoderFusion comes from exploiting surface and syntactic information embedded in lower encoder layers. Unlike them, we find that the encoder embedding layer is more important than other intermediate encoder layers. In addition, the uppermost decoder layer consistently pays more attention to the encoder embedding layer across NLP tasks. Based on this observation, we propose a simple fusion method, SurfaceFusion, by fusing only the encoder embedding layer for the softmax layer. Experimental results show that SurfaceFusion outperforms EncoderFusion on several NLP benchmarks, including machine translation, text summarization, and grammatical error correction. It obtains the state-of-the-art performance on WMT16 Romanian-English and WMT14 English-French translation tasks. Extensive analyses reveal that SurfaceFusion learns more expressive bilingual word embeddings by building a closer relationship between relevant source and target embedding. Source code is freely available at this https URL

Full PDF

UU NDERSTANDING AND I MPROVING E NCODER L AYER F USION IN S EQUENCE - TO -S EQUENCE L EARNING

Xuebo Liu ∗ , Longyue Wang , Derek F. Wong , Liang Ding , Lidia S. Chao & Zhaopeng Tu NLP CT Lab, Department of Computer and Information Science, University of Macau Tencent AI Lab The University of Sydney [email protected], {vinnylywang,zptu}@tencent.com,{derekfw,lidiasc}@um.edu.com, [email protected] A BSTRACT

Encoder layer fusion (EncoderFusion) is a technique to fuse all the encoder lay-ers (instead of the uppermost layer) for sequence-to-sequence (Seq2Seq) models,which has proven effective on various NLP tasks. However, it is still not entirelyclear why and when EncoderFusion should work. In this paper, our main contribu-tion is to take a step further in understanding EncoderFusion. Many of previousstudies believe that the success of EncoderFusion comes from exploiting surfaceand syntactic information embedded in lower encoder layers. Unlike them, weﬁnd that the encoder embedding layer is more important than other intermediateencoder layers. In addition, the uppermost decoder layer consistently pays moreattention to the encoder embedding layer across NLP tasks. Based on this obser-vation, we propose a simple fusion method, SurfaceFusion, by fusing only theencoder embedding layer for the softmax layer. Experimental results show thatSurfaceFusion outperforms EncoderFusion on several NLP benchmarks, includingmachine translation, text summarization, and grammatical error correction. It ob-tains the state-of-the-art performance on WMT16 Romanian-English and WMT14English-French translation tasks. Extensive analyses reveal that SurfaceFusionlearns more expressive bilingual word embeddings by building a closer relationshipbetween relevant source and target embeddings. The source code will be released.

NTRODUCTION

Sequence-to-Sequence (Seq2Seq) learning (Sutskever et al., 2014) has advanced the state of the artin various natural language processing (NLP) tasks, such as machine translation (Bahdanau et al.,2015; Vaswani et al., 2017; Wu et al., 2019), text summarization (Wang et al., 2019b; Zhang et al.,2020), and grammatical error correction (Kiyono et al., 2019; Kaneko et al., 2020). Seq2Seq modelsare generally implemented with an encoder-decoder framework, in which a multi-layer encodersummarizes a source sequence into a sequence of representation and another multi-layer decoderproduces the target sequence conditioned on the encoded representation.Recent studies reveal that fusing the intermediate encoder layers (EncoderFusion) is beneﬁcial forSeq2Seq models, such as layer attention (Bapna et al., 2018), layer aggregation (Dou et al., 2018;Wang et al., 2019c), and layer-wise coordination (He et al., 2018). Despite its effectiveness, not muchis known about how fusing encoder layer representations work. The intuitive explanation is that fusingencoder layers exploits surface and syntactic information embedded in the lower encoder layers (Be-linkov et al., 2017; Peters et al., 2018). However, other studies show that attending to lower encoderlayers (excluding the encoder embedding layer) does not improve model performance (Domhan,2018), which is conﬂicted with existing conclusions. It is still unclear why and when fusing encoderlayers should work in Seq2Seq models.This paper tries to shed light upon behavior of Seq2Seq models augmented with EncoderFusionmethod. To this end, we propose a novel ﬁne-grained layer attention to evaluate the contribution ofindividual encoder layers. We conduct experiments on several representative Seq2Seq NLP tasks, ∗ Work was done when Xuebo Liu and Liang Ding were interning at Tencent AI Lab. a r X i v : . [ c s . C L ] D ec ncluding machine translation, text summarization, and grammatical error correction. Through a seriesof analyses, we ﬁnd that the uppermost decoder layer pays more attention to the encoder embeddinglayer. Masking the encoder embedding layer signiﬁcantly drops model performance by generatinghallucinatory (i.e. ﬂuent but unfaithful to the source) predictions. The encoded representation of thestandard Seq2Seq models (i.e. w/o fusing encoder layers) may not have enough capacity to modelboth semantic and surface features (especially at the encoder embedding layer). We call the problemdescribed above the source representation bottleneck .Based on this observation, we simplify the EncoderFusion approaches by only connecting the encoderembedding layer to softmax layer ( SurfaceFusion ). The SurfaceFusion approach shortens the pathdistance between source and target embeddings, which can help to learn better bilingual embeddingswith direct interactions. Experimental results on several Seq2Seq NLP tasks show that our methodconsistently outperforms both the vanilla Seq2Seq model and the layer attention model. Extensiveanalyses reveal that our approach produces more aligned bilingual word embeddings by shorteningthe path distance between them, which conﬁrm our claim.Our main contributions are as follows:• We introduce a ﬁne-grained layer attention method to qualitatively and quantitativelyevaluate the contribution of individual encoder layers.• We demonstrate that the encoder embedding layer is essential for fusing encoder layers,which consolidates conﬂicted ﬁndings reported by previous studies.• We propose a simple yet effective

SurfaceFusion approach to directly exploit the encoderembedding layer for the decoder, which produces more expressive bilingual embeddings.

RELIMINARIES

EQUENCE - TO - SEQUENCE LEARNING

Seq2Seq learning aims to maximize the log-likelihood of a target sequence y = { y , . . . , y J } condi-tioned on a source sequence x = { x , . . . , x I } , which is formulated as: ˆy = arg max log P ( y | x ) .Typically, Seq2Seq learning can be implemented as various architectures (Bahdanau et al., 2015;Gehring et al., 2017; Vaswani et al., 2017; Wu et al., 2019), among which the Transformer (Vaswaniet al., 2017) has advanced the state of the art. Without loss of generality, we introduce Transformer asthe testbed in this paper. Transformer consists of an encoder E equipped with N identical layers tomap the source sequence x into distributed representations, based on which a decoder D equippedwith M identical layers generates the target sequence y : X N = E ( X ) N := n =1 F FN (cid:0) A TT ( X n − , X n − , X n − ) (cid:1) (1) Y M = D ( Y , X N ) M := m =1 F FN (cid:0) A TT (cid:0) A TT ( Y m − , Y m − , Y m − ) , X N , X N (cid:1)(cid:1) (2)where X denotes the sum of the word embeddings X emb and position embeddings X pos of x , Y denotes that of the shifted right y , F FN ( · ) denotes a position-wise feed-forward network, andA TT ( · ) denotes a multi-head dot-product attention network with three arguments–query, key andvalue. Residual connection (He et al., 2016) and layer normalization (Ba et al., 2016) are used in eachsub-layer, which are suppressed in Equation 1 and 2 for clarity. Finally, the output representation Y M of the decoder is projected into the probability P ( y | x ) , which is optimized during model training.2.2 E XPERIMENTAL SETUP

To validate the universality of source representation bottleneck in Seq2Seq models, we conductedexperiments on three representative tasks, which vary from the distance between input and outputdomains and the scale of training data:

Machine translation takes a sentence in one language as input, and outputs a semantically-equivalentsentence in another language. We conducted experiments on three benchmarking datasets: small-scale WMT16 Romanian-English (Ro-En; 0.6M instances), medium-scale WMT14 English-GermanEn-De; 4.5M instances), and large-scale WMT14 English-French (En-Fr; 36.0M instances). Thetokenized BLEU score (Papineni et al., 2002) was used for all the translation tasks.

Text summarization takes a long-text document as input, and outputs a short and adequate summaryin the same language. We used the CNN/Daily Mail corpus (0.3M instances). We evaluated with thestandard ROUGE metric (Lin, 2004), i.e. Rouge-1, Rouge-2, and Rouge-L.

Grammatical error correction takes a sentence with grammatical errors as input, and outputs acorrected sentence. We used CONLL14 datasets as the testbed (1.4M instances). The MaxMatch(M ) scores (Dahlmeier & Ng, 2012) were used for evaluation with precision, recall, and F . values.The machine translation task has distant input/output domains (i.e. in different languages), whilethe other tasks have similar input/output domains (i.e. in the same language). We used Trans-former (Vaswani et al., 2017) as the Seq2Seq model. Details of the datasets and model training arelisted in Appendix A.1. EHAVIOR OF E NCODER F USION

In this section, we ﬁrst formulate our research hypothesis of source representation bottleneck (§3.1)that EncoderFusion expects to solve. In the following subsections, we propose a ﬁne-grained layerattention model (§3.2) to validate our hypothesis on well-designed experiments (§3.3).3.1 S

OURCE REPRESENTATION BOTTLENECK

Seq2Seq models learn more abstract features with the increase of layer level (i.e. X → X N and Y → Y M ) (Belinkov et al., 2017). It has been extensively validated that a reasonable use of boththe abstract representations (at higher-level layers) and the surface representations (at lower-levellayers) is beneﬁcial for various NLP (Lu & Li, 2013; Hu et al., 2014; Dou et al., 2018; Peters et al.,2018) and CV (Long et al., 2014; Pinheiro et al., 2016; Lin et al., 2017; Chen et al., 2018a) tasks.However, the Seq2Seq decoder only takes the abstract representations at uppermost layer X N asinput (Equation 2), while ignores other usefully surface representations at other layers X n ( n < N ).Although X N has encoded surface features from low-level representations through layer-by-layerabstraction and residual connections, we hypothesize that its limited representation capacity may notsufﬁciently model those surface features from lower encoder layers, especially the embedding layer.We call such an issue as source representation bottleneck .3.2 F INE - GRAINED LAYER ATTENTION

For each decoder layer, layer attention (Bapna et al., 2018; Peters et al., 2018) assigns normalizedscalar weights to all encoder layers, providing a direct way for evaluating the contributions made byeach encoder layer. However, the capacity of a simple scalar weight is limited, leading to insufﬁcientevaluation of the contributions.Motivated by ﬁne-grained attention (Choi et al., 2018) that each element of a context vector receivesan individual attention weight, we propose a ﬁne-grained layer attention model to combine theadvantages of both techniques. This allows us to more convincingly evaluate the contribution ofindividual encoder layer to the model performance. Besides, the nature of ﬁne-grained attentionenables us to give in-depth analyses of the representation power in §3.3.Speciﬁcally, we replace the layer-agnostic source representation X N with the layer-aware representa-tion S m for each decoder layer Y m , which is calculated as: S m = N (cid:88) n =0 ˆ w m,n (cid:12) X n , ˆ w m,n = (cid:2) ˆ w m,n, , . . . , ˆ w m,n,D (cid:3) , ˆ w m,n,d = exp( w m,n,d ) (cid:80) Nn (cid:48) =0 exp( w m,n (cid:48) ,d ) where (cid:12) denotes an element-wise multiplication, and w m,n,d denotes an element in the learnableattention weight W ∈ R M × ( N +1) × D , where D is the dimensionality of the source representation.When n = 0 , we use the word embeddings X emb without position embeddings as X , which hasbeen empirically proved effective. We applied a regularization technique – DropConnect (Wan et al.,013) to the attention weight W for a stable training, which randomly drops each w m,n,d with aprobability p and divides W by − p . We set it to 0.3 for all the experiments.Table 2 lists the results. The proposed ﬁne-grained layer attention model consistently outperformsthe vanilla Transformer across Seq2Seq tasks, demonstrating the beneﬁt of fusing surface features atlower-level layers. Table 1: Results of existing encoder layer fusionmethods on the WMT16 Ro-En translation task. Model BLEU

Vanilla Transformer 33.80Layer aggregation 34.05Layer-wise coordination 34.19Coarse-grained layer attention 34.32Fine-grained layer attention 34.45We evaluated several EncoderFusion methods inTable 1, including layer aggregation (Dou et al.,2018), layer-wise coordination (He et al., 2018),and coarse-grained layer attention (Bapna et al.,2018). Their results are respectively 34.05,34.19, and 34.32, which are all lower than thatof ﬁne-grained layer attention (34.45). Basedon these experimental results, we thus chooseﬁne-grained layer attention as a representativeof EncoderFusion in the following analyses.3.3 B

EHAVIOR CHANGES ACROSS ENCODER LAYERS

In this section, we investigate whether the surface features at lower encoder layers (especially theencoder embedding layer) contribute to the model performance via carefully designed experiments. E n c o d e r L a y e r .15 .18 .17 .15 .12 .11.16 .18 .17 .16 .14 .13.14 .15 .15 .15 .14 .13.15 .14 .14 .13 .14 .14.14 .12 .12 .13 .14 .15.14 .11 .11 .13 .14 .16.13 .11 .13 .15 .17 .19 (a) Translation: Ro-En (b) Summarization (c) Correction Figure 1: Attention distribution that each decoder layer ( x -axis) attending to encoder layers ( y -axis). Visualization of layer attention

We ﬁrst visualize the learned layer attention distribution in Fig-ure 1, in which each weight is the averaged attention weights over all dimensions. Generally, a higherweight denotes more contribution of an encoder layer to the corresponding decoder layer.Clearly, in all tasks higher decoder layers especially the uppermost ones pay more attention to theencoder embedding layer, which indicates that the surface representations potentially bring someadditional useful features to the model performance. Voita et al. (2019) reveal that the upper layers ofdecoder are responsible for the translation part while the lower layers for the language modeling part.Similarly, our results show that surface representations might play an important role in learning totranslate source tokens.Among the Seq2Seq models, there are still considerable differences in the attention heatmaps. Inthe summarization model, almost all decoder layers focus more on the encoder embedding layer,while in the other two models the intermediate decoder layers pay more attention to the higher-levelencoder layers. This is consistent with the ﬁndings of Rothe et al. (2019), in which they reveal thatthe summarization task, as a typical extractive generation task, tends to use more surface featuresto generate extractive summaries. In contrast, both machine translation and error correction tasksrequire a large amount of syntactic and semantic information, which are generally embedded inhigher-level encoder layers (Peters et al., 2018).owever, we still cannot conclude that source representation bottleneck does exist in Seq2Seq models,since the surface features might act as a noise regularizer to improve the robustness of encoder outputrepresentations. To dispel the doubt, we further design two experiments to directly evaluate theeffectiveness of surface features at the encoder embedding layer.

Contribution of individual encoder layer

In this experiment, we quantitatively analyze the be-haviors change of a trained Seq2Seq model when masking a speciﬁc encoder layer (i.e. turning itsattention weight to zero and redistribute the other attention weights). Note that the masking operationdoes not affect the information ﬂow of encoding calculation, i.e. keeping Equation 1 unchanged.

Length -6%0%6% Translation GEC Summarization

Emb 1 2 34 5 6

Performance R e l a ti v e C h a ng e -4%-2%0% Translation GEC Summarization Emb 1 2 34 5 6 -6%0%6% Translation Summarization Correction R e l a t i v e C h a n g e -8%-4%0% Translation Summarization CorrectionEmb 1 2 34 5 6 (b) Length of output Figure 2: Relative changes of (a) model performance and (b) length of output when maskingindividual encoder layer in the trained Seq2Seq models. As seen, masking the embedding layer leadsto a signiﬁcant drop of model performance and increase of output length.Figure 2(a) shows the contribution of individual encoder layer to model performance. As seen,masking the encoder embedding layer seriously harms the model performance in all tasks, whichconﬁrms our claim that the surface features in the embedding layer are essential to Seq2Seq models.Figure 2(b) shows the results on the output length. Masking the encoder embedding layer consistentlyincreases the length of generated output, which is especially signiﬁcant for the summarization model.One possible reason is that the instances in translation and correction tasks have similar input/outputlengths, while the summarization instances have distant input/output lengths.By analyzing the model outputs, we found that the Seq2Seq models tend to generate some hallucina-tory (i.e. ﬂuent but unfaithful to the source) predictions (Lee et al., 2019; Wang & Sennrich, 2020)when masking the embedding layer. Taking the correction task for an example, a right prediction“anyone” was replaced by the hallucinatory prediction “friends of anyone” in the masked model, inwhich the corresponding source contains no information related to “friends”. This issue becomesworse in the summarization task, since the hallucinatory prediction is more likely to be a sentence.The additional hallucinations will increase the output length and reduce the model performance. Inaddition, Lee et al. (2019) point out that even if hallucinations occur only occasionally, the Seq2Seqmodel may evidently lose user trust than other prediction problems, indicating the importance to fusesurface features at the embedding layer. More cases are studied in Appendix A.2.

Expressivity of attended dimensions in the encoder embedding layer

As shown in Figure 1,the uppermost decoder layer pays most attention to the encoder embedding layer (i.e. the lower rightcorner). If the embedding layer acts as a noise regularizer, the layer dimensions would be randomlyattended by the ﬁne-grained model; otherwise, the dimensions of higher attention weights should bedistinguished from the other dimensions.Starting from this intuition, we reordered the dimensions of the encoder embedding layer accordingto the attention weights ˆ w M, , and split it into two equal sub-embedding matrices, i.e. more attendeddimensions and less attended dimensions . We compared the expressivity of the two sub-embeddingmatrices by the commonly-used singular value decomposition (Gao et al., 2019; Wang et al., 2019a;Shen et al., 2020), in which higher normalized singular values denote that the embedding is moreuniformly distributed, thus are more expressive. The singular values are normalized by dividing themby the largest value and their log scale values are reported for better clarity.

50 100 150 200 250Index of Singular Value43210 L o g E i g e n v a l u e s More attendedRandomly selectedLess attended (a) Translation: Ro-En

More attendedRandomly selectedLess attended (b) Summarization

More attendedRandomly selectedLess attended (c) Correction

Figure 3: Log scale singular values of the three sub-embedding matrices in the ﬁne-grained layerattention models. Higher log eigenvalues denote more expressivity of the dimensions.Figure 3 depicts the singular value results. For comparison, we also report the values of the randomlyselected dimensions. Clearly, the more attended dimensions are most expressive, while the lessattended dimensions are least expressive. These results demonstrate that the ﬁne-grained attentionmodel indeed extracts useful surface information from the encoder embedding layer, which does notplay the role of a noise regularizer.From the above experiments, we prove that the encoder embedding layer indeed provides usefulsurface information, which is not fully exploited by the standard Seq2Seq models.

UR METHOD

In Section 3, we show that the uppermost decoder layer requires more surface features for betterrepresentation learning. One possible reason is that the uppermost decoder layer is used for predictingindividual target token, which naturally beneﬁts from more token-level surface features than sequence-level abstract features. To validate this assumption, we simplify ﬁne-grained layer attention thatonly the uppermost decoder layer can attend to the embedding layer and output layer of the encoder.Empirical results show that the simpliﬁed variant works on par with the original one, revealing thatthe surface features embed at the source embedding layer is expressive.Although layer attention model partially alleviates source representation bottleneck, it potentiallyintroduces unnecessary intermediate encoder representations. To address this gap, we proposeto directly connect the decoder softmax layer and the encoder embedding layer with a simple

SurfaceFusion method.4.1 S

URFACE F USION

Seq2Seq learning aims to maximize the log-likelihood of a target sequence y given a source sequence x . In practice, it factorizes the likelihood of the target sequence into individually token likelihoods: ˆy = arg max J (cid:89) j =1 log P ( y j ) = arg max J (cid:89) j =1 log P ( y j | y

Hard fusion: Φ hard = λ log P ( y j | y

Table 2: Results of the proposed SurfaceFusion methods on the Seq2Seq tasks. “FGLA” denotesﬁne-grained layer attention. The existing results are Ghazvininejad et al. (2019) for Ro-En, Ott et al.(2018) for En-De and En-Fr, Ott et al. (2019) for summarization, and Chollampatt & Ng (2018) forcorrection. All reported scores are the higher the better.

Translation Summarization CorrectionRo-En En-De En-Fr RG-1 RG-2 RG-L Prec. Recall F . Existing

Vanilla

FGLA

Hard fusion 35.1 29.5 43.9

Soft fusion

Table 2 lists the results of the proposed approach on different tasks. Inaddition to the vanilla Seq2Seq model (“Vanilla”), we also report the results of existing studies onthe same datasets (“Existing”) for better comparison. Our re-implementation of the vanilla modelsmatches the results reported in previous works, which we believe make the evaluation convincing.Clearly, the proposed fusion approaches outperform the baselines (i.e. “Vanilla” and “FGLA”) in allcases, while there are still considerable differences among model variations. Hard fusion performsbetter on the translation tasks, while soft fusion is superior on the summarization and correction tasks.Unlike hard fusion that performs at the probability level, soft fusion performs at the logit level toprovide an earlier and direct way for fusing surface features, which might be a better solution for thetasks with a similar input/output domain.able 3: Cosine similarities between alignedsource and target word embeddings. “All” and“Non-Shared” denotes keeping or removing thealigned pair when the source and target words arethe same, which are easier to be aligned.

All Non-SharedVanilla

SurfaceFusion 0.650 0.417Closeness of word embeddings

SurfaceFu-sion shortens the path distance between sourceand target embeddings, which can help to learnbetter bilingual embeddings with direct inter-actions. Table 3 shows the cosine similaritiesbetween the tied source and target embeddingson the Ro-En translation task.In the experiment, we ﬁrst train an additionalaligner (i.e. fast-align (Dyer et al., 2013)) on thetraining corpus and use the alignment links to construct a word dictionary. The results calculated overthe dictionary show that the relationship between the source and target embedding becomes muchcloser (i.e. high cosine similarities). This can help each other to learn better representations, and hasbeen validated to be beneﬁcial for Seq2Seq models (Press & Wolf, 2017; Liu et al., 2019).

Expressivity of word embeddings

In this experiment, we quantitatively evaluate the expressivityof the word embeddings learned by different models using the singular value decomposition. Therelated experimental details and executions are similar to that of Figure 3. L o g E i g e n v a l u e s SurfaceFusionVanilla

Figure 4: Log scale singular values ofthe embeddings.Figures 4 shows the results of the tied source and targetembeddings on the Ro-En translation task. The word em-beddings of the vanilla model have fast decaying singularvalues, which limits the representational power of embed-dings to a small sub-space. The SurfaceFusion model slowsdown the decaying and the singular values become moreuniformly distributed, which demonstrates that the fusedsurface features remarkably enhance the representationlearning of embeddings. This provides a better startingpoint for the model to effectively extract surface and ab-stract features, which leads to an improvement of modelperformance.

ELATED WORK

EncoderFusion in Seq2Seq

Lower encoder layers that embed useful surface features are far awayfrom the training signals, which poses difﬁculty for deep Seq2Seq models to exploit such usefulfeatures. Although residual connections (He et al., 2016) have been incorporated to combine layers,these connections have been “shallow” themselves, and only fuse by simple, one-step operations (Yuet al., 2018). In response to this problem, several approaches have been proposed to fuse the encoderlayers with advanced methods, such as layer attention (Bapna et al., 2018; Shen et al., 2018; Wanget al., 2019c), layer aggregation (Dou et al., 2018; Wang et al., 2018a; Dou et al., 2019; Li et al.,2020), and layer-wise coordination (He et al., 2018; Liu et al., 2020). Although these methods showpromising results on different NLP tasks, not much is known about how the EncoderFusion works. Inaddition, some other studies show that exploiting low-layer encoder representations fail to improvemodel performance (Domhan, 2018).In this paper, we consolidate the conﬂicting conclusions of existing studies by pointing out thatthe encoder embedding layer is the key, which can help Seq2Seq models to precisely predict targetwords. Based on this ﬁnding, we propose a novel

SurfaceFusion to directly connecting the encoderembedding layer and the softmax layer, which consistently outperform current EncoderFusionapproaches across different NLP tasks.

Variants of Feature Fusion

Feature fusion aims to merge two sets of features into one, which isfrequently employed in CV tasks, such as semantic segmentation (Long et al., 2014; Chen et al.,2018a; Zhang et al., 2018) and object detection (Pinheiro et al., 2016; Lin et al., 2017). Zhang et al.(2018) shows that simply fusing surface and abstract features tends to be less effective due to the gapin semantic levels.For NLP tasks, researchers investigated fusion models for language understanding (Lu & Li, 2013;Hu et al., 2014; Peters et al., 2018) and language generation (Gulcehre et al., 2015; Sriram et al., 2017;tahlberg et al., 2018). Nguyen & Chiang (2019) propose to fuse features at representation-level, butwe empirically ﬁnd this kind of fusion method is not orthogonal to multi-layer models due to the largesemantic gap. Gulcehre et al. (2015) combine the predictions produced by the Seq2Seq model andexternal LM predictions in a later fusion manner, which pose little impact to the original informationﬂow. Stahlberg et al. (2018) improve upon it by removing the dependence on the manually deﬁnedhyper-parameter. In this work, we demonstrate the effectiveness of the two typical probability-levelfusion methods on sequence-to-sequence learning tasks. Unlike them that rely on an external model,our approach only requires a surface attention module that can be jointly trained with the vanillaSeq2Seq model.

ONCLUSION AND FUTURE WORK

In this paper, we investigate how encoder layer fusion works on solving the source representationbottleneck. Based on a series of experiments on different Seq2Seq tasks, we ﬁnd that the encoderembedding layer is important to the success of EncoderFusion by exploiting the useful surfaceinformation. Based on this observation, we propose a novel SurfaceFusion to directly connect theencoder embedding layer and softmax layer. Experiments show that SurfaceFusion consistentlyoutperforms the conventional EncoderFusion in several datasets. Extensive analyses reveal thatSurfaceFusion enhances the learning of expressive bilingual word embeddings for Seq2Seq models,which conﬁrm our claim.Future directions include validating our ﬁndings on more Seq2Seq tasks (e.g. dialogue and speechrecognition) and model architectures (RNMT+ (Chen et al., 2018b) and DynamicConv (Wu et al.,2019)). It is also worthwhile to explore more alternatives to EncoderFusion from the perspective ofexploiting the embedding layer. R EFERENCES

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.

In arXiv , 2016.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. In

ICLR , 2015.Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training deeper neuralmachine translation models with transparent attention. In

EMNLP , 2018.Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. What do neuralmachine translation models learn about morphology? In

ACL , 2017.Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In

ECCV , 2018a.Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, LlionJones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best ofboth worlds: Combining recent advances in neural machine translation. In

ACL , 2018b.Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. Fine-grained attention mechanism for neuralmachine translation.

Neurocomputing , 2018.Shamil Chollampatt and Hwee Tou Ng. A multilayer convolutional encoder-decoder neural networkfor grammatical error correction. In

AAAI , 2018.Daniel Dahlmeier and Hwee Tou Ng. Better evaluation for grammatical error correction. In

NAACL ,2012.Tobias Domhan. How much attention do you need? a granular analysis of neural machine translationarchitectures. In

ACL , 2018.Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and Tong Zhang. Exploiting deep representationsfor neural machine translation. In

EMNLP , 2018.i-Yi Dou, Zhaopeng Tu, Xing Wang, Longyue Wang, Shuming Shi, and Tong Zhang. Dynamiclayer aggregation for neural machine translation with routing-by-agreement. In

AAAI , 2019.Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast, and effective reparameterizationof IBM model 2. In

NAACL , 2013.Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tieyan Liu. Representation degeneration problemin training natural language generation models. In

ICLR , 2019.Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutionalsequence to sequence learning. In

ICML , 2017.Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Paralleldecoding of conditional masked language models. In

EMNLP , 2019.Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, FethiBougares, Holger Schwenk, and Yoshua Bengio. On using monolingual corpora in neural machinetranslation.

In arXiv , 2015.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

CVPR , 2016.Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Layer-wisecoordination between encoder and decoder for neural machine translation. In

NIPS , 2018.Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network architecturesfor matching natural language sentences. In

NIPS , 2014.Masahiro Kaneko, Masato Mita, Shun Kiyono, Jun Suzuki, and Kentaro Inui. Encoder-decodermodels can beneﬁt from pre-trained masked language models in grammatical error correction. In

ACL , 2020.Shun Kiyono, Jun Suzuki, Masato Mita, Tomoya Mizumoto, and Kentaro Inui. An empirical study ofincorporating pseudo data into grammatical error correction. In

EMNLP , 2019.Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. Hallucinations inneural machine translation.

In OpenReview , 2019.Jian Li, Xing Wang, Baosong Yang, Shuming Shi, Michael R Lyu, and Zhaopeng Tu. Neuroninteraction based representation composition for neural machine translation. In

AAAI , 2020.Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In

Text SummarizationBranches Out , 2004.Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.Feature pyramid networks for object detection. In

CVPR , 2017.Fenglin Liu, Xuancheng Ren, Guangxiang Zhao, and Xu Sun. Layer-wise cross-view decoding forsequence-to-sequence learning. arXiv , 2020.Xuebo Liu, Derek F. Wong, Yang Liu, Lidia S. Chao, Tong Xiao, and Jingbo Zhu. Shared-privatebilingual word embeddings for neural machine translation. In

ACL , 2019.Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semanticsegmentation. In

CVPR , 2014.Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In

NIPS , 2013.Toan Nguyen and David Chiang. Improving lexical choice in neural machine translation. In

NAACL ,2019.Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In

WMT@EMNLP , 2018.yle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In

NAACL , 2019.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automaticevaluation of machine translation. In

ACL , 2002.Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractivesummarization. In

ICLR , 2018.Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, andLuke Zettlemoyer. Deep contextualized word representations. In

NAACL , 2018.Pedro H. O. Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. Learning to reﬁne objectsegments. In

ECCV , 2016.Oﬁr Press and Lior Wolf. Using the Output Embedding to Improve Language Models. In

EACL ,2017.Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Leveraging pre-trained checkpoints forsequence generation tasks.

In arXiv , 2019.Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words withsubword units. In

ACL , 2016.Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, and Kurt Keutzer. Rethinking batchnormalization in transformers. In

ICML , 2020.Yanyao Shen, Xu Tan, Di He, Tao Qin, and Tie-Yan Liu. Dense information ﬂow for neural machinetranslation. In

NAACL , 2018.Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates. Cold fusion: Training seq2seqmodels together with language models.

In arXiv , 2017.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overﬁtting.

JMLR , 2014.Felix Stahlberg, James Cross, and Veselin Stoyanov. Simple fusion: Return of the language model.In

WMT , 2018.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.In

NIPS , 2014.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. Attention is all you need. In

NIPS , 2017.Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-headself-attention: Specialized heads do the heavy lifting, the rest can be pruned. In

ACL , 2019.Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neuralnetworks using dropconnect. In

ICML , 2013.Chaojun Wang and Rico Sennrich. On exposure bias, hallucination and domain shift in neuralmachine translation.

In arXiv , 2020.Dilin Wang, Chengyue Gong, and Qiang Liu. Improving neural language modeling via adversarialtraining. In

ICML , 2019a.Liang Wang, Wei Zhao, Ruoyu Jia, Sujian Li, and Jingming Liu. Denoising based sequence-to-sequence pre-training for text generation. In

EMNLP , 2019b.Qiang Wang, Fuxue Li, Tong Xiao, Yanyang Li, Yinqiao Li, and Jingbo Zhu. Multi-layer representa-tion fusion for neural machine translation. In

COLING , 2018a.Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao.Learning deep transformer models for machine translation. In

ACL , 2019c.inyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. Switchout: An efﬁcient data augmentationalgorithm for neural machine translation. In

EMNLP , 2018b.Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention withlightweight and dynamic convolutions. In

ICLR , 2019.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translationsystem: Bridging the gap between human and machine translation.

In arXiv , 2016.Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In

CVPR ,2018.Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J Liu. Pegasus: Pre-training with extractedgap-sentences for abstractive summarization. In

ICML , 2020.Zhenli Zhang, Xiangyu Zhang, Chao Peng, Dazhi Cheng, and Jian Sun. Exfuse: Enhancing featurefusion for semantic segmentation. In

ECCV , 2018. A PPENDIX

A.1 E

XPERIMENTAL SETUP

Table 4: Statistics of the datasets and hyperparameters for the experiments. All the data have beentokenized and split into joint sub-word units (Sennrich et al., 2016). “Batch” denotes the number ofsource tokens and target tokens used in each training step. “DP” denotes the dropout value (Srivastavaet al., 2014). “LP” denotes the length penalty (Wu et al., 2016). “Base” and “Big” denote the twokinds of model variants of Transformer. We chose the checkpoint with best validation ppl for testing.

Vocab

En-De

En-Fr

CNN/DM

CONLL

Machine translation

For WMT16 Romanian-English, we used the prepossessed data and existingresult from Ghazvininejad et al. (2019). The validation set is newsdev2016 and the test set isnewtest2016. For WMT14 English-German, the prepossessed data and existing result are derivedfrom Ott et al. (2018). The validation set is newstest2013 and the test set is newstest2014. ForWMT14 English-French, we reported the existing result from Ott et al. (2018) and followed them topreprocess the data sets. The validation set is newstest2012+2013 and the test set is newstest2014. Text summarization

For CNN/Daily Mail dataset, we used the existing result and preprocessingmethod of Ott et al. (2019). During testing, the minimum length was set to 55 and the maximumlength was set to 140, which were tuned on the development data. We also followed Paulus et al.(2018) to disallow repeating the same trigram.

Grammatical error correction

For CONLL14 benchmark, the preprocessing script and existing re-sult are given by Chollampatt & Ng (2018). We applied the regularization technique SwitchOut (Wanget al., 2018b) in this task to prevent overﬁtting, which was set to 0.8 for the source and 0.9 for thetarget.Table 4 gives more details of the benchmarks. It is noted that other unmentioned hyperparameterskeep the same with the original paper of Transformer (Vaswani et al., 2017). All the models areimplemented by the open-source toolkit fairseq (Ott et al., 2019). A.2 C

ASE STUDY

Tables 5, 6 and 7 give the cases from the three tasks. We can see that the hallucination issue relatedto surface features consistently appear over the different Seq2Seq tasks. The most representativecases are those from the correction task, in which very similar input/output sequences still make suchmistakes.Another observation is the prediction omission problem when masking the encoder output layer.The lack of abstract features leads to incomplete semantics of source representations, thus makingSeq2Seq models omit generating a part of source, hurting the model performance. By looking at thecases over the three tasks, we ﬁnd that the prediction omission is widespread in the prediction ofmodiﬁers, e.g. adjectives and adverbs. https://drive.google.com/uc?id=1YrAwCEuktG-iDVxtEW-FE72uFTLc5QMl https://drive.google.com/uc?id=0B_bZck-ksdkpM25jRUN2X2UxMm8 https://github.com/nusnlp/mlconvgec2018/blob/master/data/prepare_data.sh https://github.com/pytorch/fairseq able 5: Examples from the Ro-En translation task. Red words are good predictions, while blue words are bad predictions. Masking the embedding layer (“Mask Emb”) of the ﬁne-grained layerattention model leads to hallucinatory predictions, prolonging the prediction length. While maskingthe output layer (“Mask Out”) leads to prediction omissions, shortening the length.

Hallucination

Source diseara voi merge acasa si voi dormi linistit .Reference i will go home tonight and sleep well .Vanilla i will go home and sleep quietly .Mask Emb the device will go home and i will sleep peacefully .Mask Out i will go home and sleep quietly . Omission

Source radem adesea mult atunci cand vorbim .Reference we often laugh a lot when we talk .Vanilla we often laugh a lot when we talk .Mask Emb we often laugh a lot when we talk .Mask Out we often laugh when we talk .

Table 6: Examples from the CNN/DM summarization task.

Hallucination

Source ... But it is able to carry just as much power - 400,000 volts . It is designedto be less obtrusive and will be used for clean energy purposes ...Reference ... But it is able to carry just as much power - 400,000 volts .

It is designedto be less obtrusive and will be used for clean energy .

Vanilla ... But it is able to carry just as much power - 400,000 volts .

It is designedto be less obtrusive and will be used for clean energy .

Mask Emb ... It is able to carry just as much power - 400,000 volts .

The design is aT-shape , with two ‘ hanging baskets ’ either side ...Mask Out ... But it is able to carry just as much power - 400,000 volts .

It is designedto be less obtrusive and will be used for clean energy .

Omission

Source ... Opening statements in his trial are scheduled to begin Monday ...Reference ...

Opening statements are scheduled Monday in the trial of JamesHolmes ...Vanilla ... Prosecutors are not swayed, will seek the death penalty . Openingstatements in his trial are scheduled to begin Monday . Holmes sayshe was suffering ‘ a psychotic episode ’ at the time ...Mask Emb ... Prosecutors are not swayed, will seek the death penalty . Openingstatements in his trial are scheduled to begin Monday . Holmes sayshe was suffering ‘ a psychotic episode ’ at the time ...Mask Out ... Prosecutors are not swayed and will seek the death penalty . Holmes says he was suffering ‘ a psychotic episode ’ at the time ...

Table 7: Examples from the CONLL correction task.

Hallucination

Source They can become anyone .Reference They can become anyone .Vanilla They can become anyone .Mask Emb They can become friends with anyone .Mask Out They can become anyone . Omission

Source In conclude , people should think carefully of what is the consequences oftelling the relatives his or her generic disorder issue .Reference In conclusion , people should think carefully about what is the consequencesof telling the relatives his or her generic disorder issue .Vanilla In conclusion , people should think carefully about what is the consequencesof telling the relatives his or her generic disorder issue .Mask Emb In conclusion , people should think carefully about what is the consequencesof telling the relatives his or her generic disorder issue .Mask Out In conclusion , people should think carefully about what is the consequencesof telling the relatives his or her generic issuegeneric issue