The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives
TThe Bottom-up Evolution of Representations in the Transformer:A Study with Machine Translation and Language Modeling Objectives
Elena Voita , Rico Sennrich , Ivan Titov , Yandex, Russia University of Amsterdam, Netherlands University of Edinburgh, Scotland University of Zurich, Switzerland [email protected]@cl.uzh.ch [email protected]
Abstract
We seek to understand how the representationsof individual tokens and the structure of thelearned feature space evolve between layers indeep neural networks under different learningobjectives. We focus on the Transformers forour analysis as they have been shown effec-tive on various tasks, including machine trans-lation (MT), standard left-to-right languagemodels (LM) and masked language model-ing (MLM). Previous work used black-boxprobing tasks to show that the representationslearned by the Transformer differ significantlydepending on the objective. In this work, weuse canonical correlation analysis and mutualinformation estimators to study how informa-tion flows across Transformer layers and howthis process depends on the choice of learningobjective. For example, as you go from bot-tom to top layers, information about the pastin left-to-right language models gets vanishedand predictions about the future get formed.In contrast, for MLM, representations initiallyacquire information about the context aroundthe token, partially forgetting the token iden-tity and producing a more generalized tokenrepresentation. The token identity then getsrecreated at the top MLM layers.
Deep (i.e. multi-layered) neural networks have be-come the standard approach for many natural lan-guage processing (NLP) tasks, and their analysishas been an active topic of research. One popularapproach for analyzing representations of neuralmodels is to evaluate how informative they are forvarious linguistic tasks, so-called “probing tasks”.Previous work has made some interesting observa-tions regarding these representations; for example,Zhang and Bowman (2018) show that untrainedLSTMs outperform trained ones on a word iden-tity prediction task; and Blevins et al. (2018) show that up to a certain layer performance of represen-tations obtained from a deep LM improves on aconstituent labeling task, but then decreases, whilewith representations obtained from an MT encoderperformance continues to improve up to the high-est layer. These observations have, however, beensomewhat anecdotal and an explanation of the pro-cess behind such behavior has been lacking.In this paper, we attempt to explain more gen-erally why such behavior is observed. Rather thanmeasuring the quality of representations obtainedfrom a particular model on some auxiliary task, wecharacterize how the learning objective determinesthe information flow in the model. In particular,we consider how the representations of individualtokens in the Transformer evolve between layersunder different learning objectives.We look at this task from the information bottle-neck perspective on learning in neural networks.Tishby and Zaslavsky (2015) state that “the goalof any supervised learning is to capture and ef-ficiently represent the relevant information in theinput variable about the output-label variable” andhence the representations undergo transformationswhich aim to encode as much information aboutthe output label as possible, while ‘forgetting’ ir-relevant details about the input. As we study se-quence encoders and look into representations ofindividual tokens rather than the entire input, oursituation is more complex. In our model, the infor-mation preserved in a representation of a token isinduced due to two roles it plays: (i) predicting theoutput label from a current token representation; (ii) preserving information necessary to build rep-resentations of other tokens. For example, a lan-guage model constructs a representation which isnot only useful for predicting an output label (inthis case, the next token), but also informative for We will clarify how we define output labels for LM,MLM and MT objectives in Section 2. a r X i v : . [ c s . C L ] S e p roducing representations of subsequent tokens ina sentence. This is different from the MT set-ting, where there is no single encoder state fromwhich an output label is predicted. We hypothe-size that the training procedure (or, in our notation,the task ) defines1. the nature of changes a token representationundergoes, from layer to layer;2. the process of interactions and relationshipsbetween tokens;3. the type of information which gets lost andacquired by a token representation in theseinteractions.In this work, we study how the choice of ob-jective affects the process by which informationis encoded in token representations of the Trans-former (Vaswani et al., 2017), as this architec-ture achieves state-of-the-art results on tasks withvery different objectives such as machine trans-lation (MT) (Bojar et al., 2018; Niehues et al.,2018), standard left-to-right language modeling(LM) (Radford et al., 2018) and masked languagemodeling (MLM) (Devlin et al., 2018). The Trans-former encodes a sentence by iteratively updatingrepresentations associated with each token start-ing from a context-agnostic representation consist-ing of a positional and a token embedding. Ateach layer token representations exchange infor-mation among themselves via multi-head attentionand then this information is propagated to the nextlayer via a feed-forward transformation. We inves-tigate how this process depends on the choice ofobjective function (LM, MLM or MT) while keep-ing the data and model architecture fixed.We start with illustrating the process of infor-mation loss and gain in representations of individ-ual tokens by estimating the mutual informationbetween token representations at each layer andthe input token identity (i.e. the word type) or theoutput label (e.g., the next word for LM).Then, we investigate behavior of token repre-sentations from two perspectives: how they influ-ence and are influenced by other tokens. Usingcanonical correlation analysis, we evaluate the ex-tent of change the representation undergoes andthe degree of influence. We reveal differences inthe patterns of this behavior for different tasks.Finally, we study which type of informationgets lost and gained in the interactions between tokens and to what extent a certain property isimportant for defining a token representation ateach layer and for each task. As the properties,we consider token identities (‘word type’), posi-tions, identities of surrounding tokens and CCGsupertags. In these analyses we rely on similaritycomputations.We find, that (1) with the LM objective, asyou go from bottom to top layers, informationabout the past gets lost and predictions about thefuture get formed; (2) for MLMs, representa-tions initially acquire information about the con-text around the token, partially forgetting the tokenidentity and producing a more generalized tokenrepresentation; the token identity then gets recre-ated at the top layer; (3) for MT, though repre-sentations get refined with context, less process-ing is happening and most information about theword type does not get lost. This provides us witha hypothesis for why the MLM objective may bepreferable in the pretraining context to LM. LMsmay not be the best choice, because neither infor-mation about current token and its past nor futureis represented well: the former since this informa-tion gets discarded, the latter since the model doesnot have access to the future.Our key contributions are as follows: • we propose to view the evolution of a tokenrepresentation between layers from the com-pression/prediction trade-off perspective; • we conduct a series of experiments support-ing this view and showing that the two pro-cesses, losing information about input andaccumulating information about output, takeplace in the evolution of representations (forMLM, these processes are clearly distin-guished and can be viewed as two stages,‘context encoding’ and ‘token prediction’); • we relate to some findings from previouswork, putting them in the proposed perspec-tive, and provide insights into the internalworkings of Transformer trained with differ-ent objectives; • we propose an explanation for superior per-formance of the MLM objective over the LMone for pretraining.All analysis is done in a model-agnostic man-ner by investigating properties of token represen-ations at each layer, and can, in principle, be ap-plied to other multi-layer deep models (e.g., multi-layer RNN encoders). In this section, we describe the tasks we consider.For each task, we define input X and output Y . Given source sentence X = ( x , x , ..., x S ) and atarget sentence Y = ( y , y , ..., y T ) , NMT mod-els predict words in the target sentence, word byword, i.e. provide estimates of the conditional dis-tribution p ( y i | X, y ,i − , θ ) .We train a standard Transformer for the transla-tion task and then analyze its encoder. In contrastto the other two tasks we describe below, represen-tations from the top layers are not directly used topredict output labels but to encode the informationwhich is then used by the decoder. LMs estimate the probability of a wordgiven the previous words in a sen-tence P ( x t | x , . . . , x t − , θ ) . More formally, themodel is trained with inputs X = ( x , . . . , x t − ) and outputs Y = ( x t ) , where x t is the output labelpredicted from the final (i.e. top-layer) representa-tion of a token x t − . It is straightforward to applythe Transformer to this task (Radford et al., 2018;Lample and Conneau, 2019). We also consider the MLM objective (Devlinet al., 2018), randomly sampling of the to-kens to be predicted. We replace the correspond-ing input token by [MASK] or a random token in and of the time, respectively, keeping itunchanged otherwise.For a sentence ( x , x , ..., x S ) , where token x i is replaced with ˜ x i , the model receives X = ( x , . . . , x i − , ˜ x i , x i +1 , . . . , x S ) as input andneeds to predict Y = ( x i ) . The label x i is predictedfrom the final representation of the token ˜ x i . As described below, for a fair comparison, we usethe same training data, model architecture and pa-rameter initialization across all the tasks. In orderto make sure that our findings are reliable, we alsouse multiple datasets and repeat experiments withdifferent random initializations for each task. We train all models on the data from theWMT news translation shared task. We con-duct separate series of experiments using two lan-guage pairs: WMT 2017 English-German (5.8msentence pairs) and WMT 2014 English-French(40.8m sentence pairs). For language model-ing, we use only the source side of the paralleldata. We remove randomly chosen 2.8m sentencepairs from the English-French dataset and use thesource side for analysis. English-French modelsare trained on the remaining 38m sentence pairs.We consider different dataset sizes (2.5m and 5.8mfor English-German, 2.5m, 5.8m and 38m forEnglish-French). We find that our findings are truefor all languages, dataset sizes and initializations.In the following, all the illustrations are providedfor the models trained on the full English-Germandataset (5.8m sentence pairs).We follow the setup and training procedureof the Transformer base model (Vaswani et al.,2017). For details, see the appendix.
In this section, we give an intuitive explanation ofthe Information Bottleneck (IB) principle (Tishbyet al., 1999) and consider a direct application ofthis principle to our analysis.
The IB method (Tishby et al., 1999) considersa joint distribution of input-output pairs p ( X, Y ) and aims to extract a compressed representation ˜ X for an input X such that ˜ X retains as muchas possible information about the output Y . Moreformally, the IB method maximizes the mutual in-formation (MI) with the output I ( ˜ X ; Y ) , whilepenalizing for MI with the input I ( ˜ X ; X ) . Thelatter term in the objective ensures that the rep-resentation is indeed a compression. Intuitively,the choice of the output variable Y determines thesplit of X into irrelevant and relevant features. Therelevant features need to be retained while irrele-vant ones should be dropped.Tishby and Zaslavsky (2015) argue that compu-tation in a multi-layered neural model can be re-garded as an evolution towards the theoretical op-timum of the IB objective. A sequence of layersis viewed as a Markov chain, and the process ofobtaining Y corresponds to compressing the rep-resentation as it flows across layers and retainingonly information relevant to predicting Y . Thisimplies that Y defines the information flow in theodel. Since Y is different for each model, we ex-pect to see different patterns of information flow inmodels, and this is the focus of our study. In this work, we view every sequence model (MT,LM and MLM) as learning a function from input X to output Y . The input is a sequence of tokens X = ( x , x , . . . , x n ) and the output Y is definedin Section 2. Recall that we focus on representa-tions of individual tokens in every layer rather thanthe representation of the entire sequence.We start off our analysis of divergences in theinformation flow for different objectives by esti-mating the amount of information about input oroutput tokens retained in the token representationat each layer. Inspired by Tishby and Zaslavsky (2015), we es-timate MI between token representations at a cer-tain layer and an input token. To estimate MI, weneed to consider token representations at a layeras samples from a discrete distribution. To getsuch distribution, in the original works (Shwartz-Ziv and Tishby, 2017), the authors binned the neu-ron’s arctan activations. Using these discretizedvalues for each neuron in a layer, they were ableto treat a layer representation as a discrete vari-able. They considered neural networks with maxi-mum 12 neurons at a layer, but in practical scenar-ios (e.g. we have 512 neurons in each layer) thisapproach is not feasible. Instead, similarly to Saj-jadi et al. (2018), we discretize the representationsby clustering them into a large number of clusters.Then we use cluster labels instead of the continu-ous representations in the MI estimator.Specifically, we take only representations of the1000 most frequent (sub)words. We gather rep-resentations for 5 million occurrences of these ateach layer for each of the three models. We thencluster the representations into N = 10000 clus-ters using mini-batch k -means with k = 100 . Inthe experiments studying the mutual informationbetween a layer and source (or target) labels wefurther filter occurrences. Namely, we take onlyoccurrences where the source and target labels areamong the top 1000 most frequent (sub)words. First, we estimate the MI between an input tokenand a representation of this token at each layer. In
Figure 1: The mutual information between an input to-ken and a representation of this token at each layer. (a) LM (b) MLM
Figure 2: The mutual information of token representa-tions at a layer and source (or target) tokens. For MLM,only tokens replaced at random are considered to getexamples where input and output are different. this experiment, we form data for MLM as in thetest regime; in other words, the input token is al-ways the same as the output token. Results areshown in Figure 1. For LM, the amount of rele-vant information about the current input token de-creases. This agrees with our expectations: someof the information about the history is intuitivelynot relevant for predicting the future. MT showsa similar behavior, but the decrease is much lesssharp. This is again intuitive: the informationabout the exact identity is likely to be useful forthe decoder. The most interesting and surprisinggraph is for MLM: first, similarly to other mod-els, the information about the input token is gettinglost but then, at two top layers, it gets recovered.We will refer to these phases in further discussionas context encoding and token reconstruction, re-spectively. Whereas such non-monotonic behavioris impossible when analyzing entire layers, as inTishby and Zaslavsky (2015), in our case, it sug-gests that this extra information is obtained fromother tokens in the context.We perform the same analysis but now measur-ing MI with the output label for LM and MLM.In this experiment, we form data for MLM as intraining, masking or replacing a fraction of tokens.We then take only tokens replaced with a randomone to get examples where input and output to-kens are different. Results are shown in Figure 2.We can see that, as expected, MI with input tokensecreases while MI with output tokens increases.Both LM and MLM are trained to predict a token(next for LM and current for MLM) by encodinginput and context information. While in Figure 1we observed monotonic behavior of LM, whenlooking at the information with both input and out-put tokens, we can see the two processes, losinginformation about input and accumulating infor-mation about output, for both LM and MLM mod-els. For MLM these processes are more distinctand can be thought of as the context encoding andtoken prediction (compression/prediction) stages.For MT, since nothing is predicted directly, we seeonly the encoding stage of this process. This ob-servation relates also to the findings by Blevinset al. (2018). They show that up to a certain layerthe performance of representations obtained froma deep multi-layer RNN LM improves on a con-stituent labeling task, but then decreases, while forrepresentations obtained from an MT encoder per-formance continues to improve up to the highestlayer. We further support this view with other ex-periments in Section 6.3.Even though the information-theoretic viewprovides insights into processes shaping the rep-resentations, direct MI estimation from finite sam-ples for densities on multi-dimensional spaces ischallenging (Paninski, 2003). For this reasonin the subsequent analysis we use more well-established frameworks such as canonical corre-lation analysis to provide new insights and also tocorroborate findings we made in this section (e.g.,the presence of two phases in MLM encoding).Even though we will be using different machin-ery, we will focus on the same two IB-inspiredquestions: (1) how does information flow acrosslayers? and (2) what information does a layer rep-resent?
In this section, we analyze the flow of informa-tion. The questions we ask include: how muchprocessing is happening in a given layer; whichtokens influence other tokens most; which tokensgain most information from other tokens. As wewill see, these questions can be reduced to a com-parison between network representations. We startby describing the tool we use.
We rely on recently introduced projectionweighted Canonical Correlation Analysis (PWCCA) (Morcos et al., 2018), which is animproved version of SVCCA (Raghu et al., 2017).Both approaches are based on classic CanonicalCorrelation Analysis (CCA) (Hotelling, 1936).CCA is a multivariate statistical method for re-lating two sets of observations arising from an un-derlying process. In our setting, the underlyingprocess is a neural network trained on some task.The two sets of observations can be seen as ‘twoviews’ on the data. Intuitively, we look at the samedata (tokens in a sentence) from two standpoints.For example, one view is one layer and anotherview is another layer. Alternatively, one view canbe l -th layer in one model, whereas another viewcan be the same l -th layer in another model. CCAlets us measure similarity between pairs of views.Formally, given a set of tokens ( x , x , . . . , x N ) (with the sentences they occur in), we gather theirrepresentations produced by two models ( m and m ) at layers l and l , respectively. To achievethis, we encode the whole sentences and take rep-resentations of tokens we are interested in. Weget two views of these tokens by the models: v m ,l = ( x m ,l , x m ,l , . . . , x m ,l N ) and v m ,l =( x m ,l , x m ,l , . . . , x m ,l N ) . The representationsare gathered in two matrices X ∈ M a,N and X ∈ M b,N , where a and b are the numbers ofneurons in the models. These matrices are thengiven to CCA (specifically, PWCCA). CCA iden-tifies a linear relation maximizing the correlationbetween the matrices and computes the similarity.The values of PWCCA range from 0 to 1, with 1indicating that the observations are linear transfor-mations of each other, 0 indicating no correlation.In the next sections, we vary two aspects of thisprocess: tokens and the ‘points of view’.
We start with the analysis where we do not attemptto distinguish between different token types.
As the first step in our analysis, we measure thedifference between representations learned for dif-ferent tasks. In other words, we compare repre-sentations for v m ,l and v m ,l at different layers l .Here the data is all tokens of 5000 sentences. Wealso quantify differences between representationsof models trained with the same objective but dif-ferent random initializations. The results are pro- In our experiments,
N > .a) (b)
Figure 3: PWCCA distance (a) between representa-tions of different models at each layer (“init.” indicatesdifferent initializations), (b) between consecutive lay-ers of the same model. vided in Figure 3a.First, differences due to training objective aremuch larger than the ones due to random initial-ization of a model. This indicates that PWCCAcaptures underlying differences in the types of in-formation learned by a model rather than those dueto randomness in the training process.MT and MLM objectives produce representa-tions that are closer to each other than to LM’srepresentations. The reason for this might be two-fold. First, for LM only preceding tokens are in thecontext, whereas for MT and MLM it is the entiresentence. Second, both MT and MLM focus on agiven token, as it either needs to be reconstructedor translated. In contrast, LM produces a repre-sentation needed for predicting the next token.
In a similar manner, we measure the differencebetween representations of consecutive layers ineach model (Figure 3b). In this case we take views v m,l and v m,l +1 and vary layers l and tasks m .For MT, the extent of change monotonically de-creases when going from the bottom to top lay-ers, whereas there is no such monotonicity for LMand MLM. This mirrors our view of LM and es-pecially MLM as undergoing phases of encodingand reconstruction (see Section 4.2), thus requir-ing a stage of dismissing information irrelevant tothe output, which, in turn, is accompanied by largechanges in the representations between layers. In this section, we select tokens with some pre-defined property (e.g., frequency) and investigatehow much the tokens are influenced by other to-kens or how much they influence other tokens.
Amount of change.
We measure the extent ofchange for a group of tokens as the PWCCA dis-tance between the representations of these tokens (a) MT (b) LM(c) MLM
Figure 4: Token change vs its frequency. for a pair of adjacent layers ( l, l + 1) . This quan-tifies the amount of information the tokens receivein this layer. Influence.
To measure the influence of a tokenat l th layer on other tokens, we measure PWCCAdistance between two versions of representationsof other tokens in a sentence: first after encodingas usual, second when encoding first l − layersas usual and masking out the influencing token atthe l th layer. Figure 4 shows a clear dependence of the amountof change on token frequency. Frequent tokenschange more than rare ones in all layers in bothLM and MT. Interestingly, unlike MT, for LMthis dependence dissipates as we move towards toplayers. We can speculate that top layers focus onpredicting the future rather than incorporating thepast, and, at that stage, token frequency of the lastobserved token becomes less important.The behavior for MLM is quite different. Thetwo stages for MLMs could already be seen in Fig-ures 1 and 3b. They are even more pronouncedhere. The transition from a generalized tokenrepresentation, formed at the encoding stage, torecreating token identity apparently requires morechanges for rare tokens.When measuring influence, we find that rare to-kens generally influence more than frequent ones(Figure 5). We notice an extreme influence of rare By masking out we mean that other tokens are forbiddento attend to the chosen one.a) MT (b) LM(c) MLM
Figure 5: Token influence vs its frequency. tokens at the first MT layer and at all LM layers. Incontrast, rare tokens are not the most influencingones at the lower layers of MLM. We hypothesizethat the training procedure of MLM, with maskingout some tokens or replacing them with randomones, teaches the model not to over-rely on thesetokens before their context is well understood. Totest our hypothesis, we additionally trained MTand LM models with token dropout on the inputside (Figure 6). As we expected, there is no ex-treme influence of rare tokens when using this reg-ularization, supporting the above interpretation.Interestingly, our earlier study of the MT Trans-former (Voita et al., 2019) shows how this influ-ence of rare tokens is implemented by the model.In that work, we observed that, for any consid-ered language pair, there is one dedicated atten-tion head in the first encoder layer which tends topoint to the least frequent tokens in every sentence.The above analysis suggest that this phenomenonis likely due to overfitting.We also analyzed the extent of change and in-fluence splitting tokens according to their part ofspeech; see appendix for details.
Whereas in the previous section we were inter-ested in quantifying the amount of information ex-changed between tokens, here we primarily wantto understand what representation in each layer‘focuses’ on. We evaluate to what extent a cer-tain property is important for defining a token rep- (a) MT (b) LM
Figure 6: Token influence vs its frequency for modelstrained with word dropout (in training, each input tokenis replaced with a random with the probability 10 % ). resentation at each layer by (1) selecting a largenumber of token occurrences and taking their rep-resentations; (2) validating if a value of the prop-erty is the same for token occurrences correspond-ing to the closest representations. Though our ap-proach is different from probing tasks, we choosethe properties which will enable us to relate toother works reporting similar behaviour (Zhangand Bowman, 2018; Blevins et al., 2018; Tenneyet al., 2019a). The properties we consider are to-ken identity, position in a sentence, neighboringtokens and CCG supertags. For our analysis, we take 100 random word typesfrom the top 5,000 in our vocabulary. For eachword type, we gather 1,000 different occurrencesalong with the representations from all three mod-els. For each representation, we take the closestneighbors among representations at each layer andevaluate the percentage of neighbors with the samevalue of the property.
In this section, we track the loss of informationabout token identity (i.e., word type) and position.Our motivation is three-fold. First, this will helpus to confirm the results provided on Figure 1;second, to relate to the work reporting results forprobing tasks predicting token identity. Finally,the Transformer starts encoding a sentence from apositional and a word embedding, thus it is naturalto look at how this information is preserved.
We want to check to what extent a model confusesrepresentations of different words. For each of theconsidered words we add 9000 representations ofwords which potentially could be confused with a) (b)
Figure 7: Preserving (a) token identity, (b) positionFigure 8: t-SNE of different occurrences of the tokens“is” (red), “are” (orange), “was” (blue), “were” (light-blue). On the x-axis are layers. it. For this extended set of representations, wefollow the methodology described above.Results are presented in Figure 7a. Reassur-ingly, the plot is very similar to the one computedwith MI estimators (Figure 1), further supportingthe interpretations we gave previously (Section 4).Now, let us recall the findings by Zhang and Bow-man (2018) regarding the superior performance ofuntrained LSTMs over trained ones on the task oftoken identity prediction. They mirror our viewof the evolution of a token representation as goingthrough compression and prediction stages, wherethe learning objective defines the process of for-getting information. If a network is not trained, itis not forced to forget input information.Figure 8 shows how representations of differ-ent occurrences of the words “is”, “are”, “was”,“were” get mixed in MT and LM layers and dis-ambiguated for MLM. For MLM, of tokenswere masked as in training. In the first layer, thesemasked states form a cluster separate from the oth-ers, and then they get disambiguated as we movebottom-up across the layers.
We evaluate the average distance of position of thecurrent occurrence and the top 5 closest represen-tations. The results are provided in Figure 7b. This See appendix for the details.
Figure 9: t-SNE of different occurrences of the token“it”, position is in color (the larger the word index thedarker its color). On the x-axis are layers. (a) left (b) right
Figure 10: Preserving immediate neighbors illustrates how the information about input (in thiscase, position), potentially not so relevant to theoutput (e.g., next word for LM), gets gradually dis-missed. As expected, encoding input positions ismore important for MT, so this effect is more pro-nounced for LM and MLM. An illustration is inFigure 9. For MT, even on the last layer orderingby position is noticeable.
In this section, we will look at the two properties:identities of immediate neighbors of a current to-ken and CCG supertag of a current token. On theone hand, these properties represent a model’s un-derstanding of different types of context: lexical(neighboring tokens identity) and syntactic. Onthe other, they are especially useful for our anal-ysis since they can be split into information about‘past’ and ‘future’ by taking either left or rightneighbor or part of a CCG tag.
Figure 10 supports our previous expectation thatfor LM the importance of a previous token de-creases, while information about future token isbeing formed. For MLM, the importance of neigh-bors gets higher until the second layer and de-creases after. This may reflect stages of contextencoding and token reconstruction. a) (b) (c)
Figure 11: Preserving CCG supertag.
Results are provided in Figure 11a. As in pre-vious experiments, importance of CCG tag forMLM degrades at higher layers. This agrees withthe work by Tenney et al. (2019a). The authors ob-serve that for different tasks (e.g., part-of-speech,constituents, dependencies, semantic role label-ing, coreference) the contribution of a layer to atask increases up to a certain layer, but then de-creases at the top layers. Our work gives insightsinto the underlying process defining this behavior.For LM these results are not really informativesince it does not have access to the future. Wego further and measure importance of parts of aCCG tag corresponding to previous (Figure 11b)and next (Figure 11c) parts of a sentence. It canbe clearly seen that LM first accumulates infor-mation about the left part of CCG, understandingthe syntactic structure of the past. Then this infor-mation gets dismissed while forming informationabout future.Figure 12 shows how representations of differ-ent occurrences of the token “is” get reordered inthe space according to CCG tags (colors corre-spond to tags). Previous work analyzed representations of MTand/or LM models by using probing tasks. Dif-ferent levels of linguistic analysis have been con-sidered including morphology (Belinkov et al.,2017a; Dalvi et al., 2017; Bisazza and Tump,2018), syntax (Shi et al., 2016; Tenney et al.,2019b) and semantics (Hill et al., 2017; Belinkovet al., 2017b; Raganato and Tiedemann, 2018;Tenney et al., 2019b). Our work complements this To derive CCG supertags, we use Yoshikawa et al. (2017)tagger, the latest version with ELMO: https://github.com/masashi-y/depccg . In their experiments, representations are pooled acrosslayers with the scalar mixing technique similar to the oneused in the ELMO model (Peters et al., 2018). The prob-ing classifier is trained jointly with the mixing weights, andthe learned coefficients are used to estimate the contributionof different layers to a particular task.
Figure 12: t-SNE of different occurrences of the token“is”, CCG tag is in color (intensity of a color is a tokenposition). On the x-axis are layers. line of research by analyzing how word represen-tations evolve between layers and gives insightsinto how models trained on different tasks cometo represent different information.Canonical correlation analysis has been previ-ously used to investigate learning dynamics ofCNNs and RNNs, to measure the intrinsic dimen-sionality of layers in CNNs and compare represen-tations of networks which memorize and general-ize (Raghu et al., 2017; Morcos et al., 2018). Bauet al. (2019) used SVCCA as one of the methodsused for identifying important individual neuronsin NMT models. Saphra and Lopez (2019) usedSVCCA to investigate how representations of lin-guistic structure are learned over time in LMs.
In this work, we analyze how the learning objec-tive determines the information flow in the model.We propose to view the evolution of a tokenrepresentation between layers from the compres-sion/prediction trade-off perspective. We conducta series of experiments supporting this view andpropose a possible explanation for superior perfor-mance of MLM over LM for pretraining. We re-late our findings to observations previously madein the context of probing tasks.
Acknowledgments
We would like to thank the anonymous review-ers for their comments. The authors also thankArtem Babenko, David Talbot and Yandex Ma-chine Translation team for helpful discussions andinspiration. Ivan Titov acknowledges support ofthe European Research Council (ERC StG Broad-Sem 678254) and the Dutch National ScienceFoundation (NWO VIDI 639.022.518). Rico Sen-nrich acknowledges support from the Swiss Na-tional Science Foundation (PP00P1_176727). eferences
Anthony Bau, Yonatan Belinkov, Hassan Sajjad, NadirDurrani, Fahim Dalvi, and James Glass. 2019. Iden-tifying and controlling important neurons in neuralmachine translation. In
International Conference onLearning Representations , New Orleans.Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has-san Sajjad, and James Glass. 2017a. What do neu-ral machine translation models learn about morphol-ogy? In
Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 861–872. Associationfor Computational Linguistics.Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad,Nadir Durrani, Fahim Dalvi, and James Glass.2017b. Evaluating layers of representation in neuralmachine translation on part-of-speech and semantictagging tasks. In
Proceedings of the Eighth Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 1: Long Papers) , pages 1–10. AsianFederation of Natural Language Processing.Arianna Bisazza and Clara Tump. 2018. The lazy en-coder: A fine-grained analysis of the role of mor-phology in neural machine translation. In
Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 2871–2876,Brussels, Belgium. Association for ComputationalLinguistics.Terra Blevins, Omer Levy, and Luke Zettlemoyer.2018. Deep RNNs encode soft hierarchical syntax.In
Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers) , pages 14–19, Melbourne, Australia.Association for Computational Linguistics.Ondˇrej Bojar, Christian Federmann, Mark Fishel,Yvette Graham, Barry Haddow, Matthias Huck,Philipp Koehn, and Christof Monz. 2018. Find-ings of the 2018 conference on machine translation(wmt18). In
Proceedings of the Third Conferenceon Machine Translation, Volume 2: Shared Task Pa-pers , pages 272–307, Belgium, Brussels. Associa-tion for Computational Linguistics.Fahim Dalvi, Nadir Durrani, Hassan Sajjad, YonatanBelinkov, and Stephan Vogel. 2017. Understandingand improving morphological learning in the neu-ral machine translation decoder. In
Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers) ,pages 142–151. Asian Federation of Natural Lan-guage Processing.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Felix Hill, Kyunghyun Cho, Sébastien Jean, andY Bengio. 2017. The representational geometry of word meanings acquired by neural machine transla-tion models.
Machine Translation , 31.Harold Hotelling. 1936. Relations between two sets ofvariates.
Biometrika , 28:321–337.Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprintarXiv:1901.07291 .Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In
Proceedings of 52ndAnnual Meeting of the Association for Computa-tional Linguistics: System Demonstrations , pages55–60, Baltimore, Maryland. Association for Com-putational Linguistics.Ari S. Morcos, Maithra Raghu, and Samy Bengio.2018. Insights on representational similarity in neu-ral networks with canonical correlation. In
NeurIPS ,Montreal, Canada.Jan Niehues, Ronaldo Cattoni, Sebastian Stüker,Mauro Cettolo, Marco Turchi, and Marcello Fed-erico. 2018. The IWSLT 2018 Evaluation Cam-paign. In
Proceedings of the 15th InternationalWorkshop on Spoken Language Translation , pages118–123, Bruges, Belgium.Liam Paninski. 2003. Estimation of entropy and mu-tual information.
Neural computation , 15(6):1191–1253.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In
Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.Alessandro Raganato and Jörg Tiedemann. 2018. Ananalysis of encoder representations in transformer-based machine translation. In
Proceedings of the2018 EMNLP Workshop BlackboxNLP: Analyzingand Interpreting Neural Networks for NLP , pages287–297, Brussels, Belgium. Association for Com-putational Linguistics.Maithra Raghu, Justin Gilmer, Jason Yosinski, andJascha Sohl-Dickstein. 2017. Svcca: Singular vec-tor canonical correlation analysis for deep learningdynamics and interpretability. In
NeurIPS , Los An-geles.Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic,Olivier Bousquet, and Sylvain Gelly. 2018. Assess-ing generative models via precision and recall. In dvances in Neural Information Processing Systems31 , pages 5228–5237.Naomi Saphra and Adam Lopez. 2019. Understand-ing learning dynamics of language models withSVCCA. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 3257–3267, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In
Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Doesstring-based neural mt learn source syntax? In
Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing , pages 1526–1534. Association for Computational Linguistics.Ravid Shwartz-Ziv and Naftali Tishby. 2017. Openingthe black box of deep neural networks via informa-tion. arXiv preprint arXiv:1703.00810 .Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a.BERT rediscovers the classical NLP pipeline. In
Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4593–4601, Florence, Italy. Association for Computa-tional Linguistics.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R Bowman, Dipan-jan Das, et al. 2019b. What do you learn from con-text? probing for sentence structure in contextual-ized word representations. In
International Confer-ence on Learning Representations .Naftali Tishby, Fernando C Pereira, and WilliamBialek. 1999. The information bottleneck method.In
Proceedings of the 37-th Annual Allerton Con-ference on Communication, Control and Computing ,pages 368–377.Naftali Tishby and Noga Zaslavsky. 2015. Deep learn-ing and the information bottleneck principle. , pages 1–5.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008, Los Angeles.Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing multi-headself-attention: Specialized heads do the heavy lift-ing, the rest can be pruned. In
Proceedings of the 57th Annual Meeting of the Association for Com-putational Linguistics , pages 5797–5808, Florence,Italy. Association for Computational Linguistics.Masashi Yoshikawa, Hiroshi Noji, and Yuji Mat-sumoto. 2017. A* CCG parsing with a supertag anddependency factored model. In
Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages277–287, Vancouver, Canada. Association for Com-putational Linguistics.Kelly Zhang and Samuel Bowman. 2018. Languagemodeling teaches you more than translation does:Lessons learned through auxiliary syntactic taskanalysis. In
Proceedings of the 2018 EMNLPWorkshop BlackboxNLP: Analyzing and Interpret-ing Neural Networks for NLP , pages 359–361, Brus-sels, Belgium. Association for Computational Lin-guistics.
Data and Setting
A.1 Model architecture
For machine translation models, we follow thesetup of the Transformer base model (Vaswaniet al., 2017). More precisely, the number of layersin the encoder and in the decoder is N = 6 . Weuse h = 8 parallel attention layers, i.e. heads. Thedimensionality of input and output is d model =512 , and the inner-layer of the feed-forward net-works has the dimensionality d ff = 2048 . Forlanguage models, we use only the encoder of themodel (with the same hyper-parameters). A.2 Training
Sentences were encoded using byte-pair encod-ing (Sennrich et al., 2016), with source and tar-get vocabularies of about 32000 tokens. Sourcevocabulary is the same for all tasks. Minibatchsize is set to approximately 15000 source tokens.Training examples were batched together by ap-proximate sequence length. Each training batchcontained a set of translation pairs, approximately15000 source tokens each. The optimizer andlearning rate schedule we use are the same asin Vaswani et al. (2017). Since using a large num-ber of training steps was reported to be importantfor the MLM objective, we follow Devlin et al.(2018) and train MLM for 1 million training stepsand other models till convergence.
B Fine-grained analysis of change andinfluence: varying PoS
Figure 13 shows the amount of change for differ-ent parts of speech, Figure 14 – the amount ofinfluence for different parts of speech . Gener-ally, the patterns are similar to the ones for fre-quency groups: parts of speech with frequent to-kens (preposition, conjunction, etc.) change moreand influence less. C What does a layer represent?
C.1 Preserving token identity: experimentalsetup
In this section, we want to check to what extenta model confuses representations of different to-kens. For each of the selected tokens describedabove (“main” tokens), we pick tokens which po-tentially could be confused with the token under We use the part-of-speech tagger from StanfordCoreNLP (Manning et al., 2014). (a) MT (b) LM(c) MLM
Figure 13: Token change vs its part of speech. (a) MT (b) LM(c) MLM