Position Information in Transformers: An Overview
WWork in progress P OSITION I NFORMATION IN T RANSFORMERS :A N O VERVIEW
Philipp Dufter ∗ , Martin Schmitt ∗ , Hinrich Sch ¨utze Center for Information and Language Processing (CIS), LMU Munich, Germany { philipp,martin } @cis.lmu.de A BSTRACT
Transformers are arguably the main workhorse in recent Natural Language Pro-cessing research. By definition a Transformer is invariant with respect to re-orderings of the input. However, language is inherently sequential and word orderis essential to the semantics and syntax of an utterance. In this paper, we pro-vide an overview of common methods to incorporate position information intoTransformer models. The objectives of this survey are to i) showcase that positioninformation in Transformer is a vibrant and extensive research area; ii) enable thereader to compare existing methods by providing a unified notation and meaning-ful clustering; iii) indicate what characteristics of an application should be takeninto account when selecting a position encoding; iv) provide stimuli for futureresearch.
NTRODUCTION
The Transformer model as introduced by Vaswani et al. (2017) has been found to perform well formany tasks, such as machine translation or language modeling. With the rise of pretrained languagemodels (PLMs) (Peters et al., 2018; Howard & Ruder, 2018; Devlin et al., 2019; Brown et al., 2020)Transformer models have become even more popular. As a result they are at the core of many state ofthe art natural language processing (NLP) models. A Transformer model consists of several layers,or blocks. Each layer is a self-attention (Vaswani et al., 2017) module followed by a feed-forwardlayer. Layer normalization and residual connections are additional components of a layer.A Transformer model itself is invariant with respect to re-orderings of the input. However, text datais inherently sequential. Without position information the meaning of a sentence is not well-defined(e.g., the sequence “the cat chases the dog” vs. the multi-set { the, the, dog, chases, cat } ). Clearlyit should be beneficial to incorporate this essential inductive bias into any model that processes textdata.Therefore, there is a range of different methods to incorporate position information into NLP mod-els, especially PLMs that are based on Transformer models. Adding position information can bedone by using position embeddings, manipulating attention matrices, or preprocessing the inputwith a recurrent neural network. Overall there is a huge variety of methods that add both absoluteand relative position information to Transformer model. Similarly, many papers analyze and com-pare a subset of position embedding variants. But, to the best of our knowledge, there is no broadoverview of relevant work on position information in Transformers that systematically aggregatesand categorizes existing approaches and analyzes the differences between them.The objective of this paper is to provide an overview of methods that incorporate and analyze po-sition information in Transformer models. More specifically we aim at i) showcasing that positioninformation in Transformer is a vibrant and extensive research area ii) enabling the reader to com-pare existing methods by providing a unified notation and meaningful clustering iii) indicating whatcharacteristics of an application should be taken into account when selecting a position encodingiv) providing stimuli for future research. This paper is work in progress. We plan to actively con-tinue adding papers that we missed or that are newly published. If you want to contribute, spot anerror or miss a paper, please do reach out to us. ∗ Equal contribution - random order. a r X i v : . [ c s . C L ] F e b ork in progress Output
LayerNormAddition
Feed Forward
LayerNormAddition
AttentionInput
Input
XXW ( q ) XW ( k ) XW ( v ) Attention Matrix A Output Z Figure 1: A rough overview of a plain Transformer Encoder Block (grey block) without any positioninformation. They grey block is usually repeated for l layers. An overview of the actual attentioncomputation is shown on the right. ACKGROUND
OTATION
We denote scalars with lowercase letters x ∈ R , vectors with boldface x ∈ R d , and matrices withboldface uppercase letters X ∈ R t × d . We index vectors and matrices as follows ( x i ) i =1 , ...,d = x , ( X ij ) i =1 , ...,t,j =1 , ,...d = X . Further, the i -th row of X is the vector X i ∈ R d . The transposed isdenoted as X (cid:124) . When we are referring to positions we use r, s, t, . . . , whereas we use i, j, . . . todenote components of a vector. The maximum sequence length is called t max .2.2 T RANSFORMER M ODEL
Attention mechanisms were first used in the context of machine translation by Bahdanau et al.(2015). While they still relied on a recurrent neural network in its core, Vaswani et al. (2017) pro-posed a model that relies on attention only. They found that it outperforms current recurrent neuralnetwork approaches by large margins on the machine translation task. In their paper they introduceda new neural network architecture, the
Transformer Model , which is an encoder-decoder architec-ture. We now briefly describe the essential building block,
Transformer Encoder Blocks as shown inFigure 1. One block, also called layer, is a function f θ : R t max × d → R t max × d with f θ ( X ) =: Z thatis defined by A = (cid:114) d XW ( q ) ( XW ( k ) ) (cid:124) M = SoftMax ( A ) XW ( v ) O = LayerNorm ( M + X ) (1) F = ReLU ( OW ( f ) + b ( f ) ) W ( f ) + b ( f ) Z = LayerNorm ( O + F ) . Here, SoftMax ( A ) ts = e A ts / (cid:80) t max k =1 e A tk is the row-wise softmax function, LayerNorm ( X ) t = g (cid:12) ( X t − µ ( X t ) /σ ( X t ) + b is layer normalization (Lei Ba et al., 2016) where µ ( x ) , σ ( x ) returnsthe mean, standard deviation of a vector, and last ReLU ( X ) = max(0 , X ) is the maximum operatorapplied componentwise. Note that for addition of a vector to a matrix we assume broadcasting asimplemented in NumPy. Overall the parameters of a single layer are θ = ( W ( q ) , W ( k ) , W ( v ) ∈ R d × d , g (1) , g (2) , b (1) , b (2) ∈ R d , (2) W ( f ) ∈ R d × d f , W ( f ) ∈ R d f × d , b ( f ) ∈ R d f , b ( f ) ∈ R d ) , https://numpy.org/ d the hidden dimension, d f the intermediate dimension, and t max the maximum se-quence length. It is common to consider multiple, say h , attention heads. More specifically, W ( q ) , W ( k ) , W ( v ) ∈ R d × d h where d = hd h . Subsequently, the matrices M ( h ) ∈ R t max × d h coming from each attention head are concatenated along the second dimension to obtain M . A full Transformer model is then the function T : R t max × d → R t max × d that consists of the composition ofmultiple, say l layers, i.e., T ( X ) = f θ l ◦ f θ l − ◦ · · · ◦ f θ ( X ) .When considering an input U = ( u , u , . . . , u t ) that consists of t units, such as characters, sub-words, words, first unit embeddings U ∈ R t max × d are created by a lookup in the embedding matrix E ∈ R n × d with n being the vocabulary size. More specifically, U i = E u i is the embedding vectorthat corresponds to the unit u i . Finally, the U is then (among others) used as input to the Trans-former model. In the case that U is shorter than t max , it is padded, i.e., filled with special PAD symbols.2.3 O
RDER I NVARIANCE
If we take a close look at the Transformer model, we see that it is invariant to re-orderings of theinput. More specifically, consider any permutation matrix P π ∈ R t max × t max . When passing P π X to a Transformer layer, one gets P π SoftMax ( A ) P π (cid:124) P π XW ( v ) = P π M , as P π (cid:124) P π is the identitymatrix. All remaining operations are position-wise and thus P π T ( X ) = T ( P π X ) for any input X . As language is inherently sequential it is desirable to have P π T ( X ) (cid:54) = T ( P π X ) , which can beachieved by incorporating position information.2.4 E NCODER -D ECODER
There are different setups how to use a Transformer model. One common possibility is to have anencoder only. For example, BERT (Devlin et al., 2019) uses a Transformer model T ( X ) as encoderto perform masked language modeling. In contrast, a traditional sequence-to-sequence approachcan be materialized by adding a decoder. The decoder works almost identically to the encoder withtwo exceptions: 1) The upper triangle of the attention matrix A is usually masked in order to avoidinformation flow from future positions during the decoding process. 2) The output of the encoderis integrated through a cross-attention layer inserted before the feed forward layer. See (Vaswaniet al., 2017) for more details. The differences between an encoder and encoder-decoder architectureare mostly irrelevant for the injection of position information and many architectures rely just onencoder layers. Thus for the sake of simplicity we will talk about Transformer Encoder Blocks ingeneral for the rest of the paper. See §4.4 for position encodings that are tailored for encoder-decoderarchitectures. ECURRING C ONCEPTS IN P OSITION I NFORMATION M ODELS
While there is a variety of approaches to integrate position information into Transformers, there aresome recurring ideas, which we outline here.3.1 A
BSOLUTE VS . R
ELATIVE P OSITION E NCODING
Absolute positions encode the absolute position of a unit within a sentence. Another approach is toencode the position of a unit relative to other units. This makes intuitively sense, as in sentences like“The cat chased the dog.” and “Suddenly, the cat chased the dog.” the change in absolute positionscomes only with a slight semantic change, whereas the relative positions of “cat” and “dog” aredecisive for the meaning of the sentences.3.2 R
EPRESENTATION OF P OSITION I NFORMATION
Adding Position Embeddings (
APE ). One common approach is to add position embeddings tothe input before it is fed to the actual Transformer model: If U ∈ R t max × d is the matrix of unitembeddings, a matrix P ∈ R t max × d representing the position information is added, i.e., their sumis fed to the Transformer model: T ( U + P ) . For the first Transformer layer, this has the following3ork in progressYouaregreat a a a a a a a a a You are great
Attention Matrix p p p p p p p p p Absolute Position Bias r r r r − r r r − r − r Relative Position Bias
Figure 2: Example of absolute and relative position biases that can be added to the attention matrix.
Left: self-attention matrix for an example sentence.
Middle: learnable absolute position biases.
Right: position biases with a relative reference point. They are different to absolute encodings asthey exhibit an intuitive weight sharing pattern.effect: ˜A = (cid:114) d ( U + P ) W ( q ) W ( k ) (cid:124) ( U + P ) (cid:124) ˜M = SoftMax ( ˜A )( U + P ) W ( v ) ˜O = LayerNorm ( ˜M + U + P ) (3) ˜F = ReLU ( ˜OW ( f ) + b ( f ) ) W ( f ) + b ( f ) ˜Z = LayerNorm ( ˜O + ˜F ) . Modifying Attention Matrix (
MAM ) . Instead of adding position embeddings, other approachesdirectly modify the attention matrix. For example, by adding absolute or relative position biases tothe matrix, see Figure 2. In fact, one big effect of adding position embeddings is that it modifies theattention matrix as follows ˆA ∼ UW ( q ) W ( k ) (cid:124) U (cid:124) (cid:124) (cid:123)(cid:122) (cid:125) unit-unit ∼ A + PW ( q ) W ( k ) (cid:124) U (cid:124) + UW ( q ) W ( k ) (cid:124) P (cid:124) (cid:124) (cid:123)(cid:122) (cid:125) unit-position + PW ( q ) W ( k ) (cid:124) P (cid:124) (cid:124) (cid:123)(cid:122) (cid:125) position-position . (4)As indicated, the matrix A can then be decomposed into unit-unit interactions as well as unit-position and position-position interactions. We write ∼ as we omit the scaling factor for the attentionmatrix for simplicity.As adding position embeddings results in a modification of the attention matrix, these two ap-proaches are highly interlinked. Still, we make a distinction between these two approaches for tworeasons: i) While adding position embeddings results, among other effects, in a modified attentionmatrix, MAM only modifies the attention matrix. ii)
APE involves learning embeddings for positioninformation whereas
MAM is often interpreted as adding or multiplying scalar biases to the attentionmatrix A , see Figure 23.3 I NTEGRATION
In theory there is many possibilities where to inject position information, but in practice the infor-mation is either integrated in the input, at each attention matrix or directly before the output. Whenadding position information at the beginning, it only affects the first layer and has to be propagatedto upper layers indirectly. Often,
APE is only added at the beginning, and
MAM approaches are usedfor each layer and attention head.
URRENT P OSITION I NFORMATION M ODELS
In this section we provide an overview of current position information models. Note that we usethe term position information model to refer to a method that integrates position information, theterm position encoding refers to a position ID associated to units (e.g., numbered from to t , or4ork in progress Reference PointAbsolute Absolute & Relative Relative T o p i c Sequential (Devlin et al., 2019) (Shaw et al., 2018) (Dai et al., 2019)(Kitaev et al., 2020) (Ke et al., 2020) (Raffel et al., 2020)(Liu et al., 2020b) (Dufter et al., 2020) (Wu et al., 2020)(Press et al., 2020) (He et al., 2020) (Huang et al., 2020)(Wang et al., 2020) (Shen et al., 2018)(Dehghani et al., 2019) (Neishi & Yoshinaga, 2019)
Sinusoidal (Vaswani et al., 2017) (Yan et al., 2019)(Li et al., 2019)
Graphs (Shiv & Quirk, 2019) (Wang et al., 2019) (Zhu et al., 2019)(Cai & Lam, 2020)(Schmitt et al., 2021)
Decoder (Takase & Okazaki, 2019)(Oka et al., 2020)(Bao et al., 2019)
Crosslingual (Artetxe et al., 2020)(Ding et al., 2020)(Liu et al., 2020a)(Liu et al., 2020c)
Analysis (Yang et al., 2019) (Rosendahl et al., 2019)(Wang & Chen, 2020) (Wang et al., 2021)
Table 1: Overview and categorization of papers dealing with position information. We categorizealong two dimensions: topic, i.e., a tag that describes the main topic of a paper, and which referencepoint is used for the position encodings.assigning relative distances). A position embedding then refers to a numerical vector associatedwith a position encoding. We cluster position information models along two dimensions: referencepoint and topic , see Table 1. The following sections deal with each topic and within each topic wediscuss approaches with different reference point. Table 2 provides more details for each methodand aims at making comparisons easier.4.1 S
EQUENTIAL
BSOLUTE P OSITION E NCODINGS
The original Transformer paper considered absolute position encodings. One of the two approachesproposed by Vaswani et al. (2017) follows Gehring et al. (2017) and learns a position embeddingmatrix P ∈ R t max × d corresponding to the absolute positions , , . . . , t max − , t max in a sequence.This matrix is simply added to the unit embeddings U before they are fed to the Transformer model( APE ).In the simplest case, the position embeddings are randomly initialized and then adapted duringtraining of the network (Gehring et al., 2017; Vaswani et al., 2017; Devlin et al., 2019). Gehringet al. (2017) find that adding position embeddings only help marginally in a convolutional neuralnetwork. A Transformer model without any position information, however, performs much worsefor some tasks (e.g., Wang et al., 2019, Wang et al., 2021).For very long sequences, i.e., large t max , the number of parameters added with P is significant.Thus, Kitaev et al. (2020) proposed a more parameter-efficient factorization called axial positionembeddings . Although their method is not described in the paper, a description can be found in theircode. Intuitively, they have one embedding that marks a larger segment and a second embeddingthat indicates the position within each segment, see Figure 3 for an overview. More specifically, thematrix P gets split into two embedding matrices P (1) ∈ R t × d , P (2) ∈ R t × d with d = d + d e.g., https://huggingface.co/transformers/model_doc/reformer.html P (1)1 P (2)1 P (1)2 P (2)1 P (1)3 P (2)1 P (1)1 P (2)2 P (1)2 P (2)2 P (1)3 P (2)2 ...... d d Figure 3: Overview of the structure of P with axial Position Embeddings by Kitaev et al. (2020).They use two position embeddings, one of which can be interpreted as encoding a segment (bottom, P (2) ) and the position within that segment (top, P (1) ). This factorization is more parameter efficient,especially for long sequences.and t max = t t . Then P tj = (cid:40) P (1) r,j if j ≤ d , r = t mod t P (2) s,j − d if j > d , s = (cid:98) tt (cid:99) . (5)Liu et al. (2020b) argue that position embeddings should be parameter-efficient, data-driven, andshould be able to handle sequences that are longer than any sequence in the training data. Theypropose a new model called flow-based Transformer or FLOATER , where they model positioninformation with a continuous dynamic model. More specifically, consider P as a sequence oftimesteps p , p , . . . , p t max . They suggest to model position information as a continuous function p : R + → R d with p ( t ) = p ( s ) + (cid:90) ts h ( τ, p ( τ ) , θ h ) d τ (6)for ≤ s < t with some initial value for p (0) , where h is some function, e.g., a neural network withparameters θ h . In the simplest case they then define p i := p ( i ∆ t ) for some fixed offset ∆ t . Theyexperiment both with adding the information only in the first layer and at each layer (layerwise APE ). Even though they share parameters across layers, they use different initial values p (0) andthus have different position embeddings at each layer. Sinusoidal position embeddings (cf.§4.2) area special case of their dynamic model. Further, they provide a method to use the original positionembeddings of a pretrained Transformer model while adding the dynamic model during finetuningonly. In their experiments they observe that FLOATER outperforms learned and sinusoidal posi-tion embeddings, especially for long sequences. Further, adding position information at each layerincreases performance.Another approach to increase the Transformer efficiency both during training and inference is to keep t max small. The Shortformer by Press et al. (2020) caches previously computed unit representationsand therefore does not need to handle a large number of units at the same time. This is made possibleby what they call position-infused attention , where the position embeddings are added to the keysand queries, but not the values. Thus, the values are position independent and representations fromprevious subsequences can seamlessly be processed. More specifically, they propose ˜A ∼ ( U + P ) W ( q ) W ( k ) (cid:124) ( U + P ) (cid:124) (7) ˜M = SoftMax ( ˜A ) UW ( v ) . The computation of the attention matrix ¯A still depends on absolute position encodings in Short-former, but ¯M does not contain it, as it is only a weighted sum of unit embeddings in the first layer.Consequently, Shortformer can attend to outputs of previous subsequences and the position infor-mation has to be added in each layer again. Press et al. (2020) report large improvements in trainingspeed, as well as language modeling perplexity.While the former approaches all follow the APE methodology, Wang et al. (2020) propose an alter-native to simply summing position and unit embeddings. Instead of having one embedding per unit,they model the representation as a function over positions. That is, instead of feeding U t + P t tothe model for position t , they suggest to model the embedding of unit u as a function g ( u ) : N → R d such that the unit has a different embedding depending on the position at which it occurs. Afterhaving proposed desired properties for such functions (position-free offset and boundedness), theyintroduce complex-valued unit embeddings where their k -th component is defined as follows g ( u ) ( t ) k = r ( u ) k exp (cid:16) i ( ω ( u ) k t + θ ( u ) k ) (cid:17) . (8)6ork in progressThen, r ( u ) , ω ( u ) , θ ( u ) ∈ R d are learnable parameters that define the unit embedding for the unit u .Their approach can also be interpreted as having a word embedding, parameterized by r ( u ) , that iscomponent-wise multiplied with a position embedding, parameterized by ω ( u ) , θ ( u ) . They test theseposition-sensitive unit embeddings not only on Transformer models, but also on static embeddings,LSTMs, and CNNs, and observe large improvements.4.1.2 R ELATIVE P OSITION E NCODINGS
Among the first, Shaw et al. (2018) introduced an alternative method for incorporating both absoluteand relative position encodings . In their absolute variant they propose to change the computation to A ts ∼ U t (cid:124) W ( q ) (cid:16) W ( k ) (cid:124) U s + a ( k )( t,s ) (cid:17) (9)where a ( k )( t,s ) ∈ R d models the interaction between position t and s . Further they modify the compu-tation of the values to M t = t max (cid:88) s =1 SoftMax ( A ) ts (cid:16) W ( v ) (cid:124) U s + a ( v )( t,s ) (cid:17) (10)where a ( v )( t,s ) ∈ R d models again the interaction. While it cannot directly be compared with the effectof simple addition of position embeddings, they roughly omit the position-position interaction andhave only one unit-position term. In addition, they do not share the projection matrices but directlymodel the pairwise position interaction with the vectors a . In an ablation analysis they found thatsolely adding a ( k )( t,s ) might be sufficient.To achieve relative positions they simply set a ( k )( t,s ) := w ( k )( clip ( s − t,r )) , (11)where clip ( x, r ) = max ( − r, min( r, x )) and w ( k )( t ) ∈ R d for − r ≤ t ≤ r for a maximum relativedistance r . Analogously for a ( v )( t,s ) . To reduce space complexity, they share the parameters acrossattention heads. While it is not explicitly mentioned in their paper we assume that they add theposition information in each layer but do not share the parameters. The authors find that relativeposition embeddings perform better in machine translation and the combination of absolute andrelative embeddings does not improve the performance.Dai et al. (2019) propose the Transformer XL model. The main objective is to cover long sequencesand to overcome the constraint of having a fixed-length context. To this end they fuse Transformermodels with recurrence. This requires special handling of position information and thus a newposition information model. At each attention head they adjust the computation of the attentionmatrix to A ts ∼ U t (cid:124) W ( q ) W ( k ) (cid:124) U s (cid:124) (cid:123)(cid:122) (cid:125) content-based addressing + U t (cid:124) W ( q ) V ( k ) (cid:124) R t − s (cid:124) (cid:123)(cid:122) (cid:125) content-dependent position bias + b (cid:124) W ( k ) (cid:124) U s (cid:124) (cid:123)(cid:122) (cid:125) global content bias + c (cid:124) V ( k ) (cid:124) R t − s (cid:124) (cid:123)(cid:122) (cid:125) global position bias , (12)where R ∈ R τ × d is a sinusoidal position embedding matrix as in (Vaswani et al., 2017) and b , c ∈ R d are learnable parameters. They use different projection matrices for the relative positions, namely V ( k ) ∈ R d × d . Note that Transformer-XL is unidirectional and thus τ = t m + t max − , where t m is the memory length in the model. Furthermore they add this mechanism to all attention heads andlayers, while sharing the position parameters across layers and heads.There are more approaches that explore variants of Eq. 4. Ke et al. (2020) propose untied posi-tion embeddings . More specifically, they simply put U into the Transformer and then modify theattention matrix A in the first layer by adding a position bias A ∼ UW ( q ) W ( k ) (cid:124) U (cid:124) + PV ( q ) V ( k ) (cid:124) P (cid:124) . (13)Compared to Eq. 4 they omit the unit-position interaction terms and use different projection matrices, V ( q ) , V ( k ) ∈ R d × d for units and positions. Similarly, they add relative position embeddings by7ork in progress Figure 4: Figure by Ke et al. (2020). Their position bias is independent of the input and can thusbe easily visualized. The absolute position biases learn intuitive patterns as shown above. Patterns(from left to right) include ignoring position information, attending locally, globally, to the left, andto the right. One can clearly see the untied position bias for the first token, i.e., the [CLS] token, onthe left and top of each matrix.adding a scalar value. They add a matrix A r ∈ R t max × t max , where A rt,s = b t − s + t max and b ∈ R t max are learnable parameters, which is why we categorize this approach as MAM . A very similaridea with relative position encodings has also been used by Raffel et al. (2020). They further arguethat the [CLS] token has a special role and thus they replace the terms P (cid:124) V ( q ) V ( k ) (cid:124) P s with asingle parameter θ and analogously P t (cid:124) V ( q ) V ( k ) (cid:124) P with θ , i.e., they disentangle the positionof the [CLS] token from the other position interactions. They provide theoretical arguments thattheir absolute and relative position embeddings are complementary. Indeed, in their experimentsthe combination of relative and absolute embeddings boosts performance on the GLUE benchmark.They provide an analysis of the position biases learned by their network, see Figure 4.A similar idea has been explored in (Dufter et al., 2020), where in a more limited setting, i.e., inthe context of PoS-tagging, learnable absolute or relative position biases are learned instead of fullposition embeddings.Complementary to that line of research is a method by He et al. (2020): In their model DeBERTa ,they omit the position-position interaction and focus on unit-position interactions. However, theirembeddings are still untied or disentangled as they use different projection matrices for unit andposition embeddings. They introduce relative position embeddings A r ∈ R t max × d and define δ ( t, s ) = (cid:40) if t − s ≤ − t max t max − if t − s ≥ t max t − s + t max else. (14)They then compute A ts ∼ U t (cid:124) W ( q ) W ( k ) (cid:124) U s + U t (cid:124) W ( q ) V ( k ) (cid:124) A rδ ( t,s ) + A rδ ( s,t ) (cid:124) V ( q ) W ( k ) (cid:124) U s (15)as the attention in each layer. While they share the weights of A r ∈ R t max × d across layers,the weight matrices are separate for each attention head and layer. In addition they change thescaling factor from (cid:112) /d h to (cid:112) / (3 d h ) . In the last layer they inject a traditional absolute positionembedding matrix P ∈ R t max × d . Thus they use both MAM and
APE . They want relative encodingsto be available in every layer but argue that the model should be reminded of absolute encodingsright before the masked language model prediction. In their example sentence a new store openedbeside the new mall they argue that store and mall have similar relative positions to new and thusabsolute positions are required for predicting masked units.The following two approaches do not work with embeddings, but instead propose a direct multi-plicative smoothing on the attention matrix and can thus be categorized as
MAM . Wu et al. (2020)propose a smoothing based on relative positions in their model
DA-Transformer . They consider thematrix of absolute values of relative distances R ∈ N t max × t max where R ts = | t − s | . For eachattention head m they obtain R ( m ) = w ( m ) R with w ( m ) ∈ R being a learnable scalar parameter.They then compute A ∼ ReLU (cid:16) ( XW ( q ) W ( k ) (cid:124) X (cid:124) ) ◦ ˆR ( m ) (cid:17) , (16)where ˆR ( m ) is a rescaled version of R ( m ) and ◦ is component-wise multiplication. For rescalingthey use a learnabel sigmoid function, i.e., ˆR ( m ) = 1 + exp( v ( m ) )1 + exp( v ( m ) − R ( m ) ) . (17)8ork in progress (cid:3)(cid:11)(cid:11)(cid:10)(cid:2)(cid:12)(cid:18)(cid:19)(cid:22)(cid:7)(cid:19)(cid:10)(cid:4)(cid:18)(cid:20)(cid:14)(cid:21)(cid:14)(cid:18)(cid:17)(cid:7)(cid:15)(cid:1)(cid:11)(cid:17)(cid:9)(cid:18)(cid:10)(cid:14)(cid:17)(cid:13)(cid:5)(cid:11)(cid:15)(cid:12)(cid:2)(cid:7)(cid:21)(cid:21)(cid:11)(cid:17)(cid:21)(cid:14)(cid:18)(cid:17)(cid:1) xN (cid:6)(cid:18)(cid:19)(cid:10)(cid:1)(cid:11)(cid:16)(cid:8)(cid:11)(cid:10)(cid:10)(cid:14)(cid:17)(cid:13) InputsOutputs (a) Transformer(Vaswani et al., 2017) (cid:3)(cid:8)(cid:8)(cid:7)(cid:2)(cid:9)(cid:14)(cid:15)(cid:16)(cid:5)(cid:15)(cid:7) (cid:4)(cid:6)(cid:9)(cid:7)(cid:2)(cid:5)(cid:15)(cid:15)(cid:6)(cid:10)(cid:15)(cid:8)(cid:11)(cid:10)(cid:17)(cid:3)(cid:1)(cid:13)(cid:6)(cid:9)(cid:5)(cid:15)(cid:8)(cid:16)(cid:6)(cid:1)(cid:12)(cid:11)(cid:14)(cid:8)(cid:15)(cid:8)(cid:11)(cid:10)(cid:1) xN (cid:4)(cid:14)(cid:15)(cid:7)(cid:1)(cid:8)(cid:12)(cid:6)(cid:8)(cid:7)(cid:7)(cid:11)(cid:13)(cid:10) InputsOutputs (b) Rel-Transformer(Shaw et al., 2018) (cid:3)(cid:9)(cid:9)(cid:8)(cid:2)(cid:10)(cid:16)(cid:17)(cid:19)(cid:6)(cid:17)(cid:8) (cid:2)(cid:1)(cid:1) (cid:4)(cid:9)(cid:13)(cid:10)(cid:2)(cid:6)(cid:18)(cid:18)(cid:9)(cid:15)(cid:18)(cid:12)(cid:16)(cid:15)(cid:1) xN (cid:5)(cid:16)(cid:17)(cid:8)(cid:1)(cid:9)(cid:14)(cid:7)(cid:9)(cid:8)(cid:8)(cid:12)(cid:15)(cid:11) InputsOutputs (c) RNN-TransformerOur proposed (cid:3)(cid:8)(cid:8)(cid:7)(cid:2)(cid:9)(cid:14)(cid:15)(cid:16)(cid:5)(cid:15)(cid:7) (cid:5)(cid:4)(cid:4)(cid:6)(cid:8)(cid:11)(cid:9)(cid:2)(cid:7)(cid:17)(cid:17)(cid:8)(cid:12)(cid:17)(cid:10)(cid:13)(cid:12)(cid:19)(cid:3)(cid:1)(cid:15)(cid:8)(cid:11)(cid:7)(cid:17)(cid:10)(cid:18)(cid:8)(cid:1)(cid:14)(cid:13)(cid:16)(cid:10)(cid:17)(cid:10)(cid:13)(cid:12)(cid:1) xN (cid:4)(cid:14)(cid:15)(cid:7)(cid:1)(cid:8)(cid:12)(cid:6)(cid:8)(cid:7)(cid:7)(cid:11)(cid:13)(cid:10) InputsOutputs (d) RR-Transformer(b) + Our proposed
Figure 1: The architectures of all the Transformer-based models we compare in this study; for simplicity, we showthe encoder architectures here since the same modification is applied to their decoders. stacked encoder/decoder layers. The encoder ar-chitecture is shown in Figure 1a.Word embedding layers encode input wordsinto continuous low-dimension vectors, followedby positional encoding layers that add position in-formation to them. Encoder/decoder layers con-sist of a few sub-layers, self-attention layer, atten-tion layer (decoder only) and feed-forward layer,with layer normalization (Ba et al., 2016) for each.Both self-attention layer and attention layer em-ploy the same architecture, and we explain the de-tails in § §
4. The model in Shaw et al. (2018)modifies the self-attention layer ( § Transformer has positional encoding layers whichfollow the word embedding layers and capture ab-solute position. The process of positional encod-ing layer is to add positional encodings (positionvectors) to input word embeddings. The positionalencodings are generated using sinusoids of vary-ing frequencies, which is designed to allow themodel to attend to relative positions from the pe-riodicity of positional encodings (sinusoids). Thisis in contrast to the position embeddings (Gehringet al., 2017), a learned position vectors, which arenot meant to attend to relative positions. Vaswani et al. (2017) report that both approaches producednearly identical results in their experiments, andalso mentioned that the model with positional en-codings may handle longer inputs in testing thanthose in training, which implies that absolute posi-tion approach might have problems at this point. Some studies modify Transformer to consider rel-ative position instead of absolute position. Shawet al. (2018) propose an extension of self-attentionmechanism which handles relative position insidein order to incorporate relative position into Trans-former. We hereafter refer to their model as Rel-Transformer. In what follows, we explain the self-attention mechanism and their extension.Self-attention is a special case of general atten-tion mechanism, which uses three elements calledquery, key and value. The basic idea is to com-pute weighted sum of values where the weights arecomputed using the query and keys. Each weightrepresents how much attention is paid to the cor-responding value. In the case of self-attention, theinput set of vectors behaves as all of the three ele-ments (query, key and value) using three differenttransformations. When taking a sentence as input,it is processed as a set in the self-attention.Self-attention operation is to compute outputsequence z = ( z , ..., z n ) out of input sequence x = ( x , ..., x n ) , where both sequences have thesame langth n and x i R d x , z i R d z . The output Our preliminary experiment confirmed that positionalencodings perform better for longer sentences than those inthe training data, while position embeddings perform slightlybetter for the other length.
Figure 5: Figure by Neishi & Yoshinaga (2019). Overview of the architecture when using an RNNfor learning position information. They combine their RNN-Transformer with relative position em-beddings by Shaw et al. (2018) in a model called
RR-Transformer (far right).Overall, they only add h parameters as each head has two learnable parameters. Intuitively, theywant to allow each attention head to choose whether to attend to long range or short range de-pendencies. Note that their model is direction-agnostic. The authors observe improvements fortext classification both over vanilla Transformer, relative position encodings by (Shaw et al., 2018),Transformer-XL (Dai et al., 2019) and TENER (Yan et al., 2019).Related to the DA-Transformer, Huang et al. (2020) review absolute and relative position embeddingmethods and propose four position information models with relative position encodings: (i) Similarto (Wu et al., 2020) they scale the attention matrix by A ∼ ( XW ( q ) W ( k ) (cid:124) X (cid:124) ) ◦ R , (18)where R ts = r | s − t | and r ∈ R t max is a learnable vector. (ii) They consider R ts = r s − t as well todistinguish different directions. (iii) As a new variant they propose A ts ∼ sum product ( W ( q ) (cid:124) X t , W ( k ) (cid:124) X s , r s − t ) , (19)where r s − t ∈ R d are learnable parameters and sum product is the scalar product extended to threevectors. (iv) Last, they extend the method by Shaw et al. (2018) to not only add relative positionsto the key, but also to the query in Eq. 9, and in addition remove the position-position interaction.More specifically, A ts ∼ (cid:16) W ( q ) (cid:124) U t + r s − t (cid:17) (cid:124) (cid:16) W ( k ) (cid:124) U s + r s − t (cid:17) − r s − t (cid:124) r s − t (20)On several GLUE tasks (Wang et al., 2018) they find that the last two methods perform best.The next two approaches are not directly related to relative position encodings, but they can beinterpreted as using relative position information. Shen et al. (2018) do not work directly witha Transformer model. Still they propose Directional Self-Attention Networks (Di-SAN) . Besidesother differences to plain self-attention (e.g., multidimensional attention), they notably mask out theupper/lower triangular matrix or the diagonal in A to achieve non-symmetric attention matrices.Allowing attention only in a specific direction does not add position information directly, but stillmakes the attention mechanism position-aware to some extent, i.e., enables the model to distinguishdirections.Neishi & Yoshinaga (2019) argue that recurrent neural networks (RNN) in form of Gated recurrentunits (GRU) (Cho et al., 2014) are able to encode relative positions. Thus they propose to replaceposition encodings by adding a single GRU layer on the input before feeding it to the Transformer,see Figure 5. With their models called RRN-Transformer they observe comparable performancecompared to position embeddings, however for longer sequences the GRU yields better performance.Combining their approach with the method by Shaw et al. (2018) improves performance further, amethod they call
RR-Transformer .4.2 S
INUSOIDAL
Another line of work experiments with sinusoidal values that are kept fixed during training to encodeposition information in a sequence. The approach proposed by Vaswani et al. (2017) is an instance9ork in progressFigure 6: Figure by Yan et al. (2019). Shown is the value of dot product on the y-axis between si-nusoidal position embeddings with different relative distance ( k ) shown on the x-axis. The blue lineshows the dot product without projection matrices and the other two lines with random projections.Relative position without directionality can be encoded without projection matrices, but with theprojections this information is destroyed.of the absolute position APE pattern, called sinusoidal position embeddings , defined as P tj = (cid:40) sin(10000 − jd t ) if j even cos(10000 − ( j − d t ) if j odd. (21)They observe comparable performance between learned absolute position embeddings and theirsinusoidal variant. However, they hypothesize that the sinusoidal structure helps for long rangedependencies. This is for example verified by Liu et al. (2020b). An obvious advantage is also thatthey can handle sequences of arbitrary length, which most position models cannot. They are usuallykept fixed and are not changed during training and thus very parameter-efficient.Indeed, sinusoidal position embeddings exhibit useful properties in theory. Yan et al. (2019) inves-tigate the dot product of sinusoidal position embeddings and prove important properties: (i) Thedot product of two sinusoidal position embeddings depends only on their relative distance. Thatis, P t (cid:124) P t + r is independent of t . (ii) P t (cid:124) P t − r = P t (cid:124) P t + r , which means that sinusoidal positionembeddings are unaware of directions. However, in practice the sinusoidal embeddings are pro-jected with two different projection matrices, which destroys these properties, see Figure 6. Thus,Yan et al. (2019) propose a Direction- and Distance-aware Attention in their model
TENER thatmaintains these properties and can, in addition, distinguish between directionality. They compute A ts ∼ U t (cid:124) W ( q ) W ( k ) (cid:124) U s (cid:124) (cid:123)(cid:122) (cid:125) unit-unit + U t (cid:124) W ( q ) R t − s (cid:124) (cid:123)(cid:122) (cid:125) unit-relative position + u (cid:124) W ( k ) (cid:124) U s (cid:124) (cid:123)(cid:122) (cid:125) unit-bias + v (cid:124) R t − s (cid:124) (cid:123)(cid:122) (cid:125) relative position bias , (22)where R t − s ∈ R d is a sinusoidal relative position vector defined as R t − s,j = (cid:40) sin(( t − s )10000 − jd ) if j even cos(( t − s )10000 − ( j − d ) if j odd, (23)and u , v ∈ R d are learnable parameters for each head and layer. In addition, they set W ( k ) to theidentity matrix and omit the scaling factor √ d as they find that it performs better. Overall, the authorsfind massive performance increases for named entity recognition compared to standard Transformermodels.Dehghani et al. (2019) use a variant of sinusoidal position embeddings in their Universal Trans-former . In their model they combine Transformers with the recurrent inductive bias of recurrentneural networks. The basic idea is to replace the layers of a Transformer model with a single layerthat is recurrently applied to the input, that is they share the weights across layers. In addition theypropose conditional computation where they can halt or continue computation for each position in-dividually. When l denotes their l th application of the Transformer layer to the input, they add the10ork in progress Bush held a talk with Sharon Bush held a talk with Sharon
Absolute PositionRelative Position (a)
Sequential
Position Encoding (b)
Structural
Position Encoding
Figure 7: Figure by Wang et al. (2019). They compute absolute and relative encodings not based onthe sequential order of a sentence (left), but based on a dependency tree (right). Both absolute andrelative encodings can be created.position embeddings as follows P lt,j = (cid:40) sin(10000 − jd t ) + sin(10000 − jd l ) if j even cos(10000 − j − d t ) + cos(10000 − j − d l ) if j odd. (24)Their approach can be interpreted as adding sinusoidal position embeddings at each layer.Li et al. (2019) argue that the variance of sinusoidal position embeddings per position across dimen-sions varies greatly: for small positions it is rather small and for large positions it is rather high.The authors consider this a harmful property and propose maximum variances position embeddings(mvPE) as a remedy. They change the computation to P tj = (cid:40) sin(10000 − jd kt ) if j even cos(10000 − j − d kt ) if j odd. (25)They claim that suitable values for the hyperparameter k are k > .4.3 G RAPHS
In the following section, we will take a look at position information models for graphs, i.e., caseswhere Transformers have been used for genuine graph input as well as cases where the graph is usedas a sentence representation, e.g., a dependency graph. We distinguish two types of graph positionmodels according to the assumptions they make about the graph structure: positions in hierarchies(trees) and arbitrary graphs.4.3.1 H
IERARCHIES (T REES )Wang et al. (2019) propose structural position representations (SPR), see Figure 7. This meansthat instead of treating a sentence as a sequence of information, they perform dependency pars-ing and compute distances on the parse tree (dependency graph). We can distinguish two set-tings: (i) Analogously to absolute position encodings in sequences, where unit u t is assignedposition t , absolute SPR assigns u t the position abs ( u t ) := d tree ( u t , ROOT ) where ROOT is theroot of the dependency tree (the main verb of the sentence) and d tree ( x, y ) is the path lengthbetween x and y in the tree. (ii) For the relative SPR between the units u t , u s , they define rel ( u t , u s ) = abs ( u t ) − abs ( u s ) if u t is on the path from u s to the root or vice versa. Other-wise, they use rel ( u t , u s ) = sgn ( t − s )( abs ( u t ) + abs ( u s )) . So we see that SPR does not onlyassume the presence of a graph hierarchy but also needs a strict order to be defined on the graphnodes, because rel equally encodes sequential relative position. This makes SPR a suitable choicefor working with dependency graphs but renders SPR incompatible with other tree structures.Having defined the position of a node in a tree, Wang et al. (2019) inject their SPR via sinusoidal APE for absolute and via learned embeddings +
MAM for relative positions. It is noteworthy, thatWang et al. (2019) achieve their best performance by combining both variants of SPR with sequentialposition information and that SPR as sole sentence representation, i.e., without additional sequentialinformation, leads to a large drop in performance. Dependency parsers usually do not operate on subwords. So subwords are assigned the position of theirmain word. singularvaluedecompositionword2vecembeddinglearning comparisonused-forused-for V T A +1 − − − − V T A s v d w e l c u1 u2s v -4 0 4 2 2 2 1 1 3 d -5 -4 0 2 2 2 1 1 3 w -2 -2 -2 0 2 2 -1 ∞ e -2 -2 -2 -2 0 4 -3 -1 -1 l -2 -2 -2 -2 -4 0 -3 -1 -1 c -1 -1 -1 1 3 3 0 ∞ u1 -1 -1 -1 ∞ ∞ ∞ u2 -3 -3 -3 -1 1 1 -2 ∞ Figure 8: Figure from (Schmitt et al., 2021), showing their definition of relative position encodingsin a graph based on the lengths of shortest paths. ∞ means that there is no path between two nodes.Numbers higher than and lower than − represent sequential relative position in multi-token nodelabels (dashed green arrows).Shiv & Quirk (2019) propose alternative absolute tree position encodings (TPE). They draw inspi-ration from the mathematical properties of sinusoidals but do not use them directly like Wang et al.(2019). Also unlike SPR, their position encodings consider the full path from a node to the root ofthe tree and not only its length, thus assigning every node a unique position. This is more in line withthe spirit of absolute sequential position models (§4.1.1). The first version of TPE is parameter-free:The path from the root of an n -ary tree to some node is defined as the individual decisions that leadto the destination, i.e., which of the n children is the next to be visited at each intermediate step.These decisions are encoded as one-hot vectors of size n . The whole path is simply the concatena-tion of these vectors (padded with 0s for shorter paths). In a second version, multiple instances ofparameter-free TPE are concatenated and each one is weighted with a different learned parameter.After scaling and normalizing these vectors, they are added to the unit embeddings before the firstTransformer layer ( APE ).4.3.2 A
RBITRARY G RAPHS
Zhu et al. (2019) were the first to propose a Transformer model capable of processing arbitrarygraphs. Their position information model solely defines the relative position between nodes andincorporates this information by manipulating the attention matrix (
MAM ): A ts ∼ U t (cid:124) W ( q ) (cid:16) W ( k ) (cid:124) U s + W ( r ) (cid:124) r ( t,s ) (cid:17) (26) M t = t max (cid:88) s =1 SoftMax ( A ) ts (cid:16) W ( v ) (cid:124) U s + W ( f ) (cid:124) r ( t,s ) (cid:17) where W ( r ) , W ( f ) ∈ R d × d are additional learnable parameters, and r ( t,s ) ∈ R d is a representationof the sequence of edge labels and special edge direction symbols ( ↑ and ↓ ) on the shortest pathbetween the nodes u t and u s . Zhu et al. (2019) experiment with 5 different ways of computing r ,where the best performance is achieved by two approaches: (i) A CNN with d kernels of size 4that convolutes the embedded label sequence U ( r ) into r (cf. Kalchbrenner et al., 2014) and (ii) a12ork in progressone-layer self-attention module with sinusoidal position embeddings P (cf. §4.2): A ( r ) ∼ ( U ( r ) + P ) W ( qr ) W ( kr ) (cid:124) ( U ( r ) + P ) (cid:124) M ( r ) = SoftMax ( A ( r ) )( U ( r ) + P ) W ( vr ) (27) a ( r ) = SoftMax ( W ( r ) tanh ( W ( r ) M ( r ) (cid:124) )) r = t ( r )max (cid:88) k =1 a ( r ) k M ( r ) k with W ( r ) ∈ R d r × d , W ( r ) ∈ R × d r additional model parameters. While there is a special symbolfor the empty path from one node to itself, this method implicitly assumes that there is always atleast one path between any two nodes. While it is easily possible to extend this work to disconnectedgraphs by introducing another special symbol, the effect on performance is unclear.Cai & Lam (2020) also define relative position in a graph based on shortest paths. They differ fromthe former approach in omitting the edge direction symbols and using a bidirectional GRU (Choet al., 2014), to aggregate the label information on the paths (cf. the RNN-Transformer described byNeishi & Yoshinaga (2019)). After linearly transforming the GRU output, it is split into a forwardand a backward part: [ r t → s ; r s → t ] = W ( r ) GRU ( . . . ) . These vectors are injected into the model ina variant of APE : A st ∼ ( U s + r s → t ) (cid:124) W ( q ) W ( k ) (cid:124) ( U t + r t → s )= U s (cid:124) W ( q ) W ( k ) (cid:124) U t (cid:124) (cid:123)(cid:122) (cid:125) content-based addressing + U s (cid:124) W ( q ) W ( k ) (cid:124) r t → s (cid:124) (cid:123)(cid:122) (cid:125) source relation bias (28) + r s → t (cid:124) W ( q ) W ( k ) (cid:124) U t (cid:124) (cid:123)(cid:122) (cid:125) target relation bias + r s → t (cid:124) W ( q ) W ( k ) (cid:124) r t → s (cid:124) (cid:123)(cid:122) (cid:125) universal relation bias It is noteworthy that Cai & Lam (2020) additionally include absolute SPR (see §4.3.1) in theirmodel to exploit the hierarchical structure of the AMR (abstract meaning representation) graphsthey evaluate on. It is unclear, which position model has more impact on performance.Schmitt et al. (2021) avoid computational overhead in their
Graformer model by defining relativeposition encodings in a graph as the length of shortest paths instead of the sequence of edge labels(see Figure 8 for an example): r ( t,s ) = ∞ , if there is no path between t, s sequential relative position of u t , u s shifted by a constant to avoid clashes , if subwords u t , u s belong to the same main word d graph ( t, s ) , if d graph ( t, s ) ≤ d graph ( s, t ) − d graph ( s, t ) , if d graph ( t, s ) > d graph ( s, t ) where d graph ( x, y ) is the length of the shortest path between x and y . This definition also avoids theotherwise problematic case where there is more than one shortest path between two nodes becausethe length is always the same even if the label sequences are not. The so-defined position informationis injected via learnable scalar embeddings + MAM (cf. Raffel et al., 2020).In contrast to the other approaches, Graformer explicitly models disconnected graphs ( ∞ ) and doesnot add any sequential position information. Unfortunately, Schmitt et al. (2021) do not evaluateGraformer on the same tasks as the other discussed approaches, which makes a performance com-parison difficult.4.4 D ECODER
Takase & Okazaki (2019) propose a simple extension to sinusoidal embeddings by incorporatingsentence lengths in the position encodings of the decoder. Their motivation is to be able to con-trol the output length during decoding and to enable the decoder to generate any sequence lengthindependent of what lengths have been observed during training. The proposed length-difference
Decoder InputsLength PredictorSoft-copy Multi-Head AttentionAdd & Norm x M
Positions
Multi-Head
AttentionAdd & NormInput Embedding
Figure 1: The Transformer - model architecture.
Decoder:
The decoder is also composed of a stack of N = 6 identical layers. In addition to the twosub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-headattention over the output of the encoder stack. Similar to the encoder, we employ residual connectionsaround each of the sub-layers, followed by layer normalization. We also modify the self-attentionsub-layer in the decoder stack to prevent positions from attending to subsequent positions. Thismasking, combined with fact that the output embeddings are offset by one position, ensures that thepredictions for position i can depend only on the known outputs at positions less than i . An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sumof the values, where the weight assigned to each value is computed by a compatibility function of thequery with the corresponding key.
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists ofqueries and keys of dimension d k , and values of dimension d v . We compute the dot products of thequery with all keys, divide each by p d k , and apply a softmax function to obtain the weights on thevalues. 3 Inputs
N x
Feed ForwardAdd & Norm
Figure 1: The Transformer - model architecture.
Decoder:
The decoder is also composed of a stack of N = 6 identical layers. In addition to the twosub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-headattention over the output of the encoder stack. Similar to the encoder, we employ residual connectionsaround each of the sub-layers, followed by layer normalization. We also modify the self-attentionsub-layer in the decoder stack to prevent positions from attending to subsequent positions. Thismasking, combined with fact that the output embeddings are offset by one position, ensures that thepredictions for position i can depend only on the known outputs at positions less than i . An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sumof the values, where the weight assigned to each value is computed by a compatibility function of thequery with the corresponding key.
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists ofqueries and keys of dimension d k , and values of dimension d v . We compute the dot products of thequery with all keys, divide each by p d k , and apply a softmax function to obtain the weights on thevalues. 3 Feed ForwardAdd & NormMulti-Head
AttentionAdd & NormAdd & NormSelf-Masked
Multi-Head
AttentionLinearSoftmax
Output
Probabilities x N c Bridge Position Predictor
Pointer Predictor
Figure 9: Figure by Bao et al. (2019). Overview of their PNAT architecture with the positionprediction module. They use the encoder output to predict the output length and use a modifiedversion as input to the decoder. The position predictor then predicts a permutation of positionencodings for the output sequence. position embeddings are P tj = (cid:40) sin(10000 − jd ( l − t )) if j even cos(10000 − ( j − d ( l − t )) if j odd. (29)where l is a given length constraint. Similarly, they propose a length-ratio position embedding givenby P tj = (cid:40) sin( l − jd t ) if j even cos( l − ( j − d t ) if j odd. (30)The length constraint l is the output length of the gold standard. They observe that they can controlthe output length effectively during decoding. Oka et al. (2020) extended this approach by addingnoise to the length constraint (adding a randomly sampled integer to the length) and by predictingthe target sentence length using the Transformer model. Although, in theory, these approaches couldalso be used in the encoder, above work focuses on the decoder.Bao et al. (2019) propose to predict positions word units in the decoder in order to allow for effectivenon-autoregressive decoding, see Figure 9. More specifically, they predict the target sentence lengthand a permutation from decoder inputs and subsequently reorder the position embeddings in thedecoder according to the predicted permutation. Their model called PNAT achieves performanceimprovements in machine translation.4.5 C
ROSSLINGUAL
Unit order across different languages is quite different. While English uses a subject-verb-objectordering (SVO) other languages use variants of this principle ordering. For example, unit order isquite flexible in German, whereas it is rather strict in English. This raises the question whether it isuseful to share position information across languages.Per default, position embeddings are shared in multilingual models (Devlin et al., 2019; Conneauet al., 2020). Artetxe et al. (2020) observe mixed results with language specific position embeddings in the context of transferring monolingual models to multiple languages: for most languages it helps,but for some it seems harmful. They experimented with learned absolute position embeddings asproposed in (Devlin et al., 2019). 14ork in progressFigure 10: Figure by Wang & Chen (2020). Shown is the position-wise cosine similarity of po-sition embeddings (
APE ) after pretraining. They compare three pretrained language models thatuse learned absolute position embeddings as in (Devlin et al., 2019), and sinusoidal positions asin (Vaswani et al., 2017). BERT shows a cutoff at 128 as it is first trained on sequences with 128tokens and subsequently extended to longer sequences. GPT-2 exhibits the most homogenous simi-larity patterns.Ding et al. (2020) use crosslingual position embeddings (
XL PE ): in the context of machine transla-tion, they obtain reorderings of the source sentence and subsequently integrate both the original andreordered position encodings into the model and observe improvements on the machine translationtask.Liu et al. (2020a) find that position information hinders zero-shot crosslingual transfer in the contextof machine translation. They remove a residual connection in a middle layer to break the propagationof position information, and thus achieve large improvements in zero-shot translation.Similarly, Liu et al. (2020c) find that unit order information harms crosslingual transfer, e.g., in azero-shot transfer setting. They reduce position information by a) removing the position embed-dings, and replacing them with one dimensional convolutions, i.e., leveraging only local positioninformation, b) randomly shuffling the unit order in the source language, and c) using positionembeddings from a multilingual model and freezing them. Indeed they find that reducing orderinformation with these three methods increases performance for crosslingual transfer.4.6 A
NALYSIS
There is a range of work comparing and analyzing position information models. Rosendahl et al.(2019) analyze them in the context of machine translation. They find similar performance for abso-lute and relative encodings, but relative encodings are superior for long sentences. In addition, theyfind that the number of learnable parameters can often be reduced without performance loss.Yang et al. (2019) evaluate the ability of recovering the original word positions after shuffling someinput words. In a comparison of recurrent neural networks, Transformer models, and DiSAN (bothwith learned position embeddings), they find that RNN and DiSAN achieve similar performanceon the word reordering task, whereas Transformer is worse. However, when trained on machinetranslation Transformer performs best in the word reordering task.Wang & Chen (2020) provide an in-depth analysis of what position embeddings in large pretrainedlanguage models learn. They compare the embeddings from BERT (Devlin et al., 2019), RoBERTa(Liu et al., 2019), GPT-2 (Radford et al., 2019), and sinusoidal embeddings. See Figure 10 for ananalysis.More recently, Wang et al. (2021) present an extensive analysis of position embeddings. Theyempirically compare 13 variants of position embeddings. Among other findings, they conclude thatabsolute position embeddings are favorable for classification tasks and relative embeddings performbetter for span prediction tasks.We provide a high level comparison of the discussed methods in Table 2. In this table we clusteredsimilar approaches from a methodological point of view. The objective is to make comparisonseasier and spot commonalities faster. 15ork in progress R e f . P o i n t I n j ec t . M e t . L ea r n a b l e R ec u rr i ng U nbound Model ) K (cid:133) (cid:7) ∞ S e qu e n ce s Transformer w/ emb. (Vaswani et al., 2017) A
APE ¢ m m t max d BERT (Devlin et al., 2019)Reformer (Kitaev et al., 2020) ( d − d ) t max t + d t FLOATER (Liu et al., 2020b) A
APE ¢ ¢ ¢ or moreShortformer (Press et al., 2020) A APE m ¢ ¢ - ¢ m ¢ t max d Shaw et al. (2018) (abs) A
MAM ¢ ¢ m t dl Shaw et al. (2018) (rel) R
MAM ¢ ¢ m t max − dl T5 (Raffel et al., 2020) (2 t max − h Huang et al. (2020) dlh ( t max − DeBERTa (He et al., 2020) B Both ¢ ¢ m t max d + 2 d Transformer XL (Dai et al., 2019) R
MAM ¢ ¢ ¢ d + d lh DA-Transformer (Wu et al., 2020) h TUPE (Ke et al., 2020) B
MAM ¢ m m d + t max ( d + 2) Dufter et al. (2020) t h + 2 t max h RNN-Transf. (Neishi & Yoshinaga, 2019) R - ¢ m ¢ d + 3 d Transformer w/ sin. (Vaswani et al., 2017) A
APE m m ¢ Li et al. (2019)Takase & Okazaki (2019)Oka et al. (2020)Universal Transf. (Dehghani et al., 2019) A
APE m ¢ ¢ Di-SAN (Shen et al., 2018) R
MAM m ¢ ¢ TENER (Yan et al., 2019) dlh T ree s SPR-abs (Wang et al., 2019) A
APE m m ¢ SPR-rel (Wang et al., 2019) R
MAM ¢ m m t max + 1) d TPE (Shiv & Quirk, 2019) A
APE ¢ m m dD max G r a ph s Struct. Transformer (Zhu et al., 2019) R
MAM ¢ ¢ ¢ d + ( d + 1) d r Graph Transformer (Cai & Lam, 2020) R
APE ¢ ¢ ¢ d + 3 d Graformer (Schmitt et al., 2021) R
MAM ¢ ¢ m D max + 1) h Table 2: Comparison according to several criteria: ) = Reference Point ( A bsolute, R elative, or B oth); K = Injection method ( APE or MAM ); (cid:133) = Are the position representations learned duringtraining?; (cid:7) = Is position information recurring at each layer vs. only before first layer?; ∞ = Canthe position model generalize to longer inputs than a fixed value?; = Number of parametersintroduced by the position model (with d hidden dimension, h number of attention heads, t max longest considered sequence length, l number of layers, D max biggest length of all shortest pathsin a graph). Approaches are clustered to avoid repetition and otherwise listed in the same order asdiscussed in the text. The - symbol means that an entry does not fit into our categories. Note thata model as a whole can combine different position models while this comparison focuses on therespective novel part(s). 16ork in progress ONCLUSION
We presented an overview of methods to inject position information in Transformer models. Wehope our unified notation and systematic comparison (Table 2) will foster understanding and sparknew ideas in this important research area.Some open questions that consider somewhat unanswered up to now and can be part of future workinclude:i) How do the current position information models compare empirically on different tasks?Some analyses papers such as Wang et al. (2021) are extensive and provide many insights.Still, not all aspects and differences of the position information models are fully understood.ii) How important is word order for specific tasks? For many tasks, treating sentences asbag-of-words could be sufficient. Indeed, Wang et al. (2021) show that without positionembeddings the performance drop for some tasks is marginal. Thus we consider it interest-ing to investigate for which tasks position information is essential.iii) Can we use position information models to include more information about the structure oftext? While there are many models for processing sequential and graph-based structures,there is wide range of structural information in text that is not considered currently. Someexamples include tables, document layouts, list enumerations and sentence orders. Couldthese structures be integrated with current position information models or are new methodsrequired for representing document structure? A CKNOWLEDGEMENTS
This work was supported by the European Research Council ( R EFERENCES
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the cross-lingual transferability of mono-lingual representations. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.),
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL2020, Online, July 5-10, 2020 , pp. 4623–4637. Association for Computational Linguistics, 2020.URL .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. In Yoshua Bengio and Yann LeCun (eds.), , 2015. URL http://arxiv.org/abs/1409.0473 .Yu Bao, Hao Zhou, Jiangtao Feng, Mingxuan Wang, Shujian Huang, Jiajun Chen, and Lei Li. Non-autoregressive transformer by position learning.
CoRR , abs/1911.10677, 2019. URL http://arxiv.org/abs/1911.10677 .Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,Ilya Sutskever, and Dario Amodei. Language models are few-shot learners.
Computer ResearchRepository , 2005.14165, 2020. URL https://arxiv.org/abs/2005.14165 .Deng Cai and Wai Lam. Graph transformer for graph-to-sequence learning.
AAAI Conference onArtificial Intelligence , 2020. URL https://aaai.org/Papers/AAAI/2020GB/AAAI-CaiD.6741.pdf . 17ork in progressKyunghyun Cho, Bart van Merrienboer, C¸ aglar G¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoderfor statistical machine translation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans(eds.),
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro-cessing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL , pp. 1724–1734. ACL, 2014. doi: 10.3115/v1/d14-1179. URL https://doi.org/10.3115/v1/d14-1179 .Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek,Francisco Guzm´an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Un-supervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Na-talie Schluter, and Joel R. Tetreault (eds.),
Proceedings of the 58th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pp. 8440–8451.Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.747. URL https://doi.org/10.18653/v1/2020.acl-main.747 .Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov.Transformer-XL: Attentive language models beyond a fixed-length context. In
Proceedings of the57th Annual Meeting of the Association for Computational Linguistics , pp. 2978–2988, Florence,Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL .Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universaltransformers. In . OpenReview.net, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deepbidirectional transformers for language understanding. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pp. 4171–4186, Minneapolis, Minnesota, June2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL .Liang Ding, Longyue Wang, and Dacheng Tao. Self-attention with cross-lingual position represen-tation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.),
Proceedingsof the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, On-line, July 5-10, 2020 , pp. 1679–1685. Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.acl-main.153. URL https://doi.org/10.18653/v1/2020.acl-main.153 .Philipp Dufter, Martin Schmitt, and Hinrich Sch¨utze. Increasing learning efficiency of self-attentionnetworks through direct position interactions, learnable temperature, and convoluted attention.In
Proceedings of the 28th International Conference on Computational Linguistics , pp. 3630–3636, Barcelona, Spain (Online), December 2020. International Committee on ComputationalLinguistics. doi: 10.18653/v1/2020.coling-main.324. URL .Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutionalsequence to sequence learning. In Doina Precup and Yee Whye Teh (eds.),
Proceedings of the34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11August 2017 , volume 70 of
Proceedings of Machine Learning Research , pp. 1243–1252. PMLR,2017. URL http://proceedings.mlr.press/v70/gehring17a.html .Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bertwith disentangled attention. arXiv preprint arXiv:2006.03654 , 2020. URL https://arxiv.org/pdf/2006.03654.pdf .Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers) , pp. 328–339, Melbourne, Australia, July 2018. Association for Computational18ork in progressLinguistics. doi: 10.18653/v1/P18-1031. URL .Zhiheng Huang, Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models with bet-ter relative position embeddings. In Trevor Cohn, Yulan He, and Yang Liu (eds.),
Proceed-ings of the 2020 Conference on Empirical Methods in Natural Language Processing: Find-ings, EMNLP 2020, Online Event, 16-20 November 2020 , pp. 3327–3335. Association forComputational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.298. URL https://doi.org/10.18653/v1/2020.findings-emnlp.298 .Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural networkfor modelling sentences. In
Proceedings of the 52nd Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , pp. 655–665, Baltimore, Maryland, June2014. Association for Computational Linguistics. doi: 10.3115/v1/P14-1062. URL .Guolin Ke, Di He, and Tie-Yan Liu. Rethinking the positional encoding in language pre-training. arXiv preprint arXiv:2006.15595 , 2020. URL https://arxiv.org/pdf/2006.15595.pdf .Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In . OpenReview.net, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB .Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprintarXiv:1607.06450 , 2016. URL https://arxiv.org/pdf/1607.06450.pdf .Hailiang Li, Adele Y. C. Wang, Yang Liu, Du Tang, Zhibin Lei, and Wenye Li. An augmentedtransformer architecture for natural language generation tasks. In Panagiotis Papapetrou, XueqiCheng, and Qing He (eds.), , pp. 1–7. IEEE, 2019. doi: 10.1109/ICDMW48858.2019.9024754. URL https://doi.org/10.1109/ICDMW48858.2019.9024754 .Danni Liu, Jan Niehues, James Cross, Francisco Guzm´an, and Xian Li. Improving zero-shot trans-lation by disentangling positional information. arXiv preprint arXiv:2012.15127 , 2020a. URL https://arxiv.org/pdf/2012.15127.pdf .Xuanqing Liu, Hsiang-Fu Yu, Inderjit S. Dhillon, and Cho-Jui Hsieh. Learning to encode positionfor transformer with continuous dynamical model.
CoRR , abs/2003.09229, 2020b. URL https://arxiv.org/abs/2003.09229 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretrainingapproach.
CoRR , abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692 .Zihan Liu, Genta Indra Winata, Samuel Cahyawijaya, Andrea Madotto, Zhaojiang Lin, and PascaleFung. On the importance of word order information in cross-lingual sequence labeling.
CoRR ,abs/2001.11164, 2020c. URL https://arxiv.org/abs/2001.11164 .Masato Neishi and Naoki Yoshinaga. On the relation between position information and sentencelength in neural machine translation. In Mohit Bansal and Aline Villavicencio (eds.),
Proceedingsof the 23rd Conference on Computational Natural Language Learning, CoNLL 2019, Hong Kong,China, November 3-4, 2019 , pp. 328–338. Association for Computational Linguistics, 2019. doi:10.18653/v1/K19-1031. URL https://doi.org/10.18653/v1/K19-1031 .Yui Oka, Katsuki Chousa, Katsuhito Sudoh, and Satoshi Nakamura. Incorporating noisy lengthconstraints into transformer with length-aware positional encodings. In
Proceedings of the 28thInternational Conference on Computational Linguistics , pp. 3580–3585, Barcelona, Spain (On-line), December 2020. International Committee on Computational Linguistics. URL .19ork in progressMatthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, andLuke Zettlemoyer. Deep contextualized word representations. In
Proceedings of the 2018 Con-ference of the North American Chapter of the Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Papers) , pp. 2227–2237, New Orleans, Louisiana,June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL .Ofir Press, Noah A. Smith, and Mike Lewis. Shortformer: Better language modeling using shorterinputs, 2020. URL https://arxiv.org/abs/2012.15832 .Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Languagemodels are unsupervised multitask learners.
OpenAI blog , 1(8):9, 2019.Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, YanqiZhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.
J. Mach. Learn. Res. , 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html .Jan Rosendahl, Viet Anh Khoa Tran, Weiyue Wang, and Hermann Ney. Analysis of po-sitional encodings for neural machine translation.
IWSLT, Hong Kong, China , 2019.URL .Martin Schmitt, Leonardo F. R. Ribeiro, Philipp Dufter, Iryna Gurevych, and Hinrich Sch¨utze. Mod-eling graph structure via relative position for better text generation from knowledge graphs.
Com-puting Research Repository , 2006.09242, 2021. URL https://arxiv.org/abs/2006.09242 .Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position represen-tations. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.),
Proceedings of the 2018Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018,Volume 2 (Short Papers) , pp. 464–468. Association for Computational Linguistics, 2018. doi:10.18653/v1/n18-2074. URL https://doi.org/10.18653/v1/n18-2074 .Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan: Di-rectional self-attention network for rnn/cnn-free language understanding. In Sheila A. McIlraithand Kilian Q. Weinberger (eds.),
Proceedings of the Thirty-Second AAAI Conference on Arti-ficial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18),New Orleans, Louisiana, USA, February 2-7, 2018 , pp. 5446–5455. AAAI Press, 2018. URL .Vighnesh Leonardo Shiv and Chris Quirk. Novel positional encodings to enable tree-basedtransformers. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alch´e-Buc, Emily B. Fox, and Roman Garnett (eds.),
Advances in Neural Information Pro-cessing Systems 32: Annual Conference on Neural Information Processing Systems 2019,NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada , pp. 12058–12068, 2019.URL http://papers.nips.cc/paper/9376-novel-positional-encodings-to-enable-tree-based-transformers .Sho Takase and Naoaki Okazaki. Positional encoding to control output sequence length. In
Pro-ceedings of the 2019 Conference of the North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.3999–4004, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1401. URL .Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,20ork in progressS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.),
Advances in Neu-ral Information Processing Systems 30 , pp. 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf .Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE:A multi-task benchmark and analysis platform for natural language understanding. In
Proceedingsof the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks forNLP , pp. 353–355, Brussels, Belgium, November 2018. Association for Computational Linguis-tics. doi: 10.18653/v1/W18-5446. URL .Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen.Encoding word order in complex embeddings. In . OpenReview.net, 2020.URL https://openreview.net/forum?id=Hke-WTVtwr .Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang, Hao Yang, Qun Liu, and Jakob GrueSimonsen. On position embeddings in BERT. In
International Conference on Learning Repre-sentations , 2021. URL https://openreview.net/forum?id=onxoVA9FxMw .Xing Wang, Zhaopeng Tu, Longyue Wang, and Shuming Shi. Self-attention with structural positionrepresentations. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),
Proceedings ofthe 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong,China, November 3-7, 2019 , pp. 1403–1409. Association for Computational Linguistics, 2019.doi: 10.18653/v1/D19-1145. URL https://doi.org/10.18653/v1/D19-1145 .Yu-An Wang and Yun-Nung Chen. What do position embeddings learn? an empirical study ofpre-trained language model positional encoding. In Bonnie Webber, Trevor Cohn, Yulan He,and Yang Liu (eds.),
Proceedings of the 2020 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2020, Online, November 16-20, 2020 , pp. 6840–6849. Asso-ciation for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.555. URL https://doi.org/10.18653/v1/2020.emnlp-main.555 .Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. Da-transformer: Distance-aware transformer.
CoRR , abs/2010.06925, 2020. URL https://arxiv.org/abs/2010.06925 .Hang Yan, Bocao Deng, Xiaonan Li, and Xipeng Qiu. TENER: adapting transformer encoder fornamed entity recognition.
CoRR , abs/1911.04474, 2019. URL http://arxiv.org/abs/1911.04474 .Baosong Yang, Longyue Wang, Derek F. Wong, Lidia S. Chao, and Zhaopeng Tu. Assessing theability of self-attention networks to learn word order. In Anna Korhonen, David R. Traum, andLlu´ıs M`arquez (eds.),
Proceedings of the 57th Conference of the Association for ComputationalLinguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp.3635–3644. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1354. URL https://doi.org/10.18653/v1/p19-1354 .Jie Zhu, Junhui Li, Muhua Zhu, Longhua Qian, Min Zhang, and Guodong Zhou. Modeling graphstructure in transformer for better AMR-to-text generation. In
Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language Processing and the 9th International JointConference on Natural Language Processing (EMNLP-IJCNLP) , pp. 5459–5468, Hong Kong,China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1548.URL