[PDF] AUBER: Automated BERT Regularization

Abstract

How can we effectively regularize BERT? Although BERT proves its effectiveness in various downstream natural language processing tasks, it often overfits when there are only a small number of training instances. A promising direction to regularize BERT is based on pruning its attention heads based on a proxy score for head importance. However, heuristic-based methods are usually suboptimal since they predetermine the order by which attention heads are pruned. In order to overcome such a limitation, we propose AUBER, an effective regularization method that leverages reinforcement learning to automatically prune attention heads from BERT. Instead of depending on heuristics or rule-based policies, AUBER learns a pruning policy that determines which attention heads should or should not be pruned for regularization. Experimental results show that AUBER outperforms existing pruning methods by achieving up to 10% better accuracy. In addition, our ablation study empirically demonstrates the effectiveness of our design choices for AUBER.

Full PDF

AAUBER: A

UTOMATED

BERT R

EGULARIZATION

Hyun Dong Lee ∗ Columbia University [email protected]

Seongmin Lee ∗ Seoul National University [email protected]

U Kang † Seoul National University [email protected] A BSTRACT

How can we effectively regularize BERT? Although BERT proves its effective-ness in various downstream natural language processing tasks, it often overﬁtswhen there are only a small number of training instances. A promising direction toregularize BERT is based on pruning its attention heads based on a proxy score forhead importance. However, heuristic-based methods are usually suboptimal sincethey predetermine the order by which attention heads are pruned. In order to over-come such a limitation, we propose AUBER, an effective regularization methodthat leverages reinforcement learning to automatically prune attention heads fromBERT. Instead of depending on heuristics or rule-based policies, AUBER learnsa pruning policy that determines which attention heads should or should not bepruned for regularization. Experimental results show that AUBER outperformsexisting pruning methods by achieving up to better accuracy. In addition, ourablation study empirically demonstrates the effectiveness of our design choicesfor AUBER.

NTRODUCTION

How can we effectively regularize BERT (Devlin et al. (2018))? In natural language processing, ithas been observed that generalization could be greatly improved by ﬁne-tuning a large-scale lan-guage model pre-trained on a large unlabeled corpus. In particular, BERT demonstrated such aneffectiveness on a wide range of downstream natural language processing tasks including questionanswering and language inference.Despite its recent success and wide adoption, ﬁne-tuning BERT on a downstream task is proneto overﬁtting due to over-parameterization; BERT-base has

M parameters and BERT-large has

M parameters. This problem worsens when there are only a small number of training instancesavailable. Some observations report that ﬁne-tuning sometimes fails when a target dataset has fewerthan 10,000 training instances (Devlin et al. (2018); Phang et al. (2018)).To mitigate this critical issue, multiple studies attempt to regularize BERT by pruning parameters orusing dropout to decrease its model complexity (Michel et al. (2019); Voita et al. (2019); Lee et al.(2020)). Among these approaches, we regularize BERT by pruning attention heads since pruningyields simple and explainable results and it can be used along with other regularization methods.In order to avoid combinatorial search, whose computational complexity grows exponentially withthe number of heads, the existing methods measure the importance of each attention head basedon heuristics such as an approximation of sensitivity of BERT to pruning a speciﬁc attention head.However, these approaches are based on hand-crafted heuristics that are not guaranteed to be directlyrelated to the model performance, and therefore, would result in a suboptimal performance.In this paper, we propose AUBER, an effective method for regularizing BERT. AUBER overcomesthe limitation of past attempts to prune attention heads from BERT by leveraging reinforcementlearning. When pruning attention heads from BERT, our method automates this process by learningpolicies rather than relying on a predetermined rule-based policy and heuristics. AUBER prunesBERT sequentially in a layer-wise manner. For each layer, AUBER extracts features that are usefulfor the reinforcement learning agent to determine which attention head to be pruned from the currentlayer. The ﬁnal pruning policy found by the reinforcement learning agent is used to prune the ∗ These authors have contributed equally to the work † Corresponding author a r X i v : . [ c s . A I] S e p ayer 1Layer 2Layer 12 BERT-base DQN

Attention HeadEmbedding (state)Prune AttentionHeads (action)Layer 3 Replay Memory Layer 1Layer 2Layer 12

BERT-base DQN

Prune AttentionHeads (action)Accuracy (Reward)Layer 3Fine-Tune Attention HeadEmbedding (state)Replay MemoryAccuracy (Reward) Pruned Not Yet Pruned

Figure 1: An overview of AUBER transitioning from

Layer to Layer of BERT-base.corresponding layer. Before AUBER proceeds to process the next layer, BERT is ﬁne-tuned torecapture the information lost due to pruning attention heads. An overview of AUBER transitioningfrom the second to the third layer of BERT is demonstrated in Figure 1.Our contributions are summarized as follows:• Regularization.

BERT is prone to overﬁtting when the training dataset is too small.AUBER effectively prunes appropriate attention heads to decrease the model capacity andregularizes BERT.•

Automation.

By leveraging reinforcement learning, we automate the process of regular-ization of BERT. Instead of depending on hand-crafted policies or heuristics which oftenyield suboptimal results, AUBER inspects the current state of BERT and automaticallychooses which attention head should be pruned.•

Experiments.

We perform extensive experiments, and show that AUBER successfullyregularizes BERT improving the performance metric by up to and outperforms otherhead pruning methods. Through ablation study, we empirically show that our designchoices for AUBER are effective.The rest of this paper is organized as follows. Section 2 explains preliminaries. Section 3 describesour proposed method, AUBER. Section 4 presents experimental results. After discussing relatedworks in Section 5, we conclude in Section 6.

RELIMINARY

We describe preliminaries on multi-headed self-attention (Section 2.1), BERT (Section 2.2), anddeep Q-learning (Section 2.3).2.1 M

ULTI -H EADED S ELF -A TTENTION

A self-attention function maps a query vector and a set of key-value vector pairs to an output. Wecompute the query, key, and value vectors by multiplying the input embeddings

Q, K, V ∈ R N × d with the parametrized matrices W Q ∈ R d × n , W K ∈ R d × n , and W V ∈ R d × m respectively, where N is the number of tokens in the sentence, and n, m , and d are query, value, and embedding di-mension respectively. In multi-headed self-attention, H independently parameterized self-attentionheads are applied in parallel to project the input embeddings into multiple representation subspaces.Each attention head contains parameter matrices W Qi ∈ R d × n , W Ki ∈ R d × n , and W Vi ∈ R d × m .Output matrices of H independent self-attention heads are concatenated and once again projectedby a matrix W O ∈ R Hm × d to obtain the ﬁnal result. This process can be represented as: M ultiHeadAtt ( Q, K, V ) =

Concat ( Att ...H ( Q, K, V )) W O (1)where Att i ( Q, K, V ) = sof tmax ( ( QW Qi )( KW Ki ) T √ n ) V W Vi (2)2.2 BERTBERT (Devlin et al. (2018)) is a multi-layer Transformer (Vaswani et al. (2017)) pre-trained onmasked language model and next sentence prediction tasks. It is then ﬁne-tuned on speciﬁc tasksincluding question answering and language inference. It achieved state-of-the-art performance on avariety of downstream natural language processing tasks. BERT-base has 12 Transformer blocks andeach block has 12 self-attention heads. Despite its success in various natural language processingtasks, BERT sometimes overﬁts when the training dataset is small due to over-parameterization: M parameters for BERT-base. Thus, there has been a growing interest in BERT regularizationthrough various methods such as pruning or dropout (Lee et al. (2020)).2.3 D

EEP Q- LEARNING

A deep Q network (DQN) is a multi-layered neural policy network that outputs a vector of action-value pairs for a given state s . For a d s -dimensional state space and an action space containing d a actions, the neural network is a function from R d s to R d a . Two important aspects of the DQNalgorithm as proposed by Mnih et al. (2013) are the use of a target network, and the use of experiencereplay. The target network is the same as the policy network except that its parameters are copiedevery τ steps from the policy network. For the experience replay, observed transitions are stored forsome time and sampled uniformly from this memory bank to update the network. Both the targetnetwork and the experience replay dramatically improve the performance of the algorithm. ROPOSED M ETHOD

We propose AUBER, our method for regularizing BERT by automatically learning to prune at-tention heads from BERT. After presenting the overview of the proposed method in Section 3.1,we describe how we frame the problem of pruning attention heads into a reinforcement learningproblem in Section 3.2. Then, we propose AUBER in Section 3.4.3.1 O

VERVIEW

We observe that BERT is prone to overﬁtting for tasks with a few training data. However, the existinghead pruning methods rely on hand-crafted heuristics and hyperparamters, which give sub-optimalresults. The goal of AUBER is to automate the pruning process for successful regularization. De-signing such regularization method entails the following challenges:1.

Automation.

How can we automate the head pruning process for regularization withoutresorting to sub-optimal heuristics?2.

State representation.

When automating the regularization process as a reinforcementlearning problem, how can we represent the state of BERT in a way useful for the pruning?3.

Action search space scalability.

BERT has many parameters, many layers, and manyattention heads in each layer. When automating the regularization process of BERT as areinforcement learning problem, how can we handle prohibitively large action search spacefor pruning?We propose the following main ideas to address the challenges:1.

Reinforcement learning.

We exploit reinforcement learning, speciﬁcally DQN, with ac-curacy enhancement as reward. It is natural to leverage DQN for these problems that havediscrete action space (Sutton & Barto (2018)). Experience replay also allows efﬁcient usageof previous experiences and stable convergence (Mnih et al. (2013)).2.

L1 norm of value matrix.

We use L1 norm of value matrix of each attention head torepresent initial state of a layer. When a head is pruned, the corresponding value is set to 0.3.

Dually-greedy manner.

We prune the attention heads layer by layer sequentially to reducethe search space. Moreover, we prune one attention head at one time instead of handlingall possible pruning methods at once so that action search space becomes more scalable.3.2 A

UTOMATED R EGULARIZATION WITH R EINFORCEMENT L EARNING

AUBER leverages reinforcement learning for efﬁcient search of regularization strategy withoutrelying on heuristics. We exploit DQN among various reinforcement learning frameworks to take3dvantage of experience replay and to easily handle discrete action space. Here we introduce thedetailed setting of reinforcement learning framework.

Initial state

As mentioned in section 2.2, layer l has multiple attention heads, each of which has itsown query, key, and value matrices. For layer l of BERT, we derive the initial state s l using L1 normof the value matrix of each attention head. Further details for this computation method is elaboratedin section 3.3. Action

The action space a of AUBER is discrete. For a BERT model with H attention heads perlayer, the number of possible actions is H + 1 (i.e. a ∈ { , , . . . , H, H + 1 } ). When the action a = i ∈ { , , . . . , H − , H } is chosen, the corresponding i th attention head is pruned. The action a = H + 1 signals the DQN agent to quit pruning. With a continuous action space (e.g. effectivesparsity ratio), a separate heuristic-based pruning algorithm must be used in order to choose whichattention heads should be pruned. However, having a discrete action space allows the reinforcementlearning agent to automatically infer the expected reward for each possible pruning policy, therebyminimizing the usage of error-prone heuristics. Next State

After the i th head is pruned, the value of i th index of s l is set to 0. This modiﬁed stateis provided as the next state to the agent. When the action a = H + 1 , the next state is set to N one .This mechanism allows the agent to recognize which attention heads have been pruned and decidethe next best pruning policy based on past decisions.

Reward

The reward of AUBER is the change in accuracy ∆ acc = current accuracy − prev accuracy (3)where current accuracy is the accuracy of the current BERT model evaluated on a validation set,and prev accuracy is the accuracy obtained from the previous state or the accuracy of the originalBERT model if no attention heads are pruned. If we set the reward simply as current accuracy ,DQN cannot capture the differences among reward values if the changes in accuracy are relativelysmall. Setting the reward as the change in accuracy has the normalization effect, thus stabilizing thetraining process of the DQN agent. The reward for action a = H + 1 is a hyper-parameter that canbe adjusted to encourage or discourage active pruning. In AUBER, it is set to to encourage theDQN agent to prune only when the expected change in accuracy is positive. Fine-tuning

After the best pruning policy for layer l of BERT is found, the BERT model prunedaccording the best pruning policy is ﬁne-tuned with a smaller learning rate. This ﬁne-tuning stepis crucial since it adjusts the weights of remaining attention heads to compensate for informationlost due to pruning. Then, the initial state of layer l + 1 is calculated and provided to the agent.Since frequent ﬁne-tuning may lead to overﬁtting, we separate the training dataset into two: a mini-validation dataset and a mini-training dataset. The mini-validation dataset is the dataset on whichthe pruned BERT model is evaluated on to return a reward. After the optimal pruning policy isdetermined by using the mini-validation dataset, the mini-training dataset is used to ﬁne-tune thepruned model. When all layers are pruned by AUBER, the ﬁnal model is ﬁne-tuned with the entiretraining dataset with early stopping.3.3 S TATE R EPRESENTATION

The initial state s l of layer l of BERT is computed through following procedure. We ﬁrst calculatethe L1 norm of the value matrix of each attention head. Then, we standardize the norm values tohave a mean µ = 0 and a standard deviation σ = 1 . Finally, the sof tmax function is applied to thenorm values to yield s l . We devise the method based on the following lemma. Lemma 1.

UTOMATED

BERT R

EGULARIZATION

Our DQN agent processes the BERT model in a layer-wise manner. For each layer l with H attentionheads, the agent receives an initial layer embedding s l which encodes useful characteristics of thislayer. Then, the agent outputs the index of an attention head that is expected to increase or maintainthe training accuracy when removed. After an attention head i is pruned, the value of the i th indexof s l is set to 0, and it is provided as the next state to the agent. This process is repeated until theaction a = H + 1 . The model pruned up to layer l is ﬁne-tuned on the training dataset, and a newinitial layer embedding s l +1 is calculated from the ﬁne-tuned model.Algorithm 1 illustrates the process of AUBER. Algorithm 1:

AUBER

Input :

A BERT model B t ﬁne-tuned on task t . Output:

Regularized B t . L ← of layers in B t E ← total of episodes H ← of attention heads per layer for l ← to L − do Initialize policy net P and target net T Initialize replay memory M original accuracy ← eval ( B t ) for e ← to E − do s l ← state vector of layer l prune num ← action ← N one prev accuracy ← original accuracy while action (cid:54) = H + 1 do if prune num = H − then action ← H + 1 s ∗ l ← N one reward ← else action ← P ( s l ) B t .prune head ( action ) prune num ← prune num + 1 s l [ action ] ← s ∗ l ← s l current accuracy ← eval ( B t ) reward ← current accuracy − prev accuracy prev accuracy ← current accuracy end M.update ( s l , action, s ∗ l , reward ) end P.optimize ( M, T ) end B t .f inetune ( t ) end XPERIMENTS

We conduct experiments to answer the following questions of AUBER.•

Q1. Accuracy (Section 4.2).

Given a BERT model ﬁne-tuned on a speciﬁc natural lan-guage processing task, how well does AUBER improve the performance of the model?•

Q2. Ablation Study (Section 4.3).

How useful is the

L1 norm of the value matrices ofattention heads in representing the state of BERT? How does the order in which the layersare processed by AUBER affect regularization?5.1 E

XPERIMENTAL S ETUP

Datasets.

We perform downstream natural language processing tasks on four GLUE datasets -MRPC, CoLA, RTE, and WNLI. We test AUBER on datasets that contain less that 10,000 traininginstances since past experiments report that ﬁne-tuning sometimes fails when a target dataset hasfewer than 10,000 training instances (Devlin et al. (2018); Phang et al. (2018)). Detailed informationof these datasets is described in Table 1. Table 1: Datasets. dataset

MRPC CoLA RTE WNLI BERT Model.

We use the pre-trained bert-base-cased model provided by huggingface . We ﬁne-tune this model on each dataset mentioned in Table 1 to obtain the initial model. Initial models forMRPC, CoLA, and WNLI are ﬁne-tuned on the corresponding dataset for epochs, and the initialmodel for RTE is ﬁne-tuned for epochs. The max sequence length is set to , the batch size pergpu is set to . The learning rate for ﬁne-tuning initial models for MRPC, CoLA, and WNLI is setto . , and the learning rate for ﬁne-tuning the initial model for RTE is set to . . Reinforcement Learning.

We use a -layer feedforward neural network for the DQN agent. Thedimension of input, output, and all hidden layers are set to , , and respectively. LeakyReLUis applied after all layers except for the last one. We use the epsilon greedy strategy for choosingactions. The initial and ﬁnal epsilon values are set to and . respectively, and the epsilon decayvalue is set to . The replay memory size is set to , and the batch size for training the DQNagent is set to . The discount value γ for the DQN agent is set to . The learning rate is setto . when ﬁne-tuning BERT after processing a layer. Before processing each layer, thetraining dataset is randomly split into to yield a mini-training dataset and a mini-validationdataset. When ﬁne-tuning the ﬁnal mode, the patience value of early stopping is set to . Competitors.

We compare AUBER with other methods that prune BERT’s attention heads. As asimple baseline, we examine random pruning policy and note the method as Random. We examinetwo different pruning methods based on the importance score. In both methods, if AUBER prunes P number of attention heads from BERT, we also prune P attention heads with the smallest impor-tance scores to obtain the competitor model. We denote the pruning method using the conﬁdencescore as Conﬁdence. The conﬁdence score of an attention head is the average of the maximum at-tention weight; a high conﬁdence score indicates that the weight is concentrated on a single token.On the other hand, Michel et al. (2019) performs a forward and backward pass to calculate gradientsand uses them to assign an importance score to each attention head. Voita et al. (2019) constructs anew loss function that minimizes classiﬁcation error and the number of being used heads so that un-productive heads are pruned while maintaining the model performance. We prune the same numberof heads as AUBER by tuning hyperparameters for fair comparison. Implementation.

We construct all models using PyTorch framework. All the models are trainedand tested on GeForce GTX 1080 Ti GPU.4.2 A

CCURACY

We evaluate the performance of AUBER against competitors. Table 2 shows the results on fourGLUE datasets speciﬁed on Table 1. Note that AUBER outperforms its competitors on regularizingBERT that is ﬁne-tuned on MRPC, CoLA, RTE, or WNLI. While most of its competitors fail toimprove performance of BERT on the dev dataset of MRPC and CoLA, AUBER improves theperformance of BERT by up to . https://nyu-mll.github.io/CoLA/ https://aclweb.org/aclwiki/Recognizing_Textual_Entailment https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html https://github.com/huggingface/transformers (a) MRPC (Accuracy) Policy

Random 83.97Conﬁdence 83.33Michel et al. (2019) 83.09Voita et al. (2019) 84.17 (b) CoLA (Matthew’s correlation)

Policy

Random 57.89Conﬁdence 55.74Michel et al. (2019) 54.73Voita et al. (2019) 57.24 (c) RTE (Accuracy)

Policy

Random 63.47Conﬁdence 64.26Michel et al. (2019) 63.18Voita et al. (2019) 63.61 (d) WNLI (Accuracy)

Policy

Random 43.07Conﬁdence

Michel et al. (2019) 54.93Voita et al. (2019)

BLATION S TUDY

Here we empirically demonstrate the effectiveness of our design choices for AUBER. More speciﬁ-cally, we validate that the

L1 norm of value matrix of each attention head effectively guides AUBERto predict the best action. Moreover, we show that AUBER successfully regularizes BERT regard-less of the direction in which the layers are processed. Table 3 shows the performances of thevariants of AUBER on the four GLUE datasets listed on Table 1.4.3.1 AUBER

WITH THE K EY /Q UERY M ATRICES AS THE S TATE V ECTOR

Among the query, key, and value matrices of each attention head, we show that the value matrixbest represents the current state of BERT. Here we evaluate the performance of AUBER againstAUBER-Query and AUBER-Key. AUBER-Query and AUBER-Key use the query and key matri-ces respectively to obtain the initial state. Note that AUBER, which uses the value matrix to obtainstate vectors, outperforms AUBER-Query and AUBER-Key on all four tasks.4.3.2 AUBER

WITH L2 NORM OF THE V ALUE M ATRICES AS THE S TATE V ECTOR

L1 norm of the value matrices is used to compute the state vector based on the theoretical derivation.In this ablation study, we experimentally show that the L1 norm of the value matrices is appropriatefor state vector. We set a new variant AUBER-L2 which leverages L2 norm of the value matricesto compute the initial state vector instead of L1 norm. The performance of AUBER is far moresuperior than AUBER-L2 in most cases bolstering that L1 norm of the value matrices effectivelyrepresents the state of BERT.4.3.3 E

FFECT OF P ROCESSING L AYERS IN A D IFFERENT O RDER

We empirically demonstrate how the order in which the layers are processed affects the ﬁnal per-formance. We evaluate the performance of AUBER against AUBER-Reverse. AUBER-Reverseprocesses BERT in an opposite direction (i.e. from

Layer to Layer for BERT-base). Notethat both AUBER and AUBER-Reverse effectively regularize BERT, proving the effectiveness ofAUBER regardless of the order in which BERT layers are pruned. The differences in the ﬁnal per-formance and the number of attention heads pruned can be attributed to the ﬁne-tuning step afterpruning each layer. Since the ﬁne-tuning step adjusts the weights of the remaining attention headsin order to take the previous pruning policies into account, processing BERT in different directionsmay lead to different adjustments in weights. Varying updates on weights may make previouslyimportant attention head become unimportant and vice versa, thus resulting in different pruningpolicies and ﬁnal accuracies. 7able 3: We compare AUBER with four of its variants: AUBER-Query, AUBER-Key, AUBER-L2, and AUBER-Reverse on four GLUE datasets to demonstrate the effectiveness of various waysto calculate the initial state. AUBER-Query and AUBER-Key use the query and key matricesrespectively, and AUBER-L2 leverages L2 norm of the value matrix to obtain the initial state.AUBER-Reverse processes BERT starting from the ﬁnal layer (e.g. th layer for BERT-base).Bold font indicates the best accuracy among competing pruning methods. (a) MRPC (Accuracy) Policy (b) CoLA (Matthew’s correlation)

Policy

AUBER-Query 72 55.52AUBER-Key 55 55.78AUBER-L2 63 54.85AUBER-Reverse 57 59.48 (c) RTE (Accuracy)

Policy

AUBER-Query 83 65.34AUBER-Key 86 63.18AUBER-L2 61

AUBER-Reverse 99 64.62 (d) WNLI (Accuracy)

Policy

AUBER-Query 96

AUBER-Key 101

AUBER-L2 94 53.52AUBER-Reverse 101 54.93

ELATED W ORK

A number of studies focused on analyzing the effectiveness of multi-headed attention (Voita et al.(2019); Michel et al. (2019)). These studies evaluate the importance of each attention head bymeasuring some heuristics such as the average of its maximum attention weight, where averageis taken over tokens in a set of sentences used for evaluation, or the expected sensitivity of themodel to attention head pruning. Their results show that a large percentage of attention heads withlow importance scores can be pruned without signiﬁcantly impacting performance. However, theyusually yield suboptimal results since they predetermine the order in which the attention heads arepruned by using heuristics.To prevent overﬁtting of BERT on downstream natural language processing tasks, various regu-larization techniques are proposed. A variant of dropout improves the stability of ﬁne-tuning abig, pre-trained language model even with only a few training examples of a target task (Lee et al.(2020)). Other existing heuristics to prevent overﬁtting include choosing a small learning rate or atriangular learning rate schedule, and a small number of iterations.To automate the process of Convolutional Neural Network pruning, He & Han (2018) leveragesreinforcement learning to determine the best sparsity ratio for each layer. Important features thatcharacterize a layer are encoded and provided to a reinforcement learning agent to determine howmuch of the current layer should be pruned. To the best of our knowledge, AUBER is the ﬁrstattempt to use reinforcement learning to prune attention heads from Transformer-based models suchas BERT.

ONCLUSION

We propose AUBER, an effective method to regularize BERT by automatically pruning attentionheads. Instead of depending on heuristics or rule-based policies, AUBER leverages reinforcementlearning to learn a pruning policy that determines which attention heads should be pruned for betterregularization. Experimental results demonstrate that AUBER effectively regularizes BERT, in-creasing the performance of the original model on the dev dataset by up to . In addition, weexperimentally demonstrate the effectiveness of our design choices for AUBER.8

EFERENCES

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deepbidirectional transformers for language understanding.

CoRR , abs/1810.04805, 2018.Yihui He and Song Han. ADC: automated deep compression and acceleration with reinforcementlearning.

CoRR , abs/1802.03494, 2018.Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. Mixout: Effective regularization to ﬁnetunelarge-scale pretrained language models. In . OpenReview.net, 2020.Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InHanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alch´e-Buc, Emily B. Fox,and Roman Garnett (eds.),

Advances in Neural Information Processing Systems 32: Annual Con-ference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019,Vancouver, BC, Canada , pp. 14014–14024, 2019.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning.

CoRR ,abs/1312.5602, 2013.Jason Phang, Thibault F´evry, and Samuel R. Bowman. Sentence encoders on stilts: Supplementarytraining on intermediate labeled-data tasks.

CoRR , abs/1811.01088, 2018.Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: An Introduction . A BradfordBook, Cambridge, MA, USA, 2018. ISBN 0262039249.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike vonLuxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and RomanGarnett (eds.),

Advances in Neural Information Processing Systems 30: Annual Conference onNeural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA , pp.5998–6008, 2017.Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-headself-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Anna Korhonen,David R. Traum, and Llu´ıs M`arquez (eds.),

Proceedings of the 57th Conference of the Associationfor Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1:Long Papers , pp. 5797–5808. Association for Computational Linguistics, 2019.9 A PPENDIX

A.1 P

ROOF FOR L EMMA Lemma 1.

For a layer with H heads, let N be the number of tokens in the sentence and m , n ,and d be the value, query, and embedding dimension respectively. Let Q, K, V ∈ R N × d be theinput query, key, and value matrices, and W Qi , W Ki , and W Vi be the weight parameters of the i th head such that W Qi , W Ki ∈ R d × n and W Vi ∈ R d × m . Let O i be the output of the i th head. Then, (cid:107) O i (cid:107) ≤ C (cid:107) W Vi (cid:107) for the constant C = N (cid:107) V (cid:107) .Proof. For i th head in the layer, let sof tmax i = sof tmax ( ( QW Qi )( KW Ki ) T √ n ) (4)and v i = V W Vi . (5)The output of the head, O i , is evaluated as O i = sof tmax i v i . Then, (cid:107) O i (cid:107) = T (cid:88) j =1 m (cid:88) k =1 | ( O i ) jk | (6) = T (cid:88) j =1 m (cid:88) k =1 | ( sof tmax i ) j · · ( v i ) · k | (7) ≤ T (cid:88) j =1 m (cid:88) k =1 (cid:107) ( sof tmax i ) j · (cid:107) (cid:107) ( v i ) · k (cid:107) (8) = T (cid:88) j =1 (cid:107) ( sof tmax i ) j · (cid:107) m (cid:88) k =1 (cid:107) ( v i ) · k (cid:107) (9)Since the L1 norm of a vector is always greater than or equal to the L2 norm of the vector, (cid:107) O i (cid:107) ≤ T (cid:88) j =1 (cid:107) ( sof tmax i ) j · (cid:107) m (cid:88) k =1 (cid:107) ( v i ) · k (cid:107) (10) ≤ T m (cid:88) k =1 (cid:107) ( v i ) · k (cid:107) (11) = T T (cid:88) j =1 m (cid:88) k =1 | ( v i ) jk | (12) = T T (cid:88) j =1 m (cid:88) k =1 | V j · · ( W Vi ) · k | (13) ≤ T T (cid:88) j =1 m (cid:88) k =1 (cid:107) V j · (cid:107) (cid:107) ( W Vi ) · k (cid:107) (14) = T T (cid:88) j =1 (cid:107) V j · (cid:107) m (cid:88) k =1 (cid:107) ( W Vi ) · k (cid:107) (15) ≤ T T (cid:88) j =1 (cid:107) V j · (cid:107) m (cid:88) k =1 (cid:107) ( W Vi ) · k (cid:107) (16) = T (cid:107) V (cid:107) (cid:107) W Vi (cid:107) (17)where the norm of the matrices is entrywise norm, (cid:107) A (cid:107) = (cid:80) j (cid:80) k A jk . All heads in the samelayer take the same V as input and T is constant. Thus, (cid:107) O i (cid:107) ≤ C (cid:107) W Vi (cid:107) (18)10or the constant C = T (cid:107) V (cid:107)1