[PDF] LazyFormer: Self Attention with Lazy Update

Abstract

Improving the efficiency of Transformer-based language pre-training is an important task in NLP, especially for the self-attention module, which is computationally expensive. In this paper, we propose a simple but effective solution, called \emph{LazyFormer}, which computes the self-attention distribution infrequently. LazyFormer composes of multiple lazy blocks, each of which contains multiple Transformer layers. In each lazy block, the self-attention distribution is only computed once in the first layer and then is reused in all upper layers. In this way, the cost of computation could be largely saved. We also provide several training tricks for LazyFormer. Extensive experiments demonstrate the effectiveness of the proposed method.

Full PDF

LL AZY F ORMER : S

ELF A TTENTION WITH L AZY U PDATE

Chengxuan Ying

Dalian University of Technology [email protected]

Guolin Ke

Microsoft Research [email protected]

Di He

Microsoft Research [email protected]

Tie-Yan Liu

Microsoft Research [email protected]

February 26, 2021 A BSTRACT

Improving the efﬁciency of Transformer-based language pre-training is an important task in NLP,especially for the self-attention module, which is computationally expensive. In this paper, wepropose a simple but effective solution, called

LazyFormer , which computes the self-attentiondistribution infrequently. LazyFormer composes of multiple lazy blocks, each of which containsmultiple Transformer layers. In each lazy block, the self-attention distribution is only computed oncein the ﬁrst layer and then is reused in all upper layers. In this way, the cost of computation couldbe largely saved. We also provide several training tricks for LazyFormer. Extensive experimentsdemonstrate the effectiveness of the proposed method.

Using pre-trained contextual representations (e.g., BERT) Devlin et al. (2018) have become the standard way to improvethe performance on the downstream tasks in natural language processing. Transformer Vaswani et al. (2017) is thebasic building block for almost all pre-training methods Liu et al. (2019); Clark et al. (2019b); Devlin et al. (2018). ATransformer layer is composed of an efﬁcient densely connected network operated on each position separately anda less-efﬁcient self-attention module, which costs O ( n ) ( n is sequence length) time and space. This quadratic costbecomes a bottleneck in Transformer, especially when n is large. Many recent works Wang et al. (2020); Beltagy et al.(2020); Zaheer et al. (2020); Kitaev et al. (2020); Choromanski et al. (2020) tried to reduce the O ( n ) cost to O ( n √ n ) or O ( n log n ) , by sparsifying or approximating the attention matrix.In this paper, different from previous works, we explore a simple idea to improve the model efﬁciency by computing theself-attention infrequently . We call it LazyFormer . More speciﬁcally, a LazyFormer consists of multiple basic blocks,and each basic block is composed of m Transformer layers, as shown in Figure 1. In each block, we only computeQuery-Key dot-product attention once in the ﬁrst layer and then reuse it in the upper layers. In this way, LazyFormeronly needs to calculate the attention once in m Transformer layers, reducing the computational cost from O ( n ) to O ( n /m ) .We conduct extensive experiments to verify the efﬁciency and effectiveness of LazyFormer in language pre-training.From the results, we observe: 1) compared with a standard Transformer model of the same model capacity, LazyFormeris 1.3x faster without hurting any performance. 2) LazyFormer allows us to train the larger models effectively.Speciﬁcally, with the same pre-training cost, the larger LazyFormer can outperform the baseline by 1 point. Besides, itcan achieve a better GLUE score by only using 50% costs. 3) LazyFormer can better handle longer sequences. With n increasing, the speed-up is more signiﬁcant. Related work

LazyFormer is inspired by several recent works which investigate what self-attention module learns ineach layer Clark et al. (2019a); Vig and Yonatan (2019); Vig (2019); Xiao et al. (2019). From these previous works, we a r X i v : . [ c s . C L ] F e b PREPRINT - F

EBRUARY

26, 2021 … MatMulAdd & NormFFNAdd & NormV ProjectMatMulAdd & NormFFNAdd & NormV ProjectQ Project K ProjectDotProdSoftmax m × InputLazy Block

Figure 1: The basic block in LazyFormer.can easily see that the distributions of self-attention outputs are similar between the adjacent Transformer layers. Suchobservations motivate us to reuse the self-attention outputs in the lower layers to the upper layers.There are some works that leverage such observation for better training, like Gong et al. (2019) and Lan et al. (2019).However, these works share the parameters of self-attention in different layers but still need to compute self-attention inall layers. Therefore they cannot save the computational costs. Our work is also orthogonal to the works that modify theself-attention modules to reduce the cost, such as ReformerKitaev et al. (2020) and LinformerWang et al. (2020). Theseworks aim to reduce the computation in each layer, while ours is to reuse the self-attention outputs from lower layers.Both works could be combined and further reduce the cost of language pre-training.

The attention module (Vaswani et al., 2017) in Transformer can be generally formulated as querying a dictionarywith key-value pairs, e.g., Attention ( Q, K, V ) = softmax ( QK T √ d ) V , where d is the dimensionality of the hiddenrepresentations. In the self-attention module, Q (Query), K (Key) and V (Value) are parameterized from the input x ,i.e., Attention ( xW Q , xW K , xW V ) , where W Q , W K and W V are the projection matrices. It is easy to see that thecomputational complexity of softmax ( QK T √ d ) is O ( n ) , since it has to compute the pair-wise correlations for all inputtokens, and produces an n × n matrix (attention matrix). When we use stacked k Transformer layers, as all attentionmatrices need to be computed, the total cost of self-attention calculation is O ( kn ) .Many recent works Vig and Yonatan (2019); Vig (2019) show the attention matrices are similar in different layers,especially in the adjacent layers. Therefore, we argue that the attention matrix maybe does not need to be computed inevery layer. For example, we can only calculate the attentions in a layer and reuse it for multiple adjacent upper layers.Formally, we deﬁne a new Transformer variant called LazyFormer , which composes of several “lazy” blocks. Eachlazy block consists of m Transformer layers, as shown in Figure 1. In each block, the attention matrix will be onlycomputed in the ﬁrst layer using the input to the block, and then reused by all m − upper layers. We can stack k/m lazy blocks to construct a k -layer Transformer model. In this way, the cost of computing the attention could be reducedfrom O ( kn ) to O ( kn /m ) .Based on the design of lazy block, we further use the following two additional methods to improve LazyFormer: PREPRINT - F

EBRUARY

26, 2021

Table 1: GLUE scores on dev set. All models are pre-trained by 16GB data. Both M2x6 and M2x6-S are the LazyFormerwith 6 two-layered blocks, without dropout in self-attention. M2x6-S uses the same parameter size as BERT. For theablation study, based on M2x6-S, M2x6-SD keeps the dropout in self-attention. M2x6 increases the parameter to matchthe same pre-training cost as BERT. M2x6 mid is the intermediate 500k-step checkpoint M2x6. The details of modelsare shown in Table 2.Steps MNLI -m/mm

QNLI QQP SST CoLA MRPC RTE STS Avg.BERT 1 M M M2x6-SD 1 M M M2x6 mid k Wider Layers.

In LazyFormer, the number of projection matrices W Q and W K are reduced. Therefore, the totalparameters in LazyFormer are less than the stacked Transformers when using the same width and depth. For example,the BERT-based model has 12 layers. The embedding dimension is set to 768, and the hidden dimension is set to 3072.This conﬁguration leads to a model size of about 110M parameters. If we use m = 3 for LazyFormer in the samesetting, the model only contains about 100M parameters.As the model capacity plays an important role in language pre-training Raffel et al. (2019), we can slightly increasethe hidden/embedding dimension in LazyFormer, to match the same number of parameters. Note that increasingthe hidden/embedding dimension only slightly affects the efﬁciency. Wider LazyFormer is still much faster than aTransformer of the same depth. Besides, we can even increase the width of LazyFormer until its forward/backwardspeed match the Transformer to achieve better performance when using the same pre-training cost. Remove dropout in self-attention.

Dropout is used in self-attention by default. The cost of that dropout is also O ( n ) , as it is applied on the n × n attention matrix. Recent work Lan et al. (2019) shows the dropout in self-attentioncan be safely removed, without hurting the performance. Therefore, for better efﬁciency, we also remove the dropout inself-attention in LazyFormer. To verify the performance of the proposed LazyFormer, we conduct extensive experiments and demonstrate the resultsin this section. We use BERT-Base (112M parameters) architecture for all experiments. Speciﬁcally, BERT-Base isconsists of 12 Transformer layers, in which the embedding dimension is 768, the number of attention heads is 12, andthe hidden dimension is 3072. Besides absolute positional encoding, we further use the relative positional encodingRaffel et al. (2019) in the self-attention module for better performance. We provide all the experimental details andresults in the Appendix.

Following Devlin et al. (2018), we use the 16GB corpus (English Wikipedia corpus and BookCorpus (Zhu et al., 2015))for pre-training. We set the vocabulary size (sub-word tokens) as 32,768, sequence length as 512, and batch size as 256.We use the GLUE ( G eneral L anguage U nderstanding E valuation) dataset (Wang et al., 2018) as the downstream tasksto evaluate the performance of the pre-trained models. All codes are implemented based on fairseq (Ott et al., 2019) in PyTorch (Paszke et al., 2017). All models are run on 8 NVIDIA Tesla V100 GPUs with mixed-precision (Micikeviciuset al., 2017).We use M β x γ to denote LazyFormer structure, where β is the number of layers in each lazy block, and γ is number oftotal blocks. For example, M2x6 denotes the LazyFormer with 6 blocks, each with 2 Transformer layers. First, we set up models for the overall comparison. Besides the baseline BERT model, we set up two LazyFormervariants: 1)

M2x6-S , which uses six lazy blocks with two Transformer layers in each block, and increases hiddendimension to 3456, to retain the same parameter size as BERT; 2)

M2x6 , which increases hidden dimension to 4480,embedding dimension to 896, and the number of attention heads to 14, to retain the same pre-training cost as BERT.Both M2x6 and M2x6-S remove dropout in the self-attention module. The detailed settings can be found in Table 2. PREPRINT - F

EBRUARY

26, 2021

Table 2: Details of models in Table 6. W is the hidden dimension in feed-forward layers, H is the embedding dimension,and N is the number of attention heads. Params ( W , H , N ) Time SpeedupBERT 112 M (3072,768,12) 45 1.0xM2x6-S 112 M (3456,768,12) 35 1.3xM2x6-SD 112 M (3456,768,12) 38 1.2xM2x6 157 M (4480,896,14) 45 1.0xTable 3: Comparison for different LazyFormer layouts. To match the same parameter size as BERT, we increase thehidden dimension W for different layouts. We keep the dropout in self-attention in all settings. W Time GLUE-Avg.BERT (M1x12-SD) 3072 45

M2x6-SD 3456 38

M3x4-SD 3584 35 82.94M4x3-SD 3648 34 83.22M6x2-SD 3712 33 82.47M5M3M2M2-SD 3584 35 82.89M2M2M3M5-SD 3584 35 82.82The results are shown in Table 6. Firstly, M2x6-S achieves a slight improvement over BERT but is about 1.3x faster.This result indicates that LazyFormer is much more efﬁcient, without hurting any performance.Furthermore, M2x6 outperforms baselines by 1 point in terms of GLUE average score and is consistently better onalmost all GLUE tasks. This demonstrates another important strength of LazyFormer: it allows us to increase the modelcapacity to achieve better performance while using the same computational cost. Besides, from Table 2, we can alsosee that the results of M2x6 mid (intermediate 500k-step checkpoint of M2x6) are already competitive to that of BERT(trained for 1M steps). This suggests LazyFormer can use a signiﬁcantly short time to learn a better model. As shownin Figure 2, both M2x6 and M2x6-S converge much faster than BERT in terms of both pre-training validation loss anddown-stream task performance.

As aforementioned, dropout in self-attention brings additional costs. So weempirically study whether using dropout is essential. First, as shown in Table 6, compared M2x6-S with M2x6-SD(M2x6-S with dropout in self-attention), removing dropout in self-attention slightly improves the performance. Besides,as shown in Table 2, removing dropout in self-attention can bring the 10% speed-up. In short, removing dropout inself-attention can improve efﬁciency without hurting the performance. . . . M2x6-SM2x6BERT (a) Validation loss in pre-training.

M2x6-SM2x6BERT (b) MNLI-m score.

M2x6-SM2x6BERT (c) GLUE average score.

Figure 2: Both M2x6-S and M2x6 converge much faster than the baselines. Besides, M2x6 achieves better performancein downstream tasks while using much fewer pre-training costs. Time indicates the training wall time of 100 iterations by 8 V100 GPUs. PREPRINT - F

EBRUARY

26, 2021

Table 4: Comparison for different ( W , H , N ) settings under the same pre-training cost.Params ( W , H , N ) MNLI -m/mm GLUE-Avg.M2x6-S 112 M (3456,768,12) 85.72/85.67 83.69M2x6 157 M (4480,896,14) M (6144,768,12) M2x7 148 M (4480,768,12) 86.25/86.26 84.75 . . . Sp ee d u p R a t i o M1x12M2x6-SM3x4-SM4x3-SM6x2-S

Figure 3: Speedup ratio of

LazyFormer under different settings of lazy blocks.

Different layouts.

There are many design choices in LazyFormer. For example, one can set low-level blocks withmore/fewer layers or set the whole model with more/fewer blocks. We study how different layouts perform andsummarize the results in Table 3. First, we ﬁnd that when the number of blocks decreases, the training is faster, butthe acceleration rate is not signiﬁcant. At the same time, the ﬁnal performance gets worse. Therefore, we observeusing six blocks is a trade-off choice for both efﬁciency and effectiveness. Second, we ﬁnd the number of blocksis the key factor for the performance. For all the 4-block models, setting each block with 3 layers (M3x4-SD)achieves similar performance to the models that set different numbers of layers in different blocks (M5M3M2M2-SD ,M2M2M3M5-SD). Different model settings under the same computational cost.

As aforementioned, LazyFormer allows us to use alarger model capacity within the same computational cost. Therefore, we investigate different settings to increase modelparameters and summarize the results in Table 4. Speciﬁcally, we study the performance when we increase hiddendimension W , embedding dimension H , the number of attention heads N , or stack more blocks. From the results, weﬁnd all these settings can achieve much better performance than the baseline. Speedup for longer sequences.

If considering the cost from densely connected layer, LazyFormer reduces the costfrom O ( n + p ) to O ( n /m + p ) , where p is the cost of densely connected layer. Therefore, with n increasing, the costis dominated by the n term, and the speedup brought by LazyFormer will be also larger. To study this, we providespeedup ratios given different n , shown in Figure 3. It is easy to ﬁnd the LazyFormer can bring almost 2x speed-up forlong sequences. We propose LazyFormer, which lazily updates the self-attention module and lead to an efﬁcient model architectureof the Transformer. Speciﬁcally, LazyFormer composes of multiple lazy blocks, each of which contains multipleTransformer layers. In each lazy block, the self-attention distribution is only computed once in the ﬁrst layer and thenis reused in all upper layers. Besides, LazyFormer further removes dropout in the self-attention module for betterefﬁciency and increases model capacity for better performance. Extensive experimental results demonstrate both theefﬁciency and effectiveness of LazyFormer. M5M3M2M2 denotes the model with 4 sequential lazy blocks, each with 5, 3, 2, 2 layers respectively. PREPRINT - F

EBRUARY

26, 2021

References

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprintarXiv:2004.05150 .Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, PeterHawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. 2020. Rethinking attention with performers. arXivpreprint arXiv:2009.14794 .Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019a. What does bert look at? an analysisof bert’s attention.

ACL 2019 , page 276.Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2019b. Electra: Pre-training text encodersas discriminators rather than generators. In

International Conference on Learning Representations .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 .Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. 2019. Efﬁcient training of bert byprogressively stacking. In

International Conference on Machine Learning , pages 2337–2346.Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.

CoRR , abs/1412.6980.Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efﬁcient transformer. arXiv preprintarXiv:2001.04451 .Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, BrookeCowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and EvanHerbst. 2007. Moses: Open source toolkit for statistical machine translation. In

ACL .Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: Alite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-moyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprintarXiv:1907.11692 .Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg,Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training. arXiv preprintarXiv:1710.03740 .Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli.2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 .Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, AlbanDesmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch.Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, andPeter J Liu. 2019. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprintarXiv:1910.10683 .Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subwordunits.

CoRR , abs/1508.07909.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, andIllia Polosukhin. 2017. Attention is all you need. In

Advances in Neural Information Processing Systems , pages6000–6010.Jesse Vig. 2019. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714 .Jesse Vig and Belinkov Yonatan. 2019. Analyzing the structure of attention in a transformer language model. arXivpreprint arXiv:1906.04284 .Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: Amulti-task benchmark and analysis platform for natural language understanding.

CoRR , abs/1804.07461.Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linearcomplexity. arXiv preprint arXiv:2006.04768 .Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer. arXiv preprint arXiv:1906.11024 . PREPRINT - F

EBRUARY

26, 2021

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, PhilipPham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences.

Advances inNeural Information Processing Systems , 33.Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724 . PREPRINT - F

EBRUARY

26, 2021

Table 5: Hyperparameters for the pre-training and ﬁne-tuning.Pre-training Fine-tuning

Max Steps M - Max Epochs - 10

Learning Rate

Batch Size

256 32

Warm-up Ratio

Sequence Length

512 512

Learning Rate Decay

Linear Linear

Adam (cid:15)

Adam ( β , β ) (0.9, 0.999) (0.9, 0.999) Clip Norm

Dropout

Weight Decay

A Experimental Details

Pre-training.

Following BERT (Devlin et al., 2018), we use both English Wikipedia corpus and BookCorpus (Zhuet al., 2015) for language pre-training. By concatenating these two datasets, we can obtain a corpus with 16GB raw text.We adopt the following consecutive pre-processing steps: segmenting documents into sentences by Spacy , normalizing,lower-casing, and tokenizing the texts by Moses decoder (Koehn et al., 2007), and ﬁnally, applying byte pair encoding(BPE) (Sennrich et al., 2015) setting the vocabulary size to 32,768.We found the data cleaning is important for language pre-training. To this end, we de-duplicate the documents,normalize the punctuations, concatenate the short sequences, replace the URL and other hyperlinks to special tokens,and ﬁlter the low-frequency tokens. Therefore, our re-implemented baselines, like BERT, can achieve a higher averageGLUE scores than the original papers.We use masked language modeling as the objective of pre-training. We remove the next sentence prediction task anduse FULL-SENTENCES mode to pack sentences as suggested in RoBERTa (Liu et al., 2019). We train the models for1000 k steps where the batch size is 256 and the maximum sequence length is 512. The masked probability is set to 0.15,with replacing 80% of the masked positions by [MASK] , 10% by randomly sampled words, and keep the remaining10% unchanged. We use Adam (Kingma and Ba, 2014) as the optimizer, and set the its hyperparameter (cid:15) to 1e-6 and ( β , β to (0.9, 0.999). The peak learning rate is set to 1e-4 with a 10 k -step warm-up stage. After the warm-up stage,the learning rate decays linearly to zero. We set the dropout probability to 0.1, gradient clip norm to 1.0, and weightdecay to 0.01. Besides the ﬁnal checkpoint, we also save intermediate checkpoints and ﬁne-tune them on downstreamtasks, to check the efﬁciency of different methods. Fine-tuning.

We use the GLUE ( G eneral L anguage U nderstanding E valuation) dataset (Wang et al., 2018) as thedownstream tasks to evaluate the performance of the pre-trained models. Speciﬁcally, we use nine tasks in GLUE,including CoLA, RTE, MRPC, STS-B, SST, QNLI, QQP, and MNLI-m/mm. For the evaluation metrics, we reportMatthews correlation for CoLA, Pearson correlation for STS-B, and accuracy for other tasks. We use the same optimizer(Adam) with the same hyperparameters as in pre-training. Following previous works, we search the learning ratesduring the ﬁne-tuning for each downstream task. The setting details are listed in Table 5. For a fair comparison, wedo not apply any tricks for ﬁne-tuning. Each conﬁguration will be run ﬁve times with different random seeds, andthe median of these ﬁve results on the development set will be used as the performance of one conﬁguration. We willultimately report the best number over all conﬁgurations. We also provide all the detailed results in Table 6. https://spacy.io PREPRINT - F

EBRUARY

26, 2021 T a b l e : G L U E s c o r e s ond e v s e t . A ll m od e l s a r e p r e - t r a i n e dby16 G B d a t a . B o t h t a s k s c o r e s a nd G L U E - a v e r a g e s c o r e s e v a l u a t e d a t k , k , k , M s t e p s a r e r e po r t e d i n t h e t a b l e . M od e l s e nd i ng w it h ’- S ’ k ee p s t h e s a m e p a r a m e t e r s i ze w it h B E R T , m od e l s e nd i ng w it h ’- D ’r e m ov e s a ll d r opou tl a y e r s i n s e l f- a tt e n ti on . P a r a m s ( W , H , N ) P r e - t r a i n i ng T i m e S p ee dup R a ti o S t e p s M N L I - m / mm QN L I QQ PSS T C o L A M R P CR TE S T S A vg . B E R T ( M - S D ) M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M - S M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M - S D M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M - S D M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M - S D M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M - S D M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M M M M - S D M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M M M M - S D M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . . M M ( , , ) . k . / . . . . . . . . . k . / . . . . . . . . . k . / . . . . . . . . . M . / . . . . . . . . .75