Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers
Tsung-Han Wu, Chun-Chen Hsieh, Yen-Hao Chen, Po-Han Chi, Hung-yi Lee
HHand-crafted Attention is All You Need?A Study of Attention on Self-supervised Audio Transformer
Tsung-Han Wu, Chun-Cheng Hsieh, Yen-Hao Chen, Po-Han Chi, Hung-yi Lee
National Taiwan University,College of Electrical Engineering and Computer Science { r07942145, r07942150, r07921112, r08942074, hungyilee } @ntu.edu.tw Abstract
In this paper, we seek to reduce the computation complexityof transformer-based models for speech representation learn-ing. We evaluate 10 attention mechanisms; then, we pre-trainthe transformer-based model with those attentions in a self-supervised fashion and use them as feature extractors on down-stream tasks, including phoneme classification and speakerclassification. We find that the proposed approach, which onlyuses hand-crafted and learnable attentions, is comparable withthe full self-attention.
Index Terms : self-supervised learning, attention mechanism
1. Introduction
Transformers [1] have become the most powerful network ar-chitecture in many fields, including speech, natural languageprocessing (NLP), computer vision (CV), etc. Its outstandingperformance is based not only on capturing long-term depen-dencies of input sequence but also on the remarkable trainingefficiency. Among all transformer-based models, BERT [2]is probably the most famous one. It can learn strong lan-guage representations from unlabeled text. The idea of trainingtransformer-based models on unlabeled audio data has also beenwidely studied [3, 4, 5, 6, 7]. The basic idea is the transformer-based models should learn to recover the masked audio as pre-training, and the pre-trained model is fine-tuned in the down-stream applications.Nonetheless, transformer-based models are usually basedon pretty high computational cost because of their self-attentionmechanism. Despite the effectiveness of self-attention, vanillaself-attention [8], also known as full attention, suffers severelyfrom both quadratic memory and computation requirementswith respect to the input sequence length. The input sentencesin NLP come across this problem, not to mention those muchlonger input sequence in speech. Therefore, several new atten-tion mechanisms are proposed to reduce the time complexity.Here we summarize variants of attention mechanisms forthe transformer. Sparse Transformers [9] tried to craft new at-tentions by imitating the original attention pattern, to reduce thememory usage and the computation. There are two attentionmasks proposed in this work, both of which have lower timecomplexity and satisfactory performance. Routing Transform-ers [10] uses K-means clustering to determine the candidatesfor each query in self-attention, while Reformer [11] tries toaddress this issue by introducing the hashing algorithm.Moreover, we introduce locality-sensitive hashing (LSH) ,which can solve the
Maximum Inner Product Search (MIPS)problem efficiently [12]; the self-attention mechanism in trans-formers can be regarded as a MIPS problem. Asymmetric LSH(ALSH) [13] provides the first provably sublinear time hashingalgorithm for MIPS, and they successfully prove that there doesnot exist any LSH family for MIPS. Subsequently, the same authors propose an improved ALSH algorithm [14], whichreaches better performance. XBOX [15] is another ALSH al-gorithm. At the same time, Simple LSH [16] argues that nei-ther LSH nor ALSH can deal with MIPS well and proposes twostronger LSH/ALSH algorithms. Then, Query Normalized First(QNF) [12] is proposed, which is a method that can be usedtogether with those previously mentioned LSH algorithms andshows superior empirical performance.Furthermore, aside from those hashing methods and hand-crafted attention masks, there still exist some other algorithmsthat can reduce the time complexity of self-attention. Adap-tive Attention [17] is a self-attention mechanism that can learnits optimal attention span, allowing models to deal with longersequences. Longformer [18] introduces the attention mecha-nism scaling linearly with the input length. This modifica-tion makes it easy to process longer input sequences. Also,they bring the idea of dilated attention analogous to dilatedCNNs [19]. Lite Transformer [20] uses CNNs along with theoriginal self-attention to accelerate and boost the performance.Last but not least, SYNTHESIZER [21] proposes two simpleyet effective ways to generate attention weights directly with-out token-token interactions. Note that SYNTHESIZER differsfrom those LSH/ALSH attentions and Sparse attentions, all ofwhich try to generate attention masks from token-token interac-tions. This modification accelerates both the training and infer-ence speed drastically.Among these attentions, some of them have already beenrealized on NLP or CV tasks, whereas the others are merelyexamined by the theories and some simple empirical evaluation.In this paper, we have the following key contributions:1. We implement these attentions respectively, trying tofigure out their efficiency and effectiveness on self-supervised transformer-based models learned from theunlabelled audio. Table 1 summarizes the attention im-plemented in this paper.2. We propose a new attention mechanism, which is in-spired by the previous two works [21, 22], yielding thecompetitive performance yet with a great reduction in thetraining time.
2. Methodology
Each transformer layer takes an inptut vector sequence X = x , ..., x L , and X ∈ R L × D , where L is input sequence length and D stands for the hidden dimension. Next, three differentlinear layers project the input vector sequence X to its cor-responding query matrix ( Q ∈ L × D ), key matrix ( K ∈ L × D ), and value matrix ( V ∈ L × D ), respectively. Eachvector x i in X has a query vector q i ( i -th row of Q ), key vector a r X i v : . [ ee ss . A S ] J un a) Full n at-tenion (b) Sparseattention(strided) (c)
Sparse at-tention (fixed) (d)
LSH/ALSHattention
Figure 1:
Comparison between the full attention and other at-tention masks implemented in this work. These figures show theconnections between the query (columns) and the keys (rows).Here, the colored squares represent the candidates that shouldbe attended to in self-attention. In (a), dark blue squares repre-sent queries whereas light blue squares represent the key indicesattended to. In (b) and (c), dark blue squares are still queries,the medium light blue squares represent the key indices beingattended to by some attention heads, and the light blue squaresrepresent the key indices being attended to by the other atten-tion heads. Lastly, in (d), for each query, LSH/ALSH algorithmswill determine the key indices to be attended to. k i ( i -th row of K ), and value vector v i ( i -th row of V ). Forstandard single-head attention [1], the attention weight matrix A ∈ R L × L is generated by multiplying query Q and key K transpose. An element in A [ i, j ] is computed as below. A [ i, j ] = q i ˙ k j √ D . (1)For multi-head attention, A , ..., A H are geneated, where H stands for the number of attention heads.Although multi-head attention is powerful, it requires anenormous amount of computations. This situation deterioratesas input sequences become longer and longer. The primitiveidea is to restrict the number of keys attended to for each query.More specifically, the number of keys attended to should nothave a strong positive correlation with the input length. Thisproblem can be solved efficiently by locality-sensitive hashing(LSH). In the following subsections, we will elaborate on allLSH algorithms we implement in this paper. In (1), the dot-products of all pairs of q i and k j have to be com-puted. The basic idea of LSH is to quickly identify pairs of q i and k j leading to large enough dot-products. Only the dot-products of the identified pairs have to be computed, while therest are directly set to zero.In general, there exist two different transformations S ( x ) and R ( x ) in asymmetric LSH (ALSH) and exactly one trans-formation S ( x ) in symmetric LSH (LSH). Both S ( x ) and R ( x ) takes a vector x as input and outputs another vector. In theasymmetric case, query vectors q and key vectors k are en-coded with different transformations S ( x ) and R ( x ) respec-tively , while we encode q and k with the same transformation S ( x ) in the symmetric one. Then, we define the hash functionas: h Sign ( x ) = sign ( a T x ) , (2)where a is a random vector with a i ∼ N (0 , . For ALSH, if h Sign ( S ( q )) = h Sign ( R ( k )) , query q will attend to this key x can be q or k . k ; otherwise, we take no action. For LSH, query q attend tothe key k only if h Sign ( S ( q )) = h Sign ( S ( k )) . Specifically,we define the hyperparameter C to control the number of keysbeing attended to. That is, we choose only C keys instead of allthe keys that meet the condition h Sign ( S ( q )) = h Sign ( S ( k )) .Here we directly choose the top C values of keys: (cid:40) a T k, h Sign ( q ) ≥ − a T k, h Sign ( q ) < . (3)Due to space limitations, we can briefly list the formulationof the LSH algorithm evaluated in this paper. For further expla-nation of each algorithm, please refer to the original paper. Allqueries and keys have to be normalized before the hashing func-tions. Note that for different LSH algorithms, the normalizationmethods may differ, but the normalization is not explicitly for-mulated in the following description for simplicity. There are two hyperparameters U and m ; we let U = 0 . and m = 2 , which is shown to have the best empirical result [14].Next, to get rid of the norm of q and x , let M (cid:44) max x ∈X (cid:107) x (cid:107) and define the transformations T : R D → R D , T ( x ) = Ux/M and
S, R : R D → R D + m : S ( x ) = [ x ; 0; 0; ... ; 0] , (4) R ( x ) = (cid:20) x ; 12 − (cid:107) x (cid:107) ; 12 − (cid:107) x (cid:107) ; ... ; 12 − (cid:107) x (cid:107) m (cid:21) . (5) XBOX [15] is an asymmetric
LSH; neither normalization norhyperparameters is in XBOX. Then, define transformations
S, R : R D → R D +1 : S ( x ) = [ x ; 0] , (6) R ( x ) = (cid:20) x ; (cid:113) M − (cid:107) x (cid:107) (cid:21) , (7)where M (cid:44) max x ∈X (cid:107) x (cid:107) . There is no hyperparameter in Simple LSH and SimpleALSH [16]. As for the normalization, let M (cid:44) max x ∈X (cid:107) x (cid:107) ,then define the transformation T : R D → R D , T ( x ) = x/M .For Simple LSH, define the transformation S : R D → R D +1 : S ( x ) = (cid:20) x ; (cid:113) − (cid:107) x (cid:107) (cid:21) , (8)whereas for Simple ALSH, define two transformations S, R : R D → R D +1 : S ( x ) = (cid:20) x ; 0; (cid:113) − (cid:107) x (cid:107) (cid:21) , (9) R ( x ) = (cid:20) x ; (cid:113) − (cid:107) x (cid:107) ; 0 (cid:21) . (10) In QNF [12], let M (cid:44) max x ∈X (cid:107) x (cid:107) and define transforma-tions S, R : R D → R D +1 : S ( x ) = [ λx ; 0] , where λ = M (cid:107) x (cid:107) , (11) R ( x ) = (cid:20) x ; (cid:113) M − (cid:107) x (cid:107) (cid:21) . (12)a) H-1 (b)
H-2 (c)
H-3 (d)
H-4 (e)
H-5 (f)
H-6 (g)
H-7 (h)
H-8 (i)
H-9 (j)
H-10 (k)
H-11 (l)
H-12
Figure 2:
Attention weights learned by our model. There are attention heads. (a) Transformer (b)
Dense (c)
Random
Figure 3:
Comparison between transformers and SYNTHE-SIZER
Instead of using algorithms to determine which keys should beattended to, Sparse Attention [9] merely crafts attention masks M ∈ R L × L . With this attention matrix, the attention weightmatrix is defined as below, A [ i, j ] = QK T √ D M [ i, j ] , (13)where M [ i, j ] is multiplied with the attention weight in (1). If M is very sparse, in a real implementation, the computation of(13) only has to consider those elements which are not masked,so the computational speed can be greatly increased. There aretwo different masks proposed in [9], which are shown in Fig-ure 1b and Figure 1c. Instead of learning attention masks by algorithms, SYNTHE-SIZER [21] learns the attention weights directly. We comparethe typical attention weight generation process of Transformerand SYNTHESIZER in Figure 3. We implement two versionsof SYNTHESIZER in this paper.
The attention weights A are generated by feeding X to a func-tion F with 2 hidden layers, as the left flow in Figure 3b. Here, W : R D → R N and W : R N → R L : F ( X ) = W ( σ R ( W ( X ) + b )) + b , (14)where σ R is the ReLU activation function and N is a user-defined hyperparameter. Here A ( A ∈ H × L × L ) is a learnable matrix , which isa randomly initialized matrix, learned with other parts of the In other words, the elements in A are considered as the networkparameters. Figure 4:
Performance of utterance-level speaker classificationversus time complexity network. Note that A does not depend on the input sequences.It is the same across all the inputs. Namely, data in a batch sharethe same attention weights. . The proposed method is based on Random SYNTHESIZER, yetthe initialization is from the crafted attention masks in the pre-vious work [22]. Among the 12 attention heads in our model,seven of them are initialized according to the proposed pat-terns [22], while the others are randomly initialized.
3. Experiments
All attentions and SYNTHESIZER models are compared in Ta-ble 1. There are three main groups, which stand for Sparse At-tention, LSH, and SYNTHESIZER respectively. We comparethe time complexity of both training and inference; we also listthe corresponding application fields each attention has been ap-plied.Here, we can make three key observations: 1) For SparseAttention and LSH, though inference time is a bit longer, theyrequire less training time as L grows larger. 2) All second termsin LSH is for hashing, and we use float 16 to implement. Thisis why we multiply / here. Also, there is no gradient in hash-ing function; thus, the training time complexity is the same asinference complexity. 3) For Syn. (Dense), letting N (cid:28) L canaccelerate training dramatically; we let N = 16 in this work.4) Syn. (Random) does not need any computation to generateattention weights during inference. Following the idea of two-stage training process of BERT [2],we pre-train the transformer-based models on the unlabelled au-dio data (Librispeech [23] train-clean-360 subset) and fine-tune ‡ Recommender Systems § M denotes multi-head attention. ¶ Apply input acoustic features (Mel features) directly to down-stream models. able 1:
Summary of all attentions. L : input sequence length, D : hidden dimension, H : number of attention heads, C : number of keysattended to, and N : hidden dimension in SYNTHESIZER (Dense). Attention Time Complexity (Training) Time Complexity (Inference) Application Fields
Baseline (QK) [1] LD + 2 HL LD + HL Speech,CV,NLPBaseline (Q) [11] LD + 2 HL LD + HL CV,NLPSparse (strided) [9] LD + 2 HL √ L + HL √ L LD + HL √ L + HL √ L/ CVSparse (fixed) [9] LD + HL √ L + HL √ L LD + HL √ L/ HL √ L/ CVSign-ALSH [14] LD + ( HL + 2 HL ) / HLC LD + ( HL + 2 HL ) / HLC RS ‡ XBOX [15] LD + ( HL + 2 HL ) / HLC LD + ( HL + 2 HL ) / HLC
RSXBOX (QNF) [12] LD + ( HL + 2 HL ) / HLC LD + ( HL + 2 HL ) / HLC
RSSimple LSH [16] LD + ( HL + 2 HL ) / HLC LD + ( HL + 2 HL ) / HLC
RSSimple ALSH [16] LD + ( HL + 2 HL ) / HLc LD + ( HL + 2 HL ) / HLC
RSSYN. (Dense) [21] LN + 2 L LN + L NLPSYN. (Dense+M § ) [21] HLN + 2 HL HLN + HL NLPSYN. (Random) [21] HL - NLPSYN. (Ours) HL - - Table 2:
Performance of all attentions
Attention Speaker PhonemeUtterance Frame 1-hidden 2-hidden
Baseline (Mel ¶ ) 0.0060 0.0033 0.5246 0.5768Baseline (QK) 0.9926 0.9824 0.6460 0.6887Baseline (Q) 0.9898 0.9622 0.5893 0.6345Sparse (Strided) 0.9786 0.9039 0.6048 0.6450Sparse (Fixed) 0.9597 0.7960 0.6069 0.6846Sign-ALSH 0.9716 0.8237 0.5863 0.6393XBOX 0.9639 0.7994 0.5860 0.6262XBOX (QNF) 0.9667 0.7958 0.5819 0.6241Simple LSH 0.9628 0.7370 0.5771 0.6189Simple ALSH 0.9678 0.7999 0.5783 0.6214SYN. (Dense) 0.9660 0.9027 0.6180 0.6287SYN. (Dense+M § ) 0.9509 0.9135 0.6073 0.6471SYN. (Random) 0.9803 0.8868 0.5820 0.6237SYN. (Ours) 0.9842 0.9855 0.6157 0.6492 for the downstream tasks. All transformer-based models havethe same architecture as a 6-layered Audio ALBERT (AAL-BERT) [24]. We adopt the shared-QK attention [11] for all themethods mentioned in Sections 2.2 and Sections 2.3, which tiesthe weights of Q and K to further reduce the computation re-quirement. All the other settings of the pre-training stage fol-low those of [24]. The models are trained for 500k steps witha batch size of 50 and a learning rate of 5e-5. The optimizer isLAMB [25]. We evaluate the self-supervised models on three downstreamtasks, including utterance-level speaker classification (on the train-clean-100 subset), frame-level speaker classification (onthe train-clean-100 ), and phoneme classification (on the train-clean-360 subset with phoneme labels). In downstream tasks,the pre-trained models are used as feature-extractors whoseparameters are fixed during training. For speaker classifica-tion, the extracted representations in the utterance-level task arepassed to an average pooling layer, and the mean representa-tion is then fed to a linear classifier. As in frame-level task, wesimply train a linear classifier to predict the speaker label of ev-ery audio frame. As for the phoneme classification, we utilizeboth a one-hidden-layer model and a two-hidden-layer model indownstream tasks; a layer consists of a linear layer along with aReLU activation function here.The results are shown in Table 2. Baseline (QK) and Base-line (Q) (shared-QK attention) remarkably outperform Baseline(Mel), which shows the importance of pre-training. LSH/ALSH algorithms have negative influences on most downstream tasks,showing that restricting the attention by LSH/ALSH algorithmis not effective enough. For utterance-level speaker classifica-tion, the average pooling layer in the downstream model actslike a global attention mechanism, which compensates the ef-fects of LSH/ALSH.Sparse Attention obtains higher accuracy than LSH/ALSH,which shows that local information might be important sinceSparse Attention always contains a fixed-size local windowwhereas LSH/ALSH does not.SYNTHESIZER models perform even better on averagethan the other two groups; however, they fail to match Base-line (Q) on the frame-level speaker classification task. Ourmodel, combining a SYNTHESIZER random model with somehand-crafted masks, achieves competitive performances com-pared to Baseline (Q), and even outperforms Baseline (QK) onthe frame-level speaker classification task. It is noticeable thatthe training time of a SYNTHESIZER random model can be less than that of Baseline (QK), as well as less mem-ory usage.Figure 4 is the utterance-level speaker classification versusthe number of keys attended. Sparse attention obtains betterperformance than LSH/ALSH but attends to more keys. Thenumber of keys for all the SYNTHESIZERs is considered aszero because they generate the attention weights directly. Thisfigure shows that the proposed approach achieves better perfor-mance with less computation.In Figure 2, we visualize the attention weights of all 12heads in our model. It is obvious that Head to Head are verysimilar to their corresponding patterns at initialization, whileother heads are rather random. This outcome may be due par-tially to the fact that the patterns of Head to Head can cap-ture the information well enough; therefore, the others headonly need to learn those details neglected by the previous sevenheads.
4. Conclusion
We explore the possibility of reducing computation complexityin transformer-based models for self-supervised representationlearning. We try LSH and Sparse attention to limit the numberof token-token interaction in self-attention. We also introducerecently proposed attention modules SYNTHESIZER. Then,we propose to combine the SYNTHESIZER random modelwith hand-crafted patterns. In the experiments, our proposed ar-chitecture not only performs comparably to vanilla transformer-based models on downstream tasks but requires less train-ing time and less memory usage. . References [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in
Advances in neural information processing systems , 2017, pp.5998–6008.[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” arXiv preprint arXiv:1810.04805 , 2018.[3] X. Song, G. Wang, Z. Wu, Y. Huang, D. Su, D. Yu, and H. Meng,“Speech-XLNet: Unsupervised Acoustic Model Pretraining ForSelf-Attention Networks,” arXiv preprint arXiv:1910.10387 ,2019.[4] D. Jiang, X. Lei, W. Li, N. Luo, Y. Hu, W. Zou, and X. Li, “Im-proving Transformer-based Speech Recognition Using Unsuper-vised Pre-training,” arXiv preprint arXiv:1910.09932 , 2019.[5] A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self-supervised pre-training for speech recognition,” arXiv preprintarXiv:1911.03912 , 2019.[6] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec:Unsupervised pre-training for speech recognition,” arXiv preprintarXiv:1904.05862 , 2019.[7] S. Pascual, M. Ravanelli, J. Serr`a, A. Bonafonte, and Y. Bengio,“Learning problem-agnostic speech representations from multipleself-supervised tasks,” arXiv preprint arXiv:1904.03416 , 2019.[8] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou,and Y. Bengio, “A structured self-attentive sentence embedding,” arXiv preprint arXiv:1703.03130 , 2017.[9] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generat-ing long sequences with sparse transformers,” arXiv preprintarXiv:1904.10509 , 2019.[10] A. Roy, M. Saffar, A. Vaswani, and D. Grangier, “EfficientContent-Based Sparse Attention with Routing Transformers,” arXiv preprint arXiv:2003.05997 , 2020.[11] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The EfficientTransformer,” arXiv preprint arXiv:2001.04451 , 2020.[12] Q. Huang, G. Ma, J. Feng, Q. Fang, and A. K. Tung, “Accurateand fast asymmetric locality-sensitive hashing scheme for max-imum inner product search,” in
Proceedings of the 24th ACMSIGKDD International Conference on Knowledge Discovery &Data Mining , 2018, pp. 1561–1570.[13] A. Shrivastava and P. Li, “Asymmetric LSH (ALSH) for sublin-ear time maximum inner product search (MIPS),” in
Advances inNeural Information Processing Systems , 2014, pp. 2321–2329.[14] ——, “Improved asymmetric locality sensitive hashing (ALSH)for maximum inner product search (MIPS),” arXiv preprintarXiv:1410.5410 , 2014.[15] Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir,N. Koenigstein, N. Nice, and U. Paquet, “Speeding up the xboxrecommender system using a euclidean transformation for inner-product spaces,” in
Proceedings of the 8th ACM Conference onRecommender systems , 2014, pp. 257–264.[16] B. Neyshabur and N. Srebro, “On symmetric and asymmetric lshsfor inner product search,” arXiv preprint arXiv:1410.5518 , 2014.[17] S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin, “Adaptiveattention span in transformers,” arXiv preprint arXiv:1905.07799 ,2019.[18] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150 , 2020.[19] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio,” arXiv preprintarXiv:1609.03499 , 2016.[20] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han, “Lite Transformer withLong-Short Range Attention,” arXiv preprint arXiv:2004.11886 ,2020. [21] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng,“Synthesizer: Rethinking Self-Attention in Transformer Models,” arXiv preprint arXiv:2005.00743 , 2020.[22] A. Raganato, Y. Scherrer, and J. Tiedemann, “Fixed encoderself-attention patterns in transformer-based machine translation,” arXiv preprint arXiv:2002.10260 , 2020.[23] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in . IEEE, 2015, pp. 5206–5210.[24] P.-H. Chi, P.-H. Chung, T.-H. Wu, C.-C. Hsieh, S.-W. Li, and H.-y.Lee, “Audio ALBERT: A Lite BERT for Self-supervised Learningof Audio Representation,” arXiv , 2020.[25] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli,X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batchoptimization for deep learning: Training bert in 76 minutes,” in