[PDF] End-to-End Multi-Channel Transformer for Speech Recognition

Abstract

Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship within and between channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the baseline single-channel transformer, as well as the super-directive and neural beamformers cascaded with the transformers.

Full PDF

EEND-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION

Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian King, and Siegfried Kunzmann

Alexa Machine Learning, Amazon, USA { fengjc, radfarmr, mouchta, bbking, kunzman } @amazon.com ABSTRACT

Transformers are powerful neural architectures that allow integrat-ing different modalities using attention mechanisms. In this paper,we leverage the neural transformer architectures for multi-channelspeech recognition systems, where the spectral and spatial informa-tion collected from different microphones are integrated using atten-tion layers. Our multi-channel transformer network mainly consists ofthree parts: channel-wise self attention layers (CSA), cross-channelattention layers (CCA), and multi-channel encoder-decoder atten-tion layers (EDA). The CSA and CCA layers encode the contextualrelationship “within” and “between” channels and across time, respec-tively. The channel-attended outputs from CSA and CCA are then fedinto the EDA layers to help decode the next token given the preced-ing ones. The experiments show that in a far-ﬁeld in-house dataset,our method outperforms the baseline single-channel transformer, aswell as the super-directive and neural beamformers cascaded with thetransformers.

Index Terms — Transformer network, Attention layer, Multi-channel ASR, End-to-end ASR, Speech recognition

1. INTRODUCTION

In the past few years, voice assisted devices have become ubiquitous,and enabling them to recognize speech well in noisy environmentsis essential. One approach to make these devices robust againstnoise is to equip them with multiple microphones so that the spectraland spatial diversity of the target and interference signals can beleveraged using beamforming approaches [1–6]. It has been demon-strated in [4,6,7] that beamforming methods for multi-channel speechenhancement produce substantial improvements for ASR systems;therefore, existing ASR pipelines are mainly built on beamform-ing as a pre-processor and then cascaded with an acoustic-to-textmodel [2, 8–10].A popular beamforming method in the ﬁeld of ASR is super-directive (SD) beamforming [11, 12], which uses the sphericallyisotropic noise ﬁeld and computes the beamforming weights. Thismethod requires the knowledge of distances between sensors andwhite noise gain control [2]. With the great success of deep neural net-works in ASR, there has been signiﬁcant interest to have end-to-endall-neural models in voice assisted devices. Therefore, neural beam-formers are becoming state-of-the-art technologies for the uniﬁcationof all-neural models in speech recognition devices [8–10, 13–19]. Ingeneral, neural beamformers can be categorized into ﬁxed beamform-ing (FBF) and adaptive beamforming (ABF). While the beamformingweights are ﬁxed in FBF [10,16] during inference time, the weights inadaptive beamforming (ABF) [8, 9, 13–15, 17], can vary based on theinput utterances [17, 19] or the expected speech and noise statisticscomputed by a neural mask estimator [13, 14] and the well-knownMVDR formalization [20]. Transformers [21] are powerful neural architectures that latelyhave been used in ASR [22–24], SLU [25], and other audio-visualapplications [26] with great success, mainly due to their attentionmechanism. Only until recently, the attention concept has also beenapplied to beamforming, speciﬁcally for speech and noise mask es-timations [9, 27]. While theoretically founded via MVDR formal-ization [20], a good speech and noise mask estimator needs to bepre-trained on synthetic data for the well-deﬁned target speech andnoise annotations; the speech and noise statistics of synthetic data,however, may be far away from real-world data, which can lead tonoise leaking into the target speech statistics and vice-versa [28]. Thisdrawback could further deteriorate its ﬁnetuning with the cascadedacoustic models.In this paper, we bypass the above front-end formalization andpropose an end-to-end multi-channel transformer network whichtakes directly the spectral and spatial representations (magnitude andphase of STFT coefﬁcients) of the raw channels, and use the attentionlayers to learn the contextual relationship within each channel andacross channels, while modeling the acoustic-to-text mapping. Theexperimental results show that our method outperforms the othertwo neural beamformers cascaded with the transformers by 9% and9.33% respectively, in terms of relative WER reduction on a far-ﬁeldin-house dataset. In Sections 2, 3, and 4, we will present the pro-posed model, our experimental setup and results, and the conclusions,respectively.

2. PROPOSED METHOD

Given C -channels of audio sequences X = ( X , ..., X i , ..., X C ) and the target token sequence Y = ( y , ..., y j , ..., y U ) with length U , where X i ∈ R T × F is the i th -channel feature matrix of T framesand F features, and y j ∈ R L × is a one-hot vector of a token froma predeﬁned set of L tokens, our objective is to learn a mapping inorder to maximize the conditional probability p ( Y|X ) . An overviewof the multi-channel transformer is shown in Fig. 1, which containsthe channel and token embeddings, multi-channel encoder, and multi-channel decoder. For clarity and focusing on how we integrate multi-ple channels with attention mechanisms, we will omit the multi-headattention [21], layer normalization [29], and residual connections [30]in the equations, but only illustrate them in Fig. 2. Like other sequence-to-sequence learning problems, we start by pro-jecting the source channel features and one-hot token vector to thedense embedding spaces, for more discriminative representations.The i th channel feature matrix, X i , contains magnitude features X magi and phase features X phai ; more details will be described inSec. 3. We use three linear projection layers, W mei , W pei , and W jei to embed the magnitude, phase, and their concatenated embeddings, a r X i v : . [ ee ss . A S ] F e b hannel Embedding 𝑋 ! Channel-wise Self Attention Channel-wise Self Attention Channel-wise Self AttentionCross-channelAttention Cross-channelAttention Cross-channelAttention

Multi-channel Encoder + Positional encoding + +

Token Embedding 𝑌 Multi-channelEncoder-DecoderAttention + MaskedSelf Attention

Layer normLinearSoftmaxPredictedtoken probabilities

Multi-channel Decoder 𝑁 " × 𝑁 ×𝑋 $ 𝑋 % Channel Embedding Channel Embedding %𝑋 ! %𝑋 $ %𝑋 % %𝑌 Positional encoding

Fig. 1 . An overview of the proposed multi-channel transformer network. C , N e , and N d are the number of channels, encoder layers anddecoder layers, respectively. Note that the audio sequences X ,..., X i ,..., X C share the same token sequence Y .respectively. Since the transformer networks do not model the posi-tion of a token within a sequence, we employ the positional encoding( PE ) [21] to add the temporal ordering into the embeddings. Theoverall embedding process can be formulated as: ˆ X i = [ X magi W mei , X phai W pei ] W jei + PE ( t, f ) (1)Here, all the bias vectors are ignored and [ ., . ] indicates the concatena-tion. ˆ X i ∈ R T × d m , where d m is the embedding size. i ∈ { , ..., C } , t ∈ { , ..., T } , and f ∈ { , ..., d m } . Similarly, the token embeddingis formulated as: ˆ y j = W te y j + b te + PE ( j, l ) (2)Here W te and b te are learnable token-speciﬁc weight and bias pa-rameters. ˆ y j ∈ R d m × , j ∈ { , ..., U } , and l ∈ { , ..., d m } . : Each encoder layerstarts from utilizing self-attention layers per channel (Fig. 2(b)) inorder to learn the contextual relationship within a single channel.Following [21], we use the multi-head scaled dot-product attention(MH-SDPA) as the scoring function shown in Fig. 2(a) to computethe attention weights across time. Given the i th channel embeddings, ˆ X i , by Eq.(1), we can obtain the queries, keys, and values via thelinear transformations followed by an activation function as: Q csi = σ (cid:16) ˆ X i W cs,q + ( b cs,qi ) T (cid:17) K csi = σ (cid:16) ˆ X i W cs,k + ( b cs,ki ) T (cid:17) (3) V csi = σ (cid:16) ˆ X i W cs,v + ( b cs,vi ) T (cid:17) Here σ ( . ) is the ReLU activation function, W cs, ∗ ∈ R d m × d m and b cs, ∗ ∈ R d m × are learnable weight and bias parameters, and ∈ R T × is an all-ones vector. The channel-wise self attention output isthen computed by: H csi = Softmax (cid:18) Q csi ( K csi ) T √ d m (cid:19) V csi (4) where the scaling √ d m is for numerical stability [21]. We then addthe residual connection [30] and layernorm [29] (See Fig. 2(b)) beforefeeding the contextual time-attended representations through the feedforward layers in order to get ﬁnal channel-wise attention outputs ˆ H csi , as shown on the top of Fig. 2(b). Cross-channel Attention Layer (CCA) : The cross-channel at-tention layer (Fig. 2(c)) learns not only the cross correlation in timebetween time frames but also cross correlation between channelsgiven the self-attended channel representations, { ˆ H csi } Ci =1 . We pro-pose to create Q , K and V as follows: Q cci = σ (cid:16) ˆ H csi W cc,q + ( b cc,qi ) T (cid:17) K cci = σ (cid:16) H CCA W cc,k + ( b cc,ki ) T (cid:17) (5) V cci = σ (cid:16) H CCA W cc,v + ( b cc,vi ) T (cid:17) H CCA = (cid:88) j,j (cid:54) = i A j (cid:12) ˆ H csj (6)where ˆ H csi is the input for generating the queries. In addition, thekeys and values are generated by the weighted-sum of contributionsfrom the other channels, { ˆ H csj } Cj =1 ,j (cid:54) = i , i.e. Eq. (6), which is similarto the beamforming process. Note that A j , W cc, ∗ and b cc, ∗ arelearnable weight and bias parameters, and (cid:12) indicates element-wisemultiplication. The cross-channel attention output is then computedby: H cci = Softmax (cid:18) Q cci ( K cci ) T √ d m (cid:19) V cci (7)To the best of our knowledge, it is the ﬁrst time this cross channelattention mechanism is introduced within the transformer networkfor multi-channel ASR.Similar to CSA, we feed the contextual channel-attended repre-sentations through feed forward layers to get the ﬁnal cross-channelattention outputs, ˆ H cci , as shown on the top of Fig. 2(c). To learn moresophisticated contextual representations, we stack multiple CSAs andCCAs to from the encoder network output { H ei } Ci =1 in Fig. 2(d). 𝐾 ! li n ea r H ea d H ea d H ea d h M a t M u l S ca li ng S o f t M a x M a t M u l 𝑉 MH-SDPAconcatenatelinear(a) Layer normMH-SDPAFeed Forward 𝑄 " 𝐾 " 𝑉 " $𝑋 " + Layer norm + &𝐻 " Layer norm 𝑄 " 𝐾 " 𝑉 " MH-SDPAFeed Forward + Layer norm + Layer norm 𝑄 % 𝐾 % 𝑉 % MH-SDPAFeed Forward + Layer norm + Layer norm 𝑄 & 𝐾 & 𝑉 & MH-SDPAFeed Forward + Layer norm + Layer normMH-SDPAFeed Forward 𝑉 ’( 𝐾 ’( 𝑄 ’( + Layer norm + !𝐻 !" 𝐻 "’ ")%& (b) (c) (d) li n ea r li n ea r 𝐻 " &𝐻 % &𝐻 " &𝐻 & 𝐻 % 𝐻 " 𝐻 & &𝐻 % &𝐻 " &𝐻 & &𝐻 $* 𝐻 ’( Fig. 2 . The attention blocks in our multi-channel transformer. (a) shows the multi-head scaled dot-product attention (MH-SDPA). (b), (c), (d)show a channel-wise self attention layer (CSA), a cross-channel attention layer (CCA), and a multi-channel encoder-decoder attention layer(EDA) respectively. : Similar to [21],we employ the masked self-attention layer (MSA), ˆ H sa , to model thecontextual relationship between target tokens and their predecessors.It is computed similarly as in Eq. (3) and (4) but with token embed-dings (Eq. 2) as inputs. Then we create the queries by ˆ H sa , and keysas well as values by the multi-channel encoder outputs { H ei } Ci =1 asfollows: Q ed = σ (cid:16) ˆ H sa W ed,q + ( b md,q ) T (cid:17) K ed = σ (cid:32) C C (cid:88) i =1 H ei W ed,k + ( b md,k ) T (cid:33) (8) V ed = σ (cid:32) C C (cid:88) i =1 H ei W ed,v + ( b ed,v ) T (cid:33) Again, W ed, ∗ and b ed, ∗ are learnable weight and bias parame-ters. The multi-channel decoder attention then becomes the regularencoder-decoder attention of the transformer decoder. Similarly, byapplying MH-SDPA, layernorm, and feed forward layer, we can getﬁnal decoder output, ˆ H ed , as shown on the top of Fig. 2(d). To trainour multi-channel transformer, we use the cross-entropy loss withlabel smoothing of value (cid:15) ls = 0 . [31].

3. EXPERIMENTS3.1. Dataset

To evaluate our multi-channel transformer method (MCT), we con-duct a series of ASR experiments using over 2,000 hours of speechutterances from our in-house anonymized far-ﬁeld dataset. Theamount of training set, validation set (for model hyper-parameterselection), and test set are 2,000 hours (312,0000 utterances), 4 hours(6,000 utterances), and 16 hours (2,5000 utterances) respectively. Thedevice-directed speech data was captured using smart speaker with7 microphones, and the aperture is 63mm. The users may movewhile speaking to the device so the interaction with the devices werecompletely unconstrained. In this dataset, 2 microphone signals ofaperture distance and the super-directive beamformed signal by [11]using 7 microphone signals are employed through all the experiments.

We compare our multi-channel transformer (MCT) to four baselines:(1)

Single channel + Transformer (SCT) : This serves as the single-channel baseline. We feed each of two raw channels individuallyinto the transformer for training and testing, and obtain the aver-age WER from the two channels. (2)

Super-directive (SD) beam-former [11] + Transformer (SDBF-T) : The SD BF is widely usedin the speech-directed devices including the one we used to obtainthe beamformed signal in the in-house dataset. This beamformerused all seven microphones for beamforming. Multiple beamformersare built on the frequency domain toward different look directionsand one with the maximum energy is selected for the ASR input;therefore, the input features to the transformer are extracted from asingle channel of beamformed audio. (3)

Neural beamformer [10]+ Transformer (NBF-T) : This serves as the ﬁxed beamformer (FBF)baseline using two microphone signals as inputs rather than sevenin SD beamformer. Multiple beamforming matrices toward sevenbeam directions followed by a convolutional layer are learned to com-bine multiple channels, and then the energy features from all beamdirections respectively. The beamforming matrices are initializedwith MVDR beamformer [20]. (4)

Neural masked-based beam-former [13] + Transformer (NMBF-T) : It serves as the adaptivebeamforming (ABF) baseline, and also uses two microphone signalsas inputs. The mask estimator was pre-trained following [13]. Notethat the above neural beamforming models are jointly ﬁnetuned withthe transformers.

The transformers in all the baselines and our multi-channel trans-former (MCT) are of d m = 256 , number of hidden neurons d ff =1 , , and number of heads, h = 3 . While MCT and the transformerfor NMBF-T have N e = 4 and N d = 4 , other transformers are of N e = 6 , N d = 6 in order to have comparable model size, as shownin Table 1. Note that NMBF-T is about 5M larger than the othermethods due to the BLSTM and FeedForward layers used in the maskestimator of [13]. Results of all the experiments are demonstrated asthe relative word error rate reduction (WERR). Given a method A’sWER (WER A ) and a baseline B’s WER (WER B ), the WERR of Aover B can be computed by ( WER B − WER A ) / WER B ; the higher able 1 . The relative word error rate reduction, WERRs (%), by comparing the multi-channel transformer (MCT) to the beamfomers cascadedwith transformers. A higher number indicates a better WER.Method No. of No. of parameters WERR over WERR over WERR over WERR overchannels (Million) SCT SDBF-T NBF-T NMBF-TSC + Transformer (SCT) 1 13.29 - - - -SDBF [11] + Transformer (SDBF-T) 7 13.29 6.27 - - -NBF [10] + Transformer (NBF-T) 2 13.31 2.42 -4.11 - -NMBF [13] + Transformer (NMBF-T) 2 18.53 2.07 -4.49 - -MCT with 2 channels (MCT-2) 2 13.63 MCT with 3 channels (MCT-3) 3 13.80 . The WERRs (%) over MCT (with both CSA and CCA)while using CSA only or CCA only.Channel-wise Cross-channel WERR (%)self attention (CSA) attention (CCA) over MCT (cid:51) (cid:51) (cid:51) (cid:55) -12.71 (cid:55) (cid:51) -13.12the WERR is the better.The input features, the Log-STFT square magnitude (for SCT andSDBF-T) and STFT (for NBF-T and NMBF-T) are extracted every10 ms with a window size of 25 ms from 80K audio samples (resultsin T = 166 frames per utterance); the features of each frame is thenstacked with the ones of left two frames, followed by downsamplingof factor 3 to achieve low frame rate, resulting in F = 768 featuredimensions. In the proposed method, we use both log-STFT squaremagnitude features, and phase features following [32,33] by applyingthe sine and cosine functions upon the principal angles of the STFT ateach time-frequency bin. We used the Adam optimizer [34] and variedthe learning rate following [21, 22] for optimization. The subwordtokenizer [35] is used to create tokens from the transcriptions; we use L = 4 , tokens in total. Table 1 shows the performances of our method (MCT-2) and beam-formers+transformers methods over different baselines. While allcascaded beamformers+transformers methods perform better thanSCT (by 2.07% to 6%), our method improves the WER the most(by 11.21%). When comparing WERRs over SDBF-T, however,only MCT-2 improves the WER. The degradations from NBF-T andNMBF-T over SDBF-T may be attributed to not only 2 rather than7 microphones are used but also the suboptimal front-end formaliza-tions either by using a ﬁxed set of weights for look direction fusion(NBF-T) or ﬂawed speech/noise mask estimations (NMBF-T). If wecompare our method directly to NBF-T and NMBF-T, we see 9%and 9.33% relative improvements respectively. We further investigatewhether the information from the super-directive beamformer channelwas complementary to the multi-channel transformer. To this end, wetake the beamformed signal from SD beamformer as the third channeland feed it together with the other two channels to our transformer(MCT-3). We see in Table 1 (the last row), about 10% extra relativeimprovements are achieved compared to MCT-2.In Fig. 3, we evaluate the convergence rate and quality via com-paring the learning curves of our model to the other beamformer-transformer cascaded methods. Note that our model has started toconverge at around 100K training steps, while the others have not.

Fig. 3 . The WERR w.r.t. the training steps of our methods (MCT-2,3)comparing to beamformers cascaded with transformers. Our modelhas started to converge at around 100K steps, but not for the others.We compute the WERRs of all methods over a ﬁxed reference point,which is the highest WER point during this period by NBF-T (theleft-most point of NBF-T corresponding to WERR=0). Our methodconverges faster than the others with consistently higher relative WERimprovements. Also, we observe NMBF-T converges the slowest,and the NBF-T is the second slowest.Finally, we conducted an ablation study to demonstrate the im-portance of channel-wise self attention (CSA) and cross-channelattention (CCA) layers. To this end, we train two variants of multi-channel transformers by using CSA only or CCA only. Table 2 showsthat the WERR drops signiﬁcantly when either attention is removed.Furthermore, our model can be simply applied on more than3 channels. In an 8-microphone case, the number of parameterswould increase by only about 10% ( T × d m × N e × / / . × × × / / . ) compared to the one-microphone case( . M parameters).

4. CONCLUSION

We proposed an end-to-end transformer based multi-channel ASRmodel. We demonstrated that our model can capture the contextualrelationships within and across channels via attention mechanisms.The experiments showed that our method (MCT-2) outperforms threecascaded beamformers plus acoustic modeling pipelines in terms ofWERRs, and can be simply applied to more than 2 channel cases withaffordable increases of model parameters. . REFERENCES [1] Maurizio Omologo, Marco Matassoni, and Piergiorgio Svaizer,“Speech recognition with microphone arrays,” in

Microphonearrays , pp. 331–353. Springer, 2001.[2] Matthias W¨olfel and John McDonough,

Distant speech recog-nition , John Wiley & Sons, 2009.[3] Kenichi Kumatani, John McDonough, and Bhiksha Raj, “Mi-crophone array processing for distant speech recognition: Fromclose-talking microphones to far-ﬁeld sensors,”

IEEE SignalProcessing Magazine , vol. 29, no. 6, pp. 127–140, 2012.[4] Keisuke et al. Kinoshita, “A summary of the reverb challenge:state-of-the-art and remaining challenges in reverberant speechprocessing research,”

EURASIP Journal on Advances in SignalProcessing , vol. 2016, no. 1, pp. 7, 2016.[5] Tuomas Virtanen, Rita Singh, and Bhiksha Raj,

Techniques fornoise robustness in automatic speech recognition , John Wiley& Sons, 2012.[6] Tobias Menne, Jahn Heymann, Anastasios Alexandridis, KazukiIrie, Albert Zeyer, Markus Kitza, Pavel Golik, Ilia Kulikov,Lukas Drude, Ralf Schl¨uter, Hermann Ney, Reinhold Haeb-Umbach, and Athanasios Mouchtaris, “The rwth/upb/forthsystem combination for the 4th chime challenge evaluation,” in

CHiME-4 workshop , 2016.[7] Jon Barker, Ricard Marxer, and et al., “The third ‘chime’speechseparation and recognition challenge: Dataset, task and base-lines,” in

ASRU , 2015.[8] Xuankai Chang, Wangyou Zhang, Yanmin Qian, JonathanLe Roux, and Shinji Watanabe, “Mimo-speech: End-to-endmulti-channel multi-speaker speech recognition,” in

ASRU ,2019.[9] Xuankai Chang, Wangyou Zhang, Yanmin Qian, JonathanLe Roux, and Shinji Watanabe, “End-to-end multi-speakerspeech recognition with transformer,” in

ICASSP , 2020.[10] Kenichi Kumatani, Wu Minhua, Shiva Sundaram, Nikko Str¨om,and Bj¨orn Hoffmeister, “Multi-geometry spatial acoustic mod-eling for distant speech recognition,” in

ICASSP , 2019.[11] Simon Doclo and Marc Moonen, “Superdirective beamformingrobust against microphone mismatch,”

IEEE Transactions onAudio, Speech, and Language Processing , vol. 15, no. 2, pp.617–631, 2007.[12] Ivan Himawan, Iain McCowan, and Sridha Sridharan, “Clus-tered blind beamforming from ad-hoc microphone arrays,”

TASLP , vol. 19, no. 4, pp. 661–676, 2010.[13] Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach,“Neural network based spectral mask estimation for acousticbeamforming,” in

ICASSP , 2016.[14] Hakan Erdogan, John R Hershey, and et al., “Improved mvdrbeamforming using single-channel mask prediction networks.,”in

Interspeech , 2016.[15] Tsubasa Ochiai, Shinji Watanabe, and et al., “Multichannel end-to-end speech recognition,” arXiv preprint arXiv:1703.04783 ,2017.[16] Wu Minhua, Kenichi Kumatani, Shiva Sundaram, Nikko Str¨om,and Bj¨orn Hoffmeister, “Frequency domain multi-channelacoustic modeling for distant speech recognition,” in

ICASSP ,2019. [17] Bo Li, Tara N Sainath, and et al., “Neural network adaptivebeamforming for robust multichannel speech recognition,” in

Interspeech , 2016.[18] Xiong Xiao, Shinji Watanabe, and et al., “Deep beamformingnetworks for multi-channel speech recognition,” in

ICASSP ,2016.[19] Zhong Meng, Shinji Watanabe, and et al., “Deep long short-term memory adaptive beamforming networks for multichannelrobust speech recognition,” in

ICASSP , 2017.[20] Jack Capon, “High-resolution frequency-wavenumber spectrumanalysis,”

Proceedings of the IEEE , vol. 57, no. 8, pp. 1408–1418, 1969.[21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin, “Attention is all you need,” in

NeurNIPS , 2017.[22] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: ano-recurrence sequence-to-sequence model for speech recogni-tion,” in

ICASSP , 2018.[23] Liang Lu, Changliang Liu, Jinyu Li, and Yifan Gong, “Ex-ploring transformers for large-scale speech recognition,” arXivpreprint arXiv:2005.09684 , 2020.[24] Yongqiang et al. Wang, “Transformer-based acoustic modelingfor hybrid speech recognition,” in

ICASSP , 2020.[25] Martin Radfar, Athanasios Mouchtaris, and Siegfried Kunz-mann, “End-to-end neural transformer based spoken languageunderstanding,” in

Interspeech , 2020.[26] Georgios Paraskevopoulos, Srinivas Parthasarathy, AparnaKhare, and Shiva Sundaram, “Multiresolution and multi-modal speech recognition with transformers,” arXiv preprintarXiv:2004.14840 , 2020.[27] Bahareh Tolooshams, Ritwik Giri, Andrew H Song, Umut Isik,and Arvindh Krishnaswamy, “Channel-attention dense u-netfor multichannel speech enhancement,” in

ICASSP , 2020.[28] Lukas Drude, Jahn Heymann, and Reinhold Haeb-Umbach,“Unsupervised training of neural mask-based beamforming,” arXiv preprint arXiv:1904.01578 , 2019.[29] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton,“Layer normalization,” arXiv preprint arXiv:1607.06450 , 2016.[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in

CVPR , 2016.[31] Christian Szegedy and Vincent et al. Vanhoucke, “Rethinkingthe inception architecture for computer vision,” in

CVPR , 2016.[32] Zhong-Qiu Wang and DeLiang Wang, “Combining spectral andspatial features for deep learning based blind speaker separation,”

TASLP , vol. 27, no. 2, pp. 457–468, 2018.[33] Zhong-Qiu Wang, Jonathan Le Roux, and John R Hershey,“Multi-channel deep clustering: Discriminative spectral andspatial embeddings for speaker-independent speech separation,”in

ICASSP , 2018.[34] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[35] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neuralmachine translation of rare words with subword units,” in