Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT
Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang
JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Fast End-to-End Speech Recognition via aNon-Autoregressive Model and Cross-ModalKnowledge Transferring from BERT
Ye Bai, Jiangyan Yi,
Member, IEEE,
Jianhua Tao,
Senior Member, IEEE
Zhengkun Tian,Zhengqi Wen,
Member, IEEE
Shuai Zhang,
Abstract —Attention-based encoder-decoder (AED) modelshave achieved promising performance in speech recognition.However, because the decoder predicts text tokens (such ascharacters or words) in an autoregressive manner, it is difficultfor an AED model to predict all tokens in parallel. This makesthe inference speed relatively slow. We believe that because theencoder already captures the whole speech utterance, whichhas the token-level relationship implicitly, we can predict atoken without explicitly autoregressive language modeling. Whenthe prediction of a token does not rely on other tokens, theparallel prediction of all tokens in the sequence is realizable.Based on this idea, we propose a non-autoregressive speechrecognition model called LASO (Listen Attentively, and SpellOnce). The model consists of an encoder, a decoder, and a positiondependent summarizer (PDS). The three modules are based onbasic attention blocks. The encoder extracts high-level repre-sentations from the speech. The PDS uses positional encodingscorresponding to tokens to convert the acoustic representationsinto token-level representations. The decoder further capturestoken-level relationships with the self-attention mechanism. Atlast, the probability distribution on the vocabulary is computedfor each token position. Therefore, speech recognition is re-formulated as a position-wise classification problem. Further,we propose a cross-modal transfer learning method to refinesemantics from a large-scale pre-trained language model BERTfor improving the performance. We conduct experiments on twoscales of public speech datasets AISHELL-1 and AISHELL-2.Experimental results show that our proposed model achieves aspeedup of about × and competitive performance, comparedwith the autoregressive transformer models. To better understandthe behaviors of LASO, we analyze the model by visualizingthe attention patterns. The results show that PDS can attendspecific encoded acoustic representations based on positions andthe decoder can capture token relationships. Index Terms —speech recognition, fast, end-to-end, non-autoregressive, attention, BERT, transfer learning
Ye Bai is currently a PhD candidate at University of Chinese Academy ofSciences (UCAS), Beijing, China. e-mail: [email protected] Yi is currently an Assistant Professor in NLPR, Instituteof Automation, Chinese Academy of Sciences, Beijing, China. e-mail:[email protected] Tao is currently a Professor in NLPR, Institute of Automation,Chinese Academy of Sciences, Beijing, China. e-mail: [email protected](Corresponding author: Jianhua Tao and Jiangyan Yi).Zhengkun Tian is currently a PhD candidate at University ofChinese Academy of Sciences (UCAS), Beijing, China. e-mail:[email protected] Wen is currently an Associate Professor in NLPR, Instituteof Automation, Chinese Academy of Sciences, Beijing, China. e-mail:[email protected] Zhang is currently a PhD candidate at University of Chinese Academyof Sciences (UCAS), Beijing, China. e-mail: [email protected].
I. I
NTRODUCTION D EEP learning has significantly improved the performanceof automatic speech recognition (ASR). Conventionally,an ASR system consists of an acoustic model (AM), a pronun-ciation lexicon, and a language model (LM). The deep neuralnetwork (DNN) is used to model observation probabilities ofthe hidden Markov models (HMMs) [1], [2]. This DNN-HMMhybrid approach has achieved success in ASR. However, thepipeline of a DNN-HMM hybrid system usually requires totrain Gaussian mixture model based HMMs (GMM-HMM) forgenerating frame-level alignments and tying states. Buildingthe pronunciation lexicon requires knowledge of experts inphonetics. This complexity of the building pipeline limitsthe development of an ASR system. Moreover, the differentbuilding procedures of the AM and the LM make the systemdifficult to be optimized jointly. Thus, the possible erroraccumulation in the pipeline influences the performance ofa hybrid ASR system.Pure neural network based end-to-end (E2E) ASR systemsattract interests of researchers these years [3], [4], [5], [6],[7]. Different from the hybrid ASR systems, these systemsuse one DNN to model acoustic and language simultaneouslyso that the network can be optimized with back-propagationalgorithms in an E2E manner. In particular, attention-basedencoder-decoder (AED) models have achieved promising per-formance in ASR [8], [9]. The AED models first encode theacoustic feature sequence into latent representations with anencoder. With the latent representations, the decoder predictsthe text token sequence step-by-step. The attention mechanismqueries a proper latent vector from the outputs of the encoderfor the decoder to predict. However, even with the non-recurrent structure which can be implemented in parallel [10],[11], the recognition speed limits the deployment of AEDmodels in real-world applications. Two main reasons influenceinference speed: • First, the encoder encodes the whole utterance, so thedecoder starts inference after the user speaks out thewhole utterance; • Second, multi-pass forward propagation of the decodercosts much time during beam-search.Several work focus on the first problem to make the systemcan generate the token sequence in a streaming manner. Mono-tonic attention mechanism [12], [13] enforces the attentionalignments to be monotonic. Therefore, the encoder only a r X i v : . [ c s . C L ] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 encodes a local chunk in the acoustic feature sequence, andthe decoder predicts the next token without the future contextin the acoustic feature sequence. Triggered-attention systems[14], [15] use connectionist temporal classification (CTC)spikes to segment the acoustic feature sequence adaptively,and the decoder predicts the next token when it is activatedby a spike. Transducer-based models [7], [16], [17] use anextra blank token and marginalize all possible alignments, sothe model can immediately predict the next token when theencoder accumulates enough information. All these modelscan generate a token sequence in a steaming manner andshow promising results. However, they limit the models usinglocal information in the speech sequence. It potentially ignoresglobal semantic relationships in the speech sequence. Theglobal semantic relationships contain not only the relationshipsamong acoustic frames but also the relationships among tokens[18], as shown in Fig. 1.In this paper, we aim to address the second problem, i.e.,we would like to generate the token sequence without beam-search. We propose an attention-based feedforward neural net-work model for non-autoregressive speech recognition called“LASO” (Listen Attentively, and Spell Once). The LASOmodel first uses an encoder to encode the whole acousticsequence into high-level representations. Then, the proposed position dependent summarizer (PDS) module queries thelatent representation corresponding to the token position fromthe outputs of the encoder. It bridges the length gap betweenthe speech and the token sequence. At last, the decoderfurther refines the representations and predicts a token for eachposition. Because the prediction of one token does not dependon another token, beam-search is not used. And because thenetwork is a non-recurrent feedforward structure, it can beimplemented in parallel. To further improve the ability tocapture the semantic relationship of LASO (especially forthe decoder), we propose to use a cross-modal knowledgetransferring method. Specifically, we align the semantic spacesof the hidden representations of the LASO and the pre-trainedlarge-scale language model BERT [19]. We use the teacher-student learning based knowledge transferring method [20] toleverage knowledge in BERT. We conduct experiments on twopublicly available Chinese Mandarin datasets to evaluate theproposed methods with different data sizes. The experimentsdemonstrate that our proposed method achieves a competitiveperformance and efficiency.The contributions of this work are summarized as follows:1) We propose a non-autoregressive attention-based feed-forward neural network LASO for speech recognition.The model leverages the whole context of the speechand generates each token in the sequence in parallel.The experiments demonstrate that our proposed modelachieves high efficiency and competitive performance.2) We propose a cross-modal knowledge transferringmethod from BERT for improving the performance ofLASO. The experiments demonstrate the effectivenessof the knowledge transferring from BERT. The resultsalso show that the speech signals have similar internalstructures with corresponding text so that knowledgefrom BERT can benefit the non-autoregressive ASR my dog is cute
Language Semantics in Speech
Fig. 1. A spectrogram of an example utterance. A word corresponds to asegment in the speech signal. The relationships among the segments can beseen as the relationships among the corresponding tokens, which are referredto as language semantics in this paper. model.3) We visualize the attention of LASO in detail to an-alyze the models. The visualization results show thatthe proposed PDS module can attend specific encodedacoustic representations and the decoder can capturetoken relationships.This journal version paper is extended from our conferencepaper of INTERSPEECH 2020 [21]. The new content in thispaper includes leveraging pre-trained BERT models to furtherimprove the performance, more detailed experiments on large-scale datasets, visualization for better understanding the mod-els. The rest of the paper is organized as follows. Section IIbriefly compares the autoregressive AED models and thenon-autoregressive AED models. Section III re-formulates thespeech recognition as a position-wise classification problem.Section IV describes the proposed LASO model. Section Vdescribes how to train the model and how we distill theknowledge from BERT. Section VI introduces how the modelgenerates a sentence. Section VII compares this work withprevious related work. Section VIII and Section IX presentsetup and results of experiments, respectively. Section Xdiscusses this paper. At last, Section XI concludes this paperand presents future work.II. B
ACKGROUND : A
UTOREGRESSIVE M ODELS VS .N ON -A UTOREGRESSIVE M ODELS
In this section, we introduce the background of autoregres-sive AED models (ARM) and non-autoregressive AED models(NARM). We compare these two paradigms to better introducethe proposed method in the rest sections.
A. Autoregressive AED Models
The ARM predicts the next token based on the previouslygenerated tokens. That is, for a speech-text pair ( 𝑋, 𝑌 ) , theARM factorizes the conditional probability 𝑃 ( 𝑌 | 𝑋 ) with thechain rule: 𝑃 ARM ( 𝑌 | 𝑋 ) = 𝑃 ( 𝑦 | 𝑋 ) 𝐿 (cid:214) 𝑗 = 𝑃 ( 𝑦 𝑗 | 𝑦 < 𝑗 , 𝑋 ) , (1)where 𝑋 = [ 𝑥 , · · · , 𝑥 𝑇 ] is the acoustic feature sequence,each 𝑥 denotes a feature vector, 𝑌 = [ 𝑦 , · · · , 𝑦 𝐿 ] is thetext token sequence, and 𝑦 < 𝑗 = [ 𝑦 , · · · , 𝑦 𝑗 − ] denotes theprevious context of the token 𝑦 𝑗 . The token can be a word OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
Multi-Head Attention
FFN FFNFFNFFN FFN FFNLN LN LN LN LN LN
Multi-Head AttentionFFN FFNFFNFFN FFN FFN FFNCNN Subsampling N e × (N s -1)×N d ×Encoder Decoder
Position Dependent
Summarizer (PDS)
Sinusoidal Position EncodingMulti-Head Attention
FFN FFNFFNFFN FFN FFNLN LN LN LN LN LN
Multi-Head AttentionFFN FFNFFNFFN FFN FFNLN LN LN LN LN LN my dog is cute
Linear&Softmax Linear&Softmax
Linear&
Softmax Linear&Softmax Linear&Softmax Linear&Softmax
LNLN LN LN LN LN LN
Fig. 2. An illustration of the proposed LASO model. The LASO consists of an encoder, a position dependent summarizer (PDS), and a decoder. All the threemodules are composed of basic attention blocks, which consist of multi-head attention and a position-wise feedforward network (FFN in the figure). We firstuse a CNN to subsample the acoustic feature sequence. Then, the encoder extracts high-level representations from the subsampled sequence. The PDS queriesthe high-level acoustic representations corresponding to each token position. Then, the decoder further refines the language semantics. For each position, aprobability distribution over the vocabulary is computed with a softmax function. And during inference, we select the most likely token at each position toform the token sequence. The extra tokens
Different from the ARM, the NARM predicts each tokenwithout dependence on other tokens. Specifically, the NARMassumes the conditional independence on tokens: 𝑃 NARM ( 𝑌 | 𝑋 ) = 𝐿 (cid:214) 𝑗 = 𝑃 ( 𝑦 𝑗 | 𝑋 ) . (2) Because each probability does not depend on the other tokens,parallel implementation is possible. Another view of Eq. (2)is to process each token independently rather than to processthe production. The details are described in Section III.Conventionally, the relationships among tokens are consid-ered an important factor for the token sequence generation.We refer to this relationship as language semantics in thispaper. The previous non-autoregressive model CTC assumesconditional independence on token sequence [22]. However,to achieve good performance, the CTC-based systems use n-gram LMs for modeling the language semantics [23]. Andthe advanced version transducer models [7] use a neuralnetwork to model the language semantics. An AED modelcaptures the language semantics by the decoder [3], [4], [5].Different from them, in this paper, we propose to use a self-attention mechanism to model the implicit language semantics,which compensates for the loss of the explicit autoregressivelanguage model. The details are described in Section IV.III. ASR AS P OSITION - WISE C LASSIFICATION
In this paper, we give a new perspective on the speechrecognition problem. The basic idea is that the language
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
Multi-Head AttentionK V QPosition-wise FFNDropoutLayerNorm
Dropout
Fig. 3. An illustration of an attention block. semantics is implicitly contained in the speech signal. Fig. 1shows a spectrogram of an utterance “my dog is cute”. Eachsegment corresponds to a word in the utterance. We observethat the language semantics, i.e., the relationships among thetokens, is expressed among the segments implicitly. Thus,if the speech signal of the whole utterance is available, wecan leverage this implicit language semantic to improve theperformance of ASR.Based on the above observations, we consider the ASRproblem as position-wise classification, given the wholespeech utterance. Namely, we use the whole acoustic fea-ture sequence, including explicit acoustic characteristics andimplicit language semantics, to predict one token, but notestimate the probability of the token sequence. When allthe tokens are predicted, we simply put them together asthe recognition result. Formally, we predict the followingprobability: 𝑃 ( 𝑦 𝑗 | 𝑋 ) = 𝑓 ( 𝑋 ) , 𝑗 = , · · · , 𝐿, (3)where 𝑋 denotes the whole acoustic feature sequence, 𝑦 𝑗 denotes a token in the token sequence, 𝐿 is the length ofthe token sequence, and 𝑓 is some non-linear function. In thispaper, we use the proposed feedforward neural network LASOas the non-linear function 𝑓 . Usually, the length of the tokensequence is unknown in advance. We use a simple way totackle this. We set 𝐿 as a big enough number, and the tail ofthe token sequence will be predicted to the filler token.IV. T HE P ROPOSED
LASO M
ODEL
In this section, we introduce the proposed LASO model.The architecture is shown in Fig. 2. The encoder encodes theacoustic feature sequence into the high-level representations.The PDS queries the high-level representation correspondingto each token position. Another purpose of the PDS is to bridgethe length gap between the speech and the token sequence.The decoder further captures language semantics from theoutputs of the PDS. At last, the probability distributions over
KeysQueries Values OutputsAttentionScores
MatMul & Softmax
MatMul
Fig. 4. An illustration of dot-product attention. The queries and the keysare used to compute the attention scores with matrix multiplication and thesoftmax function. The attention scores are used as weights to fuse values. the vocabulary are computed with the linear transformationsand the softmax functions. For inference, the most likely tokenat each position is selected. For the token sequence whoselength is shorter than 𝐿 , the tail is filled with the filler token
The feedforward attention structure [24] captures the globalrelationship in a sequence. Different from recurrent neuralnetworks which encode history context into latent vectors, thefeedforward attention mechanism uses a weighted sum to fusethe input sequence. Because of its feedforward structure, it canbe computed in parallel. In this work, we use the scaled dot-product attention and position-wise feedforward network asthe basic submodule, following [24]. But we use “pre-norm”[25] for stable training. The structure is shown in Fig. 3.The scaled dot-product attention is computed byAtten ( 𝑄, 𝐾, 𝑉 ) = Softmax ( 𝑄𝐾 𝑇 √ 𝐷 𝑘 ) 𝑉, (4)where 𝑄 ∈ R 𝑇 𝑞 × 𝐷 𝑘 denotes the queries, 𝐾 ∈ R 𝑇 𝑘 × 𝐷 𝑘 denotesthe keys, and 𝑉 ∈ R 𝑇 𝑘 × 𝐷 𝑣 denotes values. As shown in Fig. 4,the attention scores are computed with the dot products ofqueries and keys, then are normalized with softmax functions.The normalized attention scores will be sharp at some position,and others are small. Then, by matrix multiplication, the valuesare fused to the corresponding position. This procedure canbe seen as that the query queries keys and fetches out acorresponding value from values.To make the attention scores various, it can be extended tomulti-head version:MHA ( 𝑄, 𝐾, 𝑉 ) = Concat ( ℎ , · · · , ℎ 𝐻 ) 𝑊 𝑜 ,ℎ 𝑖 = Atten ( 𝑄𝑊 𝑞𝑖 , 𝐾𝑊 𝑘𝑖 , 𝑉𝑊 𝑣𝑖 ) , 𝑖 = , · · · , 𝐻. (5)The queries, keys, and values are transformed into subspaceswith parameter matrices 𝑊 𝑞𝑖 , 𝑊 𝑘𝑖 , 𝑊 𝑣𝑖 , where 𝑖 is the index ofhead. Then, the scaled dot-product attention is computed forthe transformed inputs. At last, the outputs are concatenated OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 together and multiplied with 𝑊 𝑜 . ℎ 𝑖 is one attention head. 𝐻 is the number of heads.A position-wise feedforward neural network (FFN) trans-forms output of the attention at each position:FFN ( 𝑢 ) = 𝑊 Activate ( 𝑊 𝑢 + 𝑏 ) + 𝑏 , (6)where 𝑢 is a vector at one position, 𝑊 , 𝑊 , 𝑏 , and 𝑏 are learnable parameters, “Activate” is a nonlinear activationfunction. In this work, gated linear units (GLUs) [26] are used.Residual connection [27] and layer normalization (LN) [28]are also used. Different from [24], we set the LN layer beforethe residual connection, following [25] to make training stableand effective. The structure of the attention block is shown inFig. 3. B. Encoder
The encoder extracts high-level representations from theacoustic feature sequence. We first use a two-layer convolu-tional neural network (CNN) to subsample the acoustic featuresequence. We set the stride on the time axis to , so that theframe rate is reduced to / . Then the outputs of the CNN areflattened to a 𝑇 -by- 𝐷 𝑚 matrix, where 𝑇 is the length of thesubsampled feature sequence, and 𝐷 𝑚 is the dimensionality.Another purpose of the CNN is to capture the locality of theacoustic feature sequence.Then, the encoder has a stack of 𝑁 𝑒 attention blocks, asshown in Fig. 2. The keys, queries, and values are all thesame, so it is a self-attention mechanism. That is, the attentionscores are obtained by computing dot-product between everytwo vectors of the inputs. Therefore, the long-term dependencyis captured. C. Position Dependent Summarizer
The PDS module is the core of the proposed LASO model.The PDS module leverages queries, which depend on positionsof the token sequence, to query the high-level representationsfrom the encoder. It can be seen that the module “summarizes”the acoustic features, so we name it as “summarizer”. As theresult, it bridges the gap between the length of the acousticfeature sequence and the length of the token sequence.The PDS module consists of 𝑁 𝑠 attention blocks. For thefirst block, the queries are position encodings. And the queriesof the other blocks are the outputs of the previous block. Eachquery represents the token position in the token sequence. Thekeys and the values are all the outputs of the encoder. Thelength of 𝐿 of the position encoding sequence is pre-set bycounting the length of utterances in the training set and addsome tolerance. For example, if the maximum length of tokensequences in the training set is 90, we can set this 𝐿 to 100.We use sinusoidal position encodings [24]:pe 𝑖, 𝑗 = sin ( 𝑖 / 𝑗 / 𝐷 𝑚 ) , pe 𝑖, 𝑗 + = cos ( 𝑖 / 𝑗 / 𝐷 𝑚 ) , (7)where 𝑖 = , · · · , 𝐿 denotes the 𝑖 -th position and 𝑗 is the indexof an element. A benefit to use sinusoidal position encodingsis that it would allow the model to easily learn relativity of positions. That is, the positional encoding of a fixed distancebetween two positions 𝑘 can be represented as a linear functionof the two positions [24]. D. Decoder
The decoder further refines the representations of the PDSmodule. Similar to the encoder, it is a self-attention module,which consists of 𝑁 𝑑 attention blocks. It captures the rela-tionships in the sequence, i.e., implicit language semantics,which is queried by the PDS. The inputs of the decoder arethe outputs of the PDS.After the decoder, a linear transformation and a softmaxfunction are used to compute the probability distribution overthe token vocabulary. E. Formulation
The LASO model can be formulated as follows: 𝑍 = Enc ( 𝑋 ) ,𝑞 𝑖 = Summarize ( 𝑍, pe 𝑖 ) , 𝑖 = , , · · · , 𝐿,𝑄 = [ 𝑞 , · · · , 𝑞 𝐿 ] 𝑃 ( 𝑦 𝑖 | 𝑋 ) = Dec ( 𝑄 ) , 𝑖 = , , · · · , 𝐿, (8)where 𝑋 = [ 𝑥 , · · · , 𝑥 𝑇 ] is the feature sequence, 𝑍 denotes thehigh-level representations which encoded with the encoder,and the probability over the vocabulary at each position 𝑃 ( 𝑦 𝑖 | 𝑋 ) is computed with the PDS and the decoder. Function“Summarize” represents the PDS module, which attends theoutputs of the encoder in terms of the positional encodings.Because the positional encodings, which are deterministic butnot random, are a part of the whole model, they are not writtenin the probability expression.V. L EARNING
In this section, we introduce the learning procedure of theLASO model.
A. Maximum Likelihood Estimation
We use maximum likelihood estimation (MLE) criterion totrain the parameters of the LASO model. We minimize thefollowing negative log-likelihood (NLL) loss.NLL ( 𝜃 ) = − 𝑁 𝐿 𝑁 ∑︁ 𝑛 = 𝐿 ∑︁ 𝑖 = log 𝑃 𝜃 ( 𝑦 ( 𝑛 ) 𝑖 | 𝑋 ( 𝑛 ) ) , (9)where 𝑋 ( 𝑛 ) , 𝑌 ( 𝑛 ) is the 𝑛 -th speech-text pair in the corpus.The total number of the pairs is 𝑁 . 𝑦 ( 𝑛 ) 𝑖 is 𝑖 -th token in 𝑌 ( 𝑛 ) . 𝐿 is the length of the token sequence, which is preset. If thelength of a text data is shorter than 𝐿 ,
BERT my dog is cute
Decoder
MSE
Linear Linear Linear Linear Linear Linear Outputs of the Last Hidden LayerOutputs of the Last Hidden Layer
Fig. 5. An illustration of semantic refinement from BERT. The valid part ofthe token sequence is inputted into BERT. And the MSE loss between thelast hidden layers of decoder and BERT is minimized. The optional lineartransformation is used when the dimensionalities of the decoder and BERTare different.
B. Semantic Refinement from BERT
To further improve the performance, we use teacher-studentlearning [29], [30], [31], [32] to refine knowledge from pre-trained LM. BERT, a kind of denoising autoencoder LMtrained on very large scale text, has shown the powerfulability of language modeling and achieved state-of-the-artperformance on many NLP tasks [19]. Inspired by our previouspaper [20], we transfer the knowledge from BERT to theLASO model. Another potential advantage of using BERT asthe teacher model is that both our proposed LASO and BERTare bidirectional models, i.e., the model predicts a token usingboth the left context and the right context.The basic idea is that the BERT can provide good semanticrepresentation for each token. And we consider the outputs ofthe decoder also provide token-level representation. Therefore,we make the outputs of the decoder approximate the BERT.We minimize the mean squared error (MSE) between theirlast hidden layers, as shown in Fig. 5. To match the trainingprocedure of BERT, we add a
The inference of LASO is simple. We just select the mostlikely token at each position: ˆ 𝑦 𝑖 = arg max 𝑦 𝑖 𝑃 ( 𝑦 𝑖 | 𝑋 ) . 𝑖 = , · · · , 𝐿, (12)Then, the special tokens
In this section, we review and compare the previously re-lated work in two main aspects. One is the non-autoregressiveAED model. And the other is the utilization of LMs for AEDmodels.
A. Non-Autoregressive AED Models
Non-autoregressive AED models are first used in machinetranslation (MT). Gu et al. first proposed non-autoregressivemachine translation and introduced fertility to tack the mul-timodality problem [33]. Auxiliary regularization [34] andenhanced decoder input [35] are proposed to improve theperformance. Lee et al. proposed an iterative refinement al-gorithm for MT [36]. MA et al. proposed a flow model forsequence generation [37]. These models showed promisingresults on both performance and efficiency. However, thesemodels are used for MT but not ASR. Speech signal has itsown property, for example, the monotonic alignment with thetoken sequence, and the implicit language semantics in speech.This motivated us to propose a simpler non-autoregressiveAED model for ASR. The non-autoregressive transformer,which completes the masked tokens iteratively, is proposedfor ASR [38]. However, its decoder needs multi-pass forwardpropagation. Different from the previous work, we propose anon-autoregressive model LASO, which only needs one-pass
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
TABLE IT HE D ESCRIPTION OF THE D ATASETS propagation. We reformulate speech recognition as a position-wise classification problem. We propose the PDS to extracttoken-level representation from speech. The PDS bridges thelength gap between the speech and the token sequence.
B. The Utilization of LMs
Fusion methods, such as shallow fusion, deep fusion, andcold fusion, integrate external LMs to improve the perfor-mance [39], [40]. However, these methods add extra com-plexity during inference. And these methods can only useunidirectional LMs, so recent powerful bidirectional LMs suchas ELMo [41], BERT [19], are not applicable. BERT wasused to rescore the n-best results [42]. However, rescoringincreases computation during inference. And it does not usethe representation ability of BERT. Bai et al. proposed LSTapproach to transfer knowledge from an LM to an ARM withteacher-student learning [20]. [43] distilled knowledge fromBERT to an ARM. In this paper, we transfer the knowledgefrom BERT to improve the LASO model. LASO and BERTboth capture the global token-level relationship. With thismethod, the LASO model can benefit from the representationability of BERT but does not add any extra computation duringinference.
C. Cross-Modal Semantic Alignment
The semantic refinement from BERT is also related to cross-modal semantic alignment, i.e., aligning the spaces of LASOand BERT. The concept of cross-modal semantic alignment isfirst used in cross-model retrieval [44], [45], [46], [47], [48],[49]. Especially that the recent deep learning work learn ashared semantic space between the images and the text basedon neural networks [46], [47], [48], [49]. The basic idea ofthese work is to map the features from different modalities sothat the system can easily compute their similarities. RecentASR-free approach to text query-based keyword search fromspeech also based on this idea, i.e., learning a shared semanticspace between text query and speech [50]. The motivationof our proposed semantic refinement from BERT is also toalign semantic of LASO and BERT. However, different fromthe retrieval work, we do not train the models of the twomodalities but only train the LASO model. Namely, the BERTmodel, which has been confirmed as a powerful language model, is an auxiliary model to train the LASO model andis not used during inference. In addition, this work is in across-model knowledge transferring setting, i.e., transferringknowledge from text-modal model BERT to speech-modalmodel LASO . VIII. E XPERIMENTAL S ETUP
In this section, we introduce the used datasets and theexperimental setup. All the experiments are implemented withdeep learning toolkit PyTorch [51] with python programminglanguage.
A. Datasets
We conduct experiments on public Chinese speech datasetsAISHELL-1 [52] and AISHELL-2 [53]. These two datasetshave different scales of data so that we can evaluate thegeneralization on both a small dataset and a large dataset.AISHELL-1 contains 178 hours of Mandarin speech. Thespeech is recorded by 400 speakers. All audio is recordedwith high fidelity microphones in 44.1 kHz, then subsampledto 16 kHz. The content of the datasets covers 5 domainsincluding “Finance”, “Science and Technology”, “Sports”,“Entertainments”, and “News”.AISHELL-2 contains about 1000 hours of Mandarin Speechfor training. The training set is recorded by 1991 speakers withiPhone smartphones. The content covers voice commands,digital sequence, places of interest, entertainment, finance,technology, sports, English spellings, and free speaking with-out specific topics. The development sets and the test sets arerecorded with different equipment to evaluate generalizationfor different equipment.The details of the two datasets are shown in Table I. B. Setup
Basic settings . We first evaluate the models on the small-scale (150 hours) dataset AISHELL-1. Then we extend theexperiments to the large-scale (1000 hours) dataset AISHELL-2. We use -dimension Mel-filter bank features (FBANK) as Different from AED models, the inputs of LASO are only speech featuresbut not text embeddings. Therefore, it is a unimodal model. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
TABLE IIT HE S YMBOLS OF THE H YPER - PARAMETERS OF THE A RCHITECTURE
Symbol Description 𝐷 𝑚 The dimensionality of the inputs of the multi-head attention. 𝐷 𝑖𝑛 The inner dimensionality of the position-wise FFN.Activation The type of activation function of the position-wise FFN. the inputs, which are extracted every 10ms with 25ms of framelength. For AISHELL-1, the token vocabulary contains characters in the training set and three special symbols, i.e.,"
Transformer . For SAN-CTC, because it only has encoderpart, we set the number of the attention layer to 12. The modelsize of is comparable to
Transformer and other models. Werefer to this model as
SAN-CTC . LASO settings . We compare different architectures ofLASO. The symbols of network configuration are shown inTable II. The basic attention blocks are shown in Fig. 3, whichare the same with the two baselines. The lengths of positionalencodings of the PDS module are set to ( 𝐿 in Eq. (8) ). Training settings . We use Adam algorithm [55] to optimizethe models. We use the warm-up learning rate schedule [24]: 𝛼 = 𝐷 − . · min ( 𝑠𝑡𝑒 𝑝 − . , 𝑠𝑡𝑒 𝑝 · 𝑤𝑎𝑟𝑚𝑢 𝑝 − . ) . (13)The warm-up step is set to 12000. The dropout rate is set to0.1. Each batch contains about 100 seconds of speech, andwe accumulate gradients of 12 steps to simulate a big batch[56] for stabilizing training. We train the models until theyconverge. The typical number of epochs for LASO is 130.And the typical number of epochs for Speech-Transformer andSAN-CTC is 80.We use SpecAugment [57] for data augmentation. Thefrequency masking width is 27. The time masking width is40. Both frequency masking and time masking are employedtwice. But we do not use time warping. We leverage labelsmoothing with 0.1 for over-confidence problems during train- ing. We average parameters of the models which are saved atthe last 10 epochs as the final model.We use the Google’s pre-trained Chinese BERT model forsemantic refinement. This model has 12 transformer layers.The model dimensionality is 768. The total number of param-eters is 110M. The vocabulary of the BERT model contains21128 tokens. During training, the coefficient 𝜆 in Eq. (11)is set to . . Performance evaluation . For performance evaluation, weuse the standard edit-distance based error rate, i.e. charactererror rate (CER). For speed evaluation, we use both real-timefactor (RTF) and averaged processing time (APT). RTF is astandard metric to evaluate processing time cost of an ASRsystem. It is the averaged time cost to process one-secondspeech: RTF = Total Processing TimeTotal Duration . (14)This metric is dimensionless and independent with utteranceduration. To consider the impact to processing time of theutterance duration, we compute APT:APT = Total Processing TimeTotal Number of Utterance . (15)The unit of APT is second.The reason to use ATP is to show the efficiency to processone utterance. It considers waiting time of a user and is suit-able to both online applications (speech interaction systems)and offline applications (such as voice document transcription).Specifically, for online applications, the waiting time of a useris “the time between the stop-point at which the user stopsspeaking and the appearance on the screen of the ASR result”.And for offline applications, the waiting time of a user is “thetime between the point at which the user inputs the utteranceand the appearance on the screen of the ASR result”. Thesemetrics ignore the uncontrollable factors of a speech engineer,e.g., data-transmission speed on Internet.RTF may give an overestimated sense of whole-utteranceASR systems (AED, Speech-Transformer, or bidirectionalAM based hybrid models), because the whole-utterance ASRsystems starts processing speech after receiving the whole ut-terance, unlike some streaming models. For streaming models,the waiting time of a user of an online application can beestimated asRTF × the length of the last data package . (16)However, for whole-utterance ASR systems, the length ofwhole-utterance is needed to be considered. To address thisissue, we compute APT, which directly average processingtime cost of an utterance. We compute these two practicalvalues on commonly used deep learning devices (GPUs) toshow the time cost of the ASR systems. We also mentionthat the time cost of feature extraction is included in ourexperimental implementation.IX. E XPERIMENTAL R ESULTS
In this section, we introduce the experimental results. https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
A. Comparing Model Architectures on AISHELL-1
First, we comparing the performance of the different modelconfigurations. Table III shows the results of the differentconfigurations. From Table III, we conclude statements asfollows.1) 𝐷 𝑚 : A large number of the dimensionality of theattention block makes the model have more powerfulrepresentation ability and achieve better performance.2) Activation: We find that using GLU as the activationfunction is more effective than using ReLU consistently.The GLU will introduce more parameters to the model.However, comparing the deeper model with ReLU andthe shallower model with GLU, whose model sizes arerelatively similar, the model with GLU can achievebetter performance.3) 𝐷 𝑚 = . How-ever, for the models with 𝐷 𝑚 = , the performanceimprovements are subtle. We analyze that it is becausethe discrepancy of the dimensionality of the LASO andBERT is too large so that it is hard for the LASO modelsto learn the representations of BERT.In the rest of the paper, we refer to model 2, model 12,and model 16 in Table III as LASO-small , LASO-middle ,and
LASO-big , respectively, to compare performance withprevious models.
B. Comparisons with Other Methods on AISHELL-1
We compare our proposed LASO with baseline autore-gressive model
Transformer and non-autoregressive CTCmodel
SAN-CTC . We also compare it with previous work. Theresults are shown in Table IV. We can see that our proposedLASO models achieve competitive performance, comparedwith the hybrid models, the CTC models, and autoregressivetransformer models. In particular, with semantic refinementfrom BERT,
LASO-middle and
LASO-big outperform theautoregressive Transformer model and achieve state-of-the-art performances. Moreover, the inference speed of the non-autoregressive models is much faster than the autoregressivemodels.We implemented a competitive baseline autoregressivemodel
Transformer , which achieves a CER of 6.6% on thetest set of AISHELL-1. The performance of
LASO-middle ,which has a similar scale of the number of the parameters with
Transformer , is comparable to
Transformer . And thebigger model
LASO-big outperforms
Transformer . Both
LASO-middle and
LASO-big outperform the CTC basednon-autoregressive model
SAN-CTC . https://github.com/kaldi-asr/kaldi/blob/master/egs/aishell/s5/RESULTS https://github.com/espnet/espnet/blob/master/egs/aishell2/asr1/RESULTS.md Table IV also lists RTF and APT of the models. RTF isthe ratio of the total inference time to the total duration ofthe test set. APT is the averaged time cost for decoding oneutterance (including the time of feature extraction) on thetest set. The inference is done utterance by utterance on anNVIDIA RTX 2080Ti GPU. We can see that the inferencespeed of the non-autoregressive models is much faster than theautoregressive transformer model
Transformer . The APTof the non-autoregressive models is reduced by × . And wecan see that even the model size of LASO-big is larger than
Transformer , the APT of
LASO-big is much reduced.With the non-autoregressive and the feedforward structure, theinference of the model can be implemented very efficiently.The inference speed of the baseline non-autoregressivemodel
SAN-CTC is also very fast. However, compared withthe similar scale LASO models, the performance is degraded.We analyze that the LASO models use the decoder, whichplays the role of an autoencoder LM, can capture languagesemantics more effectively. In addition, for CTC models, theblank symbols inserted in the token sequence may influencethe model to capture language semantics.For
LASO-middle and
LASO-big , semantic refinementfrom BERT much reduces CERs by 11% to 12% on the testset. This demonstrates that teacher-student learning with BERTcan improve the ability of the LASO to capture the languagesemantics. However, for the small-size model
LASO-small semantic refinement from BERT does not improve the perfor-mance. We analyze that the dimensionality of the representa-tions of
LASO-small and BERT are very different (256 vs.768), which influences the LASO model to learn knowledgefrom the BERT model.
C. Comparisons with Other Methods on AISHELL-2
We then extend the experiments to the larger scale datasetAISHELL-2. AISHELL-2 contains about 1000 hours of train-ing data. And the covered topics of AISHELL-2 are morediverse than AISHELL-1. Furthermore, the training set isrecorded with iPhone, and the test sets of AISHELL-2 coverthree different channels, i.e., iPhone, Android smartphones,and Hi-Fi microphones. So we can evaluate the generaliza-tion of the models on AISHELL-2 more in detail. BecauseAISHELL-2 is larger than AISHELL-1 and training modelson AISHELL-2 costs much more time, we directly use the se-lected architectures from previous experiments on this dataset.The experimental results are shown in Table V. We can seethat the proposed LASO models also achieve a promising per-formance. Compared with the hybrid systems, LAS systems,and the CTC model, all LASO models achieve better perfor-mance. And with the similar and larger scale model sizes,
LASO-middle and
LASO-big outperform previous state-of-the-art transformer models. We find that with more dataand a larger model, LASO can achieve better performance thanthe well-trained transformer models. With semantic refinementfrom BERT, the performances are further improved. However,the improvements are not significant like the experimentson AISHELL-1. And for
LASO-small , the performancedegrades. The possible reasons are 1) the model capacity is
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
TABLE IIIT HE C HARACTER E RROR R ATES ON
AISHELL-1
WITH D IFFERENT H YPER - PARAMETERS
Model Dm Din Activation
OMPARISONS WITH O THER W ORK ON
AISHELL-1
Model † ‡ - - 8.6 -KALDI(chain) * † ‡ - - 7.4 -LAS [58] - 9.4 10.6 -ESPNet (Transformer) † ‡ [59] - 6.0 6.7 -A-FMLM [38] - 6.2 6.7 -Fan et al. (Transformer) [60] - - 6.7 -AGS CTC ‡ [61] - 7.0 7.9 - Transformer (baseline1) 67.5M 6.1 6.6 0.19 / 961ms
SAN-CTC (baseline2) 56.4M 7.2 7.8 0.0033 / 16ms
LASO-small
LASO-small w/ BERT 20.6M 7.0 7.8 0.0027 / 13ms
LASO-middle
LASO-middle w/ BERT 63.3M 5.4 6.2 0.0035 / 17ms
LASO-big
LASO-big w/ BERT 80.0M * from the KALDI official repository . † with speed perturbation based data augmentation. ‡ with an extra language model at the inference stage. very large between the small model LASO-small and thebig model BERT; 2) AISHELL-2 has more data and is morecomplex than AISHELL-1, which influences the effectivenessof knowledge distillation [62], [63].X. D
ISCUSSION
In this section, we analyze the attention patterns and discussthe impact of the sentence lengths.
A. Visualization of Attention Patterns
To better understand the behaviors of the LASO model, wevisualize the attention scores of an utterance with
LASO-big .We show the first four heads of attention scores of the last layers of the encoder, the PDS, and the decoder. Thedetailed visualization is listed in the supplemental materials.Fig. 6 shows the visualization results. We summarize theobservations as follows.1) Different heads in one layer have different attentionpatterns. This implies that one representation of thesequence in different head attends different representa-tions.2) For the encoder, some attention patterns show the top-left to bottom-right alignments (Fig. 6b and Fig. 6d).This meets the expectation: the matched representa-tions are around the corresponding representation in thespeech sequence. And some heads do not show obviouspatterns.3) For the decoder, we can see that some hidden representa-tion at a token position attends the hidden representationat the previous token position (Fig. 6g), and some hiddenrepresentation at a token position attends the hiddenrepresentation at the next token position (Fig. 6f). Mosthidden representations corresponding to filler tokens
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
TABLE VC
OMPARISONS WITH O THER W ORK ON
AISHELL-2Model † ‡ - 9.1 10.4 11.8 10.4 8.8 9.6 10.9 9.8LAS [58] - - - - - 9.2 9.7 10.3 9.7ESPNet (transformer) * † ‡ - - - - - 7.5 8.9 8.6 8.3
Transformer (baseline1) 67.5M 6.4 7.2 7.7 7.1 7.1 8.0 8.2 7.8
SAN-CTC (baseline2) 56.4M 8.3 8.9 8.8 8.6 8.0 9.0 8.9 8.7
LASO-small
LASO-small w/ BERT 20.6M 8.9 10.0 10.3 9.7 8.8 9.8 10.5 9.7
LASO-middle
LASO-middle w/ BERT 63.3M 6.5 7.2 7.4 7.0 6.6
LASO-large
LASO-large w/ BERT 80.0M * from the ESPnet official repository . † with speed perturbation based data augmentation. ‡ with an extra language model at the inference stage. time step t i m e s t e p (a) The attention scores of the firsthead of the last encoder layer. time step t i m e s t e p (b) The attention scores of the sec-ond head of the last encoder layer. time step t i m e s t e p (c) The attention scores of the thirdhead of the last encoder layer. time step t i m e s t e p (d) The attention scores of the forthhead of the last encoder layer. V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! token position V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! t o k e n p o s i t i o n (e) The attention scores of the firsthead of the last decoder layer. V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! token position V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! t o k e n p o s i t i o n (f) The attention scores of the sec-ond head of the last decoder layer. V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! token position V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! t o k e n p o s i t i o n (g) The attention scores of the thirdhead of the last decoder layer. V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! token position V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! t o k e n p o s i t i o n (h) The attention scores of the forthhead of the last decoder layer. time step V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! t o k e n p o s i t i o n (i) The attention scores of the firsthead of the last PDS layer. time step V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! t o k e n p o s i t i o n (j) The attention scores of the sec-ond head of the last PDS layer. time step V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! t o k e n p o s i t i o n (k) The attention scores of the thirdhead of the last PDS layer. time step V R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! H R V ! t o k e n p o s i t i o n (l) The attention scores of the forthhead of the last PDS layer.Fig. 6. The visualization of the attention scores of the model LASO-big . Here, we show the first four heads of the last layer of each module. And becausethe token sequence is long to visualize, we truncate the first 15 tokens. All visualization is listed in the supplemental materials. attention patterns make the model fuse the representationsfrom various aspects; 2) the attention mechanism can learnmeaningful alignments in terms of the positional encodings;3) the two special filler tokens
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 (a) A scatter plot of the CER of each sentence vs. the length of asentence on AISHELL-1 test set. (b) A scatter plot of the CER of each sentence vs. the length of asentence on AISHELL-2 iPhone test set.Fig. 7. Scatter plots for analyzing the impact of the lengths of the sentences.The grey histograms represent the ratios of lengths of the sentence (i.e., thenumber of the tokens) on the training set. The area of a scatter shows thenumber of the examples at the that point. We use baseline1
Transformer ,baseline2
SAN-CTC , and
LASO-middle since they have similar model sizes.From the plots, we can see no significant difference between the three models.The three models have the same CER for some sentences so that the scattersare overlapped. Note that an outlier in the second figure which CER is 100%since the reference of this sentence is wrong. the token relationship based on the self-attention mechanism.
B. The Impact of the Lengths of Sentences
Position parameters exist in the PDS module. This maycause that the performance relies on the length of a sentence.To check this point, we plot scatters of the CER of eachsentence vs. the length of a sentence in Fig. X. We can see no significant difference among the three models.Fig. X also provides histograms of ratios of the differentlengths of training set. We can see that the distributionsapproximate Gaussian distribution as expected. And the trendof CERs is similar with the length distribution of the trainingset, i.e., the CERs of more sentences are zero in the middlepart in the figures. This is because the models are trainedwith more data with the middle lengths. All the three modelscan recognize the sentences with unseen lengths (or few-shot lengths) in the training set. However, the error rates are relatively higher than the sentences with seen lengths.This phenomenon is as expected. Because the three models(
Transformer , SAN-CTC , LASO ) are all whole-utteranceASR models, the length of the utterance is an implicit factorto train the models. This is different from the local-windowmodels based conventional hybrid models, i.e., time-delayneural networks (TDNN), CNNs, or latency-control BLSTM.LASO is more sensitive with the unseen lengths than othertwo models. Because the PDS module need to be trained withvarious lengths.In engineering practice, two methods can be adopted toaddress the unseen-length issue of the whole-utterance models:1) select data with various lengths to train the models; 2) usinga voice activity detection (VAD) system to cut long utterancesinto segments.XI. C
ONCLUSIONS AND F UTURE W ORK
This paper proposes a feedforward neural network basednon-autoregressive speech recognition model called LASO.The model consists of an encoder, a position dependentsummarizer, and a decoder. The encoder encodes the acousticfeature sequence into high-level representations. The PDSconverts the acoustic representation sequence to the token-level sequence. And the decoder further captures the token-level relationship. Because the prediction of each token doesnot rely on other tokens, and the whole model is feedforward,the parallelization of whole sentence prediction is realizable.Thus, the inference speed is much improved, compared withthe autoregressive attention-based end-to-end models. Further-more, we propose to refine semantics from a large-scale pre-trained language model BERT to improve the performance.Experimental results show that LASO can achieve a com-petitive performance with much low recognition latency. Inthe future, we will improve the performance of LASO by thearchitecture and loss functions. We find that LASO is trainedwith more epochs during training. We will try to find strategiesto speed up training.XII. A
CKNOWLEDGMENT
The authors are grateful to the anonymous reviewers fortheir invaluable comments that improve the completeness andreadability of this paper.R
EFERENCES[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deep neuralnetworks for acoustic modeling in speech recognition: The shared viewsof four research groups,”
IEEE Signal processing magazine , vol. 29,no. 6, pp. 82–97, 2012.[2] D. Yu and L. Deng,
AUTOMATIC SPEECH RECOGNITION.
Springer,2016.[3] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention-based models for speech recognition,” in
Advances in neuralinformation processing systems , 2015, pp. 577–585.[4] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio,“End-to-end attention-based large vocabulary speech recognition,” in-ternational conference on acoustics, speech, and signal processing , pp.4945–4949, 2016.[5] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: Aneural network for large vocabulary conversational speech recognition,”in . IEEE, 2016, pp. 4960–4964.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13 [6] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in . IEEE, 2017, pp. 4835–4839.[7] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711 , 2012.[8] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen,A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al. , “State-of-the-artspeech recognition with sequence-to-sequence models,” in . IEEE, 2018, pp. 4774–4778.[9] C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter,and H. Ney, “Rwth asr systems for librispeech: Hybrid vs attention,”
Proc. Interspeech 2019 , pp. 231–235, 2019.[10] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrencesequence-to-sequence model for speech recognition,” in . IEEE, 2018, pp. 5884–5888.[11] S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese,”in
Proc. Interspeech 2018 , 2018, pp. 791–795. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2018-1107[12] C. Raffel, M.-T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online andlinear-time attention by enforcing monotonic alignments,” in
Proceed-ings of the 34th International Conference on Machine Learning-Volume70 . JMLR. org, 2017, pp. 2837–2846.[13] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” 2018.[Online]. Available: https://openreview.net/pdf?id=Hko85plCW[14] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end speech recognition,” in
ICASSP 2019-2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 5666–5670.[15] T. Hori, N. Moritz, C. Hori, and J. L. Roux, “Transformer-Based Long-Context End-to-End Speech Recognition,” in
Proc.Interspeech 2020 , 2020, pp. 5011–5015. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2928[16] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen, “Self-attention transducers forend-to-end speech recognition,”
Proc. Interspeech 2019 , pp. 4395–4399,2019.[17] Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “Synchronoustransformers for end-to-end speech recognition,” in
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 7884–7888.[18] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-to-sequenceframework for learning word embeddings from speech,” in
Proc.Interspeech 2018 , 2018, pp. 811–815. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-2341[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-trainingof deep bidirectional transformers for language understanding,” Jun.2019.[20] Y. Bai, J. Yi, J. Tao, Z. Tian, and Z. Wen, “Learn spelling from teachers:Transferring knowledge from language models to sequence-to-sequencespeech recognition,”
Proc. Interspeech 2019 , pp. 3795–3799, 2019.[21] Y. Bai, J. Yi, J. Tao, Z. Tian, Z. Wen, and S. Zhang, “Listen attentively,and spell once: Whole sentence generation via a non-autoregressivearchitecture for low-latency speech recognition,”
Proc. Interspeech 2020 ,2020.[22] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection-ist temporal classification: labelling unsegmented sequence data withrecurrent neural networks,” in
Proceedings of the 23rd internationalconference on Machine learning , 2006, pp. 369–376.[23] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speechrecognition using deep rnn models and wfst-based decoding,” in . IEEE, 2015, pp. 167–174.[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advancesin neural information processing systems , 2017, pp. 5998–6008.[25] T. Q. Nguyen and J. Salazar, “Transformers without tears: Improvingthe normalization of self-attention,” arXiv preprint arXiv:1910.05895 ,2019.[26] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modelingwith gated convolutional networks,” in
Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70 . JMLR. org, 2017,pp. 933–941. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[28] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXivpreprint arXiv:1607.06450 , 2016.[29] C. Buciluˇa, R. Caruana, and A. Niculescu-Mizil, “Model compression,”in
Proceedings of the 12th ACM SIGKDD international conference onKnowledge discovery and data mining , 2006, pp. 535–541.[30] J. Li, R. Zhao, J.-T. Huang, and Y. Gong, “Learning small-size dnn withoutput-distribution-based criteria,” in
Fifteenth annual conference of theinternational speech communication association , 2014.[31] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,” arXiv preprint arXiv:1503.02531 , 2015.[32] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Ben-gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550 ,2014.[33] J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher, “Non-autoregressive neural machine translation,” in
International Conferenceon Learning Representations , 2018.[34] Y. Wang, F. Tian, D. He, T. Qin, C. Zhai, and T.-Y. Liu, “Non-autoregressive machine translation with auxiliary regularization,” in
Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33,2019, pp. 5377–5384.[35] J. Guo, X. Tan, D. He, T. Qin, L. Xu, and T.-Y. Liu, “Non-autoregressiveneural machine translation with enhanced decoder input,” in
Proceedingsof the AAAI Conference on Artificial Intelligence , vol. 33, 2019, pp.3723–3730.[36] J. Lee, E. Mansimov, and K. Cho, “Deterministic non-autoregressiveneural sequence modeling by iterative refinement,” 2018.[37] X. Ma, C. Zhou, X. Li, G. Neubig, and E. Hovy, “Flowseq: Non-autoregressive conditional sequence generation with generative flow,”pp. 4273–4283, 2019.[38] N. Chen, S. Watanabe, J. Villalba, and N. Dehak, “Listen and fill in themissing letters: Non-autoregressive transformer for speech recognition,” arXiv preprint arXiv:1911.04908 , 2019.[39] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares,H. Schwenk, and Y. Bengio, “On using monolingual corpora in neuralmachine translation,” arXiv: Computation and Language , 2015.[40] A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion:Training seq2seq models together with language models,” in
Proc.Interspeech 2018 , 2018, pp. 387–391. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1392[41] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, andL. Zettlemoyer, “Deep contextualized word representations,” Jun. 2018.[42] J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method usingbert for speech recognition,” in
Asian Conference on Machine Learning ,2019, pp. 1081–1093.[43] H. Futami, H. Inaguma, S. Ueno, M. Mimura, S. Sakai, and T. Kawahara,“Distilling the knowledge of bert for sequence-to-sequence asr,”
Proc.Interspeech 2020 , 2020.[44] V. Lavrenko, R. Manmatha, J. Jeon et al. , “A model for learning thesemantics of pictures.” in
Nips , vol. 1, no. 2. Citeseer, 2003.[45] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotationand retrieval using cross-media relevance models,” in
Proceedings ofthe 26th annual international ACM SIGIR conference on Research anddevelopment in informaion retrieval , 2003, pp. 119–126.[46] Q.-Y. Jiang and W.-J. Li, “Deep cross-modal hashing,” in
Proceedings ofthe IEEE conference on computer vision and pattern recognition , 2017,pp. 3232–3240.[47] M. Fan, W. Wang, P. Dong, L. Han, R. Wang, and G. Li, “Cross-media retrieval by learning rich semantic embeddings of multimedia,” in
Proceedings of the 25th ACM international conference on Multimedia ,2017, pp. 1698–1706.[48] L. Zhen, P. Hu, X. Wang, and D. Peng, “Deep supervised cross-modalretrieval,” in
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , 2019, pp. 10 394–10 403.[49] Z. Yang, Z. Lin, P. Kang, J. Lv, Q. Li, and W. Liu, “Learning sharedsemantic space with correlation alignment for cross-modal event re-trieval,”
ACM Transactions on Multimedia Computing, Communications,and Applications (TOMM) , vol. 16, no. 1, pp. 1–22, 2020.[50] K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, and B. Kings-bury, “End-to-end asr-free keyword search from speech,”
IEEE Journalof Selected Topics in Signal Processing , vol. 11, no. 8, pp. 1351–1359,2017.[51] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: Animperative style, high-performance deep learning library,” in
Advances in Neural Information Processing Systems , H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, andR. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019, pp.8026–8037. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf[52] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,”in . IEEE, 2017, pp. 1–5.[53] J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: transforming mandarin asrresearch into industrial scale,” arXiv preprint arXiv:1808.10583 , 2018.[54] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks forconnectionist temporal classification in speech recognition,” in
ICASSP2019 - 2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2019, pp. 7115–7119.[55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
Proc. Interspeech 2019 , pp. 2613–2617, 2019.[58] S. Sun, P. Guo, L. Xie, and M. Hwang, “Adversarial regularizationfor attention based end-to-end robust speech recognition,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 27,no. 11, pp. 1826–1838, 2019.[59] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, S. Watanabe,T. Yoshimura, and W. Zhang, “A comparative study on transformer vsrnn in speech applications,” in , Dec 2019, pp. 449–456.[60] Z. Fan, S. Zhou, and B. Xu, “Unsupervised pre-traing for sequence tosequence speech recognition,” arXiv preprint arXiv:1910.12418 , 2019.[61] F. Ding, W. Guo, L. Dai, and J. Du, “Attention-based gated scalingadaptive acoustic model for ctc-based speech recognition,” in
ICASSP2020 - 2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2020, pp. 7404–7408.[62] S. Zagoruyko and N. Komodakis, “Paying more attention to attention:Improving the performance of convolutional neural networks via atten-tion transfer,” in
International Conference on Learning Representations ,2017.[63] J. H. Cho and B. Hariharan, “On the efficacy of knowledge distillation,”in