[PDF] Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT

Abstract

Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech utterance, which has the token-level relationship implicitly, we can predict a token without explicitly autoregressive language modeling. When the prediction of a token does not rely on other tokens, the parallel prediction of all tokens in the sequence is realizable. Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once). The model consists of an encoder, a decoder, and a position dependent summarizer (PDS). The three modules are based on basic attention blocks. The encoder extracts high-level representations from the speech. The PDS uses positional encodings corresponding to tokens to convert the acoustic representations into token-level representations. The decoder further captures token-level relationships with the self-attention mechanism. At last, the probability distribution on the vocabulary is computed for each token position. Therefore, speech recognition is re-formulated as a position-wise classification problem. Further, we propose a cross-modal transfer learning method to refine semantics from a large-scale pre-trained language model BERT for improving the performance.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Fast End-to-End Speech Recognition via aNon-Autoregressive Model and Cross-ModalKnowledge Transferring from BERT

Ye Bai, Jiangyan Yi,

Member, IEEE,

Jianhua Tao,

Senior Member, IEEE

Zhengkun Tian,Zhengqi Wen,

Member, IEEE

Shuai Zhang,

Abstract —Attention-based encoder-decoder (AED) modelshave achieved promising performance in speech recognition.However, because the decoder predicts text tokens (such ascharacters or words) in an autoregressive manner, it is difﬁcultfor an AED model to predict all tokens in parallel. This makesthe inference speed relatively slow. We believe that because theencoder already captures the whole speech utterance, whichhas the token-level relationship implicitly, we can predict atoken without explicitly autoregressive language modeling. Whenthe prediction of a token does not rely on other tokens, theparallel prediction of all tokens in the sequence is realizable.Based on this idea, we propose a non-autoregressive speechrecognition model called LASO (Listen Attentively, and SpellOnce). The model consists of an encoder, a decoder, and a positiondependent summarizer (PDS). The three modules are based onbasic attention blocks. The encoder extracts high-level repre-sentations from the speech. The PDS uses positional encodingscorresponding to tokens to convert the acoustic representationsinto token-level representations. The decoder further capturestoken-level relationships with the self-attention mechanism. Atlast, the probability distribution on the vocabulary is computedfor each token position. Therefore, speech recognition is re-formulated as a position-wise classiﬁcation problem. Further,we propose a cross-modal transfer learning method to reﬁnesemantics from a large-scale pre-trained language model BERTfor improving the performance. We conduct experiments on twoscales of public speech datasets AISHELL-1 and AISHELL-2.Experimental results show that our proposed model achieves aspeedup of about × and competitive performance, comparedwith the autoregressive transformer models. To better understandthe behaviors of LASO, we analyze the model by visualizingthe attention patterns. The results show that PDS can attendspeciﬁc encoded acoustic representations based on positions andthe decoder can capture token relationships. Index Terms —speech recognition, fast, end-to-end, non-autoregressive, attention, BERT, transfer learning

Ye Bai is currently a PhD candidate at University of Chinese Academy ofSciences (UCAS), Beijing, China. e-mail: [email protected] Yi is currently an Assistant Professor in NLPR, Instituteof Automation, Chinese Academy of Sciences, Beijing, China. e-mail:[email protected] Tao is currently a Professor in NLPR, Institute of Automation,Chinese Academy of Sciences, Beijing, China. e-mail: [email protected](Corresponding author: Jianhua Tao and Jiangyan Yi).Zhengkun Tian is currently a PhD candidate at University ofChinese Academy of Sciences (UCAS), Beijing, China. e-mail:[email protected] Wen is currently an Associate Professor in NLPR, Instituteof Automation, Chinese Academy of Sciences, Beijing, China. e-mail:[email protected] Zhang is currently a PhD candidate at University of Chinese Academyof Sciences (UCAS), Beijing, China. e-mail: [email protected].

I. I

NTRODUCTION D EEP learning has signiﬁcantly improved the performanceof automatic speech recognition (ASR). Conventionally,an ASR system consists of an acoustic model (AM), a pronun-ciation lexicon, and a language model (LM). The deep neuralnetwork (DNN) is used to model observation probabilities ofthe hidden Markov models (HMMs) [1], [2]. This DNN-HMMhybrid approach has achieved success in ASR. However, thepipeline of a DNN-HMM hybrid system usually requires totrain Gaussian mixture model based HMMs (GMM-HMM) forgenerating frame-level alignments and tying states. Buildingthe pronunciation lexicon requires knowledge of experts inphonetics. This complexity of the building pipeline limitsthe development of an ASR system. Moreover, the differentbuilding procedures of the AM and the LM make the systemdifﬁcult to be optimized jointly. Thus, the possible erroraccumulation in the pipeline inﬂuences the performance ofa hybrid ASR system.Pure neural network based end-to-end (E2E) ASR systemsattract interests of researchers these years [3], [4], [5], [6],[7]. Different from the hybrid ASR systems, these systemsuse one DNN to model acoustic and language simultaneouslyso that the network can be optimized with back-propagationalgorithms in an E2E manner. In particular, attention-basedencoder-decoder (AED) models have achieved promising per-formance in ASR [8], [9]. The AED models ﬁrst encode theacoustic feature sequence into latent representations with anencoder. With the latent representations, the decoder predictsthe text token sequence step-by-step. The attention mechanismqueries a proper latent vector from the outputs of the encoderfor the decoder to predict. However, even with the non-recurrent structure which can be implemented in parallel [10],[11], the recognition speed limits the deployment of AEDmodels in real-world applications. Two main reasons inﬂuenceinference speed: • First, the encoder encodes the whole utterance, so thedecoder starts inference after the user speaks out thewhole utterance; • Second, multi-pass forward propagation of the decodercosts much time during beam-search.Several work focus on the ﬁrst problem to make the systemcan generate the token sequence in a streaming manner. Mono-tonic attention mechanism [12], [13] enforces the attentionalignments to be monotonic. Therefore, the encoder only a r X i v : . [ c s . C L ] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 encodes a local chunk in the acoustic feature sequence, andthe decoder predicts the next token without the future contextin the acoustic feature sequence. Triggered-attention systems[14], [15] use connectionist temporal classiﬁcation (CTC)spikes to segment the acoustic feature sequence adaptively,and the decoder predicts the next token when it is activatedby a spike. Transducer-based models [7], [16], [17] use anextra blank token and marginalize all possible alignments, sothe model can immediately predict the next token when theencoder accumulates enough information. All these modelscan generate a token sequence in a steaming manner andshow promising results. However, they limit the models usinglocal information in the speech sequence. It potentially ignoresglobal semantic relationships in the speech sequence. Theglobal semantic relationships contain not only the relationshipsamong acoustic frames but also the relationships among tokens[18], as shown in Fig. 1.In this paper, we aim to address the second problem, i.e.,we would like to generate the token sequence without beam-search. We propose an attention-based feedforward neural net-work model for non-autoregressive speech recognition called“LASO” (Listen Attentively, and Spell Once). The LASOmodel ﬁrst uses an encoder to encode the whole acousticsequence into high-level representations. Then, the proposed position dependent summarizer (PDS) module queries thelatent representation corresponding to the token position fromthe outputs of the encoder. It bridges the length gap betweenthe speech and the token sequence. At last, the decoderfurther reﬁnes the representations and predicts a token for eachposition. Because the prediction of one token does not dependon another token, beam-search is not used. And because thenetwork is a non-recurrent feedforward structure, it can beimplemented in parallel. To further improve the ability tocapture the semantic relationship of LASO (especially forthe decoder), we propose to use a cross-modal knowledgetransferring method. Speciﬁcally, we align the semantic spacesof the hidden representations of the LASO and the pre-trainedlarge-scale language model BERT [19]. We use the teacher-student learning based knowledge transferring method [20] toleverage knowledge in BERT. We conduct experiments on twopublicly available Chinese Mandarin datasets to evaluate theproposed methods with different data sizes. The experimentsdemonstrate that our proposed method achieves a competitiveperformance and efﬁciency.The contributions of this work are summarized as follows:1) We propose a non-autoregressive attention-based feed-forward neural network LASO for speech recognition.The model leverages the whole context of the speechand generates each token in the sequence in parallel.The experiments demonstrate that our proposed modelachieves high efﬁciency and competitive performance.2) We propose a cross-modal knowledge transferringmethod from BERT for improving the performance ofLASO. The experiments demonstrate the effectivenessof the knowledge transferring from BERT. The resultsalso show that the speech signals have similar internalstructures with corresponding text so that knowledgefrom BERT can beneﬁt the non-autoregressive ASR my dog is cute

Language Semantics in Speech

Fig. 1. A spectrogram of an example utterance. A word corresponds to asegment in the speech signal. The relationships among the segments can beseen as the relationships among the corresponding tokens, which are referredto as language semantics in this paper. model.3) We visualize the attention of LASO in detail to an-alyze the models. The visualization results show thatthe proposed PDS module can attend speciﬁc encodedacoustic representations and the decoder can capturetoken relationships.This journal version paper is extended from our conferencepaper of INTERSPEECH 2020 [21]. The new content in thispaper includes leveraging pre-trained BERT models to furtherimprove the performance, more detailed experiments on large-scale datasets, visualization for better understanding the mod-els. The rest of the paper is organized as follows. Section IIbrieﬂy compares the autoregressive AED models and thenon-autoregressive AED models. Section III re-formulates thespeech recognition as a position-wise classiﬁcation problem.Section IV describes the proposed LASO model. Section Vdescribes how to train the model and how we distill theknowledge from BERT. Section VI introduces how the modelgenerates a sentence. Section VII compares this work withprevious related work. Section VIII and Section IX presentsetup and results of experiments, respectively. Section Xdiscusses this paper. At last, Section XI concludes this paperand presents future work.II. B

ACKGROUND : A

UTOREGRESSIVE M ODELS VS .N ON -A UTOREGRESSIVE M ODELS

In this section, we introduce the background of autoregres-sive AED models (ARM) and non-autoregressive AED models(NARM). We compare these two paradigms to better introducethe proposed method in the rest sections.

A. Autoregressive AED Models

The ARM predicts the next token based on the previouslygenerated tokens. That is, for a speech-text pair ( 𝑋, 𝑌 ) , theARM factorizes the conditional probability 𝑃 ( 𝑌 | 𝑋 ) with thechain rule: 𝑃 ARM ( 𝑌 | 𝑋 ) = 𝑃 ( 𝑦 | 𝑋 ) 𝐿 (cid:214) 𝑗 = 𝑃 ( 𝑦 𝑗 | 𝑦 < 𝑗 , 𝑋 ) , (1)where 𝑋 = [ 𝑥 , · · · , 𝑥 𝑇 ] is the acoustic feature sequence,each 𝑥 denotes a feature vector, 𝑌 = [ 𝑦 , · · · , 𝑦 𝐿 ] is thetext token sequence, and 𝑦 < 𝑗 = [ 𝑦 , · · · , 𝑦 𝑗 − ] denotes theprevious context of the token 𝑦 𝑗 . The token can be a word OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Multi-Head Attention

FFN FFNFFNFFN FFN FFNLN LN LN LN LN LN

Multi-Head AttentionFFN FFNFFNFFN FFN FFN FFNCNN Subsampling N e × (N s -1)×N d ×Encoder Decoder

Position Dependent

Summarizer (PDS)

Sinusoidal Position EncodingMulti-Head Attention

FFN FFNFFNFFN FFN FFNLN LN LN LN LN LN

Multi-Head AttentionFFN FFNFFNFFN FFN FFNLN LN LN LN LN LN my dog is cute

Linear&Softmax Linear&Softmax

Linear&

Softmax Linear&Softmax Linear&Softmax Linear&Softmax

LNLN LN LN LN LN LN

Fig. 2. An illustration of the proposed LASO model. The LASO consists of an encoder, a position dependent summarizer (PDS), and a decoder. All the threemodules are composed of basic attention blocks, which consist of multi-head attention and a position-wise feedforward network (FFN in the ﬁgure). We ﬁrstuse a CNN to subsample the acoustic feature sequence. Then, the encoder extracts high-level representations from the subsampled sequence. The PDS queriesthe high-level acoustic representations corresponding to each token position. Then, the decoder further reﬁnes the language semantics. For each position, aprobability distribution over the vocabulary is computed with a softmax function. And during inference, we select the most likely token at each position toform the token sequence. The extra tokens are removed in the token sequence. Each attention block includes layer normalization (LN in the ﬁgure)and a residual connection. And we add sinusoidal position encodings to the subsampled acoustic feature sequence. The whole architecture is non-recurrent,so it can be implemented in parallel for fast inference. or a sub-word (phone, character, or wordpiece). This can beseen as a conditional language model, i.e., the model estimatesthe probability of the token sequence given the acousticfeature sequence. Typically, Eq. (1) is implemented witha neural network, which is an encoder-decoder architecture.The encoder encodes the acoustic feature sequence, and thedecoder computes 𝑃 ( 𝑦 𝑗 | 𝑦 < 𝑗 , 𝑋 ) step-by-step.To ﬁnd the token sequence which has the highest probabilityapproximately, a beam-search algorithm is used during infer-ence. The decoder maintains a beam of potential candidates ofthe token sequence which have high probabilities. Thus, thedecoder has to forward propagate these candidates. In addition,the generation of each token depends on the previously gener-ated tokens, so it is difﬁcult to implement parallel generation. B. Non-Autoregressive AED Models

Different from the ARM, the NARM predicts each tokenwithout dependence on other tokens. Speciﬁcally, the NARMassumes the conditional independence on tokens: 𝑃 NARM ( 𝑌 | 𝑋 ) = 𝐿 (cid:214) 𝑗 = 𝑃 ( 𝑦 𝑗 | 𝑋 ) . (2) Because each probability does not depend on the other tokens,parallel implementation is possible. Another view of Eq. (2)is to process each token independently rather than to processthe production. The details are described in Section III.Conventionally, the relationships among tokens are consid-ered an important factor for the token sequence generation.We refer to this relationship as language semantics in thispaper. The previous non-autoregressive model CTC assumesconditional independence on token sequence [22]. However,to achieve good performance, the CTC-based systems use n-gram LMs for modeling the language semantics [23]. Andthe advanced version transducer models [7] use a neuralnetwork to model the language semantics. An AED modelcaptures the language semantics by the decoder [3], [4], [5].Different from them, in this paper, we propose to use a self-attention mechanism to model the implicit language semantics,which compensates for the loss of the explicit autoregressivelanguage model. The details are described in Section IV.III. ASR AS P OSITION - WISE C LASSIFICATION

In this paper, we give a new perspective on the speechrecognition problem. The basic idea is that the language

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Multi-Head AttentionK V QPosition-wise FFNDropoutLayerNorm

Dropout

Fig. 3. An illustration of an attention block. semantics is implicitly contained in the speech signal. Fig. 1shows a spectrogram of an utterance “my dog is cute”. Eachsegment corresponds to a word in the utterance. We observethat the language semantics, i.e., the relationships among thetokens, is expressed among the segments implicitly. Thus,if the speech signal of the whole utterance is available, wecan leverage this implicit language semantic to improve theperformance of ASR.Based on the above observations, we consider the ASRproblem as position-wise classiﬁcation, given the wholespeech utterance. Namely, we use the whole acoustic fea-ture sequence, including explicit acoustic characteristics andimplicit language semantics, to predict one token, but notestimate the probability of the token sequence. When allthe tokens are predicted, we simply put them together asthe recognition result. Formally, we predict the followingprobability: 𝑃 ( 𝑦 𝑗 | 𝑋 ) = 𝑓 ( 𝑋 ) , 𝑗 = , · · · , 𝐿, (3)where 𝑋 denotes the whole acoustic feature sequence, 𝑦 𝑗 denotes a token in the token sequence, 𝐿 is the length ofthe token sequence, and 𝑓 is some non-linear function. In thispaper, we use the proposed feedforward neural network LASOas the non-linear function 𝑓 . Usually, the length of the tokensequence is unknown in advance. We use a simple way totackle this. We set 𝐿 as a big enough number, and the tail ofthe token sequence will be predicted to the ﬁller token.IV. T HE P ROPOSED

LASO M

ODEL

In this section, we introduce the proposed LASO model.The architecture is shown in Fig. 2. The encoder encodes theacoustic feature sequence into the high-level representations.The PDS queries the high-level representation correspondingto each token position. Another purpose of the PDS is to bridgethe length gap between the speech and the token sequence.The decoder further captures language semantics from theoutputs of the PDS. At last, the probability distributions over

KeysQueries Values OutputsAttentionScores

MatMul & Softmax

MatMul

Fig. 4. An illustration of dot-product attention. The queries and the keysare used to compute the attention scores with matrix multiplication and thesoftmax function. The attention scores are used as weights to fuse values. the vocabulary are computed with the linear transformationsand the softmax functions. For inference, the most likely tokenat each position is selected. For the token sequence whoselength is shorter than 𝐿 , the tail is ﬁlled with the ﬁller token . This also predicts the length of the token sequenceautomatically. These ﬁller tokens are easily removed.The whole network is a feedforward structure so that itcan be implemented in parallel to make inference fast. Weintroduce the details of the basic attention block and eachmodule in the rest of this section. A. Attention Block

The feedforward attention structure [24] captures the globalrelationship in a sequence. Different from recurrent neuralnetworks which encode history context into latent vectors, thefeedforward attention mechanism uses a weighted sum to fusethe input sequence. Because of its feedforward structure, it canbe computed in parallel. In this work, we use the scaled dot-product attention and position-wise feedforward network asthe basic submodule, following [24]. But we use “pre-norm”[25] for stable training. The structure is shown in Fig. 3.The scaled dot-product attention is computed byAtten ( 𝑄, 𝐾, 𝑉 ) = Softmax ( 𝑄𝐾 𝑇 √ 𝐷 𝑘 ) 𝑉, (4)where 𝑄 ∈ R 𝑇 𝑞 × 𝐷 𝑘 denotes the queries, 𝐾 ∈ R 𝑇 𝑘 × 𝐷 𝑘 denotesthe keys, and 𝑉 ∈ R 𝑇 𝑘 × 𝐷 𝑣 denotes values. As shown in Fig. 4,the attention scores are computed with the dot products ofqueries and keys, then are normalized with softmax functions.The normalized attention scores will be sharp at some position,and others are small. Then, by matrix multiplication, the valuesare fused to the corresponding position. This procedure canbe seen as that the query queries keys and fetches out acorresponding value from values.To make the attention scores various, it can be extended tomulti-head version:MHA ( 𝑄, 𝐾, 𝑉 ) = Concat ( ℎ , · · · , ℎ 𝐻 ) 𝑊 𝑜 ,ℎ 𝑖 = Atten ( 𝑄𝑊 𝑞𝑖 , 𝐾𝑊 𝑘𝑖 , 𝑉𝑊 𝑣𝑖 ) , 𝑖 = , · · · , 𝐻. (5)The queries, keys, and values are transformed into subspaceswith parameter matrices 𝑊 𝑞𝑖 , 𝑊 𝑘𝑖 , 𝑊 𝑣𝑖 , where 𝑖 is the index ofhead. Then, the scaled dot-product attention is computed forthe transformed inputs. At last, the outputs are concatenated OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 together and multiplied with 𝑊 𝑜 . ℎ 𝑖 is one attention head. 𝐻 is the number of heads.A position-wise feedforward neural network (FFN) trans-forms output of the attention at each position:FFN ( 𝑢 ) = 𝑊 Activate ( 𝑊 𝑢 + 𝑏 ) + 𝑏 , (6)where 𝑢 is a vector at one position, 𝑊 , 𝑊 , 𝑏 , and 𝑏 are learnable parameters, “Activate” is a nonlinear activationfunction. In this work, gated linear units (GLUs) [26] are used.Residual connection [27] and layer normalization (LN) [28]are also used. Different from [24], we set the LN layer beforethe residual connection, following [25] to make training stableand effective. The structure of the attention block is shown inFig. 3. B. Encoder

The encoder extracts high-level representations from theacoustic feature sequence. We ﬁrst use a two-layer convolu-tional neural network (CNN) to subsample the acoustic featuresequence. We set the stride on the time axis to , so that theframe rate is reduced to / . Then the outputs of the CNN areﬂattened to a 𝑇 -by- 𝐷 𝑚 matrix, where 𝑇 is the length of thesubsampled feature sequence, and 𝐷 𝑚 is the dimensionality.Another purpose of the CNN is to capture the locality of theacoustic feature sequence.Then, the encoder has a stack of 𝑁 𝑒 attention blocks, asshown in Fig. 2. The keys, queries, and values are all thesame, so it is a self-attention mechanism. That is, the attentionscores are obtained by computing dot-product between everytwo vectors of the inputs. Therefore, the long-term dependencyis captured. C. Position Dependent Summarizer

The PDS module is the core of the proposed LASO model.The PDS module leverages queries, which depend on positionsof the token sequence, to query the high-level representationsfrom the encoder. It can be seen that the module “summarizes”the acoustic features, so we name it as “summarizer”. As theresult, it bridges the gap between the length of the acousticfeature sequence and the length of the token sequence.The PDS module consists of 𝑁 𝑠 attention blocks. For theﬁrst block, the queries are position encodings. And the queriesof the other blocks are the outputs of the previous block. Eachquery represents the token position in the token sequence. Thekeys and the values are all the outputs of the encoder. Thelength of 𝐿 of the position encoding sequence is pre-set bycounting the length of utterances in the training set and addsome tolerance. For example, if the maximum length of tokensequences in the training set is 90, we can set this 𝐿 to 100.We use sinusoidal position encodings [24]:pe 𝑖, 𝑗 = sin ( 𝑖 / 𝑗 / 𝐷 𝑚 ) , pe 𝑖, 𝑗 + = cos ( 𝑖 / 𝑗 / 𝐷 𝑚 ) , (7)where 𝑖 = , · · · , 𝐿 denotes the 𝑖 -th position and 𝑗 is the indexof an element. A beneﬁt to use sinusoidal position encodingsis that it would allow the model to easily learn relativity of positions. That is, the positional encoding of a ﬁxed distancebetween two positions 𝑘 can be represented as a linear functionof the two positions [24]. D. Decoder

The decoder further reﬁnes the representations of the PDSmodule. Similar to the encoder, it is a self-attention module,which consists of 𝑁 𝑑 attention blocks. It captures the rela-tionships in the sequence, i.e., implicit language semantics,which is queried by the PDS. The inputs of the decoder arethe outputs of the PDS.After the decoder, a linear transformation and a softmaxfunction are used to compute the probability distribution overthe token vocabulary. E. Formulation

The LASO model can be formulated as follows: 𝑍 = Enc ( 𝑋 ) ,𝑞 𝑖 = Summarize ( 𝑍, pe 𝑖 ) , 𝑖 = , , · · · , 𝐿,𝑄 = [ 𝑞 , · · · , 𝑞 𝐿 ] 𝑃 ( 𝑦 𝑖 | 𝑋 ) = Dec ( 𝑄 ) , 𝑖 = , , · · · , 𝐿, (8)where 𝑋 = [ 𝑥 , · · · , 𝑥 𝑇 ] is the feature sequence, 𝑍 denotes thehigh-level representations which encoded with the encoder,and the probability over the vocabulary at each position 𝑃 ( 𝑦 𝑖 | 𝑋 ) is computed with the PDS and the decoder. Function“Summarize” represents the PDS module, which attends theoutputs of the encoder in terms of the positional encodings.Because the positional encodings, which are deterministic butnot random, are a part of the whole model, they are not writtenin the probability expression.V. L EARNING

In this section, we introduce the learning procedure of theLASO model.

A. Maximum Likelihood Estimation

We use maximum likelihood estimation (MLE) criterion totrain the parameters of the LASO model. We minimize thefollowing negative log-likelihood (NLL) loss.NLL ( 𝜃 ) = − 𝑁 𝐿 𝑁 ∑︁ 𝑛 = 𝐿 ∑︁ 𝑖 = log 𝑃 𝜃 ( 𝑦 ( 𝑛 ) 𝑖 | 𝑋 ( 𝑛 ) ) , (9)where 𝑋 ( 𝑛 ) , 𝑌 ( 𝑛 ) is the 𝑛 -th speech-text pair in the corpus.The total number of the pairs is 𝑁 . 𝑦 ( 𝑛 ) 𝑖 is 𝑖 -th token in 𝑌 ( 𝑛 ) . 𝐿 is the length of the token sequence, which is preset. If thelength of a text data is shorter than 𝐿 , is used to padit, as shown in Fig. 2. Thus, the length of the token sequenceis estimated automatically. 𝜃 denotes trainable parameters ofthe model.This training procedure is end-to-end and does not dependon any frame-level label generated by some GMM-HMMsystem. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

BERT my dog is cute

Decoder

MSE

Linear Linear Linear Linear Linear Linear Outputs of the Last Hidden LayerOutputs of the Last Hidden Layer

Fig. 5. An illustration of semantic reﬁnement from BERT. The valid part ofthe token sequence is inputted into BERT. And the MSE loss between thelast hidden layers of decoder and BERT is minimized. The optional lineartransformation is used when the dimensionalities of the decoder and BERTare different.

B. Semantic Reﬁnement from BERT

To further improve the performance, we use teacher-studentlearning [29], [30], [31], [32] to reﬁne knowledge from pre-trained LM. BERT, a kind of denoising autoencoder LMtrained on very large scale text, has shown the powerfulability of language modeling and achieved state-of-the-artperformance on many NLP tasks [19]. Inspired by our previouspaper [20], we transfer the knowledge from BERT to theLASO model. Another potential advantage of using BERT asthe teacher model is that both our proposed LASO and BERTare bidirectional models, i.e., the model predicts a token usingboth the left context and the right context.The basic idea is that the BERT can provide good semanticrepresentation for each token. And we consider the outputs ofthe decoder also provide token-level representation. Therefore,we make the outputs of the decoder approximate the BERT.We minimize the mean squared error (MSE) between theirlast hidden layers, as shown in Fig. 5. To match the trainingprocedure of BERT, we add a token at the head of thetoken sequence.We input the token sequence into BERT model to fetch outthe outputs of the last hidden layer. Note that we only inputthe valid part of the token sequence, but not the padding token at the tail. and are converted to [CLS] and [SEP] , which are two special tokens added at the headand the tail of a sentence in BERT [19].We also use the valid part of the outputs of the decoder, i.e.,the outputs corresponding to the subsequence from tothe ﬁrst . Then, we compute the MSE:MSE ( 𝜃 ) = 𝑁 𝑁 ∑︁ 𝑛 = 𝐿 ( 𝑛 ) 𝑣 𝐿 ( 𝑛 ) 𝑣 ∑︁ 𝑖 = 𝐷 𝐵 ∑︁ 𝑑 = ( ℎ ( 𝑛 ) 𝑖,𝑑 − 𝑠 ( 𝑛 ) 𝑖,𝑑 ) , (10)where ℎ ( 𝑛 ) 𝑖,𝑑 is the 𝑑 -th element of the 𝑖 -th vector in the decoderoutput sequence for the 𝑛 -th data, and 𝑠 ( 𝑛 ) 𝑖,𝑑 is the 𝑑 -th elementof the 𝑖 -th vector in the BERT output sequence for the 𝑛 -th data, 𝐷 𝐵 is the dimensionality of BERT output, 𝐿 ( 𝑛 ) 𝑣 is thevalid length of 𝑛 -th data, and 𝑁 is the total number of thedata. 𝜃 represents all parameters of the model.If the dimensionality of the decoder is different from BERT,we can simply use an optional linear transformation to makethe dimensionality matching, as shown in Fig. 5.Another beneﬁt of transferring knowledge with the hiddenlayer rather than probability is that it makes the construction ofvocabulary more ﬂexible. We can use only a part vocabularyof BERT rather than all tokens.At last, we combine the NLL loss and the MSE loss as theﬁnal loss: 𝐿 ( 𝜃 ) = NLL ( 𝜃 ) + 𝜆 MSE ( 𝜃 ) , (11)where 𝜆 is a coefﬁcient to balance the values of the two losses.The typical value is . .The BERT model is only used at the training stage. It doesnot add any extra complexity during inference.VI. I NFERENCE

The inference of LASO is simple. We just select the mostlikely token at each position: ˆ 𝑦 𝑖 = arg max 𝑦 𝑖 𝑃 ( 𝑦 𝑖 | 𝑋 ) . 𝑖 = , · · · , 𝐿, (12)Then, the special tokens (or including if themodel is trained with BERT) are removed. Note that thepositional encodings ( [ pe 𝑖 ; · · · ; pe 𝐿 ] ) are a part of the wholemodel, so that they are inputed into the model as whole.This inference procedure does not depend on beam-search,so multi-pass forward propagation is not needed. Thus, theinference time cost is much reduced.VII. R ELATED W ORK

In this section, we review and compare the previously re-lated work in two main aspects. One is the non-autoregressiveAED model. And the other is the utilization of LMs for AEDmodels.

A. Non-Autoregressive AED Models

Non-autoregressive AED models are ﬁrst used in machinetranslation (MT). Gu et al. ﬁrst proposed non-autoregressivemachine translation and introduced fertility to tack the mul-timodality problem [33]. Auxiliary regularization [34] andenhanced decoder input [35] are proposed to improve theperformance. Lee et al. proposed an iterative reﬁnement al-gorithm for MT [36]. MA et al. proposed a ﬂow model forsequence generation [37]. These models showed promisingresults on both performance and efﬁciency. However, thesemodels are used for MT but not ASR. Speech signal has itsown property, for example, the monotonic alignment with thetoken sequence, and the implicit language semantics in speech.This motivated us to propose a simpler non-autoregressiveAED model for ASR. The non-autoregressive transformer,which completes the masked tokens iteratively, is proposedfor ASR [38]. However, its decoder needs multi-pass forwardpropagation. Different from the previous work, we propose anon-autoregressive model LASO, which only needs one-pass

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE IT HE D ESCRIPTION OF THE D ATASETS propagation. We reformulate speech recognition as a position-wise classiﬁcation problem. We propose the PDS to extracttoken-level representation from speech. The PDS bridges thelength gap between the speech and the token sequence.

B. The Utilization of LMs

Fusion methods, such as shallow fusion, deep fusion, andcold fusion, integrate external LMs to improve the perfor-mance [39], [40]. However, these methods add extra com-plexity during inference. And these methods can only useunidirectional LMs, so recent powerful bidirectional LMs suchas ELMo [41], BERT [19], are not applicable. BERT wasused to rescore the n-best results [42]. However, rescoringincreases computation during inference. And it does not usethe representation ability of BERT. Bai et al. proposed LSTapproach to transfer knowledge from an LM to an ARM withteacher-student learning [20]. [43] distilled knowledge fromBERT to an ARM. In this paper, we transfer the knowledgefrom BERT to improve the LASO model. LASO and BERTboth capture the global token-level relationship. With thismethod, the LASO model can beneﬁt from the representationability of BERT but does not add any extra computation duringinference.

C. Cross-Modal Semantic Alignment

The semantic reﬁnement from BERT is also related to cross-modal semantic alignment, i.e., aligning the spaces of LASOand BERT. The concept of cross-modal semantic alignment isﬁrst used in cross-model retrieval [44], [45], [46], [47], [48],[49]. Especially that the recent deep learning work learn ashared semantic space between the images and the text basedon neural networks [46], [47], [48], [49]. The basic idea ofthese work is to map the features from different modalities sothat the system can easily compute their similarities. RecentASR-free approach to text query-based keyword search fromspeech also based on this idea, i.e., learning a shared semanticspace between text query and speech [50]. The motivationof our proposed semantic reﬁnement from BERT is also toalign semantic of LASO and BERT. However, different fromthe retrieval work, we do not train the models of the twomodalities but only train the LASO model. Namely, the BERTmodel, which has been conﬁrmed as a powerful language model, is an auxiliary model to train the LASO model andis not used during inference. In addition, this work is in across-model knowledge transferring setting, i.e., transferringknowledge from text-modal model BERT to speech-modalmodel LASO . VIII. E XPERIMENTAL S ETUP

In this section, we introduce the used datasets and theexperimental setup. All the experiments are implemented withdeep learning toolkit PyTorch [51] with python programminglanguage.

A. Datasets

We conduct experiments on public Chinese speech datasetsAISHELL-1 [52] and AISHELL-2 [53]. These two datasetshave different scales of data so that we can evaluate thegeneralization on both a small dataset and a large dataset.AISHELL-1 contains 178 hours of Mandarin speech. Thespeech is recorded by 400 speakers. All audio is recordedwith high ﬁdelity microphones in 44.1 kHz, then subsampledto 16 kHz. The content of the datasets covers 5 domainsincluding “Finance”, “Science and Technology”, “Sports”,“Entertainments”, and “News”.AISHELL-2 contains about 1000 hours of Mandarin Speechfor training. The training set is recorded by 1991 speakers withiPhone smartphones. The content covers voice commands,digital sequence, places of interest, entertainment, ﬁnance,technology, sports, English spellings, and free speaking with-out speciﬁc topics. The development sets and the test sets arerecorded with different equipment to evaluate generalizationfor different equipment.The details of the two datasets are shown in Table I. B. Setup

Basic settings . We ﬁrst evaluate the models on the small-scale (150 hours) dataset AISHELL-1. Then we extend theexperiments to the large-scale (1000 hours) dataset AISHELL-2. We use -dimension Mel-ﬁlter bank features (FBANK) as Different from AED models, the inputs of LASO are only speech featuresbut not text embeddings. Therefore, it is a unimodal model. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE IIT HE S YMBOLS OF THE H YPER - PARAMETERS OF THE A RCHITECTURE

Symbol Description 𝐷 𝑚 The dimensionality of the inputs of the multi-head attention. 𝐷 𝑖𝑛 The inner dimensionality of the position-wise FFN.Activation The type of activation function of the position-wise FFN. the inputs, which are extracted every 10ms with 25ms of framelength. For AISHELL-1, the token vocabulary contains characters in the training set and three special symbols, i.e.," " for the start of the sentence, " " for unseencharacters, and " " as the ﬁller of the tail of a tokensequence. The vocabulary size for AISHELL-2 is . Baseline settings . We build an autoregressive AED modelSpeech-Transformer [10], [11] and an non-autoregressiveSAN-CTC model [54] as two baselines. The two models usethe same settings as LASO, including the above mentionedfeatures, vocabulary, and basic attention blocks (Fig. 3). Be-cause we would like to compare the performance of SAN-CTC in the non-autoregressive setting, the decoding processis greedy, i.e., directly select the most likely token and removethe blanks, without beam-search on an N-gram LM basesearching graph.For Speech-Transformer, both the encoder and the decoderhave 6 attention layers. The model dimensionality is 512. Thenumber of heads of the attention is 8. The dimensionality ofthe intermediate FFN is 2048. And the activation function isGLU. It also uses the same CNN subsampling layer as theLASO models, as shown in Fig. 2. We refer to this model as

Transformer . For SAN-CTC, because it only has encoderpart, we set the number of the attention layer to 12. The modelsize of is comparable to

Transformer and other models. Werefer to this model as

SAN-CTC . LASO settings . We compare different architectures ofLASO. The symbols of network conﬁguration are shown inTable II. The basic attention blocks are shown in Fig. 3, whichare the same with the two baselines. The lengths of positionalencodings of the PDS module are set to ( 𝐿 in Eq. (8) ). Training settings . We use Adam algorithm [55] to optimizethe models. We use the warm-up learning rate schedule [24]: 𝛼 = 𝐷 − . · min ( 𝑠𝑡𝑒 𝑝 − . , 𝑠𝑡𝑒 𝑝 · 𝑤𝑎𝑟𝑚𝑢 𝑝 − . ) . (13)The warm-up step is set to 12000. The dropout rate is set to0.1. Each batch contains about 100 seconds of speech, andwe accumulate gradients of 12 steps to simulate a big batch[56] for stabilizing training. We train the models until theyconverge. The typical number of epochs for LASO is 130.And the typical number of epochs for Speech-Transformer andSAN-CTC is 80.We use SpecAugment [57] for data augmentation. Thefrequency masking width is 27. The time masking width is40. Both frequency masking and time masking are employedtwice. But we do not use time warping. We leverage labelsmoothing with 0.1 for over-conﬁdence problems during train- ing. We average parameters of the models which are saved atthe last 10 epochs as the ﬁnal model.We use the Google’s pre-trained Chinese BERT model forsemantic reﬁnement. This model has 12 transformer layers.The model dimensionality is 768. The total number of param-eters is 110M. The vocabulary of the BERT model contains21128 tokens. During training, the coefﬁcient 𝜆 in Eq. (11)is set to . . Performance evaluation . For performance evaluation, weuse the standard edit-distance based error rate, i.e. charactererror rate (CER). For speed evaluation, we use both real-timefactor (RTF) and averaged processing time (APT). RTF is astandard metric to evaluate processing time cost of an ASRsystem. It is the averaged time cost to process one-secondspeech: RTF = Total Processing TimeTotal Duration . (14)This metric is dimensionless and independent with utteranceduration. To consider the impact to processing time of theutterance duration, we compute APT:APT = Total Processing TimeTotal Number of Utterance . (15)The unit of APT is second.The reason to use ATP is to show the efﬁciency to processone utterance. It considers waiting time of a user and is suit-able to both online applications (speech interaction systems)and ofﬂine applications (such as voice document transcription).Speciﬁcally, for online applications, the waiting time of a useris “the time between the stop-point at which the user stopsspeaking and the appearance on the screen of the ASR result”.And for ofﬂine applications, the waiting time of a user is “thetime between the point at which the user inputs the utteranceand the appearance on the screen of the ASR result”. Thesemetrics ignore the uncontrollable factors of a speech engineer,e.g., data-transmission speed on Internet.RTF may give an overestimated sense of whole-utteranceASR systems (AED, Speech-Transformer, or bidirectionalAM based hybrid models), because the whole-utterance ASRsystems starts processing speech after receiving the whole ut-terance, unlike some streaming models. For streaming models,the waiting time of a user of an online application can beestimated asRTF × the length of the last data package . (16)However, for whole-utterance ASR systems, the length ofwhole-utterance is needed to be considered. To address thisissue, we compute APT, which directly average processingtime cost of an utterance. We compute these two practicalvalues on commonly used deep learning devices (GPUs) toshow the time cost of the ASR systems. We also mentionthat the time cost of feature extraction is included in ourexperimental implementation.IX. E XPERIMENTAL R ESULTS

In this section, we introduce the experimental results. https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

A. Comparing Model Architectures on AISHELL-1

First, we comparing the performance of the different modelconﬁgurations. Table III shows the results of the differentconﬁgurations. From Table III, we conclude statements asfollows.1) 𝐷 𝑚 : A large number of the dimensionality of theattention block makes the model have more powerfulrepresentation ability and achieve better performance.2) Activation: We ﬁnd that using GLU as the activationfunction is more effective than using ReLU consistently.The GLU will introduce more parameters to the model.However, comparing the deeper model with ReLU andthe shallower model with GLU, whose model sizes arerelatively similar, the model with GLU can achievebetter performance.3) 𝐷 𝑚 = . How-ever, for the models with 𝐷 𝑚 = , the performanceimprovements are subtle. We analyze that it is becausethe discrepancy of the dimensionality of the LASO andBERT is too large so that it is hard for the LASO modelsto learn the representations of BERT.In the rest of the paper, we refer to model 2, model 12,and model 16 in Table III as LASO-small , LASO-middle ,and

LASO-big , respectively, to compare performance withprevious models.

B. Comparisons with Other Methods on AISHELL-1

We compare our proposed LASO with baseline autore-gressive model

Transformer and non-autoregressive CTCmodel

SAN-CTC . We also compare it with previous work. Theresults are shown in Table IV. We can see that our proposedLASO models achieve competitive performance, comparedwith the hybrid models, the CTC models, and autoregressivetransformer models. In particular, with semantic reﬁnementfrom BERT,

LASO-middle and

LASO-big outperform theautoregressive Transformer model and achieve state-of-the-art performances. Moreover, the inference speed of the non-autoregressive models is much faster than the autoregressivemodels.We implemented a competitive baseline autoregressivemodel

Transformer , which achieves a CER of 6.6% on thetest set of AISHELL-1. The performance of

LASO-middle ,which has a similar scale of the number of the parameters with

Transformer , is comparable to

Transformer . And thebigger model

LASO-big outperforms

Transformer . Both

LASO-middle and

LASO-big outperform the CTC basednon-autoregressive model

SAN-CTC . https://github.com/kaldi-asr/kaldi/blob/master/egs/aishell/s5/RESULTS https://github.com/espnet/espnet/blob/master/egs/aishell2/asr1/RESULTS.md Table IV also lists RTF and APT of the models. RTF isthe ratio of the total inference time to the total duration ofthe test set. APT is the averaged time cost for decoding oneutterance (including the time of feature extraction) on thetest set. The inference is done utterance by utterance on anNVIDIA RTX 2080Ti GPU. We can see that the inferencespeed of the non-autoregressive models is much faster than theautoregressive transformer model

Transformer . The APTof the non-autoregressive models is reduced by × . And wecan see that even the model size of LASO-big is larger than

Transformer , the APT of

LASO-big is much reduced.With the non-autoregressive and the feedforward structure, theinference of the model can be implemented very efﬁciently.The inference speed of the baseline non-autoregressivemodel

SAN-CTC is also very fast. However, compared withthe similar scale LASO models, the performance is degraded.We analyze that the LASO models use the decoder, whichplays the role of an autoencoder LM, can capture languagesemantics more effectively. In addition, for CTC models, theblank symbols inserted in the token sequence may inﬂuencethe model to capture language semantics.For

LASO-middle and

LASO-big , semantic reﬁnementfrom BERT much reduces CERs by 11% to 12% on the testset. This demonstrates that teacher-student learning with BERTcan improve the ability of the LASO to capture the languagesemantics. However, for the small-size model

LASO-small semantic reﬁnement from BERT does not improve the perfor-mance. We analyze that the dimensionality of the representa-tions of

LASO-small and BERT are very different (256 vs.768), which inﬂuences the LASO model to learn knowledgefrom the BERT model.

C. Comparisons with Other Methods on AISHELL-2

We then extend the experiments to the larger scale datasetAISHELL-2. AISHELL-2 contains about 1000 hours of train-ing data. And the covered topics of AISHELL-2 are morediverse than AISHELL-1. Furthermore, the training set isrecorded with iPhone, and the test sets of AISHELL-2 coverthree different channels, i.e., iPhone, Android smartphones,and Hi-Fi microphones. So we can evaluate the generaliza-tion of the models on AISHELL-2 more in detail. BecauseAISHELL-2 is larger than AISHELL-1 and training modelson AISHELL-2 costs much more time, we directly use the se-lected architectures from previous experiments on this dataset.The experimental results are shown in Table V. We can seethat the proposed LASO models also achieve a promising per-formance. Compared with the hybrid systems, LAS systems,and the CTC model, all LASO models achieve better perfor-mance. And with the similar and larger scale model sizes,

LASO-middle and

LASO-big outperform previous state-of-the-art transformer models. We ﬁnd that with more dataand a larger model, LASO can achieve better performance thanthe well-trained transformer models. With semantic reﬁnementfrom BERT, the performances are further improved. However,the improvements are not signiﬁcant like the experimentson AISHELL-1. And for

LASO-small , the performancedegrades. The possible reasons are 1) the model capacity is

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE IIIT HE C HARACTER E RROR R ATES ON

AISHELL-1

WITH D IFFERENT H YPER - PARAMETERS

Model Dm Din Activation

OMPARISONS WITH O THER W ORK ON

AISHELL-1

Model † ‡ - - 8.6 -KALDI(chain) * † ‡ - - 7.4 -LAS [58] - 9.4 10.6 -ESPNet (Transformer) † ‡ [59] - 6.0 6.7 -A-FMLM [38] - 6.2 6.7 -Fan et al. (Transformer) [60] - - 6.7 -AGS CTC ‡ [61] - 7.0 7.9 - Transformer (baseline1) 67.5M 6.1 6.6 0.19 / 961ms

SAN-CTC (baseline2) 56.4M 7.2 7.8 0.0033 / 16ms

LASO-small

LASO-small w/ BERT 20.6M 7.0 7.8 0.0027 / 13ms

LASO-middle

LASO-middle w/ BERT 63.3M 5.4 6.2 0.0035 / 17ms

LASO-big

LASO-big w/ BERT 80.0M * from the KALDI ofﬁcial repository . † with speed perturbation based data augmentation. ‡ with an extra language model at the inference stage. very large between the small model LASO-small and thebig model BERT; 2) AISHELL-2 has more data and is morecomplex than AISHELL-1, which inﬂuences the effectivenessof knowledge distillation [62], [63].X. D

ISCUSSION

In this section, we analyze the attention patterns and discussthe impact of the sentence lengths.

A. Visualization of Attention Patterns

To better understand the behaviors of the LASO model, wevisualize the attention scores of an utterance with

LASO-big .We show the ﬁrst four heads of attention scores of the last layers of the encoder, the PDS, and the decoder. Thedetailed visualization is listed in the supplemental materials.Fig. 6 shows the visualization results. We summarize theobservations as follows.1) Different heads in one layer have different attentionpatterns. This implies that one representation of thesequence in different head attends different representa-tions.2) For the encoder, some attention patterns show the top-left to bottom-right alignments (Fig. 6b and Fig. 6d).This meets the expectation: the matched representa-tions are around the corresponding representation in thespeech sequence. And some heads do not show obviouspatterns.3) For the decoder, we can see that some hidden representa-tion at a token position attends the hidden representationat the previous token position (Fig. 6g), and some hiddenrepresentation at a token position attends the hiddenrepresentation at the next token position (Fig. 6f). Mosthidden representations corresponding to ﬁller tokens attend the hidden representations correspondingto .4) For the PDS, the position corresponding to a tokenattends a small range of representations, and the overallpatterns are from the top-left to the bottom-right (Fig. 6i,Fig. 6j, Fig. 6k, and Fig. 6l). For the ﬁrst severalpositions corresponding to the ﬁller tokens, theattention pattern is like a vertical line. However, the linesare corresponding to different positions of the speechin different head (Fig. 6j and Fig. 6l). And most other tokens attend the ﬁrst several representations.We analyze that these representations denote ﬁllers.From the above observations, we conclude that 1) different

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

TABLE VC

OMPARISONS WITH O THER W ORK ON

AISHELL-2Model † ‡ - 9.1 10.4 11.8 10.4 8.8 9.6 10.9 9.8LAS [58] - - - - - 9.2 9.7 10.3 9.7ESPNet (transformer) * † ‡ - - - - - 7.5 8.9 8.6 8.3

Transformer (baseline1) 67.5M 6.4 7.2 7.7 7.1 7.1 8.0 8.2 7.8

SAN-CTC (baseline2) 56.4M 8.3 8.9 8.8 8.6 8.0 9.0 8.9 8.7

LASO-small

LASO-small w/ BERT 20.6M 8.9 10.0 10.3 9.7 8.8 9.8 10.5 9.7

LASO-middle

LASO-middle w/ BERT 63.3M 6.5 7.2 7.4 7.0 6.6

LASO-large

LASO-large w/ BERT 80.0M * from the ESPnet ofﬁcial repository . † with speed perturbation based data augmentation. ‡ with an extra language model at the inference stage. time step t i m e s t e p (a) The attention scores of the ﬁrsthead of the last encoder layer. time step t i m e s t e p (b) The attention scores of the sec-ond head of the last encoder layer. time step t i m e s t e p (c) The attention scores of the thirdhead of the last encoder layer. time step t i m e s t e p (d) The attention scores of the forthhead of the last encoder layer. VRV! HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! token position VRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! t o k e n p o s i t i o n (e) The attention scores of the ﬁrsthead of the last decoder layer. VRV! HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! token position VRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! t o k e n p o s i t i o n (f) The attention scores of the sec-ond head of the last decoder layer. VRV! HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! token position VRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! t o k e n p o s i t i o n (g) The attention scores of the thirdhead of the last decoder layer. VRV! HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! token position VRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! t o k e n p o s i t i o n (h) The attention scores of the forthhead of the last decoder layer. time step VRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! t o k e n p o s i t i o n (i) The attention scores of the ﬁrsthead of the last PDS layer. time step VRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! t o k e n p o s i t i o n (j) The attention scores of the sec-ond head of the last PDS layer. time step VRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! t o k e n p o s i t i o n (k) The attention scores of the thirdhead of the last PDS layer. time step VRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV!HRV! t o k e n p o s i t i o n (l) The attention scores of the forthhead of the last PDS layer.Fig. 6. The visualization of the attention scores of the model LASO-big . Here, we show the ﬁrst four heads of the last layer of each module. And becausethe token sequence is long to visualize, we truncate the ﬁrst 15 tokens. All visualization is listed in the supplemental materials. attention patterns make the model fuse the representationsfrom various aspects; 2) the attention mechanism can learnmeaningful alignments in terms of the positional encodings;3) the two special ﬁller tokens and absorb the meaningless representations from the encoder; 4) speciﬁcattention patterns exist in the different heads of the decoder.These demonstrate that the PDS module attends speciﬁc acous-tic representations based on positions and the decoder captures

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 (a) A scatter plot of the CER of each sentence vs. the length of asentence on AISHELL-1 test set. (b) A scatter plot of the CER of each sentence vs. the length of asentence on AISHELL-2 iPhone test set.Fig. 7. Scatter plots for analyzing the impact of the lengths of the sentences.The grey histograms represent the ratios of lengths of the sentence (i.e., thenumber of the tokens) on the training set. The area of a scatter shows thenumber of the examples at the that point. We use baseline1

Transformer ,baseline2

SAN-CTC , and

LASO-middle since they have similar model sizes.From the plots, we can see no signiﬁcant difference between the three models.The three models have the same CER for some sentences so that the scattersare overlapped. Note that an outlier in the second ﬁgure which CER is 100%since the reference of this sentence is wrong. the token relationship based on the self-attention mechanism.

B. The Impact of the Lengths of Sentences

Position parameters exist in the PDS module. This maycause that the performance relies on the length of a sentence.To check this point, we plot scatters of the CER of eachsentence vs. the length of a sentence in Fig. X. We can see no signiﬁcant difference among the three models.Fig. X also provides histograms of ratios of the differentlengths of training set. We can see that the distributionsapproximate Gaussian distribution as expected. And the trendof CERs is similar with the length distribution of the trainingset, i.e., the CERs of more sentences are zero in the middlepart in the ﬁgures. This is because the models are trainedwith more data with the middle lengths. All the three modelscan recognize the sentences with unseen lengths (or few-shot lengths) in the training set. However, the error rates are relatively higher than the sentences with seen lengths.This phenomenon is as expected. Because the three models(

Transformer , SAN-CTC , LASO ) are all whole-utteranceASR models, the length of the utterance is an implicit factorto train the models. This is different from the local-windowmodels based conventional hybrid models, i.e., time-delayneural networks (TDNN), CNNs, or latency-control BLSTM.LASO is more sensitive with the unseen lengths than othertwo models. Because the PDS module need to be trained withvarious lengths.In engineering practice, two methods can be adopted toaddress the unseen-length issue of the whole-utterance models:1) select data with various lengths to train the models; 2) usinga voice activity detection (VAD) system to cut long utterancesinto segments.XI. C

ONCLUSIONS AND F UTURE W ORK

This paper proposes a feedforward neural network basednon-autoregressive speech recognition model called LASO.The model consists of an encoder, a position dependentsummarizer, and a decoder. The encoder encodes the acousticfeature sequence into high-level representations. The PDSconverts the acoustic representation sequence to the token-level sequence. And the decoder further captures the token-level relationship. Because the prediction of each token doesnot rely on other tokens, and the whole model is feedforward,the parallelization of whole sentence prediction is realizable.Thus, the inference speed is much improved, compared withthe autoregressive attention-based end-to-end models. Further-more, we propose to reﬁne semantics from a large-scale pre-trained language model BERT to improve the performance.Experimental results show that LASO can achieve a com-petitive performance with much low recognition latency. Inthe future, we will improve the performance of LASO by thearchitecture and loss functions. We ﬁnd that LASO is trainedwith more epochs during training. We will try to ﬁnd strategiesto speed up training.XII. A

CKNOWLEDGMENT

The authors are grateful to the anonymous reviewers fortheir invaluable comments that improve the completeness andreadability of this paper.R

EFERENCES[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deep neuralnetworks for acoustic modeling in speech recognition: The shared viewsof four research groups,”

IEEE Signal processing magazine , vol. 29,no. 6, pp. 82–97, 2012.[2] D. Yu and L. Deng,

AUTOMATIC SPEECH RECOGNITION.

Springer,2016.[3] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention-based models for speech recognition,” in

Advances in neuralinformation processing systems , 2015, pp. 577–585.[4] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio,“End-to-end attention-based large vocabulary speech recognition,” in-ternational conference on acoustics, speech, and signal processing , pp.4945–4949, 2016.[5] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: Aneural network for large vocabulary conversational speech recognition,”in . IEEE, 2016, pp. 4960–4964.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13 [6] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in . IEEE, 2017, pp. 4835–4839.[7] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711 , 2012.[8] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen,A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al. , “State-of-the-artspeech recognition with sequence-to-sequence models,” in . IEEE, 2018, pp. 4774–4778.[9] C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter,and H. Ney, “Rwth asr systems for librispeech: Hybrid vs attention,”

Proc. Interspeech 2019 , pp. 231–235, 2019.[10] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrencesequence-to-sequence model for speech recognition,” in . IEEE, 2018, pp. 5884–5888.[11] S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese,”in

Proc. Interspeech 2018 , 2018, pp. 791–795. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2018-1107[12] C. Raffel, M.-T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online andlinear-time attention by enforcing monotonic alignments,” in

Proceed-ings of the 34th International Conference on Machine Learning-Volume70 . JMLR. org, 2017, pp. 2837–2846.[13] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” 2018.[Online]. Available: https://openreview.net/pdf?id=Hko85plCW[14] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end speech recognition,” in

ICASSP 2019-2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 5666–5670.[15] T. Hori, N. Moritz, C. Hori, and J. L. Roux, “Transformer-Based Long-Context End-to-End Speech Recognition,” in

Proc.Interspeech 2020 , 2020, pp. 5011–5015. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2928[16] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen, “Self-attention transducers forend-to-end speech recognition,”

Proc. Interspeech 2019 , pp. 4395–4399,2019.[17] Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “Synchronoustransformers for end-to-end speech recognition,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 7884–7888.[18] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-to-sequenceframework for learning word embeddings from speech,” in

Proc.Interspeech 2018 , 2018, pp. 811–815. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-2341[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-trainingof deep bidirectional transformers for language understanding,” Jun.2019.[20] Y. Bai, J. Yi, J. Tao, Z. Tian, and Z. Wen, “Learn spelling from teachers:Transferring knowledge from language models to sequence-to-sequencespeech recognition,”

Proc. Interspeech 2019 , pp. 3795–3799, 2019.[21] Y. Bai, J. Yi, J. Tao, Z. Tian, Z. Wen, and S. Zhang, “Listen attentively,and spell once: Whole sentence generation via a non-autoregressivearchitecture for low-latency speech recognition,”

Proc. Interspeech 2020 ,2020.[22] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection-ist temporal classiﬁcation: labelling unsegmented sequence data withrecurrent neural networks,” in

Proceedings of the 23rd internationalconference on Machine learning , 2006, pp. 369–376.[23] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speechrecognition using deep rnn models and wfst-based decoding,” in . IEEE, 2015, pp. 167–174.[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advancesin neural information processing systems , 2017, pp. 5998–6008.[25] T. Q. Nguyen and J. Salazar, “Transformers without tears: Improvingthe normalization of self-attention,” arXiv preprint arXiv:1910.05895 ,2019.[26] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modelingwith gated convolutional networks,” in

Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70 . JMLR. org, 2017,pp. 933–941. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[28] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXivpreprint arXiv:1607.06450 , 2016.[29] C. Buciluˇa, R. Caruana, and A. Niculescu-Mizil, “Model compression,”in

Proceedings of the 12th ACM SIGKDD international conference onKnowledge discovery and data mining , 2006, pp. 535–541.[30] J. Li, R. Zhao, J.-T. Huang, and Y. Gong, “Learning small-size dnn withoutput-distribution-based criteria,” in

Fifteenth annual conference of theinternational speech communication association , 2014.[31] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,” arXiv preprint arXiv:1503.02531 , 2015.[32] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Ben-gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550 ,2014.[33] J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher, “Non-autoregressive neural machine translation,” in

International Conferenceon Learning Representations , 2018.[34] Y. Wang, F. Tian, D. He, T. Qin, C. Zhai, and T.-Y. Liu, “Non-autoregressive machine translation with auxiliary regularization,” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , vol. 33,2019, pp. 5377–5384.[35] J. Guo, X. Tan, D. He, T. Qin, L. Xu, and T.-Y. Liu, “Non-autoregressiveneural machine translation with enhanced decoder input,” in

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence , vol. 33, 2019, pp.3723–3730.[36] J. Lee, E. Mansimov, and K. Cho, “Deterministic non-autoregressiveneural sequence modeling by iterative reﬁnement,” 2018.[37] X. Ma, C. Zhou, X. Li, G. Neubig, and E. Hovy, “Flowseq: Non-autoregressive conditional sequence generation with generative ﬂow,”pp. 4273–4283, 2019.[38] N. Chen, S. Watanabe, J. Villalba, and N. Dehak, “Listen and ﬁll in themissing letters: Non-autoregressive transformer for speech recognition,” arXiv preprint arXiv:1911.04908 , 2019.[39] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares,H. Schwenk, and Y. Bengio, “On using monolingual corpora in neuralmachine translation,” arXiv: Computation and Language , 2015.[40] A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion:Training seq2seq models together with language models,” in

Proc.Interspeech 2018 , 2018, pp. 387–391. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1392[41] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, andL. Zettlemoyer, “Deep contextualized word representations,” Jun. 2018.[42] J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method usingbert for speech recognition,” in

Asian Conference on Machine Learning ,2019, pp. 1081–1093.[43] H. Futami, H. Inaguma, S. Ueno, M. Mimura, S. Sakai, and T. Kawahara,“Distilling the knowledge of bert for sequence-to-sequence asr,”

Proc.Interspeech 2020 , 2020.[44] V. Lavrenko, R. Manmatha, J. Jeon et al. , “A model for learning thesemantics of pictures.” in

Nips , vol. 1, no. 2. Citeseer, 2003.[45] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotationand retrieval using cross-media relevance models,” in

Proceedings ofthe 26th annual international ACM SIGIR conference on Research anddevelopment in informaion retrieval , 2003, pp. 119–126.[46] Q.-Y. Jiang and W.-J. Li, “Deep cross-modal hashing,” in

Proceedings ofthe IEEE conference on computer vision and pattern recognition , 2017,pp. 3232–3240.[47] M. Fan, W. Wang, P. Dong, L. Han, R. Wang, and G. Li, “Cross-media retrieval by learning rich semantic embeddings of multimedia,” in

Proceedings of the 25th ACM international conference on Multimedia ,2017, pp. 1698–1706.[48] L. Zhen, P. Hu, X. Wang, and D. Peng, “Deep supervised cross-modalretrieval,” in

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , 2019, pp. 10 394–10 403.[49] Z. Yang, Z. Lin, P. Kang, J. Lv, Q. Li, and W. Liu, “Learning sharedsemantic space with correlation alignment for cross-modal event re-trieval,”

ACM Transactions on Multimedia Computing, Communications,and Applications (TOMM) , vol. 16, no. 1, pp. 1–22, 2020.[50] K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, and B. Kings-bury, “End-to-end asr-free keyword search from speech,”

IEEE Journalof Selected Topics in Signal Processing , vol. 11, no. 8, pp. 1351–1359,2017.[51] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: Animperative style, high-performance deep learning library,” in

Advances in Neural Information Processing Systems , H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, andR. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019, pp.8026–8037. [Online]. Available: https://proceedings.neurips.cc/paper/2019/ﬁle/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf[52] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,”in . IEEE, 2017, pp. 1–5.[53] J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: transforming mandarin asrresearch into industrial scale,” arXiv preprint arXiv:1808.10583 , 2018.[54] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks forconnectionist temporal classiﬁcation in speech recognition,” in

ICASSP2019 - 2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2019, pp. 7115–7119.[55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

Proc. Interspeech 2019 , pp. 2613–2617, 2019.[58] S. Sun, P. Guo, L. Xie, and M. Hwang, “Adversarial regularizationfor attention based end-to-end robust speech recognition,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 27,no. 11, pp. 1826–1838, 2019.[59] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, S. Watanabe,T. Yoshimura, and W. Zhang, “A comparative study on transformer vsrnn in speech applications,” in , Dec 2019, pp. 449–456.[60] Z. Fan, S. Zhou, and B. Xu, “Unsupervised pre-traing for sequence tosequence speech recognition,” arXiv preprint arXiv:1910.12418 , 2019.[61] F. Ding, W. Guo, L. Dai, and J. Du, “Attention-based gated scalingadaptive acoustic model for ctc-based speech recognition,” in

ICASSP2020 - 2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2020, pp. 7404–7408.[62] S. Zagoruyko and N. Komodakis, “Paying more attention to attention:Improving the performance of convolutional neural networks via atten-tion transfer,” in

International Conference on Learning Representations ,2017.[63] J. H. Cho and B. Hariharan, “On the efﬁcacy of knowledge distillation,”in