[PDF] Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

Abstract

End-to-end (E2E) systems have played a more and more important role in automatic speech recognition (ASR) and achieved great performance. However, E2E systems recognize output word sequences directly with the input acoustic feature, which can only be trained on limited acoustic data. The extra text data is widely used to improve the results of traditional artificial neural network-hidden Markov model (ANN-HMM) hybrid systems. The involving of extra text data to standard E2E ASR systems may break the E2E property during decoding. In this paper, a novel modular E2E ASR system is proposed. The modular E2E ASR system consists of two parts: an acoustic-to-phoneme (A2P) model and a phoneme-to-word (P2W) model. The A2P model is trained on acoustic data, while extra data including large scale text data can be used to train the P2W model. This additional data enables the modular E2E ASR system to model not only the acoustic part but also the language part. During the decoding phase, the two models will be integrated and act as a standard acoustic-to-word (A2W) model. In other words, the proposed modular E2E ASR system can be easily trained with extra text data and decoded in the same way as a standard E2E ASR system. Experimental results on the Switchboard corpus show that the modular E2E model achieves better word error rate (WER) than standard A2W models.

Full PDF

11 Modular End-to-end Automatic Speech RecognitionFramework for Acoustic-to-word Model

Qi Liu,

Student Member, IEEE , Zhehuai Chen, Hao Li, Mingkun Huang,Yizhou Lu, and Kai Yu,

Senior Member, IEEE

Abstract —End-to-end (E2E) systems have played a more andmore important role in automatic speech recognition (ASR) andachieved great performance. However, E2E systems recognizeoutput word sequences directly with the input acoustic feature,which can only be trained on limited acoustic data. The extratext data is widely used to improve the results of traditionalartiﬁcial neural network-hidden Markov model (ANN-HMM)hybrid systems. The involving of extra text data to standardE2E ASR systems may break the E2E property during decoding.In this paper, a novel modular E2E ASR system is proposed. Themodular E2E ASR system consists of two parts: an acoustic-to-phoneme (A2P) model and a phoneme-to-word (P2W) model. TheA2P model is trained on acoustic data, while extra data includinglarge scale text data can be used to train the P2W model. Thisadditional data enables the modular E2E ASR system to modelnot only the acoustic part but also the language part. Duringthe decoding phase, the two models will be integrated and act asa standard acoustic-to-word (A2W) model. In other words, theproposed modular E2E ASR system can be easily trained withextra text data and decoded in the same way as a standard E2EASR system. Experimental results on the Switchboard corpusshow that the modular E2E model achieves better word errorrate (WER) than standard A2W models.

Index Terms —automatic speech recognition, connectionist tem-poral classiﬁcation, attention-based encoder decoder

I. I

NTRODUCTION

Deep learning has been widely used in ASR systems.Traditional ASR with deep learning always employs ANN-HMM hybrid systems [1], [2], [3]: ANN predicts the posteriorsof HMM states and HMM is trained separately to ﬁt thelong term model. Hybrid systems have some disadvantages.First, forced alignment is needed to train the ANN part inthe traditional ANN-HMM pipeline. Second, the short-termmodel ANN and long-term model HMM are trained separately.Therefore, the knowledge that they have learned will not beshared with each other. Finally, a decoding mechanism suchas the weighted ﬁnite-state transducer (WFST) [4] and latticegeneration/search is needed during the test phase. However,with more powerful networks like long short-term memory(LSTM) [5] and convolutional neural network (CNN) [6],

This work has been supported by National Key Research and DevelopmentProgram of China (Grant No.2017YFB1002102). The authors would like tothank Heinrich Dinkel and Rao Ma for English proofreading and editing.(

Corresponding author : Kai Yu.)Qi Liu, Zhehuai Chen, Hao Li, Mingkun Huang, Yizhou Lu, and Kai Yu arewith the SpeechLab, Department of Computer Science and Engineering, andMoE Key Lab of Artiﬁcial Intelligence, AI Institute, Shanghai Jiao Tong Univer-sity, Shanghai, China. (e-mail: [email protected]; [email protected];[email protected]; [email protected]; [email protected];[email protected]) the E2E systems have the ability to directly model the wordsequence from the acoustic feature with only neural networkcomputing.Connectionist temporal classiﬁcation (CTC) [7] is a widelyused E2E algorithm. CTC adds a special label blank to modelthe intermediate frame. The blank label and forward-backwardalgorithm enable CTC to convert unsegmented input sequencesto varied-length output sequences. CTC based systems haveachieved good results in several sequence labeling tasksincluding speech recognition [8] and handwriting recognition[9].Sequence to sequence (S2S) [10] is another type of E2Emodel, which consists of two networks: an encoder and adecoder. The encoder models the input sequence as embeddingvectors and the decoder uses these vectors to generate the outputsequence. In recent years, S2S models, especially attention-based S2S models, have achieved great success on both naturallanguage processing [11] and speech recognition [12], [13].RNN-transducer [14] combines the advantages of both CTCand S2S models. It contains an encoder-decoder like mechanismand a CTC like criterion. Researches have shown that RNN-transducer could achieve competitive word error rates on speechrecognition [15], [16].Many researchers focus on E2E systems during the trainingphase. However, some works such as [9], [17] still involve extradecoding mechanisms like WFST. In this paper, we mainlyfocus on E2E systems during the decoding phase, i.e., thewhole system performs like one neural network during test.We called it A2W property or E2E property.Extra text data can be used to train a strong language modeland generate a WFST. It has been shown that the extra textdata involved by WFST with language model can signiﬁcantlyimprove the performance of hybrid ASR systems. However,E2E systems decode the word sequence directly from acousticfeatures with only neural networks. This property gives theE2E systems faster decoding speed [18] and the ability to storeonly neural network parameters, making it possible to deployon low-resources machines. However, combining WFST andE2E ASR systems brutally will break this property. Thereforemany researchers work on how to add large scale text data intoE2E ASR systems. [19] joins CTC, attention-based decoderand RNN language model together as a big joint decoder. [20]applies back-translation to convert text data to unsupervisedacoustic data. [21] uses the phoneme sequence to train a multi-modal E2E system.Currently, E2E systems based on phoneme, character, orsub-word generated by byte pair encoding (BPE) [22] have a r X i v : . [ ee ss . A S ] J u l achieved great performance. However, the desired output ofASR systems is word sequence rather than sequences consistingof these small units. Thus an additional procedure is still neededto combine these small units to words. Moreover, some Asianlanguages including Chinese, Japanese, and Korean, are moredifﬁcult to split into these small units compared with Latinbased languages. Therefore, in this paper, we try to investigatehow to use extra text data to improve the performance of ‘true’E2E models, i.e., A2W models.In this paper, we use a novel modular E2E system [23]. Themodular E2E system contains two networks, one is an acoustic-to-phoneme (A2P) network, and the other is the phoneme-to-word (P2W) network. During the training phase, the acousticdata will be used to train both A2P and P2W networks and extratext data can be used to train the P2W network. Finally, duringthe decoding phase, the output of the A2P network will be theinput of the P2W network, which means the proposed modeloutputs word sequence from input acoustic feature sequencewith only neural network calculations. This makes the wholemodular system performs like a normal E2E system. Comparedwith traditional E2E systems, the proposed modular E2E systemhas three advantages. First, it performs like an A2W modelduring the decoding phase, without the need to store otherparameters such as WFST to decode the results. Second, themodular E2E system can be trained with extra text data whileholding the E2E property. The most commonly used methodto incorporate text data in A2W models is WFST [17], whichviolates the E2E property. Finally, the modular design makesthe system easy to extend or adapt. In this paper, we give anexample of out of vocabulary (OOV) words extension.The rest of paper is organized as follows. Section II brieﬂyintroduces the background of traditional E2E ASR systems.Section III proposes our modular E2E model. Section IV showsthe implementation details of modular E2E systems. Finallysection V demonstrates the experimental results and sectionVI gives the conclusion.II. T RADITIONAL

E2E ASR S

YSTEMS

A. Connectionist Temporal Classiﬁcation

E2E system is designed to solve the sequence labelingproblem, which predicts a corresponding output sequence froma given input sequence. The difﬁculty of solving sequencelabeling problem with deep learning lies in the mismatchbetween the lengths of input and output sequences. In practice,an input acoustic feature may contain hundreds of frames,while the corresponding word sequence only has about twentywords.CTC [7] uses a special label blank to ﬁll in the intermediateframes. Formally, the merge function is deﬁned as β : [ V ∪ { blank } ] ∗ → V ∗ , (1)where V is the vocabulary of output word sequence. The mergefunction will ﬁrst combine the same consecutive words togetherand then remove all the blank symbols. For example, β ( a −− aab ) = β ( − − a − ab ) = β ( a − abbb ) = β ( aa − aab ) = aab where − means blank . In other words, the previous example Fig. 1. The possible paths of forward-backward algorithm [7]. The whitecircles denote optional blank .Fig. 2. An example that the S2S model predicts ‘WXYZ’ with input ‘ABC’[10]. The left part is the encoder network and right part is the decoder network. shows the valid CTC alignment of the word sequence ‘aab’with an input feature of length 6.Let x be the input feature sequence and w be the correspond-ing word sequence. The CTC criterion, i.e., the probability P ( w | x ) is the summation of all the possible CTC alignmentsby using merge function β : P ( w | x ) = (cid:88) π P ( π | x ) = (cid:88) π T (cid:89) t =1 P ( π t | x ) , (2)here π ∈ β − ( w ) and T = length ( π ) = length ( x ) . Since π and x have the same length, P ( π t | x ) can be easily calculatedby a neural network with a softmax output layer.The number of π grows exponentially with the inputsequence length. Thus P ( w | x ) cannot be calculated efﬁciently.However, the probability can be calculated by the forward-backward algorithm [24], [25]. Figure 1 gives a brief overviewof applying the algorithm with the CTC criterion, and thedetails can be found in [7]. B. Sequence to Sequence

S2S [10], also known as encoder-decoder, is another com-monly used E2E method. CTC predicts a single word or blank for each frame of input feature, and then uses merge function β to derive the output word sequence. However, S2S dealswith the sequence labeling problem as a conditional languagemodel problem. In other words, S2S is trained as a languagemodel by generating output word sequences conditioned oninput features.Formally, the S2S model contains two networks: encoderand decoder. The encoder ‘encodes’ the input feature to a single compressed embedding vector. The decoder will use theembedding vector and ‘decode’ the output word sequence as aconditional language model.Let x and w be the input feature and output word sequence.The criterion of S2S, i.e. the probability P ( w | x ) is the productof the conditional probability of each single word by chainrule: P ( w | x ) = N (cid:89) i =1 P ( w i | x , w i − ) (3)here N = length ( w ) . The conditional probability can becalculated by two neural networks: h = encoder ( x ) (4) P ( w i | x , w i − ) = decoder ( h , w i − ) . (5)Figure 2 illustrates the framework of the S2S model.However, since the S2S model compresses the input featureto a single embedding vector. When the length of the inputsequence is too long, some information, especially earlierinformation of the input feature might be lost, resultingin performance degradation. Attention mechanism [26] isproposed to utilize the information of input features moreefﬁciently. [12], [27] show that attention mechanism can obtaina large improvement on the ﬁnal WER result. Thereforeattention based S2S model is used in our work. C. E2E ASR Systems

Currently, two types of E2E ASR systems are commonlyused. The ﬁrst type maps the acoustic feature to a sequence ofsmall units such as phonemes, characters, or bigger sub-wordunits such as BPE [17], [28], [29], [30], [31], [32]. This type ofE2E ASR system has relative small ‘vocabulary’ size (usuallyless than 100), which enables efﬁcient training with lowermemory consumption. In practice, these E2E systems are ableto achieve good performance. However, the models of this typeusually need an additional module to convert the small unitsequence to the corresponding word sequence. The additionalmodule including WFST [17] or language model rescoringbased beam search [30] may break the E2E property duringthe decoding phase. Moreover, a language model trained withextra text data is usually used to build a WFST separately withthe neural network. This additional language model, especiallyWFST, might consume lots of resources that makes the wholesystem hard to deploy on low resource devices.This paper mainly focuses on the other type, which is usuallycalled A2W models [33], [34], [35]. It directly maps theacoustic feature to a word sequence. This type can decode theoutput word sequence with a single neural network. However,due to the large number of output units (usually more thanten thousand words) and long acoustic feature length (usuallyabout one thousand frames), systems of this type are harderand slower to train. Moreover, the large memory consumptionmakes some algorithms like RNN-transducer hard to implementin a proper A2W system. Another problem is that A2Wnetworks can only be trained with acoustic data. The languagemodel trained with extra text data cannot be easily integratedwith A2W models.

A2P (c) PSD-based Joint Training

CTC or

S2S

PSD P2WA2P

CTC

P2W

CTC or

S2S (a) Acoustic-to-phoneme Module (b) Phoneme-to-word Module

Fig. 3. The framework of modular E2E systems [23]. The dashed line squareindicates the trained part.

Choosing which type of E2E ASR system is a trade-off.The decoding results of A2W models could be generateddirectly by the neural networks. This indicates that low resourceterminals like smartphones or automobiles only need to storethe neural network parameters. Many smartphones also havea neural processing unit (NPU) in their system on a chip(SoC), which can speed up the calculation of neural networks.The other type can employ greedy search decoding and wordboundaries to achieve the A2W purpose [36], [37]. However, itwill abandon the simplicity of using extra text data to improvemodel performance as the acoustic data is harder to obtainthan the text data. Moreover, the proposed modular E2E ASRsystem is designed to be optimized by both acoustic data andtext data while the whole system performs like a single A2Wneural network during the decoding phase like [21], [38].III. M

ODULAR

E2E ASR S

YSTEM

In this section, the proposed modular E2E ASR system[23] will be introduced. The modular E2E system consistsof two networks: an A2P network and a P2W network. TheA2P network can only be trained with acoustic data, such thatit predicts the corresponding phoneme sequence with givenacoustic features. Meanwhile, the P2W network translates thephoneme sequence to the desired word sequence, which can betrained by both acoustic data and text data. Finally, the P2Wnetwork is ﬁne-tuned by the acoustic data. Figure 3 shows thewhole framework of modular systems.

A. A2P Network

The A2P network is trained on acoustic data, which predictsthe posterior probability of phoneme sequence p with the givenacoustic feature sequence x , i.e. P ( p | x ) = A2P ( x ) . (6)In the modular system, the A2P network can be consideredas the acoustic model in the standard ANN-HMM hybridsetting. It recognizes the acoustic features and produces thecorresponding phoneme sequence. In fact, it is actually aphoneme-based E2E ASR network, i.e., the ﬁrst type of E2EASR system mentioned in the previous section. This indicatesthat all the E2E optimizing methods can be used to improvethe performance of the A2P network. Even though all E2E criteria can be used to train theA2P network, in this work, only the vanilla CTC is usedas the criterion of the A2P network. It is because that thepredicted phoneme sequence might contain many errors. Thenthe whole posterior sequence will be the output of the A2Pnetwork. However, the CTC criterion can predict the posteriorsequences only based on the input acoustic data. The S2Scriterion must predict the posterior sequences not only relyingon the input acoustic data, but also the previously predictedphoneme sequence. More precisely, for a given acoustic feature x , phoneme sequence p = ( p , . . . , p T ) and a trained A2Pnetwork, the CTC criterion is formulated as, P ( p | x ) = T (cid:89) i =1 A2P ( p i | x ) . (7)Therefore, for any possible p , the posterior P ( p | x ) could becalculated only depending on x . However, for cross entropycriterion used in S2S, we have P ( p | x ) = T (cid:89) i =1 A2P ( p i | x , p i − ) . (8)Here, for any possible p , P ( p | x ) depends not only on x butalso p itself. To provide more information, the output of theA2P network, i.e., the input of the P2W network is the posteriorof all possible phoneme sequences rather than a single bestphoneme sequence. More precisely, let V p denote the phonemevocabulary, the output of the A2P network is a T × | V p | matrixwhich represents the posterior sequence rather than a singlelength T phoneme sequence. However, as mentioned above,for a given acoustic feature x , S2S models can not producethe T × | V p | matrix since it calculates the posterior which notonly relies on x but also on p . Therefore, only CTC is usedas the criterion for the A2P network. B. P2W Network

The P2W network can be trained by both acoustic data andtext data. The input of the P2W network is the posterior ofall possible phoneme sequence p and the output is the desiredword sequence w , i.e. P ( w | p ) = P2W ( p ) . (9)The P2W network can be considered as the language model partof the ANN-HMM hybrid system, although the generated wordsequence depends on the input phoneme posterior. Comparedwith the traditional language model P ( w ) , the P2W networkis trained with P ( w | p ) . P2W network and language model areboth trained on large scale text data. However, the languagemodel focuses on the unconditional internal relationship amongall the words in sentences. Nevertheless, P2W models learn notonly the unconditional word distribution but also the phonemeto word dictionary. Additionally, the phoneme alignment, i.e.,the alignment about which phoneme belongs to which word, isalso learned by P2W models. In other words, the P2W networkof the modular system is a more powerful language modelconditioned on given phoneme sequences. In this work, vanillaCTC and cross entropy with attention-based S2S are used as the criterion. [9] shows that the implicit language model ofCTC can beat some weak explicit language models and thedecoder of S2S has the same structure as the traditional LSTMlanguage model. This indicates that the P2W network has theability to model the extra text data. In general, the proposedP2W network could solve three issues: • Complete phoneme alignment automatically; • Predict proper words in the dictionary with the givenphoneme sequence; • Infer proper words with the given word sequence history.

C. PSD Joint Training

Finally, the P2W network will be ﬁne-tuned on the acousticdata. In this phase, the A2P network would be ﬁxed. In practice,the length of the acoustic feature (usually about one thousand)is much longer than the length of its corresponding phonemesequence (usually less than one hundred). It is because theposterior sequence predicted by the A2P network and itscorresponding phoneme sequence have different informationrates. Therefore, it is not suitable to directly use the phonemeposteriors, which is the output of the A2P network as theinput of the P2W network. Down-sampling is of help here.Phoneme synchronous decoding (PSD) [18] is a techniquethat is originally designed to speed up the decoding ofCTC. It removes the blank frames in the phoneme posteriorsequence, which can greatly reduce the information rate withoutperformance loss. In this work, PSD is employed as a down-sampling layer between the A2P and P2W network. Given theA2P network output, i.e., the posterior sequence p , . . . , p n where p i ( t ) represents the posterior of phoneme t at frame i ,we have X = { i | [log p i ( blank ) − max t (cid:54) = blank log p i ( t )] < λ } (10)PSD ( p , . . . , p n ) = p k , . . . , p k m k i ∈ X . (11)Here λ is a pre-deﬁned threshold. This means that PSD willremove the frames of A2P output that have high posterior onthe blank label.The advantages of using PSD include: • Adapt different information rates among input acousticfeatures, intermediate phoneme sequence and predictedword sequence. • Remove unnecessary blank information of the A2Pnetwork output. • Speed up the training of P2W network.With PSD, the whole system can be considered as P ( w | x ) = (cid:88) p P ( w | p ) P ( p | x ) (12) ≈ P2W ( PSD ( A2P ( x ))) . (13) D. Advantages of Modular E2E ASR System

Compared with traditional HMM ASR systems, E2E systemscould be easily deployed. The reason is that the E2E systemonly involves neural network calculations, which means onlyneural network parameters need to be stored. Besides, theneural network calculation can be accelerated by GPU or NPU, making it more suitable on some devices such as smartphones.However, one big obstacle to the E2E ASR system is thatits performance heavily relies on the amount of acoustic data.Compared with expensive acoustic data, text data is easierto collect. Phoneme or character based E2E systems try toimprove prediction accuracy with the WFST encoded withlanguage model trained on additional text data. This reducesthe E2E property compared with HMM ASR systems. Here,the proposed modular design splits the whole system into twoparts: the part that is only trained by acoustic data and the partthat is trained by text data. In the training phase, the modularsystem is split into an acoustic model and a language model.During the decoding phase, the whole system performs as auniﬁed A2W model. In general, the modular E2E ASR systemis an A2W model that can be easily trained with extra textdata. Moreover, its modular design enables system extensionor adaption since we can ﬁne-tune or re-train the A2P andP2W networks separately. In the next section, the OOV wordextension is given as an example.IV. I

MPLEMENTATION D ETAILS

This section exhibits the implementation details of themodular E2E ASR system. The OOV word extension is takenas an example to illustrate the extension capability of theproposed system.

A. Phoneme Sequence Generation

The large scale text data is used to train the P2W network.However, the P2W network needs the phoneme sequenceposterior as the input, while the text data only contains wordsequences. Therefore, the corresponding phoneme sequenceof extra text data should be generated. In this work, a wordto phoneme dictionary is utilized. For polyphone words, werandomly choose one pronunciation as the oracle one.For each word w in the word sequence, the dictionary is usedto look up its corresponding phoneme sequence dict ( w ) , whichare ﬁnally concatenated as the generated phoneme sequence.Formally, for a word sequence w = ( w , . . . , w N ) in extratext data, the generated phoneme sequence is p = ( concatenate Ti =1 dict ( w i )) . (14)Here means mapping the phoneme sequence to its one-hotdistribution form. The data pair ( p , w ) will be used to trainthe P2W network. B. Text Data Initialization

The generated phoneme sequence can be used to train theP2W network with large scale text data. However, duringthe decoding phase, the predicted phoneme posterior may beproblematic since it is calculated by an A2P network rather thangenerated by a oracle word sequence. The mismatch betweenthe phoneme sequence predicted by the A2P network and theoracle phoneme sequence generated by the word sequencewill lead to performance degradation. In fact, the experimentsshow that training P2W network only with oracle phonemesequences will lead to imprecise results. To solve this problem, the oracle phoneme sequencesgenerated by extra text data will be used to initialize theP2W network. After that, the P2W network is ﬁne-tuned bypredicted phoneme sequences with PSD.Let ( x a , w a ) denote the feature sequences and word se-quences of acoustic data, and w t denote the word sequencesof text data. The whole training work-ﬂow is shown below:1) Use phoneme sequence generation method as describedabove to generate the corresponding oracle phonemeposterior sequences p oa and p ot .2) Use acoustic data ( x a , p oa ) to train the A2P network.3) Use large scale text data ( p ot , w t ) to initialize P2Wnetwork.4) Generate the predicted phoneme posterior sequence p a = PSD ( A2P ( x a )) .5) Fine-tune P2W network with data ( p a , w a ) .In this work, the A2P and P2W networks are not jointlytrained. It is because that only acoustic data can be used tojointly train the A2P and P2W network. However, the A2Pnetwork is already trained by the acoustic data. Moreover, theacoustic data used to ﬁne-tune the P2W network is down-sampled by PSD, which can greatly reduce the number offrames and accelerate the ﬁne-tuning speed. C. OOV Words Extension

How to deal with OOV words is a crucial issue for ASRsystems, especially for A2W E2E ASR systems. In both CTCand S2S systems, to extend words, the words should be addedinto the vocabulary, and the dimension of the last softmax layerwould be changed. Hence, to extend OOV words in a traditionalA2W E2E ASR system, the whole neural network needs to bere-trained by acoustic data that contains OOV words. The timeand resource consumption of the re-training procedure is huge.Another problem is that OOV words are usually rare words.The frequency of OOV words in the original acoustic data maybe very low or even zero. This will lead to little performanceimprovement after re-training. Some approaches [22] have beenproposed to solve the above problems. However, in this paper,we mainly focus on A2W models. [22] used sub-word as theoutput of the model, which is not an A2W E2E system.Here, the proposed modular E2E system can be used tosolve the above OOV words extension problem. The mosttime-consuming part is the training of the acoustic model, i.e.,the A2P network. However, the trained A2P network can bedirectly used to decode the phoneme for OOV words. Onlythe P2W network needs to be re-trained. It is noticeable thatthe P2W network can be trained by

TEXT data. It indicatesthat to re-train the P2W network, the data which needs to beobtained is the text data containing the OOV words rather thanacoustic data. Compared to acoustic data, text data is easierand cheaper to obtain.More precisely, let D be the original acoustic data, i.e., ( p a , w a ) in the above subsection and A indicate the extratext data containing the OOV words. Here, A contains thecorresponding phoneme sequences by the same phonemesequence generation method described above. Three ways are proposed to re-train the P2W network in a normally trainedmodular E2E ASR system.1) Directly ﬁne-tuning : just use the extra text data A toﬁne-tune the P2W network.2) Alternative training : train the P2W network alternatelybetween epochs by the original acoustic data D and extratext data A .3) Multi-modal : this method [21] is only proposed for S2SP2W network. An additional encoder is added to the P2Wnetwork. Samples from D are encoded by the originalencoder and samples from A are encoded by the newencoder, while they are both decoded by the originaldecoder.More details of OOV words extension of modular E2E ASRsystem can be found in [39].V. E XPERIMENTS

A. Experimental Setup

The data corpus used for the experiments is SwitchBoard[40]. This corpus contains 300 hours of speech. The extractedacoustic feature is 36-dimensional fbank over 25ms time-window and 10ms frame shift. The neural networks are trainedby Kaldi [41], PyTorch [42] and MXNet [43].The extra large scale text data is the transcription of Fishercorpus, which contains more than 2M sentences and 22M words.Both acoustic data and text data use the same vocabulary withsize 30275. The evaluation sets are swbd and callhm fromNIST eval2000 test set.The baseline hybrid HMM model contains a 5-layer LSTM,while each layer contains 1024 memory cells and a 256 nodesprojection layer. The last layer is softmax among 8k clusteredtri-phone states.The standard modular E2E ASR system consists of twonetworks. The A2P network contains 4-layer bidirectionalLSTM, and each layer contains 1024 cells. The P2W networkhas two versions. The CTC version has a 3-layer bidirectionalLSTM, and each BLSTM layer contains 1024 memory cells.The S2S version is composed of an encoder with 5-layerbidirectional LSTM and a decoder with a 5-layer unidirectionalLSTM. Each layer of both encoder and decoder networks has700 memory cells. Finally, the default PSD threshold is 8.

B. Experimental Results of Modular E2E ASR System1) Different P2W Training Procedure:

In the standardmodular E2E training procedure, the P2W network needs tobe trained twice. It is ﬁrst initialized by large scale text dataand then ﬁne-tuned by the prediction of the A2P network.Some experiments are conducted to exhibit the necessity ofextra text data and ﬁne-tuning process. Table I shows the WERperformance. It demonstrates that acoustic data ﬁne-tuning ofthe P2W network is necessary for modular E2E systems. Evenif the P2W network achieves a very low WER of 1.4/2.5 withoracle phoneme sequences, it fails to obtain the characteristicsof the output of the A2P network and produces poor results.We also tried to remove the extra text data and trained themodular system with only the acoustic data. Then the wholesystem degraded to a normal A2W model. The performance isworse than the modular system trained with text data.

TABLE IWER

PERFORMANCE COMPARISON AMONG DIFFERENT TRAININGPROCEDURES . TDI

REFERS TO TEXT DATA INITIALIZATION MENTIONED INSECTION

IV-B. T

HE NUMBERS MEAN

WER

OF TEST - SET SWBD ANDCALLHM , RESPECTIVELY . Training Procedure CTC S2S

No extra text data 16.3/29.2 17.6/29.4No ﬁne-tuning 54.5/61.7 73.1/76.5TDI 15.5/27.6 16.8/29.4

2) Different PSD Threshold:

The PSD threshold [12]controls the number of frames of training data for P2W networkﬁne-tuning. With more frames, the information would be morecomplete while the training will be slower and vise versa. Here,some experiments are conducted to check the inﬂuence of thePSD threshold on the performance of modular E2E models.Table II shows the results. The results show that a large PSDthreshold can slightly improve performance. However, a toolarge or too small PSD threshold will make the P2W networkmore difﬁcult to converge, especially for random initializedones. It also demonstrates that extra text data can not onlyimprove the performance but also enhances the convergenceof the P2W network.

TABLE IIWER

PERFORMANCE COMPARISON AMONG DIFFERENT

PSD

THRESHOLDS .TDI

MEANS TEXT DATA INITIALIZATION , I . E . THE USING OF EXTRA TEXTDATA . PSD

3) Different Acoustic Models:

Other than the baselineBLSTM acoustic model, two weak acoustic models are trainedto examine the effect of different acoustic models. The weakLSTM acoustic model contains a 5-layer LSTM, each LSTMlayer has 1024 memory cells and 256 projection nodes [5].The weak FSMN model contains 8-layer FSMN [44] and 2-layer DNN. Each FSMN layer contains 1024 units and 256projection nodes [5], while each DNN layer contains 1024units. Each acoustic model is trained by a CTC criterion basedon phoneme. The phoneme CTC system is directly decoded byan extra WFST encoded with a language model. The word CTCsystem has the same structure as the phoneme CTC systemexcept for the last softmax layer. The modular systems use thetrained acoustic model as the A2P network. Table III shows theWER results. It is clear that modular systems can be improvedby using better A2P models. This means all the optimizationmethods used on normal E2E acoustic models could be used toimprove the performance of the A2P network, which may leadto performance improvement of the proposed modular system.We could also observe that the performance of the modularS2S is slightly worse than the modular CTC model. The reasonis that the output of A2P might contain errors in the predictedword sequence. Since CTC assumes that each predicted wordis independent of the other words, the errors would only affect

TABLE IIIWER

PERFORMANCE COMPARISON BETWEEN DIFFERENT ACOUSTICMODELS . T

EXT INDICATES DOES THE MODEL USE EXTRA TEXT DATA . A2W

INDICATES IS THIS MODEL AN

A2W

MODEL DURING THE DECODING PHASE . Model A2W Text WER

LSTM Phoneme CTC (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) a small area. However, in S2S models, each word predictiondepends on its predecessors. The errors will accumulate todegrade the ﬁnal performance. In fact, if we use the oraclephoneme sequence as model input, S2S will perform betterthan CTC. It is also found that the gap between modular CTCand modular S2S system is smaller with a better acoustic modelfrom table III.

4) Comparison Among Different Baselines:

We have com-pared the WER performance among different ASR baselines,including the DNN-HMM hybrid system, the two types oftraditional E2E systems, and the proposed modular system.Table IV shows the comparison results. It can be observedthat the E2E systems with small units output like characteror phoneme perform better than normal A2W models. It alsocan be observed that this type of E2E systems can be easilycombined with a language model to improve their performance.For A2W models, the proposed modular systems performbetter than normal word CTC or S2S models. It is believedthat the performance improvements come from the use ofextra text data. Overall, the proposed modular E2E model canget better performance by the extra text data compared withtraditional A2W models, i.e., the A2W models that are directlytrained by CTC or S2S with word-level output units. Comparedwith other types of E2E systems that are based on charactersor phonemes, the proposed modular systems achieve slightlyworse performance while holding the A2W E2E property, i.e.,the whole system performs as a single A2W neural networkduring decoding phase.[34] obtains better results by using multiple optimizationmethods. These methods, including speed perturb, i-vectoradaption, phoneme network initialization, and CTC-S2S jointtraining, are not used in this work. It is believed that theproposed modular E2E ASR system could achieve better resultswith these optimization algorithms.

C. Experimental Results of OOV Words Extension

This subsection demonstrates the modular design can doextension or adaption easily. Here OOV word extension is usedas an example.

1) Extra Test Set:

The data corpus used for the OOVword extension is almost the same as normal modular E2Eexperiments. The only difference is the evaluation set. Two

TABLE IVWER

PERFORMANCE COMPARISON AMONG DIFFERENT BASELINES . T

EXTINDICATES DOES THE MODEL USE EXTRA TEXT DATA . A2W

INDICATES ISTHIS MODEL AN

A2W

MODEL DURING THE DECODING PHASE . T

HEMODELS WITH AN ASTERISK ARE TRAINED BY US . Model A2W Text WER

Hybrid HMM* (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) data corpora are used. One is the in-domain eval2000 test setwhich is the combination of swbd and callhm, and the otheris the cross-domain dev93 test set from WSJ corpus.To investigate the performance gain of the OOV wordsextension, the full vocabulary V f has been cut to smallvocabulary V s where V s only contains words that occur morethan 10 times in the training set. The baseline modular E2Emodels are trained with the small vocabulary V s . And theOOV words extension models are trained with vocabulary V ev and V dev , which are the union of V s and the correspondingvocabulary of each test set. Table V shows the OOV rate ofeach vocabulary. The extra text data is extracted from Fishercorpus. Here, not all the sentences in Fisher are used. Onlythe sentences containing OOV words are used. TABLE VT

HE SIZE AND

OOV

RATE OF EACH VOCABULARY . Vocabulary Size OOV RateTraining Eval2000 Dev93Set Test Set Test Set V f V s V ev V dev The test set is split into two subsets, which are in vocabularysentences (IVS) and out of vocabulary sentences (OOVS). Ifevery word in a sentence appears in vocabulary V s , then thissentence belongs to IVS and otherwise OOVS.

2) Experimental Results of In-domain Test Set:

Table VIshows the WER performance on in-domain eval2000 test setwith OOV words extension. It shows that for the in-domain testset, the baseline system with full vocabulary performs betterthan the system with a small vocabulary. It is also observedthat for the in-domain test set, the OOV words do not causemuch performance degradation. The WER gap between IVSand OOVS is about 20% relatively. And since the OOV rateis 3.33% for the baseline small vocabulary models, the poorWER on OOVS has a small impact on the total WER.For CTC models, directly ﬁne-tuning and alternative trainingall outperform the small baseline models. The alternative

TABLE VIWER

PERFORMANCE COMPARISON ON IN - DOMAIN EVAL

WITH

OOV

WORDS EXTENSION . Model Vocabulary WERSize All IVS OOVS

Modular CTC 30275 22.8 21.0 26.06805 26.0 24.4 29.0+ directly ﬁne-tuning 7649 24.6 23.0 27.5+ alternative training 23.5 22.5 25.4Modular S2S 30275 23.5 21.1 28.06805 24.5 22.1 29.0+ directly ﬁne-tuning 7649 26.8 24.0 32.1+ alternative training 24.2 22.0 28.3+ multi-modal 24.3 21.7 29.2 training model even beats the large baseline model on OOVS.For S2S models, the gap between small and large baselinemodels is smaller than CTC, this indicates S2S models areinﬂuenced less by OOV words compared with CTC models.Directly ﬁne-tuning only uses the text data with OOV wordsmay increase the risk of over-ﬁtting for S2S models. However,alternative training still performs slightly better than the smallbaseline system. Multi-modal is less effective compared withalternative training, especially on OOVS.

3) Experimental Results of Cross-domain Test Set:

TableVII gives the results of the cross-domain dev93 test set fromWSJ. Since no acoustic data from WSJ are used to train themodels, the WER is much higher than other reported results. Itcan be observed that the WER of OOVS is much higher thanIVS. Given that the OOV rate of small baseline systems is15.2% and 6.4% for big baseline systems, the overall WER isinﬂuenced a lot by the poor WER of OOVS. It is also observedthat the baseline S2S performs poorly due to the predictedword dependence during decoding.

TABLE VIIWER

PERFORMANCE COMPARISON ON CROSS - DOMAIN DEV WITH

OOV

WORDS EXTENSION . Model Vocabulary WERSize All IVS OOVS

Modular CTC 30275 39.1 26.2 40.86805 36.4 18.4 38.7+ directly ﬁne-tuning 7627 36.9 17.6 39.4+ alternative training 30.3 17.8 31.9Modular S2S 30275 43.8 22.6 46.56805 41.3 20.4 44.0+ directly ﬁne-tuning 7627 39.1 20.5 42.2+ alternative training 35.6 18.9 37.8+ multi-modal 40.7 18.5 43.6

For cross-domain experiments, the usage of full vocabularydoes not work well. For CTC models, alternative trainingimproves the performance of OOVS compared with bothbaseline systems. It even gets better results on IVS. Formodular S2S models, directly ﬁne-tuning and alternativetraining outperform the baseline systems on both IVS andOOVS. Lastly, it can be seen that multi-modal is still ineffectiveon OOVS.Overall, using a small vocabulary will degrade the perfor-mance. Extending the vocabulary to full size is beneﬁcialfor in-domain tasks, but not performs well for cross-domaintasks. However, the proposed modular E2E ASR system could use extra text data to extend the OOV words to improve theperformance. Regarding these three ﬁne-tuning methods, onlymulti-modal is considered to be ineffective on OOVS. Directlyﬁne-tuning in useful on OOVS in almost every case. Andalternative training is the best choice. However, it needs moretime to be trained and is harder to converge.VI. C

ONCLUSION

This paper proposes a modular training strategy for E2E ASR.In particular, the proposed method splits the E2E systems intotwo parts: A2P and P2W networks. The P2W networks can betrained with large scale text data, which can improve the WERperformance. During the decoding phase, the two networks arecombined together and act as a single A2W network that holdsthe E2E property. Experiments on 300 hours SwitchBoardcorpus show that this novel approach outperforms the naiveA2W models and reaches the level of state-of-the-art A2W [34]models with the same training procedure. Besides, the modulardesign enables the efﬁcient revision of the whole system. TheOOV words extension experiment provides an example. Thefuture work includes: • Use other optimization methods including speed perturb[47], speaker adaptation [48] and GloVe initialization [49]to improve the A2P network; • Train P2W network with other E2E criteria such as CTC-S2S multitask [50] method. • Try other cross-domain experiments by using the samemethod in the OOV words extension.R

EFERENCES[1] N. Morgan and H. Bourlard, “Continuous speech recognition usingmultilayer perceptrons with hidden markov models,” in

Proceedings ofInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 1990, pp. 413–416.[2] H. Bourlard and C. J. Wellekens, “Links between markov models andmultilayer perceptrons,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , vol. 12, no. 12, pp. 1167–1178, 1990.[3] N. Morgan and H. Bourlard, “Neural networks for statistical recognitionof continuous speech,”

Proceedings of the IEEE , vol. 83, no. 5, pp.742–772, 1995.[4] M. Mohria, F. Pereirab, and M. Rileya, “Weighted ﬁnite-state transducersin speech recognition,”

Computer Speech & Language , vol. 16, no. 1,pp. 69–88, 2002.[5] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrentneural network architectures for large scale acoustic modeling,” in

Proceedings of Annual Conference of the International Speech Commu-nication Association (INTERSPEECH) , 2014, pp. 338–342.[6] Y. M. Qian, M. X. Bi, T. Tan, and K. Yu, “Very deep convolutional neuralnetworks for noise robust speech recognition,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 24, no. 12, pp. 2263–2276, 2016.[7] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Connectionisttemporal classiﬁcation: Labelling unsegmented sequence data withrecurrent neural networks,” in

Proceedings of International Conferenceon Machine Learning (ICML) , 2006, pp. 369–376.[8] A. Graves and N. Jaitly, “Towards end-to-end speech recognition withrecurrent neural networks,” in

Proceedings of International Conferenceon Machine Learning (ICML) , 2014, pp. 1764–1772.[9] Q. Liu, L. J. Wang, and Q. Huo, “A study on effects of implicitand explicit language model information for DBLSTM-CTC basedhandwriting recognition,” in

Proceedings of International Conference onDocument Analysis and Recognition (ICDAR) , 2015, pp. 461–465.[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in

Proceedings of Neural Information ProcessingSystems Conference (NIPS) , 2014, pp. 3104–3112. [11] Y. M. Cui, Z. P. Chen, S. Wei, S. J. Wang, T. Liu, and G. Hu,“Attention-over-attention neural networks for reading comprehension,”in

Proceedings of Annual Meeting of the Association for ComputationalLinguistics (ACL) , 2017, pp. 593–602.[12] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell:A neural network for large vocabulary conversational speech recognition,”in

Proceedings of International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2016, pp. 4960–4964.[13] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio,“End-to-end attention-based large vocabulary speech recognition,” in

Proceedings of International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2016, pp. 4945–4949.[14] A. Graves,

Sequence Transduction with Recurrent Neural Networks ,arXiv:1211.3711, 2012.[15] A. Graves, A. r. Mohamed, and G. Hinton, “Speech recognition with deeprecurrent neural networks,” in

Proceedings of International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2013, pp. 6645–6649.[16] E. Battenberg, J. T. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. R. Liu,S. Satheesh, A. Sriram, and Z. Y. Zhu, “Exploring neural transducersfor end-to-end speech recognition,” in

Proceedings of Automatic SpeechRecognition and Understanding Workshop (ASRU) , 2017, pp. 206–213.[17] Y. J. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speechrecognition using deep RNN models and WFST-based decoding,”in

Proceedings of Automatic Speech Recognition and UnderstandingWorkshop (ASRU) , 2015, pp. 167–174.[18] Z. H. Chen, W. Deng, T. Xu, and K. Yu, “Phone synchronousdecoding with CTC lattice,” in

Proceedings of Annual Conference ofthe International Speech Communication Association (INTERSPEECH) ,2016, pp. 1923–1927.[19] T. Hori, J. Cho, and S. Watanabe,

End-to-end Speech Recognition withWord-based RNN Language Models , arXiv:1808.02608, 2018.[20] T. Hayashi, S. Watanabe, Y. Zhang, T. Toda, T. Hori, R. Astudillo, andK. Takeda,

Back-Translation-Style Data Augmentation for End-to-EndASR , arXiv:1807.10893, 2018.[21] A. Renduchintala, S. Y. Ding, M. Wiesner, and S. Watanabe, “Multi-modal data augmentation for end-to-end ASR,” in

Proceedings of AnnualConference of the International Speech Communication Association(INTERSPEECH) , 2018, pp. 2394–2398.[22] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation ofrare words with subword units,” in

Proceedings of Annual Meeting of theAssociation for Computational Linguistics (ACL) , 2016, pp. 1715–1725.[23] Z. H. Chen, Q. Liu, H. Li, and K. Yu, “On modular training of neuralacoustics-to-word model for LVCSR,” in

Proceedings of InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , 2018,pp. 4754–4758.[24] L. Baum and J. Eagon, “An inequality with applications to statisticalestimation for probabilistic functions of markov processes and to a modelfor ecology,”

Bulletin of the American Mathematical Society , vol. 73,no. 3, pp. 360–363, 1967.[25] L. Baum and G. Sell, “Growth transformations for functions on manifolds,”

Paciﬁc Journal of Mathematics , vol. 27, no. 2, pp. 211–227, 1968.[26] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” in

Proceedings of InternationalConference on Learning Representations (ICLR) , 2015.[27] E. Variani, T. Bagby, E. McDermott, and M. Bacchiani, “End-to-endtraining of acoustic models for large vocabulary continuous speechrecognition with tensorﬂow,” in

Proceedings of Annual Conference ofthe International Speech Communication Association (INTERSPEECH) ,2017, pp. 1641–1645.[28] L. Lu, X. Zhang, and S. Renais, “On training the recurrent neural networkencoder-decoder for large vocabulary end-to-end speech recognition,” in

Proceedings of International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2016, pp. 5060–5064.[29] Y. Miao, M. Gowayyed, X. Na, T. Ko, F. Metze, and A. Waibel,“An empirical exploration of CTC acoustic models,” in

Proceedingsof International Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2016, pp. 2623–2627.[30] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno,N. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, andT. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in

Proceedingsof Annual Conference of the International Speech CommunicationAssociation (INTERSPEECH) , 2018, pp. 2207–2211.[31] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data andunits for streaming end-to-end speech recognition with rnn-transducer,”in

Proceedings of Automatic Speech Recognition and UnderstandingWorkshop (ASRU) , 2017, pp. 193–199. [32] S. Toshniwal, H. Tang, L. Lu, and K. Livescu, “Multitask learning withlow-level auxiliary tasks for encoder-decoder based speech recognition,”in

Proceedings of Annual Conference of the International SpeechCommunication Association (INTERSPEECH) , 2017, pp. 3532–3536.[33] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Nahamoo,“Direct acoustics-to-word models for english conversational speechrecognition,” in

Proceedings of Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH) , 2017, pp. 959–963.[34] K. Audhkhasi, B. Kingsbury, B. Ramabhadran, G. Saon, and M. Picheny,“Building competitive direct acoustics-to-word models for english conver-sational speech recognition,” in

Proceedings of International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 4759–4763.[35] C. Z. Yu, C. L. Zhang, C. Weng, J. Cui, and D. Yu, “A multistagetraining framework for acoustic-to-word model,” in

Proceedings of AnnualConference of the International Speech Communication Association(INTERSPEECH) , 2018, pp. 786–790.[36] Y. Z. He, T. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao,D. Rybach, A. Kannan, Y. H. Wu, R. M. Pang, Q. Liang, D. Bhatia,Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. Y. Chang,K. Rao, and A. Gruenstein, “Streaming end-to-end speech recognition formobile devices,” in

Proceedings of International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2019, pp. 6381–6385.[37] J. Y. Li, R. Zhao, H. Hu, and Y. F. Gong, “Improving RNN transducermodeling for end-to-end speech recognition,” in

Proceedings of AutomaticSpeech Recognition and Understanding Workshop (ASRU) , 2019, pp. 114–121.[38] A. Sriram, H. Jun, S. Satheesh, and A. Coates,

Cold fusion: Trainingseq2seq models together with language models , arXiv:1708.06426, 2017.[39] H. Li, Z. H. Chen, Q. Liu, Y. M. Qian, and K. Yu, “OOV words extensionfor modular neural acoustics-to-word model,” in

Proceedings of NationalConference on Man-Machine Speech Communication (NCMMSC) , 2019.[40] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD:Telephone speech corpus for research and development,” in

Proceedingsof International Conference on Acoustics, Speech and Signal Processing(ICASSP) , 1992, pp. 517–520.[41] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. M. Qian, P. Schwarz, J. Silovsky,G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,”in

Proceedings of Automatic Speech Recognition and UnderstandingWorkshop (ASRU) , 2011.[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. M.Lin, A. Desmaison, L. Antiga, and A. Lerer,

Automatic Differentiationin PyTorch , 2017.[43] T. Q. Chen, M. Li, Y. T. Li, M. Lin, N. Y. Wang, M. J. Wang,T. J. Xiao, B. Xu, C. Y. Zhang, and Z. Zhang,

MXNet: A Flexibleand Efﬁcient Machine Learning Library for Heterogeneous DistributedSystems , arXiv:1512.01274, 2015.[44] S. L. Zhang, C. Liu, H. Jiang, S. Wei, L. R. Dai, and Y. Hu,

FeedforwardSequential Memory Networks: A New Structure to Learn Long-termDependency , arXiv:1512.08301, 2015.[45] G. Zweig, C. Yu, J. Droppo, and A. Stolcke, “advances in all-neuralspeech recognition,” in

Proceedings of International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2017, pp. 4805–4809.[46] S. Palaskar and F. Metze, “Acoustic-to-word recognition with sequence-to-sequence models,” in

Proceedings of IEEE Spoken Language TechnologyWorkshop (SLT) , 2018, pp. 397–404.[47] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentationfor speech recognition,” in

Proceedings of Annual Conference of theInternational Speech Communication Association (INTERSPEECH) ,2015, pp. 3586–3589.[48] T. Tan, Y. M. Qian, M. F. Yin, Y. M. Zhuang, and K. Yu, “Clusteradaptive training for deep neural network,” in

Proceedings of InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , 2015,pp. 4325–4329.[49] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors forword representation,” in

Proceedings of Conference on Empirical Methodsin Natural Language Processing (EMNLP) , 2014, pp. 1532–1543.[50] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in