Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech
Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff
MMultimodal Semi-supervised Learning Framework for Punctuation Predictionin Conversational Speech
Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff
Amazon AWS AI { sunkaral, ronanks } @amazon.com Abstract
In this work, we explore a multimodal semi-supervisedlearning approach for punctuation prediction by learning rep-resentations from large amounts of unlabelled audio and textdata. Conventional approaches in speech processing typicallyuse forced alignment to encoder per frame acoustic featuresto word level features and perform multimodal fusion of theresulting acoustic and lexical representations. As an alterna-tive, we explore attention based multimodal fusion and compareits performance with forced alignment based fusion. Experi-ments conducted on the Fisher corpus show that our proposedapproach achieves ∼ ∼ ∼ ∼ Index Terms : speech recognition, punctuation prediction, mul-timodal fusion, semi-supervised learning
1. Introduction
The output text generated from automatic speech recognition(ASR) systems is typically devoid of punctuation and sentenceformatting. Lack of sentence segmentation and punctuationmakes it difficult to comprehend the ASR output. For example,consider the two sentences: “Let’s eat Grandma” vs. “Let’s eat,Grandma!”. Punctuation restoration not only helps understandthe context of the text but also greatly improves the readability.Punctuated text often helps in boosting the performance of sev-eral downstream natural language understanding (NLU) tasks.There is a plethora of work done in punctuation predic-tion over the past few decades. While some early methods ofpunctuation prediction used finite state or hidden markov mod-els [1, 2], some other techniques have investigated probabilisticmodels like language modeling [3, 4, 5], conditional randomfields (CRFs) [6, 7] and maximum entropy models [8]. As neu-ral networks gained popularity, several approaches have beenproposed based on sequence labeling and neural machine trans-lation [9]. These models widely used convolutional neural net-works (CNNs) and LSTM based architectures [10]. More re-cently, attention [11, 12] and transformer [13, 14, 15] basedarchitectures which have been successfully applied to a widevariety of tasks, have shown to perform well for punctuationprediction.Although it is a well explored problem in the literature,most of these improvements do not directly translate to all do-mains. In particular, punctuation prediction for conversational speech is not very well explored [16, 17, 15]. Also, a numberof approaches have been proposed exploiting the use of acous-tic features in addition to lexical features for punctuation task,but they are rather limited and do not clearly address the gapin performance with ASR outputs. In this paper, we focus onmultimodal semi-supervised deep learning approach for punc-tuation prediction in conversational speech by leveraging pre-trained lexical and acoustic encoders.
While several methodologies used either text or acoustic onlyinformation [18] for predicting punctuation, many studies showthat combining both the features yields the best performance[13, 19, 20, 21]. Acoustic features widely used in the litera-ture include prosodic information such as pause duration, phoneduration, and pitch related values like fundamental frequency,and energy. [20] shows that using acoustic information leadto increased recognition of full stops. In [21], a hierarchicalencoder is used to encode per frame acoustic features to wordlevel features and the results show that incorporating acousticfeatures significantly outperform purely lexical systems. How-ever, when trained on a very large independent text corpus, thelexical system outperformed the multimodal system that wastrained on parallel audio/text corpora. To mitigate this, the workin [13] introduced speech2vec embeddings but they do not varywith respect to the acoustic context in reference speech.In general, we identify two potential shortcomings withaforementioned multimodal systems. First, the training is stillsuboptimal due to lack of large-scale parallel audio/text corpora.Secondly, the models trained on reference text transcripts donot perform that well on ASR outputs, although incorporatingacoustic features reduced the gap to some extent.
In this work, we introduce a novel framework for multimodalfusion of lexical and acoustic embeddings for punctuation pre-diction in conversational speech. Specifically, we investigatethe benefits of using lexical and acoustic encoders that are pre-trained on large amounts of unpaired text and audio data us-ing unsupervised learning. The key idea is to learn contextualrepresentations through unsupervised training where substan-tial amounts of unlabeled data is available and then improve theperformance on a downstream task like punctuation, for whichthe amount of data is limited, by leveraging learned representa-tions. For multimodal fusion, we explore attention mechanismto automatically learn the alignment of word level lexical fea-tures and frame level acoustic features in the absence of explicitforced alignments.We also show the adaptation of our proposed multimodalarchitecture for streaming usecase by limiting the future con-text. We further study the effect of pretrained encoders with re- a r X i v : . [ ee ss . A S ] A ug pect to varying data sizes and their performance when trainedon very small amounts of data. Finally, we exploit the N-bestlists from ASR to perform data augmentation and reduce thegap in performance when tested on ASR outputs.
2. Semi-supervised learning architecture
This section introduces our proposed multimodal semi-supervised learning architecture (MuSe) for punctuation pre-diction. We pose the prediction task as a sequence labelingproblem where the model outputs a sequence of punctuation la-bels given text and corresponding audio. The architecture con-tains three main components: acoustic encoder, lexical encoder,and a fusion block to combine outputs from both the encoders.Figure 1 shows a schematic overview of the proposed approach.The lexical encoder is pretrained on a large unlabelled textcorpus for learning rich contextual representations and fine-tuned for the downstream task (i.e., weights are updated duringpunctuation model training). Given a sequence of input words( x l , x l , ..., x lm ), subwords ( s l , s l , ..., s ln ) are extracted using awordpiece tokenizer [22]. The resulting subwords are fed as in-put to a pretrained encoder, which outputs a sequence of lexicalfeatures: H l = ( h l , h l , ..., h ln ) at its final layer.The acoustic encoder takes audio signal as an inputand outputs a sequence of frame level acoustic embeddings( x a , x a , ..., x aT ). The acoustic encoder is pretrained on a largeunlabelled audio data with the objective to predict future sam-ples from a given signal context. This unsupervised pretrainingis based on the work of Schneider et al. [23]. After pretraining,we freeze the parameters of the acoustic encoder. The framelevel acoustic embeddings are then passed through a convolu-tion layer followed by a uni-directional LSTM layer to learntask specific embeddings: (cid:101) H a = ( ˜ h a , ˜ h a , ..., ˜ h aT ).Since lexical and acoustic features differ in sequencelength, it is not straightforward to concatenate them. Section3 discusses about two different approaches for aligning acous-tic feature sequence with lexical sequence. Once we obtain theresulting aligned acoustic sequence H a = ( h a , h a , ..., h an ), weconcatenate last layer representations of pretrained lexical en-coder ( H l ) with outputs from acoustic encoder ( H a ) and in-put to a linear layer with softmax activation to classify over thepunctuation labels generating ( ˆ p , ˆ p , ..., ˆ p n ) as outputs. ˆ p i = softmax ( W k ( h li ⊕ h ai ) + b k ) (1)where W k , b k denote weights and bias of linear output layer.The model is finetuned end-to-end to minimize the cross-entropy loss between the predicted distribution ( ˆ p i ) and targets( p i ).
3. Multimodal fusion alignment
Several approaches have been proposed in the past for fu-sion of acoustic features with lexical encoder. Most of theseapproaches used word-level prosodic inputs and concatenatedwith lexical inputs or outputs from lexical encoder. In this sec-tion, we describe how we model frame-level acoustic featuresfor fusion with sub-word lexical encoder using two differentapproaches: using force-aligned word durations and sub-wordattention model.
Some prosodic features like fundamental frequency and energycan be averaged across each word and used as input to theacoustic encoder. Similarly, word duration can also be used as
ConvLSTM Multimodal fusionLinear + Softmax layer
Pretrained lexical encoder [CLS]
SC T T n ... s ... [SEP] s n E [CLS] E ... E n E [SEP] hey mia did you wash your hands Pretrained acoustic embeddingsPretrained acousticencoderhey mia, did you wash your hands? Subword Tokenizer
Figure 1:
An overview of our multimodal semi-supervisedlearning architecture for punctuation prediction. a feature and the work by [16] has shown minor improvementsin punctuation prediction for conversational speech by employ-ing relative word timing and duration of word. However, suchmechanism does not capture the acoustic context beyond a wordand also prevents the use of frame-level acoustic features wherethe average vector does not represent anything.For this reason, we model frame-level acoustic features us-ing an LSTM-based acoustic encoder where force-aligned worddurations are used to obtain final word boundaries. We then useword boundaries to select the respective LSTM state outputs toform word-level features. We duplicate the same output to allthe sub-words within each word.
In the previous approach, we require force-aligned durationsduring training and word-level timestamps at inference time toform sub-word acoustic embeddings. While this is possible toachieve with conventional hybrid ASR systems, it may becomean overhead when used in conjunction with end-to-end ASRsystems such as LAS and Transformer [24, 25]. The use offorce-aligned durations may limit the acoustic context to a lim-ited number of frames. For this reason, we introduce an atten-tion module that uses scaled dot-product attention [26] to findthe alignment between acoustic feature sequence with sub-wordlexical sequence, operating on query Q, key κ , and value ϑ : Attention ( Q, κ, ϑ ) = softmax ( Qκ T √ d κ ) ϑ (2)where d κ is the dimension of the keys. For attention model, weuse the same encoder architecture as shown in Figure 1. How-ever, since attention may not require such a low-resolution inputacoustic sequence, we downsample the input feature sequenceby using a fixed stride of 2 in the 1-dimensional convolutionlayer. In the attention module, key κ and value ϑ are obtained Only in cases where punctuation is not modelled along with pho-netic unit rom LSTM state outputs. Sub-word encoder outputs are usedas query for this attention. The key κ is obtained by using aprojection layer whose weight matrix is W κ : κ ai = f ( W κ , h ai ) (3)The attention mechanism computes the attention weight ac-cording to the similarity between the query h li and each key κ ai ,and weighted sum of the values is then obtained using the atten-tion weight. h (cid:101) ai = Attention ( h li , κ a , h a ) (4)The resulting aligned acoustic hidden vector is concate-nated with lexical encoder output and given as input to the soft-max function as explained in Section 2.
4. Experiments
We conduct our experiments on English Fisher corpus [27]. Thetraining data consists of 348 hours of conversational telephonespeech where as the dev and test sets each consists of around 42hours. To prepare the data splits, we took a subset of the fullFisher corpus to only include segments of a minimum lengthof six words in our data sets. Punctuation classes in the Fishercorpus are highly unbalanced (see Table 1), which is typicalfor conversational speech. Fisher corpus has separate time-annotated and punctuated transcripts. For the forced alignmentfusion experiments, we need to compute the word boundaryinformation from time-annotated transcripts using a pretrainedacoustic model. For this purpose, we trained a TDNN-LSTMacoustic model with lattice-free Maximum Mutual Information(MMI) criterion [27] using the Kaldi ASR toolkit [28] and thesame model is used for obtaining ASR transcriptions on testdata. We restored punctuation marks like periods, commas,and question marks to the time-annotated transcripts by aligningthem with the corresponding punctuated transcript.Table 1:
Distribution of punctuation classes in Fisher corpus.
Class Count PercentageNo punctuation 2,962,489 80.58Comma (,) 70,927 11.83FullStop (.) 362,166 6.26Question mark (?) 56,128 1.33
In addition to pretrained wav2vec features, we also experi-mented with two other prosodic features: pitch and melspec.The prosodic features are computed using a 25ms frame win-dow with 10ms frame shift. We extracted F0 features basedon Kaldi pitch tracker method [29], a highly modified versionof the getf0 (RAPT) algorithm using Kaldi ASR toolkit [28].Each frame is represented by 4-dimensional features consist-ing of - probability of voicing (pov) i.e the warped Normal-ized Cross Correlation Function(NCFF), normalized log pitch(the log-pitch with pov-weighted mean subtraction over 1.5 sec-ond window), delta pitch (time derivative of log-pitch) and rawlog pitch. We also use a 80-dimension mel-scale spectrogramsas alternative to pitch features as they have shown to transferprosody well in text-to-speech systems [30].For unsupervised wav2vec feature extraction, we train awav2vec-large model [23] on a 348-hour fisher audio corpus. https://github.com/pytorch/fairseq/blob/master/examples/wav2vec Table 2:
F1 scores for punctuation prediction using variousacoustic features and two different fusion techniques; NP: Nopunctuation; FS: Fullstop; QM: Question mark;
Model Fusion Feat NP Comma FS QMBLSTM - - 96.2 69.4 66.1 74.0BERT - - 96.5 71.3 71.1 78.4MuSe FA pitch 97.3 74.1 74.6 80.4melspec 97.4 74.2 74.6 80.5wav2vec 97.5 75.6 75.6 81.3MuSe Att pitch 97.3 73.5 73.4 79.0melspec 97.4 73.5 73.4 80.1wav2vec 97.5 75.5 73.4 81.3For training, we preprocess the audio files by splitting each fileinto separate files of 10 to 30 seconds in length. The model istrained with the objective of constrastive loss and had 12 con-volutional layers with skip connections. The output is a 512-dimensional unsupervised wav2vec representation.
Our primary baseline model is a 4-layer BLSTM based on thework of Zelasko et. al. [16] with each layer having 128 weightsin each direction. We also train another lexical only modelwhich is a pretrained truncated BERT model [31] consisting of6 transformer self attention layers with each hidden layer of size768. The proposed (
MuSe ) model consists of a lexical encoderwhich is a pretrained truncated BERT model. The acoustic en-coder used for learning task specific embeddings consists a con-volutional layer of kernel size 5 and an LSTM hidden layer ofsize 256. We use a learning rate of 0.00002 and a dropout of 0.1for the truncated BERT and MuSe models. For all experimentswith pretrained lexical encoder, we use a subword vocabularysize of 28k .
5. Results
First, we compare the performance of our proposed multimodalarchitecture (
MuSe ) with baseline BLSTM model and lexicalonly BERT model (see Table 2). As expected, pure lexicalBERT model outperformed BLSTM in all punctuation marks.We notice significant improvements (5%, and 4%) in Fullstopand Question Mark under F1 metric. This indicates that fine-tuning a pretrained lexical encoder for punctuation task outper-forms the recurrent models that are trained from scratch and issynonymous with several other downstream tasks that are fine-tuned with BERT [31, 32, 33].We now compare lexical only models to multimodal fusionmodels trained on three different features: pitch, melspec andwav2vec. Overall, we observe that using any kind of acousticinformation helped in improving punctuation prediction acrossall three classes (Fullstop, Comma and Question Mark). Thisdenotes that the fusion of acoustic features is still beneficial inconjunction with state-of-the-art pretrained lexical encoders asthey model different aspects of punctuation.Among acoustic features, pitch and melspec have shownsimilar performance improvements, except in Question markwhen attention is used for fusion. This is understandable giventhat both pitch and melspec feature are extracted from audio us-ing signal processing techniques and have been used as prosodic Although lexical encoder could be further pretrained on Fisherdata, we didn’t investigate it in this paper. utures in the past [30]. Unsupervised wav2vec features provedto be the best among all acoustic features for multimodal fusionand its performance is significantly better (p < FA ) fusion performs slightly better than attention( Att ) based fusion for Fullstop while the performance is simi-lar on Comma and Question Mark. We hypothesize that thisis because providing explicit acoustic information through du-ration labels helps better prediction of full stop as opposed toimplicit learning through attention mechanism. Although ourresults are not directly comparable with the results provided in[16] on Fisher corpus (as the splits are different), we achievedbetter performance in all classes of punctuation.
We have conducted experiments to study the real time (stream-ing) performance of the proposed model when there is no futurecontext. For the purpose of this experiment, we perform uppertriangle masking (similar to transformer [26] decoder layers) inself attention layers of lexical encoder to mask the right sidecontext. Since the acoustic encoder is unidirectional, we didnot make any further changes. For the experimental results pre-sented in Table 3, we have used forced alignment as fusion tech-nique and wav2vec as acoustic features. We also trained an ad-ditional 4 layer LSTM model for baseline comparison. Similarto bidirectional models, the pretrained BERT model performs ∼
2% better than LSTM model thus proving that pretraining alsohelps in learning better representations in the absence of rightside context. We also observe that adding acoustic informationleads to an additional ∼
3% improvement over the lexical BERTmodel, confirming the effectiveness of our proposed approachfor streaming usecase.Table 3:
F1 scores for punctuation prediction (streaming).
Model NP Comma FS QMLSTM 95.4 67.6 68.7 72.3BERT 95.6 68.5 69.7 75.1MuSe 96.3 72.1 72.3 78.6
We perform experiments on varying data sizes to study the ef-fectiveness of our proposed approach. We experimented withdata sizes of 1 hour, 10 hours and 100 hours and the results arereported in Table 4. For this study, we compared lexical onlymodels (BLSTM and BERT) with our best performing model(MuSe) from Table 2. As expected, the performance of all threeapproaches were improved with the increase in amount of datasize for both reference and ASR outputs. However, our pro-posed model (
MuSe ) fared significantly better than other twolexical models when trained on smaller datasets (1 hour and 10hour). For comparison, our model achieved ∼ MuSe ) per-formed very well on reference transcripts when compared withBLSTM models. However, the gap was significantly reducedwhen tested on ASR outputs. Our proposed model (
MuSe ) per-formed better than lexical BERT due to the fusion of acousticfeatures but the performance gap on ASR outputs is still quiteevident. This shows that although pretrained models perform Table 4:
F1 scores for punctuation on varying data sizes.
Comma FS QMModel hours Ref ASR Ref ASR Ref ASRBLSTM 1 49.7 49.8 43.9 44.3 29.7 28.0BERT 54.5 54.0 51.2 50.8 42.5 35.7MuSe 58.8 57.4 55.2 55.2 48.3 40.1BLSTM 10 60.9 58.1 53.5 51.9 46.3 42.4BERT 65.6 62.0 62.5 58.9 61.6 57.8MuSe 68.2 63.5 65.9 62.9 72.9 60.8BLSTM 100 68.9 62.8 66.1 63.4 72.8 65.6BERT 70.1 65.5 69.3 64.8 75.8 68.6MuSe 71.6 66.2 70.8 65.7 77.1 69.5well on reference transcripts with smaller datasets, they are notyet robust to ASR errors. This is due to the fact that pretrainedmasked language models like BERT were trained only on ref-erence transcripts and have not seen the grammatical errors thatare introduced by ASR.
We have seen that models trained on reference transcripts didnot perform that well when tested on ASR outputs. To makemodels more robust against ASR errors, we perform data aug-mentation with ASR outputs for training [15]. For punctuationrestoration, we use edit distance measure to align ASR hypoth-esis with reference punctuated text and restore the punctuationfrom each word in reference transcription to hypothesis. If thereare words that are punctuated in reference but got deleted inASR hypothesis, we restore the punctuation to previous word.We performed experiments with data augmentation using N-best lists and the results are reported in Table 5.From the results, it is evident that the models trained purelyon reference transcripts were outperformed by models trainedon augmented text (both reference and ASR outputs). The lastthree rows from Table 5 indicate that the data augmentation ap-proach yielded better performance in all classes of punctuation.Overall, data augmentation with 3-best lists gave the best per-formance. Question mark improved by 6% in F1 score by per-forming data augmentation. The improvement might be due toincreased number of training examples in augmented data.Table 5:
Comparison of F1 scores for punctuation with modelstrained on reference transcripts and ASR augmented data.
Model n-best NP Comma FS QMBLSTM-Ref - 94.5 63.5 63.8 66.7BERT-Ref - 95.2 65.5 64.1 68.3MuSe-Ref - 95.6 67.2 66.7 70.6MuSe-ASR 1-best 95.8 68.5 69.0 75.73-best 95.6 69.0 69.5 76.45-best 95.5 67.3 66 76.3
6. Conclusions
We introduced a novel multimodal semi-supervised learningframework which leverages large amounts of unlabelled audioand text data for punctuation prediction. We proposed an al-ternative attention based multimodal fusion mechanism whichis effective, in the absence of forced alignment word durations.Through our data sizes ablation study, we showed how our pro-posed model is superior in performance to lexical only modelson reference transcripts. In order to address the performancegaps on ASR outputs, we presented a robust model that is lessaffected by ASR errors by performing data augmentation withN-best lists. . References [1] Y. Gotoh and S. Renals, “Sentence boundary detection in broad-cast speech transcripts,” 2000.[2] H. Christensen, Y. Gotoh, and S. Renals, “Punctuation annotationusing statistical prosody models,” in
ISCA tutorial and researchworkshop (ITRW) on prosody in speech recognition and under-standing , 2001.[3] A. Stolcke, E. Shriberg, R. Bates, M. Ostendorf, D. Hakkani,M. Plauche, G. Tur, and Y. Lu, “Automatic detection of sentenceboundaries and disfluencies based on recognized words,” in
FifthInternational Conference on Spoken Language Processing , 1998.[4] D. Beeferman, A. Berger, and J. Lafferty, “Cyberpunc: Alightweight punctuation annotation system for speech,” in
IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , vol. 2. IEEE, 1998, pp. 689–692.[5] A. Gravano, M. Jansche, and M. Bacchiani, “Restoring punctua-tion and capitalization in transcribed speech,” in .IEEE, 2009, pp. 4741–4744.[6] W. Lu and H. T. Ng, “Better punctuation prediction with dynamicconditional random fields,” in
Proceedings of the 2010 conferenceon empirical methods in natural language processing , 2010, pp.177–186.[7] N. Ueffing, M. Bisani, and P. Vozila, “Improved models for auto-matic punctuation prediction for spoken and written text.” in
In-terspeech , 2013, pp. 3097–3101.[8] J. Huang and G. Zweig, “Maximum entropy model for punctua-tion annotation from speech,” in
Seventh International Conferenceon Spoken Language Processing , 2002.[9] A. ¨Oktem, M. Farr´us, and L. Wanner, “Attentional parallel rnnsfor generating punctuation in transcribed speech,” in
Interna-tional Conference on Statistical Language and Speech Process-ing . Springer, 2017, pp. 131–142.[10] V. Pahuja, A. Laha, S. Mirkin, V. Raykar, L. Kotlerman, andG. Lev, “Joint learning of correlated sequence labelling tasksusing bidirectional recurrent neural networks,” arXiv preprintarXiv:1703.04650 , 2017.[11] O. Tilk and T. Alum¨ae, “Lstm for punctuation restoration inspeech transcripts,” in
Sixteenth annual conference of the inter-national speech communication association , 2015.[12] O. Tilk and T. Alumae, “Bidirectional recurrent neural networkwith attention mechanism for punctuation restoration.” in
Inter-speech , 2016, pp. 3047–3051.[13] J. Yi and J. Tao, “Self-attention based model for punctuation pre-diction using word and speech embeddings,” in
IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2019, pp. 7270–7274.[14] B. Nguyen, V. B. H. Nguyen, H. Nguyen, P. N. Phuong, T.-L.Nguyen, Q. T. Do, and L. C. Mai, “Fast and accurate capitaliza-tion and punctuation for automatic speech recognition using trans-former and chunk merging,” arXiv preprint arXiv:1908.02404 ,2019.[15] M. Sunkara, S. Ronanki, K. Dixit, S. Bodapati, and K. Kirchhoff,“Robust prediction of punctuation and truecasing for medical asr,”in
Proceedings of the First Workshop on Natural Language Pro-cessing for Medical Conversations , 2020, pp. 53–62.[16] P. ˙Zelasko, P. Szyma´nski, J. Mizgajski, A. Szymczak, Y. Carmiel,and N. Dehak, “Punctuation prediction model for conversationalspeech,” arXiv preprint arXiv:1807.00543 , 2018.[17] Ł. Augustyniak, P. Szymanski, M. Morzy, P. Zelasko, A. Szym-czak, J. Mizgajski, Y. Carmiel, and N. Dehak, “Punctua-tion prediction in spontaneous conversations: Can we mitigateasr errors with retrofitted word embeddings?” arXiv preprintarXiv:2004.05985 , 2020. [18] A. Mor´o and G. Szasz´ak, “A prosody inspired rnn approach forpunctuation of machine produced speech transcripts to improvehuman readability,” in . IEEE, 2017,pp. 000 219–000 224.[19] O. Klejch, P. Bell, and S. Renals, “Punctuated transcription ofmulti-genre broadcasts using acoustic and lexical approaches,” in
IEEE Spoken Language Technology Workshop (SLT) , 2016, pp.433–440.[20] H. Christensen, Y. Gotoh, and S. Renals, “Punctuation annotationusing statistical prosody models,” in
ISCA tutorial and researchworkshop (ITRW) on prosody in speech recognition and under-standing , 2001.[21] O. Klejch, P. Bell, and S. Renals, “Sequence-to-sequence mod-els for punctuated transcription combining lexical and acousticfeatures,” in
IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2017, pp. 5700–5704.[22] M. Schuster and K. Nakajima, “Japanese and korean voicesearch,” in . IEEE, 2012, pp. 5149–5152.[23] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec:Unsupervised pre-training for speech recognition,” arXiv preprintarXiv:1904.05862 , 2019.[24] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend andspell,” arXiv preprint arXiv:1508.01211 , 2015.[25] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrencesequence-to-sequence model for speech recognition,” in . IEEE, 2018, pp. 5884–5888.[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in
Advances in neural information processing systems , 2017, pp.5998–6008.[27] C. Cieri, D. Miller, and K. Walker, “The fisher corpus: a resourcefor the next generations of speech-to-text.” in
LREC , vol. 4, 2004,pp. 69–71.[28] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in
IEEE 2011 workshopon automatic speech recognition and understanding , no. CONF.IEEE Signal Processing Society, 2011.[29] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal,and S. Khudanpur, “A pitch extraction algorithm tuned for auto-matic speech recognition,” in . IEEE,2014, pp. 2494–2498.[30] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton,J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towardsend-to-end prosody transfer for expressive speech synthesis withtacotron,” arXiv preprint arXiv:1803.09047 , 2018.[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” arXiv preprint arXiv:1810.04805 , 2018.[32] Q. Chen, Z. Zhuo, and W. Wang, “Bert for joint intent classifica-tion and slot filling,” arXiv preprint arXiv:1902.10909 , 2019.[33] W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, andJ. Lin, “End-to-end open-domain question answering with bert-serini,” arXiv preprint arXiv:1902.01718arXiv preprint arXiv:1902.01718