Phoneme-BERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript
Mukuntha Narayanan Sundararaman, Ayush Kumar, Jithendra Vepa
PPHONEME-BERT: JOINT LANGUAGE MODELLING OF PHONEME SEQUENCE AND ASRTRANSCRIPT
Mukuntha Narayanan Sundararaman † Ayush Kumar Jithendra Vepa
Observe.AI
ABSTRACT
Recent years have witnessed significant improvement inASR systems to recognize spoken utterances. However, itis still a challenging task for noisy and out-of-domain data,where substitution and deletion errors are prevalent in thetranscribed text. These errors significantly degrade the per-formance of downstream tasks. In this work, we propose aBERT-style language model, referred to as
PhonemeBERT that learns a joint language model with phoneme sequenceand ASR transcript to learn phonetic-aware representationsthat are robust to ASR errors. We show that PhonemeBERTcan be used on downstream tasks using phoneme sequencesas additional features, and also in low-resource setup wherewe only have ASR-transcripts for the downstream tasks withno phoneme information available. We evaluate our approachextensively by generating noisy data for three benchmarkdatasets - Stanford Sentiment Treebank, TREC and ATISfor sentiment, question and intent classification tasks re-spectively. The results of the proposed approach beats thestate-of-the-art baselines comprehensively on each dataset.
Index Terms — language modelling, phoneme, asr
1. INTRODUCTION
With the proliferation of voice-enabled technologies spo-ken language understanding (SLU) has become significantlyubiquitous. The general modus-operandi of SLU systems isto convert voice into text using an ASR engine and use Nat-ural Language Understanding (NLU) on the transcribed textto comprehend the speaker’s intents and requests. Despiteadvancements in ASR systems, domain adaptation and wordrecognition in noisy setups remain a big challenge. Propernouns, out of vocab words and phonetically confusing wordsare often mistranscribed to common words. In a typical SLUsystem that operates on the ASR outputs, these errors degradethe performance of the system on the downstream tasks [1].The scientific community has tried a variety of approachesemployed to address the errors in the ASR system. Thesemethods can be broadly categorized into four groups: a)
Mod-elling word confidence : Liu et al. [2] proposes a BERT model † Work done during internship that jointly encodes the word confidence network and the dia-log context. Ladhak et al. [3] proposes LatticeRNN to encodethe ambiguities of the ASR recognition for the intent classi-fication task; b)
ASR correction : Weng et al. [4] presents acontextual language correction on ASR outputs jointly withmodelling LU task that learns from ASR n-best transcrip-tions. Mani et al. [5] use machine translation technique fordomain adaptation to correct ASR mistakes in medical con-versations; c)
End-to-End SLU : Serdyuk et al. [6] explore thepossibility to extend the end-to-end ASR learning to includeNLU component and optimize the whole system for SLU taskwhile Ghannay et al. [7] study end-to-end named entity andsemantic concept extraction from speech to circumvent theerrors arising from the ASR pipeline; d)
Phoneme enhancedrepresentations : Yenigalla et al. [8] use word2vec generatedphoneme embedding for emotion recognition task. Fang etal. [9] propose word2vec based approaches to learn phonemeembeddings capturing pronunciation similarities of phonemesto make classification robust to ASR errors.This work falls in the category of utilizing phonemeinformation additionally to learn representations more ro-bust to ASR transcription errors. The phoneme informationcan be complementary to ASR transcript especially whenthe speech is more noisy. To learn representations, earlierresearch [8, 9] trained phoneme and ASR embeddings in-dependent of each other for SLU tasks. Moreover, Fang etal. [9] leverage phoneme sequence obtained on top of theASR transcripts. This method propagates errors in the wordsequence to phoneme sequence and hence does not offer theneeded complementary information in the two sequences.We, instead, propose using phoneme sequences extracted di-rectly from audio and learn representations in a pre-trainingmethodology that jointly models the phoneme and ASR se-quence. The main contributions of this work are: 1) A methodto jointly model ASR transcripts and phoneme sequencesusing a BERT-based pre-training setup is proposed; 2) Ex-tensive experiments are carried out on benchmark datasets toshow the method’s effectiveness. Results show that Phone-meBERT can leverage phoneme sequences as additionalfeatures, making it robust to ASR errors; 3) PhonemeBERTlearns phonetic-aware representations that perform well inthe absence of phoneme information during downstream taskfine-tuning, with only noisy ASR transcripts available. a r X i v : . [ ee ss . A S ] F e b ransformer Token embeddingsPosition embeddingsTypeembeddings 0w + if1w + you2w + are3w + [MASK] + ing5w + L w w + 0p + ih-f1p + [MASK] + aa-r3p + l-uh4p + k-ih-ng5p + L p p + ......... y-uwlook ......... + + + + + + + + + + + + + + ASR Transcript Phoneme Sequence
Fig. 1 . Proposed architecture of PhonemeBERT that jointly models phoneme sequence and ASR transcript.
2. PROPOSED METHODOLOGY
In the proposed model, we aim to learn representations thatare more robust to ASR errors. To accomplish this objective,we utilize the phoneme sequence in addition to the word se-quence to train a joint language model (LM),
PhonemeBERT initialized with weights from RoBERTa [10]. RoBERTa isa robustly optimized pre-trained LM using masked languagemodeling (MLM) as the pre-training objective. We propose toextend the setup to pre-train an LM with phoneme sequenceswith the MLM objective.Instead of using individual MLM tasks for word andphoneme sequences, we propose a joint modelling by trainingthe model on a parallel corpus of words and correspondingphoneme sequences. The training sequence in the proposedmodel contains the ASR transcript concatenated with thephoneme sequence (Figure 1). The position embeddingsfor the phoneme sequence again start at 0 to enable easieralignment. Following the training regime of RoBERTa, wesample randomly 15% of the BPE tokens from the wordand phoneme sequences and replace them by [MASK] .Of the selected tokens, 80% are replaced with [MASK] ,10% are left unchanged, and 10% are replaced by a ran-domly selected vocabulary token of the same type (word orphoneme). To predict a token masked in the word sequence,the model can either attend to surrounding word tokens or tothe phoneme sequence, encouraging the model to align theword and phoneme representations, making the word rep-resentations more phonetic-aware. The joint modelling ofword and phoneme helps the model to leverage the phonemecontext if the word context is insufficient to infer the maskedtoken in the word sequence and vice-versa.To train the model, we utilize Byte Pair Encoding (BPE)representing the vocabulary of words and phonemes. BPE hasbeen shown to be an effective method to handle large vocab-ularies with sub-word units. It has also been shown to work
Dataset
Table 1 . Statistics of pre-training datasetwell with sub-words based on phonemes [11]. We use a byte-level BPE vocabulary like in RoBERTa [10], which allowsusing spaces (pauses in the phoneme sequence) as part of theBPE tokens. To represent phoneme sequences, we train ourmodel with a vocabulary consisting of 600 phoneme sub-wordunits. Finally, we represent each BPE-token using three em-beddings: token, position and type embeddings . Token em-beddings for word sequence are initialized from RoBERTabase model, while random initialization is done for tokens inphoneme sequence. Position embedding is initialized fromRoBERTa base model. In addition to token and position em-bedding, we randomly initialize a type embedding denotingwhether the token is coming from a ‘word’ or a ‘phoneme’sequence.With a primary objective to learn generic representationsfor the noisy data, we choose three datasets from different do-mains to accomplish the pre-training step (Table 1). We use,LibriSpeech corpus [12], Amazon reviews [13] and Squadv1.1 [14] for pre-training step. Specifically, we use Lib-riSpeech train-clean-360 corpus, while we randomly samplethe Amazon reviews and SQuAD dataset to collect specifiednumber of utterances across the multiple topics and para-graphs. Once we collect the raw text corpus, we follow asimilar strategy as described in Fang et al. [9] to create noisyspeech corpus. Specifically, we use Amazon Polly to con- aws.amazon.com/polly ert the raw text to speech (TTS). Next, we apply SpeechSynthesis Markup Language (SSML) tags to the audio tochange the prosody of the produced speech. We also addambient noise to the speech to make the data align withreal-life data. For a specific speech file, we apply one SSMLtag and one ambient noise. Once the noisy speech data isgenerated, we use an off-the-shelf ASR engine (AMAZONTranscribe) to transcribe the audio. Since our hypothesis isto create a language model that is robust to ASR errors, wegenerate data spanning across different levels of word-error-rate (WER). We choose to reject any transcription that hasWER less than 5% or more than 40%. We additionally usemore than one noise level for 25% of the pre-training corpusto capture characteristics of word and phoneme sequencesat different noise levels. The mean WER for the generatedcorpus is 31.2%.In addition to transcribing the speech, we generate phonemesequence from a separate sequence-to-sequence model trained,similar to the setup in [15]. The primary objective to use anindependent acoustically trained phoneme-generator is makeit non-biased with language modelling and focus only onacoustic information in the speech. The specific details ofphoneme Sequence-to-Sequence model is beyond the scopeof this work. This step is different from the work done in[9]. Fang et al. generate phoneme sequence from the tran-scribed output as a post-processing step. The approach hasa drawback that the errors present in the ASR transcript arepropagated to the phoneme sequence. The resulted phonemesequence would never to able to retain the deleted words andsubstitution errors in the ASR transcript. Instead, we usesequence-to-sequence model to output phoneme sequencesso that we can retain the complementary information in theword and phonemes sequences especially for noisy speech.
3. EXPERIMENTS AND RESULTS3.1. Dataset
For experimental evaluation of the approach, we use threebenchmark datasets as described below (Table 2). We usethe exact same setup to generate the noisy version of thedataset and the corresponding ASR transcript and phonemesequences.• SST-5: This is Stanford Sentiment Treebank dataset[16] containing the labels in five point scale from verynegative to very positive sentiment.• TREC: This is a question classification dataset [17]with coarse (6 classes) and fine-grained (50 classes)label set.• ATIS: This is an audio recordings of people makingflight reservations [18] with intent recognition as oneof the downstream tasks. Dataset
Table 2 . Statistics of downstream task dataset
We use RoBERTa as the base model to pre-train it with aug-mented phoneme sequence. Once we have a phoneme en-hanced pre-trained model, PhonemeBERT, we use the modelto fine-tune it for the downstream tasks. The PhonemeBERTmodel was trained with a batch size of 192 on × V GPUfor 50 epochs. The downstream models are fine-trained on thePhonemeBERT model for 20 epochs. The checkpoint withthe best validation score is used to report the test results. Forcases where a pre-defined validation set is not provided, weuse 20% of the training set as the validation set.
We define three baseline systems to compare the results;•
Oracle : Fine-tuning RoBERTa base model by trainingand testing on the clean (original) text corpus for down-stream tasks. This baseline specifies the upper limit ofthe results using clean (original) data.• B1 : In this baseline, we fine-tune RoBERTa base modelwith ASR transcript on the downstream tasks. Thismethod specifies an immediate fine-tuning strategy forthe downstream task with a noisy data.• B2 : In this baseline, we further pre-train the RoBERTabase model on our ASR transcripts from the pre-training corpus with word-only Masked LanguageModeling (MLM) task [10]. We then fine-tune it withthe ASR transcripts for each downstream task. The comparative results of the proposed approach againstvarious baselines are presented in Table 3.
PhonemeBERT trained with joint modelling framework of ASR transcriptand phoneme sequences outperforms other models acrossall datasets. The proposed model comprehensively beatsthe model pre-trained on only ASR transcripts ( B2 ) by upto3.4%. This shows that a language model trained on noisyASR transcript alone is not robust to ASR errors as comparedto the one trained with both ASR transcripts and phonemesequences. Furthermore, the baseline B2 performs worse thana model pre-trained with joint modelling setup but fine-tunedon downstream tasks using only ASR transcripts as input aseline Pre-Training(Feature) Task Fine-Tuning Feature Results (Accuracy/Macro-F1)SST-5 TREC-6 TREC-50 ATIS Oracle No (-) w clean w asr w asr ) w asr w asr + p asr ) w asr + p asr PhonemeBERT-w (A1) Yes ( w asr + p asr ) w asr w asr + p asr ) p asr w asr + p asr Table 3 . Comparison of the proposed model,
PhonemeBERT , with baselines. An ablation study without joint-modelling (A3)and using only one mode of feature for fine-tuning on the downstream task. (
A1, A2 ) is also presented. All models are initializedwith weights from RoBERTa-base before possible pre-training.
WER TREC-50 SST-5B1 B2 P B1 B2 P10-20 80.00
30+ 69.53 72.53
Overall 76.20 77.80
Table 4 . Accuracy comparison of the baseline (
B1, B2 ) andproposed model (
P: PhonemeBERT ) at different WER range.(A1). This result justifies our hypothesis that joint modellinghelps with learning a phonetic aware language model thatis more powerful in such noisy conditions In the absence ofphoneme sequence during inference, the jointly pre-trainedmodel can be used for downstream task fine-tuning with onlyASR transcripts as input feature. We also evaluate the setupusing only phoneme sequence for downstream fine-tuning( A2 ). The results show a drop in evaluation scores for allbut one dataset. We note a sharp decline in results for SST-5and TREC-50 dataset compared to other results. This can beexplained by the fact that both of these two tasks need under-standing of semantics in the text than syntactic information.Phoneme sequences do not capture the semantic information,rather they are syntactic representation of the utterances andhence there is a loss of performance in this setup. Finally, tostudy the impact of joint-modelling framework, we pre-traina model with independent word-MLM and phoneme-MLMtasks A3 , where each task sees the context from respective se-quence only. PhonemeBERT performs better than the modeltrained on independent MLM tasks by up to 2.4% accuracy.In Table 5, we present an example where the ASR tran-scription is completely erroneous but the presence of why inthe phoneme sequence helps in the correct classification.In a separate analysis, we categorize the test set of TREC-50 and
SST-5 into buckets based on the WER score of eachinstance and compare the performance of the trained mod-
Ref:
Why in tennis are zero points called love ?
Hyp: ten are zero points car logs
Phoneme:
W.AY.IH.N. .T.EH.N.AH.S. .AO.R. .Z.IY.R.OW..K.AO.L.D. .L.OW.V.
Gold Truth : DESC,
PhonemeBERT : DESC, B2 : ENTY Table 5 . An example with an incorrect ASR hypothesis cor-rectly classified using phoneme sequence.els (Table 4) on each bucket. We note that for a lower WER( − ), the baseline pre-trained on only ASR transcriptsperforms best. However, at higher WER ( > = 20 ), the per-formance of the two models ( B1, B2 ) decline steeply, whilePhonemeBERT outperforms the word-only models and is lesssusceptible to the noise. This reinforces the fact that we needa better modelling ability to address the robustness to ASRerrors. For WER > = 30 , PhonemeBERT outperforms thenext-best model ( B2 ) by 4.99% and 2.93% in accuracy forTREC-50 and SST-5 respectively.
4. CONCLUSIONS AND FUTURE WORK
In this work, we propose a joint modelling of phoneme se-quence and ASR transcript to build a language model robustto ASR errors. We demonstrate that the proposed model per-forms better than any of the strong baselines in downstreamNLU tasks. Through ablation study, we show that the pro-posed setup can also work in a low resource setup, still pro-ducing better results than model pre-trained with only ASRtranscript. Our analysis suggests that the method is signif-icantly better at higher WER ranges. In future, we wouldlike to extend the work to tasks such as entity recognitionand question answering for the noisy data. Additionally, wewould also like to explore the possibility of using a larger cor-pus for the pre-training step. . REFERENCES [1] T. Desot, F. Portet, and M. Vacher, “SLU for voice com-mand in smart home: Comparison of pipeline and end-to-end approaches,” in
IEEE Automatic Speech Recog-nition and Understanding Workshop, ASRU 2019, Sin-gapore . 2019, pp. 822–829, IEEE.[2] C. Liu, S. Zhu, Z. Zhao, R. Cao, L. Chen, and K. Yu,“Jointly encoding word confusion network and dialoguecontext with BERT for spoken language understanding,”
CoRR , vol. abs/2005.11640, 2020.[3] F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Ras-trow, and B. Hoffmeister, “Latticernn: Recurrent neuralnetworks over lattices,” in
Interspeech . 2016, pp. 695–699, ISCA.[4] Y. Weng, S. S. Miryala, C. Khatri, R. Wang, H. Zheng,P. Molino, M. Namazifar, A. Papangelis, H. Williams,F. Bell, and G. T¨ur, “Joint contextual modeling for ASRcorrection and language understanding,” in . 2020,pp. 6349–6353, IEEE.[5] A. Mani, S. Palaskar, and S. Konam, “Towards under-standing asr error correction for medical conversations,”in
Proceedings of the First Workshop on Natural Lan-guage Processing for Medical Conversations , 2020, pp.7–11.[6] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, andY. Bengio, “Towards end-to-end spoken language un-derstanding,” in . 2018, pp. 5754–5758, IEEE.[7] S. Ghannay, A. Caubri`ere, Y. Est`eve, N. Camelin, E. Si-monnet, A. Laurent, and E. Morin, “End-to-end namedentity and semantic concept extraction from speech,”in . 2018, pp. 692–699, IEEE.[8] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar,and J. Vepa, “Speech emotion recognition using spec-trogram & phoneme embedding,” in
Interspeech 2018 .2018, pp. 3688–3692, ISCA.[9] A. Fang, S. Filice, N. Limsopatham, and O. Rokhlenko,“Using phoneme representations to build predictivemodels robust to ASR errors,” in
Proceedings of the43rd International ACM SIGIR conference on researchand development in Information Retrieval, SIGIR 2020,Virtual Event, China . 2020, pp. 699–708, ACM. [10] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov,“Roberta: A robustly optimized BERT pretraining ap-proach,”
CoRR , vol. abs/1907.11692, 2019.[11] A. Zeyer, W. Zhou, T. Ng, R. Schl¨uter, and H. Ney,“Investigations on phoneme-based end-to-end speechrecognition,”
CoRR , vol. abs/2005.09336, 2020.[12] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,“Librispeech: An ASR corpus based on public do-main audio books,” in
IEEE International Conferenceon Acoustics, Speech and Signal Processing, ICASSP,Queensland, Australia . 2015, pp. 5206–5210, IEEE.[13] R. He and J. J. McAuley, “Ups and downs: Model-ing the visual evolution of fashion trends with one-classcollaborative filtering,” in
Proceedings of the 25th Inter-national Conference on World Wide Web, WWW 2016,Montreal, Canada . 2016, pp. 507–517, ACM.[14] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang,“Squad: 100, 000+ questions for machine comprehen-sion of text,” in
Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Processing,EMNLP 2016, Texas, USA , 2016, pp. 2383–2392.[15] K. Irie, R. Prabhavalkar, A. Kannan, A. Bruguier, D.Rybach, and P. Nguyen, “On the choice of modelingunit for sequence-to-sequence speech recognition,” in
Interspeech 2019, 20th Annual Conference of the In-ternational Speech Communication Association, Graz,Austria, 15-19 September 2019 , G. Kubin and Z. Kacic,Eds. 2019, pp. 3800–3804, ISCA.[16] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Man-ning, A. Y. Ng, and C. Potts, “Recursive deep mod-els for semantic compositionality over a sentiment tree-bank,” in
Proceedings of the 2013 Conference on Empir-ical Methods in Natural Language Processing, EMNLP,Seattle, Washington, USA . 2013, pp. 1631–1642, ACL.[17] X. Li and D. Roth, “Learning question classifiers,” in , 2002.[18] G. T¨ur, D. Hakkani-T¨ur, and L. P. Heck, “What isleft to be understood in atis?,” in2010 IEEE SpokenLanguage Technology Workshop, SLT 2010, California,USA