Generating Human Readable Transcript for Automatic Speech Recognition with Pre-trained Language Model
Junwei Liao, Yu Shi, Ming Gong, Linjun Shou, Sefik Eskimez, Liyang Lu, Hong Qu, Michael Zeng
GGENERATING HUMAN READABLE TRANSCRIPTFOR AUTOMATIC SPEECH RECOGNITION WITH PRE-TRAINED LANGUAGE MODEL
Junwei Liao ∗ , Yu Shi , Ming Gong , Linjun Shou , Sefik Eskimez , Liyang Lu , Hong Qu , Michael Zeng University of Electronic Science and Technology of China Microsoft Cognitive Services Research Group, USA Microsoft STCA NLP Group [email protected], [email protected], { yushi, migon, lisho, seeskime, liyang.lu, nzeng } @microsoft.com ABSTRACT
Modern Automatic Speech Recognition (ASR) systems canachieve high performance in terms of recognition accuracy.However, a perfectly accurate transcript still can be challeng-ing to read due to disfluency, filter words, and other erratacommon in spoken communication. Many downstream tasksand human readers rely on the output of the ASR system;therefore, errors introduced by the speaker and ASR systemalike will be propagated to the next task in the pipeline. Inthis work, we propose an ASR post-processing model thataims to transform the incorrect and noisy ASR output into areadable text for humans and downstream tasks. We leveragethe Metadata Extraction (MDE) corpus to construct a task-specific dataset for our study. Since the dataset is small, wepropose a novel data augmentation method and use a two-stage training strategy to fine-tune the RoBERTa pre-trainedmodel. On the constructed test set, our model outperforms aproduction two-step pipeline-based post-processing methodby a large margin of 13.26 on readability-aware WER (RA-WER) and 17.53 on BLEU metrics. Human evaluation alsodemonstrates that our method can generate more human-readable transcripts than the baseline method.
Index Terms — speech recognition, ASR post-processingfor readability, pre-trained language model, data augmenta-tion
1. INTRODUCTION
With the rapid development of speech-to-text technologies,ASR systems have achieved high recognition accuracy, evenbeating the performance of professional human transcriberson conversational telephone speech in terms of Word ErrorRate (WER) [1]. ASR systems bring convenience to users inmany scenarios. However, colloquial speech is fraught withdisfluency, informal words, and other noises that make it dif-ficult to understand. While ASR systems do a great job in rec-
Work done as an intern at Microsoft ognizing which words are said, its verbatim transcription cre-ates many problems for modern applications that must com-prehend the meaning and intent of what is said. The existenceof the defects in speech transcription will significantly harmthe experience of the application users if the system cannothandle them well.There is a long line of previous works focusing on makingASR transcripts more human-readable, which is referred to asmetadata extraction (MDE) [2]. MDE breaks down the goalinto several classification tasks on top of verbatim transcrip-tion. Most of the MDE systems use both textual and prosodicinformation combined by Hidden Markov Model [3, 4], Max-imum Entropy or Conditional Random Fields methods [3, 5].While MDE improves the readability of ASR transcripts to acertain extent, it ignores the recognition errors introduced byASR system. Thus MDE needs to work with other ASR post-processing components such as language model rescoring toprovide the final human-readable transcript.In this paper, We propose an end-to-end ASR post-processing model for readability (APR). Readability in thiscontext refers to having proper segmentation, capitalization,fluency, and without any error, so our model aims to trans-form the ASR output into an error-free and readable text inone shot. Our model is based on RoBERTa [6], which is apre-trained language model used for NLU tasks. Inspired byUniLM [7], which applies self-attention masks on BERT [8]to convert it into a sequence-to-sequence model, we adaptRoBERTa towards generative for NLG task.Since there is no off-the-shelf dataset for the proposedtask, we construct the desired dataset from the RT-03 MDETraining Data [9], which includes speech audio, human tran-script, and annotation. We evaluate the proposed approacheson a test split of the constructed dataset and compare it with aproduction pipeline-based baseline. Our model significantlyoutperforms baseline method 13.26 on RA-WER and 17.53on BLEU metrics. Human evaluation also shows that the pro-posed model generates more human-readable transcripts forASR output. a r X i v : . [ c s . C L ] F e b SR system(cltLSTM)ASR outputTTS system(Tacotron 2)TTS outputUngrammatical sentence
Transformer
SOS S EOSEOS
Transformer S source target Fine-tuning on gold dataPre-training on augmented data
Readable sentence
Fig. 1 . ASR post-processing model for readability based onmodified ReBERTa architecture and data augmentation.To solve the training data scarce problem, in addition todirectly fine-tune the pre-trained language model, we proposea novel data augmentation method that synthesizes large-scaletraining data using Grammatical Error Correction (GEC) cor-pora followed by text-to-speech (TTS) and ASR. We adopta two-stage training strategy, namely pre-training and fine-tuning, to better exploit the augmented data. This two-phasetraining not only learns useful information from the aug-mented data, but also avoids the risk of being overwhelmedand adversely affected by it.
2. MODEL
Figure 1 shows the whole picture of the ASR post-processingfor readability.
The middle part in Figure 1 is the ASR model. We use a con-textual layer trajectory LSTM (cltLSTM) [10] as our ASRmodel. The cltLSTM decouples the temporal modeling andtarget classification tasks with time and depth LSTMs respec-tively and incorporates future context frames to get more in-formation for accurate acoustic modeling. The input feature isan 80-dimension log Mel filter bank for every 20 milliseconds(ms) of speech by using frame skipping [11]. The softmaxlayer has 9404 nodes to model the senone labels. Runtimedecoding is performed using a 5-gram LM with a decodinggraph of around 5 gigabytes (Gbs). The cltLSTM has 24-frame lookahead, which corresponds to a 480ms duration.
The proposed APR model (the rightmost part in Figure 1) isbased on RoBERTa [6], which is a robustly optimized BERT[8] pre-training approach. Both BERT and RoBERTa have asingle Transformer stack and are pre-trained only using bidi-rectional prediction, which makes them more discriminative than generative. However, [12] demonstrated the effective-ness of transfer learning from BERT to sequence-to-sequencetask by initializing both encoder and decoder with pre-trainedBERT in their speech recognition correction work. Inspiredby this work and UniLM [7], we apply self-attention maskson the RoBERTa model to convert it into a sequence-to-sequence generation model. To achieve whole-sentence pre-diction rather than only masked-position prediction, we usean autoregressive approach during the fine-tuning.
3. DATA AUGMENTATION
Transformer-based models are usually trained on millionsof parallel sentences and tend to easily overfit if the data isscarce. [12, 13] proposed two self-complementary regular-ization techniques to solve this problem. Besides initializingthe model weights using the pre-trained language modelmentioned in the previous section, the other solution is dataaugmentation. Following their cue, we propose a novel dataaugmentation method for APR.We synthesize large-scale training data using a grammat-ical error correction (GEC) dataset as the seed data. GECaims to correct different kinds of errors such as spelling,punctuation, grammatical, and word choice errors . The pur-pose of using it is to introduce more types of errors that arenot specific to our ASR system and therefore make the APRmodel more general. The GEC dataset comes from restrictedtracks of BEA 2019 shared task [14] and is composed ofpairs of grammatically incorrect sentences and correspondingsentences corrected by a human annotator.First, we use a text-to-speech (TTS) system (the leftmostpart in Figure 1) to convert the ungrammatical sentences tospeech. Specifically, we use a Tacotron2 model [15], whichis composed of a recurrent sequence-to-sequence feature pre-diction network that maps character embeddings to mel-scalespectrograms, followed by a modified WaveNet model act-ing as a vocoder to synthesize time-domain waveforms fromthose spectrograms. We fed the grammatically incorrect sen-tences from the seed corpus into Tacotron2 to produce theaudio files simulating human speakers.Then, these audio files are fed into our ASR system thatoutputs the corresponding transcripts. The resulting text con-tains both the grammatical errors found initially in the GECdataset and the TTS+ASR pipeline errors. Finally, we pairthe outputs of the ASR system and the original grammaticalsentences as the training samples. After the above process,we obtain about 1.1M sentence pairs.
4. EXPERIMENTS4.1. Gold dataset construction
To the best of our knowledge, there is no off-the-shelf datasetthat can be directly used in our APR task. Therefore we con-truct the desired dataset from MDE corpus.We use the English Conversational Telephone Speech(CTS) in the released data, of which the transcripts and an-notations cover approximately over 40 hours of CTS audiosof casual and conversational speech. By parsing the anno-tation files, we get the transcript with metadata annotation,which, for example, uses ‘ /. ’ for statement boundaries (SU),‘ <> ’ for fillers, ‘ [] ’ for disfluency edit words, and ‘ * ’ forinterruption points inside edit disfluencies. The followingexample shows an ASR transcript with metadata annotation: and < uh > < you know > wash your clotheswherever you are /. and [ you ] * youreally get used to the outdoors ./ We generate a readable target transcript by cleaning up themetadata annotations, i.e. the deletable portion of edit disflu-encies and fillers are removed, and each SU is presented as aseparate line within the transcript. We also capitalize the firstword of sentences to further improve the transcript’s readabil-ity. Through the above steps, the transcript with metadataannotation becomes a readable text:
And wash your clotheswherever you are. And you really get used to the outdoors.
After the above processing, We obtain 27,355 readabletranscripts in total. About 1K samples are extracted for val-idation and testing, respectively. We ensure that samples oftraining, validation, and testing come from different conversa-tions. Audios are transcribed using the aforementioned ASRmodel, and the outputs are paired with readable targets.
We use the production two-step post-processing pipeline ofour ASR system as the baseline, namely n-best LM rescoringfollowed by inverse text normalization (ITN). This pipelineworks well for sequentially improving speech recognitionaccuracy and display format for readability. The languagemodel is a stacked RNN with two unidirectional LSTM layers[16]. ITN is configured to turn on capitalization, punctuation,and correcting simple grammatical errors for MDE test data.Since this pipeline lacks handling disfluencies in spoken lan-guage, for a fair comparison, we add a simple step to removesome disfluencies. Specifically, we remove the commonlyused filled pauses (e.g., uh, um) and discourse markers (e.g.,you know, I mean). We filter the repeated words and keeponly one of them (e.g., I’m I’m → I’m, it’s it’s → it’s).We take advantage of the BLEU [17] score to measure theperformance of the APR model. We also extend the conven-tional WER in speech recognition to readability-aware WER(RA-WER) by removing the text normalization before calcu-lating Levenshtein distance. https://catalog.ldc.upenn.edu/LDC2004T12 Model RA-WER BLEU
ASR transcript 45.13 45.10+ LM rescoring 44.71 45.68+ ITN 38.77 53.39+ RM disfluencies
APR (FT) 21.33 72.77APR (PT + FT)
Table 1 . Performance of APR models and baseline methodson the test set of gold data.
As aforementioned, the augmented GEC data is valuable ingeneralizing the APR model. However, the source transcriptdoes not have many disfluencies and other errata that oftenhappen in spoken communication. As a result, if we mix theaugmented data with the gold data during model training, themassive augmented data tends to overwhelm the gold data andintroduce unnecessary and even erroneous editing knowledge,which is undesirable for readability. To solve this problem,we follow [18]’s strategy. Specifically, we train the model us-ing augmented data and gold data in two phases: pre-trainingand fine-tuning, respectively.We explore the effect of data augmentation with two con-figurations in our experiment. In the first configuration de-noted as APR (FT), we directly fine-tune the proposed modelinitialized using the weight of pre-trained RoBERTa on con-structed data. In the second configuration denoted as APR(PT + FT), we first pre-train model initialized with the weightof pre-trained RoBERTa on augmented data, then fine-tunemodel on constructed data.Our model is implemented using PyTorch on top of theHuggingface Transformers library and based on RoBERTa-large architecture. We train our model with Adam opti-mizer (lr=2.5e-6, β =0.9, β =0.999) for 3 epochs with linearwarmup over the one-tenth of total steps and linear decayon a batch size of 8 for each GPU. We adopt the 50K byte-level BPE vocabulary used in RoBERTa so we can straighttransfer its pre-trained weights. Each model is trained using4 NVIDIA V100 GPUs with 32GB of memory and mixedprecision. We use the same training setup for all trainingstages. In the fine-tuning stage, checkpoints are selected onthe validation set, and the beam size for beam search is 5.
5. RESULTS5.1. Results analysis
In Table 1, we compare model performance on constructedtest data. From the table, we can see that APR (PT + FT)model outperforms the baseline (ASR + LM rescoring + ITN+ RM disfluencies) by a large margin of 13.26 RA-WER and17.53 BLEU (absolute value).To better understand the readability contribution of eachstage in the baseline method, we compute the metric scores https://github.com/huggingface/Transformers SR transcript Ground truth ASR + LM Rescoring + ITN APR (PT + FT) yeah i don’t believe they have to pay any uhlike federal tax I don’t believe , they have to pay any federaltax . Yeah, I don’t believe they have to pay any. Uh,like federal tax, uh? I don’t believe, they have to pay any federaltax.yeah i i buy him every once in awhile and an ibought one and it was you know blah I buy them every once in a while . And Ibought one . And it was blah . Yeah I I buy ’em every once in awhile and anI bought one and it was, you know blah. I buy them every once in a while . And Ibought one. And it was blah.they have as far as i’m concerned because i’mi’m not a big vegetable eater they have toomany a yellow vegetables on the same day They have too many yellow vegetables on thesame day . They have, as far as I’m concerned, becauseI’m I’m not a big vegetable eater. They havetoo many a yellow vegetables on the same day. They have too many yellow vegetables on thesame day.i just see a lot of a social and cultural differ-ences that could a post problems with a puertorico becoming a state I just see a lot of social and cultural differencesthat could pose problems with Puerto Rico be-coming a state . I just see a lot of, uh, social and cultural differ-ences that could a post problems with PuertoRico becoming a state. I just see a lot of social and cultural differencesthat could cause problems with Puerto Ricobecoming a state.
Table 2 . Comparison of generated readable transcript using baseline method and proposed model. The bold parts of sentencesare the correction of recognition errors.between each stage’s output and the reference sentence andput the results in the first group of Table 1. LM rescoringonly brings 0.42 RA-WER reduction and 0.58 BLEU promo-tion. One explanation is that for highly casual and conver-sational speech, recognition correctness especially verbatimrecognition only contributes a small part of the readability.ITN further improves the scores by 5.94 ↓ RA-WER and 7.71 ↑ BLEU, but not that far. It’s obvious that having proper cap-italization and punctuation is helpful for readability but notenough. With a simple step of removing some disfluencies,the performance gain is significant (RA-WER 5.25, BLEU3.37), which implies that disfluencies are the major factor forthe gap between baseline and our proposed method.The second group in Table 1 shows the performance of theproposed model using different training strategies. The two-stage training (PT + FT) further reduces RA-WER by 1.07point and increases BLEU by 1.52 point. The result provesthe effectiveness of the proposed data augmentation methodand the two-stage training strategy for the APR task.
To conduct a qualitative analysis of readable transcript pro-duced by the proposed model, we compare the output exam-ples of our APR (PT + FT) model with the baseline method(ASR + LM rescoring + ITN). In Table 2, we can see that boththe baseline method and our model improve the readability ofthe ASR transcript by adding punctuation and capitalizing thenames and the first word of sentences. But the baseline is ver-batim by keeping all the words in the ASR transcript, whichmakes the sentence disfluent and leads to incorrect segmen-tation and punctuation. For instance, in the first example, thebaseline wrongly segments the sentence and add a questionmark instead of period due to the influence of filler words(“uh”, “like”). In contrast to the baseline, our model removesall words that cause the disfluency and adds correct punctua-tion in the appropriate place in the transcript to make it morereadable. The third example better illustrates the capabilityof the proposed model on removing disfluencies. Our modelremoves all the repetitions, asides, and parentheticals to getthe clear sentence same with the ground truth.Besides the punctuation, capitalization, and removing dis-fluencies, our model also corrects some recognition errors while the baseline method fails to do. We use bold type fontto highlight the corrected errors in Table 2. An interestingfinding is that the last example gets “a post problems” fixedto “cause problems” which is different than the ground truth“pose problems” because the latter is not often used. It’sarguable that although the original user input is “pose”, ourmodel’s result is more readable for most human readers andmachine applications if we don’t consider personalization.
Readability is subjective and the BLEU score and RA-WERmay not be consistent with human perception. Thus, we alsoconduct a human evaluation on the Switchboard corpus [19].Specifically, we conduct an A/B test to compare our modelwith the baseline method. To build a test set for human eval-uation, we randomly chose 100 audio samples with sourcesentences length between 20 and 60 words. These audio sam-ples are passed through the ASR system to get transcripts. Wethen use the baseline method and our model to generate theoutput text, respectively. Three annotators are shown the gen-erated texts in random order and are asked to choose the morereadable one. Each case receives three labels from three an-notators. The final decision is made by a majority vote. Basedon the above experiment design, the human evaluation resultshows that annotators vote for the outputs of our model 70times out of 100 cases (win rate 70%), which means that ourmodel was rated more readable than the baseline method. Thetwo-sided binomial test on the result confirms that our modelis statistically significantly more readable than the baselinemethod, with a p-value of less than 0.01.
6. CONCLUSION
In this work, we propose an ASR post-processing modelfor readability which is based on a modified RoBERTa pre-trained language model. Fine-tuned on our constructed data,the proposed model is capable of “translating” the ASRoutput to an error-free and readable transcript for human un-derstanding and downstream tasks. Case analysis and humanevaluation demonstrate that our model outperforms the tradi-tional pipeline-based baseline method and generates a morereadable transcript. . REFERENCES [1] Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo,Xuedong Huang, and Andreas Stolcke, “The mi-crosoft 2017 conversational speech recognition system,”in . IEEE, 2018,pp. 5934–5938.[2] Yang Liu, Elizabeth Shriberg, Andreas Stolcke, BarbaraPeskin, Jeremy Ang, Dustin Hillard, Mari Ostendorf,Marcus Tomalin, Phil Woodland, and Mary Harper,“Structural metadata research in the ears program,” in
Proceedings.(ICASSP’05). IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, 2005.
IEEE, 2005, vol. 5, pp. v–957.[3] Yang Liu, Elizabeth Shriberg, Andreas Stolcke, BarbaraPeskin, and Mary Harper, “The icsi/sri/uw rt-04 struc-tural metadata extraction system,” in
Proc. of EARS RT-04 Workshop , 2004.[4] M Tomalin and PC Woodland, “The rt04 evaluationstructural metadata systems at cued,” in
Proc. Fall 2004Rich Transcription Workshop (RT-04) , 2004.[5] Yang Liu, Andreas Stolcke, Elizabeth Shriberg, andMary Harper, “Comparing and combining generativeand posterior probability models: Some advances insentence boundary detection in speech,” in
Proceed-ings of the 2004 Conference on Empirical Methods inNatural Language Processing , 2004, pp. 64–71.[6] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, LukeZettlemoyer, and Veselin Stoyanov, “Roberta: A ro-bustly optimized bert pretraining approach,” arXivpreprint arXiv:1907.11692 , 2019.[7] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, XiaodongLiu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon, “Unified language model pre-training fornatural language understanding and generation,” in
Ad-vances in Neural Information Processing Systems , 2019,pp. 13063–13075.[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova, “Bert: Pre-training of deep bidirec-tional transformers for language understanding,” in
Pro-ceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguis-tics: Human Language Technologies, Volume 1 (Longand Short Papers) , 2019, pp. 4171–4186.[9] S Strassel, C Walker, and H Lee, “Rt-03 mde train-ing data speech,”
Linguistic Data Consortium, Philadel-phia , 2004. [10] Jinyu Li, Liang Lu, Changliang Liu, and Yifan Gong,“Improving layer trajectory lstm with future contextframes,” in
ICASSP 2019-2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6550–6554.[11] Yajie Miao, Jinyu Li, Yongqiang Wang, Shi-XiongZhang, and Yifan Gong, “Simplifying long short-termmemory acoustic models for fast training and decoding,”in . IEEE, 2016,pp. 2284–2288.[12] Oleksii Hrinchuk, Mariya Popova, and Boris Gins-burg, “Correction of automatic speech recognition withtransformer sequence-to-sequence model,” in
ICASSP2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 7074–7078.[13] Junwei Liao, Sefik Emre Eskimez, Liyang Lu, Yu Shi,Ming Gong, Linjun Shou, Hong Qu, and Michael Zeng,“Improving readability for automatic speech recognitiontranscription,” arXiv preprint arXiv:2004.04438 , 2020.[14] Christopher Bryant, Mariano Felice, Øistein E. Ander-sen, and Ted Briscoe, “The bea-2019 shared task ongrammatical error correction,” in
BEA@ACL , 2019.[15] Jonathan Shen, Ruoming Pang, Ron J Weiss, MikeSchuster, Navdeep Jaitly, Zongheng Yang, ZhifengChen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al.,“Natural tts synthesis by conditioning wavenet on melspectrogram predictions,” in . IEEE, 2018, pp. 4779–4783.[16] Martin Sundermeyer, Ralf Schl¨uter, and Hermann Ney,“Lstm neural networks for language modeling,” in
Thir-teenth annual conference of the international speechcommunication association , 2012.[17] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “Bleu: a method for automatic evaluation ofmachine translation,” in
Proceedings of the 40th annualmeeting of the Association for Computational Linguis-tics , 2002, pp. 311–318.[18] Yi Zhang, Tao Ge, Furu Wei, Ming Zhou, andXu Sun, “Sequence-to-sequence pre-training with dataaugmentation for sentence rewriting,” arXiv preprintarXiv:1909.06002 , 2019.[19] John J Godfrey and Edward Holliman, “Switchboard-1release 2,”