Meta-Learning for improving rare word recognition in end-to-end ASR
MMETA-LEARNING FOR IMPROVING RARE WORD RECOGNITION IN END-TO-END ASR
Florian Lux and Ngoc Thang Vu
University of StuttgartInstitute for Natural Language Processing70569 Stuttgart, Germany
ABSTRACT
We propose a new method of generating meaningful embeddings forspeech, changes to four commonly used meta learning approaches toenable them to perform keyword spotting in continuous signals and anapproach of combining their outcomes into an end-to-end automaticspeech recognition system to improve rare word recognition. Weverify the functionality of each of our three contributions in twoexperiments exploring their performance for different amounts ofclasses (N-way) and examples per class (k-shot) in a few-shot setting.We find that the speech embeddings work well and the changes tothe meta learning approaches also clearly enable them to performcontinuous signal spotting. Despite the interface between keywordspotting and speech recognition being very simple, we are able toconsistently improve word error rate by up to 5%.
Index Terms — meta learning, keyword spotting, speech recog-nition, speech embedding
1. INTRODUCTION
While end-to-end (E2E) [1] deep learning (DL) models brought greatimprovements to the field of automatic speech recognition (ASR)in recent years and reduced word error rates (WER) on benchmarkdatasets significantly [2], they also come with a set of problems.Even though they are designed to work well with unknown words,since they usually concatenate subword-units and can thus essentiallyproduce any textual transcription, they tend to perform even worse onrare words than classical DL ASR systems [3]. This is due to themhaving an internal latent language model, which is biased and cannotbe tweaked. For example, an ASR trained on the LibriSpeech corpusis very likely to transcribe the name
Hilde as Hilda or the name
Josef as Joseph .Even though those words are rare, they tend to be of great impor-tance for many tasks. A good example for this is automatic meetingtranscription. People frequently address each other by name in ameeting. And meeting participants can have very unique names withunique pronunciations. So for the transcriptions, the prior knowl-edge which names to expect could be very helpful. Incorporating ithowever is a non-trivial task, because E2E approaches cannot be fine-tuned easily, since they lack a lexicon or a pronunciation dictionary.Proper nouns have been identified as a challenging problem inASR for a while now [4]. Recently some approaches have arisen totackle this challenge with E2E ASR using a specialised architectureand losses [5] or using specific data and training procedures to betterrepresent contextual information [6]. Our approach is meant for rarewords in general, however in this work we choose rare proper nounsas exemplary data and use few-shot learning to improve performanceon them. In meta learning [7] using the metric space approach (MSML)[8], an embedding function is trained to transform datapoints into ametric space, where comparisons purely based on positions are possi-ble. Since structural knowledge about a task is within the embeddingfunction, new information can be considered by making comparisonsto reference samples of data we want to adapt to. Essentially cluster-ing is performed in a latent space. MSML could provide solutionsfor our goal of improving recognition of rare words in highly specificcontexts, since it can operate with as little as one reference datapointof the data we want to adapt to, it doesn’t need any retraining or timeconsuming tweaking mechanisms for new reference samples, and thedecisions made on the basis of the comparisons in the metric spaceare simple and interpretable and can thus be integrated into an E2Eworkflow with few complications.We propose a two part ASR system that integrates metric spacerepresentations of expected difficult keywords into the E2E pipeline.An ASR system is built using the Transformer architecture [9] whichhas been shown to outperform any prior architecture for most speechrelated tasks[10, 11, 12, 13] by heavily relying on attention mech-anisms [14]. The intermediate representation that the Transformer-encoder produces is used as prior embedding and then embeddedfurther into a metric space on a frame-shifted window basis inspiredby previous work on keyword spotting [15]. Prior approaches of em-bedding audio include using a skip-gram model for speech [16] andusing convolutional neural nets (CNNs) for general purpose audiodetection [17]. By using the Transformer as embedding function, wehope to get more phonetically rich embeddings. The metric spaceembeddings are then compared to reference samples using variousrenown MSML approaches such as Siamese Networks [18], RelationNetworks [19], Prototypical Networks [20] and Matching Networks[21]. The results of these comparisons are taken into account whendecoding the orthographic representation of the utterance.Our main contributions are as follows: we propose a new methodof generating meaningful embeddings of spoken language, we re-design MSML approaches to spot keywords in a continuous signaland we showcase an approach to integrate the results into an E2EASR system. We verify our proposed methods on two severely differ-ent datasets and observe the embeddings working well, the keywordspotting achieving high F scores and the keyword spotting ASRinterface improving rare-word WER by 1.6% on average.
2. PROPOSED APPROACH
An overview of the entire system we built can be seen in figure 1.The ASR encoder feeds into the ASR decoder, and also into each ofthe four MSML agents in a windowed fashion. The ASR decoderproduces hypotheses and the MSML agents (usually only the one youfavor) produce a list of recognized keywords. The selection of theTo appear in
Proc. ICASSP2021, 6-11 June 2021, Toronto, Ontario, Canada © IEEE 2021 a r X i v : . [ ee ss . A S ] F e b est hypotheses that the decoder’s beamsearch produces takes the listof recognized keywords into consideration. ASR
Decoder
Meta Agents
SiamesePrototypicalRelationMatching
Recognized Keywords
EncoderEncoderFeature Extractor TextSpeech Signal
Shared Weights
Windowing
Meta ASR
Fig. 1 . Overview of how the components are combined
We need to abstract away from speaker, microphone, recording envi-ronment and so on, but retain the phonetic information. To bypassthis challenge, we use the output of the encoder of a TransformerASR as the input to the metric space embedding function. The Trans-former encoder has been trained on thousands of hours of speechbeforehand and thus abstracts away from surface level features and isrobust against data variations already.To use the ASR system as embedding function, a feature matrixderived from an arbitrary audio signal is fed to the encoder of theTransformer, yielding an embedding matrix with 512 dimensionsper timestep in our case. The Transformer applies attention withrespect to the entire input sequence itself, so in order to get stableembeddings we have to limit the attention span to the size of ourwindow. This is done by windowing over the signal first and encodingone appropriately sized chunk at a time using the ASR.
MSML approaches are usually used for classification on clean data,mostly images. We however want a network that can be appliedefficiently to a sequence of windows of a signal and rather thanclassifying the correct class, spot whether something is present ornot. This requires some major changes to the architectures of the fourmeta learning approaches used.All of the approaches are redesigned to look at one class at atime. Each network assigns a similarity score to a pair of query anda member of the support set of a class. The query is a window ofASR-encoded signal. For each class, the highest similarity score thatany of its supports achieves with any of the windows is saved. Inthe end, the score of each class is compared against a manually setthreshold and if it peaks above, it is added to the list of recognizedkeywords that is passed on to the ASR decoder.For the Siamese Network, the support and the query are em-bedded into the metric space, and the cosine similarity (CosSim,normalized dot product, measure how well vectors align) of the re-sulting vectors is computed. The scoring mechanism is displayed in equation 1. An encoded matrix M is denoted as enc(M), the query as q , the support as s score q class = max s ∈ class enc ( q ) · enc ( s ) || enc ( q ) || · || enc ( s ) || (1)In our Relation Network implementation, the query and a supportare again encoded using the metric space encoder. Here howeverthe similarity function is learnt, not static. The two vectors areconcatenated and passed though a small multi layer perceptron toget the similarity score. The relation scoring mechanism can beseen in 2, relat denotes the relation function, concat denotes vector-concatenation.score q class = max s ∈ class relat ( concat ( enc ( q ) , enc ( s ))) (2)Our Prototypical implementation takes in multiple supports at atime, which are then averaged after they have been encoded to builda prototype of the corresponding class, which is then compared tothe query using CosSim. For the case of k=1, it is equivalent to theSiamese Network. The scoring mechanism is displayed in equation 3,the calculation of the prototype is displayed in equation 4.score q class = enc ( q ) · enc ( p class ) || enc ( q ) || · || enc ( p class ) || (3) p class = (cid:88) s ∈ class s | class | (4)The Matching Network is usually designed to benefit from fullyconditional encoding functions into a metric space, which rely onattention. For this task however, this setup did not converge at all.Instead we use again a simple encoder like the other networks, how-ever with the output being sequences. We apply attention from thequery to the support sequence. The conceptual matching lies withinthe properties of the resulting attention matrix, namely how well thetwo sequences align. And to get a similarity score from the attentionmatrix, we pass it through a small long short-term memory layer [22].The matching scoring mechanism is displayed in equation 5, att isan attention mechanism and reduce is any method to reduce over asequence axis.score q class = max s ∈ class reduce ( att ( enc ( q ) , enc ( s ))) (5) The decoder of the ASR produces a set of hypotheses using a beam-search. One of them is chosen as the final ASR output. This choiceusually just depends on the ranking that the beamsearch itself assignsto the hypotheses. In our system, this choice is influenced by a list ofrecognized keywords that the MSML agents produce. If a keyword isspotted, hypotheses that contain the word are moved to the front ofthe ranking. The spotting at this level is binary.To illustrate how this works, assume we have a sentence thatcontains the name Noirtier. Shown below are the beginnings of thehypotheses that our ASR yields for the sentence.• < eos > nautier was near the bed ...• < eos > natier was near the bed ...• < eos > nartier was near the bed ...• < eos > noirtier was near the bed ...2he correct name appears in the hypotheses, however only as thefourth option. Now if our keyword spotting system detects that thesentence contains the name, we pick the most probable hypothesisthat contains the desired transcription, which is the fourth here.
3. DATASETS3.1. Speech Commands Dataset
We train the meta learning agents on the Speech Commands dataset[23]. It contains a total of 35 different words with 1000 to 3000samples each. To further enrich the variety of the dataset, we add 160more names as keywords with 4 to 6 samples each. Those samplescome from three different speech synthesis systems and random dataaugmentations are applied to them. The additional keywords arecarefully selected in order to avoid having names in there that areorthographically different, yet pronounced the same.
The LibriSpeech corpus [24] is the training set for the TransformerASR and the development set of the MSML agents. It contains over1000 hours of read speech. For tuning the meta learning agents, asubset is created from the dev-clean portion. A total of 40 propernouns are selected, which occur between 4 and 10 times in the wholecorpus. For each of them, 4 sentences containing them are selected.From each of the utterances, the name that occurs in them is manuallycut out and labelled to create a collection of high quality supports foreach name. While we choose proper nouns as exemplary data, whatwe really want to explore is the performance of rare words, thus theupper limit for their occurrence counts.
The Singapore English National Speech Corpus [25] contains over3000 hours of speech. The speakers have varying native languagesand are recorded with a variety of microphones in different recordingenvironments. We thus believe that the small sub-corpus of sentencesthat contain proper nouns which we derived from it is well suitedto reflect the performance of each of the components in a challeng-ing close-to-real-world scenario. Our derived corpus contains 120utterances which contain 30 proper nouns and 4 utterances per name.Again, the samples we use as supports are selected and cut manually.
4. EXPERIMENTS4.1. Setup
For the audio embeddings, we train a Transformer ASR on the Lib-riSpeech corpus using the ESPNet toolkit [26, 27]. The recipe uses80 Mel-frequency buckets as the input and the training is capped to40 epochs, similar to the recipe of the IMS-Speech system [28]. Thetraining took 96 hours on 4 Titan X GPUs.The two experiments use the exact same setup, but look at dif-ferent aspects. The metric space encoder that is used for all of theMSML agents is built as follows: Two 1D convolutional layers witha kernel size of 3 and 20 filters using hyperbolic tangent as activationfunction are stacked on top of each other, separated by a maximumpooling layer with a window size of 2. A global max pooling reducesover the sequence axis. The contrastive loss function [29, 30] is foundto work best. RMSProp is used as the optimizer with a learning rateof 0.001 and a discounting factor of 0.9. To train the networks, we sample pairs of audio and assign thelabel 1 if they are instances of the same word or 0 otherwise. Wealways alternate between positive and negative examples, even thoughwe expect mostly negative labels during inference. We do so for800,000 steps, even though convergence shows much earlier, sincethe MSML networks contain only around 40,000 parameters each.For the actual experiment we perform a random sampling, sincesampling every possible combination of queries and supports is notfeasible. This means for any N and k, a random query and k randomsupports for N classes are selected. Then all four MSML agents areapplied to the same randomized setup, as well as the unmodifiedsystem as baseline in the second experiment. We get one result forevery combination of N, k and agent per run. The random samplingis performed a total of 10 times and the results are averaged.
The first experiment aims to explore the impact that k and N have onthe performance of the various architectures for spotting keywords infull utterances. The performance is measured in the F score, whichis the harmonic mean of precision and recall. N-way F S c o r e SiameseRelationPrototypicalMatching
Fig. 2 . Spotting F scores across different N and agents for k=1 N-way F S c o r e SiameseRelationPrototypicalMatching
Fig. 3 . Spotting F scores across different N and agents for k=4The results on the test set (as seen in figures 2 and 3) show Match-ing and Relation Networks achieving the highest scores. SiameseNetworks only take the top spot when given enough supports foreach task. The drop in precision and thus in F score is severe foran increase in N, however considering that the random baseline de-creases from 0.5 down to 0.032, the results are still well above chance.Prototypical Networks do not seem to perform too well overall forthis kind of task on the test set, which we suspect is due to somesupports disturbing the integrity of the prototype. The second experiment focuses on the ASR performance and theimpact that the meta learning agents’ results have on it. This is3 = 1 5 10 15 20 25 30 k = 1
Siamese
Relation
Proto
Matching k = 4
Siamese
Relation
Proto
Matching . Word Error Rates in % across N, k and approachesmeasured using the WER of the ASR with and without the metalearning agents as baseline (called Vanilla here). Inconsistencies inthe Vanilla WERs originate from the randomized setup to ensurecomparability of the approaches within each N and k. The WERin this evaluation is only counting mistakes that originate from thekeywords. All other mistakes are ignored, as the meta learning agentshave no effect on them. The reranking in this setup considers all ofthe hypotheses that the beam search of the decoder (beamsize = 4)producesAs can be seen in table 1, the meta learning agents help theASR slightly in most cases for any N and k. Interestingly there isvery little difference between the meta learning agents. One moreremarkable thing is that the Matching Networks seem to be the onlycases where the WER actually increases over the vanilla system withno augmentations. This is probably due to them having low precision,but very high recall in the first experiment, so a lot of false positives.
5. DISCUSSIONImpact of Transformer based Embeddings
To begin we want toaddress insights into the usefulness of the Transformer encoder asembedding function. The results on the development set and theresults on the test set for both experiments are remarkably close,even though the data is severely different. This good performanceacross corpora hints at the Transformer ASR embeddings workingas intended. In comparison, when the same MSML nets are trainedon Mel frequency buckets directly, the performance across corporashows a massive drop from development to test set.
The Spotting Process
To provide some insight into the spotting pro-cess over a whole signal, figure 4 shows the Siamese similarity scoregiven a support of the word Prometheus over time. At the point in thesentence where the keyword actually appears, the score peaks over0.95. In most longer signals however, there are random peaks asidefrom this main peak. Looking at the transcriptions, unfortunatelyno specific reason becomes apparent. The margin between an actualpeak and a random peak across signals is however big enough for aconsistent threshold to be set (we choose 0.8).
Impact of N, k and Support Quality on MSML Agents
Next wewant to give some insights into cases of MSML agent performancewhich we find interesting. The MSML agents outperform the randombaseline by a huge margin for any N and k, with the performance de-creasing for large N. The Relation Network manages to perform welleven with few supports and surprisingly even in cases where the only S i m il a r i t y S c o r e f o r P r o m e t h e u s Jupiter had bound Prometheus on Mount Caucasus
Fig. 4 . Siamese similarity score over frames of an utterance givenone support for the keyword Prometheussupport it is given is pronounced slightly wrong. The Prototypical andSiamese Networks greatly benefit from more supports, however theprototype is sometimes disrupted by heavily mispronounced namesin the supports. This is not the case in the development set, where theSiamese and the Prototypical approach actually yield the best results,together with the Relation approach. So robustness against imperfectsupports appears to be important for close-to-real-world scenarios.
Impact of imperfect Speech on ASR
One interesting insight aboutthe keyword spotting ASR interface are the cases where the regularASR tends to make mistakes. Looking into the data and the hy-potheses of the ASR, we find that usually the most probable ASRhypothesis matches exactly what the speaker of the utterance artic-ulated. This is however not always the desired transcription, sincespeakers tend to be sloppy, especially with names, or mispronouncethem entirely.
Impact of the Keyword Spotting Integration
Another interestingobservation can be made with the cases where the hypothesis wasnot changed, despite the MSML agents correctly spotting a keyword.This happens quite frequently for names that follow phonotacticsthat are very different from English, since they do not appear in thehypotheses at all and could thus also not be reranked. At the sametime this is also a good thing, since especially the Matching Networkoften spotted keywords which were not present. The hypothesis doesnot change then, because the word in question simply does not appearin the selection. So the ASR hypotheses act as a filter to the MSMLagents and the MSML agents act as a filter to the ASR hypotheses.
6. CONCLUSIONS
In this paper we introduce the idea of using a Transformer ASRencoder as an embedding function, which enables stable good per-formance across severely different corpora. We also discuss a setupthat enables four commonly used MSML approaches to perform key-word recognition on arbitrary reference samples in continuous signals.Finally we introduce a system that combines ASR with keyword spot-ting to enable modular and simple fine-tuning of neural E2E ASRto expected vocabulary. Experiments show a keyword recognitionperformance much higher than chance across different setups and animprovement in WER, indicating that the fine-tuning works despitethe mechanism being a rather simple proof of concept.
7. REFERENCES [1] Alex Graves and Navdeep Jaitly, “Towards end-to-end speechrecognition with recurrent neural networks,” in
International onference on machine learning , 2014, pp. 1764–1772.[2] Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, andShujie Liu, “On the comparison of popular end-to-end modelsfor large scale speech recognition,” 2020.[3] Jinxi Guo, Tara Sainath, and Ron Weiss, “A spelling correctionmodel for end-to-end speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2019.[4] Antoine Laurent, Sylvain Meignier, and Paul Del´eglise, “Im-proving recognition of proper nouns in ASR through generatingand filtering phonetic transcriptions,”
Computer Speech & Lan-guage , vol. 28, 07 2014.[5] Cal Peyser, Tara N Sainath, and Golan Pundak, “Improvingproper noun recognition in end-to-end asr by customization ofthe mwer loss criterion,” in
ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 7789–7793.[6] Uri Alon, Golan Pundak, and Tara N Sainath, “Contextualspeech recognition with difficult negative training examples,” in
ICASSP 2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp.6440–6444.[7] J¨urgen Schmidhuber,
Evolutionary principles in self-referentiallearning, or on learning how to learn , Ph.D. thesis, TechnischeUniversit¨at M¨unchen, 1987.[8] J. Goldberger, S. Roweis, Geoffrey E. Hinton, and R. Salakhut-dinov, “Neighbourhood Components Analysis,” in
NIPS , 2004.[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polo-sukhin, “Attention is all you need,” in
Advances in neuralinformation processing systems , 2017, pp. 5998–6008.[10] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori,Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson En-rique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al.,“A comparative study on transformer vs rnn in speech appli-cations,” in . IEEE, 2019, pp. 449–456.[11] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and MingLiu, “Neural speech synthesis with Transformer network,” in
Proceedings of the AAAI Conference on Artificial Intelligence ,2019, vol. 33, pp. 6706–6713.[12] Laura Cross Vila, Carlos Escolano, Jos´e AR Fonollosa, andMarta R Costa-juss`a, “End-to-End Speech Translation with theTransformer,” in
IberSPEECH , 2018, pp. 60–63.[13] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-Transformer:a no-recurrence sequence-to-sequence model for speech recog-nition,” in . IEEE, 2018, pp. 5884–5888.[14] Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio,“Neural machine translation by jointly learning to align andtranslate,” in , 2015.[15] Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-footprint keyword spotting using deep neural networks,” in . IEEE, 2014, pp. 4087–4091. [16] Yu-An Chung and James Glass, “Speech2vec: A sequence-to-sequence framework for learning word embeddings fromspeech,”
Proc. Interspeech 2018 , pp. 811–815, 2018.[17] IBM, “MAX-Audio-Embedding-Generator,” Sept. 2019.[18] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S¨ackinger,and Roopak Shah, “Signature verification using a ”siamese”time delay neural network,” in
Advances in neural informationprocessing systems , 1994, pp. 737–744.[19] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HSTorr, and Timothy M Hospedales, “Learning to compare: Rela-tion network for few-shot learning,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2018,pp. 1199–1208.[20] Jake Snell, Kevin Swersky, and Richard Zemel, “Prototypi-cal networks for few-shot learning,” in
Advances in neuralinformation processing systems , 2017, pp. 4077–4087.[21] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wier-stra, et al., “Matching networks for one shot learning,” in
Advances in neural information processing systems , 2016, pp.3630–3638.[22] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-termmemory,”
Neural computation , vol. 9, no. 8, pp. 1735–1780,1997.[23] Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv e-prints , pp. arXiv–1804,2018.[24] Vassil Panayotov, Guoguo Chen, Daniel Povey, and SanjeevKhudanpur, “Librispeech: An ASR corpus based on publicdomain audio books,” in . IEEE,2015, pp. 5206–5210.[25] Jia Xin Koh, Aqilah Mislan, Kevin Khoo, Brian Ang, WilsonAng, Charmaine Ng, and YY Tan, “Building the singaporeenglish national speech corpus,”
Malay , vol. 20, no. 25.0, pp.19–3, 2019.[26] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi,Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, JahnHeymann, Matthew Wiesner, Nanxin Chen, Adithya Renduch-intala, and Tsubasa Ochiai, “ESPnet: End-to-End Speech Pro-cessing Toolkit,” in
Interspeech , 2018, pp. 2207–2211.[27] Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, TakenoriYoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda,Yu Zhang, and Xu Tan, “Espnet-tts: Unified, reproducible, andintegratable open source end-to-end text-to-speech toolkit,” in
ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.7654–7658.[28] Pavel Denisov and Ngoc Thang Vu, “IMS-speech: A speech totext tool,”
Studientexte zur Sprachkommunikation: Elektronis-che Sprachsignalverarbeitung 2019 , pp. 170–177, 2019.[29] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, andF Huang, “A tutorial on energy-based learning,”
Predictingstructured data , vol. 1, no. 0, 2006.[30] Raia Hadsell, Sumit Chopra, and Yann LeCun, “Dimensionalityreduction by learning an invariant mapping,” in2006 IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR’06)