Speech Enhancement for Wake-Up-Word detection in Voice Assistants
David Bonet, Guillermo Cámbara, Fernando López, Pablo Gómez, Carlos Segura, Jordi Luque
SSpeech Enhancement for Wake-Up-Word detection in Voice Assistants
David Bonet , , Guillermo C´ambara , , Fernando L´opez ,Pablo G´omez , Carlos Segura , Jordi Luque Universitat Polit`ecnica de Catalunya (UPC), Spain Universitat Pompeu Fabra (UPF), Spain Telef´onica, Spain Telef´onica Research, Spain [email protected]
Abstract
Keyword spotting and in particular Wake-Up-Word (WUW) de-tection is a very important task for voice assistants. A very com-mon issue of voice assistants is that they get easily activated bybackground noise like music, TV or background speech thataccidentally triggers the device. In this paper, we propose aSpeech Enhancement (SE) model adapted to the task of WUWdetection that aims at increasing the recognition rate and reduc-ing the false alarms in the presence of these types of noises.The SE model is a fully-convolutional denoising auto-encoderat waveform level and is trained using a log-Mel Spectrogramand waveform reconstruction losses together with the BCE lossof a simple WUW classification network. A new database hasbeen purposely prepared for the task of recognizing the WUWin challenging conditions containing negative samples that arevery phonetically similar to the keyword. The database is ex-tended with public databases and an exhaustive data augmenta-tion to simulate different noises and environments. The resultsobtained by concatenating the SE with a simple and state-of-the-art WUW detectors show that the SE does not have a nega-tive impact on the recognition rate in quiet environments whileincreasing the performance in the presence of noise, especiallywhen the SE and WUW detector are trained jointly end-to-end.
Index Terms : keyword spotting, speech enhancement, wake-up-word, deep learning, convolutional neural network
1. Introduction
Voice interaction with devices is becoming ubiquitous. Mostof them use a mechanism to avoid the excessive usage of re-sources, a trigger word detector. This ensures the efficient useof resources, using a Speech-To-Text tool only when neededand with the consequent start of a conversation. It is key to onlystart this conversation when the user is addressing the device,otherwise the user experience is notoriously degraded. Thus,the wake-up-word detection system must be robust enough toavoid wake-ups with TV, music, speech and sounds that do notcontain the key phrase.A common approach to reduce the impact of this type ofnoise in the system is the adoption of speech enhancement algo-rithms. Speech enhancement consists of the task of improvingthe perceptual intelligibility and quality of speech by remov-ing background noise [1]. Its main application is in the field ofmobile and internet communications [2] and related to hearingaids [3], but SE has also been applied successfully to automaticspeech recognition systems [4, 5, 6].Traditional SE methods involved a characterization step ofthe noise spectrum which is then used to try reduce the noise from the regenerated speech signal. Examples of these ap-proaches are spectral subtraction [3], Wiener filtering [7] andsubspace algorithms [8]. One of the main drawbacks of theclassical approaches is that they are not very robust against non-stationary noises or other type of noises that can mask speech,like background speech. In the last years, Deep Learning ap-proaches have been widely applied to SE at the waveform level[9, 10] and spectral level [6, 11]. In the first case, a commonarchitecture falls within the encoder-decoder paradigm. In [12],authors proposed a fully convolutional generative adversarialnetwork architecture structured as an auto-encoder with U-Netlike skip-connections. Other recent work [13] proposes a simi-lar architecture at the waveform level that includes a LSTM be-tween the encoder and the decoder and it is trained directly witha regression loss combined with a spectrogram domain loss.Inspired by these recent models, we propose a similar SEauto-encoder architecture in the time domain that is optimizednot only by minimizing waveform and Mel-spectrogram regres-sion losses, but also includes a task-dependent classificationloss provided by a simple WUW classifier acting as a Quality-Net [14]. This last term serves as a task-dependent objec-tive quality measure that trains the model to enhance importantspeech features that might be degraded otherwise.
2. Speech Enhancement
Speech enhancement is interesting for triggering phrase detec-tion since it tries to remove noise that could trigger the device,and at the same time improves speech quality and intelligibilityfor a better detection. In this case, we try to tackle the mostcommon noisy environments where voice assistants are used:TV, music, background conversations, office noise and livingroom noise. Some of these types of background noise, such asTV and background conversations, are the most likely to triggerthe voice assistant and are also the most challenging to remove.
Our model has a fully-convolutional denoising auto-encoder ar-chitecture with skip connections (Fig. 1), which has proven tobe very effective in SE tasks [12], working end-to-end at wave-form level. In training, we input a noisy audio x ∈ R T , com-prised of clean speech signal y ∈ R T and background noise n ∈ R T so that x = λ y + (1 − λ ) n .The encoder compresses the input signal and expands thenumber of channels. It is composed of six convolutional blocks(ConvBlock1D), each consisting of a convolutional layer, fol-lowed by an instance normalization and a rectified linear unit(ReLU). Kernel size K = 4 and stride S = 2 are used, except a r X i v : . [ ee ss . A S ] J a n igure 1: End-to-end SE model at waveform level concatenated with a classifier. Log-Mel Spectrogram and waveform reconstructionlosses of the SE model can be used together with the BCE loss of the classifier as a linear combination to train the model. in the first layer where K = 7 and S = 1 . The compressed sig-nal goes through an intermediate stage where the shape is pre-served, consisting of three residual blocks (ResBlock1D), eachformed by two ConvBlock1D with K = 3 and S = 1 wherea skip connection is added from the input of the residual blockto its output. The last stage of the SE model is the decoder,where the original shape of the raw audio is recovered at theoutput. Its architecture follows the inverse structure of the en-coder, where deconvolutional blocks (DeconvBlock1D) replacethe convolutional layers of the ConvBlock1D with transposedconvolutional layers. Skip connections from the encoder blocksto the decoder blocks are also used to ensure low-level detailwhen reconstructing the waveform.We use a regression loss function (L1 loss) at raw wave-form level together with another L1 loss over the log-Mel Spec-trogram as proposed in [15] to reconstruct a ”cleaned” signal ˆ y at the output. Finally, we include the classification loss (BCELoss) when training the SE model jointly with the classifier orconcatenating a pretrained classifier at its output. Thus, we alsotry to optimize the SE model to the specific task of WUW clas-sification. Our final loss function is defined as a linear combi-nation of the three losses: L T = αL raw ( y , ˆ y ) + βL spec ( S ( y ) , S ( ˆ y )) + γL BCE (1)where α , β and γ are hyperparamters weighting each loss termand S ( · ) denotes the log-Mel Spectrogram of the signal, whichis computed using 512 FFT bins, a window of 20 ms with 10ms of shift and 40 filters in the Mel Scale.
3. Methodology
The database used for conducting the experiments here pre-sented consists of WUW samples labeled as positive, and othernon-WUW samples labeled as negative. Since the chosen key-word is ”OK Aura”, which triggers Telef´onica’s home assis-tant, Aura, positive samples are drawn from company’s in-house databases. Some of the negative samples have been alsorecorded in such databases, but we also add speech and othertypes of acoustic events from external data sets, so the modelsgain robustness with further data. Information about all dataused is detailed in this section.
In a first round, around 4300 WUW samples from 360 speak-ers have been recorded, resulting in 2.8 hours of audio. Fur-thermore, office ambient noise has been recorded as well, with the aim of having samples for noise data augmentation. Thesecond data collection round has been done in order to studyand improve some sensitive cases where WUW modules typ-ically underperform. For instance, such dataset contains richmetadata about positive and negative utterances, like room dis-tance, speech accent, emotion, age or gender. Furthermore, thenegative utterances contain phonetically similar words to ”OKAura”, since these are the most ambiguous to recognize for aclassifier. Detailed information about data acquisition is ex-plained in the following subsection.
A web-based Jotform form has been designed for data collec-tion. Such form is open and actually still receiving input datafrom volunteers, so readers are also invited to contribute to thedataset . Until the date of this work, 1096 samples from 80speakers have been recorded, which consists of 1.2 hours ofaudio. Volunteers are asked to pronounce various scripted utter-ances at a close distance and also at two meters from the devicemic. The similarity levels are the following:1. Exact WUW, in an isolated manner: OK Aura.
2. Exact WUW, in a context:
Perfecto, voy a mirar qu´e danhoy. OK Aura.
3. Contains ”Aura”:
Hay un aura de paz y tranquilidad.
4. Contains ”OK”:
OK, a ver qu´e ponen en la tele.
5. Contains similar word units to ”Aura”:
Hola Laura.
6. Contains similar word units to ”OK”:
Prefiero el hockey albaloncesto.
7. Contains similar word units to ”OK Aura”:
Porque Laura,¿qu´e te pareci´o la pel´ıcula?3.1.3. External data
General negative examples have been randomly chosen fromthe publicly available Spanish Common Voice corpus [16] thatcurrently holds over 300 hours of validated audio. However,we keep a 10:1 ratio between negative and positive samples,since such ratio proves to yield good results in [17], thus avoid-ing bigger ratios that lead to increasing computational times.Therefore, we have used a Common Voice partition consistingof 55h for training, 7h for development and 7h for testing.Background noises were selected from various publicdatasets according to different use case scenarios. Living roombackground noise (HOME-LIVINGB) from the QUT-NOISE https://form.jotform.com/201694606537056 atabase [18], TV audios from the IberSpeech-RTVE Chal-lenge [19], and music and conversations from free libraries. All the audio samples are monoaural signals stored in Wave-form Audio File Format (WAV) with a sampling rate of 16kHz.The speech data that has been collected was processed witha Speech Activity Detection (SAD) module producing times-tamps where speech occurs. For this purpose the tool frompyannote.audio [20] has been used, which has been trained withthe AMI corpus [21]. This helped us to only use the valid speechsegments of the audios we collected.As features to train the models we mainly used two: Mel-Frequency Cesptral Coefficients (MFCCs) and log-Mel Spec-trogram. The MFCCs were constructed first filtering the audiowith a band pass filter (20Hz to 8kHz) and then, extracting thefirst thirteen coefficients with 100 milliseconds of windows sizeand frame shifting of 50 milliseconds. The procedure to extractthe log-Mel Spectrogram ( S ( · ) ) is detailed in section 2.1.Train, development and test partitions are split ensuring thatneither speaker nor background noise is repeated between par-titions, trying to maintain a 80-10-10 proportion, respectively.Total data, containing internal and external datasets, consists of50.737 non-WUW samples and 4.651 WUW samples. Several Room Impulse Responses (RIR) were created based onthe Image Source Method (ISM) [22], for a room of dimensions ( L x , L y , L z ) where ≤ L x ≤ . , ≤ L y ≤ . , . ≤ L z ≤ meters, with microphone and source randomly locatedat any ( x, y ) point within a height of . ≤ z ≤ meters.Every TV and music original recordings were convolved withdifferent RIRs to simulate the signal picked up by the micro-phone of the device in the room.Adding background noise to clean speech signals is themain data augmentation technique used in the training stage.We use background noises of different scenarios (TV, music,background conversations, office noise and living room noise)and a wide range of SNRs to improve the performance of themodels against noisy environments. In each epoch, we createdifferent noisy samples by randomly selecting a sample of back-ground noise for each speech event and combining them with arandomly chosen SNR in a specified range. With the aim of assessing the quality of the trained SE mod-els, we use several trigger word detection classifier models, re-porting the impact of the SE module at WUW classificationperformance. The WUW classifiers used here are a LeNet, awell-known standard classifier, easy to optimize [23];
Res15 , Res15-narrow and
Res8 based on a reimplementation byTang and Lin [24] of Sainath and Parada’s Convolutional Neu-ral Networks (CNNs) for keyword spotting [25], using residuallearning techniques with dilated convolutions [26]; a
SGRU and
SGRU2 , two Recurrent Neural Network (RNNs) models, basedon the open source tool named Mycroft Precise[27], which is alightweight wake-up-word detection tool implemented in Ten-sorFlow. These are two bigger variations that we have imple-mented in PyTorch. We also use a
CNN-FAT2019 , a CNN https://freemusicarchive.org/ architecture adapted from a kernel [28] in Kaggle’s FAT 2019competition [29], which has shown good performance in taskslike audio tagging or detection of gender, identity and speechevents from pulse signal [30].Figure 2: Macro F1-score box plot for different SNR ranges.Classifiers trained with low noise ( [5 , dB SNR). Figure 3:
Macro F1-score box plot for different SNR ranges.Classifiers trained with a very wide range of noise ( [ − , dB SNR). Speech signals and background noises are combined randomlyfollowing the procedure explained in 3.2 with a given SNRrange. The SE model is trained to cover a wide SNR rangeof [ − , dBs, whereas WUW models are trained to covertwo scenarios: a classifier trained with the same SNR range asthe SE model, and a classifier less aware of noise with a nar-rower SNR range of [5 , dBs. This way, it is possible tostudy the impact of the SE model regarding if the classifier hasbeen trained with more or less noise.Data imbalance is addressed balancing the classes in eachbatch using a weighted sampler. We use a fixed window lengthof 1.5 seconds based on the annotated timestamps for our col-lected database, and random cuts for the rest of the CommonVoice samples.ll the models are trained with early stopping based on thevalidation loss with 10 epochs of patience. We use the Adamoptimizer with a learning rate of 0.001 and a batch size of 50.Loss (1) allows to train the models in multiple ways and we de-fine different SE models and classifiers based on the loss func-tion used:a) Classifier: we remove the auto-encoder from the architec-ture (Fig. 1) and train any of the classifiers using the noisyaudio as input: α = β = 0 and γ = 1 b) SE model (SimpleSE): we remove the classifier from thearchitecture and optimize the auto-encoder based on the re-construction losses only: α = β = 1 and γ = 0 c) SE model + frozen classifier (FrozenSE): operations of theclassifier are dropped from the backward graph for gradi-ent calculation, optimizing only the SE model for a givenpretrained classifier (LeNet). α = β = γ = 1 d) SE model + classifier (JointSE): auto-encoder and LeNetare trained jointly using the three losses: α = β = γ = 1 All the models take as input windows of 1.5 seconds of audio, toensure that common WUW utterances are fully within it, sincethe average ”OK Aura” is about 0.8 seconds long. Therefore,we perform an atomic test evaluating if a single window con-tains the WUW or not. Both negative and positive samples areassigned a background noise sample with which they are com-bined with a random SNR between certain ranges, as describedin section 3.4.Given the output scores of the models, the threshold to de-cide if a sample test is a WUW or not is chosen as the oneyielding the biggest difference between true and false positiverates, based on Youden’s J statistic [31]. Once the thresholdis decided, macro F1-score is computed in order to balanceWUW/non-WUW proportions in the results. We average suchscores across all WUW classifiers described in section 3.3, forevery SNR range.
4. Results
Figure 2 illustrates the improvement of the WUW detection innoisy scenarios by concatenating our FrozenSE model with allWUW classifiers described in section 3.3 trained with low noise( [5 , dB SNR), which we could find in simple voice assistantsystems. Applying SE in quiet scenarios maintains fairly goodresults, and improves them in lower SNR ranges.If we train the classifiers with more data augmentation( [ − , dB SNR), the baseline classifier results for noisierscenarios improve. Results using the FrozenSE do not decreasebut the improvement in ranges of severe noise is not as large asin Figure 2, see Figure 3.In section 3.4 we have defined the parameters of the lossfunction (1) to train a classifier (case a)), and different ap-proaches to train the SE model, either standalone (b), c)) orin conjunction with the classifier (d)). In Figure 4 we can seehow JointSE performs better than all the other cases in almostevery SNR range. From 40 dB to 10 dB of SNR, the resultsare very similar for the 4 models. In contrast, in the noisiestranges we can see how the classifier without SE model is theworst performer, followed by the SimpleSE case where onlythe waveform and spectral reconstruction losses are used. Wefound that the FrozenSE case, which includes the classificationloss in the training stage, improves the results for the wake-up-word detection task. However, the best results are obtained with Figure 4: Comparison of different training methods for the SEmodels and LeNet classifier, in terms of the macro F1-Scorefor different SNR ranges. All models trained in the range of [ − , dB SNR. the JointSE case where the SE model + LeNet are trained jointlyusing all three losses.We compared the WUW detection results of our JointSEwith other SOTA SE models (SEGAN [12] and Denoiser [13]),followed by a classifier (data augmented LeNet) in differentnoise scenarios. In Table 1, it can be observed how when train-ing the models together with the task loss, the results in oursetup are better than with other more powerful but more generalSE models, since there is no mismatch between the SE and clas-sifier in the end-to-end and it is also more adapted to commonhome noises. JointSE improves the detection over the no SEmodel case, especially in scenarios with background conversa-tions, loud office noise or loud TV, see Table 2.Table 1:
Macro F1-score enhancing the noisy audios with SOTASE models and using a LeNet as a classifier.
SNR [ dB ] No SE SEGAN Denoiser JointSE [20 , Clean .
980 0 .
964 0 . . [10 , Noisy .
969 0 .
940 0 . . [0 , − Very noisy .
869 0 .
798 0 . . Table 2:
Macro F1-score percentage difference between JointSEand LeNet without any SE module, for different backgroundnoise types. Positive values mean that the JointSE score is big-ger than the single LeNet’s.
SNR [ dB ] Music TV Office Living Room Conversations [20 , Clean . − . . . . [10 , Noisy . − . . . . [0 , − Very noisy . . . . .
5. Conclusions
In this paper we proposed a SE model adapted to the taskof WUW in voice assistants for the home environment. TheSE model is a fully-convolutional denoising auto-encoder atwaveform level and it is trained using a log-Mel Spectrogramand waveform regression losses together with a task-dependentWUW classification loss. Results show that for clean andslightly noisy conditions, SE in general does not bring a sub-stantial improvement over a classifier trained with proper dataaugmentation, but in the case of very noisy conditions SE doesimprove the performance, especially when the SE and WUWdetector are trained jointly end-to-end. . References [1] P. C. Loizou,
Speech enhancement: theory and practice . CRCpress, 2013.[2] C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, andJ. Gehrke, “A scalable noisy speech dataset and online subjectivetest framework,” arXiv preprint arXiv:1909.08050 , 2019.[3] L.-P. Yang and Q.-J. Fu, “Spectral subtraction-based speech en-hancement for cochlear implant patients in background noise,”
The journal of the Acoustical Society of America , vol. 117, no. 3,pp. 1001–1004, 2005.[4] C. Zoril˘a, C. Boeddeker, R. Doddipatla, and R. Haeb-Umbach,“An investigation into the effectiveness of enhancement in ASRtraining and test for chime-5 dinner party transcription,” in . IEEE, 2019, pp. 47–53.[5] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, andA. Y. Ng, “Recurrent neural networks for noise reduction in ro-bust ASR,” in
Thirteenth Annual Conference of the InternationalSpeech Communication Association , 2012.[6] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux,J. R. Hershey, and B. Schuller, “Speech enhancement with LSTMrecurrent neural networks and its application to noise-robustASR,” in
International Conference on Latent Variable Analysisand Signal Separation . Springer, 2015, pp. 91–99.[7] J. Meyer and K. U. Simmer, “Multi-channel speech enhance-ment in a car environment using wiener filtering and spectral sub-traction,” in , vol. 2. IEEE, 1997, pp. 1167–1170.[8] Y. Ephraim and H. L. Van Trees, “A signal subspace approach forspeech enhancement,”
IEEE Transactions on speech and audioprocessing , vol. 3, no. 4, pp. 251–266, 1995.[9] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denois-ing,” in . IEEE, 2018, pp. 5069–5073.[10] H. Phan, I. V. McLoughlin, L. Pham, O. Y. Ch´en, P. Koch,M. De Vos, and A. Mertins, “Improving GANs for speech en-hancement,” arXiv preprint arXiv:2001.05532 , 2020.[11] S. R. Park and J. Lee, “A fully convolutional neural network forspeech enhancement,” arXiv preprint arXiv:1609.07132 , 2016.[12] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speechenhancement generative adversarial network,” arXiv preprintarXiv:1703.09452 , 2017.[13] A. D´efossez, G. Synnaeve, and Y. Adi, “Real time speechenhancement in the waveform domain,” arXiv preprintarXiv:2006.12847 , 2020.[14] S.-W. Fu, C.-F. Liao, and Y. Tsao, “Learning with learned lossfunction: Speech enhancement with quality-net to improve per-ceptual evaluation of speech quality,”
IEEE Signal Processing Let-ters , vol. 27, pp. 26–30, 2019.[15] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: Afast waveform generation model based on generative adversarialnetworks with multi-resolution spectrogram,” in
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2020, pp. 6199–6203.[16] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber,“Common Voice: A massively-multilingual speech corpus,” arXivpreprint arXiv:1912.06670 , 2019.[17] J. Hou, Y. Shi, M. Ostendorf, M.-Y. Hwang, and L. Xie, “Min-ing effective negative training samples for keyword spotting,” in
ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 7444–7448.[18] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “TheQUT-NOISE-TIMIT corpus for the evaluation of voice activitydetection algorithms,”
Proceedings of Interspeech 2010 , 2010. [19] E. Lleida, A. Ortega, A. Miguel, V. Baz´an-Gil, C. P´erez,M. G´omez, and A. de Prada, “Albayzin 2018 evaluation: theIberSpeech-RTVE challenge on speech technologies for Spanishbroadcast media,”
Applied Sciences , vol. 9, no. 24, p. 5412, 2019.[20] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov,M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill,“pyannote.audio: neural building blocks for speaker diarization,”in
ICASSP 2020, IEEE International Conference on Acoustics,Speech, and Signal Processing , Barcelona, Spain, May 2020.[21] J. Carletta, “Unleashing the killer corpus: experiences in creatingthe multi-everything ami meeting corpus,”
Language Resourcesand Evaluation , vol. 41, no. 2, pp. 181–190, 2007.[22] J. B. Allen and D. A. Berkley, “Image method for efficiently sim-ulating small-room acoustics,”
The Journal of the Acoustical So-ciety of America , vol. 65, no. 4, pp. 943–950, 1979.[23] Y. LeCun et al. , “Lenet-5, convolutional neural networks,”
URL:http://yann. lecun. com/exdb/lenet , vol. 20, no. 5, p. 14, 2015.[24] R. Tang and J. Lin, “Honk: A PyTorch reimplementation ofconvolutional neural networks for keyword spotting,”
CoRR , vol.abs/1710.06554, 2017. [Online]. Available: http://arxiv.org/abs/1710.06554[25] T. N. Sainath and C. Parada, “Convolutional neural networksfor small-footprint keyword spotting,” in
Sixteenth Annual Con-ference of the International Speech Communication Association ,2015.[26] R. Tang and J. Lin, “Deep residual learning for small-footprintkeyword spotting,”
CoRR arXivpreprint arXiv:1906.02975 , 2019.[30] G. C´ambara, J. Luque, and M. Farr´us, “Detection of speech eventsand speaker characteristics through photo-plethysmographic sig-nal neural processing,” in
ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 7564–7568.[31] W. J. Youden, “Index for rating diagnostic tests,”