[PDF] Speech Enhancement for Wake-Up-Word detection in Voice Assistants

Abstract

Keyword spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants. A very common issue of voice assistants is that they get easily activated by background noise like music, TV or background speech that accidentally triggers the device. In this paper, we propose a Speech Enhancement (SE) model adapted to the task of WUW detection that aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises. The SE model is a fully-convolutional denoising auto-encoder at waveform level and is trained using a log-Mel Spectrogram and waveform reconstruction losses together with the BCE loss of a simple WUW classification network. A new database has been purposely prepared for the task of recognizing the WUW in challenging conditions containing negative samples that are very phonetically similar to the keyword. The database is extended with public databases and an exhaustive data augmentation to simulate different noises and environments. The results obtained by concatenating the SE with a simple and state-of-the-art WUW detectors show that the SE does not have a negative impact on the recognition rate in quiet environments while increasing the performance in the presence of noise, especially when the SE and WUW detector are trained jointly end-to-end.

Full PDF

SSpeech Enhancement for Wake-Up-Word detection in Voice Assistants

David Bonet , , Guillermo C´ambara , , Fernando L´opez ,Pablo G´omez , Carlos Segura , Jordi Luque Universitat Polit`ecnica de Catalunya (UPC), Spain Universitat Pompeu Fabra (UPF), Spain Telef´onica, Spain Telef´onica Research, Spain [email protected]

Abstract

Keyword spotting and in particular Wake-Up-Word (WUW) de-tection is a very important task for voice assistants. A very com-mon issue of voice assistants is that they get easily activated bybackground noise like music, TV or background speech thataccidentally triggers the device. In this paper, we propose aSpeech Enhancement (SE) model adapted to the task of WUWdetection that aims at increasing the recognition rate and reduc-ing the false alarms in the presence of these types of noises.The SE model is a fully-convolutional denoising auto-encoderat waveform level and is trained using a log-Mel Spectrogramand waveform reconstruction losses together with the BCE lossof a simple WUW classiﬁcation network. A new database hasbeen purposely prepared for the task of recognizing the WUWin challenging conditions containing negative samples that arevery phonetically similar to the keyword. The database is ex-tended with public databases and an exhaustive data augmenta-tion to simulate different noises and environments. The resultsobtained by concatenating the SE with a simple and state-of-the-art WUW detectors show that the SE does not have a nega-tive impact on the recognition rate in quiet environments whileincreasing the performance in the presence of noise, especiallywhen the SE and WUW detector are trained jointly end-to-end.

Index Terms : keyword spotting, speech enhancement, wake-up-word, deep learning, convolutional neural network

1. Introduction

Voice interaction with devices is becoming ubiquitous. Mostof them use a mechanism to avoid the excessive usage of re-sources, a trigger word detector. This ensures the efﬁcient useof resources, using a Speech-To-Text tool only when neededand with the consequent start of a conversation. It is key to onlystart this conversation when the user is addressing the device,otherwise the user experience is notoriously degraded. Thus,the wake-up-word detection system must be robust enough toavoid wake-ups with TV, music, speech and sounds that do notcontain the key phrase.A common approach to reduce the impact of this type ofnoise in the system is the adoption of speech enhancement algo-rithms. Speech enhancement consists of the task of improvingthe perceptual intelligibility and quality of speech by remov-ing background noise [1]. Its main application is in the ﬁeld ofmobile and internet communications [2] and related to hearingaids [3], but SE has also been applied successfully to automaticspeech recognition systems [4, 5, 6].Traditional SE methods involved a characterization step ofthe noise spectrum which is then used to try reduce the noise from the regenerated speech signal. Examples of these ap-proaches are spectral subtraction [3], Wiener ﬁltering [7] andsubspace algorithms [8]. One of the main drawbacks of theclassical approaches is that they are not very robust against non-stationary noises or other type of noises that can mask speech,like background speech. In the last years, Deep Learning ap-proaches have been widely applied to SE at the waveform level[9, 10] and spectral level [6, 11]. In the ﬁrst case, a commonarchitecture falls within the encoder-decoder paradigm. In [12],authors proposed a fully convolutional generative adversarialnetwork architecture structured as an auto-encoder with U-Netlike skip-connections. Other recent work [13] proposes a simi-lar architecture at the waveform level that includes a LSTM be-tween the encoder and the decoder and it is trained directly witha regression loss combined with a spectrogram domain loss.Inspired by these recent models, we propose a similar SEauto-encoder architecture in the time domain that is optimizednot only by minimizing waveform and Mel-spectrogram regres-sion losses, but also includes a task-dependent classiﬁcationloss provided by a simple WUW classiﬁer acting as a Quality-Net [14]. This last term serves as a task-dependent objec-tive quality measure that trains the model to enhance importantspeech features that might be degraded otherwise.

2. Speech Enhancement

Speech enhancement is interesting for triggering phrase detec-tion since it tries to remove noise that could trigger the device,and at the same time improves speech quality and intelligibilityfor a better detection. In this case, we try to tackle the mostcommon noisy environments where voice assistants are used:TV, music, background conversations, ofﬁce noise and livingroom noise. Some of these types of background noise, such asTV and background conversations, are the most likely to triggerthe voice assistant and are also the most challenging to remove.

Our model has a fully-convolutional denoising auto-encoder ar-chitecture with skip connections (Fig. 1), which has proven tobe very effective in SE tasks [12], working end-to-end at wave-form level. In training, we input a noisy audio x ∈ R T , com-prised of clean speech signal y ∈ R T and background noise n ∈ R T so that x = λ y + (1 − λ ) n .The encoder compresses the input signal and expands thenumber of channels. It is composed of six convolutional blocks(ConvBlock1D), each consisting of a convolutional layer, fol-lowed by an instance normalization and a rectiﬁed linear unit(ReLU). Kernel size K = 4 and stride S = 2 are used, except a r X i v : . [ ee ss . A S ] J a n igure 1: End-to-end SE model at waveform level concatenated with a classiﬁer. Log-Mel Spectrogram and waveform reconstructionlosses of the SE model can be used together with the BCE loss of the classiﬁer as a linear combination to train the model. in the ﬁrst layer where K = 7 and S = 1 . The compressed sig-nal goes through an intermediate stage where the shape is pre-served, consisting of three residual blocks (ResBlock1D), eachformed by two ConvBlock1D with K = 3 and S = 1 wherea skip connection is added from the input of the residual blockto its output. The last stage of the SE model is the decoder,where the original shape of the raw audio is recovered at theoutput. Its architecture follows the inverse structure of the en-coder, where deconvolutional blocks (DeconvBlock1D) replacethe convolutional layers of the ConvBlock1D with transposedconvolutional layers. Skip connections from the encoder blocksto the decoder blocks are also used to ensure low-level detailwhen reconstructing the waveform.We use a regression loss function (L1 loss) at raw wave-form level together with another L1 loss over the log-Mel Spec-trogram as proposed in [15] to reconstruct a ”cleaned” signal ˆ y at the output. Finally, we include the classiﬁcation loss (BCELoss) when training the SE model jointly with the classiﬁer orconcatenating a pretrained classiﬁer at its output. Thus, we alsotry to optimize the SE model to the speciﬁc task of WUW clas-siﬁcation. Our ﬁnal loss function is deﬁned as a linear combi-nation of the three losses: L T = αL raw ( y , ˆ y ) + βL spec ( S ( y ) , S ( ˆ y )) + γL BCE (1)where α , β and γ are hyperparamters weighting each loss termand S ( · ) denotes the log-Mel Spectrogram of the signal, whichis computed using 512 FFT bins, a window of 20 ms with 10ms of shift and 40 ﬁlters in the Mel Scale.

3. Methodology

The database used for conducting the experiments here pre-sented consists of WUW samples labeled as positive, and othernon-WUW samples labeled as negative. Since the chosen key-word is ”OK Aura”, which triggers Telef´onica’s home assis-tant, Aura, positive samples are drawn from company’s in-house databases. Some of the negative samples have been alsorecorded in such databases, but we also add speech and othertypes of acoustic events from external data sets, so the modelsgain robustness with further data. Information about all dataused is detailed in this section.

In a ﬁrst round, around 4300 WUW samples from 360 speak-ers have been recorded, resulting in 2.8 hours of audio. Fur-thermore, ofﬁce ambient noise has been recorded as well, with the aim of having samples for noise data augmentation. Thesecond data collection round has been done in order to studyand improve some sensitive cases where WUW modules typ-ically underperform. For instance, such dataset contains richmetadata about positive and negative utterances, like room dis-tance, speech accent, emotion, age or gender. Furthermore, thenegative utterances contain phonetically similar words to ”OKAura”, since these are the most ambiguous to recognize for aclassiﬁer. Detailed information about data acquisition is ex-plained in the following subsection.

A web-based Jotform form has been designed for data collec-tion. Such form is open and actually still receiving input datafrom volunteers, so readers are also invited to contribute to thedataset . Until the date of this work, 1096 samples from 80speakers have been recorded, which consists of 1.2 hours ofaudio. Volunteers are asked to pronounce various scripted utter-ances at a close distance and also at two meters from the devicemic. The similarity levels are the following:1. Exact WUW, in an isolated manner: OK Aura.

2. Exact WUW, in a context:

Perfecto, voy a mirar qu´e danhoy. OK Aura.

3. Contains ”Aura”:

Hay un aura de paz y tranquilidad.

4. Contains ”OK”:

OK, a ver qu´e ponen en la tele.

5. Contains similar word units to ”Aura”:

Hola Laura.

6. Contains similar word units to ”OK”:

Preﬁero el hockey albaloncesto.

7. Contains similar word units to ”OK Aura”:

Porque Laura,¿qu´e te pareci´o la pel´ıcula?3.1.3. External data

General negative examples have been randomly chosen fromthe publicly available Spanish Common Voice corpus [16] thatcurrently holds over 300 hours of validated audio. However,we keep a 10:1 ratio between negative and positive samples,since such ratio proves to yield good results in [17], thus avoid-ing bigger ratios that lead to increasing computational times.Therefore, we have used a Common Voice partition consistingof 55h for training, 7h for development and 7h for testing.Background noises were selected from various publicdatasets according to different use case scenarios. Living roombackground noise (HOME-LIVINGB) from the QUT-NOISE https://form.jotform.com/201694606537056 atabase [18], TV audios from the IberSpeech-RTVE Chal-lenge [19], and music and conversations from free libraries. All the audio samples are monoaural signals stored in Wave-form Audio File Format (WAV) with a sampling rate of 16kHz.The speech data that has been collected was processed witha Speech Activity Detection (SAD) module producing times-tamps where speech occurs. For this purpose the tool frompyannote.audio [20] has been used, which has been trained withthe AMI corpus [21]. This helped us to only use the valid speechsegments of the audios we collected.As features to train the models we mainly used two: Mel-Frequency Cesptral Coefﬁcients (MFCCs) and log-Mel Spec-trogram. The MFCCs were constructed ﬁrst ﬁltering the audiowith a band pass ﬁlter (20Hz to 8kHz) and then, extracting theﬁrst thirteen coefﬁcients with 100 milliseconds of windows sizeand frame shifting of 50 milliseconds. The procedure to extractthe log-Mel Spectrogram ( S ( · ) ) is detailed in section 2.1.Train, development and test partitions are split ensuring thatneither speaker nor background noise is repeated between par-titions, trying to maintain a 80-10-10 proportion, respectively.Total data, containing internal and external datasets, consists of50.737 non-WUW samples and 4.651 WUW samples. Several Room Impulse Responses (RIR) were created based onthe Image Source Method (ISM) [22], for a room of dimensions ( L x , L y , L z ) where ≤ L x ≤ . , ≤ L y ≤ . , . ≤ L z ≤ meters, with microphone and source randomly locatedat any ( x, y ) point within a height of . ≤ z ≤ meters.Every TV and music original recordings were convolved withdifferent RIRs to simulate the signal picked up by the micro-phone of the device in the room.Adding background noise to clean speech signals is themain data augmentation technique used in the training stage.We use background noises of different scenarios (TV, music,background conversations, ofﬁce noise and living room noise)and a wide range of SNRs to improve the performance of themodels against noisy environments. In each epoch, we createdifferent noisy samples by randomly selecting a sample of back-ground noise for each speech event and combining them with arandomly chosen SNR in a speciﬁed range. With the aim of assessing the quality of the trained SE mod-els, we use several trigger word detection classiﬁer models, re-porting the impact of the SE module at WUW classiﬁcationperformance. The WUW classiﬁers used here are a LeNet, awell-known standard classiﬁer, easy to optimize [23];

Res15 , Res15-narrow and

Res8 based on a reimplementation byTang and Lin [24] of Sainath and Parada’s Convolutional Neu-ral Networks (CNNs) for keyword spotting [25], using residuallearning techniques with dilated convolutions [26]; a

SGRU and

SGRU2 , two Recurrent Neural Network (RNNs) models, basedon the open source tool named Mycroft Precise[27], which is alightweight wake-up-word detection tool implemented in Ten-sorFlow. These are two bigger variations that we have imple-mented in PyTorch. We also use a

CNN-FAT2019 , a CNN https://freemusicarchive.org/ architecture adapted from a kernel [28] in Kaggle’s FAT 2019competition [29], which has shown good performance in taskslike audio tagging or detection of gender, identity and speechevents from pulse signal [30].Figure 2: Macro F1-score box plot for different SNR ranges.Classiﬁers trained with low noise ( [5 , dB SNR). Figure 3:

Macro F1-score box plot for different SNR ranges.Classiﬁers trained with a very wide range of noise ( [ − , dB SNR). Speech signals and background noises are combined randomlyfollowing the procedure explained in 3.2 with a given SNRrange. The SE model is trained to cover a wide SNR rangeof [ − , dBs, whereas WUW models are trained to covertwo scenarios: a classiﬁer trained with the same SNR range asthe SE model, and a classiﬁer less aware of noise with a nar-rower SNR range of [5 , dBs. This way, it is possible tostudy the impact of the SE model regarding if the classiﬁer hasbeen trained with more or less noise.Data imbalance is addressed balancing the classes in eachbatch using a weighted sampler. We use a ﬁxed window lengthof 1.5 seconds based on the annotated timestamps for our col-lected database, and random cuts for the rest of the CommonVoice samples.ll the models are trained with early stopping based on thevalidation loss with 10 epochs of patience. We use the Adamoptimizer with a learning rate of 0.001 and a batch size of 50.Loss (1) allows to train the models in multiple ways and we de-ﬁne different SE models and classiﬁers based on the loss func-tion used:a) Classiﬁer: we remove the auto-encoder from the architec-ture (Fig. 1) and train any of the classiﬁers using the noisyaudio as input: α = β = 0 and γ = 1 b) SE model (SimpleSE): we remove the classiﬁer from thearchitecture and optimize the auto-encoder based on the re-construction losses only: α = β = 1 and γ = 0 c) SE model + frozen classiﬁer (FrozenSE): operations of theclassiﬁer are dropped from the backward graph for gradi-ent calculation, optimizing only the SE model for a givenpretrained classiﬁer (LeNet). α = β = γ = 1 d) SE model + classiﬁer (JointSE): auto-encoder and LeNetare trained jointly using the three losses: α = β = γ = 1 All the models take as input windows of 1.5 seconds of audio, toensure that common WUW utterances are fully within it, sincethe average ”OK Aura” is about 0.8 seconds long. Therefore,we perform an atomic test evaluating if a single window con-tains the WUW or not. Both negative and positive samples areassigned a background noise sample with which they are com-bined with a random SNR between certain ranges, as describedin section 3.4.Given the output scores of the models, the threshold to de-cide if a sample test is a WUW or not is chosen as the oneyielding the biggest difference between true and false positiverates, based on Youden’s J statistic [31]. Once the thresholdis decided, macro F1-score is computed in order to balanceWUW/non-WUW proportions in the results. We average suchscores across all WUW classiﬁers described in section 3.3, forevery SNR range.

4. Results

Figure 2 illustrates the improvement of the WUW detection innoisy scenarios by concatenating our FrozenSE model with allWUW classiﬁers described in section 3.3 trained with low noise( [5 , dB SNR), which we could ﬁnd in simple voice assistantsystems. Applying SE in quiet scenarios maintains fairly goodresults, and improves them in lower SNR ranges.If we train the classiﬁers with more data augmentation( [ − , dB SNR), the baseline classiﬁer results for noisierscenarios improve. Results using the FrozenSE do not decreasebut the improvement in ranges of severe noise is not as large asin Figure 2, see Figure 3.In section 3.4 we have deﬁned the parameters of the lossfunction (1) to train a classiﬁer (case a)), and different ap-proaches to train the SE model, either standalone (b), c)) orin conjunction with the classiﬁer (d)). In Figure 4 we can seehow JointSE performs better than all the other cases in almostevery SNR range. From 40 dB to 10 dB of SNR, the resultsare very similar for the 4 models. In contrast, in the noisiestranges we can see how the classiﬁer without SE model is theworst performer, followed by the SimpleSE case where onlythe waveform and spectral reconstruction losses are used. Wefound that the FrozenSE case, which includes the classiﬁcationloss in the training stage, improves the results for the wake-up-word detection task. However, the best results are obtained with Figure 4: Comparison of different training methods for the SEmodels and LeNet classiﬁer, in terms of the macro F1-Scorefor different SNR ranges. All models trained in the range of [ − , dB SNR. the JointSE case where the SE model + LeNet are trained jointlyusing all three losses.We compared the WUW detection results of our JointSEwith other SOTA SE models (SEGAN [12] and Denoiser [13]),followed by a classiﬁer (data augmented LeNet) in differentnoise scenarios. In Table 1, it can be observed how when train-ing the models together with the task loss, the results in oursetup are better than with other more powerful but more generalSE models, since there is no mismatch between the SE and clas-siﬁer in the end-to-end and it is also more adapted to commonhome noises. JointSE improves the detection over the no SEmodel case, especially in scenarios with background conversa-tions, loud ofﬁce noise or loud TV, see Table 2.Table 1:

Macro F1-score enhancing the noisy audios with SOTASE models and using a LeNet as a classiﬁer.

SNR [ dB ] No SE SEGAN Denoiser JointSE [20 , Clean .

980 0 .

964 0 . . [10 , Noisy .

969 0 .

940 0 . . [0 , − Very noisy .

869 0 .

798 0 . . Table 2:

Macro F1-score percentage difference between JointSEand LeNet without any SE module, for different backgroundnoise types. Positive values mean that the JointSE score is big-ger than the single LeNet’s.

SNR [ dB ] Music TV Ofﬁce Living Room Conversations [20 , Clean . − . . . . [10 , Noisy . − . . . . [0 , − Very noisy . . . . .

5. Conclusions

In this paper we proposed a SE model adapted to the taskof WUW in voice assistants for the home environment. TheSE model is a fully-convolutional denoising auto-encoder atwaveform level and it is trained using a log-Mel Spectrogramand waveform regression losses together with a task-dependentWUW classiﬁcation loss. Results show that for clean andslightly noisy conditions, SE in general does not bring a sub-stantial improvement over a classiﬁer trained with proper dataaugmentation, but in the case of very noisy conditions SE doesimprove the performance, especially when the SE and WUWdetector are trained jointly end-to-end. . References [1] P. C. Loizou,

Speech enhancement: theory and practice . CRCpress, 2013.[2] C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, andJ. Gehrke, “A scalable noisy speech dataset and online subjectivetest framework,” arXiv preprint arXiv:1909.08050 , 2019.[3] L.-P. Yang and Q.-J. Fu, “Spectral subtraction-based speech en-hancement for cochlear implant patients in background noise,”

The journal of the Acoustical Society of America , vol. 117, no. 3,pp. 1001–1004, 2005.[4] C. Zoril˘a, C. Boeddeker, R. Doddipatla, and R. Haeb-Umbach,“An investigation into the effectiveness of enhancement in ASRtraining and test for chime-5 dinner party transcription,” in . IEEE, 2019, pp. 47–53.[5] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, andA. Y. Ng, “Recurrent neural networks for noise reduction in ro-bust ASR,” in

Thirteenth Annual Conference of the InternationalSpeech Communication Association , 2012.[6] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux,J. R. Hershey, and B. Schuller, “Speech enhancement with LSTMrecurrent neural networks and its application to noise-robustASR,” in

International Conference on Latent Variable Analysisand Signal Separation . Springer, 2015, pp. 91–99.[7] J. Meyer and K. U. Simmer, “Multi-channel speech enhance-ment in a car environment using wiener ﬁltering and spectral sub-traction,” in , vol. 2. IEEE, 1997, pp. 1167–1170.[8] Y. Ephraim and H. L. Van Trees, “A signal subspace approach forspeech enhancement,”

IEEE Transactions on speech and audioprocessing , vol. 3, no. 4, pp. 251–266, 1995.[9] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denois-ing,” in . IEEE, 2018, pp. 5069–5073.[10] H. Phan, I. V. McLoughlin, L. Pham, O. Y. Ch´en, P. Koch,M. De Vos, and A. Mertins, “Improving GANs for speech en-hancement,” arXiv preprint arXiv:2001.05532 , 2020.[11] S. R. Park and J. Lee, “A fully convolutional neural network forspeech enhancement,” arXiv preprint arXiv:1609.07132 , 2016.[12] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speechenhancement generative adversarial network,” arXiv preprintarXiv:1703.09452 , 2017.[13] A. D´efossez, G. Synnaeve, and Y. Adi, “Real time speechenhancement in the waveform domain,” arXiv preprintarXiv:2006.12847 , 2020.[14] S.-W. Fu, C.-F. Liao, and Y. Tsao, “Learning with learned lossfunction: Speech enhancement with quality-net to improve per-ceptual evaluation of speech quality,”

IEEE Signal Processing Let-ters , vol. 27, pp. 26–30, 2019.[15] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: Afast waveform generation model based on generative adversarialnetworks with multi-resolution spectrogram,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2020, pp. 6199–6203.[16] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber,“Common Voice: A massively-multilingual speech corpus,” arXivpreprint arXiv:1912.06670 , 2019.[17] J. Hou, Y. Shi, M. Ostendorf, M.-Y. Hwang, and L. Xie, “Min-ing effective negative training samples for keyword spotting,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 7444–7448.[18] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “TheQUT-NOISE-TIMIT corpus for the evaluation of voice activitydetection algorithms,”

Proceedings of Interspeech 2010 , 2010. [19] E. Lleida, A. Ortega, A. Miguel, V. Baz´an-Gil, C. P´erez,M. G´omez, and A. de Prada, “Albayzin 2018 evaluation: theIberSpeech-RTVE challenge on speech technologies for Spanishbroadcast media,”

Applied Sciences , vol. 9, no. 24, p. 5412, 2019.[20] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov,M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill,“pyannote.audio: neural building blocks for speaker diarization,”in

ICASSP 2020, IEEE International Conference on Acoustics,Speech, and Signal Processing , Barcelona, Spain, May 2020.[21] J. Carletta, “Unleashing the killer corpus: experiences in creatingthe multi-everything ami meeting corpus,”

Language Resourcesand Evaluation , vol. 41, no. 2, pp. 181–190, 2007.[22] J. B. Allen and D. A. Berkley, “Image method for efﬁciently sim-ulating small-room acoustics,”

The Journal of the Acoustical So-ciety of America , vol. 65, no. 4, pp. 943–950, 1979.[23] Y. LeCun et al. , “Lenet-5, convolutional neural networks,”

URL:http://yann. lecun. com/exdb/lenet , vol. 20, no. 5, p. 14, 2015.[24] R. Tang and J. Lin, “Honk: A PyTorch reimplementation ofconvolutional neural networks for keyword spotting,”

CoRR , vol.abs/1710.06554, 2017. [Online]. Available: http://arxiv.org/abs/1710.06554[25] T. N. Sainath and C. Parada, “Convolutional neural networksfor small-footprint keyword spotting,” in

Sixteenth Annual Con-ference of the International Speech Communication Association ,2015.[26] R. Tang and J. Lin, “Deep residual learning for small-footprintkeyword spotting,”

CoRR arXivpreprint arXiv:1906.02975 , 2019.[30] G. C´ambara, J. Luque, and M. Farr´us, “Detection of speech eventsand speaker characteristics through photo-plethysmographic sig-nal neural processing,” in

ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 7564–7568.[31] W. J. Youden, “Index for rating diagnostic tests,”