On training targets for noise-robust voice activity detection
OOn training targetsfor noise-robust voice activity detection st Sebastian Braun
Microsoft Research
Redmond, WA, [email protected] nd Ivan Tashev
Microsoft Research
Redmond, WA, [email protected]
Abstract —The task of voice activity detection (VAD) is anoften required module in various speech processing, analysis andclassification tasks. While state-of-the-art neural network basedVADs can achieve great results, they often exceed computationalbudgets and real-time operating requirements. In this work, wepropose a computationally efficient real-time VAD network thatachieves state-of-the-art results on several public real recordingdatasets. We investigate different training targets for the VADand show that using the segmental voice-to-noise ratio (VNR)is a better and more noise-robust training target than the cleanspeech level based VAD. We also show that multi-target trainingimproves the performance further.
Index Terms —voice activity detection, convolutional recurrentneural network, real-time inference
I. I
NTRODUCTION
Classifiers detecting speech presence in audio signals,widely known as a voice activity detector (VAD), are amature but still popular research topic. There are countlessapplications that require or benefit from a VAD, like all kindsof audio processing modules [1], [2] like speech enhancers, lo-calizers, echo controllers, automatic gain controls, speech levelor signal-to-noise ratio (SNR) measurement, speech-rate orsilence-rate estimation, pre-processing, gating or segmentationfor speech recognizers, speech-related classifiers for emotion[3], gender, age, identity, anomaly, toxicity, or speech encodingand transmission [4], and so forth.Traditional statistical models, widely used in speech en-hancement due to their simplicity, often rely only on the non-stationarity of speech [5]–[7]. Better accuracy can be achievedby integrating models of features like pitch, harmonicity [8],[9], modulation [10], spectral shape [11], [12], etc. [13].Dealing with increasing numbers of features becomes moredifficult with human-crafted models, often resulting in dimin-ishing performance gains. Therefore, data-driven approaches,especially neural networks [14]–[16], are an attractive choiceand have shown substantial performance boosts [17].A major design point of VAD methods is the temporalresolution of the classification. While for some applicationslike speech segmentation, emotion classifiers etc. a largergranularity is sufficient, speech processing applications usuallyrequire a VAD in frame-level resolution of typical lengthsbetween 5 - 20 ms. While some deep learning (DL) basedVADs designed for the first class of applications use coarsertemporal resolutions of 0.2 - 10 s, which additionally can improve robustness and performance, this limits the use asgeneral purpose VAD for the second class of applications.Furthermore, most applications have real-time requirements,and are deployed on computationally and power-constrainededge devices, which poses challenges on the latency andcomputational efficiency of the algorithms.In this work, we propose a neural network architecturefor real-time VAD on a typical short audio frame basis.The network provides predictions per frame without look-ahead, and is reasonably small and efficient to be executedon standard CPUs of typical battery-powered devices. Themain contribution of this work is, however, the investigationof noise robust VAD training targets. We show that training onthe clean speech level-based VAD is less robust in low SNRconditions than using the frequency-weighted segmental voice-to-noise ratio (VNR). While the ground truth depending onlyon the clean speech level is ill-conditioned in very low SNR,where the speech might not even be audible, the VNR labelaccounts for the audibility of speech in noise. Additionally, wepropose a multi-target training, using speech level-based VADand VNR as joint targets, which improves the performancefurther. While in [18] a method for frame-level SNR estimationhas been proposed, to the best of our knowledge this is thefirst work to use SNR as training target for VAD.II. S
IGNAL MODEL AND TRAINING TARGETS
We assume that noisy training data is generated by mixingpossibly reverberant speech with noise, resulting in the trainingmixture signal y ( t ) = h ( t ) (cid:63) s ( t ) + v ( t ) , (1)where h ( t ) , s ( t ) , v ( t ) are the acoustic impulse response (AIR),non-reverberant speech signal and noise signal, t is the timeindex, and (cid:63) denotes the convolution operator. As we areinterested in detecting speech, be it reverberant or not, but donot consider reverberant tails as desired speech information,we define the target speech signal x ( t ) = h win ( t ) (cid:63) s ( t ) , (2)where h win ( t ) is a windowed version of the full AIR h ( t ) removing the late reverberation tail. We chose an exponentiallydecaying window corresponding to a decay rate of . s/ dBstarting from the direct path of the AIR. a r X i v : . [ ee ss . A S ] F e b he typical VAD training target is defined by the cleanspeech level threshold T level as V AD ( n ) = (cid:40) if | W VAD x ( n ) | > T level if | W VAD x ( n ) | ≤ T level , (3)where the vector x ( n ) contains the frequency-domain rep-resentation of the target speech signal x ( t ) at frame n andthe matrix W VAD is an optional frequency weighting such asbandpass filters or loudness weighting.We propose an alternative training target, i. e. the Mel-weighted segmental VNR
V N R ( n ) = 10 log | W VNR x ( n ) | | W VNR v ( n ) | , (4)where v ( n ) is the frequency-domain representation of thenoise signal v ( t ) and the matrix W VNR is an auditory orloudness weighting, e. g. a Mel-filterbank. Speech presencecan then be computed similarly as in (3) by comparing thecontinuous VNR values to a threshold. Note that in contrastto the speech level only dependent VAD, the VNR also takesnoise into account, providing additional information to voicepresence about the background noise and the audibility ofthe speech in noise . This circumvents the problem of an ill-defined VAD (3) for highly negative SNR, i. e. when the speechmay not even be audible. Furthermore, the VNR attributesa physical and interpretable meaning to the speech detectionthreshold. Consequently, by adjusting the detection thresholdfor a VNR based VAD, we can detect either any audible speechin noise using a very low threshold, or only target well-audibleforeground speech by choosing a positive VNR threshold.III. P
ROPOSED SINGLE - AND MULTI - TARGET LOSSES
Having defined the two training targets, the binary
V AD ( n ) , and the continuous V N R ( n ) in the previous sec-tion, we define the considered loss functions in the following.A natural choice for a binary classification like V AD ( n ) isthe binary cross-entropy (BCE) [19] L BCE ( z ) = − N (cid:88) n z n log(ˆ z n ) + (1 − ˆ z n ) log( z n ) , (5)where z n and ˆ z n are labels and predictions, respectively, and N is the number of frames. On the other hand, continuousdistributions like V N R ( n ) are often fitted using the mean-squared error (MSE) or mean absolute error (MAE) [18]. Weindeed determined in preliminary experiments that the BCEperformed best when training V AD ( n ) , while the MAE lossperformed best when training to predict V N R ( n ) .As multi-target training can be conveniently realized in deeplearning and often helps generalization and robustness, wepropose to combine V AD ( n ) and V N R ( n ) training targets.We explored applying BCE and MAE loss to the trainingtargets, resulting in the two multi-target training loss options L VADVNR,1 = (1 − α ) L BCE ( V AD ) + α L MAE ( V N R ) (6) L VADVNR,2 = L BCE ( V AD ) + L BCE ( V N R ) , (7) • VAD • VNR time C o n v l a y e r s Causal conv
Log
Mel
Spectra CNN encoder
RNN (GRU)
FC mappings f r e q time memory a) b) Fig. 1. a) convolutional recurrent network architecture, b) efficient causalconvolutions with temporal kernel size 2. where the weighting factor α = 0 . to balance BCE and MAEwas optimized on the validation set. We finally consider fourdifferent loss functions:1) Predict VAD with BCE loss: L BCE ( V AD )
2) Predict VNR with MAE loss: L MAE ( V N R )
3) Predict both VAD, VNR with BCE and MAE: L VADVNR,1
4) Predict both VAD and VNR with BCE: L VADVNR,2
IV. P
ROPOSED NETWORK ARCHITECTURE
Due to the success of convolutional recurrent network(CRN) structures for efficient noise suppression [20], [21],we adapt a similar structure for the VAD classification task.Figure 1a) shows the proposed network architecture. To re-duce the input dimensionality, the input features are log-Melenergies. The features are encoded by a few 2D convolutionallayers extracting spectro-temporal information. The frequencyaxis is reduced by a stride of 2 every layer, while the channeldimension is doubled. The convolution over time is causal,meaning no future information is used to infer the currentframe as illustrated in Fig. 1b). The kernel size along time axisis only 2, which is very efficient, but still extracts temporalinformation across N CNN + 1 frames, where N CNN is thenumber of convolution layers. The output of the convolutionalencoder is reshaped to a single vector and fed to a recurrentlayer, specifically a gated recurrent unit (GRU) [22]. Finally,two fully connected (FC) layers are used to obtain the desiredoutput of one (single-target case) or two values (multi-targetcase) per time frame. All convolutional and FC hidden layersuse parametric rectified linear unit (PReLU) activations, whilethe output layer uses a sigmoid to obtain constrained values.V. E
XPERIMENTAL VALIDATION
A. Training data
We a use large-scale augmented synthetic training set toensure generalization to real-world signals. The training setuses 544 h of high mean opinion score (MOS) rated speechrecordings from the LibriVox corpus, 247 h noise recordingsfrom Audioset, Freesound, internal noise recordings and 1 hof colored stationary noise. Except for the 65 h internal noiserecordings, the data is available publicly as part of the 2ndDNS challenge . Non-reverberant speech files were augmented https://github.com/microsoft/DNS-ChallengeABLE IN EURAL NETWORK LAYER DIMENSIONS .layer type hyperparameters activationconv2D 1 →
16, (2,3), (1,2), (1,0,1,1) PReLUconv2D 16 →
32, (2,3), (1,2), (1,0,1,1) PReLUconv2D 32 →
64, (2,3), (1,2), (1,0,1,1) PReLUconv2D 64 → × →
512 –GRU 512 →
512 Sigmoid, tanhFC 512 →
256 PReLUFC 256 → with acoustic impulse responses randomly drawn from a set of7000 measured and simulated responses from several publicand internal databases. 20% non-reverberant speech is notreverb augmented to represent conditions such as close-talkmicrophones or professionally recorded speech.Time shifted speech and noise sequences were mixed with aGaussian SNR distribution with N (5 , dB and augmentedto different microphone signal levels with N ( − , dBFS.The total training set comprised roughly 1000 h of 10 smixture-target signal pairs. A more detailed description of thedataset generation can be found in [21]. The training targets V AD ( n ) and V N R ( n ) are obtained as described in Section II.For training monitoring and hyper-parameter tuning, wegenerated a synthetic validation set in the same way as above,using speech from the DAPS dataset [23], and room impulseresponses (RIRs) and noise from the QUT database. B. Experimental setup
Training targets and features were computed from 32 msframes with 16 ms frame shift. The VAD weighting W VAD wasa bandpass filter between [150 , Hz, and we used a signal-dependent threshold T level = 0 . · max (cid:2) | W VAD x ( n ) | (cid:3) . W VNR was a 32 band Mel weighting, and
V N R ( n ) was lim-ited to the range [ − , dB and mapped to the range [0 , for training purposes. Both targets, V AD ( n ) and V N R ( n ) were temporally smoothed to remove measurement errors andobtain smoother predictions, using a centered moving averagewindow smoothing with length 0.2 s.The network input features were 64 log-Mel energy bins inthe range 0-8 kHz. Table I shows the network layer dimen-sions. Convolutional layers have parameters { inChannels → outChannels, kernelSize, stride, padding } , where kernelSize and stride are defined as (time,frequency) , and padding isdefined as (timeLeft,timeRight,freqLow,freqHigh) . Reshaping,uni-directional GRU, and FC layers are defined by { inShape → outShape } . The proposed networks are trained using theAdamW optimizer [24] with a learning rate of · − , weightdecay 0.01, batch size of 50 sequences of 10 s length, andadaptive gradient norm clipping [25]. Training is stopped whenthere is no improvement anymore on the validation set. https://research.qut.edu.au/saivt/databases/qut-noise-databases-and-protocols/ C. Metrics, test sets, and post-processing for evaluation
We used the area under curve (AUC) of false positiverate vs. true positive rate as metric for evaluation, trainingmonitoring and tuning. In contrast to the often used precisionand recall or F-score, AUC is threshold-independent, and webelieve it is therefore more holistic.As DL methods can only be convincing when generalizationon real data can be proven, we use three different publicdatasets consisting only of real recordings in the wild. Wechose the datasets to be from public domains, having been usedin prior studies to benchmark VAD methods for comparability,and to be representative for common real-world scenarios andchallenging SNR distributions. The three test sets are describedin the following. The
KAIST dataset consists of four 30-minfield-recordings in noisy environments (bus stop, constructionsite, park, room) performed by the authors of [26]. Speech on-and off-sets are human-labelled by the authors. The HAVIC database [27] is a collection of 72 h videos of everydaylife situations like children playing, kitchens, living rooms,traffic, office, sports events, etc., with human annotationsof acoustic events. We categorized all speech-related eventsas target, while the labels { noise, background noise, music,unintelligible, baby, TV, singing } were considered as non-speech. Although baby and singing are voice, it was excludedas it is not represented in our speech training data. The AVAspeech v1.0 dataset is a 30 h collection of crowd-sourcelabeled YouTube clips (we were only able to obtain 120 ofthe 160 15-min clips from YouTube). Most of the AVA speechclips seem to originate from movies, so this dataset should beconsidered as a biased subsample of possible scenarios.As all test sets are human-labeled, this might involve anno-tation errors, and exhibit coarse temporal granularity (pausesbetween words and sentences are usually still labeled asspeech), and temporal label uncertainty. This might interferewith our fine-grained frame-wise VAD predictions, that willdetect short pauses between words. To mitigate these test labelinaccuracies, we apply a post-processing to the frame-wisepredictions V AD ( n ) , V N R ( n ) by smoothing the predictionswith a fixed window of 0.4 s length. The window involves nolook-ahead, computing the 90-th percentile within the window,in order to detect speech, even when only a fraction of theframe within the 0.4 s window contains speech activity. Thispost-processing achieved up to 6% relative AUC improvementon the given datasets, while still being real-time compatible. D. Results
We compare our models to a state-of-the-art baselineACAM [26] employing a recursive temporal attention moduleand features with look-ahead of 190 ms. Table II shows theAUC results for baseline and proposed models. We can see thatour best model achieves comparable performance on KAISTand better performance on the diverse and noisy HAVICdataset, while being real-time without look-ahead, and using a https://github.com/jtkim-kaist/VAD http://research.google.com/ava/index.htmlABLE IIAUC RESULTS (%)
ON TEST SETS .target loss output KAIST HAVIC AVAACAM [26]
VAD,VNR BCE,MAE (6) VAD 95.9 86.5 92.0VAD,VNR BCE (7) VNR
VAD,VNR BCE (7) VAD 93.9 86.1 91.4 −5 0 5 10 15 20 25SNR (dB)8688909294 A U C ( % ) VAD BCEVNR MAEVAD+VNR BCE
Fig. 2. AUC on validation set for different training targets over SNR. less complex model during inference time. The ONNX modelof our network processes 1 s audio in 7 ms on a laptop CPU at2.5 GHz, while for ACAM [26] reported 10 ms without, andover 200 ms with feature extraction on a GPU workstation.The first and second colums in Table II indicates thetraining targets and loss functions of the proposed networks.As described in Section III, we trained four different networks,two single-target and two multi-target predictors. Since themulti-target networks provide two outputs per frame, they areevaluated in two ways, either using the VAD or the VNRoutput. While the multi-target training improves performanceon the test sets, we did not find useful improvements bycombining the VAD and VNR outputs of the model, comparedto only using VNR output. The two top results per dataset areprinted boldface. We can observe that predicting VNR yieldssignificantly better results than VAD on KAIST and HAVIC,while the difference on AVA is minor. The VNR output of themulti-target predictors show consistently equal or better resultsthan the single-target predictors, while the VAD output of themulti-target networks performs still worse than the VNR singletarget. This illustrates both the advantage predicting VNR overVAD, and also the slight gain by multi-target training. Notethat the label errors of the HAVIC dataset prevent achievinghigh results.Further analysis of AUC on the validation set groupedby the mixing SNR shown in Fig. 2 reveals that indeed alltraining targets perform similarly at high SNR, while VNRand VAD+VNR targets provide better performance at mediumto low SNR. This proves the robustness to noise of using VNRas training target compared to the common clean speech levelbased activity. Interestingly, on the validation set the VNR only(orange) loss shows slightly better results than the multi-targetloss (green line) in Fig. 2, while all three test sets in Table IIshow a slight advantage of the multi-target training. We still V N R ( d B ) a) Foreground speech with background noise and music V N R ( d B ) b) Quiet office, typing (no speech) audio VAD VNR Fig. 3. Outputs of the proposed model trained to output VAD and VNR withBCE loss for a) an easy scenario with prominent foreground speech, b) arecording without active speech. can conclude that the multi-target loss helps generalization dueto better results on three real recording test sets in contrast tothe synthetically mixed validation set, which the networks alsowere optimized on, and is therefore indirectly seen data.The two outputs of the best proposed model, multi-targetprediction with BCE loss, is shown in Fig. 3 for two testrecordings. The waveform is shown in blue, the VAD outputin orange, and the VNR in green. Waveform and VAD (range[0,1]) are scaled to the VNR dB y-axis for illustration pur-poses. For the easy scenario in Fig. 3a), both outputs providegood indications of active and non-active speech frames. Bothpredictors show low values even in short pauses betweenwords. However, a recording in a quiet office, containing onlyslight ambient noise and keyboard typing in Fig. 3b) revealssome problems of the classification VAD output. The outputis more noisy, and the VAD predictions become quite largefor some non-speech acoustic events. On the other hand, theVNR prediction is consistently low, hovering around -10 dB,indicating very unlikely speech activity at any time. It isworthwhile mentioning that the lowest equal error rate (falsealarm equals miss rate) on the validation set is achieved witha VNR threshold of -7 dB.VI. C
ONCLUSIONS
We have proposed an efficient real-time neural networkbased VAD that achieves state-of-the-art results on challengingreal recordings. We showed that the segmental VNR is amore noise robust training target for VAD than the cleanspeech level based activity, and can be further improved bycombining both targets for multi-target training. The proposednetwork is flexible for most applications, as the frame-levelbased decisions can be converted to coarser granularity bysimple post-processing. The efficient network design allowsstraightforward integration in various speech processing tasksand different implementation platforms, as inference time ona modern CPU is around 7 ms per second of audio withoutany runtime optimizations.
EFERENCES[1] E. H¨ansler and G. Schmidt,
Acoustic Echo and Noise Control: A pracicalApproach . New Jersey, USA: Wiley, 2004.[2] I. Tashev,
Sound Capture and Processing: Practical Approaches . Wiley,July 2009.[3] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realisticemotions and affect in speech: State of the art and lessons learnt fromthe first challenge,”
Speech Communication , vol. 53, no. 9, pp. 1062 –1087, 2011, sensing Emotion and Affect - Facing Realism in SpeechProcessing.[4] T. B¨ackstr¨om,
Speech Coding with Code-Excited Linear Prediction ,1st ed. Springer, 2017.[5] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voiceactivity detector,”
IEEE Signal Processing Letters , vol. 6, no. 1, pp.1–3, Jan. 1999.[6] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-based noise powerestimation with low complexity and low tracking delay,”
IEEE Trans.Audio, Speech, Lang. Process. , vol. 20, no. 4, pp. 1383 –1393, May2012.[7] R. Martin, “Noise power spectral density estimation based on optimalsmoothing and minimum statistics,”
IEEE Trans. Speech Audio Process. ,vol. 9, pp. 504–512, Jul. 2001.[8] N. Dhananjaya and B. Yegnanarayana, “Voiced/nonvoiced detectionbased on robustness of voiced epochs,”
IEEE Signal Processing Letters ,vol. 17, no. 3, pp. 273–276, 2010.[9] L. N. Tan, B. J. Borgstrom, and A. Alwan, “Voice activity detectionusing harmonic frequency components in likelihood ratio test,” in
Proc.IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) ,2010, pp. 4466–4469.[10] C.-C. Hsu, K.-M. Cheong, T.-S. Chi, and Y. Taso, “Robust voiceactivity detection algorithm based on feature of frequency modulationof harmonics and its DSP implementation,”
IEICE Transactions onInformation and Systems , vol. E98.D, no. 10, pp. 1808–1817, 2015.[11] S. O. Sadjadi and J. H. L. Hansen, “Unsupervised speech activitydetection using voicing measures and perceptual spectral flux,”
IEEESignal Processing Letters , vol. 20, no. 3, pp. 197–200, 2013.[12] I. Yoo, H. Lim, and D. Yook, “Formant-based robust voice activitydetection,”
IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 23,no. 12, pp. 2238–2245, 2015.[13] S. Graf, T. Herbig, M. Buck, and G. Schmidt, “Features for voice activitydetection: a comparative analysis,”
EURASIP Journal on Applied SignalProcessing , vol. 91, 2015.[14] F. Eyben, F. Weninger, S. Squartini, and B. Schuller, “Real-life voiceactivity detection with LSTM recurrent neural networks and an appli- cation to hollywood movies,” in
Proc. IEEE Intl. Conf. on Acoustics,Speech and Signal Processing (ICASSP) , May 2013, pp. 483–487.[15] T. Hughes and K. Mierle, “Recurrent neural networks for voice activitydetection,” in
Proc. IEEE Intl. Conf. on Acoustics, Speech and SignalProcessing (ICASSP) , 2013, pp. 7378–7382.[16] F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza, “Deepneural networks for multi-room voice activity detection: Advancementsand comparative evaluation,” in
International Joint Conference on Neu-ral Networks (IJCNN) , 2016, pp. 3391–3398.[17] I. Tashev and S. Mirsamadi, “DNN-based causal voice activity detector,”in
Information Theory and Applications Workshop . University ofCalifornia - San Diego, February 2016.[18] H. Li, D. Wang, X. Zhang, and G. Gao, “Frame-level signal-to-noiseratio estimation using deep learning,” in
Proc. Interspeech , 2020.[19] I. Goodfellow, Y. Bengio, and A. Courville,
Deep Learning
Proc. Interspeech , 2018, pp. 3229–3233.[21] S. Braun, H. Gamper, C. K. A. Reddy, and I. Tashev, “Towards efficientmodels for real-time deep noise suppression,” to appear at ICASSP2021, 2021. [Online]. Available: arXiv:2101.09249[22] K. Cho, B. V. Merri¨enboer, D. Bahdanau, and Y. Bengio, “On theproperties of neural machine translation: Encoder-decoder approaches,”in
Proc. 8th Workshop on Syntax, Semantics and Structure in StatisticalTranslation (SSST-8) , 2014.[23] G. J. Mysore, “Can we automatically transform speech recorded oncommon consumer devices in real-world environments into professionalproduction quality speech?—a dataset, insights, and challenges,”
IEEESignal Processing Letters , vol. 22, no. 8, pp. 1006–1010, 2015.[24] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in
International Conference on Learning Representations , 2019. [Online].Available: https://openreview.net/forum?id=Bkg6RiCqY7[25] P. Seetharaman, G. Wichern, B. Pardo, and J. L. Roux, “AutoClip:Adaptive gradient clipping for source separation networks,” in , 2020, pp. 1–6.[26] J. Kim and M. Hahn, “Voice activity detection using an adaptive contextattention model,”
IEEE Signal Processing Letters , vol. 25, no. 8, pp.1181–1185, 2018.[27] S. Strassel, A. Morris, J. Fiscus, C. Caruso, H. Lee, P. Over, J. Fiumara,B. Shaw, B. Antonishek, and M. Michel, “Creating HAVIC: Heteroge-neous Audio Visual Internet Collection,” in8th International Conferenceon Language Resources and Evaluation (LREC)