Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks
Rohith Aralikatti, Dilip Margam, Tanay Sharma, Thanda Abhinav, Shankar M Venkatesan
GGLOBAL SNR ESTIMATION OF SPEECH SIGNALS USING ENTROPY AND UNCERTAINTYESTIMATES FROM DROPOUT NETWORKS
Rohith Aralikatti, Dilip Kumar Margam, Tanay Sharma, Abhinav Thanda, Shankar M Venkatesan
Samsung R&D Institute India, Bangalore { r.aralikatti,dilip.margam,tanay.sharma,abhinav.t89,s.venkatesan } @samsung.com ABSTRACT
This paper demonstrates two novel methods to estimatethe global SNR of speech signals. In both methods, DeepNeural Network-Hidden Markov Model (DNN-HMM) acous-tic model used in speech recognition systems is leveraged forthe additional task of SNR estimation. In the first method,the entropy of the DNN-HMM output is computed. Recentwork on bayesian deep learning has shown that a DNN-HMMtrained with dropout can be used to estimate model uncer-tainty by approximating it as a deep Gaussian process. In thesecond method, this approximation is used to obtain modeluncertainty estimates. Noise specific regressors are used topredict the SNR from the entropy and model uncertainty. TheDNN-HMM is trained on GRID corpus and tested on differ-ent noise profiles from the DEMAND noise database at SNRlevels ranging from -10 dB to 30 dB.
Index Terms — SNR Estimation, Dropout, Entropy, DeepNeural Networks
1. INTRODUCTION
Signal-to-noise ratio (SNR) estimation of a signal is an im-portant step in many speech processing techniques such asrobust automatic speech recognition (ASR) ([1, 2]), speechenhancement ([3, 4]), noise supression and speech detection.The global signal-to-noise ratio (SNR) of a signal x ( t ) indB is defined as follows. SN R dB ( x ) = 10 log P ower ( s ) P ower ( n ) The signal x ( t ) = s ( t ) + n ( t ) where s ( t ) represents the cleansignal and n ( t ) is the noise component.State-of-the-art ASR has achieved very low error rateswith the advent of deep learning. However, performance ofASR systems can still be improved in noisy conditions. Ro-bust ASR techniques such as noise-aware training [1] and re-lated methods ([5],[2]) require an estimate of the noise presentin the speech signal.Recently, it has been shown that incorporating visual fea-tures (extracted from lip movements during speech) can lead to improved word error rates (WER) during noisy environ-ment ([6],[7]). In [8], both audio and visual modalities areused for speech enhancement. With the proliferation of voiceassistants and front facing cameras in smartphones, using vi-sual features to improve ASR seems feasible. This raises thecrucial question - when should the camera be turned on tomake use of features from the visual modality? In such sce-narios, we can benefit from accurate SNR estimation by turn-ing on the camera in noisy environments.In this paper, we present two novel methods to estimatethe global SNR (at an utterance level) of a speech signal. Bothmethods require training a DNN based speech classifier onnoise free audio using alignments generated from a GMM-HMM model trained for ASR. The first method estimatesSNR by computing the entropy of the DNN’s output. Thesecond method uses model uncertainty estimates obtained byusing dropout during inference as shown in [9]. In section2, we present related work that has been done. Section 3 de-scribes the entropy based SNR estimator. Section 4 describesthe dropout based SNR estimator. Section 5 describes the ar-chitecture of the network, the training procedure and the ex-periments done. Section 6 presents the results of the paper.The final section 7 has the conclusion.
2. RELATED WORK
SNR estimation has been an active area of research. In [10],the authors use specific handcrafted features such as signalenergy, signal variability, pitch and voicing probability totrain noise specific regressors that compute SNR of an inputsignal. In [11], the amplitude of clean speech is modelled bya gamma distribution and noise is assumed to be normallydistributed. SNR is estimated by observing changes to theparameters of the gamma distribution upon addition of noise.The NIST-SNR measurement tool uses a sequential GMMto model the speech and non-speech parts of a signal to esti-mate the SNR. In [12], a voice activity detector (VAD) is usedto classify frames as either voiced, unvoiced or silence and thenoise spectrum is estimated from this information. After sub-tracting the noise spectrum from the input signal to obtain theclean signal, SNR is estimated. In [13], computational audi-tory scene analysis is used to estimate speech dominated and a r X i v : . [ ee ss . A S ] A p r oise dominated portions of the signal in order to obtain SNR.Estimation of instantaneous SNR is also a subtask inmany speech enhancement methods ([8, 14, 15, 16]). In[17], a neural network is trained to output the SNR in eachfrequency channel using amplitude modulation spectrogram(AMS) features which are obtained from the input signal.In [18], the peaks and valleys of the smoothened short timepower estimate of a signal are used to estimate the noisepower and instantaneous SNR.
3. ENTROPY BASED SNR ESTIMATION
In this method, a neural network which is trained as a partof ASR system to predict the posterior distribution of HMMstates is used. The Shannon entropy of the posterior distribu-tion is computed. In information theory, Shannon entropy isrealisation of the average uncertainity of encoding machine.Similarly in our case the posterior distribution obtained fromDNN which is trained as a part of ASR system, acts as anencoding distribution for encoding machine. Whenever thefeature vector of clean signal is forwarded through DNN it isexpected to give meaningful posterior distribution. But whena feature vector of a noisy signal is forwarded through theneural network, the posteriors are expected to be arbitrary,which in most cases lead to higher entropy value. This comesfrom the assumption that addition of noise to the speech sig-nal results in arbitrary features.Let F i denote the i th input frame of utterance U and Y (of dimension d ) denote the output of DNN. The entropy forgiven input F j is computed as shown in equation 1. H ( F j ) = − d (cid:88) i =0 P [ Y i ] log P [ Y i ] (1) Entropy ( U ) = m (cid:88) i =0 H ( F i ) m (2) SNR ( U ) = f ( Entropy ( U )) (3) Where P [ . ] denotes softmax activation, Y i is i th dimen-sion of Y . The average entropy of all input frames for agiven utterance is used as a measure of the entropy for anutterance. A polynomial regressor f ( . ) is trained on utter-ance level entropy values to predict the SNR of speech sig-nal. The advantage of this method is that it can work on anykind of noise which can randomize the speech signal. TheDNN-HMM based ASR systems which are sensitive to noisyconditions, can take advantage of entropy values to estimatethe SNR with low computational overhead. In figure 1, it isclearly seen that with increase in noise, the average entropyincreases.
4. SNR ESTIMATION USING DROPOUTUNCERTAINTY4.1. Bayesian uncertainty using dropout
Gal and Ghahramani showed in [9] that the use of dropoutwhile training DNNs can be thought of as a bayesian approx-imation of a deep Gaussian process (GP). Using the above GPapproximation, estimates for the model uncertainty of DNNstrained using dropout are derived. More specifically, it isshown that uncertainty of the DNN output for a given inputcan be approximated by computing the variance of multipleoutput samples obtained by using dropout during inference.The use of dropout during inference, results in different out-put every time the forward pass is done, for a given input.The variance of these output samples is the uncertainty forthe given input.The above method is used to obtain uncertainty estimatesfor the DNN that was trained as a part of DNN-HMM basedASR system as explained in section 6. This DNN is referredas dropout network through out this paper. If the input is cor-rupted by noise, it is expected that the model uncertainty de-rived from dropout will be higher. The model uncertainty forgiven input F j is computed as shown in equation 4. MU ( F ) = d (cid:88) i =0 V ar [ Y i ] (4) uncertainty ( U ) = m (cid:88) i =0 MU ( F i ) m (5) SNR ( U ) = f ( uncertainty ( U )) (6) SNR ( U ) = f ( uncertainty ( U ) , Entropy ( U )) (7) Where
M U stands for model uncertainty per frame. The av-erage variance over all input frames is used as a measure ofuncertanity for an utterance. The SNR of the utterance is es-timated as shown in equation 6, where f ( . ) is polynomialregressor trained to predict SNR from uncertainty value. Thethe regressor f ( . ) is trained on both uncertainty and entropyof utterance to output SNR value. We have compared the per-formance of all three regressors in table 1. It may not always be feasible to run the forward pass multipletimes per input frame in order to obtain output samples. Giventhe input frame and the weights of the dropout network, itshould be possible to algebraically derive the variance andexpectation of the output layer.The uncertainty of the model is the consequence of un-certainty added because of dropout in each layer of network.Following equations depicts how the uncertainity of modelcan be computed mathematically. For mathematical simplic-ity let us consider the neural network with one layer. Theoutput of the one layer network with ReLU activation func-tion is: Y = ReLU ( W · ( D ◦ F ) + b ) . Where ◦ denotesadamard product, D denotes the dropout mask. The vari-ance of i th dimension of output is given as shown in equation8. V ar [ Y i ] = V ar [ ReLU ( W Ti ( D ◦ F ) + b ))]= V ar [ ReLU ( m − (cid:88) j =0 W ij D j F j )] (8) = V ar [ ReLU ( A i )] Where A i = (cid:80) m − j =0 W ij D j F j . W i denote i th row of matrix W , m is the dimension of F . The dropout variable D i beinga bernoulli variable with probability of success p , V ar [ D i ] = p (1 − p ) . V ar [ A i ] = m − (cid:88) j =0 W ij F j V ar [ D j ]= p (1 − p ) m − (cid:88) j =0 W ij F j (9) Since all the dropout bernoulli random variables are indepen-dent of one another, the equation 9 follows. The difficultycomes in computing the
V ar [ Y i ] because it involves a non-linear ( Relu ) activation function. To compute the
V ar [ Y i ] one has to integrate the Y i s over all possible dropout distribu-tions ( m possibilities), which will increase the computationalcomplexity. One can proceed from here using the Taylor firstorder approximation of m variables. In [19] it is assumedthat sum of activation values follows normal distribution fol-lowing the central limit theorem, but this assumption did nothold good empirically in our case because of multiple layersin network.However, the variance of the output is some complex non-linear function of the input and the dropout network weights.Therefore it must be possible to train another DNN to learnthis non-linear relationship so that the uncertainty can be es-timated by a single forward pass of this second network. Thissecond neural network from now on will be referred to asthe variance network in this paper. The variance network ex-plained in section 5.1.1 was able to succesfully learn the map-ping from the input frame to the output (dropout uncertainty),as shown in the figure 3.
5. EXPERIMENTS
A DNN-HMM based ASR system is trained on the Grid cor-pus [20] ( of it is used for training, for testing), whichhas speakers and utterances per speaker. The Melscale filter-bank features of dimension, with contextualframes on both sides are used as input features. The dura-tion of ms and shift of ms is used in feature extractionprocess. The activation function used is ReLU, along withdropout with p = 0 . ( p is probability of dropping a neuron)is used in all hidden layers. The output of DNN is of dimen-sion corresponding to number of HMM states. There are six hidden layers with neurons in each layer. This DNNwhich is also reffered to as dropout network in this paper isused for estimating entropy and variance in all our experi-ments, except for section 4.2. We experimented on 16 different noise types from the DE-MAND noise dataset, where noise is added to the test set ofutterances. We observe that there is a strong correlation be-tween average entropy and SNR as shown in figure 1. Similarkind of results for average dropout uncertainty estimates ver-sus the SNR are obtained, where model uncertainty increaseswith increase in noise as shown in the figure 2. The variancehas been computed by taking 100 output samples per inputframe, but we obtained similar results when we reduced thenumber of samples to 20 per input frame. Figure 2 shows thevariation in model uncertainty with respect to SNR for samesix arbitrarily chosen noises as in figure 1.The variance computation was done on the output samplesobtained from the DNN before the application of softmax toobtain probabilities. This gave better results, since the soft-max function tends to exponentially squash the outputs to liebetween and and this causes the variance along many ofthe dimensions of the output to be ignored.Using the ReLU non-linearity also gave better results ascompared to the sigmoid and tanh non-linearity. This is ex-pected, as both the sigmoid and tanh tend to saturate and thisdoes not allow the variance (or model uncertainty) to be prop-agated to the output layer. This is the network used for fast dropout uncertainty estima-tion. The variance network is trained on uncertainty estimatesobtained from the dropout network. The training is done onutterances from the GRID corpus mixed with noise from theDEMAND [21] dataset using the previously trained dropoutnetwork. The training is done on utterances mixed with 12types of noise at 40 different SNR levels (from -10 dB to30 dB). The testing is done on different utterances from theGRID corpus mixed with noise samples not exposed to thenetwork during training.Variance network is able to succesfully learn the mappingfrom the input frame to the output uncertainty. The plotsshown in the figure 3 shows the variation of output uncer-tainty for the four types of noise (CAR, PARK, KITCHEN,MEETING) which were not used during training.
6. RESULTS
To obtain the SNR of an input signal, we have trained noisespecific linear regressors to obtain the SNR value given the ig. 1 . Plot depicts the relationship betweenaveraged entropy of utterance (defined in equa-tion 2) with SNR value of utterance for test ut-terances for six arbitrarily chosen noise types.
Fig. 2 . Figure shows the relationship betweenaveraged uncertainity of utterance (as in equa-tion 5) and SNR value of utterance for test ut-terances for six arbitrarily chosen noise types.
Fig. 3 . Figure shows the relationship betweenoutput of variance network and noisy inputspeech with different SNR values for four un-seen (not used in training) noises.
Table 1 . The Mean Absolute Error (MAE) of our SNR esti-mation methods is compared against pre-existing methodsNoise Method SNR (dB)type -10 -5 0 5 10 DK I T C H E N NIST
WADA f f f N P A R K NIST
WADA f f f O M EET I NG NIST
WADA f f f f , f and f ) described previously with well known SNR estima-tion methods, namely the NIST STNR estimation tool andthe WADA SNR estimation method described in [12]. It isobserved that the regressor trained on dropout uncertainty per-formed better than the entropy based regressor. Indeed, it isobserved that the regressor trained on both the dropout un-certainty and entropy perfomed worse than just regressing onthe network uncertainty. However, all three regressors haveproduced better SNR estimates than either WADA or NIST,partiicularly at low SNR levels.Though we clearly see a correlation between the en- tropy/dropout uncertainty and the noise in the signal, tofinally obtain the SNR value of the signal we have to traina noise specific regressor on top of the entropy/dropout un-certainty values. The possibility of directly predicting SNRindependent of the background noise is something that needsfurther research. In [10], the authors propose using a DNN tofind out which of the noise types most closely resemble theinput and use the corresponding regressor to estimate SNR.However, since dropout network is trained on clean audio,irrespective of the type of noise in the speech signal, the trendof increasing uncertainty with increasing noise did hold evenin unseen noise conditions. The variance network, which istrained on specific noise types in order to avoid the computa-tional costs of taking samples during inference, clearly main-tained this trend even in unseen noise conditions as shown infigure 3
7. CONCLUSION
In this paper, we have shown that it is possible to extract use-ful information from the uncertainty (either from entropy orfrom bayesian estimates) and predict the SNR of a speechsignal. Previous research in deep learning based speech pro-cessing has not made use of uncertainty information to thebest knowledge of the authors. Using the above uncertaintyinformation to better design and improve the performance ofcurrent ASR and speech enhancement algorithms will be pos-sible future directions of research. Another possible improve-ment that can be done is to investigate the possibility of pre-dicting instantaneous SNR instead of global SNR. The meth-ods proposed in this paper for SNR estimation do not imposeany conditions on the type of noise corrupting the signal. Thisleaves open the possibility of applying similar noise estima-tion techniques to non-speech signals. . REFERENCES [1] Michael L Seltzer, Dong Yu, and Yongqiang Wang, “Aninvestigation of deep neural networks for noise robustspeech recognition,” in
Acoustics, Speech and SignalProcessing (ICASSP), 2013 IEEE International Confer-ence on . IEEE, 2013, pp. 7398–7402.[2] Kang Hyun Lee, Shin Jae Kang, Woo Hyun Kang, andNam Soo Kim, “Two-stage noise aware training usingasymmetric deep denoising autoencoder,” in
Acoustics,Speech and Signal Processing (ICASSP), 2016 IEEE In-ternational Conference on . IEEE, 2016, pp. 5765–5769.[3] Yariv Ephraim and David Malah, “Speech enhance-ment using a minimum mean-square error log-spectralamplitude estimator,”
IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 33, no. 2, pp. 443–445, 1985.[4] Elias Nemer, Rafik Goubran, and Samy Mahmoud, “Snrestimation of speech signals using subbands and fourth-order statistics,”
IEEE Signal Processing Letters , vol. 6,no. 7, pp. 171–174, 1999.[5] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee,“Dynamic noise aware training for speech enhancementbased on deep neural networks,” in
Fifteenth AnnualConference of the International Speech CommunicationAssociation , 2014.[6] Abhinav Thanda and Shankar M Venkatesan, “Multi-task learning of deep neural networks for audio vi-sual automatic speech recognition,” arXiv preprintarXiv:1701.02477 , 2017.[7] Abhinav Thanda and Shankar M Venkatesan, “Audiovisual speech recognition using deep recurrent neuralnetworks,” in
IAPR Workshop on Multimodal PatternRecognition of Social Signals in Human-Computer In-teraction . Springer, 2016, pp. 98–109.[8] Pascal Scalart et al., “Speech enhancement based on apriori signal to noise estimation,” in
Acoustics, Speech,and Signal Processing, 1996. ICASSP-96. ConferenceProceedings., 1996 IEEE International Conference on .IEEE, 1996, vol. 2, pp. 629–632.[9] Yarin Gal and Zoubin Ghahramani, “Dropout as abayesian approximation: Representing model uncer-tainty in deep learning,” in international conference onmachine learning , 2016, pp. 1050–1059.[10] Pavlos Papadopoulos, Andreas Tsiartas, and ShrikanthNarayanan, “Long-term snr estimation of speech signalsin known and unknown channel conditions,”
IEEE/ACMTransactions on Audio, Speech, and Language Process-ing , vol. 24, no. 12, pp. 2495–2506, 2016. [11] Chanwoo Kim and Richard M Stern, “Robust signal-to-noise ratio estimation based on waveform amplitudedistribution analysis,” in
Ninth Annual Conference ofthe International Speech Communication Association ,2008.[12] Juan A Morales-Cordovilla, Ning Ma, Victoria S´anchez,Jos´e L Carmona, Antonio M Peinado, and Jon Barker,“A pitch based noise estimation technique for robustspeech recognition with missing data,” in
Acoustics,Speech and Signal Processing (ICASSP), 2011 IEEE In-ternational Conference on . IEEE, 2011, pp. 4808–4811.[13] Arun Narayanan and DeLiang Wang, “A casa-based sys-tem for long-term snr estimation,”
IEEE Transactionson Audio, Speech, and Language Processing , vol. 20,no. 9, pp. 2518–2527, 2012.[14] Jinkyu Lee, Keulbit Kim, Turaj Shabestary, and Hong-Goo Kang, “Deep bi-directional long short-term mem-ory based speech enhancement for wind noise reduc-tion,” in
Hands-free Speech Communications and Mi-crophone Arrays (HSCMA), 2017 . IEEE, 2017, pp. 41–45.[15] Israel Cohen, “Relaxed statistical model for speech en-hancement and a priori snr estimation,”
IEEE Transac-tions on Speech and Audio Processing , vol. 13, no. 5,pp. 870–881, 2005.[16] Yao Ren and Michael T Johnson, “An improved snr es-timator for speech enhancement,” in
Acoustics, Speechand Signal Processing, 2008. ICASSP 2008. IEEE Inter-national Conference on . IEEE, 2008, pp. 4901–4904.[17] J¨urgen Tchorz and Birger Kollmeier, “Snr estimationbased on amplitude modulation analysis with applica-tions to noise suppression,”
IEEE Transactions onSpeech and Audio Processing , vol. 11, no. 3, pp. 184–192, 2003.[18] Rainer Martin, “An efficient algorithm to estimate theinstantaneous snr of speech signals.,” in
Eurospeech ,1993, vol. 93, pp. 1093–1096.[19] Sida Wang and Christopher Manning, “Fast dropouttraining,” in
Proceedings of the 30th International Con-ference on Machine Learning (ICML-13) , 2013, pp.118–126.[20] Martin Cooke, Jon Barker, Stuart Cunningham, andXu Shao, “An audio-visual corpus for speech percep-tion and automatic speech recognition,”