MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection
Zhenpeng Zheng, Jianzong Wang, Ning Cheng, Jian Luo, Jing Xiao
MMLNET: An Adaptive Multiple Receptive-field Attention Neural Network forVoice Activity Detection
Zhenpeng Zheng, Jianzong Wang ∗ , Ning Cheng, Jian Luo, Jing Xiao Ping An Technology (Shenzhen) Co., Ltd. { zhengzhenpeng479, wangjianzong347, chengning221, luojian304, xiaojing661 } @pingan.com.cn Abstract
Voice activity detection (VAD) makes a distinction betweenspeech and non-speech and its performance is of crucial impor-tance for speech based services. Recently, deep neural network(DNN)-based VADs have achieved better performance thanconventional signal processing methods. The existed DNN-based models always handcrafted a fixed window to make use ofthe contextual speech information to improve the performanceof VAD. However, the fixed window of contextual speech infor-mation can’t handle various unpredictable noise environmentsand highlight the critical speech information to VAD task. In or-der to solve this problem, this paper proposed an adaptive mul-tiple receptive-field attention neural network, called MLNET,to finish VAD task. The MLNET leveraged multi-branches toextract multiple contextual speech information and investigatedan effective attention block to weight the most crucial parts ofthe context for final classification. Experiments in real-worldscenarios demonstrated that the proposed MLNET-based modeloutperformed other baselines.
Index Terms : Voice Activity Detection, Adaptive MutipleReceptive-field Attention
1. Introduction
Voice Activity Detection (VAD), which aims at removing noiseor silence from the original speech signal and obtaining validspeech signal, is an essential part of speech recognition orother speech-based applications [1, 2, 3]. Unfortunately, inthe real environment, the speech signals may contain numerousbackground noises and have a low signal-to-noise ratio (SNR),which brings great challenges to an accurate VAD system.With the continuous development of speech technologies,the research on VAD has become a continuing hot spot [4, 5, 6].Early research focused on parametric methods: energy function,zero-crossing rate, statistical signal analysis or other acous-tic features [7, 8, 9, 10]. Later, various machine learningbased methods were established to VAD systems: GaussianMixture Models (GMM) [11], Hidden Markov Model(HMM)[12] or Support Vector Machines(SVM) [13]. More specifi-cally, deep learning models were also established: deep neuralnetwork(DNN)[14, 15, 16, 17, 18], deep belief network(DBN)[19], convolutional neural network (CNN) [20, 21], recurrentneural network (RNN) [22, 23]. Moreover, speech enhance-ment (SE) based methods were also introduced to reply the lowSNR noisy environments [24, 25].The above machine learning based models have achievedgreat progress in VAD task. These models had a common char-acteristics: in order to get a robust VAD system, speech in-cluding contextual information, just as other speech-based sys-tems, would be forwarded into a neural network. However, ∗ Corresponding author: Jianzong Wang, [email protected] when training or testing, it was hard to select the optimal hyper-parameter to determine the amount of contextual information.Specially, when selecting more information, much noisy infor-mation may be included in short speech segments and causefalse positive detections. When less contextual information wasselected, the VAD system may not make use of effective in-formation to make the right classification. Based on the aboveanalysis, when performing VAD tasks, the VAD systems shouldbe capable of selecting the most appropriate contextual infor-mation according to the characteristics of current speech andfocusing on the most important speech segments to obtain theoptimal detection results. Motivated by the successful applica-tion of attention mechanism in image and natural language un-derstanding tasks, this paper proposed an attention-based modelof MLNET to choose the appropriate speech segment for theclassifier. The MLNET took advantage of different gated unitsto extract different contextual speech information and leveragedattention mechanism of channel selection to choose the mostappropriate contextual information to adapt different noisy en-vironments. In particular, the first attention model for VAD taskwas the ACAM model and it focused on the effect of a certainframe in a fixed window, which is different from our windowsize attention [23]. From our point of view, the most importantframe for the detection result is the current frame and the sur-rounding frames are also important for final result but they arethe auxiliary information.With respect to the state of the art, the main contributionsof this paper were as following:• A architecture of MLNET, which could adapt differentsegments and select the optimal contextual informationfor the final classifier, was proposed to VAD.• A useful mechanism of gated units was leveraged to ex-tract different contextual speech information.• An attention strategy for effective and appropriate con-textual information selection was investigated.• The proposed method was benchmarked against severalstate-of-art methods and the functions of each part inMLNET were also compared.
2. Methodology
The MLNET-based VAD achieved a frame-based speech ornon-speech classifier. Suppose the training corpus can bemarked as χ = { ( x t , y t ) } Tt =1 , where x t is an acoustic fea-ture vector of t -th frame audio signal and y t denotes the la-bel of x t . If x t is a speech frame, then y t = 1 ; oth-erwise, y t = 0 . Because of contextual information’s im-portance for speech applications, x t is usually expanded to a r X i v : . [ ee ss . A S ] A ug t = [ x t − r , ..., x t − , x t , x t +1 , ..., x t + r ] when training or test-ing. Here, the objective of VAD is to learn a function f as (1). ˆ y t = f ([ x t − r , ..., x t − , x t , x t +1 , ..., x t + r ]) (1)where ˆ y t denotes the predicting result of x t and r denoted thewindow size of contextal speech information.In experiment, the value r was a hyper-parameter and thefixed value of r was hard to adapt various speech environments.For example, when x t wa a speech frame and the speech seg-ment duration around x t was short, the large r would containmuch non-speech information and cause false positive results.In turn, when r was short, the contextual information may notbe fully utilized, which can result in false negative or false pos-itive results. To address the above problems, this paper pro-posed the MLNET model, which leveraged the multiple gatedaffined units and attention mechanism to choose the optimalreceptive-field speech adaptively to make the classification. TheMLNET’s architecture is shown in Figure 1.Figure 1: The MLNET’s Architecture. The x t − i ( i ∈ [ − r, + r ]) represented 40-log Mel Features of a frame and (2 r +1) frames’features would be inputted into the MLNET. The detail architecture of multiple receptive-field attentionblock was showed in Figure 2. In order to realize adap-tive feature selection, multiple branches were leveraged to ex-tract different receptive-filed speech features and each branchrepresents a specific receptive field or a specific contex-tual information. Specifically, for the branch of r i , thecontextual speech information can be denoted as I (cid:48) t =[ x t − r i , ..., x t − , x t , x t +1 , ..., x t + r i ] and (2 × r i +1) frame fea-tures were included. For the calculation of subsequent attentionmodules, the feature matrix of I (cid:48) t with different size need to beconverted feature matrix with the same size. This paper madeuse of the gated affined unit to finish this task. The gated affinedunit as this non-linearity has proved to work better for modelingaudio than the other affine function or Relu functions [26]. Thegated affined unit’s definition was defined as (3): q i = tanh ( W f,r i ∗ I i + b f,r i ) (cid:12) σ ( W g,r i ∗ I i + b g,r i ) (2)where W f,r i , b f,r i denoted affine parameters of tanh and W g,r i , b g,r i denoted affine parameters of sigmoid. It was ob-vious that the size of W f,r i and W g,r i would change with r i while the size of q i was constant regardless of r i . The ∗ de-noted a convolution operator while the (cid:12) denoted an element-wise multiplication operator. Figure 2: Adaptive Multiple Receptive-field Attention Block. Ascan be seen, the inputted feature with different receptive-field r i would be calculated at different branches and the attentionalfeature vector of time t would be calculated based on eachbranch. After gated affined unit operation, different two-dimensional extracted matrixes, which were denoted by q t, , ..., q t,i , ..., q t,r , were obtained and each feature matrixof q t,i had the same size. Analogy with image, each featurematrix of q t,i can be regarded as a channel feature map andwe produced a channel attention map to decide which featurematrix should be focused on. Based on previous channel atten-tion work in images [27, 28], the attention module’s structurewas showed in Figure 3. Firstly, both average-pooling andmax-pooling operations were used to aggregate informationof a feature matrix and two feature descriptors were generatedto represent different feature matrixes. Then, both descriptorswere forward to a shared 2-layer fully connected DNN toproduce two different attention ratio vector p t,max and p t,avg .Finally, by adding an norm operation, p t,max and p t,avg wereused to generate final attention ratio vector a t of (4). a t = σ ( F C ( avgpool ( q t )) + F C ( maxpool ( q t )))= σ ( W ( W ( avgpool ( q t ))) + W ( W ( maxpool ( q t )))) (3)where F C denoted the shared 2-layer fully connected DNN and W and W were the weights of DNN. Note that the weight W of first layer followed a leaky relu activation function. Afterobtaining a t , the attentioned or scaled feature matrix would becalculated through (4)-(7). a t = [ a t, , ..., a t,i , ..., a t,r ] (4) p t,i = σ ( a t,i ) / ( r (cid:88) i =1 σ ( a t,i )) (5) p t = [ p t, , p t, , ..., p t,r ] (6) Q t = r (cid:88) i =1 p t,i ∗ q t,i (7)Where a t denoted the attention vector and it can be calu-cated by the attention module. σ was the sigmoid function andthe p t represented the normalized value of a t . Finally, the Q t was calculated by summing original q t with attentional coeffi-cient.igure 3: Attention Module Structure. The gated affined unit q i of receptive-field r i would produce two feature vector throughmaxpooling and averagepooling and the two vectors were in-putted 2-layer FC to produce the attentional weights. The finalattentional feature matrix can be obtained through (5)-(8) Through the adaptive multiple receptive-field attention block,the scaled feature matrix Q t was obtained and Q t would bereshaped into a feature vector m t . The Bi-LSTM is skilledin learning contextual speech information and the DNN isan excellent classifier. Nextly, the feature sequence M t =[ m , m , ..., m T ] would be fed into a two-layer Bi-LSTM and1-layer Fully Connected DNN to make the final classification.The training of MLNET-based VAD could be regarded as acommon supervised optimized problems with traditional cross-entropy loss function. L crossentropy = t = T (cid:88) t =0 { y t log (ˆ y t ) + (1 − y t ) log (1 − ˆ y t ) } (8)Because of the characteristic of the proposed model, thispaper further investigated an additional attention loss functionto adapt the attention mechanism in the training phase. Theattention loss function was designed to emphasize the most im-portant receptive field of r k , and the definitions were showed in(9)-(11). k = argmax i ( p t,i ) (9) L attention = t = T (cid:88) t =0 i = r (cid:88) i =0 ( y t,i log ( p t,i )) (10) L = L crossentropy + L attention (11)Where L attention could be denoted as a softmax problem andthe most appropriate receptive-field p t,k was the target and thecorresponding y t,k is and other y t,i = 0( i (cid:54) = k ) .
3. Experiments
In this paper, the English corpus of Aurora4 [29] and the Chi-nese corpuses of Thchs30 [30] were applied to train and test theproposed model. In our experiments, firstly, because of imbal-ance of speech and non-speech, 2-second-long silence segments Table 1:
Model Configuration
Name Unit × × (2 ∗ [1 , , , ,
9] + 1) × (2 ∗ [1 , , , ,
9] + 1)
Attention module 12-layer Bi-LSTM (64 + 64) × (64 + 64)1-layer FC 64were added into forward and backward of each utterance. Intraining, the clean speech corpus were corrupted by public 100noise types of HuNonspeech . Each utterance was randomlycorrupted at a level of -5dB-20dB SNR and all have an average7.5dB SNR. The NOISEX-92 noise dataset was used to con-struct testing dataset and 4 types unseen noises of babble, fac-tory, destory-engine were selected to corrupt the clean speech.Also, the SNR was setted between -5dB and 20dB and the aver-age is 7.5dB. For Aurora4, 95% of training data were used fortraining and 5% were used as dev data. The testing corpus ofAurora4 were corrupted by the NOISEX-92 and leveraged asthe testing data. For thchs30 data, the dataset was constructedwith the same process, but the different was that dev datasetleveraged the original corresponding data.For comparison, the metrics of F1-score and DCF were se-lected as a performance measurement. F1-score took into ac-count both accuracy and recall metrics, which was a commonevaluation index of binary classification problems. DCF re-flected the wrong performance of the model and DCF was de-fined as followed: F − score ( θ ) = 2 T P T P + F P + F N (12)
DCF ( θ ) = 0 . × P FN ( θ ) + 0 . × P FP ( θ ) (13)where θ denoted a given system decision-threshold setting. TPrepresented true positive examples’ num while FP and FN rep-resented the num of false positive and false negative examples. P FN was the probability of FP while P FN was the probabilityof FN. It should be noted that the larger the F1-score was, thebetter performance while the smaller the DCF was, the betterperformance. In testing, we calculated the two metrics of eachrecording respectively and averaged the metrics of all record-ings as the final score. The acoustic feature extracted for MLNET was 40-dimensionallog-mel filterbank while the frame size was 25 ms with a shiftof 10 ms. The window of attention block were setted as 19frames, which corresponded to 190ms contextual information,while the gated affined unit’s receptive-field were setted 1, 3,5, 7 ,9. Other parameters of MLNET were shown in Table1.For training our proposed model, the matrix weight parametersof MLNET were all initialized with random uniform initializa-tion while the bias parameters were initialized with a constantof 0.1. In our experiments, we trained the network for 150epoches with the Adam algorithm when the loss function got http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/ http://spib.rice.edu/spib/select noise.html http://fearlesssteps.exploreapollo.org/ able 2: Result Comparision of Aurora4
Name Dev Eval
F1-score
Google VAD (mode 0) 72.33 76.32CRNN 89.14 87.23ACAM 90.56 89.03
MLNET 91.38 89.27
DCF
Google VAD (mode 0) 22.06 18.34CRNN 9.23 10.34ACAM 8.95 9.01
MLNET 8.77 9.23
Table 3:
Result Comparision of Thchs30
Name Dev Eval
F1-score
Google VAD (mode 0) 74.71 74.60CRNN 91.35 89.90ACAM 92.53 91.27
MLNET 93.25 92.58
DCF
Google VAD (mode 0) 17.88 18.18CRNN 8.21 9.72ACAM 7.67 8.51
MLNET 6.89 8.12 little change. The batchsize was 32 and the learning rate wasset to 0.001. When training, the gradient cropping strategy wasalso applied and the gradient of each parameter at each iterationwas limited between -1 and 1.To demonstrate the effectiveness of our proposed model,three VAD approaches were used for comparison. The first wasgoogle’ WebRTC VAD systems [31]. Additionally, Vafeiadisproposed 2-D CRNN model and has made great success in re-cent speech activity detection [32]. Nextly, ACAM approach,which was the state of art of attention-based methods, was alsoincluded [23]. In our experiments, we made use of the sameparameters, but all trick strategies, such as batch normalizationand regularization, were not leveraged. To alleviate the effectof the input features, it should be noted that 40-log mel acous-tic features were leveraged to establish the CRNN and ACAMbaseline models, which were different from the original ap-proaches.
The results were summarized in Table 2 and Table 3. We canobserved that MLNET has the best performance and outformedother three baselines models. Especially, three-training-basedmethods achieved 10% higher than google’s VAD in F1-scoreand 8% lower in DCF. The ACAM and MLNET outformedthe traditional deep learning methods of CRNN, which provedattention-based models were helpful for improving the accuracyof detection. Comparing with ACAM, MLNET selected thewindows’ attentions instead of ACAM’s frames-based attentionand experiments showed that the window based attention mod-els achieved higher performance than the frames-based models,which also conformed to our prior knowledge that current frameinformation was the most important and the joined frames wereauxiliary information when predicting the current frame. Table 4:
Module Comparision
Model Dev Eval
F1-score
Bi-LSTM 85.89 84.17+Gated Unit 87.63 86.24+Non-Attention 90.85 88.73 +Attention 91.38 89.27
DCF
Bi-LSTM 12.24 13.54+Gated Unit 11.16 12.01+Non-Attention 9.11 9.88 +Attention 8.77 9.23
In order to illustrate each part’s functionality of our pro-posed model, the comparison of each modules were further in-vestigated. The base was the Bi-LSTM, which just leveragedthe contextual speech information and the feature vectors ofcontextual speech were aggregated to a longer vector beforefeeding into the network. The second was the gated unit modelthat the gated unit operation replaced the aggregated mecha-nism. The third and the fourth were leveraged to certificate themultiple window’s functions while the attention’s function wasalso compared. The Aurora4 dataset were leveraged to makethis evaluation and the results were shown in Table 4. As notedin this table, the gated affined unit based models have betterperformance than direct aggregation of Bi-LSTM and achievedabout 2% increase in F1-score and 1.5% decrease in DCF.Adding the multiple receptive-field of non-attention, the VAD’sperformance was also improved and achieved about 2.5% in-crease in F1-score and 2.13 % decrease in DCF. Lastly, theadaptive attention mechanism was also helpful for MLNET’sperformance. In contrast, the adaptive attention only achieved asmall accuracy improvement than non-attention module and thereason may be the mechanism of receptive-field selection hasexisted in multiple receptive-field attention block. In particular,we observed that the adaptive multiple receptive-field attentionblock was also helpful for speeding up models’ convergence inour experiments. To sum up, the proposed method can betterdeal with the VAD problems.
4. Conclusion
The existed DNN-based VAD models only leveraged fixedreceptive-field contextual speech information and were unableto handle with speech segments of different lengths adaptively.To overcome this defect, this paper proposed an architecture ofMLNET for VAD task. MLNET made use of different gatedaffined unit to extract different contextual speech informationand leveraged the adaptive attention block to select the most fo-cused speech segments. Comparing with the existed models, theexperiments demonstrated that MLNET outformed other base-line models and proved that the proposed architecture was help-ful to deal with VAD problems.
5. ACKNOWLEDGEMENTS
This paper is supported by National Key Research and Develop-ment Program of China under grant No. 2018YFB1003500, No.2018YFB0204400 and No. 2017YFB1401202. Correspondingauthor is Jianzong Wang from Ping An Technology (Shenzhen)Co., Ltd. . References [1] M. W. Mak and H. B. Yu, “A study of voice activity detectiontechniques for nist speaker recognition evaluations,” in
ComputerSpeech & Language , 2014, p. 295313.[2] J. Ramirez, J. M. Gorriz, and J. C. Segura,
Voice Activity Detec-tion. Fundamentals and Speech Recognition System Robustness .InTech, 2007.[3] T. Yoshimura, T. Hayashi, K. Takeda, and S. Watanabe, “End-to-end automatic speech recognition integrated with ctc-based voiceactivity detection,” in
ICASSP , 2020.[4] B. Sharma, R. K. Das, and H. Li, “Multi-Level Adaptive SpeechActivity Detector for Speech in Naturalistic Environments,” in
In-terspeech , 2019.[5] F. M. G. D. P. Mainar and M. Cernak, “Spiking Neural NetworksTrained With Backpropagation for Low Power Neuromorphic Im-plementation of Voice Activity Detection,” in
ICASSP , 2020.[6] G. D. F. Martinelli and M. Cernak, “A Bin Encoding Trainingof a Spiking Neural Network Based Voice Activity Detection,” in
ICASSP , 2020.[7] X. Yang, B. Tan, J. Ding, J. Zhang, and J. Gong, “Comparativestudy on voice activity detection algorithm,” in
International Con-ference on Electrical and Control Engineering , 2010.[8] K. W. Ho, Y. T. Young, P. K. Jung, and L. Chungyong, “Robustvoice activity detection algorithm for estimating noise spectrum,”
Electronics Letters , vol. 36, no. 2, pp. 180–181, 2002.[9] S. Jongseo, S. N. Kim, and W. Sung, “A statistical model basedvoice activity detector,”
IEEE Signal Processing Letters , vol. 6,no. 1, pp. 1 – 3, 1999.[10] D. Thomas, S. Yannis, K. Yusuke, and A. Masami, “Voice activitydetection: Merging source and filter-based information,” in
IEEESignal Processing Letters . IEEE, 2016, pp. 252–256.[11] T. Ng, B. Zhang, and N. L, “Developing a speech activity detec-tion system for the darpa rats program,” in
Interspeech , 2012.[12] V. H and S. H, “Hidden-markov-model-based voice activity de-tector with high speech detection rate for speech enhancement,”
Iet Signal Processing , vol. 6, no. 1, pp. 54–63, 2012.[13] E. Dong, G. Liu, Y. Zhou, and X. Zhang, “Applying support vectormachines to voice activity detection,” in
International Conferenceon Signal Processing Proceedings . IEEE, 2003.[14] S. Tong, H. Gu, and K. Yu, “A comparative study of robustness ofdeep learning approaches for vad,” in
ICASSP , 2016.[15] X. Zhang and D. Wang, “Boosting contextual information fordeep neural network based voice activity detection,”
IEEE/ACMTransactions on Audio Speech and Language Processing , vol. 24,no. 2, pp. 252–264, 2016.[16] Z. C. Fan, Z. Bai, X.-L. Zhang, S. Rahardja, and J. Chen, “Aucoptimization for deep learning based voice activity detection,” in
ICASSP , 2019.[17] A. Ivry, I. Cohen, and B. Berdugo, “Evaluation of deep-learning-based voice activity detectors and room impulse response modelsin reverberant environments,” in
ICASSP , 2020.[18] A. Ivry, B. Berdugo, and I. Cohen, “Voice activity detection fortransient noisy environment based on diffusion nets,”
IEEE Jour-nal of Selected Topics in Signal Processing , vol. 13, no. 2, pp.254–264, 2019.[19] X. Zhang and J. Wu, “Deep belief networks based voice activ-ity detection,”
IEEE Transactions on Audio Speech and LanguageProcessing , vol. 21, no. 4, pp. 697–710, 2013.[20] I. Ariav and I. Cohen, “An end-to-end multimodal voice activitydetection using wavenet encoder and residual networks,”
IEEEJournal of Selected Topics in Signal Processing , pp. 1–1, 2019.[21] S. Y. Chang, B. Li, G. Simko, T. N. Sainath, A. Tripathi,A. van den Oord, and O. Vinyals, “Temporal modeling usingdilated convolution and gating for voice-activity-detection,” in
ICASSP , 2018. [22] H. Thad, Mierle, and Keir, “Recurrent neural networks for voiceactivity detection,” in
ICASSP , 2013, pp. 7378–7382.[23] K. Juntae and H. Minsoo, “Voice activity detection using an adap-tive context attention model,”
IEEE Signal Processing Letters ,vol. PP, no. 99, pp. 1–1, 2018.[24] R. Lin, C. Costello, C. Jankowski, and V. Mruthyunjaya, “Opti-mizing Voice Activity Detection for Noisy Conditions,” in
Inter-speech 2019 , pp. 2030–2034.[25] Y. Jung, Y. Kim, Y. Choi, and H. Kim, “Joint learning using de-noising variational autoencoders for voice activity detection.” in
Interspeech , 2018, pp. 1210–1214.[26] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt,A. Graves, and K. Kavukcuoglu, “Conditional image generationwith pixelcnn decoders,”
CoRR , 2016. [Online]. Available:http://arxiv.org/abs/1606.05328[27] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”in
CVPR , 2018, pp. 7132–7141.[28] S. Woo, J. Park, J. Y. Lee, and I. So Kweon, “Cbam: Convolu-tional block attention module,” in
ECCV , 2018, pp. 3–19.[29] N. Parihar and J. Picone,
Aurora working group: DSR front endLVCSR evaluation AU/384/02 . Inst. for Signal and InformationProcess, Mississippi State University, 2002.[30] Z. Z. Dong Wang, Xuewei Zhang, “Thchs-30 : A freechinese speech corpus,” 2015. [Online]. Available: http://arxiv.org/abs/1512.01882[31] GoogleWebRTC, “https://webrtc.org/,” 2016.[32] A. Vafeiadis, E. Fanioudakis, I. Potamitis, K. Votis, D. Giakoumis,D. Tzovaras, L. Chen, and R. Hamzaoui, “Two-Dimensional Con-volutional Recurrent Neural Networks for Speech Activity Detec-tion,” in