[PDF] MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection

Abstract

Voice activity detection (VAD) makes a distinction between speech and non-speech and its performance is of crucial importance for speech based services. Recently, deep neural network (DNN)-based VADs have achieved better performance than conventional signal processing methods. The existed DNNbased models always handcrafted a fixed window to make use of the contextual speech information to improve the performance of VAD. However, the fixed window of contextual speech information can't handle various unpredicatable noise environments and highlight the critical speech information to VAD task. In order to solve this problem, this paper proposed an adaptive multiple receptive-field attention neural network, called MLNET, to finish VAD task. The MLNET leveraged multi-branches to extract multiple contextual speech information and investigated an effective attention block to weight the most crucial parts of the context for final classification. Experiments in real-world scenarios demonstrated that the proposed MLNET-based model outperformed other baselines.

Full PDF

MMLNET: An Adaptive Multiple Receptive-ﬁeld Attention Neural Network forVoice Activity Detection

Zhenpeng Zheng, Jianzong Wang ∗ , Ning Cheng, Jian Luo, Jing Xiao Ping An Technology (Shenzhen) Co., Ltd. { zhengzhenpeng479, wangjianzong347, chengning221, luojian304, xiaojing661 } @pingan.com.cn Abstract

Voice activity detection (VAD) makes a distinction betweenspeech and non-speech and its performance is of crucial impor-tance for speech based services. Recently, deep neural network(DNN)-based VADs have achieved better performance thanconventional signal processing methods. The existed DNN-based models always handcrafted a ﬁxed window to make use ofthe contextual speech information to improve the performanceof VAD. However, the ﬁxed window of contextual speech infor-mation can’t handle various unpredictable noise environmentsand highlight the critical speech information to VAD task. In or-der to solve this problem, this paper proposed an adaptive mul-tiple receptive-ﬁeld attention neural network, called MLNET,to ﬁnish VAD task. The MLNET leveraged multi-branches toextract multiple contextual speech information and investigatedan effective attention block to weight the most crucial parts ofthe context for ﬁnal classiﬁcation. Experiments in real-worldscenarios demonstrated that the proposed MLNET-based modeloutperformed other baselines.

Index Terms : Voice Activity Detection, Adaptive MutipleReceptive-ﬁeld Attention

1. Introduction

Voice Activity Detection (VAD), which aims at removing noiseor silence from the original speech signal and obtaining validspeech signal, is an essential part of speech recognition orother speech-based applications [1, 2, 3]. Unfortunately, inthe real environment, the speech signals may contain numerousbackground noises and have a low signal-to-noise ratio (SNR),which brings great challenges to an accurate VAD system.With the continuous development of speech technologies,the research on VAD has become a continuing hot spot [4, 5, 6].Early research focused on parametric methods: energy function,zero-crossing rate, statistical signal analysis or other acous-tic features [7, 8, 9, 10]. Later, various machine learningbased methods were established to VAD systems: GaussianMixture Models (GMM) [11], Hidden Markov Model(HMM)[12] or Support Vector Machines(SVM) [13]. More speciﬁ-cally, deep learning models were also established: deep neuralnetwork(DNN)[14, 15, 16, 17, 18], deep belief network(DBN)[19], convolutional neural network (CNN) [20, 21], recurrentneural network (RNN) [22, 23]. Moreover, speech enhance-ment (SE) based methods were also introduced to reply the lowSNR noisy environments [24, 25].The above machine learning based models have achievedgreat progress in VAD task. These models had a common char-acteristics: in order to get a robust VAD system, speech in-cluding contextual information, just as other speech-based sys-tems, would be forwarded into a neural network. However, ∗ Corresponding author: Jianzong Wang, [email protected] when training or testing, it was hard to select the optimal hyper-parameter to determine the amount of contextual information.Specially, when selecting more information, much noisy infor-mation may be included in short speech segments and causefalse positive detections. When less contextual information wasselected, the VAD system may not make use of effective in-formation to make the right classiﬁcation. Based on the aboveanalysis, when performing VAD tasks, the VAD systems shouldbe capable of selecting the most appropriate contextual infor-mation according to the characteristics of current speech andfocusing on the most important speech segments to obtain theoptimal detection results. Motivated by the successful applica-tion of attention mechanism in image and natural language un-derstanding tasks, this paper proposed an attention-based modelof MLNET to choose the appropriate speech segment for theclassiﬁer. The MLNET took advantage of different gated unitsto extract different contextual speech information and leveragedattention mechanism of channel selection to choose the mostappropriate contextual information to adapt different noisy en-vironments. In particular, the ﬁrst attention model for VAD taskwas the ACAM model and it focused on the effect of a certainframe in a ﬁxed window, which is different from our windowsize attention [23]. From our point of view, the most importantframe for the detection result is the current frame and the sur-rounding frames are also important for ﬁnal result but they arethe auxiliary information.With respect to the state of the art, the main contributionsof this paper were as following:• A architecture of MLNET, which could adapt differentsegments and select the optimal contextual informationfor the ﬁnal classiﬁer, was proposed to VAD.• A useful mechanism of gated units was leveraged to ex-tract different contextual speech information.• An attention strategy for effective and appropriate con-textual information selection was investigated.• The proposed method was benchmarked against severalstate-of-art methods and the functions of each part inMLNET were also compared.

2. Methodology

The MLNET-based VAD achieved a frame-based speech ornon-speech classiﬁer. Suppose the training corpus can bemarked as χ = { ( x t , y t ) } Tt =1 , where x t is an acoustic fea-ture vector of t -th frame audio signal and y t denotes the la-bel of x t . If x t is a speech frame, then y t = 1 ; oth-erwise, y t = 0 . Because of contextual information’s im-portance for speech applications, x t is usually expanded to a r X i v : . [ ee ss . A S ] A ug t = [ x t − r , ..., x t − , x t , x t +1 , ..., x t + r ] when training or test-ing. Here, the objective of VAD is to learn a function f as (1). ˆ y t = f ([ x t − r , ..., x t − , x t , x t +1 , ..., x t + r ]) (1)where ˆ y t denotes the predicting result of x t and r denoted thewindow size of contextal speech information.In experiment, the value r was a hyper-parameter and theﬁxed value of r was hard to adapt various speech environments.For example, when x t wa a speech frame and the speech seg-ment duration around x t was short, the large r would containmuch non-speech information and cause false positive results.In turn, when r was short, the contextual information may notbe fully utilized, which can result in false negative or false pos-itive results. To address the above problems, this paper pro-posed the MLNET model, which leveraged the multiple gatedafﬁned units and attention mechanism to choose the optimalreceptive-ﬁeld speech adaptively to make the classiﬁcation. TheMLNET’s architecture is shown in Figure 1.Figure 1: The MLNET’s Architecture. The x t − i ( i ∈ [ − r, + r ]) represented 40-log Mel Features of a frame and (2 r +1) frames’features would be inputted into the MLNET. The detail architecture of multiple receptive-ﬁeld attentionblock was showed in Figure 2. In order to realize adap-tive feature selection, multiple branches were leveraged to ex-tract different receptive-ﬁled speech features and each branchrepresents a speciﬁc receptive ﬁeld or a speciﬁc contex-tual information. Speciﬁcally, for the branch of r i , thecontextual speech information can be denoted as I (cid:48) t =[ x t − r i , ..., x t − , x t , x t +1 , ..., x t + r i ] and (2 × r i +1) frame fea-tures were included. For the calculation of subsequent attentionmodules, the feature matrix of I (cid:48) t with different size need to beconverted feature matrix with the same size. This paper madeuse of the gated afﬁned unit to ﬁnish this task. The gated afﬁnedunit as this non-linearity has proved to work better for modelingaudio than the other afﬁne function or Relu functions [26]. Thegated afﬁned unit’s deﬁnition was deﬁned as (3): q i = tanh ( W f,r i ∗ I i + b f,r i ) (cid:12) σ ( W g,r i ∗ I i + b g,r i ) (2)where W f,r i , b f,r i denoted afﬁne parameters of tanh and W g,r i , b g,r i denoted afﬁne parameters of sigmoid. It was ob-vious that the size of W f,r i and W g,r i would change with r i while the size of q i was constant regardless of r i . The ∗ de-noted a convolution operator while the (cid:12) denoted an element-wise multiplication operator. Figure 2: Adaptive Multiple Receptive-ﬁeld Attention Block. Ascan be seen, the inputted feature with different receptive-ﬁeld r i would be calculated at different branches and the attentionalfeature vector of time t would be calculated based on eachbranch. After gated afﬁned unit operation, different two-dimensional extracted matrixes, which were denoted by q t, , ..., q t,i , ..., q t,r , were obtained and each feature matrixof q t,i had the same size. Analogy with image, each featurematrix of q t,i can be regarded as a channel feature map andwe produced a channel attention map to decide which featurematrix should be focused on. Based on previous channel atten-tion work in images [27, 28], the attention module’s structurewas showed in Figure 3. Firstly, both average-pooling andmax-pooling operations were used to aggregate informationof a feature matrix and two feature descriptors were generatedto represent different feature matrixes. Then, both descriptorswere forward to a shared 2-layer fully connected DNN toproduce two different attention ratio vector p t,max and p t,avg .Finally, by adding an norm operation, p t,max and p t,avg wereused to generate ﬁnal attention ratio vector a t of (4). a t = σ ( F C ( avgpool ( q t )) + F C ( maxpool ( q t )))= σ ( W ( W ( avgpool ( q t ))) + W ( W ( maxpool ( q t )))) (3)where F C denoted the shared 2-layer fully connected DNN and W and W were the weights of DNN. Note that the weight W of ﬁrst layer followed a leaky relu activation function. Afterobtaining a t , the attentioned or scaled feature matrix would becalculated through (4)-(7). a t = [ a t, , ..., a t,i , ..., a t,r ] (4) p t,i = σ ( a t,i ) / ( r (cid:88) i =1 σ ( a t,i )) (5) p t = [ p t, , p t, , ..., p t,r ] (6) Q t = r (cid:88) i =1 p t,i ∗ q t,i (7)Where a t denoted the attention vector and it can be calu-cated by the attention module. σ was the sigmoid function andthe p t represented the normalized value of a t . Finally, the Q t was calculated by summing original q t with attentional coefﬁ-cient.igure 3: Attention Module Structure. The gated afﬁned unit q i of receptive-ﬁeld r i would produce two feature vector throughmaxpooling and averagepooling and the two vectors were in-putted 2-layer FC to produce the attentional weights. The ﬁnalattentional feature matrix can be obtained through (5)-(8) Through the adaptive multiple receptive-ﬁeld attention block,the scaled feature matrix Q t was obtained and Q t would bereshaped into a feature vector m t . The Bi-LSTM is skilledin learning contextual speech information and the DNN isan excellent classiﬁer. Nextly, the feature sequence M t =[ m , m , ..., m T ] would be fed into a two-layer Bi-LSTM and1-layer Fully Connected DNN to make the ﬁnal classiﬁcation.The training of MLNET-based VAD could be regarded as acommon supervised optimized problems with traditional cross-entropy loss function. L crossentropy = t = T (cid:88) t =0 { y t log (ˆ y t ) + (1 − y t ) log (1 − ˆ y t ) } (8)Because of the characteristic of the proposed model, thispaper further investigated an additional attention loss functionto adapt the attention mechanism in the training phase. Theattention loss function was designed to emphasize the most im-portant receptive ﬁeld of r k , and the deﬁnitions were showed in(9)-(11). k = argmax i ( p t,i ) (9) L attention = t = T (cid:88) t =0 i = r (cid:88) i =0 ( y t,i log ( p t,i )) (10) L = L crossentropy + L attention (11)Where L attention could be denoted as a softmax problem andthe most appropriate receptive-ﬁeld p t,k was the target and thecorresponding y t,k is and other y t,i = 0( i (cid:54) = k ) .

3. Experiments

In this paper, the English corpus of Aurora4 [29] and the Chi-nese corpuses of Thchs30 [30] were applied to train and test theproposed model. In our experiments, ﬁrstly, because of imbal-ance of speech and non-speech, 2-second-long silence segments Table 1:

Model Conﬁguration

Name Unit × × (2 ∗ [1 , , , ,

9] + 1) × (2 ∗ [1 , , , ,

9] + 1)

Attention module 12-layer Bi-LSTM (64 + 64) × (64 + 64)1-layer FC 64were added into forward and backward of each utterance. Intraining, the clean speech corpus were corrupted by public 100noise types of HuNonspeech . Each utterance was randomlycorrupted at a level of -5dB-20dB SNR and all have an average7.5dB SNR. The NOISEX-92 noise dataset was used to con-struct testing dataset and 4 types unseen noises of babble, fac-tory, destory-engine were selected to corrupt the clean speech.Also, the SNR was setted between -5dB and 20dB and the aver-age is 7.5dB. For Aurora4, 95% of training data were used fortraining and 5% were used as dev data. The testing corpus ofAurora4 were corrupted by the NOISEX-92 and leveraged asthe testing data. For thchs30 data, the dataset was constructedwith the same process, but the different was that dev datasetleveraged the original corresponding data.For comparison, the metrics of F1-score and DCF were se-lected as a performance measurement. F1-score took into ac-count both accuracy and recall metrics, which was a commonevaluation index of binary classiﬁcation problems. DCF re-ﬂected the wrong performance of the model and DCF was de-ﬁned as followed: F − score ( θ ) = 2 T P T P + F P + F N (12)

DCF ( θ ) = 0 . × P FN ( θ ) + 0 . × P FP ( θ ) (13)where θ denoted a given system decision-threshold setting. TPrepresented true positive examples’ num while FP and FN rep-resented the num of false positive and false negative examples. P FN was the probability of FP while P FN was the probabilityof FN. It should be noted that the larger the F1-score was, thebetter performance while the smaller the DCF was, the betterperformance. In testing, we calculated the two metrics of eachrecording respectively and averaged the metrics of all record-ings as the ﬁnal score. The acoustic feature extracted for MLNET was 40-dimensionallog-mel ﬁlterbank while the frame size was 25 ms with a shiftof 10 ms. The window of attention block were setted as 19frames, which corresponded to 190ms contextual information,while the gated afﬁned unit’s receptive-ﬁeld were setted 1, 3,5, 7 ,9. Other parameters of MLNET were shown in Table1.For training our proposed model, the matrix weight parametersof MLNET were all initialized with random uniform initializa-tion while the bias parameters were initialized with a constantof 0.1. In our experiments, we trained the network for 150epoches with the Adam algorithm when the loss function got http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/ http://spib.rice.edu/spib/select noise.html http://fearlesssteps.exploreapollo.org/ able 2: Result Comparision of Aurora4

Name Dev Eval

F1-score

Google VAD (mode 0) 72.33 76.32CRNN 89.14 87.23ACAM 90.56 89.03

MLNET 91.38 89.27

DCF

Google VAD (mode 0) 22.06 18.34CRNN 9.23 10.34ACAM 8.95 9.01

MLNET 8.77 9.23

Table 3:

Result Comparision of Thchs30

Name Dev Eval

F1-score

Google VAD (mode 0) 74.71 74.60CRNN 91.35 89.90ACAM 92.53 91.27

MLNET 93.25 92.58

DCF

Google VAD (mode 0) 17.88 18.18CRNN 8.21 9.72ACAM 7.67 8.51

MLNET 6.89 8.12 little change. The batchsize was 32 and the learning rate wasset to 0.001. When training, the gradient cropping strategy wasalso applied and the gradient of each parameter at each iterationwas limited between -1 and 1.To demonstrate the effectiveness of our proposed model,three VAD approaches were used for comparison. The ﬁrst wasgoogle’ WebRTC VAD systems [31]. Additionally, Vafeiadisproposed 2-D CRNN model and has made great success in re-cent speech activity detection [32]. Nextly, ACAM approach,which was the state of art of attention-based methods, was alsoincluded [23]. In our experiments, we made use of the sameparameters, but all trick strategies, such as batch normalizationand regularization, were not leveraged. To alleviate the effectof the input features, it should be noted that 40-log mel acous-tic features were leveraged to establish the CRNN and ACAMbaseline models, which were different from the original ap-proaches.

The results were summarized in Table 2 and Table 3. We canobserved that MLNET has the best performance and outformedother three baselines models. Especially, three-training-basedmethods achieved 10% higher than google’s VAD in F1-scoreand 8% lower in DCF. The ACAM and MLNET outformedthe traditional deep learning methods of CRNN, which provedattention-based models were helpful for improving the accuracyof detection. Comparing with ACAM, MLNET selected thewindows’ attentions instead of ACAM’s frames-based attentionand experiments showed that the window based attention mod-els achieved higher performance than the frames-based models,which also conformed to our prior knowledge that current frameinformation was the most important and the joined frames wereauxiliary information when predicting the current frame. Table 4:

Module Comparision

Model Dev Eval

F1-score

Bi-LSTM 85.89 84.17+Gated Unit 87.63 86.24+Non-Attention 90.85 88.73 +Attention 91.38 89.27

DCF

Bi-LSTM 12.24 13.54+Gated Unit 11.16 12.01+Non-Attention 9.11 9.88 +Attention 8.77 9.23

In order to illustrate each part’s functionality of our pro-posed model, the comparison of each modules were further in-vestigated. The base was the Bi-LSTM, which just leveragedthe contextual speech information and the feature vectors ofcontextual speech were aggregated to a longer vector beforefeeding into the network. The second was the gated unit modelthat the gated unit operation replaced the aggregated mecha-nism. The third and the fourth were leveraged to certiﬁcate themultiple window’s functions while the attention’s function wasalso compared. The Aurora4 dataset were leveraged to makethis evaluation and the results were shown in Table 4. As notedin this table, the gated afﬁned unit based models have betterperformance than direct aggregation of Bi-LSTM and achievedabout 2% increase in F1-score and 1.5% decrease in DCF.Adding the multiple receptive-ﬁeld of non-attention, the VAD’sperformance was also improved and achieved about 2.5% in-crease in F1-score and 2.13 % decrease in DCF. Lastly, theadaptive attention mechanism was also helpful for MLNET’sperformance. In contrast, the adaptive attention only achieved asmall accuracy improvement than non-attention module and thereason may be the mechanism of receptive-ﬁeld selection hasexisted in multiple receptive-ﬁeld attention block. In particular,we observed that the adaptive multiple receptive-ﬁeld attentionblock was also helpful for speeding up models’ convergence inour experiments. To sum up, the proposed method can betterdeal with the VAD problems.

4. Conclusion

The existed DNN-based VAD models only leveraged ﬁxedreceptive-ﬁeld contextual speech information and were unableto handle with speech segments of different lengths adaptively.To overcome this defect, this paper proposed an architecture ofMLNET for VAD task. MLNET made use of different gatedafﬁned unit to extract different contextual speech informationand leveraged the adaptive attention block to select the most fo-cused speech segments. Comparing with the existed models, theexperiments demonstrated that MLNET outformed other base-line models and proved that the proposed architecture was help-ful to deal with VAD problems.

5. ACKNOWLEDGEMENTS

This paper is supported by National Key Research and Develop-ment Program of China under grant No. 2018YFB1003500, No.2018YFB0204400 and No. 2017YFB1401202. Correspondingauthor is Jianzong Wang from Ping An Technology (Shenzhen)Co., Ltd. . References [1] M. W. Mak and H. B. Yu, “A study of voice activity detectiontechniques for nist speaker recognition evaluations,” in

ComputerSpeech & Language , 2014, p. 295313.[2] J. Ramirez, J. M. Gorriz, and J. C. Segura,

Voice Activity Detec-tion. Fundamentals and Speech Recognition System Robustness .InTech, 2007.[3] T. Yoshimura, T. Hayashi, K. Takeda, and S. Watanabe, “End-to-end automatic speech recognition integrated with ctc-based voiceactivity detection,” in

ICASSP , 2020.[4] B. Sharma, R. K. Das, and H. Li, “Multi-Level Adaptive SpeechActivity Detector for Speech in Naturalistic Environments,” in

In-terspeech , 2019.[5] F. M. G. D. P. Mainar and M. Cernak, “Spiking Neural NetworksTrained With Backpropagation for Low Power Neuromorphic Im-plementation of Voice Activity Detection,” in

ICASSP , 2020.[6] G. D. F. Martinelli and M. Cernak, “A Bin Encoding Trainingof a Spiking Neural Network Based Voice Activity Detection,” in

ICASSP , 2020.[7] X. Yang, B. Tan, J. Ding, J. Zhang, and J. Gong, “Comparativestudy on voice activity detection algorithm,” in

International Con-ference on Electrical and Control Engineering , 2010.[8] K. W. Ho, Y. T. Young, P. K. Jung, and L. Chungyong, “Robustvoice activity detection algorithm for estimating noise spectrum,”

Electronics Letters , vol. 36, no. 2, pp. 180–181, 2002.[9] S. Jongseo, S. N. Kim, and W. Sung, “A statistical model basedvoice activity detector,”

IEEE Signal Processing Letters , vol. 6,no. 1, pp. 1 – 3, 1999.[10] D. Thomas, S. Yannis, K. Yusuke, and A. Masami, “Voice activitydetection: Merging source and ﬁlter-based information,” in

IEEESignal Processing Letters . IEEE, 2016, pp. 252–256.[11] T. Ng, B. Zhang, and N. L, “Developing a speech activity detec-tion system for the darpa rats program,” in

Interspeech , 2012.[12] V. H and S. H, “Hidden-markov-model-based voice activity de-tector with high speech detection rate for speech enhancement,”

Iet Signal Processing , vol. 6, no. 1, pp. 54–63, 2012.[13] E. Dong, G. Liu, Y. Zhou, and X. Zhang, “Applying support vectormachines to voice activity detection,” in

International Conferenceon Signal Processing Proceedings . IEEE, 2003.[14] S. Tong, H. Gu, and K. Yu, “A comparative study of robustness ofdeep learning approaches for vad,” in

ICASSP , 2016.[15] X. Zhang and D. Wang, “Boosting contextual information fordeep neural network based voice activity detection,”

IEEE/ACMTransactions on Audio Speech and Language Processing , vol. 24,no. 2, pp. 252–264, 2016.[16] Z. C. Fan, Z. Bai, X.-L. Zhang, S. Rahardja, and J. Chen, “Aucoptimization for deep learning based voice activity detection,” in

ICASSP , 2019.[17] A. Ivry, I. Cohen, and B. Berdugo, “Evaluation of deep-learning-based voice activity detectors and room impulse response modelsin reverberant environments,” in

ICASSP , 2020.[18] A. Ivry, B. Berdugo, and I. Cohen, “Voice activity detection fortransient noisy environment based on diffusion nets,”

IEEE Jour-nal of Selected Topics in Signal Processing , vol. 13, no. 2, pp.254–264, 2019.[19] X. Zhang and J. Wu, “Deep belief networks based voice activ-ity detection,”

IEEE Transactions on Audio Speech and LanguageProcessing , vol. 21, no. 4, pp. 697–710, 2013.[20] I. Ariav and I. Cohen, “An end-to-end multimodal voice activitydetection using wavenet encoder and residual networks,”

IEEEJournal of Selected Topics in Signal Processing , pp. 1–1, 2019.[21] S. Y. Chang, B. Li, G. Simko, T. N. Sainath, A. Tripathi,A. van den Oord, and O. Vinyals, “Temporal modeling usingdilated convolution and gating for voice-activity-detection,” in

ICASSP , 2018. [22] H. Thad, Mierle, and Keir, “Recurrent neural networks for voiceactivity detection,” in

ICASSP , 2013, pp. 7378–7382.[23] K. Juntae and H. Minsoo, “Voice activity detection using an adap-tive context attention model,”

IEEE Signal Processing Letters ,vol. PP, no. 99, pp. 1–1, 2018.[24] R. Lin, C. Costello, C. Jankowski, and V. Mruthyunjaya, “Opti-mizing Voice Activity Detection for Noisy Conditions,” in

Inter-speech 2019 , pp. 2030–2034.[25] Y. Jung, Y. Kim, Y. Choi, and H. Kim, “Joint learning using de-noising variational autoencoders for voice activity detection.” in

Interspeech , 2018, pp. 1210–1214.[26] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt,A. Graves, and K. Kavukcuoglu, “Conditional image generationwith pixelcnn decoders,”

CoRR , 2016. [Online]. Available:http://arxiv.org/abs/1606.05328[27] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”in

CVPR , 2018, pp. 7132–7141.[28] S. Woo, J. Park, J. Y. Lee, and I. So Kweon, “Cbam: Convolu-tional block attention module,” in

ECCV , 2018, pp. 3–19.[29] N. Parihar and J. Picone,

Aurora working group: DSR front endLVCSR evaluation AU/384/02 . Inst. for Signal and InformationProcess, Mississippi State University, 2002.[30] Z. Z. Dong Wang, Xuewei Zhang, “Thchs-30 : A freechinese speech corpus,” 2015. [Online]. Available: http://arxiv.org/abs/1512.01882[31] GoogleWebRTC, “https://webrtc.org/,” 2016.[32] A. Vafeiadis, E. Fanioudakis, I. Potamitis, K. Votis, D. Giakoumis,D. Tzovaras, L. Chen, and R. Hamzaoui, “Two-Dimensional Con-volutional Recurrent Neural Networks for Speech Activity Detec-tion,” in