Compact Speaker Embedding: lrx-vector
CCompact Speaker Embedding: lrx-vector
Munir Georges , , Jonathan Huang ∗ , Tobias Bocklet , Intel Labs, Munich, Germany Apple Inc, Cupertino, California, USA Technische Hochschule Ingolstadt, Germany Technische Hochschule N¨urnberg, Germany [email protected], [email protected], [email protected]
Abstract
Deep neural networks (DNN) have recently been widely used inspeaker recognition systems, achieving state-of-the-art perfor-mance on various benchmarks. The x-vector architecture is es-pecially popular in this research community, due to its excellentperformance and manageable computational complexity. In thispaper, we present the lrx-vector system, which is the low-rankfactorized version of the x-vector embedding network. The pri-mary objective of this topology is to further reduce the memoryrequirement of the speaker recognition system. We discuss thedeployment of knowledge distillation for training the lrx-vectorsystem and compare against low-rank factorization with SVD.On the VOiCES 2019 far-field corpus we were able to reducethe weights by 28% compared to the full-rank x-vector systemwhile keeping the recognition rate constant (1.83 % EER).
Index Terms : speaker recognition, x-vector, low power
1. Introduction
Speaker recognition systems have been popularized in con-sumer devices such as smart phones and smart speakers, for ac-cess control. This is achieved by generating a voice print fromthe user’s speech during interaction with the device and com-paring against an existing voice print. Voice prints are usuallygenerated by speaker embeddings of Deep Neural Networks(DNNs), which can also be the underlying feature for diariza-tion in multi-speaker meetings [1, 2]. DNNs have extensivelyexplored in the literature for the generation of speaker embed-ding with different objective functions [3, 4, 5, 6]. The x-vectorsystem [3] emerged as a favorite in the research community, dueto its robust training, state-of-the-art performance and the avail-ability of recipes in the popular Kaldi framework [7]. Our workhere is focused on the x-vector system.Local inference provides a clear advantages over cloud so-lution. Examples are: improved protection of user privacy,lower recognition latency, or the autonomy from communica-tions channels. Local inference has been previously addressedfor speech recognition [8] and spoken language understanding[9]. The main challenges in local inference is the limited com-pute and memory available on the device. Increasing these re-quirements have adverse implications to cost and energy effi-ciency of the device. The memory access operations during in-ference is identified as a bottleneck for energy efficiency. Inthis paper, we focus on reducing the memory footprint of an x-vector-based speaker verification system. Furthermore, reduc-tion in memory footprint will lead to lower cost for the device.There is already a wealth of literature available the focuson the compression of DNNs. Training DNNs with low-rank ∗ *Work done at Intel Labs matrices jointly with the target objective is explored for vi-sion and audio signals, previously. Novikov et al. [10] explorelow-rank factorization of neural networks using CIFAR-10 and1000-class ImageNet ILSVRC-2012.Sak et al. [11] use low-rank projection layers in Recurrent Neural Networks (RNN) forspeech recognition. A rank constrained DNN topology for keyword spotting is proposed by Nakkiran et al. [12]. Relatedwork to compression, HashNet uses a low-cost hash function torandomly group weights into hash buckets. Chen et al. [13]propose and compare this approach with low-rank networks.Wu et al. [14] explore quantization of convolutional neuralnetworks and compares them with various alternatives includ-ing Low-rank Decomposition and Approximation of Non-linearResponses. An energy-efficient hardware accelerator using alow-rank approximation is also proposed by Zhu et al. [15]where inactive neurons are passed by. Sahraeian et al. [16]explore low-rank factorization beyond compression aspects viaSingular Value Decomposition (SVD) of the weight matricesto achieve sparse multilingual acoustic models. More general,Dighe et al. [17] improve the acoustic model by training usinglow-rank and sparse soft targets. Similar success was achievedfor Deep Gaussian Conditional Random Fields as explored by,e.g., Chandra et al. [18]. Ding et al. [19] describe a struc-tured low-rank constraint using domain-specific and domain-invariant DNNs. Applying low-rank and low-rank plus diago-nal matrix parametrization to RNNs is studied by Barone et al.[20]. Sharan et al. [21] explore random projection for low-ranktensor factorization and describe the use on gene expression andEEG time series data. Zhang et al. [22] apply structural spar-sification on Time-Delay Neural Networks (TDNN) to removeredundant structures. Alternative approaches are subject to ourfurther research, e.g., binary neural networks as successfullyapplied to natural language understanding [23].In this paper, we proposed low-rank x-vector speaker em-beddings by deploying knowledge distillation to the trainingprocess. We call the resulting embedding lrx-vector. The paperis organizes as follows: Section 2 describes the baseline speakerrecognition system, with details on the model topology and lossfunction. In Section 3, we introduce the modified model topol-ogy and the model training methodology using knowledge dis-tillation. We present the results of our experiments in Section 4and then conclude the paper in 5.
2. Baseline Speaker Embedding System
The x-vector embedding comprises two parts. First, the featuresequence, i.e., mel-filter bank is processed by layers of TDNNs.Second, a statistical pooling layer encodes a segment of speechand computes the embedding vector by a feed forward network.This architecture is illustrated in Figure 1. a r X i v : . [ ee ss . A S ] A ug igure 1: x-vector speaker embedding. The input sequence ofspeech features at the top is processed by three TDNN layers.A Statistical Pooling Layer is computed over a speech segment.The Speaker embedding is finally compute by a FFN. This work is based on the x-vector model structure proposedby Snyder et al. [3], with some simplifications. Compared tothe original x-vector model, our architecture, shown on Table1, uses an increased input feature dimension from 24 to 40, re-duces the pooling dimension from 1500 to 512 and removes afully-connected layer between the embedding and speaker out-put layer. We reduce the embedding dimension from 512 to 256.In our testing, these modifications did not degrade the recogni-tion performance and result in much lower complexity. We usethis topology as a baseline for comparing against the lrx-vector.Table 1:
Baseline x-vector configuration for a speech utterancewith T frames. Three TDNN, two FFN layer are processing theinput sequence. The embedding is computes by a FNN on top ofthe statistical pooling layer. layer context AffineLayer1 [t-2,t+2] ( × × Layer2 { t-2,t,t+2 } ( × × Layer3 { t-2,t,t+2 } ( × × Layer4 { t } × Layer5 { t } × Stats pooling [0,T) N/ASegment [0,T) 1024 × × NThe output layer is only used during model training; forspeaker enrollment and verification, the embedding is takenfrom the
Segment -layer (see Table 1). One speaker embeddingis computed for an entire utterance, regardless of its length.We use the length-normalized cosine distance of this 256-dimension embedding vectors between enrollment and test ut-terances to produce the speaker recognition score.
While the conventional softmax loss works reasonably wellfor training speaker embeddings, it is specifically designed forclassification, not verification tasks. Speaker recognition sys-tems trained with softmax loss typically use PLDA [24] in thebackend to improve separation between speakers. The tripletloss function, which is designed to reduce intra-speaker and in-crease inter-speaker distance, has shown to be more effective for speaker recognition [4]. Likewise, the end-to-end loss [5] hasbetter performance than softmax. The downside to these kindsof losses is that the training infrastructure is significantly morecomplex than one used for supervised learning with softmax. Ina prior study [6], we explored the use of several recently pro-posed loss functions that were first introduced in face recogni-tion research. These loss functions are drop-in replacements forsoftmax. Thus, modification to training code is simple with lit-tle overhead in training speed. We found Additive Margin Soft-max (AM-softmax) [25] to perform best in the far-field test set,and incorporating PLDA did not improve performance againstthe simpler cosine distance. The elimination of the PLDA inthe inference pipeline makes the entire model easy to deploy totarget hardware, with the help of tools such as the Intel® Dis-tribution of OpenVINO TM toolkit .
3. Compact Speaker Embedding System
First, we analyze the vanilla TDNN layer, which we representby a feed forward network (FFN). The layer of the baselinex-vector topology described in Section 2.1 processes the input x t ∈ R c × n at time t . It is a concatenation of c feature vectorsaccording to the layer context. For example, layer 2 of Table 1is defined by following row:layer context AffineLayer2 { t-2,t,t+2 } ( × × Here, c = 3 for layer context t − , t, t +2 of features generatedfrom layer one at time steps. Each input feature to this layerhas n = 512 dimensions. The overall TDNN output has m =512 dimensions. The input sequence is shifted by one to processthe output for the next time step. The FFN representing theTDNN layer with weight matrix W ∈ R ( c × n ) × m , but no biasis defined as follows: y = Φ( W x t ) (1)where Φ a non linear activation function, i.e., ReLU in this pa-per. The number of overall weights in a TDNN is c · n · m . Inthe lrx-vector embedding, the TDNN matrix described above isreplaced by two matrices W a ∈ R ( c × n ) × k and W b ∈ R k × m .The output of the TDNN layer is y = Φ (( W a W b ) x t ) (2)where W a and W b are low-rank representations of W with low-rank constant < k < n . The overall number of weights in alow-rank TDNN layer is c · k · ( n + m ) which is significantlysmaller compared to vanilla TDNN layer when k is set properly.The lrx-vector configuration is presented in Table 2.Finally, less weights need to be stored in non volatile mem-ory for lrx-vector system. On the other hand, the compute in-creases which is most often not an issue on recent DSP plat-forms including efficient matrix-matrix multiplication units. We have explored several different ways of training the lrx-vector system:1.
Random initialization:
The network is initialized by Py-Torch’s default random initialization. It is trained in thesame way as the baseline x-vector system using the AM-softmax loss described in Section 2.2. https://docs.openvinotoolkit.org/ able 2: lrx-vector system configuration for a T-frame speechutterance. Matrices of the four layers with most weights arereplaced by low rank matrices. layer context AffineLayer1 [t-2,t+2] ( × × )Layer2 { t-2,t,t+2 } ( × × k , k × Layer3 { t-2,t,t+2 } ( × × k , k × Layer4 { t } × k , k × Layer5 { t } × k , k × Stats pooling [0,T) N/ASegment [0,T) 1024 × × N2. lrx-SVD : A x-vector baseline system is trained, and singu-lar value decomposition (SVD) is performed on the weightmatrices. For each layer, we keep a subset of the singularvalues. It is similar to work done by Nakkiran et al [12].The network is not trained any further after SVD.3. lrx-SVD F : A network obtained from SVD is fine tuned us-ing the baseline training system until convergence with low-ered learning rate.4. Knowledge distillation:
This is a method of using a largerteacher network to train a smaller student network to achievebetter performance than it is possible with the smaller net-work alone. It has been successfully applied in computervision tasks [26]. We find the use of a well-trained full-rankx-vector (i.e. the baseline system described in Section 2) asthe teacher to the lrx-vector being particularly effective. Ourmodel training procedure is modified with a loss functioncombining contributions from knowledge distillation (KD)and AM-softmax (AMS): L KD, AMS = αL KD + (1 − α ) L AMS (3)Here, L KD can be computed by Kullback Leibler diver-gence (KLD), Mean Square Error (MSE) or Cosine Simi-larity (COS). We will make these comparisons in the exper-iments. Determining the weight α with, e.g., a grid searchis time and compute intensive. We partially circumvent thisby applying an idea derived from multi-task learning, previ-ously proposed by Du et al. [27]. The KD loss gets mini-mized as long as its gradient has non-negative cosine simi-larity with the target gradient. The teacher is ignored, other-wise. Hence, we minimize following equation with GradientCosine Similarity (GCS) to train our speaker identificationlrx-vector embedding with α = 0 . : L GCS, KD, AMS = (cid:40) L KD, AMS cos ( L KD , L AMS ) > L AMS otherwise (4)
4. Experiments
Our systems were developed by using Voxceleb 1 and 2. Theproposed compact speaker identification system is evaluated us-ing the Voxceleb 1 & 2 data set. The data is described byMcLaren et al. [28], Nagrani et al. [29], [30] as training data.Evaluation was performed on the VOiCES challenge far-fieldtext-independent dataset [31]. The VOiCES development setwas used to optimize our system. We present results for the de-velopment and evaluation sets using equal error rate (EER) andminimum decision cost function (minDCF) metrics as definedin the VOiCES challenge [32]. Figure 2:
Singular Values of x-vector model matrices. The low-rank constant is, e.g., k = k = 256 and k = k = 384 . Our training data is prepared by applying 4x data augmen-tation. For each augmented speech file we convolve a randomlychosen room impulse response (RIR) from 100 artificially gen-erated by Pyroomacoustics [33] and 100 selected from AachenImpulse Response Database [34], then mixing in with randomlychosen clips from Google’s Audioset under Creative Commons[35]. The SNR for mixing was uniformly distributed between0 and 18 dB. We extract 40-dimensional mel-filterbank featuresfrom 25 ms frames with 15 ms overlap. The features are mean-normalized on a 3-second sliding window.Our system was developed using the PyTorch framework.For the baseline system, we used an initial learning rate of 0.1and decaying to a final learning rate of 0.0001 in 30 epochs ofthe training data. Training with knowledge distillation and fine-tuning started at lower learning rate of 0.01. A weight decay of e − is used in all experiments. We trained all networks untilconvergence was achieved. This took most often no longer than20 epochs. An SVD is used to factorize the matrices of a previously trainedx-vector system Figure 2 shows the singular values for eachlayer. We have empirically figured out that a low rank inputlayer and low rank layers after the stats pooling significantly de-creases the speaker identification accuracy. In contrast, Nakki-ran el al. [12] use a low-rank first layer in key-word spotting,successfully. In this paper, we set k = 0 . n , k = 0 . n and k = 0 . n , k = 0 . n where n , n , n and n are the di-mension of the input vectors to layer to . It cannot be ruledout that a full grid-search finds better choices for k i given a tar-get number of overall weights in the lrx-vector. Automaticallydetermining optimal k i is subject to our further research.Table 3 compares scaled full-rank x-vector to the randomlyinitialized lrx-vector, lrx-SVD and lrx-SVD F . Note that we lin-early scaled the output dimension of each layer in the x-vectorfor all layer with the same constant factor before the stats pool-ing, in order to obtain the same model size of k parametersas in the lrx-vector. The results indicate that SVD, even withfine-tuning, did not produce better results than random initial-ization. Next, we focus the attention to another technique. https://pytorch.org/ able 3: Factorization with Singular Value Decomposition(SVD) compared to training from scratch lrx-Random init F For our research, we choose the teacher to be the baselinesystem as described in Section 2. The EER and minDCF ofthe teacher system in Table 4 is a strong baseline, competitivewith the top x-vector systems of comparable complexity in theVOiCES Challenge [32]. We use this baseline system to teacha student lrx-vector system with k weights, as described inthe previous section.In the first set of results on Table 4, we present the exper-iments using the standard KD learning objective from Eq. 3,with α = 0 . . It can be seen that KD is generally an improve-ment over random initialization or lrx-SVD F , with KD-MSEperforming best with relative EER improvement of . onthe development, and . on the evaluation sets. Althoughthis set of results is promising, it might be possible to improvefurther by tuning α . Practically, doing this by grid search willrequire long training time.Table 4: Comparison of loss functions between teacher and stu-dent at the example of the lrx-vector system with overall 550kweights. The teacher is a baseline system with 4800k weights. lrx-vector Dev Eval550k weights EER minDCF EER minDCFlrx-SVD F KD-COS 2.67 0.273 6.37 0.456Teacher 1.83 0.189 5.5 0.381Table 5 shows the results of gradient cosine similarity asdescribed in Section 3.2. Here, we see that the best performanceis achieved with cosine similarity distillation loss, with relativeEER improvement of . for development and . forevaluation sets over lrx-SVD F .Table 5: Knowledge distillation with positive Gradient CosineSimilarity between teacher and student. Otherwise, only Addi-tive Margin Softmax speaker identification loss. lrx-vector Dev Eval550k weights EER MinDCF EER MinDCFlrx-SVD F Teacher 1.83 0.189 5.5 0.381
Comparing x- and lrx-vector based speaker identification of dif-ferent sizes at approximately the same EER is subject of thisevaluation section. The number of weights in the models were Figure 3:
Achieved Equal Error Rate (EER) on the developmentset at different number of weights of X- and lrx-Vector Systems. adjusted by changing the dimension of each hidden layers bya multiplication factor < . This factor is the same for alllayers in the network up to the stats pooling layer. Automati-cally determining Layer dependent factors is subject of futureresearch. lrx-vector as well as x-vector systems were trainedby knowledge distillation using KD-GCS-COS. We selected thebest models given the development set as similar in previoussections. The results are shown in Table 6. A . EER wasachieved with an lrx-Vector system that requires to store k weights in ROM. This is of the size to a comparable x-vector system that meets the same EER.Table 6: Weigh reduction of lrx-vector system compared to x-vector system that achieves same EER.
EER lrx-vector sizeDev % of x-vector equivalent
5. Conclusion
This paper addresses compact speaker identification by lrx-vector embedding. We propose a low-rank version of the pop-ular TDNN based x-vector embedding where big matrices arereplaced by low-rank matrices. We address one of the main bot-tlenecks of low power inference in small edge devices, memoryaccess, by reducing the size of the model. Using the VOiCESfar-field test set, we achieved reduction in the number ofparameters compared to the full size model, at the same EERof . . The lrx-vector is also shown to achieve reduction inmodel size compared to scaled-down x-vector, at comparableEERs across a wide range of operating points. Future researchbeyond this work can include other means of searching for bestknowledge distillation hyper-parameter α , and joint low-rankand weight quantization optimizations. . References [1] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey,and S. Khudanpur, “Speaker recognition for multi-speaker con-versations using x-vectors,” in ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 5796–5800.[2] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully su-pervised speaker diarization,” in
ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6301–6305.[3] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in . IEEE, 2018, pp. 5329–5333.[4] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: an end-to-end neural speakerembedding system,” arXiv preprint arXiv:1705.02304 , 2017.[5] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in . IEEE, 2016, pp. 5115–5119.[6] J. Huang and T. Bocklet, “Intel Far-Field Speaker RecognitionSystem for VOiCES Challenge 2019,” in
Proc. Interspeech 2019 ,2019, pp. 2473–2477.[7] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in
IEEE 2011 workshopon automatic speech recognition and understanding , no. CONF.IEEE Signal Processing Society, 2011.[8] M. Georges, S. Kanthak, and D. Klakow, “Accurate client-serverbased speech recognition keeping personal data on the client,” in
ICASSP , 2014.[9] G. Stemmer, M. Georges, J. Hofer, P. Rozen, J. G. Bauer, J. Now-icki, T. Bocklet, H. R. Colett, O. Falik, M. Deisher, and S. J.Downing, “Speech recognition and understanding on hardware-accelerated DSP,” in
Interspeech 2017 , 2017, pp. 2036–2037.[10] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Ten-sorizing neural networks,” in
Advances in Neural InformationProcessing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc.,2015, pp. 442–450.[11] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memorybased recurrent neural network architectures for large vocabularyspeech recognition,”
CoRR , vol. abs/1402.1128, 2014.[12] P. Nakkiran, R. Alvarez, R. Prabhavalkar, and C. Parada, “Com-pressing deep neural networks using a rank-constrained topology,”in
Proceedings of Annual Conference of the International SpeechCommunication Association (Interspeech) , 2015, pp. 1473–1477.[13] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Com-pressing neural networks with the hashing trick,” in
Proceedingsof the 32nd International Conference on Machine Learning , ser.Proceedings of Machine Learning Research, F. Bach and D. Blei,Eds., vol. 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 2285–2294.[14] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convo-lutional neural networks for mobile devices,” in
The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , June2016.[15] Jingyang Zhu, Zhiliang Qian, and Chi-Ying Tsui, “Lradnn: High-throughput and energy-efficient deep neural network acceleratorusing low rank approximation,” in , Jan 2016, pp.581–586.[16] R. Sahraeian and D. Van Compernolle, “A study of rank-constrained multilingual dnns for low-resource asr,” in , March 2016, pp. 5420–5424. [17] P. Dighe, A. Asaei, and H. Bourlard, “Low-rank and sparse softtargets to learn better dnn acoustic models,” in , March 2017, pp. 5265–5269.[18] S. Chandra, N. Usunier, and I. Kokkinos, “Dense and low-rankgaussian crfs using deep embeddings,” in
The IEEE InternationalConference on Computer Vision (ICCV) , Oct 2017.[19] Z. Ding and Y. Fu, “Deep domain generalization with structuredlow-rank constraint,”
IEEE Transactions on Image Processing ,vol. 27, no. 1, pp. 304–313, Jan 2018.[20] A. V. Miceli Barone, “Low-rank passthrough neural networks,”
Proceedings of the Workshop on Deep Learning Approaches forLow-Resource NLP , 2018.[21] V. Sharan, K. S. Tai, P. Bailis, and G. Valiant, “Com-pressed factorization: Fast and accurate low-rank factorization ofcompressively-sensed data,” in
Proceedings of the 36th Interna-tional Conference on Machine Learning, ICML 2019, 9-15 June2019, Long Beach, California, USA , 2019, pp. 5690–5700.[22] J. Zhang, J. Huang, M. Deisher, H. Li, and Y. Chen, “Structuralsparsification for far-field speaker recognition with gna,” 2019.[23] M. Georges, K. Czarnowski, and T. Bocklet, “Ultra-CompactNLU: Neuronal Network Binarization as Regularization,” in
Proc.Interspeech 2019 , 2019, pp. 809–813.[24] S. Ioffe, “Probabilistic linear discriminant analysis,” in
ComputerVision – ECCV 2006 , A. Leonardis, H. Bischof, and A. Pinz, Eds.Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 531–542.[25] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmaxfor face verification,”
IEEE Signal Processing Letters , vol. 25,no. 7, pp. 926–930, 2018.[26] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge ina neural network,” in
NIPS Deep Learning and RepresentationLearning Workshop , 2015. [Online]. Available: http://arxiv.org/abs/1503.02531[27] Y. Du, W. M. Czarnecki, S. M. Jayakumar, R. Pascanu, andB. Lakshminarayanan, “Adapting auxiliary losses using gradientsimilarity,” 2018.[28] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The 2016speakers in the wild speaker recognition evaluation,” in
Inter-speech 2016 , 2016, pp. 823–827.[29] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb:Large-scale speaker verification in the wild,”
Computer Scienceand Language , 2019.[30] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in
INTERSPEECH , 2017.[31] C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco,M. Graciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. vanHout, and et al., “Voices obscured in complex environmentalsettings (voices) corpus,”
Interspeech 2018 , Sep 2018. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2018-1454[32] M. K. Nandwana, J. Van Hout, M. McLaren, C. Richey, A. Law-son, and M. A. Barrios, “The voices from a distance challenge2019 evaluation plan,” arXiv preprint arXiv:1902.10828 , 2019.[33] R. Scheibler, E. Bezzam, and I. Dokmani´c, “Pyroomacoustics: Apython package for audio room simulation and array processingalgorithms,” in . IEEE, 2018, pp. 351–355.[34] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse re-sponse database for the evaluation of dereverberation algorithms,”in . IEEE, 2009, pp. 1–5.[35] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontologyand human-labeled dataset for audio events,” in2017 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP)