Multichannel CRNN for Speaker Counting: an Analysis of Performance
Pierre-Amaury Grumiaux, Srdan Kitic, Laurent Girin, Alexandre Guérin
MMULTICHANNEL CRNN FOR SPEAKER COUNTING:AN ANALYSIS OF PERFORMANCE
Pierre-Amaury Grumiaux Sr ¯dan Kiti´c Laurent Girin Alexandre Guérin Orange Labs, France Univ. Grenoble Alpes, GIPSA-lab, Grenoble-INP, CNRS, Grenoble, France [email protected]
ABSTRACT
Speaker counting is the task of estimating the num-ber of people that are simultaneously speaking in anaudio recording. For several audio processing taskssuch as speaker diarization, separation, localizationand tracking, knowing the number of speakers ateach timestep is a prerequisite, or at least it can bea strong advantage, in addition to enabling a low la-tency processing. In a previous work, we addressedthe speaker counting problem with a multichan-nel convolutional recurrent neural network whichproduces an estimation at a short-term frame res-olution. In this work, we show that, for a givenframe, there is an optimal position in the input se-quence for best prediction accuracy. We empiri-cally demonstrate the link between that optimal po-sition, the length of the input sequence and the sizeof the convolutional filters.
1. INTRODUCTION
Speaker counting –estimating the evolving numberof speakers in an audio recording– is a crucial stagein several audio processing tasks such as speakerdiarization, localisation and tracking. It can be seenas a subtask of speaker diarization, which estimateswho speaks and when in a speech segment [1, 2].This task has been poorly addressed in the speechprocessing literature as a problem on its own, in amajority of the source separation and localisationmethods, the number of speakers is often consid- ered as a known and essential prerequisite [3–6] orestimated by clustering separation/localisation fea-tures [7, 8]. Speaker counting becomes even moredifficult when several speech overlap, therefore itreveals particularly useful for tracking, as it canhelp the difficult problem of detecting the appear-ance and disappearance of a speaker track alongtime [9].In the literature of source counting, single-channel parametric methods rely on ad-hoc param-eters to infer the number of speakers [10, 11]. Mul-tichannel approaches exploit spatial information tobetter discriminate speakers. Classical multichan-nel methods are based on eigenvalue analysis of thearray spatial covariance matrix [12–14], but cannotbe used in underdetermined configurations. Clus-tering approaches in the time-frequency (TF) do-main enable to overcome this restriction [15–19].Nevertheless, they often turn out to be poorly ro-bust to reverberation; moreover, they often requirethe maximum number of speakers as the input pa-rameter.More recently, deep learning has been appliedto the audio source counting problem. In [20],a convolutional neural network is used to clas-sify single-channel noisy audio signals into threeclasses : 1, 2 or 3-or-more sources. In [21], theauthors compare several neural network architec-tures with long short-term memory (LSTM) or con-volutional layers, and also tried classification andregression paradigms. They extended their workin [22] with a single-channel convolutional recur- a r X i v : . [ c s . S D ] J a n ent neural network (CRNN) predicting the max-imum number of speaker occurring in a 5-secondaudio signal. Recently, we proposed an adaptationof this CRNN with multichannel input features topredict the number of speakers at a short-term pre-cision on reverberant speech signals [23].In this paper, we extend our work in [23] by pro-viding an empirical analysis of the speaker count-ing CRNN with regards to the sequence-to-one out-put mapping. We demonstrate that, for the best pre-diction on a given frame, there is an optimal choiceof the decoded label within an output sequence, de-pending on convolutional and recurrent parameters.Past information is needed to let the LSTM con-verge and a few overhead frames also help for bestaccuracy.
2. SPEAKER COUNTING SYSTEM
The method used in this paper is the same as in[23], but we shortly recall the main lines in this sec-tion.
To provide spatial information to the network, weuse the Ambisonics representation as a multichan-nel input feature. The main advantages of the Am-bisonics format are its ability to accurately repre-sent the spatial properties of a soundfield,while be-ing almost independent from the microphone type.The Ambisonics format is produced by projectingan audio signal onto a basis of spherical harmon-ics. For practical use, this infinite basis is trun-cated which defines the Ambisonics order : here,we provide first-order Ambisonics (FOA) to thenetwork, leading to 4 channels. For a plane wavecoming from azimuth θ and elevation φ , and bear-ing a sound pressure p , the FOA components aregiven in the STFT domain by: W ( t, f ) X ( t, f ) Y ( t, f ) Z ( t, f ) = √ θ cos φ √ θ cos φ √ φ p ( t, f ) . (1) We use the N3D Ambisonics normalization standard [24]. where t and f denote the STFT time and frequencybins, respectively.The phase of p ( t, f ) is considered a redundantinformation across the channels,thus we only usethe magnitude of the FOA components. By stack-ing them, we end up with a tridimensional tensor X ∈ R N t × F × I , with N t frames, F frequency binsand I channels as an input feature for the neuralnetwork. We use signals sampled at kHz, a , -point STFT (hence F = 513 ) with a sinu-soidal analysis window and overlap. The pa-rameter N t takes several values during our experi-ments, see Section 3. Speaker counting is considered as a classificationproblem with 6 classes, from 0 to 5 concurrentspeakers. For the given frame, the target is encodedas a one-hot vector y of size 6, and the softmaxfunction is used for the output layer. For inference,the prediction is the highest probability of the out-put distribution. We use the same architecture as in [23], illustratedin Figure 1. A first bloc is composed of 2 convo-lutional layers with 64 and 32 filters respectively,followed by a max-pooling layer, and another 2convolutional layers with 128 and 64 filters respec-tively, also followed by a max-pooling layer. Thefilter support size K (the size in both time and fre-quency axis) is varied as a part of the analysis inSection 3.The max-pooling operation only applies to thefrequency axis to keep the temporal dimension un-changed, allowing a frame-based decision. Thefollowing layer is composed of a LSTM used ina sequence-to-sequence mapping mode (see [23]for more details), leading to an output of dimen-sion N t × . Finally each temporal vector of di-mension goes through the 6-unit softmax outputlayer that produces the probability distribution foreach class. Therefore, this pipeline enables the net-work to compute a probability distribution for eachframe. igure 1 : Architecture of the counting neural net-work, similar to [23]. To train and test the neural network, we use syn-thesized speech signals comprising between 1 and5 speakers who begin and end to talk at randomtimes. The speech signal of each speaker is indi-vidually convolved with spatial room impulse re-sponse (SRIR) generated using the image-sourcemethod [ ? ]. Then the individual wet speech sig-nals are mixed together and a diffuse noise is added.The reader can find more details on the mixturegeneration algorithm in [23]. We end up with a to-tal of hours of speech signals for training and . hours for validation and test. Note that SRIR,speech and noise signals used for validation andtest are never encountered during training.
3. PERFORMANCE ANALYSIS
In this section we evaluate the performance of theCRNN on the test set depending on the values oftwo parameters : the position n of an analyzedframe within a sequence of length N t and the sup-port size of the convolutional filters K .The sequence-to-sequence nature of the LSTMlayer we use provides us a way to predict the mostprobable class for each frame in an input sequenceof N t frames. However, the amount of informa-tion available for predicting a distinct frame de-pends on its position n within the N t -frame se-quence. For instance, if n = 0 (first frame in thesequence), the prediction relies only on its con-tent in the spectrogram plus the content of neigh-boring frames (because of the size of the convolu-tional layers), whereas for n = N t (last frame inthe sequence), the prediction can fully benefit fromthe recurrent nature of the LSTM, by gathering in-formation from all the previous frames in the se-quence. This leads to the hypothesis that a predic-tion for a frame in the beginning of a sequence willbe less accurate than a prediction for a frame fur-ther away in the sequence. To assess this hypoth-esis, we compute the accuracy of the prediction ofall frames in the test set by forcing those framesto be in a same given position n within the inputsequence. Figure 2 shows the average accuracy of the CRNNon the test set, depending on the position n in thesequence for the prediction of each frame and forseveral values of K . Each color corresponds toa value of N t , with one curve showing the accu-racy per position n , and one horizontal line show-ing the average accuracy on all positions. The re-sults are in extent of [23]. We see that the CRNN isable to achieve an accuracy between and ,which is a good result in noisy and reverberant en-vironment, with up to 5 speakers in the signal. Asin [23], we notice a trend that the average accuracyincreases with the length of the sequence, but theframewise results show that the CRNN can even a) Convolutional filter size K = 3 (b) Convolutional filter size K = 5 (c) Convolutional filter size K = 7 Figure 2 : Average framewise speaker counting accuracy of the proposed CRNN, as a function of the po-sition of the decoded frame in the N t -sequence, for several values of N t and convolutional filter size. Thehorizontal bars indicated the average accuracy over the frame positions.o better that the average performance if we avoidframes at the beginning and at the end.Interestingly, all curves follow a similar trend:the accuracy increases with n , then we observe afloor (except for N t = 10 ), then the accuracy de-creases when n reaches N t . The first interpretationwe can draw is that the LSTM layer needs a certainamount of information (several timesteps) to con-verge and output an optimal prediction (hence thefloor). This optimal prediction is within about and accuracy, depending on the experiments.All curves show almost the same important riseat the beginning of the sequences: For example for K = 3 and N t = 30 , the accuracy goes fromaround for n = 0 to around for n = 6 which is a notable increase. At n = 6 the accu-racy begins to fall for N t = 10 , whereas it goes onrising for other values of N t . When we go furtherin the sequence, the increase stabilizes and reachesa floor, which is around n = 30 for N t = 50 .Here, the LSTM seems to have reached its optimalprediction power. So finally we can correlate thelength of the input sequence to the overall accuracywith the simple reason that the LSTM needs timeto converge and aggregate information to providea better prediction. This highlights the fact that inpractice we need a certain amount of past informa-tion to provide the LSTM for higher accuracy ofprediction. In the following paragraph, we com-ment on the decrease of accuracy after the optimalfloor. Another interesting element in the curves is thedrop of accuracy in the last values of the decodedframe position n , when it gets close to N t . Thisdrop of accuracy occurs for all N t values. For ex-ample for K = 3 and N t = 10 , accuracy grad-ually goes from best accuracy . for n = 6 to . for last frame position n = 9 . For N t = 30 it goes from a maximum value of . for n = 25 to . for last frame position n = 29 . In fact, for K = 3 , it seems that thepeak in accuracy along the sequence appears at theframe position n = N t − for all N t except for N t = 10 , and for the next positions the accuracyfalls.If we do the same analysis for K = 5 , the dropseems the happen a bit earlier at position N t − ,except for N t = 10 for which the convergenceseems to condition the increase of accuracy here.For K = 7 the phenomenon is less similar amongthe different values of N t : for N t = 30 the dropbegins at position n = N t − and for N t = 50 itbegins at position n = N t − . Yet we see that theposition where the accuracy starts to drop seems tobe correlated to the size of the convolutional filters.In fact this is due to the use of padding in theconvolutional layers: For a filter size K = 3 , thespectrograms are padded with one frame of zeroswhen the filter is applied on an edge frame ( n = 0 or n = N t − ), so that we keep the temporaldimension through all layers. This padding fol-lowed by the convolutional operation gives an edgeframe with less information (since the convolutionincludes the void information added with padding)for the next convolutional layer, and this one willapply the same operation and so on until the th and last convolutional layer.So after this layer the padding operation hasadded void frame times at the end of the spec-trogram which explains the drop of accuracy frames before the end of the sequence, due to thelack of information. This can also be calculated for K = 5 : the filter size forces a padding operation of void frames for each of the convolutional layers,which leads to × void frames at the end ofthe spectrogram before entering the LSTM layers.We again obtain the position n = N t − for whichthe accuracy begins to drop. An empirical formulacan then be drawn to compute the optimal position n opt of the analyzed frame for best accuracy : n opt = N t − K + 1 (2)For K = 7 , the formula gives an optimal position n = N t − before the drop in accuracy begins.Note that this result is not perfectly observed inthe curves. In particular for N t = 10 , this paddingeffect is balanced by the rise of accuracy due tothe LSTM convergence when accumulating infor-ation across successive frames. That is why forexample, for K = 5 and N t = 10 , the drop beginsat n = N t − because at that position the conver-gence is still in progress. The above analysis showed two aspects within thegraphs :• The LSTM needs time to converge, so thatfor a good speaker counting accuracy weneed to provide a certain amount of past in-formation.• The peak of performance is obtained for aposition several frames before the end ofthe sequence, because after the convolutionallayers the last frames suffer from padding.The number of overhead frames needed forbest accuracy is K × where K is the sizeof the convolutional filters. It gives the opti-mal position of the analyzed frame from theend of the sequence, as padding would loweraccuracy in a further position.Therefore for best speaker counting perfor-mance, a CRNN needs past information as wellsome overhead, depending on the recurrent andconvolutional parameters.
4. CONCLUSION
In this paper we propose an analysis of the countingaccuracy of a CRNN, depending on the position ofthe analyzed frame within the input sequence anddepending on the size of the convolutional filters.We show that the LSTM indeed needs several stepsto converge and provide the best possible accuracy.But although this convergence theoretically reachesits maximum at the end of the sequence, we witnessa drop in accuracy towards the end of the sequencewhich is explained by zero-padding in the cascadedconvolutional layers, which still enables us to pro-vide a framewise prediction for source counting.The use of convolutional filters needs some over-head in the sequence. So there is a tradeoff between real-time prediction of the number of speakers andthe accuracy this prediction.
5. REFERENCES5. REFERENCES