Speaker Change Detection Using Features through A Neural Network Speaker Classifier
Zhenhao Ge, Ananth N. Iyer, Srinath Cheluvaraja, Aravind Ganapathiraju
IIntelligent Systems Conference 20177-8 September 2017 | London, UK
Speaker Change Detection Using Features through ANeural Network Speaker Classifier
Zhenhao Ge, Ananth N. Iyer, Srinath Cheluvaraja, Aravind Ganapathiraju
Interactive Intelligence Inc., Indianapolis, Indiana, USAEmail: { roger.ge, ananth.iyer, srinath.cheluvaraja, aravind.ganapathiraju } @inin.com Abstract —The mechanism proposed here is for real-timespeaker change detection in conversations, which firstly trainsa neural network text-independent speaker classifier using in-domain speaker data. Through the network, features of conver-sational speech from out-of-domain speakers are then convertedinto likelihood vectors, i.e. similarity scores comparing to thein-domain speakers. These transformed features demonstratevery distinctive patterns, which facilitates differentiating speakersand enable speaker change detection with some straight-forwarddistance metrics. The speaker classifier and the speaker changedetector are trained/tested using speech of the first 200 (in-domain) and the remaining 126 (out-of-domain) male speakers inTIMIT respectively. For the speaker classification, 100% accuracyat a 200 speaker size is achieved on any testing file, given thespeech duration is at least 0.97 seconds. For the speaker changedetection using speaker classification outputs, performance basedon 0.5, 1, and 2 seconds of inspection intervals were evaluatedin terms of error rate and F1 score, using synthesized data byconcatenating speech from various speakers. It captures closeto 97% of the changes by comparing the current second ofspeech with the previous second, which is very competitive amongliterature using other methods.
Keywords — Speaker Change Detection, Speaker Classification,Neural Network
I. I
NTRODUCTION
Speaker Change Detection (SCD) is a task to detect thechange of speakers during conversations. An efficient andaccurate speaker change detector can be used to partitionconversations into homogeneous segments, where only onespeaker is presented. Speaker recognition or verification canthen be performed on the clustered speaker segments, ratherthan on a frame-by-frame basis, to improve accuracy andreduce cost. However, SCD is challenging since prior infor-mation of the speakers is absent, and it is usually requiredto detect speaker change in real-time, within limit delay, e.g.within 1 or 2 seconds of speech.SCD can be divided into retrospective vs. real-time detec-tion [1]. The former one is normally based on model trainingfor speakers and detection algorithm, using Gaussian MixtureModels (GMMs) and Hidden Markov Models (HMMs), etc.[2]. It includes approaches with different thresholding criteria,such as Bayesian Information Criterion (BIC) [3], Kullback-Leibler (KL)-based metrics [4], etc. For the real-time detection,the decision has to be made using limited preceding datawith low computational cost. Research has been focused onimproving features and developing efficient distance metrics.Lu et al. [5] obtained reliable change detection in real-timenews broadcasting with the Bayesian feature fusion method.In the evaluation using TIMIT synthesized data by Kotti et al. [6], it mean F1 score was . and it observed a significantdrop in accuracy for speaker change within durations less than2 seconds. Another work from Ajmera et al. [7] reported 81%recall and 22% precision using BIC and log-likelihood ratioson HUB-4-1997 3-hour news data.Here a novel real-time mechanism for SCD is presented. Itfirst transforms conversations to speaker classification outputsthrough a feed-forward neural network, trained by in-domainspeaker data; then detects if speaker change is presentedby comparing the similarity of adjacent intervals of 0.5, 1,or 2 seconds. Though the speakers presented in the testingconversations are usually out-of-domain (unseen) speakers tothe network, and the outputs merely serve as the likelihoodof them to be classified to the in-domain speakers, the patternof the new speakers is still revealed in the network outputsand can be used to distinguish one another. This enables usto develop some straight-forward distance metrics to capturespeaker change. Very promising performance is achieved onthe synthesized conversational speech using TIMIT, which isnot feasible if we directly use the raw features without goingthrough the speaker classification network.Fig. 1 shows a global picture for using an NN-basedspeaker classifier as a feature transformer and then detectingspeaker changes with the improved features. The followingsections walk through 3 major components in Fig. 1, includingdata preparation, (Sec. II), the framework of neural network(NN) based speaker classification (Sec. III) and the speakerchange detection mechanism, i.e. the distance metrics that weuse for detection based on speaker classification outputs (Sec.IV). Finally, the conclusion and future work is in Sec. V.II. D ATA P REPARATION
Speech of all 326 male speakers in the “train” folder ofthe TIMIT corpus is used here. Data of males from the “test”folder and data of females from both “train” and “test” foldersare currently not used. For each speaker, there are 10 datafiles containing one sentence each from 3 categories: “SX”(5 sentences), “SI” (3 sentences) and “SA” (2 sentences).The 326 male speakers are sorted alphabetically and dividedinto 2 groups: first 200 speakers (group A) and remaining126 speakers (group B). For group A, sentences in the “SX”and “SI” categories are different among speakers. They arecombined with a total duration around seconds per speakerand used to train the text-independent neural network speakerclassifier. Sentences in the “SA” category are the same andshared with all speakers, so they can be used to test theaccuracy with no distinguishable information added throughcontent. For group B, synthesized conversations are generatedIEEE 1 ||
Speech of all 326 male speakers in the “train” folder ofthe TIMIT corpus is used here. Data of males from the “test”folder and data of females from both “train” and “test” foldersare currently not used. For each speaker, there are 10 datafiles containing one sentence each from 3 categories: “SX”(5 sentences), “SI” (3 sentences) and “SA” (2 sentences).The 326 male speakers are sorted alphabetically and dividedinto 2 groups: first 200 speakers (group A) and remaining126 speakers (group B). For group A, sentences in the “SX”and “SI” categories are different among speakers. They arecombined with a total duration around seconds per speakerand used to train the text-independent neural network speakerclassifier. Sentences in the “SA” category are the same andshared with all speakers, so they can be used to test theaccuracy with no distinguishable information added throughcontent. For group B, synthesized conversations are generatedIEEE 1 || P a g e a r X i v : . [ c s . S D ] F e b ntelligent Systems Conference 20177-8 September 2017 | London, UK
Data
Preparation
Speech from speakers Features of synthesized conversations NN ‐ based Speaker
Classifier
Features for training/testing speaker classifier Improved features of synthesized conversations SCD (Distance
Measurement
Between
Speech
Intervals)
Training (threshold optimization)Testing (performance evaluation) Results
Fig. 1. Diagram of using improved features through an NN-based speaker classifier for Speaker Change Detection (SCD). by concatenating speech from multiple speakers. Conversationscreated using “SX” and “SI” sentences of the first 63 outof 126 speakers are used to find the optimal threshold todetermine speaker change, while conversations with “SX” and“SI” sentences of the remaining 63 speakers are used fortesting the SCD performance.The following 2 subsections introduce the process ofconverting raw speech into features used in the developmentof the speaker classifier and SCD algorithm, including a)preprocessing, and b) feature extraction and concatenation.
A. Preprocessing
Preprocessing mainly consists of a) scaling the maximumof absolute amplitude to 1, and b) Voice Activity Detection(VAD) to eliminate the unvoiced part of speech. Experimentsshow both speaker classification and speaker change detectioncan perform significantly better if speakers are evaluated onlyusing voiced speech, especially when the data is noisy.An improved version of Giannakopoulos’s recipe [8] withshort-term energy and spectral centroid is developed for VAD.Given a short-term signal s ( n ) with N samples, the energy is: E = 1 N N (cid:88) n =1 | s ( n ) | , (1)and given the corresponding Discrete Fourier Transform (DFT) S ( k ) of s ( n ) with K frequency components, the spectralcentroid can be formulated as: C = (cid:80) Kk =1 kS ( k ) (cid:80) Kk =1 S ( k ) . (2)The short-term energy E is used to discriminate silence withenvironmental noise, and the spectral centroid C can be used toremove non-environmental noise, i.e. non-speech sound, suchas coughing, mouse clicking and keyboard tapping, since theynormally have different spectral centroids compared to humanspeech. Only when E and C are both above their thresholds T E and T C , the speech frame is considered to be voiced, otherwise,it will be removed. These thresholds are adjusted to be slightlyhigher to enforce a stricter VAD algorithm and ensure thequality of the captured voiced sections. This is achieved bytuning the signal median smoothing parameters, such as stepsize and smoothing order, as well as setting the thresholds T E and T C as a weighted average of the local maxima in thedistribution histograms of the short-term energy and spectralcentroid respectively. In this work, the TIMIT speech withthe original K sampling rate is segmented into overlappedframes with a ms window size and a ms hop size. B. Feature Extraction, Normalization and Concatenation
The 39-dimensional Mel-Frequency Cepstral Coefficients(MFCCs) with delta and double delta were generated from thepreprocessed speech, following Ellis’s recipe [9]. They wereextracted using overlapped ms Hamming windows whichhop every ms. Then, the features of each speaker were nor-malized with his own mean and variance. To capture the tran-sition patterns within longer durations, these 39-dimensionalfeature frames were concatenated to form overlapped longerframes. In this work, 10 frames ( ms) were concatenatedwith hop size of 3 frames ( ms) as shown in Fig. 2. … …
39 390
Fig. 2. Feature concatentation example with a window size of 10 frames anda hop size of 3 frames.
III. N
EURAL N ETWORK S PEAKER C LASSIFICATION
The concatenated features (e.g. 390 dimensional featurevectors) are used as the input to a neural network speakerclassifier. As mentioned in the first paragraph of Sec. II, the“SX” and “SI” sentences of the first 200 male speakers wereused for training, and the remaining “SA” sentences from thesame set of speakers were used for testing.
A. Cost Function and Model Structures
Ng’s neural network training recipe for hand-written digitclassification [10] is used here, which treats the multi-classproblem as K separate binary classifications. It is consideredto be the generalization of the cost function of binary clas-sification using logistic regression, which is built on slightlydifferent concepts compared with the cross-entropy cost func-tion with softmax as the output layer [11].Given M samples, K output classes, and L layers, includ-ing input, output and all hidden layers in between, the costIEEE 2 ||
Ng’s neural network training recipe for hand-written digitclassification [10] is used here, which treats the multi-classproblem as K separate binary classifications. It is consideredto be the generalization of the cost function of binary clas-sification using logistic regression, which is built on slightlydifferent concepts compared with the cross-entropy cost func-tion with softmax as the output layer [11].Given M samples, K output classes, and L layers, includ-ing input, output and all hidden layers in between, the costIEEE 2 || P a g e ntelligent Systems Conference 20177-8 September 2017 | London, UK function can be formulated as: J (Θ) = − M (cid:34) M (cid:88) m =1 K (cid:88) k =1 (cid:16) y ( m ) k log( h θ ( x ( m ) ) k ) (3) + (1 − y ( m ) k ) log(1 − h θ ( x ( m ) ) k ) (cid:17)(cid:105) + λ M L − (cid:88) l =1 s l (cid:88) i =1 s l +1 (cid:88) j =1 ( θ ( l ) ji ) where h θ ( x ( m ) ) k is the k th output of the final layer, given m thinput sample x ( m ) , and y ( m ) k is its corresponding target label.The nd half of Eq. (3) is the regularization factor to preventover-fitting, where λ is the regularization parameter and θ ( l ) ji is the j -th row, i -th column element of the weight matrix Θ ( l ) between l -th and ( l + 1) -th layers, i.e. the weight from i -thnode in l -th layer to j -th node in ( l + 1) -th layer.In this work, there is only 1 hidden layer ( L = 3 ) with nodes ( s = 200 ), the input feature dimension is ( s =390 ), and the speaker classifier was trained with data from speakers ( s = K = 200 ). Therefore, the network structure is
390 : 200 : 200 , with weight matrices Θ (1) ( × ) and Θ ( × ). The additional 1 column is a bias vector, which isleft out in regularization, since the change of bias is unrelatedto over-fitting. In this example, the regularization part in Eq.(3) can be instantiated as L − (cid:88) l =1 s l (cid:88) i =1 s l +1 (cid:88) j =1 ( θ ( l ) ji ) = (cid:88) i =1 200 (cid:88) j =1 ( θ (1) j,i ) + (cid:88) i =1 200 (cid:88) j =1 ( θ (2) j,i ) . (4) B. Model Training and Performance Evaluation
The neural network model is trained through forward-backward propagation. Denoting z ( l ) and a ( l ) as the input andoutput of the l -th layer, the sigmoid function a ( l ) = g ( z ( l ) ) = 11 + e − z ( l ) (5)is selected as the activation function, and the input z ( l +1) of the ( l + 1) -th layer can be transformed from the output a ( l ) of the l -th layer, using z ( l +1) = Θ a ( l ) . Then, h θ ( x ) can be computed through forward propagation: x = a (1) → z (2) → a (2) → · · · → z ( L ) → a ( L ) = h θ ( x ) . The weightmatrix Θ ( l ) is randomly initiated using continuous uniformdistribution between ( − . , . and then trained throughbackward propagation of ∂J/∂θ ( l ) j,i , by minimizing J (Θ) usingRasmussen’s conjugate gradient algorithm, which handles stepsize (learning rate) automatically with slope ratio method[12].In evaluating the classifier performance, the sigmoid outputof the final layer h θ ( x ( m ) ) is a K -dimensional vector, eachelement in the ranges of (0 , . It serves as the “likelihood” toindicate how likely it is to classify m -th input frame into oneof the K speakers. The speaker classification can be predictedby the sum of log likelihood of M input frames (predictionscores), and the predicted speaker ID k ∗ is the index of itsmaximum: k ∗ = arg max k ∈ [1 ,K ] (cid:32) M (cid:88) m =1 log( h θ ( x ( m ) ) k ) (cid:33) . (6) M can range from 1 to the entire frame length of the testingfile. If M = 1 , the accuracy achieved is based on individualframes, each of which is ms (window duration T win infeature concatenation) with ms of new data, compared withthe previous frame. On the other hand, if M is equal to thetotal number of frames in file, the accuracy is file-based. Theaverage duration of sentences (i.e. file length) is about 2.5seconds. In general, larger M leads to higher accuracy. Giventhe best model available with the network structure
390 : 200 :200 , Fig. 3 demonstrates an example of file-level predictionscore of -th speaker (MPGR0). It shows the peak of positives(in the green circle) is slightly dropped but still distinguishableenough to all other negatives, from the file SI1410 in thetraining set, to the file
SA1 in the testing set.(a)
SI1410 in training(b)
SA1 in testing
Fig. 3. File-level prediction scores of th speaker (MPGR0) in training andtesting sets respectively. Using this model, the file-level training and testing accura-cies at speaker size are both 100%, as indicated in TableI. The frame-level testing accuracy is . %, which indicates TABLE I. NN-
BASED SPEAKER CLASSIFICATION PERFORMANCE WITHFIRST
MALE IN
16K TIMIT ( . SEC ./ FRAME , ∼ SEC ./ FILE ) Dataset Accuracy (%) Frames (seconds) needed for 100% accuracy frame file min mean max train 96.63 100 2 (0.13) 2.80 (0.15) 6 (0.25)test 79.65 100 5 (0.22) 11.59 (0.42) 30 (0.97) that . % frames in the testing set, with duration as littleas . second, can be classified correctly. It also shows theminimum, mean, and maximum number of consecutive framesneeded and their corresponding durations in order to achieve100% accuracy, evaluated through all files in both training andtesting datasets. Since the next frame provides only ms (hopduration T hop in feature concatenation) additional information,IEEE 3 ||
16K TIMIT ( . SEC ./ FRAME , ∼ SEC ./ FILE ) Dataset Accuracy (%) Frames (seconds) needed for 100% accuracy frame file min mean max train 96.63 100 2 (0.13) 2.80 (0.15) 6 (0.25)test 79.65 100 5 (0.22) 11.59 (0.42) 30 (0.97) that . % frames in the testing set, with duration as littleas . second, can be classified correctly. It also shows theminimum, mean, and maximum number of consecutive framesneeded and their corresponding durations in order to achieve100% accuracy, evaluated through all files in both training andtesting datasets. Since the next frame provides only ms (hopduration T hop in feature concatenation) additional information,IEEE 3 || P a g e ntelligent Systems Conference 20177-8 September 2017 | London, UK compared with the current frame, given the number of framesneeded N , the formula to compute the corresponding requiredduration T is T = ( N − × T hop + 1 × T win . (7)With this formula, it requires only 11.59 frames (0.42 second)on average, to achieve 100% accuracy in the testing dataset.Using the training data to test is normally not legitimate,and here it is used merely to get a sense of how the accuracydrops when switching from training data to testing data. C. Model Parameter Optimization
The current neural network model with the structure
390 :200 : 200 is actually the best one in terms of highest frame-level testing accuracy, after grid searching on a) the numberof hidden layers ( , ), and b) the number of nodes per hiddenlayer ( , , , ), with a subset containing only 10%randomly selected training and testing data.Once the ideal network structure is identified, the modeltraining is conducted with a regularization parameter λ inthe cost function J (Θ) , which is iteratively reduced from 3to 0 through training. This dynamic regularization scheme isexperimentally proved to avoid over-fitting and allow moreiterations to reach a refined model with better performance.The training is set to be terminate once the testing frameaccuracy cannot be improved more than . in the last 2consecutive training iterations, which normally takes around to iterations. The training set is at speakersize with seconds speech each. It is fed in as a wholebatch of data, which requires about 1 hour to train, on acomputer with i7-3770 CPU and 16 GB memory. Therefore,the computational cost is certainly manageable.IV. S PEAKER C HANGE D ETECTION U SING S PEAKER C LASSIFICATION O UTPUTS
The main task in this work is to detect the speaker changein conversations. Developing an NN-based speaker classifieris among one of the approaches to improve features for thatpurpose. Here given the raw feature x ∈ IR , the transformednew feature is denoted as d = log( h θ ( x )) ∈ IR . (8)Dividing the conversation into consecutive speech intervalswith equal frame length M , the goal is to develop somedistance metrics to measure the difference between 2 sets ofimproved features at current interval t and previous interval t − , which is formulated as: d (cid:48) t = dist( d t , d t − ) (9)Fig. 4 shows an example of concatenation of 100 200-dimensional transformed features for 5 in-domain and 5 out-of-domain speakers. These features are reversed to linear scale(i.e. h θ ( x ) ) rather than logarithmic scale for better visibilityand are from the testing set containing “SA” sentences. Thein-domain speakers are with speaker ID: 10, 20, 30, 40 and50 (selected from first 200 speakers), while the IDs for theout-of-domain ones are: 210, 220, 230, 240 and 250 (selectedfrom speakers with ID 201 to 326). The prediction scores are shown in gray scale, the larger the darker. The pattern for eachspeaker in (a) is fairly clear since they peak at their own IDindices. The pattern for speakers in (b) is not apparent, but onecan still find some “strip lines”, which indicate the consistencyin similarity comparing one out-of-domain speaker with all in-domain speakers. (a) 5 in-domain speakers(b) 5 out-of-domain speakers Fig. 4. Prediction output pattern visualization for in-domain and out-of-domain speakers.
A. Distance Metrics to Compare Adjacent Intervals
With the “SX” and “SI” sentences in the remaining 126out-of-domain male speakers, 2 concatenated speeches arecreated using the data from the first 63 and the remaining 63speakers respectively. They are used for training (threshold de-termination) and testing (performance evaluation) respectivelyin SCD. Sentences for the same speaker are first concatenatedwith the speech in the first T seconds. T is the duration forthe shortest concatenation among all 126 speakers ( T = 14 seconds in this work). These sentences grouped by speakersare then concatenated again to form the synthesized trainingand testing conversations ( ×
63 = 882 seconds ≈ minuteseach), as shown in Fig. 5. sentences sentences … sentences ID
201 ID
202 ID sentences sentences … sentences ID
264 ID
265 ID
T T TT T T first out ‐ of ‐ domain speakerssecond out ‐ of ‐ domain speakers training (threshold optimization)testing (performance evaluation) Fig. 5. Speech concatenation to form the synthesized conversations fortraining and tesitng in speaker change detection.
IEEE 4 ||
IEEE 4 || P a g e ntelligent Systems Conference 20177-8 September 2017 | London, UK
The concatenated speech is then examined in each adjacentbut non-overlapped interval t with M frames. Using p -normdistance metrics, Eq. (9) can be instantiated as: d (cid:48) t = (cid:32) K (cid:88) k =1 ( | ¯ d t − ¯ d t − | p ) (cid:33) p , (10)where K is the number of in-domain speakers used to train thespeaker classifier, i.e. dimension of the transformed features. d t and d t − both are feature matrices at size of M × K ,and ¯ d t , ¯ d t − are their mean vectors with dimension K . Thedifference between current and previous intervals d (cid:48) t shouldbe low (as negative), if feature matrices d t and d t − belongto the same speaker, and should be high (as positive) viseversa. In this work, p = { , , , , , , , ∞} are tested, and p = 2 , i.e. the Euclidean distance provided the best separationbetween positive (higher value expected) and negative (lowervalue expected) samples.Some other distance metrics other than p -norm, such asBhattacharyya distance for comparison between 2 sets ofsamples, is also evaluated here. However, since the majordifference between ¯ d t and ¯ d t − demonstrated only with a fewdimensions, which is much smaller than the full dimension K ,the covariance matrices for both ¯ d t and ¯ d t − are not positivedefinite, and this type of distance is not feasible then. B. SCD Training and Testing
Denoting the difference d (cid:48) t between current and previousintervals t , t − as sample x , the speaker changes can bedetected if x is higher than optimal threshold x ∗ . Fig. 6 (a,b, c) plot d (cid:48) t vs. interval t with interval durations 0.5, 1, and2 seconds, where positive samples are highlighted with redstars. They are evenly distributed since the conversation speechis concatenated using speeches from individual speakers withsame duration. By modeling the positive and negative samplesas two Gaussian distributions, the Bayesian decision boundaryis selected as the optimal threshold x ∗ .As is shown in Fig. 6 (d, e, f), the negative samples (classlabel ω in Fig. 6) are much more than positive samples (classlabel ω in Fig. 6), especially when the time interval is small.Therefore, the dataset is very skewed. Therefore, F1 score isused along with error rate P e to measure the SCD performance.Given False Negative Ratio (FNR), i.e. ratio of classifyingpositive as negative ( F N ) vs. all positive ( P ), False PositiveRatio (FPR), i.e. ratio of classifying negative as positive ( F P )vs. all negative ( N ) and P = T P + F N and N = T N + F P , P e and F can be computed by: P e = F N + F PP + N (11) F T P T P + F P + F N (12)Table II show all these statistics for performance evaluation.The results for training data is theoretical, computed usingGaussian distributions in Fig. 6 (d, e, f), and the ones for testingdata is experimentally counted, using plots similar to Fig. 6 (a,b, c). However, the optimal thresholds for the training data maynot be still optimal for the testing data. It shows above of speaker changes cannot be detected by comparing features
TABLE II. SCD
PERFORMANCE ON SYNTHESIZED CONVERSATIONS ( THEORETICAL ON THE TRAINING SET , EXPERIMENTAL ON THE TESTINGSET ), WITH MULTIPLE INSPECTION INTERVAL .itvl./ itvl. P e (%) F1 FNR(%) FPR(%) P e (%) F1 FNR(%) FPR(%)spkr. sec. (theoretical) (experimental)28 0.5 0.479 0.929 10.288 0.121 2.042 0.747 14.516 1.58714 1 0.189 0.987 2.022 0.050 0.454 0.969 0 0.4887 2 0.076 0.997 0.412 0.020 0.227 0.992 0 0.265 in the current and previous 0.5 second interval, i.e. FNR is . or . at theoretical and experimental cases.However, these numbers drop significantly once the inspectioninterval gets longer. C. Potential Further Improvement
The approach described above for SCD is checking thedifference d (cid:48) t between d t and d t − , features in the current andprevious intervals. However, by comparing current difference d (cid:48) t with previous difference d (cid:48) t − and the next difference d (cid:48) t +1 ,i.e. difference of the difference, may reveal more reliableinformation. This is based on the assumption that if speakerchange occurs in the current interval, the d (cid:48) t will be muchhigher than both its previous and next ones, d (cid:48) t − and d (cid:48) t +1 .This distance metric can be considered as “second derivative”of the raw feature, and is formulated as: d (cid:48)(cid:48) t = ( d (cid:48) t − d (cid:48) t − ) + ( d (cid:48) t − d (cid:48) t +1 ) (13)It shows accuracy improvement in some noisy cases, suchas reducing the error rate on the testing data from . to . , with a 0.5 second interval. However, it will delay thedecision for 1 additional time interval, since it requires thenext feature d t +1 in computation.V. C ONCLUSION AND F UTURE W ORK
In this work, a noval real-time SCD approach using im-proved features through a speaker classification network ispresented. The features are represented by vectors of attributesof the in-domain speakers, i.e. projected onto a space spannedby the in-domain speakers. It enables the use of simple dis-tance metrics such as Euclidean distance between the featurecentroids to detect speaker change in adjacent intervals. UsingTIMIT data of 200 male speakers, the classifier guarantees toachieve 100% accuracy, with speech no longer than 1 second.In the 15-minute synthesized conversations of 63 differentspeakers (62 unique speaker changes), theoretically there isonly around 2% of the changes are mis-detected with theF1 score above 0.98. It outperforms the results in [6], whichalso used TIMIT synthesized data, based on algorithms in [5].These results are still very competitive, compared to otheralgorithms using real world conversations [7], [13].The next step is to test the algorithm with real-worldconversations, where the number of speakers should be fewerand the speaker changes may less frequent, but they can beless predictable and speaker conflicts may occur. Since theBayesian threshold depends on the speaker change frequency,which is unpredictable in real world scenarios, more robustand dynamic thresholding might be necessary to improvethe performance. Second, better SCD performance has beenobserved with conversations from in-domain speakers. Thus,IEEE 5 ||
In this work, a noval real-time SCD approach using im-proved features through a speaker classification network ispresented. The features are represented by vectors of attributesof the in-domain speakers, i.e. projected onto a space spannedby the in-domain speakers. It enables the use of simple dis-tance metrics such as Euclidean distance between the featurecentroids to detect speaker change in adjacent intervals. UsingTIMIT data of 200 male speakers, the classifier guarantees toachieve 100% accuracy, with speech no longer than 1 second.In the 15-minute synthesized conversations of 63 differentspeakers (62 unique speaker changes), theoretically there isonly around 2% of the changes are mis-detected with theF1 score above 0.98. It outperforms the results in [6], whichalso used TIMIT synthesized data, based on algorithms in [5].These results are still very competitive, compared to otheralgorithms using real world conversations [7], [13].The next step is to test the algorithm with real-worldconversations, where the number of speakers should be fewerand the speaker changes may less frequent, but they can beless predictable and speaker conflicts may occur. Since theBayesian threshold depends on the speaker change frequency,which is unpredictable in real world scenarios, more robustand dynamic thresholding might be necessary to improvethe performance. Second, better SCD performance has beenobserved with conversations from in-domain speakers. Thus,IEEE 5 || P a g e ntelligent Systems Conference 20177-8 September 2017 | London, UK d i ff e r en c e v a l ue difference compared with previous interval (a) experimental (0.5 sec.) d i ff e r en c e v a l ue difference compared with previous interval (b) experimental (1 sec.) d i ff e r en c e v a l ue difference compared with previous interval (c) experimental (2 sec.) ← x* = 67.809difference values (x) w e i gh t ed P D F Gaussian PDFs with error highlighted, p(error) = 0.479% p(x| ω ) * p( ω )p(x| ω ) * p( ω ) (d) theoretical (0.5 sec.) ← x* = 51.856difference values (x) w e i gh t ed P D F Gaussian PDFs with error highlighted, p(error) = 0.189% p(x| ω ) * p( ω )p(x| ω ) * p( ω ) (e) theoretical (1 sec.) ← x* = 40.366difference values (x) w e i gh t ed P D F Gaussian PDFs with error highlighted, p(error) = 0.076% p(x| ω ) * p( ω )p(x| ω ) * p( ω ) (f) theoretical (2 sec.) Fig. 6. Experimental and theoretical distributions of positive and negative samples with multiple interval durations. speaker clustering based on initial detection results is alsodesirable for converting new speakers into in-domain speakersand forming a better speaker classifier for feature transform.Third, currently the speaker classifier is trained to maximizeclassification accuracy for the in-domain speakers, rather thantrained towards the best feature transformer for detectingspeaker changes. Finally, how to select the speaker as in-domain speakers to train it for that purpose is still unclearand needs to be explored in the future.R
EFERENCES[1] Thor Bundgaard Nielsen, “Efficient recursive speaker segmentation forunsupervised audio editing,” 2013.[2] Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille,Gerald Friedland, and Oriol Vinyals, “Speaker diarization: A review ofrecent research,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , 2012.[3] Chuck Wooters and Marijn Huijbregts, “The icsi rt07s speaker diariza-tion system,” in
Multimodal Technologies for Perception of Humans .Springer, 2008.[4] Jamal-Eddine Rougui, Mohammed Rziza, Driss Aboutajdine, MarcGelgon, and Jos´e Martinez, “Fast incremental clustering of gaussianmixture speaker models for scaling up retrieval in on-line broadcast,”in
ICASSP’ 06 . IEEE, 2006.[5] Lie Lu and Hong-Jiang Zhang, “Speaker change detection and trackingin real-time news broadcasting analysis,” in
Proceedings of the tenthACM international conference on Multimedia . ACM, 2002, pp. 602–610.[6] Margarita Kotti, Luis Gustavo PM Martins, Emmanouil Benetos,Jaime S Cardoso, and Constantine Kotropoulos, “Automatic speakersegmentation using multiple features and distance measures: A com-parison of three approaches,” in
IEEE ICME’ 06 .[7] Jitendra Ajmera, Iain McCowan, and Herv´e Bourlard, “Robust speakerchange detection,”
IEEE signal processing letters , 2004.[8] Theodoros Giannakopoulos, “A method for silence removal andsegmentation of speech signals, implemented in Matlab,”
Universityof Athens, Athens ∼ srihari/CSE574/, Accessed: 2016-07-21.[12] Carl Edward Rasmussen, “Gaussian processes for machine learning,”2006.[13] Benjamin Bigot and Isabelle Ferran´e Pinquier, “Exploiting speakersegmentations for automatic role detection. an application to broadcastnews documents,” in Content-Based Multimedia Indexing (CBMI), 2010International Workshop on . IEEE, 2010, pp. 1–6.
IEEE 6 ||