U-vectors: Generating clusterable speaker embedding from unlabeled data
M. F. Mridha, Abu Quwsar Ohi, M. Ameer Ali, Muhammad Mostafa Monowar, Md. Abdul Hamid
UU-
VECTORS : G
ENERATING CLUSTERABLE SPEAKEREMBEDDING FROM UNLABELED DATA
A P
REPRINT
M. F. Mridha
Department of Computer Science & EngineeringBangladesh University of Business & TechnologyDhaka, Bangladesh [email protected]
Abu Quwsar Ohi
Department of Computer Science & EngineeringBangladesh University of Business & TechnologyDhaka, Bangladesh [email protected]
M. Ameer Ali
Department of Information TechnologyDepartment of Computer Science & EngineeringBangladesh University of Business & TechnologyDhaka, Bangladesh [email protected]
Muhammad Mostafa Monowar
Department of Information TechnologyFaculty of Computing & Information TechnologyKing Abdulaziz UniversityJeddah-21589, Kingdom of Saudi Arabia [email protected]
Md. Abdul Hamid
Department of Information TechnologyFaculty of Computing & Information TechnologyKing Abdulaziz UniversityJeddah-21589, Kingdom of Saudi Arabia [email protected] A BSTRACT
Speaker recognition deals with recognizing speakers by their speech. Strategies related to speakerrecognition may explore speech timbre properties, accent, speech patterns and so on. Supervisedspeaker recognition has been dramatically investigated. However, through rigorous excavation, wehave found that unsupervised speaker recognition systems mostly depend on domain adaptationpolicy. This paper introduces a speaker recognition strategy dealing with unlabeled data, whichgenerates clusterable embedding vectors from small fixed-size speech frames. The unsupervisedtraining strategy involves an assumption that a small speech segment should include a single speaker.Depending on such a belief, we construct pairwise constraints to train twin deep learning architectureswith noise augmentation policies, that generate speaker embeddings. Without relying on domainadaption policy, the process unsupervisely produces clusterable speaker embeddings, and we name itunsupervised vectors (u-vectors). The evaluation is concluded in two popular speaker recognitiondatasets for English language, TIMIT, and LibriSpeech. Also, we include a Bengali dataset, BengaliASR, to illustrate the diversity of the domain shifts for speaker recognition systems. Finally, weconclude that the proposed approach achieves remarkable performance using pairwise architectures.
Keywords
Speaker recognition · Clustering · Twin networks · Deep learning
Speech is the most engaging and acceptable form of communication among one another. Artificial intelligence (AI)systems are currently continuously targeting and working on various challenges of speech-related topics, includingspeech recognition, speech segmentation, speaker recognition, speech diarization, and so on. Among the different a r X i v : . [ c s . S D ] F e b -vectors: Generating clusterable speaker embedding from unlabeled data A P
REPRINT sub-domains of AI, Deep learning (DL) strategies often perform superior to other techniques. However, DL strategiesrequire a vast amount of labeled data to operate on speech-related queries.Among the various speech-based solutions, speaker recognition has fascinating usage of identifying users by hearingspeech. Speaker recognition systems are directly involved with biometric identification systems and are suitable forauthenticating users remotely by hearing a voice. In perspective to various biometric systems, such as facial recognition,fingerprint matching, and so on, speaker recognition is considered challenging due to several difficulties. The difficultiesinclude speech states, including emotional conditions, environmental noise, health conditions, speaking styles, etc.Further, in comparison with supervised speaker recognition approaches, unsupervised and semi-supervised strategiesare hardly investigated.DL architectures have been extensively investigated for supervised speaker recognition systems. For speaker and speechrecognition models, speech spectrograms and mel-frequency cepstral coefficients (MFCC) [1] are used as a preprocessingstrategy. For such cases, convolutional neural networks (CNN) are generally implemented [2]. However, currentarchitectures processes raw-audio and extract speaker recognizable features. SincNet [3] improves the feature extractionprocess from raw audio waves. The architecture fuses sinc functions with CNN that can extract speaker informationfrom low and high cutoff frequencies. AM-MobileNet1D [4] further demonstrates that 1D-CNN architectures aresufficient for identifying features from raw audio waveforms. Also, the architecture requires fewer parameters comparedto SincNet. Although supervised speaker recognition architectures perform excellently in recognition tasks on a largeset of speakers, the only problem lies in labelling a large dataset.Generating speech embeddings has been widely observed in the speaker recognition domain [5, 6, 7]. Embeddingrefers to generating vectors of continuous values. Currently, unsupervised speaker recognition systems implementdomain adaptation policies, mostly fused with embedding vectors. Domain adaptation refers to finding appropriatesimilarities, where a framework is trained on training data, and tested on a similar yet unseen data. Hence, domainadaptation strategies may perform poorly when the variation of training and the unseen dataset is massive. Further,domain adaptation strategies also may produce less accuracy if, they are trained on an inadequate dataset with minimalvariation [7]. Yet, efforts have been made to reduce the interpretation of unseen data over training data, by addingadversarial training [8], improving training policies [9], covariance adaptation policies [10], etc. Although these policiesimprove the robustness, most strategies are still prone to various speech diversions, such as language, speech pattern,age, emotion, and so on.Currently, in the aspect of DL, embeddings can be generated using triplet [11] and pairwise loss [12] techniques. Intriplet loss architecture, three parallel inputs flow through the network: anchor, negative, and positive. The positive inputcontains a similar class w.r.t. to the anchor, whereas the negative input contains a different class. However, in pairwisearchitecture, a pair of information is flowed that either belongs to a single or different class. Triplet architectures havebeen perceived in speaker feature extraction in supervised practice [13].
Embedder
Speech segment of speaker p Speech segment of speaker q Speech segment of speaker pSpeech frames
Figure 1: The figure illustrates the difficulty of the problem. Firstly a set of segmented speech is available, with anunknown number of speakers (in the example, two speakers, p and q). Speech segments are windowed into smallerspeech frames, assuming that all frames of a single speech segment belong to a single class. Further, a DL basedembedding system is used to find speech similarities (inter-segment similarity) and relations from speech segments.The process results in generating clusterable speech embeddings.2-vectors: Generating clusterable speaker embedding from unlabeled data
A P
REPRINT
This paper introduces an unsupervised strategy of generating speaker embedding directly from the unseen data. Hence,the method does not depend on domain adaptation policies and can adapt diverse features from most speech data.Moreover, we insist on converting DL architectures’ training process to both semi-supervised and unsupervised manner.Yet, to do so, we require segmented audio streams (length of 1 second) and need to guarantee that a segment containsonly one person’s speech. We further window the audio segment into smaller speech frames (0.2 seconds) and use themfor training the DL architecture. The audio segments are assigned pseudo labels, which are further reconstructed by DLarchitecture. Figure 1 illustrates the construction of the training procedure.
The clustering process introduced in this paper is based on a particular assumption: a speaker speaks continuously for aspecific time. Hence, if we use segmentation methods or extract small speech segments based on voice segmentationtechniques, most speech segments will contain an individual’s speech. However, some segments might be impure, i.e., asingle segment may have multiple individuals’ speech. Nevertheless, we argue that the ratio of impurity would be smallenough for most general speech conversation. Hence, we investigate such a strategy with the most common neuralnetwork architectures; a siamese network [14].This paper tends to solve the speaker recognition system in an unsupervised manner based on some constraints. Weprovide Table 1 that summarizes the paper’s mathematical notations to facilitate the readers. To comprehend theproblem statement, let S be a database of speech segment, where X k be a short-length audio segment containing speechof an individual. Also, let x i be a smaller window/frame of the audio segment, where x i ∈ X k . From a particularspeech segment, X k , M number of non-overlapping speech frames are generated. As it is stated that a speech segmentbelongs to a single individual, the smaller speech frames also belong to that individual. From this intuition, we constructpairwise constraints between audio frames. We define two speech frames belong to the same cluster if they belong tothe same audio segment. On the contrary, we consider two clusters that belong to different clusters if they belong todifferent audio segments. Based on the pairwise relations, a set of cluster C can be generated. Where a single cluster( c i ∈ C ) belongs to a specific speech segment. Considering most speech segments will belong to a single speaker, wecan assume that most cluster c i would contain a single individual’s data. However, as multiple speech segments canbelong to a single individual, multiple clusters may contain a single individual’s data. Hence, the challenge is to findsuch optimal cluster relationships such that no two clusters may contain speech of a single individual. The trainingstrategy has two possibilities, such as:• In the case of an unlabeled dataset, let us consider that each audio file contains a single individual’s speech.For a set of audio files with an unknown number of speakers, our approach is suitable to produce clusterableembeddings based on speakers. This constraint is similar to semi-supervised learning, as some of the pairwiseconstraints are known [15].• Let us consider a dataset containing multiple speakers’ conversations, where a single audio stream may includevarious speakers. In such a case, no pairwise constraints are known, and an unsupervised strategy is required.Hence, we produce hypothetical pairwise constraints based on audio segmentation processes such as VAD,word segmentation [16], etc., and construct pairwise constraints. However, in such a case, the embeddingsystem’s accuracy depends on the purity of the audio segmentation process.For both of the problem, DL architecture is used to aggregate multiple clusters such that the resulting cluster containsall of the embeddings of a single individual. We imply that if a DL function can properly extract speech features fromaudio frames, it can obtain optimal reasoning of speech frames being similar and dissimilar. Further, an optimallytrained DL framework can successfully re-cluster the data based on the feature similarity rather than the number ofhypothetical clusters. We extend our investigation towards finding such DL training strategy.Since the approach deals with unsupervisely generating speaker embedding vectors from speech data (without domainadaptation), we name the output embedding vectors as unsupervised vectors (u-vectors). Formally we define u-vectorsas follows, Definition 1.1
Unsupervised vectors (u-vectors) refer to a set of DNN generated speech embeddings, which areclusterable based on speakers, trained from unlabeled speech segments.
In the formal definition, by DNN, we indicate any specific implementation of substantially deep neural networks, suchas convolutional, feedforward, recurrent, etc.
The overall contributions of the paper are the following: 3-vectors: Generating clusterable speaker embedding from unlabeled data
A P
REPRINT
Table 1: The mathematical notations used in the paper are summarized.
Notation Description S A set of audio segments. Audio segments are fragments of a continuous audio stream.We assume that most audio segments contain the speech of a single individual. X A single audio segment,
X ∈ S . x i An audio frame, generated by taking shorter frames from an audio segment, x i ∈ X k .Audio frames are used to train DL architectures. M Denotes the number of possible audio frames in an audio segment, x ≤ i ≤M ∈ X k .Theoretically, M × | x i | = |X k | . C A set of clusters. These clusters are formed using hypothetical pairwise constraints.As cluster linkages are constructed based on the speech segment relations, it can beconsidered that |S| = |C| .c i Denotes a subset of the entire cluster, c i ⊆ C . Here, c i represents a cluster constructedusing the inter-relationship of speech frames, belonging to a specific speech segment X i . N The actual number of individuals in S , considering the ground truth. For this specificproblem, the value of N is unknown. α The distance hyperparameter used for AutoEmbedder [12] architecture. For otherarchitectures, α may indicate a connectivity state for any two cluster nodes.• We introduce an unsupervised strategy of generating speaker-dependent embeddings, named u-vector. Thetraining process is domain-independent and directly learns from the unseen data.• We use pairwise constraints and non-generative augmentations to train DL architectures using unlabeled data.• We evaluate the proposed architecture with two inter-cluster based strategies, triplet and pairwise architecture.We also conclude that a DL architecture can discriminate speakers from pseudo labels based on featuresimilarity.We organize the paper as follows, Section 2 reviews the related works conducted in speaker recognition domain. Section3 clarifies the construction of the training procedure, along with the challenges, and modifications. Section 4 illustratesthe experimental setup, datasets and the analysis of the architectures’ performance. In Section 5, we sketch the proposedmethod’s future initiative. Finally, Section 6 concludes the paper. Speaker recognition has been a topic of interest over the past decades, and various systems have been proposed to solvethe challenge. In the domain of speaker recognition, numerous techniques have been observed since late 2000. Amongthese, embedding architectures have been widely explored to extract the diversity of speech frames. Embedding modelsare often considered feature extractors, which can generate a speech-print (related to finger-print) of an individual.Hence, every individual’s speech will remain closer in the embedding space, causing a cluster of embedding fromspeech frames.Gaussian mixture model (GMM) supervector [17] (stacking mean vectors from GMM), and Joint factor analysis (JFA)[18] have been popularly integrated into the speaker recognition task. Joint factor analysis merges the speaker-dependentand channel-dependent supervector and generates a new supervector based on the dependency. GMM and JFA weresignificantly accepted as a feature extractor and implemented in various speaker recognition strategies. Later on,inspired by JFA, identity vector (i-vector) [19] was introduced. I-vector contributes to changing the channel-dependentsupervectors, and integrates speaker information within the supervectors. Hence, i-vector became more sensitiveto speech variations and greatly accepted by the researchers. In most cases of JFA and i-vector, MFCC is widelyimplemented. MFCC is a linear cosine transform of a log power spectrum used to extract a sound’s information.However, a lower MFCC with a lower cepstral coefficient returns only sound information, whereas the higher value of4-vectors: Generating clusterable speaker embedding from unlabeled data
A P
REPRINT the coefficient represents speaker information as well [20]. Further, probabilistic linear discriminant analysis (PLDA)[21] is mostly used for implementing speaker verification and identification systems using i-vectors [5].The present improvement of DL architectures has led to revisiting the speech embedding representation neuralarchitecture perspective. Deep vector (d-vector) [6] is a mutated implementation of the speech frame embeddings usingdeep neural networks (DNN). The d-vector depends on the automated feature extraction process of DNN. The model’straining process is supervised, and in the basic implementation of the d-vector, it is explored as a text-dependent system.After the training procedure, the softmax layer is left out, and the embeddings are extracted from the last hidden layer.Although the d-vector is based on DNN, further studies have been made using CNN architectures [22]. In the modifiedarchitecture, the speech is converted into MEL coefficients, which are normalized and supplied to CNN. Moreover,extensive studies have been made to improve the basic d-vector to a text-independent unsupervised vector generationusing domain adaptation [23]. The mechanism is split into two parts in the upgraded version, a DNN that extractsembeddings and a separately trained classifier that classifies speakers. These studies’ limitation is that most of themrequire massive labelled data in the training procedure. Also, the embeddings’ performance in the case of unseenspeakers dramatically depends on the training data.As DNN architectures are dependent on the amount of training data, an improved strategy of the d-vector is proposed,named x-vector [7]. X-vectors are a modified version of d-vector, which depends on basic sound augmentationtechniques, noise and reverberation. Further, the implementation highly motivates data augmentation usage andpresents a decent accuracy improvement over i-vectors. The default x-vector is implemented based on the improvedtext-independent version of the d-vector [23] by properly utilizing data augmentation.The present state of the art speech embedding systems tends to be unsupervised. However, the concept of unsupervisionstill depends on a large set of training data. Both d-vector and x-vectors directly rely on the domain adaptation [24]policy of neural network architectures. Hence, the performance of these architectures on unseen data massively dependson the volume and diversity of the training dataset. The domain adaptation capability of neural network architectures isfurther increased by using synthetic datasets [25]. However, the performance is still dependent on previously learnedfeatures and performance might lack due to data inefficiency and domain variation of training and testing data.Therefore, we introduce an approach that is independent of domain adaptation of neural network architectures. Instead,the proposed method tends to utilize the automated feature extraction neural network.
The proposed work includes speech segments discussed in Section 3.1. Further, speech segments are broken into smallspeech frames for construction pairwise constraint, addressed in Section 3.2. Uncertainties due to the pseudo-labels ofpairwise constraints are discussed in Section 3.3. Challenges of deciding segmentation length are addressed in Section3.4. Finally, the DL framework used to find actual cluster assignments is theorized in Section 3.5. The methodologyexplains the procedure of training the AutoEmbedder [12], a pairwise constraint-based architecture. Moreover, theprocess of training triplet architectures is also similar. Figure 2 illustrates the overall workflow of producing u-vectors.
To generate the pairwise constraints, we consider a speech segment belonging to an individual. Moreover, if it ispossible to extract accurate pairwise constraints, a DL framework can be trained in using those constraints. To generatesuch pairwise constraints, we depend on speech segmentation procedures.Speech can be easily filtered using various segmentation techniques. Methods such as VAD [26] and word segmentation[27] can be indeed adopted to define such speech segments, containing the voice of a single individual. It is also feasibleto assume that a single individual mostly speaks more than one word in a conversation. Hence, it is also possible toqueue multiple speech segments and hypothesize that they come from a single individual. However, increasing thequeue or size of a speech segment also increases the probability of impurity of a speech segment (discussed in Section3.4). By impurity in a speech segment, we refer that a speech segment contains more than one speaker. Impure datacan often trick the DL frameworks from finding actual relationships among clusters. Hence, to minimize the impurityrisk, we study speech segments with a length of one second. After the successful extraction of speech segments, thepairwise constraints are to be constructed. Although the overall framework is dependent on proper speech segmentationtechniques, we avoid implementing such segmentation methods. Instead, we provide a detailed evaluation of embeddingaccuracy based on various levels of cluster impurities. 5-vectors: Generating clusterable speaker embedding from unlabeled data
A P
REPRINT
Embedder
Speech segment of speaker p Speech segment of speaker q Speech segment of speaker pCan Link Can Link Can Link Can Link Can Link Can Link Can Link Can Link Can LinkCannot Link Cannot Link C l u s t e r N e t w o r k A ud i o S t r ea m Cannot Link T r a i n i ng D a t a - B a t c h Anchor Positive Negative
Triplet Data
Can Link Cannot LinkCan/Cannot Link
Pair Data Or E m bedd i ng S pa c e Cluster 1 Cluster 2 Cluster 3
Figure 2: The figure illustrates the comprehensive procedures to generate u-vectors. Audio stream is segmented andgiven pseudo labels. Then a cluster network is formed using small non-overlapping speech frames from the speechsegments. Finally, training data is generated from the cluster network based on the needs of the siamese architecture.The pairwise network requires an equal number of can-link and cannot link pairs. In contrast, triplet networks receivethree data, one pair with can-link, and one cannot-link pair. The network reforms the cluster associations, where speakersimilarity is considered. 6-vectors: Generating clusterable speaker embedding from unlabeled data
A P
REPRINT
Embedding architecture is trained based on pairwise constraints. By considering x i and x j as two random speechframes, two circumstances may occur: a) speech frames may belong to the same audio segment X k or b) they maybelong to different audio segments. In the current state of the problem, as the speech labels’ ground truth is unknownfor every speech segment, we consider each segment belonging to different individuals. Hence, the number of uniquepseudo labels is equal to |S| . Mathematically, C being a set of clusters, c i being a particular cluster of similar nodes,and X k being a specific speech segment, ∀ x i ∈ X k and ∀ x j ∈ X k , x i , x j ∈ c k ∀ x i ∈ X k and ∀ x j / ∈ X k , x i , x j / ∈ c k (1)The DL framework is trained based on the defined cluster constraints. To properly introduce the inter-cluster andintra-cluster relation to a DL framework, we define a gound regression function based on pairwise criteria derived in Eq.1. The function is defined as, P c ( x i , x j ) = (cid:26) if x i , x j ∈ c p α if x i ∈ c p and x j ∈ c q (2)In general, the P c ( · , · ) outputs the distance constraints that each embedding (generated from speech frames) holds. Thefunction infers that an embedding pair must be at a close distance if they belong to the same cluster or at a distanceof α otherwise. However, embedding pairs belonging to different clusters may be at a distance greater than α , whichis established in the AutoEmbedder architecture (Eq. 4). We use the pairwise constraints to train a DL architecture.Further, we revisit the data-clusters’ uncertainty and segmenting impurities and explore why a DL framework may benecessary in such a case. The cluster assignments are mostly uncertain based on two major concerns: a) the segmented audio X k may be impure,b) the ground-truth of cluster assignments are unknown. Therefore, in most cases, the number of ground-label (definedas N ) is theoretically not equal to the number of clusters, i.e., N (cid:54) = |C| and N (cid:54) = |S| , where |S| = |C| . Moreover,due to such impurity and uncertainty of ground-labels, the subsequent flaws in the training dataset (based on pairwiseproperties) are frequently observed,• Impurity in must-link constraint:
The dataset’s core concept is to assume that an audio segment X k containsonly one individual’s speech. Generally, a segmentation system may inaccurately identify speech segmentsand hold multiple individuals’ speech in a single audio segment. However, if we consider short length audiosegments, the probability of speaker fusion rapidly decreases.• Error in cannot-link constraint:
Let, x i ∈ c p and x j ∈ c q , where c p (cid:54) = c q . The cluster assignments areconsidered based on the number of audio segments. Hence, for most datasets, the number of speech segments isgreater than the actual number of speakers, |C| ≥ N . Therefore, considering the ground-truth, the assumption c p (cid:54) = c q may be wrong, and data pair x i and x j may belong to the same cluster considering the ground truth.If we consider a cluster network C with no impurity, then the task of DL is to eliminate the errors in cannot linkconstraints based on the feature relationship. Hence, if it is possible to prioritize the speech features to a DL framework,it can allegedly aggregate appropriate cluster from erroneous cannot-link clusters. Therefore, training the DL architecturereduces errors in cannot-link constraints. However, reducing the impurity of the input data’s must-link constraintsconsiderably depends on the length of speech segments and segmentation policies. The time-domain length of the speech segments (defined by |X | ) operates a vital role in the overall performance of thetraining process. Each segment is further windowed into smaller speech frames. Hence, the segment length must bedivisible by the length of fixed-size speech frames (defined by | x | ). Various architectures consider overlapped frameswhile windowing speech signals. However, we avoid such measures, as such overlaps result in mixing similar speechpatterns in multiple speech frames.To illustrate the trade-off of selecting an optimal length of speech segment |X | , let us consider L mean being the meanand L std the standard deviation (std) of the length of speech segmentations for a given dataset (or a buffer of audiostream). Therefore, statistically, L mean − L std is the optimal minimal length for which we can assume that most7-vectors: Generating clusterable speaker embedding from unlabeled data A P
REPRINT segments strictly contains speech of a single individual. However, if the minimum segment length is considered, thenumber of frames per segment M would also reduce.Reducing the number of frames per segment due to a shorter segment would deliver less inter-cluster relations for eachsegment. The reduction of inter-cluster association would also cause the DL framework struggle finding feature relationbetween speech frames. Further, increasing the size of speech segments may also result in impure components, if |X | ≥ L mean − L std .To explore the reason of impurity, let us consider an audio stream contains a mean time J mean with standard deviationof J std , after which, the speaker exchanges. In such condition, selecting the length of segment too high may resultin being |X | ≥ J mean − J std . However, statistically, in most general conversations, the length of minimal speechsegmentation is mostly less than the speaker exchange time, L mean + L std ≤ J mean − J std . Therefore, if we canselect such | X | , for which, |X | < L mean − L std < J mean − J std , the rate of impurity would be zero. Hence, selecting |X | ≈ L mean would reduce the rate of impurity. For the experimental datasets, the L mean is equal to one (illustrated inTable 2). Hence, we experiment with one-second speech segment. Further, we investigate a pairwise framework inwhich we try to trick DL architecture into converging towards the ground cluster relationship. As a DL architecture, we use a pairwise constraint-based AutoEmbedder framework to re-cluster speech data. However,we introduce further modifications to the network’s general training process to strengthen the learning progress. Ingeneral, the AutoEmbedder architecture is trained based on the pairwise constraints defined by function P c ( · , · ) . Thearchitecture follows siamese network constraints that can be presented as, S ( x, x (cid:48) ) = ReLU ( (cid:107)E φ ( x ) − E φ ( x (cid:48) ) (cid:107) , α ) = R + ≤ α (3)The ReLU ( · , · ) function used in Eq. 3 is a thresholded ReLU function, such that, ReLU ( x, α ) = (cid:26) x if ≤ x < αα if x ≥ α (4)In Eq 3, the S ( · , · ) represents a siamese network function, that receives two inputs. The framework contains a singleshared DNN network E φ ( · ) that map higher dimensional input to a lower dimension clusterable embeddings. TheEuclidean distance of the embedding pairs is calculated and passed through the thresholded ReLU activation functionderived in Eq 4. The threshold value is a cluster margin of α . Due to the threshold, the siamese architecture alwaysgenerates outputs in the range [0 , α ] . The L2 loss function is used to train the general AutoEmbedder architecture. TheAutoEmbedder architecture is trained using an equal number of must-link and cannot link constraints for each databatch. However, in a triplet architecture, the problem is automatically solved, as each triplet contains a fusion of cannotlink (negative) and can-link (positive) data. Both types of cluster relationships (can-link and cannot-link) may contain faulty assumptions and pseudo labelsconsidering the ground truth. Hence, a basic augmentation scheme is used to trick the DL network from overfittingerroneous cluster relationships. Although various augmentation techniques are available, we adhere to mixing noisewith speech data for augmentation. For noise augmentation, we implement a basic formula that is,
Aug ( x i , noise, thres ) = { x (cid:48) i | x (cid:48) i = x i × (1 − thres ) + noise × thres } [0 ≤ thres ≤ (5)Here, Aug ( · , · , · ) is a function that produces augmented speech data, which is inputted as x i . The thres is a thresholdused to define the ratio of mixing noise with speech data x i . Augmenting noise with speech frames results in less-confusing the AutoEmbedder network in case of erroneous data pairs. Fusing noise may facilitate the architecture byignoring faulty data pairs due to different noise situations. Moreover, augmenting data also results in data variation,and the network extracts more beneficial features from speech data. Algorithm 1 presents pseudocode of the pairwisetraining process. In this section, we experiment with the proposed scheme based on the impurity of speech segmentation. As thearchitecture’s target is to produce clusterable embedding, we use k-means to measure the embeddings’ clusters. Further,we use three popular metrics Accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI),to measure clustering effectiveness. The metrics are calculated as demonstrated in [12], and are widely implemented torefer to the purity of clustering [15, 28]. 8-vectors: Generating clusterable speaker embedding from unlabeled data
A P
REPRINT
Algorithm 1:
AutoEmbedder training for speaker recognition.
Input:
Dataset D containing speech frames, DL model with initial weights E φ , Distance hyperparameter α ,Training epochs E p Initialize siamese network, S φ ( · , · ) ← ReLU ( ||E φ ( · ) − E φ ( · ) || , α ) for epoch ← to E p doforeach D batch ∈ D do X , X (cid:48) , Y ← {} , {} , {} counter ← foreach x ∈ D batch do X ← append x in X if counter < | batch | / then X (cid:48) ← randomly pick and append a can-link speech frame from DY ← append in Y else X (cid:48) ← randomly pick and append a cannot-link speech frame from DY ← append α in Y counter ← counter + 1 X ← randomly select half of the speech frames and augment them X (cid:48) ← randomly select half of the speech frames and augment them S φ ← Train S φ with X , X (cid:48) , Y Tensorflow [29] and Keras [30] are used to implement the neural network architectures. Moreover, scikit-learn [31] isused to implement clustering algorithms. Numpy [32] is used to perform efficient numerical operations. Librosa [33]is used to perform audio analysis. The datasets were segmented using a threshold of 16 decibels, implemented as aVAD. The audio streams have been processed with a sample rate of 16000Hz. For audio to spectrogram conversion,the parameters are set as described, size of fast-fourier transform: 191, window-size: 128, stride: 34, mel-scales: 100.Speech spectrograms are used as inputs to train the DL architectures.
For experimentation, three speech dataset has been used. TIMIT [34] and LibriSpeech [35] is popular speech dataset forEnglish language. Moreover, we use Bengali Automated Speech Recognition Dataset [36] to show the diversity of ourapproach for additional languages. Among the three datasets, TIMIT and LirbriSpeech datasets contain studio-gradeaudio speech. Bengali ASR dataset is crowdsourced hence, contains a diverse sound and noise variation. Table 2illustrates some basic statistics of each dataset. Throughout the experiment, we abbreviate LibriSpeech and BengaliASR dataset as LIBRI and ASR, respectively. As the training procedure augments noise with the speech frames, we usescalable noisy speech dataset [37]. The dataset contains diverse environmental noises, which helps the architectures toexplore and relate speech features.Table 2: The table illustrates the mean, median and deviation of segment duration and words per sentence for eachdataset. The segment duration is calculated using the setup described in Section 4.1.Dataset Segment duration Words per sentenceMean Median STD Mean Median STDTIMIT [34] 1 0.8 0.6 8.63 8.0 2.6LibriSpeech [35] 1.2 0.8 1 18.9 15.0 12.9Bengali ASR [36] 1.3 1.2 0.8 3.20 3.0 3.09-vectors: Generating clusterable speaker embedding from unlabeled data
A P
REPRINT
We implement two methods, AutoEmbedder (pairwise architecture) and a triplet architecture, to analyze the speechembeddings based on the proposed strategy. However, apart from these two strategies, the currently famous speechvector methods do not hold to the training properties considered in the paper, and they mostly follow a supervisedlearning or domain adaption strategy. Hence, we disregard them in our experiment.We use DenseNet121 [38], as a baseline architecture for both of the DL frameworks. Further, both models are connectedwith a dense layer consisting of nodes. Therefore, both pairwise and triplet networks produce dimensionembedding vectors. We have added l2-normalization on the output layer of the triplet network, as it is suggested that itincreases the accuracy of the framework [39]. For AutoEmbedder architecture, we implemented the default l2-loss,whereas the triplet architecture is trained using semi-hard triplet loss. S c o r e Dataset: TIMITArchitecture: Triplet
Dataset: LIBRIArchitecture: Triplet
Dataset: ASRArchitecture: Triplet
Epoch S c o r e Dataset: TIMITArchitecture: Pairwise
Epoch
Dataset: LIBRIArchitecture: Pairwise
Epoch
Dataset: ASRArchitecture: Pairwise
Train ACC Train NMI Train ARI Ground ACC Ground NMI Ground ARI
Figure 3: The graphs illustrated in the first row visualizes the metrics on training and ground dataset containing 25speakers with impurity = 0 for triplet architecture. The lower row envisions the same for the pairwise architecture.Each column represents benchmarks carried on a single dataset.The evaluation process guarantees that both architectures are trained using the same dataset/data-subset. As the trainingprocess is unsupervised, the architectures receive the same data for the training and testing process. However, for thetraining process, the labels are unknown and generated based on the paper’s assumptions. We refer to such a dataset asa training dataset. By ground dataset, we refer to the same dataset that further considers the ground truth values. Fortraining both frameworks, we used a batch size of . The training is conducted using Adam [40] optimizer with alearning rate of 0.0005.The training phase’s data processing includes heavy computational complexity, including online noise augmentation andspectrogram conversion. Each dataset is randomly augmented with a threshold range of [0 , . for the augmentationprocess. Further, computing ACC, NMI, and ARI metrics require quadratic time complexity. Hence, we limit thenumber of speakers to 150. Instead of training on the overall dataset, we train on a sub-set of data, where each speakercontains a speech of 10 seconds. For testing the ground truth data, a random selection of 2-second speech is selected foreach speaker. To inquire the architectures properly, we scramble the training dataset’s pseudo labels, that produces acan-link impurity in the dataset labels. Hence, we use the impurity ratio as a training data situation to illustrate the rateof impure cluster assignments on the training data pseudo-labels.Figure 3 illustrates a benchmark of the triplet and pairwise architectures while training on three different datasets, with speakers = 25 and impurity = 0 . The triplet architectures smoothly learn from the training data and greatly overfitson the augmented training data. The ground dataset benchmark is also as expected since the triplet architectures’ cor-rectness first increases and gradually decreases due to overfitting. Hence, from the visualization, it can be acknowledgedthat the triplet network only memorizes the speech features concerning the pseudo labels assigned to them.10-vectors: Generating clusterable speaker embedding from unlabeled data A P
REPRINT S c o r e Dataset: TIMITArchitecture: Triplet
Dataset: LIBRIArchitecture: Triplet
Dataset: ASRArchitecture: Triplet
Epoch S c o r e Dataset: TIMITArchitecture: Pairwise
Epoch
Dataset: LIBRIArchitecture: Pairwise
Epoch
Dataset: ASRArchitecture: Pairwise
Train ACC Train NMI Train ARI Ground ACC Ground NMI Ground ARI
Figure 4: The graph illustrates metrics on training and ground dataset containing 50 speakers with impurity = 0 . S c o r e Dataset: TIMITNetwork: Triplet
Dataset: LIBRINetwork: Triplet
Dataset: ASRNetwork: Triplet
Epoch S c o r e Dataset: TIMITNetwork: Pairwise
Epoch
Dataset: LIBRINetwork: Pairwise
Epoch
Dataset: ASRNetwork: Pairwise
Train ACC Train NMI Train ARI Ground ACC Ground NMI Ground ARI
Figure 5: The graph illustrates metrics on training and ground dataset containing 50 speakers with impurity = 0 . .On the contrary, the pairwise architecture generates a satisfactory performance, with some irregularities. In general,deep learning architectures produce higher accuracy on the training dataset than validation dataset. However, in such acase, the ground dataset’s performance is mostly more elevated than the training dataset. Yet, the performance on theground datasets generally decreases after 400 epochs. As the number of speakers is small, the architecture easily getsoverfitted on the training dataset. Further increasing the number of speakers to 50 reduces the overfitting on trainingdata, as illustrated in Figure 4. The triplet architecture still overfits on the training data’s pseudo label, whereas, thepairwise architecture gives a balanced performance on the ground dataset.11-vectors: Generating clusterable speaker embedding from unlabeled data A P
REPRINT S c o r e Dataset: TIMITNetwork: Triplet
Dataset: LIBRINetwork: Triplet
Dataset: ASRNetwork: Triplet
Epoch S c o r e Dataset: TIMITNetwork: Pairwise
Epoch
Dataset: LIBRINetwork: Pairwise
Epoch
Dataset: ASRNetwork: Pairwise
Train ACC Train NMI Train ARI Ground ACC Ground NMI Ground ARI
Figure 6: The graph illustrates metrics on training and ground dataset containing 50 speakers with impurity = 0 . .Increasing the impurity of the inter-connection of the training data reduces the performance of the architectures. InFigure 5 and 6 illustrates benchmarks conducted with impurity = 0 . and impurity = 0 . while considering speakers = 50 . The triplet architecture still overfits on the training architecture. In contrast, pairwise architectureslowly memorizes the training dataset. Yet, it holds a marginal exactness on the ground data before overfitting on thetraining data.Table 3: The table benchmarks the pairwise architecture in TIMIT dataset with four groups of speakers, 25, 50, 100,and 150. For each group of speakers, the table also considers three segmentation impurities, 0, 0.05, and 0.1 to illustratethe shortcomings of incorrect segmentation, for fully unsupervised speaker recognition strategy. Impurity = 0
Impurity = 0 . Impurity = 0 . ACC N M I ARI ACC N M I ARI ACC N M I ARISpeakers = 25
T rain
Ground
Speakers = 50
T rain
Ground
Speakers = 100
T rain
Ground
Speakers = 150
T rain
Ground
A P
REPRINT
Table 4: The table benchmarks the pairwise architecture in LIBRI dataset with four groups of speakers, 25, 50, 100, and150. For each group of speakers, the table also considers three segmentation impurities, 0, 0.05, and 0.1 to illustrate theshortcomings of incorrect segmentation, for fully unsupervised speaker recognition strategy.
Impurity = 0
Impurity = 0 . Impurity = 0 . ACC N M I ARI ACC N M I ARI ACC N M I ARISpeakers = 25
T rain
Ground
Speakers = 50
T rain
Ground
Speakers = 100
T rain
Ground
Speakers = 146
T rain
Ground impurity = 0 on every dataset.However, increasing the number of speakers results in reducing the performance of the architecture. On the contrary,increasing the impurity of the speech segment further reduces the performance of the architecture. A small fluctuationis observed for LIBRI and ASR dataset while the number of speakers is kept on 25 and 50. Increasing the number ofspeakers from 25 to 50 causes an increase in accuracy, which is inconsistent.Table 5: The table benchmarks the pairwise architecture in ASR dataset with four groups of speakers, 25, 50, 100, and150. For each group of speakers, the table also considers three segmentation impurities, 0, 0.05, and 0.1 to illustrate theshortcomings of incorrect segmentation, for fully unsupervised speaker recognition strategy.
Impurity = 0
Impurity = 0 . Impurity = 0 . ACC N M I ARI ACC N M I ARI ACC N M I ARISpeakers = 25
T rain
Ground
Speakers = 50
T rain
Ground
Speakers = 100
T rain
Ground
Speakers = 150
T rain
Ground
A P
REPRINT
The architecture requires a sufficient number of speech variations from users to explore the proper feature relationshipbetween speech frames. Limiting the number of speakers to 25 caused the interpretation of speech to be reduced. Hence,the architecture struggles to find better speech relations, and lessened performance is observed. Increasing the numberof speakers to 50 balances the speech variations in the training data and causes an increase in accuracy.
The pairwise architecture with training strategy results in a good performance in the speaker recognition process.However, throughout the investigation, the architecture tends to have some issues that have to be considered. Firstly,training the architecture with lesser speech variation causes overfitting, observed while keeping speaker = 25 .Secondly, as the augmentation procedure fuses noises, speech data with excessive noises may not generate a good result.Further, as the system is fully segmentation dependent, the target lies in developing an optimal audio segmentationprocedure. Resolving these challenges would benefit the architecture for a wide range of speaker recognition andevaluation usage.
The paper introduces a clusterable speech embeddings based on the speakers, namely u-vector. The policy of the archi-tecture deals with pseudo labels and trained from unlabeled datasets. The procedure is suitable for both semisupervisedand unsupervised training strategies. We evaluate such strategies with two appropriate deep learning architectures:pairwise and triplet. In the perspective of unlabeled data, the architecture performs at an acceptable rate concerningthe number of speakers and speech segmentation errors. However, the method requires clean speech, and robustsegmentation techniques to properly construct clusterable u-vectors, depending on speaker variations. We stronglybelieve that such an in-depth and hypothetical strategy of generating pseudo labels to train speaker recognition modelswould help researches develop new schemes.
References [1] Vibha Tiwari. Mfcc and its applications in speaker recognition.
International journal on emerging technologies ,1(1):19–22, 2010.[2] Anurag Chowdhury and Arun Ross. Fusing mfcc and lpc features using 1d triplet cnn for speaker recognition inseverely degraded audio signals.
IEEE Transactions on Information Forensics and Security , 15:1616–1629, 2019.[3] Mirco Ravanelli and Yoshua Bengio. Speaker recognition from raw waveform with sincnet. In , pages 1021–1028. IEEE, 2018.[4] J. A. Chagas Nunes, D. Macêdo, and C. Zanchettin. Am-mobilenet1d: A portable model for speaker recognition.In , pages 1–8, 2020.[5] Daniel Garcia-Romero and Carol Y Espy-Wilson. Analysis of i-vector length normalization in speaker recognitionsystems. In
Twelfth annual conference of the international speech communication association , 2011.[6] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neuralnetworks for small footprint text-dependent speaker verification. In , pages 4052–4056. IEEE, 2014.[7] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robustdnn embeddings for speaker recognition. In , pages 5329–5333. IEEE, 2018.[8] Qing Wang, Wei Rao, Sining Sun, Leib Xie, Eng Siong Chng, and Haizhou Li. Unsupervised domain adaptationvia domain adversarial training for speaker recognition. In , pages 4889–4893. IEEE, 2018.[9] Daniel Garcia-Romero, Alan McCree, Stephen Shum, Niko Brummer, and Carlos Vaquero. Unsupervised domainadaptation for i-vector speaker recognition. In
Proceedings of Odyssey: The Speaker and Language RecognitionWorkshop , volume 8, 2014.[10] Daniel Garcia-Romero, Xiaohui Zhang, Alan McCree, and Daniel Povey. Improving speaker recognitionperformance in the domain adaptation challenge using deep neural networks. In , pages 378–383. IEEE, 2014.14-vectors: Generating clusterable speaker embedding from unlabeled data
A P
REPRINT [11] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition andclustering. In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 815–823,2015.[12] Abu Quwsar Ohi, MF Mridha, Farisa Benta Safir, Md Abdul Hamid, and Muhammad Mostafa Monowar.Autoembedder: A semi-supervised dnn embedding system for clustering.
Knowledge-Based Systems , 204:106190,2020.[13] Chunlei Zhang, Kazuhito Koishida, and John HL Hansen. Text-independent speaker verification based on tripletconvolutional neural network embeddings.
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,26(9):1633–1644, 2018.[14] Qing Guo, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, and Song Wang. Learning dynamic siamese network forvisual object tracking. In
Proceedings of the IEEE international conference on computer vision , pages 1763–1771,2017.[15] Yazhou Ren, Kangrong Hu, Xinyi Dai, Lili Pan, Steven CH Hoi, and Zenglin Xu. Semi-supervised deep embeddedclustering.
Neurocomputing , 325:121–130, 2019.[16] Herman Kamper, Karen Livescu, and Sharon Goldwater. An embedded segmental k-means model for unsupervisedsegmentation and clustering of speech. In , pages 719–726. IEEE, 2017.[17] William M Campbell, Douglas E Sturim, and Douglas A Reynolds. Support vector machines using gmmsupervectors for speaker verification.
IEEE signal processing letters , 13(5):308–311, 2006.[18] Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel. Joint factor analysis versus eigenchannelsin speaker recognition.
IEEE Transactions on Audio, Speech, and Language Processing , 15(4):1435–1447, 2007.[19] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis forspeaker verification.
IEEE Transactions on Audio, Speech, and Language Processing , 19(4):788–798, 2010.[20] KI Molla and Keikichi Hirose. On the effectiveness of mfccs and their statistical distribution properties in speakeridentification. In , pages 136–141. IEEE, 2004.[21] Sergey Ioffe. Probabilistic linear discriminant analysis. In
European Conference on Computer Vision , pages531–542. Springer, 2006.[22] Yu-hsin Chen, Ignacio Lopez-Moreno, Tara N Sainath, Mirkó Visontai, Raziel Alvarez, and Carolina Parada.Locally-connected and convolutional neural networks for small footprint speaker recognition. In
Sixteenth AnnualConference of the International Speech Communication Association , 2015.[23] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur. Deep neural network embeddingsfor text-independent speaker verification. In
Interspeech , pages 999–1003, 2017.[24] Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani.
Advances in DomainAdaptation Theory . Elsevier, 2019.[25] Evaggelos Spyrou, Eirini Mathe, Georgios Pikramenos, Konstantinos Kechagias, and Phivos Mylonas. Dataaugmentation vs. domain adaptation—a case study in human activity recognition.
Technologies , 8(4):55, 2020.[26] Zheng-Hua Tan, Najim Dehak, et al. rvad: An unsupervised segment-based robust voice activity detection method.
Computer Speech & Language , 59:1–21, 2020.[27] Michael R Brent. Speech segmentation and word discovery: A computational perspective.
Trends in CognitiveSciences , 3(8):294–301, 1999.[28] Xu Yang, Cheng Deng, Feng Zheng, Junchi Yan, and Wei Liu. Deep spectral clustering using dual autoencodernetwork. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4066–4075,2019.[29] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In { USENIX } symposium on operating systems design and implementation ( { OSDI } , pages 265–283, 2016.[30] François Chollet et al. keras, 2015.[31] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research , 12:2825–2830, 2011.15-vectors: Generating clusterable speaker embedding from unlabeled data A P
REPRINT [32] Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numericalcomputation.
Computing in science & engineering , 13(2):22–30, 2011.[33] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto.librosa: Audio and music signal analysis in python. In
Proceedings of the 14th python in science conference ,volume 8, pages 18–25, 2015.[34] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1.
STIN , 93:27403, 1993.[35] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based onpublic domain audio books. In , pages 5206–5210. IEEE, 2015.[36] Oddur Kjartansson, Supheakmungkol Sarin, Knot Pipatsrisawat, Martin Jansche, and Linne Ha. Crowd-SourcedSpeech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali. In
Proc. The 6th Intl.Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) , pages 52–55, Gurugram,India, August 2018.[37] Chandan KA Reddy, Ebrahim Beyrami, Jamie Pool, Ross Cutler, Sriram Srinivasan, and Johannes Gehrke. Ascalable noisy speech dataset and online subjective test framework.
Proc. Interspeech 2019 , pages 1816–1820,2019.[38] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutionalnetworks. In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4700–4708,2017.[39] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 , 2017.[40] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980