Siamese Capsule Network for End-to-End Speaker Recognition In The Wild
SSIAMESE CAPSULE NETWORK FOR END-TO-END SPEAKERRECOGNITION IN THE WILD
Amirhossein Hajavi, Ali Etemad
Department of ECE & Ingenuity Labs Research InstituteQueen’s University, Kingston { a.hajavi, ali.etemad } @queensu.ca ABSTRACT
We propose an end-to-end deep model for speaker verifica-tion in the wild. Our model uses thin-ResNet for extractingspeaker embeddings from utterances and a Siamese capsulenetwork and dynamic routing as the Back-end to calculate asimilarity score between the embeddings. We conduct a se-ries of experiments and comparisons on our model to state-of-the-art solutions, showing that our model outperforms allthe other models using substantially less amount of trainingdata. We also perform additional experiments to study the im-pact of different speaker embeddings on the Siamese capsulenetwork. We show that the best performance is achieved byusing embeddings obtained directly from the feature aggre-gation module of the Front-end and passing them to highercapsules using dynamic routing.
Index Terms — Deep Speaker Recognition, End-to-EndSpeaker Recognition, Siamese Networks, Capsules
1. INTRODUCTION
Speaker verification models are comprised of two main parts.The
Front-end component which encodes an utterance intofixed sized embedding vectors [1], and the
Back-end compo-nent which measures the similarity of two given vectors inthe form of a similarity score [2]. Most commonly in previ-ous studies, the two components of the speaker verificationsystem are two separate processes: the Front-end process ofembedding extraction through a Deep Neural Network (DNN)pipeline, and the Back-end conducted via a non-trainable sec-ondary module, typically cosine similarity.Deep learning has emerged as a successful tool forspeaker recognition in the past years. In this context, usingDNN as a substitute for different components of conventionaltechniques such as i-Vector [1, 2] has been subject to manystudies [3, 4, 5, 6, 7]. Replacing the Front-end componentof the i-Vector/PLDA with DNN models in order to extractspeech representations from utterances is one of the commontechniques used in recent literature for both speaker iden-tification and verification [5, 6, 7]. Architectures such asTime-Delay Neural Networks (TDNN) [3], Recurrent Neural Networks (RNN) [4], and Convolutional Neural Networks(CNN) [5, 6, 7] are some examples of DNN models used inthe recent studies.The most common technique used for the Back-end com-ponent in DNN-based speaker recognition is the cosine dis-tance. Despite the advancement of DNN models in extractinglearnt speech representations, a very limited number of stud-ies have used sophisticated DNN models to replace cosinedistance in the Back-end component. The neural networksused as the Back-end component have typically consisted ofsimple Multi-Layer Perceptrons (MLP) [8, 9, 10], whereasmore sophisticated methods, such as Siamese networks, haveshown to be useful in other application areas [11]. Siamesenetworks have been utilized to measure the similarity levelbetween two feature vectors[12]. The network takes the twovectors as inputs and produces an output score indicating thedegree of similarity between the two vectors. Different formsof Siamese networks have proven to be successful in manyspeech-related tasks such as language detection [13], domainadaptation [14], and speech representation learning [15].While Siamese networks that incorporate MLPs oftensuccessfully learn simple input-output transformations, theyfail to consider the part-whole relations within the featurevectors. On the other hand, capsule networks along withrouting mechanisms, have been designed to detect part-wholerelations, and have recently emerged [16] as an upcomingapproach in deep learning with impressive results obtainedin image recognition [17], speech emotion recognition [18],keyword detection [19], and brain-computer interfaces [20].The integration of pose matrices inside of capsules enablescapsule networks to support different instantiating parame-ters such as deformation and orientation. Capsules also usedifferent routing mechanisms such as dynamic routing [16]or EM routing [21] to capture the part-whole relationships orwhole-part relationships [17] in the input features.In this paper we propose a speaker recognition model witha Back-end Siamese capsule network for text-independentspeaker verification. Our model can be integrated with anyFront-end speaker representation learning model, resultingin a thoroughly end-to-end pipeline. Our contributions in a r X i v : . [ ee ss . A S ] S e p his paper can be summarized as follows. ( ) We propose anovel Siamese network using capsules for speaker recogni-tion. We integrate our proposed model with a state-of-the-artFront-end DNN to perform speaker verification. ( ) We trainthe end-to-end pipeline with Voxceleb1 to perform speakerrecognition in the wild. Our results outperform that of othersolutions and set a new state-of-the-art. ( ) We show thatdespite the fact that the dataset used for training our model(Voxceleb1) is much smaller than Voxceleb2 which is used bysome other studies, our model still achieves superior results.The rest of the paper is as follows. In Section 2, someof the related works around Siamese networks, end-to-endspeaker recognition, and capsules will be discussed. Section3 will explain the architecture and the details of our proposedmodel. In section 4 the conducted experiments and resultsare presented. And finally Section 5 concludes the paper andpresents suggestions for future work.
2. RELATED WORK2.1. Siamese Networks for Speaker Recognition
Deep learning approaches for speaker recognition havegained a lot of attention given the advancements in compu-tational capacity and availability of large in-the-wild datasets[22, 23]. A large number of studies using DNN models forspeaker embedding extraction have been performed in thepast few years. Most prominent studies have used variousCNN architectures such as ResNet [24, 6, 7] to achieve ef-fective speaker embeddings from spectral representations ofthe utterances. Other successful models such as X-Vectors[3] have used TDNN to obtain reliable speaker embeddingsusing MFCC features.The majority of DNN models used in speaker recogni-tion take a single utterance as input and provide a fixed-sizevector as the speaker embedding for the utterance. Anotherprocess is then used to calculate the similarity of the two em-beddings obtained from an enrolment utterance and a test ut-terance for speaker verification. The most commonly usedtechnique among the recent studies for calculating the simi-larity score is the cosine similarity (as shown in Equation 1).In the few cases where a similarity score is calculated usinga DNN model [8, 9, 10, 25], the performance has not beencomparable to the state-of-the-art.
Score ( V , V ) = V T × V | V || V | (1) With the emergence of capsule networks [16], there has beenconsiderable advancements in learning deep representationsof data [17, 18, 19, 20]. Taking advantage of routing mecha-nisms such as dynamic routing has enabled capsules to cap-ture part-whole relations. However, to the best of our knowl- edge, capsule networks have not yet been explored for speakerrecognition purposes.On the topic of using capsules within a Siamese networkarchitecture, a number of papers have explored this strategy inareas outside of speaker recognition. For example, a Siamesecapsule network was proposed in [11], which used primarycapsules to capture the facial parts from CNN embeddings ofthe input pair of face images. A higher level capsule usedrouted information from the primary capsules via dynamicrouting to construct part-whole relationships. The representa-tions acquired from capsules were then transformed to a sec-ondary latent space and the final similarity score was calcu-lated using a non-linear combination of these representations.The work done in [26] utilized Siamese capsule networksas a tool for calculating the similarity of two short sequencesof text via embeddings obtained from Bidirectional GatedRecurrent Units (BGRU). Employing a similar approach asthe Siamese capsule network proposed for face recognition,the embeddings were first passed through primary capsulesto identify parts of the text (a representation of words andphrases). The higher capsules then used the information fromthe primary capsules via a routing mechanism to formulate arepresentation for the whole text.As the mentioned studies suggest, the use of capsule net-works (either as a stand-alone model or as part of a Siamesenetwork) has shown promising results. However there hasbeen no studies on the use of Siamese capsule networks forspeaker recognition. In this work we aim to introduce a novelnetwork based on this type of architecture to perform speakerverification using audio signals.
3. PROPOSED NETWORK
We propose an architecture based on a Siamese network witha Back-end that utilizes capsules for directly measuring a sim-ilarity score for speaker verification. Through the follow-ing sub-sections, we describe the different components of ourmodel. An overview of the model is presented in Figure 1.
We utilize a model with Resnet-based architecture as theFront-end component of our end-to-end DNN model. Inorder to better be able to show the impact of the Siamesecapsule network, the front-end component as depicted in Fig-ure 1, shares the majority of its architectural details with themodel proposed in [24]. The model is a modified versionof ResNet34, namely thin-ResNet34, which utilizes 34 con-volution layers in the main body of the model. It also usesan effective feature aggregation method, namely GhostVlad[24]. The original model, uses a final fully connected (FC)layer to transform the features into a latent space and the finalembedding is scaled down to 512 elemental vectors whichare later used for verification via cosine distance. In our pa- ack-endFront-end t h i n - R e s N e tt h i n - R e s N e t C ap s u l e C ap s u l e C ap s u l e C ap s u l e s x DynamicRouting S c o r e Fig. 1 . The architecture of our proposed end-to-end speakerrecognition model with Siamese capsule networks.per, we opt to remove the FC layer in order to preserve themaximum information possible from the embedding vectorsprovided by the GhostVlad pooling mechanism.The thin-ResNet model operates on the enrollment utter-ance and the test utterance using a identical set of weights.This results in two vectors of size 4096 for each of the ut-terances. These vectors are then paired together in matriceswith a dimension of (4096 × in the way that each index ofthe embedding vector of enrollment utterance is paired withthe same index in the embedding vector of the test utterance.This helps to better compare the embeddings later on in theBack-end component of our network. The Back-end component of the DNN model (see Figure 1)used in this paper consists of only higher capsules as we optnot to use the primary capsules. The operations inside pri-mary capsule layers include a non-linear convolutions andthe squashing operation. We aim to compare each index ofthe enrollment utterance embedding with the same index ofthe test utterance embedding and the non-linear effect of theconvolution layer and the squashing operation inside the pri-mary capsule prevents this correlation. Therefore by skippingthe operations of the primary capsule, the embeddings are di-rectly passed to the higher capsules. Each tuple ( v i , v i ) ,where v i is the i th index in the enrollment embedding, and v i is the same index in the test embedding, is considereda single part while extracting the part-whole relations usingdynamic routing.We utilize 4 higher capsules in the Back-end model. Thenumber of capsules is found empirically and may differ withrespect to the diversity and complexity of the set of speak-ers. We also use dynamic routing [16] with 3 iterations asthe routing mechanism between the embedding tuples and thecapsules. In this mechanism, embedding tuples are first nor-malised using the L2 normalization (see Equation 2). Thecontribution of each tuple V i in the higher capsule C j is then determined by a secondary coefficient c ij through performingrouting softmax (see Equation 3). The value p ij is calculatedthrough multiple iterations using gradients obtained from thefinal loss function. Lastly, the representations from the highercapsules are then aggregated and passed to a Sigmoid func-tion. The model is trained through binary classification andthe final score calculated by the Sigmoid function is presentedas the similarity score between the enrollment utterance andthe test utterance. V i = V i || V i || + (cid:15) (2) c ij = exp ( p ij ) (cid:80) k exp ( p ik ) (3) Our Siamese capsule network utilizes four higher capsuleswith a capsule dimension of 128. Each capsule receives in-formation from 4096 tuples through dynamic routing. Thenumber of trainable parameters are approximately . M pa-rameters. The high number of trainable parameters increasesthe probability of over-fitting. To address this issue, we optto train the model using random selection of utterances fromvarious speakers.Training of the model is done using a similar approachto that of the triplet loss. For each step of the training, threeutterances are selected, two utterances form the same speakerand the other spoken by a different speaker. The same speakerutterances are selected form all the utterances of a speaker us-ing a uniform random distribution without replacement. Thena third utterance is selected from the utterances of a randomspeaker, reducing the chance of the same triplets being se-lected again further in training.We use Adam optimizer for training our Siamese capsulenetwork. The Front-end model remains frozen during trainingin order to isolate the source of any changes in performanceto the impact of the capsule network only. The learning rate isinitially set to 0.01, but is changed with cyclical learning ratepattern [30] to ensure optimal convergence. For the hardware,we use a single Titan RTX GPU for training. For batch size,64 utterance triplets are selected at each step of the training.
4. EXPERIMENT SETUP AND RESULTS4.1. Dataset
In this paper, the VoxCeleb dataset is used for both trainingand evaluation. The training set consists of approximately148k utterances spoken by 1,211 speakers. Using the randomselection of utterances for triplets, the scale of possible tripletcombinations for training adds up to , × , wherethe number 122 is the average utterance count for a speakerin VoxCeleb1 dataset. Hence, selecting VoxCeleb1 dataset able 1 . The results for evaluation of our Siamese capsule network in comparison to several benchmark models. Model Loss Train set EER%
Nagrani et al. [22] I-Vector + PLDA – VoxCeleb1 8.80Cai et al. [27] ResNet34 + SAP A-softmax + PLDA VoxCeleb1 4.40Cai et al. [27] ResNet34 + LDE A-softmax + PLDA VoxCeleb1 4.48Chung et al. [28] ResNet50 + TAP Triplet Loss VoxCeleb2 4.19Hajavi et al. [6] UtterIdNet + TDV Softmax VoxCeleb2 4.26Okabe et al. [29] TDNN (X-Vector) + TAP Softmax VoxCeleb1 3.85Xie et al. [24] Thin-ResNet34 + GhostVlad Softmax VoxCeleb2 3.22
Ours
Thin-ResNet34 + Siamese Capsule Binary Cross-entropy VoxCeleb1
Table 2 . The results for evaluation of Siamese capsule net-works with different architectures, using embeddings fromdifferent layers of the Front-end model. FC: the embeddingfrom the last FC layer of the model; Aggreg.: the embeddingfrom the output of GhostVlad aggregation module.
Layer Dimension No. Caps. Primary Caps. EERFC 512 2 No 3.86FC 512 4 No 3.65FC 512 6 No 3.63Aggreg. 4096 2 No 3.18Aggreg. 4096 4 No
Aggreg. 4096 6 No 3.16Aggreg. 4096 2 Yes 4.06Aggreg. 4096 4 Yes 3.83Aggreg. 4096 6 Yes 3.90 with less number of speakers and utterances compared to Vox-Celeb2, helps with managing the number of combination.
Table 1 presents the results of our experiments. In this ta-ble the performance of the Siamese capsule network withrespect to the benchmark models of thin-ResNet+GhostVlad,ResNet34+SAP, and ResNet34+LDE is presented. The com-parison illustrates that our model achieves an EER of 3.14%,outperforming all the benchmark models. We also comparethe performance of our model with respect to the amountof data needed for training. While some models [6, 24, 28]use the VoxCeleb2 dataset which contains more than 1Mutterances for training, our model utilizes substantially lessamount of data, and yet outperforms these models. This maybe due to the random selection of utterance triplets whichincreases the number of training samples provided by theVoxCeleb1 dataset.We also evaluate the performance of our model using dif-ferent number of capsules in the architecture. We also test theembeddings from two different layers of the Front-end modeland the effect of using primary capsules immediately on theembeddings. Table 2 includes the results of our experiments.As shown in the results, the best performance is achieved us-ing four capsules in the architecture. Also the embeddings ob-tained directly from the GhostVlad aggregation module lead to a considerably better performance compared to the bottle-neck features collected from the fully connected layer. Theeffect of using primary capsules on the embeddings leads to alower performance. This is in compliance with the argumentmade earlier (see Section 3.2) that the non-linear operationsperformed in primary capsules prevent the model from col-lecting more decisive information from embeddings.In the end, we should point out that while a number ofother works have attempted to use CNN-based Siamese ar-chitectures for speaker recognition in the past [9, 25], theyhave generally achieved less competitive results compared tothe approach of using cosine distance on the obtained em-beddings. However, the use of a Siamese capsule networkin our model shows improvement over the conventional ap-proach, which can indicate the viability of such architecturesfor speaker verification in the wild.
5. CONCLUSION AND FUTURE WORK
In this paper a novel Siamese network using capsules and dy-namic routing was proposed for speaker verification in thewild. Our end-to-end pipeline used thin-ResNet as its Front-end component for speech representation learning, while cap-sules were used in its Back-end to extract part-whole rela-tions of the embeddings later used to calculate the similarityscore between two representations. Experiments on the Vox-celeb test set illustrated that our model outperforms the otherbenchmarks by obtaining an EER of 3.14% and sets a newstate-of-the-art. Our model can be trained using substantiallyless amount of training data to reach the desired performanceby random selection of utterance triplets. As a possible fu-ture route we intend to extend our experiments of the Siamesecapsule network on utterances from various domains such asenvironments and noise levels.
6. ACKNOWLEDGEMENTS
The authors would like to thank IMRSV Data Labs for theirsupport of this work and also acknowledge the Natural Sci-ences and Engineering Research Council of Canada (NSERC)for supporting this research (grant no.: CRDPJ 533919-18). . REFERENCES [1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-let, “Front-end factor analysis for speaker verification,”
IEEETransactions on Audio, Speech, and Language Processing ,vol. 19, no. 4, pp. 788–798, 2010.[2] P. Kenny, “Bayesian speaker verification with, heavy tailed pri-ors,”
Odyssey , 2010.[3] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey,and S. Khudanpur, “Speaker recognition for multi-speaker con-versations using x-vectors,”
IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) , pp. 5796–5800, 2019.[4] J. Wang, K.-C. Wang, M. Law, F. Rudzicz, and M. Brudno,“Centroid-based deep metric learning for speaker recognition,”
IEEE International Conference on Acoustics, Speech, and Sig-nal Processing (ICASSP) , pp. 3652–3656, 2019.[5] I. Kim, K. Kim, J. Kim, and C. Choi, “Deep speaker repre-sentation using orthogonal decomposition and recombinationfor speaker verification,”
IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) , pp. 6126–6130, 2019.[6] A. Hajavi and A. Etemad, “A deep neural network for short-segment speaker recognition,”
INTERSPEECH , pp. 2878–2882, 2019.[7] ——, “Knowing what to listen to: Early attentionfor deep speech representation learning,” arXiv preprintarXiv:2009.01822 , 2020.[8] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matˇejka, andL. Burget, “End-to-end dnn based speaker recognition in-spired by i-vector and plda,”
IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , pp. 4874–4878, 2018.[9] S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-end attention based text-dependent speaker verification,”
IEEESpoken Language Technology Workshop (SLT) , pp. 171–178,2016.[10] K. Sriskandaraja, V. Sethu, and E. Ambikairajah, “Deepsiamese architecture based replay detection for secure voicebiometric.”
INTERSPEECH , pp. 671–675, 2018.[11] J. O. Neill, “Siamese capsule networks,” arXiv preprintarXiv:1805.07242 , 2018.[12] D. Chicco, “Siamese neural networks: An overview,”
ArtificialNeural Networks , pp. 73–94, 2020.[13] S. Shon, A. Ali, and J. Glass, “Convolutional neural networkand language embeddings for end-to-end dialect recognition,”
Odyssey, The Speaker and Language Recognition Workshop ,pp. 98–104, 2018.[14] S. Rozenberg, H. Aronowitz, and R. Hoory, “Siamese x-vectorreconstruction for domain adapted speaker recognition,” arXivpreprint arXiv:2007.14146 , 2020.[15] R. Riad, C. Dancette, J. Karadayi, N. Zeghidour, T. Schatz, andE. Dupoux, “Sampling strategies in siamese networks for unsu-pervised speech representation learning,”
INTERSPEECH , pp.2658–2662, 2018. [16] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing be-tween capsules,” in
Advances in neural information processingsystems , 2017, pp. 3856–3866.[17] A. Kosiorek, S. Sabour, Y. W. Teh, and G. E. Hinton, “Stackedcapsule autoencoders,”
Advances in Neural Information Pro-cessing Systems , pp. 15 512–15 522, 2019.[18] X. Wu, S. Liu, Y. Cao, X. Li, J. Yu, D. Dai, X. Ma, S. Hu,Z. Wu, X. Liu et al. , “Speech emotion recognition usingcapsule networks,”
IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , pp. 6695–6699,2019.[19] Y. Xiong, V. Berisha, and C. Chakrabarti, “Residual+ capsulenetworks (rescap) for simultaneous single-channel overlappedkeyword recognition.”
INTERSPEECH , pp. 3337–3341, 2019.[20] G. Zhang and A. Etemad, “Capsule attention for multi-modal eeg and eog spatiotemporal representation learningwith application to driver vigilance estimation,” arXiv preprintarXiv:1912.07812 , 2019.[21] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules withem routing,” in
International conference on learning represen-tations , 2018.[22] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb:Large-scale speaker verification in the wild,”
Computer Speech& Language , vol. 60, p. 101027, 2020.[23] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speak-ers in the wild (sitw) speaker recognition database,”
INTER-SPEECH , pp. 818–822, 2016.[24] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman,“Utterance-level aggregation for speaker recognition in thewild,”
IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP) , pp. 5791–5795, 2019.[25] Y. Zhang, M. Yu, N. Li, C. Yu, J. Cui, and D. Yu, “Seq2seqattentional siamese neural networks for text-dependent speakerverification,”
ICASSP , pp. 6131–6135, 2019.[26] Y. Wu, J. Li, J. Wu, and J. Chang, “Siamese capsule networkswith global and local features for text classification,”
Neuro-computing , 2020.[27] W. Cai, J. Chen, and M. Li, “Analysis of Length Normalizationin End-to-End Speaker Verification System,”
INTERSPEECH ,pp. 3618–3622, 2018.[28] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,”
INTERSPEECH , pp. 1086–1090, 2018.[29] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive StatisticsPooling for Deep Speaker Embedding,”
INTERSPEECH , pp.2252–2256, 2018.[30] L. N. Smith, “Cyclical learning rates for training neural net-works,”