Sparse Architectures for Text-Independent Speaker Verification Using Deep Neural Networks
11 Sparse Architectures for Text-Independent SpeakerVerification Using Deep Neural Networks
Sara SedighiDepartment of Electrical and Computer EngineeringBoise State University, Idaho, USAShayan RamhormoziDepartment of Network and Communication,FANAP Telecom
Abstract —Network pruning is of great importance dueto the elimination of the unimportant weights or featuresactivated due to the network over-parametrization. Ad-vantages of sparsity enforcement include preventing theoverfitting and speedup. Considering the large numberof parameters in deep architectures, network compressionbecomes of critical importance due to the required hugeamount of computational power. In this work, we imposestructured sparsity for speaker verification which is thevalidation of the query speaker compared to the speakergallery. We will show that the mere sparsity enforcementcan improve the verification results due to the possibleinitial overfitting in the network.
I. I
NTRODUCTION
Recent advancements in deep learning suggestednew approaches to train deep networks led to al-most human performance level in the image andobject recognition, speech recognition [1]–[4] anddata mining [5], [6]. Approaches based on Infor-mation Theory have also been proposed to providea framework for interpret the deep architecture ina better sense [7]. Some of these new approachessuch as dropout [8] will handle overfitting issue [9].For training deep neural networks, network over-parametrizing makes the architecture unnecessarilycomplicated. Huge computational power is also re-quired for training and model evaluation [10].Up to now, different approaches have been pro-posed for compressing models. Model compres-sion [11], [12], pruning [13], [14], and (cid:96) − regularization [15] have been proposed so far forthis aim. In some previous works such as [16], it’sbeen declared that training a few portions of theweights is enough by kernel-based estimators. Alarge amount of the previously performed methods are based on multiple steps of tuning which makesthe model hardly scalable. One issue is the modelcomplexity and computational burdon which is re-lated ot the large number of network parameters.Feature selection is one of the approaches for re-ducing the number of unimportant neurons. Select-ing the important features by emoving unimportantelements may impose the weight pruning. A largenumber of feature methods in this field such as PCAand AEss have been proposed.For effective network compression, differentmethods such as utilizing the group lasso [17],structure scale constraining [18], and StructuredSparsity Learning (SSL) [19] have been proposed.For most of the research works, there is no evidenceof addressing how the accuracy is related to thecompression. In this work, we propose the use ofsparsity for imposing structured sparsity for speakerverification. We will show that the simple sparsify-ing the network, can improve the results for speakerverification. II. I MPOSING SPARSITY
A. Group sparse regularization
We focus on enforcing group sparsity to pruneconvolutional and fully-connected layers. Grouplasso has widely been used for feature selection byenforcing the sparsity on the weights group [17],[20]. The objective of the group sparsity is toselect the effective channels or neurons in case wehave a convolutional layer or fully-connected layer,respectively.Assume a convolutional layer is demonstratedby W ∈ R C, [ W idth,Heigth ] ,F and C parameter indi-cates the input channels, [ W idth, Heigth ] shows a r X i v : . [ c s . S D ] A ug the kernel dimension, and F is the number of outputchannels. The objective loss function is as follows: L ( W ) = L data ( W ) + λ r .(cid:96) ( W )+1 (cid:112) | G ( W m ) | { λ gs . N (cid:88) m =1 L gs ( G ( W m )) } (1)in which, L gs indicates the group sparsity loss. The | G ( W l ) | is the number of channels for the m th layerand λ is the hyper-parameter for the loss. Assumehaving L different groups of weights, the groupsparsity is defined as follows: L gs = L (cid:88) j =1 (cid:118)(cid:117)(cid:117)(cid:116) | w ( j ) | (cid:88) i =1 ( w ( j ) i ) (2)in which w ( j ) demonstrates the j th group of weightsand | w ( j ) | shows the number of weights in the group. Fig. 1. Group sparsity on the fully-connected layers.
The main objective of the group sparse regulariza-tion is the removal of redundant features which areactivated regarding the network over-parametrizing.For the fully-connected layers, the group is also allthe weights connected to the neuron and is shownin Fig. 1. III. E
XPERIMENTS
The speaker verification is comparing the queryspeaker to the gallery of speaker models and vali-date the speaker identity. The speaker verificationis mainly divided into text-independent and text-dependent types. In the text-dependent setting, theavailable spoken utterances are the same. In text-independent setting, however, no assumption is con-sidered for the utterances. The challenge for thelatter scenario comes from the fact that the featuresmust distinguishable for both speaker and speechinformation.
Input:
For each sample sound file, a windowof 25ms, with 15ms overlapping is used and the result will be a spectrogram of size × fora 1-second duration of the audio sound. For thethird dimension, first and second order derivativefeatures are appended together. SpeechPy libraryhas been used for speech feature extraction [21]. Dataset:
We used the VoxCeleb dataset for ourexperiments [22]. There are 1211 available speakers,40 speakers are chosen for test and the rest are usedfor training as suggested in [22]. The raw audiosare extracted from regular Youtube videos which in-clude a variety of internal differences such as back-ground noise and different recording qualities whichmake the dataset very challenging. For our exper-iments, we choose very short 1-second utteranceswith Voice Activity Detection (VAD) for removingthe silence parts. Choosing short utterances haveforensics applications and makes our experimentschallenging. It is more realistic to consider shortutterance because in real-world applications, formost of the times only short utterances of differentsubjects are available.For speaker verification architecture, weused convolutional neural networks due totheir superiority in applications such as actionrecognition [23], object recognition [3],speakerverification and audio-visual matching.
Training and optimization objective:
The archi-tecture is shown in Fig. 2 which has two deep neuralnetworks with weight sharing. This architecture is aSiamese neural network [24] and has been utilizedfor different applications [25]–[27]. The main objec-tive of a Siamese network is the creation of a jointoutput feature space to distinguish between genuineand impostor pairs. The idea is that if two elementsof an input pair are coming from the same subject,the output distance should be close by a simpledistance metric and should be far away if theyhave different identities. For this goal, the trainingloss function should consider both aforementionedconditions. Contrastive loss L ( X, Y ) is used for thataim and is defined as follows: C ( X, Y ) = 1 N N (cid:88) k =1 C ( Y k , ( X , X ) k ) , (3)where N is the number training samples, and L ( Y l , ( X p , X p ) l ) will be defined as follows: C ( Y l , ( X , X ) l ) = Y ∗ C gen ( D W ( X , X ) l )+ (1 − Y ) ∗ C imp ( D W ( X , X ) l ) + λ || W || (4)in that λ || W || is the regularization parameter. C gen and C imp are the associated costs for the genuineand impostor pairs respectively and will be definedas functions of D ( X , X ) : (cid:40) C gen ( D ( X , X ) = D ( X , X ) C imp ( D ( X , X ) = max { , ( η − D ( X , X )) } (5)for which η considered to be a predefined marginand D ( X , X ) is the Euclidean distance betweenassociated output features for the pairs. Architecture:
For the architectures, VGG-Net hasbeen chosen as an effective model for the imageclassifcation [28]. The architecture is modified spec-ified to the input features. The output dimensionalityof the last layer is set to 64. The average pool-ing has also been employed for spatial dimensionmatching [29].
Fig. 2. The general employed architecture for speaker verification.
Results and comparison:
The proposed approachwill be compared with some other base-line methodsas the GMM-UBM method [30] which has beenselected with 39 MFCCs coefficients including firstand second order derivatives. Universal BackgroundModel (UBM) with 512 mixture components has been employed. The I-Vector system [31], is used asone of the methods for the comparison. The resultsare depicted in Table. I. As can be observed, impos-ing the proposed approach for sparsity outperformsthe other methods.
TABLE IC
OMPARING THE PROPOSED APPROACH WITH THE OTHERMETHODS .Model EERGMM-UBM [30] 28.22I-vectors [31] 24.91CNN [baseline] 24.68CNN [SSL] 24.11
An important factor caused by enforcing sparsityis the speedup. The results are shown in Fig. 3.
Fig. 3. The speedup for separate layers.
IV. C
ONCLUSION
In this work, we proposed the application ofsparsity imposition for speaker verification. Exper-imental results demonstrated the effectiveness ofenforcing sparsity due to its potential power forpreventing overfitting. This was the direct outcomeof removing unimportant elements of the networksuch as neurons in fully connected layers and outputfilters in the convolutional layer.R
EFERENCES [1] J. Schmidhuber, “Deep learning in neural networks: Anoverview,”
Neural networks , vol. 61, pp. 85–117, 2015.[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature ,vol. 521, no. 7553, p. 436, 2015. [3] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neuralnetwork for real-time object recognition,” in
Intelligent Robotsand Systems (IROS), 2015 IEEE/RSJ International Conferenceon , pp. 922–928, IEEE, 2015.[4] E. Variani, X. Lei, E. McDermott, I. L. Moreno, andJ. Gonzalez-Dominguez, “Deep neural networks for small foot-print text-dependent speaker verification,” in
Acoustics, Speechand Signal Processing (ICASSP), 2014 IEEE InternationalConference on , pp. 4052–4056, IEEE, 2014.[5] M. Piergallini, R. Shirvani, G. S. Gautam, and M. Chouikha,“Word-level language identification and predicting codeswitch-ing points in swahili-english language data,” in
Proceedings ofthe Second Workshop on Computational Approaches to CodeSwitching , pp. 21–29, 2016.[6] R. Shirvani, M. Piergallini, G. S. Gautam, and M. Chouikha,“The howard university system submission for the shared taskin language identification in spanish-english codeswitching,”in
Proceedings of The Second Workshop on ComputationalApproaches to Code Switching , pp. 116–120, 2016.[7] C. E. Shannon, “A mathematical theory of communication,”
ACM SIGMOBILE Mobile Computing and CommunicationsReview , vol. 5, no. 1, pp. 3–55, 2001.[8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neuralnetworks from overfitting,”
The Journal of Machine LearningResearch , vol. 15, no. 1, pp. 1929–1958, 2014.[9] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167 , 2015.[10] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect:Training deep neural networks with binary weights duringpropagations,” in
Advances in neural information processingsystems , pp. 3123–3131, 2015.[11] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in
Advances in neural information processing systems , pp. 2654–2662, 2014.[12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledgein a neural network,” arXiv preprint arXiv:1503.02531 , 2015.[13] R. Reed, “Pruning algorithms-a survey,”
IEEE transactions onNeural Networks , vol. 4, no. 5, pp. 740–747, 1993.[14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weightsand connections for efficient neural network,” in
Advances inNeural Information Processing Systems , pp. 1135–1143, 2015.[15] M. D. Collins and P. Kohli, “Memory bounded deep convolu-tional networks,” arXiv preprint arXiv:1412.1442 , 2014.[16] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. , “Predictingparameters in deep learning,” in
Advances in Neural Informa-tion Processing Systems , pp. 2148–2156, 2013.[17] M. Yuan and Y. Lin, “Model selection and estimation in regres-sion with grouped variables,”
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , vol. 68, no. 1,pp. 49–67, 2006.[18] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky,“Sparse convolutional neural networks,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition ,pp. 806–814, 2015.[19] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learningstructured sparsity in deep neural networks,” in
Advances inNeural Information Processing Systems , pp. 2074–2082, 2016.[20] L. Meier, S. Van De Geer, and P. B¨uhlmann, “The group lassofor logistic regression,”
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , vol. 70, no. 1, pp. 53–71,2008.[21] A. Torfi, “Speechpy-a library for speech processing and recog-nition,” arXiv preprint arXiv:1803.01094 , 2018. [22] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:a large-scale speaker identification dataset,” arXiv preprintarXiv:1706.08612 , 2017.[23] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neuralnetworks for human action recognition,”
IEEE transactionson pattern analysis and machine intelligence , vol. 35, no. 1,pp. 221–231, 2013.[24] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similaritymetric discriminatively, with application to face verification,” in
Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on , vol. 1, pp. 539–546,IEEE, 2005.[25] X. Sun, A. Torfi, and N. Nasrabadi, “Deep siamese convo-lutional neural networks for identical twins and look-alikeidentification,”
Deep Learning in Biometrics , p. 65, 2018.[26] R. R. Varior, M. Haloi, and G. Wang, “Gated siamese convolu-tional neural network architecture for human re-identification,”in
European Conference on Computer Vision , pp. 791–808,Springer, 2016.[27] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural net-works for one-shot image recognition,” in
ICML Deep LearningWorkshop , vol. 2, 2015.[28] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556 , 2014.[29] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXivpreprint arXiv:1312.4400 , 2013.[30] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speakerverification using adapted gaussian mixture models,”
Digitalsignal processing , vol. 10, no. 1-3, pp. 19–41, 2000.[31] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-let, “Front-end factor analysis for speaker verification,”