[PDF] Sparse Architectures for Text-Independent Speaker Verification Using Deep Neural Networks

Abstract

Network pruning is of great importance due to the elimination of the unimportant weights or features activated due to the network over-parametrization. Advantages of sparsity enforcement include preventing the overfitting and speedup. Considering a large number of parameters in deep architectures, network compression becomes of critical importance due to the required huge amount of computational power. In this work, we impose structured sparsity for speaker verification which is the validation of the query speaker compared to the speaker gallery. We will show that the mere sparsity enforcement can improve the verification results due to the possible initial overfitting in the network.

Full PDF

11 Sparse Architectures for Text-Independent SpeakerVeriﬁcation Using Deep Neural Networks

Sara SedighiDepartment of Electrical and Computer EngineeringBoise State University, Idaho, USAShayan RamhormoziDepartment of Network and Communication,FANAP Telecom

Abstract —Network pruning is of great importance dueto the elimination of the unimportant weights or featuresactivated due to the network over-parametrization. Ad-vantages of sparsity enforcement include preventing theoverﬁtting and speedup. Considering the large numberof parameters in deep architectures, network compressionbecomes of critical importance due to the required hugeamount of computational power. In this work, we imposestructured sparsity for speaker veriﬁcation which is thevalidation of the query speaker compared to the speakergallery. We will show that the mere sparsity enforcementcan improve the veriﬁcation results due to the possibleinitial overﬁtting in the network.

I. I

NTRODUCTION

Recent advancements in deep learning suggestednew approaches to train deep networks led to al-most human performance level in the image andobject recognition, speech recognition [1]–[4] anddata mining [5], [6]. Approaches based on Infor-mation Theory have also been proposed to providea framework for interpret the deep architecture ina better sense [7]. Some of these new approachessuch as dropout [8] will handle overﬁtting issue [9].For training deep neural networks, network over-parametrizing makes the architecture unnecessarilycomplicated. Huge computational power is also re-quired for training and model evaluation [10].Up to now, different approaches have been pro-posed for compressing models. Model compres-sion [11], [12], pruning [13], [14], and (cid:96) − regularization [15] have been proposed so far forthis aim. In some previous works such as [16], it’sbeen declared that training a few portions of theweights is enough by kernel-based estimators. Alarge amount of the previously performed methods are based on multiple steps of tuning which makesthe model hardly scalable. One issue is the modelcomplexity and computational burdon which is re-lated ot the large number of network parameters.Feature selection is one of the approaches for re-ducing the number of unimportant neurons. Select-ing the important features by emoving unimportantelements may impose the weight pruning. A largenumber of feature methods in this ﬁeld such as PCAand AEss have been proposed.For effective network compression, differentmethods such as utilizing the group lasso [17],structure scale constraining [18], and StructuredSparsity Learning (SSL) [19] have been proposed.For most of the research works, there is no evidenceof addressing how the accuracy is related to thecompression. In this work, we propose the use ofsparsity for imposing structured sparsity for speakerveriﬁcation. We will show that the simple sparsify-ing the network, can improve the results for speakerveriﬁcation. II. I MPOSING SPARSITY

A. Group sparse regularization

We focus on enforcing group sparsity to pruneconvolutional and fully-connected layers. Grouplasso has widely been used for feature selection byenforcing the sparsity on the weights group [17],[20]. The objective of the group sparsity is toselect the effective channels or neurons in case wehave a convolutional layer or fully-connected layer,respectively.Assume a convolutional layer is demonstratedby W ∈ R C, [ W idth,Heigth ] ,F and C parameter indi-cates the input channels, [ W idth, Heigth ] shows a r X i v : . [ c s . S D ] A ug the kernel dimension, and F is the number of outputchannels. The objective loss function is as follows: L ( W ) = L data ( W ) + λ r .(cid:96) ( W )+1 (cid:112) | G ( W m ) | { λ gs . N (cid:88) m =1 L gs ( G ( W m )) } (1)in which, L gs indicates the group sparsity loss. The | G ( W l ) | is the number of channels for the m th layerand λ is the hyper-parameter for the loss. Assumehaving L different groups of weights, the groupsparsity is deﬁned as follows: L gs = L (cid:88) j =1 (cid:118)(cid:117)(cid:117)(cid:116) | w ( j ) | (cid:88) i =1 ( w ( j ) i ) (2)in which w ( j ) demonstrates the j th group of weightsand | w ( j ) | shows the number of weights in the group. Fig. 1. Group sparsity on the fully-connected layers.

The main objective of the group sparse regulariza-tion is the removal of redundant features which areactivated regarding the network over-parametrizing.For the fully-connected layers, the group is also allthe weights connected to the neuron and is shownin Fig. 1. III. E

XPERIMENTS

The speaker veriﬁcation is comparing the queryspeaker to the gallery of speaker models and vali-date the speaker identity. The speaker veriﬁcationis mainly divided into text-independent and text-dependent types. In the text-dependent setting, theavailable spoken utterances are the same. In text-independent setting, however, no assumption is con-sidered for the utterances. The challenge for thelatter scenario comes from the fact that the featuresmust distinguishable for both speaker and speechinformation.

Input:

For each sample sound ﬁle, a windowof 25ms, with 15ms overlapping is used and the result will be a spectrogram of size × fora 1-second duration of the audio sound. For thethird dimension, ﬁrst and second order derivativefeatures are appended together. SpeechPy libraryhas been used for speech feature extraction [21]. Dataset:

We used the VoxCeleb dataset for ourexperiments [22]. There are 1211 available speakers,40 speakers are chosen for test and the rest are usedfor training as suggested in [22]. The raw audiosare extracted from regular Youtube videos which in-clude a variety of internal differences such as back-ground noise and different recording qualities whichmake the dataset very challenging. For our exper-iments, we choose very short 1-second utteranceswith Voice Activity Detection (VAD) for removingthe silence parts. Choosing short utterances haveforensics applications and makes our experimentschallenging. It is more realistic to consider shortutterance because in real-world applications, formost of the times only short utterances of differentsubjects are available.For speaker veriﬁcation architecture, weused convolutional neural networks due totheir superiority in applications such as actionrecognition [23], object recognition [3],speakerveriﬁcation and audio-visual matching.

Training and optimization objective:

The archi-tecture is shown in Fig. 2 which has two deep neuralnetworks with weight sharing. This architecture is aSiamese neural network [24] and has been utilizedfor different applications [25]–[27]. The main objec-tive of a Siamese network is the creation of a jointoutput feature space to distinguish between genuineand impostor pairs. The idea is that if two elementsof an input pair are coming from the same subject,the output distance should be close by a simpledistance metric and should be far away if theyhave different identities. For this goal, the trainingloss function should consider both aforementionedconditions. Contrastive loss L ( X, Y ) is used for thataim and is deﬁned as follows: C ( X, Y ) = 1 N N (cid:88) k =1 C ( Y k , ( X , X ) k ) , (3)where N is the number training samples, and L ( Y l , ( X p , X p ) l ) will be deﬁned as follows: C ( Y l , ( X , X ) l ) = Y ∗ C gen ( D W ( X , X ) l )+ (1 − Y ) ∗ C imp ( D W ( X , X ) l ) + λ || W || (4)in that λ || W || is the regularization parameter. C gen and C imp are the associated costs for the genuineand impostor pairs respectively and will be deﬁnedas functions of D ( X , X ) : (cid:40) C gen ( D ( X , X ) = D ( X , X ) C imp ( D ( X , X ) = max { , ( η − D ( X , X )) } (5)for which η considered to be a predeﬁned marginand D ( X , X ) is the Euclidean distance betweenassociated output features for the pairs. Architecture:

For the architectures, VGG-Net hasbeen chosen as an effective model for the imageclassifcation [28]. The architecture is modiﬁed spec-iﬁed to the input features. The output dimensionalityof the last layer is set to 64. The average pool-ing has also been employed for spatial dimensionmatching [29].

Fig. 2. The general employed architecture for speaker veriﬁcation.

Results and comparison:

The proposed approachwill be compared with some other base-line methodsas the GMM-UBM method [30] which has beenselected with 39 MFCCs coefﬁcients including ﬁrstand second order derivatives. Universal BackgroundModel (UBM) with 512 mixture components has been employed. The I-Vector system [31], is used asone of the methods for the comparison. The resultsare depicted in Table. I. As can be observed, impos-ing the proposed approach for sparsity outperformsthe other methods.

TABLE IC

OMPARING THE PROPOSED APPROACH WITH THE OTHERMETHODS .Model EERGMM-UBM [30] 28.22I-vectors [31] 24.91CNN [baseline] 24.68CNN [SSL] 24.11

An important factor caused by enforcing sparsityis the speedup. The results are shown in Fig. 3.

Fig. 3. The speedup for separate layers.

IV. C

ONCLUSION

In this work, we proposed the application ofsparsity imposition for speaker veriﬁcation. Exper-imental results demonstrated the effectiveness ofenforcing sparsity due to its potential power forpreventing overﬁtting. This was the direct outcomeof removing unimportant elements of the networksuch as neurons in fully connected layers and outputﬁlters in the convolutional layer.R

EFERENCES [1] J. Schmidhuber, “Deep learning in neural networks: Anoverview,”

Neural networks , vol. 61, pp. 85–117, 2015.[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature ,vol. 521, no. 7553, p. 436, 2015. [3] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neuralnetwork for real-time object recognition,” in

Intelligent Robotsand Systems (IROS), 2015 IEEE/RSJ International Conferenceon , pp. 922–928, IEEE, 2015.[4] E. Variani, X. Lei, E. McDermott, I. L. Moreno, andJ. Gonzalez-Dominguez, “Deep neural networks for small foot-print text-dependent speaker veriﬁcation,” in

Acoustics, Speechand Signal Processing (ICASSP), 2014 IEEE InternationalConference on , pp. 4052–4056, IEEE, 2014.[5] M. Piergallini, R. Shirvani, G. S. Gautam, and M. Chouikha,“Word-level language identiﬁcation and predicting codeswitch-ing points in swahili-english language data,” in

Proceedings ofthe Second Workshop on Computational Approaches to CodeSwitching , pp. 21–29, 2016.[6] R. Shirvani, M. Piergallini, G. S. Gautam, and M. Chouikha,“The howard university system submission for the shared taskin language identiﬁcation in spanish-english codeswitching,”in

Proceedings of The Second Workshop on ComputationalApproaches to Code Switching , pp. 116–120, 2016.[7] C. E. Shannon, “A mathematical theory of communication,”

ACM SIGMOBILE Mobile Computing and CommunicationsReview , vol. 5, no. 1, pp. 3–55, 2001.[8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neuralnetworks from overﬁtting,”

The Journal of Machine LearningResearch , vol. 15, no. 1, pp. 1929–1958, 2014.[9] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167 , 2015.[10] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect:Training deep neural networks with binary weights duringpropagations,” in

Advances in neural information processingsystems , pp. 3123–3131, 2015.[11] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in

Advances in neural information processing systems , pp. 2654–2662, 2014.[12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledgein a neural network,” arXiv preprint arXiv:1503.02531 , 2015.[13] R. Reed, “Pruning algorithms-a survey,”

IEEE transactions onNeural Networks , vol. 4, no. 5, pp. 740–747, 1993.[14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weightsand connections for efﬁcient neural network,” in

Advances inNeural Information Processing Systems , pp. 1135–1143, 2015.[15] M. D. Collins and P. Kohli, “Memory bounded deep convolu-tional networks,” arXiv preprint arXiv:1412.1442 , 2014.[16] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. , “Predictingparameters in deep learning,” in

Advances in Neural Informa-tion Processing Systems , pp. 2148–2156, 2013.[17] M. Yuan and Y. Lin, “Model selection and estimation in regres-sion with grouped variables,”

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , vol. 68, no. 1,pp. 49–67, 2006.[18] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky,“Sparse convolutional neural networks,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition ,pp. 806–814, 2015.[19] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learningstructured sparsity in deep neural networks,” in

Advances inNeural Information Processing Systems , pp. 2074–2082, 2016.[20] L. Meier, S. Van De Geer, and P. B¨uhlmann, “The group lassofor logistic regression,”

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , vol. 70, no. 1, pp. 53–71,2008.[21] A. Torﬁ, “Speechpy-a library for speech processing and recog-nition,” arXiv preprint arXiv:1803.01094 , 2018. [22] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:a large-scale speaker identiﬁcation dataset,” arXiv preprintarXiv:1706.08612 , 2017.[23] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neuralnetworks for human action recognition,”

IEEE transactionson pattern analysis and machine intelligence , vol. 35, no. 1,pp. 221–231, 2013.[24] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similaritymetric discriminatively, with application to face veriﬁcation,” in

Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on , vol. 1, pp. 539–546,IEEE, 2005.[25] X. Sun, A. Torﬁ, and N. Nasrabadi, “Deep siamese convo-lutional neural networks for identical twins and look-alikeidentiﬁcation,”

Deep Learning in Biometrics , p. 65, 2018.[26] R. R. Varior, M. Haloi, and G. Wang, “Gated siamese convolu-tional neural network architecture for human re-identiﬁcation,”in

European Conference on Computer Vision , pp. 791–808,Springer, 2016.[27] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural net-works for one-shot image recognition,” in

ICML Deep LearningWorkshop , vol. 2, 2015.[28] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556 , 2014.[29] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXivpreprint arXiv:1312.4400 , 2013.[30] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speakerveriﬁcation using adapted gaussian mixture models,”

Digitalsignal processing , vol. 10, no. 1-3, pp. 19–41, 2000.[31] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-let, “Front-end factor analysis for speaker veriﬁcation,”