Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification
EEvolutionary Algorithm Enhanced Neural Architecture Search forText-Independent Speaker Verification
Xiaoyang Qu, Jianzong Wang ∗ , Jing Xiao Ping An Technology (Shenzhen) Co., Ltd. { quxiaoyang343,wangjianzong347,xiaojing661 } @pingan.com.cn Abstract
State-of-the-art speaker verification models are based on deeplearning techniques, which heavily depend on the hand-designed neural architectures from experts or engineers. Weborrow the idea of neural architecture search(NAS) for the text-independent speaker verification task . As NAS can learn deepnetwork structures automatically, we introduce the NAS con-ception into the well-known x-vector network. Furthermore,this paper proposes an evolutionary algorithm enhanced neuralarchitecture search method called Auto-Vector to automaticallydiscover promising networks for the speaker verification task.The experimental results demonstrate our NAS-based modeloutperforms state-of-the-art speaker verification models.
Index Terms : speaker verification, deep learning, neural net-work, neural architecture search.
1. Introduction
Speaker verification is the process of verifying whether an ut-terance belongs to the same speaker, based on enrolled speakerinformation. It can be categorized as text-dependent speakerverification (TD-SV) and text-independent speaker verification (TI-SV). Relatively, TI-SV is more convenient for practical ap-plications, as it poses no constraints, e.g., duration or lexicalcontent, on utterances to verify. However, it is also more diffi-cult to achieve a good performance, due to many potential vari-abilities of the utterances. In this work, we focus on TI-SV.In the early years, the i-vector [1] based models withPLDA[2] backend dominated the development of the speakerverification application. In recent years, the deep neural net-works(DNN) trained as acoustic models for automatic speechrecognition (ASR) are integrated into the i-vector system[3, 4,5]. Although the ASR DNN can enhance phonetic modelingin the i-vector UBM, it adds a high computational cost to thei-vector system. In the latest years, DL-based techniques canbe used as utterance-level speaker feature extractor[6, 7, 8, 9],and enable an end-to-end pipeline to discriminate betweenspeakers[10, 11, 12].However, these architectures are hand-designed by expertsor experienced engineers. It is highly demanding on theirknowledge and experiences. As a result, neural architecturesearch[13][14] techniques are becoming an increasingly popu-lar topic in both academia and industry, because of its great po-tential to automatically find more effective architectures to out-perform hand-crafted ones. The early works on NAS are basedon reinforcement learning or evolutionary algorithm, such as[13, 15, 16, 17, 18] . But these approaches are expensive intime. To reduce search time costs, researchers proposed a widerange of optimization paradigms[19, 20, 21, 22], where hyper-network[23, 24, 25, 26] is a typical representative. *Corresponding author: Jianzong Wang, [email protected]
In this work, we bring the idea of hyper-network based neu-ral architecture search into text-independent speaker verifica-tion. We managed to improve its search efficiency by use of amemetic evolutionary algorithm. Our work has several contri-butions as follows. (1) As NAS can learn deep network struc-tures automatically, we introduce the NAS conception into thex-vector network. (2) To learn more promising structures forspeaker verification, we build a large-scale hyper-network withrepetitive architecture motifs. (3) To discover more promisingcandidate networks, we use a memetic evolutionary algorithm.(4) The experiment results demonstrate that our NAS-based x-vector and Auto-Vector outperform state-of-the-art speaker ver-ification methods in two datasets.
2. Proposed Methods
First, let us review the well-known x-vector network as shownin Figure1(a). Suppose the input utterance contains T frames.The first five layers L to L are frame-level information hiddenlayers. These layers are connected with a time-delay architec-ture with temporal context windows. The context window overthe first layer is set as a range from T − to T + 2 . The secondand third layers splice the output of the previous layer at timesteps { T-2, T, T+2 } and { T-3, T, T+3 } , respectively. The statis-tical pooling layer builds utterance-level feature by calculatingmean and standard deviation over frame-level features. Note,the seventh hidden layer L and the final softmax output layerare used for training and discarded in the evaluation process.The sixth layer L is used as the embedding of x-vector.As NAS can learn deep network structures automatically,we introduce the NAS conception into the x-vector network, asshown in Figure1(b). For conventional x-vectors, the contextwindows between frame-level hidden layers are set by experts.Here, we let the number and the size of the context windowto be decided by an automatic-decided method. This methodis developed from hyper-network-based NAS, which stands outamong these efficient NAS approaches because they can signifi-cantly reduce the tedious training process by sharing its param-eters with all candidate networks. The key point is to specifythe search space of hyper-network, which contains all possiblecandidate networks. As shown in Figure 1(b), we incorporatevarious choice temporal context windows for the first five lay-ers. Then, we use a memetic evolution search policy shown inSection 2.3.2 to find the optimal candidate network with combi-nation choices for temporal context windows. The statical pool-ing, the sixth, and seventh layers are the same as conventionalx-vectors. However, the small search space limits the potential-ity of NAS-based x-vector. To enable an ample search space,we designed an Auto-Vector for speaker verification, as shownin Section2.2. a r X i v : . [ ee ss . A S ] A ug emporal Average blockblock blockh1 h2 hT Statistics Pooling h1 h2 hT
Statistics Pooling [t-2, t+2] L1 L2L3L4
L5 L6
L7 L1 L2 MFCC
Batch ofSpectrogramsspk1spk2 L6 L7 X1, X2, X3,...XT
X1, X2, X3,...XT
Metric Learning {t-2,t,t+2}{t-3,t,t+3}
SUM1*1 1*3 1*5 1*7
SUM L5 (a) x-vector (b) NAS-based x-vector (c) Auto-Vector Figure 1: (a)The embedding DNN architecture of x-vector. (b)Our NAS-based x-vecotr. (c)Our Auto-Vector for speaker verification.
To automatically learn more promising nerual architectures fortext-independent speaker verification, we build a large-scalehyper-network with repetitive architecture motifs. As shown inFigure1(c), the framework includes three parts: input features,architecture, and loss.
Input Features . MFCCs( Mel-frequency cepstral coeffi-cients) is used to extract the frame-level acoustic feature vec-tors from raw waveform signals. Then the frames are convertedinto input acoustic features of 40-dimensional MFCCs with aframe-length of 25ms. This gives spectrograms of size 40*300for 3 seconds of speech.
Architecture . As shown in Figure 1(c), we build a hyper-network containing the entire search space of architectures. Thearchitecture of hyper-network is stacked with identical structurebut different weights. Assume there are N B choice blocks, andevery block has N op choice operations. Each choice block ap-plies either one or two different operations out of N op possibleoptions. Thus, there are, therefore N op + N op ∗ ( N op − possi-ble combinations of operations that we can apply in each block.In our experiment, the N op is set as 6, so we have six possi-ble operations: a max-pooling layer, an identity operation, andconvolution layers of size 1x1, 3x3, 5x5 and 7x7. The averagetemporal pooling is implemented by applying a 2D adaptive av-erage pooling several input planes. As the size of the searchspace grows exponentially with the number of choice blocks N B , this large-scale search space can enable more possibilityof promising networks. Loss.
In addition to softmax pre-training, we also usedistance-based loss function, such as triplet loss or generalizedend-to-end loss. Among various distance-based loss functions[12, 11, 10], the generalized end-to-end loss function [11] per-form best, because it not only learns to rank but also emphasizesthe hard examples. The details are shown in Section 2.4.
The training procedure of NAS-based x-vector and Auto-Vectorconsist of four steps: (1) Design a search space. (2) Trainthe hyper-network. (3) Search the optional sub-networks withtheir parameters inherited from the hyper-network. (4) Retrainthe best-accuracy candidate sub-network as a standalone model.The first step have been shown in Section2.2. The second step is to train hyper-network. The training goal is formulated as θ ∗ ( H ) = argmin θ L train ( H, θ ) (1)here θ ∗ means the weights of the hyper-network H and L train ( · ) denotes a loss function on training dataset.The third step is to search for high-quality sub-networks,with their parameters inherited from the hyper-network.Thehigh-quality sub-network search task is a black-box optimiza-tion, which aims to find an approximate maximizer of an objec-tive function f(x) using a given budget of N sub-network evalu-ations. In can be formulated as ˜ a ∗ ≈ a ∗ = argmax a n f ( a n ∼ H ) (2)note the sub-network a n is sampling from the search space ofhyper-network H . For this optimization goal, we develop amemetic algorithm based evolutionary policy, which is illus-trated in Section2.3.2,The last step is to retrain the obtained optimal sub-networkfor the best performance. The model parameters learned byminimizing the accumulative loss shown in the equation of φ ∗ (˜ a ∗ ) = argmin φ L train (˜ a ∗ , φ ) (3)here φ ∗ means the weights of the best-accuracy candidate sub-model ˜ a ∗ . And, L train ( · ) denotes a loss function on trainingdataset. The details of loss function is shown in Section2.4. .3.2. Memetic Evolutionary Search Policy Our search policy is based on the memetic algorithm. Thememetic algorithm is an augmentation of the genetic algo-rithm. In other words, the memetic algorithm consists of thegenetic algorithm and one or more local search components.The memetic algorithm integrates the local search method intothe genetic algorithm to reduce the likelihood of premature con-vergence. Thereby, the promising child individuals are gener-ated by recombination from and adaptation from outstandingindividuals.The search process is shown in Algorithm 1. The inputsinclude an empty population set Ω with size S , the generationnumber G , and the well-trained hyper-network H . The key op-erations of the search process are shown as follows. (1) For mutation operations , the selected candidate choose one or twodifferent operations in its every choice block with probability0.1 to produce a new candidate. Because cross-over operationswill result in local operation, we only make mutation opera-tions. (2) The local search employs a hill-climbing algorithmto discover high-quality sub-networks by greedily moving in thedirection of better-performing sub-networks. (3) The competeoperation uses an acceptance criterion to pick the better one.(4) The fitness evaluation is calculating as f i = 1 − δ i , where δ k means the equal error rate of i individual model. (5) The selection operation is based on a tournament selection policy.In the tournament selection policy, a candidate set is randomlyselected from the overall population set. Then, the best-fitnessindividual is chosen from the candidate set rather than the over-all population set. This policy can avoid zooming in on goodmodels too early and enable more search space to be explored. Algorithm 1
Memetic evolutionary algorithm Input:
The population set Ω with size S Input:
The generation number G Input:
The well-trained hyper-network H Output:
A best-accuracy sub-network a Initialize population set Ω for i ← ...S do a i ← uniform-sampling( H ) f i ← fitness evaluation( a i ) Ω = Ω + { a i , f i } end for for j ← ...G do ¨ a j ← mutation( a j ) ¯ a j ← local-search( ¨ a j ) a j ← compete( ¨ a j , ¯ a j ) f j ← fitness-evaluation( a j ) Ω = Ω + { a j , f j } Ω = Ω − worst( Ω ) a j +1 ← tournament selection( Ω ) end for2.4. Backend The objective of typical cross-entropy loss is to learn to predictdirectly a label given an input. Metric learning aims to predictrelative distance between inputs. In addition to softmax pre-training, we also use distance-based loss function. AssumingN speakers with each M utterance. The loss function L ( · ) isshown as follows. L ( e a , e p , e n ) = 1 − σ ( d ( e a , e p )) + max ≤ k ≤ Nk (cid:54) = j σ ( d ( e a , e kn )) (4) Table 1: Equal Error Rate Comparison(The Lower, The Better)
EER(Dataset1) EER(Dataset2) SizeLSTM-GE2E[11] 6.2% 8.3% 4.6Mx-vector[27] 4.6% 6.5% 6.14MNAS-based x-vector 4.3% 5.6% 6.32MAuto-Vector d ( e a , e p ) means the scaled similarity score between theanchor embedding e a and the positive embedding e p . Here, e a and e p belongs to the same speaker. The negative embedding e kn is the centroid embedding of the k th speakers, which shouldbe evaluated as e kn = M (cid:80) Mm =1 e ka ( m ) , using M utterances forthe k th speaker. σ ( · ) means the sigmoid function. Here, thescaled similarity score function d ( e a , e p ) is defined as d ( e a , e p ) = w · cos ( e a , e p ) + b (5)here, w and b are learnable parameters. cos ( · ) means the cosinesimilarity function.
3. EXPERIMENT
We use two datasets for Evaluation.
Dataset1 includes 300speakers with 4527 utterances in total. The duration of whichmostly range from 3 to 7 seconds. We split the overall datasetinto a training dataset of 270 speakers and a test dataset of 30speakers. 10 utterances are randomly chosen as enrollment ut-terances for each speaker, and another 10 randomly chosen ut-terances are used as evaluation samples.
Dataset2 includes 4000 speakers with 23573 utterances andmore than 12,600 hours of speech. This dataset is split intotwo parts: a training dataset of 3960 speakers and an evaluationdataset of 285 speakers. The evaluation partition consists of 285speakers that do not overlap with the 3960 speakers for trainingdatasets.The raw waveform audios with a 16KHz sampling rate areconverted into frames using a hamming window of width 25 msand step 10ms. MFCCs( Mel-frequency cepstral coefficients)is used to extract the frame-level acoustic feature vectors fromraw waveform signals. Then the frames are converted into in-put acoustic features of 40-dimensional MFCCs with a frame-length of 25ms that are mean-normalized over a sliding windowof up to 3 seconds. This gives spectrograms of size 40*300 for3 seconds of speech. An energy-based VAD is employed tofilter out non-speech frames from the utterances. There are Nspeakers each with M utterances.
Table 1 shows the EER comparison of four models on Dataset1and Dataset2. Our Auto-Vector performs better than LSTM andx-vector. For two datasets, the equal error rate(EER) of ourAuto-Vector is much lower than LSTM and x-vector. This re-sult proves that the neural architecture search network can finda better model than the expert-designed hand-crafted models.We use the same back-end(GE2E) for all evaluated systems toeliminate the impacts of different back-end classifiers.The configurations of two baseline networks are shown asfollows. The first baseline is a 3-layer
LSTM Network [11]with a projection of size 256. The embedding vector(d-vector)able 2:
The evaluation results of hyper-network and sub-network on Dataset1. Here, F means the number of filters inthe first convolution layer and B means the number of choiceblocks. For sub-networks, we retrain top-10 sub-networks andreport the mean x and standard deviation y as x ± y . size(M) EER(%) cost(GPUh)HyperNet(F=16,B=24) 2.43 3.5 14.6HyperNet(F=32,B=24) 6.08 2.7 21.7HyperNet(F=64,B=24) 17.04 1.9 33.9HyperNet(F=128,B=24) 46.82 1.4 50.7HyperNet(B=12,F=32) 5.06 3.1 18.9HyperNet(B=24,F=32) 6.08 2.7 21.7HyperNet(B=36,F=32) 7.08 2.6 24.3HyperNet(B=48,F=32) 8.05 2.2 26.2SubNet(F=16,B=48) . ± . . ± . -SubNet(F=32,B=48) . ± . . ± . - SubNet(F=64,B=48) . ± . . ± . -SubNet(F=128,B=48) . ± . . ± . -size is the same as the LSTM projection size. There are 768hidden nodes in the LSTM layer. The expected average movingis used to get the embedding. The second baseline is x-vectorNetwork [27]. The first five layers L to L are frame-levelinformation hidden layers. While there are 512 nodes in eachof the first four layers L to L , there are 1500 nodes in thefifth layer L . The statistical pooling layer builds utterance-level feature by calculating mean and standard deviation overframe-level features. Two utterance-level layers L and L eachhave 512 nodes. The sixth layer L is used as embedding.Our NAS-based x-vector is stacked with a repetitive con-text block. Each block contains 4 choice temporal context win-dows. As the search space is small, we use random search pol-icy for NAS-based x-vector.For
Auto-Vector , the hyper-parameters of hyper-network(number of blocks B and the number of filters F ) are analyzedin Section 3.3. The embedding size is set as 512. To decouplethe correlation of sub-networks, we set the path dropout rate as0.1. For the input, the batch size is set as 40 utterances from 8speakers, each with 5 utterances. For training, we use is Adamoptimizer and a linear learning rate decay policy with a baselearning rate of 0.02. For the memetic evolutionary search, thesize of the population set is 100, and the number of generationsis 2000. . First, we parameterize our modelsbased on F, the number of filters in the first convolution layer, asshown in Table 2. When F = 16 and B=24, we obtain an averageEER of 3.5 with about 2.43 M parameters. with a double growthof filter number, the growth of model size is multiplied by nearlythree times. Obviously, the equal error rate will decrease withthe increase of filters. As we use two reduction blocks in ourhyper-network model, the growth model size should be multi-plied by four times. However, due to the existing of dense layerin the end of NAS-based model, the growth model size is mul-tiplied by nearly three times. The best model gets 1.4% EERwith around 46.82M parameters.Then, we parameterize our models based on B, the numberof choice blocks. When B= 12 and F=32, we obtain an average EER of 3.1% with about 5.06M parameters. The best modelgets 2.2% EER with around 8.05M parameters. With a doubleboost of the block number, the model size only increases a lit-tle because there is two dense layers in the tail of our model.The weights of the dense layer dominate the model size, so themodel size increases at a low rate along with the double increaseof the number of filters. o f c a n d i d a t e m o d e l s The Histogram of Candidate Models EERrandom searchour search (a) Search hyper-network with F=32 and B=24 o f c a n d i d a t e m o d e l s The Histogram of Candidate Models EERrandom searchour search (b) Search hyper-network with F=64 and B=48
Figure 2:
The evaluation histogram of 2000 candidates sub-networks with their parameters inherited from the hyper-network
The Impact of Evolutionary Search.
Compared to therandom selection algorithm, our hierarchical evolutionary al-gorithm can generate more high-quality candidate models. Theequal error rate (EER) distribution of candidate models is shownin Figure 2. While most of the candidate models searched out byour evolutionary algorithm, have a lower equal error rate (EER)than by random search. Besides, the best-accuracy model isfound out by our evolutionary algorithm rather than by a ran-dom algorithm. This result further proves that our hierarchicalevolutionary algorithm can get more space to be explored togenerate more high-quality candidate models.
Sub-Network Re-Training . As shown in Table 2, we re-train top-10 sub-networks and report the mean x and standarddeviation y as x ± y for sub-network training. We aim to findthe best-quality model whose model size is smaller than the x-vector network. When F=64 and B=48, we can discover theoptimal model whose model size is 5.17M, and EER is 1.8%.
4. Conclusion
In this paper, we introduce the NAS conception into well-knownx-vector network. Enabling more search space to be explored,we use an evolutionary algorithm enhanced neural architecturesearch framework to search high-quality sub-networks. The ex-periment shows that our system outperforms two state-of-the-art end-to-end methods in a public dataset. Besides, our NASmethod can achieve a reduction of 36%-86% in equal errorcompared with the state-of-the-art methods.
5. Acknowledgment
This paper is supported by National Key Research and Devel-opment Program of China under grant No. 2018YFB1003500,No. 2018YFB0204400 and No. 2017YFB1401202. . References [1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2010.[2] T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, andP. Dumouchel, “Text-dependent speaker recognition using pldawith uncertainty propagation,” matrix , vol. 500, no. 1, 2013.[3] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel schemefor speaker recognition using a phonetically-aware deep neuralnetwork,” in . IEEE, 2014, pp. 1695–1699.[4] P. Kenny, T. Stafylakis, P. Ouellet, V. Gupta, and M. J. Alam,“Deep neural networks for extracting baum-welch statistics forspeaker recognition.” in
Odyssey , vol. 2014, 2014, pp. 293–298.[5] M. McLaren, Y. Lei, and L. Ferrer, “Advances in deep neuralnetwork approaches to speaker recognition,” in . IEEE, 2015, pp. 4814–4818.[6] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in .IEEE, 2014, pp. 4052–4056.[7] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deep fea-ture for text-depent speaker verification,”
Speech Communication ,vol. 73, pp. 1–13, 2015.[8] Z. Shi, M. Wang, L. Liu, H. Lin, and R. Liu, “A double jointbayesian approach for j-vector based text-depent speaker verifica-tion,” arXiv preprint arXiv:1711.06434 , 2017.[9] T. Fu, Y. Qian, Y. Liu, and K. Yu, “Tandem deep features for text-dependent speaker verification,” in
Fifteenth Annual Conferenceof the International Speech Communication Association , 2014.[10] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in . IEEE, 2016, pp. 5115–5119.[11] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalizedend-to-end loss for speaker verification,” in . IEEE, 2018, pp. 4879–4883.[12] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: an end-to-end neural speakerembedding system,” arXiv preprint arXiv:1705.02304 , 2017.[13] B. Zoph and Q. V. Le, “Neural architecture search with reinforce-ment learning,” arXiv preprint arXiv:1611.01578 , 2016.[14] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neu-ral network architectures using reinforcement learning,” arXivpreprint arXiv:1611.02167 , 2016.[15] K. O. Stanley and R. Miikkulainen, “Evolving neural net-works through augmenting topologies,”
Evolutionary computa-tion , vol. 10, no. 2, pp. 99–127, 2002.[16] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evo-lution for image classifier architecture search,” in
Proceedings ofthe AAAI Conference on Artificial Intelligence , vol. 33, 2019, pp.4780–4789.[17] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink,O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al. ,“Evolving deep neural networks,” in
Artificial Intelligence in theAge of Neural Networks and Brain Computing . Elsevier, 2019,pp. 293–312.[18] P. R. Lorenzo and J. Nalepa, “Memetic evolution of deep neuralnetworks,” in
Proceedings of the Genetic and Evolutionary Com-putation Conference . ACM, 2018, pp. 505–512. [19] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural ar-chitecture search,” in
Proceedings of the European Conference onComputer Vision (ECCV) , 2018, pp. 19–34.[20] R. Negrinho and G. Gordon, “Deeparchitect: Automaticallydesigning and training deep architectures,” arXiv preprintarXiv:1704.08792 , 2017.[21] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficientneural architecture search via parameter sharing,” arXiv preprintarXiv:1802.03268 , 2018.[22] J.-H. M. Elsken, Thomas and F. Hutter, “Simple and efficientarchitecture search for convolutional neural networks,” arXivpreprint arXiv:1711.04528 , 2017.[23] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Smash: one-shotmodel architecture search through hypernetworks,” arXiv preprintarXiv:1708.05344 , 2017.[24] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le,“Understanding and simplifying one-shot architecture search,” in
International Conference on Machine Learning , 2018, pp. 549–558.[25] C. Zhang, M. Ren, and R. Urtasun, “Graph hypernetworks forneural architecture search,” arXiv preprint arXiv:1810.05749 ,2018.[26] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun,“Single path one-shot neural architecture search with uniformsampling,” arXiv preprint arXiv:1904.00420 , 2019.[27] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in2018 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP)