[PDF] Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification

Abstract

State-of-the-art speaker verification models are based on deep learning techniques, which heavily depend on the handdesigned neural architectures from experts or engineers. We borrow the idea of neural architecture search(NAS) for the textindependent speaker verification task. As NAS can learn deep network structures automatically, we introduce the NAS conception into the well-known x-vector network. Furthermore, this paper proposes an evolutionary algorithm enhanced neural architecture search method called Auto-Vector to automatically discover promising networks for the speaker verification task. The experimental results demonstrate our NAS-based model outperforms state-of-the-art speaker verification models.

Full PDF

EEvolutionary Algorithm Enhanced Neural Architecture Search forText-Independent Speaker Veriﬁcation

Xiaoyang Qu, Jianzong Wang ∗ , Jing Xiao Ping An Technology (Shenzhen) Co., Ltd. { quxiaoyang343,wangjianzong347,xiaojing661 } @pingan.com.cn Abstract

State-of-the-art speaker veriﬁcation models are based on deeplearning techniques, which heavily depend on the hand-designed neural architectures from experts or engineers. Weborrow the idea of neural architecture search(NAS) for the text-independent speaker veriﬁcation task . As NAS can learn deepnetwork structures automatically, we introduce the NAS con-ception into the well-known x-vector network. Furthermore,this paper proposes an evolutionary algorithm enhanced neuralarchitecture search method called Auto-Vector to automaticallydiscover promising networks for the speaker veriﬁcation task.The experimental results demonstrate our NAS-based modeloutperforms state-of-the-art speaker veriﬁcation models.

Index Terms : speaker veriﬁcation, deep learning, neural net-work, neural architecture search.

1. Introduction

Speaker veriﬁcation is the process of verifying whether an ut-terance belongs to the same speaker, based on enrolled speakerinformation. It can be categorized as text-dependent speakerveriﬁcation (TD-SV) and text-independent speaker veriﬁcation (TI-SV). Relatively, TI-SV is more convenient for practical ap-plications, as it poses no constraints, e.g., duration or lexicalcontent, on utterances to verify. However, it is also more difﬁ-cult to achieve a good performance, due to many potential vari-abilities of the utterances. In this work, we focus on TI-SV.In the early years, the i-vector [1] based models withPLDA[2] backend dominated the development of the speakerveriﬁcation application. In recent years, the deep neural net-works(DNN) trained as acoustic models for automatic speechrecognition (ASR) are integrated into the i-vector system[3, 4,5]. Although the ASR DNN can enhance phonetic modelingin the i-vector UBM, it adds a high computational cost to thei-vector system. In the latest years, DL-based techniques canbe used as utterance-level speaker feature extractor[6, 7, 8, 9],and enable an end-to-end pipeline to discriminate betweenspeakers[10, 11, 12].However, these architectures are hand-designed by expertsor experienced engineers. It is highly demanding on theirknowledge and experiences. As a result, neural architecturesearch[13][14] techniques are becoming an increasingly popu-lar topic in both academia and industry, because of its great po-tential to automatically ﬁnd more effective architectures to out-perform hand-crafted ones. The early works on NAS are basedon reinforcement learning or evolutionary algorithm, such as[13, 15, 16, 17, 18] . But these approaches are expensive intime. To reduce search time costs, researchers proposed a widerange of optimization paradigms[19, 20, 21, 22], where hyper-network[23, 24, 25, 26] is a typical representative. *Corresponding author: Jianzong Wang, [email protected]

In this work, we bring the idea of hyper-network based neu-ral architecture search into text-independent speaker veriﬁca-tion. We managed to improve its search efﬁciency by use of amemetic evolutionary algorithm. Our work has several contri-butions as follows. (1) As NAS can learn deep network struc-tures automatically, we introduce the NAS conception into thex-vector network. (2) To learn more promising structures forspeaker veriﬁcation, we build a large-scale hyper-network withrepetitive architecture motifs. (3) To discover more promisingcandidate networks, we use a memetic evolutionary algorithm.(4) The experiment results demonstrate that our NAS-based x-vector and Auto-Vector outperform state-of-the-art speaker ver-iﬁcation methods in two datasets.

2. Proposed Methods

First, let us review the well-known x-vector network as shownin Figure1(a). Suppose the input utterance contains T frames.The ﬁrst ﬁve layers L to L are frame-level information hiddenlayers. These layers are connected with a time-delay architec-ture with temporal context windows. The context window overthe ﬁrst layer is set as a range from T − to T + 2 . The secondand third layers splice the output of the previous layer at timesteps { T-2, T, T+2 } and { T-3, T, T+3 } , respectively. The statis-tical pooling layer builds utterance-level feature by calculatingmean and standard deviation over frame-level features. Note,the seventh hidden layer L and the ﬁnal softmax output layerare used for training and discarded in the evaluation process.The sixth layer L is used as the embedding of x-vector.As NAS can learn deep network structures automatically,we introduce the NAS conception into the x-vector network, asshown in Figure1(b). For conventional x-vectors, the contextwindows between frame-level hidden layers are set by experts.Here, we let the number and the size of the context windowto be decided by an automatic-decided method. This methodis developed from hyper-network-based NAS, which stands outamong these efﬁcient NAS approaches because they can signiﬁ-cantly reduce the tedious training process by sharing its param-eters with all candidate networks. The key point is to specifythe search space of hyper-network, which contains all possiblecandidate networks. As shown in Figure 1(b), we incorporatevarious choice temporal context windows for the ﬁrst ﬁve lay-ers. Then, we use a memetic evolution search policy shown inSection 2.3.2 to ﬁnd the optimal candidate network with combi-nation choices for temporal context windows. The statical pool-ing, the sixth, and seventh layers are the same as conventionalx-vectors. However, the small search space limits the potential-ity of NAS-based x-vector. To enable an ample search space,we designed an Auto-Vector for speaker veriﬁcation, as shownin Section2.2. a r X i v : . [ ee ss . A S ] A ug emporal Average blockblock blockh1 h2 hT Statistics Pooling h1 h2 hT

Statistics Pooling [t-2, t+2] L1 L2L3L4

L5 L6

L7 L1 L2 MFCC

Batch ofSpectrogramsspk1spk2 L6 L7 X1, X2, X3,...XT

X1, X2, X3,...XT

Metric Learning {t-2,t,t+2}{t-3,t,t+3}

SUM1*1 1*3 1*5 1*7

SUM L5 (a) x-vector (b) NAS-based x-vector (c) Auto-Vector Figure 1: (a)The embedding DNN architecture of x-vector. (b)Our NAS-based x-vecotr. (c)Our Auto-Vector for speaker veriﬁcation.

To automatically learn more promising nerual architectures fortext-independent speaker veriﬁcation, we build a large-scalehyper-network with repetitive architecture motifs. As shown inFigure1(c), the framework includes three parts: input features,architecture, and loss.

Input Features . MFCCs( Mel-frequency cepstral coefﬁ-cients) is used to extract the frame-level acoustic feature vec-tors from raw waveform signals. Then the frames are convertedinto input acoustic features of 40-dimensional MFCCs with aframe-length of 25ms. This gives spectrograms of size 40*300for 3 seconds of speech.

Architecture . As shown in Figure 1(c), we build a hyper-network containing the entire search space of architectures. Thearchitecture of hyper-network is stacked with identical structurebut different weights. Assume there are N B choice blocks, andevery block has N op choice operations. Each choice block ap-plies either one or two different operations out of N op possibleoptions. Thus, there are, therefore N op + N op ∗ ( N op − possi-ble combinations of operations that we can apply in each block.In our experiment, the N op is set as 6, so we have six possi-ble operations: a max-pooling layer, an identity operation, andconvolution layers of size 1x1, 3x3, 5x5 and 7x7. The averagetemporal pooling is implemented by applying a 2D adaptive av-erage pooling several input planes. As the size of the searchspace grows exponentially with the number of choice blocks N B , this large-scale search space can enable more possibilityof promising networks. Loss.

In addition to softmax pre-training, we also usedistance-based loss function, such as triplet loss or generalizedend-to-end loss. Among various distance-based loss functions[12, 11, 10], the generalized end-to-end loss function [11] per-form best, because it not only learns to rank but also emphasizesthe hard examples. The details are shown in Section 2.4.

The training procedure of NAS-based x-vector and Auto-Vectorconsist of four steps: (1) Design a search space. (2) Trainthe hyper-network. (3) Search the optional sub-networks withtheir parameters inherited from the hyper-network. (4) Retrainthe best-accuracy candidate sub-network as a standalone model.The ﬁrst step have been shown in Section2.2. The second step is to train hyper-network. The training goal is formulated as θ ∗ ( H ) = argmin θ L train ( H, θ ) (1)here θ ∗ means the weights of the hyper-network H and L train ( · ) denotes a loss function on training dataset.The third step is to search for high-quality sub-networks,with their parameters inherited from the hyper-network.Thehigh-quality sub-network search task is a black-box optimiza-tion, which aims to ﬁnd an approximate maximizer of an objec-tive function f(x) using a given budget of N sub-network evalu-ations. In can be formulated as ˜ a ∗ ≈ a ∗ = argmax a n f ( a n ∼ H ) (2)note the sub-network a n is sampling from the search space ofhyper-network H . For this optimization goal, we develop amemetic algorithm based evolutionary policy, which is illus-trated in Section2.3.2,The last step is to retrain the obtained optimal sub-networkfor the best performance. The model parameters learned byminimizing the accumulative loss shown in the equation of φ ∗ (˜ a ∗ ) = argmin φ L train (˜ a ∗ , φ ) (3)here φ ∗ means the weights of the best-accuracy candidate sub-model ˜ a ∗ . And, L train ( · ) denotes a loss function on trainingdataset. The details of loss function is shown in Section2.4. .3.2. Memetic Evolutionary Search Policy Our search policy is based on the memetic algorithm. Thememetic algorithm is an augmentation of the genetic algo-rithm. In other words, the memetic algorithm consists of thegenetic algorithm and one or more local search components.The memetic algorithm integrates the local search method intothe genetic algorithm to reduce the likelihood of premature con-vergence. Thereby, the promising child individuals are gener-ated by recombination from and adaptation from outstandingindividuals.The search process is shown in Algorithm 1. The inputsinclude an empty population set Ω with size S , the generationnumber G , and the well-trained hyper-network H . The key op-erations of the search process are shown as follows. (1) For mutation operations , the selected candidate choose one or twodifferent operations in its every choice block with probability0.1 to produce a new candidate. Because cross-over operationswill result in local operation, we only make mutation opera-tions. (2) The local search employs a hill-climbing algorithmto discover high-quality sub-networks by greedily moving in thedirection of better-performing sub-networks. (3) The competeoperation uses an acceptance criterion to pick the better one.(4) The ﬁtness evaluation is calculating as f i = 1 − δ i , where δ k means the equal error rate of i individual model. (5) The selection operation is based on a tournament selection policy.In the tournament selection policy, a candidate set is randomlyselected from the overall population set. Then, the best-ﬁtnessindividual is chosen from the candidate set rather than the over-all population set. This policy can avoid zooming in on goodmodels too early and enable more search space to be explored. Algorithm 1

Memetic evolutionary algorithm Input:

The population set Ω with size S Input:

The generation number G Input:

The well-trained hyper-network H Output:

A best-accuracy sub-network a Initialize population set Ω for i ← ...S do a i ← uniform-sampling( H ) f i ← ﬁtness evaluation( a i ) Ω = Ω + { a i , f i } end for for j ← ...G do ¨ a j ← mutation( a j ) ¯ a j ← local-search( ¨ a j ) a j ← compete( ¨ a j , ¯ a j ) f j ← ﬁtness-evaluation( a j ) Ω = Ω + { a j , f j } Ω = Ω − worst( Ω ) a j +1 ← tournament selection( Ω ) end for2.4. Backend The objective of typical cross-entropy loss is to learn to predictdirectly a label given an input. Metric learning aims to predictrelative distance between inputs. In addition to softmax pre-training, we also use distance-based loss function. AssumingN speakers with each M utterance. The loss function L ( · ) isshown as follows. L ( e a , e p , e n ) = 1 − σ ( d ( e a , e p )) + max ≤ k ≤ Nk (cid:54) = j σ ( d ( e a , e kn )) (4) Table 1: Equal Error Rate Comparison(The Lower, The Better)

EER(Dataset1) EER(Dataset2) SizeLSTM-GE2E[11] 6.2% 8.3% 4.6Mx-vector[27] 4.6% 6.5% 6.14MNAS-based x-vector 4.3% 5.6% 6.32MAuto-Vector d ( e a , e p ) means the scaled similarity score between theanchor embedding e a and the positive embedding e p . Here, e a and e p belongs to the same speaker. The negative embedding e kn is the centroid embedding of the k th speakers, which shouldbe evaluated as e kn = M (cid:80) Mm =1 e ka ( m ) , using M utterances forthe k th speaker. σ ( · ) means the sigmoid function. Here, thescaled similarity score function d ( e a , e p ) is deﬁned as d ( e a , e p ) = w · cos ( e a , e p ) + b (5)here, w and b are learnable parameters. cos ( · ) means the cosinesimilarity function.

3. EXPERIMENT

We use two datasets for Evaluation.

Dataset1 includes 300speakers with 4527 utterances in total. The duration of whichmostly range from 3 to 7 seconds. We split the overall datasetinto a training dataset of 270 speakers and a test dataset of 30speakers. 10 utterances are randomly chosen as enrollment ut-terances for each speaker, and another 10 randomly chosen ut-terances are used as evaluation samples.

Dataset2 includes 4000 speakers with 23573 utterances andmore than 12,600 hours of speech. This dataset is split intotwo parts: a training dataset of 3960 speakers and an evaluationdataset of 285 speakers. The evaluation partition consists of 285speakers that do not overlap with the 3960 speakers for trainingdatasets.The raw waveform audios with a 16KHz sampling rate areconverted into frames using a hamming window of width 25 msand step 10ms. MFCCs( Mel-frequency cepstral coefﬁcients)is used to extract the frame-level acoustic feature vectors fromraw waveform signals. Then the frames are converted into in-put acoustic features of 40-dimensional MFCCs with a frame-length of 25ms that are mean-normalized over a sliding windowof up to 3 seconds. This gives spectrograms of size 40*300 for3 seconds of speech. An energy-based VAD is employed toﬁlter out non-speech frames from the utterances. There are Nspeakers each with M utterances.

Table 1 shows the EER comparison of four models on Dataset1and Dataset2. Our Auto-Vector performs better than LSTM andx-vector. For two datasets, the equal error rate(EER) of ourAuto-Vector is much lower than LSTM and x-vector. This re-sult proves that the neural architecture search network can ﬁnda better model than the expert-designed hand-crafted models.We use the same back-end(GE2E) for all evaluated systems toeliminate the impacts of different back-end classiﬁers.The conﬁgurations of two baseline networks are shown asfollows. The ﬁrst baseline is a 3-layer

LSTM Network [11]with a projection of size 256. The embedding vector(d-vector)able 2:

The evaluation results of hyper-network and sub-network on Dataset1. Here, F means the number of ﬁlters inthe ﬁrst convolution layer and B means the number of choiceblocks. For sub-networks, we retrain top-10 sub-networks andreport the mean x and standard deviation y as x ± y . size(M) EER(%) cost(GPUh)HyperNet(F=16,B=24) 2.43 3.5 14.6HyperNet(F=32,B=24) 6.08 2.7 21.7HyperNet(F=64,B=24) 17.04 1.9 33.9HyperNet(F=128,B=24) 46.82 1.4 50.7HyperNet(B=12,F=32) 5.06 3.1 18.9HyperNet(B=24,F=32) 6.08 2.7 21.7HyperNet(B=36,F=32) 7.08 2.6 24.3HyperNet(B=48,F=32) 8.05 2.2 26.2SubNet(F=16,B=48) . ± . . ± . -SubNet(F=32,B=48) . ± . . ± . - SubNet(F=64,B=48) . ± . . ± . -SubNet(F=128,B=48) . ± . . ± . -size is the same as the LSTM projection size. There are 768hidden nodes in the LSTM layer. The expected average movingis used to get the embedding. The second baseline is x-vectorNetwork [27]. The ﬁrst ﬁve layers L to L are frame-levelinformation hidden layers. While there are 512 nodes in eachof the ﬁrst four layers L to L , there are 1500 nodes in theﬁfth layer L . The statistical pooling layer builds utterance-level feature by calculating mean and standard deviation overframe-level features. Two utterance-level layers L and L eachhave 512 nodes. The sixth layer L is used as embedding.Our NAS-based x-vector is stacked with a repetitive con-text block. Each block contains 4 choice temporal context win-dows. As the search space is small, we use random search pol-icy for NAS-based x-vector.For

Auto-Vector , the hyper-parameters of hyper-network(number of blocks B and the number of ﬁlters F ) are analyzedin Section 3.3. The embedding size is set as 512. To decouplethe correlation of sub-networks, we set the path dropout rate as0.1. For the input, the batch size is set as 40 utterances from 8speakers, each with 5 utterances. For training, we use is Adamoptimizer and a linear learning rate decay policy with a baselearning rate of 0.02. For the memetic evolutionary search, thesize of the population set is 100, and the number of generationsis 2000. . First, we parameterize our modelsbased on F, the number of ﬁlters in the ﬁrst convolution layer, asshown in Table 2. When F = 16 and B=24, we obtain an averageEER of 3.5 with about 2.43 M parameters. with a double growthof ﬁlter number, the growth of model size is multiplied by nearlythree times. Obviously, the equal error rate will decrease withthe increase of ﬁlters. As we use two reduction blocks in ourhyper-network model, the growth model size should be multi-plied by four times. However, due to the existing of dense layerin the end of NAS-based model, the growth model size is mul-tiplied by nearly three times. The best model gets 1.4% EERwith around 46.82M parameters.Then, we parameterize our models based on B, the numberof choice blocks. When B= 12 and F=32, we obtain an average EER of 3.1% with about 5.06M parameters. The best modelgets 2.2% EER with around 8.05M parameters. With a doubleboost of the block number, the model size only increases a lit-tle because there is two dense layers in the tail of our model.The weights of the dense layer dominate the model size, so themodel size increases at a low rate along with the double increaseof the number of ﬁlters. o f c a n d i d a t e m o d e l s The Histogram of Candidate Models EERrandom searchour search (a) Search hyper-network with F=32 and B=24 o f c a n d i d a t e m o d e l s The Histogram of Candidate Models EERrandom searchour search (b) Search hyper-network with F=64 and B=48

Figure 2:

The evaluation histogram of 2000 candidates sub-networks with their parameters inherited from the hyper-network

The Impact of Evolutionary Search.

Compared to therandom selection algorithm, our hierarchical evolutionary al-gorithm can generate more high-quality candidate models. Theequal error rate (EER) distribution of candidate models is shownin Figure 2. While most of the candidate models searched out byour evolutionary algorithm, have a lower equal error rate (EER)than by random search. Besides, the best-accuracy model isfound out by our evolutionary algorithm rather than by a ran-dom algorithm. This result further proves that our hierarchicalevolutionary algorithm can get more space to be explored togenerate more high-quality candidate models.

Sub-Network Re-Training . As shown in Table 2, we re-train top-10 sub-networks and report the mean x and standarddeviation y as x ± y for sub-network training. We aim to ﬁndthe best-quality model whose model size is smaller than the x-vector network. When F=64 and B=48, we can discover theoptimal model whose model size is 5.17M, and EER is 1.8%.

4. Conclusion

In this paper, we introduce the NAS conception into well-knownx-vector network. Enabling more search space to be explored,we use an evolutionary algorithm enhanced neural architecturesearch framework to search high-quality sub-networks. The ex-periment shows that our system outperforms two state-of-the-art end-to-end methods in a public dataset. Besides, our NASmethod can achieve a reduction of 36%-86% in equal errorcompared with the state-of-the-art methods.

5. Acknowledgment

This paper is supported by National Key Research and Devel-opment Program of China under grant No. 2018YFB1003500,No. 2018YFB0204400 and No. 2017YFB1401202. . References [1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker veriﬁcation,”

IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2010.[2] T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, andP. Dumouchel, “Text-dependent speaker recognition using pldawith uncertainty propagation,” matrix , vol. 500, no. 1, 2013.[3] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel schemefor speaker recognition using a phonetically-aware deep neuralnetwork,” in . IEEE, 2014, pp. 1695–1699.[4] P. Kenny, T. Stafylakis, P. Ouellet, V. Gupta, and M. J. Alam,“Deep neural networks for extracting baum-welch statistics forspeaker recognition.” in

Odyssey , vol. 2014, 2014, pp. 293–298.[5] M. McLaren, Y. Lei, and L. Ferrer, “Advances in deep neuralnetwork approaches to speaker recognition,” in . IEEE, 2015, pp. 4814–4818.[6] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker veriﬁcation,” in .IEEE, 2014, pp. 4052–4056.[7] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deep fea-ture for text-depent speaker veriﬁcation,”

Speech Communication ,vol. 73, pp. 1–13, 2015.[8] Z. Shi, M. Wang, L. Liu, H. Lin, and R. Liu, “A double jointbayesian approach for j-vector based text-depent speaker veriﬁca-tion,” arXiv preprint arXiv:1711.06434 , 2017.[9] T. Fu, Y. Qian, Y. Liu, and K. Yu, “Tandem deep features for text-dependent speaker veriﬁcation,” in

Fifteenth Annual Conferenceof the International Speech Communication Association , 2014.[10] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker veriﬁcation,” in . IEEE, 2016, pp. 5115–5119.[11] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalizedend-to-end loss for speaker veriﬁcation,” in . IEEE, 2018, pp. 4879–4883.[12] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: an end-to-end neural speakerembedding system,” arXiv preprint arXiv:1705.02304 , 2017.[13] B. Zoph and Q. V. Le, “Neural architecture search with reinforce-ment learning,” arXiv preprint arXiv:1611.01578 , 2016.[14] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neu-ral network architectures using reinforcement learning,” arXivpreprint arXiv:1611.02167 , 2016.[15] K. O. Stanley and R. Miikkulainen, “Evolving neural net-works through augmenting topologies,”

Evolutionary computa-tion , vol. 10, no. 2, pp. 99–127, 2002.[16] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evo-lution for image classiﬁer architecture search,” in

Proceedings ofthe AAAI Conference on Artiﬁcial Intelligence , vol. 33, 2019, pp.4780–4789.[17] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink,O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al. ,“Evolving deep neural networks,” in

Artiﬁcial Intelligence in theAge of Neural Networks and Brain Computing . Elsevier, 2019,pp. 293–312.[18] P. R. Lorenzo and J. Nalepa, “Memetic evolution of deep neuralnetworks,” in

Proceedings of the Genetic and Evolutionary Com-putation Conference . ACM, 2018, pp. 505–512. [19] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural ar-chitecture search,” in

Proceedings of the European Conference onComputer Vision (ECCV) , 2018, pp. 19–34.[20] R. Negrinho and G. Gordon, “Deeparchitect: Automaticallydesigning and training deep architectures,” arXiv preprintarXiv:1704.08792 , 2017.[21] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efﬁcientneural architecture search via parameter sharing,” arXiv preprintarXiv:1802.03268 , 2018.[22] J.-H. M. Elsken, Thomas and F. Hutter, “Simple and efﬁcientarchitecture search for convolutional neural networks,” arXivpreprint arXiv:1711.04528 , 2017.[23] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Smash: one-shotmodel architecture search through hypernetworks,” arXiv preprintarXiv:1708.05344 , 2017.[24] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le,“Understanding and simplifying one-shot architecture search,” in

International Conference on Machine Learning , 2018, pp. 549–558.[25] C. Zhang, M. Ren, and R. Urtasun, “Graph hypernetworks forneural architecture search,” arXiv preprint arXiv:1810.05749 ,2018.[26] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun,“Single path one-shot neural architecture search with uniformsampling,” arXiv preprint arXiv:1904.00420 , 2019.[27] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in2018 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP)