Semi-Supervised Neural Architecture Search
Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen, Tie-Yan Liu
SSemi-Supervised Neural Architecture Search Renqian Luo ∗ , Xu Tan, Rui Wang, Tao Qin, Enhong Chen, Tie-Yan Liu University of Science and Technology of China, Hefei, China Microsoft Research Asia, Beijing, China [email protected], [email protected] {xuta, ruiwa, taoqin, tyliu}@microsoft.com Abstract
Neural architecture search (NAS) relies on a good controller to generate betterarchitectures or predict the accuracy of given architectures. However, training thecontroller requires both abundant and high-quality pairs of architectures and theiraccuracy, while it is costly to evaluate an architecture and obtain its accuracy. Inthis paper, we propose
SemiNAS , a semi-supervised NAS approach that leveragesnumerous unlabeled architectures (without evaluation and thus nearly no cost).Specifically, SemiNAS 1) trains an initial accuracy predictor with a small set ofarchitecture-accuracy data pairs; 2) uses the trained accuracy predictor to predictthe accuracy of large amount of architectures (without evaluation); and 3) addsthe generated data pairs to the original data to further improve the predictor. Thetrained accuracy predictor can be applied to various NAS algorithms by predictingthe accuracy of candidate architectures for them. SemiNAS has two advantages:1) It reduces the computational cost under the same accuracy guarantee. OnNASBench-101 benchmark dataset, it achieves comparable accuracy with gradient-based method while using only 1/7 architecture-accuracy pairs. 2) It achieves higheraccuracy under the same computational cost. It achieves 94.02% test accuracyon NASBench-101, outperforming all the baselines when using the same numberof architectures. On ImageNet, it achieves 23.5% top-1 error rate (under 600MFLOPS constraint) using 4 GPU-days for search. We further apply it to LJSpeechtext to speech task and it achieves 97% intelligibility rate in the low-resource settingand 15% test error rate in the robustness setting, with 9%, 7% improvements overthe baseline respectively.
Neural architecture search (NAS) for automatic architecture design has been successfully applied inseveral tasks including image classification and language modeling [42, 26, 6]. NAS typically containstwo components, a controller (also called generator) that controls the generation of new architectures,and an evaluator that trains candidate architectures and evaluates their accuracy . The controllerlearns to generate relatively better architectures via a variety of techniques (e.g., reinforcementlearning [41, 42], evolution [20], gradient optimization [15, 17], Bayesian optimization [39]), andplays an important role in NAS [41, 42, 18, 20, 17, 15, 39]. To ensure the performance of thecontroller, a large number of high-quality pairs of architectures and their corresponding accuracy arerequired as the training data. ∗ The work was done when the first author was an intern at Microsoft Research Asia. Although a variety of metrics including accuracy, model size, and inference speed have been used as searchcriterion, the accuracy of an architecture is the most important and costly one, and other metrics can be easilycalculated with almost zero computation cost. Therefore, we focus on accuracy in this work.34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. a r X i v : . [ c s . L G ] N ov owever, collecting such architecture-accuracy pairs is expensive, since it is costly for the evaluatorto train each architecture to accurately get its accuracy, which incurs the highest computational cost inNAS. Popular methods usually consume hundreds to thousands of GPU days to discover eventuallygood architectures [41, 20, 17]. To address this problem, one-shot NAS [2, 18, 15, 35] uses a supernetto include all candidate architectures via weight sharing and trains the supernet to reduce the trainingtime. While greatly reducing the computational cost, the quality of the training data (architecturesand their corresponding accuracy) for the controller is degraded [24], and thus these approachessuffer from accuracy decline on downstream tasks.In various scenarios with limited labeled training data, semi-supervised learning [40] is a popularapproach to leverage unlabeled data to boost the training accuracy. In the scenario of NAS, unlabeledarchitectures can be obtained through random generation, mutation [20], or simply going throughthe whole search space [32], which incur nearly zero additional cost. Inspired by semi-supervisedlearning, in this paper, we propose SemiNAS , a semi-supervised approach for NAS that leverages alarge number of unlabeled architectures. Specifically, SemiNAS 1) trains an initial accuracy predictorwith a set of architecture-accuracy data pairs; 2) uses the trained accuracy predictor to predict theaccuracy of a large number of unlabeled architectures; and 3) adds the generated architecture-accuracypairs to the original data to further improve the accuracy predictor. The trained accuracy predictorcan be incorporated to various NAS algorithms by predicting the accuracy of unseen architectures.SemiNAS can be applied to many NAS algorithms. We take the neural architecture optimization(NAO) [17] algorithm as an example, since NAO has the following advantages: 1) it takes architecture-accuracy pairs as training data to train a accuracy predictor to predict the accuracy of architectures,which can directly benefit from SemiNAS; 2) it supports both conventional methods which traineach architecture from scratch [42, 20, 17] and one-shot methods which train a supernet with weightsharing [18, 17]; and 3) it is based on gradient optimization which has shown better effectivenessand efficiency. Although we implement SemiNAS on NAO, it is easy to be applied to other NASmethods, such as reinforcement learning based methods [42, 18] and evolutionary algorithm basedmethods [20].SemiNAS shows advantages over both conventional NAS and one-shot NAS. Compared to conven-tional NAS, it can significantly reduce computational cost while achieving similar accuracy, andachieve better accuracy with similar cost. Specifically, on NASBench-101 benchmark, SemiNASachieves similar accuracy ( . ) as gradient based methods [17] using only / architectures.Meanwhile it achieves . mean test accuracy surpassing all the baselines when evaluating thesame number of architectures (with the same computational cost). Compared to one-shot NAS,SemiNAS achieves higher accuracy using similar computational cost. For image classification, within GPU days for search, we achieve . top-1 error rate on ImageNet under the mobile setting. Fortext to speech (TTS), using GPU days for search, SemiNAS achieves intelligibility rate inthe low-resource setting and sentence error rate in the robustness setting, which outperformshuman-designed model by and points respectively. To the best of our knowledge, we are thefirst to develop NAS algorithms on text to speech (TTS) task. We carefully design the search spaceand search metric for TTS, and achieve significant improvements compared to human-designedarchitectures. We believe that our designed search space and metric are helpful for future studies onNAS for TTS. From the perspective of the computational cost of training candidate architectures, previous works onNAS can be categorized into conventional NAS and one-shot NAS.Conventional NAS includes [41, 42, 20, 17], which achieve significant improvements on severalbenchmark datasets. Obtaining the accuracy of the candidate architectures is expensive in conventionalNAS, since they train every single architecture from scratch and usually require thousands ofarchitectures to train. The total cost is usually more than hundreds of GPU days [42, 20, 17].To reduce the huge cost in NAS, one-shot NAS was proposed with the help of weight sharingmechanism. [2] proposes to include all candidate operations in the search space within a supernetand share parameters among candidate architectures. Each candidate architecture is a sub-graph inthe supernet and only activates the parameters associated with it. The algorithm trains the supernetand then evaluates the accuracy of candidate architectures by the corresponding sub-graphs in the2upernet. [18, 17, 15, 5, 36, 4, 27, 9] also leverage the one-shot idea to perform efficient searchwhile using different search algorithms. Such weight sharing mechanism successfully cuts down thecomputational cost to less than GPU days [18, 15, 4, 36]. However, the supernet requires carefuldesign and the training of supernet needs careful tuning. Moreover, it shows inferior performanceand reproducibility compared to conventional NAS. One main cause is the short training time andinadequate update of individual architecture [12, 24], which leads to an inaccurate ranking of thearchitectures, and provides relatively low-quality architecture-accuracy pairs for the controller.To sum up, there exists a trade-off between computational cost and accuracy. We formalize thecomputational cost of the evaluator by C = N × T , where N is the number of architecture-accuracypairs for the controller to learn, and T is the training time of each candidate architecture. Inconventional NAS, the evaluator trains each architecture from scratch and the T is typically severalepochs to ensure the accuracy of the evaluation, leading to large C . In one-shot NAS, the T isreduced to a few mini-batches, which is inadequate for training and therefore produces low-qualityarchitecture-accuracy pairs. Our SemiNAS handles this computation and accuracy trade-off from anew perspective which reduces N by leveraging a large number of unlabeled architectures. In this section, we first describe the semi-supervised training of the accuracy predictor, and thenintroduce the implementation of the proposed SemiNAS algorithm.
To learn from both labeled architecture-accuracy pairs and unlabeled architectures without corre-sponding accuracy numbers, SemiNAS trains an accuracy predictor via semi-supervised learning.Specifically, we utilize a large number of unevaluated architectures ( M ) to improve the accuracypredictor. To utilize numerous unlabeled data, we leverage self-supervised learning by predicting theaccuracy of unevaluated architectures [11] and then combine them with ground-truth data to furtherimprove the accuracy predictor. Following [34], we apply dropout as noise during the training.However, a simple accuracy predictor is hard to learn information from architectures with pseudolabels via regression task although with techniques in [34]. Inspired by [17], we use an accuracypredictor framework consisting of an encoder f e , a predictor f p and a decoder f d . The encoderis implemented as an LSTM network to map the discrete architecture x to continuous embeddingrepresentations e x , and the predictor uses fully connected layers to predict the accuracy of thearchitecture taking the continuous embedding e x as input. The decoder is an LSTM to decodethe continuous embedding back to discrete architecture in an auto-regressive manner. The threecomponents are trained jointly via the regression task and the reconstruction task. The semi-supervisedlearning of the accuracy predictor can be decomposed into steps:• Train the encoder f e , predictor f p and the decoder f d with N architecture-accuracy pairswhere each architecture is trained and evaluated.• Generate M unlabeled architectures and use the trained encoder f e and predictor f p topredict their accuracy.• Use both the N architecture-accuracy pairs and the M self-labeled pairs together to train abetter accuracy predictor.The accuracy predictor learns information from limited number of architecture-accuracy pairs, whilethere are still numerous unseen architectures. With the help of the decoder, the encoder and thedecoder together act like an autoencoder to learn the hidden representation of architectures. Thereforeit is able for the whole framework to learn the information of architectures in an unsupervisedway without the requirement of ground-truth labels (evaluated accuracy), and further improves theaccuracy predictor as the three components are trained jointly. The trained accuracy predictor can beincorporated to various NAS algorithms by predicting the accuracy of unseen architectures for them.SemiNAS brings advantages over both conventional NAS and one-shot NAS, which can be illustratedunder the computational cost formulation C = N × T . Compared to conventional NAS which is One epoch means training on the whole dataset for once. C with smaller N but using more additionalunlabeled architectures to avoid accuracy drop, and can also further improve the performance withsame computational cost. Compared to one-shot NAS which has inferior accuracy, SemiNAS canimprove the accuracy by using more unlabeled architectures under the same computational cost C .Specifically, in order to get more accurate evaluation of architectures and improve the quality ofarchitecture-accuracy pairs, we can extend the average training time T for each individual architecture.Meanwhile, we reduce the number of architectures to be trained (i.e., N ) to keep the total budget C unchanged. We now describe the implementation of our SemiNAS algorithm. We take NAO [17] as our imple-mentation since it has following advantages: 1) it contains an encoder-predictor-decoder framework,where the encoder and the predictor can predict the accuracy for large number of architectureswithout evaluation, and is straightforward to incorporate our method; 2) it performs architecturesearch by applying gradient ascent which has shown better effectiveness and efficiency; 3) it canincorporate both conventional NAS (whose evaluator trains each architecture from scratch) andone-shot NAS (whose evaluator builds a supernet to train all the architectures via weight sharing).NAO [17] uses an encoder-predictor-decoder framework as the controller, where the encoder f e maps the discrete architecture representation x into continuous representation e x = f e ( x ) and usesthe predictor f p to predict its accuracy ˆ y = f p ( e x ) . Then it uses a decoder f d that is implementedbased on a multi-layer LSTM to reconstruct the original discrete architecture from the continuousrepresentation x = f d ( e x ) in an auto-regressive manner.After the controller is trained, for any given architecture x as the input, NAO moves its representation e x towards the direction of the gradient ascent of the accuracy prediction f p ( e x ) to get a new andbetter continuous representation e (cid:48) x as follows: e (cid:48) x = e x + η ∂f p ( e x ) ∂e x , where η is a step size. e (cid:48) x canget higher prediction accuracy f p ( e (cid:48) x ) after gradient ascent. Then it uses the decoder f d to decode e (cid:48) x into a new architecture x (cid:48) , which is supposed to be better than architecture x . The process of thearchitecture optimization is performed for L iterations, where newly generated architectures at theend of each iteration are added to the architecture pool for evaluation and further used to train thecontroller in the next iteration. Finally, the best performing architecture in the architecture pool isselected out as the final result. Algorithm 1
Semi-Supervised Neural Architecture Search Input : Number of architectures N to evaluate. Number of unlabeled architectures M to use. The set ofarchitecture-accuracy pairs D = ∅ to train the encoder-predictor-decoder. Number of architectures K based on which to generate better architectures. Training steps T to evaluate each architecture. Number ofoptimization iterations L . Step size η .2: Generate N architectures. Use the evaluator to train each architecture for T steps (in conventional way orweight sharing way).3: Evaluate the N architectures to obtain the accuracy and form the labeled dataset D .4: for l = 1 , · · · , L do
5: Train f e , f p and f d jointly using D .6: Randomly generate M architectures and use f e and f p to predict their accuracy and forming dataset ˆ D .7: Set (cid:101) D = D (cid:83) ˆ D .8: Train f e , f p and f d using (cid:101) D .9: Pick K architectures with top accuracy among (cid:101) D . For each architecture, obtain a better architecture byapplying gradient ascent optimization with step size η .10: Evaluate the newly generated architectures using the evaluator and add them to D .11: end for Output : The architecture in D with the best accuracy. With the semi-supervised method proposed in Section 3.1, we propose our SemiNAS as shownin Alg. 1. First we train the encoder-predictor-decoder on limited number ( N ) of architecture-accuracy pairs (line 5). Then we train the encoder-predictor-decoder with additional M unlabeledarchitectures (line 6-8). Finally, we perform the step of generating new architectures as the samein [17] (line9-10). 4 .3 Discussions Although our SemiNAS is mainly implemented based on NAO in this paper, the key idea of utilizingthe trained encoder f e and predictor f p to predict the accuracy of numerous unlabeled architectures canbe extended to a variety of NAS methods. For reinforcement learning based algorithms [41, 42, 18]where the controller is usually an RNN model, we can predict the accuracy of the architectures gener-ated by the RNN and take the predicted accuracy as the reward to train the controller. For evolutionbased methods [20], we can predict the accuracy of the architectures generated through mutation andcrossover, and then take the predicted accuracy as the fitness of the generated architectures. In this section, we demonstrate the effectiveness of SemiNAS on image classification tasks. We firstconduct experiments on NASBench-101 [37] and then on the commonly used large-scale ImageNet.
NASBench-101 [37] designs a cell-based search space following the common practice [42,17, 15]. It includes , CNN architectures and trains each architecture CIFAR-10 for times.Querying the accuracy of an architecture from the dataset is equivalent to training and evaluatingthe architecture. We hope to discover comparable architectures with less computational cost orbetter architectures with comparable computational cost. Specifically, on this dataset, reducing thecomputational cost can be regarded as decreasing the number of queries. Setup
Both the encoder and the decoder consist of a single layer LSTM with a hidden size of ,and the predictor is a three-layer fully connected network with hidden sizes of , , respectively.We use Adam optimizer with a learning rate of . . During search, only valid accuracy is used.After search, we report the mean test accuracy of the selected architecture over the runs. Wereport two settings of SemiNAS. For the first setting, we use N = 100 , M = 10000 and up-sample N labeled data by x (directly duplicate the labeled data). We generate new architecturesbased on top K = 100 architectures following line 9 in Alg. 1 at each iteration and run for L = 2 iterations. The algorithm totally evaluates
100 + 100 × architectures. For the secondsetting, we set N = 1100 , M = 10000 and up-sample N labeled data by x. We generate new architectures based on top K = 100 architectures at each iteration and run for L = 3 iterations.The algorithm totally queries × architectures. For comparison, we evaluaterandom search, regularized evolution (RE) [20] and NAO as baselines, where RE is validated asthe best-performing algorithm in the NASBench-101 publication. We limit the number of queriesof the baselines to be for fair comparison. Particularly, we run NAO with two settings using and architectures for better comparison considering our SemiNAS is mainly implementedbased on NAO in this paper. Additionally, we also combine our semi-supervised trained accuracypredictor with RE and name it SemiNAS (RE) for comparison to show the potential of SemiNAS.All the experiments are conducted for times and we report the averaged results. Since the besttest accuracy in the dataset is . and several algorithms are reaching it, we also report testregret (gap to . ) following the guide by [37] and the ranking of the accuracy number amongthe whole dataset to better illustrate the improvements of our method. Results
All the results are listed in Table 1. Random search achieves . test accuracy with aconfidence interval of [93 . , . (alpha= ). This implies that even . is a significantdifference and there exists a large margin for improvement. We can see that, when using the samenumber of architectures-accuracy pairs ( ), SemiNAS outperforms all the baselines with . test accuracy and corresponding . test regret, which ranks top in the whole space. SemiNASwith only architectures achieves . test accuracy and . test regret which is on parwith NAO with architectures. Moreover, NAO using architectures only achieves . which is merely better than random search. This demonstrates that with the help of unlabeled data,SemiNAS indeed outperforms baselines when using the same number of labeled architectures, andcan achieve similar performance while using much less resources. Further, SemiNAS (RE) achieves . which is on par with baseline RE while using only a half number of labeled architectures, andoutperforms RE with . when using same number of labeled architectures (2000). This implies5he potential of using semi-supervised learning in NAS for speeding up the search and applying tovarious NAS algorithms. We also conduct experiments to study the effect of different number ofunlabeled architectures ( M ) and up-sampling ratio in SemiNAS, and the results are in Section 7.2. Method
Table 1: Performances of different NAS methods on NASBench-101 dataset. “
Previous experiments on NASBench-101 dataset verify the effectiveness and efficiency of SemiNASin a well-controlled environment. We further evaluate our approach to the large-scale ImageNetdataset.
Search space
We adopt a MobileNet-v2 [23] based search space following ProxylessNAS [4]. Itconsists of multiple stacked layers. We search the operation of each layer. Candidate operationsinclude mobile inverted bottleneck convolution layers [23] with various kernel sizes { , , } andexpansion ratios { , } , as well as zero-out layer. Setup
We randomly sample , images from the training data as valid set for architecturesearch. Since training ImageNet is too expensive, we adopt weight sharing mechanism [18, 4] toperform one-shot search. We train the supernet on GPUs for steps with a batch size of per card. We set N = 100 , M = 4000 and run the search process for L = 3 iterations. In eachiteration, new better architectures are generated based on top K = 100 architectures followingline 9 in Alg. 1. The search runs for day on V100 GPUs. To fairly compare with other works, welimit the FLOPS of the discovered architecture to be less than
M. The discovered architectureis trained for epochs with a total batch size of . We use the SGD optimizer with an initiallearning rate of . and a cosine learning rate schedule [16]. More training details are in Section 7.1.For NAO, we use the open source code and train it on the same search space used in this paper. Inboth SemiNAS and NAO, we train the supernet for steps at each iteration to keep the same costwhile NAO uses larger N = 1000 . For ProxylessNAS, since it also optimizes latency as additionaltarget, for fair comparison, we use their open source code and rerun the search while optimizingaccuracy without considering latency. We limit the FLOPS of discovered architecture to be less than M. We run all the experiments for times. Results
From the results in Table 2, SemiNAS achieves . top-1 test error rate on ImageNetunder the M FLOPS constraint, which outperforms all the other NAS works. Specifically,it significantly outperforms the baseline algorithm NAO based on which SemiNAS is mainlyimplemented by . , and outperforms ProxylessNAS where our search space is based on by . .The discovered architecture is depicted in Section 7.3. In this section, we further explore the application of SemiNAS to a new task: text to speech. https://github.com/renqianluo/NAO_pytorch https://github.com/mit-han-lab/proxylessnas odel/Method Top-1 (%) Top-5 (%) Params (Million) FLOPS (Million)MobileNetV2 [23] 25.3 - 6.9 585ShuffleNet 2 × (v2) [38] 25.1 - ∼ Table 2: Performances of different methods on ImageNet. For fair comparison, we run NAO on thesame search space used in this paper, and run ProxylessNAS by optimizing accuracy without latency.Text to speech (TTS) [31, 25, 19, 13, 21] is an import task aiming to synthesize intelligible and naturalspeech from text. The encoder-decoder based neural TTS [25] has achieved significant improvements.However, due to the different modalities between the input (text) and the output (speech), popularTTS models are still complicated and require many human experiences when designing the modelarchitecture. Moreover, unlike many other sequence learning tasks (e.g., neural machine translation)where the Transformer model [30] is the dominate architecture, RNN based Tacotron [31, 25], CNNbased Deep Voice [1, 7, 19], and Transformer based models [13] show comparable accuracy in TTS,without one being exclusively better than others.The complexity of the model architecture in TTS indicates great potential of NAS on this task.However, applying NAS on TTS task also has challenges, mainly in two aspects: 1) Current TTSmodel architectures are complicated, including many human designed components. It is difficultbut important to design the network bone and the corresponding search space for NAS. 2) Unlikeother tasks (e.g., image classification) whose evaluation is objective and automatic, the evaluationof a TTS model requires subject judgement and human evaluation in the loop (e.g., intelligibilityrate for understandability and mean opinion score for naturalness). It is impractical to use humanevaluation for thousands of architectures in NAS. Thus, it is difficult but also important to design aspecific and appropriate objective metric as the reward of an architecture during the search process.Next, we design the search space and evaluation metric for NAS on TTS, and apply SemiNAS ontwo specific TTS settings: low-resource setting and robustness setting.
After surveying the previous neural TTS models, we choose a multi-layer encoder-decoder based network as the network backbone for TTS. We search the operation of each layer ofthe encoder and the decoder. The search space includes candidate operations in total: convolutionlayer with kernel size { , , , , , , } , Transformer layer [13] with number of heads of { , , } and LSTM layer. Specifically, we use unidirectional LSTM layer, causal convolutionlayer, causal self-attention layer in the decoder to avoid seeing the information in future positions.Besides, every decoder layer is inserted with an additional encoder-decoder-attention layer to catchthe relationship between the source and target sequence, where the dot-product multi-head attentionin Transformer [30] is adopted. Evaluation metric
It has been shown that the quality of the attention alignment between theencoder and decoder is an important influence factor on the quality of synthesized speech in previousworks [21, 31, 25, 13, 19], and misalignment can be observed for most mistakes (e.g., skipping andrepeating). Accordingly, we consider the diagonal focus rate (DFR) of the attention map between theencoder and decoder as the metric of an architecture. DFR is defined as:
DF R = (cid:80) Ii =1 (cid:80) ki + bo = ki − b A o,i (cid:80) Ii =1 (cid:80) Oo =1 A o,i ,7here A ∈ R O × I denotes the attention map, I and O are the length of the source input sequenceand the target output sequence, k = OI is the slope factor and b is the width of the diagonal area inthe attention map. DFR measures how much attention lies in the diagonal area with width b in theattention matrix, and ranges in [0 , which is the larger the better. In addition, we have also triedvalid loss as the search metric, but it is inferior to DFR according to our preliminary experiments. Task setting
Current TTS systems are capable of achieving near human-parity quality when trainedon adequate data and tested on regular sentences [25, 13]. However, current TTS models have poorperformance on two specific TTS settings: 1) low-resource setting, where only few paired speechand text data is available. 2) Robustness setting, where the test sentences are not regular (e.g., tooshort, too long, or contain many word pieces that have the same pronunciations). Under these twosettings, the synthesized speech of a human-designed TTS model is usually not accurate and robust(i.e., some words are skipped or repeated). Thus we apply SemiNAS on these two settings to improvethe accuracy and robustness. We conduct experiments on the LJSpeech dataset [10] which contains text and speech data pairs with approximately hours of speech audio. To simulate the low-resource scenario, we randomly split out paired speech and textsamples as the training set, where the total audio length is less than hours. We use N = 100 , M =4000 , T = 3 . We adopt the weight sharing mechanism and train the supernet on 4 GPUs for epochs. The search runs for day on 4 P40 GPUs. Besides, we train vanilla NAO as a baselinewhere N = 1000 . The discovered architecture is trained on the training set for k steps on 4 GPUs,with a batch size of K speech frames on each GPU. More details are provided in Section 7.1.In the inference process, the output mel-spectrograms are transformed into audio samples usingGriffin-Lim [8]. We run all the experiments for times. Model/Method Intelligibility Rate (%) DFR (%)Transformer TTS [13] 88 86NAO [17] 94 88SemiNAS
97 90
Table 3: Results on LJSpeech under the low-resource setting. “DFR” is diagonal focus rate.
Results
We test the performance of SemiNAS, NAO [17] and Transformer TTS (following [13])on the test sentences and report the results in Table 3. We measure the performances in termsof word level intelligibility rate (IR), which is a commonly used metric to evaluate the quality ofgenerated audio [22]. IR is defined as the percentage of test words whose pronunciation is consideredto be correct and clear by human. It is shown that SemiNAS achieves
IR, with significantimprovements of points over human designed Transformer TTS and points over NAO. Wealso list the DFR metric for each method in Table 3, where SemiNAS outperforms TransformerTTS and NAO in terms of DFR, which is consistent with the results on IR and indicates that ourproposed search metric DFR can indeed guide NAS algorithms to achieve better accuracy. We alsouse MOS (mean opinion score) [28] to evaluate the naturalness of the synthesized speech. UsingGriffin-Lim as the vocoder to synthesize the speech, the ground-truth mel-spectrograms achieves . MOS, Transformer TTS achieves . , NAO achieves . and SemiNAS achieves . . SemiNASoutperforms other methods in terms of MOS, which also demonstrates the advantages of SemiNAS.We also attach the discovered architecture by SemiNAS in Section 7.3. We train on the whole LJSpeech dataset as the training data. For robustness test, we select the sentences as used in [19] (attached in Section 7.4) that are found hard for TTS models. Trainingdetails follow the same as in the low-resource TTS experiment. We also attach the discoveredarchitecture in Section 7.3. We run all the experiments for times. Results
We report the results in Table 4, including the DFR, the number of sentences with repeatingand skipping words, and the sentence level error rate. A sentence is counted as an error if it contains8 odel/Method DFR (%) Repeat Skip Error (%)Transformer TTS[13] 15 1 21 22NAO [17] 25 2 18 19SemiNAS Table 4: Robustness test on the 100 hard sentences. “DFR” stands for diagonal focus rate.a repeating or skipping word. SemiNAS is better than Transformer TTS [13] and NAO [17] on all themetrics. It reduces the error rate by and compared to Transformer TTS structure designed byhuman experts and the searched architecture by NAO respectively. High-quality architecture-accuracy pairs are critical to NAS; however, accurately evaluating theaccuracy of an architecture is costly. In this paper, we proposed SemiNAS, a semi-supervised learningmethod for NAS. It leverages a small set of high-quality architecture-accuracy pairs to train an initialaccuracy predictor, and then utilizes a large number of unlabeled architectures to further improve theaccuracy predictor. Experiments on image classification tasks (NASBench-101 and ImageNet) andtext to speech tasks (the low-resource setting and robustness setting) demonstrate 1) the efficiency ofSemiNAS on reducing the computation cost over conventional NAS while achieving similar accuracyand 2) its effectiveness on improving the accuracy of both conventional NAS and one-shot NAS undersimilar computational cost. In the future, we will apply SemiNAS to more tasks such as automaticspeech recognition, text summarization, etc. Furthermore, we will explore advanced semi-supervisedlearning methods [33, 3] to improve SemiNAS.
Broader Impact
This work focuses on neural architecture search. It has the following potential positive impact in thesociety: 1) Improve the performance of neural networks for better applications. 2) Reduce the humanefforts in designing neural architectures. At the same time, it may have some negative consequencesbecause architecture search may cost many resources.
References [1] Sercan Ö Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yong-guo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep voice: Real-time neural text-to-speech. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 195–204. JMLR. org, 2017.[2] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Under-standing and simplifying one-shot architecture search. In
International Conference on MachineLearning , pages 549–558, 2018.[3] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin ARaffel. Mixmatch: A holistic approach to semi-supervised learning. In
Advances in NeuralInformation Processing Systems , pages 5049–5059, 2019.[4] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on targettask and hardware. arXiv preprint arXiv:1812.00332 , 2018.[5] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture search:Bridging the depth gap between search and evaluation. arXiv preprint arXiv:1904.12760 , 2019.[6] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramidarchitecture for object detection. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 7036–7045, 2019.[7] Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, JonathanRaiman, and Yanqi Zhou. Deep voice 2: Multi-speaker neural text-to-speech. In
Advances inneural information processing systems , pages 2962–2970, 2017.98] Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform.
IEEETransactions on Acoustics, Speech, and Signal Processing , 32(2):236–243, 1984.[9] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and JianSun. Single path one-shot neural architecture search with uniform sampling. arXiv preprintarXiv:1904.00420 , 2019.[10] Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/ , 2017.[11] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method fordeep neural networks. In
Workshop on challenges in representation learning, ICML , volume 3,2013.[12] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638 , 2019.[13] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesiswith transformer network. In
Proceedings of the AAAI Conference on Artificial Intelligence ,volume 33, pages 6706–6713, 2019.[14] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille,Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. arXiv preprintarXiv:1712.00559 , 2017.[15] Hanxiao Liu, Karen Simonyan, Yiming Yang, and Hanxiao Liu. Darts: Differentiable architec-ture search. arXiv preprint arXiv:1806.09055 , 2018.[16] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXivpreprint arXiv:1608.03983 , 2016.[17] Renqian Luo, Fei Tian, Tao Qin, and Tie-Yan Liu. Neural architecture optimization. arXivpreprint arXiv:1808.07233 , 2018.[18] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecturesearch via parameter sharing. In
International Conference on Machine Learning , pages 4092–4101, 2018.[19] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay Kannan, Sharan Narang,Jonathan Raiman, and John Miller. Deep voice 3: Scaling text-to-speech with convolutionalsequence learning. arXiv preprint arXiv:1710.07654 , 2017.[20] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution forimage classifier architecture search. arXiv preprint arXiv:1802.01548 , 2018.[21] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech:Fast, robust and controllable text to speech. In
Advances in Neural Information ProcessingSystems , pages 3165–3174, 2019.[22] Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Almost unsupervised textto speech and automatic speech recognition. arXiv preprint arXiv:1905.06791 , 2019.[23] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 4510–4520, 2018.[24] Christian Sciuto, Kaicheng Yu, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluatingthe search phase of neural architecture search. arXiv preprint arXiv:1902.08142 , 2019.[25] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis byconditioning wavenet on mel spectrogram predictions. In , pages 4779–4783. IEEE, 2018.[26] David So, Quoc Le, and Chen Liang. The evolved transformer. In
International Conference onMachine Learning , pages 5877–5886, 2019.[27] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, JieLiu, and Diana Marculescu. Single-path nas: Designing hardware-efficient convnets in less than4 hours. arXiv preprint arXiv:1904.02877 , 2019.[28] Robert C Streijl, Stefan Winkler, and David S Hands. Mean opinion score (mos) revisited:methods and applications, limitations and alternatives.
Multimedia Systems , 22(2):213–227,2016. 1029] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neuralnetworks. arXiv preprint arXiv:1905.11946 , 2019.[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In
Advances in neural informationprocessing systems , pages 5998–6008, 2017.[31] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly,Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-endspeech synthesis. arXiv preprint arXiv:1703.10135 , 2017.[32] Wei Wen, Hanxiao Liu, Hai Li, Yiran Chen, Gabriel Bender, and Pieter-Jan Kindermans. Neuralpredictor for neural architecture search. arXiv preprint arXiv:1912.00848 , 2019.[33] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised dataaugmentation for consistency training. arXiv preprint arXiv:1904.12848 , 2019.[34] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with noisy studentimproves imagenet classification. In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR) , June 2020.[35] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: Stochastic neural architecturesearch, 2018.[36] Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong.Pc-darts: Partial channel connections for memory-efficient differentiable architecture search,2019.[37] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter.NAS-bench-101: Towards reproducible neural architecture search. In Kamalika Chaudhuri andRuslan Salakhutdinov, editors,
Proceedings of the 36th International Conference on MachineLearning , volume 97 of
Proceedings of Machine Learning Research , pages 7105–7114, LongBeach, California, USA, 09–15 Jun 2019. PMLR.[38] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficientconvolutional neural network for mobile devices. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 6848–6856, 2018.[39] Hongpeng Zhou, Minghao Yang, Jun Wang, and Wei Pan. Bayesnas: A bayesian approach forneural architecture search. arXiv preprint arXiv:1905.04919 , 2019.[40] Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning.
Synthesislectures on artificial intelligence and machine learning , 3(1):1–130, 2009.[41] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXivpreprint arXiv:1611.01578 , 2016.[42] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferablearchitectures for scalable image recognition. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 8697–8710, 2018.
In the first setting where N = 100 , M = 10000 , new architectures are generated based ontop K = 100 architectures at each iteration. In the second setting where N = 1100 , M = 10000 , new architectures are generated based on top K = 100 architectures at each iteration. We use λ = 0 . as the trade-off parameter to balance the regression loss and the reconstruction loss. We build the supernet following [4]. We train the supernet on GPUs for steps with a batchsize of per card. We use SGD optimizer with a learning rate of . and decay the learning rateby a factor of 0.97 per epoch. The discovered architecture is trained on 4 P40 cards for epochswith a batch size of per card. We use the SGD optimizer with an initial learning rate of . and acosine learning rate schedule [16]. 11 .1.3 TTS We adopt the weight sharing mechanism for search and train the supernet on 4 GPUs. The discoveredarchitecture is trained on the training set for k steps on 4 GPUs, with a batch size of K speechframes on each GPU. We use the Adam optimizer with β = 0 . , β = 0 . , (cid:15) = 1 e − and followthe same learning rate schedule in [13] with warmup steps. In this section, we conduct experiments on NASBench-101 to study SemiNAS, including the numberof unlabeled architectures M and the up-sampling ratio of labeled architectures. (a) (b) Figure 1: Study of SemiNAS on NASBench-101. (a): Performances with different M . (b): Perfor-mances with different up-sampling ratios. Number of unlabeled architectures M We study the effect of different M on SemiNAS. Given N = 100 , we range M within { , , , , , , , } , and plot the resultsin Fig. 1(a). Notice that M = 0 is equivalent to NAO without using any additional evaluatedarchitectures. We can see that the test accuracy increases as M increases, indicating that utilizingunlabeled architectures indeed helps the training of the controller and generating better architectures. Up-sampling ratio
Since N is much smaller than M , we do up-sampling to balance the data. Westudy how the up-sampling ratio affects the effectiveness of SemiNAS on NASBench-101. We set N = 100 , M = 10000 and range the up-sampling ratio in { , , , , , , } where meansno up-sampling. The results are depicted in Figure 1(b). We can see that the final accuracy wouldbenefit from up-sampling but will not continue to improve when the ratio is high (e.g., larger than ). We show the discovered architectures for the tasks by SemiNAS.
We adopt the ProxylessNAS [4] search space which is built on the MobileNet-V2 [23] backbone. Itcontains several different stages and each stage consists of multiple layers. We search the operationof each individual layer. There are candidate operations in the search space:• MBConv (k=3, r=3)• MBConv (k=3, r=6)• MBConv (k=5, r=3)• MBConv (k=5, r=6)• MBConv (k=7, r=3)• MBConv (k=7, r=6) 12 zero-out layerwhere MBConv is mobile inverted bottleneck convolution, k is the kernel size and r is the expansionratio [23]. Our discovered architecture for ImageNet is depicted in Fig. 2Figure 2: Architecture for ImageNet discovered by SemiNAS. “MBConv3” and “MBConv6” denotemobile inverted bottleneck convolution layer with an expansion ratio of 3 and 6 respectively. We adopt encoder-decoder based architecture as the backbone, and search the operation of each layer.Candidate operations include:• Convolution layer with kernel size of 1• Convolution layer with kernel size of 5• Convolution layer with kernel size of 9• Convolution layer with kernel size of 13• Convolution layer with kernel size of 17• Convolution layer with kernel size of 21• Convolution layer with kernel size of 25• Transformer layer with head number of 2• Transformer layer with head number of 4• Transformer layer with head number of 8• LSTM layer
Low-Resource Setting
The discovered architecture by SemiNAS for low-resource setting is shownin Fig. 3
Robustness Setting
The discovered architecture by SemiNAS for robustness setting is shown inFig. 4
We list the 100 sentences we use for robustness setting:a b c.x y z.hurry.warehouse.referendum.is it free?justifiable.environment. 13igure 3: Architecture for low-resource setting discovered by SemiNAS.Figure 4: Architecture for robustness setting discovered by SemiNAS.14 debt runs.gravitational.cardboard film.person thinking.prepared killer.aircraft torture.allergic trouser.strategic conduct.worrying literature.christmas is coming.a pet dilemma thinks.how was the math test?good to the last drop.an m b a agent listens.a compromise disappears.an axis of x y or z freezers.she did her best to help him.a backbone contests the chaos.two a greater than two n nine.don’t step on the broken glass.a damned flips into the patient.a trade purges within the b b c.i’d rather be a bird than a fish.i hear that nancy is very pretty.i want more detailed information.please wait outside of the house.n a s a exposure tunes the waffle.a mist dictates within the monster.a sketch ropes the middle ceremony.every farewell explodes the career.she folded here handkerchief neatly.against the steam chooses the studio.rock music approaches at high velocity.nine adam baye study on the two pieces.an unfriendly decay conveys the outcome.abstraction is often one floor above you.a played lady ranks any publicized preview.he told us a very exciting adventure story.on august twenty eight mary plays the piano.into a controller beams a concrete terrorist.i often see the time eleven eleven on clocks.it was getting dark and we weren’t there yet.against every rhyme starves a choral apparatus.everyone was busy so i went to the movie alone.i checked to make sure that he was still alive.a dominant vegetarian shies away from the g o p.joe made the sugar cookies susan decorated them.i want to buy a onesie but know it won’t suit me.a former override of q w e r t y outside the pope.f b i says that c i a says i’ll stay way from it.any climbing dish listens to a cumbersome formula.she wrote him a long letter but he didn’t read it.dear beauty is in the heat not physical i love you.an appeal on january fifth duplicates a sharp queen.a farewell solos on march twenty third shakes north.he ran out of money so he had to stop playing poker.for example a newspaper has only regional distribution t.i currently have four windows open up and i don’t know why.next to my indirect vocal declines every unbearable academic.15pposite her sounding bag is a m c’s configured thoroughfare.from april eighth to the present i only smoke four cigarettes.i will never be this young again every oh damn i just got older.a generous continuum of amazon dot com is the conflicting worker.she advised him to come back at once the wife lectures the blast.a song can make or ruin a person’s day if they let it get to them.she did not cheat on the test for it was not the right thing to do.he said he was not there yesterday however many people saw him there.should we start class now or should we wait for everyone to get here?if purple people eaters are real where do they find purple people to eat?on november eighteenth eighteen twenty one a glittering gem is not enough.a rocket from space x interacts with the individual beneath the soft flaw.malls are great places to shop i can find everything i need under one roof.i think i will buy the red car or i will lease the blue one the faith nests.italy is my favorite country in fact i plan to spend two weeks there next year.i would have gotten w w w w dot google dot com but my attendance wasn’t good enough.nineteen twenty is when we are unique together until we realise we are all the same.my mum tries to be cool by saying h t t p colon slash slash w w w b a i d u dot com.he turned in the research paper on friday otherwise he emailed a s d f at yahoo dot org.she works two jobs to make ends meet at least that was her reason for no having time to join us.a remarkable well promotes the alphabet into the adjusted luck the dress dodges across my assault.a b c d e f g h i j k l m n o p q r s t u v w x y z one two three four five six seven eight nine ten.across the waste persists the wrong pacifier the washed passenger parades under the incorrectcomputer.if the easter bunny and the tooth fairy had babies would they take your teeth and leave chocolate foryou?sometimes all you need to do is completely make an ass of yourself and laugh it off to realise that lifeisn’t so bad after all.she borrowed the book from him many years ago and hasn’t yet returned it why won’t thedistinguishing love jump with the juvenile?last friday in three week’s time i saw a spotted striped blue worm shake hands with a legless lizardthe lake is a long way from here.i was very proud of my nickname throughout high school but today i couldn’t be any different towhat my nickname was the metal lusts the ranging captain charters the link.i am happy to take your donation any amount will be greatly appreciated the waves were crashing onthe shore it was a lovely sight the paradox sticks this bowl on top of a spontaneous tea.a purple pig and a green donkey flew a kite in the middle of the night and ended up sunburn thecontained error poses as a logical target the divorce attacks near a missing doom the opera fines thedaily examiner into a murderer.as the most famous singer-songwriter jay chou gave a perfect performance in beijing on may twentyfourth twenty fifth and twenty sixth twenty three all the fans thought highly of him and took pride inhim all the tickets were sold out.if you like tuna and tomato sauce try combining the two it’s really not as bad as it sounds the bodymay perhaps compensates for the loss of a true metaphysics the clock within this blog and the clockon my laptop are on hour different from each other.someone i know recently combined maple syrup and buttered popcorn thinking it would taste likecaramel popcorn it didn’t and they don’t recommend anyone else do it either the gentleman marchesaround the principal the divorce attacks near a missing doom the color misprints a circular worryacross the controversy.
We provide demo for both low-resource setting and robustness setting of TTS experiments. Specifi-cally, we provide 10 test cases for each setting respectively and provide their ground-truth audio (ifexist), generated audio by Transformer TTS and generated audio by SemiNAS. We provide the demoat this link which is a web page and one can directly listen to the audio samples. https://speechresearch.github.io/seminas .6 Implementation Details We implement all the code in Pytorch [ ? ] with version 1.2. We implement the core architecturesearch algorithm following NAO [17] . For downstream tasks, we implement the code followingcorresponding baselines. For ImageNet experiment, we build our code based on ProxylessNASimplementation . For TTS experiment, we build the code following Transformer TTS [13] which isoriginally in Tensorflow. https://github.com/renqianluo/NAO_pytorch https://github.com/mit-han-lab/proxylessnashttps://github.com/mit-han-lab/proxylessnas