Angle-based Search Space Shrinking for Neural Architecture Search
Yiming Hu, Yuding Liang, Zichao Guo, Ruosi Wan, Xiangyu Zhang, Yichen Wei, Qingyi Gu, Jian Sun
AAngle-based Search Space Shrinking for NeuralArchitecture Search
Yiming Hu (cid:63) , Yuding Liang (cid:63) , Zichao Guo , Ruosi Wan , Xiangyu Zhang ,Yichen Wei , Qingyi Gu , Jian Sun Institute of Automation, Chinese Academy of Sciences MEGVII Technology School of Artificial Intelligence, University of Chinese Academy of Sciences { liangyuding,guozichao,wanruosi,zhangxiangyu,weiyichen,sunjian } @megvii.com { huyiming2016, qingyi.gu } @ia.ac.cn Abstract.
In this work, we present a simple and general search spaceshrinking method, called Angle-Based search space Shrinking (ABS), forNeural Architecture Search (NAS). Our approach progressively simpli-fies the original search space by dropping unpromising candidates, thuscan reduce difficulties for existing NAS methods to find superior archi-tectures. In particular, we propose an angle-based metric to guide theshrinking process. We provide comprehensive evidences showing that, inweight-sharing supernet, the proposed metric is more stable and accu-rate than accuracy-based and magnitude-based metrics to predict thecapability of child models. We also show that the angle-based metriccan converge fast while training supernet, enabling us to get promis-ing shrunk search spaces efficiently. ABS can easily apply to most ofpopular NAS approaches (e.g. SPOS, FariNAS, ProxylessNAS, DARTSand PDARTS). Comprehensive experiments show that ABS can dramat-ically enhance existing NAS approaches by providing a promising shrunksearch space.
Keywords: angle, search space shrinking, NAS
Neural Architecture Search (NAS), the process of automatic model design hasachieved significant progress in various computer vision tasks [33,8,20,31]. Theyusually search over a large search space covering billions of options to find thesuperior ones, which is time-consuming and challenging. Though many weight-sharing NAS methods [14,21,22,5,30] have been proposed to relieve the searchefficiency problem, the challenge brought by the large and complicated searchspace still remains. This work is supported by The National Key Research and Development Programof China (No. 2017YFA0700800), Beijing Academy of Artificial Intelligence (BAAI)and the National Natural Science Foundation of China under Grants 61673376 (cid:63)
Equal contribution. This work is done when Yiming Hu was an intern at MEGVIITechnology a r X i v : . [ c s . N E ] M a y . Fig. 1.
Overview of the proposed angle-based search space shrinking method. We firsttrain the supernet for some epochs with uniform sampling. After this, all operators areranked by their scores and those of them whose rankings fall at the tail are dropped
Shrinking search space seems to be a feasible solution to relieve the optimiza-tion and efficiency problem of NAS over large and complicated search spaces.In fact, recent studies [7,24,25,4,18] have adopted different shrinking methodsto simplify the large search space dynamically. These methods either speed upthe search process or reduce the optimization difficulty in training stage by pro-gressively dropping unpromising candidate operators. Though existing shrinkingmethods have obtained decent results, it’s still challenging to detect unpromisingoperators among lots of candidate ones. The key is to predict the capacity of can-didates by an accurate metric. Existing NAS methods usually use accuracy-basedmetric [21,26,4,18] or magnitude-based metric [7,24,25] to guide the shrinkingprocess. However, neither of them is satisfactory: the former one is unstable andunable to accurately predict the performance of candidates in weight-sharingsetting [32]; while the later one entails the rich-get-richer problem in the processof joint optimization [1,7].In this work, we propose a novel angle-based metric to guide the shrinkingprocess. It’s obtained via computing the angle between the model’s weight vectorand its initialization. Recent work [6] has used the similar metric to measure thegenerality of stand-alone models and demonstrates its effectiveness. For the firsttime, we introduce the angle-based metric to weight-sharing NAS. Comparedwith accuracy-based and magnitude-based metrics, the proposed angle-basedmetric is more effective and efficient. First, it can save heavy computation over-head by eliminating inference procedure. Second, it has higher stability and rank-ing correlation than accuracy-based metric in weight-sharing supernet. Third, itconverges faster than its counterparts, which enables us to detect and removeunpromising candidates during early training stage.Based on the angle-based metric, we further present a conceptually simple,flexible, and general method for search space shrinking, named as Angle-Based ngle-based Search Space Shrinking for Neural Architecture Search 3 search space Shrinking (ABS). As shown in Fig. 1, we divide the pipeline of ABSinto multiple stages and progressively discard unpromising candidates accordingto our angle-based metric. ABS aims to get a shrunk search space covering manypromising network architectures. Contrast to existing shrinking methods, theshrunk search spaces ABS find don’t rely on specific search algorithms, thus areavailable for different NAS approaches to get immediate accuracy improvement.ABS can easily apply to various NAS algorithms. We analyze and evaluate itseffectiveness on Benchmark-201 [12] and ImageNet [17]. Our experiments showseveral popular NAS algorithms consistently discover more powerful networkarchitectures from the shrunk search spaces found by ABS. To sum up, ourmain contributions are as follows:1. We clarify and verify the effectiveness of elaborately shrunk search spaces toenhance the performance of existing NAS methods.2. We design a novel angle-based metric to guide the process of search spaceshrinking, and verify its advantages including efficiency, stability, and fastconvergence by lots of analysis experiments.3. We propose a dynamic search space shrinking method that can be consideredas a general plug-in to improve various NAS algorithms including SPOS [14],FairNAS [9], ProxylessNAS [5], DARTS [22] and PDARTS [7].
Weight-sharing NAS.
To reduce computation cost, many works [22,5,14,3,7]adopt weight-sharing mechanisms for efficient NAS. In these algorithms, differ-ent child models share the same copy of weights. Latest approaches on efficientNAS fall into two categories: one-shot methods [3,14,9] and gradient-based meth-ods [22,5,30]. One-shot methods train an over-parameterized supernet based onvarious sample strategies [14,9,3]. After this, they evaluate a lot of child mod-els from the supernet with trained weights as alternatives, and choose thosewith best performance. Gradient-based algorithms [22,5,30] introduce architec-ture parameters for each operator, and jointly optimize the network weights andarchitecture parameters by back-propagation. Finally, they choose operators bythe magnitudes of architecture parameters.
Search Space Shrinking.
Several recent works [21,7,26,24,25,4,24,18] performsearch space shrinking for efficient NAS. PNAS [21] and EPNAS [26] greatlyimprove the search efficiency compared to RL-based methods by progressivelypruning and expanding the search space. PDARTS [7] proposes to shrink thesearch space for reducing computational overhead in the process of bridging thelarge gap between the architecture depths in search and evaluation scenarios.In order to improve the ranking quality of candidate networks, PCNAS [18] at-tempts to drop unpromising operators layer by layer based on one-shot methods.The shrinking techniques mentioned above are strongly associated with spe-cific algorithms, thus unable to easily apply to other NAS methods. In contrast, . our search space shrinking method is simple and general, which can be consid-ered as a plug-in to enhance the performance of different NAS algorithms. More-over, an effective metric is vital to discover less promising models or operatorsfor search space shrinking. Accuracy-based metric [21,26,4,18] and magnitude-based metric [7,24,25] are two widely used metrics in NAS area. Accuracy-basedmetric is extremely unstable and not predictive to the true performance of can-didates in weight-sharing setting [32], while magnitude-based metric involves therich-get-richer problem in the process of joint optimization [1,7], thus leads tounfair competitions of different operators. In contrast, our angle-based metric ismuch more stable and predictive without the rich-get-richer problem. Angle-based Metric.
Recently, deep learning community comes to realizethe angle of weights is very useful to measure the training behavior of neuralnetworks: some works [19,2] theoretically prove that due to widely used nor-malization layers in neural network, the angle of weights is more accurate thaneuclidean distance to represent the update of weights; [6] uses the angle betweenthe weights of a well-trained network and the initialized weights, to measurethe generalization of the well-trained network on real data experiments. But theangle calculation method in [6] can’t deal with none-parameter operators likeidentity and average pooling. To our best knowledge, no angle-based methodwas proposed before in NAS filed. Therefore we design a special strategy toapply the angle-based metric in NAS methods.
In this section, firstly we verify our claim that a elaborately shrunk search spacecan improve existing NAS algorithms by experiments. Then we propose an angle-based metric to guide the process of search space shrinking. Finally, we demon-strate the effectiveness of the overall angle-based search space shrinking method.
In this section, we conduct experiments to investigate the behaviors of NASmethods on various shrunk search spaces and point out that an elaboratelyshrunk search space can enhance existing NAS approaches. Our experimentsare conducted on NAS-Bench-201 [12], which contains 15625 child models withground-truths. We design 7 shrunk search spaces of various size from NAS-Bench-201, and evaluate performance of five NAS algorithms [22,10,11,14,27]over shrunk search spaces plus the original one.Fig. 2 summaries the experiment results. It shows the elaborately shrunksearch space can improve the given NAS methods with a clear margin. Forexample, GDAS finds the best model on CIFAR-10 from S
2. On CIFAR-100dataset, all algorithms discover the best networks from S
8. For SPOS, the bestnetworks found on ImageNet-16-120 are from S
5. However, not all shrunk searchspaces are beneficial to NAS algorithms. Most of shrunk search spaces show nosuperiority to the original one ( S ngle-based Search Space Shrinking for Neural Architecture Search 5 Fig. 2.
Elaborately shrunk Search Space is better. We evaluate five different NASalgorithms [22,10,11,14,27] on eight search spaces which makes it non-trivial to shrink search space wisely. To deal with such issue,we propose an angle-based shrinking method to discover the promising shrunksearch space efficiently. The proposed shrinking procedure can apply to all ex-isting NAS algorithms. We’ll demonstrate its procedure and effectiveness in thefollowing context.
According to [19,2], the weights of a neural network withBatch Normalization [15] are “scale invariant”, which means the Frobinus normof weights can’t affect the performance of the neural network. Due to “scaleinvariant” property, the angle ∆ W (defined as Eq. (1)) between trained weights W and initialized weights W is better than euclidean distance of weights torepresent the difference between initialized neural networks and trained ones: ∆ W = arccos ( < W , W > || W || F · || W || F ) , (1)where < W , W > denotes the inner product of W and W , || · || F denotes theFrobinus norm. [6] shows ∆ W is an efficient metric to measure the generalizationof a well-trained stand-alone model. Angle-based Metric for Child Model from Supernet.
Since the angleshows close connection to generalization of trained networks, we consider us- . Fig. 3.
Examples of the weight vector determined by structure and weights. V , V areweight vectors of these child models respectively ing it to compare the performance of different child models from the supernet.However, directly using angle ∆ W of a child model may meet severe problems inweight sharing settings: the procedure of computing ∆ W can’t distinguishdifferent structures with exact same learnable weights . Such dilemmais caused by non-parametric alternative operators (“none”, “identity”, “pool-ing”) in supernet. For example, child model 1 and child model 2 shown in Fig. 3have exact same learnable weights [ W , W , W ], but child model 1 has shortcut(OP4: identity), while child model 2 is sequential. Apparently child model 1 and2 have different performance due to diverse structures, but ∆ [ W , W , W ] can’treflect such difference.Therefore, to take non-parametric operators into account, we use the follow-ing strategy to distinguish different structures with the same learnable weights.For “pooling” and “identity” operators, we assign a fixed weight to them, andtreat them like other operators with learnable weights: “pooling” has k × k kernel, where elements are all 1 /k , k is the pooling size; “identity” has emptyweights, which means we don’t add anything to the weight vector for “identity”.For “none” operator, it can totally change the connectivity of the child model, wecan’t simply treat it as the other operators. Hence we design a new angle-basedmetric as following to take connectivity of child model into account. Definition of Angle-based Metric.
Supernet is seen as a directed acyclic graph G ( O , E ), where O = { o , o , ..., o M } is the set of nodes, o is the only rootnode (input of the supernet), o M is the only leaf node (output of the supernet); E = { ( o i , o j , w k ) | alternative operators (including non-parametric operators ex-cept “none”) from o i to o j with w k } . Assume a child model is sampled fromthe supernet, and it can be represented as a sub-graph g ( O , ˜ E ) from G , where˜ E ⊂ E , o , o M ∈ ˜ E ; The angle-based metric ∆ g given g is defined as: ∆ g = arccos ( < V ( g, W ) , V ( g, W ) > || V ( g, W ) || F · || V ( g, W ) || F ) , (2)where W is the initialized weights of the supernet G ; V ( g, W ) denotes theweight vector of g , and it’s constructed by concatenating the weights of all paths from o to o M in g , its construction procedure is shown in Algorithm 1.The construction procedure described in Algorithm 1 can make sure childmodels with diverse structures must have different weight vectors, even with the In this work, we do not distinguish “max pooling” and “average pooling” in ourdiscussion and experiments. Path from node o i to node o i k in a directed acyclic graph G ( O , E ) means there existsa subset P ⊂ ˜ E , where P = { ( o i , o i , w j ) , ( o i , o i , w j ) , ..., ( o i k − , o i k , w j k − ) } .ngle-based Search Space Shrinking for Neural Architecture Search 7 Algorithm 1:
Construction of weight vector V ( g, W ) for Model g Input:
A child model g ( O , ˜ E ) from the supernet, weights of supernet W = { w } . Output: weight vector V ( g, W ) Find all paths from the root node o to the leaf node o M in g : P = { P ⊂ ˜ E | P is a path from o to o M } ; V = [ ∅ ]([ ∅ ] means empty vector); for P in P do V P = concatenate ( { w k | ( o i , o j , w k ) ∈ P } ); V = concatenate [ V , V P ]; end V ( g, W ) = V ; same learnable weights. As an example, Fig.3 illustrates the difference betweenthe weight vectors of child models with “none” and “identity” (comparing childmodel 1 and 2). Since V ( g, W ) is well defined on child models from any typeof supernet, we compute the angel-based metric on all child models no matterwhether there’s “none” in supernet as an alternative. Constructing Weight Vector on Cell-like/Block-like Supernet.
Algo-rithm 1 presents the general construction procedure of weight vector given achild model. It works well when the topology of supernet isn’t too complex. How-ever, in the worst case, the length of weight vector is of exponential complexitygiven the number of nodes. Hence it can make massive computational burdenwhen number of nodes is too large in practice. Luckily, existing popular NASsearch spaces (MobileNet-like, DARTS, ...) all consist of several non-intersectcells, which allows us to compute the angle-based metric within each cell in-stead of the whole network. Specifically, we propose the following strategy as acomputation-saving option:1. Divide the whole network into several non-intersecting blocks;2. Construct weight vector within each block respectively by Algorithm 1;3. Obtain weight vector of the child model by concatenating the weight vectorsof each block.Experiment results and further discussion are presented in section 4.1.
Before demonstrating the pipeline of angle-based shrinking method, we need to define the angle-based score to evaluatealternative operators at first. Assume P = { p , p , · · · , p N } represents the col-lection of all candidate operators from supernet, N is the number of candidateoperators. We define the score of an operator by the expected angle-based metricof child models containing the operator: Score ( p i ) = E g ∈{ g | g ⊂G ,g contains p i } ∆ g , i ∈ { , , · · · , N } , (3) . where g , G and ∆ g have been defined in section 3.2, g is uniformly sampled from { g | g ⊂ G , g contains p i } . In practice, rather than computing the expectation inEq.(3) precisely, we randomly sample finite number of child models containingthe operator, and use the sample mean of angle-based metric instead. Algorithm of Angle-based Shrinking Method.
Now we can present thealgorithm to describe the pipeline shown in Fig.1:
Algorithm 2:
Angle-based Search Space Shrinking Method (ABS)
Input:
A supernet G , threshold of search space size T , number of operatorsdropped out per iteration k . Output:
A shrunk supernet ˜ G Let ˜ G = G ; while | ˜ G| > T do Training the supernet ˜ G for several epochs following [14]; Computing score of each operator from ˜ G by Eq.(3); Removing k operators from ˜ G with the lowest k scores; end Note that during the shrinking process, at least one operator is preserved ineach edge, since ABS should not change the connectivity of the supernet.
In this section, we demonstrate the power of ABS in two aspects by experiments:first, we conduct adequate experiments to verify and analyze the effectiveness ofour angle-based metric in stability and convergence characteristics. All analysisexperiments are conducted on NAS-Bench-201 [12], since it provides the real per-formance of all architectures, which can be treated as the ground-truths; second,we show that combined with ABS various popular NAS algorithms (SPOS, Fari-NAS, ProxylessNAS, DARTS, and PDARTS), whose source codes are available,can achieve better performance from the shrunk MobileNet-like and DARTSsearch spaces.
First of all, we conduct ex-periments to verify if the angle-based metric defined in Eq.(2) can really reflectthe capability of stand-alone models with different structures. In detail, we uni-formly select 50 child models from NAS-Bench-201 and train them from scratchto obtain the fully optimized weights. Since the initialized weights are known,the angle of a model can be calculated as Eq.(2). To quantify the correlationbetween the networks’ capability and their angles, we rank chosen 50 models ac-cording to their angle, and compute the Kendall rank correlation coefficient [16](Kendall’s Tau for short) with the ground-truth provided by NAS-Bench-201. ngle-based Search Space Shrinking for Neural Architecture Search 9
Fig. 4.
The correlation between the angle-based ranking and ground-truth ranking.We uniformly choose 50 models from NAS-Bench-201 [12], and train them from scratch
Fig.4 shows the correlation between the network ranking by angle and theranking of ground-truth on three datasets (CIFAR-10, CIFAR-100, ImageNet-16-120). It’s obvious that the Kendall’s Tau on all three different datasets are greaterthan 0.8, which suggests the angle of a model has significant linear correlationwith its capability. Therefore, it’s reasonable to use the angle-based metric tocompare the performance of trained models even with different structures.
Ranking Correlation in Weight-sharing Supernet.
The accuracy of astand-alone model is the ground-truth representing the real performance of thenetwork, but the accuracy of the child model obtained from weight-sharing super-net is not the same. Due to the weight co-adaptation problem in weight-sharingsupernet, the accuracy obtained from supernet is not predictive to the real per-formance. So in this section, we verify the effectiveness of our angel-based metricin weight-sharing supernet by comparing its ranking correlation with others. Indetail, we firstly train a weight-sharing supernet constructed on NAS-Bench-201 search space by uniform sampling strategy [14]. Then we calculate differentmetrics, such as accuracy and angle, of all child models by inheriting optimizedweights from supernet. At last, we rank child models according to the metricand ground-truth respectively, and compute the Kendall’s Tau between thesetwo types of ranks as the ranking correlation. Since the magnitude-based metriccan only rank operators within the same edge, we only compare the rankingcorrelation of the accuracy-based metric and angle-based metric.
Table 1.
The mean Kendall’s Tau of 10 repeat experiments on NAS-Bench-201Method CIFAR-10 CIFAR-100 ImageNet-16-120Random 0 . − . − . [14] 0 . . . . . . Table 1 shows the ranking correlations based on three metrics (random, ac-curacy with Re-BN, angle-based metric) on three different datasets (CIFAR-10, Re-BN means that before inferring the selected child model, we reset the batchnormalization’s [15] mean and variance and re-calculate them on the training dataset.0 .
CIFAR-100, ImageNet-16-120). Accuracy-based metric with Re-BN and angle-based metric are both dramatically better than random selection. Importantly,our angle-based metric outperforms accuracy-based metric with a clear marginon all three datasets, which suggests that our angle-based metric is more effectiveto evaluate the capability of child models from supernet.
Ranking Stability.
We have shown that our angle-based metric can achievehigher ranking correlation than the accuracy-based metric. In this section, wefurther discuss the ranking stability of our angle-based metric. In detail, we carryout 9 independent repeat experiments on three different datasets respectively andcalculate means and variances of ranking correlation obtained by accuracy-basedmetric and angle-based metric.
CIFAR-10 CIFAR-100 ImageNet-16-120Datasets0.50.6 R a n k i n g C o rr e l a t i o n Acc. w/ ReBN Angle
Fig. 5.
The ranking stability on NAS-Bench-201. Every column is the range of rankingcorrelation for a metric and dataset pair. The smaller column means more stable
As Fig.5 shows, our angle-based metric is extremely stable compared withaccuracy-based metric. It has much smaller variance and higher mean thataccuracy-based metric on all three datasets. This is a crucial advantage for NASmethods, which can relieve the problem of reproducibility in weight-sharing NASapproaches. Magnitude-based metric is still not included in discussion, becauseit can’t rank child models.
Convergence in Supernet Training.
In this section, we further investigateconvergence behaviors of angle-based metric and accuracy-based metric in su-pernet training. In search space shrinking procedure, unpromising operators areusually removed when supernet isn’t well trained. Hence the performance of themetric to evaluate child models’ capability at early training stage will severelyinfluence the final result of shrinking method.Fig. 6 shows the different metrics’ ranking correlation with ground-truthat the early stage of training supernet. As shown in Fig. 6, our angle-basedmetric has higher ranking correlation on all three datasets during the first 10epochs. Especially, there is a huge gap between Angle-based metric and accuracy-based metric during the first 5 epochs. It suggests that our angle-based metricconverges faster than accuracy-based metric in supernet training, which makesit more powerful to guide shrinking procedure at early training stage. ngle-based Search Space Shrinking for Neural Architecture Search 11
Fig. 6.
Ranking correlation of different metrics at the early stage of supernet trainingon NAS-Bench-201
Time Cost for Metric Calculation.
Before search space shrinking, the met-ric reflecting the performance of child models or optional operators should beobtained for guidance. Magnitude-based metric needs to train extra architectureparameters as metric except for network weights, which costs nearly double timefor supernet training. Instead, accuracy-based metric only requires inference timeby inheriting weights from supernet. But it still costs much time when evalu-ating a large number of child models. And our angle-based metric can furthersave the inference time for calculation. To compare the time cost of calculatingaccuracy-based metric and angle-based metric, we train a supernet and applythese two metrics to 100 randomly selected models from NAS-Bench-201. Exper-iments are run ten times on
NVIDIA GTX 2080Ti
GPU to calculate the meanand standard deviation. As Table 2 shows, the time cost of angle-based metricon three datasets are all less than 1 seconds while accuracy-based metric’s timecost are greater than 250 seconds.
Table 2.
The processing time (100 models) of accuracy-based and angle-based metricson NAS-Bench-201Method CIFAR-10 (s) CIFAR-100 (s) ImageNet-16-120 (s)Acc. w/ Re-BN [14] 561 . ± . . ± . . ± . Angle 0 . ± . . ± . . ± . Select Promising Operators.
The experiments above prove the superiorityof angle-based metric as an indicator to evaluate child models from supernet, butwe still need to verify if it’s really helpful to guide the selection of promising oper-ators. To verify the effectiveness of angle-based metric for shrinking search space,we compare the shrinking results based on three metrics (accuracy-based, mag-nitude based, and angle-based metric). In our settings, the ground-truth scoreof each operator is obtained by averaging the ground-truth accuracy of all childmodels containing the given operator, the ground-truth ranking is based on theground-truth score. We also rank alternative operators according to their metric-based scores, where the angle-based score is defined as Eq.(3); the accuracy-basedscore is similar to ground truth score but the accuracy is obtained from theweights from trained-supernet; following [29], we use the magnitude-based scoreto rank the operators. After getting the metric based rank, we drop out twenty operators with the lowest ranking, and check the ground-truth ranking of thereserved operators. Fig.7 shows experiment results on NAS-bench-201.
Fig. 7.
The operator distribution after shrinking in three repeated CIFAR-10 experi-ments on NAS-Bench-201 with different random seeds
From Fig.7, magnitude-based and accuracy-based metrics both remove mostof operators in the first 8 ground-truth ranking, while the angle-based metricreserves all of them. Moreover, almost all reserved operators the angle-basedmethod finds have higher ground-truth scores than the removed ones’, whileaccuracy-based and magnitude-based methods seem to choose operators ran-domly. Besides, we repeat the experiments for three times with different randomseeds, the result shows that angle-based shrinking method can stably select thepromising operators with top ground-truth scores, while the shrunk spaces basedon accuracy-based and magnitude-based metric are of great uncertainty.Though there’s no guarantee that the child models with best performancemust be hidden in the shrunk search space, which consists of the operators withtop ground-truth ranking, it’s reasonable to believe we are more likely to discoverwell-behaved structures from elaborately shrunk search spaces with high ground-truth scores. Based on this motivation, the angle-based metric allows us to selectthose operators with high performance efficiently.
In this part, we conduct experiments to show the power of ABS combinedwith existing popular NAS algorithms. We choose five different NAS algorithms(SPOS [14], FairNAS [9], ProxylessNAS [5], DARTS [22], and PDARTS [7]),whose public codes are available, to apply our ABS method, by using their ownsearch spaces (MobileNet-like search space for SPOS, FairNAS, ProxylessNASand DARTS search space for DARTS, PDARTS). All shrinking experiments areperformed on ImageNet. We randomly split the original training set into twoparts: 50000 images for validation and the rest as the training set. ngle-based Search Space Shrinking for Neural Architecture Search 13
MobileNet-like Search Space.
MobileNet-like Search Space consists of Mo-bileNetV2 blocks with kernel size { , , } , expansion ratio { , } and identity asalternative operators. We test the performance of ABS with SPOS [14], Proxy-lessNAS [5] and FairNAS [9] on MobileNet-like search space. SPOS and Prox-ylessNAS are applied on the Proxyless (GPU) search space [5], while FairNASis applied on the same search space as [9]. We shrink the MobileNet-like searchspaces by ABS at first, then apply three NAS algorithms to the shrunk spaceson ImageNet. As a comparison, we also apply these NAS methods to originalsearch spaces on ImageNet.In detail, the whole shrinking process is divided into multiple stages. Supernetis trained for 100 and 5 epochs in the first and other shrinking stage. We followthe block-like weight vector construction procedure to compute the angle-basedmetric. The score of each operator is acquired by averaging the angle of 1000child models containing the given operator. Moreover, the base weight W usedto compute angle is reset when over 50 operators are removed from the originalsearch space. Because our exploratory experiments (see Fig. 4 in appendix) showthat after training models for several epochs, the angle between the currentweight W and the initialized weight W is always close to 90 ◦ due to very highdimension of weights. It doesn’t mean the training is close to be finished, butthe change of angle is too tiny to distinguish the change of weights. Therefore,to represent the change of weights effectively during the mid-term of training,we need to reset the base weight to compute the angle.When sampling child models, ABS dumps models that dont satisfy flops con-straint. For SPOS and ProxylessNAS, ABS removes 7 operators whose rankingsfall at the tail in each shrinking cycle. For FairNAS, ABS removes one operatorfor each layer each time because of its fair constraint. The shrinking processfinishes when the size of search space is less than 10 . In the re-training phrase,we use the same training setting as [14] to retrain all the searched models fromscratch, with an exception: dropout is added before the final fully-connected andthe dropout rate is 0 . Table 3.
Search results on MobileNet-like search space. ∗ The searched models in theirpapers are retrained using our training settingMethod Flops Top1 Acc. Flops(ABS) Top1 Acc.(ABS)FairNAS [9] 322M 74.24% ∗ SPOS [14] 465M 75.33% ∗ ProxylessNAS [5] 467M 75.56% ∗ As Table 3 shows, all algorithms can obtain significant benefits from ourshrunk search spaces. SPOS and ProxylessNAS find models from the shrunksearch space with 0 .
6% higher accuracy than from the original search spacerespectively. FairNAS also finds better model on shrunk search space with 0 . DARTS Search Space.
The effectiveness of ABS for DARTS and PDARTSmethods over DARTS search space is demonstrated in this part. Following theexperiment settings in [22], we apply the search procedure on CIFAR-10, thenretrain the selected models from scratch and evaluate them on ImageNet.In detail, the block-like weight vector construction procedure is adopted whileusing ABS. Supernet is trained for 150 and 5 epochs in the first and othershrinking stages respectively. For the same reason as the MobileNet-like searchspace, we reset the base weight for one time when over 40 operators are removedfrom the original search space. ABS removes one operator for each edge in ashrinking cycle. The shrinking process stops when the size of shrunk searchspace is less than a threshold. DARTS and PDARTS share the same thresholdas the MobileNet-like search space. In the re-training stage, all algorithms usethe same training setting as [7] to retrain the searched models.
Table 4.
Results on DARTS search space without human intervention (CIFAR10). ∗ For the form x(y), x means models searched by us using codes, y means the searchedmodels in their papersMethod Param. Top1 Acc. Param. (ABS) Top1 Acc. (ABS)DARTS [14] 0.21M(0.26M) ∗ . . ∗ . %PDARTS [14] 0.29M(0.29M) ∗ . . ∗ . % Since these two algorithms are applied on CIFAR-10, they should yield mod-els which have good performance on CIFAR-10. Table 4 shows the search resultsof two algorithms with or without ABS on DARTS search space. All resultspresented in Table 4 have no human intervention. The re-training phase has thesame number of channels and layers as the searching phase. It shows that ABScan help DARTS and PDARTS to get significant improvement (0 .
53% at most)on DARTS search space with CIFAR-10.
Table 5.
ImageNet results on DARTS search space. ∗ The same notation as Table 4Method Channels Flops Top1 Acc.DARTS [22] 48(48) ∗ ∗ . . ∗ DARTS(ABS) 48 619M
DARTS(ABS, scale down) 45 547M
PDARTS [7] 48(48) ∗ ∗ . . ∗ PDARTS(ABS) 48 645M
PDARTS(ABS, scale down) 45 570M
The architectures found by DARTS and PDARTS with ABS on CIFAR10also perform well on ImageNet. Table 5 presents the performance of the architec-tures on Imagenet. Even though the search procedures are applied on CIFAR10,the architectures found from shrunk search space still dramatically outperformsthose from the original search space on ImageNet: DARTS and PDARTS get ngle-based Search Space Shrinking for Neural Architecture Search 15 .
2% and 0 .
87% accuracy improvement respectively without any human inter-ference (0 .
71% and 0 .
31% improvement even compared with reported results in[22,7]). Such vast improvement is probably due to the fact that the architec-tures found from the shrunk search space have more flops. But it’s reasonablethat models with higher flops are more likely to have better capability if theflops are not constrained. Furthermore, to fairly compare the performance withconstrained flops, the channels of the architecture we found from shrunk spaceare scaled down to fit with the constraint of flops. Table 5 shows that even theconstrained models from the shrunk search space can still get better results.
In this paper, we point out that elaborately shrunk search space can improve theperformance of existing NAS algorithms. Based on this observation, we proposean angle-based search space shrinking method available for all existing NASalgortihms, named as ABS. While applying ABS, we adopt a novel angle-basedmetric to evaluate the capability of child models from supernet and guide theshrinking procedure. We verify the effectiveness of the angle-based metric byconducting analysis experiments on NAS-bench-201, and demonstrate the powerof ABS combining with various NAS algorithms on multiple search spaces anddatasets. All experimental results prove that the proposed method is highlyefficient, and can significantly improve existing popular NAS algorithms.However, there are some problems not solved yet. For example, how to dis-criminate average pooling and max pooling; how to process more non-parametricoperators such as different activation functions (ReLU [13], LeakyReLU [23],Swish [28], ...). In the future, we will spend more time on discriminating morenon-parametric operators using the angle-based metric in NAS.
References
1. Adam, G., Lorraine, J.: Understanding neural architecture search techniques. arXivpreprint arXiv:1904.00438 (2019)2. Arora, S., Li, Z., Lyu, K.: Theoretical analysis of auto rate-tuning by batch nor-malization. arXiv preprint arXiv:1812.03981 (2018)3. Bender, G., Kindermans, P.J., Zoph, B., Vasudevan, V., Le, Q.: Understanding andsimplifying one-shot architecture search. In: International Conference on MachineLearning. pp. 549–558 (2018)4. Cai, H., Gan, C., Han, S.: Once for all: Train one network and specialize it forefficient deployment. arXiv preprint arXiv:1908.09791 (2019)5. Cai, H., Zhu, L., Han, S.: Proxylessnas: Direct neural architecture search on targettask and hardware. arXiv preprint arXiv:1812.00332 (2018)6. Carbonnelle, S., De Vleeschouwer, C.: Layer rotation: a surprisingly simple indi-cator of generalization in deep networks? (2019)7. Chen, X., Xie, L., Wu, J., Tian, Q.: Progressive differentiable architecturesearch: Bridging the depth gap between search and evaluation. arXiv preprintarXiv:1904.12760 (2019)8. Chen, Y., Yang, T., Zhang, X., Meng, G., Xiao, X., Sun, J.: Detnas: Backbonesearch for object detection. In: Advances in Neural Information Processing Sys-tems. pp. 6638–6648 (2019)9. Chu, X., Zhang, B., Xu, R., Li, J.: Fairnas: Rethinking evaluation fairness of weightsharing neural architecture search. arXiv preprint arXiv:1907.01845 (2019)10. Dong, X., Yang, Y.: One-shot neural architecture search via self-evaluated templatenetwork. In: Proceedings of the IEEE International Conference on Computer Vision(ICCV). pp. 3681–3690 (2019)11. Dong, X., Yang, Y.: Searching for a robust neural architecture in four gpu hours. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 1761–1770 (2019)12. Dong, X., Yang, Y.: Nas-bench-201: Extending the scope of reproducible neu-ral architecture search. In: International Conference on Learning Representations(ICLR) (2020), https://openreview.net/forum?id=HJxyZkBKDr
13. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Pro-ceedings of the fourteenth international conference on artificial intelligence andstatistics. pp. 315–323 (2011)14. Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., Sun, J.: Singlepath one-shot neural architecture search with uniform sampling. arXiv preprintarXiv:1904.00420 (2019)15. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)16. Kendall, M.G.: A new measure of rank correlation. Biometrika (1/2), 81–93(1938)17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-volutional neural networks. Communications of The ACM (6), 84–90 (2017)18. Li, X., Lin, C., Li, C., Sun, M., Wu, W., Yan, J., Ouyang, W.: Improving one-shotnas by suppressing the posterior fading. arXiv preprint arXiv:1910.02543 (2019)19. Li, Z., Arora, S.: An exponential learning rate schedule for deep learning. arXivpreprint arXiv:1910.07454 (2019)ngle-based Search Space Shrinking for Neural Architecture Search 1720. Liu, C., Chen, L.C., Schroff, F., Adam, H., Hua, W., Yuille, A.L., Fei-Fei, L.:Auto-deeplab: Hierarchical neural architecture search for semantic image segmen-tation. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 82–92 (2019)21. Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille,A., Huang, J., Murphy, K.: Progressive neural architecture search. In: Proceedingsof the European Conference on Computer Vision (ECCV). pp. 19–34 (2018)22. Liu, H., Simonyan, K., Yang, Y.: Darts: Differentiable architecture search. arXivpreprint arXiv:1806.09055 (2018)23. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural net-work acoustic models. In: Proc. icml. vol. 30, p. 3 (2013)24. Nayman, N., Noy, A., Ridnik, T., Friedman, I., Jin, R., Zelnik, L.: Xnas: Neu-ral architecture search with expert advice. In: Advances in Neural InformationProcessing Systems. pp. 1975–1985 (2019)25. Noy, A., Nayman, N., Ridnik, T., Zamir, N., Doveh, S., Friedman, I., Giryes, R.,Zelnik-Manor, L.: Asap: Architecture search, anneal and prune. arXiv preprintarXiv:1904.04123 (2019)26. P´erez-R´ua, J.M., Baccouche, M., Pateux, S.: Efficient progressive neural architec-ture search. arXiv preprint arXiv:1808.00391 (2018)27. Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecturesearch via parameter sharing. arXiv preprint arXiv:1802.03268 (2018)28. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXivpreprint arXiv:1710.05941 (2017)29. Wang, L., Xie, L., Zhang, T., Guo, J., Tian, Q.: Scalable nas with factorizablearchitectural parameters. arXiv preprint arXiv:1912.13256 (2019)30. Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia,Y., Keutzer, K.: Fbnet: Hardware-aware efficient convnet design via differentiableneural architecture search. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 10734–10742 (2019)31. Xu, H., Yao, L., Zhang, W., Liang, X., Li, Z.: Auto-fpn: Automatic network ar-chitecture adaptation for object detection beyond classification. In: Proceedings ofthe IEEE International Conference on Computer Vision. pp. 6649–6658 (2019)32. Zhang, Y., Lin, Z., Jiang, J., Zhang, Q., Wang, Y., Xue, H., Zhang, C., Yang, Y.:Deeper insights into weight sharing in neural architecture search. arXiv preprintarXiv:2001.01431 (2020)33. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architecturesfor scalable image recognition. In: Proceedings of the IEEE conference on computervision and pattern recognition. pp. 8697–8710 (2018)8 . Appendix
Fig. 1.
Structures of searched architectures under FLOPs constraints over theMobileNet-like shrunk search space, see Table 3 for details
Table 1.
The original search space S1 and its shrunk search spaces of various size fromNAS-Bench-201, see Section 3.1 for detailsSearch Space Candidate OperatorsS1 none , skip connect , conv × conv × average pooling × skip connect , conv × conv × average pooling × none , conv × conv × average pooling × none , skip connect , conv × average pooling × none , skip connect , conv × conv × conv × conv × average pooling × none , skip connect , average pooling × conv × conv × Fig. 2.
Structures of searched architectures by PDARTS over the shrunk search space,see Table 4 and 5 for details(a) Normal cell learned on CIFAR-10(b) Reduction cell learned on CIFAR-10
Fig. 3.
Structures of searched architectures by DARTS over the shrunk search space,see Table 4 and 5 for details
Fig. 4.
The angle evolution of a standalone model on different datasets. The model ischosen from NAS-Benchmark-201 search space. The angle values adopt radian measure.0 .
Fig. 5.