Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels
NNeural Architecture Search using BayesianOptimisation with Weisfeiler-Lehman Kernel
Binxin Ru ∗ , Xingchen Wan ∗ , Xiaowen Dong, Michael A Osborne Machine Learning Research GroupUniversity of OxfordOxford, UK {robin, xwan, xdong, mosb}@robots.ox.ac.uk
Abstract
Bayesian optimisation (BO) has been widely used for hyperparameter optimisationbut its application in neural architecture search (NAS) is limited due to the non-continuous, high-dimensional and graph-like search spaces. Current approacheseither rely on encoding schemes, which are not scalable to large architecturesand ignore the implicit topological structure of architectures, or use graph neuralnetworks, which require additional hyperparameter tuning and a large amount ofobserved data, which is particularly expensive to obtain in NAS. We propose a neatBO approach for NAS, which combines the Weisfeiler-Lehman graph kernel witha Gaussian process surrogate to capture the topological structure of architectures,without having to explicitly define a Gaussian process over high-dimensionalvector spaces. We also harness the interpretable features learnt via the graph kernelto guide the generation of new architectures. We demonstrate empirically thatour surrogate model is scalable to large architectures and highly data-efficient;competing methods require 3 to 20 times more observations to achieve equallygood prediction performance as ours. We finally show that our method outperformsexisting NAS approaches to achieve state-of-the-art results on NAS datasets.
Neural architecture search (NAS) is a popular research direction recently that aims to automate thedesign process of good neural network architectures for a given task/dataset. Neural architecturesfound via different NAS strategies have demonstrated state-of-the-art performance outperforminghuman experts’ design on a variety of tasks [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. Similar to hyperparameteroptimisation, NAS can often be formulated as a black-box optimisation problem [11] and evaluationof the objective can be very expensive as it involves training of the architecture queried. Under thissetting, query efficiency is highly valued [12, 11, 13] and Bayesian optimisation (BO), which hasbeen successfully applied for hyperparamter optimisation [14, 15, 16, 17, 18, 19, 20], becomes anatural choice to consider.Yet, conventional BO methods cannot be applied directly to NAS as the popular cell-based searchspaces are non-continuous and graph-like [9, 21, 22]; each possible architecture can be represented asan acyclic directed graph whose nodes correspond to operation units/layers and the edges representconnection among these units [21]. To accommodate such search space, [12] and [23] proposedistance metrics for comparing architectures, thus enabling the use of Gaussian process (GP)-basedBO for NAS. [21] and [24] encode architectures with a high-dimensional vector of discrete andcategorical variables. However, both the distance metrics and the encoding schemes are not scalableto large architectures [24] and overlook the topological structure of the architectures [13], which ∗ Equal contribution. a r X i v : . [ c s . L G ] J un an be important [25]. Another line of work adopts graph neural networks (GNNs) in combinationwith Bayesian linear regression as the BO surrogate model to predict the architecture performance[26, 27, 13]. These approaches treat architectures as attributed graph data and consider the graphtopology of architectures. But the GNN design introduces additional hyperparameter tuning and thetraining of the GNN also requires a relatively large amount of architecture data. Moreover, applyingBayesian linear regression on the features extracted by GNN is not a principled way to obtain modeluncertainty compared to GP and often leads to poor uncertainty estimates [28]. The extracted featuresare also hard to interpret, thus not helpful for guiding the practitioners to generate new architectures.In view of the above limitations, we propose a new BO-based NAS approach which uses a GP incombination with graph kernel. It naturally handles the graph-like architecture spaces and takes intoaccount the topological structure of architectures. Meanwhile, the surrogate preserves the merits ofGPs in data-efficiency, uncertainty computation and automated surrogate hyperparameter treatment.Specifically, our main contributions can be summarised as follows. • We introduces a GP-based BO strategy for NAS, NAS-BOWL, which is highly query-efficientand amenable to the graph-like NAS search spaces. Our proposed surrogate model combinesa GP with a Weisfeiler-Lehman subtree (WL) graph kernel to exploit the implicit topologicalstructure of architectures. It is scalable to large architecture cells (e.g. 32 nodes) and can achievebetter prediction performance than GNN-based surrogate with much less training data. • We harness the interpretable graph features extracted by the WL graph kernel and propose tolearn their corresponding effects on the architecture performance based on surrogate gradientinformation. We then demonstrate the usefulness of this on an application which is to guidearchitecture mutation/generation for optimising the acquisition function. • We empirically demonstrate that our surrogate model achieves superior performance with muchfewer observations in NAS search spaces of different sizes. We finally show that our searchstrategy achieves state-of-the-art performances on both NAS-Bench datasets.
Architectures in popular NAS search spaces can be represented as an acyclic directed graph [11, 29,21, 22, 25] where each graph node represents an operation unit or layer (e.g. a conv3 × in [21]) and each edge defines the information flow from one layer to another. With this representation,NAS can be formulated as an optimisation problem to find the directed graph and its correspondingnode operations (i.e. the directed attributed graph G ) that give the best architecture validationperformance y ( G ) : G ∗ = arg max G y ( G ) . To solve the above optimisation, we adopt BO, which is a query-efficient technique for optimising ablack-box, expensive-to-evaluate objective [30]. BO uses a statistical surrogate to model the objectiveand builds an acquisition function based on the surrogate. The next query location is recommendedby optimising the acquisition function which balances the exploitation and exploration. We use a GPas the surrogate model in this work, as it can achieve competitive modelling performance with smallnumber of query data [31] and give analytic predictive posterior mean µ ( G t |D t − ) and variance k ( G t , G (cid:48) t |D t − ) : µ ( G t |D t − ) = k ( G t , G t − ) K − t − y t − and k ( G t , G (cid:48) t |D t − ) = k ( G t , G (cid:48) t ) − k ( G t , G t − ) K − t − k ( G t − , G (cid:48) t ) where G t − = { G , . . . , G t − } and [ K t − ] i,j = k ( G i , G j ) with k ( · , · ) being the graph kernel function. We experiment with Expected Improvement [32] as theacquisition function in this work though our approach is compatible with alternative choices. Graph kernels are kernel functions defined over graphs to compute their level of similarity. A genericgraph kernel may be represented by the function k ( · , · ) over a pair of graphs G and G (cid:48) [33]: k ( G, G (cid:48) ) = (cid:104) φ ( G ) , φ ( G (cid:48) ) (cid:105) H (2.1) A sequence of operations: convolution with × filter size, batch normalisation and ReLU activation. axPool 3x3 InputConv 1x1Conv 3x3Output
Arch A Arch BStep 1
Step 2
Arch A
Arch BStep 3Step 4 h=0 features 021 2 3h=1 features22 24 3414 h=0 features 0 1 2 3 4 h=1 features 5 6 7 8 9 10 11
Figure 1: Illustration of one WL iteration on a NAS-Bench-101 cell. Given two architectures atinitialisation, WL kernel first collects the neighbourhood labels of each node (Step 1) and compressthe collected original labels, i.e., features at h = 0 (Initialisation) into features at h = 1 (Step 2).Each node is then relabelled with the compressed label/ h = 1 feature (Step 3) and the two graphs arecompared based on the histogram on both h = 0 and h = 1 features (Step 4). This WL iteration willbe repeated until h = H . h is both the index the WL iteration and the depth of the subtree featuresextracted. Substructures at h = 0 and h = 1 in Arch A are shown in the middle right of the plot.where φ ( · ) represents some vector embeddings/features of the graph extracted by the graph kerneland (cid:104)· , ·(cid:105) H represents an inner product in the reproducing kernel Hilbert space (RKHS) [34, 33]. Formore detailed reviews on graph kernels, the readers are referred to [34] and [33]. Here we discuss two key aspects of our BO-based NAS strategy while the overall algorithm ispresented in App. A.
Our primary proposal is to use an elegant graph kernel to circumvent the aforementioned limitationsof GP-based BO in the NAS setting and enable the direct definition of a GP surrogate on the graph-like search space. This construction preserves desirable properties of GPs, such as its principleduncertainty estimation which is key for BO, and allows the user to deploy the rich advances in GP-based BO (which include parallel computation [35, 14, 36, 37, 38, 39], multi-objective [40, 41, 42, 43]and transfer-learning [44, 45, 46, 47, 48]) on NAS problems. The kernel choice encodes our priorknowledge on the objective and is crucial in GP modelling. Here we opt to base our GP surrogate onthe Weisfeiler-Lehman (WL) graph kernel family [49] (we term our surrogate GPWL). In this section,we will first illustrate the mechanism of the WL kernel, followed by our rationales for choosing it.The WL kernel compares two directed graphs based on both local and global structures. It startsby comparing the node labels of both graphs ( h = 0 features 0 to 4 in Fig. 1) via a base kernel k base (cid:0) φ ( G ) , φ ( G (cid:48) ) (cid:1) where φ ( G ) denotes the histogram of h = 0 features in the graph, and h isthe index of WL iterations and the depth of the subtree features extracted. For the WL kernel with h > , it then proceeds to collect h = 1 features following steps 1 to 3 in Fig. 1 and compare the twographs with k base (cid:0) φ ( G ) , φ ( G (cid:48) ) (cid:1) based on the subtree structures of depth [49, 50]. The procedurethen repeats until the highest iteration level h = H specified and the resultant WL kernel is given by: k H WL ( G, G (cid:48) ) = H (cid:88) h =0 w h k base (cid:0) φ h ( G ) , φ h ( G (cid:48) ) (cid:1) . (3.1)3 base is a base kernel specified by the user and a simple example is the dot product of the featureembeddings (cid:104) φ ( G ) , φ ( G (cid:48) ) (cid:105) and w h contains the weights associated with each WL iteration h . Wefollow the convention in [49] to set all the weights equal.Note as a result of the WL label reassignment,the node labels in Arch A at initialisation ( h = 0 ) are different from those in Arch A in Step 3( h = 1 ); h = 0 features represent subtrees of depth 0 while h = 1 features are subtrees of depth1. In this way, as h increases, the WL kernel captures higher-order features which correspond toincreasingly larger neighbourhoods (see App. A for an algorithmic representation of WL).We argue that WL kernel is a desirable choice for the NAS application for the following reasons.1. WL kernel is able to compare labeled and directed graphs of different sizes.
As discussed inSection 2.1, architectures in almost all popular NAS search spaces [21, 22, 29, 25] can be representedas directed graphs with node/edge attributes. Thus, WL kernel can be directly applied on them. Onthe other hand, many graph kernels either do not handle node labels [51], or are incompatible withdirected graphs [52, 53]. Converting architectures into undirected graphs can result in loss of valuableinformation such as the direction of data flow in the architecture (we show this in Section 5.1).2.
WL kernel is expressive yet highly interpretable.
WL kernel is able to capture substructuresthat go from local to global scale with increasing h values. Such multi-scale comparison is similarto that enabled by a Multiscale Laplacian Kernel [52] and is desirable for architecture comparison.This is in contrast to graph kernels such as [54, 51], which only focus on local substructures, or thosebased on graph spectra [53], which only look at global connectivities. Furthermore, the WL kernelis derived directly from the Weisfeiler-Lehman graph isomorphism test [55], which is shown to beas powerful as a GNN in distinguishing non-isomorphic graphs [56, 57]. However, the higher-ordergraph features extracted by GNNs are hard to interpret by humans. On the other hand, the subtreefeatures learnt by WL kernel (e.g. the h = 0 and h = 1 features in Figure 1) are easily interpretable.As we will discuss in Section 3.2 later, we can harness the surrogate gradient information on low- h substructures to identify the effect of particular node labels on the architecture performance and thuslearn useful information to guide new architecture generation.3. WL kernel is relatively efficient and scalable.
Other expressive graph kernels are often pro-hibitive to compute: for example, defining { n, m } to be the number of nodes and edges in a graph,random walk [58], shortest path [59] and graphlet kernels [51] incur a complexity of O ( n ) , O ( n ) and O ( n k ) respectively where k is the maximum graphlet size. Another approach based on computingthe architecture edit-distance [23] is also expensive: its exact solution is NP-complete [60] and isprovably difficult to approximate [61]. On the other hand, the WL kernel only entails a complexity of O ( hm ) [49]. Empirically, we find that in typical NAS search spaces (such as NAS-Bench datasets)featuring rather small cells, h ≤ usually suffices (even in a deliberately large cell search space weconstruct later, h ≤ is sufficient) – this implies the kernel computing cost is likely eclipsed by the O ( N ) complexity of GPs, not to mention the main bottleneck of NAS is the actual training of thearchitectures. The scalability of WL is also to be contrasted to other approaches such as an encodingof all input-output paths [24], which without truncation scales exponentially with n .With the above-mentioned merits, the incorporation of WL kernel permits the usage of GP-basedBO on various NAS search spaces. This enables the practitioners to harness the rich literaturesof GP-based BO methods on hyperparameter optimisation and redeploy them on NAS problems.Meanwhile, the use of GP surrogate frees us from hand-picking the WL kernel hyperparameter h as we can automatically learn the optimal values by maximising the Bayesian marginal likelihood.This leads to a method with almost no inherent hyperparameters that require manual tuning . Weempirically justify the superior prediction performance of our GP surrogate with a WL kernel againstother graph kernels and GNNs in Section 5.1. Note that we may further improve the expressivenessof the surrogate by adding multiple types of kernels together, especially if the kernels used capturedifferent aspects of graph information. We briefly investigate this in App. B and find the extent ofperformance gain depends on the NAS search space; a WL kernel alone can be sufficient for commoncell-based space. We leave the comprehensive evaluation of such kernel combinations to a futurework. Consequently, naively computing the Gram matrix consisting of pairwise kernel between all pairs in N graphs is of O ( N hm ) , but this can be further improved to O ( Nhm + N hn ) . See [56]. a) GP derivatives in N101 (b) Valid acc. in N101 (c) GP derivatives in N201 (d) Valid acc. in N201 Figure 2: Mean derivatives and the corresponding mean ± std of validation accuracy for differentnode feature types in NAS-Bench-101 (N101) (a)(b) and NAS-Bench-201 (N201) (c)(d) datasets.The x-axis is the number of nodes having the same specific type of node feature in the architectures.Note architectures containing more of negative-derivative features tend to have lower accuracy andvice versa. Thus, the derivative information is useful in assessing the effects of node features. In the preceding section, we have elaborated the advantages of using WL graph kernels to make theNAS search space amenable to GP-based BO. A key advantage of WL that we identify in Section3.1 is that it extracts interpretable features. Here, we demonstrate that its integration with the GPsurrogate can further allow us to distinguish the relative impact of these features on the architectureperformance. This could potentially be a starting point towards explainable NAS, and practically canbe helpful for the practitioners who are not only interested in finding good-performing architecturesbut also in how to modify the architecture to further improve the performance.To assess the effect of the extracted features, we propose to utilise the derivatives of the GP predictivemean. Derivatives as tools for interpretability have been used previously [62, 63, 64], but the GPsurrogate and WL kernels in our method mean that we may compute derivatives with respect tothe interpretable features analytically. Formally, the derivative with respect to the j -th element of φ t = φ ( G t ) (the feature vector of a test graph) is also Gaussian and has an expected value: E p ( y | G t , D t − ) (cid:104) ∂y∂φ jt (cid:105) = ∂µ∂φ jt = ∂ (cid:104) φ t , Φ t − (cid:105) ∂φ jt K − t − y t − (3.2)where φ ( · ) has the same meaning in equation 2.3 and Φ t − = [ φ ( G ) , . . . , φ ( G t − )] T is the featurematrix stacked from the feature vectors from the previous observations. Intuitively, since each φ jt denotes the count of a WL feature in G t , its derivative naturally encodes the direction and sensitivityof the GP objective (in this case the predicted validation accuracy) about that particular feature.We illustrate the usefulness of the gradient information with an example on h = 0 features in Fig.2. We randomly sample 100 architecture data to train our GPWL surrogate and reserve another 500samples as the validation set for both NAS-Bench-101 and NAS-Bench-201 datasets. We evaluate thegradient with respect to the h = 0 features (i.e. the node operation types) on the validation set andcompute the average gradient among graphs with the same number of a specific node feature. Themean gradient results for different node feature types are shown in Fig. 2a and 2c while the validationaccuracies achieved by architectures containing different occurrence of various node feature typesare shown in Fig. 2b and 2d. For example, in Fig. 2a, the gradient of maxpool3 × at number ofoccurrences 2 is the average gradient of GP posterior mean with respect to that node feature across allvalidation architectures which contain 2 maxpool3 × nodes. The corresponding point in Fig. 2b isthe average ± conv3 × operation always has positive gradients in Fig. 2a , which informs us that having more such featureslikely contributes to better accuracy. Conversely, the negative gradients of conv1 × suggest that this operation is relatively undesirable whereas the near-zero gradients of maxpool3 × suggest such operation has little impact on the architecture performance. These observations arere-confirmed by the accuracy plots in Fig. 2b and similar results are observed in NAS-Bench-201 data. We choose h = 0 for the ease of illustration; our method can be applied to WL features of higher h withoutloss of generality. conv3 × results in Fig.2b) or the increasingly rare observations on extreme architecturesmaking the posterior mean of GP surrogate converge to the prior mean of zero.Interpretability offered here can be useful in many aspects. Here, we demonstrate on one example:harnessing the feature gradient information to generate candidate architectures for acquisition func-tion optimisation. Under the NAS setting, optimising the acquisition function over the search spacecan be challenging [12, 26, 24], because the non-continuous search space makes it ineffective to usethe analytic gradients, and exhaustively evaluating on all possible architectures is computationallyunviable. A way to generate a population of candidate architectures for acquisition function opti-misation at each BO iteration is necessary for all BO-based NAS strategies. The naïve way to doso is to randomly sample architectures from the search space [21, 65] while a better alternative isbased on genetic mutation which generates the candidate architectures by mutating a small pool ofparent architectures [12, 26, 24, 13]. We build on the genetic mutation, but instead of using randommutation, we propose to use the gradient information provided by GPWL to guide the mutationin a more informed way. A high-level comparison between these approaches is shown in App. C.Specifically, we transform the gradients into pseudo-probabilities defining the chances of mutationfor each node and edge in the architecture. The sub-feature whose gradient is very positive and thuscontributes positively to the validation accuracy will have lower probability of mutation. Upon a nodeor an edge is chosen for mutation, we reuse the gradient information on its possible change optionsto define their corresponding probability of being chosen. The detailed procedures are describedwith reference to an example cell in App. C. In summary, we propose a new way to learn whicharchitecture component to modify as well as how to modify it to improve performance by making useof interpretable feature extracted in our GPWL surrogate. Recently there are also some attempts of using BO for NAS [12, 21, 26, 13, 24]. To overcome thelimitations of conventional BO on noncontinuous and graph-like NAS search spaces, [12] proposes asimilarity measure among neural architectures based on optimal transport to enable the use of GPsurrogate while [21] and [24] suggest encoding schemes to characterise neural architectures with avector of discrete and categorical variables. Yet, the proposed kernel [12] can be slow to compute forlarge architectures [13] and such encoding schemes are not scalable to search cells with large numberof of nodes [24]. Alternatively, several works use graph neural network (GNN) as the surrogatemodel [26, 27, 13] to capture the graph structure of neural architectures. However, the design ofthe GNN introduces many additional hyperparameters to be tuned and GNN requires a relativelylarge number of training data to achieve decent prediction performance as shown in Section 5.1.The most related work is [66] which applied GP-based BO with graph-induced kernels to designmultimodal fusion deep neural networks. [66] assign each possible architectures in the search spaceas a nodes in a undirected super-graph and uses a diffusion kernel to capture similarities between thenodes. The need for constructing and computing on this large super-graph limits the application ofthe method to relatively small search space. In contrast, we model each architecture in the searchspace as an individual directed graph and propose to compare pairs of graphs with WL kernel. Suchsetup allows our method to act on larger and more complex architectures and capture data flowdirections in architectures. Our approach of comparing graphs without the need of referencing asuper-graph is also computationally cheaper. In addition to BO-NAS literature, there are also severalworks that applied graph kernels in BO [67, 68, 69]. However, all these work focus on the undirected graph setting which is very different from our directed graph NAS problem and none of these workinvestigates the use of WL graph kernel family.Table 1: Regression performance (i.t.o rank correlation) of different graph kernels.
Kernel Complexity N101 CIFAR10 CIFAR100 ImageNet16 Flower-102WL O ( Hm ) ± ± ± ± ± O ( n ) ± ± ± ± ± O ( n ) ± ± ± ± ± O ( Ln ) † ± ± ± ± ± † : L is the number of neighbours, a hyperparameter of MLP kernel.
100 200 300
Number of training samples R a n k C o rrr e l a t i o n GPWLGNNCOMBO (a) CIFAR10
Number of training samples R a n k C o rrr e l a t i o n (b) CIFAR100 Number of training samples R a n k C o rrr e l a t i o n (c) ImageNet16 Number of training samples R a n k C o rrr e l a t i o n (d) N101 Number of training samples R a n k C o rrr e l a t i o n (e) Flowers102 Figure 3: Mean rank correlation achieved by GPWL, GNN and COMBO surrogates across 20 trialson different datasets. Error bars denote ± standard error. We examine the regression performance of GPWL on available NAS datasets: NAS-Bench-101 onCIFAR10 (denoted as N101) [21], NAS-Bench-201 on CIFAR10, CIFAR100 and ImageNet16 [22](denoted by their respective image dataset hereafter). However, recognising that both datasets onlycontain CIFAR-sized images and relatively small architecture cells , as a further demonstration ofscalability of our proposed methods to much larger architectures, we also construct a dataset with 547architectures sampled from the randomly wired graph generator described in [25]; each architecturecell has 32 operation nodes and all the architectures are trained on the Flowers102 dataset [70](we denote this dataset as Flower102 hereafter). Similar to [21, 22, 13], we use Spearman’s rankcorrelation between predicted validation accuracy and the true validation accuracy as the performancemetric because in NAS, what matters is the relative ranking among different architectures. Comparison with other graph kernels
We first compare the performance of WL kernel againstother popular graph kernels such as (fast) Random Walk (RW) [54, 58], Shortest-Path (SP) [59],Multiscale Laplacian (MLP) [52] kernels when combined with GPs. These competing graph kernelsare chosen because they represent distinct graph kernel classes and are suitable for NAS search spacewith small or no modifications. In each NAS dataset, we randomly sample 50 architecture data totrain the GP surrogate and use another 400 architectures as the validation set to evaluate the rankcorrelation between the predicted and the ground-truth validation accuracy.We repeat each trial 20 times, and report the mean and standard error of all the kernel choices onall NAS datasets in Table 1. We also include the worst-case complexity of the kernel computation between a pair of graphs in the table. The results in this section justify our reasoning in Section3.1; combined with the interpretability benefits we discussed, WL consistently outperforms otherkernels across search spaces while retaining modest computational costs. RW often comes a closecompetitor, but its computational complexity is worse and does not always converge. MLP, whichrequires us to convert directed graphs to undirected graphs, performs poorly, thereby validating thatdirectional information is highly important. Finally, in BO, uncertainty estimates are as important asthe predictor accuracy itself; we show that GPWL produces sound uncertainty estimates in App. D.
Comparison with GNN and alternative GP surrogate
We then compare the regression perfor-mance of our GPWL surrogate against two competitive baselines: GNN [13], which uses a combina-tion of graph convolution network and a final Bayesian linear regression layer as the surrogate, andCOMBO [67] , which uses GPs with a graph diffusion kernel on combinatorial graphs. We followthe same set-up described above but repeat the experiments on a varying number of training data toevaluate the data-efficiency of different surrogates. It is evident from Fig. 3 that our GPWL surrogateclearly outperforms both competing methods on all the NAS datasets with much less training data.Specifically, GPWL requires at least 3 times fewer data than GNN and at least 10 times fewer datathan COMBO on NASBench-201 datasets. It is also able to achieve high rank correlation on datasetswith larger search spaces such as NASBench-101 and Flowers102 while requiring 20 times fewerdata than GNN on Flowers102 and over 30 times fewer data on NASBench-101. In NASBench-101 and NASBench-201, each cell is a graph of 7 and 4 nodes, respectively. We choose COMBO as it uses GP surrogate with different kernel choices and also is very close to the mostrelated work [66] whose implementation is not publicly available a) N101 (b) CIFAR10 (c) CIFAR100 (d) ImageNet16(e) N101 (f) CIFAR10 (g) CIFAR100 (h) ImageNet16 Figure 4: Median validation error on NAS-Bench datasets with deterministic (top row) and stochastic (bottom row) validation errors from 20 trials. Shades denote ± standard error. We further benchmark our proposed NAS approach, NAS-BOWL, against a range of existing methods,including random search, TPE [15], Reinforcement Learning ( rl ) [71], BO with SMAC ( smacbo )[17], regularised evolution [72] and BO with GNN surrogate ( gcnbo ) [13]. On NAS-Bench-101,we also include BANANAS [24], where the authors claim state-of-the-art performance. In bothNAS-Bench datasets, validation errors of different random seeds are provided, thereby creating noisyobjective function observations. We perform experiments using both the deterministic setup describedin [24], where the validation errors over multiple random initialisations are averaged to providedeterministic objective functions, and also report results with noisy/stochastic objective functions.We show the validation results in both setups in Figure 4 and the corresponding test results in App. E.In these figures, we use NASBOWLm and
NASBOWLr to denote NAS-BOWL variants with architecturesgenerated from the algorithm described in Section 3.2 and from random sampling, respectively.Similarly,
BANANASm and
BANANASr represent the BANANAS results using mutation and randomsampling in [24]. The readers are referred to App. E for our setup details.It is evident that NAS-BOWL outperforms all baselines on all NAS-Bench tasks in achieving bothlowest validation and test errors. The recent neural-network-based methods such as BANANASand GCNBO are often the strongest competitors, but we emphasise that unlike our approach, thesemethods inevitably introduce a number of extra hyperparameters whose tuning is non-trivial and havepoorly calibrated uncertainty estimates [28]. The experiments with stochastic errors further showthat even in a more challenging setup with noisy objective function observations, NAS-BOWL stillperforms very well as it inherits the robustness against noisy data from the GP model. Finally, weperform ablation studies on NAS-BOWL in App. E.
In this paper, we propose a novel BO-based architecture search strategy, NAS-BOWL, which usesa GP surrogate with the WL graph kernel to handle architecture inputs. We show that our methodachieves superior prediction performance across various tasks, and attain state-of-the-art performanceon NAS-Bench datasets. Building on our proposed framework, a broader range of GP-based BOtechniques can be deployed to tackle more challenging NAS problems such as the multi-objective and transfer learning settings. In addition, we also exploit the human-interpretable WL feature extractionfor architecture generation; we believe this is a starting point for explainable NAS, which is anotherexciting direction that warrants future investigations.8 eferences [1] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin,“Large-scale evolution of image classifiers,” in
International Conference on Machine Learning(ICML) , pp. 2902–2911, 2017.[2] B. Zoph and Q. Le, “Neural architecture search with reinforcement learning,” in
InternationalConference on Learning Representations (ICLR) , 2017.[3] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient architecture search by networktransformation,” in
AAAI Conference on Artificial Intelligence , 2018.[4] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprintarXiv:1806.09055 , 2018.[5] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, andK. Murphy, “Progressive neural architecture search,” in
European Conference on ComputerVision (ECCV) , pp. 19–34, 2018.[6] R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu, “Neural architecture optimization,” in
Advancesin Neural Information Processing Systems (NIPS) , pp. 7816–7827, 2018.[7] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural architecture search viaparameter sharing,” in
International Conference on Machine Learning (ICML) , pp. 4092–4101,2018.[8] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifierarchitecture search,” arXiv:1802.01548 , 2018.[9] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalableimage recognition,” in
Computer Vision and Pattern Recognition (CVPR) , pp. 8697–8710, 2018.[10] S. Xie, H. Zheng, C. Liu, and L. Lin, “Snas: stochastic neural architecture search,” arXivpreprint arXiv:1812.09926 , 2018.[11] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” arXiv:1808.05377 ,2018.[12] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing, “Neural architecturesearch with Bayesian optimisation and optimal transport,” in
Advances in Neural InformationProcessing Systems (NIPS) , pp. 2016–2025, 2018.[13] H. Shi, R. Pi, H. Xu, Z. Li, J. T. Kwok, and T. Zhang, “Multi-objective neural architecturesearch via predictive network performance optimization,” 2019.[14] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learningalgorithms,” in
Advances in neural information processing systems , pp. 2951–2959, 2012.[15] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter opti-mization,” in
Advances in Neural Information Processing Systems (NIPS) , pp. 2546–2554,2011.[16] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model search: Hyperparameteroptimization in hundreds of dimensions for vision architectures,” in
International Conferenceon Machine Learning (ICML) , 2013.[17] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for generalalgorithm configuration,” in
International conference on learning and intelligent optimization ,pp. 507–523, Springer, 2011.[18] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter, “Fast Bayesian optimization ofmachine learning hyperparameters on large datasets,” arXiv:1605.07079 , 2016.[19] S. Falkner, A. Klein, and F. Hutter, “BOHB: Robust and efficient hyperparameter optimizationat scale,” in
International Conference on Machine Learning (ICML) , pp. 1436–1445, 2018.[20] Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, and N. de Freitas,“Bayesian optimization in AlphaGo,” arXiv:1812.06855 , 2018.[21] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter, “NAS-Bench-101:Towards reproducible neural architecture search,” in
International Conference on MachineLearning (ICML) , pp. 7105–7114, 2019. 922] X. Dong and Y. Yang, “Nas-bench-201: Extending the scope of reproducible neural architecturesearch,” arXiv preprint arXiv:2001.00326 , 2020.[23] H. Jin, Q. Song, and X. Hu, “Auto-keras: An efficient neural architecture search system,” in
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &Data Mining , pp. 1946–1956, 2019.[24] C. White, W. Neiswanger, and Y. Savani, “Bananas: Bayesian optimization with neural architec-tures for neural architecture search,” arXiv preprint arXiv:1910.11858 , 2019.[25] S. Xie, A. Kirillov, R. Girshick, and K. He, “Exploring randomly wired neural networks forimage recognition,” in
Proceedings of the IEEE International Conference on Computer Vision ,pp. 1284–1293, 2019.[26] L. Ma, J. Cui, and B. Yang, “Deep neural architecture search with deep graph Bayesianoptimization,” in
Web Intelligence (WI) , pp. 500–507, IEEE/WIC/ACM, 2019.[27] C. Zhang, M. Ren, and R. Urtasun, “Graph hypernetworks for neural architecture search,” in
International Conference on Learning Representations , 2019.[28] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter, “Bayesian optimization with robustBayesian neural networks,” in
Advances in Neural Information Processing Systems (NIPS) ,pp. 4134–4142, 2016.[29] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalableimage recognition,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , pp. 8697–8710, 2018.[30] E. Brochu, V. M. Cora, and N. De Freitas, “A tutorial on bayesian optimization of expensivecost functions, with application to active user modeling and hierarchical reinforcement learning,” arXiv preprint arXiv:1012.2599 , 2010.[31] C. K. Williams and C. E. Rasmussen,
Gaussian processes for machine learning , vol. 2. MITpress Cambridge, MA, 2006.[32] J. Mockus, V. Tiesis, and A. Zilinskas, “The application of bayesian methods for seeking theextremum,”
Towards global optimization , vol. 2, no. 117-129, p. 2, 1978.[33] N. M. Kriege, F. D. Johansson, and C. Morris, “A survey on graph kernels,”
Applied NetworkScience , vol. 5, no. 1, pp. 1–42, 2020.[34] G. Nikolentzos, G. Siglidis, and M. Vazirgiannis, “Graph kernels: A survey,” arXiv preprintarXiv:1904.12218 , 2019.[35] D. Ginsbourger, R. Le Riche, and L. Carraro, “Kriging is well-suited to parallelize optimization,”in
Computational intelligence in expensive optimization problems , pp. 131–162, Springer, 2010.[36] J. González, Z. Dai, P. Hennig, and N. Lawrence, “Batch bayesian optimization via localpenalization,” in
Artificial Intelligence and Statistics , pp. 648–657, 2016.[37] J. M. Hernández-Lobato, J. Requeima, E. O. Pyzer-Knapp, and A. Aspuru-Guzik, “Parallel anddistributed thompson sampling for large-scale accelerated exploration of chemical space,” arXivpreprint arXiv:1706.01825 , 2017.[38] K. Kandasamy, A. Krishnamurthy, J. Schneider, and B. Póczos, “Parallelised bayesian opti-misation via thompson sampling,” in
International Conference on Artificial Intelligence andStatistics , pp. 133–142, 2018.[39] A. Alvi, B. Ru, J.-P. Calliess, S. Roberts, and M. A. Osborne, “Asynchronous batch bayesian op-timisation with improved local penalisation,” in
International Conference on Machine Learning ,pp. 253–262, 2019.[40] M. Emmerich and J.-w. Klinkenberg, “The computation of the expected improvement indominated hypervolume of pareto front approximations,”
Rapport technique, Leiden University ,vol. 34, pp. 7–3, 2008.[41] W. Lyu, F. Yang, C. Yan, D. Zhou, and X. Zeng, “Multi-objective bayesian optimization foranalog/rf circuit synthesis,” in
Proceedings of the 55th Annual Design Automation Conference ,pp. 1–6, 2018.[42] D. Hernández-Lobato, J. Hernandez-Lobato, A. Shah, and R. Adams, “Predictive entropy searchfor multi-objective bayesian optimization,” in
International Conference on Machine Learning ,pp. 1492–1501, 2016. 1043] B. Paria, K. Kandasamy, and B. Póczos, “A flexible framework for multi-objective bayesianoptimization using random scalarizations,” arXiv preprint arXiv:1805.12168 , 2018.[44] M. Wistuba, N. Schilling, and L. Schmidt-Thieme, “Two-stage transfer surrogate model forautomatic hyperparameter optimization,” in
Joint European conference on machine learningand knowledge discovery in databases , pp. 199–214, Springer, 2016.[45] M. Poloczek, J. Wang, and P. I. Frazier, “Warm starting bayesian optimization,” in , pp. 770–781, IEEE, 2016.[46] Z. Wang, B. Kim, and L. P. Kaelbling, “Regret bounds for meta bayesian optimization withan unknown gaussian process prior,” in
Advances in Neural Information Processing Systems ,pp. 10477–10488, 2018.[47] M. Wistuba, N. Schilling, and L. Schmidt-Thieme, “Scalable gaussian process-based transfersurrogates for hyperparameter optimization,”
Machine Learning , vol. 107, no. 1, pp. 43–78,2018.[48] M. Feurer, B. Letham, and E. Bakshy, “Scalable meta-learning for bayesian optimization usingranking-weighted gaussian process ensembles,” in
AutoML Workshop at ICML , 2018.[49] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt,“Weisfeiler-lehman graph kernels,”
Journal of Machine Learning Research , vol. 12, no. 77,pp. 2539–2561, 2011.[50] F. Höppner and M. Jahnke, “Enriched weisfeiler-lehman kernel for improved graph clustering ofsource code,” in
International Symposium on Intelligent Data Analysis , pp. 248–260, Springer,2020.[51] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphletkernels for large graph comparison,” in
Artificial Intelligence and Statistics , pp. 488–495, 2009.[52] R. Kondor and H. Pan, “The multiscale laplacian graph kernel,” in
Advances in Neural Informa-tion Processing Systems , pp. 2990–2998, 2016.[53] N. de Lara and E. Pineau, “A simple baseline algorithm for graph classification,” arXiv preprintarXiv:1810.09155 , 2018.[54] H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized kernels between labeled graphs,” in
Proceedings of the 20th international conference on machine learning (ICML-03) , pp. 321–328,2003.[55] B. Weisfeiler and A. A. Lehman, “A reduction of a graph to a canonical form and an algebraarising during this reduction,”
Nauchno-Technicheskaya Informatsia , vol. 2, no. 9, pp. 12–16,1968.[56] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe,“Weisfeiler and leman go neural: Higher-order graph neural networks,” in
Proceedings of theAAAI Conference on Artificial Intelligence , vol. 33, pp. 4602–4609, 2019.[57] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?,” arXivpreprint arXiv:1810.00826 , 2018.[58] T. Gärtner, P. Flach, and S. Wrobel, “On graph kernels: Hardness results and efficient alterna-tives,” in
Learning theory and kernel machines , pp. 129–143, Springer, 2003.[59] K. M. Borgwardt and H.-P. Kriegel, “Shortest-path kernels on graphs,” in
Fifth IEEE Interna-tional Conference on Data Mining (ICDM’05) , pp. 8–pp, IEEE, 2005.[60] Z. Zeng, A. K. Tung, J. Wang, J. Feng, and L. Zhou, “Comparing stars: On approximatinggraph edit distance,”
Proceedings of the VLDB Endowment , vol. 2, no. 1, pp. 25–36, 2009.[61] C.-L. Lin, “Hardness of approximating graph transformation problem,” in
International Sympo-sium on Algorithms and Computation , pp. 74–82, Springer, 1994.[62] A. P. Engelbrecht, I. Cloete, and J. M. Zurada, “Determining the significance of input parametersusing sensitivity analysis,” in
International Workshop on Artificial Neural Networks , pp. 382–388, Springer, 1995.[63] P. W. Koh and P. Liang, “Understanding black-box predictions via influence functions,” in
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pp. 1885–1894, JMLR. org, 2017. 1164] M. T. Ribeiro, S. Singh, and C. Guestrin, “" why should i trust you?" explaining the predictionsof any classifier,” in
Proceedings of the 22nd ACM SIGKDD international conference onknowledge discovery and data mining , pp. 1135–1144, 2016.[65] A. Yang, P. M. Esperança, and F. M. Carlucci, “NAS evaluation is frustratingly hard,” in
International Conference on Learning Representations (ICLR) , 2020.[66] D. Ramachandram, M. Lisicki, T. J. Shields, M. R. Amer, and G. W. Taylor, “Bayesian opti-mization on graph-structured search spaces: Optimizing deep multimodal fusion architectures,”
Neurocomputing , vol. 298, pp. 80–89, 2018.[67] C. Oh, J. Tomczak, E. Gavves, and M. Welling, “Combinatorial bayesian optimization using thegraph cartesian product,” in
Advances in Neural Information Processing Systems , pp. 2910–2920,2019.[68] J. Cui and B. Yang, “Graph bayesian optimization: Algorithms, evaluations and applications,” arXiv preprint arXiv:1805.01157 , 2018.[69] T. Shiraishi, T. Le, H. Kashima, and M. Yamada, “Topological bayesian optimization withpersistence diagrams,” arXiv preprint arXiv:1902.09722 , 2019.[70] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number ofclasses,” in ,pp. 722–729, IEEE, 2008.[71] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprintarXiv:1611.01578 , 2016.[72] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifierarchitecture search,” in
Proceedings of the aaai conference on artificial intelligence , vol. 33,pp. 4780–4789, 2019.[73] C. E. Rasmussen, “Gaussian processes in machine learning,” in
Summer School on MachineLearning , pp. 63–71, Springer, 2003.[74] M. Gönen and E. Alpaydin, “Multiple kernel learning algorithms,”
Journal of machine learningresearch , vol. 12, no. 64, pp. 2211–2268, 2011.[75] N. M. Kriege, P.-L. Giscard, and R. Wilson, “On valid optimal assignment kernels and ap-plications to graph classification,” in
Advances in Neural Information Processing Systems ,pp. 1623–1631, 2016.[76] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in
InternationalConference on Learning Representations (ICLR) , 2019.[77] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,”
Journal ofmachine learning research , vol. 13, no. Feb, pp. 281–305, 2012.[78] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in thebandit setting: No regret and experimental design,” arXiv preprint arXiv:0912.3995 , 2009.12
Algorithms
The overall algorithm of our NAS-BOWL method is presented in Algorithm 1. The algorithm forgeneral Weisfeiler-Lehman subtree kernel is presented in Algorithm 2 (modified from [49]). Notethat here we assume the weights associated with each WL iteration h in equation 3.1 to be for all h . Algorithm 1
NAS-BOWL Algorithm Input:
Observation data D , number of BO iterations T , BO batch size B , acquisition function α ( · ) Output:
The best architecture G T for t = 1 , . . . , T do Generate a pool of candidate architectures G t Select { G t,i } Bi =1 = arg max G ∈G t α t ( G |D t − ) Evaluate the validation accuracies { y t,i } Bi =1 of { G t,i } Bi =1 D t ← D t − ∪ ( { G t,i } Bi =1 , { y t,i } Bi =1 ) end forAlgorithm 2 Weisfeiler-Lehman subtree kernel computation between two graphs Input:
Graphs { G , G } , Maximum WL iterations H Output:
The kernel function value between the graphs k Initialise the feature vectors { φ , φ } with the respective counts of original node labels i.e. the h = 0 WL features. (E.g. φ i is the count of i -th node label of graph G ) for h = 1 , . . . , H do Assign a multiset-label M h ( v ) to each node v in G consisting of the multiset { l h − | u ∈ N ( v ) ,where l h − ( v ) is the node label of node v of the h − -th WL iteration; N ( v ) are the neighbournodes of node v Sort each elements in M h ( v ) in ascending order and concatenate them into string s h ( v ) Add l h − ( v ) as a prefix to s h ( v ) . Compress each string s h ( v ) using hash function f so that f ( s h ( v )) = f ( s h ( w )) iff s h ( v ) = s h ( w ) for two nodes { v, w } . Set l h ( v ) := f ( s h ( v )) ∀ v ∈ G . Concatenate the φ , φ with the respective counts of the new labels end for Compute inner product between the feature vectors in RKHS k = (cid:104) φ , φ (cid:105) H B Combining Different Kernels
In general, the sum or product of valid kernels gives another valid kernel, as such, combining differentkernels to yield a better-performing kernel is commonly used in GP and Multiple Kernel Learning(MKL) literature [73, 74]. In this section, we conduct a preliminary discussion on its usefulness toGPWL. As a singular example, we consider the additive kernel that is a linear combination of theWL kernel and the MLP kernel: k add ( G , G ) = αk WL ( G , G ) + βk MLP ( G , G ) s.t. α + β = 1 , α, β ≥ (B.1)where α, β are the kernel weights. We choose WL and MLP because we expect them to extractdiverse information: whereas WL processes the graph node information directly, MLP consider thespectrum of the graph Laplacian matrix, which often reflect the global properties such as the topologyand the graph connectivity. We expect the more diverse features captured by the constituent kernelswill lead to a more effective additive kernel. While it is possible to determine the weights in a moreprincipled way such as jointly optimising them in the GP log-marginal likelihood, in this example wesimply set α = 0 . and β = 0 . . We then perform regression on NAS-Bench-101 and Flower102datasets following the setup as in Section 5.1. We repeat each experiment 20 times and report themean and standard deviation in Table 2, and we show the uncertainty estimate of additive kernel inFig. 5. In both search spaces the additive kernel outperforms the constituent kernels but the gainover the WL kernel is marginal. Interestingly, while MLP performs poorly on its own, it can be seen13hat the complementary spectral information extracted by it can be helpful when used alongside ourWL kernel. Generally, we hypothesise that as the search space increases in complexity (e.g., largergraphs, more edge connections permitted, etc), we expect that the benefits from combining differentkernels to increase and we defer a more comprehensive discussion on this to a future work.Table 2: Regression performance (i.t.o rank correlation) of additive kernelsKernel N101 Flower-102WL + MLP ± ± † ± ± † ± ± † : Taken directly from Table 1. (a) N101 (b) Flower102 Figure 5: Predictive vs ground-truth validation error of GPWL with additive kernel on N101 andFlower-102 in log-log scale. Error bar denotes ± C Different Ways to Generate Candidate Architectures for AcquisitionFunction Optimisation (a) Random Sampling (b) Genetic Mutation (c) Gradient Guided Mutation
Figure 6: Different ways to generate candidate architecture population for acquisition functionoptimisation. Random sampling (a) uses no information from queried data or surrogate model.Conventional genetic mutation (b) uses best architectures queried so far to be the parent architecturesfor generating new architectures. Our proposed gradient guided mutation (c) further exploits thegradient information of posterior model with respect to the interpretable features learnt by WL kernelto guide how to mutate a given parent architecture.A way to generate a population of candidate architectures for acquisition function optimisation ateach BO iteration is necessary for all BO-based NAS strategy. The naive way to do so is to randomlysample architectures from the search space [21, 65](Fig. 6a). This is simple to implement but ignoresany information contained in past query data or the predictive posterior model, thus inefficient inexploring the huge search space. A more popular approach is based on genetic mutation whichgenerates the candidate architectures by mutating a small pool of parent architectures [12, 26, 24, 13](Fig. 6b). The parent architectures are usually chosen among queried architectures which give the topvalidation performance or acquisition function values. Generating candidate architecture pool in thisway enables us to exploit the prior information on the best architectures observed so far to explore14he large search space more efficiently. However, in all prior work [12, 26, 24, 13], no informationcan be gained on how to mutate the parent architecture. As a result, any node or edge in an parentarchitecture gets equal chances of undergoing mutation and is randomly changed to any other possiblestatus except its current one with uniform probability .In contrast, we believe that the aforementioned gradient information of GPWL provides a moreinformed way in guiding the mutation. As discussed in Section 3.2 in the main text, we re-state theanalytic expression of the posterior feature derivative: E p ( y | G t , D t − ) (cid:104) ∂y∂φ jt (cid:105) = ∂µ∂φ jt = ∂ (cid:104) φ t , Φ t − (cid:105) ∂φ jt K − t − y t − (C.1)When a simple dot product is used (i.e. (cid:104) φ , φ (cid:105) = φ · φ ), the derivative is simply give by: E p ( y | G t , D t − ) (cid:104) ∂y∂φ jt (cid:105) = Φ t − K − t − y t − (C.2)When optimal assignment [75] inner product is used, we have (cid:104) φ , φ (cid:105) = (cid:80) i min( φ i , φ i ) . The min( · , · ) operator leads to non-differentiable points. To tackle this we can use a continuous approxi-mation similar to that proposed in [39]: (cid:104) φ , φ (cid:105) = (cid:88) i min( φ i , φ i ) ≈ (cid:88) i (cid:104) ( φ i ) p + ( φ i ) p (cid:105) /p (C.3)where we choose p = − . However, empirically we find the gradients computed via approximationin equation C.3 is mostly consistent with equation C.2 (as we will show, we normalise the gradientsinto pseudo-probabilities and magnitudes of the gradients do not matter as much), but the latter canbe computed much more cheaply.Given this, we then transform the gradients using a sigmoid transformation on the negative of thegradients to obtain positive values and then normalise them to represent pseudo-probability thatencodes the chance of mutation for each node and edge feature in the architecture. The sub-feature(e.g. the node 1 is a conv3 × operations in Fig. 7) whose gradients are very positive will have alower probability of being mutated as we want to keep such good features which positively contributeto validation accuracy. On the other hand, features with large negative gradients will subject to higherprobability of undergoing mutation as we want to change these features that contribute negatively tothe architecture performance.With reference to the illustrative example in Fig. 7, the posterior mean gradients with respect tonode labels (i.e. h = 0 features) can be directly obtained but those with respect to edges are nottrivial; an edge can be present in multiple h = 1 features, each of which can have potentially differentsign or magnitude . To deal with this, we assign the probability of mutation to an edge basedon the mean gradient of all the features in which the edge appears. This smooths out the noisycontribution of a particular edge to the architecture’s validation performance. Upon a node or an edgeis chosen for mutation, we reuse the gradient information on its possible change options to define thecorresponding probability of change. For example. if node 2 with label maxpool3 × in Fig. 7 ischosen to mutate, it’s more likely to change into conv3 × than conv1 × . Insummary, we propose a procedure to learn which architecture component to modify as well as howto modify it to improve performance by taking the advantage of interpretable features extracted inour WL kernel GP and their corresponding derivatives. We briefly compare the different candidateacquisition strategies discussed in this Section in App. E.5.Note that an alternative way is to assign a length-scale to each feature embedding produced in WLkernel and then learn these length-scales with automatic relevance determination (ARD) kernels basedon GP marginal likelihood to assess the responsiveness of the architecture performance with respectto different features. However, the number of possible features in architecture search space can belarge and grows exponentially with the architecture size. Accurate learning of these length-scalescan then be very difficult and easily lead to suboptimal values. Moreover, the length-scales onlyreflect the smoothness/variability objective function values with respect to the features but does not e.g. In NAS-Bench-101, a conv3 × node has a uniform probability of becoming a maxpool3 × node or conv1 × node. For example, the edge from node 0 to 2 in Fig. 7 is present in two h = 1 features: 0, 12 and 0,23 : Input3: Output1: Conv 3x3 -0.50.9 -0.6 + 0.9-0.8-0.5 -0.6+1.1 -2.1 Label Gradient
Observed h=0
Subfeatures
Observed h=1
Subfeatures
Figure 7: An illustration using gradient information on various features to guide the mutation.tell the direction of their effects on the architecture performance. Filtering features based on ARDlength-scales can lead to removal of important architecture features which positively correlates withthe validation performance.
D Predictive Mean ± Standard Deviation of GPWL Surrogate on NASDatasets
In this section, we show the GPWL predictions on the various NAS datasets when trained with 50samples each. It can be shown that not only a satisfactory predictive mean is produced by GPWL interms of the rank correlation and the agreement with the ground truth, there is also sound uncertaintyestimates, as we can see that in most cases the ground truths are within the error bar representing onestandard deviation of the GP predictive distributions. For the training of GPWL, we always transformthe validation errors (the targets of the regression) into log-scale, normalise the data and transform itback at prediction, as empirically we find this leads to better uncertainty estimates. (a) N101 (b) CIFAR10 (c) CIFAR100 (d) ImageNet16 (e) Flower102
Figure 8: Predicted vs ground-truth validation error of GPWL in various NAS-Bench tasks in log-logscale. Error bar denotes ± E Experimental Details
All experiments were conducted on a 36-core 2.3GHz Intel Xeon processor with 512 GB RAM.
E.1 Datasets
We experiment on the following datasets: • NAS-Bench-101 [21]: The search space is an acyclic directed graph with nodes and a maxi-mum of edges. Besides the input node and output node, the remaining operation nodescan choose one of the three possible operations: conv3 × , conv1 × and maxpool3 × . The dataset contains all 423,624 unique neural architecturesin the search space. Each architecture is trained for 108 epochs and evaluated on CIFAR10image data. The evaluation is repeated over 3 random initialisation seeds. We can accessthe final training/validation/test accuracy, the number of parameters as well as training timeof each architecture from the dataset. The dataset and its API can be downloaded from https://github.com/google-research/nasbench/ .16 NAS-Bench-201 [22]: The search space is an acyclic directed graph with nodes and edges.Each edge corresponds to an operation selected from the set of 5 possible options: conv1 × , conv3 × , avgpool3 × , skip-connect and zeroize . This search space is applicable toalmost all up-to-date NAS algorithms. Note although the search space of NAS-Bench-201is more general, it’s smaller than that of NAS-Bench-101. The dataset contains all 15,625unique neural architectures in the search space. Each architecture is trained for 200 epochsand evaluated on 3 image datasets: CIFAR10, CIFAR100, ImageNet16-120. The evaluation isrepeated over 3 random initialisation seeds. We can access the training accuracy/loss, validationaccuracy/loss after every training epoch, the final test accuracy/loss, number of parameters aswell as FLOPs from the dataset. The dataset and its API can be downloaded from https://github.com/D-X-Y/NAS-Bench-201 . • Flowers102 : We generate this dataset based on the random graph generators proposed in [25].The search space is an acyclic directed graph with nodes and a varying number of edges. Allthe nodes can take one of the three possible options: input , output , relu-conv3 × . Thus,the graph can have multiple inputs and outputs. This search space is very different from those ofNAS-Bench-101 and NAS-Bench-201 and is used to test the scalability of our surrogate model fora large-scale search space (i.t.o number of numbers in the graph). The edges/wiring/connection inthe graph is created by one of the three classic random graph models: Erdos-Renyi(ER), Barabasi-Albert(BA) and Watt-Strogatz(WS). Different random graph models result in graphs of differenttopological structures and connectivity patterns and are defined by one or two hyperparameters.We investigate a total of 69 different sets of hyperparameters: 8 values for the hyperparameter ofER model, 6 values for the hyperparameter of BA model and 55 different value combinationsfor the two hyperparameters of WS model. For each hyperparameter set, we generate 8 differentarchitectures using the random graph model and train each architecture for 250 epochs beforeevaluating on Flowers102 dataset. The training set-ups follow [76]. This results in our dataset of552 randomly wired neural architectures. E.2 Experimental SetupNAS-BOWL
We use a batch size B = 5 (i.e., at each BO iteration, architectures yielding top5 acquisition function values are selected to be evaluated in parallel). When mutation algorithmdescribed in Section 3.2 is used, we use a pool size of P = 200 , and half of which is generated frommutating the top-10 best performing architectures already queried and the other half is generatedfrom random sampling to encourage more explorations in NAS-Bench-101. In NAS-Bench-201,accounting for the much smaller search space and consequently the lesser need to exploration, wesimply generate all architectures from mutation. For experiments with random acquisition, we alsouse P = 200 throughout, and we also study the effect of varying P later in this section. We use WLwith optimal assignment (OA) for all datasets apart from NAS-Bench-201. Denoting the featurevectors of two graphs G and G as φ and φ respectively, the OA inner product in the WL case isgiven by the histogram intersection (cid:104) φ , φ (cid:105) = (cid:80) i min( φ i , φ i ) , where φ i is the i -th element of thevector. On NAS-Bench-201 which features a much smaller search space which we find a simple dotproduct of the feature vectors φ T1 φ to perform empirically better. We always use 10 random samplesto initialise BOWL.On NAS-Bench-101 dataset, we always apply pruning (which is available in the NAS-Bench-101API) to remove the invalid nodes and edges from the graphs. On NAS-Bench-201 dataset, sincethe architectures are defined over a DARTS-like, edge-labelled search space, we first convert theedge-labelled graphs to node-labelled graphs as a pre-processing step. It is worth noting that itis possible to use WL kernel defined over edge-labelled graphs directly (e.g the WL-edge kernelproposed by [49]), although in this paper we find the WL kernels over node-labelled graphs toperform empirically better. BANANAS
We use the code made public by the authors [24] ( https://github.com/naszilla/bananas ), and use the default settings contained in the code with the exception of the number ofarchitectures queried at each BO iteration (i.e. BO batch size): the default is , but to conform to ourtest settings we use instead. While we do not change the default pool size of P = 200 at each BOiteration, instead of filling the pool entirely from mutation of the best architectures, we only mutate architectures from top-10 best architectures and generate the other randomly to enable a faircomparison with our method. It is worth noting that neither changes led to a significant deterioration17n the performance of BANANAS: under the deterministic validation error setup, the results we reportare largely consistent with the results reported in [24]; under the stochastic validation error setup, ourBANANAS results actually slightly outperform results in the original paper. It is finally worth notingthat the public implementation of BANANAS on NAS-Bench-201 was not released by the authors. GCNBO for NAS
We implemented the GNN surrogate in Section 5.1 by ourselves followingthe description in the most recent work [13], which uses a graph convolution neural network incombination with a Bayesian linear regression layer to predict architecture performance in its BO-based NAS . To ensure fair comparison with our NAS-BOWL, we then define a normal ExpectedImprovement (EI) acquisition function based on the predictive distribution by the GNN surrogate toobtain another BO-based NAS baseline in Section 5.2, GCNBO. Similar to all the other baselinesincluding our NASBOWLr and
BANANASr , we use random sampling to generate candidate architecturesfor acquisition function optimisation. However, different from NAS-BOWL and BANANAS, GCNBOuses a batch size B = 1 , i.e. at each BO iteration, NAS-BOWL and BANANAS select 5 newarchitectures to evaluate next but GCNBO select 1 new architecture to evaluate next. This setupshould favour GCNBO if we measure the optimisation performance against the number of architectureevaluations which is the metric used in Figs. 4 and 9 because at each BO iteration, GCNBO selectsthe next architecture G t based on the most up-to-date information α t ( G |D t − ) whereas NAS-BOWL and BANANAS only select one architecture G t, in such fully informed way but selectthe other four architectures { G t,i } i =2 with outdated information. Specifically, in the sequentialcase ( B = 1 ), G t, is selected only after we have evaluated G t, , G t, is selected by maximising α t ( G |{D t − , ( G t, , y t, ) } ) ; the same procedure applies for G t, , G t, and G t, . However, in thebatch case ( B = 5 ) where G t,i for ≤ i ≤ need to be selected before G t,i − is evaluated, { G t,i } i =2 are all decided based on α t ( G |D t − ) like G t, . For a more detailed discussion on sequential ( B = 1 )and batch ( B > ) BO, the readers are referred to [39]. Other Baselines
For all the other baselines: random search [77], TPE [15], Reinforcement Learning[71], BO with SMAC [17], regularised evolution [72], we follow the implementation available at https://github.com/automl/nas_benchmarks for NAS-Bench-101 [21]. We modify them tobe applicable on NAS-Bench-201 [22]. Note that like GCNBO, all these methods are sequential B = 1 , and thus should enjoy the same advantage mentioned above when measured against thenumber of architectures evaluated. E.3 Additional NAS-Bench ResultsTest Errors Against Number of Evaluations
We show the test errors against number of evalu-ations using both stochastic and deterministic validation errors of NAS-Bench datasets in Figure9. It is worth noting that regardless of whether the validation errors are stochastic or not, the testerrors are always averaged to deterministic values for fair comparison. It is obvious that NAS-BOWLstill outperforms the other methods under this metric in achieving lower test error or enjoying fasterconvergence, or having both under most circumstances. This corresponds well with the results on thevalidation error in Fig. 4 and double-confirms the superior performance of our proposed NAS-BOWLin searching optimal architectures.
Results Against GPU-Hours
In Figs. 10 and 11, we show the validation and test errors against number of GPU-Hours used to train the architectures (instead of number of architectures evaluated in Figs. 4 and 9), respectively, and it can be seen that NAS-BOWL outperforms the baselines interms of GPU-time as well. It is worth noting that the training time is available from both of theNAS-Bench datasets in standardised setting described in [21, 22]; we did not actually train thesemodels. Finally, whereas for the sequential models the GPU-time is equivalent to the wall-clocktime, since our method features batch BO the wall-clock time can be dramatically reduced from theGPU-time described here by taking advantage of any available parallel computing facility (e.g., if B = 5 and we have 5 GPUs available in parallel, the wall clock time is roughly of the GPU-time). [13] did not publicly release their code. a) N101 (b) CIFAR10 (c) CIFAR100 (d) ImageNet16(e) N101 (f) CIFAR10 (g) CIFAR100 (h) ImageNet16 Figure 9: Median test error on NAS-Bench datasets with deterministic (top row) and stochastic (bottom row) validation errors from 20 trials. Shades denote ± standard error. (a) N101 (b) CIFAR10 (c) CIFAR100 (d) ImageNet16(e) N101 (f) CIFAR10 (g) CIFAR100 (h) ImageNet16 Figure 10: Median validation error on NAS-Bench datasets with deterministic (top row) and stochastic (bottom row) validation errors from 20 trials against the number of GPU-hours . Shades denote ± standard error. E.4 Effect of Varying Pool Size
As discussed in the main text, NAS-BOWL introduces no inherent hyperparameters that requiremanual tuning, but as discussed in App. C, the choice on how to generate the candidate architecturesrequires us to specify a number of parameters such as the pool size ( P , the number of candidatearchitectures to generate at each BO iteration) and batch size B . In our main experiments, we haveset P = 200 and B = 5 throughout; in this section, we consider the effect of varying P to investigatewhether the performance of NAS-BOWL is sensitive to this parameter.We keep B = 5 but adjust P ∈ { , , , } , and keep all other settings to be consistentwith the other experiments using the deterministic validation errors on NAS-Bench-101 (N101) (i.e.averaging the validation error seeds to remove stochasticity), and we report our results in Fig. 12where the median result is computed from 20 experiment repeats. It can be shown that while the19 a) N101 (b) CIFAR10 (c) CIFAR100 (d) ImageNet16(e) N101 (f) CIFAR10 (g) CIFAR100 (h) ImageNet16 Figure 11: Median test error on NAS-Bench datasets with deterministic (top row) and stochastic (bottom row) validation errors from 20 trials against the number of GPU-hours . Shades denote ± standard error.convergence speed varies slightly between the different P choices, for all choices of P apart from 50which performs slightly worse, NAS-BOWL converges to similar validation and test errors at the endof 150 architecture evaluations – this suggests that the performance of NAS-BOWL is rather robustto the value of P and that our recommendation of P = 200 does perform well both in terms of boththe final solution returned and the convergence speed. (a) Validation Error (b) Test Error Figure 12: Effect of varying P on NAS-BOWL in N101. E.5 Ablation Studies
In this section we perform ablation studies on the NAS-BOWL performance on both N101 and N201(with deterministic validation errors). We repeat each experiment 20 times, and we present the medianand standard error in terms of both validation and test performances in Fig. 13 (N101 in (a)(b) andN201 in (c)(d)). We now explain each legend as follow:1. gradmutate : Full NAS-BOWL using the gradient-guided architecture mutation described inSection 3.2 (identical to
NASBOWLm in Figs. 4 and 9);2. mutate : NAS-BOWL with the standard mutation algorithm. Specifically, we use identical setupto the gradient-guided mutation scheme, with the only exception that the probabilities of mutationof all nodes and edges are uniform;3. WL : NAS-BOWL with random candidate generation. This is identical to NASBOWLr in Figs. 4 and9;4.
UCB : NAS-BOWL with random candidate generation, but with the acquisition function changedfrom Expected Improvement (EI) to Upper Confidence Bound (UCB) [78] α acq = µ + β n σ ,20here µ, σ are the predictive mean and standard deviation of the GPWL surrogate, respectivelyand β n is a coefficient that changes as a function of n , the number of BO iterations. We select β at initialisation ( β ) to , but decay it according to β n = β (cid:113) log(2( n + 1)) as suggested by[78], where n is the number of BO iterations.5. VH : NAS-BOWL with random candidate generation, but instead of leaving the value of h (numberof WL iterations) to be automatically determined by the optimisation of the GP log marginallikelihood, we set h = 0 , i.e. no WL iteration takes place and the only features we use arethe counts of each type of original node operation features (e.g. conv3 × ). Thisessentially reduces the WL kernel to a Vertex Histogram (VH) kernel. (a) Val. Error on N101 (b) Test Error on N101 (c) Val. Error on N201 (d) Test Error on N201 Figure 13: Ablation studies of NAS-BOWLWe find that using an appropriate h is crucial: in both N101 and N201, VH significantly underperformsthe other variants, although the extent of underperformance is smaller in N201 likely due to its smallersearch space. This suggests that how the nodes are connected, which are extracted as higher-orderWL features, are very important, and the multi-scale feature extraction in the WL kernel is crucialto the success of NAS-BOWL. On the other hand, the choice of the acquisition function seems notto matter as much, as there is little difference between UCB and WL runs in both N101 and N201.Finally, using either mutation algorithm leads to a significant improvement in the performance ofNAS-BOWL; between gradmutate and mutate , while there is little difference in terms of finalperformance, in both cases gradmutate does converge faster at initial phase. We note that bothdatasets still feature rather small search spaces where the gain in guided search can be limited, asit is possible that random mutation might already be sufficient to find high-performing regions inthe search space. We expect the potential gain in performance of gradmutategradmutate