[PDF] Optimal Transport Kernels for Sequential and Parallel Neural Architecture Search

Abstract

Neural architecture search (NAS) automates the design of deep neural networks. One of the main challenges in searching complex and non-continuous architectures is to compare the similarity of networks that the conventional Euclidean metric may fail to capture. Optimal transport (OT) is resilient to such complex structure by considering the minimal cost for transporting a network into another. However, the OT is generally not negative definite which may limit its ability to build the positive-definite kernels required in many kernel-dependent frameworks. Building upon tree-Wasserstein (TW), which is a negative definite variant of OT, we develop a novel discrepancy for neural architectures, and demonstrate it within a Gaussian process surrogate model for the sequential NAS settings. Furthermore, we derive a novel parallel NAS, using quality k-determinantal point process on the GP posterior, to select diverse and high-performing architectures from a discrete set of candidates. Empirically, we demonstrate that our TW-based approaches outperform other baselines in both sequential and parallel NAS.

Full PDF

OOptimal Transport Kernels for Sequential andParallel Neural Architecture Search

Vu Nguyen ∗ University of Oxford [email protected]

Tam Le ∗ RIKEN AIP [email protected]

Makoto Yamada

Kyoto University & RIKEN AIP [email protected]

Michael A. Osborne

University of Oxford [email protected]

Abstract

Neural architecture search (NAS) automates the design of deep neural networks.One of the main challenges in searching complex and non-continuous architecturesis to compare the similarity of networks that the conventional Euclidean metricmay fail to capture. Optimal transport (OT) is resilient to such complex structureby considering the minimal cost for transporting a network into another. However,the OT is generally not negative deﬁnite which may limit its ability to build thepositive-deﬁnite kernels required in many kernel-dependent frameworks. Buildingupon tree-Wasserstein (TW), which is a negative deﬁnite variant of OT, we developa novel discrepancy for neural architectures, and demonstrate it within a Gaussianprocess surrogate model for the sequential NAS settings. Furthermore, we derivea novel parallel NAS, using quality k-determinantal point process on the GPposterior, to select diverse and high-performing architectures from a discrete set ofcandidates. Empirically, we demonstrate that our TW-based approaches outperformother baselines in both sequential and parallel NAS.

Neural Architecture Search (NAS) is the process of automating architecture engineering to ﬁnd thebest design of our neural network model. This output architecture will perform well for a particulardataset provided. With the increasing interest in deep learning in recent years, NAS has attractedsigniﬁcant research attention [9, 11, 31, 32, 33, 43, 44, 48, 55, 61, 63]. We refer the interested readersto the survey [12] for a detailed review of NAS.Bayesian optimization (BO) utilizes a probabilistic model, particularly Gaussian process (GP) [42],for determining future evaluations and its evaluation efﬁciency makes it well suited for the expensiveevaluations of NAS. However, the conventional BO approaches [49, 51] are not suitable to capturethe complex and non-continuous designs of neural architectures. Recent work [24] has consideredoptimal transport (OT) for measuring neural architectures. This views two networks as logistical suppliers and receivers , then optimizes to ﬁnd the minimal transportation cost as the distance, i.e.,similar architectures will need less cost for transporting and vice versa. However, the existing OTdistance for architectures, such as OTMANN [24], do not easily lend themselves to the creation ofthe positive semi-deﬁnite (p.s.d.) kernel (covariance function) due to the non-negative indeﬁniteproperty of OT [38] (§8.3). It is critical as the GP is not a valid random process when the covariancefunction (kernel) is not p.s.d. (see Lem. 2.1). In addition, there is still an open research directionfor parallel NAS where the goal is to select multiple high-performing and diverse candidates from ∗ These authors contributed equally.Preprint. Under review. a r X i v : . [ c s . L G ] J un discrete set of candidates for parallel evaluations. This discrete property makes the parallel NASinteresting and different from the existing batch BO approaches [7, 20], which are typically designedto handle continuous observations.We propose a p.s.d. tree-Wasserstein distance for neural network architectures. We design a new wayto capture both global and local information via n -gram and indegree/outdegree representations fornetworks. In addition, we propose the k-determinantal point process (k-DPP) quality for selectingdiverse and high-performing architectures from a discrete set. This discrete property of NAS makesk-DPP ideal in sampling the optimal choices that overcomes the greedy selection used in the existingbatch Bayesian optimization [7, 20, 59]. We summarize our contributions as follows: • A tree-Wasserstein distance with a novel design for capturing local and global informationfrom architectures which results in a p.s.d. kernel while the existing OT distance does not. • A demonstration of tree-Wasserstein as the novel GP covariance function for sequentialNAS. • A parallel NAS approach using k-DPP for selecting diverse and high-quality architecturesfrom a discrete set.

We ﬁrst argue that the covariance matrices associated with a kernel function of Gaussian process (GP)and k-DPP need to be positive semi-deﬁnite (p.s.d.) for a valid random process in Lemma 2.1. Wethen develop tree-Wasserstein (TW) [8, 13, 29], the negative deﬁnite variant of OT, for measuring thesimilarity of architectures. Consequently, we can build a p.s.d. kernel upon optimal transport (OT)geometry for modelling with GPs and k-determinantal point processes (k-DPPs).

Lemma 2.1.

If a covariance function k of a Gaussian process is not positive semi-deﬁnite, theresulting GP is not a valid random process. Proof of Lemma 2.1 is placed in the Appendix §D.1.

We give a brief review about OT, tree metric, tree-Wasserstein (TW) which are the main componentsfor our NAS framework. We denote [ n ] = { , , . . . , n } , ∀ n ∈ N + . Let (Ω , d ) be a measurablemetric space. For any x ∈ Ω , we use δ x for the Dirac unit mass on x . Optimal transport.

OT, a.k.a. Wasserstein, Monge-Kantorovich, or Earth’s Mover distance, is apowerful tool to compare probability measures [38, 56]. Let ω , ν be Borel probability distributionson Ω and R ( ω, ν ) be the set of probability distributions π on Ω × Ω such that π ( B × Ω) = ω ( B ) and π (Ω × B (cid:48) ) = ν ( B (cid:48) ) for all Borel sets B , B (cid:48) . The 1-Wasserstein distance W d [56] (p.2) between ω , ν is deﬁned as: W d ( ω, ν ) = inf π ∈R ( ω,ν ) (cid:90) Ω × Ω d ( x, z ) π ( d x, d z ) (1)where d is a ground metric (i.e., cost metric) of OT. Tree metrics and Tree-Wasserstein.

A metric d : Ω × Ω → R + is a tree metric if there exists atree T with positive edge lengths such that ∀ x ∈ Ω , then x is a node of T ; and ∀ x, z ∈ Ω , d ( x, z ) isequal to the length of the (unique) path between x and z [47] (§7, p.145–182).Let d T be the tree metric on tree T rooted at r . For x, z ∈ T , we denote P ( x, z ) as the (unique)path between x and z . We write Γ( x ) for a set of nodes in the subtree of T rooted at x , deﬁned as Γ( x ) = { z ∈ T | x ∈ P ( r, z ) } . For edge e in T , let v e be the deeper level node of edge e (the farthernode to root r ), and w e be the positive length of that edge.Tree-Wasserstein (TW) is a special case of OT whose ground metric is a tree metric [8, 13, 29]. Giventwo measures ω , ν supported on tree T , and setting the tree metric d T as the ground metric, then theTW distance W d T between ω, ν has a closed-form solution as [29]: W d T ( ω, ν ) = (cid:88) e ∈T w e (cid:12)(cid:12) ω (cid:0) Γ( v e ) (cid:1) − ν (cid:0) Γ( v e ) (cid:1)(cid:12)(cid:12) . (2)2e note that we can derive p.s.d. kernels on tree-Wasserstein distance W d T [29], as opposed to thestandard OT W d for general d [38]. We present a new approach leveraging the tree-Wasserstein for measuring the similarity of neuralnetwork architectures. We consider a neural network architecture x by ( S o , A ) where S o is a multi-setof operations in each layer of x , and A is an adjacency matrix, representing the connection amongthose layers in x . We can also view a neural network as a directed labeled graph where each layeris a node in a graph, and an operation in each layer is a node label (i.e., A represents the graphstructure, and S o contains node labels). We then propose to extract information from neural networkarchitectures by distilling them into three separate quantities as follows: n -gram representation for layer operations. Each neural network consists of several operationsfrom input layer to output layer. Inspired by the n -gram representation for a document in naturallanguage processing, we view a neural network as a document and its operations as words. Therefore,we can use n -grams (i.e., n -length paths) to represent operations considered in the neural network.We then normalize the n -gram, and denote it as x o for a neural network x .Particularly, for n = 1 , the n -gram representation is a frequency vector of operations, used in Nasbot[24]. When we use all n ≤ (cid:96) where (cid:96) is the number of network layers, the n -gram representationshares the same spirit as the path encoding, used in Bananas [60].Let S be the set of operations, and S n = S × S × · · · × S ( n times of S ), the n -gram can be representedas empirical measures in the followings ω o x = (cid:88) s ∈ S n x os δ s ω o z = (cid:88) s ∈ S n z os δ s (3)where x os and z os are the frequency of n -gram operation s ∈ S n in architecture x and z , respectively.We can leverage the TW distance to compare the n -gram representations ω o x and ω o z using Eq. (2),denoted as W d T o ( ω o x , ω o z ) . To compute this distance, we utilize a predeﬁned tree structure for networkoperations by hierarchically grouping similar network operations into a tree as illustrated in Fig. 1.We can utilize the domain knowledge to deﬁne the grouping and the edge weights, such as we canhave conv1 and conv3 in the same group and maxpool is from another group. Inspired by thepartition-based tree metric sampling [29], we deﬁne the edge weights decreasing when the edge is farfrom the root. Although such design can be subjective, the ﬁnal distance (deﬁned later in Eq. (5))will be calibrated and normalized properly when modeling with a GP in §3. We refer to Fig. 6 andAppendix §E for the example of TW computation for neural network architectures. Indegree and outdegree representations for network structure.

We propose to leverage the indegree and outdegree of each layer which are the number of ingoing and outgoing layers respectively,as an alternative way to represent a network structure. We denote L x as the set of all layers which onecan reach from the input layer for neural network x . Let η x,(cid:96) and M x be lengths of the shortest pathsfrom the input layer to the layer (cid:96) and to the output layer respectively. By observing the commonrepresentation in neural network layers that they start with an input layer, connect with some middlelayers, and end with an output layer, we represent the indegree and outdegree of network layers in x as empirical measures ω d − x and ω d + x , deﬁned as ω d − x = (cid:88) (cid:96) ∈ L x x d − (cid:96) δ ηx,(cid:96) +1 Mx +1 , ω d + x = (cid:88) (cid:96) ∈ L x x d + (cid:96) δ ηx,(cid:96) +1 Mx +1 , (4)where x d − (cid:96) and x d + (cid:96) are the normalized indegree and outdegree of the layer (cid:96) of x respectively.For indegree and outdegree information, the supports of empirical measures ω d − x , and ω d − z arein one-dimensional space that a tree structure reduces to a chain of supportsf. Thus, we can use W d T− (cid:16) ω d − x , ω d − z (cid:17) to compare those empirical measures. Similarly, we have W d T + (cid:16) ω d + x , ω d + z (cid:17) forempirical measures ω d + x and ω d + z built from outdegree information. Since the tree is a chain, the TW is equivalent to the univariate OT. ndegree Operation cv1 cv3 cv1 cv3 mp3 ( , ) (x , z ) (x , z ) x z Outdegree x z transport to L0L1L2L0L1L2L3 L1L2L3L1L2L3L4 incv1 cv3 cv3cv3 outout incv3 cv1mp3mp3 frequency cv1 operation L1 bin over layer 1convolution1x1/3x3 A tree for computing maxpooling3x3cv1/3mp3 (, ) , = ( , ) + (x , z ) + (1 − − ) _ (x , z ) Figure 1: We represent two architectures x and z by network structure (via outdegree and indegree)and network operation (using -gram in this example). The similarity between each respectiverepresentation is estimated by tree-Wasserstein to compute the minimal cost of transporting oneobject to another. As a nice property of optimal transport, our tree-Wasserstein can handle different layer sizes and different operation types. The weights in each histogram are calculated from thearchitectures. The histogram bins in outdegree and indegree are aligned with the network structure inthe left. See the Appendix §E for detailed calculations. Tree-Wasserstein distance for neural network.

Given neural networks x and z , we considerthree separate TWs for the n -gram, indegree and outdegree representations of the networks. Then,we deﬁne d NN as a convex combination with positive weights { α , α , α | (cid:80) i α i = 1 , α i ≥ } for W d T o , W d T− , and W d T + respectively, to compare neural networks x and z as: d NN ( x , z ) = α W d T o ( x o , z o ) + α W d T− (cid:16) ω d − x , ω d − z (cid:17) + (1 − α − α ) W d T + (cid:16) ω d + x , ω d + z (cid:17) . (5)The proposed discrepancy d NN can capture not only frequency of layer operations, but also networkstructures, e.g., indegree and outdegree of network layers.We illustrate our proposed TW for neural network in Fig. 8 describing each component in Eq. (5).We also describe the calculation in details in the Appendix §E. We highlight a useful property of ourproposed d NN : it can compare two architectures with different layer sizes and/or operations sizes. Proposition 1.

The d NN for neural networks is a pseudo-metric and negative deﬁnite. Proof of Proposition 1 is placed in the Appendix §D.2.Our discrepancy d NN is negative deﬁnite as opposed to the OT for neural networks considered in [24]which is indeﬁnite. Therefore, from Proposition 1 and following Theorem 3.2.2 in [2], we can derivea positive deﬁnite TW kernel upon d NN for neural networks x , z as k ( x , z ) = exp (cid:0) − d NN ( x , z ) /σ l (cid:1) , (6)where the scalar σ l is the length-scale parameter. Our kernel has three hyperparameters including alength-scale σ l in Eq. (6); α and α in Eq. (5). We refer to the Appendix §G for further discussionabout the properties of the pseudo-distance d NN . Problem setting.

We consider a noisy black-box function f : R d → R over some domain X containing neural network architectures. As a black-box function, we do not have a closed-form for f and it is expensive to evaluate. Our goal is to ﬁnd the best architecture x ∗ ∈ X such that x ∗ = argmax x ∈X f ( x ) . (7)4e view the black-box function above as a machine learning experiment which takes an input as aneural network architecture x and produces an accuracy y . We can write y = f ( x ) + (cid:15) where wehave considered Gaussian noise (cid:15) ∼ N (0 , σ f ) given the noise variance σ f estimated from the data.Bayesian optimization (BO) optimizes the black-box function by sequentially evaluating the black-boxfunction [18, 49, 36]. Particularly, BO can speed up the optimization process by using a probabilisticmodel to guide the search [51]. BO has demonstrated impressive success for optimising the expensiveblack-box functions across domains. Surrogate models.

Bayesian optimization reasons about f by building a surrogate model, such asa Gaussian process (GP) [42], Bayesian deep learning [53], deep neural network [52, 60] or randomforest [4]. Among these choices, GP is the most popular model, offering three key beneﬁts: (i)closed-form uncertainty estimation, (ii) evaluation efﬁciency, and (iii) learning hyperparameters. AGP outputs a normally distributed random variable at every point in the input space. The predictivedistribution for a new observation also follows a Gaussian distribution [42] where we can estimatethe expected function value µ ( x ) and the predictive uncertainty σ ( x ) as µ ( x (cid:48) ) = k ( x (cid:48) , X ) (cid:2) K ( X , X ) + σ n I (cid:3) − y (8) σ ( x (cid:48) ) = k ( x (cid:48) , x (cid:48) ) − k ( x (cid:48) , X ) (cid:2) K ( X , X ) + σ n I (cid:3) − k T ( x (cid:48) , X ) (9)where X = [ x , ... x N ] and y = [ y , ..y N ] are the collected architectures and performances; K ( U, V ) is a covariance matrix whose element ( i, j ) is calculated as k ( x i , x j ) with x i ∈ U and x j ∈ V ; σ n isthe measurement noise variance and I is the identity matrix. Generating a pool of candidates P t . We follow [24, 60] to generate a list of candidate networksusing an evolutionary algorithm [1]. First, we stochastically select top-performing candidates withhigher acquisition function values. Then, we apply a mutation operator to each candidate to producemodiﬁed architectures. Finally, we evaluate the acquisition on this mutations, add it to the initial pool,and repeat for several steps to get a pool of candidates P t . Optimizing hyperparameters.

We optimize the model hyperparameters by maximizing the logmarginal likelihood. We present the derivatives for estimating the hyperparameters α and α of thetree-Wasserstein in the appendix. We shall optimize these variables via multi-started gradient descent. We sequentially suggest a single architecture for evaluation using a decision function α ( x ) (a.k.a.acquisition function) from the surrogate model. This acquisition function is carefully designed totrade off between exploration of the search space and exploitation of current promising regions.We utilize the GP-UCB [54] as the main decision function α ( x ) = µ ( x ) + κσ ( x ) where κ is theparameter controlling the exploration, µ and σ are the GP predictive mean and variance in Eqs. (8,9).Empirically, we ﬁnd that this GP-UCB generally performs better than expected improvement (EI)(see the Appendix §H.1) and other acquisition functions (see [60]). We note that the GP-UCB alsocomes with a theoretical guarantee for convergence [54].We maximize the acquisition function to select the next architecture x t +1 = arg max x ∈P t α t ( x ) .This maximization is done on the discrete set of candidate P t obtained previously. The selectedcandidate is the one we expect to be the best if we are optimistic in the presence of uncertainty. The parallel setting speeds up the optimization process by selecting a batch of architectures forparallel evaluations. We present the k-determinantal point process (k-DPP) with quality to selectfrom a discrete pool of candidate P t for (i) high-performing and (ii) diverse architectures that coverthe most information while avoiding redundancy.The DPP [28] is an elegant probabilistic measure used to model negative correlations withina subset and hence promote its diversity. A k-determinantal point process (k-DPP) [27] is adistribution over all subsets of a ground set P t of cardinality k. It is determined by a posi-tive semideﬁnite kernel K P t . Let K A be the submatrix of K P t consisting of the entries K ij lgorithm 1 Sequential and Parallel NAS using Gaussian process with tree-Wasserstein kernel Input:

Initial data D , black-box function f ( x ) . Output:

The best architecture x ∗ for t = 1 , . . . , T do Generate architecture candidates P t by random permutation from the top architectures. Learn a GP (including hyperparameters) using TW from D t − to perform estimation over P t including (i) covariance matrix K P t , (ii) predictive mean µ P t and (iii) predictive variance σ P t If Sequential: select a next architecture x t = argmax ∀ x ∈P t α ( x | µ P t , σ P t ) then evaluate the new architecture y t = f ( x t ) and augment the data D t ← D t − ∪ ( x t , y t ) If Parallel: select B architectures X t = [ x t, , , ... x t,B ] = k-DPP ( K P t ) in Eq. (12) then evaluate in parallel Y t = f ( X t ) and augment D t ← D t − ∪ ( X t , Y t ) end for with i, j ∈ A ⊆ P t . Then, the probability of observing A ⊆ P is proportional to det( K A ) , P ( A ⊆ P t ) ∝ det( K A ) (10) where K ij = q i φ Ti φ i q j . (11) k-DPP with quality. While the original idea of a k-DPP is to ﬁnd a diverse subset, we can extendit to ﬁnd a subset which is both diverse and high-quality. For this, we write a DPP kernel k as a Grammatrix, K = Φ T Φ , where the columns of Φ are vectors representing items in the set S . We now takethis one step further, writing each column Φ as the product of a quality term q i ∈ R + and a vector ofnormalized diversity features φ i , || φ i || = 1 . The entries of the kernel can now be written in Eq. (11).As discussed in [28], this decomposition of K has two main advantages. First, it implicitly enforcesthe constraint that K must be positive semideﬁnite, which can potentially simplify learning. Second,it allows us to independently model quality and diversity, and then combine them into a uniﬁedmodel. Particularly, we have P K ( A ) ∝ (cid:0)(cid:81) i ∈ A q i (cid:1) det( φ Ti φ i ) where the ﬁrst term increases with thequality of the selected items, and the second term increases with the diversity of the selected items.Without the quality component, we would get a very diverse set of architectures, but we might fail toinclude the most high-performance architectures in P t , focusing instead on low-quality outliers. Byintegrating the two models we can achieve a more balanced result. Conditioning.

In the parallel setting, given the training data, we would like to select high qualityand diverse architectures from a pool of candidate P t described above. We shall condition on thetraining data in constructing the covariance matrix over the testing candidates from P t . We make thefollowing proposition in connecting the k-DPP conditioning and GP uncertainty estimation. Thisview allows us to learn the covariance matrix using GP, such as we can maximize the GP marginallikelihood for learning the TW distance and kernel hyperparameters for k-DPP. Proposition 2.

Conditioned on the training set, the probability of selecting new candidates from apool P t is equivalent to the determinant of the Gaussian process predictive covariance matrix. Proof of Proposition 2 is placed in the Appendix §F.1.We can utilize the GP predictive mean µ ( · ) in Eq. (8) to estimate the quality for the unknownarchitetures. Then, we construct the covariance (kernel) matrix over the test candidates for selectionfollowing Eq. (12) as K P t ( x i , x j ) = exp ( − µ ( x i )) σ ( x i , x j ) exp ( − µ ( x j )) , ∀ x i , x j ∈ P t (12)where µ ( x i ) and σ ( x i , x j ) are the GP predictive mean and variance deﬁned in Eqs. (8,9). Each term inEq. (12) are naturally in range [0 , , thus it balances between diversity and quality. Finally, we sample B architectures from the covariance matrix K P t which encodes both the diversity (exploration) andhigh-utility (exploitation). The sampling algorithm requires precomputing the eigenvalues [27].Sampling from a k-DPP requires O ( N B ) time overall where B is the batch size. Advantages.

The connection between GP and k-DPP allows us to directly sample diverse andhigh-quality samples from the GP posterior. This leads to the key advantage that we can optimally sample a batch of candidates without the need of greedy selection. On the other hand, the existing6

00 200 300 400 500

Iterations T e s t E rr o r BO using Different Distances

BO-OT (NASBOT)BO-Path EncodingBO-GW BO-TWBO-TW-2G

100 200 300 400 500

Iterations T e s t E rr o r NASBENCH101

RandEvolutionBananasTPE BOHBBO-OT (NASBOT)BO-TWBO-TW-2G

40 80 120 160 200

Iterations T e s t E rr o r NASBENCH201-Cifar100

RandEvolutionBananas BO-OT (NASBOT)BO-TWBO-TW-2G

Figure 2: Sequential NAS on different distances for BO (left) and different baselines (middle andright). Our approaches are BO-TW and BO-TW 2G for -gram and -gram representation.batch BO approaches rely either on greedy strategy [6, 7, 20] to sequentially select the points in abatch or independent sampling [14, 22]. The greedy algorithm is non-optimal and the independentsampling approaches can not coordinate the information across points in a batch.We note that our k-DPP above is related to [25], but different from two perspectives that [25] considersk-DPP for batch BO in the (i) continuous setting and (ii) using pure exploration (without quality).We will consider this as the baseline in the experiment. Experimental settings

All experimental results are averaged over 30 independent runs with differ-ent random seeds. We set the number of candidate architecture in each |P t | = 100 . We will releaseall source codes in the ﬁnal version. We utilize the popular NAS tabular datasets of NASBENCH101(NB101) [64] and NASBENCH201 (NB201) [10] for evaluations. TW and TW-2G stand for our TWusing 1-gram and 2-gram representation respectively. Ablation study between different distances for BO

We design the ablation study using differentdistances within a BO framework. Particularly, we consider the vanilla optimal transport (Wassersteindistance). We follow [24] to deﬁne the cost metric for OT. This baseline can be seen as the modiﬁedversion of the Nasbot [24]. In addition, we compare our approach with the BO using the Gromov-Wasserstein distance [34] (BO-GW) and path encoding (BO-Path Encoding) as used in [60]. Theresults in Left Fig. 2 suggest that the proposed TW using 2-gram performs the best among the BOdistance for neural network architectures. The OT and GW will result in a non-psd indeﬁnite kernel.For using OT and GW in our GP, we keep adding (“jitter”) noise to the diagonal of the kernel matrixuntil it becomes p.s.d. We utilize the POT library [15] for the implementation of OT and GW.While our framework is able to handle n -gram representation, we learn that -gram is empiricallythe best choice. This choice is well supported by the fact that two convolution layers of × staytogether can be used to represent for a special effect of × convolution kernel. In addition, theuse of full n -gram may result in very sparse representation and some features are not so meaningfulanymore. Therefore, in the experiment we only consider -gram and -gram. Sequential NAS.

We validate our GP-BO model using tree-Wasserstein on the sequential setting.Since NB101 is somewhat harder than NB201, we allocate queries for NB101 and queriesfor NB201 including of random selection at the beginning of BO.We compare our approach against the common baselines including Random search, evolutionarysearch, TPE [3], BOHB [14], NASBOT [24] and BANANAS [60]. We use the AutoML libraryfor TPE and BOHB including the results for NB101, but not NB201. We do not compare withReinforcement Learning approaches which have been shown to perform poorly in [60].We show in Fig. 2 that our tree-Wasserstein including 1-gram and 2-gram will result in the bestperformance with a wide margin to the second best – a BANANAS [60], which needs to specify the https://github.com/automl/nas_benchmarks Iterations T e s t E rr o r [B=5] NB201-Cifar10 GP-TSGP-BUCB/KBEvolutionaryk-DPPk-DPP Quality

15 30 45 60 75 90

Iterations L o g o f T e s t E rr o r [B=5] NB201-Cifar100 GP-TSGP-BUCB/KBEvolutionaryk-DPPk-DPP Quality

15 30 45 60 75 90

Iterations T e s t E rr o r [B=5] NB201-Imagenet GP-TSGP-BUCB/KBEvolutionaryk-DPPk-DPP Quality

Figure 3: Batch NAS comparison using TW-2Gram and a batch size B = 5 meta neural network with extra hyperparameters (layers, nodes, learning rate). The random searchperforms poorly in NAS due to the high-dimensional and complex space. Our GP-based optimizeroffers a closed-form uncertainty estimation without iterative approximation in neural network (viaback-propagation). As a property of GP, our BO-TW can generalize well using fewer observations.

15 30 45 60

Iterations T e s t E rr o r k-DPP Quality B=1 (Sequential)k-DPP Quality B=2k-DPP Quality B=3k-DPP Quality B=5k-DPP Quality B=10 Figure 4: Performance with dif-ferent batch sizes B on Imagenet. Batch NAS.

We next demonstrate our model on selecting mul-tiple architectures for parallel evaluation – parallel NAS setting.There are fewer approaches for parallel NAS compared to the se-quential setting. We select to compare our k-DPP quality againstThompson sampling [22], GP-BUCB [7] and k-DPP for batch BO[25]. The GP-BUCB is equivalent to Krigging believer [19] whenthe halluciated observation value is set to the GP predictive mean.Therefore, we label them as GP-BUCB/KB. We also comparewith the vanilla k-DPP (without using quality) [25].We allocate a maximum budget of queries including ran-dom initial architectures. The result in Fig. 3 shows that ourproposed k-DPP quality is the best among the baselines. We referto the Appendix for additional experiments including varyingbatch sizes and more results on NB201.Our sampling from k-DPP quality is advantageous against theexisting batch BO approaches [19, 7, 25, 22] in that we can optimally select a batch of architectureswithout relying on the greedy selection strategy. In addition, our k-DPP quality can leverage theadvantage of the GP in estimating the hyperparameters for the covariance matrix.Finally, we study the performance with different choices of batch size B in Fig. 4 which naturallyconﬁrms that the performance increases with larger batch size B . Iteration L o g o f V a l u e Figure 5: Estimated hyperparameterson NB101 over iterations.

Estimating hyperparameters.

We plot the estimated hy-perparameters λ = α σ l , λ = α σ l , λ = − α − α σ l overiterations in Fig. 5. This indicates the relative contributionof the operation, indegree and outdegree toward the d NN inEq. (5). Particularly, the operation contributes receives moreweight and is useful information than the individual indegreeor outdegree. We have presented a new framework for sequential and par-allel NAS. Our framework constructs the similarity betweenarchitectures using tree-Wasserstein geometry. Then, it utilizes the Gaussian process surogate modelfor modeling and optimization. We draw the connection between GP predictive distribution to k-DPPquality for selecting diverse and high-performing architectures from discrete set. We demonstrateour model using NASBENCH101 and NASBENCH201 that we outperform the existing baselines insequential and parallel settings. 8 roader Impact

This paper presents a new machine learning approach that may have several societal impacts: • Our proposed technique in sequential and batch neural architecture search will be widelyapplicable to all deep neural networks - based approaches, including a wide range of machinelearning algorithms (deep reinforcement learning, deep generative model, deep learning)and a tremendous range of applications (computer vision, natural language processing,manufacturing and more). • Our NAS approaches automate the neural network design process that signiﬁcantly savescost and time for machine learning practitioners by taking them out of the tuning loop. • The tree-Wasserstein distance for neural networks can be of independent interest for differenttasks in which utilizing the kernel or covariance matrix, such as neural network compression,neural network training and continual learning. • The k-DPP quality with Gaussian process can be of independent interest for other taskssuch as selecting a diverse and high-quality set of points from a discrete set in documentsummarization, video summarization... etc.

References [1] Thomas Back.

Evolutionary algorithms in theory and practice: evolution strategies, evolution-ary programming, genetic algorithms . Oxford university press, 1996.[2] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel.

Harmonic analysis on semigroups .Springer-Verlag, 1984.[3] J S Bergstra, R Bardenet, Y Bengio, and B Kégl. Algorithms for hyper-parameter optimization.In

Advances in Neural Information Processing Systems , pages 2546–2554, 2011.[4] Leo Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.[5] Rainer E Burkard and Eranda Cela. Linear assignment problems and extensions. In

Handbookof combinatorial optimization , pages 75–149. Springer, 1999.[6] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussianprocess optimization with upper conﬁdence bound and pure exploration. In

Machine Learningand Knowledge Discovery in Databases , pages 225–240. Springer, 2013.[7] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitationtradeoffs in gaussian process bandit optimization.

The Journal of Machine Learning Research ,15(1):3873–3923, 2014.[8] Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithmsfor Earth Mover’s distance.

Theory of Computing Systems , 48(2):428–442, 2011.[9] Xuanyi Dong and Yi Yang. Searching for a robust neural architecture in four gpu hours. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages1761–1770, 2019.[10] Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neuralarchitecture search.

International Conference on Learning Representation , 2020.[11] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efﬁcient multi-objective neural archi-tecture search via lamarckian evolution.

International Conference on Learning Representation ,2019.[12] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey.

Journal of Machine Learning Research , 20(55):1–21, 2019.[13] Steven N Evans and Frederick A Matsen. The phylogenetic Kantorovich–Rubinstein metric forenvironmental sequence samples.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 74(3):569–592, 2012.[14] Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and efﬁcient hyperparameteroptimization at scale. In

International Conference on Machine Learning , pages 1436–1445,2018. 915] Rémi Flamary and Nicolas Courty. Pot python optimal transport library.

GitHub: https://github.com/rﬂamary/POT , 2017.[16] Peter I Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 , 2018.[17] Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey of graph edit distance.

PatternAnalysis and applications , 13(1):113–129, 2010.[18] Roman Garnett, Michael A Osborne, and Stephen J Roberts. Bayesian optimization for sensorset selection. In

Proceedings of the 9th ACM/IEEE international conference on informationprocessing in sensor networks , pages 209–219, 2010.[19] David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging is well-suited to paral-lelize optimization. In

Computational Intelligence in Expensive Optimization Problems , pages131–162. Springer, 2010.[20] Javier González, Zhenwen Dai, Philipp Hennig, and Neil D Lawrence. Batch Bayesian op-timization via local penalization. In

International Conference on Artiﬁcial Intelligence andStatistics , pages 648–657, 2016.[21] Shivapratap Gopakumar, Sunil Gupta, Santu Rana, Vu Nguyen, and Svetha Venkatesh. Algo-rithmic assurance: An active approach to algorithmic testing using Bayesian optimisation. In

Advances in Neural Information Processing Systems (NeurIPS) , pages 5465–5473, 2018.[22] José Miguel Hernández-Lobato, James Requeima, Edward O Pyzer-Knapp, and Alán Aspuru-Guzik. Parallel and distributed Thompson sampling for large-scale accelerated exploration ofchemical space.

In International Conference on Machine Learning , pages 1470–1479, 2017.[23] Haifeng Jin, Qingquan Song, and Xia Hu. Auto-keras: Efﬁcient neural architecture search withnetwork morphism.[24] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric P Xing.Neural architecture search with bayesian optimisation and optimal transport. In

Advances inNeural Information Processing Systems , pages 2016–2025, 2018.[25] Tarun Kathuria, Amit Deshpande, and Pushmeet Kohli. Batched Gaussian process banditoptimization via determinantal point processes. In

Advances in Neural Information ProcessingSystems , pages 4206–4214, 2016.[26] R Kondor and J Lafferty. Diffusion kernels on graphs and other discrete input spaces. In

International Conference on Machine Learning , pages 315–322, 2002.[27] Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. In

Proceedingsof the 28th International Conference on Machine Learning (ICML) , pages 1193–1200, 2011.[28] Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning.

Founda-tions and Trends R (cid:13) in Machine Learning , 5(2–3):123–286, 2012.[29] Tam Le, Makoto Yamada, Kenji Fukumizu, and Marco Cuturi. Tree-sliced variants of wasser-stein distances. In Advances in neural information processing systems , pages 12283–12294,2019.[30] Lisha Li and Kevin Jamieson. Hyperband: A novel bandit-based approach to hyperparameteroptimization.

Journal of Machine Learning Research , 18:1–52, 2018.[31] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu.Hierarchical representations for efﬁcient architecture search.

International Conference onLearning Representation , 2018.[32] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search.

International Conference on Learning Representation , 2019.[33] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimiza-tion. In

Advances in neural information processing systems , pages 7816–7827, 2018.[34] Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching.

Foundations of computational mathematics , 11(4):417–487, 2011.[35] Bruno T Messmer and Horst Bunke. A new algorithm for error-tolerant subgraph isomorphismdetection.

IEEE transactions on pattern analysis and machine intelligence , 20(5):493–504,1998. 1036] Vu Nguyen and Michael A Osborne. Knowing the what but not the where in bayesian optimiza-tion. In

International Conference on Machine Learning , 2020.[37] Vu Nguyen, Sebastian Schulze, and Michael A Osborne. Bayesian optimization for iterativelearning. arXiv preprint arXiv:1909.09593 , 2019.[38] Gabriel Peyré and Marco Cuturi. Computational optimal transport.

Foundations and Trends inMachine Learning , 11(5-6):355–607, 2019.[39] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efﬁcient neural architecturesearch via parameters sharing. In

International Conference on Machine Learning , pages4095–4104, 2018.[40] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In

Advancesin neural information processing systems , pages 1177–1184, 2007.[41] Santu Rana, Cheng Li, Sunil Gupta, Vu Nguyen, and Svetha Venkatesh. High dimensionalBayesian optimization with elastic Gaussian process. In

Proceedings of the 34th InternationalConference on Machine Learning , pages 2883–2891, 2017.[42] Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.[43] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for imageclassiﬁer architecture search. In

Proceedings of the AAAI conference on Artiﬁcial Intelligence ,volume 33, pages 4780–4789, 2019.[44] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan,Quoc V Le, and Alexey Kurakin. Large-scale evolution of image classiﬁers. In

Proceedings ofthe 34th International Conference on Machine Learning , pages 2902–2911, 2017.[45] Binxin Ru, Ahsan S Alvi, Vu Nguyen, Michael A Osborne, and Stephen J Roberts. Bayesianoptimisation over multiple continuous and categorical inputs. In

International Conference onMachine Learning , 2020.[46] Christian Sciuto, Kaicheng Yu, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluatingthe search phase of neural architecture search. arXiv preprint arXiv:1902.08142 , 2019.[47] Charles Semple and Mike Steel. Phylogenetics.

Oxford Lecture Series in Mathematics and itsApplications , 2003.[48] Syed Asif Raza Shah, Wenji Wu, Qiming Lu, Liang Zhang, Sajith Sasidharan, Phil DeMar,Chin Guok, John Macauley, Eric Pouyoul, Jin Kim, et al. Amoebanet: An sdn-enabled networkservice for big data science.

Journal of Network and Computer Applications , 119:70–82, 2018.[49] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Takingthe human out of the loop: A review of Bayesian optimization.

Proceedings of the IEEE ,104(1):148–175, 2016.[50] Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In

Learning theoryand kernel machines , pages 144–158. Springer, 2003.[51] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machinelearning algorithms. In

Advances in neural information processing systems , pages 2951–2959,2012.[52] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram,Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable Bayesian optimization using deepneural networks. In

Proceedings of the 32nd International Conference on Machine Learning ,pages 2171–2180, 2015.[53] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimizationwith robust bayesian neural networks. In

Advances in Neural Information Processing Systems ,pages 4134–4142, 2016.[54] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian processoptimization in the bandit setting: No regret and experimental design. In

Proceedings of the27th International Conference on Machine Learning , pages 1015–1022, 2010.[55] Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A genetic programmingapproach to designing convolutional neural network architectures. In

Proceedings of the Geneticand Evolutionary Computation Conference , pages 497–504, 2017.1156] Cedric Villani.

Topics in optimal transportation . American Mathematical Soc., 2003.[57] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graphkernels.

Journal of Machine Learning Research , 11(Apr):1201–1242, 2010.[58] Walter D Wallis, Peter Shoubridge, M Kraetz, and D Ray. Graph distances using graph union.

Pattern Recognition Letters , 22(6-7):701–704, 2001.[59] Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale Bayesianoptimization in high-dimensional spaces. In

International Conference on Artiﬁcial Intelligenceand Statistics , pages 745–754, 2018.[60] Colin White, Willie Neiswanger, and Yash Savani. Bananas: Bayesian optimization with neuralarchitectures for neural architecture search. arXiv preprint arXiv:1910.11858 , 2019.[61] Lingxi Xie and Alan Yuille. Genetic cnn. In

Proceedings of the IEEE international conferenceon computer vision , pages 1379–1388, 2017.[62] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecturesearch.

International Conference on Learning Representation , 2019.[63] Quanming Yao, Ju Xu, Wei-Wei Tu, and Zhanxing Zhu. Efﬁcient neural architecture search viaproximal iterations. In

AAAI Conference on Artiﬁcial Intelligence , 2020.[64] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter.Nas-bench-101: Towards reproducible neural architecture search. In

International Conferenceon Machine Learning , pages 7105–7114, 2019.[65] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.

Interna-tional Conference on Representation Learning , 2017.12n the appendix, we ﬁrst review the related approaches in sequential and batch neural architecturesearch. We then present the illustrative example of the proposed tree-Wasserstein distance for neuralnetwork architecture. Then, we provide additional details of the Bayesian optimization in estimatingthe hyperparameters. Finally, we show further empirical comparisons and analysis for the model.

A Related works in Neural architecture search

We refer the interested readers to [12] for the comprehensive survey on neural architecture search.Many different search strategies have been attempted to explore the space of neural architectures,including random search, evolutionary methods, reinforcement learning (RL), gradient-based methodsand Bayesian optimization.

Evolutionary approaches. [44, 43, 55, 31, 48, 61, 11] have been extensively used for NAS. In thecontext of evolutionary evolving, the mutation operations include adding a layer, removing a layer orchanging the type of a layer (e.g., from convolution to pooling) from the neural network architecture.Then, the evolutionary approaches will update the population, e.g., tournament selection by removingthe worst or oldest individual from a population.

Reinforcement learning.

NASNet [65] is a reinforcement learning algorithm for NAS whichachieves state-of-the-art results on CIFAR-10 and PTB; however, the algorithm requires 3000 GPUdays to train. Efﬁcient Neural Architecture Search (ENAS) [39] proposes to use a controller whichdiscovers architectures by learning to search for an optimal subgraph within a large graph. Thecontroller is trained with policy gradient to select a subgraph that maximizes the validation set’sexpected reward. The model corresponding to the subgraph is trained to minimize a canonical crossentropy loss. Multiple child models share parameters, ENAS requires fewer GPU-hours than otherapproaches and 1000-fold less than "standard" NAS.

Gradient-based approaches. [33, 32, 9, 63] represent the search space as a directed acyclic graph(DAG) containing billions of sub-graphs, each of which indicates a kind of neural architecture. Toavoid traversing all the possibilities of the sub-graphs, they develop a differentiable sampler overthe DAG. The beneﬁt of such idea is that a differentiable space enables computation of gradientinformation, which could speed up the convergence of underneath optimization algorithm. Varioustechniques have been proposed, e.g., DARTS [32] , SNAS [62] , and NAO [33]. While theseapproaches based on gradient-based learning can reduce the computational resources required forNAS, it is currently not well understood if an initial bias in exploring certain parts of the search spacemore than others might lead to the bias and thus result in premature convergence of NAS [46]. Inaddition, the gradient-based approach may be less appropriate for exploring different space (e.g., withcompletely different number of layers), as opposed to the approach presented in this paper.

Bayesian optimization.

BO has been an emerging technique for black-box optimization whenfunction evaluations are expensive [41, 16, 21], and it has seen great success in hyperparameteroptimization for deep learning [30, 37, 45]. Recently, Bayesian optimization has been used forsearching the best neural architecture [24, 23, 60]. BO relies on a covariance function to represent thesimilarity between two data points. For such similarity representation, we can (1) directly measurethe similarity of the networks by optimal transport, then modeling with GP surrogate in [24]; or (2)measure the graphs based on the path-based encodings, then modeling with neural network surrogatein [60]. OTMANN [24] shares similarities with Wasserstein (earth mover’s) distances which alsohave an OT formulation. However, it is not a Wasserstein distance itself—in particular, the supports ofthe masses and the cost matrices change depending on the two networks being compared. One of thedrawback of OTMANN is that it may not be negative deﬁnite for a p.s.d. kernel which is an importantrequirement for modeling with GP. This is the motivation for our proposed tree-Wasserstein.

Path-based encoding.

BANANAS [60] proposes the path-based encoding for neural networkarchitectures. The drawback of path-based encoding is that we need to enumerate all possible pathsfrom the input node to the output node, in terms of the operations. This can potentially raise althoughit can work well in NASBench dataset [64] which results in (cid:80) i =0 i = 364 possible paths. Kernel graph.

Previous work considers the neural network architectures as the graphs, then deﬁningvarious distances and kernels on graphs [17, 26, 35, 50, 58]. However, they may not be ideal for our13AS setting because neural networks have additional complex properties in addition to graphicalstructure, such as the type of operations performed at each layer, the number of neurons, etc. Somemethods do allow different vertex sets [57], they cannot handle layer masses and layer similarities.

B Related works in batch neural architecture search

They are several approaches in batch Bayesian optimization literature which can be used to selectmultiple architectures for evaluation, including random search, evolutionary search and most ofthe batch Bayesian optimization approaches, such as Krigging believer [19], GP-BUCB [7], GP-Thompson Sampling [22], and BOHB [14].Krigging believer (KB) [19] exploits an interesting fact about GPs: the predictive variance of GPsdepends only on the input x , but not the outcome values y . KB will iteratively construct a batch ofpoints. First, it ﬁnds the maximum of the acquisition function, like the sequential setting. Next, KBmoves to the next maximum by suppressing this point. This is done by inserting the outcome at thispoint as a halucinated value. This process is repeated until the batch is ﬁlled.GP-BUCB [7] is related to the above Krigging believer in exploiting the GP predictive variance.Particularly, GP-BUCB is similar to KB when the halucinated value is set to the GP predictive mean.GP-Thompson Sampling [22] generates a batch of points by drawing from the posterior distributionof the GP to ﬁll in a batch. In the continuous setting, we can draw a GP sample using random Fourierfeature [40]. In our discrete case of NAS, we can simply draw samples from the GP predictive mean. C Datasets

We summarize two benchmark datasets used in the paper. Neural architecture search (NAS) methodsare notoriously difﬁcult to reproduce and compare due to different search spaces, training proceduresand computing cost. These make methods inaccessible to most researchers. Therefore, the below twobenchmark datasets have been created.

NASBENCH101.

The NAS-Bench-101 dataset contains over , neural architectures withprecomputed training, validation, and test accuracy [64]. In NASBench dataset, the neural networkarchitectures have been exhaustively trained and evaluated on CIFAR-10 to create a queryable dataset. NASBENCH201.

NAS-Bench-201 includes all possible architectures generated by nodes and associated operation options, which results in , neural cell candidates in total. The Nasbench201dataset includes the tabular results for three subdatasets including CIFAR-10, CIFAR-100 andImageNet-16-120. D Proofs

D.1 Proof for Lemma 2.1

Proof.

We consider X ∼ GP ( m () , k ()) . If k is not a p.s.d. kernel, then there is some set of n points ( t i ) ni =1 and corresponding weights α i ∈ R such that n (cid:88) i =1 n (cid:88) j =1 α i k ( t i , t j ) α j < . (13)By the GP assumption, Cov ( X ( t i ) , X ( t j )) = k ( t i , t j ) , we show that the variance is now negativeVar (cid:32) n (cid:88) i =1 α i X ( t i ) (cid:33) = n (cid:88) i =1 n (cid:88) j =1 α i Cov ( X ( t i ) , X ( t j )) α j < . (14)The negative variance concludes our prove that the GP is no longer valid with non-p.s.d. kernel. https://github.com/google-research/nasbench https://github.com/D-X-Y/NAS-Bench-201 ncv1 cv3 cv3cv3 outout incv3 cv1mp3mp3 = 14 1, 3, 0 = 14 1, 1, 2 m c c rootc1 c3 m root m-c3m-c1c3-mc1-m m-m c1-c1 c3-c3c3-c1c1-c3 = 12 [0, 1, 0, 0, 1, 0, 0, 0, 0] = 13 [0, 0, 1, 0, 0, 1, 0, 0, 1] m - mm - c m - c c - m c - c c - c c - m c - c c - c tree for 2-gram operationtree for 1-gram operation c Figure 6: Example of TW used for calculating two architectures. The label of each histogram binis highlighted in green. The distance between two nodes in a tree is sum of total cost if we travelbetween the two nodes, see Eq. (2). For example, the cost for moving maxpool ( m ) to conv1 ( c ) is . . . We use similar analogy for computing in 2-gram (2g) representation. D.2 Proof for Proposition 1

Proof.

We have that tree-Wasserstein (TW) is a metric and negative deﬁnite [29]. Therefore, W d T o , W d T− , W d T + are also a metric and negative deﬁnite.Moreover, the discrepancy d NN is a convex combination with positive weights for W d T o , W d T− , W d T + .Therefore, it is easy to verify that for given neural networks x , x , x , we have: • d NN ( x , x ) = 0 . • d NN ( x , x ) = d NN ( x , x ) . • d NN ( x , x ) + d NN ( x , x ) ≥ d NN ( x , x ) .Thus, d NN is a pseudo-metric. Additionally, a convex combination with positive weights preservesthe negative deﬁniteness. Therefore, d NN is negative deﬁnite. D.3 Tree-Wasserstein kernel for neural networksProposition 3.

Given the scalar length-scale parameter σ l , the tree-Wasserstein kernel for neuralnetworks k ( x, z ) = exp( − d NN ( x,z ) σ l ) is inﬁnitely divisible.Proof. Given two neural networks x and z , we introduce new kernels k γ ( x , z ) = exp( − d NN ( x , z ) γσ l ) for γ ∈ N ∗ . Following [2] (Theorem 3.2.2, p.74), k γ ( x, z ) is also positive deﬁnite. Moreover, we alsohave k ( x , z ) = ( k γ ( x , z )) γ . Then, following [2] (Deﬁnition 2.6, p.76), we complete the proof.From Proposition 3, one does not need to recompute the Gram matrix of the TW kernel for eachchoice of σ l , since it sufﬁces to compute it once. E An example of TW computation for neural network architectures

In this section, we present an example of using TW for transporting two neural network architecturesin Fig. 6. We consider a set S of interest operations as follow S = { cv1, cv3, mp3 } .15able 1: The tree-metric for operations in W d T o (with the tree structure in Fig. 6) cv1 cv3 mp3cv1 . cv3 . mp3 Neural network information ( S o , A ) . We use the order top-bottom and left-right for layers in S o . • For neural network x , we have S o x = { in, cv1, cv3, cv3, cv3, out } , A x =   • For neural network z , we have S o z = { in, cv3, cv1, mp3, mp3, out } ,A z =   We show how to calculate these three representations (layer operation, indegree and outdegree) usingtree-Wasserstein.

E.1 n -gram representation for layer operations • -gram representation. The -gram representations x o and z o for neural network x , and z respectively are: x o = 14 (1 , , z o = 14 (1 , , where we use the order ( ) for the frequency of interest operations in the set S for the -gram representation of neural network. • -gram representation. For the -gram representations x o and z o for neural networks x and z respectively, we use the following order for S × S : ( ). So, we have x o = 12 (0 , , , , , , , , , z o = 13 (0 , , , , , , , , . Or, we can represent them as empirical measures ω x o = δ cv1-cv3 + δ cv3-cv3 , ω z o = 13 δ cv1-mp3 + 13 δ cv3-mp3 + 13 δ mp3-mp3 . • Tree metrics for n -gram representations for layer operations. We can use the tree metric inFig. 6 for -gram and -gram representations. The tree metric for operations are summarized inTable 1.Using the closed-form computation of tree-Wasserstein presented in Eq. (2) in the main text, we cancompute W d T o ( x o , z o ) for -gram representation and W d T o ( x o , z o ) for -gram representation.16or -gram representation, we have W d T o ( x o , z o ) = 0 . (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) + 0 . (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) + 0 . (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) + 1 (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) = 0 . . (15)For -gram representation, we have W d T o ( x o , z o ) =0 . (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) + 0 . (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) + 0 . | − | (16) + 0 . (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) + 0 . (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) + 0 . (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) + 1 (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) = 2 . E.2 Indegree and outdegree representations for network structure

The indegree and outdegree empirical measures ( ω d − x , ω d + x ) and ( ω d − z , ω d + z ) for neural networks x and z respectively are: ω d − x = (cid:88) i =1 x d − i δ η x ,i +1 M x +1 , ω d + x = (cid:88) i =1 x d + i δ η x ,i +1 M x +1 (17) ω d − z = (cid:88) i =1 z d − i δ η z ,i +1 M z +1 , ω d + z = (cid:88) i =1 z d + i δ η z ,i +1 M z +1 , (18)where x d − = (cid:18) , , , , , (cid:19) , x d + = (cid:18) , , , , , (cid:19) , (19) z d − = (cid:18) , , , , , (cid:19) , z d + = (cid:18) , , , , , (cid:19) , (20) η x = (0 , , , , , , η z = (0 , , , , , , (21) M x = 2 , M z = 2 , and x d − i , x d + i , η x ,i are the i th elements of x d − , x d + , η x respectively. Conse-quently, one can leverage the indegree and outdegree for network structures to distinguish between x and z .We demonstrate in Fig. 7 how to calculate the tree-Wasserstein for indegree and outdegree. Thesupports of empirical measures ω d − x and ω d − z are in a line. So, we simply choose a tree as a chain ofreal values for the tree-Wasserstein distance. Particularly, the tree-Wasserstein is equivalent to theunivariate optimal transport. It is similar for empirical measures ω d + x and ω d + z . • W d T− ( ω d − x , ω d − z ) for indegree representation. Using Eq. (2) we have W d T− ( ω d − x , ω d − z ) = 13 (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ω d −· (cid:18) Γ (cid:18) δ (cid:19)(cid:19) + 13 (cid:12)(cid:12)(cid:12)(cid:12)(cid:18)

37 + 47 (cid:19) − (cid:18)

27 + 57 (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ω d −· (cid:18) Γ (cid:18) δ (cid:19)(cid:19) = 121 = 0 . . (22)where the cost and are deﬁned as the edge weights (from δ / to δ / , and from δ / to δ / respectively) in Fig. 7. ω d −· (Γ( δ )) , and ω d −· (Γ( δ )) are the total mass of empirical measures in thesubtrees rooted as the deeper node of corresponding edge (from δ / to δ / , and from δ / to δ / respectively) as deﬁned in Eq. (2). The tree is simply a chain of increasing real values, i.e., a chain → → , the weight in each edge issimply the (cid:96) distance between two nodes of that edge. ncv1 cv3 cv3cv3out out incv3 cv1mp3mp3 (1, 1, 1 ) ( distance to root, indegree, outdegree ) (1, 1, 1 ) (1, 1, 1 )(2, 2, 1 ) (2, 2, 0 ) (0, 0, 2 )(1, 1, 1 )(2, 1, 1 )(2, 2, 1 )(2, 2, 0 )(1, 1, 2 )(0, 0, 3 ) Chain Tree edge weightedge weight

Figure 7: Illustration of indegree and outdegree used in TW. We can represent the empirical measureas ω d − x = (cid:80) (cid:96) ∈ L x x d − (cid:96) δ ηx,(cid:96) +1 Mx +1 = δ + δ and ω d + x = (cid:80) (cid:96) ∈ L x x d + (cid:96) δ ηx,(cid:96) +1 Mx +1 = δ + δ + δ andfor z as ω d − z = (cid:80) (cid:96) ∈ L z z d − (cid:96) δ ηz,(cid:96) +1 Mz +1 = δ + δ and ω d + x = (cid:80) (cid:96) ∈ L z z d + (cid:96) δ ηz,(cid:96) +1 Mz +1 = δ + δ + δ .The tree, which is a chain, is used to compute the distance. • W d T + ( ω d + x , ω d + z ) for outdegree representation. Similarly, for outdegree representation, we have W d T + ( ω d + x , ω d + z ) = 13 (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ω d + · (cid:18) Γ (cid:18) δ (cid:19)(cid:19) + 13 (cid:12)(cid:12)(cid:12)(cid:12)(cid:18)

37 + 17 (cid:19) + (cid:18) − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ω d + · (cid:18) Γ (cid:18) δ (cid:19)(cid:19) = 221 = 0 . . (23)From W d T o , W d T− and W d T + , we can obtain the discrepancy d NN between neural networks x and z as in Eq. (5) with predeﬁned values α , α , α . F Optimizing hyperparameters in TW and GP

As equivalence, we consider λ = α σ l , λ = α σ l and λ = − α − α σ l in Eq. (6) and present thederivative for estimating the variable λ in our kernel. k ( u , v ) = exp (cid:16) − λ W d T o ( u , v ) − λ W d T− ( u , v ) − λ W d T + ( u , v ) (cid:17) . (24)The hyperparameters of the kernel are optimised by maximising the log marginal likelihood (LML)of the GP surrogate θ ∗ = arg max θ L ( θ, D ) , (25)where we collected the hyperparameters into θ = { λ , λ , λ , σ n } . The LML and its derivative aredeﬁned as [42] L ( θ ) = − y (cid:124) K − y −

12 log | K | + constant (26) ∂ L ∂θ = 12 (cid:18) y (cid:124) K − ∂ K ∂θ K − y − tr (cid:18) K − ∂ K ∂θ (cid:19)(cid:19) , (27)where y are the function values at sample locations and K is the covariance matrix of k ( x , x (cid:48) ) evaluated on the training data. 18able 2: Properties comparison across different distances for using with GP-BO and k-DPP. GW isGromov-Wasserstein. TW is tree-Wasserstein. OT is optimal transport.Representation Matrix/Graph Path-Encode OT (or W) GW TWClosed-form estimation (cid:51) (cid:51) (cid:55) (cid:55) (cid:51) Positive semi deﬁnite (cid:51) (cid:51) (cid:55) (cid:55) (cid:51)

Different architecture sizes (cid:51) , (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) Scaling with architecture size (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) optimization of the LML was performed via multi-started gradient descent. The gradient in Eq. (27)relies on the gradient of the kernel k w.r.t. each of its parameters: ∂k ( u , v ) ∂λ = − W d T o ( u , v ) × k ( u , v ) (28) ∂k ( u , v ) ∂λ = − W d T− ( u , v ) × k ( u , v ) (29) ∂k ( u , v ) ∂λ = − W d T + ( u , v ) × k ( u , v ) . (30) F.1 Proof for Proposition 2

Proof.

Let A and B be the training and test set respectively, we utilize the Schur complement to have K A ∪ B = K A × (cid:2) K B − K BA K − A K AB (cid:3) and the probability of selecting B is P ( B ⊂ P | A ) = det ( K A ∪ B )det( K A ) = det (cid:0) K B − K BA K − A K AB (cid:1) = det ( σ ( B | A )) . (31)This shows that the conditioning of k-DPP is equivalent to the GP predictive variance σ ( B | A ) inEq. (9). G Distance Properties Comparison

We summarize the key beneﬁts of using tree-Wasserstein (n-gram) as the main distance with GP forsequential NAS and k-DPP for batch NAS in Table 2. Tree-Wasserstein offers close-form computationand positive semi-deﬁnite covariance matrix which is critical for GP and k-DPP modeling.

Comparison with graph kernel.

Besides the adjacency matrix representation, each architectureinclude layer masses and operation type. We note that two different architectures may share the sameadjacency matrix while they are different in operation type and layer mass.

Comparison with path-based encoding.

TW can scale well to more number of nodes, layerswhile the path-based encoding is limited to.

Comparison with OT approaches in computational complexity.

In general, OT is formulatedas a linear programming problem and its computational complexity is super cubic in the size ofprobability measures [5] (e.g., using network simplex). On the other hand, TW has a closed-formcomputation in Eq. (2), and its computational complexity is linear to the number of edges in thetree. Therefore, TW is much faster than OT in applications [29], and especially useful for large-scalesettings where the computation of OT becomes prohibited.

H Additional Experiments and Illustrations

H.1 Model Analysis

100 200 300 400

Architecture A r c h i t e c t u r e W dTo Degree Operation

Architecture A r c h i t e c t u r e W dT + Degree In

Architecture A r c h i t e c t u r e W dT Degree Out

Figure 8: Tree-Wasserstein distances over 500 architectures on NASBENCH101.

40 80 120 160 200

Iterations T e s t E rr o r NASBENCH201-Imagenet

RandEvolutionBananasBO-OT (NASBOT)BO-TWBO-TW-2G

40 80 120 160 200

Iterations T e s t E rr o r NASBENCH201-Cifar10

RandEvolutionBananas BO-OT (NASBOT)BO-TWBO-TW-2G

40 80 120 160 200

Iterations T e s t E rr o r NASBENCH201-Cifar100

RandEvolutionBananas BO-OT (NASBOT)BO-TWBO-TW-2G

Figure 9: Additional sequential NAS comparison on NASBENCH201.

20 40 60 80 100

Iteration T e s t E rr o r [B=5] NASBENCH101 GP-TSGP-BUCB/KBEvolutionary k-DPPk-DPP Quality

Figure 10: Additional result of batch NASon NB101. We use TW-2G and a batch size B = 5 We illustrate three internal distances of our tree-Wasserstein including W d T o , W d T + , W d T− in Fig. 8.Each internal distance captures different aspects ofthe networks. The zero diagonal matrix indicates thecorrect estimation of the same neural architectures. H.2 Furthersequential and batch NAS experiments

To complement the results presented in the main pa-per, we present additional experiments on both se-quential and batch NAS setting using NB101 andNB201 dataset. in Fig. 9 We present additional ex-periments on batch NAS settings in Fig. 10 that theproposed k-DPP quality achieves consistently thebest performance.

H.3 Ablation study using different acquisition functions

We evaluate our proposed model using two comon acquisition functions including UCB and EI. Theresult suggests that UCB tends to perform much better than EI for our NAS setting. This result isconsistent with the comparison presented in Bananas [60].

H.4 Using k-DPP quality with another distance

Additional to the tree-Wasserstein presented in the main paper, we demonstrate the proposed k-DPPquality using path distance [60]. We show that our k-DPP quality is not restricted to TW-2G, but itcan be generally used with different choices of kernel distances.Particularly, we present in Fig. 12 the comparison using two datasets: Imagenet and Cifar100 inNB201. The results validate two messages as follows. First, our k-DPP quality is the best among the20

00 300 400 500

Iterations T e s t E rr o r NASBENCH101 [TW 2-Gram] EI [TW 2-Gram] GP-UCB

Figure 11: Optimizing the acquisition function using GP-UCB and EI on NB101. The results suggestthat using GP-UCB will lead to better performance than EI.

15 30 45 60 75 90

Iterations T e s t E rr o r [B=5] Imagenet - Path Distance GP-TSGP-BUCB/KBEvolutionaryk-DPPk-DPP Qualityk-DPP Quality TW2G

15 30 45 60 75 90

Iterations L o g o f T e s t E rr o r [B=5] Cifar100 - Path Distance GP-TSGP-BUCB/KBEvolutionaryk-DPPk-DPP Qualityk-DPP Quality TW2G

Figure 12: Batch NAS comparison using another kernel distance which is path distance. We use abatch size B = 5= 5