[PDF] Neural Architecture Search with Bayesian Optimisation and Optimal Transport

Abstract

Bayesian Optimisation (BO) refers to a class of methods for global optimisation of a function f which is only accessible via point evaluations. It is typically used in settings where f is expensive to evaluate. A common use case for BO in machine learning is model selection, where it is not possible to analytically model the generalisation performance of a statistical model, and we resort to noisy and expensive training and validation procedures to choose the best model. Conventional BO methods have focused on Euclidean and categorical domains, which, in the context of model selection, only permits tuning scalar hyper-parameters of machine learning algorithms. However, with the surge of interest in deep learning, there is an increasing demand to tune neural network \emph{architectures}. In this work, we develop NASBOT, a Gaussian process based BO framework for neural architecture search. To accomplish this, we develop a distance metric in the space of neural network architectures which can be computed efficiently via an optimal transport program. This distance might be of independent interest to the deep learning community as it may find applications outside of BO. We demonstrate that NASBOT outperforms other alternatives for architecture search in several cross validation based model selection tasks on multi-layer perceptrons and convolutional neural networks.

Full PDF

NNeural Architecture Searchwith Bayesian Optimisation and Optimal Transport

Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabás Póczos, Eric P Xing

Carnegie Mellon University, Petuum Inc. {kandasamy, willie, schneide, bapoczos, epxing}@cs.cmu.edu

Abstract

Bayesian Optimisation (BO) refers to a class of methods for global optimisation ofa function f which is only accessible via point evaluations. It is typically used insettings where f is expensive to evaluate. A common use case for BO in machinelearning is model selection, where it is not possible to analytically model the gener-alisation performance of a statistical model, and we resort to noisy and expensivetraining and validation procedures to choose the best model. Conventional BOmethods have focused on Euclidean and categorical domains, which, in the contextof model selection, only permits tuning scalar hyper-parameters of machine learn-ing algorithms. However, with the surge of interest in deep learning, there is anincreasing demand to tune neural network architectures . In this work, we develop NASBOT , a Gaussian process based BO framework for neural architecture search.To accomplish this, we develop a distance metric in the space of neural networkarchitectures which can be computed efﬁciently via an optimal transport program.This distance might be of independent interest to the deep learning community as itmay ﬁnd applications outside of BO. We demonstrate that

NASBOT outperformsother alternatives for architecture search in several cross validation based modelselection tasks on multi-layer perceptrons and convolutional neural networks.

In many real world problems, we are required to sequentially evaluate a noisy black-box function f with the goal of ﬁnding its optimum in some domain X . Typically, each evaluation is expensivein such applications, and we need to keep the number of evaluations to a minimum. Bayesianoptimisation (BO) refers to an approach for global optimisation that is popularly used in such settings.It uses Bayesian models for f to infer function values at unexplored regions and guide the selectionof points for future evaluations. BO has been successfully applied for many optimisation problems inoptimal policy search, industrial design, and scientiﬁc experimentation. That said, the quintessentialuse case for BO in machine learning is model selection [14, 40]. For instance, consider selectingthe regularisation parameter λ and kernel bandwidth h for an SVM. We can set this up as a zerothorder optimisation problem where our domain is a two dimensional space of ( λ, h ) values, and eachfunction evaluation trains the SVM on a training set, and computes the accuracy on a validation set.The goal is to ﬁnd the model, i.e. hyper-parameters, with the highest validation accuracy.The majority of the BO literature has focused on settings where the domain X is either Euclideanor categorical. This sufﬁces for many tasks, such as the SVM example above. However, withrecent successes in deep learning, neural networks are increasingly becoming the method of choicefor many machine learning applications. A number of recent work have designed novel neuralnetwork architectures to signiﬁcantly outperform the previous state of the art [12, 13, 37, 45]. Thismotivates studying model selection over the space of neural architectures to optimise for generalisationperformance. A critical challenge in this endeavour is that evaluating a network via train and validationprocedures is very expensive. This paper proposes a BO framework for this problem. a r X i v : . [ c s . L G ] M a r hile there are several approaches to BO, those based on Gaussian processes (GP) [35] are mostcommon in the BO literature. In its most unadorned form, a BO algorithm operates sequentially,starting at time with a GP prior for f ; at time t , it incorporates results of evaluations from , . . . , t − in the form of a posterior for f . It then uses this posterior to construct an acquisition function ϕ t ,where ϕ t ( x ) is a measure of the value of evaluating f at x at time t if our goal is to maximise f .Accordingly, it chooses to evaluate f at the maximiser of the acquisition, i.e. x t = argmax x ∈X ϕ t ( x ) .There are two key ingredients to realising this plan for GP based BO. First, we need to quantify thesimilarity between two points x, x (cid:48) in the domain in the form of a kernel κ ( x, x (cid:48) ) . The kernel isneeded to deﬁne the GP, which allows us to reason about an unevaluated value f ( x (cid:48) ) when we havealready evaluated f ( x ) . Secondly, we need a method to maximise ϕ t .These two steps are fairly straightforward in conventional domains. For example, in Euclidean spaces,we can use one of many popular kernels such as Gaussian, Laplacian, or Matérn; we can maximise ϕ t via off the shelf branch-and-bound or gradient based methods. However, when each x ∈ X is aneural network architecture, this is not the case. Hence, our challenges in this work are two-fold.First, we need to quantify (dis)similarity between two networks . Intuitively, in Fig. 1, network 1a ismore similar to network 1b, than it is to 1c. Secondly, we need to be able to traverse the space ofsuch networks to optimise the acquisition function . Our main contributions are as follows.1. We develop a (pseudo-)distance for neural network architectures called OTMANN (OptimalTransport Metrics for Architectures of Neural Networks) that can be computed efﬁciently via anoptimal transport program.2. We develop a BO framework for optimising functions on neural network architectures called

NASBOT (Neural Architecture Search with Bayesian Optimisation and Optimal Transport). Thisincludes an evolutionary algorithm to optimise the acquisition function.3. Empirically, we demonstrate that

NASBOT outperforms other baselines on model selection tasksfor multi-layer perceptrons (MLP) and convolutional neural networks (CNN). Our python imple-mentations of

OTMANN and

NASBOT are available at github.com/kirthevasank/nasbot . Related Work:

Recently, there has been a surge of interest in methods for neural architecturesearch [1, 6, 8, 21, 25, 26, 30, 32, 36, 41, 51–54]. We discuss them in detail in the Appendix due tospace constraints. Broadly, they fall into two categories, based on either evolutionary algorithms ( EA )or reinforcement learning (RL). EA provide a simple mechanism to explore the space of architecturesby making a sequence of changes to networks that have already been evaluated. However, as we willdiscuss later, they are not ideally suited for optimising functions that are expensive to evaluate. WhileRL methods have seen recent success, architecture search is in essence an optimisation problem –ﬁnd the network with the lowest validation error. There is no explicit need to maintain a notion ofstate and solve credit assignment [43]. Since RL is a fundamentally more difﬁcult problem thanoptimisation [16], these approaches need to try a very large number of architectures to ﬁnd theoptimum. This is not desirable, especially in computationally constrained settings.None of the above methods have been designed with a focus on the expense of evaluating a neuralnetwork, with an emphasis on being judicious in selecting which architecture to try next. Bayesianoptimisation (BO) uses introspective Bayesian models to carefully determine future evaluations andis well suited for expensive evaluations. BO usually consumes more computation to determine futurepoints than other methods, but this pays dividends when the evaluations are very expensive. Whilethere has been some work on BO for architecture search [2, 15, 28, 40, 44], they have only beenapplied to optimise feed forward structures, e.g. Fig. 1a, but not Figs. 1b, 1c. We compare NASBOT to one such method and demonstrate that feed forward structures are inadequate for many problems.

Our goal is to maximise a function f deﬁned on a space X of neural network architectures. Whenwe evaluate f at x ∈ X , we obtain a possibly noisy observation y of f ( x ) . In the context ofarchitecture search, f is the performance on a validation set after x is trained on the training set. If x (cid:63) = argmax X f ( x ) is the optimal architecture, and x t is the architecture evaluated at time t , wewant f ( x (cid:63) ) − max t ≤ n f ( x t ) to vanish fast as the number of evaluations n → ∞ . We begin with areview of BO and then present a graph theoretic formalism for neural network architectures. A GP is a random process deﬁned on some domain X , and is characterised by a mean function µ : X → R and a (covariance) kernel κ : X → R . Given n observations D n = { ( x i , y i ) } ni =1 , where2 : ip(235)1: conv3, 16(16)2: conv3, 16(256)3: conv3, 32(512)4: conv5, 32(1024)5: max-pool, 1(32)6: fc, 16(512)7: softmax(235)8: op(235) (a)

0: ip(235)1: conv3, 16(16)2: conv3, 16(256) 3: conv3, 16(256)4: conv3, 16(256)5: conv5, 32(1024)6: max-pool, 1(32)7: fc, 16(512)8: softmax(235)9: op(235) (b)

0: ip(240)1: conv7, 16(16)2: conv5, 32(512) 3: conv3 /2, 16(256) 4: conv3, 16(256)5: avg-pool, 1(32) 6: max-pool, 1(16) 7: max-pool, 1(16)8: fc, 16(512) 12: fc, 16(512) 9: conv3, 16(256)10: softmax(120) 13: softmax(120) 11: max-pool, 1(16)14: op(240) (c)

Figure 1: An illustration of some CNNarchitectures. In each layer, i : indexesthe layer, followed by the label (e.g conv3 ), and then the number of units(e.g. number of ﬁlters). The input andoutput layers are pink while the decision( softmax ) layers are green. From Section 3:

The layer mass is de-noted in parentheses. The following arethe normalised and unnormalised dis-tances d, ¯ d . All self distances are ,i.e. d ( G , G ) = ¯ d ( G , G ) = 0 . Unnor-malised: d ( a , b ) = 175 . , d ( a , c ) =1479 . , d ( b , c ) = 1621 . . Normalised: ¯ d ( a , b ) = 0 . , ¯ d ( a , c ) = 0 . , ¯ d ( b , c ) = 0 . . x i ∈ X , y i = f ( x i ) + (cid:15) i ∈ R , and (cid:15) i ∼ N (0 , η ) , the posterior process f |D n is also a GP with mean µ n and covariance κ n . Denote Y ∈ R n with Y i = y i , k, k (cid:48) ∈ R n with k i = κ ( x, x i ) , k (cid:48) i = κ ( x (cid:48) , x i ) ,and K ∈ R n × n with K i,j = κ ( x i , x j ) . Then, µ n , κ n can be computed via, µ n ( x ) = k (cid:62) ( K + η I ) − Y, κ n ( x, x (cid:48) ) = κ ( x, x (cid:48) ) − k (cid:62) ( K + η I ) − k (cid:48) . (1)For more background on GPs, we refer readers to Rasmussen and Williams [35]. When tasked withoptimising a function f over a domain X , BO models f as a sample from a GP. At time t , we havealready evaluated f at points { x i } t − i =1 and obtained observations { y i } t − i =1 . To determine the next pointfor evaluation x t , we ﬁrst use the posterior GP to deﬁne an acquisition function ϕ t : X → R , whichmeasures the utility of evaluating f at any x ∈ X according to the posterior. We then maximise theacquisition x t = argmax X ϕ t ( x ) , and evaluate f at x t . The expected improvement acquisition [31], ϕ t ( x ) = E (cid:2) max { , f ( x ) − τ t − } (cid:12)(cid:12) { ( x i , y i ) } t − i =1 (cid:3) , (2)measures the expected improvement over the current maximum value according to the posterior GP.Here τ t − = argmax i ≤ t − f ( x i ) denotes the current best value. This expectation can be computed inclosed form for GPs. We use EI in this work, but the ideas apply just as well to other acquisitions [3]. GP/BO in the context of architecture search:

Intuitively, κ ( x, x (cid:48) ) is a measure of similaritybetween x and x (cid:48) . If κ ( x, x (cid:48) ) is large, then f ( x ) and f ( x (cid:48) ) are highly correlated. Hence, the GPeffectively imposes a smoothness condition on f : X → R ; i.e. since networks a and b in Fig. are similar, they are likely to have similar cross validation performance. In BO, when selecting thenext point, we balance between exploitation , choosing points that we believe will have high f value,and exploration , choosing points that we do not know much about so that we do not get stuck at abad optimum. For example, if we have already evaluated f (a) , then exploration incentivises us tochoose c over b since we can reasonably gauge f (b) from f (a) . On the other hand, if f (a) has highvalue, then exploitation incentivises choosing b , as it is more likely to be the optimum than c . Our formalism will view a neural network as a graph whose vertices are the layers of the network.We will use the CNNs in Fig. 1 to illustrate the concepts. A neural network G = ( L , E ) is deﬁnedby a set of layers L and directed edges E . An edge ( u, v ) ∈ E is a ordered pair of layers. In Fig. 1,the layers are depicted by rectangles and the edges by arrows. A layer u ∈ L is equipped witha layer label (cid:96)(cid:96) ( u ) which denotes the type of operations performed at the layer. For instance, inFig. 1a, (cid:96)(cid:96) (1) = conv3 , (cid:96)(cid:96) (5) = max-pool denote a × convolution and a max-pooling operation.The attribute (cid:96)u denotes the number of computational units in a layer. In Fig. 1b, (cid:96)u (5) = 32 and (cid:96)u (7) = 16 are the number of convolutional ﬁlters and fully connected nodes.In addition, each network has decision layers which are used to obtain the predictions of thenetwork. For a classiﬁcation task, the decision layers perform softmax operations and output theprobabilities an input datum belongs to each class. For regression, the decision layers perform linear combinations of the outputs of the previous layers and output a single scalar. All networks3ave at least one decision layer. When a network has multiple decision layers, we average the outputof each decision layer to obtain the ﬁnal output. The decision layers are shown in green in Fig. 1.Finally, every network has a unique input layer u ip and output layer u op with labels (cid:96)(cid:96) ( u ip ) = ip and (cid:96)(cid:96) ( u op ) = op . It is instructive to think of the role of u ip as feeding a data point to the network and therole of u op as averaging the results of the decision layers. The input and output layers are shown inpink in Fig. 1. We refer to all layers that are not input, output or decision layers as processing layers .The directed edges are to be interpreted as follows. The output of each layer is fed to each of itschildren; so both layers 2 and 3 in Fig. 1b take the output of layer as input. When a layer hasmultiple parents, the inputs are concatenated; so layer 5 sees an input of

16 + 16 ﬁltered channelscoming in from layers and . Finally, we mention that neural networks are also characterised by thevalues of the weights/parameters between layers. In architecture search, we typically do not considerthese weights. Instead, an algorithm will (somewhat ideally) assume access to an optimisation oraclethat can minimise the loss function on the training set and ﬁnd the optimal weights.We next describe a distance d : X → R + for neural architectures. Recall that our eventual goal isa kernel for the GP; given a distance d , we will aim for κ ( x, x (cid:48) ) = e − βd ( x,x (cid:48) ) p , where β, p ∈ R + ,as the kernel. Many popular kernels take this form. For e.g. when X ⊂ R n and d is the L norm, p = 1 , correspond to the Laplacian and Gaussian kernels respectively. OTMANN

Distance

To motivate this distance, note that the performance of a neural network is determined by the amountof computation at each layer, the types of these operations, and how the layers are connected. Ameaningful distance should account for these factors. To that end,

OTMANN is deﬁned as theminimum of a matching scheme which attempts to match the computation at the layers of onenetwork to the layers of the other. We incur penalties for matching layers with different types ofoperations or those at structurally different positions. We will ﬁnd a matching that minimises thesepenalties, and the total penalty at the minimum will give rise to a distance. We ﬁrst describe twoconcepts, layer masses and path lengths, which we will use to deﬁne

OTMANN . Layer masses:

The layer masses (cid:96)m : L → R + will be the quantity that we match between the layersof two networks when comparing them. (cid:96)m ( u ) quantiﬁes the signiﬁcance of layer u . For processinglayers, (cid:96)m ( u ) will represent the amount of computation carried out by layer u and is computed via theproduct of (cid:96)u ( u ) and the number of incoming units. For example, in Fig. 1b, (cid:96)m (5) = 32 × (16 + 16) as there are ﬁltered channels each coming from layers and respectively. As there is nocomputation at the input and output layers, we cannot deﬁne the layer mass directly as we did for theprocessing layers. Therefore, we use (cid:96)m ( u ip ) = (cid:96)m ( u op ) = ζ (cid:80) u ∈PL (cid:96)m ( u ) where PL denotes theset of processing layers, and ζ ∈ (0 , is a parameter to be determined. Intuitively, we are using anamount of mass that is proportional to the amount of computation in the processing layers. Similarly,the decision layers occupy a signiﬁcant role in the architecture as they directly inﬂuence the output.While there is computation being performed at these layers, this might be problem dependent – thereis more computation performed at the softmax layer in a 10 class classiﬁcation problem than in a2 class problem. Furthermore, we found that setting the layer mass for decisions layers based oncomputation underestimates their contribution to the network. Following the same intuition as we didfor the input/output layers, we assign an amount of mass proportional to the mass in the processinglayers. Since the outputs of the decision layers are averaged, we distribute the mass among alldecision layers; that is, if DL are decision layers, ∀ u ∈ DL , (cid:96)m ( u ) = ζ |DL| (cid:80) u ∈PL (cid:96)m ( u ) . In allour experiments, we use ζ = 0 . . In Fig. 1, the layer masses for each layer are shown in parantheses. Path lengths from/to u ip / u op : In a neural network G , a path from u to v is a sequence of layers u , . . . , u s where u = u , u s = v and ( u i , u i +1 ) ∈ E for all i ≤ s − . The length of this path isthe number of hops from one node to another in order to get from u to v . For example, in Fig. 1c, (2 , , , is a path from layer to of length . Let the shortest (longest) path length from u to v be the smallest (largest) number of hops from one node to another among all paths from u to v .Additionally, deﬁne the random walk path length as the expected number of hops to get from u to v , if,from any layer we hop to one of its children chosen uniformly at random. For example, in Fig. 1c, theshortest, longest and random walk path lengths from layer to layer are 5, 7, and 5.67 respectively.For any u ∈ L , let δ spop ( u ) , δ lpop ( u ) , δ rwop ( u ) denote the length of the shortest, longest and random walkpaths from u to the output u op . Similarly, let δ spip ( u ) , δ lpip ( u ) , δ rwip ( u ) denote the corresponding lengths4 onv3 conv5 max-pool avg-pool fcconv3 . ∞ ∞ ∞ conv5 . ∞ ∞ ∞ max-pool ∞ ∞ . ∞ avg-pool ∞ ∞ .

25 0 ∞ fc ∞ ∞ ∞ ∞ Table 1:

An example label mismatchcost matrix M . There is zero cost formatching identical layers, < cost forsimilar layers, and inﬁnite cost for dis-parate layers. for walks from the input u ip to u . As the layers of a neural network can be topologically ordered , theabove path lengths are well deﬁned and ﬁnite. Further, for any s ∈ { sp,lp,rw } and t ∈ { ip,op } , δ st ( u ) can be computed for all u ∈ L , in O ( |E| ) time (see Appendix A.3 for details).We are now ready to describe OTMANN . Given two networks G = ( L , E ) , G = ( L , E ) with n , n layers respectively, we will attempt to match the layer masses in both networks. We let Z ∈ R n × n + be such that Z ( i, j ) denotes the amount of mass matched between layer i ∈ G and j ∈ G . The OTMANN distance is computed by solving the following optimisation problem. minimise Z φ lmm ( Z ) + φ nas ( Z ) + ν str φ str ( Z ) (3) subject to (cid:88) j ∈L Z ij ≤ (cid:96)m ( i ) , (cid:88) i ∈L Z ij ≤ (cid:96)m ( j ) , ∀ i, j The label mismatch term φ lmm , penalises matching masses that have different labels, while thestructural term φ str penalises matching masses at structurally different positions with respect to eachother. If we choose not to match any mass in either network, we incur a non-assignment penalty φ nas . ν str > determines the trade-off between the structural and other terms. The inequality constraintsensure that we do not over assign the masses in a layer. We now describe φ lmm , φ nas , and φ str . Label mismatch penalty φ lmm : We begin with a label penalty matrix M ∈ R L × L where L isthe number of all label types and M ( x , y ) denotes the penalty for transporting a unit mass froma layer with label x to a layer with label y . We then construct a matrix C lmm ∈ R n × n with C lmm ( i, j ) = M ( (cid:96)(cid:96) ( i ) , (cid:96)(cid:96) ( j )) corresponding to the mislabel cost for matching unit mass from eachlayer i ∈ L to each layer j ∈ L . We then set φ lmm ( Z ) = (cid:104) Z, C lmm (cid:105) = (cid:80) i ∈L ,j ∈L Z ( i, j ) C ( i, j ) to be the sum of all matchings from L to L weighted by the label penalty terms. This matrix M ,illustrated in Table 1, is a parameter that needs to be speciﬁed for OTMANN . They can be speciﬁedwith an intuitive understanding of the functionality of the layers; e.g. many values in M are ∞ , whilefor similar layers, we choose a value less than . Non-assignment penalty φ nas : We set this to be the amount of mass that is unassigned in both networks,i.e. φ nas ( Z ) = (cid:80) i ∈L (cid:0) (cid:96)m ( i ) − (cid:80) j ∈L Z ij (cid:1) + (cid:80) j ∈L (cid:0) (cid:96)m ( j ) − (cid:80) i ∈L Z ij (cid:1) . This essentiallyimplies that the cost for not assigning unit mass is . The costs in Table 1 are deﬁned relative tothis. For similar layers x , y , M ( x , y ) (cid:28) and for disparate layers M ( x , y ) (cid:29) . That is, we wouldrather match conv3 to conv5 than not assign it, provided the structural penalty for doing so is small;conversely, we would rather not assign a conv3 , than assign it to fc . This also explains why we didnot use a trade-off parameter like ν str for φ lmm and φ nas – it is simple to specify reasonable values for M ( x , y ) from an understanding of their functionality. Structural penalty φ str : We deﬁne a matrix C str ∈ R n × n where C str ( i, j ) is small if layers i ∈ L and j ∈ L are at structurally similar positions in their respective networks. We then set φ str ( Z ) = (cid:104) Z, C str (cid:105) . For i ∈ L , j ∈ L , we let C str ( i, j ) = (cid:80) s ∈{ sp, lp, rw } (cid:80) t ∈{ ip,op } | δ st ( i ) − δ st ( j ) | be theaverage of all path length differences, where δ st are the path lengths deﬁned previously. We deﬁne φ str in terms of the shortest/longest/random-walk path lengths from/to the input/output, because theycapture various notions of information ﬂow in a neural network; a layer’s input is inﬂuenced by thepaths the data takes before reaching the layer and its output inﬂuences all layers it passes throughbefore reaching the decision layers. If the path lengths are similar for two layers, they are likely to beat similar structural positions. Further, this form allows us to solve (3) efﬁciently via an OT programand prove distance properties about the solution. If we need to compute pairwise distances for severalnetworks, as is the case in BO, the path lengths can be pre-computed in O ( |E| ) time, and used toconstruct C str for two networks at the moment of computing the distance between them.This completes the description of our matching program. In Appendix A, we prove that (3) can beformulated as an Optimal Transport (OT) program [47]. OT is a well studied problem with severalefﬁcient solvers [33]. Our theorem below, shows that the solution of (3) is a distance. A topological ordering is an ordering of the layers u , . . . , u |L| such that u comes before v if ( u, v ) ∈ E . peration Description dec_single Pick a layer at random and decrease the number of units by / . dec_en_masse Pick several layers at random and decrease the number of units by / for all of them. inc_single Pick a layer at random and increase the number of units by / . inc_en_masse Pick several layers at random and increase the number of units by / for all of them. dup_path Pick a random path u , . . . , u k , duplicate u , . . . , u k − and connect them to u and u k . remove_layer Pick a layer at random and remove it. Connect the layer’s parents to its children if necessary. skip

Randomly pick layers u, v where u is topologically before v . Add ( u, v ) to E . swap_label Randomly pick a layer and change its label. wedge_layer

Randomly remove an edge ( u, v ) from E . Create a new layer w and add ( u, w ) , ( w, v ) to E . Table 2:

Descriptions of modiﬁers to transform one network to another. The ﬁrst four change the number ofunits in the layers but do not change the architecture, while the last ﬁve change the architecture.

Theorem 1.

Let d ( G , G ) be the solution of (3) for networks G , G . Under mild regularity condi-tions on M , d ( · , · ) is a pseudo-distance. That is, for all networks G , G , G , it satisﬁes, d ( G , G ) ≥ , d ( G , G ) = d ( G , G ) , d ( G , G ) = 0 and d ( G , G ) ≤ d ( G , G ) + d ( G , G ) . For what follows, deﬁne ¯ d ( G , G ) = d ( G , G ) / ( tm ( G )+ tm ( G )) where tm ( G i ) = (cid:80) u ∈L i (cid:96)m ( u ) is the total mass of a network. Note that ¯ d ≤ . While ¯ d does not satisfy the triangle inequality, itprovides a useful measure of dissimilarity normalised by the amount of computation. Our experiencesuggests that d puts more emphasis on the amount of computation at the layers over structure andvice versa for ¯ d . Therefore, it is prudent to combine both quantities in any downstream application.The caption in Fig. 1 gives d, ¯ d values for the examples in that ﬁgure when ν str = 0 . .We conclude this section with a couple of remarks. First, OTMANN shares similarities with Wasser-stein (earth mover’s) distances which also have an OT formulation. However, it is not a Wassersteindistance itself—in particular, the supports of the masses and the cost matrices change dependingon the two networks being compared. Second, while there has been prior work for deﬁning variousdistances and kernels on graphs, we cannot use them in BO because neural networks have additionalcomplex properties in addition to graphical structure, such as the type of operations performed ateach layer, the number of neurons, etc. The above work either deﬁne the distance/kernel betweenvertices or assume the same vertex (layer) set [9, 23, 29, 38, 49], none of which apply in our setting.While some methods do allow different vertex sets [48], they cannot handle layer masses and layersimilarities. Moreover, the computation of the above distances are more expensive than

OTMANN .Hence, these methods cannot be directly plugged into BO framework for architecture search.In Appendix A, we provide additional material on

OTMANN . This includes the proof of Theorem 1,a discussion on some design choices, and implementation details such as the computation of the pathlengths. Moreover, we provide illustrations to demonstrate that

OTMANN is a meaningful distancefor architecture search. For example, a t-SNE embedding places similar architectures close to eachother. Further, scatter plots showing the validation error vs distance on real datasets demonstrate thatnetworks with small distance tend to perform similarly on the problem. NASBOT

We now describe

NASBOT , our BO algorithm for neural architecture search. Recall that in orderto realise the BO scheme outlined in Section 2.1, we need to specify (a) a kernel κ for neuralarchitectures and (b) a method to optimise the acquisition ϕ t over these architectures. Due to spaceconstraints, we will only describe the key ideas and defer all details to Appendix B.As described previously, we will use a negative exponentiated distance for κ . Precisely, κ = αe − βd + ¯ αd − ¯ β ¯ d , where d, ¯ d are the OTMANN distance and its normalised version. We mentionthat while this has the form of popular kernels, we do not know yet if it is in fact a kernel. In ourexperiments, we did not encounter an instance where the eigenvalues of the kernel matrix werenegative. In any case, there are several methods to circumvent this issue in kernel methods [42].We use an evolutionary algorithm ( EA ) approach to optimise the acquisition function (2). For this,we begin with an initial pool of networks and evaluate the acquisition ϕ t on those networks. Thenwe generate a set of N mut mutations of this pool as follows. First, we stochastically select N mut candidates from the set of networks already evaluated such that those with higher ϕ t values are morelikely to be selected than those with lower values. Then we modify each candidate, to produce anew architecture. These modiﬁcations, described in Table 2, might change the architecture either by6 ime (hours) C r o ss V a li d a t i o n M S E Blog Feedback,

Time (hours) C r o ss V a li d a t i o n M S E Indoor Location,

Time (hours) C r o ss V a li d a t i o n M S E Slice Localisation,

Time (hours) C r o ss V a li d a t i o n M S E -2 -1 Naval Propulsion,

Time (hours) C r o ss V a li d a t i o n M S E Protein,

Time (hours) C r o ss V a li d a t i o n M S E News,

Time (hours) C r o ss V a li d a t i o n E rr o r Cifar10,

EARANDTreeBONASBOT

Figure 2:

Cross validation results:

In all ﬁgures, the x axis is time. The y axis is the mean squared error(MSE) in the ﬁrst 6 ﬁgures and the classiﬁcation error in the last. Lower is better in all cases. The title of eachﬁgure states the dataset and the number of parallel workers (GPUs). All ﬁgures were averaged over at least independent runs of each method. Error bars indicate one standard error. increasing or decreasing the number of computational units in a layer, by adding or deleting layers,or by changing the connectivity of existing layers. Finally, we evaluate the acquisition on this N mut mutations, add it to the initial pool, and repeat for the prescribed number of steps. While EA worksﬁne for cheap functions, such as the acquisition ϕ t which is analytically available, it is not suitablewhen evaluations are expensive, such as training a neural network. This is because EA selects pointsfor future evaluations that are already close to points that have been evaluated, and is hence inefﬁcientat exploring the space. In our experiments, we compare NASBOT to the same EA scheme used tooptimise the acquisition and demonstrate the former outperforms the latter.We conclude this section by observing that this framework for NASBOT / OTMANN has additionalﬂexibility to what has been described. If one wishes to tune over drop-out probabilities, regularisationpenalties and batch normalisation at each layer, they can be treated as part of the layer label, via anaugmented label penalty matrix M which accounts for these considerations. If one wishes to jointlytune other scalar hyper-parameters (e.g. learning rate), they can use an existing kernel for euclideanspaces and deﬁne the GP over the joint architecture + hyper-parameter space via a product kernel.BO methods for early stopping in iterative training procedures [17–20, 22] can be easily incorporatedby deﬁning a ﬁdelity space . Using a line of work in scalable GPs [39, 50], one can apply our methodsto challenging problems which might require trying a very large number ( ∼ NASBOT in large scale settings, but are tangential to ourgoal of introducing a BO method for architecture search.

Methods:

We compare

NASBOT to the following baselines.

RAND : random search; EA (Evolution-ary algorithm): the same EA procedure described above. TreeBO [15]: a BO method which onlysearches over feed forward structures. Random search is a natural baseline to compare optimisationmethods. However, unlike in Euclidean spaces, there is no natural way to randomly explore the spaceof architectures. Our

RAND implementation, operates in exactly the same way as

NASBOT , exceptthat the EA procedure is fed a random sample from Unif(0 , instead of the GP acquisition eachtime it evaluates an architecture. Hence, RAND is effectively picking a random network from thesame space explored by

NASBOT ; neither method has an unfair advantage because it considers adifferent space. While there are other methods for architecture search, their implementations arehighly nontrivial and are not made available.

Datasets:

We use the following datasets: blog feedback [4], indoor location [46], slice localisa-tion [11], naval propulsion [5], protein tertiary structure [34], news popularity [7], Cifar10 [24]. Theﬁrst six are regression problems for which we use MLPs. The last is a classiﬁcation task on imagesfor which we use CNNs. Table 3 gives the size and dimensionality of each dataset. For the ﬁrst datasets, we use a . − . − . train-validation-test split and normalised the input and output tohave zero mean and unit variance. Hence, a constant predictor will have a mean squared error ofapproximately . For Cifar10 we use K for training and K each for validation and testing.7 ethod Blog (60 K, Indoor (21 K, Slice (54 K, Naval (12 K, Protein (46 K, News (40 K, Cifar10 (60 K, K ) Cifar10 K iters RAND . ± . . ± . . ± .

041 0 . ± .

002 0 . ± . . ± . . ± .

002 0 . ± . EA . ± .

040 0 . ± .

010 0 . ± . . ± . . ± . . ± . . ± .

002 0 . ± . TreeBO . ± .

053 0 . ± .

023 0 . ± .

079 0 . ± .

002 0 . ± .

007 0 . ± .

085 0 . ± .

004 0 . ± . NASBOT . ± .

029 0 . ± .

008 0 . ± .

044 0 . ± .

002 0 . ± .

033 0 . ± .

024 0 . ± .

003 0 . ± . Table 3:

The ﬁrst row gives the number of samples N and the dimensionality D of each dataset in the form ( N, D ) . The subsequent rows show the regression MSE or classiﬁcation error (lower is better) on the test set for each method. The last column is for Cifar10 where we took the best models found by each method in 24Kiterations and trained it for K iterations. When we trained the VGG-19 architecture using our trainingprocedure, we got test errors . (60K iterations) and . (150K iterations). Experimental Set up:

Each method is executed in an asynchronously parallel set up of 2-4 GPUs,That is, it can evaluate multiple models in parallel, with each model on a single GPU. When theevaluation of one model ﬁnishes, the methods can incorporate the result and immediately re-deploythe next job without waiting for the others to ﬁnish. For the blog, indoor, slice, naval and proteindatasets we use 2 GeForce GTX 970 (4GB) GPUs and a computational budget of 8 hours for eachmethod. For the news popularity dataset we use GeForce GTX 980 (6GB) GPUs with a budget of6 hours and for Cifar10 we use 4 K80 (12GB) GPUs with a budget of 10 hours. For the regressiondatasets, we train each model with stochastic gradient descent (SGD) with a ﬁxed step size of − , abatch size of 256 for 20K batch iterations. For Cifar10, we start with a step size of − , and reduceit gradually. We train in batches of 32 images for 60K batch iterations. The methods evaluate between - networks depending on the size of the networks chosen and the number of GPUs. Results:

Fig. 2 plots the best validation score for each method against time. In Table 3, we presentthe results on the test set with the best model chosen on the basis of validation set performance. Onthe Cifar10 dataset, we also trained the best models for longer ( K iterations). These results are inthe last column of Table 3. We see that NASBOT is the most consistent of all methods. The averagetime taken by

NASBOT to determine the next architecture to evaluate was . s. For RAND , EA ,and TreeBO this was . s, . s, and . s respectively. The time taken to train and validatemodels was on the order of 10-40 minutes depending on the model size. Fig. 2 includes this timetaken to determine the next point. Like many BO algorithms, while NASBOT ’s selection criterion istime consuming, it pays off when evaluations are expensive. In Appendices B and C, we provideadditional details on the experiment set up and conduct synthetic ablation studies by holding outdifferent components of the

NASBOT framework. We also illustrate some of the best architecturesfound—on many datasets, common features were long skip connections and multiple decision layers.Finally, we note that while our Cifar10 experiments fall short of the current state of the art [25, 26, 53],the amount of computation in these work is several orders of magnitude more than ours (both thecomputation invested to train a single model and the number of models trained). Further, they useconstrained spaces specialised for CNNs, while

NASBOT is deployed in a very general model space.We believe that our results can also be improved by employing enhanced training techniques such asimage whitening, image ﬂipping, drop out, etc. For example, using our training procedure on theVGG-19 architecture [37] yielded a test set error of . after K iterations. However, VGG-19is known to do signiﬁcantly better on Cifar10. That said, we believe our results are encouraging andlay out the premise for BO for neural architectures. We described

NASBOT , a BO framework for neural architecture search.

NASBOT ﬁnds betterarchitectures for MLPs and CNNs more efﬁciently than other baselines on several datasets. Akey contribution of this work is the efﬁciently computable

OTMANN distance for neural networkarchitectures, which may be of independent interest as it might ﬁnd applications outside of BO. Ourcode for

NASBOT and

OTMANN will be made available.8 cknolwedgements

We would like to thank Guru Guruganesh and Dougal Sutherland for the insightful discussions. Thisresearch is partly funded by DOE grant DESC0011114, NSF grant IIS1563887, and the Darpa D3Mprogram. KK is supported by a Facebook fellowship and a Siebel scholarship.

References [1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architecturesusing reinforcement learning. arXiv preprint arXiv:1611.02167 , 2016.[2] James Bergstra, Daniel Yamins, and David Daniel Cox. Making a science of model search: Hyperparameteroptimization in hundreds of dimensions for vision architectures. 2013.[3] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on Bayesian Optimization of Expensive CostFunctions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.

CoRR ,2010.[4] Krisztian Buza. Feedback prediction for blogs. In

Data analysis, machine learning and knowledgediscovery , pages 145–152. Springer, 2014.[5] Andrea Coraddu, Luca Oneto, Aessandro Ghio, Stefano Savio, Davide Anguita, and Massimo Figari.Machine learning approaches for improving condition-based maintenance of naval propulsion plants.

Proceedings of the Institution of Mechanical Engineers, Part M: Journal of Engineering for the MaritimeEnvironment , 230(1):136–153, 2016.[6] Corinna Cortes, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Adanet: Adaptivestructural learning of artiﬁcial neural networks. arXiv preprint arXiv:1607.01097 , 2016.[7] Kelwin Fernandes, Pedro Vinagre, and Paulo Cortez. A proactive intelligent decision support system forpredicting the popularity of online news. In

Portuguese Conference on Artiﬁcial Intelligence , 2015.[8] Dario Floreano, Peter Dürr, and Claudio Mattiussi. Neuroevolution: from architectures to learning.

Evolutionary Intelligence , 1(1):47–62, 2008.[9] Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey of graph edit distance.

Pattern Analysisand applications , 13(1):113–129, 2010.[10] David Ginsbourger, Janis Janusevskis, and Rodolphe Le Riche. Dealing with asynchronicity in parallelgaussian process based global optimization. In

ERCIM , 2011.[11] Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Sebastian Pölsterl, and Alexander Cavallaro. 2d imageregistration in ct images using radial image descriptors. In

International Conference on Medical ImageComputing and Computer-Assisted Intervention , pages 607–614. Springer, 2011.[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[13] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolu-tional networks. In

CVPR , 2017.[14] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for generalalgorithm conﬁguration. In

LION , 2011.[15] Rodolphe Jenatton, Cedric Archambeau, Javier González, and Matthias Seeger. Bayesian optimizationwith tree-structured dependencies. In

International Conference on Machine Learning , 2017.[16] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextualdecision processes with low bellman rank are pac-learnable. arXiv preprint arXiv:1610.09512 , 2016.[17] Kirthevasan Kandasamy, Gautam Dasarathy, Junier B Oliva, Jeff Schneider, and Barnabás Póczos. Gaussianprocess bandit optimisation with multi-ﬁdelity evaluations. In

Advances in Neural Information ProcessingSystems , pages 992–1000, 2016.[18] Kirthevasan Kandasamy, Gautam Dasarathy, Junier B Oliva, Jeff Schneider, and Barnabas Poczos. Multi-ﬁdelity gaussian process bandit optimisation. arXiv preprint arXiv:1603.06288 , 2016.[19] Kirthevasan Kandasamy, Gautam Dasarathy, Barnabas Poczos, and Jeff Schneider. The multi-ﬁdelitymulti-armed bandit. In

Advances in Neural Information Processing Systems , pages 1777–1785, 2016.[20] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabas Poczos. Multi-ﬁdelity BayesianOptimisation with Continuous Approximations. arXiv preprint arXiv:1703.06240 , 2017.[21] Hiroaki Kitano. Designing neural networks using genetic algorithms with graph generation system.

Complex systems , 4(4):461–476, 1990.[22] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimizationof machine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079 , 2016.[23] Risi Imre Kondor and John Lafferty. Diffusion kernels on graphs and other discrete input spaces. In

ICML ,volume 2, pages 315–322, 2002.[24] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.[25] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang,and Kevin Murphy. Progressive neural architecture search. arXiv preprint arXiv:1712.00559 , 2017.

26] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchicalrepresentations for efﬁcient architecture search. arXiv preprint arXiv:1711.00436 , 2017.[27] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

Journal of machine learningresearch , 9(Nov):2579–2605, 2008.[28] Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Towardsautomatically-tuned neural networks. In

Workshop on Automatic Machine Learning , pages 58–65, 2016.[29] Bruno T Messmer and Horst Bunke. A new algorithm for error-tolerant subgraph isomorphism detection.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 20(5):493–504, 1998.[30] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju,Arshak Navruzyan, Nigel Duffy, and Babak Hodjat. Evolving deep neural networks. arXiv preprintarXiv:1703.00548 , 2017.[31] J.B. Mockus and L.J. Mockus. Bayesian approach to global optimization and application to multiobjectiveand constrained problems.

Journal of Optimization Theory and Applications , 1991.[32] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architec-tures. arXiv preprint arXiv:1704.08792 , 2017.[33] Gabriel Peyré and Marco Cuturi.

Computational Optimal Transport . Available online, 2017.[34] PS Rana. Physicochemical properties of protein tertiary structure data set, 2013.[35] C.E. Rasmussen and C.K.I. Williams.

Gaussian Processes for Machine Learning . Adaptative computationand machine learning series. University Press Group Limited, 2006.[36] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and AlexKurakin. Large-scale evolution of image classiﬁers. arXiv preprint arXiv:1703.01041 , 2017.[37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-tion. arXiv preprint arXiv:1409.1556 , 2014.[38] Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In

Learning theory and kernelmachines , pages 144–158. Springer, 2003.[39] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In

Advances inneural information processing systems , pages 1257–1264, 2006.[40] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian Optimization of Machine LearningAlgorithms. In

Advances in Neural Information Processing Systems , 2012.[41] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.

Evolutionary computation , 10(2):99–127, 2002.[42] Dougal J Sutherland.

Scalable, Active and Flexible Learning on Distributions . PhD thesis, CarnegieMellon University Pittsburgh, PA, 2015.[43] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction , volume 1. MIT pressCambridge, 1998.[44] Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, and Michael A Osborne. Raiders of thelost architecture: Kernels for bayesian optimization in conditional parameter spaces. arXiv preprintarXiv:1409.4011 , 2014.[45] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, DumitruErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In

Proceedings ofthe IEEE conference on computer vision and pattern recognition , pages 1–9, 2015.[46] Joaquín Torres-Sospedra, Raúl Montoliu, Adolfo Martínez-Usó, Joan P Avariento, Tomás J Arnau, MauriBenedito-Bordonau, and Joaquín Huerta. Ujiindoorloc: A new multi-building and multi-ﬂoor database forwlan ﬁngerprint-based indoor localization problems. In

Indoor Positioning and Indoor Navigation (IPIN),2014 International Conference on , pages 261–270. IEEE, 2014.[47] Cédric Villani.

Optimal transport: old and new , volume 338. Springer Science & Business Media, 2008.[48] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels.

Journal of Machine Learning Research , 11(Apr):1201–1242, 2010.[49] Walter D Wallis, Peter Shoubridge, M Kraetz, and D Ray. Graph distances using graph union.

PatternRecognition Letters , 22(6-7):701–704, 2001.[50] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes(kiss-gp). In

International Conference on Machine Learning , pages 1775–1784, 2015.[51] Lingxi Xie and Alan Yuille. Genetic cnn. arXiv preprint arXiv:1703.01513 , 2017.[52] Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning. arXivpreprint arXiv:1708.05552 , 2017.[53] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprintarXiv:1611.01578 , 2016.[54] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures forscalable image recognition. arXiv preprint arXiv:1707.07012 , 2017. Additional Details on

OTMANN

A.1 Optimal Transport Reformulation

We begin with a review optimal transport. Throughout this section, (cid:104)· , ·(cid:105) denotes the Frobenius dotproduct. n , n ∈ R n denote a vector of ones and zeros respectively. A review of Optimal Transport [47] : Let y ∈ R n + , y ∈ R n + be such that (cid:62) n y = (cid:62) n y . Let C ∈ R n × n + . The following optimisation problem, minimise Z (cid:104) Z, C (cid:105) (4) subject to

Z > , Z n = y , Z (cid:62) n = y . is called an optimal transport program. One interpretation of this set up is that y denotes the suppliesat n warehouses, y denotes the demands at n retail stores, C ij denotes the cost of transporting aunit mass of supplies from warehouse i to store j and Z ij denotes the mass of material transportedfrom i to j . The program attempts to ﬁnd transportation plan which minimises the total cost oftransportation (cid:104) Z, C (cid:105) . OT formulation of (3) : We now describe the OT formulation of the

OTMANN distance. In ad-dition to providing an efﬁcient way to solve (3), the OT formulation will allow us to prove themetric properties of the solution. When computing the distance between G , G , for i = 1 , ,let tm ( G i ) = (cid:80) u ∈L i (cid:96)m ( u ) denote the total mass in G i , and ¯ n i = n i + 1 where n i = |L i | . y = [ { (cid:96)m ( u ) } u ∈L , tm ( G )] ∈ R ¯ n will be the supplies in our OT problem, and y =[ { (cid:96)m ( u ) } u ∈L , tm ( G )] ∈ R ¯ n will be the demands. To deﬁne the cost matrix, we augment themislabel and structural penalty matrices C lmm , C str with an additional row and column of zeros;i.e. C (cid:48) lmm = [ C lmm n ; (cid:62) ¯ n ] ∈ R ¯ n × ¯ n ; C (cid:48) str is deﬁned similarly. Let C (cid:48) nas = [ n ,n n ; (cid:62) n ∈ R ¯ n × ¯ n . We will show that (3) is equivalent to the following OT program. minimise Z (cid:48) (cid:104) Z (cid:48) , C (cid:48) (cid:105) (5) subject to Z (cid:48) ¯ n = y , Z (cid:48)(cid:62) ¯ n = y . One interpretation of (5) is that the last row/column appended to the cost matrices serve as a non-assignment layer and that the cost for transporting unit mass to this layer from all other layers is .The costs for mislabelling was deﬁned relative to this non-assignment cost. The costs for similarlayers is much smaller than ; therefore, the optimiser is incentivised to transport mass among similarlayers rather than not assign it provided that the structural penalty is not too large. Correspondingly,the cost for very disparate layers is much larger so that we would never match, say, a convolutionallayer with a pooling layer. In fact, the ∞ ’s in Table 1 can be replaced by any value larger than andthe solution will be the same. The following theorem shows that (3) and (5) are equivalent. Theorem 2.

Problems (3) and (5) are equivalent, in that they both have the same minimum and wecan recover the solution of one from the other.Proof.

We will show that there exists a bijection between feasible points in both problems with thesame value for the objective. First let Z ∈ R n × n be a feasible point for (3). Let Z (cid:48) ∈ R ¯ n × ¯ n be such that its ﬁrst n × n block is Z and, Z ¯ n j = (cid:80) n i =1 Z ij , Z i ¯ n = (cid:80) n j =1 Z ij , and Z ¯ n , ¯ n = (cid:80) ij Z ij . Then, for all i ≤ n , (cid:80) j Z (cid:48) ij = (cid:96)m ( j ) and (cid:80) j Z (cid:48) ¯ n j Z (cid:48) ij = (cid:80) j (cid:96)m ( j ) − (cid:80) ij Z ij + Z ¯ n , ¯ n = tm ( G ) . We then have, Z (cid:48) ¯ n = y Similarly, we can show Z (cid:48)(cid:62) ¯ n = y . Therefore, Z (cid:48) is feasible for (5). We see that the objectives are equal via simple calculations, (cid:104) Z (cid:48) , C (cid:48) (cid:105) = (cid:104) Z (cid:48) , C (cid:48) lmm + C (cid:48) str (cid:105) + (cid:104) Z (cid:48) , C (cid:48) nas (cid:105) (6) = (cid:104) Z, C lmm + C str (cid:105) + n (cid:88) j =1 Z (cid:48) ij + n (cid:88) i =1 Z (cid:48) ij = (cid:104) Z, C lmm (cid:105) + (cid:104) Z, C str (cid:105) + (cid:88) i ∈L (cid:0) (cid:96)m ( i ) − (cid:88) j ∈L Z ij (cid:1) + (cid:88) j ∈L (cid:0) (cid:96)m ( j ) − (cid:88) i ∈L Z ij (cid:1) . The converse also follows via a straightforward argument. For given Z (cid:48) that is feasible for (5), welet Z be the ﬁrst n × n block. By the equality constraints and non-negativity of Z (cid:48) , Z is feasiblefor (3). By reversing the argument in (6) we see that the objectives are also equal.11 : ip(100)1: conv3, 16(16)2: conv3, 16(256)3: conv3, 32(512)4: max-pool, 1(32)5: fc, 16(51)6: softmax(100)7: op(100)

0: ip(100)1: conv3, 16(16)2: conv3, 8(128) 3: conv3, 8(128)4: conv3, 32(512)5: max-pool, 1(32)6: fc, 16(51)7: softmax(100)8: op(100)

Figure 3: An example of CNNs which have d = ¯ d = 0 distance. The OT solution matches themass in each layer in the network on the left tothe layer horizontally opposite to it on the rightwith cost. For layer 2 on the left, its mass ismapped to layers 2 and 3 on the left. However,while the descriptor of these networks is different,their functional behaviour is the same. A.2 Distance Properties of

OTMANN

The following theorem shows that the solution of (3) is a pseudo-distance. This is a formal version ofTheorem 1 in the main text.

Theorem 3.

Assume that the mislabel cost matrix M satisﬁes the triangle inequality; i.e. for alllabels x , y , z we have M ( x , z ) ≤ M ( x , y ) + M ( y , z ) . Let d ( G , G ) be the solution of (3) fornetworks G , G . Then d ( · , · ) is a pseudo-distance. That is, for all networks G , G , G , it satisﬁes, d ( G , G ) > , d ( G , G ) = d ( G , G ) , d ( G , G ) = 0 and d ( G , G ) ≤ d ( G , G ) + d ( G , G ) . Some remarks are in order. First, observe that while d ( · , · ) is a pseudo-distance, it is not a distance; i.e. d ( G , G ) = 0 (cid:59) G = G . For example, while the networks in Figure 3 have different descriptorsaccording to our formalism in Section 2.2, their distance is . However, it is not hard to see that theirfunctionality is the same – in both cases, the output of layer is passed through conv3 ﬁlters andthen fed to a layer with conv3 ﬁlters – and hence, this property is desirable in this example. Itis not yet clear however, if the topology induced by our metric equates two functionally dissimilarnetworks. We leave it to future work to study equivalence classes induced by the OTMANN distance.Second, despite the OT formulation, this is not a Wasserstein distance. In particular, the supports ofthe masses and the cost matrices change depending on the two networks being compared.

Proof of Theorem 3 . We will use the OT formulation (5) in this proof. The ﬁrst three properties arestraightforward. Non-negativity follows from non-negativity of Z (cid:48) , C (cid:48) in (5). It is symmetric sincethe cost matrix for d ( G , G ) is C (cid:48)(cid:62) if the cost matrix for d ( G , G ) is C and (cid:104) Z (cid:48) , C (cid:48) (cid:105) = (cid:104) Z (cid:48)(cid:62) , C (cid:48)(cid:62) (cid:105) for all Z (cid:48) . We also have d ( G , G ) = 0 since, then, C (cid:48) has a zero diagonal.To prove the triangle inequality, we will use a gluing lemma, similar to what is used in the proofof Wasserstein distances [33]. Let G , G , G be given and m , m , m be their total masses. Letthe solutions to d ( G , G ) and d ( G , G ) be P ∈ R ¯ n × ¯ n and Q ∈ R ¯ n × ¯ n respectively. Whensolving (5), we see that adding extra mass to the non-assignment layers does not change the objective,as an optimiser can transport mass between the two layers with cost. Hence, we can assumew.l.o.g that (5) was solved with y i = (cid:2) { (cid:96)m ( u ) } u ∈L i , (cid:0) (cid:80) j ∈{ , , } tm ( G j ) − tm ( G i ) (cid:1)(cid:3) ∈ R ¯ n i for i = 1 , , , when computing the distances d ( G , G ) , d ( G , G ) , d ( G , G ) ; i.e. the total mass was m + m + m for all three pairs. We can similarly assume that P, Q account for this extra mass, i.e. P ¯ n ¯ n and Q ¯ n ¯ n have been increased by m and m respectively from their solutions in (5).To apply the gluing lemma, let S = P diag(1 /y ) Q ∈ R ¯ n × ¯ n , where diag(1 /y ) is a diagonalmatrix whose ( j, j ) th element is / ( y ) j (note y > ). We see that S is feasible for (5) whencomputing d ( G , G ) , R ¯ n = P diag(1 /y ) Q ¯ n = P diag(1 /y ) y = P ¯ n = y . Similarly, R (cid:62) ¯ n = y . Now, let U (cid:48) , V (cid:48) , W (cid:48) be the cost matrices C (cid:48) in (5) when computing d ( G , G ) , d ( G , G ) , and d ( G , G ) respectively. We will use the following technical lemma whoseproof is given below. Lemma 4.

For all i ∈ L , j ∈ L , k ∈ L , we have W (cid:48) ik ≤ U (cid:48) ij + V (cid:48) jk . d ( G , G ) ≤ (cid:104) R, W (cid:48) (cid:105) = (cid:88) i ∈L ,k ∈L W (cid:48) ik (cid:88) j ∈L P ij Q jk ( y ) j ≤ (cid:88) i,j,k ( U (cid:48) ij + V (cid:48) jk ) P ij Q jk ( y ) j = (cid:88) ij U (cid:48) ij P ij ( y ) j (cid:88) k Q jk + (cid:88) jk V (cid:48) jk Q jk ( y ) j (cid:88) k P ij = (cid:88) ij U (cid:48) ij P ij + (cid:88) jk V (cid:48) jk Q jk = d ( G , G ) + d ( G , G ) The ﬁrst step uses the fact that d ( G , G ) is the minimum of all feasible solutions and the third stepuses Lemma 4. The fourth step rearranges terms and the ﬁfth step uses P (cid:62) ¯ n = Q ¯ n = y . Proof of Lemma 4 . Let W (cid:48) = W (cid:48) lmm + W (cid:48) str + W (cid:48) nas be the decomposition into the labelmismatch, structural and non-assignment parts of the cost matrices; deﬁne similar quantities U (cid:48) lmm , U (cid:48) str , U (cid:48) nas , V (cid:48) lmm , V (cid:48) str , V (cid:48) nas for U (cid:48) , V (cid:48) . Noting a ≤ b + c and d ≤ e + f implies a + d ≤ b + e + c + f ,it is sufﬁcient to show the triangle inquality for each component individually. For the label mis-match term, ( W (cid:48) lmm ) ik ≤ ( U (cid:48) lmm ) ij + ( V (cid:48) lmm ) jk follows directly from the conditions on M by setting x = (cid:96)(cid:96) ( i ) , y = (cid:96)(cid:96) ( j ) , z = (cid:96)(cid:96) ( k ) , where i, j, k are indexing in L , L , L respectively.For the non-assignment terms, when ( W (cid:48) nas ) ik = 0 the claim is true trivially. ( W (cid:48) nas ) ik = 1 , eitherwhen ( i = ¯ n , k ≤ n ) or ( i ≤ n , k = ¯ n ) . In the former case, when j ≤ n , ( U (cid:48) nas ) jk = 1 andwhen j = ¯ n , ( V (cid:48) nas ) ¯ n = 1 as k ≤ n . We therefore have, ( W (cid:48) nas ) ik = ( U (cid:48) nas ) ij + ( V (cid:48) nas ) jk = 1 . Asimilar argument shows equality for the ( i ≤ n , k = ¯ n ) case as well.Finally, for the structural terms we note that W (cid:48) str can be written as W (cid:48) str = (cid:80) t W (cid:48) ( t ) as can U (cid:48) ( t ) , T (cid:48) ( t ) .Here t indexes over the choices for the types of distances considered, i.e. t ∈ { sp, lp, rw } × { ip, op } .It is sufﬁcient to show ( W (cid:48) ( t ) ) ik ≤ ( U (cid:48) ( t ) ) ij + ( T (cid:48) ( t ) ) jk . This inequality takes the form, | δ ( t )1 i − δ ( t )3 k | ≤ | δ ( t )1 i − δ ( t )2 j | + | δ ( t )2 j − δ ( t )3 k | . Where δ ( t ) g(cid:96) refers to distance type t in network g for layer s . The above is simply the triangleinequality for real numbers. This concludes the proof of Lemma 4. A.3 Implementation & Design ChoicesMasses on the decision & input/output layers:

It is natural to ask why one needs to model themass in the decision and input/output layers. For example, a seemingly natural choice is to use for these layers. Using mass, is a reasonable strategy if we were to allow only one decision layer.However, when there are multiple decision layers, consider comparing the following two networks:the ﬁrst has a feed forward MLP with non-linear layers, the second is the same network but with anadditional linear decision layer u , with one edge from u ip to u and an edge from u to u op . This lattermodels the function as a linear + non-linear term which might be suitable for some problems unlikemodeling it only as a non-linear term. If we do not add layer masses for the input/output/decisionlayers, then the distance between both networks would be - as there will be equal mass in the FFpart for both networks and they can be matched with 0 cost. Algorithm 1:

Compute δ rwop ( u ) for all u ∈ L Require: G = ( L , E ) , L is topologically sorted in S . δ rwop ( u op ) = 0 , δ rwop ( u ) = nan ∀ u (cid:54) = u op . while S is not empty do u ← pop _ last( S ) ∆ ← { δ rwop ( c ) : c ∈ children( u ) } δ rwop ( u ) ← end while Return δ rwop . 13 . . c5 . . c7 . . mp . ap .

25 0 fc sm Table 4:

The label mismatch cost matrix M we used in our CNN experiments. M ( x , y ) de-notes the penalty for transporting a unit massfrom a layer with label x to a layer with label y .The labels abbreviated are conv3 , conv5 , conv7 , max-pool , avg-pool , fc , and softmax in order.A blank indicates ∞ cost. We have not shown the ip and op layers, but they are similar to the fc column, in the diagonal and ∞ elsewhere. re cr lg ta linre . . . . cr . . . . . . . . lg . . .

25 0 . ta . . . . lin Table 5:

The label mismatch cost matrix M weused in our MLP experiments. The labels abbre-viated are relu , crelu , , logistic , tanh ,and linear in order. is place-holder forany other rectiﬁer such as leaky-relu , softplus , elu . A blank indicates ∞ cost. The design herewas simple. Each label gets cost with itself. Arectiﬁer gets . cost with another rectiﬁer and . with a sigmoid; vice versa for all sigmoids. Therest of the costs are inﬁnity. We have not shown the ip and op , but they are similar to the lin column, in the diagonal and ∞ elsewhere. Computing path lengths δ ts : Algorithm 1 computes all path lengths in O ( |E| ) time. Note thattopological sort of a connected digraph also takes O ( |E| ) time. The topological sorting ensures that δ rwop is always computed for the children in step 4. For δ spop , δ lpop we would replace the averaging of ∆ in step 5 with the minimum and maximum of ∆ respectively.For δ rwip we make the following changes to Algorithm 1. In step 1, we set δ rwip ( u ip ) = 0 , in step 3,we pop _ ﬁrst and ∆ in step 4 is computed using the parents. δ spip , δ lpip are computed with the sameprocedure but by replacing the averaging with minimum or maximum as above. Label Penalty Matrices:

The label penalty matrices used in our

NASBOT implementation, de-scribed below, satisfy the triangle inequality condition in Theorem 3.CNNs: Table 4 shows the label penalty matrix M for used in our CNN experiments with labels conv3 , conv5 , conv7 , max-pool , avg-pool , softmax , ip , op . conv k denotes a k × k convolutionwhile avg-pool and max-pool are pooling operations. In addition, we also use res3 , res5 , res7 layers which are inspired by ResNets. A res k uses 2 concatenated conv k layers but the input tothe ﬁrst layer is added to the output of the second layer before the relu activation – See Figure 2in He et al. [12]. The layer mass for res k layers is twice that of a conv k layer. The costs for the res in the label penalty matrix is the same as the conv block. The cost between a res k and conv j is M ( res k, conv j ) = 0 . × M ( conv k, conv j ) + 0 . × ; i.e. we are using a convex combination ofthe conv costs and the non-assignment cost. The intuition is that a res k is similar to conv k blockexcept for the residual addition.MLPs: Table 5 shows the label penalty matrix M for used in our MLP experiments with labels relu , crelu , leaky-relu , softplus , elu , logistic , tanh , linear , ip , op . Here the ﬁrst sevenare common non-linear activations; relu , crelu , leaky-relu , softplus , elu rectiﬁers while logistic and tanh are sigmoidal activations. Other details:

Our implementation of

OTMANN differs from what is described in the maintext in two ways. First, in our CNN experiments, for a fc layer u , we use . × (cid:96)m ( u ) ×(cid:104) (cid:105) as the mass, i.e. we multiply it by . from what is described in the main text.This is because, in the convolutional and pooling channels, each unit is an image where as in the fc layers each unit is a scalar. One could, in principle, account for the image sizes at the variouslayers when computing the layer masses, but this also has the added complication of depending onthe size of the input image which varies from problem to problem. Our approach is simpler and yieldsreasonable results. 14econdly, we use a slightly different form for C str . First, for i ∈ L , j ∈ L , we let C allstr ( i, j ) = (cid:80) s ∈{ sp, lp, rw } (cid:80) t ∈{ ip,op } | δ st ( i ) − δ st ( j ) | be the average of all path length differences; i.e. C allstr captures the path length differences when considering all layers. For CNNs, we similarly constructmatrices C convstr , C poolstr , C fcstr , except they only consider the convolutional, pooling and fully connectedlayers respectively in the path lengths. For C convstr , the distances to the output (from the input) can becomputed by zeroing outgoing (incoming) edges to layers that are not convolutional. We can similarlyconstruct C poolstr and C fcstr only counting the pooling and fully connected layers. Our ﬁnal cost matrixfor the structural penalty is the average of these four matrices, C str = ( C allstr + C convstr + C poolstr + C fcstr ) / .For MLPs, we adopt a similar strategy by computing matrices C allstr , C recstr , C sigstr with all layers, onlyrectiﬁers, and only sigmoidal layers and let C str = ( C allstr + C recstr + C sigstr ) / . The intuition is that byconsidering certain types of layers, we are accounting for different types of information ﬂow due todifferent operations. A.4 Some Illustrations of the

OTMANN

Distance

We illustrate that

OTMANN computes reasonable distances on neural network architectures via atwo-dimensional t-SNE visualisation [27] of the network architectures based. Given a distance matrixbetween m objects, t-SNE embeds them in a d dimensional space so that objects with small distancesare placed closer to those that have larger distances. Figure 4 shows the t-SNE embedding usingthe OTMANN distance and its noramlised version. We have indexed 13 networks in both ﬁgures ina-n and displayed their architectures in Figure 5. Similar networks are placed close to each otherindicating that

OTMANN induces a meaningful topology among neural network architectures.Next, we show that the distances induced by

OTMANN are correlated with validation error perfor-mance. In Figure 6 we provide the following scatter plot for networks trained in our experimentsfor the Indoor, Naval and Slice datasets. Each point in the ﬁgure is for pair of networks. The x -axisis the OTMANN distance between the pair and the y -axis is the difference in the validation erroron the dataset. In each ﬁgure we used networks giving rise to K pairwise points in eachscatter plot. As the ﬁgure indicates, when the distance is small the difference in performance isclose to . However, as the distance increases, the points are more scattered. Intuitively, one shouldexpect that while networks that are far apart could perform similarly or differently, similar networksshould perform similarly. Hence, OTMANN induces a useful topology in the space of architecturesthat is smooth for validaiton performance on real world datasets. This demonstrates that it can beincorporated in a BO framework to optimise a network based on its validation error.

B Implementation of

NASBOT

Here, we describe our BO framework for

NASBOT in full detail.

B.1 The Kernel

As described in the main text, we use a negative exponentiated distance as our kernel. Precisely, weuse, κ ( · , · ) = αe − (cid:80) i β i d pi ( · , · ) + ¯ αe − (cid:80) i ¯ β i ¯ d ¯ pi ( · , · ) . (7)Here, d i , ¯ d i , are the OTMANN distance and its normalised counterpart developed in Section 3,computed with different values for ν str ∈ { ν str ,i } i . β i , ¯ β i manage the relative contributions of d i , ¯ d i ,while ( α, ¯ α ) manage the contributions of each kernel in the sum. An ensemble approach of theabove form, instead of trying to pick a single best value, ensures that NASBOT accounts for thedifferent topologies induced by the different distances d i , ¯ d i . In the experiments we report, we used { ν str ,i } i = { . , . , . , . } , p = 1 and ¯ p = 2 . Our experience suggests that NASBOT was notparticularly sensitive to these choices expect when we used only very large or only very small valuesin { ν str ,i } i . NASBOT , as described above has hyper-parameters of its own; α, ¯ α, { ( β i , ¯ β i ) } i =1 and the GPnoise variance η . While maximising the GP marginal likelihood is a common approach to pickhyper-parameters, this might cause over-ﬁtting when there are many of them. Further, as traininglarge neural networks is typically expensive, we have to content with few observations for the GP15 ab c defghi jk mn t-SNE: OTMANN Distance ab cd ef ghij kmn t-SNE: Normalised OTMANN Distance Figure 4: Two dimensional t-SNE embeddings of randomly generated CNN architectures basedon the

OTMANN distance (top) and its normalised version (bottom). Some networks have beenindexed a-n in the ﬁgures; these network architectures are illustrated in Figure 5. Networks thatare similar are embedded close to each other indicating that the

OTMANN induces a meaningfultopology among neural network architectures. 16 a b c d g h i j k m n fe Figure 5: Illustrations of the nextworks indexed a-n in Figure 4.17igure 6: Each point in the scatter plot indicates the log distance between two architectures ( x axis)and the difference in the validation error ( y axis), on the Indoor, Naval and Slice datasets. We used networks, giving rise to ∼ K pairwise points. On all datasets, when the distance is small,so is the difference in the validation error. As the distance increases, there is more variance in thevalidation error difference. Intuitively, one should expect that while networks that are far apart couldperform similarly or differently, networks with small distance should perform similarly.in practical settings. Our solution is to start with a (uniform) prior over these hyper-parameters andsample hyper-parameter values from the posterior under the GP likelihood [40], which we found tobe robust. While it is possible to treat ν str itself as a hyper-parameter of the kernel, this will requireus to re-compute all pairwise distances of networks that have already been evaluated each time wechange the hyper-parameters. On the other hand, with the above approach, we can compute and storedistances for different ν str ,i values whenever a new network is evaluated, and then compute the kernelcheaply for different values of α, ¯ α, { ( β i , ¯ β i ) } i . B.2 Optimising the Acquisition

We use a evolutionary algorithm ( EA ) approach to optimise the acquisition function (2). We beginwith an initial pool of networks and evaluate the acquisition ϕ t on those networks. Then we generatea set of N mut mutations of this pool as follows. First, we stochastically select N mut candidates fromthe set of networks already evaluated such that those with higher ϕ t values are more likely to beselected than those with lower values. Then we apply a mutation operator to each candidate, toproduce a modiﬁed architecture. Finally, we evaluate the acquisition on this N mut mutations, add it tothe initial pool, and repeat for the prescribed number of steps. Mutation Operator:

To describe the mutation operator, we will ﬁrst deﬁne a library of modiﬁcationsto a neural network. These modiﬁcations, described in Table 6, might change the architecture eitherby increasing or decreasing the number of computational units in a layer, by adding or deleting layers,or by changing the connectivity of existing layers. They provide a simple mechanism to explore thespace of architectures that are close to a given architecture. The one-step mutation operator takes agiven network and applies one of the modiﬁcations in Table 6 picked at random to produce a newnetwork. The k -step mutation operator takes a given network, and repeatedly applies the one-stepoperator k times – the new network will have undergone k changes from the original one. One can alsodeﬁne a compound operator, which picks the number of steps probabilistically. In our implementationof NASBOT , we used such a compound operator with probabilities (0 . , . , . , . , . ;i.e. it chooses a one-step operator with probability . , a -step operator with probability . , etc.Typical implementations of EA in Euclidean spaces deﬁne the mutation operator via a Gaussian (orother) perturbation of a chosen candidate. It is instructive to think of the probabilities for each step inour scheme above as being analogous to the width of the Gaussian chosen for perturbation. Sampling strategy:

The sampling strategy for EA is as follows. Let { z i } i , where z i ∈ X bethe points evaluated so far. We sample N mut new points from a distribution π where π ( z i ) ∝ exp( g ( z i ) /σ ). Here g is the function to be optimised (for NASBOT , ϕ t at time t ). σ is the standarddeviation of all previous evaluations. As the probability for large g values is higher, they are morelikely to get selected. σ provides normalisation to account for different ranges of function values.18 peration Description dec_single Pick a layer at random and decrease the number of units by / . dec_en_masse First topologically order the networks, randomly pick / of the layers (in order)and decrease the number of units by / . For networks with eight layers or fewerpick a / of the layers (instead of 1/8) and for those with four layers or fewer pick / . inc_single Pick a layer at random and increase the number of units by / . inc_en_masse Choose a large sub set of layers, as for dec_en_masse , and increase the number ofunits by / . dup_path This modiﬁer duplicates a random path in the network. Randomly pick a node u and then pick one of its children u randomly. Keep repeating to generate a path u , u , . . . , u k − , u k until you decide to stop randomly. Create duplicate layers ˜ u , . . . , ˜ u k − where ˜ u i = u i for i = 2 , . . . , k − . Add these layers along withnew edges ( u , ˜ u ) , (˜ u k − , u k ) , and (˜ u j , ˜ u j +1 ) for j = 2 , . . . , k − . remove_layer Picks a layer at random and removes it. If this layer was the only child (parent) ofany of its parents (children) u , then adds an edge from u (one of its parents) to oneof its children ( u ). skip Randomly picks layers u, v where u is topologically before v and ( u, v ) / ∈ E . Add ( u, v ) to E . swap_label Randomly pick a layer and change its label. wedge_layer

Randomly pick any edge ( u, v ) ∈ E . Create a new layer w with a random label (cid:96)(cid:96) ( w ) . Remove ( u, v ) from E and add ( u, w ) , ( w, v ) . If applicable, set the numberof units (cid:96)u ( w ) to be ( (cid:96)u ( u ) + (cid:96)u ( v )) / .Table 6: Descriptions of modiﬁers to transform one network to another. The ﬁrst four change the number ofunits in the layers but do not change the architecture, while the last ﬁve change the architecture.

Since our candidate selection scheme at each step favours networks that have high acquisition value,our EA scheme is more likely to search at regions that are known to have high acquisition. Thestochasticity in this selection scheme and the fact that we could take multiple steps in the mutationoperation ensures that we still sufﬁciently explore the space. Since an evaluation of ϕ t is cheap, wecan use many EA steps to explore several architectures and optimise ϕ t . Other details:

The EA procedure is also initialised with the same initial pools in Figures 20, 21. Inour NASBOT implementation, we increase the total number of EA evaluations n EA at rate O ( √ t ) where t is the current time step in NASBOT . We set N mut to be O ( √ n EA ) . Hence, initially we areonly considering a small neighborhood around the initial pool, but as we proceed along BO, weexpand to a larger region, and also spend more effort to optimise ϕ t . Considerations when performing modiﬁcations:

The modiﬁcations in Table 6 is straightforwardin MLPs. But in CNNs one needs to ensure that the image sizes are the same when concatenatingthem as an input to a layer. This is because strides can shrink the size of the image. When we performa modiﬁcation we check if this condition is violated and if so, disallow that modiﬁcation. When a skip modiﬁer attempts to add a connection from a layer with a large image size to one with a smaller one,we add avg-pool layers at stride 2 so that the connection can be made (this can be seen, for e.g. inthe second network in Fig. 8).

B.3 Other Implementation DetailsInitialisation:

We initialise

NASBOT (and other methods) with an initial pool of networks. Thesenetworks are illustrated in Fig. 20 for CNNs and Fig. 21 for MLPs at the end of the document. Theseare the same networks used to initialise the EA procedure to optimise the acquisition. All initialnetworks have feed forward structure. For the CNNs, the ﬁrst networks have structure similar to theVGG nets [37] and the remaining have blocked feed forward structures as in He et al. [12]. We alsouse blocked structures for the MLPs with the layer labels decided arbitrarily. Domain:

For

NASBOT , and other methods, we impose the following constraints on the search space.If the EA modiﬁer (explained below) generates a network that violates these constraints, we simplyskip it. • Maximum number of layers: Maximum mass: • Maximum in/out degree: • Maximum number of edges: • Maximum number of units per layer: • Minimum number of units per layer: Layer types:

We use the layer types detailed in Appendix A.3 for both CNNs and MLPs. For CNNs,all pooling operations are done at stride 2. For convolutional layers, we use either stride 1 or 2(speciﬁed in the illustrations). For all layers in a CNN we use relu activations.

Parallel BO:

We use a parallelised experimental set up where multiple models can be evaluated inparallel. We handle parallel BO via the hallucination technique in Ginsbourger et al. [10].Finally, we emphasise that many of the above choices were made arbitrarily, and we were able toget

NASBOT working efﬁciently with our ﬁrst choice for these parameters/speciﬁcations. Note thatmany end-to-end systems require speciﬁcation of such choices.

C Addendum to Experiments

C.1 Baselines

RAND : Our

RAND implementation, operates in exactly the same way as

NASBOT , except that the EA procedure (Sec. B.2) is fed a random sample from Unif(0 , instead of the GP acquisition eachtime it evaluates an architecture. That is, we follow the same schedule for n EA and N mut as we did for NASBOT . Hence

RAND has the opportunity to explore the same space as

NASBOT , but picks thenext evaluation randomly from this space. EA : This is as described in Appendix B except that we ﬁx N mut = 10 all the time. In our experimentswhere we used a budget based on time, it was difﬁcult to predict the total number of evaluations so asto set N mut in perhaps a more intelligent way. TreeBO : As the implementation from Jenatton et al. [15] was not made available, we wrote our own.It differs from the version described in the paper in a few ways. We do not tune for a regularisationpenalty and step size as they do to keep it line with the rest of our experimental set up. We set thedepth of the network to as we allowed layers for the other methods. We also check for the otherconstraints given in Appendix B before evaluating a network. The original paper uses a tree structuredkernel which can allow for efﬁcient inference with a large number of samples. For simplicity, weconstruct the entire kernel matrix and perform standard GP inference. The result of the inference isthe same, and the number of GP samples was always below in our experiments so a sophisticatedprocedure was not necessary. C.2 Details on Training

In all methods, for each proposed network architecture, we trained the network on the train data set,and periodically evaluated its performance on the validation data set. For MLP experiments, weoptimised network parameters using stochastic gradient descent with a ﬁxed step size of − and abatch size of 256 for 20,000 iterations. We computed the validation set MSE every 100 iterations;from this we returned the minimum MSE that was achieved. For CNN experiments, we optimisednetwork parameters using stochastic gradient descent with a batch size of 32. We started with alearning rate of . and reduced it gradually. We also used batch normalisation and trained themodel for , batch iterations. We computed the validation set classiﬁcation error every iterations; from this we returned the minimum classiﬁcation error that was achieved.After each method returned an optimal neural network architecture, we again trained each optimalnetwork architecture on the train data set, periodically evaluated its performance on the validation dataset, and ﬁnally computed the MSE or classiﬁcation error on the test data set. For MLP experiments,we used the same optimisation procedure as above; we then computed the test set MSE at the iterationwhere the network achieved the minimum validation set MSE. For CNN experiments, we used thesame optimisation procedure as above, except here the optimal network architecture was trained20or 120,000 iterations; we then computed the test set classiﬁcation error at the iteration where thenetwork achieved the minimum validation set classiﬁcation error. C.3 Optimal Network Architectures and Initial Pool

Here we illustrate and compare the optimal neural network architectures found by different methods.In Figures 8-11, we show some optimal network architectures found on the Cifar10 data by

NASBOT , EA , RAND , and

TreeBO , respectively. We also show some optimal network architectures found forthese four methods on the Indoor data, in Figures 12-15, and on the Slice data, in Figures 16-19. Acommon feature among all optimal architectures found by

NASBOT was the presence of long skipconnections and multiple decision layers.In Figure 21, we show the initial pool of MLP network architectures, and in Figure 20, we show theinitial pool of CNN network architectures. On the Cifar10 dataset, VGG-19 was one of the networksin the initial pool. While all methods beat VGG-19 when trained for 24K iterations (the numberof iterations we used when picking the model),

TreeBO and

RAND lose to VGG-19 (see Section 5for details). This could be because the performance after shorter training periods may not exactlycorrelate with performance after longer training periods.

C.4 Ablation Studies and Design Choices

We conduct experiments comparing the various design choices in

NASBOT . Due to computationalconstraints, we carry them out on synthetic functions.In Figure 7a, we compare

NASBOT using only the normalised distance, only the unnormaliseddistance, and the combined kernel as in (7). While the individual distances performs well, thecombined form outperforms both.Next, we modify our EA procedure to optimise the acquisition. We execute NASBOT using only the EA modiﬁers which change the computational units (ﬁrst four modiﬁers in Table 6), then using themodiﬁers which only change the structure of the networks (bottom 5 in Table 6), and ﬁnally using all9 modiﬁers, as used in all our experiments. The combined version outperforms the ﬁrst two.Finally, we experiment with different choices for p and ¯ p in (7). As the ﬁgures indicate, theperformance was not particularly sensitive to these choices.Below we describe the three synthetic functions f , f , f used in our synthetic experiments. f applies for CNNs while f , f apply for MLPs. Here am denotes the average mass per layer, deg i is the average in degree the layers, deg o is the average out degree, δ is the shortest distance from u ip to u op , str is the average stride in CNNS, frac _ conv3 is the fraction of layers that are conv3 , frac _ sigmoid is the fraction of layers that are sigmoidal. f = exp( − . ∗ | am − | ) + exp( − . ∗ | deg i − | ) + exp( − . ∗ | deg o − | )+exp( − . ∗ | δ − | ) + exp( − . ∗ ||L| − | ) + exp( − . ∗ ||E| − | ) f = f + exp( − ∗ | str − . | ) + exp( − . ∗ ||L| − | )+exp( − . ∗ | am − | ) + frac _ conv3 f = f + exp( − . ∗ | am − | ) + exp( − . ∗ ||E| − | ) + frac _ sigmoid f = f + frac _ sigmoid D Additional Discussion on Related Work

Historically, evolutionary (genetic) algorithms ( EA ) have been the most common method used fordesigning architectures [8, 21, 26, 30, 36, 41, 51]. EA techniques are popular as they provide a simplemechanism to explore the space of architectures by making a sequence of changes to networks thathave already been evaluated. However, as we will discuss later, EA algorithms, while conceptuallyand computationally simple, are typically not best suited for optimising functions that are expensiveto evaluate. A related line of work ﬁrst sets up a search space for architectures via incrementalmodiﬁcations, and then explores this space via random exploration, MCTS, or A* search [6, 25, 32].21 umber of evaluations N e ga t i v e m a x i m u m v a l u e -6-5.5-5-4.5-4 Ablation study on Kernel Design ( f ) Only NormalisedOnly UnnormalisedCombined (a)

Number of evaluations N e ga t i v e m a x i m u m v a l u e -4.5-4-3.5-3-2.5 Ablation study on EA modi ﬁ ers ( f ) Only ComputationalOnly StructuralCombined (b)

Number of evaluations N e ga t i v e m a x i m u m v a l u e -3.5-3-2.5-2 Comparison of p , ¯ p values ( f ) p = 1 , ¯ p = 1 p = 1 , ¯ p = 2 p = 2 , ¯ p = 1 p = 2 , ¯ p = 2 (c) Figure 7:

We compare

NASBOT for different design choices in our framework. (a): Comparison of

NASBOT using only the normalised distance e − β ¯ d , only the unnormalised distance d − βd , and the combination e − βd + e − ¯ β ¯ d .(b): Comparison of NASBOT using only the EA modiﬁers which change the computational units (top 4 inTable 6), modiﬁers which only change the structure of the networks (bottom 5 in Table 6), and all 9 modiﬁers.(c): Comparison of NASBOT with different choices for p and ¯ p . In all ﬁgures, the x axis is the number ofevaluations and the y axis is the negative maximum value (lower is better). All ﬁgures were produced byaveraging over at least 10 runs. Some of the methods above can only optimise among feed forward structures, e.g. Fig. 1a, but cannothandle spaces with arbitrarily structured networks, e.g. Figs. 1b, 1c.The most successful recent architecture search methods that can handle arbitrary structures haveadopted reinforcement learning (RL) [1, 52–54]. However, architecture search is in essence an optimisation problem – ﬁnd the network with the highest function value. There is no explicit needto maintain a notion of state and solve the credit assignment problem in RL [43]. Since RL isfundamentally more difﬁcult than optimisation [16], these methods typically need to try a very largenumber of architectures to ﬁnd the optimum. This is not desirable, especially in computationallyconstrained settings. 22 : ip(328008)1: conv3, 64(64)2: conv3, 64(4096)3: max-pool(64)4: conv3, 128(8192)5: conv3, 128(16384)6: max-pool(128)7: conv3, 128(16384)8: conv3, 128(16384)9: conv3, 128(16384)10: conv3, 128(16384)11: max-pool(128) 12: max-pool(128)13: conv3, 256(32768) 14: conv3, 224(28672)15: conv3, 224(57344) 16: conv3, 288(64512)17: conv3, 256(57344) 18: max-pool(288)19: conv3, 288(73728) 20: conv3, 576(165888)21: max-pool(288) 22: conv3, 576(331776)23: conv3, 576(165888) 24: conv3, 576(165888) 25: max-pool(576)26: conv3, 576(331776) 27: conv3, 576(331776) 28: fc, 144(8294)29: conv3, 576(331776) 30: conv3, 576(331776) 31: avg-pool(576)32: softmax(109336)33: conv3, 576(331776) 34: max-pool(576)36: fc, 144(16588)43: op(328008)35: conv3, 576(331776)37: max-pool(576) 38: softmax(109336)39: fc, 126(7257)40: fc, 252(3175)41: fc, 504(12700)42: softmax(109336) 0: ip(159992)1: conv3, 64(64)2: conv3, 64(4096)3: max-pool(64)4: conv3, 128(8192)5: conv3, 128(16384)6: max-pool(128) 7: avg-pool(128)8: conv3, 128(16384) 9: avg-pool(128)10: conv3, 128(16384) 11: avg-pool(128)12: conv3, 128(16384)24: conv7, 512(327680)13: conv3, 128(16384)14: max-pool(128)15: conv3, 256(32768)19: max-pool(384)16: conv3, 256(65536)17: res3, 256(65536)18: conv3, 256(65536)20: conv5, 448(172032)21: conv3, 512(229376)22: conv3, 512(262144)23: conv3, 512(262144)25: max-pool(512)26: fc, 128(6553)27: fc, 256(3276)28: fc, 448(11468)29: softmax(159992)30: op(159992) 0: ip(198735)1: conv3, 64(64)2: conv3, 64(4096)3: max-pool(64)4: conv3, 128(8192)5: conv3, 128(16384) 6: max-pool(128)7: max-pool(128) 8: conv3, 128(16384)9: conv3, 128(16384) 10: max-pool(128)11: conv3, 128(16384) 12: max-pool(128)13: conv3, 128(16384) 14: conv3, 512(65536)15: max-pool(128) 16: conv3, 576(294912)17: conv3, 256(32768) 18: conv3, 256(32768) 19: conv3, 576(331776)20: conv3, 256(65536) 21: conv3, 256(65536) 22: max-pool(576)23: conv3, 256(65536) 25: max-pool(512) 24: fc, 128(7372)26: fc, 256(3276)27: conv3, 512(262144) 28: fc, 512(13107)29: conv3, 576(294912) 30: softmax(99367)31: conv3, 576(331776) 37: op(198735)32: max-pool(576)33: fc, 128(7372)34: fc, 256(3276)35: fc, 512(13107)36: softmax(99367) 0: ip(329217)1: conv3, 64(64)2: conv3, 64(4096) 3: avg-pool(64)4: max-pool(64) 5: avg-pool(64)6: conv3, 128(8192) 7: avg-pool(64)8: conv3, 128(16384) 9: avg-pool(64) 10: avg-pool(64)11: max-pool(128)12: avg-pool(64) 13: avg-pool(64)14: conv3, 144(18432)46: fc, 128(13926) 41: fc, 128(7372)15: conv3, 128(18432)16: conv3, 128(16384)17: conv3, 128(16384)18: max-pool(128)19: conv3, 256(32768)20: conv3, 256(65536)21: conv3, 256(65536)22: conv3, 288(73728) 23: conv3, 256(65536)24: conv3, 256(73728) 25: conv3, 256(73728) 26: max-pool(256) 27: max-pool(256)28: max-pool(256) 29: max-pool(256)30: conv3, 512(131072) 31: conv3, 512(131072)32: conv3, 512(131072) 33: conv3, 512(131072)35: conv3, 512(524288) 34: conv3, 512(262144)36: conv3, 512(262144) 37: conv3, 512(262144) 38: max-pool(512)39: conv3, 512(262144) 40: conv3, 512(262144) 43: max-pool(1024)42: res3 /2, 512(262144) 44: fc, 512(6553)45: max-pool(512) 47: softmax(109739)48: conv3, 128(65536) 49: fc, 512(6553)55: op(329217)50: fc, 128(1638) 51: softmax(109739)52: fc, 256(3276)53: fc, 512(13107)54: softmax(109739)

Figure 8: