[PDF] Contrastive Embeddings for Neural Architectures

Abstract

The performance of algorithms for neural architecture search strongly depends on the parametrization of the search space. We use contrastive learning to identify networks across different initializations based on their data Jacobians, and automatically produce the first architecture embeddings independent from the parametrization of the search space. Using our contrastive embeddings, we show that traditional black-box optimization algorithms, without modification, can reach state-of-the-art performance in Neural Architecture Search. As our method provides a unified embedding space, we perform for the first time transfer learning between search spaces. Finally, we show the evolution of embeddings during training, motivating future studies into using embeddings at different training stages to gain a deeper understanding of the networks in a search space.

Full PDF

AArxiv: February 2021

Contrastive Embeddings for Neural Architectures

Daniel Hesslow & Iacopo Poli

LightOn { firstname } @lighton.ai A BSTRACT

The performance of algorithms for neural architecture search strongly dependson the parametrization of the search space. We use contrastive learning toidentify networks across different initializations based on their data Jacobians,and automatically produce the ﬁrst architecture embeddings independent fromthe parametrization of the search space. Using our contrastive embeddings, weshow that traditional black-box optimization algorithms, without modiﬁcation,can reach state-of-the-art performance in Neural Architecture Search. As ourmethod provides a uniﬁed embedding space, we perform for the ﬁrst time transferlearning between search spaces. Finally, we show the evolution of embeddingsduring training, motivating future studies into using embeddings at different train-ing stages to gain a deeper understanding of the networks in a search space.

NTRODUCTION

Traditionally, the design of state-of-the-art neural network architectures is informed by domainknowledge and requires a large amount of manual work to ﬁnd the best hyperparameters. How-ever, automated architecture search methods have recently achieved state-of-the-art results on taskssuch as image classiﬁcation, object detection, semantic segmentation and speech recognition, oreven data augmentation and platform-aware optimization (Ren et al., 2020).Neural Architecture Search (NAS) was introduced by Zoph & Le (2016), using reinforcement learn-ing. Since then, different search spaces have been developed Zoph et al. (2018), and there existsnow a set of search spaces that is commonly used to evaluate NAS algorithms (Dong & Yang, 2020;Siems et al., 2020; Klyuchnikov et al., 2020). Different search algorithms have also been conceived:DARTS (Liu et al., 2018) relaxes the search space to be continuous to perform architecture opti-mization by gradient descent, and aging evolution (Real et al., 2019) is a genetic algorithm designedfor NAS.Recently, Mellor et al. (2020) showed that statistics computed on architectures at initialization, be-fore training, can be used to infer which will perform better after training. In particular, they ﬁnd aheuristic based on samples of the data Jacobians of networks at initialization. Additionally, Melloret al. (2020), is to our knowledge the ﬁrst instance of a method for NAS that is invariant to theparametrization of the search space, other than random search.In parallel, contrastive learning has gathered interest in the computer vision community and pro-duced various state-of-the-art results (He et al., 2020; Chen et al., 2020; Caron et al., 2020; Grillet al., 2020). In contrastive learning, the model learns an informative embedding space through aself-supervised pre-training phase: from the images in a batch, pairs are generated through randomtransformations, and the model is trained to generate similar (dissimilar) embeddings for similar(dissimilar) images.In this work, we combine contrastive learning with the idea from Mellor et al. (2020) of usingthe Jacobians of networks at initialization, in order to ﬁnd an embedding space suitable for NeuralArchitecture Search. We frame our work in the context of the theory presented by Wang et al.(2016). The embedding space that we generate is invariant to the search space of origin, allowing usto accomplish transfer learning between different search spaces.1 a r X i v : . [ c s . L G ] F e b rxiv: February 20211.1 M OTIVATIONS AND C ONTRIBUTIONS

We design a method to produce architecture embeddings using Contrastive Learning and informa-tion available from their initial state. Our technique is capable of generating embeddings indepen-dent from the parametrization of the search space, that evolve during training. We leverage thesecontrastive embeddings in Neural Architecture Search using traditional black-box optimization al-gorithms. Moreover, since they provide a uniﬁed embedding space across search spaces, we exploitthem to perform transfer learning between search spaces.

Parametrization-independent embeddings

NAS methods promise to outperform randomsearch, however the encoding of the architectures must show some structure for the search algo-rithm to exploit. These embeddings are typically produced by condensing all the parameters usedto generate an architecture into a single vector. The method used to generate architectures from asearch space thereby implicitly parametrizes it.The parametrization of the search space affects the performance of a NAS algorithm, as noted byWang et al. (2019). However, when performing architecture search, it is not feasible to test multi-ple different parametrizations of the search space and evaluate which one performs better: once wehave started to evaluate networks in a search space, there is no reason to discard previous evaluations.While the current generation of NAS alleviates the need for experts in the design of architectures,now expert knowledge is needed to build and parametrize a search space compatible with the chosensearch algorithm, implying that it is exceedingly difﬁcult to outperform a simple random search.We present in Sec. 4 the ﬁrst method to create networks embeddings without relying on theirparametrization in any search space, through the combination of modern contrastive learning andthe theory of data Jacobian matrices for neural architecture search.

Evolution of the embeddings during training

In Section 4.4, we show how the embeddings varyduring training, noting that the training procedure tends to connect areas with similar ﬁnal test ac-curacy together. We hypothesize that this information could enable even more efﬁcient architecturesearch methods in the future.

Leveraging traditional black-box algorithms

Existing methods to generate architecture embed-dings rely on metrics from their computational graphs to identify similar architectures, either byexplicitly trying to preserve the edit distance in the embedding space, or by leveraging more sophisti-cated methods from graph representation learning. Our method leverages the information containedin the Data Jacobian Matrix of networks at initialization to train a contrastive model. As such, it canproduce embeddings that more meaningfully capture the structure of the search space. As a result,traditional black-box algorithms perform well for architecture search, as shown for NAS-Bench-201(Dong & Yang, 2020) in Section 5.1.

Transfer learning between search spaces

Our method provides a uniﬁed embedding space, sinceit does not depend on the parametrization of networks in any search space. We exploit this feature toperform for the ﬁrst time transfer learning between the two search spaces. In practice, we performit between the size and the topology spaces in NATS-Bench (Dong et al., 2020) in Section 5.2.

ELATED W ORK

Neural Architecture Search

Previous works have attempted to improve network embeddings:Klyuchnikov et al. (2020) use graph2vec (Narayanan et al., 2017) to ﬁnd embeddings such thatnetworks with the same computational graph share the same embeddings, and similarly Yan et al.(2020) produce embeddings that are invariant to graph isomorphisms. However, the method differsin that this work trains a variational autoencoder to produce the embeddings. Wei et al. (2020) uses acontrastive loss to ﬁnd a low dimensional metric space where the graph edit distances of the originalparametrization is approximately preserved.In the absence of dense sampling, all of these works rely on the prior that the edit distance is agood predictor for relative performance. In contrast, our method, learns to ﬁnd an embedding spacebased on intrinsic properties of the architectures. It can therefore discover properties about thearchitectures which are not present in their graph representation.2rxiv: February 2021

Data Jacobian

Methods based on the Jacobians with respect to the input of trained networkshave been shown to provide valuable information for knowledge transfer and distillation (Czarneckiet al., 2017; Srinivas & Fleuret, 2018), as well as analysis and regularization of networks (Wanget al., 2016).

Neural Tangent Kernel

The Jacobian of the network with respect to the parameters is computedfor inference with neural tangent kernels (NTK) (Jacot et al., 2018). Using NTK as a proxy for NAS(Park et al., 2020) underperforms the neural network Gaussian process (NNGP) kernel. The NNGPprovides an inexpensive signal for predicting if an architecture exceeds median performance, but itis worse than training for a few epochs in predicting the order of the top performing architectures.

Contrastive learning

Different techniques have been developed in contrastive learning. He et al.(2020) train a network with a contrastive loss against a memory bank of negative samples producedby a slowly moving average version of itself. Chen et al. (2020) remove the memory bank and justconsider negative samples from within the same minibatch. Grill et al. (2020) remove the negativesamples completely but stabilize the training by encoding the positive samples using a momentumencoder.

ACKGROUND

RADITIONAL A RCHITECTURE E MBEDDINGS

A decision tree is created either implicitly or explicitly to sample networks from a search space. Toencode an architecture, one records all choices as the decision tree is traversed into a vector, which isthen used as the embedding of the architecture. Without any additional knowledge, a NAS algorithmwill assume that all choices in the decision tree have an equal importance on the characteristics ofan architecture.3.2 D

ATA J ACOBIAN

Extended Data Jacobian matrices are used by Wang et al. (2016) to analyze trained neural networks.We ground our work in their theoretical setting, and introduce the relevant concepts below.Multi Layer Perceptrons with ReLU activations are locally equivalent to a linear model: the ReLUafter a linear layer can be combined into a single linear layer, where each row in the matrix isreplaced by zeros if the output pre-activation is negative.ReLU ( W x ) = ˆ

W x, ˆ W ij = (cid:26) W ij if ( W x ) j ≥ otherwiseSince matrix multiplication is closed, within a neighborhood where the signs of all neurons pre-activation is constant, the full network can be replaced by a single matrix. This property can beextended to any model whose layers can be rewritten as linear layers, including convolutional layersand average pooling layers. Max pooling layers also retain this property, and can be treated similarlyto ReLU.Therefore, in a local neighbourhood close to x , the full information of a network, f , is containedwithin its Data Jacobian Matrix (DJM).

DJM ( x ) = ∂f ( x ) ∂x and within that neighbourhood f ( x ) = DJM ( x ) x We can evaluate the Data Jacobian Matrix at many different points x to gather information aboutmultiple different neighbourhoods. If we assume the network to have a single output, its DJM isa vector, and we can then stack the DJMs at different points to form the Extended Data Jacobian

Matrix (EDJM). If a network has multiple outputs we can sum them to get a single output, whichwe use to calculate the EDJM.Wang et al. (2016) use the singular values of the EDJM to compute a score, and empirically showthat the score is correlated with the depth of the network, and its model capacity.3.3 C

ONTRASTIVE L EARNING

Contrastive learning is a self-supervised method that ﬁnds an informative embedding space of theinput data useful for downstream tasks. Central to contrastive learning is the concept of a view of anobject: two different views of the same object are only superﬁcially different, and we should be ableto train a network to see past these differences and identify the same underlying object. To this end,a network is trained to map different views of the same object close to each other in the embeddingspace and, conversely, views of different objects should be far apart from each other, as shown inFigure 1.

Different ViewsContrastive NetworkDifferent Views

Embedding Space

PushPushPull Pull

Figure 1: In contrastive learning a network learns to produce similar embeddings for different viewsof the same picture, while producing dissimilar embeddings for dissimilar pictures.

ONTRASTIVE E MBEDDINGS FOR N EURAL A RCHITECTURES

We leverage intrinsic properties of the networks to encode them without depending on theirparametrization. We must rely on properties of the architectures at initialization, since it is not com-putationally feasible to train the architectures to obtain their embeddings. At variance with previouswork, we develop a method to ﬁnd such properties automatically, using contrastive learning.To this effect, we train a network that takes an architecture at initialization as input and produces anembedding at the output. It is desirable that the network has the following two properties:• Different initializations of the same architecture should yield similar embeddings.• Different architectures should yield different embeddings.We can therefore frame our embedding problem as a contrastive learning task: different initializa-tions of the same network will correspond to different views of the same sample in the contrastivelearning framework. 4rxiv: February 20214.1 O

UR METHOD

Following Mellor et al. (2020), we compute the Extended Data Jacobian Matrix (EDJM) of archi-tectures at initialization, and we use a low-rank projection of it as input to our contrastive network,to limit memory requirements. We will refer to the projected version of the EDJM as the

ExtendedProjected Data Jacobian Matrix (EPDJM).EPDJM ( X ) i = φ X (cid:18) ∂ (cid:107) f ( X i ) (cid:107) ∂X i (cid:19) (1)where φ X denotes a projection onto the top-k principal components. X = [ U U ] (cid:20) Σ

00 Σ (cid:21) V T , φ X ( x ) = U Σ x (2)The contrastive network is then applied to the EPDJM, and trained using SimCLR (Chen et al.,2020). Once the Contrastive Network is trained, we can obtain the embeddings of any architec-ture in the search space, as shown in Figure 2. The embeddings can then be used with any blackbox optimization algorithm, we use Sequential Model Based Bayesian Optimization with GaussianProcesses, and use Expected Improvement as the acquisition function.Figure 2: Illustration of our method for obtaining the network embeddings. We sample architecturesfrom the search space, and form a batch of views with different random initializations. We computethe data Jacobians, project them, and feed them to the contrastive network. The contrastive modellearns to generate similar embeddings for networks with similar performance.4.2 I MPLEMENTATION

Since the input of contrastive network is a set of Jacobians at different data points, it is desirablethat the architecture of the contrastive network is invariant under permutation of the data points. Tothis end, we use the simple architecture from Zaheer et al. (2017), which encodes each sampledJacobian matrix with a shared multilayer perceptron (MLP), aggregates them by taking their mean,and ﬁnally produces its output with another MLP.5rxiv: February 2021For the contrastive learning, we use SimCLR with a batch size of 512, and a temperature of 0.1.We project the Jacobians down to a 256-dimensional space. To accelerate the contrastive learning,we precompute the projected Jacobians using four different initializations for each architecture. Thecomputation of the projected Jacobians takes less than 2 hours on a single GPU (NVIDIA RTX2080Ti). The embedding size is set to 256, and we use a single layer feedforward network for theprojection head.We use the implementation of Gaussian Processes from GPy (since 2012), to select new architec-tures for evaluation by randomly sampling 20 architectures, and choosing the one with the highestexpected improvement. We open source our code, including all hyperparameters .4.3 A NALYSIS OF THE E MBEDDINGS

We plot the t-SNE projections (Van der Maaten & Hinton, 2008) at different stages of our methodin Figure 3 to analyze the inﬂuence of the contrastive learning on the embeddings. We note that theEPDJM alone carries some meaning in the t-SNE space. The contrastive embeddings at initializationof the network already exhibit more evident structure. Finally, the contrastive learning phase pro-duces clean embeddings with little noise, that clearly separate architectures based on performance.Further, we predict the performance of the unseen networks in the search space using LightGBM (Keet al., 2017) with the default hyperparameters, to analyse the predictive power of the embeddings.The results for NAS-Bench-201 (Dong & Yang, 2020) are shown in Figure 4 and Table 1, andindicate that the contrastive embeddings are highly predictive of the performance of the architecturesin this search space.Table 1: Metrics computed on predicted accuracies for the three benchmarks in NAS-Bench-201.This provides a condensed view of Figure 4 C ORRELATION K ENDALL - τ CIFAR-10 0.88 0.57CIFAR-100 0.86 0.57I

MAGE N ET $ FF X U D F\ (a) EPDJM $ FF X U D F\ (b) Untrained embedding of theEPDJMs $ FF X U D F\ (c) Trained embedding of EPDJMs Figure 3: t-SNE projections of different statistics of 1500 architectures in NAS-Bench-201.4.4 E

MBEDDINGS D URING O PTIMIZATION

Once the contrastive network is trained, it can produce embedding for architectures at various pointsduring their training. We show the evolution of the embeddings of 100 networks during the 50 ﬁrstepochs of training in Figure 5: the embeddings vary during the training procedure, potentially en-abling future methods to learn more information about the search space for each evaluated network.In particular, the training procedure tends to connect areas with similar ﬁnal test accuracy. https://github.com/lightonai/contrastive-embeddings-for-neural-architectures

50 60 70 80 90Accuracy7075808590 P r e d i c t e d A cc u r a c y (a) CIFAR-10

10 20 30 40 50 60 70Accuracy40455055606570 P r e d i c t e d A cc u r a c y (b) CIFAR-100

10 20 30 40Accuracy152025303540 P r e d i c t e d A cc u r a c y (c) ImageNet16-120 Figure 4: Predicted accuracy against actual accuracy. The predictions are produced by LGBMapplied on the contrastive embeddings of 500 randomly selected architectures in NAS-Bench-201(Dong & Yang, 2020). A cc u r a c y Figure 5: t-SNE projections of movement in embedding space during training of 50 architectures inthe NAS-Bench-201 Benchmark. The color of each trajectory represents the ﬁnal test accuracy ofthe architecture on the ImageNet16-120 data set, a downsampled version of the traditional ImageNetdataset.

RCHITECTURE S EARCH

We evaluate our contrastive embeddings on the task of architecture search. We use a SequentialModel Based Optimization (SMBO) with Gaussian Processes (Bergstra et al., 2011) to guide thesearch: it is a commonly used method, developed in a different context, and its performance in oursetting implies that there is structure in our embeddings that can be easily leveraged.The covariance function for the Gaussian Process is chosen to be the Matern-52 kernel, which isa stationary kernel that only depends on the Euclidean distance between the architectures in theembedding space. If this can correctly guide our optimization, then the Euclidean distance withinthe embedding space is a good predictor of relative performance.Based on the work of Wang et al. (2016), who construct their score by normalizing the EDJM by theprincipal singular value, we investigate both the contrastive embeddings produced by the normalizedas well as the unnormalized EDJM.5.1 NAS-B

ENCH -201We show the results on the NAS-Bench-201 benchmark (Dong & Yang, 2020) in Figure 6. Wenotice that the normalization by the principal singular value signiﬁcantly degrades the performanceon this benchmark. However, both versions show a clear improvement over random search, and theunnormalized version also signiﬁcantly outperforms regularized evolution (Real et al., 2019) whenthe number of evaluated architectures is small.We remark that these results require signiﬁcant manual tuning of the hyperparameters of the op-timization algorithm. Nevertheless, they demonstrate that the contrastive embeddings do contain7rxiv: February 2021 1XPEHURI(YDOXDWHG$UFKLWHFWXUHV $ FF X U D F\ R I % H V W $ U F K L W H F W X U H 5DQGRP6HDUFK5HJXODUL]HG(YROXWLRQ&RQWUDVWLYH(PEHGLQJV*3QRUPDOL]HG&RQWUDVWLYH(PEHGLQJV*3XQQRUPDOL]HG (a) CIFAR-10 1XPEHURI(YDOXDWHG$UFKLWHFWXUHV $ FF X U D F\ R I % H V W $ U F K L W H F W X U H 5DQGRP6HDUFK5HJXODUL]HG(YROXWLRQ&RQWUDVWLYH(PEHGLQJV*3QRUPDOL]HG&RQWUDVWLYH(PEHGLQJV*3XQQRUPDOL]HG (b) CIFAR-100 1XPEHURI(YDOXDWHG$UFKLWHFWXUHV $ FF X U D F\ R I % H V W $ U F K L W H F W X U H 5DQGRP6HDUFK5HJXODUL]HG(YROXWLRQ&RQWUDVWLYH(PEHGLQJV*3QRUPDOL]HG&RQWUDVWLYH(PEHGLQJV*3XQQRUPDOL]HG (c) ImageNet16-120 Figure 6: Search results on NAS-Bench-201 (Dong & Yang, 2020) using the low rank embeddings.We show the results for Contrastive Embeddings produced both by normalizing the principal singu-lar value of the Projected EDJM (normalized) and without (unnormalized).signiﬁcant structural information. While there are methods to automatically select hyperparametersfor Gaussian processes, for example by optimizing the marginal likelihood (Rasmussen, 2003), weleave this for future work. Principal Singular Value010203040 A cc u r a c y (a) Topology Benchmark Principal Singular Value152025303540 A cc u r a c y (b) Size Benchmark Figure 7: The principal singular value of the extended data Jacobian matrix for different architec-tures in a search space. This quantity is a good predictor of the performance of an architecture onImageNet16-120 in the topology benchmark, whereas it is not informative for the size benchmark.5.2 T

RANSFER L EARNING

A unique feature of our contrastive embeddings is that they do not depend on any information aboutthe search space used to generated architectures, allowing us to merge the embedding spaces ofmultiple search spaces into a single one. With such uniﬁed embedding space, we perform transferlearning from one search space to another.NATS-Bench (Dong et al., 2020) consists of two different search spaces: a ﬁrst one ( topology ) wherethe topology of architectures is evaluated, and a second ( size ) where the number of ﬁlters in differentconvolutional layers is evaluated. We notice that the two different search spaces have signiﬁcantlydifferent distributions of the principal singular value. Further, the principal singular value is a goodpredictor of ﬁnal performance of the network in the topology benchmark, however it is not in the sizebenchmark, as shown in Figure 7. For this reason, we use the EPDJM normalized by the principalsingular value to perform transfer learning. We use random forests (Breiman, 2001) to predict theaccuracy in one search space based on the other, and we display the results in Figure 8.To evaluate the performance on the transfer learning task we compute two metrics: the Pearsoncorrelation coefﬁcient as well as the Kendall rank correlation the results are shown in 2. For both size → topology and topology → size , we see a signiﬁcant correlation between predicted accuracies8rxiv: February 2021

15 20 25 30 35 40 45Accuracy323436384042 P r e d i c t e d A cc u r a c y (a) size → size P r e d i c t e d A cc u r a c y (b) size → topology

15 20 25 30 35 40 45Accuracy3436384042 P r e d i c t e d A cc u r a c y (c) topology → size P r e d i c t e d A cc u r a c y (d) topology → topology Figure 8: The transferability of features is evaluated using the two different search spaces that areprovided by NATS-Bench: size and topology (Dong et al., 2020). A random forests model is trainedon 10000 samples from one benchmark and evaluated on the other. The notation size → topology for example means that the model is trained on the size benchmark and evaluated on the topologyone. For both size → topology and topology → size we see a signiﬁcant correlation between the thepredicted accuracies and the actual accuracies without ever having evaluated a single network fromthe target search space.Table 2: Metrics computed on predicted accuracies obtained from the transfer learning. A randomforest model is trained on a source search space and evaluated on the target search space. Themetrics are reported as mean ± standard deviation aggregated over ten runs. The results from theﬁrst run can be seen in Figure 8. We remark that Size → Topology has a substantially larger standarddeviation, which originates in the difﬁculty of predicting the accuracies of rare architectures withlow accuracy in the source search space. In Figure 8 these particularly hard to predict architecturesform a close to horizontal line. S OURCE → T ARGET C ORRELATION K ENDALL - τ S IZE → S IZE ± ± IZE → T OPOLOGY ± ± OPOLOGY → S IZE ± ± OPOLOGY → T OPOLOGY ± ± and the actual accuracies without ever having evaluated a single network from the target searchspace . 9rxiv: February 2021 ONCLUSION AND OUTLOOKS

We have developed an end-to-end method to produce embeddings for architectures, using informa-tion available from their initial state and contrastive learning, eliminating the need for manual tuningof the search space parametrization. Our analysis of the embeddings clearly shows the advantageintroduced by every stage of our pipeline, and our results on Neural Architecture Search using con-trastive embeddings are promising. We predict that existing search methods will beneﬁt from furtherwork to improve the encoding stage.More precisely, by visualizing the t-SNE at different stages of our pipeline, we have shown thatthe embeddings produced by our technique discover the structure of the search space. We alsodemonstrated how the embedding space evolves throughout training epochs, connecting regions ofthe search space with similar ﬁnal performance, opening the possibility of future work to learn ad-ditional information by analyzing these trajectories. Moreover, we evaluated how the informationcontent of the embeddings allows us to predict the accuracies of a random subset of architecturesfrom

NAS-Bench-201 . This ﬁnding motivated us to employ traditional black-box algorithms to per-form architecture search, again on NAS-Bench-201. We remark that even though we used a general-purpose algorithm, not designed speciﬁcally for NAS, we reached a clear improvement over randomsearch, and we outperformed Regularized Evolution under a small evaluation budget. These resultsdemonstrate unambiguously that the contrastive encoding learns properties of the networks withpredictive power for the performance after training.Finally, since the embeddings are independent of the search space, our technique provides a uniﬁedembedding space and enables to learn universal properties of the networks. We veriﬁed this byperforming for the ﬁrst time transfer learning across search spaces in neural architecture search.Our work highlights a novel direction of work where the focus is not on algorithms for neuralarchitecture search, and instead on improving the embeddings of the neural networks, so that existingmethods work better. We wish to inspire further cross-pollination between contrastive learning,black-box optimization and neural architecture search. R EFERENCES

James Bergstra, R´emi Bardenet, Yoshua Bengio, and Bal´azs K´egl. Algorithms for hyper-parameteroptimization. In ,volume 24. Neural Information Processing Systems Foundation, 2011.Leo Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin.Unsupervised learning of visual features by contrasting cluster assignments.

Advances in NeuralInformation Processing Systems , 33, 2020.Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework forcontrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.Wojciech M Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan Pascanu.Sobolev training for neural networks. In

Advances in Neural Information Processing Systems , pp.4278–4287, 2017.Xuanyi Dong and Yi Yang. Nas-bench-102: Extending the scope of reproducible neural architecturesearch. arXiv preprint arXiv:2001.00326 , 2020.Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. Nats-bench: Benchmarking nasalgorithms for architecture topology and size. arXiv preprint arXiv:2009.00437 , 2020.GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy , since 2012.Jean-Bastien Grill, Florian Strub, Florent Altch´e, Corentin Tallec, Pierre Richemond, ElenaBuchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar,et al. Bootstrap your own latent-a new approach to self-supervised learning.

Advances in NeuralInformation Processing Systems , 33, 2020. 10rxiv: February 2021Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast forunsupervised visual representation learning. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , pp. 9729–9738, 2020.Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and gen-eralization in neural networks. In

Advances in neural information processing systems , pp. 8571–8580, 2018.Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efﬁcient gradient boosting decision tree. In

Proceedings of the 31stInternational Conference on Neural Information Processing Systems , NIPS’17, pp. 3149–3157,Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.Nikita Klyuchnikov, Ilya Troﬁmov, Ekaterina Artemova, Mikhail Salnikov, Maxim Fedorov, andEvgeny Burnaev. Nas-bench-nlp: neural architecture search benchmark for natural languageprocessing. arXiv preprint arXiv:2006.07116 , 2020.Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXivpreprint arXiv:1806.09055 , 2018.Joseph Mellor, Jack Turner, Amos Storkey, and Elliot J. Crowley. Neural architecture search withouttraining, 2020.Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu,and Shantanu Jaiswal. graph2vec: Learning distributed representations of graphs. arXiv preprintarXiv:1707.05005 , 2017.Daniel S Park, Jaehoon Lee, Daiyi Peng, Yuan Cao, and Jascha Sohl-Dickstein. Towards nngp-guided neural architecture search. arXiv preprint arXiv:2011.06006 , 2020.Carl Edward Rasmussen. Gaussian processes in machine learning. In

Summer school on machinelearning , pp. 63–71. Springer, 2003.Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for imageclassiﬁer architecture search. In

Proceedings of the aaai conference on artiﬁcial intelligence ,volume 33, pp. 4780–4789, 2019.Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and XinWang. A comprehensive survey of neural architecture search: Challenges and solutions. arXivpreprint arXiv:2006.02903 , 2020.Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. Nas-bench-301 and the case for surrogate benchmarks for neural architecture search. arXiv preprintarXiv:2008.09777 , 2020.Suraj Srinivas and Franc¸ois Fleuret. Knowledge transfer with jacobian matching. arXiv preprintarXiv:1803.00443 , 2018.Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

Journal of machinelearning research , 9(11), 2008.Linnan Wang, Saining Xie, Teng Li, Rodrigo Fonseca, and Yuandong Tian. Sample-efﬁcient neuralarchitecture search by learning action space. arXiv preprint arXiv:1906.06832 , 2019.Shengjie Wang, Abdel-rahman Mohamed, Rich Caruana, Jeff Bilmes, Matthai Plilipose, MatthewRichardson, Krzysztof Geras, Gregor Urban, and Ozlem Aslan. Analysis of deep neural networkswith extended data jacobian matrix. In

International Conference on Machine Learning , pp. 718–726, 2016.Chen Wei, Yiping Tang, Chuang Niu, Haihong Hu, Yue Wang, and Jimin Liang. Self-supervised rep-resentation learning for evolutionary neural architecture search. arXiv preprint arXiv:2011.00186 ,2020. 11rxiv: February 2021Shen Yan, Yu Zheng, Wei Ao, Xiao Zeng, and Mi Zhang. Does unsupervised architecture represen-tation learning help neural architecture search? In

NeurIPS , 2020.Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnab´as P´oczos, Ruslan Salakhutdinov, andAlexander J. Smola. Deep sets.

CoRR , abs/1703.06114, 2017. URL http://arxiv.org/abs/1703.06114 .Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprintarXiv:1611.01578 , 2016.Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architecturesfor scalable image recognition. In