Contrastive Embeddings for Neural Architectures
AArxiv: February 2021
Contrastive Embeddings for Neural Architectures
Daniel Hesslow & Iacopo Poli
LightOn { firstname } @lighton.ai A BSTRACT
The performance of algorithms for neural architecture search strongly dependson the parametrization of the search space. We use contrastive learning toidentify networks across different initializations based on their data Jacobians,and automatically produce the first architecture embeddings independent fromthe parametrization of the search space. Using our contrastive embeddings, weshow that traditional black-box optimization algorithms, without modification,can reach state-of-the-art performance in Neural Architecture Search. As ourmethod provides a unified embedding space, we perform for the first time transferlearning between search spaces. Finally, we show the evolution of embeddingsduring training, motivating future studies into using embeddings at different train-ing stages to gain a deeper understanding of the networks in a search space.
NTRODUCTION
Traditionally, the design of state-of-the-art neural network architectures is informed by domainknowledge and requires a large amount of manual work to find the best hyperparameters. How-ever, automated architecture search methods have recently achieved state-of-the-art results on taskssuch as image classification, object detection, semantic segmentation and speech recognition, oreven data augmentation and platform-aware optimization (Ren et al., 2020).Neural Architecture Search (NAS) was introduced by Zoph & Le (2016), using reinforcement learn-ing. Since then, different search spaces have been developed Zoph et al. (2018), and there existsnow a set of search spaces that is commonly used to evaluate NAS algorithms (Dong & Yang, 2020;Siems et al., 2020; Klyuchnikov et al., 2020). Different search algorithms have also been conceived:DARTS (Liu et al., 2018) relaxes the search space to be continuous to perform architecture opti-mization by gradient descent, and aging evolution (Real et al., 2019) is a genetic algorithm designedfor NAS.Recently, Mellor et al. (2020) showed that statistics computed on architectures at initialization, be-fore training, can be used to infer which will perform better after training. In particular, they find aheuristic based on samples of the data Jacobians of networks at initialization. Additionally, Melloret al. (2020), is to our knowledge the first instance of a method for NAS that is invariant to theparametrization of the search space, other than random search.In parallel, contrastive learning has gathered interest in the computer vision community and pro-duced various state-of-the-art results (He et al., 2020; Chen et al., 2020; Caron et al., 2020; Grillet al., 2020). In contrastive learning, the model learns an informative embedding space through aself-supervised pre-training phase: from the images in a batch, pairs are generated through randomtransformations, and the model is trained to generate similar (dissimilar) embeddings for similar(dissimilar) images.In this work, we combine contrastive learning with the idea from Mellor et al. (2020) of usingthe Jacobians of networks at initialization, in order to find an embedding space suitable for NeuralArchitecture Search. We frame our work in the context of the theory presented by Wang et al.(2016). The embedding space that we generate is invariant to the search space of origin, allowing usto accomplish transfer learning between different search spaces.1 a r X i v : . [ c s . L G ] F e b rxiv: February 20211.1 M OTIVATIONS AND C ONTRIBUTIONS
We design a method to produce architecture embeddings using Contrastive Learning and informa-tion available from their initial state. Our technique is capable of generating embeddings indepen-dent from the parametrization of the search space, that evolve during training. We leverage thesecontrastive embeddings in Neural Architecture Search using traditional black-box optimization al-gorithms. Moreover, since they provide a unified embedding space across search spaces, we exploitthem to perform transfer learning between search spaces.
Parametrization-independent embeddings
NAS methods promise to outperform randomsearch, however the encoding of the architectures must show some structure for the search algo-rithm to exploit. These embeddings are typically produced by condensing all the parameters usedto generate an architecture into a single vector. The method used to generate architectures from asearch space thereby implicitly parametrizes it.The parametrization of the search space affects the performance of a NAS algorithm, as noted byWang et al. (2019). However, when performing architecture search, it is not feasible to test multi-ple different parametrizations of the search space and evaluate which one performs better: once wehave started to evaluate networks in a search space, there is no reason to discard previous evaluations.While the current generation of NAS alleviates the need for experts in the design of architectures,now expert knowledge is needed to build and parametrize a search space compatible with the chosensearch algorithm, implying that it is exceedingly difficult to outperform a simple random search.We present in Sec. 4 the first method to create networks embeddings without relying on theirparametrization in any search space, through the combination of modern contrastive learning andthe theory of data Jacobian matrices for neural architecture search.
Evolution of the embeddings during training
In Section 4.4, we show how the embeddings varyduring training, noting that the training procedure tends to connect areas with similar final test ac-curacy together. We hypothesize that this information could enable even more efficient architecturesearch methods in the future.
Leveraging traditional black-box algorithms
Existing methods to generate architecture embed-dings rely on metrics from their computational graphs to identify similar architectures, either byexplicitly trying to preserve the edit distance in the embedding space, or by leveraging more sophisti-cated methods from graph representation learning. Our method leverages the information containedin the Data Jacobian Matrix of networks at initialization to train a contrastive model. As such, it canproduce embeddings that more meaningfully capture the structure of the search space. As a result,traditional black-box algorithms perform well for architecture search, as shown for NAS-Bench-201(Dong & Yang, 2020) in Section 5.1.
Transfer learning between search spaces
Our method provides a unified embedding space, sinceit does not depend on the parametrization of networks in any search space. We exploit this feature toperform for the first time transfer learning between the two search spaces. In practice, we performit between the size and the topology spaces in NATS-Bench (Dong et al., 2020) in Section 5.2.
ELATED W ORK
Neural Architecture Search
Previous works have attempted to improve network embeddings:Klyuchnikov et al. (2020) use graph2vec (Narayanan et al., 2017) to find embeddings such thatnetworks with the same computational graph share the same embeddings, and similarly Yan et al.(2020) produce embeddings that are invariant to graph isomorphisms. However, the method differsin that this work trains a variational autoencoder to produce the embeddings. Wei et al. (2020) uses acontrastive loss to find a low dimensional metric space where the graph edit distances of the originalparametrization is approximately preserved.In the absence of dense sampling, all of these works rely on the prior that the edit distance is agood predictor for relative performance. In contrast, our method, learns to find an embedding spacebased on intrinsic properties of the architectures. It can therefore discover properties about thearchitectures which are not present in their graph representation.2rxiv: February 2021
Data Jacobian
Methods based on the Jacobians with respect to the input of trained networkshave been shown to provide valuable information for knowledge transfer and distillation (Czarneckiet al., 2017; Srinivas & Fleuret, 2018), as well as analysis and regularization of networks (Wanget al., 2016).
Neural Tangent Kernel
The Jacobian of the network with respect to the parameters is computedfor inference with neural tangent kernels (NTK) (Jacot et al., 2018). Using NTK as a proxy for NAS(Park et al., 2020) underperforms the neural network Gaussian process (NNGP) kernel. The NNGPprovides an inexpensive signal for predicting if an architecture exceeds median performance, but itis worse than training for a few epochs in predicting the order of the top performing architectures.
Contrastive learning
Different techniques have been developed in contrastive learning. He et al.(2020) train a network with a contrastive loss against a memory bank of negative samples producedby a slowly moving average version of itself. Chen et al. (2020) remove the memory bank and justconsider negative samples from within the same minibatch. Grill et al. (2020) remove the negativesamples completely but stabilize the training by encoding the positive samples using a momentumencoder.
ACKGROUND
RADITIONAL A RCHITECTURE E MBEDDINGS
A decision tree is created either implicitly or explicitly to sample networks from a search space. Toencode an architecture, one records all choices as the decision tree is traversed into a vector, which isthen used as the embedding of the architecture. Without any additional knowledge, a NAS algorithmwill assume that all choices in the decision tree have an equal importance on the characteristics ofan architecture.3.2 D
ATA J ACOBIAN
Extended Data Jacobian matrices are used by Wang et al. (2016) to analyze trained neural networks.We ground our work in their theoretical setting, and introduce the relevant concepts below.Multi Layer Perceptrons with ReLU activations are locally equivalent to a linear model: the ReLUafter a linear layer can be combined into a single linear layer, where each row in the matrix isreplaced by zeros if the output pre-activation is negative.ReLU ( W x ) = ˆ
W x, ˆ W ij = (cid:26) W ij if ( W x ) j ≥ otherwiseSince matrix multiplication is closed, within a neighborhood where the signs of all neurons pre-activation is constant, the full network can be replaced by a single matrix. This property can beextended to any model whose layers can be rewritten as linear layers, including convolutional layersand average pooling layers. Max pooling layers also retain this property, and can be treated similarlyto ReLU.Therefore, in a local neighbourhood close to x , the full information of a network, f , is containedwithin its Data Jacobian Matrix (DJM).
DJM ( x ) = ∂f ( x ) ∂x and within that neighbourhood f ( x ) = DJM ( x ) x We can evaluate the Data Jacobian Matrix at many different points x to gather information aboutmultiple different neighbourhoods. If we assume the network to have a single output, its DJM isa vector, and we can then stack the DJMs at different points to form the Extended Data Jacobian
Matrix (EDJM). If a network has multiple outputs we can sum them to get a single output, whichwe use to calculate the EDJM.Wang et al. (2016) use the singular values of the EDJM to compute a score, and empirically showthat the score is correlated with the depth of the network, and its model capacity.3.3 C
ONTRASTIVE L EARNING
Contrastive learning is a self-supervised method that finds an informative embedding space of theinput data useful for downstream tasks. Central to contrastive learning is the concept of a view of anobject: two different views of the same object are only superficially different, and we should be ableto train a network to see past these differences and identify the same underlying object. To this end,a network is trained to map different views of the same object close to each other in the embeddingspace and, conversely, views of different objects should be far apart from each other, as shown inFigure 1.
Different ViewsContrastive NetworkDifferent Views
Embedding Space
PushPushPull Pull
Figure 1: In contrastive learning a network learns to produce similar embeddings for different viewsof the same picture, while producing dissimilar embeddings for dissimilar pictures.
ONTRASTIVE E MBEDDINGS FOR N EURAL A RCHITECTURES
We leverage intrinsic properties of the networks to encode them without depending on theirparametrization. We must rely on properties of the architectures at initialization, since it is not com-putationally feasible to train the architectures to obtain their embeddings. At variance with previouswork, we develop a method to find such properties automatically, using contrastive learning.To this effect, we train a network that takes an architecture at initialization as input and produces anembedding at the output. It is desirable that the network has the following two properties:• Different initializations of the same architecture should yield similar embeddings.• Different architectures should yield different embeddings.We can therefore frame our embedding problem as a contrastive learning task: different initializa-tions of the same network will correspond to different views of the same sample in the contrastivelearning framework. 4rxiv: February 20214.1 O
UR METHOD
Following Mellor et al. (2020), we compute the Extended Data Jacobian Matrix (EDJM) of archi-tectures at initialization, and we use a low-rank projection of it as input to our contrastive network,to limit memory requirements. We will refer to the projected version of the EDJM as the
ExtendedProjected Data Jacobian Matrix (EPDJM).EPDJM ( X ) i = φ X (cid:18) ∂ (cid:107) f ( X i ) (cid:107) ∂X i (cid:19) (1)where φ X denotes a projection onto the top-k principal components. X = [ U U ] (cid:20) Σ
00 Σ (cid:21) V T , φ X ( x ) = U Σ x (2)The contrastive network is then applied to the EPDJM, and trained using SimCLR (Chen et al.,2020). Once the Contrastive Network is trained, we can obtain the embeddings of any architec-ture in the search space, as shown in Figure 2. The embeddings can then be used with any blackbox optimization algorithm, we use Sequential Model Based Bayesian Optimization with GaussianProcesses, and use Expected Improvement as the acquisition function.Figure 2: Illustration of our method for obtaining the network embeddings. We sample architecturesfrom the search space, and form a batch of views with different random initializations. We computethe data Jacobians, project them, and feed them to the contrastive network. The contrastive modellearns to generate similar embeddings for networks with similar performance.4.2 I MPLEMENTATION
Since the input of contrastive network is a set of Jacobians at different data points, it is desirablethat the architecture of the contrastive network is invariant under permutation of the data points. Tothis end, we use the simple architecture from Zaheer et al. (2017), which encodes each sampledJacobian matrix with a shared multilayer perceptron (MLP), aggregates them by taking their mean,and finally produces its output with another MLP.5rxiv: February 2021For the contrastive learning, we use SimCLR with a batch size of 512, and a temperature of 0.1.We project the Jacobians down to a 256-dimensional space. To accelerate the contrastive learning,we precompute the projected Jacobians using four different initializations for each architecture. Thecomputation of the projected Jacobians takes less than 2 hours on a single GPU (NVIDIA RTX2080Ti). The embedding size is set to 256, and we use a single layer feedforward network for theprojection head.We use the implementation of Gaussian Processes from GPy (since 2012), to select new architec-tures for evaluation by randomly sampling 20 architectures, and choosing the one with the highestexpected improvement. We open source our code, including all hyperparameters .4.3 A NALYSIS OF THE E MBEDDINGS
We plot the t-SNE projections (Van der Maaten & Hinton, 2008) at different stages of our methodin Figure 3 to analyze the influence of the contrastive learning on the embeddings. We note that theEPDJM alone carries some meaning in the t-SNE space. The contrastive embeddings at initializationof the network already exhibit more evident structure. Finally, the contrastive learning phase pro-duces clean embeddings with little noise, that clearly separate architectures based on performance.Further, we predict the performance of the unseen networks in the search space using LightGBM (Keet al., 2017) with the default hyperparameters, to analyse the predictive power of the embeddings.The results for NAS-Bench-201 (Dong & Yang, 2020) are shown in Figure 4 and Table 1, andindicate that the contrastive embeddings are highly predictive of the performance of the architecturesin this search space.Table 1: Metrics computed on predicted accuracies for the three benchmarks in NAS-Bench-201.This provides a condensed view of Figure 4 C ORRELATION K ENDALL - τ CIFAR-10 0.88 0.57CIFAR-100 0.86 0.57I
MAGE N ET $ F F X U D F \ (a) EPDJM $ F F X U D F \ (b) Untrained embedding of theEPDJMs $ F F X U D F \ (c) Trained embedding of EPDJMs Figure 3: t-SNE projections of different statistics of 1500 architectures in NAS-Bench-201.4.4 E
MBEDDINGS D URING O PTIMIZATION
Once the contrastive network is trained, it can produce embedding for architectures at various pointsduring their training. We show the evolution of the embeddings of 100 networks during the 50 firstepochs of training in Figure 5: the embeddings vary during the training procedure, potentially en-abling future methods to learn more information about the search space for each evaluated network.In particular, the training procedure tends to connect areas with similar final test accuracy. https://github.com/lightonai/contrastive-embeddings-for-neural-architectures
50 60 70 80 90Accuracy7075808590 P r e d i c t e d A cc u r a c y (a) CIFAR-10
10 20 30 40 50 60 70Accuracy40455055606570 P r e d i c t e d A cc u r a c y (b) CIFAR-100
10 20 30 40Accuracy152025303540 P r e d i c t e d A cc u r a c y (c) ImageNet16-120 Figure 4: Predicted accuracy against actual accuracy. The predictions are produced by LGBMapplied on the contrastive embeddings of 500 randomly selected architectures in NAS-Bench-201(Dong & Yang, 2020). A cc u r a c y Figure 5: t-SNE projections of movement in embedding space during training of 50 architectures inthe NAS-Bench-201 Benchmark. The color of each trajectory represents the final test accuracy ofthe architecture on the ImageNet16-120 data set, a downsampled version of the traditional ImageNetdataset.
RCHITECTURE S EARCH
We evaluate our contrastive embeddings on the task of architecture search. We use a SequentialModel Based Optimization (SMBO) with Gaussian Processes (Bergstra et al., 2011) to guide thesearch: it is a commonly used method, developed in a different context, and its performance in oursetting implies that there is structure in our embeddings that can be easily leveraged.The covariance function for the Gaussian Process is chosen to be the Matern-52 kernel, which isa stationary kernel that only depends on the Euclidean distance between the architectures in theembedding space. If this can correctly guide our optimization, then the Euclidean distance withinthe embedding space is a good predictor of relative performance.Based on the work of Wang et al. (2016), who construct their score by normalizing the EDJM by theprincipal singular value, we investigate both the contrastive embeddings produced by the normalizedas well as the unnormalized EDJM.5.1 NAS-B
ENCH -201We show the results on the NAS-Bench-201 benchmark (Dong & Yang, 2020) in Figure 6. Wenotice that the normalization by the principal singular value significantly degrades the performanceon this benchmark. However, both versions show a clear improvement over random search, and theunnormalized version also significantly outperforms regularized evolution (Real et al., 2019) whenthe number of evaluated architectures is small.We remark that these results require significant manual tuning of the hyperparameters of the op-timization algorithm. Nevertheless, they demonstrate that the contrastive embeddings do contain7rxiv: February 2021 1 X P E H U R I ( Y D O X D W H G $ U F K L W H F W X U H V $ F F X U D F \ R I % H V W $ U F K L W H F W X U H 5 D Q G R P 6 H D U F K 5 H J X O D U L ] H G ( Y R O X W L R Q &