[PDF] Learning to reinforcement learn for Neural Architecture Search

Abstract

Reinforcement learning (RL) is a goal-oriented learning solution that has proven to be successful for Neural Architecture Search (NAS) on the CIFAR and ImageNet datasets. However, a limitation of this approach is its high computational cost, making it unfeasible to replay it on other datasets. Through meta-learning, we could bring this cost down by adapting previously learned policies instead of learning them from scratch. In this work, we propose a deep meta-RL algorithm that learns an adaptive policy over a set of environments, making it possible to transfer it to previously unseen tasks. The algorithm was applied to various proof-of-concept environments in the past, but we adapt it to the NAS problem. We empirically investigate the agent's behavior during training when challenged to design chain-structured neural architectures for three datasets with increasing levels of hardness, to later fix the policy and evaluate it on two unseen datasets of different difficulty. Our results show that, under resource constraints, the agent effectively adapts its strategy during training to design better architectures than the ones designed by a standard RL algorithm, and can design good architectures during the evaluation on previously unseen environments. We also provide guidelines on the applicability of our framework in a more complex NAS setting by studying the progress of the agent when challenged to design multi-branch architectures.

Full PDF

LLearning to reinforcement learn for Neural Architecture Search

Learning to reinforcement learn for Neural ArchitectureSearch

Jorge Gomez Robles [email protected]

Joaquin Vanschoren [email protected]

Department of Mathematics and Computer ScienceEindhoven University of TechnologyEindhoven 5612 AZ, The Netherlands

Abstract

Reinforcement learning (RL) is a goal-oriented learning solution that has proven to besuccessful for Neural Architecture Search (NAS) on the CIFAR and ImageNet datasets.However, a limitation of this approach is its high computational cost, making it unfeasibleto replay it on other datasets. Through meta-learning, we could bring this cost down byadapting previously learned policies instead of learning them from scratch. In this work, wepropose a deep meta-RL algorithm that learns an adaptive policy over a set of environments,making it possible to transfer it to previously unseen tasks. The algorithm was applied tovarious proof-of-concept environments in the past, but we adapt it to the NAS problem.We empirically investigate the agent’s behavior during training when challenged to designchain-structured neural architectures for three datasets with increasing levels of hardness,to later ﬁx the policy and evaluate it on two unseen datasets of diﬀerent diﬃculty. Ourresults show that, under resource constraints, the agent eﬀectively adapts its strategyduring training to design better architectures than the ones designed by a standard RLalgorithm, and can design good architectures during the evaluation on previously unseenenvironments. We also provide guidelines on the applicability of our framework in a morecomplex NAS setting by studying the progress of the agent when challenged to designmulti-branch architectures.

Keywords:

Neural Architecture Search, Deep Meta-Reinforcement Learning, Image Clas-siﬁcation

1. Introduction

Neural networks have achieved remarkable results in many ﬁelds, such as that of ImageClassiﬁcation. Crucial aspects of this success are the choice of the neural architecture andthe chosen hyperparameters for the particular dataset of interest; however, this is not alwaysstraightforward. Although state-of-the-art neural networks can inspire the design of otherarchitectures, this process heavily relies on the designer’s level of expertise, making it achallenging and cumbersome task that is prone to deliver underperforming networks.In an attempt to overcome these ﬂaws, researchers have explored various techniquesunder the name of Neural Architecture Search (NAS) (Elsken et al., 2018). In NAS, theultimate goal is to come up with an algorithm that takes any arbitrary dataset as inputand outputs a well-performing neural network for some learning task of interest, so thatwe can accelerate the design process and remove the dependency on human intervention.Nevertheless, coming up with a solution of this kind is a complicated endeavor where re-searchers have to deal with several aspects such as the type of the networks that they a r X i v : . [ c s . N E ] D ec . Gomez Robles and J. Vanschoren consider, the scope of the automation process, or the search strategy applied. A particularsearch strategy for NAS is reinforcement learning (RL), where a so-called agent learns howto design neural networks by sampling architectures and using their numeric performanceon a speciﬁc dataset as the reward signals that guide the search. Popular standard RLalgorithms such as Q-learning or Reinforce have been used to design state-of-the-artConvolutional Neural Networks (CNNs) for classiﬁcation tasks on the CIFAR and ImageNetdatasets (Pham et al., 2018; Cai et al., 2018; Zhong et al., 2018; Zoph and Le, 2016; Bakeret al., 2016), but little attention is paid to deliver architectures for other datasets. In anattempt to ﬁll that gap, a suitable alternative is deep meta-RL (Wang et al., 2016; Duanet al., 2016), where the agent acts on various environments to learn an adaptive policy thatcan be transferred to new environments.In this work, we apply deep meta-RL to NAS, which, to the best of our knowledge,is a novel contribution. The environments that we consider are associated with standardimage classiﬁcation tasks on datasets with diﬀerent levels of hardness sampled from a meta-dataset (Triantaﬁllou et al., 2019). Our main experiments focus on the design of chain-structured networks and show that, under resource constraints, the resulting policy canadapt to new environments, outperform standard RL, and design better architectures thanthe ones inspired by state-of-the-art networks. We also experiment with extending ourapproach to the design of multi-branch architectures so that we can give directions forfuture work.The remainder of this report is structured as follows. First, in Section 2, we introducethe preliminary concepts required to understand our work. Next, in Section 3, we discuss therelated work for both reinforcement learning and NAS. In Section 4, we formally introduceour methodology, and in Section 5, the framework developed to implement it. In Section 6,we deﬁne the experiments, and in Section 7, we show the results. Finally, in Section 8, theconclusions are set out.

2. Preliminaries

Reinforcement learning (RL) is an approach to automate goal-directed learning (Suttonand Barto, 2012). It relies on two entities that interact with each other: an environment that delivers information of its state , and an agent that using such information learnshow to achieve a goal in the environment. The interaction is a bilateral communicationwhere the agent performs actions to modify the state of the environment, which respondswith a numeric reward measuring how good the action was to achieve the goal . Typically,the sole interest of the agent is to improve its decision-making strategy, known as the policy , to maximize the total reward received over the whole interaction trial since this willlead it to the desired goal. More strictly, RL is formalized using ﬁnite Markov DecisionProcesses (MDPs) as in Deﬁnition 1 borrowed from Duan et al. (2016), resulting in theagent-environment interaction illustrated in Figure 1.

Deﬁnition 1 (Reinforcement Learning)

We deﬁne a discrete-time ﬁnite-horizon dis-counted MDP M = ( X , A , P , r, ρ , γ, T ) , in which X is a state set, A an action set, P : X × A × X (cid:55)→ R + a transition probability distribution, r : X × A (cid:55)→ [ − R max , R max ] earning to reinforcement learn for Neural Architecture Search a bounded reward function, ρ : X (cid:55)→ R + an initial state distribution, γ ∈ [0 , a dis-count factor, and T the horizon. Reinforcement learning typically aims to optimizea stochastic policy π θ : X × A (cid:55)→ R + by maximizing the expected reward, modeled as η ( π θ ) = E τ [ (cid:80) Tt =0 γ t r ( x t , a t )] , where τ = ( s , α , ... ) denotes the whole trajectory, x t ∈ X , x ∼ ρ ( x ) , a t ∈ A , a t ∼ π θ ( a t | x t ) , and x t +1 ∼ P ( x t +1 | x t , a t ) . AgentEnvironment r t = r ( x t , a t ) x t a t Figure 1: Graphic representation of the reinforcement learning interaction. Every time the agentperforms an action a t , the environment modiﬁes its state x t − to x t , computes the reward r t and sends both values to the agent, who uses them to optimize its policy. Neural Architecture Search (NAS) is the process of automating the design of neural net-works. In order to formalize this deﬁnition, it is convenient to refer to the survey of Elskenet al. (2018), which characterize a NAS work with three variables: the search space , the search strategy , and the performance estimation strategy . Figure 2 illustrates the interactionbetween these variables.

Searchspace X Searchstrategy Performanceestimationstrategy r A ∈ X performanceestimate of A Figure 2: An illustration of the three NAS variables interacting (Elsken et al., 2018). At any momentduring the search, the search strategy samples an architecture A from the search space andsends it to the performance estimation strategy , which returns the performance estimate.By design, the search space and the performance estimation strategy are named after thevariables in Deﬁnition 1 since they are typically equivalent in NAS within the reinforcementlearning setting. The search space is the set of architectures considered in the search process. It is possibleto deﬁne diﬀerent spaces by constraining attributes of the networks, such as the maximumdepth allowed, the type of layers to use, or the connections permitted between layers. Acommon abstraction inspired in popular networks is to separate the search spaces in chain structures and multi-branch structures that can be either complete neural networks or cells that can be used to build more complex networks, as illustrated in Figure 3. . Gomez Robles and J. Vanschoren Figure 3: Examples of networks belonging to diﬀerent search spaces. On the left, a chain-structured network. On the center, a multi-branch network. On the right, the same multi-branchstructure used as a cell repeated multiple times to build a more complex network.

On the other hand, the search strategy is simply the algorithm used to perform thesearch. The choices range from naive approaches such as random search to more sophisti-cated ones like reinforcement learning (Baker et al., 2016; Zoph and Le, 2016), evolutionaryalgorithms (Real et al., 2018), or gradient descent search (Liu et al., 2018).Lastly, the performance estimation strategy is the function used to measure the goodnessof the sampled architectures. Formally, it is a function R D : X (cid:55)→ R evaluating an architec-ture on a dataset D . The vanilla estimation strategy is the test accuracy after training ofa network, but diﬀerent alternatives are proposed to try delivering an accurate estimate ina short time since expensive training creates a bottleneck in the search process.

3. Related work

The key to reinforcement learning is the algorithm used to optimize the policy π θ (seeDeﬁnition 1). Through the years, researchers have proposed diﬀerent algorithms, such as Reinforce (Williams, 1992), Q-learning (Watkins and Dayan, 1992), Actor-Critic (Konda,2002), Deep-Q-Network (DQN) (Mnih et al., 2013), Trust Region Policy Optimization(TRPO) (Schulman et al., 2015), and Proximal Policy Optimization (PPO) (Schulmanet al., 2017). These algorithms have successfully solved problems in a variety of ﬁelds, fromrobotics (Kober et al., 2013) and video games (Bouzy and Chaslot, 2006; Mnih et al., 2013)to traﬃc-control (Arel et al., 2010) or computational resources management (Mao et al.,2016), showing the power and utility of the reinforcement learning framework. Despite itssuccess, a theoretical ﬂaw of RL is that the policy learned only captures the one-to-one state-action relation of the environment in question, making it necessary to perform in- earning to reinforcement learn for Neural Architecture Search dividual runs for every new environment of interest. The latter is a costly trait of thisstandard form of RL since it typically requires thousands of steps to converge to an optimalpolicy.A research area that addresses this problem is meta-RL, in which agents are trainedto learn transferable policies that do not require training from scratch on new problems.We identify two types of meta-RL algorithms: the ones that learn a good initializationfor the neural networks representing the policy , and the ones that learn policies that canadapt their decision-making strategy to new environments, ideally without further training.First, Model Agnostic Meta-Learning (MAML) (Finn et al., 2017) learns an initializationthat allows few-shot learning in RL for a set of similar environments. Second, two algorithmshave been proposed to learn policies that, once deployed, can adapt their decision-makingstrategy to new environments, ideally without further training: Learning to reinforcementlearn (Wang et al., 2016) and RL (Duan et al., 2016). These methods aim to learn amore sophisticated policy modeled by a Recurrent Neural Network (RNN) that capturesthe relation between states, actions, and meta-data of past actions. What emerges is apolicy that can adapt its strategy to diﬀerent environments. The main diﬀerence betweenthe two works is the set of environments considered: for Wang et al. (2016) they come froma parameterized distribution, whereas for Duan et al. (2016) they are relatively unrelated.We believe that the latter makes these algorithms more suitable for scenarios where adistribution of environments cannot be guaranteed, although the general universality of thegradient-based learner (Finn and Levine, 2017) suggests that MAML could address thisscenario too. Third, Simple Neural AttentIve Learner (SNAIL) extends the idea behind Learning to reinforcement learn and RL by using a more powerful temporal-focused modelthan the simple RNN. We note that none of these approaches have been applied to NAS. As introduced earlier in Section 2.2, it is possible to address Neural Architecture Search(NAS) in diﬀerent ways. Remarkable results have been achieved by applying optimizationtechniques such as Bayesian optimization, evolutionary algorithms, gradient-based search,and reinforcement learning. We are interested in reinforcement learning for NAS, due tothe variety of works that have achieved state-of-the-art results. For other work in NAS, werefer to the survey of Elsken et al. (2018).Although the ultimate goal of NAS is to come up with a straightforward fully-automatedsolution that can deliver a neural architecture for a machine learning task on any dataset ofinterest, there exist several factors that impede that ambition. Perhaps the most importantof these factors is the high computational cost of NAS with reinforcement learning, whichimposes constraints on diﬀerent elements that impact the scope of the solutions. The ﬁrstbottleneck is the computation of the reward, which typically is the test accuracy of thesampled architectures after training. Because of the expensiveness of such computation,researchers have proposed various performance estimation strategies to avoid expensivetraining procedures, and they have also imposed some constraints over the search space considered so that a lower number of architectures get sampled and evaluated. For the

1. The term deep meta-reinforcement learning comes from the usage of deep models, such as neural networks,to represent the policy to be learned. . Gomez Robles and J. Vanschoren ﬁrst aspect, we observe several relaxations: reducing the number of training epochs (Bakeret al., 2016; Zhong et al., 2018), sharing of weights between similar networks (Cai et al.,2018; Pham et al., 2018), or entirely skipping the train-evaluation procedure by predictingthe performance of the architectures (Zhong et al., 2018). Although these alternativeshave successfully reduced the computation time, they pay little attention to the eﬀect oftheir potentially unfair estimations on the decision-making of the agent, and therefore, oneshould treat them carefully. On the other hand, for the search space , crucial choices arethe cardinality of the space and the complexity of the architectures. Some researchers optfor ample spaces with various types of layers and no restrictions in their connections (Zophand Le, 2016; Zoph et al., 2017; Zhong et al., 2018), whereas others prefer them smaller,such as a chain-structured space (Baker et al., 2016), or a multi-branch space modeled as afully-connected graph with low cardinality (Pham et al., 2018).It is important to note that an approach can dramatically reduce its computation timewith relaxations on the search space alone. For instance, the same methodology can decreaseits computational cost by a factor of 7 (28 days to 4 days, with hundreds of GPUs in bothcases) if the space is restricted in the types of layers and the number of elements allowed inthe architectures (Zoph and Le, 2016; Zoph et al., 2017). Furthermore, a constrained searchspace used jointly with some performance estimation strategy can reduce the cost to only1 day with 1 GPU such as in BlockQNN-V2 (Zhong et al., 2018) and ENAS (Pham et al.,2018); however, this drastic reduction of the computational time should be treated withcaution. In the case of BlockQNN-V2, the estimation of the performance of the networks(i.e., accuracy at a given epoch) depends on a surrogate prediction model that is not studiedin detail by the authors, thus leaving room for potentially wrong predictions. On the otherhand, a recent investigation (Singh et al., 2019) shows that the quality of the networksdelivered by ENAS is not a consequence of reinforcement learning, but of the search space,which contains a majority of well-performing architectures that can be explored with aless expensive procedure such as random search, therefore losing its character of artiﬁciallysmart search.Another factor impacting a NAS with reinforcement learning work is the input dataset used. Although they usually transfer the best CIFAR-based architecture designed by theagent to the ImageNet dataset (Baker et al., 2016; Zoph and Le, 2016; Zoph et al., 2017;Cai et al., 2018; Pham et al., 2018), none of them make the agent design networks for otherdatasets. Furthermore, none of the works give insight on how using a diﬀerent datasetcould aﬀect the complexity of the search. We believe that the lack of study for otherdatasets is ascribed to the costly task-oriented design of the reinforcement learning algo-rithms used, Q-learning and

Reinforce , that requires to train the agent from scratch forevery environment (i.e., a dataset) of interest. The authors do not justify the choice of thesealgorithms; hence, it would be desirable to study other reinforcement learning algorithmsin the same NAS scenarios.

4. Methodology

We aim to improve the performance of Neural Architecture Search (NAS) with reinforcementlearning (RL) by using meta-learning. We, therefore, build a meta-RL system that can learnacross environments and adapt to them. We split the system into two components: the earning to reinforcement learn for Neural Architecture Search NAS variables and the reinforcement learning framework. For the reinforcement learningframework, we make use of a deep meta-RL algorithm that follows the same line of

Learningto reinforcement learn (Wang et al., 2016) and RL (Duan et al., 2016), with some minoradaptations in the meta-data employed and the design of the episodes. The environmentsthat we consider are neural architecture design tasks for diﬀerent datasets sampled fromthe meta-dataset collection (Triantaﬁllou et al., 2019). On the other hand, for the NASelements, we work with a slightly modiﬁed version of the search space of BlockQNN (Zhonget al., 2018) and similarly, we use the test accuracy after early-stop training as the rewardassociated with the sampled architectures. In the remainder of the section, we discuss theseelements further. As described in Section 2.2, three NAS variables characterize a research work in this area:the search strategy, the search space, and the performance estimation strategy. In our case,we constrain the search strategy to a deep meta-reinforcement learning algorithm that weexplain in detail in Section 4.2, and thus, here we only elaborate on the remaining two.

The set of architectures considered in our work is inspired by BlockQNN (Zhong et al., 2018),which deﬁnes the search space as all architectures that can be generated by sequentiallystacking d ∈ N vectors from a so-called Network Structure Code (NSC) space containingencodings of the most relevant layers for CNNs. An NSC vector has information of the typeof a layer, the value of its most important hyperparameter, its position on the network, andthe allowed incoming connections (i.e., the inputs) so that it becomes possible to representany architecture as a list of NSCs. The NSC deﬁnition is ﬂexible in that it can easily bemodiﬁed or extended and, moreover, it allows us to deﬁne an equivalent discrete actionspace for the reinforcement learning agent as described in Section 4.2.2.In Table 1, we present the NSC space for our implementation. Given a list of NSC vectorsrepresenting an architecture, the network is built following the next rules: ﬁrstly, based onBlockQNN’s results, if a convolution layer is found then a Pre-activation Convolutional Cell (PCC) (He et al., 2016) with 32 units is used; secondly, the concatenation and addition operations create padded versions of their inputs if they have diﬀerent shapes; thirdly, if atthe end of the building process the network has two or more leaves then they get mergedwith a concatenation operation . Figure 4 illustrates these rules. Our estimation of the long-term performance of the designed networks closely follows theearly-stop approach of BlockQNN-V1 (Zhong et al., 2018), but we ignore the penalization

2. The PCC stacks a ReLU, a convolution, and a batch normalization unit.3. The selection of the number of units is made to reduce the cost of the training of the networks.4. The last two rules do not apply for chain-structured architectures since no merge operations are needed.5. The kernel size is an attribute for the convolutions, whereas for the pooling elements it refers to thelayer’s pool size. . Gomez Robles and J. Vanschoren [0, 0, 0, 0, 0][0, 0, 0, 0, 0][1, 1, 1, 0, 0][2, 2, 2, 1, 0][3, 1, 3, 0, 0][4, 3, 2, 3, 0][5, 1, 5, 0, 0][6, 2, 2, 5, 0][7, 5, 0, 2, 4][8, 7, 0, 0, 0] Input (84x84)Convolutionk=1 (84x84) Convolutionk=3 (82x82) Convolutionk=5 (80x80)MaxPoolingp=2 (42x42) AvgPoolingp=2 (41x41) MaxPoolingp=2 (40x40)Addition (42x42) Concatenation(42x42) Figure 4: Example of an architecture sampled from the search space of our approach. On the left, alist of Neural Structure Codes (NSCs); on the right, the corresponding network after theapplication of the rules. For the sake of simplicity, in this example, the convolutions areassumed to have one ﬁlter only.

Name Index Type Kernel size Predecessor 1 Predecessor 2

Convolution T {

1, 3, 5 } K ∅ Max Pooling T {

2, 3 } K ∅ Average Pooling T {

2, 3 } K ∅ Addition T ∅ K K

Concatenation T ∅ K K

Terminal T ∅ ∅ ∅ Table 1: The subset of the NSC space used, presented as in BlockQNN (Zhong et al., 2018). Thechanges with respect to the original BlockQNN space are: a) the identity operator (

Type

4) is omitted; b) the pool size values changed from the original set { , } to { , } becausea pool size of 1 does not contribute to any reduction. The set T = { , . . . , d } refersto the position of each layer in the network, where d is the maximum depth, and K = { , , , . . . , current layer index − } the index of its predecessor. of the network’s FLOPs and density since we have empirically ascertained that it is toostrict when the classiﬁcation task is diﬃcult (i.e., when low accuracy values are expected).The choice of an early-stop strategy is made to help reduce the computational cost ofour approach. In short, for every sampled architecture N a prediction module is appended,and the network is then trained for a small number of epochs to obtain its accuracy on atest set, which is the ﬁnal estimation of its long-term performance. The datasets consideredare balanced, and their train and test splits are designed beforehand (see Section 4.2.3).As in BlockQNN, for the prediction module we stack a fully-connected dense layer with1024 units and ReLU activation function, a dropout layer with rate of 0.4, a dense layer earning to reinforcement learn for Neural Architecture Search with the number of units equals to the desired number of classes to predict and linearactivation function, and a softmax that outputs the probabilities per class. The trainingis performed to minimize the cross-entropy loss using the Adam Optimizer (Kingma andBa, 2015) with the parameters used in BlockQNN: β = 0 . β = 0 . (cid:15) adam = 10 e − ,and α adam = 0 .

001 that is reduced by a factor of 0.2 every ﬁve epochs. After training, thenetwork is evaluated on a test set by ﬁxing the network’s weights and selecting the classwith the highest probability to be the ﬁnal prediction per observation in the set so that thestandard accuracy ACC N can be returned. The deep-meta-RL framework that we propose is diﬀerent from standard RL in two mainaspects. First, the agent is challenged to face more than one environment during training,and second, the distribution over the reward domain learned by the agent is now dependanton the whole history of states , actions , and rewards , instead of the simple state-action pairs.In the remainder of the section, we describe each of the RL elements. A state x i ∈ X is a multidimensional array of size d ×

5, storing d NSC vectors sorted bylayer index. While this representation is programmatically easy to control, it is not idealin a machine learning setting. In particular, we note that every element of an NSC vectoris a categorical variable. Therefore, when required, every NSC vector in x t is transformedas follows: the layer’s type is encoded into a one-hot vector of size 8, the predecessors intoa one-hot vector of size ( d + 1), and the kernel size into a one-hot vector of size ( k + 1)with k = max(kernel size). The transformation ignores the layer index because the stateimplicitly incorporates the information of the position of each layer due to sorting. Thisencoding results in a multidimensional array of size d × (2 d + k + 11). Figure 5 illustratesthis transformation. [0, 0, 0, 0, 0][1, 1, 5, 0, 0][2, 2, 2, 1, 0][3, 6, 0, 0, 2] 1, 0 . . . 0, 0 1, 0, 0, 0, 0, 0 1, 0, 0, 0, 0 1, 0, 0, 0, 00, 1 . . . 0, 0 0, 0, 0, 0, 0, 1 1, 0, 0, 0, 0 1, 0, 0, 0, 00, 0 . . . 0, 0 0, 0, 1, 0, 0, 0 0, 1, 0, 0, 0 1, 0, 0, 0, 00, 0 . . . 1, 0 1, 0, 0, 0, 0, 0 1, 0, 0, 0, 0 0, 0, 1, 0, 0 d d + k + 118 k + 1 d + 1 d + 1 Figure 5: The diﬀerent representations of a state. In this example, a state x t contains d = 4 NSCvectors. On the left, a network as a list of NSC vectors; on the right, the same network inits encoded representation. In our work, k = 5 as observed in Table 1.

6. This size is the result of having 7 types of layers (see Section 4.1.1) plus the type 0 representing an emptylayer.7. When working with chain-structured networks the second predecessor is always omitted, reducing thedimensionality of the encoding to d × ( d + k + 10). . Gomez Robles and J. Vanschoren We formulate the action space A as a discrete space of 14 actions listed in Table 2. Eachaction a i ∈ A can either append a new element from the NSC space to a state x j ∈ X orcontrol two pointers, p and p , for the indices of the predecessors to use for the next NSCvector. We note that for the chained-structured networks no pointers are required since thepredecessor is always the previous layer, and neither do the merging operations addition and concatenation , making it possible to reduce the action space to 8 actions only. Action ID Description

A0 Add convolution with kernel size = 1, using predecessor p A1 Add convolution with kernel size = 3, using predecessor p A2 Add convolution with kernel size = 5, using predecessor p A3 Add max-pooling with pool size = 2, using predecessor p A4 Add max-pooling with pool size = 3, using predecessor p A5 Add avg-pooling with pool size = 2, using predecessor p A6 Add avg-pooling with pool size = 3, using predecessor p A7 Add terminal state.A8 Add addition with predecessors p and p A9 Add concatenation with predecessors p and p A10 Shift p one position up (i.e., p = p + 1)A11 Shift p one position down (i.e., p = p − p one position up (i.e., p = p + 1)A13 Shift p one position down (i.e., p = p − Table 2: The action space proposed, which is compliant with the NSC space of section 4.1.1.

In our work, an environment is a neural architecture design task for image classiﬁcationon a speciﬁc dataset of interest. The goal for an agent on this environment is to come upwith the best architecture possible after interacting for a certain number of time-steps. Atany time-step t , the environment’s state is x t ∈ X , which is the NSC representation of aneural network N t . The reward r t ∈ [0 ,

1] associated with x t is a function of the network’saccuracy ACC N t ∈ [0 ,

1] (Section 4.1.2). The initial state x of the environment is an emptyarchitecture.An agent can interact with the environment through a set of episodes by performingactions a t ∈ A . In our terminology, an episode is the trajectory from a reset of the environ-ment’s state until a termination signal. The environment triggers the termination signal inthe following cases: a) the predecessors p and p are out of bounds after the execution of a t , b) a t is a terminal action, c) x t contains d NSC elements (the maximum depth) afterperforming a t , d) the total number of actions executed in the current episode is higher thana given number τ , or e) the action led to an invalid architecture. The agent-environmentinteraction process is formalized in Algorithm 1. earning to reinforcement learn for Neural Architecture Search Algorithm 1

Agent-environment interaction procedure interact (Agent, Environment, Dataset, t max , σ ) done ← False t ← Environment.reset to initial state() while t < t max do a t ← Agent.get next action() x t ← Environment.update state( a t ) N ← Environment.build network( x t ) ACC N t ← N.accuracy(Dataset) if a t is shifting then r t ← σ · ACC N t else r t ← ACC N t done ← Environment.is termination()

Agent.learn( x t , a t , r t , done ) if done then Environment.reset to initial state() done ← FalseAs mentioned in the beginning of Section 4, we work with more than one environment.Speciﬁcally, we deﬁne ﬁve environments, each one associated with a diﬀerent dataset sam-pled from the meta-dataset collection (Triantaﬁllou et al., 2019). The datasets are listedin Table 3 and they were selected as explained in Appendix A. All datasets have balancedclasses. In order to evaluate the accuracy of a network N t , for any dataset we perform adeterministic 1/3 train-test split and follow the pre-processing that has been initially pro-posed by the meta-dataset authors so that the images are resized to a shape of 84 × × Table 3: List of datasets considered for the environments. They are sampled from the meta-dataset (Triantaﬁllou et al., 2019) as explained in Appendix A. . Our deep meta-RL approach, illustrated in Figure 6, is based on the work of Wang et al.(2016) and Duan et al. (2016). They propose to learn a policy that, in addition to the state- . Gomez Robles and J. Vanschoren action pairs of standard RL, uses the current time-step in the agent-environment interaction(the temporal information) as well as the previous action and reward. In this way, theagent can learn the relation between its past decisions and the current action. However,we introduce a modiﬁcation in the temporal information, by considering the relative stepwithin an episode instead of the global time-step so that the agent can capture the relationbetween changes in a neural architecture.Formally, let D be a set of Markov Decision Processes (MDPs). Consider an agentembedding a Recurrent Neural Network (RNN) - with internal state h - modeling a policy π . At the start of a trial , a new task m i ∈ D is sampled, and the internal state h is set tozeros (empty network). The agent then executes its action-selection strategy for a certainnumber t max of discrete time-steps, performing n episodes of maximum length l dependingon the environment’s rules. At each step t (with 0 ≤ t ≤ t max ) an action a t ∈ A is executedas a function of the observed history H t = { x , a , r , c , ..., x t − , a t − , r t − , c t − , x t } (setof states { x s } ≤ s ≤ t , actions { a s } ≤ s

Figure 6: A graphic representation, inspired by the RL illustration (Duan et al., 2016), of our deepmeta-reinforcement learning framework. In this example, the trial consists of t max = 6time-steps, and the agent is able to complete two episodes of diﬀerent length. c s is acounter of the current step in the episode and it gets reset at the start of any new episode.The states x and x are shown to be diﬀerent, although in practice the initial state of anepisode could always be the same. Similarly to Wang et al. (2016), we make use of the synchronous Advantage Actor-Critic(A2C) (Mnih et al., 2016) with one worker. As it can be observed in Figure 7, the only earning to reinforcement learn for Neural Architecture Search change in the A2C network is in the input of the recurrent unit, so that the updates of thenetwork’s parameters remain unchanged. x t StateencoderFlatten One-hotencoder a t − r t − c t Concatenation (cid:8) π ( a t | s t ; θ ) V ( s t ; θ v ) Figure 7: Illustration of the meta -A2C architecture. In our implementation, the “State encoder”follows the procedure explained in Section 4.2.1, and the recurrent layer is an LSTM with128 units.

Formally, let t be the current time step, s t = x t · a t − · r t − · c t a concatenation ofinputs, π ( a t | s t ; θ ) the policy, V ( s t ; θ v ) the value function, H the entropy, j ∈ N the horizon, γ ∈ (0 ,

1] the discount factor, η the regularization coeﬃcient, and R t = (cid:80) j − i =0 γ i r t + i thetotal accumulated return from time step t . The gradient of the objective function is: ∇ θ log π ( a t | s t ; θ ) ( R t − V ( s t ; θ v )) (cid:124) (cid:123)(cid:122) (cid:125) Advantage estimate + η ∇ θ H ( π ( s t ; θ )) (cid:124) (cid:123)(cid:122) (cid:125) Entropy regularization (1)As it is usually the case for A2C, the parameters θ and θ v are shared except for theones in output layers. For a detailed description of the algorithm, we refer to the originalpaper (Mnih et al., 2016).

5. Evaluation framework

The current Neural Architecture Search (NAS) solutions lack a crucial element: an open-source framework for reproducibility and further research. Speciﬁcally for NAS with rein-forcement learning, it would be desirable to build on a programming interface that allowsresearchers to explore the eﬀect of diﬀerent algorithms on the same NAS environment. Inan attempt to ﬁll this gap, we have developed the nasgym , a python OpenAI Gym (Brock-man et al., 2016) environment that can jointly be used with all the reinforcement learningalgorithms exposed in the OpenAI baselines (Dhariwal et al., 2017).We make use of the object-oriented paradigm to abstract the most essential elementsof NAS as a reinforcement learning problem, resulting in a system that can be extended toperform new experiments, as displayed in Figure 8. Although the defaults in the nasgym are

8. Source code available at: github.com/gomerudo/nas-env . Gomez Robles and J. Vanschoren the elements in our methodology, the system allows us to easily modify the key components,such as the performance estimation strategy, the action space, or the Neural Structure Codespace. We also provide an interface to use a database of experiments that can help to storepreviously computed rewards, thus reducing the computation time of future trials. All thedeep learning components are built with TensorFlow v1.12 (Abadi et al., 2015). nasgym DefaultNasEnvironmentDatasetHandler DbInterface NSCdeﬁnition ﬁle Hyper-parameters values < TFRecords ﬁles  Database ofexperimentsFigure 8: An sketch of the system built to perform our research. The nasgym package contains adefault NAS environment whose states and actions are designed according to the NeuralStructure Code (NCS) space deﬁned in a .yml ﬁle. The hyperparameters for all themachine learning components are deﬁned in a .ini ﬁle. Internally, the environment makesuse of a dataset handler that reads TFRecords ﬁles and sends them as inputs to the neuralarchitectures. A simple database of experiments is used to store experiments in a local ﬁle,although the logic can be easily be extended to support a more robust database system.

Additionally to the nasgym , we implement the meta version of the A2C algorithm ontop of the OpenAI baselines . We believe that this software engineering eﬀort will help tocompare, reproduce, and develop future research in NAS.

6. Experiments

To evaluate our framework, we conduct three experiments. The ﬁrst two aim to study thebehavior of the agent when challenged to design chain-structured networks, and the thirdone is intended to observe its behavior in the multi-branch setting. We empirically assessthe quality of the networks designed by the agent through episodes, the ability of the agentto adapt to each environment, and the runtimes of the training trials.

The agent learns from the three trainenvironments listed in Table 3, using deep meta-RL. It starts in the omniglot environment,continues in vgg ﬂower , and ﬁnishes in dtd so that it faces increasingly harder classiﬁcation

9. Source code available at: github.com/gomerudo/openai-baselines earning to reinforcement learn for Neural Architecture Search tasks (see Appendix A), and the policy learned in one environment is reused in the next one.The agent interacts with each environment for t max = 8000, t max = 7000, and t max = 7000,respectively so that the agent spends more time in the ﬁrst environment to develop itsinitial knowledge. We compare against two baselines: random search and DeepQN withexperience replay, where the agent learns a new policy on each environment (i.e., it doesnot re-use the policy between trials) for t max = 6500 , , and 7000, respectively. Due toresources and time constraints, all t max values were empirically selected according to thebehaviour of the rewards (see Section 7). The most relevant hyper-paremeters are set asfollows:- Environment - d = 10. The maximum depth of a neural architecture.- τ = 10. The maximum length of an episode.- A2C hyper-parameters - j = 5. The number of steps to perform before updating the A2C parameters (seeEquation 4.2.5). We set the value to half the maximum depth of the networksto allow the agent to learn before the termination of an episode.- γ = 0 .

9. The discount factor for the past actions.- η = 0 .

01. The default in the OpenAI baselines (Dhariwal et al., 2017).- α = 0 . Learning to reinforcement learn (Wanget al., 2016).-

DeepQN - Experience buﬀer size = t max .- Target model’s batch size = 20.- (cid:15) with linear decay from 1.0 to 0.1. The parameter controlling the explorationof the agent.- α = 0 . Training of the sampled networks - batch size = 128.- epochs = 12. The value used in BlockQNN (Zhong et al., 2018).

Experiment 2: evaluation of the policy.

We ﬁx the policy obtained in Experiment1. The agent interacts with the evaluation environments, aircraft and cu birds , and deploysits decision-making strategy to design a neural architecture for each dataset. The interactionruns for t max = 2000 to study the performance of the policy in short evaluation trials. Atthe end of the interaction, we select the best two architectures per environment (i.e., theones with the highest reward) and train them on the same datasets but applying a moreintensive procedure as follows. First, we augment the capacity of the architectures bychanging the number of ﬁlters in the convolution layers according to the layer’s depth; i.e.,number of units = 2 i with i being the current count of convolutions while building thenetwork (e.g., number of units = 32 → → . Gomez Robles and J. Vanschoren to 4096, we use a learning rate with exponential decay, and we train the network for 100epochs. Since the datasets that we use are resized to a shape of 84x84x3, it is not fairto compare our resulting accuracy values with those of state-of-the-art architectures thatassume a higher order of shape (Cui et al., 2018; Guo and Farrell, 2018; Hu and Qi, 2019),and neither is to train our networks (which are designed for a given input size) with biggerimages. Hence, based on the baselines of Hu and Qi (2019), we use a VGG-19 network (Liuand Deng, 2015) with only two blocks as our baseline on both datasets. In this experiment, we ex-tend the search space to multi-branch architectures. We consider the omniglot environmentonly. The goal here is to observe the ability of the agent to design multi-branch networksthrough time; i.e., the number of multi-branch structures generated through training. Theinteraction runs for t max = 20000 time-steps because more exploration is required due tothe larger action space. The hyper-parameters are the same as in Experiment 1, except that τ = 20 and j = 10 because the trajectories are longer due to the shifting of the pointerscontrolling the predecessors, and batch size = 64 because the concatenation operation cangenerate networks that require more space in memory. We train the agent from scratch twotimes varying the parameter σ ∈ [0 . , .

1] (see Section 4.2.3) to study its eﬀect encouragingthe generation of multi-branch structures.

7. Results

Experiment 1: evolution during training

Figure 9 shows the evolution of the best reward and the accumulated reward per episode(representing the quality of the neural architectures), as well as the episode length (ina chain-structured network this represents the number of layers). We observe that, inthe ﬁrst environment ( omniglot ), our deep meta-RL agent performs worse than

DeepQN .Nevertheless, in the second and third environments ( vgg ﬂower , dtd ), the agent performsbetter than the baselines from the very ﬁrst steps, and more consistently through all episodes(showing less variance). DeepQN only catches up after many more episodes, although itexhibits a faster learning curve, which we ascribe to the linear exploration that makes it toend exploration sooner.Figure 10 shows the entropy of the policy during training over the diﬀerent environ-ments, which in the A2C algorithm is related to the level of exploration by the agent (moreexploration leads to high entropy). In the ﬁrst environment, our agent explores the environ-ment for a signiﬁcant number of time-steps, which translates to the slow increase observedin Figure 9a. In the second environment, the exploration drops down quickly, except for ashort period with increased exploration (time-steps 9005 to 12005). In the last and hardestenvironment, the agent re-explores the environment to adapt its strategy, leading to a re-duction of the episode length (depth of the networks) and, consequently, the accumulatedreward does not appear to increase due to the shorter episodes. We believe that explorationcauses the drops in episode length in vgg ﬂower and dtd (Figures 9e and 9h). earning to reinforcement learn for Neural Architecture Search In Figure 11, the proportion of the actions performed by the agent during training isshown. We see that it deployed diﬀerent strategies per environment. Speciﬁcally, we note thechanges in proportion for actions A0 ( convolution with a kernel of size 1), A3 ( max-pooling with pool size of 2), and A7 (the terminal state) when the environment switched from vgg ﬂower to dtd , suggesting that the agent preferred diﬀerent layers and depth accordingto the dataset.Finally, Table 4 shows the running times per environment for each RL algorithm. Here,we do not observe signiﬁcant diﬀerences considering that once transferred, the policy of thedeep meta-RL agent designs deeper and more costly networks, as observed in Figures 9eand 9h. (a) (b) (c)(d) (e) (f)(g) (h) (i)Figure 9: Evolution of training episodes through time from diﬀerent perspectives, showing the meansand ± . Gomez Robles and J. Vanschoren Figure 11: Proportion of actions performed by the agent per dataset in Experiment 1. The labelsin the x-axis match the IDs in Table 2.

Dataset Deep meta-RL DQNomniglot 11 days 9h 6 days 14hvgg ﬂower 7 days 23h 5 days 15hdtd 6 days 17h 6 days 4hTotal 26 days 1h 18 days 9h

Table 4: Running times per dataset during training. All experiments ran on a single NVIDIA TeslaK40m GPU.

Experiment 2: evaluation of the policy

The results of replaying the learned policy on completely new datasets are displayed inFigure 12, and the corresponding runtimes are listed in Table 5. They show that the agentimmediately ﬁnds a good solution (with a deep network), and rewards remain consistent;however, it does not improve over time, which warrants further study (see Section 8). More-over, the strategy deployed by the agent is not diﬀerent on each dataset, as it is observedin Figure 13. We conﬁrm that the strategies are not signiﬁcantly diﬀerent by performinga Wilcoxon signed-rank test with the null hypothesis that the two related paired samplescome from the same distribution. The output is a p-value = 0 .

48 with 95% conﬁdence.Dataset Runtimeaircraft 2 days 6hcu birds 2 days 22hTotal 5 days 4h

Table 5: Running times for the evaluation of the deep meta-RL agent. All experiments ran on asingle NVIDIA Tesla K40m GPU.

As we mentioned in Section 6, another result of interest is the performance of the bestnetworks designed by the agent when they follow more intensive training. Table 6 shows theaccuracy values obtained. We note that the networks achieve low accuracy, in the majorityof the cases worse than random guessing. An important observation is that these low values earning to reinforcement learn for Neural Architecture Search (a) (b) (c)(d) (e) (f)Figure 12: Evolution of evaluation episodes through time from diﬀerent perspectives, showing themeans and ± can be a consequence of the relaxation made in the shape of the images. Whereas state-of-the-art architectures on both aircraft and cu birds work with shapes greater than 200 × ×

84 that might lead to loss of information. Moreover, state-of-the-art results for these datasets are usually obtained after data augmentation and usedeeper and more complex networks with multi-branch structures (Cui et al., 2018; Guo andFarrell, 2018; Hu and Qi, 2019). However, in this experiment, we do not consider any of thelatter aspects since we work under resource constraints that force us to make relaxations,and thus a lower accuracy can be expected.Despite the low values, the architectures for the two datasets designed by the deepmeta-RL agent outperformed by a signiﬁcant amount the shortened version of VGG19.This shows that by using the learned policy it is possible to ﬁnd better architectures thanone inspired by state-of-the-art networks. A ﬁnal observation is that the best architecturefound by the agent during training did not become the best ﬁnal network, thus exhibitingthat early-stop can underestimate the long-term performance of the networks, which alsowarrants future work. . Gomez Robles and J. Vanschoren Dataset Deep meta-RL (1st) Deep meta-RL (2nd) VGG19-likeaircraft 49.18 ± ± ± ± ± ± Table 6: Accuracy values of the best architectures after a more intense training. Every reportedaccuracy value is the mean ± Experiment 3: training on a more complex environment

Figure 14 shows the evolution of the best reward , episode length , and accumulated reward during the multi-branch experiment on omniglot . We do not observe diﬀerences in thebehavior of the agent when using diﬀerent σ values, but we note that it took longer tooutput meaningful rewards (around episode 3000) when compared to Experiment 1, causingextended runtimes as shown in Table 7. (a) (b)(c)Figure 14: Evolution of training episodes for the multi-branch experiment from diﬀerent perspec-tives. The plots show the means and ± Runtime σ = 0 . σ = 0 . Table 7: Runtimes for the training of the deep meta-RL agent in the multi-branch search space for omniglot . All experiments ran on a single NVIDIA Tesla K40m GPU.

However, we stated in Section 6.2 that our main interest in this experiment is to studywhether or not the agent can explore multi-branch structures. Figure 15a shows the en- earning to reinforcement learn for Neural Architecture Search tropy of the policy through time-steps, and Figure 15b the count of multi-branch structuresthrough episodes. We note that during exploration the appearance of multi-branch struc-tures is more likely, and after episode 3000 (represented by the vertical line in Figure 15a),when exploration drops down, the multi-branch structures become less frequent. Further-more, we found that the proportion of multi-branch vs. chain-structured networks is only1:10, meaning that the agent did not explore multi-branch structures aggressively, and set-tled for chain-structured networks instead. The latter is supported by the proportion ofactions displayed in Figure 16, where the actions A8-13 (related to multi-branch structures)are the least frequent. (a) (b)Figure 15: The exploration of the agent through time. (a) The entropy of the policy through time-steps. The vertical line cuts the horizontal axis at the time-step where the episde 3000starts. (b) The count of multi-branch structures explored by the agent, showing the mean ± We believe that a multi-branch space requires us to handle diﬀerently how the prede-cessors of the NSC vectors are selected (see Section 4.2.2). Some alternatives are: deﬁningheuristics to chose the predecessors instead of using the shifting operations, assigning otherrewards to the actions related to the predecessors, and modifying the hyper-parameters ofthe A2C network to encourage more exploration of the agent in the beginning of the deepmeta-RL training. . Gomez Robles and J. Vanschoren

8. Conclusions and future work

In this work, we presented the ﬁrst application of deep meta-RL in the NAS setting. Firstly,we investigated the advantages of deep meta-RL against standard RL on the relativelysimple scenario of chain-structured architectures. Despite resource limitations (1 GPUonly), we observed that a policy learned using deep meta-RL can be transferred to otherenvironments and quickly designs architectures with higher and more consistent accuracythan standard RL. Nevertheless, standard RL outperforms meta-RL when both learn apolicy from scratch. We also note that the meta-RL agent exhibited adaptive behaviorduring the training, changing its strategy according to the dataset in question. Secondly, weanalyzed the adaptability of the agent during evaluation (i.e., when the policy’s weights areﬁxed) and the quality of the networks that it designs for previously unseen datasets. In ourexperiment, the agent was not able to adapt its strategy to diﬀerent environments, but theperformance of the networks delivered was better than the performance of a human-designednetwork, showing that the knowledge developed by our agent in the training environments ismeaningful in others. Thirdly, we extended our approach to a more complex NAS scenariowith a multi-branch search space. In this setting, the meta-RL agent was not able to deeplyexplore the multi-branch structures and settled for chain-structured networks instead.We conclude that deep meta-RL does provide an advantage over standard RL whentransferring is enabled, and it can eﬀectively adapt its strategy to diﬀerent environmentsduring training. Moreover, the policy learned can be used to deliver meaningful and well-performing architectures on previously unseen environments without further training. Webelieve that it is possible to strengthen our deep-meta RL framework in future work. Specif-ically, we propose to investigate the following aspects under more powerful computationalresources:-

Hyper-parameter tunning of the A2C components . In Experiment 1, we observed thatthe learning progress of the meta-RL agent is slow. We also noticed a long explorationwindow in the ﬁrst environment. In order to improve these aspects, we propose totune the hyper-parameters according to the next intuitions:- j : the parameter controlling the number of steps before a learning update. Wesuggest reducing this value to speed up learning.- η : the entropy regularization. Experiments varying the range of this hyper-parameter are required to observe its impact on the learning curve. Also, diﬀerentvalues could be used depending on the hardness of the environments.- α : the learning rate. We suggest exploring decay functions for the learning rateto encourage faster learning after exploration.- Duration of the agent-environment interaction . In Experiment 2, the policy did notexhibit adaptive behavior. A possible explanation is that the training trials wererelatively short when compared to other reinforcement learning applications. Trainingthe agent for longer trials could help improve the adaptation of the policy duringevaluation.-

The action space in the multi-branch setting . In Experiment 3, we observed thatthe agent was not able to explore the multi-branch space suﬃciently and settled for earning to reinforcement learn for Neural Architecture Search chain-structured networks instead. Although hyper-parameter tuning could also helpencourage exploration of the multi-branching actions, we believe that redeﬁning theactions is a more suitable area of improvement. In that line, we recommend exploringheuristics based on the number of connections to select the predecessors.- The datasets and the performance estimation strategy . In all experiments, we observeda low accuracy of the networks in the datasets. Since we worked with constrainedresources, we applied relaxations to the datasets and the performance estimationstrategy to reduce the computational cost, which could have aﬀected the accuracyof the networks. Future work can focus on designing a diﬀerent set of environmentswith images with a smaller size, optimizing the performance estimation strategy perdataset, and investigating alternatives to reduce the cost of computing the rewardsassociated to the networks.-

Transforming other standard RL algorithms to a meta-RL version . The transforma-tion of the A2C algorithm to a meta-RL implementation required to change the inputpassed to the policy and to rely on a recurrent unit to learn the temporal dependenciesbetween actions. This transformation is possible on other standard RL algorithms,which would help study diﬀerent meta-RL approaches to NAS.-

Benchmarking of other RL on the same NAS environments . In Section 5, we intro-duced the system developed to conduct our experiments, which allows to easily playother RL algorithms from the OpenAI baselines on the same NAS environments. Webelieve that this system will help to encourage research in these directions so that thebeneﬁts of diﬀerent RL algorithms on NAS can be studied in detail.

Acknowledgments

To the

SURF cooperative for kindly providing the required computational resources. ToJane Wang for her valuable feedback to this work.

References

Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,Greg S. Corrado, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, IanGoodfellow, Andrew Harp, Geoﬀrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefow-icz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man´e, Rajat Monga, SherryMoore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, IlyaSutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, FernandaVi´egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, andXiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems,2015. URL http://tensorflow.org/ . Software available from tensorﬂow.org.I. Arel, C. Liu, T. Urbanik, and A. G. Kohls. Reinforcement learning-based multi-agentsystem for network traﬃc signal control.

IET Intelligent Transport Systems , 4(2):128–135,June 2010. ISSN 1751-956X. doi: 10.1049/iet-its.2009.0070. . Gomez Robles and J. Vanschoren Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural networkarchitectures using reinforcement learning.

CoRR , abs/1611.02167, 2016. URL http://arxiv.org/abs/1611.02167 .B. Bouzy and G. Chaslot. Monte-carlo go reinforcement learning experiments. In , pages 187–194, May 2006.doi: 10.1109/CIG.2006.311699.Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, JieTang, and Wojciech Zaremba. Openai gym, 2016.Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level networktransformation for eﬃcient architecture search.

CoRR , abs/1806.02639, 2018. URL http://arxiv.org/abs/1806.02639 .Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge J. Belongie. Large scale ﬁne-grained categorization and domain-speciﬁc transfer learning.

CoRR , abs/1806.06193,2018. URL http://arxiv.org/abs/1806.06193 .Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, AlecRadford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai base-lines. https://github.com/openai/baselines , 2017.Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel.Rl : Fast reinforcement learning via slow reinforcement learning. CoRR , abs/1611.02779,2016. URL http://arxiv.org/abs/1611.02779 .Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural Architecture Search: ASurvey. arXiv e-prints , art. arXiv:1808.05377, Aug 2018.Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations andgradient descent can approximate any learning algorithm.

CoRR , abs/1710.11622, 2017.URL http://arxiv.org/abs/1710.11622 .Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fastadaptation of deep networks.

CoRR , abs/1703.03400, 2017. URL http://arxiv.org/abs/1703.03400 .Pei Guo and Ryan Farrell. Fine-grained visual categorization using PAIRS: pose and ap-pearance integration for recognizing subcategories.

CoRR , abs/1801.09057, 2018. URL http://arxiv.org/abs/1801.09057 .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deepresidual networks.

CoRR , abs/1603.05027, 2016. URL http://arxiv.org/abs/1603.05027 .Tao Hu and Honggang Qi. See better before looking closer: Weakly supervised data aug-mentation network for ﬁne-grained visual classiﬁcation.

CoRR , abs/1901.09891, 2019.URL http://arxiv.org/abs/1901.09891 . earning to reinforcement learn for Neural Architecture Search Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic gradient descent.

ICLR: International Conference on Learning Representations , 2015.Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: Asurvey.

Int. J. Rob. Res. , 32(11):1238–1274, September 2013. ISSN 0278-3649. doi:10.1177/0278364913495721. URL http://dx.doi.org/10.1177/0278364913495721 .Vijaymohan Konda.

Actor-critic Algorithms . PhD thesis, Cambridge, MA, USA, 2002.AAI0804543.Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: diﬀerentiable architecturesearch.

CoRR , abs/1806.09055, 2018. URL http://arxiv.org/abs/1806.09055 .S. Liu and W. Deng. Very deep convolutional neural network based image classiﬁcation usingsmall training sample size. In , pages 730–734, Nov 2015. doi: 10.1109/ACPR.2015.7486599.Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. Resource man-agement with deep reinforcement learning. In

Proceedings of the 15th ACM Work-shop on Hot Topics in Networks , HotNets ’16, pages 50–56, New York, NY, USA,2016. ACM. ISBN 978-1-4503-4661-0. doi: 10.1145/3005745.3005750. URL http://doi.acm.org/10.1145/3005745.3005750 .Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning.

CoRR , abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602 .Volodymyr Mnih, Adri`a Puigdom`enech Badia, Mehdi Mirza, Alex Graves, Timothy P.Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods fordeep reinforcement learning.

CoRR , abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783 .Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeﬀ Dean. Eﬃcient neuralarchitecture search via parameter sharing.

CoRR , abs/1802.03268, 2018. URL http://arxiv.org/abs/1802.03268 .Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution forimage classiﬁer architecture search.

CoRR , abs/1802.01548, 2018. URL http://arxiv.org/abs/1802.01548 .John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trustregion policy optimization.

CoRR , abs/1502.05477, 2015. URL http://arxiv.org/abs/1502.05477 .John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms.

CoRR , abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347 . . Gomez Robles and J. Vanschoren Prabhant Singh, Tobias Jacobs, Sebastien Nicolas, and Mischa Schmidt. A Study ofthe Learning Progress in Neural Architecture Search Techniques. arXiv e-prints , art.arXiv:1906.07590, Jun 2019.Richard S. Sutton and Andrew G. Barto.

Reinforcement learning: an introduction . TheMIT Press, 2012.Eleni Triantaﬁllou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, RossGoroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle.Meta-dataset: A dataset of datasets for learning to learn from few examples.

CoRR ,abs/1903.03096, 2019. URL http://arxiv.org/abs/1903.03096 .Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, R´emiMunos, Charles Blundell, Dharshan Kumaran, and Matthew Botvinick. Learning toreinforcement learn.

CoRR , abs/1611.05763, 2016. URL http://arxiv.org/abs/1611.05763 .Christopher J. C. H. Watkins and Peter Dayan. Q-learning.

Machine Learning , 8(3):279–292, May 1992. ISSN 1573-0565. doi: 10.1007/BF00992698. URL https://doi.org/10.1007/BF00992698 .Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein-forcement learning. In

Machine Learning , pages 229–256, 1992.Zhao Zhong, Zichen Yang, Boyang Deng, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Blockqnn: Eﬃcient block-wise neural network architecture generation.

CoRR ,abs/1808.05584, 2018. URL http://arxiv.org/abs/1808.05584 .Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning.

CoRR , abs/1611.01578, 2016. URL http://arxiv.org/abs/1611.01578 .Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferablearchitectures for scalable image recognition.

CoRR , abs/1707.07012, 2017. URL http://arxiv.org/abs/1707.07012 . earning to reinforcement learn for Neural Architecture Search Appendix A. Selection of the datasets

The deep meta-reinforcement learning framework that we implement requires a set of en-vironments associated to image classiﬁcation tasks. In order to design these environments,we rely on the meta-dataset (Triantaﬁllou et al., 2019), a collection of 10 datasets with aconcrete sampling procedure designed for meta-learning in few-shot learning image classiﬁ-cation. In our setting, the datasets are intended for standard image classiﬁcation, thus weredeﬁne the sampling strategy. Our interest is in using small but yet challenging datasetsthat allow us to save computational resources without making the Neural ArchitectureSearch (NAS) trivial.In Table 8 the original datasets in the collection are listed. We select the ones that aresmaller than CIFAR-10 (60K observations), which is the reference for NAS. The datasetssatisfying the criterion are aircraft , cu birds , dtd , omniglot , traﬃc sign and vgg ﬂower . Wewant to evaluate the hardness of these six datasets to deﬁne a sampling procedure from thecollection, and thus we perform a short and individual deep meta-reinforcement learningtrial with t max = 200 for each dataset. Since at the beginning of the trial the agent doesnot develop any signiﬁcant knowledge, its sampling of architectures is random. In Figure 17the boxplot and barplot of the obtained accuracy values are presented, and in Table 9 therunning time per experiment is shown.A simple exploratory analysis suggests three types of datasets: a “trivial” dataset withhigh accuracy values with simple networks ( traﬃc sign ), two “hard” datasets with lowaccuracy values (all values below 30%: dtd and cu birds ), and three “medium” datasetswith more diversity of accuracy values (median around 30% and broader interquartile range: aircraft , omniglot , vgg ﬂower ). On the other hand, for the running times, we can observethat aircraft and cu birds result in the most expensive runs. Considering the computationtime, and the hardness of the classiﬁcation tasks, we deﬁned the sampling presented inTable 3. Our training datasets have diﬀerent levels of hardness and reported the leastcostly runs.Dataset ID Dataset name N classes N observationsaircraft FGVC-Aircraft 100 10000cu birds CUB-200-2011 200 11788dtd Describable Textures 47 5640fungi FGVCx Fungi 1394 89760ilsvrc 2012 ImageNet 1000 1280764mscoco Common Objects in Context 80 330000omniglot Omniglot 1623 32460quickdraw Quick, Draw! 345 50426266traﬃc sign German Traﬃc Sign Recognition Benchmark 43 39209vgg ﬂower VGG Flower 102 8189 Table 8: The original meta-dataset (Triantaﬁllou et al., 2019) with the number of classes and obser-vations after conversion with the oﬃcial source code.27 . Gomez Robles and J. Vanschoren

Dataset ID Timeaircraft 9h49mcu birds 16h20mdtd 5h38momniglot 3h38mtraﬃc sign 4h33mvgg ﬂower 4h56m

Table 9: Running times of a deep meta-RL trial with t max = 200, used to study the hardness andcost of each dataset.(a) (b)Figure 17: Diﬀerent visualizations of the early-stop accuracy values obtained to study the hardnessof the datasets. Appendix B. Networks designed by the deep meta-RL agent duringtraining and evaluation

Here we show the best architectures designed by the agent in the three experiments. Fig-ure 18 shows the best architecture per datasets during training ( omniglot , vgg ﬂower , and earning to reinforcement learn for Neural Architecture Search dtd ). Figure 19 and 20 show the best two architectures during evaluation for aircraft and cu birds respectively. Figure 21 shows the best architectures for the multi-branch experi-ment. For each architecture we report the early-stop accuracy obtained. Input (84x84)Convolutionk=5 (80x80)MaxPoolingp=3 (26x26)Convolutionk=1 (26x26)Convolutionk=5 (22x22)Convolutionk=1 (22x22)Convolutionk=5 (18x18)Convolutionk=1 (18x18)MaxPoolingp=2 (9x9)AvgPoolingp=3 (3x3) (a)

Input (84x84)Convolutionk=3 (82x82)Convolutionk=3 (80x80)MaxPoolingp=2 (40x40)MaxPoolingp=2 (20x20)MaxPoolingp=2 (10x10)MaxPoolingp=2 (5x5) (b)

Input (84x84)Convolutionk=3 (82x82)Convolutionk=1 (82x82)Convolutionk=3 (80x80)MaxPoolingp=2 (40x40)MaxPoolingp=2 (20x20)MaxPoolingp=2 (10x10)MaxPoolingp=2 (5x5) (c)Figure 18: Best architectures designed for the training datasets. (a) The best architecture for om-niglot , with early-stop accuracy of 67.11. (b) The best architecture for vgg ﬂower , withearly-stop accuracy of 57.15. (c) The best architecture for dtd , with early-stop accuracyof 29.43 29 . Gomez Robles and J. Vanschoren

Input (84x84)Convolutionk=3 (82x82)Convolutionk=5 (78x78)MaxPoolingp=2 (39x39)MaxPoolingp=2 (19x19) (a)

Input (84x84)Convolutionk=1 (84x84)Convolutionk=3 (82x82)Convolutionk=3 (80x80)MaxPoolingp=2 (40x40)MaxPoolingp=3 (13x13) (b)Figure 19: Best architectures designed for aircraft during evaluation of the policy. (a) The bestarchitecture with early-stop accuracy of 48.22. (b) The second-best architecture withearly-stop accuracy of 47.95 30 earning to reinforcement learn for Neural Architecture Search

Input (84x84)Convolutionk=1 (84x84)Convolutionk=3 (82x82)Convolutionk=3 (80x80)MaxPoolingp=2 (40x40)Convolutionk=3 38x38)MaxPoolingp=2 (19x19)MaxPoolingp=2 (9x9)MaxPoolingp=2 (4x4)MaxPoolingp=2 (2x2) (a)

Input (84x84)Convolutionk=3 (82x82)Convolutionk=1 (82x82)Convolutionk=3 (80x80)Convolutionk=5 (76x76)MaxPoolingp=2 (38x38)MaxPoolingp=2 (19x19)MaxPoolingp=2 (9x9)MaxPoolingp=2 (4x4)MaxPoolingp=2 (2x2) (b)Figure 20: Best architectures designed for cu birds during evaluation of the policy. (a) The bestarchitecture with early-stop accuracy of 19.22. (b) The second-best architecture withearly-stop accuracy of 19.06 31 . Gomez Robles and J. Vanschoren

Input (84x84)Convolutionk=5 (80x80)Convolutionk=5 (76x76)Avgoolingp=3 (25x25)Convolutionk=5 (21x21)MaxPoolingp=2 (10x10) MaxPoolingp=2 (10x10)MaxPoolingp=2 (5x5) Convolutionk=5 (6x6)MaxPoolingp=2 (2x2) AvgPoolingp=3 (3x3)Concat (3x3) (a)

Input (84x84)AvgPoolingp=2 (42x42)Convolutionk=5 (38x38)Convolutionk=3 (36x36)MaxPoolingp=2 (18x18)Convolutionk=3 (16x16)MaxPoolingp=2 (8x8)MaxPoolingp=2 (4x4)Convolutionk=1 (4x4) (b)Figure 21: Best architectures designed for during the experiment in a multi-branch search space.(a) The best architecture when σ = 0 .

0, with early-stop accuracy of 66.10. (b) The bestarchitecture when σ = 0 ..