Neural Ensemble Search for Uncertainty Estimation and Dataset Shift
Sheheryar Zaidi, Arber Zela, Thomas Elsken, Chris Holmes, Frank Hutter, Yee Whye Teh
NNeural Ensemble Searchfor Performant and Calibrated Predictions
Sheheryar Zaidi ∗ Arber Zela ∗ Thomas Elsken Chris Holmes Frank Hutter , Yee Whye Teh University of Oxford, University of Freiburg, Bosch Center for Artificial Intelligence {szaidi, cholmes, y.w.teh}@stats.ox.ac.uk{zelaa, elsken, fh}@cs.uni-freiburg.de
Abstract
Ensembles of neural networks achieve superior performance compared to stand-alone networks not only in terms of accuracy on in-distribution data but also ondata with distributional shift alongside improved uncertainty calibration. Diver-sity among networks in an ensemble is believed to be key for building strongensembles, but typical approaches only ensemble different weight vectors of afixed architecture. Instead, we investigate neural architecture search (NAS) forexplicitly constructing ensembles to exploit diversity among networks of varyingarchitectures and to achieve robustness against distributional shift. By directly opti-mizing ensemble performance, our methods implicitly encourage diversity amongnetworks, without the need to explicitly define diversity. We find that the resultingensembles are more diverse compared to ensembles composed of a fixed architec-ture and are therefore also more powerful. We show significant improvements inensemble performance on image classification tasks both for in-distribution dataand during distributional shift with better uncertainty calibration.
Automatically learning useful representations of data using deep neural networks has been successfulacross various tasks [30, 25, 40], leading to the ubiquitous deployment of neural networks. Whilesome applications rely only on the predictions made by a neural network, many critical applicationsalso require reliable predictive uncertainty estimates and robustness under the presence of distribu-tional shift in the data observed at test time relative to the training data. Examples include medicalimaging [16] and self-driving cars [6]. However, several studies have shown that neural networksare not always robust to dataset shift [43, 24], nor do they exhibit calibrated predictive uncertainty,resulting in overconfident and incorrect predictions [20].Using an ensemble of networks rather than a stand-alone network is a strong baseline both in termsof predictive uncertainty calibration and robustness to dataset shift. Ensembles also outperformapproximate Bayesian methods [33, 43, 21]. Their success is usually attributed to the diversity amongthe base learners, however there are various definitions of diversity [32, 63] without a consensus.In practice, ensembles are usually constructed by choosing a fixed state-of-the-art architecture, andcreating base learners either by independently training random initializations of it (called deepensembles [33]) or by picking various checkpoints of a single training trajectory [27, 37].However, as we show, base learners with varying architectures make more diverse predictions.Therefore, picking a fixed architecture for the ensemble’s base learners neglects diversity in favor ofbase learner strength. This has implications for the ensemble performance, since both diversity and ∗ Equal contribution.Preprint. Under review. a r X i v : . [ c s . L G ] J un ase learner strength are important. To overcome this, we propose Neural Ensemble Search (NES);a NES algorithm finds a set of diverse neural architectures that together form a strong ensemble.By directly optimizing ensemble loss while maintaining independent training of base learners, aNES algorithm implicitly encourages diversity, without the need for explicitly defining the notion ofdiversity. In detail, our contributions are as follows:1. We show that ensembles composed of varying architectures perform better than ensemblescomposed of a fixed architecture. We demonstrate that this is due to increased diversity among theensemble’s base learners (Sections 3 and 5).2. Based on these findings and the importance of diversity, we propose two algorithms for NeuralEnsemble Search: NES-RS and NES-RE. NES-RS is a simple random search based algorithm,and NES-RE is based on regularized evolution [44]. Both search algorithms seek performantensembles with varying base learner architectures (Section 4).3. Through experiments on image classification tasks, we evaluate the ensembles found by NES-RSand NES-RE from the point of view of both predictive performance and uncertainty calibration.We also compare to the baseline of deep ensembles with fixed, optimized architectures. We findour ensembles outperform deep ensembles not only on in-distribution data but also during datasetshift, with better predictive performance and uncertainty calibration (Section 5).The code for our experiments is available at: https://github.com/automl/nes . Ensemble Learning.
Ensembles of neural networks [22, 31, 11] are commonly used to boostperformance [50, 47, 23]. In practice, strategies for building ensembles include the popular approachof independently training multiple initializations of the same network, training base learners onbootstrap samples of the training data (i.e. bagging ) [64], joint training with diversity-encouraginglosses [34, 62, 52] and using checkpoints during the training trajectory of a network [27, 37]. Wefocus on ensembles of independently trained base learners, as this is a simple approach leadingto strong ensembles (see [34] for a comparison of ensembling approaches). Diversity is believedto be key for successful ensembles and various proposals have been made for its definition [32],yet there is no consensus [63]. Much recent interest in ensembles has been due to their strongpredictive uncertainty estimation [33], with extensive empirical studies observing that ensemblesoutperform other approaches for uncertainty estimation, notably including approximate Bayesianmethods [43, 21].
Neural Architecture Search.
Neural Architecture Search (NAS), the process of automaticallydesigning neural network architectures, is a natural next step for automating the learning of represen-tations with neural networks [15]. Existing strategies using reinforcement learning [2, 61, 65, 66],evolutionary algorithms [1, 14, 44, 45, 48] or Bayesian Optimization [29, 39, 42, 53] have demon-strated that NAS can find architectures that surpass hand-crafted ones on a variety of tasks. A recentfocus in NAS research has been on computational efficiency with algorithms utilizing gradient-basedoptimization [8, 12, 36, 55, 59, 56], network morphisms [7, 13, 14], or multi-fidelity optimiza-tion [3, 17, 60].A recent line of research which is closest to ours connects ensemble learning and NAS. Methodsproposed by Cortes et al. [10] and Macko et al. [38] iteratively add (sub-)networks to an ensembleto improve the ensemble’s performance. While our work focuses on generating a diverse and well-performing (in an ensemble) set of architectures while essentially fixing the way an ensemble is builtfrom its base learners, the focus of [10] and [38] is on how to build the ensemble. As a consequence,the space of neural networks they consider is limited in contrast to our work: [10] only considersfully-connected neural networks and [38] only uses NASNet-A [66] building blocks with varyingdepth and number of filters. Interestingly, [38] employs knowledge distillation [26], which is also usedin NAS research [14, 46]. However, this can actively discourage diversity in predictions; therefore,we do not consider it in our work. Rather than iteratively growing an ensemble, Bian et al. [5]propose to prune redundant (sub-) networks in an ensemble without significant loss in performanceby utilizing diversity. While all aforementioned work only focuses on performance for in-distributiondata, we additionally consider dataset shift as well as uncertainty calibration.2
Ensembles of Varying Architectures are More Diverse
After introducing our notation and problem set-up, we discuss diversity in ensembles of neuralnetworks and provide empirical evidence that networks with varying architectures make more diversepredictions than networks with a fixed architecture trained multiple times.
Let D train = { ( x i , y i ) : i = 1 , . . . , N } be the training dataset, where the input x i ∈ R D and,assuming a classification task, the output y i ∈ { , . . . , C } . We use D val and D test for the validationand test datasets, respectively. Denote by f θ a neural network with weights θ , so f θ ( x ) ∈ R C is thepredicted probability vector over the classes for input x . Let (cid:96) ( f θ ( x ) , y ) be the neural network’sloss for data point ( x , y ) . Given M networks f θ , . . . , f θ M , we construct the ensemble F of thesenetworks by averaging the outputs, yielding F ( x ) = M (cid:80) Mi =1 f θ i ( x ) .In addition to the ensemble’s loss (cid:96) ( F ( x ) , y ) , we will also consider the average base learner loss andthe oracle ensemble’s loss. The average base learner loss is simply defined as M (cid:80) Mi =1 (cid:96) ( f θ i ( x ) , y ) ;we use this to measure the average base learner strength . Similar to [34, 62], the oracle ensemble F OE composed of base learners f θ , . . . , f θ M is defined to be the function which, given an input x ,returns the prediction of the base learner with the smallest loss for ( x , y ) , that is, F OE ( x ) = f θ k ( x ) , where k ∈ argmin i (cid:96) ( f θ i ( x ) , y ) . Of course, the oracle ensemble can only be constructed if the true class y is known. We use the oracleensemble loss as a measure of the diversity in base learner predictions. Intuitively, if base learnersmake diverse predictions for x , the oracle ensemble is more likely to find some base learner witha small loss, whereas if all base learners make identical predictions, the oracle ensemble yields thesame output as any (and all) base learners. Therefore, as a rule of thumb, small oracle ensemble lossindicates more diverse base learner predictions . Proposition 3.1.
Suppose (cid:96) is negative log-likelihood (NLL). Then, the oracle ensemble loss,ensemble loss, and average base learner loss satisfy the following inequality: (cid:96) ( F OE ( x ) , y ) ≤ (cid:96) ( F ( x ) , y ) ≤ M M (cid:88) i =1 (cid:96) ( f θ i ( x ) , y ) . We refer to Appendix A for a proof. Proposition 3.1 suggests that strong ensembles require notonly strong average base learners (smaller upper bound), but also more diversity in their predictions(smaller lower bound). There is extensive theoretical work relating strong base learner performanceand diversity with the generalization properties of ensembles [22, 63, 28, 4, 19]. Notably, Breiman[49] showed that the generalization error of random forests depends on the strength of individualtrees and the correlation between their mistakes.
In practice, ensembles of neural networks are usually made by independently training M randominitializations of a network with a fixed architecture, on the same training dataset. This procedureis called deep ensembles , and has been empirically observed to yield performant and calibratedensembles [33, 43].The fixed architecture used to build deep ensembles is typically chosen to be a strong stand-alonearchitecture, either hand-crafted or found by NAS. However, since ensemble performance dependsnot only on strong base learners but also on their diversity, optimizing the base learner’s architectureand then constructing a deep ensemble neglects diversity in favor of strong base learner performance.Having base learner architectures vary allows more diversity in their predictions. In this section, weprovide empirical evidence for this by visualizing the base learners’ predictions. Fort et al. [18]found that base learners in a deep ensemble explore different parts of the function space by meansof applying dimensionality reduction to their predictions. Building on this, we uniformly samplefive architectures from the DARTS search space [36], train 20 initializations of each architectureon CIFAR-10 and visualize the similarity among the networks’ predictions on the test dataset usingt-SNE [51]. Experiment details are available in Section 5 and Appendix B.3 t - S N E d i m e n s i o n Arch_1Arch_2Arch_3Arch_4Arch_5 (a) Five different architectures, eachtrained with 20 different initializations. t - S N E d i m e n s i o n Fixed arch.Varying arch. (b) Predictions of base learners in twoensembles, one with fixed architectureand one with varying architectures.
Figure 1: t-SNE visualization ofbase learner predictions.As shown in Figure 1a, we observe clustering of predictionsmade by different initializations of a fixed architecture, sug-gesting that base learners with varying architectures exploredifferent parts of the function space. Moreover, we also visu-alize the predictions of base learners of two ensembles, eachof size M = 30 , where one is a deep ensemble (i.e. a fixedarchitecture trained multiple times) and the other has varyingarchitectures (found by NES-RS which will be introduced inSection 4). Figure 1b shows more diversity in the ensemble withvarying architectures than in the one with a fixed architecture. In this section, we propose the concept of neural ensemblesearch (NES). In summary, a NES algorithm optimizes thearchitectures of base learners in an ensemble to minimize en-semble loss.Given a network f : R D → R C , let L ( f, D ) = (cid:80) ( x ,y ) ∈D (cid:96) ( f ( x ) , y ) be the loss of f over dataset D . Given aset of base learners { f , . . . , f M } , let Ensemble be the functionwhich maps { f , . . . , f M } to the ensemble F = M (cid:80) Mi =1 f i asdefined in Section 3 . To emphasize the architecture, we usethe notation f θ,α to denote a network with architecture α ∈ A and weights θ , where A is a space of architectures. A NESalgorithm aims to solve the following optimization problem: min α ,...,α M ∈A L ( Ensemble ( f θ ,α , . . . , f θ M ,α M ) , D val ) (1)s.t. θ i ∈ argmin θ L ( f θ,α i , D train ) for i = 1 , . . . , M Eq. 1 is difficult to solve for at least two reasons. First, weare optimizing over M architectures, so the search space iseffectively A M , compared to it being A in typical NAS, makingit more difficult to explore fully. Second, a larger search spacealso increases the risk of overfitting the ensemble loss to D val . A possible approach here is toconsider the ensemble as a single large network to which we apply NAS, but joint training of anensemble through a single loss has been empirically observed to underperform training base learnersindependently, specially for large neural networks [52]. Instead, our general approach to solve Eq. 1consists of two steps:1. Pool building : build a pool P = { f θ ,α , . . . , f θ K ,α K } of size K consisting of potential baselearners, where each f θ i ,α i is a network trained independently on D train .2. Ensemble selection : select M base learners f θ ∗ ,α ∗ , . . . , f θ ∗ M ,α ∗ M from P to form an ensemblewhich minimizes loss on D val . (We assume K ≥ M .)Step 1 reduces the options for the base learner architectures, with the intention to make the searchmore feasible and focus on strong architectures. Step 2 then selects a performant ensemble whichimplicitly encourages base learner strength and diversity. This procedure also ensures that theensemble’s base learners are trained independently. We propose using forward step-wise selectionfor step 2; that is, given the set of networks P , we start with an empty ensemble and add to it thenetwork from P which minimizes ensemble loss on D val . We repeat this without replacement until theensemble is of size M . Let ForwardSelect ( P , D val , M ) denote the set of M base learners selectedfrom P by this procedure.Note that selecting the ensemble from P is a combinatorial optimization problem; a greedy approachsuch as ForwardSelect is nevertheless effective [9], while keeping computational overhead low,given the predictions of the networks on D val . We also experimented with three other ensembleselection algorithms: (1) Starting with the best network by validation performance, add the next bestnetwork to the ensemble only if it improves validation performance, iterating until the ensemble size4 electensemble Sampleparent Mutate Updatepopulation Figure 2: Illustration of one iteration of NES-RE. Network architectures are represented as coloredbars of different lengths illustrating different layers and widths. Starting with the current population,ensemble selection is applied to select parent candidates, among which one is sampled as the parent.A mutated copy of the parent is added to the population, and the oldest member is removed.is M or all models have been considered. (2) Select the top M networks by validation performance.(3) Forward step-wise selection with replacement. We typically found that these three performedcomparatively or worse than our choice ForwardSelect .We have not yet discussed the algorithm for building the pool in step 1; we propose two approaches,NES-RS (Section 4.1) and NES-RE (Section 4.2). NES-RS is a simple random search based algorithm,while NES-RE is based on regularized evolution [44], a state-of-the-art NAS algorithm. Note thatwhile gradient-based NAS methods have recently become popular, they are not naively applicable inour setting as the base learner selection component
ForwardSelect is typically non-differentiable.
In NAS, random search (RS) is a competitive baseline on carefully designed architecture searchspaces [35, 57, 58]. Motivated by its success and simplicity, we first introduce NES with randomsearch (NES-RS). NES-RS builds the pool P by independently sampling architectures uniformlyfrom the search space A (and training them). Since the architectures of networks in P vary, applyingensemble selection is a simple way to exploit diversity, yielding a performant ensemble. Algorithm 1describes NES-RS in pseudocode. Algorithm 1:
NES with Random Search
Data:
Search space A ; ensemble size M ; comp. budget K ; D train , D val . Sample K architectures α , . . . , α K independently and uniformly from A . Train each architecture α i using D train , yielding a pool of networks P = { f θ ,α , . . . , f θ K ,α K } . Select base learners { f θ ∗ ,α ∗ , . . . , f θ ∗ M ,α ∗ M } = ForwardSelect ( P , D val , M ) by forwardstep-wise selection without replacement. return ensemble Ensemble ( f θ ∗ ,α ∗ , . . . , f θ ∗ M ,α ∗ M ) A more guided approach for building the pool P is using regularized evolution (RE) [44]. While RShas the benefit of simplicity, by sampling architectures uniformly, the resulting pool might containmany weak architectures, leaving few strong architectures for ForwardSelect to choose between.Therefore, NES-RS might require a large pool in order to explore interesting parts of the searchspace. RE is an evolutionary algorithm used for NAS which explores the search space by evolving a population of architectures. In summary, RE starts with a randomly initialized fixed-size populationof architectures. At each iteration, a subset of size m of the population is sampled, from which thebest network by validation loss is selected as the parent. A mutated copy of the parent architecture,called the child, is trained and added to the population, and the oldest member of the population isremoved, preserving the population size. This is iterated until the computational budget is reached,returning the history , i.e. all the networks evaluated during the search.Based on RE for NAS, we propose NES-RE to build the pool of potential base learners. NES-REstarts by randomly initializing a population p of size P . At each iteration, we apply ForwardSelect to the population to select an ensemble of size m , and we uniformly sample one base learner fromthe ensemble to be the parent. A mutated copy of the parent is added to p and the oldest network This approach returns an ensemble of size at most M . lgorithm 2: NES with Regularized Evolution
Data:
Search space A ; ensemble size M ; comp. budget K ; D train , D val ; population size P ;number of parent candidates m . Sample P architectures α , . . . , α P independently and uniformly from A . Train each architecture α i using D train , and initialize p = P = { f θ ,α , . . . , f θ P ,α P } . while |P| < K do Select m parent candidates { f (cid:101) θ , (cid:101) α , . . . , f (cid:101) θ m , (cid:101) α m } = ForwardSelect ( p , D val , m ) . Sample uniformly a parent architecture α from { (cid:101) α , . . . , (cid:101) α m } . // parent stays in p . Apply mutation to α , yielding child architecture β . Train β using D train and add the trained network f θ,β to p and P . Remove the oldest member in p . // as done in RE [44]. Select base learners { f θ ∗ ,α ∗ , . . . , f θ ∗ M ,α ∗ M } = ForwardSelect ( P , D val , M ) by forwardstep-wise selection without replacement. return ensemble Ensemble ( f θ ∗ ,α ∗ , . . . , f θ ∗ M ,α ∗ M ) is removed, as in regularized evolution. This process is repeated until the computational budget isreached, and the history is returned as the pool P . See Algorithm 2 for pseudocode and Figure 2 foran illustration.Also, note the distinction between the population and the pool in NES-RE: the population is evolved,whereas the pool is the set of all networks evaluated during evolution (i.e., the history) and is usedpost-hoc for selecting the ensemble. Moreover, ForwardSelect is used both for selecting m parentcandidates (line 4 in NES-RE) and choosing the final ensemble of size M (line 9 in NES-RE). Ingeneral, m (cid:54) = M . Using deep ensembles is a common way of building a model robust to distributional shift relative totraining data. In general, one may not know the type of distributional shift that occurs at test time.However, by using an ensemble, diversity in base learner predictions prevents the model from relyingon one base learner’s predictions which may not only be incorrect but also overconfident.We assume that one does not have access to data points with test-time shift at training time, but onedoes have access to some validation data D shiftval with a validation shift, which encapsulates one’s beliefabout test-time shift. A simple way to adapt NES-RS and NES-RE to return ensembles robust to shiftis by using D shiftval instead of D val whenever applying ForwardSelect to select the final ensemble. Inalgorithms 1 and 2, this is in lines 3 and 9, respectively. Note that in line 4 of Algorithm 2, we canalso replace D val with D shiftval when expecting test-time shift, however to avoid running NES-RE oncefor each of D val , D shiftval , we simply sample one of D val , D shiftval uniformly at each iteration, in order toexplore architectures that work well both in-distribution and during shift. See Appendices C.2 andB.3 for further discussion. We apply NES using the cell-based search space for DARTS [36] and evaluate the ensembles foundon two image classification datasets: Fashion-MNIST [54] and CIFAR-10-C [24]. CIFAR-10-Cis a dataset based on CIFAR-10 but also includes validation and test dataset shifts, each with fiveseverity levels. We use three metrics: NLL, classification error and expected calibration error (ECE)[20, 41]. Hyperparameter choices, experimental and implementation details are available in AppendixB. Note that we do not aim for state-of-the-art performance but rather focus on understanding theimprovement over baselines based on deep ensembles. Unless stated otherwise, all evaluations are onthe test dataset.
Baselines.
We compare the ensembles found by NES to the baseline of deep ensembles built usinga fixed, optimized architecture. The fixed architecture is either: (1) optimized by random search,called DeepEns (RS), (2) the architecture found using DARTS, called DeepEns (DARTS) or (3) thearchitecture found using RE, called DeepEns (AmoebaNet). All architectures used have five nodes6
00 200 300 400Number of networks evaluated0.2000.2050.2100.215 N LL ( n o s h i f t ) M = 3
100 200 300 400Number of networks evaluated0.1950.2000.2050.2100.215
M = 5
100 200 300 400Number of networks evaluated0.1950.2000.2050.210
M = 10
100 200 300 400Number of networks evaluated0.200.210.22
M = 30
NES-RSNES-REDeepEns (RS)DeepEns (DARTS)DeepEns (AmoebaNet)
Figure 3: Results on Fashion-MNIST with varying ensembles sizes M . Lines show the mean NLLachieved by the ensembles with 95% confidence intervals.in their cells, except AmoebaNet, which has six nodes. All base learners also use the same trainingroutine. See Appendix B for details. Results on in-distribution data.
Figure 3 and the top row of Figure 4 show the NLL achieved byNES-RS, NES-RE and the baselines as a function of the computational budget K and ensemble size M . We see that NES algorithms consistently outperform the baselines, usually for most values of K . NES-RS and NES-RE perform comparably for CIFAR-10, but NES-RE outperforms NES-RSfor Fashion-MNIST due to more efficient exploration of the search space. Interestingly, despiteAmoebaNet being deeper, both NES algorithms outperform deep ensembles of AmoebaNet, exceptfor NES-RS on Fashion-MNIST when M = 30 , in which case the NLL is comparable. N LL ( n o s h i f t ) M = 5
M = 10
M = 30
DeepEns (AmoebaNet)DeepEns (DARTS)DeepEns (RS)NES-RENES-RS N LL ( s e v e r i t y = ) N LL ( s e v e r i t y = )
100 200 300 400Number of networks evaluated1.31.41.51.6 100 200 300 400Number of networks evaluated1.31.41.5
Figure 4: Results on CIFAR-10-C [24] with varying ensembles sizes M and shift severity. Linesshow the mean NLL achieved by the ensembles with 95% confidence intervals. Results during dataset shift.
Next, we evaluate the robustness of the ensembles during datasetshift for CIFAR-10-C. All base learners are trained on D train without any form of data augmentation.However, we use a shifted validation dataset, D shiftval , and a shifted test dataset, D shifttest . D shiftval is builtby applying a random validation shift to each datapoint in D val . D shifttest is built similarly but usinginstead test shifts applied to D test (see Appendix B and [24] for details). The severity of the shiftvaries between 1-5. The fixed architecture used in the baseline DeepEns (RS) is selected based on itsloss over D shiftval , but the DARTS and AmoebaNet architectures remain unchanged.As shown in the bottom two rows of Figure 4, ensembles picked by NES-RS and NES-RE are morerobust to dataset shift than all three baselines. Unsurprisingly, DeepEns (DARTS) and DeepEns(AmoebaNet) perform poorly in comparison to the other methods, as they are not optimized to dealwith dataset shift; the gap in performance naturally increases with shift severity, highlighting thathighly optimized architectures can fail heavily under dataset shift. We also see that NES-RE improvesover the performance of NES-RS during dataset shift.7 N LL ( n o s h i f t ) Average baselearner
10 20 30M (Ensemble size)0.050.100.150.200.250.30
Oracle ensemble (a) No data shift
10 20 30M (Ensemble size)1.61.82.02.22.42.6 N LL ( s e v e r i t y = ) Average baselearner
10 20 30M (Ensemble size)0.500.751.001.251.501.75
Oracle ensemble
DeepEns (AmoebaNet)DeepEns (DARTS)DeepEns (RS)NES-RENES-RS (b) Data shift (severity 5)
Figure 5: Results on CIFAR-10-C for the average base learner loss and the oracle ensemble loss (seeSection 3 for details), with K = 400 . Recall that small oracle ensemble loss generally corresponds tohigher diversity. Classification error and uncertainty calibration.
We also assess the ensembles using classifica-tion error and expected calibration error (ECE). ECE measures whether the predicted probabilitiesare calibrated. Intuitively, whenever the ensemble makes a particular prediction with probability, e.g., , one should expect the model to be correct around of the times; ECE measures the extentof mismatch between the model’s confidence and accuracy. The results comparing NES-RE andNES-RS with baselines are shown in Table 1. In terms of classification error, we find that ensemblesbuilt by NES consistently outperform the baseline across ensemble sizes and shift severities, withreductions of up to percentage points in error. As with loss, NES-RE outperforms NES-RS duringdataset shift. We also see that ensembles found by NES exhibit superior uncertainty calibration,reducing ECE by up to against baselines. Note that good uncertainty calibration is especiallyimportant when models are used during dataset shift.Table 1: Error and ECE of ensembles on CIFAR-10-C for different shift severities and ensemble sizes M with K = 400 . Best values and all values within confidence interval are bold faced. SeeTable 3 for an extended version. Classification Error (out of 1)
Expected Calibration Error (ECE)ShiftSeverity M NES-RS NES-RE DeepEns(RS) DeepEns(DARTS) DeepEns(AmoebaNet) NES-RS NES-RE DeepEns(RS) DeepEns(DARTS) DeepEns(AmoebaNet)5 . ± . . ± . . ± . . .
098 0 . ± . . ± . . ± . .
015 0 . . ± . . ± . . ± . .
100 0 . . ± . . ± . . ± . . . . ± . . ± . . ± . .
095 0 .
094 0 . ± . . ± . . ± . . . . ± . . ± . . ± . .
266 0 . . ± . . ± . . ± . .
075 0 . . ± . . ± . . ± . .
263 0 . . ± . . ± . . ± . .
067 0 . . ± . . ± . . ± . .
258 0 . . ± . . ± . . ± . .
057 0 . . ± . . ± . . ± . .
429 0 . . ± . . ± . . ± . .
177 0 . . ± . . ± . . ± . .
429 0 . . ± . . ± . . ± . .
168 0 . . ± . . ± . . ± . .
427 0 . . ± . . ± . . ± . .
160 0 . Diversity and average base learner strength.
To understand why ensembles found by NESalgorithms outperform deep ensembles with fixed, optimized architectures, we view the ensemblesthrough the lens of the average base learner loss and oracle ensemble loss as defined in Section 3, asshown in Figure 5. Recall that small oracle ensemble loss indicates higher diversity . We see thatNES finds ensembles with smaller oracle ensemble losses indicating greater diversity among baselearners. Unsurprisingly, the average base learner is weaker for NES as compared to DeepEns (RS).Despite this, the ensemble performs better, highlighting once again the importance of diversity.
We showed that ensembles with varying architectures are more diverse than ensembles with fixedarchitectures and argued that deep ensembles with fixed, optimized architectures neglect diversity.To this end, we proposed
Neural Ensemble Search , which exploits diversity between base learnersof varying architectures to find strong ensembles. We demonstrated empirically that NES-REand NES-RS outperform deep ensembles in terms of both predictive performance and uncertaintycalibration, for in-distribution data and also during dataset shift. We found that even NES-RS, asimple random search based algorithm, found ensembles capable of outperforming deep ensemblesbuilt with state-of-the-art architectures. 8 cknowledgments
AZ, TE and FH acknowledge support by the European Research Council (ERC) under the EuropeanUnion Horizon 2020 research and innovation programme through grant no. 716721, and by BMBFgrant DeToL. SZ acknowledges support from Aker Scholarship. CH wishes to acknowledge supportfrom The Alan Turing Institute,The Medical Research Council UK, and the EPSRC Bayes4Healthgrant. We also thank Julien Siems for providing a parallel implementation of regularized evolution.
References [1] Noor Awad, Neeratyoy Mallik, and Frank Hutter. Differential Evolution for Neural Architecture Search.
ICLR Neural Architecture Search Workshop , 2020.[2] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing Neural Network Architecturesusing Reinforcement Learning.
ICLR , 2017.[3] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating Neural Architecture Searchusing Performance Prediction. In
NIPS Workshop on Meta-Learning , 2017.[4] Yijun Bian and Huanhuan Chen. When does Diversity Help Generalization in Classification Ensembles?
ArXiv , abs/1903.06236, 2019.[5] Yijun Bian, Qingquan Song, Mengnan Du, Jun Yao, Huanhuan Chen, and Xia Hu. Sub-ArchitectureEnsemble Pruning in Neural Architecture Search.
ArXiv , abs/1910.00370, 2019.[6] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal,Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba.End to End Learning for Self-Driving Cars.
CoRR , abs/1604.07316, 2016.[7] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient Architecture Search by NetworkTransformation. In
Association for the Advancement of Artificial Intelligence , 2018.[8] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct Neural Architecture Search on Target Taskand Hardware. In
International Conference on Learning Representations , 2019.[9] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble Selection fromLibraries of Models. In
Proceedings of the Twenty-First International Conference on Machine Learning ,ICML ’04, page 18, New York, NY, USA, 2004. Association for Computing Machinery.[10] Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. AdaNet: AdaptiveStructural Learning of Artificial Neural Networks. In
Proceedings of the 34th International Conference onMachine Learning , volume 70 of
Proceedings of Machine Learning Research , pages 874–883, InternationalConvention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.[11] Thomas G. Dietterich. Ensemble Methods in Machine Learning. In
Multiple Classifier Systems , pages1–15, Berlin, Heidelberg, 2000. Springer Berlin Heidelberg.[12] Xuanyi Dong and Yi Yang. Searching for a Robust Neural Architecture in Four GPU Hours. In
ComputerVision and Pattern Recognition (CVPR) , pages 1761–1770, 2019.[13] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Simple and Efficient Architecture Search forConvolutional Neural Networks. In
NeurIPS Workshop on Meta-Learning , 2017.[14] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efficient Multi-Objective Neural ArchitectureSearch via Lamarckian Evolution. In
International Conference on Learning Representations , 2019.[15] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural Architecture Search: A Survey.
Journal ofMachine Learning Research , 20(55):1–21, 2019.[16] Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, and SebastianThrun. Dermatologist-level classification of skin cancer with deep neural networks.
Nature , 542:115–,2017.[17] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and Efficient Hyperparameter Optimizationat Scale. In
Proceedings of the 35th International Conference on Machine Learning , volume 80 of
Proceedings of Machine Learning Research , pages 1437–1446, Stockholmsmässan, Stockholm Sweden,10–15 Jul 2018. PMLR.
18] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep Ensembles: A Loss Landscape Perspective.
ArXiv , abs/1912.02757, 2019.[19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press, 2016.[20] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks.In
Proceedings of the 34th International Conference on Machine Learning , volume 70 of
Proceedings ofMachine Learning Research , pages 1321–1330, International Convention Centre, Sydney, Australia, 06–11Aug 2017. PMLR.[21] Fredrik K Gustafsson, Martin Danelljan, and Thomas B Schön. Evaluating Scalable Bayesian DeepLearning Methods for Robust Computer Vision. In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) Workshops , June 2020.[22] Lars K. Hansen and Peter Salamon. Neural network ensembles.
IEEE Transactions on Pattern Analysisand Machine Intelligence , 12(10):993–1001, 1990.[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 770–778,2016.[24] Dan Hendrycks and Thomas Dietterich. Benchmarking Neural Network Robustness to Common Corrup-tions and Perturbations. In
International Conference on Learning Representations , 2019.[25] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N.Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The sharedviews of four research groups.
IEEE Signal Processing Magazine , 29(6):82–97, 2012.[26] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network.
ArXivpreprint , abs/1503.02531, 2015.[27] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John Hopcroft, and Kilian Weinberger. SnapshotEnsembles: Train 1, get m for free. In
International Conference on Learning Representations , 2017.[28] Zhengshen Jiang, Hongzhi Liu, Bin Fu, and Zhonghai Wu. Generalized Ambiguity Decompositions forClassification with Applications in Active Learning and Unsupervised Ensemble Pruning. In
AAAI , pages2073–2079. AAAI Press, 2017.[29] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric P Xing. NeuralArchitecture Search with Bayesian Optimisation and Optimal Transport. In
Advances in Neural InformationProcessing Systems , pages 2016–2025, 2018.[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet Classification with Deep ConvolutionalNeural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances inNeural Information Processing Systems 25 , pages 1097–1105. Curran Associates, Inc., 2012.[31] Anders Krogh and Jesper Vedelsby. Neural Network Ensembles, Cross Validation, and Active Learning. InG. Tesauro, D. S. Touretzky, and T. K. Leen, editors,
Advances in Neural Information Processing Systems7 , pages 231–238. MIT Press, 1995.[32] Ludmila Kuncheva and Chris Whitaker. Measures of Diversity in Classifier Ensembles and Their Relation-ship with the Ensemble Accuracy.
Machine Learning , 51:181–207, 05 2003.[33] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable PredictiveUncertainty Estimation using Deep Ensembles. In
Advances in Neural Information Processing Systems 30 ,pages 6402–6413. Curran Associates, Inc., 2017.[34] Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall, and Dhruv Batra. Why M Heads areBetter than One: Training a Diverse Ensemble of Deep Networks. arXiv e-prints , page arXiv:1511.06314,2015.[35] Liam Li and Ameet Talwalkar. Random Search and Reproducibility for Neural Architecture Search. In
UAI , page 129. AUAI Press, 2019.[36] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable Architecture Search. In
International Conference on Learning Representations , 2019.[37] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. In
InternationalConference on Learning Representations , 2017.
38] Vladimir Macko, Charles Weill, Hanna Mazzawi, and Javier Gonzalvo. Improving Neural ArchitectureSearch Image Classifiers via Ensemble Learning.
ArXiv , abs/1903.06236, 2019.[39] Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. TowardsAutomatically-Tuned Neural Networks. In
ICML 2016 AutoML Workshop , 2016.[40] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Representations ofWords and Phrases and their Compositionality. In
Advances in Neural Information Processing Systems 26 ,pages 3111–3119. Curran Associates, Inc., 2013.[41] Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining Well Calibrated Prob-abilities using Bayesian Binning. In
Proceedings of the Twenty-Ninth AAAI Conference on ArtificialIntelligence , AAAI’15, page 2901–2907. AAAI Press, 2015.[42] Changyong Oh, Jakub Tomczak, Efstratios Gavves, and Max Welling. Combinatorial Bayesian Optimiza-tion using the Graph Cartesian Product. In
Advances in Neural Information Processing Systems 32 , pages2914–2924. Curran Associates, Inc., 2019.[43] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, BalajiLakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? Evaluating predictiveuncertainty under dataset shift. In
Advances in Neural Information Processing Systems 32 , pages 13991–14002. Curran Associates, Inc., 2019.[44] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized Evolution for Image ClassifierArchitecture Search. In
AAAI , pages 4780–4789. AAAI Press, 2019.[45] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V. Le,and Alexey Kurakin. Large-Scale Evolution of Image Classifiers. In
Proceedings of the 34th InternationalConference on Machine Learning , volume 70 of
Proceedings of Machine Learning Research , pages2902–2911, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.[46] Christoph Schorn, Thomas Elsken, Sebastian Vogel, Armin Runge, Andre Guntoro, and Gerd As-cheid. Automated design of error-resilient and hardware-efficient deep neural networks.
ArXiv preprintarXiv:1909.13844 , 2019.[47] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale ImageRecognition. In
International Conference on Learning Representations , 2015.[48] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.
Evolutionary Computation , 10:99–127, 2002.[49] Leo Breiman Statistics and Leo Breiman. Random forests. In
Machine Learning , pages 5–32, 2001.[50] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, DumitruErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going Deeper with Convolutions. In
ComputerVision and Pattern Recognition (CVPR) , 2015.[51] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.
Journal of Machine LearningResearch , 9(2579-2605):85, 2008.[52] Andrew M. Webb, Charles Reynolds, Dan-Andrei Iliescu, Henry W. J. Reeve, Mikel Luján, and GavinBrown. Joint Training of Neural Network Ensembles.
CoRR , abs/1902.04422, 2019.[53] Colin White, Willie Neiswanger, and Yash Savani. BANANAS: Bayesian Optimization with NeuralNetworks for Neural Architecture Search, 2020.[54] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A Novel Image Dataset for BenchmarkingMachine Learning Algorithms. arXiv e-prints , 2017.[55] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: Stochastic Neural Architecture Search. In
International Conference on Learning Representations , 2019.[56] Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong. PC-DARTS:Partial Channel Connections for Memory-Efficient Architecture Search. In
International Conference onLearning Representations , 2020.[57] Antoine Yang, Pedro M. Esperança, and Fabio M. Carlucci. NAS evaluation is frustratingly hard. In
International Conference on Learning Representations , 2020.
58] Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating theSearch Phase of Neural Architecture Search. In
International Conference on Learning Representations ,2020.[59] Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hutter. Under-standing and Robustifying Differentiable Architecture Search. In
International Conference on LearningRepresentations , 2020.[60] Arber Zela, Aaron Klein, Stefan Falkner, and Frank Hutter. Towards Automated Deep Learning: EfficientJoint Neural Architecture and Hyperparameter Search. In
ICML Workshop on AutoML , 2018.[61] Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical Block-wise Neural NetworkArchitecture Generation. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2423–2432, 2018.[62] Tianyi Zhou, Shengjie Wang, and Jeff A Bilmes. Diverse Ensemble Evolution: Curriculum Data-ModelMarriage. In
Advances in Neural Information Processing Systems 31 , pages 5905–5916. Curran Associates,Inc., 2018.[63] Zhi-Hua Zhou.
Ensemble Methods: Foundations and Algorithms . Chapman & Hall, 1st edition, 2012.[64] Zhi-Hua Zhou, Jianxin Wu, and Wei Tang. Ensembling neural networks: Many could be better than all.
Artificial Intelligence , 137(1):239 – 263, 2002.[65] Barret Zoph and Quoc V Le. Neural Architecture Search with Reinforcement Learning. In
InternationalConference on Learning Representations , 2017.[66] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning Transferable Architectures forScalable Image Recognition. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 8697–8710, 2018. Proof of Proposition 3.1
Taking the loss function to be NLL, we have (cid:96) ( f ( x ) , y )) = − log [ f ( x )] y , where [ f ( x )] y is theprobability assigned by the network f of x belonging to the true class y , i.e. indexing the predictedprobabilities f ( x ) with the true target y . Note that t (cid:55)→ − log t is a convex and decreasing function.We first prove (cid:96) ( F OE ( x ) , y ) ≤ (cid:96) ( F ( x ) , y ) . Recall, by definition of F OE , we have F OE ( x ) = f θ k ( x ) where k ∈ argmin i (cid:96) ( f θ i ( x ) , y ) , therefore [ F OE ( x )] y = [ f θ k ( x )] y ≥ [ f θ i ( x )] y for all i = 1 , . . . , C .That is, f θ k assigns the highest probability to the correct class y for input x . Since − log is adecreasing function, we have (cid:96) ( F ( x ) , y ) = − log (cid:32) M M (cid:88) i =1 [ f θ i ( x )] y (cid:33) ≥ − log ([ f θ k ( x )] y ) = (cid:96) ( F OE ( x ) , y ) . We apply Jensen’s inequality in its finite form for the second inequality. Jensen’s inequality statesthat for a real-valued, convex function ϕ with its domain being a subset of R and numbers t , . . . , t n in its domain, ϕ ( n (cid:80) ni =1 t i ) ≤ n (cid:80) ni =1 ϕ ( t i ) . Noting that − log is a convex function, (cid:96) ( F ( x ) , y ) ≤ M (cid:80) Mi =1 (cid:96) ( f θ i ( x ) , y ) follows directly. B Experimental and Implementation Details
We describe details of the experiments shown in Section 5 and Appendix C. Note that unless statedotherwise, all sampling over a discrete set is done uniformly in the discussion below.
B.1 Architecture Search Space
We use the same architecture search space as in DARTS [36]; denote this by A . In A , we search fortwo types of cells : normal cells, which preserve the spatial dimensions, and reduction cells, whichreduce the spatial dimensions. These cells are stacked using a pre-determined macro-architecturewhere they are usually repeated and connected using additional skip connections. Each cell is adirected acyclic graph, where nodes represent feature maps in the computational graph and edgesbetween them correspond to operation choices (e.g. a convolution operation). The cell parses inputsfrom the previous and previous-previous cells in its 2 input nodes. Afterwards it contains 5 nodes:4 intermediate nodes that aggregate the information coming from 2 previous nodes in the cell andfinally an output node that concatenates the output of all intermediate nodes across the channeldimension. AmoebaNet contains one more intermediate node, making that a deeper architecture. Theset of possible operations (eight in total in DARTS) that we use for each edge in the cells is the sameas DARTS, but we leave out the “zero” operation since that is not necessary for non-differentiableapproaches such as random search and evolution. We refer the reader to [36] for more details. B.2 Training Routine
The macro-architecture we use has 16 initial channels and 8 cells (6 normal and 2 reduction), and wastrained using a batch size of 100 for 100 epochs for CIFAR-10-C and 15 epochs for Fashion-MNIST.Unlike DARTS, we do not use any data augmentation procedure during training, nor any additionalregularization such as ScheduledDropPath [66] or auxiliary heads. All other hyperparameter settingsare exactly as in DARTS [36].We split the training data of Fashion-MNIST with 50k samples being used for training base learnersand 10k samples reserved for validation, which is used by the NES algorithms during ensembleselection and by DeepEns (RS) for picking the best architecture to use in the deep ensemble. Wealso set aside a test set with 10k samples; note that this test set is of course never used apart fromevaluating the final ensembles. Similarly, we split CIFAR-10-C into 40k, 10k and 10k samplesused for training, validation and testing, respectively. Note that when considering dataset shift forCIFAR-10-C, we also apply two disjoint sets of “corruptions” (following the terminology used by[24]) to the validation and test sets. We never apply any corruption to the training data. Morespecifically, out of the 19 different corruptions provided by [24], we randomly apply one from { Speckle Noise , Gaussian Blur , Spatter , Saturate } to each data point in the validation setand one from { Gaussian Noise , Shot Noise , Impulse Noise , Defocus Blur , Glass Blur , otion Blur , Zoom Blur , Snow , Frost , Fog , Brightness , Contrast , Elastic Transform , Pixelate , JPEG compression } to each data point in the test set. This choice of validation andtest corruptions follows the recommendation of [24]. Also, as mentioned in Section 5, each of thesecorruptions have 5 severity levels, which yields 5 corresponding severity levels for D shiftval and D shifttest .For NES-RE, NES-RS and DeepEns (RS), the results shown are averaged over multiple runs ofeach algorithm, with error bars showing the 95% confidence interval. For NES-RS and DeepEns(RS) on CIFAR-10-C and Fashion-MNIST, we sampled 1,200 random architectures first, then wesampled 400 architectures without replacement from these 1,200 for each run (because we used amaximum budget of 400 architecture evaluations) with a total of 10 runs. This was done in order toavoid training a very large number of networks. For NES-RE on CIFAR-10-C, we used a total of 6independent runs, and for NES-RE on Fashion-MNIST, we used a total of 5 independent runs. Also,whenever the results do not have the computational budget on one of the axes (e.g. Figure 5), thebudget for NES-RS, NES-RE and DeepEns (RS) is 400 architecture evaluations. B.3 Implementation Details of NES-REParallization.
Running NES-RE on a single GPU requires evaluating hundreds of networks se-quentially, which is tedious. To circumvent this, we distribute the “while |P| < K ” loop inAlgorithm 2 over multiple GPUs, called worker nodes. We use the parallelism scheme provided bythe hpbandster [17] codebase. In brief, the master node keeps track of the population and history(lines 1, 4-6, 8 in Algorithm 2), and it distributes the training of the networks to the individual workernodes (lines 2, 7 in Algorithm 2). In our experiments, we always use 20 worker nodes and evolve apopulation p of size P = 50 . During iterations of evolution, we use an ensemble size of m = 10 toselect parent candidates. Mutations.
We adapt the mutations used in RE to the DARTS search space. As in RE, we first picka normal or reduction cell at random to mutate and then sample one of the following mutations:• identity : no mutation is applied to the cell.• op mutation : sample one edge in the cell and replace its operation with another operationsampled from the list of operations.• hidden state mutation : sample one intermediate node in the cell, then sample one ofits two incoming edges. Replace the input node of that edge with another sampled node,without altering the edge’s operation.See [44] for details and illustrations of these mutations.
Adaptation of NES-RE to dataset shifts.
As described in Section 4.3, at each iteration of evolu-tion, the validation set used in line 4 of Algorithm 2 is sampled uniformly between D val and D shiftval when dealing with dataset shift. In this case, we use shift severity level 5 for D shiftval . Once the evolutionis complete and the pool P has been formed, then for each severity level s ∈ { , , . . . , } , we apply ForwardSelect with D shiftval of severity s to select an ensemble from P (line 9 in Algorithm 2), whichis then evaluated on D shifttest of severity s . (Here s = 0 corresponds to no shift.) This only applies toCIFAR-10-C, as we do not consider dataset shift for Fashion-MNIST. C Additional Experiments
In this section we provide additional results for the experiments conducted in Section 5. Note that, aswith all results shown in Section 5, all evaluations are made on test data unless stated otherwise.
C.1 Additional Results on Fashion-MNIST
To understand why NES algorithms outperform deep ensembles on Fashion-MNIST [54], we comparethe average base learner loss (Figure 6) and oracle ensemble loss (Figure 7) of NES-RS, NES-RE andDeepEns (RS). Notice that, apart from the case when ensemble size M = 30 , NES-RS and NES-RE https://github.com/automl/HpBandSter
00 200 300 400Number of networks evaluated0.2200.2250.2300.235 A v e r a g e b a s e l e a r n e r N LL ( n o s h i f t ) M = 3
100 200 300 400Number of networks evaluated0.2200.2250.2300.235
M = 5
100 200 300 400Number of networks evaluated0.2200.2250.2300.2350.240
M = 10
100 200 300 400Number of networks evaluated0.220.230.240.25
M = 30
NES-RSNES-REDeepEns (RS)DeepEns (DARTS)DeepEns (AmoebaNet)
Figure 6: Average base learner loss for NES-RS, NES-RE and DeepEns (RS) on Fashion-MNIST.Lines show the mean NLL and confidence intervals.
100 200 300 400Number of networks evaluated0.1350.1400.1450.1500.155 O r a c l e e n s e m b l e N LL ( n o s h i f t ) M = 3
100 200 300 400Number of networks evaluated0.1150.1200.1250.1300.135
M = 5
100 200 300 400Number of networks evaluated0.0950.1000.1050.110
M = 10
100 200 300 400Number of networks evaluated0.0750.0800.085
M = 30
NES-RSNES-REDeepEns (RS)DeepEns (DARTS)DeepEns (AmoebaNet)
Figure 7: Oracle ensemble loss for NES-RS, NES-RE and DeepEns (RS) on Fashion-MNIST. Linesshow the mean NLL and confidence intervals.find ensembles with both stronger and more diverse base learners (smaller losses in Figures 6 and 7,respectively). While it is expected that the oracle ensemble loss is smaller for NES-RS and NES-REcompared to DeepEns (RS), it initially appears surprising that DeepEns (RS) has a larger averagebase learner loss considering that the architecture for the deep ensemble is chosen to minimize thebase learner loss. We found that this is due to the loss having a sensitive dependence not only on thearchitecture but also the initialization of the base learner networks. Therefore, re-training the bestarchitecture by validation loss to build the deep ensemble yields base learners with higher losses dueto the use of different random initializations. Fortunately, NES algorithms are not affected by this,since they simply select the ensemble’s base learners from the pool without having to re-train anythingwhich allows them to exploit good architectures as well as initializations. Note that, for CIFAR-10-Cexperiments, this was not the case; base learner losses did not have as sensitive a dependence on theinitialization as they did on the architecture.In Table 2, we compare the classification error and expected calibration error (ECE) of NES algorithmswith the deep ensembles baseline for various ensemble sizes on Fashion-MNIST. Similar to the loss,NES algorithms also achieve smaller errors, while ECE remains approximately the same for allmethods.Table 2: Error and ECE of ensembles on Fashion-MNIST for different ensemble sizes M . Best valuesand all values within confidence interval are bold faced. Classification Error (out of 1)
Expected Calibration Error (ECE) M NES-RS NES-RE DeepEns(RS) DeepEns(DARTS) DeepEns(AmoebaNet) NES-RS NES-RE DeepEns(RS) DeepEns(DARTS) DeepEns(AmoebaNet)3 . ± . . ± . . ± . .
077 0 .
077 0 . ± . . ± . . ± . . . . ± . . ± . . ± . .
077 0 . . ± . . ± . . ± . .
005 0 . . ± . . ± . . ± . .
076 0 . . ± . . ± . . ± . . . . ± . . ± . . ± . .
075 0 . . ± . . ± . . ± . . . .2 Additional Results on CIFAR-10-C In this section, we provide additional experimental results on CIFAR-10-C. Table 3 is an extendedversion of Table 1, containing all severity levels 1-5. Figures 8, 9 and 10 show the loss, error andECE, respectively, of the ensembles selected by NES and the baselines as a function of the budget K .For these ensembles, Figures 11 and 12 show the average base learner loss and the oracle ensembleloss, respectively. These plots generally show that NES algorithms find ensembles which outperformdeep ensembles for almost all values of the budget K . Note that the error and ECE values in Table 3correspond to the values at the rightmost end of the subplots in Figures 9 and 10, that is, when thebudget K = 400 .We also include a variant of NES-RE, called NES-RE-0, in Figures 8, 9, 10, 11 and 12. NES-REand NES-RE-0 are the same, except that NES-RE-0 uses the validation set D val without any shiftduring iterations of evolution, as in line 4 of Algorithm 2. Following the discussion in AppendixB.3, recall that this is unlike NES-RE, where we sample the validation set to be either D val or D shiftval at each iteration of evolution. Therefore, NES-RE-0 evolves the population without taking intoaccount dataset shift, with D shiftval only being used for the post-hoc ensemble selection step in line 9 ofAlgorithm 2.As shown in the Figures 8 and 9, NES-RE-0 shows a minor improvement over NES-RE in termsof loss and error for ensemble size M = 30 in the absence of dataset shift. This is in line withexpectations, because evolution in NES-RE-0 focuses on finding base learners which form strongensembles for in-distribution data. On the other hand, when there is dataset shift, the performanceof NES-RE-0 ensembles degrades, yielding higher loss and error than both NES-RS and NES-RE.Nonetheless, NES-RE-0 still manages to outperform the DeepEns baselines consistently. We drawtwo conclusions on the basis of these results: (1) NES-RE-0 can be a competitive option in theabsence of dataset shift. (2) Sampling the validation set, as done in NES-RE, to be D val or D shiftval inline 4 of Algorithm 2 plays an important role is returning a final pool P of base learners from which ForwardSelect can select ensembles robust to dataset shift.Table 3: Extension of table 1. Error and ECE of ensembles on CIFAR-10-C for different shiftseverities and ensemble sizes M . Best values and all values within confidence interval are boldfaced. Classification Error (out of 1)
Expected Calibration Error (ECE)ShiftSeverity M NES-RS NES-RE DeepEns(RS) DeepEns(DARTS) DeepEns(AmoebaNet) NES-RS NES-RE DeepEns(RS) DeepEns(DARTS) DeepEns(AmoebaNet)5 . ± . . ± . . ± . . .
098 0 . ± . . ± . . ± . .
015 0 . . ± . . ± . . ± . .
100 0 . . ± . . ± . . ± . . . . ± . . ± . . ± . .
095 0 .
094 0 . ± . . ± . . ± . . . . ± . . ± . . ± . .
172 0 . . ± . . ± . . ± . .
038 0 . . ± . . ± . . ± . .
169 0 . . ± . . ± . . ± . .
028 0 . . ± . . ± . . ± . .
164 0 . . ± . . ± . . ± . .
020 0 . . ± . . ± . . ± . .
214 0 . . ± . . ± . . ± . .
050 0 . . ± . . ± . . ± . .
208 0 . . ± . . ± . . ± . .
037 0 . . ± . . ± . . ± . .
206 0 . . ± . . ± . . ± . .
031 0 . . ± . . ± . . ± . .
266 0 . . ± . . ± . . ± . .
075 0 . . ± . . ± . . ± . .
263 0 . . ± . . ± . . ± . .
067 0 . . ± . . ± . . ± . .
258 0 . . ± . . ± . . ± . .
057 0 . . ± . . ± . . ± . .
336 0 . . ± . . ± . . ± . .
125 0 . . ± . . ± . . ± . .
329 0 . . ± . . ± . . ± . .
109 0 . . ± . . ± . . ± . .
324 0 . . ± . . ± . . ± . .
100 0 . . ± . . ± . . ± . .
429 0 . . ± . . ± . . ± . .
177 0 . . ± . . ± . . ± . .
429 0 . . ± . . ± . . ± . .
168 0 . . ± . . ± . . ± . .
427 0 . . ± . . ± . . ± . .
160 0 . The results shown are an average over 3 independent runs of NES-RE-0. .300.310.320.330.34 N LL ( n o s h i f t ) M = 5
M = 10
M = 30
DeepEns (AmoebaNet)DeepEns (DARTS)DeepEns (RS)NES-RENES-RE-0NES-RS N LL ( s e v e r i t y = ) N LL ( s e v e r i t y = ) N LL ( s e v e r i t y = ) N LL ( s e v e r i t y = ) N LL ( s e v e r i t y = )
100 200 300 400Number of networks evaluated1.31.41.51.6 100 200 300 400Number of networks evaluated1.31.41.5
Figure 8: Results on CIFAR-10-C [24] with varying ensembles sizes M and shift severity. Linesshow the mean NLL achieved by the ensembles with 95% confidence intervals. See Appendix C.2for the definition of NES-RE-0. 17 .0950.1000.1050.1100.115 E rr o r ( n o s h i f t ) M = 5
M = 10
M = 30
DeepEns (AmoebaNet)DeepEns (DARTS)DeepEns (RS)NES-RENES-RE-0NES-RS E rr o r ( s e v e r i t y = ) E rr o r ( s e v e r i t y = ) E rr o r ( s e v e r i t y = ) E rr o r ( s e v e r i t y = ) E rr o r ( s e v e r i t y = )
100 200 300 400Number of networks evaluated0.380.400.42 100 200 300 400Number of networks evaluated0.380.400.42
Figure 9: Results on CIFAR-10-C [24] with varying ensembles sizes M and shift severity. Linesshow the mean error achieved by the ensembles with 95% confidence intervals. See Appendix C.2for the definition of NES-RE-0. 18 .0100.0120.0140.016 E C E ( n o s h i f t ) M = 5
M = 10
M = 30
DeepEns (AmoebaNet)DeepEns (DARTS)DeepEns (RS)NES-RENES-RE-0NES-RS E C E ( s e v e r i t y = ) E C E ( s e v e r i t y = ) E C E ( s e v e r i t y = ) E C E ( s e v e r i t y = ) E C E ( s e v e r i t y = )
100 200 300 400Number of networks evaluated0.100.120.140.16 100 200 300 400Number of networks evaluated0.100.120.140.16
Figure 10: Results on CIFAR-10-C [24] with varying ensembles sizes M and shift severity. Linesshow the mean ECE achieved by the ensembles with 95% confidence intervals. See Appendix C.2for the definition of NES-RE-0. 19 .4500.4750.5000.5250.550 A v e r a g e b a s e l e a r n e r N LL ( n o s h i f t ) M = 5
M = 10
M = 30
DeepEns (AmoebaNet)DeepEns (DARTS)DeepEns (RS)NES-RENES-RE-0NES-RS A v e r a g e b a s e l e a r n e r N LL ( s e v e r i t y = ) A v e r a g e b a s e l e a r n e r N LL ( s e v e r i t y = ) A v e r a g e b a s e l e a r n e r N LL ( s e v e r i t y = ) A v e r a g e b a s e l e a r n e r N LL ( s e v e r i t y = ) A v e r a g e b a s e l e a r n e r N LL ( s e v e r i t y = )
100 200 300 400Number of networks evaluated1.82.02.22.42.6 100 200 300 400Number of networks evaluated1.82.02.22.42.6
Figure 11: Results on CIFAR-10-C [24] with varying ensembles sizes M and shift severity. Linesshow the mean of the average base learner loss with 95% confidence intervals. See Appendix C.2 forthe definition of NES-RE-0. 20 .130.140.150.160.170.18 O r a c l e e n s e m b l e N LL ( n o s h i f t ) M = 5
M = 10
M = 30
DeepEns (AmoebaNet)DeepEns (DARTS)DeepEns (RS)NES-RENES-RE-0NES-RS O r a c l e e n s e m b l e N LL ( s e v e r i t y = ) O r a c l e e n s e m b l e N LL ( s e v e r i t y = ) O r a c l e e n s e m b l e N LL ( s e v e r i t y = ) O r a c l e e n s e m b l e N LL ( s e v e r i t y = ) O r a c l e e n s e m b l e N LL ( s e v e r i t y = )
100 200 300 400Number of networks evaluated0.60.70.80.9 100 200 300 400Number of networks evaluated0.450.500.550.60