[PDF] Deep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

Abstract

One of the generally accepted views of modern deep learning is that increasing the number of parameters usually leads to better quality. The two easiest ways to increase the number of parameters is to increase the size of the network, e.g. width, or to train a deep ensemble; both approaches improve the performance in practice. In this work, we consider a fixed memory budget setting, and investigate, what is more effective: to train a single wide network, or to perform a memory split -- to train an ensemble of several thinner networks, with the same total number of parameters? We find that, for large enough budgets, the number of networks in the ensemble, corresponding to the optimal memory split, is usually larger than one. Interestingly, this effect holds for the commonly used sizes of the standard architectures. For example, one WideResNet-28-10 achieves significantly worse test accuracy on CIFAR-100 than an ensemble of sixteen thinner WideResNets: 80.6% and 82.52% correspondingly. We call the described effect the Memory Split Advantage and show that it holds for a variety of datasets and model architectures.

Full PDF

DDeep Ensembles on a Fixed Memory Budget:One Wide Network or Several Thinner Ones?

Nadezhda Chirkova Ekaterina Lobacheva Dmitry Vetrov

Abstract

One of the generally accepted views of moderndeep learning is that increasing the number of pa-rameters usually leads to better quality. The twoeasiest ways to increase the number of parametersis to increase the size of the network, e. g. width,or to train a deep ensemble; both approaches im-prove the performance in practice. In this work,we consider a ﬁxed memory budget setting, andinvestigate, what is more effective: to train a sin-gle wide network, or to perform a memory split —to train an ensemble of several thinner networks,with the same total number of parameters? Weﬁnd that, for large enough budgets, the numberof networks in the ensemble, corresponding tothe optimal memory split, is usually larger thanone. Interestingly, this effect holds for the com-monly used sizes of the standard architectures.For example, one WideResNet-28-10 achievessigniﬁcantly worse test accuracy on CIFAR-100than an ensemble of sixteen thinner WideResNets:80.6% and 82.52% correspondingly. We call thedescribed effect the Memory Split Advantage andshow that it holds for a variety of datasets andmodel architectures.

1. Introduction

In recent decade, a lot of neural network architectures forsolving different artiﬁcial intelligence problems were pro-posed, and each new state-of-the-art model usually has moreparameters than its predecessor. While working within aﬁxed architectural family, such as WideResNets (Zagoruyko& Komodakis, 2016) or Transformers (Vaswani et al., 2017),a variety of upscaling methods can be used to increase the Samsung-HSE Laboratory, National Research Uni-versity Higher School of Economics, Moscow, Russia Samsung AI Center Moscow, Russia. Correspondence to:Nadezhda Chirkova < [email protected] > , Ekaterina Lobacheva < [email protected] > .Preliminary work. Under review by the International Conferenceon Machine Learning (ICML 2020). T e s t B L E U (a)(b)(c) Individual netsEns. of 1/4-sized netsEns. of 1/16-sized nets

Figure 1.

Illustration of the MSA effect. The relative arrangementof the quality curves of an individual model and ensembles leads tothe MSA effect for large enough budgets: the ensemble of severalmedium-width neural networks (b) gives higher quality than onevery wide neural network (a) or the ensemble of a lot of thin neuralnetworks (c). The results for Transformer on IWSLT’14 De-Enare shown, x-axis denotes the total number of parameters, standardbudget correspond to the size of Transformer with d model = 512 . network size. The two most common approaches are toincrease the model depth, i. e. the number of layers (Heet al., 2015; Wang et al., 2019; Pham et al., 2019), or toincrease the model width by, for example, increasing thenumber of ﬁlters in a convolutional neural network or thedimensionality of embeddings and fully-connected layers ina Transformer. Both methods improve model performanceand also can be combined together for higher effective-ness (Tan & Le, 2019). In this work, we focus on increasingthe network size by scaling the width of each layer becausestraightforward upscaling of the network depth often leadsto optimization difﬁculties (Huang et al., 2016; Bapna et al.,2018).Rather than increasing the size of one model, one can trainan ensemble of several models. One of the most popularmethods to construct an ensemble of deep neural networksis to train individual networks from different random ini-tializations, and then average their predictions (Hansen &Salamon, 1990; Lakshminarayanan et al., 2017). Such en-sembles are called Deep Ensembles. Deep Ensembles withan increasing ensemble size were shown to improve both a r X i v : . [ c s . L G ] M a y eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? classiﬁcation performance and quality of uncertainty esti-mation of the model (Szegedy et al., 2016; Wu et al., 2016;Devlin et al., 2019; Ashukha et al., 2020).The effect of increasing the network size or the ensem-ble size independently is well investigated in the literature.In this work, we investigate this effect in a ﬁxed memorybudget setting, when increasing the network size entails de-creasing the ensemble size. Under a ﬁxed memory budget,we mean a ﬁxed number of parameters. We focus on the fol-lowing question: with a ﬁxed number of parameters, whatperforms better: (a) one very wide neural network, or (b)the ensemble of several medium-width neural networks, or(c) the ensemble of a lot of thin neural networks? Our mainempirical result is that, for large enough memory budgets,(b) is better than (a) and (c): splitting the memory budgetbetween several mid-size neural networks results in betterperformance than spending the budget on one big networkor training a huge number of small networks (see ﬁgure 1).We call this effect Memory Split Advantage (MSA). We per-form a rigorous empirical study of the MSA effect and showthe effect for VGG (Simonyan & Zisserman, 2014) andWideResNet on CIFAR-10/100 datasets as well as Trans-former on IWSLT14 German-to-English. We observe thatfor a lot of dataset–archirecture pairs, the MSA effect holdseven for small architecture conﬁgurations, i. e. several timessmaller than commonly used ones.The MSA-effect leads to a simple and effective way ofimproving the quality of the model without changing thenumber of parameters. Splitting the network into an ensem-ble of several smaller ones allows distributed training andprediction.

2. Related work

Individual networks.

According to the conventional bias-variance trade-off theory (Hastie et al., 2004), for a partic-ular task and model family, there should be an optimalmodel size, so that larger, more complex, models, overﬁtand perform worse. However, a tendency of larger neu-ral networks to achieve higher performance is observed inmany practical applications of modern deep learning (Huanget al., 2018; Tan & Le, 2019; Radford et al., 2018; Devlinet al., 2019), even though such models are overparametrizedand are able to ﬁt even random labels (Zhang et al., 2017).This phenomena is actively researched nowadays and recentworks (Belkin et al., 2018; Nakkiran et al., 2019) conﬁrmthat, in an overparametrized regime, increasing the size ofthe network leads to a better quality.

Ensembles.

The standard ensembling approach for neuralnetworks consists in training individual models indepen-dently and then averaging their predictions. A diversity inerror distributions across member networks is essential to construct an effective ensemble (Hansen & Salamon, 1990;Krogh & Vedelsby, 1994). To achieve it, networks are usu-ally trained on the same dataset but from different randominitializations (Lakshminarayanan et al., 2017). This ap-proach results in higher performance than standard baggingin case of neural networks (Lee et al., 2015). Ensemblingof neural networks was shown to substantially increase bothclassiﬁcation performance and quality of uncertainty es-timation in a lot of practical tasks (Szegedy et al., 2016;Wu et al., 2016; Devlin et al., 2019; Beluch et al., 2018;Ashukha et al., 2020). Moreover, a lot of winning and topperforming solutions of different Kaggle competitions useensembles of deep neural networks. Memory efﬁcient ensembles.

While achieving higherperformance, ensembles of neural networks are also moreresource consuming in terms of memory. Various compres-sion methods may be used to compress individual networks,such as sparsiﬁcation and quantization (Han et al., 2016) ormore compact parametrization of weight matrices (Novikovet al., 2015). To compress an ensemble, one may compresseach member network independently or apply, for exam-ple, distillation (Hinton et al., 2015) to the whole ensemble.These approaches and our ﬁndings are orthogonal and canbe straightforwardly combined.More compact ensembles may also be constructed by shar-ing parameters between member networks (Lee et al., 2015;Gao et al., 2019). Gao et al. (2019) propose a technique ofsimultaneous training of several sub-networks inside onenetwork so that these sub-networks are then used as an en-semble. Hence, the authors also split one network into anensemble, but in a different way: the sub-networks are de-pendent and share a large portion of parameters, while ourﬁndings show that even splitting one network into severalindependent ones boosts the performance for large enoughbudgets.Multi-branch architectures (Szegedy et al., 2016; Xie et al.,2017), which split some building blocks of a network intosub-blocks and aggregate (e. g. sum or concatenate) sub-blocks’ outputs, can also be seen as an ensemble of alarge number of subnetworks. These subnetworks are againclosely connected and trained jointly, while in our work, weinvestigate the effect of independent training of networks inthe ensemble.

3. Motivation

In this paper, we are focused on solving a machine learningtask by an ensemble, of one or more neural networks, witha ﬁxed total number of parameters. For a ﬁxed memorybudget, we consider ensembles containing different number eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? of networks and vary the width of member networks, so thatthe resulting combination of the two parameters, number ofnetworks and network’s width, results in the right memorybudget. We call the described ensembles, with a ﬁxednumber of parameters, the memory splits. We do not changeany other architectural properties of individual networksexcept their width.The dependency of individual neural network’s performanceon its width usually looks as shown in ﬁgure 1: qualityincreases when the width grows and saturates for largerwidths (Tan & Le, 2019). As a result, in practice, it ispreferable to use a network, wide enough to achieve highperformance, but not too wide, because it needs a lot ofresources while providing only a negligible quality improve-ment.The same type of dependency is observed between qual-ity and number of networks in Deep Ensembles: the morenetworks the better, but for the large number of networksquality saturates (Ashukha et al., 2020). An ensemble qual-ity curve (the network size is ﬁxed, the ensemble size varies)may be depicted in the same plot as the quality curve of anindividual network, by using the number of parameters onthe x-axis (instead of width or number of networks). Weshow this curve for ensembles with member networks oftwo sizes in ﬁgure 1.When working with a ﬁxed memory budget, one needs tochoose how to spend it: to train one large network or tosplit the budget into two or more parts and train severalsmaller networks. Intuitively, the choice depends on thebudget and relative arrangement of the quality curves ofan individual model and ensembles. For a small budget(left vertical line in ﬁgure 1), one model is expected togive higher performance, because the curve for one modelincreases faster than for an ensemble. We assume suchbehavior on the premise that smaller models have a high bias,while ensembling tends to decrease the variance (Tumer &Ghosh, 1996). For a large budget (the right vertical line inﬁgure 1), the performance of an individual network saturates,while the common practices say that ensembling increasesperformance even for very large networks (Szegedy et al.,2016; Devlin et al., 2019). Based on these observations, wehypothesize that for large budgets the MSA effect holds: itis preferable to split the budget and train an ensemble ofseveral networks, instead of one large network.In this paper, we consider several dataset–architecture pairsand empirically investigate the following research questions(RQs):RQ1 Does the MSA effect hold for various datasets andarchitectures?RQ2 For which budgets does it hold? Does it hold forstandard budgets that are usually used in practice? RQ3 How does an optimal split look like for different bud-gets?

4. Experimental design

We perform experiments with convolutional neural networks(WideResNet28x10 and VGG16) on CIFAR-10 (Krizhevskyet al., a) and CIFAR-100 (Krizhevsky et al., b), and Trans-former on IWSLT14 German-English (De-En) (Cettoloet al., 2014). For each dataset–architecture pair, we in-vestigate memory splitting for a range of memory budgets.To obtain different memory splits, we select the width factor,a hyperparameter controlling layer widths, so that the sizeof the member network is approximately equal to the budgetdivided by the ensemble size.Since the good performance of the model is usually obtainedwith a carefully tuned hyperparameters, and the optimalhyperparameters may change for different network sizes,we investigate memory splitting in two settings:(A) without hyperparameter tuning: we use the same hy-perparameters for all memory splits, including turningoff the regularization (weight decay, dropout);(B) with hyperparameter tuning: we use grid search totune the hyperparameters of a single network on thevalidation set for each network size; to train memorysplits we use hyperparameters, optimal for membernetworks.We can only ﬁnd good hyperparameters approximately, in-troducing additional noise to the results. Setting A avoidsthis additional noise and results in more smooth plots. Set-ting A is also similar to the setting of (Nakkiran et al., 2019).Setting B is more practically oriented.For each dataset–architecture pair, in each setting, we pro-vide a memory split plot.

Memory split plot.

Each plot contains several lines, oneline corresponds to a ﬁxed memory budget. For a budget B (the number of parameters), we train several memory splits,each contains N networks of size B/N , N = 1 , , , . . . (logarithmic scale). To obtain a network of size B/N , weadjust the width factor of the network. We then plot thetest quality vs. ensemble size N and analyze what is theoptimal N . An ensemble, corresponding to the optimal N ,is called optimal memory split for budget B . Optimal N usually varies for different budgets B .For the majority of points on the plot, we average the testquality over 3–5 runs (3 for more computationally expen-sive runs, 5 — for less expensive ones), and plot mean ± standard deviation of quality. The most computationally ex-pensive ensembles ( N (cid:62) for one or two biggest budgets) eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? were trained only once, but the standard deviation of theirtest quality is small because of the large ensemble size N .In the rest of the paper, we denote an ensemble of N net-works, each having S parameters, with E ( N, S ) . The totalnumber of parameters in E ( N, S ) equals N S . Q ( E ( N, S )) denotes the test quality of ensemble E ( N, S ) . B standard denotes the standard budget, equivalent to the number ofparameters in a single network of a commonly used size forthe speciﬁc architecture. Experimental details for CNNs.

We consider VGG (Si-monyan & Zisserman, 2014) (16 layers) and WideRes-Net (Zagoruyko & Komodakis, 2016) (28 layers). We usethe implementation provided by Garipov et al. (2018) ,and scale the number of ﬁlters for convolutional layersand the number of neurons for fully-connected layers.For VGG / WideResNet, we use convolutional layers with [ k, k, k, k ] / [ k, k, k ] ﬁlters, and fully-connected lay-ers with k / k neurons, and vary width factor k . For VGG,we consider (cid:54) k (cid:54) , k = 64 corresponds to a stan-dard, commonly used, conﬁguration. We use the size ofthis conﬁguration (15.3M parameters) as a standard budget B standard in the experiments with VGG. For WideResNet,we consider (cid:54) k (cid:54) , k = 160 corresponds to astandard model, WideResNet-28-10. We use the size ofthis conﬁguration (36.8M parameters) as a standard budget B standard in the experiments with WideResNet.We train all the networks for 200 epochs with SGD withan annealing learning schedule and a batch size of 128. Insetting A, we use an initial learning rate 10 times smallerthan in the reference implementation to ensure that the train-ing converges for all considered models. In setting B, weconsider both the standard learning rate and the small one.For VGG, we use weight decay, and binary dropout forfully-connected layers. For WideResNet, we use weightdecay and batch normalization (Ioffe & Szegedy, 2015).Dropout does not affect quality a lot, so we do not use itfor WideResNet, to reduce the hyperparameter grid size.Hyperparameter choices for all cases are listed in table 1.Quality is measured in accuracy. More details are given inAppendix A. Experimental details for Transformer.

We use standardTransformer architecture (Vaswani et al., 2017) with 6 lay-ers, each with model dimensions d model = k , feed-forwarddimensions d ffn = 2 k , and 4 attention heads. We use (cid:54) k (cid:54) to obtain neural networks with differ-ent memory budgets, k = 512 corresponds to a standard,commonly used, conﬁguration. We use the size of this con-ﬁguration (39.5M parameters) as a standard budget in the https://github.com/timgaripov/dnn-mode-connectivity Table 1.

Hyperparameter choices for all architectures and settings.Notation: wd — weight decay, lr — initial learning rate, dr —dropout rate. In setting A, we use the ﬁxed hyperparameters for allnetwork sizes. In setting B, for each network size, we choose theoptimal hyperparameters on the validation set using grid search(all possible combinations ).Architecture A: without B: withhyp. tuning hyp. tuningVGG lr = . lr ∈ { . , . } wd = wd ∈ { − , · − , − , · − } dr = dr ∈ { , . , . } WideResNet lr = . lr ∈ { . , . } wd = wd ∈ { − , · − , − , · − } Transformer lr = . lr ∈ { from − to · − , step = 10 − } wd = wd ∈ { − i for i = 1 , } dr = dr ∈ { from to . , step = 0 . } experiments with Transformer. We use the implementationprovided by fairseq (Ott et al., 2019).We use label smoothing of . and batches of maximum4096 tokens. We train models using Adam (Kingma & Ba,2015) with β = 0 . , β = 0 . and inverse square-rootlearning rate schedule with 4000 warm-up steps. We stoptraining after the convergence of the validation loss or after100/50 epochs for setting A/B, whichever comes ﬁrst. Insetting A, we use a learning rate of . to ensure thestable training of all network sizes. In setting B, we usea weight decay and a binary dropout, and choose optimalhyperparameters using grid search (see table 1). For eval-uation, we use a beam search with a beam size of 5 and alength penalty of 1.0. The translation quality is measuredin BLEU (Papineni et al., 2002). More details are given inAppendix A.

5. Memory split advantage effect

The memory split plots for CNNs and Transformer are givenin ﬁgures 2 and 3 correspondingly. Two columns show theresults for two settings (without hyperparameter tuning andwith tuned hyperparameters), and each row corresponds toa dataset–architecture pair.

Verifying assumptions for the MSA effect.

Our experi-ments conﬁrm the generally accepted view that increasingthe size S of a single model results in an increase and satu- Due to the limited computational resources, for Transformer,we ﬁrst select the optimal lr and dr from all possible combinationswith wd = 10 − , and then select the optimal wd. https://github.com/pytorch/fairseq eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? WideResNet, CIFAR-100, no hyperparameter tuning WideResNet, CIFAR-100, with tuned hyperparameters T e s t a cc u r a c y Memorybudget(in standardbudgets)84211/21/4 1 2 4 8 16 32 64 128 256Number of networks in an ensemble7980818283 T e s t a cc u r a c y Memorybudget(in standardbudgets)84211/21/4

VGG, CIFAR-100, no hyperparameter tuning VGG, CIFAR-100, with tuned hyperparameters T e s t a cc u r a c y Memorybudget(in standardbudgets)84211/21/4 1 2 4 8 16 32 64 128 256 512Number of networks in an ensemble57.560.062.565.067.570.072.575.077.5 T e s t a cc u r a c y Memorybudget(in standardbudgets)84211/21/4

WideResNet, CIFAR-10, no hyperparameter tuning WideResNet, CIFAR-10, with tuned hyperparameters T e s t a cc u r a c y Memorybudget(in standardbudgets)84211/21/4 1 2 4 8 16 32 64 128 256Number of networks in an ensemble95.695.896.096.296.496.696.8 T e s t a cc u r a c y Memorybudget(in standardbudgets)84211/21/4

VGG, CIFAR-10, no hyperparameter tuning VGG, CIFAR-10, with tuned hyperparameters T e s t a cc u r a c y Memorybudget(in standardbudgets)84211/21/4 1 2 4 8 16 32 64 128Number of networks in an ensemble92.092.593.093.594.094.5 T e s t a cc u r a c y Memorybudget(in standardbudgets)84211/21/4

Figure 2.

MSA effect for CNNs: mean ± standard deviation of the test accuracy of the ensembles with ﬁxed memory budget. The x-axisdenotes the number of networks N in the ensemble. Each line corresponds to a ﬁxed memory budget (the number of parameters) measuredin standard budgets. One standard budget corresponds to the commonly used model size. For some large ensembles, the standard deviationis not provided due to the limited computational resources. eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? Transformer, IWSLT De-En, no hyperparameter tuning Transformer, IWSLT De-En, with tuned hyperparameters T e s t B L E U Memory budget (in standard budgets)4211/21/41/81/16 T e s t B L E U Memory budget (in standard budgets)211/21/41/81/16

Figure 3.

MSA effect for Transformer: mean ± standard deviation of BLEU of the ensembles with ﬁxed memory budget. The x-axisdenotes the number of networks N in the ensemble. Each line corresponds to a ﬁxed memory budget (the number of parameters) measuredin standard budgets. One standard budget corresponds to the size of one Transformer with k = 512 . For some large ensembles, thestandard deviation is not provided due to the limited computational resources. ration of test quality Q ( E (1 , S )) , with and without hyper-parameter tuning for each model size. We refer to this effectas individual quality saturation. This individual quality satu-ration may be seen from ﬁgures 2 and 3: going from bottomto top along the vertical line at abscissa N = 1 (movingamong colored lines), but we also give the more convenientplots in Appendix B. The similar results without hyperpa-rameter tuning, and so with mere regularization, were shownin (Nakkiran et al., 2019), but we did not ﬁnd a neat similarexperiment for a regularized setting.We also checked that, when the member network size S isﬁxed, increasing the ensemble size N results in an increaseand saturation of Q ( E ( N, S )) (referred later as ensemblequality saturation). This observation may also be made fromﬁgures 2 and 3, and the similar results are given in (Ashukhaet al., 2020). Two described observations allow us to moveahead and answer the research questions stated in section 3. RQ1: Does the MSA effect hold for various datasets andarchitectures?

For a particular memory budget B , theMSA effect holds, if the number of networks in the optimalmemory split is larger than one: N ∗ > , N ∗ = argmax N (cid:62) Q ( E ( N, B/N )) . In other words, the line on the plot, corresponding to budget B , has an optimum at abscissa N > . We observe theMSA effect for all considered dataset–architecture pairs, fora wide range of budgets. For example, consider WideResNetwith tuned hyperparameters on CIFAR-100, and a memorybudget, equivalent to one standard model — WideResNet-28-10. The red line, corresponding to this budget, has anoptimum at abscissa N = 16 . That is, sixteen WideRes- Nets of the width factor × smaller than the standard one ,perform signiﬁcantly better than one standard WideResNet-28-10 (82.52% test accuracy v.s. 80.60%).The MSA effect holds for both designed settings. The re-sults in setting A are more smooth in the sense that there isno additional noise caused by hyperparameter tuning, butquality for setting A is low due to the absence of properregularization. In setting B, hyperparameter optimizationis approximate, and for some model sizes, the found hyper-parameters may be more suitable, than for others. To makethe hyperparameter search fairer, we used the same uniformgrid for all model sizes. We perform additional experimentson the hyperparameter search below. RQ2: For which budgets does the MSA effect hold?

The MSA effect holds for all considered budgets, except sev-eral smallest ones. Notably, for each considered architecture,the MSA effect holds for the standard budget B standard , de-noted in the red line on all plots. This means, that widelyused conﬁgurations of popular architectures are not optimal,and a simple technique, memory splitting, could signiﬁ-cantly improve their quality, retaining the same number ofparameters. In the majority of cases, the MSA effect alsoholds for budgets, several times smaller than the standardone.For large budgets, the MSA effect is expected be-cause of the individual quality saturation: if we split B = 8 B standard , E (2 , B standard ) easily outperforms E (1 , B standard ) because quality gap between membernetworks E (1 , B standard ) and E (1 , B standard ) is neg-ligible. However, for smaller budgets, there is usually asigniﬁcant quality gap between the single network and the The number of parameters in a network is a quadratic functionof the width factor. eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? N u m b e r o f n e t w o r k s i n a n e n s e m b l e VGG on CIFAR100no hyp. tun. with hyp. tun. 1/64 1/16 1/4 1 41248163264 WideResNet on CIFAR100no hyp. tun. with hyp. tun.1/1024 1/128 1/16 1/2 4Memory budget (in standard budgets)1/10241/2561/641/161/41 S i z e o f m e m b e r n e t w o r k ( i n s t a n d a r d b u dg e t s ) N u m b e r o f n e t w o r k s i n a n e n s e m b l e CIFAR100, with hyperparameter tuningWideResNet VGG10 Memory budget (in number of parameters)10 S i z e o f m e m b e r n e t w o r k ( i n nu m b e r o f p a r a m e t e r s ) Figure 4.

Optimal memory splits for VGG and WideResNet on CIFAR-100 for different memory budgets. For each optimal memory splitthe ensemble size is shown in the top plot, while the corresponding size of the member network is shown in the bottom plot. Left: resultsfor settings with and without hyperparameter tuning. Right: comparison of memory splits for VGG and WideResNet, aligned in terms ofnumber of parameters. member network of the optimal memory split. Ensemblingmember networks pushes quality of the memory split upenough to cause the MSA effect.

RQ3: How does an optimal split look like for differentbudgets?

Figure 4 shows optimal memory splits for VGGand WideResNet on CIFAR-100 for different memory bud-gets. The results on other dataset–architecture pairs looksimilar and are presented in Appendix C. We would like tonote that the depicted values of optimal ensemble size andmember network size are approximate, due to the discrete-ness of the ensemble size grid, and the variance in accuracyof neural networks, trained from different random seeds.For all dataset–architecture pairs, with increasing memorybudget, the optimal memory split grows both in terms ofensemble size and member network size. Hence, there isno one global optimal ensemble size or member networksize. For small budgets, the optimal decision is not to splitmemory and therefore the optimal ensemble size does notgrow (see small budgets for VGG with hyperparameter tun-ing). For large budgets, the quality of one network saturates,therefore the optimal member network size saturates too(see large budgets for all tasks).In the case of VGG and WideResNet, hyperparameter tun-ing mostly consists of choosing the right regularization andtherefore, ﬁrst of all, makes large networks better. As aresult, optimal memory splits for the setting with hyperpa-rameter tuning contains fewer networks of larger size.To compare optimal memory splits between different archi-tectures, we depict the results for VGG and WideResNet onCIFAR-100 with hyperparameter tuning together in one plot(ﬁgure 4, right). We align the budgets by measuring themin number of parameters. While achieving much higher quality than VGG, WideResNet also uses parameters moreefﬁciently inside one network. Hence, for WideResNet, theoptimal memory split consists of smaller networks, and theoptimal ensemble size becomes greater than one, startingfrom much smaller budgets, than for VGG.

Tuning hyperparameters with Bayesian optimization.

In order to better justify that the MSA effect holds in theregularized setting, we conduct additional experiments withmore careful hyperparameter tuning. We use Bayesian opti-mization (BO) (Snoek et al., 2012) to ﬁnd hyperparametersbetter than ones selected by grid search.For different model sizes of WideResNet on CIFAR-100,we choose the best learning rate, weight decay and dropoutrate on the validation set using BO with 20 iterations. Weuse an open-source implementation of BO. BO takes moretime than grid search, so we omit largest budgets in thisexperiment.The results are given in ﬁgure 5. The MSA-effect holdsfor both hyperparameter tuning methods, while the opti-mization methods differ considerably: grid search exploresthe hyperparameter space regularly; and BO incorporatesrandom search, and searching within narrow region near thebest discovered points.Hyperparameters, found by BO, in most cases result inhigher test accuracies of single models, than hyperparame-ters found by grid search. This is visible at abscissa N = 1 in ﬁgure 5. However, the test accuracies of memory splitsare not always higher for BO than for grid search. Thisis because optimal hyperparameters for the ensemble may https://github.com/fmfn/BayesianOptimization eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? differ from optimal hyperparameters for the single model.Optimizing hyperparameters for ensemble is extremely ex-pensive, but potentially it could make memory splits evenstronger, compared to single models. In total, the maximumachieved test accuracy, for the largest considered budget, ishigher for BO (82.75%), than for grid search (82.52%). WideResNet, CIFAR-100, BO-tuned hyperparameters T e s t a cc u r a c y Memorybudget(in standardbudgets)

22 (GS)11 (GS)1/21/2 (GS)1/41/4 (GS)

Figure 5.

MSA effect for setting with hyperparameters tuned usingBayesian optimization, WideResNet on CIFAR-100. Each solidline shows mean ± standard deviation of the test accuracy of theensembles with a ﬁxed memory budget; the x-axis denotes thenumber of networks N in the ensemble. One standard budgetcorresponds to the size of one WideResNet-28-10. The dashedline corresponds to the results with grid search, for reference. MSA effect for uncertainty estimation

Ensembles ofdeep neural networks are often used for uncertainty esti-mation: probability estimates produced by ensembles aremore accurate and reliable than ones produced by a sin-gle network (Lakshminarayanan et al., 2017), in both in-domain (Ashukha et al., 2020) and out-of-domain (Snoeket al., 2019) settings.However, this comparison is usually performed in a ﬁxedwidth setting, and so with unequal number of parameters.We show that ensembles outperform the single network inin-domain uncertainty estimation, when the total number ofparameters in both models is ﬁxed and approximately equal.We use the calibrated test negative log-likelihood (NLL)to measure the quality of in-domain uncertainty. Ashukhaet al. (2020) highlight the importance of temperature scalingfor both single networks and ensembles, since comparisonof log-likelihoods without temperature scaling may lead toarbitrary results. We measure the calibrated test NLL forimage classiﬁcation models trained in the main experimentsand discussed in paragraphs RQ1–RQ3. We show the resultsfor WideResNet with tuned hyperparameters in ﬁgure 6,more results may be found in Appendix D.The MSA effect holds for the calibrated test NLL in all thesame cases (triplets architecture–dataset–budget) as for the

WideResNet, CIFAR-100, tuned hyperparameters T e s t N LL ( c a li b r a t e d ) Memorybudget(in standardbudgets)84211/21/4

WideResNet, CIFAR-10, tuned hyperparameters T e s t N LL ( c a li b r a t e d ) Memorybudget(in standardbudgets)84211/21/4

Figure 6.

MSA effect for calibrated test NLL, WideResNet onCIFAR-100 and CIFAR-10. Each line shows mean ± standarddeviation of the calibrated test NLL of the ensembles with a ﬁxedmemory budget; the x-axis denotes the number of networks N in the ensemble. One standard budget corresponds to the sizeof one WideResNet-28-10. For some large ensembles, the stan-dard deviation is not provided due to the limited computationalresources. test accuracy. The number of networks in a memory split,optimal w. r. t. calibrated test NLL, is usually equal or a bitlarger than that, optimal w. r. t. test accuracy.

6. Conclusion

In this work, we introduce the MSA effect and the aris-ing simple method of improving the neural network perfor-mance in a limited memory setting. Investigating the MSAeffect for various datasets, architectures, and budgets, weﬁnd that the effect holds even for small conﬁgurations ofpopular architectures and that the larger the conﬁguration,the bigger the ensemble size N and the network size S ,corresponding to the optimal memory split. Finding theoptimal values of N and S without full computation of thememory split plot is an interesting direction for future work. Acknowledgments

We would like to thank Dmitry Molchanov for the valu-able feedback. Results for convolutional neural networks eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? were supported by the Russian Science Foundation grant № References

Ashukha, A., Lyzhov, A., Molchanov, D., and Vetrov, D.Pitfalls of in-domain uncertainty estimation and ensem-bling in deep learning. In

International Conference onLearning Representations . 2020.Bapna, A., Chen, M., Firat, O., Cao, Y., and Wu, Y. Trainingdeeper neural machine translation models with transpar-ent attention. In

Conference on Empirical Methods inNatural Language Processing , 2018.Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconcilingmodern machine learning and the bias-variance trade-off.In arXiv preprint arXiv:1812.11118 , 2018.Beluch, W. H., Genewein, T., Nrnberger, A., and Khler,J. M. The power of ensembles for active learning inimage classiﬁcation. In

Conference on Computer Visionand Pattern Recognition , 2018.Cettolo, M., Niehues, J., Stuker, S., Bentivogli, L., and Fed-erico, M. Report on the 11th IWSLT evaluation campaign.In

IWSLT . 2014.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:Pre-training of deep bidirectional transformers for lan-guage understanding. In

Conference of the North Amer-ican Chapter of the Association for Computational Lin-guistics: Human Language Technologies , 2019.Gao, Y., Cai, Z., Chen, Y., Chen, W., Yang, K., Sun, C.,and Yao, C. Intra-ensemble in neural networks. In arXivpreprint arXiv:1904.04466 , 2019.Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., andWilson, A. G. Loss surfaces, mode connectivity, and fastensembling of dnns. In

Advances in Neural InformationProcessing Systems , 2018.Han, S., Mao, H., and Dally, W. J. Deep compression:Compressing deep neural network with pruning, trainedquantization and huffman coding. In

International Con-ference on Learning Representations , 2016.Hansen, L. and Salamon, P. Neural network ensembles.

IEEE Transactions on Pattern Analysis and Machine In-telligence , 12, 1990.Hastie, T., Tibshirani, R., Friedman, J., and Franklin, J. Theelements of statistical learning: Data mining, inference,and prediction.

Math. Intell. , 27, 2004. He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition.

Conference on ComputerVision and Pattern Recognition , 2015.Hinton, G., Vinyals, O., and Dean, J. Distilling the knowl-edge in a neural network. In

Deep Learning and Repre-sentation Learning Workshop at the Neural InformationProcessing Systems , 2015.Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,K. Q. Deep networks with stochastic depth. In

EuropeanConference on Computer Vision , 2016.Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le,Q. V., and Chen, Z. GPipe: Efﬁcient training of giantneural networks using pipeline parallelism. In arXivpreprint arXiv:1811.06965 , 2018.Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In

International Conference on Machine Learning , 2015.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In

International Conference on LearningRepresentations , 2015.Krizhevsky, A., Nair, V., and Hinton, G. CIFAR-10 (cana-dian institute for advanced research). a.Krizhevsky, A., Nair, V., and Hinton, G. CIFAR-100 (cana-dian institute for advanced research). b.Krogh, A. and Vedelsby, J. Neural network ensembles,cross validation and active learning. In

InternationalConference on Neural Information Processing Systems ,1994.Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simpleand scalable predictive uncertainty estimation using deepensembles. In

Advances in Neural Information Process-ing Systems . 2017.Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D. J.,and Batra, D. Why M heads are better than one: Traininga diverse ensemble of deep networks. In arXiv preprintarXiv:1511.06314 , 2015.Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak,B., and Sutskever, I. Deep double descent: Wherebigger models and more data hurt. In arXiv preprintarXiv:1912.02292 , 2019.Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D.Tensorizing neural networks. In

International Conferenceon Neural Information Processing Systems , 2015.Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng,N., Grangier, D., and Auli, M. fairseq: A fast, exten-sible toolkit for sequence modeling. In

Conference of eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? the North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies,Demonstrations , 2019.Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: amethod for automatic evaluation of machine translation.In

Annual Meeting of the Association for ComputationalLinguistics , 2002.Pham, N., Nguyen, T., Niehues, J., M¨uller, M., and Waibel,A. Very deep self-attention networks for end-to-endspeech recognition. In arXiv preprint arXiv:1904.13377 ,2019.Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., andSutskever, I. Language models are unsupervised multitasklearners. 2018.Sennrich, R., Haddow, B., and Birch, A. Neural machinetranslation of rare words with subword units. In

AnnualMeeting of the Association for Computational Linguistics ,2016.Simonyan, K. and Zisserman, A. Very deep convolutionalnetworks for large-scale image recognition. In arXivpreprint arXiv:1409.1556 , 2014.Snoek, J., Larochelle, H., and Adams, R. P. Practicalbayesian optimization of machine learning algorithms.In

Advances in Neural Information Processing Systems .2012.Snoek, J., Ovadia, Y., Fertig, E., Lakshminarayanan, B.,Nowozin, S., Sculley, D., Dillon, J., Ren, J., and Nado,Z. Can you trust your model's uncertainty? evaluatingpredictive uncertainty under dataset shift. In

Advances inNeural Information Processing Systems . 2019.Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A.Inception-v4, Inception-ResNet and the impact of resid-ual connections on learning. In

Workshop of the Interna-tional Conference on Learning Representations , 2016.Tan, M. and Le, Q. EfﬁcientNet: Rethinking model scal-ing for convolutional neural networks. In

InternationalConference on Machine Learning , 2019.Tumer, K. and Ghosh, J. Analysis of decision boundaries inlinearly combined neural classiﬁers.

Pattern Recognition ,29, 1996.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Atten-tion is all you need. In

Advances in Neural InformationProcessing Systems . 2017.Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F.,and Chao, L. S. Learning deep transformer models for machine translation. In

Annual Meeting of the Associationfor Computational Linguistics , 2019.Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,K., Klingner, J., Shah, A., Johnson, M., Liu, X., ukaszKaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H.,Stevens, K., Kurian, G., Patil, N., Wang, W., Young,C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado,G., Hughes, M., and Dean, J. Google’s neural machinetranslation system: Bridging the gap between human andmachine translation. In arXiv preprint arXiv:1609.08144 ,2016.Xie, S., Girshick, R., Dollr, P., Tu, Z., and He, K. Aggre-gated residual transformations for deep neural networks.In

Conference on Computer Vision and Pattern Recogni-tion , 2017.Zagoruyko, S. and Komodakis, N. Wide residual networks.In

British Machine Vision Conference , 2016.Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,O. Understanding deep learning requires rethinking gen-eralization. In

International Conference on LearningRepresentations , 2017. eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

A. Experimental details

A.1. Details for CNNsData.

We conduct experiments on CIFAR-100 and CIFAR-10 datasets, each containing 50000 training and 10000 test-ing examples. For tuning hyperparameters, we randomlyselect 5000 training examples as a validation set. Afterchoosing optimal hyperparameters, we retrain the models onthe full training dataset. We use a standard data augmenta-tion scheme following (Garipov et al., 2018): zero-paddingwith 4 pixels on each side, random cropping to produce3232 images, and horizontal mirroring with probability 0.5.In the experiments without hyper parameter tuning, we donot use data augmentation.

Training.

Following (Garipov et al., 2018), we train allmodels for 200 epochs using SGD with momentum of . and the following learning rate schedule: constant (100epochs) – linearly annealing (80 epochs) – constant (20epochs). The ﬁnal learning rate is 100 times smaller thanthe initial one. We use the batch size of 128. Testing.

We measure the test accuracy and the calibratednegative log-likelihood (NLL) of each ensemble or individ-ual model.

Bayesian optimization.

We use the following ranges forBayesian optimization: [10 − , . for weight decay, [0 , . for dropout, and [ − , . / . ] for learning ratefor VGG/WideResNet correspondingly. A.2. Details for TransformersData.

We conduct experiments on the IWSLT14 German-English (De-En) and English-German (En-De) translationtasks (Cettolo et al., 2014) that contain 160K training sen-tences and 7K validation sentences randomly sampled fromthe training data. We test on the concatenation of tst2010,tst2011, tst2012, tst2013 and dev2010. We preprocess thedata using byte pair encoding (BPE) (Sennrich et al., 2016).

A.3. Details of the memory splitting procedure

For a budget B (the number of parameters), we train severalmemory splits E ( N, B/N ) , each contains N networks ofsize B/N , N = 1 , , , . . . . The number of parametersin a network is a quadratic function of the width factor.To obtain a network of size B/N , we solve the quadraticequation w. r. t. the width factor, and then round the widthfactor to the nearest integer. The difference in the number ofparameters between the resulting memory split and budget B is negligible, compared to B .To make predictions with an ensemble, we average predic-tions of the individual networks after softmax, i.˙e. averagediscrete distributions over classes. A.4. Computing infrastructure

Experiments were conducted on NVIDIA Tesla V100 GPU,Tesla P40 GPU and Tesla P100 GPU.

B. Individual network quality

Figure 7 shows the test quality of individual network fordifferent model sizes. We present the results for settingswith and without hyperparameter tuning for each networksize. The results conﬁrm the generally accepted view thatincreasing the network size leads to the higher quality.

C. Optimal memory splits

Optimal memory splits for VGG/WideResNet on CIFAR-10and Transformer on IWSLT’14 De-En for different mem-ory budgets are shown in ﬁgures 8 and 9 correspondingly.The general pattern for these dataset–architecture pairs lookvery similar to results on CIFAR-100: with the increasingmemory budget, the optimal memory split grows both interms of ensemble size and member network size.In case of VGG and WideResNet on CIFAR-10, similarto CIFAR-100, optimal memory splits for the setting withhyperparameter tuning contain fewer networks of larger size.However, in case of Transformer, the comparison results arenot that clear. The reason is that hyperparameter tuning incase of Transformer not only makes large networks better byregularizing them, but also makes smaller networks betterby employing higher learning rates.The comparison results of optimal memory splits betweenVGG and WideResNet on CIFAR-10 generally looks similarto the results on CIFAR-100 but a bit more noisy (ﬁgure 8,right).

D. MSA effect for uncertainty estimation

Memory split plots for the calibrated test NLL for VGGwith tuned hyperparameters are shown in ﬁgure 10. Theresults generally look similar to the case of WideResNet:the MSA effect holds for the calibrated test NLL in all thesame cases as for the test accuracy. eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

WideResNet, CIFAR-100, no hyperparameter tuning WideResNet, CIFAR-100, with tuned hyperparameters T e s t a cc u r a c y Test accuracy of individual models 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4 8Network size (in standard budgets)727476788082 T e s t a cc u r a c y Test accuracy of individual models

VGG, CIFAR-100, no hyperparameter tuning VGG, CIFAR-100, with tuned hyperparameters T e s t a cc u r a c y Test accuracy of individual models 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4 8Network size (in standard budgets)505560657075 T e s t a cc u r a c y Test accuracy of individual models

WideResNet, CIFAR-10, no hyperparameter tuning WideResNet, CIFAR-10, with tuned hyperparameters T e s t a cc u r a c y Test accuracy of individual models 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4 8Network size (in standard budgets)93.594.094.595.095.596.096.5 T e s t a cc u r a c y Test accuracy of individual models

VGG, CIFAR-10, no hyperparameter tuning VGG, CIFAR-10, with tuned hyperparameters T e s t a cc u r a c y Test accuracy of individual models 1/16 1/8 1/4 1/2 1 2 4 8Network size (in standard budgets)88899091929394 T e s t a cc u r a c y Test accuracy of individual models

Transformer, IWSLT De-En, no hyperparameter tuning Transformer, IWSLT De-En, with tuned hyperparameters T e s t B L E U Test BLEU of individual models 1/32 1/16 1/8 1/4 1/2 1 2Network size (in standard budgets)28303234 T e s t B L E U Test BLEU of individual models

Figure 7.

Quality of individual network for different model sizes. To draw boxplots, we use all the networks trained in the memory splitexperiments. eep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones? N u m b e r o f n e t w o r k s i n a n e n s e m b l e VGG on CIFAR10no hyp. tun. with hyp. tun. 1/64 1/16 1/4 1 41248163264 WideResNet on CIFAR10no hyp. tun. with hyp. tun.1/1024 1/128 1/16 1/2 4Memory budget (in standard budgets)1/10241/2561/641/161/41 S i z e o f m e m b e r n e t w o r k ( i n s t a n d a r d b u dg e t s ) N u m b e r o f n e t w o r k s i n a n e n s e m b l e CIFAR10, with hyperparameter tuningWideResNet VGG10 Memory budget (in number of parameters)10 S i z e o f m e m b e r n e t w o r k ( i n nu m b e r o f p a r a m e t e r s ) Figure 8.

Optimal memory splits for VGG and WideResNet on CIFAR-10 for different memory budgets. For each optimal memory splitthe ensemble size is shown in the top plot, while the corresponding size of the member network is shown in the bottom plot. Left: resultsfor settings with and without hyperparameter tuning. Right: comparison of memory splits for VGG and WideResNet, aligned in terms ofnumber of parameters. N u m b e r o f n e t w o r k s i n a n e n s e m b l e Transformer on IWSLT'14no hyp. tun. with hyp. tun.1/16 1/4 1 4Memory budget (in standard budgets)1/161/41 S i z e o f m e m b e r n e t w o r k ( i n s t a n d a r d b u dg e t s ) Figure 9.

Optimal memory splits for Transformer on IWSLT’14De-En for different memory budgets for settings with and withouthyperparameter tuning. For each optimal memory split the ensem-ble size is shown in the top plot, while the corresponding size ofthe member network is shown in the bottom plot. VGG, CIFAR-100, tuned hyperparameters T e s t N LL ( c a li b r a t e d ) Memorybudget(in standardbudgets)84211/21/4

VGG, CIFAR-10, tuned hyperparameters T e s t N LL ( c a li b r a t e d ) Memorybudget(in standardbudgets)84211/21/4

Figure 10.