[PDF] A Constructive Prediction of the Generalization Error Across Scales

Abstract

The dependency of the generalization error of neural networks on model and dataset size is of critical importance both in practice and for understanding the theory of neural networks. Nevertheless, the functional form of this dependency remains elusive. In this work, we present a functional form which approximates well the generalization error in practice. Capitalizing on the successful concept of model scaling (e.g., width, depth), we are able to simultaneously construct such a form and specify the exact models which can attain it across model/data scales. Our construction follows insights obtained from observations conducted over a range of model/data scales, in various model types and datasets, in vision and language tasks. We show that the form both fits the observations well across scales, and provides accurate predictions from small- to large-scale models and data.

Full PDF

PPublished as a conference paper at ICLR 2020

A C

ONSTRUCTIVE P REDICTION OF THE G ENERALIZATION E RROR A CROSS S CALES

Jonathan S. Rosenfeld Amir Rosenfeld Yonatan Belinkov Nir Shavit { jonsr,belinkov,shanir } @csail.mit.edu [email protected] Massachusetts Institute of Technology York University Harvard University Neural Magic Inc Tel Aviv University A BSTRACT

The dependency of the generalization error of neural networks on model anddataset size is of critical importance both in practice and for understanding thetheory of neural networks. Nevertheless, the functional form of this dependencyremains elusive. In this work, we present a functional form which approximateswell the generalization error in practice. Capitalizing on the successful concept ofmodel scaling (e.g., width, depth), we are able to simultaneously construct sucha form and specify the exact models which can attain it across model/data scales.Our construction follows insights obtained from observations conducted over arange of model/data scales, in various model types and datasets, in vision and lan-guage tasks. We show that the form both ﬁts the observations well across scales,and provides accurate predictions from small- to large-scale models and data.

NTRODUCTION

With the success and heightened adoption of neural networks for real world tasks, some questionsremain poorly answered. For a given task and model architecture, how much data would one requireto reach a prescribed performance level? How big a model would be needed?Addressing such questions is made especially difﬁcult by the mounting evidence that large, deepneural networks trained on large-scale data outperform their smaller counterparts, rendering thetraining of high performance models prohibitively costly. Indeed, in the absence of practical an-swers to the above questions, surrogate approaches have proven useful. One such common approachis model scaling, where one designs and compares small-scale models, and applies the obtained ar-chitectural principles at a larger scale (e.g., Liu et al., 2018; Real et al., 2018; Zoph et al., 2018).Despite these heuristics being widely used to various degrees of success, the relation between theperformance of a model in the small- and large-scale settings is not well understood. Hence, explor-ing the limitations or improving the efﬁciency of such methods remains subject to trial and error.In this work we circle back to the fundamental question: what is the (functional) relation betweengeneralization error and model and dataset sizes ? Critically, we capitalize on the concept of modelscaling in its strictest form: we consider the case where there is some given scaling policy thatcompletely deﬁnes how to scale up a model from small to large scales. We include in this contextall model parameters, such that traversing from one scale (in which all parameters are known) toanother requires no additional resources for specifying the model (e.g., architecture search/design).We empirically explore the behavior of the generalization error over a wide range of datasets andmodels in vision and language tasks. While the error landscape seems fairly complex at ﬁrst glance,we observe the emergence of several key characteristics shared across benchmarks and domains.Chief among these characteristics is the emergence of regions where power-law behavior approxi-mates the error well both with respect to data size, when holding model size ﬁxed, and vice versa.Motivated by these observations, we establish criteria which a function approximating the errorlandscape should meet. We propose an intuitive candidate for such a function and evaluate itsquality, both in explaining the observed error landscapes and in extrapolating from small scale (seen)to large scale (unseen) errors. Critically, our functional approximation of the error depends on both1 a r X i v : . [ c s . L G ] D ec ublished as a conference paper at ICLR 2020model and data sizes. We ﬁnd that this function leads to a high quality ﬁt and extrapolation. Forinstance, the mean and standard deviation of the relative errors are under 2% when ﬁtting acrossall scales investigated and under 5% when extrapolating from a slimmed-down model (1/16 of theparameters) on a fraction of the training data (1/8 of the examples) on the ImageNet (Russakovskyet al., 2015) and WikiText-103 (Merity et al., 2016) datasets, with similar results for other datasets.To the best of our knowledge, this is the ﬁrst work that provides simultaneously: • A joint functional form of the generalization error landscape—as dependent on both dataand model size—with few, interpretable degrees of freedom (section 5). • Direct and complete speciﬁcation (via the scaling policy) of the model conﬁguration attain-ing said generalization error across model and dataset sizes. • Highly accurate approximation of error measurements across model and data scales via thefunctional form, evaluated on different models, datasets, and tasks (section 6 ). • Highly accurate error prediction from small to large model and data (section 7).We conclude with a discussion of some implications of our ﬁndings as a practical and principled toolfor understanding network design at small scale and for efﬁcient computation and trade-off designin general. We hope this work also provides a useful empirical leg to stand on and an invitation tosearch for a theory of generalization error which accounts for our ﬁndings.

ELATED WORK

Model scaling:

A number of studies have explored the effect of model scaling on performance.For instance, image classiﬁcation networks can be scaled by depth (number of layers; He et al., 2016)or width (number of channels; Zagoruyko & Komodakis, 2016; Howard et al., 2017). More recently,Tan & Le (2019) demonstrated how scaling width, depth, and input resolution has combined positiveeffects larger than scaling each factor in isolation. However, this relationship has yet to be quantiﬁedin a predictive form – by how much will error change with model scaling? In this work, we focuson ﬁnding a constructive functional form for determining the model given a speciﬁed performance.

Data scaling:

It has long been recognized that more data improves performance, and variousstudies report such trends in both computer vision (e.g., Zhu et al., 2012; Sun et al., 2017) andlanguage processing tasks (e.g., Banko & Brill, 2001; Talmor & Berant, 2019). A number of priorstudies observed power-law relations between the generalization error and training data size (Choet al., 2015; Miceli Barone et al., 2017; Johnson et al., 2018). Most relevant to our work, Hestnesset al. (2017) explored the effect of data size on the generalization error in vision, language, andspeech tasks, and observed a strikingly consistent power-law behavior in a large set of experiments.However, while these studies point to the empirical existence of a power law in terms of data, theydo not offer tools for predicting the performance given a speciﬁed model. Nor do they offer low-costmethods to specify the model conﬁguration which would attain the power law with data dependency.Indeed, Hestness et al. had to search over models and their conﬁgurations at large scale to exhibittheir ﬁndings, incurring prohibitive computational costs.In contrast, we demonstrate a constructive recipe, where we directly predict the test performance atlarge scale and specify the full model conﬁguration which attains it (with no need for large-scalesearch), given performance at small scale.

Predicting model performance:

Since training models at full data/model scale may be compu-tationally prohibitive, a line of work tries to predict the performance of a given model on a givendataset, without training the model, for example by using a bank of previously trained models,dataset, and their associated performances (Istrate et al., 2019). Others have proposed to estimateperformance on small data (Klein et al., 2017) or model sizes (Zoph et al., 2018; Real et al., 2019)in the context of neural architecture search (NAS). In this case, the small-scale evaluation is usedto compare models at small cost, to expedite the search process; see Elsken et al. (2019) for a re-cent survey. Our work complements previous approaches by demonstrating a functional form thatcan predict large-scale performance from small-scale measurements. Moreover, our method may beintegrated in NAS, addressing some of its current limitations (as discussed in section 8).2ublished as a conference paper at ICLR 2020Table 1: The datasets and models used in this work, along with their original training data size andthe range of explored scales. For more information, see appendix A. (a) Training data size (number of words) and model size (number of parameters excluding word embeddings)for language modeling tasks.

Dataset Size ( N ) Scales ( n ) Base Model Size ( M ) Scales ( m )PTB 0.9M (cid:41) − k N , ≤ k ≤ AWD-LSTM 20M (cid:41) − k M , ≤ k ≤ WikiText-2 2M AWD-LSTM 20MWikiText-103 100M Transformer-XL 41M (b) Training data size (number of images) and model size (number of parameters) for image classiﬁcation tasks.

Dataset Size( N ) Scales ( n ) Base Model Size( M ) Scales ( m )ImageNet 1.2M − k N , ≤ k ≤ ResNet-50 25.5M − k M , ≤ k ≤ CIFAR10 60K  − k N , ≤ k ≤ WRN-44-16 0.7M − k M , − ≤ k ≤ CIFAR100 60K WRN-44-16 0.7M  − k M , − ≤ k ≤ DTD 5640 WRN-44-16 0.7MAircraft 10K WRN-44-16 0.7MUCF101 13K WRN-44-16 0.7M

Theoretical error bounds:

Much attention has been given to theoretical explanations of the gener-alization capabilities of deep neural networks (Neyshabur et al., 2017a;b; Allen-Zhu et al., 2018a;b;Arora et al., 2018). While fully engaging with this literature is beyond our scope, we note that recentstudies have derived bounds involving power-law dependencies in both model (Yarotsky, 2018) anddata size (Liang et al., 2019). We leave it as an open question for future work to ﬁnd theoreticalexplanations for the empirical behavior and the functional form we investigate in this work.

XPERIMENTAL S ETUP

Notation:

Let D n = { x i , y i } ni =1 denote a labeled (training) dataset with n samples or datapoints.Let f m denote a neural network whose size is the number of parameters m , such that ˆ y = f m ( x ) isthe predicted label. Let (cid:15) ( n, m ) be the generalization error as a function of n and m , measured bya performance metric (e.g., top-1 accuracy or cross-entropy loss) on a held-out test set. We refer tothis error function as the error landscape .3.1 S CALING P OLICIES

Dataset scaling:

We wish to scale datasets while preserving the original distribution. For imageclassiﬁcation, we uniformly subsample all classes by a constant ratio, thus preserving the relativesample size per class. We limit the maximal sub-sampling to avoid eradicating any class. Forlanguage modeling, where the number of classes (vocabulary items) has a very long tail distribution,we randomly sample sentences such that the total number of sampled words will be a certain fractionof the original dataset. Table 1 reports the data scales we use. In all tasks the held-out test set remainsuntouched for evaluating the error.

Model scaling:

We are critically interested in a method where moving across scales is deﬁned bysome scaling function, such that no additional signiﬁcant computation would be incurred. We thusconsider the case where the model architecture is given and the model size determines how to scaleit. For instance, one may scale width (number of channels in convolutional networks, hidden statesize in recurrent networks), depth (number of layers), do compound scaling (Tan & Le, 2019), ormore generally deﬁne a function tying the model degrees of freedom and size. We focus primarilyon width scaling in our experiments; the model scales are reported in Table 1. We also performselected depth scaling to demonstrate ﬂexibility with respect to the scaling method.3ublished as a conference paper at ICLR 2020 l o g ( d a t a f r a c t i o n ) l o g ( m o d e l f r a c t i o n ) l o g ( e rr ) wiki103: actual test loss (a) Wiki103 error (cross entropy) landscape. l o g ( d a t a f r a c t i o n ) l o g ( m o d e l f r a c t i o n ) l o g ( e rr ) cifar10: actual test err (b) CIFAR10 error (top1) landscape. Figure 1: Error landscapes in log-log-log scale. Each point (blue dot) is the error resulting fromtraining with a model/data conﬁguration m, n . The surface is a linear interpolation between thepoints, which is then projected on the ( m, (cid:15) ) , ( n, (cid:15) ) and ( m, n ) planes. See Appendix C for details. l o g ( c r o ss e n t r o p y l o ss ) log2(m/M)-12.0-10.0-8.0-6.0-4.0-2.00.0 12 10 8 6 4 2 0log2( model fraction )log2(n/N)-5.0-4.0-3.0-2.0-1.00.0 (a) Wiki103 cross entropy vs. data and model size. l o g ( t o p e rr o r ) log2(m/M)-10.0-8.0-6.0-4.0-2.00.02.04.0 10 8 6 4 2 0 2 4log2( model fraction )log2(n/N)-5.0-4.0-3.0-2.0-1.00.0 (b) CIFAR10 top1 error vs. data and model size. Figure 2: Error vs. data size (left part of each subﬁgure) and model size (right part) for Wiki103 andCIFAR10. Solid dots are measurements, dashed lines are best ﬁt to saturating power-law.

Hyper-parameters:

For similar reasons we wish to avoid hyper-paramater search at large scales,and thus avoid the temptation to tune hyper-parameters accordingly (learning rate, regularization,etc.). Therefore, we hold all hyper-parameters ﬁxed. This enables us to construct a functional formthat ﬁts the error landscape and can be used to predict the error across scales while completely deﬁn-ing the model attaining it. We consider pros and cons of this approach in the discussion (section 8).3.2 T

ASKS , M

ODELS , O

PTIMIZERS AND D ATASETS

We experiment with both vision and language tasks. We use 6 benchmark datasets for image classi-ﬁcation and 3 for language modeling. For image classiﬁcation, we train ResNet (He et al., 2016) andWRN models (Zagoruyko & Komodakis, 2016) with stochastic gradient decent (SGD). In section6.2 we explore the effect of varying architectures and optimizers for a ﬁxed task (CIFAR100), addingVGG16 (Simonyan & Zisserman, 2014) and DenseNet (Huang et al., 2017) models trained with bothAdam (Kingma & Ba, 2015) and SGD. For language modeling, we train AWD-LSTM (Merity et al.,2018) and Transformer-XL models (Dai et al., 2019) with SGD and Adam optimizers respectively.Summary statistics are shown in Table 1, along with the range of explored scales. Appendix A givesadditional information. 4ublished as a conference paper at ICLR 2020

BSERVATIONS ON THE E RROR L ANDSCAPE

Figures 1a and 1b respectively show an example test error landscape for width scaling ofTransformer-XL on WikiText-103 and WRN-44-16 on CIFAR10. Various additional such land-scapes are found in appendix C, showing largely consistent patterns. Examining the error landscapesyields the following observations:O1

Model Scaling

O1.1 For a given dataset size, scaling up the model results in an initial decrease in test error, whichthen saturates to a level determined by the dataset size. This behavior has been noted byTan & Le (2019) across varied model scaling methods, although they have not engaged withthe dependency on dataset size.O1.2 The rate of error decrease with model size appears well approximated by a power-law.These two observations together can be summarized as the following relation: (cid:15) ( m, n ) ≈ b ( n ) m − β ( n ) + c m ( n ) (1)where b, β, c m may depend on the data size n , s.t. as m grows, (cid:15) → c m . Example ﬁts to thisform (allowing b, β, c m to be ﬁt per n ) are seen in ﬁgure 2a (right) and ﬁgure 2b (right).O2 Data scaling

O2.1 For a given model size, scaling up the dataset results in an initial increase in performance,which then saturates to a level determined by the model size.O2.2 The rate of error decrease with dataset size appears well approximated by a power-law. Hes-tness et al. (2017) also noted a similar relationship, but did not functionally tie the saturationlevel to the dataset size.These two observations together can be summarized as the following relation: (cid:15) ( m, n ) ≈ a ( m ) n − α ( m ) + c n ( m ) (2)where a, α, c n may depend on the model size m , s.t. as n grows, (cid:15) → c n . Example ﬁts tothis form (allowing a, α, c n to be ﬁt per m ) are seen in ﬁgure 2a (left) and ﬁgure 2b (left).O3 Joint properties

The behavior of the error when scaling model size while holding data sizeﬁxed, and vice versa, extends to the entire error landscape in a well-behaved manner, such thatthe manifold (cid:15) ( m, n ) is smooth everywhere as a function of both model and data scales. UNCTIONAL A PPROXIMATION OF THE G ENERALIZATION E RROR

RITERIA

Motivated by the above observations, we now consider a functional approximation for the error land-scape. In particular, let us consider function families meeting the following criteria which augmentand restrict our observations:C1 As either model or dataset size goes to zero, the expected performance is equivalent to arandom-guess error level (cid:15) . C2 For a given dataset size, scaling up the model will result in an initial increase in perfor-mance, which will then saturate, taking the form in equation 1.C3 For a given model size, scaling up the dataset will result in an initial increase in perfor-mance, which will then saturate, taking the form in equation 2.C4 There exists an irreducible error (cid:15) ∞ , intrinsic to the dataset.C5 The function must be smooth everywhere and monotonic non-increasing in terms of modeland data size (observation O3).While there are many possible function families meeting the above criteria, below we propose asimple function family for our evaluation. We do not claim that this is in fact the true underlyingdependency, but rather that it serves as a good approximation of the error landscape—consistentwith these criteria. At some point error increase ensues; this point differs between datasets, see Appendix C for examples. Best guess when m → ( (cid:15) n ) or n → ( (cid:15) m ) need not coincide, but can, e.g., in a balanced dataset. ROPOSED F UNCTION F AMILY

As a ﬁrst insightful step, consider the implications of satisfying C2 and C3 simultaneously . Byexamining the limiting behavior as m or n grow, we have:As m grows large: c m ( n ) ≈ a ( m ) n − α ( m ) + c n ( m ) As n grows large: c n ( m ) ≈ b ( n ) m − β ( n ) + c m ( n ) Thus, a consistent form satisfying C2 and C3 simultaneously is: (cid:15) ( m, n ) ≈ a ( m ) n − α ( m ) + b ( n ) m − β ( n ) + c ∞ (3)where c ∞ is a constant not dependent on either m or n .Let us now examine the simpliﬁed case where a, b, α, β are constant: ˜ (cid:15) ( m, n ) = an − α + bm − β + c ∞ (4)where α ≥ and β ≥ control the global rate at which error decreases with data and model size,respectively, a > and b > are a form of unit conversion between data and model sizes and error,and c ∞ > is the asymptotic lower value attainable. This function is a special case of equation 3and meets criteria C2 and C3 by construction. Importantly C4 and C5 are also met.However, by giving up the dependence of a, b, α, β on m, n , this function does not meet criterion C1.We thus need to model the transition from the initial random-guess level to the power-law region.We propose to parameterize the transition using the following envelope (complex) function: ˆ (cid:15) ( m, n ) = (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) ˜ (cid:15) ( m, n )˜ (cid:15) ( m, n ) − iη (cid:13)(cid:13)(cid:13)(cid:13) = (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) an − α + bm − β + c ∞ an − α + bm − β + c ∞ − iη (cid:13)(cid:13)(cid:13)(cid:13) (5)where i = √− . Here the simple pole at η controls the transition point from the initial random-guesslevel (cid:15) as ( m, n ) increase. As ( m, n ) grow, ˜ (cid:15) → c ∞ and the ﬁnal irreducible error (cid:15) ∞ (cid:44) (cid:15) c ∞ η − is approached. The random-guess error, (cid:15) , is a known parameter determined by dataset statistics(e.g, ( N classes − /N classes for a balanced dataset). Note that due to our choice of rational envelope,we can divide by a constant the form in equation 4. Without loss of generality, let us choose a = 1 .Note that while the forms in equations 3 and 4 are well motivated, the approach taken for modelingthe transition is solely a convenience one. In fact, the transition(s) as function of m and n maybe captured in the functional forms of a, b, α, β or another envelope mechanism. We leave a morereﬁned investigation of the nature of the transitions to future work. ERROR LANDSCAPE ESTIMATION

We wish to empirically estimate the quality of the proposed functional parameterization as a ﬁt to thetrue error landscape. Let ˆ (cid:15) ( n, m ; θ ) be the parametric function family (equation 5) approximatingthe error landscape (cid:15) ( n, m ) , where θ = { α, β, b, c ∞ , η } . Deﬁne the divergence δ ( n, m ; θ ) as therelative difference between the estimated error ˆ (cid:15) ( m, n ; θ ) and the true error (cid:15) ( m, n ) : δ ( n, m ; θ ) (cid:44) ˆ (cid:15) ( m, n ; θ ) − (cid:15) ( m, n ) (cid:15) ( m, n ) We ﬁt a least squares regression model to ﬁnd the best parameters minimizing the divergence. In thissection, we ﬁt the function using 10-fold cross-validation across all model/data conﬁgurations m, n (see Table 1) and evaluate the ﬁt quality. (In the next section, we perform extrapolation experiments,from seen to unseen points.) We perform the ﬁt separately for each dataset and evaluate its qualityby the mean µ and standard deviation σ of the divergence δ over all points ( m, n ) . See Appendix B.1for experimental details.As ﬁgure 3 shows, estimated test accuracy is highly correlated with actual test accuracy for variousdatasets, with worst-case values µ < and σ < . Note that the number of free parameters issmall ( | θ | ≤ ) compared to the number of points (42–49 model-data conﬁgurations), demonstratingthe appropriateness of the proposed function for modeling the complex error landscape.6ublished as a conference paper at ICLR 2020 measured test loss e s t i m a t e d t e s t l o ss fit: language modeling wiki103 : :-0.1±1.3 % :1.2±0.3 %PTB : :-0.0±0.3 % :0.7±0.3 %wiki2 : :-0.0±0.2 % :0.4±0.2 % (a) Estimated vs. actual cross-entropy lossfor various language modeling datasets. measured top1 error e s t i m a t e d t o p e rr o r fit: image classification aircraft: :0.5±0.1 % :1.5±0.2 %dtd : :0.2±0.1 % :1.5±0.0 %ucf101 : :-0.5±1.6 % :4.4±0.7 %cifar10 : :0.1±0.1 % :4.5±0.1 %imagenet: :0.3±0.3 % :1.9±0.5 %cifar100: :0.7±0.2 % :2.2±0.1 % (b) Estimated vs. actual test error for variousimage classiﬁcation datasets. Figure 3: Error estimation results, using 10-fold cross-validation on all conﬁgurations in eachdataset. For reference, in blue is the identity line. The legend shows mean µ and standard devi-ation σ of the divergence δ ( ± one std). See Appendix C for the actual and estimated landscapes ineach dataset. l o g ( D S f r a c t i o n ) l o g ( m o d e l f r a c t i o n ) l o g ( e rr ) cifar10: actual test err (a) Error landscape when scaling depth(at constant baseline width). measured top1 error e s t i m a t e d t o p e rr o r fit: cifar10 width scaling D=8: 0.4/4.2D=14: 0.3/4.3D=32: 1.2/5.3D=44: 1.7/5.2D=62: 0.4/4.4D=128: 0.1/5.8 (b) Width scaling ﬁt at differentconstant depths (D). measured top1 error e s t i m a t e d t o p e rr o r fit: cifar10 depth scaling W=1: 0.1/4.0W=2: 2.3/5.8W=4: 0.6/4.9W=8: -0.1/2.6 W=16: 0.3/2.9W=32: 0.7/4.1W=64: 0.6/4.1W=128: 0.8/7.4 (c) Depth scaling ﬁt at differentconstant widths (W).

Figure 4: Error landscape estimation results on CIFAR10 for width and depth scaling, showing smalland comparable ﬁt errors in both cases. Numbers in legends denote mean/variance of the estimationdivergence.6.1 A P

ROBE INTO D EPTH S CALING

Here we verify that our results extend to another canonical scaling policy, namely depth scaling.Figure 4a shows the error landscape with depth scaling on CIFAR10, exhibiting the same character-istics as width scaling. Figures 4b and 4c show error landscape estimation results for both cases ofwidth and depth scaling, exhibiting small and comparable ﬁt errors (conﬁdence intervals < ).Since the difference in approximation quality is effectively indistinguishable when scaling depth orwidth orthogonally, we expect compound scaling to adhere to the same functional form. Indeed, weveriﬁed this on the publicly available (model scaling only) results for EfﬁcientNet (Tan & Le, 2019).6.2 O N THE V ARIETY OF O PTIMIZERS AND A RCHITECTURES

Our study covers a deliberate variety of architectures (ResNet, WRN, LSTM, Trans-former) and optimizers (Adam, SGD variants), following standard implementationsin the literature as recommended for each dataset/model setting; see Appendix A. For image classiﬁcation, we set (cid:15) = ( N classes − /N classes (the balanced dataset case). For languagemodeling, we estimate (cid:15) as another parameter, such that θ = { α, β, b, c ∞ , η, (cid:15) } in this case. m o d e l f r a c t i o n : l o g ( m / M ) extrapolation points (a) Illustration. e s t i m a t e d t o p e rr o r model fraction 1/16 data fraction 1/8 :-4.5%:4.681% extrapolation, imagenet fitextrapolated (b) Extrapolation on ImageNet e s t i m a t e d t e s t l o ss model fraction 1/16 data fraction 1/8 :0.5%:1.689% extrapolation, wiki103 fitextrapolated (c) Extrapolation on WikiText-103. Figure 6: Extrapolation results. (a) Illustration of the extrapolation setup, where we ﬁt on a subsetof the points (in green) and predict on larger points (in red). (b) and (c) show example results on oneconﬁguration in two benchmark datasets. Comprehensive results are given in Appendix D. measured top1 error e s t i m a t e d t o p e rr o r fit: image classification wrn/sgd : :0.0±0.8 % :1.4±0.5 %vgg/adam: :0.0±1.4 % :1.3±0.6 %vgg/sgd : :0.0±1.5 % :1.3±0.9 %densenet/sgd: :0.5±3.2 % :4.5±2.7 %densenet/adam: :0.5±2.6 % :5.2±2.7 % Figure 5: CIFAR100 Error estimation re-sults with three architectures (WRN, VGG,DenseNet) and two optimizers (SGD, Adam).However, the model/optimizer settings differ inmultiple aspects across the different tasks , ren-dering the comparison of, say, different optimiz-ers, challenging. In this section we verify that thefunctional form holds when varying the optimizerand/or the architecture on the same task, namelyimage classiﬁcation on CIFAR100.In addition to the previously examined setting ofWRN with SGD, we add four more settings: twowell known architectures (VGG and DenseNet),each trained with both SGD and Adam optimizers.See Appendix A for experimental details. Figure 5exhibits consistent, accurate, ﬁt values across all ar-chitecture/optimizer settings, with mean divergenceof µ < (std: σ < ; conﬁdence intervals < ). XTRAPOLATION

In this section, we evaluate the ability of our functional approximation to extrapolate beyond seenmodel/data conﬁgurations. The primary question we ask is: can we predict the error of a largemodel/data conﬁguration from the errors of smaller-scale model/data conﬁgurations? To do this,we ﬁt the least squares regression on a subset of the conﬁgurations and predict the error on larger,unseen conﬁgurations. More formally, let ( m i , n j ) denote a given model/data conﬁguration. We ﬁrstestimate parameters θ ij by ﬁtting the function in equation 5 on all points of at most that size ( m ≤ m i , n ≤ n j ). Then we predict the error (cid:15) ( m, n ) in all points corresponding to larger conﬁgurations( m > m i , n > n j ) using estimated θ ij . Finally, we measure the divergence δ ( m, n ) between theestimated error and the actual error at all larger conﬁgurations. This process is illustrated in ﬁgure 6a.Figure 6b shows the results of one such extrapolation experiment, on ImageNet. In this case, wehave ﬁt the functional form on all conﬁgurations of model size m ≤ m i = M/ and data size n ≤ n j = N/ , and predicted the error on all larger conﬁgurations. As the ﬁgure shows, theextrapolation is highly accurate, with a mean divergence of µ = 4 . (std: σ = 4 . ). Figure 6creports a similar experiment on WikiText-103. Here, again, we see very good extrapolation, with amean divergence of µ = 0 . (std: σ = 1 . ). Note that each extrapolation is run 10 times withdifferent random initializations of θ ij in the least squares with negligible effect on the prediction.8ublished as a conference paper at ICLR 2020In practice, we may be interested in extrapolation quality with different subsets of conﬁgurations.Appendix D provides detailed extrapolation results on multiple subsets of conﬁgurations, for bothvision and language datasets. Generally, the extrapolation performs well once not ill-posed, whichmay be caused by lack of signal in the region of the initial “random-guess” level, or in degeneratecases like having fewer measurements than the number of free parameters in θ . ISCUSSION AND C ONCLUSION

In this work, through insights gained by the joint examination of the dependencies of generalizationerror on both model and data size, we arrive at criteria for functions consistent with the form of thegeneralization error under a given scaling policy. We consider one such function and ﬁnd it to bein very good agreement with the actual behavior of the error landscape. Indeed, the agreement isstrong enough that extrapolation from small to large scale becomes feasible: the function predictsthe behavior of the generalization error in practice for the practical case of scaling models and data.We discuss several example implications of knowing such a functional form.

Small-scale network development:

At the core of small ﬁdelity searches is the notion of perfor-mance rank comparison between models. However, small scale and large scale ranks are not assuredto be consistent. If indeed a functional form such as empirically found in this work holds very gen-erally, then in contrast, one can safely assess scaling rank between models at small scale, with theassurance that it remains consistent. This suggests that one would be well served by searching overscaling policies; a pertinent example of such a success is Tan & Le (2019). The functional form alsoexplains the limitation of small-scale search: once reaching the random-guess error level, where thesensitivity to scaling vanishes, the informativeness of ranking diminishes. Finally, the functionalform allows direct usage of differentiable methods for NAS.

Principled design:

Knowing the error landscape function facilitates reasoning about the choiceof ( m, n ) attaining a speciﬁed error level. In other words, for any given error level, one can solveEq. 5 for m, n based on small-scale measurements. Thus, one can quantitatively answer designquestions regarding the expected (in particular, large-scale) relations between m , n , and (cid:15) . In fact,Eq. 5 provides direct ansewrs to questions such as ”how much data would one require to reach aprescribed performance level?” or ”how big a model would be needed?” Imposing constraints is alsostraightforward. For instance, consider the following question: ”What is the maximal model sizepossibly needed (useful), when the data is limited in size, n = n lim (for a given model architectureand scaling policy)?” For a ﬁxed dataset size, model scaling eventually contributes marginally toerror reduction and becomes negligible when bm − β (cid:28) n − αlim (Eq. 5). Deﬁne the relative contributionthreshold T as satisfying T = n − αlim bm − βmax . (For example, T = 10 .) Then the maximal useful model sizemeeting threshold T is: m max ( T ) = ( bT ) /β n α/βlim Similarly, The maximal useful amount of data for a limited sized model m lim is: n max ( T ) = (1 /bT ) /α m β/αlim Moreover, Eq. 5 allows for complex design trade-offs. Generally, given some design-tradeoff costfunction C ( m, n, (cid:15) ) , one can minimize such cost s.t. Eq. 5. For example, consider the case of opti-mizing for efﬁcient computation which has both practical and environmental importance (Schwartzet al., 2019). Since the number of FLOPs during training is ∝ m · n (for constant epoch budget),the trade-off cost function may be formulated as C ( FLOPS , (cid:15) ) = C ( mn, (cid:15) ) . Further, since constanterror contour is very well approximated by c = n α + bm β (Eq. 5), dataset and models may be scaledwith optimal resource efﬁciency with no effect on performance by solving for: argmin m,n m · n s.t. c = 1 n α + bm β The solution gives us the optimal-computational-efﬁciency ratio of model to data size: bβα n α m β = 1 .9ublished as a conference paper at ICLR 2020 Limitations:

We have made a few simplifying assumptions in our choice of approximating func-tion, in particular in how to model the transition from the initial random-guess error level and theunion of the random-guess level of the two scenarios (small model with large data and large modelwith small data). We leave a more detailed examination of the behavior of the transitions fromrandom-guess error levels and reﬁnements of the functional form to future work.Critically, the restrictive nature of our scaling framework (all parameters and hyperparameters de-scribed by a policy) is both a blessing and a challenge. The blessing comes in fulﬁlling the goalof ﬁnding simultaneously both the form of the generalization error and the full speciﬁcation of themodel and hyperparameters that attain it across scales. The challenge is that we have demonstratedin this work only the case of constant hyper-parameters. We conjecture that the relation betweenmodel conﬁguration and hyperparameter choice (Zela et al., 2018) may entail the potential to for-mulate hyperparameter-scaling policies similar in nature to the model-scaling polices, and that thesetoo fall under the scope of the form we ﬁnd in this work. This too will be the subject of future work.We hope that this work will bring the actual functional form of the generalization error in thispractical case of scaling to the fore, both in practice and as an empirical leg to stand on in the questfor its theoretical origins.A

CKNOWLEDGMENTS

We thank Alexander Rakhlin, Alexander Madry, Kai Xiao, Lu Mi, Viaks Garg, Dan Alistrah, andTommi Jaakkola for discussions and their help. We also thank the anonymous reviewers for theirvaluable feedback. J.R. was partly supported by the Eli and Dorothy Berman Fellowship as wellas grants NSF IIS-1447786, NSF CCF-1563880 and China-Singapore Suzhou Industrial Park. A.R.was partially supported by the Air Force Ofﬁce of Scientiﬁc Research USA (FA9550-18-1-0054)though a grant to John K. Tsotsos. Y.B. was partly supported by the Harvard Mind ,Brain, andBehavior Initiative. R EFERENCES

Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameter-ized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 , 2018a.Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neuralnetworks. arXiv preprint arXiv:1810.12065 , 2018b.Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds fordeep nets via a compression approach. arXiv preprint arXiv:1802.05296 , 2018.Michele Banko and Eric Brill. Mitigating the paucity-of-data problem: Exploring the effect oftraining corpus size on classiﬁer performance for natural language processing. In

Proceedings ofthe ﬁrst international conference on Human language technology research , pp. 1–5. Associationfor Computational Linguistics, 2001.Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. Dynamicimage networks for action recognition. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 3034–3042, 2016.James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural net-works. In

International Conference on Learning Representations , 2017.Junghwan Cho, Kyewook Lee, Ellie Shin, Garry Choy, and Synho Do. How much data is needed totrain a medical image deep learning system to achieve necessary high accuracy? arXiv preprintarXiv:1511.06348 , 2015.Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. De-scribing textures in the wild. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pp. 3606–3613, 2014. 10ublished as a conference paper at ICLR 2020Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov.Transformer-XL: Attentive language models beyond a ﬁxed-length context. In

Proceedings of the57th Annual Meeting of the Association for Computational Linguistics , pp. 2978–2988, Florence,Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL .Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey.

Journal of Machine Learning Research , 20(55):1–21, 2019.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. pp. 770–778, 2016.Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad,Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable,empirically. arXiv preprint arXiv:1712.00409 , 2017.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.Elad Hoffer, Itay Hubara, and Daniel Soudry. Fix your classiﬁer: the marginal value of trainingthe last weight layer. In

International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=S1Dh8Tg0- .Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 , 2017.Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. 1(2):3, 2017.Roxana Istrate, Florian Scheidegger, Giovanni Mariani, Dimitrios Nikolopoulos, Costas Bekas, andA Cristiano I Malossi. Tapas: Train-less accuracy predictor for architecture search. In

Proceed-ings of the AAAI Conference on Artiﬁcial Intelligence , volume 33, pp. 3927–3934, 2019.Mark Johnson, Peter Anderson, Mark Dras, and Mark Steedman. Predicting accuracy on largedatasets from smaller pilot data. In

Proceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 2: Short Papers) , pp. 450–455, Melbourne, Australia, July2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2072. URL .Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015.Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian op-timization of machine learning hyperparameters on large datasets. In

Artiﬁcial Intelligence andStatistics , pp. 528–536, 2017.Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009.Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the risk of minimum-norm interpolantsand restricted lower isometry of kernels. arXiv preprint arXiv:1908.10292 , 2019.Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXivpreprint arXiv:1806.09055 , 2018.Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grainedvisual classiﬁcation of aircraft. arXiv preprint arXiv:1306.5151 , 2013.Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixturemodels. arXiv preprint arXiv:1609.07843 , 2016.Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTMlanguage models. In

International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=SyyGPP0TZ .11ublished as a conference paper at ICLR 2020Antonio Valerio Miceli Barone, Barry Haddow, Ulrich Germann, and Rico Sennrich. Regularizationtechniques for ﬁne-tuning in neural machine translation. In

Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Processing , pp. 1489–1494, Copenhagen, Denmark,September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1156. URL .Tom´aˇs Mikolov, Martin Karaﬁ´at, Luk´aˇs Burget, Jan ˇCernock`y, and Sanjeev Khudanpur. Recurrentneural network based language model. In

Eleventh Annual Conference of the International SpeechCommunication Association , 2010.Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring general-ization in deep learning. In

Advances in Neural Information Processing Systems , pp. 5947–5956,2017a.Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach tospectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564 ,2017b.Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation inPyTorch. In

NIPS Autodiff Workshop , 2017.E Real, A Aggarwal, Y Huang, and QV Le. Aging evolution for image classiﬁer architecture search.In

AAAI Conference on Artiﬁcial Intelligence , 2019.Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for imageclassiﬁer architecture search. arXiv preprint arXiv:1802.01548 , 2018.Sylvestre-Alvise Rebufﬁ, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains withresidual adapters. In

Advances in Neural Information Processing Systems , pp. 506–516, 2017.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.

International journal of computer vision , 115(3):211–252, 2015.Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai. arXiv preprintarXiv:1907.10597 , 2019.Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 , 2014.Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actionsclasses from videos in the wild. arXiv preprint arXiv:1212.0402 , 2012.Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable ef-fectiveness of data in deep learning era. In

Proceedings of the IEEE international conference oncomputer vision , pp. 843–852, 2017.Alon Talmor and Jonathan Berant. MultiQA: An empirical investigation of generalization andtransfer in reading comprehension. In

Proceedings of the 57th Annual Meeting of the Associ-ation for Computational Linguistics , pp. 4911–4921, Florence, Italy, July 2019. Association forComputational Linguistics. doi: 10.18653/v1/P19-1485. URL .Mingxing Tan and Quoc Le. Efﬁcientnet: Rethinking model scaling for convolutional neural net-works. In

International Conference on Machine Learning , pp. 6105–6114, 2019.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.),

Advances in Neu-ral Information Processing Systems 30 , pp. 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf .12ublished as a conference paper at ICLR 2020Zifeng Wu, Chunhua Shen, and Anton van den Hengel. Wider or deeper: Revisiting the resnetmodel for visual recognition. arXiv preprint arXiv:1611.10080 , 2016.Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. arXivpreprint arXiv:1802.03620 , 2018.Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146 , 2016.Arber Zela, Aaron Klein, Stefan Falkner, and Frank Hutter. Towards automated deep learning:Efﬁcient joint neural architecture and hyperparameter search. arXiv preprint arXiv:1807.06906 ,2018.Xiangxin Zhu, Carl Vondrick, Deva Ramanan, and Charless C Fowlkes. Do we need more trainingdata or better models for object detection?. In

BMVC , volume 3, pp. 5. Citeseer, 2012.Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architecturesfor scalable image recognition. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pp. 8697–8710, 2018. 13ublished as a conference paper at ICLR 2020

A D

ATASETS AND M ODELS

A.1 I

MAGE C LASSIFICATION

A.1.1 D

ATASETS

We evaluated our predictions on several popular image classiﬁcation datasets: ImageNet (Rus-sakovsky et al., 2015): a large-scale recognition benchmark consisting of natural images of 1000 ob-ject categories with 1.28M training images spread roughly uniformly over the categories. It has 50Kvalidation and 100K testing images. It has been the most popular large-scale benchmark for imageclassiﬁcation methods for the better part of the last decade. CIFAR10/100 (Krizhevsky et al., 2009):60K natural RGB images of 10 classes (100 for CIFAR100) with a train/test split of 50K/10K. Foreach of the following datasets, we use the version collated, resized, and split into train/validation/testsets by Rebufﬁ et al. (2017). DTD (Cimpoi et al., 2014): a texture database of 47 categories and5640 images. Aircraft (Maji et al., 2013): 10K images of 100 different aircraft classes. UCF101(Soomro et al., 2012): originally a video action recognition dataset, converted using the method ofBilen et al. (2016) into a single image per video. It contains 13,320 images of 101 action classes.A.1.2 M

ODELS

We experiment with four models for image classiﬁcation. We use different variants of the popularResNet architecture (He et al., 2016) in the main experiments. For ImageNet we use ResNet-50and build on the code from the PyTorch framework (Paszke et al., 2017) to vary the model width.For all other datasets we use WRN-44-16 (Wu et al., 2016) of varying widths, modiﬁed from theimplementation of Hoffer et al. (2018).Scaling the models’ width is performed by multiplying the number of channels in each convolutionallayer and the width of the hidden linear layers by a constant factor and rounding to the nearest integer.The ranges of width scales (and data scales) for the main experiments are detailed in Table 1b.In section 6.2, we perform width scaling for two additional architectures, VGG16bn (Simonyan &Zisserman, 2014) and DenseNet (L=40, k=32) (Huang et al., 2017). The VGG and DenseNet modelswere also modiﬁed for width scaling from the implementation of Hoffer et al. (2018). The modelscales in this case are − k , ≤ k ≤ , for both VGG and DenseNEt.Depth-scaling, in the CIFAR10 case (section 6.1), is performed by appending extra layers withineach block.A.1.3 T RAINING

In the main experiments, training is done via SGD with a momentum of 0.9, weight decay of 1e-4and initial learning rate of 0.1. For ImageNet we train for 90 epochs, decreasing the learning rateby a multiplicative factor of 0.1 after and 30 and after 60 epochs. We use a batch size of 16. For allother vision datasets we use a batch-size of 128. We begin training with a learning rate of 0.1, runfor 200 epochs, and reduce by a multiplicative factor of 0.1 after 80, 120, and 160 epochs.For the VGG and DenseNet experiments on CIFAR100 in section 6.2, we train with both SGD andAdam optimizers. We train VGG for 170 epochs and Densenet for 300 epochs. Adam hyperparam-eters are default, with an initial learning rate of 1e-3. When training with SGD, we retain initiallearning rate, batch size, momentum, and weight-decay, as in the main experiment (at 0.1, 128, 0.9,and 1e-4 respectively) and follow standard stepped learning rate schedules: For VGG, learning ratemultiplicative factor of 0.1 after 80, 120, and 160 epochs; For DenseNet, learning rate multiplicativefactor of 0.1 after 150 and 225 epochs.A.2 L

ANGUAGE M ODELING

A.2.1 D

ATASETS

We evaluate on several datasets commonly used for (word-level) language modeling: Penn Tree-bank (Mikolov et al., 2010), WikiText-2 (Bradbury et al., 2017), and WikiText-103 (Merity et al.,2016). The PTB is a relatively small language modeling dataset of news texts, with a vocabu-14ublished as a conference paper at ICLR 2020lary of 10K unique words and about 900K/70K/80K training/validation/test words. WikiText-2 isdrawn from Wikipedia articles and it is both larger and richer, with a vocabulary of 33K words and2M/210K/240K training/validation/test words. WikiText-103 is also based on Wikipedia, but largerstill, with a vocabulary of 270K words and 100M training words (and the same validation and testsets as WikiText-2).A.2.2 M

ODELS

We experiment with two standard models for language modeling: Transformer-XL (Dai et al., 2019)and AWD-LSTM (Merity et al., 2018). Transformer-XL is a recent language modeling architecturethat is based on transformer self-attention (Vaswani et al., 2017), but modiﬁed to better learn de-pendencies beyond a ﬁxed length by adding a segment-level recurrence mechanism. It has achievedstate-of-the-art results on multiple benchmarks. We use the ofﬁcial PyTorch implementation withtheir base conﬁguration: 16 layers, embedding size of 410, inner dimension of 2100 in the fully-connected layers, and 10 attention heads. Training is done with Adam. See the implementation forother details. For scaling experiments, we decimate the inner dimension. We use Transformer-XLfor WikiText-103.AWD-LSTM is a long short-term memory (Hochreiter & Schmidhuber, 1997) language model withadaptive weight averaging. We use the ofﬁcial implementation with the recommended conﬁgura-tion: 3 layers, embedding size of 400, and hidden state size of 1150. Training is done with SGD.We use AWD-LSTM for PTB and WikiText-2 and follow the recommended settings for these twodatasets. For scaling experiments, we decimate the hidden state size. https://github.com/kimiyoung/transformer-xl https://github.com/salesforce/awd-lstm-lm B E

RROR E STIMATION E XPERIMENT

B.1 E

XPERIMENTAL D ETAILS

In the experiment described in section 6, we ﬁt a least squares regression model to ﬁnd the bestparameters minimizing the divergence δ ( m, n ) - evaluated at conﬁgurations m, n as in Table 1: θ ∗ = arg min θ (cid:88) n,m | δ ( m, n ; θ ) | We quantify the quality of the ﬁt by the mean µ and standard deviation σ of the ﬁtted divergenceby performing standard 10-fold cross validation over all points ( m, n ) with conﬁdence intervalsreported as ± std over the folds.B.2 F OUND T HETA V ALUES

Table 2: Optimal values of θ as found by the least squres regression ﬁtting the functional form. (a) Image classiﬁcation (ﬁtting top 1 error). α β b c ∞ η ImageNet .

75 0 .

61 0 .

76 3 .

63 18 . CIFAR10 .

66 0 .

53 5 . · − . · − . CIFAR100 .

70 0 .

51 0 .

15 0 .

71 6 . DTD .

40 1 .

16 4 . · − . · − . Aircraft .

10 0 .

83 3 . · − . · − . UFC101 .

93 0 .

54 4 . · − . · − . (b) Language modeling (ﬁtting cross entropy loss). α β b c ∞ η (cid:15) PTB .

81 0 .

34 0 .

15 5 .

00 6 .

27 6 . WikiText-2 .

01 0 .

22 0 .

99 8 .

23 10 .

38 6 . WikiText-103 .

74 0 .

56 0 .

33 9 .

04 16 .

34 6 . C A

DDITIONAL E RROR L ANDSCAPE M EASUREMENTS AND E STIMATIONS

In this appendix, we provide error landscape measurements and estimations for all datasets, corre-sponding to the experiment in section 6. The results are shown in 3D graphs similar to ﬁgure 1. Ineach such graph, the z-axis is the logarithm of the generalization error as a function of two indepen-dent variables: the model size m and the data size n .The 3D graph is deliberately portrayed in log-log-log scale, as we cover a very large range of datascales and model scales and a correspondingly wide range of errors. This view is a useful one whenone wishes to evaluate both large dynamic ranges (simultaneously both very large and very smallvalues) and is especially vivid in portraying power-law like dependencies; a power-law naturallyforms a straight line in a log-log view.In each ﬁgure, subﬁgure (a) shows the measured error landscape is in log-log-log scale, where eachpoint (blue dot) is the error resulting from training with a model/data conﬁguration m, n . Subﬁgure(b) shows the best-ﬁt estimated error landscape. The surface is a linear interpolation between thepoints, which is then projected on the model-error ( m, (cid:15) ) , data-error ( n, (cid:15) ) , and model-data ( m, n ) planes. The contour plots on each one of these planes are the projections of the error landscapesurface, and are useful in considering the behavior of the surface when holding one dimensionconstant.We call to attention several interesting observations on the datasets explored: • As quantiﬁed rigorously in section 6, the ﬁts perform well across error ranges. In thesesurfaces, one also gets qualitative sense of the ﬁt adequacy across the wide ranges of thedataset and model scales directly. While perhaps slightly difﬁcult to asses the surface di-rectly, a helpful view is to consider the similarity between the projections of the actual andprojected surfaces. • With increasing model size, indeed typically the error does remain saturated. However, inone of our tested datasets (ﬁgure 12) there was a renewed slight increase. We verify thatthis is indeed over-ﬁtting, in the sense that there is no corresponding increase in the training error. We note that the functional form we ﬁnd can actually be used to veer clear of the m, n regions where such over-ﬁtting may occur. • The simplifying approach taken by considering the random guess levels (and associatedtransitions) for small models or small data as identical, seems to work fairly well withsome deviation apparent by examining ﬁgure 15. Indeed the simpliﬁcation can hold wellfor balanced datasets, but need not for imbalanced ones such as in the task of languagemodeling. Thus, a relaxation of this simpliﬁcation is expected to be important conceptuallyand practically. 17ublished as a conference paper at ICLR 2020 l o g ( D S f r a c t i o n ) l o g ( p a r a m )

16 18 20 22 24 l o g ( e rr ) imagenet: actual test err (a) Actual error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m )

16 18 20 22 24 l o g ( e rr ) imagenet: estimated test err (b) Estimated error landscape. Figure 7: ImageNet error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) cifar10: actual test err (a) Actual error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) cifar10: estimated test err (b) Estimated error landscape. Figure 8: CIFAR10 error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) decathlon_cifar100: actual test err (a) Actual error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) decathlon_cifar100: estimated test err (b) Estimated error landscape. Figure 9: CIFAR100 error landscape.18ublished as a conference paper at ICLR 2020 l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) decathlon_dtd: actual test err (a) Actual error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) decathlon_dtd: estimated test err (b) Estimated error landscape. Figure 10: DTD error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) decathlon_aircraft: actual test err (a) Actual error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) decathlon_aircraft: estimated test err (b) Estimated error landscape. Figure 11: Aircraft error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) decathlon_ucf101: actual test err (a) Actual error landscape. l o g ( D S f r a c t i o n ) l o g ( p a r a m ) l o g ( e rr ) decathlon_ucf101: estimated test err (b) Estimated error landscape. Figure 12: UFC101 error landscape.19ublished as a conference paper at ICLR 2020 l o g ( D S f r a c t i o n ) l o g ( m o d e l f r a c t i o n ) l o g ( e rr ) PTB: actual test loss (a) Actual error landscape. l o g ( D S f r a c t i o n ) l o g ( m o d e l f r a c t i o n ) l o g ( e rr ) PTB: estimated test loss (b) Estimated error landscape.

Figure 13: PTB error landscape. l o g ( D S f r a c t i o n ) l o g ( m o d e l f r a c t i o n ) l o g ( e rr ) wiki2: actual test loss (a) Actual error landscape. l o g ( D S f r a c t i o n ) l o g ( m o d e l f r a c t i o n ) l o g ( e rr ) wiki2: estimated test loss (b) Estimated error landscape. Figure 14: WikiText-2 error landscape. l o g ( D S f r a c t i o n ) l o g ( m o d e l f r a c t i o n ) l o g ( e rr ) wiki103: actual test loss (a) Actual error landscape. l o g ( D S f r a c t i o n ) l o g ( m o d e l f r a c t i o n ) l o g ( e rr ) wiki103: estimated test loss (b) Estimated error landscape. Figure 15: WikiText-103 error landscape.20ublished as a conference paper at ICLR 2020

D A

DDITIONAL E XTRAPOLATION R ESULTS

Here we provide detailed extrapolation results, for all datasets. All ﬁgures are structured in a similarway. Each subplot shows estimated (y-axis) vs. actual error (x-axis) (0 to 1 scale on both axes). Eachsubplot is located at the coordinate of the maximal data and model given for the task of performingthe ﬁt to the functional form in equation 5. This is the point at the top-right corner of the greendots in the illustration in ﬁgure 6a. The target is to ﬁnd the error-landscape values for unseen, largerscales of both model and data (red points in the same illustration). Going from left to right in eachﬁgure indicates observed measurements of the error from models of an increasing fraction w.r.t thefull size. Going from bottom-to top indicates observed measurements of the error from dataset sizesof an increasingly large fraction of the full dataset.In each subplot, every point shows the estimated vs. actual error on a model-data conﬁguration.Points that were given for ﬁtting the function are colored in green, while unseen points that were notused are in red. The red points show the estimation error vs. actual error when extrapolating to alllarger models and data sizes. In each subplot, the mean and standard deviation over all divergences δ at target points are given in text.Each experiment ﬁt of the parameters was repeated 100 times, with different random initializationsof θ . The shaded bands show one standard deviation across these runs.The quality of the extrapolation is critically dependent on the signal provided in the (green) ﬁttedpoints. Two limiting factors are evident by examining the ﬁgures below, which both play a role inthe well-posedness of the solution: • The proximity to the initial random guess level. Only upon transitioning from the initialerror plateau, does meaningful signal about the scaling rates become available. Indeed, forscales prior still in the region or close to the initial error level, one sees poor extrapolationresults; see ﬁgures 18, 19, and 21, and the vivid origin of this phenomena by examiningﬁgures 11, 10, and 12. • A second source of ill-posedness is tied to the number of conﬁgurations used for the esti-mation of θ . Clearly, when this is small, one cannot expect the extrapolation to be stable.In fact, at least two measurements in each scaling dimension (model/data) are needed, andno less than the number of parameters in θ in total. Indeed, for all the plots in this ap-pendix, the smallest scale of m, n is omitted form the graph such that the lowermost rowand leftmost column span exactly two model and data scales correspondingly. Of course,there is nothing tying directly the number of points and scale of conﬁgurations measured,and one can decouple these two factors by taking closer spaced samples at small scale. • When both the above factors are not limiting the measurement, one readily sees that fordivergences of no more than a few percent, it is sufﬁcient to measure model/data conﬁg-urations which are far-ranged from the conﬁgurations which one wishes to extrapolate to. 21ublished as a conference paper at ICLR 2020 imagenet

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M ) Figure 16: ImageNet extrapolation results.22ublished as a conference paper at ICLR 2020 decathlon_cifar100

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M ) Figure 17: CIFAR100 Extrapolation Results23ublished as a conference paper at ICLR 2020 decathlon_aircraft

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M ) Figure 18: Aircraft extrapolation results.24ublished as a conference paper at ICLR 2020 decathlon_dtd

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M ) Figure 19: DTD Results25ublished as a conference paper at ICLR 2020 cifar10

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M ) Figure 20: CIFAR10 extrapolation results.26ublished as a conference paper at ICLR 2020 decathlon_ucf101

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M ) Figure 21: UCF101 extrapolation results.27ublished as a conference paper at ICLR 2020 :2.8±0.7 :0.9±0.0 :0.7±0.0 :0.4±0.0 :0.4±0.0 :1.4±0.0 :0.0±0.03 :2.3±0.9 :2.1±0.7 :-0.3±0.0 :1.1±0.0 :0.1±0.0 :0.9±0.0 :0.7±0.0 :0.6±0.0 :1.8±0.0 :0.0±0.0 :1.6±0.9 :1.7±0.7 :-1.2±0.0 :1.6±0.0 :0.1±0.0 :1.0±0.0 :1.0±0.0 :0.8±0.0 :2.1±0.0 :0.5±0.0345 :1.5±0.9 :1.8±0.6 :-1.7±0.0 :1.9±0.0 :0.2±0.0 :1.2±0.0 :1.3±0.0 :1.1±0.0 :2.5±0.0 :1.0±0.03 :-1.1±1.1 :1.7±0.3 :-0.7±0.0 :1.5±0.0 :0.7±0.0 :1.5±0.0 :1.7±0.0 :1.5±0.0 :3.1±0.0 :1.6±0.03 4 5 6 73

567 :0.2±0.7 :1.6±0.4 3 4 5 6 7:-5.0±1.2 :2.9±0.7 3 4 5 6 7:-4.6±1.0 :2.6±0.6 3 4 5 6 7:-5.8±1.2 :3.7±0.8 3 4 5 6 7:-7.3±1.3 :4.1±0.7

PTB

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M ))

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M )) Figure 22: PTB extrapolation results.28ublished as a conference paper at ICLR 2020 :8.6±2.6 :1.3±0.0 :0.8±0.0 :3.2±0.0 :0.3±0.0 :0.2±0.0 :0.0±0.03 :-9.0±2.8 :8.9±2.5 :0.3±0.0 :0.9±0.0 :1.7±0.0 :1.3±0.0 :0.5±0.0 :0.6±0.0 :1.3±0.0 :0.0±0.0 :-4.5±2.1 :5.7±1.6 :-2.5±0.0 :2.5±0.0 :1.2±0.0 :1.1±0.0 :0.6±0.0 :0.9±0.0 :1.5±0.0 :0.7±0.0345 :-3.9±2.0 :5.4±1.4 :-3.0±0.0 :3.1±0.0 :1.9±0.0 :1.5±0.0 :1.3±0.0 :1.5±0.0 :2.4±0.0 :1.3±0.03 :-3.2±2.0 :4.9±1.1 :-4.2±0.0 :3.4±0.0 :1.8±0.0 :1.7±0.0 :1.2±0.0 :1.8±0.0 :1.7±0.0 :1.6±0.03 4 5 6 73

567 :-11.2±2.1 :9.9±2.0 3 4 5 6 7:-4.2±0.7 :3.9±0.2 3 4 5 6 7:-4.5±0.8 :2.8±0.6 3 4 5 6 7:-4.4±0.9 :2.6±0.7 3 4 5 6 7:-5.3±0.9 :2.7±0.6 wiki2

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M ))

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M )) Figure 23: WikiText-2 extrapolation results.29ublished as a conference paper at ICLR 2020 :7.2±1.9 :6.7±0.0 :3.9±0.0 :2.4±0.0 :1.6±0.0 :3.6±0.0 :0.0±0.03 :-2.6±2.4 :4.4±1.4 :0.7±0.0 :0.8±0.0 :-0.4±0.0 :0.9±0.0 :1.2±0.0 :1.4±0.0 :3.8±0.0 :0.0±0.0 :2.1±1.6 :3.0±0.6 :0.5±0.0 :1.7±0.0 :-1.0±0.0 :1.5±0.0 :1.9±0.0 :1.9±0.0 :4.9±0.0 :1.1±0.0345 :0.7±1.1 :4.2±0.4 :-6.4±0.0 :5.0±0.0 :-0.4±0.0 :2.0±0.0 :2.8±0.0 :2.6±0.0 :5.6±0.0 :2.5±0.03 :4.3±0.7 :5.2±0.1 :0.7±0.0 :2.9±0.0 :2.1±0.0 :3.0±0.0 :4.1±0.0 :3.4±0.0 :5.8±0.0 :3.8±0.03 4 5 6 73

567 :-45.8±0.0 :28.7±0.0 3 4 5 6 7:-48.8±0.0 :29.7±0.0 3 4 5 6 7:-41.0±0.7 :24.5±0.5 3 4 5 6 7:-21.4±0.4 :11.5±0.3 3 4 5 6 7:-13.9±1.1 :5.0±0.6 wiki103

Dataset Fraction (log2(n/N) M o d e l F r a c t i o n ( l o g ( m / M ))