LLearning Curve Theory
Marcus Hutter
Abstract
Recently a number of empirical “universal” scaling law papers have been pub-lished, most notably by OpenAI. ‘Scaling laws’ refers to power-law decreases oftraining or test error w.r.t. more data, larger neural networks, and/or more com-pute. In this work we focus on scaling w.r.t. data size n . Theoretical understandingof this phenomenon is largely lacking, except in finite-dimensional models for whicherror typically decreases with n − / or n − , where n is the sample size. We developand theoretically analyse the simplest possible (toy) model that can exhibit n − β learning curves for arbitrary power β >
0, and determine whether power laws areuniversal or depend on the data distribution.
Contents
Power Law, Scaling, Learning Curve, Theory, Data Size, Error, Loss, Zipf, ... a r X i v : . [ c s . L G ] F e b Introduction
Power laws in large-scale machine learning.
The ‘mantra’ of modern machinelearning is ‘bigger is better’. The larger and deeper Neural Networks (NNs) are, the moredata they are fed, the longer they are trained, the better they perform. Apart from theproblem of overfitting [BHM18] and the associated recent phenomenon of double-descent[BHMM19], this in itself is rather unsurprising. But recently ‘bigger is better’ has beenexperimentally quantified, most notably by Baidu [HNA +
17] and OpenAI [HKK + +
20, HKHM21]. They observe that the error or test loss decreases as a powerlaw, with the data size , with the model size (number of NN parameters), as well as withthe compute budget used for training, assuming one factor is not “bottlenecked” by theother two factors. If all three factors are increased appropriately in tandem, the loss haspower-law scaling over a very wide range of data/model size and compute budget.If there is intrinsic noise in the data (or a non-vanishing model mis-specification), theloss can never reach zero, but at best can converge to the intrinsic entropy of the data (orthe intrinsic representation=approximation error). When we talk about error , we meantest loss with this potential offset subtracted, similar to regret in online learning. Ubiquity/universality of power laws.
Power laws have been observed for manyproblem types (supervised, unsupervised, transfer learning) and data types (images,video, text, even math) and many NN architectures (Transformers, ConvNets, ...)[HNA +
17, RRBS19, HKK +
20, KMH + Theory: Scaling with model size.
Consider a function f : [0; 1] d → R which we wishto approximate. A naive approximation is to discretize the hyper-cube to an ε -grid. Thisconstitutes a model with m = (1 /ε ) d parameters, and if f is L -Lipschitz, can approximate f to accuracy L · ε = L · m − /d , i.e. the (absolute) error scales with model size m as apower law with exponent − /d . More generally, there exist (actually linear) models with m parameters that can approximate all functions f whose first k derivatives are boundedto accuracy O ( m − k/d ) [Mha96], again a power law, and without further assumptions, noreasonable model can do better [DHM89]; see [Pin99] for reformulations and discussionsof these results in the context of NNs. Not being aware of this early theoretical work, thisscaling law has very recently been empirically verified and extended by [SK20]. Insteadof naively using the input dimension d , they determine and use the (fractal) dimensionof the data distribution in the penultimate layer of the NN. Theory: Scaling with compute.
Most NNs are trained by some form of stochasticgradient descent, efficiently implemented in the form of back-propagation. Hence computeis proportional to number of iterations i times batch-size times model size. So studying thescaling of error with the number of iterations tells us how error scales with compute. Theloss landscape of NNs is highly irregular, which makes theoretical analyses cumbersomeat best. At least asymptotically, the loss is locally convex, hence the well-understoodstochastic (and online) convex optimization could be a first (but possibly misleading)path to search for theoretical understanding of scaling with compute. The error of moststochastic/online optimization algorithms scales as a power law i − / or i − for convexfunctions [Bub15, Haz16]. 2 heory: Scaling with data size. Even less is theoretically known about scaling withdata size. [Cho20] and [HNA +
17] consider a very simple Bernoulli model: Essentiallythey observe that the Bernoulli parameter can be estimated to accuracy 1 / √ n from n i.i.d samples, i.e. the absolute loss (also) scales with 1 / √ n [HNA +
17] and the log-lossor KL-divergence scales with 1 /n [Cho20]. Indeed, the latter holds for any loss, locallyquadratic at the minimum, so is not at all due to special properties of KL as [Cho20]suggests. These observations trivially follow from the central limit theorem for virtuallyany finitely-parameterized model in the under-parameterized regime of more-data-than-parameters. This is of course always the case for their Bernoulli model, which onlyhas one parameter, but not necessarily for the over-parameterized regime some modernNNs work in. Anyway, the scaling laws identified by OpenAI et al. are n − β , for various β < /
2, which neither the Bernoulli nor any finite-dimensional model can explain.
Data size vs iterations vs compute.
Above we have used the fact that compute is(usually in deep learning) proportional to number of learning iterations, provided batchand model size are kept fixed. In addition,(i) in online learning , every data item is used only once, hence the size of data usedup to iteration n is proportional to n .(ii) This is also true for stochastic learning algorithms for some recent networks, suchas GPT-3, trained on massive data sets, where every data item is used at most once(with high probability).(iii) When generating artificial data , it is natural to generate a new data item for eachiteration.Hence in all of these 3 settings, the learning curves , error-with-data-size, error-with-iterations, and error-with-compute, are scaled versions of each other. For this reason,scaling of error with iterations, also tells us how error scales with data size and even withcompute, but scaling with model size is different. This work.
In this work we focus on scaling with data size n . As explained above, anyreasonable finitely-parameterized model and reasonable loss function leads to a scalinglaw n − β with β = or β = 1, but not the observed β < . We therefore conjecture thatany theoretical explanation of power laws for a variety β (beyond 0-1 and absolute errorimplying β = and locally-quadratic loss implying β = 1) requires real-world data ofunbounded complexity, that is, no finite-dimensional model can “explain” all informationin the data.Possible modelling choices are (a) scaling up the model with data, or (b) considernon-parametric models (e.g. kNN or Gaussian processes), or (c) a model with (countably-)infinitely-many parameters. We choose (c) for mathematical simplicity compared to (b),and because (c) clearly separates scaling with data from scaling with model size, unlike(a). In future, (a) and (b) should definitely also be pursued, in particular, since we haveno indication that our findings transfer.Within our toy model, we show that for domains of unbounded complexity, a largevariety of learning curves are possible, including non -power-laws. It is plausible thatthis remains true for most infinite models. Real data is often Zipf distributed (e.g. thefrequency of words in text), which is itself a power law. We show that this, in ourtoy model, implies power law learning curves with “interesting” β , though most (evennon-Zipf) distributions also lead to power laws but with “uninteresting” β .3 ontents. In Section 2 we introduce our setup: classification with countable “feature”space and a memorizing algorithm, the simplest model and algorithm we could come upwith that exhibits interesting/relevant scaling behavior. In Section 3 we derive and discussgeneral expressions for expected learning curves and for various specific data distributions:finite, Zipf, exponential, and beyond, many but not all lead to power laws. In Section 4 weestimate the uncertainty in empirical learning curves. We show that the signal-to-noiseratio deteriorates with n , which implies that many (costly) runs need to be averaged inpractice to get a smooth learning curve. On the other hand, the signal-to-noise ratio of thetime-averaged learning curves tends to infinity, hence even a single run suffices for large n . In Section 5 we perform some simple control experiments to confirm and illustratethe theory and claims, and the accuracy of the theoretical expressions. In Section 6we discuss (potential) extensions of our toy model towards a more comprehensive andrealistic theory of scaling laws: noisy labels, other loss functions, continuous features,models that generalize, and deep learning. Section 7 concludes with limitations andpotential applications. Appendix A discusses losses beyond 0-1 loss. Appendix B containsderivations of the expected error, and in particular exact and approximate expressions forthe time-averaged variance. Appendix C considers noisy labels. Appendix D derives anapproximation of sums by integrals, tailored to our purpose. Appendix E lists notation.Appendix F contains some mores plots. We formally introduce our setup, model, algorithm, and loss function in this section:We consider classification problems with 0-1 loss and countable feature space. A naturalpractical example application would be classifying words w.r.t. some criterion. Our toymodel is a deterministic classifier for features/words sampled i.i.d. w.r.t. to some distri-bution. Our toy algorithm predicts/recalls the class for a new feature from a previouslyobserved ( feature , class ) pair, or acts randomly on a novel feature . The probability of anerroneous prediction is hence proportional to the probability of observing a new feature,which formally is equivalent to the model in [Cha81]. The usage and analyses of themodel and resulting expressions are totally different though. While [Cha81]’s aim is todevelop estimators for the probability of discovering a new species from data whateverthe unknown true underlying probabilities, we are interested in the relationship betweenthe true probability distribution of the data and the resulting learning curves, i.e. thescaling of expected (averaged) error with sample size. In Appendix A we show that,within a for our purpose irrelevant multiplicative constant, the results also apply to mostother loss functions. The toy model.
The goal of this work is to identify and study the simplest modelthat is able to exhibit power-law learning curves as empirically observed by [HNA + +
20, KMH +
20] and others. Consider a classification problem h ∈ H := X → Y ,e.g. Y = { , } for binary classification, where classifier h is to be learnt from data D n := { ( x , y ) , ..., ( x n , y n ) } ∈ ( X × Y ) n . For finite X and Y , this is a finite model class( |H| < ∞ ), which, as discussed above, can only exhibit a restrictive range of learningcurves, typically n − /n − / /e − O ( n ) for locally-quadratic/absolute/0-1 error. In practice, X is often a (feature) vector space R d , which can support an infinite model class ( |H| = ∞ ) (e.g. NNs) rich enough to exhibit (at least empirically) n − β scaling for many different β (cid:54)∈ { , } , typically β (cid:28)
1. The smallest potentially suitable X would be countable,4.g. N , which we henceforth assume. The model class H := N → Y is uncountable andhas infinite VC-dimension, hence is not uniformly PAC learnable, but can be learnt non-uniformly.Furthermore, for simplicity we assume that data D n := { ( i , y ) , ..., ( i n , y n ) } ≡ ( i n , y n ) with “feature” i t ∈ N “labelled” y t is noise-free = deterministic, i.e. y t = y t (cid:48) if i t = i t (cid:48) . Let h ∈ H be the unknown true labelling function. We discuss relaxationsof some of these assumptions later in Section 6, in particular extension to other lossfunction in Appendix A and noisy labels in Appendix C. Let features i t be drawn i.i.d.with P [ i t = i ] =: θ i ≥ (cid:80) ∞ i =1 θ i = 1. The infinite vector θ ≡ ( θ , θ , ... )characterizes the feature distribution. The labels are then determined by y t = h ( i t ). The toy algorithm.
We consider a simple tabulation learning algorithm A : N × ( N ×Y ) ∗ → Y that stores all past labelled features D n and on next feature i n +1 = i recalls y t if i t = i , i.e. feature i has appeared in the past, or outputs, in its simplest instantiation, undefined if i (cid:54)∈ i n i.e. is new. Formally: A ( i, D n ) := (cid:26) y t if i = i t for some t ≤ n ⊥ else i.e. if i (cid:54)∈ i n (1) Error.
Algorithm A only makes an error predicting label y n +1 if i (cid:54)∈ i n . We say A makes 1 unit of error in this case. Formally, the (instantaneous) error E n of algorithm A when predicting y n +1 from D n is defined as E n := [[ i n +1 (cid:54)∈ i n ]]The expectation of this w.r.t. to the random choice of D n and i n +1 gives the expected(instantaneous) error EE n := E [ E n ] = P [ i n +1 (cid:54)∈ i n ] = ∞ (cid:88) i =1 θ i (1 − θ i ) n (2)A formal derivation is given in Appendix B, but the result is also intuitive: If feature i has not been observed so far (which happens with probability (1 − θ i ) n ), and then feature i is observed (which happens with probability θ i ), the algorithm makes an error. EE n as afunction of n constitutes an (expected) learning curve, which we will henceforth study. InAppendix A we show that expression (2) remains valid within an irrelevant multiplicativeconstant for most other loss functions. We now derive theoretical expected learning curves for various underlying data distribu-tions. We derive exact and approximate, general and specific expressions for the scalingof expected error with sample size. Specifically we consider finite models, which lead toexponential error decay, and infinite Zipf distributions, which lead to interesting powerlaws with power β <
1. Interestingly even highly skewed data distributions lead to powerlaws, albeit with “uninteresting” power β = 1. Exponential decay.
In the simplest case of m of the θ i being equal and the rest being 0,the error EE n = (1 − m ) n ≈ e − n/m decays exponentially with n . This is not too interestingto us, since (a) this case corresponds to a finite model (see above), (b) exponential decayis an “artifact” of the deterministic label and discontinuous 0-1 error, and (c) will becomea power law 1 /n after time-averaging (Section 4).5
10 20 30 40 50 n rr [ n ] c o n t r i b u t i o n s Superposition of exponentials.
Since (2) isinvariant under bijective renumbering of features i ∈ N , we can w.l.g. assume θ ≥ θ ≥ θ ≥ ... .Some θ s may be equal. If we group equal θ s to-gether into ¯¯ θ j with multiplicity m j > ϑ j := − ln(1 − ¯¯ θ j ), then EE n = M (cid:88) j =1 m j ¯¯ θ j e − n ¯¯ ϑ j (3)where M ∈ N ∪ {∞} is the number of different θ i >
0. This is a superposition ofexponentials in n (note that (cid:80) Mj =1 m j ¯¯ θ j = 1) with different decay rates ¯¯ ϑ j . If different¯¯ θ j have widely different magnitudes and/or for suitable multiplicities m j , the sum willbe dominated by different terms at different “times” n . So there will be different phasesof exponential decay, starting with fast decay e − n ¯¯ ϑ for small n , taken over by slowerdecay e − n ¯¯ ϑ for larger n , and e − n ¯¯ ϑ for even larger n , etc. though some terms may never(exclusively) dominate, or phases may be unidentifiably muddled together (see figureabove). In any case, if M = ∞ , the dominant terms shift indefinitely to ever smaller θ for ever larger n . For M < ∞ eventually e − n ¯¯ ϑ M for the smallest ¯¯ ϑ will dominate EE n . Thesame caveats (a)-(c) apply as for M = 1 in the previous paragraph. Approximations.
First, in our subsequent analysis we (can) approximate (1 − θ i ) n =: e − nϑ i ≈ e − nθ i , justified as follows: (i) For nθ i (cid:28) θ i (cid:28) ϑ i ≈ θ i , while numerically e − nϑ i /e − nθ i (cid:54)≈ nθ i (cid:29)
1, but the exponentialscaling of e − nϑ i and e − nθ i we care about is sufficiently similar. (iii) There can only bea finite number of θ i (cid:54)(cid:28)
1, say, θ i for i ≤ i are not small, then already for moderatelylarge n (cid:29) /θ i , all features i ≤ i are observed with high probability and hence do notcontribute (much) to the expected error (formally e − nθ i (cid:28) i ≤ i ) hence can safelybe ignored in any asymptotic analysis.Second, let f : R → R be a smooth and monotone decreasing interpolation of θ : N → R , i.e. f ( i ) := θ i and f (cid:48) ( x ) ≤
0. We can then approximate the error as follows: EE n = ∞ (cid:88) i =1 f ( i )(1 − f ( i )) n ≈ (cid:90) ∞ f ( x ) e − nf ( x ) dx = (cid:90) θ ue − nu du | f (cid:48) ( f − ( u )) | × ≈ n | f (cid:48) ( f − ( n )) | (4)The first ≈ uses the two approximations introduced above. The equality follows froma reparametrization u = f ( x ) and f (1) = θ and f ( ∞ ) = 0 and dx = du/f (cid:48) ( x ) and f (cid:48) <
0. The numerator ue − nu is maximal and (strongly) concentrated around u = 1 /n ,hence u ≈ /n gives most of the integral’s contribution. Therefore replacing u by 1 /n inthe denominator can be a reasonable approximation. The last ≈ follows from this and (cid:82) θ ue − nu du ≈ (cid:82) ∞ ue − nu du = 1 /n for nθ (cid:29) i for which θ i ≈ n . Esti-mating the number of such i multiplied by θ i (1 − θ i ) n ≈ θ i e − leads to approximation(4). In Appendix B we show that the approximation error of the integral representationis bounded by 1 /en + o (1 /n ). Zipf-distributed data.
Empirically many data have been observed to have a power-lawdistribution, called Zipf distribution in this context, that is, for a countable domain the6requency of the i th most frequent item is approximately proportional to i − ( α +1) for some α >
0. In our model this will be the case if θ i ∝ i − ( α +1) , so let u = f ( x ) = α · x − ( α +1) .This implies x = f − ( u ) = ( u/α ) − / (1+ α ) and f (cid:48) ( x ) = − α ( α + 1) x − ( α +2) = − α ( α +1)( u/α ) ( α +2) / ( α +1) , hence EE n ≈ n | f (cid:48) ( f − ( n )) | × = n − β , where β := α α That is, Zipf-distributed data (with power α + 1) lead to a power-law learning curve(with power β = α α < EE n ≈ c α n − β with c α = α / ( α +1) Γ( α α ) / ( α + 1). c = √ π ˙= 0 .
886 and c . = 1 .
177 in excellent agreement with the fit curves in Figure 2.
Exponentially-distributed data.
An exponential data distribution θ i ∝ e − γi is moreskewed than any power law. For u = f ( x ) = γ · e − γx we have x = f − ( u ) = γ ln γu and f (cid:48) ( x ) = − γ e − γx = − γu , hence both approximations in (4) give 1 /γn . A rigorous upperbound EE n ≤ ( e + γ ) n follows from (11) in Appendix B, and a rigorous lower bound EE n × ≥ n from the next paragraph. So even an exponential data distribution leads to apower law learning curve, though the exponent 1 is much larger than observed in (most)experiments, which hints at that data are not exponentially distributed, assuming thistoy model has any real-world relevance. Beyond exponentially-distributed data.
For (quite unrealistic) decay faster thanexponentially, e.g. θ i ∝ e − γi , the approximations (4) are too crude, but somewhat sur-prisingly we always get a (sort of) power law as long as θ i > i . First,the previous paragraph implies that EE n × ≤ n − for any θ i × ≤ e − γi for any γ >
0, i.e. the errordecreases at least with n − if the i th item has at most exponentially small probability in i .For a (partial) converse, define n i := (cid:100) /ϑ i (cid:101) ≤ /ϑ i + 1. Plugging ϑ i := − ln(1 − θ i ) ≥ θ i and 2 θ i ≥ ϑ i ≥ /n i for θ i < .
79 into EE n ≥ θ i (1 − θ i ) n = θ i e − nϑ i we get EE n i ≥ θ i e − n i ϑ i ≥ n i e − (1 /ϑ i +1) ϑ i ≥ n i e × = n − i Hence, if θ i > i , then there are infinitely many n for which EE n × ≥ n − .For θ i going to zero exponentially or slower, the spacing between n i +1 and n i has boundedratio n i +1 /n i ≤ e γ , which implies EE n × ≥ n − for all n . For faster decaying θ i , e.g. θ i = e − γi this is no longer the case. So in some weak sense, power law learning curves are universal,but it’s mostly 1 /n , so not useful to explain observed power laws. So far we have considered expected learning curves. This corresponds to averaging in-finitely many experimental runs. In practice, only finitely many runs are possible, some-times as few as 5 or even 1. In the following we consider the variance V n of the instan-taneous error E n = [[ i n +1 (cid:54)∈ i n ]] as a function of n . The standard error when averaging k runs is then (cid:112) V n /k . The question we care most about here is whether (for large n ) thisis small or large compared to EE n , because this determines whether learning curves (forsmall k ) are smooth or look random, and how large k suffices for a good signal-to-noise ra-tio. We also consider time-average expected learning curves and their variance, which aremuch smoother. Note that cumulative errors are like attenuated drifting random walks, aproperty time-average learning curves qualitatively inherit (red curve in Figure 1 right).7 nstantaneous Variance. E n ∈ { , } , hence E n = E n , hence E [ E n ] = E [ E n ] = ∞ (cid:88) i =1 θ i (1 − θ i ) n =: µ n , hence V [ E n ] = E [ E n ] − E [ E n ] = µ n (1 − µ n )Since µ n → n → ∞ , the standard deviation σ n := (cid:112) V [ E n ] = (cid:112) µ n (1 − µ n ) ≈ √ µ n (cid:29) µ n = EE n That is, the standard deviation is much larger than then mean for large n . Indeed, fora single run, there is no proper learning curve at all, since E n ∈ { , } (see Figures 4&5top left). In order to get a good signal-to-noise ratio one would need to average a large(and indeed increasing with n ) number k (cid:29) /µ n of runs (see Figures 1&4&5). Time-averaged Mean and Variance.
In practice, beyond averaging over runs, otheraverages are performed to reduce noise.One alternative is to report the time-averagederror E , rather than the instantaneous error E . We can calculate its mean and varianceas follows E N : = 1 N N − (cid:88) n =0 E n (5) E [ E N ] = 1 N N − (cid:88) n =0 E [ E n ] = 1 N ∞ (cid:88) i =1 θ i N − (cid:88) n =0 (1 − θ i ) n ( a ) = 1 N ∞ (cid:88) i =1 − (1 − θ i ) N (6) E [ E N ] ( b ) = 1 N ∞ (cid:88) i =1 [1 − (1 − θ i ) N ] + 1 N (cid:88) i (cid:54) = j [1 − (1 − θ i ) N − (1 − θ j ) N + (1 − θ i − θ j ) N ] V [ E N ] ( c ) = 1 N ∞ (cid:88) i =1 (1 − θ i ) N [1 − (1 − θ i ) N ] − N (cid:88) i (cid:54) = j [(1 − θ i ) N (1 − θ j ) N − (1 − θ i − θ j ) N ](7)where ( a ) is simple algebra, ( b ) follows from inserting the definition of E N and somerather tedious algebraic manipulations (see Appendix B), and ( c ) from inserting ( a ) and( b ) into the definition of variance and simple algebraic manipulation. We now revisit theexponential and Zipf case studied earlier, after a trivial but note-worthy observation. Case θ i = [[ i = 1]]. In this case i n = 1 ∀ n , hence E = 1 and E n = 0 ∀ n ≥ V [ E n ] = 0 ∀ n . This is the fastest any error can decay, 0 after 1 observation, hence thefastest any time-averaged error can decay is E N = 1 /N . This means, for any learningproblem that has instantaneous error decaying faster than 1 /N , one should report theinstantaneous error rather than the slower decaying and hence much larger time-averagederror. Most problems of interest in Deep Learning have much slower learning curvesthough, and for those, the time-averaged and the instantaneous error have the same decay rate, but the time-averaged has lower variance, so is the preferred one to plot orreport. Case θ i = m [[ i ≤ m ]]. In this case, while EE n = (1 − m ) n ≈ e − n/m decays exponentially,8he average quantities decay with 1 /N (or 1 /N ): E [ E N ] = mN [1 − (1 − m ) N ] −→ mN for N → ∞ V [ E N ] = m N [ m (1 − m ) N + (1 − m )(1 − m ) N − (1 − m ) N ] → mN [ e − N/m − e − N/m ] → mN e − N/m for N → ∞ σ [ E N ] ≈ √ mN e − N/ m (cid:28) mN ≈ E [ E N ] for N (cid:29) m The expressions for the mean and variance follow from the general expression (6) and(7) above, by inserting θ i = m for i ≤ m and noting that i > m , for which θ i = 0, giveno contribution, hence (cid:80) i m and (cid:80) i (cid:54) = j m ( m − N (much) larger than m , so the time-averaged learning curves have a muchbetter signal-to-noise ratio; see Figure 1. Also σ [ E N ] ≈ N − / (cid:28) ≈ E [ E N ] for m (cid:29) N .The intuition is easy: For m (cid:29) N , at ever n < N , a new i n +1 is observed, i.e. E n = 1w.h.p. For N (cid:29) m , all m errors have been made, i.e. E N = mN w.h.p, i.e. in both casesthe variance is small. Only for m ≈ N is there sizeable uncertainty in E N . The situationis similar for the most interesting Zipf case: Case θ i ∝ i − ( α +1) . Recall that for Zipf-distribution θ i ∝ i − ( α +1) , the expected errorfollowed power law EE n ≈ c α n − β , where 0 < β = α α <
1. The time-averaged error E [ E N ] ≈ N N (cid:88) n =1 E [ E n ] ≈ c α N (cid:90) N n − β dn = c (cid:48) α N − β with c (cid:48) α := c α − β = α α Γ( α α )follows the same power law with the emphsame exponent β , which is a generic propertyas foreshadowed earlier. As for the variance, we show in Appendix B that E [ E N ] × ≈ E [ E N ] , V [ E N ] × ≈ N − α α , hence σ [ E N ] × ≈ N − / α α (cid:28) N − α α × ≈ E [ E N ]In particular, the signal-to-noise ratio is σ [ E N ] E [ E N ] × ≈ N − / (2+2 α ) That is, the standard deviation is much smaller than then mean. A single run sufficesto get a good (and excellent for n (cid:38) N , new i N +1 (cid:54)∈ i N have small but sufficient chance, contributing to the errorand variance, decreasing exponentially in the uniform model, but only as a power law inthe Zipf model. General θ Case.
One can show that the signal-to-noise ratio for the time-averagederror improves with N in general for any choice of θ . First note that the argument of (cid:80) i (cid:54) = j [ ... ] in (7) is non-negative, hence the variance is upper-bounded by the first (cid:80) i .Using 1 − (1 − θ i ) N ≤ θ i N , we get V [ E N ] ≤ N EE N , so thesignal-to-noise ratio is σ [ E N ] E [ E N ] ≤ √ N EE N E [ E N ] = √ N EE N (cid:80) N − n =0 EE n → N → ∞
9o prove the limit → EE n is monotonedecreasing ( EE n (cid:38) ). (i) For bounded total error (cid:80) ∞ n =0 EE n ≤ c (e.g. exponential errordecay in finite models), EE n (cid:38) implies EE N = o (1 /N ), which implies that the numeratortends to 0; the denominator is lower-bound by EE = 1. (ii) For unbounded total error (cid:80) N − n =0 EE n → ∞ (most infinite models, e.g. Zipf and even exponential θ i ), we factorthe denominator as (cid:80) N − n =0 EE n ≡ √ Σ N − n =0 EE n · √ Σ N − n =0 EE n and lower-bound one term by (cid:80) N − n =0 EE n ≥ N EE N , which is true since EE n (cid:38) . n,N E rr o r [ n ] a v . r un s E[n][n]EA[n]A[n] 0246810ESum[n]Sum[n] 0 10 20 30 40 50 n,N E rr o r [ n ] a v . r un s E[n][n]EA[n]A[n] 0246810ESum[n]Sum[n]
Figure 1: (
Learning Curves) (left) for uniform data distribution P [ i n = i ] = θ i = m for i ≤ m = 10 averaged over k = 100 runs. (right) for Zipf-distributed data P [ i n = i ] = θ i ∝ i − ( α +1) for α = 1 averaged over k = 10 runs. See Figures 4 and 5 for more plots for k = 1 , , , n E rr o r [ n ] f o r Z i p f [ = . ] [n]0.89 (n+1)^-0.5000.89 (n+1)^-0.501Av [n]1.60 (n+1)^-0.5001.43 (n+1)^-0.462 0246810[n]1.60 (n+1)^0.5001.43 (n+1)^0.538 n E rr o r [ n ] f o r Z i p f [ = . ] [n]1.18 (n+1)^-0.0911.18 (n+1)^-0.091AvE[n] 1 runAv [n]1.29 (n+1)^-0.0911.24 (n+1)^-0.084 050100150200250300350E[n] 1 run[n]1.29 (n+1)^0.9091.24 (n+1)^0.916 Figure 2: (
Power Law fit to Zipf-Distributed Data) for Zipf-distributed data P [ i n = i ] = θ i ∝ i − ( α +1) for α = 1 (left) and α = 0 . (right) ; averaged over infinitely many runs(dots), for fitted ˆ β (solid) and theoretical β = α α (dash), and empirical error for a singlerun ( k = 1, dashdot).We performed some control experiments to verify the correctness of the theory andclaims, and the accuracy of the theoretical expressions. Uniform and Zipf-distributed data.
Figures 1 and4 plot learning curves for uniformlydistributed data P [ i n = i ] = θ i = m for i ≤ m = 10 averaged over k = 1 , , ,
500 1000 1500 2000 2500 3000 3500 i _ i n E rr o r [ n ] f o r F il e [ = . ] [n]2.03 (n+1)^-0.246AvE[n] (1 run)Av [n]1.86 (n+1)^-0.191 0100200300400E[n] (1 run)[n]1.86 (n+1)^0.809 Figure 3: (
Word-Frequency in Text File, Learning Curve, Power Law) (left)
Log-linear plot of the relative (left scale) and absolute (right scale) frequency of words inthe first 20469 words in file ‘book1’ of the Calgary Corpus, and fitted Zipf law. (right)
Power law fit to learning curve for this data set for a word classification task.runs. Figure 1 and 5 plot learning curves for Zipf-distributed data P [ i n = i ] = θ i ∝ i − ( α +1) for α = 1, also averaged over k = 1 , , , D have been generated, and the average is taken over k of them. Various errors areplotted as functions of the sample index/size n . The crosses are the instantaneous errors E n averaged over k runs. The black curves are the exact expected instantaneous error EE n . The shaded regions are 1 standard deviation σ n / √ k from the theory (not empirical).Similarly the blue dots, lines, and shadings are the time-averaged errors E N , their exactexpectation E [ E N ] and theoretical standard error (cid:112) V [ E N ] /k . The red triangles, lines, andshadings are the empirical cumulative errors (cid:80) Nn =0 E [ n ] ≡ N E N , their exact expectation (cid:80) Nn =0 EE [ n ] ≡ N E [ E N ], and theoretical standard error. The dashed lines connecting theempirical errors are for better visibility (only). Fitting power laws to learning curves.
We now fit power laws to the learningcurves of (exactly) Zipf distributed data. Figure 2 shows fits for synthetic data withZipf-exponents α = 1 and α = 0 .
1. The fit is “perfect” except for very small values of n . This is consistent with our approximation, which is good for nθ (cid:29)
1. Theoreticallywe expect and empirically we found the approximation (4) to be good for nθ (cid:38)
2. For α = 1 we have θ = 0 .
5, hence the approximation should be good for n (cid:38)
4, while for α = 0 . θ ˙= 0 . n (cid:38)
30; bothare consistent with the plots. To avoid clutter we only present expected curves. Theyperfectly match the averaged curves over infinitely many runs anyway (see Figures 1&5).The fitted power law exponents ˆ β are also close to the theoretical predictions β = α α ( = 0 . α = 1 and ˙= 0 .
091 for α = 0 . Text data.
It is well-known that the frequency of a word in typical texts is aboutinversely proportional to its rank in the frequency table: The most frequent word (‘the’)occurs about twice as often as the second most frequent word (‘a’), about three times asoften as the third most frequent word, and so on. That is, word frequency follows a Zipfdistribution with α ≈
0. Figures 3&6 (left) show the frequency distributions of the first20469 words in file ‘book1’ of the Calgary Corpus. Apart from the steps, caused by wordfrequencies being integers, the distribution is very close to Zipf. Note that more than halfof the words only appear once. Figures 3 (right) shows the learning curves for any wordclassification task. The power-law fit is reasonably good, but not perfect. The reason is11he step structure of especially rare words. Indeed, many θ s are equal, and only finitelymany are non-zero, so the learning curve is a finite superposition of exponentials as in(3). For moderate n this mixture amalgamates to an approximately power law. For large n , the error decays exponentially as exp( − θ min n ). Indeed, for larger n , Figures 6 showsthat the power fit becomes worse, and the true error decays faster than the fit power law.Note that for α ≤ i − ( α +1) is not summable, hence any such distribution must breakdown after some i , our approximation becomes invalid, and β ≤ In the following we discuss some potential extensions of the toy model. Some look feasible,others are hard or wishful thinking. We discuss the more realistic case of noisy labels,other loss functions, continuous features, and more realistic models that generalize, e.g.deep learning algorithms.
Noisy labels.
In most machine learning applications, labels (or more general targets)are themselves noisy, not just the feature vector x ∈ X , e.g. y = h ( x )+ Noise . The majorimplications are as follows:(a) The learning algorithms need to be a bit smarter than just memorization,e.g. predicting the average or by majority.(b) Due to the label noise, the error cannot converge to 0 anymore but to the intrinsic“entropy”, which should be subtracted before studying scaling.(c) For absolute (locally quadratic) loss there will be an extra n − / ( n − ) additive errorterm due to parameter estimation error, hence(d) the instantaneous loss will not decay exponentially anymore even if the model isfinite.(e) Otherwise the scaling laws for Zipf data are unchanged .In summary, the error/loss should be a sum of 3 terms, at least conceptually:(1) The inherent entropy in the data,(2) the parameter learning rate n − / for absolute loss, squared, i.e. n − for (locally)quadratic loss,(3) the same power law n − β as in the deterministic case.In Appendix C we verify claims (b,c,d,e) for our toy model extended to noisy binaryclassification for square loss. What is remarkable is that the instantaneous square loss Loss n ( A ) for noisy labels turns out to include a term proportional to the time-averaged (0-1) error E [ E n ] =(6) in the deterministic case. But this “magically” ensures (c,d,e), since E [ E n ] × ≈ max { EE n , n } , at least for the choices of θ discussed in Section 3. For instance, fora finite model, Loss n ( A ) × ≈ E [ E n ] × ≈ n . Other loss functions.
For our deterministic toy model, the loss function has little tono influence on the results as discussed in Appendix A. For noisy labels, this also seemsto be the case, except that n − β is now the fastest possible decay, with β depending onthe loss-function: β = for absolute loss and β = 1 for locally quadratic loss such asKL and square. Loss functions with any (other) value of β > ontinuous features. Countable feature spaces have some applications, e.g. in NLP,words can be identified with integer features i ∈ N . In most applications, feature spacesare (effectively) continuous, often vector spaces R d , and no feature ever repeats exactly( x n (cid:54) = x m for n (cid:54) = m ). A simple model with a continuous domain is the Dirichlet Pro-cess, or the essentially equivalent Chinese Restaurant Process (CRP) and Stick-Breakingprocess. In the CRP, the continuous domain is essentially reduced to an exponentiallydistributed countable number of sticks=features, leading to power law learning curves n − β , but the exponent is restricted to β = 1, which is too limiting. But even the CRPis not exactly a special case of our toy model and much harder to analyse. In some formof “mean-field” approximation it reduces to a special case of our model. The generalized2-parameter Poison Dirichlet Process [BH10] also only leads to β = 1. Finding analyt-ically tractable models with continuous features that exhibit interesting learning curvesremains an open problem. Generalizing algorithms.
Proper models/algorithms for continuous features need togeneralize from observed inputs to similar future not-yet-observed inputs, which is atthe heart of virtually all interesting machine learning model/algorithms. Such modelsare much more varied and harder to analyze. If the domain could be partitioned intocountably many cells, each cell containing only sufficiently similar features, and this canbe done a-priori and is fixed independent of the actually realized data D n and mostimportantly independent of the data size n , we arrive back at our countable toy model(usually with noisy labels) and our analysis (nearly) applies. But it is more plausiblethat a suitable partitioning, e.g. clustering of data, is in itself data (size) dependent,and hence will affect the scaling. A more interesting non-parametric model, potentiallyamenable to theoretical analysis, is k -Nearest-Neighbors (kNN), likely with interestinglearning curves. The ‘perfect prediction for exact repetition’ in our toy model can beviewed as an abstraction of ‘classify features in the same cell alike’ which itself is a toymodel for ‘classify similar observations alike or similarly’, so maybe some of our findingsor analysis tools approximately transfer. Deep learning. (Deep) neural networks are a particularly powerful class of mod-els/algorithms that can generalize, but are also notoriously difficult to theoretically anal-yse. It may be a long way from our toy model to a similar analysis of NNs. Furthermorewe have not at all considered the equally interesting questions of scaling with model sizeand compute.
Summary.
We introduced a very simple model that can exhibit power laws (decreaseof error with data size) consistent with recent findings in deep learning. The model isplausibly the simplest such model, and that choice was deliberate to not get bogged downwith intractable math or forced into crude approximations or bounds at this early stageof investigation. Many but not all data distributions lead to power laws. We do not knowwhether the discovered specific relation between the Zipf exponent α and the power lawexponent β = α/ (1 + α ) is an artifact of the model, or has wider validity beyond thismodel. The signal-to-noise ratio for the time-averaged error tend to zero, which impliesthat a single experimental run suffices for stable results. Limitations.
The toy model studied in this work is admittedly totally unrealistic asa Deep Learning model, but we believe it captures the (or at least a) true reason for13he observed scaling laws w.r.t. data. Whether it has any predictive power, or can begeneralized to NNs and/or scaling laws for model size and/or compute, is beyond the scopeof this paper. We hope that this initial investigation spurs more advanced theoreticalinvestigations, and ultimately lead to predictive models. We have outlined some ideas inSection 6, some (more) are hopefully feasible. In any case, finding the simplest modelwhich captures the essence is a necessary first step, and we believe our toy model fits thisbill.
Applications.
Besides providing scientific insight, a good theoretical understanding ofscaling laws could ultimately help tune network and algorithm parameters in a moreprincipled way, and thus save significant compute for finding good large NNs by reducinghyper-parameter sweeps. The cost of training recent models has reached millions ofdollars and can exhaust and exceed even FAANGs computational resources.
Acknowledgements.
I thank David Budden and J¨org Bornschein for encouraging meto look into the topic of scaling laws, and for interesting discussions.
References [BH10] Wray Buntine and Marcus Hutter. A Bayesian review of the Poisson-Dirichletprocess. Technical Report arXiv:1007.0296, NICTA and ANU, Australia, 2010.[BHM18] Mikhail Belkin, Daniel Hsu, and Partha Mitra. Overfitting or perfect fitting? Riskbounds for classification and regression rules that interpolate. arXiv:1806.05161[cond-mat, stat] , June 2018.[BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modernmachine-learning practice and the classical bias–variance trade-off.
Proceedings ofthe National Academy of Sciences , 116(32):15849–15854, August 2019.[Bub15] S´ebastien Bubeck. Convex Optimization: Algorithms and Complexity.
Foundationsand Trends ® in Machine Learning , 8(3-4):231–357, 2015.[Cha81] Anne Chao. On Estimating the Probability of Discovering a New Species. Annalsof Statistics , 9(6):1339–1342, November 1981.[Cho20] Kyunghyun Cho. Scaling laws of recovering Bernoulli, November 2020. Blog Posthttp://kyunghyuncho.me/scaling-law-of-estimating-bernoulli/.[DHM89] Ronald A. DeVore, Ralph Howard, and Charles Micchelli. Optimal nonlinear ap-proximation. manuscripta mathematica , 63(4):469–478, December 1989.[Haz16] Elad Hazan.
Introduction to Online Convex Optimization . 2016.[HKHM21] Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. ScalingLaws for Transfer. arXiv:2102.01293 [cs] , February 2021.[HKK +
20] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, JacobJackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy,Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler,John Schulman, Dario Amodei, and Sam McCandlish. Scaling Laws for Autore-gressive Generative Modeling. arXiv:2010.14701 [cs] , November 2020.[HNA +
17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun,Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. DeepLearning Scaling is Predictable, Empirically. arXiv:1712.00409 [cs, stat] , December2017. KMH +
20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess,Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. ScalingLaws for Neural Language Models. arXiv:2001.08361 [cs, stat] , January 2020.[Mha96] H. N. Mhaskar. Neural Networks for Optimal Approximation of Smooth and Ana-lytic Functions.
Neural Computation , 8(1):164–177, January 1996.[Pin99] Allan Pinkus. Approximation theory of the MLP model in neural networks.
ActaNumerica , 8:143–195, January 1999.[RRBS19] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. AConstructive Prediction of the Generalization Error Across Scales. In
InternationalConference on Learning Representations , September 2019.[SK20] Utkarsh Sharma and Jared Kaplan. A Neural Scaling Law from the Dimension ofthe Data Manifold. arXiv:2004.10802 [cs, stat] , April 2020.
A Other Loss Functions
We can (slightly) generalize the learning algorithm A to other loss functions and behaviorson i n +1 (cid:54)∈ i n . We continue to assume that A (cid:48) suffers Loss n = 0 if i ∈ i n by using stored D n . Assume A (cid:48) suffers Loss n if i n +1 (cid:54)∈ i n , then E [ Loss n ] = E [ Loss n | i n +1 (cid:54)∈ i n ] · P [ i n +1 (cid:54)∈ i n ] + 0 · P [ i ∈ i n ] = E [ Loss n | i n +1 (cid:54)∈ i n ] · EE n The second factor is our primary object of study. The first factor is often constantor bounded by constants: For instance for A in (1), Loss n = 1 if i n +1 (cid:54)∈ i n , hence E [ Loss n ] = EE n . Assume a classification problem with K labels Y = { , ..., K − } , andmodify A to randomize its output (samples y ∈ Y uniformly) if i n +1 (cid:54)∈ i n , then there is a1 /K chance to accidentally predict the correct label, hence E [ Loss n | i n +1 (cid:54)∈ i n ] = 1 − /K ,where the expectation is now also w.r.t. the algorithm’s randomness. For continuous Y and if A (cid:48) samples y from any (non-atomic) density over Y if i n +1 (cid:54)∈ i n , the probability ofaccidentally correctly predicting y n +1 is 0, hence E [ Loss n ] = EE n . For binary classification( K = 2) we could also let A (cid:48) predict y = and use absolute loss Loss n = | y − y n +1 | , inwhich case E [ Loss n | i n +1 (cid:54)∈ i n ] = . If A (cid:48) instead samples y uniformly from [0; 1], then E [ Loss n | i n +1 (cid:54)∈ i n ] = (cid:82) | y − y n +1 | dy = . If Y = [0; 1], then E [ Loss n | i n +1 (cid:54)∈ i n ] = (cid:82) | y − y n +1 | dy = + ( − y n +1 ) , hence EE n ≤ E [ Loss n ] ≤ EE n .More generally, for any compact and uniformly rounded set Y ⊆ R d , for any loss ofthe form Loss n := (cid:96) ( || y − y n +1 || ), for any norm || · || , for any continuous strictly increasing (cid:96) ≥
0, and A (cid:48) sampling y from any density p alg ( y ) > Y if i n +1 (cid:54)∈ i n , we have, forsome constants c , c > c EE n ≤ E [ Loss n ] ≤ c EE n , or E [ Loss n ] × = EE n for shortand this fact holds even more generally. Since a multiplicative constant in the loss isirrelevant from a scaling perspective, all scaling results for EE n also apply to this (slightly)more general setting.The proof of this is as follows: A uniformly rounded set by definition can be repre-sented as a union of ε -balls for some fixed ε >
0, i.e. Y = (cid:83) ˜ y ∈ ˜ Y B ε (˜ y ) for some ˜ Y , where15 ε (˜ y ) := { y : || y − ˜ y || ≤ ε } . Then E [ Loss n | i n +1 (cid:54)∈ i n ] = (cid:90) Y (cid:96) ( || y − y n +1 || ) p alg ( y ) dy (8) ( a ) ≥ δ (cid:90) B ε (˜ y ) (cid:96) ( || y − y n +1 || ) dy ( b ) ≥ δ (cid:90) B ε ( y n +1 ) (cid:96) ( || y − y n +1 || ) dy ( c ) = δ (cid:90) B ε (0) (cid:96) ( || z || ) dz =: c d ) > y ∈ ˜ Y is chosen such that y n +1 ∈ B ε (˜ y ) ⊆ Y , and p alg > δ >
0, since p alg > Y is compact. In (b), B ε ( y n +1 ) can be obtained from B ε (˜ y ) by cutting out the moon B ε (˜ y ) \ B ε ( y n +1 ), and point-mirroring the moon at (˜ y + y n +1 ). The flip brings everypoint closer to y n +1 , hence decreases the integral since (cid:96) is monotone increasing. (c) justrecenters the integral, which now is obviously independent of y n +1 . It is non-zero (d),since (cid:96) is strictly increasing. Since (cid:96) is continuous and Y is compact, (cid:96) is upper boundedby (cid:96) max < ∞ . This immediately implies (8) is upper bounded by c := (cid:96) max < ∞ . B Derivation of Expectation and Variance
Expectation.
Recall the error of (the basic form of) Algorithm A is E n = [[ i n +1 (cid:54)∈ i n ]].Hence the probability that Algorithm A makes an error under distribution θ given data D n is E [ E n |D n ] = P [ A ( i n +1 , D n ) (cid:54) = y n +1 |D n ] = (cid:88) i (cid:54)∈ i n P [ i n +1 = i ] = (cid:88) i (cid:54)∈ i n θ i (9)The expectaion of this w.r.t. D n is EE n := E [ E n ] = E [ E [ E n |D n ]] = (cid:88) i n P [ A ( i n +1 , D n ) (cid:54) = y n +1 |D n ] P [ D n ]= (cid:88) i n (cid:16) (cid:88) i (cid:54)∈ i n θ i (cid:17) n (cid:89) t =1 θ i t = (cid:88) i n (cid:16) ∞ (cid:88) i =1 [[ i (cid:54) = i ∧ ... ∧ i (cid:54) = i n ]] θ i (cid:17) n (cid:89) t =1 θ i t = ∞ (cid:88) i =1 θ i (cid:88) i n n (cid:89) t =1 [[ i (cid:54) = i t ]] θ i t = ∞ (cid:88) i =1 θ i n (cid:89) t =1 (cid:88) i t (cid:54) = i θ i t = ∞ (cid:88) i =1 θ i (1 − θ i ) n The result can actually more easily be derived as EE n = P [ i n +1 (cid:54)∈ i n ] = ∞ (cid:88) i =1 P [ i n +1 = i ∧ i (cid:54) = i ∧ ... ∧ i n (cid:54) = i ]= ∞ (cid:88) i =1 P [ i n +1 = i ] n (cid:89) t =1 P [ i t (cid:54) = i ] = ∞ (cid:88) i =1 θ i (1 − θ i ) n (10)but the former derivation is more suitable for generalization to other loss functions andnoisy labels. 16 pproximation. Let f : (0; ∞ ) → (0; ∞ ) be a continuously differentiable and decreas-ing extension of θ : N → R , i.e. f ( i ) := θ i and f (cid:48) ( x ) <
0. Let g ( x ) := f ( x ) e − nf ( x ) . Since u (cid:55)→ ue − nu is unimodal with maximum 1 /en and at u = 1 /n and f is monotone, g ( x ) isunimodal with maximum g max = 1 /en at x max = f − ( n ). We hence can use (18) (any a ∈ [0; 1]) to upper bound the sum in (10) by an integral as follows: EE n (10) = ∞ (cid:88) i =1 θ i (1 − θ i ) n ≤ ∞ (cid:88) i =1 θ i e − nθ i = ∞ (cid:88) i =1 f ( i ) e − nf ( i )(18) ≤ g max + (cid:90) ∞ a f ( x ) e − nf ( x ) dx = 1 en + (cid:90) f ( a )0 ue − nu du | f (cid:48) ( f − ( u )) | where the last equality follows from a reparametrization u = f ( x ) and f ( ∞ ) = 0 and dx = du/f (cid:48) ( x ) and f (cid:48) < − θ i ) n . For 0 ≤ x ≤ ε we have e − x ≤ − x + x = 1 − (1 − x ) x ≤ − (1 − ε ) x Inserting x = θ i / (1 − ε ), we get1 − θ i ≥ e − θ i / (1 − ε ) for θ i ≤ ε (1 − ε )Let i be an index such that θ i ≤ ε (1 − ε ). We define ˜ n := n/ (1 − ε ). That ˜ n is not aninteger is no problem, and could even be avoided by renormalizing θ i instead. Similarlyas for the upper bound, we get a lower bound EE n ≥ ∞ (cid:88) i = i θ i (1 − θ i ) n ≥ ∞ (cid:88) i = i θ i e − ˜ nθ i = − i − (cid:88) i =1 θ i e − ˜ nθ i + ∞ (cid:88) i =1 θ i e − ˜ nθ i ≥ − i − (cid:88) i =1 θ i e − ˜ nθ i − ˜ g max + (cid:90) ∞ a f ( x ) e − ˜ nf ( x ) dx ≥ − e − ˜ nθ i − e ˜ n + (cid:90) f ( a )0 ue − ˜ nu du | f (cid:48) ( f − ( u )) | Let us choose i such that θ i ≥ ε (1 − ε ), which is possible as long as θ i ≥ θ i − . Thisfinally leads to EE n ≥ − e − εn − en + (cid:90) f ( a )0 ue − ˜ nu du | f (cid:48) ( f − ( u )) | Since we can choose ε arbitrarily small, combining both bounds, and choosing a → f such that f ( x ) → ∞ for x →
0, we have (cid:12)(cid:12)(cid:12) EE n − EE (cid:82) n (cid:12)(cid:12)(cid:12) ≤ en + o (1 /n ) with EE (cid:82) n := (cid:90) ∞ ue − nu du | f (cid:48) ( f − ( u )) | (11)The integral is dominated by u (cid:46) n , so for large n is determined by the asymptotics of f (cid:48) ( f − ( u )) for u →
0. Assume | f (cid:48) ( f − ( u )) | ≈ c (cid:48) u δ for u → c (cid:48) and δ (12)17ubstitution u = v/n leads to EE (cid:82) n ≈ (cid:90) ∞ ue − nu c (cid:48) u δ du = n δ − c (cid:48) (cid:90) ∞ v − δ e − v dv = Γ(2 − δ ) c (cid:48) n δ − (13)where Γ is the Gamma function. Zipf distribution.
For Zipf-distributed θ i = αi − ( α +1) , let u = f ( x ) = α · x − ( α +1) . Thisimplies x = f − ( u ) = ( u/α ) − α and f (cid:48) ( x ) = − α ( α + 1) x − ( α +2) = − α ( α + 1)( u/α ) α +2 α +1 hence approximation (12) is actually exact with δ = α +2 α +1 and c (cid:48) = α ( α + 1) /α ( α +2) / ( α +1) leading to EE (cid:82) n = c α n − β with c α = Γ(2 − δ ) c (cid:48) = α / (1+ α ) α Γ( α α ) and β = δ − α α Note that ˜ n − β = n − β + o ( n ) for e.g. ε = n ln n , and e − εn = 1 /n . Numerically one cancheck that c α ≤ .
214 for all α > c α ≥ .
886 for (the interesting) α ≤
1, that is, c α is nearly independent of α . c = √ π ˙= 0 .
886 and c . = 1 .
177 are in excellent agreementwith the fit curves in Figure 2.
Time-averaged expectation and variance.
We now consider the time-averaged error E N : = 1 N N − (cid:88) n =0 E n We derive the expressions for its expectation E [ E N ] and variance V [ E N ] stated in Section 4.The expectation is trivial: E [ E N ] = 1 N N − (cid:88) n =0 E [ E n ] = 1 N ∞ (cid:88) i =1 θ i N − (cid:88) n =0 (1 − θ i ) n = 1 N ∞ (cid:88) i =1 [1 − (1 − θ i ) N ]For the variance, we first compute E N , then E [ E N ], then V [ E N ] = E [ E N ] − E [ E N ] : E N = 1 N N − (cid:88) n =0 N − (cid:88) m =0 E n E m ( b ) = 2 N N − (cid:88) n =0 n − (cid:88) m =0 E n E m + 1 N N − (cid:88) n =0 E n where ( b ) breaks up the double sum into lower=upper and diagonal terms. Since m < n we have E n · E m = [[ i n +1 (cid:54)∈ i n ]] · [[ i m +1 (cid:54)∈ i m ]]= ∞ (cid:88) i =1 [[ i n +1 = i ∧ i n (cid:54) = i ∧ ... ∧ i (cid:54) = i ]] · ∞ (cid:88) j =1 [[ i m +1 = j ∧ i m (cid:54) = j ∧ ... ∧ i (cid:54) = j ]]= (cid:88) i,j [[ i n +1 = i ∧ i n (cid:54) = i ∧ ... ∧ i m +2 (cid:54) = i ∧ i m +1 = j (cid:54) = i ∧ i m (cid:54) = j (cid:54) = i ∧ ... ∧ i (cid:54) = j (cid:54) = i ]]18or i = j , i m +1 = j (cid:54) = i (meaning i m +1 = j ∧ i m +1 (cid:54) = i ) is a contradiction, so we can limitthe sum to i (cid:54) = j . Taking the expectation and noting that E n ∈ { , } and all i t areindependent and P [ i t = i ] = θ i we get E [ E n · E m ] = (cid:88) i (cid:54) = j P [ i n +1 = i ] P [ i n (cid:54) = i ] ... P [ i m +2 (cid:54) = i ] P [ i m +1 = j (cid:54) = i ] P [ i m (cid:54) = j (cid:54) = i ] ... P [ i (cid:54) = j (cid:54) = i ]= (cid:88) i (cid:54) = j θ i (1 − θ i ) n − m − θ j (1 − θ i − θ j ) m , hence E [ N − (cid:88) n =0 n − (cid:88) m =0 E n E m ] = (cid:88) i (cid:54) = j b ijN = (cid:88) i (cid:54) = j b ijN + b jiN , where b ijN := N − (cid:88) n =0 n − (cid:88) m =0 θ i (1 − θ i ) n − m − θ j (1 − θ i − θ j ) m = θ i θ j N − (cid:88) m =0 (1 − θ i − θ j ) m N − (cid:88) n = m +1 (1 − θ i ) n − m − = θ i θ j N − (cid:88) m =0 (1 − θ i − θ j ) m θ i [1 − (1 − θ i ) N − m − ]= θ j N − (cid:88) m =0 (1 − θ i − θ j ) m − θ j (1 − θ i ) N − N − (cid:88) m =0 (cid:16) − θ i − θ j − θ i (cid:17) m = θ j θ i + θ j [1 − (1 − θ i − θ j ) N ] − θ j (1 − θ i ) N − − θ i θ j (cid:20) − (cid:16) − θ i − θ j − θ i (cid:17) N (cid:21) = θ j θ i + θ j [1 − (1 − θ i − θ j ) N ] − (1 − θ i ) N + (1 − θ i − θ j ) N , hence b ijN + b jiN = 1 − (1 − θ i ) N − (1 − θ j ) N + (1 − θ i − θ j ) N The diagonal term is easy:1 N N − (cid:88) n =0 E [ E n ] = 1 N N − (cid:88) n =0 E [ E n ] = 1 N N − (cid:88) n =0 E [ E N ] = 1 N ∞ (cid:88) i =1 − (1 − θ i ) N Putting everything together, non-diagonal i (cid:54) = j and diagonal i = j expressions we getour final expression E [ E N ] = 1 N (cid:88) i (cid:54) = j [1 − (1 − θ i ) N − (1 − θ j ) N + (1 − θ i − θ j ) N ] + 1 N ∞ (cid:88) i =1 [1 − (1 − θ i ) N ]In order to get the variance of E N we have to subtract the squared expected error E [ E N ] = (cid:16) N ∞ (cid:88) i =1 − (1 − θ i ) N (cid:17) = 1 N (cid:88) i,j [1 − (1 − θ i ) N ] · [1 − (1 − θ j ) N ]= 1 N (cid:88) i (cid:54) = j [1 − (1 − θ i ) N − (1 − θ j ) N + (1 − θ i ) N (1 − θ j ) N ]+ 1 N ∞ (cid:88) i =1 [1 − − θ i ) N + (1 − θ i ) N ]19here we expanded the product and separated the i (cid:54) = j from the i = j terms, which noweasily leads to V [ E N ] = E [ E N ] − E [ E N ] = 1 N (cid:88) i (cid:54) = j [(1 − θ i − θ j ) N − (1 − θ i ) N (1 − θ j ) N ] + 1 N ∞ (cid:88) i =1 [(1 − θ i ) N − (1 − θ i ) N ] Approximation.
We can approximate the variance similarly to the expectation EE n . Weonly provide a heuristic derivation analogous to (4): V [ E N ] ( a ) ≈ N (cid:88) i (cid:54) = j [ e − ( θ i − θ j ) N − e − θ i N e − θ j N ] + 1 N ∞ (cid:88) i =1 [ e − θ i N − e − θ i N ] ( b ) ≈ N (cid:90) ∞ e − f ( x ) N − e − f ( x ) N dx ( c ) = 1 N (cid:90) θ e − uN − e − uN | f (cid:48) ( f − ( u )) | du ( d ) × ≈ N | f (cid:48) ( f − ( ln 2 N )) | (cid:90) θ e − uN − e − uN du ( e ) ≈ N | f (cid:48) ( f − ( ln 2 N )) | × ≈ N EE N (cid:40) × ≈ N E [ E N ] if EE N × ≈ N − β ≤ N E [ E N ] always( a ) follows from 1 − θ i ≈ e − θ i , ( b ) by setting f ( i ) := θ i , and replacing the sums by integrals,( c ) follows from a reparametrization u = f ( x ) and f (1) = θ and f ( ∞ ) = 0 and dx = du/f (cid:48) ( x ) and f (cid:48) <
0. The numerator e − uN − e − uN is maximal and (strongly) concentratedaround u = ln 2 /n , hence u ≈ ln 2 /n gives most of the integral’s contribution. Thereforein ( d ) replacing u by ln 2 /n in the denominator can be a reasonable approximation. ( e )follows from (cid:82) θ ... ≈ (cid:82) ∞ ... = 1 / N for θ N (cid:29)
1. We could use this approximationfor various concrete f , but if we substitute u = 1 /N instead of ln 2 /N in ( d ) we onlymake a multiplicative error ( × ), and the expression nicely reduces to the instantaneous expected error EE N . For slowly decreasing error EE N × ≈ N − β , we have E [ E N ] × ≈ E [ E N ]. Ingeneral E [ E N ] ≤ E [ E N ], since EE n is monotone decreasing. C Noisy Labels
Here we generalize our model to noisy labels. We first derive generic expression expres-sions for (somewhat) general algorithm and loss. We then instantiate them for frequencyestimation and square loss. Finally we outline how to derive similar expressions for theabsolute loss.
General loss.
Consider a binary classification problem where labels y t ∈ { , } arenoisy. Let γ i := P [ y t = 1 | i t = i ] be the probability that feature i ∈ N is labelled as 1. Theprobability of observing feature i itself remains θ i := P [ i t = i ] as in the deterministic case.Algorithm A i := A ( i, D n ) ∈ [0; 1] now aims to predict γ i . The square loss if predicting A i while the true label is y , and its expectation w.r.t. γ i , are Loss n ( A |D n , i n +1 = i, y n +1 = y ) = ( y − A i ) Loss n ( A |D n , i n +1 = i ) = γ i (1 − A i ) + (1 − γ i )(0 − A i ) = ( γ i − A i ) + γ i (1 − γ i )20he most naive learning algorithm would predict γ i from observed frequencies: A i = k i /n i if feature i occurred n i := { t ≤ n : i t = i } times and has label 1 for k i := { t ≤ n : i t = i, y t = 1 } times. Obviously k i /n i → γ i provided n i → ∞ , hence Loss n ( A |D n , i n +1 = i )converges to the intrinsic label “entropy” γ i (1 − γ i ), rather than to 0, which has to besubtracted for a power law analysis to make sense. Similarly the expectation of log-loss − y ln A i − (1 − y ) ln(1 − A i ) w.r.t. γ i leads to Kullback-Leibler loss KL( γ i || A i ) + Entropy H ( γ i ). More generally let us assume A ( i, D n ) depends (somehow) only on k i and n i (e.g.Laplace rule), and hence Loss n ( A |D n , i n +1 = i ) = (cid:96) ( γ i , A i ) = (cid:96) ( γ i , k i , n i ) = Loss n ( A | k i , n i , i n +1 = i )for some function (cid:96) . We now take the expectation over D n : L i := Loss n ( A | i n +1 = i )= (cid:88) D n Loss n ( A |D n , i n +1 = i ) P [ D n ]= n (cid:88) n i =0 n i (cid:88) k i =0 (cid:18) (cid:88) D n : k i ,n i P [ D n ] (cid:19) (cid:96) ( γ i , k i , n i )where (cid:80) D n : k i ,n i means that the sum is restricted to D n for which i (( i, n i ( k i )times. The probability of each of this happening is binomial: (cid:88) D n : k i ,n i P [ D n ] = (cid:88) i n : n i P [ i n ] (cid:88) y n : k i P [ y n | i n ] = (cid:16) nn i (cid:17) θ n i i (1 − θ i ) n − n i (cid:16) n i k i (cid:17) γ k i i (1 − γ i ) n i − k i This is obvious or follows by explicit calculation of the sums and some algebra. Puttingeverything together and finally taking the expectation over i we get Loss n ( A | n i , i n +1 = i ) = n i (cid:88) k i =0 (cid:16) n i k i (cid:17) γ k i i (1 − γ i ) n i − k i (cid:96) ( γ i , k i , n i ) (14) L i ≡ Loss n ( A | i n +1 = i ) = n (cid:88) n i =0 (cid:16) nn i (cid:17) θ n i i (1 − θ i ) n − n i Loss n ( A | n i , i n +1 = i ) (15) Loss n ( A ) = ∞ (cid:88) i =1 θ i L i (16)In the deterministic case γ i ∈ { , } and our memorizing algorithm (1) with 0-1 loss, (cid:96) ( γ i , k i , n i ) = [[ n i = 0]] is independent k i , so the k i sums the binomial to 1, and the n i -sumcollapses to n i = 0, leading back to Loss n ( A ) = (cid:80) ∞ i =1 θ i (1 − θ i ) n = EE n =(10). Square loss.
For noisy labels, γ i ∈ [0; 1], frequency estimator A i = k i /n i , square loss (cid:96) ( γ i , k i , n i ) = ( γ i − k i /n i ) for n i > γ i (1 − γ i ) removed), and keeping (cid:96) ( γ i , k i ,
0) = 1, we proceed as follows: The k i -sum in (14) becomes the variance of k i /n i ,hence Loss n ( A | n i , i n +1 = i ) = γ i (1 − γ i ) /n i . Unfortunately plugging this into the next n i -sum in (15) leads to a hypergeometric function. We tried various approximations, allleading essentially to the same end result. The simplest approximation is to approximate γ i (1 − γ i ) /n i by γ i (1 − γ i ) / ( n i + 1), which is asymptotically correct and within a factor21f 2 also valid for 1 ≤ n i < ∞ , and avoids hypergeometric functions altogether: L i = n (cid:88) n i =0 (cid:16) nn i (cid:17) θ n i i (1 − θ i ) n − n i Loss n ( A | n i , i n +1 = i )= (1 − θ i ) n + n (cid:88) n i =1 (cid:16) nn i (cid:17) θ n i i (1 − θ i ) n − n i γ i (1 − γ i ) n i ≈ (1 − θ i ) n + n (cid:88) n i =1 (cid:16) nn i (cid:17) θ n i i (1 − θ i ) n − n i γ i (1 − γ i ) n i + 1= (1 − θ i ) n − (1 − θ i ) n γ i (1 − γ i ) + n (cid:88) n i =0 (cid:16) nn i (cid:17) θ n i i (1 − θ i ) n − n i γ i (1 − γ i ) n i + 1 ( a ) = [1 − γ i (1 − γ i )](1 − θ i ) n + γ i (1 − γ i )( n + 1) θ i n +1 (cid:88) k =1 (cid:16) n + 1 k (cid:17) θ ki (1 − θ i ) n +1 − k ( b ) = [1 − γ i (1 − γ i )](1 − θ i ) n + γ i (1 − γ i )( n + 1) θ i [1 − (1 − θ i ) n +1 ]where (a) follows from substituting n i = k − k = 0 contribution, and the fact that a complete binomialsums to 1. If we assume that the noise level is the same for all features, i.e. γ i = γ or γ i = 1 − γ , then Loss n ( A ) = ∞ (cid:88) i =1 θ i L i ≈ [1 − γ (1 − γ )] ∞ (cid:88) i =1 θ i (1 − θ i ) n + γ (1 − γ ) n + 1 ∞ (cid:88) i =1 [1 − (1 − θ i ) n +1 ]= [1 − γ (1 − γ )] EE n + γ (1 − γ ) E [ E n +1 ]Again, in the deterministic case γ ∈ { , } , we get back Loss n ( A ) = EE n . If we assumethat γ i are bounded away from 0 and 1, then still within a multiplicative constant Loss n ( A ) × = EE n + E [ E n ]This is quite remarkable, that the instantaneous square loss for noisy labels includesa term proportional to the time-averaged (0-1) error in the deterministic case. As wehave seen in the main paper, roughly, as long as EE n goes to 0 slower than 1 /n , EE n and E [ E n ] have the same asymptotics, which in turn implies that the results for deterministicclassification transfer to noisy labels, in particular α -Zipf-distributed data lead to β -powerlaw learning curves with β = α α . While EE can decay faster than 1 /n , e.g. exponentiallyfor finite models, E [ E n ] and hence Loss n ( A ) can never decay faster than 1 /n . As discussedin the introduction, the reason is that the accuracy to which parameters (here γ i ) canbe estimated from n i i.i.d. data is × = n − / i × ≥ n − / , which squares to Loss n ( A ) × ≥ n − for(locally) quadratic loss. Absolute loss.
For absolute loss (cid:96) = | γ i − k i /n i | there is no closed-form solution for (14).For large n i , the binomial is approximately Gaussian (in k i /n i ) with mean γ i and variance γ i (1 − γ i ) /n i , and (14) can be evaluated to (cid:112) γ i (1 − γ i ) /n i . Plugging this into (15) wecan approximate the n i -sum for nθ i (cid:28) nθ i (cid:29)
1. Plugging each into (16), andapproximating the i -sum for α -Zipf-distributed θ i one can show that each, again, scalesas n − β with β = α α , but the latter has an additional n − / term. For α < n − β , but for α > Approximating Sums by Integrals
Sums (cid:80) ∞ i =1 g ( i ) can be approximated by integrals (cid:82) ∞ g ( x ) dx . To upper bound the ap-proximation error classically requires computing a cumbersome integral (Euler-Maclaurinremainder) or only works for finite sums (Trapezoid rule). In the following we derive anupper bound on the approximation accuracy, suitable for our purpose. First, note thatfor a monotone increasing function (cid:90) mn − g ( x ) dx ≤ m (cid:88) i = n g ( i ) ≤ (cid:90) m +1 n g ( x ) dx (17)with inequalities reversed for monotone decreasing functions. Consider now any mea-surable function g : [0; ∞ ) → [0; ∞ ) increasing up to g max = g ( x max ) and thereafterdecreasing (In our application g ( x ) = f ( x ) e − nf ( x ) ). Let i m − ≤ x max ≤ i m . We split theintegral into the increasing and decreasing part and use (17) to lower-bound the error: ∞ (cid:88) i =1 g ( i ) = i m − (cid:88) i =1 g ( i ) + ∞ (cid:88) i = i m g ( i ) ≥ (cid:90) i m − g ( x ) dx + (cid:90) ∞ i m g ( x ) dx = (cid:90) ∞ g ( x ) dx − (cid:90) i m − i m g ( x ) dx ≥ (cid:90) ∞ g ( x ) dx − g max To obtain an upper bound we have to exclude i m − and i m from the sums: ∞ (cid:88) i =1 g ( i ) = i m − (cid:88) i =1 g ( i ) + ∞ (cid:88) i = i m +1 g ( i ) + [ g ( i m − ) + g ( i m )] ≤ (cid:90) i m − g ( x ) dx + (cid:90) ∞ i m g ( x ) dx + min { g ( i m − ) , g ( i m ) } + max { g ( i m − ) , g ( i m ) }≤ (cid:90) i m − g ( x ) dx + (cid:90) ∞ i m g ( x ) dx + (cid:90) i m i m − g ( x ) dx + g max = (cid:90) ∞ g ( x ) dx + g max Together this leads to the following bound on the approximation error: (cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) i =1 g ( i ) − (cid:90) ∞ a g ( x ) dx ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ g max (18)for any=every choice of a ∈ [0; 1]. Without further assumptions on g , this bound is tight.For the lower bound consider g ( x ) = g max for i m − < x < i m and 0 otherwise. For theupper bound consider g ( i m ) = g max and 0 otherwise.23 List of Notation
Symbol Explanation a/b · c = ( a/b ) · c but a/bc = a/ ( bc )[[Bool]] 1 if Bool=True, 0 if Bool=False S Number of elements in set S P , E , V Probability, Expectation, Variance i ∈ i n is short for i ∈ { i , ..., i n } ˙= Equal within the stated number of numerical digits × = Equal within a multiplicative constant × ≈ Asymptotically or approximately proportional i, j ∈ N natural number “feature” t, n, m ∈ N time/sample index N ∈ N sample size θ i probability of feature iA tabular learning algorithm h : N → Y classifier, e.g. binary Y = { , } f : R → R Theoretical data distribution/scaling f ( i ) = θ i D n Data consisting of n (feature,label) pairs E n Instantaneous Error of A on i n +1 predicting y n +1 from D n EE n Expectation of Instantaneous Error E n w.r.t. D n E N Time-Averaged Error from n = 1 , ..., Nα Exponent of Zipf distributed data frequency β Exponent of power law for error as a function of data size γ Decay rate for exponential data distribution24
More Figures n,N E rr o r [ n ] r un E[n][n]EA[n]A[n] 0246810ESum[n]Sum[n] n,N E rr o r [ n ] a v . r un s E[n][n]EA[n]A[n] 02468ESum[n]Sum[n] n,N E rr o r [ n ] a v . r un s E[n][n]EA[n]A[n] 0246810ESum[n]Sum[n] 0 10 20 30 40 50 n,N E rr o r [ n ] a v . r un s E[n][n]EA[n]A[n] 0246810ESum[n]Sum[n]
Figure 4: (
Learning Curves) for uniform data distribution P [ i n = i ] = θ i = m for i ≤ m = 10 averaged over k = 1 , , ,
10 20 30 40 50 n,N E rr o r [ n ] r un E[n][n]EA[n]A[n] 0246810ESum[n]Sum[n] 0 10 20 30 40 50 n,N E rr o r [ n ] a v . r un s E[n][n]EA[n]A[n] 0246810ESum[n]Sum[n]0 10 20 30 40 50 n,N E rr o r [ n ] a v . r un s E[n][n]EA[n]A[n] 024681012ESum[n]Sum[n] 0 10 20 30 40 50 n,N E rr o r [ n ] a v . r un s E[n][n]EA[n]A[n] 0246810ESum[n]Sum[n]
Figure 5: (
Learning Curves) for Zipf-distributed data P [ i n = i ] = θ i ∝ i − ( α +1) for α = 1 averaged over k = 1 , , , i _ i n E rr o r [ n ] f o r F il e [ = . ] [n]9.18 (n+1)^-0.4792.03 (n+1)^-0.246AvE[n] (1 run)Av [n]4.88 (n+1)^-0.3391.86 (n+1)^-0.191 010002000300040005000E[n] (1 run)[n]4.88 (n+1)^0.6611.86 (n+1)^0.809 Figure 6: (
Word-Frequency in Text File, Learning Curve, Power Law) (left)
Relative (left scale) and absolute (right scale) of frequency of words in the first 20469words in file ‘book1’ of the Calgary Corpus, and fitted Zipf law, which is a straight linein the Log-log plot. (right)
Power law fit to learning curve for this data set for a wordclassification task. For large n , low frequency words break the Zipf law and hence thepower law. The solid line is the same as in Figure 3 (right) fit to the reliable region n ≤ nn