Finite size corrections for neural network Gaussian processes
FFinite size corrections for neural network Gaussian processes
Joseph M. Antognini Abstract
There has been a recent surge of interest in model-ing neural networks (NNs) as Gaussian processes.In the limit of a NN of infinite width the NN be-comes equivalent to a Gaussian process. Herewe demonstrate that for an ensemble of large, fi-nite, fully connected networks with a single hid-den layer the distribution of outputs at initializa-tion is well described by a Gaussian perturbed bythe fourth Hermite polynomial for weights drawnfrom a symmetric distribution. We show that thescale of the perturbation is inversely proportionalto the number of units in the NN and that higherorder terms decay more rapidly, thereby recover-ing the Edgeworth expansion. We conclude by ob-serving that understanding how this perturbationchanges under training would reveal the regimesin which the Gaussian process framework is validto model NN behavior.
1. Introduction
Today it is well known that there is a deep connection be-tween modern, highly overparameterized neural networks(NNs) and Gaussian processes. A foundational result inthe field by Neal (1996) and Williams (1997) demonstratedthat a randomly initialized NN with a single hidden layer ofinfinite width is identical to a Gaussian process so long asthe weights of the NN are drawn from a distribution withfinite variance. Although the covariance between differenthidden units of the NN is zero, these works showed thatthe covariance between a single hidden unit with differentinputs is non-zero, thereby making learning possible.Although Neal (1996) and Williams (1997) only studied thecase of NNs with a single hidden layer, several recent workshave extended this insight to show that NNs with multi-ple, possibly convolutional, layers of infinite width are alsoGaussian processes (Lee et al., 2017; Matthews et al., 2018;Novak et al., 2019; Garriga-Alonso et al., 2018). And while Whisper AI, San Francisco, California. Correspondence to:Joseph Antognini < [email protected] > .Presented at the ICML 2019 Workshop on Theoretical Physics forDeep Learning. Copyright 2019 by the author(s). these works only studied NNs at initialization, Jacot et al.(2018) showed that gradient descent on a NN corresponds toapplying a tangent kernel to an equivalent Gaussian process,and Arora et al. (2019) has recently weakened the conditionson this proof. Lee et al. (2019) empirically showed that ap-plication of the tangent kernel closely match the predicteddynamics of a linearized deep NN.Although NNs have grown dramatically in size, they haveremained frustratingly finite. In practice, most practitionerstend to choose widths in the range 128–1024. Especially asdeep learning models have moved onto mobile devices therehas been substantial effort into compressing NNs so thatthey can perform inference quickly and efficiently on lowpower devices (e.g., Hinton et al., 2015; Han et al., 2015;Howard et al., 2017; Sandler et al., 2018; Howard et al.,2019; Tan & Le, 2019).In this paper we attempt to bridge the divide between theinfinite NNs of theory and the merely large NNs of prac-tice. We show that the Edgeworth expansion is a usefultool to describe the distribution of NN outputs for large NNensembles. In particular, the distribution of outputs for anensemble of large, finite NNs is not Gaussian, but is insteada Gaussian perturbed by the fourth Hermite polynomial,and the magnitude of this perturbation is inversely propor-tional to the number of hidden units. The author believesthat understanding the stability of this perturbation undertraining will be a useful method to assess the relevance ofthe NN Gaussian process framework to study the long-termdynamics of NN training.
2. Derivation of the Gaussian process limitusing the renormalization group
Let us consider a NN with a single hidden layer consistingof N units. For simplicity of notation we shall restrictourselves to the case where both the input, x , and the output, y , are one-dimensional. The more general case of a multi-dimensional input does not change the derivation, and asNeal (1996) notes, when the output is multi-dimensionalthe different output dimensions are independent. We writethe input-to-hidden weights as u i , and the hidden-to-outputweights as v i . We will omit bias terms since biases arecommonly initialized with zeros. Their inclusion clutters themath but does not change the results. The hidden activations a r X i v : . [ c s . L G ] A ug inite size corrections for NN Gaussian processes are given by h i = f ( u i x ) , (1)where f is the activation function. The output of the NN isgiven by y = N (cid:88) i =1 v i h i . (2)Let us suppose that the weights are initialized by ani.i.d. sample from a probability distribution (not necessarilythe same for different layers). At initialization the parame-ters u , v , and b are generally sampled independently from aset of probability distributions with finite variance. Mostpractitioners today use Glorot uniform initialization (Glorot& Bengio, 2010) and this is the default in popular frame-works like Tensorflow (Abadi et al., 2016) and PyTorch(Paszke et al., 2017).As Neal (1996) observes, because the u i are all independentof one another, the covariance between any two h i must bezero for fixed x . However, the covariance of a fixed hiddenunit for two different inputs is, in general, non-zero. Thiscovariance can be written as a kernel, C ( x, x (cid:48) ) , which inthe Gaussian process limit, expresses the entire state of theGaussian process. Note however that the calculation of thecovariance is independent of the number of hidden units.The requirement that the number of units in the hidden layergo to infinity is necessary for the distribution of outputs tobe Gaussian. If the number of hidden units is finite, thecovariance remains the same, but the output distributionmay be different.To determine the effect of a large, but finite, number of unitsin the hidden layer we use the renormalization group torecover the Edgeworth expansion. There are a number ofworks which derive the central limit theorem and Edgeworthexpansion using the renormalization group (e.g., Anshele-vich, 1999; Jona-Lasinio, 2001; Calvo et al., 2010), and herewe follow an approach in Sethna (2006, pp. 291–3). Therenormalization group analysis consists of four steps:1. Coarse-grain.2. Renormalize. An important exception is orthogonal initialization (Saxe et al.,2013) which places an orthogonality constraint on the weightmatrices thereby causing the individual parameters to lose theirindependence. And, because the v i are all independent of one another, thecovariance between two different output dimensions must be zerofor fixed x as well (thereby justifying our study of the case of asingle output dimension). Since this is a simply a statement aboutexpectations and does not require taking the limit N → ∞ it holdsfor NNs of both infinite and finite width. The N → ∞ limit wasonly required by Neal (1996) to guarantee a Gaussian distribution;it does not change the covariance, assuming that the weights aredrawn from a distribution that scales inversely with N (which isrequired to keep the output finite in the limit of infinite N ).
3. Find the fixed point of the renormalization transforma-tion.4. Linearize about this fixed point.
To coarse-grain we observe that the output of the NN isgiven by the sum over all hidden units in Eq. 2. For a fixedinput, the value of each activation may be considered to bea sample from some probability distribution p h ( h ; x ) . Herewe write h as an argument of the probability distributionfunction and treat the input x as a parameter since it affectsthe shape of the distribution but is not a random variate itself.Similarly, the value of v i is given by a sample from p v ( v ; x ) .Since the samples u i are independent, the h i are independentas well, and the output is given by the sum of the productsof samples from the two probability distributions p h ( h ; x ) and p v ( v ; x ) . This product of h i and v i can be considered tobe a sample from a third probability distribution p hv ( hv ; x ) .We now replace each pair of hidden units with a singlehidden unit: ( hv ) (cid:48) i = ( hv ) i + ( hv ) i +1 . (3)The probability distribution of the transformed ( hv ) (cid:48) is sim-ply the convolution of the probability distribution p hv ( hv ) with itself. It is easier to represent this transformation inFourier space because the convolution becomes multiplica-tion: (cid:101) p (cid:48) hv ( k ; x ) = (cid:101) p hv ( k ; x ) , (4)where we use a tilde to represent the Fourier transformeddistribution, and k to represent the frequency domain of theproduct hv . In the next step we renormalize the transformed probabilitydistribution so that it becomes self-similar to the originalprobability distribution. Specifically, when we add tworandom variates, the standard deviation of the result is largerby the factor √ . We therefore need to rescale the newprobability distribution so that it has the same variance as theoriginal. (We are assuming that the probability distributionswe are working with are centered so the mean does not needto be transformed.) The rescaled probability distribution isthen √ p (cid:48) hv ( √ hv ; x ) , where the prefactor is required tonormalize the rescaled probability distribution. The finalrenormalization operator is therefore R [ p hv ( hv ; x )] ≡ p hv ( √ hv ; x ) ∗ p hv ( √ hv ; x ) (5) = F − [ (cid:101) p hv ( k/ √ x ) ] . (6) inite size corrections for NN Gaussian processes At the fixed point of the renormalization transformation,successive applications of the transformation do not changethe distribution, so we have p ∗ y ( y ; x ) = R [ p ∗ y ( y ; x )] , (7)where we use an asterisk to represent the fixed point andwe now identify the sum of the h i v i from the repeatedapplication of the transformation as the NN output y . Takingthe Fourier transform we have (cid:101) p ∗ y ( k ; x ) = (cid:101) p ∗ y ( k/ √ x ) . (8)The solution to this equation is the Gaussian distribution, (cid:101) p ∗ y ( k ; x ) = N ( k ; 0 , σ ) ≡ √ πσ e − k / (2 σ ) , (9)where σ is the standard deviation of the original distributionand is a function of the input x . (Due to the renormalizationtransformation we constrain σ to remain fixed for fixed x .)This is just a restatement of the fact that the Gaussian distri-bution is the stable distribution for the family of distributionswith finite variance (Feller, 1966, § VIII.4).
Let us now consider a probability distribution which is close to a Gaussian distribution, but is not exactly Gaussian. Wecan represent this distribution as p y ( y ) = p ∗ y ( y ) + (cid:15)φ ( y ) for some small (cid:15) and arbitrary function, φ ( y ) . We can thenlinearize the renormalization transformation by finding itseigenvalues and eigenfunctions. These must satisfy therelationship R [ p ∗ y ( y ; x ) + (cid:15)φ ( y ; x )] (cid:39) p ∗ y ( y ; x ) + ∞ (cid:88) n =0 λ n (cid:15)φ n ( y ; x ) , (10)where we drop terms of order (cid:15) and higher. Substitutingthe renormalization transformation and taking the Fouriertransform, we find (cid:101) φ n ( k ; x ) = 1 λ n σ (cid:114) π e − k / σ (cid:101) φ n (cid:18) k √ x (cid:19) . (11)The set of eigenfunctions that satisfy this relationship inFourier space is given by (cid:101) φ n ( k ; x ) = ( ik ) n N ( k ; 0 , σ ) . (12)Taking the inverse Fourier transform we find that the eigen-functions are Hermite polynomials multiplied by a Gaus-sian: φ n ( y ) = H n ( y ) N ( y ; 0 , σ ) , (13) where H n ( x ) ≡ ( x − D ) n · , with D being the differen-tial operator. From these eigenfunctions we find that theeigenvalues are λ n = 2 − n/ . (14)Note that the first two eigenvalues are relevant, and thethird is marginal. This is a consequence of the fact that asthe width of the NN tends to infinity, the resulting outputdistribution must remained normalized and have a fixedmean and standard deviation. The rest of the eigenvaluesare irrelevant, however, due to the fact that in the infinitewidth limit, the higher order moments must tend to thevalues of a Gaussian.Now, since we have N units in the hidden layer of the NN,we will need to apply the renormalization transformation log N times. The eigenvalues for the repeated transforma-tion will therefore be λ log Nn = N − n/ . (15)We can now write out the probability distribution for theNN output as p y ( y ) = p ∗ y ( y ) + c ( x ) λ log N φ ( hv ; x ) + c ( x ) λ log N φ ( y ; x ) + · · · (16) = N ( y ; 0 , σ ) × (1 + c ( x ) N − / H ( y ) + c ( x ) N − H ( y ) + · · · ) , (17)where we explicitly represent the constants c i as being func-tions of the input x (because they are determined by themoments of the original probability distribution p y ( y ; x ) ),and expect them to be of order unity in general. This resultis a recovery of the Edgeworth expansion, but in this case itis the expansion of a stochastic process rather than a proba-bility distribution because it is parameterized by the input, x (e.g., Juszkiewicz et al., 1995). Given the distribution p y ( y ; x ) the constants c i can be determined exactly from amore rigorous derivation (e.g., Hansen, 2006). The Edge-worth expansion is an asymptotic series, so for fixed n theseries converges as N → ∞ , but the series itself divergesas n → ∞ . For large N , then, we may drop the c term andhigher.Now, it is generally the case that the probability distribu-tions used to initialize NNs are symmetric and so have zeroskew. This symmetry forces the first irrelevant eigenfunc-tion (given by the third Hermite polynomial) to make no In the infinite width limit the joint output distribution of two ormore different inputs is given by a multivariate Gaussian distribu-tion. In the finite width case we expect the joint distribution of twoor more different inputs to be given by the multivariate Edgeworthexpansion, although we do not show it here. See Sellentin et al.(2017) for a derivation of the multivariate Edgeworth expansion. inite size corrections for NN Gaussian processes contribution so that c = 0 . This implies that for the proba-bility distributions typically used to initialize NNs we have p y ( y ) (cid:39) √ πσ e − y / σ × (cid:20) − c ( x ) N (cid:18) − (cid:16) yσ (cid:17) + (cid:16) yσ (cid:17) (cid:19)(cid:21) . (18)
3. Experiments
To verify the correctness of Eq. 18 we generate an ensembleof randomly initialized NNs. Each NN has a singlehidden layer with 128 units and a ReLU activation. TheNNs are initialized using the default layers behavior inTensorflow with the weights drawn from the Glorot uniformdistribution and the biases set to zero. We then observe thedistribution of outputs for a fixed input of x = 1 . The result-ing distribution should be close to, but not exactly, Gaussian.To measure the deviation from Gaussianity we calculate theempirical cumulative distribution function (CDF) of the out-puts and subtract the CDF of a Gaussian distribution withthe same variance as the outputs ( / for this NN).Calculating the CDF of p y ( y ) in Eq. 18 and subtracting offthe CDF of a normal distribution, we find that the differenceis given by the third Hermite polynomial times a Gaussian: (cid:90) y −∞ p y ( y (cid:48) ) − N ( y (cid:48) ; 0 , σ ) dy (cid:48) = c N N ( y ; 0 , σ ) H ( y ) . (19)We compare the empirical CDF with the predicted CDF inFig. 1 and find excellent agreement. From this empiricalCDF we measure c ≈ . . N From Eq. 18 we expect the magnitude of the deviation froma Gaussian to scale inversely with the number of hiddenunits, N . To test this prediction we generate a set of ensem-bles of randomly initialized NNs for a range of N . We vary N between 8 and 148 and for each N we use an ensembleof NNs. As before, the weights are initialized from aGlorot uniform distribution, the biases are set to zero, and aReLU activation is used. The input is fixed to x = 1 and thedistribution of output values is collected across the ensem-ble. We then calculate the difference between the empiricalCDF and a Gaussian CDF (with the variance of the Gaus-sian set to be the observed variance of the outputs). Thisdifference is expected to be the third Hermite polynomialtimes a Gaussian times a scaling factor α and measure thebest fit for α : (cid:90) y −∞ p y ( y (cid:48) ) − N ( y (cid:48) ; 0 , σ ) dy (cid:48) = α N ( y ; 0 , σ ) H ( y ) . (20) − . − . . . . y − . − . − . . . . . E m p i r i c a l C D F − e r f ( y ; , σ ) Observed α N ( y ; 0 , σ ) H ( y ) Figure 1.
The difference between the empirical CDF of the outputof an ensemble of randomly initialized NNs with the CDF of aGaussian (solid blue line). The predicted difference from Eq. 19is shown in the dashed orange line. The empirical CDF wascalculated for NNs each with a single hidden layer consistingof 128 units for a fixed input of x = 1 . The results for the measured values of α are shown in Fig. 2.We find that α scales approximately with the inverse of N as expected, except for small values of N , where thedependence is slightly steeper. The best fit is α ∝ N − . .
4. Discussion
Jacot et al. (2018) has shown that if an infinitely wide NNis trained on a mean squared error loss then the NN remainsa Gaussian process throughout training. However, NNsused in practice are not infinitely wide and therefore arenot Gaussian. Nevertheless, this analysis shows that theperturbation away from a Gaussian can be quantified andscales approximately inversely with N . It is therefore inter-esting to ask what happens to the perturbation when a NNis trained. Does it shrink over the course of training, staythe same magnitude, or increase?In the first case we should expect that Gaussian processeswill be a powerful framework to understand NNs becauseeven if a typical NN is not quite a Gaussian process at the be-ginning of training, it will soon become one. In the secondcase, Gaussian processes will still be a useful framework,but with the understanding that the distributions are in factperturbed Gaussian distributions rather than exactly Gaus-sian. If the final case holds, Gaussian processes will be auseful framework to understand the early stages of training,but will become progressively worse. Depending on howgreat the deviation from Gaussianity becomes, the Gaussianprocess framework may fail to describe the training dynam-ics entirely at a certain point in training. (If this occurs thenwe would be able to identify the point at which this pertur-bation diverges as a phase transition in the NN.) Based onthe empirical success of Gaussian processes at predicting inite size corrections for NN Gaussian processes N − − α N − Best fit
Figure 2.
The scaling of the perturbation of the distribution froma Gaussian with the number of hidden units. Each point is thebest fit for α of the difference between the empirical CDF withthe CDF of the third Hermite polynomial times a Gaussian (seeEq. 20). The predicted N − scaling is shown with the dashed blueline and the best fit scaling N − . is shown with the dotted redline. the trajectories of NNs of finite sizes in Lee et al. (2019) itis the opinion of the author that the output distribution isunlikely to stray too far from a Gaussian over the course oftraining. However, rigorously proving this to be the case is amore complicated problem than analyzing the perturbationfrom a Gaussian at initialization and is beyond the scope ofthis paper. Acknowledgments
The author is grateful to Michael Flynn, Kevin Quetano,Josh Schneier, and Richard Zhu for helpful discussions andsuggestions.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.Tensorflow: A system for large-scale machine learning.In { USENIX } Symposium on Operating SystemsDesign and Implementation ( { OSDI } , pp. 265–283,2016.Anshelevich, M. The linearization of the central limit op-erator in free probability theory. Probability theory andrelated fields , 115(3):401–416, 1999.Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., andRuosong, W. On exact computation with an infinitelywide neural net. arXiv preprint arXiv:1904:11955 , 2019.Calvo, I., Cuch´ı, J. C., Esteve, J. G., and Falceto, F. Gener-alized central limit theorem and renormalization group.
Journal of Statistical Physics , 141(3):409–421, 2010. Feller, W.
An introduction to probability theory and itsapplications , volume 2. John Wiley & Sons, 1966.Garriga-Alonso, A., Aitchison, L., and Rasmussen, C. E.Deep convolutional networks as shallow gaussian pro-cesses. arXiv preprint arXiv:1808.05587 , 2018.Glorot, X. and Bengio, Y. Understanding the difficultyof training deep feedforward neural networks. In
Pro-ceedings of the thirteenth international conference onartificial intelligence and statistics , pp. 249–256, 2010.Han, S., Pool, J., Tran, J., and Dally, W. Learning bothweights and connections for efficient neural network. In
Advances in neural information processing systems , pp.1135–1143, 2015.Hansen, E. Edgeworth expansions, 2006. URL http://web.math.ku.dk/˜erhansen/bootstrap_05/doku/noter/Edgeworth_24_01.pdf .Hinton, G., Vinyals, O., and Dean, J. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015.Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen,B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan,V., et al. Searching for mobilenetv3. arXiv preprintarXiv:1905.02244 , 2019.Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:Efficient convolutional neural networks for mobile visionapplications. arXiv preprint arXiv:1704.04861 , 2017.Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel:Convergence and generalization in neural networks. In
Advances in neural information processing systems , pp.8571–8580, 2018.Jona-Lasinio, G. Renormalization group and probabilitytheory.
Physics Reports , 352(4-6):439–458, 2001.Juszkiewicz, R., Weinberg, D. H., Amsterdamski, P.,Chodorowski, M., and Bouchet, F. Weakly nonlinearGaussian fluctuations and the edgeworth expansion.
TheAstrophysical Journal , 442:39–56, March 1995. doi:10.1086/175420.Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Penning-ton, J., and Sohl-Dickstein, J. Deep neural networks asgaussian processes. arXiv preprint arXiv:1711.00165 ,2017.Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J., and Pennington, J. Wide neural networks ofany depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720 , 2019. inite size corrections for NN Gaussian processes
Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E.,and Ghahramani, Z. Gaussian process behaviour in widedeep neural networks. arXiv preprint arXiv:1804.11271 ,2018.Neal, R. M. Priors for infinite networks. In
BayesianLearning for Neural Networks , pp. 29–53. Springer, 1996.Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Abo-lafia, D. A., Pennington, J., and Sohl-dickstein, J.Bayesian deep convolutional networks with many chan-nels are gaussian processes. In
International Confer-ence on Learning Representations , 2019. URL https://openreview.net/forum?id=B1g30j0qF7 .Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., andLerer, A. Automatic differentiation in pytorch. In
NIPS 2017 Autodiff Workshop , 2017. URL https://openreview.net/forum?id=BJJsrmfCZ .Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., andChen, L.-C. Mobilenetv2: Inverted residuals and linearbottlenecks. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 4510–4520, 2018.Saxe, A. M., McClelland, J. L., and Ganguli, S. Exactsolutions to the nonlinear dynamics of learning in deeplinear neural networks. arXiv preprint arXiv:1312.6120 ,2013.Sellentin, E., Jaffe, A. H., and Heavens, A. F. On the use ofthe Edgeworth expansion in cosmology I: how to foreseeand evade its pitfalls. arXiv e-prints , September 2017.Sethna, J.
Statistical mechanics: entropy, order parameters,and complexity , volume 14. Oxford University Press,2006.Tan, M. and Le, Q. V. Efficientnet: Rethinking modelscaling for convolutional neural networks. arXiv preprintarXiv:1905.11946 , 2019.Williams, C. K. Computing with infinite networks. In