[PDF] On the infinite width limit of neural networks with a standard parameterization

Abstract

There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks, the NTK (Neural Tangent Kernel) parameterization and the naive standard parameterization. However, the extrapolation of both of these parameterizations to infinite width is problematic. The standard parameterization leads to a divergent neural tangent kernel while the NTK parameterization fails to capture crucial aspects of finite width networks such as: the dependence of training dynamics on relative layer widths, the relative training dynamics of weights and biases, and overall learning rate scale. Here we propose an improved extrapolation of the standard parameterization that preserves all of these properties as width is taken to infinity and yields a well-defined neural tangent kernel. We show experimentally that the resulting kernels typically achieve similar accuracy to those resulting from an NTK parameterization, but with better correspondence to the parameterization of typical finite width networks. Additionally, with careful tuning of width parameters, the improved standard parameterization kernels can outperform those stemming from an NTK parameterization. We release code implementing this improved standard parameterization as part of the Neural Tangents library at this https URL.

Full PDF

OO N THE INFINITE WIDTH LIMIT OF NEURAL NETWORKS WITH ASTANDARD PARAMETERIZATION

Jascha Sohl-Dickstein, Roman Novak, Samuel S. Schoenholz, Jaehoon Lee {jaschasd, romann, schsam, jaehlee}@google.com

April 21, 2020 A BSTRACT

There are currently two parameterizations used to derive ﬁxed kernels corresponding to inﬁnitewidth neural networks, the NTK (Neural Tangent Kernel) parameterization and the naive standardparameterization. However, the extrapolation of both of these parameterizations to inﬁnite widthis problematic. The standard parameterization leads to a divergent neural tangent kernel whilethe NTK parameterization fails to capture crucial aspects of ﬁnite width networks such as: thedependence of training dynamics on relative layer widths, the relative training dynamics of weightsand biases, and overall learning rate scale. Here we propose an improved extrapolation of the standardparameterization that preserves all of these properties as width is taken to inﬁnity and yields a well-deﬁned neural tangent kernel. We show experimentally that the resulting kernels typically achievesimilar accuracy to those resulting from an NTK parameterization, but with better correspondenceto the parameterization of typical ﬁnite width networks. Additionally, with careful tuning of widthparameters, the improved standard parameterization kernels can outperform those stemming from anNTK parameterization. We release code implementing this improved standard parameterization aspart of the Neural Tangents library [24] at https://github.com/google/neural-tangents . Inﬁnite width Bayesian [21, 15, 17, 18, 23, 9, 7, 28, 29, 6] and gradient descent trained [12, 16, 5, 28, 13, 8, 3, 1, 2, 26]neural networks are an area of active and extremely promising work. There are currently two parameterizations used toderive ﬁxed kernels corresponding to inﬁnite width neural networks : the NTK parameterization [12, §2]; and the naivestandard parameterization [25, §2.1]; [10, 11]. However, the extrapolations of both of these parameterizations to inﬁnitewidth fail to capture crucial aspects of ﬁnite width networks: • In ﬁnite width networks, differences in relative layer widths can have a profound effect on training dynamics.Under the NTK parameterization, as layer width goes to inﬁnity, relative layer width has no effect on trainingdynamics or predictions. • As the naive standard parameterization is extended to large widths, the largest stable learning rate scales like width [14, Theorem 7]; [25, §H]. A learning rate that goes to zero as width goes to inﬁnity poses a variety ofpractical and theoretical challenges, including a neural tangent kernel with entries that diverge to inﬁnity. • At ﬁnite width, convolutional networks with an NTK parameterization have been reported to generalize morepoorly than those with a standard parameterization [25, §I] (though we do not consistently reproduce thisrelationship in our own experiments, see Figure 3). • For neither NTK nor naive standard parameterizations do inﬁnite width learning rates agree closely with thosetypically used to train ﬁnite width standard parameterization networks. • The relative learning dynamics of bias and weight parameters are different in the NTK parameterization thanthey are for a standard parameterization ﬁnite-width network.In this note we propose an improved extrapolation of the standard parameterization to inﬁnite width that resolves theseinconsistencies while simultaneously leading to a well-deﬁned neural tangent kernel. Namely, in this parameterization Another line of work applies a different scaling, and derives non-ﬁxed inﬁnite width kernels [20, 19, 4, 22]. a r X i v : . [ c s . L G ] A p r PREPRINT - A

PRIL

21, 2020Parameterization Standard (naive) NTK

Standard (improved)

Layer equation, x l +1 = W l x l + b l σ w √ sN l W l x l + σ b b l √ s W l x l + b Weight shape, W l ∈ R sN l +1 × sN l W initialization, W lij ∼ N (cid:16) , σ w sN l (cid:17) N (0 , N (cid:16) , σ w N l (cid:17) b initialization, b li ∼ N (cid:0) , σ b (cid:1) N (0 , N (cid:0) , σ b (cid:1) NNGP, s → ∞ , K l +1 = σ w K l + σ b NTK, s → ∞ , Θ l +1 = diverges σ w K l + σ b + σ w Θ l N l K l + 1 + σ w Θ l Table 1: Equations describing a fully connected layer for each parameterization, both for a ﬁnite width network andfor the corresponding inﬁnite width NNGP and NT kernels. Here N l is the baseline (ﬁnite network) width of layer l ,and s is a width-scaling factor that is taken to ∞ for inﬁnite width networks.Parameterization Standard (naive) NTK Standard (improved)

Layer equation, x l +1 i,p = W li,j,m x lj,p + m + b li σ w √ sN l M W li,j,m x lj,p + m + σ b b li √ s W li,j,m x lj,p + m + b li Weight shape, W l ∈ R sN l +1 × sN l × M W initialization, W lijm ∼ N (cid:16) , σ w sN l M (cid:17) N (0 , N (cid:16) , σ w N l M (cid:17) b initialization, b li ∼ N (cid:0) , σ b (cid:1) N (0 , N (cid:0) , σ b (cid:1) NNGP, s → ∞ , K l +1 = σ w A (cid:0) K l (cid:1) + σ b NTK, s → ∞ , Θ l +1 = diverges σ w A (cid:0) K l (cid:1) + σ b + σ w A (cid:0) Θ l (cid:1) N l M A (cid:0) K l (cid:1) + 1 + σ w A (cid:0) Θ l (cid:1) Table 2: Equations describing a convolutional layer for each parameterization, both for a ﬁnite width network and forthe corresponding inﬁnite width NNGP and NT kernels. We use Einstein notation for summation – indices that appearonly in a single term are implicitly summed over. M is the number of spatial positions in the convolution kernel, m indexes over spatial locations within the kernel, p + m corresponds to input spatial location p offset by m , N l is thebaseline (ﬁnite network) channel count of layer l , A ( · ) is the diagonal averaging operator deﬁned in Xiao et al. [27]and Novak et al. [23, §2.2.1], and s is a width-scaling factor that is taken to ∞ for inﬁnite channel count networks.the resulting inﬁnite width network maintains a learning rate scale that agrees with that used to train the originalnetwork, preserves the impact of relative layer widths on training dynamics for ﬁnite width networks, and similarlypreserves the relative training dynamics of weights and biases. Afﬁne layers in neural networks are typically written as, z l +1 = W l y l + b l (1)where z l are pre-activations, y l = φ ( z l ) are activations, W l are weights, and b l are biases. To preserve the scale of thepre-activations as the width of the network, N l , is varied one typically initializes the weights as W l ∼ N (0 , σ w /N l ) and biases as b l ∼ N (0 , σ b ) . However, as was noted in [12], this leads to divergent gradient ﬂow dynamics as N l → ∞ .In [12], the authors resolve this situation by using an alternative parameterization where afﬁne layers are written as, z l +1 = σ √ N l ω l y l + b l (2)2 PREPRINT - A

PRIL

21, 2020 S t a n d a r d P a r a m e t e r i z a t i o n Classification Error

FCConv-VecConv-GAPWRN-LN

Mean Squared Error S t a n d a r d P a r a m e t e r i z a t i o n Classification Error

Random Widths FC 0.065 0.070 0.075 0.080 0.085 0.090NTK Parameterization0.0650.0700.0750.0800.0850.090

Mean Squared Error

Figure 1: Inﬁnite width networks with various architectures achieve similar error when using the improved standardparameterization or the NTK parameterization, while the improved standard parameterization better matches prop-erties of typical ﬁnite width networks. Each point compares the neural tangent kernel prediction error for the samearchitecture on CIFAR-10, but using NTK (x-axis) or improved standard (y-axis) parameterization. (Upper)

Eachpoint corresponds to varying training set size ( { , , , , , , } ), depth ( { , , , , } for FC/ Conv, ﬁxed number of block of 4 for WRN) and widths ( { k | k = 0 , ..., } for FC / Conv and widening factor { k | k = − , ... } ∪ { , , , } for WRN). FC is fully connected network with constant hidden width and Conv-Vec / GAP correspond to constant channel convolutional neural networks without / with global average pool-ing.

WRN-LN is Wide Residual Network with four residual blocks and Batch Normalization layer replaced withLayer Normalization. (Lower)

Each layer width of fully connected architecture are randomly sampled from k with k ∈ { , ..., } .where ω l ∼ N (0 , . This leads to a well-behaved inﬁnite-width limit, but involves a number of inconsistencies relativeto standard neural networks.The core idea here is to write the width of the neural network in each layer in terms of an auxiliary parameter, s , n l = sN l . We then write an afﬁne layer as, z l +1 = 1 √ s W l y l + b l (3)The inﬁnite width limit can be taken by letting s → ∞ . The parameter variances σ w , σ b and original layer widths N l instead appear in the variance of the initializer (as is typically done for ﬁnite width networks). A complete set ofequations describing an afﬁne layer, and corresponding inﬁnite width kernels, for this parameterization are given inTables 1 and 2, for fully connected and convolutional architectures respectively.A formal proof of convergence of the improved standard parameteriation to the speciﬁed kernels is beyond the scopeof this short note. However, we observe that the proof technique in Lee et al. [16, §Apps. F, G] applies with minimalmodiﬁcation. Additionally, Monte Carlo validation of the correctness of the introduced kernels is performed as part ofthe Neural Tangents [24] unit test suite. 3 PREPRINT - A

PRIL

21, 2020 C l a ss i f i c a t i o n E rr o r

400 Training Samples 10 Width0.6000.6020.6040.6060.6080.6100.612 C l a ss i f i c a t i o n E rr o r Width0.5560.5580.5600.5620.5640.5660.5680.570 4000 Training Samples

Standard paramNNGPNTK param

Figure 2: For fully connected networks, the neural tangent kernel prediction for the improved standard parameterizationcan outperform the NTK parameterization, especially when the layer widths N l used in the standard parameterizationare tuned. Experiments are performed on the CIFAR-10 dataset with networks corresponding to 5 hidden layers. S t a n d a r d P a r a m e t e r i z a t i o n Classification Error

FCConv-VECConv-GAP 0.0150.0200.0250.0300.0350.0400.0450.050NTK Parameterization0.0150.0200.0250.0300.0350.0400.0450.050

Mean Squared Error

Figure 3: SGD trained ﬁnite width neural networks perform similarly when using the standard parameterization or theNTK parameterization. For all experiments, the network was trained with an MSE loss on the full CIFAR-10 dataset(45k/5k/10k split). Each point in FC corresponds to varying width { k | k = 0 , ..., } , and each point in Conv-VEC and

Conv-GAP corresponds to varying number of channels {8, 11, 16, 23, 32, 45, 64, 90, 128, 181, 256, 362, 512}. Allnetworks are ReLU networks with σ w = 2 . , σ b = 0 . . They were trained with vanilla SGD without L2 regularizationor data augmentation. Constant learning rate was grid searched over 20 log spaced values within [0.01, 100]. Forstandard parameterization learning rate is divided by max( N l ) . FC networks were trained with batch size 1024 for3,000 epochs whereas Conv networks were trained with batch size 256 for 10,000 epochs.4

PREPRINT - A

PRIL

21, 2020

In this section, we study empirical properties of inﬁnite and ﬁnite width networks stemming from both the NTK andimproved standard parameterization. All of the experiments in this section were done using Neural Tangents library [24].Here we focus on kernels corresponding to ReLU networks with σ w = 2 . , σ b = 0 . .In Figure 1 we compare the predictions of kernels for pairs of identical networks, but using the improved standardor NTK parameterization. We ﬁnd that the performance of the kernels resulting from the two parameterizations areextremely similar, while the training dynamics of the improved standard parameterization network are expected tobetter match those of typical ﬁnite width networks. In Figure 2 we show that if the width parameter N l is carefullytuned, then the neural tangent kernel for a fully connected network using the improved standard parameterization canoutperform the kernel for an NTK parameterized network. In Figure 3, we show that random ﬁnite width networksusing the standard and NTK parameterization perform similarly. The analytic forms for the various kernels inspire some additional interesting observations: • For the NTK parameterization, the kernel resulting from a Bayesian neural network and from gradient descenttraining of the readout layer of an inﬁnite width network are the same. For the both the naive and improvedstandard parameterization however, the two differ. • For neural networks with a standard parameterization, the magnitude of the contribution of the bias to theneural tangent kernel (and thus to learning dynamics) remains constant with increasing width. However,the contribution of the weights to the learning dynamics grows like like N l . We should thus expect that asnetworks become wide, the role played by the bias in training becomes less important.In this note, we introduced an improved extrapolation of ﬁnite width networks to inﬁnite width that better matches theparameterization and learning dynamics of typical ﬁnite width networks. It is our hope that this will enable theory andexperiments with inﬁnite width networks to better explain the behavior of practical ﬁnite width networks. References [1] Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. (2019a). On exact computation with aninﬁnitely wide neural net. In

Advances In Neural Information Processing Systems .[2] Arora, S., Du, S. S., Li, Z., Salakhutdinov, R., Wang, R., and Yu, D. (2019b). Harnessing the power of inﬁnitelywide deep nets on small-data tasks.[3] Bietti, A. and Mairal, J. (2019). On the inductive bias of neural tangent kernels. arXiv preprint arXiv:1905.12173 .[4] Chizat, L. and Bach, F. (2018). On the global convergence of gradient descent for over-parameterized models usingoptimal transport. In

Advances in neural information processing systems , pages 3036–3046.[5] Chizat, L., Oyallon, E., and Bach, F. (2019). On lazy training in differentiable programming. arXiv preprintarXiv:1812.07956 .[6] De Palma, G., Kiani, B., and Lloyd, S. (2019). Random deep neural networks are biased towards simple functions.In

Advances in Neural Information Processing Systems , pages 1962–1974.[7] Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang, R., and Xu, K. (2019). Graph neural tangent kernel:Fusing graph neural networks with graph kernels. In Wallach, H., Larochelle, H., Beygelzimer, A., d‘Alché Buc, F.,Fox, E., and Garnett, R., editors,

Advances in Neural Information Processing Systems 32 , pages 5724–5734. CurranAssociates, Inc.[8] Dyer, E. and Gur-Ari, G. (2019). Asymptotics of wide networks from feynman diagrams. arXiv preprintarXiv:1909.11304 .[9] Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L. (2019). Deep convolutional networks as shallow gaussianprocesses. In

International Conference on Learning Representations .[10] Glorot, X. and Bengio, Y. (2010). Understanding the difﬁculty of training deep feedforward neural networks. In

Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics , pages 249–256.[11] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In

Proceedings ofthe IEEE conference on computer vision and pattern recognition , pages 770–778.5

PREPRINT - A

PRIL

21, 2020[12] Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neuralnetworks. In

Advances in neural information processing systems .[13] Jacot, A., Gabriel, F., and Hongler, C. (2019). Freeze and chaos for dnns: an NTK view of batch normalization,checkerboard and boundary effects. arXiv preprint arXiv:1907.05715 .[14] Karakida, R., Akaho, S., and Amari, S.-i. (2018). Universal statistics of ﬁsher information in deep neural networks:mean ﬁeld approach.

International Conference on Artiﬁcial Intelligence and Statistics .[15] Lee, J., Bahri, Y., Novak, R., Schoenholz, S., Pennington, J., and Sohl-dickstein, J. (2018). Deep neural networksas gaussian processes. In

International Conference on Learning Representations .[16] Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. (2019). Wideneural networks of any depth evolve as linear models under gradient descent. In

Advances in neural informationprocessing systems .[17] Matthews, A., Hron, J., Rowland, M., Turner, R. E., and Ghahramani, Z. (2018a). Gaussian process behaviour inwide deep neural networks. In

International Conference on Learning Representations .[18] Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. (2018b). Gaussian processbehaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271 .[19] Mei, S., Misiakiewicz, T., and Montanari, A. (2019). Mean-ﬁeld theory of two-layers neural networks: dimension-free bounds and kernel limit.

Annual Conference on Learning Theory .[20] Mei, S., Montanari, A., and Nguyen, P.-M. (2018). A mean ﬁeld view of the landscape of two-layer neuralnetworks.

Proceedings of the National Academy of Sciences , 115(33):E7665–E7671.[21] Neal, R. M. (1994). Priors for inﬁnite networks (tech. rep. no. crg-tr-94-1).

University of Toronto .[22] Nguyen, P.-M. (2019). Mean ﬁeld limit of the learning dynamics of multilayer neural networks.

ArXiv ,abs/1902.02880.[23] Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Hron, J., Abolaﬁa, D. A., Pennington, J., and Sohl-Dickstein, J.(2018). Bayesian deep convolutional networks with many channels are gaussian processes.[24] Novak, R., Xiao, L., Hron, J., Lee, J., Sohl-Dickstein, J., and Schoenholz, S. S. (2020). Neural tangents: Fast andeasy inﬁnite neural networks in python. https://github.com/google/neural-tangents .[25] Park, D. S., Sohl-Dickstein, J., Le, Q. V., and Smith, S. L. (2019). The effect of network width on stochasticgradient descent and generalization: an empirical study. In

International Conference on Machine Learning .[26] Schwartz-Ziv, R. and Alemi, A. A. (2019). Information in inﬁnite ensembles of inﬁnitely-wide neural networks. arXiv preprint arXiv:1911.09189 .[27] Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and Pennington, J. (2018). Dynamical isometry and a meanﬁeld theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In

International Conferenceon Machine Learning .[28] Yang, G. (2019a). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradientindependence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760 .[29] Yang, G. (2019b). Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In