[PDF] Why bigger is not always better: on finite and infinite neural networks

Abstract

Recent work has argued that neural networks can be understood theoretically by taking the number of channels to infinity, at which point the outputs become Gaussian process (GP) distributed. However, we note that infinite Bayesian neural networks lack a key facet of the behaviour of real neural networks: the fixed kernel, determined only by network hyperparameters, implies that they cannot do any form of representation learning. The lack of representation or equivalently kernel learning leads to less flexibility and hence worse performance, giving a potential explanation for the inferior performance of infinite networks observed in the literature (e.g. Novak et al. 2019). We give analytic results characterising the prior over representations and representation learning in finite deep linear networks. We show empirically that the representations in SOTA architectures such as ResNets trained with SGD are much closer to those suggested by our deep linear results than by the corresponding infinite network. This motivates the introduction of a new class of network: infinite networks with bottlenecks, which inherit the theoretical tractability of infinite networks while at the same time allowing representation learning.

Full PDF

WWhy bigger is not always better: on ﬁnite and inﬁnite neural networks

Laurence Aitchison Abstract

Recent work has argued that neural networks canbe understood theoretically by taking the numberof channels to inﬁnity, at which point the out-puts become Gaussian process (GP) distributed.However, we note that inﬁnite Bayesian neuralnetworks lack a key facet of the behaviour ofreal neural networks: the ﬁxed kernel, determinedonly by network hyperparameters, implies thatthey cannot do any form of representation learn-ing. The lack of representation or equivalentlykernel learning leads to less ﬂexibility and henceworse performance, giving a potential explana-tion for the inferior performance of inﬁnite net-works observed in the literature (e.g. Novak etal. 2019). We give analytic results characterisingthe prior over representations and representationlearning in ﬁnite deep linear networks. We showempirically that the representations in SOTA ar-chitectures such as ResNets trained with SGDare much closer to those suggested by our deeplinear results than by the corresponding inﬁnitenetwork. This motivates the introduction of a newclass of network: inﬁnite networks with bottle-necks, which inherit the theoretical tractability ofinﬁnite networks while at the same time allowingrepresentation learning.One approach to understanding and improving neural net-works is to perform Bayesian inference in an inﬁnitely widenetwork (Lee et al., 2018; Matthews et al., 2018; Garriga-Alonso et al., 2019; Novak et al., 2019). In this limit theoutputs become Gaussian process distributed, enabling ef-ﬁcient and exact reasoning about uncertainty, and givinga means of interpretation using the parameter-free kernelfunction (which depends only on network hyperparameterssuch as depth). However, the performance of Bayesian inﬁ-nite networks lags considerably behind state-of-the-art ﬁnite University of Bristol, Bristol, UK. Correspondence to: Lau-rence Aitchison < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s). networks trained using SGD (e.g. compare performancein Garriga-Alonso et al. (2019), Novak et al. (2019) andArora et al. (2019) against He et al. (2016) and Chen et al.(2018)). This seems surprising, because, to our knowledge,there are no reports of wider networks degrading classi-ﬁcation performance (indeed, the opposite is sometimesargued; see Zagoruyko & Komodakis, 2016), and becauseexact Bayesian inference is provably optimal, if the prioraccurately describes our beliefs (Ramsey, 1926). Indeed,recent work on the Neural Tangent Kernel (NTK) (Li et al.,2019) has suggested that deterministic gradient descent inan inﬁnite network gives slighly lower performance thanBayesian inference in the same network.Our hypothesis is that the poor performance of Bayesianinﬁnite networks arises because the top-layer representation(equivalent to the kernel), is ﬁxed by the network hyperpa-rameters, and thus cannot be learned from data. This breaksmany of our key intuitions about why deep networks areeffective. For instance in transfer learning (Huh et al., 2016)we use a large-scale dataset such as ImageNet to a learn agood high-level representation, then apply this representa-tion to other tasks where less data is available. However,transfer learning is impossible in inﬁnite Bayesian neuralnetworks, because the top-layer representation is ﬁxed bythe network hyperparameters and so cannot be learned usinge.g. ImageNet.To understand these issues, we analysed ﬁnite networksusing tools from the inﬁnite network literature (Lee et al.,2018; Matthews et al., 2018; Garriga-Alonso et al., 2019;Novak et al., 2019). We begin by giving a toy, two-layer ex-ample, contrasting the ﬂexibility of ﬁnite networks with theinﬂexibility of inﬁnite networks, showing that ﬂexible ﬁnitenetworks offer beneﬁts under conditions of model-mismatch.We then introduce inﬁnite networks with bottlenecks, whichcombine the theoretical tractability of inﬁnite networks withthe ﬂexibility of ﬁnite networks. To obtain an analytic un-derstanding of kernel/representation ﬂexibility and learningin such networks, we consider linear inﬁnite networks withbottlenecks, which are equivalent to ﬁnite deep linear net-works. We took two approaches to characterising thesenetworks. First, we considered the prior viewpoint, i.e. thecovariance in the top-layer kernel induced by randomnessin the lower-layer weights. In particular, we showed thatnarrower, deeper networks offer more ﬂexibility, and that a r X i v : . [ s t a t . M L ] J un hy bigger is not always better: on ﬁnite and inﬁnite neural networks CNNs offer more ﬂexibility than locally connected networks(LCNs) when the input is spatially structured. Second, weconsidered the posterior viewpoint, showing that under bothMAP inference and posterior sampling, the representationsin learned neural networks slowly transition from being sim-ilar to the input kernel (i.e. the inner product of the inputs)to being similar to the output kernel (i.e. the inner productof one-hot vectors representing the labels). We found animportant difference between MAP inference and sampling:for MAP inference, the learned representations transitionfrom the input to the output kernel, irrespective of the net-work width. Bayesian networks behave similarly when thenetwork width and the number of output channels are equal,but as the network width increases, the learned represen-tations become increasingly dominated by the prior, andinsensitive to the outputs. Remarkably, we ﬁnd that in aResNet trained using SGD on CIFAR-10, the representationdiffers dramatically from the corresponding inﬁnite networkand is instead very close to the output kernel, as suggestedby our deep linear results. This conﬁrms the importance ofworking with a theoretical model, such as inﬁnite networkswith bottlenecks, that is capable of capturing representationlearning.

1. Toy Example

In the introduction, we noted that inﬁnite Bayesian networksperform worse than standard neural networks trained usingstochastic gradient descent. Thus, as we make ﬁnite neu-ral networks wider, there should be some point at whichperformance begins to degrade. We considered a simple,two-layer, fully-connected linear network with the full set of X , hidden unit activationsdenoted H , and 10-dimensional outputs denoted Y , H = XW Y = HV + σ Ξ (1)where Ξ is IID standard Gaussian noise, W is the input-to-hidden weight matrix and V is the hidden-to-output weightmatrix, whose columns, w µ and v ν are generated IID from, P ( w µ ) = N (cid:0) w µ ; , X I (cid:1) P ( v ν ) = N (cid:0) v ν ; , H I (cid:1) , (2)and where the variance of the weights is normalised by thenumber of inputs to that layer, X = 4 for the 4-dimensionalinput, and H for the width of the hidden layer.In the ﬁrst example (Fig. 1 left), we generated targets forsupervised learning using a second neural network withweights generated as described above, with H gen ∈ { , , } hidden units. We evaluated the Bayesian model-evidencefor networks with many different numbers of hidden units(x-axis). Bayesian reasoning would suggest that the modelevidence for the true model (i.e. with a matched numberof hidden units) should be higher than the model evidence for any other model, as indeed we found (Fig. 1 top left),and these patterns held true for the predictive probability, orequivalently test performance (Fig. 1 bottom left). Whilethese results give an example where smaller networks per-form better, they do not necessarily help us to understandthe behaviour of neural networks on real datasets, wherethe true generative process for the data is not known, and isnot in our model class. As such, we considered two furtherexamples where the neural network generating the targetslay outside of our model class. In particular, we again gen-erated target outputs by sampling a “true” network fromthe prior, but we modiﬁed the inputs to this network, ﬁrstby multipling those inputs by (Fig. 1 middle), then byzeroing-out all but the ﬁrst input unit (Fig. 1 right). Crit-cally, we ensured model-mismatch by putting the original,unmodiﬁed inputs into the trained networks. In both ofthese experiments, there was an optimium number of hiddenunits, after which performance degraded as more hiddenunits were included.To understand why this might be the case, it is insightfulto consider the methods we used to evaluate the modelevidence and generate these results. In particular, note thatconditioned on H , the output for any given channel, y ν , isIID and depends only on the corresponding column of theoutput weights, v ν , P ( Y | V , H ) = (cid:89) ν P ( y ν | v ν , H )= (cid:89) ν N (cid:0) y ν ; Hv ν , σ I (cid:1) . (3)Thus, we can integrate over the output weights, v ν , to obtaina distribution over Y conditioned on H , P ( Y | H ) = (cid:89) ν P ( y ν | H )= (cid:89) ν N (cid:0) y ν ; , H HH T + σ I (cid:1) . (4)This is the classical Gaussian process representation ofBayesian linear regression (Rasmussen & Williams, 2006).Remembering that the hidden activities, H , is a determinis-tic function of the weights, W , and inputs, X , we can writethis distribution as, P ( y ν | H ) = P ( y ν | W , X )= N (cid:0) y ν ; , H XWW T X T + σ I (cid:1) . (5)Thus, the ﬁrst-layer weights, W , act as kernel hyperparam-eters in a Gaussian process: they control the covariance ofthe outputs, y ν . To evaluate the model evidence we need to hy bigger is not always better: on ﬁnite and inﬁnite neural networks − r e l a t i v e m ode l e v i den c e vary number of hiddens − − x → x − − x → ( x , 0, 0, 0)2 number of hiddens − − r e l a t i v ep r ed i c t i v e l og - p r ob ( t e s t ) H gen = 1 H gen = 2 H gen = 4 2 number of hiddens − − number of hiddens − − Figure 1.

A toy fully-connected, two-layer Bayesian linear network showing situations in which smaller networks perform better thanlarger networks. The red dots indicate the optimal number of hidden units in that simulation. Left: training data generated from the priorwith H gen hidden units. Middle: training data generated from the prior with H gen = 4 but where we scale-up the inputs by a factor of 100.Right: training data generated from the prior with H gen = 4 , but where we zero-out all but the ﬁrst input dimension. Top: Bayesian modelevidence. Bottom: predictive log-probability, or equivalently test-error. integrate over W , P ( Y | X ) = (cid:90) dW P ( W ) (cid:89) ν P ( y ν | W , X )= E P( W ) (cid:34)(cid:89) ν P ( y ν | W , X ) (cid:35) (6)and we estimate this integral by drawing

64 000 samplesfrom the prior,

P ( W ) . Importantly, while W providesﬂexibility in the kernel in ﬁnite networks, this ﬂexibilitygradually disappears as we consider wider hidden layersnetworks. In particular, lim H →∞ H WW T = lim H →∞ H H (cid:88) µ =1 w µ w Tµ = E (cid:2) w µ w Tµ (cid:3) = X I . (7)Therefore, in this limit, the distribution over Y convergesto, lim H →∞ P ( Y | X ) = (cid:89) ν N (cid:0) y ν ; , X XX T + σ I (cid:1) . (8)This is exactly the distribution we would expect fromBayesian linear regression in a one-layer network. Thus, bytaking the inﬁnite limit, we have eliminated the additionalﬂexibility afforded by the two-layer network, and we cansee that the superior performance of smaller networks inFig. 1 emerges because they give additional ﬂexibility in the covariance of the outputs, which gradually disappearsas network size increases. Finally, note that sampling fromthe prior works well here both because of the concentrationresult above, and because we use relatively small amount ofdata, 20 points.

2. Inﬁnite networks with ﬁnite bottlenecks

In the previous section, we considered the simplest networksin which these phenomena emerge: a two-layer, linear net-work. In this section, we setup a full inﬁnite network withbottlenecks and show that activity ﬂowing through this net-work can be understood entirely in terms of kernel andcovariance matricies.Consider a single layer within a fully-connected network,where the potentially inﬁnite activity at the previous layer, H (cid:96) − , corresponding to a batch containing all inputs, ismultiplied by a weight matrix, W (cid:96) , to give a ﬁnite numberof activations, A (cid:96) . This activation matrix, A (cid:96) is multipliedby another matrix, M (cid:96) , to give a potentially inﬁnite up-dated activation matrix, A (cid:48) (cid:96) , which is then passed througha non-linearity, φ , to give the potentially inﬁnite activity atthis layer, H (cid:96) . Note that following Matthews et al. (2018),we use “activation” pre-nonlinearity and “activity” post-nonlinearity. A (cid:96) = H (cid:96) − W (cid:96) A (cid:48) (cid:96) = A (cid:96) M (cid:96) H (cid:96) = φ ( A (cid:48) (cid:96) ) (9) hy bigger is not always better: on ﬁnite and inﬁnite neural networks P × M (cid:122)(cid:125)(cid:124)(cid:123) H = X P × N (cid:96) (cid:122)(cid:125)(cid:124)(cid:123) A (cid:96) = H (cid:96) − W (cid:96) P × M (cid:96) (cid:122)(cid:125)(cid:124)(cid:123) A (cid:48) (cid:96) = A (cid:96) M (cid:96) P × M (cid:96) (cid:122)(cid:125)(cid:124)(cid:123) H (cid:96) = φ ( A (cid:48) (cid:96) ) L = M XX T (cid:124) (cid:123)(cid:122) (cid:125) input kernel J (cid:96) = C (cid:2) a (cid:96)µ | H (cid:96) − (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) covariance K (cid:96) = N (cid:96) A (cid:96) A T(cid:96) = M (cid:96) A (cid:48) (cid:96) A (cid:48) T(cid:96) (cid:124) (cid:123)(cid:122) (cid:125) activation kernel L (cid:96) = M (cid:96) H (cid:96) H T(cid:96) (cid:124) (cid:123)(cid:122) (cid:125) activity kernel

Figure 2.

The relationships between the feature-space and kernel representations of the neural network. For a typical ﬁnite neural network, M (cid:96) = I , so M (cid:96) = N (cid:96) . For a ﬁnite-inﬁnite network (which allows us to compute L (cid:96) from K (cid:96) ), we send M (cid:96) → ∞ , and draw the elementsof M (cid:96) IID from a Gaussian distribution with zero mean and variance /M (cid:96) . where the input data is H = X , and H (cid:96) ∈ R P × M (cid:96) A (cid:96) ∈ R P × N (cid:96) A (cid:48) (cid:96) ∈ R P × M (cid:96) W (cid:96) ∈ R M (cid:96) − × N (cid:96) M (cid:96) ∈ R N (cid:96) × M (cid:96) (10)For an inﬁnite network with bottlenecks, we take the limit as M (cid:96) goes to inﬁnity, leaving N (cid:96) ﬁnite. As such, the activitybefore, A (cid:48) (cid:96) , and after, H (cid:96) , the nonlinearity is inﬁnite, with aﬁnite linear bottleneck formed by A (cid:96) .For a fully-connected network, the columns of W (cid:96) and M (cid:96) ,denoted w (cid:96)λ and m (cid:96)λ are generated IID from a Gaussiandistribution, P ( W (cid:96) ) = N (cid:96) (cid:89) λ =1 P (cid:0) w (cid:96)λ (cid:1) = N (cid:96) (cid:89) λ =1 N (cid:16) w (cid:96)λ ; , M (cid:96) − I (cid:17) (11) P ( M (cid:96) ) = M (cid:96) (cid:89) λ =1 P (cid:0) m (cid:96)λ (cid:1) = M (cid:96)λ (cid:89) λ =1 N (cid:16) m (cid:96)λ ; , N (cid:96) I (cid:17) . (12)where the normalization constants, /M (cid:96) − and /N (cid:96) − , en-sure that activations remain normalized as they ﬂow throughthe network.Following the inﬁnite network literature, we would like tocharacterise activity ﬂowing through the network in termsof the activation kernel, K (cid:96) and activity kernel, L (cid:96) , K (cid:96) ≡ N (cid:96) A (cid:96) A T(cid:96) L (cid:96) ≡ M (cid:96) H (cid:96) H T(cid:96) L = M XX T (13)We begin by characterising the relationship between A (cid:48) (cid:96) and A (cid:96) . As each channel (column) of A (cid:48) (cid:96) is a linear functionof the corresponding channel of the weights, a (cid:48) (cid:96)λ = A (cid:96) m (cid:96)λ ,these activations are Gaussian and IID conditioned on A (cid:96) , P ( A (cid:48) (cid:96) | A (cid:96) ) = M (cid:96) (cid:89) λ =1 P (cid:0) a (cid:48) (cid:96)λ | A (cid:96) (cid:1) = M (cid:96) (cid:89) λ =1 N (cid:0) a (cid:48) (cid:96)λ ; , K (cid:96) (cid:1) = P ( A (cid:48) (cid:96) | K (cid:96) ) (14) and taking the limit of M (cid:96) → ∞ , lim M (cid:96) →∞ M (cid:96) M (cid:96) M T(cid:96) = N (cid:96) I lim M (cid:96) →∞ M (cid:96) A (cid:48) (cid:96) ( A (cid:48) (cid:96) ) T = lim M (cid:96) →∞ M (cid:96) A (cid:96) M (cid:96) M T(cid:96) A T(cid:96) = N (cid:96) A (cid:96) A T(cid:96) = K (cid:96) , (15)Thus, the kernel for A (cid:96) is equivalent to the kernel for A (cid:48) (cid:96) ininﬁnite networks with ﬁnite bottlenecks (Fig. 2).Next, consider computing K (cid:96) from L (cid:96) − . As each chan-nel (column) of the activations is a linear function of thecorresponding channel of the weights, a (cid:96)λ = H (cid:96) − w (cid:96)λ , theactivations are Gaussian and IID conditioned on the activityat the previous layer, P ( A (cid:96) | H (cid:96) − ) = N (cid:96) (cid:89) µ =1 P (cid:0) a (cid:96)λ | H (cid:96) − (cid:1) = N (cid:96) (cid:89) µ =1 N (cid:0) a (cid:96)λ ; , J (cid:96) (cid:1) = P ( A (cid:96) | J (cid:96) ) , (16)with covariance J (cid:96) . For a fully connected network, thecovariance, J (cid:96) , is equal to the previous layer’s activity-kernel, L (cid:96) , J (cid:96) = L (cid:96) − = M (cid:96) − H (cid:96) − H T(cid:96) − , (17)but the relationship is more complex in convolutional archi-tectures (Garriga-Alonso et al., 2019; Novak et al., 2019)(Appendix A.2). As A (cid:96) is always ﬁnite and random, K (cid:96) is also a random variable, and inspecting the above expres-sions, its distribution can be written as a Wishart, centeredon L (cid:96) − .Finally consider computing L (cid:96) from K (cid:96) . Note that as both A (cid:48) (cid:96) and H (cid:96) are inﬁnite, we can directly use standard resultsfrom inﬁnite neural networks, i.e. those from Cho & Saul(2009), as in Lee et al. (2018); Matthews et al. (2018);Garriga-Alonso et al. (2019); Novak et al. (2019).Linear inﬁnite networks with ﬁnite bottlenecks can be ob-tained by setting H (cid:96) = φ ( A (cid:48) (cid:96) ) = A (cid:48) (cid:96) , implying that hy bigger is not always better: on ﬁnite and inﬁnite neural networks L (cid:96) = K (cid:96) . Critically, this is equivalent to a deep linearnetwork obtained by in adddition setting M (cid:96) = I so that A (cid:48) (cid:96) = A (cid:96) and M (cid:96) = N (cid:96) , as these choices imply that the A (cid:96) = A (cid:48) (cid:96) = H (cid:96) so that again, L (cid:96) = K (cid:96) . Given this setup, we can see that even a ﬁnite nonlinear net-work (i.e. with M (cid:96) = I ) is a deep Gaussian process. In par-ticular, in a deep Gaussian process, the activations at layer (cid:96) ,denoted A (cid:96) , consist of N (cid:96) IID channels that are Gaussian-process distributed (Eq. 16), with a kernel/covariance deter-mined by the activations at the previous layer. For a fullyconnected network, J (cid:96) = L (cid:96) − = N (cid:96) − φ ( A (cid:96) − ) φ T ( A (cid:96) − ) . (18)The relationship between ﬁnite neural networks and deepGPs is worth noting, because the same intuition, of the lower-layers shaping the top-layer kernel, arises in both senarios(e.g. Bui et al., 2016), and because there is potential forapplying GP inference methods for neural networks, andvice versa.

3. The prior view on kernel ﬂexibility

We can analyse how ﬂexibility in the kernel emerges bylooking at the variability (i.e. the variance and covariance)of J (cid:96) , K (cid:96) and L (cid:96) . If the prior gives a stochastic kernel withhigher variance, then it will be easier to shape that kernel byconditioning on data. In the appendix, we derive recursiveupdates for deep, linear, convolutional networks, but here,for simplicity we give the fully-connected updates, C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) = C (cid:2) L (cid:96) − ij , L (cid:96) − kl | K (cid:3) (19a) C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) ≈ C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) (19b) + N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) C (cid:2) L (cid:96)ij , L (cid:96)kl | L (cid:3) = C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) (19c)where, (cid:10) J (cid:96)ij (cid:11) = E (cid:2) J (cid:96)ij | L (cid:3) = L ij (19d)where i , j , k and l index datapoints.This expression predicts that the variance of the kernel isproportional to the depth (including the last layer; L + 1 )and inversely proportional to the width, N , C (cid:2) K L +1 ij , K L +1 kl | L (cid:3) ≈ L +1 N (cid:0) L ik L jl + L il L jk (cid:1) . (20)This expression is so simple because, for a fully-connectedlinear network, the expected covariance at each layer is thesame. For nonlinear and convolutional or locally-connectednetworks the covariance is still proportional to /N , but the depth-dependence becomes more complex, as the covari-ance changes as it propagates through layers.To check the validity of these expressions, we sampled10,000 neural networks from the prior, and evaluated thevariance of the kernel for a single input (Fig. 3). Theseinputs were either spatially unstructured (i.e. white noise),or spatially structured, in which case the inputs were thesame across the whole image. For fully connected networks,we conﬁrmed that the variance of the kernel is proportionalto the depth including the last layer, L + 1 , and inverselyproportional to width, N (Fig. 3A). For locally connectednetworks, we found that structured and unstructured inputsgave the same kernel variance, which is expected as anyspatial structured is destroyed after the ﬁrst layer (Fig. 3B).Further, for convolutional networks with structured input,the variance of the kernel was proportional to network depth(Fig. 3C bottom), but whenever that spatial structure wasabsent, either because it was absent in the inputs or be-cause it was eliminated by an LCN (Fig. 3BC bottom) thevariance of the kernel was almost constant with depth (seeAppendix A.2.1).The large decrease in kernel ﬂexibility for locally connectednetworks might be one reason behind the result in Novaket al. (2019) that locally connected networks have perfor-mance that is very similar to an inﬁnite-width network, inwhich all ﬂexibility has been eliminated. In essence, for alocally connected network, we sample the weights for eachspatial region independently, so we in effect average overmore IID random variables, reducing the variance of thekernel at the next layer, and hence reducing the possibil-ity for data to shape that representation. In contrast, for aconvolutional network we share weights across locations,increasing the variance in the kernel, and hence increasingthe possibility for data to shape the representation. Finally,as the spatial input size, S , increases, for convolutional net-works with spatially structured inputs, the variance of thekernel is constant, whereas for locally connected or spatiallyunstructured inputs, the variance falls (Fig. 3D).

4. The posterior view on kernel ﬂexibility

An alternative approach to understanding ﬂexibility in ﬁ-nite neural networks is to consider the posterior viewpoint:how learning shapes top-level representations. To obtainanalytical insights, we considered maximum a-posterioriand sampling based inference in a deep, fully-connected,linear network. In both cases, we found that learned neuralnetworks shift the representation from being close to theinput kernel, deﬁned by, K = L = M XX T , (21) hy bigger is not always better: on ﬁnite and inﬁnite neural networks N2 − − V a r [ K ii ] A FC sampledtheory L = 1 L = 16 2 N B LCNstructuredunstructured 2 N C CNN 2 S D LCN (top) vs CNN (bottom)2 L+12 − − V a r [ K ii ] N = 64 N = 1024 2 L+1 2 L+1 2 S Figure 3.

The variance of the stochastic kernel induced by randomly sampling weights in ﬁnite, linear, fully connected and convolutionalnetworks, with spatially structured and unstructured inputs. We use normalized inputs and circular convolutions to ensure that the kernel’sexpected value remains equal to at all locations as it propagates through the network. The dashed lines in all plots display the theoreticalapproximation (Eq. 19) which is valid when the width is much greater than the number of layers. The solid lines display the empiricalvariance of the kernel from 10,000 simulations. A The variance of the kernel for fully connected networks, plotted against networkwidth, N , for shallow (blue; L + 1 = 1 ), and deep (orange; L + 1 = 16 ) networks (top) and plotted against network depth, L + 1 , fornarrow (green; N = 64 ) and wide (red; N = 1024 ) networks. B The variance of the kernel for locally connected networks with spatiallystructured and unstructured inputs, plotted against the number of channels, N , and against network depth, L + 1 . Note that the structuredline lies underneath the unstructured line. The inputs are 1-dimensional with S = 32 spatial locations, and input channels. C As in B ,but for convolutional networks. D The variance of the kernel as a function of the input spatial size, S , for deep ( L + 1 = 16 ) LCNs (top)and CNNs (bottom) with spatially strutured and unstructured inputs. to being close the output kernel, deﬁned by, K L +1 = N L +1 YY T . (22)In particular, under MAP inference, the shape of the kernelsmoothly transitions from the input to the output kernel(Appendix B.2), K (cid:96) = (cid:18) N (cid:96)< N ≤ (cid:96) (cid:19) (cid:96) ( L +1 − (cid:96) ) L +1 (cid:0) K L +1 K − (cid:1) (cid:96)/ ( L +1) K , (23)where N (cid:96)< is the geometric average of the width in layers (cid:96) + 1 to L + 1 , and N ≤ (cid:96) is the geometric average of thewidth in layers to (cid:96) . Thus, the kernels (and the underlyingweights) at each layer can be made arbitrarily large or smallby changing the width, despite the prior distribution beingchosen speciﬁcally to ensure that the scale of the kernelswas invariant to network width. This is an issue inherentto the use of MAP inference, which often ﬁnds modes thatgive a poor characterisation of the Bayesian posterior. Incontrast, if we sample the weights using Langevin sampling(Appendix C), and set all the intermediate widths, from N to N L to N , then we get a similar expression, K (cid:96) = (cid:0) K L K − (cid:1) (cid:96)/L K , (24)where the kernels slowly transition from K to K L . Thekey difference is that the similarity between the top-layerrepresentation, K L , and the output kernel, K L +1 , dependson the ratio between the network width, N , and the numberof output units, Y = N L +1 . In particular, if Y = N , thenwe get a relationship very similar to that for MAP inference, K (cid:96) = (cid:0) K L +1 K − (cid:1) (cid:96)/ ( L +1) K , (25)However, as the network width grows very large, the priorbegins to dominate, and the posterior becomes dominatedby the prior, lim N/Y →∞ K (cid:96) = K , (26)as K L = K . Finally, if the network width is small incomparison to the number of units, lim N/Y → K (cid:96) = (cid:0) K L +1 K − (cid:1) (cid:96)/L K , (27) hy bigger is not always better: on ﬁnite and inﬁnite neural networks c o rr e l a t i on w i t h i n ﬁ n i t e k e r ne l A m a r g i na lli k e li hood B e i gen v a l ueo f i n ﬁ n i t ene t w o r k C ResNet block=0ResNet block=1ResNet block=2ResNet block=3 c o rr e l a t i on w i t hou t pu t k e r ne l epoch=0epoch=50epoch=100epoch=150epoch=200 f r a c t i ono f v a r i an c e i n c l a ss i ﬁc a t i ond i r e c t i on inﬁnite network rank10 e i gen v a l ue s o f ﬁ n i t ene t w o r k eig ∝ rank − one-hot class Figure 4.

Comparison of kernels for ﬁnite and inﬁnite neural networks at different layers. All kernels are computed on test data. A (top)Correlation (coefﬁcient) between the kernel deﬁned by the inﬁnite network, and kernel deﬁned by a ﬁnite network after different numbersof training epochs. A (bottom) Correlation (coefﬁcient) between the kernel deﬁned by the inﬁnite network, and the output kernel deﬁnedby taking the inner product of one-hot vectors representing the class label. B (top) The Gaussian process marginal likelihood for the 10functions given by the one-hot class labels, evaluated using the kernel output by different ResNet blocks. B (bottom) The fraction ofvariance in the direction of the one-hot output class labels. C (top) The eigenvalues of the kernel deﬁned by the inﬁnite network as weprogress through layers, and compared to a − power law (grey). C (top) The eigenvalues of the kernel deﬁned by the ﬁnite network after200 training epochs, as we progress through ResNet blocks. as the top-layer kernel converges to the output, K L = K L +1 .The above results suggest that ﬁnite neural networks per-form well by giving ﬂexibility to interpolate between theinput kernel and output kernel. To see how this happensin real neural networks, we considered a 34-layer ResNetwithout batchnorm corresponding to the inﬁnite networkin Garriga-Alonso et al. (2019) trained on CIFAR-10. Webegan by computing the correlation between elements ofthe ﬁnite and inﬁnite kernel (Fig. 4A top) as we go throughResNet blocks (x-axis), and as we go through training (bluelines). As expected, the randomly initialized, untrained net-work retains a high correlation with the inﬁnite kernel atall layers, though the correlation is somewhat smaller forhigher layers, as there have been more time for randomsampling to build up discrepancies. However, for trainednetworks, this correspondence between the ﬁnite and in-ﬁnite networks is far weaker: even at the ﬁrst layer, thecorrelation is only around . , and as we go through lay-ers, the correlation decreases to almost zero. To understandwhether kernels were being actively shaped, we computedthe correlation between the kernel for the ﬁnite network andthe output kernel, deﬁned by taking the inner product of vec-tors representing the one-hot class labels (Fig. 4A bottom).We found that while the correlation for the untrained net- work decreased across layers, training gives strong positivecorrelations with the output kernel, and these correlationsincrease as we move through network layers. Combined,these results indicate that the top-layer representation ismuch closer to the output kernel, as suggested by the deeplinear results, than it is to the corresponding inﬁnite network.While correlation is a useful simple measure of similarity,there are other measures of similarity that take into accountthe special structure of kernel matricies. In particular, weconsidered the marginal likelihood for the one-hot outputscorresponding to the class label, under a GP, with a kernelgiven by a scaled sum of the kernel at that ResNet block, andthe identity (see Appendix D; Fig. 4B top). For the inﬁnitenetwork, the marginal likelihood increased somewhat as wemoved through network layers, and the untrained ﬁnite net-work performed similarly, except that there was a decreasein performance at the last layer. In contrast, the marginallikelihood for the ﬁnite, trained networks was initially veryclose to the inﬁnite networks, but grew rapidly as we movethrough ResNet blocks.To gain an insight into how training shaped the neural net-work kernels, we computed the variance in the subspacedeﬁned by the one-hot outputs (i.e. the classiﬁcation direc-tions; Fig. 4B bottom). We might have expected to see asteady increase in the variance in this subspace as we move hy bigger is not always better: on ﬁnite and inﬁnite neural networks through layers, but in fact the level was very small, onlyrising appreciably at the ﬁnal block, and only for trained net-works. To try to understand these results, we computed theeigenvalue spectrum of the kernels. For the inﬁnite network(Fig. 4C top), we found that the eigenvalue spectrum at alllevels decayed as a − power law. This is expected at thelowest level due to the well known /f power spectrum ofimages (Van der Schaaf & van Hateren, 1996), but is not nec-essarily the case at higher-levels. Given the power-spectrumof the output kernel is just a small set of equal-sized eigen-values corresponding to the class labels (Fig. 4C bottom,green line), we might expect the eigenspectrum of ﬁnitenetworks to gradually get steeper as we move through net-work layers. In fact, we ﬁnd the opposite: for intermediatelayers, the eigenvalue spectrum becomes ﬂatter, which canbe interpreted as the network attempting to retain as muchinformation as possible about all aspects of the image. Itis only at the last layer where the relevant information isselected, giving an eigenvalue spectrum with around − large and roughly equally-sized eigenvalues, followed bymuch smaller eigenvalues, which mirrors the spectrum ofthe output kernel. This again conﬁrms that the top-layer rep-resentation in trained networks is much closer to the outputkernel than it is to corresponding inﬁnite network.

5. Related work

Agrawal et al. (2020) independently introduced inﬁnite net-works with ﬁnite bottlenecks, but then made a very differentcontribution in that context. In particular, they highlightedthat if we take the limit as some layers of a neural networkgo to inﬁnity, convergence to the inﬁnite networks with bot-tlenecks considered here is not immediate, but requires theneural network components to exhibit sufﬁcient uniformitywith respect to their inputs. In contrast, we show that ﬁnitebottlenecks can introduce ﬂexibility and thereby improveperformance even in two-layer linear networks, give analyticresults in the case of linear networks, and show that theseconsiderations are likely to be important in realistic large-scale networks, by showing that the kernel for a trainedResNet differs dramatically from that for the correspondinginﬁnite network.Technically, our work bears similarity to classical work onthe dynamics of gradient descent in unregularised deep lin-ear networks (Saxe et al., 2013). Importantly, the lack ofregularisation in this work implies that inﬁnitely many opti-mal solutions are available (e.g. all the lower-layer weightsbeing ﬁxed to the identity). In contrast, we focused onBayesian inference, but also considered the optimal solutionfor regularised networks, which are much more constrained.

6. Conclusions

We have shown that ﬁnite Bayesian neural networks havemore ﬂexibility than inﬁnite networks, and that this mayexplain the superior performance of ﬁnite networks. Thus,we introduced inﬁnite networks with bottlenecks, and arguethat they may be as incorporate ﬂexibility and are able toperform representation learning, they may be a better modelof real neural networks. We then assessed the ﬂexibilityof deep linear networks from two perspectives. First, welooked at the prior viewpoint: the variability in the top-layerkernel induced by the prior over a ﬁnite neural network.Second, we looked at the posterior viewpoint: the abilityof the learning process to shape the top-layer kernel. Un-der both MAP inference and sampling in ﬁnite networks,learning gradually shaped top-layer representations so as tomatch the output-kernel. But, as Bayesian neural networksincrease in width, the kernels become gradually less ﬂex-ible, eliminating the possibility for learning to shape thekernel. In contrast, for MAP inference, the degree of kernelshaping is not affected by network width, and this additionalﬂexibility might be an avenue for overﬁtting.

Acknowledgements

I would like to thank Adri Garriga-Alonso, Sebastian Oberand Vidhi Lalchand for useful discussions.

References

Agrawal, D., Papamarkou, T., and Hinkle, J. Wide neuralnetworks with bottlenecks are deep gaussian processes. arXiv preprint arXiv:2001.00921 , 2020.Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., andWang, R. On exact computation with an inﬁnitely wideneural net. arXiv preprint arXiv:1904.11955 , 2019.Bui, T., Hern´andez-Lobato, D., Hernandez-Lobato, J., Li,Y., and Turner, R. Deep gaussian processes for regressionusing approximate expectation propagation. In

Interna-tional Conference on Machine Learning , pp. 1472–1481,2016.Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud,D. K. Neural ordinary differential equations. In

Advancesin neural information processing systems , pp. 6571–6583,2018.Cho, Y. and Saul, L. K. Kernel methods for deep learning.

NeurIPS , 2009.Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L.Deep convolutional networks as shallow Gaussian pro-cesses.

ICLR , 2019. hy bigger is not always better: on ﬁnite and inﬁnite neural networks

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In

European conference oncomputer vision , pp. 630–645. Springer, 2016.Huh, M., Agrawal, P., and Efros, A. A. What makesimagenet good for transfer learning? arXiv preprintarXiv:1608.08614 , 2016.Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Penning-ton, J., and Sohl-Dickstein, J. Deep neural networks asGaussian processes.

ICLR , 2018.Li, Z., Wang, R., Yu, D., Du, S. S., Hu, W., Salakhutdinov,R., and Arora, S. Enhanced convolutional neural tangentkernels. arXiv preprint arXiv:1911.00809 , 2019.Matthews, A., Rowland, M., Hron, J., Turner, R., andGhahramani, Z. Gaussian process behaviour in widedeep neural networks.

ICLR , 2018.Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Hron, J.,Abolaﬁa, D. A., Pennington, J., and Sohl-Dickstein, J.Bayesian deep convolutional networks with many chan-nels are Gaussian processes.

ICLR , 2019.Ramsey, F. P. Truth and probability. In

Readings in FormalEpistemology , pp. 21–45. Springer, 1926.Rasmussen, C. E. and Williams, C. K.

Gaussian processesfor machine learning . MIT press, 2006.Saxe, A. M., McClelland, J. L., and Ganguli, S. Exactsolutions to the nonlinear dynamics of learning in deeplinear neural networks. arXiv preprint arXiv:1312.6120 ,2013.Van der Schaaf, v. A. and van Hateren, J. v. Modelling thepower spectra of natural images: statistics and informa-tion.

Vision research , 36(17):2759–2770, 1996.Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146 , 2016. hy bigger is not always better: on ﬁnite and inﬁnite neural networks

A. Kernel ﬂexibility: prior viewpoint

To compute the covariance, which we denote C [ · ] of the kernel for a deep network, we consider a recursion where westart with C (cid:2) L (cid:96) − ij , L (cid:96) − kl | L (cid:3) , then compute the resulting C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) , then compute the resulting C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) . Inparticular, we apply the law of total covariance for K (cid:96) | J (cid:96) , and we consider linear networks for which L (cid:96) = K (cid:96) , C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) = ? (28a) C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) = C (cid:2) E (cid:2) K (cid:96)ij | J (cid:96) (cid:3) , E (cid:2) K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) + E (cid:2) C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) (28b) C (cid:2) L (cid:96)ij , L (cid:96)kl | L (cid:3) = C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) (28c)The ﬁrst equation is different for fully connected and convolutional networks, so we give its form later.Eq. (28b) always behaves in the same way for linear and nonlinear, fully connected and convolutional networks so weconsider this ﬁrst. In particular, we always have E [ K (cid:96) | J (cid:96) ] = J (cid:96) , so the ﬁrst term in Eq. (28b) is C (cid:2) E (cid:2) K (cid:96)ij | J (cid:96) (cid:3) , E (cid:2) K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) = C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) . (29)For the second term in Eq. (28b), we substitute the deﬁnition of K (cid:96) (Eq. 13) into the deﬁnition of the covariance, C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) = N (cid:96) N (cid:96) (cid:88) µ =1 N (cid:96) (cid:88) ν =1 E (cid:2) a (cid:96)µi a (cid:96)µj a (cid:96)νk a (cid:96)νl | J (cid:96) (cid:3) − (cid:32) N (cid:96) N (cid:96) (cid:88) µ =1 E (cid:2) a (cid:96)µi a (cid:96)µj | J (cid:96) (cid:3)(cid:33) (cid:32) N (cid:96) N (cid:96) (cid:88) ν =1 E (cid:2) a (cid:96)νk a (cid:96)νl | J (cid:96) (cid:3)(cid:33) . (30)As the a ’s are jointly Gaussian, their expectations are E (cid:2) a (cid:96)µi a (cid:96)µj a (cid:96)νk a (cid:96)νl | J (cid:96) (cid:3) = J (cid:96)ij J (cid:96)kl + δ µν (cid:0) J (cid:96)ik J (cid:96)jl + J (cid:96)il J (cid:96)jk (cid:1) (31a) E (cid:2) a (cid:96)µi a (cid:96)µj | J (cid:96) (cid:3) = J (cid:96)ij (31b) E (cid:2) a (cid:96)νk a (cid:96)νl | J (cid:96) (cid:3) = J (cid:96)kl . (31c)Thus, the covariance of the kernel becomes, C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) = N (cid:96) (cid:0) J (cid:96)ik J (cid:96)jl + J (cid:96)il J (cid:96)jk (cid:1) , (32)Substituting this into the second term in Eq. (28b) E (cid:2) C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) = N (cid:96) E (cid:2) J (cid:96)ik J (cid:96)jl + J (cid:96)il J (cid:96)jk | L (cid:3) , (33)and writing the expected product in terms of the product of expectations and covariance, E (cid:2) C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) = N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) + N (cid:96) (cid:0) C (cid:2) J (cid:96)ik , J (cid:96)jl | L (cid:3) + C (cid:2) J (cid:96)il , J (cid:96)jk | L (cid:3)(cid:1) (34)where, (cid:10) J (cid:96)ik (cid:11) = E (cid:2) J (cid:96)ik | L (cid:3) . (35)Substituting Eq. (29) and Eq. (34) into Eq. (28b), we obtain C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) = C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) + N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) + N (cid:96) (cid:0) C (cid:2) J (cid:96)ik , J (cid:96)jl | L (cid:3) + C (cid:2) J (cid:96)il , J (cid:96)jk | L (cid:3)(cid:1) (36) hy bigger is not always better: on ﬁnite and inﬁnite neural networks A.1. Fully connected network

Now we evaluate Eq. (28a) ﬁrst for a fully connected network, a (cid:96)λ,i = (cid:88) µ h (cid:96) − i,µ W (cid:96)µ,λ (37)where the weights are drawn from an independent zero-mean Gaussian, such that E (cid:2) W (cid:96)µ,λ W (cid:96)ν,λ (cid:3) = N (cid:96) − δ µ,ν (38)Thus, a (cid:96)λ has distribution, P (cid:0) a (cid:96)λ (cid:1) = N (cid:0) a (cid:96)λ ; , J (cid:96) (cid:1) , (39)where J (cid:96) is given by, J (cid:96)ij = C (cid:2) a (cid:96)i,λ , a (cid:96)j,λ (cid:3) = E (cid:2) a (cid:96)i,λ a (cid:96)j,λ (cid:3) (40) = E (cid:34)(cid:32)(cid:88) µ h (cid:96) − i,µ W (cid:96)µ,λ (cid:33) (cid:32)(cid:88) ν h (cid:96) − j,ν W (cid:96)ν,λ (cid:33)(cid:35) (41) = (cid:88) µν h (cid:96) − i,µ h (cid:96) − j,ν E (cid:2) W (cid:96)µ,λ W (cid:96)ν,λ (cid:3) (42)substituting for the expectation (Eq. 38), and identifying the activity kernel (Eq. 13), = N (cid:96) − (cid:88) µ h (cid:96) − i,µ h (cid:96) − j,µ = L (cid:96) − ij (43)Thus, C (cid:2) J (cid:96)ij , J (cid:96)kl | K (cid:3) = C (cid:2) L (cid:96) − ij , L (cid:96) − kl | K (cid:3) (44)Combining this expression with Eq. (36) gives a complete form for the updates (Eq. 28), C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) = C (cid:2) L (cid:96) − ij , L (cid:96) − kl | K (cid:3) (45a) C (cid:2) K (cid:96)is,jr , K (cid:96)kl | L (cid:3) = C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) + N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) + N (cid:96) (cid:0) C (cid:2) J (cid:96)ik , J (cid:96)jl | L (cid:3) + C (cid:2) J (cid:96)il , J (cid:96)jk | L (cid:3)(cid:1) (45b) C (cid:2) L (cid:96)ij , L (cid:96)kl | L (cid:3) = C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) (45c)However, this form is difﬁcult to analyse due to the complexity of Eq. (45b). Instead we can form an approximation toEq. (45b) by noting that one of the recursive terms is negligable. Taking all network widths to be equal, N (cid:96) = N , (or atleast of the same order), if C (cid:2) L (cid:96) − ij , L (cid:96) − kl | L (cid:3) = O (1 /N ) , (46)then C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) = O (1 /N ) , (47)as the network is chosen such that the activities, and hence the covariances, J (cid:96)ij remain O (1) C (cid:2) K (cid:96)is,jr , K (cid:96)kl | L (cid:3) = C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) + N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) + O (1 /N ) = O (1 /N ) , (48)so, C (cid:2) L (cid:96)ij , L (cid:96)kl | L (cid:3) = O (1 /N ) . (49) hy bigger is not always better: on ﬁnite and inﬁnite neural networks To begin the recursion, the data is ﬁxed, so C (cid:2) J ij , J kl | L (cid:3) = C (cid:2) L ij , L kl | L (cid:3) = 0 , (50)and, C (cid:2) K ij , K kl | L (cid:3) = N (cid:0)(cid:10) L ik (cid:11) (cid:10) L jl (cid:11) + (cid:10) J il (cid:11) (cid:10) J jk (cid:11)(cid:1) = O (1 /N ) . (51)Thus, the covariance of the kernels and covariances is in indeed O (1 /N ) , so C K (cid:96)ij , K (cid:96)kl can be approximated by Eq. (48).Combining this approximation with Eq. (45) gives the expressions in the main text (Eq. 19). Finally, note that theapproximation in Eq. (48) remains true only as long as the number of layers is small, L (cid:28) N . A.2. Convolutional network

For locally connected and convolutional networks, we introduce spatial structure into the activations, and we use spatialindicies, r , s , u and v . Thus, the activations for datapoint i at layer (cid:96) , spatial location r and channel λ are given by, a (cid:96)i,rλ = (cid:88) r (cid:48) µ h (cid:96) − i,r (cid:48) µ W (cid:96)r (cid:48) µ,rλ . (52)Note that for many purposes, these higher-order tensors can be treated as vectors and matrices, if we combine indicies(e.g. using a “reshape” or “view” operation). The commas in the index list are used to denote how to combine indicies forthis particular operation, such that it can be understood as a standard matrix/vector operation. For the above equation, theactivations, a (cid:96) ∈ R P × SN (cid:96) are given by the matrix product of the activities from the previous layer, h (cid:96) − ∈ R P × SM (cid:96) − , andthe weights, W (cid:96) ∈ R ∈ SM (cid:96) − × SN (cid:96) , where remember that S is the number of spatial locations in the input.For a convolutional neural network, the weights are the same if we consider the same input-to-output channels, and the samespatial displacement, d , and are uncorrelated otherwise, E (cid:2) W (cid:96)r (cid:48) µ,rλ W (cid:96)s (cid:48) ν,sλ (cid:3) = M (cid:96) − D (cid:96) − δ µ,ν (cid:88) d ∈D (cid:96) − δ r (cid:48) , ( r + d ) δ s (cid:48) , ( s + d ) . (53)where D (cid:96) − is the set of all valid spatial displacements for the convolution, and D (cid:96) − = |D (cid:96) − | is the number of valid spatialdisplacements (i.e. the size of the convolutional patch). For a locally-connected network, the only additional requirement isthat the output spatial locations are the same, E (cid:2) W (cid:96)r (cid:48) µ,rλ W (cid:96)s (cid:48) ν,sλ (cid:3) = M (cid:96) − D (cid:96) − δ µ,ν δ r,s (cid:88) d ∈D (cid:96) − δ r (cid:48) , ( r + d ) δ s (cid:48) , ( s + d ) . (54)Now we can compute the covariance of the activations, J (cid:96) , for a convolutional network, J (cid:96)ir,js = E (cid:2) a (cid:96)i,rλ a (cid:96)j,sλ | L (cid:96) − (cid:3) (55) = E  (cid:88) r (cid:48) µ h (cid:96) − i,r (cid:48) µ W (cid:96)r (cid:48) µ,rλ  (cid:32)(cid:88) s (cid:48) ν h (cid:96) − j,s (cid:48) ν W (cid:96)s (cid:48) ν,rλ (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L (cid:96) −  (56) = (cid:88) µνr (cid:48) s (cid:48) h (cid:96) − i,r (cid:48) µ h (cid:96) − j,s (cid:48) ν E (cid:2) W (cid:96)r (cid:48) µ,rλ W (cid:96)s (cid:48) ν,rλ (cid:3) (57)substituting the covariance of the weights (Eq. 53), and noting that the product of h ’s forms the deﬁnition of the activitykernel (Eq. 13), J (cid:96)ir,js = D (cid:96) − (cid:88) d ∈D (cid:96) − L (cid:96) − i ( r + d ) ,j ( s + d ) (58)For locally connected intermediate layers, we instead substitute Eq. (54), which gives the same result, except that the outputlocations must be the same for there to be any covariance in the weights, J (cid:96)ir,js = D (cid:96) − δ r,s (cid:88) d ∈D (cid:96) − L (cid:96) − i ( r + d ) ,j ( s + d ) (59) hy bigger is not always better: on ﬁnite and inﬁnite neural networks Substituting this into Eq. (28a), C (cid:2) J (cid:96)ir,js , J (cid:96)ku,lv | L (cid:3) = C  D (cid:96) − (cid:88) d ∈D (cid:96) − L (cid:96) − i ( r + d ) ,j ( s + d ) , D (cid:96) − (cid:88) d ∈D (cid:96) − L (cid:96) − k ( u + d ) ,l ( v + d )  . (60)Now, we can put together full recursive updates for convolutional networks, by pulling the sum out of the covariance above,and by taking the indicies in Eq. (28), as indexing both a datapoint and a spatial location (i.e. i → i, s ), C (cid:2) J (cid:96)ir,js , J (cid:96)ku,lv | L (cid:3) = D (cid:96) − (cid:88) dd (cid:48) C (cid:104) L (cid:96) − i ( r + d ) ,j ( s + d ) , L (cid:96) − k ( u + d (cid:48) ) ,l ( v + d (cid:48) ) | L (cid:105) (61a) C (cid:2) K (cid:96)ir,js , K (cid:96)ku,lv | L (cid:3) ≈ C (cid:2) J (cid:96)ir,js , J (cid:96)ku,lv | L (cid:3) + N (cid:96) (cid:0)(cid:10) J (cid:96)ir,ku (cid:11) (cid:10) J (cid:96)js,lv (cid:11) + (cid:10) J (cid:96)ir,lv (cid:11) (cid:10) J (cid:96)js,ku (cid:11)(cid:1) (61b) C (cid:2) L (cid:96)ir,js , L (cid:96)ku,lv | L (cid:3) = C (cid:2) K (cid:96)ir,js , K (cid:96)ku,lv | L (cid:3) (61c)Finally, to compute these terms, note that we can recursively compute these expressions for r = s and u = v , C (cid:2) J (cid:96)ir,jr , J (cid:96)ku,lu | L (cid:3) = D (cid:96) − (cid:88) dd (cid:48) C (cid:104) L (cid:96) − i ( r + d ) ,j ( r + d ) , L (cid:96) − k ( u + d (cid:48) ) ,l ( u + d (cid:48) ) | L (cid:105) , (62)which reduces computational complexity, and the resulting expression can even be evaluated efﬁciently as a 2D convolution.A.2.1. C ONVOLUTIONAL AND LOCALLY CONNECTED NETWORKS

To understand the very different results for convolutional and locally structured networks (Fig. 3B–D) despite their havingthe same inﬁnite limit, we need to consider how Eq. (62) interacts with Eq. (59). For a locally connected network, thecovariance of activations at different locations is always zero, i.e. J (cid:96)ir,js = 0 for r (cid:54) = s whereas, for a spatially structurednetwork, the J ir,js terms for r (cid:54) = s have the same scale as those for r = s . The J ir,js terms enter into the variance of thekernel through Eq. (62). Note that there are D (cid:96) − terms in this sum, and the sum is normalized by dividing by D (cid:96) − . Thus,in convolutional networks, there are D (cid:96) − terms all with the same scale, whereas in spatially unstructured networks, we haveonly D (cid:96) − nonzero terms, introducing an effective /D (cid:96) − normalizer. This is particularly important if we consider thelast layer. The last layer can be understood as a convolution, where the convolutional patch has the same size as the image(i.e. D L = S ), and there is no padding, such the the output has a single spatial location. In this case, the /S = 1 /D L normalizer can be very large. B. Kernel ﬂexibility: posterior viewpoint

B.1. Reparameterising ﬁnite neural networks

Swapping between a kernel representation and a feature representation is difﬁcult if we work directly with a prior overthe weights, W (cid:96) ∈ R N (cid:96) − × N (cid:96) . Instead, note that as the weights are Gaussian, we can reparameterise the neural network,working instead with V (cid:96) ∈ R P × N (cid:96) which has independent standard Gaussian entries, where P is the number of datapoints.In particular, we can write the activities at the next layer using, A (cid:96) = H (cid:96) − W (cid:96) = U T(cid:96) V (cid:96) . (63)where U (cid:96) ∈ R P × P is any matrix that satisﬁes, J (cid:96) = U T(cid:96) U (cid:96) . (64)such as the Cholesky decomposition of the covariance, J (cid:96) . We can thus write the kernel as, K (cid:96) = N (cid:96) A (cid:96) A T(cid:96) = N (cid:96) H (cid:96) − W (cid:96) W T(cid:96) H T(cid:96) − = N (cid:96) U T(cid:96) V (cid:96) V T(cid:96) U (cid:96) . (65)Rearranging, we can write N (cid:96) V (cid:96) V T(cid:96) , or equivalently the mismatch between the covariance, J (cid:96) , and the output kernel, K (cid:96) ,in terms of L (cid:96) and K (cid:96) , and we denote this quantity R (cid:96) for future use, R (cid:96) = U − T(cid:96) K (cid:96) U − (cid:96) = N (cid:96) V (cid:96) V T(cid:96) (66)where X − T = ( X − ) T = ( X T ) − , and we have assumed that J (cid:96) is invertible, which if nothing else, requires that thenumber of features M (cid:96) and N (cid:96) are larger than the number of datapoints. hy bigger is not always better: on ﬁnite and inﬁnite neural networks B.2. MAP inference

Here, we consider MAP inference over V (cid:96) . As the entries of V (cid:96) have a standard Gaussian prior, we have, log P ( V (cid:96) ) = − Tr (cid:0) V (cid:96) V T(cid:96) (cid:1) + const (67) = − N (cid:96) Tr (cid:0) U − T(cid:96) K (cid:96) U − (cid:96) (cid:1) + const (68) = − N (cid:96) Tr (cid:0) K (cid:96) U − (cid:96) U − T(cid:96) (cid:1) + const (69) = − N (cid:96) Tr (cid:16) K (cid:96) (cid:0) U T(cid:96) U (cid:96) (cid:1) − (cid:17) + const (70) = − N (cid:96) Tr (cid:0) K (cid:96) J − (cid:96) (cid:1) + const (71)We can write the likelihood in the same form, log P ( Y | J L +1 ) = − Tr (cid:0) Y T J − L +1 Y (cid:1) + const (72) log P ( Y | J L +1 ) = − Tr (cid:0) YY T J − L +1 (cid:1) + const (73) log P ( Y | J L +1 ) = − N L +1 Tr (cid:0) K L +1 J − L +1 (cid:1) + const (74)where, K L +1 = N L +1 YY T . (75)Note that we would usually incorporate IID noise in the outputs, and we are not doing so here in order to give exact,interpretable solutions. We do not expect this to change the overall pattern of the results, except to marginally weaken theconnection between the the output-kernel, K L +1 , and top-layer kernel, K L .Thus, the joint probability can be written as, log P ( V , . . . , V L , Y | X ) = − L +1 (cid:88) (cid:96) =1 N (cid:96) Tr (cid:0) K (cid:96) J − (cid:96) (cid:1) + const . (76)Now we ﬁnd, the MAP values of V , . . . V L V ∗ , . . . , V ∗ L = arg max V ,..., V L log P ( V , . . . , V L , Y | X ) , (77)by taking gradients of P ( V , . . . , V L , Y | X ) wrt K , . . . , K L . Note that we can ﬁnd the mode of this distribution bydiffereniating with respect to many different quantities, and we choose K (cid:96) because of algebraic convenience, and because itincludes all relevant information from V (cid:96) (Eq. 65). Further, note that as we are still working with the probability density of V , . . . , V L we should not include a Jacobian term. Now we consider a linear, fully connected network where J (cid:96) = K (cid:96) − , = ∂∂ K (cid:96) log P ( V , . . . , V L , Y | X ) = − N (cid:96) K − (cid:96) − + N (cid:96) +1 K − (cid:96) K (cid:96) +1 K − (cid:96) (78)where we have used, ∂ Tr (cid:0) K − (cid:96) K (cid:96) +1 (cid:1) ∂ K (cid:96) = − K − (cid:96) K (cid:96) +1 K − (cid:96) (79) ∂ Tr (cid:0) K − (cid:96) − K (cid:96) (cid:1) ∂ K (cid:96) = K − (cid:96) − (80)Thus, the MAP kernel changes as a ﬁxed ratio, S = N (cid:96) +1 K (cid:96) +1 K − (cid:96) = N (cid:96) K (cid:96) K − (cid:96) − , (81)As the input kernel, K , and the output kernel, K L +1 , are ﬁxed we can solve for S , S L +1 = L +1 (cid:89) (cid:96) =1 N (cid:96) K (cid:96) K − (cid:96) − = K L +1 K − L +1 (cid:89) (cid:96) =1 N (cid:96) (82) hy bigger is not always better: on ﬁnite and inﬁnite neural networks so, S = (cid:0) K L +1 K − (cid:1) /L +1 (cid:32) L +1 (cid:89) (cid:96) =1 N (cid:96) (cid:33) /L +1 (83)where the ﬁnal term is the geometric average of the width at each layer. As such, the kernel at any given layer is, K (cid:96) = (cid:32) (cid:96) (cid:89) (cid:96) (cid:48) =1 K (cid:96) K − (cid:96) − (cid:33) K (84) = (cid:32) (cid:96) (cid:89) (cid:96) (cid:48) =1 1 N (cid:96) (cid:48) S (cid:33) K (85) = (cid:16)(cid:81) L +1 (cid:96) (cid:48) =1 N (cid:96) (cid:48) (cid:17) (cid:96)/ ( L +1) (cid:81) (cid:96)(cid:96) (cid:48) =1 N (cid:96) (cid:48) (cid:0) K L +1 K − (cid:1) (cid:96)/ ( L +1) K (86)deﬁning the geometric average of the number of units at each layer prior to (and including) (cid:96) , and after (cid:96) , N ≤ (cid:96) = (cid:32) (cid:96) (cid:89) (cid:96) (cid:48) =1 N (cid:96) (cid:48) (cid:33) /(cid:96) (87) N (cid:96)< = (cid:32) L +1 (cid:89) (cid:96) (cid:48) = (cid:96) +1 N (cid:96) (cid:48) (cid:33) / ( L +1 − (cid:96) ) (88)we can write, (cid:16)(cid:81) L +1 (cid:96) (cid:48) =1 N (cid:96) (cid:17) (cid:96)/ ( L +1) (cid:81) (cid:96)(cid:96) (cid:48) =1 N (cid:96) = (cid:16) ( N ≤ (cid:96) ) (cid:96) ( N (cid:96)< ) L +1 − (cid:96) (cid:17) (cid:96)/ ( L +1) ( N ≤ (cid:96) ) (cid:96) (89) = (cid:16) ( N ≤ (cid:96) ) − ( L +1 − (cid:96) ) ( N (cid:96)< ) L +1 − (cid:96) (cid:17) (cid:96)/ ( L +1) (90) = (cid:18) N (cid:96)< N ≤ (cid:96) (cid:19) (cid:96) ( L +1 − (cid:96) ) L +1 (91)This factor is the ratio of the geometric average of the widths for the previous and subsequent layers, to a power whichdepends on the distance to the end points (for (cid:96) = 0 or (cid:96) = L + 1 this power is 0), K (cid:96) = (cid:18) N (cid:96)< N ≤ (cid:96) (cid:19) (cid:96) ( L +1 − (cid:96) ) L +1 (cid:0) K L +1 K − (cid:1) (cid:96)/ ( L +1) K (92)Thus, MAP does something sensible: no matter the network widths (and including as the network widths go to inﬁnity), therepresentation interpolates smoothly between the input and output kernels. However, the scale of these representations canshift in a strange, and potentially pathological fashion. Remember that we normalized the weights, taking into account thewidth of each layer such that the representations maintained the same scale, irrespective of layer width. However, underMAP inference, the network width controls the scale of the kernel, with larger kernels at layer (cid:96) given by widening layersfrom to (cid:96) , and narrowing layers from (cid:96) + 1 to L + 1 . C. Deriving a cost-function such that gradient descent is equivalent to sampling

The pathologies in the above derivations indicate that MAP, using full-batch gradient descent may give a very poorapproximation of the kernel induced by stochastic gradient descent. As such, we consider Langevin sampling which not hy bigger is not always better: on ﬁnite and inﬁnite neural networks only gives Bayesian inference, but also gives a good starting point for thinking about the noise introduced by stochasticgradient descent. In particular, we perform Langevin sampling over V (cid:96) (Eq. 63) d V (cid:96) = dt ∂ L ∂ V (cid:96) + d Ξ (cid:96) , (93)where d Ξ (cid:96) is a matrix-valued Weiner process. Remembering that the objective is completely speciﬁed by R (cid:96) = N (cid:96) V (cid:96) V T(cid:96) ,for a linear or ﬁnite-inﬁnite network, we consider the effect of this sampling on R (cid:96) . In particular, we consider the expectedchange in R (cid:96) under Langevin sampling, E [ d R (cid:96) | R (cid:96) ] = E (cid:104) N (cid:96) d (cid:0) V (cid:96) V T(cid:96) (cid:1)(cid:105) = N (cid:96) dt (cid:32) ∂ L ∂ V (cid:96) V T(cid:96) + V (cid:96) ∂ L ∂ V (cid:96) T (cid:33) + N (cid:96) E (cid:2) d Ξ (cid:96) d Ξ T(cid:96) (cid:3) . (94)As the only stochasticity comes from the last term, and this term has known expectation, N (cid:96) E (cid:2) d Ξ (cid:96) d Ξ T(cid:96) (cid:3) = dt I , (95)We can compute the expected update, which becomes the exact update as we take N (cid:96) → ∞ , lim N (cid:96) →∞ d R (cid:96) dt = E (cid:20) d R (cid:96) dt (cid:12)(cid:12)(cid:12)(cid:12) R (cid:96) (cid:21) = N (cid:96) dt (cid:32)(cid:18) ∂ L ∂ V (cid:96) (cid:19) V T(cid:96) + V (cid:96) (cid:18) ∂ L ∂ V (cid:96) (cid:19) T (cid:33) + dt I . (96)To check that these dynamics are sensible, we consider performing Langevin sampling using the above dynamics under thezero-mean, unit-variance prior on elements of V (cid:96) , L = − Tr (cid:0) V (cid:96) V T(cid:96) (cid:1) , (97)so the gradient is, ∂ L ∂ V (cid:96) = ∂∂ V (cid:96) (cid:2) − Tr (cid:0) V (cid:96) V T(cid:96) (cid:1)(cid:3) = − V (cid:96) . (98)Thus, E (cid:20) d R (cid:96) dt (cid:12)(cid:12)(cid:12)(cid:12) R (cid:96) (cid:21) = N (cid:96) V (cid:96) V T(cid:96) + I = − R (cid:96) + I . (99)Now, we set the expected change in R (cid:96) equal to zero, = E (cid:20) d R (cid:96) dt (cid:21) = − E [ R (cid:96) ] + I . (100)and solving for the expected value of R (cid:96) , E [ R (cid:96) ] = E (cid:104) N (cid:96) V (cid:96) V T(cid:96) (cid:105) = I , (101)which is equal to the expected value of N (cid:96) V (cid:96) V T(cid:96) under the prior, as is necessary given that these dynamics perform exactLangevin sampling in the limit.

C.1. Langevin dynamics as the modes of an objective

We can write the expected dynamics of R (cid:96) under Langevin sampling as the gradient of a surrogate objective, L (cid:48) , L (cid:48) = L + N (cid:96) log | R | = L + N (cid:96) log (cid:12)(cid:12) V (cid:96) V T(cid:96) (cid:12)(cid:12) . (102)The gradient of the determinant is given by the pseudo-inverse, ∂∂ V (cid:96) log (cid:12)(cid:12)(cid:12) N (cid:96) V (cid:96) V T(cid:96) (cid:12)(cid:12)(cid:12) = 2 (cid:0) V (cid:96) V T(cid:96) (cid:1) − V (cid:96) . (103) hy bigger is not always better: on ﬁnite and inﬁnite neural networks Thus, continuous gradient descent on the full objective, with a learning rate of , gives, d V (cid:96) = dt (cid:20) ∂ L (cid:48) ∂ V (cid:96) (cid:21) = dt (cid:20) ∂ L ∂ V (cid:96) + N (cid:96) (cid:0) V (cid:96) V T(cid:96) (cid:1) − V (cid:96) (cid:21) (104)The implied change in R is, d R (cid:96) = N (cid:96) (cid:0) d V (cid:96) V T(cid:96) + V (cid:96) d V T(cid:96) (cid:1) = dt N (cid:96) (cid:32)(cid:18) ∂ L ∂ V (cid:96) (cid:19) V T(cid:96) + V (cid:96) (cid:18) ∂ L ∂ V (cid:96) (cid:19) T (cid:33) + dt I (105)And this is exactly equal to the change in R induced by Langevin sampling Eq. (96). C.2. The sampling objective as modiﬁed maximum-likelihood under a Wishart prior

To further check that the Langevin sampling result is sensible, we note that it is very similar to doing MAP inference under aWishart prior, but that sampling ﬁxes pathologies in this proceedure due to the skew inherent in the Wishart distribution.In particular, the Wishart probability density is given by, log P ( K (cid:96) | J (cid:96) ) = log Wishart (cid:16) K (cid:96) ; N (cid:96) J (cid:96) , N (cid:96) (cid:17) (106) = N (cid:96) − P − log | K (cid:96) | − N (cid:96) log | J (cid:96) | − N (cid:96) Tr (cid:0) J − (cid:96) − K (cid:96) (cid:1) . (107)The pathologies arise if we compare the expectation and the mode of this distribution, E [ K (cid:96) | J (cid:96) ] = J (cid:96) (108a) arg max K (cid:96) [log P ( K (cid:96) | J (cid:96) )] = ( N (cid:96) − P − J (cid:96) , (108b)where the matricies K (cid:96) and J (cid:96) are P × P , and K (cid:96) is the inner product of N (cid:96) vectors with covariance N (cid:96) J (cid:96) . Thus, the modegives a very poor characterisation of the expectation of the distribution, to the extent if N (cid:96) = P + 1 , the mode is zero whilethe expectation can take on any value. Thankfully, it is possible to ﬁnd a closely related optimization problem that gives agood characterisation of the mean. In particular, we need to incorporate a new term in the objective that counteracts the“shrinkage” induced by the skew in the Wishart, such that the mode of the new objective equals the expectation, arg max K (cid:96) (cid:2) log P ( K (cid:96) | J (cid:96) ) + P +12 log | K (cid:96) | (cid:3) = J (cid:96) . (109)Critically, this term, P +12 log | K (cid:96) | is almost entirely independent of the parameters (it depends only on the size, P ), and thecombined objective is equivalent to the objective for Langevin sampling, L (cid:48) , log P ( K (cid:96) | J (cid:96) ) + P +12 log | K (cid:96) | = N (cid:96) log (cid:12)(cid:12) J − (cid:96) K (cid:96) (cid:12)(cid:12) − N (cid:96) Tr (cid:0) J − (cid:96) K (cid:96) (cid:1) (110) = N (cid:96) log (cid:12)(cid:12) U − T(cid:96) K (cid:96) U − (cid:96) (cid:12)(cid:12) − N (cid:96) Tr (cid:0) U − T(cid:96) K (cid:96) U − (cid:96) (cid:1) (111) = N (cid:96) log | R (cid:96) | − N (cid:96) Tr ( R (cid:96) ) . (112)if we consider a simple one-layer setup, with L given by Eq. (97). C.3. Representation learning in deep networks

The log-probability of the data at the ﬁnal layer can be written in the same form as the objective for Langevin sampling(Eq. 102), and the modiﬁed objective for Wishart inference (Eq. 110). In particular, log P ( y µ | K L ) = − y Tµ L − L y µ − | K L | + const . (113)deﬁning the constant kernel, K L +1 = Y YY T , we can write the log-probability of Y in a manner that is consistent with theprevious kernels, log P ( Y | K L ) = Y (cid:0) log (cid:12)(cid:12) L − L K L +1 (cid:12)(cid:12) − Tr (cid:0) L − L K L +1 (cid:1)(cid:1) + const (114) hy bigger is not always better: on ﬁnite and inﬁnite neural networks where the log determinant of K L +1 is constant, so can be included without changing the objective.As such, the full objective can be written as, L = L +1 (cid:88) (cid:96) =1 N (cid:96) (cid:0) log (cid:12)(cid:12) L − (cid:96) − K (cid:96) (cid:12)(cid:12) − Tr (cid:0) L − (cid:96) − K (cid:96) (cid:1)(cid:1) . (115)When we differentiate, only the terms that vary with K (cid:96) are relevant, L = N (cid:96) (cid:0) log (cid:12)(cid:12) L − (cid:96) − K (cid:96) (cid:12)(cid:12) − Tr (cid:0) L − (cid:96) − K (cid:96) (cid:1)(cid:1) + N (cid:96) +1 (cid:0) log (cid:12)(cid:12) L − (cid:96) K (cid:96) +1 (cid:12)(cid:12) − Tr (cid:0) L − (cid:96) K (cid:96) +1 (cid:1)(cid:1) . (116)While the derivations up to this point have been the same, the gradients are different for fully connected, locally connected,and convolutional networks diverge. C.4. Fully connected networks

For fully connected networks, L (cid:96) = K (cid:96) . (117)so the terms in the objective that depend on K (cid:96) are, L = N (cid:96) (cid:0) log (cid:12)(cid:12) K − (cid:96) − K (cid:96) (cid:12)(cid:12) − Tr (cid:0) K − (cid:96) − K (cid:96) (cid:1)(cid:1) + N (cid:96) +1 (cid:0) log (cid:12)(cid:12) K − (cid:96) K (cid:96) +1 (cid:12)(cid:12) − Tr (cid:0) K − (cid:96) K (cid:96) +1 (cid:1)(cid:1) . (118)Differentiating the relevant terms, ∂ Tr K − (cid:96) K (cid:96) +1 ∂ K (cid:96) = − K − (cid:96) K (cid:96) +1 K − (cid:96) (119a) ∂ Tr K − (cid:96) − K (cid:96) ∂ K (cid:96) = K − (cid:96) − (119b) ∂ log (cid:12)(cid:12) K − (cid:96) − K (cid:96) (cid:12)(cid:12) ∂ K (cid:96) = ∂ log | K (cid:96) | ∂ K (cid:96) = K − (cid:96) (119c) ∂ log (cid:12)(cid:12) K − (cid:96) K (cid:96) +1 (cid:12)(cid:12) ∂ K (cid:96) = − ∂ log | K (cid:96) | ∂ K (cid:96) = − K − (cid:96) (119d)We then set the gradients to zero, = ∂ L ∂ K (cid:96) = − ( N (cid:96) +1 − N (cid:96) ) K − (cid:96) + N (cid:96) +1 K − (cid:96) K (cid:96) +1 K − (cid:96) − N (cid:96) K − (cid:96) − (120)We pre multiply by K (cid:96) , = − ( N (cid:96) +1 − N (cid:96) ) I + N (cid:96) +1 K (cid:96) +1 K − (cid:96) − N (cid:96) K (cid:96) K − (cid:96) − , (121)And note that the resulting expression can be written in terms of a ratio, T (cid:96) +1 = K (cid:96) +1 K − (cid:96) = − ( N (cid:96) +1 − N (cid:96) ) I + N (cid:96) +1 T (cid:96) +1 − N (cid:96) T (cid:96) . (122)Solving for T (cid:96) +1 , T (cid:96) +1 = I + N (cid:96) N (cid:96) +1 ( T (cid:96) − I ) (123)We use N (cid:96) = N , for (cid:96) ∈ { , . . . , N } ,and N L +1 = Y , T (cid:96) = (cid:40) T for (cid:96) ∈ { , . . . , L } I + NY ( T − I ) for (cid:96) = L + 1 (124) hy bigger is not always better: on ﬁnite and inﬁnite neural networks to compute T , we use, K L +1 K − = T L +1 T L , (125)substituting for T L +1 , K L +1 K − = (cid:0) I + NY ( T − I ) (cid:1) T L (126)As this cannot be solved analytically for T , we consider three special cases. First, if there are many outputs in comparisonto the number of hidden units (i.e. N/Y → ), lim N/Y → T = (cid:0) K L +1 K − (cid:1) /L (127)and thus, the top-level kernel is equal to the output kernel, i.e. K L = K L +1 . Second, we consider the other extreme wherethere are many more hidden units than output channels (i.e. N/Y → ∞ ). In this limit, we must have = T − I becauseotherwise the NY ( T − I ) term will explode, lim N/Y →∞ T = I , (128)thus, the representation does not change as it ﬂows throught the network. Finally, we consider a more reasonable case wherethe number of hidden units is of the order of the number of output channels — in particular, we consider Y = N , T = (cid:0) K L +1 K − (cid:1) / ( L +1) (129)as such, the top-layer kernel is almost — but not quite — equal to the output kernel, but it does get closer as the networkgets deeper, K L = T L K = (cid:0) K L +1 K − (cid:1) L/ ( L +1) K (130) D. Natural gradients for a Gaussian-process sum kernel

We begin by deﬁning the covariance (kernel) as the sum over a set of kernels, K i , weighted by λ i , K = (cid:88) i λ i K i . (131)Our goal is to ﬁnd the maximum-likelihood λ i parameters using a natural-gradient method. The likelihood is, log P ( Y ) = − Tr (cid:0) K − YY T (cid:1) − N log | K | + const . (132)And the gradient is, ∂ log P ( Y ) ∂λ α = Tr L α L y − N Tr L α . (133)where, L α = K − K α (134) L y = K − YY T (135)For a natural-gradient method, we need the expected-second-derivatives. For the ﬁrst term, these are, E (cid:20) ∂∂λ β (cid:2) Tr L α L y (cid:3)(cid:21) = E (cid:2) − (Tr L β L α L y + Tr L α L β L y ) (cid:3) (136) = − (Tr L β L α E [ L y ] + Tr L α L β E [ L y ]) (137) = − N (Tr L β L α + Tr L α L β ) (138) = − N Tr L α L β (139) hy bigger is not always better: on ﬁnite and inﬁnite neural networks using basic matrix identities, and the fact that, under the model, E [ L y ] = N I . The second term is independent of Y , so wecan just compute the second derivative, ∂∂λ β (cid:2) − N Tr L α (cid:3) = N Tr L β L α . Thus, E (cid:20) ∂ ∂λ α λ β log P ( Y ) (cid:21) = − N Tr L α L ββ