Why bigger is not always better: on finite and infinite neural networks
WWhy bigger is not always better: on finite and infinite neural networks
Laurence Aitchison Abstract
Recent work has argued that neural networks canbe understood theoretically by taking the numberof channels to infinity, at which point the out-puts become Gaussian process (GP) distributed.However, we note that infinite Bayesian neuralnetworks lack a key facet of the behaviour ofreal neural networks: the fixed kernel, determinedonly by network hyperparameters, implies thatthey cannot do any form of representation learn-ing. The lack of representation or equivalentlykernel learning leads to less flexibility and henceworse performance, giving a potential explana-tion for the inferior performance of infinite net-works observed in the literature (e.g. Novak etal. 2019). We give analytic results characterisingthe prior over representations and representationlearning in finite deep linear networks. We showempirically that the representations in SOTA ar-chitectures such as ResNets trained with SGDare much closer to those suggested by our deeplinear results than by the corresponding infinitenetwork. This motivates the introduction of a newclass of network: infinite networks with bottle-necks, which inherit the theoretical tractability ofinfinite networks while at the same time allowingrepresentation learning.One approach to understanding and improving neural net-works is to perform Bayesian inference in an infinitely widenetwork (Lee et al., 2018; Matthews et al., 2018; Garriga-Alonso et al., 2019; Novak et al., 2019). In this limit theoutputs become Gaussian process distributed, enabling ef-ficient and exact reasoning about uncertainty, and givinga means of interpretation using the parameter-free kernelfunction (which depends only on network hyperparameterssuch as depth). However, the performance of Bayesian infi-nite networks lags considerably behind state-of-the-art finite University of Bristol, Bristol, UK. Correspondence to: Lau-rence Aitchison < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s). networks trained using SGD (e.g. compare performancein Garriga-Alonso et al. (2019), Novak et al. (2019) andArora et al. (2019) against He et al. (2016) and Chen et al.(2018)). This seems surprising, because, to our knowledge,there are no reports of wider networks degrading classi-fication performance (indeed, the opposite is sometimesargued; see Zagoruyko & Komodakis, 2016), and becauseexact Bayesian inference is provably optimal, if the prioraccurately describes our beliefs (Ramsey, 1926). Indeed,recent work on the Neural Tangent Kernel (NTK) (Li et al.,2019) has suggested that deterministic gradient descent inan infinite network gives slighly lower performance thanBayesian inference in the same network.Our hypothesis is that the poor performance of Bayesianinfinite networks arises because the top-layer representation(equivalent to the kernel), is fixed by the network hyperpa-rameters, and thus cannot be learned from data. This breaksmany of our key intuitions about why deep networks areeffective. For instance in transfer learning (Huh et al., 2016)we use a large-scale dataset such as ImageNet to a learn agood high-level representation, then apply this representa-tion to other tasks where less data is available. However,transfer learning is impossible in infinite Bayesian neuralnetworks, because the top-layer representation is fixed bythe network hyperparameters and so cannot be learned usinge.g. ImageNet.To understand these issues, we analysed finite networksusing tools from the infinite network literature (Lee et al.,2018; Matthews et al., 2018; Garriga-Alonso et al., 2019;Novak et al., 2019). We begin by giving a toy, two-layer ex-ample, contrasting the flexibility of finite networks with theinflexibility of infinite networks, showing that flexible finitenetworks offer benefits under conditions of model-mismatch.We then introduce infinite networks with bottlenecks, whichcombine the theoretical tractability of infinite networks withthe flexibility of finite networks. To obtain an analytic un-derstanding of kernel/representation flexibility and learningin such networks, we consider linear infinite networks withbottlenecks, which are equivalent to finite deep linear net-works. We took two approaches to characterising thesenetworks. First, we considered the prior viewpoint, i.e. thecovariance in the top-layer kernel induced by randomnessin the lower-layer weights. In particular, we showed thatnarrower, deeper networks offer more flexibility, and that a r X i v : . [ s t a t . M L ] J un hy bigger is not always better: on finite and infinite neural networks CNNs offer more flexibility than locally connected networks(LCNs) when the input is spatially structured. Second, weconsidered the posterior viewpoint, showing that under bothMAP inference and posterior sampling, the representationsin learned neural networks slowly transition from being sim-ilar to the input kernel (i.e. the inner product of the inputs)to being similar to the output kernel (i.e. the inner productof one-hot vectors representing the labels). We found animportant difference between MAP inference and sampling:for MAP inference, the learned representations transitionfrom the input to the output kernel, irrespective of the net-work width. Bayesian networks behave similarly when thenetwork width and the number of output channels are equal,but as the network width increases, the learned represen-tations become increasingly dominated by the prior, andinsensitive to the outputs. Remarkably, we find that in aResNet trained using SGD on CIFAR-10, the representationdiffers dramatically from the corresponding infinite networkand is instead very close to the output kernel, as suggestedby our deep linear results. This confirms the importance ofworking with a theoretical model, such as infinite networkswith bottlenecks, that is capable of capturing representationlearning.
1. Toy Example
In the introduction, we noted that infinite Bayesian networksperform worse than standard neural networks trained usingstochastic gradient descent. Thus, as we make finite neu-ral networks wider, there should be some point at whichperformance begins to degrade. We considered a simple,two-layer, fully-connected linear network with the full set of X , hidden unit activationsdenoted H , and 10-dimensional outputs denoted Y , H = XW Y = HV + σ Ξ (1)where Ξ is IID standard Gaussian noise, W is the input-to-hidden weight matrix and V is the hidden-to-output weightmatrix, whose columns, w µ and v ν are generated IID from, P ( w µ ) = N (cid:0) w µ ; , X I (cid:1) P ( v ν ) = N (cid:0) v ν ; , H I (cid:1) , (2)and where the variance of the weights is normalised by thenumber of inputs to that layer, X = 4 for the 4-dimensionalinput, and H for the width of the hidden layer.In the first example (Fig. 1 left), we generated targets forsupervised learning using a second neural network withweights generated as described above, with H gen ∈ { , , } hidden units. We evaluated the Bayesian model-evidencefor networks with many different numbers of hidden units(x-axis). Bayesian reasoning would suggest that the modelevidence for the true model (i.e. with a matched numberof hidden units) should be higher than the model evidence for any other model, as indeed we found (Fig. 1 top left),and these patterns held true for the predictive probability, orequivalently test performance (Fig. 1 bottom left). Whilethese results give an example where smaller networks per-form better, they do not necessarily help us to understandthe behaviour of neural networks on real datasets, wherethe true generative process for the data is not known, and isnot in our model class. As such, we considered two furtherexamples where the neural network generating the targetslay outside of our model class. In particular, we again gen-erated target outputs by sampling a “true” network fromthe prior, but we modified the inputs to this network, firstby multipling those inputs by (Fig. 1 middle), then byzeroing-out all but the first input unit (Fig. 1 right). Crit-cally, we ensured model-mismatch by putting the original,unmodified inputs into the trained networks. In both ofthese experiments, there was an optimium number of hiddenunits, after which performance degraded as more hiddenunits were included.To understand why this might be the case, it is insightfulto consider the methods we used to evaluate the modelevidence and generate these results. In particular, note thatconditioned on H , the output for any given channel, y ν , isIID and depends only on the corresponding column of theoutput weights, v ν , P ( Y | V , H ) = (cid:89) ν P ( y ν | v ν , H )= (cid:89) ν N (cid:0) y ν ; Hv ν , σ I (cid:1) . (3)Thus, we can integrate over the output weights, v ν , to obtaina distribution over Y conditioned on H , P ( Y | H ) = (cid:89) ν P ( y ν | H )= (cid:89) ν N (cid:0) y ν ; , H HH T + σ I (cid:1) . (4)This is the classical Gaussian process representation ofBayesian linear regression (Rasmussen & Williams, 2006).Remembering that the hidden activities, H , is a determinis-tic function of the weights, W , and inputs, X , we can writethis distribution as, P ( y ν | H ) = P ( y ν | W , X )= N (cid:0) y ν ; , H XWW T X T + σ I (cid:1) . (5)Thus, the first-layer weights, W , act as kernel hyperparam-eters in a Gaussian process: they control the covariance ofthe outputs, y ν . To evaluate the model evidence we need to hy bigger is not always better: on finite and infinite neural networks − r e l a t i v e m ode l e v i den c e vary number of hiddens − − x → x − − x → ( x , 0, 0, 0)2 number of hiddens − − r e l a t i v ep r ed i c t i v e l og - p r ob ( t e s t ) H gen = 1 H gen = 2 H gen = 4 2 number of hiddens − − number of hiddens − − Figure 1.
A toy fully-connected, two-layer Bayesian linear network showing situations in which smaller networks perform better thanlarger networks. The red dots indicate the optimal number of hidden units in that simulation. Left: training data generated from the priorwith H gen hidden units. Middle: training data generated from the prior with H gen = 4 but where we scale-up the inputs by a factor of 100.Right: training data generated from the prior with H gen = 4 , but where we zero-out all but the first input dimension. Top: Bayesian modelevidence. Bottom: predictive log-probability, or equivalently test-error. integrate over W , P ( Y | X ) = (cid:90) dW P ( W ) (cid:89) ν P ( y ν | W , X )= E P( W ) (cid:34)(cid:89) ν P ( y ν | W , X ) (cid:35) (6)and we estimate this integral by drawing
64 000 samplesfrom the prior,
P ( W ) . Importantly, while W providesflexibility in the kernel in finite networks, this flexibilitygradually disappears as we consider wider hidden layersnetworks. In particular, lim H →∞ H WW T = lim H →∞ H H (cid:88) µ =1 w µ w Tµ = E (cid:2) w µ w Tµ (cid:3) = X I . (7)Therefore, in this limit, the distribution over Y convergesto, lim H →∞ P ( Y | X ) = (cid:89) ν N (cid:0) y ν ; , X XX T + σ I (cid:1) . (8)This is exactly the distribution we would expect fromBayesian linear regression in a one-layer network. Thus, bytaking the infinite limit, we have eliminated the additionalflexibility afforded by the two-layer network, and we cansee that the superior performance of smaller networks inFig. 1 emerges because they give additional flexibility in the covariance of the outputs, which gradually disappearsas network size increases. Finally, note that sampling fromthe prior works well here both because of the concentrationresult above, and because we use relatively small amount ofdata, 20 points.
2. Infinite networks with finite bottlenecks
In the previous section, we considered the simplest networksin which these phenomena emerge: a two-layer, linear net-work. In this section, we setup a full infinite network withbottlenecks and show that activity flowing through this net-work can be understood entirely in terms of kernel andcovariance matricies.Consider a single layer within a fully-connected network,where the potentially infinite activity at the previous layer, H (cid:96) − , corresponding to a batch containing all inputs, ismultiplied by a weight matrix, W (cid:96) , to give a finite numberof activations, A (cid:96) . This activation matrix, A (cid:96) is multipliedby another matrix, M (cid:96) , to give a potentially infinite up-dated activation matrix, A (cid:48) (cid:96) , which is then passed througha non-linearity, φ , to give the potentially infinite activity atthis layer, H (cid:96) . Note that following Matthews et al. (2018),we use “activation” pre-nonlinearity and “activity” post-nonlinearity. A (cid:96) = H (cid:96) − W (cid:96) A (cid:48) (cid:96) = A (cid:96) M (cid:96) H (cid:96) = φ ( A (cid:48) (cid:96) ) (9) hy bigger is not always better: on finite and infinite neural networks P × M (cid:122)(cid:125)(cid:124)(cid:123) H = X P × N (cid:96) (cid:122)(cid:125)(cid:124)(cid:123) A (cid:96) = H (cid:96) − W (cid:96) P × M (cid:96) (cid:122)(cid:125)(cid:124)(cid:123) A (cid:48) (cid:96) = A (cid:96) M (cid:96) P × M (cid:96) (cid:122)(cid:125)(cid:124)(cid:123) H (cid:96) = φ ( A (cid:48) (cid:96) ) L = M XX T (cid:124) (cid:123)(cid:122) (cid:125) input kernel J (cid:96) = C (cid:2) a (cid:96)µ | H (cid:96) − (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) covariance K (cid:96) = N (cid:96) A (cid:96) A T(cid:96) = M (cid:96) A (cid:48) (cid:96) A (cid:48) T(cid:96) (cid:124) (cid:123)(cid:122) (cid:125) activation kernel L (cid:96) = M (cid:96) H (cid:96) H T(cid:96) (cid:124) (cid:123)(cid:122) (cid:125) activity kernel
Figure 2.
The relationships between the feature-space and kernel representations of the neural network. For a typical finite neural network, M (cid:96) = I , so M (cid:96) = N (cid:96) . For a finite-infinite network (which allows us to compute L (cid:96) from K (cid:96) ), we send M (cid:96) → ∞ , and draw the elementsof M (cid:96) IID from a Gaussian distribution with zero mean and variance /M (cid:96) . where the input data is H = X , and H (cid:96) ∈ R P × M (cid:96) A (cid:96) ∈ R P × N (cid:96) A (cid:48) (cid:96) ∈ R P × M (cid:96) W (cid:96) ∈ R M (cid:96) − × N (cid:96) M (cid:96) ∈ R N (cid:96) × M (cid:96) (10)For an infinite network with bottlenecks, we take the limit as M (cid:96) goes to infinity, leaving N (cid:96) finite. As such, the activitybefore, A (cid:48) (cid:96) , and after, H (cid:96) , the nonlinearity is infinite, with afinite linear bottleneck formed by A (cid:96) .For a fully-connected network, the columns of W (cid:96) and M (cid:96) ,denoted w (cid:96)λ and m (cid:96)λ are generated IID from a Gaussiandistribution, P ( W (cid:96) ) = N (cid:96) (cid:89) λ =1 P (cid:0) w (cid:96)λ (cid:1) = N (cid:96) (cid:89) λ =1 N (cid:16) w (cid:96)λ ; , M (cid:96) − I (cid:17) (11) P ( M (cid:96) ) = M (cid:96) (cid:89) λ =1 P (cid:0) m (cid:96)λ (cid:1) = M (cid:96)λ (cid:89) λ =1 N (cid:16) m (cid:96)λ ; , N (cid:96) I (cid:17) . (12)where the normalization constants, /M (cid:96) − and /N (cid:96) − , en-sure that activations remain normalized as they flow throughthe network.Following the infinite network literature, we would like tocharacterise activity flowing through the network in termsof the activation kernel, K (cid:96) and activity kernel, L (cid:96) , K (cid:96) ≡ N (cid:96) A (cid:96) A T(cid:96) L (cid:96) ≡ M (cid:96) H (cid:96) H T(cid:96) L = M XX T (13)We begin by characterising the relationship between A (cid:48) (cid:96) and A (cid:96) . As each channel (column) of A (cid:48) (cid:96) is a linear functionof the corresponding channel of the weights, a (cid:48) (cid:96)λ = A (cid:96) m (cid:96)λ ,these activations are Gaussian and IID conditioned on A (cid:96) , P ( A (cid:48) (cid:96) | A (cid:96) ) = M (cid:96) (cid:89) λ =1 P (cid:0) a (cid:48) (cid:96)λ | A (cid:96) (cid:1) = M (cid:96) (cid:89) λ =1 N (cid:0) a (cid:48) (cid:96)λ ; , K (cid:96) (cid:1) = P ( A (cid:48) (cid:96) | K (cid:96) ) (14) and taking the limit of M (cid:96) → ∞ , lim M (cid:96) →∞ M (cid:96) M (cid:96) M T(cid:96) = N (cid:96) I lim M (cid:96) →∞ M (cid:96) A (cid:48) (cid:96) ( A (cid:48) (cid:96) ) T = lim M (cid:96) →∞ M (cid:96) A (cid:96) M (cid:96) M T(cid:96) A T(cid:96) = N (cid:96) A (cid:96) A T(cid:96) = K (cid:96) , (15)Thus, the kernel for A (cid:96) is equivalent to the kernel for A (cid:48) (cid:96) ininfinite networks with finite bottlenecks (Fig. 2).Next, consider computing K (cid:96) from L (cid:96) − . As each chan-nel (column) of the activations is a linear function of thecorresponding channel of the weights, a (cid:96)λ = H (cid:96) − w (cid:96)λ , theactivations are Gaussian and IID conditioned on the activityat the previous layer, P ( A (cid:96) | H (cid:96) − ) = N (cid:96) (cid:89) µ =1 P (cid:0) a (cid:96)λ | H (cid:96) − (cid:1) = N (cid:96) (cid:89) µ =1 N (cid:0) a (cid:96)λ ; , J (cid:96) (cid:1) = P ( A (cid:96) | J (cid:96) ) , (16)with covariance J (cid:96) . For a fully connected network, thecovariance, J (cid:96) , is equal to the previous layer’s activity-kernel, L (cid:96) , J (cid:96) = L (cid:96) − = M (cid:96) − H (cid:96) − H T(cid:96) − , (17)but the relationship is more complex in convolutional archi-tectures (Garriga-Alonso et al., 2019; Novak et al., 2019)(Appendix A.2). As A (cid:96) is always finite and random, K (cid:96) is also a random variable, and inspecting the above expres-sions, its distribution can be written as a Wishart, centeredon L (cid:96) − .Finally consider computing L (cid:96) from K (cid:96) . Note that as both A (cid:48) (cid:96) and H (cid:96) are infinite, we can directly use standard resultsfrom infinite neural networks, i.e. those from Cho & Saul(2009), as in Lee et al. (2018); Matthews et al. (2018);Garriga-Alonso et al. (2019); Novak et al. (2019).Linear infinite networks with finite bottlenecks can be ob-tained by setting H (cid:96) = φ ( A (cid:48) (cid:96) ) = A (cid:48) (cid:96) , implying that hy bigger is not always better: on finite and infinite neural networks L (cid:96) = K (cid:96) . Critically, this is equivalent to a deep linearnetwork obtained by in adddition setting M (cid:96) = I so that A (cid:48) (cid:96) = A (cid:96) and M (cid:96) = N (cid:96) , as these choices imply that the A (cid:96) = A (cid:48) (cid:96) = H (cid:96) so that again, L (cid:96) = K (cid:96) . Given this setup, we can see that even a finite nonlinear net-work (i.e. with M (cid:96) = I ) is a deep Gaussian process. In par-ticular, in a deep Gaussian process, the activations at layer (cid:96) ,denoted A (cid:96) , consist of N (cid:96) IID channels that are Gaussian-process distributed (Eq. 16), with a kernel/covariance deter-mined by the activations at the previous layer. For a fullyconnected network, J (cid:96) = L (cid:96) − = N (cid:96) − φ ( A (cid:96) − ) φ T ( A (cid:96) − ) . (18)The relationship between finite neural networks and deepGPs is worth noting, because the same intuition, of the lower-layers shaping the top-layer kernel, arises in both senarios(e.g. Bui et al., 2016), and because there is potential forapplying GP inference methods for neural networks, andvice versa.
3. The prior view on kernel flexibility
We can analyse how flexibility in the kernel emerges bylooking at the variability (i.e. the variance and covariance)of J (cid:96) , K (cid:96) and L (cid:96) . If the prior gives a stochastic kernel withhigher variance, then it will be easier to shape that kernel byconditioning on data. In the appendix, we derive recursiveupdates for deep, linear, convolutional networks, but here,for simplicity we give the fully-connected updates, C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) = C (cid:2) L (cid:96) − ij , L (cid:96) − kl | K (cid:3) (19a) C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) ≈ C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) (19b) + N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) C (cid:2) L (cid:96)ij , L (cid:96)kl | L (cid:3) = C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) (19c)where, (cid:10) J (cid:96)ij (cid:11) = E (cid:2) J (cid:96)ij | L (cid:3) = L ij (19d)where i , j , k and l index datapoints.This expression predicts that the variance of the kernel isproportional to the depth (including the last layer; L + 1 )and inversely proportional to the width, N , C (cid:2) K L +1 ij , K L +1 kl | L (cid:3) ≈ L +1 N (cid:0) L ik L jl + L il L jk (cid:1) . (20)This expression is so simple because, for a fully-connectedlinear network, the expected covariance at each layer is thesame. For nonlinear and convolutional or locally-connectednetworks the covariance is still proportional to /N , but the depth-dependence becomes more complex, as the covari-ance changes as it propagates through layers.To check the validity of these expressions, we sampled10,000 neural networks from the prior, and evaluated thevariance of the kernel for a single input (Fig. 3). Theseinputs were either spatially unstructured (i.e. white noise),or spatially structured, in which case the inputs were thesame across the whole image. For fully connected networks,we confirmed that the variance of the kernel is proportionalto the depth including the last layer, L + 1 , and inverselyproportional to width, N (Fig. 3A). For locally connectednetworks, we found that structured and unstructured inputsgave the same kernel variance, which is expected as anyspatial structured is destroyed after the first layer (Fig. 3B).Further, for convolutional networks with structured input,the variance of the kernel was proportional to network depth(Fig. 3C bottom), but whenever that spatial structure wasabsent, either because it was absent in the inputs or be-cause it was eliminated by an LCN (Fig. 3BC bottom) thevariance of the kernel was almost constant with depth (seeAppendix A.2.1).The large decrease in kernel flexibility for locally connectednetworks might be one reason behind the result in Novaket al. (2019) that locally connected networks have perfor-mance that is very similar to an infinite-width network, inwhich all flexibility has been eliminated. In essence, for alocally connected network, we sample the weights for eachspatial region independently, so we in effect average overmore IID random variables, reducing the variance of thekernel at the next layer, and hence reducing the possibil-ity for data to shape that representation. In contrast, for aconvolutional network we share weights across locations,increasing the variance in the kernel, and hence increasingthe possibility for data to shape the representation. Finally,as the spatial input size, S , increases, for convolutional net-works with spatially structured inputs, the variance of thekernel is constant, whereas for locally connected or spatiallyunstructured inputs, the variance falls (Fig. 3D).
4. The posterior view on kernel flexibility
An alternative approach to understanding flexibility in fi-nite neural networks is to consider the posterior viewpoint:how learning shapes top-level representations. To obtainanalytical insights, we considered maximum a-posterioriand sampling based inference in a deep, fully-connected,linear network. In both cases, we found that learned neuralnetworks shift the representation from being close to theinput kernel, defined by, K = L = M XX T , (21) hy bigger is not always better: on finite and infinite neural networks N2 − − V a r [ K ii ] A FC sampledtheory L = 1 L = 16 2 N B LCNstructuredunstructured 2 N C CNN 2 S D LCN (top) vs CNN (bottom)2 L+12 − − V a r [ K ii ] N = 64 N = 1024 2 L+1 2 L+1 2 S Figure 3.
The variance of the stochastic kernel induced by randomly sampling weights in finite, linear, fully connected and convolutionalnetworks, with spatially structured and unstructured inputs. We use normalized inputs and circular convolutions to ensure that the kernel’sexpected value remains equal to at all locations as it propagates through the network. The dashed lines in all plots display the theoreticalapproximation (Eq. 19) which is valid when the width is much greater than the number of layers. The solid lines display the empiricalvariance of the kernel from 10,000 simulations. A The variance of the kernel for fully connected networks, plotted against networkwidth, N , for shallow (blue; L + 1 = 1 ), and deep (orange; L + 1 = 16 ) networks (top) and plotted against network depth, L + 1 , fornarrow (green; N = 64 ) and wide (red; N = 1024 ) networks. B The variance of the kernel for locally connected networks with spatiallystructured and unstructured inputs, plotted against the number of channels, N , and against network depth, L + 1 . Note that the structuredline lies underneath the unstructured line. The inputs are 1-dimensional with S = 32 spatial locations, and input channels. C As in B ,but for convolutional networks. D The variance of the kernel as a function of the input spatial size, S , for deep ( L + 1 = 16 ) LCNs (top)and CNNs (bottom) with spatially strutured and unstructured inputs. to being close the output kernel, defined by, K L +1 = N L +1 YY T . (22)In particular, under MAP inference, the shape of the kernelsmoothly transitions from the input to the output kernel(Appendix B.2), K (cid:96) = (cid:18) N (cid:96)< N ≤ (cid:96) (cid:19) (cid:96) ( L +1 − (cid:96) ) L +1 (cid:0) K L +1 K − (cid:1) (cid:96)/ ( L +1) K , (23)where N (cid:96)< is the geometric average of the width in layers (cid:96) + 1 to L + 1 , and N ≤ (cid:96) is the geometric average of thewidth in layers to (cid:96) . Thus, the kernels (and the underlyingweights) at each layer can be made arbitrarily large or smallby changing the width, despite the prior distribution beingchosen specifically to ensure that the scale of the kernelswas invariant to network width. This is an issue inherentto the use of MAP inference, which often finds modes thatgive a poor characterisation of the Bayesian posterior. Incontrast, if we sample the weights using Langevin sampling(Appendix C), and set all the intermediate widths, from N to N L to N , then we get a similar expression, K (cid:96) = (cid:0) K L K − (cid:1) (cid:96)/L K , (24)where the kernels slowly transition from K to K L . Thekey difference is that the similarity between the top-layerrepresentation, K L , and the output kernel, K L +1 , dependson the ratio between the network width, N , and the numberof output units, Y = N L +1 . In particular, if Y = N , thenwe get a relationship very similar to that for MAP inference, K (cid:96) = (cid:0) K L +1 K − (cid:1) (cid:96)/ ( L +1) K , (25)However, as the network width grows very large, the priorbegins to dominate, and the posterior becomes dominatedby the prior, lim N/Y →∞ K (cid:96) = K , (26)as K L = K . Finally, if the network width is small incomparison to the number of units, lim N/Y → K (cid:96) = (cid:0) K L +1 K − (cid:1) (cid:96)/L K , (27) hy bigger is not always better: on finite and infinite neural networks c o rr e l a t i on w i t h i n fi n i t e k e r ne l A m a r g i na lli k e li hood B e i gen v a l ueo f i n fi n i t ene t w o r k C ResNet block=0ResNet block=1ResNet block=2ResNet block=3 c o rr e l a t i on w i t hou t pu t k e r ne l epoch=0epoch=50epoch=100epoch=150epoch=200 f r a c t i ono f v a r i an c e i n c l a ss i fic a t i ond i r e c t i on infinite network rank10 e i gen v a l ue s o f fi n i t ene t w o r k eig ∝ rank − one-hot class Figure 4.
Comparison of kernels for finite and infinite neural networks at different layers. All kernels are computed on test data. A (top)Correlation (coefficient) between the kernel defined by the infinite network, and kernel defined by a finite network after different numbersof training epochs. A (bottom) Correlation (coefficient) between the kernel defined by the infinite network, and the output kernel definedby taking the inner product of one-hot vectors representing the class label. B (top) The Gaussian process marginal likelihood for the 10functions given by the one-hot class labels, evaluated using the kernel output by different ResNet blocks. B (bottom) The fraction ofvariance in the direction of the one-hot output class labels. C (top) The eigenvalues of the kernel defined by the infinite network as weprogress through layers, and compared to a − power law (grey). C (top) The eigenvalues of the kernel defined by the finite network after200 training epochs, as we progress through ResNet blocks. as the top-layer kernel converges to the output, K L = K L +1 .The above results suggest that finite neural networks per-form well by giving flexibility to interpolate between theinput kernel and output kernel. To see how this happensin real neural networks, we considered a 34-layer ResNetwithout batchnorm corresponding to the infinite networkin Garriga-Alonso et al. (2019) trained on CIFAR-10. Webegan by computing the correlation between elements ofthe finite and infinite kernel (Fig. 4A top) as we go throughResNet blocks (x-axis), and as we go through training (bluelines). As expected, the randomly initialized, untrained net-work retains a high correlation with the infinite kernel atall layers, though the correlation is somewhat smaller forhigher layers, as there have been more time for randomsampling to build up discrepancies. However, for trainednetworks, this correspondence between the finite and in-finite networks is far weaker: even at the first layer, thecorrelation is only around . , and as we go through lay-ers, the correlation decreases to almost zero. To understandwhether kernels were being actively shaped, we computedthe correlation between the kernel for the finite network andthe output kernel, defined by taking the inner product of vec-tors representing the one-hot class labels (Fig. 4A bottom).We found that while the correlation for the untrained net- work decreased across layers, training gives strong positivecorrelations with the output kernel, and these correlationsincrease as we move through network layers. Combined,these results indicate that the top-layer representation ismuch closer to the output kernel, as suggested by the deeplinear results, than it is to the corresponding infinite network.While correlation is a useful simple measure of similarity,there are other measures of similarity that take into accountthe special structure of kernel matricies. In particular, weconsidered the marginal likelihood for the one-hot outputscorresponding to the class label, under a GP, with a kernelgiven by a scaled sum of the kernel at that ResNet block, andthe identity (see Appendix D; Fig. 4B top). For the infinitenetwork, the marginal likelihood increased somewhat as wemoved through network layers, and the untrained finite net-work performed similarly, except that there was a decreasein performance at the last layer. In contrast, the marginallikelihood for the finite, trained networks was initially veryclose to the infinite networks, but grew rapidly as we movethrough ResNet blocks.To gain an insight into how training shaped the neural net-work kernels, we computed the variance in the subspacedefined by the one-hot outputs (i.e. the classification direc-tions; Fig. 4B bottom). We might have expected to see asteady increase in the variance in this subspace as we move hy bigger is not always better: on finite and infinite neural networks through layers, but in fact the level was very small, onlyrising appreciably at the final block, and only for trained net-works. To try to understand these results, we computed theeigenvalue spectrum of the kernels. For the infinite network(Fig. 4C top), we found that the eigenvalue spectrum at alllevels decayed as a − power law. This is expected at thelowest level due to the well known /f power spectrum ofimages (Van der Schaaf & van Hateren, 1996), but is not nec-essarily the case at higher-levels. Given the power-spectrumof the output kernel is just a small set of equal-sized eigen-values corresponding to the class labels (Fig. 4C bottom,green line), we might expect the eigenspectrum of finitenetworks to gradually get steeper as we move through net-work layers. In fact, we find the opposite: for intermediatelayers, the eigenvalue spectrum becomes flatter, which canbe interpreted as the network attempting to retain as muchinformation as possible about all aspects of the image. Itis only at the last layer where the relevant information isselected, giving an eigenvalue spectrum with around − large and roughly equally-sized eigenvalues, followed bymuch smaller eigenvalues, which mirrors the spectrum ofthe output kernel. This again confirms that the top-layer rep-resentation in trained networks is much closer to the outputkernel than it is to corresponding infinite network.
5. Related work
Agrawal et al. (2020) independently introduced infinite net-works with finite bottlenecks, but then made a very differentcontribution in that context. In particular, they highlightedthat if we take the limit as some layers of a neural networkgo to infinity, convergence to the infinite networks with bot-tlenecks considered here is not immediate, but requires theneural network components to exhibit sufficient uniformitywith respect to their inputs. In contrast, we show that finitebottlenecks can introduce flexibility and thereby improveperformance even in two-layer linear networks, give analyticresults in the case of linear networks, and show that theseconsiderations are likely to be important in realistic large-scale networks, by showing that the kernel for a trainedResNet differs dramatically from that for the correspondinginfinite network.Technically, our work bears similarity to classical work onthe dynamics of gradient descent in unregularised deep lin-ear networks (Saxe et al., 2013). Importantly, the lack ofregularisation in this work implies that infinitely many opti-mal solutions are available (e.g. all the lower-layer weightsbeing fixed to the identity). In contrast, we focused onBayesian inference, but also considered the optimal solutionfor regularised networks, which are much more constrained.
6. Conclusions
We have shown that finite Bayesian neural networks havemore flexibility than infinite networks, and that this mayexplain the superior performance of finite networks. Thus,we introduced infinite networks with bottlenecks, and arguethat they may be as incorporate flexibility and are able toperform representation learning, they may be a better modelof real neural networks. We then assessed the flexibilityof deep linear networks from two perspectives. First, welooked at the prior viewpoint: the variability in the top-layerkernel induced by the prior over a finite neural network.Second, we looked at the posterior viewpoint: the abilityof the learning process to shape the top-layer kernel. Un-der both MAP inference and sampling in finite networks,learning gradually shaped top-layer representations so as tomatch the output-kernel. But, as Bayesian neural networksincrease in width, the kernels become gradually less flex-ible, eliminating the possibility for learning to shape thekernel. In contrast, for MAP inference, the degree of kernelshaping is not affected by network width, and this additionalflexibility might be an avenue for overfitting.
Acknowledgements
I would like to thank Adri Garriga-Alonso, Sebastian Oberand Vidhi Lalchand for useful discussions.
References
Agrawal, D., Papamarkou, T., and Hinkle, J. Wide neuralnetworks with bottlenecks are deep gaussian processes. arXiv preprint arXiv:2001.00921 , 2020.Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., andWang, R. On exact computation with an infinitely wideneural net. arXiv preprint arXiv:1904.11955 , 2019.Bui, T., Hern´andez-Lobato, D., Hernandez-Lobato, J., Li,Y., and Turner, R. Deep gaussian processes for regressionusing approximate expectation propagation. In
Interna-tional Conference on Machine Learning , pp. 1472–1481,2016.Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud,D. K. Neural ordinary differential equations. In
Advancesin neural information processing systems , pp. 6571–6583,2018.Cho, Y. and Saul, L. K. Kernel methods for deep learning.
NeurIPS , 2009.Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L.Deep convolutional networks as shallow Gaussian pro-cesses.
ICLR , 2019. hy bigger is not always better: on finite and infinite neural networks
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In
European conference oncomputer vision , pp. 630–645. Springer, 2016.Huh, M., Agrawal, P., and Efros, A. A. What makesimagenet good for transfer learning? arXiv preprintarXiv:1608.08614 , 2016.Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Penning-ton, J., and Sohl-Dickstein, J. Deep neural networks asGaussian processes.
ICLR , 2018.Li, Z., Wang, R., Yu, D., Du, S. S., Hu, W., Salakhutdinov,R., and Arora, S. Enhanced convolutional neural tangentkernels. arXiv preprint arXiv:1911.00809 , 2019.Matthews, A., Rowland, M., Hron, J., Turner, R., andGhahramani, Z. Gaussian process behaviour in widedeep neural networks.
ICLR , 2018.Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Hron, J.,Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J.Bayesian deep convolutional networks with many chan-nels are Gaussian processes.
ICLR , 2019.Ramsey, F. P. Truth and probability. In
Readings in FormalEpistemology , pp. 21–45. Springer, 1926.Rasmussen, C. E. and Williams, C. K.
Gaussian processesfor machine learning . MIT press, 2006.Saxe, A. M., McClelland, J. L., and Ganguli, S. Exactsolutions to the nonlinear dynamics of learning in deeplinear neural networks. arXiv preprint arXiv:1312.6120 ,2013.Van der Schaaf, v. A. and van Hateren, J. v. Modelling thepower spectra of natural images: statistics and informa-tion.
Vision research , 36(17):2759–2770, 1996.Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146 , 2016. hy bigger is not always better: on finite and infinite neural networks
A. Kernel flexibility: prior viewpoint
To compute the covariance, which we denote C [ · ] of the kernel for a deep network, we consider a recursion where westart with C (cid:2) L (cid:96) − ij , L (cid:96) − kl | L (cid:3) , then compute the resulting C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) , then compute the resulting C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) . Inparticular, we apply the law of total covariance for K (cid:96) | J (cid:96) , and we consider linear networks for which L (cid:96) = K (cid:96) , C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) = ? (28a) C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) = C (cid:2) E (cid:2) K (cid:96)ij | J (cid:96) (cid:3) , E (cid:2) K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) + E (cid:2) C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) (28b) C (cid:2) L (cid:96)ij , L (cid:96)kl | L (cid:3) = C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) (28c)The first equation is different for fully connected and convolutional networks, so we give its form later.Eq. (28b) always behaves in the same way for linear and nonlinear, fully connected and convolutional networks so weconsider this first. In particular, we always have E [ K (cid:96) | J (cid:96) ] = J (cid:96) , so the first term in Eq. (28b) is C (cid:2) E (cid:2) K (cid:96)ij | J (cid:96) (cid:3) , E (cid:2) K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) = C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) . (29)For the second term in Eq. (28b), we substitute the definition of K (cid:96) (Eq. 13) into the definition of the covariance, C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) = N (cid:96) N (cid:96) (cid:88) µ =1 N (cid:96) (cid:88) ν =1 E (cid:2) a (cid:96)µi a (cid:96)µj a (cid:96)νk a (cid:96)νl | J (cid:96) (cid:3) − (cid:32) N (cid:96) N (cid:96) (cid:88) µ =1 E (cid:2) a (cid:96)µi a (cid:96)µj | J (cid:96) (cid:3)(cid:33) (cid:32) N (cid:96) N (cid:96) (cid:88) ν =1 E (cid:2) a (cid:96)νk a (cid:96)νl | J (cid:96) (cid:3)(cid:33) . (30)As the a ’s are jointly Gaussian, their expectations are E (cid:2) a (cid:96)µi a (cid:96)µj a (cid:96)νk a (cid:96)νl | J (cid:96) (cid:3) = J (cid:96)ij J (cid:96)kl + δ µν (cid:0) J (cid:96)ik J (cid:96)jl + J (cid:96)il J (cid:96)jk (cid:1) (31a) E (cid:2) a (cid:96)µi a (cid:96)µj | J (cid:96) (cid:3) = J (cid:96)ij (31b) E (cid:2) a (cid:96)νk a (cid:96)νl | J (cid:96) (cid:3) = J (cid:96)kl . (31c)Thus, the covariance of the kernel becomes, C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) = N (cid:96) (cid:0) J (cid:96)ik J (cid:96)jl + J (cid:96)il J (cid:96)jk (cid:1) , (32)Substituting this into the second term in Eq. (28b) E (cid:2) C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) = N (cid:96) E (cid:2) J (cid:96)ik J (cid:96)jl + J (cid:96)il J (cid:96)jk | L (cid:3) , (33)and writing the expected product in terms of the product of expectations and covariance, E (cid:2) C (cid:2) K (cid:96)ij , K (cid:96)kl | J (cid:96) (cid:3) | L (cid:3) = N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) + N (cid:96) (cid:0) C (cid:2) J (cid:96)ik , J (cid:96)jl | L (cid:3) + C (cid:2) J (cid:96)il , J (cid:96)jk | L (cid:3)(cid:1) (34)where, (cid:10) J (cid:96)ik (cid:11) = E (cid:2) J (cid:96)ik | L (cid:3) . (35)Substituting Eq. (29) and Eq. (34) into Eq. (28b), we obtain C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) = C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) + N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) + N (cid:96) (cid:0) C (cid:2) J (cid:96)ik , J (cid:96)jl | L (cid:3) + C (cid:2) J (cid:96)il , J (cid:96)jk | L (cid:3)(cid:1) (36) hy bigger is not always better: on finite and infinite neural networks A.1. Fully connected network
Now we evaluate Eq. (28a) first for a fully connected network, a (cid:96)λ,i = (cid:88) µ h (cid:96) − i,µ W (cid:96)µ,λ (37)where the weights are drawn from an independent zero-mean Gaussian, such that E (cid:2) W (cid:96)µ,λ W (cid:96)ν,λ (cid:3) = N (cid:96) − δ µ,ν (38)Thus, a (cid:96)λ has distribution, P (cid:0) a (cid:96)λ (cid:1) = N (cid:0) a (cid:96)λ ; , J (cid:96) (cid:1) , (39)where J (cid:96) is given by, J (cid:96)ij = C (cid:2) a (cid:96)i,λ , a (cid:96)j,λ (cid:3) = E (cid:2) a (cid:96)i,λ a (cid:96)j,λ (cid:3) (40) = E (cid:34)(cid:32)(cid:88) µ h (cid:96) − i,µ W (cid:96)µ,λ (cid:33) (cid:32)(cid:88) ν h (cid:96) − j,ν W (cid:96)ν,λ (cid:33)(cid:35) (41) = (cid:88) µν h (cid:96) − i,µ h (cid:96) − j,ν E (cid:2) W (cid:96)µ,λ W (cid:96)ν,λ (cid:3) (42)substituting for the expectation (Eq. 38), and identifying the activity kernel (Eq. 13), = N (cid:96) − (cid:88) µ h (cid:96) − i,µ h (cid:96) − j,µ = L (cid:96) − ij (43)Thus, C (cid:2) J (cid:96)ij , J (cid:96)kl | K (cid:3) = C (cid:2) L (cid:96) − ij , L (cid:96) − kl | K (cid:3) (44)Combining this expression with Eq. (36) gives a complete form for the updates (Eq. 28), C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) = C (cid:2) L (cid:96) − ij , L (cid:96) − kl | K (cid:3) (45a) C (cid:2) K (cid:96)is,jr , K (cid:96)kl | L (cid:3) = C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) + N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) + N (cid:96) (cid:0) C (cid:2) J (cid:96)ik , J (cid:96)jl | L (cid:3) + C (cid:2) J (cid:96)il , J (cid:96)jk | L (cid:3)(cid:1) (45b) C (cid:2) L (cid:96)ij , L (cid:96)kl | L (cid:3) = C (cid:2) K (cid:96)ij , K (cid:96)kl | L (cid:3) (45c)However, this form is difficult to analyse due to the complexity of Eq. (45b). Instead we can form an approximation toEq. (45b) by noting that one of the recursive terms is negligable. Taking all network widths to be equal, N (cid:96) = N , (or atleast of the same order), if C (cid:2) L (cid:96) − ij , L (cid:96) − kl | L (cid:3) = O (1 /N ) , (46)then C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) = O (1 /N ) , (47)as the network is chosen such that the activities, and hence the covariances, J (cid:96)ij remain O (1) C (cid:2) K (cid:96)is,jr , K (cid:96)kl | L (cid:3) = C (cid:2) J (cid:96)ij , J (cid:96)kl | L (cid:3) + N (cid:96) (cid:0)(cid:10) J (cid:96)ik (cid:11) (cid:10) J (cid:96)jl (cid:11) + (cid:10) J (cid:96)il (cid:11) (cid:10) J (cid:96)jk (cid:11)(cid:1) + O (1 /N ) = O (1 /N ) , (48)so, C (cid:2) L (cid:96)ij , L (cid:96)kl | L (cid:3) = O (1 /N ) . (49) hy bigger is not always better: on finite and infinite neural networks To begin the recursion, the data is fixed, so C (cid:2) J ij , J kl | L (cid:3) = C (cid:2) L ij , L kl | L (cid:3) = 0 , (50)and, C (cid:2) K ij , K kl | L (cid:3) = N (cid:0)(cid:10) L ik (cid:11) (cid:10) L jl (cid:11) + (cid:10) J il (cid:11) (cid:10) J jk (cid:11)(cid:1) = O (1 /N ) . (51)Thus, the covariance of the kernels and covariances is in indeed O (1 /N ) , so C K (cid:96)ij , K (cid:96)kl can be approximated by Eq. (48).Combining this approximation with Eq. (45) gives the expressions in the main text (Eq. 19). Finally, note that theapproximation in Eq. (48) remains true only as long as the number of layers is small, L (cid:28) N . A.2. Convolutional network
For locally connected and convolutional networks, we introduce spatial structure into the activations, and we use spatialindicies, r , s , u and v . Thus, the activations for datapoint i at layer (cid:96) , spatial location r and channel λ are given by, a (cid:96)i,rλ = (cid:88) r (cid:48) µ h (cid:96) − i,r (cid:48) µ W (cid:96)r (cid:48) µ,rλ . (52)Note that for many purposes, these higher-order tensors can be treated as vectors and matrices, if we combine indicies(e.g. using a “reshape” or “view” operation). The commas in the index list are used to denote how to combine indicies forthis particular operation, such that it can be understood as a standard matrix/vector operation. For the above equation, theactivations, a (cid:96) ∈ R P × SN (cid:96) are given by the matrix product of the activities from the previous layer, h (cid:96) − ∈ R P × SM (cid:96) − , andthe weights, W (cid:96) ∈ R ∈ SM (cid:96) − × SN (cid:96) , where remember that S is the number of spatial locations in the input.For a convolutional neural network, the weights are the same if we consider the same input-to-output channels, and the samespatial displacement, d , and are uncorrelated otherwise, E (cid:2) W (cid:96)r (cid:48) µ,rλ W (cid:96)s (cid:48) ν,sλ (cid:3) = M (cid:96) − D (cid:96) − δ µ,ν (cid:88) d ∈D (cid:96) − δ r (cid:48) , ( r + d ) δ s (cid:48) , ( s + d ) . (53)where D (cid:96) − is the set of all valid spatial displacements for the convolution, and D (cid:96) − = |D (cid:96) − | is the number of valid spatialdisplacements (i.e. the size of the convolutional patch). For a locally-connected network, the only additional requirement isthat the output spatial locations are the same, E (cid:2) W (cid:96)r (cid:48) µ,rλ W (cid:96)s (cid:48) ν,sλ (cid:3) = M (cid:96) − D (cid:96) − δ µ,ν δ r,s (cid:88) d ∈D (cid:96) − δ r (cid:48) , ( r + d ) δ s (cid:48) , ( s + d ) . (54)Now we can compute the covariance of the activations, J (cid:96) , for a convolutional network, J (cid:96)ir,js = E (cid:2) a (cid:96)i,rλ a (cid:96)j,sλ | L (cid:96) − (cid:3) (55) = E (cid:88) r (cid:48) µ h (cid:96) − i,r (cid:48) µ W (cid:96)r (cid:48) µ,rλ (cid:32)(cid:88) s (cid:48) ν h (cid:96) − j,s (cid:48) ν W (cid:96)s (cid:48) ν,rλ (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L (cid:96) − (56) = (cid:88) µνr (cid:48) s (cid:48) h (cid:96) − i,r (cid:48) µ h (cid:96) − j,s (cid:48) ν E (cid:2) W (cid:96)r (cid:48) µ,rλ W (cid:96)s (cid:48) ν,rλ (cid:3) (57)substituting the covariance of the weights (Eq. 53), and noting that the product of h ’s forms the definition of the activitykernel (Eq. 13), J (cid:96)ir,js = D (cid:96) − (cid:88) d ∈D (cid:96) − L (cid:96) − i ( r + d ) ,j ( s + d ) (58)For locally connected intermediate layers, we instead substitute Eq. (54), which gives the same result, except that the outputlocations must be the same for there to be any covariance in the weights, J (cid:96)ir,js = D (cid:96) − δ r,s (cid:88) d ∈D (cid:96) − L (cid:96) − i ( r + d ) ,j ( s + d ) (59) hy bigger is not always better: on finite and infinite neural networks Substituting this into Eq. (28a), C (cid:2) J (cid:96)ir,js , J (cid:96)ku,lv | L (cid:3) = C D (cid:96) − (cid:88) d ∈D (cid:96) − L (cid:96) − i ( r + d ) ,j ( s + d ) , D (cid:96) − (cid:88) d ∈D (cid:96) − L (cid:96) − k ( u + d ) ,l ( v + d ) . (60)Now, we can put together full recursive updates for convolutional networks, by pulling the sum out of the covariance above,and by taking the indicies in Eq. (28), as indexing both a datapoint and a spatial location (i.e. i → i, s ), C (cid:2) J (cid:96)ir,js , J (cid:96)ku,lv | L (cid:3) = D (cid:96) − (cid:88) dd (cid:48) C (cid:104) L (cid:96) − i ( r + d ) ,j ( s + d ) , L (cid:96) − k ( u + d (cid:48) ) ,l ( v + d (cid:48) ) | L (cid:105) (61a) C (cid:2) K (cid:96)ir,js , K (cid:96)ku,lv | L (cid:3) ≈ C (cid:2) J (cid:96)ir,js , J (cid:96)ku,lv | L (cid:3) + N (cid:96) (cid:0)(cid:10) J (cid:96)ir,ku (cid:11) (cid:10) J (cid:96)js,lv (cid:11) + (cid:10) J (cid:96)ir,lv (cid:11) (cid:10) J (cid:96)js,ku (cid:11)(cid:1) (61b) C (cid:2) L (cid:96)ir,js , L (cid:96)ku,lv | L (cid:3) = C (cid:2) K (cid:96)ir,js , K (cid:96)ku,lv | L (cid:3) (61c)Finally, to compute these terms, note that we can recursively compute these expressions for r = s and u = v , C (cid:2) J (cid:96)ir,jr , J (cid:96)ku,lu | L (cid:3) = D (cid:96) − (cid:88) dd (cid:48) C (cid:104) L (cid:96) − i ( r + d ) ,j ( r + d ) , L (cid:96) − k ( u + d (cid:48) ) ,l ( u + d (cid:48) ) | L (cid:105) , (62)which reduces computational complexity, and the resulting expression can even be evaluated efficiently as a 2D convolution.A.2.1. C ONVOLUTIONAL AND LOCALLY CONNECTED NETWORKS
To understand the very different results for convolutional and locally structured networks (Fig. 3B–D) despite their havingthe same infinite limit, we need to consider how Eq. (62) interacts with Eq. (59). For a locally connected network, thecovariance of activations at different locations is always zero, i.e. J (cid:96)ir,js = 0 for r (cid:54) = s whereas, for a spatially structurednetwork, the J ir,js terms for r (cid:54) = s have the same scale as those for r = s . The J ir,js terms enter into the variance of thekernel through Eq. (62). Note that there are D (cid:96) − terms in this sum, and the sum is normalized by dividing by D (cid:96) − . Thus,in convolutional networks, there are D (cid:96) − terms all with the same scale, whereas in spatially unstructured networks, we haveonly D (cid:96) − nonzero terms, introducing an effective /D (cid:96) − normalizer. This is particularly important if we consider thelast layer. The last layer can be understood as a convolution, where the convolutional patch has the same size as the image(i.e. D L = S ), and there is no padding, such the the output has a single spatial location. In this case, the /S = 1 /D L normalizer can be very large. B. Kernel flexibility: posterior viewpoint
B.1. Reparameterising finite neural networks
Swapping between a kernel representation and a feature representation is difficult if we work directly with a prior overthe weights, W (cid:96) ∈ R N (cid:96) − × N (cid:96) . Instead, note that as the weights are Gaussian, we can reparameterise the neural network,working instead with V (cid:96) ∈ R P × N (cid:96) which has independent standard Gaussian entries, where P is the number of datapoints.In particular, we can write the activities at the next layer using, A (cid:96) = H (cid:96) − W (cid:96) = U T(cid:96) V (cid:96) . (63)where U (cid:96) ∈ R P × P is any matrix that satisfies, J (cid:96) = U T(cid:96) U (cid:96) . (64)such as the Cholesky decomposition of the covariance, J (cid:96) . We can thus write the kernel as, K (cid:96) = N (cid:96) A (cid:96) A T(cid:96) = N (cid:96) H (cid:96) − W (cid:96) W T(cid:96) H T(cid:96) − = N (cid:96) U T(cid:96) V (cid:96) V T(cid:96) U (cid:96) . (65)Rearranging, we can write N (cid:96) V (cid:96) V T(cid:96) , or equivalently the mismatch between the covariance, J (cid:96) , and the output kernel, K (cid:96) ,in terms of L (cid:96) and K (cid:96) , and we denote this quantity R (cid:96) for future use, R (cid:96) = U − T(cid:96) K (cid:96) U − (cid:96) = N (cid:96) V (cid:96) V T(cid:96) (66)where X − T = ( X − ) T = ( X T ) − , and we have assumed that J (cid:96) is invertible, which if nothing else, requires that thenumber of features M (cid:96) and N (cid:96) are larger than the number of datapoints. hy bigger is not always better: on finite and infinite neural networks B.2. MAP inference
Here, we consider MAP inference over V (cid:96) . As the entries of V (cid:96) have a standard Gaussian prior, we have, log P ( V (cid:96) ) = − Tr (cid:0) V (cid:96) V T(cid:96) (cid:1) + const (67) = − N (cid:96) Tr (cid:0) U − T(cid:96) K (cid:96) U − (cid:96) (cid:1) + const (68) = − N (cid:96) Tr (cid:0) K (cid:96) U − (cid:96) U − T(cid:96) (cid:1) + const (69) = − N (cid:96) Tr (cid:16) K (cid:96) (cid:0) U T(cid:96) U (cid:96) (cid:1) − (cid:17) + const (70) = − N (cid:96) Tr (cid:0) K (cid:96) J − (cid:96) (cid:1) + const (71)We can write the likelihood in the same form, log P ( Y | J L +1 ) = − Tr (cid:0) Y T J − L +1 Y (cid:1) + const (72) log P ( Y | J L +1 ) = − Tr (cid:0) YY T J − L +1 (cid:1) + const (73) log P ( Y | J L +1 ) = − N L +1 Tr (cid:0) K L +1 J − L +1 (cid:1) + const (74)where, K L +1 = N L +1 YY T . (75)Note that we would usually incorporate IID noise in the outputs, and we are not doing so here in order to give exact,interpretable solutions. We do not expect this to change the overall pattern of the results, except to marginally weaken theconnection between the the output-kernel, K L +1 , and top-layer kernel, K L .Thus, the joint probability can be written as, log P ( V , . . . , V L , Y | X ) = − L +1 (cid:88) (cid:96) =1 N (cid:96) Tr (cid:0) K (cid:96) J − (cid:96) (cid:1) + const . (76)Now we find, the MAP values of V , . . . V L V ∗ , . . . , V ∗ L = arg max V ,..., V L log P ( V , . . . , V L , Y | X ) , (77)by taking gradients of P ( V , . . . , V L , Y | X ) wrt K , . . . , K L . Note that we can find the mode of this distribution bydiffereniating with respect to many different quantities, and we choose K (cid:96) because of algebraic convenience, and because itincludes all relevant information from V (cid:96) (Eq. 65). Further, note that as we are still working with the probability density of V , . . . , V L we should not include a Jacobian term. Now we consider a linear, fully connected network where J (cid:96) = K (cid:96) − , = ∂∂ K (cid:96) log P ( V , . . . , V L , Y | X ) = − N (cid:96) K − (cid:96) − + N (cid:96) +1 K − (cid:96) K (cid:96) +1 K − (cid:96) (78)where we have used, ∂ Tr (cid:0) K − (cid:96) K (cid:96) +1 (cid:1) ∂ K (cid:96) = − K − (cid:96) K (cid:96) +1 K − (cid:96) (79) ∂ Tr (cid:0) K − (cid:96) − K (cid:96) (cid:1) ∂ K (cid:96) = K − (cid:96) − (80)Thus, the MAP kernel changes as a fixed ratio, S = N (cid:96) +1 K (cid:96) +1 K − (cid:96) = N (cid:96) K (cid:96) K − (cid:96) − , (81)As the input kernel, K , and the output kernel, K L +1 , are fixed we can solve for S , S L +1 = L +1 (cid:89) (cid:96) =1 N (cid:96) K (cid:96) K − (cid:96) − = K L +1 K − L +1 (cid:89) (cid:96) =1 N (cid:96) (82) hy bigger is not always better: on finite and infinite neural networks so, S = (cid:0) K L +1 K − (cid:1) /L +1 (cid:32) L +1 (cid:89) (cid:96) =1 N (cid:96) (cid:33) /L +1 (83)where the final term is the geometric average of the width at each layer. As such, the kernel at any given layer is, K (cid:96) = (cid:32) (cid:96) (cid:89) (cid:96) (cid:48) =1 K (cid:96) K − (cid:96) − (cid:33) K (84) = (cid:32) (cid:96) (cid:89) (cid:96) (cid:48) =1 1 N (cid:96) (cid:48) S (cid:33) K (85) = (cid:16)(cid:81) L +1 (cid:96) (cid:48) =1 N (cid:96) (cid:48) (cid:17) (cid:96)/ ( L +1) (cid:81) (cid:96)(cid:96) (cid:48) =1 N (cid:96) (cid:48) (cid:0) K L +1 K − (cid:1) (cid:96)/ ( L +1) K (86)defining the geometric average of the number of units at each layer prior to (and including) (cid:96) , and after (cid:96) , N ≤ (cid:96) = (cid:32) (cid:96) (cid:89) (cid:96) (cid:48) =1 N (cid:96) (cid:48) (cid:33) /(cid:96) (87) N (cid:96)< = (cid:32) L +1 (cid:89) (cid:96) (cid:48) = (cid:96) +1 N (cid:96) (cid:48) (cid:33) / ( L +1 − (cid:96) ) (88)we can write, (cid:16)(cid:81) L +1 (cid:96) (cid:48) =1 N (cid:96) (cid:17) (cid:96)/ ( L +1) (cid:81) (cid:96)(cid:96) (cid:48) =1 N (cid:96) = (cid:16) ( N ≤ (cid:96) ) (cid:96) ( N (cid:96)< ) L +1 − (cid:96) (cid:17) (cid:96)/ ( L +1) ( N ≤ (cid:96) ) (cid:96) (89) = (cid:16) ( N ≤ (cid:96) ) − ( L +1 − (cid:96) ) ( N (cid:96)< ) L +1 − (cid:96) (cid:17) (cid:96)/ ( L +1) (90) = (cid:18) N (cid:96)< N ≤ (cid:96) (cid:19) (cid:96) ( L +1 − (cid:96) ) L +1 (91)This factor is the ratio of the geometric average of the widths for the previous and subsequent layers, to a power whichdepends on the distance to the end points (for (cid:96) = 0 or (cid:96) = L + 1 this power is 0), K (cid:96) = (cid:18) N (cid:96)< N ≤ (cid:96) (cid:19) (cid:96) ( L +1 − (cid:96) ) L +1 (cid:0) K L +1 K − (cid:1) (cid:96)/ ( L +1) K (92)Thus, MAP does something sensible: no matter the network widths (and including as the network widths go to infinity), therepresentation interpolates smoothly between the input and output kernels. However, the scale of these representations canshift in a strange, and potentially pathological fashion. Remember that we normalized the weights, taking into account thewidth of each layer such that the representations maintained the same scale, irrespective of layer width. However, underMAP inference, the network width controls the scale of the kernel, with larger kernels at layer (cid:96) given by widening layersfrom to (cid:96) , and narrowing layers from (cid:96) + 1 to L + 1 . C. Deriving a cost-function such that gradient descent is equivalent to sampling
The pathologies in the above derivations indicate that MAP, using full-batch gradient descent may give a very poorapproximation of the kernel induced by stochastic gradient descent. As such, we consider Langevin sampling which not hy bigger is not always better: on finite and infinite neural networks only gives Bayesian inference, but also gives a good starting point for thinking about the noise introduced by stochasticgradient descent. In particular, we perform Langevin sampling over V (cid:96) (Eq. 63) d V (cid:96) = dt ∂ L ∂ V (cid:96) + d Ξ (cid:96) , (93)where d Ξ (cid:96) is a matrix-valued Weiner process. Remembering that the objective is completely specified by R (cid:96) = N (cid:96) V (cid:96) V T(cid:96) ,for a linear or finite-infinite network, we consider the effect of this sampling on R (cid:96) . In particular, we consider the expectedchange in R (cid:96) under Langevin sampling, E [ d R (cid:96) | R (cid:96) ] = E (cid:104) N (cid:96) d (cid:0) V (cid:96) V T(cid:96) (cid:1)(cid:105) = N (cid:96) dt (cid:32) ∂ L ∂ V (cid:96) V T(cid:96) + V (cid:96) ∂ L ∂ V (cid:96) T (cid:33) + N (cid:96) E (cid:2) d Ξ (cid:96) d Ξ T(cid:96) (cid:3) . (94)As the only stochasticity comes from the last term, and this term has known expectation, N (cid:96) E (cid:2) d Ξ (cid:96) d Ξ T(cid:96) (cid:3) = dt I , (95)We can compute the expected update, which becomes the exact update as we take N (cid:96) → ∞ , lim N (cid:96) →∞ d R (cid:96) dt = E (cid:20) d R (cid:96) dt (cid:12)(cid:12)(cid:12)(cid:12) R (cid:96) (cid:21) = N (cid:96) dt (cid:32)(cid:18) ∂ L ∂ V (cid:96) (cid:19) V T(cid:96) + V (cid:96) (cid:18) ∂ L ∂ V (cid:96) (cid:19) T (cid:33) + dt I . (96)To check that these dynamics are sensible, we consider performing Langevin sampling using the above dynamics under thezero-mean, unit-variance prior on elements of V (cid:96) , L = − Tr (cid:0) V (cid:96) V T(cid:96) (cid:1) , (97)so the gradient is, ∂ L ∂ V (cid:96) = ∂∂ V (cid:96) (cid:2) − Tr (cid:0) V (cid:96) V T(cid:96) (cid:1)(cid:3) = − V (cid:96) . (98)Thus, E (cid:20) d R (cid:96) dt (cid:12)(cid:12)(cid:12)(cid:12) R (cid:96) (cid:21) = N (cid:96) V (cid:96) V T(cid:96) + I = − R (cid:96) + I . (99)Now, we set the expected change in R (cid:96) equal to zero, = E (cid:20) d R (cid:96) dt (cid:21) = − E [ R (cid:96) ] + I . (100)and solving for the expected value of R (cid:96) , E [ R (cid:96) ] = E (cid:104) N (cid:96) V (cid:96) V T(cid:96) (cid:105) = I , (101)which is equal to the expected value of N (cid:96) V (cid:96) V T(cid:96) under the prior, as is necessary given that these dynamics perform exactLangevin sampling in the limit.
C.1. Langevin dynamics as the modes of an objective
We can write the expected dynamics of R (cid:96) under Langevin sampling as the gradient of a surrogate objective, L (cid:48) , L (cid:48) = L + N (cid:96) log | R | = L + N (cid:96) log (cid:12)(cid:12) V (cid:96) V T(cid:96) (cid:12)(cid:12) . (102)The gradient of the determinant is given by the pseudo-inverse, ∂∂ V (cid:96) log (cid:12)(cid:12)(cid:12) N (cid:96) V (cid:96) V T(cid:96) (cid:12)(cid:12)(cid:12) = 2 (cid:0) V (cid:96) V T(cid:96) (cid:1) − V (cid:96) . (103) hy bigger is not always better: on finite and infinite neural networks Thus, continuous gradient descent on the full objective, with a learning rate of , gives, d V (cid:96) = dt (cid:20) ∂ L (cid:48) ∂ V (cid:96) (cid:21) = dt (cid:20) ∂ L ∂ V (cid:96) + N (cid:96) (cid:0) V (cid:96) V T(cid:96) (cid:1) − V (cid:96) (cid:21) (104)The implied change in R is, d R (cid:96) = N (cid:96) (cid:0) d V (cid:96) V T(cid:96) + V (cid:96) d V T(cid:96) (cid:1) = dt N (cid:96) (cid:32)(cid:18) ∂ L ∂ V (cid:96) (cid:19) V T(cid:96) + V (cid:96) (cid:18) ∂ L ∂ V (cid:96) (cid:19) T (cid:33) + dt I (105)And this is exactly equal to the change in R induced by Langevin sampling Eq. (96). C.2. The sampling objective as modified maximum-likelihood under a Wishart prior
To further check that the Langevin sampling result is sensible, we note that it is very similar to doing MAP inference under aWishart prior, but that sampling fixes pathologies in this proceedure due to the skew inherent in the Wishart distribution.In particular, the Wishart probability density is given by, log P ( K (cid:96) | J (cid:96) ) = log Wishart (cid:16) K (cid:96) ; N (cid:96) J (cid:96) , N (cid:96) (cid:17) (106) = N (cid:96) − P − log | K (cid:96) | − N (cid:96) log | J (cid:96) | − N (cid:96) Tr (cid:0) J − (cid:96) − K (cid:96) (cid:1) . (107)The pathologies arise if we compare the expectation and the mode of this distribution, E [ K (cid:96) | J (cid:96) ] = J (cid:96) (108a) arg max K (cid:96) [log P ( K (cid:96) | J (cid:96) )] = ( N (cid:96) − P − J (cid:96) , (108b)where the matricies K (cid:96) and J (cid:96) are P × P , and K (cid:96) is the inner product of N (cid:96) vectors with covariance N (cid:96) J (cid:96) . Thus, the modegives a very poor characterisation of the expectation of the distribution, to the extent if N (cid:96) = P + 1 , the mode is zero whilethe expectation can take on any value. Thankfully, it is possible to find a closely related optimization problem that gives agood characterisation of the mean. In particular, we need to incorporate a new term in the objective that counteracts the“shrinkage” induced by the skew in the Wishart, such that the mode of the new objective equals the expectation, arg max K (cid:96) (cid:2) log P ( K (cid:96) | J (cid:96) ) + P +12 log | K (cid:96) | (cid:3) = J (cid:96) . (109)Critically, this term, P +12 log | K (cid:96) | is almost entirely independent of the parameters (it depends only on the size, P ), and thecombined objective is equivalent to the objective for Langevin sampling, L (cid:48) , log P ( K (cid:96) | J (cid:96) ) + P +12 log | K (cid:96) | = N (cid:96) log (cid:12)(cid:12) J − (cid:96) K (cid:96) (cid:12)(cid:12) − N (cid:96) Tr (cid:0) J − (cid:96) K (cid:96) (cid:1) (110) = N (cid:96) log (cid:12)(cid:12) U − T(cid:96) K (cid:96) U − (cid:96) (cid:12)(cid:12) − N (cid:96) Tr (cid:0) U − T(cid:96) K (cid:96) U − (cid:96) (cid:1) (111) = N (cid:96) log | R (cid:96) | − N (cid:96) Tr ( R (cid:96) ) . (112)if we consider a simple one-layer setup, with L given by Eq. (97). C.3. Representation learning in deep networks
The log-probability of the data at the final layer can be written in the same form as the objective for Langevin sampling(Eq. 102), and the modified objective for Wishart inference (Eq. 110). In particular, log P ( y µ | K L ) = − y Tµ L − L y µ − | K L | + const . (113)defining the constant kernel, K L +1 = Y YY T , we can write the log-probability of Y in a manner that is consistent with theprevious kernels, log P ( Y | K L ) = Y (cid:0) log (cid:12)(cid:12) L − L K L +1 (cid:12)(cid:12) − Tr (cid:0) L − L K L +1 (cid:1)(cid:1) + const (114) hy bigger is not always better: on finite and infinite neural networks where the log determinant of K L +1 is constant, so can be included without changing the objective.As such, the full objective can be written as, L = L +1 (cid:88) (cid:96) =1 N (cid:96) (cid:0) log (cid:12)(cid:12) L − (cid:96) − K (cid:96) (cid:12)(cid:12) − Tr (cid:0) L − (cid:96) − K (cid:96) (cid:1)(cid:1) . (115)When we differentiate, only the terms that vary with K (cid:96) are relevant, L = N (cid:96) (cid:0) log (cid:12)(cid:12) L − (cid:96) − K (cid:96) (cid:12)(cid:12) − Tr (cid:0) L − (cid:96) − K (cid:96) (cid:1)(cid:1) + N (cid:96) +1 (cid:0) log (cid:12)(cid:12) L − (cid:96) K (cid:96) +1 (cid:12)(cid:12) − Tr (cid:0) L − (cid:96) K (cid:96) +1 (cid:1)(cid:1) . (116)While the derivations up to this point have been the same, the gradients are different for fully connected, locally connected,and convolutional networks diverge. C.4. Fully connected networks
For fully connected networks, L (cid:96) = K (cid:96) . (117)so the terms in the objective that depend on K (cid:96) are, L = N (cid:96) (cid:0) log (cid:12)(cid:12) K − (cid:96) − K (cid:96) (cid:12)(cid:12) − Tr (cid:0) K − (cid:96) − K (cid:96) (cid:1)(cid:1) + N (cid:96) +1 (cid:0) log (cid:12)(cid:12) K − (cid:96) K (cid:96) +1 (cid:12)(cid:12) − Tr (cid:0) K − (cid:96) K (cid:96) +1 (cid:1)(cid:1) . (118)Differentiating the relevant terms, ∂ Tr K − (cid:96) K (cid:96) +1 ∂ K (cid:96) = − K − (cid:96) K (cid:96) +1 K − (cid:96) (119a) ∂ Tr K − (cid:96) − K (cid:96) ∂ K (cid:96) = K − (cid:96) − (119b) ∂ log (cid:12)(cid:12) K − (cid:96) − K (cid:96) (cid:12)(cid:12) ∂ K (cid:96) = ∂ log | K (cid:96) | ∂ K (cid:96) = K − (cid:96) (119c) ∂ log (cid:12)(cid:12) K − (cid:96) K (cid:96) +1 (cid:12)(cid:12) ∂ K (cid:96) = − ∂ log | K (cid:96) | ∂ K (cid:96) = − K − (cid:96) (119d)We then set the gradients to zero, = ∂ L ∂ K (cid:96) = − ( N (cid:96) +1 − N (cid:96) ) K − (cid:96) + N (cid:96) +1 K − (cid:96) K (cid:96) +1 K − (cid:96) − N (cid:96) K − (cid:96) − (120)We pre multiply by K (cid:96) , = − ( N (cid:96) +1 − N (cid:96) ) I + N (cid:96) +1 K (cid:96) +1 K − (cid:96) − N (cid:96) K (cid:96) K − (cid:96) − , (121)And note that the resulting expression can be written in terms of a ratio, T (cid:96) +1 = K (cid:96) +1 K − (cid:96) = − ( N (cid:96) +1 − N (cid:96) ) I + N (cid:96) +1 T (cid:96) +1 − N (cid:96) T (cid:96) . (122)Solving for T (cid:96) +1 , T (cid:96) +1 = I + N (cid:96) N (cid:96) +1 ( T (cid:96) − I ) (123)We use N (cid:96) = N , for (cid:96) ∈ { , . . . , N } ,and N L +1 = Y , T (cid:96) = (cid:40) T for (cid:96) ∈ { , . . . , L } I + NY ( T − I ) for (cid:96) = L + 1 (124) hy bigger is not always better: on finite and infinite neural networks to compute T , we use, K L +1 K − = T L +1 T L , (125)substituting for T L +1 , K L +1 K − = (cid:0) I + NY ( T − I ) (cid:1) T L (126)As this cannot be solved analytically for T , we consider three special cases. First, if there are many outputs in comparisonto the number of hidden units (i.e. N/Y → ), lim N/Y → T = (cid:0) K L +1 K − (cid:1) /L (127)and thus, the top-level kernel is equal to the output kernel, i.e. K L = K L +1 . Second, we consider the other extreme wherethere are many more hidden units than output channels (i.e. N/Y → ∞ ). In this limit, we must have = T − I becauseotherwise the NY ( T − I ) term will explode, lim N/Y →∞ T = I , (128)thus, the representation does not change as it flows throught the network. Finally, we consider a more reasonable case wherethe number of hidden units is of the order of the number of output channels — in particular, we consider Y = N , T = (cid:0) K L +1 K − (cid:1) / ( L +1) (129)as such, the top-layer kernel is almost — but not quite — equal to the output kernel, but it does get closer as the networkgets deeper, K L = T L K = (cid:0) K L +1 K − (cid:1) L/ ( L +1) K (130) D. Natural gradients for a Gaussian-process sum kernel
We begin by defining the covariance (kernel) as the sum over a set of kernels, K i , weighted by λ i , K = (cid:88) i λ i K i . (131)Our goal is to find the maximum-likelihood λ i parameters using a natural-gradient method. The likelihood is, log P ( Y ) = − Tr (cid:0) K − YY T (cid:1) − N log | K | + const . (132)And the gradient is, ∂ log P ( Y ) ∂λ α = Tr L α L y − N Tr L α . (133)where, L α = K − K α (134) L y = K − YY T (135)For a natural-gradient method, we need the expected-second-derivatives. For the first term, these are, E (cid:20) ∂∂λ β (cid:2) Tr L α L y (cid:3)(cid:21) = E (cid:2) − (Tr L β L α L y + Tr L α L β L y ) (cid:3) (136) = − (Tr L β L α E [ L y ] + Tr L α L β E [ L y ]) (137) = − N (Tr L β L α + Tr L α L β ) (138) = − N Tr L α L β (139) hy bigger is not always better: on finite and infinite neural networks using basic matrix identities, and the fact that, under the model, E [ L y ] = N I . The second term is independent of Y , so wecan just compute the second derivative, ∂∂λ β (cid:2) − N Tr L α (cid:3) = N Tr L β L α . Thus, E (cid:20) ∂ ∂λ α λ β log P ( Y ) (cid:21) = − N Tr L α L ββ