[PDF] A new role for circuit expansion for learning in neural networks

Abstract

Many sensory pathways in the brain rely on sparsely active populations of neurons downstream from the input stimuli. The biological reason for the occurrence of expanded structure in the brain is unclear, but may be because expansion can increase the expressive power of a neural network. In this work, we show that expanding a neural network can improve its generalization performance even in cases in which the expanded structure is pruned after the learning period. To study this setting we use a teacher-student framework where a perceptron teacher network generates labels which are corrupted with small amounts of noise. We then train a student network that is structurally matched to the teacher and can achieve optimal accuracy if given the teacher's synaptic weights. We find that sparse expansion of the input of a student perceptron network both increases its capacity and improves the generalization performance of the network when learning a noisy rule from a teacher perceptron when these expansions are pruned after learning. We find similar behavior when the expanded units are stochastic and uncorrelated with the input and analyze this network in the mean field limit. We show by solving the mean field equations that the generalization error of the stochastic expanded student network continues to drop as the size of the network increases. The improvement in generalization performance occurs despite the increased complexity of the student network relative to the teacher it is trying to learn. We show that this effect is closely related to the addition of slack variables in artificial neural networks and suggest possible implications for artificial and biological neural networks.

Full PDF

AA new role for circuit expansion for learning in neural networks

Julia Steinberg,

1, 2, ∗ Madhu Advani, and Haim Sompolinsky

1, 3 Center for Brain Science, Harvard University, Cambridge MA 02138, USA Department of Physics, Harvard University, Cambridge MA 02138, USA Edmond and Lily Safra Center for Brain Sciences, Hebrew University, Jerusalem 91904, Israel (Dated: August 21, 2020)Many sensory pathways in the brain rely on sparsely active populations of neurons downstreamfrom the input stimuli. The biological reason for the occurrence of expanded structure in the brainis unclear, but may be because expansion can increase the expressive power of a neural network. Inthis work, we show that expanding a neural network can improve its generalization performance evenin cases in which the expanded structure is pruned after the learning period. To study this settingwe use a teacher-student framework where a perceptron teacher network generates labels whichare corrupted with small amounts of noise. We then train a student network that is structurallymatched to the teacher and can achieve optimal accuracy if given the teachers synaptic weights. Weﬁnd that sparse expansion of the input of a student perceptron network both increases its capacityand improves the generalization performance of the network when learning a noisy rule from ateacher perceptron when these expansions are pruned after learning. We ﬁnd similar behavior whenthe expanded units are stochastic and uncorrelated with the input and analyze this network in themean ﬁeld limit. We show by solving the mean ﬁeld equations that the generalization error of thestochastic expanded student network continues to drop as the size of the network increases. Theimprovement in generalization performance occurs despite the increased complexity of the studentnetwork relative to the teacher it is trying to learn. We show that this eﬀect is closely related to theaddition of slack variables in artiﬁcial neural networks and suggest possible implications for artiﬁcialand biological neural networks.

I. INTRODUCTION

Learning and memory is thought to occur mainlythrough long term modiﬁcation of synaptic connec-tions among neurons, a phenomenon well establishedexperimentally. Neural circuits also undergo structuralchanges. Adult neurogenesis in the mammalian brain fa-cilitates the continuous creation of new neurons and oc-curs mainly in the olfactory bulb and the dentate gyrusof the hippocampus [1, 2]. Another form of structuralplasticity is the continuous recycling of synapses whichis seen in both cortex and hippocampus. Several model-ing studies addressed the computational consequences ofthese phenomena (see [3, 4] for adult neurogensis and [5]for synaptic recycling.Of particular interest are the dynamic changes in thecircuit structure of the hippocampus, an area specializedin learning and memory, functions which would seem tobeneﬁt from a stable circuit structure.In this work, we explore a novel computational bene-ﬁt of structual dynamics in neural circuits that learn newassociations or tasks. We show that under certain classesof learning paradigms, the expansion of a neural ciruit ar-chitecture by recruiting additional neurons and synapsesmay facilitate the dynamics of learning. Expanding thecircuit size enabling sparse coding have been shown tohave computational beneﬁts in several contexts of neu-roscience and machine learning for sensory processing,learning and memory [6–10]. In these models, circuit ∗ Correspondence to: [email protected] expansion and the resultant sparse coding yield betterrepresentations of the stimuli, enhancing pattern sepa-ration, and improving the capacity for pattern retrievaland classiﬁcation. Importantly, to realize these beneﬁts,the expanded architecture needs to be stable after learn-ing. In contrast, in our scenario the beneﬁt of expansionis in its facilitating the dynamics of learning and not itsinformation bearing potential. In fact, expansion in thisscenario is most beneﬁcial when it is transient, where theadded neurons and synapses are pruned after the learn-ing period, hence this hypothesis is consistent with theobserved continuous recycling of synapses.We consider neural networks that learn supervisedclassiﬁcation problems learned by a single layer percep-tron. Nevertheless, learning the rule by training withlabelled examples may be hampered by the complexityof the underlying data. We will focus on two cases of un-realizable rules. The ﬁrst case occurs when the teachernetwork produces training labels corrupted by stochasticnoise. The second case occurs when the teacher is morecomplex than the student network trying to learn therule. In both of these cases, there is a critical size of thetraining set above which no single layer student is able tocorrectly classify all of the training examples. This crit-ical size is called the student’s capacity. We show thatadding sparse expansions to student networks by randommappings of the original input increases the capacity ofthe network and improves the generalization performanceof the network as it is trained on larger training sets.While the capacity of a network is clearly related toits dimensionality, it is not obvious and even counterin-tuitive that increasing the size of a network should im- a r X i v : . [ c ond - m a t . d i s - nn ] A ug prove its generalization performance. Using mean ﬁeldtheory and simulations of a wide range of network param-eters, we show that expansion of the architecture duringlearning achieves improved generalization, particularly ifthe additional elements of the circuit are removed afterlearning. In addition, it is shown that the eﬀect is morepronounced if the hidden representation during learningis sparse. We ﬁnd that the performance is most improvedwhen the expanded units are random and uncorrelatedwith the original input which suggests that having lowoverlap in expanded activity between diﬀerent traininginput is crucial to improving performance.Our analysis oﬀers a new perspective on the impor-tant issue of the relation between model complexity andlearning in neural networks. Artiﬁcial neural networkshave achieved state of the art predictive performance ona variety of tasks [11, 12]. The primary beneﬁt to train-ing these enormous models appears to lie in their abil-ity to represent very complex functions and the link be-tween width, depth, and expressivity of neural networksis discussed in detail in several studies including, [13–16]. These networks are often overparameterized in thesense that than the number of examples the network istrained on is far less than the number of free parametersin the network [17]. Classical statistical learning theorysuggests that such massively over-parameterized modelsshould be expected to over-ﬁt on the training data [18]and make poor predictions on new inputs not seen bythe network before. To resolve this apparent paradox, ithas been suggested that modern learning algorithms costfunctions, and architectures incorporate strong explicitand implicit regularizations [19–22]. Our ﬁndings sug-gest there may be advantages to making neural networkslarger than is required for expressing the underlying task.These advantages are related to enhacing the ease of thelearning convergence, and that in these cases, optimalperformance after learning is achieved upon removal ofthe additional nodes and weights. Indeed, pruning ofDeep Neural Networks after training is a current topic ofresearch in machine learning [23–26].We start in section II by showing in simulations thatimplementing a sparse expansion of a perceptron networkvia random mapping of the input can improve its gener-alization ability when learning from a noisy teacher. Insection III we analyze these results by studying a simplermodel of a single layer perceptron in which the activity inthe expanded units is random and uncorrelated with thestimulus. We use the replica method to dervie a meanﬁeld theory exact in the thermodynamic limit, and ﬁnd itmatches well with simulations of large but ﬁnite size net-works. In section III B we explain this phenomena moreintuitively by showing a correspondence between addingrandom input neurons and including slack variables inthe optimization problem. We also discuss how hiddenunits in our two layer network model can resemble thestochastic expansion of the input layer in the one layermodel. In section IV A we demonstrate how the beneﬁtof sparse expansion also applies in more general cases of learning unrealizable rules by comparing the performanceof a student learning a more complex teacher network toour theory results. In most of our work we have focusedon convex learning algorithms. In section IV B we dis-cuss to what extent these eﬀects extend to other learningalgorithms. Finally, we close by discussing some generalimplications of our results. II. SPARSE EXPANSIONS AND LEARNING y y(cid:15) → w w xx ywwxA CB D ˜ w ˜ wJ z ˜ xyx FIG. 1. Teacher and student network schematics. (A) noisyteacher network (B) student network (C) Student with sparsehidden layer (D) Student with stochastic expanded units.

We begin our analysis by considering a teacher per-ceptron network with N input nodes x i , one outputnode y , and N synaptic weights w i drawn iid from w i ∼ N (cid:0) , σ w (cid:1) and supervised learning tasks in whicha student perceptron will attempt to learn the teacher’sinput-output rule from a training set provided by it. Foreach input x drawn iid from x i ∼ N (0 , y ∈ {− , } via the followingrule y = sign h h = 1 (cid:112) σ w N N (cid:88) i =1 w i x i + (cid:15) (1)where (cid:15) ∼ N (0 , σ out ) denotes an output or label noise(Fig. 1 A). We assume a training set consisting of P such input-output pairs, and we deﬁne α = P/N asthe measurement density of training examples relative tothe teacher.The goal of training is to yield a network weights thatperform well on new inputs, i.e., to have a small gen-eralization error E g , deﬁned as the expected fraction ofmislabeled examples averaged over the full distributionsof inputs x and the noise (cid:15) as follows E g ( w ) = (cid:104) Θ ( − y ( x ) y ( x )) (cid:105) x,(cid:15) (2)where y = sign (cid:32) √ N N (cid:88) i w i x i (cid:33) (3)The generalization error is minimized when the studentweights equals those of the teacher, i.e. w = w . Thiswill yield the same generalization error as the teacher it-self if it were tested on examples with labels generatedvia Eqn. 1. We refer to this error as the minimal gen-eralization error which can be expressed in terms of thenoise as follows E min = E g ( w ) = 1 π (cid:18) π − tan − (cid:18) σ out (cid:19)(cid:19) (4)which provides a lower bound on the generalization er-ror of a student as no network architecture (even morecomplex than a perceptron) can yield a better perfor-mance.Finding the optimal set of weights may be diﬃcult evenif the number of examples is large. Due to label noisefrom the teacher, training examples will no longer be lin-early separable i.e., perfectly classiﬁed by a perceptron,beyond some critical value of P , rending the trainingtask as “unrealizable” by a perceptron. Furthermore,unlike the realizable regime, in the unrealizable regime,ﬁnding the minimum of the training error is a nonconvexproblem and can be hampered by local minima. Herewe assume that the training is restricted to minimizingthe training error by applying convex algorithms. Suchtraining algorithms are limited to sizes smaller than thecapacity. The capacity depends on the level of outputnoise in the labels, see Fig. 5.For a teacher of ﬁxed width N and a ﬁxed trainingset of size P , we can increase the capacity of the studentnetwork by making the student network larger than theteacher.There are several ways to expand the student networkand each have a diﬀerent eﬀect on the generalization per-formance. We ﬁrst increase the network size by imple-menting random transformation of input stimuli to a hid-den layer of size N + as depicted in C of Fig. 1. The labelsin the student network are given by y µ = sign( h µ ) where, h µ = 1 √ N  N (cid:88) i =1 w i x µi + N + (cid:88) j =1 ˜ w j z µj  (5)where z µ represents the activity in a hidden layer of neu-rons generated by a random connectivity matrix J , z µj = A (cid:112) f (1 − f ) (cid:32) Θ (cid:32) N (cid:88) i =1 J ji x µi − T (cid:33) − f (cid:33) (6)where A is a positive scalar and T is a ﬁring threshold chosen to produce hidden layer neuronal activity witha given sparsity f . The synapses J ji are chosen iid ac-cording to J ji ∼ N (0 ,

1) and are uncorrelated with theteacher network,In simulations, we measure the performance of this net-work by estimating the generalization error on new exam-ples generated from the same distribution as the trainingset ( x µ , y µ ). Because J is ﬁxed the training problem isstill that of linear classiﬁcation, with an expanded in-put layer of size N + N + = βN and correspondinglyan expanded trained weight vector ( w, ˜ w ) . We train theoutput weights using max-margin classiﬁcation (i.e., Lin-ear SVM [27, 28]) which ﬁnds an error free solution thatmaximizes the minimal distance of the input examplesfrom the separating plane, called margin, κ , which isour case is deﬁned through the linear inequality y µ h µ ≥ κ || w + ˜ w || , ∀ µ (7)provided that such a solution exists. This Max-marginclassiﬁcation is equivalent to solving the following prob-lem, ( w ∗ , ˜ w ∗ ) = arg min w, ˜ w N (cid:88) i =1 w i + N + (cid:88) j =1 ˜ w j (8) s.t. y µ h µ ≥ ∀ µ (9)The optimization problem in Eqn. 9 is convex and admitsa unique solution ( w ∗ , ˜ w ∗ ). We choose the max-marginsolution as in general it is known to yield a robust solu-tion to the classiﬁcation problem with good generaliza-tion performance [29, 30].As expected, the addition of this hidden layer increasesthe capacity of the student, namely the maximal valueof P for which the training data are linearly separable[31]. For instance, for the parameters of Figs. 2(a) and2(b), the capacity increases from a maximum value of α ,equaling ∼ β = 1) to α c ∼

31 and ∼

65 for β = 5 and 10, respectively. In limit N → ∞ ,it appears that this increased capacity does not dependon the sparsity of the hidden layer, or depends on itvery weakly. Importantly, by enabling the network totrain successfully on a large training set adding the ran-dom layer improves substanially the generalization per-formance of the network, particularly, if the hidden layeractivity z µ is very sparse, i.e., f (cid:28) N + after learning. In contrast, for a hidden layer withdense activity, the generalization error decreases intiallywith increasing α but then saturates at an intermediatevalue of α and increases for larger values. Additionally,for dense activity the performance slightly deterioratesby removing the extra neurons after learnings, see Fig. 4.The role of sparseness will be discussed below in section E g =1f =0.02f =0.5oracle E g =1f =0.02f =0.5oracle E g f=0.5 prunedwith Joracle E g f=0.02 prunedwith Joracle -2out =1=3=5 E g =1f =0.05f =0.50oracle E g =1f =0.05f =0.50oracle E g =1=5=10=oracle (a) E g =1f =0.02f =0.5oracle E g =1f =0.02f =0.5oracle E g f=0.5 prunedwith Joracle E g f=0.02 prunedwith Joracle -2out =1=3=5 E g =1f =0.05f =0.50oracle E g =1f =0.05f =0.50oracle E g =1=5=10=oracle (b) FIG. 2. The generalization error E g from simulations of atwo layer network. 2(a) compares E g as a function of α for a student network the same size as the teacher and forstudent networks with expansion factors β = 5 with dense( f = 0 .

5) and sparse ( f = 0 .

02) activity. 2(b) does the samefor β = 10. The oracle line represents the lowest possiblegeneralization error due to the presence of label noise. Theparameters A = 0 . σ out = 0 .

25, and N = 100 and 200 trialsare used in both ﬁgures. E g =5 prunedunprunedoracle (a) E g =10 prunedunprunedoracle (b) FIG. 3. Comparison of the generalization error in simulationsof a sparsely expanded two layer network before and afterpruning the expanded units. 3(a) shows simulations for astudent network with expansion factor β = 5 and 3(b) showssimulations for a student network with β = 10. We see forboth values of β the student network with the best overallperformance is the network with sparse expansion with ex-panded weights are pruned after learning. The parameters f = 0 . A = 0 . σ out = 0 .

25, and N = 100 and 200 trialsare used in both ﬁgures. III C.The network given in Eqns. 5 and 6 is diﬃcult to studyanalytically because of correlations in the activities of thehidden layer induced by J [32].We therefore consider in the following section a sim-pliﬁed expansion scheme which we call, a stochastic ar-chitecture, and is shown in Fig. 1 D. In contrast tothe deterministic scheme of Fig. 1 C, here, the activ-ity patterns of the additional neurons are not generatedthrough connections from the input layer. Instead theyare randomly generated for each training pattern, µ , in-dependent of x µ . The advantage of this scheme that therandom activities of the hidden neurons are statisticallyindependent of each other as well as for diﬀerent trainingpatterns, rendering the model amenable to study usingthe tools of statistical mechanics. Although this scheme E g =5 prunedunprunedoracle (a) E g =10 prunedunprunedoracle (b) FIG. 4. Comparison of the generalization error from simu-lations of a densely expanded two layer network before andafter pruning the expanded units. 4(a) shows simulations fora student network with expansion factor β = 5 and 4(b) showssimulations for a student network with β = 10. We see thatthe densely expanded network performs best when the ex-panded weights are unpruned. However, the performance ofthe sparsely expanded network with pruned weights in 3 is su-perior to the densely expanded network regardless of whetherthe weights are pruned or kept. The parameters f = 0 . A = 0 . σ out = 0 .

25, and N = 100 and 200 trials were usedfor all ﬁgures. is artiﬁcial from a biological perspective, we will showthat when the deterministic layer is very sparse the sys-tem’s behavior is similar to the stochastic model. III. THEORY OF PERCEPTRON LEARNINGWITH EXPANDED STOCHASTIC UNITS

In this section, we develop intuition for the eﬀect ofsparse expansion on the generalization performance of aperceptron by considering a simpler single layer studentnetwork which can be solved analytically in the meanﬁeld limit. This network (shown in B of Fig. 1) is trainedusing data with µ = 1 , ...P binary labels y µ , generated bythe noisy teacher network in Eqn. 1. For convenience, wekeep the same normalization for the student and teacherweight vectors which corresponds to setting σ w = β in 1.The activity of the student network takes the form: h = 1 √ N  N (cid:88) i =1 w i x i + N + (cid:88) j =1 ˜ w j ˜ x j  (10)where ˜ x µj are random units added to the input layer andare drawn iid from a gaussian distribution with zero meanand variance σ in . The label y given to input x by the stu-dent is y ( x ) = sign( h ). The student weights are trainedto optimize the max margin, Eqn. 9. A. Mean ﬁeld theory

We analyze the performance of the expanded studentnetwork in 10. We will denote the measurement densityof the training set relative to the width of this student as α = α /β . The mean ﬁeld theory below is exact in thethermodynamic limit, where P, N → ∞ and α ∼ O (1) .To perform an ensemble average of the system’s prop-erties over diﬀerent realizations of training sets, we usethe replica trick in a manner similar to [33–38]. We startby considering the version space for n replicated studentsindexed by a : (cid:104) V n (cid:105) = (cid:90) (cid:89) a dw a d ˜ w a δ  N (cid:88) i =1 ( w ai ) + N + (cid:88) j =1 ( ˜ w aj ) − N  × P (cid:89) µ =1 (cid:88) σ = ± (cid:104) Θ ([ σh µa − κ ]) Θ( σh µ ) (cid:105) (11)where we have normalized the weights so that (cid:107) w a (cid:107) + (cid:107) (cid:101) w a (cid:107) = N in all replicas, and Θ is the Heavyside stepfunction. The quantities h µa are the student’s ﬁelds in-duced by the µ -th input and weight vector ( w a , ˜ w a ) ofthe a -th replica; h µ are the teacher ﬁelds induced by the µ -th input including noise. The angular brackets de-note averaging with respect to the gaussian input vec-tors, x µ (with variance 1), student input noise vector,˜ x µ (with variance N σ in ) , teacher label noise, (cid:15) µ (with vari-ance σ out ). Since the distribution of inputs is isotropic,one does not need to average over the teacher distribu-tion. Evaluating Eqn. 11, we derive a mean ﬁeld theoryin terms of the following order parameters m a , ˜ r a , q ab and ˜ q ab m a = 1 N N (cid:88) i =1 w i w ai (12)˜ r a = 1 N N + (cid:88) i =1 ( ˜ w ai ) (13) q ab = 1 N N (cid:88) i =1 w ai w bi (14)˜ q ab = σ in N N (cid:88) i =1 ˜ w ai ˜ w bi (15)The order parameters can be understood intuitively asfollows: m a corresponds to the overlap between the stu-dent weights w a and the teacher perceptron weights. ˜ r a corresponds the norm of expanded weights ˜ w a ; q ab mea-sures the overlap between student weight w a in replicaand w b in replica b . Similary ˜ q ab measures the overlap ofexpansion weights ˜ w a and ˜ w b (scaled with the expansion-input variance σ in ).We apply the replica symmetric (RS) ansatz for theorder parameters m a , ˜ r a , q ab , and ˜ q ab , which is exact be-cause the version space of weight vectors is connected.This allows us write the order parameter matrices interms of the four scalar order parameters m, ˜ r, q, and ˜ q as follows m a = m, (16)˜ r a = ˜ r, (17) q ab = (1 − q − ˜ r ) δ ab + q, (18)˜ q ab = (cid:0) σ in ˜ r − ˜ q (cid:1) δ ab + ˜ q (19)In the mean ﬁeld limit, we can decompose (cid:104) V n (cid:105) into thesum of an entropic term and energetic term which areboth functions of m, ˜ r, q, and ˜ q (cid:104) V n (cid:105) = exp [ nN ( G ( q, ˜ q, ˜ r, m ) + αG ( q, ˜ q, ˜ r, m ))] (20)Within the replica framework, the max-margin solutionis the unique solution which maximizes the margin andcorresponds to the equivalence of w in all student replicassuch that the overlaps q → − ˜ r and ˜ q → σ in ˜ r . Takingthe limit n →

0, the averaged free energy is given by (cid:104) log V (cid:105) = N ( G (˜ r, m ) + αG (˜ r, m )) (21)We obtain three closed saddle point equations for κ , m ,and ˜ r by ninimizing the free energy in Eqn. 21 with re-spect to m and ˜ r and requiring that V →

0. The capacityof the network is determined by solving the mean ﬁeldequations in the limit κ →

0. Full details of the replicacalculation and the form of the saddle equations are givenin Appendix A.The performance of the system depends on the expan-sion parameter β , and the input and output noise vari-ances σ in and σ out . We will focus primarily on σ in = 1,where the mean ﬁeld equations simplify considerably, andsolve them for ˜ r as a function of m and β . We ﬁnd that inthis case the capacity of the network as deﬁned in termsof α obeys the simple scaling relation α c ( β, σ out ) = βα c (1 , σ out ) (22)where α c (1 , σ out ) is the capacity of the unexpanded ne-towork shown in Fig. 5. We derive an expression for E g interms of the mean ﬁeld order parameters in Appendix D. -2out =1=3=5 FIG. 5. The network capacity for random inputs as a functioninverse variance of the label noise obtained from the solutionof the mean ﬁeld equations for σ in = 1. With the removal of the expanded weights E g takesthe form: E g = 1 π (cid:32) π − tan − (cid:32) R (cid:112) σ out − R (cid:33)(cid:33) (23)where R is deﬁned as the cosine of the angle betweenstudent and teacher weights which can be expressed interms of the order parameter m and ˜ r as R = m √ − ˜ r (24)where the factor of √ − ˜ r in Eqn. 24 is the fraction of thestudent weight norm in the subspace of the teacher. Forstudent networks that are the same size as the teacher,˜ r = 0 and R = m . The generalization error for a studentnetwork that retains its expanded units after learning,with stochastic noise included in each test example, isgiven by replacing R with m in Eqn. 23. Thus, we seethat for improved generalization performance, it is neces-sary to prune the augmented units after learning as wasshown numerically for the deterministic network, Fig. 3.In the stochastic expansion, the intuition for removingthese weights is straightforward as retaining them impliesinjecting stochastic activities in test example, uncorre-lated with the task’s input, which will obviously reduceperformance. The situation is diﬀerent in the determin-istic network in which correlations between the expandedand original components of the network the network areinduced by the random map J , hence through learning ˜ w acquire some information about the task. Indeed, as wehave shown above, for dense expansion, retaining theseweights slightly increases the performance. However, forsparse expansion, the correlation between the expandedactivations and the task input is small (see below) hencepurning improves the performance similar to the stochas-tic case. Finally, we note that in the case of zero out-put noise, E g is just the angle between the student andteacher normalized by π and the minimal error is givenby Eqn. 23 with R = 1, in agreement with Eqn. 4.In Fig. 6, we plot the theoretical results for E g as afunction of α for diﬀerent values of β for two values of σ out . In both high and low σ out , the generalization errordecreases monotonically as a function of α for ﬁxed β and as a function of β for ﬁxed α . In 6(c) and 6(d) ofFig. 6 we show the minimal E g as a function of β deﬁnedas the generalization error reached for each β after mini-mizing over α . An interesting question is whether for agiven size of training set, there is a ﬁnite optimal expan-sion ratio. We ﬁnd two qualitatively diﬀerent behaviorsdependent on the value of σ out . For low values of σ out ,for a each ﬁxed value of α , the student network with thelowest generalization error is the smallest network whichcan ﬁt all of the training examples. For higher valuesof σ out , we ﬁnd that making the network larger alwaysimproves the generalization performance for any valueof α , with the best performance occuring in the limit β → ∞ . The crossover between these two regimes occursroughly around σ out ∼ .

5. We conclude that adding E g =1=5=10=oracle (a) E g =1=5=10=oracle (b) E g (c) E g (d) FIG. 6. The replica theory results for the generalizationerror. E g is shown as a function of α for several values ofthe expansion factor β for label noise with standard deviation σ out = 0 .

25 in 6(a) and standard deviation σ out = 1 in 6(b). E g is shown as a function of β for σ out = 0 .

25 in 6(c) and σ out = 1 in 6(d). σ in = 1 for all ﬁgures. noisy units during learning gives the network the capac-ity to ﬁt the label noise and train on more examples ina way that does not interfere with the relevant weightinformation. This allows networks with larger expansionratios to achieve better generalization as they are trainedon more examples.So far, we have considered the simple case of σ in = 1.We now discuss brieﬂy the eﬀect of varying σ in . In Fig. 7,we demonstrate how varying the level of input noise canimprove generalization error by comparing theory andsimulations for diﬀerent choices of σ in . We ﬁnd that cal-culations of E g from simulations match very well withthe value obtained from solution of the mean ﬁeld equa-tions shown in Fig. 7. For low label noise, the general-ization performance is most substantially improved whenthe variance of activity in the added units is much lowerthan the variance of patterns being learned, i.e. σ in < A . For ﬁxed value of label noise σ out , we ﬁnd that there us an optimal variance σ in ofthe augmented units which minimizes E g for ﬁxed mea-surement density α and expansion factor β . This valuecan be determined from the replica equations shown inFig. 7(b) and discussed in Appendix F. We will return tothis issue in the Section C. E g in = 1theory in = 0.6theory in = 0.4theoryoracle (a) in E g (b) FIG. 7. The replica theory results compared with simulationsfor the generalization error E g . 7(a) shows E g for σ out = 0 . β = 5 and several values of σ in . The error bars are computedfrom the mean and standard deviation of 400 trials with N =100 . E g v. σ in with α = 3, β = 5, and σ out = 1, β = 5 and the line represents the replica predictions for thevalue of σ in that minimize the generalization error. E g f =0.02f =0.5theoryoracle (a) E g f=0.5f=0.02randomoracle (b) FIG. 8. Comparison of student networks with stochastic andsparse expansions with. 8(a) compares simulations of the twolayer student network with the theory results for the one layernetwork for β = 5 for σ out = 0 . N = 100, and 200 trials.In general, we see that student networks with stochastic addedunits attain superior performance when compared to a deter-ministic networks of the same size. 8(b) compares E g ( β ) forthe case of stochastic augmented input units and deterministichidden units with dense and sparse activity with σ out = 0 . N = 80 . The parameters σ in = A = 1 and 200 trialsare used for both ﬁgures. B. Comparison between stochastic expansion anddeterministic sparse expansion

For networks expanded with sparse hidden layers, theparameter A is closely related to σ in . We directly com-pare the generalization performance of the student net-work with a sparse hidden layer (Eqn. 5) with the studentnetwork with stochastic units added to the input (Eqn.10) by setting σ in = A so that the statistics of the ex-pansion units match in the two networks. For simplicitywe consider the case σ in = A = 1. Fig. 8(a) shows thegeneralization error for each network with β = 5 andFig. 8(b) shows the the generalization error as a func-tion of the network expansion factor β . The stochasti-cally expanded network achieves superior generalizationperformances for larger values of α and has a higher ca-pacity. However, as can be seen, the performance of the deterministic networks approach that of the stochasticnetwork upon increasing sparsity of the hidden layer ac-tivity. This is expected as the correlation in the sparseactivities are weak and hence approach the uncorrelatedstochastic limit. C. Correspondence with slack regularization

While it is clear that expanding a network increases itscapacity, it is not obvious that the expansion we have im-plemented should lead to improved generalization. Whilewidening a network increases its capacity to ﬁt moretraining data, it may also increase its Rademacher com-plexity improving its ability to learn random input out-put data [22]. However, it turns out that the improvedgeneralization performance in the networks we have stud-ied can be related to an equivalence between our ex-panded network trained in the realizable regime and anunexpanded network trained in the unrealizable regimeusing slack regularization, which we now explain.We consider the relation between our expansionschemes for learning and that of slack SVM which isdeﬁned as,min w,ξ N (cid:88) i =1 w i + C P (cid:88) µ =1 ξ µ s.t. y µ (cid:32) N (cid:88) i =1 w i x µi (cid:33) ≥ − ξ µ (25)While the SVM learning works only in the realizableregime, slack SVM is a convex optimization that allowsnon zero classiﬁcation errors (when ξ µ >

1) and regu-larizes them through the slack prameter C that applies L regularization of the slack variables ξ µ . Although itdoes not minimize the training error, and its cost func-tion does not have a well deﬁned interpretation in termsof the classiﬁcation tasks, it is a popular learning algo-rithm due to its simplicity and its empirically nice gen-eralization properties .To see the relation between the SVM with stochasticexpansion we ﬁrst note that the minimal ˜ w of Eqn. 8will necessarily be in the span of the P input stochasticvectors, ˜ X µ = (cid:101) x µ y µ , since any projection on the nullspace will increase the norm of ˜ w without contributing tothe satisfaction of the inequalities. Deﬁning new variable ξ µ as ξ µ = ˜ X µT ˜ w (26)we can write the optimal ˜ w as˜ w = ( ˜ X T ) + ξ (27)where ˜ X is the matrix of input stochastic vectors and +denotes the pseudo-inverse operation. Substituting Eqn.27into Eqn. 8 yieldsmin w,ξ N (cid:88) i =1 w i + P (cid:88) µ =1 P (cid:88) ν =1 ξ µ C µν ξ ν s.t. y µ (cid:32) N (cid:88) i =1 w i x µi (cid:33) ≥ − ξ µ (28)where C = ˜( X T ˜ X ) + or equivalently, C µν = ˜ x µ ˜ x νT (29)which is just the sample covariance matrix of the ex-panded inputs in the training set. We recognize the sec-ond term in Eqn. 28 as the square of the Mahalanobisdistance of the vector ξ µ from a set of observations ofzero mean and covariance matrix C µν . Thus, SVM withexpanded networks is equivalent to slack SVM of the orig-inal network with a slack SVM that incorporates a Ma-halanobis distance regularization of the slack variableswith a covariance regularizer matrix C injected by theexpanded activities.Furthermore, we can establish exact correspondencebetween the stochastic expansion and the slack SVM,Eqn. 28 in the limit large β (and ﬁxed α ) by notingthat in this limit, (cid:101) x µ (cid:101) x νT ∼ δ µν , hence, the slack termbecomes P (cid:88) µ =1 P (cid:88) ν =1 ξ µ (cid:104) C − µν (cid:105) ξ ν → σ in P (cid:88) µ =1 ( ξ µ ) (30)which is a generic slack regularization term, with C = σ in This implies that the addition of stochastic units be-comes equivalent to the addition of slack terms in thelimit β → ∞ . The equivalence breaks down completelyfor N + < P when the matrix C µν becomes uninvertible.The above equivalence hold also for deterministic ex-pansion, where now C µν = z µ z νT , see Eqn. 6. In thecase of a sparse expansion, C µν has small oﬀ-diagonalelements and diagonal elements equaling A which playsthe role of the slack regularizer. IV. EXTENSIONS

So far, we have focused on a perceptron learning anoisy perceptron rule using convex learning algorithms.In the following section, we investigate whether randomexpansion of the network during learning is beneﬁcial alsowhen the teacher is a given by nonlinear classiﬁcationrule, and also in training with gradient based methods.

A. Learning nonlinear classiﬁcation rule

To model a perceptron learning a complex but deter-ministic rule we consider a student perceptron learning from a quadratic teacher where the target rule is, y ( x ) = sign  a √ N N (cid:88) i =1 w i x i + (1 − a ) 1 N (cid:88) i,j =1 w ij x i x j  (31)with weights drawn iid as w i , w ij ∼ N (0 , a isa scalar coeﬃcient between zero and denotes the relativeweight of the linear component of the teacher. Clearly, aperceptron student cannot emulate perfectly such a rule.In fact for a perceptron with N weights, the optimalweights are w = w with a none zero minimal generaliza-tion error, E min which in decreases with a . In addition,there is a critical capacity, α c above which the trainingexamples are unrealizable, with α c increases with a . Wenow discuss the eﬀect of adding the stochastic randomlayer as in Fig. 1D of size N + with N + N + = βN .Clearly the capacity for learning with zero training errorincreases with β . We now ask whether this expansionis also beneﬁcial for generalization and whether prun-ning the network after learning improves performance.We have simulated training in this network using as be-fore, the max-margin algorithm. Results shown in Fig. 9,conﬁrm that expanded stochastic network performs bet-ter than the unexpanded one. Furthermore, the resultsare in excellent agreement with the behavior in the caseof the noisy perceptron target rule, with noise variancegiven by σ out = 1 − aa . (32)We also show simulation results for the two layer net-work with dense and sparse deterministic expansions inFig. 10. As in the case of a noisy teacher, the optimal E g quadraticnoisyquadraticnoisyoracle E g quadraticnoisyquadraticnoisyoracle E g f=0.02 prunedwith Joracle E g f=0.5 prunedwith Joracle E g =1f =0.05f =0.50oracle -2out =1=3=5 -2out =1=3=5 E g =1f =0.05f =0.50oracle (a) E g quadraticnoisyquadraticnoisyoracle E g quadraticnoisyquadraticnoisyoracle E g f=0.02 prunedwith Joracle E g f=0.5 prunedwith Joracle E g =1f =0.05f =0.50oracle -2out =1=3=5 -2out =1=3=5 E g =1f =0.05f =0.50oracle (b) FIG. 9. Comparison of E g as a function of α for a studentlearning a quadratic teacher v. the same student learning alinear teacher with label noise. Error bars in 9(a) are obtainedfrom simulations of a stochastic expanded student with σ in =1 learning a quadratic teacher for 200 trials and the solid linescorrespond to the replica theory result for student learningfrom a noisy teacher. In 9(b) we compare simulations of atwo layer student network with f = 0 .

02 and A = 1 learningfrom a quadratic teacher with simulations of the same studentnetwork learning from a noisy teacher for 400 trials. Theparameters a = 0 . σ out = 1, and N = 100 are used in bothﬁgures. E g f=0.02 prunedunprunedoracle (a) E g f=0.5 prunedunprunedoracle (b) FIG. 10. Simulation results for a two layer expanded studentlearning a quadratic teacher. 10(a) compares E g as a functionof α before and after pruning for a sparse hidden layer and10(b) compares the performance before and after pruning fora dense hidden layer. The parameters a = 0 . σ out = 1, β = 10, N = 100 and 400 trials are used in both ﬁgures generalization performance occurs after the extra neu-rons and synapses are removed from the network for asparse expansion. This eﬀect persists for values of β aslarge as β = 40 for N = 60. In the case of the two layernetwork, it is not entirely obvious that removing the ex-tra synapses would improve performance, as this struc-ture may be used to learn something about the quadraticpart of the teacher. It is possible that there may be pa-rameter regimes in which it is beneﬁcial to keep the extraweights unpruned that we have been unable to reach dueto computational limitations on β and N . Despite thesepotential shortcomings, our ﬁndings for both student ar-chitectures demonstrate that the beneﬁts of expandinga network can also occur in the setting where the rulebeing learned is more complicated than the model. B. Logistic regression

We will now consider alternative optimization meth-ods and loss functions which allow a neural network tobe trained beyond capacity. One example is logistic re-gression, with a cost function given by L ( w ) = P (cid:88) µ =1 log (1 + exp( − u µ )) (33) u µ = 1 √ N N (cid:88) i =1 y µ w i x µi (34)In the following, we consider full batch gradient descentso that the update to the weights at each training epochis given by ∆ w i = − η ∂L ( w ) ∂w i In [39] it was shown that the normalized weight vec-tor obtained by a minimizing the logistic regression loss function via gradient descent should converge to the maxmargin solution after a suﬃciently long training time ifthe training data is linearly separable. However, thiscorrespondence depends on learning parameters such as η and the number of iterations. Note, that in general, forconvergence to the max margin solution one needs to runthe logistic regression gradent based training for longertimes than required for ﬁnding a solution with zero tran-ing error. For unrealizeable rules, e.g., the noisy teacherin Eqn. 1 and the quadratic teacher Eqn. 31, logistic re-gression and max-margin classiﬁcation are not equiva-lent for large P because the training set provided by theteacher is not linearly separable.In previous sections we have shown that stochastic andsparse expansions of perceptron networks increase the ca-pacity of a network by making the training set linearlyseparable in a higher dimensional space. Thus, it is nat-ural to ask under what conditions training an expandednetwork via logistic regession will result in a weight vec-tor that converges to the new max margin solution in thehigher dimensional space and if this solution can yield asuperior generalization performance compared to a gra-dient based training of the unexpanded student network.We have simulated the logistic regression learning forthe problem of learning a noisy perceptron teacher, forsome values of η and number of training epochs. Weﬁrst consider the case β = 1, i.e. a student the samesize as the teacher. For α below capacity, the marginincreases monotonically with training epochs and con-verges asymptotically to the maximum margin, as shownin Fig. 12(b) ( α = 3), with convergence time dependingon η . In Fig. 12(a) we show for the same α the value ofthe overlap between student and teacher, as a function of ηt . Interestingly, while R does seem to converge asymp-totically to the maximum margin value, it is not mono-tonic and in fact reaches a maximum value larger thanthe inﬁnite time asymptote, early in the training. Thus,the max margin solution is not necessarily the one withthe best generalization performance. Above capacity, lo-gistic regression permits solutions with nonzero trainingerror, and we ﬁnd that it results with good generaliza-tion performance. The value of R as a function of α is shown in Fig. 11. As seen, for small α the overlap(achieved after a large number of epochs) is close to themax margin solution with precise values dependent on η and the stopping criterion. When α increases above ca-pacity, R increases monotonically and seems to approach R = 1 for large α (corresponding to the optimal solution w = w ), although the amount of increase depends on η .Note that in this regime both R and κ converge fast totheir asymptotic values as shown in Fig. 12.For an expanded student network i.e. β > , we ﬁndthat R converges to the max margin value after longtraining time for α below capacity as shown in Fig. 13and continues to increase with α as it increases abovecapacity. However, for ﬁxed values of η that are nottoo large, the largest value of R for any α is obtainedfor the unexpanded network, i.e., β = 1 as shown in0 R N=100 =0.5=0.05=0.005max margin R N=400 =0.5=0.05=0.005max margin

FIG. 11. Simulation results for logistic regression showing R v. α for various learning rates for N = 100, and σ out = 0 . t R =3 =0.5=0.05=0.005max margin 0 100 200 300 400 500 t R =8 =0.5=0.05=0.005max margin (a) t -1.4-1.2-1-0.8-0.6-0.4-0.200.2 =3 =0.5=0.05=0.005max margin t -1.45-1.4-1.35-1.3-1.25-1.2-1.15-1.1-1.05-1-0.95 =8 =0.5=0.05=0.005 (b) FIG. 12. Simulation results for logistic regression with β = 1and σ out = 0 . t is deﬁned as the number of trainingepochs. 12(a) shows R as a function of ηt where for α = 3(below capacity) and α = 8 (above capacity) for N = 100.12(b) shows the margin κ as a function of ηt for α = 3 and α = 8. Fig. 13. This implies that in this range of parameters,expanding the network does not improve generalizationperformance.

V. DISCUSSION

In this work, we have shown how expanding the ar-chitecture of neural networks can provide computationalbeneﬁts beyond better expressivity and improve the gen-eralization performance of the network after the ex-panded weights and neurons are pruned after training.We obtain equations for the order parameters charac- R N=100 =0.5=0.05=0.005max margin R N=400 =0.5=0.05=0.005max margin R =0.01 =1=3=5max margin E g quadraticnoisyquadraticnoisyoracle FIG. 13. Simulation results for logistic regression for severalvalues of β with N = 100 and σ out = 0 . R v. α for β = 1 , , η = 0 .

01. The max marginline in the plot corresponds to the max margin solution for β = 5 and the circles mark the capacity for each value of β . terizing generalization in randomly expanded perceptronnetworks (called stochastic expansion) in the mean limitand show explicitly that expansion allows for more accu-rate learning of noisy or complex teacher networks. Thisis achieved by increasing network capacity during train-ing, allowing the learning to beneﬁt from more examples.We show a qualitatively similar improved performancewhen expanding by adding a ﬁxed random weights (de-terministic expansion) connecting the input to sparselyactive hidden units. An additional insight into our re-sults is provided by showing that the expansion is ef-fectively similar to the addition of slack variables to themax-margin learning. We believe our ﬁndings suggest apossible biological function for adult neurogenesis in thebrain.In our analysis, we considered training sets drawn iidfrom a Gaussian distribution with no spatial structure.It would be interesting to see how our results could be ex-tended to learning structured data. In particular, [40] de-veloped a theory for the linear classiﬁcation of manifoldswith arbitrary geometry by using special anchor pointson the manifolds to deﬁne novel geometrical measuresof radius and dimension which can be directly linked tothe classiﬁcation capacity for manifolds of various geome-tries. It would be interesting to see if sparse expansionssimilar to those we have studied could be useful in classi-fying noisy manifolds and if there is any correspondenceto SVMs containing anisotropic slack regularization en-coded in the structure of the covariance matrix as in Eqn.29.It would also be interesting to determine how and ifour observations apply to learning in deep networks withmultiple layers. Neural network pruning techniques havebeen widely dicussed in the deep learning community andit has been shown that neural network network prun-ing techniques can reduce parameter counts of trainednetwork by over 90% without compromising accuracy[23, 41]. Training a pruned model from scratch is worsethan retraining a pruned model, which suggests that the1extra capacity of the network allows it to ﬁnd more op-timal solutions. In [42], the authors ﬁnd that dense,randomly-initialized, feed-forward networks contain sub-networks that can reach test accuracy comparable to theoriginal network in a similar number of training iterationswhen trained in isolation. It would be interesting to see ifthe extra weights in the larger networks can be translatedinto a regularization condition on the subnetwork.Most of our work focused on max margin learning. Wehave explored the eﬀect of expansion on gradient basedlearning with logistic regression cost function. We ﬁndthat for appropriate choice of learning rate and learningtime, generalization is similar to the max margin perfor-mance below the network capacity, consistent with [39].We also found that in the explored parameter range, op-timal generalization performance is achieved by the un-expanded network, as gradient based learning can extractuseful information even beyond the capacity learning.However, understanding the generalization performancein gradient based learning requires a more thorough un-derstanding of the role of learning rate and training timeis quite diﬃcult given the lack of theory for the trainingdynamics for logistic regression. It would be interestingto see if there is a way to scale η such that expandingthe network can provide similar beneﬁts for logistic re- gression beyond capacity as for max margin learning. Weleave this to future work.We also note that generalization can also improve whenadding unquenched noise to the student labels duringtraining with logistic loss as this prevents the classiﬁerfrom overﬁtting (results not shown; [43, 44]). This diﬀersfrom our construction for two reasons. The ﬁrst is thatour student by construction learns the weights in the theextended part of the network. The second is that ourdimensionality expansion changes changes the propertiesof the training set in that a nonlinearly separable trainingset in the original space may become linearly separablein the higher dimensional expanded space. VI. ACKNOWLEDGMENTS

We would like to thank Haozhe Shan and WeishunZhong for valuable discussions concerning our logisticregression results and Subir Sachdev for helpful com-ments on the draft. This work is partially supportedby the Gatsby Charitable Foundation, the Swartz Foun-dation, the National Institutes of Health (Grant No.1U19NS104653). J.S. acknowledges support from the Na-tional Science Foundation Graduate Research Fellowshipunder Grant No. DGE1144152. [1] C. Zhao, W. Deng, and F. H. Gage, Cell , 645 (2008).[2] G. li Ming and H. Song, Neuron , 687 (2011).[3] J. B. Aimone, Cold Spring Harbor Perspectives in Biol-ogy .[4] W. Deng, J. B. Aimone, and F. H. Gage, Nature ReviewsNeuroscience .[5] G. Mongillo, S. Rumpel, and Y. Loewenstein, CurrentOpinion in Neurobiology , 7 (2017), computationalNeuroscience.[6] B. A. Olshausen and D. J. Field, Current Opinion inNeurobiology .[7] S. Ganguli and H. Sompolinsky, Annual Review of Neu-roscience , 485 (2012), pMID: 22483042.[8] A. Litwin-Kumar, K. D. Harris, R. Axel, H. Sompolinsky,and L. F. Abbott, Neuron , 1153 (2014).[9] A. Treves and E. T. Rolls, Hippocampus , 374.[10] M. V. Tsodyks and M. V. Feigelman, Europhysics Letters(EPL) , 101 (1988).[11] Y. LeCun, Y. Bengio, and G. Hinton, nature , 436(2015).[12] J. Schmidhuber, Neural networks , 85 (2015).[13] Y. Bengio and O. Delalleau, in International Conferenceon Algorithmic Learning Theory (Springer, 2011) pp. 18–36.[14] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, andS. Ganguli, in

Advances in neural information processingsystems (2016) pp. 3360–3368.[15] I. Safran and O. Shamir, in

Proceedings of the 34th Inter-national Conference on Machine Learning - Volume 70 ,ICML’17 (JMLR.org, 2017) pp. 2979–2987. [16] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, andJ. Sohl-Dickstein, in

Proceedings of the 34th InternationalConference on Machine Learning , Proceedings of Ma-chine Learning Research, Vol. 70, edited by D. Precupand Y. W. Teh (PMLR, International Convention Cen-tre, Sydney, Australia, 2017) pp. 2847–2854.[17] K. Simonyan and A. Zisserman, arXiv preprintarXiv:1409.1556 (2014).[18] C. Zhang, S. Bengio, M. Hardt, B. Recht, andO. Vinyals, arXiv preprint arXiv:1611.03530 (2016).[19] P. L. Bartlett and S. Mendelson, Journal of MachineLearning Research , 463 (2002).[20] M. Advani and A. Saxe, (2017), arXiv:1710.03667[stat.ML].[21] Y. Bansal, M. Advani, D. Cox, and A. Saxe, (2018),arXiv:1806.00730 [stat.ML].[22] M. Mohri, A. Rostamizadeh, and A. Talwalkar, MITPress , 32 (2012).[23] S. Han, J. Pool, J. Tran, and W. Dally, in Advancesin Neural Information Processing Systems 28 , edited byC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,and R. Garnett (Curran Associates, Inc., 2015) pp. 1135–1143.[24] T. Yang, Y. Chen, and V. Sze, in (2017) pp. 6071–6079.[25] D. Blalock, J. J. Gonzalez Ortiz, J. Frankle, and J. Gut-tag, arXiv preprint arXiv:2003.03033 (2020).[26] T. Gale, E. Elsen, and S. Hooker, arXiv preprintarXiv:1902.09574 (2019). [27] C. Cortes and V. Vapnik, Machine Learning , 273(1995).[28] B. E. Boser, I. M. Guyon, and V. N. Vapnik, in Pro-ceedings of the Fifth Annual Workshop on ComputationalLearning Theory , COLT 92 (Association for ComputingMachinery, New York, NY, USA, 1992) p. 144152.[29] P. Bartlett and J. Shawe-Taylor, “Generalization perfor-mance of support vector machines and other pattern clas-siﬁers,” in

Advances in Kernel Methods: Support VectorLearning (MIT Press, Cambridge, MA, USA, 1999) p.4354.[30] V. Vapnik,

Statistical Learning Theory .[31] T. M. Cover, IEEE Transactions on Electronic Comput-ers

EC-14 , 326 (1965).[32] B. Babadi and H. Sompolinsky, Neuron , 1213 (2014).[33] E. Gardner, Europhysics Letters (EPL) , 481 (1987).[34] E. Gardner and B. Derrida, Journal of Physics A: Math-ematical and General , 271 (1988).[35] E. Gardner, Journal of Physics A: Mathematical andGeneral , 257 (1988). [36] H. Seung, H. Sompolinsky, and N. Tishby, Physical Re-view A , 6056 (1992).[37] T. L. H. Watkin, A. Rau, and M. Biehl, Rev. Mod. Phys. , 499 (1993).[38] A. Engel and C. Van den Broeck, Statistical Mechanicsof Learning .[39] D. Soudry, E. Hoﬀer, M. Shpigel Nacson, S. Gunasekar,and N. Srebro, Journal of Machine Learning Research ,1 (2018).[40] S. Chung, D. D. Lee, and H. Sompolinsky, Phys. Rev. X , 031003 (2018).[41] Y. LeCun, J. S. Denker, and S. A. Solla, in Advancesin Neural Information Processing Systems 2 , edited byD. S. Touretzky (Morgan-Kaufmann, 1990) pp. 598–605.[42] J. Frankle and M. Carbin, in

International Conferenceon Learning Representations (2019).[43] C. M. Bishop, Neural Computation , 108 (1995),https://doi.org/10.1162/neco.1995.7.1.108.[44] M. Welling and Y. W. Teh, in Proceedings of the 28thInternational Conference on International Conference onMachine Learning , ICML’11 (Omnipress, USA, 2011) pp.681–688. Appendix A: Mean ﬁeld equations

We outline the derivation of the mean ﬁeld equations used to compute the order parameters deﬁned in Eqns. 12,13,14, and 15which are used to compute the generalization error given in Eqn. 23. We deﬁne the student ﬁeld for eachreplica of the student network as: h µa = 1 √ N N (cid:88) i =1 W i X µ = 1 √ N  N (cid:88) i =1 w ai x µi + N + (cid:88) j =1 ˜ w aj ˜ x µj  , (A1)and the teacher ﬁeld as h µ = 1 √ N N (cid:88) i =1 W i · X iµ + (cid:15) µ = 1 √ N N (cid:88) i =1 w i x µi + (cid:15) µ . (A2)We can now write the average over the version space in Eqn. 20 in terms of these new variables V n = (cid:42)(cid:90) N (cid:89) i =1 N + (cid:89) j =1 (cid:89) a dw ai d ˜ w aj δ  N (cid:88) i =1 ( w ai ) + N + (cid:88) j =1 ( ˜ w aj ) − N  P (cid:89) µ (cid:90) dh µa (cid:90) d ˆ h µa (cid:90) dh µ (cid:90) d ˆ h µ (cid:34)(cid:88) σ Θ µ,a ( yh µa − κ ) Θ( yh µ ) (cid:35) I (cid:43) (A3)where I is given by I = exp (cid:34) − i (cid:88) aµ h µa ˆ h µa − i (cid:88) µ h µ ˆ h µ + i (cid:88) aµ ˆ h µa √ N N (cid:88) i =1 ( W ai X µi ) + i (cid:88) µ ˆ h µ (cid:32) √ N N (cid:88) i =1 W i · X µi + (cid:15) µ (cid:33)(cid:35) (A4)and the constraints in Eqns. A1 and A2 are implemented by the Lagrange multipliers h µa and ˆ h µ . Averaging overthe input x µ , ˜ x , and the noise (cid:15) µ , I becomes I = (cid:90) P (cid:89) µ =1 N (cid:89) i =1 dx µi √ π e − ( xµ )22 N + (cid:89) j =1 d ˜ x µj (cid:112) πσ in e − (˜ xµ )22 σ in d(cid:15) µ (cid:112) πσ out e − ( (cid:15)µ )22 σ out × exp  − i (cid:88) µα h µa ˆ h µa − i (cid:88) µ h µ ˆ h µ + i (cid:88) µa N (cid:88) i =1 √ N (ˆ h µa w ai + ˆ h µ w i ) x µi + i (cid:88) µa N + (cid:88) j =1 √ N (ˆ h µa ˜ w aj ˜ x jµ ) + i (cid:88) µ ˆ h µ (cid:15) µ  = exp  − (cid:88) µ (cid:16) i (cid:88) a h µa ˆ h µa + ih µ ˆ h µ + (cid:88) a ˆ h µa ˆ h µ N (cid:88) i =1 w ai · w i N + (1 + σ out )2 ˆ h µ + 12 (cid:88) ab ˆ h µa ˆ h µb  N (cid:88) i =1 w ai w bi N + σ in N + (cid:88) j =1 ˜ w aj ˜ w bj N  (cid:17) (A5)We deﬁne the order parameters m a , q ab and ˜ q ab as m a = 1 N N (cid:88) i =1 w i w ai (A6) q ab = 1 N N (cid:88) i =1 w ai w bi (A7)˜ q ab = σ in N N + (cid:88) j =1 ˜ w aj ˜ w jb (A8)For further convenience, we write the sum of q ab and ˜ q ab as Q ab = q ab + ˜ q ab . (A9)4In terms of the order parameters, I becomes I = exp (cid:34) − (cid:88) µ (cid:16) i (cid:88) α h µa ˆ h µa + ih µ ˆ h µ + (cid:88) a ˆ h µa ˆ h µ m a + (1 + σ out )2 (ˆ h µ ) + 12 (cid:88) ab ˆ h µa ˆ h µb Q ab (cid:17)(cid:35) (A10)We can now do the integrals over ˆ h and ˆ h which gives us P (cid:89) µ (cid:90) dh µa (cid:90) d ˆ h µa (cid:90) dh µ (cid:90) d ˆ h µ I = P (cid:89) µ (cid:90) dh µa (cid:90) D ¯ h µ det( Q ab − ¯ m ) − P X (A11)where we have deﬁned X as X = exp (cid:34) − (cid:88) µ (cid:88) ab (¯ h µ ¯ m − h µa )( Q ab − ¯ m ) − (¯ h µ ¯ m − h µb ) (cid:35) (A12)and ¯ m and ¯ h as ¯ m = m (cid:112) σ out (A13)¯ h = h (cid:112) σ out (A14)We now deﬁne the additional parameter ˜ r a as ˜ r a = 1 N N + (cid:88) i =1 ( ˜ w ai ) (A15)Since the solution space is connected, we can make the following replica symmetric ansatz for m a , q ab , ˜ q ab , and ˜ r a m a = m (A16)˜ r a = ˜ r (A17) q ab = (1 − ˜ r − q ) δ ab + q (A18)˜ q ab = (cid:0) σ in ˜ r − ˜ q (cid:1) δ ab + ˜ q (A19) Q ab = ( r Q − Q ) δ ab + Q (A20)where Q = q + ˜ q and r Q = 1 − (1 − σ in )˜ r . The inverse of the matrix in Eqn. A12 is given by( Q ab − ¯ m ) − = 1 r Q − Q δ ab − Q − ¯ m ( r Q − Q ) (A21)we now deﬁne X (cid:48) as: X (cid:48) = P (cid:89) µ (cid:90) dh µa (cid:90) dh µ exp (cid:104) − P Q ab − ¯ m ) (cid:105) X P (A22)Plugging in the replica symmetric ansatz in Eqns. A16, A17, A18, A19, this becomes X (cid:48) = P (cid:89) µ (cid:90) dh µa (cid:90) dh µ exp (cid:104) − r Q − Q ) (cid:88) µa ( h µa ) + 12 Q − ¯ m ( r Q − Q ) (cid:88) µ (cid:32)(cid:88) a h µa (cid:33) + 1( r Q − Q ) (cid:88) µ ¯ h µ ¯ m (cid:88) a h µa − n (cid:80) µ (¯ h µ ¯ m ) r Q − Q ) − P Q ab − ¯ m ) (cid:105) (A23)We decouple terms with diﬀerent replica indices in Eqn. A23 by introducing the auxiliary variable t . Then X (cid:48) X (cid:48) = 2 (cid:90) ∞ D ¯ h (cid:90) Dt (cid:34)(cid:90) ∞ κ dh √ π exp (cid:16) − h r Q − Q + (cid:112) Q − ¯ m r Q − Q ht + 1 r Q − Q h ¯ h ¯ m − ¯ m ¯ h r Q − Q ) (cid:17)(cid:105) n (cid:35) (A24)where Dx = dx √ π e − x .One we evaluate all of the integrals in the expression for (cid:104) V n (cid:105) we can write it in the following form (cid:104) V n (cid:105) = exp( nN ( G ( q, ˜ q, ˜ r, m ) + αG ( q, ˜ q, ˜ r, m ))) (A25)where G ( q, ˜ q, m ) is an entropic contribution coming from from the integral over the weights and G ( q, ˜ q, ˜ r, m ) is anenergetic contribution whose form is dictated by the learning rule.We can start by computing the energetic contribution. We deﬁne A ( t, ¯ h ) and Z ( t, ¯ h ) as A ( t, ¯ h ) = 12( r Q − Q ) (cid:16)(cid:112) Q − ¯ m t + ¯ h ¯ m (cid:17) − ¯ m ¯ h r Q − Q ) (A26) Z ( t, ¯ h ) = (cid:90) ∞ κ dh √ π exp (cid:18) − r Q − Q ) (cid:104) h − (cid:16)(cid:112) Q − ¯ m t + ¯ h ¯ m (cid:17)(cid:105) (cid:19) (A27)and rewrite X (cid:48) as X (cid:48) = 2 (cid:90) ∞ D ¯ h (cid:90) Dt (cid:104) exp( An ( t, ¯ h )) Z n ( t, ¯ h ) (cid:105) (A28)In the limit n → X (cid:48) becomes X (cid:48) = exp (cid:18) An + 2 n (cid:90) ∞ D ¯ h (cid:90) Dt log Z ( t, h ) (cid:19) (A29)where we deﬁne A as the following integral A = (cid:90) ∞ D ¯ h (cid:90) DtA ( t, ¯ h ) = Q − ¯ m r Q − Q ) (A30)We can do the following shift of variables x = (cid:16)(cid:112) Q − ¯ m t + ¯ h ¯ m (cid:17) / (cid:112) Q (A31) y = (cid:16) − ¯ mt + (cid:112) Q − ¯ m ¯ h (cid:17) / (cid:112) Q (A32)which allows us to write t and ¯ h as t = (cid:16)(cid:112) Q − ¯ m x − y ¯ m (cid:17) / (cid:112) Q (A33)¯ h = (cid:16) x ¯ m + (cid:112) Q − ¯ m y (cid:17) / (cid:112) Q (A34)and Z ( t, ¯ h ) as Z ( x ) = (cid:90) ∞ κ dh √ π exp (cid:18) − ( h − √ Qx ) r Q − Q ) (cid:19) = (cid:112) r Q − QH (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) (A35)Under this transformation, the Gaussian integrals become (cid:90) ∞ D ¯ h (cid:90) Dt = (cid:90) Dx (cid:90) ∞− x ¯ m/ √ Q − ¯ m Dy = (cid:90) DxH (cid:16) − x ¯ m/ (cid:112) Q − ¯ m (cid:17) (A36)6where we deﬁne H ( x ) = (cid:90) ∞ x Dy (A37)This gives us2 (cid:90) ∞ Dh (cid:90) Dt log Z ( t, h ) =2 (cid:90) DxH (cid:16) − x ¯ m/ (cid:112) Q − ¯ m (cid:17) log Z ( x ) (A38)=2 (cid:90) DxH (cid:16) − x ¯ m/ (cid:112) Q − ¯ m (cid:17) log H (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) + 12 log( r Q − Q )So X (cid:48) becomes X (cid:48) = exp (cid:32) (cid:90) DxH (cid:16) − x ¯ m/ (cid:112) Q − ¯ m (cid:17) log H (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) + 12 log( r Q − Q ) + A (cid:33) n (A39)Using the relation A − n logdet( Q ab − ¯ m ) + 12 log( r Q − Q ) = 0 (A40)the replicated volume of the version space become (cid:104) V n (cid:105) = (cid:90) n (cid:89) a =1 d N w a d N ˜ w a δ  N (cid:88) i =1 ( w ai ) + N + (cid:88) j =1 ( ˜ w aj ) − N  (cid:90) dm (cid:90) dq ab (cid:90) d ˜ q ab δ ( N m − N (cid:88) i =1 w ai w i ) (cid:89) ab δ (cid:32) N q ab − N (cid:88) i =1 w ai w bi (cid:33) δ  N ˜ q ab − σ in N + (cid:88) j =1 ˜ w aj ˜ w bj  exp (cid:16) n (cid:90) DxH (cid:32) − x ¯ m (cid:112) Q − ¯ m (cid:33) log H (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) (cid:17) P (A41)We can compute the entropic term G ( q, ˜ q, m, ˜ r ) by considering the integrals over conﬁgurations of weights allowedby the delta functions. Then exp( nN ( G ( q, ˜ q, ˜ r, m )) is given byexp( nN ( G ( q, ˜ q, ˜ r, m )) = (cid:90) n (cid:89) a =1 (cid:90) dw a d ˜ w a δ (cid:32) N m − N (cid:88) i =1 w ai w i (cid:33) (A42) × (cid:89) ab δ (cid:32) N q ab − N (cid:88) i =1 w ai w bi (cid:33) δ  N ˜ q ab − σ in N + (cid:88) j =1 ˜ w aj ˜ w bj  Introducing the Lagrange multipliers ˆ m , ˆ q ab , and ˆ˜ q ab , Eqn. A43 can be written asexp( nN ( G ( q, ˜ q, ˜ r, m )) = (cid:90) n (cid:89) a =1 dw a d ˜ w a (cid:90) d ˆ m √ π (cid:90) d ˆ q ab √ π (cid:90) d ˆ˜ q ab √ π exp (cid:16) i (cid:88) ab ˆ q ab (cid:32) N q ab − N (cid:88) i =1 w ai w bi (cid:33) + i (cid:88) ab ˆ˜ q ab  N ˜ q ab − σ in N + (cid:88) j =1 ˜ w aj ˜ w bj  + i (cid:88) a ˆ m a (cid:32) N m a − N (cid:88) i =1 w ai w i (cid:33) (cid:17) = (cid:90) d ˆ m √ π (cid:90) d ˆ q ab √ π (cid:90) d ˆ˜ q ab √ π exp (cid:32) iN (cid:88) ab ˆ q ab q ab + iN (cid:88) ab ˆ˜ q ab ˜ q ab + iN (cid:88) a ˆ m a m a (cid:33) × (cid:90) n (cid:89) a =1 dw a d ˜ w a exp (cid:0) − iH ( w a , ˜ w a , w ) (cid:1) (A43)7where we have deﬁned a “Hamiltonian” H ( w a , ˜ w a , w ) as H ( w a , ˜ w a , w ) = 12 (cid:88) ab (ˆ q ab N (cid:88) i =1 w ai w bi + σ in ˆ˜ q ab N + (cid:88) j =1 ˜ w aj ˜ w bj ) + (cid:88) a N (cid:88) i =1 w ai w i ˆ m a (A44)Doing a Wick rotation i ˆ m a → ˆ m a , i ˆ q ab → ˆ q ab , i ˆ˜ q ab → ˆ˜ q ab and integrating over the weights w and ˜ w , we haveexp( nN ( G ( q, ˜ q, ˜ r, m ) = (cid:90) d ˆ m √ π (cid:90) d ˆ q ab √ π (cid:90) d ˆ˜ q ab √ π exp (cid:32) N (cid:88) ab ˆ q ab q ab + N (cid:88) ab ˆ˜ q ab ˜ q ab + N (cid:88) a ˆ m a m (cid:33) × (cid:90) n (cid:89) a =1 dw a dw a exp  − (cid:88) ab (ˆ q ab N (cid:88) i =1 w ai w bi + σ in ˆ˜ q ab N + (cid:88) j =1 w aj w bj ) − (cid:88) α N (cid:88) i =1 w ai w i ˆ m a )  = (cid:90) d ˆ m √ π (cid:90) d ˆ q ab √ π (cid:90) d ˆ˜ q ab √ π exp (cid:32) N (cid:88) ab ˆ q ab q ab + N (cid:88) ab ˆ˜ q ab ˜ q ab + N (cid:88) a ˆ m a m (cid:33) × exp (cid:32) N (cid:88) ab ˆ m a ˆ q − ab ˆ m b − N β − q − N (1 − β − )2 log det ˆ˜ qσ in (cid:33) (A45)We can evaluate the integral on the saddle point by solving for ˆ m a , ˆ q ab ,and ˆ˜ q ab using the three saddle point equations0 = N m γ + N (cid:88) b (ˆ q − cb ) ˆ m b (A46)0 = − N (cid:88) ab ˆ m a (ˆ q ac ) − (ˆ q bd ) − ˆ m b − N β − q cd ) − + N q cd (A47)0 = − N (1 − β − )2 (ˆ˜ q cd ) − + N q cd (A48)We make the following replica symmetric ansatz for ˆ q αβ and ˆ˜ q αβ ˆ q ab = (ˆ q − ˆ q ) δ ab + ˆ q (A49)ˆ˜ q ab = (ˆ˜ q − ˆ˜ q ) δ ab + ˆ˜ q (A50)Inserting these expressions into Eqns. A46, A47, and A48 gives us the following scalar equations1ˆ q − ˆ q = β (1 − ˜ r − q ) (A51)ˆ m = − mβ (1 − ˜ r − q ) (A52)ˆ q = − q − m β (1 − ˜ r − q ) (A53)1ˆ˜ q − ˆ˜ q = ββ − σ in ˜ r − ˜ q ) (A54)ˆ˜ q = − β − β ˜ q ( σ in ˜ r − ˜ q ) (A55)Solving for ˆ m , ˆ q , ˆ q , ˆ˜ q , and ˆ˜ q we ﬁnd G ( q, ˜ q, ˜ r, m ) = 12 (cid:32) q − m β (1 − ˜ r − q ) + β − β ˜ qσ in ˜ r − ˜ q + 1 β log ( β (1 − ˜ r − q )) + β − β log (cid:18) ββ − σ in ˜ r − ˜ q ) (cid:19) (cid:33) (A56)In summary, we have (cid:104) V n (cid:105) = exp nN ( G ( q, ˜ q, ˜ r, m ) + αG ( q, ˜ q, ˜ r, m )) (A57)8where m is given by the saddle point value G ( q, ˜ q, ˜ r, m ) = 12 (cid:32) β (cid:18) q − m (1 − ˜ r − q ) + log β (1 − ˜ r − q ) (cid:19) + β − β (cid:32) ˜ qσ in ˜ r − ˜ q + log β (cid:0) σ in ˜ r − ˜ q (cid:1) β − (cid:33)(cid:33) (A58) G ( q, ˜ q, ˜ r, m ) = 2 (cid:90) DxH (cid:32) − x ¯ m (cid:112) Q − ¯ m (cid:33) log H (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) (A59) Appendix B: Max-margin limit in mean ﬁeld theory

In the max margin limit the uniqueness of the solutions for w and ˜ w imply q → − ˜ r, ˜ q → σ in ˜ r, Q → r Q (B1)In general, q and ˜ q approach their max margin values at diﬀerent rates. To account for this we deﬁne the scalingfactors λ and ˜ λ as λ = r Q − Q − ˜ r − q (B2)˜ λ = r Q − Qσ in ˜ r − ˜ q (B3)where λ − + ˜ λ − = 1. This allows us to rewrite G ( q, ˜ q, ˜ r, m ) so that all of the singular terms scale as ( r Q − Q ) − asfollows G ( q, ˜ q, ˜ r, m ) = 12 (cid:32) λβ q − m r Q − Q + ˜ λ ( β − β ˜ qr Q − Q + 1 β log (cid:0) βλ − ( r Q − Q ) (cid:1) + β − β log (cid:32) β ˜ λ − β − r Q − Q ) (cid:33)(cid:33) (B4)Taking the max margin limit followed by the limit n →

0, we ﬁnd that the free energy is given by (cid:104) log V (cid:105) = N r Q − Q ) (cid:32) λ (1 − ˜ r − m ) + λ ( λ − − ( β − σ in ˜ rβ − α (cid:90) DxH (cid:32) − x ¯ m (cid:112) r Q − ¯ m (cid:33) [ κ − √ r Q x ] (cid:33) (B5)The saddle point equation for m is λ ¯ m (cid:112) r Q − ¯ m = αβ √ π (cid:32) (cid:90) ∞− κ √ rQ − ¯ m Dx x σ out (cid:32) κ (cid:112) r Q − ¯ m + x (cid:33) (cid:33) (B6)The saddle point equation for ˜ r is λ (( λ − − ( β − σ in − β = 2 α (cid:90) DxH (cid:32) − x ¯ m (cid:112) r Q − ¯ m (cid:33) x ( κ − √ r Q x ) + ( σ in − √ r Q (B7)+ α ¯ m ( σ in − √ πr Q (cid:90) ∞− κ √ rQ − ¯ m Dx (cid:113) r Q − ¯ m x (cid:32) κ (cid:112) r Q − ¯ m + x (cid:33) (B8)We can use Eqn. (B6)to further simplify this as λ (( λ − − ( β − σ in − β = 2 α (cid:90) DxH (cid:32) − x ¯ m (cid:112) r Q − ¯ m (cid:33) x ( κ − √ r Q x ) + ( σ in − √ r Q + ( σ in − σ out + 1) λ ¯ m βr Q (B9)For λ , we have the saddlepoint equation 1 − m = ˜ r (cid:18) − ( β − σ in ( λ − (cid:19) (B10)9which has the relevant solution λ = 1 + (cid:115) ( β − σ in ˜ r − ˜ r − m (B11) R , the cosine of the angle between student and teacher, can be written in terms of m and ˜ r as R = m √ − ˜ r (B12)For σ in = 1, i.e. the variance of the augmented units matches the variance of the original input, Eqns. B5, B6, B7,and B4 simplify considerably and are given1 N (cid:104) ln V (cid:105) = 12(1 − q ) (cid:18) − m − α (cid:90) DxH (cid:18) − x ¯ m √ − ¯ m (cid:19) [ κ − x ] (cid:19) (B13)¯ m = α √ − ¯ m √ π (cid:32) (cid:90) ∞− κ √ − ¯ m Dx x σ out (cid:18) κ √ − ¯ m + x (cid:19) (cid:33) (B14)˜ r = β − β (1 − m ) (B15) λ = β (B16) r Q = 1 (B17)We can now write R directly in terms of m and β as R = m (cid:113) − β − β (1 − m ) (B18) Appendix C: Network at capacity

We determine the capacity of the network for ﬁxed β by setting the margin κ = 0 in the mean ﬁeld equations.After performing all of the integrals, we have the following three equations λ (1 − ˜ r − m ) + λ ( λ − − ( β − σ in ˜ r ) β = απ (cid:32) arccot (cid:32) ¯ m (cid:112) r Q − ¯ m (cid:33) − ¯ m (cid:112) r Q − ¯ m r Q (cid:33) (C1) λ ¯ m (cid:112) r Q − ¯ m = αβπ

11 + σ out (C2) λ (( λ − − ( β − σ in − β = α ( σ in − √ π √ r Q (cid:18) − ¯ m √ r Q (cid:19) + ( σ in − σ out + 1) λ ¯ m βr Q (C3)We can expressing α as α = α /β and solve these equations numerically for α to determine α c For σ in = 1, the equations for network capacity become1 − m = απ (cid:18) arccot (cid:18) ¯ m √ − ¯ m (cid:19) − ¯ m (cid:112) − ¯ m (cid:19) (C4)¯ m √ − ¯ m = απ

11 + σ out (C5)Note that these equations do not depend on α but not on β . This implies that for σ in = 1, α c is only a function of σ out . The capacity of a network of size β then obeys the simple scaling relation. α c ( β, σ out ) = βα c (1 , σ out ) (C6)0 Appendix D: Calculation of the generalization error

To evaluate the generalization error in terms of the mean ﬁeld order parameters, we start from the followingexpression for the error one E ( w, x, (cid:15) ) = Θ (cid:32) − (cid:32) √ N N (cid:88) i =1 w i x i (cid:33) (cid:32) √ N N (cid:88) i =1 w i · x i + (cid:15) (cid:33)(cid:33) (D1)Averaging over the input x , and noise (cid:15) , we get E g ( w ) = (cid:90) N (cid:89) i =1 dx i √ π e − x i (cid:90) d(cid:15) (cid:112) πσ out e − (cid:15) σ out Θ (cid:32) − (cid:32) √ N N (cid:88) i =1 w i x i (cid:33) (cid:32) √ N N (cid:88) i =1 w i · x i + (cid:15) (cid:33)(cid:33) (D2)= (cid:90) N (cid:89) i =1 dx i √ π e − x i (cid:90) dz (cid:112) πσ out e − (cid:15) σ out (cid:90) dh √ π (cid:90) dh √ π (cid:90) d ˆ h √ π (cid:90) d ˆ h √ π Θ (cid:0) − hh (cid:1) (D3) × exp (cid:32) − i ˆ hh − i ˆ h h + i √ N N (cid:88) i =1 (ˆ hw i x i + ˆ h w i x i ) + i ˆ h (cid:15) (cid:33) (D4)= (cid:90) dh √ π (cid:90) dh √ π (cid:90) d ˆ h √ π (cid:90) d ˆ h √ π Θ (cid:0) − hh (cid:1) (D5) × exp (cid:32) − i ˆ hh − i ˆ h h − N (ˆ h N (cid:88) i =0 w i + 2ˆ h ˆ h N (cid:88) i =1 w i w i + (ˆ h ) N (cid:88) i =0 w i ) − σ out h ) (cid:33) (D6)We set the normalization of the student and teacher to be || w || = || w || = √ N (D7)and deﬁne the order parameter R as the cosine of the angle between teacher and student as R = N (cid:80) N i =1 w i w i (D8)After performing the integral over ˆ h , we can deﬁne a rescaled R and h as¯ R = R (cid:112) σ out (D9)¯ h = h (cid:112) σ out (D10)We can then perform the integral over ˆ h to get the following integral over h and ¯ h E g ( R ) = (cid:90) d h √ π d h √ π dˆ h √ π Θ (cid:0) − h ¯ h (cid:1) e − (1 − ¯ R )ˆ h − i ˆ h ( h +¯ h ¯ R ) − (¯ h ) (D11)= (cid:90) d h d¯ h π √ − ¯ R Θ (cid:0) − h ¯ h (cid:1) e − − ¯ R ( h − h ¯ h ¯ R +(¯ h ) ) (D12)This evaluates to E g ( R ) = 1 π (cid:32) π − tan − (cid:32) R (cid:112) σ out − R (cid:33)(cid:33) (D13)In our expanded network, m and R are related as m = 1 N R (cid:107) w (cid:107)(cid:107) w (cid:107) (D14)1This gives us R = m √ − ˜ r (D15)In terms of m and ˜ r this can be written as E g ( m, ˜ r ) = 1 π (cid:32) π − tan − (cid:32) m (cid:112) (1 − ˜ r )(1 + σ out ) − m (cid:33)(cid:33) (D16) Appendix E: Large β limit We can ﬁnd a closed expression for the generalization error in the limit β → ∞ with σ in ≤

1. In this limit we have m (cid:28) α (cid:28) β and 1 (cid:28) κ . Analysis of the saddle point equations gives us the following relations σ in = α β κ (E1) σ in − β − λ = 0 (E2) λ ¯ m = 2 α √ π κ (E3) βσ in ¯ m = 2 α √ π σ in (cid:114) βα (E4) λ = σ in (cid:114) β − ˜ r − m (E5)which lead to the following expressions for m and ˜ rm = (cid:115) α (1 + σ out ) βπσ in (E6)1 − ˜ r = 1 βσ in (cid:0) π − α (cid:0) σ out (cid:1)(cid:1) (E7)Plugging these into Eqn. D15 gives us R ≈ α π σ out ) α π σ out ) ≈ − π (1 + σ out )2 α (E8)The expression for R in Eqn. (E8) can be plugged into Eqn. (D13) to ﬁnd an expression for the generalizationerror for β → ∞ which is shown in Fig. (6(a)). Note that this expression does not depend on σ in as long as σ in ≤ Appendix F: Optimal input noise

We ﬁnd the optimal σ in to minimize the generalization error by maximizing R . Diﬀerentiating R with respect to σ in gives us dRdσ in = dmdσ in √ − ˜ r + 12 d ˜ rdσ in m (1 − ˜ r ) (F1)2which gives us the condition dmdσ in = − m (1 − ˜ r ) d ˜ rdσ inin