A new role for circuit expansion for learning in neural networks
AA new role for circuit expansion for learning in neural networks
Julia Steinberg,
1, 2, ∗ Madhu Advani, and Haim Sompolinsky
1, 3 Center for Brain Science, Harvard University, Cambridge MA 02138, USA Department of Physics, Harvard University, Cambridge MA 02138, USA Edmond and Lily Safra Center for Brain Sciences, Hebrew University, Jerusalem 91904, Israel (Dated: August 21, 2020)Many sensory pathways in the brain rely on sparsely active populations of neurons downstreamfrom the input stimuli. The biological reason for the occurrence of expanded structure in the brainis unclear, but may be because expansion can increase the expressive power of a neural network. Inthis work, we show that expanding a neural network can improve its generalization performance evenin cases in which the expanded structure is pruned after the learning period. To study this settingwe use a teacher-student framework where a perceptron teacher network generates labels whichare corrupted with small amounts of noise. We then train a student network that is structurallymatched to the teacher and can achieve optimal accuracy if given the teachers synaptic weights. Wefind that sparse expansion of the input of a student perceptron network both increases its capacityand improves the generalization performance of the network when learning a noisy rule from ateacher perceptron when these expansions are pruned after learning. We find similar behavior whenthe expanded units are stochastic and uncorrelated with the input and analyze this network in themean field limit. We show by solving the mean field equations that the generalization error of thestochastic expanded student network continues to drop as the size of the network increases. Theimprovement in generalization performance occurs despite the increased complexity of the studentnetwork relative to the teacher it is trying to learn. We show that this effect is closely related to theaddition of slack variables in artificial neural networks and suggest possible implications for artificialand biological neural networks.
I. INTRODUCTION
Learning and memory is thought to occur mainlythrough long term modification of synaptic connec-tions among neurons, a phenomenon well establishedexperimentally. Neural circuits also undergo structuralchanges. Adult neurogenesis in the mammalian brain fa-cilitates the continuous creation of new neurons and oc-curs mainly in the olfactory bulb and the dentate gyrusof the hippocampus [1, 2]. Another form of structuralplasticity is the continuous recycling of synapses whichis seen in both cortex and hippocampus. Several model-ing studies addressed the computational consequences ofthese phenomena (see [3, 4] for adult neurogensis and [5]for synaptic recycling.Of particular interest are the dynamic changes in thecircuit structure of the hippocampus, an area specializedin learning and memory, functions which would seem tobenefit from a stable circuit structure.In this work, we explore a novel computational bene-fit of structual dynamics in neural circuits that learn newassociations or tasks. We show that under certain classesof learning paradigms, the expansion of a neural ciruit ar-chitecture by recruiting additional neurons and synapsesmay facilitate the dynamics of learning. Expanding thecircuit size enabling sparse coding have been shown tohave computational benefits in several contexts of neu-roscience and machine learning for sensory processing,learning and memory [6–10]. In these models, circuit ∗ Correspondence to: [email protected] expansion and the resultant sparse coding yield betterrepresentations of the stimuli, enhancing pattern sepa-ration, and improving the capacity for pattern retrievaland classification. Importantly, to realize these benefits,the expanded architecture needs to be stable after learn-ing. In contrast, in our scenario the benefit of expansionis in its facilitating the dynamics of learning and not itsinformation bearing potential. In fact, expansion in thisscenario is most beneficial when it is transient, where theadded neurons and synapses are pruned after the learn-ing period, hence this hypothesis is consistent with theobserved continuous recycling of synapses.We consider neural networks that learn supervisedclassification problems learned by a single layer percep-tron. Nevertheless, learning the rule by training withlabelled examples may be hampered by the complexityof the underlying data. We will focus on two cases of un-realizable rules. The first case occurs when the teachernetwork produces training labels corrupted by stochasticnoise. The second case occurs when the teacher is morecomplex than the student network trying to learn therule. In both of these cases, there is a critical size of thetraining set above which no single layer student is able tocorrectly classify all of the training examples. This crit-ical size is called the student’s capacity. We show thatadding sparse expansions to student networks by randommappings of the original input increases the capacity ofthe network and improves the generalization performanceof the network as it is trained on larger training sets.While the capacity of a network is clearly related toits dimensionality, it is not obvious and even counterin-tuitive that increasing the size of a network should im- a r X i v : . [ c ond - m a t . d i s - nn ] A ug prove its generalization performance. Using mean fieldtheory and simulations of a wide range of network param-eters, we show that expansion of the architecture duringlearning achieves improved generalization, particularly ifthe additional elements of the circuit are removed afterlearning. In addition, it is shown that the effect is morepronounced if the hidden representation during learningis sparse. We find that the performance is most improvedwhen the expanded units are random and uncorrelatedwith the original input which suggests that having lowoverlap in expanded activity between different traininginput is crucial to improving performance.Our analysis offers a new perspective on the impor-tant issue of the relation between model complexity andlearning in neural networks. Artificial neural networkshave achieved state of the art predictive performance ona variety of tasks [11, 12]. The primary benefit to train-ing these enormous models appears to lie in their abil-ity to represent very complex functions and the link be-tween width, depth, and expressivity of neural networksis discussed in detail in several studies including, [13–16]. These networks are often overparameterized in thesense that than the number of examples the network istrained on is far less than the number of free parametersin the network [17]. Classical statistical learning theorysuggests that such massively over-parameterized modelsshould be expected to over-fit on the training data [18]and make poor predictions on new inputs not seen bythe network before. To resolve this apparent paradox, ithas been suggested that modern learning algorithms costfunctions, and architectures incorporate strong explicitand implicit regularizations [19–22]. Our findings sug-gest there may be advantages to making neural networkslarger than is required for expressing the underlying task.These advantages are related to enhacing the ease of thelearning convergence, and that in these cases, optimalperformance after learning is achieved upon removal ofthe additional nodes and weights. Indeed, pruning ofDeep Neural Networks after training is a current topic ofresearch in machine learning [23–26].We start in section II by showing in simulations thatimplementing a sparse expansion of a perceptron networkvia random mapping of the input can improve its gener-alization ability when learning from a noisy teacher. Insection III we analyze these results by studying a simplermodel of a single layer perceptron in which the activity inthe expanded units is random and uncorrelated with thestimulus. We use the replica method to dervie a meanfield theory exact in the thermodynamic limit, and find itmatches well with simulations of large but finite size net-works. In section III B we explain this phenomena moreintuitively by showing a correspondence between addingrandom input neurons and including slack variables inthe optimization problem. We also discuss how hiddenunits in our two layer network model can resemble thestochastic expansion of the input layer in the one layermodel. In section IV A we demonstrate how the benefitof sparse expansion also applies in more general cases of learning unrealizable rules by comparing the performanceof a student learning a more complex teacher network toour theory results. In most of our work we have focusedon convex learning algorithms. In section IV B we dis-cuss to what extent these effects extend to other learningalgorithms. Finally, we close by discussing some generalimplications of our results. II. SPARSE EXPANSIONS AND LEARNING y y(cid:15) → w w xx ywwxA CB D ˜ w ˜ wJ z ˜ xyx FIG. 1. Teacher and student network schematics. (A) noisyteacher network (B) student network (C) Student with sparsehidden layer (D) Student with stochastic expanded units.
We begin our analysis by considering a teacher per-ceptron network with N input nodes x i , one outputnode y , and N synaptic weights w i drawn iid from w i ∼ N (cid:0) , σ w (cid:1) and supervised learning tasks in whicha student perceptron will attempt to learn the teacher’sinput-output rule from a training set provided by it. Foreach input x drawn iid from x i ∼ N (0 , y ∈ {− , } via the followingrule y = sign h h = 1 (cid:112) σ w N N (cid:88) i =1 w i x i + (cid:15) (1)where (cid:15) ∼ N (0 , σ out ) denotes an output or label noise(Fig. 1 A). We assume a training set consisting of P such input-output pairs, and we define α = P/N asthe measurement density of training examples relative tothe teacher.The goal of training is to yield a network weights thatperform well on new inputs, i.e., to have a small gen-eralization error E g , defined as the expected fraction ofmislabeled examples averaged over the full distributionsof inputs x and the noise (cid:15) as follows E g ( w ) = (cid:104) Θ ( − y ( x ) y ( x )) (cid:105) x,(cid:15) (2)where y = sign (cid:32) √ N N (cid:88) i w i x i (cid:33) (3)The generalization error is minimized when the studentweights equals those of the teacher, i.e. w = w . Thiswill yield the same generalization error as the teacher it-self if it were tested on examples with labels generatedvia Eqn. 1. We refer to this error as the minimal gen-eralization error which can be expressed in terms of thenoise as follows E min = E g ( w ) = 1 π (cid:18) π − tan − (cid:18) σ out (cid:19)(cid:19) (4)which provides a lower bound on the generalization er-ror of a student as no network architecture (even morecomplex than a perceptron) can yield a better perfor-mance.Finding the optimal set of weights may be difficult evenif the number of examples is large. Due to label noisefrom the teacher, training examples will no longer be lin-early separable i.e., perfectly classified by a perceptron,beyond some critical value of P , rending the trainingtask as “unrealizable” by a perceptron. Furthermore,unlike the realizable regime, in the unrealizable regime,finding the minimum of the training error is a nonconvexproblem and can be hampered by local minima. Herewe assume that the training is restricted to minimizingthe training error by applying convex algorithms. Suchtraining algorithms are limited to sizes smaller than thecapacity. The capacity depends on the level of outputnoise in the labels, see Fig. 5.For a teacher of fixed width N and a fixed trainingset of size P , we can increase the capacity of the studentnetwork by making the student network larger than theteacher.There are several ways to expand the student networkand each have a different effect on the generalization per-formance. We first increase the network size by imple-menting random transformation of input stimuli to a hid-den layer of size N + as depicted in C of Fig. 1. The labelsin the student network are given by y µ = sign( h µ ) where, h µ = 1 √ N N (cid:88) i =1 w i x µi + N + (cid:88) j =1 ˜ w j z µj (5)where z µ represents the activity in a hidden layer of neu-rons generated by a random connectivity matrix J , z µj = A (cid:112) f (1 − f ) (cid:32) Θ (cid:32) N (cid:88) i =1 J ji x µi − T (cid:33) − f (cid:33) (6)where A is a positive scalar and T is a firing threshold chosen to produce hidden layer neuronal activity witha given sparsity f . The synapses J ji are chosen iid ac-cording to J ji ∼ N (0 ,
1) and are uncorrelated with theteacher network,In simulations, we measure the performance of this net-work by estimating the generalization error on new exam-ples generated from the same distribution as the trainingset ( x µ , y µ ). Because J is fixed the training problem isstill that of linear classification, with an expanded in-put layer of size N + N + = βN and correspondinglyan expanded trained weight vector ( w, ˜ w ) . We train theoutput weights using max-margin classification (i.e., Lin-ear SVM [27, 28]) which finds an error free solution thatmaximizes the minimal distance of the input examplesfrom the separating plane, called margin, κ , which isour case is defined through the linear inequality y µ h µ ≥ κ || w + ˜ w || , ∀ µ (7)provided that such a solution exists. This Max-marginclassification is equivalent to solving the following prob-lem, ( w ∗ , ˜ w ∗ ) = arg min w, ˜ w N (cid:88) i =1 w i + N + (cid:88) j =1 ˜ w j (8) s.t. y µ h µ ≥ ∀ µ (9)The optimization problem in Eqn. 9 is convex and admitsa unique solution ( w ∗ , ˜ w ∗ ). We choose the max-marginsolution as in general it is known to yield a robust solu-tion to the classification problem with good generaliza-tion performance [29, 30].As expected, the addition of this hidden layer increasesthe capacity of the student, namely the maximal valueof P for which the training data are linearly separable[31]. For instance, for the parameters of Figs. 2(a) and2(b), the capacity increases from a maximum value of α ,equaling ∼ β = 1) to α c ∼
31 and ∼
65 for β = 5 and 10, respectively. In limit N → ∞ ,it appears that this increased capacity does not dependon the sparsity of the hidden layer, or depends on itvery weakly. Importantly, by enabling the network totrain successfully on a large training set adding the ran-dom layer improves substanially the generalization per-formance of the network, particularly, if the hidden layeractivity z µ is very sparse, i.e., f (cid:28) N + after learning. In contrast, for a hidden layer withdense activity, the generalization error decreases intiallywith increasing α but then saturates at an intermediatevalue of α and increases for larger values. Additionally,for dense activity the performance slightly deterioratesby removing the extra neurons after learnings, see Fig. 4.The role of sparseness will be discussed below in section E g =1f =0.02f =0.5oracle E g =1f =0.02f =0.5oracle E g f=0.5 prunedwith Joracle E g f=0.02 prunedwith Joracle -2out =1=3=5 E g =1f =0.05f =0.50oracle E g =1f =0.05f =0.50oracle E g =1=5=10=oracle (a) E g =1f =0.02f =0.5oracle E g =1f =0.02f =0.5oracle E g f=0.5 prunedwith Joracle E g f=0.02 prunedwith Joracle -2out =1=3=5 E g =1f =0.05f =0.50oracle E g =1f =0.05f =0.50oracle E g =1=5=10=oracle (b) FIG. 2. The generalization error E g from simulations of atwo layer network. 2(a) compares E g as a function of α for a student network the same size as the teacher and forstudent networks with expansion factors β = 5 with dense( f = 0 .
5) and sparse ( f = 0 .
02) activity. 2(b) does the samefor β = 10. The oracle line represents the lowest possiblegeneralization error due to the presence of label noise. Theparameters A = 0 . σ out = 0 .
25, and N = 100 and 200 trialsare used in both figures. E g =5 prunedunprunedoracle (a) E g =10 prunedunprunedoracle (b) FIG. 3. Comparison of the generalization error in simulationsof a sparsely expanded two layer network before and afterpruning the expanded units. 3(a) shows simulations for astudent network with expansion factor β = 5 and 3(b) showssimulations for a student network with β = 10. We see forboth values of β the student network with the best overallperformance is the network with sparse expansion with ex-panded weights are pruned after learning. The parameters f = 0 . A = 0 . σ out = 0 .
25, and N = 100 and 200 trialsare used in both figures. III C.The network given in Eqns. 5 and 6 is difficult to studyanalytically because of correlations in the activities of thehidden layer induced by J [32].We therefore consider in the following section a sim-plified expansion scheme which we call, a stochastic ar-chitecture, and is shown in Fig. 1 D. In contrast tothe deterministic scheme of Fig. 1 C, here, the activ-ity patterns of the additional neurons are not generatedthrough connections from the input layer. Instead theyare randomly generated for each training pattern, µ , in-dependent of x µ . The advantage of this scheme that therandom activities of the hidden neurons are statisticallyindependent of each other as well as for different trainingpatterns, rendering the model amenable to study usingthe tools of statistical mechanics. Although this scheme E g =5 prunedunprunedoracle (a) E g =10 prunedunprunedoracle (b) FIG. 4. Comparison of the generalization error from simu-lations of a densely expanded two layer network before andafter pruning the expanded units. 4(a) shows simulations fora student network with expansion factor β = 5 and 4(b) showssimulations for a student network with β = 10. We see thatthe densely expanded network performs best when the ex-panded weights are unpruned. However, the performance ofthe sparsely expanded network with pruned weights in 3 is su-perior to the densely expanded network regardless of whetherthe weights are pruned or kept. The parameters f = 0 . A = 0 . σ out = 0 .
25, and N = 100 and 200 trials were usedfor all figures. is artificial from a biological perspective, we will showthat when the deterministic layer is very sparse the sys-tem’s behavior is similar to the stochastic model. III. THEORY OF PERCEPTRON LEARNINGWITH EXPANDED STOCHASTIC UNITS
In this section, we develop intuition for the effect ofsparse expansion on the generalization performance of aperceptron by considering a simpler single layer studentnetwork which can be solved analytically in the meanfield limit. This network (shown in B of Fig. 1) is trainedusing data with µ = 1 , ...P binary labels y µ , generated bythe noisy teacher network in Eqn. 1. For convenience, wekeep the same normalization for the student and teacherweight vectors which corresponds to setting σ w = β in 1.The activity of the student network takes the form: h = 1 √ N N (cid:88) i =1 w i x i + N + (cid:88) j =1 ˜ w j ˜ x j (10)where ˜ x µj are random units added to the input layer andare drawn iid from a gaussian distribution with zero meanand variance σ in . The label y given to input x by the stu-dent is y ( x ) = sign( h ). The student weights are trainedto optimize the max margin, Eqn. 9. A. Mean field theory
We analyze the performance of the expanded studentnetwork in 10. We will denote the measurement densityof the training set relative to the width of this student as α = α /β . The mean field theory below is exact in thethermodynamic limit, where P, N → ∞ and α ∼ O (1) .To perform an ensemble average of the system’s prop-erties over different realizations of training sets, we usethe replica trick in a manner similar to [33–38]. We startby considering the version space for n replicated studentsindexed by a : (cid:104) V n (cid:105) = (cid:90) (cid:89) a dw a d ˜ w a δ N (cid:88) i =1 ( w ai ) + N + (cid:88) j =1 ( ˜ w aj ) − N × P (cid:89) µ =1 (cid:88) σ = ± (cid:104) Θ ([ σh µa − κ ]) Θ( σh µ ) (cid:105) (11)where we have normalized the weights so that (cid:107) w a (cid:107) + (cid:107) (cid:101) w a (cid:107) = N in all replicas, and Θ is the Heavyside stepfunction. The quantities h µa are the student’s fields in-duced by the µ -th input and weight vector ( w a , ˜ w a ) ofthe a -th replica; h µ are the teacher fields induced by the µ -th input including noise. The angular brackets de-note averaging with respect to the gaussian input vec-tors, x µ (with variance 1), student input noise vector,˜ x µ (with variance N σ in ) , teacher label noise, (cid:15) µ (with vari-ance σ out ). Since the distribution of inputs is isotropic,one does not need to average over the teacher distribu-tion. Evaluating Eqn. 11, we derive a mean field theoryin terms of the following order parameters m a , ˜ r a , q ab and ˜ q ab m a = 1 N N (cid:88) i =1 w i w ai (12)˜ r a = 1 N N + (cid:88) i =1 ( ˜ w ai ) (13) q ab = 1 N N (cid:88) i =1 w ai w bi (14)˜ q ab = σ in N N (cid:88) i =1 ˜ w ai ˜ w bi (15)The order parameters can be understood intuitively asfollows: m a corresponds to the overlap between the stu-dent weights w a and the teacher perceptron weights. ˜ r a corresponds the norm of expanded weights ˜ w a ; q ab mea-sures the overlap between student weight w a in replicaand w b in replica b . Similary ˜ q ab measures the overlap ofexpansion weights ˜ w a and ˜ w b (scaled with the expansion-input variance σ in ).We apply the replica symmetric (RS) ansatz for theorder parameters m a , ˜ r a , q ab , and ˜ q ab , which is exact be-cause the version space of weight vectors is connected.This allows us write the order parameter matrices interms of the four scalar order parameters m, ˜ r, q, and ˜ q as follows m a = m, (16)˜ r a = ˜ r, (17) q ab = (1 − q − ˜ r ) δ ab + q, (18)˜ q ab = (cid:0) σ in ˜ r − ˜ q (cid:1) δ ab + ˜ q (19)In the mean field limit, we can decompose (cid:104) V n (cid:105) into thesum of an entropic term and energetic term which areboth functions of m, ˜ r, q, and ˜ q (cid:104) V n (cid:105) = exp [ nN ( G ( q, ˜ q, ˜ r, m ) + αG ( q, ˜ q, ˜ r, m ))] (20)Within the replica framework, the max-margin solutionis the unique solution which maximizes the margin andcorresponds to the equivalence of w in all student replicassuch that the overlaps q → − ˜ r and ˜ q → σ in ˜ r . Takingthe limit n →
0, the averaged free energy is given by (cid:104) log V (cid:105) = N ( G (˜ r, m ) + αG (˜ r, m )) (21)We obtain three closed saddle point equations for κ , m ,and ˜ r by ninimizing the free energy in Eqn. 21 with re-spect to m and ˜ r and requiring that V →
0. The capacityof the network is determined by solving the mean fieldequations in the limit κ →
0. Full details of the replicacalculation and the form of the saddle equations are givenin Appendix A.The performance of the system depends on the expan-sion parameter β , and the input and output noise vari-ances σ in and σ out . We will focus primarily on σ in = 1,where the mean field equations simplify considerably, andsolve them for ˜ r as a function of m and β . We find that inthis case the capacity of the network as defined in termsof α obeys the simple scaling relation α c ( β, σ out ) = βα c (1 , σ out ) (22)where α c (1 , σ out ) is the capacity of the unexpanded ne-towork shown in Fig. 5. We derive an expression for E g interms of the mean field order parameters in Appendix D. -2out =1=3=5 FIG. 5. The network capacity for random inputs as a functioninverse variance of the label noise obtained from the solutionof the mean field equations for σ in = 1. With the removal of the expanded weights E g takesthe form: E g = 1 π (cid:32) π − tan − (cid:32) R (cid:112) σ out − R (cid:33)(cid:33) (23)where R is defined as the cosine of the angle betweenstudent and teacher weights which can be expressed interms of the order parameter m and ˜ r as R = m √ − ˜ r (24)where the factor of √ − ˜ r in Eqn. 24 is the fraction of thestudent weight norm in the subspace of the teacher. Forstudent networks that are the same size as the teacher,˜ r = 0 and R = m . The generalization error for a studentnetwork that retains its expanded units after learning,with stochastic noise included in each test example, isgiven by replacing R with m in Eqn. 23. Thus, we seethat for improved generalization performance, it is neces-sary to prune the augmented units after learning as wasshown numerically for the deterministic network, Fig. 3.In the stochastic expansion, the intuition for removingthese weights is straightforward as retaining them impliesinjecting stochastic activities in test example, uncorre-lated with the task’s input, which will obviously reduceperformance. The situation is different in the determin-istic network in which correlations between the expandedand original components of the network the network areinduced by the random map J , hence through learning ˜ w acquire some information about the task. Indeed, as wehave shown above, for dense expansion, retaining theseweights slightly increases the performance. However, forsparse expansion, the correlation between the expandedactivations and the task input is small (see below) hencepurning improves the performance similar to the stochas-tic case. Finally, we note that in the case of zero out-put noise, E g is just the angle between the student andteacher normalized by π and the minimal error is givenby Eqn. 23 with R = 1, in agreement with Eqn. 4.In Fig. 6, we plot the theoretical results for E g as afunction of α for different values of β for two values of σ out . In both high and low σ out , the generalization errordecreases monotonically as a function of α for fixed β and as a function of β for fixed α . In 6(c) and 6(d) ofFig. 6 we show the minimal E g as a function of β definedas the generalization error reached for each β after mini-mizing over α . An interesting question is whether for agiven size of training set, there is a finite optimal expan-sion ratio. We find two qualitatively different behaviorsdependent on the value of σ out . For low values of σ out ,for a each fixed value of α , the student network with thelowest generalization error is the smallest network whichcan fit all of the training examples. For higher valuesof σ out , we find that making the network larger alwaysimproves the generalization performance for any valueof α , with the best performance occuring in the limit β → ∞ . The crossover between these two regimes occursroughly around σ out ∼ .
5. We conclude that adding E g =1=5=10=oracle (a) E g =1=5=10=oracle (b) E g (c) E g (d) FIG. 6. The replica theory results for the generalizationerror. E g is shown as a function of α for several values ofthe expansion factor β for label noise with standard deviation σ out = 0 .
25 in 6(a) and standard deviation σ out = 1 in 6(b). E g is shown as a function of β for σ out = 0 .
25 in 6(c) and σ out = 1 in 6(d). σ in = 1 for all figures. noisy units during learning gives the network the capac-ity to fit the label noise and train on more examples ina way that does not interfere with the relevant weightinformation. This allows networks with larger expansionratios to achieve better generalization as they are trainedon more examples.So far, we have considered the simple case of σ in = 1.We now discuss briefly the effect of varying σ in . In Fig. 7,we demonstrate how varying the level of input noise canimprove generalization error by comparing theory andsimulations for different choices of σ in . We find that cal-culations of E g from simulations match very well withthe value obtained from solution of the mean field equa-tions shown in Fig. 7. For low label noise, the general-ization performance is most substantially improved whenthe variance of activity in the added units is much lowerthan the variance of patterns being learned, i.e. σ in < A . For fixed value of label noise σ out , we find that there us an optimal variance σ in ofthe augmented units which minimizes E g for fixed mea-surement density α and expansion factor β . This valuecan be determined from the replica equations shown inFig. 7(b) and discussed in Appendix F. We will return tothis issue in the Section C. E g in = 1theory in = 0.6theory in = 0.4theoryoracle (a) in E g (b) FIG. 7. The replica theory results compared with simulationsfor the generalization error E g . 7(a) shows E g for σ out = 0 . β = 5 and several values of σ in . The error bars are computedfrom the mean and standard deviation of 400 trials with N =100 . E g v. σ in with α = 3, β = 5, and σ out = 1, β = 5 and the line represents the replica predictions for thevalue of σ in that minimize the generalization error. E g f =0.02f =0.5theoryoracle (a) E g f=0.5f=0.02randomoracle (b) FIG. 8. Comparison of student networks with stochastic andsparse expansions with. 8(a) compares simulations of the twolayer student network with the theory results for the one layernetwork for β = 5 for σ out = 0 . N = 100, and 200 trials.In general, we see that student networks with stochastic addedunits attain superior performance when compared to a deter-ministic networks of the same size. 8(b) compares E g ( β ) forthe case of stochastic augmented input units and deterministichidden units with dense and sparse activity with σ out = 0 . N = 80 . The parameters σ in = A = 1 and 200 trialsare used for both figures. B. Comparison between stochastic expansion anddeterministic sparse expansion
For networks expanded with sparse hidden layers, theparameter A is closely related to σ in . We directly com-pare the generalization performance of the student net-work with a sparse hidden layer (Eqn. 5) with the studentnetwork with stochastic units added to the input (Eqn.10) by setting σ in = A so that the statistics of the ex-pansion units match in the two networks. For simplicitywe consider the case σ in = A = 1. Fig. 8(a) shows thegeneralization error for each network with β = 5 andFig. 8(b) shows the the generalization error as a func-tion of the network expansion factor β . The stochasti-cally expanded network achieves superior generalizationperformances for larger values of α and has a higher ca-pacity. However, as can be seen, the performance of the deterministic networks approach that of the stochasticnetwork upon increasing sparsity of the hidden layer ac-tivity. This is expected as the correlation in the sparseactivities are weak and hence approach the uncorrelatedstochastic limit. C. Correspondence with slack regularization
While it is clear that expanding a network increases itscapacity, it is not obvious that the expansion we have im-plemented should lead to improved generalization. Whilewidening a network increases its capacity to fit moretraining data, it may also increase its Rademacher com-plexity improving its ability to learn random input out-put data [22]. However, it turns out that the improvedgeneralization performance in the networks we have stud-ied can be related to an equivalence between our ex-panded network trained in the realizable regime and anunexpanded network trained in the unrealizable regimeusing slack regularization, which we now explain.We consider the relation between our expansionschemes for learning and that of slack SVM which isdefined as,min w,ξ N (cid:88) i =1 w i + C P (cid:88) µ =1 ξ µ s.t. y µ (cid:32) N (cid:88) i =1 w i x µi (cid:33) ≥ − ξ µ (25)While the SVM learning works only in the realizableregime, slack SVM is a convex optimization that allowsnon zero classification errors (when ξ µ >
1) and regu-larizes them through the slack prameter C that applies L regularization of the slack variables ξ µ . Although itdoes not minimize the training error, and its cost func-tion does not have a well defined interpretation in termsof the classification tasks, it is a popular learning algo-rithm due to its simplicity and its empirically nice gen-eralization properties .To see the relation between the SVM with stochasticexpansion we first note that the minimal ˜ w of Eqn. 8will necessarily be in the span of the P input stochasticvectors, ˜ X µ = (cid:101) x µ y µ , since any projection on the nullspace will increase the norm of ˜ w without contributing tothe satisfaction of the inequalities. Defining new variable ξ µ as ξ µ = ˜ X µT ˜ w (26)we can write the optimal ˜ w as˜ w = ( ˜ X T ) + ξ (27)where ˜ X is the matrix of input stochastic vectors and +denotes the pseudo-inverse operation. Substituting Eqn.27into Eqn. 8 yieldsmin w,ξ N (cid:88) i =1 w i + P (cid:88) µ =1 P (cid:88) ν =1 ξ µ C µν ξ ν s.t. y µ (cid:32) N (cid:88) i =1 w i x µi (cid:33) ≥ − ξ µ (28)where C = ˜( X T ˜ X ) + or equivalently, C µν = ˜ x µ ˜ x νT (29)which is just the sample covariance matrix of the ex-panded inputs in the training set. We recognize the sec-ond term in Eqn. 28 as the square of the Mahalanobisdistance of the vector ξ µ from a set of observations ofzero mean and covariance matrix C µν . Thus, SVM withexpanded networks is equivalent to slack SVM of the orig-inal network with a slack SVM that incorporates a Ma-halanobis distance regularization of the slack variableswith a covariance regularizer matrix C injected by theexpanded activities.Furthermore, we can establish exact correspondencebetween the stochastic expansion and the slack SVM,Eqn. 28 in the limit large β (and fixed α ) by notingthat in this limit, (cid:101) x µ (cid:101) x νT ∼ δ µν , hence, the slack termbecomes P (cid:88) µ =1 P (cid:88) ν =1 ξ µ (cid:104) C − µν (cid:105) ξ ν → σ in P (cid:88) µ =1 ( ξ µ ) (30)which is a generic slack regularization term, with C = σ in This implies that the addition of stochastic units be-comes equivalent to the addition of slack terms in thelimit β → ∞ . The equivalence breaks down completelyfor N + < P when the matrix C µν becomes uninvertible.The above equivalence hold also for deterministic ex-pansion, where now C µν = z µ z νT , see Eqn. 6. In thecase of a sparse expansion, C µν has small off-diagonalelements and diagonal elements equaling A which playsthe role of the slack regularizer. IV. EXTENSIONS
So far, we have focused on a perceptron learning anoisy perceptron rule using convex learning algorithms.In the following section, we investigate whether randomexpansion of the network during learning is beneficial alsowhen the teacher is a given by nonlinear classificationrule, and also in training with gradient based methods.
A. Learning nonlinear classification rule
To model a perceptron learning a complex but deter-ministic rule we consider a student perceptron learning from a quadratic teacher where the target rule is, y ( x ) = sign a √ N N (cid:88) i =1 w i x i + (1 − a ) 1 N (cid:88) i,j =1 w ij x i x j (31)with weights drawn iid as w i , w ij ∼ N (0 , a isa scalar coefficient between zero and denotes the relativeweight of the linear component of the teacher. Clearly, aperceptron student cannot emulate perfectly such a rule.In fact for a perceptron with N weights, the optimalweights are w = w with a none zero minimal generaliza-tion error, E min which in decreases with a . In addition,there is a critical capacity, α c above which the trainingexamples are unrealizable, with α c increases with a . Wenow discuss the effect of adding the stochastic randomlayer as in Fig. 1D of size N + with N + N + = βN .Clearly the capacity for learning with zero training errorincreases with β . We now ask whether this expansionis also beneficial for generalization and whether prun-ning the network after learning improves performance.We have simulated training in this network using as be-fore, the max-margin algorithm. Results shown in Fig. 9,confirm that expanded stochastic network performs bet-ter than the unexpanded one. Furthermore, the resultsare in excellent agreement with the behavior in the caseof the noisy perceptron target rule, with noise variancegiven by σ out = 1 − aa . (32)We also show simulation results for the two layer net-work with dense and sparse deterministic expansions inFig. 10. As in the case of a noisy teacher, the optimal E g quadraticnoisyquadraticnoisyoracle E g quadraticnoisyquadraticnoisyoracle E g f=0.02 prunedwith Joracle E g f=0.5 prunedwith Joracle E g =1f =0.05f =0.50oracle -2out =1=3=5 -2out =1=3=5 E g =1f =0.05f =0.50oracle (a) E g quadraticnoisyquadraticnoisyoracle E g quadraticnoisyquadraticnoisyoracle E g f=0.02 prunedwith Joracle E g f=0.5 prunedwith Joracle E g =1f =0.05f =0.50oracle -2out =1=3=5 -2out =1=3=5 E g =1f =0.05f =0.50oracle (b) FIG. 9. Comparison of E g as a function of α for a studentlearning a quadratic teacher v. the same student learning alinear teacher with label noise. Error bars in 9(a) are obtainedfrom simulations of a stochastic expanded student with σ in =1 learning a quadratic teacher for 200 trials and the solid linescorrespond to the replica theory result for student learningfrom a noisy teacher. In 9(b) we compare simulations of atwo layer student network with f = 0 .
02 and A = 1 learningfrom a quadratic teacher with simulations of the same studentnetwork learning from a noisy teacher for 400 trials. Theparameters a = 0 . σ out = 1, and N = 100 are used in bothfigures. E g f=0.02 prunedunprunedoracle (a) E g f=0.5 prunedunprunedoracle (b) FIG. 10. Simulation results for a two layer expanded studentlearning a quadratic teacher. 10(a) compares E g as a functionof α before and after pruning for a sparse hidden layer and10(b) compares the performance before and after pruning fora dense hidden layer. The parameters a = 0 . σ out = 1, β = 10, N = 100 and 400 trials are used in both figures generalization performance occurs after the extra neu-rons and synapses are removed from the network for asparse expansion. This effect persists for values of β aslarge as β = 40 for N = 60. In the case of the two layernetwork, it is not entirely obvious that removing the ex-tra synapses would improve performance, as this struc-ture may be used to learn something about the quadraticpart of the teacher. It is possible that there may be pa-rameter regimes in which it is beneficial to keep the extraweights unpruned that we have been unable to reach dueto computational limitations on β and N . Despite thesepotential shortcomings, our findings for both student ar-chitectures demonstrate that the benefits of expandinga network can also occur in the setting where the rulebeing learned is more complicated than the model. B. Logistic regression
We will now consider alternative optimization meth-ods and loss functions which allow a neural network tobe trained beyond capacity. One example is logistic re-gression, with a cost function given by L ( w ) = P (cid:88) µ =1 log (1 + exp( − u µ )) (33) u µ = 1 √ N N (cid:88) i =1 y µ w i x µi (34)In the following, we consider full batch gradient descentso that the update to the weights at each training epochis given by ∆ w i = − η ∂L ( w ) ∂w i In [39] it was shown that the normalized weight vec-tor obtained by a minimizing the logistic regression loss function via gradient descent should converge to the maxmargin solution after a sufficiently long training time ifthe training data is linearly separable. However, thiscorrespondence depends on learning parameters such as η and the number of iterations. Note, that in general, forconvergence to the max margin solution one needs to runthe logistic regression gradent based training for longertimes than required for finding a solution with zero tran-ing error. For unrealizeable rules, e.g., the noisy teacherin Eqn. 1 and the quadratic teacher Eqn. 31, logistic re-gression and max-margin classification are not equiva-lent for large P because the training set provided by theteacher is not linearly separable.In previous sections we have shown that stochastic andsparse expansions of perceptron networks increase the ca-pacity of a network by making the training set linearlyseparable in a higher dimensional space. Thus, it is nat-ural to ask under what conditions training an expandednetwork via logistic regession will result in a weight vec-tor that converges to the new max margin solution in thehigher dimensional space and if this solution can yield asuperior generalization performance compared to a gra-dient based training of the unexpanded student network.We have simulated the logistic regression learning forthe problem of learning a noisy perceptron teacher, forsome values of η and number of training epochs. Wefirst consider the case β = 1, i.e. a student the samesize as the teacher. For α below capacity, the marginincreases monotonically with training epochs and con-verges asymptotically to the maximum margin, as shownin Fig. 12(b) ( α = 3), with convergence time dependingon η . In Fig. 12(a) we show for the same α the value ofthe overlap between student and teacher, as a function of ηt . Interestingly, while R does seem to converge asymp-totically to the maximum margin value, it is not mono-tonic and in fact reaches a maximum value larger thanthe infinite time asymptote, early in the training. Thus,the max margin solution is not necessarily the one withthe best generalization performance. Above capacity, lo-gistic regression permits solutions with nonzero trainingerror, and we find that it results with good generaliza-tion performance. The value of R as a function of α is shown in Fig. 11. As seen, for small α the overlap(achieved after a large number of epochs) is close to themax margin solution with precise values dependent on η and the stopping criterion. When α increases above ca-pacity, R increases monotonically and seems to approach R = 1 for large α (corresponding to the optimal solution w = w ), although the amount of increase depends on η .Note that in this regime both R and κ converge fast totheir asymptotic values as shown in Fig. 12.For an expanded student network i.e. β > , we findthat R converges to the max margin value after longtraining time for α below capacity as shown in Fig. 13and continues to increase with α as it increases abovecapacity. However, for fixed values of η that are nottoo large, the largest value of R for any α is obtainedfor the unexpanded network, i.e., β = 1 as shown in0 R N=100 =0.5=0.05=0.005max margin R N=400 =0.5=0.05=0.005max margin
FIG. 11. Simulation results for logistic regression showing R v. α for various learning rates for N = 100, and σ out = 0 . t R =3 =0.5=0.05=0.005max margin 0 100 200 300 400 500 t R =8 =0.5=0.05=0.005max margin (a) t -1.4-1.2-1-0.8-0.6-0.4-0.200.2 =3 =0.5=0.05=0.005max margin t -1.45-1.4-1.35-1.3-1.25-1.2-1.15-1.1-1.05-1-0.95 =8 =0.5=0.05=0.005 (b) FIG. 12. Simulation results for logistic regression with β = 1and σ out = 0 . t is defined as the number of trainingepochs. 12(a) shows R as a function of ηt where for α = 3(below capacity) and α = 8 (above capacity) for N = 100.12(b) shows the margin κ as a function of ηt for α = 3 and α = 8. Fig. 13. This implies that in this range of parameters,expanding the network does not improve generalizationperformance.
V. DISCUSSION
In this work, we have shown how expanding the ar-chitecture of neural networks can provide computationalbenefits beyond better expressivity and improve the gen-eralization performance of the network after the ex-panded weights and neurons are pruned after training.We obtain equations for the order parameters charac- R N=100 =0.5=0.05=0.005max margin R N=400 =0.5=0.05=0.005max margin R =0.01 =1=3=5max margin E g quadraticnoisyquadraticnoisyoracle FIG. 13. Simulation results for logistic regression for severalvalues of β with N = 100 and σ out = 0 . R v. α for β = 1 , , η = 0 .
01. The max marginline in the plot corresponds to the max margin solution for β = 5 and the circles mark the capacity for each value of β . terizing generalization in randomly expanded perceptronnetworks (called stochastic expansion) in the mean limitand show explicitly that expansion allows for more accu-rate learning of noisy or complex teacher networks. Thisis achieved by increasing network capacity during train-ing, allowing the learning to benefit from more examples.We show a qualitatively similar improved performancewhen expanding by adding a fixed random weights (de-terministic expansion) connecting the input to sparselyactive hidden units. An additional insight into our re-sults is provided by showing that the expansion is ef-fectively similar to the addition of slack variables to themax-margin learning. We believe our findings suggest apossible biological function for adult neurogenesis in thebrain.In our analysis, we considered training sets drawn iidfrom a Gaussian distribution with no spatial structure.It would be interesting to see how our results could be ex-tended to learning structured data. In particular, [40] de-veloped a theory for the linear classification of manifoldswith arbitrary geometry by using special anchor pointson the manifolds to define novel geometrical measuresof radius and dimension which can be directly linked tothe classification capacity for manifolds of various geome-tries. It would be interesting to see if sparse expansionssimilar to those we have studied could be useful in classi-fying noisy manifolds and if there is any correspondenceto SVMs containing anisotropic slack regularization en-coded in the structure of the covariance matrix as in Eqn.29.It would also be interesting to determine how and ifour observations apply to learning in deep networks withmultiple layers. Neural network pruning techniques havebeen widely dicussed in the deep learning community andit has been shown that neural network network prun-ing techniques can reduce parameter counts of trainednetwork by over 90% without compromising accuracy[23, 41]. Training a pruned model from scratch is worsethan retraining a pruned model, which suggests that the1extra capacity of the network allows it to find more op-timal solutions. In [42], the authors find that dense,randomly-initialized, feed-forward networks contain sub-networks that can reach test accuracy comparable to theoriginal network in a similar number of training iterationswhen trained in isolation. It would be interesting to see ifthe extra weights in the larger networks can be translatedinto a regularization condition on the subnetwork.Most of our work focused on max margin learning. Wehave explored the effect of expansion on gradient basedlearning with logistic regression cost function. We findthat for appropriate choice of learning rate and learningtime, generalization is similar to the max margin perfor-mance below the network capacity, consistent with [39].We also found that in the explored parameter range, op-timal generalization performance is achieved by the un-expanded network, as gradient based learning can extractuseful information even beyond the capacity learning.However, understanding the generalization performancein gradient based learning requires a more thorough un-derstanding of the role of learning rate and training timeis quite difficult given the lack of theory for the trainingdynamics for logistic regression. It would be interestingto see if there is a way to scale η such that expandingthe network can provide similar benefits for logistic re- gression beyond capacity as for max margin learning. Weleave this to future work.We also note that generalization can also improve whenadding unquenched noise to the student labels duringtraining with logistic loss as this prevents the classifierfrom overfitting (results not shown; [43, 44]). This differsfrom our construction for two reasons. The first is thatour student by construction learns the weights in the theextended part of the network. The second is that ourdimensionality expansion changes changes the propertiesof the training set in that a nonlinearly separable trainingset in the original space may become linearly separablein the higher dimensional expanded space. VI. ACKNOWLEDGMENTS
We would like to thank Haozhe Shan and WeishunZhong for valuable discussions concerning our logisticregression results and Subir Sachdev for helpful com-ments on the draft. This work is partially supportedby the Gatsby Charitable Foundation, the Swartz Foun-dation, the National Institutes of Health (Grant No.1U19NS104653). J.S. acknowledges support from the Na-tional Science Foundation Graduate Research Fellowshipunder Grant No. DGE1144152. [1] C. Zhao, W. Deng, and F. H. Gage, Cell , 645 (2008).[2] G. li Ming and H. Song, Neuron , 687 (2011).[3] J. B. Aimone, Cold Spring Harbor Perspectives in Biol-ogy .[4] W. Deng, J. B. Aimone, and F. H. Gage, Nature ReviewsNeuroscience .[5] G. Mongillo, S. Rumpel, and Y. Loewenstein, CurrentOpinion in Neurobiology , 7 (2017), computationalNeuroscience.[6] B. A. Olshausen and D. J. Field, Current Opinion inNeurobiology .[7] S. Ganguli and H. Sompolinsky, Annual Review of Neu-roscience , 485 (2012), pMID: 22483042.[8] A. Litwin-Kumar, K. D. Harris, R. Axel, H. Sompolinsky,and L. F. Abbott, Neuron , 1153 (2014).[9] A. Treves and E. T. Rolls, Hippocampus , 374.[10] M. V. Tsodyks and M. V. Feigelman, Europhysics Letters(EPL) , 101 (1988).[11] Y. LeCun, Y. Bengio, and G. Hinton, nature , 436(2015).[12] J. Schmidhuber, Neural networks , 85 (2015).[13] Y. Bengio and O. Delalleau, in International Conferenceon Algorithmic Learning Theory (Springer, 2011) pp. 18–36.[14] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, andS. Ganguli, in
Advances in neural information processingsystems (2016) pp. 3360–3368.[15] I. Safran and O. Shamir, in
Proceedings of the 34th Inter-national Conference on Machine Learning - Volume 70 ,ICML’17 (JMLR.org, 2017) pp. 2979–2987. [16] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, andJ. Sohl-Dickstein, in
Proceedings of the 34th InternationalConference on Machine Learning , Proceedings of Ma-chine Learning Research, Vol. 70, edited by D. Precupand Y. W. Teh (PMLR, International Convention Cen-tre, Sydney, Australia, 2017) pp. 2847–2854.[17] K. Simonyan and A. Zisserman, arXiv preprintarXiv:1409.1556 (2014).[18] C. Zhang, S. Bengio, M. Hardt, B. Recht, andO. Vinyals, arXiv preprint arXiv:1611.03530 (2016).[19] P. L. Bartlett and S. Mendelson, Journal of MachineLearning Research , 463 (2002).[20] M. Advani and A. Saxe, (2017), arXiv:1710.03667[stat.ML].[21] Y. Bansal, M. Advani, D. Cox, and A. Saxe, (2018),arXiv:1806.00730 [stat.ML].[22] M. Mohri, A. Rostamizadeh, and A. Talwalkar, MITPress , 32 (2012).[23] S. Han, J. Pool, J. Tran, and W. Dally, in Advancesin Neural Information Processing Systems 28 , edited byC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,and R. Garnett (Curran Associates, Inc., 2015) pp. 1135–1143.[24] T. Yang, Y. Chen, and V. Sze, in (2017) pp. 6071–6079.[25] D. Blalock, J. J. Gonzalez Ortiz, J. Frankle, and J. Gut-tag, arXiv preprint arXiv:2003.03033 (2020).[26] T. Gale, E. Elsen, and S. Hooker, arXiv preprintarXiv:1902.09574 (2019). [27] C. Cortes and V. Vapnik, Machine Learning , 273(1995).[28] B. E. Boser, I. M. Guyon, and V. N. Vapnik, in Pro-ceedings of the Fifth Annual Workshop on ComputationalLearning Theory , COLT 92 (Association for ComputingMachinery, New York, NY, USA, 1992) p. 144152.[29] P. Bartlett and J. Shawe-Taylor, “Generalization perfor-mance of support vector machines and other pattern clas-sifiers,” in
Advances in Kernel Methods: Support VectorLearning (MIT Press, Cambridge, MA, USA, 1999) p.4354.[30] V. Vapnik,
Statistical Learning Theory .[31] T. M. Cover, IEEE Transactions on Electronic Comput-ers
EC-14 , 326 (1965).[32] B. Babadi and H. Sompolinsky, Neuron , 1213 (2014).[33] E. Gardner, Europhysics Letters (EPL) , 481 (1987).[34] E. Gardner and B. Derrida, Journal of Physics A: Math-ematical and General , 271 (1988).[35] E. Gardner, Journal of Physics A: Mathematical andGeneral , 257 (1988). [36] H. Seung, H. Sompolinsky, and N. Tishby, Physical Re-view A , 6056 (1992).[37] T. L. H. Watkin, A. Rau, and M. Biehl, Rev. Mod. Phys. , 499 (1993).[38] A. Engel and C. Van den Broeck, Statistical Mechanicsof Learning .[39] D. Soudry, E. Hoffer, M. Shpigel Nacson, S. Gunasekar,and N. Srebro, Journal of Machine Learning Research ,1 (2018).[40] S. Chung, D. D. Lee, and H. Sompolinsky, Phys. Rev. X , 031003 (2018).[41] Y. LeCun, J. S. Denker, and S. A. Solla, in Advancesin Neural Information Processing Systems 2 , edited byD. S. Touretzky (Morgan-Kaufmann, 1990) pp. 598–605.[42] J. Frankle and M. Carbin, in
International Conferenceon Learning Representations (2019).[43] C. M. Bishop, Neural Computation , 108 (1995),https://doi.org/10.1162/neco.1995.7.1.108.[44] M. Welling and Y. W. Teh, in Proceedings of the 28thInternational Conference on International Conference onMachine Learning , ICML’11 (Omnipress, USA, 2011) pp.681–688. Appendix A: Mean field equations
We outline the derivation of the mean field equations used to compute the order parameters defined in Eqns. 12,13,14, and 15which are used to compute the generalization error given in Eqn. 23. We define the student field for eachreplica of the student network as: h µa = 1 √ N N (cid:88) i =1 W i X µ = 1 √ N N (cid:88) i =1 w ai x µi + N + (cid:88) j =1 ˜ w aj ˜ x µj , (A1)and the teacher field as h µ = 1 √ N N (cid:88) i =1 W i · X iµ + (cid:15) µ = 1 √ N N (cid:88) i =1 w i x µi + (cid:15) µ . (A2)We can now write the average over the version space in Eqn. 20 in terms of these new variables V n = (cid:42)(cid:90) N (cid:89) i =1 N + (cid:89) j =1 (cid:89) a dw ai d ˜ w aj δ N (cid:88) i =1 ( w ai ) + N + (cid:88) j =1 ( ˜ w aj ) − N P (cid:89) µ (cid:90) dh µa (cid:90) d ˆ h µa (cid:90) dh µ (cid:90) d ˆ h µ (cid:34)(cid:88) σ Θ µ,a ( yh µa − κ ) Θ( yh µ ) (cid:35) I (cid:43) (A3)where I is given by I = exp (cid:34) − i (cid:88) aµ h µa ˆ h µa − i (cid:88) µ h µ ˆ h µ + i (cid:88) aµ ˆ h µa √ N N (cid:88) i =1 ( W ai X µi ) + i (cid:88) µ ˆ h µ (cid:32) √ N N (cid:88) i =1 W i · X µi + (cid:15) µ (cid:33)(cid:35) (A4)and the constraints in Eqns. A1 and A2 are implemented by the Lagrange multipliers h µa and ˆ h µ . Averaging overthe input x µ , ˜ x , and the noise (cid:15) µ , I becomes I = (cid:90) P (cid:89) µ =1 N (cid:89) i =1 dx µi √ π e − ( xµ )22 N + (cid:89) j =1 d ˜ x µj (cid:112) πσ in e − (˜ xµ )22 σ in d(cid:15) µ (cid:112) πσ out e − ( (cid:15)µ )22 σ out × exp − i (cid:88) µα h µa ˆ h µa − i (cid:88) µ h µ ˆ h µ + i (cid:88) µa N (cid:88) i =1 √ N (ˆ h µa w ai + ˆ h µ w i ) x µi + i (cid:88) µa N + (cid:88) j =1 √ N (ˆ h µa ˜ w aj ˜ x jµ ) + i (cid:88) µ ˆ h µ (cid:15) µ = exp − (cid:88) µ (cid:16) i (cid:88) a h µa ˆ h µa + ih µ ˆ h µ + (cid:88) a ˆ h µa ˆ h µ N (cid:88) i =1 w ai · w i N + (1 + σ out )2 ˆ h µ + 12 (cid:88) ab ˆ h µa ˆ h µb N (cid:88) i =1 w ai w bi N + σ in N + (cid:88) j =1 ˜ w aj ˜ w bj N (cid:17) (A5)We define the order parameters m a , q ab and ˜ q ab as m a = 1 N N (cid:88) i =1 w i w ai (A6) q ab = 1 N N (cid:88) i =1 w ai w bi (A7)˜ q ab = σ in N N + (cid:88) j =1 ˜ w aj ˜ w jb (A8)For further convenience, we write the sum of q ab and ˜ q ab as Q ab = q ab + ˜ q ab . (A9)4In terms of the order parameters, I becomes I = exp (cid:34) − (cid:88) µ (cid:16) i (cid:88) α h µa ˆ h µa + ih µ ˆ h µ + (cid:88) a ˆ h µa ˆ h µ m a + (1 + σ out )2 (ˆ h µ ) + 12 (cid:88) ab ˆ h µa ˆ h µb Q ab (cid:17)(cid:35) (A10)We can now do the integrals over ˆ h and ˆ h which gives us P (cid:89) µ (cid:90) dh µa (cid:90) d ˆ h µa (cid:90) dh µ (cid:90) d ˆ h µ I = P (cid:89) µ (cid:90) dh µa (cid:90) D ¯ h µ det( Q ab − ¯ m ) − P X (A11)where we have defined X as X = exp (cid:34) − (cid:88) µ (cid:88) ab (¯ h µ ¯ m − h µa )( Q ab − ¯ m ) − (¯ h µ ¯ m − h µb ) (cid:35) (A12)and ¯ m and ¯ h as ¯ m = m (cid:112) σ out (A13)¯ h = h (cid:112) σ out (A14)We now define the additional parameter ˜ r a as ˜ r a = 1 N N + (cid:88) i =1 ( ˜ w ai ) (A15)Since the solution space is connected, we can make the following replica symmetric ansatz for m a , q ab , ˜ q ab , and ˜ r a m a = m (A16)˜ r a = ˜ r (A17) q ab = (1 − ˜ r − q ) δ ab + q (A18)˜ q ab = (cid:0) σ in ˜ r − ˜ q (cid:1) δ ab + ˜ q (A19) Q ab = ( r Q − Q ) δ ab + Q (A20)where Q = q + ˜ q and r Q = 1 − (1 − σ in )˜ r . The inverse of the matrix in Eqn. A12 is given by( Q ab − ¯ m ) − = 1 r Q − Q δ ab − Q − ¯ m ( r Q − Q ) (A21)we now define X (cid:48) as: X (cid:48) = P (cid:89) µ (cid:90) dh µa (cid:90) dh µ exp (cid:104) − P Q ab − ¯ m ) (cid:105) X P (A22)Plugging in the replica symmetric ansatz in Eqns. A16, A17, A18, A19, this becomes X (cid:48) = P (cid:89) µ (cid:90) dh µa (cid:90) dh µ exp (cid:104) − r Q − Q ) (cid:88) µa ( h µa ) + 12 Q − ¯ m ( r Q − Q ) (cid:88) µ (cid:32)(cid:88) a h µa (cid:33) + 1( r Q − Q ) (cid:88) µ ¯ h µ ¯ m (cid:88) a h µa − n (cid:80) µ (¯ h µ ¯ m ) r Q − Q ) − P Q ab − ¯ m ) (cid:105) (A23)We decouple terms with different replica indices in Eqn. A23 by introducing the auxiliary variable t . Then X (cid:48) X (cid:48) = 2 (cid:90) ∞ D ¯ h (cid:90) Dt (cid:34)(cid:90) ∞ κ dh √ π exp (cid:16) − h r Q − Q + (cid:112) Q − ¯ m r Q − Q ht + 1 r Q − Q h ¯ h ¯ m − ¯ m ¯ h r Q − Q ) (cid:17)(cid:105) n (cid:35) (A24)where Dx = dx √ π e − x .One we evaluate all of the integrals in the expression for (cid:104) V n (cid:105) we can write it in the following form (cid:104) V n (cid:105) = exp( nN ( G ( q, ˜ q, ˜ r, m ) + αG ( q, ˜ q, ˜ r, m ))) (A25)where G ( q, ˜ q, m ) is an entropic contribution coming from from the integral over the weights and G ( q, ˜ q, ˜ r, m ) is anenergetic contribution whose form is dictated by the learning rule.We can start by computing the energetic contribution. We define A ( t, ¯ h ) and Z ( t, ¯ h ) as A ( t, ¯ h ) = 12( r Q − Q ) (cid:16)(cid:112) Q − ¯ m t + ¯ h ¯ m (cid:17) − ¯ m ¯ h r Q − Q ) (A26) Z ( t, ¯ h ) = (cid:90) ∞ κ dh √ π exp (cid:18) − r Q − Q ) (cid:104) h − (cid:16)(cid:112) Q − ¯ m t + ¯ h ¯ m (cid:17)(cid:105) (cid:19) (A27)and rewrite X (cid:48) as X (cid:48) = 2 (cid:90) ∞ D ¯ h (cid:90) Dt (cid:104) exp( An ( t, ¯ h )) Z n ( t, ¯ h ) (cid:105) (A28)In the limit n → X (cid:48) becomes X (cid:48) = exp (cid:18) An + 2 n (cid:90) ∞ D ¯ h (cid:90) Dt log Z ( t, h ) (cid:19) (A29)where we define A as the following integral A = (cid:90) ∞ D ¯ h (cid:90) DtA ( t, ¯ h ) = Q − ¯ m r Q − Q ) (A30)We can do the following shift of variables x = (cid:16)(cid:112) Q − ¯ m t + ¯ h ¯ m (cid:17) / (cid:112) Q (A31) y = (cid:16) − ¯ mt + (cid:112) Q − ¯ m ¯ h (cid:17) / (cid:112) Q (A32)which allows us to write t and ¯ h as t = (cid:16)(cid:112) Q − ¯ m x − y ¯ m (cid:17) / (cid:112) Q (A33)¯ h = (cid:16) x ¯ m + (cid:112) Q − ¯ m y (cid:17) / (cid:112) Q (A34)and Z ( t, ¯ h ) as Z ( x ) = (cid:90) ∞ κ dh √ π exp (cid:18) − ( h − √ Qx ) r Q − Q ) (cid:19) = (cid:112) r Q − QH (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) (A35)Under this transformation, the Gaussian integrals become (cid:90) ∞ D ¯ h (cid:90) Dt = (cid:90) Dx (cid:90) ∞− x ¯ m/ √ Q − ¯ m Dy = (cid:90) DxH (cid:16) − x ¯ m/ (cid:112) Q − ¯ m (cid:17) (A36)6where we define H ( x ) = (cid:90) ∞ x Dy (A37)This gives us2 (cid:90) ∞ Dh (cid:90) Dt log Z ( t, h ) =2 (cid:90) DxH (cid:16) − x ¯ m/ (cid:112) Q − ¯ m (cid:17) log Z ( x ) (A38)=2 (cid:90) DxH (cid:16) − x ¯ m/ (cid:112) Q − ¯ m (cid:17) log H (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) + 12 log( r Q − Q )So X (cid:48) becomes X (cid:48) = exp (cid:32) (cid:90) DxH (cid:16) − x ¯ m/ (cid:112) Q − ¯ m (cid:17) log H (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) + 12 log( r Q − Q ) + A (cid:33) n (A39)Using the relation A − n logdet( Q ab − ¯ m ) + 12 log( r Q − Q ) = 0 (A40)the replicated volume of the version space become (cid:104) V n (cid:105) = (cid:90) n (cid:89) a =1 d N w a d N ˜ w a δ N (cid:88) i =1 ( w ai ) + N + (cid:88) j =1 ( ˜ w aj ) − N (cid:90) dm (cid:90) dq ab (cid:90) d ˜ q ab δ ( N m − N (cid:88) i =1 w ai w i ) (cid:89) ab δ (cid:32) N q ab − N (cid:88) i =1 w ai w bi (cid:33) δ N ˜ q ab − σ in N + (cid:88) j =1 ˜ w aj ˜ w bj exp (cid:16) n (cid:90) DxH (cid:32) − x ¯ m (cid:112) Q − ¯ m (cid:33) log H (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) (cid:17) P (A41)We can compute the entropic term G ( q, ˜ q, m, ˜ r ) by considering the integrals over configurations of weights allowedby the delta functions. Then exp( nN ( G ( q, ˜ q, ˜ r, m )) is given byexp( nN ( G ( q, ˜ q, ˜ r, m )) = (cid:90) n (cid:89) a =1 (cid:90) dw a d ˜ w a δ (cid:32) N m − N (cid:88) i =1 w ai w i (cid:33) (A42) × (cid:89) ab δ (cid:32) N q ab − N (cid:88) i =1 w ai w bi (cid:33) δ N ˜ q ab − σ in N + (cid:88) j =1 ˜ w aj ˜ w bj Introducing the Lagrange multipliers ˆ m , ˆ q ab , and ˆ˜ q ab , Eqn. A43 can be written asexp( nN ( G ( q, ˜ q, ˜ r, m )) = (cid:90) n (cid:89) a =1 dw a d ˜ w a (cid:90) d ˆ m √ π (cid:90) d ˆ q ab √ π (cid:90) d ˆ˜ q ab √ π exp (cid:16) i (cid:88) ab ˆ q ab (cid:32) N q ab − N (cid:88) i =1 w ai w bi (cid:33) + i (cid:88) ab ˆ˜ q ab N ˜ q ab − σ in N + (cid:88) j =1 ˜ w aj ˜ w bj + i (cid:88) a ˆ m a (cid:32) N m a − N (cid:88) i =1 w ai w i (cid:33) (cid:17) = (cid:90) d ˆ m √ π (cid:90) d ˆ q ab √ π (cid:90) d ˆ˜ q ab √ π exp (cid:32) iN (cid:88) ab ˆ q ab q ab + iN (cid:88) ab ˆ˜ q ab ˜ q ab + iN (cid:88) a ˆ m a m a (cid:33) × (cid:90) n (cid:89) a =1 dw a d ˜ w a exp (cid:0) − iH ( w a , ˜ w a , w ) (cid:1) (A43)7where we have defined a “Hamiltonian” H ( w a , ˜ w a , w ) as H ( w a , ˜ w a , w ) = 12 (cid:88) ab (ˆ q ab N (cid:88) i =1 w ai w bi + σ in ˆ˜ q ab N + (cid:88) j =1 ˜ w aj ˜ w bj ) + (cid:88) a N (cid:88) i =1 w ai w i ˆ m a (A44)Doing a Wick rotation i ˆ m a → ˆ m a , i ˆ q ab → ˆ q ab , i ˆ˜ q ab → ˆ˜ q ab and integrating over the weights w and ˜ w , we haveexp( nN ( G ( q, ˜ q, ˜ r, m ) = (cid:90) d ˆ m √ π (cid:90) d ˆ q ab √ π (cid:90) d ˆ˜ q ab √ π exp (cid:32) N (cid:88) ab ˆ q ab q ab + N (cid:88) ab ˆ˜ q ab ˜ q ab + N (cid:88) a ˆ m a m (cid:33) × (cid:90) n (cid:89) a =1 dw a dw a exp − (cid:88) ab (ˆ q ab N (cid:88) i =1 w ai w bi + σ in ˆ˜ q ab N + (cid:88) j =1 w aj w bj ) − (cid:88) α N (cid:88) i =1 w ai w i ˆ m a ) = (cid:90) d ˆ m √ π (cid:90) d ˆ q ab √ π (cid:90) d ˆ˜ q ab √ π exp (cid:32) N (cid:88) ab ˆ q ab q ab + N (cid:88) ab ˆ˜ q ab ˜ q ab + N (cid:88) a ˆ m a m (cid:33) × exp (cid:32) N (cid:88) ab ˆ m a ˆ q − ab ˆ m b − N β − q − N (1 − β − )2 log det ˆ˜ qσ in (cid:33) (A45)We can evaluate the integral on the saddle point by solving for ˆ m a , ˆ q ab ,and ˆ˜ q ab using the three saddle point equations0 = N m γ + N (cid:88) b (ˆ q − cb ) ˆ m b (A46)0 = − N (cid:88) ab ˆ m a (ˆ q ac ) − (ˆ q bd ) − ˆ m b − N β − q cd ) − + N q cd (A47)0 = − N (1 − β − )2 (ˆ˜ q cd ) − + N q cd (A48)We make the following replica symmetric ansatz for ˆ q αβ and ˆ˜ q αβ ˆ q ab = (ˆ q − ˆ q ) δ ab + ˆ q (A49)ˆ˜ q ab = (ˆ˜ q − ˆ˜ q ) δ ab + ˆ˜ q (A50)Inserting these expressions into Eqns. A46, A47, and A48 gives us the following scalar equations1ˆ q − ˆ q = β (1 − ˜ r − q ) (A51)ˆ m = − mβ (1 − ˜ r − q ) (A52)ˆ q = − q − m β (1 − ˜ r − q ) (A53)1ˆ˜ q − ˆ˜ q = ββ − σ in ˜ r − ˜ q ) (A54)ˆ˜ q = − β − β ˜ q ( σ in ˜ r − ˜ q ) (A55)Solving for ˆ m , ˆ q , ˆ q , ˆ˜ q , and ˆ˜ q we find G ( q, ˜ q, ˜ r, m ) = 12 (cid:32) q − m β (1 − ˜ r − q ) + β − β ˜ qσ in ˜ r − ˜ q + 1 β log ( β (1 − ˜ r − q )) + β − β log (cid:18) ββ − σ in ˜ r − ˜ q ) (cid:19) (cid:33) (A56)In summary, we have (cid:104) V n (cid:105) = exp nN ( G ( q, ˜ q, ˜ r, m ) + αG ( q, ˜ q, ˜ r, m )) (A57)8where m is given by the saddle point value G ( q, ˜ q, ˜ r, m ) = 12 (cid:32) β (cid:18) q − m (1 − ˜ r − q ) + log β (1 − ˜ r − q ) (cid:19) + β − β (cid:32) ˜ qσ in ˜ r − ˜ q + log β (cid:0) σ in ˜ r − ˜ q (cid:1) β − (cid:33)(cid:33) (A58) G ( q, ˜ q, ˜ r, m ) = 2 (cid:90) DxH (cid:32) − x ¯ m (cid:112) Q − ¯ m (cid:33) log H (cid:32) κ − √ Qx (cid:112) r Q − Q (cid:33) (A59) Appendix B: Max-margin limit in mean field theory
In the max margin limit the uniqueness of the solutions for w and ˜ w imply q → − ˜ r, ˜ q → σ in ˜ r, Q → r Q (B1)In general, q and ˜ q approach their max margin values at different rates. To account for this we define the scalingfactors λ and ˜ λ as λ = r Q − Q − ˜ r − q (B2)˜ λ = r Q − Qσ in ˜ r − ˜ q (B3)where λ − + ˜ λ − = 1. This allows us to rewrite G ( q, ˜ q, ˜ r, m ) so that all of the singular terms scale as ( r Q − Q ) − asfollows G ( q, ˜ q, ˜ r, m ) = 12 (cid:32) λβ q − m r Q − Q + ˜ λ ( β − β ˜ qr Q − Q + 1 β log (cid:0) βλ − ( r Q − Q ) (cid:1) + β − β log (cid:32) β ˜ λ − β − r Q − Q ) (cid:33)(cid:33) (B4)Taking the max margin limit followed by the limit n →
0, we find that the free energy is given by (cid:104) log V (cid:105) = N r Q − Q ) (cid:32) λ (1 − ˜ r − m ) + λ ( λ − − ( β − σ in ˜ rβ − α (cid:90) DxH (cid:32) − x ¯ m (cid:112) r Q − ¯ m (cid:33) [ κ − √ r Q x ] (cid:33) (B5)The saddle point equation for m is λ ¯ m (cid:112) r Q − ¯ m = αβ √ π (cid:32) (cid:90) ∞− κ √ rQ − ¯ m Dx x σ out (cid:32) κ (cid:112) r Q − ¯ m + x (cid:33) (cid:33) (B6)The saddle point equation for ˜ r is λ (( λ − − ( β − σ in − β = 2 α (cid:90) DxH (cid:32) − x ¯ m (cid:112) r Q − ¯ m (cid:33) x ( κ − √ r Q x ) + ( σ in − √ r Q (B7)+ α ¯ m ( σ in − √ πr Q (cid:90) ∞− κ √ rQ − ¯ m Dx (cid:113) r Q − ¯ m x (cid:32) κ (cid:112) r Q − ¯ m + x (cid:33) (B8)We can use Eqn. (B6)to further simplify this as λ (( λ − − ( β − σ in − β = 2 α (cid:90) DxH (cid:32) − x ¯ m (cid:112) r Q − ¯ m (cid:33) x ( κ − √ r Q x ) + ( σ in − √ r Q + ( σ in − σ out + 1) λ ¯ m βr Q (B9)For λ , we have the saddlepoint equation 1 − m = ˜ r (cid:18) − ( β − σ in ( λ − (cid:19) (B10)9which has the relevant solution λ = 1 + (cid:115) ( β − σ in ˜ r − ˜ r − m (B11) R , the cosine of the angle between student and teacher, can be written in terms of m and ˜ r as R = m √ − ˜ r (B12)For σ in = 1, i.e. the variance of the augmented units matches the variance of the original input, Eqns. B5, B6, B7,and B4 simplify considerably and are given1 N (cid:104) ln V (cid:105) = 12(1 − q ) (cid:18) − m − α (cid:90) DxH (cid:18) − x ¯ m √ − ¯ m (cid:19) [ κ − x ] (cid:19) (B13)¯ m = α √ − ¯ m √ π (cid:32) (cid:90) ∞− κ √ − ¯ m Dx x σ out (cid:18) κ √ − ¯ m + x (cid:19) (cid:33) (B14)˜ r = β − β (1 − m ) (B15) λ = β (B16) r Q = 1 (B17)We can now write R directly in terms of m and β as R = m (cid:113) − β − β (1 − m ) (B18) Appendix C: Network at capacity
We determine the capacity of the network for fixed β by setting the margin κ = 0 in the mean field equations.After performing all of the integrals, we have the following three equations λ (1 − ˜ r − m ) + λ ( λ − − ( β − σ in ˜ r ) β = απ (cid:32) arccot (cid:32) ¯ m (cid:112) r Q − ¯ m (cid:33) − ¯ m (cid:112) r Q − ¯ m r Q (cid:33) (C1) λ ¯ m (cid:112) r Q − ¯ m = αβπ
11 + σ out (C2) λ (( λ − − ( β − σ in − β = α ( σ in − √ π √ r Q (cid:18) − ¯ m √ r Q (cid:19) + ( σ in − σ out + 1) λ ¯ m βr Q (C3)We can expressing α as α = α /β and solve these equations numerically for α to determine α c For σ in = 1, the equations for network capacity become1 − m = απ (cid:18) arccot (cid:18) ¯ m √ − ¯ m (cid:19) − ¯ m (cid:112) − ¯ m (cid:19) (C4)¯ m √ − ¯ m = απ
11 + σ out (C5)Note that these equations do not depend on α but not on β . This implies that for σ in = 1, α c is only a function of σ out . The capacity of a network of size β then obeys the simple scaling relation. α c ( β, σ out ) = βα c (1 , σ out ) (C6)0 Appendix D: Calculation of the generalization error
To evaluate the generalization error in terms of the mean field order parameters, we start from the followingexpression for the error one E ( w, x, (cid:15) ) = Θ (cid:32) − (cid:32) √ N N (cid:88) i =1 w i x i (cid:33) (cid:32) √ N N (cid:88) i =1 w i · x i + (cid:15) (cid:33)(cid:33) (D1)Averaging over the input x , and noise (cid:15) , we get E g ( w ) = (cid:90) N (cid:89) i =1 dx i √ π e − x i (cid:90) d(cid:15) (cid:112) πσ out e − (cid:15) σ out Θ (cid:32) − (cid:32) √ N N (cid:88) i =1 w i x i (cid:33) (cid:32) √ N N (cid:88) i =1 w i · x i + (cid:15) (cid:33)(cid:33) (D2)= (cid:90) N (cid:89) i =1 dx i √ π e − x i (cid:90) dz (cid:112) πσ out e − (cid:15) σ out (cid:90) dh √ π (cid:90) dh √ π (cid:90) d ˆ h √ π (cid:90) d ˆ h √ π Θ (cid:0) − hh (cid:1) (D3) × exp (cid:32) − i ˆ hh − i ˆ h h + i √ N N (cid:88) i =1 (ˆ hw i x i + ˆ h w i x i ) + i ˆ h (cid:15) (cid:33) (D4)= (cid:90) dh √ π (cid:90) dh √ π (cid:90) d ˆ h √ π (cid:90) d ˆ h √ π Θ (cid:0) − hh (cid:1) (D5) × exp (cid:32) − i ˆ hh − i ˆ h h − N (ˆ h N (cid:88) i =0 w i + 2ˆ h ˆ h N (cid:88) i =1 w i w i + (ˆ h ) N (cid:88) i =0 w i ) − σ out h ) (cid:33) (D6)We set the normalization of the student and teacher to be || w || = || w || = √ N (D7)and define the order parameter R as the cosine of the angle between teacher and student as R = N (cid:80) N i =1 w i w i (D8)After performing the integral over ˆ h , we can define a rescaled R and h as¯ R = R (cid:112) σ out (D9)¯ h = h (cid:112) σ out (D10)We can then perform the integral over ˆ h to get the following integral over h and ¯ h E g ( R ) = (cid:90) d h √ π d h √ π dˆ h √ π Θ (cid:0) − h ¯ h (cid:1) e − (1 − ¯ R )ˆ h − i ˆ h ( h +¯ h ¯ R ) − (¯ h ) (D11)= (cid:90) d h d¯ h π √ − ¯ R Θ (cid:0) − h ¯ h (cid:1) e − − ¯ R ( h − h ¯ h ¯ R +(¯ h ) ) (D12)This evaluates to E g ( R ) = 1 π (cid:32) π − tan − (cid:32) R (cid:112) σ out − R (cid:33)(cid:33) (D13)In our expanded network, m and R are related as m = 1 N R (cid:107) w (cid:107)(cid:107) w (cid:107) (D14)1This gives us R = m √ − ˜ r (D15)In terms of m and ˜ r this can be written as E g ( m, ˜ r ) = 1 π (cid:32) π − tan − (cid:32) m (cid:112) (1 − ˜ r )(1 + σ out ) − m (cid:33)(cid:33) (D16) Appendix E: Large β limit We can find a closed expression for the generalization error in the limit β → ∞ with σ in ≤
1. In this limit we have m (cid:28) α (cid:28) β and 1 (cid:28) κ . Analysis of the saddle point equations gives us the following relations σ in = α β κ (E1) σ in − β − λ = 0 (E2) λ ¯ m = 2 α √ π κ (E3) βσ in ¯ m = 2 α √ π σ in (cid:114) βα (E4) λ = σ in (cid:114) β − ˜ r − m (E5)which lead to the following expressions for m and ˜ rm = (cid:115) α (1 + σ out ) βπσ in (E6)1 − ˜ r = 1 βσ in (cid:0) π − α (cid:0) σ out (cid:1)(cid:1) (E7)Plugging these into Eqn. D15 gives us R ≈ α π σ out ) α π σ out ) ≈ − π (1 + σ out )2 α (E8)The expression for R in Eqn. (E8) can be plugged into Eqn. (D13) to find an expression for the generalizationerror for β → ∞ which is shown in Fig. (6(a)). Note that this expression does not depend on σ in as long as σ in ≤ Appendix F: Optimal input noise
We find the optimal σ in to minimize the generalization error by maximizing R . Differentiating R with respect to σ in gives us dRdσ in = dmdσ in √ − ˜ r + 12 d ˜ rdσ in m (1 − ˜ r ) (F1)2which gives us the condition dmdσ in = − m (1 − ˜ r ) d ˜ rdσ inin