Activation function dependence of the storage capacity of treelike neural networks
AActivation function dependence of the storage capacity of treelike neural networks
Jacob A. Zavatone-Veth ∗ Department of Physics, Harvard University, Cambridge, Massachusetts 02138, USA
Cengiz Pehlevan † John A. Paulson School of Engineering and Applied Sciences,Harvard University, Cambridge, Massachusetts 02138, USA andCenter for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA (Dated: July 23, 2020)The expressive power of artificial neural networks crucially depends on the nonlinearity of theiractivation functions. Though a wide variety of nonlinear activation functions have been proposed foruse in artificial neural networks, a detailed understanding of their role in determining the expressivepower of a network has not emerged. Here, we study how activation functions affect the storagecapacity of treelike two-layer networks. We relate the boundedness or divergence of the capacityin this limit to the smoothness of the activation function, elucidating the relationship betweenpreviously studied special cases. Our results show that nonlinearity can both increase capacity anddecrease the robustness of classification, and provide simple estimates for the capacity of networkswith several commonly used activation functions.
The expressive power of artificial neural networks iswell-known [1–4], but a complete theoretical account ofhow their remarkable abilities arise is lacking [5, 6]. Inparticular, though a diverse array of nonlinear activationfunctions have been employed in neural networks [5–12],our understanding of the relationship between choice ofactivation function and computational capability is in-complete [7–9]. Methods from the statistical mechanicsof disordered systems have enabled the interrogation ofthis link in several special cases [9–15], but these previousworks have not yielded a general theory.In this work, we characterize how pattern storage ca-pacity depends on activation function in a tractable two-layer network model known as the treelike committee ma-chine. We find that the storage capacity of a treelikecommittee machine remains finite in the infinite-widthlimit provided that the activation function is weakly dif-ferentiable, and it and its weak derivative are square-integrable with respect to Gaussian measure. For exam-ple, the capacity with sign activation functions diverges,while that with rectified linear unit or error function ac-tivations is finite. We predict that nonlinearity shouldincrease capacity, but may reduce the robustness of clas-sification. These connections between expressive powerand smoothness begin to shed light on the influence of ac-tivation functions on the capabilities of neural networks.
The treelike committee machine —The treelike commit-tee machine is a two-layer neural network with N inputsdivided among K hidden units into disjoint groups of N/K and binary outputs (Figure 1a) [9–12]. For a hiddenunit activation function g , a set of hidden unit weight vec-tors { w j ∈ R N/K } Kj =1 , a readout weight vector v ∈ R K , ∗ [email protected] † [email protected] and a threshold ϑ ∈ R , its output is given as y ( x ) = sign( s ( x )) for (1) s ( x ; { w j } , v , ϑ ) = 1 √ K K (cid:88) j =1 v j g (cid:32) w j · x j (cid:112) N/K (cid:33) − ϑ, (2)where x j denotes the vector of inputs to the j th hid-den unit. In this model, the readout weight vector andthreshold are fixed, and only the hidden unit weights arelearned. The perceptron can thus be viewed as the spe-cial case of a treelike committee machine with identityactivation functions and equal readout weights [13, 14]. Statistical mechanics of pattern storage —To character-ize this network’s ability to classify a random dataset of P examples subject to constraints on the hidden unitweights imposed by a measure ρ , we define the Gardner α c κ linearquadraticReLUerf0 0.52 g g g ... ...... ... ...
31 0.25 0.75(a) (b)
FIG. 1. Pattern storage in treelike committee machines. (a)Network architecture. (b) Capacity α c as a function of mar-gin κ for several common activation functions. Solid anddashed lines indicate estimates of the capacity under replica-symmetric and one-step replica-symmetry-breaking ans¨atze,respectively. a r X i v : . [ c ond - m a t . d i s - nn ] J u l volume [13, 14] Z = (cid:90) dρ ( { w j } ) P (cid:89) µ =1 Θ ( y µ s ( x µ ; { w j } , v , ϑ ) − κ ) , (3)which measures the volume in weight space such that allexamples are classified correctly with margin at least κ .As in most studies of the Gardner volume of neural net-works, we consider a dataset consisting of Bernoulli disor-der, in which the components of the inputs and the targetoutputs are independently and identically distributed as x µjk = ± y µ = ± N/K ) / [9–11, 13].We will study the sequential infinite-width limit N, P → ∞ , K → ∞ , with load α ≡ N/P = O (1).In this limit, we expect the free entropy per weight f = N − log Z to be self-averaging with respect to thedistribution of the quenched disorder [12], and for thereto exist a critical load α c , termed the capacity, belowwhich the classification task is solvable with probabilityone and above which Z vanishes [12–15]. The specialcase of this model with sign activation functions was in-tensively studied in the late 20 th century, showing thatthe capacity diverges as K → ∞ [10, 11, 16]. In contrast,Baldassi et al. [9] recently showed that the capacity withrectified linear unit (ReLU) activations remains boundedin the infinite-width limit. Our primary objective in thiswork is to identify the class of activation functions forwhich the capacity remains finite.We begin our analysis by specifying our choice ofgeneral constraints on the activation function, readoutweights, and threshold. We will require the K → ∞ limit to be well-defined in the sense that the variance ofthe output preactivation s over the pattern distributionis finite. In the sequential limit, the central limit theo-rem implies that hidden unit preactivations converge indistribution to a collection of independent Gaussian ran-dom variables [17]. Therefore, for s to have finite vari-ance, the activation function g must lie in the Lebesguespace L ( γ ) of functions that are square-integrable withrespect to the Gaussian measure γ on the reals (see Ap-pendix A). Furthermore, as var( s ) ∝ (cid:107) v (cid:107) /K , we musthave (cid:107) v (cid:107) = O ( √ K ). Observing that (cid:107) v (cid:107) sets the effec-tive scale of ϑ and κ but does not affect the zero-margincapacity, we fix (cid:107) v (cid:107) = √ K for convenience. To ensurethat s has mean zero, we set ϑ = K − / ( E g ) (cid:80) Kj =1 v j ,where E g = (cid:82) dγ ( z ) g ( z ) is the average value of the hid-den unit activations. This choice maximizes the capacityfor the symmetric datasets of interest (see AppendicesB, C, and D), and generalizes the conditions on v and ϑ considered in previous works [9–11].To compute the limiting quenched free entropy, we ap-ply the replica trick, which exploits a limit identity forlogarithmic averages and a non-rigorous interchange of limits to write f = lim n ↓ lim K →∞ lim N →∞ nN log E x ,y Z nN,αN,K , (4)where the validity of analytic continuation of the mo-ments from positive integer n to n ↓ q abj = ( K/N ) w aj · w bj [13, 15, 18], which representthe average overlap between the preactivations of the j th hidden unit in two different replicas a and b . Under areplica- and hidden-unit-symmetric (RS) ansatz q abj = q ,one finds that f RS = extr q (cid:26) α (cid:90) dγ ( z ) log H (cid:32) κ + (cid:112) ˜ q ( q ) z (cid:112) σ − ˜ q ( q ) (cid:33) + 12 (cid:20) − q + log(1 − q ) + log(2 π ) (cid:21) (cid:27) , (5)where H ( z ) = (cid:82) ∞ z dγ ( x ) is the Gaussian tail distribu-tion function, σ = E g − ( E g ) is the variance of theactivation, and˜ q ( q ) = cov (cid:20) g ( x ) , g ( y ) ; (cid:20) xy (cid:21) ∼ N (cid:18) , (cid:20) qq (cid:21)(cid:19)(cid:21) (6)is an effective order parameter describing the averageoverlap between the activations of a given hidden unit intwo different replicas. This expression for f RS is equiv-alent to that given in [9] for ReLU activations, but weadopt a different definition for the effective order parame-ter that has a clearer statistical interpretation for genericactivation functions.To find the replica-symmetric capacity α RS , one musttake the limit q ↑ q , as the Gard-ner volume tends to zero in this limit [9–14]. As q ↑ q ↑ σ , but the asymptotic properties of ˜ q as a functionof ε ≡ − q depend on the choice of activation function.Making the general ansatz that σ − ˜ q ∼ ε (cid:96) for some (cid:96) >
0, we find that α RS ∼ ε (cid:96) − (see Appendix C). There-fore, the RS capacity diverges if (cid:96) < (cid:96) >
1, while the boundary case (cid:96) = 1 is special in thatthe capacity is bounded but non-vanishing. For the spe-cial cases of g ( x ) = sign( x ) and g ( x ) = ReLU( x ), thisbehavior was noted by Baldassi et al. [9]. For sign, onehas σ − ˜ q ∼ √ ε , and α RS diverges in the infinite-widthlimit, while for ReLU, σ − ˜ q ∼ ε , and α RS remains finite.However, [9] and other previous studies [10, 11] relied ondirect computation of the effective order parameters forall values of q , which is not tractable for most activationfunctions, and does not yield general insight. Asymptotics of the effective order parameter —To un-derstand the asymptotic behavior of ˜ q ( q ) as q ↑ g , we apply tools from thetheory of Gaussian measures [19]. As g is in L ( γ ) byassumption, it has a Fourier-Hermite series g ( x ) = ∞ (cid:88) k =0 g k He k ( x ) , (7)where { He k } is the set of orthonormal Hermite polyno-mials (see Appendix A). We note that the L ( γ ) norm of g can then be written as the sum of the squares of thecoefficients g k , i.e. (cid:107) g (cid:107) γ = (cid:80) ∞ k =0 g k , and that g = E g .To express ˜ q ( q ) in terms of these coefficients, we recallthe Mehler expansion of the standard bivariate Gaussiandensity ϕ ( x, y ; q ) [20, 21]: ϕ ( x, y ; q ) = ϕ ( x ) ϕ ( y ) ∞ (cid:88) k =0 q k He k ( x ) He k ( y ) , (8)where ϕ ( x ) = exp( − x / / √ π is the univariate Gaus-sian density. Then, we can evaluate the expectation in(6), yielding ˜ q ( q ) + g = ∞ (cid:88) k =0 g k q k , (9)which, by Abel’s theorem, is a bounded, continuous func-tion of q ∈ [ − ,
1] because ˜ q (1) + g = (cid:107) g (cid:107) γ is finite.Writing q ≡ − ε , we expand (1 − ε ) k in a binomial se-ries and formally interchange the order of summation toobtain ˜ q ( ε ) + g = ∞ (cid:88) l =0 ( − ε ) l l ! ∞ (cid:88) k = l ( k ) l g k , (10)where ( k ) l = k ( k − · · · ( k − l + 1) is the falling factorial.We recognize the sums over k as the norms of the weakderivatives of g , which have formal Fourier-Hermite series g ( l ) ( x ) = ∞ (cid:88) k = l g k (cid:112) ( k ) l He k − l ( x ) , (11)which follow from the recurrence relation He (cid:48) k ( x ) = √ k He k − ( x ) [19]. Therefore, ˜ q admits a formal powerseries expansion in ε as˜ q ( ε ) + g = ∞ (cid:88) l =0 ( − l l ! (cid:107) g ( l ) (cid:107) γ ε l . (12)For the RS capacity to remain bounded, we merelyrequire that the first two terms in this series are finite,not for the series to converge at any higher order fornon-vanishing ε . Therefore, the RS capacity is finite foronce weakly-differentiable activations g such that the L norms of the function and its weak derivative with respectto Gaussian measure, (cid:107) g (cid:107) γ and (cid:107) g (cid:48) (cid:107) γ , are finite. Thisclass of functions is precisely the Sobolev class H ( γ ) [19].We provide additional background material on H ( γ ) andweak differentiability in Appendix A. Storage capacity —For any activation function in theclass H ( γ ), we find that α RS ( κ ) = (cid:107) g (cid:48) (cid:107) γ σ α G (cid:16) κσ (cid:17) , (13)where α G ( κ ) = (cid:20)(cid:90) ∞− κ dγ ( z ) ( κ + z ) (cid:21) − (14)is Gardner’s formula for the perceptron capacity [13] (seeAppendix C). In terms of Fourier-Hermite coefficients,we have σ = (cid:80) ∞ k =1 g k and (cid:107) g (cid:48) (cid:107) γ = (cid:80) ∞ k =1 kg k . Thus, wehave (cid:107) g (cid:48) (cid:107) γ ≥ σ , with equality if and only if all nonlin-ear terms (those corresponding to Hermite polynomialsof degree two or greater) vanish. Therefore, introduc-ing nonlinearity always increases the zero-margin RS ca-pacity. However, as α G ( κ ) is a monotonically decreasingfunction, the capacity at large margins can be reduced bynonlinearity if σ <
1. We note that the zero-margin ca-pacity is invariant under rescaling of the activation func-tion and hidden unit weights as g (cid:55)→ c g , v (cid:55)→ c v forsome constants c and c . For finite margin, rescaling canincrease or decrease the capacity by changing σ . Thus,in the sense of classification margin, introducing nonlin-earity or re-scaling can reduce the robustness of classifi-cation.Using this result, we can characterize the RS ca-pacity of wide treelike committee machines for severalcommonly-used activation functions (see Appendix E fordetails). As previously noted, for a linear activation func-tion, our result reduces to Gardner’s perceptron capac-ity [13]. As the sign function is not weakly differen-tiable, we recover the result that the capacity diverges[10, 11]. ReLU is weakly differentiable, and we recoverthe result from [9] that α RS = 2 π/ ( π − (cid:39) . α RS =2 arcsin(2 / /π (cid:39) . α RS (cid:39) . α RS = 4.We plot the RS capacity as a function of margin for theseactivation functions in Figure 1b, illustrating how non-linearity can reduce the large-margin capacity.To demonstrate the applicability of our theory to moreexotic activation functions, we consider several smooth,non-monotonic alternatives to ReLU that have been pro-posed in the deep learning literature [7, 8]. All of thesefunctions are in H ( γ ), hence they yield finite capac-ities, which we estimate in Appendix E. In summary,our replica-symmetric calculation divides activation func-tions into those that are outside H ( γ ), for which the ca-pacity diverges, and those that are in H ( γ ), for whichthe capacity is finite and given by (13). For functions in H ( γ ), our RS result predicts that nonlinearity shouldincrease capacity, but can decrease classification robust-ness.However, for nonlinear activation functions, one gener-ically expects the energy landscape to become locallynon-convex, and for replica symmetry breaking (RSB)to occur [9–12, 15, 18]. The RS estimate of the capacityis therefore only an upper bound, and one must accountfor RSB effects in order to obtain a more accurate esti-mate [9–12, 15, 16, 18]. To that end, we have calculatedthe capacity under a one-step replica-symmetry-breaking(1-RSB) ansatz, extending the results of earlier work [9–11] to arbitrary activation functions. Under the 1-RSBansatz, the replicas are divided into groups of size m ,with inter-group overlap q and intra-group overlap q .Then, the capacity is extracted by taking the limit q ↑ m ↓
0, with r ≡ m/ (1 − q ) finite [9–12, 18].As detailed in Appendix D, this calculation yields anexpression for α as the solution to a two-dimensionalminimization problem over q and r . Importantly, thefinite-capacity condition at 1-RSB is the same as thatwith RS. For functions in H ( γ ), the resulting mini-mization problem must usually be solved numerically,hence we give results for only a few tractable exam-ples. For these examples, we illustrate this minimiza-tion problem by plotting its landscape in Figure 2. Forcomparison purposes, we include the landscape for lin-ear activation functions, for which RSB does not occur[13–15, 22]. For ReLU, we obtain α (cid:39) . q ∗ , r ∗ ) (cid:39) (0 . , . α (cid:39) .
92 reported by [9], a dis-crepancy which likely results from differences in nu-merical analysis (see Appendix E). For erf, we obtain α (cid:39) . q ∗ , r ∗ ) (cid:39) (0 . , . α ( κ = 0) (cid:39) . q ∗ , r ∗ ) (cid:39) (0 . , . Discussion —We have shown that the storage capacityof treelike committee machines with activation functionsin H ( γ ) remains bounded in the infinite-width limit.Our results follow from a replica analysis of the Gard-ner volume, with the capacity given by a simple closed-form expression under a replica-symmetric ansatz anda two-dimensional minimization problem with one-stepreplica-symmetry-breaking. Depending on the activationfunction, a fully accurate determination of the capacitywould likely require higher levels in the Parisi hierarchyof replica-symmetry-breaking ans¨atze [18]. Furthermore,it can be challenging to rigorously prove that the capac-ity results obtained using the replica method at any levelof the Parisi hierarchy are correct [15, 18, 22–24]. Withthese caveats in mind, our results begin to elucidate hownonlinear activation functions affect the ability of neuralnetworks ability to robustly solve classification problems.The Gardner volume is agnostic to the choice of learn-ing algorithm used to train the weights of the network,which makes it a general approach to studying storage q r α ααα FIG. 2. The landscape of the function whose minimum de-termines the 1-RSB capacity as a function of the inter-blockoverlap q and the rescaled Parisi parameter r ≡ m/ (1 − q )for several example activation functions. In each panel, thevalue of this function is shown in false color, with the locationof minimum indicated by an orange dot. capacity, but means that it can provide only limited in-sight into the practical realizability of the extant solu-tions [9–12]. In a recent study of least-squares functionapproximation by wide fully-connected committee ma-chines, Panigrahi et al. [7] have shown using randommatrix theory methods that the speed and robustnessof learning with stochastic gradient descent is related toactivation function smoothness. In particular, trainingerror decreases rapidly with large decay exponent un-der weaker assumptions on the data for activation func-tions that have a “kink”—a jump discontinuity in theirfirst or second derivatives—than for smooth functions.This result is suggestively similar to that of this work,as such functions are at most once or twice weakly-differentiable. However, further work will be requiredto determine whether a similar link between smoothnessand trainability exists for classification tasks in treelikenetworks.In this work, we have studied the activation-function-dependence of the storage capacity of wide treelike com-mittee machines. This network architecture is partic-ularly convenient to study in the infinite-width limit,but it is far removed from the deep networks used inpractical applications [5]. As a step towards more elab-orate and realistic models, one could consider a fully-connected committee machine, in which each hidden unitis connected to the full set of inputs. Prior work onsuch networks with sign activation functions suggeststhat some qualitative aspects of the behavior of treelikecommittee machines should still hold true [10, 11, 25].However, fully-connected committee machines possess apermutation-symmetry with respect to interchange of thehidden units, which is broken at loads below the RS ca-pacity [10]. This phenomenon and the presence of cor-relations between hidden units complicates the study oftheir infinite-width limit. Accurate determination of howtheir storage capacity depends on activation function willtherefore require further work, in which the insights de-veloped in this study should prove broadly useful. Acknowledgements —J. A. Zavatone-Veth acknowl-edges support from the NSF-Simons Center for Mathe-matical and Statistical Analysis of Biology at Harvardand the Harvard Quantitative Biology Initiative. C.Pehlevan thanks the Harvard Data Science Initiative,Google, and Intel for support. [1] G. Cybenko, Mathematics of control, signals and systems , 303 (1989).[2] K. Hornik, M. Stinchcombe, H. White, et al. , Neural net-works , 359 (1989).[3] C. Zhang, S. Bengio, M. Hardt, B. Recht, andO. Vinyals, in (2016).[4] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, andS. Ganguli, in Advances in neural information processingsystems (2016) pp. 3360–3368.[5] Y. LeCun, Y. Bengio, and G. Hinton, Nature , 436(2015).[6] I. Goodfellow, Y. Bengio, and A. Courville,
Deep learn-ing (MIT press, 2016).[7] A. Panigrahi, A. Shetty, and N. Goyal, arXiv preprintarXiv:1908.05660 (2019).[8] P. Ramachandran, B. Zoph, and Q. V. Le, arXiv preprintarXiv:1710.05941 (2017).[9] C. Baldassi, E. M. Malatesta, and R. Zecchina, PhysicalReview Letters , 170602 (2019).[10] E. Barkai, D. Hansel, and H. Sompolinsky, Physical Re-view A , 4146 (1992).[11] A. Engel, H. K¨ohler, F. Tschepke, H. Vollmayr, andA. Zippelius, Physical Review A , 7590 (1992).[12] A. Engel and C. Van den Broeck, Statistical mechanicsof learning (Cambridge University Press, 2001).[13] E. Gardner, Journal of Physics A: Mathematical andGeneral , 257 (1988).[14] E. Gardner and B. Derrida, Journal of Physics A: Math-ematical and General , 271 (1988).[15] M. Talagrand, Spin glasses: a challenge for mathemati-cians: cavity and mean field models , Vol. 46 (SpringerScience & Business Media, 2003).[16] R. Monasson and R. Zecchina, Physical Review Letters , 2432 (1995).[17] D. Pollard, A user’s guide to measure theoretic probabil-ity , Vol. 8 (Cambridge University Press, 2002).[18] M. M´ezard, G. Parisi, and M. Virasoro,
Spin glass the-ory and beyond: An Introduction to the Replica Methodand Its Applications , Vol. 9 (World Scientific PublishingCompany, 1987).[19] V. I. Bogachev,
Gaussian measures (American Mathe-matical Society, 1998).[20] W. Kibble, Mathematical Proceedings of the CambridgePhilosophical Society , 12 (1945).[21] Y. L. Tong, The multivariate normal distribution (Springer Science & Business Media, 2012).[22] M. Shcherbina and B. Tirozzi, Communications in Math-ematical Physics , 383 (2003).[23] J. Ding and N. Sun, in
Proceedings of the 51st An-nual ACM SIGACT Symposium on Theory of Computing (2019) pp. 816–827.[24] B. Aubin, A. Maillard, J. Barbier, F. Krzakala, N. Macris, and L. Zdeborov´a, Journal of Statistical Me-chanics: Theory and Experiment , 124023 (2019).[25] R. Urbanczik, Journal of Physics A: Mathematical andGeneral , L387 (1997).[26] M. Abramowitz and I. A. Stegun, Handbook of mathe-matical functions with formulas, graphs, and mathemati-cal tables , Vol. 55 (US Government printing office, 1948).[27] R. H. Byrd, J. C. Gilbert, and J. Nocedal, MathematicalProgramming , 149 (2000).[28] P. Poirazi, T. Brannon, and B. W. Mel, Neuron , 989(2003). Appendix A: Gaussian measures, Hermite polynomials, and weak differentiability
In this appendix, we review relevant background material from the theory of Gaussian measures. Our discussion isa specialization of the more general discussion in Chapter 1 of Bogachev [19] to the one-dimensional case. We merelyseek to summarize the relevant definitions and results, and will not attempt to provide rigorous proofs.We let γ be the standard Gaussian probability measure on R , which has density exp( − x / / √ π with respect toLebesgue measure. We let L ( γ ) be the Lebesgue space of functions on R that are square-integrable with respect to γ , and, for brevity, denote the norm on this space as (cid:107) · (cid:107) γ . The natural orthonormal basis for L ( γ ) is given by theset of Hermite polynomials { He k } ∞ k =0 , which can be defined by the formulaHe k ( x ) = ( − k √ k ! exp (cid:18) x (cid:19) d k dx k exp (cid:18) − x (cid:19) . (A1)The Hermite polynomials satisfy the recurrence relationHe (cid:48) k ( x ) = √ k He k − ( x ) = x He k ( x ) − √ k + 1 He k +1 ( x ) (A2)for k ≥
1, with He ≡
1. For a given function g ∈ L ( γ ), we define its Fourier-Hermite coefficients g k = (cid:90) g He k dγ, (A3)and the Fourier-Hermite series g ( x ) = ∞ (cid:88) k =0 g k He k ( x ) , (A4)which is guaranteed to converge in mean-square by the fact that (cid:107) g (cid:107) γ = (cid:80) ∞ k =0 g k is finite.Then, using the recurrence relation He (cid:48) k ( x ) = √ k He k − ( x ) and the fact that He (cid:48) ≡
0, we can express the l th weakderivative of g as a formal Fourier-Hermite series g ( l ) ( x ) = ∞ (cid:88) k = l g k (cid:112) ( k ) l He k − l ( x ) , (A5)where ( k ) r = k ( k − · · · ( k − r + 1) is the falling factorial. If, for some r ≥
0, the sum (cid:107) g ( r ) (cid:107) γ = ∞ (cid:88) k = r ( k ) r g k (A6)is finite, then the Fourier-Hermite series for g and its weak derivatives up to order r converge in mean-square. Theclass of functions satisfying this condition is the Sobolev class H r ( γ ), which has Sobolev norm (cid:107) g (cid:107) H r ( γ ) = (cid:32) r (cid:88) l =0 (cid:107) g ( l ) (cid:107) γ (cid:33) / . (A7)Having defined H r ( γ ) in terms of Fourier-Hermite expansions, we now connect this definition to a more genericnotion of weak differentiability. Let C ∞ ( R ) be the set of all infinitely-differentiable functions with compact support,and let p ≥
1. For a locally integrable function f , we define its weak derivative f (cid:48) as a locally integrable function thatsatisfies the integration by parts formula (cid:90) R φ (cid:48) ( x ) f ( x ) dx = − (cid:90) R φ ( x ) f (cid:48) ( x ) dx (A8)for every φ ∈ C ∞ ( R ). The subset of functions in L ( R ) with weak derivatives up to order r of finite L norm formsthe Sobelev class H r ( R ). We can then define the class H r loc ( R ) as the set of all functions f on R such that φf ∈ H r ( R )for all φ ∈ C ∞ ( R ). H r ( γ ) coincides with the class of all functions f ∈ H r loc ( R ) such that f and its weak derivativesup to order r have finite L ( γ ) norm, and the corresponding weak derivatives coincide as well. In one dimension,the criterion that the ( r − th derivative is differentiable almost everywhere and is equal almost everywhere to theLebesgue integral of its derivative implies the required weak differentiability condition. Furthermore, by Rademacher’stheorem, every function that is locally Lipschitz continuous belongs to H ( R ). Appendix B: The Gardner volume of the treelike committee machine
In this appendix, we give a detailed account of the computation of the Gardner volume of the treelike committeemachine using the replica method. As described in the main text, the treelike committee machine [9–12] is a two-layerneural network with a total of N inputs divided into disjoint groups of N/K among K hidden units: y ( x ; { w j } , v , ϑ ) = sign √ K K (cid:88) j =1 v j g (cid:32) w j · x j (cid:112) N/K (cid:33) − ϑ , (B1)where x j ∈ R N/K is the vector of inputs to the j th hidden unit, { w j ∈ R N/K } Kj =1 are the hidden unit weight vectors, v ∈ R K is the fixed readout weight vector, g is the activation function, and ϑ ∈ R is a threshold. We want tocharacterize the ability of this network to classify a dataset of P independent and identically distributed randomexamples { ( x ν , y ν ) } Pµ =1 , where x ν ∈ {− , } N and y ν ∈ {− , } , in terms of the Gardner volume [13, 14] Z N,P,K = (cid:90) dρ ( { w j } ) P (cid:89) ν =1 Θ ( y ν y ( x ν ; { w j } , v , ϑ ) − κ ) , (B2)where ρ is a measure on the space of hidden unit weights. We will compute the limiting quenched free entropy perweight f in the sequential limit N, P → ∞ , K → ∞ , with load N/P → α ∈ (0 , ∞ ) using the replica trick as f ≡ lim K →∞ lim N →∞ f N,αN,K = lim K →∞ lim N →∞ N E x ,y log Z N,αN,K = lim n ↓ lim K →∞ lim N →∞ nN log E x ,y Z nN,αN,K , (B3)where E x ,y denotes expectation over the quenched Bernoulli disorder represented by the dataset.We take the elements of x ν to be independent and identically distributed, with equal probability of being positiveor negative. We allow the distribution of y ν to be asymmetric, with P ( y ν = +1) = 1 − P ( y ν = −
1) = p for some p ∈ [0 , N/K -sphere of radius (
N/K ) / . With this choice, the total volume of weight space is (cid:90) dρ = S N/K ( (cid:112) N/K ) K , (B4)where S D ( R ) ≡ π D/ Γ( D/ R D − (B5)is the surface area of the D -dimensional sphere of radius R . We will assume that (cid:107) v (cid:107) = √ K , but will not initiallyimpose further conditions on the readout weights or threshold. Finally, we assume that g ∈ L ( γ ).Introducing replicas indexed by a = 1 , . . . , n , we can write the n th quenched moment of the Gardner volume as E x ,y Z n = E x ,y (cid:90) (cid:89) a dρ ( { w aj } ) (cid:89) a,ν Θ y ν √ K (cid:88) j v j g (cid:32) w aj · x νj (cid:112) N/K (cid:33) − ϑ − κ . (B6)We observe immediately that the fact that the different patterns are independent and identically distributed impliesthat E x ,y Z n = (cid:90) (cid:89) a dρ ( { w aj } ) E x ,y (cid:89) a Θ y √ K (cid:88) j v j g (cid:32) w aj · x j (cid:112) N/K (cid:33) − ϑ − κ P , (B7)allowing us to simplify our notation by eliminating the pattern index ν . We now consider the local fields h aj ≡ (cid:114) KN w aj · x j , (B8)which have mean zero and covariance cov( h aj , h bl ) = δ jl KN w aj · x j . (B9)In this setting, the natural order parameters are the Edwards-Anderson (EA) order parameters [13, 15, 18] q abj ≡ KN w aj · w bj ( a (cid:54) = b ) , (B10)which measure the overlap between the weight vectors of each hidden unit in two different replicas. As we havechosen the weight vectors to lie on the sphere, the self-overlap of each hidden unit is fixed to unity, and the EA orderparameters are bounded between negative one and one. In terms of the EA order parameters, we havecov( h aj , h bl ) = δ jl [ δ ab + q abj (1 − δ ab )] . (B11)Then, as each of the local fields is the sum of N/K independent random variables and their covariance is finite, bycentral limit theorem they converge in distribution as N → ∞ for any fixed K to a multivariate Gaussian with thesame mean and covariance [17, 21]. We note that this limiting result would alternatively follow by inserting Fourierrepresentations of the delta function to enforce the definition of the variables h aj , evaluating the averages over theinputs, and expanding the result to lowest order in 1 /N .We then define the function G ( { q abj } ) ≡ n log E y E h (cid:89) a Θ y √ K (cid:88) j v j g ( h aj ) − ϑ − κ , (B12)where the average E h is taken over the q abj -dependent Gaussian distribution of the local fields. Introducing Lagrangemultipliers ˆ q abj to enforce the definitions of the order parameters q abj , we obtain E x ,y Z n = (cid:90) (cid:89) b
N/K -sphere of radius (cid:112)