[PDF] Activation function dependence of the storage capacity of treelike neural networks

Abstract

The expressive power of artificial neural networks crucially depends on the nonlinearity of their activation functions. Though a wide variety of nonlinear activation functions have been proposed for use in artificial neural networks, a detailed understanding of their role in determining the expressive power of a network has not emerged. Here, we study how activation functions affect the storage capacity of treelike two-layer networks. We relate the boundedness or divergence of the capacity in the infinite-width limit to the smoothness of the activation function, elucidating the relationship between previously studied special cases. Our results show that nonlinearity can both increase capacity and decrease the robustness of classification, and provide simple estimates for the capacity of networks with several commonly used activation functions. Furthermore, they generate a hypothesis for the functional benefit of dendritic spikes in branched neurons.

Full PDF

AActivation function dependence of the storage capacity of treelike neural networks

Jacob A. Zavatone-Veth ∗ Department of Physics, Harvard University, Cambridge, Massachusetts 02138, USA

Cengiz Pehlevan † John A. Paulson School of Engineering and Applied Sciences,Harvard University, Cambridge, Massachusetts 02138, USA andCenter for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA (Dated: July 23, 2020)The expressive power of artiﬁcial neural networks crucially depends on the nonlinearity of theiractivation functions. Though a wide variety of nonlinear activation functions have been proposed foruse in artiﬁcial neural networks, a detailed understanding of their role in determining the expressivepower of a network has not emerged. Here, we study how activation functions aﬀect the storagecapacity of treelike two-layer networks. We relate the boundedness or divergence of the capacityin this limit to the smoothness of the activation function, elucidating the relationship betweenpreviously studied special cases. Our results show that nonlinearity can both increase capacity anddecrease the robustness of classiﬁcation, and provide simple estimates for the capacity of networkswith several commonly used activation functions.

The expressive power of artiﬁcial neural networks iswell-known [1–4], but a complete theoretical account ofhow their remarkable abilities arise is lacking [5, 6]. Inparticular, though a diverse array of nonlinear activationfunctions have been employed in neural networks [5–12],our understanding of the relationship between choice ofactivation function and computational capability is in-complete [7–9]. Methods from the statistical mechanicsof disordered systems have enabled the interrogation ofthis link in several special cases [9–15], but these previousworks have not yielded a general theory.In this work, we characterize how pattern storage ca-pacity depends on activation function in a tractable two-layer network model known as the treelike committee ma-chine. We ﬁnd that the storage capacity of a treelikecommittee machine remains ﬁnite in the inﬁnite-widthlimit provided that the activation function is weakly dif-ferentiable, and it and its weak derivative are square-integrable with respect to Gaussian measure. For exam-ple, the capacity with sign activation functions diverges,while that with rectiﬁed linear unit or error function ac-tivations is ﬁnite. We predict that nonlinearity shouldincrease capacity, but may reduce the robustness of clas-siﬁcation. These connections between expressive powerand smoothness begin to shed light on the inﬂuence of ac-tivation functions on the capabilities of neural networks.

The treelike committee machine —The treelike commit-tee machine is a two-layer neural network with N inputsdivided among K hidden units into disjoint groups of N/K and binary outputs (Figure 1a) [9–12]. For a hiddenunit activation function g , a set of hidden unit weight vec-tors { w j ∈ R N/K } Kj =1 , a readout weight vector v ∈ R K , ∗ [email protected] † [email protected] and a threshold ϑ ∈ R , its output is given as y ( x ) = sign( s ( x )) for (1) s ( x ; { w j } , v , ϑ ) = 1 √ K K (cid:88) j =1 v j g (cid:32) w j · x j (cid:112) N/K (cid:33) − ϑ, (2)where x j denotes the vector of inputs to the j th hid-den unit. In this model, the readout weight vector andthreshold are ﬁxed, and only the hidden unit weights arelearned. The perceptron can thus be viewed as the spe-cial case of a treelike committee machine with identityactivation functions and equal readout weights [13, 14]. Statistical mechanics of pattern storage —To character-ize this network’s ability to classify a random dataset of P examples subject to constraints on the hidden unitweights imposed by a measure ρ , we deﬁne the Gardner α c κ linearquadraticReLUerf0 0.52 g g g ... ...... ... ...

31 0.25 0.75(a) (b)

FIG. 1. Pattern storage in treelike committee machines. (a)Network architecture. (b) Capacity α c as a function of mar-gin κ for several common activation functions. Solid anddashed lines indicate estimates of the capacity under replica-symmetric and one-step replica-symmetry-breaking ans¨atze,respectively. a r X i v : . [ c ond - m a t . d i s - nn ] J u l volume [13, 14] Z = (cid:90) dρ ( { w j } ) P (cid:89) µ =1 Θ ( y µ s ( x µ ; { w j } , v , ϑ ) − κ ) , (3)which measures the volume in weight space such that allexamples are classiﬁed correctly with margin at least κ .As in most studies of the Gardner volume of neural net-works, we consider a dataset consisting of Bernoulli disor-der, in which the components of the inputs and the targetoutputs are independently and identically distributed as x µjk = ± y µ = ± N/K ) / [9–11, 13].We will study the sequential inﬁnite-width limit N, P → ∞ , K → ∞ , with load α ≡ N/P = O (1).In this limit, we expect the free entropy per weight f = N − log Z to be self-averaging with respect to thedistribution of the quenched disorder [12], and for thereto exist a critical load α c , termed the capacity, belowwhich the classiﬁcation task is solvable with probabilityone and above which Z vanishes [12–15]. The specialcase of this model with sign activation functions was in-tensively studied in the late 20 th century, showing thatthe capacity diverges as K → ∞ [10, 11, 16]. In contrast,Baldassi et al. [9] recently showed that the capacity withrectiﬁed linear unit (ReLU) activations remains boundedin the inﬁnite-width limit. Our primary objective in thiswork is to identify the class of activation functions forwhich the capacity remains ﬁnite.We begin our analysis by specifying our choice ofgeneral constraints on the activation function, readoutweights, and threshold. We will require the K → ∞ limit to be well-deﬁned in the sense that the variance ofthe output preactivation s over the pattern distributionis ﬁnite. In the sequential limit, the central limit theo-rem implies that hidden unit preactivations converge indistribution to a collection of independent Gaussian ran-dom variables [17]. Therefore, for s to have ﬁnite vari-ance, the activation function g must lie in the Lebesguespace L ( γ ) of functions that are square-integrable withrespect to the Gaussian measure γ on the reals (see Ap-pendix A). Furthermore, as var( s ) ∝ (cid:107) v (cid:107) /K , we musthave (cid:107) v (cid:107) = O ( √ K ). Observing that (cid:107) v (cid:107) sets the eﬀec-tive scale of ϑ and κ but does not aﬀect the zero-margincapacity, we ﬁx (cid:107) v (cid:107) = √ K for convenience. To ensurethat s has mean zero, we set ϑ = K − / ( E g ) (cid:80) Kj =1 v j ,where E g = (cid:82) dγ ( z ) g ( z ) is the average value of the hid-den unit activations. This choice maximizes the capacityfor the symmetric datasets of interest (see AppendicesB, C, and D), and generalizes the conditions on v and ϑ considered in previous works [9–11].To compute the limiting quenched free entropy, we ap-ply the replica trick, which exploits a limit identity forlogarithmic averages and a non-rigorous interchange of limits to write f = lim n ↓ lim K →∞ lim N →∞ nN log E x ,y Z nN,αN,K , (4)where the validity of analytic continuation of the mo-ments from positive integer n to n ↓ q abj = ( K/N ) w aj · w bj [13, 15, 18], which representthe average overlap between the preactivations of the j th hidden unit in two diﬀerent replicas a and b . Under areplica- and hidden-unit-symmetric (RS) ansatz q abj = q ,one ﬁnds that f RS = extr q (cid:26) α (cid:90) dγ ( z ) log H (cid:32) κ + (cid:112) ˜ q ( q ) z (cid:112) σ − ˜ q ( q ) (cid:33) + 12 (cid:20) − q + log(1 − q ) + log(2 π ) (cid:21) (cid:27) , (5)where H ( z ) = (cid:82) ∞ z dγ ( x ) is the Gaussian tail distribu-tion function, σ = E g − ( E g ) is the variance of theactivation, and˜ q ( q ) = cov (cid:20) g ( x ) , g ( y ) ; (cid:20) xy (cid:21) ∼ N (cid:18) , (cid:20) qq (cid:21)(cid:19)(cid:21) (6)is an eﬀective order parameter describing the averageoverlap between the activations of a given hidden unit intwo diﬀerent replicas. This expression for f RS is equiv-alent to that given in [9] for ReLU activations, but weadopt a diﬀerent deﬁnition for the eﬀective order parame-ter that has a clearer statistical interpretation for genericactivation functions.To ﬁnd the replica-symmetric capacity α RS , one musttake the limit q ↑ q , as the Gard-ner volume tends to zero in this limit [9–14]. As q ↑ q ↑ σ , but the asymptotic properties of ˜ q as a functionof ε ≡ − q depend on the choice of activation function.Making the general ansatz that σ − ˜ q ∼ ε (cid:96) for some (cid:96) >

0, we ﬁnd that α RS ∼ ε (cid:96) − (see Appendix C). There-fore, the RS capacity diverges if (cid:96) < (cid:96) >

1, while the boundary case (cid:96) = 1 is special in thatthe capacity is bounded but non-vanishing. For the spe-cial cases of g ( x ) = sign( x ) and g ( x ) = ReLU( x ), thisbehavior was noted by Baldassi et al. [9]. For sign, onehas σ − ˜ q ∼ √ ε , and α RS diverges in the inﬁnite-widthlimit, while for ReLU, σ − ˜ q ∼ ε , and α RS remains ﬁnite.However, [9] and other previous studies [10, 11] relied ondirect computation of the eﬀective order parameters forall values of q , which is not tractable for most activationfunctions, and does not yield general insight. Asymptotics of the eﬀective order parameter —To un-derstand the asymptotic behavior of ˜ q ( q ) as q ↑ g , we apply tools from thetheory of Gaussian measures [19]. As g is in L ( γ ) byassumption, it has a Fourier-Hermite series g ( x ) = ∞ (cid:88) k =0 g k He k ( x ) , (7)where { He k } is the set of orthonormal Hermite polyno-mials (see Appendix A). We note that the L ( γ ) norm of g can then be written as the sum of the squares of thecoeﬃcients g k , i.e. (cid:107) g (cid:107) γ = (cid:80) ∞ k =0 g k , and that g = E g .To express ˜ q ( q ) in terms of these coeﬃcients, we recallthe Mehler expansion of the standard bivariate Gaussiandensity ϕ ( x, y ; q ) [20, 21]: ϕ ( x, y ; q ) = ϕ ( x ) ϕ ( y ) ∞ (cid:88) k =0 q k He k ( x ) He k ( y ) , (8)where ϕ ( x ) = exp( − x / / √ π is the univariate Gaus-sian density. Then, we can evaluate the expectation in(6), yielding ˜ q ( q ) + g = ∞ (cid:88) k =0 g k q k , (9)which, by Abel’s theorem, is a bounded, continuous func-tion of q ∈ [ − ,

1] because ˜ q (1) + g = (cid:107) g (cid:107) γ is ﬁnite.Writing q ≡ − ε , we expand (1 − ε ) k in a binomial se-ries and formally interchange the order of summation toobtain ˜ q ( ε ) + g = ∞ (cid:88) l =0 ( − ε ) l l ! ∞ (cid:88) k = l ( k ) l g k , (10)where ( k ) l = k ( k − · · · ( k − l + 1) is the falling factorial.We recognize the sums over k as the norms of the weakderivatives of g , which have formal Fourier-Hermite series g ( l ) ( x ) = ∞ (cid:88) k = l g k (cid:112) ( k ) l He k − l ( x ) , (11)which follow from the recurrence relation He (cid:48) k ( x ) = √ k He k − ( x ) [19]. Therefore, ˜ q admits a formal powerseries expansion in ε as˜ q ( ε ) + g = ∞ (cid:88) l =0 ( − l l ! (cid:107) g ( l ) (cid:107) γ ε l . (12)For the RS capacity to remain bounded, we merelyrequire that the ﬁrst two terms in this series are ﬁnite,not for the series to converge at any higher order fornon-vanishing ε . Therefore, the RS capacity is ﬁnite foronce weakly-diﬀerentiable activations g such that the L norms of the function and its weak derivative with respectto Gaussian measure, (cid:107) g (cid:107) γ and (cid:107) g (cid:48) (cid:107) γ , are ﬁnite. Thisclass of functions is precisely the Sobolev class H ( γ ) [19].We provide additional background material on H ( γ ) andweak diﬀerentiability in Appendix A. Storage capacity —For any activation function in theclass H ( γ ), we ﬁnd that α RS ( κ ) = (cid:107) g (cid:48) (cid:107) γ σ α G (cid:16) κσ (cid:17) , (13)where α G ( κ ) = (cid:20)(cid:90) ∞− κ dγ ( z ) ( κ + z ) (cid:21) − (14)is Gardner’s formula for the perceptron capacity [13] (seeAppendix C). In terms of Fourier-Hermite coeﬃcients,we have σ = (cid:80) ∞ k =1 g k and (cid:107) g (cid:48) (cid:107) γ = (cid:80) ∞ k =1 kg k . Thus, wehave (cid:107) g (cid:48) (cid:107) γ ≥ σ , with equality if and only if all nonlin-ear terms (those corresponding to Hermite polynomialsof degree two or greater) vanish. Therefore, introduc-ing nonlinearity always increases the zero-margin RS ca-pacity. However, as α G ( κ ) is a monotonically decreasingfunction, the capacity at large margins can be reduced bynonlinearity if σ <

1. We note that the zero-margin ca-pacity is invariant under rescaling of the activation func-tion and hidden unit weights as g (cid:55)→ c g , v (cid:55)→ c v forsome constants c and c . For ﬁnite margin, rescaling canincrease or decrease the capacity by changing σ . Thus,in the sense of classiﬁcation margin, introducing nonlin-earity or re-scaling can reduce the robustness of classiﬁ-cation.Using this result, we can characterize the RS ca-pacity of wide treelike committee machines for severalcommonly-used activation functions (see Appendix E fordetails). As previously noted, for a linear activation func-tion, our result reduces to Gardner’s perceptron capac-ity [13]. As the sign function is not weakly diﬀeren-tiable, we recover the result that the capacity diverges[10, 11]. ReLU is weakly diﬀerentiable, and we recoverthe result from [9] that α RS = 2 π/ ( π − (cid:39) . α RS =2 arcsin(2 / /π (cid:39) . α RS (cid:39) . α RS = 4.We plot the RS capacity as a function of margin for theseactivation functions in Figure 1b, illustrating how non-linearity can reduce the large-margin capacity.To demonstrate the applicability of our theory to moreexotic activation functions, we consider several smooth,non-monotonic alternatives to ReLU that have been pro-posed in the deep learning literature [7, 8]. All of thesefunctions are in H ( γ ), hence they yield ﬁnite capac-ities, which we estimate in Appendix E. In summary,our replica-symmetric calculation divides activation func-tions into those that are outside H ( γ ), for which the ca-pacity diverges, and those that are in H ( γ ), for whichthe capacity is ﬁnite and given by (13). For functions in H ( γ ), our RS result predicts that nonlinearity shouldincrease capacity, but can decrease classiﬁcation robust-ness.However, for nonlinear activation functions, one gener-ically expects the energy landscape to become locallynon-convex, and for replica symmetry breaking (RSB)to occur [9–12, 15, 18]. The RS estimate of the capacityis therefore only an upper bound, and one must accountfor RSB eﬀects in order to obtain a more accurate esti-mate [9–12, 15, 16, 18]. To that end, we have calculatedthe capacity under a one-step replica-symmetry-breaking(1-RSB) ansatz, extending the results of earlier work [9–11] to arbitrary activation functions. Under the 1-RSBansatz, the replicas are divided into groups of size m ,with inter-group overlap q and intra-group overlap q .Then, the capacity is extracted by taking the limit q ↑ m ↓

0, with r ≡ m/ (1 − q ) ﬁnite [9–12, 18].As detailed in Appendix D, this calculation yields anexpression for α as the solution to a two-dimensionalminimization problem over q and r . Importantly, theﬁnite-capacity condition at 1-RSB is the same as thatwith RS. For functions in H ( γ ), the resulting mini-mization problem must usually be solved numerically,hence we give results for only a few tractable exam-ples. For these examples, we illustrate this minimiza-tion problem by plotting its landscape in Figure 2. Forcomparison purposes, we include the landscape for lin-ear activation functions, for which RSB does not occur[13–15, 22]. For ReLU, we obtain α (cid:39) . q ∗ , r ∗ ) (cid:39) (0 . , . α (cid:39) .

92 reported by [9], a dis-crepancy which likely results from diﬀerences in nu-merical analysis (see Appendix E). For erf, we obtain α (cid:39) . q ∗ , r ∗ ) (cid:39) (0 . , . α ( κ = 0) (cid:39) . q ∗ , r ∗ ) (cid:39) (0 . , . Discussion —We have shown that the storage capacityof treelike committee machines with activation functionsin H ( γ ) remains bounded in the inﬁnite-width limit.Our results follow from a replica analysis of the Gard-ner volume, with the capacity given by a simple closed-form expression under a replica-symmetric ansatz anda two-dimensional minimization problem with one-stepreplica-symmetry-breaking. Depending on the activationfunction, a fully accurate determination of the capacitywould likely require higher levels in the Parisi hierarchyof replica-symmetry-breaking ans¨atze [18]. Furthermore,it can be challenging to rigorously prove that the capac-ity results obtained using the replica method at any levelof the Parisi hierarchy are correct [15, 18, 22–24]. Withthese caveats in mind, our results begin to elucidate hownonlinear activation functions aﬀect the ability of neuralnetworks ability to robustly solve classiﬁcation problems.The Gardner volume is agnostic to the choice of learn-ing algorithm used to train the weights of the network,which makes it a general approach to studying storage q r α ααα FIG. 2. The landscape of the function whose minimum de-termines the 1-RSB capacity as a function of the inter-blockoverlap q and the rescaled Parisi parameter r ≡ m/ (1 − q )for several example activation functions. In each panel, thevalue of this function is shown in false color, with the locationof minimum indicated by an orange dot. capacity, but means that it can provide only limited in-sight into the practical realizability of the extant solu-tions [9–12]. In a recent study of least-squares functionapproximation by wide fully-connected committee ma-chines, Panigrahi et al. [7] have shown using randommatrix theory methods that the speed and robustnessof learning with stochastic gradient descent is related toactivation function smoothness. In particular, trainingerror decreases rapidly with large decay exponent un-der weaker assumptions on the data for activation func-tions that have a “kink”—a jump discontinuity in theirﬁrst or second derivatives—than for smooth functions.This result is suggestively similar to that of this work,as such functions are at most once or twice weakly-diﬀerentiable. However, further work will be requiredto determine whether a similar link between smoothnessand trainability exists for classiﬁcation tasks in treelikenetworks.In this work, we have studied the activation-function-dependence of the storage capacity of wide treelike com-mittee machines. This network architecture is partic-ularly convenient to study in the inﬁnite-width limit,but it is far removed from the deep networks used inpractical applications [5]. As a step towards more elab-orate and realistic models, one could consider a fully-connected committee machine, in which each hidden unitis connected to the full set of inputs. Prior work onsuch networks with sign activation functions suggeststhat some qualitative aspects of the behavior of treelikecommittee machines should still hold true [10, 11, 25].However, fully-connected committee machines possess apermutation-symmetry with respect to interchange of thehidden units, which is broken at loads below the RS ca-pacity [10]. This phenomenon and the presence of cor-relations between hidden units complicates the study oftheir inﬁnite-width limit. Accurate determination of howtheir storage capacity depends on activation function willtherefore require further work, in which the insights de-veloped in this study should prove broadly useful. Acknowledgements —J. A. Zavatone-Veth acknowl-edges support from the NSF-Simons Center for Mathe-matical and Statistical Analysis of Biology at Harvardand the Harvard Quantitative Biology Initiative. C.Pehlevan thanks the Harvard Data Science Initiative,Google, and Intel for support. [1] G. Cybenko, Mathematics of control, signals and systems , 303 (1989).[2] K. Hornik, M. Stinchcombe, H. White, et al. , Neural net-works , 359 (1989).[3] C. Zhang, S. Bengio, M. Hardt, B. Recht, andO. Vinyals, in (2016).[4] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, andS. Ganguli, in Advances in neural information processingsystems (2016) pp. 3360–3368.[5] Y. LeCun, Y. Bengio, and G. Hinton, Nature , 436(2015).[6] I. Goodfellow, Y. Bengio, and A. Courville,

Deep learn-ing (MIT press, 2016).[7] A. Panigrahi, A. Shetty, and N. Goyal, arXiv preprintarXiv:1908.05660 (2019).[8] P. Ramachandran, B. Zoph, and Q. V. Le, arXiv preprintarXiv:1710.05941 (2017).[9] C. Baldassi, E. M. Malatesta, and R. Zecchina, PhysicalReview Letters , 170602 (2019).[10] E. Barkai, D. Hansel, and H. Sompolinsky, Physical Re-view A , 4146 (1992).[11] A. Engel, H. K¨ohler, F. Tschepke, H. Vollmayr, andA. Zippelius, Physical Review A , 7590 (1992).[12] A. Engel and C. Van den Broeck, Statistical mechanicsof learning (Cambridge University Press, 2001).[13] E. Gardner, Journal of Physics A: Mathematical andGeneral , 257 (1988).[14] E. Gardner and B. Derrida, Journal of Physics A: Math-ematical and General , 271 (1988).[15] M. Talagrand, Spin glasses: a challenge for mathemati-cians: cavity and mean ﬁeld models , Vol. 46 (SpringerScience & Business Media, 2003).[16] R. Monasson and R. Zecchina, Physical Review Letters , 2432 (1995).[17] D. Pollard, A user’s guide to measure theoretic probabil-ity , Vol. 8 (Cambridge University Press, 2002).[18] M. M´ezard, G. Parisi, and M. Virasoro,

Spin glass the-ory and beyond: An Introduction to the Replica Methodand Its Applications , Vol. 9 (World Scientiﬁc PublishingCompany, 1987).[19] V. I. Bogachev,

Gaussian measures (American Mathe-matical Society, 1998).[20] W. Kibble, Mathematical Proceedings of the CambridgePhilosophical Society , 12 (1945).[21] Y. L. Tong, The multivariate normal distribution (Springer Science & Business Media, 2012).[22] M. Shcherbina and B. Tirozzi, Communications in Math-ematical Physics , 383 (2003).[23] J. Ding and N. Sun, in

Proceedings of the 51st An-nual ACM SIGACT Symposium on Theory of Computing (2019) pp. 816–827.[24] B. Aubin, A. Maillard, J. Barbier, F. Krzakala, N. Macris, and L. Zdeborov´a, Journal of Statistical Me-chanics: Theory and Experiment , 124023 (2019).[25] R. Urbanczik, Journal of Physics A: Mathematical andGeneral , L387 (1997).[26] M. Abramowitz and I. A. Stegun, Handbook of mathe-matical functions with formulas, graphs, and mathemati-cal tables , Vol. 55 (US Government printing oﬃce, 1948).[27] R. H. Byrd, J. C. Gilbert, and J. Nocedal, MathematicalProgramming , 149 (2000).[28] P. Poirazi, T. Brannon, and B. W. Mel, Neuron , 989(2003). Appendix A: Gaussian measures, Hermite polynomials, and weak diﬀerentiability

In this appendix, we review relevant background material from the theory of Gaussian measures. Our discussion isa specialization of the more general discussion in Chapter 1 of Bogachev [19] to the one-dimensional case. We merelyseek to summarize the relevant deﬁnitions and results, and will not attempt to provide rigorous proofs.We let γ be the standard Gaussian probability measure on R , which has density exp( − x / / √ π with respect toLebesgue measure. We let L ( γ ) be the Lebesgue space of functions on R that are square-integrable with respect to γ , and, for brevity, denote the norm on this space as (cid:107) · (cid:107) γ . The natural orthonormal basis for L ( γ ) is given by theset of Hermite polynomials { He k } ∞ k =0 , which can be deﬁned by the formulaHe k ( x ) = ( − k √ k ! exp (cid:18) x (cid:19) d k dx k exp (cid:18) − x (cid:19) . (A1)The Hermite polynomials satisfy the recurrence relationHe (cid:48) k ( x ) = √ k He k − ( x ) = x He k ( x ) − √ k + 1 He k +1 ( x ) (A2)for k ≥

1, with He ≡

1. For a given function g ∈ L ( γ ), we deﬁne its Fourier-Hermite coeﬃcients g k = (cid:90) g He k dγ, (A3)and the Fourier-Hermite series g ( x ) = ∞ (cid:88) k =0 g k He k ( x ) , (A4)which is guaranteed to converge in mean-square by the fact that (cid:107) g (cid:107) γ = (cid:80) ∞ k =0 g k is ﬁnite.Then, using the recurrence relation He (cid:48) k ( x ) = √ k He k − ( x ) and the fact that He (cid:48) ≡

0, we can express the l th weakderivative of g as a formal Fourier-Hermite series g ( l ) ( x ) = ∞ (cid:88) k = l g k (cid:112) ( k ) l He k − l ( x ) , (A5)where ( k ) r = k ( k − · · · ( k − r + 1) is the falling factorial. If, for some r ≥

0, the sum (cid:107) g ( r ) (cid:107) γ = ∞ (cid:88) k = r ( k ) r g k (A6)is ﬁnite, then the Fourier-Hermite series for g and its weak derivatives up to order r converge in mean-square. Theclass of functions satisfying this condition is the Sobolev class H r ( γ ), which has Sobolev norm (cid:107) g (cid:107) H r ( γ ) = (cid:32) r (cid:88) l =0 (cid:107) g ( l ) (cid:107) γ (cid:33) / . (A7)Having deﬁned H r ( γ ) in terms of Fourier-Hermite expansions, we now connect this deﬁnition to a more genericnotion of weak diﬀerentiability. Let C ∞ ( R ) be the set of all inﬁnitely-diﬀerentiable functions with compact support,and let p ≥

1. For a locally integrable function f , we deﬁne its weak derivative f (cid:48) as a locally integrable function thatsatisﬁes the integration by parts formula (cid:90) R φ (cid:48) ( x ) f ( x ) dx = − (cid:90) R φ ( x ) f (cid:48) ( x ) dx (A8)for every φ ∈ C ∞ ( R ). The subset of functions in L ( R ) with weak derivatives up to order r of ﬁnite L norm formsthe Sobelev class H r ( R ). We can then deﬁne the class H r loc ( R ) as the set of all functions f on R such that φf ∈ H r ( R )for all φ ∈ C ∞ ( R ). H r ( γ ) coincides with the class of all functions f ∈ H r loc ( R ) such that f and its weak derivativesup to order r have ﬁnite L ( γ ) norm, and the corresponding weak derivatives coincide as well. In one dimension,the criterion that the ( r − th derivative is diﬀerentiable almost everywhere and is equal almost everywhere to theLebesgue integral of its derivative implies the required weak diﬀerentiability condition. Furthermore, by Rademacher’stheorem, every function that is locally Lipschitz continuous belongs to H ( R ). Appendix B: The Gardner volume of the treelike committee machine

In this appendix, we give a detailed account of the computation of the Gardner volume of the treelike committeemachine using the replica method. As described in the main text, the treelike committee machine [9–12] is a two-layerneural network with a total of N inputs divided into disjoint groups of N/K among K hidden units: y ( x ; { w j } , v , ϑ ) = sign  √ K K (cid:88) j =1 v j g (cid:32) w j · x j (cid:112) N/K (cid:33) − ϑ  , (B1)where x j ∈ R N/K is the vector of inputs to the j th hidden unit, { w j ∈ R N/K } Kj =1 are the hidden unit weight vectors, v ∈ R K is the ﬁxed readout weight vector, g is the activation function, and ϑ ∈ R is a threshold. We want tocharacterize the ability of this network to classify a dataset of P independent and identically distributed randomexamples { ( x ν , y ν ) } Pµ =1 , where x ν ∈ {− , } N and y ν ∈ {− , } , in terms of the Gardner volume [13, 14] Z N,P,K = (cid:90) dρ ( { w j } ) P (cid:89) ν =1 Θ ( y ν y ( x ν ; { w j } , v , ϑ ) − κ ) , (B2)where ρ is a measure on the space of hidden unit weights. We will compute the limiting quenched free entropy perweight f in the sequential limit N, P → ∞ , K → ∞ , with load N/P → α ∈ (0 , ∞ ) using the replica trick as f ≡ lim K →∞ lim N →∞ f N,αN,K = lim K →∞ lim N →∞ N E x ,y log Z N,αN,K = lim n ↓ lim K →∞ lim N →∞ nN log E x ,y Z nN,αN,K , (B3)where E x ,y denotes expectation over the quenched Bernoulli disorder represented by the dataset.We take the elements of x ν to be independent and identically distributed, with equal probability of being positiveor negative. We allow the distribution of y ν to be asymmetric, with P ( y ν = +1) = 1 − P ( y ν = −

1) = p for some p ∈ [0 , N/K -sphere of radius (

N/K ) / . With this choice, the total volume of weight space is (cid:90) dρ = S N/K ( (cid:112) N/K ) K , (B4)where S D ( R ) ≡ π D/ Γ( D/ R D − (B5)is the surface area of the D -dimensional sphere of radius R . We will assume that (cid:107) v (cid:107) = √ K , but will not initiallyimpose further conditions on the readout weights or threshold. Finally, we assume that g ∈ L ( γ ).Introducing replicas indexed by a = 1 , . . . , n , we can write the n th quenched moment of the Gardner volume as E x ,y Z n = E x ,y (cid:90) (cid:89) a dρ ( { w aj } ) (cid:89) a,ν Θ  y ν  √ K (cid:88) j v j g (cid:32) w aj · x νj (cid:112) N/K (cid:33) − ϑ  − κ  . (B6)We observe immediately that the fact that the diﬀerent patterns are independent and identically distributed impliesthat E x ,y Z n = (cid:90) (cid:89) a dρ ( { w aj } )  E x ,y (cid:89) a Θ  y  √ K (cid:88) j v j g (cid:32) w aj · x j (cid:112) N/K (cid:33) − ϑ  − κ  P , (B7)allowing us to simplify our notation by eliminating the pattern index ν . We now consider the local ﬁelds h aj ≡ (cid:114) KN w aj · x j , (B8)which have mean zero and covariance cov( h aj , h bl ) = δ jl KN w aj · x j . (B9)In this setting, the natural order parameters are the Edwards-Anderson (EA) order parameters [13, 15, 18] q abj ≡ KN w aj · w bj ( a (cid:54) = b ) , (B10)which measure the overlap between the weight vectors of each hidden unit in two diﬀerent replicas. As we havechosen the weight vectors to lie on the sphere, the self-overlap of each hidden unit is ﬁxed to unity, and the EA orderparameters are bounded between negative one and one. In terms of the EA order parameters, we havecov( h aj , h bl ) = δ jl [ δ ab + q abj (1 − δ ab )] . (B11)Then, as each of the local ﬁelds is the sum of N/K independent random variables and their covariance is ﬁnite, bycentral limit theorem they converge in distribution as N → ∞ for any ﬁxed K to a multivariate Gaussian with thesame mean and covariance [17, 21]. We note that this limiting result would alternatively follow by inserting Fourierrepresentations of the delta function to enforce the deﬁnition of the variables h aj , evaluating the averages over theinputs, and expanding the result to lowest order in 1 /N .We then deﬁne the function G ( { q abj } ) ≡ n log E y E h (cid:89) a Θ  y  √ K (cid:88) j v j g ( h aj ) − ϑ  − κ  , (B12)where the average E h is taken over the q abj -dependent Gaussian distribution of the local ﬁelds. Introducing Lagrangemultipliers ˆ q abj to enforce the deﬁnitions of the order parameters q abj , we obtain E x ,y Z n = (cid:90) (cid:89) b

N/K -sphere of radius (cid:112)

N/K , the integral over the weights expands as (cid:90) (cid:89) a,j d w aj (cid:89) a,j δ (cid:18) (cid:107) w aj (cid:107) − NK (cid:19) exp  (cid:88) b

In this appendix, we study the quenched free entropy derived in Appendix B using a replica-symmetric ansatz[12–15, 18]. In addition, as we expect the diﬀerent hidden units to be equivalent to one another after averaging overpatterns [10, 11], we make the ansatz that the order parameters are the same across all hidden units. Concretely, wemake the ansatz  ˆ E aj = ˆ Eq abj = q ˆ q abj = ˆ q (C1)which substantially simpliﬁes the saddle point equations.

1. The replica-symmetric quenched free entropy

Considering the entropic contribution G , we immediately obtain the simpliﬁcation G = 12 ˆ E −

12 ( n − q ˆ q + 1 n log (cid:90) d n w exp (cid:18) − w T Aw (cid:19) , (C2)where we have deﬁned the n × n matrix A = ( ˆ E + ˆ q ) I n − ˆ q n Tn . (C3)Applying the matrix determinant lemma, we havedet A = ( ˆ E + ˆ q ) n (cid:18) − ˆ q ˆ E + ˆ q n (cid:19) , (C4)hence we ﬁnd that lim n ↓ n log (cid:90) d n w exp (cid:18) − w T Aw (cid:19) = 12 (cid:20) log 2 π ˆ E + ˆ q + ˆ q ˆ E + ˆ q (cid:21) . (C5)Then, as we would expect, the entropic contribution to the quenched free entropy is the same as for a perceptron [13],yielding lim n ↓ G = 12 (cid:20) ˆ E + q ˆ q + log 2 π ˆ E + ˆ q + ˆ q ˆ E + ˆ q (cid:21) . (C6)As the Lagrange multipliers ˆ E and ˆ q appear only in G , they can easily be eliminated from the saddle point equations,yielding lim n ↓ G = 12 (cid:20) − q + log[2 π (1 − q )] (cid:21) (C7)at the replica-symmetric saddle point.We now consider the energetic term G . With the replica- and branch-symmetric ansatz, the covariance matrix ofthe Gaussian-distributed local ﬁelds simpliﬁes tocov( h aj , h bl ) = δ jl [ δ ab + q (1 − δ ab )] . (C8)Then, as the local ﬁelds h j are independent, the internal ﬁelds s a ≡ √ K (cid:88) j v j g ( h aj ) − ϑ (C9)are the sums of K independent random variables, with mean µ ≡ E h s a = (cid:20) √ K (cid:88) j v j (cid:21) ( E g ) − ϑ (C10)0and covariancecov( s a , s b ) = 1 K K (cid:88) j,l =1 v j v l cov( g ( h aj ) , g ( h bl )) = 1 K K (cid:88) j =1 v j cov( g ( h aj ) , g ( h bj )) = cov( g ( h a ) , g ( h b )) (C11)where we have used the fact the ﬁelds are independent and identically distributed across branches and the assumptionthat (cid:107) v (cid:107) = K . Then, if cov( g ( h a ) , g ( h b )) is ﬁnite, which holds for any g ∈ L ( γ ), then the classical central limittheorem implies that the internal ﬁelds s a converge in distribution as K → ∞ to a multivariate Gaussian with themean and variance given above [17, 21].Deﬁning the quantity σ ≡ var [ g ( x ) : x ∼ N (0 , (cid:107) g (cid:107) γ − ( E g ) (C12)and the eﬀective order parameter ˜ q = cov (cid:20) g ( x ) , g ( y ) ; (cid:20) xy (cid:21) ∼ N (cid:18) , (cid:20) qq (cid:21)(cid:19)(cid:21) , (C13)we can see from the joint distribution of the local ﬁelds h a that we can write the covariance matrix of the internalﬁelds as cov( s a , s b ) = σ δ ab + ˜ q (1 − δ ab ) . (C14)Then, we can expand the ﬁrst average in exp( nG ) in terms of the joint characteristic function of s a as (cid:90) (cid:89) a ds a d ˆ s a π (cid:34)(cid:89) a Θ( − s a − κ ) (cid:35) exp (cid:32) i (cid:88) a s a ˆ s a −

12 ( σ − ˜ q ) (cid:88) a (ˆ s a ) −

12 ˜ q (cid:20) (cid:88) a ˆ s a (cid:21) + iµ (cid:88) a ˆ s a (cid:33) . (C15)To evaluate the remaining integrals, we perform a Hubbard-Stratonovich transformation, which is deﬁned via theintegral identity [12] exp (cid:18) − x (cid:19) = (cid:90) dγ ( z ) exp( − ixz ) , (C16)to decouple the replicas at the expense of introducing an auxiliary ﬁeld z . From a statistical point of view, we cansee that this has the eﬀect of shifting the mean of s a from µ to µ + √ ˜ qz , which yields (cid:90) dγ ( z ) (cid:20)(cid:90) ds d ˆ s π Θ( − s − κ ) exp (cid:18) is ˆ s −

12 ( σ − ˜ q )ˆ s + i ( µ + (cid:112) ˜ qz )ˆ s (cid:19)(cid:21) n (C17)= (cid:90) dγ ( z ) (cid:34) H (cid:32) κ + µ + √ ˜ qz (cid:112) σ − ˜ q (cid:33)(cid:35) n , (C18)where H ( z ) = (cid:82) ∞ z dγ ( x ) is the Gaussian tail distribution function. Analogously, we can see that the second term inexp( nG ) can be written in a similar form, yieldingexp( nG ) = (1 − p ) (cid:90) dγ ( z ) (cid:34) H (cid:32) κ + µ + √ ˜ qz (cid:112) σ − ˜ q (cid:33)(cid:35) n + p (cid:90) dγ ( z ) (cid:34) H (cid:32) κ − µ − √ ˜ qz (cid:112) σ − ˜ q (cid:33)(cid:35) n . (C19)Applying the identity E log x = lim n ↓ log( E x n ) n , (C20)upon passing to the limit n ↓ f RS = extr q (cid:26) (1 − p ) α (cid:90) dγ ( z ) log H (cid:32) κ + µ + √ ˜ qz (cid:112) σ − ˜ q (cid:33) + pα (cid:90) dγ ( z ) log H (cid:32) κ − µ − √ ˜ qz (cid:112) σ − ˜ q (cid:33) + 12 (cid:20) − q + log(1 − q ) + log(2 π ) (cid:21) (cid:27) . (C21)1

2. The replica-symmetric capacity

The replica-symmetric capacity α RS is determined by the value of α such that q ↑ q from the replica-symmetric free entropy f RS for α − , we have1 α RS = lim q ↑ (1 − q ) q ∂ ˜ q∂q (cid:20) (1 − p ) (cid:90) dγ ( z ) 1 H ( c + ) 1 √ π exp (cid:18) − c (cid:19) σ − ˜ q ) / (cid:18) κ + µ + zσ √ ˜ q (cid:19) + p (cid:90) dγ ( z ) 1 H ( c − ) 1 √ π exp (cid:18) − c − (cid:19) σ − ˜ q ) / (cid:18) κ − µ − zσ √ ˜ q (cid:19) (cid:21) , (C22)where, for brevity, we write c ± ≡ κ ± µ ± √ ˜ qz (cid:112) σ − ˜ q . (C23)We can then see that the ﬁniteness of the replica-symmetric critical capacity depends on the analytic properties of ˜ q in the limit q ↑

1. To study the properties of this limit, we make the change of variables q = 1 − ε . We genericallyexpect ˜ q ↑ σ , but the way in which ˜ q approaches σ depends on the activation function. As observed by Baldassi et al. [9], for g ( x ) = sign( x ), σ − ˜ q ∼ √ ε , while, for g ( x ) = ReLU( x ), σ − ˜ q ∼ ε . As shown in the main text, theasymptotic scaling σ − ˜ q ∼ ε holds for all g ∈ H ( γ ). We thus make the ansatz˜ q ∼ σ − βε (cid:96) (C24)for some parameters β, (cid:96) >

0. Then, the contribution of the ﬁrst term in the saddle point equation above to α − isgiven to leading order as ε − (cid:96)/ − ε (cid:96) (1 − p ) √ β (cid:90) dγ ( z ) 1 H ( c + ) 1 √ π exp (cid:18) − c (cid:19) σ − ˜ q ) / (cid:32) κ + µ + zσ (cid:112) σ − βε (cid:96) (cid:33) , (C25)where we have reparameterized c + in terms of ε . In the limit ε ↓ c + tends to + ∞ if z ≥ − ( κ + µ ) /σ and to −∞ otherwise. Noting that 1 H ( x ) ∼ (cid:40) πx ) − / exp( − x / (cid:2) − x − + O ( x − ) (cid:3) x (cid:28) − √ πx exp( x / (cid:2) x − + O ( x − ) (cid:3) x (cid:29) +1 , (C26)we can see that the only non-vanishing contribution comes from the interval z ≥ − ( κ + µ ) /σ . Thus, to leading order,this term is given as ε − (cid:96) − ε (cid:96) (1 − p ) β (cid:90) ∞− ( κ + µ ) /σ dγ ( z ) (cid:16) κ + µ + z (cid:112) σ − βε (cid:96) (cid:17) (cid:32) κ + µ + zσ (cid:112) σ − βε (cid:96) (cid:33) , (C27)which we can further approximate as (cid:96)ε − (cid:96) − pβ (cid:90) ∞− ( κ + µ ) /σ dγ ( z ) ( κ + µ + σz ) . (C28)By an identical procedure, we can derive the leading-order contribution to the second term, in which case the non-vanishing contribution to the integral comes from z ≤ ( κ − µ ) /σ , yielding1 α RS = lim ε ↓ (cid:96)ε − (cid:96) (cid:34) − pβ (cid:90) ∞− ( κ + µ ) /σ dγ ( z ) ( κ + µ + σz ) + pβ (cid:90) ( κ − µ ) /σ −∞ dγ ( z ) ( κ − µ − σz ) (cid:35) . (C29)We note that this result can alternatively be obtained using the method of Engel et al. [11], which exploits theproperties of the function whose extremum with respect to q deﬁnes the free entropy f RS to avoid the need toexplicitly compute the saddle point equation.Thus, we can see that the limit ε ↓ (cid:96) <

1, implying divergence of the replica-symmetric capacity. If (cid:96) ≥

1, which holds for all functions g ∈ H ( γ ), the capacity remains ﬁnite, and we have β = (cid:107) g (cid:48) (cid:107) γ . We note that the2boundary case (cid:96) = 1, which corresponds to non-zero (cid:107) g (cid:48) (cid:107) γ , is special, as the capacity vanishes if if (cid:96) >

1. For thisclass of functions, we therefore obtain1 α RS = σ (cid:107) g (cid:48) (cid:107) γ (cid:34) (1 − p ) (cid:90) ∞− ( κ + µ ) /σ dγ ( z ) (cid:18) κ + µσ + z (cid:19) + p (cid:90) ∞− ( κ − µ ) /σ dγ ( z ) (cid:18) κ − µσ + z (cid:19) (cid:35) . (C30)By inspection, we can see that if p = 1 / α RS is maximized by taking µ = 0.If the condition µ = 0 holds, the formula above simpliﬁes to1 α RS = σ (cid:107) g (cid:48) (cid:107) γ (cid:90) ∞− κ/σ dγ ( z ) ( κ/σ + z ) , (C31)as given in the main text. Appendix D: One-step replica-symmetry-breaking solution

In this appendix, we consider a one-step replica-symmetry-breaking (1-RSB) ansatz, in which we divide the n replicas into groups of size m , known as the Parisi parameter, and allow the overlaps between groups to diﬀer fromthe overlaps within groups [9–12, 15, 18]. Again, we assume that the order parameters are translation-invariant acrossbranches. We let q denote the overlaps between replicas in diﬀerent groups, and q the overlap between replicaswithin the same group, with corresponding Lagrange multipliers ˆ q and ˆ q .

1. The 1-RSB quenched free entropy

With the 1-RSB ansatz, the entropic contribution G simpliﬁes to G = 12 ˆ E −

12 ( n − m ) q ˆ q −

12 ( m − q ˆ q + 1 n log (cid:90) d n w exp (cid:18) − w T Cw (cid:19) , (D1)where we have deﬁned the n × n block Toeplitz matrix C =  A B · · ·

BB A . . . ...... . . . . . . B · · · A  , (D2)where the m × m blocks are deﬁned as A = ( ˆ E + ˆ q ) I m − ˆ q m T m (D3)and B = − ˆ q m T m , (D4)respectively. Then, as the integral over w is Gaussian, it can easily be evaluated, yielding1 n log (cid:90) d n w exp (cid:18) − w T Cw (cid:19) = 12 log(2 π ) − n log det C . (D5)To compute the determinant of C , we will use a convenient lemma. For n/m a power of two, we havedet C = det( A − B ) n/m − det( A + ( n/m − B ) , (D6)which follows from the identity det (cid:18) A BB A (cid:19) = det( A − B ) det( A + B ) , (D7)3and induction on n/m in powers of two. By the matrix determinant lemma, we havedet( A − B ) = ( ˆ E + ˆ q ) m (cid:18) m ˆ q − ˆ q ˆ E + ˆ q (cid:19) (D8)and det( A + ( n/m − B ) = ( ˆ E + ˆ q ) m (cid:18) m (1 − n/m )ˆ q − ˆ q ˆ E + ˆ q (cid:19) . (D9)Therefore, for n/m a power of two, we have1 n log det C = log( ˆ E + ˆ q ) + n − mnm log (cid:18) m ˆ q − ˆ q ˆ E + ˆ q (cid:19) + 1 n log (cid:18) m − n )ˆ q − m ˆ q ˆ E + ˆ q (cid:19) . (D10)Assuming the validity of analytic continuation to n ↓

0, we havelim n ↓ log det C = log( ˆ E + ˆ q ) − ˆ q ˆ E + ˆ q + m (ˆ q − ˆ q ) + 1 m log (cid:32) ˆ E + ˆ q + m (ˆ q − ˆ q )ˆ E + ˆ q (cid:33) . (D11)Therefore, we obtainlim n ↓ G = 12 ˆ E + 12 q ˆ q + 12 m ( q ˆ q − q ˆ q )+ 12 (cid:34) log (cid:18) π ˆ E + ˆ q (cid:19) + ˆ q ˆ E + ˆ q + m (ˆ q − ˆ q ) + 1 m log (cid:32) ˆ E + ˆ q ˆ E + ˆ q + m (ˆ q − ˆ q ) (cid:33)(cid:35) . (D12)We note that this result can alternatively be obtained with substantially more algebra by performing many Hubbard-Stratonovich transformations [11]. As the Lagrange multipliers ˆ E , ˆ q , and ˆ q appear only in G , we can eliminatethem from the saddle point equations, yieldinglim n ↓ G = 12 (cid:20) q − q − m ( q − q ) + log[2 π (1 − q )] + 1 m log (cid:18) − q − m ( q − q )1 − q (cid:19)(cid:21) (D13)at the 1-RSB saddle point, which reduces to the replica-symmetric result if we take q = q .We now consider the energetic contribution G . As in the replica-symmetric case, the central limit theorem impliesthat the internal ﬁelds s a ≡ √ K (cid:88) j v j g ( h aj ) − ϑ (D14)converge in distribution to a Gaussian as K → ∞ . Their mean µ is the same as before, but now their covariance isgiven by the block Toeplitz matrix C =  A B · · ·

BB A . . . ...... . . . . . . B · · · A  , (D15)with m × m blocks A = ( σ − ˜ q ) I m + ˜ q m T m (D16)and B = ˜ q m T m , (D17)where σ = var[ g ( h )] as before and the eﬀective order parameter now takes two values˜ q j = cov (cid:20) g ( x ) , g ( y ) ; (cid:20) xy (cid:21) ∼ N (cid:18) , (cid:20) q j q j (cid:21)(cid:19)(cid:21) , j = 1 , . (D18)4We now want to understand the structure of the joint characteristic function of the replicated internal ﬁelds, in termsof which we will expand G , such that we can decouple replicas by performing appropriate Hubbard-Stratonovichtransformations. Introducing Lagrange multipliers ˆ s and indexing blocks by Greek superscripts, we haveˆ s · C ˆ s = n/m (cid:88) λ =1 ˆ s λ · A ˆ s λ + (cid:88) ν (cid:54) = λ ˆ s ν · B ˆ s λ = ( σ − ˜ q )(ˆ s · ˆ s ) + (˜ q − ˜ q ) n/m (cid:88) λ =1 ( m · ˆ s λ ) + ˜ q ( n · ˆ s ) . (D19)Then, we can see that we will need to perform one Hubbard-Stratonovich transformation to decouple the ( n · ˆ s ) term at the expense of introducing an auxiliary ﬁeld z , which has the eﬀect of shifting the mean of s a from µ to µ + √ ˜ q z . To decouple the ( m · ˆ s λ ) terms, we introduce n/m auxiliary ﬁelds z λ , which further shifts the mean of s a from µ + √ ˜ q z to µ + √ ˜ q z + √ ˜ q − ˜ q z λ . Then, recognizing that the contribution of each replica within a blockto the integral over the corresponding z λ is identical, and that the contribution of each block to the integral over z is in turn identical, the ﬁrst average in exp( nG ) is given as (cid:90) dγ ( z ) (cid:26)(cid:90) dγ ( z ) (cid:20)(cid:90) ds d ˆ s π Θ( − s − κ ) exp (cid:18) is ˆ s −

12 ( σ − ˜ q )ˆ s + i ( µ + (cid:112) ˜ q z + (cid:112) ˜ q − ˜ q z )ˆ s (cid:19)(cid:21) m (cid:27) n/m (D20)= (cid:90) dγ ( z ) (cid:40)(cid:90) dγ ( z ) (cid:34) H (cid:32) κ + ( µ + √ ˜ q z + √ ˜ q − ˜ q z ) (cid:112) σ − ˜ q (cid:33)(cid:35) m (cid:41) n/m . (D21)Analogously, we can see that the second term in exp( nG ) can be written in a similar form, yieldingexp( nG ) = (1 − p ) (cid:90) dγ ( z ) (cid:40)(cid:90) dγ ( z ) (cid:34) H (cid:32) κ + ( µ + √ ˜ q z + √ ˜ q − ˜ q z ) (cid:112) σ − ˜ q (cid:33)(cid:35) m (cid:41) n/m + p (cid:90) dγ ( z ) (cid:40)(cid:90) dγ ( z ) (cid:34) H (cid:32) κ − ( µ + √ ˜ q z + √ ˜ q − ˜ q z ) (cid:112) σ − ˜ q (cid:33)(cid:35) m (cid:41) n/m . (D22)Therefore, passing to the limit n ↓

0, we obtain the 1-RSB saddle point free entropy f = extr q ,q ,m (cid:26) m (1 − p ) α (cid:90) dγ ( z ) log (cid:90) dγ ( z ) (cid:34) H (cid:32) κ + ( µ + √ ˜ q z + √ ˜ q − ˜ q z ) (cid:112) σ − ˜ q (cid:33)(cid:35) m + 1 m pα (cid:90) dγ ( z ) log (cid:90) dγ ( z ) (cid:34) H (cid:32) κ − ( µ + √ ˜ q z + √ ˜ q − ˜ q z ) (cid:112) σ − ˜ q (cid:33)(cid:35) m + 12 (cid:20) q − q − m ( q − q ) + log[2 π (1 − q )] + 1 m log (cid:18) − q − m ( q − q )1 − q (cid:19)(cid:21) (cid:27) . (D23)

2. The 1-RSB capacity

To determine the capacity under the 1-RSB ansatz, we need to ﬁnd the value of α such that q ↑

1. In this limit,we expect m ↓ r ≡ m − q (D24)remains ﬁnite [9–12, 15]. We thus re-parameterize the saddle point equations by writing q = 1 − ε and m = rε , whichyields f = extr q ,ε,r ε (cid:26) r (1 − p ) α (cid:90) dγ ( z ) log (cid:90) dγ ( z ) (cid:34) H (cid:32) κ + ( µ + √ ˜ q z + √ ˜ q − ˜ q z ) (cid:112) σ − ˜ q (cid:33)(cid:35) rε + 1 r pα (cid:90) dγ ( z ) log (cid:90) dγ ( z ) (cid:34) H (cid:32) κ − ( µ + √ ˜ q z + √ ˜ q − ˜ q z ) (cid:112) σ − ˜ q (cid:33)(cid:35) rε + 12 (cid:20) ε + q r (1 − q − ε ) + ε log(2 πε ) + 1 r log(1 + r (1 − q − ε )) (cid:21) (cid:27) , (D25)5where ˜ q is now a function of ε alone.To derive a formula for the 1-RSB critical capacity, we follow the method used by Engel et al. [11]. This methodstarts by observing that the quantity inside the curly braces above must vanish in the limit ε ↓ ε to be well-deﬁned in this limit. This condition gives an implicit expression for α as0 = min q ,r (cid:26) q r (1 − q ) + 1 r log(1 + r (1 − q )) − r α ψ ( q , r ) (cid:27) , (D26)where ψ ( q , r ; κ ) ≡ − lim ε ↓ (cid:26) (1 − p ) (cid:90) dγ ( z ) log (cid:90) dγ ( z ) [ H ( c + )] rε + p (cid:90) dγ ( z ) log (cid:90) dγ ( z ) [ H ( c − )] rε (cid:27) , (D27)and, for brevity, we write c ± = κ ± ( µ + √ ˜ q z + √ ˜ q − ˜ q z ) (cid:112) σ − ˜ q . (D28)As ψ ≥ q , r , and κ , we can explicitly express the capacity as α ( κ ) = min q ,r (cid:26) ψ ( q , r ; κ ) (cid:20) rq r (1 − q ) + log(1 + r (1 − q )) (cid:21)(cid:27) . (D29)We note that one could obtain the same formula for α as a function of the saddle-point values of q and r by solvingthe saddle-point equation for ε for α and taking the limit ε ↓ ε ↓ ψ . Following our analysis of the RS critical capacity, wefocus on the symmetric case µ = 0, and make the ansatz that ˜ q ∼ σ − βε (cid:96) for some β, (cid:96) >

0. The assumption ofsymmetry yields the simpliﬁcation ψ ( q , r ; κ ) = − lim ε ↓ (cid:90) dγ ( z ) log (cid:90) dγ ( z ) (cid:34) H (cid:32) κ + √ ˜ q z + √ ˜ q − ˜ q z (cid:112) σ − ˜ q (cid:33)(cid:35) rε . (D30)Expanding the argument of H to leading order in ε , we have κ + √ ˜ q z + √ ˜ q − ˜ q z (cid:112) σ − ˜ q ∼ κ + √ ˜ q z + (cid:112) σ − ˜ q z (cid:112) βε (cid:96) , (D31)hence the argument of H tends to + ∞ for z ≥ − ( κ + √ ˜ q z ) / (cid:112) σ − ˜ q and to −∞ otherwise. Noting that H ( x ) ∼ (cid:40) − (2 πx ) − / exp( − x / (cid:2) − x − + O ( x − ) (cid:3) x (cid:28) − πx ) − / exp( − x / (cid:2) − x − + O ( x − ) (cid:3) x (cid:29) +1 , (D32)we can then write the argument of the logarithm in the deﬁnition of ψ to leading order in ε as (cid:90) − Q −∞ dγ ( z ) (cid:34) − √ π √ βε (cid:96)/ κ + √ ˜ q z + (cid:112) σ − ˜ q z exp (cid:32) − ( κ + √ ˜ q z + (cid:112) σ − ˜ q z ) βε (cid:96) (cid:33)(cid:35) rε + (cid:90) ∞− Q dγ ( z ) (cid:34) √ π √ βε (cid:96)/ κ + √ ˜ q z + (cid:112) σ − ˜ q z exp (cid:32) − ( κ + √ ˜ q z + (cid:112) σ − ˜ q z ) βε (cid:96) (cid:33)(cid:35) rε , (D33)where we have deﬁned the function Q ≡ κ + √ ˜ q z (cid:112) σ − ˜ q (D34)for brevity. Using the continuity of the logarithm and passing to the limit ε ↓

0, this simpliﬁes to (cid:90) − Q −∞ dγ ( z ) + (cid:90) ∞− Q dγ ( z ) lim ε ↓ exp (cid:32) − r ( κ + √ ˜ q z + (cid:112) σ − ˜ q z ) β ε − (cid:96) (cid:33) (D35)6for any (cid:96) >

0. If (cid:96) <

1, the remaining limit tends to unity, hence the argument of the logarithm tends to unity and ψ vanishes, resulting in a divergent 1-RSB capacity. If g ∈ H ( γ ), we have (cid:96) ≥ β = (cid:107) g (cid:48) (cid:107) γ , hence the 1-RSBcapacity, like the RS capacity, remains ﬁnite. For functions of this class, evaluating the integrals over z , we ﬁnd that ψ ( q , r ; κ ) = − (cid:90) dγ ( z ) log (cid:20) H ( Q ) + R exp (cid:18) − r ( κ + √ ˜ q z ) (cid:107) g (cid:48) (cid:107) γ + r ( σ − ˜ q ) (cid:19) H ( − QR ) (cid:21) , (D36)where Q is given as above and we have deﬁned R ≡ (cid:115) (cid:107) g (cid:48) (cid:107) γ (cid:107) g (cid:48) (cid:107) γ + r ( σ − ˜ q ) (D37)for brevity.Thus, the 1-RSB ansatz yields the same conditions on the activation function as the RS ansatz for the capacity toremain ﬁnite in the inﬁnite-width. For a given activation function, we can in principle determine α numericallyby solving an explicit two-dimensional minimization problem over q ∈ [0 ,

1] and r ∈ [0 , ∞ ), though we do not obtaina simple closed-form solution like that for α RS . We observe that the ﬁrst-order conditions on q and r resulting fromthis minimization are precisely the saddle-point equations for those order parameters. Unlike in the calculation ofthe RS capacity, we cannot in this case avoid the need to compute the eﬀective order parameter ˜ q ( q ) for genericvalues of q . Finally, we note that expression for the 1-RSB capacity with g ( x ) = ReLU( x ) from [9] is equivalent infunctional form to that presented here. Appendix E: Calculation of the capacity for common activation functions

In this appendix, we provide details of the calculation of the RS and 1-RSB capacities for several commonly-used activation functions. Numerical computation of the 1-RSB capacity was performed in

Matlab fmincon [27]. These results were then checked against computationsperformed in

Mathematica

NIntegrate and minimizer

NMinimize with 100digits of internal precision. These methods were also used to generate and cross-check the contour plots in Figure 2of the main text.

1. The rectiﬁed linear unit

The rectiﬁed linear unit ReLU( x ) ≡ max { , x } is the most commonly-used activation function in modern machinelearning appications [5, 8], and has weak derivative ReLU (cid:48) ( x ) = Θ( x ). With our conventions, the Hermite expansionof ReLU is given as ReLU( x ) = 1 √ π + 12 He ( x ) + 1 √ π ∞ (cid:88) k =1 ( − k +1 k (2 k − k ! (cid:112) (2 k )! He k ( x ) . (E1)By direct computation of the required integrals, we have σ = 12 − π = π − π (E2)and (cid:107) ReLU (cid:48) (cid:107) γ = (cid:107) Θ (cid:107) = H (0) = 12 , (E3)hence we recover the result of Baldassi et al. [9] that α RS ( κ = 0) = 2 ππ − (cid:39) . . (E4)For ReLU, we can express ˜ q ( q ) in closed form by direct integration or by summation of the series expansion resultingfrom the function’s Hermite expansion as˜ q ( q ) = q q arcsin( q ) + (cid:112) − q − π . (E5)7Using this formula, we obtain the estimate α ( κ = 0) (cid:39) . q ∗ , r ∗ ) (cid:39) (0 . , . α (cid:39) .

92 reportedby Baldassi et al. [9]. This discrepancy likely results from the fact that they appear to have solved the saddle-point equations for q and r numerically rather than directly minimizing over α . Though these approachesare equivalent at the level of ﬁrst-order conditions, it is possible that they yield diﬀerent results when implementednumerically.

2. The Gauss error function

The Gauss error function erf( z ) = 2 √ π (cid:90) z dx exp( − x ) = 1 − H ( √ z ) (E7)is the most analytically convenient of the commonly-used sigmoidal activation functions. It has the Hermite expansionerf( x ) = 2 √ π ∞ (cid:88) k =0 ( − k k (2 k + 1) k ! (cid:112) (2 k + 1)! He k +1 ( x ) , (E8)which allows us to easily obtain the closed-form expressions˜ q ( q ) = 43 π ∞ (cid:88) k =0 (2 k + 1)!3 k ( k !) (2 k + 1) q k +1 = 2 π arcsin (cid:18) q (cid:19) (E9)and σ = ˜ q (1) = 2 π arcsin (cid:18) (cid:19) . (E10)Similarly, we can easily work out that (cid:107) erf (cid:48) (cid:107) γ = 2 √ π (cid:90) dγ ( x ) exp( − x ) = 4 √ π . (E11)This yields α RS ( κ = 0) = 4 √ / (cid:39) . α ( κ = 0) (cid:39) . , (E13)with ( q ∗ , r ∗ ) (cid:39) (0 . , .

3. The quadratic

In neuroscientiﬁc studies of two-layer network models, expansive activations such as a quadratic are sometimesconsidered [28]. With g ( x ) = x , we can trivially work out that ˜ q ( q ) = 2 q , hence we have σ = 2 and (cid:107) g (cid:48) (cid:107) = (cid:107) x (cid:107) γ = 4. Thus, we ﬁnd that α RS ( κ = 0) = 4 (E14)and α ( κ = 0) (cid:39) . , (E15)with ( q ∗ , r ∗ ) (cid:39) (0 . , .

4. The hyperbolic tangent and logistic

Though it is less analytically convenient than the error function, the hyperbolic tangent and logistic sigmoid g ( x ) = (tanh( x ) + 1) / (cid:107) tanh (cid:107) γ (cid:39) . (cid:107) tanh (cid:48) (cid:107) γ (cid:39) . α RS ( κ = 0) (cid:39) . . (E16)As the RS capacity is scale- and shift-invariant, the RS capacity of the logistic function is the same as that of thehyperbolic tangent.

5. The “Swish” and “Mish” functions

Recent experimental works on deep neural networks have proposed various smooth, non-monotonic functions asalternatives to the rectiﬁer unit. One proposal is the product of a logistic function and a linear function, termed“Swish” [7, 8]: swish( x ; β ) = x − βx ) , (E17)where β is a positive parameter. Conventionally, β is either ﬁxed to unity or treated as a trainable weight, and yieldsthe limiting behavior lim β ↓ swish( x ; β ) = x , lim β →∞ swish( x ; β ) = ReLU( x ). We can see that α RS is a monotoneincreasing function of β , which tends to the perceptron result α RS ( κ = 0) = 2 as β ↓ α RS ( κ = 0) = 2 π/ ( π −

1) as β → ∞ . With β = 1, we have (cid:107) swish( · ; 1) (cid:107) γ (cid:39) . (cid:107) swish (cid:48) ( · ; 1) (cid:107) γ (cid:39) . α RS ( κ = 0) (cid:39) . . (E18)Another alternative to ReLU is the “Mish” function, deﬁned as [7]mish( x ) = x tanh log(1 + exp( x )) , (E19)for which we have (cid:107) mish (cid:107) γ (cid:39) . (cid:107) mish (cid:48) (cid:107) γ (cid:39) . α RS ( κ = 0) (cid:39) . ..