On transversality of bent hyperplane arrangements and the topological expressiveness of ReLU neural networks
OON TRANSVERSALITY OF BENT HYPERPLANEARRANGEMENTS AND THE TOPOLOGICALEXPRESSIVENESS OF RELU NEURAL NETWORKS
J. ELISENDA GRIGSBY AND KATHRYN LINDSEY
Abstract.
Let F : R n → R be a feedforward ReLU neural network.It is well-known that for any choice of parameters, F is continuous andpiecewise (affine) linear. We lay some foundations for a systematic inves-tigation of how the architecture of F impacts the geometry and topologyof its possible decision regions, F − ( −∞ , t ) and F − ( t, ∞ ), for binaryclassification tasks. Following the classical progression for smooth func-tions in differential topology, we first define the notion of a generic,transversal ReLU neural network and show that almost all ReLU net-works are generic and transversal. We then define a partially-orientedlinear 1–complex in the domain of F and identify properties of this com-plex that yield an obstruction to the existence of bounded connectedcomponents of a decision region. We use this obstruction to prove thata decision region of a generic, transversal ReLU network F : R n → R with a single hidden layer of dimension n + 1 can have no more than onebounded connected component. Introduction
Neural networks have rapidly become one of the most widely-used toolsin the machine learning toolkit. Unfortunately, despite–or, perhaps, be-cause of–their spectacular success in applications, significant foundationalquestions remain. Of these, we believe many would benefit greatly from thedirect attention of theoretical mathematicians, particularly those in the geo-metric topology, nonlinear algebra, and dynamics communities. An impor-tant goal of this paper and its sequels is to advertise some of these problemsto those communities.Recall that one can view a (trained) feedforward neural network as aparticular type of function, F : R n → R m , between Euclidean spaces. Theinputs to the function are data feature vectors and the outputs are typicallyused to answer m –class classification problems by partitioning the inputspace into decision regions according to which component of the functionoutput is maximized at that point.The main purpose of the present work is to present a framework for study-ing the question: How does the architecture of a feedforward neural networkconstrain the topology of its decision regions?
Here, the architecture of afeedforward neural network refers simply to the dimensions of the hiddenlayers. The neural networks we consider here will be fully-connected ReLUnetworks without skip connections. The topological expressiveness of an
JEG was partially supported by Simons Collaboration grant 635578.KL was partially supported by NSF grant number DMS-1901247. a r X i v : . [ m a t h . C O ] A ug J. ELISENDA GRIGSBY AND KATHRYN LINDSEY architecture is the collection of possible homeomorphism types of decisionregions that can appear as the parameters vary (cf. [2, 9]).First: why should the machine learning community care about topologicalexpressiveness ?Recall that a cornerstone theoretical result in the study of neural net-works (for a variety of activation functions, including the widely-used ReLUfunction that is our focus here) is the Universal Approximation Theorem([5, 13, 1]), which says that a sufficiently high-dimensional neural networkcan approximate any continuous function on a compact set to arbitraryaccuracy. This is the version of representational power or expressiveness frequently cited by practitioners as a guarantee that feedforward neural net-works can yield a solution to any data question one might throw at them.Yet continuous functions can be quite poorly behaved, and certain classesof poorly behaved continuous functions are undesirable targets for learning.For example, functions with high Lipschitz constants and ones whose partialderivatives are highly variable with respect to the input direction lead tothe easy production of adversarial examples and hence to potentially poorgeneralization to unseen data (cf. [4, 19, 15]). Moreover, for classificationproblems, it is the partitioning of the feature space into decision regions andnot the specific form of the function we learn that is relevant.It is important to remark at this point that homeomorphism is a verycoarse equivalence relation. Two different decision regions can be homeo-morphic and still have quite different geometric properties (shape, volume,etc.). However, homeomorphism is a good equivalence relation to consideron a first pass, because the coarsest, most fundamental global features of thedata are preserved by homeomorphism (number of connected components,homology groups, etc.), and these coarse features are very likely to be stableunder different data representation choices. The flip side of this observationis that if a particular architecture lacks the topological expressiveness tocapture obvious topological features inherent in a well-sampled labeled dataset, it will not generalize well to unseen data.In the present work, we will focus on the simplest case of feedforwardneural networks, F : R n → R , with 1–dimensional output. Such a networkis typically used to answer binary (aka Yes/No) classification problems bychoosing a threshold, t ∈ R , and declaring the sublevel set of t to be the“N” decision region, the superlevel set of t to be the “Y” decision region,and the level set of t to be the decision boundary. That is: N F ( t ) := F − (( −∞ , t )) B F ( t ) := F − ( { t } ) Y F ( t ) := F − (( t, ∞ )) . Classical results in differential topology now tell us that if F were smooth,we could perturb F slightly to be Morse, and the indices and values of itscritical points would then provide strong information about the topologyof its decision regions. Unfortunately, a ReLU neural network map F isclearly not smooth; it is continuous and piecewise affine linear. Yet F will OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 3 typically (i.e., for almost all choices of parameters) be well-behaved enoughthat the information we need to understand the topology of the decisionregions should be extractable directly from the weights and biases of theneural network. We don’t make it to the finish line in the present work, butwe make a good start.We begin by reviewing some standard results in the theory of affine hyper-plane arrangements and convex polyhedra, relying heavily on Grunert’s work[7] on polyhedral complexes (Definition 3.11) and Hanin-Rolnick’s work [11](see also [16, 17, 10]) generalizing the classical notion of hyperplane arrange-ments to so-called bent hyperplane arrangements (Definition 6.1). Theseideas dovetail nicely and provide the right formalism for understanding theclass of ReLU neural network maps. Note that the appearance of polyhedralcomplexes in the study of ReLU networks is well-known, and made explicit,e.g., in the relationship between ReLU neural networks with rational pa-rameters and tropical rational functions [22]. Our contribution here is toformalize the relationship enough to open a path for applying classical ideasin differential topology to extract information about the topology of deci-sion regions. We summarize our main results below. More precise versionsof these theorems, along with their proofs, appear in later sections. Theorem 1.
Let F : R n → R be a ReLU neural network map. F is contin-uous and affine linear on the cells of a canonical realization of the domain, R n , as a polyhedral complex, C ( F ) . Moreover, when F is generic (Definition 2.9) and transversal (Definition8.2), we explicitly identify cells of this polyhedral complex with naturalobjects defined by Hanin-Rolnick: Theorem 2.
Let F : R n → R be a generic, transversal ReLU neural networkmap. The n –cells of the canonical polyhedral complex, C ( F ) , are the closuresof the activation regions (Definition 6.4) of F , and the ( n − –skeleton of C ( F ) is the bent hyperplane arrangement (Definition 6.1) of F . We slightly extend classical transversality results (Theorems 7 and 8) toobtain:
Theorem 3.
Almost all ReLU neural networks are generic and transversal.
We show that every parametrized family of neural networks (Definition2.4) F : R n × R D ( n ,...,n m , → R is piecewise smooth in the following sense: Theorem 4.
Every parametrized family of ReLU neural networks F issmooth on the complement of a codimension 1 algebraic set. Many of the key observations in Theorems 1–4 were proved in [11]. Thetheorems above place those results in a broader context. Once we’ve estab-lished these foundational results, we turn our attention to addressing somefirst questions about architecture’s impact on topological expressiveness. We We’d like to use the word generically here, but the term generic is unavoidably usedin a different context later in the paper (Definition 2.9).
J. ELISENDA GRIGSBY AND KATHRYN LINDSEY begin by using the framework developed above to recast and reprove the re-sult of Johnson [14] that inspired this study. A variant of this result wasproved independently by Hanin-Sellke [12]: Theorem 5.
For any integer n ≥ , let F : R n → R be a ReLU neuralnetwork, all of whose hidden layers have dimension ≤ n . Then for anydecision threshold t ∈ R , each of Y F ( t ) B F ( t ) , and N F ( t ) is either empty orunbounded.Remark . Theorem 5 is not quite true for n = 1. For example, it is easyto see that the neural network map F : R → R defined by F ( x ) = σ ( x )has B F ( t ) = { t } for all t >
0. However, Hanin-Sellke [12] prove, subject tothe assumptions in the statement of Theorem 5 with no restrictions on n ,that Y F ( t ) and N F ( t ) are either empty or unbounded.We also have the following new application: Theorem 6.
Let F : R n → R n +1 → R be a ReLU neural network with inputdimension n and a single hidden layer of dimension n + 1 . Each decisionregion of F associated to a transversal threshold can have no more than bounded connected component. A crucial player in the proof of Theorem 6 is the 1–skeleton, C ( F ) , ofthe polyhedral complex, C ( F ), which is naturally endowed with a partialorientation pointing in the direction in which F increases (Definition 9.15).This partially-oriented graph will figure prominently in the sequel, as it isprecisely the information needed to determine geometric and topologicalinformation about the decision regions. Note that the partial orientationdata can be extracted directly from the weight matrices of the neural networkusing the chain rule (Lemma 9.17).This paper is heavy on definitions and notation, since we pulled from avariety of sources to lay necessary foundations for a consistent and generaltheory. Some sections may therefore be safely skimmed on a first readingand referenced only as needed to understand the proofs of the main re-sults. Sections 2 and 9 largely fall into this category. Similarly, Section7 establishes important results about parameterized neural network maps,but nothing in this section is referenced elsewhere in the paper.Sections 3 and 4 establish notation and key terminology. We do the bulkof the technical work in Sections 5, 6, 8, and the new applications can befound in Section 10. Contents
1. Introduction 1Acknowledgements 52. Layer maps and hyperplane arrangements 53. Polyhedral complexes 84. Transversality 114.1. Classical transversality results 11 Johnson’s proof holds for a large class of activation functions, including the ReLUactivation function studied here.
OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 5 ∇ F –oriented1–skeleton of the canonical polyhedral complex 2710. Obstructions to Topological Expressiveness and Applications 2910.1. Obstructing multiple bounded connected components 34References 36 Acknowledgements.
The authors would like to thank Jesse Johnson forproposing many of the questions at the heart of this investigation; BorisHanin for illuminating conversations during the
Foundations of Deep Learn-ing program at the Simons Institute for the Theory of Computing in thesummer of 2019; Jenna Rajchgot for very helpful discussions about algebraicgeometry; Yaim Cooper and Jordan Ellenberg for inspiration and encour-agement. 2.
Layer maps and hyperplane arrangements
In what follows, let • ReLU : R → R denote the function ReLU( x ) := max { , x } , and • σ : R n → R n denote the function that applies ReLU to each coor-dinate. Definition 2.1.
Let n ∈ N . A neural network defined on R n with ReLUactivation function on all hidden layers and one-dimensional output is afinite sequence of natural numbers n , . . . , n m together with affine maps A i : R n i → R n i +1 for i = 0 , . . . , m . This determines a function F : R n → R ,which we call the associated neural network map , given by the composition R n F = σ ◦ A −−−−−−→ R n F = σ ◦ A −−−−−−→ R n F = σ ◦ A −−−−−−→ . . . F m = σ ◦ A m −−−−−−−→ R n m G = A m +1 −−−−−−→ R . Such a neural network is said to be of architecture ( n , . . . , n m , , depth m + 1 , and width max { n , . . . , n m , } . The k th layer map of such a neuralnetwork is the composition σ ◦ A k for k = 1 , . . . , m and is the map G = A k for k = m + 1 .Remark . Note that in Definition 2.1 the activation function on the finallayer map is the Identity function, not σ . Accordingly, we use the notation G on the output layer map to distinguish it from the hidden layer maps, F k . J. ELISENDA GRIGSBY AND KATHRYN LINDSEY
An affine map A : R n → R m is specified by a weight matrix W ∈ M m × n ( R ) and a bias vector (cid:126)b ∈ R m , as follows. Let ( W | (cid:126)b ) denote the m × ( n + 1) matrix whose final column is (cid:126)b . For each (cid:126)x ∈ R n , let (cid:126)x (cid:48) := ( (cid:126)x, (cid:126)x under the embedding of R n into R n +1 fixing the finalcoordinate to 1. Then(1) A ( (cid:126)x ) = ( W | (cid:126)b ) (cid:126)x (cid:48) , where the product on the right is the matrix product, viewing (cid:126)x (cid:48) as an( n + 1) × Definition 2.3.
For any network architecture ( n , . . . , n m , , the total di-mension D ( n , . . . , n m , of the parameter space of neural networks of archi-tecture ( n , . . . , n m , is the total number of parameters (weights and biases)that define the matrices A , . . . , A m +1 . Definition 2.4.
Let ( n , . . . , n m , be a network architecture. The param-eterized family of ReLU neural networks with architecture ( n , . . . , n m , isthe map F : R n × R D ( n ,...,n m , → R defined as follows. For each s ∈ R D ( n ,...,n m , , F s : R n × { s } → R is the ReLU neural network map associated to the weights and biases givenby s . Observe that for each row of W , the corresponding row ( W i | b i ) ∈ R n +1 ,of the augmented matrix ( W | b ) ∈ M m × ( n +1) ( R ) determines a set S i ⊂ R n that is the solution set to a homogeneous equation:(2) S i := { (cid:126)x ∈ R n | ( W i | b i ) · ( (cid:126)x |
1) = 0 } . If we know the weight vector is non-zero (i.e. W i (cid:54) = ) and hence S i is ahyperplane, we will denote it by H i . In this case, R n \ H i has two connectedcomponents, H + i := { (cid:126)x ∈ R n | ( W i | b i ) · ( (cid:126)x | > } (3) H − i := { (cid:126)x ∈ R n | ( W i | b i ) · ( (cid:126)x | < } , (4)which endows H i with a co-orientation, pointing toward H + i .We define: Definition 2.5. An ordered affine solution set arrangement in R n is a fi-nite ordered set, S = { S , . . . , S m } , where each S i is the solution set to ahomogeneous affine linear equation as described above in equation (2) .Remark . If W i (cid:54) = , S i will be a hyperplane. However, in the degeneratecase W i = , S i is not a hyperplane. S i is empty if b i (cid:54) = 0 and S i is all of R n if b i = 0. Definition 2.7.
An ordered affine solution set arrangement, S = { S , . . . , S m } ,is said to be in general position (aka generic ) if for all subsets { S i , . . . , S i p } ⊆S , it is the case that S i ∩ . . . ∩ S i p is an affine linear subspace of R n of di-mension n − p , where a negative-dimensional intersection is understood tobe empty. OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 7
Remark . Note that the solution set of an affine linear equation is an affinelinear subspace of R n of dimension n − Definition 2.9. (i) A layer map of a neural network is said to be degenerate if a row, W i , of its associated weight matrix, W , is for some i . (Note thatthe corresponding bias, b i , may be zero or nonzero.) Otherwise thelayer map is said to be nondegenerate . A neural network with atleast one degenerate layer map is said to be degenerate and is saidto be nondegenerate otherwise.(ii) A layer map of a neural network is said to be generic if the cor-responding affine solution set arrangement is generic (Definition2.7). A neural network whose layer maps are all generic is said tobe generic .Remark . In Definition 2.1, we define the width of a neural network ofarchitecture ( n , . . . , n m ,
1) to be M = max { n , . . . , n m , } . If one of thehidden layers, R n i , has dimension n i ≤ M , we shall find it convenient toreplace it with a layer of dimension M , imbedding R n i into R M in the stan-dard way and replacing the affine map A i : R n i − → R n i with the degeneratemap whose final M − n i components are all 0. With this understood, allneural networks described in the present work will have hidden layers ofdimension equal to the width. Lemma 2.11.
Let A : R n → R n be an affine linear map given by A ( (cid:126)x ) :=( W | b ) (cid:126)x (cid:48) as in Equation 1, and let S = { S , . . . , S n } be the associated affinesolution set arrangement in R n described in Equation 2. Then A is aninvertible function if and only if S is generic.Proof. Since translation by (cid:126)b is an invertible operation, the Invertible MatrixTheorem tells us that A is an invertible function iff the n × n weight matrix W is invertible iff the row space of W has dimension n . But this is true iff,for each k ≤ n , the rank of any k × n matrix W (cid:48) obtained by choosing k rowsof W has dimension k . By the rank-nullity theorem, the above condition isequivalent to the kernel of W (cid:48) having dimension n − k , and this is equivalentto every k –fold intersection of affine solution sets in S being an affine linearsubspace of dimension n − k (i.e., S is generic), as desired. (cid:3) Definition 2.12.
The co-oriented ordered hyperplane arrangement associ-ated to the k th layer of a neural network map is the (possibly empty) setformed by removing the degenerate solution sets from the affine solution setarrangement associated to that layer map and assigning a co-orientation tothe remaining hyperplanes as described in the text before Definition 2.5. The geometry and combinatorics of hyperplane arrangements is a beauti-ful and rich subject in its own right. We content ourselves here with recallingthe notions that will be important for this paper, referring the interestedreader to [18] for more details.
Definition 2.13. An ordered, co-oriented hyperplane arrangement , A = { H , . . . , H j } , J. ELISENDA GRIGSBY AND KATHRYN LINDSEY in R n is an ordered set of co-oriented affine hyperplanes in R n . By for-getting the ordering of the set and the co-orientations of the affine hyper-planes we obtain a classical hyperplane arrangement; that is, a finite set, A = { H , . . . , H j } , of affine hyperplanes in R n .Remark . We shall use A (resp., H i ) if the hyperplanes in our arrange-ment are ordered and co-oriented and A (resp., H i ) if not.The rank of a hyperplane arrangement A in R n is the dimension of thespace spanned by the normals to the hyperplanes in A . Definition 2.15.
Let ( n , . . . , n m ) be a neural network architecture.(i) The affine solution set arrangement map is the map that asso-ciates to each parameterization s ∈ R D ( n ,...,n m ) the ( m + 1) –tuple ( S , . . . , S m ) s of affine solution set arrangements associated to s ,where S i is the affine solution set arrangement associated to theneural network layer map F i +1 : R n i → R n i +1 endowed with theweights and biases specified by s .(ii) The co-oriented hyperplane arrangement map is the map that asso-ciates to each parameterization s ∈ R D ( n ,...,n m ) the ( m + 1) –tuple ( A , . . . , A m ) s of co-oriented hyperplane arrangements, where A i is obtained from the affine solution set arrangement S i defined inpart (i) by removing the degenerate solution sets and endowing theremaining hyperplanes with co-orientations as in equation (4) .Remark . Note that in Definition 2.15, | A i | ≤ n i +1 , with equality if andonly if the corresponding layer map is nondegenerate. Degenerate arrange-ments are rare, cf. Remark 2.8 and Lemma 2.17.Recall that a property P is said to hold for almost every point x in R n (with respect to Lebesgue measure) if the set of points x ∈ R for which P ( x ) is false is a (Lebesgue measurable) set of measure 0 (equivalently, a null set .).The following well-known lemma (cf. [18]) follows in a straightforwardway from standard facts in linear algebra. Alternatively, the reader mayobtain the result using Theorem 7. Lemma 2.17.
Let ( n , . . . , n m ) be a network architecture. For almost every s ∈ R D ( n ,...,n m ) , each S i in ( S , . . . , S m ) is a generic (co-oriented) hyper-plane arrangement. Polyhedral complexes
We will need some basic facts about the geometry and combinatorics ofconvex polytopes, polyhedral sets, and polyhedral complexes. We quicklyrecall relevant background and terminology, referring the interested readerto [6, 7] for a more thorough treatment.
Definition 3.1. [6, Sec. 2.6, 3.1] (i) A polyhedral set P in R n is an intersection of finitely many closedaffine half spaces H +1 , . . . , H + m ⊆ R n . OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 9 (ii) A convex polytope in R n is a bounded polyhedral set.Remark . A polyhedral set is is an intersection of convex sets, henceconvex.
Definition 3.3. [6, Sec. 2.2] (i) We call a hyperplane H in R n a cutting hyperplane of P and say H cuts P if there exists x , x ∈ P with x ∈ P ∩ H + and x ∈ P ∩ H − .(ii) We call a hyperplane H in R n a supporting hyperplane of P andsay H supports P if H does not cut P and H ∩ P (cid:54) = ∅ . Definition 3.4. [6, Sec. 2.6] (i) A subset F ⊂ P is said to be a face of P if either F = ∅ , F = P ,or F = H ∩ P for some supporting hyperplane of P .(ii) ∅ and P are called the improper faces of P . All other faces are proper .(iii) If F is a maximal proper face of P (that is, it is contained in noproper faces of P but itself ), we say F is a maximal proper face of P . We shall refer to a maximal proper face of P as a facet of P .Remark . Each region of a hyperplane arrangement is the interior of apolyhedral set.Lemma 3.6 can be found in [6, Sec. 2.6]:
Lemma 3.6.
Every polyhedral set P of dimension n has an irredundant realization as an intersection P = H +1 ∩ . . . ∩ H + m satisfying the property that P (cid:54) = P i := (cid:92) ≤ j ≤ nj (cid:54) = i H + j for each i = 1 , . . . , m . Moreover, for an irredundant realization as above,the set of facets of P is precisely the set of proper faces of the form P ∩ H i . Definition 3.7.
Let S be a subset of R n . The affine hull of S , denotedaff ( S ) , is the minimal affine linear subspace of R n containing S . Equivalentlyaff ( S ) is the intersection of all affine linear subspaces of R n containing S .Remark . Note that a single point in R n may be considered a 0-dimensionalaffine linear subspace of R n . Definition 3.9.
For ≤ k ≤ n , a polyhedral set of dimension k in R n is apolyhedral set whose affine hull has dimension k . Definition 3.10.
Let P be a polyhedral set of dimension n .(i) A k –face of P is a face of P that has dimension k .(ii) A facet of P is an ( n − –face of P .(iii) A vertex of P is a –face of P . Definition 3.11. [7, Definition 1.9] A polyhedral complex C of dimension d is a finite set of polyhedral sets of dimension k , for ≤ k ≤ d , called the cells of C , satisfying the additional properties:(i) If P ∈ C , then every face of P is in C . (ii) If P, Q ∈ C , then P ∩ Q is a single mutual face of P and Q . In the above, we refer to the cells of dimension d as the top-dimensional cells of C . Definition 3.12.
The domain or underlying set |C| of a polyhedral complex C is the union of its cells. Definition 3.13.
Let C be a polyhedral complex of dimension d . The k –skeleton of C , denoted C k , is the subcomplex of all polyhedral sets of C ofdimension i , where ≤ i ≤ k . Note that C ⊆ C ⊆ . . . ⊆ C d − ⊆ ( C d = C ) . Remark . Note that in the present work, we shall require all of our poly-hedral sets to have finitely many faces and all of our polyhedral complexesto contain finitely many polyhedral sets.
Definition 3.15. If C is a polyhedral complex embedded in R n and |C| = R n ,we call C a polyhedral decomposition of R n . Definition 3.16. (i) Any hyperplane arrangement A in R n induces a polyhedral decom-position, C ( A ) , of R n as follows. Define the n -dimensional cells of C ( A ) to be the closures of the regions of A , and for < i < n ,inductively define the i -dimensional cells of C ( A ) to be the facets ofthe i + 1 dimensional cells.(ii) Any affine solution set arrangement S = { S , . . . , S m } in R n in-duces a polyhedral decomposition, C ( S ) , formed by first removingthe degenerate affine solution sets from S to obtain a hyperplanearrangement, A , and setting C ( S ) := C ( A ) . Note that the domain of the n − C ( A ) in Definition 3.16 is thethe union of the hyperplanes comprising A . Definition 3.17.
Let
M, R be polyhedral complexes. A map f : | M | → | R | is cellular if for every cell K ∈ M there exists a cell L ∈ R with f ( K ) ⊆ L . Definition 3.18.
Let M be polyhedral complexes with | M | embedded in R m .A map f : | M | → R r is linear on cells of M , if for each cell K ∈ M , therestriction of f to | K | is an affine linear map. Definition 3.19.
Let M and M (cid:48) be polyhedral complexes. M (cid:48) is said to bea subdivision of M if | M | = | M (cid:48) | and each cell of M (cid:48) is contained in a cellof M . Definition 3.20. ( [7] ) Let M and R be polyhedral complexes with a map f : | M | → R r linear on cells of M , where R is embedded in R r . The levelset complex of f is the set M ∈ R defined as M ∈ R := { S ∩ f − ( Y ) | S ∈ M, Y ∈ R } . The assertion that a level set complex is a polyhedral complex is justifiedby the following lemma:
OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 11
Lemma 3.21. ( [7, Lemma 2.5] ]) Let M be a polyhedral complex embedded insome R m with a map f : | M | → R r linear on cells and let R be a polyhedralcomplex embedded in R r . The level set complex M ∈ R is a polyhedral complexwhose underlying set | M | ∈ R is also embedded in R m . Lemma 3.22. If f : | M | → R r is as above and | R | = R r as in Definition3.15, then(i) f : M ∈ R → R is cellular , and(ii) M ∈ R is a subdivision of M .Proof. Immediate from the definitions. (cid:3)
In the present work, we will be most interested in the simple case of levelset complexes for maps R n → R , where the single 0–cell of the polyhedralcomplex R ⊆ R is a threshold t ∈ R , and the two 1–cells are the unboundedintervals ( −∞ , t ] and [ t, ∞ ). Definition 3.23. An interval complex is a polyhedral complex on R of theform { ( −∞ , t ] , { t } , [ t, ∞ ) } for some t ∈ R . (See Section 1.2.2 in [7].) In this case, Grunert’s Lemma 3.21 abovegives realizations of the sub- and super-level sets of the threshold t ∈ R aspolyhedral complexes. 4. Transversality
In this subsection, we recall notions of transversality and state and provetwo easy extensions of classical transversality results.4.1.
Classical transversality results.
We now briefly review some clas-sical transversality results, following [8].We denote the tangent space of a smooth manifold X at a point x ∈ X by T x X . Recall that for a smooth map f : X → Y of manifolds with f ( x ) = y ,the derivative df x is a linear map between tangent spaces, df x : T x X → T y Y ,and the image df x ( T x X ) is a linear subspace of T y Y . Recall, also, thatthe sum of two linear subspaces U and V of a linear space W is the set U + V := { u + v : u ∈ U, v ∈ V } .In Definition 4.1 and Theorems 7 and 8, assume X to be a smooth man-ifold with or without boundary, Y and Z to be smooth manifolds withoutboundary, Z a smoothly embedded submanifold of Y , and f : X → Y asmooth map. Definition 4.1.
We say that f is transverse to Z and write f (cid:116) Z if (5) df p ( T p X ) + T f ( p ) Z = T f ( p ) Y for all p ∈ f − ( Z ) .Remark . Definition 4.1 allows for the possibility that X is a manifold ofdimension 0, i.e. consists of (without loss of generality) a single point p . Inthis case T p { p } = { } and so df p ( T p { p } ) = { } , so condition (5) reduces tothe condition that if f ( p ) ∈ Z , then Z and Y must agree in a neighborhoodof f ( p ). Remark . Note that if the image f ( X ) does not intersect Z , then condi-tion (5) is vacuously true, so f is transverse to Z . Theorem 7 (Map Transversality Theorem) . [8, p. 28] If f is transverse to Z , then f − ( Z ) is an embedded submanifold of X. Furthermore, the codi-mension of f − ( Z ) in X equals the codimension of Z in Y .Remark . The Map Transversality Theorem uses the standard conventionthat the dimension of the empty set can assume any number. In the MapTransversality Theorem, if f − ( Z ) = ∅ , one considers the codimension of f − ( Z ) in X to be the codimension of Z in Y . Theorem 8 (Parametric Transversality Theorem) . [8, p. 68] Let S be asmooth manifold and let F : X × S → Y be a smooth map. If F is transverseto Z , then for (Lebesgue) almost every s ∈ S the restriction map F s : X → Y given by F s ( x ) = F ( x, s ) is transverse to Z . We wish to apply the Parametric Transversality Theorem to the parametrizedfamily of neural networks of a fixed architecture (Definition 2.4), but thisfamily does not satisfy the smoothness requirements, so we develop the nec-essary non-smooth analogues in § Extensions of the classical transversality results to maps onpolyhedral complexes that are smooth on cells.
We introduce a polyhedral analogue of Definition 4.1:
Definition 4.5.
Let X be a polyhedral complex of dimension d in R n , let f : | X | → R r be a map which is smooth on all cells of X and let Z be asmoothly embedded submanifold (without boundary) of R r . We say that f is transverse on cells to Z and write f (cid:116) c Z if:(i) the restriction of f to the interior , int ( C ) , of every k –cell C of X is transverse to Z (in the sense of Definition 4.1) when ≤ k ≤ d ,and(ii) the restriction of f to every –cell (vertex) of X is transverse to Z .Remark . We note that a function defined on a 0-cell is considered to be(vacuously) smooth.Note that condition (ii) of Definition 4.5 implies that if f is transverse to Z and there exists a vertex v of X such that f ( v ) ∈ Z , then Z must havethe full dimension r . Thus, if dim( Z ) < r , as is the case in all situations wewill consider, f being transverse to Z implies no vertex of X is sent by f to Z . We will be particularly interested in the case in which the codomain isan interval complex, i.e. r = 1 and Z = { t } is a threshold in R . Corollary 4.7 (Corollary of Theorem 7) . Let X be a polyhedral complex ofdimension d in R n . Let f : | X | → R r be a map which is smooth on cells of X and let Z be a smoothly embedded submanifold of R r for which f (cid:116) c Z .Then we have: • For every cell C ∈ X of dimension k , where ≤ k ≤ d , f − ( Z ) ∩ int ( C ) is a (possibly empty) smoothly embedded submanifold of int( C ) .Furthermore, the codimension of f − ( Z ) ∩ int ( C ) in int ( C ) equalsthe codimension of Z in R r . OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 13 • If dim ( Z ) < r , then for every cell C ∈ X of dimension (vertex), f ( C ) (cid:54)∈ Z .Proof. This follows immediately from Theorem 7, since the interior of anypolyhedral set of dimension k ∈ N is a nonempty smooth manifold. (cid:3) We will need the following version of the Parametric Transversality The-orem for families of maps that are linear on cells of a polyhedral complex.
Proposition 4.8.
Let X be a polyhedral complex in R n , S a smooth manifoldwithout boundary, and Z ⊆ R r a smoothly embedded submanifold withoutboundary. Suppose that for each cell C ∈ X , the restricted map F | C × S : C → R r is smooth and the further restricted map F | C (cid:48) × S : C (cid:48) × S → R r , where C (cid:48) = (cid:40) int ( C ) if C is of dimension ≥ ,C if C is of dimension , is transverse to Z . Then for (Lebesgue) almost every s ∈ S , the map f s : | X | × { s } → R r given by f s ( x ) = F ( x, s ) is transverse to Z .Proof. For each cell C ∈ X , the Parametric Transversality Theorem impliesthat there exists a null set S C ⊂ S such that f s | C (cid:48) ×{ s } is transverse to Z forevery s ∈ S \ S C . Let S X = (cid:83) C ∈ X S C ; as a finite union of null sets, S X is anull set. Then for every C ∈ X and s ∈ S \ S X , we have that f s | C (cid:48) ×{ s } : C (cid:48) × { s } → R r is transverse to Z . Hence f s is transverse to Z for all s ∈ S \ S X . (cid:3) Maps on polyhedral complexes and transversal thresholds
We now turn to applying the transversality statements developed in theprevious section to ReLU neural network maps.
Definition 5.1.
Let M be a polyhedral complex embedded in R n , n ∈ N ,and let F : | M | → R be a map that is smooth on cells. A threshold t ∈ R is said to be transversal for F and M if F is transverse on cells (Definition4.5) to the ( -dimensional) submanifold { t } of R . In this case, we write F (cid:116) c { t } .Remark . Although Section 4.2 and Definition 5.1 require only that themap F be smooth on cells, from this point onwards we restrict to the casethat F is affine linear on cells, since this is the setting relevant for under-standing neural network maps. In this case, we can recast transversality oncells for affine linear maps F in terms of F -nonconstant cellular neighbor-hoods, as follows.For the remainder of this section, let M be a polyhedral complex em-bedded in R n , n ∈ N , and let F : | M | → R be a map that is linear oncells. Definition 5.3.
A point x ∈ M is said to have a F -nonconstant cellularneighborhood in M if F is nonconstant on each cell of M containing x . Note that each vertex of M is itself a cell on which F is necessarilyconstant; hence, no vertex of M can be said to have a F -nonconstant cellularneighborhood. Lemma 5.4.
A threshold t ∈ R is transversal for F and M if and only ifeach point p ∈ F − ( { t } ) has a F -nonconstant cellular neighborhood in M .Proof. This is a matter of working through definitions. Assume F is linearon cells. The threshold t is transversal for F and M if and only if for any k -cell C ∈ X with k ≥ f to int( C ) is transverse to { t } , andthe restriction of f to any 0-cell X is transverse to { t } . This is equivalentto the statement that if p ∈ F − ( { t } ) and p is in a cell C ∈ M , then df p ( T p C ) + T f ( p ) { t } = T f ( p ) R . Since T f ( p ) { t } = 0, this equality holds if and only if df p ( T p C ) = T f ( p ) R ∼ = R for every cell C containing p , which is the definition of p having a noncon-stant cellular neighborhood. (cid:3) Remark . For F : | M | → R a map which is linear on the cells of apolyhedral complex M ⊆ R n as above, [7, Sec. 3.2] defines the notionof a regular point of F , by analogy to ideas from classical smooth Morsetheory. Informally, a regular point of F is one for which there exists alocal coordinate system on which the function F agrees with one of thecoordinates. Although we won’t need this more general notion in the presentwork, we remark that it follows immediately from the definitions that aregular, non-vertex, point of M necessarily has a F -nonconstant cellularneighborhood. Lemma 5.6.
Let t ∈ R be a transversal threshold for F and M . Then forevery cell C ∈ M , F − ( { t } ) ∩ C is either empty or aff ( F − ( { t } ) ∩ C ) is ahyperplane in aff ( C ) . Moreover, whenever F − ( { t } ) ∩ C is nonempty, thehyperplane aff ( F − ( { t } ) ∩ C ) cuts C (as in Definition 3.3).Proof. The statement that F − ( { t } ) ∩ C is a submanifold of codimension 1in C is from the Map Transversality Theorem (Theorem 7); its affine hull isa hyperplane because F is affine linear. Let H = aff( F − ( { t } ) ∩ C ) (cid:54) = ∅ . If H were a supporting hyperplane of C , then H ∩ C would be a non-emptylower-dimensional face of C , all of whose points map to t . Applying Lemma5.4, this would contradict the assumption that t is a transversal threshold.Hence, H cuts C whenever F − ( { t } ) ∩ C (cid:54) = ∅ . (cid:3) Lemma 5.7.
All but finitely many thresholds t ∈ R are transversal for F and M .Proof. The polyhedral complex M is, by definition, finite. Hence there areonly finitely many cells on which F is constant. But Lemma 5.4 tells us thatthe images of the constant cells are the only nontransversal thresholds for F and M . The result follows. (cid:3) OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 15 Bent hyperplane arrangements and canonical polyhedralcomplexes
The following notions were introduced in [11].
Definition 6.1. [11, Eqn. (2), Lem. 4]
Let R n F = σ ◦ A −−−−−−→ R n F = σ ◦ A −−−−−−→ . . . F m = σ ◦ A m −−−−−−−→ R n m G = A m +1 −−−−−−→ R . be a ReLU neural network. Let A ( k ) = (cid:110) H ( k )1 , . . . , H ( k ) n ik (cid:111) denote the hyperplane arrangement in R n k − associated to the layer map F k (Definition 2.12).(i) A bent hyperplane associated to the k th layer of F , for k ∈ { , . . . , m } ,is the preimage in R n of any hyperplane H ( k ) i (in R n k − ) associatedto the k th layer map: ( F k − ◦ . . . ◦ F ) − (cid:16) H ( k ) i (cid:17) . A bent hyperplane associated to the first layer of F is any hyperplane H (1) i (in R n ) associated to the first layer map.(ii) The bent hyperplane arrangement associated to the k th layer of F ,for k ∈ { , . . . , m } , is the set B ( k ) F := ( F k − ◦ . . . ◦ F ) − n ik (cid:91) j =1 H ( k ) j . The bent hyperplane arrangement associated to the first layer of F is the hyperplane arrangement (in R n ) associated to the first layerof F .(iii) The bent hyperplane arrangement associated to the entire neuralnetwork is the union of the bent hyperplane arrangements associatedto the layers: B F := m (cid:91) k =1 B ( k ) F . Remark . Note that the preimage of the hyperplane (in R n m ) associatedto the final layer G of the neural network is not included in the bent hyper-plane arrangement. This omission is motivated by the fact there is no factorof σ in G , and so G does not cause the preimage of this hyperplane in R n to belong to the locus where the neural network map is non-differentiable. Lemma 6.3.
A neural network map is smooth on the complement of itsbent hyperplane arrangement.Proof.
Clearly, each layer map F k : R n k − → R n k is smooth on the comple-ment of the domain of its associated hyperplane arrangement A ( k ) , and thefinal layer map G is smooth everywhere since it is affine. The restriction ofthe neural network map F to R n \ B F can be viewed as the composition ofthe restriction of the layer maps to complements of the domains their asso-ciated hyperplane arrangement; thus, as a composition of smooth maps, therestriction of F to R n \ B F is smooth. (cid:3) Definition 6.4. [11, Def 1, Lem 2]
Let F : R n → R be a ReLU neu-ral network map. An activation region of F is a connected component ofthe complement of the bent hyperplane arrangement associated to F , i.e. aconnected component of R n \ B F .Remark . Note that it is possible for the preimage of a hyperplane ar-rangement associated to a (non-first) layer map to have codimension 0, not 1,in R n . In this case, the bent hyperplane arrangement does not everywherelook locally like a hyperplane arrangement.As a simple example of this phenomenon, consider a two-layer ReLUneural network F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) R , where A (1) is the standard coordinate hyperplane arrangement, and A (2) = (cid:8) H (2) (cid:9) , where H (2) is any hyperplane through the origin. Then B (1) F = { H st , H st } , the standard co-oriented coordinate axes, and B (2) F = { ( x, y ) ∈ R | x, y ≤ , } the closed all-negative orthant. In particular, the bent hyperplane arrange-ment is codimension 0, not 1, and hence the closure of the activation regions(Definition 6.4) is a proper subset of R n .This phenomenon arises when a map fails to be transversal to a threshold,an observation that motivates Definition 8.2 and Theorem 2. Note that itis also a measure zero phenomenon. See Theorem 3.We now turn our attention to constructing a canonical polyhedral de-composition of the domain of a ReLU neural network using the work of [7].In the transversal case, we explicitly relate this decomposition to the benthyperplane arrangements and activation regions in Theorems 1 and 2. Definition 6.6. A polyhedral subcomplex of a polyhedral complex C is asubset C (cid:48) ⊆ C such that for every cell P in C (cid:48) , every face of P is also in C (cid:48) .The underlying set for a subcomplex C (cid:48) , denoted |C (cid:48) | , is the union of the cellsin C (cid:48) . Definition 6.7.
Let F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n m G (cid:47) (cid:47) R be a ReLU neural network. For i ∈ { , . . . , m } , denote by R ( i ) the polyhedralcomplex on R n i − induced (as in Definition 3.16) by the hyperplane arrange-ment associated to the i th layer map F i (Definition 2.12). Inductively definepolyhedral complexes C ( F ) , . . . , C ( F m ◦ . . . F ) on R n as follows:Set C ( F ) := R (1) , and for i = 2 , . . . , m , inductively set (6) C ( F i ◦ . . . ◦ F ) := C ( F i − ◦ . . . ◦ F ) ∈ R ( i ) . The canonical polyhedral complex associated to the neural network is thepolyhedral complex C F := C ( F m ◦ . . . F ) . OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 17
Remark . Note that level set complexes were only defined for maps thatare linear on cells, so for the inductive step (6) of Definition 6.7 to makesense, and in order to assert that each C ( F i ◦ . . . ◦ F ) is a polyhedral complex,we must justify that for each i the map F i ◦ . . . ◦ F : R n → R n i is linear oncells of C ( F i ◦ . . . ◦ F ); this is done in Theorem 1. Remark . Note the canonical polyhedral complexes do not depend on thefinal layer map, G , which is purely affine and does not have a factor of σ .Compare with Remark 6.2. Theorem 1.
Let F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n m G (cid:47) (cid:47) R be a ReLU neural network. For each i = 1 , . . . , m , C ( F i ◦ . . . ◦ F ) is apolyhedral decomposition of R n such that(i) F i ◦ . . . ◦ F is linear on the cells of C ( F i ◦ . . . ◦ F ),(ii) (cid:83) ik =1 B ( k ) F is the domain of a polyhedral subcomplex of C ( F i ◦ . . . ◦ F ). Proof.
For each i = 1 , . . . , m , denote by A ( i ) the hyperplane arrangement (of ≤ n i affine hyperplanes in R n i − ) associated to the layer map F i : R n i − → R n i (Definition 2.12), and denote by R ( i ) the induced polyhedral decompo-sition of R n i − (Definition 3.16). We proceed by induction on i .For i = 1, it is immediate that B (1) F = A (1) forms the ( n − R (1) and F is linear on cells of C ( F ) = R (1) .Now consider i > i −
1. Lemma 3.21together with condition (i) of the inductive hypothesis implies C ( F i ◦ . . . ◦ F )is a polyhedral complex.By condition (i) of the inductive hypothesis, each cell in C ( F i − ◦ . . . ◦ F ) ∈ R ( i ) is the intersection of a cell in C ( F i − ◦ . . . ◦ F ) with the preimageof a cell in R ( i ) . The map F i − ◦ . . . ◦ F is linear on each such intersectionby assumption. The layer map F i : R n i − → R n i is linear on cells of R ( i ) .Condition (i) follows.By condition (ii) of the inductive hypothesis, (cid:83) i − k =1 B ( k ) F is the domainof a polyhedral subcomplex of C ( F i − ◦ . . . ◦ F ). The polyhedral complex C ( F i ◦ . . . ◦ F ) is, by definition, formed by subdividing the polyhedral complex C ( F i − ◦ . . . ◦ F ), so (cid:83) i − k =1 B ( k ) F is the domain of a polyhedral subcomplex of C ( F i ◦ . . . ◦ F ). Let R ( i )( n i − − denote the ( n i − − R ( i ) . Notingthat the domain of R ( i )( n i − − is the union of the hyperplanes in A ( i ) , we have |C ( F i − ◦ . . . ◦ F ) | ∈ R ( i )( ni − − = B ( i ) F . Since the union of two subcomplexes of a polyhedral complex is a sub-complex, we see that (cid:83) ik =1 B ( k ) F is a polyhedral subcomplex of C ( F i ◦ . . . ◦ F ),which is condition (ii). (cid:3) In Definition 5.1, we say what it means for a threshold t to be transversal for F and M , where M is an arbitrary polyhedral complex and F : | M | → R is linear on cells of M . Equipped with the result (Theorem 1) that a neural network map F : R n → R is linear on cells of its canonical polyhedralcomplex C ( F ), it is natural to use M = C ( F ) to make the following definition. Definition . A threshold t ∈ R is a transversal threshold for a neuralnetwork F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n m G (cid:47) (cid:47) R if t is a transversal threshold for F and its canonical polyhedral complex C ( F ) = C ( F m ◦ . . . F ) . Piecewise smoothness of the parametrized family of neuralnetworks
Lemma . Fix any network architecture ( n , . . . , n m , . Then there existsa finite set E of polynomials in the variables x , . . . , x n , s , . . . s D such that(i) If T is a term of a polynomial in E , then T has the form xs τ . . . s τ D D for some x ∈ { x , . . . , x n , } and ( τ , . . . , τ D ) ∈ { , } D , and(ii) F is smooth on the complement of the set Z F defined by Z F := { ( x, s ) ∈ R n × R D : f i (( x, s )) = 0 for some f i ∈ E } . Proof.
Fix a network architecture ( n , . . . , n m , D = D ( n , . . . , n m , E be the set of all possible “inputs” ofany ReLU in the expression defining F . Rather than presenting a formalproof, we give an illustrative example that demonstrates all the key ideas.We consider the network architecture (1 , , F : R × R → R given by( x, ( a, b, c, d, e, f, g )) (cid:55)→ ReLU(( e · ReLU( ax + b ) + f · ReLU( cx + d ) + g )Each of the three ReLU’s in this expression acts as either the identity or 0,depending on the sign of its argument.Let E be the set of all possible expressions that are inputs of a ReLUin the expression for F , allowing for the possibility that each nested ReLUcould be either 0 or the identity. That is, E = { ax + b, cx + d, e ( ax + b ) + f ( cx + d ) + g, e ( ax + b ) + g, f ( cx + d ) + g, g } . Let Z F ⊂ R × R be the set of points ( x, s ) where at least one function in E evaluates to 0.For any fixed input ( x, s ), F ( x, s ) is given by one of 2 possible (notnecessarily distinct) formulas (which correspond to each of the 3 ReLU’sbeing in one of two possible “states”). Let H be the set of 2 (not necessarilydistinct) functions formed by replacing each ReLU with either 0 or theidentity. That is, H is the set consisting of the following 8 (not necessarily OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 19 distinct) functions :( x, ( a, b, c, d, e, f, g )) (cid:55)→ e · ( ax + b ) + f · ( cx + d ) + g ( x, ( a, b, c, d, e, f, g )) (cid:55)→ e · ( ax + b ) + f · · ( cx + d ) + g ( x, ( a, b, c, d, e, f, g )) (cid:55)→ e · · ( ax + b ) + f · ( cx + d ) + g ( x, ( a, b, c, d, e, f, g )) (cid:55)→ e · · ( ax + b ) + f · · ( cx + d ) + g ( x, ( a, b, c, d, e, f, g )) (cid:55)→ e · ( ax + b ) + f · ( cx + d ) + g )( x, ( a, b, c, d, e, f, g )) (cid:55)→ e · ( ax + b ) + f · · ( cx + d ) + g )( x, ( a, b, c, d, e, f, g )) (cid:55)→ e · · ( ax + b ) + f · ( cx + d ) + g )( x, ( a, b, c, d, e, f, g )) (cid:55)→ e · · ( ax + b ) + f · · ( cx + d ) + g )Which of these 8 functions represents the value of F at a given point canonly change at points where the argument of a ReLU in the expression for F – i.e. a polynomial in E – changes sign. Now, since all the functions in E arecontinuous and Z F is their set of zeros, for any point ( x, s ) ∈ ( R × R ) \ Z F it follows from continuity that there exists a neighborhood U of ( x, s ) onwhich the sign of each function in E is constant. Consequently, there is afixed function f ∈ H such that F agrees with f on U . Since all functionsin H are smooth, this implies that the restriction of F to R × R \ Z F issmooth. (cid:3) Theorem 4.
Fix any network architecture ( n , . . . , n m , Z F ⊂ R n × R D ( n ,...,n m , such that(i) The parametrized family of neural networks of architecture ( n , . . . , n m , F : R n × R D ( n ,...,n m , → R , is smooth on the complement of Z F .(ii) Z F is the vanishing set of a polynomial, and hence is a closed,nowhere dense subset with Lebesgue measure 0,(iii) the complement of Z F consists of finitely many connected compo-nents. Proof.
Let E be the set of polynomials constructed in Lemma 7.1, define thepolynomial F := (cid:81) f i ∈ E f i , and observe that the set Z F ⊂ R n + D ( n ,...,n m , from Lemma Z F ⊂ R n + D ( n ,...,n m , is the vanishing set for F . Standardresults in real algebraic geometry imply that Z F , as a codimension onealgebraic set, is nowhere dense, has Lebesgue measure 0, and its complementconsists of finitely many connected components (see e.g. [3, 20]). (cid:3) Proposition . For any network architecture ( n , . . . , n m , , let Z F be thealgebraic set constructed in the proof of Lemma 7.1 for the parametrizedfamily of neural networks F : R n × R D ( n ,...,n m , → R , and for each s ∈ R D ( n ,...,n m , , denote by F s the neural network map definedby F s ( x ) = F ( x, s ) . Then (cid:91) s ∈ Z F ( B F s × { s } ) ⊆ Z f . Proof.
Fix a parameter s ∈ R D ( n ,...,n m , . The bent hyperplane arrangement B F s is the union the preimages in R n of the hyperplanes in the hyperplanearrangements associate to the layer maps of F s . Each such hyperplaneis the vanishing set of the argument of a ReLU in the expression for F s .Thus the polynomial in the variables x , . . . , x D ( n ,...,n m , that defines sucha hyperplane is obtained by substituting the values for s into one of thepolynomials in the set E constructed in the proof of Lemma 7.1, implying F ( x, s ) = 0, where F is the product of the polynomials in E . Hence, anypoint x in the preimage in R n of such a hyperplane satisfies ( x, s ) ∈ Z F . (cid:3) Transversal neural networks
In what follows, let π j : R m → R denote the projection onto the j th coordinate. Definition . Let R n F = σ ◦ A −−−−−−→ R n F = σ ◦ A −−−−−−→ . . . F m = σ ◦ A m −−−−−−−→ R n m G = A m +1 −−−−−−→ R be a ReLU neural network. For each i ∈ { , . . . , m } and j ∈ { , . . . , n i } , the node map , F i,j , is the map π j ◦ A i ◦ F i − ◦ . . . ◦ F : R n → R . Definition . A ReLU neural network F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n m G (cid:47) (cid:47) R is said to be transversal if, for each i ∈ { , . . . , m } and each j ∈ { , . . . , n i } , t = 0 is a transversal threshold (Definition 5.1) for the node map F i,j : R n → R . Remark . The descriptors generic and transversal , when applied to ReLUneural networks, are similar but complementary concepts. We devote a fewwords to explaining the difference.A ReLU neural network is generic if each solution set arrangement for eachlayer map is generic. It is not immediate, yet it is true, that if a solution setarrangement is generic then each solution set in the arrangement intersectseach intersection of solution sets in that layer transversely.In contrast, if a ReLU neural network is transversal then it follows fromthe definitions that each bent hyperplane intersects the bent hyperplanesfrom all previous layers transversely.Put simply, when applied to ReLU neural networks, the term generic describes intersections of cells associated to a single layer map, and theterm transversal describes intersections of cells associated to different layermaps.
OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 21
Theorem 2.
If a ReLU neural network F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n m G (cid:47) (cid:47) R is transversal (in the sense of Definition 8.2), then the bent hyperplanearrangement B F is the domain of the ( n − C ( F m ◦ . . . ◦ F ) and the closures of the activation regionsof F are the n –cells of C ( F m ◦ . . . ◦ F ). Proof.
We proceed by induction on m . The base case m = 1 is immediate.Now consider any fixed value of m > m . In particular, for each node map F m,j , j ∈ { , . . . , n m } ,the bent hyperplane arrangement B F m,j = (cid:83) m − i =1 B ( i ) F m,j is the domain of the( n − C ( F m − ◦ . . . F ). But for each i ∈ { , . . . , m − } and j ∈ { , . . . , n m } , we have B ( i ) F m,j = B ( i ) F , so the bent hyperplane arrangement, B (cid:48) F := m − (cid:91) i =1 B ( i ) F , for the first m − n − C ( F m − ◦ . . . ◦ F ).To see that B F is contained in the ( n − C ( F m ◦ . . . ◦ F ), webegin by noting that B (cid:48) F is contained in the ( n − C ( F m ◦ . . . ◦ F )since C ( F m ◦ . . . ◦ F ) is a subdivision of C ( F m − ◦ . . . ◦ F ). Moreover, bydefinition B F = B (cid:48) F ∪ n m (cid:91) j =1 F − m,j ( { } ) . It therefore suffices to show that (cid:83) n m j =1 F − m,j ( { } ) is contained in the ( n − C ( F m ◦ . . . ◦ F ).But since 0 is a transversal threshold for each node map F m,j : R n → R ,this follows from the Map Transversality Theorem. Explicitly, for every cell C ∈ C ( F m − ◦ . . . ◦ F ), Theorem 7 tells us that C ∩ F − m,j ( { } ) is codimension1 in C . Since C ( F m − ◦ . . . ◦ F ) has dimension n , it follows that any newcell in B F \ B (cid:48) F has dimension ≤ ( n − C ( F m ◦ . . . ◦ F ) is contained in B F we will show that any k –cell C in C ( F m ◦ . . . ◦ F ) n − is also in B F .Since C ( F m ◦ . . . ◦ F ) is, by definition, a subdivision of C ( F m − ◦ . . . ◦ F ),the cell C is contained in a cell C (cid:48) of C ( F m − ◦ . . . ◦ F ). Assume WLOG that C (cid:48) has minimal dimension among all cells in C ( F m − ◦ . . . ◦ F ) containing C , and let k (cid:48) be the dimension of C (cid:48) . If k (cid:48) ≤ n −
1, then C ⊆ C (cid:48) ⊆C ( F m − ◦ . . . ◦ F ) n − , and the inductive hypothesis tells us C ⊆ B (cid:48) F , asdesired.So we may assume that C (cid:48) has dimension n . Therefore the constructiondescribed in the proof of Theorem 1 and the fact that C has dimension ≤ n − C is equal to the intersectionof C (cid:48) with F − m,j ( { } ) for some node j in the m th layer map. It follows that C ⊆ B F .We conclude that B F = C ( F m ◦ . . . ◦ F ) n − as desired. (cid:3) Theorem 3.
For any given architecture R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n m G (cid:47) (cid:47) R of feedforward ReLU neural network, almost every (with respect to Lebesguemeasure on R D ( n ,...,n m , ) choice of parameters yields a transversal ReLUneural network. Proof.
We proceed by induction on m , the number of hidden layers. When m = 1, each node map F ,j : R n → R is an affine linear map, so F istransversal if and only if the node map F ,j has at least one nonzero value.That is, if and only if the associated weight, bias vector ( W j | b j ) is nonzerofor each j . The set of such parameters clearly forms a null set, as desired.Now consider a fixed value of m ≥
2, and assume the result holds forsmaller values of m . For any neural network F of architecture ( n , . . . , n m , j ∈ { , . . . , n m } , the node map F m,j is a neural network of architec-ture ( n , . . . , n m − , F is transversal ifand only if the node map F m, is a transversal neural network and for every j ∈ { , . . . , n m } the node map F m,j has t = 0 as a transversal threshold.For any n ∈ N , let λ n denote Lebesgue measure on R n . By the inductiveassumption, there exists a λ D ( n ,...,n m − , -null set N ⊂ R D ( n ,...,n m − , suchthat for every parameter in R D ( n ,...,n m − , \ N , the node map F m, istransversal. Let N denote the set of points in R D ( n ,...,n m , whose projectionto the first D ( n , . . . , n m − , N ; then λ D ( n ,...,n m , ( N ) =0. Let δ = D ( n , . . . , n m , − D ( n , . . . , n m , s ∈ R D ( n ,...,n m , \ N , there existsa set Y s ⊂ R δ such that λ δ ( Y s ) = 0 and for every x ∈ R δ \ Y s , the neuralnetwork F associated to the parameter ( s, x ) ∈ R D ( n ,...,n m , satisfies thatfor every j ∈ { , . . . , n m } the node map F m,j has t = 0 as a transversalthreshold. Assuming such sets Y s are defined, set N := (cid:110) ( s, x ) : s ∈ R D ( n ,...,n m , \ N , x ∈ Y s (cid:111) . Tonelli’s Theorem will then imply that λ D ( n ,...,n m , ( N ) = 0. Then, forevery parameter in R D ( n ,...,n m , \ ( N ∪ N ), the associated neural networkis transversal.So fix s ∈ R D ( n ,...,n m − ) . Then, for each j ∈ { , . . . , n m } , we represent theparametrized family of the layer- m , node- j node maps of neural networks ofarchitecture ( n , . . . , n m ,
1) whose first m − s by( F m,j ) s : R n × R n m − +1 → R . (The coordinate in R n m − +1 parameterizes the possible weights and bias forthe j th affine linear map of the m th layer.) Let C ( F m − ◦ . . . ◦ F ) denote thecanonical polyhedral complex (on R n ) associated to the first m − F whose first m − s . By construction, for each w ∈ R n m − +1 , the node map( F m,j ) s (cid:12)(cid:12) w : R n × { w } → R , given by ( F m,j ) s (cid:12)(cid:12) w ( x ) = ( F m,j ) s ( x, w ) OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 23 is linear on the cells of C ( F m − ◦ . . . ◦ F ).Therefore, by Parametric Transver-sality (Proposition 4.8), to prove that ( F m,j ) s | w is transverse to { } ⊂ R for λ n m − +1 -almost every w ∈ R n m − +1 , it suffices to show that for every cell C ∈ C ( F m − ◦ . . . ◦ F ), the parametrized map( F m,j ) s | C (cid:48) × R nm − : C (cid:48) × R n m − +1 → R is transverse to { } , where C (cid:48) = C if dim( C ) = 0 and C (cid:48) = int( C ) otherwise.For any such C , to show that ( F m,j ) s | C (cid:48) × R nm − is transverse to { } ⊆ R , itsuffices to show that ( F m,j ) s | C (cid:48) × R nm − is surjective, since the whole space, R , is clearly transverse to any embedded submanifold.Accordingly, fix a non-empty cell, C ∈ C ( F m − ◦ . . . ◦ F ). Then C (cid:48) isnon-empty, so its image in R n m − under F m − ◦ . . . ◦ F is non-empty. Let p ∈ R n m − be a point in this image. We claim that for every t ∈ R thereexists some affine linear transformation R n m − → R sending p to t . This istrue because every affine linear transformation A sends p to some value, t ,and if t (cid:54) = t then the affine linear transformation A + ( t − t ) will send p to t . So ( F m,j ) s | C (cid:48) × R nm − is surjective.Thus, for λ n m − +1 -almost every w ∈ R n m − +1 , the node map ( F m,j ) s | w is transverse to { } ; let Y s,j be the λ n m − +1 -null set where this fails. Set Y s to be the set of points x in R δ such that for some j ∈ { , . . . , n m } ,the projection of x to its ( n m − + 1) coordinates representing the weightsand biases representing the j th affine map of ( m − th layer is in Y s,j . Byconstruction, λ δ ( Y s ) = 0. (cid:3) Binary codings of regions of co-oriented hyperplanearrangements and the gradient vector field of a ReLUneural network map
In this section, we collect elementary facts about co-oriented hyperplanearrangements that will be useful in the proofs of Theorems 5 and 6. We alsointroduce a partial orientation on C ( F ) , the 1–skeleton of the bent hyper-plane arrangement of a generic, transversal ReLU neural network, definedusing the gradient of the neural network function F . This partially-orientedgraph plays an important role in the proof of Theorem 6.9.1. Regions, vertices, and edges of classical hyperplane arrange-ments.
Definition . A region of a (possibly ordered, co-oriented) hyperplane ar-rangement A in R n is a connected component of R n \ (cid:83) H ∈A H . Let r ( A ) denote the number of regions of A . Note that each region, R , of an ordered, co-oriented hyperplane arrange-ment A = { H , . . . , H k } is naturally labeled with a binary k –tuple, (cid:126)θ ∈{ , } k , where the i th component of (cid:126)θ associated to R is 1 (resp. 0) if theco-orientation of H i points towards (resp., away from) R . (1,1,1) (0,1,1)H1 H2 H3(0,0,1)(1,0,1)(1,0,0) (1,1,0) (0,1,0) Figure 1.
There are 7 regions in the complement of 3 co-oriented hyperplanes in R . The assignment of binary 3–tuples to regions is not surjective. Definition . We shall denote the region of A labeled by the binary k –tuple (cid:126)θ by R (cid:126)θ ( A ) and refer to it as the (cid:126)θ region of A . If the ordered, co-oriented hyperplane arrangement A is clear from context, we will abbreviatethe notation to R (cid:126)θ . The assignment of binary k –tuples to regions of an ordered, co-orientedhyperplane arrangement A is clearly injective, but it need not be surjective,as illustrated in Figure 1. Lemma 9.3 gives a sufficient condition for theassignment to be bijective.The contents of Lemmas 9.3 and 9.4 are well known and follow immedi-ately from a classical theorem of Zaslavsky [21] (cf. [18, Thm. 2.5]). Weinclude proofs here for completeness. Lemma . Let A := { H , . . . , H k } be a generic ordered, co-oriented hyper-plane arrangement in R n , where k ≤ n . Then r ( A ) = 2 k , and hence thereis a one-to-one correspondence between regions of A and binary k –tuples.Proof. An elementary fact, [18, Lem. 2.1 (a)], used in the proof of Za-slavsky’s theorem is that if A is a hyperplane arrangement, H ∈ A is a hy-perplane, A (cid:48) := A\{ H } is the deleted arrangement , and A (cid:48)(cid:48) := { K ∩ H | K ∈A (cid:48) } is the restricted arrangement, then r ( A ) = r ( A (cid:48) ) + r ( A (cid:48)(cid:48) ). If A is generic , then the result follows from a double induction on n and k , since r ( A (cid:48) ) = 2 k − and r ( A (cid:48)(cid:48) ) = 2 k − . (cid:3) Lemma . Let A = { H , . . . , H k } be a (generic or non-generic) ordered,co-oriented hyperplane arrangement of k ≤ n hyperplanes in R n . A has nobounded regions. OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 25
This result follows from [18, Lem. 2.1 (b)], but, since no proof is providedfor part (b), we present a different approach (which has the added benefitof reminding the reader of a notion of duality between the domain and theparameter space of an affine linear map). The classical Minkowski-Weyltheorem (cf. [6]) tells us that every convex polytope
P ⊆ R n (see Definition3.1) has two equivalent representations:(i) as an intersection of half-spaces ( H representation), a minimal sub-set of which give rise to the facets of P , and(ii) as a convex hull of points ( V representation), a minimal subset ofwhich are the vertices of P .Moreover, these representations are dual in the sense that if P is a convexpolytope in R n , we can define a so-called polar dual to P , P ∗ ⊆ ( R n ) ∗ , asfollows. First translate P so that the origin is in its interior. Then define P ∗ := { (cid:126)y ∈ R n | (cid:126)x · (cid:126)y ≤ ∀ (cid:126)x ∈ P} . P ∗ is also a convex polytope of dimension n , and its vertices (resp., facets)are in natural bijective correspondence with the facets (resp., vertices) of P [6, Sec. 3.4]. Proof.
Suppose that A is a hyperplane arrangement of k ≤ n hyperplanesin R n and there is a bounded region, R , of A . The region R is a convexpolytope and hence has a combinatorial dual P ∗ whose vertices are in bi-jective correspondence with (a subset of) the hyperplanes of A . Since P ∗ is a convex polytope of dimension n and is the convex hull of its vertices,and the convex hull of k points has dimension at most k −
1, it follows that k ≥ n + 1. (cid:3) Lemma . Let S = { S , . . . , S N } ⊂ R n be a nondegenerate solution set ar-rangement with defining set, { ( W i | b i ) } Ni =1 , of augmented weight/bias vectorsas in Equation 2, and let A = { H , . . . , H N } ⊂ R n be its associated (possiblynon-generic) hyperplane arrangement as in Definitions 2.12 and 3.16. Let { H i , . . . , H i n } ⊆ A be any rank n subarrangement of size n . Then p = H i ∩ . . . ∩ H i n is a –cell (vertex) of the canonical polyhedral complex C ( A ) . Conversely,every vertex p of C ( A ) can be realized as p = H i ∩ . . . ∩ H i n for some rank n subarrangement { H i , . . . , H i n } ⊆ A .Proof. Let A (cid:48) = { H i , . . . , H i n } be a rank n , size n subarrangement of A .From Lemma 2.11 it follows that A (cid:48) is a generic arrangement, and hence the n –fold intersection, H i ∩ . . . ∩ H i n , is an affine subspace of R n of dimension 0.Indeed, since the ( n − k )–dimensional affine subspaces of R n associated to k –fold intersections of k –element subsets of A (cid:48) are reverse-ordered by inclusion,we see that p = H i ∩ . . . ∩ H i n is the unique 0–cell in the boundary of all cellsof the polyhedral complex C ( A (cid:48) ). Since A is obtained from A (cid:48) by addinghyperplanes, C ( A ) is a polyhedral subdivision of C ( A (cid:48) ), and so p is also a0–cell (vertex) of C ( A ), as desired.For the converse statement, we proceed by induction on n . For the basecase ( n = 1), it follows directly from the definition of C ( A ) that every 0–cell (vertex) is a hyperplane of A . Now let n > p is a 0–cell of C ( A ). We know that p ∈ K for some hyperplane K ∈ A . Consider therestricted solution set arrangement S K = { K ∩ H | H ∈ A \ K } , from which we obtain a restricted hyperplane arrangement A K by delet-ing the degenerate solution sets. Then p is also a 0–cell in the canonicalpolyhedral complex C ( A K ). Since A K is an ( n − H i , . . . , H i n − ∈ A such that p = ( K ∩ H i ) ∩ . . . ∩ ( K ∩ H i n − ) in K . Letting H i n = K , it follows that p = H i ∩ . . . ∩ H i n . { H i , . . . , H i n } must therefore be rank n , since otherwise its intersectionwould be an affine space of dimension > (cid:3) Corollary . Let A = { H , . . . , H N } ⊂ R n , { ( W i | b i ) } Ni =1 be as above. Thereis a canonical surjective map from the set of linearly-independent n –elementsubsets (subbases), { W i , . . . , W i n } ⊆ { W , . . . , W N } , to the set of verticesof C ( A ) .Proof. Subbases of the set, { W , . . . , W N } , of weight vectors are in canonicalbijective correspondence with the set of rank n , size n subarrangements of A . By Lemma 9.5, there is a surjective map from the set of rank n , size n subarrangements of A to the set of vertices of C ( A ), defined by taking the n –fold intersection of the hyperplanes in the subarrangement. (cid:3) Lemma . Let A = { H , . . . , H N } ⊂ R n , { ( W i | b i ) } Ni =1 be as above, and let B p = { W i , . . . , W i n } and B q = { W j , . . . , W j n } be two subbases of { W , . . . , W N } with corresponding canonical vertices p and q , respectively, as guaranteed by Corollary 9.6. If |B p ∩ B q | = n − , then either p = q or p ∪ q is the boundary of a –cell in C ( A ) .Proof. If B p = { W i , . . . , W i n } and B q = { W j , . . . , W j n } are as above, then (reordering j , . . . , j n if necessary) we may assume that i k = j k for k = 1 , . . . , n − i n (cid:54) = j n . Now (the proof of) Corollary 9.6tells us that p = (cid:0) H i ∩ . . . ∩ H i n − (cid:1) ∩ H i n and q = (cid:0) H i ∩ . . . ∩ H i n − (cid:1) ∩ H j n , which tells us that p and q are points on the same 1–dimensional affine space, (cid:0) H i ∩ . . . ∩ H i n − (cid:1) ⊆ C ( A ). The conclusion follows. (cid:3) Corollary . Let A = { H , . . . , H n +1 } be any (generic or non-generic)arrangement of hyperplanes in R n . Every pair of –cells (vertices), p (cid:54) = q ,of C ( A ) is connected by some –cell (edge) of C ( A ) . That is, all vertices areadjacent in the graph C ( A ) .Proof. Every pair of n –element subsets of an ( n + 1)–element set has a com-mon ( n − B p , B q are two subbases of { W , . . . , W n +1 } and p (cid:54) = q , then Lemma 9.7 tells us p and q are adjacent in C ( A ) . (cid:3) OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 27
We now consider a generic, ordered, co-oriented hyperplane arrangement A = { H , . . . , H n } in R n . Note that | A | = n , so Lemma 9.3 applies.Let ¯ R (cid:126) ⊂ R n denote the closure of the (cid:126) , . . . , ∈ { , } n region of A . Note that ¯ R (cid:126) is a polyhedral set of dimension n . The following lemmais immediate. Lemma . Let A = { H , . . . , H n } be a generic, ordered, co-oriented ar-rangement of n hyperplanes in R n . Then the faces of ¯ R (cid:126) are in naturalbijection with binary n –tuples. Explicitly, the map from { , } n to the set offaces of ¯ R (cid:126) given by (cid:126)θ ∈ { , } n (cid:55)→ F (cid:126)θ := { (cid:126)θ (cid:12) (cid:126)v | (cid:126)v ∈ ¯ R (cid:126) } is a bijection.Remark . In the above, (cid:126)θ (cid:12) (cid:126)v denotes the
Hadamard product (component-wise product) of (cid:126)θ and (cid:126)v . Accordingly, F (cid:126) = ¯ R (cid:126) , and F (cid:126) = { (0 , . . . , } .Moreover, dim( F (cid:126)θ ) = (cid:80) ni =1 θ i .Let A st be the standard ordered, co-oriented coordinate hyperplane ar-rangement in R n . That is, A st = { H sti } ni =1 , where H sti := { (cid:126)v = ( v , . . . , v n ) ∈ R n | v i = 0 } , co-oriented in the direction of the non-negative half-space, { (cid:126)v ∈ R n | v i ≥ } . We shall denote by ¯ R st(cid:126) the closure of the (cid:126) A st , and by F st(cid:126)θ its (cid:126)θ face.In other words, R st(cid:126) (resp., ¯ R st(cid:126) ) is the positive (resp., non-negative) or-thant in R n . Lemma . Let A = { H , . . . , H n } be a generic, ordered, co-orientedarrangement of n hyperplanes in R n associated to a generic layer map σ ◦ A : R n → R n . Then A maps the A decomposition of R n to the A st decomposition of R n , in the following sense: • A ( H i ) = H sti for all i = 1 , . . . , n , and • A ( R (cid:126)θ ) = R st(cid:126)θ and A ( F (cid:126)θ ) = F st(cid:126)θ for all (cid:126)θ ∈ { , } n .Moreover, σ ◦ A is the composition of the affine isomorphism A realizing theabove identification, followed by the projection ¯ R st(cid:126)θ → F st(cid:126)θ given by taking the Hadamard product with (cid:126)θ ∈ { , } ⊆ R n .Proof. Immediate from the definition of the ReLU function. (cid:3)
Regions of bent hyperplane arrangements and the ∇ F –oriented –skeleton of the canonical polyhedral complex. We can similarlyendow the activation regions of a generic, transversal ReLU neural networkwith a binary labeling as follows. Recall that if F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n m G (cid:47) (cid:47) R is transversal and generic, Theorem 2 guarantees that the domain of the( n − C ( F ) (resp., the n –cells) agrees with the bent hyper-plane arrangement, B F , (resp., the closures of the activation regions of B F ). In this case, the image of every activation region of F (interior of an n –cell of C ( F )) is contained in a unique region of the co-oriented hyperplanearrangement in each layer: Definition . Let F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n m G (cid:47) (cid:47) R be a transversal, generic ReLU neural network, and let A ( i ) denote the co-oriented hyperplane arrangement associated to F i . The ( (cid:126)θ , . . . , (cid:126)θ m )–region of F , denoted R ( (cid:126)θ ,...,(cid:126)θ m ) , is the unique activation region of F satisfying theproperty that for each i ∈ { , . . . m } , F i − ◦ . . . ◦ F (cid:16) R ( (cid:126)θ ,...,(cid:126)θ m ) (cid:17) ⊆ R (cid:126)θ i (cid:16) A ( i ) (cid:17) . Remark . As with regions of classical hyperplane arrangements, theassignment of binary ( n , . . . , n m )–tuples to activation regions of a generic,transversal neural network is injective, but need not be surjective. If n = n = . . . = n m , then the assignment is bijective, by Lemma 9.3. Lemma . Let F : R n F −→ R n F −→ R n F −→ . . . F m −−→ R n m G −→ R be a transversal, generic ReLU neural network with associated weight ma-trices W , . . . , W m , W m +1 . Let R ( (cid:126)θ ,...,(cid:126)θ m ) be a region of B F with associatedsequence of binary tuples ( (cid:126)θ , . . . , (cid:126)θ m ) , and let p ∈ ¯ R , where ¯ R is the closureof R . Then ∂F∂ x (cid:12)(cid:12)(cid:12)(cid:12) x = p = W m +1 W (cid:126)θ m m · · · W (cid:126)θ , where we define W (cid:126)θ k k to be the matrix obtained from W k by replacing the i throw of W k with ’s when the i th entry of (cid:126)θ k is .Proof. Immediate from the definition of the affine linear function on R ( (cid:126)θ ,...,(cid:126)θ m ) and the chain rule for partial derivatives. (cid:3) The 1-skeleton, C ( F ) , of the canonical polyhedral decomposition for aneural network F : R n → R , is an embedded linear graph in R n . Thisgraph has a natural partial orientation, defined as follows. Definition . Let F : R n → R be a neural network, denote by C ( F ) the –skeleton of its canonical polyhedral complex, and let C be a –cell in C ( F ) . • If F is nonconstant on C , orient C in the direction in which F increases. • If F is constant on C , we will leave it unlabeled and refer to C asa flat edge.We will refer to this (partial) orientation on C ( F ) as the grad ( F ) –orientationor ∇ F –orientation.Remark . Note that F is linear on C by Theorem 1. OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 29
For a transversal, generic ReLU neural network F , F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n m G (cid:47) (cid:47) R , Lemma 9.17 provides a way to to calculate the orientations on the 1–cells of C ( F ) combinatorially from the weight matrices of the neural network layersand the list of binary tuples in the regions adjacent to the 1–cells. Lemma . Let F : R n → . . . → R n m → R be a transversal, generic ReLU neural network with associated weight ma-trices W , . . . , W m , W m +1 , and let C be a –cell in C ( F ) with arbitrarily-chosen orientation. Let v C be the unit norm vector parallel to C in thedirection of the arbitrarily-chosen orientation. Let Θ( C ) denote the set ofbinary tuples associated to regions R for which C is in ∂R , and let θ (cid:12) ( C ) = (cid:0) θ (cid:12) ( C ) , . . . , θ (cid:12) m ( C ) (cid:1) := (cid:75) ( (cid:126)θ ,...,(cid:126)θ m ) ∈ Θ( C ) ( (cid:126)θ , . . . , (cid:126)θ m ) be the Hadamard product of all of the binary tuple sequences in Θ( C ) . Werecover the grad ( F ) –orientation on C (Definition 9.15) by assigning to C : • the orientation in the direction of v C if (cid:18) W m +1 W θ (cid:12) m ( C ) m · · · W θ (cid:12) ( C )1 (cid:19) · v C > , • the orientation in the direction of − v C if (cid:18) W m +1 W θ (cid:12) m ( C ) m · · · W θ (cid:12) ( C )1 (cid:19) · v C < , and • no orientation if (cid:18) W θ (cid:12) m ( C ) m · · · W θ (cid:12) ( C )1 (cid:19) · v C = 0 Proof.
Definition 9.15 tells us that if F is nonconstant on C then the orien-tation on C is in the direction parallel to C in which F increases. Identifythe tangent vectors along C with ± v C , and the result follows from the factthat F is linear on C and agrees with the linear functions on all regionscontaining C in their boundary. (cid:3) Obstructions to Topological Expressiveness andApplications
This section uses the framework developed in the previous sections to givean alternative perspective on the Hanin-Sellke, Johnson result that a width n ReLU neural network F : R n → R has decision regions that are eitherempty or unbounded. We also develop an architecture-based obstructionto the existence of multiple bounded connected components in a decisionregion.Recall the statement of the Hanin-Sellke, Johnson result: Theorem 5. [14, 12] Let n ≥
2. Suppose F : R n F (cid:47) (cid:47) R n F (cid:47) (cid:47) . . . F m (cid:47) (cid:47) R n G (cid:47) (cid:47) R is a width n ReLU neural network map and t ∈ R . Each of the sets N F ( t ) := F − (( −∞ , t )) B F ( t ) := F − ( { t } ) Y F ( t ) := F − (( t, ∞ ))is either empty or unbounded.We will need some elementary facts about the image of a ReLU neuralnetwork in the width n case. We address the case of generic and non-genericlayer maps separately. Proposition . Let N = ( F m ◦ . . . ◦ F ) : R n → . . . → R n be the composition of all but the final layer map of a generic (Definition 2.9)width n ReLU neural network in which every hidden layer has dimension n .Then Im ( N ) ⊆ ¯ R st(cid:126) ⊆ R n is the domain of a polyhedral complex, C , with at most one cell of dimension n . Explicitly, Im ( N ) = |C| = P ∪ Q , where P is a (possibly empty) polyhedral set of dimension n , and Q is aunion of polyhedral sets of dimension < n . Moreover, if P is nonempty, n ≥ , and P = H +1 ∩ . . . ∩ H + m is an irredundant realization of P as an intersection of closed half spaces,then the hyperplane arrangement { H , . . . , H m } has rank ≥ .Proof. We proceed by induction on the number, m , of layers. If m = 1,Im( N ) is ¯ R st(cid:126) , which is the domain of a polyhedral complex with a single n –cell, P = ¯ R st(cid:126) , realizable as an irredundant intersection of n half-spaces.Moreover, if n ≥
2, the rank of the corresponding hyperplane arrangement, { H st , . . . , H stn } is ≥
2, as required.Now consider m ≥ C (cid:48) is the polyhedral complex whosedomain agrees with the image, Im( N (cid:48) ) = P (cid:48) ∪ Q (cid:48) , of the first m − N as described. Let • A be the ordered, co-oriented hyperplane arrangement associatedto F m , • C ( A ) be the associated polyhedral decomposition of the domain of F m (Definition 3.16), • and C (cid:48)∩C ( A ) be the complex whose cells are pair-wise intersectionsof cells of C (cid:48) with cells of C ( A ).By Lemma 9.11, F m is linear on the cells of C (cid:48)∩C ( A ) . Indeed, it is immedi-ate that we can alternatively characterize C (cid:48)∩C ( A ) as the level set complex, OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 31 C (cid:48)∈C ( A st ) , of F m relative to the polyhedral decomposition associated to thestandard hyperplane arrangement, A st , in the codomain.Noting that the image of a polyhedral set under an affine linear map is apolyhedral set, we now define C to be the complex whose cells are the imagesof cells of C (cid:48)∈C ( A st ) under F m .It then follows immediately from Lemma 9.11 that all of the cells of C havedimension < n except possibly F m (cid:0) P (cid:48) ∩ ¯ R (cid:126) ( A ) (cid:1) . Further, since F m is anaffine isomorphism on ¯ R (cid:126) ( A ), this cell will be n –dimensional iff P (cid:48) ∩ ¯ R (cid:126) ( A )is n –dimensional.In this case, we claim that as long as n ≥
2, an irredundant boundinghyperplane arrangement of the cell P (cid:48) ∩ ¯ R (cid:126) ( A ) will have rank ≥
2, sinceboth P (cid:48) and ¯ R (cid:126) ( A ) have this property. To see this, note that the union ofthe bounding hyperplane arrangements for P (cid:48) and ¯ R (cid:126) ( A ) yields a boundinghyperplane arrangement for P (cid:48) ∩ ¯ R (cid:126) ( A ), and it necessarily has rank ≥ redundant inequality ina system of linear inequalities is a non-negative linear combination of theother linear inequalities in the system. This implies that the rank of anybounding hyperplane arrangement of a polyhedral set is equal to the rankof an irredundant bounding hyperplane arrangement.Moreover, the image under F m of an irredundant bounding hyperplanearrangement for P (cid:48) ∩ ¯ R (cid:126) ( A ) is also irredundant and has rank ≥
2, since F m is an affine isomorphism on ¯ R (cid:126) ( A ). Defining P to be • F m (cid:16) P (cid:48)∈ ¯ R (cid:126) ( A st ) (cid:17) if the polyhedral set P (cid:48)∈ ¯ R (cid:126) ( A st ) has dimension n ,and • ∅ otherwise,and Q to be the union of all other cells of C , the result follows. (cid:3) Lemma . Let N = ( F m ◦ . . . ◦ F ) : R n → R n → . . . → R n be the composition of all but the final layer map of a generic width n ReLUneural network in which every hidden layer has dimension n , and letIm ( N ) = |C| = P ∪ Q as in Proposition 10.1. If x ∈ |C n − | , then N − ( { x } ) is unbounded.Proof. We proceed again by induction on the number, m , of layers. Theresult is clear when m = 1, since in this case Lemma 9.11 tells us that Q = ∅ and P = ¯ R st(cid:126) , and each point x ∈ |C n − | = ∂ P = ∂ ( ¯ R st(cid:126) ) is in theimage of the projection map R st(cid:126)θ → F st(cid:126)θ , hence has unbounded preimage.Now suppose m > |C (cid:48) | = P (cid:48) ∪ Q (cid:48) ,of the first m − N . That is, each point in |C (cid:48) n − | has unboundedpreimage.Let x ∈ |C n − | ⊆ Im( N ). If x is in the image of |C (cid:48) n − | , then it hasunbounded preimage by the inductive hypothesis. So we may assume that x is in the image of int( P (cid:48) ) and not in the image of ∂ P (cid:48) . But since x ∈ |C n − | , Lemma 9.11 implies that F − m ( { x } ) ⊆ |C (cid:48) | is a ray contained in int( P (cid:48) ). Sinceint( P (cid:48) ) is the image of the interior of a polyhedral set in the domain under acomposition of affine linear isomorphisms, the preimage of this ray is a rayin the domain, hence unbounded. The conclusion follows. (cid:3) Lemma . Let X be an affine hyperplane in R n , and let P = H +1 ∩ . . . ∩ H + m be an irredundant representation of a non-empty n –dimensional polyhedralset in R n such that the hyperplane arrangement A = { H , . . . , H m } has rank ≥ , and, for each i , let F i = P ∩ H i be the facet in ∂ P corresponding to thebounding hyperplane H i . If X ∩ P (cid:54) = ∅ , then X ∩ F i (cid:54) = ∅ for some i .Proof. We proceed by induction on m , the base case being m = 2. Note alsothat the rank assumption implies that n ≥
2. In this case, if X ∩ P (cid:54) = ∅ and X ∩ F = ∅ , then X and H must be parallel. But then the rank assumptionimplies that X ∩ F (cid:54) = ∅ as desired.Now assume P = H +1 ∩ . . . ∩ H + m for m > A = { H , . . . , H m } has rank ≥
2. Reordering if necessary,we may assume { H , H } has rank 2. If X ∩ F m (cid:54) = ∅ , we’re done, so we mayassume WLOG that X ∩ F m = ∅ .But then P (cid:48) := (cid:84) m − i =1 H + i is a necessarily irredundant representation of apolyhedral set with fewer facets, whose bounding hyperplane arrangementhas rank ≥
2. So the inductive hypothesis tells us that there exists some i ≤ m − X ∩ F i (cid:54) = ∅ , as desired. (cid:3) Lemma . For any n ∈ N , let N = ( F m ◦ . . . ◦ F ) : R n → . . . → R n be the composition of all but the final layer map of a non -generic width n ReLU neural network. Then each point in Im ( N ) has unbounded preimage.Proof. Let F i be the first non-generic layer map in the composition N = F m ◦ . . . ◦ F . Since F i is non-generic, the affine map A i underlying F i isnon-invertible, by Lemma 2.11. Indeed, the preimage of any point is anaffine linear subspace of R n of dimension ≥
1, hence unbounded. Recallingthat F i = σ ◦ A i is a map that is linear on the cells of the canonical polyhedraldecomposition C ( F i − ◦ . . . ◦ F i ) of R n , it follows immediately that the linearmap on each cell is also non-invertible, hence the preimage of any point p ∈ Im( F i ) is unbounded. Any point q ∈ Im( N ) is of the form q = ( F m ◦ . . . ◦ F i +1 )( p ) for some p ∈ Im( F i ). So N − ( { q } ) is unbounded. (cid:3) Recall that a threshold t is transversal for a neural network F (Defini-tion 6.10) if it is transversal for F with respect to its canonical polyhedralcomplex C ( F ). Lemma . Let t be a transversal threshold for a neural network F : R n → R . Then B F ( t ) = ∂N F ( t ) = ∂Y F ( t ) . OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 33
Proof.
In the case that B F ( t ) = ∅ , either Y F ( t ) = R n and N F ( t ) = ∅ or Y F ( t ) = ∅ and N F ( t ) = R n , so the statement holds.Now suppose B F ( t ) (cid:54) = ∅ . For each cell C ∈ C ( F ) such that F − ( { t } ) ∩ C (cid:54) = ∅ , Lemma 5.6 guarantees that aff( F − ( { t } ) ∩ C ) is a hyperplane that cuts C . Denote by C + and C − the intersections of C with the two open half-spaces that are the complement of this hyperplane. Since F is nonconstanton C (by Lemma 5.4), precisely one of C + , C − must be contained in Y F ( t )and the other must be contained in N F ( t ). Therefore B F ( t ) ⊆ ∂Y F ( t ) and B F ( t ) ⊆ N F ( t ). The reverse inclusions are obvious. (cid:3) We are now ready for:
Proof of Theorem 5.
Fix an integer n ≥ F : R n → R be a ReLUneural network whose hidden layers all have dimension ≤ n . We may thenassume WLOG that every intermediate layer has dimension n (c.f. Remark2.10). Decompose F as F = G ◦ N , where N : R n → R n is the compositionof all the layer maps except for the final one, and G : R n → R is the finallayer map. If G is degenerate, it is immediate that the theorem holds. Soassume that G is nondegenerate.Step 1: We will prove that for every t ∈ R , the decision boundary B F ( t )is either empty or unbounded. Since G is nondegenerate, X t := G − ( { t } ) isan affine hyperplane in the final hidden layer of F . Note that B F ( t ) = N − (cid:0) G − { t } (cid:1) = N − (Im( N ) ∩ X t ) . If Im( N ) ∩ X t is empty, then B F ( t ) is empty, as desired. So assume Im( N ) ∩ X t is nonempty.Case 1: We first consider the case that N is non-generic. If Im( N ) ∩ X t is nonempty, then N − (Im( N ) ∩ X t ) is unbounded by Lemma 10.4, and ifIm( N ) ∩ X t is empty, then so is N − (Im( N ) ∩ X t ).Case 2: Now consider the case that N is generic. By Proposition 10.1,Im( N ) is the domain of a polyhedral complex C that has a unique (possiblyempty) n -cell P .Subcase a: If X t ∩ P = ∅ , then the assumption that Im( N ) ∩ X t (cid:54) = ∅ implies X t ∩ C n − (cid:54) = ∅ . Therefore B F ( t ) is unbounded by Lemma 10.2.Subcase b: If X t ∩P (cid:54) = ∅ , then P is nonempty. Because n ≥
2, Proposition10.1 guarantees that P has rank ≥
2. Hence X t ∩ C n − (cid:54) = ∅ by Lemma 10.3.Therefore B F ( t ) is unbounded by Lemma 10.2.Step 2: We will use the fact that B F ( t ) is empty or unbounded to show N F ( t ) and Y F ( t ) are also.Case 1: When t ∈ R is a transversal threshold, it is now straightforwardto see that the decision regions Y F ( t ) and N F ( t ) are also either empty orunbounded, since B F ( t ) = ∂N F ( t ) = ∂Y F ( t ) by Lemma 10.5, and a boundedset cannot have unbounded closure.Case 2: Suppose t ∈ R is a non-transversal threshold. We will give anargument for Y F ( t ); the argument for N F ( t ) is analogous. Let X + t be thepositive half-space associated to the co-oriented affine hyperplane X t . Then F − (( t, ∞ )) = N − (Im( N ) ∩ X + t ) . If this intersection is empty, then Y F ( t ) is empty, as desired. If this inter-section is nonempty, there are two subcases. Subcase a: t < max { t (cid:48) ∈ R : F − (( t (cid:48) , ∞ )) (cid:54) = ∅} . In this case, since F iscontinuous, there exists (cid:15) > t, t + (cid:15) ) ⊂ { t (cid:48) ∈ R : F − (( t (cid:48) , ∞ )) (cid:54) = ∅} . Hence Lemma 5.7 implies there exists a transversal threshold t (cid:48) > t for which F − ( { t (cid:48) } ) is non-empty. Noting that F − ( { t (cid:48) } ) = B F ( t (cid:48) ), it follows from thefirst part of the proof that B F ( t (cid:48) ) is unbounded. So Y F ( t ) ⊇ B F ( t (cid:48) ) mustalso be unbounded.Subcase b: t = max { t (cid:48) ∈ R : F − (( t (cid:48) , ∞ )) (cid:54) = ∅} . In this case Y F ( t ) isempty, as desired. (cid:3) Obstructing multiple bounded connected components.
As ob-served in [14], it is straightforward to construct, for every n , a width n + 1neural network with a single hidden layer, R n → R n +1 → R , that has abounded decision region consisting of a single connected component. Weprove that such a simple architecture cannot produce a decision region withmore than one bounded connected component. Theorem 6.
Let F : R n → R n +1 → R be a ReLU neural network. Then adecision region Y F ( t ) or N F ( t ) associated to a transversal threshold t canhave no more than one bounded connected component. Lemma . Let
P ⊆ R n be a polyhedral set, and let F : P → R be anaffine linear map on P . If F achieves a maximum (resp., minimum) on theinterior of P , then F achieves this maximum (resp., minimum) value on allof P , and hence on all faces in its boundary.Proof. This is a standard result in linear programming. The maximum(resp., minimum) value of an affine linear function on a polyhedral set isachieved when the dot product with a particular vector (the vector of weightsof the affine linear function) is maximized (resp., minimized). But since P is a closed subset of a linear subspace of R n , the maximum value of the dotproduct with a fixed vector is either attained on the boundary or on theentire set. (cid:3) Corollary . Let F : R n → R be a neural network, and C ( F ) be itscanonical polyhedral complex. If P is a cell of C ( F ) with at least one vertexas a face, and F achieves a maximum (resp., minimum) on P , then F achieves a maximum (resp., minimum) at a vertex of P .Proof. The function F is linear on P by Theorem 1. Under the given as-sumptions, every face of P will also be a polyhedral set with a vertex as aface. The result follows by strong induction on the dimension of P , applyingLemma 10.6 in the inductive step. (cid:3) Corollary . Let F and C ( F ) be as above. If P is a bounded cell (polytope)of C ( F ) of any dimension , then F achieves a maximum (resp., minimum)at a vertex of P . Note that F needn’t be generic or transversal. OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 35
Proof.
Every polytope
P ⊂ R n (of any dimension) has at least one vertex.Moreover, it is bounded, hence compact, since cells are closed. The ex-treme value theorem then guarantees that F achieves both a minimum andmaximum value on P . The result follows from Corollary 10.7. (cid:3) Proposition . Let t be a transversal threshold for a neural network F : R n → R , and let S be a bounded connected component of Y F ( t ) (resp., N F ( t ) ). Then there exist non-empty bounded subgraphs G (cid:48) ⊆ G ⊆ C ( F ) which, when endowed with the ∇ F –orientation (Definition 9.15), satisfy:(i) G (cid:48) is flat;(ii) G (cid:48) ⊆ G (cid:40) S ;(iii) there is a non-empty collection, E , of edges adjacent to G , satisfyingthe property that every edge e ∈ E points towards G (resp., pointsaway from G ) and has nonempty intersection with ∂S and the otherdecision region, N F ( t ) (resp, Y F ( t ) ).Remark . Note that for any graphs G and G , the statement “ G isa subgraph of G ” implies that the set of vertices (resp. edges) of G is asubset of the set of vertices (resp. edges) of G . The vertices (resp. edges)of the graph C ( F ) are the 0-cells (resp. 1-cells) of C ( F ). Remark . One can view the graph G (cid:48) described in Proposition 10.9 asa PL analogue of a Morse critical point of index n (resp., 0). Recall that if f : M n → R is a Morse function on a smooth n –dimensional manifold, anybounded connected component, S , of a superlevel set (resp., sublevel set)must contain a critical point of index n (resp., of index 0) in its interior.Moreover, once the manifold has been endowed with a Riemannian metric,the gradient vector field, ∇ f , will be transverse to ∂S (since it’s the preimageof a regular value) and will point into (resp., out of) S . Proof.
Let S denote the closure of S . Since S is closed and bounded, hencecompact, the extreme value theorem tells us that f attains its maximum(resp., minimum) value, M ∈ R (resp., m ∈ R ), on S . I.e., there exists x ∈ S such that f ( x ) = M and f ( y ) ≤ M for all y ∈ S . But Lemma 10.6implies that f − ( { M } ) contains a non-empty subgraph, G (cid:48) , of C ( F ) , sincea maximum value, if attained on the interior of a cell, is attained on thewhole cell, including its boundary.Moreover, G (cid:48) ⊂ S , for if G (cid:48) ∩ ( ∂S ⊆ F − { t } ) (cid:54) = ∅ then t = M , which wouldimply that t is not a transversal threshold since its preimage contains a vertexwhich by definition cannot have a nonconstant cellular neighborhood.Let G be the maximal subgraph of C ( F ) ∩ S that both contains G (cid:48) andis entirely contained in S . Properties (i) and (ii) are immediate by con-struction. To see Property (iii), note that that C ( F ) is connected and un-bounded, so it follows that C ( F ) ∩ ∂S (cid:54) = ∅ . Since t is a transversal threshold,all points in ∂S have nonconstant cellular neighborhood, hence all edges of C ( F ) intersecting ∂S are oriented, and the orientations are toward (resp.,away from) G if S ⊆ Y F ( t ) (resp., S ⊆ N F ( t )). (cid:3) Proof of Theorem 6.
We open by noting that since F has a single hiddenlayer, its canonical polyhedral complex, C ( F ), is simply the canonical poly-hedral complex, C ( A ), of the hyperplane arrangement, A , in R n associatedto the first layer map. We may assume without loss of generality that |A| = n + 1, for if the first layer map is degenerate then the neural networkhas width n , and hence its decision regions have no bounded connectedcomponents, by Theorem 5.Now let t ∈ R be a transversal threshold for a (not necessarily generic,not necessarily transversal) ReLU network, F : R n → R n +1 → R . We willshow that Y F ( t ) has no more than one bounded connected component. Theargument for N F ( t ) is analogous.Assume, aiming for a contradiction, that Y F ( t ) has more than one boundedconnected component. Choose two of these, and call them S and S . Asdescribed in Proposition 10.9, there exist non-empty bounded subgraphs G i ⊂ S i (for i = 1 ,
2) of the 1–skeleton of C ( A ) = C ( F ) and associated non-empty collections, E i , of edges adjacent to G i , equipped with ∇ F –orientationpointing towards G i . For each S i , choose an external vertex p i ⊂ G i . Thatis, choose a vertex p i ⊂ G i in the boundary of an edge of E i .Now Corollary 9.8 tells us that p and p are connected by an edge (1–cell), e , of C ( A ). It follows that e is in both E and E . But this is impossible,since it would require e to be oriented in two different directions at once.We conclude that one of E or E must be empty, hence Proposition 10.9tells us that one of S , S must be empty. The result follows. (cid:3) References [1] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understandingdeep neural networks with rectified linear units. In . OpenReview.net, 2018.[2] Monica Bianchini and Franco Scarselli. On the complexity of neural network clas-sifiers: A comparison between shallow and deep architectures.
IEEE Trans. NeuralNetworks Learn. Syst. , 25(8):1553–1565, 2014.[3] Jacek Bochnak, Michel Coste, and Marie-Fran¸coise Roy.
Real algebraic geometry , vol-ume 36 of
Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Math-ematics and Related Areas (3)] . Springer-Verlag, Berlin, 1998. Translated from the1987 French original, Revised by the authors.[4] Moustapha Ciss´e, Piotr Bojanowski, Edouard Grave, Yann N. Dauphin, and NicolasUsunier. Parseval networks: Improving robustness to adversarial examples. In DoinaPrecup and Yee Whye Teh, editors,
Proceedings of the 34th International Confer-ence on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 ,volume 70 of
Proceedings of Machine Learning Research , pages 854–863. PMLR, 2017.[5] G. Cybenko. Approximation by superpositions of a sigmoidal function.
Math. ControlSignals Systems , 2(4):303–314, 1989.[6] Branko Gr¨unbaum.
Convex polytopes , volume 221 of
Graduate Texts in Mathematics .Springer-Verlag, New York, second edition, 2003. Prepared and with a preface byVolker Kaibel, Victor Klee and G¨unter M. Ziegler.[7] Romain Grunert.
Piecewise Linear Morse Theory . PhD thesis, Freie Universit¨atBerlin, 2016. https://refubium.fu-berlin.de/handle/fub188/12531.[8] V. Guillemin and A. Pollack.
Differential Topology . AMS Chelsea Publishing. AMSChelsea Publishing, 2010.
OPOLOGICAL EXPRESSIVENESS OF RELU NEURAL NETWORKS 37 [9] William H. Guss and Ruslan Salakhutdinov. On characterizing the capacity of neuralnetworks using algebraic topology.
CoRR , abs/1802.04443, 2018.[10] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. InKamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedings of the 36th In-ternational Conference on Machine Learning, ICML 2019, 9-15 June 2019, LongBeach, California, USA , volume 97 of
Proceedings of Machine Learning Research ,pages 2596–2604. PMLR, 2019.[11] Boris Hanin and David Rolnick. Deep ReLU networks have surprisingly few activationpatterns.
CoRR , abs/1906.00904, 2019.[12] Boris Hanin and Mark Sellke. Approximating continuous functions by ReLU nets ofminimal width.
CoRR , abs/1710.11278, 2017.[13] Kurt Hornik. Approximation capabilities of multilayer feedforward networks.
NeuralNetworks , 4(2):251 – 257, 1991.[14] Jesse Johnson. Deep, skinny neural networks are not universal approximators. In . OpenReview.net, 2019.[15] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, andAdrian Vladu. Towards deep learning models resistant to adversarial attacks. In . OpenReview.net,2018.[16] Guido F. Mont´ufar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. Onthe number of linear regions of deep neural networks. In Zoubin Ghahramani, MaxWelling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors,
Ad-vances in Neural Information Processing Systems 27: Annual Conference on Neu-ral Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec,Canada , pages 2924–2932, 2014.[17] Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding andcounting linear regions of deep neural networks. In Jennifer G. Dy and AndreasKrause, editors,
Proceedings of the 35th International Conference on Machine Learn-ing, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018 , volume 80of
Proceedings of Machine Learning Research , pages 4565–4573. PMLR, 2018.[18] Richard P. Stanley. An introduction to hyperplane arrangements. In
Geometric com-binatorics , volume 13 of
IAS/Park City Math. Ser. , pages 389–496. Amer. Math. Soc.,Providence, RI, 2007.[19] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scal-able certification of perturbation invariance for deep neural networks. In Samy Ben-gio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi,and Roman Garnett, editors,
Advances in Neural Information Processing Systems31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS2018, 3-8 December 2018, Montr´eal, Canada , pages 6542–6551, 2018.[20] Hugh E. Warren. Lower bounds for approximation by nonlinear manifolds.
Trans.Amer. Math. Soc. , 133:167–178, 1968.[21] Thomas Zaslavsky. Facing up to arrangements: face-count formulas for partitions ofspace by hyperplanes.
Mem. Amer. Math. Soc. , 1(issue, issue 1, no. 154):vii+102,1975.[22] Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. Tropical geometry of deep neuralnetworks. In
ICML , 2018.[23] G¨unter M. Ziegler.
Lectures on polytopes , volume 152 of
Graduate Texts in Mathe-matics . Springer-Verlag, New York, 1995.
Boston College; Department of Mathematics; 522 Maloney Hall; ChestnutHill, MA 02467
E-mail address : [email protected] Boston College; Department of Mathematics; 567 Maloney Hall; ChestnutHill, MA 02467
E-mail address ::