[PDF] Geometry of Deep Convolutional Networks

Abstract

We give a formal procedure for computing preimages of convolutional network outputs using the dual basis defined from the set of hyperplanes associated with the layers of the network. We point out the special symmetry associated with arrangements of hyperplanes of convolutional networks that take the form of regular multidimensional polyhedral cones. We discuss the efficiency of large number of layers of nested cones that result from incremental small size convolutions in order to give a good compromise between efficient contraction of data to low dimensions and shaping of preimage manifolds. We demonstrate how a specific network flattens a non linear input manifold to an affine output manifold and discuss its relevance to understanding classification properties of deep networks.

Full PDF

GGeometry of Deep Convolutional Networks

Stefan CarlssonSchool of EECSKTHStockholm, SwedenMay 23, 2019

Abstract

We give a formal procedure for computing preimages of convolutionalnetwork outputs using the dual basis deﬁned from the set of hyperplanesassociated with the layers of the network. We point out the special symme-try associated with arrangements of hyperplanes of convolutional networksthat take the form of regular multidimensional polyhedral cones. We dis-cuss the eﬃciency of large number of layers of nested cones that resultfrom incremental small size convolutions in order to give a good compro-mise between eﬃcient contraction of data to low dimensions and shapingof preimage manifolds. We demonstrate how a speciﬁc network ﬂattensa non linear input manifold to an aﬃne output manifold and discuss itsrelevance to understanding classiﬁcation properties of deep networks.

Deep convolutional networks for classiﬁcation map input data domains to outputdomains that ideally correspond to various classes. The ability of deep networksto construct various mappings has been the subject of several studies over theyears [1, 3, 10] and in general resulted in various estimates of capacity given anetwork structure. The actual mappings that are learnt by training a speciﬁcnetwork however, often raise a set of questions such as why are increasinglydeeper networks advantageous [13, 14] ? What are the mechanisms responsiblefor the successful generalisation properties of deep networks ? Also the basicquestion why deep learning over large datasets is so much more eﬀective thanearlier machine learning approaches is still essentially open, [7]. These questionsare not in general answered by studies of capacity. A more direct approach basedon actual trained networks and the mappings they are eﬃciently able to produceseems needed in order to answer these questions. It seems ever more likely e.gthat the ability of deep networks to generalize is connected with some sortof restriction of mappings that they theoretically can produce and that thesemappings are ideally adapted to the problem for which deep learning has proven1 a r X i v : . [ c s . L G ] M a y uccessful, Due to the complexity of deep networks the actual computationof how input domains are mapped to output classiﬁers has been consideredprohibitively diﬃcult. From general considerations of networks with rectiﬁer(ReLU) non linearities we know that these functions must be piecewise linear[10] but the relation between network parameters such has convolutional ﬁlterweights and fully connected layer parameters and the actual functions remainslargely obscure. In general, work has therefore been concentrated on empiricalstudies of actual trained networks [6, 8, 9]Recently however there have been attempts to understand the relation be-tween networks and their mapping properties from a more general and theoreti-cal point of view. This has included speciﬁc procedures for generating preimagesof network outputs [4] and more systematic studies of the nature of piecewiselinear functions and mappings involved in deep networks, [2, 11, 15].In this work we will make the assertion that understanding the geometryof deep networks and the manifolds of data they process is an eﬀective wayto understand the comparative success of deep networks. We will considerconvolutional networks with ReLU non linearities. These can be completelycharacterised by the corresponding hyperplanes associated with individual con-volutional kernels . We will demonstrate that the individual arrangement ofhyperplanes inside a layer and the relative arrangement between layers is cru-cial to the understanding the success of various deep network structures andhow they map data from input domains to output classiﬁers.We will consider only the convolutional part of a deep network with a singlechannel. We will assume no subsampling or max pooling. This will allow us toget a clear understanding of the role of the convolutional part. A more completeanalysis involving multiple channels and fully connected layers is possible butmore complex and will be left to future work.The focus of our study is to analyse how domains of input data are mappedthrough a deep network. A complete understanding of this mapping and itsinverse or preimage will give a detailed description of the workings of the net-work. Since we are not considering the ﬁnal fully connected layers we willdemonstrate how to compute in detail the structure of input data manifold thatcan be mapped to a speciﬁed reduced dimensionality aﬃne manifold in the ac-tivity space of the ﬁnal convolutional output layer. This ﬂattening of input datais often considered as a necessary preprocessing step for eﬃcient classiﬁcation.The understanding of mappings between layers will be based on the speciﬁcunderstanding of how to compute preimages for networks activities. We willrecapitulate and extend the work in [4] based on the construction of a dualbasis from an arrangement of hyperplanes. By specialising to convolutionalnetworks we will demonstrate that the arrangement of hyperplanes associatedwith a speciﬁc layer can be eﬀectively described by a regular multidimensionalpolyhedral cone oriented in the identity direction in the input space of thelayer. Cones associated with successive layers are then in general partly nestedinside their predecessor. This leads to eﬃcient contraction and shaping of theinput domain data manifold. In general however contraction and shaping arein conﬂict in the sense that eﬃcient contraction implies less eﬃcient shaping.2e will argue that this’ conﬂict is resolved by extending the number of layersof the network with small incremental updates of ﬁlters at each layer.The main contribution of the paper is the exploitation of the properties ofnested cones in order to explain how non linear manifolds can be shaped andcontracted in order to comply with the distribution of actual class manifoldsand to enable eﬃcient preprocessing for the ﬁnal classiﬁer stages of the net-work. We will speciﬁcally demonstrate the capability of the convolutional partof the network to ﬂatten non linear input manifolds which has previously beensuggested as an important preprocessing step in object recognition, [5, 12] Transformations between layers in a network with ReLU as nonlinear elementscan be written as y = [ W x + b ] + (1)Where [ ] + denotes the ReLU function max (0 , x i ). applied component wise tothe elements of the input vector x which will be conﬁned to the positive orthantof the d-dimensional Euclidean input space. It divides the components of theoutput vector y into two classes depending on the location of the input x : { j : w Tj x + b j > } → y j = w Tj x + b j { i : w Ti x + b i ≤ } → y i = 0 (2)In order to analyse the way domains are mapped through the network we willbe interested in the set of inputs x that can generate a speciﬁc output y . P ( y ) = { x : y = [ W x + b ] + } (3)This set, known as the preimage of y can be empty, contain a unique element x or consist of a whole domain of the input space. This last case is quite obviousby considering the ReLU nonlinearity that maps whole half spaces of the inputdomain to 0 components of the output y .The preimage will depend on the location of the input relative to the ar-rangement of the hyperplanes deﬁned by the aﬃne part of the mapping:Π i = { x : w Ti x + b i = 0 } i = 1 , , . . . d (4)These hyperplanes divides the input space into a maximum of 2 d number ofdiﬀerent cells with the maximum attained if all hyperplanes cut through theinput space which we take as the non negative orthant of the d-dimensionalEuclidean space R d + . Understanding the arrangement of these hyperplanes ingeneral and especially in the case of convolutional mappings will be central toour understanding of how input domains are contracted and collapsed throughthe network. 3he preimage problem can be treated geometrically using these hyperplanesas well as the constraint input domains deﬁned by these. For a given output y we can denote the components where y j > y j , y j . . . y j q and the comple-mentary index set where y i = 0 as i , i . . . i p With each positive component of y we can associate a hyperplane:Π ∗ j = { x : y j = w Tj x + b j } j = j , j , . . . j q (5)which is just the hyperplane Π j translated with the output y j For the 0-components of y we can deﬁne the half spaces X − i = { x : w Ti x + b i ≤ } i = i , i , . . . i p (6)I.e the half space cut out by the negative side of the plane Π i . These planes andhalf spaces together with the general input domain constraint of being inside R + d deﬁne the preimage constraints given the output y .If we deﬁne the aﬃne intersection subspace:Π ∗ = Π ∗ j ∩ Π ∗ j ∩ . . . ∩ Π ∗ j q (7)and the intersection of half spaces: X − = X − i ∩ X − i ∩ . . . ∩ X − i p (8)the preimage of y can be deﬁned as: P ( y ) = Π ∗ ∩ X − ∩ R d + (9)The constraint sets and the preimage set is illustrated in ﬁgure 3 for the caseof d = 3 and various outputs y with diﬀerent number of 0-components.For fully connected networks, computing the preimage set amounts to ﬁndingthe intersection of an aﬃne subspace with a polytope in d − dimensional space.This problem is known to be exponential in d and therefore intractable. How-ever, we will see that this situation is changed substantially when we considerconvolutional instead of fully connected networks. In order to get more insight into the nature of preimages we will devise a generalmethod of computing that highlights the nature of the arrangement of hyper-planes. The set of hyperplanes Π i , i = 1 . . . d will be assumed to be in generalposition, i.e. no two planes are parallel. The intersection of all hyperplanesexcluding plane i : S i = Π ∩ Π ∩ . . . ∩ Π i − ∩ Π i +1 . . . ∩ Π d (10)is then a one-dimensional aﬃne subspace S i that is contained in all hyperplanesΠ j excluding j = i. For all i we can deﬁne vectors e i in R d parallel to S i . The4eneral position of the hyperplanes then guarantees that the set e i is completein R d . By translating all vectors e i to the point in R d which is the mutualintersection of all planes Π i Π ∩ Π ∩ . . . ∩ Π d (11)they can therefore be used as a basis that spans R d . This construction also hasthe property that the intersection of the subset of hyperplanes:Π j ∩ Π j ∩ . . . ∩ Π j q (12)is spanned by the complementary dual basis set e i , e i . . . e i p (13)The dual basis can now be used to express the solution to the preimage problem.The aﬃne intersection subspace P ∗ associated with the positive components j j . . . j q of the output y is spanned by the complementary vectors associatedwith the negative components i i . . . i p . These indices also deﬁne the hyper-planes Π , Π . . . Π p that constrain the preimage to lie in the intersections ofhalf spaces associated with the negative sides.We now deﬁne the positive direction of the vector e i as that associated withthe negative side of the plane Π i . If we consider the intersection of the sub-space P ∗ and the subspace generated by the intersections of the hyperplanes Π i associated with the negative components of y we get:Π ∗ j ∩ Π ∗ j ∩ . . . ∩ Π ∗ j q ∩ Π i ∩ . . . ∩ Π i p (14)Due to complementarity of the positive and negative indices, this is a uniqueelement x ∗ ∈ R d (marked “output” in ﬁgure 3 which lies in the aﬃne subspace ofthe positive output components P ∗ as well as on the intersection of the boundaryhyperplanes Π i that make up the half space intersection constraint X − for thepreimage. if we take the subset of the dual basis vectors with e i , e i . . . e i p andmove them to this intersection element, they will span the part of the negativeconstraint region X − associated with the preimage. I.e. the preimage of theoutput y is given by: P ( y ) = { x ∈ R d + : x = x ∗ + i = i p (cid:88) i =1 α i e i α i ≥ } (15) We will now specialise to the standard case of convolutional networks. In orderto emphasize the basic role of geometric properties we will consider only a single5 + −− + x3x2 x1−e3 . e3 e2e3 e2 positive output constraint hyperplane negative output constraint polytopepreimageoutput e1 Figure 1:

Left : 3 planes in general position and the preimage (red) of the out-put (black dot) e and e components of dual basis used to generate preimage. Right : Polyhedral cone of 3 planes from a circulant layer transformation ma-trix. (two diﬀerent views) The nesting property of the cone will refer to it’sability to “grip” the coordinate axis of the input space. channel with no subsampling. Most of what we state will generalize to the moregeneral case of multiple channels with diﬀerent convolutional kernels but needsa more careful analysis we will exploit the fact that convolutional matrices arein most respects asymptotically equivalent to those of circulant matrices whereeach new row is a one element cyclic shift of the previous. For any convolutionmatrix we will consider the corresponding circulant matrix that appends rowsat the end to make it square and circulant. Especially when the support ofthe convolution is small relative to the dimension d, typically the order of 10in relation to 1000, this approximation will be negligible. Except for specialcases the corresponding circulant will be full rank, which means that propertiesabout dual basis etc. derived previously will apply also here. As is standard wewill assume that the bias b is the same for all applications of the convolutionkernels.The ﬁrst thing to note about hyperplanes associated with circulant matricesis that they all intersect on the identity line going through the origin and thepoint (1 , , . . .

1) . Denote the circulant matrix as C with elements c i,j . Thecirculant property implies c i +1 ,j = c i,j − , i = 1 . . . d − , j = 2 . . . d and c i +1 , = c i,d . Each row is shifted one step cyclically relative to the previous.For the hyperplane corresponding to row i we have: j = d (cid:88) j =1 c i,j x j + b = 0 (16)It is easy to see that the circulant property implies that the sum of all elementsalong a row is the same for all rows. Let the sum of the row be a . We then get: x j = − b/a for j = 1 . . . d as a solution for this system of equations which is apoint on the identity line in R d . 6he arrangement of the set of hyperplanes: w Ti x + b = 0 i = 1 . . . d (17)with w Ti the i :th row of the circulant augmented convolutional matrix W , willbe highly regular. Consider a cyclic permutation P x of the components of theinput x described by the single shift matrix P i.e x i is mapped to x i +1 for i = 1 . . . d − x d is mapped to x . We then get: w Ti P x + b = w Ti +1 x + b = 0 i = 1 . . . d − w Td P x + b = w T x + b = 0 (18)which states that points on the hyperplane associated with weights w i aremapped to hyperplane associated with weights w i +1 . The hyperplanes asso-ciated with the weights w i i = 1 . . . d therefore form a regular multidimen-sional polyhedral cone in R d around the identity line, with the apex located at x T = ( − b/a, − b/a . . . − b/a ) controlled by the bias b and the sum of ﬁlter weights a . Geometrically, the cone is determined by the apex location, the angle of theplanes to the central identity line and its rotation in d -dimensional space. Apexlocation and angle are two parameters which leaves d − R d . This maximum degree of freedom is howeverattained only for unrestricted circulant transformations. The ﬁnite support ofthe convolution weights in CNN:s will heavily restrict rotations of the cone. Theimplications of this will be discussed later. Any transformation between two layers in a convolutional network can now beconsidered as a mapping between two regular multidimensional polyhedral conesthat are symmetric around the identity line in R d . The coordinate planes ofthe input space R d + can be modelled as such a cone as well as the output spacegiven by the convolution. The strong regularity of these cones will of courseimpose strong regularities on the geometric description of the mapping betweenlayers. Just as in the general case, this transformation will be broken down totransformations between intersection subspaces of the two cones.In order to get an idea of this we will start with a simple multi layer networkwith two dimensional input and output and a circulant transformation: x ( l +1)1 = [ a ( l ) x ( l )1 + b ( l ) x ( l )2 + c ( l ) ] + x ( l +1)2 = [ b ( l ) x ( l )1 + a ( l ) x ( l )2 + c ( l ) ] + (19)Figure 5 illustrates the mapping of data from the input space ( x , x ) to theoutput space ( y , y ) for two networks with 3 and 6 layers respectively. Thedashed lines represent successive preimages of data that maps to a speciﬁc lo-cation at a layer. By connecting them we get domains of input data mapped7 x2 y1y2x1 x2 Figure 2:

Illustration of how data in 2d input space ( x , x ) contracts by succes-sive layers in multi layer 2 node networks with circulant transformations. Left :Transformation ( a, b ) = (1 , . Only the bias diﬀers between layers. Alter-nating red and blue frames show successive layers remapped to the input space. Right : Arbitrary circulant transformations. Note how the nesting property in-sures a good variation in the generated manifolds. When nesting becomes lesspronounced for higher values of input the variation of the manifolds diminishes. to the same output at the ﬁnal layer, i.e they are contraction ﬂows depictinghow data is moved through the network. Note that in both layers the majorpart of the input domain is mapped to output (0 , a = 1 , b = 0. The domain of the input that ismapped to output domain is just quite trivial planar manifolds.The second network with more varied weights illustrates how input domainmanifolds with more structure can be created. It also demonstrates the im-portance of the concept of “nested cones” and how this aﬀects the input datamanifolds. The red lines represent data that is associated with layer cones thatare completely nested inside its predecessors, while the black lines representdata where the succeeding cone has a wider angle than its predecessor. Whenthis happens, the hyperplanes associated with the output cone will intersect thehyperplanes of the input cone and input data beyond this intersection is justtransformed linearly. Since all data in ﬁgure 5 is remapped to the input spacethis has the eﬀect that data is not transformed at all. This has no eﬀect atall at the shaping of the input manifold. One could say that these layers are“wasted” beyond the location of the intersection as far as network propertiesare concerned since they neither contribute to the shaping or the contractionof input data manifolds. The eﬀect of this on the input manifold can be seenas a less diverse variation of its shape (black) compared to the previous partassociated with the completely nested part of the layer cones.8n higher dimensions the eﬀects of nested vs. partially nested cones appearin the same way but more elaborate. In addition to the 2d case we also haveto consider rotations of the cone, which as was pointed out earlier, has d − d dimensional space. The eﬀects of contraction ofdata from higher to lower dimensions also become more intricate as the numberof diﬀerent subspace dimensionalities increases. Most of these eﬀects can beillustrated with 3 dimensional input and output spaces. For d = 3 the genericcirculant matrix can be used to deﬁne a layer transformation: y = [ ax + bx + cx + d ] + y = [ cx + ax + bx + d ] + y = [ bx + cx + ax + d ] + (20)The transformation properties of this network are most easily illustrated if westart with the pure bias case with transformation W = I , i.e a = 1 , b = 0 , c = 0.A speciﬁc element in input space is mapped according to its position relative tothe hyperplanes. If we use the dual basis to deﬁne the coordinates of the outputdata, the mapping for the input element will be the same in input cells withthe same relation to all hyperplanes. In d dimensions, the hyperplanes dividethe input space into 2 d cells where elements are mapped to a speciﬁc associatedintersection subspace in the output domain.The grey shaded boxes indicate two cells with diﬀerent numbers of negativeconstraints 1 and 2 respectively. The content of the upper one with one negativeconstraint including all its bounding faces and their intersections is mapped toa speciﬁc 2 d hyperplane in the output domain while the content of the lower onewith two negative constraints is mapped to the 1d intersection of two hyper-planes. This illustrates the most important property of the nesting of the conesassociated with the input and output layer: For a range of transformations in thevicinity of the identity mapping, the input space, properly normalised in range,is divided into cells where the elements of the cells including their boundingfaces an their intersections are mapped to output intersection subspaces withequal or lower dimension. This means that the content of the cell is irreversiblycontracted to lower dimensions.Figure 5 also contains examples of preimages to individual elements (darkshaded grey rectangles) and the components of the dual bases used to spanthese. Note that these are aﬀected by changing the angle of the output cone.It introduces a limit of the nesting beyond which the mapping properties of thetransformation are changed so that data no longer maps to a manifold of equalor lower dimensionality. I.e the contraction property is lost in those regions ofthe input space where nesting of cones ceases.We will formally deﬁne this important property of nested cones as:Let R d + be the non negative orthant of the Euclidean d -space. Let Π i be the hyperplane deﬁned as x i = 0 for ( x . . . x d ) ∈ R d + . Considera set of corresponding hyperplanes Π . . . Π d in R d + associated witha circulant matrix. Take a subset i . . . i p of these hyperplanes andform the intersection subset: M i ...i p = Π i ∩ . . . ∩ Π i p . If for each9 esting limit Figure 3:

Illustration of how data maps between intersection subspaces in twosuccessive layers of 3 node networks with circulant transformations

Left : Iden-tity transformation and bias only. An element in the input space maps accordingto it’s cell location determined by the sign relative to the output hyperplanes. Theﬁgure illustrates how two diﬀerent cells, dark grey and light grey are mapped tored 2d plane and red 1d line in the output layer. Note that both 3d volume, 2dfaces and 1d edges maps to the same output intersection subspace. This meansthat dimension of subspace location for the element is non increasing which im-plies that data gets contracted. This is a consequence of the nesting properties ofthe output and input polyhedral cones.

Right

The transformation is now suchthat the angle of the cone is increased. The planes of the output layer now in-tersects the coordinate axis of the input space. Beyond the intersections the nonincreasing contraction property ceases. The dark gray areas indicate preimagesof points located on the output 1-d coordinate axis. Note how they are decreasesin size between the left and right examples. x in M i ...i p the positive span S + ( e j . . . e j q ) of the associated dualbasis contains an element in the corresponding intersection subsetΠ i ∩ . . . ∩ Π i p and thereby in each subset of planes with these indexes,but no other subset, we say that the cone formed by the hyperplanesΠ is completely nested in the cone formed by the hyperplanes Π .We see that this deﬁnition implies that the cone formed by planes Π iscompletely contained in that formed by the planes Π but also that its relativerotation is restricted. We will have reason to relax the condition of inclusionof all elements i . . . i p in the intersection subset and talk about cones withrestricted nesting. Complete nesting implies contraction of data from one layerto the next which can be seen from the fact that all elements of the completeintersection subset Π i ∩ . . . ∩ Π i p and thereby in each subset are mapped tothe intersection subset M i ...i p = Π i ∩ . . . ∩ Π i p with same dimensionality. In10ddition elements from intersection subsets formed by subsets of indexes i . . . i p will also be mapped to this same intersection subset. The subsets associatedwith these indexes are however of higher dimension. Consequently, mappingof data between layers will be from intersection subsets to intersection subsetswith equal or lower dimension. This is the crucial property connecting degreeof nesting with degree of contraction.By going further through the network to higher layers this contraction isiterated and data is increasingly concentrated on intersection subspaces withlower dimension which is reﬂected by the increased sparsity of the nodes of thenetwork. The convolutional part of a deep network can therefore be seen as acomponent in a metric learning system where the purpose of the training willbe to create domains in input space associated with diﬀerent classes that aremapped to separate low dimensional outputs as a preprocessing for the ﬁnalfully connected layers that will make possible eﬃcient separation of classes.There is therefore a conﬂict between the diversity of input manifolds thatcontract to low dimensional outputs and the degree of contraction that can begenerated in a convolutional network. The eﬃcient resolution of this conﬂictseems to lie in increasing the number of layers in the network in order to be ableto shape diverse input manifolds but with small incremental convolution ﬁltersthat retain the nesting property of the cone in order to preserve the properdegree of contraction. Empirically, this is exactly what has been demonstratedto be the most eﬃcient way to increase performance of deep networks [13, 14]. We are now in a position to give a general characterizaton of the preimage cor-responding to a speciﬁed output domain at the ﬁnal convolutional output layerassuming the property of nested layer cones. Ideally we would like to includethe ﬁnal fully connected layers in this analysis but it will require a special studysince we cannot assume the nesting property to be valid for these. In the endthe network should map diﬀerent classes to linearly separable domains in orderto enable eﬃcient classiﬁcation. It is generally suggested that the preprocess-ing part of a network corresponds to ﬂattening nonlinear input manifolds inorder to achieve this ﬁnal separation at the output. In order to be able to drawas general conclusions as possible we shall demonstrate the exact structure ofa nonlinear input manifold that maps to a prespeciﬁed aﬃne manifold at theﬁnal convolutional layer. We denote this manifold M and the output at theﬁnal convolutional layer by x ( l ) . The ﬁnal layer can be characterised by theset of hyperplanes: Π ( l )1 . . . Π ( l ) d . Let the zero components of the output x ( l ) be i , i . . . i q . It can then be associated with the intersection of the outputmanifold and the corresponding hyperplanes M ∩ Π ( l ) i ∩ Π ( l ) i ∩ . . . ∩ Π ( l ) i q (21)11he degree of intersection q will depend on the dimensionality of M . If M is a d − l . This is the maximum complexity situation thatwill generate a d − M means reducing the possible intersection with combinations of hyperplanes.Note that if M intersects the set the intersection of planes i , i . . . i p it alsointersects the intersection of any subset of these. Intersecting M with each ofthese subsets will generate pieces of intersections linked together. These areaﬃne subsets with diﬀerent dimensionality and the preimage of each piece willbe generated by complementary dual basis components. This is illustrated byﬁgure 6 for the case of an aﬃne plane in R intersecting to give a triangulardomain. In this case we have three points on the coordinate axis, and three linesconnecting these. The three points will all span 2d planes bases on diﬀerent pairsof complementary dual basis components. In addition to these, the points on thelines of the triangle generated by intersecting M with each of the three individualoutput planes will generate 1 d lines that jointly will span a 2 d plane. This planewill connect continuously with the planes spanned from the points on the axis toyield a piecewise planar input manifold to the ﬁnal layer. Continuing throughthe network, this piecewise planar manifold will intersect with the planes oflayer l − M at the next layer. Note howeverthat the nested cone property substantially reduces complexity compared to thegeneral case of arbitrary hyperplanes.It should be pointed out that these manifold do not necessarily correspondto actual class manifolds since we are not considering the complete networkwith fully connected layers. They can however be considered as more elaborateand speciﬁc building blocks in order to construct the actual class manifolds ofa trained network. We have deﬁned a formal procedure for computing preimages of deep lineartransformation networks with ReLU non linearities using the dual basis ex-tracted from the set of hyperplanes representing the transformation. Special-ising to convolutional networks we demonstrate that the complexity and thesymmetry of the arrangement of corresponding hyperplanes is substantially re-duced and we show that these arrangements can be modelled closely with mul-tidimensisional regular polyhedral cones around the identity line in input space.We point out the crucial property of nested cones which guarantees eﬃcientcontraction of data to lower dimensions and argue that this property could be12elevant in the design of real networks. By increasing the number of layers toshape input manifolds in the form of preimages we can retain the nested coneproperty that most eﬃciently exploits network data in order to construct inputmanifolds that comply with manifolds corresponding to real classes and wouldexplain the success of ever deeper networks for deep learning. The retaining ofthe nested cone property can be expressed as a limitation of the degrees of free-dom of multidimensional rotation of the cones. Since convolutional networksessentially always have limited spatial support convolutions, this is to a highdegree built in to existing systems. The desire to retain the property of nestingcould however act as an extra constraint to further reduce the complexity ofthe convolutions. This of course means that the degrees of freedom are reducedfor a network which could act as a regularization constraint and potentially ex-plain the puzzling eﬃciency of generalisation of deep networks in spite of a highnumber of parameters.We demonstrate that it is in principle possible to compute non linear inputmanifolds that map to aﬃne output manifolds. This demonstrates the possibil-ity of deep convolutional networks to achieve ﬂattening of input data which isgenerally considered as an important preprocessing step for classiﬁcation. Sincewe do not consider a complete network with fully connected layers at the end wecannot give details how classiﬁcation is achieved. The explicit demonstrationof non linear manifolds that map to aﬃne outputs however indicates a possiblebasic structure of input manifolds for classes. It is easy to see that a paralleltranslation of the aﬃne output manifold would result in two linearly separablemanifolds that would be generated by essentially parallel translated non linearmanifolds in the input space. This demonstrates that convolutional networkscan be designed to exactly separate suﬃciently “covariant “ classes. and thatthis could be the reason for the relative success of convolutional networks overprevious machine learning approaches to classiﬁcation and explain why using alarge number of classes for training is advantageous since they all contribute tovery similar individual manifolds.Disregarding these speculations the fact remains that these manifolds willalways exist since they are derived on purely formal grounds from the structureof the network. If they have no role in classiﬁcation their presence will have tobe explained in other ways.

References [1] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Un-derstanding deep neural networks with rectiﬁed linear units. arXiv preprintarXiv:1611.01491 , 2016.[2] Ronen Basri and David W. Jacobs. Eﬃcient representation of low-dimensional manifolds using deep networks.

CoRR , abs/1602.04723, 2016.133] Yoshua Bengio and Olivier Delalleau. On the expressive power of deeparchitectures. In

International Conference on Algorithmic Learning Theory ,pages 18–36. Springer, 2011.[4] Stefan Carlsson, Hossein Azizpour, Ali Sharif Razavian, Josephine Sulli-van, and Kevin Smith. The preimage of rectiﬁer network activities. In

International Conference on Learning Representations (workshop) , 2017.[5] James J DiCarlo and David D Cox. Untangling invariant object recognition.

Trends in cognitive sciences , 11(8):333–341, 2007.[6] Alexey Dosovitskiy and Thomas Brox. Inverting visual representationswith convolutional networks. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 4829–4837, 2016.[7] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet clas-siﬁcation with deep convolutional neural networks. In

Advances in neuralinformation processing systems , pages 1097–1105, 2012.[8] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image rep-resentations by inverting them. In

Proceedings of the IEEE Conf. on Com-puter Vision and Pattern Recognition (CVPR) , 2015.[9] Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutionalneural networks using natural pre-images.

International Journal of Com-puter Vision (IJCV) , 2016.[10] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio.On the number of linear regions of deep neural networks. In

Advances inneural information processing systems , pages 2924–2932, 2014.[11] Richard Baraniuk Randall Balestriero. A spline theory of deep learning.

Proceedings of the International Conference on Machine Learning (ICML) ,2018.[12] Sam T Rowes. Nonlinear dimensionality reduction by locally linear embed-ding.

Science , 290:232, 2000.[13] Karen Simonyan and Andrew Zisserman. Very deep convolutional networksfor large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra-binovich. Going deeper with convolutions. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 1–9, 2015.[15] Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. Tropical geometryof deep neural networks.

Proceedings of the International Conference onMachine Learning (ICML) , 2018. 14igure 4:

Piecewise planar manifold in 3d input space that maps to aﬃne man-ifold (blue triangle) at the ﬁnal convolutional layer in a 3-node 3-layer networkwith circulant transformations. All data is remapped to the input space.

Left :Red patches are mapped to 0 dimensional red points at the three output coordi-nate axis Blue patches are mapped to 1d lines connecting the points. (dark red isoutside light red is inside of the manifold

Right : Patches that are generated byselective components of the dual basis at each layer. The positive span generatedby selective components of the dual basis emanating from the red output pointson the triangle as well as from each intersection with coordinate lines in earlylayers, intersects with the arrangement of hyperplanes representing the preced-ing layer. The 1-d intersections are then used as seed points for new spans thatintersect next preceding layer etc. The 2d intersections together with selectiveedges from the spans generate linking patches that ensures the continuity of theinput manifold.: Patches that are generated byselective components of the dual basis at each layer. The positive span generatedby selective components of the dual basis emanating from the red output pointson the triangle as well as from each intersection with coordinate lines in earlylayers, intersects with the arrangement of hyperplanes representing the preced-ing layer. The 1-d intersections are then used as seed points for new spans thatintersect next preceding layer etc. The 2d intersections together with selectiveedges from the spans generate linking patches that ensures the continuity of theinput manifold.