Deep Function Machines: Generalized Neural Networks for Topological Layer Expression
DD EEP F UNCTION M ACHINES : G
ENERALIZED N EURAL N ETWORKS FOR T OPOLOGICAL L AYER E XPRESSION
William H. Guss
Machine Learning at BerkeleyUniversity of California, Berkeley [email protected] A BSTRACT
In this paper we propose a generalization of deep neural networks called deep func-tion machines (DFMs). DFMs act on vector spaces of arbitrary (possibly infinite)dimension and we show that a family of DFMs are invariant to the dimension ofinput data; that is, the parameterization of the model does not directly hinge onthe quality of the input (eg. high resolution images). Using this generalization weprovide a new theory of universal approximation of bounded non-linear operatorsbetween function spaces. We then suggest that DFMs provide an expressive frame-work for designing new neural network layer types with topological considerationsin mind. Finally, we introduce a novel architecture, RippLeNet, for resolutioninvariant computer vision, which empirically achieves state of the art invariance.
NTRODUCTION
In recent years, deep learning has radically transformed a majority of approaches to computervision, reinforcement learning, and generative models [Schmidhuber (2015)]. Theoretically, westill lack a unified description of what computational mechanisms have made these deeper modelsmore successful than their wider counterparts. Substantial analysis by Shalev-Shwartz et al. (2011),Raghu et al. (2016), Poole et al. (2016) and many others gives insight into how the properties ofneural architectures , like depth and weight sharing, determine the expressivity of those architectures.However, less studied is the how the properties of data , such as sample statistics or geometricstructure, determine the architectures which are most expressive on that data.Surprisingly, the latter perspective leads to simple questions without answers rooted in theory. Forexample, what topological properties of images allow convolutional layers such expressivity andgeneralizeability thereon? Intuitively, spatial locality and translation invariance are sufficient justifi-cations in practice, but is there a more general theory which suggests the optimality of convolutions?Furthermore, do there exist weight sharing schemes beyond convolutions and fully connected layersthat give rise to provably more expressive models in practice? In this paper, we will more concretelystudy the data-architecture relationship and develop a theoretical framework for creating layers andarchitectures with provable properties subject to topological and geometric constraints imposed onthe data.
The Problem with Resolution.
To motivate a use for such a framework, we consider the problemof learning on high resolution data. Computationally, machine learning deals with discrete signals,but frequently those signals are sampled in time from a continuous function. For example, audio isinherently a continuous function f : [0 , t end ] → R , but is sampled as a vector v ∈ R , × t . Evenin vision, images are generally piecewise smooth functions f : R → R mapping pixel position tocolor intensity, but are sampled as tensors v ∈ R x × y × c . Performing tractible machine learning as theresolution of images or audio increases almost always requires some lossy preprocessing like PCA orDiscrete Fourier Analysis [Burch (2001)]. Convolutional neural networks avoid dealing therein byintutively assuming a spacial locality on these vectors. However, one wonders what is lost throughthe use of various dimensionality reduction and weight sharing schemes . Note we do not claim that deep learning on high resolution data is currently intractible or ineffective. Theproblem of resolution is presented as an example in which topological constaints can be imposed on a type ofdata to yield new architecutres with desired, provable properties. a r X i v : . [ s t a t . M L ] N ov igure 1: Left: A discrete vector v ∈ R l × w representation of an image. Right: The true continuousfunction f : R → R from which it was sampled.A key observation in discussing a large class of smooth functions is their simplicity. Although froma set theoretic perspective, the graph of a function consists of infiniteley many points, relativelycomplex algebras of functions can be described with symbolic simplicity. A great example arepolynomials: the space of all square ( x ) mononomials occupies a one-dimensional vector space, andone can generalize this phenomena beyond these basic families. Thus we will explore what results inembracing the assumption that a signal is really a sample from a continuous process, and utilize theanalytic simplicity of certain smooth functions to derive new layer types. Our Contribution.
First, we extend neural networks to the infinite dimensional domain of continu-ous functions and define deep function machines (DFMs), a general family of function approximatorswhich encapsulates this continuous relaxation and its discrete counterpart. Thereafer, we survey andrefocus past analysis of neural networks with infinitely (and potentially uncountably) many nodes with respect to the expresiveness the maps that they represent. We show that DFMs not only admitmost other infinite dimensional neural network generalizations in the literature but also provide thenecessary language to solve two long standing questions of universal approximation raised followingStinchcombe (1999). With the framework firmly established, we then return to our motivating goal ofprovable deep learning and show that DFMs naturally give rise to neural networks which are provablyinvariant to the resolution of the input, and indeed that DFMs can be used more generally to constructarchitectures (e.g. those with convolutions) with provable properties given topological assumptions.Finally we experimentally verify such constructions by introducing a new type of layer, WaveLayers,apart from convolutions. ACKGROUND
In order to propose deep function machines we must establish what it means for a neural network actdirectly on continuous functions. Recall the standardMcCulloch & Pitts (1943) feed-forward neuralnetwork.
Definition 2.1 (Discrete Neural Networks) . We say N : R n → R m is a (discrete) feed-forwardneural network iff for the following recurrence relation is defined for adjacent layers (cid:96) → (cid:96) (cid:48) , N : y (cid:96) (cid:48) = g (cid:0) W T(cid:96) y (cid:96) (cid:1) ; y := x (2.1) where W (cid:96) is a weight tensor and g is a non-polynomial activation function. Suppose that we wish to map one space of functions to another with a neural network. Considerthe model of N as the number of neurons for every layer becomes uncountable. The index for eachneuron then becomes real-valued, along with the weight and input vectors. The process is roughlydepicted in Figure 2. The core idea behind the derivation is that as the number of nodes in thenetwork becomes uncountable we need apply a normalizing term to the contribution of each nodein the evaluation of the following layer so as to avoid saturation. Eventually this process resemblesLebesgue integration.More formally, let N be an L layer neural network as given in Definition 2.1. Without loss ofgenerality we will examine the first layer, (cid:96) = 1 . Let us denote ξ : X ⊂ R → R as some See related work. igure 2: Left: Resolution refinement of an input signal by simple functions. Right: An illustrationof the extension of neural networks to infinite dimensions. Note that x ∈ R N is a sample of f ( N ) , asimple function with (cid:107) f ( N ) − ξ (cid:107) → as N → ∞ . Furthermore, the process is not actually countable,as depicted here.arbitrary continuous input function for the neural network. Likewise consider a real-valued piecewiseintegrable weight function , w (cid:96) : R → R , for a layer (cid:96) which is composed of two indexing variables u, v ∈ E (cid:96) , E (cid:96) (cid:48) ⊂ R . In this analysis we will restrict the indices to lie in compact sets E (cid:96) , E (cid:96) (cid:48) .If f is a simple function then for some finite partition of E (cid:96) , say u < · · · < u n , then f = (cid:80) nm =1 χ [ u m − ,u m ] p n where for all u ∈ [ u m − , u m ] , p n ≤ ξ ( u ) . Visually this is a piecewise constantfunction underneath the graph of ξ . Suppose that some vector x is sampled from ξ , then we can make x a simple function by taking an arbitray parition of E (cid:96) so that: when u < u < u , f ( u ) = x , andwhen u < u < u , f ( u ) = x , and so on. This simple function f is essentially piecewise constanton intervals of uniform length so that on each interval it attains the value of the n th component, x n .Finally if w v is some simple function approximating the v -th row of some weight matrix W (cid:96) in thesame fashion, then w v · f is also a simple function. Therefore particular neural layer associated to f (and thereby x ) is y = g ( W T(cid:96) x ) = g (cid:32) n (cid:88) m =1 W (cid:96)mv x m µ ([ u m − , u m ]) (cid:33) = g (cid:18)(cid:90) E (cid:96) w (cid:96)v ( u ) f ( u ) dµ ( u ) (cid:19) , (2.2)where µ is the Lebesgue measure on R . Now suppose that there is a refinement of x ; that is, returning to our original problem, there is ahigher resolution sample of ξ say f (cid:48) (and thereby x (cid:48) ), so that it more closely approximates ξ . It thenfollows that the cooresponding refined partition, u (cid:48) < · · · < u (cid:48) k , (where k > n ), occupies the same E (cid:96) but individually, µ ([ u m − , u m ]) ≤ µ ([ u (cid:48) m − , u (cid:48) m ]) . Therefore we weight the contribution of each x (cid:48) n less than each x n , in a measure theoretic sense.Recalling the theory of simple functions without loss of generality assume ξ, ω ( · , · ) ≥ . Then weyield that if F v = { ( w v , f ) : E (cid:96) → R | f, w v simple , ≤ f ≤ ξ, ≤ w v ≤ ω (cid:96) ( · , v ) } (2.3)then it follows immediately that sup ( f,w v ) ∈ F v (cid:90) E (cid:96) w v ( u ) f ( u ) dµ ( u ) = (cid:90) E (cid:96) ω (cid:96) ( u, v ) ξ ( u ) dµ ( u ) . (2.4)Therefore we give the following definition for infinite dimensional neural networks. Definition 2.2 (Operator Neural Networks) . We call O : L ( E (cid:96) ) → L ( E (cid:96) (cid:48) ) an operator neuralnetwork parameterized by ω (cid:96) if for two adjacent layers (cid:96) → (cid:96) (cid:48) O : y (cid:96) (cid:48) ( v ) = g (cid:18)(cid:90) E (cid:96) y (cid:96) ( u ) ω (cid:96) ( u, v ) dµ ( u ) (cid:19) ; y ( v ) = ξ ( v ) . (2.5) where E (cid:96) , E (cid:96) (cid:48) are locally compact Hausdorff mesure spaces and u ∈ X, v ∈ Y. It is no loss of generality to extend the results in this work to weight kernels indexed by arbitrary u, v ∈ R n ,but we ommit this treatment for ease of understanding. D EEP F UNCTION M ACHINES
With operator neural networks defined, we endeavour to define a topologically inspired frameworkfor developing expressive layer types. A powerful language of abstraction for describing feed-forward (and potentially recurrent) neural network architectures is that of computational skeletons asintroduced in Daniely et al. (2016). Recall the following definition.
Definition 3.1.
A computational skeleton S is a directed asyclic graph whose non-input nodes arelabeled by activations. Daniely et al. (2016) provides an excellent account of how these graph structures abstract the manyneural network architectures we see in practice. We will give these skeletons "flesh and skin"so to speak, and in doing so pursure a suitable generalization of neural networks which allowsintermediate mappings between possibly infinite dimensional topological vector spaces. DFMs arethat generalization.
Definition 3.2 (Deep Function Machines) . A deep function machine D is a computational skeleton S indexed by I with the following properties: • Every vertex in S is a topological vector space X (cid:96) where (cid:96) ∈ I. • If nodes (cid:96) ∈ A ⊂ I feed into (cid:96) (cid:48) then the activation on (cid:96) (cid:48) is denoted y (cid:96) ∈ X (cid:96) and is definedas y (cid:96) (cid:48) = g (cid:32)(cid:88) (cid:96) ∈ A T (cid:96) (cid:2) y (cid:96) (cid:3)(cid:33) (3.1) where T (cid:96) : X (cid:96) → X (cid:96) (cid:48) is some affine form called the operation of node (cid:96) . To see the expressive power of this generalization, we propose several operations T (cid:96) that not onlyencapsulate ONNs and other abstractions on infinite dimensional neural networks, but also almost allfeed-forward architectures used in practice.3.1 G ENERALIZED N EURAL L AYERS
Generalized neural layers are the basic units of the theory of deep function machines, and they can beused to construct architectures of neural networks with provable properties, such as the resolutioninvariance we seek. The most basic case is X (cid:96) = R n and X (cid:96) (cid:48) = R m , where we should expect astandard neural network. As either X (cid:96) or X (cid:96) (cid:48) become infinite dimensional we hope to attain modelsof functional MLPs from Rossi et al. (2002) or infinite layer neural networks from Globerson & Livni(2016) with universal approximation properties. Definition 3.3 (Generalized Layer Operations) . We suggest several natural generalized layer families T (cid:96) for DFMs as follows. • T (cid:96) is said to be o -operational if and only if X (cid:96) and X (cid:96) (cid:48) are spaces of integrable functionsover locally compact Hausdorff measure spaces, and T (cid:96) [ y (cid:96) ]( v ) = o ( y (cid:96) )( v ) = (cid:90) E (cid:96) y (cid:96) ( u ) ω (cid:96) ( u, v ) dµ ( u ) . (3.2) For example , X (cid:96) , X (cid:96) (cid:48) = C ( R ) , yields operator neural networks. • T (cid:96) is said to be n -discrete if and only if X (cid:96) and X (cid:96) (cid:48) are finite dimensional vector spaces,and T (cid:96) [ y (cid:96) ] = n ( y (cid:96) ) = W T(cid:96) y (cid:96) . (3.3) For example, X (cid:96) = R n , X (cid:96) (cid:48) = R m , yields standard feed-forward neural networks. • T (cid:96) is said to be f -functional if and only if X (cid:96) is some space of integrable functions asmentioned previously and X (cid:96) (cid:48) is a finite dimensional vector space, and T (cid:96) [ y (cid:96) ] = f ( y (cid:96) ) = (cid:90) E (cid:96) ω (cid:96) ( u ) y (cid:96) ( u ) dµ ( u ) (3.4) Nothing precludes the definition from allowing multiple functions as input, the operation must just becarried on each coordinate function. R R [0 , nnn N C ( R ) C ([0 , C ([ − , oo O C ( R ) C ( R ) o R m f R m f R n nn C ( R ) d C ( R ) d C ( R ) o [0 , ∞ ) fff nf D Figure 3: Examples of three different deep function machines with activations ommited and T (cid:96) replaced with the actual type. Left: A standard feed forward binary classifier (without convolution),Middle: An operator neural network. Right: A complicated DFM with residues. For example X (cid:96) = C ( R ) , X (cid:96) (cid:48) = R n , yields functional MLPs. • T (cid:96) is said to be d -defunctional if and only if X (cid:96) is a are finite dimensional vector space and X (cid:96) (cid:48) is some space of integrable functions. T l [ y (cid:96) ]( v ) = d ( y (cid:96) )( v ) = ω (cid:96) ( v ) T y (cid:96) (3.5) For example, X (cid:96) = R n , X (cid:96) (cid:48) = C ( R ) . The naturality of the above layer operations come from their universality and generality.3.2 R
ELATED W ORK AND A U NIFIED V IEW OF I NFINITE D IMENSIONAL N EURAL N ETWORKS
Operator neural networks are just one of many instantiations of DFMs. Before we show universalityresults for deep function machines, it should be noted that there has been substantial effort in theliterature to explore various embodiments of infinite dimensional neural networks. To the best of theauthors’ knowledge, DFMs provide a single unified view of every such proposed framework to date.In particular, Neal (1996) proposed the first analysis of neural networks with countably infinite nodes,showing that as the number of nodes in discrete neural networks tends to infinity, they converge to aGaussian process prior over functions. Later, Williams (1998) provided a deeper analysis of such alimit on neural networks. A great deal of effort was placed on analyzing covariance maps associatedto the Guassian processes resultant from infinite neural networks with both sigmoidal and Gaussianactivation functions. These results were based mostly in the framework of Bayesian learning, and ledto a great deal of analyses of the relationship between non-parametric kernel methods and infinitenetworks, including Le Roux & Bengio (2007), Seeger (2004), Cho & Saul (2011), Hazan & Jaakkola(2015), and Globerson & Livni (2016).Out of this initial work, Hazan & Jaakkola (2015) define hidden layer infinite layer neural networks with one or two layers which map a vector x ∈ R n to a real value by considering infinitely manyfeature maps φ w ( x ) = g ( (cid:104) w, x (cid:105) ) where w is an index variable in R n . Then for some weight function u : R n → R , the output of an infinite layer neural network is a real number (cid:82) u ( w ) φ w ( x ) dµ ( w ) .This approach can be kernelized and therefore has resulted further theory by Globerson & Livni (2016)that aligns neural networks with Gaussian processes and kernel methods. Operatator neural networksdiffer significantly in that we let each w be freely paramaterized by some function ω and requirethat x be a continuous function on a locally compact Hausdorf space. Additionally no universalapproximation theory is provided for infinite layer networks directly, but is cited as following fromthe work of Le Roux & Bengio (2007). As we will see, DFMs will not only encapsulate (and benefitfrom) these results, but also provide a general universal approximation theory therefor. Note that y (cid:96) ( u ) is a scalar function and ω is a vector valuued function of dimension dim ( X (cid:96) (cid:48) ) . Additionallythis definiton can easily be extended to function spaces on finite dimensional vectorspaces by using the Kroneckerproduct. able 1: Unification of Infinite Dimensional Neural Network Theory Name Form DFM Authors
InfiniteNNs b + (cid:80) ∞ j =1 v j h ( x ; u j ) N ∞ : R n n −→ ⊕ ∞ i =1 R n −→ R m Neal (1996);Williams (1998)FunctionalMLPs (cid:80) pi =1 β i g (cid:0)(cid:82) w i ξ dµ (cid:1) F : L ( R ) f −→ R p n −→ R Stinchcombe(1999); Rossiet al. (2002)ContinuousNNs (cid:82) ω ( u ) g ( x · ω ( u )) du C : R n d −→ L ([ a, b ]) f −→ R m Le Roux &Bengio (2007)Non-parametricContinuousNNs (cid:82) ω ( u ) g ( (cid:104) x, u (cid:105) ) du C (cid:48) : R n d −→ L ( R ) f −→ R m Le Roux &Bengio (2007)InfiniteLayer NNs same as non-parametriccontinuous NNs same as non-parametriccontinuous NNs Globerson &Livni (2016);Hazan &Jaakkola (2015)Another variant of infinite dimensional neural networks, which we hope to generalize, is the func-tional multilayer perceptron (Functional MLP). This body of work is not referenced in any of theaforementioned work on infinite layer neural networks, but it is clearly related. The fundamental ideais that given some f ∈ V = C ( X ) , where X is a locally compact Hausdorff space, there exists ageneralization of neural networks which approximates arbitrary continuous bounded functionals on V (maps f (cid:55)→ a ∈ R ). These functional MLPs take the form (cid:80) pi =1 β i g (cid:0)(cid:82) ω i ( x ) f ( x ) dµ ( x ) (cid:1) . Theauthors show the power of such an approximation using the functional analysis results of Stinchcombe(1999) and additionally provide statistical consistency results defining well defined optimal parameterestimation in the infinite dimensional case.Stemming additionally from the initial work of Neal (1996), the final variant called continuous neuralnetworks has two manifestations: the first of which is more closely related to functional perceptronsand the last of which is exactly the formulation of infintie layer NNs. Initially Le Roux & Bengio(2007) proposes an infinite dimensional neural network of the form (cid:82) ω ( u ) g ( x · ω ( u )) dµ ( u ) andshows universal approximation in this regime. Overall this formulation mimics multiplication bysome weighting vector as in infinite layer NNs, except in the continuous neural formulation ω can beparameterized by a set of weights. Thereafter, to prove connections between gaussian processes froma different vantage, they propose non-parametric continuous neural networks , (cid:82) ω ( u ) g ( x · u ) dµ ( u ) ,which are exactly infinite-layer neural networks.In the view of deep function machines, the foregoing variants of infinite and semi-infinite dimensionalneural networks are merely instantiations of different computational skeleton structures. A summaryof the unified view is given in Table 1.3.3 A PPROXIMATION T HEORY OF D EEP F UNCTION M ACHINES
In addition to unification, DFMs provide a powerful language for proving universal approximationtheorems for neural networks of any depth and dimension . The central theme of our approach isthat the approximation theories of any DFM can be factored through the standard approximationtheories of discrete neural networks . In the forthcoming section, this principle allows us to prove twoapproximation theories which have been open questions since Stinchcombe (1999).The classic results of Cybenko (1989), yields the theory for n -discrete layers. For f -functionallayers, the work of Stinchcombe (1999) proved in great generality that for certain topologies on C ( E (cid:96) ) , two layer functional MLPs universally approximate any continuous functional on C ( E (cid:96) ) .Following Stinchcombe (1999), Rossi et al. (2002) extended these results to the case wherein multiple By dimension, we mean both infinite and finite dimensional neural networks. -operational layers prepended f -functional layers. We will show in particular that o -operational andsimilarly d -defunctional layers alone are dense in the much richer space of uniformly continuousbounded operators on function space. We give three results of increasing power, but decreasingtransparency. Theorem 3.4 (Point Approximation) . Let [ a, b ] ⊂ R be a bounded interval and g : R → B ⊂ R be acontinuous, bijective activation function. Then if ξ : E (cid:96) → R and f : E (cid:48) (cid:96) → B are L ( µ ) integrablefunctions there exists a unique class of o -operational layers such that g ◦ o [ ξ ] = f. Proof.
We seek a class of weight kernels ω (cid:96) so that that o [ ξ ] = f. Let ω (cid:96) ( u, v ) = (cid:2) ( g − ) (cid:48) ◦ ( h (Ξ( u ) , v )) (cid:3) h (cid:48) (Ξ( u ) , v ) where Ξ( u ) is the indefinite integral of ξ . Define h so thatit satisfies the following two equivalent equations µ -a.e. h (Ξ( b ) , v ) − h (Ξ( a ) , v ) = f ( v ) ∂h ( x, v ) ∂x ξ ( u ) (cid:12)(cid:12)(cid:12) x =Ξ( u ) ,v = v,u ∈{ a,b } = 0 (3.6)The proof is completed in the appendix.The statement of Theorem 3.4 is not itself very powerful; we merely claim that o -operational layerscan at least map any one function to any other one function. However, the proof yields insightinto what the weight kernels of o -operational layers look like when the single condition ξ (cid:55)→ f isimposed. Therefrom, we conjecture but do not prove that a statstically optimal initialization fortraining o -operational layers is given by satifying (3.6) when ξ = n (cid:80) mn =1 ξ n , f = n (cid:80) mn =1 f n ,where the training set { ( ξ n , f n ) } are drawn i.i.d from some distribution D . Theorem 3.5 (Nonlinear Operator Approximation) . Suppose that E , E are bounded intervals in R .For all κ, λ , if K : Lip λ ( E ) → Lip κ ( E ) is a uniformly continuous, nonlinear operator, then forevery (cid:15) > there exists a deep function machine D : L ( E ) L ( E ) L ( E ) o o (3.7) such that (cid:13)(cid:13) D| Lip λ − K (cid:13)(cid:13) < (cid:15). With two layer operator networks universal, it remains to consider d -deconvolutional layers. Theorem 3.6 (Nonlinear Basis Approximation) . Suppose
I, E , E are compact intervals, and let C ω ( X ) denote the set of analytic functions on X . If B : I n → C ω ( E ) is a continuous basis map toanalytic functions then for every (cid:15) > there exists a deep function machine D : R n L ( E ) L ( E ) d o (3.8) such that (cid:107) D | I n − B (cid:107) < (cid:15) in the topology of uniform convergence. To the best of our knowledge, the above are the first approximation theorems for nonlinear operatorsand basis maps on function spaces for neural networks. The proofs in the appendix roughly involve afactorization of arbitrary DFMs through approximation theories for n -discrete layers.Essentially, the factorization works as follows. In both of the foregoing theorems we want to roughlyapproximate some nonlinear map K with a DFM D . We therefore define an operator | K | , called an affine projection , that takes functions, converts them into piecewise constant approximations, applies K , and then again converts the result to piecewise constant approximations. Since there are a finitenumber, say N and M , of pieces given in the input and the output of | K | respectively, we can definean operator ˜ K : R N → R M , called a lattice map , which in some sense reproduces | K | . We thenshow both Theorem 3.5 and Theorem 3.6 by approximating ˜ K with a discrete neural network, N ,and chosing D to be such that its discreitzation is N . Surprisingly, this principle holds for any DFMstructure and a large class of different K not just those which use piecewise constant approximations! N EURAL T OPOLOGY FROM T OPOLOGY
As we have now shown, deep function machines can express arbitrarily powerful configurations of’perceptron layer’ mappings betwen different spaces. However, it is not yet theoretically clear howdifferent configurations of the computaional skeleton and the particular spaces X (cid:96) do or do not leadto a difference in expressiveness of DFMs. To answer questions of structure, we will return to themotivating example of high-resolution data, but now in the language of deep function machines.4.1 R ESOLUTION I NVARIANT N EURAL N ETWORKS
If an an input ( x j ) ∈ R N is sampled from an continuous function ξ ∈ C ( E (cid:96) ) , o -operational layersare a natural way of extending neural networks to deal directly with ξ. As before, it is useful to thinkof each o as a continuous relaxation of a class of n , and from this perspective we can gain insight intothe weight tensors of n -discrete layers as the resolution of x increases. Theorem 4.1 (Invariance) . If T (cid:96) is an o -operational layer with an integrable weight kernel ω ( u, v ) of O (1) parameters, then there is a unique fully connected n -discrete layer, N (cid:96) , with O ( N ) parametersso that T (cid:96) [ ξ ]( j ) = N (cid:96) ( x ) j for all ξ, x as above. Resolution Invariance Schema ω (cid:96) ( u, v ; w ) = (cid:80) nk =1 f ( u, v ; w k ) O : C ( E (cid:96) ) C ( E (cid:96) (cid:48) )[ O ] n : R N R M o parameterization n n -discrete inst. Figure 4: DFM construction of resolu-tion invariant n -discrete layer.Theorem 4.1 is a statement of variance in parameterization;when the input is a sample of a smooth signal, fully con-nected n -discrete layers are naively overparameterized.DFMs therefore yield a simple resolution invariance schemefor neural networks. Instead of placing arbitrary restrictionson W (cid:96) like convolution or assuming that the gradient descentwill implicitly find a smooth weight matrix or filter W (cid:96) for n ,we take W (cid:96) to be the discretization of a smooth ω (cid:96) ( u, v ) . Animmediate advantage is that the weight surfaces, ω (cid:96) ( u, v ) ,of o -operational layers can be parameterized by dense fam-ilies f ( u, v ; w ) , whose parameters w do not depend on theresolution of the input but on the complexity of the modelbeing learnt.4.2 T OPOLOGICALLY I NSPIRED L AYER P ARAMETERIZATIONS
Furthermore, we can now explore new parameterizations by constructing weight tensors and therebyneural network topologies which approximate the action of the operator neural networks which mostexpressively fit the topological properties of the data. Generally new restrictions on the weights ofdiscrete neural networks might be achieved as follows:1. Given that the input data x is sampled from some f ∈ F ⊂ { g : E → R } , find a closedalgebra of weight kernels so that ω ∈ A is minimally parameterized and g ◦ o [ F ] is asufficiently "rich" class of functions.2. Repeat this process for each layer of a computational skeleton S and yield a DFM O .3. Instantiate a deep function machine [ O ] n called the n -discrete instatiation of O consistingof only n -discrete layers by discretizing each o -operational layer through the resolutioninvariance schema sample: W (cid:96) = (cid:20) w (cid:96) (cid:18) ie u , je u , · · · , ke v , te v , · · · (cid:19)(cid:21) ij...kt... (4.1)where e η denotes the cardinality of the sample along the η -axis of E (cid:96) = [0 , dim ( E (cid:96) ) . Thisprocess is depicted in Figure 4.This perspective yields interpretations of existing layer types and the creation of new ones. Forexample, convolutional n -discrete layers provably approximate o -operational layers with weightkernels that are solutions to the ultrahyperbolic partial differential equation. igure 5: The WaveLayer architecture of RippLeNet for MNIST. Bracketed numbers denote numberof wave coefficients. The images following each WaveLayer are the example activations of neuronsafter training given by feeding (cid:48) (cid:48) into RippLeNet. Theorem 4.2 (Convolutional Neural Networks) . Let N (cid:96) be an n -discrete convolutional layer suchthat n ( x ) = h (cid:63) x where (cid:63) is the convolution operator and h is a filter tensor. Then there is a o -operational layer, O (cid:96) with ω (cid:96) ( u , . . . , u n , v , . . . , v n ) such that n (cid:88) k =1 ∂ ω∂ u k = c n (cid:88) k =1 ∂ ω∂ v k (4.2) and its n -discrete instatiation is [ O (cid:96) ] n = N (cid:96) . Using Theorem 4.2 we therefore propose the following generalization of convolutional layers withweight kernels satisfying (4.2) whose n -discrete instantiation is resolution invariant. Definition 4.3 (WaveLayers) . We say that T (cid:96) is a WaveLayer if it is the n -discrete instantiation (viathe resolution invariance schema) of an o -operational layer with weight kernel of the form ω (cid:96) ( u, v ) = s + b (cid:88) i =1 s i cos( w Ti ( u, v ) − p i ); s i , p i ∈ R , w i ∈ R dim ( E (cid:96) ) . (4.3)WaveLayers are named as such, because the kernels ω (cid:96) are super position standing waves movingin directions encoded by w i , offset in phase by p i . Additionally any n -discrete convolutional layercan be expressed by WaveLayers, setting the direction θ i of w i to θ i = π/ . In this case, instead oflearning the values h at each j , we learn s i , w i , p i . XPERIMENTS
With theoretical guarantees given for DFMs, we propose a series of experiments to test the learnability,expressivity, and resolution invariance of WaveLayers.
RippLeNet.
To find a baseline model, we performed a grid search on MNIST over various DFMsand hyperparameters, arriving at RippLeNet, an architecture similar to the classic LeNet-5 of LeCunet al. (1998) depicted in Figure 5. The model is the n -discrete instatiation of the following DFM(5.1)and consiststs of 5 successive wave layers with tanh activations and no pooling. We found thatusing ReLu often resulted in a failure of learning to converge. Changes in the activation shapes(eg. 24x24x2 → E (cid:96) at eachnode of the DFM. The trainable parameters, in particular the magnitude of internal frequencies,were initialized at offsets proportional to the number of waves comprising each o -operationallayer. Likewise, orientation of each wave was initialized uniformly on the unit spherical shell. Note: WaveLayers are not not the same as Fourier networks or FC layers with cos activation functions.
Itsuffices to view wave layers as simply another way to reparameterize the weight matrix of n -discrete layers, andtherefore the VC dimension of WaveLayers is less than that of normal fully connected layers. See the appendix. igure 6: A plot of test error versus number oftrainable parameters for RippLeNet in comparisonwith early LeNet architectures. Model Expressive Parameter Reduction.
InTheorem 4.1, it was shown that DFMs in somesense parameterize the complexity of the modelbeing learned, without constraints imposed bythe form of data. RippLeNet benefits in thatthe sheer number of parameters (and thereforethe variance) can be reduced until the model ex-presses the mapping to satisfactory error withoutconcern for resolution variants.We emprically verify this methodology by fixingthe model architecture and increasing the num-ber of waves per layer uniformly. This results inan exponential marginal utility on lowest errorachieved after epochs of MNIST with respectto the number of parameters in the model, shownin Figure 6. The slight outperformance of earlyLeNet architectures, suggest that future workin optimizing WaveLayers might be fruitful inachieving state of the art parameter reduction.Figure 7: A plot of training time (normalized foreach layer type with respect to the training timefor × baseline) as the resolution of MNISTscales. Resolution Invariance.
True resolution invari-ance has the desirable property of consistency.Principly, consistency requires that regardless ofthe the resolution complexity of data, the train-ing time, paramerization, and testing accuracyof a model do not vary.We test consistency for RippLeNet by fixing allaspects of initialization save for the input resolu-tion of images. For each training run, we rescaleMNIST using both bicubic and nearest neigh-bor scaling to square resolutions of sidelength R = { , . . . , , , , , . . . } . In conjuc-tion we compare the resolution consistency offully connected (FC) and convolutional archi-tectures. For FC models, the number of freeparameters on the first layer is increased out ofnecessity. Likewise, the size of the input filtersfor convolutional models is varied. As shown inFigure 7, WaveLayers remain invariant to resolu-tion changes in both multirun variance and nor-malized convergence iterations, whereas bothFC and convolutional layers exhibit an increasein both measurements with resolution. ONCLUSION
In this paper we proposed deep function machines, a novel framework for topologically inspiredlayer parameterization. We showed that given topological assumptions, DFMs provide theoreticaltools to yield provable properties in neural neural networks. We then used this framework to deriveWaveLayers, a new type of provably resolution invariant layer for processing data sampled fromcontinuous signals such as images and audio. The derivation of WaveLayers was additionallyaccompanied by the proposal of several layer operations for DFMs between infinite and/or finitedimensional vector spaces. We for the first time proved a theory of non-linear operator and functionalbasis approximation for neural networks of infinite dimensions closing two long standing questionssince Stinchcombe (1999). We then utilized the expressive power of such DFMs to arrive at a novelarchitecture for resolution invariant image processing, RippLeNet. uture Work. Although we’ve layed the ground work for exploration into the theory of deepfunction machines, there are still many open questions both theoretically and empirically. Thedrastic outperformance in resolution variance of RippLeNet in comparision to traditional layer typessuggests that new layer types via DFMs with provable properties in mind should be further explored.Furthermore a deeper analysis of existing global network topologies using DFMs may be useful giventheir expressive power. R EFERENCES
Carl Burch. A survey of machine learning. A survey for the Pennsylvania Governor’s School for theSciences, 2001.Youngmin Cho and Lawrence K Saul. Analysis and extension of arc-cosine kernels for large marginclassification. arXiv preprint arXiv:1112.3712 , 2011.G. Cybenko. Approximation by superpositions of a sigmoidal function.
Mathematics of Control,Signals, and Systems , 2:303–3314, 1989.Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: Thepower of initialization and a dual view on expressivity. arXiv preprint arXiv:1602.05897 , 2016.Amir Globerson and Roi Livni. Learning infinite-layer networks: beyond the kernel trick. arXivpreprint arXiv:1606.05316 , 2016.Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133 , 2015.Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. 2007.Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied todocument recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics , 5(4):115–133, 1943.Radford M Neal.
Bayesian learning for neural networks , volume 118. 1996.Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Ex-ponential expressivity in deep neural networks through transient chaos. In
Advances In NeuralInformation Processing Systems , pp. 3360–3368, 2016.Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On theexpressive power of deep neural networks. arXiv preprint arXiv:1606.05336 , 2016.Fabrice Rossi, Brieuc Conan-Guez, and François Fleuret. Theoretical properties of functional multilayer perceptrons. 2002.Jurgen Schmidhuber. Deep learning in neural networks: An overview.
Neural Networks , 61:85–117,2015.Matthias Seeger. Gaussian processes for machine learning.
International Journal of neural systems ,14(02):69–106, 2004.Shai Shalev-Shwartz, Ohad Shamir, and Karthik Sridharan. Learning kernel-based halfspaces withthe 0-1 loss.
SIAM Journal on Computing , 40(6):1623–1646, 2011.Maxwell B Stinchcombe. Neural network approximation of continuous functionals and continuousfunctions on compactifications.
Neural Networks , 12(3):467–477, 1999.Christopher KI Williams. Computation with infinite neural networks.
Neural Computation , 10(5):1203–1216, 1998. A PPENDIX
A: A
DDITIONAL T OY D EMONSTRATION
To provide a direct comparison between convolutional, fully connected, and WaveLayer operationson data with semicontinuity assumptions, we conducted several toy demonstrations. We construct adataset of functions (similar to a dataset of audio waveforms) whose input pairs are Gaussian bumpfunctions, B µ ( u ) centered at different points in the interval µ ∈ [0 , . The cooresponding "labels"or target outputs are squres of gaussian bump functions plus a linear form whose slopes are givenby the position of the center, that is B µ ( u ) + µ ( u − µ ) . The desired map we wish to learn is then T : B µ ( u ) (cid:55)→ B µ ( u ) + µ · ( u − µ ) . To construct the actual dataset D , for a random sample ofcenters µ we sample the input output pairs over 100 evenly spaced sub-intervals. The resultant datasetis a list of N pairs of input/output vectors D = { ( x i , y i ) } Ni =1 with x, y ∈ R . We then train three different two layer DFMs with n -discrete convolutional, fully-connected, andWaveLayers respectively. The following three figures show the outputs of all three layer types astraining progresses. In the first three quadrants, the output of each layer type on a particular exampledatapoint is shown along with that example’s input/target functions, ( x i , y i ) . The particular exampleshown is chosen randomly at the beginning of training. In the bottom right, the log training error overthe whole dataset of each layer type is shown. A tick is the number of batches seen by the algorithm.In initialization, the three layers exhibit predicted behavior despite artifacts towards the boundariesof the intervals. The convolutional layer, acts as a mullifer smoothing the input signal as its ownkernel is Gaussian. The fully connected layer generates as predicted a normally distributed set ofdifferent output activations and does not regard the spatial locality (and thereby continuity) of theinput. Finally the WaveLayer output exhibits behaviour predicted in Neal (1996), that is it limitstowards a smoothed random walk over the input signal.As training continues, both the convolutional and WaveLayer outputs preserve the continuity ofthe input signal, and approximate the smoothness of the output signal as induced by their ownrelation to ultrahyperbolic differential equations. Since the FC layer is not restricted to any giventopology, although it approximates the desired output signal closely in the L norm, it fails toachieve smoothness as this regularization is not explicity coded. It is important to note that in thisexample the WaveLayer output immediately surpasses the accuracy of the convolutional outputbecause the convolutional output only has bias units accross entire channels, whereas the bias unitsof WaveLayers are themselves functions (cid:80) nk =1 s k cos ( w k · v + p k ) + b k . Therefore WaveLayerscan impose hetrogenously accross their output signals, where as convolutional require much deeperarchitectures to artifically generate such biases. he results of this toy demonstration illustrate the intermediate flexibility of WaveLayers betweenpurely fully connected and convolutional architectures. Satisfying the same differential equation(4.2), convolutional and WaveLayer architectures are regularized by spatial locality, but WaveLayerscan in fact go beyond convolution layers and employ translational hetrogeneity. Although the purposeof this work is not to demonstrate the superiority of either convoluitional or WaveLayer architectures,it does open a new avenue of exploration in neural architecture design using DFMs to design layertypes using topological constraints. A PPENDIX
B: T
HEOREMS AND P ROOFS
OINT A PPROXIMATION
Theorem 8.1.
Let [ a, b ] ⊂ R be a bounded interval and g : R → B ⊂ R be a continuous, bijectiveactivation function. Then if ξ : E (cid:96) → R and f : E (cid:48) (cid:96) → B are L ( µ ) integrable functions there existsa unique class of o -operational layers such that g ◦ o [ ξ ] = f. Proof.
We will give an exact formula for the weight function ω (cid:96) cooresponding to o so that theformula is true. Recall that y (cid:96) (cid:48) ( v ) = g (cid:18)(cid:90) E (cid:96) ξ ( u ) ω (cid:96) ( u, v ) dµ ( u ) (cid:19) . (8.1)Then let ω (cid:96) ( u, v ) = (cid:2) ( g − ) (cid:48) ◦ ( h (Ξ( u ) , v )) (cid:3) h (cid:48) (Ξ( u ) , v ) where Ξ( u ) is the indefinite integral of ξ and h : R × E (cid:96) (cid:48) → R is some jointly and seperately integrable function. By the bijectivity of g onto its codomain, ω (cid:96) exists. Now further specify h so that, h (Ξ( u ) , v ) (cid:12)(cid:12)(cid:12) u ∈ E (cid:96) = f ( v ) . Then by thefundamental theorem of (Lebesgue) calculus and chain rule, g ( o [ ξ ]( v )) = g (cid:18)(cid:90) E (cid:96) (cid:2) ( g − ) (cid:48) ◦ ( h (Ξ( u ) , v )) (cid:3) h (cid:48) (Ξ( u ) , v ) ξ ( u ) dµ ( u ) (cid:19) = g (cid:0) g − ( h (Ξ( u ) , v )) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) u ∈ E (cid:96) = f ( v ) (8.2)A generalization of this theorem to E (cid:96) ⊂ R n is given by Stokes theorem.8.2 D ENSITY IN L INEAR O PERATORS
Theorem 8.2 (Approximation of Linear Operators) . Suppose E (cid:96) , E (cid:96) (cid:48) are σ -compact, locally compact,measurable, Hausdorff spaces. If K : C ( E (cid:96) ) → C ( E (cid:48) (cid:96) ) is a bounded linear operator then thereexists an o -operational layer such that for all y (cid:96) ∈ C ( E (cid:96) ) , o [ y (cid:96) ] = K [ y (cid:96) ] . Proof.
Let ζ t : C ( E (cid:96) (cid:48) ) → R be a linear form which evaluates its arguments at t ∈ E (cid:96) (cid:48) ; that is, ζ t ( f ) = f ( t ) . Then because ζ t is bounded on its domain, ζ t ◦ K = K (cid:63) ζ t : C ( E (cid:96) ) → R is a boundedlinear functional. Then from the Riesz Representation Theorem we have that there is a unique regularBorel measure µ t on E (cid:96) such that (cid:0) Ky (cid:96) (cid:1) ( t ) = K (cid:63) ζ t (cid:0) y (cid:96) (cid:1) = (cid:90) E (cid:96) y (cid:96) ( s ) dµ t ( s ) , (cid:107) µ t (cid:107) = (cid:107) K (cid:63) ζ t (cid:107) (8.3)We will show that κ : t (cid:55)→ K (cid:63) ζ t is continuous. Take an open neighborhood of K (cid:63) ζ t , say V ⊂ [ C ( E (cid:96) )] ∗ , in the weak* topology. Recall that the weak* topology endows [ C ( E (cid:96) )] ∗ with smallestcollection of open sets so that maps in i ( C ( E (cid:96) )) ⊂ [ C ( E (cid:96) )] ∗∗ are continuous where i : C ( E (cid:96) ) → [ C ( E (cid:96) )] ∗∗ so that i ( f ) = ˆ f = φ (cid:55)→ φ ( f ) , φ ∈ [ C ( E (cid:96) )] ∗ . Then without loss of generality V = m (cid:92) n =1 ˆ f − α n ( U α n ) where f α n ∈ C ( E (cid:96) ) and U α n are open in R . Now κ − ( V ) = W is such that if t ∈ W then K (cid:63) ζ t ∈ (cid:84) m ˆ f − α n ( U α n ) . Therefore for all f α n then K ∗ ζ t ( f α n ) = ζ t ( K [ f α n ]) = K [ f α n ]( t ) ∈ U α n . We would like to show that there is an open neighborhood of t , say D , so that D ⊂ W and κ ( Z ) ⊂ V .First since all the maps K [ f α n ] : E (cid:96) (cid:48) → R are continuous let D = (cid:84) m ( K [ f α n ]) − ( U α n ) ⊂ E (cid:96) (cid:48) .Then if r ∈ D , ˆ f α n [ K (cid:63) ζ r ] = K [ f α n ]( r ) ∈ U α n for all ≤ n ≤ m . Therefore κ ( r ) ∈ V and so κ ( D ) ⊂ V . s the norm (cid:107) · (cid:107) ∗ is continuous on [ C ( E (cid:96) )] ∗ , and κ is continuous on E (cid:96) (cid:48) , the map t (cid:55)→ (cid:107) κ ( t ) (cid:107) iscontinuous. In particular, for any compact subset of E (cid:96) (cid:48) , say F , there is an r ∈ F so that (cid:107) κ ( r ) (cid:107) ismaximal on F ; that is, for all t ∈ F , (cid:107) µ t (cid:107) ≤ (cid:107) µ r (cid:107) . Thus µ t (cid:28) µ r . Now we must construct a borel regular measure ν such that for all t ∈ E (cid:96) (cid:48) , µ t (cid:28) ν . To do so, wewill decompose E (cid:96) (cid:48) into a union of infinitely many compacta on which there is a maximal measure.Since E (cid:96) (cid:48) is a σ -compact locally compact Hausdorff space we can form a union E (cid:96) (cid:48) = (cid:83) ∞ U n ofprecompacts U n with the property that U n ⊂ U n +1 . For each n define ν n so that χ U n \ U n − µ t ( n ) where µ t ( n ) is the maximal measure on each compact cl ( U n ) as described in the above paragraph.Finally let ν = (cid:80) ∞ n =1 ν n . Clearly ν is a measure since every ν n is mutually singular with ν m when n (cid:54) = m . Additionally for all t ∈ E (cid:96) (cid:48) , µ t (cid:28) ν .Next by the Lebesgue-Radon-Nikodym theorem, for every t there is an L ( ν ) function K t so that dµ t ( s ) = K t ( s ) dν ( s ) . Thus it follows that K (cid:2) y (cid:96) (cid:3) ( t ) = (cid:90) E (cid:96) y (cid:96) ( s ) K t ( s ) dν ( s )= (cid:90) E (cid:96) y (cid:96) ( s ) K ( t, s ) dν ( s ) = o [ y (cid:96) ]( t ) . (8.4)By letting ω (cid:96) = K we then have K = o up to a ν -null set and this completes the proof.8.3 D ENSITY IN N ON -L INEAR OPERATORS
Theorem 8.3.
Suppose that E , E are bounded intervals in R . If K : Lip λ ( E ) → Lip κ ( E ) is auniformly continuous, nonlinear operator. Then for every (cid:15) > there exists a deep function machine D : L ( E ) L ( E ) L ( E ) o o (8.5) such that (cid:13)(cid:13) D| Lip λ − K (cid:13)(cid:13) < (cid:15). We will first introduce some defintions which quantize uniformly continuous operators on functionspace.
Definition 8.4.
Let P = p < · · · < p N be some partition of a compact interval E with N components. We call ρ P : Lip ∗ ( E ) → R N and ρ ∗ P : R M → Lip ∗ ( E ) affine projection maps if ρ P ( f ) = ( f ( p i )) Ni =1 ρ ∗ P ( v ) = v (cid:55)→ N − (cid:88) i =0 χ P i ( x ) (cid:20) ( v i +1 − v i ) µ ( P i ) ( t − p i ) + v i (cid:21) (8.6) where χ P i is the indicator function on P i = [ p i , p i +1 ) when i < N and P N − = [ p N − , p N ] . Definition 8.5.
Let
P, Q be partitions of E , E of N, M components respectively. If K : Lip λ ( E ) → Lip κ ( E ) , its affine projection, | K | , and its lattice map, ˜ K , are defined so that thefollowing diagram commutes,Lip λ ( E ) R N Lip λ ( E ) Lip κ ( E ) R M Lip κ ( E ) | K | ρ P ˜ K ρ ∗ P Kρ ∗ Q ρ Q Lemma 8.6 (Strong Linear Approximation) . If K : Lip λ ( E ) → Lip κ ( E ) is a uniformly continuous,nonlinear operator, then for every (cid:15) > there exist partitions P, Q of E , E so that (cid:107) K − | K |(cid:107) < (cid:15). Proof.
To show the lemma, we will chase the commutative diagram above by approximation.For any δ > , we claim that there exists a P such that for any f ∈ Lip λ ( E ) , the affine projectionapproximates f ; that is, (cid:107) f − ρ ∗ P ◦ ρ P ◦ f (cid:107) L ( µ ) < δ. To see this, take P to be a uniform partition of with ∆ p := µ ( P i ) < δµ ( E ) λ . Then (cid:90) | f − ρ ∗ P ◦ ρ P ◦ f | dµ ≤ N − (cid:88) i =1 (cid:90) P i (cid:12)(cid:12)(cid:12)(cid:12) f ( t ) − (cid:20) ( f ( p i ) − f ( p i )) µ ( P i ) ( t − p i ) + f ( p i ) (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) dµ ( t ) ≤ N − (cid:88) i =1 (cid:90) P i | f ( t ) − f ( p i ) − λ ( t − p i ) | dµ ( t ) ≤ N − (cid:88) i =1 (cid:90) P i λ | t − p i | dµ ( t ) ≤ λ ∆ p N < δ.
Now by the absolute continuity of K , for every (cid:15) > there is a δ and therefore a partition P of E sothat if (cid:107) f − ρ ∗ P ◦ ρ P ◦ f (cid:107) L ( µ ) < δ then (cid:107) K [ f ] − K [ ρ ∗ P ◦ ρ P ◦ f ] (cid:107) L ( µ ) < (cid:15)/ . Finally let Q be auniform partition of E so that for every φ ∈ Lip κ ( E ) , (cid:107) φ − ρ ∗ Q ◦ ρ Q ◦ φ (cid:107) L ( µ ) < (cid:15)/ . It followsthat for every f ∈ Lip λ ( E ) (cid:107) K [ f ] − | K | [ f ] (cid:107) L ( µ ) ≤ (cid:107) K [ f ] − K ◦ ρ ∗ P ◦ ρ P [ f ] (cid:107) + (cid:107) K [ f ] − ρ ∗ Q ◦ ρ Q ◦ K [ f ] (cid:107) < (cid:15) (cid:15) (cid:15). Therefore the affine projection of K approximates K . This completes the proof.With the lemma given we will approximate nonlinear operators through an approximation of theaffine approximation using n -discrete DFMs. Proof of Theorem 3.5.
Let (cid:15) > be given. By Lemma 8.6 there exist partitions, P, Q , so that (cid:107) K − | K |(cid:107) < (cid:15)/ . The cooresponding lattice map ˜ K : R N → R M is therefore continuous. Since E is a compact interval, the image ρ P [ Lip λ ( E )] is compact and homeomorphic to the unit hypercube [0 , N . By the universal approximation theorem of Cybenko (1989), for every δ , there exists a deepfunction machine N : R N R J R M , n n so that (cid:107) ˜ K − N (cid:107) ∞ < δ. Then, the continuity of the affine projection maps implies that there exist δ such that (cid:107) ρ ∗ Q ◦ N ◦ ρ P − | K |(cid:107) < (cid:15)/ . Therefore the induced operator on N represents K ; that is, (cid:107) ρ ∗ Q ◦ N ◦ ρ P − K (cid:107) < (cid:15) .Let N be parameterized by W ∈ R N × J and W ∈ R J × M . Let S be any uniform partition of an I = [0 , with J components. Then parameterize a deep function machine D with weight kernels ω ( u, v ) = N (cid:88) i =1 J (cid:88) j =1 χ S j × P i ( u, v ) W i,j δ ( u − p i ) ,ω ( v, x ) = M − (cid:88) k =1 χ Q k ( x ) J (cid:88) j =1 (cid:34) W j,k +1 − W j,k µ ( Q k ) ( x − q k ) + W j,k (cid:35) δ ( v − s j ) , where δ is the dirac delta function. We claim that D = ρ ∗ Q ◦N ◦ ρ P . Performing routine computations,for any f ∈ Lip λ ( E ) , D [ f ] = T ◦ g ◦ (cid:18)(cid:90) E f ( u ) ω ( u, v ) dµ ( u ) (cid:19) = T ◦ g ◦ (cid:90) E N (cid:88) i =1 J (cid:88) j =1 χ S j × P i ( u, v ) W i,j f ( u ) δ ( u − p i ) dµ ( u ) = T ◦ g ◦ J (cid:88) j =1 ρ P ( f ) T W j χ S j ( v ) := T ◦ g ◦ h ( v ) hus, h is identical to the j th neuron of the first n -discrete layer in N ◦ ρ P when v = s j . Turning to T in D , we get that D [ f ] = (cid:90) I ω ( v, x ) g ( h ( v )) dµ ( v )= M − (cid:88) k =1 χ Q k ( x ) J (cid:88) j =1 (cid:34) W j,k +1 − W j,k µ ( Q k ) ( x − q k ) + W j,k (cid:35) g ( h ( s j ))= M − (cid:88) k =1 χ Q k ( x ) · g ( ρ P ( f ) T W ) T (cid:20) W k +1 − W k µ ( Q k ) ( x − q k ) + W k (cid:21) = ρ ∗ Q ( g ( ρ P ( f ) T W ) T W ) = ρ ∗ Q ◦ N ◦ ρ P [ f ] . Therefore (cid:107)D|
Lip λ − K (cid:107) < (cid:15) and this completes the proof.We will now prove a similar theorem for d -defunctional layers. Theorem 8.7 (Nonlinear Basis Approximation) . Suppose
I, E , E are compact intervals, and let C ω ( X ) denote the set of analytic functions on X . If B : I n → C ω ( E ) is a continuous basis map toanalytic functions then for every (cid:15) > there exists a deep function machine D : R n L ( E ) L ( E ) d o (8.7) such that (cid:107) D | I n − B (cid:107) < (cid:15) in the topolopgy of uniform convergence.Proof. Recall that the set of polynomials on E , P , are a basis for the vector space C ω ( E ) . Thereforethe map B has a decomposition through nmaots κ, ∆ so that the following diagram commutes (cid:96) ( R ) R n C ω ( E ) κ ∆ B and κ : ( a i ) ∞ i =1 (cid:55)→ (cid:80) a n g n where g n is the mononomial of degree n . The existence of ∆ can beverified through a composition of the direct product of basis projections in C ω ( E ) and B .For each m ∈ N the projection image in the m th coordinate, π m [∆[ R n ]] = R and so again themaos factor into a countable collection of maps (∆ i : R n → R ) ∞ i =1 so that (cid:81) ∞ i =1 ∆ i = ∆ . We willapproximate B by approximations of κ ◦ ∆ via increasing products of ∆ i .Define the aforementioned increasing product map ∆ ( N ) as ∆ ( N ) = N (cid:89) i =1 ∆ i × ∞ (cid:89) N +1 c where c is the constant map. Now with (cid:15) > given, we wish to show that there exists an N so that (cid:107) κ ◦ ∆ ( N ) − B (cid:107) < (cid:15) in the topology of uniform convergence.To see this let P N ⊂ P denote the set of polynomials of degree at most n. Next we define a open’mullification’ of P N . In particular let O P N ( (cid:15) ) = { f ∈ C ω ( (cid:15) ) | | f − g | < (cid:15), g ∈ P N } . It is clear that O P N ( (cid:15) ) ⊂ O P N ( (cid:15) ) when N ≤ N and furthermore by the density of P in C ω ( E ) we have that { O P i ( (cid:15) ) } ∞ i =1 is an open cover of B [ I n ] ⊂ C ω ( E ) . Since I n is compact B [ I n ] is a compact subset of C ω ( E ) and thus there is a finite index set I (cid:48) = { N , . . . N k } so that (cid:83) t ∈ I (cid:48) O P t ⊃ B [ I n ] . If N = max I (cid:48) then O P N ( (cid:15) ) ⊃ B [ I n ] . Therefore for every x ∈ I n we havethat (cid:107) κ ◦ ∆ ( N ) − B (cid:107) < (cid:15) since κ ◦ ∆ ( N ) is a polynomial of degree at most N. ow we will filter the maps ψ N := π ...N ◦ ∆ ( N ) where ψ N : R n → R N through the universalapproximation theory of standard discrete neural networks. Let N be a two n -discrete layer DFM sothat (cid:107)N − ψ N (cid:107) < (cid:15) . For convienience let N := n ◦ g n Then we can instantiate N as the DFM in(8.7) using the same method as in the proof of the nonlinear operator approximation theory above.For the d -defunctional layer let W jk be the weight tensor of n in N so that n ( x ) j = (cid:80) k W jk x k .Then let the weight kernel for d be ω k ( v ) = ρ ∗ ( W · k ) . and then d [ x ] | v = j = n ( x ) j . We will ommitthe design of weight kernels for the o -operational layer, but this is not difficult to establish. Alltogether we now have that via the approximation of N and equivalence of N and its instantiation inthe statement of the theorem, (cid:107) κ ◦ ρ ◦ o ◦ g ◦ d − κ ◦ ι N ◦ ψ N (cid:107) < (cid:15). Finally we need deal with the basis map κ . On the compact set ∆ ( N ) [ I n ] = ι N ◦ ψ N [ I n ] , κ is abounded linear operator and its composition, κ ◦ ρ ◦ o is also a bounded linear operator. Therefore bythe bounded linear approximation theorem of o -operational layers, there is a (cid:107) o (cid:48) − κ ◦ ρ ◦ o (cid:107) < (cid:15) .Appending such o (cid:48) to d as above we achieve the approximation bound of the theorem. This completesthe proof.8.4 R ESOLUTION I NVAIRANCE
Theorem 8.8. If T (cid:96) is an o -operational layer with an integrable weight kernel ω ( u, v ) of O (1) parameters, then there is a unique n -discrete layer with with O ( N ) parameters so that o [ ξ ]( j ) = n [ x ] j for all indices j and for all ξ, x as above.Proof. Given some o , we will give a direct computation of the corresponding weight matrix of n . Itfollows that o [ ξ ]( v ) = (cid:90) E (cid:96) ξ ( u ) ω (cid:96) ( u, v ) dµ ( u )= N − (cid:88) n (cid:90) n +1 n (( x n +1 − x n )( u − n ) + x n ) ω (cid:96) ( u, v ) dµ ( u )= N − (cid:88) n ( x n +1 − x n ) (cid:90) n +1 n ( u − n ) ω (cid:96) ( u, v ) dµ ( u ) + x n (cid:90) n +1 n ω (cid:96) ( u, v ) dµ ( u ) (8.8)Now, let V n ( v ) = (cid:82) n +1 n ( u − n ) ω (cid:96) ( u, v ) dµ ( u ) and Q n ( v ) = (cid:82) n +1 n ω (cid:96) ( u, v ) dµ ( u ); We can noweasily simplify (8.8) using the telescoping trick of summation. o ( ξ )[ v ] = x N V N − ( v ) + N − (cid:88) n =2 x n ( Q n ( v ) − V n ( v ) + V n − ( v )) + x ( Q ( v ) − V ( v )) (8.9)Given indices in j ∈ { , · · · , M } , let W ∈ R N × M so that W n,j = ( Q n ( j ) − V n ( j ) + V n − ( j ) , W N,j = V N − ( j ) , and W ,j = Q ( j ) − V ( j ) . It follows that if W parameterizes some n , then n [ x ] j = o [ ξ ]( j ) for every f sampled/approximated by x and ξ . Furthermore, dim ( W ) ∈ O ( N ) , and n is unique up to L ( µ ) equivalence.8.5 C ONVOLUTIONAL N EURAL N ETWORKS AND THE U LTRAHYPERBOLIC D IFFERENTIAL E QUATION
Proof.
A general solution to (4.3) is of the form ω ( u, v ) = F ( u − cv ) + G ( u + cv ) where F, G are second-differentiable. Essentially the shape of ω stays constant in u , but the position of ω varies in v . For every h there exists a continuous F so that F ( j ) = h j , G = 0 . Let ω ( u, v ) = F ( u − cv ) + G ( u + cv ) . Therefore applying Theorem 4.1, to o parameterized by ω , we yield a weightmatrix W so that [ o [ ξ ]( j ) = (cid:90) E ξ ( u ) ( F ( u − cj ) + 0) dµ ( u ) = ( W x ) j = ( h (cid:63) x ) j = n [ x ] j . (8.10) his completes the proof. Lip λ · · · Lip λ Lip λ R Lip λ · · · Lip λ oo o o foo oo o o f PPENDIX
C: VC D
IMENSION OF D ISCRETIZED O PERATOR N EURAL N ETWORKS . In order to calculate the VC dimension of DFMs contianing only discretized o -operational layers,denoted D , we have D ⊂ N , where N is the family of all DFMs with n -discrete skeletons whoseper-node dimensionality is exactly that of the discretization D . Thus the VC dimension of F can bebounded by that of N , however more fine tuned estimate is both possible and essential.Suppose that in designing some deep architecture, one wishes to keep VC dimension low, whilstincreasing per-node activation dimensionality. In practice optimization in higher dimensions is easierwhen a low dimensional parameterization is embedded therein. For example, hyperdimensionalcomputing, sparse coding, and convolutional neural networks naturally neccessitate high dimensionalhidden spaces but benefit from regularized capacity. Since the dimensionality of the discretization O does not depend on the original dimensionality of the space, then the capacity of O depends directlyon the "complexity" of the family of weight surfaces there endowed. It would therefore be convenientto answer the following question formally. The VC Problem . Let W (cid:96) ⊂ L ( R , µ ) be some family of weight surfaces. Then induce O W ,a family of discretized o -operational layers with O W := { [ o W ] n } W ∈W where [ · ] n denotes thediscretization. What is
V CDim ( O W ) ? Although in this work we do not directly attack this problem, a solution leads to another dimensionof layer and architecture design beyond topological constraints. In practice, one would be able tochoose which set of W (cid:96) to give a satisfactory generalizability condition on their learning problem.
10 A
PPENDIX
D: A
NALYTICAL D ERIVATION OF C ONTINUOUS E RROR B ACKPROPAGATION FOR S EPERABLE W EIGHT K ERNELS
With these theoretical guarantees given for DFMs, the implementation of the feedforward and errorbackpropagation algorithms in this context is an essential next step. We will consider operator neuralnetworks with polynomial kernels. As aforementioned, in the case where a DFM has nodes withnon-seperable kernels, we cannot give the guarntees we do in the following section. Therefore, astandard auto-differentiation set-up will suffice for DFMs with for example wave layers.Feedforward propagation is straight forward, and relies on memoizing operators by using the separa-bility of weight polynomials. Essentially, integration need only occur once to yield coefficients onpower functions. See Algorithm 1.10.0.1 F
EED -F ORWARD P ROPAGATION
We will say that a function f : R → R is numerically integrable if it can be seperated into f ( x, y ) = g ( x ) h ( y ) . Theorem 10.1. If O is a operator neural network with L consecutive layers, then given any (cid:96) suchthat ≤ (cid:96) < L , y (cid:96) is numerically integrable, and if ξ is any continuous and Riemann integrableinput function, then O [ ξ ] is numerically integrable. lgorithm 1 Feedforward Propagation on F Input: input function ξ for l ∈ { , . . . , L − } dofor t ∈ Z (cid:96)X do Calculate I (cid:96)t = (cid:82) E (cid:96) y (cid:96) ( j (cid:96) ) j t(cid:96) dj (cid:96) . end forfor s ∈ Z (cid:96)Y do Calculate C (cid:96)s = (cid:80) Z (cid:96)X a k a,s I (cid:96)a . end for Memoize y (cid:96) ( j ) = g (cid:16)(cid:80) Z (cid:96)Y b j b C (cid:96)b (cid:17) . end for The output is given by O [ ξ ] = y L . Proof.
Consider the first layer. We can write the sigmoidal output of the ( (cid:96) ) th layer as a function ofthe previous layer; that is, y (cid:96) = g (cid:18)(cid:90) E (cid:96) w (cid:96) ( j (cid:96) , j (cid:96) ) y (cid:96) ( j l ) dj l (cid:19) . (10.1)Clearly this composition can be expanded using the polynomial definition of the weight surface.Hence y (cid:96) = g (cid:90) E (cid:96) y (cid:96) ( j (cid:96) ) Z (cid:96)Y (cid:88) x (cid:96) Z (cid:96)X (cid:88) x l k x l ,x (cid:96) j x l (cid:96) j x (cid:96) (cid:96) dj (cid:96) = g Z (cid:96)Y (cid:88) x (cid:96) j x (cid:96) Z (cid:96)X (cid:88) x l k x l ,x (cid:96) (cid:90) E (cid:96) y (cid:96) ( j (cid:96) ) j x l (cid:96) dj (cid:96) , (10.2)and therefore y (cid:96) is numerically integrable. For the purpose of constructing an algorithm, let I (cid:96)x (cid:96) bethe evaluation of the integral in the above definition for any given x (cid:96) It is important to note that the previous proof requires that y (cid:96) be Riemann integrable. Hence, with ξ satisfying those conditions it follows that every y (cid:96) is integrable inductively. That is, because y isintegrable it follows that by the numerical integrability of all l , O [ ξ ] = y L is numerically integrable.This completes the proof.Using the logic of the previous proof, it follows that the development of some inductive algorithm ispossible.10.0.2 C ONTINUOUS E RROR B ACKPROPAGATION
As is common with many non-convex problems with discretized neural networks, a stochastic gradientdescent method will be developed using a continuous analogue to error backpropagation. We definethe loss function as follows.
Definition 10.2.
For a operator neural network O and a dataset { ( γ n ( j ) , δ n ( j )) } we say that theerror for a given n is defined by E = 12 (cid:90) E L ( O ( γ n ) − δ n ) dj L (10.3)This error definition follows from N as the typical error function for N is just the square norm of thedifference of the desired and predicted output vectors. In this case we use the L norm on C ( E L ) inthe same fashion.We first propose the following lemma as to aid in our derivation of a computationally suitable errorbackpropagation algorithm. Lemma 10.3.
Given some layer, l > , in O , functions of the form Ψ (cid:96) = g (cid:48) (cid:0) Σ l y (cid:96) (cid:1) are numericallyintegrable. roof. If Ψ (cid:96) = g (cid:48) (cid:32)(cid:90) E ( (cid:96) − y ( (cid:96) − w ( (cid:96) − dj (cid:96) (cid:33) (10.4)then Ψ (cid:96) = g (cid:48) Z ( (cid:96) − Y (cid:88) b j bl Z ( (cid:96) − X (cid:88) a k ( (cid:96) − a,b (cid:90) E ( (cid:96) − y ( (cid:96) − j a(cid:96) dj l − (10.5)hence Ψ can be numerically integrated and thereby evaluated.The ability to simplify the derivative of the output of each layer greatly reduces the computationaltime of the error backpropagation. It becomes a function defined on the interval of integration of thenext iterated integral. Theorem 10.4.
The gradient, ∇ E ( γ, δ ) , for the error function (10.3) on some O can be evaluatednumerically.Proof. Recall that E over O is composed of k (cid:96)x,y for x ∈ Z (cid:96)X , y ∈ Z (cid:96)Y , and ≤ l ≤ L . If weshow that ∂E∂k (cid:96)x,y can be numerically evaluated for arbitrary, l, x, y , then every component of ∇ E isnumerically evaluable and hence ∇ E can be numerically evaluated. Given some arbitrary l in O , let n = (cid:96) . We will examine the particular partial derivative for the case that n = 1 , and then for arbitrary n , induct over each iterated integral.Consider the following expansion for n = 1 , ∂E∂k L − nx,y = ∂∂k L − x,y (cid:90) E (cid:96) [ O ( γ ) − δ ] dj L = (cid:90) E (cid:96) [ O ( γ ) − δ ] Ψ L (cid:90) E ( (cid:96) − j xL − j yL y L − dj L − dj L = (cid:90) E (cid:96) [ O ( γ ) − δ ] Ψ L j yL (cid:90) E ( (cid:96) − j xL − y L − dj L − dj L (10.6)Since the second integral in (10.6) is exactly I L − x from ( ?? ), it follows that ∂E∂k ( n ) x,y = I L − x (cid:90) E (cid:96) [ O ( γ ) − δ ] Ψ L j yL dj L (10.7)and clearly for the case of n = 1 , the theorem holds.Now we will show that this is all the case for larger n . It will become clear why we have chosen toinclude n = 1 in the proof upon expansion of the pratial derivative in these higher order cases.Let us expand the gradient for n ∈ { , . . . , L } . ∂E∂k L − nx,y = (cid:90) E L [ O ( γ ) − δ ]Ψ L (cid:90) E L − w L − Ψ L − (cid:90) · · · (cid:90) E L − n +1) w L − n +1) Ψ L − n +1) (cid:124) (cid:123)(cid:122) (cid:125) n − iterated integrals (cid:90) E L − n y L − n j aL − n j bL − n +1 dj L − n . . . dj L (10.8)As aforementioned, proving the n = 1 case is required because for n = 1 , (10.8) has a section of n − iterated integrals which cannot be possible for the proceeding logic.We now use the order invariance properly of iterated integrals (that is, (cid:82) A (cid:82) B f ( x, y ) dxdy = (cid:82) B (cid:82) A f ( x, y ) dydx ) and reverse the order of integration of (10.8).In order to reverse the order of integration we must ensure each iterated integral has an integrandwhich contains variables which are guaranteed integration over some region. To examine this, wepropose the following recurrence relation for the gradient. et { B s } be defined along L − n ≤ s ≤ L , as follows B L = (cid:90) E L [ O ( γ ) − δ ] Ψ L B L − dj L ,B s = (cid:90) E (cid:96) Ψ (cid:96) Z (cid:96)X (cid:88) a Z (cid:96)Y (cid:88) b j a(cid:96) j b(cid:96) B (cid:96) dj (cid:96) ,B L − n = (cid:90) E L − n j xL − n j yL − n +1 dj L − n (10.9)such that ∂E∂k (cid:96)x,y = B L . If we wish to reverse the order of integration, we must find a reoccurrencerelation on a sequence, { B s } such that ∂E∂k L − nx,y = B L − n = B L . Consider the gradual reversal of(10.8).Just as important as Clearly, ∂E∂k (cid:96)x,y = (cid:90) E L − n y L − n j xL − n (cid:90) E L [ O ( γ ) − δ ]Ψ L (cid:90) E L − w L − Ψ L − (cid:90) · · · (cid:90) E L − n +1) j yL − n +1 w L − n +1) Ψ L − n +1) dj L − n +1 . . . dj L dj L − n (10.10)is the first order reversal of (10.8). We now show the second order case with first weight functionexpanded. ∂E∂k (cid:96)x,y = (cid:90) E L − n y L − n j xL − n (cid:90) E L − n +1) Z Y (cid:88) b Z X (cid:88) a k a,b j a + yL − n +1 Ψ L − n +1) (cid:90) E L [ O ( γ ) − δ ]Ψ L (cid:90) · · · (cid:90) E L − n +1) j bL − n +2 w ( L − n +2) Ψ ( L − n +2) dj L − n +1 . . . dj L dj L − n . (10.11)Repeated iteration of the method seen in (10.10) and (10.11), where the inner most integral is movedto the outside of the ( L − s ) th iterated integral, with s is the iteration, yields the following full reversalof (10.8). For notational simplicity recall that l = L − n , then ∂E∂k (cid:96)x,y = (cid:90) E (cid:96) y (cid:96) j xl (cid:90) E (cid:96) Z (cid:96)X (cid:88) a j a + y(cid:96) Ψ (cid:96) (cid:90) E (cid:96) +2 Z (cid:96)Y (cid:88) b Z (cid:96) +2 X (cid:88) c k (cid:96)a,b j b + cl +2 Ψ (cid:96) +2 (cid:90) E (cid:96) +3 Z (cid:96) +2 Y (cid:88) d Z (cid:96) +3 X (cid:88) e k (cid:96) +2 c,d j d + el +3 Ψ (cid:96) +3 (cid:90) · · · (cid:90) E L Z L − Y (cid:88) q k L − p,q j qL [ O ( γ ) − δ ]Ψ L dj L . . . dj L − n . (10.12)Observing the reversal in (10.12), we yield the following recurrence relation for { B s } . Bare in mind, l = L − n , x and y still correspond with ∂E∂k (cid:96)x,y , and the following relation uses its definition on s forcases not otherwise defined.B L,t = (cid:90) E L Z L − Y (cid:88) b k L − t,b j bL [ O ( γ ) − δ ] Ψ L dj L . B s,t = (cid:90) E ( s ) Z ( s − Y (cid:88) b Z ( s ) X (cid:88) a k ( s − t,b j a + bs Ψ ( s ) B s +1 ,a dj s . B (cid:96) = (cid:90) E (cid:96) Z (cid:96)X (cid:88) a j a + y(cid:96) Ψ (cid:96) B l +2 ,a dj (cid:96) .∂E∂k (cid:96)x,y = B l = (cid:90) E (cid:96) j xl y (cid:96) B (cid:96) dj (cid:96) . (10.13) lgorithm 2 Error Backpropagation
Input: input γ , desired δ, learning rate α, time t. for (cid:96) ∈ { , . . . , L } do Calculate Ψ (cid:96) = g (cid:48) (cid:16)(cid:82) E ( (cid:96) − y ( (cid:96) − w ( (cid:96) − dj (cid:96) (cid:17) end for For every t , compute B L,t from from (10.13).Update the output coefficient matrix k L − x,y − I L − x (cid:82) E L [ F ( γ ) − δ ] Ψ L j yL dj L → k L − x,y . for l = L − to do If it is null, compute and memoize B l +2 ,t from (10.13).Compute but do not store B (cid:96) ∈ R . Compute ∂E∂k (cid:96)x,y = B l from from (10.13).Update the weights on layer l : k (cid:96)x,y ( t ) → k (cid:96)x,y end for Note that B L − n = B L by this logic.With (10.13), we need only show that B L − n is integrable. Hence we induct on L − n ≤ s ≤ L over { B s } under the proposition that B s is not only numerically integrable but also constant.Consider the base case s = L . For every t , because every function in the integrand of B L in (10.13)is composed of j L , functions of the form B L must be numerically integrable and clearly, B L ∈ R .Now suppose that B s +1 ,t is numerically integrable and constant. Then, trivially, B s,u is also numeri-cally integrable by the contents of the integrand in (10.13) and B s,u ∈ R . Hence, the proposition that s + 1 implies s holds for (cid:96) < s < L .Lastly we must show that both B (cid:96) and B l are numerically integrable. By induction B l +2 must benumerically integrable. Hence by the contents of its integrand B (cid:96) must also be numerically integrableand real. As a result, B l = ∂E∂k (cid:96)x,y is real and numerically integrable.Since we have shown that ∂E∂k (cid:96)x,y is numerically integrable, ∇ E must therefore be numericallyevaluable as aforementioned. This completes the proof.must therefore be numericallyevaluable as aforementioned. This completes the proof.