[PDF] Deep Function Machines: Generalized Neural Networks for Topological Layer Expression

Abstract

In this paper we propose a generalization of deep neural networks called deep function machines (DFMs). DFMs act on vector spaces of arbitrary (possibly infinite) dimension and we show that a family of DFMs are invariant to the dimension of input data; that is, the parameterization of the model does not directly hinge on the quality of the input (eg. high resolution images). Using this generalization we provide a new theory of universal approximation of bounded non-linear operators between function spaces. We then suggest that DFMs provide an expressive framework for designing new neural network layer types with topological considerations in mind. Finally, we introduce a novel architecture, RippLeNet, for resolution invariant computer vision, which empirically achieves state of the art invariance.

Full PDF

DD EEP F UNCTION M ACHINES : G

ENERALIZED N EURAL N ETWORKS FOR T OPOLOGICAL L AYER E XPRESSION

William H. Guss

Machine Learning at BerkeleyUniversity of California, Berkeley [email protected] A BSTRACT

In this paper we propose a generalization of deep neural networks called deep func-tion machines (DFMs). DFMs act on vector spaces of arbitrary (possibly inﬁnite)dimension and we show that a family of DFMs are invariant to the dimension ofinput data; that is, the parameterization of the model does not directly hinge onthe quality of the input (eg. high resolution images). Using this generalization weprovide a new theory of universal approximation of bounded non-linear operatorsbetween function spaces. We then suggest that DFMs provide an expressive frame-work for designing new neural network layer types with topological considerationsin mind. Finally, we introduce a novel architecture, RippLeNet, for resolutioninvariant computer vision, which empirically achieves state of the art invariance.

NTRODUCTION

In recent years, deep learning has radically transformed a majority of approaches to computervision, reinforcement learning, and generative models [Schmidhuber (2015)]. Theoretically, westill lack a uniﬁed description of what computational mechanisms have made these deeper modelsmore successful than their wider counterparts. Substantial analysis by Shalev-Shwartz et al. (2011),Raghu et al. (2016), Poole et al. (2016) and many others gives insight into how the properties ofneural architectures , like depth and weight sharing, determine the expressivity of those architectures.However, less studied is the how the properties of data , such as sample statistics or geometricstructure, determine the architectures which are most expressive on that data.Surprisingly, the latter perspective leads to simple questions without answers rooted in theory. Forexample, what topological properties of images allow convolutional layers such expressivity andgeneralizeability thereon? Intuitively, spatial locality and translation invariance are sufﬁcient justiﬁ-cations in practice, but is there a more general theory which suggests the optimality of convolutions?Furthermore, do there exist weight sharing schemes beyond convolutions and fully connected layersthat give rise to provably more expressive models in practice? In this paper, we will more concretelystudy the data-architecture relationship and develop a theoretical framework for creating layers andarchitectures with provable properties subject to topological and geometric constraints imposed onthe data.

The Problem with Resolution.

To motivate a use for such a framework, we consider the problemof learning on high resolution data. Computationally, machine learning deals with discrete signals,but frequently those signals are sampled in time from a continuous function. For example, audio isinherently a continuous function f : [0 , t end ] → R , but is sampled as a vector v ∈ R , × t . Evenin vision, images are generally piecewise smooth functions f : R → R mapping pixel position tocolor intensity, but are sampled as tensors v ∈ R x × y × c . Performing tractible machine learning as theresolution of images or audio increases almost always requires some lossy preprocessing like PCA orDiscrete Fourier Analysis [Burch (2001)]. Convolutional neural networks avoid dealing therein byintutively assuming a spacial locality on these vectors. However, one wonders what is lost throughthe use of various dimensionality reduction and weight sharing schemes . Note we do not claim that deep learning on high resolution data is currently intractible or ineffective. Theproblem of resolution is presented as an example in which topological constaints can be imposed on a type ofdata to yield new architecutres with desired, provable properties. a r X i v : . [ s t a t . M L ] N ov igure 1: Left: A discrete vector v ∈ R l × w representation of an image. Right: The true continuousfunction f : R → R from which it was sampled.A key observation in discussing a large class of smooth functions is their simplicity. Although froma set theoretic perspective, the graph of a function consists of inﬁniteley many points, relativelycomplex algebras of functions can be described with symbolic simplicity. A great example arepolynomials: the space of all square ( x ) mononomials occupies a one-dimensional vector space, andone can generalize this phenomena beyond these basic families. Thus we will explore what results inembracing the assumption that a signal is really a sample from a continuous process, and utilize theanalytic simplicity of certain smooth functions to derive new layer types. Our Contribution.

First, we extend neural networks to the inﬁnite dimensional domain of continu-ous functions and deﬁne deep function machines (DFMs), a general family of function approximatorswhich encapsulates this continuous relaxation and its discrete counterpart. Thereafer, we survey andrefocus past analysis of neural networks with inﬁnitely (and potentially uncountably) many nodes with respect to the expresiveness the maps that they represent. We show that DFMs not only admitmost other inﬁnite dimensional neural network generalizations in the literature but also provide thenecessary language to solve two long standing questions of universal approximation raised followingStinchcombe (1999). With the framework ﬁrmly established, we then return to our motivating goal ofprovable deep learning and show that DFMs naturally give rise to neural networks which are provablyinvariant to the resolution of the input, and indeed that DFMs can be used more generally to constructarchitectures (e.g. those with convolutions) with provable properties given topological assumptions.Finally we experimentally verify such constructions by introducing a new type of layer, WaveLayers,apart from convolutions. ACKGROUND

In order to propose deep function machines we must establish what it means for a neural network actdirectly on continuous functions. Recall the standardMcCulloch & Pitts (1943) feed-forward neuralnetwork.

Deﬁnition 2.1 (Discrete Neural Networks) . We say N : R n → R m is a (discrete) feed-forwardneural network iff for the following recurrence relation is deﬁned for adjacent layers (cid:96) → (cid:96) (cid:48) , N : y (cid:96) (cid:48) = g (cid:0) W T(cid:96) y (cid:96) (cid:1) ; y := x (2.1) where W (cid:96) is a weight tensor and g is a non-polynomial activation function. Suppose that we wish to map one space of functions to another with a neural network. Considerthe model of N as the number of neurons for every layer becomes uncountable. The index for eachneuron then becomes real-valued, along with the weight and input vectors. The process is roughlydepicted in Figure 2. The core idea behind the derivation is that as the number of nodes in thenetwork becomes uncountable we need apply a normalizing term to the contribution of each nodein the evaluation of the following layer so as to avoid saturation. Eventually this process resemblesLebesgue integration.More formally, let N be an L layer neural network as given in Deﬁnition 2.1. Without loss ofgenerality we will examine the ﬁrst layer, (cid:96) = 1 . Let us denote ξ : X ⊂ R → R as some See related work. igure 2: Left: Resolution reﬁnement of an input signal by simple functions. Right: An illustrationof the extension of neural networks to inﬁnite dimensions. Note that x ∈ R N is a sample of f ( N ) , asimple function with (cid:107) f ( N ) − ξ (cid:107) → as N → ∞ . Furthermore, the process is not actually countable,as depicted here.arbitrary continuous input function for the neural network. Likewise consider a real-valued piecewiseintegrable weight function , w (cid:96) : R → R , for a layer (cid:96) which is composed of two indexing variables u, v ∈ E (cid:96) , E (cid:96) (cid:48) ⊂ R . In this analysis we will restrict the indices to lie in compact sets E (cid:96) , E (cid:96) (cid:48) .If f is a simple function then for some ﬁnite partition of E (cid:96) , say u < · · · < u n , then f = (cid:80) nm =1 χ [ u m − ,u m ] p n where for all u ∈ [ u m − , u m ] , p n ≤ ξ ( u ) . Visually this is a piecewise constantfunction underneath the graph of ξ . Suppose that some vector x is sampled from ξ , then we can make x a simple function by taking an arbitray parition of E (cid:96) so that: when u < u < u , f ( u ) = x , andwhen u < u < u , f ( u ) = x , and so on. This simple function f is essentially piecewise constanton intervals of uniform length so that on each interval it attains the value of the n th component, x n .Finally if w v is some simple function approximating the v -th row of some weight matrix W (cid:96) in thesame fashion, then w v · f is also a simple function. Therefore particular neural layer associated to f (and thereby x ) is y = g ( W T(cid:96) x ) = g (cid:32) n (cid:88) m =1 W (cid:96)mv x m µ ([ u m − , u m ]) (cid:33) = g (cid:18)(cid:90) E (cid:96) w (cid:96)v ( u ) f ( u ) dµ ( u ) (cid:19) , (2.2)where µ is the Lebesgue measure on R . Now suppose that there is a reﬁnement of x ; that is, returning to our original problem, there is ahigher resolution sample of ξ say f (cid:48) (and thereby x (cid:48) ), so that it more closely approximates ξ . It thenfollows that the cooresponding reﬁned partition, u (cid:48) < · · · < u (cid:48) k , (where k > n ), occupies the same E (cid:96) but individually, µ ([ u m − , u m ]) ≤ µ ([ u (cid:48) m − , u (cid:48) m ]) . Therefore we weight the contribution of each x (cid:48) n less than each x n , in a measure theoretic sense.Recalling the theory of simple functions without loss of generality assume ξ, ω ( · , · ) ≥ . Then weyield that if F v = { ( w v , f ) : E (cid:96) → R | f, w v simple , ≤ f ≤ ξ, ≤ w v ≤ ω (cid:96) ( · , v ) } (2.3)then it follows immediately that sup ( f,w v ) ∈ F v (cid:90) E (cid:96) w v ( u ) f ( u ) dµ ( u ) = (cid:90) E (cid:96) ω (cid:96) ( u, v ) ξ ( u ) dµ ( u ) . (2.4)Therefore we give the following deﬁnition for inﬁnite dimensional neural networks. Deﬁnition 2.2 (Operator Neural Networks) . We call O : L ( E (cid:96) ) → L ( E (cid:96) (cid:48) ) an operator neuralnetwork parameterized by ω (cid:96) if for two adjacent layers (cid:96) → (cid:96) (cid:48) O : y (cid:96) (cid:48) ( v ) = g (cid:18)(cid:90) E (cid:96) y (cid:96) ( u ) ω (cid:96) ( u, v ) dµ ( u ) (cid:19) ; y ( v ) = ξ ( v ) . (2.5) where E (cid:96) , E (cid:96) (cid:48) are locally compact Hausdorff mesure spaces and u ∈ X, v ∈ Y. It is no loss of generality to extend the results in this work to weight kernels indexed by arbitrary u, v ∈ R n ,but we ommit this treatment for ease of understanding. D EEP F UNCTION M ACHINES

With operator neural networks deﬁned, we endeavour to deﬁne a topologically inspired frameworkfor developing expressive layer types. A powerful language of abstraction for describing feed-forward (and potentially recurrent) neural network architectures is that of computational skeletons asintroduced in Daniely et al. (2016). Recall the following deﬁnition.

Deﬁnition 3.1.

A computational skeleton S is a directed asyclic graph whose non-input nodes arelabeled by activations. Daniely et al. (2016) provides an excellent account of how these graph structures abstract the manyneural network architectures we see in practice. We will give these skeletons "ﬂesh and skin"so to speak, and in doing so pursure a suitable generalization of neural networks which allowsintermediate mappings between possibly inﬁnite dimensional topological vector spaces. DFMs arethat generalization.

Deﬁnition 3.2 (Deep Function Machines) . A deep function machine D is a computational skeleton S indexed by I with the following properties: • Every vertex in S is a topological vector space X (cid:96) where (cid:96) ∈ I. • If nodes (cid:96) ∈ A ⊂ I feed into (cid:96) (cid:48) then the activation on (cid:96) (cid:48) is denoted y (cid:96) ∈ X (cid:96) and is deﬁnedas y (cid:96) (cid:48) = g (cid:32)(cid:88) (cid:96) ∈ A T (cid:96) (cid:2) y (cid:96) (cid:3)(cid:33) (3.1) where T (cid:96) : X (cid:96) → X (cid:96) (cid:48) is some afﬁne form called the operation of node (cid:96) . To see the expressive power of this generalization, we propose several operations T (cid:96) that not onlyencapsulate ONNs and other abstractions on inﬁnite dimensional neural networks, but also almost allfeed-forward architectures used in practice.3.1 G ENERALIZED N EURAL L AYERS

Generalized neural layers are the basic units of the theory of deep function machines, and they can beused to construct architectures of neural networks with provable properties, such as the resolutioninvariance we seek. The most basic case is X (cid:96) = R n and X (cid:96) (cid:48) = R m , where we should expect astandard neural network. As either X (cid:96) or X (cid:96) (cid:48) become inﬁnite dimensional we hope to attain modelsof functional MLPs from Rossi et al. (2002) or inﬁnite layer neural networks from Globerson & Livni(2016) with universal approximation properties. Deﬁnition 3.3 (Generalized Layer Operations) . We suggest several natural generalized layer families T (cid:96) for DFMs as follows. • T (cid:96) is said to be o -operational if and only if X (cid:96) and X (cid:96) (cid:48) are spaces of integrable functionsover locally compact Hausdorff measure spaces, and T (cid:96) [ y (cid:96) ]( v ) = o ( y (cid:96) )( v ) = (cid:90) E (cid:96) y (cid:96) ( u ) ω (cid:96) ( u, v ) dµ ( u ) . (3.2) For example , X (cid:96) , X (cid:96) (cid:48) = C ( R ) , yields operator neural networks. • T (cid:96) is said to be n -discrete if and only if X (cid:96) and X (cid:96) (cid:48) are ﬁnite dimensional vector spaces,and T (cid:96) [ y (cid:96) ] = n ( y (cid:96) ) = W T(cid:96) y (cid:96) . (3.3) For example, X (cid:96) = R n , X (cid:96) (cid:48) = R m , yields standard feed-forward neural networks. • T (cid:96) is said to be f -functional if and only if X (cid:96) is some space of integrable functions asmentioned previously and X (cid:96) (cid:48) is a ﬁnite dimensional vector space, and T (cid:96) [ y (cid:96) ] = f ( y (cid:96) ) = (cid:90) E (cid:96) ω (cid:96) ( u ) y (cid:96) ( u ) dµ ( u ) (3.4) Nothing precludes the deﬁnition from allowing multiple functions as input, the operation must just becarried on each coordinate function. R R [0 , nnn N C ( R ) C ([0 , C ([ − , oo O C ( R ) C ( R ) o R m f R m f R n nn C ( R ) d C ( R ) d C ( R ) o [0 , ∞ ) fff nf D Figure 3: Examples of three different deep function machines with activations ommited and T (cid:96) replaced with the actual type. Left: A standard feed forward binary classiﬁer (without convolution),Middle: An operator neural network. Right: A complicated DFM with residues. For example X (cid:96) = C ( R ) , X (cid:96) (cid:48) = R n , yields functional MLPs. • T (cid:96) is said to be d -defunctional if and only if X (cid:96) is a are ﬁnite dimensional vector space and X (cid:96) (cid:48) is some space of integrable functions. T l [ y (cid:96) ]( v ) = d ( y (cid:96) )( v ) = ω (cid:96) ( v ) T y (cid:96) (3.5) For example, X (cid:96) = R n , X (cid:96) (cid:48) = C ( R ) . The naturality of the above layer operations come from their universality and generality.3.2 R

ELATED W ORK AND A U NIFIED V IEW OF I NFINITE D IMENSIONAL N EURAL N ETWORKS

Operator neural networks are just one of many instantiations of DFMs. Before we show universalityresults for deep function machines, it should be noted that there has been substantial effort in theliterature to explore various embodiments of inﬁnite dimensional neural networks. To the best of theauthors’ knowledge, DFMs provide a single uniﬁed view of every such proposed framework to date.In particular, Neal (1996) proposed the ﬁrst analysis of neural networks with countably inﬁnite nodes,showing that as the number of nodes in discrete neural networks tends to inﬁnity, they converge to aGaussian process prior over functions. Later, Williams (1998) provided a deeper analysis of such alimit on neural networks. A great deal of effort was placed on analyzing covariance maps associatedto the Guassian processes resultant from inﬁnite neural networks with both sigmoidal and Gaussianactivation functions. These results were based mostly in the framework of Bayesian learning, and ledto a great deal of analyses of the relationship between non-parametric kernel methods and inﬁnitenetworks, including Le Roux & Bengio (2007), Seeger (2004), Cho & Saul (2011), Hazan & Jaakkola(2015), and Globerson & Livni (2016).Out of this initial work, Hazan & Jaakkola (2015) deﬁne hidden layer inﬁnite layer neural networks with one or two layers which map a vector x ∈ R n to a real value by considering inﬁnitely manyfeature maps φ w ( x ) = g ( (cid:104) w, x (cid:105) ) where w is an index variable in R n . Then for some weight function u : R n → R , the output of an inﬁnite layer neural network is a real number (cid:82) u ( w ) φ w ( x ) dµ ( w ) .This approach can be kernelized and therefore has resulted further theory by Globerson & Livni (2016)that aligns neural networks with Gaussian processes and kernel methods. Operatator neural networksdiffer signiﬁcantly in that we let each w be freely paramaterized by some function ω and requirethat x be a continuous function on a locally compact Hausdorf space. Additionally no universalapproximation theory is provided for inﬁnite layer networks directly, but is cited as following fromthe work of Le Roux & Bengio (2007). As we will see, DFMs will not only encapsulate (and beneﬁtfrom) these results, but also provide a general universal approximation theory therefor. Note that y (cid:96) ( u ) is a scalar function and ω is a vector valuued function of dimension dim ( X (cid:96) (cid:48) ) . Additionallythis deﬁniton can easily be extended to function spaces on ﬁnite dimensional vectorspaces by using the Kroneckerproduct. able 1: Uniﬁcation of Inﬁnite Dimensional Neural Network Theory Name Form DFM Authors

InﬁniteNNs b + (cid:80) ∞ j =1 v j h ( x ; u j ) N ∞ : R n n −→ ⊕ ∞ i =1 R n −→ R m Neal (1996);Williams (1998)FunctionalMLPs (cid:80) pi =1 β i g (cid:0)(cid:82) w i ξ dµ (cid:1) F : L ( R ) f −→ R p n −→ R Stinchcombe(1999); Rossiet al. (2002)ContinuousNNs (cid:82) ω ( u ) g ( x · ω ( u )) du C : R n d −→ L ([ a, b ]) f −→ R m Le Roux &Bengio (2007)Non-parametricContinuousNNs (cid:82) ω ( u ) g ( (cid:104) x, u (cid:105) ) du C (cid:48) : R n d −→ L ( R ) f −→ R m Le Roux &Bengio (2007)InﬁniteLayer NNs same as non-parametriccontinuous NNs same as non-parametriccontinuous NNs Globerson &Livni (2016);Hazan &Jaakkola (2015)Another variant of inﬁnite dimensional neural networks, which we hope to generalize, is the func-tional multilayer perceptron (Functional MLP). This body of work is not referenced in any of theaforementioned work on inﬁnite layer neural networks, but it is clearly related. The fundamental ideais that given some f ∈ V = C ( X ) , where X is a locally compact Hausdorff space, there exists ageneralization of neural networks which approximates arbitrary continuous bounded functionals on V (maps f (cid:55)→ a ∈ R ). These functional MLPs take the form (cid:80) pi =1 β i g (cid:0)(cid:82) ω i ( x ) f ( x ) dµ ( x ) (cid:1) . Theauthors show the power of such an approximation using the functional analysis results of Stinchcombe(1999) and additionally provide statistical consistency results deﬁning well deﬁned optimal parameterestimation in the inﬁnite dimensional case.Stemming additionally from the initial work of Neal (1996), the ﬁnal variant called continuous neuralnetworks has two manifestations: the ﬁrst of which is more closely related to functional perceptronsand the last of which is exactly the formulation of inﬁntie layer NNs. Initially Le Roux & Bengio(2007) proposes an inﬁnite dimensional neural network of the form (cid:82) ω ( u ) g ( x · ω ( u )) dµ ( u ) andshows universal approximation in this regime. Overall this formulation mimics multiplication bysome weighting vector as in inﬁnite layer NNs, except in the continuous neural formulation ω can beparameterized by a set of weights. Thereafter, to prove connections between gaussian processes froma different vantage, they propose non-parametric continuous neural networks , (cid:82) ω ( u ) g ( x · u ) dµ ( u ) ,which are exactly inﬁnite-layer neural networks.In the view of deep function machines, the foregoing variants of inﬁnite and semi-inﬁnite dimensionalneural networks are merely instantiations of different computational skeleton structures. A summaryof the uniﬁed view is given in Table 1.3.3 A PPROXIMATION T HEORY OF D EEP F UNCTION M ACHINES

In addition to uniﬁcation, DFMs provide a powerful language for proving universal approximationtheorems for neural networks of any depth and dimension . The central theme of our approach isthat the approximation theories of any DFM can be factored through the standard approximationtheories of discrete neural networks . In the forthcoming section, this principle allows us to prove twoapproximation theories which have been open questions since Stinchcombe (1999).The classic results of Cybenko (1989), yields the theory for n -discrete layers. For f -functionallayers, the work of Stinchcombe (1999) proved in great generality that for certain topologies on C ( E (cid:96) ) , two layer functional MLPs universally approximate any continuous functional on C ( E (cid:96) ) .Following Stinchcombe (1999), Rossi et al. (2002) extended these results to the case wherein multiple By dimension, we mean both inﬁnite and ﬁnite dimensional neural networks. -operational layers prepended f -functional layers. We will show in particular that o -operational andsimilarly d -defunctional layers alone are dense in the much richer space of uniformly continuousbounded operators on function space. We give three results of increasing power, but decreasingtransparency. Theorem 3.4 (Point Approximation) . Let [ a, b ] ⊂ R be a bounded interval and g : R → B ⊂ R be acontinuous, bijective activation function. Then if ξ : E (cid:96) → R and f : E (cid:48) (cid:96) → B are L ( µ ) integrablefunctions there exists a unique class of o -operational layers such that g ◦ o [ ξ ] = f. Proof.

We seek a class of weight kernels ω (cid:96) so that that o [ ξ ] = f. Let ω (cid:96) ( u, v ) = (cid:2) ( g − ) (cid:48) ◦ ( h (Ξ( u ) , v )) (cid:3) h (cid:48) (Ξ( u ) , v ) where Ξ( u ) is the indeﬁnite integral of ξ . Deﬁne h so thatit satisﬁes the following two equivalent equations µ -a.e. h (Ξ( b ) , v ) − h (Ξ( a ) , v ) = f ( v ) ∂h ( x, v ) ∂x ξ ( u ) (cid:12)(cid:12)(cid:12) x =Ξ( u ) ,v = v,u ∈{ a,b } = 0 (3.6)The proof is completed in the appendix.The statement of Theorem 3.4 is not itself very powerful; we merely claim that o -operational layerscan at least map any one function to any other one function. However, the proof yields insightinto what the weight kernels of o -operational layers look like when the single condition ξ (cid:55)→ f isimposed. Therefrom, we conjecture but do not prove that a statstically optimal initialization fortraining o -operational layers is given by satifying (3.6) when ξ = n (cid:80) mn =1 ξ n , f = n (cid:80) mn =1 f n ,where the training set { ( ξ n , f n ) } are drawn i.i.d from some distribution D . Theorem 3.5 (Nonlinear Operator Approximation) . Suppose that E , E are bounded intervals in R .For all κ, λ , if K : Lip λ ( E ) → Lip κ ( E ) is a uniformly continuous, nonlinear operator, then forevery (cid:15) > there exists a deep function machine D : L ( E ) L ( E ) L ( E ) o o (3.7) such that (cid:13)(cid:13) D| Lip λ − K (cid:13)(cid:13) < (cid:15). With two layer operator networks universal, it remains to consider d -deconvolutional layers. Theorem 3.6 (Nonlinear Basis Approximation) . Suppose

As we have now shown, deep function machines can express arbitrarily powerful conﬁgurations of’perceptron layer’ mappings betwen different spaces. However, it is not yet theoretically clear howdifferent conﬁgurations of the computaional skeleton and the particular spaces X (cid:96) do or do not leadto a difference in expressiveness of DFMs. To answer questions of structure, we will return to themotivating example of high-resolution data, but now in the language of deep function machines.4.1 R ESOLUTION I NVARIANT N EURAL N ETWORKS

If an an input ( x j ) ∈ R N is sampled from an continuous function ξ ∈ C ( E (cid:96) ) , o -operational layersare a natural way of extending neural networks to deal directly with ξ. As before, it is useful to thinkof each o as a continuous relaxation of a class of n , and from this perspective we can gain insight intothe weight tensors of n -discrete layers as the resolution of x increases. Theorem 4.1 (Invariance) . If T (cid:96) is an o -operational layer with an integrable weight kernel ω ( u, v ) of O (1) parameters, then there is a unique fully connected n -discrete layer, N (cid:96) , with O ( N ) parametersso that T (cid:96) [ ξ ]( j ) = N (cid:96) ( x ) j for all ξ, x as above. Resolution Invariance Schema ω (cid:96) ( u, v ; w ) = (cid:80) nk =1 f ( u, v ; w k ) O : C ( E (cid:96) ) C ( E (cid:96) (cid:48) )[ O ] n : R N R M o parameterization n n -discrete inst. Figure 4: DFM construction of resolu-tion invariant n -discrete layer.Theorem 4.1 is a statement of variance in parameterization;when the input is a sample of a smooth signal, fully con-nected n -discrete layers are naively overparameterized.DFMs therefore yield a simple resolution invariance schemefor neural networks. Instead of placing arbitrary restrictionson W (cid:96) like convolution or assuming that the gradient descentwill implicitly ﬁnd a smooth weight matrix or ﬁlter W (cid:96) for n ,we take W (cid:96) to be the discretization of a smooth ω (cid:96) ( u, v ) . Animmediate advantage is that the weight surfaces, ω (cid:96) ( u, v ) ,of o -operational layers can be parameterized by dense fam-ilies f ( u, v ; w ) , whose parameters w do not depend on theresolution of the input but on the complexity of the modelbeing learnt.4.2 T OPOLOGICALLY I NSPIRED L AYER P ARAMETERIZATIONS

Furthermore, we can now explore new parameterizations by constructing weight tensors and therebyneural network topologies which approximate the action of the operator neural networks which mostexpressively ﬁt the topological properties of the data. Generally new restrictions on the weights ofdiscrete neural networks might be achieved as follows:1. Given that the input data x is sampled from some f ∈ F ⊂ { g : E → R } , ﬁnd a closedalgebra of weight kernels so that ω ∈ A is minimally parameterized and g ◦ o [ F ] is asufﬁciently "rich" class of functions.2. Repeat this process for each layer of a computational skeleton S and yield a DFM O .3. Instantiate a deep function machine [ O ] n called the n -discrete instatiation of O consistingof only n -discrete layers by discretizing each o -operational layer through the resolutioninvariance schema sample: W (cid:96) = (cid:20) w (cid:96) (cid:18) ie u , je u , · · · , ke v , te v , · · · (cid:19)(cid:21) ij...kt... (4.1)where e η denotes the cardinality of the sample along the η -axis of E (cid:96) = [0 , dim ( E (cid:96) ) . Thisprocess is depicted in Figure 4.This perspective yields interpretations of existing layer types and the creation of new ones. Forexample, convolutional n -discrete layers provably approximate o -operational layers with weightkernels that are solutions to the ultrahyperbolic partial differential equation. igure 5: The WaveLayer architecture of RippLeNet for MNIST. Bracketed numbers denote numberof wave coefﬁcients. The images following each WaveLayer are the example activations of neuronsafter training given by feeding (cid:48) (cid:48) into RippLeNet. Theorem 4.2 (Convolutional Neural Networks) . Let N (cid:96) be an n -discrete convolutional layer suchthat n ( x ) = h (cid:63) x where (cid:63) is the convolution operator and h is a ﬁlter tensor. Then there is a o -operational layer, O (cid:96) with ω (cid:96) ( u , . . . , u n , v , . . . , v n ) such that n (cid:88) k =1 ∂ ω∂ u k = c n (cid:88) k =1 ∂ ω∂ v k (4.2) and its n -discrete instatiation is [ O (cid:96) ] n = N (cid:96) . Using Theorem 4.2 we therefore propose the following generalization of convolutional layers withweight kernels satisfying (4.2) whose n -discrete instantiation is resolution invariant. Deﬁnition 4.3 (WaveLayers) . We say that T (cid:96) is a WaveLayer if it is the n -discrete instantiation (viathe resolution invariance schema) of an o -operational layer with weight kernel of the form ω (cid:96) ( u, v ) = s + b (cid:88) i =1 s i cos( w Ti ( u, v ) − p i ); s i , p i ∈ R , w i ∈ R dim ( E (cid:96) ) . (4.3)WaveLayers are named as such, because the kernels ω (cid:96) are super position standing waves movingin directions encoded by w i , offset in phase by p i . Additionally any n -discrete convolutional layercan be expressed by WaveLayers, setting the direction θ i of w i to θ i = π/ . In this case, instead oflearning the values h at each j , we learn s i , w i , p i . XPERIMENTS

With theoretical guarantees given for DFMs, we propose a series of experiments to test the learnability,expressivity, and resolution invariance of WaveLayers.

RippLeNet.

To ﬁnd a baseline model, we performed a grid search on MNIST over various DFMsand hyperparameters, arriving at RippLeNet, an architecture similar to the classic LeNet-5 of LeCunet al. (1998) depicted in Figure 5. The model is the n -discrete instatiation of the following DFM(5.1)and consiststs of 5 successive wave layers with tanh activations and no pooling. We found thatusing ReLu often resulted in a failure of learning to converge. Changes in the activation shapes(eg. 24x24x2 → E (cid:96) at eachnode of the DFM. The trainable parameters, in particular the magnitude of internal frequencies,were initialized at offsets proportional to the number of waves comprising each o -operationallayer. Likewise, orientation of each wave was initialized uniformly on the unit spherical shell. Note: WaveLayers are not not the same as Fourier networks or FC layers with cos activation functions.

Itsufﬁces to view wave layers as simply another way to reparameterize the weight matrix of n -discrete layers, andtherefore the VC dimension of WaveLayers is less than that of normal fully connected layers. See the appendix. igure 6: A plot of test error versus number oftrainable parameters for RippLeNet in comparisonwith early LeNet architectures. Model Expressive Parameter Reduction.

InTheorem 4.1, it was shown that DFMs in somesense parameterize the complexity of the modelbeing learned, without constraints imposed bythe form of data. RippLeNet beneﬁts in thatthe sheer number of parameters (and thereforethe variance) can be reduced until the model ex-presses the mapping to satisfactory error withoutconcern for resolution variants.We emprically verify this methodology by ﬁxingthe model architecture and increasing the num-ber of waves per layer uniformly. This results inan exponential marginal utility on lowest errorachieved after epochs of MNIST with respectto the number of parameters in the model, shownin Figure 6. The slight outperformance of earlyLeNet architectures, suggest that future workin optimizing WaveLayers might be fruitful inachieving state of the art parameter reduction.Figure 7: A plot of training time (normalized foreach layer type with respect to the training timefor × baseline) as the resolution of MNISTscales. Resolution Invariance.

True resolution invari-ance has the desirable property of consistency.Principly, consistency requires that regardless ofthe the resolution complexity of data, the train-ing time, paramerization, and testing accuracyof a model do not vary.We test consistency for RippLeNet by ﬁxing allaspects of initialization save for the input resolu-tion of images. For each training run, we rescaleMNIST using both bicubic and nearest neigh-bor scaling to square resolutions of sidelength R = { , . . . , , , , , . . . } . In conjuc-tion we compare the resolution consistency offully connected (FC) and convolutional archi-tectures. For FC models, the number of freeparameters on the ﬁrst layer is increased out ofnecessity. Likewise, the size of the input ﬁltersfor convolutional models is varied. As shown inFigure 7, WaveLayers remain invariant to resolu-tion changes in both multirun variance and nor-malized convergence iterations, whereas bothFC and convolutional layers exhibit an increasein both measurements with resolution. ONCLUSION

In this paper we proposed deep function machines, a novel framework for topologically inspiredlayer parameterization. We showed that given topological assumptions, DFMs provide theoreticaltools to yield provable properties in neural neural networks. We then used this framework to deriveWaveLayers, a new type of provably resolution invariant layer for processing data sampled fromcontinuous signals such as images and audio. The derivation of WaveLayers was additionallyaccompanied by the proposal of several layer operations for DFMs between inﬁnite and/or ﬁnitedimensional vector spaces. We for the ﬁrst time proved a theory of non-linear operator and functionalbasis approximation for neural networks of inﬁnite dimensions closing two long standing questionssince Stinchcombe (1999). We then utilized the expressive power of such DFMs to arrive at a novelarchitecture for resolution invariant image processing, RippLeNet. uture Work. Although we’ve layed the ground work for exploration into the theory of deepfunction machines, there are still many open questions both theoretically and empirically. Thedrastic outperformance in resolution variance of RippLeNet in comparision to traditional layer typessuggests that new layer types via DFMs with provable properties in mind should be further explored.Furthermore a deeper analysis of existing global network topologies using DFMs may be useful giventheir expressive power. R EFERENCES

Carl Burch. A survey of machine learning. A survey for the Pennsylvania Governor’s School for theSciences, 2001.Youngmin Cho and Lawrence K Saul. Analysis and extension of arc-cosine kernels for large marginclassiﬁcation. arXiv preprint arXiv:1112.3712 , 2011.G. Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics of Control,Signals, and Systems , 2:303–3314, 1989.Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: Thepower of initialization and a dual view on expressivity. arXiv preprint arXiv:1602.05897 , 2016.Amir Globerson and Roi Livni. Learning inﬁnite-layer networks: beyond the kernel trick. arXivpreprint arXiv:1606.05316 , 2016.Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from inﬁnite neural networks. arXiv preprint arXiv:1508.05133 , 2015.Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. 2007.Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied todocument recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.

The bulletin of mathematical biophysics , 5(4):115–133, 1943.Radford M Neal.

Bayesian learning for neural networks , volume 118. 1996.Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Ex-ponential expressivity in deep neural networks through transient chaos. In

Advances In NeuralInformation Processing Systems , pp. 3360–3368, 2016.Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On theexpressive power of deep neural networks. arXiv preprint arXiv:1606.05336 , 2016.Fabrice Rossi, Brieuc Conan-Guez, and François Fleuret. Theoretical properties of functional multilayer perceptrons. 2002.Jurgen Schmidhuber. Deep learning in neural networks: An overview.

Neural Networks , 61:85–117,2015.Matthias Seeger. Gaussian processes for machine learning.

International Journal of neural systems ,14(02):69–106, 2004.Shai Shalev-Shwartz, Ohad Shamir, and Karthik Sridharan. Learning kernel-based halfspaces withthe 0-1 loss.

SIAM Journal on Computing , 40(6):1623–1646, 2011.Maxwell B Stinchcombe. Neural network approximation of continuous functionals and continuousfunctions on compactiﬁcations.

Neural Networks , 12(3):467–477, 1999.Christopher KI Williams. Computation with inﬁnite neural networks.

Neural Computation , 10(5):1203–1216, 1998. A PPENDIX

A: A

DDITIONAL T OY D EMONSTRATION

To provide a direct comparison between convolutional, fully connected, and WaveLayer operationson data with semicontinuity assumptions, we conducted several toy demonstrations. We construct adataset of functions (similar to a dataset of audio waveforms) whose input pairs are Gaussian bumpfunctions, B µ ( u ) centered at different points in the interval µ ∈ [0 , . The cooresponding "labels"or target outputs are squres of gaussian bump functions plus a linear form whose slopes are givenby the position of the center, that is B µ ( u ) + µ ( u − µ ) . The desired map we wish to learn is then T : B µ ( u ) (cid:55)→ B µ ( u ) + µ · ( u − µ ) . To construct the actual dataset D , for a random sample ofcenters µ we sample the input output pairs over 100 evenly spaced sub-intervals. The resultant datasetis a list of N pairs of input/output vectors D = { ( x i , y i ) } Ni =1 with x, y ∈ R . We then train three different two layer DFMs with n -discrete convolutional, fully-connected, andWaveLayers respectively. The following three ﬁgures show the outputs of all three layer types astraining progresses. In the ﬁrst three quadrants, the output of each layer type on a particular exampledatapoint is shown along with that example’s input/target functions, ( x i , y i ) . The particular exampleshown is chosen randomly at the beginning of training. In the bottom right, the log training error overthe whole dataset of each layer type is shown. A tick is the number of batches seen by the algorithm.In initialization, the three layers exhibit predicted behavior despite artifacts towards the boundariesof the intervals. The convolutional layer, acts as a mullifer smoothing the input signal as its ownkernel is Gaussian. The fully connected layer generates as predicted a normally distributed set ofdifferent output activations and does not regard the spatial locality (and thereby continuity) of theinput. Finally the WaveLayer output exhibits behaviour predicted in Neal (1996), that is it limitstowards a smoothed random walk over the input signal.As training continues, both the convolutional and WaveLayer outputs preserve the continuity ofthe input signal, and approximate the smoothness of the output signal as induced by their ownrelation to ultrahyperbolic differential equations. Since the FC layer is not restricted to any giventopology, although it approximates the desired output signal closely in the L norm, it fails toachieve smoothness as this regularization is not explicity coded. It is important to note that in thisexample the WaveLayer output immediately surpasses the accuracy of the convolutional outputbecause the convolutional output only has bias units accross entire channels, whereas the bias unitsof WaveLayers are themselves functions (cid:80) nk =1 s k cos ( w k · v + p k ) + b k . Therefore WaveLayerscan impose hetrogenously accross their output signals, where as convolutional require much deeperarchitectures to artiﬁcally generate such biases. he results of this toy demonstration illustrate the intermediate ﬂexibility of WaveLayers betweenpurely fully connected and convolutional architectures. Satisfying the same differential equation(4.2), convolutional and WaveLayer architectures are regularized by spatial locality, but WaveLayerscan in fact go beyond convolution layers and employ translational hetrogeneity. Although the purposeof this work is not to demonstrate the superiority of either convoluitional or WaveLayer architectures,it does open a new avenue of exploration in neural architecture design using DFMs to design layertypes using topological constraints. A PPENDIX

B: T

HEOREMS AND P ROOFS

OINT A PPROXIMATION

Theorem 8.1.

Let [ a, b ] ⊂ R be a bounded interval and g : R → B ⊂ R be a continuous, bijectiveactivation function. Then if ξ : E (cid:96) → R and f : E (cid:48) (cid:96) → B are L ( µ ) integrable functions there existsa unique class of o -operational layers such that g ◦ o [ ξ ] = f. Proof.

We will give an exact formula for the weight function ω (cid:96) cooresponding to o so that theformula is true. Recall that y (cid:96) (cid:48) ( v ) = g (cid:18)(cid:90) E (cid:96) ξ ( u ) ω (cid:96) ( u, v ) dµ ( u ) (cid:19) . (8.1)Then let ω (cid:96) ( u, v ) = (cid:2) ( g − ) (cid:48) ◦ ( h (Ξ( u ) , v )) (cid:3) h (cid:48) (Ξ( u ) , v ) where Ξ( u ) is the indeﬁnite integral of ξ and h : R × E (cid:96) (cid:48) → R is some jointly and seperately integrable function. By the bijectivity of g onto its codomain, ω (cid:96) exists. Now further specify h so that, h (Ξ( u ) , v ) (cid:12)(cid:12)(cid:12) u ∈ E (cid:96) = f ( v ) . Then by thefundamental theorem of (Lebesgue) calculus and chain rule, g ( o [ ξ ]( v )) = g (cid:18)(cid:90) E (cid:96) (cid:2) ( g − ) (cid:48) ◦ ( h (Ξ( u ) , v )) (cid:3) h (cid:48) (Ξ( u ) , v ) ξ ( u ) dµ ( u ) (cid:19) = g (cid:0) g − ( h (Ξ( u ) , v )) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) u ∈ E (cid:96) = f ( v ) (8.2)A generalization of this theorem to E (cid:96) ⊂ R n is given by Stokes theorem.8.2 D ENSITY IN L INEAR O PERATORS

Theorem 8.2 (Approximation of Linear Operators) . Suppose E (cid:96) , E (cid:96) (cid:48) are σ -compact, locally compact,measurable, Hausdorff spaces. If K : C ( E (cid:96) ) → C ( E (cid:48) (cid:96) ) is a bounded linear operator then thereexists an o -operational layer such that for all y (cid:96) ∈ C ( E (cid:96) ) , o [ y (cid:96) ] = K [ y (cid:96) ] . Proof.

Let ζ t : C ( E (cid:96) (cid:48) ) → R be a linear form which evaluates its arguments at t ∈ E (cid:96) (cid:48) ; that is, ζ t ( f ) = f ( t ) . Then because ζ t is bounded on its domain, ζ t ◦ K = K (cid:63) ζ t : C ( E (cid:96) ) → R is a boundedlinear functional. Then from the Riesz Representation Theorem we have that there is a unique regularBorel measure µ t on E (cid:96) such that (cid:0) Ky (cid:96) (cid:1) ( t ) = K (cid:63) ζ t (cid:0) y (cid:96) (cid:1) = (cid:90) E (cid:96) y (cid:96) ( s ) dµ t ( s ) , (cid:107) µ t (cid:107) = (cid:107) K (cid:63) ζ t (cid:107) (8.3)We will show that κ : t (cid:55)→ K (cid:63) ζ t is continuous. Take an open neighborhood of K (cid:63) ζ t , say V ⊂ [ C ( E (cid:96) )] ∗ , in the weak* topology. Recall that the weak* topology endows [ C ( E (cid:96) )] ∗ with smallestcollection of open sets so that maps in i ( C ( E (cid:96) )) ⊂ [ C ( E (cid:96) )] ∗∗ are continuous where i : C ( E (cid:96) ) → [ C ( E (cid:96) )] ∗∗ so that i ( f ) = ˆ f = φ (cid:55)→ φ ( f ) , φ ∈ [ C ( E (cid:96) )] ∗ . Then without loss of generality V = m (cid:92) n =1 ˆ f − α n ( U α n ) where f α n ∈ C ( E (cid:96) ) and U α n are open in R . Now κ − ( V ) = W is such that if t ∈ W then K (cid:63) ζ t ∈ (cid:84) m ˆ f − α n ( U α n ) . Therefore for all f α n then K ∗ ζ t ( f α n ) = ζ t ( K [ f α n ]) = K [ f α n ]( t ) ∈ U α n . We would like to show that there is an open neighborhood of t , say D , so that D ⊂ W and κ ( Z ) ⊂ V .First since all the maps K [ f α n ] : E (cid:96) (cid:48) → R are continuous let D = (cid:84) m ( K [ f α n ]) − ( U α n ) ⊂ E (cid:96) (cid:48) .Then if r ∈ D , ˆ f α n [ K (cid:63) ζ r ] = K [ f α n ]( r ) ∈ U α n for all ≤ n ≤ m . Therefore κ ( r ) ∈ V and so κ ( D ) ⊂ V . s the norm (cid:107) · (cid:107) ∗ is continuous on [ C ( E (cid:96) )] ∗ , and κ is continuous on E (cid:96) (cid:48) , the map t (cid:55)→ (cid:107) κ ( t ) (cid:107) iscontinuous. In particular, for any compact subset of E (cid:96) (cid:48) , say F , there is an r ∈ F so that (cid:107) κ ( r ) (cid:107) ismaximal on F ; that is, for all t ∈ F , (cid:107) µ t (cid:107) ≤ (cid:107) µ r (cid:107) . Thus µ t (cid:28) µ r . Now we must construct a borel regular measure ν such that for all t ∈ E (cid:96) (cid:48) , µ t (cid:28) ν . To do so, wewill decompose E (cid:96) (cid:48) into a union of inﬁnitely many compacta on which there is a maximal measure.Since E (cid:96) (cid:48) is a σ -compact locally compact Hausdorff space we can form a union E (cid:96) (cid:48) = (cid:83) ∞ U n ofprecompacts U n with the property that U n ⊂ U n +1 . For each n deﬁne ν n so that χ U n \ U n − µ t ( n ) where µ t ( n ) is the maximal measure on each compact cl ( U n ) as described in the above paragraph.Finally let ν = (cid:80) ∞ n =1 ν n . Clearly ν is a measure since every ν n is mutually singular with ν m when n (cid:54) = m . Additionally for all t ∈ E (cid:96) (cid:48) , µ t (cid:28) ν .Next by the Lebesgue-Radon-Nikodym theorem, for every t there is an L ( ν ) function K t so that dµ t ( s ) = K t ( s ) dν ( s ) . Thus it follows that K (cid:2) y (cid:96) (cid:3) ( t ) = (cid:90) E (cid:96) y (cid:96) ( s ) K t ( s ) dν ( s )= (cid:90) E (cid:96) y (cid:96) ( s ) K ( t, s ) dν ( s ) = o [ y (cid:96) ]( t ) . (8.4)By letting ω (cid:96) = K we then have K = o up to a ν -null set and this completes the proof.8.3 D ENSITY IN N ON -L INEAR OPERATORS

Theorem 8.3.

Suppose that E , E are bounded intervals in R . If K : Lip λ ( E ) → Lip κ ( E ) is auniformly continuous, nonlinear operator. Then for every (cid:15) > there exists a deep function machine D : L ( E ) L ( E ) L ( E ) o o (8.5) such that (cid:13)(cid:13) D| Lip λ − K (cid:13)(cid:13) < (cid:15). We will ﬁrst introduce some deﬁntions which quantize uniformly continuous operators on functionspace.

Deﬁnition 8.4.

Let P = p < · · · < p N be some partition of a compact interval E with N components. We call ρ P : Lip ∗ ( E ) → R N and ρ ∗ P : R M → Lip ∗ ( E ) afﬁne projection maps if ρ P ( f ) = ( f ( p i )) Ni =1 ρ ∗ P ( v ) = v (cid:55)→ N − (cid:88) i =0 χ P i ( x ) (cid:20) ( v i +1 − v i ) µ ( P i ) ( t − p i ) + v i (cid:21) (8.6) where χ P i is the indicator function on P i = [ p i , p i +1 ) when i < N and P N − = [ p N − , p N ] . Deﬁnition 8.5.

Let

P, Q be partitions of E , E of N, M components respectively. If K : Lip λ ( E ) → Lip κ ( E ) , its afﬁne projection, | K | , and its lattice map, ˜ K , are deﬁned so that thefollowing diagram commutes,Lip λ ( E ) R N Lip λ ( E ) Lip κ ( E ) R M Lip κ ( E ) | K | ρ P ˜ K ρ ∗ P Kρ ∗ Q ρ Q Lemma 8.6 (Strong Linear Approximation) . If K : Lip λ ( E ) → Lip κ ( E ) is a uniformly continuous,nonlinear operator, then for every (cid:15) > there exist partitions P, Q of E , E so that (cid:107) K − | K |(cid:107) < (cid:15). Proof.

To show the lemma, we will chase the commutative diagram above by approximation.For any δ > , we claim that there exists a P such that for any f ∈ Lip λ ( E ) , the afﬁne projectionapproximates f ; that is, (cid:107) f − ρ ∗ P ◦ ρ P ◦ f (cid:107) L ( µ ) < δ. To see this, take P to be a uniform partition of with ∆ p := µ ( P i ) < δµ ( E ) λ . Then (cid:90) | f − ρ ∗ P ◦ ρ P ◦ f | dµ ≤ N − (cid:88) i =1 (cid:90) P i (cid:12)(cid:12)(cid:12)(cid:12) f ( t ) − (cid:20) ( f ( p i ) − f ( p i )) µ ( P i ) ( t − p i ) + f ( p i ) (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) dµ ( t ) ≤ N − (cid:88) i =1 (cid:90) P i | f ( t ) − f ( p i ) − λ ( t − p i ) | dµ ( t ) ≤ N − (cid:88) i =1 (cid:90) P i λ | t − p i | dµ ( t ) ≤ λ ∆ p N < δ.

Now by the absolute continuity of K , for every (cid:15) > there is a δ and therefore a partition P of E sothat if (cid:107) f − ρ ∗ P ◦ ρ P ◦ f (cid:107) L ( µ ) < δ then (cid:107) K [ f ] − K [ ρ ∗ P ◦ ρ P ◦ f ] (cid:107) L ( µ ) < (cid:15)/ . Finally let Q be auniform partition of E so that for every φ ∈ Lip κ ( E ) , (cid:107) φ − ρ ∗ Q ◦ ρ Q ◦ φ (cid:107) L ( µ ) < (cid:15)/ . It followsthat for every f ∈ Lip λ ( E ) (cid:107) K [ f ] − | K | [ f ] (cid:107) L ( µ ) ≤ (cid:107) K [ f ] − K ◦ ρ ∗ P ◦ ρ P [ f ] (cid:107) + (cid:107) K [ f ] − ρ ∗ Q ◦ ρ Q ◦ K [ f ] (cid:107) < (cid:15) (cid:15) (cid:15). Therefore the afﬁne projection of K approximates K . This completes the proof.With the lemma given we will approximate nonlinear operators through an approximation of theafﬁne approximation using n -discrete DFMs. Proof of Theorem 3.5.

Let (cid:15) > be given. By Lemma 8.6 there exist partitions, P, Q , so that (cid:107) K − | K |(cid:107) < (cid:15)/ . The cooresponding lattice map ˜ K : R N → R M is therefore continuous. Since E is a compact interval, the image ρ P [ Lip λ ( E )] is compact and homeomorphic to the unit hypercube [0 , N . By the universal approximation theorem of Cybenko (1989), for every δ , there exists a deepfunction machine N : R N R J R M , n n so that (cid:107) ˜ K − N (cid:107) ∞ < δ. Then, the continuity of the afﬁne projection maps implies that there exist δ such that (cid:107) ρ ∗ Q ◦ N ◦ ρ P − | K |(cid:107) < (cid:15)/ . Therefore the induced operator on N represents K ; that is, (cid:107) ρ ∗ Q ◦ N ◦ ρ P − K (cid:107) < (cid:15) .Let N be parameterized by W ∈ R N × J and W ∈ R J × M . Let S be any uniform partition of an I = [0 , with J components. Then parameterize a deep function machine D with weight kernels ω ( u, v ) = N (cid:88) i =1 J (cid:88) j =1 χ S j × P i ( u, v ) W i,j δ ( u − p i ) ,ω ( v, x ) = M − (cid:88) k =1 χ Q k ( x ) J (cid:88) j =1 (cid:34) W j,k +1 − W j,k µ ( Q k ) ( x − q k ) + W j,k (cid:35) δ ( v − s j ) , where δ is the dirac delta function. We claim that D = ρ ∗ Q ◦N ◦ ρ P . Performing routine computations,for any f ∈ Lip λ ( E ) , D [ f ] = T ◦ g ◦ (cid:18)(cid:90) E f ( u ) ω ( u, v ) dµ ( u ) (cid:19) = T ◦ g ◦ (cid:90) E N (cid:88) i =1 J (cid:88) j =1 χ S j × P i ( u, v ) W i,j f ( u ) δ ( u − p i ) dµ ( u )  = T ◦ g ◦  J (cid:88) j =1 ρ P ( f ) T W j χ S j ( v )  := T ◦ g ◦ h ( v ) hus, h is identical to the j th neuron of the ﬁrst n -discrete layer in N ◦ ρ P when v = s j . Turning to T in D , we get that D [ f ] = (cid:90) I ω ( v, x ) g ( h ( v )) dµ ( v )= M − (cid:88) k =1 χ Q k ( x ) J (cid:88) j =1 (cid:34) W j,k +1 − W j,k µ ( Q k ) ( x − q k ) + W j,k (cid:35) g ( h ( s j ))= M − (cid:88) k =1 χ Q k ( x ) · g ( ρ P ( f ) T W ) T (cid:20) W k +1 − W k µ ( Q k ) ( x − q k ) + W k (cid:21) = ρ ∗ Q ( g ( ρ P ( f ) T W ) T W ) = ρ ∗ Q ◦ N ◦ ρ P [ f ] . Therefore (cid:107)D|

Lip λ − K (cid:107) < (cid:15) and this completes the proof.We will now prove a similar theorem for d -defunctional layers. Theorem 8.7 (Nonlinear Basis Approximation) . Suppose

I, E , E are compact intervals, and let C ω ( X ) denote the set of analytic functions on X . If B : I n → C ω ( E ) is a continuous basis map toanalytic functions then for every (cid:15) > there exists a deep function machine D : R n L ( E ) L ( E ) d o (8.7) such that (cid:107) D | I n − B (cid:107) < (cid:15) in the topolopgy of uniform convergence.Proof. Recall that the set of polynomials on E , P , are a basis for the vector space C ω ( E ) . Thereforethe map B has a decomposition through nmaots κ, ∆ so that the following diagram commutes (cid:96) ( R ) R n C ω ( E ) κ ∆ B and κ : ( a i ) ∞ i =1 (cid:55)→ (cid:80) a n g n where g n is the mononomial of degree n . The existence of ∆ can beveriﬁed through a composition of the direct product of basis projections in C ω ( E ) and B .For each m ∈ N the projection image in the m th coordinate, π m [∆[ R n ]] = R and so again themaos factor into a countable collection of maps (∆ i : R n → R ) ∞ i =1 so that (cid:81) ∞ i =1 ∆ i = ∆ . We willapproximate B by approximations of κ ◦ ∆ via increasing products of ∆ i .Deﬁne the aforementioned increasing product map ∆ ( N ) as ∆ ( N ) = N (cid:89) i =1 ∆ i × ∞ (cid:89) N +1 c where c is the constant map. Now with (cid:15) > given, we wish to show that there exists an N so that (cid:107) κ ◦ ∆ ( N ) − B (cid:107) < (cid:15) in the topology of uniform convergence.To see this let P N ⊂ P denote the set of polynomials of degree at most n. Next we deﬁne a open’mulliﬁcation’ of P N . In particular let O P N ( (cid:15) ) = { f ∈ C ω ( (cid:15) ) | | f − g | < (cid:15), g ∈ P N } . It is clear that O P N ( (cid:15) ) ⊂ O P N ( (cid:15) ) when N ≤ N and furthermore by the density of P in C ω ( E ) we have that { O P i ( (cid:15) ) } ∞ i =1 is an open cover of B [ I n ] ⊂ C ω ( E ) . Since I n is compact B [ I n ] is a compact subset of C ω ( E ) and thus there is a ﬁnite index set I (cid:48) = { N , . . . N k } so that (cid:83) t ∈ I (cid:48) O P t ⊃ B [ I n ] . If N = max I (cid:48) then O P N ( (cid:15) ) ⊃ B [ I n ] . Therefore for every x ∈ I n we havethat (cid:107) κ ◦ ∆ ( N ) − B (cid:107) < (cid:15) since κ ◦ ∆ ( N ) is a polynomial of degree at most N. ow we will ﬁlter the maps ψ N := π ...N ◦ ∆ ( N ) where ψ N : R n → R N through the universalapproximation theory of standard discrete neural networks. Let N be a two n -discrete layer DFM sothat (cid:107)N − ψ N (cid:107) < (cid:15) . For convienience let N := n ◦ g n Then we can instantiate N as the DFM in(8.7) using the same method as in the proof of the nonlinear operator approximation theory above.For the d -defunctional layer let W jk be the weight tensor of n in N so that n ( x ) j = (cid:80) k W jk x k .Then let the weight kernel for d be ω k ( v ) = ρ ∗ ( W · k ) . and then d [ x ] | v = j = n ( x ) j . We will ommitthe design of weight kernels for the o -operational layer, but this is not difﬁcult to establish. Alltogether we now have that via the approximation of N and equivalence of N and its instantiation inthe statement of the theorem, (cid:107) κ ◦ ρ ◦ o ◦ g ◦ d − κ ◦ ι N ◦ ψ N (cid:107) < (cid:15). Finally we need deal with the basis map κ . On the compact set ∆ ( N ) [ I n ] = ι N ◦ ψ N [ I n ] , κ is abounded linear operator and its composition, κ ◦ ρ ◦ o is also a bounded linear operator. Therefore bythe bounded linear approximation theorem of o -operational layers, there is a (cid:107) o (cid:48) − κ ◦ ρ ◦ o (cid:107) < (cid:15) .Appending such o (cid:48) to d as above we achieve the approximation bound of the theorem. This completesthe proof.8.4 R ESOLUTION I NVAIRANCE

Theorem 8.8. If T (cid:96) is an o -operational layer with an integrable weight kernel ω ( u, v ) of O (1) parameters, then there is a unique n -discrete layer with with O ( N ) parameters so that o [ ξ ]( j ) = n [ x ] j for all indices j and for all ξ, x as above.Proof. Given some o , we will give a direct computation of the corresponding weight matrix of n . Itfollows that o [ ξ ]( v ) = (cid:90) E (cid:96) ξ ( u ) ω (cid:96) ( u, v ) dµ ( u )= N − (cid:88) n (cid:90) n +1 n (( x n +1 − x n )( u − n ) + x n ) ω (cid:96) ( u, v ) dµ ( u )= N − (cid:88) n ( x n +1 − x n ) (cid:90) n +1 n ( u − n ) ω (cid:96) ( u, v ) dµ ( u ) + x n (cid:90) n +1 n ω (cid:96) ( u, v ) dµ ( u ) (8.8)Now, let V n ( v ) = (cid:82) n +1 n ( u − n ) ω (cid:96) ( u, v ) dµ ( u ) and Q n ( v ) = (cid:82) n +1 n ω (cid:96) ( u, v ) dµ ( u ); We can noweasily simplify (8.8) using the telescoping trick of summation. o ( ξ )[ v ] = x N V N − ( v ) + N − (cid:88) n =2 x n ( Q n ( v ) − V n ( v ) + V n − ( v )) + x ( Q ( v ) − V ( v )) (8.9)Given indices in j ∈ { , · · · , M } , let W ∈ R N × M so that W n,j = ( Q n ( j ) − V n ( j ) + V n − ( j ) , W N,j = V N − ( j ) , and W ,j = Q ( j ) − V ( j ) . It follows that if W parameterizes some n , then n [ x ] j = o [ ξ ]( j ) for every f sampled/approximated by x and ξ . Furthermore, dim ( W ) ∈ O ( N ) , and n is unique up to L ( µ ) equivalence.8.5 C ONVOLUTIONAL N EURAL N ETWORKS AND THE U LTRAHYPERBOLIC D IFFERENTIAL E QUATION

Proof.

A general solution to (4.3) is of the form ω ( u, v ) = F ( u − cv ) + G ( u + cv ) where F, G are second-differentiable. Essentially the shape of ω stays constant in u , but the position of ω varies in v . For every h there exists a continuous F so that F ( j ) = h j , G = 0 . Let ω ( u, v ) = F ( u − cv ) + G ( u + cv ) . Therefore applying Theorem 4.1, to o parameterized by ω , we yield a weightmatrix W so that [ o [ ξ ]( j ) = (cid:90) E ξ ( u ) ( F ( u − cj ) + 0) dµ ( u ) = ( W x ) j = ( h (cid:63) x ) j = n [ x ] j . (8.10) his completes the proof. Lip λ · · · Lip λ Lip λ R Lip λ · · · Lip λ oo o o foo oo o o f PPENDIX

C: VC D

IMENSION OF D ISCRETIZED O PERATOR N EURAL N ETWORKS . In order to calculate the VC dimension of DFMs contianing only discretized o -operational layers,denoted D , we have D ⊂ N , where N is the family of all DFMs with n -discrete skeletons whoseper-node dimensionality is exactly that of the discretization D . Thus the VC dimension of F can bebounded by that of N , however more ﬁne tuned estimate is both possible and essential.Suppose that in designing some deep architecture, one wishes to keep VC dimension low, whilstincreasing per-node activation dimensionality. In practice optimization in higher dimensions is easierwhen a low dimensional parameterization is embedded therein. For example, hyperdimensionalcomputing, sparse coding, and convolutional neural networks naturally neccessitate high dimensionalhidden spaces but beneﬁt from regularized capacity. Since the dimensionality of the discretization O does not depend on the original dimensionality of the space, then the capacity of O depends directlyon the "complexity" of the family of weight surfaces there endowed. It would therefore be convenientto answer the following question formally. The VC Problem . Let W (cid:96) ⊂ L ( R , µ ) be some family of weight surfaces. Then induce O W ,a family of discretized o -operational layers with O W := { [ o W ] n } W ∈W where [ · ] n denotes thediscretization. What is

V CDim ( O W ) ? Although in this work we do not directly attack this problem, a solution leads to another dimensionof layer and architecture design beyond topological constraints. In practice, one would be able tochoose which set of W (cid:96) to give a satisfactory generalizability condition on their learning problem.

10 A

PPENDIX

D: A

NALYTICAL D ERIVATION OF C ONTINUOUS E RROR B ACKPROPAGATION FOR S EPERABLE W EIGHT K ERNELS

With these theoretical guarantees given for DFMs, the implementation of the feedforward and errorbackpropagation algorithms in this context is an essential next step. We will consider operator neuralnetworks with polynomial kernels. As aforementioned, in the case where a DFM has nodes withnon-seperable kernels, we cannot give the guarntees we do in the following section. Therefore, astandard auto-differentiation set-up will sufﬁce for DFMs with for example wave layers.Feedforward propagation is straight forward, and relies on memoizing operators by using the separa-bility of weight polynomials. Essentially, integration need only occur once to yield coefﬁcients onpower functions. See Algorithm 1.10.0.1 F

EED -F ORWARD P ROPAGATION

We will say that a function f : R → R is numerically integrable if it can be seperated into f ( x, y ) = g ( x ) h ( y ) . Theorem 10.1. If O is a operator neural network with L consecutive layers, then given any (cid:96) suchthat ≤ (cid:96) < L , y (cid:96) is numerically integrable, and if ξ is any continuous and Riemann integrableinput function, then O [ ξ ] is numerically integrable. lgorithm 1 Feedforward Propagation on F Input: input function ξ for l ∈ { , . . . , L − } dofor t ∈ Z (cid:96)X do Calculate I (cid:96)t = (cid:82) E (cid:96) y (cid:96) ( j (cid:96) ) j t(cid:96) dj (cid:96) . end forfor s ∈ Z (cid:96)Y do Calculate C (cid:96)s = (cid:80) Z (cid:96)X a k a,s I (cid:96)a . end for Memoize y (cid:96) ( j ) = g (cid:16)(cid:80) Z (cid:96)Y b j b C (cid:96)b (cid:17) . end for The output is given by O [ ξ ] = y L . Proof.

Consider the ﬁrst layer. We can write the sigmoidal output of the ( (cid:96) ) th layer as a function ofthe previous layer; that is, y (cid:96) = g (cid:18)(cid:90) E (cid:96) w (cid:96) ( j (cid:96) , j (cid:96) ) y (cid:96) ( j l ) dj l (cid:19) . (10.1)Clearly this composition can be expanded using the polynomial deﬁnition of the weight surface.Hence y (cid:96) = g (cid:90) E (cid:96) y (cid:96) ( j (cid:96) ) Z (cid:96)Y (cid:88) x (cid:96) Z (cid:96)X (cid:88) x l k x l ,x (cid:96) j x l (cid:96) j x (cid:96) (cid:96) dj (cid:96)  = g  Z (cid:96)Y (cid:88) x (cid:96) j x (cid:96) Z (cid:96)X (cid:88) x l k x l ,x (cid:96) (cid:90) E (cid:96) y (cid:96) ( j (cid:96) ) j x l (cid:96) dj (cid:96)  , (10.2)and therefore y (cid:96) is numerically integrable. For the purpose of constructing an algorithm, let I (cid:96)x (cid:96) bethe evaluation of the integral in the above deﬁnition for any given x (cid:96) It is important to note that the previous proof requires that y (cid:96) be Riemann integrable. Hence, with ξ satisfying those conditions it follows that every y (cid:96) is integrable inductively. That is, because y isintegrable it follows that by the numerical integrability of all l , O [ ξ ] = y L is numerically integrable.This completes the proof.Using the logic of the previous proof, it follows that the development of some inductive algorithm ispossible.10.0.2 C ONTINUOUS E RROR B ACKPROPAGATION

As is common with many non-convex problems with discretized neural networks, a stochastic gradientdescent method will be developed using a continuous analogue to error backpropagation. We deﬁnethe loss function as follows.

Deﬁnition 10.2.

For a operator neural network O and a dataset { ( γ n ( j ) , δ n ( j )) } we say that theerror for a given n is deﬁned by E = 12 (cid:90) E L ( O ( γ n ) − δ n ) dj L (10.3)This error deﬁnition follows from N as the typical error function for N is just the square norm of thedifference of the desired and predicted output vectors. In this case we use the L norm on C ( E L ) inthe same fashion.We ﬁrst propose the following lemma as to aid in our derivation of a computationally suitable errorbackpropagation algorithm. Lemma 10.3.

Given some layer, l > , in O , functions of the form Ψ (cid:96) = g (cid:48) (cid:0) Σ l y (cid:96) (cid:1) are numericallyintegrable. roof. If Ψ (cid:96) = g (cid:48) (cid:32)(cid:90) E ( (cid:96) − y ( (cid:96) − w ( (cid:96) − dj (cid:96) (cid:33) (10.4)then Ψ (cid:96) = g (cid:48)  Z ( (cid:96) − Y (cid:88) b j bl Z ( (cid:96) − X (cid:88) a k ( (cid:96) − a,b (cid:90) E ( (cid:96) − y ( (cid:96) − j a(cid:96) dj l −  (10.5)hence Ψ can be numerically integrated and thereby evaluated.The ability to simplify the derivative of the output of each layer greatly reduces the computationaltime of the error backpropagation. It becomes a function deﬁned on the interval of integration of thenext iterated integral. Theorem 10.4.

The gradient, ∇ E ( γ, δ ) , for the error function (10.3) on some O can be evaluatednumerically.Proof. Recall that E over O is composed of k (cid:96)x,y for x ∈ Z (cid:96)X , y ∈ Z (cid:96)Y , and ≤ l ≤ L . If weshow that ∂E∂k (cid:96)x,y can be numerically evaluated for arbitrary, l, x, y , then every component of ∇ E isnumerically evaluable and hence ∇ E can be numerically evaluated. Given some arbitrary l in O , let n = (cid:96) . We will examine the particular partial derivative for the case that n = 1 , and then for arbitrary n , induct over each iterated integral.Consider the following expansion for n = 1 , ∂E∂k L − nx,y = ∂∂k L − x,y (cid:90) E (cid:96) [ O ( γ ) − δ ] dj L = (cid:90) E (cid:96) [ O ( γ ) − δ ] Ψ L (cid:90) E ( (cid:96) − j xL − j yL y L − dj L − dj L = (cid:90) E (cid:96) [ O ( γ ) − δ ] Ψ L j yL (cid:90) E ( (cid:96) − j xL − y L − dj L − dj L (10.6)Since the second integral in (10.6) is exactly I L − x from ( ?? ), it follows that ∂E∂k ( n ) x,y = I L − x (cid:90) E (cid:96) [ O ( γ ) − δ ] Ψ L j yL dj L (10.7)and clearly for the case of n = 1 , the theorem holds.Now we will show that this is all the case for larger n . It will become clear why we have chosen toinclude n = 1 in the proof upon expansion of the pratial derivative in these higher order cases.Let us expand the gradient for n ∈ { , . . . , L } . ∂E∂k L − nx,y = (cid:90) E L [ O ( γ ) − δ ]Ψ L (cid:90) E L − w L − Ψ L − (cid:90) · · · (cid:90) E L − n +1) w L − n +1) Ψ L − n +1) (cid:124) (cid:123)(cid:122) (cid:125) n − iterated integrals (cid:90) E L − n y L − n j aL − n j bL − n +1 dj L − n . . . dj L (10.8)As aforementioned, proving the n = 1 case is required because for n = 1 , (10.8) has a section of n − iterated integrals which cannot be possible for the proceeding logic.We now use the order invariance properly of iterated integrals (that is, (cid:82) A (cid:82) B f ( x, y ) dxdy = (cid:82) B (cid:82) A f ( x, y ) dydx ) and reverse the order of integration of (10.8).In order to reverse the order of integration we must ensure each iterated integral has an integrandwhich contains variables which are guaranteed integration over some region. To examine this, wepropose the following recurrence relation for the gradient. et { B s } be deﬁned along L − n ≤ s ≤ L , as follows B L = (cid:90) E L [ O ( γ ) − δ ] Ψ L B L − dj L ,B s = (cid:90) E (cid:96) Ψ (cid:96) Z (cid:96)X (cid:88) a Z (cid:96)Y (cid:88) b j a(cid:96) j b(cid:96) B (cid:96) dj (cid:96) ,B L − n = (cid:90) E L − n j xL − n j yL − n +1 dj L − n (10.9)such that ∂E∂k (cid:96)x,y = B L . If we wish to reverse the order of integration, we must ﬁnd a reoccurrencerelation on a sequence, { B s } such that ∂E∂k L − nx,y = B L − n = B L . Consider the gradual reversal of(10.8).Just as important as Clearly, ∂E∂k (cid:96)x,y = (cid:90) E L − n y L − n j xL − n (cid:90) E L [ O ( γ ) − δ ]Ψ L (cid:90) E L − w L − Ψ L − (cid:90) · · · (cid:90) E L − n +1) j yL − n +1 w L − n +1) Ψ L − n +1) dj L − n +1 . . . dj L dj L − n (10.10)is the ﬁrst order reversal of (10.8). We now show the second order case with ﬁrst weight functionexpanded. ∂E∂k (cid:96)x,y = (cid:90) E L − n y L − n j xL − n (cid:90) E L − n +1) Z Y (cid:88) b Z X (cid:88) a k a,b j a + yL − n +1 Ψ L − n +1) (cid:90) E L [ O ( γ ) − δ ]Ψ L (cid:90) · · · (cid:90) E L − n +1) j bL − n +2 w ( L − n +2) Ψ ( L − n +2) dj L − n +1 . . . dj L dj L − n . (10.11)Repeated iteration of the method seen in (10.10) and (10.11), where the inner most integral is movedto the outside of the ( L − s ) th iterated integral, with s is the iteration, yields the following full reversalof (10.8). For notational simplicity recall that l = L − n , then ∂E∂k (cid:96)x,y = (cid:90) E (cid:96) y (cid:96) j xl (cid:90) E (cid:96) Z (cid:96)X (cid:88) a j a + y(cid:96) Ψ (cid:96) (cid:90) E (cid:96) +2 Z (cid:96)Y (cid:88) b Z (cid:96) +2 X (cid:88) c k (cid:96)a,b j b + cl +2 Ψ (cid:96) +2 (cid:90) E (cid:96) +3 Z (cid:96) +2 Y (cid:88) d Z (cid:96) +3 X (cid:88) e k (cid:96) +2 c,d j d + el +3 Ψ (cid:96) +3 (cid:90) · · · (cid:90) E L Z L − Y (cid:88) q k L − p,q j qL [ O ( γ ) − δ ]Ψ L dj L . . . dj L − n . (10.12)Observing the reversal in (10.12), we yield the following recurrence relation for { B s } . Bare in mind, l = L − n , x and y still correspond with ∂E∂k (cid:96)x,y , and the following relation uses its deﬁnition on s forcases not otherwise deﬁned.B L,t = (cid:90) E L Z L − Y (cid:88) b k L − t,b j bL [ O ( γ ) − δ ] Ψ L dj L . B s,t = (cid:90) E ( s ) Z ( s − Y (cid:88) b Z ( s ) X (cid:88) a k ( s − t,b j a + bs Ψ ( s ) B s +1 ,a dj s . B (cid:96) = (cid:90) E (cid:96) Z (cid:96)X (cid:88) a j a + y(cid:96) Ψ (cid:96) B l +2 ,a dj (cid:96) .∂E∂k (cid:96)x,y = B l = (cid:90) E (cid:96) j xl y (cid:96) B (cid:96) dj (cid:96) . (10.13) lgorithm 2 Error Backpropagation

Input: input γ , desired δ, learning rate α, time t. for (cid:96) ∈ { , . . . , L } do Calculate Ψ (cid:96) = g (cid:48) (cid:16)(cid:82) E ( (cid:96) − y ( (cid:96) − w ( (cid:96) − dj (cid:96) (cid:17) end for For every t , compute B L,t from from (10.13).Update the output coefﬁcient matrix k L − x,y − I L − x (cid:82) E L [ F ( γ ) − δ ] Ψ L j yL dj L → k L − x,y . for l = L − to do If it is null, compute and memoize B l +2 ,t from (10.13).Compute but do not store B (cid:96) ∈ R . Compute ∂E∂k (cid:96)x,y = B l from from (10.13).Update the weights on layer l : k (cid:96)x,y ( t ) → k (cid:96)x,y end for Note that B L − n = B L by this logic.With (10.13), we need only show that B L − n is integrable. Hence we induct on L − n ≤ s ≤ L over { B s } under the proposition that B s is not only numerically integrable but also constant.Consider the base case s = L . For every t , because every function in the integrand of B L in (10.13)is composed of j L , functions of the form B L must be numerically integrable and clearly, B L ∈ R .Now suppose that B s +1 ,t is numerically integrable and constant. Then, trivially, B s,u is also numeri-cally integrable by the contents of the integrand in (10.13) and B s,u ∈ R . Hence, the proposition that s + 1 implies s holds for (cid:96) < s < L .Lastly we must show that both B (cid:96) and B l are numerically integrable. By induction B l +2 must benumerically integrable. Hence by the contents of its integrand B (cid:96) must also be numerically integrableand real. As a result, B l = ∂E∂k (cid:96)x,y is real and numerically integrable.Since we have shown that ∂E∂k (cid:96)x,y is numerically integrable, ∇ E must therefore be numericallyevaluable as aforementioned. This completes the proof.must therefore be numericallyevaluable as aforementioned. This completes the proof.