Higher Order Automatic Differentiation of Higher Order Functions
aa r X i v : . [ c s . P L ] J a n HIGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHERORDER FUNCTIONS
MATHIEU HUOT, SAM STATON, AND MATTHIJS V ´AK ´ARUniversity of OxfordUniversity of OxfordUtrecht University
Abstract.
We present semantic correctness proofs of automatic differentiation (AD). Weconsider a forward-mode AD method on a higher order language with algebraic data types,and we characterise it as the unique structure preserving macro given a choice of derivativesfor basic operations. We describe a rich semantics for differentiable programming, basedon diffeological spaces. We show that it interprets our language, and we phrase what itmeans for the AD method to be correct with respect to this semantics. We show that ourcharacterisation of AD gives rise to an elegant semantic proof of its correctness based ona gluing construction on diffeological spaces. We explain how this is, in essence, a logicalrelations argument. Throughout, we show how the analysis extends to AD methods forcomputing higher order derivatives using a Taylor approximation. Introduction
Automatic differentiation (AD), loosely speaking, is the process of taking a program de-scribing a function, and constructing the derivative of that function by applying the chainrule across the program code. As gradients play a central role in many aspects of machinelearning, so too do automatic differentiation systems such as TensorFlow [1], PyTorch [53]or Stan [12]. Programs denotationalsemantics (cid:15) (cid:15) automaticdifferentiation / / Programs denotationalsemantics (cid:15) (cid:15)
Differentialgeometry mathdifferentiation / / DifferentialgeometryFigure 1: Overview of semantics/correctness of AD.Differentiation has a well de-veloped mathematical theory interms of differential geometry.The aim of this paper is to formal-ize this connection between differ-ential geometry and the syntacticoperations of AD, particularly forAD methods that calculate higherorder derivatives. In this way we
Key words and phrases: automatic differentiation, software correctness, denotational semantics. ∗ This paper provides an extended version of [28], extending it with proofs and a novel discussion of thesemantics and correctness of automatic differentiation methods for computing higher order derivatives.
Preprint submitted toLogical Methods in Computer Science © M. Huot, S. Staton, and M. Vákár CC (cid:13) Creative Commons
M. HUOT, S. STATON, AND M. V ´AK ´AR achieve two things: (1) a compositional, denotational understanding of differentiable pro-gramming and AD; (2) an explanation of the correctness of AD.This intuitive correspondence (summarized in Fig. 1) is in fact rather complicated. Inthis paper, we focus on resolving the following problem: higher order functions play a keyrole in programming, and yet they have no counterpart in traditional differential geometry.Moreover, we resolve this problem while retaining the compositionality of denotationalsemantics.1.0.1.
Higher order functions and differentiation.
A major application of higher order func-tions is to support disciplined code reuse. Code reuse is particularly acute in machine learn-ing. For example, a multi-layer neural network might be built of millions of near-identicalneurons, as follows.neuron n : (real n ∗ (real n ∗ real)) → real neuron n def = λ h x, h w, b ii . ς ( w · x + b )layer n : ( ( τ ∗ P ) → τ ) → ( τ ∗ P n ) → τ n layer n def = λf. λ h x, h p , . . . , p n ii . h f h x, p i , . . . , f h x, p n ii comp : ( ( ( τ ∗ P ) → τ ) ∗ ( ( τ ∗ Q ) → τ ) ) → ( τ ∗ ( P ∗ Q )) → τ comp def = λ h f, g i . λ h x, ( p, q ) i . g h f h x, p i , q i − . x ς ( x ) (Here ς ( x ) def = e − x is the sigmoid function, as illustrated.) We can use these functions tobuild a network as follows (see also Fig. 2):comp h layer m (neuron k ) , comp h layer n (neuron m ) , neuron n ii : (real k ∗ P ) → real (1.1) · · ·· · ·· · · k m n Figure 2: The network in (1.1)with k inputs and twohidden layers.Here P ∼ = real p with p = ( m ( k +1)+ n ( m +1)+ n +1).This program (1.1) describes a smooth (infinitely differ-entiable) function. The goal of automatic differentiationis to find its derivative.If we β -reduce all the λ ’s, we end up with a very longfunction expression just built from the sigmoid functionand linear algebra. We can then find a program for calcu-lating its derivative by applying the chain rule. However,automatic differentiation can also be expressed withoutfirst β -reducing, in a compositional way, by explaininghow higher order functions like (layer) and (comp) propa-gate derivatives. This paper is a semantic analysis of thiscompositional approach.The general idea of denotational semantics is to interpret types as spaces and programsas functions between the spaces. In this paper, we propose to use diffeological spaces andsmooth functions [63, 30] to this end. These satisfy the following three desiderata: • R is a space, and the smooth functions R → R are exactly the functions that are infinitelydifferentiable; • The set of smooth functions X → Y between spaces again forms a space, so we caninterpret function types. • The disjoint union of a sequence of spaces again forms a space, so we can interpret varianttypes and inductive types.
IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 3
We emphasise that the most standard formulation of differential geometry, using manifolds,does not support spaces of functions. Diffeological spaces seem to us the simplest notion ofspace that satisfies these conditions, but there are other candidates [4, 64]. A diffeologicalspace is in particular a set X equipped with a chosen set of curves C X ⊆ X R and a smoothmap f : X → Y must be such that if γ ∈ C X then γ ; f ∈ C Y . This is reminiscent of themethod of logical relations.1.0.2. From smoothness to automatic derivatives at higher types.
Our denotational seman-tics in diffeological spaces guarantees that all definable functions are smooth. But we needmore than just to know that a definable function happens to have a mathematical derivative:we need to be able to find that derivative.In this paper we focus on forward mode automatic differentiation methods for comput-ing higher derivatives, which are macro translations on syntax (called −→D in § X, X ′ , S ) where, intuitively, X is a space of inhabitants of thetype, X ′ is a space serving as a chosen bundle of tangents (or jets, in the case of higherorder derivatives) over X , and S ⊆ X R × X ′ R is a binary relation between curves, informallyrelating curves in X with their tangent (resp. jet) curves in X ′ . This new model gives adenotational semantics for higher order automatic differentiation.In § § Related work and context.
AD has a long history and has many implementations. ADwas perhaps first phrased in a functional setting in [55], and there are now a number ofteams working on AD in the functional setting (e.g. [68, 61, 21]), some providing efficientimplementations. Although that work does not involve formal semantics, it is inspired byintuitions from differential geometry and category theory.This paper adds to a very recent body of work on verified automatic differentiation. Inthe first order setting, there are recent accounts based on denotational semantics in mani-folds [23, 44] and based on synthetic differential geometry [18], work making a categoricalabstraction [16] and work connecting operational semantics with denotational semantics[2, 57], as well as work focussing on how to correctly differentiate programs that operateon tensors [8] and programs that make use of quantum computing [70]. Recently there hasalso been significant progress at higher types. The work of Brunel et al. [11] and Mazzaand Pagani [49] give formal correctness proofs for reverse-mode derivatives on a linear λ -calculus with a particular operational semantics. The work of Barthe et al. [5] providesa general discussion of some new syntactic logical relations arguments including one verysimilar to our syntactic proof of Theorem 1. Sherman et al. [62] discuss a differential pro-gramming technique that works at higher types, based on exact real arithmetic and relate itto a computable semantics. We understand that the authors of [18] are working on highertypes. V´ak´ar [66] phrases and proves correct a reverse mode AD technique on a higherorder language, based on a source-code transformation on a standard λ -calculus. M. HUOT, S. STATON, AND M. V ´AK ´AR
The differential λ -calculus [20] is related to AD, and explicit connections are made in[46, 47]. One difference is that the differential λ -calculus allows the addition of terms atall types, and hence vector space models are suitable, but this appears peculiar with thevariant and inductive types that we consider here.This paper builds on our previous work [28, 65] in which we gave denotational cor-rectness proofs for forward mode AD algorithms for computing first derivatives. Here, weexplain how these techniques extend to methods that calculate higher derivatives.The idea of directly calculating higher order derivatives by using automatic differen-tiation methods that work with Taylor approximations (also known as jets in differentialgeometry) is well-known [26] and it has recently gained renewed interest [9, 10]. So far, such“Taylor-mode AD” methods have only been applied to first order functional languages, how-ever. This paper shows how to extend these higher order AD methods to languages withsupport for higher order functions and algebraic data types.The two main methods for implementing AD are operator overloading and, the methodused in this paper, source code transformation [67]. Taylor-mode AD has been seen to besignificantly faster than iterated AD in the context of operator overloading [10] in Jax [24].There are other notable implementations of forward Taylor-mode [6, 7, 33, 54, 69]. Someof them are implemented in a functional language [33, 54]. Taylor-mode implementationsuse the rich algebraic structure of derivatives to avoid a lot of redundant computationsoccurring via iterated first order methods and share of a lot of redundant computations.Perhaps the simplest example to see this is with the sin function, whose iterated derivativesonly involve sin, cos, and negation. Importantly, most AD tools have the right complexityup to a constant factor, but this constant is quite important in practice and Taylor-modehelps achieve better performance. Another stunning result of a version of Taylor-modewas achieved in [41], where a gain of performance of up to two orders of magnitude wasachieved for computing certain Hessian-vector products using Ricci calculus. In essence,the algorithm used is mixed-mode that is derived via jets in [9]. This is further improved in[42]. Tayor-mode can also be useful for ODE solvers and hence will be important for neuraldifferential equations [13].Finally, we emphasise that we have chosen the neural network (1.1) as our runningexample mainly for its simplicity. There are many other examples of AD outside the neuralnetworks literature: AD is useful whenever derivatives need to be calculated on high dimen-sional spaces. This includes optimization problems more generally, where the derivative ispassed to a gradient descent method (e.g. [59, 34, 58, 35, 19, 45]). Optimization problemsinvolving higher order functions naturally show up in the calculus of variations and its ap-plications in physics, where one typically looks for a function minimizing a certain integral[25]. Other applications of AD are in advanced integration methods, since derivatives playa role in Hamiltonian Monte Carlo [52, 27] and variational inference [40]. Second ordermethods for gradient-descent have also been extensively studied. As the basic second orderNewton method requires inverting a high dimentional hessian matrix, several alternativesand approximations have been studied. Some of them still require Taylor-like modes ofdifferentiation and require a matrix-vector product where the matrix resembles the hessianor inverse hessian [36, 48, 3].1.0.4. Summary of contributions.
We have provided a semantic analysis of higher orderautomatic differentiation. Our syntactic starting point is are higher order forward-mode ADmacros on a typed higher order language that extend their well-known first order equivalent
IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 5 (e.g. [61, 68, 28]). We present these in § § • We give a denotational semantics for the language in diffeological spaces, showing thatevery definable expression is smooth ( § • We show correctness of the higher order AD macros by a logical relations argument(Th. 1). • We give a categorical analysis of this correctness argument with two parts: canonicity ofthe macro in terms of syntactic categories, and a new notion of glued space that abstractsthe logical relation ( § • We then use this analysis to state and prove a correctness argument at all first ordertypes (Th. 2).2.
Rudiments of differentiation: how to calculate with dual numbers andTaylor approximations
First order differentiation: the chain rule and dual numbers.
Recall that thederivative of a function f : R → R , if it exists, is a function ∇ f : R → R such that for all a , ∇ f ( a ) is the gradient of f at a in the sense that the function x f ( a )+ ∇ f ( a ) · ( x − a ) givesthe best linear approximation of f at a . (The gradient ∇ f ( a ) is often written d f ( x )d x ( a ).)The chain rule for differentiation tells us that we can calculate ∇ ( f ; g )( a ) = ∇ f ( a ) ·∇ g ( f ( a )). In that sense, the chain rule tells us how linear approximations to a functiontransform under post-composition with another function.To find ∇ f in a compositional way, using the chain rule, two generalizations are rea-sonable: • We need both f and ∇ f when calculating ∇ ( f ; g ) of a composition f ; g , using the chainrule, so we are really interested in the pair ( f, ∇ f ) : R → R × R ; • In building f we will need to consider functions of multiple arguments, such as + : R → R ,and these functions should propagate derivatives.Thus we are more generally interested in transforming a function g : R n → R into a function h : ( R × R ) n → R × R in such a way that for any f . . . f n : R → R ,( f , ∇ f , . . . , f n , ∇ f n ); h = (( f , . . . , f n ); g, ∇ (( f , . . . , f n ); g )). (2.1)An intuition for h is often given in terms of dual numbers. The transformed functionoperates on pairs of numbers, ( x, x ′ ), and it is common to think of such a pair as x + x ′ ǫ for an‘infinitesimal’ ǫ . But while this is a helpful intuition, the formalization of infinitesimals canbe intricate, and the development in this paper is focussed on the elementary formulationin (2.1).A function h satisfying (2.1) encodes all the partial derivatives of g . For example,if g : R → R , then with f ( x ) def = x and f ( x ) def = x , by applying (2.1) to x we obtain h ( x , , x ,
0) = ( g ( x , x ) , ∂g ( x,x ) ∂x ( x )) and similarly h ( x , , x ,
1) = ( g ( x , x ) , ∂g ( x ,x ) ∂x ( x )).And conversely, if g is differentiable in each argument, then a unique h satisfying (2.1) canbe found by taking linear combinations of partial derivatives, for example: h ( x , x ′ , x , x ′ ) = ( g ( x , x ) , x ′ · ∂g ( x,x ) ∂x ( x ) + x ′ · ∂g ( x ,x ) ∂x ( x )).(Here, recall that the partial derivative ∂g ( x,x ) ∂x ( x ) is a notation for the gradient ∇ ( g ( − , x ))( x ),i.e. with x fixed. ) M. HUOT, S. STATON, AND M. V ´AK ´AR
In summary, the idea of differentiation with dual numbers is to transform a differentiablefunction g : R n → R to a function h : R n → R which captures g and all its partialderivatives. We packaged this up in (2.1) as an invariant which is useful for buildingderivatives of compound functions R → R in a compositional way. The idea of (first order)forward mode automatic differentiation is to perform this transformation at the source codelevel. Smooth functions.
In what follows we will often speak of smooth functions R k → R , whichare functions that are continuous and differentiable, such that their derivatives are alsocontinuous and differentiable, and so on.2.2. Higher order differentiation: the Fa`a di Bruno formula and Taylor approx-imations.
We now generalize the above in two directions: • We look for the best local approximations to f with polynomials of some order R , gener-alizing the above use of linear functions ( R = 1). • We can work directly with multivariate functions R k → R instead of functions of onevariable R → R ( k = 1).To make this precise, we recall that, given a smooth function f : R k → R and a naturalnumber R ≥
0, the R -th order Taylor approximation of f at a ∈ R k is defined in terms ofthe partial derivatives of f : R k → R x X { ( α ,...,α k ) ∈ N k | α + ... + α k ≤ R } α ! · . . . · α k ! ∂ α + ... + α k f ( x ) ∂x α · · · ∂x α k k ( a ) · ( x − a ) α · . . . · ( x k − a k ) α k . This is an R -th order polynomial. We can again recover the partial derivatives of f up tothe R -th order from its Taylor approximation by evaluating the series at basis vectors.Recall that the ordering of partial derivatives does not matter for smooth functions(Schwarz/Clairaut’s theorem). So there will be (cid:0) R + k − k − (cid:1) R -th order partial derivatives, andaltogether there are (cid:0) R + kk (cid:1) summands in the R -th order Taylor approximation. (This canbe seen by a ‘stars-and-bars’ argument.)Since there are (cid:0) R + kk (cid:1) partial derivatives of f i of order ≤ R (often called ( k, R )-velocities),we can store them in the Euclidean space R ( R + kk ), which can also be regarded as the spaceof k -variate polynomials of degree ≤ R .We will use a convention of coordinates ( y α ...α k ∈ R ) ( α ,...,α k ) ∈ { ( α ,...,α k ) ∈ N k | ≤ α + ... + α k ≤ R } where y α ...α k is intended to represent a partial derivative ∂ α ... + αk f∂x α ··· ∂x αkk ( a ) for some function f : R k → R . We will choose these coordinates in lexicographic order of the multi-indices( α , . . . , α k ), that is, the indexes in the Euclidean space R ( R + kk ) will typically range from(0 , . . . ,
0) to ( R, , . . . , IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 7
The ( k, R ) -Taylor representation of a function g : R n → R is a function h : (cid:16) R ( R + kk ) (cid:17) n → R ( R + kk ) that transforms the partial derivatives of f : R k → R n of order ≤ R under postcom-position with g : (cid:18) ∂ α + ... + α k f j ( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! nj =1 ; h = (cid:18) ∂ α + ... + α k (( f , . . . , f n ); g )( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! nj =1 . (2.2)Thus the Taylor representation generalizes the dual numbers representation ( R = k = 1).To explicitly calculate the Taylor representation for a smooth function, we recall ageneralization of the chain rule to higher derivatives. The chain rule tells us how thecoefficients of linear approximations transform under composition of the functions. The Fa`a di Bruno formula [60, 22, 17] tells us how coefficients of Taylor approximations – thatis, higher derivatives – transform under composition. We recall the multivariate form from[60, Theorem 2.1]. Given functions f = ( f , . . . , f l ) : R k → R l and g : R l → R , for α + . . . + α k > ∂ α + ... + α k ( f ; g )( x ) ∂x α · · · ∂x α k k ( a ) = α ! · . . . · α k ! · X { ( β ,...,β l ) ∈ N l | ≤ β + ... + β l ≤ α + ... + α k } ∂ β + ... + β l g ( y ) ∂y β · · · ∂y β l l ( f ( a )) · X { (( e ,...,e l ) ,..., ( e q ,...,e ql )) ∈ ( N l ) q | e j + ... + e qj = β j , ( e + ... + e l ) · α i + ... +( e q + ... + e ql ) · α qi = α i } q Y r =1 l Y j =1 e rj ! (cid:18) α r ! · . . . · α rk ! ∂ α r + ··· + α rk f j ( x ) ∂ α r x · · · ∂ α rk x k ( a ) (cid:19) e rj , where ( α , . . . , α k ) , . . . , ( α q , . . . , α qk ) ∈ N k are an enumeration of all the vectors ( α r , . . . , α rk )of k natural numbers such that α rj ≤ α j and α r + . . . + α rk > q for thenumber of such vectors. The details of this formula reflect the complicated combinatoricsthe arise from repeated applications of the chain and product rules for differentiation thatone uses to prove it. Conceptually, however, it is rather straightforward: it tells us that thecoefficients of the R -th order Taylor approximation of f ; g can be expressed exclusively interms of those of f and g .Thus the Fa`a di Bruno formula uniquely determines the Taylor approximation h : (cid:16) R ( R + kk ) (cid:17) n → R ( R + kk ) in terms of the derivatives of g : R n → R of order ≤ R , and we canalso recover all such derivatives from h .2.3. Example: a two-dimensional second order Taylor series.
As an example, wecan specialize the Fa`a di Bruno formula above to the second order Taylor series of a function f : R → R l and its behaviour under postcomposition with a smooth function g : R l → R : ∂ ( f ; g )( x ) ∂x i ∂x i ′ ( a ) = l X j =1 ∂g ( y ) ∂y j ( f ( a )) ∂ f j ( x ) ∂x i ∂x i ′ ( a ) + l X j,j ′ =1 ∂ g ( y ) ∂y j ∂y j ′ ( f ( a )) ∂f j ′ ( x ) ∂x i ( a ) ∂f j ( x ) ∂x i ′ ( a ) , where i, i ′ ∈ { , } might either coincide or be distinct. M. HUOT, S. STATON, AND M. V ´AK ´AR
Rather than working with the full (2 , g , we ignore the non-mixed second order derivatives y j = ∂ f j ( x ) ∂x and y j = ∂ f j ( x ) ∂x for the moment, and werepresent the derivatives of order ≤ f j : R → R (at some point a ) as the numbers( y j , y j , y j , y j ) = (cid:18) f j ( a ) , ∂f j ( x ) ∂x ( a ) , ∂f j ( x ) ∂x ( a ) , ∂ f j ( x ) ∂x ∂x ( a ) (cid:19) ∈ R and we can choose a similar representation for the derivatives of ( f ; g ). Then, we observethat the Fa`a di Bruno formula induces the function h : ( R ) l → R h (( y , y , y , y ) , . . . , ( y l , y l , y l , y l )) = g ( y , . . . , y l ) P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j + P lj,j ′ =1 ∂ g ( y ,...,y l ) ∂y j ∂y j ′ ( y , . . . , y l ) · y j · y j ′ . In particular, we can note that h (( y , y , y , , . . . , ( y l , y l , y l , g ( y , . . . , y l ) P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j P lj,j ′ =1 ∂ g ( y ,...,y l ) ∂y j ∂y j ′ ( y , . . . , y l ) · y j · y j ′ . We see can use this method to calculate any directional first and second order derivativeof g in one pass. For example, if l = 3, so g : R → R , then the last component of h (( x, x ′ , x ′′ , , ( y, y ′ , y ′′ , , ( z, z ′ , z ′′ , x ′ , y ′ , z ′ ) and the second derivative in direction ( x ′′ , y ′′ , z ′′ ), and evaluating at ( x, y, z ).In the proper Taylor representation we explicitly include the non-mixed second orderderivatives as inputs and outputs, leading to a function h ′ : ( R ) l → R . Above we have fol-lowed a common trick to avoid some unnecessary storage and computation, since these extrainputs and outputs are not required for computing the second order derivatives of g . Forinstance, if l = 2 then the last component of h (( x, , , , ( y, , , ∂ g ( x,y ) ∂x ( x, y ).2.4. Example: a one-dimensional second order Taylor series.
As opposed to (2,2)-AD, (1,2)-AD computes the first and second order derivatives in the same direction. Forexample, if g : R → R is a smooth function, then h : ( R ) → R . An intuition for h can begiven in terms of triple numbers. The transformed function operates on triples of numbers,( x, x ′ , x ′′ ), and it is common to think of such a triple as x + x ′ ǫ + x ′′ ǫ for an ‘infinitesimal’ IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 9 ǫ which has the property that ǫ = 0. For instance we have h (( x , , , ( x , , g ( x , x ) , ∂g ( x, x ) ∂x ( x ) , ∂ g ( x, x ) ∂x ( x )) h (( x , , , ( x , , g ( x , x ) , ∂g ( x , x ) ∂x ( x ) , ∂ g ( x , x ) ∂x ( x )) h (( x , , , ( x , , g ( x , x ) , ∂g ( x , x ) ∂x ( x ) , ∂ g ( x, x ) ∂x ( x ) + ∂ g ( x , x ) ∂x ( x ) + 2 ∂ g ( x, y ) ∂x∂y ( x , x ))We see that we directly get non-mixed second-order partial derivatives but not the mixed-ones. We can recover ∂ g ( x,y ) ∂x∂y ( x , x ) as ( h (( x , , , ( x , , − h (( x , , , ( x , , − h (( x , , , ( x , , g : R l → R , then h : ( R ) l → R satisfies: h (( x , x ′ , , . . . , ( x l , x ′ l , g ( x , . . . , x l ) P li =1 ∂g ( x ,...,x l ) ∂x i ( x , . . . , x l ) · x ′ i P li,j =1 ∂ g ( x ,...,x l ) ∂x i ∂x j ( x , . . . , x l ) · x ′ i · x ′ j . We can always recover the mixed second order partial derivatives from this but this requiresseveral computations involving h . This is thus different from the (2,2) method which wasmore direct.2.5. Remark.
In the rest of this article, we study forward-mode ( k, R )-automatic differen-tiation for a language with higher-order functions. The reader may like to fix k = R = 1for a standard automatic differentiation with first-order derivatives, based on dual numbers.This is the approach taken in the conference version of this paper [29]. But the generaliza-tion to higher-order derivatives with arbitrary k and R flows straightforwardly through thewhole narrative.3. A Higher Order Forward-Mode AD Translation
A simple language of smooth functions.
We consider a standard higher ordertyped language with a first order type real of real numbers. The types ( τ, σ ) and terms( t, s ) are as follows. τ, σ, ρ ::= types | real real numbers | ( τ ∗ . . . ∗ τ n ) finite product | τ → σ function t, s, r ::= terms x variable | op ( t , . . . , t n ) operations (including constants) | h t , . . . , t n i | case t of h x , . . . , x n i → s tuples/pattern matching | λx.t | t s function abstraction/application The typing rules are in Figure 3. We have included some abstract basic n -ary operations op ∈ Op n for every n ∈ N . These are intended to include the usual (smooth) mathematicaloperations that are used in programs to which automatic differentiation is applied. Forexample, • for any real constant c ∈ R , we typically include a constant c ∈ Op ; we slightly abusenotation and will simply write c for c () in our examples; • we include some unary operations such as ς ∈ Op which we intend to stand for the usualsigmoid function, ς ( x ) def = e − x ; • we include some binary operations such as addition and multiplication (+) , ( ∗ ) ∈ Op ;We add some simple syntactic sugar t − u def = t + ( − ∗ u and, for some natural number n , n · t def = n times z }| { t + ... + t and t n def = n times z }| { t ∗ ... ∗ t Similarly, we will frequently denote repeated sums and products using P - and Q -signs,respectively: for example, we write t + ... + t n as P i ∈{ ,...,n } t i and t ∗ ... ∗ t n as Q i ∈{ ,...,n } t i .This in addition to programming sugar such as let x = t in s for ( λx.s ) t and λ h x , . . . , x n i .t for λx. case x of h x , . . . , x n i → t .3.2. Syntactic automatic differentiation: a functorial macro.
The aim of higherorder forward mode AD is to find the ( k, R )-Taylor representation of a function by syntacticmanipulations, for some choice of ( k, R ) that we fix. For our simple language, we implementthis as the following inductively defined macro −→D ( k,R ) on both types and terms (see also [68,61]). For the sake of legibility, we simply write −→D ( k,R ) as −→D here and leave the dimension k and order R of the Taylor representation implicit. The following definition is for general k and R , but we treat specific cases afterwards in Example 1. −→D ( τ → σ ) def = −→D ( τ ) → −→D ( σ ) −→D ( τ ∗ ... ∗ τ n ) def = −→D ( τ ) ∗ ... ∗ −→D ( τ n ) −→D ( real ) def = real ( R + kk ) (i.e., the type of tuples of reals of length (cid:0) R + kk (cid:1) )Γ ⊢ t : real . . . Γ ⊢ t n : real Γ ⊢ op ( t , . . . , t n ) : real ( op ∈ Op n )Γ ⊢ t : τ . . . Γ ⊢ t n : τ n Γ ⊢ h t , . . . , t n i : ( τ ∗ . . . ∗ τ n ) Γ ⊢ t : ( σ ∗ . . . ∗ σ n ) Γ , x : σ , ..., x n : σ n ⊢ s : τ Γ ⊢ case t of h x , . . . , x n i → s : τ Γ ⊢ x : τ (( x : τ ) ∈ Γ) Γ , x : τ ⊢ t : σ Γ ⊢ λx : τ.t : τ → σ Γ ⊢ t : σ → τ Γ ⊢ s : σ Γ ⊢ t s : τ Figure 3: Typing rules for the simple language.
IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 11 −→D ( x ) def = x −→D ( c ) def = h c, i −→D ( λx.t ) def = λx. −→D ( t ) −→D ( t s ) def = −→D ( t ) −→D ( s ) −→D ( h t , . . . , t n i ) def = h −→D ( t ) , . . . , −→D ( t n ) i −→D ( case t of h x , . . . , x n i → s ) def = case −→D ( t ) of h x , . . . , x n i → −→D ( s ) −→D ( op ( t , . . . , t n )) def = case −→D ( t ) of h x ... , ..., x R ... i → ... case −→D ( t n ) of h x n ... , ..., x nR ... i →h D ... op ( x ... , ..., x R ... , ..., x n ... , ...., x nR ... ) , · · · ,D R... op ( x ... , ..., x R ... , ..., x n ... , ...., x nR ... ) i where D ... op ( x ... , ..., x R ... , ..., x n ... , ...., x nR ... ) def = op ( x ... , ..., x n ... ) D α ....α k op ( x ... , ..., x R ... , ..., x n ... , ...., x nR ... ) def = (for α + ... + α k > α ! · . . . · α k ! · P { ( β ,...,β n ) ∈ N l | ≤ β + ... + β n ≤ α + ... + α k } ∂ β ··· β n op ( x ... , . . . , x n ... ) ∗ P { (( e ,...,e l ) ,..., ( e q ,...,e ql )) ∈ ( N l ) q | e j + ... + e qj = β j , ( e + ... + e l ) · α i + ... +( e q + ... + e ql ) · α qi = α i } Q qr =1 Q lj =1 1 e rj ! · (cid:16) α r ! · ... · α rk ! · x jα ··· α k (cid:17) e rj .Here, ( ∂ β ··· β n op )( x , . . . , x n ) are some chosen terms of type real in the language withfree variables from x , . . . , x n . We think of these terms as implementing the partial deriv-ative ∂ β ... + βn J op K ( x ,...,x n ) ∂x β ··· ∂x βnn of the smooth function J op K : R n → R that op implements. Forexample, we could choose the following representations of derivatives of order ≤ ∂ (+)( x , x ) = 1 ∂ (+)( x , x ) = 0 ∂ (+)( x , x ) = 1 ∂ (+)( x , x ) = 0 ∂ (+)( x , x ) = 0 ∂ ( ∗ )( x , x ) = x ∂ ( ∗ )( x , x ) = 0 ∂ ( ∗ )( x , x ) = x ∂ ( ∗ )( x , x ) = 1 ∂ ( ∗ )( x , x ) = 0 ∂ ( ς )( x ) = let y = ς ( x ) in y ∗ (1 − y ) ∂ ( ς )( x ) = let y = ς ( x ) inlet z = y ∗ (1 − y ) in z ∗ (1 − ∗ y )Note that our rules, in particular, imply that −→D ( c ) = h c, , . . . , i . Example 1 ((1 , , . Our choices of partial derivatives of the example oper-ations are sufficient to implement ( k, R )-Taylor forward AD with R ≤
2. To be explicit, thedistinctive formulas for (1 , , of −→D ( k,R ) above) are −→D (1 , ( real ) = (real ∗ real) −→D (1 , ( op ( t , . . . , t n )) = case −→D (1 , ( t ) of h x , x i → . . . → case −→D (1 , ( t n ) of h x n , x n i →h op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ i op ( x , . . . , x n ) i −→D (2 , ( real ) = real −→D (2 , ( op ( t , ..., t n )) = case −→D (2 , ( t ) of h x , x , x , x , x , x i → ... case −→D (2 , ( t n ) of h x n , x n , x n , x n , x n , x n i →h op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) + P ni,j =1 x i ∗ x j ∗ ∂ c i,j op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) + P ni,j =1 x i ∗ x j ∗ ∂ c i,j op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) + P ni,j =1 x i ∗ x j ∗ ∂ c i,j op ( x , . . . , x n ) i where we informally write ˆ i for the one-hot encoding of i (the sequence of length n consistingexclusively of zeros except in position i where it has a 1) and c i, j for the two-hot encodingof i and j (the sequence of length n consisting exclusively of zeros except in positions i and j where it has a 1 if i = j and a 2 if i = j )As noted in §
2, it is often unnecessary to include all components of the (2 , , −→D (2 , ′ ( real ) = real and −→D (2 , ′ ( op ( t , ..., t n )) = case −→D (2 , ′ ( t ) of h x , x , x , x i → ... case −→D (2 , ′ ( t n ) of h x n , x n , x n , x n i →h op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) + P ni,i ′ =1 x i ∗ x i ∗ ∂ ˆˆ i op ( x , . . . , x n ) i . We extend −→D to contexts: −→D ( { x : τ , ..., x n : τ n } ) def = { x : −→D ( τ ) , ..., x n : −→D ( τ n ) } . This turns −→D into a well-typed, functorial macro in the following sense. Lemma 1 (Functorial macro) . If Γ ⊢ t : τ then −→D (Γ) ⊢ −→D ( t ) : −→D ( τ ) .If Γ , x : σ ⊢ t : τ and Γ ⊢ s : σ then −→D (Γ) ⊢ −→D ( t [ s / x ]) = −→D ( t )[ −→D ( s ) / x ] .Proof. By induction on the structure of typing derviations.
IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 13
Example 2 (Inner products) . Let us write τ n for the n -fold product ( τ ∗ . . . ∗ τ ) . Then,given Γ ⊢ t, s : real n we can define their inner productΓ ⊢ t · n s def = case t of h z , . . . , z n i → case s of h y , . . . , y n i → z ∗ y + · · · + z n ∗ y n : real To illustrate the calculation of −→D (1 , , let us expand (and β -reduce) −→D (1 , ( t · s ): case −→D (1 , ( t ) of h z , z i → case −→D (1 , ( s ) of h y , y i → case z of h z , , z , i → case y of h y , , y , i → case z of h z , , z , i → case y of h y , , y , i →h z , ∗ y , + z , ∗ y , , z , ∗ y , + z , ∗ y , + z , ∗ y , + z , ∗ y , i Let us also expand the calculation of −→D (2 , ′ ( t · s ): case −→D (2 , ′ ( t ) of h z , z i → case −→D (2 , ′ ( s ) of h y , y i → case z of h z , z ′ , , z ′ , , z ′′ i → case y of h y , y ′ , , y ′ , , y ′′ , i → case z of h z , z ′ , , z ′ , , z ′′ , i → case y of h y , y ′ , , y ′ , , y ′′ , i →h z ∗ y + z ∗ y ,z ∗ y ′ , + z ′ , ∗ y + z ∗ y ′ , + z ′ , ∗ y ,z ∗ y ′ , + z ′ , ∗ y + z ∗ y ′ , + z ′ , ∗ y ,z ′′ ∗ y + z ′′ ∗ y + y ′′ ∗ z + y ′′ ∗ z + y ′ , ∗ y ′ , + y ′ , ∗ y ′ , + z ′ , ∗ z ′ , + z ′ , ∗ z ′ , i Example 3 (Neural networks) . In our introduction, we provided a program (1.1) in ourlanguage to build a neural network out of expressions neuron , layer , comp; this programmakes use of the inner product of Ex. 2. We can similarly calculate the derivatives of deepneural nets by mechanically applying the macro −→D .4. Semantics of differentiation
Consider for a moment the first order fragment of the language in §
3, with only one type, real , and no λ ’s or pairs. This has a simple semantics in the category of cartesian spacesand smooth maps. Indeed, a term x . . . x n : real ⊢ t : real has a natural reading as afunction J t K : R n → R by interpreting our operation symbols by the well-known operationson R n → R with the corresponding name. In fact, the functions that are definable in thisfirst order fragment are smooth. Let us write CartSp for this category of cartesian spaces( R n for some n ) and smooth functions.The category CartSp has cartesian products, and so we can also interpret producttypes, tupling and pattern matching, giving us a useful syntax for constructing functionsinto and out of products of R . For example, the interpretation of (neuron n ) in (1.1) becomes R n × R n × R J · n K × id R −−−−−→ R × R J + K −−→ R J ς K −−→ R . where J · n K , J + K and J ς K are the usual inner product, addition and the sigmoid function on R , respectively. Inside this category, we can straightforwardly study the first order language without λ ’s, and automatic differentiation. In fact, we can prove the following by plain inductionon the syntax: The interpretation of the (syntactic) forward AD −→D ( t ) of a first order term t equals the usual(semantic) derivative of the interpretation of t as a smooth function. However, as is well known, the category
CartSp does not support function spaces. Tosee this, notice that we have polynomial terms x , . . . , x d : real ⊢ λy. P dn =1 x n y n : real → real for each d , and so if we could interpret ( real → real ) as a Euclidean space R p then, byinterpreting these polynomial expressions, we would be able to find continuous injections R d → R p for every d , which is topologically impossible for any p , for example as a conse-quence of the Borsuk-Ulam theorem (see Appx. A).This lack of function spaces means that we cannot interpret the functions (layer) and(comp) from (1.1) in CartSp , as they are higher order functions, even though they arevery useful and innocent building blocks for differential programming! Clearly, we coulddefine neural nets such as (1.1) directly as smooth functions without any higher ordersubcomponents, though that would quickly become cumbersome for deep networks. Aproblematic consequence of the lack of a semantics for higher order differential programsis that we have no obvious way of establishing compositional semantic correctness of −→D forthe given implementation of (1.1).We now show that every definable function is smooth, and then in Section 4.2 we showthat the −→D macro witnesses its derivatives.4.1.
Smoothness at higher types and diffeologies.
The aim of this section is to intro-duce diffeological spaces as a semantic model for the simple language in Section 3. By wayof motivation, we begin with a standard set theoretic semantics, where types are interpretedas follows T real W def = R T ( τ ∗ . . . ∗ τ n ) W def = Q ni =1 T τ i W T τ → σ W def = ( T τ W → T σ W )and a term x : τ , . . . , x n : τ n ⊢ t : σ is interpreted as a function Q ni =1 T τ i W → T σ W , mappinga valuation of the context to a result.We can show that the interpretation of a term x : real , . . . , x n : real ⊢ t : real isalways a smooth function R n → R , even if it has higher order subterms. We begin witha fairly standard logical relations proof of this, and then move from this to the semanticmodel of diffeological spaces. Proposition 1. If x : real , . . . , x n : real ⊢ t : real then the function T t W : R n → R issmooth.Proof. Fix R and k . We show that T t W has ( k, R )-velocities. For each type τ define a set Q τ ⊆ [ R k → T τ W ] by induction on the structure of types: Q real = { f : R k → T τ W | f has a ( k, R ) Taylor approximation everywhere } Q ( τ ∗ ... ∗ τ n ) = { f : R k → Q ni =1 T τ i W | ∀ ~r ∈ R k . ∀ i. f i ( ~r ) ∈ Q τ i } Q τ → σ = { f : R k → T τ W → T σ W | ∀ g ∈ Q τ . λ ( ~r ) . f ( ~r )( g ( ~r )) ∈ Q σ } Now we show the fundamental lemma: if x : τ , . . . , x n : τ n ⊢ u : σ and g ∈ Q τ . . . g n ∈ Q τ n then (( g . . . g n ); T u W ) ∈ Q σ . This is shown by induction on the structure of typing IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 15 derivations. The only interesting step here is that the basic operations (+, ∗ , ς etc.) aresmooth. We deduce the statement of the theorem by putting u = t , k = n , and letting g i : R n → R be the projections.At higher types, the logical relations Q show that we can only define functions thatsend smooth functions to smooth functions, meaning that we can never use them to buildfirst order functions that are not smooth. For example, (comp) in (1.1) has this property.This logical relations proof suggests to build a semantic model by interpreting types assets with structure: for each type we have a set X together with a set Q R k X ⊆ [ R k → X ] ofplots. Definition 1. A diffeological space ( X, P X ) consists of a set X together with, for each n and each open subset U of R n , a set P UX ⊆ [ U → X ] of functions, called plots , such that • all constant functions are plots; • if f : V → U is a smooth function and p ∈ P UX , then f ; p ∈ P VX ; • if (cid:16) p i ∈ P U i X (cid:17) i ∈ I is a compatible family of plots ( x ∈ U i ∩ U j ⇒ p i ( x ) = p j ( x )) and ( U i ) i ∈ I covers U , then the gluing p : U → X : x ∈ U i p i ( x ) is a plot.We call a function f : X → Y between diffeological spaces smooth if, for all plots p ∈ P UX , we have that p ; f ∈ P UY . We write Diff ( X, Y ) for the set of smooth maps from X to Y . Smooth functions compose, and so we have a category Diff of diffeological spacesand smooth functions.A diffeological space is thus a set equipped with structure. Many constructions of setscarry over straightforwardly to diffeological spaces.
Example 4 (Cartesian diffeologies) . Each open subset U of R n can be given the structureof a diffeological space by taking all the smooth functions V → U as P VU . It is easily seenthat smooth functions from V → U in the traditional sense coincide with smooth functionsin the sense of diffeological spaces. Thus diffeological spaces have a profound relationshipwith ordinary calculus.In categorical terms, this gives a full embedding of CartSp in Diff . Example 5 (Product diffeologies) . Given a family ( X i ) i ∈ I of diffeological spaces, we canequip the product Q i ∈ I X i of sets with the product diffeology in which U -plots are preciselythe functions of the form ( p i ) i ∈ I for p i ∈ P UX i .This gives us the categorical product in Diff . Example 6 (Functional diffeology) . We can equip the set
Diff ( X, Y ) of smooth functionsbetween diffeological spaces with the functional diffeology in which U -plots consist of func-tions f : U → Diff ( X, Y ) such that ( u, x ) f ( u )( x ) is an element of Diff ( U × X, Y ).This specifies the categorical function object in
Diff .We can now give a denotational semantics to our language from Section 3 in the categoryof diffeological spaces. We interpret each type τ as a set J τ K equipped with the relevantdiffeology, by induction on the structure of types: J real K def = R J ( τ ∗ . . . ∗ τ n ) K def = Q ni =1 J τ i K J τ → σ K def = Diff ( J τ K , J σ K )A context Γ = ( x : τ . . . x n : τ n ) is interpreted as a diffeological space J Γ K def = Q ni =1 J τ i K .Now well typed terms Γ ⊢ t : τ are interpreted as smooth functions J t K : J Γ K → J τ K , giving a meaning for t for every valuation of the context. This is routinely defined by inductionon the structure of typing derivations once we choose a smooth function J op K : R n → R tointerpret each n -ary operation op ∈ Op n . For example, constants c : real are interpreted asconstant functions; and the first order operations (+ , ∗ , ς ) are interpreted by composing withthe corresponding functions, which are smooth: e.g., J ς ( t ) K ( ρ ) def = ς ( J t K ( ρ )), where ρ ∈ J Γ K .Variables are interpreted as J x i K ( ρ ) def = ρ i . The remaining constructs are interpreted asfollows, and it is straightforward to show that smoothness is preserved. J h t , . . . , t n i K ( ρ ) def = ( J t K ( ρ ) , . . . , J t n K ( ρ )) J λx : τ.t K ( ρ )( a ) def = J t K ( ρ, a ) ( a ∈ J τ K ) J case t of h ... i → s K ( ρ ) def = J s K ( ρ, J t K ( ρ )) J t s K ( ρ ) def = J t K ( ρ )( J s K ( ρ ))The logical relations proof of Proposition 1 is reminiscent of diffeological spaces. We nowbriefly remark on the suitability of the axioms of diffeological spaces (Def 1) for a semanticmodel of smooth programs. The first axiom says that we only consider reflexive logicalrelations. From the perspective of the interpretation, it recognizes in particular that thesemantics of an expression of type ( real → real ) → real is defined by its value on smoothfunctions rather than arbitrary arguments. That is to say, the set-theoretic semantics at thebeginning of this section, T ( real → real ) → real W , is different to the diffeological semantics, J ( real → real ) → real K . The second axiom for diffeological spaces ensures that the smoothmaps in Diff ( U, X ) are exactly the plots in P UX . The third axiom ensures that categoriesof manifolds fully embed into Diff ; it will not play a visible role in this paper – in fact, [5]prove similar results for a simple language like ours by using plain logical relations (over
Set ) and without demanding the diffeology axioms. However, we expect the third axiomto be crucial for programming with other smooth structures or partiality.4.2.
Correctness of AD.
We have shown that a term x : real , . . . , x n : real ⊢ t : real is interpreted as a smooth function J t K : R n → R , even if t involves higher order func-tions (like (1.1)). Moreover, the macro differentiation −→D ( k,R ) ( t ) is a function J −→D ( k,R ) ( t ) K :( R ( R + kk )) n → R ( R + kk ) (Prop. 1). This enables us to state a limited version of our maincorrectness theorem: Theorem 1 (Semantic correctness of −→D (limited)) . For any term x : real , . . . , x n : real ⊢ t : real , the function J −→D ( k,R ) ( t ) K is the ( k, R ) -Taylor representation (2.2) of J t K . In detail: for any smooth functions f . . . f n : R k → R , (cid:18) ∂ α + ... + α k f j ( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! nj =1 ; J −→D ( k,R ) ( t ) K = (cid:18) ∂ α + ... + α k (( f , . . . , f n ); J t K )( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! nj =1 . (For instance, if n = 2, then J −→D (1 , ( t ) K ( x , , x ,
0) = ( J t K ( x , x ) , ∂ J t K ( x,x ) ∂x ( x )).) Proof.
We prove this by logical relations. Although the following proof is elementary, wefound it by using the categorical methods in § τ , we define a binary relation S τ between (open) k -dimensional plots in J τ K and (open) k -dimensional plots in J −→D ( k,R ) ( τ ) K , i.e. S τ ⊆ P R k J τ K × P R k J −→D ( k,R ) ( τ ) K , by inductionon τ : IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 17 • S real def = ( f, (cid:18) ∂ α ... + αk f ( x ) ∂x α ··· ∂x αkk (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! (cid:12)(cid:12)(cid:12) f : R k → R smooth ) ; • S ( τ ∗ ... ∗ τ n ) def = { (( f , ..., f n ) , ( g , ..., g n )) | ( f , g ) ∈ S τ , ..., ( f n , g n ) ∈ S τ n } ; • S τ → σ def = { ( f , f ) | ∀ ( g , g ) ∈ S τ . ( x f ( x )( g ( x )) , x f ( x )( g ( x ))) ∈ S σ } .Then, we establish the following ‘fundamental lemma’:If x : τ , ..., x n : τ n ⊢ t : σ and, for all 1 ≤ i ≤ n , f i : R k → J τ i K and g i : R k → J −→D ( k,R ) ( τ i ) K are such that ( f i , g i ) is in S τ i , then we have that (cid:0) ( f , . . . , f n ); J t K , ( g , . . . , g n ); J −→D ( k,R ) ( t ) K (cid:1) is in S σ .This is proved routinely by induction on the typing derivation of t . The case for op ( t , . . . , t n ) relies on the precise definition of −→D ( k,R ) ( op ( t , . . . , t n )).We conclude the theorem from the fundamental lemma by considering the case where τ i = σ = real , m = n and s i = y i .5. Extending the language: variant and inductive types
In this section, we show that the definition of forward AD and the semantics generalize ifwe extend the language of § Language.
We additionally consider the following types and terms: τ, σ, ρ ::= . . . types | { ℓ τ (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n τ n } variant | list ( τ ) list t, s, r ::= . . . terms | τ.ℓ t variant constructor | [ ] | t :: s empty list and cons | case t of { ℓ x → s (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s n } pattern matching: variants | fold ( x , x ) .t over s from r list foldWe extend the type system according to the rules of Fig. 4. We can then extend −→D ( k,R ) (again, writing it as −→D , for legibility) to our new types and terms by −→D ( { ℓ τ (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n τ n } ) def = { ℓ −→D ( τ ) (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n −→D ( τ n ) } −→D ( list ( τ )) def = list ( −→D ( τ )) −→D ( τ.ℓ t ) def = −→D ( τ ) .ℓ −→D ( t ) −→D ([ ]) def = [ ] −→D ( t :: s ) def = −→D ( t ) :: −→D ( s ) −→D ( case t of { ℓ x → s (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s n } ) def = case −→D ( t ) of { ℓ x → −→D ( s ) (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → −→D ( s n ) } −→D ( fold ( x , x ) .t over s from r ) def = fold ( x , x ) . −→D ( t ) over −→D ( s ) from −→D ( r )To demonstrate the practical use of expressive type systems for differential program-ming, we consider the following two examples. Example 7 (Lists of inputs for neural nets) . Usually, we run a neural network on a largedata set, the size of which might be determined at runtime. To evaluate a neural networkon multiple inputs, in practice, one often sums the outcomes. This can be coded in ourextended language as follows. Suppose that we have a network f : (real n ∗ P ) → real thatoperates on single input vectors. We can construct one that operates on lists of inputs asfollows: g def = λ h l, w i . fold ( x , x ) .f h x , w i + x over l from (list ( real n ) ∗ P ) → realExample 8 (Missing data) . In practically every application of statistics and machine learn-ing, we face the problem of missing data : for some observations, only partial informationis available.In an expressive typed programming language like we consider, we can model missingdata conveniently using the data type maybe ( τ ) = { Nothing ( ) (cid:12)(cid:12)
Just τ } . In the context ofa neural network, one might use it as follows. First, define some helper functionsfromMaybe def = λx.λm. case m of { Nothing → x (cid:12)(cid:12) Just x ′ → x ′ } fromMaybe n def = λ h x , ..., x n i .λ h m , ..., m n i . h fromMaybe x m , ..., fromMaybe x n m n i : ( maybe ( real )) n → real n → real n map def = λf.λl. fold ( x , x ) .f x :: x over l from [ ] : ( τ → σ ) → list ( τ ) → list ( σ )Given a neural network f : (list ( real k ) ∗ P ) → real , we can build a new one that operateson on a data set for which some covariates (features) are missing, by passing in defaultvalues to replace the missing covariates: λ h l, h m, w ii .f h map (fromMaybe k m ) l, w i : (list (( maybe ( real )) k ) ∗ (real k ∗ P )) → real Then, given a data set l with missing covariates, we can perform automatic differentiationon this network to optimize, simultaneously, the ordinary network parameters w and thedefault values for missing covariates m .Γ ⊢ t : τ i Γ ⊢ τ.ℓ i t : τ (( ℓ i τ i ) ∈ τ ) Γ ⊢ [ ] : list ( τ ) Γ ⊢ t : τ Γ ⊢ s : list ( τ )Γ ⊢ t :: s : list ( τ )Γ ⊢ t : { ℓ τ (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n τ n } for each 1 ≤ i ≤ n : Γ , x i : τ i ⊢ s i : τ Γ ⊢ case t of { ℓ x → s (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s n } : τ Γ ⊢ s : list ( τ ) Γ ⊢ r : σ Γ , x : τ, x : σ ⊢ t : σ Γ ⊢ fold ( x , x ) .t over s from r : σ Figure 4: Additional typing rules for the extended language.
IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 19
Semantics. In § τ is interpreted as a diffeological space, which is a set equipped with a family of plots: • A variant type { ℓ τ (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n τ n } is inductively interpreted as the disjoint union of thesemantic spaces, J { ℓ τ (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n τ n } K def = U ni =1 J τ i K , with U -plots P U J { ℓ τ (cid:12)(cid:12) ... (cid:12)(cid:12) ℓ n τ n } K def = ( (cid:20) U j f j −→ J τ j K → U ni =1 J τ i K (cid:21) nj =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U = U nj =1 U j , f j ∈ P U j J τ j K ) . • A list type list ( τ ) is interpreted as the set of lists, J list ( τ ) K def = U ∞ i =1 J τ K i with U -plots P U J list ( τ ) K def = ( (cid:20) U j f j −→ J τ K j → U ∞ i =1 J τ K i (cid:21) ∞ j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U = U ∞ j =1 U j , f j ∈ P U j J τ K j ) The constructors and destructors for variants and lists are interpreted as in the usual settheoretic semantics.It is routine to show inductively that these interpretations are smooth. Thus everyterm Γ ⊢ t : τ in the extended language is interpreted as a smooth function J t K : J Γ K → J τ K between diffeological spaces.In this section we focused on a language with lists and variants. All inductive typesare easily interpreted in the category of diffeological spaces in much the same way. Thecategorically minded reader may regard this as a consequence of Diff being a concreteGrothendieck quasitopos, e.g. [4], and hence is complete and cocomplete. The interpretationfollows exactly the usual categorical semantics of variant types and inductive types (e.g. [56]).List objects as initial algebras are computed as usual in a cocomplete category (e.g. [31]).6.
Categorical analysis of (higher order) forward AD and its correctness
This section has three parts. First, we give a categorical account of the functoriality ofAD (Ex. 9). Then we introduce our gluing construction, and relate it to the correctness ofAD (dgm. (6.1)). Finally, we state and prove a correctness theorem for all first order typesby considering a category of manifolds (Th. 2). case h t , . . . , t n i of h x , . . . , x n i → s = s [ t / x , . . . , t n / x n ] s [ t / y ] x ,...,x n = case t of h x , . . . , x n i → s [ h x ,...,x n i / y ] case ℓ i t of { ℓ x → s (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s n } = s i [ t / x i ] s [ t / y ] x ,...,x n = case t of { ℓ x → s [ ℓ x / y ] (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s [ ℓ n x n / y ] } fold ( x , x ) .t over [ ] from r = r fold ( x , x ) .t over s :: s from r = t [ s / x , fold ( x ,x ) .t over s from r / x ] u = s [ [ ] / y ] , r [ s / x ] = s [ x :: y / y ] ⇒ s [ t / y ] x ,x = fold ( x , x ) .r over t from u ( λx.t ) s = t [ s / x ] t x = λx.t x We write x ,...,x n = to indi-cate that the variables arefree in the left hand side. Figure 5: Standard βη -laws (e.g. [56]) for products, functions, variants and lists. Syntactic categories.
Our language induces a syntactic category as follows.
Definition 2.
Let
Syn be the category whose objects are types, and where a morphism τ → σ is a term in context x : τ ⊢ t : σ modulo the βη -laws (Fig. 5). Composition is bysubstitution.For simplicity, we do not impose identities involving the primitive operations, such asthe arithmetic identity x + y = y + x in Syn . As is standard, this category has the followinguniversal property.
Lemma 2 (e.g. [56]) . For every bicartesian closed category C with list objects, and everychoice of an object F ( real ) ∈ C and morphisms F ( op ) ∈ C ( F ( real ) n , F ( real )) for all op ∈ Op n and n ∈ N , in C , there is a unique functor F : Syn → C respecting the interpretationand preserving the bicartesian closed structure as well as list objects.Proof notes.
The functor F : Syn → C is a canonical denotational semantics for the lan-guage, interpreting types as objects of C and terms as morphisms. For instance, F ( τ → σ ) def =( F τ → F σ ), the function space in the category C , and F ( t s ) def = is the composite ( F t, F s ); eval .When C = Diff , the denotational semantics of the language in diffeological spaces( § J − K : Syn → Diff satisfying J real K = R , J ς K = ς and so on. Example 9 (Canonical definition of forward AD) . The forward AD macro −→D ( k,R ) ( § Syn . Consider the unique cartesian closedfunctor F : Syn → Syn such that F ( real ) = real ( R + kk ) and F ( op ) = z : ( F ( real ) ∗ ... ∗ F ( real ) ) ⊢ case z of h x , ..., x n i → −→D ( k,R ) ( op ( x , . . . , x n )) : F ( real ) . Then for any type τ , F ( τ ) = −→D ( k,R ) ( τ ), and for any term x : τ ⊢ t : σ , F ( t ) = −→D ( k,R ) ( t )as morphisms F ( τ ) → F ( σ ) in the syntactic category.This observation is a categorical counterpart to Lemma 1.6.2. Categorical gluing and logical relations.
Gluing is a method for building newcategorical models which has been used for many purposes, including logical relations andrealizability [51]. Our logical relations argument in the proof of Theorem 1 can be under-stood in this setting. In this subsection, for the categorically minded, we explain this, andin doing so we quickly recover a correctness result for the more general language in Section 5and for arbitrary first order types.We define a category Gl k whose objects are triples ( X, X ′ , S ) where X and X ′ arediffeological spaces and S ⊆ P R k X × P R k X ′ is a relation between their k -dimensional plots. Amorphism ( X, X ′ , S ) → ( Y, Y ′ , T ) is a pair of smooth functions f : X → Y , f ′ : X ′ → Y ′ ,such that if ( g, g ′ ) ∈ S then ( g ; f, g ′ ; f ′ ) ∈ T . The idea is that this is a semantic domainwhere we can simultaneously interpret the language and its automatic derivatives. Proposition 2.
The category Gl k is bicartesian closed, has list objects, and the projectionfunctor proj : Gl k → Diff × Diff preserves this structure.
IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 21
Proof notes.
The category Gl k is a full subcategory of the comma categoryid Set ↓ Diff ( R k , − ) × Diff ( R k , − ). The result thus follows by the general theory of categor-ical gluing (e.g. [32, Lemma 15]).We give a semantics L − M = ( L − M , L − M , S − ) for the language in Gl k , interpreting types τ as objects ( L τ M , L τ M , S τ ), and terms as morphisms. We let L real M = R and L real M = R ( R + kk ), with the relation S real def = ( ( f, (cid:18) ∂ α + ... + α k f ( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ) | f : R k → R smooth ) . We interpret the operations op according to J op K in L − M , but according to the ( k, R )-Taylorrepresentation of J op K in L − M . For instance, when k = 2 and r = 2, L ∗ M : R × R → R is L ∗ M (( x , x , x , x , x , x ) , ( y , y , y , y , y , y )) def =( x y ,x y + x y ,x y + 2 x y + x y ,x y + x y ,x y + x y + x y + x y ,x y + 2 x y + x y ).At this point one checks that these interpretations are indeed morphisms in Gl k . This isequivalent to the statement that L op M is the ( k, R )-Taylor representation of J op K (2.2). Theremaining constructions of the language are interpreted using the categorical structure of Gl k , following Lem. 2.Notice that the diagram below commutes. One can check this by hand or note that itfollows from the initiality of Syn (Lem. 2): all the functors preserve all the structure.
Syn (id , −→D ( k,R ) ( − )) / / L − M (cid:15) (cid:15) Syn × Syn J − K × J − K (cid:15) (cid:15) Gl k proj / / Diff × Diff (6.1)We thus arrive at a restatement of the correctness theorem (Th. 1), which holds even for theextended language with variants and lists, because for any x : real , ..., x n : real ⊢ t : real ,the interpretations ( J t K , J −→D ( k,R ) ( t ) K ) are in the image of the projection Gl k → Diff × Diff ,and hence J −→D ( k,R ) ( t ) K is a ( k, R )-Taylor representation of J t K .6.3. Correctness at all first order types, via manifolds.
We now generalize Theorem 1to hold at all first order types, not just the reals. To do this, we need to define the derivativeof a smooth map between the interpretations of first order types. We do this by recallingthe well-known theory of manifolds and jet bundles.For our purposes, a smooth manifold M is a second-countable Hausdorff topologicalspace together with a smooth atlas: an open cover U together with homeomorphisms (cid:0) φ U : U → R n ( U ) (cid:1) U ∈U (called charts, or local coordinates) such that φ − U ; φ V is smooth onits domain of definition for all U, V ∈ U . A function f : M → N between manifolds issmooth if φ − U ; f ; ψ V is smooth for all charts φ U and ψ V of M and N , respectively. Let uswrite Man for this category.Our manifolds are slightly unusual because different charts in an atlas may have differentfinite dimension n ( U ). Thus we consider manifolds with dimensions that are potentiallyunbounded, albeit locally finite. This does not affect the theory of differential geometry asfar as we need it here.Each open subset of R n can be regarded as a manifold. This lets us regard the categoryof manifolds Man as a full subcategory of the category of diffeological spaces. We considera manifold ( X, { φ U } U ) as a diffeological space with the same carrier set X and wherethe plots P UX , called the manifold diffeology , are the smooth functions in Man ( U, X ). Afunction X → Y is smooth in the sense of manifolds if and only if it is smooth in the senseof diffeological spaces [30]. For the categorically minded reader, this means that we havea full embedding of Man into
Diff . Moreover, the natural interpretation of the first orderfragment of our language in
Man coincides with that in
Diff . That is, the embedding of
Man into
Diff preserves finite products and countable coproducts (hence initial algebrasof polynomial endofunctors).
Proposition 3.
Suppose that a type τ is first order, i.e. it is just built from reals, products,variants, and lists (or, again, arbitrary inductive types), and not function types. Then thediffeological space J τ K is a manifold.Proof notes. This is proved by induction on the structure of types. In fact, one may showthat every such J τ K is isomorphic to a manifold of the form U ni =1 R d i where the bound n iseither finite or ∞ , but this isomorphism is typically not an identity function.The constraint to first order types is necessary because, e.g. the space J real → real K is not a manifold, because of a Borsuk-Ulam argument (see Appx. A).We recall how the Taylor representation of any morphism f : M → N of manifolds isgiven by its action on jets [38, Chapter IV]. For each point x in a manifold M , define the( k, R )- jet space J ( k,R ) x M to be the set { γ ∈ Man ( R k , M ) | γ (0) = x } / ∼ of equivalenceclasses [ γ ] of k -dimensional plots γ in M based at x , where we identify γ ∼ γ iff all partialderivatives of order ≤ R coincide in the sense that ∂ α + ... + α k ( γ ; f )( x ) ∂x α · · · ∂x α k k (0) = ∂ α + ... + α k ( γ ; f )( x ) ∂x α · · · ∂x α k k (0)for all smooth f : M → R and all multi-indices ( α , ..., α k ) = (0 , ..., , ..., ( R, , ..., k, R ) = (1 , k, R )-jet space is better known as a tangent space . The ( k, R ) -jet bundle (a.k.a. tangent bundle , in case ( k, R ) = (1 , M is the set J ( k,R ) ( M ) def = U x ∈ M J ( k,R ) x ( M ). The charts of M equip J ( k,R ) ( M ) with a canonical manifold structure.The (manifold) diffeology of these jet bundles can be concisely summarized by the plots P U J ( k,R ) ( M ) = n f : U → |J ( k,R ) ( M ) | | ∃ g ∈ P U × R k M . ∀ u ∈ U. ( g ( u, , [ v g ( u, v )]) = f ( u ) o .Then J ( k,R ) acts on smooth maps f : M → N to give J ( k,R ) ( f ) : J ( k,R ) ( M ) → J ( k,R ) ( N )is defined as J ( k,R ) ( f )( x, [ γ ]) def = ( f ( x ) , [ γ ; f ]). In local coordinates, this action J ( k,R ) ( f )is seen to coincide precisely with the ( k, R )-Taylor representation of f given by the Fa`a diBruno formula [50]. All told, the ( k, R )-jet bundle is a functor J ( k,R ) : Man → Man [38].
IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 23
We can understand the jet bundle of a composite space in terms of that of its parts.
Lemma 3.
There are canonical isomorphisms J ( k,R ) ( U ∞ i =1 M i ) ∼ = U ∞ i =1 J ( k,R ) ( M i ) and J ( k,R ) ( M × . . . × M n ) ∼ = J ( k,R ) ( M ) × . . . × J ( k,R ) ( M n ) .Proof notes. For disjoint unions, notice that that smooth morphisms from R k into a disjointunion of manifolds always factor over a single inclusion, because R k is connected. Forproducts, it is well-known that partial derivatives of a morphism ( f , ..., f n ) are calculatedcomponent-wise [43, ex. 3-2].We define a canonical isomorphism φ −→D J τ : J −→D ( k,R ) ( τ ) K → J ( k,R ) ( J τ K ) for every type τ ,by induction on the structure of types. We let φ −→D J real : J −→D ( k,R ) ( real ) K → J ( k,R ) ( J real K ) begiven by φ −→D J real ( x, x ′ ) def = ( x, [ t x + x ′ t ]). For the other types, we use Lemma 3. We cannow phrase correctness at all first order types. Theorem 2 (Semantic correctness of −→D ( k,R ) (full)) . For any ground τ , any first ordercontext Γ and any term Γ ⊢ t : τ , the syntactic translation −→D ( k,R ) coincides with the ( k, R ) -jet bundle functor, modulo these canonical isomorphisms: J −→D ( k,R ) (Γ) K J −→D ( k,R ) ( t ) K / / φ −→D J Γ ∼ = (cid:15) (cid:15) J −→D ( k,R ) ( τ ) K φ −→D J τ ∼ = (cid:15) (cid:15) J ( k,R ) ( J Γ K ) J ( k,R ) ( J t K ) / / J ( k,R ) ( J τ K ) Proof notes.
For any k -dimensional plot γ ∈ Man ( R k , M ), let ¯ γ ∈ Man ( R k , J ( k,R ) ( M )) bethe ( k, R )-jet curve, given by ¯ γ ( x ) = ( γ ( x ) , [ t γ ( x + t )]). First, we note that a smoothmap h : J ( k,R ) ( M ) → J ( k,R ) ( N ) is of the form J ( k,R ) ( g ) for some g : M → N if for allsmooth γ : R k → M we have ¯ γ ; h = ( γ ; g ) : R k → J ( k,R ) ( N ). This generalizes (2.2). Second,for any first order type τ , S J τ K = { ( f, ˜ f ) | ˜ f ; φ −→D J τ = ¯ f } . This is shown by induction onthe structure of types. We conclude the theorem from diagram (6.1), by putting these twoobservations together.7. Discussion: What are derivatives of higher order functions?
In our gluing categories Gl k of Section 6.2, we have avoided the question of what semanticderivatives should be associated with higher order functions. Our syntactic macro −→D pro-vides a specific derivative for every definable function, but in the model Gl k there is onlya relation between plots and their corresponding Taylor representations, and this relationis not necessarily single-valued. Our approach has been rather indifferent about what “the”correct derivative of a higher order function should be. Instead, all we have cared about isthat we are using “a” derivative that is correct in the sense that it can never be used toproduce incorrect derivatives for first order functions, where we do have an unambiguousnotion of correct derivative. Automatic derivatives of higher order functions may not be unique!
Accord-ing to our gluing semantics, a function g : L τ M → L σ M defines a correct ( k, R )-Taylorrepresentation of a function f : L τ M → L σ M iff ( f, g ) defines a morphism L τ M → L σ M in Gl k . In particular, there is no guarantee that every f has a unique correct ( k, R )-Taylorrepresentation g . (Although such Taylor representations are, in fact, unique when τ, σ arefirst order types.)For a concrete example to show that derivatives of higher order functions might notbe unique in our framework, let us consider the case ( k, R ) = (1 ,
1) and focus on firstderivatives of the evaluation functionev : R → J ( real → real ) → real K = R → ( R ⇒ R ) ⇒ R ; r ( f f ( r )) . The glueing relation L ( real → real ) → real M in Gl relates curves in γ : R → ( R ⇒ R ) ⇒ R to “tangent curves” γ ′ : R → ( R × R ⇒ R × R ) ⇒ R × R . In this relation, the function evis related to at least two different tangent curves. Lemma 4.
We have a smooth map sort : ( R × R ⇒ R × R ) → ( R × R ⇒ R × R ) f (( r, ) ( π ( f ( r, , ∇ (( − , f ; π )( r ))) .Proof. Let f ∈ P U R × R ⇒ R × R and let γ , γ ∈ P U R . Then, also u ( γ ( u ) , f ( u ); π ∈P U R = Man ( U, R ) by definition of the exponential in Diff . Therefore, we also have that u
7→ ∇ ( u ′ ( γ ( u ′ ) , f ( u ′ ); π ) ∈ Man ( U, R ) = P U R , as we are working with infinitelydifferentiable smooth maps. Consequently, u ( f ; sort)( u )( γ ( u ) , γ ( u )) = ( π ( f ( u ))( γ ( u ) , , ∇ ( u ′ ( γ ( u ′ ) , f ( u ′ ); π )) ∈ P U R × R , by definition of the product in Diff . It follows that ( f ; sort) ∈ P U R × R ⇒ R × R .This map is idempotent and it converts any map R × R → R × R into the dual-numbersrepresentation of its first component. For example, (sort(swap)) is the constantly (0 , R × R → R × R ( r, r ′ ) ( r ′ , r ) . Lemma 5.
We have that both (ev , ev ′ ) ∈ L ( real → real ) → real M and (ev , ev ′ ) ∈ L ( real → real ) → real M for ev ′ : R → ( R × R ⇒ R × R ) ⇒ R × R a ( f f ( a, ′ : R → ( R × R ⇒ R × R ) ⇒ R × R a ( f (sort f )( a, . Proof.
We need to show that for any ( γ, γ ′ ) ∈ L real → real M , we have that ( x ev( x )( γ ( x )) ,x ev ′ i ( x )( γ ′ ( x ))) ∈ L real M (which means that ( x ev ′ i ( x )( γ ′ ( x ))) = ( x ev( x )( γ ( x )) , ∇ ( x ev( x )( γ ( x )))) ) for i = 1 ,
2. That is, we need to show that for any γ : R → R ⇒ R and γ ′ : R → R × R ⇒ R × R such that for any ( δ, δ ′ ) ∈ L real M (which means that δ : R → R IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 25 and δ ′ = ( δ, ∇ δ )), we have that ( r γ ( r )( δ ( r )) , r γ ′ ( r )( δ ′ ( r ))) ∈ L real M (which meansthat r γ ′ ( r )( δ ( r ) , ∇ δ ( r ))) = ( r γ ( r )( δ ( r )) , ∇ ( r γ ( r )( δ ( r )))).Now, focussing on ev ′ : we need to show that x ev ′ ( x )( γ ′ ( x )) = ( x ev( x )( γ ( x )) , ∇ ( x ev( x )( γ ( x )))). Inlining the definition of ev ′ : we need to show that x γ ′ ( x )( x,
1) =( x γ ( x )( x ) , ∇ ( x γ ( x )( x ))). This follows by assumption by choosing δ ( r ) = r (andhence δ ′ ( r ) = ( r, ′ : we need to show that x ev ′ ( x )( γ ′ ( x )) = ( x ev( x )( γ ( x )) , ∇ ( x ev( x )( γ ( x )))). Inlining ev ′ ’s definition: we need to show that ( x (( π ( γ ′ ( x )( x, , x (( − , γ ′ ( x ); π )( x )))) = ( x (( r, ) ( π ( γ ′ ( x )( r, , ∇ (( − , γ ′ ( x ); π )( r )))( x, x (sort γ ′ ( x ))( x,
1) = ( x γ ( x )( x ) , ∇ ( x γ ( x )( x ))). That is, we need to show that π ( γ ′ ( x )( x, γ ( x )( x ) for all x ∈ R , which holds by the assumption that ( γ, γ ′ ) ∈ L real → real M by choosing δ ( x ′ ) = x (and hence δ ′ ( x ′ ) = ( x, x = x ′ .Yet, ev ′ = ev ′ as ev ′ ( a )(swap) = (1 , a ) and ev ′ ( a )(swap) = (0 , ′ = ev ′ are both “valid” semantic derivatives of the evaluation function (ev) in ourframework. In particular, it shows that semantic derivatives of higher order functions mightnot be unique. Our macro −→D will return ev ′ , but everything would still work just as well ifit instead returned ev ′ .7.2. Canonical derivatives of higher order functions?
Differential geometers and an-alysts have long pursued notions of a canonical derivative of various higher order functionsarising, for example, in the calculus of variations and in the study of infinite dimensionalLie groups [39]. Such an uncontroversial notion of derivative exists on various (infinite di-mensional) spaces of functions that form suitable (so-called convenient) vector spaces, or,manifolds locally modelled on such vector spaces. At the level of generality of diffeologicalspaces, however, various natural notions of derivative that coincide in convenient vectorspaces start to diverge and it is no longer clear what the best definition of a derivative is[15]. Another, fundamentally different setting that defines canonical derivatives of manyhigher order functions is given by synthetic differential geometry [37].While derivatives of higher order functions are of deep interest and have rightly beenstudied in their own right in differential geometry, we believe the situation is subtly differentin computer science:(1) In programming applications, we use higher order programs only to construct the firstorder functions that we ultimately end up running and calculating derivatives of. Au-tomatic differentiation methods can exploit this freedom: derivatives of higher orderfunctions only matter in so far as they can be used to construct the correct derivativesof first order functions, so we can choose an easy and cheap notion of derivative amongthe valid options. As such, the fact that our semantics does not commit to a singlenotion of derivative of higher order functions can be seen as a feature rather than bug that models the pragmatics of programming practice.(2) While function spaces in differential geometry are typically infinite dimensional objectsthat are unsuitable for representation in the finite memory of a computer, higher orderfunctions as used in programming are much more restricted: all they can do is calla function on finitely many arguments and analyse the function outputs. As such,function types in programming can be thought of as (locally) finite dimensional. Incase a canonical notion of automatic derivative of higher order function is really desired, it may be worth pursuing a more intentional notion of semantics such as one based ongame semantics. Such intentional techniques could capture the computational notionof higher order function better than our current (and other) extensional semanticsusing existing techniques from differential geometry. We hope that an exploration ofsuch techniques might lead to an appropriate notion of computable derivative, even forhigher order functions. 8.
Discussion and future work
Summary.
We have shown that diffeological spaces provide a denotational semanticsfor a higher order language with variants and inductive types ( § Connection to the state of the art in AD implementation.
As is common indenotational semantics research, we have here focused on an idealized language and simpletranslations to illustrate the main aspects of the method. There are a number of pointswhere our approach is simplistic compared to the advanced current practice, as we nowexplain.8.2.1.
Representation of vectors.
In our examples we have treated n -vectors as tuples oflength n . This style of programming does not scale to large n . A better solution would beto use array types, following [61]. As demonstrated by [14], our categorical semantics andcorrectness proofs straightforwardly extend to cover them, in a similar way to our treatmentof lists. In fact, [14] formalizes our correctness arguments in Coq and extends them to applyto the system of [61].8.2.2. Efficient forward-mode AD.
For AD to be useful, it must be fast. The (1 , −→D (1 , that we use is the basis of an efficient AD library [61]. Numerous optimizationsare needed, ranging from algebraic manipulations, to partial evaluations, to the use of anoptimizing C compiler, but the resulting implementation is performant in experiments [61].The Coq formalization [14] validates some of these manipulations using a similar semanticsto ours. We believe the implementation in [61] can be extended to apply to the more general( k, R )-AD methods we described in this paper through minor changes. IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 27
Reverse-mode and mixed-mode AD.
While forward-mode AD methods are useful,many applications require reverse-mode AD, or even mixed-mode AD for efficiency. In[28], we described how our correctness proof applies to a continuation-based AD techniquethat closely resembles reverse-mode AD, but only has the correct complexity under a non-standard operational semantics [11] (in particular, the linear factoring rule is crucial). Itremains to be seen whether this technique and its correctness proof can be adapted to yieldgenuine reverse AD under a standard operational semantics.Alternatively, by relying on a variation of our techniques, [66] gives a correctness proofof a rather different (1 , k, R )-reverse AD and mixed-mode AD for efficiently computing higher orderderivatives.8.2.4. Other language features.
The idealized languages that we considered so far do nottouch on several useful language constructs. For example: the use of functions that arepartial (such as division) or partly-smooth (such as ReLU); phenomena such as iteration,recursion; and probabilities. Recent work by MV [65] shows how our analysis of (1 , k, R )-forward mode AD of recursive programs. We leave the analysis of AD ofprobabilistic programs for future work. Acknowledgment
We have benefited from discussing this work with many people, including M. Betancourt, B.Carpenter, O. Kammar, C. Mak, L. Ong, B. Pearlmutter, G. Plotkin, A. Shaikhha, J. Sigal,and others. Our work is supported by the Royal Society and by a Facebook ResearchAward. In the course of this work, MV has also been employed at Oxford (EPSRC ProjectEP/M023974/1) and at Columbia in the Stan development team. This project has receivedfunding from the European Union’s Horizon 2020 research and innovation programme underthe Marie Sk lodowska-Curie grant agreement No. 895827.
References [1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machinelearning. In ,pages 265–283, 2016.[2] Mart´ın Abadi and Gordon D Plotkin. A simple differentiable programming language. In
Proc. POPL2020 . ACM, 2020.[3] Shun-ichi Amari.
Differential-geometrical methods in statistics , volume 28. Springer Science & BusinessMedia, 2012.[4] John Baez and Alexander Hoffnung. Convenient categories of smooth spaces.
Transactions of the Amer-ican Mathematical Society , 363(11):5789–5825, 2011.[5] Gilles Barthe, Rapha¨elle Crubill´e, Ugo Dal Lago, and Francesco Gavazzo. On the versatility of openlogical relations: Continuity, automatic differentiation, and a containment theorem. In
Proc. ESOP2020 . Springer, 2020. To appear.[6] Claus Bendtsen and Ole Stauning. Fadbad, a flexible c++ package for automatic differentiation. Tech-nical report, Technical Report IMM–REP–1996–17, Department of Mathematical Modelling . . . , 1996.[7] Claus Bendtsen and Ole Stauning. Tadiff, a flexible c++ package for automatic differentiation.
TU ofDenmark, Department of Mathematical Modelling, Lungby. Technical report IMM-REP-1997-07 , 1997.[8] Gilbert Bernstein, Michael Mara, Tzu-Mao Li, Dougal Maclaurin, and Jonathan Ragan-Kelley. Differ-entiating a tensor language. arXiv preprint arXiv:2008.11256 , 2020.[9] Michael Betancourt. A geometric theory of higher-order automatic differentiation. arXiv preprintarXiv:1812.11592 , 2018.[10] Jesse Bettencourt, Matthew J Johnson, and David Duvenaud. Taylor-mode automatic differentiationfor higher-order derivatives in JAX. 2019.[11] Alois Brunel, Damiano Mazza, and Michele Pagani. Backpropagation in the simply typed lambda-calculus with linear negation. In
Proc. POPL 2020 , 2020.[12] Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Be-tancourt. The Stan math library: Reverse-mode automatic differentiation in C++. arXiv preprintarXiv:1509.07164 , 2015.[13] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differen-tial equations. In
Advances in neural information processing systems , pages 6571–6583, 2018.[14] Curtis Chin Jen Sem. Formalized correctness proofs of automatic differentiation in Coq.
Master’sThesis, Utrecht University , 2020. Thesis: https://dspace.library.uu.nl/handle/1874/400790. Coq code:https://github.com/crtschin/thesis.[15] J Daniel Christensen and Enxin Wu. Tangent spaces and tangent bundles for diffeological spaces. arXivpreprint arXiv:1411.5425 , 2014.[16] J. Robin B. Cockett, Geoff S. H. Cruttwell, Jonathan Gallagher, Jean-Simon Pacaud Lemay, BenjaminMacAdam, Gordon D. Plotkin, and Dorette Pronk. Reverse derivative categories. In
Proc. CSL 2020 ,2020.[17] G Constantine and T Savits. A multivariate Faa di Bruno formula with applications.
Transactions ofthe American Mathematical Society , 348(2):503–520, 1996.[18] Geoff Cruttwell, Jonathan Gallagher, and Ben MacAdam. Towards formalizing and extending differen-tial programming using tangent categories. In
Proc. ACT 2019 , 2019.[19] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning andstochastic optimization.
Journal of Machine Learning Research , 12(Jul):2121–2159, 2011.[20] Thomas Ehrhard and Laurent Regnier. The differential lambda-calculus.
Theoretical Computer Science ,309(1-3):1–41, 2003.[21] Conal Elliott. The simple essence of automatic differentiation.
Proceedings of the ACM on ProgrammingLanguages , 2(ICFP):70, 2018.[22] L Hern´andez Encinas and J Munoz Masque. A short proof of the generalized Fa`a di Bruno’s formula.
Applied mathematics letters , 16(6):975–979, 2003.[23] Brendan Fong, David Spivak, and R´emy Tuy´eras. Backprop as functor: A compositional perspectiveon supervised learning. In , pages 1–13. IEEE, 2019.
IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 29 [24] Roy Frostig, Matthew James Johnson, and Chris Leary. Compiling machine learning programs viahigh-level tracing.
Systems for Machine Learning , 2018.[25] Izrail Moiseevitch Gelfand, Richard A Silverman, et al.
Calculus of variations . Courier Corporation,2000.[26] Andreas Griewank, Jean Utke, and Andrea Walther. Evaluating higher derivative tensors by forwardpropagation of univariate taylor series.
Mathematics of Computation , 69(231):1117–1130, 2000.[27] Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: adaptively setting path lengths inHamiltonian Monte Carlo.
Journal of Machine Learning Research , 15(1):1593–1623, 2014.[28] Mathieu Huot, Sam Staton, and Matthijs V´ak´ar. Correctness of automatic differentiation via diffeologiesand categorical gluing. In
FoSSaCS , pages 319–338, 2020.[29] Mathieu Huot, Sam Staton, and Matthijs V´ak´ar. Correctness of automatic differentiation via diffeologiesand categorical gluing. Full version, 2020. arxiv:2001.02209.[30] Patrick Iglesias-Zemmour.
Diffeology . American Mathematical Soc., 2013.[31] Bart Jacobs and JMMM Rutten. An introduction to (co)algebras and (co)induction. In
Addvancedtopics in bisimulation and coinduction , pages 38–99. CUP, 2011.[32] Peter T Johnstone, Stephen Lack, and P Sobocinski. Quasitoposes, quasiadhesive categories and Artinglueing. In
Proc. CALCO 2007 , 2007.[33] Jerzy Karczmarczuk. Functional differentiation of computer programs.
Higher-order and symbolic com-putation , 14(1):35–57, 2001.[34] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function.
TheAnnals of Mathematical Statistics , 23(3):462–466, 1952.[35] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[36] Dana A Knoll and David E Keyes. Jacobian-free Newton–Krylov methods: a survey of approaches andapplications.
Journal of Computational Physics , 193(2):357–397, 2004.[37] Anders Kock.
Synthetic differential geometry , volume 333. Cambridge University Press, 2006.[38] Ivan Kol´ar, Jan Slov´ak, and Peter W Michor. Natural operations in differential geometry. 1999.[39] Andreas Kriegl and Peter W Michor.
The convenient setting of global analysis , volume 53. AmericanMathematical Soc., 1997.[40] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automaticdifferentiation variational inference.
The Journal of Machine Learning Research , 18(1):430–474, 2017.[41] S¨oren Laue, Matthias Mitterreiter, and Joachim Giesen. Computing higher order derivatives of matrixand tensor expressions.
Advances in neural information processing systems , 31:2750–2759, 2018.[42] S¨oren Laue, Matthias Mitterreiter, and Joachim Giesen. A simple and efficient tensor calculus. In
AAAI ,pages 4527–4534, 2020.[43] John M Lee. Smooth manifolds. In
Introduction to Smooth Manifolds , pages 1–31. Springer, 2013.[44] Wonyeol Lee, Hangyeol Yu, Xavier Rival, and Hongseok Yang. On correctness of automatic differentia-tion for non-differentiable functions. arXiv preprint arXiv:2006.06903 , 2020.[45] Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.
Mathematical programming , 45(1-3):503–528, 1989.[46] Carol Mak and Luke Ong. A differential-form pullback programming language for higher-order reverse-mode automatic differentiation. arxiv:2002.08241, 2020.[47] Oleksandr Manzyuk. A simply typed λ -calculus of forward automatic differentiation. In Proc. MFPS2012 , 2012.[48] James Martens. Deep learning via Hessian-free optimization. In
ICML , volume 27, pages 735–742, 2010.[49] Damiano Mazza and Michele Pagani. Automatic differentiation in PCF. arXiv preprintarXiv:2011.03335 , 2020.[50] Joel Merker. Four explicit formulas for the prolongations of an infinitesimal lie symmetry and multi-variate Faa di Bruno formulas. arXiv preprint math/0411650 , 2004.[51] John C Mitchell and Andre Scedrov. Notes on sconing and relators. In
International Workshop onComputer Science Logic , pages 352–378. Springer, 1992.[52] Radford M Neal. MCMC using Hamiltonian dynamics. In
Handbook of Markov Chain Monte Carlo ,chapter 5. Chapman & Hall / CRC Press, 2011.[53] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. [54] Barak A Pearlmutter and Jeffrey Mark Siskind. Lazy multivariate higher-order forward-mode ad.
ACMSIGPLAN Notices , 42(1):155–160, 2007.[55] Barak A Pearlmutter and Jeffrey Mark Siskind. Reverse-mode AD in a functional framework: Lambdathe ultimate backpropagator.
ACM Transactions on Programming Languages and Systems (TOPLAS) ,30(2):7, 2008.[56] Andrew M Pitts. Categorical logic. Technical report, University of Cambridge, Computer Laboratory,1995.[57] Gordon D Plotkin. Some principles of differential programming languages. Invited talk, POPL 2018,2018.[58] Ning Qian. On the momentum term in gradient descent learning algorithms.
Neural networks , 12(1):145–151, 1999.[59] Herbert Robbins and Sutton Monro. A stochastic approximation method.
The Annals of MathematicalStatistics , pages 400–407, 1951.[60] Thomas H Savits. Some statistical applications of Faa di Bruno.
Journal of Multivariate Analysis ,97(10):2131–2140, 2006.[61] Amir Shaikhha, Andrew Fitzgibbon, Dimitrios Vytiniotis, and Simon Peyton Jones. Efficient differen-tiable programming in a functional array-processing language.
Proceedings of the ACM on ProgrammingLanguages , 3(ICFP):97, 2019.[62] Benjamin Sherman, Jesse Michel, and Michael Carbin. λ S : Computable semantics for differentiableprogramming with higher-order functions and datatypes. arXiv preprint arXiv:2007.08017 , 2020.[63] Jean-Marie Souriau. Groupes diff´erentiels. In Differential geometrical methods in mathematical physics ,pages 91–128. Springer, 1980.[64] Andrew Stacey. Comparative smootheology.
Theory Appl. Categ. , 25(4):64–117, 2011.[65] Matthijs V´ak´ar. Denotational correctness of forward-mode automatic differentiation for iteration andrecursion. arXiv preprint arXiv:2007.05282 , 2020.[66] Matthijs V´ak´ar. Reverse AD at higher types: Pure, principled and denotationally correct. In
Proc. ESOP2021 . Springer, 2021. To appear.[67] Bart Van Merri¨enboer, Olivier Breuleux, Arnaud Bergeron, and Pascal Lamblin. Automatic differentia-tion in ML: Where we are and where we should be going. In
Advances in neural information processingsystems , pages 8757–8767, 2018.[68] Fei Wang, Xilun Wu, Gregory Essertel, James Decker, and Tiark Rompf. Demystifying differentiableprogramming: Shift/reset the penultimate backpropagator.
Proceedings of the ACM on ProgrammingLanguages , 3(ICFP), 2019.[69] Mu Wang, Assefaw Gebremedhin, and Alex Pothen. Capitalizing on live variables: new algorithms forefficient hessian computation via automatic differentiation.
Mathematical Programming Computation ,8(4):393–433, 2016.[70] Shaopeng Zhu, Shih-Han Hung, Shouvanik Chakrabarti, and Xiaodi Wu. On the principles of differen-tiable quantum programming languages. arXiv preprint arXiv:2004.01122 , 2020.
IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 31
Appendix A. CartSp and
Man are not cartesian closed categories
Lemma 6.
There is no continuous injection R d +1 → R d .Proof. If there were, it would restrict to a continuous injection S d → R d . The Borsuk-Ulamtheorem, however, tells us that every continuous f : S d → R d has some x ∈ S d such that f ( x ) = f ( − x ), which is a contradiction.Let us define the terms: x : real , . . . , x n : real ⊢ t n = λy.x + x ∗ y + · · · + x n ∗ y n : real → real Assuming that
CartSp / Man is cartesian closed, observe that these get interpreted asinjective continuous (because smooth) functions R n → J real → real K in CartSp and
Man . Theorem 3. CartSp is not cartesian closed.Proof.
In case
CartSp were cartesian closed, we would have J real → real K = real n forsome n . Then, we would get, in particular a continuous injection J t n +1 K : R n +1 → R n ,which contradicts Lemma 6. Theorem 4. Man is not cartesian closed.Proof.
Observe that we have ι n : R n → R n +1 ; h a , . . . , a n i 7→ h a , . . . , a n , i and that ι n ; J t n +1 K = J t n K . Let us write A n for the image of J t n K and A = ∪ n ∈ N A n . Then, A n isconnected because it is the continuous image of a connected set. Similarly, A is connectedbecause it is the non-disjoint union of connected sets. This means that A lies in a singleconnected component of J real → real K , which is a manifold with some finite dimension, say d . Take some x ∈ R d +1 (say, 0), take some open d -ball U around J t d +1 K ( x ), and take someopen d + 1-ball V around x in J t d +1 K − ( U ). Then, J t d +1 K restricts to a continuous injectionfrom V to U , or equivalently, R d +1 to R d , which contradicts Lemma 6. This work is licensed under the Creative Commons Attribution License. To view a copy of thislicense, visit https://creativecommons.org/licenses/by/4.0/https://creativecommons.org/licenses/by/4.0/