[PDF] Higher Order Automatic Differentiation of Higher Order Functions

Abstract

We present semantic correctness proofs of automatic differentiation (AD). We consider a forward-mode AD method on a higher order language with algebraic data types, and we characterise it as the unique structure preserving macro given a choice of derivatives for basic operations. We describe a rich semantics for differentiable programming, based on diffeological spaces. We show that it interprets our language, and we phrase what it means for the AD method to be correct with respect to this semantics. We show that our characterisation of AD gives rise to an elegant semantic proof of its correctness based on a gluing construction on diffeological spaces. We explain how this is, in essence, a logical relations argument. Throughout, we show how the analysis extends to AD methods for computing higher order derivatives using a Taylor approximation.

Full PDF

aa r X i v : . [ c s . P L ] J a n HIGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHERORDER FUNCTIONS

MATHIEU HUOT, SAM STATON, AND MATTHIJS V ´AK ´ARUniversity of OxfordUniversity of OxfordUtrecht University

Abstract.

We present semantic correctness proofs of automatic diﬀerentiation (AD). Weconsider a forward-mode AD method on a higher order language with algebraic data types,and we characterise it as the unique structure preserving macro given a choice of derivativesfor basic operations. We describe a rich semantics for diﬀerentiable programming, basedon diﬀeological spaces. We show that it interprets our language, and we phrase what itmeans for the AD method to be correct with respect to this semantics. We show that ourcharacterisation of AD gives rise to an elegant semantic proof of its correctness based ona gluing construction on diﬀeological spaces. We explain how this is, in essence, a logicalrelations argument. Throughout, we show how the analysis extends to AD methods forcomputing higher order derivatives using a Taylor approximation. Introduction

Automatic diﬀerentiation (AD), loosely speaking, is the process of taking a program de-scribing a function, and constructing the derivative of that function by applying the chainrule across the program code. As gradients play a central role in many aspects of machinelearning, so too do automatic diﬀerentiation systems such as TensorFlow [1], PyTorch [53]or Stan [12]. Programs denotationalsemantics (cid:15) (cid:15) automaticdiﬀerentiation / / Programs denotationalsemantics (cid:15) (cid:15)

Diﬀerentialgeometry mathdiﬀerentiation / / DiﬀerentialgeometryFigure 1: Overview of semantics/correctness of AD.Diﬀerentiation has a well de-veloped mathematical theory interms of diﬀerential geometry.The aim of this paper is to formal-ize this connection between diﬀer-ential geometry and the syntacticoperations of AD, particularly forAD methods that calculate higherorder derivatives. In this way we

Key words and phrases: automatic diﬀerentiation, software correctness, denotational semantics. ∗ This paper provides an extended version of [28], extending it with proofs and a novel discussion of thesemantics and correctness of automatic diﬀerentiation methods for computing higher order derivatives.

Preprint submitted toLogical Methods in Computer Science © M. Huot, S. Staton, and M. Vákár CC (cid:13) Creative Commons

M. HUOT, S. STATON, AND M. V ´AK ´AR achieve two things: (1) a compositional, denotational understanding of diﬀerentiable pro-gramming and AD; (2) an explanation of the correctness of AD.This intuitive correspondence (summarized in Fig. 1) is in fact rather complicated. Inthis paper, we focus on resolving the following problem: higher order functions play a keyrole in programming, and yet they have no counterpart in traditional diﬀerential geometry.Moreover, we resolve this problem while retaining the compositionality of denotationalsemantics.1.0.1.

Higher order functions and diﬀerentiation.

A major application of higher order func-tions is to support disciplined code reuse. Code reuse is particularly acute in machine learn-ing. For example, a multi-layer neural network might be built of millions of near-identicalneurons, as follows.neuron n : (real n ∗ (real n ∗ real)) → real neuron n def = λ h x, h w, b ii . ς ( w · x + b )layer n : ( ( τ ∗ P ) → τ ) → ( τ ∗ P n ) → τ n layer n def = λf. λ h x, h p , . . . , p n ii . h f h x, p i , . . . , f h x, p n ii comp : ( ( ( τ ∗ P ) → τ ) ∗ ( ( τ ∗ Q ) → τ ) ) → ( τ ∗ ( P ∗ Q )) → τ comp def = λ h f, g i . λ h x, ( p, q ) i . g h f h x, p i , q i − . x ς ( x ) (Here ς ( x ) def = e − x is the sigmoid function, as illustrated.) We can use these functions tobuild a network as follows (see also Fig. 2):comp h layer m (neuron k ) , comp h layer n (neuron m ) , neuron n ii : (real k ∗ P ) → real (1.1) · · ·· · ·· · · k m n Figure 2: The network in (1.1)with k inputs and twohidden layers.Here P ∼ = real p with p = ( m ( k +1)+ n ( m +1)+ n +1).This program (1.1) describes a smooth (inﬁnitely diﬀer-entiable) function. The goal of automatic diﬀerentiationis to ﬁnd its derivative.If we β -reduce all the λ ’s, we end up with a very longfunction expression just built from the sigmoid functionand linear algebra. We can then ﬁnd a program for calcu-lating its derivative by applying the chain rule. However,automatic diﬀerentiation can also be expressed withoutﬁrst β -reducing, in a compositional way, by explaininghow higher order functions like (layer) and (comp) propa-gate derivatives. This paper is a semantic analysis of thiscompositional approach.The general idea of denotational semantics is to interpret types as spaces and programsas functions between the spaces. In this paper, we propose to use diﬀeological spaces andsmooth functions [63, 30] to this end. These satisfy the following three desiderata: • R is a space, and the smooth functions R → R are exactly the functions that are inﬁnitelydiﬀerentiable; • The set of smooth functions X → Y between spaces again forms a space, so we caninterpret function types. • The disjoint union of a sequence of spaces again forms a space, so we can interpret varianttypes and inductive types.

IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 3

We emphasise that the most standard formulation of diﬀerential geometry, using manifolds,does not support spaces of functions. Diﬀeological spaces seem to us the simplest notion ofspace that satisﬁes these conditions, but there are other candidates [4, 64]. A diﬀeologicalspace is in particular a set X equipped with a chosen set of curves C X ⊆ X R and a smoothmap f : X → Y must be such that if γ ∈ C X then γ ; f ∈ C Y . This is reminiscent of themethod of logical relations.1.0.2. From smoothness to automatic derivatives at higher types.

Our denotational seman-tics in diﬀeological spaces guarantees that all deﬁnable functions are smooth. But we needmore than just to know that a deﬁnable function happens to have a mathematical derivative:we need to be able to ﬁnd that derivative.In this paper we focus on forward mode automatic diﬀerentiation methods for comput-ing higher derivatives, which are macro translations on syntax (called −→D in § X, X ′ , S ) where, intuitively, X is a space of inhabitants of thetype, X ′ is a space serving as a chosen bundle of tangents (or jets, in the case of higherorder derivatives) over X , and S ⊆ X R × X ′ R is a binary relation between curves, informallyrelating curves in X with their tangent (resp. jet) curves in X ′ . This new model gives adenotational semantics for higher order automatic diﬀerentiation.In § § Related work and context.

AD has a long history and has many implementations. ADwas perhaps ﬁrst phrased in a functional setting in [55], and there are now a number ofteams working on AD in the functional setting (e.g. [68, 61, 21]), some providing eﬃcientimplementations. Although that work does not involve formal semantics, it is inspired byintuitions from diﬀerential geometry and category theory.This paper adds to a very recent body of work on veriﬁed automatic diﬀerentiation. Inthe ﬁrst order setting, there are recent accounts based on denotational semantics in mani-folds [23, 44] and based on synthetic diﬀerential geometry [18], work making a categoricalabstraction [16] and work connecting operational semantics with denotational semantics[2, 57], as well as work focussing on how to correctly diﬀerentiate programs that operateon tensors [8] and programs that make use of quantum computing [70]. Recently there hasalso been signiﬁcant progress at higher types. The work of Brunel et al. [11] and Mazzaand Pagani [49] give formal correctness proofs for reverse-mode derivatives on a linear λ -calculus with a particular operational semantics. The work of Barthe et al. [5] providesa general discussion of some new syntactic logical relations arguments including one verysimilar to our syntactic proof of Theorem 1. Sherman et al. [62] discuss a diﬀerential pro-gramming technique that works at higher types, based on exact real arithmetic and relate itto a computable semantics. We understand that the authors of [18] are working on highertypes. V´ak´ar [66] phrases and proves correct a reverse mode AD technique on a higherorder language, based on a source-code transformation on a standard λ -calculus. M. HUOT, S. STATON, AND M. V ´AK ´AR

The diﬀerential λ -calculus [20] is related to AD, and explicit connections are made in[46, 47]. One diﬀerence is that the diﬀerential λ -calculus allows the addition of terms atall types, and hence vector space models are suitable, but this appears peculiar with thevariant and inductive types that we consider here.This paper builds on our previous work [28, 65] in which we gave denotational cor-rectness proofs for forward mode AD algorithms for computing ﬁrst derivatives. Here, weexplain how these techniques extend to methods that calculate higher derivatives.The idea of directly calculating higher order derivatives by using automatic diﬀeren-tiation methods that work with Taylor approximations (also known as jets in diﬀerentialgeometry) is well-known [26] and it has recently gained renewed interest [9, 10]. So far, such“Taylor-mode AD” methods have only been applied to ﬁrst order functional languages, how-ever. This paper shows how to extend these higher order AD methods to languages withsupport for higher order functions and algebraic data types.The two main methods for implementing AD are operator overloading and, the methodused in this paper, source code transformation [67]. Taylor-mode AD has been seen to besigniﬁcantly faster than iterated AD in the context of operator overloading [10] in Jax [24].There are other notable implementations of forward Taylor-mode [6, 7, 33, 54, 69]. Someof them are implemented in a functional language [33, 54]. Taylor-mode implementationsuse the rich algebraic structure of derivatives to avoid a lot of redundant computationsoccurring via iterated ﬁrst order methods and share of a lot of redundant computations.Perhaps the simplest example to see this is with the sin function, whose iterated derivativesonly involve sin, cos, and negation. Importantly, most AD tools have the right complexityup to a constant factor, but this constant is quite important in practice and Taylor-modehelps achieve better performance. Another stunning result of a version of Taylor-modewas achieved in [41], where a gain of performance of up to two orders of magnitude wasachieved for computing certain Hessian-vector products using Ricci calculus. In essence,the algorithm used is mixed-mode that is derived via jets in [9]. This is further improved in[42]. Tayor-mode can also be useful for ODE solvers and hence will be important for neuraldiﬀerential equations [13].Finally, we emphasise that we have chosen the neural network (1.1) as our runningexample mainly for its simplicity. There are many other examples of AD outside the neuralnetworks literature: AD is useful whenever derivatives need to be calculated on high dimen-sional spaces. This includes optimization problems more generally, where the derivative ispassed to a gradient descent method (e.g. [59, 34, 58, 35, 19, 45]). Optimization problemsinvolving higher order functions naturally show up in the calculus of variations and its ap-plications in physics, where one typically looks for a function minimizing a certain integral[25]. Other applications of AD are in advanced integration methods, since derivatives playa role in Hamiltonian Monte Carlo [52, 27] and variational inference [40]. Second ordermethods for gradient-descent have also been extensively studied. As the basic second orderNewton method requires inverting a high dimentional hessian matrix, several alternativesand approximations have been studied. Some of them still require Taylor-like modes ofdiﬀerentiation and require a matrix-vector product where the matrix resembles the hessianor inverse hessian [36, 48, 3].1.0.4. Summary of contributions.

We have provided a semantic analysis of higher orderautomatic diﬀerentiation. Our syntactic starting point is are higher order forward-mode ADmacros on a typed higher order language that extend their well-known ﬁrst order equivalent

IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 5 (e.g. [61, 68, 28]). We present these in § § • We give a denotational semantics for the language in diﬀeological spaces, showing thatevery deﬁnable expression is smooth ( § • We show correctness of the higher order AD macros by a logical relations argument(Th. 1). • We give a categorical analysis of this correctness argument with two parts: canonicity ofthe macro in terms of syntactic categories, and a new notion of glued space that abstractsthe logical relation ( § • We then use this analysis to state and prove a correctness argument at all ﬁrst ordertypes (Th. 2).2.

Rudiments of differentiation: how to calculate with dual numbers andTaylor approximations

First order diﬀerentiation: the chain rule and dual numbers.

Recall that thederivative of a function f : R → R , if it exists, is a function ∇ f : R → R such that for all a , ∇ f ( a ) is the gradient of f at a in the sense that the function x f ( a )+ ∇ f ( a ) · ( x − a ) givesthe best linear approximation of f at a . (The gradient ∇ f ( a ) is often written d f ( x )d x ( a ).)The chain rule for diﬀerentiation tells us that we can calculate ∇ ( f ; g )( a ) = ∇ f ( a ) ·∇ g ( f ( a )). In that sense, the chain rule tells us how linear approximations to a functiontransform under post-composition with another function.To ﬁnd ∇ f in a compositional way, using the chain rule, two generalizations are rea-sonable: • We need both f and ∇ f when calculating ∇ ( f ; g ) of a composition f ; g , using the chainrule, so we are really interested in the pair ( f, ∇ f ) : R → R × R ; • In building f we will need to consider functions of multiple arguments, such as + : R → R ,and these functions should propagate derivatives.Thus we are more generally interested in transforming a function g : R n → R into a function h : ( R × R ) n → R × R in such a way that for any f . . . f n : R → R ,( f , ∇ f , . . . , f n , ∇ f n ); h = (( f , . . . , f n ); g, ∇ (( f , . . . , f n ); g )). (2.1)An intuition for h is often given in terms of dual numbers. The transformed functionoperates on pairs of numbers, ( x, x ′ ), and it is common to think of such a pair as x + x ′ ǫ for an‘inﬁnitesimal’ ǫ . But while this is a helpful intuition, the formalization of inﬁnitesimals canbe intricate, and the development in this paper is focussed on the elementary formulationin (2.1).A function h satisfying (2.1) encodes all the partial derivatives of g . For example,if g : R → R , then with f ( x ) def = x and f ( x ) def = x , by applying (2.1) to x we obtain h ( x , , x ,

0) = ( g ( x , x ) , ∂g ( x,x ) ∂x ( x )) and similarly h ( x , , x ,

1) = ( g ( x , x ) , ∂g ( x ,x ) ∂x ( x )).And conversely, if g is diﬀerentiable in each argument, then a unique h satisfying (2.1) canbe found by taking linear combinations of partial derivatives, for example: h ( x , x ′ , x , x ′ ) = ( g ( x , x ) , x ′ · ∂g ( x,x ) ∂x ( x ) + x ′ · ∂g ( x ,x ) ∂x ( x )).(Here, recall that the partial derivative ∂g ( x,x ) ∂x ( x ) is a notation for the gradient ∇ ( g ( − , x ))( x ),i.e. with x ﬁxed. ) M. HUOT, S. STATON, AND M. V ´AK ´AR

In summary, the idea of diﬀerentiation with dual numbers is to transform a diﬀerentiablefunction g : R n → R to a function h : R n → R which captures g and all its partialderivatives. We packaged this up in (2.1) as an invariant which is useful for buildingderivatives of compound functions R → R in a compositional way. The idea of (ﬁrst order)forward mode automatic diﬀerentiation is to perform this transformation at the source codelevel. Smooth functions.

In what follows we will often speak of smooth functions R k → R , whichare functions that are continuous and diﬀerentiable, such that their derivatives are alsocontinuous and diﬀerentiable, and so on.2.2. Higher order diﬀerentiation: the Fa`a di Bruno formula and Taylor approx-imations.

We now generalize the above in two directions: • We look for the best local approximations to f with polynomials of some order R , gener-alizing the above use of linear functions ( R = 1). • We can work directly with multivariate functions R k → R instead of functions of onevariable R → R ( k = 1).To make this precise, we recall that, given a smooth function f : R k → R and a naturalnumber R ≥

0, the R -th order Taylor approximation of f at a ∈ R k is deﬁned in terms ofthe partial derivatives of f : R k → R x X { ( α ,...,α k ) ∈ N k | α + ... + α k ≤ R } α ! · . . . · α k ! ∂ α + ... + α k f ( x ) ∂x α · · · ∂x α k k ( a ) · ( x − a ) α · . . . · ( x k − a k ) α k . This is an R -th order polynomial. We can again recover the partial derivatives of f up tothe R -th order from its Taylor approximation by evaluating the series at basis vectors.Recall that the ordering of partial derivatives does not matter for smooth functions(Schwarz/Clairaut’s theorem). So there will be (cid:0) R + k − k − (cid:1) R -th order partial derivatives, andaltogether there are (cid:0) R + kk (cid:1) summands in the R -th order Taylor approximation. (This canbe seen by a ‘stars-and-bars’ argument.)Since there are (cid:0) R + kk (cid:1) partial derivatives of f i of order ≤ R (often called ( k, R )-velocities),we can store them in the Euclidean space R ( R + kk ), which can also be regarded as the spaceof k -variate polynomials of degree ≤ R .We will use a convention of coordinates ( y α ...α k ∈ R ) ( α ,...,α k ) ∈ { ( α ,...,α k ) ∈ N k | ≤ α + ... + α k ≤ R } where y α ...α k is intended to represent a partial derivative ∂ α ... + αk f∂x α ··· ∂x αkk ( a ) for some function f : R k → R . We will choose these coordinates in lexicographic order of the multi-indices( α , . . . , α k ), that is, the indexes in the Euclidean space R ( R + kk ) will typically range from(0 , . . . ,

0) to ( R, , . . . , IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 7

The ( k, R ) -Taylor representation of a function g : R n → R is a function h : (cid:16) R ( R + kk ) (cid:17) n → R ( R + kk ) that transforms the partial derivatives of f : R k → R n of order ≤ R under postcom-position with g : (cid:18) ∂ α + ... + α k f j ( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! nj =1 ; h = (cid:18) ∂ α + ... + α k (( f , . . . , f n ); g )( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! nj =1 . (2.2)Thus the Taylor representation generalizes the dual numbers representation ( R = k = 1).To explicitly calculate the Taylor representation for a smooth function, we recall ageneralization of the chain rule to higher derivatives. The chain rule tells us how thecoeﬃcients of linear approximations transform under composition of the functions. The Fa`a di Bruno formula [60, 22, 17] tells us how coeﬃcients of Taylor approximations – thatis, higher derivatives – transform under composition. We recall the multivariate form from[60, Theorem 2.1]. Given functions f = ( f , . . . , f l ) : R k → R l and g : R l → R , for α + . . . + α k > ∂ α + ... + α k ( f ; g )( x ) ∂x α · · · ∂x α k k ( a ) = α ! · . . . · α k ! · X { ( β ,...,β l ) ∈ N l | ≤ β + ... + β l ≤ α + ... + α k } ∂ β + ... + β l g ( y ) ∂y β · · · ∂y β l l ( f ( a )) · X { (( e ,...,e l ) ,..., ( e q ,...,e ql )) ∈ ( N l ) q | e j + ... + e qj = β j , ( e + ... + e l ) · α i + ... +( e q + ... + e ql ) · α qi = α i } q Y r =1 l Y j =1 e rj ! (cid:18) α r ! · . . . · α rk ! ∂ α r + ··· + α rk f j ( x ) ∂ α r x · · · ∂ α rk x k ( a ) (cid:19) e rj , where ( α , . . . , α k ) , . . . , ( α q , . . . , α qk ) ∈ N k are an enumeration of all the vectors ( α r , . . . , α rk )of k natural numbers such that α rj ≤ α j and α r + . . . + α rk > q for thenumber of such vectors. The details of this formula reﬂect the complicated combinatoricsthe arise from repeated applications of the chain and product rules for diﬀerentiation thatone uses to prove it. Conceptually, however, it is rather straightforward: it tells us that thecoeﬃcients of the R -th order Taylor approximation of f ; g can be expressed exclusively interms of those of f and g .Thus the Fa`a di Bruno formula uniquely determines the Taylor approximation h : (cid:16) R ( R + kk ) (cid:17) n → R ( R + kk ) in terms of the derivatives of g : R n → R of order ≤ R , and we canalso recover all such derivatives from h .2.3. Example: a two-dimensional second order Taylor series.

As an example, wecan specialize the Fa`a di Bruno formula above to the second order Taylor series of a function f : R → R l and its behaviour under postcomposition with a smooth function g : R l → R : ∂ ( f ; g )( x ) ∂x i ∂x i ′ ( a ) = l X j =1 ∂g ( y ) ∂y j ( f ( a )) ∂ f j ( x ) ∂x i ∂x i ′ ( a ) + l X j,j ′ =1 ∂ g ( y ) ∂y j ∂y j ′ ( f ( a )) ∂f j ′ ( x ) ∂x i ( a ) ∂f j ( x ) ∂x i ′ ( a ) , where i, i ′ ∈ { , } might either coincide or be distinct. M. HUOT, S. STATON, AND M. V ´AK ´AR

Rather than working with the full (2 , g , we ignore the non-mixed second order derivatives y j = ∂ f j ( x ) ∂x and y j = ∂ f j ( x ) ∂x for the moment, and werepresent the derivatives of order ≤ f j : R → R (at some point a ) as the numbers( y j , y j , y j , y j ) = (cid:18) f j ( a ) , ∂f j ( x ) ∂x ( a ) , ∂f j ( x ) ∂x ( a ) , ∂ f j ( x ) ∂x ∂x ( a ) (cid:19) ∈ R and we can choose a similar representation for the derivatives of ( f ; g ). Then, we observethat the Fa`a di Bruno formula induces the function h : ( R ) l → R h (( y , y , y , y ) , . . . , ( y l , y l , y l , y l )) =  g ( y , . . . , y l ) P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j + P lj,j ′ =1 ∂ g ( y ,...,y l ) ∂y j ∂y j ′ ( y , . . . , y l ) · y j · y j ′  . In particular, we can note that h (( y , y , y , , . . . , ( y l , y l , y l ,  g ( y , . . . , y l ) P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j P lj =1 ∂g ( y ,...,y l ) ∂y j ( y , . . . , y l ) · y j P lj,j ′ =1 ∂ g ( y ,...,y l ) ∂y j ∂y j ′ ( y , . . . , y l ) · y j · y j ′  . We see can use this method to calculate any directional ﬁrst and second order derivativeof g in one pass. For example, if l = 3, so g : R → R , then the last component of h (( x, x ′ , x ′′ , , ( y, y ′ , y ′′ , , ( z, z ′ , z ′′ , x ′ , y ′ , z ′ ) and the second derivative in direction ( x ′′ , y ′′ , z ′′ ), and evaluating at ( x, y, z ).In the proper Taylor representation we explicitly include the non-mixed second orderderivatives as inputs and outputs, leading to a function h ′ : ( R ) l → R . Above we have fol-lowed a common trick to avoid some unnecessary storage and computation, since these extrainputs and outputs are not required for computing the second order derivatives of g . Forinstance, if l = 2 then the last component of h (( x, , , , ( y, , , ∂ g ( x,y ) ∂x ( x, y ).2.4. Example: a one-dimensional second order Taylor series.

As opposed to (2,2)-AD, (1,2)-AD computes the ﬁrst and second order derivatives in the same direction. Forexample, if g : R → R is a smooth function, then h : ( R ) → R . An intuition for h can begiven in terms of triple numbers. The transformed function operates on triples of numbers,( x, x ′ , x ′′ ), and it is common to think of such a triple as x + x ′ ǫ + x ′′ ǫ for an ‘inﬁnitesimal’ IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 9 ǫ which has the property that ǫ = 0. For instance we have h (( x , , , ( x , , g ( x , x ) , ∂g ( x, x ) ∂x ( x ) , ∂ g ( x, x ) ∂x ( x )) h (( x , , , ( x , , g ( x , x ) , ∂g ( x , x ) ∂x ( x ) , ∂ g ( x , x ) ∂x ( x )) h (( x , , , ( x , , g ( x , x ) , ∂g ( x , x ) ∂x ( x ) , ∂ g ( x, x ) ∂x ( x ) + ∂ g ( x , x ) ∂x ( x ) + 2 ∂ g ( x, y ) ∂x∂y ( x , x ))We see that we directly get non-mixed second-order partial derivatives but not the mixed-ones. We can recover ∂ g ( x,y ) ∂x∂y ( x , x ) as ( h (( x , , , ( x , , − h (( x , , , ( x , , − h (( x , , , ( x , , g : R l → R , then h : ( R ) l → R satisﬁes: h (( x , x ′ , , . . . , ( x l , x ′ l ,  g ( x , . . . , x l ) P li =1 ∂g ( x ,...,x l ) ∂x i ( x , . . . , x l ) · x ′ i P li,j =1 ∂ g ( x ,...,x l ) ∂x i ∂x j ( x , . . . , x l ) · x ′ i · x ′ j  . We can always recover the mixed second order partial derivatives from this but this requiresseveral computations involving h . This is thus diﬀerent from the (2,2) method which wasmore direct.2.5. Remark.

In the rest of this article, we study forward-mode ( k, R )-automatic diﬀeren-tiation for a language with higher-order functions. The reader may like to ﬁx k = R = 1for a standard automatic diﬀerentiation with ﬁrst-order derivatives, based on dual numbers.This is the approach taken in the conference version of this paper [29]. But the generaliza-tion to higher-order derivatives with arbitrary k and R ﬂows straightforwardly through thewhole narrative.3. A Higher Order Forward-Mode AD Translation

A simple language of smooth functions.

We consider a standard higher ordertyped language with a ﬁrst order type real of real numbers. The types ( τ, σ ) and terms( t, s ) are as follows. τ, σ, ρ ::= types | real real numbers | ( τ ∗ . . . ∗ τ n ) ﬁnite product | τ → σ function t, s, r ::= terms x variable | op ( t , . . . , t n ) operations (including constants) | h t , . . . , t n i | case t of h x , . . . , x n i → s tuples/pattern matching | λx.t | t s function abstraction/application The typing rules are in Figure 3. We have included some abstract basic n -ary operations op ∈ Op n for every n ∈ N . These are intended to include the usual (smooth) mathematicaloperations that are used in programs to which automatic diﬀerentiation is applied. Forexample, • for any real constant c ∈ R , we typically include a constant c ∈ Op ; we slightly abusenotation and will simply write c for c () in our examples; • we include some unary operations such as ς ∈ Op which we intend to stand for the usualsigmoid function, ς ( x ) def = e − x ; • we include some binary operations such as addition and multiplication (+) , ( ∗ ) ∈ Op ;We add some simple syntactic sugar t − u def = t + ( − ∗ u and, for some natural number n , n · t def = n times z }| { t + ... + t and t n def = n times z }| { t ∗ ... ∗ t Similarly, we will frequently denote repeated sums and products using P - and Q -signs,respectively: for example, we write t + ... + t n as P i ∈{ ,...,n } t i and t ∗ ... ∗ t n as Q i ∈{ ,...,n } t i .This in addition to programming sugar such as let x = t in s for ( λx.s ) t and λ h x , . . . , x n i .t for λx. case x of h x , . . . , x n i → t .3.2. Syntactic automatic diﬀerentiation: a functorial macro.

The aim of higherorder forward mode AD is to ﬁnd the ( k, R )-Taylor representation of a function by syntacticmanipulations, for some choice of ( k, R ) that we ﬁx. For our simple language, we implementthis as the following inductively deﬁned macro −→D ( k,R ) on both types and terms (see also [68,61]). For the sake of legibility, we simply write −→D ( k,R ) as −→D here and leave the dimension k and order R of the Taylor representation implicit. The following deﬁnition is for general k and R , but we treat speciﬁc cases afterwards in Example 1. −→D ( τ → σ ) def = −→D ( τ ) → −→D ( σ ) −→D ( τ ∗ ... ∗ τ n ) def = −→D ( τ ) ∗ ... ∗ −→D ( τ n ) −→D ( real ) def = real ( R + kk ) (i.e., the type of tuples of reals of length (cid:0) R + kk (cid:1) )Γ ⊢ t : real . . . Γ ⊢ t n : real Γ ⊢ op ( t , . . . , t n ) : real ( op ∈ Op n )Γ ⊢ t : τ . . . Γ ⊢ t n : τ n Γ ⊢ h t , . . . , t n i : ( τ ∗ . . . ∗ τ n ) Γ ⊢ t : ( σ ∗ . . . ∗ σ n ) Γ , x : σ , ..., x n : σ n ⊢ s : τ Γ ⊢ case t of h x , . . . , x n i → s : τ Γ ⊢ x : τ (( x : τ ) ∈ Γ) Γ , x : τ ⊢ t : σ Γ ⊢ λx : τ.t : τ → σ Γ ⊢ t : σ → τ Γ ⊢ s : σ Γ ⊢ t s : τ Figure 3: Typing rules for the simple language.

IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 11 −→D ( x ) def = x −→D ( c ) def = h c, i −→D ( λx.t ) def = λx. −→D ( t ) −→D ( t s ) def = −→D ( t ) −→D ( s ) −→D ( h t , . . . , t n i ) def = h −→D ( t ) , . . . , −→D ( t n ) i −→D ( case t of h x , . . . , x n i → s ) def = case −→D ( t ) of h x , . . . , x n i → −→D ( s ) −→D ( op ( t , . . . , t n )) def = case −→D ( t ) of h x ... , ..., x R ... i → ... case −→D ( t n ) of h x n ... , ..., x nR ... i →h D ... op ( x ... , ..., x R ... , ..., x n ... , ...., x nR ... ) , · · · ,D R... op ( x ... , ..., x R ... , ..., x n ... , ...., x nR ... ) i where D ... op ( x ... , ..., x R ... , ..., x n ... , ...., x nR ... ) def = op ( x ... , ..., x n ... ) D α ....α k op ( x ... , ..., x R ... , ..., x n ... , ...., x nR ... ) def = (for α + ... + α k > α ! · . . . · α k ! · P { ( β ,...,β n ) ∈ N l | ≤ β + ... + β n ≤ α + ... + α k } ∂ β ··· β n op ( x ... , . . . , x n ... ) ∗ P { (( e ,...,e l ) ,..., ( e q ,...,e ql )) ∈ ( N l ) q | e j + ... + e qj = β j , ( e + ... + e l ) · α i + ... +( e q + ... + e ql ) · α qi = α i } Q qr =1 Q lj =1 1 e rj ! · (cid:16) α r ! · ... · α rk ! · x jα ··· α k (cid:17) e rj .Here, ( ∂ β ··· β n op )( x , . . . , x n ) are some chosen terms of type real in the language withfree variables from x , . . . , x n . We think of these terms as implementing the partial deriv-ative ∂ β ... + βn J op K ( x ,...,x n ) ∂x β ··· ∂x βnn of the smooth function J op K : R n → R that op implements. Forexample, we could choose the following representations of derivatives of order ≤ ∂ (+)( x , x ) = 1 ∂ (+)( x , x ) = 0 ∂ (+)( x , x ) = 1 ∂ (+)( x , x ) = 0 ∂ (+)( x , x ) = 0 ∂ ( ∗ )( x , x ) = x ∂ ( ∗ )( x , x ) = 0 ∂ ( ∗ )( x , x ) = x ∂ ( ∗ )( x , x ) = 1 ∂ ( ∗ )( x , x ) = 0 ∂ ( ς )( x ) = let y = ς ( x ) in y ∗ (1 − y ) ∂ ( ς )( x ) = let y = ς ( x ) inlet z = y ∗ (1 − y ) in z ∗ (1 − ∗ y )Note that our rules, in particular, imply that −→D ( c ) = h c, , . . . , i . Example 1 ((1 , , . Our choices of partial derivatives of the example oper-ations are suﬃcient to implement ( k, R )-Taylor forward AD with R ≤

2. To be explicit, thedistinctive formulas for (1 , , of −→D ( k,R ) above) are −→D (1 , ( real ) = (real ∗ real) −→D (1 , ( op ( t , . . . , t n )) = case −→D (1 , ( t ) of h x , x i → . . . → case −→D (1 , ( t n ) of h x n , x n i →h op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ i op ( x , . . . , x n ) i −→D (2 , ( real ) = real −→D (2 , ( op ( t , ..., t n )) = case −→D (2 , ( t ) of h x , x , x , x , x , x i → ... case −→D (2 , ( t n ) of h x n , x n , x n , x n , x n , x n i →h op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) + P ni,j =1 x i ∗ x j ∗ ∂ c i,j op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) + P ni,j =1 x i ∗ x j ∗ ∂ c i,j op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) + P ni,j =1 x i ∗ x j ∗ ∂ c i,j op ( x , . . . , x n ) i where we informally write ˆ i for the one-hot encoding of i (the sequence of length n consistingexclusively of zeros except in position i where it has a 1) and c i, j for the two-hot encodingof i and j (the sequence of length n consisting exclusively of zeros except in positions i and j where it has a 1 if i = j and a 2 if i = j )As noted in §

2, it is often unnecessary to include all components of the (2 , , −→D (2 , ′ ( real ) = real and −→D (2 , ′ ( op ( t , ..., t n )) = case −→D (2 , ′ ( t ) of h x , x , x , x i → ... case −→D (2 , ′ ( t n ) of h x n , x n , x n , x n i →h op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) , P ni =1 x i ∗ ∂ ˆ i op ( x , . . . , x n ) + P ni,i ′ =1 x i ∗ x i ∗ ∂ ˆˆ i op ( x , . . . , x n ) i . We extend −→D to contexts: −→D ( { x : τ , ..., x n : τ n } ) def = { x : −→D ( τ ) , ..., x n : −→D ( τ n ) } . This turns −→D into a well-typed, functorial macro in the following sense. Lemma 1 (Functorial macro) . If Γ ⊢ t : τ then −→D (Γ) ⊢ −→D ( t ) : −→D ( τ ) .If Γ , x : σ ⊢ t : τ and Γ ⊢ s : σ then −→D (Γ) ⊢ −→D ( t [ s / x ]) = −→D ( t )[ −→D ( s ) / x ] .Proof. By induction on the structure of typing derviations.

IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 13

Example 2 (Inner products) . Let us write τ n for the n -fold product ( τ ∗ . . . ∗ τ ) . Then,given Γ ⊢ t, s : real n we can deﬁne their inner productΓ ⊢ t · n s def = case t of h z , . . . , z n i → case s of h y , . . . , y n i → z ∗ y + · · · + z n ∗ y n : real To illustrate the calculation of −→D (1 , , let us expand (and β -reduce) −→D (1 , ( t · s ): case −→D (1 , ( t ) of h z , z i → case −→D (1 , ( s ) of h y , y i → case z of h z , , z , i → case y of h y , , y , i → case z of h z , , z , i → case y of h y , , y , i →h z , ∗ y , + z , ∗ y , , z , ∗ y , + z , ∗ y , + z , ∗ y , + z , ∗ y , i Let us also expand the calculation of −→D (2 , ′ ( t · s ): case −→D (2 , ′ ( t ) of h z , z i → case −→D (2 , ′ ( s ) of h y , y i → case z of h z , z ′ , , z ′ , , z ′′ i → case y of h y , y ′ , , y ′ , , y ′′ , i → case z of h z , z ′ , , z ′ , , z ′′ , i → case y of h y , y ′ , , y ′ , , y ′′ , i →h z ∗ y + z ∗ y ,z ∗ y ′ , + z ′ , ∗ y + z ∗ y ′ , + z ′ , ∗ y ,z ∗ y ′ , + z ′ , ∗ y + z ∗ y ′ , + z ′ , ∗ y ,z ′′ ∗ y + z ′′ ∗ y + y ′′ ∗ z + y ′′ ∗ z + y ′ , ∗ y ′ , + y ′ , ∗ y ′ , + z ′ , ∗ z ′ , + z ′ , ∗ z ′ , i Example 3 (Neural networks) . In our introduction, we provided a program (1.1) in ourlanguage to build a neural network out of expressions neuron , layer , comp; this programmakes use of the inner product of Ex. 2. We can similarly calculate the derivatives of deepneural nets by mechanically applying the macro −→D .4. Semantics of differentiation

Consider for a moment the ﬁrst order fragment of the language in §

3, with only one type, real , and no λ ’s or pairs. This has a simple semantics in the category of cartesian spacesand smooth maps. Indeed, a term x . . . x n : real ⊢ t : real has a natural reading as afunction J t K : R n → R by interpreting our operation symbols by the well-known operationson R n → R with the corresponding name. In fact, the functions that are deﬁnable in thisﬁrst order fragment are smooth. Let us write CartSp for this category of cartesian spaces( R n for some n ) and smooth functions.The category CartSp has cartesian products, and so we can also interpret producttypes, tupling and pattern matching, giving us a useful syntax for constructing functionsinto and out of products of R . For example, the interpretation of (neuron n ) in (1.1) becomes R n × R n × R J · n K × id R −−−−−→ R × R J + K −−→ R J ς K −−→ R . where J · n K , J + K and J ς K are the usual inner product, addition and the sigmoid function on R , respectively. Inside this category, we can straightforwardly study the ﬁrst order language without λ ’s, and automatic diﬀerentiation. In fact, we can prove the following by plain inductionon the syntax: The interpretation of the (syntactic) forward AD −→D ( t ) of a ﬁrst order term t equals the usual(semantic) derivative of the interpretation of t as a smooth function. However, as is well known, the category

CartSp does not support function spaces. Tosee this, notice that we have polynomial terms x , . . . , x d : real ⊢ λy. P dn =1 x n y n : real → real for each d , and so if we could interpret ( real → real ) as a Euclidean space R p then, byinterpreting these polynomial expressions, we would be able to ﬁnd continuous injections R d → R p for every d , which is topologically impossible for any p , for example as a conse-quence of the Borsuk-Ulam theorem (see Appx. A).This lack of function spaces means that we cannot interpret the functions (layer) and(comp) from (1.1) in CartSp , as they are higher order functions, even though they arevery useful and innocent building blocks for diﬀerential programming! Clearly, we coulddeﬁne neural nets such as (1.1) directly as smooth functions without any higher ordersubcomponents, though that would quickly become cumbersome for deep networks. Aproblematic consequence of the lack of a semantics for higher order diﬀerential programsis that we have no obvious way of establishing compositional semantic correctness of −→D forthe given implementation of (1.1).We now show that every deﬁnable function is smooth, and then in Section 4.2 we showthat the −→D macro witnesses its derivatives.4.1.

Smoothness at higher types and diﬀeologies.

The aim of this section is to intro-duce diﬀeological spaces as a semantic model for the simple language in Section 3. By wayof motivation, we begin with a standard set theoretic semantics, where types are interpretedas follows T real W def = R T ( τ ∗ . . . ∗ τ n ) W def = Q ni =1 T τ i W T τ → σ W def = ( T τ W → T σ W )and a term x : τ , . . . , x n : τ n ⊢ t : σ is interpreted as a function Q ni =1 T τ i W → T σ W , mappinga valuation of the context to a result.We can show that the interpretation of a term x : real , . . . , x n : real ⊢ t : real isalways a smooth function R n → R , even if it has higher order subterms. We begin witha fairly standard logical relations proof of this, and then move from this to the semanticmodel of diﬀeological spaces. Proposition 1. If x : real , . . . , x n : real ⊢ t : real then the function T t W : R n → R issmooth.Proof. Fix R and k . We show that T t W has ( k, R )-velocities. For each type τ deﬁne a set Q τ ⊆ [ R k → T τ W ] by induction on the structure of types: Q real = { f : R k → T τ W | f has a ( k, R ) Taylor approximation everywhere } Q ( τ ∗ ... ∗ τ n ) = { f : R k → Q ni =1 T τ i W | ∀ ~r ∈ R k . ∀ i. f i ( ~r ) ∈ Q τ i } Q τ → σ = { f : R k → T τ W → T σ W | ∀ g ∈ Q τ . λ ( ~r ) . f ( ~r )( g ( ~r )) ∈ Q σ } Now we show the fundamental lemma: if x : τ , . . . , x n : τ n ⊢ u : σ and g ∈ Q τ . . . g n ∈ Q τ n then (( g . . . g n ); T u W ) ∈ Q σ . This is shown by induction on the structure of typing IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 15 derivations. The only interesting step here is that the basic operations (+, ∗ , ς etc.) aresmooth. We deduce the statement of the theorem by putting u = t , k = n , and letting g i : R n → R be the projections.At higher types, the logical relations Q show that we can only deﬁne functions thatsend smooth functions to smooth functions, meaning that we can never use them to buildﬁrst order functions that are not smooth. For example, (comp) in (1.1) has this property.This logical relations proof suggests to build a semantic model by interpreting types assets with structure: for each type we have a set X together with a set Q R k X ⊆ [ R k → X ] ofplots. Deﬁnition 1. A diﬀeological space ( X, P X ) consists of a set X together with, for each n and each open subset U of R n , a set P UX ⊆ [ U → X ] of functions, called plots , such that • all constant functions are plots; • if f : V → U is a smooth function and p ∈ P UX , then f ; p ∈ P VX ; • if (cid:16) p i ∈ P U i X (cid:17) i ∈ I is a compatible family of plots ( x ∈ U i ∩ U j ⇒ p i ( x ) = p j ( x )) and ( U i ) i ∈ I covers U , then the gluing p : U → X : x ∈ U i p i ( x ) is a plot.We call a function f : X → Y between diﬀeological spaces smooth if, for all plots p ∈ P UX , we have that p ; f ∈ P UY . We write Diﬀ ( X, Y ) for the set of smooth maps from X to Y . Smooth functions compose, and so we have a category Diﬀ of diﬀeological spacesand smooth functions.A diﬀeological space is thus a set equipped with structure. Many constructions of setscarry over straightforwardly to diﬀeological spaces.

Example 4 (Cartesian diﬀeologies) . Each open subset U of R n can be given the structureof a diﬀeological space by taking all the smooth functions V → U as P VU . It is easily seenthat smooth functions from V → U in the traditional sense coincide with smooth functionsin the sense of diﬀeological spaces. Thus diﬀeological spaces have a profound relationshipwith ordinary calculus.In categorical terms, this gives a full embedding of CartSp in Diﬀ . Example 5 (Product diﬀeologies) . Given a family ( X i ) i ∈ I of diﬀeological spaces, we canequip the product Q i ∈ I X i of sets with the product diﬀeology in which U -plots are preciselythe functions of the form ( p i ) i ∈ I for p i ∈ P UX i .This gives us the categorical product in Diﬀ . Example 6 (Functional diﬀeology) . We can equip the set

Diﬀ ( X, Y ) of smooth functionsbetween diﬀeological spaces with the functional diﬀeology in which U -plots consist of func-tions f : U → Diﬀ ( X, Y ) such that ( u, x ) f ( u )( x ) is an element of Diﬀ ( U × X, Y ).This speciﬁes the categorical function object in

Diﬀ .We can now give a denotational semantics to our language from Section 3 in the categoryof diﬀeological spaces. We interpret each type τ as a set J τ K equipped with the relevantdiﬀeology, by induction on the structure of types: J real K def = R J ( τ ∗ . . . ∗ τ n ) K def = Q ni =1 J τ i K J τ → σ K def = Diﬀ ( J τ K , J σ K )A context Γ = ( x : τ . . . x n : τ n ) is interpreted as a diﬀeological space J Γ K def = Q ni =1 J τ i K .Now well typed terms Γ ⊢ t : τ are interpreted as smooth functions J t K : J Γ K → J τ K , giving a meaning for t for every valuation of the context. This is routinely deﬁned by inductionon the structure of typing derivations once we choose a smooth function J op K : R n → R tointerpret each n -ary operation op ∈ Op n . For example, constants c : real are interpreted asconstant functions; and the ﬁrst order operations (+ , ∗ , ς ) are interpreted by composing withthe corresponding functions, which are smooth: e.g., J ς ( t ) K ( ρ ) def = ς ( J t K ( ρ )), where ρ ∈ J Γ K .Variables are interpreted as J x i K ( ρ ) def = ρ i . The remaining constructs are interpreted asfollows, and it is straightforward to show that smoothness is preserved. J h t , . . . , t n i K ( ρ ) def = ( J t K ( ρ ) , . . . , J t n K ( ρ )) J λx : τ.t K ( ρ )( a ) def = J t K ( ρ, a ) ( a ∈ J τ K ) J case t of h ... i → s K ( ρ ) def = J s K ( ρ, J t K ( ρ )) J t s K ( ρ ) def = J t K ( ρ )( J s K ( ρ ))The logical relations proof of Proposition 1 is reminiscent of diﬀeological spaces. We nowbrieﬂy remark on the suitability of the axioms of diﬀeological spaces (Def 1) for a semanticmodel of smooth programs. The ﬁrst axiom says that we only consider reﬂexive logicalrelations. From the perspective of the interpretation, it recognizes in particular that thesemantics of an expression of type ( real → real ) → real is deﬁned by its value on smoothfunctions rather than arbitrary arguments. That is to say, the set-theoretic semantics at thebeginning of this section, T ( real → real ) → real W , is diﬀerent to the diﬀeological semantics, J ( real → real ) → real K . The second axiom for diﬀeological spaces ensures that the smoothmaps in Diﬀ ( U, X ) are exactly the plots in P UX . The third axiom ensures that categoriesof manifolds fully embed into Diﬀ ; it will not play a visible role in this paper – in fact, [5]prove similar results for a simple language like ours by using plain logical relations (over

Set ) and without demanding the diﬀeology axioms. However, we expect the third axiomto be crucial for programming with other smooth structures or partiality.4.2.

Correctness of AD.

We have shown that a term x : real , . . . , x n : real ⊢ t : real is interpreted as a smooth function J t K : R n → R , even if t involves higher order func-tions (like (1.1)). Moreover, the macro diﬀerentiation −→D ( k,R ) ( t ) is a function J −→D ( k,R ) ( t ) K :( R ( R + kk )) n → R ( R + kk ) (Prop. 1). This enables us to state a limited version of our maincorrectness theorem: Theorem 1 (Semantic correctness of −→D (limited)) . For any term x : real , . . . , x n : real ⊢ t : real , the function J −→D ( k,R ) ( t ) K is the ( k, R ) -Taylor representation (2.2) of J t K . In detail: for any smooth functions f . . . f n : R k → R , (cid:18) ∂ α + ... + α k f j ( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! nj =1 ; J −→D ( k,R ) ( t ) K = (cid:18) ∂ α + ... + α k (( f , . . . , f n ); J t K )( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! nj =1 . (For instance, if n = 2, then J −→D (1 , ( t ) K ( x , , x ,

0) = ( J t K ( x , x ) , ∂ J t K ( x,x ) ∂x ( x )).) Proof.

We prove this by logical relations. Although the following proof is elementary, wefound it by using the categorical methods in § τ , we deﬁne a binary relation S τ between (open) k -dimensional plots in J τ K and (open) k -dimensional plots in J −→D ( k,R ) ( τ ) K , i.e. S τ ⊆ P R k J τ K × P R k J −→D ( k,R ) ( τ ) K , by inductionon τ : IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 17 • S real def = ( f, (cid:18) ∂ α ... + αk f ( x ) ∂x α ··· ∂x αkk (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ! (cid:12)(cid:12)(cid:12) f : R k → R smooth ) ; • S ( τ ∗ ... ∗ τ n ) def = { (( f , ..., f n ) , ( g , ..., g n )) | ( f , g ) ∈ S τ , ..., ( f n , g n ) ∈ S τ n } ; • S τ → σ def = { ( f , f ) | ∀ ( g , g ) ∈ S τ . ( x f ( x )( g ( x )) , x f ( x )( g ( x ))) ∈ S σ } .Then, we establish the following ‘fundamental lemma’:If x : τ , ..., x n : τ n ⊢ t : σ and, for all 1 ≤ i ≤ n , f i : R k → J τ i K and g i : R k → J −→D ( k,R ) ( τ i ) K are such that ( f i , g i ) is in S τ i , then we have that (cid:0) ( f , . . . , f n ); J t K , ( g , . . . , g n ); J −→D ( k,R ) ( t ) K (cid:1) is in S σ .This is proved routinely by induction on the typing derivation of t . The case for op ( t , . . . , t n ) relies on the precise deﬁnition of −→D ( k,R ) ( op ( t , . . . , t n )).We conclude the theorem from the fundamental lemma by considering the case where τ i = σ = real , m = n and s i = y i .5. Extending the language: variant and inductive types

In this section, we show that the deﬁnition of forward AD and the semantics generalize ifwe extend the language of § Language.

We additionally consider the following types and terms: τ, σ, ρ ::= . . . types | { ℓ τ (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n τ n } variant | list ( τ ) list t, s, r ::= . . . terms | τ.ℓ t variant constructor | [ ] | t :: s empty list and cons | case t of { ℓ x → s (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s n } pattern matching: variants | fold ( x , x ) .t over s from r list foldWe extend the type system according to the rules of Fig. 4. We can then extend −→D ( k,R ) (again, writing it as −→D , for legibility) to our new types and terms by −→D ( { ℓ τ (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n τ n } ) def = { ℓ −→D ( τ ) (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n −→D ( τ n ) } −→D ( list ( τ )) def = list ( −→D ( τ )) −→D ( τ.ℓ t ) def = −→D ( τ ) .ℓ −→D ( t ) −→D ([ ]) def = [ ] −→D ( t :: s ) def = −→D ( t ) :: −→D ( s ) −→D ( case t of { ℓ x → s (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s n } ) def = case −→D ( t ) of { ℓ x → −→D ( s ) (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → −→D ( s n ) } −→D ( fold ( x , x ) .t over s from r ) def = fold ( x , x ) . −→D ( t ) over −→D ( s ) from −→D ( r )To demonstrate the practical use of expressive type systems for diﬀerential program-ming, we consider the following two examples. Example 7 (Lists of inputs for neural nets) . Usually, we run a neural network on a largedata set, the size of which might be determined at runtime. To evaluate a neural networkon multiple inputs, in practice, one often sums the outcomes. This can be coded in ourextended language as follows. Suppose that we have a network f : (real n ∗ P ) → real thatoperates on single input vectors. We can construct one that operates on lists of inputs asfollows: g def = λ h l, w i . fold ( x , x ) .f h x , w i + x over l from (list ( real n ) ∗ P ) → realExample 8 (Missing data) . In practically every application of statistics and machine learn-ing, we face the problem of missing data : for some observations, only partial informationis available.In an expressive typed programming language like we consider, we can model missingdata conveniently using the data type maybe ( τ ) = { Nothing ( ) (cid:12)(cid:12)

Just τ } . In the context ofa neural network, one might use it as follows. First, deﬁne some helper functionsfromMaybe def = λx.λm. case m of { Nothing → x (cid:12)(cid:12) Just x ′ → x ′ } fromMaybe n def = λ h x , ..., x n i .λ h m , ..., m n i . h fromMaybe x m , ..., fromMaybe x n m n i : ( maybe ( real )) n → real n → real n map def = λf.λl. fold ( x , x ) .f x :: x over l from [ ] : ( τ → σ ) → list ( τ ) → list ( σ )Given a neural network f : (list ( real k ) ∗ P ) → real , we can build a new one that operateson on a data set for which some covariates (features) are missing, by passing in defaultvalues to replace the missing covariates: λ h l, h m, w ii .f h map (fromMaybe k m ) l, w i : (list (( maybe ( real )) k ) ∗ (real k ∗ P )) → real Then, given a data set l with missing covariates, we can perform automatic diﬀerentiationon this network to optimize, simultaneously, the ordinary network parameters w and thedefault values for missing covariates m .Γ ⊢ t : τ i Γ ⊢ τ.ℓ i t : τ (( ℓ i τ i ) ∈ τ ) Γ ⊢ [ ] : list ( τ ) Γ ⊢ t : τ Γ ⊢ s : list ( τ )Γ ⊢ t :: s : list ( τ )Γ ⊢ t : { ℓ τ (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n τ n } for each 1 ≤ i ≤ n : Γ , x i : τ i ⊢ s i : τ Γ ⊢ case t of { ℓ x → s (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s n } : τ Γ ⊢ s : list ( τ ) Γ ⊢ r : σ Γ , x : τ, x : σ ⊢ t : σ Γ ⊢ fold ( x , x ) .t over s from r : σ Figure 4: Additional typing rules for the extended language.

IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 19

Semantics. In § τ is interpreted as a diﬀeological space, which is a set equipped with a family of plots: • A variant type { ℓ τ (cid:12)(cid:12) . . . (cid:12)(cid:12) ℓ n τ n } is inductively interpreted as the disjoint union of thesemantic spaces, J { ℓ τ (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n τ n } K def = U ni =1 J τ i K , with U -plots P U J { ℓ τ (cid:12)(cid:12) ... (cid:12)(cid:12) ℓ n τ n } K def = ( (cid:20) U j f j −→ J τ j K → U ni =1 J τ i K (cid:21) nj =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U = U nj =1 U j , f j ∈ P U j J τ j K ) . • A list type list ( τ ) is interpreted as the set of lists, J list ( τ ) K def = U ∞ i =1 J τ K i with U -plots P U J list ( τ ) K def = ( (cid:20) U j f j −→ J τ K j → U ∞ i =1 J τ K i (cid:21) ∞ j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U = U ∞ j =1 U j , f j ∈ P U j J τ K j ) The constructors and destructors for variants and lists are interpreted as in the usual settheoretic semantics.It is routine to show inductively that these interpretations are smooth. Thus everyterm Γ ⊢ t : τ in the extended language is interpreted as a smooth function J t K : J Γ K → J τ K between diﬀeological spaces.In this section we focused on a language with lists and variants. All inductive typesare easily interpreted in the category of diﬀeological spaces in much the same way. Thecategorically minded reader may regard this as a consequence of Diﬀ being a concreteGrothendieck quasitopos, e.g. [4], and hence is complete and cocomplete. The interpretationfollows exactly the usual categorical semantics of variant types and inductive types (e.g. [56]).List objects as initial algebras are computed as usual in a cocomplete category (e.g. [31]).6.

Categorical analysis of (higher order) forward AD and its correctness

This section has three parts. First, we give a categorical account of the functoriality ofAD (Ex. 9). Then we introduce our gluing construction, and relate it to the correctness ofAD (dgm. (6.1)). Finally, we state and prove a correctness theorem for all ﬁrst order typesby considering a category of manifolds (Th. 2). case h t , . . . , t n i of h x , . . . , x n i → s = s [ t / x , . . . , t n / x n ] s [ t / y ] x ,...,x n = case t of h x , . . . , x n i → s [ h x ,...,x n i / y ] case ℓ i t of { ℓ x → s (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s n } = s i [ t / x i ] s [ t / y ] x ,...,x n = case t of { ℓ x → s [ ℓ x / y ] (cid:12)(cid:12) · · · (cid:12)(cid:12) ℓ n x n → s [ ℓ n x n / y ] } fold ( x , x ) .t over [ ] from r = r fold ( x , x ) .t over s :: s from r = t [ s / x , fold ( x ,x ) .t over s from r / x ] u = s [ [ ] / y ] , r [ s / x ] = s [ x :: y / y ] ⇒ s [ t / y ] x ,x = fold ( x , x ) .r over t from u ( λx.t ) s = t [ s / x ] t x = λx.t x We write x ,...,x n = to indi-cate that the variables arefree in the left hand side. Figure 5: Standard βη -laws (e.g. [56]) for products, functions, variants and lists. Syntactic categories.

Our language induces a syntactic category as follows.

Deﬁnition 2.

Let

Syn be the category whose objects are types, and where a morphism τ → σ is a term in context x : τ ⊢ t : σ modulo the βη -laws (Fig. 5). Composition is bysubstitution.For simplicity, we do not impose identities involving the primitive operations, such asthe arithmetic identity x + y = y + x in Syn . As is standard, this category has the followinguniversal property.

Lemma 2 (e.g. [56]) . For every bicartesian closed category C with list objects, and everychoice of an object F ( real ) ∈ C and morphisms F ( op ) ∈ C ( F ( real ) n , F ( real )) for all op ∈ Op n and n ∈ N , in C , there is a unique functor F : Syn → C respecting the interpretationand preserving the bicartesian closed structure as well as list objects.Proof notes.

The functor F : Syn → C is a canonical denotational semantics for the lan-guage, interpreting types as objects of C and terms as morphisms. For instance, F ( τ → σ ) def =( F τ → F σ ), the function space in the category C , and F ( t s ) def = is the composite ( F t, F s ); eval .When C = Diﬀ , the denotational semantics of the language in diﬀeological spaces( § J − K : Syn → Diﬀ satisfying J real K = R , J ς K = ς and so on. Example 9 (Canonical deﬁnition of forward AD) . The forward AD macro −→D ( k,R ) ( § Syn . Consider the unique cartesian closedfunctor F : Syn → Syn such that F ( real ) = real ( R + kk ) and F ( op ) = z : ( F ( real ) ∗ ... ∗ F ( real ) ) ⊢ case z of h x , ..., x n i → −→D ( k,R ) ( op ( x , . . . , x n )) : F ( real ) . Then for any type τ , F ( τ ) = −→D ( k,R ) ( τ ), and for any term x : τ ⊢ t : σ , F ( t ) = −→D ( k,R ) ( t )as morphisms F ( τ ) → F ( σ ) in the syntactic category.This observation is a categorical counterpart to Lemma 1.6.2. Categorical gluing and logical relations.

Gluing is a method for building newcategorical models which has been used for many purposes, including logical relations andrealizability [51]. Our logical relations argument in the proof of Theorem 1 can be under-stood in this setting. In this subsection, for the categorically minded, we explain this, andin doing so we quickly recover a correctness result for the more general language in Section 5and for arbitrary ﬁrst order types.We deﬁne a category Gl k whose objects are triples ( X, X ′ , S ) where X and X ′ arediﬀeological spaces and S ⊆ P R k X × P R k X ′ is a relation between their k -dimensional plots. Amorphism ( X, X ′ , S ) → ( Y, Y ′ , T ) is a pair of smooth functions f : X → Y , f ′ : X ′ → Y ′ ,such that if ( g, g ′ ) ∈ S then ( g ; f, g ′ ; f ′ ) ∈ T . The idea is that this is a semantic domainwhere we can simultaneously interpret the language and its automatic derivatives. Proposition 2.

The category Gl k is bicartesian closed, has list objects, and the projectionfunctor proj : Gl k → Diﬀ × Diﬀ preserves this structure.

IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 21

Proof notes.

The category Gl k is a full subcategory of the comma categoryid Set ↓ Diﬀ ( R k , − ) × Diﬀ ( R k , − ). The result thus follows by the general theory of categor-ical gluing (e.g. [32, Lemma 15]).We give a semantics L − M = ( L − M , L − M , S − ) for the language in Gl k , interpreting types τ as objects ( L τ M , L τ M , S τ ), and terms as morphisms. We let L real M = R and L real M = R ( R + kk ), with the relation S real def = ( ( f, (cid:18) ∂ α + ... + α k f ( x ) ∂x α · · · ∂x α k k (cid:19) ( R, ,..., α ,...,α k )=(0 ,..., ) | f : R k → R smooth ) . We interpret the operations op according to J op K in L − M , but according to the ( k, R )-Taylorrepresentation of J op K in L − M . For instance, when k = 2 and r = 2, L ∗ M : R × R → R is L ∗ M (( x , x , x , x , x , x ) , ( y , y , y , y , y , y )) def =( x y ,x y + x y ,x y + 2 x y + x y ,x y + x y ,x y + x y + x y + x y ,x y + 2 x y + x y ).At this point one checks that these interpretations are indeed morphisms in Gl k . This isequivalent to the statement that L op M is the ( k, R )-Taylor representation of J op K (2.2). Theremaining constructions of the language are interpreted using the categorical structure of Gl k , following Lem. 2.Notice that the diagram below commutes. One can check this by hand or note that itfollows from the initiality of Syn (Lem. 2): all the functors preserve all the structure.

Syn (id , −→D ( k,R ) ( − )) / / L − M (cid:15) (cid:15) Syn × Syn J − K × J − K (cid:15) (cid:15) Gl k proj / / Diﬀ × Diﬀ (6.1)We thus arrive at a restatement of the correctness theorem (Th. 1), which holds even for theextended language with variants and lists, because for any x : real , ..., x n : real ⊢ t : real ,the interpretations ( J t K , J −→D ( k,R ) ( t ) K ) are in the image of the projection Gl k → Diﬀ × Diﬀ ,and hence J −→D ( k,R ) ( t ) K is a ( k, R )-Taylor representation of J t K .6.3. Correctness at all ﬁrst order types, via manifolds.

We now generalize Theorem 1to hold at all ﬁrst order types, not just the reals. To do this, we need to deﬁne the derivativeof a smooth map between the interpretations of ﬁrst order types. We do this by recallingthe well-known theory of manifolds and jet bundles.For our purposes, a smooth manifold M is a second-countable Hausdorﬀ topologicalspace together with a smooth atlas: an open cover U together with homeomorphisms (cid:0) φ U : U → R n ( U ) (cid:1) U ∈U (called charts, or local coordinates) such that φ − U ; φ V is smooth onits domain of deﬁnition for all U, V ∈ U . A function f : M → N between manifolds issmooth if φ − U ; f ; ψ V is smooth for all charts φ U and ψ V of M and N , respectively. Let uswrite Man for this category.Our manifolds are slightly unusual because diﬀerent charts in an atlas may have diﬀerentﬁnite dimension n ( U ). Thus we consider manifolds with dimensions that are potentiallyunbounded, albeit locally ﬁnite. This does not aﬀect the theory of diﬀerential geometry asfar as we need it here.Each open subset of R n can be regarded as a manifold. This lets us regard the categoryof manifolds Man as a full subcategory of the category of diﬀeological spaces. We considera manifold ( X, { φ U } U ) as a diﬀeological space with the same carrier set X and wherethe plots P UX , called the manifold diﬀeology , are the smooth functions in Man ( U, X ). Afunction X → Y is smooth in the sense of manifolds if and only if it is smooth in the senseof diﬀeological spaces [30]. For the categorically minded reader, this means that we havea full embedding of Man into

Diﬀ . Moreover, the natural interpretation of the ﬁrst orderfragment of our language in

Man coincides with that in

Diﬀ . That is, the embedding of

Man into

Diﬀ preserves ﬁnite products and countable coproducts (hence initial algebrasof polynomial endofunctors).

Proposition 3.

Suppose that a type τ is ﬁrst order, i.e. it is just built from reals, products,variants, and lists (or, again, arbitrary inductive types), and not function types. Then thediﬀeological space J τ K is a manifold.Proof notes. This is proved by induction on the structure of types. In fact, one may showthat every such J τ K is isomorphic to a manifold of the form U ni =1 R d i where the bound n iseither ﬁnite or ∞ , but this isomorphism is typically not an identity function.The constraint to ﬁrst order types is necessary because, e.g. the space J real → real K is not a manifold, because of a Borsuk-Ulam argument (see Appx. A).We recall how the Taylor representation of any morphism f : M → N of manifolds isgiven by its action on jets [38, Chapter IV]. For each point x in a manifold M , deﬁne the( k, R )- jet space J ( k,R ) x M to be the set { γ ∈ Man ( R k , M ) | γ (0) = x } / ∼ of equivalenceclasses [ γ ] of k -dimensional plots γ in M based at x , where we identify γ ∼ γ iﬀ all partialderivatives of order ≤ R coincide in the sense that ∂ α + ... + α k ( γ ; f )( x ) ∂x α · · · ∂x α k k (0) = ∂ α + ... + α k ( γ ; f )( x ) ∂x α · · · ∂x α k k (0)for all smooth f : M → R and all multi-indices ( α , ..., α k ) = (0 , ..., , ..., ( R, , ..., k, R ) = (1 , k, R )-jet space is better known as a tangent space . The ( k, R ) -jet bundle (a.k.a. tangent bundle , in case ( k, R ) = (1 , M is the set J ( k,R ) ( M ) def = U x ∈ M J ( k,R ) x ( M ). The charts of M equip J ( k,R ) ( M ) with a canonical manifold structure.The (manifold) diﬀeology of these jet bundles can be concisely summarized by the plots P U J ( k,R ) ( M ) = n f : U → |J ( k,R ) ( M ) | | ∃ g ∈ P U × R k M . ∀ u ∈ U. ( g ( u, , [ v g ( u, v )]) = f ( u ) o .Then J ( k,R ) acts on smooth maps f : M → N to give J ( k,R ) ( f ) : J ( k,R ) ( M ) → J ( k,R ) ( N )is deﬁned as J ( k,R ) ( f )( x, [ γ ]) def = ( f ( x ) , [ γ ; f ]). In local coordinates, this action J ( k,R ) ( f )is seen to coincide precisely with the ( k, R )-Taylor representation of f given by the Fa`a diBruno formula [50]. All told, the ( k, R )-jet bundle is a functor J ( k,R ) : Man → Man [38].

IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 23

We can understand the jet bundle of a composite space in terms of that of its parts.

Lemma 3.

There are canonical isomorphisms J ( k,R ) ( U ∞ i =1 M i ) ∼ = U ∞ i =1 J ( k,R ) ( M i ) and J ( k,R ) ( M × . . . × M n ) ∼ = J ( k,R ) ( M ) × . . . × J ( k,R ) ( M n ) .Proof notes. For disjoint unions, notice that that smooth morphisms from R k into a disjointunion of manifolds always factor over a single inclusion, because R k is connected. Forproducts, it is well-known that partial derivatives of a morphism ( f , ..., f n ) are calculatedcomponent-wise [43, ex. 3-2].We deﬁne a canonical isomorphism φ −→D J τ : J −→D ( k,R ) ( τ ) K → J ( k,R ) ( J τ K ) for every type τ ,by induction on the structure of types. We let φ −→D J real : J −→D ( k,R ) ( real ) K → J ( k,R ) ( J real K ) begiven by φ −→D J real ( x, x ′ ) def = ( x, [ t x + x ′ t ]). For the other types, we use Lemma 3. We cannow phrase correctness at all ﬁrst order types. Theorem 2 (Semantic correctness of −→D ( k,R ) (full)) . For any ground τ , any ﬁrst ordercontext Γ and any term Γ ⊢ t : τ , the syntactic translation −→D ( k,R ) coincides with the ( k, R ) -jet bundle functor, modulo these canonical isomorphisms: J −→D ( k,R ) (Γ) K J −→D ( k,R ) ( t ) K / / φ −→D J Γ ∼ = (cid:15) (cid:15) J −→D ( k,R ) ( τ ) K φ −→D J τ ∼ = (cid:15) (cid:15) J ( k,R ) ( J Γ K ) J ( k,R ) ( J t K ) / / J ( k,R ) ( J τ K ) Proof notes.

For any k -dimensional plot γ ∈ Man ( R k , M ), let ¯ γ ∈ Man ( R k , J ( k,R ) ( M )) bethe ( k, R )-jet curve, given by ¯ γ ( x ) = ( γ ( x ) , [ t γ ( x + t )]). First, we note that a smoothmap h : J ( k,R ) ( M ) → J ( k,R ) ( N ) is of the form J ( k,R ) ( g ) for some g : M → N if for allsmooth γ : R k → M we have ¯ γ ; h = ( γ ; g ) : R k → J ( k,R ) ( N ). This generalizes (2.2). Second,for any ﬁrst order type τ , S J τ K = { ( f, ˜ f ) | ˜ f ; φ −→D J τ = ¯ f } . This is shown by induction onthe structure of types. We conclude the theorem from diagram (6.1), by putting these twoobservations together.7. Discussion: What are derivatives of higher order functions?

In our gluing categories Gl k of Section 6.2, we have avoided the question of what semanticderivatives should be associated with higher order functions. Our syntactic macro −→D pro-vides a speciﬁc derivative for every deﬁnable function, but in the model Gl k there is onlya relation between plots and their corresponding Taylor representations, and this relationis not necessarily single-valued. Our approach has been rather indiﬀerent about what “the”correct derivative of a higher order function should be. Instead, all we have cared about isthat we are using “a” derivative that is correct in the sense that it can never be used toproduce incorrect derivatives for ﬁrst order functions, where we do have an unambiguousnotion of correct derivative. Automatic derivatives of higher order functions may not be unique!

Accord-ing to our gluing semantics, a function g : L τ M → L σ M deﬁnes a correct ( k, R )-Taylorrepresentation of a function f : L τ M → L σ M iﬀ ( f, g ) deﬁnes a morphism L τ M → L σ M in Gl k . In particular, there is no guarantee that every f has a unique correct ( k, R )-Taylorrepresentation g . (Although such Taylor representations are, in fact, unique when τ, σ areﬁrst order types.)For a concrete example to show that derivatives of higher order functions might notbe unique in our framework, let us consider the case ( k, R ) = (1 ,

1) and focus on ﬁrstderivatives of the evaluation functionev : R → J ( real → real ) → real K = R → ( R ⇒ R ) ⇒ R ; r ( f f ( r )) . The glueing relation L ( real → real ) → real M in Gl relates curves in γ : R → ( R ⇒ R ) ⇒ R to “tangent curves” γ ′ : R → ( R × R ⇒ R × R ) ⇒ R × R . In this relation, the function evis related to at least two diﬀerent tangent curves. Lemma 4.

We have a smooth map sort : ( R × R ⇒ R × R ) → ( R × R ⇒ R × R ) f (( r, ) ( π ( f ( r, , ∇ (( − , f ; π )( r ))) .Proof. Let f ∈ P U R × R ⇒ R × R and let γ , γ ∈ P U R . Then, also u ( γ ( u ) , f ( u ); π ∈P U R = Man ( U, R ) by deﬁnition of the exponential in Diﬀ . Therefore, we also have that u

7→ ∇ ( u ′ ( γ ( u ′ ) , f ( u ′ ); π ) ∈ Man ( U, R ) = P U R , as we are working with inﬁnitelydiﬀerentiable smooth maps. Consequently, u ( f ; sort)( u )( γ ( u ) , γ ( u )) = ( π ( f ( u ))( γ ( u ) , , ∇ ( u ′ ( γ ( u ′ ) , f ( u ′ ); π )) ∈ P U R × R , by deﬁnition of the product in Diﬀ . It follows that ( f ; sort) ∈ P U R × R ⇒ R × R .This map is idempotent and it converts any map R × R → R × R into the dual-numbersrepresentation of its ﬁrst component. For example, (sort(swap)) is the constantly (0 , R × R → R × R ( r, r ′ ) ( r ′ , r ) . Lemma 5.

We have that both (ev , ev ′ ) ∈ L ( real → real ) → real M and (ev , ev ′ ) ∈ L ( real → real ) → real M for ev ′ : R → ( R × R ⇒ R × R ) ⇒ R × R a ( f f ( a, ′ : R → ( R × R ⇒ R × R ) ⇒ R × R a ( f (sort f )( a, . Proof.

We need to show that for any ( γ, γ ′ ) ∈ L real → real M , we have that ( x ev( x )( γ ( x )) ,x ev ′ i ( x )( γ ′ ( x ))) ∈ L real M (which means that ( x ev ′ i ( x )( γ ′ ( x ))) = ( x ev( x )( γ ( x )) , ∇ ( x ev( x )( γ ( x )))) ) for i = 1 ,

2. That is, we need to show that for any γ : R → R ⇒ R and γ ′ : R → R × R ⇒ R × R such that for any ( δ, δ ′ ) ∈ L real M (which means that δ : R → R IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 25 and δ ′ = ( δ, ∇ δ )), we have that ( r γ ( r )( δ ( r )) , r γ ′ ( r )( δ ′ ( r ))) ∈ L real M (which meansthat r γ ′ ( r )( δ ( r ) , ∇ δ ( r ))) = ( r γ ( r )( δ ( r )) , ∇ ( r γ ( r )( δ ( r )))).Now, focussing on ev ′ : we need to show that x ev ′ ( x )( γ ′ ( x )) = ( x ev( x )( γ ( x )) , ∇ ( x ev( x )( γ ( x )))). Inlining the deﬁnition of ev ′ : we need to show that x γ ′ ( x )( x,

1) =( x γ ( x )( x ) , ∇ ( x γ ( x )( x ))). This follows by assumption by choosing δ ( r ) = r (andhence δ ′ ( r ) = ( r, ′ : we need to show that x ev ′ ( x )( γ ′ ( x )) = ( x ev( x )( γ ( x )) , ∇ ( x ev( x )( γ ( x )))). Inlining ev ′ ’s deﬁnition: we need to show that ( x (( π ( γ ′ ( x )( x, , x (( − , γ ′ ( x ); π )( x )))) = ( x (( r, ) ( π ( γ ′ ( x )( r, , ∇ (( − , γ ′ ( x ); π )( r )))( x, x (sort γ ′ ( x ))( x,

1) = ( x γ ( x )( x ) , ∇ ( x γ ( x )( x ))). That is, we need to show that π ( γ ′ ( x )( x, γ ( x )( x ) for all x ∈ R , which holds by the assumption that ( γ, γ ′ ) ∈ L real → real M by choosing δ ( x ′ ) = x (and hence δ ′ ( x ′ ) = ( x, x = x ′ .Yet, ev ′ = ev ′ as ev ′ ( a )(swap) = (1 , a ) and ev ′ ( a )(swap) = (0 , ′ = ev ′ are both “valid” semantic derivatives of the evaluation function (ev) in ourframework. In particular, it shows that semantic derivatives of higher order functions mightnot be unique. Our macro −→D will return ev ′ , but everything would still work just as well ifit instead returned ev ′ .7.2. Canonical derivatives of higher order functions?

Diﬀerential geometers and an-alysts have long pursued notions of a canonical derivative of various higher order functionsarising, for example, in the calculus of variations and in the study of inﬁnite dimensionalLie groups [39]. Such an uncontroversial notion of derivative exists on various (inﬁnite di-mensional) spaces of functions that form suitable (so-called convenient) vector spaces, or,manifolds locally modelled on such vector spaces. At the level of generality of diﬀeologicalspaces, however, various natural notions of derivative that coincide in convenient vectorspaces start to diverge and it is no longer clear what the best deﬁnition of a derivative is[15]. Another, fundamentally diﬀerent setting that deﬁnes canonical derivatives of manyhigher order functions is given by synthetic diﬀerential geometry [37].While derivatives of higher order functions are of deep interest and have rightly beenstudied in their own right in diﬀerential geometry, we believe the situation is subtly diﬀerentin computer science:(1) In programming applications, we use higher order programs only to construct the ﬁrstorder functions that we ultimately end up running and calculating derivatives of. Au-tomatic diﬀerentiation methods can exploit this freedom: derivatives of higher orderfunctions only matter in so far as they can be used to construct the correct derivativesof ﬁrst order functions, so we can choose an easy and cheap notion of derivative amongthe valid options. As such, the fact that our semantics does not commit to a singlenotion of derivative of higher order functions can be seen as a feature rather than bug that models the pragmatics of programming practice.(2) While function spaces in diﬀerential geometry are typically inﬁnite dimensional objectsthat are unsuitable for representation in the ﬁnite memory of a computer, higher orderfunctions as used in programming are much more restricted: all they can do is calla function on ﬁnitely many arguments and analyse the function outputs. As such,function types in programming can be thought of as (locally) ﬁnite dimensional. Incase a canonical notion of automatic derivative of higher order function is really desired, it may be worth pursuing a more intentional notion of semantics such as one based ongame semantics. Such intentional techniques could capture the computational notionof higher order function better than our current (and other) extensional semanticsusing existing techniques from diﬀerential geometry. We hope that an exploration ofsuch techniques might lead to an appropriate notion of computable derivative, even forhigher order functions. 8.

Discussion and future work

Summary.

We have shown that diﬀeological spaces provide a denotational semanticsfor a higher order language with variants and inductive types ( § Connection to the state of the art in AD implementation.

As is common indenotational semantics research, we have here focused on an idealized language and simpletranslations to illustrate the main aspects of the method. There are a number of pointswhere our approach is simplistic compared to the advanced current practice, as we nowexplain.8.2.1.

Representation of vectors.

In our examples we have treated n -vectors as tuples oflength n . This style of programming does not scale to large n . A better solution would beto use array types, following [61]. As demonstrated by [14], our categorical semantics andcorrectness proofs straightforwardly extend to cover them, in a similar way to our treatmentof lists. In fact, [14] formalizes our correctness arguments in Coq and extends them to applyto the system of [61].8.2.2. Eﬃcient forward-mode AD.

For AD to be useful, it must be fast. The (1 , −→D (1 , that we use is the basis of an eﬃcient AD library [61]. Numerous optimizationsare needed, ranging from algebraic manipulations, to partial evaluations, to the use of anoptimizing C compiler, but the resulting implementation is performant in experiments [61].The Coq formalization [14] validates some of these manipulations using a similar semanticsto ours. We believe the implementation in [61] can be extended to apply to the more general( k, R )-AD methods we described in this paper through minor changes. IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 27

Reverse-mode and mixed-mode AD.

While forward-mode AD methods are useful,many applications require reverse-mode AD, or even mixed-mode AD for eﬃciency. In[28], we described how our correctness proof applies to a continuation-based AD techniquethat closely resembles reverse-mode AD, but only has the correct complexity under a non-standard operational semantics [11] (in particular, the linear factoring rule is crucial). Itremains to be seen whether this technique and its correctness proof can be adapted to yieldgenuine reverse AD under a standard operational semantics.Alternatively, by relying on a variation of our techniques, [66] gives a correctness proofof a rather diﬀerent (1 , k, R )-reverse AD and mixed-mode AD for eﬃciently computing higher orderderivatives.8.2.4. Other language features.

The idealized languages that we considered so far do nottouch on several useful language constructs. For example: the use of functions that arepartial (such as division) or partly-smooth (such as ReLU); phenomena such as iteration,recursion; and probabilities. Recent work by MV [65] shows how our analysis of (1 , k, R )-forward mode AD of recursive programs. We leave the analysis of AD ofprobabilistic programs for future work. Acknowledgment

We have beneﬁted from discussing this work with many people, including M. Betancourt, B.Carpenter, O. Kammar, C. Mak, L. Ong, B. Pearlmutter, G. Plotkin, A. Shaikhha, J. Sigal,and others. Our work is supported by the Royal Society and by a Facebook ResearchAward. In the course of this work, MV has also been employed at Oxford (EPSRC ProjectEP/M023974/1) and at Columbia in the Stan development team. This project has receivedfunding from the European Union’s Horizon 2020 research and innovation programme underthe Marie Sk lodowska-Curie grant agreement No. 895827.

References [1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeﬀrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoﬀrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machinelearning. In ,pages 265–283, 2016.[2] Mart´ın Abadi and Gordon D Plotkin. A simple diﬀerentiable programming language. In

Proc. POPL2020 . ACM, 2020.[3] Shun-ichi Amari.

Diﬀerential-geometrical methods in statistics , volume 28. Springer Science & BusinessMedia, 2012.[4] John Baez and Alexander Hoﬀnung. Convenient categories of smooth spaces.

Transactions of the Amer-ican Mathematical Society , 363(11):5789–5825, 2011.[5] Gilles Barthe, Rapha¨elle Crubill´e, Ugo Dal Lago, and Francesco Gavazzo. On the versatility of openlogical relations: Continuity, automatic diﬀerentiation, and a containment theorem. In

Proc. ESOP2020 . Springer, 2020. To appear.[6] Claus Bendtsen and Ole Stauning. Fadbad, a ﬂexible c++ package for automatic diﬀerentiation. Tech-nical report, Technical Report IMM–REP–1996–17, Department of Mathematical Modelling . . . , 1996.[7] Claus Bendtsen and Ole Stauning. Tadiﬀ, a ﬂexible c++ package for automatic diﬀerentiation.

TU ofDenmark, Department of Mathematical Modelling, Lungby. Technical report IMM-REP-1997-07 , 1997.[8] Gilbert Bernstein, Michael Mara, Tzu-Mao Li, Dougal Maclaurin, and Jonathan Ragan-Kelley. Diﬀer-entiating a tensor language. arXiv preprint arXiv:2008.11256 , 2020.[9] Michael Betancourt. A geometric theory of higher-order automatic diﬀerentiation. arXiv preprintarXiv:1812.11592 , 2018.[10] Jesse Bettencourt, Matthew J Johnson, and David Duvenaud. Taylor-mode automatic diﬀerentiationfor higher-order derivatives in JAX. 2019.[11] Alois Brunel, Damiano Mazza, and Michele Pagani. Backpropagation in the simply typed lambda-calculus with linear negation. In

Proc. POPL 2020 , 2020.[12] Bob Carpenter, Matthew D Hoﬀman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Be-tancourt. The Stan math library: Reverse-mode automatic diﬀerentiation in C++. arXiv preprintarXiv:1509.07164 , 2015.[13] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary diﬀeren-tial equations. In

Advances in neural information processing systems , pages 6571–6583, 2018.[14] Curtis Chin Jen Sem. Formalized correctness proofs of automatic diﬀerentiation in Coq.

Master’sThesis, Utrecht University , 2020. Thesis: https://dspace.library.uu.nl/handle/1874/400790. Coq code:https://github.com/crtschin/thesis.[15] J Daniel Christensen and Enxin Wu. Tangent spaces and tangent bundles for diﬀeological spaces. arXivpreprint arXiv:1411.5425 , 2014.[16] J. Robin B. Cockett, Geoﬀ S. H. Cruttwell, Jonathan Gallagher, Jean-Simon Pacaud Lemay, BenjaminMacAdam, Gordon D. Plotkin, and Dorette Pronk. Reverse derivative categories. In

Proc. CSL 2020 ,2020.[17] G Constantine and T Savits. A multivariate Faa di Bruno formula with applications.

Transactions ofthe American Mathematical Society , 348(2):503–520, 1996.[18] Geoﬀ Cruttwell, Jonathan Gallagher, and Ben MacAdam. Towards formalizing and extending diﬀeren-tial programming using tangent categories. In

Proc. ACT 2019 , 2019.[19] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning andstochastic optimization.

Journal of Machine Learning Research , 12(Jul):2121–2159, 2011.[20] Thomas Ehrhard and Laurent Regnier. The diﬀerential lambda-calculus.

Theoretical Computer Science ,309(1-3):1–41, 2003.[21] Conal Elliott. The simple essence of automatic diﬀerentiation.

Proceedings of the ACM on ProgrammingLanguages , 2(ICFP):70, 2018.[22] L Hern´andez Encinas and J Munoz Masque. A short proof of the generalized Fa`a di Bruno’s formula.

Applied mathematics letters , 16(6):975–979, 2003.[23] Brendan Fong, David Spivak, and R´emy Tuy´eras. Backprop as functor: A compositional perspectiveon supervised learning. In , pages 1–13. IEEE, 2019.

IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 29 [24] Roy Frostig, Matthew James Johnson, and Chris Leary. Compiling machine learning programs viahigh-level tracing.

Systems for Machine Learning , 2018.[25] Izrail Moiseevitch Gelfand, Richard A Silverman, et al.

Calculus of variations . Courier Corporation,2000.[26] Andreas Griewank, Jean Utke, and Andrea Walther. Evaluating higher derivative tensors by forwardpropagation of univariate taylor series.

Mathematics of Computation , 69(231):1117–1130, 2000.[27] Matthew D Hoﬀman and Andrew Gelman. The No-U-Turn sampler: adaptively setting path lengths inHamiltonian Monte Carlo.

Journal of Machine Learning Research , 15(1):1593–1623, 2014.[28] Mathieu Huot, Sam Staton, and Matthijs V´ak´ar. Correctness of automatic diﬀerentiation via diﬀeologiesand categorical gluing. In

FoSSaCS , pages 319–338, 2020.[29] Mathieu Huot, Sam Staton, and Matthijs V´ak´ar. Correctness of automatic diﬀerentiation via diﬀeologiesand categorical gluing. Full version, 2020. arxiv:2001.02209.[30] Patrick Iglesias-Zemmour.

Diﬀeology . American Mathematical Soc., 2013.[31] Bart Jacobs and JMMM Rutten. An introduction to (co)algebras and (co)induction. In

Addvancedtopics in bisimulation and coinduction , pages 38–99. CUP, 2011.[32] Peter T Johnstone, Stephen Lack, and P Sobocinski. Quasitoposes, quasiadhesive categories and Artinglueing. In

Proc. CALCO 2007 , 2007.[33] Jerzy Karczmarczuk. Functional diﬀerentiation of computer programs.

Higher-order and symbolic com-putation , 14(1):35–57, 2001.[34] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function.

TheAnnals of Mathematical Statistics , 23(3):462–466, 1952.[35] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[36] Dana A Knoll and David E Keyes. Jacobian-free Newton–Krylov methods: a survey of approaches andapplications.

Journal of Computational Physics , 193(2):357–397, 2004.[37] Anders Kock.

Synthetic diﬀerential geometry , volume 333. Cambridge University Press, 2006.[38] Ivan Kol´ar, Jan Slov´ak, and Peter W Michor. Natural operations in diﬀerential geometry. 1999.[39] Andreas Kriegl and Peter W Michor.

The convenient setting of global analysis , volume 53. AmericanMathematical Soc., 1997.[40] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automaticdiﬀerentiation variational inference.

The Journal of Machine Learning Research , 18(1):430–474, 2017.[41] S¨oren Laue, Matthias Mitterreiter, and Joachim Giesen. Computing higher order derivatives of matrixand tensor expressions.

Advances in neural information processing systems , 31:2750–2759, 2018.[42] S¨oren Laue, Matthias Mitterreiter, and Joachim Giesen. A simple and eﬃcient tensor calculus. In

AAAI ,pages 4527–4534, 2020.[43] John M Lee. Smooth manifolds. In

Introduction to Smooth Manifolds , pages 1–31. Springer, 2013.[44] Wonyeol Lee, Hangyeol Yu, Xavier Rival, and Hongseok Yang. On correctness of automatic diﬀerentia-tion for non-diﬀerentiable functions. arXiv preprint arXiv:2006.06903 , 2020.[45] Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.

Mathematical programming , 45(1-3):503–528, 1989.[46] Carol Mak and Luke Ong. A diﬀerential-form pullback programming language for higher-order reverse-mode automatic diﬀerentiation. arxiv:2002.08241, 2020.[47] Oleksandr Manzyuk. A simply typed λ -calculus of forward automatic diﬀerentiation. In Proc. MFPS2012 , 2012.[48] James Martens. Deep learning via Hessian-free optimization. In

ICML , volume 27, pages 735–742, 2010.[49] Damiano Mazza and Michele Pagani. Automatic diﬀerentiation in PCF. arXiv preprintarXiv:2011.03335 , 2020.[50] Joel Merker. Four explicit formulas for the prolongations of an inﬁnitesimal lie symmetry and multi-variate Faa di Bruno formulas. arXiv preprint math/0411650 , 2004.[51] John C Mitchell and Andre Scedrov. Notes on sconing and relators. In

International Workshop onComputer Science Logic , pages 352–378. Springer, 1992.[52] Radford M Neal. MCMC using Hamiltonian dynamics. In

Handbook of Markov Chain Monte Carlo ,chapter 5. Chapman & Hall / CRC Press, 2011.[53] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic diﬀerentiation in pytorch. 2017. [54] Barak A Pearlmutter and Jeﬀrey Mark Siskind. Lazy multivariate higher-order forward-mode ad.

ACMSIGPLAN Notices , 42(1):155–160, 2007.[55] Barak A Pearlmutter and Jeﬀrey Mark Siskind. Reverse-mode AD in a functional framework: Lambdathe ultimate backpropagator.

ACM Transactions on Programming Languages and Systems (TOPLAS) ,30(2):7, 2008.[56] Andrew M Pitts. Categorical logic. Technical report, University of Cambridge, Computer Laboratory,1995.[57] Gordon D Plotkin. Some principles of diﬀerential programming languages. Invited talk, POPL 2018,2018.[58] Ning Qian. On the momentum term in gradient descent learning algorithms.

Neural networks , 12(1):145–151, 1999.[59] Herbert Robbins and Sutton Monro. A stochastic approximation method.

The Annals of MathematicalStatistics , pages 400–407, 1951.[60] Thomas H Savits. Some statistical applications of Faa di Bruno.

Journal of Multivariate Analysis ,97(10):2131–2140, 2006.[61] Amir Shaikhha, Andrew Fitzgibbon, Dimitrios Vytiniotis, and Simon Peyton Jones. Eﬃcient diﬀeren-tiable programming in a functional array-processing language.

Proceedings of the ACM on ProgrammingLanguages , 3(ICFP):97, 2019.[62] Benjamin Sherman, Jesse Michel, and Michael Carbin. λ S : Computable semantics for diﬀerentiableprogramming with higher-order functions and datatypes. arXiv preprint arXiv:2007.08017 , 2020.[63] Jean-Marie Souriau. Groupes diﬀ´erentiels. In Diﬀerential geometrical methods in mathematical physics ,pages 91–128. Springer, 1980.[64] Andrew Stacey. Comparative smootheology.

Theory Appl. Categ. , 25(4):64–117, 2011.[65] Matthijs V´ak´ar. Denotational correctness of forward-mode automatic diﬀerentiation for iteration andrecursion. arXiv preprint arXiv:2007.05282 , 2020.[66] Matthijs V´ak´ar. Reverse AD at higher types: Pure, principled and denotationally correct. In

Proc. ESOP2021 . Springer, 2021. To appear.[67] Bart Van Merri¨enboer, Olivier Breuleux, Arnaud Bergeron, and Pascal Lamblin. Automatic diﬀerentia-tion in ML: Where we are and where we should be going. In

Advances in neural information processingsystems , pages 8757–8767, 2018.[68] Fei Wang, Xilun Wu, Gregory Essertel, James Decker, and Tiark Rompf. Demystifying diﬀerentiableprogramming: Shift/reset the penultimate backpropagator.

Proceedings of the ACM on ProgrammingLanguages , 3(ICFP), 2019.[69] Mu Wang, Assefaw Gebremedhin, and Alex Pothen. Capitalizing on live variables: new algorithms foreﬃcient hessian computation via automatic diﬀerentiation.

Mathematical Programming Computation ,8(4):393–433, 2016.[70] Shaopeng Zhu, Shih-Han Hung, Shouvanik Chakrabarti, and Xiaodi Wu. On the principles of diﬀeren-tiable quantum programming languages. arXiv preprint arXiv:2004.01122 , 2020.

IGHER ORDER AUTOMATIC DIFFERENTIATION OF HIGHER ORDER FUNCTIONS 31

Appendix A. CartSp and

Man are not cartesian closed categories

Lemma 6.

There is no continuous injection R d +1 → R d .Proof. If there were, it would restrict to a continuous injection S d → R d . The Borsuk-Ulamtheorem, however, tells us that every continuous f : S d → R d has some x ∈ S d such that f ( x ) = f ( − x ), which is a contradiction.Let us deﬁne the terms: x : real , . . . , x n : real ⊢ t n = λy.x + x ∗ y + · · · + x n ∗ y n : real → real Assuming that

CartSp / Man is cartesian closed, observe that these get interpreted asinjective continuous (because smooth) functions R n → J real → real K in CartSp and

Man . Theorem 3. CartSp is not cartesian closed.Proof.

In case

CartSp were cartesian closed, we would have J real → real K = real n forsome n . Then, we would get, in particular a continuous injection J t n +1 K : R n +1 → R n ,which contradicts Lemma 6. Theorem 4. Man is not cartesian closed.Proof.

Observe that we have ι n : R n → R n +1 ; h a , . . . , a n i 7→ h a , . . . , a n , i and that ι n ; J t n +1 K = J t n K . Let us write A n for the image of J t n K and A = ∪ n ∈ N A n . Then, A n isconnected because it is the continuous image of a connected set. Similarly, A is connectedbecause it is the non-disjoint union of connected sets. This means that A lies in a singleconnected component of J real → real K , which is a manifold with some ﬁnite dimension, say d . Take some x ∈ R d +1 (say, 0), take some open d -ball U around J t d +1 K ( x ), and take someopen d + 1-ball V around x in J t d +1 K − ( U ). Then, J t d +1 K restricts to a continuous injectionfrom V to U , or equivalently, R d +1 to R d , which contradicts Lemma 6. This work is licensed under the Creative Commons Attribution License. To view a copy of thislicense, visit https://creativecommons.org/licenses/by/4.0/https://creativecommons.org/licenses/by/4.0/