[PDF] A Differential-form Pullback Programming Language for Higher-order Reverse-mode Automatic Differentiation

Abstract

Building on the observation that reverse-mode automatic differentiation (AD) -- a generalisation of backpropagation -- can naturally be expressed as pullbacks of differential 1-forms, we design a simple higher-order programming language with a first-class differential operator, and present a reduction strategy which exactly simulates reverse-mode AD. We justify our reduction strategy by interpreting our language in any differential λ -category that satisfies the Hahn-Banach Separation Theorem, and show that the reduction strategy precisely captures reverse-mode AD in a truly higher-order setting.

Full PDF

aa r X i v : . [ c s . P L ] F e b A Diﬀerential-form Pullback Programming Languagefor Higher-order Reverse-mode AutomaticDiﬀerentiation

Carol MakLuke Ong

Abstract

Building on the observation that reverse-mode automaticdiﬀerentiation (AD) — a generalisation of backpropagation— can naturally be expressed as pullbacks of diﬀerential 1-forms, we design a simple higher-order programming lan-guage with a ﬁrst-class diﬀerential operator, and presenta reduction strategy which exactly simulates reverse-modeAD. We justify our reduction strategy by interpreting ourlanguage in any diﬀerential λ -category that satisﬁes the Hahn-Banach Separation Theorem, and show that the reductionstrategy precisely captures reverse-mode AD in a truly higher-order setting. Automatic diﬀerentiation (AD) [34] is widely considered themost eﬃcient and accurate algorithm for computing deriva-tives, thanks largely to the chain rule. There are two modesof AD: • Forward-mode AD evaluates the chain rule from in-puts to outputs; it has time complexity that scales withthe number of inputs, and constant space complexity. • Reverse-mode AD — a generalisation of backpropaga-tion — evaluates the chain rule (in dual form) fromoutputs to inputs; it has time complexity that scaleswith the number of outputs, and space complexitythat scales with the number of intermediate variables.In machine learning applications such as neural networks,the number of input parameters is usually considerablylarger than the number of outputs. For this reason, reverse-mode AD has been the preferred method of diﬀerentiation,especially in deep learning applications. (See Baydin et al.[5] for an excellent survey of AD.)The only downside of reverse-mode AD is its rather in-volved deﬁnition, which has led to a variety of compli-cated implementations in neural networks. On the one hand,TensorFlow [1] and Theano [3] employ the deﬁne-and-run approach where the model is constructed as a computa-tional graph before execution. On the other hand, PyTorch[25] and Autograd [20] employ the deﬁne-by-run approach where the computational graph is constructed dynamicallyduring the execution.

Can we replace the traditional graphical representation ofreverse-mode AD by a simple yet expressive framework?

In-deed, there have been calls from the neural network com-munity for the development of diﬀerentiable programming [14, 19, 24], based on a higher-order functional languagewith a built-in diﬀerential operator that returns the deriv-ative of a given program via reverse-mode AD. Such a lan-guage would free the programmer from implementationaldetails of diﬀerentiation. Programmers would be able to con-centrate on the construction of machine learning models,and train them by calling the built-in diﬀerential operatoron the cost function of their models.The goal of this work is to present a simple higher-orderprogramming language with an explicit diﬀerential oper-ator, such that its reduction semantics is exactly reverse-mode AD, in a truly higher-order manner.The syntax of our language is inspied by Ehrhard andRegnier [15]’s diﬀerential λ -calculus, which is an extensionof simply-typed λ -calculus with a diﬀerential operator thatmimics standard symbolic diﬀerentiation (but not reverse-mode AD). Their deﬁnition of diﬀerentiation via a linearsubstitution provides a good foundation for our language.The reduction strategy of our language uses diﬀerential λ -category [11] (the model of diﬀerential λ -calculus) as aguide. Diﬀerential λ -category is a Cartesian closed diﬀeren-tial category [9], and hence enjoys the fundamental prop-erties of derivatives, and behaves well with exponentials(curry). Contributions.

Our starting point (Section 2.2) is the obser-vation that the computation of reverse-mode AD can natu-rally be expressed as a transformation of pullbacks of dif-ferential 1-forms . We argue that this viewpoint is essentialfor understanding reverse-mode AD in a functional setting.Standard reverse-mode AD (as presented in [4, 5]) is onlydeﬁned in Euclidean spaces.We present (in Section 3) a simple higher-order program-ming language, extending the simply-typed λ -calculus [12] arol Mak and Luke Ong with an explicit diﬀerential operator called the pullback , ( Ω λx . P ) · S , which serves as a reverse-mode AD simulator.Using diﬀerential λ -category [11] as a guide, we design areduction strategy for our language so that the reduction ofthe application, (cid:0) ( Ω λx . P )·( λx . e p ∗ ) (cid:1) S , mimics reverse-modeAD in computing the p -th row of the Jacobian matrix (deriv-ative) of the function λx . P at the point S , where e p is thecolumn vector with 1 at the p -th position and 0 everywhereelse. Moreover, we show how our reduction semantics canbe adapted to a continuation passing style evaluation (Sec-tion 3.5).Owing to the higher-order nature of our language, stan-dard diﬀerential calculus is not enough to model our lan-guage and hence cannot justify our reductions. Our ﬁnalcontribution (in Section 4) is to show that any diﬀerential λ -category [11] that satisﬁes the Hahn-Banach SeparationTheorem is a model of our language (Theorem 4.6). Our re-duction semantics is faithful to reverse-mode AD, in that itis exactly reverse-mode AD when restricted to ﬁrst-order;moreover we can perform reverse-mode AD on any higher-order abstraction, which may contain higher-order terms,duals, pullbacks, and free variables as subterms (Corollary4.8).Finally, we discuss related works in Section 5 and conclu-sion and future directions in Section 6.Throughout this paper, we will point to the attached Ap-pendix for additional content. All proofs are in Appendix E,unless stated otherwise. We introduce forward- and reverse-mode automatic dif-ferentiation (AD), highlighting their respective beneﬁts inpractice. Then we explain how reverse-mode AD can nat-urally be expressed as the pullback of diﬀerential 1-forms.(The examples used to illustrate the above methods are col-lated in Figure 4).

Recall that the Jacobian matrix of a smooth real-valued func-tion f : R n → R m at x ∈ R n is J ( f )( x ) : =  ∂ f ∂ z (cid:12)(cid:12)(cid:12) x ∂ f ∂ z (cid:12)(cid:12)(cid:12) x . . . ∂ f ∂ z n (cid:12)(cid:12)(cid:12) x ∂ f ∂ z (cid:12)(cid:12)(cid:12) x ∂ f ∂ z (cid:12)(cid:12)(cid:12) x . . . ∂ f ∂ z n (cid:12)(cid:12)(cid:12) x ... ... . . . ... ∂ f m ∂ z (cid:12)(cid:12)(cid:12) x ∂ f m ∂ z (cid:12)(cid:12)(cid:12) x . . . ∂ f m ∂ z n (cid:12)(cid:12)(cid:12) x  where f j : = π j ◦ f : R n → R . We call the function J : C ∞ ( R n , R m ) → C ∞ ( R n , L ( R n , R m )) the Jacobian ; J ( f ) theJacobian of f ; J ( f )( x ) the Jacobian of f at x ; J ( f )( x )( v ) the Jacobian of f at x along v ∈ R n and λx . J ( f )( x )( v ) theJacobian of f along v . Symbolic Diﬀerentiation

Numerical derivatives are stan-dardly computed using symbolic diﬀerentiation : ﬁrst com-pute ∂ f j ∂ z i for all i , j using rules (e.g. product and chain rules),then substitute x for z to obtain J ( f )( x ) .For example, to compute the Jacobian of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i by symbolic diﬀerentiation, ﬁrstcompute ∂ f ∂ x = ( x + )( x + y )( x + y + ( x + )) and ∂ f ∂ y = ( x + )( x + y )( y ( x + )) . Then, substitute 1 for x and 3 for y to obtain J ( f )(h , i) = (cid:2)

660 528 (cid:3) . Symbolic diﬀerentiation is accurate but ineﬃcient. Noticethat the term ( x + ) appears twice in ∂ f ∂ x , and ( + ) is evalu-ated twice in ∂ f ∂ x (cid:12)(cid:12)(cid:12) (because for h : h x , y i 7→ ( x + )( x + y ) ,both h (h x , y i) and ∂ h ∂ x contain the term ( x + ) , and the prod-uct rule tells us to calculate them separately). This dupli-cation is a cause of the so-called expression swell problem ,resulting in exponential time-complexity. Automatic Diﬀerentiation

Automatic diﬀerentiation (AD) avoids this problem by a simple divide-and-conquerapproach: ﬁrst arrange f as a composite of elementary functions, д , . . . , д k (i.e. f = д k ◦ · · · ◦ д ), then computethe Jacobian of each of these elementary functions, andﬁnally combine them via the chain rule to yield the desiredJacobian of f . Forward-mode AD

Recall the chain rule:

J ( f )( x ) = J ( д k )( x k − ) × · · · × J ( д )( x ) × J ( д )( x ) for f = д k ◦ · · · ◦ д , where x i : = д i ( x i − ) . Forward-mode

AD computes the Jacobian matrix

J ( f )( x ) by calculating α i : = J ( д i )( x i − )× α i − and x i : = д i ( x i − ) , with α : = I (iden-tity matrix) and x . Then, α k = J ( f )( x ) is the Jacobian of f at x . This computation can neatly be presented as an iter-ation of the h· | ·i -reduction, h x | α i д −→ h д ( x ) | J ( д )( x ) × α i , for д = д , . . . , д k , starting from the pair h x | I i . Besides be-ing easy to implement, forward-mode AD computes the newpair from the current pair h x | α i , requiring no additionalmemory.To compute the Jacobian of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i by forward-mode AD, ﬁrst decompose f into C ∞ ( A , B ) is the set of all smooth functions from A to B , and L ( A , B ) isthe set of all linear functions from A to B , for Euclidean spaces A and B . in the sense of being easily diﬀerentiable2 iﬀerential-form Pullback Programming Language and Reverse-mode AD elementary functions as R д −−→ R ∗ −−→ R (−) −−−→ R , where д (h x , y i) : = h x + , x + y i . Then, starting from hh , i | I i ,iterate the h· | ·i -reduction hh , i | (cid:20) (cid:21) i д −→ hh + , ∗ + i | (cid:20) (cid:21) i ∗ −→ h ∗ | (cid:2)

15 12 (cid:3) i (−) −−−→ h | (cid:2)

660 528 (cid:3) i yielding (cid:2)

660 528 (cid:3) as the Jacobian of f at h , i . Notice that ( + ) is only evaluated once, even though its result is usedin various calculations.In practice, because storing the intermediate matrices α i can be expensive, the matrix J ( f )( x ) is computed column-by-column , by simply changing the starting pair from h x | I i to h x | e p i , where e p ∈ R n is the column vector with 1at the p -th position and 0 everywhere else. Then, the com-putation becomes a reduction of a vector-vector pair, and α k = J ( f )( x ) × e p is the p -th column of the Jacobian ma-trix J ( f )( x ) . Since J ( f )( x ) is a m -by- n matrix, n runs arerequired to compute the whole Jacobian matrix.For example, if we start from hh , i | (cid:20) (cid:21) i , the reduction hh , i | (cid:20) (cid:21) i д −→ hh , i | (cid:20) (cid:21) i ∗ −→ h | (cid:2) (cid:3) i (−) −−−→ h | (cid:2) (cid:3) i gives us the ﬁrst column of the Jacobian matrix J ( f )(h , i) . Reverse-mode AD

By contrast, reverse-mode

AD com-putes the dual of the Jacobian matrix, (J ( f )( x )) ∗ , using thechain rule in dual (transpose) form (J ( f )( x )) ∗ = (J ( д )( x )) ∗ × · · · × (J ( д k )( x k − )) ∗ as follows: ﬁrst compute x i : = д i ( x i − ) for i = , . . . , k − β i : = (J ( д i )( x i − )) ∗ × β i + for i = k , . . . , β k + : = I (Reverse Phase).For example, the reverse-mode AD computation on f isas follows.Forward Phase: h , i д −→ h , i ∗ −→ (−) −−−→ (cid:20) (cid:21) д ←− (cid:20) (cid:21) ∗ ←− (cid:2) (cid:3) (−) ←−−− I In practice, like forward-mode AD, the matrix (J ( f )( x )) ∗ is computed column-by-column, by sim-ply setting β k + : = π p , where π p ∈ L ( R m , R ) is the p -thprojection. Thus, a run (comprising Forward and ReversePhase) computes (J ( f )( x )) ∗ ( π p ) , the p -th row of theJacobian of f at x . It follows that m runs are required tocompute the m -by- n Jacobian matrix.In many machine learning (e.g. deep learning) problems,the functions f : R n → R m we need to diﬀerentiate havemany more inputs than outputs, in the sense that n ≫ m . Whenever this is the case, reverse-mode AD is more eﬃcientthan forward-mode. Remark 2.1.

Unlike forward-mode AD, we cannot inter-leave the iteration of x i and the computation of β i . In fact, ac-cording to Hoﬀmann [18], nobody knows how to do reverse-mode AD using pairs h· | ·i , as employed by forward-modeAD to great eﬀect. In other words, reverse-mode AD doesnot seem presentable as an in-place algorithm. Reverse-mode AD can naturally be expressed using pull-backs and diﬀerential 1-forms, as alluded to by Betancourt[7] and discussed in [26].Let E : = R n and F : = R m . A diﬀerential 1-form of E is asmooth map ω ∈ C ∞ ( E , L ( E , R )) . Denote the set of all diﬀer-ential 1-forms of E as Ω E . E.g. λx . π p ∈ Ω R m . (Henceforth,by , we mean diﬀerential 1-form.) The pullback of a1-form ω ∈ Ω F along a smooth map f : E → F is a 1-form Ω ( f )( ω ) ∈ Ω E where Ω ( f )( ω ) : E −→ L ( E , R ) x f )( x )) ∗ ( ω ( f x )) Notice the result of an iteration of reverse-mode AD (J ( f )( x )) ∗ ( π p ) can be expressed as Ω ( f )( λx . π p )( x ) , whichcan be expanded to (cid:0) Ω ( д ) ◦ · · · ◦ Ω ( д k ) (cid:1) ( λx . π p )( x ) . Hence,reverse-mode AD can be expressed as: ﬁrst iterate the reduc-tion of 1-forms, ω д −→ Ω ( д )( ω ) , for д = д k , . . . , д , startingfrom the 1-form λx . π p ; then compute ω ( x ) , which yieldsthe p -th row of J ( f )( x ) .Returning to our example, Ω ( f )( λx . (cid:2) (cid:3) )(h , i) = (cid:0) Ω ( д ) ◦ Ω (∗) ◦ Ω ((−) ) (cid:1) ( λx . (cid:2) (cid:3) ∗ )(h , i) = (J ( д )(h , i)) ∗ (cid:0) Ω (∗) ◦ Ω ((−) ) (cid:1) ( λx . (cid:2) (cid:3) ∗ )(h , i) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (cid:0) Ω ((−) ) (cid:1) ( λx . (cid:2) (cid:3) ∗ )( ) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (J ((−) )( )) ∗ (cid:0) ( λx . (cid:2) (cid:3) ∗ )( ) (cid:1) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (cid:2) (cid:3) ∗ = (J ( д )(h , i)) ∗ (cid:20) (cid:21) ∗ = (cid:20) (cid:21) ∗ which is the Jacobian J ( f )(h , i) .The pullback-of-1-forms perspective gives us a way toperform reverse-mode AD beyond Euclidean spaces (for ex-ample on the function sum : List ( R ) → R , which returnsthe sum of the elements of a list); and it shapes our languageand reduction presented in Section 3. (Example 3.2 showshow sum can be deﬁned in our language and Appendix A.2shows how reverse-mode AD can be performed on sum .) arol Mak and Luke Ong Simple terms S :: = x | λx . S | S P | π i ( S ) | h S , S i | r | f ( P )| J f · S | ( λx . S ) ∗ · S | ( Ω ( λx . P )) · S | r ∗ Pullback terms P :: = | S | S + P Figure 1.

Grammar of simple terms S and pullback terms P .Assume a collection V of variables (typically x , y , z , ω ), anda collection F (typically f , д , h ) of easily-diﬀerentiable real-valued functions, in the sense that the Jacobian of f , J ( f ) ,can be called by the language, r and r range over R and R n respectively. Remark 2.2.

Pullbacks can be generalised to arbitrary p -forms, using essentially the same approach. However thepullbacks of general p -forms no longer resemble reverse-mode AD as it is commonly understood. Figure 1 presents the grammar of simple terms S and pull-back terms P , and Figure 2 presents the type system. Whilethe deﬁnition of simple terms S is relatively standard (ex-cept for the new constructs which will be discussed later),the deﬁnition of pullback terms P as sums of simple termsis not. The idea of sum is important since it speciﬁes the “linearpositions” in a simple term, just as it speciﬁes the algebraicnotion of linearity in Mathematics. For example, x ( y + z ) isa term but ( x + y ) z is not. This is because ( x + y ) z is the sameas xz + yz , but x ( y + z ) cannot. Hence in S P , S is in a linearposition but not P . Similarly, in Mathematics ( f + f )( x ) = f ( x ) + f ( x ) but in general f ( x + x ) , f ( x ) + f ( x ) for smooth functions f , f and x , x . Hence, the function f in an application f ( x ) is in a linear position while theargument x is not.Formally we deﬁne the set lin ( S ) of linear variables in asimple term S by y ∈ lin ( S ) if, and only if, y is in a linearposition in S . lin ( x ) : = { x } lin ( λx . S ) : = lin ( S ) \ { x } lin ( S P ) : = lin ( S ) \ FV ( P ) lin ( π i ( S )) : = lin ( S ) lin (h S , S i) : = lin ( S ) ∩ lin ( S ) lin (J f · S ) : = lin ( S ) lin (( λx . S ) ∗ · S ) : = (cid:0) lin ( S ) \ FV ( S ) (cid:1) ∪ (cid:0) lin ( S ) \ FV ( S ) (cid:1) lin ( S ) : = ∅ otherwise.For example, lin ( x z ( y z )) = { x } . Any term of the dual type σ ∗ is considered a linear func-tional of σ . For example, e p ∗ has the dual type R n ∗ . Thenthe term e p ∗ mimics the linear functional π p ∈ L ( R n , R ) .The Jacobian J f · S is considered as the Jacobian of f along S , which is a smooth function. For example, let f : R m → R n be “easily diﬀerentiable”, then J f · v mimics theJacobian along v , i.e. the function λx . J ( f )( x )( v ) .The dual map ( λx . S ) ∗ · S is considered the dual ofthe linear functional S along the function λx . S , where x ∈ lin ( S ) . For example, let r ∈ R m . The dual map ( λv . (J f · v ) r ) ∗ · e p ∗ mimics (J ( f )( r )) ∗ ( π p ) ∈ L ( R m , R ) ,which is the dual of π p along the Jacobian J ( f )( r ) .The pullback ( Ω λx . P ) · S is considered the pullback of the1-form S along the function λx . P . For example, ( Ω λx . f ( x )) ·( λx . e p ∗ ) mimics Ω ( f )( λx . π p ) ∈ Ω ( R m ) , which is the pull-back of the 1-form λx . π p along f .Hence, to perform reverse-mode AD on a term λx . P at P ′ with respect to ω , we consider the term (cid:0) ( Ω λx . P ) · ω (cid:1) P ′ . We use syntactic sugars to ease writing. For n ≥ z afresh variable. R n + ≡ R n × R Ω σ ≡ σ ⇒ σ ∗  r ... r n  ≡ h r , . . . , r n i h P , P , P i ≡ hh P , P i , P i S πi ≡ π i ( S ) let x = t in s ≡ ( λx . s ) t Ω r ≡ λx . r ∗ λ h x , y i . S ≡ λz . S [ z π / x ][ z π / y ] Capture-free substitution is applied recursively, e.g. (cid:0) ( λx . S ) ∗ · S (cid:1) [ P ′ / z ] ≡ ( λx . S [ P ′ / z ]) ∗ · ( S [ P ′ / z ]) and (cid:0) ( Ω λx . P ) · S (cid:1) [ P ′ / z ] ≡ ( Ω ( λx . P [ P ′ / z ])) · ( S [ P ′ / z ]) . We treat0 as the unit of our sum terms, i.e. 0 ≡ + S ≡ + S and S ≡ S +

0; and consider + as a associative and commutativeoperator. We also deﬁne S [ S + S / y ] ≡ S [ S / y ] + S [ S / y ] ifand only if y ∈ lin ( S ) . For example, ( S + S ) P ≡ S P + S P .We ﬁnish this subsection with some examples that can beexpressed in this language. Example 3.1.

Consider the running example in computingthe Jacobian of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i .Assume д (h x , y i) : = h x + , x + y i , mult and pow2 are in theset of easily diﬀerentiable functions, i.e. д , mult , pow2 ∈ F .The function f can be presented by the term {h x , y i : R } ⊢ pow2 ( mult ( д (h x , y i))) : R . More interestingly, the Jacobian iﬀerential-form Pullback Programming Language and Reverse-mode AD σ , τ :: = R | σ × σ | σ ⇒ σ | σ ∗ Γ ⊢ σ Γ ⊢ S : σ Γ ⊢ P : σ Γ ⊢ S + P : σ Γ ∪ { x : σ } ⊢ x : σ Γ ∪ { x : σ } ⊢ S : τ Γ ⊢ λx . S : σ ⇒ τ Γ ⊢ S : σ ⇒ τ Γ ⊢ P : σ Γ ⊢ S P : τ Γ ⊢ S : σ × σ Γ ⊢ π i ( S ) : σ i Γ ⊢ S : σ Γ ⊢ S : σ Γ ⊢ h S , S i : σ × σ r ∈ R Γ ⊢ r : R Γ ⊢ P : R n Γ ⊢ f ( P ) : R m Γ ⊢ S : R n Γ ⊢ J f · S : R n ⇒ R m r ∈ R n Γ ⊢ r ∗ : R n ∗ Γ ∪ { x : σ } ⊢ S : τ Γ ⊢ S : τ ∗ x ∈ lin ( S ) Γ ⊢ ( λx . S ) ∗ · S : σ ∗ Γ ∪ { x : σ } ⊢ P : τ Γ ⊢ S : Ω τ Γ ⊢ ( Ω λx . P ) · S : Ω σ Figure 2.

The types and typing rules for DPPL. Ω σ ≡ σ ⇒ σ ∗ and f : R n → R m is easily diﬀerentiable, i.e. f ∈ F .of f at h , i , i.e. J ( f )(h , i) , can be presented by the term ⊢ (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · ( Ω (cid:2) (cid:3) ) (cid:1) h , i : R ∗ . This is the application of the pullback Ω ( f )( λx . (cid:2) (cid:3) ∗ ) to thepoint h , i , which we saw in Subsection 2.2 is the Jacobianof f at h , i . Example 3.2.

Consider the function that takes a list of realnumbers and returns the sum of the elements of a list. Us-ing the standard Church encoding of List, i.e.

List ( X ) ≡( X → D → D ) → ( D → D ) , and [ x , x , . . . , x n ] ≡ λ f d . f x n (cid:0) . . . ( f x ( f x d )) (cid:1) for some dummy type D , sum : List ( R ) → R is deﬁned to be λl . l ( λxy . x + y )

0. Hence the Ja-cobian of sum at a list [ , − ] can be expressed as { ω : Ω ( List ( R ))} ⊢ (cid:0) ( Ω ( sum )) · ω (cid:1) [ , − ] : R ∗ . Now the question is how we could perform reverse-modeAD on this term. Recall the result of a reverse-mode AD on afunction f : R n → R m at x ∈ R n , i.e. the p -th row of the Ja-cobian matrix of f at x , can be expressed as Ω ( f )( λx . π p )( x ) ,which is (J ( f )( x )) ∗ (( λx . π p )( f x )) = (J ( f )( x )) ∗ × π p .In the rest of this Section, we consider how the term (( Ω λy . P ′ ) · ω ) P , which mimics Ω ( f )( ω )( x ) , can be reduced.To avoid expression swell, we ﬁrst perform A-reduction: P ′ −→ ∗ A L which decompose a term into a series of “smaller”terms, as explained in Subsection 3.2. Then, we reduce (( Ω λy . L ) · ω ) P by induction on L , as explained in Subsec-tion 3.3. Lastly, we complete our reduction strategy in Sub-section 3.4.We use the term in Example 3.1 as a running examplein our reduction strategy to illustrate that this reduction isfaithful to reverse-mode AD (in that it is exactly reverse-mode AD when restricted to ﬁrst-order). The reduction ofthe term in Example 3.2 is given in Appendix A.2. It illus-trates how reverse-mode AD can be performed on a higher-order function. We use the administrative reduction (A-reduction) of Sabryand Felleisen [28] to decompose a pullback term P into a letseries L of elementary terms, i.e. P −→ ∗ A let x = E ; . . . ; x n = E in x n , where elementary terms E and let series L are deﬁned as E :: = | z + z | z | λx . L | z z | z i | h z , z i | r | f ( z )| J f · z | ( λx . L ) ∗ · z | ( Ω λx . L ) · z | r ∗ L :: = let z = E in L | let z = E in z . Note that elementary terms E should be “ﬁne enough” toavoid expression swell. The complete set of A-reductionson P can be found in Appendix B. We write −→ ∗ A for thereﬂexive and transitive closure of −→ A . Example 3.3.

We decompose the term considered in Exam-ple 3.1, pow2 ( mult ( д (h x , y i))) , via administrative reduction. pow2 ( mult ( д (h x , y i))) −→ ∗ A let z = h x , y i ; z = д ( z ) ; z = mult ( z ) ; z = pow2 ( z ) in z . This is reminiscent of the decomposition of f into R д −−→ R ∗ −−→ R (−) −−−→ R before performing AD. After decomposing P ′ to a let series L of elementary termsvia A-reductions in ( Ω λy . P ′ ) · ω , we reduce ( Ω λy . L ) · ω byinduction on L as shown in Figure 3 (Let series). Reduction 7is the base case and reduction 8 expresses the contra-variantproperty of pullbacks. Example 3.4.

Take ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) ·( Ω (cid:2) (cid:3) ) discussed in Example 3.1, as when applied tothe point h , i is the Jacobian J ( f )(h , i) where f (h x , y i) : = (cid:0) ( x + )( x + y ) (cid:1) . In Example 3.3, we showedthat pow2 ( mult ( д (h x , y i))) is A-reduced to a let series L .Now via reduction 7 and 8, ( Ω λ h x , y i . L ) · ω is reduced to a arol Mak and Luke Ong Let Series: ( Ω ( λy . let x = E in x )) · ω ( ) −−→ ( Ω λy . E ) · ω ( Ω ( λy . let x = E in L )) · ω ( ) −−→ ( Ω λy . h y , E i) · (cid:0) ( Ω λ h y , x i . L ) · ω (cid:1) Constant Functions: (cid:0) ( Ω λy . E ) · ω (cid:1) V ( ) −−→ y < FV ( E ) . Linear Functions: (cid:0) ( Ω λy . z + y π j ) · ω (cid:1) V ( a ) −−−−→ ( λv . v π j ) ∗ · (cid:0) ω ( z + V π j ) (cid:1)(cid:0) ( Ω λy . y πi + y π j ) · ω (cid:1) V ( b ) −−−−→ ( λv . v πi + v π j ) ∗ · (cid:0) ω ( V πi + V π j ) (cid:1)(cid:0) ( Ω λy . y ) · ω (cid:1) V ( ) −−−→ ( λv . v ) ∗ · ( ω V ) (cid:0) ( Ω λy . y πi ) · ω (cid:1) V ( ) −−−→ ( λv . v πi ) ∗ · ( ω V πi ) (cid:0) ( Ω λy . h y πi , z i) · ω (cid:1) V ( a ) −−−−→ ( λv . h v πi , i) ∗ · ( ω h V πi , z i) (cid:0) ( Ω λy . h z , y π j i) · ω (cid:1) V ( b ) −−−−→ ( λv . h , v π j i) ∗ · ( ω h z , V π j i) (cid:0) ( Ω λy . h y πi , y π j i) · ω (cid:1) V ( c ) −−−→ ( λv . h v πi , v π j i) ∗ · ( ω h V πi , V π j i) (cid:0) ( Ω λy . J f · y πi ) · ω (cid:1) V ( ) −−−→ ( λv . J f · v πi ) ∗ · (cid:0) ω (J f · V πi ) (cid:1) Function Symbols: (cid:0) ( Ω λy . f ( y πi )) · ω (cid:1) V ( ) −−−→ ( λv . (J f · v πi ) V πi ) ∗ · (cid:0) ω ( f ( V πi )) (cid:1) Dual Maps: (cid:16) ( Ω λy . ( λx . L ) ∗ · y πi ) · ω (cid:17) V ( a ) −−−−→ ( λv . ( λx . L ) ∗ · v πi ) ∗ · (cid:0) ω (( λx . L ) ∗ · V πi ) (cid:1) if y < FV ( λx . L ) (cid:0) ( Ω λy . L ) · ω ′ (cid:1) V −→ ∗ ( λv . S ) ∗ · ω ′ V ′ y < FV ( z ) (cid:0) ( Ω λy . ( λx . L ) ∗ · z ) · ω (cid:1) V ( b ) −−−−→ (cid:0) λv . ( λx . S ) ∗ · z (cid:1) ∗ · ω (cid:0) ( λx . L [ V / y ]) ∗ · z (cid:1)(cid:0) ( Ω λy . L ) · ω ′ (cid:1) V −→ ∗ ( λv . S ) ∗ · ω ′ V ′ (cid:0) ( Ω λy . ( λx . L ) ∗ · y πi ) · ω (cid:1) V ( c ) −−−→ (cid:0) λv . ( λx . L [ V / y ]) ∗ · v πi + ( λx . S ) ∗ · V πi (cid:1) ∗ · (cid:0) ω (cid:0) ( λx . L [ V / y ]) ∗ · V πi (cid:1)(cid:1) Pullback Terms: (cid:0) ( Ω λx . L ) · z ) a −→ ∗ ( λv . S ) ∗ · ( z L [ a / x ]) (cid:0) ( Ω λy . ( Ω λx . L ) · z ) · ω (cid:1) V ( ) −−−→ (cid:0) ( Ω λy . λa . ( λv . S ) ∗ · ( z L [ a / x ])) · ω (cid:1) V Abstraction: (cid:0) ( Ω λy . L ) · ω ′ (cid:1) V −→ ∗ ( λv . S ) ∗ · ( ω ′ L [ V / y ]) x < FV ( V ) (cid:0) ( Ω λy . λx . L ) · ω (cid:1) V ( ) −−−→ ( λv . λx . S ) ∗ · (cid:0) ω ( λx . L [ V / y ]) (cid:1) Application: (cid:0) ( Ω λy . y πi z ) · ω (cid:1) V ( a ) −−−−→ ( λv . v πi z ) ∗ · ( ω ( V πi z )) (cid:0) ( Ω λz . P ′ ) · ω ′ (cid:1) V π j −→ ( λv ′ . S ′ ) ∗ · ω ′ ( P ′ [ V π j / z ]) V πi ≡ λz . P ′ (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V ( b ) −−−−→ ( λv . v πi V π j + S ′ [ v π j / v ′ ]) ∗ · (cid:0) ω ( V πi V π j ) (cid:1)(cid:0) ( Ω λz . V ′ ) · ω ′ (cid:1) V π j −→ V πi ≡ λz . V ′ (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V ( c ) −−−→ ( λv . v πi V π j ) ∗ · (cid:0) ω ( V πi V π j ) (cid:1) Pair: (cid:0) ( Ω λy . E ) · ω ′ (cid:1) V −→ ( λv . S ) ∗ · ( ω ′ ( E [ V / y ])) y ∈ FV ( E ) (cid:0) ( Ω λy . h y , E i) · ω (cid:1) V ( a ) −−−−→ ( λv . h v , S i) ∗ · ( ω h V , E [ V / y ]i) (cid:0) ( Ω λy . h y , E i) · ω (cid:1) V ( b ) −−−−→ ( λv . h v , i) ∗ · ( ω h V , E i) for y < FV ( E ) Figure 3.

Pullback Reductions iﬀerential-form Pullback Programming Language and Reverse-mode AD series of pullback along elementary terms. ( Ω λ h x , y i . let z = h x , y i ; z = д ( z ) ; z = mult ( z ) ; z = pow2 ( z ) in z ) · ω −→ ∗ © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ Via A-reductions and reductions 7 and 8, ( Ω λy . P ′ ) · ω is reduced to a series of pullback along elementary terms ( Ω λy . E ) · ( . . . (( Ω λy . E n ) · ω )) . Now, we deﬁne the reduc-tion of pullback along elementary terms when applied to avalue V , i.e. (( Ω λy . E ) · ω ) V .Recall the pullback of a 1-form ω ∈ Ω ( F ) along a smoothfunction f : E → F is deﬁned to be Ω ( f )( ω ) : x f )( x )) ∗ ( ω ( f x )) . Hence, we have the following pullback reduction (cid:0) ( Ω λy . E ) · ω (cid:1) V −→ ( λv . S ) ∗ · ( ω ( E [ V / y ])) of the application (cid:0) ( Ω λy . E ) · ω (cid:1) V which mimics the pull-back of a variable ω along an abstraction λy . E at a term V .But how should one deﬁne the simple term S in ( λv . S ) ∗ ·( ω ( E [ V / y ])) so that λv . S mimics the Jacobian of f at x , i.e. J ( f )( x ) ? We do so by induction on the elementary terms E ,shown in Figure 3 Reductions 9-20. Remark 3.5.

For readers familiar with diﬀerential λ -calculus [15], S is the result of substituting a linear occur-rence of y by v , and then substituting all free occurrencesof y by V in the term E . Our approach is diﬀerent from dif-ferential λ -calculus in that we deﬁne a reduction strategyinstead of a substitution. A comprehensive comparison be-tween our language and diﬀerential λ -calculus is given inSection 5. If y is not a free variable in E , λy . E is mimicking a constantfunction. The Jacobian of a constant function is 0, hence wereduce (cid:0) ( Ω λy . E ) · ω (cid:1) V to ( λv . ) ∗ · ( ω ( E [ V / y ])) , which isthe sugar for 0 as shown in Figure 3 (Constant Functions)Reduction 9. The redexes (cid:0) ( Ω λy . ) · ω (cid:1) V , (cid:0) ( Ω λy . r ) · ω (cid:1) V and (cid:0) ( Ω λy . r ∗ ) · ω (cid:1) V all reduce to 0.Henceforth, we assume y ∈ FV ( E ) . We consider the redexes where y ∈ lin ( E ) . Then λy . E ismimicking a linear function, whose Jacobian is itself. Hence A value is a normal form of the reduction strategy. Its deﬁnition will bemade precise in the next subsection. (cid:0) ( Ω λy . E ) · ω (cid:1) V is reduced to ( λv . S ) ∗ · ( ω ( E [ V / y ])) where S is the result of substituting y by v in E . Figure 3 (LinearFunctions) Reductions 10-14 shows how they are reduced. Now consider the redexes where y might not be a linearvariable in E . All reductions are shown in Figure 3. Function Symbols

Let f be “easily diﬀerentiable”. Then, λy . f ( y πi ) is mimicking f ◦ π i , whose Jacobian at x is J ( f )( π i ( x )) ◦ π i . Hence the Jacobian of λy . f ( y πi ) is λv . (J f · v πi ) V πi and (cid:0) ( Ω λy . f ( y πi )) · ω (cid:1) V is reduced to ( λv . (J f · v πi ) V πi ) ∗ · ( ω ( f ( V πi ))) as shown in Reduction15. Dual Maps

Consider the Jacobian of λy . ( λx . L ) ∗ · z at V . Itis easy to see that the result varies depending on where thevariable y is located in the dual map ( λx . L ) ∗ · z . We considerthree cases.First, if y < FV ( λx . L ) , we must have z ≡ y πi . Then y is a linear variable in ( λx . L ) ∗ · y πi and so the Jacobian of λy . ( λx . L ) ∗ · y πi at V is λv . ( λx . L ) ∗ · v πi . Hence, we haveReduction 16 a .Second, say y < FV ( z ) . Since dual and abstraction are bothlinear operations, and y is only free in L , the Jacobian of λy . ( λx . L ) ∗ · z at V . should be λv . ( λx . S ′ ) ∗ · z where λv . S ′ isthe Jacobian of λy . L at V . To ﬁnd the Jacobian of λy . L at V , we reduce (cid:0) ( Ω λy . L ) · ω (cid:1) V to ( λv . S ′ ) ∗ · ( ω L [ V / y ]) . Then λv . S ′ is the Jacobian of λy . L at V . The reduction is given inReduction 16 b . Note that this reduction avoids expressionswell, as we are reducing the let series L in λy . ( λx . L ) ∗ · z using our pullback reductions, which does not suﬀer fromexpression swell.Finally, for y ∈ FV ( λx . L ) ∩ FV ( z ) , the Jacobian of λy . ( λx . L ) ∗ · z at V is the “sum” of the results we have for thetwo cases above, i.e. λv . ( λx . L ) ∗ · v πi + ( λx . S ) ∗ · y πi , wherethe remaining free occurrences of y are substituted by V ,since the Jacobian of a bilinear function l : X × X → Y is J ( l )(h x , x i(h v , v i) = l h x , v i + l h v , x i . Hence, wehave Reduction 16 c . Pullback Terms

Consider (cid:0) ( Ω λy . ( Ω λx . L ) · z ) · ω (cid:1) V .Instead of reducing it to some ( λv . S ) ∗ ·( ω (( Ω λy . ( Ω λx . L ) · z ) · ω )[ V / y ]) like the others, herewe simply reduce (cid:0) ( Ω λx . L ) · z ) a to ( λv . S ) ∗ · ( z L [ a / x ]) ,where a is a fresh variable and z . x , and replace ( Ω λx . L )· z by λa . ( λv . S ) ∗ · ( z L [ a / x ]) in (cid:0) ( Ω λy . ( Ω λx . L ) · z ) · ω (cid:1) V asshown in Reduction 17. Abstraction

Consider the Jacobian of λy . λx . L at V . arol Mak and Luke Ong We follow the treatment of exponentials in diﬀerential λ -category [11] where the (D-curry) rule states that for all f : Y × X → A , D [ cur ( f )] = cur ( D [ f ] ◦ h π × X , π × Id X i) ,which means J ( cur ( f ))( y ) is equal to λv . J ( cur ( f ))( y )( v ) = λvx . J ( f h− , x i)( y )( v ) . According to this (D-curry) rule, the Jacobian of λy . λx . L at V should be λv . λx . S where λv . S is the Jacobian of λy . L at V . Hence similar to the dual map case, we ﬁrstreduce (cid:0) ( Ω λy . L ) · ω (cid:1) V to ( λv . S ) ∗ · ( ω L [ V / y ]) and ob-tain the Jacobian of λy . L at V , i.e. λv . S and then reduce (cid:0) ( Ω λy . λx . L ) · ω (cid:1) V to ( λv . λx . S ) ∗ · (cid:0) ω ( λx . L [ V / y ]) (cid:1) as shownin Reduction 18. Application

Consider the Jacobian of λy . z z at V . Notethat z and z may or may not contain y as a free variable.Hence, there are two cases.First, we consider λy . y πi z where z is fresh. Since y ∈ lin ( y πi z ) , λy . y πi z mimics a linear function, and hence itsJacobian at V is λv . v πi z . So (cid:0) ( Ω λy . y πi z ) · ω (cid:1) V is reducedto ( λv . v πi z ) ∗ · ( ω ( V πi z )) as shown in Reduction 19 a .Second, we consider the Jacobian of λy . y πi y π j at V . Now y is not a linear variable in y πi y π j , since it occurs in theargument y π j . As proved in Lemma 4.4 of [21], every diﬀer-ential λ -category satisﬁes the (D-eval) rule, D [ ev ◦ h h , д i] = ev ◦ h D [ h ] , д ◦ π i + D [ uncur ( h )] ◦ hh , D [ д ]i , h π , д ◦ π ii which means J ( ev ◦ h h , д i)( x )( v ) is equal to (cid:0) J ( h )( x )( v ) (cid:1) ( д ( x )) + J ( h ( x ))( д ( x ))(J( д )( x )( v )) for all h : C → ( A ⇒ B ) and д : C → A . Hence, the Jacobianof ev ◦ h π i , π j i at x along v , i.e. J ( ev ◦ h π i , π j i)( x )( v ) , is π i ( v )( π j ( x )) + J ( π i ( x ))( π j ( x ))( π j ( v )) . So the Jacobian of λy . y πi y π j at V is λv . v πi V π j + S ′ [ v π j / v ′ ] where λv ′ . S ′ is the Jacobian of V πi at V π j . Hence assum-ing V πi ≡ λz . P ′ , we ﬁrst reduce (cid:0) ( Ω λz . P ′ ) · ω (cid:1) V π j to ( λv ′ . S ′ ) ∗ · ω ( P ′ [ V π j / z ]) and obtain λv ′ . S ′ as the Jacobianof λz . P ′ at V π j . Then, we reduce (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V to ( λv . v πi V π j + S ′ [ v π j / v ′ ]) ∗ · (cid:0) ω ( V πi V π j ) (cid:1) as shown in Re-duction 19 b .If (cid:0) ( Ω λz . V ′ ) · ω (cid:1) V π j reduces to 0, which means λz . V ′ ≡ V πi is a constant function, the Jacobian of λy . y πi y π j at V is just λv . v πi V π j and we have Reduction 19 c . Remark 3.6.

Doing induction on elementary terms deﬁnedin Subsection 3.2, we can see that there are a few elementaryterms E where (cid:0) ( Ω λy . E ) · ω (cid:1) V is not a redex, namelyvalue 1: (cid:0) ( Ω λy . z y πi ) · ω (cid:1) V where z is a free variable,value 2: (cid:0) ( Ω λy . y πi y π j ) · ω ) V where V πi . λz . P ′ . Having these terms as values makes sense intuitively,since they have “inappropriate” values in positions. Values1 has a free variable z in a function position. Value 2 substi-tutes y πi by V πi which is a non-abstraction, to a functionposition. Pair

Last but not least, we consider the Jacobian of λy . h y , E i at V . It is easy to see that Jacobian is λv . h v , S i where λv . S is the Jacobian of λy . E , as shown in Reduction20 a and Reduction 20 b . Example 3.7.

Take our running example. In Examples 3.3and 3.4 we showed that via A-reductions and Reductions 7and 8, ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · ω is reduced to © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ We show how it can be reduced when applied to h , i . © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ (cid:20) (cid:21) . −−−−→ © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ · © « ( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®¬ h (cid:20) (cid:21) , (cid:20) (cid:21) i ª®®¬ . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ · (cid:16) (cid:18) ( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·(( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ) (cid:19) h (cid:20) (cid:21) , (cid:20) (cid:21) , (cid:20) (cid:21) i (cid:17)ª®®®¬ ( ⋆ )20 . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ · (cid:0) (cid:16) ( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω (cid:17) h (cid:20) (cid:21) , (cid:20) (cid:21) , (cid:20) (cid:21) , i (cid:1) ª®®®®¬ . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · ( ω ) ª®®®¬ Notice how this is reminiscent of the forward phase ofreverse-mode AD performed on f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i considered in Subsection 2.1.Moreover, we used the reduction f ( r ) −→ f ( r ) couples oftimes in the argument position of an application. This is toavoid expression swell. Note 1 + ( ⋆ ) even when the result is used in various computations.Hence, we must have a call-by-value reduction strategy aspresented below. iﬀerential-form Pullback Programming Language and Reverse-mode AD Reductions in Subsections 3.2 and 3.3 are the most interest-ing development of the paper. However, they alone are notenough to complete a reduction strategy. In this subsection,we deﬁne contexts and redexes so that any non-value termcan be reduced.The deﬁnition of context C is the standard call-by-valuecontext, extended with duals and pullbacks. Notice that thecontext ( Ω λy . C A ) · S contains a A-context deﬁned in Sub-section 3.2. This follows from the idea of reverse-mode ADto decompose a term into elementary terms before diﬀeren-tiating them. C :: = [] | C + P | V + C A | C P | V C | π i ( C ) | h C , S i | h V , C i| f ( C ) | J f · C | ( λx . S ) ∗ · C | ( λx . C ) ∗ · V | ( Ω λy . C A ) · S | ( Ω λy . E ) · C | ( Ω λy . h y , E i) · C Our redex r extend the standard call-by-value redex withfour sets of terms. r :: = ( λx . S ) V | π i (h V , V i) | f ( r ) | (J f · r ) r ′ | ( λv . (J f · v ) r ) ∗ · r ′∗ | ( λv . V ) ∗ · (cid:0) ( λv . V ) ∗ · V (cid:1) | ( Ω λy . L ) · S | (cid:0) ( Ω λy . E ) · V (cid:1) V | (cid:0) ( Ω λy . h y , E i) · V (cid:1) V where either V . (J f · v ) r or V . r ′∗ . A value V is apullback term P that cannot be reduced further, i.e. a termin normal form.The following standard lemma, which is proved by induc-tion on P , tells us that there is at most one redex to reduce. Lemma 3.8.

Every term P can be expressed as either C [ r ] forsome unique context C and redex r or a value V . Let’s look at the reductions of redexes. (1-4) are the stan-dard call-by-value reductions. (5) reduces the dual along alinear map l and (6) is the contra-variant property of dualmaps. ( λx . S ) V ( ) −−→ S [ V / x ] π i (h V , V i) ( ) −−→ V i f ( r ) ( ) −−→ f ( r ) (J f · r ) r ′ ( ) −−→ J ( f )( r ′ )( r )( λv . (J f · v ) r ) ∗ · r ′∗ ( ) −−→ ((J ( f )( r )) ∗ ( r ′ )) ∗ ( λv . V ) ∗ · (cid:0) ( λv . V ) ∗ · V (cid:1) ( ) −−→ ( λv . V [ V / v ]) ∗ · V where either V . (J f · v ) r or V . r ′∗ .We say C [ r ] −→ C [ V ] if r −→ V for all reductions exceptfor those with a proof tree, i.e. Reductions 16 b , 16 c , 17, 18,19 b , 19 c and 20 a , where we have r −→ ∗ V C [ r ′ [ V / ω ][ V / V ]] −→ C [ V ′ [ V / ω ][ V / V ]] if r −→ ∗ V r ′ −→ V ′ Example 3.9.

Consider our running example P ≡ (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · Ω (cid:2) (cid:3) (cid:1) h , i which rep-resents the Jacobian of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i , as shown in Example 3.1. Replacing ω by Ω (cid:2) (cid:3) ≡ λx . (cid:2) (cid:3) ∗ in Examples 3.3, 3.4 and 3.7, P is reduced to © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · ( ω ) ª®®®¬ . Via reduction 5 and β reduction, P is reduced to © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · (cid:2) (cid:3) ª®®®¬ −→ © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ · ª®¬   ∗ −→ (cid:18) ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ · (cid:19)   ∗ −→ (cid:0) ( λ h v , v i . hh v , v i , h v , v ii) ∗ · (cid:1)   ∗ −→ (cid:20) (cid:21) ∗ Notice how this mimics the reverse phase of reverse-modeAD on f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i consideredin Subsection 2.1.Examples 3.3, 3.4 and 3.7 demonstrates that our reductionstrategy is faithful to reverse-mode AD (in that it is exactlyreverse-mode AD when restricted to ﬁrst-order). Diﬀerential 1-forms Ω E : = C ∞ ( E , L ( E , R )) is similar to thecontinuation of E with the “answer” R . We can indeed writeour reduction in a continuation passing style (CPS) manner.Let h P | S i y ≡ ( Ω λy . P ) · S , then we can treat h P | S i y as aconﬁguration of an element Γ ∪ { y : σ } ⊢ P : τ and a “con-tinuation” Γ ⊢ S : Ω τ . The rules for the redexes h L | S i y , h E | V i y V and hh y , E i | V i y V can be directly convertedfrom Reductions 7-20. For example, Reduction 8 can be writ-ten as h let x = E in L | ω i y −→ hh y , E i | h L | ω i h y , x i i y . We prefer to write our language without the explicit men-tion of CPS since this paper focuses on the syntactic notionof reverse-mode AD using pullbacks and 1-forms. Also, 1-form of the type σ is more precisely described as an element arol Mak and Luke Ong of the function type Ω σ ≡ σ ⇒ σ ∗ , than of the continuationof σ , i.e. σ ⇒ ( σ ⇒ R ) . We show that any diﬀerential λ -category satisfying theHahn-Banach Separation Theorem can soundly model ourlanguage. Cartesian diﬀerential category [9] aims to axiomatise funda-mental properties of derivative. Indeed, any model of syn-thetic diﬀerential geometry has an associated Cartesian dif-ferential category. [13]

Cartesian diﬀerential category

A category C is a Carte-sian diﬀerential category if • every homset C ( A , B ) is enriched with a commutativemonoid ( C ( A , B ) , + AB , AB ) and the additive structureis preserved by composition on the left. i.e. ( д + h )◦ f = д ◦ f + h ◦ f and 0 ◦ f = . • it has products and projections and pairings of addi-tive maps are additive. A morphism f is additive if itpreserves the additive structure of the homset on theright. i.e. f ◦ ( д + h ) = f ◦ д + f ◦ h and f ◦ = . and it has an operator D [−] : C ( A , B ) → C ( A × A , B ) thatsatisﬁes the following axioms:[CD1] D is linear: D [ f + д ] = D [ f ] + D [ д ] and D [ ] = D is additive in its ﬁrst coordinate: D [ f ] ◦ h h + k , v i = D [ f ] ◦ h h , v i + D [ f ] ◦ h k , v i , D [ f ] ◦ h , v i = D behaves with projections: D [ Id ] = π , D [ π ] = π ◦ π and D [ π ] = π ◦ π [CD4] D behaves with pairings: D [h f , д i] = h D [ f ] , D [ д ]i [CD5] Chain rule: D [ д ◦ f ] = D [ д ] ◦ h D [ f ] , f ◦ π i [CD6] D [ f ] is linear in its ﬁrst component: D [ D [ f ]] ◦ hh д , i , h h , k ii = D [ f ] ◦ h д , k i [CD7] Independence of order of partial diﬀerentiation: D [ D [ f ]] ◦ hh , h i , h д , k ii = D [ D [ f ]] ◦ hh , д i , h h , k ii We call D the Cartesian diﬀerential operator of C . Example 4.1.

The category

FVect of ﬁnite dimensionalvector spaces and diﬀerentiable functions is a Cartesian dif-ferential category, with the Cartesian diﬀerential operator D [ f ]h v , x i = J ( f )( x )( v ) ,Cartesian diﬀerential operator does not necessarily be-have well with exponentials. Hence, Bucciarelli et al. [11]added the (D-curry) rule and introduced diﬀerential λ -category. Diﬀerential λ -category A Cartesian diﬀerential categoryis a diﬀerential λ -category if • it is Cartesian closed, • λ (−) preserves the additive structure, i.e. λ ( f + д ) = λ ( f ) + λ ( д ) and λ ( ) = • D [−] satisﬁes the (D-curry) rule: for any f : A × A → B , D [ λ ( f )] = λ ( D [ f ] ◦ h π × A , π × Id A i) Linearity

A morphism f in a diﬀerential λ -category is lin-ear if D [ f ] = f ◦ π . Example 4.2.

The category

Con ∞ of convenient vectorspace and smooth maps, considered by [8], is a diﬀer-ential λ -category with the Cartesian diﬀerential operator D [ f ]h v , x i : = lim t → ( f ( x + tv ) − f ( x ))/ t , as shown inLemma E.2. We say a diﬀerential λ -category C satisﬁes Hahn-BanachSeparation Theorem if R is an object in C and for any object A in C and distinct elements x , y in A , there exists a linearmorphism l : A → R that separates x and y , i.e. l ( x ) , l ( y ) . Example 4.3.

The category

Con ∞ of convenient vectorspace and smooth maps satisﬁes the Hahn-Banach Separa-tion Theorem, as shown in Proposition E.3. Let C be a diﬀerential λ -category that satisﬁes Hahn-BanachSeparation Theorem. Since C is Cartesian closed, the inter-pretations for the λ -calculus terms are standard, and henceomitted. The full set of interpretations can be found in Ap-pendix C. J R K : = R J σ × σ K : = J σ K × J σ KJ σ ∗ K : = L ( J σ K , R ) J σ ⇒ σ K : = C ( J σ K , J σ K ) where L ( J σ K , R ) : = { f ∈ C ( J σ , R K ) | D [ f ] = f ◦ π } is theset of all linear morphisms from J σ K to R . J K γ : = J S + P K γ : = J S K γ + J P K γ J  r ... r n  ∗ K γ : =  v ... v n  n Õ i = r i v i J ( λx . S ) ∗ · S K γ : = λv . J S K γ (cid:0) J S K h γ , v i (cid:1) J ( Ω λx . P ) · S K γ : = λxv . J S K γ ( J P K h γ , x i)( D [ cur ( J P K ) γ ]h v , x i) We verify our deﬁnitions of linearity and substitution inLemma 4.4 and Lemma 4.5 respectively.

Lemma 4.4 (Linearity) . Let Γ ∪ { x : σ } ⊢ P : τ and Γ ⊢ P : σ ∗ . Let γ ∈ J Γ K and γ ∈ J Γ K . Then, iﬀerential-form Pullback Programming Language and Reverse-mode AD

1. if x ∈ lin ( P ) , then cur ( J P K ) γ is linear, i.e. D [ cur ( J P K ) γ ] = ( cur ( J P K ) γ ) ◦ π ,2. J P K γ is linear, i.e. D [ J P K γ ] = ( J P K γ ) ◦ π . Lemma 4.5 (Substitution) . J Γ ⊢ S [ P / x ] : τ K = J Γ ∪ { x : σ } ⊢ S : τ K ◦ h Id J Γ K , J Γ ⊢ P : σ K i Any diﬀerential λ -category satisfying Hahn-Banach Sep-aration Theorem is a sound model of our language. Notethat the Hahn-Banach Separation Theorem is crucial in theproof. Theorem 4.6 (Correctness of Reductions) . Let Γ ⊢ P : σ .1. P −→ A P ′ implies J P K = J P ′ K .2. P −→ P ′ implies J P K = J P ′ K .Proof. The full proof can be found in Appendix E.2. Case analysis on reductions of pullback terms. ConsiderReduction 16 . γ ∈ J Γ K . By IH, and V πi ≡ λz . P ′ , we have J (cid:0) ( Ω λz . P ′ ) · ω (cid:1) V π j K = J ( λv ′ . S ′ ) ∗ · ω ( P ′ [ V π j / z ]) K whichmeans for any 1-form ϕ and v , ϕ ( J P ′ K h γ , J V π j K γ i) (cid:0) D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i (cid:1) = ϕ ( J P ′ K h γ , J V π j K γ i) ( J S ′ K h γ , v i) . Let l be a linear morphism to R , then λx . l is a 1-formand hence we have l (cid:0) D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i (cid:1) = l ( J S ′ K h γ , v i) . By the contra-positive of theHahn-Banach Separation Theorem, it implies D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i = J S ′ K h γ , v i . Note that by (D-eval) in [21], D [ ev ◦ h π i , π j i]h v , x i = π i ( v )( π j ( x )) + D [ π i ( x )]h π j ( v ) , π j ( x )i . Hence we have J (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) J y πi y π j K h γ , J V K γ i (cid:1) (cid:0) D [ ev ◦ h π i , π j i]h v , J V K γ )i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1)(cid:0) v πi ( J V π j K γ ) + D [ J V πi K γ ]h v π j , J V π j K γ i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1)(cid:0) v πi ( J V π j K γ ) + D [ cur ( J P ′ K ) γ ]h v π j , J V π j K γ i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1) (cid:0) v πi ( J V π j K γ ) + J S ′ K h γ , J V π j K γ i (cid:1) = J ( λv . v πi V π j + S ′ [ V π j / v ]) ∗ · ω ( V πi V π j ) K γ (cid:3) A simple corollary of Theorem 4.6 is that types are invari-ant under reductions.

Corollary 4.7. (Subject Reduction) For any pullback terms P and P ′ where P −→ P ′ . If Γ ⊢ P : σ , then Γ ⊢ P ′ : σ . Recall performing reverse-mode AD on a real-valued func-tion f : R n → R m at a point x ∈ R n computes a row of theJacobian matrix J ( f )( x ) , i.e. (J ( f )( x )) ∗ ( π p ) . The following corollary tells us that our reduction isfaithful to reverse-mode AD (in that it is exactly reverse-mode AD when restricted to ﬁrst-order) and we can performreverse-mode AD on any abstraction which might containhigher-order terms, duals, pullbacks and free variables. Corollary 4.8.

Let Γ ∪ { y : σ } ⊢ P : τ , Γ ⊢ P : σ , γ ∈ J Γ K .1. Let σ ≡ R n , τ ≡ R m . If (cid:0) ( Ω λy . P ) · Ω e p (cid:1) P −→ ∗ V , thenthe p -th row of the Jacobian matrix of J P K h γ , −i at J P K γ is ( J V K γ ) ∗ .2. Let l be a linear morphism from J τ K to R . If (cid:0) ( Ω λy . P ) · ω (cid:1) P −→ ∗ ( λv . P ′ ) ∗ · ω P ′ for somefresh variable ω , then the derivative of l ◦ ( J P K h γ , −i) at J P K γ along some v ∈ J σ K is l ( J P ′ K h γ , λx . l , v i) i.e. D [ l ◦ ( J P K h γ , −i)]h v , J P K γ i = l ( J P ′ K h γ , λx . l , v i) Example 4.9.

In Example 3.9, we showed that (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · Ω (cid:2) (cid:3) (cid:1) h , i −→ ∗ (cid:20) (cid:21) ∗ Note that (cid:2)

660 528 (cid:3) is exactly the Jacobian matrix of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i . We discuss recent works on calculi / languages that providediﬀerentiation capabilities.

The standard bearer is none other than diﬀerential λ -calculus [15], which has inspired the design of our language.The implementation induced by diﬀerential λ -calculus isa form of symbolic diﬀerentiation, which suﬀers from ex-pression swell. For this reason, Manzyuk [22] introduced the perturbative λ -calculus , a λ -calculus with a forward-modeAD operator. Our language is complementary to these cal-culi, in that it implements higher-order reverse-mode AD;moreover, it is call-by-value, which is crucial for reverse-mode AD to avoid expression swell, as illustrated in Exam-ple 3.7.What is the relationship between our language and diﬀer-ential λ -calculus? We can give a precise answer via a compo-sitional translation (−) t to a diﬀerential λ -calculus extendedby real numbers, function symbols, pairs and projections,deﬁned as follows: s , t :: = x | λx . s | s T | D s · t | π i ( s ) | h s , t i | r | f ( T ) | D f · tS , T :: = | s | s + T where r ∈ R , f ∈ F The major cases of the deﬁnition of (−) t are; ( σ ∗ ) t : = σ t ⇒ R (J f · ( S )) t : = D f · S t  r ... r n  ∗ t : = λv . n Õ i = f i ( π i ( v )) arol Mak and Luke Ong (cid:0) ( λy . S ) ∗ · S (cid:1) t : = λv . ( S ) t (cid:0) ( λy . ( S ) t ) v (cid:1) (( Ω λy . P ) · S ) t : = λxv . S t (cid:0) ( λy . P t ) x (cid:1) (cid:0) ( D ( λy . P t ) · v ) x (cid:1) for f i : = r i × − . (The deﬁnitions are provided in full in Ap-pendix D.)Because diﬀerential λ -calculus does not have linearfunction type, ( S ) t is no longer in a linear position in (cid:0) ( λx . S ) ∗ · S (cid:1) t . Though the translation does not preservelinearity, it does preserve reductions and interpretations(Lemma 5.1). Lemma 5.1.

Let P be a term.1. If P −→ P ′ , then there exists a reduct s of P ′ t such that P t −→ ∗ s in L D .2. J P K = J P t K in C . A corollary of Lemma 5.1 (1) is that our reduction strategyis strongly normalizing.

Corollary 5.2 (Strong Normalization) . Any reduction se-quence from any term is ﬁnite, and ends in a value.

Encouraged by calls [14, 19, 24] from the machine learningcommunity, the development of reverse-mode AD program-ming language has been an active research problem. Follow-ing Pearlmutter and Siskind [27], these languages usuallytreat reverse-mode AD as a meta -operator on programs.

First-order

Elliott [16] gives a categorical presentation ofreverse-mode AD. Using a functor over Cartesian categories,he presents a neat implementation of reverse-mode AD.As is well-known, conditional does not behave well withsmoothness [6]; nor does loops and recursion. Abadi andPlotkin [2] address this problem via a ﬁrst-order languagewith conditionals, recursively deﬁned functions, and a con-struct for reverse-mode AD. Using real analysis, they provethe coincidence of operational and denotational semantics.To our knowledge, these treatments of reverse-mode ADare restricted to ﬁrst-order functions.

Towards higher-order

The ﬁrst work that extendsreverse-mode AD to higher orders is by Pearlmutter andSiskind [27]; they use a non-compositional program trans-formation to implement reverse-mode AD.Inspired by Wang et al. [32, 33], Brunel et al. [10] studya simply-typed λ -calculus augmented with a notion of lin-ear negation type. Though our dual type may resembletheir linear negation, they are actually quite diﬀerent. In fact, our work can be viewed as providing a positive an-swer to the last paragraph of [10, Sec. 7], where the au-thors address the relation between their work and diﬀeren-tial lambda-calculus. They describe a “naïve” approach of ex-pressing reverse-mode AD in diﬀerential lambda-calculus inthe sense that it suﬀers from “expression swell”, which ourapproach does not (see Example 3.7). Moreover, Brunel etal. use a program transformation to perform reverse-modeAD, whereas we use a ﬁrst-class diﬀerential operator. Brunelet al. [1] prove correctness for performing reverse-mode ADon real-valued functions (Theorem 5.6, Corollary 5.7 in [1]),whereas we allow any (higher-order) abstraction to be theargument of the pullback term and proved that the result ofthe reduction of such a pullback term is exactly the deriva-tive of the abstraction (Corollary 4.8).Building on Elliott [16]’s categorical presentation ofreverse-mode AD, and Pearlmutter and Siskind [27]’s ideaof diﬀerentiating higher-order functions, Vytiniotis et al.[31] developed an implementation of a simply-typed diﬀer-entiable programming language.However, all these treatments are not purely higher-order,in the sense that their diﬀerential operator can only com-pute the derivative of an “end to end” ﬁrst-order program(which may be constructed using higher-order functions),but not the derivative of a higher-order function.As far as we know, our work gives the ﬁrst implemen-tation of reverse-mode AD in a higher-order programminglanguage that directly computes the derivative of higher-order functions using reverse-mode AD (Corollary 4.8 (2)). After outlining the mathematical foundation of reverse-mode AD as the pullback of diﬀerential 1-forms (Section2.2), we presented a simple higher-order programming lan-guage with an explicit diﬀerential operator, ( Ω ( λx . P )) · S ,(Subsection 3.1) and a call-by-value reduction strategy to di-vide (A-reductions in Subsection 3.2), conquer (pullback re-ductions in Subsection 3.3) and combine (Subsection 3.4) theterm (cid:0) ( Ω ( λx . P )) · ω (cid:1) S , such that its reduction exactly mim-ics reverse-mode AD. Examples are given to illustrate thatour reduction is faithful to reverse-mode AD. Moreover, weshow how our reduction can be adapted to a CPS evaluation(Subsection 3.5).We showed (in Section 4) that any diﬀerential λ -categorythat satisﬁes the Hahn-Banach Separation Theorem is asound model of our language (Theorem 4.6) and how our re-duction precisely captures the notion of reverse-mode AD,in both ﬁrst-order and higher-order settings (Corollary 4.8). iﬀerential-form Pullback Programming Language and Reverse-mode AD Future Directions.

An interesting direction is to extendour language with probability, which can serve as a compilerintermediate representation for “deep” probabilistic frame-works such as Edward [29] and Pyro [30]. Inference algo-rithms that require the computation of gradients, such asHamiltonian Monte Carlo and variational inference, whichEdward and Pyro rely on, can be expressed in such a lan-guage and allows us to prove correctness.

References [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, AndyDavis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Geof-frey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,Rajat Monga, Sherry Moore, Derek Gordon Murray, BenoitSteiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Mar-tin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow:A System for Large-Scale Machine Learning. In . 265–283. [2] Martín Abadi and Gordon D. Plotkin. 2020. A simple diﬀer-entiable programming language.

PACMPL https://doi.org/10.1145/3371106 [3] Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Anger-müller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, JustinBayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Ar-naud Bergeron, James Bergstra, Valentin Bisson, Josh BleecherSnyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, XavierBouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre LucCarrier, Kyunghyun Cho, Jan Chorowski, Paul F. Christiano, TimCooijmans, Marc-Alexandre Côté, Myriam Côté, Aaron C. Courville,Yann N. Dauphin, Olivier Delalleau, Julien Demouth, Guillaume Des-jardins, Sander Dieleman, Laurent Dinh, Melanie Ducoﬀe, VincentDumoulin, Samira Ebrahimi Kahou, Dumitru Erhan, Ziye Fan, OrhanFirat, Mathieu Germain, Xavier Glorot, Ian J. Goodfellow, MatthewGraham, Çaglar Gülçehre, Philippe Hamel, Iban Harlouchet, Jean-Philippe Heng, Balázs Hidasi, Sina Honari, Arjun Jain, Sébastien Jean,Kai Jia, Mikhail Korobov, Vivek Kulkarni, Alex Lamb, Pascal Lam-blin, Eric Larsen, César Laurent, Sean Lee, Simon Lefrançois, SimonLemieux, Nicholas Léonard, Zhouhan Lin, Jesse A. Livezey, CoryLorenz, Jeremiah Lowin, Qianli Ma, Pierre-Antoine Manzagol, OlivierMastropietro, Robert McGibbon, Roland Memisevic, Bart van Mer-riënboer, Vincent Michalski, Mehdi Mirza, Alberto Orlandi, Christo-pher Joseph Pal, Razvan Pascanu, Mohammad Pezeshki, Colin Raﬀel,Daniel Renshaw, Matthew Rocklin, Adriana Romero, Markus Roth,Peter Sadowski, John Salvatier, François Savard, Jan Schlüter, JohnSchulman, Gabriel Schwartz, Iulian Vlad Serban, Dmitriy Serdyuk,Samira Shabanian, Étienne Simon, Sigurd Spieckermann, S. RamanaSubramanyam, Jakub Sygnowski, Jérémie Tanguay, Gijs van Tulder,Joseph P. Turian, Sebastian Urban, Pascal Vincent, Francesco Visin,Harm de Vries, David Warde-Farley, Dustin J. Webb, Matthew Will-son, Kelvin Xu, Lijun Xue, Li Yao, Saizheng Zhang, and Ying Zhang.2016. Theano: A Python framework for fast computation of mathe-matical expressions.

CoRR abs/1605.02688 (2016). arXiv:1605.02688 http://arxiv.org/abs/1605.02688 [4] F. Bauer. 1974. Computational Graphs and Rounding Error.

SIAM J.Numer. Anal.

11, 1 (1974), 87–96. https://doi.org/10.1137/0711010 [5] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey AndreyevichRadul, and Jeﬀrey Mark Siskind. 2017. Automatic Diﬀerentiation inMachine Learning: a Survey.

J. Mach. Learn. Res.

18 (2017), 153:1–153:43. http://jmlr.org/papers/v18/17-468.html [6] Thomas Beck and Herbert Fischer. 1994. The if-problem in auto-matic diﬀerentiation.

J. Comput. Appl. Math.

50, 1 (1994), 119 – 131. https://doi.org/10.1016/0377-0427(94)90294-1 [7] Michael Betancourt. 2018. A geometric theory of higher-order auto-matic diﬀerentiation. arXiv preprint arXiv:1812.11592 (2018).[8] Richard Blute, Thomas Ehrhard, and Christine Tasson. 2010. Aconvenient diﬀerential category.

CoRR abs/1006.3140 (2010).arXiv:1006.3140 http://arxiv.org/abs/1006.3140 [9] Richard F Blute, J Robin B Cockett, and Robert AG Seely. 2009. Carte-sian diﬀerential categories.

Theory and Applications of Categories

CoRR abs/1909.13768 (2019). arXiv:1909.13768 http://arxiv.org/abs/1909.13768 [11] Antonio Bucciarelli, Thomas Ehrhard, and Giulio Manzonetto.2010. Categorical Models for Simply Typed Resource Cal-culi.

Electr. Notes Theor. Comput. Sci.

265 (2010), 213–230. https://doi.org/10.1016/j.entcs.2010.08.013 [12] Alonzo Church. 1965.

The Calculi of Lambda-Conversion . New York :Kraus Reprint Corporation.[13] J. Robin B. Cockett and Geoﬀ S. H. Cruttwell. 2014. Diﬀerential Struc-ture, Tangent Structure, and SDG.

Applied Categorical Structures https://doi.org/10.1007/s10485-013-9312-0 [14] David Dalrymple. 2016. 2016: What do you consider the mostinteresting recent [scientiﬁc] news? What makes it important? . (2016). Accessed: 2020-01-07.[15] Thomas Ehrhard and Laurent Regnier. 2003. The diﬀeren-tial lambda-calculus.

Theor. Comput. Sci. https://doi.org/10.1016/S0304-3975(03)00392-X [16] Conal Elliott. 2018. The simple essence of automatic diﬀerentiation.

PACMPL

2, ICFP (2018), 70:1–70:29. https://doi.org/10.1145/3236765 [17] Alfred Frölicher and Andreas Kriegl. 1988.

Linear spaces and diﬀeren-tiation theory . Chichester : Wiley.[18] Philipp H. W. Hoﬀmann. 2016. A Hitchhiker’s Guide to AutomaticDiﬀerentiation.

Numerical Algorithms

72, 3 (01 Jul 2016), 775–811. https://doi.org/10.1007/s11075-015-0067-6 [19] Yann LeCun. 2018. Deep Learning estmort. Vive Diﬀerentiable Programming! .(2018). Accessed: 2020-01-07.[20] Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. 2015. Auto-grad: Eﬀortless Gradients in Numpy. Presented in AutoML Workshop,ICML, Cascais, Portugal.[21] Giulio Manzonetto. 2012. What is a categorical modelof the diﬀerential and the resource λ -calculi? Mathemat-ical Structures in Computer Science

22, 3 (2012), 451–520. https://doi.org/10.1017/S0960129511000594 [22] Oleksandr Manzyuk. 2012. A Simply Typed λ -Calculus of ForwardAutomatic Diﬀerentiation. Electr. Notes Theor. Comput. Sci.

286 (2012),257–272. https://doi.org/10.1016/j.entcs.2012.08.017 arol Mak and Luke Ong [23] Peter W. Michor and Andreas Kriegl. 1997. The convenient setting ofglobal analysis . Providence, R.I. : American Mathematical Society.[24] Christopher Olah. 2015. Neural Networks, Types, and FunctionalProgramming. http://colah.github.io/posts/2015-09-NN-Types-FP/ .(2015). Accessed: 2020-01-07.[25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, JamesBradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, NataliaGimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, EdwardYang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chin-tala. 2019. PyTorch: An Imperative Style, High-Performance DeepLearning Library.

CoRR abs/1912.01703 (2019). arXiv:1912.01703 http://arxiv.org/abs/1912.01703 [26] Barak A. Pearlmutter. 2019. A Nuts-and-Bolts Diﬀerential GeometricPerspective on Automatic Diﬀerentiation. Presented in Languages forInference Workshop, Cascais, Portugal.[27] Barak A. Pearlmutter and Jeﬀrey Mark Siskind. 2008. Reverse-mode AD in a functional framework: Lambda the ultimate back-propagator.

ACM Trans. Program. Lang. Syst.

30, 2 (2008), 7:1–7:36. https://doi.org/10.1145/1330017.1330018 [28] Amr Sabry and Matthias Felleisen. 1992. Reasoning AboutPrograms in Continuation-Passing Style. In

Proceedings of theConference on Lisp and Functional Programming, LFP 1992,San Francisco, California, USA, 22-24 June 1992.

ACM, 288–298. https://doi.org/10.1145/141471.141563 [29] Dustin Tran, Matthew D. Hoﬀman, Rif A. Saurous, Eugene Brevdo,Kevin Murphy, and David M. Blei. 2017. Deep ProbabilisticProgramming.

CoRR abs/1701.03757 (2017). arXiv:1701.03757 http://arxiv.org/abs/1701.03757 [30] Uber. 2017. Pyro (Retrieved) Nov 2018. (2017). http://pyro.ai/ [31] Dimitrios Vytiniotis, Dan Belov, Richard Wei, Gordon Plotkin, andMartin Abadi. 2019. The Diﬀerentiable Curry. Presented in ProgramTranformations for Machine Learning Workshop, NeurIPS, Vancou-ver, Canada.[32] Fei Wang, James M. Decker, Xilun Wu, Grégory M. Essertel, andTiark Rompf. 2018. Backpropagation with Callbacks: Founda-tions for Eﬃcient and Expressive Diﬀerentiable Programming.In

Advances in Neural Information Processing Systems 31: An-nual Conference on Neural Information Processing Systems 2018,NeurIPS 2018, 3-8 December 2018, Montréal, Canada. , Samy Ben-gio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman,Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 10201–10212. http://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018 [33] Fei Wang, Daniel Zheng, James M. Decker, Xilun Wu, Grégory M. Es-sertel, and Tiark Rompf. 2019. Demystifying diﬀerentiable program-ming: shift/reset the penultimate backpropagator.

PACMPL

3, ICFP(2019), 96:1–96:31. https://doi.org/10.1145/3341700 [34] R. E. Wengert. 1964. A simple automatic derivative eval-uation program.

Commun. ACM

7, 8 (1964), 463–464. https://doi.org/10.1145/355586.364791 iﬀerential-form Pullback Programming Language and Reverse-mode AD AppendixA Examples

A.1 Simple Example

We focus on how to compute the derivative of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i by diﬀerent modes of AD.First f is decomposed into elementary functions as R д −−→ R ∗ −−→ R (−) −−−→ R , where д (h x , y i) : = h x + , x + y i .Then, Figure 4 summarize the iterations of diﬀerent modesof AD.Now we show how Section 3 tells us how to performreverse-mode AD on f . Term

Assuming д , mult , pow2 ∈ F , we can deﬁne the fol-lowing term in the language. ⊢ (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · ( Ω (cid:2) (cid:3) ) (cid:1) h , i : R ∗ This term is the application of the pullback Ω ( f )( λx . (cid:2) (cid:3) ∗ ) tothe point h , i , which is exactly the Jacobian of f at h , i . Administrative Reduction

We decompose the term pow2 ( mult ( д (h x , y i))) , via administrative reduction, into alet series of elementary terms. pow2 ( mult ( д (h x , y i))) −→ ∗ A L ≡ let z = h x , y i ; z = д ( z ) ; z = mult ( z ) ; z = pow2 ( z ) in z . This is reminiscent of the decomposition of f into R д −−→ R ∗ −−→ R (−) −−−→ R before performing AD. Splitting the Omega

Now via reduction 7 and 8, ( Ω λ h x , y i . L ) · ω is reduced to a series of pullback along ele-mentary terms. ( Ω λ h x , y i . let z = h x , y i ; z = д ( z ) ; z = mult ( z ) ; z = pow2 ( z ) in z ) · ω −→ ∗ © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ Pullback Reduction

We showed that via A-reductionsand Reductions 7 and 8, ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i))))· ω is reduced to © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ We show how it can be reduced when applied to h , i . © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ (cid:20) (cid:21) . −−−−→ © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ · © « ( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®¬ h (cid:20) (cid:21) , (cid:20) (cid:21) i ª®®¬ . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ · (cid:16) (cid:18) ( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·(( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ) (cid:19) h (cid:20) (cid:21) , (cid:20) (cid:21) , (cid:20) (cid:21) i (cid:17)ª®®®¬ ( ⋆ )20 . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ · (cid:0) (cid:16) ( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω (cid:17) h (cid:20) (cid:21) , (cid:20) (cid:21) , (cid:20) (cid:21) , i (cid:1) ª®®®®¬ . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · ( ω ) ª®®®¬ Notice how this is reminiscent of the forward phase ofreverse-mode AD performed on f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i considered in Figure 4.Moreover, we used the reduction f ( r ) −→ f ( r ) couples oftimes in the argument position of an application. This is toavoid expression swell. Note 1 + ( ⋆ ) even when the result is used in various computations. Combine

Replacing ω by Ω (cid:2) (cid:3) ≡ λx . (cid:2) (cid:3) ∗ , we have shownso far that (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · Ω (cid:2) (cid:3) (cid:1) h , i is reduced to © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · ( ω ) ª®®®¬ . Now via reduction 5 and β reduction, we further reduce itto © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · (cid:2) (cid:3) ª®®®¬ −→ © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ · ª®¬   ∗ arol Mak and Luke Ong Naïve Forward Mode: hh , i | h i i д / / hh + , ∗ + i | h i i ∗ / / h ∗ | [

15 12 ] i (−) / / h | [

660 528 ] i Forward Mode: hh , i | h i i д / / hh , i | h i i ∗ / / h | [ ] i (−) / / h | [ ] i Reverse Mode: Forward Phase: h , i д / / h , i ∗ / / (−) / / h i h i д o o [ ] ∗ o o [ ] (−) o o Pullback: (cid:0) Ω ( д ) ◦ Ω (∗) ◦ Ω ((−) ) (cid:1) ( λx . [ ] )(h , i) = (J ( д )(h , i)) ∗ (cid:0) Ω (∗) ◦ Ω ((−) ) (cid:1) ( λx . [ ] )(h , i) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (cid:0) Ω ((−) ) (cid:1) ( λx . [ ] )( ) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (J ((−) )( )) ∗ (cid:0) ( λx . [ ] )( ) (cid:1) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ [ ] = (J ( д )(h , i)) ∗ h i = h i Figure 4.

Diﬀerent modes of automatic diﬀerentiation performed on the function f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i ,after f is decomposed into elementary functions: R д −−→ R ∗ −−→ R (−) −−−→ R , where д (h x , y i) : = h x + , x + y i . −→ (cid:18) ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ · (cid:19)   ∗ −→ (cid:0) ( λ h v , v i . hh v , v i , h v , v ii) ∗ · (cid:1)   ∗ −→ (cid:20) (cid:21) ∗ Notice how this mimics the reverse phase of reverse-modeAD on f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i consideredin Figure 4. A.2 Sum Example

Consider the function that takes a list of real numbers andreturns the sum of the elements of a list. We show how Sec-tion 3 tells us how to perform reverse-mode AD on such ahigher-order function.

Term

Using the standard Church encoding of List, i.e.

List ( X ) ≡ ( X → D → D ) → ( D → D )[ x , x , . . . , x n ] ≡ λ f d . f x n (cid:0) . . . ( f x ( f x d )) (cid:1) for some dummy type D , sum : List ( R ) → R can beexpressed in our language described in Section 3 to be λl . l ( λxy . x + y )

0. Hence the derivative of sum at a list [ , − ] can be expressed as { ω : Ω ( List ( R ))} ⊢ (cid:0) ( Ω ( sum )) · ω (cid:1) [ , − ] : R ∗ . Administrative Reduction

We ﬁrst decompose the bodyof the sum : List ( R ) → R term, considered in Example 3.2, i.e. l ( λxy . x + y ) l ( λxy . x + y ) −→ ∗ A (cid:0) ( let z ′ = l in z ′ ) ( λxy . let z ′ = x + y in z ′ ) (cid:1) ( let z ′ = in z ′ )−→ ∗ A (cid:18) let z = l ; z = λxy . ( let z ′ = x + y in z ′ ) ; z = z z in z (cid:19) ( let z ′ = in z ′ )−→ ∗ A let z = l ; z = λxy . ( let z ′ = x + y in z ′ ) ; z = z z ; z = z = z z in z Splitting the Omega

After the A-reductions where l ( λxy . x + y ) ( Ω ( λl . l ( λxy . x + y ) )) · ω , via Reductions 7 and 8. ( Ω ( λl . l ( λxy . x + y ) )) · ω −→ ∗ A ( Ω λl . let z = l ; z = λxy . let z ′ = x + y in z ′ ; z = z z ; z = z = z z in z ) · ω −→ ∗ © « ( Ω λl . h l , l i) ·( Ω λ h l , z i . h l , z , λxy . L i) ·( Ω λ h l , z , z i . h l , z , z , z z i) ·( Ω λ h l , z , z , z i . h l , z , z , z , i) ·( Ω λ h l , z , z , z , z i . z z ) · ω ª®®®®¬ Pullback Reduction

First, Figure 5 showsthat (cid:0) ( Ω [ , − ]) · ω ′ (cid:1) ( λxy . L ) is reduced to ( λv . v − ( + (h , d i)) + (J + (h− , −i) · ( v d )) ( + h , d i)) ∗ · ω ′ A iﬀerential-form Pullback Programming Language and Reverse-mode AD (cid:0) ( Ω [ , − ]) · ω ′ (cid:1) ( λxy . L )≡ (cid:0) ( Ω λ f d . f − ( f d )) · ω ′ (cid:1) ( λxy . L )−→ ∗ A ( Ω λ f d . let z = fz = − z = z z z = fz = z = z z z = dz = z z z = z z in z ) · ω ′ ! ( λxy . L )−→ ∗ © « ( Ω λ f . h f , f i) ·( Ω λ h f , z i . h f , z , − i) ·( Ω λ h f , z , z i . h f , z , z , z z i) ·( Ω λ h f , z , z , z i . h f , z , z , z , f i) ·( Ω λ h f , z , z , z , z i . h f , z , z , z , z , i) ·( Ω λ h f , z , z , z , z , z i . h f , z , z , z , z , z , z z i) ·( Ω λ h f , z , z , z , z , z , z i . h f , z , z , z , z , z , z , d i) ·( Ω λ h f , z , z , z , z , z , z , z i . h f , z , z , z , z , z , z , z , z z i) ·( Ω λ h f , z , z , z , z , z , z , z , z i . h f , z , z , z , z , z , z , z , z , z z i) · ω ′ ª®®®®®®®®®®¬ ( λxy . L )−→ ∗ © « ( λv . h v , v i) ∗ ·( λ h v , v i . h v , v , i) ∗ ·( λ h v , v , v i . h v , v , v , v − + λy . (J + · h v , i) h− , y ii) ∗ ·( λ h v , v , v , v i . h v , v , v , v , v i) ∗ ·( λ h v , v , v , v , v i . h v , v , v , v , v , i) ∗ ·( λ h v , v , v , v , v , v i . h v , v , v , v , v , v , v + λy . (J + · h v , i) h , y ii) ∗ ·( λ h v , v , v , v , v , v , v i . h v , v , v , v , v , v , v , i) ∗ ·( λ h v , v , v , v , v , v , v , v i . h v , v , v , v , v , v , v , v , v d + (J + (h , −i) · v ) d i) ∗ ·( λ h v , v , v , v , v , v , v , v , v i . h v , v , v , v , v , v , v , v , v , v ( + (h , d i)) + (J + (h− , −i) · v ) ( + (h , d i))i) ∗ · ω ′ A ª®®®®®®®®®®®¬ −−→ ∗ ( λv . v − ( + (h , d i)) + (J + (h− , −i) · ( v d )) ( + h , d i)) ∗ · ω ′ A where A ≡ h λxy . L , λxy . L , − , λy . + (h− , y i) , λxy . L , , λy . + (h , y i) , d , + (h , d i)i Figure 5.

Reduction of (cid:0) ( Ω [ , − ]) · ω ′ (cid:1) ( λxy . L ) Then, we reduce (cid:0) ( Ω ( sum )) · ω (cid:1) [ , − ] as follows. © « ( Ω λl . h l , l i) ·( Ω λ h l , z i . h l , z , λxy . L i) ·( Ω λ h l , z , z i . h l , z , z , z z i) ·( Ω λ h l , z , z , z i . h l , z , z , z , i) ·( Ω λ h l , z , z , z , z i . z z ) · ω ª®®®®¬ [ , − ]−→ ∗ © « ( λv . h v , v i) ∗ ·( λ h v , v i . h v , v , i) ∗ ·( λ h v , v , v i . h v , v , v , v ( λxy . L ) + v − ( + (h , d i)) + (J + (h− , −i) · ( v d )) ( + h , d i)i) ∗ ·( λ h v , v , v , v i . h v , v , v , v , i) ∗ ·( λ h v , v , v , v , v i . h v , v , v , v , v , v + (J + (h− , + (h , −i)i) · v ) i) ∗ · ωB ª®®®®®®®®¬ −→ ∗ ( λv . v ( λxy . L ) ) ∗ · ωB where B ≡ h[ , − ] , [ , − ] , λxy . L , λd . + (h− , + (h , d i)i) , , i .Hence, λv . v ( λxy . L ) sum ≡ λl . l ( λxy . L ) [ , − ] .This sequence of reduction tells us how the derivative of sum at [ , − ] can be computed using reverse-mode AD. B Administrative Reduction

Elementary terms E , let series L , A-contexts C A and A-redexes r A are deﬁned as follows. E :: = | z + z | z | λx . L | z z | z i | h z , z i | r | f ( z ) | J f · z | ( λx . L ) ∗ · z | ( Ω λx . L ) · z | r ∗ L :: = let z = E in L | let z = E in z . C A :: = [] | C A + P | L + C A | λz . C A | C A P | L C A | π i ( C A )| h C A , S i | h L , C A i | f ( C A ) | J f · C A | ( λx . C A ) ∗ · S | ( λx . L ) ∗ · C A | ( Ω ( λx . C A )) · S | ( Ω ( λx . L )) · C A r A :: = | L + L | x | λz . L | L L | π i ( L ) | h L , L i | r | f ( L ) | J f · L | ( λx . L ) ∗ · L | ( Ω λx . L ) · L | r ∗ Lemma B.1.

Every pullback term P can be expressed as ei-ther C A [ r A ] for some unique A-context C A and A-redex r A ora let series of elementary terms L . An A-redex r A is reduced to a let series L as follows.0 −→ A let x = in x L + L −→ A let x = L ; x = L ; x = x + x in x x −→ A let x = x in x λz . L −→ A let x = λz . L in x arol Mak and Luke Ong L L −→ A let x = L ; x = L ; x = x x in x π i ( L ) −→ A let x = L ; x = π i ( x ) in x h L , L i −→ A let x = L ; x = L ; x = h x , x i in x r −→ A let x = r in x f ( L ) −→ A let x = L ; x = f ( x ) in x J f · L −→ A let x = L ; x = J f · x in x ( λx . L ) ∗ · L −→ A let x = L ; x = ( λx . L ) ∗ · x in x ( Ω λx . L ) · L −→ A let x = L ; x = ( Ω ( λx . L )) · x in x r ∗ −→ A let x = r ∗ in x Any pullback term P which can be expressed as C A [ r A ] can be A-reduced to C A [ L ] where r A −→ A L . C Interpretation J Γ ⊢ σ K γ = J Γ ⊢ S + P : σ K γ = J S K γ + J P K γ J Γ ∪ { x : σ } ⊢ x : σ K h γ , z i = z J Γ ⊢ λx . S : σ ⇒ σ K γ = cur ( J S K ) γ J Γ ⊢ S P : σ K γ = ( J S K γ ) ( J P K γ ) J Γ ⊢ π i ( S ) : σ i K γ = π i ( J S K γ ) J Γ ⊢ h S , S i : σ × σ K γ = h J S K γ , J S K γ i J Γ ⊢ r : R K γ = r J Γ ⊢ f ( P ) : R m K γ = f ( J P K γ ) J Γ ⊢ J f · S : R m ⇒ R m K γ = cur ( D [ f ])( J S K γ ) J Γ ⊢ ( λx . S ) ∗ · S : σ ∗ K γ = λv . J S K γ (cid:0) J S K h γ , v i (cid:1) J Γ ⊢ ( Ω λx . P ) · S : Ω σ K γ = λxv . J S K γ ( J P K h γ , x i)( D [ cur ( J P K ) γ ]h v , x i) J Γ ⊢  r ... r n  ∗ : R n ∗ K γ =  v ... v n  n Õ i = r i v i D Extended Diﬀerential Lambda-Calculus

Diﬀerential substitution of the extended diﬀerential λ -termsare deﬁned as follows. ∂∂ x π i ( s ) · T ≡ π i (cid:0) ∂ s ∂ x · T (cid:1) ∂∂ x h s , s i · T ≡ h ∂ s ∂ x · T , ∂ s ∂ x · T i ∂ r ∂ x · T ≡ ∂∂ x ( f ( s )) · T ≡ (cid:16) D f · (cid:0) ∂ s ∂ x · T (cid:1) (cid:17) s ∂∂ x ( D f · s ) · T ≡ D f · (cid:0) ∂ s ∂ x · T (cid:1) Consider the term f ( s ) . There are no linear occurrencesof x in f . Hence, we ignore f and perform diﬀerential sub-stitution to s directly and obtain (cid:16) D f · (cid:0) ∂ s ∂ x · T (cid:1) (cid:17) s .We can interpret the extended diﬀerential λ -calculus witha diﬀerential λ -category, which is the categorical semanticsof diﬀerential λ -calculus. Hence, what is left to show is theinterpretations of the extended terms. J π i ( s ) K = π i ◦ J s KJ h s , s i K = h J s K , J s K i J r K = λγ . r J f ( s ) K = f ◦ J s KJ D f · s K = λγ x . D [ f ]h J s K γ , x i Translation to Diﬀerential Lambda Calculus t : = π i ( S ) t : = π i ( S t )( S + P ) t : = S t + P t (h S , S i) t : = h( S ) t , ( S ) t i y t : = y r t : = r ( λy . S ) t : = λy . S t ( f ( P )) t : = f ( P t )( S P ) t : = S t P t (J f · S ) t : = D f · ( S ) t  r ... r n  ∗ t : = λv . n Õ i = f i ( π i ( v )) (cid:0) ( λy . S ) ∗ · S (cid:1) t : = λv . ( S ) t (cid:0) ( λy . ( S ) t ) v (cid:1) (( Ω λy . P ) · S ) t : = λxv . S t (cid:0) ( λy . P t ) x (cid:1) (cid:0) ( D ( λy . P t ) · v ) x (cid:1) where f : = r i × − . E Proofs

Proposition E.1.

The derivative of any constant morphism f in a diﬀerential λ -category is , i.e. D [ f ] = .Proof. A constant morphism f : A → B that maps all of A to b ∈ B can be written as f = ( λz . b ) ◦ A → B and λz . b : B → B . So by [CD1,2,5] we have D [ f ] = D [( λz . b ) ◦ ] = D [ λz . b ] ◦ h D [ ] , ◦ π i = D [ λz . b ] ◦ h , ◦ π i = (cid:3) Lemma E.2.

Con ∞ is a diﬀerential λ -category with the dif-ferential operator D [ f ]h v , x i : = J ( f )( x )( v ) = lim t → ( f ( x + tv ) − f ( x ))/ t . Proof. [17, 23] have shown that

Con ∞ is Cartesian closed,and [8] have shown that Con ∞ is a Cartesian diﬀerentialcategory. What is left to show is that λ (−) preserves theadditive structure and D [−] satisﬁes the (D-curry) rule, i.e. D [ λ ( f )] = λ (cid:0) D [ f ] ◦ h π × , π × Id i (cid:1) .We ﬁrst show that λ (−) is additive, i.e. λ ( f + д ) = λ ( f ) + λ ( д ) and λ ( ) =

0. Note that for f , д , A × B → C and a ∈ A , b ∈ B , λ ( f + д )( a )( b ) = ( f + д )h a , b i = f h a , b i + д h a , b i = λ ( f )( a )( b ) + λ ( д )( a )( b ) and λ ( )( a )( b ) = h a , b i = = ( a )( b ) . Now we show that D [−] satisﬁes the (D-curry) rule. Let f : A × B → C , v , x ∈ A and b ∈ B . D [ λ ( f )] h v , x i b = (cid:16) lim t → λ ( f )( x + vt ) − λ ( f )( x ) t (cid:17) b = lim t → f h x + vt , b i − f h x , b i t = lim t → f (h x , b i + t h v , i) − f h x , b i t = D [ f ]hh v , i , h x , b ii = (cid:0) D [ f ] ◦ h π × , π × Id i (cid:1) hh v , x i , b i = λ (cid:0) D [ f ] ◦ h π × , π × Id i (cid:1) h v , x i b iﬀerential-form Pullback Programming Language and Reverse-mode AD (cid:3) Proposition E.3.

Let E be a convenient vector space and x , y ∈ E be distinct elements in E . Then, there exists abornological linear map l : E → R that separates x and y ,i.e. l ( x ) , l ( y ) .Proof. This follows from the fact that convenient vectorspace is separated. x , y implies that x − y ,

0. Hence by separation, thereis a bornological linear map l : E → R such that l ( x − y ) , l is linear, so we have l ( x ) − l ( y ) , l ( x ) , l ( y ) . (cid:3) Lemma 4.4 (Linearity) . Let Γ ∪ { x : σ } ⊢ P : τ and Γ ⊢ P : σ ∗ . Let γ ∈ J Γ K and γ ∈ J Γ K . Then,1. if x ∈ lin ( P ) , then cur ( J P K ) γ is linear, i.e. D [ cur ( J P K ) γ ] = ( cur ( J P K ) γ ) ◦ π ,2. J P K γ is linear, i.e. D [ J P K γ ] = ( J P K γ ) ◦ π .Proof. Induction on the structure of P on the following twostatements.IH.1 If Γ ∪ { x : σ } ⊢ P : τ and x ∈ lin ( P ) , then forany γ ∈ J Γ K , cur ( J P K ) γ is linear, i.e. D [ cur ( J P K ) γ ] = ( cur ( J P K ) γ ) ◦ π .IH.2 If Γ ⊢ P : σ ∗ , then for any γ ∈ J Γ K , J P K γ is linear, i.e. D [ J P K γ ] = ( J P K γ ) ◦ π .(var) Say P ≡ x .(1) If Γ ∪ { x : σ } ⊢ x : σ and x ∈ lin ( x ) , then D [ cur ( J x K ) γ ] = D [ Id ] = π = Id ◦ π = ( cur ( J x K ) γ ) ◦ π .(2) If Γ ⊢ x : σ ∗ , then Γ = Γ ∪ { x : σ ∗ } so for any h γ , z i ∈ J Γ K , z is linear and D [ J x K h γ , z i] = D [ z ] = z ◦ π = ( J P K h γ , z i) ◦ π .(dual) Say P ≡ ( λx . S ) ∗ · S .(1) Let Γ ∪ { x : σ } ⊢ ( λx . S ) ∗ · S : τ and x ∈ lin (( λx . S ) ∗ · S ) : = (cid:0) lin ( S ) \ FV ( S ) (cid:1) ∪ (cid:0) lin ( S ) \ FV ( S ) (cid:1) , then for any γ ∈ J Γ K and since J S K h γ , x i is of a dual type, by IH.2, D [ cur ( J ( λx . S ) ∗ · S K ) γ ]h v , x i = λz . (cid:0) D [ J S K h γ , −i]h v , x i (cid:1) д (h x , z i) + D [ J S K h γ , x i]h D [ д (h− , z i)]h v , x i , д (h x , z i)i = λz . (cid:0) D [ J S K h γ , −i]h v , x i (cid:1) д (h x , z i) + J S K h γ , x i( D [ д (h− , z i)]h v , x i) where д : h x , z i 7→ J S K h γ , x , z i . Note that x can onlybe in either lin ( S ) \ FV ( S ) or lin ( S ) \ FV ( S ) but notboth. Say x ∈ lin ( S ) \ FV ( S ) , then by Proposition E.1and IH.1, D [ cur ( J ( λx . S ) ∗ · S K ) γ ]h v , x i = λz . (cid:0) D [ J S K h γ , −i]h v , x i (cid:1) ( J S K h γ , x , z i) + J S K h γ , x i( D [ J S K h γ , − , z i]h v , x i) = λz . J S K h γ , x i( D [ J S K h γ , − , z i]h v , x i) = λz . J S K h γ , v i( J S K h γ , v , z i) = J ( λx . S ) ∗ · S K h γ , v i = (cid:0) cur ( J ( λx . S ) ∗ · S K ) γ ) ◦ π (cid:1) h v , x i (2) Let Γ ⊢ ( λx . S ) ∗ · S : σ ∗ and γ ∈ J Γ K . Then, byIH.1 and IH.2, D [ J ( λx . S ) ∗ · S K γ ] = D [( J S K γ ) ◦ (cid:0) cur ( J S K ) γ (cid:1) ] = D [ J S K γ ] ◦ h D [ cur ( J S K ) γ ] , ( cur ( J S K ) γ ) ◦ π i = ( J S K γ ) ◦ ( cur ( J S K ) γ ) ◦ π = ( J ( λx . S ) ∗ · S K γ ) ◦ π All other cases are straight forward inductive proofs. (cid:3)

Lemma 4.5 (Substitution) . J Γ ⊢ S [ P / x ] : τ K = J Γ ∪ { x : σ } ⊢ S : τ K ◦ h Id J Γ K , J Γ ⊢ P : σ K i Proof.

The only interesting cases are dual and pullbackmaps.(dual) (cid:0) ( λx . S ) ∗ · S (cid:1) [ P ′ / y ] ≡ ( λx . S [ P ′ / y ]) ∗ · S [ P ′ / y ] J (cid:0) ( λx . S ) ∗ · S (cid:1) [ P ′ / y ] K γ = J ( λx . S [ P ′ / y ]) ∗ · S [ P ′ / y ] K γ = J S [ P ′ / y ] K γ ◦ cur ( J S [ P ′ / y ] K ) γ = λx . J S K h γ , J P ′ K γ i( J S K h γ , J P ′ K γ , x i) (IH) = J ( λx . S ) ∗ · S K h γ , J P ′ K γ i (pb) (cid:0) ( Ω λx . P ) · S (cid:1) [ P ′ / y ] ≡ ( Ω λx . P [ P ′ / y ]) · S [ P ′ / y ] J (cid:0) ( Ω λx . P ) · S (cid:1) [ P ′ / y ] K γ = J ( Ω λx . P [ P ′ / y ]) · S [ P ′ / y ] K γ = λxv . (cid:0) J S K h γ , J P ′ K γ i (cid:1) (cid:0) J P K h γ , x , J P ′ K γ i (cid:1)(cid:0) D [ J P K h γ , − , J P ′ K γ i]h v , x i (cid:1) (IH) = J ( Ω λx . P ) · S K h γ , J P ′ K γ i (cid:3) Theorem 4.6 (Correctness of Reductions) . Let Γ ⊢ P : σ .1. P −→ A P ′ implies J P K = J P ′ K .2. P −→ P ′ implies J P K = J P ′ K .Proof.

1. Easy induction on −→ A .2. Case analysis on reductions of pullback terms. Let γ ∈ J Γ K .(1-4) J ( λx . S ) V K = J S [ V / x ] K , π i (h V , V i) = J V i K , f ( r ) = J f ( r ) K and J J ( f )( r )( r ′ ) K = D [ f ]h r ′ , r i are easilyveriﬁed using the Substitution Lemma 4.5.(5) Let J ( f )( r ) = [ a ij ] i = ,..., m , j = ,..., n and r ′ = [ r ′ i ] i = ,..., m . J ( λv . J ( f )( r )( v )) ∗ · r ′∗ K γ arol Mak and Luke Ong = (J ( f )( r )) ∗ ( λv . m Õ i = r ′ i v i ) = λv . m Õ i = r ′ i n Õ j = a ij v j = λv . n Õ j = m Õ i = ( r ′ i · a ij ) v j = λv . n Õ j = ((J ( f )( r )) ⊤ × r ) j v j = J ((J ( f )( r )) ⊤ × r ) ∗ K γ (6) Say Γ ∪ { v : σ } ⊢ V : τ . Let Γ ∪ { v : σ , v : σ } ⊢ V ′ : τ where v is not a free variable in V . J ( λv . V ) ∗ · (cid:16) ( λv . V ) ∗ · V (cid:17) K γ = (cid:0) cur ( J V K ) γ (cid:1) ∗ (cid:16)(cid:0) cur ( J V K ) γ (cid:1) ∗ ( J V K γ ) (cid:17) = (cid:16) (cid:0) cur ( J V K ) γ (cid:1) ◦ (cid:0) cur ( J V K ) γ (cid:1)(cid:17) ∗ ( J V K γ ) = (cid:16) v J V K h γ , J V K h γ , v ii (cid:17) ∗ ( J V K γ ) = (cid:16) v J V ′ K hh γ , v i , J V K h γ , v ii (cid:17) ∗ ( J V K γ ) = (cid:16) v (cid:0) J V ′ K ◦ h Id , J V K i (cid:1) h γ , v i (cid:17) ∗ ( J V K γ ) = (cid:16) v (cid:0) J V ′ [ V / v ] K (cid:1) h γ , v i (cid:17) ∗ ( J V K γ ) = (cid:16) cur ( J V ′ [ V / v ] K ) γ (cid:17) ∗ ( J V K γ ) = J ( λv . V ′ [ V / v ]) ∗ · V K (7) Using the Substitution Lemma 4.5, J ( Ω ( λy . let x = E in x )) · ω K = J ( Ω ( λy . E )) · ω K follows immediately from J λy . let x = E in x K = cur ( J let x = E in x K ) = cur ( J E K ) = J λy . E K . (8) Consider ( Ω ( λy . let x = E in L )) · ω −→( Ω ( λy . h y , E i)) · (cid:0) ( Ω ( λz . b L )) · ω (cid:1) where Γ ∪ { z : σ × σ } ⊢ b L ≡ L [ π ( z )/ y ][ π ( z )/ x ] : τ . J ( Ω ( λy . let x = E in L )) · ω K γ = Ω (cid:16) cur ( J let x = E in L K ) γ (cid:17) ( J ω K γ ) = Ω (cid:16) cur ( J L K ◦ h Id , J E K i) γ (cid:17) ( J ω K γ ) = Ω (cid:16) s J L K hh γ , s i , J E K h γ , s ii (cid:17) ( J ω K γ ) = Ω (cid:16) s J b L K h γ , h s , J E K h γ , s iii (cid:17) ( J ω K γ ) = Ω (cid:16) (cid:0) cur ( J b L K ) γ (cid:1) ◦ h Id J σ K , cur ( J E K ) γ i (cid:17) ( J ω K γ ) = Ω (cid:0) h Id J σ K , cur ( J E K ) γ i (cid:1) (cid:16) Ω (cid:0) cur ( J b L K ) γ (cid:1) ( J ω K γ ) (cid:17) = Ω (cid:0) h Id J σ K , cur ( J E K ) γ i (cid:1) (cid:16) J ( Ω λz . b L ) · ω K γ (cid:17) = Ω (cid:0) cur ( J h y , E i K ) γ (cid:1) (cid:16) J ( Ω λz . b L ) · ω K γ (cid:17) = J ( Ω λy . h y , E i) · (cid:0) ( Ω λz . b L ) · ω (cid:1) K (9) Say y is not free in E and (cid:0) ( Ω ( λy . E )) · ω (cid:1) V −→ J (( Ω λy . E ) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J E K h γ , x i)( D [ cur ( J E K ) γ ]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ ( J E K h γ , x i) (cid:1) ( J V K γ ) = ( λxv . ) ( J V K γ ) = λv . = J K γ since cur ( J E K ) γ is a constant function and the deriv-ative of any constant function is 0 by PropositionE.1.(10) We present the proof for (10b) (cid:0) ( Ω λy . y πi + y π j ) · ω (cid:1) V −→ ( λv . v πi + v π j ) ∗ · ω (cid:0) V πi + V π j (cid:1) which leads to (10.1). J (( Ω λy . y πi + y π j ) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J y πi + y π j K h γ , x i)( D [ π i + π j ]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ ( x πi + x π j )( v πi + v π j ) (cid:1) ( J V K γ ) = λv . J ω K γ ((( J V K γ ) πi + (( J V K γ ) π j )( v πi + v π j ) = λv . J ω ( V πi + V π j ) K γ ( v πi + v π j ) = J ( λv . v πi + v π j ) ∗ · ω ( V πi + V π j ) K γ (11) (cid:0) ( Ω λy . y ) · ω (cid:1) V −→ ( λv . v ) ∗ · ω V J (( Ω λy . y ) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J y K h γ , x i)( D [ Id ]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ xv (cid:1) ( J V K γ ) = λv . J ω K γ ( J V K γ ) v = λv . J ω V K γv = J ( λv . v ) ∗ · ω V K γ (12) (( Ω λy . y πi ) · ω ) V −→ ( λv . v πi ) ∗ · ω V πi J (( Ω λy . y πi ) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J y πi K h γ , x i)( D [ π i ]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ x πi v πi (cid:1) ( J V K γ ) = λv . J ω K γ ( J V πi K γ ) v πi = J ( λv . v πi ) ∗ · ω V πi K γ (13) We prove for (13c), (( Ω λy . h y πi , y π j i) · ω ) V −→( λv . h v πi , v π j i) ∗ · ω h V πi , V π j i which leads to (13a)and (13b). J (( Ω λy . h y πi , y π j i) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J h y πi , y π j i K h γ , x i)( D [h π i , π j i]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ h x πi , x π j ih v πi , v π j i (cid:1) ( J V K γ ) = λv . J ω K γ h( J V K γ ) πi , ( J V K γ ) π j ih v πi , v π j i = λv . J ω h V πi , V π j i K γ h v πi , v π j i = J ( λv . h v πi , v π j i) ∗ · ω h V πi , V π j i K γ (14) (cid:0) ( Ω λy . J f · y πi ) · ω (cid:1) V −→ ( λv . J f · v πi ) ∗ · (cid:0) ω (J f · V πi ) (cid:1) By [CD3,4,5,6], D [ λyz . D [ f ]h y πi , z i] = D [ cur ( D [ f ]) ◦ π i ] = D [ cur ( D [ f ])] ◦ ( π i × π i ) = cur ( D [ D [ f ]] ◦ h π × , π × Id i) ◦ ( π i × π i ) = cur ( D [ f ] ◦ ( π × Id )) ◦ ( π i × π i ) Hence J (cid:0) ( Ω λy . J f · y πi ) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) λx . D [ f ]h J V π j K γ , x i (cid:1)(cid:0) D [ λyz . D [ f ]h y πi , z i]h v , J V K γ i (cid:1) = λv . J ω K γ (cid:0) λx . D [ f ]h J V π j K γ , x i (cid:1)(cid:16)(cid:0) cur ( D [ f ] ◦ ( π × Id )) ◦ ( π i × π i ) (cid:1) h v , J V K γ i (cid:17) = λv . J ω K γ (cid:0) J (J f · V πi ) K γ (cid:1) ( λx . D [ f ]h v πi , x i) iﬀerential-form Pullback Programming Language and Reverse-mode AD = J ( λv . J f · v πi ) ∗ · (cid:0) ω (J f · V πi ) (cid:1) K γ (15) (( Ω λy . f ( y πi )) · ω ) V −→ ( λv . (J f · v πi ) V πi ) ∗ · (cid:0) ω ( f ( V πi )) (cid:1) J (( Ω λy . f ( y πi )) · ω ) V K γ = (cid:0) λxv . J ω K γ (cid:0) f x πi (cid:1) (cid:0) D [ f ]h v πi , x πi i (cid:1) (cid:1) ( J V K γ ) = λv . J ω K γ (cid:0) f ( J V πi K γ ) (cid:1) (cid:0) D [ f ]h v πi , J V πi K γ i (cid:1) = λv . J ω K γ (cid:0) f ( J V πi K γ ) (cid:1) (cid:0) J (J f · v πi ) V πi K h γ , v i (cid:1) = J ( λv . (J f · v πi ) V πi ) ∗ · (cid:0) ω ( f ( V πi )) (cid:1) K γ (16) We prove for the most complicated case (16c) whichleads to (16a) and (16b).By IH, J (cid:0) ( Ω λy . L ) · ω (cid:1) V K = J ( λv . S ) ∗ · ω V ′ K impliesfor any 1-form ϕ , γ and x , v , ϕ ( J L K h γ , J V K γ , x i) ( D [ J L K h γ , − , x i]h v , J V K γ i) = ϕ ( J V K γ ) ( J S K h γ , x , v i) . By Hahn-Banach Theorem, we have D [ J L K h γ , − , x i]h v , J V K γ i = J S K h γ , x , v i . First, note that since V πi is of the dual type, henceby Lemma 4.4 (2), D [ J V πi K γ ] = ( J V πi K γ ) ◦ π . D [ cur ( J ( λz . L ) ∗ · y πi K ) γ ]h v , J V K γ i = D [ λy . λz . y πi ( J L K h γ , y , z i)]h v , J V K γ i = D [ cur ( ev ◦ h π i ◦ π , д i)]h v , J V K γ i = λz . D [ ev ◦ h π i ◦ π , д i]hh v , i , h J V K γ , z ii = λz . (cid:0) ev ◦ h D [ π i ◦ π ] , д ◦ π i + D [ uncur ( π i ◦ π )]◦ hh , D [ д ]i , h π , д ◦ π ii (cid:1) hh v , i , h J V K γ , z ii = λz . v πi ( д h J V K γ , z i) + D [ uncur ( π i ◦ π )]hh , D [ д ]hh v , i , h J V K γ , z iii , hh J V K γ , z i , д h J V K γ , z iii = λz . v πi ( д h J V K γ , z i) + D [ uncur ( π i )]hh , D [ д ]hh v , i , h J V K γ , z iii , h J V K γ , д h J V K γ , z iii = λz . v πi ( д h J V K γ , z i) + D [ uncur ( π i )h J V K γ , −i]h D [ д ]hh v , i , h J V K γ , z ii , д h J V K γ , z ii = λz . v πi ( д h J V K γ , z i) + (cid:0) D [ J V πi K γ ] ◦ h D [ д ] , д ◦ π i (cid:1) hh v , i , h J V K γ , z ii = λz . v πi ( д h J V K γ , z i) + (cid:0) v πi ◦ π ◦ h D [ д ] , д ◦ π i (cid:1) hh v , i , h J V K γ , z ii = λz . v πi ( д h J V K γ , z i) + J V πi K γ ) (cid:0) D [ д h− , z i]h v , J V K γ i (cid:1) = λz . v πi ( J L K h γ , J V K γ , z i) + J V πi K γ (cid:0) D [ J L K h γ , − , z i]h v , J V K γ i (cid:1) = λz . v πi ( J L K h γ , J V K γ , z i) + J V πi K γ (cid:0) J S K h γ , x , v i (cid:1) = J ( λz . L [ V / y ]) ∗ · v πi + ( λz . S ) ∗ · V πi K h γ , v i . where д : h y , z i 7→ J L K h γ , y , z i .Now we have J (cid:0) ( Ω λy . ( λz . L ) ∗ · y πi ) · ω (cid:1) V K γ = λv . J ω K γ ( J ( λz . L ) ∗ · y πi K h γ , J V K γ i) (cid:0) D [ cur ( J λz . L ∗ · y πi K ) γ ]h v , J V K γ i (cid:1) = λv . J ω K γ ( J ( λz . L [ V / y ]) ∗ · V πi K γ ) (cid:0) J ( λz . L [ V / y ]) ∗ · v πi + ( λz . S ) ∗ · V πi K h γ , v i (cid:1) = J ( λv . ( λz . S ) ∗ · V πi + ( λz . L [ V / y ]) ∗ · v πi ) ∗ · ω (( λz . L [ V / y ]) ∗ · V πi ) K γ (17) (cid:0) ( Ω λy . ( Ω λx . L ) · y πi ) · ω (cid:1) V −→ (cid:0) ( Ω λy . λa . ( λv . S ) ∗ · z L [ a / x ]) · ω (cid:1) V if (cid:0) ( Ω λx . L ) · z ) a −→ ∗ ( λv . S ) ∗ · z L [ a / x ] forfresh variable a .By IH, J (cid:0) ( Ω λx . L ) · z ) a K = J ( λv . S ) ∗ · z L [ a / x ] K im-plies for any ϕ , γ y , a , v , ϕ ( J L K h γ , a , y i) ( D [ J L K h γ , − , y i]h v , a i) = ϕ ( J L K h γ , a , y i) ( J S K h γ , y , a , v i) . By Hahn-Banach Theorem, D [ J L K h γ , − , y i]h v , a i = J S K h γ , y , a , v i . J ( Ω λx . L ) · z K h γ , y i = λav . J z K h γ , y i ( J L K h γ , v , a i) ( D [ J L K h γ , − , y i]h v , a i) = λav . J z K h γ , y i ( J L [ a / x ] K h γ , v i) ( J S K h γ , y , a , v i) = λa . J ( λv . S ) ∗ · z L [ a / x ] K h γ , y , a i = J λa . ( λv . S ) ∗ · z L [ a / x ] K h γ , y i Hence we have J (cid:0) ( Ω λy . ( Ω λx . L ) · z ) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) J ( Ω λx . L ) · z K h γ , J V K γ i (cid:1)(cid:0) D [ cur ( J ( Ω λx . L ) · z K ) γ ]h v , J V K γ i (cid:1) = λv . J ω K γ (cid:0) J λa . ( λv . S ) ∗ · z L [ a / x ] K h γ , J V K γ i (cid:1)(cid:0) D [ cur ( J ( Ω λx . L ) · z K ) γ ]h v , J V K γ i (cid:1) = J (cid:0) ( Ω λy . λa . ( λv . S ) ∗ · z L [ a / x ]) · ω (cid:1) V K γ (18) If (cid:0) ( Ω λy . L ) · ω (cid:1) V −→ ∗ ( λv . S ) ∗ · ω V and x < FV ( V ) , then (cid:0) ( Ω λy . λx . L ) · V (cid:1) V −→ ( λv . λx . S ) ∗ · V λx . L [ V / y ] . Recall the (D-curry) rule, D [ cur ( f )] = cur ( D [ f ] ◦ h π × , π × Id i) . By IH, we have J (cid:0) ( Ω λy . L ) · ω (cid:1) V K = J ( λv . S ) ∗ · ω ( L [ V / y ]) K , whichmeans for any 1-form ϕ , γ and x , v , ϕ ( J L K h γ , J V K γ , x i) ( D [ J L K h γ , − , x i]h v , J V K γ i) = ϕ ( J L K h γ , J V K γ , x i) ( J S K h γ , x , v i) . By Hahn-Banach Theorem, D [ J L K h γ , − , x i]h v , J V K γ i = J S K h γ , x , v i . Now D [ cur ( J λx . L K ) γ ]h v , J V K γ i = D [ cur ( J L K )h γ , −i]h v , J V K γ i = D [ cur ( f )]h v , J V K γ i = cur ( D [ f ] ◦ h π × , π × Id i)h v , J V K γ i = λx . ( D [ f ] ◦ h π × , π × Id i)hh v , J V K γ i , x i = λx . D [ f h− , x i]h v , J V K γ i = λx . D [ J L K h γ , − , x i]h v , J V K γ i = λx . J S K h γ , x , v i where f : = uncur ( cur ( J L K )h γ , −i) . Hence, we have J (( Ω λy . λx . L ) · V ) V K γ = (cid:0) λxv . J V K γ ( J λx . L K h x , γ i)( D [ cur ( J λx . L K ) γ ]h v , x i) (cid:1) ( J V K γ ) arol Mak and Luke Ong = λv . J V K γ ( J λx . L K h J V K γ , γ i)( D [ cur ( J λx . L K ) γ ]h v , J V K γ i) = λv . J V K γ ( J λx . L [ V / y ] K γ )( λx . J S K h γ , x , v i) = λv . J V ( λx . L [ V / y ]) K γ ( J λx . S K h γ , v i) = J ( λv . λx . S ) ∗ · V ( λx . L [ V / y ]) K γ (19) We prove it for the complicated case (19c) and (19a)and (19b) follows.First note that by (D-eval) in [21], we have D [ ev ◦ h π i , π j i]h v , x i = π i ( v )( π j ( x )) + D [ π i ( x )]h π j ( v ) , π j ( x )i . By IH, and V πi ≡ λz . P ′ ,we have J (cid:0) ( Ω λz . P ′ ) · ω (cid:1) V π j K = J ( λv ′ . S ′ ) ∗ · ω ( P ′ [ V π j / z ]) K which means for any 1-form ϕ , γ and v , ϕ ( J P ′ K h γ , J V π j K γ i) (cid:0) D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i (cid:1) = ϕ ( J P ′ K h γ , J V π j K γ i) ( J S ′ K h γ , v i) . By Hahn-Banach Theorem, D [ J V πi K γ ]h v π j , J V π j K γ i = D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i = J S ′ K h γ , v i . Hencewe have J (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) J y πi y π j K h γ , J V K γ i (cid:1) (cid:0) D [ ev ◦ h π i , π j i]h v , J V K γ )i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1)(cid:0) v πi ( J V π j K γ ) + D [ J V πi K γ ]h v π j , J V π j K γ i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1) (cid:0) v πi ( J V π j K γ ) + J S ′ K h γ , J V π j K γ i (cid:1) = J ( λv . v πi V π j + S ′ [ V π j / v ]) ∗ · ω ( V πi V π j ) K γ (20a) Say y is a free variable in E , (cid:0) ( Ω ( λy . h y , E i)) · ω (cid:1) V −→ ( λv . h v , S i) ∗ · ω h V , E [ V / y ]i if (cid:0) ( Ω λy . E ) · ω (cid:1) V −→( λv . S ) ∗ · ω ( E [ V / y ]) . By IH, we have J (cid:0) ( Ω λy . E ) · ω (cid:1) V K = J ( λv . S ) ∗ · ω ( E [ V / y ]) K , whichimplies for any γ ∈ J Γ K and v , J E K h γ , J V K γ i = J P K γ and D [ J E K h γ , −i]h v , J V K γ i = J S K h γ , v i . Now, J (cid:0) ( Ω ( λy . h y , E i)) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) h J V K γ , J E K h γ , J V K γ ii (cid:1)(cid:0) h D [ Id ]h v , J V K γ i , D [ J E K h γ , −i]h v , J V K γ ii (cid:1) = λv . J ω K γ (cid:0) h J V K γ , J E [ V / y ] K γ i (cid:1) (cid:0) h v , J S K h γ , v ii (cid:1) = λv . J h V , E [ V / y ]i ω K γ (cid:0) J λv . h v , S i K γv (cid:1) = J ( λv . h v , S i) ∗ · ω h V , E [ V / y ]i K γ (20b) If y < FV ( E ) , we have (cid:0) ( Ω ( λy . h y , E i)) · ω (cid:1) V −→ ( λv . h v , i) ∗ · ω h V , E i and J (cid:0) ( Ω ( λy . h y , E i)) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) h J V K γ , J E K h γ , J V K γ ii (cid:1)(cid:0) h D [ Id ]h v , J V K γ i , D [ J E K h γ , −i]h v , J V K γ ii (cid:1) = λv . J ω K γ (cid:0) h J V K γ , J E K γ i (cid:1) (cid:0) h v , i (cid:1) = λv . J ω K γ (cid:0) J h V , E i γ K (cid:1) (cid:0) J λv . h v , i K γv (cid:1) = J ( λv . h v , i) ∗ · ω h V , E i K γ (cid:3) Lemma 5.1.

Let P be a term.1. If P −→ P ′ , then there exists a reduct s of P ′ t such that P t −→ ∗ s in L D .2. J P K = J P t K in C .Proof.

1. Easy induction on −→ .2. We prove by induction on P . Most cases are trivial. Let γ ∈ J Γ K .(dual) J ( λx . S ) ∗ · S K γ = λv . J S K γ ( cur ( J S K ) γv ) = λv . J S K h γ , v i (cid:0) J λx . S K h γ , v i( J v K h γ , v i) (cid:1) = λv . J S t K h γ , v i (cid:0) J λx . S t K h γ , v i( J v t K h γ , v i) (cid:1) = λv . J S t (cid:0) ( λx . S t ) v (cid:1) K h γ , v i = J λv . S t (cid:0) ( λx . S t ) v (cid:1) K γ (pb) J ( Ω ( λy . P )) · S K γ = λxv . ( J S K γ )( J P K h γ , x i)( D [ cur ( J P K ) γ ]h v , x i) = λxv . ( J S K γ )( J P K h γ , x i)( D [ J P K ]hh , v i , h γ , x ii) = λxv . ( J S K γ )( J P K h γ , x i) (cid:0) D [ J P K ]hh , J v K h γ , x , v ii , hh γ , x , v i , x ii (cid:1) = λxv . ( J S K γ ) (cid:0) cur ( J P K ) γ ( J x K h γ , x , v i) (cid:1)(cid:0) J D ( λy . P ) · v K h γ , x , v i( J x K h γ , x , v i) (cid:1) = λxv . ( J S K h γ , x , v i) (cid:0) cur ( J P K )h γ , x , v i( J x K h γ , x , v i) (cid:1)(cid:0) J D ( λy . P ) · v K h γ , x , v i( J x K h γ , x , v i) (cid:1) = λxv . J S t (cid:0) ( λy . D t ) x (cid:1) (cid:0) (cid:0) D ( λy . P t ) · v (cid:1) x (cid:1) K h γ , x , v i = J λxv . S t (cid:0) ( λy . D t ) x (cid:1) (cid:0) (cid:0) D ( λy . P t ) · v (cid:1) x (cid:1) K γ (cid:3) Corollary 5.2 (Strong Normalization) . Any reduction se-quence from any term is ﬁnite, and ends in a value.Proof. If P does not terminates, then we can form a reduc-tion sequence in L D that does not terminates using Lemma5.1 (1) and conﬂuent property of diﬀerential λ -calculus,proved in [15]. Then, this contradicts the strong normaliza-tion property of diﬀerential λ -calculus. (cid:3)(cid:3)