A Differential-form Pullback Programming Language for Higher-order Reverse-mode Automatic Differentiation
aa r X i v : . [ c s . P L ] F e b A Differential-form Pullback Programming Languagefor Higher-order Reverse-mode AutomaticDifferentiation
Carol MakLuke Ong
Abstract
Building on the observation that reverse-mode automaticdifferentiation (AD) — a generalisation of backpropagation— can naturally be expressed as pullbacks of differential 1-forms, we design a simple higher-order programming lan-guage with a first-class differential operator, and presenta reduction strategy which exactly simulates reverse-modeAD. We justify our reduction strategy by interpreting ourlanguage in any differential λ -category that satisfies the Hahn-Banach Separation Theorem, and show that the reductionstrategy precisely captures reverse-mode AD in a truly higher-order setting. Automatic differentiation (AD) [34] is widely considered themost efficient and accurate algorithm for computing deriva-tives, thanks largely to the chain rule. There are two modesof AD: • Forward-mode AD evaluates the chain rule from in-puts to outputs; it has time complexity that scales withthe number of inputs, and constant space complexity. • Reverse-mode AD — a generalisation of backpropaga-tion — evaluates the chain rule (in dual form) fromoutputs to inputs; it has time complexity that scaleswith the number of outputs, and space complexitythat scales with the number of intermediate variables.In machine learning applications such as neural networks,the number of input parameters is usually considerablylarger than the number of outputs. For this reason, reverse-mode AD has been the preferred method of differentiation,especially in deep learning applications. (See Baydin et al.[5] for an excellent survey of AD.)The only downside of reverse-mode AD is its rather in-volved definition, which has led to a variety of compli-cated implementations in neural networks. On the one hand,TensorFlow [1] and Theano [3] employ the define-and-run approach where the model is constructed as a computa-tional graph before execution. On the other hand, PyTorch[25] and Autograd [20] employ the define-by-run approach where the computational graph is constructed dynamicallyduring the execution.
Can we replace the traditional graphical representation ofreverse-mode AD by a simple yet expressive framework?
In-deed, there have been calls from the neural network com-munity for the development of differentiable programming [14, 19, 24], based on a higher-order functional languagewith a built-in differential operator that returns the deriv-ative of a given program via reverse-mode AD. Such a lan-guage would free the programmer from implementationaldetails of differentiation. Programmers would be able to con-centrate on the construction of machine learning models,and train them by calling the built-in differential operatoron the cost function of their models.The goal of this work is to present a simple higher-orderprogramming language with an explicit differential oper-ator, such that its reduction semantics is exactly reverse-mode AD, in a truly higher-order manner.The syntax of our language is inspied by Ehrhard andRegnier [15]’s differential λ -calculus, which is an extensionof simply-typed λ -calculus with a differential operator thatmimics standard symbolic differentiation (but not reverse-mode AD). Their definition of differentiation via a linearsubstitution provides a good foundation for our language.The reduction strategy of our language uses differential λ -category [11] (the model of differential λ -calculus) as aguide. Differential λ -category is a Cartesian closed differen-tial category [9], and hence enjoys the fundamental prop-erties of derivatives, and behaves well with exponentials(curry). Contributions.
Our starting point (Section 2.2) is the obser-vation that the computation of reverse-mode AD can natu-rally be expressed as a transformation of pullbacks of dif-ferential 1-forms . We argue that this viewpoint is essentialfor understanding reverse-mode AD in a functional setting.Standard reverse-mode AD (as presented in [4, 5]) is onlydefined in Euclidean spaces.We present (in Section 3) a simple higher-order program-ming language, extending the simply-typed λ -calculus [12] arol Mak and Luke Ong with an explicit differential operator called the pullback , ( Ω λx . P ) · S , which serves as a reverse-mode AD simulator.Using differential λ -category [11] as a guide, we design areduction strategy for our language so that the reduction ofthe application, (cid:0) ( Ω λx . P )·( λx . e p ∗ ) (cid:1) S , mimics reverse-modeAD in computing the p -th row of the Jacobian matrix (deriv-ative) of the function λx . P at the point S , where e p is thecolumn vector with 1 at the p -th position and 0 everywhereelse. Moreover, we show how our reduction semantics canbe adapted to a continuation passing style evaluation (Sec-tion 3.5).Owing to the higher-order nature of our language, stan-dard differential calculus is not enough to model our lan-guage and hence cannot justify our reductions. Our finalcontribution (in Section 4) is to show that any differential λ -category [11] that satisfies the Hahn-Banach SeparationTheorem is a model of our language (Theorem 4.6). Our re-duction semantics is faithful to reverse-mode AD, in that itis exactly reverse-mode AD when restricted to first-order;moreover we can perform reverse-mode AD on any higher-order abstraction, which may contain higher-order terms,duals, pullbacks, and free variables as subterms (Corollary4.8).Finally, we discuss related works in Section 5 and conclu-sion and future directions in Section 6.Throughout this paper, we will point to the attached Ap-pendix for additional content. All proofs are in Appendix E,unless stated otherwise. We introduce forward- and reverse-mode automatic dif-ferentiation (AD), highlighting their respective benefits inpractice. Then we explain how reverse-mode AD can nat-urally be expressed as the pullback of differential 1-forms.(The examples used to illustrate the above methods are col-lated in Figure 4).
Recall that the Jacobian matrix of a smooth real-valued func-tion f : R n → R m at x ∈ R n is J ( f )( x ) : = ∂ f ∂ z (cid:12)(cid:12)(cid:12) x ∂ f ∂ z (cid:12)(cid:12)(cid:12) x . . . ∂ f ∂ z n (cid:12)(cid:12)(cid:12) x ∂ f ∂ z (cid:12)(cid:12)(cid:12) x ∂ f ∂ z (cid:12)(cid:12)(cid:12) x . . . ∂ f ∂ z n (cid:12)(cid:12)(cid:12) x ... ... . . . ... ∂ f m ∂ z (cid:12)(cid:12)(cid:12) x ∂ f m ∂ z (cid:12)(cid:12)(cid:12) x . . . ∂ f m ∂ z n (cid:12)(cid:12)(cid:12) x where f j : = π j ◦ f : R n → R . We call the function J : C ∞ ( R n , R m ) → C ∞ ( R n , L ( R n , R m )) the Jacobian ; J ( f ) theJacobian of f ; J ( f )( x ) the Jacobian of f at x ; J ( f )( x )( v ) the Jacobian of f at x along v ∈ R n and λx . J ( f )( x )( v ) theJacobian of f along v . Symbolic Differentiation
Numerical derivatives are stan-dardly computed using symbolic differentiation : first com-pute ∂ f j ∂ z i for all i , j using rules (e.g. product and chain rules),then substitute x for z to obtain J ( f )( x ) .For example, to compute the Jacobian of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i by symbolic differentiation, firstcompute ∂ f ∂ x = ( x + )( x + y )( x + y + ( x + )) and ∂ f ∂ y = ( x + )( x + y )( y ( x + )) . Then, substitute 1 for x and 3 for y to obtain J ( f )(h , i) = (cid:2)
660 528 (cid:3) . Symbolic differentiation is accurate but inefficient. Noticethat the term ( x + ) appears twice in ∂ f ∂ x , and ( + ) is evalu-ated twice in ∂ f ∂ x (cid:12)(cid:12)(cid:12) (because for h : h x , y i 7→ ( x + )( x + y ) ,both h (h x , y i) and ∂ h ∂ x contain the term ( x + ) , and the prod-uct rule tells us to calculate them separately). This dupli-cation is a cause of the so-called expression swell problem ,resulting in exponential time-complexity. Automatic Differentiation
Automatic differentiation (AD) avoids this problem by a simple divide-and-conquerapproach: first arrange f as a composite of elementary functions, д , . . . , д k (i.e. f = д k ◦ · · · ◦ д ), then computethe Jacobian of each of these elementary functions, andfinally combine them via the chain rule to yield the desiredJacobian of f . Forward-mode AD
Recall the chain rule:
J ( f )( x ) = J ( д k )( x k − ) × · · · × J ( д )( x ) × J ( д )( x ) for f = д k ◦ · · · ◦ д , where x i : = д i ( x i − ) . Forward-mode
AD computes the Jacobian matrix
J ( f )( x ) by calculating α i : = J ( д i )( x i − )× α i − and x i : = д i ( x i − ) , with α : = I (iden-tity matrix) and x . Then, α k = J ( f )( x ) is the Jacobian of f at x . This computation can neatly be presented as an iter-ation of the h· | ·i -reduction, h x | α i д −→ h д ( x ) | J ( д )( x ) × α i , for д = д , . . . , д k , starting from the pair h x | I i . Besides be-ing easy to implement, forward-mode AD computes the newpair from the current pair h x | α i , requiring no additionalmemory.To compute the Jacobian of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i by forward-mode AD, first decompose f into C ∞ ( A , B ) is the set of all smooth functions from A to B , and L ( A , B ) isthe set of all linear functions from A to B , for Euclidean spaces A and B . in the sense of being easily differentiable2 ifferential-form Pullback Programming Language and Reverse-mode AD elementary functions as R д −−→ R ∗ −−→ R (−) −−−→ R , where д (h x , y i) : = h x + , x + y i . Then, starting from hh , i | I i ,iterate the h· | ·i -reduction hh , i | (cid:20) (cid:21) i д −→ hh + , ∗ + i | (cid:20) (cid:21) i ∗ −→ h ∗ | (cid:2)
15 12 (cid:3) i (−) −−−→ h | (cid:2)
660 528 (cid:3) i yielding (cid:2)
660 528 (cid:3) as the Jacobian of f at h , i . Notice that ( + ) is only evaluated once, even though its result is usedin various calculations.In practice, because storing the intermediate matrices α i can be expensive, the matrix J ( f )( x ) is computed column-by-column , by simply changing the starting pair from h x | I i to h x | e p i , where e p ∈ R n is the column vector with 1at the p -th position and 0 everywhere else. Then, the com-putation becomes a reduction of a vector-vector pair, and α k = J ( f )( x ) × e p is the p -th column of the Jacobian ma-trix J ( f )( x ) . Since J ( f )( x ) is a m -by- n matrix, n runs arerequired to compute the whole Jacobian matrix.For example, if we start from hh , i | (cid:20) (cid:21) i , the reduction hh , i | (cid:20) (cid:21) i д −→ hh , i | (cid:20) (cid:21) i ∗ −→ h | (cid:2) (cid:3) i (−) −−−→ h | (cid:2) (cid:3) i gives us the first column of the Jacobian matrix J ( f )(h , i) . Reverse-mode AD
By contrast, reverse-mode
AD com-putes the dual of the Jacobian matrix, (J ( f )( x )) ∗ , using thechain rule in dual (transpose) form (J ( f )( x )) ∗ = (J ( д )( x )) ∗ × · · · × (J ( д k )( x k − )) ∗ as follows: first compute x i : = д i ( x i − ) for i = , . . . , k − β i : = (J ( д i )( x i − )) ∗ × β i + for i = k , . . . , β k + : = I (Reverse Phase).For example, the reverse-mode AD computation on f isas follows.Forward Phase: h , i д −→ h , i ∗ −→ (−) −−−→ (cid:20) (cid:21) д ←− (cid:20) (cid:21) ∗ ←− (cid:2) (cid:3) (−) ←−−− I In practice, like forward-mode AD, the matrix (J ( f )( x )) ∗ is computed column-by-column, by sim-ply setting β k + : = π p , where π p ∈ L ( R m , R ) is the p -thprojection. Thus, a run (comprising Forward and ReversePhase) computes (J ( f )( x )) ∗ ( π p ) , the p -th row of theJacobian of f at x . It follows that m runs are required tocompute the m -by- n Jacobian matrix.In many machine learning (e.g. deep learning) problems,the functions f : R n → R m we need to differentiate havemany more inputs than outputs, in the sense that n ≫ m . Whenever this is the case, reverse-mode AD is more efficientthan forward-mode. Remark 2.1.
Unlike forward-mode AD, we cannot inter-leave the iteration of x i and the computation of β i . In fact, ac-cording to Hoffmann [18], nobody knows how to do reverse-mode AD using pairs h· | ·i , as employed by forward-modeAD to great effect. In other words, reverse-mode AD doesnot seem presentable as an in-place algorithm. Reverse-mode AD can naturally be expressed using pull-backs and differential 1-forms, as alluded to by Betancourt[7] and discussed in [26].Let E : = R n and F : = R m . A differential 1-form of E is asmooth map ω ∈ C ∞ ( E , L ( E , R )) . Denote the set of all differ-ential 1-forms of E as Ω E . E.g. λx . π p ∈ Ω R m . (Henceforth,by , we mean differential 1-form.) The pullback of a1-form ω ∈ Ω F along a smooth map f : E → F is a 1-form Ω ( f )( ω ) ∈ Ω E where Ω ( f )( ω ) : E −→ L ( E , R ) x f )( x )) ∗ ( ω ( f x )) Notice the result of an iteration of reverse-mode AD (J ( f )( x )) ∗ ( π p ) can be expressed as Ω ( f )( λx . π p )( x ) , whichcan be expanded to (cid:0) Ω ( д ) ◦ · · · ◦ Ω ( д k ) (cid:1) ( λx . π p )( x ) . Hence,reverse-mode AD can be expressed as: first iterate the reduc-tion of 1-forms, ω д −→ Ω ( д )( ω ) , for д = д k , . . . , д , startingfrom the 1-form λx . π p ; then compute ω ( x ) , which yieldsthe p -th row of J ( f )( x ) .Returning to our example, Ω ( f )( λx . (cid:2) (cid:3) )(h , i) = (cid:0) Ω ( д ) ◦ Ω (∗) ◦ Ω ((−) ) (cid:1) ( λx . (cid:2) (cid:3) ∗ )(h , i) = (J ( д )(h , i)) ∗ (cid:0) Ω (∗) ◦ Ω ((−) ) (cid:1) ( λx . (cid:2) (cid:3) ∗ )(h , i) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (cid:0) Ω ((−) ) (cid:1) ( λx . (cid:2) (cid:3) ∗ )( ) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (J ((−) )( )) ∗ (cid:0) ( λx . (cid:2) (cid:3) ∗ )( ) (cid:1) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (cid:2) (cid:3) ∗ = (J ( д )(h , i)) ∗ (cid:20) (cid:21) ∗ = (cid:20) (cid:21) ∗ which is the Jacobian J ( f )(h , i) .The pullback-of-1-forms perspective gives us a way toperform reverse-mode AD beyond Euclidean spaces (for ex-ample on the function sum : List ( R ) → R , which returnsthe sum of the elements of a list); and it shapes our languageand reduction presented in Section 3. (Example 3.2 showshow sum can be defined in our language and Appendix A.2shows how reverse-mode AD can be performed on sum .) arol Mak and Luke Ong Simple terms S :: = x | λx . S | S P | π i ( S ) | h S , S i | r | f ( P )| J f · S | ( λx . S ) ∗ · S | ( Ω ( λx . P )) · S | r ∗ Pullback terms P :: = | S | S + P Figure 1.
Grammar of simple terms S and pullback terms P .Assume a collection V of variables (typically x , y , z , ω ), anda collection F (typically f , д , h ) of easily-differentiable real-valued functions, in the sense that the Jacobian of f , J ( f ) ,can be called by the language, r and r range over R and R n respectively. Remark 2.2.
Pullbacks can be generalised to arbitrary p -forms, using essentially the same approach. However thepullbacks of general p -forms no longer resemble reverse-mode AD as it is commonly understood. Figure 1 presents the grammar of simple terms S and pull-back terms P , and Figure 2 presents the type system. Whilethe definition of simple terms S is relatively standard (ex-cept for the new constructs which will be discussed later),the definition of pullback terms P as sums of simple termsis not. The idea of sum is important since it specifies the “linearpositions” in a simple term, just as it specifies the algebraicnotion of linearity in Mathematics. For example, x ( y + z ) isa term but ( x + y ) z is not. This is because ( x + y ) z is the sameas xz + yz , but x ( y + z ) cannot. Hence in S P , S is in a linearposition but not P . Similarly, in Mathematics ( f + f )( x ) = f ( x ) + f ( x ) but in general f ( x + x ) , f ( x ) + f ( x ) for smooth functions f , f and x , x . Hence, the function f in an application f ( x ) is in a linear position while theargument x is not.Formally we define the set lin ( S ) of linear variables in asimple term S by y ∈ lin ( S ) if, and only if, y is in a linearposition in S . lin ( x ) : = { x } lin ( λx . S ) : = lin ( S ) \ { x } lin ( S P ) : = lin ( S ) \ FV ( P ) lin ( π i ( S )) : = lin ( S ) lin (h S , S i) : = lin ( S ) ∩ lin ( S ) lin (J f · S ) : = lin ( S ) lin (( λx . S ) ∗ · S ) : = (cid:0) lin ( S ) \ FV ( S ) (cid:1) ∪ (cid:0) lin ( S ) \ FV ( S ) (cid:1) lin ( S ) : = ∅ otherwise.For example, lin ( x z ( y z )) = { x } . Any term of the dual type σ ∗ is considered a linear func-tional of σ . For example, e p ∗ has the dual type R n ∗ . Thenthe term e p ∗ mimics the linear functional π p ∈ L ( R n , R ) .The Jacobian J f · S is considered as the Jacobian of f along S , which is a smooth function. For example, let f : R m → R n be “easily differentiable”, then J f · v mimics theJacobian along v , i.e. the function λx . J ( f )( x )( v ) .The dual map ( λx . S ) ∗ · S is considered the dual ofthe linear functional S along the function λx . S , where x ∈ lin ( S ) . For example, let r ∈ R m . The dual map ( λv . (J f · v ) r ) ∗ · e p ∗ mimics (J ( f )( r )) ∗ ( π p ) ∈ L ( R m , R ) ,which is the dual of π p along the Jacobian J ( f )( r ) .The pullback ( Ω λx . P ) · S is considered the pullback of the1-form S along the function λx . P . For example, ( Ω λx . f ( x )) ·( λx . e p ∗ ) mimics Ω ( f )( λx . π p ) ∈ Ω ( R m ) , which is the pull-back of the 1-form λx . π p along f .Hence, to perform reverse-mode AD on a term λx . P at P ′ with respect to ω , we consider the term (cid:0) ( Ω λx . P ) · ω (cid:1) P ′ . We use syntactic sugars to ease writing. For n ≥ z afresh variable. R n + ≡ R n × R Ω σ ≡ σ ⇒ σ ∗ r ... r n ≡ h r , . . . , r n i h P , P , P i ≡ hh P , P i , P i S πi ≡ π i ( S ) let x = t in s ≡ ( λx . s ) t Ω r ≡ λx . r ∗ λ h x , y i . S ≡ λz . S [ z π / x ][ z π / y ] Capture-free substitution is applied recursively, e.g. (cid:0) ( λx . S ) ∗ · S (cid:1) [ P ′ / z ] ≡ ( λx . S [ P ′ / z ]) ∗ · ( S [ P ′ / z ]) and (cid:0) ( Ω λx . P ) · S (cid:1) [ P ′ / z ] ≡ ( Ω ( λx . P [ P ′ / z ])) · ( S [ P ′ / z ]) . We treat0 as the unit of our sum terms, i.e. 0 ≡ + S ≡ + S and S ≡ S +
0; and consider + as a associative and commutativeoperator. We also define S [ S + S / y ] ≡ S [ S / y ] + S [ S / y ] ifand only if y ∈ lin ( S ) . For example, ( S + S ) P ≡ S P + S P .We finish this subsection with some examples that can beexpressed in this language. Example 3.1.
Consider the running example in computingthe Jacobian of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i .Assume д (h x , y i) : = h x + , x + y i , mult and pow2 are in theset of easily differentiable functions, i.e. д , mult , pow2 ∈ F .The function f can be presented by the term {h x , y i : R } ⊢ pow2 ( mult ( д (h x , y i))) : R . More interestingly, the Jacobian ifferential-form Pullback Programming Language and Reverse-mode AD σ , τ :: = R | σ × σ | σ ⇒ σ | σ ∗ Γ ⊢ σ Γ ⊢ S : σ Γ ⊢ P : σ Γ ⊢ S + P : σ Γ ∪ { x : σ } ⊢ x : σ Γ ∪ { x : σ } ⊢ S : τ Γ ⊢ λx . S : σ ⇒ τ Γ ⊢ S : σ ⇒ τ Γ ⊢ P : σ Γ ⊢ S P : τ Γ ⊢ S : σ × σ Γ ⊢ π i ( S ) : σ i Γ ⊢ S : σ Γ ⊢ S : σ Γ ⊢ h S , S i : σ × σ r ∈ R Γ ⊢ r : R Γ ⊢ P : R n Γ ⊢ f ( P ) : R m Γ ⊢ S : R n Γ ⊢ J f · S : R n ⇒ R m r ∈ R n Γ ⊢ r ∗ : R n ∗ Γ ∪ { x : σ } ⊢ S : τ Γ ⊢ S : τ ∗ x ∈ lin ( S ) Γ ⊢ ( λx . S ) ∗ · S : σ ∗ Γ ∪ { x : σ } ⊢ P : τ Γ ⊢ S : Ω τ Γ ⊢ ( Ω λx . P ) · S : Ω σ Figure 2.
The types and typing rules for DPPL. Ω σ ≡ σ ⇒ σ ∗ and f : R n → R m is easily differentiable, i.e. f ∈ F .of f at h , i , i.e. J ( f )(h , i) , can be presented by the term ⊢ (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · ( Ω (cid:2) (cid:3) ) (cid:1) h , i : R ∗ . This is the application of the pullback Ω ( f )( λx . (cid:2) (cid:3) ∗ ) to thepoint h , i , which we saw in Subsection 2.2 is the Jacobianof f at h , i . Example 3.2.
Consider the function that takes a list of realnumbers and returns the sum of the elements of a list. Us-ing the standard Church encoding of List, i.e.
List ( X ) ≡( X → D → D ) → ( D → D ) , and [ x , x , . . . , x n ] ≡ λ f d . f x n (cid:0) . . . ( f x ( f x d )) (cid:1) for some dummy type D , sum : List ( R ) → R is defined to be λl . l ( λxy . x + y )
0. Hence the Ja-cobian of sum at a list [ , − ] can be expressed as { ω : Ω ( List ( R ))} ⊢ (cid:0) ( Ω ( sum )) · ω (cid:1) [ , − ] : R ∗ . Now the question is how we could perform reverse-modeAD on this term. Recall the result of a reverse-mode AD on afunction f : R n → R m at x ∈ R n , i.e. the p -th row of the Ja-cobian matrix of f at x , can be expressed as Ω ( f )( λx . π p )( x ) ,which is (J ( f )( x )) ∗ (( λx . π p )( f x )) = (J ( f )( x )) ∗ × π p .In the rest of this Section, we consider how the term (( Ω λy . P ′ ) · ω ) P , which mimics Ω ( f )( ω )( x ) , can be reduced.To avoid expression swell, we first perform A-reduction: P ′ −→ ∗ A L which decompose a term into a series of “smaller”terms, as explained in Subsection 3.2. Then, we reduce (( Ω λy . L ) · ω ) P by induction on L , as explained in Subsec-tion 3.3. Lastly, we complete our reduction strategy in Sub-section 3.4.We use the term in Example 3.1 as a running examplein our reduction strategy to illustrate that this reduction isfaithful to reverse-mode AD (in that it is exactly reverse-mode AD when restricted to first-order). The reduction ofthe term in Example 3.2 is given in Appendix A.2. It illus-trates how reverse-mode AD can be performed on a higher-order function. We use the administrative reduction (A-reduction) of Sabryand Felleisen [28] to decompose a pullback term P into a letseries L of elementary terms, i.e. P −→ ∗ A let x = E ; . . . ; x n = E in x n , where elementary terms E and let series L are defined as E :: = | z + z | z | λx . L | z z | z i | h z , z i | r | f ( z )| J f · z | ( λx . L ) ∗ · z | ( Ω λx . L ) · z | r ∗ L :: = let z = E in L | let z = E in z . Note that elementary terms E should be “fine enough” toavoid expression swell. The complete set of A-reductionson P can be found in Appendix B. We write −→ ∗ A for thereflexive and transitive closure of −→ A . Example 3.3.
We decompose the term considered in Exam-ple 3.1, pow2 ( mult ( д (h x , y i))) , via administrative reduction. pow2 ( mult ( д (h x , y i))) −→ ∗ A let z = h x , y i ; z = д ( z ) ; z = mult ( z ) ; z = pow2 ( z ) in z . This is reminiscent of the decomposition of f into R д −−→ R ∗ −−→ R (−) −−−→ R before performing AD. After decomposing P ′ to a let series L of elementary termsvia A-reductions in ( Ω λy . P ′ ) · ω , we reduce ( Ω λy . L ) · ω byinduction on L as shown in Figure 3 (Let series). Reduction 7is the base case and reduction 8 expresses the contra-variantproperty of pullbacks. Example 3.4.
Take ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) ·( Ω (cid:2) (cid:3) ) discussed in Example 3.1, as when applied tothe point h , i is the Jacobian J ( f )(h , i) where f (h x , y i) : = (cid:0) ( x + )( x + y ) (cid:1) . In Example 3.3, we showedthat pow2 ( mult ( д (h x , y i))) is A-reduced to a let series L .Now via reduction 7 and 8, ( Ω λ h x , y i . L ) · ω is reduced to a arol Mak and Luke Ong Let Series: ( Ω ( λy . let x = E in x )) · ω ( ) −−→ ( Ω λy . E ) · ω ( Ω ( λy . let x = E in L )) · ω ( ) −−→ ( Ω λy . h y , E i) · (cid:0) ( Ω λ h y , x i . L ) · ω (cid:1) Constant Functions: (cid:0) ( Ω λy . E ) · ω (cid:1) V ( ) −−→ y < FV ( E ) . Linear Functions: (cid:0) ( Ω λy . z + y π j ) · ω (cid:1) V ( a ) −−−−→ ( λv . v π j ) ∗ · (cid:0) ω ( z + V π j ) (cid:1)(cid:0) ( Ω λy . y πi + y π j ) · ω (cid:1) V ( b ) −−−−→ ( λv . v πi + v π j ) ∗ · (cid:0) ω ( V πi + V π j ) (cid:1)(cid:0) ( Ω λy . y ) · ω (cid:1) V ( ) −−−→ ( λv . v ) ∗ · ( ω V ) (cid:0) ( Ω λy . y πi ) · ω (cid:1) V ( ) −−−→ ( λv . v πi ) ∗ · ( ω V πi ) (cid:0) ( Ω λy . h y πi , z i) · ω (cid:1) V ( a ) −−−−→ ( λv . h v πi , i) ∗ · ( ω h V πi , z i) (cid:0) ( Ω λy . h z , y π j i) · ω (cid:1) V ( b ) −−−−→ ( λv . h , v π j i) ∗ · ( ω h z , V π j i) (cid:0) ( Ω λy . h y πi , y π j i) · ω (cid:1) V ( c ) −−−→ ( λv . h v πi , v π j i) ∗ · ( ω h V πi , V π j i) (cid:0) ( Ω λy . J f · y πi ) · ω (cid:1) V ( ) −−−→ ( λv . J f · v πi ) ∗ · (cid:0) ω (J f · V πi ) (cid:1) Function Symbols: (cid:0) ( Ω λy . f ( y πi )) · ω (cid:1) V ( ) −−−→ ( λv . (J f · v πi ) V πi ) ∗ · (cid:0) ω ( f ( V πi )) (cid:1) Dual Maps: (cid:16) ( Ω λy . ( λx . L ) ∗ · y πi ) · ω (cid:17) V ( a ) −−−−→ ( λv . ( λx . L ) ∗ · v πi ) ∗ · (cid:0) ω (( λx . L ) ∗ · V πi ) (cid:1) if y < FV ( λx . L ) (cid:0) ( Ω λy . L ) · ω ′ (cid:1) V −→ ∗ ( λv . S ) ∗ · ω ′ V ′ y < FV ( z ) (cid:0) ( Ω λy . ( λx . L ) ∗ · z ) · ω (cid:1) V ( b ) −−−−→ (cid:0) λv . ( λx . S ) ∗ · z (cid:1) ∗ · ω (cid:0) ( λx . L [ V / y ]) ∗ · z (cid:1)(cid:0) ( Ω λy . L ) · ω ′ (cid:1) V −→ ∗ ( λv . S ) ∗ · ω ′ V ′ (cid:0) ( Ω λy . ( λx . L ) ∗ · y πi ) · ω (cid:1) V ( c ) −−−→ (cid:0) λv . ( λx . L [ V / y ]) ∗ · v πi + ( λx . S ) ∗ · V πi (cid:1) ∗ · (cid:0) ω (cid:0) ( λx . L [ V / y ]) ∗ · V πi (cid:1)(cid:1) Pullback Terms: (cid:0) ( Ω λx . L ) · z ) a −→ ∗ ( λv . S ) ∗ · ( z L [ a / x ]) (cid:0) ( Ω λy . ( Ω λx . L ) · z ) · ω (cid:1) V ( ) −−−→ (cid:0) ( Ω λy . λa . ( λv . S ) ∗ · ( z L [ a / x ])) · ω (cid:1) V Abstraction: (cid:0) ( Ω λy . L ) · ω ′ (cid:1) V −→ ∗ ( λv . S ) ∗ · ( ω ′ L [ V / y ]) x < FV ( V ) (cid:0) ( Ω λy . λx . L ) · ω (cid:1) V ( ) −−−→ ( λv . λx . S ) ∗ · (cid:0) ω ( λx . L [ V / y ]) (cid:1) Application: (cid:0) ( Ω λy . y πi z ) · ω (cid:1) V ( a ) −−−−→ ( λv . v πi z ) ∗ · ( ω ( V πi z )) (cid:0) ( Ω λz . P ′ ) · ω ′ (cid:1) V π j −→ ( λv ′ . S ′ ) ∗ · ω ′ ( P ′ [ V π j / z ]) V πi ≡ λz . P ′ (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V ( b ) −−−−→ ( λv . v πi V π j + S ′ [ v π j / v ′ ]) ∗ · (cid:0) ω ( V πi V π j ) (cid:1)(cid:0) ( Ω λz . V ′ ) · ω ′ (cid:1) V π j −→ V πi ≡ λz . V ′ (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V ( c ) −−−→ ( λv . v πi V π j ) ∗ · (cid:0) ω ( V πi V π j ) (cid:1) Pair: (cid:0) ( Ω λy . E ) · ω ′ (cid:1) V −→ ( λv . S ) ∗ · ( ω ′ ( E [ V / y ])) y ∈ FV ( E ) (cid:0) ( Ω λy . h y , E i) · ω (cid:1) V ( a ) −−−−→ ( λv . h v , S i) ∗ · ( ω h V , E [ V / y ]i) (cid:0) ( Ω λy . h y , E i) · ω (cid:1) V ( b ) −−−−→ ( λv . h v , i) ∗ · ( ω h V , E i) for y < FV ( E ) Figure 3.
Pullback Reductions ifferential-form Pullback Programming Language and Reverse-mode AD series of pullback along elementary terms. ( Ω λ h x , y i . let z = h x , y i ; z = д ( z ) ; z = mult ( z ) ; z = pow2 ( z ) in z ) · ω −→ ∗ © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ Via A-reductions and reductions 7 and 8, ( Ω λy . P ′ ) · ω is reduced to a series of pullback along elementary terms ( Ω λy . E ) · ( . . . (( Ω λy . E n ) · ω )) . Now, we define the reduc-tion of pullback along elementary terms when applied to avalue V , i.e. (( Ω λy . E ) · ω ) V .Recall the pullback of a 1-form ω ∈ Ω ( F ) along a smoothfunction f : E → F is defined to be Ω ( f )( ω ) : x f )( x )) ∗ ( ω ( f x )) . Hence, we have the following pullback reduction (cid:0) ( Ω λy . E ) · ω (cid:1) V −→ ( λv . S ) ∗ · ( ω ( E [ V / y ])) of the application (cid:0) ( Ω λy . E ) · ω (cid:1) V which mimics the pull-back of a variable ω along an abstraction λy . E at a term V .But how should one define the simple term S in ( λv . S ) ∗ ·( ω ( E [ V / y ])) so that λv . S mimics the Jacobian of f at x , i.e. J ( f )( x ) ? We do so by induction on the elementary terms E ,shown in Figure 3 Reductions 9-20. Remark 3.5.
For readers familiar with differential λ -calculus [15], S is the result of substituting a linear occur-rence of y by v , and then substituting all free occurrencesof y by V in the term E . Our approach is different from dif-ferential λ -calculus in that we define a reduction strategyinstead of a substitution. A comprehensive comparison be-tween our language and differential λ -calculus is given inSection 5. If y is not a free variable in E , λy . E is mimicking a constantfunction. The Jacobian of a constant function is 0, hence wereduce (cid:0) ( Ω λy . E ) · ω (cid:1) V to ( λv . ) ∗ · ( ω ( E [ V / y ])) , which isthe sugar for 0 as shown in Figure 3 (Constant Functions)Reduction 9. The redexes (cid:0) ( Ω λy . ) · ω (cid:1) V , (cid:0) ( Ω λy . r ) · ω (cid:1) V and (cid:0) ( Ω λy . r ∗ ) · ω (cid:1) V all reduce to 0.Henceforth, we assume y ∈ FV ( E ) . We consider the redexes where y ∈ lin ( E ) . Then λy . E ismimicking a linear function, whose Jacobian is itself. Hence A value is a normal form of the reduction strategy. Its definition will bemade precise in the next subsection. (cid:0) ( Ω λy . E ) · ω (cid:1) V is reduced to ( λv . S ) ∗ · ( ω ( E [ V / y ])) where S is the result of substituting y by v in E . Figure 3 (LinearFunctions) Reductions 10-14 shows how they are reduced. Now consider the redexes where y might not be a linearvariable in E . All reductions are shown in Figure 3. Function Symbols
Let f be “easily differentiable”. Then, λy . f ( y πi ) is mimicking f ◦ π i , whose Jacobian at x is J ( f )( π i ( x )) ◦ π i . Hence the Jacobian of λy . f ( y πi ) is λv . (J f · v πi ) V πi and (cid:0) ( Ω λy . f ( y πi )) · ω (cid:1) V is reduced to ( λv . (J f · v πi ) V πi ) ∗ · ( ω ( f ( V πi ))) as shown in Reduction15. Dual Maps
Consider the Jacobian of λy . ( λx . L ) ∗ · z at V . Itis easy to see that the result varies depending on where thevariable y is located in the dual map ( λx . L ) ∗ · z . We considerthree cases.First, if y < FV ( λx . L ) , we must have z ≡ y πi . Then y is a linear variable in ( λx . L ) ∗ · y πi and so the Jacobian of λy . ( λx . L ) ∗ · y πi at V is λv . ( λx . L ) ∗ · v πi . Hence, we haveReduction 16 a .Second, say y < FV ( z ) . Since dual and abstraction are bothlinear operations, and y is only free in L , the Jacobian of λy . ( λx . L ) ∗ · z at V . should be λv . ( λx . S ′ ) ∗ · z where λv . S ′ isthe Jacobian of λy . L at V . To find the Jacobian of λy . L at V , we reduce (cid:0) ( Ω λy . L ) · ω (cid:1) V to ( λv . S ′ ) ∗ · ( ω L [ V / y ]) . Then λv . S ′ is the Jacobian of λy . L at V . The reduction is given inReduction 16 b . Note that this reduction avoids expressionswell, as we are reducing the let series L in λy . ( λx . L ) ∗ · z using our pullback reductions, which does not suffer fromexpression swell.Finally, for y ∈ FV ( λx . L ) ∩ FV ( z ) , the Jacobian of λy . ( λx . L ) ∗ · z at V is the “sum” of the results we have for thetwo cases above, i.e. λv . ( λx . L ) ∗ · v πi + ( λx . S ) ∗ · y πi , wherethe remaining free occurrences of y are substituted by V ,since the Jacobian of a bilinear function l : X × X → Y is J ( l )(h x , x i(h v , v i) = l h x , v i + l h v , x i . Hence, wehave Reduction 16 c . Pullback Terms
Consider (cid:0) ( Ω λy . ( Ω λx . L ) · z ) · ω (cid:1) V .Instead of reducing it to some ( λv . S ) ∗ ·( ω (( Ω λy . ( Ω λx . L ) · z ) · ω )[ V / y ]) like the others, herewe simply reduce (cid:0) ( Ω λx . L ) · z ) a to ( λv . S ) ∗ · ( z L [ a / x ]) ,where a is a fresh variable and z . x , and replace ( Ω λx . L )· z by λa . ( λv . S ) ∗ · ( z L [ a / x ]) in (cid:0) ( Ω λy . ( Ω λx . L ) · z ) · ω (cid:1) V asshown in Reduction 17. Abstraction
Consider the Jacobian of λy . λx . L at V . arol Mak and Luke Ong We follow the treatment of exponentials in differential λ -category [11] where the (D-curry) rule states that for all f : Y × X → A , D [ cur ( f )] = cur ( D [ f ] ◦ h π × X , π × Id X i) ,which means J ( cur ( f ))( y ) is equal to λv . J ( cur ( f ))( y )( v ) = λvx . J ( f h− , x i)( y )( v ) . According to this (D-curry) rule, the Jacobian of λy . λx . L at V should be λv . λx . S where λv . S is the Jacobian of λy . L at V . Hence similar to the dual map case, we firstreduce (cid:0) ( Ω λy . L ) · ω (cid:1) V to ( λv . S ) ∗ · ( ω L [ V / y ]) and ob-tain the Jacobian of λy . L at V , i.e. λv . S and then reduce (cid:0) ( Ω λy . λx . L ) · ω (cid:1) V to ( λv . λx . S ) ∗ · (cid:0) ω ( λx . L [ V / y ]) (cid:1) as shownin Reduction 18. Application
Consider the Jacobian of λy . z z at V . Notethat z and z may or may not contain y as a free variable.Hence, there are two cases.First, we consider λy . y πi z where z is fresh. Since y ∈ lin ( y πi z ) , λy . y πi z mimics a linear function, and hence itsJacobian at V is λv . v πi z . So (cid:0) ( Ω λy . y πi z ) · ω (cid:1) V is reducedto ( λv . v πi z ) ∗ · ( ω ( V πi z )) as shown in Reduction 19 a .Second, we consider the Jacobian of λy . y πi y π j at V . Now y is not a linear variable in y πi y π j , since it occurs in theargument y π j . As proved in Lemma 4.4 of [21], every differ-ential λ -category satisfies the (D-eval) rule, D [ ev ◦ h h , д i] = ev ◦ h D [ h ] , д ◦ π i + D [ uncur ( h )] ◦ hh , D [ д ]i , h π , д ◦ π ii which means J ( ev ◦ h h , д i)( x )( v ) is equal to (cid:0) J ( h )( x )( v ) (cid:1) ( д ( x )) + J ( h ( x ))( д ( x ))(J( д )( x )( v )) for all h : C → ( A ⇒ B ) and д : C → A . Hence, the Jacobianof ev ◦ h π i , π j i at x along v , i.e. J ( ev ◦ h π i , π j i)( x )( v ) , is π i ( v )( π j ( x )) + J ( π i ( x ))( π j ( x ))( π j ( v )) . So the Jacobian of λy . y πi y π j at V is λv . v πi V π j + S ′ [ v π j / v ′ ] where λv ′ . S ′ is the Jacobian of V πi at V π j . Hence assum-ing V πi ≡ λz . P ′ , we first reduce (cid:0) ( Ω λz . P ′ ) · ω (cid:1) V π j to ( λv ′ . S ′ ) ∗ · ω ( P ′ [ V π j / z ]) and obtain λv ′ . S ′ as the Jacobianof λz . P ′ at V π j . Then, we reduce (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V to ( λv . v πi V π j + S ′ [ v π j / v ′ ]) ∗ · (cid:0) ω ( V πi V π j ) (cid:1) as shown in Re-duction 19 b .If (cid:0) ( Ω λz . V ′ ) · ω (cid:1) V π j reduces to 0, which means λz . V ′ ≡ V πi is a constant function, the Jacobian of λy . y πi y π j at V is just λv . v πi V π j and we have Reduction 19 c . Remark 3.6.
Doing induction on elementary terms definedin Subsection 3.2, we can see that there are a few elementaryterms E where (cid:0) ( Ω λy . E ) · ω (cid:1) V is not a redex, namelyvalue 1: (cid:0) ( Ω λy . z y πi ) · ω (cid:1) V where z is a free variable,value 2: (cid:0) ( Ω λy . y πi y π j ) · ω ) V where V πi . λz . P ′ . Having these terms as values makes sense intuitively,since they have “inappropriate” values in positions. Values1 has a free variable z in a function position. Value 2 substi-tutes y πi by V πi which is a non-abstraction, to a functionposition. Pair
Last but not least, we consider the Jacobian of λy . h y , E i at V . It is easy to see that Jacobian is λv . h v , S i where λv . S is the Jacobian of λy . E , as shown in Reduction20 a and Reduction 20 b . Example 3.7.
Take our running example. In Examples 3.3and 3.4 we showed that via A-reductions and Reductions 7and 8, ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · ω is reduced to © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ We show how it can be reduced when applied to h , i . © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ (cid:20) (cid:21) . −−−−→ © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ · © « ( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®¬ h (cid:20) (cid:21) , (cid:20) (cid:21) i ª®®¬ . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ · (cid:16) (cid:18) ( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·(( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ) (cid:19) h (cid:20) (cid:21) , (cid:20) (cid:21) , (cid:20) (cid:21) i (cid:17)ª®®®¬ ( ⋆ )20 . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ · (cid:0) (cid:16) ( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω (cid:17) h (cid:20) (cid:21) , (cid:20) (cid:21) , (cid:20) (cid:21) , i (cid:1) ª®®®®¬ . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · ( ω ) ª®®®¬ Notice how this is reminiscent of the forward phase ofreverse-mode AD performed on f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i considered in Subsection 2.1.Moreover, we used the reduction f ( r ) −→ f ( r ) couples oftimes in the argument position of an application. This is toavoid expression swell. Note 1 + ( ⋆ ) even when the result is used in various computations.Hence, we must have a call-by-value reduction strategy aspresented below. ifferential-form Pullback Programming Language and Reverse-mode AD Reductions in Subsections 3.2 and 3.3 are the most interest-ing development of the paper. However, they alone are notenough to complete a reduction strategy. In this subsection,we define contexts and redexes so that any non-value termcan be reduced.The definition of context C is the standard call-by-valuecontext, extended with duals and pullbacks. Notice that thecontext ( Ω λy . C A ) · S contains a A-context defined in Sub-section 3.2. This follows from the idea of reverse-mode ADto decompose a term into elementary terms before differen-tiating them. C :: = [] | C + P | V + C A | C P | V C | π i ( C ) | h C , S i | h V , C i| f ( C ) | J f · C | ( λx . S ) ∗ · C | ( λx . C ) ∗ · V | ( Ω λy . C A ) · S | ( Ω λy . E ) · C | ( Ω λy . h y , E i) · C Our redex r extend the standard call-by-value redex withfour sets of terms. r :: = ( λx . S ) V | π i (h V , V i) | f ( r ) | (J f · r ) r ′ | ( λv . (J f · v ) r ) ∗ · r ′∗ | ( λv . V ) ∗ · (cid:0) ( λv . V ) ∗ · V (cid:1) | ( Ω λy . L ) · S | (cid:0) ( Ω λy . E ) · V (cid:1) V | (cid:0) ( Ω λy . h y , E i) · V (cid:1) V where either V . (J f · v ) r or V . r ′∗ . A value V is apullback term P that cannot be reduced further, i.e. a termin normal form.The following standard lemma, which is proved by induc-tion on P , tells us that there is at most one redex to reduce. Lemma 3.8.
Every term P can be expressed as either C [ r ] forsome unique context C and redex r or a value V . Let’s look at the reductions of redexes. (1-4) are the stan-dard call-by-value reductions. (5) reduces the dual along alinear map l and (6) is the contra-variant property of dualmaps. ( λx . S ) V ( ) −−→ S [ V / x ] π i (h V , V i) ( ) −−→ V i f ( r ) ( ) −−→ f ( r ) (J f · r ) r ′ ( ) −−→ J ( f )( r ′ )( r )( λv . (J f · v ) r ) ∗ · r ′∗ ( ) −−→ ((J ( f )( r )) ∗ ( r ′ )) ∗ ( λv . V ) ∗ · (cid:0) ( λv . V ) ∗ · V (cid:1) ( ) −−→ ( λv . V [ V / v ]) ∗ · V where either V . (J f · v ) r or V . r ′∗ .We say C [ r ] −→ C [ V ] if r −→ V for all reductions exceptfor those with a proof tree, i.e. Reductions 16 b , 16 c , 17, 18,19 b , 19 c and 20 a , where we have r −→ ∗ V C [ r ′ [ V / ω ][ V / V ]] −→ C [ V ′ [ V / ω ][ V / V ]] if r −→ ∗ V r ′ −→ V ′ Example 3.9.
Consider our running example P ≡ (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · Ω (cid:2) (cid:3) (cid:1) h , i which rep-resents the Jacobian of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i , as shown in Example 3.1. Replacing ω by Ω (cid:2) (cid:3) ≡ λx . (cid:2) (cid:3) ∗ in Examples 3.3, 3.4 and 3.7, P is reduced to © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · ( ω ) ª®®®¬ . Via reduction 5 and β reduction, P is reduced to © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · (cid:2) (cid:3) ª®®®¬ −→ © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ · ª®¬ ∗ −→ (cid:18) ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ · (cid:19) ∗ −→ (cid:0) ( λ h v , v i . hh v , v i , h v , v ii) ∗ · (cid:1) ∗ −→ (cid:20) (cid:21) ∗ Notice how this mimics the reverse phase of reverse-modeAD on f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i consideredin Subsection 2.1.Examples 3.3, 3.4 and 3.7 demonstrates that our reductionstrategy is faithful to reverse-mode AD (in that it is exactlyreverse-mode AD when restricted to first-order). Differential 1-forms Ω E : = C ∞ ( E , L ( E , R )) is similar to thecontinuation of E with the “answer” R . We can indeed writeour reduction in a continuation passing style (CPS) manner.Let h P | S i y ≡ ( Ω λy . P ) · S , then we can treat h P | S i y as aconfiguration of an element Γ ∪ { y : σ } ⊢ P : τ and a “con-tinuation” Γ ⊢ S : Ω τ . The rules for the redexes h L | S i y , h E | V i y V and hh y , E i | V i y V can be directly convertedfrom Reductions 7-20. For example, Reduction 8 can be writ-ten as h let x = E in L | ω i y −→ hh y , E i | h L | ω i h y , x i i y . We prefer to write our language without the explicit men-tion of CPS since this paper focuses on the syntactic notionof reverse-mode AD using pullbacks and 1-forms. Also, 1-form of the type σ is more precisely described as an element arol Mak and Luke Ong of the function type Ω σ ≡ σ ⇒ σ ∗ , than of the continuationof σ , i.e. σ ⇒ ( σ ⇒ R ) . We show that any differential λ -category satisfying theHahn-Banach Separation Theorem can soundly model ourlanguage. Cartesian differential category [9] aims to axiomatise funda-mental properties of derivative. Indeed, any model of syn-thetic differential geometry has an associated Cartesian dif-ferential category. [13]
Cartesian differential category
A category C is a Carte-sian differential category if • every homset C ( A , B ) is enriched with a commutativemonoid ( C ( A , B ) , + AB , AB ) and the additive structureis preserved by composition on the left. i.e. ( д + h )◦ f = д ◦ f + h ◦ f and 0 ◦ f = . • it has products and projections and pairings of addi-tive maps are additive. A morphism f is additive if itpreserves the additive structure of the homset on theright. i.e. f ◦ ( д + h ) = f ◦ д + f ◦ h and f ◦ = . and it has an operator D [−] : C ( A , B ) → C ( A × A , B ) thatsatisfies the following axioms:[CD1] D is linear: D [ f + д ] = D [ f ] + D [ д ] and D [ ] = D is additive in its first coordinate: D [ f ] ◦ h h + k , v i = D [ f ] ◦ h h , v i + D [ f ] ◦ h k , v i , D [ f ] ◦ h , v i = D behaves with projections: D [ Id ] = π , D [ π ] = π ◦ π and D [ π ] = π ◦ π [CD4] D behaves with pairings: D [h f , д i] = h D [ f ] , D [ д ]i [CD5] Chain rule: D [ д ◦ f ] = D [ д ] ◦ h D [ f ] , f ◦ π i [CD6] D [ f ] is linear in its first component: D [ D [ f ]] ◦ hh д , i , h h , k ii = D [ f ] ◦ h д , k i [CD7] Independence of order of partial differentiation: D [ D [ f ]] ◦ hh , h i , h д , k ii = D [ D [ f ]] ◦ hh , д i , h h , k ii We call D the Cartesian differential operator of C . Example 4.1.
The category
FVect of finite dimensionalvector spaces and differentiable functions is a Cartesian dif-ferential category, with the Cartesian differential operator D [ f ]h v , x i = J ( f )( x )( v ) ,Cartesian differential operator does not necessarily be-have well with exponentials. Hence, Bucciarelli et al. [11]added the (D-curry) rule and introduced differential λ -category. Differential λ -category A Cartesian differential categoryis a differential λ -category if • it is Cartesian closed, • λ (−) preserves the additive structure, i.e. λ ( f + д ) = λ ( f ) + λ ( д ) and λ ( ) = • D [−] satisfies the (D-curry) rule: for any f : A × A → B , D [ λ ( f )] = λ ( D [ f ] ◦ h π × A , π × Id A i) Linearity
A morphism f in a differential λ -category is lin-ear if D [ f ] = f ◦ π . Example 4.2.
The category
Con ∞ of convenient vectorspace and smooth maps, considered by [8], is a differ-ential λ -category with the Cartesian differential operator D [ f ]h v , x i : = lim t → ( f ( x + tv ) − f ( x ))/ t , as shown inLemma E.2. We say a differential λ -category C satisfies Hahn-BanachSeparation Theorem if R is an object in C and for any object A in C and distinct elements x , y in A , there exists a linearmorphism l : A → R that separates x and y , i.e. l ( x ) , l ( y ) . Example 4.3.
The category
Con ∞ of convenient vectorspace and smooth maps satisfies the Hahn-Banach Separa-tion Theorem, as shown in Proposition E.3. Let C be a differential λ -category that satisfies Hahn-BanachSeparation Theorem. Since C is Cartesian closed, the inter-pretations for the λ -calculus terms are standard, and henceomitted. The full set of interpretations can be found in Ap-pendix C. J R K : = R J σ × σ K : = J σ K × J σ KJ σ ∗ K : = L ( J σ K , R ) J σ ⇒ σ K : = C ( J σ K , J σ K ) where L ( J σ K , R ) : = { f ∈ C ( J σ , R K ) | D [ f ] = f ◦ π } is theset of all linear morphisms from J σ K to R . J K γ : = J S + P K γ : = J S K γ + J P K γ J r ... r n ∗ K γ : = v ... v n n Õ i = r i v i J ( λx . S ) ∗ · S K γ : = λv . J S K γ (cid:0) J S K h γ , v i (cid:1) J ( Ω λx . P ) · S K γ : = λxv . J S K γ ( J P K h γ , x i)( D [ cur ( J P K ) γ ]h v , x i) We verify our definitions of linearity and substitution inLemma 4.4 and Lemma 4.5 respectively.
Lemma 4.4 (Linearity) . Let Γ ∪ { x : σ } ⊢ P : τ and Γ ⊢ P : σ ∗ . Let γ ∈ J Γ K and γ ∈ J Γ K . Then, ifferential-form Pullback Programming Language and Reverse-mode AD
1. if x ∈ lin ( P ) , then cur ( J P K ) γ is linear, i.e. D [ cur ( J P K ) γ ] = ( cur ( J P K ) γ ) ◦ π ,2. J P K γ is linear, i.e. D [ J P K γ ] = ( J P K γ ) ◦ π . Lemma 4.5 (Substitution) . J Γ ⊢ S [ P / x ] : τ K = J Γ ∪ { x : σ } ⊢ S : τ K ◦ h Id J Γ K , J Γ ⊢ P : σ K i Any differential λ -category satisfying Hahn-Banach Sep-aration Theorem is a sound model of our language. Notethat the Hahn-Banach Separation Theorem is crucial in theproof. Theorem 4.6 (Correctness of Reductions) . Let Γ ⊢ P : σ .1. P −→ A P ′ implies J P K = J P ′ K .2. P −→ P ′ implies J P K = J P ′ K .Proof. The full proof can be found in Appendix E.2. Case analysis on reductions of pullback terms. ConsiderReduction 16 . γ ∈ J Γ K . By IH, and V πi ≡ λz . P ′ , we have J (cid:0) ( Ω λz . P ′ ) · ω (cid:1) V π j K = J ( λv ′ . S ′ ) ∗ · ω ( P ′ [ V π j / z ]) K whichmeans for any 1-form ϕ and v , ϕ ( J P ′ K h γ , J V π j K γ i) (cid:0) D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i (cid:1) = ϕ ( J P ′ K h γ , J V π j K γ i) ( J S ′ K h γ , v i) . Let l be a linear morphism to R , then λx . l is a 1-formand hence we have l (cid:0) D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i (cid:1) = l ( J S ′ K h γ , v i) . By the contra-positive of theHahn-Banach Separation Theorem, it implies D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i = J S ′ K h γ , v i . Note that by (D-eval) in [21], D [ ev ◦ h π i , π j i]h v , x i = π i ( v )( π j ( x )) + D [ π i ( x )]h π j ( v ) , π j ( x )i . Hence we have J (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) J y πi y π j K h γ , J V K γ i (cid:1) (cid:0) D [ ev ◦ h π i , π j i]h v , J V K γ )i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1)(cid:0) v πi ( J V π j K γ ) + D [ J V πi K γ ]h v π j , J V π j K γ i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1)(cid:0) v πi ( J V π j K γ ) + D [ cur ( J P ′ K ) γ ]h v π j , J V π j K γ i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1) (cid:0) v πi ( J V π j K γ ) + J S ′ K h γ , J V π j K γ i (cid:1) = J ( λv . v πi V π j + S ′ [ V π j / v ]) ∗ · ω ( V πi V π j ) K γ (cid:3) A simple corollary of Theorem 4.6 is that types are invari-ant under reductions.
Corollary 4.7. (Subject Reduction) For any pullback terms P and P ′ where P −→ P ′ . If Γ ⊢ P : σ , then Γ ⊢ P ′ : σ . Recall performing reverse-mode AD on a real-valued func-tion f : R n → R m at a point x ∈ R n computes a row of theJacobian matrix J ( f )( x ) , i.e. (J ( f )( x )) ∗ ( π p ) . The following corollary tells us that our reduction isfaithful to reverse-mode AD (in that it is exactly reverse-mode AD when restricted to first-order) and we can performreverse-mode AD on any abstraction which might containhigher-order terms, duals, pullbacks and free variables. Corollary 4.8.
Let Γ ∪ { y : σ } ⊢ P : τ , Γ ⊢ P : σ , γ ∈ J Γ K .1. Let σ ≡ R n , τ ≡ R m . If (cid:0) ( Ω λy . P ) · Ω e p (cid:1) P −→ ∗ V , thenthe p -th row of the Jacobian matrix of J P K h γ , −i at J P K γ is ( J V K γ ) ∗ .2. Let l be a linear morphism from J τ K to R . If (cid:0) ( Ω λy . P ) · ω (cid:1) P −→ ∗ ( λv . P ′ ) ∗ · ω P ′ for somefresh variable ω , then the derivative of l ◦ ( J P K h γ , −i) at J P K γ along some v ∈ J σ K is l ( J P ′ K h γ , λx . l , v i) i.e. D [ l ◦ ( J P K h γ , −i)]h v , J P K γ i = l ( J P ′ K h γ , λx . l , v i) Example 4.9.
In Example 3.9, we showed that (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · Ω (cid:2) (cid:3) (cid:1) h , i −→ ∗ (cid:20) (cid:21) ∗ Note that (cid:2)
660 528 (cid:3) is exactly the Jacobian matrix of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i . We discuss recent works on calculi / languages that providedifferentiation capabilities.
The standard bearer is none other than differential λ -calculus [15], which has inspired the design of our language.The implementation induced by differential λ -calculus isa form of symbolic differentiation, which suffers from ex-pression swell. For this reason, Manzyuk [22] introduced the perturbative λ -calculus , a λ -calculus with a forward-modeAD operator. Our language is complementary to these cal-culi, in that it implements higher-order reverse-mode AD;moreover, it is call-by-value, which is crucial for reverse-mode AD to avoid expression swell, as illustrated in Exam-ple 3.7.What is the relationship between our language and differ-ential λ -calculus? We can give a precise answer via a compo-sitional translation (−) t to a differential λ -calculus extendedby real numbers, function symbols, pairs and projections,defined as follows: s , t :: = x | λx . s | s T | D s · t | π i ( s ) | h s , t i | r | f ( T ) | D f · tS , T :: = | s | s + T where r ∈ R , f ∈ F The major cases of the definition of (−) t are; ( σ ∗ ) t : = σ t ⇒ R (J f · ( S )) t : = D f · S t r ... r n ∗ t : = λv . n Õ i = f i ( π i ( v )) arol Mak and Luke Ong (cid:0) ( λy . S ) ∗ · S (cid:1) t : = λv . ( S ) t (cid:0) ( λy . ( S ) t ) v (cid:1) (( Ω λy . P ) · S ) t : = λxv . S t (cid:0) ( λy . P t ) x (cid:1) (cid:0) ( D ( λy . P t ) · v ) x (cid:1) for f i : = r i × − . (The definitions are provided in full in Ap-pendix D.)Because differential λ -calculus does not have linearfunction type, ( S ) t is no longer in a linear position in (cid:0) ( λx . S ) ∗ · S (cid:1) t . Though the translation does not preservelinearity, it does preserve reductions and interpretations(Lemma 5.1). Lemma 5.1.
Let P be a term.1. If P −→ P ′ , then there exists a reduct s of P ′ t such that P t −→ ∗ s in L D .2. J P K = J P t K in C . A corollary of Lemma 5.1 (1) is that our reduction strategyis strongly normalizing.
Corollary 5.2 (Strong Normalization) . Any reduction se-quence from any term is finite, and ends in a value.
Encouraged by calls [14, 19, 24] from the machine learningcommunity, the development of reverse-mode AD program-ming language has been an active research problem. Follow-ing Pearlmutter and Siskind [27], these languages usuallytreat reverse-mode AD as a meta -operator on programs.
First-order
Elliott [16] gives a categorical presentation ofreverse-mode AD. Using a functor over Cartesian categories,he presents a neat implementation of reverse-mode AD.As is well-known, conditional does not behave well withsmoothness [6]; nor does loops and recursion. Abadi andPlotkin [2] address this problem via a first-order languagewith conditionals, recursively defined functions, and a con-struct for reverse-mode AD. Using real analysis, they provethe coincidence of operational and denotational semantics.To our knowledge, these treatments of reverse-mode ADare restricted to first-order functions.
Towards higher-order
The first work that extendsreverse-mode AD to higher orders is by Pearlmutter andSiskind [27]; they use a non-compositional program trans-formation to implement reverse-mode AD.Inspired by Wang et al. [32, 33], Brunel et al. [10] studya simply-typed λ -calculus augmented with a notion of lin-ear negation type. Though our dual type may resembletheir linear negation, they are actually quite different. In fact, our work can be viewed as providing a positive an-swer to the last paragraph of [10, Sec. 7], where the au-thors address the relation between their work and differen-tial lambda-calculus. They describe a “naïve” approach of ex-pressing reverse-mode AD in differential lambda-calculus inthe sense that it suffers from “expression swell”, which ourapproach does not (see Example 3.7). Moreover, Brunel etal. use a program transformation to perform reverse-modeAD, whereas we use a first-class differential operator. Brunelet al. [1] prove correctness for performing reverse-mode ADon real-valued functions (Theorem 5.6, Corollary 5.7 in [1]),whereas we allow any (higher-order) abstraction to be theargument of the pullback term and proved that the result ofthe reduction of such a pullback term is exactly the deriva-tive of the abstraction (Corollary 4.8).Building on Elliott [16]’s categorical presentation ofreverse-mode AD, and Pearlmutter and Siskind [27]’s ideaof differentiating higher-order functions, Vytiniotis et al.[31] developed an implementation of a simply-typed differ-entiable programming language.However, all these treatments are not purely higher-order,in the sense that their differential operator can only com-pute the derivative of an “end to end” first-order program(which may be constructed using higher-order functions),but not the derivative of a higher-order function.As far as we know, our work gives the first implemen-tation of reverse-mode AD in a higher-order programminglanguage that directly computes the derivative of higher-order functions using reverse-mode AD (Corollary 4.8 (2)). After outlining the mathematical foundation of reverse-mode AD as the pullback of differential 1-forms (Section2.2), we presented a simple higher-order programming lan-guage with an explicit differential operator, ( Ω ( λx . P )) · S ,(Subsection 3.1) and a call-by-value reduction strategy to di-vide (A-reductions in Subsection 3.2), conquer (pullback re-ductions in Subsection 3.3) and combine (Subsection 3.4) theterm (cid:0) ( Ω ( λx . P )) · ω (cid:1) S , such that its reduction exactly mim-ics reverse-mode AD. Examples are given to illustrate thatour reduction is faithful to reverse-mode AD. Moreover, weshow how our reduction can be adapted to a CPS evaluation(Subsection 3.5).We showed (in Section 4) that any differential λ -categorythat satisfies the Hahn-Banach Separation Theorem is asound model of our language (Theorem 4.6) and how our re-duction precisely captures the notion of reverse-mode AD,in both first-order and higher-order settings (Corollary 4.8). ifferential-form Pullback Programming Language and Reverse-mode AD Future Directions.
An interesting direction is to extendour language with probability, which can serve as a compilerintermediate representation for “deep” probabilistic frame-works such as Edward [29] and Pyro [30]. Inference algo-rithms that require the computation of gradients, such asHamiltonian Monte Carlo and variational inference, whichEdward and Pyro rely on, can be expressed in such a lan-guage and allows us to prove correctness.
References [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, AndyDavis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geof-frey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,Rajat Monga, Sherry Moore, Derek Gordon Murray, BenoitSteiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Mar-tin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow:A System for Large-Scale Machine Learning. In . 265–283. [2] Martín Abadi and Gordon D. Plotkin. 2020. A simple differ-entiable programming language.
PACMPL https://doi.org/10.1145/3371106 [3] Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Anger-müller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, JustinBayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Ar-naud Bergeron, James Bergstra, Valentin Bisson, Josh BleecherSnyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, XavierBouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre LucCarrier, Kyunghyun Cho, Jan Chorowski, Paul F. Christiano, TimCooijmans, Marc-Alexandre Côté, Myriam Côté, Aaron C. Courville,Yann N. Dauphin, Olivier Delalleau, Julien Demouth, Guillaume Des-jardins, Sander Dieleman, Laurent Dinh, Melanie Ducoffe, VincentDumoulin, Samira Ebrahimi Kahou, Dumitru Erhan, Ziye Fan, OrhanFirat, Mathieu Germain, Xavier Glorot, Ian J. Goodfellow, MatthewGraham, Çaglar Gülçehre, Philippe Hamel, Iban Harlouchet, Jean-Philippe Heng, Balázs Hidasi, Sina Honari, Arjun Jain, Sébastien Jean,Kai Jia, Mikhail Korobov, Vivek Kulkarni, Alex Lamb, Pascal Lam-blin, Eric Larsen, César Laurent, Sean Lee, Simon Lefrançois, SimonLemieux, Nicholas Léonard, Zhouhan Lin, Jesse A. Livezey, CoryLorenz, Jeremiah Lowin, Qianli Ma, Pierre-Antoine Manzagol, OlivierMastropietro, Robert McGibbon, Roland Memisevic, Bart van Mer-riënboer, Vincent Michalski, Mehdi Mirza, Alberto Orlandi, Christo-pher Joseph Pal, Razvan Pascanu, Mohammad Pezeshki, Colin Raffel,Daniel Renshaw, Matthew Rocklin, Adriana Romero, Markus Roth,Peter Sadowski, John Salvatier, François Savard, Jan Schlüter, JohnSchulman, Gabriel Schwartz, Iulian Vlad Serban, Dmitriy Serdyuk,Samira Shabanian, Étienne Simon, Sigurd Spieckermann, S. RamanaSubramanyam, Jakub Sygnowski, Jérémie Tanguay, Gijs van Tulder,Joseph P. Turian, Sebastian Urban, Pascal Vincent, Francesco Visin,Harm de Vries, David Warde-Farley, Dustin J. Webb, Matthew Will-son, Kelvin Xu, Lijun Xue, Li Yao, Saizheng Zhang, and Ying Zhang.2016. Theano: A Python framework for fast computation of mathe-matical expressions.
CoRR abs/1605.02688 (2016). arXiv:1605.02688 http://arxiv.org/abs/1605.02688 [4] F. Bauer. 1974. Computational Graphs and Rounding Error.
SIAM J.Numer. Anal.
11, 1 (1974), 87–96. https://doi.org/10.1137/0711010 [5] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey AndreyevichRadul, and Jeffrey Mark Siskind. 2017. Automatic Differentiation inMachine Learning: a Survey.
J. Mach. Learn. Res.
18 (2017), 153:1–153:43. http://jmlr.org/papers/v18/17-468.html [6] Thomas Beck and Herbert Fischer. 1994. The if-problem in auto-matic differentiation.
J. Comput. Appl. Math.
50, 1 (1994), 119 – 131. https://doi.org/10.1016/0377-0427(94)90294-1 [7] Michael Betancourt. 2018. A geometric theory of higher-order auto-matic differentiation. arXiv preprint arXiv:1812.11592 (2018).[8] Richard Blute, Thomas Ehrhard, and Christine Tasson. 2010. Aconvenient differential category.
CoRR abs/1006.3140 (2010).arXiv:1006.3140 http://arxiv.org/abs/1006.3140 [9] Richard F Blute, J Robin B Cockett, and Robert AG Seely. 2009. Carte-sian differential categories.
Theory and Applications of Categories
CoRR abs/1909.13768 (2019). arXiv:1909.13768 http://arxiv.org/abs/1909.13768 [11] Antonio Bucciarelli, Thomas Ehrhard, and Giulio Manzonetto.2010. Categorical Models for Simply Typed Resource Cal-culi.
Electr. Notes Theor. Comput. Sci.
265 (2010), 213–230. https://doi.org/10.1016/j.entcs.2010.08.013 [12] Alonzo Church. 1965.
The Calculi of Lambda-Conversion . New York :Kraus Reprint Corporation.[13] J. Robin B. Cockett and Geoff S. H. Cruttwell. 2014. Differential Struc-ture, Tangent Structure, and SDG.
Applied Categorical Structures https://doi.org/10.1007/s10485-013-9312-0 [14] David Dalrymple. 2016. 2016: What do you consider the mostinteresting recent [scientific] news? What makes it important? . (2016). Accessed: 2020-01-07.[15] Thomas Ehrhard and Laurent Regnier. 2003. The differen-tial lambda-calculus.
Theor. Comput. Sci. https://doi.org/10.1016/S0304-3975(03)00392-X [16] Conal Elliott. 2018. The simple essence of automatic differentiation.
PACMPL
2, ICFP (2018), 70:1–70:29. https://doi.org/10.1145/3236765 [17] Alfred Frölicher and Andreas Kriegl. 1988.
Linear spaces and differen-tiation theory . Chichester : Wiley.[18] Philipp H. W. Hoffmann. 2016. A Hitchhiker’s Guide to AutomaticDifferentiation.
Numerical Algorithms
72, 3 (01 Jul 2016), 775–811. https://doi.org/10.1007/s11075-015-0067-6 [19] Yann LeCun. 2018. Deep Learning estmort. Vive Differentiable Programming! .(2018). Accessed: 2020-01-07.[20] Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. 2015. Auto-grad: Effortless Gradients in Numpy. Presented in AutoML Workshop,ICML, Cascais, Portugal.[21] Giulio Manzonetto. 2012. What is a categorical modelof the differential and the resource λ -calculi? Mathemat-ical Structures in Computer Science
22, 3 (2012), 451–520. https://doi.org/10.1017/S0960129511000594 [22] Oleksandr Manzyuk. 2012. A Simply Typed λ -Calculus of ForwardAutomatic Differentiation. Electr. Notes Theor. Comput. Sci.
286 (2012),257–272. https://doi.org/10.1016/j.entcs.2012.08.017 arol Mak and Luke Ong [23] Peter W. Michor and Andreas Kriegl. 1997. The convenient setting ofglobal analysis . Providence, R.I. : American Mathematical Society.[24] Christopher Olah. 2015. Neural Networks, Types, and FunctionalProgramming. http://colah.github.io/posts/2015-09-NN-Types-FP/ .(2015). Accessed: 2020-01-07.[25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, JamesBradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, NataliaGimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, EdwardYang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chin-tala. 2019. PyTorch: An Imperative Style, High-Performance DeepLearning Library.
CoRR abs/1912.01703 (2019). arXiv:1912.01703 http://arxiv.org/abs/1912.01703 [26] Barak A. Pearlmutter. 2019. A Nuts-and-Bolts Differential GeometricPerspective on Automatic Differentiation. Presented in Languages forInference Workshop, Cascais, Portugal.[27] Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reverse-mode AD in a functional framework: Lambda the ultimate back-propagator.
ACM Trans. Program. Lang. Syst.
30, 2 (2008), 7:1–7:36. https://doi.org/10.1145/1330017.1330018 [28] Amr Sabry and Matthias Felleisen. 1992. Reasoning AboutPrograms in Continuation-Passing Style. In
Proceedings of theConference on Lisp and Functional Programming, LFP 1992,San Francisco, California, USA, 22-24 June 1992.
ACM, 288–298. https://doi.org/10.1145/141471.141563 [29] Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo,Kevin Murphy, and David M. Blei. 2017. Deep ProbabilisticProgramming.
CoRR abs/1701.03757 (2017). arXiv:1701.03757 http://arxiv.org/abs/1701.03757 [30] Uber. 2017. Pyro (Retrieved) Nov 2018. (2017). http://pyro.ai/ [31] Dimitrios Vytiniotis, Dan Belov, Richard Wei, Gordon Plotkin, andMartin Abadi. 2019. The Differentiable Curry. Presented in ProgramTranformations for Machine Learning Workshop, NeurIPS, Vancou-ver, Canada.[32] Fei Wang, James M. Decker, Xilun Wu, Grégory M. Essertel, andTiark Rompf. 2018. Backpropagation with Callbacks: Founda-tions for Efficient and Expressive Differentiable Programming.In
Advances in Neural Information Processing Systems 31: An-nual Conference on Neural Information Processing Systems 2018,NeurIPS 2018, 3-8 December 2018, Montréal, Canada. , Samy Ben-gio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman,Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 10201–10212. http://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018 [33] Fei Wang, Daniel Zheng, James M. Decker, Xilun Wu, Grégory M. Es-sertel, and Tiark Rompf. 2019. Demystifying differentiable program-ming: shift/reset the penultimate backpropagator.
PACMPL
3, ICFP(2019), 96:1–96:31. https://doi.org/10.1145/3341700 [34] R. E. Wengert. 1964. A simple automatic derivative eval-uation program.
Commun. ACM
7, 8 (1964), 463–464. https://doi.org/10.1145/355586.364791 ifferential-form Pullback Programming Language and Reverse-mode AD AppendixA Examples
A.1 Simple Example
We focus on how to compute the derivative of f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i by different modes of AD.First f is decomposed into elementary functions as R д −−→ R ∗ −−→ R (−) −−−→ R , where д (h x , y i) : = h x + , x + y i .Then, Figure 4 summarize the iterations of different modesof AD.Now we show how Section 3 tells us how to performreverse-mode AD on f . Term
Assuming д , mult , pow2 ∈ F , we can define the fol-lowing term in the language. ⊢ (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · ( Ω (cid:2) (cid:3) ) (cid:1) h , i : R ∗ This term is the application of the pullback Ω ( f )( λx . (cid:2) (cid:3) ∗ ) tothe point h , i , which is exactly the Jacobian of f at h , i . Administrative Reduction
We decompose the term pow2 ( mult ( д (h x , y i))) , via administrative reduction, into alet series of elementary terms. pow2 ( mult ( д (h x , y i))) −→ ∗ A L ≡ let z = h x , y i ; z = д ( z ) ; z = mult ( z ) ; z = pow2 ( z ) in z . This is reminiscent of the decomposition of f into R д −−→ R ∗ −−→ R (−) −−−→ R before performing AD. Splitting the Omega
Now via reduction 7 and 8, ( Ω λ h x , y i . L ) · ω is reduced to a series of pullback along ele-mentary terms. ( Ω λ h x , y i . let z = h x , y i ; z = д ( z ) ; z = mult ( z ) ; z = pow2 ( z ) in z ) · ω −→ ∗ © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ Pullback Reduction
We showed that via A-reductionsand Reductions 7 and 8, ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i))))· ω is reduced to © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ We show how it can be reduced when applied to h , i . © « ( Ω λ h x , y i . hh x , y i , h x , y ii) ·( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®®®¬ (cid:20) (cid:21) . −−−−→ © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ · © « ( Ω λ hh x , y i , z i . hh x , y i , z , д ( z )i) ·( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ª®¬ h (cid:20) (cid:21) , (cid:20) (cid:21) i ª®®¬ . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ · (cid:16) (cid:18) ( Ω λ hh x , y i , z , z i . hh x , y i , z , z , mult ( z )i) ·(( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω ) (cid:19) h (cid:20) (cid:21) , (cid:20) (cid:21) , (cid:20) (cid:21) i (cid:17)ª®®®¬ ( ⋆ )20 . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ · (cid:0) (cid:16) ( Ω λ hh x , y i , z , z , z i . pow2 ( z )) · ω (cid:17) h (cid:20) (cid:21) , (cid:20) (cid:21) , (cid:20) (cid:21) , i (cid:1) ª®®®®¬ . −−−−→ , © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · ( ω ) ª®®®¬ Notice how this is reminiscent of the forward phase ofreverse-mode AD performed on f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i considered in Figure 4.Moreover, we used the reduction f ( r ) −→ f ( r ) couples oftimes in the argument position of an application. This is toavoid expression swell. Note 1 + ( ⋆ ) even when the result is used in various computations. Combine
Replacing ω by Ω (cid:2) (cid:3) ≡ λx . (cid:2) (cid:3) ∗ , we have shownso far that (cid:0) ( Ω λ h x , y i . pow2 ( mult ( д (h x , y i)))) · Ω (cid:2) (cid:3) (cid:1) h , i is reduced to © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · ( ω ) ª®®®¬ . Now via reduction 5 and β reduction, we further reduce itto © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ ·( λ hh v , v i , v , v , v i . (J pow2 · v ) ) ∗ · (cid:2) (cid:3) ª®®®¬ −→ © « ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ ·( λ hh v , v i , v , v i . hh v , v i , v , v , (J mult · v )h , ii) ∗ · ª®¬ ∗ arol Mak and Luke Ong Naïve Forward Mode: hh , i | h i i д / / hh + , ∗ + i | h i i ∗ / / h ∗ | [
15 12 ] i (−) / / h | [
660 528 ] i Forward Mode: hh , i | h i i д / / hh , i | h i i ∗ / / h | [ ] i (−) / / h | [ ] i Reverse Mode: Forward Phase: h , i д / / h , i ∗ / / (−) / / h i h i д o o [ ] ∗ o o [ ] (−) o o Pullback: (cid:0) Ω ( д ) ◦ Ω (∗) ◦ Ω ((−) ) (cid:1) ( λx . [ ] )(h , i) = (J ( д )(h , i)) ∗ (cid:0) Ω (∗) ◦ Ω ((−) ) (cid:1) ( λx . [ ] )(h , i) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (cid:0) Ω ((−) ) (cid:1) ( λx . [ ] )( ) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ (J ((−) )( )) ∗ (cid:0) ( λx . [ ] )( ) (cid:1) = (J ( д )(h , i)) ∗ (J (∗)(h , i)) ∗ [ ] = (J ( д )(h , i)) ∗ h i = h i Figure 4.
Different modes of automatic differentiation performed on the function f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i ,after f is decomposed into elementary functions: R д −−→ R ∗ −−→ R (−) −−−→ R , where д (h x , y i) : = h x + , x + y i . −→ (cid:18) ( λ h v , v i . hh v , v i , h v , v ii) ∗ ·( λ hh v , v i , v i . hh v , v i , v , (J д · v )h , ii) ∗ · (cid:19) ∗ −→ (cid:0) ( λ h v , v i . hh v , v i , h v , v ii) ∗ · (cid:1) ∗ −→ (cid:20) (cid:21) ∗ Notice how this mimics the reverse phase of reverse-modeAD on f : h x , y i 7→ (cid:0) ( x + )( x + y ) (cid:1) at h , i consideredin Figure 4. A.2 Sum Example
Consider the function that takes a list of real numbers andreturns the sum of the elements of a list. We show how Sec-tion 3 tells us how to perform reverse-mode AD on such ahigher-order function.
Term
Using the standard Church encoding of List, i.e.
List ( X ) ≡ ( X → D → D ) → ( D → D )[ x , x , . . . , x n ] ≡ λ f d . f x n (cid:0) . . . ( f x ( f x d )) (cid:1) for some dummy type D , sum : List ( R ) → R can beexpressed in our language described in Section 3 to be λl . l ( λxy . x + y )
0. Hence the derivative of sum at a list [ , − ] can be expressed as { ω : Ω ( List ( R ))} ⊢ (cid:0) ( Ω ( sum )) · ω (cid:1) [ , − ] : R ∗ . Administrative Reduction
We first decompose the bodyof the sum : List ( R ) → R term, considered in Example 3.2, i.e. l ( λxy . x + y ) l ( λxy . x + y ) −→ ∗ A (cid:0) ( let z ′ = l in z ′ ) ( λxy . let z ′ = x + y in z ′ ) (cid:1) ( let z ′ = in z ′ )−→ ∗ A (cid:18) let z = l ; z = λxy . ( let z ′ = x + y in z ′ ) ; z = z z in z (cid:19) ( let z ′ = in z ′ )−→ ∗ A let z = l ; z = λxy . ( let z ′ = x + y in z ′ ) ; z = z z ; z = z = z z in z Splitting the Omega
After the A-reductions where l ( λxy . x + y ) ( Ω ( λl . l ( λxy . x + y ) )) · ω , via Reductions 7 and 8. ( Ω ( λl . l ( λxy . x + y ) )) · ω −→ ∗ A ( Ω λl . let z = l ; z = λxy . let z ′ = x + y in z ′ ; z = z z ; z = z = z z in z ) · ω −→ ∗ © « ( Ω λl . h l , l i) ·( Ω λ h l , z i . h l , z , λxy . L i) ·( Ω λ h l , z , z i . h l , z , z , z z i) ·( Ω λ h l , z , z , z i . h l , z , z , z , i) ·( Ω λ h l , z , z , z , z i . z z ) · ω ª®®®®¬ Pullback Reduction
First, Figure 5 showsthat (cid:0) ( Ω [ , − ]) · ω ′ (cid:1) ( λxy . L ) is reduced to ( λv . v − ( + (h , d i)) + (J + (h− , −i) · ( v d )) ( + h , d i)) ∗ · ω ′ A ifferential-form Pullback Programming Language and Reverse-mode AD (cid:0) ( Ω [ , − ]) · ω ′ (cid:1) ( λxy . L )≡ (cid:0) ( Ω λ f d . f − ( f d )) · ω ′ (cid:1) ( λxy . L )−→ ∗ A ( Ω λ f d . let z = fz = − z = z z z = fz = z = z z z = dz = z z z = z z in z ) · ω ′ ! ( λxy . L )−→ ∗ © « ( Ω λ f . h f , f i) ·( Ω λ h f , z i . h f , z , − i) ·( Ω λ h f , z , z i . h f , z , z , z z i) ·( Ω λ h f , z , z , z i . h f , z , z , z , f i) ·( Ω λ h f , z , z , z , z i . h f , z , z , z , z , i) ·( Ω λ h f , z , z , z , z , z i . h f , z , z , z , z , z , z z i) ·( Ω λ h f , z , z , z , z , z , z i . h f , z , z , z , z , z , z , d i) ·( Ω λ h f , z , z , z , z , z , z , z i . h f , z , z , z , z , z , z , z , z z i) ·( Ω λ h f , z , z , z , z , z , z , z , z i . h f , z , z , z , z , z , z , z , z , z z i) · ω ′ ª®®®®®®®®®®¬ ( λxy . L )−→ ∗ © « ( λv . h v , v i) ∗ ·( λ h v , v i . h v , v , i) ∗ ·( λ h v , v , v i . h v , v , v , v − + λy . (J + · h v , i) h− , y ii) ∗ ·( λ h v , v , v , v i . h v , v , v , v , v i) ∗ ·( λ h v , v , v , v , v i . h v , v , v , v , v , i) ∗ ·( λ h v , v , v , v , v , v i . h v , v , v , v , v , v , v + λy . (J + · h v , i) h , y ii) ∗ ·( λ h v , v , v , v , v , v , v i . h v , v , v , v , v , v , v , i) ∗ ·( λ h v , v , v , v , v , v , v , v i . h v , v , v , v , v , v , v , v , v d + (J + (h , −i) · v ) d i) ∗ ·( λ h v , v , v , v , v , v , v , v , v i . h v , v , v , v , v , v , v , v , v , v ( + (h , d i)) + (J + (h− , −i) · v ) ( + (h , d i))i) ∗ · ω ′ A ª®®®®®®®®®®®¬ −−→ ∗ ( λv . v − ( + (h , d i)) + (J + (h− , −i) · ( v d )) ( + h , d i)) ∗ · ω ′ A where A ≡ h λxy . L , λxy . L , − , λy . + (h− , y i) , λxy . L , , λy . + (h , y i) , d , + (h , d i)i Figure 5.
Reduction of (cid:0) ( Ω [ , − ]) · ω ′ (cid:1) ( λxy . L ) Then, we reduce (cid:0) ( Ω ( sum )) · ω (cid:1) [ , − ] as follows. © « ( Ω λl . h l , l i) ·( Ω λ h l , z i . h l , z , λxy . L i) ·( Ω λ h l , z , z i . h l , z , z , z z i) ·( Ω λ h l , z , z , z i . h l , z , z , z , i) ·( Ω λ h l , z , z , z , z i . z z ) · ω ª®®®®¬ [ , − ]−→ ∗ © « ( λv . h v , v i) ∗ ·( λ h v , v i . h v , v , i) ∗ ·( λ h v , v , v i . h v , v , v , v ( λxy . L ) + v − ( + (h , d i)) + (J + (h− , −i) · ( v d )) ( + h , d i)i) ∗ ·( λ h v , v , v , v i . h v , v , v , v , i) ∗ ·( λ h v , v , v , v , v i . h v , v , v , v , v , v + (J + (h− , + (h , −i)i) · v ) i) ∗ · ωB ª®®®®®®®®¬ −→ ∗ ( λv . v ( λxy . L ) ) ∗ · ωB where B ≡ h[ , − ] , [ , − ] , λxy . L , λd . + (h− , + (h , d i)i) , , i .Hence, λv . v ( λxy . L ) sum ≡ λl . l ( λxy . L ) [ , − ] .This sequence of reduction tells us how the derivative of sum at [ , − ] can be computed using reverse-mode AD. B Administrative Reduction
Elementary terms E , let series L , A-contexts C A and A-redexes r A are defined as follows. E :: = | z + z | z | λx . L | z z | z i | h z , z i | r | f ( z ) | J f · z | ( λx . L ) ∗ · z | ( Ω λx . L ) · z | r ∗ L :: = let z = E in L | let z = E in z . C A :: = [] | C A + P | L + C A | λz . C A | C A P | L C A | π i ( C A )| h C A , S i | h L , C A i | f ( C A ) | J f · C A | ( λx . C A ) ∗ · S | ( λx . L ) ∗ · C A | ( Ω ( λx . C A )) · S | ( Ω ( λx . L )) · C A r A :: = | L + L | x | λz . L | L L | π i ( L ) | h L , L i | r | f ( L ) | J f · L | ( λx . L ) ∗ · L | ( Ω λx . L ) · L | r ∗ Lemma B.1.
Every pullback term P can be expressed as ei-ther C A [ r A ] for some unique A-context C A and A-redex r A ora let series of elementary terms L . An A-redex r A is reduced to a let series L as follows.0 −→ A let x = in x L + L −→ A let x = L ; x = L ; x = x + x in x x −→ A let x = x in x λz . L −→ A let x = λz . L in x arol Mak and Luke Ong L L −→ A let x = L ; x = L ; x = x x in x π i ( L ) −→ A let x = L ; x = π i ( x ) in x h L , L i −→ A let x = L ; x = L ; x = h x , x i in x r −→ A let x = r in x f ( L ) −→ A let x = L ; x = f ( x ) in x J f · L −→ A let x = L ; x = J f · x in x ( λx . L ) ∗ · L −→ A let x = L ; x = ( λx . L ) ∗ · x in x ( Ω λx . L ) · L −→ A let x = L ; x = ( Ω ( λx . L )) · x in x r ∗ −→ A let x = r ∗ in x Any pullback term P which can be expressed as C A [ r A ] can be A-reduced to C A [ L ] where r A −→ A L . C Interpretation J Γ ⊢ σ K γ = J Γ ⊢ S + P : σ K γ = J S K γ + J P K γ J Γ ∪ { x : σ } ⊢ x : σ K h γ , z i = z J Γ ⊢ λx . S : σ ⇒ σ K γ = cur ( J S K ) γ J Γ ⊢ S P : σ K γ = ( J S K γ ) ( J P K γ ) J Γ ⊢ π i ( S ) : σ i K γ = π i ( J S K γ ) J Γ ⊢ h S , S i : σ × σ K γ = h J S K γ , J S K γ i J Γ ⊢ r : R K γ = r J Γ ⊢ f ( P ) : R m K γ = f ( J P K γ ) J Γ ⊢ J f · S : R m ⇒ R m K γ = cur ( D [ f ])( J S K γ ) J Γ ⊢ ( λx . S ) ∗ · S : σ ∗ K γ = λv . J S K γ (cid:0) J S K h γ , v i (cid:1) J Γ ⊢ ( Ω λx . P ) · S : Ω σ K γ = λxv . J S K γ ( J P K h γ , x i)( D [ cur ( J P K ) γ ]h v , x i) J Γ ⊢ r ... r n ∗ : R n ∗ K γ = v ... v n n Õ i = r i v i D Extended Differential Lambda-Calculus
Differential substitution of the extended differential λ -termsare defined as follows. ∂∂ x π i ( s ) · T ≡ π i (cid:0) ∂ s ∂ x · T (cid:1) ∂∂ x h s , s i · T ≡ h ∂ s ∂ x · T , ∂ s ∂ x · T i ∂ r ∂ x · T ≡ ∂∂ x ( f ( s )) · T ≡ (cid:16) D f · (cid:0) ∂ s ∂ x · T (cid:1) (cid:17) s ∂∂ x ( D f · s ) · T ≡ D f · (cid:0) ∂ s ∂ x · T (cid:1) Consider the term f ( s ) . There are no linear occurrencesof x in f . Hence, we ignore f and perform differential sub-stitution to s directly and obtain (cid:16) D f · (cid:0) ∂ s ∂ x · T (cid:1) (cid:17) s .We can interpret the extended differential λ -calculus witha differential λ -category, which is the categorical semanticsof differential λ -calculus. Hence, what is left to show is theinterpretations of the extended terms. J π i ( s ) K = π i ◦ J s KJ h s , s i K = h J s K , J s K i J r K = λγ . r J f ( s ) K = f ◦ J s KJ D f · s K = λγ x . D [ f ]h J s K γ , x i Translation to Differential Lambda Calculus t : = π i ( S ) t : = π i ( S t )( S + P ) t : = S t + P t (h S , S i) t : = h( S ) t , ( S ) t i y t : = y r t : = r ( λy . S ) t : = λy . S t ( f ( P )) t : = f ( P t )( S P ) t : = S t P t (J f · S ) t : = D f · ( S ) t r ... r n ∗ t : = λv . n Õ i = f i ( π i ( v )) (cid:0) ( λy . S ) ∗ · S (cid:1) t : = λv . ( S ) t (cid:0) ( λy . ( S ) t ) v (cid:1) (( Ω λy . P ) · S ) t : = λxv . S t (cid:0) ( λy . P t ) x (cid:1) (cid:0) ( D ( λy . P t ) · v ) x (cid:1) where f : = r i × − . E Proofs
Proposition E.1.
The derivative of any constant morphism f in a differential λ -category is , i.e. D [ f ] = .Proof. A constant morphism f : A → B that maps all of A to b ∈ B can be written as f = ( λz . b ) ◦ A → B and λz . b : B → B . So by [CD1,2,5] we have D [ f ] = D [( λz . b ) ◦ ] = D [ λz . b ] ◦ h D [ ] , ◦ π i = D [ λz . b ] ◦ h , ◦ π i = (cid:3) Lemma E.2.
Con ∞ is a differential λ -category with the dif-ferential operator D [ f ]h v , x i : = J ( f )( x )( v ) = lim t → ( f ( x + tv ) − f ( x ))/ t . Proof. [17, 23] have shown that
Con ∞ is Cartesian closed,and [8] have shown that Con ∞ is a Cartesian differentialcategory. What is left to show is that λ (−) preserves theadditive structure and D [−] satisfies the (D-curry) rule, i.e. D [ λ ( f )] = λ (cid:0) D [ f ] ◦ h π × , π × Id i (cid:1) .We first show that λ (−) is additive, i.e. λ ( f + д ) = λ ( f ) + λ ( д ) and λ ( ) =
0. Note that for f , д , A × B → C and a ∈ A , b ∈ B , λ ( f + д )( a )( b ) = ( f + д )h a , b i = f h a , b i + д h a , b i = λ ( f )( a )( b ) + λ ( д )( a )( b ) and λ ( )( a )( b ) = h a , b i = = ( a )( b ) . Now we show that D [−] satisfies the (D-curry) rule. Let f : A × B → C , v , x ∈ A and b ∈ B . D [ λ ( f )] h v , x i b = (cid:16) lim t → λ ( f )( x + vt ) − λ ( f )( x ) t (cid:17) b = lim t → f h x + vt , b i − f h x , b i t = lim t → f (h x , b i + t h v , i) − f h x , b i t = D [ f ]hh v , i , h x , b ii = (cid:0) D [ f ] ◦ h π × , π × Id i (cid:1) hh v , x i , b i = λ (cid:0) D [ f ] ◦ h π × , π × Id i (cid:1) h v , x i b ifferential-form Pullback Programming Language and Reverse-mode AD (cid:3) Proposition E.3.
Let E be a convenient vector space and x , y ∈ E be distinct elements in E . Then, there exists abornological linear map l : E → R that separates x and y ,i.e. l ( x ) , l ( y ) .Proof. This follows from the fact that convenient vectorspace is separated. x , y implies that x − y ,
0. Hence by separation, thereis a bornological linear map l : E → R such that l ( x − y ) , l is linear, so we have l ( x ) − l ( y ) , l ( x ) , l ( y ) . (cid:3) Lemma 4.4 (Linearity) . Let Γ ∪ { x : σ } ⊢ P : τ and Γ ⊢ P : σ ∗ . Let γ ∈ J Γ K and γ ∈ J Γ K . Then,1. if x ∈ lin ( P ) , then cur ( J P K ) γ is linear, i.e. D [ cur ( J P K ) γ ] = ( cur ( J P K ) γ ) ◦ π ,2. J P K γ is linear, i.e. D [ J P K γ ] = ( J P K γ ) ◦ π .Proof. Induction on the structure of P on the following twostatements.IH.1 If Γ ∪ { x : σ } ⊢ P : τ and x ∈ lin ( P ) , then forany γ ∈ J Γ K , cur ( J P K ) γ is linear, i.e. D [ cur ( J P K ) γ ] = ( cur ( J P K ) γ ) ◦ π .IH.2 If Γ ⊢ P : σ ∗ , then for any γ ∈ J Γ K , J P K γ is linear, i.e. D [ J P K γ ] = ( J P K γ ) ◦ π .(var) Say P ≡ x .(1) If Γ ∪ { x : σ } ⊢ x : σ and x ∈ lin ( x ) , then D [ cur ( J x K ) γ ] = D [ Id ] = π = Id ◦ π = ( cur ( J x K ) γ ) ◦ π .(2) If Γ ⊢ x : σ ∗ , then Γ = Γ ∪ { x : σ ∗ } so for any h γ , z i ∈ J Γ K , z is linear and D [ J x K h γ , z i] = D [ z ] = z ◦ π = ( J P K h γ , z i) ◦ π .(dual) Say P ≡ ( λx . S ) ∗ · S .(1) Let Γ ∪ { x : σ } ⊢ ( λx . S ) ∗ · S : τ and x ∈ lin (( λx . S ) ∗ · S ) : = (cid:0) lin ( S ) \ FV ( S ) (cid:1) ∪ (cid:0) lin ( S ) \ FV ( S ) (cid:1) , then for any γ ∈ J Γ K and since J S K h γ , x i is of a dual type, by IH.2, D [ cur ( J ( λx . S ) ∗ · S K ) γ ]h v , x i = λz . (cid:0) D [ J S K h γ , −i]h v , x i (cid:1) д (h x , z i) + D [ J S K h γ , x i]h D [ д (h− , z i)]h v , x i , д (h x , z i)i = λz . (cid:0) D [ J S K h γ , −i]h v , x i (cid:1) д (h x , z i) + J S K h γ , x i( D [ д (h− , z i)]h v , x i) where д : h x , z i 7→ J S K h γ , x , z i . Note that x can onlybe in either lin ( S ) \ FV ( S ) or lin ( S ) \ FV ( S ) but notboth. Say x ∈ lin ( S ) \ FV ( S ) , then by Proposition E.1and IH.1, D [ cur ( J ( λx . S ) ∗ · S K ) γ ]h v , x i = λz . (cid:0) D [ J S K h γ , −i]h v , x i (cid:1) ( J S K h γ , x , z i) + J S K h γ , x i( D [ J S K h γ , − , z i]h v , x i) = λz . J S K h γ , x i( D [ J S K h γ , − , z i]h v , x i) = λz . J S K h γ , v i( J S K h γ , v , z i) = J ( λx . S ) ∗ · S K h γ , v i = (cid:0) cur ( J ( λx . S ) ∗ · S K ) γ ) ◦ π (cid:1) h v , x i (2) Let Γ ⊢ ( λx . S ) ∗ · S : σ ∗ and γ ∈ J Γ K . Then, byIH.1 and IH.2, D [ J ( λx . S ) ∗ · S K γ ] = D [( J S K γ ) ◦ (cid:0) cur ( J S K ) γ (cid:1) ] = D [ J S K γ ] ◦ h D [ cur ( J S K ) γ ] , ( cur ( J S K ) γ ) ◦ π i = ( J S K γ ) ◦ ( cur ( J S K ) γ ) ◦ π = ( J ( λx . S ) ∗ · S K γ ) ◦ π All other cases are straight forward inductive proofs. (cid:3)
Lemma 4.5 (Substitution) . J Γ ⊢ S [ P / x ] : τ K = J Γ ∪ { x : σ } ⊢ S : τ K ◦ h Id J Γ K , J Γ ⊢ P : σ K i Proof.
The only interesting cases are dual and pullbackmaps.(dual) (cid:0) ( λx . S ) ∗ · S (cid:1) [ P ′ / y ] ≡ ( λx . S [ P ′ / y ]) ∗ · S [ P ′ / y ] J (cid:0) ( λx . S ) ∗ · S (cid:1) [ P ′ / y ] K γ = J ( λx . S [ P ′ / y ]) ∗ · S [ P ′ / y ] K γ = J S [ P ′ / y ] K γ ◦ cur ( J S [ P ′ / y ] K ) γ = λx . J S K h γ , J P ′ K γ i( J S K h γ , J P ′ K γ , x i) (IH) = J ( λx . S ) ∗ · S K h γ , J P ′ K γ i (pb) (cid:0) ( Ω λx . P ) · S (cid:1) [ P ′ / y ] ≡ ( Ω λx . P [ P ′ / y ]) · S [ P ′ / y ] J (cid:0) ( Ω λx . P ) · S (cid:1) [ P ′ / y ] K γ = J ( Ω λx . P [ P ′ / y ]) · S [ P ′ / y ] K γ = λxv . (cid:0) J S K h γ , J P ′ K γ i (cid:1) (cid:0) J P K h γ , x , J P ′ K γ i (cid:1)(cid:0) D [ J P K h γ , − , J P ′ K γ i]h v , x i (cid:1) (IH) = J ( Ω λx . P ) · S K h γ , J P ′ K γ i (cid:3) Theorem 4.6 (Correctness of Reductions) . Let Γ ⊢ P : σ .1. P −→ A P ′ implies J P K = J P ′ K .2. P −→ P ′ implies J P K = J P ′ K .Proof.
1. Easy induction on −→ A .2. Case analysis on reductions of pullback terms. Let γ ∈ J Γ K .(1-4) J ( λx . S ) V K = J S [ V / x ] K , π i (h V , V i) = J V i K , f ( r ) = J f ( r ) K and J J ( f )( r )( r ′ ) K = D [ f ]h r ′ , r i are easilyverified using the Substitution Lemma 4.5.(5) Let J ( f )( r ) = [ a ij ] i = ,..., m , j = ,..., n and r ′ = [ r ′ i ] i = ,..., m . J ( λv . J ( f )( r )( v )) ∗ · r ′∗ K γ arol Mak and Luke Ong = (J ( f )( r )) ∗ ( λv . m Õ i = r ′ i v i ) = λv . m Õ i = r ′ i n Õ j = a ij v j = λv . n Õ j = m Õ i = ( r ′ i · a ij ) v j = λv . n Õ j = ((J ( f )( r )) ⊤ × r ) j v j = J ((J ( f )( r )) ⊤ × r ) ∗ K γ (6) Say Γ ∪ { v : σ } ⊢ V : τ . Let Γ ∪ { v : σ , v : σ } ⊢ V ′ : τ where v is not a free variable in V . J ( λv . V ) ∗ · (cid:16) ( λv . V ) ∗ · V (cid:17) K γ = (cid:0) cur ( J V K ) γ (cid:1) ∗ (cid:16)(cid:0) cur ( J V K ) γ (cid:1) ∗ ( J V K γ ) (cid:17) = (cid:16) (cid:0) cur ( J V K ) γ (cid:1) ◦ (cid:0) cur ( J V K ) γ (cid:1)(cid:17) ∗ ( J V K γ ) = (cid:16) v J V K h γ , J V K h γ , v ii (cid:17) ∗ ( J V K γ ) = (cid:16) v J V ′ K hh γ , v i , J V K h γ , v ii (cid:17) ∗ ( J V K γ ) = (cid:16) v (cid:0) J V ′ K ◦ h Id , J V K i (cid:1) h γ , v i (cid:17) ∗ ( J V K γ ) = (cid:16) v (cid:0) J V ′ [ V / v ] K (cid:1) h γ , v i (cid:17) ∗ ( J V K γ ) = (cid:16) cur ( J V ′ [ V / v ] K ) γ (cid:17) ∗ ( J V K γ ) = J ( λv . V ′ [ V / v ]) ∗ · V K (7) Using the Substitution Lemma 4.5, J ( Ω ( λy . let x = E in x )) · ω K = J ( Ω ( λy . E )) · ω K follows immediately from J λy . let x = E in x K = cur ( J let x = E in x K ) = cur ( J E K ) = J λy . E K . (8) Consider ( Ω ( λy . let x = E in L )) · ω −→( Ω ( λy . h y , E i)) · (cid:0) ( Ω ( λz . b L )) · ω (cid:1) where Γ ∪ { z : σ × σ } ⊢ b L ≡ L [ π ( z )/ y ][ π ( z )/ x ] : τ . J ( Ω ( λy . let x = E in L )) · ω K γ = Ω (cid:16) cur ( J let x = E in L K ) γ (cid:17) ( J ω K γ ) = Ω (cid:16) cur ( J L K ◦ h Id , J E K i) γ (cid:17) ( J ω K γ ) = Ω (cid:16) s J L K hh γ , s i , J E K h γ , s ii (cid:17) ( J ω K γ ) = Ω (cid:16) s J b L K h γ , h s , J E K h γ , s iii (cid:17) ( J ω K γ ) = Ω (cid:16) (cid:0) cur ( J b L K ) γ (cid:1) ◦ h Id J σ K , cur ( J E K ) γ i (cid:17) ( J ω K γ ) = Ω (cid:0) h Id J σ K , cur ( J E K ) γ i (cid:1) (cid:16) Ω (cid:0) cur ( J b L K ) γ (cid:1) ( J ω K γ ) (cid:17) = Ω (cid:0) h Id J σ K , cur ( J E K ) γ i (cid:1) (cid:16) J ( Ω λz . b L ) · ω K γ (cid:17) = Ω (cid:0) cur ( J h y , E i K ) γ (cid:1) (cid:16) J ( Ω λz . b L ) · ω K γ (cid:17) = J ( Ω λy . h y , E i) · (cid:0) ( Ω λz . b L ) · ω (cid:1) K (9) Say y is not free in E and (cid:0) ( Ω ( λy . E )) · ω (cid:1) V −→ J (( Ω λy . E ) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J E K h γ , x i)( D [ cur ( J E K ) γ ]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ ( J E K h γ , x i) (cid:1) ( J V K γ ) = ( λxv . ) ( J V K γ ) = λv . = J K γ since cur ( J E K ) γ is a constant function and the deriv-ative of any constant function is 0 by PropositionE.1.(10) We present the proof for (10b) (cid:0) ( Ω λy . y πi + y π j ) · ω (cid:1) V −→ ( λv . v πi + v π j ) ∗ · ω (cid:0) V πi + V π j (cid:1) which leads to (10.1). J (( Ω λy . y πi + y π j ) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J y πi + y π j K h γ , x i)( D [ π i + π j ]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ ( x πi + x π j )( v πi + v π j ) (cid:1) ( J V K γ ) = λv . J ω K γ ((( J V K γ ) πi + (( J V K γ ) π j )( v πi + v π j ) = λv . J ω ( V πi + V π j ) K γ ( v πi + v π j ) = J ( λv . v πi + v π j ) ∗ · ω ( V πi + V π j ) K γ (11) (cid:0) ( Ω λy . y ) · ω (cid:1) V −→ ( λv . v ) ∗ · ω V J (( Ω λy . y ) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J y K h γ , x i)( D [ Id ]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ xv (cid:1) ( J V K γ ) = λv . J ω K γ ( J V K γ ) v = λv . J ω V K γv = J ( λv . v ) ∗ · ω V K γ (12) (( Ω λy . y πi ) · ω ) V −→ ( λv . v πi ) ∗ · ω V πi J (( Ω λy . y πi ) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J y πi K h γ , x i)( D [ π i ]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ x πi v πi (cid:1) ( J V K γ ) = λv . J ω K γ ( J V πi K γ ) v πi = J ( λv . v πi ) ∗ · ω V πi K γ (13) We prove for (13c), (( Ω λy . h y πi , y π j i) · ω ) V −→( λv . h v πi , v π j i) ∗ · ω h V πi , V π j i which leads to (13a)and (13b). J (( Ω λy . h y πi , y π j i) · ω ) V K γ = (cid:0) λxv . J ω K γ ( J h y πi , y π j i K h γ , x i)( D [h π i , π j i]h v , x i) (cid:1) ( J V K γ ) = (cid:0) λxv . J ω K γ h x πi , x π j ih v πi , v π j i (cid:1) ( J V K γ ) = λv . J ω K γ h( J V K γ ) πi , ( J V K γ ) π j ih v πi , v π j i = λv . J ω h V πi , V π j i K γ h v πi , v π j i = J ( λv . h v πi , v π j i) ∗ · ω h V πi , V π j i K γ (14) (cid:0) ( Ω λy . J f · y πi ) · ω (cid:1) V −→ ( λv . J f · v πi ) ∗ · (cid:0) ω (J f · V πi ) (cid:1) By [CD3,4,5,6], D [ λyz . D [ f ]h y πi , z i] = D [ cur ( D [ f ]) ◦ π i ] = D [ cur ( D [ f ])] ◦ ( π i × π i ) = cur ( D [ D [ f ]] ◦ h π × , π × Id i) ◦ ( π i × π i ) = cur ( D [ f ] ◦ ( π × Id )) ◦ ( π i × π i ) Hence J (cid:0) ( Ω λy . J f · y πi ) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) λx . D [ f ]h J V π j K γ , x i (cid:1)(cid:0) D [ λyz . D [ f ]h y πi , z i]h v , J V K γ i (cid:1) = λv . J ω K γ (cid:0) λx . D [ f ]h J V π j K γ , x i (cid:1)(cid:16)(cid:0) cur ( D [ f ] ◦ ( π × Id )) ◦ ( π i × π i ) (cid:1) h v , J V K γ i (cid:17) = λv . J ω K γ (cid:0) J (J f · V πi ) K γ (cid:1) ( λx . D [ f ]h v πi , x i) ifferential-form Pullback Programming Language and Reverse-mode AD = J ( λv . J f · v πi ) ∗ · (cid:0) ω (J f · V πi ) (cid:1) K γ (15) (( Ω λy . f ( y πi )) · ω ) V −→ ( λv . (J f · v πi ) V πi ) ∗ · (cid:0) ω ( f ( V πi )) (cid:1) J (( Ω λy . f ( y πi )) · ω ) V K γ = (cid:0) λxv . J ω K γ (cid:0) f x πi (cid:1) (cid:0) D [ f ]h v πi , x πi i (cid:1) (cid:1) ( J V K γ ) = λv . J ω K γ (cid:0) f ( J V πi K γ ) (cid:1) (cid:0) D [ f ]h v πi , J V πi K γ i (cid:1) = λv . J ω K γ (cid:0) f ( J V πi K γ ) (cid:1) (cid:0) J (J f · v πi ) V πi K h γ , v i (cid:1) = J ( λv . (J f · v πi ) V πi ) ∗ · (cid:0) ω ( f ( V πi )) (cid:1) K γ (16) We prove for the most complicated case (16c) whichleads to (16a) and (16b).By IH, J (cid:0) ( Ω λy . L ) · ω (cid:1) V K = J ( λv . S ) ∗ · ω V ′ K impliesfor any 1-form ϕ , γ and x , v , ϕ ( J L K h γ , J V K γ , x i) ( D [ J L K h γ , − , x i]h v , J V K γ i) = ϕ ( J V K γ ) ( J S K h γ , x , v i) . By Hahn-Banach Theorem, we have D [ J L K h γ , − , x i]h v , J V K γ i = J S K h γ , x , v i . First, note that since V πi is of the dual type, henceby Lemma 4.4 (2), D [ J V πi K γ ] = ( J V πi K γ ) ◦ π . D [ cur ( J ( λz . L ) ∗ · y πi K ) γ ]h v , J V K γ i = D [ λy . λz . y πi ( J L K h γ , y , z i)]h v , J V K γ i = D [ cur ( ev ◦ h π i ◦ π , д i)]h v , J V K γ i = λz . D [ ev ◦ h π i ◦ π , д i]hh v , i , h J V K γ , z ii = λz . (cid:0) ev ◦ h D [ π i ◦ π ] , д ◦ π i + D [ uncur ( π i ◦ π )]◦ hh , D [ д ]i , h π , д ◦ π ii (cid:1) hh v , i , h J V K γ , z ii = λz . v πi ( д h J V K γ , z i) + D [ uncur ( π i ◦ π )]hh , D [ д ]hh v , i , h J V K γ , z iii , hh J V K γ , z i , д h J V K γ , z iii = λz . v πi ( д h J V K γ , z i) + D [ uncur ( π i )]hh , D [ д ]hh v , i , h J V K γ , z iii , h J V K γ , д h J V K γ , z iii = λz . v πi ( д h J V K γ , z i) + D [ uncur ( π i )h J V K γ , −i]h D [ д ]hh v , i , h J V K γ , z ii , д h J V K γ , z ii = λz . v πi ( д h J V K γ , z i) + (cid:0) D [ J V πi K γ ] ◦ h D [ д ] , д ◦ π i (cid:1) hh v , i , h J V K γ , z ii = λz . v πi ( д h J V K γ , z i) + (cid:0) v πi ◦ π ◦ h D [ д ] , д ◦ π i (cid:1) hh v , i , h J V K γ , z ii = λz . v πi ( д h J V K γ , z i) + J V πi K γ ) (cid:0) D [ д h− , z i]h v , J V K γ i (cid:1) = λz . v πi ( J L K h γ , J V K γ , z i) + J V πi K γ (cid:0) D [ J L K h γ , − , z i]h v , J V K γ i (cid:1) = λz . v πi ( J L K h γ , J V K γ , z i) + J V πi K γ (cid:0) J S K h γ , x , v i (cid:1) = J ( λz . L [ V / y ]) ∗ · v πi + ( λz . S ) ∗ · V πi K h γ , v i . where д : h y , z i 7→ J L K h γ , y , z i .Now we have J (cid:0) ( Ω λy . ( λz . L ) ∗ · y πi ) · ω (cid:1) V K γ = λv . J ω K γ ( J ( λz . L ) ∗ · y πi K h γ , J V K γ i) (cid:0) D [ cur ( J λz . L ∗ · y πi K ) γ ]h v , J V K γ i (cid:1) = λv . J ω K γ ( J ( λz . L [ V / y ]) ∗ · V πi K γ ) (cid:0) J ( λz . L [ V / y ]) ∗ · v πi + ( λz . S ) ∗ · V πi K h γ , v i (cid:1) = J ( λv . ( λz . S ) ∗ · V πi + ( λz . L [ V / y ]) ∗ · v πi ) ∗ · ω (( λz . L [ V / y ]) ∗ · V πi ) K γ (17) (cid:0) ( Ω λy . ( Ω λx . L ) · y πi ) · ω (cid:1) V −→ (cid:0) ( Ω λy . λa . ( λv . S ) ∗ · z L [ a / x ]) · ω (cid:1) V if (cid:0) ( Ω λx . L ) · z ) a −→ ∗ ( λv . S ) ∗ · z L [ a / x ] forfresh variable a .By IH, J (cid:0) ( Ω λx . L ) · z ) a K = J ( λv . S ) ∗ · z L [ a / x ] K im-plies for any ϕ , γ y , a , v , ϕ ( J L K h γ , a , y i) ( D [ J L K h γ , − , y i]h v , a i) = ϕ ( J L K h γ , a , y i) ( J S K h γ , y , a , v i) . By Hahn-Banach Theorem, D [ J L K h γ , − , y i]h v , a i = J S K h γ , y , a , v i . J ( Ω λx . L ) · z K h γ , y i = λav . J z K h γ , y i ( J L K h γ , v , a i) ( D [ J L K h γ , − , y i]h v , a i) = λav . J z K h γ , y i ( J L [ a / x ] K h γ , v i) ( J S K h γ , y , a , v i) = λa . J ( λv . S ) ∗ · z L [ a / x ] K h γ , y , a i = J λa . ( λv . S ) ∗ · z L [ a / x ] K h γ , y i Hence we have J (cid:0) ( Ω λy . ( Ω λx . L ) · z ) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) J ( Ω λx . L ) · z K h γ , J V K γ i (cid:1)(cid:0) D [ cur ( J ( Ω λx . L ) · z K ) γ ]h v , J V K γ i (cid:1) = λv . J ω K γ (cid:0) J λa . ( λv . S ) ∗ · z L [ a / x ] K h γ , J V K γ i (cid:1)(cid:0) D [ cur ( J ( Ω λx . L ) · z K ) γ ]h v , J V K γ i (cid:1) = J (cid:0) ( Ω λy . λa . ( λv . S ) ∗ · z L [ a / x ]) · ω (cid:1) V K γ (18) If (cid:0) ( Ω λy . L ) · ω (cid:1) V −→ ∗ ( λv . S ) ∗ · ω V and x < FV ( V ) , then (cid:0) ( Ω λy . λx . L ) · V (cid:1) V −→ ( λv . λx . S ) ∗ · V λx . L [ V / y ] . Recall the (D-curry) rule, D [ cur ( f )] = cur ( D [ f ] ◦ h π × , π × Id i) . By IH, we have J (cid:0) ( Ω λy . L ) · ω (cid:1) V K = J ( λv . S ) ∗ · ω ( L [ V / y ]) K , whichmeans for any 1-form ϕ , γ and x , v , ϕ ( J L K h γ , J V K γ , x i) ( D [ J L K h γ , − , x i]h v , J V K γ i) = ϕ ( J L K h γ , J V K γ , x i) ( J S K h γ , x , v i) . By Hahn-Banach Theorem, D [ J L K h γ , − , x i]h v , J V K γ i = J S K h γ , x , v i . Now D [ cur ( J λx . L K ) γ ]h v , J V K γ i = D [ cur ( J L K )h γ , −i]h v , J V K γ i = D [ cur ( f )]h v , J V K γ i = cur ( D [ f ] ◦ h π × , π × Id i)h v , J V K γ i = λx . ( D [ f ] ◦ h π × , π × Id i)hh v , J V K γ i , x i = λx . D [ f h− , x i]h v , J V K γ i = λx . D [ J L K h γ , − , x i]h v , J V K γ i = λx . J S K h γ , x , v i where f : = uncur ( cur ( J L K )h γ , −i) . Hence, we have J (( Ω λy . λx . L ) · V ) V K γ = (cid:0) λxv . J V K γ ( J λx . L K h x , γ i)( D [ cur ( J λx . L K ) γ ]h v , x i) (cid:1) ( J V K γ ) arol Mak and Luke Ong = λv . J V K γ ( J λx . L K h J V K γ , γ i)( D [ cur ( J λx . L K ) γ ]h v , J V K γ i) = λv . J V K γ ( J λx . L [ V / y ] K γ )( λx . J S K h γ , x , v i) = λv . J V ( λx . L [ V / y ]) K γ ( J λx . S K h γ , v i) = J ( λv . λx . S ) ∗ · V ( λx . L [ V / y ]) K γ (19) We prove it for the complicated case (19c) and (19a)and (19b) follows.First note that by (D-eval) in [21], we have D [ ev ◦ h π i , π j i]h v , x i = π i ( v )( π j ( x )) + D [ π i ( x )]h π j ( v ) , π j ( x )i . By IH, and V πi ≡ λz . P ′ ,we have J (cid:0) ( Ω λz . P ′ ) · ω (cid:1) V π j K = J ( λv ′ . S ′ ) ∗ · ω ( P ′ [ V π j / z ]) K which means for any 1-form ϕ , γ and v , ϕ ( J P ′ K h γ , J V π j K γ i) (cid:0) D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i (cid:1) = ϕ ( J P ′ K h γ , J V π j K γ i) ( J S ′ K h γ , v i) . By Hahn-Banach Theorem, D [ J V πi K γ ]h v π j , J V π j K γ i = D [ cur ( J P ′ K ) γ ]h v , J V π j K γ i = J S ′ K h γ , v i . Hencewe have J (cid:0) ( Ω λy . y πi y π j ) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) J y πi y π j K h γ , J V K γ i (cid:1) (cid:0) D [ ev ◦ h π i , π j i]h v , J V K γ )i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1)(cid:0) v πi ( J V π j K γ ) + D [ J V πi K γ ]h v π j , J V π j K γ i (cid:1) = λv . J ω K γ (cid:0) J V πi V π j K γ (cid:1) (cid:0) v πi ( J V π j K γ ) + J S ′ K h γ , J V π j K γ i (cid:1) = J ( λv . v πi V π j + S ′ [ V π j / v ]) ∗ · ω ( V πi V π j ) K γ (20a) Say y is a free variable in E , (cid:0) ( Ω ( λy . h y , E i)) · ω (cid:1) V −→ ( λv . h v , S i) ∗ · ω h V , E [ V / y ]i if (cid:0) ( Ω λy . E ) · ω (cid:1) V −→( λv . S ) ∗ · ω ( E [ V / y ]) . By IH, we have J (cid:0) ( Ω λy . E ) · ω (cid:1) V K = J ( λv . S ) ∗ · ω ( E [ V / y ]) K , whichimplies for any γ ∈ J Γ K and v , J E K h γ , J V K γ i = J P K γ and D [ J E K h γ , −i]h v , J V K γ i = J S K h γ , v i . Now, J (cid:0) ( Ω ( λy . h y , E i)) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) h J V K γ , J E K h γ , J V K γ ii (cid:1)(cid:0) h D [ Id ]h v , J V K γ i , D [ J E K h γ , −i]h v , J V K γ ii (cid:1) = λv . J ω K γ (cid:0) h J V K γ , J E [ V / y ] K γ i (cid:1) (cid:0) h v , J S K h γ , v ii (cid:1) = λv . J h V , E [ V / y ]i ω K γ (cid:0) J λv . h v , S i K γv (cid:1) = J ( λv . h v , S i) ∗ · ω h V , E [ V / y ]i K γ (20b) If y < FV ( E ) , we have (cid:0) ( Ω ( λy . h y , E i)) · ω (cid:1) V −→ ( λv . h v , i) ∗ · ω h V , E i and J (cid:0) ( Ω ( λy . h y , E i)) · ω (cid:1) V K γ = λv . J ω K γ (cid:0) h J V K γ , J E K h γ , J V K γ ii (cid:1)(cid:0) h D [ Id ]h v , J V K γ i , D [ J E K h γ , −i]h v , J V K γ ii (cid:1) = λv . J ω K γ (cid:0) h J V K γ , J E K γ i (cid:1) (cid:0) h v , i (cid:1) = λv . J ω K γ (cid:0) J h V , E i γ K (cid:1) (cid:0) J λv . h v , i K γv (cid:1) = J ( λv . h v , i) ∗ · ω h V , E i K γ (cid:3) Lemma 5.1.
Let P be a term.1. If P −→ P ′ , then there exists a reduct s of P ′ t such that P t −→ ∗ s in L D .2. J P K = J P t K in C .Proof.
1. Easy induction on −→ .2. We prove by induction on P . Most cases are trivial. Let γ ∈ J Γ K .(dual) J ( λx . S ) ∗ · S K γ = λv . J S K γ ( cur ( J S K ) γv ) = λv . J S K h γ , v i (cid:0) J λx . S K h γ , v i( J v K h γ , v i) (cid:1) = λv . J S t K h γ , v i (cid:0) J λx . S t K h γ , v i( J v t K h γ , v i) (cid:1) = λv . J S t (cid:0) ( λx . S t ) v (cid:1) K h γ , v i = J λv . S t (cid:0) ( λx . S t ) v (cid:1) K γ (pb) J ( Ω ( λy . P )) · S K γ = λxv . ( J S K γ )( J P K h γ , x i)( D [ cur ( J P K ) γ ]h v , x i) = λxv . ( J S K γ )( J P K h γ , x i)( D [ J P K ]hh , v i , h γ , x ii) = λxv . ( J S K γ )( J P K h γ , x i) (cid:0) D [ J P K ]hh , J v K h γ , x , v ii , hh γ , x , v i , x ii (cid:1) = λxv . ( J S K γ ) (cid:0) cur ( J P K ) γ ( J x K h γ , x , v i) (cid:1)(cid:0) J D ( λy . P ) · v K h γ , x , v i( J x K h γ , x , v i) (cid:1) = λxv . ( J S K h γ , x , v i) (cid:0) cur ( J P K )h γ , x , v i( J x K h γ , x , v i) (cid:1)(cid:0) J D ( λy . P ) · v K h γ , x , v i( J x K h γ , x , v i) (cid:1) = λxv . J S t (cid:0) ( λy . D t ) x (cid:1) (cid:0) (cid:0) D ( λy . P t ) · v (cid:1) x (cid:1) K h γ , x , v i = J λxv . S t (cid:0) ( λy . D t ) x (cid:1) (cid:0) (cid:0) D ( λy . P t ) · v (cid:1) x (cid:1) K γ (cid:3) Corollary 5.2 (Strong Normalization) . Any reduction se-quence from any term is finite, and ends in a value.Proof. If P does not terminates, then we can form a reduc-tion sequence in L D that does not terminates using Lemma5.1 (1) and confluent property of differential λ -calculus,proved in [15]. Then, this contradicts the strong normaliza-tion property of differential λ -calculus. (cid:3)(cid:3)