Categorical semantics of a simple differential programming language
DDavid I. Spivak and Jamie Vicary (Eds.):Applied Category Theory 2020 (ACT2020)EPTCS 333, 2021, pp. 289–310, doi:10.4204/EPTCS.333.20
Categorical semantics of asimple differential programming language
Geoffrey Cruttwell
Mount Allison University, Sackville, Canada [email protected]
Jonathan Gallagher Dorette Pronk
Dalhousie University, Halifax, Canada [email protected] [email protected]
With the increased interest in machine learning, and deep learning in particular (where one extractsprogressively higher level features from data using multiple layers of processing), the use of automaticdifferentiation has become more wide-spread in computation. See for instance the surveys given in [32]and [4]. In fact, Facebook’s Chief AI scientist Yann LeCun has gone as far as famously exclaiming:“Deep learning est mort. Vive Differentiable Programming! ...people are now building anew kind of software by assembling networks of parameterized functional blocks and bytraining them from examples using some form of gradient-based optimization.” The point being that differentiation is no longer being viewed as merely a useful tool when creatingsoftware, but instead becoming viewed as a fundamental building block. This sort of ubiquity warrants amore in-depth study of automatic differentiation with a focus on treating it as a fundamental component.There have been two recent developments to provide the theoretical support for this type of structure.In fact, the settings described above use two types of differentiation: the usual forward derivative toanalyse the effect of changes in the data, as well as the reverse derivative to allow for error correction (i.e.,training) through the efficient calculation of the gradients of functions. Thus, any theoretical approachmust be able to deal with both types of differentiation. One approach is presented in [2], where Abadiand Plotkin provide a simple differential programming language with conditionals, recursive functiondefinitions, and a notion of reverse-mode differentiation (from which forward differentiation can bederived) together with both a denotational and an operational semantics, and theorems showing that thetwo coincide. Another approach is given in [16], where the authors present reverse differential categories,a categorical setting for reverse differentiation. They also show how every reverse differential categorygives rise to a (forward) derivative and a canonical “contextual linear dagger” operation. The converseis true as well: a category with a foward derivative (that is, a Cartesian differential category [8]) with acontextual linear dagger has a canonical reverse derivative.In the present paper we bring these two approaches together. In particular, we show how an extensionof reverse derivative categories models Abadi and Plotkin’s language, and describe how this categoricalmodel allows one to consider potential improvements to the operational semantics of the language. Tomodel Abadi and Plotkin’s language categorically, reverse derivative categories are not sufficient, dueto their inability to handle partial functions and control structures. Thus, we need to add partialityto reverse differential categories. The standard categorical machinery to model partiality is restrictionstructure, which assigns to each map a partial identity map, subject to axioms as described in [14].Combining this structure with reverse differential structure, we introduce reverse differential restriction Facebook post on Jan 5, 2018
90 Categorical semantics ofasimple differential programming language categories . In addition to the list of axioms [RD.1 – 7] given for reverse differential categories in [16],we require two additional axioms expressing how the restriction of the reverse derivative of a function isrelated to the restriction of the function itself and what the reverse derivative of a restriction of a functionneeds to be (cf. Definition 3.1). The results characterizing the relationship between differential andreverse differential categories in terms of a contextual linear dagger extend to the context of restrictioncategories. We also get for free that the reverse derivative preserves the order on the maps and preservesjoins of maps, if they exist.In Section 4 we show how Abadi and Plotkin’s language can be modelled in a reverse differentialrestriction category. We do this in two steps: at first we modify their language by omitting generalrecursion and instead only include while-loops. While-loops can be modelled in terms of recursion,but by separating them out we can see that source-transformation techniques (not discussed explicitlyin Abadi and Plotkin but used in some commercial systems such as Theano [5], TensorFlow [1], andTangent [45]) always hold in our semantics (see Section 4.3); source transformation techniques are notused for general recursion. We also note that in order to be able to push differentiation through thecontrol structure, Abadi and Plotkin need that for each predicate the inverse images of true and false areboth open sets. In the context of restriction categories we model this instead by providing two partiallydefined maps into the terminal object 1 for each predicate symbol (one for true and one for false) withthe requirement that their restrictions do not overlap (cf. Section 4.2).In the process of modelling Abadi and Plotkin’s language, we see that not all our axioms are needed.Specifically, Axioms [RD.6] and [RD.7] of a reverse differential restriction category (which deal withthe behaviour of repeated reverse derivatives) are not strictly necessary to model Abadi and Plotkin’s lan-guage. However, in the final section of the paper, we show that if these axioms are present, changes canbe made to the operational semantics to improve the efficiency and applicability of the simple differentialprogramming language.Abadi and Plotkin’s language represents an approach that makes the reverse derivative a languageprimitive in a functional language. Other approaches have been proposed to use reverse-mode accumula-tion for computing the derivative in a functional language. Given a function R n f −−→ R m , Pearlmutter andSiskind [39] discuss how to compute the Jacobian matrix of f in a functional language by performingtransformations on the function’s computational graph. This idea is similar to the symbolic differentia-tion of trace or tape terms in Abadi and Plotkin’s language. Elliot [26] shows how to view this sort ofreverse-mode accumulation using continuations: when a function, written as a composition of simpleoperations, is written in continuation passing style, the reverse derivative of its computation graph corre-sponds to a sort of generalized derivative of the continuation. In [46], Wang et al extend Pearlmutter andSiskind’s work by showing that the move to continuations allows for getting around the nonlocality issuein the earlier work. Brunel et al [10] extend Wang’s work from the point of view of linear logic, andallow for additional analyses based on tracking the linearity of a variable. Abadi and Plotkin’s work rep-resents a next step in this area by considering, in addition, control flow structures and general recursivefunctions.This current work contrasts to work submitted to ACT 2019 on modelling differential programmingusing synthetic differential geometry (SDG) (see [34, 35] for an introduction to SDG). In the previouswork a simple differential programming language featuring forward differentiation was introduced andan interpretation into a well-adapted model of SDG was given (see e.g. [23] for such models). The focuswas on exploring what programming languages features might be able to exist soundly with differentialprogramming. The current work develops the categorical semantics of Abadi and Plotkin’s language forreverse differentiation as well as the categorical semantics of source-transformations for their language..Cruttwell, J.Gallagher, D.Pronk 291In particular we show that the operational semantics is modelled soundly by a denotational semanticsinto our categorical framework. We will also see that using the axiomatic approach developed here leadsto a sound exponential speedup when computing the reverse derivative of looping-phenomena. In this section, we briefly review some of the relevant structures from category theory which we willmake use of.
The canonical category for differentiation is the category
Smooth whose objects are the powers of the re-als R ( R = { } , R , R , R , etc.) and whose maps are the smooth (infinitely differentiable) maps betweenthem. To any map f : A −→ B in this category, there is an associated map D [ f ] : A × A −→ B whose value at ( x , v ) ∈ A × A is J ( f )( x ) · v , the Jacobian of f at x , taken in the direction v . This mapsatisfies various rules; for example, the chain rule is equivalent to the statement that for any maps f : A −→ B , g : B −→ C , D [ f g ] = h π f , D [ f ] i D [ g ] . (Note that we use the path-order for composition, so f g means “first f then g ”.) Many other familiarrules from calculus can be expressed via D ; for example, the symmetry of mixed partial derivatives canbe expressed as a condition on D [ f ] = D [ D [ f ]] . Definition 2.1. ([8, Defn. 2.1.1]) A
Cartesian differential category or CDC is a Cartesian left additivecategory ([8, Defn. 1.3.1]) which has, for any map f : A −→ B, a mapD [ f ] : A × A −→ Bsatisfying seven axioms [CD.1–7] . The formulation of CDCs and indeed the other flavours of categories with derivatives we will usehave the intent that in h a , v i D [ f ] , a is the point and v is the direction; this is in contrast to the originalformulation of CDCs which had the point and direction swapped, and we chose the point-directionformulation because most of the literature follows this convention. This causes a change to axioms CD.2,6,7 .While
Smooth is the canonical example, there are many others, including examples from infinitedimensional vector spaces, synthetic differential geometry, algebraic geometry, differential lambda cal-culus, etc: see [8, 28, 11, 19, 15].In contrast, the reverse derivative, widely used in machine learning for its efficiency, is an operationwhich takes a smooth map f : A −→ B and produces a smooth map R [ f ] : A × B −→ A whose value at ( x , w ) ∈ A × B is J ( f ) T ( x ) · w , the transpose of the Jacobian of f at x , taken in the direction w .92 Categorical semantics ofasimple differential programming languageThere are two possible ways to categorically axiomatize the reverse derivative. One way is to startwith a CDC and ask for a dagger structure (representing the transpose); one could then use the daggerwith the D from the CDC to define a reverse derivative R . However, there is some subtlety in this: thedagger structure is only present on the linear maps of the category, not on all the maps of the category.The other way is to axiomatize R directly, as was done in [16]. Definition 2.2. ([16, Defn. 13]) A reverse differential category or RDC is a Cartesian left additivecategory which has, for any map f : A −→ B, a mapR [ f ] : A × B −→ Asatisfying seven axioms.
For example, in this formulation the chain rule is equivalent to [RD.5] , the rule that for any maps f : A −→ B , g : B −→ C , R [ f g ] = h π , ( f × ) R [ g ] i R [ f ] . Moreover, there is something striking about a reverse differential structure: any RDC is automati-cally a CDC. If one applies the reverse derivative twice, 0’s out a component and projects, the result isthe forward derivative. That is, the following defines a (forward) differential structure from a reversedifferential structure (see [16, Theorem 16]): A f −−→ BA × B R [ f ] −−−→ A ( A × B ) × A R [ R [ f ]] −−−−−→ A × BD [ f ] : = A × A h π , , π i −−−−−−→ ( A × B ) × A R [ R [ f ]] −−−−−→ A × B π −−→ B Thus, while a “dagger on linear maps” is required to derive an RDC from a CDC, no such structure isrequired to go from an RDC to a CDC. In fact, one can show that a CDC with a “dagger on linear maps”is equivalent to an RDC: see Theorem 42 in [16].For this reason, as well as the fact that the reverse derivative is of greater importance in machinelearning, in this paper we take a reverse differential category to be the primary structure.
Of course, to model a real-world programming language which involves non-terminating computations,we must also be able to handle partial functions. For this, we turn to restriction categories [14], whichallow one to algebraically model categories whose maps may only be partially defined. Consider thecategory of sets and partial functions between them. To any map f : A −→ B in this category, there is anassociated “partial identity” map f : A −→ A , which is defined to be the identity wherever f is defined,and undefined otherwise. This operation then has various properties such as f f = f . This is thenaxiomatized: Definition 2.3. ([14, Defn. 2.1.1]) A restriction category is a category which has for any map f : A −→ B,a map f : A −→ A satisfying various axioms.
In section 3, we will combine restriction structure with reverse differential structure to get the cate-gorical structure we will use to model Abadi and Plotkin’s language..Cruttwell, J.Gallagher, D.Pronk 293Before we get to that, however, we will need to briefly review a few definitions from restrictioncategory theory. It will also be helpful to consider the previously defined combination of restrictionstructure and (forward) differential structure.A restriction category allows one to easily talk about when a map is “less than or equal to” a parallelmap and when two parallel maps are “compatible”:
Definition 2.4.
Suppose f , g : A −→ B are maps in a restriction category. Write f ≤ g if f g = f , andwrite f ∼ g (and say “ f is compatible with g”) if f g = g f . That is, f ≤ g if g is defined wherever f is defined, and when restricted to f ’s domain of definition, g is equal to f ; f ∼ g if f and g are equal where they are both defined. One can show that ≤ is a partialorder on each hom-set; in fact, restriction categories are canonically partial order enriched by ≤ .Being able to “join” two compatible maps will be important when we define control structures suchas “if” and “while”, as we will being able to discuss when maps are “disjoint”. Definition 2.5.
1. If f , g : A −→ B and there is a least upper bound f ∨ g with respect to the partialorder defined above, we call f ∨ g the join of f and g. Note that this implies that f and g arecompatible.2. The notion of join extends to families of maps that are pairwise compatible, and we write ∨ i f i todenote the join of the pairwise compatible family.3. Say that a map /0 : A −→ B is nowhere defined if /0 is the minimum in the partial order.4. Say that f , g : A −→ B are disjoint if f g is nowhere defined. Any two disjoint maps are compatible.
The formalization of disjoint joins in a restriction category was given in [18] as part of the story offormalizing Hoare semantics in a classical restriction category. Further analysis of joins in restrictioncategories was provided in [21]. Giles [31] used disjoint joins in connecting restriction categories to thesemantics of reversible computing. Disjoint joins in partial map categories correspond to disjoint joinsof monics, which often give a coproduct (e.g. as in coherent categories). One way to model iteration is tohave a traced coproduct, and this can be directly expressed using disjoint joins: this approach was usedin formalizing iteration in restriction categories and to build a partial combinatory algebra by iteratinga step-function in [20, 13]. The formalization of iteration using disjoint joins was based on the work ofConway [22]. Another approach to formalizing the semantics of iterative processes in a category usingalgebraic formalizations was introduced in [25], refined in [7], and further developed categorically in [3].Finally, it is worth noting that there has been previous work combining CDC structure with restrictionstructure [12]. The canonical example of such a category is the category of smooth partial maps betweenthe R n ’s. The partiality acts in a compatible way with the derivative, as D [ f ] : A × A −→ B is entirelydefined in the second (vector) component: that is, the only partiality D [ f ] has is from f itself. Thus, in a“differential restriction category”, one asks that D [ f ] = f ×
1. These are formulated on top of the notionof cartesian left additive restriction category : these are restriction categories with restriction products(which is a lax notion of product for restriction catgories developed in [17]) and where each homset isa commutative monoid such that x ( f + g ) = x f + xg and 0 f ≤
0, and finally projections fully preserveaddition. The intuition comes from considering partial, smooth functions on open subsets of R n : not allsmooth functions preserve addition, but smooth functions are addable under pointwise addition. Definition 2.6. ([12, Defn. 3.18]) A differential restriction category is a Cartesian left additive restric-tion category, which has, for each map f : A −→ B, a mapD [ f ] : A × A −→ Bsatisfying various axioms , including [DR.8] : D [ f ] = f × . There are nine equational axioms mirroring the axioms for reverse differential restriction categories given in the sequel.
94 Categorical semantics ofasimple differential programming language
We are now ready to define the new structure which we will use to model Abadi and Plotkin’s language.
Definition 3.1. A reverse differential restriction category or RDRC is a Cartesian left additive restric-tion category which has an operation on maps: A f −−→ BA × B −−−→ R [ f ] Asuch that [RD.1] R [ f + g ] = R [ f ] + R [ g ] and R [ ] = ; [RD.2] for all a , b , c: h a , b + c i R [ f ] = h a , b i R [ f ] + h a , c i R [ f ] and h a , i R [ f ] = a f ; [RD.3] R [ π j ] = π ι j ; [RD.4] R [ h f , g i ] = ( × π ) R [ f ] + ( × π ) R [ g ] ; [RD.5] R [ f g ] = h π , h π f , π i R [ g ] i R [ f ] ; [RD.6] h × π , × π i ( ι × ) R [ R [ R [ f ]]] π = ( × π ) R [ f ] ; [RD.7] ( ι × ) R [ R [( ι × ) R [ R [ f ]] π ]] π = ex ( ι × ) R [ R [( ι × ) R [ R [ f ]] π ]] π ; [RD.8] R [ f ] = f × ; [RD.9] R [ f ] = ( f × ) π . As noted above, [RD.5] represents the chain rule, while [RD.8] says that the partiality of R [ f ] isentirely determined by the partiality of f itself. [RD.9] says how to differentiate restriction idempotents.The other axioms are similar to those for an RDC; for an explanation of what they represent, see the dis-cussion after Definition 13 in [16]. Also note that term logics have also been given to simplify reasoningin cartesian differential categories [8] and differential restriction categories [29]; a term logic for reversedifferential restriction categories exists but will not be discussed further here.Any Fermat theory [24] and more generally any Lawvere theory which is also a cartesian differentialcategory can be given the structure of a reverse differential category; in these cases both the forwardand reverse derivatives can be pushed down to sums and tuples of derivatives on maps R −→ R , and herethe forward and reverse derivative necessarily coincide. A restriction version of this example is givenby considering a topological ring R that satisfies the axiom of determinacy (see [6]); the category withobjects: powers of R , and C ∞ -maps that are smooth on restriction to an open set form a reverse differentialrestriction category. This meta-example includes the category Smooth P of functions that are smooth onan open subset of R n . For an example whose objects are not of the form R n : the coKleisli category of themultiset comonad on the category of relations Rel is a cartesian differential category and the category oflinear maps is
Rel (see [9, 8] for details). As
Rel is a compact closed self-dual category, and the derivativeat a point is linear (hence a map in
Rel ), one can obtain a reverse derivative on the coKleisli category ofthe finite multiset comonad on
Rel .Just as with an RDC, we can derive a forward differential restriction structure from a reverse.
Theorem 3.2.
Every reverse differential restriction category X is a differential restriction category withthe derivative defined as previously (see also [16, Theorem 16]). .Cruttwell, J.Gallagher, D.Pronk 295Moreover, just as in [16, Theorem 42], one can prove that a DRC with a “contextual linear dagger”is equivalent to an RDRC; however, for space constraints we will not go into full details here. One mustfirst describe fibrations for restriction categories: these were studied by Nester in [37]. One can give aversion of the simple fibration for a restriction category as well as the dual of the simple fibration (thisis remarkable – as the dual of a restriction category is not generally a restriction category). Importantly,maps in the simple fibration have their partiality concentrated in the context i.e. f = e × e = e .A contextual dagger is a an involution of fibrations Lin ( X )[ X ] −→ Lin ( X )[ X ] ∗ where Lin ( X )[ X ] denotes asubfibration of the simple fibration consisting of linear maps in context, using the notion of fibration forrestriction categories. From a reverse differential restriction category one obtains such an involution offibrations from ( u , f ) ( u , ( ι × ) R [ f ] π ) , and the second component is sometimes written f † [ I ] where I is the context object. There are a few subtleties that we will also not go further into here.The reverse derivative automatically preserves the induced partial order (from the restriction struc-ture) and joins, if they exist: Proposition 3.3. If X is a reverse differential restriction category, then for any f , g : A −→ B, f ≤ gimplies R [ f ] ≤ R [ g ] , and if X has joins, then for any pairwise compatible family { f i } , R [ ∨ i f i ] = ∨ i R [ f i ] . As we shall see, we will not strictly need [RD.6] and [RD.7] to model Abadi and Plotkin’s language;thus, we make the following definition:
Definition 3.4. A basic reverse differential restriction category (or basic RDRC ) is a structure satisfyingall the requirements for an RDRC except [RD.6] and [RD.7] . However, as we discuss in the final section, using axioms [RD.6] and [RD.7] allows one to considerimprovements to the operational semantics of the language.
We will make use of the language defined by Abadi and Plotkin [2]. We will make one modification upfront. We will first consider the language without recursive function definitions and instead with while-loops (called SDPL); after showing the semantics works out, we will then add recursive definitions backin (called SDPL + ). We remark that while the presentation of SDPL given in Plotkin and followed hereis parametrized over a single generating type; however, we can add arbitrary generating types as long asthose types have operations that provide the structure of a commutative monoid.In [2], Abadi and Plotkin remarked that there are two approaches to differentiating over controlstructures: there are source transformations used in systems such as TensorFlow [1] and Theano [5] andthere is the execution trace method used in systems such as Autograd [36] and PyTorch [38]. The sourcetransformation method for dealing with derivatives of control structures defines a way to distribute thederivative into control structures; for example ∂ if B then M else N ∂ x would be replaced by if B then ∂ m ∂ x else ∂ n ∂ x The execution trace allows defining a symbolic derivative on simpler terms with no control structuresor derivatives, and then evaluating a term enough so that there are no control structures or derivativespresent, allowing a symbolic trace through the derivative. This must be done at runtime – for example,96 Categorical semantics ofasimple differential programming language Γ , x : A ⊢ x : A r ∈ R Γ ⊢ r : real Γ ⊢ m : real Γ ⊢ n : real Γ ⊢ m + n : real Γ ⊢ m : T op : T −→ U ∈ ΣΓ ⊢ op ( m ) : U Γ ⊢ m : T Γ , x : T ⊢ n : U Γ ⊢ let x : T = m in n : U Γ ⊢ ∗ : 1 Γ ⊢ m : U Γ ⊢ n : T Γ ⊢ ( m , n ) U , T : U × T Γ ⊢ m : U × T Γ ⊢ fst U , T ( m ) : U Γ ⊢ m : U × T Γ ⊢ snd U , T ( m ) : T Γ ⊢ b Γ ⊢ m : T Γ ⊢ n : T Γ ⊢ if b then m else n : T p : U ⊢ b p : U ⊢ f : Up : U ⊢ while b do f : U Γ , x : U ⊢ m : T Γ ⊢ a : U Γ ⊢ v : T Γ ⊢ v . rd ( x : U . m )( a ) : U Γ ⊢ true Γ ⊢ false Γ ⊢ m : U pred : U ∈ Pred Γ ⊢ pred ( m ) Table 1: Typing rules for
SDPL we need to know when differentiating over an if-then-else statement which branch was taken, and oncethis control structure is eliminated the derivative can be computed on the simpler resultant term. This hasthe advantage of making it simpler to adapt to derivatives over more subtle structures such as recursivefunction definitions. Since it’s done at runtime, it can performed by a source-transformation by a JITcompiler ensuring efficiency.
The types of
SDPL are given by the following grammar: Ty : = real | | Ty × Ty Powers are assumed to be left-associated so real n + : = real n × real . To form the raw terms of SDPL weassume a countable supply of variables, a set of typed operation symbols Σ , and a set of typed predicatesymbols Pred . The raw terms are then defined by the following grammar: m : = x | r ( r ∈ R ) | m + m | op ( m ) ( op ∈ Σ ) | let x : Ty = m in m | ∗ | ( m , n ) Ty , Ty | fst Ty , Ty ( m ) | snd Ty , Ty ( m ) | if b then m else n | while b do m | m . rd ( x : T . m )( m ) b : = pred ( m ) ( pred ∈ Pred ) | true | false Note that the typing rules will disallow inputs or outputs to come from boolean terms. This is toensure that all typed terms are differentiable with respect to every argument. The typing rules for
SDPL are given in Table 1. In the typing rules, Γ is assumed to be a list of typed variables Γ = [ x i : A i ] ni = where A i ∈ Ty . Free variables are defined in the usual way; note that let expressions bind the variable x and.Cruttwell, J.Gallagher, D.Pronk 297when forming the reverse differential term v . rd ( x : U . m )( a ) the variable x is also bound. The reversedifferential expression may be read as “the reverse differential of m with respect to x evaluated at thepoint a in the direction v .” SDPL
Let X be a basic reverse differential restriction category with countable joins of disjoint maps. An interpretation structure for SDPL into X is given by a tuple of structures: ( A ∈ X , ( a r −−→ A ) r ∈ R , J K , J K T , J K F ) and we extend such a structure to an interpretation of all the terms of SDPL as explained below. We mustfirst interpret types, and to begin we need an object A from X to carry our signatures. We also requirethat A has a point 1 a r −−→ A for each element r ∈ R (since we must interpret R constants which are part of SDPL ) . With such an A we define an interpretation of types: J K : = J real K : = A J T × U K : = J T K × J U K We extend the interpretation to contexts: J · K : = J x : U K : = J U K J Γ , x : U K : = J Γ K × J U K We also require an interpretation of each operation symbol op : T −→ U ∈ Σ of the correct type: J op K : J T K −→ J U K . We additionally require two interpretations of each predicate symbol pred : U ∈ Pred : J pred K T : J U K −→ J pred K F : J U K −→ J pred K T J pred K F = /0. To summarize: Σ ( U , T ) J K −−→ X ( J U K , J T K ) Pred ( U ) J K T , J K F −−−−−−→ X ( J U K , ) The intent for giving two interpretations of predicate symbols is that we must give an interpretationof the “true” part of the predicate and the “false” part. In [2] an interpretation of predicate symbols isgiven as maps J U K −→ { true , false } with the property that the preimages of both true and false are open.This necessarily makes the interpretation of a predicate partial or trivial, and moreover it is equivalent togiving an interpretation of predicate symbols into disjoint open sets of J U K , which is again equivalent togiving an interpretation into disjoint predicates on J U K . A way around this non-standard interpretationof predicates is given by taking the manifold completion [33, 15] of the model, noting that 1 + + Proj: • J x : U ⊢ x : U K : = J U K ; It is not strictly necessary that
SDPL contains a constant for every r ∈ R – as long as we include 0 we could only requireconstants that we actually use, such as the computable reals.
98 Categorical semantics ofasimple differential programming language• J Γ , x : U ⊢ x : U K : = J γ K × J U K π −−→ J U K ;• J Γ , y : U ⊢ x : T K : = J γ K × J U K π −−→ J γ K J Γ ⊢ x : T K −−−−−−−→ J T K . Real operations: • We define J Γ ⊢ real K : = J Γ K −−→ A and for the other elements J Γ ⊢ r : real K : = J Γ K ! −→ a r = J r K −−−−−→ A = J real K • J Γ ⊢ m + n : real K : = J Γ K J Γ ⊢ m : real K + J Γ ⊢ n : real K −−−−−−−−−−−−−−−−−→ J real K Operation terms:
Given op : T −→ U ∈ Σ J Γ ⊢ op ( m ) : U K : = J Γ K J Γ ⊢ m : T K −−−−−−−→ J T K J op K −−−→ J U K Let: J Γ ⊢ let x : T = m in n : U K : = J Γ K h , J Γ ⊢ m : T K i −−−−−−−−−→ J Γ K × J T K J Γ , x : T ⊢ n : U K −−−−−−−−−−→ J U K Product terms: • J Γ ⊢ ∗ : 1 K : = J Γ K ! −→ J Γ ⊢ ( m , n ) A , B : A × B K : = J Γ K h J Γ ⊢ m : A K , J Γ ⊢ n : B K i −−−−−−−−−−−−−−→ J A K × J B K • J Γ ⊢ fst ( m ) A , B : A K : = J Γ K J Γ ⊢ m : A × B K −−−−−−−−−→ J A K × J B K π −−→ J A K • J Γ ⊢ snd ( m ) A , B : B K : = J Γ K J Γ ⊢ m : A × B K −−−−−−−−−→ J A K × J B K π −−→ J B K Control structures: J Γ ⊢ if b then m else n : U K : = J Γ ⊢ b K T J Γ ⊢ m : U K ∨ J Γ ⊢ b K F J Γ ⊢ n : U KJ p : A ⊢ while b do m : A K : = ∞ _ i = (cid:18)(cid:16) J p : A ⊢ b K T J p : A ⊢ m : A K (cid:17) i J q : A ⊢ b K F (cid:19) Reverse derivatives: J Γ ⊢ v . rd ( x : T . m )( a ) : T K : = J Γ K hh , J Γ ⊢ a : T K i , J Γ ⊢ v : U K i −−−−−−−−−−−−−−−−→ ( J Γ K × J T K ) × J U K R [ J Γ , x : T ⊢ m : U K ] π −−−−−−−−−−−−−→ J T K Boolean terms: J Γ ⊢ true K T : = J Γ K ! −→ J Γ ⊢ true K F : = /0. Likewise, J Γ ⊢ false K T : = /0 and J Γ ⊢ false K F : = !. Finally for any pred ∈ Pred ( A ) : J Γ ⊢ pred ( m ) K H : = J Γ K J m K −−−→ J A K J pred K H −−−−−→ H ranges over { T , F } .For a brief explanation of the interpretation of while-loops, for f : A −→ A we set f = id and f n + = f f n . Then our interpretation says either the guard was false, or it was true and we executed m and then itwas false, or it was true and we executed m and it was still true and we executed m again and then it wasfalse, and so on. This yields J while b do m K : = J b K F ∨ J b K T J m KJ b K F ∨ J b K T J m KJ b K T J m K T J b K F ∨ · · · .Cruttwell, J.Gallagher, D.Pronk 299 In this section we show that the interpretation above always soundly models source code transformationsfor differentiating if-then-else statements and while-loops.
Proposition 4.1.
In an interpretation structure on a basic RDRC, for any terms Γ , x : U ⊢ m : T , Γ , x : U ⊢ n : T , Γ ⊢ a : U , and Γ ⊢ v : T and for any predicate Γ , x : U ⊢ B we have J Γ ⊢ v . rd ( x : U . if b then m else n )( a ) K = J Γ ⊢ if ( let x = a in b ) then v . rd ( x : U . m )( a ) else v . rd ( x : U . n )( a ) K Corollary 4.2 (If-then-else transformation) . In an interpretation structure on a basic RDRC, we alwayshave J Γ , x : U ⊢ v . rd ( x : U . if b then m else n )( x ) K = J Γ , x : Y ⊢ if b then v . rd ( x : U . m )( x ) else v . rd ( x : U . n )( x ) K Turning to iteration, if a while-loop terminates, then while b do f is f n for some n . The forwardderivative admits a tail recursive description: D [ f n + ] = h π f , D [ f ] i D [ f n ] SDPL has two admissable operations: dagger and forward differentiation. m † [ Γ ] : = y . rd ( x . m )( ) fd ( x . m )( a ) . v : = let z = v in ( y . rd ( x . m )( a )) † [ Γ ] where the y in m † [ Γ ] is fresh. The recursive description of D [ f n ] is useful in proving the following: Proposition 4.3 (Forward-differentiation for while-loops) . In an interpretation structure on a basicRDRC,1. For any Γ , x : A ⊢ m : B, J fd ( x . m )( a ) . v K = hh , J a K i , J v K i ( × ι ) D [ J m K ]
2. For any x : A ⊢ f : A we have J ⊢ fd ( x . while b do f )( a ) . v K = J ⊢ let x = a , y = v in snd ( while π b do ( π f , fd ( x . f )( x ) . y )) K On the other hand, the reverse derivative satisfies: R [ f n ]( a , b ) = R [ f ]( a , R [ f ]( f ( a ) , R [ f ]( f ( f ( a )) , · · · , b ))) Which looks at first glance to be head recursive, and not like something that can be implemented by aniteration. However, with [RD.6] , we can do the following: R [ f n + ] = D [ f n + ] † [ A ] = ( T ( f ) n D [ f ]) † [ A ] where T ( f ) = h π f , D [ f ] i This is the basis of the following source transformation for while-loops.
Corollary 4.4 (Reverse-differentiation of while-loops) . In an interpretation structure on an RDRC, letz : A ⊢ f : A and z : A ⊢ b; then we have J v : A ⊢ v . rd ( x . while b do f )( a ) K = r v : A ⊢ ( ⊢ let x = a , y = v in snd ( while π b do ( π f , fd ( x . f )( x ) . y ))) † [ . ] z where † [ . ] denotes the dagger defined above with respect to the empty context.
00 Categorical semantics ofasimple differential programming language
Until now we discussed the semantics of a fragment of the language described by [2]. We formallyextended their language with while-loops to isolate their behaviour, but missed out on recursive functiondefinitions. Given general recursion, one can implement loops using tail recursion. We now move todiscuss their full language with recursive function definitions. This language will be called
SDPL + . Togive such an extension, we introduce two new raw terms m : = m as before | f ( m ) | letrec f ( x ) : = m in n In the above, when we form f ( a ) , the symbol f is taken to be a free function variable, and the term letrec f ( x ) : = m in n binds the variable x in m and the function variable f in m and n . However, thesefunction variables are of a different sort than ordinary variables because they have arity. That is f ( a ) only makes sense if a : B and f has arity B −→ C , which we write as f : B −→ C . Thus, our typing/termformation rules will have two sorts of contexts, one to record function names and the other for ordinaryvariables. Our terms in context then have the form Φ | Γ ⊢ m : B , and to update the rules from before, justadd Φ to all the contexts. The two new rules are Φ , f : A −→ B | Γ ⊢ m : A Φ , f : A −→ B | Γ ⊢ f ( m ) : B Φ , f : A −→ B | x : A ⊢ m Φ , f : A −→ B | Γ ⊢ n : C Φ | Γ ⊢ letrec f ( x ) : = m in n : C We will now give the interpretation of recursive definitions and calls in a basic reverse differentialjoin restriction category. But first, we will review a basic bit of intuition of recursive function theory incase the reader is unfamiliar.We often write computable functions A f −−→ B as f ( n ) : = m , but it is usually helpful to think of f assimply a name for the unnamed function λ n . m and then write f = λ n . m . The idea is that as a computation f has an internal representation that uses the variable n somewhere. If f is recursive then that means thatthe symbol f also appears in m , and thus it is a function that depends on itself. To break this cycle wethen abstract out the symbol f too. We write f : = λ f . λ n . m . This creates a function Fun ( A , B ) f −−→ Fun ( A , B ) f takes an arbitrary computable function A h −−→ B and creates a function that uses h instead of f anywhere f was used in the body m .To give a quick example, consider the computable function fac ( n ) : = if n < then else n ∗ fac ( n − ) . Then fac ( h )( n ) : = if n < then else n ∗ h ( n − ) The point is that this new function is not recursive. However, it is instructive to see what happens whenwe apply it to the function it represents. As an exercise, we leave it to the reader to prove that fac ( fac ) = fac In other words, the recursive function fac is a fixed point of the functional fac . It is also the best fixedpoint of fac in the sense that it is the least defined function that is a fixed point of fac . This works ingeneral, given any recursive function r it may be obtained as the least fixed point of r .To model least-fixed-point phenomena we will use the notion of a pointed directed complete partialorder or (DCPPO) for short. The first use of DCPPOs to model recursive phenomena is due to Scott.Cruttwell, J.Gallagher, D.Pronk 301[41] in giving models of the untyped λ -calculus. DCPPOs are used in the semantics of the functionalprogramming language PCF in [40]. Abstract DCPPO-enriched categories of partial maps were used inmodelling the semantics of the functional programming language FPC in [27]. The DCPPO structureon homsets of Smooth P was used in [2] to provide a semantics of SDPL. The approach taken heregeneralizes [2] to an arbitrary basic reverse differential join restriction category, highlights the structuralaspects of the interpretation, and uses the axioms of such a category to derive some simplifications tothe operational behaviour. A connection of ω -CPPOs and restriction categories was introduced using thedelay monad in [44]. Definition 4.5.
Let ( D , ≤ ) be a partial order. A subset A ⊆ D is directed if A is nonempty and any twoelements f , g ∈ A have an upper bound in A; i.e., there is an h ∈ A with f , g ≤ h. A partial order ( D , ≤ ) is a directed complete partial order if every directed subset A has a supremum written W a ∈ A a ∈ D. Adirected complete partial order is pointed (DCPPO) if there is a supremum for the empty set, that is aminimal element /0 ≤ d for all d ∈ D.By a morphism of DCPPOs ( P , ≤ ) g −−→ ( Q , ≤ ) we mean a function g on the underlying sets thatis monotone and preserves suprema. We observe minimally that the category of DCPPOs is Cartesianclosed. Lemma 4.6. [42] Let ( D , ≤ ) be a DCPPO. Then1. Every morphism D g −−→ D has a least fixed point; i.e. a u ∈ D such that g ( u ) = u.2. For any other DCPPO ( P , ≤ ) every morphism P × D g −−→ P has a parametrized fixed point; i.e., aP u −−→ D such that P DP × D u h , u i g In other words, for each x ∈ P, u ( x ) is a fixed point of g ( x , ) . This parametrized fixed point is oftendenoted µ y . g ( , y ) and as the fixed point of g ( x , ) by µ y . g ( x , y ) . Join restriction categories are DCPPO enriched:
Proposition 4.7.
Let X be a restriction category. Then with respect to the order enrichment of restrictioncategories:1. X is a join restriction category then the enrichment lies in DCPPOs.2. If X has joins and restriction products then those products are DCPPO enriched products; i.e., X ( A , B × C ) ≃ X ( A , B ) × X ( A , C ) qua an isomorphism of DCPPOs. Moreover, the “contractionoperator” X ( A , B ) ∆ A −−→ X ( A , A × B ) that sends A f −−→ B to A h , f i −−−−→ A × B is a morphism of DCPPOs.3. If X has joins and is a Cartesian left additive restriction category, then the addition on homsets X ( A , B ) × X ( A , B ) + −−→ X ( A , B ) is a morphism of DCPPOs.
02 Categorical semantics ofasimple differential programming language
4. If X is a reverse differential join restriction category then the operation of reverse differentiation X ( A , B ) R [ ] −−−→ X ( A × B , A ) is a morphism of DCPPOs. Parts – of Proposition 4.7 implies that certain operations we will need to form from monotone andjoin preserving maps will again be monotone and join preserving.To give the categorical semantics of SDPL + , we must extend the interpretation developed in section4.2. To begin we first give the interpretation of function contexts. The idea being that a free functionsymbol could be any map of the correct type, the interpretation of function contexts is given as a productof homsets. J /0 K : = J Φ , f : A −→ B K : = J Φ K × X ( J A K , J B K ) The interpretation of a term in context Γ ⊢ m : B constructed a map J Γ K J m K −−−→ J B K . With function contexts,the maps we build now depend on the morphism from X ( J A K , J B K ) to fill in the call to a function. Thatis, the interpretation is now a function J Φ K J Φ | Γ ⊢ m : B K −−−−−−−−→ X ( J Γ K , J B K ) Now, we are building a function and to do so it suffices to build a map in X ( J Γ K , J B K ) for eachelement φ ∈ J Φ K . We write J m K φ for the value of J m K at φ . The construction is by induction and for theterms from SDPL , the construction is exactly the same with the addition of a φ subscript decorating theterms appropriately. For example, J Φ | Γ ⊢ let x = m in n K φ : = D , J m K φ E J n K φ . However, we can buildthe interpretation entirely using external structure by induction as well.For example, interpretation of let x = m in n may be given using the “contraction operator”, and thisconstruction is element free. J Φ K X ( J Γ K , J A K ) × X ( J Γ K × J A K , J B K ) X ( J Γ K , J Γ K × J A K ) × X ( J Γ K × J A K , J B K ) X ( J Γ K , J B K ) h J m K , J n K i J let x = m in n K ∆Γ × · We will leave it to the reader to construct the interpretation of v . rd ( x . m )( a ) using a similar idea, as wellas the reverse differential operator X ( A , B ) R [ ] −−−→ X ( A × B , A ) .For SDPL + , we extend this to the two new terms. Given a function context Φ = ( f , . . . , f n ) thenfor any φ ∈ J Φ K we have that φ = ( φ , . . . , φ n ) are all maps in X : if f i : A i −→ B i then φ i : J A i K −→ J B i K .We will write φ ( f i ) to denote φ i . We also make use of the “no-free-variable” assumption for recursivedefinitions; that is, in the type formation rule for recursive definitions letrec f ( x ) : = m in n , m musthave at most a unique free variable, and it must be x . Fun-Call: J Φ , f : A −→ B | Γ ⊢ f ( m ) : B K φ : = J Γ K J m K φ −−−−→ J A K φ ( f ) −−−→ J B K Rec-Def:
First note that if we just translate a simple recursive function letrec f ( x ) : = m , we see that x is a free variable and f is a free function variable in m . That is, we have f : A −→ B | x : A ⊢ m : A .Then note that the interpretation we are developing would interpret m as a function X ( J A K , J B K ) J m K −−−→ X ( J A K , J B K ) This is exactly the sort of underlined function we looked at earlier: it takes each h : J A K −→ J B K in X and uses it by the above translation of function calls, any where that f was used in m . Then by.Cruttwell, J.Gallagher, D.Pronk 303Lemma 4.6, we may take its fixed point, µ . We then get a map J A K µ −−→ J B K such that J m K µ = µ andis the least defined such map, giving us the interpretation of the recursive function, and we wouldwrite J letrec f ( x ) : = m K = µ . More generally, in m the unique variable condition only applies toordinary variables, but m could have multiple function variables. Then if we translate Φ , f : A −→ B | x : A ⊢ m : B we get a map J Φ K × X ( J A K , J B K ) J m K −−−→ X ( J A K , J B K ) We may then apply the second part of Lemma 4.6 and obtain a parametrized fixed point J Φ K µ f . J m K ( , f ) −−−−−−−→ X ( J A K , J B K ) Likewise if we translate Φ , f : A −→ B | Γ ⊢ n : C , we get a map J Φ K × X ( J A K , J B K ) J n K −−−→ X ( J Γ K , J C K ) Then, finally, the interpretation of letrec f ( x ) : = m in n is defined by the following diagram: J Φ K J Φ K × X ( J A K , J B K ) X ( J Γ K , J C K ) h , µ f . J m K ( , f ) i J letrec f ( x ) : = m in n K J n K We may also define it componentwise as J letrec f ( x ) : = m in n K φ : = J n K ( φ , µ f . J m K ( φ , f ) ) Note that the above definition is only well-defined if we can prove that the interpretation J Φ K J m K −−−→ X ( J Γ K , J B K ) always yields a monotone and join preserving function between the DCPPOs, so that in thelast step the use of Lemma 4.6 is justified. Proposition 4.8.
Let X be a basic reverse differential join restriction category, with a specified inter-pretation structure for SDPL + . Then the interpretation of terms in context is always a monotone, joinpreserving function between the DCPPOs. In particular, the construction is well-defined. The operational semantics used by [2] defined a sublanguage of the raw terms called trace terms . Theseare generated by the following grammar: tr : = x | r ( r ∈ R ) | op ( tr ) | let x = m in n | ∗ | ( tr , tr ) | fst ( tr ) | snd ( tr ) Abadi and Plotkin also defined a sublanguage of trace terms called values. v : = x | r ( r ∈ R ) | ∗ | ( v , v ) v bool : = true | false
04 Categorical semantics ofasimple differential programming languageThe operational semantics of a program then consists of two mutually inductively defined reductions:symbolic evaluation and ordinary evaluation – the former yields a trace term and the latter yields avalue. Then the main idea is that to evaluate a term, when you hit a reverse differential, v . rd ( x . f )( a ) ,you evaluate f symbolically, just enough to remove control structures and derivatives giving a traceterm. And then this trace term is differentiated symbolically, yielding a trace term, and the evaluationcontinues.Note that defining symbolic reverse differentiation does not require any evaluation functions. How-ever, we do at this point require, as [2] did, that for each function symbol op ∈ Σ ( T , U ) there is anassociated a function symbol op R ∈ Σ ( T × U , T ) . The idea is that op R is the reverse derivative of op. Wewill write v . op R ( a ) as notation for op R ( a , v ) . Then define symbolic reverse differentiation v . R ( x . f )( a ) by induction over trace terms f and where v and a are values: w . R ( x . y )( a ) = ( w x = y x = yw . R ( x . r )( a ) = r ∈ R w . R ( x . m + n )( a ) = w . R ( x . m )( a ) + w . R ( x . n )( a ) w . R ( x . op ( m ))( a ) = let x = a , t = w . op R ( m ) in t . rd ( m )( a ) t fresh w . R ( x . let y = d in e )( a ) = let x = a , y = d in w . rd ( x . e )( a )+ ( let t = w . rd ( y . e )( y ) in t . rd ( x . d )( a )) t fresh w . R ( ∗ ) = w . R ( x . ( u , v ))( a ) = let ( y , z ) = w in y . R ( x . u )( a ) + z . R ( x . v )( a ) w . R ( x . fst ( m ))( a ) = let x = a , t = m in ( w . ) . R ( x . m )( a ) t fresh w . R ( x . snd ( m ))( a ) = let x = a , t = m in ( , w ) . R ( x . m )( a ) t freshThe let term is also the chain rule but for differentiating with respect to the two variable function Γ , x , y ⊢ n , so that we get the usual rule ∂ n / ∂ t = ∂ n / ∂ x · ∂ x / ∂ t + ∂ n / ∂ y · ∂ y / ∂ t appropriately reversed.Also, for the projection rule, one might have expected just ( w , ) . R ( m )( a ) . Under interpretation wecertainly get a term of the form R [ a π ] = ( a × ι ) R [ a ] . However by [RD.8] , ( a × ) R [ a ] = R [ a ] . We willsee below that if our evaluation satisfies a certain property, then the simpler translation is warranted.Then as long as our interpretation always sends op R to the reverse derivative of op, then symbolicand formal reverse differentiation agree under interpretation. Proposition 4.9 (Symbolic differentiation correctness) . Suppose X is a basic reverse differential joinrestriction category, and suppose that we have a fixed interpretation of SDPL into X for which J op R K = R [ J op K ] then for all values a , v and for all traced terms m J v . rd ( x . m )( a ) K = J v . R ( x . m )( a ) K We have an analogous proposition for the interpretation of all
SDPL + . Proposition 4.10 (Symbolic differentiation correctness extended) . Suppose X is a basic reverse differ-ential join restriction category, and that we have a fixed interpretation of SDPL + into X for which J op R K = R [ J op K ] , then for all values a , v and for all traced terms m: J v . rd ( x . m )( a ) K φ = J v . R ( x . m )( a ) K φ .Cruttwell, J.Gallagher, D.Pronk 305We then define the operational semantics of SDPL exactly as done by Abadi and Plotkin [2]: an operational structure is given by ( ev , bev , R ) where ev T , U : Σ ( T , U ) × v T −→ v U bev T : Pred ( T ) × v T −→ val bool are partial functions. We denote closed value terms v that have type Y as v Y and the set of closed v bool as val bool (these sets are precisely those that require formation in an empty context ⊢ m : A and ⊢ b ). Further R : Σ ( T , U ) −→ Σ ( T × U , T ) op op R With these three pieces one may define ordinary reduction ⇒ from terms to values and symbolic reduc-tion from terms to trace terms by induction; see [2] for details. For SDPL these reduction relationsare formulated with respect to a value environment: this is a mapping of variable names to closed valueterms. For
SDPL + we also require a function environment: this is a mapping ϕ of function names toclosures. A closure is a tuple ( ϕ , f , x , m ) where m has at most the free ordinary variable x , and addition-ally, all the free function variables in m except f are in the domain of ϕ . The idea is that closures arecreated when evaluating letrec f ( x ) : = m in n ; if our current function environment is ϕ we extend itwith ( ϕ , f , x , m ) and continue evaluating n – this way if n calls f then the definition of f can be lookedup in the function environment, and any symbol that the body of f requires to operate will be there too. An interpretation structure ( A ∈ X , ( a r −−→ A ) r ∈ R , J K , J K T , J K F ) is a differentially denotational inter-pretation structure when1. For all closed value terms of v we have that 1 J v K −−−→ J A K is a total point of J A K ;2. For all op ∈ Σ we have R [ J op K ] = J op R K ;3. For all closed value terms v ∈ v A we have1 J A KJ B K J v KJ ev ( op , v ) K J op K J A K J bev ( pred , v ) K H J v K J op K H where H is either T or F . In particular, both sides may be undefined, but they must be undefinedsimultaneously.The idea behind showing that a denotational semantics captures a language’s operational semanticsis that if m ⇒ v then J m K = J v K . However, the operational semantics for SDPL and
SDPL + is definedwith respect to value and function environments, and we have two operational relations. Interpreting aterm m with free variables x , . . . , x n in a value environment { x i : = v i } ≤ i ≤ is straightforward: since each v i is a closed term, first interpret m as above: J Γ K J m K −−−→ J B K , and then precompose with the point of J Γ K given by 1 h J v i K i i ≤ n −−−−−−→ J Γ K . Next we need the following lemma: Lemma 5.1.
The interpretation of terms of
SDPL + extends to allow the construction of an element of J Φ K for each function environment ϕ whose domain is Φ .
06 Categorical semantics ofasimple differential programming languageNote that for any trace term c it always fully evaluates. It requires no function context because it hasno function symbols, and we have that for any value environment ρ , c ⇒ v for some closed value term v .The goal is then to prove the following theorem by mutual induction: for any term m , any valueenvironment ρ , and function environment ϕ , we have that if m c then J m K = J c K = J v K . In this section we describe additional properties our categorical semantics has that may lead to a morerefined operational semantics.The compatibility between differentiation and restriction: [RD.8,9] state essentially that the defined-ness of the reverse derivative of a term is completely determined by the term itself. This is relevant toa more efficient semantics: the operational semantics used here has the property that when taking thereverse derivative over looping or recursive constructs, we first build a trace term, which turns out tobe a (long) series of let expressions that describe the evolution of the state of the computation. Wethen symbolically differentiate these let expressions which always results in the creation of a sum oftwo expressions for each such let expression – and the number of let expressions created by recursion orlooping is the number of times that the function recursed or the number of times the loop ran. Thus wequickly get wide trees of sums of symbolic terms that need to be evaluated. However, at each step of thisprocess, one of these terms is of the form v . rd ( x . m )( a ) where x does not occur freely in m , and hence canbe proven to always evaluate to 0 if it evaluates to anything. Our semantics has the following property Lemma 6.1.
For any term m in which x does not occur J v . rd ( x . m )( a ) K = h , h J a K , J v K ii J m K Lemma 6.2.
If we added the rulex fv ( e ) ⇒ w . R ( x . let y = d in e )( a ) : = let x = a , y = d , t = w . rd ( y . e )( y ) in t . rd ( x . d )( a ) Then Propositions 4.9 and 4.10 would still hold.
This gives an operational semantics where differentiating over looping constructs does not have abranching blowup, and hence experiences an exponential speedup.Reverse differential restriction categories, as we have seen earlier, allow forming a forward derivativefrom the reverse derivative. They also allow forming a reverse derivative from that forward derivative.In a reverse differential restriction category, [RD.6] is equivalent to the requirement that the process ofgoing from a reverse derivative to a forward derivative and then back to a reverse derivative gives exactlythe starting reverse derivative.
Lemma 6.3.
For any map from A × B f −−→ C define a map A × C f † [ A ] : = ( ι × ) R [ f ] π −−−−−−−−−−−−−→ B. We always get aforward derivative as D [ f ] : = R [ f ] † [ A ] . Then [RD.6] is equivalent to requiring that D [ f ] † [ A ] = R [ f ] . This kind of coherence for defining forward derivatives from their reverse could be useful in usingthe forward derivative and then converting back via daggering the result..Cruttwell, J.Gallagher, D.Pronk 307
Lemma 6.4.
For any operational symbol op if the evaluation function used by the operational semanticssatisfies eval ( snd ( op RRR , ((( a , ) , ) , ( , b )))) = eval ( op R , ( a , b )) then this can be modelled in any reverse differential restriction category. Moreover, for every term m wehave J rd ( x . rd ( y . rd ( z . m )( a ) . y )( b ) . x )( c ) . w K = J b K J c K J rd ( z . m )( a ) . w K Crucially for the above, we require [RD.6] .An aspect of forward differentiation that is modelled in our semantics is that differentiating a differ-ential with respect to its “direction” is just substitution. That is fd ( x . fd ( y . m )( a ) . x )( b ) . v = fd ( y . m )( a ) . v is modelled. This uses [RD.6] . More generally, we can modify the type system slightly to keep track ofthe arguments that a term is differentiated to by introducing another context, which we call a linearitycontext. Then the typing judgment for the reverse differential term would have two forms: Γ , x : A | ∆ ⊢ m : B Γ , a : A | ∆ , v : B ⊢ rd x . m ( a ) . v : A a , v fresh Γ | ∆ , x : A ⊢ m : B Γ , a : A | ∆ , v : B ⊢ rd x . m ( a ) . v : A a , v freshAnd if we are forward differentiating with respect to a variable from the linearity context: i.e., if we formthe forward derivative of a term with respect to a variable from the linearity context; i.e., if v was in thelinearity context of a term m and we form fd ( v . m )( a ) . w ) , then operational reduction fd ( v . m )( a ) . w let v = w in m is modelled in our semantics. This means that we can completely avoid doing differentiation in somecases, at the cost of having to carry around more type information. There is a similar version of thisrule for reverse derivatives and it has to do with “colet” expressions. In SDPL we can use the reversederivative to create a term that substitutes linearly into the output variable of a term. We could use these“colet” expressions and allow for speedups of reverse derivatives as well. It might also be interesting tocharacterize these constructions in their own right. This approach also allows us to force [RD.6] into theoperational semantics.The axiom [RD.7] , dealing with the symmetry of mixed partial derivatives, may also have a role toplay in simplifying the operational semantics. Some machine-learning algorithms use the Hessian of theerror function to optimize backpropagation itself, allowing for both more efficient and effective training(for one example, see [30, 43]). These second derivatives are expected to satisfy a higher dimensionalanalog of the chain rule. In fact one might expect in general higher analogs of the chain rule to hold,which are sometimes called the Faa di Bruno formulae for higher chain rules on terms of the form ∂ n ( f g ) . These expected formulae will all hold in our semantics due to a result that shows [CD.6,7] areequivalent to having all the Faa di Bruno formulae [19]. These higher chain rule expansions can be usedto determine a slightly different operational semantics for rd ( x . m )( a ) . v expressions, where the chain ruleis maximally expanded first, and linearity reductions occur, and then symbolic differentiation is used.While it is unclear if this is more efficient, it would make things simpler, as it would guarantee that theoperational semantics captured the higher chain rule formulae without having to make a requirement ofthe evaluation function on op RRR . References [1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore,
08 Categorical semantics ofasimple differential programming language
Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, andXiaoqiang Zheng. Tensorflow: a system for large-scale machine learning. , pages 265 – 283, 2016.[2] Martin Abadi and Gordon D. Plotkin. A simple differentiable programming language.
Proceedings of theACM on Programming Languages , 4:38:1 – 38:28, 2019. doi:10.1145/3371106 .[3] Jiˇr´ı Ad´amek, Stefan Milius, and Jiˇr´ı Velebil. Elgot algebras.
Electronic Notes Theoretical Computer Science ,155:87–109, 2006. doi:10.1016/j.entcs.2005.11.053 .[4] Atilim G¨unes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Auto-matic differentiation in machine learning: a survey.
Journal of Machine Learning Research , 18(153):1 – 43,2018.[5] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Des-jardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A cpu and gpu math ex-pression compiler.
Proceedings of the Python for scientific computing conference (SciPy) , 4, 2010. doi:10.25080/Majora-92bf1922-003 .[6] W. Bertram, H. Gl¨ockner, and K.-H. Neeb. Differential calculus over general base fields and rings.
Exposi-tiones Mathematicae , 22(3):213 – 282, 2004. doi:10.1016/s0723-0869(04)80006-9 .[7] S. Bloom and Z. Esik.
Iteration theories . Springer - EATCS Series, 1993. doi:10.1007/978-3-642-78034-9_7 .[8] R. Blute, R. Cockett, and R. Seely. Cartesian Differential Categories.
Theory and Applications of Categories ,22:622–672, 2009.[9] R.F. Blute, J.R.B. Cockett, and R.A.G. Seely. Differential categories.
Mathematical structures in computerscience , 16(6):1049–1083, 2006.[10] Alo¨ıs Brunel, Damiano Mazza, and Michele Pagani. Backpropagation in the simply typed lambda-calculuswith linear negation.
Proc. ACM Program. Lang. , 4(POPL), December 2019. doi:10.1145/3371132 .[11] J.R.B. Cockett and G.S.H. Cruttwell. Differential bundles and fibrations for tangent categories.
Cahiers deTopologie et Geom´etrie Diff´erentielle Cat´egoriques , LIX(1):10–92, 2018.[12] J.R.B. Cockett, G.S.H. Cruttwell, and J.D. Gallagher. Differential restriction categories.
Theory and Appli-cations of Categories , 25(21):537–613, 2011.[13] J.R.B. Cockett, P.J.W. Hofstra, and P. Hrubeˇs. Total maps of turing categories.
Electronic Notes in The-oretical Computer Science , 308:129–146, 2014. Proceedings of the 30th Conference on the MathematicalFoundations of Programming Semantics (MFPS XXX). doi:10.1016/j.entcs.2014.10.008 .[14] J.R.B. Cockett and Stephen Lack. Restriction categories I: categories of partial maps.
Theoretical ComputerScience , 270(1):223 – 259, 2002. doi:10.1016/S0304-3975(00)00382-0 .[15] R. Cockett and G. Cruttwell. Differential structure, tangent structure, and SDG.
Applied Categorical Struc-tures , 22:331–417, 2014. doi:10.1007/s10485-013-9312-0 .[16] R. Cockett, G. Cruttwell, J. Gallagher, J-S. Lemay, B. MacAdam, G. Plotkin, and D. Pronk. Reverse deriva-tive categories. arxiv:1910.07065 , (18):1–25, 2019.[17] R. Cockett and S. Lack. Restriction categories III: colimits, partial limits and extensivity.
MathematicalStructures in Computer Science , 17(4):775–817, 2007. doi:10.1017/S0960129507006056 .[18] R. Cockett and E. Manes. Boolean and classical restriction categories.
Mathematical Structures in ComputerScience , 19(2):357–416, 2009. doi:10.1017/s0960129509007543 .[19] R. Cockett and R. Seely. The Fa´a Di Bruno Construction.
Theory and Applications of Categories , 25:294–425, 2011.[20] Robin Cockett, Joaqu´ın D´ıaz-Bo¨ıls, Jonathan Gallagher, and Pavel Hrubeˇs. Timed sets, functional complex-ity, and computability.
Electronic Notes in Theoretical Computer Science , 286:117–137, 2012. Proceed-ings of the 28th Conference on the Mathematical Foundations of Programming Semantics (MFPS XXVIII). doi:10.1016/j.entcs.2012.08.009 . .Cruttwell, J.Gallagher, D.Pronk 309 [21] Robin Cockett, Xiuzhan Guo, and Pieter Hofstra. Range categories II: Towards regularity. Theory andapplications of categories , 26(18):453–500, 2012.[22] J. H. Conway.
Regular algebra and finite machines . Chapman and Hall Mathematics Series, 1971.[23] E. Dubuc. Sur les mod´eles de la g´eom´etrie diff´erentielle synth´etique.
Cahiers de Topologie G´eom´etrieDiff´erentielle Cat´egoriques , 20(1):231–279, 1979.[24] E. Dubuc and A. Kock. On 1-Form Classifiers.
Communications in Algebra , 12(12):1471–1531, 1984. doi:10.1080/00927878408823064 .[25] Calvin C. Elgot.
Monadic Computation and Iterative Algebraic Theories , pages 179–234. Springer NewYork, New York, NY, 1982. doi:10.1007/978-1-4613-8177-8_6 .[26] Conal Elliott. The simple essence of automatic differentiation.
Proceedings of the ACM on ProgrammingLanguages , 2(ICFP):70, 2018. doi:10.1145/3236765 .[27] M. P. Fiore and G. D. Plotkin. An axiomatisation of computationally adequate domain theoretic models offpc. In
Proceedings Ninth Annual IEEE Symposium on Logic in Computer Science , pages 92–102, 1994. doi:10.1109/lics.1994.316083 .[28] J. Gallagher.
The differential lambda-calculus: syntax and semantics for differential geometry . PhD thesis,University of Calgary, 2009.[29] J. Gallagher. What is a differential partial combinatory algebra? Master’s thesis, University of Calgary, 2011.[30] B. Ghorbani, S. Krishnan, and Y. Xiao. An investigation into neural net optimization via hessian eigenvaluedensity.
Proceedings of Machine Learning Research , 97, 2019.[31] Brett Giles.
An investigation of some theoretical aspects of reversible computing . PhD thesis, University ofCalgary, 2014.[32] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press, 2016.[33] Marco Grandis. Cohesive categories and manifolds.
Annali di Matematica Pura ed Applicata , 157(1):199–244, 1990. doi:10.1007/bf01765319 .[34] A. Kock.
Synthetic differential geometry . Cambridge University Press, 1981. doi:10.1017/cbo9780511550812 .[35] R. Lavendhomme.
Basic Concepts of Synthetic Differential Geometry . Kluwer Texts in Mathematical Sci-ences. Kluwer Academic Publishers, 1996. doi:10.1007/978-1-4757-4588-7 .[36] Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Autograd: effortless gradients in numpy.
ICML2015 AutoML Workshop , 238, 2015.[37] C. Nester. Turing categories and realizability. Master’s thesis, University of Calgary, 2017.[38] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,Zeming Lin, Natalia Gimelshein, and Luca et al Antiga. Pytorch: An imperative style, high-performancedeep learning library.
Advances in Neural Information Processing Systems , pages 8024 – 8035, 2019.[39] Barak A. Pearlmutter and Jeffrey Mark Siskind. Reverse-mode ad in a functional framework:Lambda the ultimate backpropagator.
ACM Trans. Program. Lang. Syst. , 30(2), March 2008. doi:10.1145/1330017.1330018 .[40] G.D. Plotkin. LCF considered as a programming language.
Theoretical Computer Science , 5(3):223 – 255,1977. doi:10.1016/0304-3975(77)90044-5 .[41] Dana S. Scott. A type-theoretical alternative to ISWIM, CUCH, OWHY.
Theoretical Computer Science ,121(1–2):411–440, 1993. doi:10.1016/0304-3975(93)90095-b .[42] Alfred Tarski. A lattice-theoretical fixpoint theorem and its applications.
Pacific Journal of Mathematics ,5(2):285 – 309, 1955. doi:10.2140/pjm.1955.5.285 .[43] L. Tzu-Mao.
Differentiable Visual Computing . PhD thesis, Massachusetts Institute of Technology, 2019.
10 Categorical semantics ofasimple differential programming language [44] Tarmo Uustalu and Niccol`o Veltri. The delay monad and restriction categories. In
Theoret-ical Aspects of Computing – ICTAC 2017 , pages 32–50. Springer International Publishing, 2017. doi:10.1007/978-3-319-67729-3_3 .[45] Bart van Merrienboer, Dan Moldovan, and Alexander B. Wiltschko. Tangent: automatic differentiation usingsource-code transformation for dynamically typed array programming.
Advances in Neural InformationProcessing Systems 31:Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018,3–8 December 2018, Montr´eal, Canada , pages 6259 – 6268, 2018.[46] Fei Wang, Daniel Zheng, James Decker, Xilun Wu, Gr´egory M. Essertel, and Tiark Rompf. Demystifying dif-ferentiable programming: Shift/reset the penultimate backpropagator.
Proc. ACM Program. Lang. , 3(ICFP),July 2019. doi:10.1145/3341700doi:10.1145/3341700