[PDF] Automatic Differentiation via Effects and Handlers: An Implementation in Frank

Abstract

Automatic differentiation (AD) is an important family of algorithms which enables derivative based optimization. We show that AD can be simply implemented with effects and handlers by doing so in the Frank language. By considering how our implementation behaves in Frank's operational semantics, we show how our code performs the dynamic creation of programs during evaluation.

Full PDF

aa r X i v : . [ c s . P L ] J a n Automatic Diﬀerentiation via Eﬀects and Handlers

An Implementation in Frank

Jesse Sigal

University of EdinburghEdinburgh, U.K. [email protected]

Abstract

Automatic diﬀerentiation (AD) is an important family of al-gorithms which enables derivative based optimization. Weshow that AD can be simply implemented with eﬀects andhandlers by doing so in the Frank language. By consideringhow our implementation behaves in Frank’s operational se-mantics, we show how our code performs the dynamic cre-ation of programs during evaluation.

CCS Concepts: • Theory of computation → Operationalsemantics ; Control primitives ; •

Mathematics of comput-ing → Automatic diﬀerentiation . Keywords: automatic diﬀerentiation, algebraic eﬀects, eﬀecthandlers, diﬀerentiable programming

ACM Reference Format:

Jesse Sigal. 2021. Automatic Diﬀerentiation via Eﬀects and Han-dlers: An Implementation in Frank. In

Proceedings of Partial Eval-uation and Program Manipulation (PEPM’21).

ACM, New York, NY,USA, 7 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Machine learning, artiﬁcial intelligence, scientiﬁc modelling,information analysis, and other data heavy ﬁelds have driventhe demand for tools which enable derivative based opti-mization. The family of algorithms known as automatic dif-ferentiation (AD) is the foundation of the tools which achievethis. The family can be coarsely divided into forward mode and reverse mode . Multiple modes exist because their asymp-totics depend on diﬀerent features of the diﬀerentiated pro-grams. Forward mode AD was introduced in 1964 by Wengert[14], and reverse mode AD was created by Speelpenning inhis 1980 thesis [12].It is not surprising that, given its history, AD has beenimplemented in many diﬀerent ways. Many popular toolssuch as ADIFOR [1], ADIC [2], and Tapenade [6, 9] workvia source transformation. These transformations take placeon languages such as C and FORTRAN, and thus all of theaforementioned tools work externally from the program be-ing written. We shall show here that the recent Frank lan-guage [3, 8] and its operational semantics, which leverages

PEPM’21, January 18–19, Online https://doi.org/10.1145/nnnnnnn.nnnnnnn eﬀects and handlers , can be informally seen as dynamicallyperforming partial evaluation and program manipulation.

We are most interested in showing the structure of AD al-gorithms, so we shall only give a short intuition for AD. Let 𝑓 , 𝑔 : R → R be smooth functions (i.e. inﬁnitely diﬀeren-tiable at all points). The chain rule states that ( 𝑓 ◦ 𝑔 ) ′ ( 𝑥 ) = 𝑓 ′ ( 𝑔 ( 𝑥 )) · 𝑔 ′ ( 𝑥 ) . AD algorithms use this compositional prop-erty to incrementally calculate the derivative of an entireprogram one basic operation at a time during evaluation. Werefer the reader to the textbook of Griewank and Walther[4] for general knowledge and to Hascoët and Araya-Polo[5] for checkpointed reverse mode, our most interesting ex-ample. Eﬀects and handlers are a structured method of includingside-eﬀects into programs. Algebraic eﬀects were introducedin 2001 by Plotkin and Power [10] handlers for them in 2009by Plotkin and Pretnar [11]. Eﬀects and handlers can be viewedas an extension of the common feature of catchable excep-tions. Catching an exception terminates the program delim-ited by the exception handling code, but eﬀect handlers canresume the handled code and pass a value to it. Eﬀects andhandlers can implement many common side eﬀects suchas state, exceptions, non-determinism, logging, and input-output.

We will be using the Frank language to implement AD. Frank’styping and operational semantics are inspired by call-by-push-value [7], meaning there is a distinction between val-ues and computations. We note Frank has a ﬁxed left-to-right evaluation order. Frank combines the concepts of func-tions and handlers by unifying them into what Frank refersto as operators, which act by application. However, we shallusually say handler for operators which handle eﬀects andfunctions otherwise. We shall also simplify certain aspectsfor ease of exposition, see Convent et al. [3] for a tutorialand details.Let us consider a simple example of a handler for a pro-gram which uses state of type S . EPM’21, January 18–19, Online Jesse Sigal (cid:12)(cid:12) state : {S -> X -> [Console]X} (cid:12)(cid:12) state _ x = print "end"; x (cid:12)(cid:12) state s k> = print "get"; state s (k s) (cid:12)(cid:12) state _ k> = print "put"; state s (k unit)

We ﬁrst explain the type of state . The handler state takestwo arguments, one of type S and one of type X . In orderfor state to be used, the context in which it is called mustsupport the ability [Console] which is a snoc-list containingexactly one instance Console of the interface

Console (the abil-ity [Console, Console] contains two distinct instances of thesame interface). The ability [Console] means we can use the command print deﬁned by the interface

Console . In the term state s x , the value produced by s can only be computed us-ing commands from the instances in the ability [Console] . Onthe other hand, the value produced by x can use commandsfrom [Console, State S] . The value x has access to State com-mands because the adjustment extends the ambi-ent ability [Console] . We note that the adjustment guarantees that state handles all commands of the

State in-terface ( get and set ). The full type of state includes braces,showing that state is a suspended computation. Frank auto-matically inserts these if they are absent.We shall brieﬂy explain some aspects of Frank’s opera-tional semantics before we go into more detail during ADexamples. Consider the example top-level use of state wheresemicolon is sequencing and postﬁx ! is nullary function ap-plication. (cid:12)(cid:12) The ability [Console] is permitted at the top-level as Frank’simplementation will handle it. As the program executes, theunderlined get is encountered and a continuation of the pro-gram delimited by state is captured, namely the operator {r -> (put (r + get!); get!)} , and bound to k in the body ofthe second line of state ’s deﬁnition. Once the execution of (put (get! + get!); get!) ﬁnishes, the ﬁrst line of state ’s def-inition is matched and state exits. We will cover the implementation of four diﬀerent handlersin Frank: evaluate : the most basic handler which dispatches tobuiltin arithmetic operations; diff : an implementation of forward mode AD; reverse : an implementation of reverse mode AD whichmakes use of the builtin mutable state interface; and reversec : an implementation of checkpointed reverse modeAD which extends reverse .Each of the handlers handle the interface

Smooth , which con-ceptually corresponds to smooth functions on the real num-bers. We only include constants, negation, addition, and mul-tiplication for simplicity, but any number of other smoothfunctions could be included. Additionally, Frank currently does not support ﬂoats, so we use integers instead, howeverwith language support ﬂoats could be used. (cid:12)(cid:12) data Nullary = constE Int (cid:12)(cid:12) data Unary = negateE (cid:12)(cid:12) data Binary = plusE | timesE (cid:12)(cid:12)(cid:12)(cid:12) interface Smooth X = (cid:12)(cid:12) ap0 : Nullary -> X (cid:12)(cid:12) | ap1 : Unary -> X -> X (cid:12)(cid:12) | ap2 : Binary -> X -> X -> X

The above deﬁnition says the

Smooth interface is parameter-ized by X and has three eﬀectful commands. The command apN is the N -ary application of a smooth function. Note thatthe nullary functions are constants. For ease of use, we de-ﬁne the following helper functions. (cid:12)(cid:12) c : Int -> [Smooth X] X (cid:12)(cid:12) c i = ap0 (constE i) (cid:12)(cid:12)(cid:12)(cid:12) n : X -> [Smooth X] X (cid:12)(cid:12) n x = ap1 negateE x (cid:12)(cid:12)(cid:12)(cid:12) p : X -> X -> [Smooth X] X (cid:12)(cid:12) p x y = ap2 plusE x y (cid:12)(cid:12)(cid:12)(cid:12) t : X -> X -> [Smooth X] X (cid:12)(cid:12) t x y = ap2 timesE x y The operational semantics of Frank allows us to treat theabove helper functions as if they were commands themselves,which we will do throughout. We will also deﬁne helperfunctions for the dispatching of 𝑛 -ary functions and theirderivatives to make the similarity between diﬀerent AD modesmore apparent. (cid:12)(cid:12) op0 : Nullary -> [Smooth X] X (cid:12)(cid:12) op1 : Unary -> X -> [Smooth X] X (cid:12)(cid:12) op2 : Binary -> X -> X -> [Smooth X] X (cid:12)(cid:12) op0 (constE i) = c i (cid:12)(cid:12) op1 negateE x = n x (cid:12)(cid:12) op2 plusE x y = p x y (cid:12)(cid:12) op2 timesE x y = t x y (cid:12)(cid:12)(cid:12)(cid:12) der1 : Unary -> X -> [Smooth X] X (cid:12)(cid:12) der2L : Binary -> X -> X -> [Smooth X] X (cid:12)(cid:12) der2R : Binary -> X -> X -> [Smooth X] X (cid:12)(cid:12) der1 negateE x = n (c 1) 𝑑 / 𝑑𝑥 (− 𝑥 ) = − (cid:12)(cid:12) der2L plusE x y = c 1 𝑑 / 𝑑𝑥 ( 𝑥 + 𝑦 ) = (cid:12)(cid:12) der2L timesE x y = y 𝑑 / 𝑑𝑥 ( 𝑥 · 𝑦 ) = 𝑦 (cid:12)(cid:12) der2R plusE x y = c 1 𝑑 / 𝑑𝑦 ( 𝑥 + 𝑦 ) = (cid:12)(cid:12) der2R timesE x y = x 𝑑 / 𝑑𝑦 ( 𝑥 · 𝑦 ) = 𝑥 The most basic handler we will consider is the evaluate han-dler. It only handles

Smooth X where X is instantiated to Int . (cid:12)(cid:12) evaluate : X -> X (cid:12)(cid:12) evaluate x = x (cid:12)(cid:12) evaluate k> = evaluate (k i) (cid:12)(cid:12) evaluate k> = evaluate (k (-x)) (cid:12)(cid:12) evaluate k> = evaluate (k (x + y)) (cid:12)(cid:12) evaluate k> = evaluate (k (x * y)) utomatic Diﬀerentiation via Eﬀects and Handlers PEPM’21, January 18–19, Online In the case of constE i , its integer parameter i is returned.Each other case of evaluate takes the integer arguments passedto the command and performs the corresponding integer op-eration.The evaluate handler will always be our top-level handler,and it is the only way to remove all Smooth interfaces. Weshall evaluate an example program where evaluate is the top-level handler to illustrate how Frank executes. We will bepaying special attention to how delimited continuations arecaptured. We will use underlining to show what term is cur-rently at the focus of evaluation.Our initial program is below, and represents the term + 𝑥 + − 𝑦 evaluated at 𝑥 = and 𝑦 = , which equals − . (cid:12)(cid:12) evaluate (p (c 1) (p (t (t 2 2) 2) (n (t 4 4)))) The current focus of evaluation is the command c 1 . (cid:12)(cid:12) evaluate (p (c 1) (p (t (t 2 2) 2) (n (t 4 4)))) The argument is in normal form (fully evaluated). There-fore, we can handle the command c 1 . The handling processbegins by capturing the proper delimited continuation byincrementally freezing the stack of evaluation frames. Werepresent freezing by highlighting and boldface. (cid:12)(cid:12) evaluate (p (c 1) (p (t (t 2 2) 2) (n (t 4 4)))) (cid:12)(cid:12) evaluate (p (c 1) (p (t (t 2 2) 2) (n (t 4 4)))) We have now reached a handler, evaluate , for the commandin focus. The frozen command (highlighted) is the captureddelimited continuation. The ap0 case of evaluate is then matchedto the command c 1 , where k is bound to the continuationwith c 1 removed and i is bound to . The bound variables k and i are then substituted into the corresponding body of evaluate . (cid:12)(cid:12) evaluate ({x -> (p x (p (t (t 2 2) 2) (n (t 4 4))))} 1) The next step applies the continuation to . (cid:12)(cid:12) evaluate (p 1 (p (t (t 2 2) 2) (n (t 4 4)))) The focus of evaluation now moves to t 2 2 , and a new de-limited continuation is dynamically captured. (cid:12)(cid:12) evaluate (p 1 (p (t (t 2 2) 2) (n (t 4 4)))) (cid:12)(cid:12) evaluate (p 1 (p (t (t 2 2)

2) (n (t 4 4)))) (cid:12)(cid:12) evaluate (p 1 (p (t (t 2 2) 2) (n (t 4 4)))) (cid:12)(cid:12) evaluate (p 1 (p (t (t 2 2) 2) (n (t 4 4))) ) (cid:12)(cid:12) evaluate (p 1 (p (t (t 2 2) 2) (n (t 4 4)))) We have now again reached the evaluate handler, and thistime match the ap2 case, resulting in the following. (cid:12)(cid:12) evaluate ({x -> (p 1 (p (t x 2) (n (t 4 4))))} (2 * 2)) (cid:12)(cid:12) evaluate ({x -> (p 1 (p (t x 2) (n (t 4 4))))} 4) (cid:12)(cid:12) evaluate (p 1 (p (t 4 2) (n (t 4 4))))

Evaluation will continue as such until the ﬁnal answer of − is calculated.We have now seen how the evaluate handler interprets Smooth commands with the builtin arithmetic operations. Eventhough evaluate is simple, it allows us to write our other han-dlers in a polymorphic fashion independent of

Int . Our next handler is the diff handler, which implements for-ward mode AD via a method known as dual numbers . A dualnumber is a pair of real numbers where the second numberrepresents the derivative of the ﬁrst. The diff handler han-dles commands with dual number arguments. The mathe-matical justiﬁcation of AD is not our focus, and thus we shalljust focus on the patterns of computation present withoutproving their correctness. We deﬁne the

Dual datatype and diff below. (cid:12)(cid:12) data Dual X = dual X X (cid:12)(cid:12)(cid:12)(cid:12) v : Dual X -> X (cid:12)(cid:12) v (dual x _) = x (cid:12)(cid:12)(cid:12)(cid:12) dv : Dual X -> X (cid:12)(cid:12) dv (dual _ dx) = dx (cid:12)(cid:12)(cid:12)(cid:12) diff : Y -> [Smooth X] Y (cid:12)(cid:12) diff x = x (cid:12)(cid:12) diff k> = (cid:12)(cid:12) let r = dual (op0 n) (c 0) in (cid:12)(cid:12) diff (k r) (cid:12)(cid:12) diff k> = (cid:12)(cid:12) let r = dual (op1 u x) (t (der1 u x) dx) in (cid:12)(cid:12) diff (k r) (cid:12)(cid:12) diff k> = (cid:12)(cid:12) let r = dual (op2 b x y) (p (t (der2L b x y) dx) (cid:12)(cid:12) (t (der2R b x y) dy)) in (cid:12)(cid:12) diff (k r)

Notice the similarities between each of the apN cases. Thecommand being handled by diff is evaluated with opN in theﬁrst component of

Dual , and a calculation involving deriva-tives creates the second component.We will evaluate an example program similar to our previ-ous one. The program will represent the same mathematicalterm + 𝑥 + − 𝑦 evaluated at 𝑥 = and 𝑦 = , but addi-tionally we shall be calculating the derivative with respectto 𝑥 at this point, which is . This is achieved by setting 𝑥 to dual 2 1 and 𝑦 to dual 4 0 , where 𝑥 has its second compo-nent set to to treat it as the diﬀerentiated variable and 𝑦 has its second component set to to treat it as a constant. (cid:12)(cid:12) evaluate (diff ( (cid:12)(cid:12) p (c 1) (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))) (cid:12)(cid:12) )) Evaluation begins as before, with the c 1 command beingin focus and a delimited continuation being captured. (cid:12)(cid:12) evaluate (diff ( (cid:12)(cid:12) p (c 1) (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))) (cid:12)(cid:12) )) Note how the continuation captured is delimited by diff and not evaluate . This behavior is due to the eﬀect typingsystem of Frank. There are two diﬀerent instances of the

Smooth interface available to the portion of the program be-ing handled. By default, the innermost handler provides the

EPM’21, January 18–19, Online Jesse Sigal instance being used by extending the ambient ability withan adaptor. As we shall see later, Frank provides constructsallowing us to select handlers other than the innermost. Thetop case of diff is matched by c 1 with the following result. (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) let r = dual (op0 (constE 1)) (c 0) in (cid:12)(cid:12) diff ( (cid:12)(cid:12) {x -> (p x (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))))} r) (cid:12)(cid:12) ) (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) let r = dual (c 1) (c 0) in (cid:12)(cid:12) diff ( (cid:12)(cid:12) {x -> (p x (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))))} r) (cid:12)(cid:12) ) We now have two c commands which will be be handled by evaluate , producing dual 1 0 for r ’s value. After handling, r will be be substituted and the continuation applied. (cid:12)(cid:12) evaluate (diff ( (cid:12)(cid:12) p (dual 1 0) (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))) (cid:12)(cid:12) )) Evaluation will then continue in a similar manner for allremaining commands. Each command will ﬁrst be handledby diff , and the commands in the body of each diff casehandled by evaluate , eventually producing dual -7 12 .We will now focus on Frank’s ability to dynamically de-termine which handler handles a command. First, we deﬁnetwo auxiliary functions. (cid:12)(cid:12) lift : X -> [Smooth X, Smooth (Dual X)] (Dual X) (cid:12)(cid:12) lift x = dual x ( (c 0)) (cid:12)(cid:12)(cid:12)(cid:12) d : {(Dual X) -> [Smooth X, Smooth (Dual X)] (Dual X)} (cid:12)(cid:12) -> X -> [Smooth X] X (cid:12)(cid:12) d f x = dv (diff (f (dual x ( (c 1)))))

The adaptor in lift causes the command c 0 to be as-sociated with Smooth X and not the rightmost instance

Smooth (Dual X) .The d function returns the derivative of a unary function and lift will enable us to nest d . Note that as in lift , in d causes the command c 1 to be associated with Smooth X . Con-sider the expression 𝑑 / 𝑑𝑥 ( 𝑥 · 𝑑 / 𝑑𝑦 ( 𝑥 + 𝑦 )| 𝑦 = )| 𝑥 = (which equals ). The corresponding program requires lift . (cid:12)(cid:12) evaluate (d {x -> t x (d {y -> p (lift x) y} (c 1))} (c 1)) We evaluate until the delimited continuation is captured. (cid:12)(cid:12) evaluate (d {x -> t x (d {y -> p (lift x) y} (c 1))} 1) (cid:12)(cid:12) evaluate (dv (diff ( (cid:12)(cid:12) {x -> t x (d {y -> p (lift x) y} (c 1))} (cid:12)(cid:12) (dual 1 ( (c 1))) (cid:12)(cid:12) ))) (cid:12)(cid:12) evaluate (dv (diff ( (cid:12)(cid:12) {x -> t x (d {y -> p (lift x) y} (c 1))} (cid:12)(cid:12) (dual 1 ( (c 1))) (cid:12)(cid:12) )))

The continuation for c 1 is delimited by evaluate due to .We conclude by noting that Frank will reject the nested pro-gram if lift is not present.

The evaluate and diff handlers manipulate programs by cap-turing delimited continuations, but only in quite simple ways.They each eventually compute a value based on the com-mand being handled and then continue with the originalprogram with the computed value substituted in. The reverse handler will be diﬀerent, and will build up a secondary pro-gram during the evaluation of the initial program.Reverse mode AD works by creating a mutable cell foreach value which accumulates contributions to its deriva-tive. The method of accumulation is a generalized versionof the backpropagation algorithm made prominent by ma-chine learning. We deﬁne the datatype

Prop for backpropa-gation where

Ref X is a reference to a mutable cell contain-ing a value of type X . The reverse handler handles commandscontaining Prop ’s. (cid:12)(cid:12) data Prop X = prop X (Ref X) (cid:12)(cid:12)(cid:12)(cid:12) fwd : Prop X -> X (cid:12)(cid:12) fwd (prop x _) = x (cid:12)(cid:12)(cid:12)(cid:12) deriv : Prop X -> Ref X (cid:12)(cid:12) deriv (prop _ r) = r (cid:12)(cid:12)(cid:12)(cid:12) reverse : Unit -> [RefState, Smooth X] Unit (cid:12)(cid:12) reverse x = x (cid:12)(cid:12) reverse k> = (cid:12)(cid:12) let r = prop (op0 n) (new (c 0)) in (cid:12)(cid:12) reverse (k r) (cid:12)(cid:12) reverse k> = (cid:12)(cid:12) let r = prop (op1 u x) (new (c 0)) in (cid:12)(cid:12) reverse (k r); (cid:12)(cid:12) write dx (p (read dx) (t (der1 u x) (read (deriv r)))) (cid:12)(cid:12) reverse k> = (cid:12)(cid:12) let r = prop (op2 b x y) (new (c 0)) in (cid:12)(cid:12) reverse (k r); (cid:12)(cid:12) write dx (p (read dx) (t (der2L b x y) (read (deriv r)))); (cid:12)(cid:12) write dy (p (read dy) (t (der2R b x y) (read (deriv r))))

The reverse handler makes use of the same op and der func-tions as diff , but is diﬀerent from evaluate and diff in twoimportant ways. Firstly, the type of reverse shows that it re-quires access to the RefState interface of mutable state (abuiltin eﬀect of Frank that can be handled by the languageimplementation). Secondly, the body of the ap1 and ap2 casescontains code after the use of the captured delimited contin-uation k . We shall see these writes to memory will form thesecondary program which actually accumulates derivatives.To properly calculate derivatives with reverse , we requirea small helper function which starts the process of backprop-agation, which we call grad for gradient. (cid:12)(cid:12) grad : {(Prop X) (cid:12)(cid:12) -> [RefState, Smooth X, Smooth (Prop X)] (Prop X)} (cid:12)(cid:12) -> X -> [RefState, Smooth X] X utomatic Diﬀerentiation via Eﬀects and Handlers PEPM’21, January 18–19, Online (cid:12)(cid:12) grad f x = (cid:12)(cid:12) let z = prop x (new (c 0)) in (cid:12)(cid:12) reverse (write (deriv (f z)) ( (c 1))); (cid:12)(cid:12) read (deriv z) We evaluate the same term as before. (cid:12)(cid:12) evaluate (grad ({x -> (cid:12)(cid:12) let y = c 4 in p (c 1) (p (t (t x x) x) (n (t y y))) (cid:12)(cid:12) } 2)) (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) let z = prop 2 (new (c 0)) in (cid:12)(cid:12) reverse (write (deriv ({x -> (cid:12)(cid:12) let y = c 4 in p (c 1) (p (t (t x x) x) (n (t y y))) (cid:12)(cid:12) } z)) ( (c 1))); (cid:12)(cid:12) read (deriv z))

The term new (c 0) is handled ﬁrst by evaluate for c 0 (re-turning ), and the command new 0 is handled by the Frankimplementation and returns a new reference whose cellcontains . The result is then substituted for z . (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) reverse (write (deriv ({x -> (cid:12)(cid:12) let y = c 4 in p (c 1) (p (t (t x x) x) (n (t y y))) (cid:12)(cid:12) } (prop 2 ))) ( (c 1))); (cid:12)(cid:12) read (deriv (prop 2 ))) Next, the anonymous function is applied to prop 2 . (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) reverse (write (deriv ( (cid:12)(cid:12) let y = c 4 in (cid:12)(cid:12) p (c 1) (p (t (t (prop 2 ) (prop 2 )) (prop 2 )) (cid:12)(cid:12) (n (t y y))) (cid:12)(cid:12) )) ( (c 1))); (cid:12)(cid:12) read (deriv (prop 2 ))) The command c 4 is handled by the ap0 case of reverse , whichas before creates a new reference , and thus y is substi-tuted by prop 4 . The command c 1 will create . (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) reverse (write (deriv ( (cid:12)(cid:12) p (prop 1 ) (cid:12)(cid:12) (p (t (t (prop 2 ) (prop 2 )) (prop 2 )) (cid:12)(cid:12) (n (t (prop 4 ) (prop 4 )))) (cid:12)(cid:12) )) ( (c 1))); (cid:12)(cid:12) read (deriv (prop 2 ))) We have now reached the ﬁrst interesting command, whichmatches the ap2 case of reverse . The captured delimited con-tinuation is now explicitly highlighted. (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) reverse (write (deriv ( (cid:12)(cid:12) p (prop 1 ) (cid:12)(cid:12) (p (t (t (prop 2 ) (prop 2 )) (prop 2 )) (cid:12)(cid:12) (n (t (prop 4 ) (prop 4 )))) (cid:12)(cid:12) )) ( (c 1))) ; (cid:12)(cid:12) read (deriv (prop 2 ))) The result of reverse handling the command produces a newreference . (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) reverse (write (deriv ( (cid:12)(cid:12) p (prop 1 ) (cid:12)(cid:12) (p (t (prop 4 ) (prop 2 )) (cid:12)(cid:12) (n (t (prop 4 ) (prop 4 )))) (cid:12)(cid:12) )) ( (c 1))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2L timesE 2 2) (read (deriv (prop 4 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2R timesE 2 2) (read (deriv (prop 4 ))))); (cid:12)(cid:12) read (deriv (prop 2 ))) We see that the evaluation of the initial program has pro-duced new expressions to be evaluated after the initial pro-gram ﬁnishes. The handling by reverse will eventually han-dle all commands meant for it, producing the following. (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) reverse (write ( (c 1))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2L plusE 1 -8) (read (deriv (prop -7 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2R plusE 1 -8) (read (deriv (prop -7 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2L plusE 8 -16) (read (deriv (prop -8 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2R plusE 8 -16) (read (deriv (prop -8 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der1 negateE 16) (read (deriv (prop -16 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2L timesE 4 4) (read (deriv (prop 16 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2R timesE 4 4) (read (deriv (prop 16 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2L timesE 4 2) (read (deriv (prop 8 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2R timesE 4 2) (read (deriv (prop 8 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2L timesE 2 2) (read (deriv (prop 4 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2R timesE 2 2) (read (deriv (prop 4 ))))); (cid:12)(cid:12) read (deriv (prop 2 )))

The above code is the secondary program created by reverse ,which performs backpropagation. Furthermore, if a user wishedto capture this secondary program, the deﬁnition of reverse could be changed to return a suspended computation. Thus,we could also partially evaluate the whole program (initialand backpropagation) by running only the initial programand capturing the backpropagation computation.It could also be possible to use multi-stage programmingby reifying the initial and secondary programs as a compu-tation graph in the style of Wang et al. [13]. Their approachuses delimited continuations and combines normal execu-tion with building an intermediate representation. As eﬀectsand handlers are essentially a structured use of delimitedcontinuations, a similar story for Frank may be possible.

The ﬁnal algorithm we shall cover is checkpointed reversemode. Reverse mode has maximum memory residency pro-portional to the number of operations (as seen in the deﬁni-tion of reverse ). Checkpointed reverse mode allows a trade-oﬀ between space and time by recomputing checkpointedsubprograms, once without allocating memory and an addi-tional time with memory. However, any memory allocatedin between these two runs can be safely deallocated, as it

EPM’21, January 18–19, Online Jesse Sigal corresponds to code after the checkpointed subprogram inthe original program, thus reducing maximum memory res-idency.To deﬁne our new handler, we introduce a

Checkpoint ef-fect which takes a suspended computation that will be runmultiple times. We also deﬁne a simple evaluate like handler, evaluatet (see appendix for deﬁnition). (cid:12)(cid:12) interface Checkpoint X = (cid:12)(cid:12) checkpoint : (cid:12)(cid:12) {[Checkpoint X , Smooth (Prop X)] Prop X} -> Prop X

Frank also contains a catch-all pattern match which matchesvalues and commands not handled above it. We use this fea-ture to extend reverse by delegating any

Smooth commandsreceived to reverse and only adding a case for checkpoint . (cid:12)(cid:12) reversec : Unit (cid:12)(cid:12) -> [RefState, Smooth X] Unit (cid:12)(cid:12) reversec x = x (cid:12)(cid:12) reversec k> = (cid:12)(cid:12) let s = new (c 0) in (cid:12)(cid:12) let res = (evaluatet s ( (cid:12)(cid:12) s b)> p! (cid:12)(cid:12) )) in (cid:12)(cid:12) let r = prop (fwd res) (new (c 0)) in (cid:12)(cid:12) reversec (k r); (cid:12)(cid:12) reversec (write (cid:12)(cid:12) (deriv ( s b), RefState> p!)) (cid:12)(cid:12) (read (deriv r))) (cid:12)(cid:12) reversec = reversec ( s)> (reverse m!)) Note how the checkpointed subprogram (the suspended com-putation p which is the argument of checkpoint ) is calledtwice, once with evaluatet as the handler and once with reversec as the handler. Additionally, the last case will match every Smooth command, and then reinvoke the captured computa-tion with a new reverse handler to handle the command.Consider the following program where gradc is grad with reversec in the place of reverse . (cid:12)(cid:12) evaluate (gradc ({x -> (cid:12)(cid:12) let y = c 2 in (cid:12)(cid:12) let z = checkpoint {p x y} in (cid:12)(cid:12) let a = checkpoint {let w = checkpoint {t x z} in p w y} in (cid:12)(cid:12) p a x (cid:12)(cid:12) } (c 2))) The ﬁrst interesting evaluation step is after the underlined checkpoint has been handled. (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) reversec ( s)> (reverse (write (deriv ( (cid:12)(cid:12) let z = prop 4 in (cid:12)(cid:12) let a = checkpoint { (cid:12)(cid:12) let w = checkpoint t (prop 2 ) z in (cid:12)(cid:12) p w (prop 2 )} in (cid:12)(cid:12) p a (prop 2 ) (cid:12)(cid:12) )) ( (c 1))))); (cid:12)(cid:12) reversec (write (deriv ( s b), RefState> (cid:12)(cid:12) {p (prop 2 ) (prop 2 )}!)) (cid:12)(cid:12) (read (deriv (prop 4 )))); (cid:12)(cid:12) read (deriv (prop 2 )))

Note how on the second line the reverse handler has beenmade the innermost delimiter of the remainder of the ini-tial program, via the catch-all case of reversec . Additionally,note how the checkpointed code (underlined) is stored as athunk to be run after the initial program in the second use of reversec . After the initial program has been evaluated away,we obtain the following. (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) reversec ( s)> ( (cid:12)(cid:12) reverse (write 1); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2L plusE 10 2) (read (deriv (prop 12 ))))); (cid:12)(cid:12) write (p (read ) (cid:12)(cid:12) (t (der2R plusE 10 2) (read (deriv (prop 12 )))))); (cid:12)(cid:12) reversec (write (deriv ( s b), RefState> (cid:12)(cid:12) {let w = checkpoint t (prop 2 ) (prop 4 ) in (cid:12)(cid:12) p w (prop 2 )}!)) (cid:12)(cid:12) (read (deriv (prop 10 )))); (cid:12)(cid:12) reversec (write (deriv ( s b), RefState> (cid:12)(cid:12) {p (prop 2 ) (prop 2 )}!)) (cid:12)(cid:12) (read (deriv (prop 4 )))); (cid:12)(cid:12) read (deriv (prop 2 )))

The remaining checkpoint command illustrates the recursivenature of reversec . It shows how even nested checkpointingin checkpointed code can be properly evaluated by delayingthe program transformation happening via evaluation.

We have seen the implementation and evaluation of AD inFrank via Frank’s operation semantics and four handlers: evaluate , diff , reverse , and reversec . While evaluate and diff do eﬀectively no program transformations, reverse and reversec build up ancillary programs via delimited continuations. Theeﬀects and handler style of Frank allowed us to composeand nest our deﬁned handlers, which is especially appar-ent in the modular deﬁnition of reversec which delegatesall Smooth commands to reverse . It may also be possible tointegrate multi-stage programming by using the system ofWang et al.. In conclusion, we have illustrated in Frank thateﬀects and handlers are a good match for AD, and that ef-fects and handlers can be seen as a form of program manip-ulation.

Acknowledgments

I would like to thank Sam Lindley for answering my manyquestions about Frank, my supervisor Chris Heunen for hissupport, and Ohad Kammar for conversations about this workand encouragement to improve it. I would also like to thankthe reviewers for their valuable feedback.

References [1] Christian Bischof, Peyvand Khademi, Andrew Mauer, and AlanCarle. 1996. Adifor 2.0: Automatic Diﬀerentiation of Fortran77 Programs.

IEEE Comput. Sci. Eng.

3, 3 (Sept. 1996), 18–32. https://doi.org/10.1109/99.537089utomatic Diﬀerentiation via Eﬀects and Handlers PEPM’21, January 18–19, Online [2] C. H. Bischof, L. Roh, and A. J. Mauer-Oats. 1997. ADIC:an extensible automatic diﬀerentiation tool for ANSI-C.

Software: Practice and Experience

27, 12 (1997), 1427–1456. https://doi.org/10.1002/(SICI)1097-024X(199712)27:12<1427::AID-SPE138>3.0.CO;2-Q [3] Lukas Convent, Sam Lindley, Conor Mcbride, and Craig Mclaughlin.2020. Doo bee doo bee doo.

Journal of Functional Programming https://doi.org/10.1017/S0956796820000039

Publisher: Cam-bridge University Press.[4] A. Griewank and A. Walther. 2008.

Evaluating Deriva-tives . Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898717761 [5] Laurent Hascoët and Mauricio Araya-Polo. 2006. Enabling user-driven Checkpointing strategies in Reverse-mode Automatic Diﬀeren-tiation. arXiv:cs/0606042 (June 2006). http://arxiv.org/abs/cs/0606042 arXiv: cs/0606042.[6] Laurent Hascoët and Valérie Pascual. 2013. The Tapenadeautomatic diﬀerentiation tool: Principles, model, and speciﬁca-tion.

ACM Trans. Math. Softw.

39, 3 (May 2013), 20:1–20:43. https://doi.org/10.1145/2450153.2450158 [7] P. B. Levy. 2003.

Call-By-Push-Value: A Functional/Imperative Synthe-sis . Springer Netherlands. https://doi.org/10.1007/978-94-007-0954-6 [8] Sam Lindley, Conor McBride, and Craig McLaughlin. 2017. Dobe do be do. In

Proceedings of the 44th ACM SIGPLAN Sym-posium on Principles of Programming Languages (POPL 2017) .Association for Computing Machinery, Paris, France, 500–514. https://doi.org/10.1145/3009837.3009897 [9] Valérie Pascual and Laurent Hascoët. 2008. TAPENADE for C. In

Ad-vances in Automatic Diﬀerentiation (Lecture Notes in ComputationalScience and Engineering) , Christian H. Bischof, H. Martin Bücker, PaulHovland, Uwe Naumann, and Jean Utke (Eds.). Springer, Berlin, Hei-delberg, 199–209. https://doi.org/10.1007/978-3-540-68942-3_18 [10] Gordon Plotkin and John Power. 2001. Adequacy for Algebraic Ef-fects. In

Foundations of Software Science and Computation Structures ,Gerhard Goos, Juris Hartmanis, Jan van Leeuwen, Furio Honsell, andMarino Miculan (Eds.). Vol. 2030. Springer Berlin Heidelberg, Berlin,Heidelberg, 1–24. https://doi.org/10.1007/3-540-45315-6_1

Series Ti-tle: Lecture Notes in Computer Science.[11] Gordon Plotkin and Matija Pretnar. 2009. Handlers of Algebraic Ef-fects. In

Programming Languages and Systems , Giuseppe Castagna(Ed.). Vol. 5502. Springer Berlin Heidelberg, Berlin, Heidelberg, 80–94. https://doi.org/10.1007/978-3-642-00590-9_7

Series Title: LectureNotes in Computer Science.[12] Bert Speelpenning. 1980.

Compiling fast partial derivatives of func-tions given by algorithms . Ph.D. University of Illinois at Urbana-Champaign, USA. AAI8017989.[13] Fei Wang, Daniel Zheng, James Decker, Xilun Wu, Grégory M. Esser-tel, and Tiark Rompf. 2019. Demystifying diﬀerentiable programming:shift/reset the penultimate backpropagator.

Proc. ACM Program. Lang.

3, ICFP (July 2019), 96:1–96:31. https://doi.org/10.1145/3341700 [14] R. E. Wengert. 1964. A simple automatic derivative evalu-ation program.

Commun. ACM

7, 8 (Aug. 1964), 463–464. https://doi.org/10.1145/355586.364791

Appendix

The following is the deﬁnition of evaluatet use in reversec .Note the similarities with evaluate . (cid:12)(cid:12) evaluatet : Ref X (cid:12)(cid:12) -> Y (cid:12)(cid:12) -> [Smooth X] Y (cid:12)(cid:12) evaluatet _ x = x (cid:12)(cid:12) evaluatet s k> = (cid:12)(cid:12) let res = evaluatet s ( s b)> p!) in (cid:12)(cid:12) evaluatet s (k (prop (fwd res) s)) (cid:12)(cid:12) evaluatet s k> = (cid:12)(cid:12) evaluatet s (k (prop ( (op0 n)) s)) (cid:12)(cid:12) evaluatet s k> = (cid:12)(cid:12) evaluatet s (k (prop ( (op1 u x)) s)) (cid:12)(cid:12) evaluatet s k> = (cid:12)(cid:12)(cid:12)(cid:12)