Automatic Differentiation via Effects and Handlers: An Implementation in Frank
aa r X i v : . [ c s . P L ] J a n Automatic Differentiation via Effects and Handlers
An Implementation in Frank
Jesse Sigal
University of EdinburghEdinburgh, U.K. [email protected]
Abstract
Automatic differentiation (AD) is an important family of al-gorithms which enables derivative based optimization. Weshow that AD can be simply implemented with effects andhandlers by doing so in the Frank language. By consideringhow our implementation behaves in Frank’s operational se-mantics, we show how our code performs the dynamic cre-ation of programs during evaluation.
CCS Concepts: • Theory of computation → Operationalsemantics ; Control primitives ; •
Mathematics of comput-ing → Automatic differentiation . Keywords: automatic differentiation, algebraic effects, effecthandlers, differentiable programming
ACM Reference Format:
Jesse Sigal. 2021. Automatic Differentiation via Effects and Han-dlers: An Implementation in Frank. In
Proceedings of Partial Eval-uation and Program Manipulation (PEPM’21).
ACM, New York, NY,USA, 7 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Machine learning, artificial intelligence, scientific modelling,information analysis, and other data heavy fields have driventhe demand for tools which enable derivative based opti-mization. The family of algorithms known as automatic dif-ferentiation (AD) is the foundation of the tools which achievethis. The family can be coarsely divided into forward mode and reverse mode . Multiple modes exist because their asymp-totics depend on different features of the differentiated pro-grams. Forward mode AD was introduced in 1964 by Wengert[14], and reverse mode AD was created by Speelpenning inhis 1980 thesis [12].It is not surprising that, given its history, AD has beenimplemented in many different ways. Many popular toolssuch as ADIFOR [1], ADIC [2], and Tapenade [6, 9] workvia source transformation. These transformations take placeon languages such as C and FORTRAN, and thus all of theaforementioned tools work externally from the program be-ing written. We shall show here that the recent Frank lan-guage [3, 8] and its operational semantics, which leverages
PEPM’21, January 18–19, Online https://doi.org/10.1145/nnnnnnn.nnnnnnn effects and handlers , can be informally seen as dynamicallyperforming partial evaluation and program manipulation.
We are most interested in showing the structure of AD al-gorithms, so we shall only give a short intuition for AD. Let 𝑓 , 𝑔 : R → R be smooth functions (i.e. infinitely differen-tiable at all points). The chain rule states that ( 𝑓 ◦ 𝑔 ) ′ ( 𝑥 ) = 𝑓 ′ ( 𝑔 ( 𝑥 )) · 𝑔 ′ ( 𝑥 ) . AD algorithms use this compositional prop-erty to incrementally calculate the derivative of an entireprogram one basic operation at a time during evaluation. Werefer the reader to the textbook of Griewank and Walther[4] for general knowledge and to Hascoët and Araya-Polo[5] for checkpointed reverse mode, our most interesting ex-ample. Effects and handlers are a structured method of includingside-effects into programs. Algebraic effects were introducedin 2001 by Plotkin and Power [10] handlers for them in 2009by Plotkin and Pretnar [11]. Effects and handlers can be viewedas an extension of the common feature of catchable excep-tions. Catching an exception terminates the program delim-ited by the exception handling code, but effect handlers canresume the handled code and pass a value to it. Effects andhandlers can implement many common side effects suchas state, exceptions, non-determinism, logging, and input-output.
We will be using the Frank language to implement AD. Frank’styping and operational semantics are inspired by call-by-push-value [7], meaning there is a distinction between val-ues and computations. We note Frank has a fixed left-to-right evaluation order. Frank combines the concepts of func-tions and handlers by unifying them into what Frank refersto as operators, which act by application. However, we shallusually say handler for operators which handle effects andfunctions otherwise. We shall also simplify certain aspectsfor ease of exposition, see Convent et al. [3] for a tutorialand details.Let us consider a simple example of a handler for a pro-gram which uses state of type S . EPM’21, January 18–19, Online Jesse Sigal (cid:12)(cid:12) state : {S ->
We first explain the type of state . The handler state takestwo arguments, one of type S and one of type X . In orderfor state to be used, the context in which it is called mustsupport the ability [Console] which is a snoc-list containingexactly one instance Console of the interface
Console (the abil-ity [Console, Console] contains two distinct instances of thesame interface). The ability [Console] means we can use the command print defined by the interface
Console . In the term state s x , the value produced by s can only be computed us-ing commands from the instances in the ability [Console] . Onthe other hand, the value produced by x can use commandsfrom [Console, State S] . The value x has access to State com-mands because the adjustment
State in-terface ( get and set ). The full type of state includes braces,showing that state is a suspended computation. Frank auto-matically inserts these if they are absent.We shall briefly explain some aspects of Frank’s opera-tional semantics before we go into more detail during ADexamples. Consider the example top-level use of state wheresemicolon is sequencing and postfix ! is nullary function ap-plication. (cid:12)(cid:12) The ability [Console] is permitted at the top-level as Frank’simplementation will handle it. As the program executes, theunderlined get is encountered and a continuation of the pro-gram delimited by state is captured, namely the operator {r -> (put (r + get!); get!)} , and bound to k in the body ofthe second line of state ’s definition. Once the execution of (put (get! + get!); get!) finishes, the first line of state ’s def-inition is matched and state exits. We will cover the implementation of four different handlersin Frank: evaluate : the most basic handler which dispatches tobuiltin arithmetic operations; diff : an implementation of forward mode AD; reverse : an implementation of reverse mode AD whichmakes use of the builtin mutable state interface; and reversec : an implementation of checkpointed reverse modeAD which extends reverse .Each of the handlers handle the interface
Smooth , which con-ceptually corresponds to smooth functions on the real num-bers. We only include constants, negation, addition, and mul-tiplication for simplicity, but any number of other smoothfunctions could be included. Additionally, Frank currently does not support floats, so we use integers instead, howeverwith language support floats could be used. (cid:12)(cid:12) data Nullary = constE Int (cid:12)(cid:12) data Unary = negateE (cid:12)(cid:12) data Binary = plusE | timesE (cid:12)(cid:12)(cid:12)(cid:12) interface Smooth X = (cid:12)(cid:12) ap0 : Nullary -> X (cid:12)(cid:12) | ap1 : Unary -> X -> X (cid:12)(cid:12) | ap2 : Binary -> X -> X -> X
The above definition says the
Smooth interface is parameter-ized by X and has three effectful commands. The command apN is the N -ary application of a smooth function. Note thatthe nullary functions are constants. For ease of use, we de-fine the following helper functions. (cid:12)(cid:12) c : Int -> [Smooth X] X (cid:12)(cid:12) c i = ap0 (constE i) (cid:12)(cid:12)(cid:12)(cid:12) n : X -> [Smooth X] X (cid:12)(cid:12) n x = ap1 negateE x (cid:12)(cid:12)(cid:12)(cid:12) p : X -> X -> [Smooth X] X (cid:12)(cid:12) p x y = ap2 plusE x y (cid:12)(cid:12)(cid:12)(cid:12) t : X -> X -> [Smooth X] X (cid:12)(cid:12) t x y = ap2 timesE x y The operational semantics of Frank allows us to treat theabove helper functions as if they were commands themselves,which we will do throughout. We will also define helperfunctions for the dispatching of 𝑛 -ary functions and theirderivatives to make the similarity between different AD modesmore apparent. (cid:12)(cid:12) op0 : Nullary -> [Smooth X] X (cid:12)(cid:12) op1 : Unary -> X -> [Smooth X] X (cid:12)(cid:12) op2 : Binary -> X -> X -> [Smooth X] X (cid:12)(cid:12) op0 (constE i) = c i (cid:12)(cid:12) op1 negateE x = n x (cid:12)(cid:12) op2 plusE x y = p x y (cid:12)(cid:12) op2 timesE x y = t x y (cid:12)(cid:12)(cid:12)(cid:12) der1 : Unary -> X -> [Smooth X] X (cid:12)(cid:12) der2L : Binary -> X -> X -> [Smooth X] X (cid:12)(cid:12) der2R : Binary -> X -> X -> [Smooth X] X (cid:12)(cid:12) der1 negateE x = n (c 1) 𝑑 / 𝑑𝑥 (− 𝑥 ) = − (cid:12)(cid:12) der2L plusE x y = c 1 𝑑 / 𝑑𝑥 ( 𝑥 + 𝑦 ) = (cid:12)(cid:12) der2L timesE x y = y 𝑑 / 𝑑𝑥 ( 𝑥 · 𝑦 ) = 𝑦 (cid:12)(cid:12) der2R plusE x y = c 1 𝑑 / 𝑑𝑦 ( 𝑥 + 𝑦 ) = (cid:12)(cid:12) der2R timesE x y = x 𝑑 / 𝑑𝑦 ( 𝑥 · 𝑦 ) = 𝑥 The most basic handler we will consider is the evaluate han-dler. It only handles
Smooth X where X is instantiated to Int . (cid:12)(cid:12) evaluate :
2) (n (t 4 4)))) (cid:12)(cid:12) evaluate (p 1 (p (t (t 2 2) 2) (n (t 4 4)))) (cid:12)(cid:12) evaluate (p 1 (p (t (t 2 2) 2) (n (t 4 4))) ) (cid:12)(cid:12) evaluate (p 1 (p (t (t 2 2) 2) (n (t 4 4)))) We have now again reached the evaluate handler, and thistime match the ap2 case, resulting in the following. (cid:12)(cid:12) evaluate ({x -> (p 1 (p (t x 2) (n (t 4 4))))} (2 * 2)) (cid:12)(cid:12) evaluate ({x -> (p 1 (p (t x 2) (n (t 4 4))))} 4) (cid:12)(cid:12) evaluate (p 1 (p (t 4 2) (n (t 4 4))))
Evaluation will continue as such until the final answer of − is calculated.We have now seen how the evaluate handler interprets Smooth commands with the builtin arithmetic operations. Eventhough evaluate is simple, it allows us to write our other han-dlers in a polymorphic fashion independent of
Int . Our next handler is the diff handler, which implements for-ward mode AD via a method known as dual numbers . A dualnumber is a pair of real numbers where the second numberrepresents the derivative of the first. The diff handler han-dles commands with dual number arguments. The mathe-matical justification of AD is not our focus, and thus we shalljust focus on the patterns of computation present withoutproving their correctness. We define the
Dual datatype and diff below. (cid:12)(cid:12) data Dual X = dual X X (cid:12)(cid:12)(cid:12)(cid:12) v : Dual X -> X (cid:12)(cid:12) v (dual x _) = x (cid:12)(cid:12)(cid:12)(cid:12) dv : Dual X -> X (cid:12)(cid:12) dv (dual _ dx) = dx (cid:12)(cid:12)(cid:12)(cid:12) diff :
Notice the similarities between each of the apN cases. Thecommand being handled by diff is evaluated with opN in thefirst component of
Dual , and a calculation involving deriva-tives creates the second component.We will evaluate an example program similar to our previ-ous one. The program will represent the same mathematicalterm + 𝑥 + − 𝑦 evaluated at 𝑥 = and 𝑦 = , but addi-tionally we shall be calculating the derivative with respectto 𝑥 at this point, which is . This is achieved by setting 𝑥 to dual 2 1 and 𝑦 to dual 4 0 , where 𝑥 has its second compo-nent set to to treat it as the differentiated variable and 𝑦 has its second component set to to treat it as a constant. (cid:12)(cid:12) evaluate (diff ( (cid:12)(cid:12) p (c 1) (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))) (cid:12)(cid:12) )) Evaluation begins as before, with the c 1 command beingin focus and a delimited continuation being captured. (cid:12)(cid:12) evaluate (diff ( (cid:12)(cid:12) p (c 1) (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))) (cid:12)(cid:12) )) Note how the continuation captured is delimited by diff and not evaluate . This behavior is due to the effect typingsystem of Frank. There are two different instances of the
Smooth interface available to the portion of the program be-ing handled. By default, the innermost handler provides the
EPM’21, January 18–19, Online Jesse Sigal instance being used by extending the ambient ability withan adaptor. As we shall see later, Frank provides constructsallowing us to select handlers other than the innermost. Thetop case of diff is matched by c 1 with the following result. (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) let r = dual (op0 (constE 1)) (c 0) in (cid:12)(cid:12) diff ( (cid:12)(cid:12) {x -> (p x (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))))} r) (cid:12)(cid:12) ) (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) let r = dual (c 1) (c 0) in (cid:12)(cid:12) diff ( (cid:12)(cid:12) {x -> (p x (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))))} r) (cid:12)(cid:12) ) We now have two c commands which will be be handled by evaluate , producing dual 1 0 for r ’s value. After handling, r will be be substituted and the continuation applied. (cid:12)(cid:12) evaluate (diff ( (cid:12)(cid:12) p (dual 1 0) (p (t (t (dual 2 1) (dual 2 1)) (dual 2 1)) (cid:12)(cid:12) (n (t (dual 4 0) (dual 4 0)))) (cid:12)(cid:12) )) Evaluation will then continue in a similar manner for allremaining commands. Each command will first be handledby diff , and the commands in the body of each diff casehandled by evaluate , eventually producing dual -7 12 .We will now focus on Frank’s ability to dynamically de-termine which handler handles a command. First, we definetwo auxiliary functions. (cid:12)(cid:12) lift : X -> [Smooth X, Smooth (Dual X)] (Dual X) (cid:12)(cid:12) lift x = dual x (
The adaptor
Smooth (Dual X) .The d function returns the derivative of a unary function and lift will enable us to nest d . Note that as in lift ,
The continuation for c 1 is delimited by evaluate due to
The evaluate and diff handlers manipulate programs by cap-turing delimited continuations, but only in quite simple ways.They each eventually compute a value based on the com-mand being handled and then continue with the originalprogram with the computed value substituted in. The reverse handler will be different, and will build up a secondary pro-gram during the evaluation of the initial program.Reverse mode AD works by creating a mutable cell foreach value which accumulates contributions to its deriva-tive. The method of accumulation is a generalized versionof the backpropagation algorithm made prominent by ma-chine learning. We define the datatype
Prop for backpropa-gation where
Ref X is a reference to a mutable cell contain-ing a value of type X . The reverse handler handles commandscontaining Prop ’s. (cid:12)(cid:12) data Prop X = prop X (Ref X) (cid:12)(cid:12)(cid:12)(cid:12) fwd : Prop X -> X (cid:12)(cid:12) fwd (prop x _) = x (cid:12)(cid:12)(cid:12)(cid:12) deriv : Prop X -> Ref X (cid:12)(cid:12) deriv (prop _ r) = r (cid:12)(cid:12)(cid:12)(cid:12) reverse :
The reverse handler makes use of the same op and der func-tions as diff , but is different from evaluate and diff in twoimportant ways. Firstly, the type of reverse shows that it re-quires access to the RefState interface of mutable state (abuiltin effect of Frank that can be handled by the languageimplementation). Secondly, the body of the ap1 and ap2 casescontains code after the use of the captured delimited contin-uation k . We shall see these writes to memory will form thesecondary program which actually accumulates derivatives.To properly calculate derivatives with reverse , we requirea small helper function which starts the process of backprop-agation, which we call grad for gradient. (cid:12)(cid:12) grad : {(Prop X) (cid:12)(cid:12) -> [RefState, Smooth X, Smooth (Prop X)] (Prop X)} (cid:12)(cid:12) -> X -> [RefState, Smooth X] X utomatic Differentiation via Effects and Handlers PEPM’21, January 18–19, Online (cid:12)(cid:12) grad f x = (cid:12)(cid:12) let z = prop x (new (c 0)) in (cid:12)(cid:12) reverse (write (deriv (f z)) (
The term new (c 0) is handled first by evaluate for c 0 (re-turning ), and the command new 0 is handled by the Frankimplementation and returns a new reference
The above code is the secondary program created by reverse ,which performs backpropagation. Furthermore, if a user wishedto capture this secondary program, the definition of reverse could be changed to return a suspended computation. Thus,we could also partially evaluate the whole program (initialand backpropagation) by running only the initial programand capturing the backpropagation computation.It could also be possible to use multi-stage programmingby reifying the initial and secondary programs as a compu-tation graph in the style of Wang et al. [13]. Their approachuses delimited continuations and combines normal execu-tion with building an intermediate representation. As effectsand handlers are essentially a structured use of delimitedcontinuations, a similar story for Frank may be possible.
The final algorithm we shall cover is checkpointed reversemode. Reverse mode has maximum memory residency pro-portional to the number of operations (as seen in the defini-tion of reverse ). Checkpointed reverse mode allows a trade-off between space and time by recomputing checkpointedsubprograms, once without allocating memory and an addi-tional time with memory. However, any memory allocatedin between these two runs can be safely deallocated, as it
EPM’21, January 18–19, Online Jesse Sigal corresponds to code after the checkpointed subprogram inthe original program, thus reducing maximum memory res-idency.To define our new handler, we introduce a
Checkpoint ef-fect which takes a suspended computation that will be runmultiple times. We also define a simple evaluate like handler, evaluatet (see appendix for definition). (cid:12)(cid:12) interface Checkpoint X = (cid:12)(cid:12) checkpoint : (cid:12)(cid:12) {[Checkpoint X , Smooth (Prop X)] Prop X} -> Prop X
Frank also contains a catch-all pattern match
Smooth commandsreceived to reverse and only adding a case for checkpoint . (cid:12)(cid:12) reversec :
Note how on the second line the reverse handler has beenmade the innermost delimiter of the remainder of the ini-tial program, via the catch-all case of reversec . Additionally,note how the checkpointed code (underlined) is stored as athunk to be run after the initial program in the second use of reversec . After the initial program has been evaluated away,we obtain the following. (cid:12)(cid:12) evaluate ( (cid:12)(cid:12) reversec (
The remaining checkpoint command illustrates the recursivenature of reversec . It shows how even nested checkpointingin checkpointed code can be properly evaluated by delayingthe program transformation happening via evaluation.
We have seen the implementation and evaluation of AD inFrank via Frank’s operation semantics and four handlers: evaluate , diff , reverse , and reversec . While evaluate and diff do effectively no program transformations, reverse and reversec build up ancillary programs via delimited continuations. Theeffects and handler style of Frank allowed us to composeand nest our defined handlers, which is especially appar-ent in the modular definition of reversec which delegatesall Smooth commands to reverse . It may also be possible tointegrate multi-stage programming by using the system ofWang et al.. In conclusion, we have illustrated in Frank thateffects and handlers are a good match for AD, and that ef-fects and handlers can be seen as a form of program manip-ulation.
Acknowledgments
I would like to thank Sam Lindley for answering my manyquestions about Frank, my supervisor Chris Heunen for hissupport, and Ohad Kammar for conversations about this workand encouragement to improve it. I would also like to thankthe reviewers for their valuable feedback.
References [1] Christian Bischof, Peyvand Khademi, Andrew Mauer, and AlanCarle. 1996. Adifor 2.0: Automatic Differentiation of Fortran77 Programs.
IEEE Comput. Sci. Eng.
3, 3 (Sept. 1996), 18–32. https://doi.org/10.1109/99.537089utomatic Differentiation via Effects and Handlers PEPM’21, January 18–19, Online [2] C. H. Bischof, L. Roh, and A. J. Mauer-Oats. 1997. ADIC:an extensible automatic differentiation tool for ANSI-C.
Software: Practice and Experience
27, 12 (1997), 1427–1456. https://doi.org/10.1002/(SICI)1097-024X(199712)27:12<1427::AID-SPE138>3.0.CO;2-Q [3] Lukas Convent, Sam Lindley, Conor Mcbride, and Craig Mclaughlin.2020. Doo bee doo bee doo.
Journal of Functional Programming https://doi.org/10.1017/S0956796820000039
Publisher: Cam-bridge University Press.[4] A. Griewank and A. Walther. 2008.
Evaluating Deriva-tives . Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898717761 [5] Laurent Hascoët and Mauricio Araya-Polo. 2006. Enabling user-driven Checkpointing strategies in Reverse-mode Automatic Differen-tiation. arXiv:cs/0606042 (June 2006). http://arxiv.org/abs/cs/0606042 arXiv: cs/0606042.[6] Laurent Hascoët and Valérie Pascual. 2013. The Tapenadeautomatic differentiation tool: Principles, model, and specifica-tion.
ACM Trans. Math. Softw.
39, 3 (May 2013), 20:1–20:43. https://doi.org/10.1145/2450153.2450158 [7] P. B. Levy. 2003.
Call-By-Push-Value: A Functional/Imperative Synthe-sis . Springer Netherlands. https://doi.org/10.1007/978-94-007-0954-6 [8] Sam Lindley, Conor McBride, and Craig McLaughlin. 2017. Dobe do be do. In
Proceedings of the 44th ACM SIGPLAN Sym-posium on Principles of Programming Languages (POPL 2017) .Association for Computing Machinery, Paris, France, 500–514. https://doi.org/10.1145/3009837.3009897 [9] Valérie Pascual and Laurent Hascoët. 2008. TAPENADE for C. In
Ad-vances in Automatic Differentiation (Lecture Notes in ComputationalScience and Engineering) , Christian H. Bischof, H. Martin Bücker, PaulHovland, Uwe Naumann, and Jean Utke (Eds.). Springer, Berlin, Hei-delberg, 199–209. https://doi.org/10.1007/978-3-540-68942-3_18 [10] Gordon Plotkin and John Power. 2001. Adequacy for Algebraic Ef-fects. In
Foundations of Software Science and Computation Structures ,Gerhard Goos, Juris Hartmanis, Jan van Leeuwen, Furio Honsell, andMarino Miculan (Eds.). Vol. 2030. Springer Berlin Heidelberg, Berlin,Heidelberg, 1–24. https://doi.org/10.1007/3-540-45315-6_1
Series Ti-tle: Lecture Notes in Computer Science.[11] Gordon Plotkin and Matija Pretnar. 2009. Handlers of Algebraic Ef-fects. In
Programming Languages and Systems , Giuseppe Castagna(Ed.). Vol. 5502. Springer Berlin Heidelberg, Berlin, Heidelberg, 80–94. https://doi.org/10.1007/978-3-642-00590-9_7
Series Title: LectureNotes in Computer Science.[12] Bert Speelpenning. 1980.
Compiling fast partial derivatives of func-tions given by algorithms . Ph.D. University of Illinois at Urbana-Champaign, USA. AAI8017989.[13] Fei Wang, Daniel Zheng, James Decker, Xilun Wu, Grégory M. Esser-tel, and Tiark Rompf. 2019. Demystifying differentiable programming:shift/reset the penultimate backpropagator.
Proc. ACM Program. Lang.
3, ICFP (July 2019), 96:1–96:31. https://doi.org/10.1145/3341700 [14] R. E. Wengert. 1964. A simple automatic derivative evalu-ation program.
Commun. ACM
7, 8 (Aug. 1964), 463–464. https://doi.org/10.1145/355586.364791
Appendix
The following is the definition of evaluatet use in reversec .Note the similarities with evaluate . (cid:12)(cid:12) evaluatet : Ref X (cid:12)(cid:12) ->