[PDF] FastAD: Expression Template-Based C++ Library for Fast and Memory-Efficient Automatic Differentiation

Abstract

Automatic differentiation is a set of techniques to efficiently and accurately compute the derivative of a function represented by a computer program. Existing C++ libraries for automatic differentiation (e.g. Adept, Stan Math Library), however, exhibit large memory consumptions and runtime performance issues. This paper introduces FastAD, a new C++ template library for automatic differentiation, that overcomes all of these challenges in existing libraries by using vectorization, simpler memory management using a fully expression-template-based design, and other compile-time optimizations to remove some run-time overhead. Benchmarks show that FastAD performs 2-10 times faster than Adept and 2-19 times faster than Stan across various test cases including a few real-world examples.

Full PDF

FFastAD: Expression Template-Based C++Library for Fast and Memory-EﬃcientAutomatic Diﬀerentiation

James Yang

Department of Statistics, Stanford University [email protected]

Abstract.

Automatic diﬀerentiation is a set of techniques to eﬃcientlyand accurately compute the derivative of a function represented by acomputer program. Existing C++ libraries for automatic diﬀerentiation(e.g. Adept, Stan Math Library), however, exhibit large memory con-sumptions and runtime performance issues. This paper introduces Fas-tAD, a new C++ template library for automatic diﬀerentiation, thatovercomes all of these challenges in existing libraries by using vector-ization, simpler memory management using a fully expression-template-based design, and other compile-time optimizations to remove some run-time overhead. Benchmarks show that FastAD performs 2-10 times fasterthan Adept and 2-19 times faster than Stan across various test cases in-cluding a few real-world examples.

Keywords: automatic diﬀerentiation · forward-mode · reverse-mode · C++ · expression templates · template metaprogramming · lazy-evaluation · lazy-allocation · vectorization. Gradient computation plays a critical role in modern computing problems sur-rounding optimization, statistics, and machine learning. For example, one maywish to compute sensitivities of an Ordinary-Diﬀerential-Equation (ODE) in-tegrator for optimization or parameter estimation [3]. In Bayesian statistics,advanced Markov-Chain-Monte-Carlo (MCMC) algorithms such as the Hamil-tonian Monte Carlo (HMC) and the No-U-Turn-Sampler (NUTS) rely heavilyon computing the gradient of a (log) joint probability density function to up-date proposal samples in the leapfrog algorithm [7][10]. Neural networks rely oncomputing the gradient of a loss function during back-propagation to update theweights between each layer of the network [4].Oftentimes, the target function to diﬀerentiate is extremely complicated, andit is very tedious and error-prone for the programmer to manually deﬁne theanalytical formula for the gradient [9]. It is rather desirable to have a genericframework where the programmer only needs to specify the target function todiﬀerentiate it. Moreover, computing the derivative may be one of the moreexpensive parts of an algorithm if it requires numerous such evaluations, as it a r X i v : . [ c s . M S ] F e b J. Yang is the case for the three examples discussed above. Hence, it is imperative thatgradient computation is as eﬃcient as possible. These desires motivated thedevelopment of such a framework: automatic diﬀerentiation (AD).FastAD is a general-purpose AD library in C++ supporting both forwardand reverse modes of automatic diﬀerentiation. It is highly optimized to com-pute gradients, but also supports computing the full Jacobian matrix. Similar tothe Stan Math Library, FastAD is primarily intended for diﬀerentiating scalarfunctions, and it is well-known that reverse-mode is more eﬃcient than forward-mode for these cases [3]. For these reasons, this paper will focus only on reverse-mode and computing gradients rather than forward-mode or computing a fullJacobian matrix.

Section 3 will ﬁrst explain the reverse-mode automatic diﬀerentiation algorithmto give context for how FastAD is implemented. With this background, Sec-tion 4 will discuss some implementation detail and draw design comparisonswith other libraries to see how FastAD overcomes the common challenges seenin these libraries. In Section 5, we show a series of benchmarks with other li-braries including two real-world examples of a regression model and a stochasticvolatility model.

We brieﬂy discuss the reverse-mode automatic diﬀerentiation algorithm for con-text and to motivate the discussion for the next sections. For a more in-depthtreatment, we direct the readers to [3][9][5].As an example, consider the function f ( x , x , x ) = sin( x ) + cos( x ) · x − log( x ) (1)This function can be represented as an expression graph where each node of thegraph represents a sub-expression. Fig. 1 shows the corresponding expressiongraph drawn to evaluate in the same order as deﬁned by the operator precedencein the C++ standard.Note that, in general, a variable x i can be referenced by multiple nodes. Forexample, x is referenced by the * and log nodes. It is actually more helpful toconvert this expression graph into an expression tree by replacing all such nodeswith multiple parents as separate nodes that have a reference back to the actualvariable. Mathematically, f ( x , x , x ) = ˜ f ( g ( x , x , x )) (2)˜ f ( w , w , w , w ) = sin( w ) + cos( w ) · w − log( w ) g ( x , x , x ) = ( x , x , x , x ) astAD: C++ Template Library for AD 3 x x x sin cos log+ ∗ − Fig. 1: Expression graph for Eq. 1Fig. 2 shows the corresponding converted expression tree. This way all nodesexcept possibly the x i nodes have exactly one parent and have a unique pathback up to the root, if we view the tree as a directed graph. While this may seemmore complicated than the original expression graph, implementation becomesmuch cleaner with this new approach. With the converted expression tree, wewill see momentarily that we may actually start the algorithm from w i insteadof x i . In fact, rather than treating x i as nodes of the expression graph, it ismore helpful to instead treat them as containers that hold the initial values andtheir adjoints , ∂f∂x i . For this reason, we denote the path between x i and w i withdotted lines, and consider only up to nodes w i in the expression tree.The reverse-mode algorithm consists of two passes of the expression graph: forward -evaluation (not to be confused with forward-mode AD), and backward -evaluation. During the forward -evaluation, we compute the expression in theusual fashion, i.e. start at the root, recursively forward-evaluate from left toright all of its children, then ﬁnally take those results and evaluate the currentnode. Evaluation of each node will compute the operation that it represents. Forexample, after forward-evaluating the w node, which is a trivial operation ofretrieving the value from the container x , the sin node will return sin( w ) =sin( x ).The backward -evaluation for a given node starts by receiving its adjoint fromits parent. We will also refer to this value as its seed to further distinguish x i from the expression nodes. Since the seed is exactly the partial derivativeof f with respect to the node, the root of the expression tree will receive thevalue ∂f∂f = 1. Using the seed, the node computes the correct seed for all ofits children and backward-evaluates from right-to-left , the opposite direction offorward-evaluation.The correct seed for each child is computed by a simple chain-rule. Assumingthe current node is represented by w ∈ R and one of its children is v ∈ R , the J. Yang x x x w w w w sin cos log+ ∗ − Fig. 2: Converted expression tree for Eq. 2. Nodes x , x , x are now separatedfrom the rest of the graph by a layer of w variables. Note in particular that x is now replaced with w and w (red boundary).seed for v is the following: ∂f∂v = dfdw ∂w∂v (3)Each node w has node-speciﬁc information to compute ∂w∂v . For example, for the log node in Fig. 2, with w as log and v as w , ∂w∂v = d log( w ) dw = 1 w In general, if v ∈ R m × n and w ∈ R p × q , then ∂f∂v ij = p (cid:88) k =1 q (cid:88) l =1 ∂f∂w kl ∂w kl ∂v ij (4)In particular, the adjoint will always have the same shape and size as the value.For nodes that have references back to the containers x i , i.e. w , . . . , w , theymust increment the adjoints in the containers with their seeds. For example,nodes w and w will take their seeds and increment the adjoint for x . Thisis easily seen by chain-rule again: let w , . . . , w k denote all of the variables thatreferences x and for simplicity assume they are all scalar, although the resultcan be easily generalized for multiple dimensions. Then, ∂f∂x = k (cid:88) i =1 ∂f∂w i ∂w i ∂x = k (cid:88) i =1 ∂f∂w i astAD: C++ Template Library for AD 5 where ∂w i ∂x = 1 because w i is simply the identity function with respect to x . Thefully accumulated adjoints for x , x , x form the gradient of f .In general, an expression node can be quite general and is only required todeﬁne how to compute its value and the adjoints of its children using Eq. 4.In the general case where f : R n → R m , we can apply this algorithm for thescalar functions f j for j = 1 , . . . , m and save each gradient as the j th row of amatrix. The ﬁnal matrix is then the Jacobian of f . In this section, we cover a few key ideas of our implementation . In Section 4.1,we ﬁrst discuss the beneﬁts of vectorization and the diﬃculties of integrating itinto an AD system. We then explain how FastAD fully utilizes vectorization anddemonstrate that other libraries do not fully take advantage of it. In Section 4.2,we discuss some memory management and performance issues stemming fromthe use of the “tape” data structure. We then explain how FastAD overcomesthese challenges using expression templates and a lazy allocation strategy. Fi-nally, Section 4.3 covers other compile-time optimizations that can further max-imize performance. Vectorization refers to the parallelization of operations on multiple data at thehardware level. On a modern Intel 64-bit processor supporting AVX, four double-precision ﬂoating point numbers can be processed simultaneously, roughly im-proving performance by a factor of four. While the compiler optimization is ableto vectorize a user’s code sometimes, it is not guaranteed because vectorizationrequirements are quite stringent. For example, vectorization is not guaranteedif memory access is not done in a contiguous fashion and is impossible if thereis any dependency between loop iterations. This makes it quite challenging todesign an AD system that can always predict compiler optimization to createvectorized code. However, vectorization can make AD extremely fast, powerful,and practical even in complex problems. In practice, we come across many ex-amples where operations can be vectorized during gradient computation. Forexample, matrix multiplication, any reduction from a multi-dimensional vari-able to a scalar such as summation or product of all elements, and any unaryand binary function that is applied element-wise such as exponential, logarithm,power, sin, cos, tan, and the usual arithmetic operators.Since most of the opportunities for vectorization occur in matrix operations,the goal is to use a well-polished and eﬃcient matrix library such as

Eigen , oneof the most popular C++ matrix libraries. However, this is not an easy task.Adept2.0 notes in their documentation that integrating an expression-templatebased matrix library such as

Eigen into an AD system can be quite challenging in github page: https://github.com/JamesYang007/FastAD J. Yang Fig. 3: The left shows the memory layout for a matrix of univariate AD variables.This refers to the design pattern “a matrix of pair of pointers to double” sinceeach element of the matrix contains two pointers pointing to its value and adjoint.The right shows the memory layout for a single matrix-shaped AD variable,referring to the reverse pattern “a pair of pointers to matrix of doubles”. ThisAD variable has two pointers, each pointing to a matrix of doubles for the valueand adjoint.design. To circumvent these design issues, Adept2.0 integrates their own matrixlibrary designed speciﬁcally for their AD system. This, however, only introducesanother problem of writing an eﬃcient matrix library, another daunting task.In fact, the author of Adept notes that matrix multiplication, one of the mostwidely used matrix operations, is currently very slow [8]. Stan mentions thatintegrating matrix libraries can lead to unexpected problems that would requireextra memory allocations to resolve [3]. Other libraries such as ADOL-C, Cp-pAD, and Sacado do not integrate any matrix libraries directly, but they doallow the use of

Eigen matrix classes simply as containers.The one library among those mentioned previously that ultimately incor-porates

Eigen is Stan. Stan provides their own “plug-ins”, which extend

Eigen classes for their AD variables. For example, in Stan, one would use

Eigen::Matrix as a matrix of AD (univariate) variables,where the generic matrix object is extended to work diﬀerently for stan::math::var .However, this design still does not take full advantage of the vectorization that

Eigen provides. One reason is that

Eigen is only highly optimized for matri-ces of primitive types such as double-precision ﬂoating points and integers. Forany other class types,

Eigen defaults to a less optimized version with no guar-antees of vectorization. Therefore, unless one extends all

Eigen methods withoptimizations for their speciﬁc class type, they will not receive much beneﬁt ofvectorization. Another reason is that these matrices now have a heterogeneousstructure where each element is an AD variable which represents a pair of valueand adjoint. As a result, it is not possible to read only the values (and sim-ilarly, only the adjoints) of the matrix in a contiguous fashion. The compilerthen cannot guarantee any automatic vectorization. astAD: C++ Template Library for AD 7

FastAD fully utilizes the beneﬁts of

Eigen through one simple design dif-ference, which we refer to as shape traits . Stan as well as the other librariesmentioned previously except Adept2.0 follow the design pattern of “a matrix ofpair of pointers to double” when deﬁning a matrix of AD variables. Note thata univariate AD variable internally contains two pointers to doubles, one point-ing to the value and one to the adjoint. In contrast, FastAD follows the reversepattern: “a pair of pointers to matrix of doubles”. That is, rather than deﬁn-ing a univariate AD variable, which gets reused as an element of a matrix, wedeﬁne a separate matrix AD variable containing a pair of pointers, each point-ing to a matrix of double. Figure 3 shows the memory layout for each of thedesign patterns. This subtle diﬀerence provides a few important beneﬁts. Sincethe value and adjoint are now represented as matrices of primitive types, matrixoperations will be vectorized. The second beneﬁt is the signiﬁcant reduction ofmemory consumption. For other libraries, a matrix of AD variables contains twopointers for each element . However, with our design, a single matrix AD variablewould only store two pointers regardless of the size of the matrix. Hence, if amatrix is m × n , the traditional approach has a O ( mn ) memory consumptionfor the pointers, and FastAD approach has a O (1) consumption.Using templates in C++, it is easy to unify the API for the diﬀerent ADvariable shapes by providing an additional template parameter: ad :: Var < double , ad :: scl > x; // scalar shape ad :: Var < double , ad :: vec > v; // vector shape ad :: Var < double , ad :: mat > m; // matrix shape Shape traits also apply for any arbitrary node because we can deduce theoutput shape given the input shapes. The following is a declaration of a genericnode representing an element-wise unary operation, which demonstrates thisidea: template < class Unary , class ExprType >struct UnaryNode :ValueAdjView :: value_t ,typename util::shape traits::shape t>{ /* ... */ }; The portion highlighted in red related to shape t takes the input expression type

ExprType and deduces its shape type. Since an element-wise unary operationalways has the same shape as its input, the unary node takes on the same shape.To verify that FastAD vectorizes more than Stan, we performed reverse-ADfor a simple summation function f ( x ) = n (cid:80) i =1 x i and generated the disassemblyfor Stan and FastAD . We extracted the part that performs the summationand compared the instructions to see whether vectorization was taking place.The following is the disassembly for Stan: L3178 :movq (% rax ) , % rdx github page: https://github.com/JamesYang007/ADBenchmark J. Yang addq $ The instruction used to add is vaddsd , which is an AVX instruction to add scalar double-precision values. This is not a vectorized instruction, and hencethe addition is not done in parallel. This portion of the disassembly is related to aspecialization of an

Eigen class responsible for summation with default traversal ,which is no diﬀerent from a naive for-loop.Compare the above disassembly with the one generated for FastAD:

L3020 :addq $ $

64 , % raxcmpq % rdx , % rcxjg L3020

This portion of the assembly is indeed related to the linear vectorized traver-sal specialization of the same

Eigen class responsible for the summation. Theinstruction used to add is vaddpd , which is an AVX instruction to add packed double-precision values. This is a vectorized instruction and the operation istruly done in parallel.Sometimes, Stan is able to produce vectorized code such as in matrix multi-plication. This is consistent with our benchmark results since Stan came closestto FastAD for this operation (see Section 5.3). It is also consistent with how itis implemented, since they allocate extra memory for double values for each ma-trix and the multiplication is carried out with these matrices of primitive types.However, this vectorization does come at a cost of at least 4 times extra mem-ory allocation than what FastAD allocates. Moreover, the backward-evaluationrequires heap-allocating a matrix on-the-ﬂy every time. FastAD incurs no suchcost, only allocates what is needed, and never heap-allocates during AD evalu-ation.

Most AD systems manage a data structure in memory often referred to as the“tape” to store the sequence of operations via function pointers as well as thenode values and adjoints. This tape is modiﬁed dynamically and requires sophis-ticated memory management to eﬃciently reuse memory whenever possible. Staneven writes their own custom memory allocator to alleviate memory fragmen-tation, promote data locality, and amortize the cost of memory allocations [3].However, the memory is not fully contiguous and may still over-allocate. Forsome libraries, on top of memory management of these operations, a run-timecheck must be performed at every evaluation to determine the correct opera-tion [2]. Others like Stan rely on dynamic polymorphism to look up the vtableto call the correct operation [3]. astAD: C++ Template Library for AD 9

FastAD is unique in that it uses expression templates to represent the se-quence of operations as a single stack-allocated object. By overloading the commaoperator, we can chain expressions together into a single object. The followingis an example of chaining multiple expressions: auto expr = (x = y * z , // expr 1 w = x * x + 3 * ad :: sin (x + y) , // expr 2 w + z * x // expr 3 ); Each of the three sub-expressions separated by the commas returns an expres-sion object containing the necessary information to perform reverse-AD on theirrespective structure. Those expression objects are then chained by the commaoperators to build a ﬁnal expression object that contains the information toperform reverse-AD on all 3 sub-expressions in the order presented. This ﬁnalobject is saved into the variable expr .Expression template makes it possible to build these expression objects con-taining the reverse-AD logic. Expression template is a template metaprogram-ming technique that builds complex structures representing a computation atcompile-time. For a full treatment of expression templates, we direct the readersto [12]. As an example, in the following case,

Var < double , scl > x , y;auto expr = x + y;x+y returns a new object of type

BinaryNode, Var> , which represents the addition of two AD variables. In particular, this ob-ject has member functions that deﬁne the logic for the forward and backwardevaluations of the reverse-mode AD. This design brings many performance ben-eﬁts. Since the compiler now knows the entire sequence of operations for theexpression, it allows for the reverse-AD logic to be inlined with no virtual func-tion calls, and it removes the need to dynamically manage a separate vector offunction pointers. Additionally, the expression object is stack-allocated, whichis cheaper than being heap-allocated, and its byte size is proportional to thenumber of nodes, which is relatively small in practice.We can optimize the memory management even further with another obser-vation: an expression can determine the exact number of values and adjointsneeded to represent all of the nodes. If every node can determine this numberfor itself, then any expression tree built using these nodes can determine it aswell by induction. It is the case that all nodes deﬁned in FastAD can, in fact,determine this number. This leads to the idea of lazy allocation . Lazy allocationrefers to allocating memory only when the memory is required by the user. Inother words, an expression object does not necessarily need to allocate memoryfor the values and adjoints at construction time, since it is only needed later atdiﬀerentiation time. Once an expression is fully built and the user is ready todiﬀerentiate it, the user can lazily determine this number of values and adjointsby interfacing with the expression object, allocate that amount in a contiguousmanner, and “bind” the expression object to use that region of memory. This solves the issue with the traditional tape where the values and adjoints are notstored in a fully contiguous manner. Conveniently, the allocated values and ad-joints also do not necessarily need to be stored and managed by a global tape.Furthermore, the expression objects can be given additional logic to bind to thisregion systematically so that the forward and backward evaluations will readthis memory region almost linearly, which increases cache hits.While Stan notes that they are more memory-eﬃcient than other popularC++ libraries [3], we noticed a non-negligible diﬀerence in memory consump-tion between Stan and FastAD. We took the stochastic volatility example inSection 5.6, and compared the memory allocation in number of bytes. For Stan,we took the member variables from the tape and computed the number of usedbytes. We did not take into account other miscellaneous members for simplicity,and this estimate serves as a very rough lower bound on the total amount ofmemory allocated. For FastAD, we computed the memory allocated for the val-ues, adjoints, and the expression object. Our result shows that Stan uses 4718696bytes and FastAD uses 1836216 bytes. This rough estimate shows that Stan usesat least 2.5 times more memory than FastAD.With all these optimizations, FastAD removes the need for a global tape byusing expression templates to manage the sequence of operations in an expressionat compile-time, contiguously allocating the exact number of values and adjoints,and localizing the memory allocations for each expression.

Using C++17 features, we can make further compile-time optimizations thatcould potentially save tremendous amount of time during run-time.One example is choosing the correct specialization of an operation dependingon the shapes of the input. As seen in Section 4.1, all nodes are given a shapetrait. Depending on the input shapes, one may need to invoke diﬀerent routinesfor the same node. For example, the normal log-pdf node behaves quite diﬀerentlydepending on whether the variance parameter is a scalar σ or a (covariance)matrix Σ . Namely, if the variance has a matrix shape, we must perform a matrixinverse to compute the log-pdf, which requires a diﬀerent code from the scalarcase. Using a C++ design pattern called Substitution-Failure-Is-Not-An-Error(SFINAE), we can choose the correct routine at compile-time. The beneﬁt isthat there is no time spent during run-time in choosing the routine anymore,whereas in libraries like CppAD, they choose the routines at run-time for everyevaluation of the node [2].Another example is detecting constants in an expression. We can optimize anode by saving temporary results when certain inputs are constants, which wecan check at compile-time using the C++ language feature if constexpr . Theseresults can then be reused in subsequent AD evaluations, sometimes changingorders of magnitude of the performance. As an example, consider an expressioncontaining a normal log-pdf node for which we diﬀerentiate many times. Sup-pose also that the input variables to the node are x , an n -dimensional vector of astAD: C++ Template Library for AD 11 constants, and µ, σ , which are scalar AD variables. In general, for every evalua-tion of the node, the time to forward-evaluate the log-pdf is O ( n ), since we mustcompute log( p ( x | µ, σ )) = − n (cid:80) i =1 ( x i − µ ) σ − n log( σ ) + C However, the normal log-pdf node has an opportunity to make a powerful opti-mization in this particular case. Since the normal family deﬁnes a two-dimensionalexponential family, it admits a suﬃcient statistic T ( x ) = (cid:0) ¯ x, S n ( x ) (cid:1) where¯ x := 1 n n (cid:88) i =1 x i , S n ( x ) := 1 n n (cid:88) i =1 ( x i − ¯ x ) Since x is a constant, the suﬃcient statistic can then be computed once and savedfor future use. The normal log-pdf forward-evaluation now only takes O (1) timegiven the suﬃcient statistic, as seen in Eq. 5 below,log( p ( x | µ, σ )) = − σ n (cid:88) i =1 ( x i − µ ) − n log( σ ) + C = − n σ (cid:0) S n + (¯ x − µ ) (cid:1) − n log( σ ) + C (5) In this section, we compare performances of various libraries against FastAD fora range of examples . The following is an alphabetical list of the libraries usedfor benchmarking: – Adept 2.0.8 [8] – ADOL-C 2.7.2 [6] – CppAD 20200000 [2] – FastAD 3.1.0 – Sacado 13.0.0 [1] – Stan Math Library 3.3.0 [3]All the libraries are at their most recent release at the time of benchmarking.These libraries have also been used by others [3][9][8].All benchmarks were run on a Google Cloud Virtual Machine with the fol-lowing conﬁguration: – OS : Ubuntu 18.04.1 – Architecture : x86 64-bit – Processor : Intel Xeon Processor github page: https://github.com/JamesYang007/ADBenchmark2 J. Yang – Cores : 8 – Compiler : GCC10 – C++ Standard : 17 – Compiler Optimization Flags : -O3 -march=native All benchmarks benchmark the case where a user wishes to diﬀerentiate ascalar function f for diﬀerent values of x . This is a very common use-case.For example, in the Newton-Raphson method, we have to compute f (cid:48) ( x n ) atevery iteration with the updated x n value. In HMC and NUTS, the leapfrogalgorithm frequently updates a quantity called the “momentum vector” ρ with ∇ θ log( p ( θ, x )) ( x here is treated as a constant), where θ is a “position vector”that also gets frequently updated.Our benchmark drivers are very similar to the ones used by Stan [3]. Asnoted in [9], there is no standard set of benchmark examples for AD, but sinceStan is most similar to FastAD in both design and intended usage, we wantedto keep the benchmark as similar as possible.We measure the average time to diﬀerentiate a function with an initial inputand save the function evaluation result as well as the gradient. After timing eachlibrary, the gradients are compared with manually-written gradient computa-tion to check accuracy. All libraries had some numerical issues for some of theexamples, but the maximum proportion of error to the actual gradient valueswas on the order of 10 − , which is negligible. Hence, in terms of accuracy, alllibraries were equally acceptable. Finally, all benchmarks were written in themost optimal way for every library based on their documentation.Every benchmark also times the “baseline”, which is a manually-written for-ward evaluation (computing function value). This will serve as a way to measurethe extra overhead of computing the gradient relative to computing the functionvalue. This approach was also used in [3], however in our benchmark, the base-line is also optimized to be vectorized. Hence, our results for all libraries withrespect to the baseline may diﬀer from those in the reference.Sections 5.1-5.4 cover some micro-benchmarks where we benchmark com-monly used functions: summation and product of elements in a vector, log-sum-exponential, matrix multiplication, and normal log-pdf. Sections 5.5-5.6 coversome macro-benchmarks where we benchmark practical examples: a regressionmodel and a stochastic volatility model. The summation function is deﬁned as f s : R n → R where f s ( x ) = n (cid:80) i =1 x i , and theproduct function is deﬁned as f p : R n → R where f p ( x ) = n (cid:81) i =1 x i . We also testedthe case where we forced all libraries to use a manually-written for-loop-basedsummation and product. Fig. 4 shows the benchmark results.In all four cases, it is clear that FastAD outperforms all libraries for all valuesof N by a signiﬁcant factor. For sum , the next three fastest libraries, asymptot-ically, are Adept, Stan, and CppAD, respectively. The trend stabilizes starting astAD: C++ Template Library for AD 13(a) Sum (b) Sum (naive, for-loop-based)(c) Product (d) Product (naive, for-loop-based) Fig. 4: Sum and product benchmarks of other libraries against FastAD plot-ted relative to FastAD average time. Fig. 4a,4c use built-in functions wheneveravailable. Fig. 4b,4d use the naive iterative-based code for all libraries.from N = 2 = 64 where Adept is about 10 times slower than FastAD and Stanabout 15 times. The naive, for-loop-based benchmark shown in Fig. 4b exhibitsa similar pattern, although the performance diﬀerence with FastAD is less ex-treme: Adept is about 4 . prod and prod iter cases, Fig. 4c,4d again show a similar trend as sum and sum iter . Overall, the comparison with FastAD is less extreme. For prod , Adept is about 5 times slower than FastAD, and Stan about 7 times. For prod iter , Adept is about 3 times slower and Stan about 4 . The log-sum-exp function is deﬁned as f : R n → R where f ( x ) = log( n (cid:80) i =1 e x i ),Fig. 5 shows the benchmark results. Fig. 5: Log-sum-exp benchmark of other libraries against FastAD plotted relativeto FastAD average time.FastAD outperforms all libraries for all values of N . The next three fastestlibraries are Adept, Stan, and CppAD, respectively. The trend stabilizes startingfrom N = 2 = 64 where Adept is about 2 . . N , despite calls to expensive functionslike log and exp . In fact, at the largest value of N = 2 , FastAD is 1 . exp expression is reusedduring the backward-evaluation. For this benchmark, all matrices are square matrices of the same size. In orderto have a scalar target function, we add another step of adding all of the entriesof the matrix multiplication, i.e. f ( A, B ) = K (cid:88) i =1 K (cid:88) j =1 ( A · B ) ij Fig. 6 shows the benchmark results.FastAD is still the fastest library for all values of N , but Stan performs muchcloser to FastAD than in the previous examples. All other libraries consistently astAD: C++ Template Library for AD 15 Fig. 6: Matrix multiplication benchmark of other libraries against FastAD plot-ted relative to FastAD average time.take longer than both FastAD and Stan as N increases. Towards the end ataround N = 2 , Stan is about 2 times slower. For moderate sized N ∈ [2 , ],Stan is about 3 times slower.This example really highlights the beneﬁts of vectorization. As noted in Sec-tion 4.1, this was the one benchmark example where Stan was able to producevectorized code, which is consistent with Figure 6 that Stan is the only librarythat has the same order of magnitude as FastAD. Other libraries did not producevectorized code.The comparison with the baseline shows that FastAD takes 3 .

27 times longer.Note that forward-evaluation requires one matrix multiplication between two K × K matrices, and backward-evaluation additionally requires two matrix mul-tiplications of the same order, one for each adjoint: ∂f∂A = ∂f∂ ( A · B ) · B T , ∂f∂B = A T · ∂f∂ ( A · B )Hence, in total, one AD evaluation requires three matrix multiplications betweentwo K × K matrices. If we approximate a manually-written gradient computationto take three times as long as the baseline (one multiplication), FastAD timerelative to this approximated time is . = 1 .

09. This shows then that FastADonly has about 9% overhead from a manually-written code, which is extremelyoptimal.

Fig. 7: Normal log-pdf benchmark of other libraries against FastAD plotted rel-ative to FastAD average time.

The normal log-pdf function is deﬁned up to a constant as: f ( x ) = − σ N (cid:88) i =1 ( x i − µ ) − N log( σ )For this benchmark, µ = − . , σ = 1 .

37 and are kept as constants. Fig. 7 showsthe benchmark results.FastAD is the fastest library for all values of N . The trend stabilizes ataround N = 2 = 128. Towards the end, Adept is about 9 times slower, andStan about 19 times slower. Comparing all libraries, the overall diﬀerence we seein this example is the largest we have seen so far, and this is partly due to how wecompute log( σ ). Section 4.3 showed that we can check at compile-time whethera certain variable is a constant, in which case, we can perform a more optimizedroutine. In this case, since σ is a constant, it computes the normalizing constantlog( σ ) once and gets reused over multiple AD evaluations with no additionalcost during runtime, which saves a lot of time since logarithm is a relativelyexpensive function. astAD: C++ Template Library for AD 17 Fig. 8: Bayesian linear regression benchmark of Stan against FastAD plottedrelative to FastAD average time.

This section marks the ﬁrst macro-benchmark example. We consider the follow-ing Bayesian linear regression model: y ∼ N (cid:0) X · w + b, σ (cid:1) w ∼ N (0 , b ∼ N (0 , σ ∼ U nif (0 . , . )The target function is the log of the joint probability density function (up to aconstant). Fig. 8 shows the benchmark results.FastAD outperforms Stan by 8 . N and Adept by 10times. The trend stabilizes starting from around N = 70. It is interesting to seethat around N = 2 , FastAD is only 2 . X , since constants implement a no-op forbackward-evaluation.If we assume that the most expensive operation is the matrix multiplication,AD evaluation approximately takes two matrix multiplications between a matrixand a vector. We can then approximate a lower bound for the manually-writtengradient computation time to be two times that of the baseline. The relativetime of FastAD to this approximated time is 1 .

1, implying about 10% overheadfrom a manually-written code.

Fig. 9: Stochastic volatility benchmark of Stan against FastAD plotted relativeto FastAD average time.

This section marks the second and last macro-benchmark example. We considerthe following stochastic volatility model taken from the Stan user guide [11]: y ∼ N (0 , e h ) h std ∼ N (0 , σ ∼ Cauchy (0 , µ ∼ Cauchy (0 , φ ∼ U nif ( − , h = h std · σh [0] = h [0] (cid:112) − φ h = h + µh [ i ] = φ · ( h [ i − − µ ) , i > h std , h, φ, σ, µ . Fig. 9shows the benchmark results.FastAD outperforms Adept by 2 times and Stan by 2 . N . The trend seems to stabilize starting from N = 2 . It is interesting to seethat FastAD actually performs better than the baseline for moderate to large astAD: C++ Template Library for AD 19 N values. For a complicated model as such, there are many opportunities forFastAD to cache certain evaluations for constants as mentioned in Section 4.3and 5.4. In particular, the exponential function e h reuses its forward-evaluatedresult, and many log-pdfs cache the log of its parameters such as log( σ ) inthe Normal log-pdfs and log( γ ) in the Cauchy log-pdfs ( σ, γ are the secondparameters of their respective distributions, which are constant in this model).Note that this caching is automatically done in FastAD, which would be tediousto manually code for the baseline. Hence, this shows that due to automaticcaching, FastAD forward and backward-evaluation combined can be faster thana manually-written forward evaluation only, which puts FastAD at an optimalperformance. In this paper, we ﬁrst introduced the reverse-mode automatic diﬀerentiation al-gorithm to give context and background on how FastAD is implemented. Wethen discussed how FastAD uses vectorization to boost the overall performance,expression template and lazy allocation strategies to simplify memory manage-ment, and compile-time checks to further reduce run-time. To see how FastADperforms in practice, we rigorously benchmarked a set of micro and macro-benchmarks with other popular C++ AD libraries and showed that FastADconsistently achieved 2 to 19 times faster speed than the next two fastest li-braries across a wide range of examples.

FastAD is still quite new and does not have full support for all commonly-usedfunctions, probability density functions, and matrix decompositions. While Fas-tAD is currently optimized for CPU performance, its design can also be extendedto support GPU. Having support for both processors will allow FastAD to bewell-suited for extremely large-scale problems as well. Although computing hes-sian would be easy to implement in FastAD, generalizing to higher-order deriva-tives seems to pose a great challenge, and this feature could be useful in areassuch as physics and optimization problems. Lastly, one application that couldpotentially beneﬁt greatly from FastAD is probabilistic programming languagesuch as Stan, which heavily uses automatic diﬀerentiation for diﬀerentiatingscalar functions.

We would like to give special thanks to the members of the Laplace Lab atColumbia University, especially Bob Carpenter, Ben Bales, Charles Margossian,and Steve Bronder, as well as Andrea Walther, Eric Phipps, and Brad Bell,the maintainers of ADOL-C, Sacado, and CppAD, respectively, for their helpful comments on the ﬁrst draft and for taking the time to optimize the benchmarkcode for their respective libraries. We also thank Art Owen and our colleaguesKevin Guo, Dean Deng, and John Cherian for useful feedback and correctionson the ﬁrst draft.

References (2), 131–167 (Jun 1996). https://doi.org/10.1145/229473.229474, https://doi.org/10.1145/229473.2294747. Hoﬀman, M.D., Gelman, A.: The no-u-turn sampler: Adaptively setting pathlengths in hamiltonian monte carlo http://arxiv.org/abs/1111.4246v18. Hogan, R.J.: Fast reverse-mode automatic diﬀerentiation using expression tem-plates in c++40