[PDF] Notes on Computational Graph and Jacobian Accumulation

Abstract

The optimal calculation order of a computational graph can be represented by a set of algebraic expressions. Computational graph and algebraic expression both have close relations and significant differences, this paper looks into these relations and differences, making plain their interconvertibility. By revealing different types of multiplication relations in algebraic expressions and their elimination dependencies in line-graph, we establish a theoretical limit on the efficiency of face elimination.

Full PDF

aa r X i v : . [ c s . S C ] D ec Notes on Computational Graph andJacobian Accumulation

Yichong [email protected] 24, 2020

Abstract

The optimal calculation order of a computational graph can be represented by a setof algebraic expressions. Computational graph and algebraic expression both have closerelations and signiﬁcant diﬀerences, this paper looks into these relations and diﬀerences,making plain their interconvertibility. By revealing diﬀerent types of multiplicationrelations in algebraic expressions and their elimination dependencies in line-graph, weestablish a theoretical limit on the eﬃciency of face elimination.

1. Face Elimination and D* Algorithm

After a sparse computational graph or subgraph is linearized, i.e., all the derivatives denotedby edges and vertices in the computational graph are evaluated at a point, there are manyelimination techniques to accumulate derivative values, including vertex elimination, edgeelimination and face elimination.Naumann states in his paper that all the elimination techniques mentioned above “arebased on the elimination of transitive dependences between variables in F . A variable v j depends transitively on v i via v k if i ≺ k ≺ j . In general, there is no structural representationfor eliminating such dependences in G ; that is, it cannot be expressed by modifying either V or E . A richer data structure is required, namely, a directed variant of the line graph of G .”“All intermediate vertices belong to exactly two directed complete bipartite subgraphs of ˜ G .They are minimal in one and maximal in the other. Intermediate vertices in G are mappedonto complete bipartite subgraphs (or bicliques) K ν,µ of ˜ G ”, He continues to explain, “Theelimination of transitive dependences can be interpreted as the elimination of edges in ˜ G .The modiﬁcation of the dual c-graph is referred to as face elimination in order to distinguishbetween edge elimination in ˜ G and G .” ([7] 2003, p5-8)Griewank and Walther state in their book that in edge elimination “we are forced to simul-taneously merge them with all their predecessor or successor edges, respectively. Sometimeswe may wish to do that only with one of them. However, that desire cannot be realized bya simple modiﬁcation of the linearized computational graph. Instead we have to considerthe so-called line-graph of the computational graph with a slight extension.” They continue:“Geometrically, we may interpret the edges of the line-graph as faces of the extended com-putational graph. Therefore, we will refer to them as faces.” ([1] 2008, p204)1 v v v v v e e e e e (a) e e e e e (b) Fig. 1.1:

Line-Graph and Biclique

The transitive dependences Naumann talks about are actually the adjacent edge paths incomputational graph, i.e., one in-going edge and one out-going edge of a vertex form a in-outvertex path. e.g., ( v , v , v ) in Fig. 1.1 (a) is an adjacent edge path. As multiple paths couldoverlap on the same edge and vertex, elimination of one edge could result in eliminationof multiple paths, e.g., adjacent edge path ( v , v , v ) and ( v , v , v ) overlap on edge e ,elimination of e could result in elimination of ( v , v , v ) and ( v , v , v ). It seems impossibleto eliminate transitive dependences or adjacent edge paths one by one in computational graph.However, adjacent edges connected by one vertex in G form a complete bipartite subgraph(or biclique) in ˜ G . Since every adjacent edge path in G has an edge in ˜ G representing it,e.g., Fig. 1.1 (b) is the line-graph of computational graph Fig. 1.1 (a), edge ( e , e ) in Fig. 1.1(b) represents the adjacent edge path ( v , v , v ) in Fig. 1.1 (a), it can be eliminated inthe corresponding line-graph, i.e., face elimination in ˜ G is equivalent to adjacent edge pathelimination in G .In an article uploaded in 2017, Naumann and Utke make a further explanation: “ADexploits the associativity of the chain rule to compute derivatives on the basis of the de-pendency information structurally represented by paths in the computational graph. Thederivative computation can be represented by elimination steps in the computational graph.”“Face elimination reduces the granularity of an elimination step to a single multiplication op-eration.” “We deﬁne the optimal Jacobian accumulation (OJA) under chain rule arithmetic asa computation of Bauer’s formula with minimal multiplication count.” “Until now there hasbeen no proof that one can attain the minimum with face elimination either. Face eliminationdoes not cover all possibilities of computing Bauer’s formula, particularly the commutativityand distributivity of the multiplications. We outline a proof showing that there is no exploitof chain rule arithmetic manipulations in Bauer’s formula that can undercut an optimal faceelimination, thereby establishing that face elimination can indeed solve OJA.” ([8] p1-2)As for the rule of face elimination, Naumann described as follows:For all intermediate faces ( i, j ) ∈ ˜ E Z ,“ (1). If there exists a vertex k ∈ ˜ V such that P k = P i and S k = S j , then set c k = c k + c j c i (absorption); else ˜ V = ˜ V ∪ { k ′ } with a new vertex k ′ such that P k ′ = P i and S k ′ = S j (ﬁllin)and labeled with c k ′ = c j c i .(2). Remove edge ( i, j ) from ˜ E .(3). Remove i ∈ ˜ V if it is isolated. Otherwise, if there exists a vertex i ′ ∈ ˜ V such that P i ′ = P i and S i ′ = S i , then set c i = c i + c i ′ (merge); remove i ′ .(4). Repeat Step (3) for j ∈ ˜ V . ”([7] 2003, p8)In step (1) of Naumann’s rule, the absorption c k = c k + c j c i means that we can performthis operation by modifying and reusing k instead of creating a new vertex. On the principle2f reusing vertices, when | S i | = 1, i.e. S i = { j } , we can perform ﬁllin operation by c i = c j c i and S i = S j ; when | P j | = 1, i.e. P j = { i } , we can perform ﬁllin operation by c j = c j c i and P j = P i ;Griewank and Walther also presented their version of modiﬁcation rules, they explainedthat these modiﬁcations will not alter the accumulated values a ij , “but hopefully simplifyingtheir computation”. In addition to the merge and remove operations similar to those inNaumann’s description they added the “interior vertex split” operation: “make a copy of anyinterior vertex by assigning the same value and the same predecessor or successor set whilesplitting the other set, i.e., reattaching some of the incoming or some of the outgoing edges tothe new copy.” They continued to explain: “we can split and merge until every interior vertex e is simply connected in that it has exactly one incoming and one outgoing edge connectingit to the vertices o j and d i for which then c e = a ij .” “Applying the modiﬁcations describedabove, one may generate new directed graphs that are no longer line-graphs of an underlyingcomputational graph.” ([1] p205)According to the splitting Griewank and Walther mentioned, in addition to those presentedby Naumann, there are more operations that can be performed in line-graph: If P k = P i and S k ⊂ S j , we can perform absorption by c k = c k + c j c i , S j = S j − S k without removing ( i, j ).If P k ⊂ P i and S k = S j , we can perform absorption by c k = c k + c j c i , P i = P i − P k withoutremoving ( i, j ). If P k = P i and S k ⊃ S j , we can perform ﬁllin operation by ˜ V = ˜ V ∪ { k ′ } , c k ′ = c j c i + c k , S k = S k − S j if | S i | > ∩ | P j | > c i = c j c i + c k , S k = S k − S j , S i = S j if | S i | = 1 ∩ | P j | ≥ c j = c j c i + c k , S k = S k − S j , P j = P i if | S i | ≥ ∩ | P j | = 1. If P k ⊃ P i and S k = S j , we can perform ﬁllin operation by ˜ V = ˜ V ∪ { k ′ } , c k ′ = c j c i + c k , P k = P k − P i if | S i | > ∩ | P j | > c i = c j c i + c k , P k = P k − P i , S i = S j if | S i | = 1 ∩ | P j | ≥ c j = c j c i + c k , P k = P k − P i , P j = P i if | S i | ≥ ∩ | P j | = 1. If there exists a vertex i ′ ∈ ˜ V such that P i ′ ⊇ P i and S i ′ = S i , then we can perform merge operation by c i = c i + c i ′ , P i ′ = P i ′ − P i ; or P i ′ = P i and S i ′ ⊇ S i , then we can perform merge operation by c i = c i + c i ′ , S i ′ = S i ′ − S i .In an article published in 2002, Griewank and Naumann explain the disadvantages of faceelimination: “it is clear that face elimination oﬀers an even larger number of choices than edgeelimination” “a poor choice of edge or face elimination sequences might lead to an exponentialgrowth of the intermediate graph structure and thus the overall cost. In other words the extrafreedom may lead one astray, unless suitable safeguards can be found.” “If implementeddynamically, it also requires examining and possibly reconnecting all predecessors of the facesorigin and all successors of its destination. Concerning the ratio between the number ofﬂoating point operations and the number of memory accesses, dynamic face elimination isthus a true nightmare.” “we may view edge and vertex eliminations as chunkings of faceeliminations to achieve a better ratio between computation and communication operations.”“in any case we must therefore strive for a static implementation, where all decisions regardingthe elimination sequence and the necessary memory allocations as well as some of the addressand pointer calculation are performed at compile time. They may then be hardwired into acode for evaluating the Jacobian F ′ ( x ) eﬃciently at many diﬀerent arguments x ”. ([3] 2002,p8-9)Brian Guenter states in his paper that “D* computes eﬃcient symbolic derivatives bysymbolically executing the expression graph at compile time to eliminate common subex-pressions and by exploiting the special nature of the graph that represents the derivative ofa function.” “factorable subgraphs are deﬁned by a dominator or postdominator node at abranch in the graph. If a dominator node b has more than one child, or if a post-dominatornode b has more than one parent, then b is a factor node. If c is dominated by a factor3ode b and has more than one parent, or c is postdominated by b and has more than onechild, then c is a factor base of b . A factor subgraph, [ b, c ] consists of a factor node b , afactor base c of b , and those nodes on any path from b to c .” He then describes a factoringalgorithm to eliminate factor subgraph in the derivative graph and convert it into subgraphedge, adding that “factorization does not change the value of the sum of products expressionwhich represents the derivative so factor subgraphs can be factored in any order”. He says“The solution to the problem of common factor subgraphs is to count the number of timeseach factor subgraph [ i, j ] appears in the nm derivative graphs. The factor subgraph whichappears most often is factored ﬁrst. If factor subgraph [ k, l ] disappears in some derivativegraphs as a result of factorization then the count of [ k, l ] is decremented. To determineif factorization has eliminated [ k, l ] from some derivative graph f ij it is only necessary tocount the children of a dominator node or the parents of a postdominator node. If eitheris one the factor subgraph no longer exists. The counts of the [ k, l ] are eﬃciently updatedduring factorization by observing if either node of a deleted edge is either a factor or factorbase node. Ranking of the [ k, l ] can be done eﬃciently with a priority queue.” ([5] 2007, p1-7)

2. Diﬀerentiation Graph

We realize that, in d y d x , the relation between d y and d x is a partial order d y ≥ d x . e.g., y = f ( g ( x )), if we write d f d g and d g d x as d f → d g and d g → d x respectively, the transitiverelation of partial order between d f and d x will be denoted by chain rule.d f d x = d f d g d g d x = ⇒ d f → d x = d f → d g → d x d f → d x means d f depends on d x or there is a path of dependence from d f to d x .The relation between adjacent arrows in the path is multiplication. To get the value ofd f → d x , we multiply all the derivatives of adjacent pairs along the path from d f to d x , i.e.,d f → d g → d g → · · · → d g n → d x = ⇒ d f d g d g d g · · · d g n d x As for the multivariate chain rule, e.g., y = f ( g ( x ) , g ( x ) , g ( x )). The derivative d f d x wouldbe like this: d f d x = d f d g d g d x + d f d g d g d x + d f d g d g d x = ⇒ d f d x = d f  → d g → d x → d g → d x → d g → d x Notice that d f is divided into d g , d g and d g three diﬀerent paths and we sum up all theproducts of diﬀerent paths in the end. The relation between diﬀerent paths is addition, i.e.,d f  → d g → d x → d g → d x ... → d g n → d x = ⇒ d f d g d g d x + d f d g d g d x + · · · + d f d g n d g n d x g , g , g ] and h between y and x , e.g., y = f ( g ( h ( x )) , g ( h ( x )) , g ( h ( x ))), then d f w(cid:127) d f d x = d f  → d g → d h → d x → d g → d h → d x → d g → d h → d x w(cid:127) d f d x = d f d g d g d h d h d x + d f d g d g d h d h d x + d f d g d g d h d h d x d g d g d g d h d h d h d x d x d x Fig. 2.1:

The transformation of diﬀerentiation expression

We notice that the above structure forms a rooted tree hierarchy.In practice, instead of seeing h in diﬀerent branches as diﬀerent subexpressions with thesame structure, computing d h d x and multiplying them by the rest derivatives in each branchrespectively, we would like to factor out the common factor, compute and multiply by it oncefor all: d f d g d g d h d h d x + d f d g d g d h d h d x + d f d g d g d h d h d x = ⇒ (cid:18) d f d g d g d h + d f d g d g d h + d f d g d g d h (cid:19) d h d x Similar to the above factorization, when we factor out the common factor d h d x , we mergethe d h in the graph. d f d g d g d g d h d h d h d x d x d x d h d x === ⇒ d f d g d g d g d h d x Fig. 2.2 h nodes, but also all the d x nodes and the edgesbetween d h and d x .The same rule also applies to all the leaf nodes of same variable because d x d x = 1. d f d g d g d x + d f d g d g d x + d f d g d g d x = ⇒ (cid:16) d f d g d g d x + d f d g d g d x + d f d g d g d x (cid:17) d x d x = ⇒ (cid:16) d f d g d g d x + d f d g d g d x + d f d g d g d x (cid:17) d f d g d g d g d x d x d x ( ) === ⇒ d x d x d f d g d g d g d x ( ) === ⇒ d f d g d g d g d x w(cid:127) w(cid:127) w(cid:127) Fig. 2.3

Notice that we removed the d x d x self loop in the end. The parentheses ‘()’ operator, it seesthe inside diﬀerentiation expression as a whole and evaluate it before any adjacent outsideoperation. Because we see x as the same variable during the process of factoring out thecommon factor, naturally it merges all the d x nodes to evaluate.We notice that there is close correspondence between diﬀerentiation graph and diﬀer-entiation expression which can be converted to each other. Although diﬀerentiation graphand diﬀerentiation expression basically are the same thing in diﬀerent form, there are somediﬀerences between them. The most obvious one is the terminal variable. In algebraic dif-ferentiation expression, we use the same symbol in diﬀerent position to denote they are thesame, whereas in diﬀerentiation graph, the terminal variable can be in diﬀerent positions,e.g., the trees in Fig. 2.2 and Fig. 2.3, they can also be in the same position, e.g., the DAG’sin Fig. 2.2 and Fig. 2.3.Note how we arrived at the rooted DAG in Fig. 2.2, we see the rooted tree as an in-termediate graph, just like how we manipulate the function expression: ﬁrst construct thetree hierarchy, then factor out the common factor by merging vertices. The rooted tree ismore similar to the algebraic expression in terms of using same variable in diﬀerent position.e.g., the trees in Fig. 2.2 and Fig. 2.3 have the exactly same structures as the correspondingalgebraic expression.So what does d y → d x mean? In fact, we can further simplify the notation by removing‘d’: y → x . The arrow → denotes that variable y depends on variable x , and we take thederivative of y with respect to x .Therefore, d f d g d g d g d h d x (a) == ⇒ fg g g hx ∂f∂g ∂f∂g ∂f∂g d g d h d g d h d g d h d h d x (b) == ⇒ fg g g hx (c) Fig. 2.4:

Remove ‘d’

6e ﬁnd that the trace of multi-path chain rule is the trace of dependence in the functionexpression. Therefore a diﬀerentiation graph is also a dependence graph of function expres-sions. For diﬀerentiation graph, if we deﬁne the direction of edges in paths as vertical, thedirection of aligning sibling paths is horizontal, according to multi-path chain rule, betweenedges in the vertical direction there are multiplication operators, between sibling paths alignedin the horizontal direction there are addition operators. However, we cannot compute theroot value or intermediate value of function expression using its dependence graph, as thereis no operator in dependence graph.Note that if we reverse the direction of all the edges in Fig. 2.4 (c), it becomes a com-putational graph. Computational graph is more like a generalized term, which means it isnot limited to the computation of derivatives. Diﬀerentiation graph is computational graph,just like we say diﬀerentiation expression is algebraic expression. As Parr and Howard havementioned in their paper([9],2018), the opposite arrow direction denotes the direction of dataﬂow. In the diagrams of following sections, we will still use the direction of dependence asthe default direction.

3. Notations and Terminology

Let G = ( V, E ) be a diﬀerentiation graph, v i ∈ V ∋ v j ∩ v i ≺ v j . We use P v j → v i to denotethe set of all the paths from vertex v j to v i and p v j → v i ( ∈ P v j → v i ) to denote a path in P v j → v i . | P v j → v i | is the number of paths in P v j → v i , | p v j → v i | is the length of path p v j → v i , i.e., the numberof edges in the path; Then the formula of multi-path chain rule or Bauer’s formula can berewritten as: ∂v y ∂v x = X p j ∈ P vy → vx Y e i ∈ p j e i We use e i to denote the i th edge in E , and e − i and e + i to denote the source node anddestination node of directed edge e i respectively. We also use h j, i i or h v j , v i i to denote thedirected edge from vertex v j to v i .We use P e j → e i to denote the set of all the paths from edge e j to e i . Apparently, P e j → e i ≡ P e + j → e − i . We use v − i and v + i to denote the incoming edges and outgoing edges of intermediatevertex v i respectively. Therefore the set of all adjacent edge paths between the incoming edgesand outgoing edges of v i is P v i ≡ P v − i → v + i . ∀ p k ∈ P v i , | p k | = 2. We use v −− i or v = i to denoteall the predecessors of v i , and v ++ i or v i to denote all the successors of v i . Let V ′ ⊆ V , then V ′ = is the set of all the predecessors of vertices in V ′ , V ′ is the set of all the successors ofvertices in V ′ .Let Y , Z , X be the roots, intermediates and terminals respectively, Y ∪ Z ∪ X = V . Then P Y → X is the set of all the paths from roots Y to terminals X . We use Y + to denote allthe outgoing edges from Y , Y to denote all the successors of roots, X − to denote all theincoming edges to X , X = to denote all the predecessors of terminals. Y − = ∅ , Y = = ∅ , X + = ∅ , X = ∅ . ∀ v i ∈ V : if v − i = ∅ ∪ v = i = ∅ , then v i ∈ Y ; if v + i = ∅ ∪ v i = ∅ , then v i ∈ X ;We use D − ( v i ) to denote the in-degree of v i , i.e., the number of incoming edges of v i , D + ( v i )to denote the out-degree of v i , i.e., the number of outgoing edges of v i . ∀ x i ∈ X, D + ( x i ) = 0, ∀ y j ∈ Y, D − ( y j ) = 0. ∀ v i ∈ V, D − ( v i ) ≥ , D + ( v i ) ≥

0. We use D − r ( v i ) to denote the r-degreeof v i , i.e., the number of root vertices that have at least one path to v i , and D + t ( v i ) to denote7he t-degree of v i , i.e., the number of terminal vertices to which v i has at least one path. Weuse X i to denote the set of terminals to which v i has at least one path, Y i to denote the set ofroots that have at least one path to v i . Note that X i and Y i are the concepts of index domainand index range mentioned in [1] (P147). D − r ( v i ) = | Y i | , D + t ( v i ) = | X i | .Notice that there could be more than one path going through one edge. We use O P ( e k )to denote the overlapping degree of an edge e k ∈ E in a set of paths P , i.e., the number ofpaths in P going through edge e k . Therefore, ∀ v i ∈ Z ,O P vi ( e k ) = ( D + ( v i ) ( e k ∈ v − i )D − ( v i ) ( e k ∈ v + i ) ∀ e k ∈ E, O P ( e k ) ≥ Deﬁnition 3.1.

In a diﬀerentiation graph G = ( V, E ), if there are at least two separatepaths between vertex v i and v j that do not share any intermediate vertex at the same graphlocation, and all the intermediate vertices in P v i → v j have no outgoing edge to or incomingedge from vertices that are not in P v i → v j , then all the vertices(including v i and v j ) and edgesin all those paths from v i to v j form an block b h v i , v j i . v i is the source of b h v i , v j i , v j is thesink of b h v i , v j i .If there is a block between v i and v j , it is a diﬀerentiation graph of d v i d v j . Deﬁnition 3.2.

Assume that there are two diﬀerent vertices v i and v j in diﬀerentiationgraph G = ( V, E ). A path p v i → v j is a direct chain or direct simple chain e ′ h v i , v j i if it hasat least 1 intermediate vertex and the in-degree and out-degree of every intermediate vertexin p v i → v j are both 1. If after every block whose intermediate vertices do not include v i or v j is replaced by a substitute edge in G , path p v i → v j is a direct simple chain that contains atleast 1 substitute edge, then all the non-substitute edges and all the vertices in p v i → v j and allthe vertices and edges in the corresponding blocks of those substitute edges in p v i → v j forman indirect chain c h v i , v j i . Both direct chain and indirect chain are called chain. If all thecorresponding blocks of those substitute edges in p v i → v j are simple blocks, then c h v i , v j i is anindirect simple chain e ′ h v i , v j i . Both direct simple chain and indirect simple chain are calledsimple chain. If a chain c h v i , v j i is not an simple chain, then it is a complex chain. v i is thesource of c h v i , v j i and e ′ h v i , v j i , v j is the sink of c h v i , v j i and e ′ h v i , v j i .Note that an edge could be a path of length 1, but here in our deﬁnition it is not a chain,and an indirect chain has more than one path. Deﬁnition 3.3.

In a diﬀerentiation graph G = ( V, E ), if every outgoing edge of vertex v i belongs to a simple chain to v j or connects to v j , and there are at least 2 outgoing edge of v i are not in the same simple chain, then all the vertices(including v i and v j ) and edges inall those paths from v i to v j form an simple block e h v i , v j i . v i is the source of e h v i , v j i , v j isthe sink of e h v i , v j i . If all the simple chains in e h v i , v j i are direct simple chains, then e h v i , v j i is a direct simple block, otherwise it is an indirect simple block. If a block b h v i , v j i is not ansimple block, then it is a complex block. 8 eﬁnition 3.4. In a diﬀerentiation graph G = ( V, E ), all the root vertices are in the sameﬁrst level of depth, and all the terminal vertices are in the same last level of depth. Thedepth level of an intermediate vertex is the length of the longest path of all the paths fromroot vertices to that vertex. We use L ( v i ) to denote the depth level of vertex v i . If vertex v i has an directed edge e connected to vertex v j and L ( v j ) − L ( v i ) >

1, then e is a cross-leveledge. We use L ( Y ) to denote the ﬁrst level of depth, L ( X ) to denote the last level of depth,and L ( G ) to denote the depth of G . L ( Y ) = 0, L ( G ) = L ( X ). v v v v e e e e e (a) v v ′ v v v e e e e e (b) v v v v v e e e e e (c) Fig. 3.1:

Cross-level edge segmentation

We can segment a cross-level edge by ﬁlling the positions of depth levels crossed by itwith vertices derived from its head or tail vertex, turning it into a direct simple chain. e.g.,Fig. 3.1 (b) or (c) is the segmentation result of cross-level edge h v , v i in Fig. 3.1 (a). Ina diﬀerentiation graph in which all the cross-level edges are segmented, every two adjacentdepth levels form a bipartite subgraph or local Jacobian. Note that the calculation of 1 · e in( e e + 1 · e + e e ) is unnecessary, just like the calculation of zero entries in local Jacobianmultiplication.We denote the set of all vertices in depth level l by V l , e.g., Y = V , X = V h , where h is the depth level of terminals. Let V a , V b and V c ( a < b < c ) be the sets of vertices indiﬀerent depth levels, then for v i ∈ V b , the set of vertices in V c to which v i has at least onepath is V ci or X ci , the set of vertices in V a that have at least one path to v i is V ai or Y ai , where X i = V hi = X hi , Y i = V i = Y i .We use J ik to denote the Jacobian between depth level i and k , e.g., the global Jacobian is J L ( x ) . However, local Jacobians are not limited to two diﬀerent depth levels. We use J h S | S i to denote the Jacobian between two sets of vertices, e.g., J h Y | X i is the global Jacobian. Weuse J h v a , v b | v c , v d i to denote the local Jacobian between v a , v b and v c , v d , or J h v i | v i i to denotethe local Jacobian between v i and v i . Deﬁnition 3.5.

Let G ′ = ( U, V, E ′ ) be a bipartite graph or subgraph with directed edges from U to V , after we remove all the directions of directed edges in E ′ , it becomes an undirectedgraph G = ( V, E ). If ∃ v i ∈ G , ∀ v j ∈ G ∩ v j = v i , P v i → v j = ∅ , then G ′ is an undivided bipartitegraph or subgraph. 9 . Factorization and Conversion v v , v , v , v , v , v e e e e e e e e (a) v v , v , v , v , v , v , v , v e e e e e e e e e e e e (b) v v v v v v v v v v v v v v (c) Fig. 4.1:

Simple Block and Complex Block

The conversion between diﬀerentiation graph and algebraic expression is obvious when thegraph is simple. e.g., in Fig. 4.1 (a) there are two simple blocks e h v , v i and e h v , v i chainedtogether, we can immediately write down its algebraic expression:d v d v = ( e e + e e )( e e + e e ) (4.1)where the edge label e denotes the derivative of its source node with respect to its destinationnode, e.g., e = ∂v ∂v . We notice that a simple block in diﬀerentiation graph can be convertedto a subexpression block with parenthesis in algebraic expression.However, it is diﬃcult to get an algebraic expression in Fig. 4.1 (b) at one glance, becauseit is a complex block. Diﬀerentiation graph is more ﬂexible when it comes to the same interme-diate variables(vertices) or derivatives(edges), it can always merge them, whereas in a singlealgebraic expression, due to its limitation of expressivity, there is no direct representation orcorresponding counterpart of complex block.One solution for converting a complex diﬀerentiation graph to an algebraic expression isto transform it into a simple diﬀerentiation graph with simple chains and simple blocks, whichcan be converted to algebraic expression directly. e.g., Fig. 4.1 (c) is a direct simple blockthat lists all paths from root to terminal in Fig. 4.1 (a). Its algebraic expression would belike this:d v d v = ( e e e e + e e e e + e e e e + e e e e ) (4.2)But there is a diﬀerence between the result algebraic expressions. Note that it takes 12 fma’sto compute Eq. (4.2) while only 5 fma’s for Eq. (4.1).There are two ways to transform Fig. 4.1 (b). One way is that ﬁrst transform it to a directsimple block that lists all the paths from root to terminal, then merge these vertices and edgesagain, making sure only simple blocks or simple chains are generated. The other way is tobreak down the overly merged graph into a more convertible one that has no complex blocks.e.g., Fig. 4.2 demonstrates how we factorize Fig. 4.1 (b).10 v , v , v , v , v , v , v , v , v , v e e e e e e e e e e e e e e (a) v v v v v v v v v v v v v v e e e e e e e e e e e e e e e e e e (b) v v , v , v , v , v , v , v , v , v , v e e e e e e e e e e e e e e (c) v v v v v v v v v v v v v v e e e e e e e e e e e e e e e e e e (d) Fig. 4.2:

Complex Block Factorization

There are two methods to factorize a complex block: one is forward factorization, theother is backward factorization.To apply backward factorization, we ﬁrst compute the in-degree of every intermediatevertex in the block, treating every simple block whose intermediate vertices do not includethe vertex being computed as an edge. e.g., in Fig. 4.1 (a), the in-degree of vertex v is 1.Then, from bottom to top, we search for vertices whose in-degrees are greater than 1, splitfrom behind the outgoing edges, reducing the overlapping degrees of outgoing edges to 1.e.g., in Fig. 4.1 (b), v and v are the predecessors of sink v , their in-degrees are both 2, i.e., O P v ( e ) = 2 and O P v ( e ) = 2, which means there are 2 paths overlapped on e and e respectively, we split from behind e and e , by creating another duplicated v and v , e and e in locations that are diﬀerent from the original one, connecting e and e to the new v and new v respectively, as shown in Fig. 4.2 (a). After that, we adjust the in-degrees ofvertices and continue to search for next target vertex from among the predecessor vertices ofcurrent position, e.g., the in-degree of v in Fig. 4.2 (a) is 2. Since e h v , v i is a simple block,we regard it as one edge, split it just like how we split e and e in the previous step, asshown in Fig. 4.2 (b).Forward factorization is basically the same as backward factorization except in the op-posite direction. Instead of tracking in-degrees of intermediate vertices, searching for targetvertices from bottom up and splitting from behind an edge, it tracks out-degrees, searchesfor target vertices from top down and splits from the head of an edge. Fig. 4.2 (c) and (d)illustrate the process of forward factorization.We can quickly convert a simple block to an algebraic expression. The algebraic expressionof Fig. 4.2 (b) is:d v d v = e ( e e e + e ( e e + e e )) + e ( e e e + e ( e e + e e )) (4.3)11he algebraic expression of Fig. 4.2 (d) is:d v d v = ( e e e + ( e e + e e ) e ) e + ( e e e + ( e e + e e ) e ) e (4.4)Note that in Eq. (4.3) and Eq. (4.4), the same intermediate variable or subexpression ( e e + e e ) and ( e e + e e ) cannot be factored out despite they can be merged in the graph. Inpractice, we would like to use some intermediate variables to oﬀset the limitation of a singlealgebraic expression. e.g., let s = e e + e e , Eq. (4.3) becomes:d v d v = e ( e e e + e s ) + e ( e e e + e s ) (4.5)If we put ( e e + e e ) under the Eq. (4.5), and draw two directed edges from the bottomof s ’s in diﬀerent locations to the top of it, denoting ‘=’ or reference, it becomes a DAG. Wenotice that multiple algebraic expressions connected by intermediate variables is eﬀectivelya DAG of references. During the process of complex block factorization, in order to takeadvantage of the reference structure of multiple algebraic expressions, for every simple blockthat splits like an edge, we can create a new reference edge representing it, and replace itwith that edge before splitting. e.g., in Fig. 4.3, a new edge s is created to replace e h v , v i . v v , v , v , v , v , v , v , v , v , v e e e e e e e e e e e e e e (a) v v , v , v , v , v , v , v , v e e e e e e e s e e e (b) v v v v v v v v v v e e e e e e e s e e s e (c) Fig. 4.3:

Factorization with Reference

Diﬀerentiation graph is path-oriented, edges and vertices are the carrier of paths, multiplepaths could overlap on the same edges and vertices. In a direct simple chain, all the edgesand vertices form only one path, therefore elimination of path is equivalent to eliminationof corresponding edges and vertices. We notice that simple chains and simple blocks couldbe nested structures like algebraic expressions. In an optimized arithmetic expression, theoptimal calculation order is from inside out, so are the simple blocks and simple chains. Whenthe most inner direct simple chains and direct simple blocks are eliminated and replacedby shortcut edges, the second most inner indirect simple chains and indirect simple blocksbecome direct simple chains and direct simple blocks. We can repeat this all the way tothe outermost simple chains and simple blocks. Thus, for the simple blocks and simplechains, elimination of paths is equivalent to elimination of corresponding edges and verticesas long as it follows the optimal calculation order. Notice that in Fig. 4.3 (a), e h v , v i and( e e + e e ) are equivalent, ( e e + e e ) is the algebraic form, e h v , v i is the graph form.12e can easily convert a simple block or a simple chain to its corresponding algebraic form, andthen deduce it back again from the algebraic expression we have, they are interconvertible.It is worth noting that the commutativity of multiplication relation in algebraic expression,e.g., e i e j and e j e i , makes sense while in diﬀerentiation graph makes no sense because e i ≻ e j or e − i ≻ e + i = e − j ≻ e + j . Therefore, the precondition of their interconvertibility is to imposenoncommutativity restriction of multiplication relation on the algebraic expressions convertedfrom diﬀerentiation graph.[ e , e ] | {z } A × (cid:20) e e e e (cid:21)| {z } B ×  e e e e | {z } C × (cid:20) e e (cid:21)| {z } D (4.6)Notice that we can convert Fig. 4.1 (b) to the form of local Jacobians, as shown in Eq. (4.6).If we use ( AB ) to denote the matrix product of A and B , then ( AB ) CD is the Fig. 4.2 (c),(( AB ) C ) D is the Fig. 4.2 (d), AB ( CD ) is the Fig. 4.2 (a), A ( B ( CD )) is the Fig. 4.2 (b).The accumulation process of ABCD is the factorization process of diﬀerentiation graph inFig. 4.1 (b). If we accumulate it from left to right, i.e. ((( AB ) C ) D ), the result is Eq. (4.4);from right to left, i.e. ( A ( B ( CD ))), the result is Eq. (4.3). Similar to the reference edge inFig. 4.3, we can also use reference variables to replace the entries in ( AB ). It is worth notingthat there are more than just two accumulation orders we discussed above, we can accumulatelocal Jacobians from both sides or from the middle, e.g., (( AB )( CD )) or (( A ( BC )) D ), whichmeans we can also factorize complex blocks from both sides or from the middle. Diﬀerentaccumulation orders result in diﬀerent algebraic expressions with diﬀerent structures, our goalis to ﬁnd an algebraic expression that requires less fma’s to compute. Note that local Jacobianaccumulation merges or eliminates depth levels through matrix multiplications. Despite it canautomatically factorize complex blocks by treating every entry of the result local Jacobianas an independent simple block or simple chain and packing their accumulations into oneatomic operation, it ignores the optimal calculation order in the simple chains and simpleblocks, which means it is up to us to ﬁnd a better accumulation order. e.g., in Fig. 4.1 (a),we prefer ( J J )( J J ) to J ( J J ) J , because the latter splits v into 4 vertices, resultingin Fig. 4.1 (c), which requires more fma’s to compute. v v v v v v v v v v v v v v v v e e e e e e e e e e e e e e e e e e (a) v v v v v v v v v v v v s e e e e e e s e e e e (b) Fig. 4.4:

Conﬂicting depth accumulation order

When accumulating local Jacobians of diﬀerent depth levels, sometime diﬀerent subgraphsmay have conﬂicting optimal accumulation orders, e.g., in Fig. 4.4 (a), the optimal ac-cumulation order for the left subgraph is ( J J )( J J ) J , while for the right subgraph is13 ( J J )( J J ). Although depth local Jacobian multiplication can be performed separatelyin unrelated sub-graphs that span the same depth levels, local Jacobians are not limited totwo diﬀerent depth levels, we can also apply non-depth local Jacobian accumulation, e.g., theresult of J h v , v | v , v , v , v i × J h v , v , v , v | v , v i = J h v , v | v , v i is Fig. 4.4 (b),where s = e e + e e and s = e e + e e . In fact, we can ﬁrst eliminate all the simplechains and simple blocks in a diﬀerentiation graph using reference edge, and eliminate depthlevels by performing depth local Jacobian multiplication when accumulating layered densegraph or subgraph. y e e e e e e e e e e e e x (a) y e e e e e e e e e e e e e e x (b) y e e e e e e s e e e e x (c) y e e e e e e e e e e e e x (d) y e e e e e e e e e e e e e x (e) y e e e e e e e e e e s x (f) Fig. 4.5:

Chain-Block Elimination in the Line-Graph

Note that we can perform face elimination in the line-graph in the same way how we fac-torize the diﬀerentiation graph, splitting vertices explicitly, then eliminating edges, as shownin Fig. 4.5 (a) (b) (c). We can also perform face elimination using the rules presented byNaumann, splitting vertices implicitly, as shown in Fig. 4.5 (d) (e) (f). The ﬁrst way is moresimilar to the corresponding algebraic expression, while the second way costs less when im-plementing the process in a program, because it creates less new vertices.

5. Multi-Root or Multi-Terminal

So far we have only considered the situation of ( | Y | = 1) ∩ ( | X | = 1). When | Y | > | X | > j, i ) but across all nonzero partials a ij .” ([1] 2008, p196)We need to identify the common sub-chains or sub-blocks nested in root-terminal chains orblocks, and replace them with reference edges.The purpose of multiple blocks or chains being intertwined together is to merge the samecommon sub-chains or sub-blocks. If the outmost blocks or chains have no nested commonsub-chains or sub-blocks, there is no need to put them together to compute, they can beeasily separated from each other. Just like local Jacobian accumulation sees every entry ofthe result local Jacobian independently, another solution is to simply list or completely splitall the sub-graphs of root-terminal pairs from the diﬀerentiation graph as the entries of root-terminal Jacobian, since they do not share any common sub-chain or sub-block that needsbe computed ﬁrst. e.g., we can list all the root-terminal pairs in Fig. 1.1 (a). In other words,if we know the optimal common sub-chain or sub-block combination set, we can computethem separately. The upper bound is that we ignore the common sub-chains or sub-blocks,separating root-terminal pairs directly and optimizing them individually.We notice that diﬀerent combination set of root-terminal chains or blocks may havediﬀerent set of common sub-chains or sub-blocks. Let A = { ( v i , v j ) | v i ∈ Y, v j ∈ X } , B = { S | S ⊆ A } , C = { S | S ⊆ B } , then | B | = 2 | A | = 2 | X |·| Y | , | C | = 2 | B | = 2 | X |·| Y | . When | X | and | Y | increase, | B | increases exponentially and | C | increases doubly exponentially, itbecomes extremely diﬃcult to conduct a brute-force search for a optimal combination thatrequires minimum number of fma’s to compute the root-terminal Jacobian. However, thecombinations in B are not equally important, we need to narrow down the searching space,e.g., ﬁnd the largest common sub-chains or sub-blocks overlapped by most root-terminal pairs,where ‘largest’ means largest number of fma’s needed to compute it. Another problem is thatwhen a solution is found, how do we prove that it is optimal? What if there is another solutionthat requires less fma’s to calculate? It is hard to verify if the solution is optimal, especiallywhen | X | · | Y | is large. What we can do is to ﬁnd a solution that requires as less fma’s aspossible.Note that the optimal dynamic value accumulation in linearized diﬀerentiation graph isnot the optimal calculation order. If the cost of searching for an optimal calculation orderexceeds the calculation cost saved by the optimal calculation order, we would rather not do it.However, for a static symbolic diﬀerentiation graph that calculates a lot of diﬀerent inputs,the optimal calculation order matters. v v − v − v − v , v , v , v , v , v , v , v , v , v v v v e e e e e e e e e e e e e e e e e e e e (a) y y y y e e e e e e e e e e e e e e e e e e e e x x x x (b) Fig. 5.1:

Multi-root and Multi-terminal Diﬀerentiation Graph and Line Graph P v − → v , i.e., all the vertices and edgesthat are irrelevant to the calculation of ∂v − ∂v , then all the relevant vertices and edges in P v − → v form an indirect simple chain e ′ h v − , v i . We can quickly write down its algebraic form: e ( e e + e e ) e . However after ﬁnishing the calculation or conversion, we ﬁnd ourselves inan awkward situation: it cannot be eliminated directly, because ∀ e k ∈ P v − → v , O P Y → X ( e k ) >

1. Similarly, in its corresponding line-graph Fig. 5.1 (b), it cannot be eliminated directlyeither, because ∀ e k ∈ P y → x , O P ˜ Y → ˜ X ( e k ) > v v − v − v − v − v , v , v , v , v , v , v , v , v , v , v v v v e e e e e e e e e e e e e e e e e e e e e e e (a) v v − v − v − v − v , v , v , v , v , v , v , v v v v e e e e e s e e s e e s e e e e e e e e (b) v v − v − v − v , v , v , v , v , v , v , v v v v e e e e e e e s e e s e e e e e e e e (c) v v − v − v , v , v , v , v , v , v , v v v v e e e e e s e e s e e e e e e e e (d) v − v − v , v , v , v , v , v , v v v v s e e s e e e e (e) v v − v − v , v , v , v , v v v v e e s s e e (f) v − v − v , v , v , v , v , v v s e e s e e (g) v − v − v , v , v , v , v , v , v v s e e s e e e e (h) v − v − v , v v (i) Fig. 5.2:

Multi-Roots or Multi-Terminals Factorization

Our strategy of ﬁnding common sub-chains and sub-blocks:1. If there are simple chains or simple blocks, replace them with reference edges. e.g., after16eplacing all the simple chains in Fig. 5.2 (a), it becomes Fig. 5.2 (b).2. Find a vertex v i that has the largest r-degree, ∀ v k ∈ V L ( v i ) ∩ v k = v i , D − r ( v k ) < D − r ( v i ),in the depth level closest to the ﬁrst level; Find a vertex v j in P v i → X that has the largestt-degree, ∀ v k ∈ V L ( v j ) ∩ v k = v j , D + t ( v k ) < D + t ( v j ), in the depth level closest to the last level.e.g., in Fig. 5.2 (b), we ﬁnd that v i = v j = v .3. If Y i = Y and X j = X , go to step 4. If Y i = Y or X j = X , then split all the edges andintermediate vertices in P Y i → X j , i.e. P Y + i → X − j , into a new page from the current page, go tostep 6. e.g., we split Fig. 5.2 (b) into two pages, one of which is Fig. 5.2 (c), and the edgebetween v − and v is on the other page. Note that we do not split roots and terminalswhen splitting pages. All the roots are still in the same line of the ﬁrst depth level, and allthe terminals are still in the same line of the last depth level. Diﬀerent pages share the samelines of roots and terminals.4. If L ( v i ) = L ( v j ) or ( | P v i → v j | = 1) ∩ ( | p v i → v j | = 1), i.e., || P v i → v j || = 1, then separate root-terminal pairs as follows:If there are cross-level outgoing edges from root vertices or cross-level incoming edges to ter-minal vertices, then, let E ′ be the set of all outgoing edges that cross most levels from roots,separate P E ′ → X − from the current page, or, let E ′ be the set of all incoming edges that crossmost levels to terminals, separate P Y + → E ′ from the current page. e.g., we separate Fig. 5.1(c) into Fig. 5.2 (d) and Fig. 5.2 (e), separate Fig. 5.2 (e) into Fig. 5.2 (g) and Fig. 5.2 (h).Otherwise, if L ( v i ) − L ( Y ) ≥ L ( X ) − L ( v j ), select a set of roots Y ′ , separate P Y ′ + → X − ; if L ( v i ) − L ( Y ) ≤ L ( X ) − L ( v j ), select a set of terminals X ′ , separate P Y + → X ′− ; Go to step 6.5. If L ( v i ) < L ( v j ) and || P v i → v j || > P ( V L ( vi ) j ) + → ( V L ( vj ) i ) − into a new page and go to step 6 if corresponding v i and v j can befound in it, otherwise factorize the vertices and edges in P V L ( vi ) j → V L ( vj ) i locally and replacesimple chains or simple blocks with reference edges until L ( v i ) = L ( v j ) or || P v i → v j || = 1. e.g.,we factorize the vertices and edges between V and V in Fig. 5.2 (d) and replace simplechains and simple blocks with reference edges, the result page is Fig. 5.2 (f).6. Apply the above process to the separated pages until there is only one root-terminal pair orno such v i and v j can be found in each page. Merge the same root-terminal pairs in diﬀerentpages. y y e e e e e e e e e e e e e e e e x x x x (a) Fig. 5.3 = e e (5.1) s = e e (5.2) s = ( e e + e e ) (5.3) s = e ( s e + e s ) = e ( e e e + e ( e e + e e )) (5.4) s = e ( s e + e s ) = e ( e e e + e ( e e + e e )) (5.5) ∂v − ∂v = e ( s + e e ) e = e ( e e + e e ) e (5.6) ∂v − ∂v = e e e e (5.7) ∂v − ∂v = s e = e ( e e e + e ( e e + e e )) e (5.8) ∂v − ∂v = s e = e ( e e e + e ( e e + e e )) e (5.9) ∂v − ∂v = e ( e e + s ) e = e ( e e + e e ) e (5.10) ∂v − ∂v = e e e e (5.11) ∂v − ∂v = s e = e ( e e e + e ( e e + e e )) e (5.12) ∂v − ∂v = s e = e ( e e e + e ( e e + e e )) e (5.13)We notice that every edge between two intermediate vertices in line-graph is a multiplica-tion relation, however in algebraic expressions, there are two types of multiplication relation:direct and indirect. e.g., in e ( e e + e e ) e , there is a direct multiplication relation be-tween e and e , whereas in e ( e e + e e ), the multiplication relation between e and e isindirect, there is a parenthesis between e and e .Assume we have found the optimal calculation order of a diﬀerentiation graph G , whichis represented by a set of algebraic expressions connected by reference variables. In its cor-responding line-graph ˜ G , there may be diﬀerent types of multiplication relations in diﬀerentroot-terminal pairs for the same edge, elimination of which may break parentheses in theoptimal algebraic expressions of some root-terminal pairs, i.e. break the optimal calculationorder. e.g., Fig. 5.3 is the line-graph of Fig. 5.2 (e), we assume Eq. (5.1) - Eq. (5.13) is anoptimal calculation order. If we eliminate edge h e , e i directly from Fig. 5.3 for Eq. (5.7),it will break the calculation order of Eq. (5.6), Eq. (5.8) and Eq. (5.9).We can ﬁx it by explicitly splitting edges with conﬂicting multiplication relations, sepa-rating the direct multiplication relation from the indirect, and only eliminating those edgesthat represent direct multiplication relations. However, this method not only copies a lot ofvertices and edges, but also requires more decision-making about how to split and which onesto eliminate. Furthermore, the splitting of an edge sometimes depends on the splitting ofanother adjacent edge, e.g., in Fig. 5.3, the splitting of h e , e i depends on the splitting of h e , e i .Another solution is to search for a face elimination sequence without explicit splitting.Assume we have already known a set of optimal algebraic expressions, we can eliminate edges18n line-graph in the same order as the calculation order of the corresponding set of optimalalgebraic expressions, avoiding breaking parentheses. Unlike parentheses elimination in analgebraic expression, the face elimination of a sub-expression block in line-graph may notonly depend on the sub-expression blocks nested in it but also the sub-expression blocksoutside it, because one edge that denotes direct multiplication relation in a sub-expressionblock of an expression may denote indirect multiplication relation in a sub-expression block ofanother expression. e.g., the face elimination of ( e e + e e ) in Eq. (5.6) depends on the faceelimination of e e in Eq. (5.4). Therefore when eliminating multiplication relations insidesub-expression blocks, we need to eliminate those direct multiplication relations that haveno indirect multiplication relations in other sub-expression blocks ﬁrst. e.g., elimination of h e , e i and h e , e i in Fig. 5.3 will not break the optimal calculation order in the algebraicexpression set Eq. (5.1) - Eq. (5.13). Lemma 1.

Assume that G ( V, E ) is a diﬀerentiation graph, A is one of its correspondingoptimal algebraic expression sets, s , s are edge or reference edges in G and s ≻ s . If s and s have both direct and indirect multiplication relations in A , then there must be an edgeor reference edge s in the indirect multiplication relation of s and s , such that s ( s s + ∆ s )or (∆ s + s s ) s , where ∆ s is the rest sub-expressions in the block and ∆ s = 0 ∩ s / ∈ ∆ s . Proof.

There are 4 possible forms of indirect multiplication relation for s and s : s ( s +∆ s ), s ( s s + ∆ s ), (∆ s + s ) s and (∆ s + s s ) s . Assume that a ∈ A ∋ a , and ∃ s ( s + ∆ s ) ∈ a ∩ s i ( s s +∆ s a ) s j ∈ a , s i is the predecessor of ( s s +∆ s a ), s j is the successor of ( s s +∆ s a ),∆ s a is the sibling of s s , s i = 1 if it has no predecessor, s j = 1 if it has no successor, ∆ s a = 0 ifit has no sibling. From the sub-expressions s ( s +∆ s ) we know that s and ∆ s have the samesource vertex and destination vertex, i.e., s − = ∆ s − ∩ s +2 = ∆ s + , therefore there must be asibling sub-expression of s i ( s s + ∆ s a ) s j , such that s i ( s ∆ s + ∆ s b ) s j + s i ( s s + ∆ s a ) s j ∈ a ,where ∆ s b = 0 if s ∆ s has no sibling. Obviously, s i ( s ∆ s + ∆ s b ) s j + s i ( s s + ∆ s a ) s j = s i ( s (∆ s + s ) + ∆ s b ) s j , the number of calculation can be reduced, which means a and A are not optimal. Therefore, s ( s + ∆ s ) is impossible. Similarly, we can prove (∆ s + s ) s isalso impossible.Notice that we can omit ∆ s in the indirect multiplication relations above, simply use s | s s and s s | s to denote s ( s s + ∆ s ) and (∆ s + s s ) s respectively. Lemma 2.

Assume that G ( V, E ) is a diﬀerentiation graph, A is one of its correspondingoptimal algebraic expression sets, s , s are edge or reference edges in G and s ≻ s . If s and s have both direct and indirect multiplication relations in A , then in order to avoidbreaking parentheses between s and s in A , the face elimination of h s , s i in line-graphdepends on the face eliminations of all h s , s i i that have s | s s i in A . Proof.

For all the edges h s , s i i in line-graph that have s | s s i in A , the elimination of h s , s i i will implicitly split s into another vertex and create a new edge connecting s to it.Notice that It is possible for direct and indirect multiplication relations to form circularstructure in an algebraic expression set, because there are two directions in s | s s and s s | s .e.g., s | s s , s | s s , s s | s , s s | s , s s | s , s s | s , s | s s , s | s s . It seems impossible forus to eliminate an edge in the circular dependencies without breaking parentheses.19 roposition 1. Given a diﬀerentiation graph, face elimination can achieve optimal Jaco-bian accumulation that requires minimum number of calculations if there exists at least onecorresponding optimal algebraic expression set that doesn’t contain circular elimination de-pendencies.

Proof.

Assume that G ( V, E ) is a diﬀerentiation graph, A is one of its corresponding optimalalgebraic expression sets that doesn’t contain circular elimination dependencies, s , s are edgeor reference edges in G and s ≻ s . If s and s have both direct and indirect multiplicationrelations in A , then for all the edges h s , s i i in line-graph that have s | s s i in A , if h s , s i i has no indirect multiplication relation in A , we can eliminate it ﬁrst, if all h s , s i i that have s | s s i in A are eliminated, we can safely eliminate the edge h s , s i in line-graph withoutworrying about accidentally breaking parentheses in any s | s s i . If h s , s i i also has indirectmultiplication relations s | s i s j in A , we can apply this process to s s i and s | s i s j recursively.We don’t yet know if it is impossible to form circular dependencies in the optimal alge-braic expression sets of a diﬀerentiation graph. If all the optimal algebraic expression sets ofa diﬀerentiation graph contain circular dependencies, we may not be able to achieve optimalby performing face elimination.

6. Conclusion

There is close correspondence between diﬀerentiation graph and algebraic expression, simplediﬀerentiation graph and algebraic expression are interconvertible provided that the mul-tiplication relations in the algebraic expressions converted from diﬀerentiation graphs arenoncommutative. Diﬀerentiation graph is more ﬂexible, it can merge the same intermediatevariable or subexpression that cannot be factored out in algebraic expression. There is nodirect representation or corresponding counterpart of complex block in algebraic expression.We can factorize overly merged complex blocks or chains, and transform them into simplediﬀerentiation graphs only with simple chains and simple blocks. For the simple blocks andsimple chains, elimination of paths is equivalent to elimination of corresponding edges andvertices as long as it follows the optimal calculation order. Local Jacobian multiplication canautomatically factorize complex blocks but it is up to us to ﬁnd a better accumulation order.The purpose of multiple blocks or chains being intertwined together is to merge the samecommon sub-chains or sub-blocks, if we know the optimal common sub-chain or sub-blockcombination set, we can compute them separately. In line-graph there may be diﬀerent typesof multiplication relations for the same edge, elimination of which may break the optimalcalculation order. Unlike parentheses elimination in an algebraic expression, the face elimina-tion of a sub-expression block in line-graph may not only depend on the sub-expression blocksnested in it but also the sub-expression blocks outside it. It is possible for direct and indirectmultiplication relations to form circular dependencies in an algebraic expression set. If all theoptimal algebraic expression sets of a diﬀerentiation graph contain circular dependencies, wemay not be able to achieve optimal by performing face elimination.20 eferences [1] Andrea Walther Andreas Griewank. Evaluating derivatives : principles and techniques ofalgorithmic diﬀerentiation, 2nd edition, 2008.[2] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jef-frey Mark Siskind. Automatic diﬀerentiation in machine learning: a survey, 2015. https://arxiv.org/abs/1502.05767 .[3] A Griewank and U Naumann. Accumulating jacobians by vertex, edge, or face elimination.

Proceedings of the 6th African Conference on Research in Computer Science, INRIA,France , pages 375–383, 2002.[4] Andreas Griewank and Uwe Naumann. Accumulating jacobians as chained sparse matrixproducts.

Math. Prog , 3:2003, 2002.[5] Brian Guenter. The D* symbolic diﬀerentiation algorithm, 2007.[6] Paul Wilhelm Matthias Kuhn. Edge and vertex elimination, 2009. https://doi.org/10.1007/s10107-003-0456-9 .[7] U Naumann. Optimal accumulation of jacobian matrices by elimination methods on thedual computational graph, 2003. https://doi.org/10.1007/s10107-003-0456-9 .[8] Uwe Naumann and Jean Utke. On face elimination in computational graphs, 2017. http://cerfacs.fr/wp-content/uploads/2017/06/7_Naumann_utke.pdf .[9] Terence Parr and Jeremy Howard. The matrix calculus you need for deep learning, 2018. https://arxiv.org/abs/1802.01528https://arxiv.org/abs/1802.01528