[PDF] Index handling and assign optimization for Algorithmic Differentiation reuse index managers

Abstract

For operator overloading Algorithmic Differentiation tools, the identification of primal variables and adjoint variables is usually done via indices. Two common schemes exist for their management and distribution. The linear approach is easy to implement and supports memory optimization with respect to copy statements. On the other hand, the reuse approach requires more implementation effort but results in much smaller adjoint vectors, which are more suitable for the vector mode of Algorithmic Differentiation. In this paper, we present both approaches, how to implement them, and discuss their advantages, disadvantages and properties of the resulting Algorithmic Differentiation type. In addition, a new management scheme is presented which supports copy optimizations and the reuse of indices, thus combining the advantages of the other two. The implementations of all three schemes are compared on a simple synthetic example and on a real world example using the computational fluid dynamics solver in SU2.

Full PDF

aa r X i v : . [ c s . M S ] J un Index handling and assign optimization forAlgorithmic Diﬀerentiation reuse index managers

Max Sagebaum, Johannes Blühdorn, Nicolas R. GaugerChair for Scientiﬁc ComputingTechnische Universität Kaiserslautern

Abstract

For operator overloading Algorithmic Diﬀerentiation tools, the iden-tiﬁcation of primal variables and adjoint variables is usually done viaindices. Two common schemes exist for their management and distribu-tion. The linear approach is easy to implement and supports memoryoptimization with respect to copy statements. On the other hand, thereuse approach requires more implementation eﬀort but results in muchsmaller adjoint vectors, which are more suitable for the vector mode ofAlgorithmic Diﬀerentiation. In this paper, we present both approaches,how to implement them, and discuss their advantages, disadvantages andproperties of the resulting Algorithmic Diﬀerentiation type. In addition,a new management scheme is presented which supports copy optimiza-tions and the reuse of indices, thus combining the advantages of the othertwo. The implementations of all three schemes are compared on a simplesynthetic example and on a real world example using the computationalﬂuid dynamics solver in SU2.

Algorithmic Diﬀerentiation (AD) refers to techniques for the machine accuratediﬀerentiation of computer codes. To that end, we interpret a code with inputs x ∈ R k and outputs y ∈ R m as a mathematical function F : R k → R m . Usually,the code does not only contain ﬂoating point computations but also controlﬂow statements. For each speciﬁc input choice x , however, the control ﬂow isﬁxed and we may consider the computations as a composition of elementaryoperations like additions or multiplications together with, e. g., standard mathlibrary functions like sin or exp. The derivatives of these elementary operationsare known and can be implemented explicitly. Then, AD evaluates the derivativeof the full code at the speciﬁc input x by applying the chain rule to the sequenceof elementary operations. In the following, we summarize the principles of ADwith a focus on the reverse mode and operator overloading implementations.Comprehensive introductions to AD are given in [GW08, Nau12].The reverse mode of AD is a speciﬁc strategy for the automatic evaluationof the chain rule. There, each variable v of the computer program — whichmight be an input, output or intermediate variable — is associated with acorresponding adjoint variable ¯ v . ¯ v stands for the derivative of the outputs y as1 function of v in the direction of ¯ y , that is, ¯ v = dydv ( v ) T ¯ y . The adjoint outputs ¯ y — that is, the derivative direction — must be speciﬁed by the user. Particularly,the adjoint inputs satisfy the relation ¯ x = dFdx ( x ) T ¯ y (1)which is the standard way to describe the reverse mode of AD. By suitablechoices of ¯ y , various derivatives of the computer program can be computed. Forexample, if m = 1 and ¯ y = 1 . , ¯ x is the gradient of F evaluated at x .During a single evaluation of (1), the reverse mode of AD avoids the com-putational cost of setting up the full Jacobian dFdx . Instead, an evaluation onthe statement level is performed that is equivalent to the computation of thematrix vector product. Let w = φ ( v ) be a statement in the computer codewith φ : R n → R and n ∈ N . φ might be a single elementary operation or acomposition of multiple elementary operations. For each statement, reverse ADcomputes the adjoint update as ¯ v j + = ∂φ∂v j ( v ) ¯ w ; ¯ w = 0 ∀ j = 1 . . . n . (2)The order of statement evaluation is reversed, that is, if we have statements φ i with i = 1 . . . l for the primal evaluation, we need to evaluate Equation (2)for i running from l to . In more general terms, the sensitivity information ispropagated from the outputs to the inputs, which corresponds to the evaluationof the chain rule in reverse order. Since the reverse run requires information thatis only available after the primal evaluation is ﬁnished, certain information needsto be stored by AD such that the reversal is possible. The process of recordingthis information alongside the primal computation is known as taping.In this paper, we focus on AD implementations that use operator overloadingfor the augmentation of the computer program. There, a new computation type(e.g. adouble ) replaces the original computation type (e.g. double ). The newtype and corresponding operator overloads store information for each statementsuch that the adjoint updates according to Equation (2) can be performed afterthe primal computation is ﬁnished. First, we must store information for eachstatement that allows for the computation of the derivative expression in (2).Jacobian taping approaches compute ∂φ∂v ( v ) alongside the primal evaluation andstore it for the reversal, primal value taping approaches store the primal values v and some handles to evaluate ∂φ∂v with these values in the reverse run. De-tailed discussions on how these two approaches can be implemented are given in[SAG19] and [SAG18] for Jacobian tapes and primal value tapes, respectively.Second, we must be able to identify the variables that were involved in a state-ment, and associate them with their adjoint variables. For operator overloadingAD tools, an adjoint variable is usually not stored together with the associatedprimal variable in the new computation type because the lifetime of the pri-mal variable is not the same as the lifetime of the adjoint variable. Instead,an identiﬁer is stored in the new computation type which associates the primalvariable with an entry in a global adjoint vector. During the primal evaluation,identiﬁers are assigned to all variables, and for each statement, the identiﬁersof the involved variables are stored for the reverse run.Currently, there are two established identiﬁer management schemes. In theﬁrst, identiﬁers are handed out in a linear fashion where each identiﬁer is only2sed once for a variable. For example, dco/c++ [LLN16], CppAD [BB08] andthe CoDiPack RealReverse type use this management strategy. The secondtechnique relies on the reuse of the identiﬁers. The life span of the identiﬁeris coupled to the life span of the AD value and it is used again after the ADvalue is deleted. Among other tools, Adept [Hog14], ADOL-C [WG09] andthe CoDiPack

RealReverseIndex type use this management scheme. Bothschemes have their advantages and yield diﬀerent properties for the AD typethat is using them. The linear management allows to skip copy operations inthe taping process and the reuse scheme has a much smaller adjoint vector.In this paper, we introduce a new scheme for the handling of identiﬁers thatcombines the advantages of both established schemes. The basic idea is to adda use counter for the identiﬁers.The outline of this paper is as follows. First, we deﬁne some nomencla-ture and prerequisites which are essential for the discussion. Afterwards, wederive from the use cases of the identiﬁer management the interface betweenits implementation and the AD tool. Since there are neither resources on theimplementation details nor discussions about the properties of the establishedidentiﬁer schemes available, we introduce both schemes, their properties andprovide an example implementation. The assign optimization for copy opera-tions is discussed afterwards and leads us to the introduction of the new identi-ﬁer management scheme. Finally, we summarize the properties of the diﬀerentschemes and compare all three in an artiﬁcial test case and apply them in theSU2 CFD solver [EPC +

15] on a real world test case. All code presented in thispaper is C++11 code. An AD tool encompasses the full library of an AD implementation (e.g. dco/c++,ADOL-C, or CoDiPack).

AD type refers to the implementation that performsthe operator overloading approach (e.g., adolc::adouble , codi::RealReverse ,or codi::RealReverseIndex ). An AD value is a speciﬁc instance of an ADtype (e.g. codi::RealReverse a; ) that has a lifetime and memory. It is con-structed, written, read and destructed.For a given AD type, its

AD tape is the structure that stores all informationfor the reverse mode of AD. The term recording describes the process of storingthe information on the AD tape and the term reverse evaluation describes theprocess of seeding the output variables, running the reverse mode of AD andobtaining the adjoints of the input variables.We refer to the data used to identify adjoints (e.g. ¯ w ) of the primal values(e.g. w ) as identiﬁers . Since the optimal choice for the identiﬁer type is aninteger, we use for the implementing classes names like IndexManager . Forclarity, we do not introduce a template parameters for the index type and assumethat it is handled like an int . In the linear management scheme, this allowsfor tapes that are approximately 100 GB in size, that is, statements withan average of four arguments per statement. For reuse index schemes, thisintroduces limitations only if there are more than ≈ AD values in useat the same time. Only up to version 1.2, afterwards the new scheme described in this paper is used. or (;;) {AD::resetTape();AD::startRecording(); for ( auto input : inputs) { AD::registerInput(input); }callFunc(inputs, outputs) for ( auto output : outputs) { AD::registerOutput(output); }AD::stopRecording(); for ( auto output : outputs) { AD::setAdjoint(output, ...); }AD::evaluateReverse(); for ( auto input : inputs) { AD::getAdjoint(input, ...); }} Figure 1: Interaction between program and AD tool.For the identiﬁers, we assume that they are unique in the application, thatis, a single identiﬁer must not be used for two diﬀerent AD values. Each ADtool must provide mechanisms that guarantee this property. In this paper, wefocus on CoDiPack. There, it is suﬃcient to initialize all memory with zerosbecause CoDiPack assumes that the zero identiﬁer is used for all independentvariables, that is, variables that do not depend on the input variables. Then,zero is the only identiﬁer which does not have to be unique with respect to thedeﬁnition above.We assume that the program interacts with the AD tool as depicted in Figure1. Particularly, the index manager must • support multiple tape recordings • and handle overwrites of old AD values from previous recordings.The call to resetTape , that is expected to be placed every time a new tape isrecorded, can be used by the index manager to clean up internal data structures.One additional assumption is that the repeated recording of the tape is safe.This means that all AD variables have gained a new identiﬁer before they areused in callFunc . A new index can either be gained by calling registerInput on that variable or overwriting it with an assignment. This is especially impor-tant for the linear management scheme since otherwise, the correctness of theresult cannot be guaranteed. Usually, we distinguish only two use cases for identiﬁer updates. The ﬁrst oneis an assignment of an identiﬁer to an AD value. Here, the old identiﬁer ofthe AD value must be taken into account by the identiﬁer scheme. In theimplementation, this is realized via a call to the assignIndex( int & i) methodof the index manager. The second case is the deletion of an identiﬁer, which isindicated to the implementation by a call to the method freeIndex( int & i) .4or the assign optimization, which is introduced later, a third use case isrequired. There, the implementation for the index manager must take the lefthand side (lhs) identiﬁer and the right hand side (rhs) identiﬁer into account. copyIndex( int & lhs, const int &rhs) is the corresponding method in theinterface.It is now important to determine the scenarios in which the methods of theindex manager implementation are called. In general, the lifetime of an ADvalue can be categorized into four operations: It can be created, overwritten,used and deleted.The used case occurs if the AD value is used on the right hand side of anassignment (e.g. ... = a * a ). There, no change of the identiﬁer is requiredand the index manager does not have to be informed.The deleted scenario occurs only if the AD value runs out of scope or deleteis called. For both cases, it is suﬃcient that the destructor of the AD type callsthe freeIndex method.The created case occurs whenever AD values are constructed. It is easilytreated in CoDiPack since we assume that the zero identiﬁer can be used multi-ple times and indicates an inactive AD value. The AD tape can simply set theidentiﬁer and does not have to inform the index manager implementation.The overwrite case is the most common and interesting one. Examplesare a = ... , ADType b = ... , or c = d . In general, it can be handled by acombined freeIndex call and assignIndex call. The free call ensures that theidentiﬁer seen by the assign call is zero so that the assign call always generatesa new identiﬁer for the AD value. This approach is in accordance with the ADtheory: The old identiﬁer of an AD value is not required for the correctness ofthe reverse evaluation.This approach allows optimizations of the data management. An identiﬁerthat is ﬁrst freed with freeIndex can then be immediately used in the following assignIndex call. Therefore, the freeIndex call is skipped and assignIndex has to check the old value in the memory location. It can then decide how tohandle identiﬁers that are overwritten. This allows for an optimal implementa-tion with respect to memory movements.One general consideration has to be made about when assignIndex is calledduring the recording process of a statement, especially when the left hand sideof the statement is captured as a reference and also occurs on the right handside of the statement, e.g. a = a * a . A call to assignIndex might changethe identiﬁer of a . Then, if the information for the right hand side is gath-ered afterwards, the wrong identiﬁer is stored. It is therefore essential to call assignIndex only after all data for the right hand side has been gathered. The simplest option for the assignment of identiﬁers to AD values is the linearmanagement scheme. Here, each new variable gets its own unique identiﬁer.The implementation is straightforward, see Figure 2.The class contains a member which stores the last identiﬁer given to any ADvalue. Based on this value, a new identiﬁer is created by increasing the memberby one and returning the new value. 5 truct

LinearIndexManager { int lastIndex;LinearIndexManager() : lastIndex(0) {} void assignIndex( int & i) {lastIndex += 1; // leave out the zero index return lastIndex;} void freeIndex( int & i) { // empty } void reset() {lastIndex = 0;}}; Figure 2: Implementation of the linear management scheme.To each AD value, a unique identiﬁer is assigned that is only used for thelife time of the respective variable. After the AD value has been overwrittenor deleted, the identiﬁer is not used again. This often leads to adjoint vectorswhich are quite large while most of the entries are used only a few times, leadingto “dead” memory.The advantage of the

LinearIndexManager is that the left hand side identi-ﬁers for statements do not have to be recorded on the tape [SAG19]. The basicidea is that each new statement increases the counter by one and in the reverseevaluation, the counter is decremented by one. Therefore, the left hand sideidentiﬁer does not have to be stored. Note that this change interlocks the state-ments and identiﬁers. Each statement generates an identiﬁer and, conversely,to generate an identiﬁer, a statement must be written to the tape.

The life time of a variable can be very short. It would therefore make sense toreuse the identiﬁer generated for the variable. An index manager with such aproperty needs mechanisms to track identiﬁers which are no longer used.An implementation that tracks the identiﬁers is shown in Figure 3. Thelogic of the manager is slightly more involved. The constructor preallocates thedefault block size of identiﬁers. Afterwards, an initial set of available identiﬁersis generated with createNewIndices . This method does not have to check ifthe indices vector has enough space since the default block size is preallocatedand it is only called if indicesLeft is zero.The method assignIndex skips the assignment of an identiﬁer if there isalready one. This can be done as a result of the overwrite discussion in Section3. If the value i is freed, it is immediately assigned to i again, therefore we can6 truct ReuseIndexManager { int maximumIndex;std::vector< int > indices; size_t indicesLeft; static const size_t

INDEX_BLOCK_SIZE = 256;ReuseIndexManager() : maximumIndex(0), indices(INDEX_BLOCK_SIZE),indicesLeft(0){ createNewIndices(); } void createNewIndices() { // only called if indicesLeft is zero for (; indicesLeft < INDEX_BLOCK_SIZE; indicesLeft += 1) {maximumIndex += 1;indices[indicesLeft] = maximumIndex;}} void assignIndex( int & i) { if (0 == i) { // leave index in place if not zero if (0 == indicesLeft) {createNewIndices();}indicesLeft -= 1;i = indices[indicesLeft];}} void freeIndex( int & i) { if (0 != i) { if (indicesLeft == indices.size()) {indices.resize(indices.size() + INDEX_BLOCK_SIZE);}indices[indicesLeft] = i;indicesLeft += 1;i = 0;}} void copyIndex( int & lhs, const int & rhs) { if (0 != rhs) { assignIndex(lhs); } else { freeIndex(lhs); }} void reset() { // index management is global, no reset required }}; Figure 3: Basic reuse index manager implementation.7 void func() { ADVar a, b; AD::registerInput(a); AD::resetTape(); AD::registerInput(b); } Figure 4: Example for possibly wrong index management in a reuse index man-ager.optimize this case.If the identiﬁer is zero, assignIndex checks if there are indices left andcreates new ones if necessary. Afterwards, it pops the last index from the vector.The freeIndex method needs to check if the vector has space left for the pushof the index. Since the zero identiﬁer is the passive identiﬁer, it should not befreed which is checked ﬁrst in the method.We also implement the copyIndex method to have the full interface deﬁnedin Section 3 available. This basic implementation ensures that the left handside identiﬁer is assigned a valid index if the right hand side is active.The most interesting method is the reset method since it does not contain anylogic. This is mandatory since all identiﬁers for the current AD variables stayvalid after the reset. A variable that is overwritten or freed after the reset willreturn its identiﬁer to the index manager which has — in this implementation— no means to check if the identiﬁer has been created before or after the reset.The issue can be demonstrated with the example in Figure 4. In line 6, bothvariables a and b are freed, which adds both identiﬁers to the list of indices.Suppose a gets the identiﬁer 1. If, together with the tape reset, the indexmanager is reset as well, then b also gets the identiﬁer 1. The free in line 6adds then two times the identiﬁer 1 to the indices list. In the further recordingprocess, the identiﬁer 1 is then given to two distinct AD values, which violatesthe uniqueness property postulated in Section 2.It is therefore necessary that the reset method of the reuse index managerkeeps all information as it is and does not revert to a state where all identiﬁersare unused. For Algorithmic Diﬀerentiation, the deﬁnition of the input values is quite im-portant as it determines which part of the program is diﬀerentiated. In general,the registration of an input includes the assignment of a new identiﬁer to anAD value. Depending on the used index manager, special care must be takenhow this is done.In order to illustrate problems which can occur for the index manager, theexample in Figure 5 is used. The example contains two variable registrationswith a short computation in between. Identiﬁers that are generated and freed inlines 4 and 5 can lead to problems, which are discussed for each index managerseparately. 8 void func(ADVar* x, ADVar* y) { AD::registerInput(x[0]); { ADVar t = x[0] * x[0]; y[0] = t + x[0]; } // t is freed AD::registerInput(x[1]); y[1] = y[0] * x[1]; } Figure 5: Example for diﬃculties with input registrations.

For the linear index manager, it is important to keep in mind that statementsand identiﬁers are interlocked. Therefore, if a variable requires a new identiﬁer,the tape has to write a statement. Hence, ﬁve statements are written in theexample from Figure 5, that is, two statements for the registrations in lines 2and 8 and three statements for the expressions in lines 4, 5 and 9.The theory of the reverse mode of AD requires that each reverse evaluationof a statement is followed by setting the adjoint of the left hand side to zero, seeEquation (2). The index manager has no inﬂuence on this and cannot changethis default behavior.Particulary, the adjoint of an input variable is set to zero after the reverseevaluation of a registerInput statement, that is, the derivative informationwe are interested in is overwritten.This can be ﬁxed to some extent if we do not store registerInput state-ments at the beginning of the tape. In the above example, however, this pre-serves only the adjoint of x[0] , the adjoint of x[1] is still overwritten.Another solution is to skip the set to zero of the left hand side adjoint afterthe reverse evaluation of a statement. For a linear index management, this ispossible since each identiﬁer is used only once. It further provides the userwith the adjoints of all intermediate computations. However, the adjoint vectorneeds to be cleared before another reverse interpretation of the same tape canbe performed, which costs approximately 25% of the interpretation time of atape. If the adjoints of the left hand side are set to zero directly in the reverseinterpretation, this cost is hidden behind other operations.The third option is to tag the statements for registerInput calls and skipthe reverse interpretation for them. Operator tapes can have a special operatorfor input statements and know from this operator that the reverse evaluationmust be skipped for this statement. Statement level tapes can use a specialnumber for the right hand side active variable count like − and identify registerinput statements via this number. To implement these changes, it is suﬃcient tomodify the register input and evaluate logic in the tape implementation. Notethat this is a modiﬁcation of the statement storing and evaluation process thatcannot be realized by changes to the index manager. In general, this optionensures the correct adjoints for x[0] and x[1] and eliminates the extra cost ofresetting the adjoint vector after the interpretation.9or CoDiPack, the second option was the default behavior until version 1.6.Since then, the third option is the new default behavior. The old logic can beactivated with the preprocessor ﬂag -DCODI_ZeroAdjointReverse=0 The ﬁrst option — in addition to the other solutions — is not implementedin CoDiPack since the memory gains are rather small and it complicates thetape management.

With a fresh reuse index manager, the identiﬁers for the statements in lines 2,4, 5 from Figure 5 are id(x[0]) == 1 , id(t) == 2 and id(y[0]) == 3 . Inline 6, the identiﬁer of t is freed and then given to x[1] in line 8. Withoutadditional measures, the adjoints of x[0] and x[1] are wrong after the reverseinterpretation. The correct value for x[1] is set in the reverse interpretationof the statement in line 9. Line 8 has not registered a statement and line 6assumes implicitly that the adjoint of t is zero, which is in general wrong sincethe identiﬁer points to the adjoint of x[1] . The reverse evaluation in line 5 and4 yields therefore wrong results. Furthermore, after the evaluation of line 4, theadjoint of t and thus the adjoint of x[1] are set to zero.A possible solution consists in the creation of a statement for registerInput calls and a special management of the adjoints for input variables. However,this increases the complexity of the tape implementation for CoDiPack. Thistechnique is for example used in ADOL-C without any problems since they usea diﬀerent philosophy: The user provides the adjoint vectors directly.The solution suggested in this paper and used in CoDiPack adds logic onlyto the reuse index manager. The key problem of the above example is that theidentiﬁer of x[1] has already been used on the current tape. If the identiﬁerwas a “fresh” one, then the problem would not occur. A “fresh” identiﬁer is onethat is used for the ﬁrst time in a tape recording process.The implementation in Figure 3 is extended by a second index vector thatstores the unused (“fresh”) indices. All new and overwritten routines are listed inFigure 6. The reset method receives the logic for appending the used identiﬁersto the unused ones. The assignIndex method returns an unused identiﬁer asa fallback if no used ones are left. The method assignUnusedIndex forces therelease of the identiﬁer stored in i since we do not know if this identiﬁer hasalready been used in the current tape recording, otherwise it works like the old assignIndex method.This implementation avoids the errors observed in the code of Figure 5 andguarantees a correct adjoint vector after the reverse interpretation. In Section 2, we postulated that all identiﬁers of AD values are unique. Wewant to lift this restriction now for the assign statement b = a . The reverseevaluation of this statement is ¯ a + = ¯ b ; ¯ b = 0 , that is, all updates received by ¯ b are added to ¯ a . From a programming point of view, b could be realized as areference to the memory location of a until the next write access to a . A similarargument holds true for the index managers.10 / struct ReuseIndexManager std::vector< int > unusedIndices; size_t unusedIndicesLeft; // createNewIndices: stores indices now in the unusedIndices list void assignUnusedIndex( int & i) { if (0 != i) {freeIndex(i); // force change of index } if (0 == unusedIndicesLeft) {createNewIndices();}unusedIndicesLeft -= 1;i = unusedIndices[unusedIndicesLeft];} void assignIndex( int & i) { if (0 == i) { // leave index in place if not zero if (0 == indicesLeft) {assignUnusedIndex(i); // fallback to unused indices } else {indicesLeft -= 1;i = indices[indicesLeft];}}} // freeIndex stays the same// copyIndex stays the same void reset() { size_t totalSize = indicesLeft + unusedIndicesLeft; if (unusedIndices.size() < totalSize) {unusedIndices.resize(totalSize);}std::copy(indices.begin(), indices.begin() + indicesLeft,unusedIndices.begin() + unusedIndicesLeft);unusedIndicesLeft = totalSize;indicesLeft = 0;} Figure 6: Extended reuse index manager implementation by unused identiﬁers.11orward eval.: Finish a assigned a = . . . a is used a copied to bb = a a and b are usedReverse eval.: ¯ a and ¯ b are zero ¯ a is used foran update . . . + = J ∗ ¯ a ¯ a isupdated ¯ a + = . . . Add updatesof ¯ b to ¯ a ¯ a + = ¯ b ¯ a and ¯ b are updated ¯ a + = . . . ¯ b + = . . . Figure 7: Primal and reverse connection between two variables that are a copyof each other. // LinearIndexManager static const bool

AssignNeedsStatement = false; void copyIndex( int & lhs, const int & rhs) {lhs = rhs;}

Figure 8: Implementation for the assign optimization of a linear managementscheme.

The optimization for the linear management scheme presented below has beenimplemented by the group of Uwe Naumann in dco/c++.For the

LinearIndexManager , it is guaranteed that each identiﬁer is usedfor at most one AD value. Speciﬁcally, the identiﬁer of b is only used for b . ¯ b is initially zero and for every operation that uses b , ¯ b is updated accordingly.The reversal of b = a is the last time that ¯ b is used in the reverse interpretationprocess and all of its updates are added to ¯ a at once. This is also picturedin Figure 7. For ¯ a , we see in this Figure that it is only changed by updateoperations (+=) and used with a multiplier for an update at the point were theassignment of a is reversed. Since all changes to ¯ a from ¯ b are done beforehand,it does not matter when the value of ¯ b is added to ¯ a . Thus, updates to ¯ b candirectly be added to ¯ a instead. If we use the same identiﬁer for a and b , thishappens automatically in the reverse interpretation. The advantage is that nostatement needs to be stored on the tape for the copy operation.We introduce the static member AssignNeedsStatement to the index man-ager implementation. The tape uses this member to decide whether or not astatement has to be pushed for statements like a = b . In such cases, the tapecalls the copyIndex method to indicate to the index manager that an assignoperation is performed. The additional code for the linear index manager isstraightforward, see Figure 8. 12 truct IdentifierUseCount {std::vector< int > useCount; void setMaximumSize( size_t size) {useCount.resize(size);} bool unuseIndex( int & i) {useCount[i] -= 1; return useCount[i] == 0; // indicates last use of this identifier } void useIndex( int &i) {useCount[i] += 1; // new use location of the identifier }}; Figure 9: Use count manager for identiﬁers.

In principle, the same argument as for the linear management scheme can bedone for the reuse management scheme. However, the reuse index managementas implemented in Figure 3 recollects freed indices. Therefore, if we simplycopy the right hand side identiﬁer in assignments, two AD values with the sameidentiﬁer are eventually freed, adding the identiﬁer twice to the list of indices.After that, it might be handed out to two diﬀerent, completely unrelated ADvalues, violating the (relaxed) uniqueness assumption. Therefore, the assignoptimization is not possible with the implementation from Figure 3.Only with an extension that keeps track on how often an identiﬁer is used,it is possible to use the assign optimization with a reuse index manger. Such anextension is described in the next section.

The basic principle that enables the assign optimization for index managersthat reuse identiﬁers is the same as for the reference counting in smart pointers[Cop92]. Each time an identiﬁer is used or destructed, a corresponding counteris incremented or decremented, respectively. If the counter reaches zero, theidentiﬁer can be freed. However, since millions of such reference counts arerequired, we need a solution that has less overhead and uses the least amountof memory possible.First, we implement a class that manages the use count of the identiﬁers.Then, we use it to extend the

ReuseIndexManager and enable the assign op-timization. The general implementation in Figure 9 has only three functionsand is self-explanatory. It is assumed that the function setMaximumSize isalways called when the index manager generates new identiﬁers. Therefore,the other methods do not need to check for the bounds. The implementa-13 / struct UseCountIndexManager {

IdentifierUseCount uc; void copyIndex( int & lhs, const int & rhs) { if (lhs != rhs) {freeIndex(lhs);uc.useIndex(rhs);lhs = rhs;}} Figure 10: Assign optimization implementation for a reuse management scheme.tion of the

UseCountIndexManager class is obtained by modiﬁcations of the

ReuseIndexManager class. Too ensure the correct size of the use count vec-tor, we call setMaximumSize after each call to createNewIndices . unuseIndex is called at the beginning of freeIndex and we evaluate the remaining logicin freeIndex only if the return value indicates the last use of the identi-ﬁer. assignUnusedIndex calls useIndex for the newly assigned identiﬁer. Themethod assignIndex ﬁrst calls unusedIndex . If it is the last use of the iden-tiﬁer, then no new identiﬁer is acquired. If the identiﬁer is still in use, then anew one is retrieved by the old logic. Afterwards, useIndex is called in bothcases to indicate a new use of the identiﬁer. copyIndex is the only method where the logic is changed. The old imple-mentation in Figure 3 only needs to ensure that the left hand side index is valid.The new logic in the UseCountIndexManager class has to free the index of theleft hand side since it is overwritten. It also calls useIndex for the identiﬁer ofthe right hand side. The implementation is shown in Figure 10. The if conditionin the code avoids the logic if the left hand side and right hand side have thesame index. In such a situation, freeIndex would decrease the use count byone only to increase it in the call of uc.useIndex by one immediately after-wards. This avoids also the corner case where the arguments &lhs and &rhs point to the same memory location. There, the freeIndex call that is supposedto set the left hand side identiﬁer to zero erroneously resets the right hand sideidentiﬁer as well, resulting in a passive variable. This situation can easily occurin a self assignment like a = a .This implementation of the

UseCountIndexManager supports the assign op-timization for an index manager that reuses identiﬁers.

The three index managers have some special properties that we want to highlightnow.

The

LinearIndexManager from Figure 2 is compatible with C-like memoryoperations (e.g. memcpy ). All AD types that use this kind of index manager14ave this property. It can directly be deduced from the copyIndex methodimplementation in Figure 8. Since the only operation for the identiﬁers is thecopy operation, a C-like memory operation would do the same.Some tape implementations might store additional data for each identiﬁerin their internal data structure. As long as a copy operation of an active valueresults in a C-like memory operation of the internal data associated to theidentiﬁer these AD types are also compatible with C-like memory operations.The other two index managers are not compatible with C-like memory oper-ations. For the

ReuseIndexManager , it is the same reason why it cannot supportthe assign optimization. On the other hand, the

UseCountIndexManager cansupport the assign optimization but the logic in the copyIndex method in Fig-ure 10 is not a trivial copy operation and therefore it is not compatible withC-like memory operations.

The

ReuseIndexManager in Figure 3 is implemented such that the assignIndex method keeps the old index in place. Therefore, a variable that is never deletedand is always assigned an active value will keep the same identiﬁer for the overallruntime of the application. This property can be used to create arrays of ADvalues that have continuous indices for the whole vector. Special AD constructslike external functions can then build on this property of the vector to reducethe required memory and to improve runtime performance.The index manager in CoDiPack does not support the creation of vectorswith continuous identiﬁers. Until now, no use case required it. Nevertheless,ADOL-C supports this behavior with a call to ensureContiguousLocations prior to allocating the vector.The other two index managers do not support this behavior. This is clearfor the

LinearIndexManager since it creates a new identiﬁer for each statement.The behavior for the

UseCountIndexManager cannot be guaranteed: In a state-ment, a new identiﬁer has to be created if the use count of the left hand sideidentiﬁer is greater than one. With an extension that tags unique identiﬁers,however, this behavior could be added to the

UseCountIndexManager .

10 Summary

We presented three index managers with diﬀerent properties for the handling ofidentiﬁers for AD types. The properties of the three index managers are summa-rized in Table 1. The memory per statement is zero for the

LinearIndexManager since the identiﬁer can be computed from counting statements [SAG19]. For theother two, the identiﬁer of the left hand side has to be stored. In addition, the

UseCountIndexManager requires memory for each identiﬁer that is created sinceit needs to store the use count for each identiﬁer. The number of branches forthe diﬀerent identiﬁer operations show that for the more advanced features, thecomplexity in the control ﬂow increases.The assign optimization with an index manager that reuses indices is thenew contribution of the paper. In the results section below, we test the runtimeand memory impacts. 15able 1: Summary of properties and implementation details for thethree index managers:

LinearIndexManager ( LIM ), ReuseIndexManager ( RIM ), UseCountIndexManager ( UCIM ). LIM RIM UCIM

Memory per statement (byte) 0 4 4Memory per identiﬁer (byte) 0 0 4Branches for assign 0 3 4Branches for free 0 2 3Branches for copy 0 3 4Assign optimization + - +Special properties C-mem comp. (ﬁxed identiﬁers) -Statement for register input + - -

11 Results

The coupled Burgers’ equation is used for a general comparison of performancevalues for the diﬀerent implementations. Problem setup and discretization arealready described in [SAG19] and [SAG18]. For completeness, we recapitulatehere the basic information.The coupled Burgers’ equation [BA09, Bah03, ZSD10] u t + uu x + vu y = 1 R ( u xx + u yy ) , (3) v t + uv x + vv y = 1 R ( v xx + v yy ) (4)is discretized with an upwind ﬁnite diﬀerence scheme. The initial and boundaryconditions are taken from the exact solution u ( x, y, t ) = x + y − xt − t ( x, y, t ) ∈ D × R , (5) v ( x, y, t ) = x + y − xt − t ( x, y, t ) ∈ D × R (6)given in [BA09]. The computational domain D is the unit square D = [0 , × [0 , ⊂ R × R . As far as the diﬀerentiation is concerned, we choose the ini-tial solution of the time stepping scheme as input parameters, and as outputparameter we take the norm of the ﬁnal solution.For the implementation of the program, all methods are written in such a waythat they can be inlined and the requested memory is allocated and initializedbefore the time measurement starts. Particularly, the required memory for thetape is computed beforehand and then allocated. This yields very stable timemeasurements which are run on one node of the Elwetritsch cluster at the TUKaiserslautern. The node consists of two Intel Xeon 6126 CPUs with a total of24 cores and 384 GB of main memory. We discretize the Burgers’ equation ona × grid and solve it with 32 iterations. All timing values are averagedover 10 evaluations.For the time measurements, two diﬀerent conﬁgurations are tested.16 ype Adjoint vector Statement data Argument data Tape memoryJacobian LIM 711.77 MB 88.97 MB 3963.54 MB 4764.28 MBJacobian RIM 16.50 MB 444.86 MB 3971.81 MB 4433.18 MBJacobian UCIM 16.50 MB 441.41 MB 3963.54 MB 4429.72 MBPrimal LIM 711.77 MB 1512.70 MB 1849.80 MB 4074.26 MBPrimal RIM 16.40 MB 1868.40 MB 1852.50 MB 3754.17 MBPrimal UCIM 16.50 MB 1853.93 MB 1849.80 MB 3745.20 MB Table 2: Memory requirement for the Burgers taste case. • The multi test conﬁguration runs the same process on each of the 24 cores.This setup simulates a use case where the full node is used for computationand every core uses the memory bandwidth of the socket. • The single test conﬁguration runs just one process on the whole node.This eliminates the memory bandwidth limitations and provides a betterview on the computational performance.Both test conﬁgurations are evaluated with all three index managers, and withthe Jacobian taping approach as well as the primal value taping approach. Thememory consumption for one process is shown in Table 2.

Adjoint Vector , State-ment Data and

Argument Data show the required memory for the correspondingtape entries. The size of the adjoint vector is coupled in the linear managementcase to the size of the statement data. In the reuse case, the adjoint vector is aslarge as the number of simultaneously active variables. An entry in statementdata is generated for every statement in the application. With the assign op-timization, the number of entries should be reduced. Argument data containsone entry per argument to a statement. It is reduced by the same size as thestatement data with the assign optimization.

Tape Memory shows the overallmemory of the tape. The adjoint vector, statement data and argument data arethe largest part of the tape memory but there are some additional structuresthat require some resources.Since the Burgers test case is written such that nearly no copy operations areperformed, there is no memory gain with the new index manager. Therefore,the following performance comparison shows the management overhead for thenew index manager.The timing evaluations for the Jacobian types and primal values types canbe seen in Figure 11 for the tape recording and in Figure 12 for the tape reversal.The Jacobian recording results show quite clearly that the increased complexityof the index handler decreases the recording performance. For the single case,this amounts to 5% for the reuse index manager and to 17% for the use countindex manager compared to the linear index manager. For the multi case, theperformance drop is in the same region of about 7% and 11%, respectively. Thisdecrease in performance also occurs since the required data for each statementis larger and therefore, more memory bandwidth is used. During the reversal ofthe tape in Figure 12, the performance increases compared to the linear indexmanager. For the single case, the performance gain is around 9% and around20% for the multi case. The multi case yields here a higher performance increasesince the memory bandwidth limitations are lifted a little bit with the reducedadjoint vector size. For the new index manager, no further performance increaseis seen since it does not have a large eﬀect on the size of the adjoint vector.17 . . . . . . LinearReuseUseCount . .

22 1 .

36 1 . . . .

26 1 . . .

38 1 . . time in sec.Jacobi gcc single recordJacobi gcc mult recordPrimal gcc single recordPrimal gcc mult recordFigure 11: Burgers test case tape recording.The same conclusions can be drawn for the performance results of primalvalue tapes. During the recording, the peformance decreases for both conﬁg-urations. The initial step from a linear management approach to the reuseapproach reduces the performance by 19% for both conﬁgurations. From thereuse handling to the use count handling, the decrease is only 1% with a total20% performance decrease for both conﬁgurations. During the reverse evalua-tion, there is an increase in performance by 6% for the single case and 13% forthe multi case. The drop in performance from the reuse handling to the usecount handle scheme cannot really be explained by the algorithm. In theory,the performance should not decrease since the algorithm is the same. It can beexplained by the layout of the executable. Function pointers are used in thereverse run of the primal value tapes and the higher complexity of the codeyields a larger binary. The jump for calling the function pointer is now largerand the CPU will handle it in a diﬀerent manner.As expected, the new index manager has a negative impact on performanceduring the recording of a tape. In the reverse interpretation of the tape, noperformance gains are expected since the Burgers test does not contain manycopy operations. For the second test case, we use the discrete adjoint of SU2 [EPC + U = G ( U, X ) . (7) X represents the design variables and G ( U ) some (pseudo) time-stepping schemelike the explicit or implicit Euler method. The ﬁxed-point iteration U n +1 = G ( U n , X ) is applied for n → N ∗ until the ﬁxed point U ∗ is reached. For the18 . . . . . . LinearReuseUseCount . . .

78 1 . .

97 1 . . .

13 1 . . . time in sec.Jacobi gcc single reverseJacobi gcc mult reversePrimal gcc single reversePrimal gcc mult reverseFigure 12: Burgers test case tape reversal.shape optimization with respect to the design X , we formulate the minimiza-tion problem min U J ( U, X ) s.t. U = G ( U, X ) . As detailed in [ASG15], wesolve the minimization problem by fulﬁlling the KKT conditions [KT51] on theLagrangian function L ( U, X, ¯ U ) = J ( U, X ) + ( G ( U, X ) − U ) T ¯ U . (8)This requires the solution of the adjoint state equation ¯ U = (cid:20) ∂J ( U ∗ , X ) ∂U (cid:21) T + (cid:20) ∂G ( U ∗ , X ) ∂U (cid:21) T ¯ U (9)which is formulated here as a ﬁxed-point equation. To solve this equation, werecord the tape for the Lagrangian function L with CoDiPack once. Then, thetape is evaluated again and again until the ﬁxed point ¯ U ∗ is reached. It istherefore important for SU2 that the tape evaluation is as fast a possible. Thetape recording is not that time critical.As a testcase, we consider the viscous ﬂow over the Onera M6 wing. Thecomputational mesh consists of . million interior elements and for the solution,the 3D RANS equations are used. The test is run on two nodes of the Elwetritschcluster at the TU Kaiserslautern. Each node consists of two Intel Xeon 6126CPUs with a total of 24 cores and 96 GB of main memory each. In total,the case is run with 192 GB of main memory and 48 cores. For the primalcomputation, one ﬁxed-point iteration step takes about 1.1 seconds and uses13.73 GB of main memory.The performance values for SU2 are measured also for two diﬀerent conﬁg-urations, namely with the preaccumulation switch either disabled or enabled.Preaccumulation is a technique that computes the Jacobian for a code regionand stores this Jacobian on the tape. If the region has few input and outputvalues but takes a lot of statements to compute, then this technique reducesthe required memory. The switch in SU2 enables this technique for certain code19laces and the default conﬁguration in SU2 has it enabled. Since the preaccumu-lation takes some additional time during the recording of the tape and improvesthe reverse evaluation time of the tape, it would be hard to see the pure eﬀectof the new index manager in the recording and reversal sweep. Therefore, themore theoretical results without preaccumulation and the realistic results withpreaccumulation are both presented here.All recording results in Figure 13 look qualitatively similar in the four dif-ferent conﬁgurations. In each conﬁguration, a performance drop is seen fromthe linear index management to the reuse index management. Other than inthe Burgers case, the performance improves here for the use count index man-ager. This is due to the reduced memory seen in Table 3, which shows thatthe increased complexity can be hidden behind the memory bandwidth of theRAM.The reversal results in Figure 14 show the same relative behavior as therecording results, but here, the diﬀerences are due to the increased or decreasedsize of the stored data. Since the reuse index manager stores additional state-ments for the copy operations, it is in general slower than the other two indexmanagers. An increase in performance is seen for all cases when using the newindex manager instead of the old one. For the Jacobian tapes, this amounts toan increase of 13% both with preaccumulation and without preaccumulation.The primal value taping approach has a performance increase of 20% withoutpreaccumulation and 15% with.The memory analysis is more interesting than for the Burgers case. In SU2,a lot of copy operations are performed, which shows the importance of the newindex manager. For the conﬁguration without preaccumulation in Table 3, thememory reduction from the reuse index manager to the use count index manageris about 30% for the Jacobian and primal value taping approaches. This reduc-tion is achieved by removing the unnecessary copy operations from the tape,which is indicated by the size of the argument data and the number of state-ment entries. With the new indexing scheme, the same number of statementsis recorded as for the linear index manager, which uses the assign optimizationalready. The memory reduction with respect to the linear index manager isthen about 12% for the Jacobian approach and 10% for the primal value tapingapproach. Here, the reduction is due to the reduced size of the adjoint vector. Itis very beneﬁcial for the tape evaluation speed and the scaling for vector modeapplications.The cases in Table 4, where preaccumulation is enabled, show the same gen-eral trend and also the memory reductions are around the same size. A com-parison between the memory values with and without preaccumulation yields ageneral reduction of about 35% for each taping approach.

12 Conclusion

In this paper, we presented two existing schemes for the index handling in ADtool implementations, namely linear index handling and reuse index handling.The ﬁrst scheme admits an optimization for assign statements such that thesestatements do not have to be stored on the tape. We presented a new indexmanager which uses a references counting technique to enable the assign opti-mization also for an indexing scheme that reuses indices. An implementation20 ype Adjoint Statement Statement Argument Tape Memoryvector entries data data memory factor(in GB) (in GB) (in GB) (in GB)Jacobian LIM 24.288 3,259,937,668 3.036 70.096 97.420 7.09Jacobian RIM 0.633 5,317,452,010 24.761 94.029 119.531 8.70Jacobian UCIM 0.514 3,175,942,216 14.789 70.757 85.706 6.23Primal LIM 24.303 3,261,899,960 51.650 50.288 126.241 9.19Primal RIM 0.633 5,320,297,083 104.053 58.269 163.708 11.91Primal UCIM 0.514 3,177,904,508 62.152 50.288 113.788 8.28

Table 3: Memory requirement for the SU2 Onera M6 test case without preac-cumulation. The memory factor is tape memory relative to the primal memoryof 13.73 GB.

Type Adjoint Statement Statement Argument Tape Memoryvector entries data data memory factor(in GB) (in GB) (in GB) (in GB)Jacobian LIM 13.666 1,834,310,180 1.753 46.509 61.884 4.50Jacobian RIM 0.633 3,450,714,509 16.068 65.517 82.326 5.99Jacobian UCIM 0.514 1,750,314,728 8.150 46.509 55.480 4.03Primal LIM 13.674 1,835,297,413 29.063 34.456 77.194 5.61Primal RIM 0.633 3,452,339,485 67.520 40.798 109.703 7.98Primal UCIM 0.514 1,751,301,961 34.251 34.456 70.055 5.10

Table 4: Memory requirement for the SU2 Onera M6 test case with preaccu-mulation. The memory factor is tape memory relative to the primal memory of13.73 GB. . . . . LinearReuseUseCount .

33 1 . . .

87 2 . . .

56 2 . .

06 2 .

96 4 . . time factor w.r.t. primalJacobian No PreAccJacobian PreAccPrimal No PreAccPrimal PreAccFigure 13: SU2 Onera M6 test case recording21 . . . . . LinearReuseUseCount . . . . . .

37 0 .

94 1 . . .

63 0 . . time factor w.r.t. primalJacobian No PreAccJacobian PreAccPrimal No PreAccPrimal PreAccFigure 14: SU2 Onera M6 test case reversalfor this new manager in CoDiPack has been presented and analyzed for itsproperties and possible eﬀects on the taping process.The new use count index manager was then tested with two applications.The Burgers test case showed that the overhead for the additional data man-agement is about 4% in the bandwidth limited case. In the SU2 test case, theadvantages with respect to the memory reduction could be shown. With thenew manager, the same number of statements as for the linear index managercould be reached, that uses the assign optimization already. Overall the memoryconsumption could be reduced by 10% with respect to the linear index manager,that is, the previous best case scenario. References [ASG15] T. Albring, M. Sagebaum, and N. R. Gauger. Development of aConsistent Discrete Adjoint Solver in an Evolving Aerodynamic De-sign Framework. In , AIAA Aviation. American Institute ofAeronautics and Astronautics, jun 2015.[BA09] J. Biazar and H. Aminikhah. Exact and numerical solutions fornon-linear burgers’ equation by vim.

Mathematical and ComputerModelling , 49(7):1394–1400, 2009.[Bah03] A. Bahadır. A fully implicit ﬁnite-diﬀerence scheme for two-dimensional burgers’ equations.

Applied Mathematics and Computa-tion , 137(1):131–137, 2003.[BB08] B. M. Bell and J. V. Burke. Algorithmic Diﬀerentiation of implicitfunctions and optimal values. In

Advances in Automatic Diﬀerenti-ation , pages 67–77. Springer, 2008.22Cop92] James O. Coplien.

Advanced C++ Programming Styles and Idioms .Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,1992.[EPC +

15] T. D. Economon, F. Palacios, S. R. Copeland, T. W. Lukaczyk, andJ. J. Alonso. SU2: An Open-Source Suite for Multiphysics Simulationand Design.

AIAA Journal , 54(3):828–846, 2015.[GW08] A. Griewank and A. Walther.

Evaluating Derivatives, second edition .SIAM, 2008.[Hog14] R. J. Hogan. Fast reverse-mode Automatic Diﬀerentiation using ex-pression templates in C++.

ACM TOMS , 40(4):26:1–26:24, 2014.[KT51] H. W. Kuhn and A. W. Tucker. Nonlinear Programming. In

Proceed-ings of the Second Berkeley Symposium on Mathematical Statisticsand Probability , pages 481–492, Berkeley, Calif., 1951. University ofCalifornia Press.[LLN16] K. Leppkes, J. Lotz, and U. Naumann. Derivative code by over-loading in C++ (dco/c++): Introduction and summary of features.Technical Report AIB-2016-08, RWTH Aachen University, Septem-ber 2016.[Nau12] U. Naumann.

The Art of Diﬀerentiating Computer Programs: AnIntroduction to Algorithmic Diﬀerentiation . Software, Environments,and Tools. SIAM, 2012.[SAG18] M. Sagebaum, T. Albring, and N.R. Gauger. Expression templatesfor primal value taping in the reverse mode of algorithmic diﬀerenti-ation.

Optimization Methods and Software , 2018.[SAG19] M. Sagebaum, T. Albring, and N.R. Gauger. High-PerformanceDerivative Computations Using CoDiPack.

ACM Trans. Math.Softw. , 45(4), December 2019.[WG09] A. Walther and A. Griewank. Getting started with ADOL-C. In

Combinatorial scientiﬁc computing , pages 181–202, 2009.[ZSD10] H. Zhu, H. Shu, and M. Ding. Numerical solutions of two-dimensionalburgers’ equations by discrete adomian decomposition method.