Index handling and assign optimization for Algorithmic Differentiation reuse index managers
aa r X i v : . [ c s . M S ] J un Index handling and assign optimization forAlgorithmic Differentiation reuse index managers
Max Sagebaum, Johannes Blühdorn, Nicolas R. GaugerChair for Scientific ComputingTechnische Universität Kaiserslautern
Abstract
For operator overloading Algorithmic Differentiation tools, the iden-tification of primal variables and adjoint variables is usually done viaindices. Two common schemes exist for their management and distribu-tion. The linear approach is easy to implement and supports memoryoptimization with respect to copy statements. On the other hand, thereuse approach requires more implementation effort but results in muchsmaller adjoint vectors, which are more suitable for the vector mode ofAlgorithmic Differentiation. In this paper, we present both approaches,how to implement them, and discuss their advantages, disadvantages andproperties of the resulting Algorithmic Differentiation type. In addition,a new management scheme is presented which supports copy optimiza-tions and the reuse of indices, thus combining the advantages of the othertwo. The implementations of all three schemes are compared on a simplesynthetic example and on a real world example using the computationalfluid dynamics solver in SU2.
Algorithmic Differentiation (AD) refers to techniques for the machine accuratedifferentiation of computer codes. To that end, we interpret a code with inputs x ∈ R k and outputs y ∈ R m as a mathematical function F : R k → R m . Usually,the code does not only contain floating point computations but also controlflow statements. For each specific input choice x , however, the control flow isfixed and we may consider the computations as a composition of elementaryoperations like additions or multiplications together with, e. g., standard mathlibrary functions like sin or exp. The derivatives of these elementary operationsare known and can be implemented explicitly. Then, AD evaluates the derivativeof the full code at the specific input x by applying the chain rule to the sequenceof elementary operations. In the following, we summarize the principles of ADwith a focus on the reverse mode and operator overloading implementations.Comprehensive introductions to AD are given in [GW08, Nau12].The reverse mode of AD is a specific strategy for the automatic evaluationof the chain rule. There, each variable v of the computer program — whichmight be an input, output or intermediate variable — is associated with acorresponding adjoint variable ¯ v . ¯ v stands for the derivative of the outputs y as1 function of v in the direction of ¯ y , that is, ¯ v = dydv ( v ) T ¯ y . The adjoint outputs ¯ y — that is, the derivative direction — must be specified by the user. Particularly,the adjoint inputs satisfy the relation ¯ x = dFdx ( x ) T ¯ y (1)which is the standard way to describe the reverse mode of AD. By suitablechoices of ¯ y , various derivatives of the computer program can be computed. Forexample, if m = 1 and ¯ y = 1 . , ¯ x is the gradient of F evaluated at x .During a single evaluation of (1), the reverse mode of AD avoids the com-putational cost of setting up the full Jacobian dFdx . Instead, an evaluation onthe statement level is performed that is equivalent to the computation of thematrix vector product. Let w = φ ( v ) be a statement in the computer codewith φ : R n → R and n ∈ N . φ might be a single elementary operation or acomposition of multiple elementary operations. For each statement, reverse ADcomputes the adjoint update as ¯ v j + = ∂φ∂v j ( v ) ¯ w ; ¯ w = 0 ∀ j = 1 . . . n . (2)The order of statement evaluation is reversed, that is, if we have statements φ i with i = 1 . . . l for the primal evaluation, we need to evaluate Equation (2)for i running from l to . In more general terms, the sensitivity information ispropagated from the outputs to the inputs, which corresponds to the evaluationof the chain rule in reverse order. Since the reverse run requires information thatis only available after the primal evaluation is finished, certain information needsto be stored by AD such that the reversal is possible. The process of recordingthis information alongside the primal computation is known as taping.In this paper, we focus on AD implementations that use operator overloadingfor the augmentation of the computer program. There, a new computation type(e.g. adouble ) replaces the original computation type (e.g. double ). The newtype and corresponding operator overloads store information for each statementsuch that the adjoint updates according to Equation (2) can be performed afterthe primal computation is finished. First, we must store information for eachstatement that allows for the computation of the derivative expression in (2).Jacobian taping approaches compute ∂φ∂v ( v ) alongside the primal evaluation andstore it for the reversal, primal value taping approaches store the primal values v and some handles to evaluate ∂φ∂v with these values in the reverse run. De-tailed discussions on how these two approaches can be implemented are given in[SAG19] and [SAG18] for Jacobian tapes and primal value tapes, respectively.Second, we must be able to identify the variables that were involved in a state-ment, and associate them with their adjoint variables. For operator overloadingAD tools, an adjoint variable is usually not stored together with the associatedprimal variable in the new computation type because the lifetime of the pri-mal variable is not the same as the lifetime of the adjoint variable. Instead,an identifier is stored in the new computation type which associates the primalvariable with an entry in a global adjoint vector. During the primal evaluation,identifiers are assigned to all variables, and for each statement, the identifiersof the involved variables are stored for the reverse run.Currently, there are two established identifier management schemes. In thefirst, identifiers are handed out in a linear fashion where each identifier is only2sed once for a variable. For example, dco/c++ [LLN16], CppAD [BB08] andthe CoDiPack RealReverse type use this management strategy. The secondtechnique relies on the reuse of the identifiers. The life span of the identifieris coupled to the life span of the AD value and it is used again after the ADvalue is deleted. Among other tools, Adept [Hog14], ADOL-C [WG09] andthe CoDiPack
RealReverseIndex type use this management scheme. Bothschemes have their advantages and yield different properties for the AD typethat is using them. The linear management allows to skip copy operations inthe taping process and the reuse scheme has a much smaller adjoint vector.In this paper, we introduce a new scheme for the handling of identifiers thatcombines the advantages of both established schemes. The basic idea is to adda use counter for the identifiers.The outline of this paper is as follows. First, we define some nomencla-ture and prerequisites which are essential for the discussion. Afterwards, wederive from the use cases of the identifier management the interface betweenits implementation and the AD tool. Since there are neither resources on theimplementation details nor discussions about the properties of the establishedidentifier schemes available, we introduce both schemes, their properties andprovide an example implementation. The assign optimization for copy opera-tions is discussed afterwards and leads us to the introduction of the new identi-fier management scheme. Finally, we summarize the properties of the differentschemes and compare all three in an artificial test case and apply them in theSU2 CFD solver [EPC +
15] on a real world test case. All code presented in thispaper is C++11 code. An AD tool encompasses the full library of an AD implementation (e.g. dco/c++,ADOL-C, or CoDiPack).
AD type refers to the implementation that performsthe operator overloading approach (e.g., adolc::adouble , codi::RealReverse ,or codi::RealReverseIndex ). An AD value is a specific instance of an ADtype (e.g. codi::RealReverse a; ) that has a lifetime and memory. It is con-structed, written, read and destructed.For a given AD type, its
AD tape is the structure that stores all informationfor the reverse mode of AD. The term recording describes the process of storingthe information on the AD tape and the term reverse evaluation describes theprocess of seeding the output variables, running the reverse mode of AD andobtaining the adjoints of the input variables.We refer to the data used to identify adjoints (e.g. ¯ w ) of the primal values(e.g. w ) as identifiers . Since the optimal choice for the identifier type is aninteger, we use for the implementing classes names like IndexManager . Forclarity, we do not introduce a template parameters for the index type and assumethat it is handled like an int . In the linear management scheme, this allowsfor tapes that are approximately 100 GB in size, that is, statements withan average of four arguments per statement. For reuse index schemes, thisintroduces limitations only if there are more than ≈ AD values in useat the same time. Only up to version 1.2, afterwards the new scheme described in this paper is used. or (;;) {AD::resetTape();AD::startRecording(); for ( auto input : inputs) { AD::registerInput(input); }callFunc(inputs, outputs) for ( auto output : outputs) { AD::registerOutput(output); }AD::stopRecording(); for ( auto output : outputs) { AD::setAdjoint(output, ...); }AD::evaluateReverse(); for ( auto input : inputs) { AD::getAdjoint(input, ...); }} Figure 1: Interaction between program and AD tool.For the identifiers, we assume that they are unique in the application, thatis, a single identifier must not be used for two different AD values. Each ADtool must provide mechanisms that guarantee this property. In this paper, wefocus on CoDiPack. There, it is sufficient to initialize all memory with zerosbecause CoDiPack assumes that the zero identifier is used for all independentvariables, that is, variables that do not depend on the input variables. Then,zero is the only identifier which does not have to be unique with respect to thedefinition above.We assume that the program interacts with the AD tool as depicted in Figure1. Particularly, the index manager must • support multiple tape recordings • and handle overwrites of old AD values from previous recordings.The call to resetTape , that is expected to be placed every time a new tape isrecorded, can be used by the index manager to clean up internal data structures.One additional assumption is that the repeated recording of the tape is safe.This means that all AD variables have gained a new identifier before they areused in callFunc . A new index can either be gained by calling registerInput on that variable or overwriting it with an assignment. This is especially impor-tant for the linear management scheme since otherwise, the correctness of theresult cannot be guaranteed. Usually, we distinguish only two use cases for identifier updates. The first oneis an assignment of an identifier to an AD value. Here, the old identifier ofthe AD value must be taken into account by the identifier scheme. In theimplementation, this is realized via a call to the assignIndex( int & i) methodof the index manager. The second case is the deletion of an identifier, which isindicated to the implementation by a call to the method freeIndex( int & i) .4or the assign optimization, which is introduced later, a third use case isrequired. There, the implementation for the index manager must take the lefthand side (lhs) identifier and the right hand side (rhs) identifier into account. copyIndex( int & lhs, const int &rhs) is the corresponding method in theinterface.It is now important to determine the scenarios in which the methods of theindex manager implementation are called. In general, the lifetime of an ADvalue can be categorized into four operations: It can be created, overwritten,used and deleted.The used case occurs if the AD value is used on the right hand side of anassignment (e.g. ... = a * a ). There, no change of the identifier is requiredand the index manager does not have to be informed.The deleted scenario occurs only if the AD value runs out of scope or deleteis called. For both cases, it is sufficient that the destructor of the AD type callsthe freeIndex method.The created case occurs whenever AD values are constructed. It is easilytreated in CoDiPack since we assume that the zero identifier can be used multi-ple times and indicates an inactive AD value. The AD tape can simply set theidentifier and does not have to inform the index manager implementation.The overwrite case is the most common and interesting one. Examplesare a = ... , ADType b = ... , or c = d . In general, it can be handled by acombined freeIndex call and assignIndex call. The free call ensures that theidentifier seen by the assign call is zero so that the assign call always generatesa new identifier for the AD value. This approach is in accordance with the ADtheory: The old identifier of an AD value is not required for the correctness ofthe reverse evaluation.This approach allows optimizations of the data management. An identifierthat is first freed with freeIndex can then be immediately used in the following assignIndex call. Therefore, the freeIndex call is skipped and assignIndex has to check the old value in the memory location. It can then decide how tohandle identifiers that are overwritten. This allows for an optimal implementa-tion with respect to memory movements.One general consideration has to be made about when assignIndex is calledduring the recording process of a statement, especially when the left hand sideof the statement is captured as a reference and also occurs on the right handside of the statement, e.g. a = a * a . A call to assignIndex might changethe identifier of a . Then, if the information for the right hand side is gath-ered afterwards, the wrong identifier is stored. It is therefore essential to call assignIndex only after all data for the right hand side has been gathered. The simplest option for the assignment of identifiers to AD values is the linearmanagement scheme. Here, each new variable gets its own unique identifier.The implementation is straightforward, see Figure 2.The class contains a member which stores the last identifier given to any ADvalue. Based on this value, a new identifier is created by increasing the memberby one and returning the new value. 5 truct
LinearIndexManager { int lastIndex;LinearIndexManager() : lastIndex(0) {} void assignIndex( int & i) {lastIndex += 1; // leave out the zero index return lastIndex;} void freeIndex( int & i) { // empty } void reset() {lastIndex = 0;}}; Figure 2: Implementation of the linear management scheme.To each AD value, a unique identifier is assigned that is only used for thelife time of the respective variable. After the AD value has been overwrittenor deleted, the identifier is not used again. This often leads to adjoint vectorswhich are quite large while most of the entries are used only a few times, leadingto “dead” memory.The advantage of the
LinearIndexManager is that the left hand side identi-fiers for statements do not have to be recorded on the tape [SAG19]. The basicidea is that each new statement increases the counter by one and in the reverseevaluation, the counter is decremented by one. Therefore, the left hand sideidentifier does not have to be stored. Note that this change interlocks the state-ments and identifiers. Each statement generates an identifier and, conversely,to generate an identifier, a statement must be written to the tape.
The life time of a variable can be very short. It would therefore make sense toreuse the identifier generated for the variable. An index manager with such aproperty needs mechanisms to track identifiers which are no longer used.An implementation that tracks the identifiers is shown in Figure 3. Thelogic of the manager is slightly more involved. The constructor preallocates thedefault block size of identifiers. Afterwards, an initial set of available identifiersis generated with createNewIndices . This method does not have to check ifthe indices vector has enough space since the default block size is preallocatedand it is only called if indicesLeft is zero.The method assignIndex skips the assignment of an identifier if there isalready one. This can be done as a result of the overwrite discussion in Section3. If the value i is freed, it is immediately assigned to i again, therefore we can6 truct ReuseIndexManager { int maximumIndex;std::vector< int > indices; size_t indicesLeft; static const size_t
INDEX_BLOCK_SIZE = 256;ReuseIndexManager() : maximumIndex(0), indices(INDEX_BLOCK_SIZE),indicesLeft(0){ createNewIndices(); } void createNewIndices() { // only called if indicesLeft is zero for (; indicesLeft < INDEX_BLOCK_SIZE; indicesLeft += 1) {maximumIndex += 1;indices[indicesLeft] = maximumIndex;}} void assignIndex( int & i) { if (0 == i) { // leave index in place if not zero if (0 == indicesLeft) {createNewIndices();}indicesLeft -= 1;i = indices[indicesLeft];}} void freeIndex( int & i) { if (0 != i) { if (indicesLeft == indices.size()) {indices.resize(indices.size() + INDEX_BLOCK_SIZE);}indices[indicesLeft] = i;indicesLeft += 1;i = 0;}} void copyIndex( int & lhs, const int & rhs) { if (0 != rhs) { assignIndex(lhs); } else { freeIndex(lhs); }} void reset() { // index management is global, no reset required }}; Figure 3: Basic reuse index manager implementation.7 void func() { ADVar a, b; AD::registerInput(a); AD::resetTape(); AD::registerInput(b); } Figure 4: Example for possibly wrong index management in a reuse index man-ager.optimize this case.If the identifier is zero, assignIndex checks if there are indices left andcreates new ones if necessary. Afterwards, it pops the last index from the vector.The freeIndex method needs to check if the vector has space left for the pushof the index. Since the zero identifier is the passive identifier, it should not befreed which is checked first in the method.We also implement the copyIndex method to have the full interface definedin Section 3 available. This basic implementation ensures that the left handside identifier is assigned a valid index if the right hand side is active.The most interesting method is the reset method since it does not contain anylogic. This is mandatory since all identifiers for the current AD variables stayvalid after the reset. A variable that is overwritten or freed after the reset willreturn its identifier to the index manager which has — in this implementation— no means to check if the identifier has been created before or after the reset.The issue can be demonstrated with the example in Figure 4. In line 6, bothvariables a and b are freed, which adds both identifiers to the list of indices.Suppose a gets the identifier 1. If, together with the tape reset, the indexmanager is reset as well, then b also gets the identifier 1. The free in line 6adds then two times the identifier 1 to the indices list. In the further recordingprocess, the identifier 1 is then given to two distinct AD values, which violatesthe uniqueness property postulated in Section 2.It is therefore necessary that the reset method of the reuse index managerkeeps all information as it is and does not revert to a state where all identifiersare unused. For Algorithmic Differentiation, the definition of the input values is quite im-portant as it determines which part of the program is differentiated. In general,the registration of an input includes the assignment of a new identifier to anAD value. Depending on the used index manager, special care must be takenhow this is done.In order to illustrate problems which can occur for the index manager, theexample in Figure 5 is used. The example contains two variable registrationswith a short computation in between. Identifiers that are generated and freed inlines 4 and 5 can lead to problems, which are discussed for each index managerseparately. 8 void func(ADVar* x, ADVar* y) { AD::registerInput(x[0]); { ADVar t = x[0] * x[0]; y[0] = t + x[0]; } // t is freed AD::registerInput(x[1]); y[1] = y[0] * x[1]; } Figure 5: Example for difficulties with input registrations.
For the linear index manager, it is important to keep in mind that statementsand identifiers are interlocked. Therefore, if a variable requires a new identifier,the tape has to write a statement. Hence, five statements are written in theexample from Figure 5, that is, two statements for the registrations in lines 2and 8 and three statements for the expressions in lines 4, 5 and 9.The theory of the reverse mode of AD requires that each reverse evaluationof a statement is followed by setting the adjoint of the left hand side to zero, seeEquation (2). The index manager has no influence on this and cannot changethis default behavior.Particulary, the adjoint of an input variable is set to zero after the reverseevaluation of a registerInput statement, that is, the derivative informationwe are interested in is overwritten.This can be fixed to some extent if we do not store registerInput state-ments at the beginning of the tape. In the above example, however, this pre-serves only the adjoint of x[0] , the adjoint of x[1] is still overwritten.Another solution is to skip the set to zero of the left hand side adjoint afterthe reverse evaluation of a statement. For a linear index management, this ispossible since each identifier is used only once. It further provides the userwith the adjoints of all intermediate computations. However, the adjoint vectorneeds to be cleared before another reverse interpretation of the same tape canbe performed, which costs approximately 25% of the interpretation time of atape. If the adjoints of the left hand side are set to zero directly in the reverseinterpretation, this cost is hidden behind other operations.The third option is to tag the statements for registerInput calls and skipthe reverse interpretation for them. Operator tapes can have a special operatorfor input statements and know from this operator that the reverse evaluationmust be skipped for this statement. Statement level tapes can use a specialnumber for the right hand side active variable count like − and identify registerinput statements via this number. To implement these changes, it is sufficient tomodify the register input and evaluate logic in the tape implementation. Notethat this is a modification of the statement storing and evaluation process thatcannot be realized by changes to the index manager. In general, this optionensures the correct adjoints for x[0] and x[1] and eliminates the extra cost ofresetting the adjoint vector after the interpretation.9or CoDiPack, the second option was the default behavior until version 1.6.Since then, the third option is the new default behavior. The old logic can beactivated with the preprocessor flag -DCODI_ZeroAdjointReverse=0 The first option — in addition to the other solutions — is not implementedin CoDiPack since the memory gains are rather small and it complicates thetape management.
With a fresh reuse index manager, the identifiers for the statements in lines 2,4, 5 from Figure 5 are id(x[0]) == 1 , id(t) == 2 and id(y[0]) == 3 . Inline 6, the identifier of t is freed and then given to x[1] in line 8. Withoutadditional measures, the adjoints of x[0] and x[1] are wrong after the reverseinterpretation. The correct value for x[1] is set in the reverse interpretationof the statement in line 9. Line 8 has not registered a statement and line 6assumes implicitly that the adjoint of t is zero, which is in general wrong sincethe identifier points to the adjoint of x[1] . The reverse evaluation in line 5 and4 yields therefore wrong results. Furthermore, after the evaluation of line 4, theadjoint of t and thus the adjoint of x[1] are set to zero.A possible solution consists in the creation of a statement for registerInput calls and a special management of the adjoints for input variables. However,this increases the complexity of the tape implementation for CoDiPack. Thistechnique is for example used in ADOL-C without any problems since they usea different philosophy: The user provides the adjoint vectors directly.The solution suggested in this paper and used in CoDiPack adds logic onlyto the reuse index manager. The key problem of the above example is that theidentifier of x[1] has already been used on the current tape. If the identifierwas a “fresh” one, then the problem would not occur. A “fresh” identifier is onethat is used for the first time in a tape recording process.The implementation in Figure 3 is extended by a second index vector thatstores the unused (“fresh”) indices. All new and overwritten routines are listed inFigure 6. The reset method receives the logic for appending the used identifiersto the unused ones. The assignIndex method returns an unused identifier asa fallback if no used ones are left. The method assignUnusedIndex forces therelease of the identifier stored in i since we do not know if this identifier hasalready been used in the current tape recording, otherwise it works like the old assignIndex method.This implementation avoids the errors observed in the code of Figure 5 andguarantees a correct adjoint vector after the reverse interpretation. In Section 2, we postulated that all identifiers of AD values are unique. Wewant to lift this restriction now for the assign statement b = a . The reverseevaluation of this statement is ¯ a + = ¯ b ; ¯ b = 0 , that is, all updates received by ¯ b are added to ¯ a . From a programming point of view, b could be realized as areference to the memory location of a until the next write access to a . A similarargument holds true for the index managers.10 / struct ReuseIndexManager std::vector< int > unusedIndices; size_t unusedIndicesLeft; // createNewIndices: stores indices now in the unusedIndices list void assignUnusedIndex( int & i) { if (0 != i) {freeIndex(i); // force change of index } if (0 == unusedIndicesLeft) {createNewIndices();}unusedIndicesLeft -= 1;i = unusedIndices[unusedIndicesLeft];} void assignIndex( int & i) { if (0 == i) { // leave index in place if not zero if (0 == indicesLeft) {assignUnusedIndex(i); // fallback to unused indices } else {indicesLeft -= 1;i = indices[indicesLeft];}}} // freeIndex stays the same// copyIndex stays the same void reset() { size_t totalSize = indicesLeft + unusedIndicesLeft; if (unusedIndices.size() < totalSize) {unusedIndices.resize(totalSize);}std::copy(indices.begin(), indices.begin() + indicesLeft,unusedIndices.begin() + unusedIndicesLeft);unusedIndicesLeft = totalSize;indicesLeft = 0;} Figure 6: Extended reuse index manager implementation by unused identifiers.11orward eval.: Finish a assigned a = . . . a is used a copied to bb = a a and b are usedReverse eval.: ¯ a and ¯ b are zero ¯ a is used foran update . . . + = J ∗ ¯ a ¯ a isupdated ¯ a + = . . . Add updatesof ¯ b to ¯ a ¯ a + = ¯ b ¯ a and ¯ b are updated ¯ a + = . . . ¯ b + = . . . Figure 7: Primal and reverse connection between two variables that are a copyof each other. // LinearIndexManager static const bool
AssignNeedsStatement = false; void copyIndex( int & lhs, const int & rhs) {lhs = rhs;}
Figure 8: Implementation for the assign optimization of a linear managementscheme.
The optimization for the linear management scheme presented below has beenimplemented by the group of Uwe Naumann in dco/c++.For the
LinearIndexManager , it is guaranteed that each identifier is usedfor at most one AD value. Specifically, the identifier of b is only used for b . ¯ b is initially zero and for every operation that uses b , ¯ b is updated accordingly.The reversal of b = a is the last time that ¯ b is used in the reverse interpretationprocess and all of its updates are added to ¯ a at once. This is also picturedin Figure 7. For ¯ a , we see in this Figure that it is only changed by updateoperations (+=) and used with a multiplier for an update at the point were theassignment of a is reversed. Since all changes to ¯ a from ¯ b are done beforehand,it does not matter when the value of ¯ b is added to ¯ a . Thus, updates to ¯ b candirectly be added to ¯ a instead. If we use the same identifier for a and b , thishappens automatically in the reverse interpretation. The advantage is that nostatement needs to be stored on the tape for the copy operation.We introduce the static member AssignNeedsStatement to the index man-ager implementation. The tape uses this member to decide whether or not astatement has to be pushed for statements like a = b . In such cases, the tapecalls the copyIndex method to indicate to the index manager that an assignoperation is performed. The additional code for the linear index manager isstraightforward, see Figure 8. 12 truct IdentifierUseCount {std::vector< int > useCount; void setMaximumSize( size_t size) {useCount.resize(size);} bool unuseIndex( int & i) {useCount[i] -= 1; return useCount[i] == 0; // indicates last use of this identifier } void useIndex( int &i) {useCount[i] += 1; // new use location of the identifier }}; Figure 9: Use count manager for identifiers.
In principle, the same argument as for the linear management scheme can bedone for the reuse management scheme. However, the reuse index managementas implemented in Figure 3 recollects freed indices. Therefore, if we simplycopy the right hand side identifier in assignments, two AD values with the sameidentifier are eventually freed, adding the identifier twice to the list of indices.After that, it might be handed out to two different, completely unrelated ADvalues, violating the (relaxed) uniqueness assumption. Therefore, the assignoptimization is not possible with the implementation from Figure 3.Only with an extension that keeps track on how often an identifier is used,it is possible to use the assign optimization with a reuse index manger. Such anextension is described in the next section.
The basic principle that enables the assign optimization for index managersthat reuse identifiers is the same as for the reference counting in smart pointers[Cop92]. Each time an identifier is used or destructed, a corresponding counteris incremented or decremented, respectively. If the counter reaches zero, theidentifier can be freed. However, since millions of such reference counts arerequired, we need a solution that has less overhead and uses the least amountof memory possible.First, we implement a class that manages the use count of the identifiers.Then, we use it to extend the
ReuseIndexManager and enable the assign op-timization. The general implementation in Figure 9 has only three functionsand is self-explanatory. It is assumed that the function setMaximumSize isalways called when the index manager generates new identifiers. Therefore,the other methods do not need to check for the bounds. The implementa-13 / struct UseCountIndexManager {
IdentifierUseCount uc; void copyIndex( int & lhs, const int & rhs) { if (lhs != rhs) {freeIndex(lhs);uc.useIndex(rhs);lhs = rhs;}} Figure 10: Assign optimization implementation for a reuse management scheme.tion of the
UseCountIndexManager class is obtained by modifications of the
ReuseIndexManager class. Too ensure the correct size of the use count vec-tor, we call setMaximumSize after each call to createNewIndices . unuseIndex is called at the beginning of freeIndex and we evaluate the remaining logicin freeIndex only if the return value indicates the last use of the identi-fier. assignUnusedIndex calls useIndex for the newly assigned identifier. Themethod assignIndex first calls unusedIndex . If it is the last use of the iden-tifier, then no new identifier is acquired. If the identifier is still in use, then anew one is retrieved by the old logic. Afterwards, useIndex is called in bothcases to indicate a new use of the identifier. copyIndex is the only method where the logic is changed. The old imple-mentation in Figure 3 only needs to ensure that the left hand side index is valid.The new logic in the UseCountIndexManager class has to free the index of theleft hand side since it is overwritten. It also calls useIndex for the identifier ofthe right hand side. The implementation is shown in Figure 10. The if conditionin the code avoids the logic if the left hand side and right hand side have thesame index. In such a situation, freeIndex would decrease the use count byone only to increase it in the call of uc.useIndex by one immediately after-wards. This avoids also the corner case where the arguments &lhs and &rhs point to the same memory location. There, the freeIndex call that is supposedto set the left hand side identifier to zero erroneously resets the right hand sideidentifier as well, resulting in a passive variable. This situation can easily occurin a self assignment like a = a .This implementation of the
UseCountIndexManager supports the assign op-timization for an index manager that reuses identifiers.
The three index managers have some special properties that we want to highlightnow.
The
LinearIndexManager from Figure 2 is compatible with C-like memoryoperations (e.g. memcpy ). All AD types that use this kind of index manager14ave this property. It can directly be deduced from the copyIndex methodimplementation in Figure 8. Since the only operation for the identifiers is thecopy operation, a C-like memory operation would do the same.Some tape implementations might store additional data for each identifierin their internal data structure. As long as a copy operation of an active valueresults in a C-like memory operation of the internal data associated to theidentifier these AD types are also compatible with C-like memory operations.The other two index managers are not compatible with C-like memory oper-ations. For the
ReuseIndexManager , it is the same reason why it cannot supportthe assign optimization. On the other hand, the
UseCountIndexManager cansupport the assign optimization but the logic in the copyIndex method in Fig-ure 10 is not a trivial copy operation and therefore it is not compatible withC-like memory operations.
The
ReuseIndexManager in Figure 3 is implemented such that the assignIndex method keeps the old index in place. Therefore, a variable that is never deletedand is always assigned an active value will keep the same identifier for the overallruntime of the application. This property can be used to create arrays of ADvalues that have continuous indices for the whole vector. Special AD constructslike external functions can then build on this property of the vector to reducethe required memory and to improve runtime performance.The index manager in CoDiPack does not support the creation of vectorswith continuous identifiers. Until now, no use case required it. Nevertheless,ADOL-C supports this behavior with a call to ensureContiguousLocations prior to allocating the vector.The other two index managers do not support this behavior. This is clearfor the
LinearIndexManager since it creates a new identifier for each statement.The behavior for the
UseCountIndexManager cannot be guaranteed: In a state-ment, a new identifier has to be created if the use count of the left hand sideidentifier is greater than one. With an extension that tags unique identifiers,however, this behavior could be added to the
UseCountIndexManager .
10 Summary
We presented three index managers with different properties for the handling ofidentifiers for AD types. The properties of the three index managers are summa-rized in Table 1. The memory per statement is zero for the
LinearIndexManager since the identifier can be computed from counting statements [SAG19]. For theother two, the identifier of the left hand side has to be stored. In addition, the
UseCountIndexManager requires memory for each identifier that is created sinceit needs to store the use count for each identifier. The number of branches forthe different identifier operations show that for the more advanced features, thecomplexity in the control flow increases.The assign optimization with an index manager that reuses indices is thenew contribution of the paper. In the results section below, we test the runtimeand memory impacts. 15able 1: Summary of properties and implementation details for thethree index managers:
LinearIndexManager ( LIM ), ReuseIndexManager ( RIM ), UseCountIndexManager ( UCIM ). LIM RIM UCIM
Memory per statement (byte) 0 4 4Memory per identifier (byte) 0 0 4Branches for assign 0 3 4Branches for free 0 2 3Branches for copy 0 3 4Assign optimization + - +Special properties C-mem comp. (fixed identifiers) -Statement for register input + - -
11 Results
The coupled Burgers’ equation is used for a general comparison of performancevalues for the different implementations. Problem setup and discretization arealready described in [SAG19] and [SAG18]. For completeness, we recapitulatehere the basic information.The coupled Burgers’ equation [BA09, Bah03, ZSD10] u t + uu x + vu y = 1 R ( u xx + u yy ) , (3) v t + uv x + vv y = 1 R ( v xx + v yy ) (4)is discretized with an upwind finite difference scheme. The initial and boundaryconditions are taken from the exact solution u ( x, y, t ) = x + y − xt − t ( x, y, t ) ∈ D × R , (5) v ( x, y, t ) = x + y − xt − t ( x, y, t ) ∈ D × R (6)given in [BA09]. The computational domain D is the unit square D = [0 , × [0 , ⊂ R × R . As far as the differentiation is concerned, we choose the ini-tial solution of the time stepping scheme as input parameters, and as outputparameter we take the norm of the final solution.For the implementation of the program, all methods are written in such a waythat they can be inlined and the requested memory is allocated and initializedbefore the time measurement starts. Particularly, the required memory for thetape is computed beforehand and then allocated. This yields very stable timemeasurements which are run on one node of the Elwetritsch cluster at the TUKaiserslautern. The node consists of two Intel Xeon 6126 CPUs with a total of24 cores and 384 GB of main memory. We discretize the Burgers’ equation ona × grid and solve it with 32 iterations. All timing values are averagedover 10 evaluations.For the time measurements, two different configurations are tested.16 ype Adjoint vector Statement data Argument data Tape memoryJacobian LIM 711.77 MB 88.97 MB 3963.54 MB 4764.28 MBJacobian RIM 16.50 MB 444.86 MB 3971.81 MB 4433.18 MBJacobian UCIM 16.50 MB 441.41 MB 3963.54 MB 4429.72 MBPrimal LIM 711.77 MB 1512.70 MB 1849.80 MB 4074.26 MBPrimal RIM 16.40 MB 1868.40 MB 1852.50 MB 3754.17 MBPrimal UCIM 16.50 MB 1853.93 MB 1849.80 MB 3745.20 MB Table 2: Memory requirement for the Burgers taste case. • The multi test configuration runs the same process on each of the 24 cores.This setup simulates a use case where the full node is used for computationand every core uses the memory bandwidth of the socket. • The single test configuration runs just one process on the whole node.This eliminates the memory bandwidth limitations and provides a betterview on the computational performance.Both test configurations are evaluated with all three index managers, and withthe Jacobian taping approach as well as the primal value taping approach. Thememory consumption for one process is shown in Table 2.
Adjoint Vector , State-ment Data and
Argument Data show the required memory for the correspondingtape entries. The size of the adjoint vector is coupled in the linear managementcase to the size of the statement data. In the reuse case, the adjoint vector is aslarge as the number of simultaneously active variables. An entry in statementdata is generated for every statement in the application. With the assign op-timization, the number of entries should be reduced. Argument data containsone entry per argument to a statement. It is reduced by the same size as thestatement data with the assign optimization.
Tape Memory shows the overallmemory of the tape. The adjoint vector, statement data and argument data arethe largest part of the tape memory but there are some additional structuresthat require some resources.Since the Burgers test case is written such that nearly no copy operations areperformed, there is no memory gain with the new index manager. Therefore,the following performance comparison shows the management overhead for thenew index manager.The timing evaluations for the Jacobian types and primal values types canbe seen in Figure 11 for the tape recording and in Figure 12 for the tape reversal.The Jacobian recording results show quite clearly that the increased complexityof the index handler decreases the recording performance. For the single case,this amounts to 5% for the reuse index manager and to 17% for the use countindex manager compared to the linear index manager. For the multi case, theperformance drop is in the same region of about 7% and 11%, respectively. Thisdecrease in performance also occurs since the required data for each statementis larger and therefore, more memory bandwidth is used. During the reversal ofthe tape in Figure 12, the performance increases compared to the linear indexmanager. For the single case, the performance gain is around 9% and around20% for the multi case. The multi case yields here a higher performance increasesince the memory bandwidth limitations are lifted a little bit with the reducedadjoint vector size. For the new index manager, no further performance increaseis seen since it does not have a large effect on the size of the adjoint vector.17 . . . . . . LinearReuseUseCount . .
22 1 .
36 1 . . . .
26 1 . . .
38 1 . . time in sec.Jacobi gcc single recordJacobi gcc mult recordPrimal gcc single recordPrimal gcc mult recordFigure 11: Burgers test case tape recording.The same conclusions can be drawn for the performance results of primalvalue tapes. During the recording, the peformance decreases for both config-urations. The initial step from a linear management approach to the reuseapproach reduces the performance by 19% for both configurations. From thereuse handling to the use count handling, the decrease is only 1% with a total20% performance decrease for both configurations. During the reverse evalua-tion, there is an increase in performance by 6% for the single case and 13% forthe multi case. The drop in performance from the reuse handling to the usecount handle scheme cannot really be explained by the algorithm. In theory,the performance should not decrease since the algorithm is the same. It can beexplained by the layout of the executable. Function pointers are used in thereverse run of the primal value tapes and the higher complexity of the codeyields a larger binary. The jump for calling the function pointer is now largerand the CPU will handle it in a different manner.As expected, the new index manager has a negative impact on performanceduring the recording of a tape. In the reverse interpretation of the tape, noperformance gains are expected since the Burgers test does not contain manycopy operations. For the second test case, we use the discrete adjoint of SU2 [EPC + U = G ( U, X ) . (7) X represents the design variables and G ( U ) some (pseudo) time-stepping schemelike the explicit or implicit Euler method. The fixed-point iteration U n +1 = G ( U n , X ) is applied for n → N ∗ until the fixed point U ∗ is reached. For the18 . . . . . . LinearReuseUseCount . . .
78 1 . .
97 1 . . .
13 1 . . . time in sec.Jacobi gcc single reverseJacobi gcc mult reversePrimal gcc single reversePrimal gcc mult reverseFigure 12: Burgers test case tape reversal.shape optimization with respect to the design X , we formulate the minimiza-tion problem min U J ( U, X ) s.t. U = G ( U, X ) . As detailed in [ASG15], wesolve the minimization problem by fulfilling the KKT conditions [KT51] on theLagrangian function L ( U, X, ¯ U ) = J ( U, X ) + ( G ( U, X ) − U ) T ¯ U . (8)This requires the solution of the adjoint state equation ¯ U = (cid:20) ∂J ( U ∗ , X ) ∂U (cid:21) T + (cid:20) ∂G ( U ∗ , X ) ∂U (cid:21) T ¯ U (9)which is formulated here as a fixed-point equation. To solve this equation, werecord the tape for the Lagrangian function L with CoDiPack once. Then, thetape is evaluated again and again until the fixed point ¯ U ∗ is reached. It istherefore important for SU2 that the tape evaluation is as fast a possible. Thetape recording is not that time critical.As a testcase, we consider the viscous flow over the Onera M6 wing. Thecomputational mesh consists of . million interior elements and for the solution,the 3D RANS equations are used. The test is run on two nodes of the Elwetritschcluster at the TU Kaiserslautern. Each node consists of two Intel Xeon 6126CPUs with a total of 24 cores and 96 GB of main memory each. In total,the case is run with 192 GB of main memory and 48 cores. For the primalcomputation, one fixed-point iteration step takes about 1.1 seconds and uses13.73 GB of main memory.The performance values for SU2 are measured also for two different config-urations, namely with the preaccumulation switch either disabled or enabled.Preaccumulation is a technique that computes the Jacobian for a code regionand stores this Jacobian on the tape. If the region has few input and outputvalues but takes a lot of statements to compute, then this technique reducesthe required memory. The switch in SU2 enables this technique for certain code19laces and the default configuration in SU2 has it enabled. Since the preaccumu-lation takes some additional time during the recording of the tape and improvesthe reverse evaluation time of the tape, it would be hard to see the pure effectof the new index manager in the recording and reversal sweep. Therefore, themore theoretical results without preaccumulation and the realistic results withpreaccumulation are both presented here.All recording results in Figure 13 look qualitatively similar in the four dif-ferent configurations. In each configuration, a performance drop is seen fromthe linear index management to the reuse index management. Other than inthe Burgers case, the performance improves here for the use count index man-ager. This is due to the reduced memory seen in Table 3, which shows thatthe increased complexity can be hidden behind the memory bandwidth of theRAM.The reversal results in Figure 14 show the same relative behavior as therecording results, but here, the differences are due to the increased or decreasedsize of the stored data. Since the reuse index manager stores additional state-ments for the copy operations, it is in general slower than the other two indexmanagers. An increase in performance is seen for all cases when using the newindex manager instead of the old one. For the Jacobian tapes, this amounts toan increase of 13% both with preaccumulation and without preaccumulation.The primal value taping approach has a performance increase of 20% withoutpreaccumulation and 15% with.The memory analysis is more interesting than for the Burgers case. In SU2,a lot of copy operations are performed, which shows the importance of the newindex manager. For the configuration without preaccumulation in Table 3, thememory reduction from the reuse index manager to the use count index manageris about 30% for the Jacobian and primal value taping approaches. This reduc-tion is achieved by removing the unnecessary copy operations from the tape,which is indicated by the size of the argument data and the number of state-ment entries. With the new indexing scheme, the same number of statementsis recorded as for the linear index manager, which uses the assign optimizationalready. The memory reduction with respect to the linear index manager isthen about 12% for the Jacobian approach and 10% for the primal value tapingapproach. Here, the reduction is due to the reduced size of the adjoint vector. Itis very beneficial for the tape evaluation speed and the scaling for vector modeapplications.The cases in Table 4, where preaccumulation is enabled, show the same gen-eral trend and also the memory reductions are around the same size. A com-parison between the memory values with and without preaccumulation yields ageneral reduction of about 35% for each taping approach.
12 Conclusion
In this paper, we presented two existing schemes for the index handling in ADtool implementations, namely linear index handling and reuse index handling.The first scheme admits an optimization for assign statements such that thesestatements do not have to be stored on the tape. We presented a new indexmanager which uses a references counting technique to enable the assign opti-mization also for an indexing scheme that reuses indices. An implementation20 ype Adjoint Statement Statement Argument Tape Memoryvector entries data data memory factor(in GB) (in GB) (in GB) (in GB)Jacobian LIM 24.288 3,259,937,668 3.036 70.096 97.420 7.09Jacobian RIM 0.633 5,317,452,010 24.761 94.029 119.531 8.70Jacobian UCIM 0.514 3,175,942,216 14.789 70.757 85.706 6.23Primal LIM 24.303 3,261,899,960 51.650 50.288 126.241 9.19Primal RIM 0.633 5,320,297,083 104.053 58.269 163.708 11.91Primal UCIM 0.514 3,177,904,508 62.152 50.288 113.788 8.28
Table 3: Memory requirement for the SU2 Onera M6 test case without preac-cumulation. The memory factor is tape memory relative to the primal memoryof 13.73 GB.
Type Adjoint Statement Statement Argument Tape Memoryvector entries data data memory factor(in GB) (in GB) (in GB) (in GB)Jacobian LIM 13.666 1,834,310,180 1.753 46.509 61.884 4.50Jacobian RIM 0.633 3,450,714,509 16.068 65.517 82.326 5.99Jacobian UCIM 0.514 1,750,314,728 8.150 46.509 55.480 4.03Primal LIM 13.674 1,835,297,413 29.063 34.456 77.194 5.61Primal RIM 0.633 3,452,339,485 67.520 40.798 109.703 7.98Primal UCIM 0.514 1,751,301,961 34.251 34.456 70.055 5.10
Table 4: Memory requirement for the SU2 Onera M6 test case with preaccu-mulation. The memory factor is tape memory relative to the primal memory of13.73 GB. . . . . LinearReuseUseCount .
33 1 . . .
87 2 . . .
56 2 . .
06 2 .
96 4 . . time factor w.r.t. primalJacobian No PreAccJacobian PreAccPrimal No PreAccPrimal PreAccFigure 13: SU2 Onera M6 test case recording21 . . . . . LinearReuseUseCount . . . . . .
37 0 .
94 1 . . .
63 0 . . time factor w.r.t. primalJacobian No PreAccJacobian PreAccPrimal No PreAccPrimal PreAccFigure 14: SU2 Onera M6 test case reversalfor this new manager in CoDiPack has been presented and analyzed for itsproperties and possible effects on the taping process.The new use count index manager was then tested with two applications.The Burgers test case showed that the overhead for the additional data man-agement is about 4% in the bandwidth limited case. In the SU2 test case, theadvantages with respect to the memory reduction could be shown. With thenew manager, the same number of statements as for the linear index managercould be reached, that uses the assign optimization already. Overall the memoryconsumption could be reduced by 10% with respect to the linear index manager,that is, the previous best case scenario. References [ASG15] T. Albring, M. Sagebaum, and N. R. Gauger. Development of aConsistent Discrete Adjoint Solver in an Evolving Aerodynamic De-sign Framework. In , AIAA Aviation. American Institute ofAeronautics and Astronautics, jun 2015.[BA09] J. Biazar and H. Aminikhah. Exact and numerical solutions fornon-linear burgers’ equation by vim.
Mathematical and ComputerModelling , 49(7):1394–1400, 2009.[Bah03] A. Bahadır. A fully implicit finite-difference scheme for two-dimensional burgers’ equations.
Applied Mathematics and Computa-tion , 137(1):131–137, 2003.[BB08] B. M. Bell and J. V. Burke. Algorithmic Differentiation of implicitfunctions and optimal values. In
Advances in Automatic Differenti-ation , pages 67–77. Springer, 2008.22Cop92] James O. Coplien.
Advanced C++ Programming Styles and Idioms .Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,1992.[EPC +
15] T. D. Economon, F. Palacios, S. R. Copeland, T. W. Lukaczyk, andJ. J. Alonso. SU2: An Open-Source Suite for Multiphysics Simulationand Design.
AIAA Journal , 54(3):828–846, 2015.[GW08] A. Griewank and A. Walther.
Evaluating Derivatives, second edition .SIAM, 2008.[Hog14] R. J. Hogan. Fast reverse-mode Automatic Differentiation using ex-pression templates in C++.
ACM TOMS , 40(4):26:1–26:24, 2014.[KT51] H. W. Kuhn and A. W. Tucker. Nonlinear Programming. In
Proceed-ings of the Second Berkeley Symposium on Mathematical Statisticsand Probability , pages 481–492, Berkeley, Calif., 1951. University ofCalifornia Press.[LLN16] K. Leppkes, J. Lotz, and U. Naumann. Derivative code by over-loading in C++ (dco/c++): Introduction and summary of features.Technical Report AIB-2016-08, RWTH Aachen University, Septem-ber 2016.[Nau12] U. Naumann.
The Art of Differentiating Computer Programs: AnIntroduction to Algorithmic Differentiation . Software, Environments,and Tools. SIAM, 2012.[SAG18] M. Sagebaum, T. Albring, and N.R. Gauger. Expression templatesfor primal value taping in the reverse mode of algorithmic differenti-ation.
Optimization Methods and Software , 2018.[SAG19] M. Sagebaum, T. Albring, and N.R. Gauger. High-PerformanceDerivative Computations Using CoDiPack.
ACM Trans. Math.Softw. , 45(4), December 2019.[WG09] A. Walther and A. Griewank. Getting started with ADOL-C. In
Combinatorial scientific computing , pages 181–202, 2009.[ZSD10] H. Zhu, H. Shu, and M. Ding. Numerical solutions of two-dimensionalburgers’ equations by discrete adomian decomposition method.