[PDF] Event-Based Automatic Differentiation of OpenMP with OpDiLib

Abstract

We present the new software OpDiLib, a universal add-on for classical operator overloading AD tools that enables the automatic differentiation (AD) of OpenMP parallelized code. With it, we establish support for OpenMP features in a reverse mode operator overloading AD tool to an extent that was previously only reported on in source transformation tools. We achieve this with an event-based implementation ansatz that is unprecedented in AD. Combined with modern OpenMP features around OMPT, we demonstrate how it can be used to achieve differentiation without any additional modifications of the source code; neither do we impose a priori restrictions on the data access patterns, which makes OpDiLib highly applicable. For further performance optimizations, restrictions like atomic updates on the adjoint variables can be lifted in a fine-grained manner for any parts of the code. OpDiLib can also be applied in a semi-automatic fashion via a macro interface, which supports compilers that do not implement OMPT. In a detailed performance study, we demonstrate the applicability of OpDiLib for a pure operator overloading approach in a hybrid parallel environment. We quantify the cost of atomic updates on the adjoint vector and showcase the speedup and scaling that can be achieved with the different configurations of OpDiLib in both the forward and the reverse pass.

Full PDF

EEvent-Based Automatic Diﬀerentiation of OpenMP withOpDiLib

Johannes Blühdorn, Max Sagebaum, and Nicolas R. Gauger

Chair for Scientiﬁc Computing, Technische Universität Kaiserslautern, Germany

Abstract

We present the new software OpDiLib, a universal add-on for classical operator overloadingAD tools that enables the automatic diﬀerentiation (AD) of OpenMP parallelized code. Withit, we establish support for OpenMP features in a reverse mode operator overloading AD toolto an extent that was previously only reported on in source transformation tools. We achievethis with an event-based implementation ansatz that is unprecedented in AD. Combined withmodern OpenMP features around OMPT, we demonstrate how it can be used to achievediﬀerentiation without any additional modiﬁcations of the source code; neither do we imposea priori restrictions on the data access patterns, which makes OpDiLib highly applicable. Forfurther performance optimizations, restrictions like atomic updates on the adjoint variables canbe lifted in a ﬁne-grained manner for any parts of the code. OpDiLib can also be applied in asemi-automatic fashion via a macro interface, which supports compilers that do not implementOMPT. In a detailed performance study, we demonstrate the applicability of OpDiLib for apure operator overloading approach in a hybrid parallel environment. We quantify the costof atomic updates on the adjoint vector and showcase the speedup and scaling that can beachieved with the diﬀerent conﬁgurations of OpDiLib in both the forward and the reverse pass.

Keywords

Automatic Diﬀerentiation, OpenMP, OMPT, High Performance Computing, AD ToolDesign, Reusable Software

CR Subject Classiﬁcation

D.1.3 (Concurrent Programming), D.2.13 (Reusable Software), G.1.4(Automatic Diﬀerentiation), G.4 (Mathematical Software)

Automatic diﬀerentiation (AD) deals with the evaluation of derivatives of computer codes, in amachine accurate manner and for a computational cost that is often only a small multiple of thecost of the original program. AD has repeatedly proven its value in scientiﬁc computing, especiallyfor large simulation codes where symbolic derivatives are not feasible and ﬁnite diﬀerences producenumerically unstable and unreliable results. Areas of application include discrete adjoint solvers[32, 1], shape optimization [11, 24], inverse problems [2], and machine learning [13]. A comprehensiveintroduction to AD is given in [12].We view a computer program with input variables x and output variables y as a mathematicalfunction F : R n → R m : x y . While an execution of this program for a speciﬁc input x mightinvolve millions of code statements, it always amounts to a sequence of elementary operations like1 a r X i v : . [ c s . M S ] F e b ums, products, or standard math library function like sin or exp. For all these operations, thederivatives are known; hence, the derivative of the full code in x can be obtained by a rigorous ap-plication of the chain rule. The reverse mode of AD is a speciﬁc strategy for the structured traversalof the chain rule. The code is executed in its original structure in a forward pass ; throughout, it isrecorded in a suitable representation on a stack-like structure called tape . In the subsequent reversepass , the tape is evaluated and derivative information is propagated from the outputs to the inputsin reverse direction compared to the original data ﬂow. To formalize this, we introduce for eachvariable v of the code, whether input, output, or intermediate variable, an adjoint variable ¯ v thatfulﬁlls the relation ¯ v = d F d v ( v ) T ¯ y . For the adjoint inputs ¯ x , we obtain¯ x = d F d x ( x ) T ¯ y so that suitable choices of ¯ y yield various derivatives of the computer code. Choosing ¯ y is knownas seeding . In an implementation, it is practical to assemble the adjoint variables in an incrementalfashion on the statement level. Let a code statement be given by w = φ ( u ) (1)where w ∈ R is an intermediate variable or an output, u ∈ R k consists of inputs and intermediatevariables from previous statements, and φ : R k → R is composed of one or multiple elementaryoperations. The adjoint updates for the statement read¯ u j + = ∂φ∂u j ( u ) ¯ w for j = 1 , . . . , k, then set ¯ w = 0 . (2)All information required for the reversal of a statement must be stored on the tape, that is, iden-tiﬁcation for the adjoint variables ¯ u and ¯ w that have to be used [29], and information for thecomputation of the terms ∂φ∂u j ( u ) [17, 27]. The order in which all statements (1) occurred in theforward code is preserved in their recording on the tape, and unwinding the tape from end to startproduces the corresponding adjoint updates (2) in reverse order.A key objective of AD tools is the automatic augmentation of the original source code by ad-ditional statements required for the recording of tapes and the computation of derivatives. Thetwo main approaches for this are source transformation and operator overloading . Source trans-formation tools are essentially compilers that take the original source code and transform it to anaugmented source code. Besides the original computations, the augmented source code containsalso statements for the computation of derivatives. TAPENADE [14] is an example for a sourcetransformation tool. Operator overloading tools rely on a replacement of the ﬂoating point datatype by a custom

AD type . All elementary operations are overloaded for the AD type. From theuser’s point of view, the AD type behaves like a ﬂoating point type; internally, it performs anythingthat is required for diﬀerentiation, for example the recording in the reverse mode of AD. Examplesfor operator overloading AD tools are ADOL-C [33] and CoDiPack [28].In order to achieve maximum performance on today’s multicore architectures, simulation codestypically employ shared memory parallelism, for example by means of OpenMP [23]. Hence, ADtools should be capable of diﬀerentiating codes that contain OpenMP directives. Research on com-bining AD and OpenMP started in the realm of source transformation tools when [6, 7] suggestedthe generation of OpenMP parallel code for the simultaneous computation of multiple derivativesof an otherwise serial code with the forward vector mode. This is extended in [8] to the case of an2lready parallel code by using nested parallelism. First reports on source-to-source reverse modediﬀerentiation of an OpenMP parallel simulation code are given in [15, 16]. Subsequently, supportfor reverse mode AD for OpenMP parallel codes was also established in operator overloading tools,for example ADOL-C [19, 5]. However, the fact that OpenMP parallelism is speciﬁed in largeparts via directives posed a serious challenge for operator overloading tools. Such direc-tives are outside the scope of the programming language and inherently inaccessible to overloadingtechniques. This limited the supported set of OpenMP features in operator overloading to parallelregions and parallel loops. Source transformation tools, on the other hand, have never been subjectto this limitation. Transformation rules for many additional OpenMP directives where establishedin [10]. We summarize some key insights from previous research. Transitioning from the reversemode of AD in a serial context as described above to parallel codes, we face the problem that tapingis an inherently sequential process. Therefore, computations that are performed simultaneously indiﬀerent threads have to be recorded on diﬀerent tapes. Furthermore, it was observed that theparallelization of the forward pass can be used to deduce a corresponding parallelism for the reversepass. However, this is — except for special cases — not as simple as evaluating the tapes in parallel.First, any additional synchronization that occurred in the forward pass has to be transformed tocorresponding reverse synchronization; second, while simultaneous read access to the same variablefrom diﬀerent threads is ﬁne in the forward pass, the corresponding adjoint updates (2) require thensimultaneous write access to the same adjoint variable, which introduces data races in the reversepass. Given the incremental nature of the adjoint updates (2), it was shown in [10] that using onlyatomic updates is a general solution for this. However, atomic updates are associated with a highperformance cost and previous research has therefore identiﬁed forward data access patterns thatdo not require reversal with atomic updates [19, 10, 18].In this paper, we present a novel event-based approach for the reverse mode automatic dif-ferentiation of OpenMP parallel codes with operator overloading tools and its implementation inthe new open source tool OpDiLib . Starting with version 5.0 of the OpenMP speciﬁcation [23],internal parts of the OpenMP runtime are formalized by execution model events . Furthermore,OMPT becomes part of the speciﬁcation, an API that allows tool developers to observe and reactto these events. OMPT was originally designed to provide a standardized interface for the per-formance monitoring of OpenMP applications [9]; more generally, it allows for the coupling of theOpenMP runtime with tools that execute in the address space of the original program [23], which isprecisely the case for operator overloading AD tools. In this work, we identify and demonstrate theapplicability of OMPT for automatic diﬀerentiation. Speciﬁcally, it allows us to implement aug-mentations required for AD handling as side eﬀects of OpenMP constructs, without any additionalmodiﬁcations of the original source code. We start in Section 2 with an overview over OpDiLib’sarchitecture. It has evolved from the OMPT core concept to a point where it supports all thediﬀerent conﬁgurations and use cases presented in this paper.In order to support OpenMP in an operator overloading AD tool, two kinds of adaptions mustbe made. First, the AD tool itself must be thread-safe. This is an AD tool speciﬁc property thatrequires an individual implementation. Second, an appropriate handling for OpenMP mechanismsmust be provided. While certain OpenMP support has been crafted into some operator overloadingAD tools by now, we demonstrate with OpDiLib for the ﬁrst time that the second part can beimplemented independently of a speciﬁc operator overloading AD tool. Moreover, this implemen-tation is based on a limited set of features that many AD tools already have, which allows us todesign OpDiLib as a universal add-on for classical operator overloading AD tools. Support for in the tools AMPI [30] and MeDiPack . The coupling is describedin Section 3. OpDiLib bindings are already implemented for CoDiPack [28]. We hope that thispaper is also a useful starting point for coupling OpDiLib with other operator overloading AD toolsin the future.OpDiLib’s event-based diﬀerentiation logic is presented in Section 4. It is formulated in termsof AD events , and the AD handling is implemented as reactions to these events. This correspondsto an AD-centered point of view that allows for a uniﬁed handling of similar OpenMP mechanismsand hides many OpenMP and OMPT speciﬁc details that are not relevant for AD. Furthermore, thediﬀerentiation logic itself does not depend on OMPT at all. While OMPT execution model eventscan be used to generate AD events, the diﬀerentiation logic can also be reused together with otherAD event generation mechanisms. With this approach, we establish in Section 4.1 AD supportfor many OpenMP constructs in operator overloading, including several that were previously onlytreated by source transformation tools.From a user’s point of view, OpenMP is very easy to apply. Introducing parallelism to a code canbe as straightforward as supplying a loop with a directive. To achievemaximum parallel performance, on the other hand, additional restructuring of the code and theapplication of more advanced OpenMP features might be required. OpDiLib follows the samephilosophy. With OMPT, a ﬁrst diﬀerentiated version of an OpenMP parallel code can be achievedwithout any additional code changes and without restrictions on the data access patterns. Weachieve this by defaulting to atomic updates on adjoint variables. Starting from there, a user mayemploy additional optimizations. An important part of improving the performance of the parallelreverse pass is the elimination of atomic updates, which might only be possible after a revision ofthe data access patterns. Then, OpDiLib’s adjoint access control tools which we present in Section4.2 can be used to disable atomic updates for selected parts of the code.Section 5 presents two diﬀerent mechanisms for the generation of AD events. Section 5.1 detailsthe generation of AD events via OMPT execution model events. OpDiLib is designed to be usedwith these modern OMPT features. Nonetheless, compilers are still in the process of establishingsupport for OMPT. As many projects rely on older compiler versions or compilers without OMPTfeatures, we provide a second backend that does not require OMPT. It is based on an alternativeinterface to OpenMP functionality that consists of replacement macros for OpenMP di-rectives and clauses as well as replacements for OpenMP’s runtime functions. The only additionalstep required for its application is the replacement of OpenMP constructs. As the macro interfaceis fully compatible with the OMPT backend, these code changes do not have to be revisited for atransition from the macro backend to the OMPT backend. Section 5.2 provides an overview overthe macro interface and highlights some implementational details of the macro backend.In Section 6, we conduct a performance study of OpDiLib applied to an OpenMP-MPI hybridparallel code. We use CoDiPack [28] as the underlying AD tool and MeDiPack for the diﬀeren-tiation of MPI. While a mixed approach between operator overloading and source transformationwas established for hybrid parallel codes in [31], we demonstrate with OpDiLib that they can alsobe handled by pure operator overloading, which admits a simpliﬁed toolchain without additional l a ss i c a l A D t oo l t oo l l og i c b a c k e nd O p e n M P w o r k ﬂ o w wrap manage tapesembed reverseOpenMP logicinto tapes for end events:provide AD datagenerateAD eventsfor begin events:register AD data augmentFigure 1: Layers of OpDiLib’s architecture.source transformation compilers and does not involve interfaces between diﬀerent AD paradigms.We compare OpDiLib’s backends, diﬀerent compilers, quantify the cost of atomic updates on theadjoint vector and demonstrate the impact of OpDiLib on the parallel performance of the forwardand reverse pass. We conclude our work in Section 7. OpDiLib is divided into three parts, backend , logic and tool . These layers are illustrated togetherwith their relations in Figure 1. The tool layer couples OpDiLib with a classical operator overloadingAD tool. It provides access to AD tool features required by OpDiLib while abstracting away thecomplexity of the AD tool. OpDiLib’s logic layer implements an OpenMP diﬀerentiation logic thatis formulated in terms of AD events. These occur in pairs that mark the beginning and end ofOpenMP constructs, respectively. On the one hand, the event handling consists of augmentationsof the forward pass, for example preparations of the parallel recording environment. On the otherhand, the reverse OpenMP logic is embedded into the tapes. For both parts, the logic layer relieson the AD tool features exposed by the tool layer. OpDiLib’s backend layer is responsible for thegeneration of AD events alongside the OpenMP workﬂow, for example at the beginning or endof a parallel region, or upon encountering or leaving a synchronization barrier. Furthermore, thebackend layer is responsible for AD data exchange between matching begin and end events. Anydata registered by the logic layer for a begin event is also provided to the matching end event. Thebackend layer also provides low-level information such as identiﬁers of locks or data associated withthe current parallel region. We discuss these layers in detail in Sections 3, 4, and 5, respectively.The intention of the design is that each layer can easily be exchanged without aﬀecting theothers, thus increasing the reusability of all parts of OpDiLib. For each type of layer, an interface isdeﬁned that guides its implementation. Coupling OpDiLib with another operator overloading ADtool, for example, is realized via a new implementation of the tool layer interface. In Section 5, wepresent two backend implementations that employ diﬀerent mechanisms of AD event generation.The only part of OpDiLib that depends on OMPT is the speciﬁc realization of the backend layerpresented in Section 5.1.While the backend layer is compiled into the application, both tool and logic layer can also5e exchanged at runtime. For the tool layer, this is important if diﬀerent AD types are used indiﬀerent parts of the application. Since the tapes used by the logic layer usually depend on theAD type, this use case requires also an exchange of the logic layer. Speciﬁcally for the logic layer,we could imagine the use case of an optimized implementation tailored to a speciﬁc way of usingOpenMP; to improve the performance, one would switch to the other logic layer for such parts ofthe code. In this section, we present the assumptions on the AD workﬂow that guide the implementation ofOpDiLib and collect the properties and features that an operator overloading AD tool must have inorder to be coupled with OpDiLib. We showcase the interface that has to be implemented for thecoupling and close with the discussion of some advanced use cases. To begin with, we assume thatan AD workﬂow such as depicted in Listing 1 is used, and that the AD tool supports the classicaloperations involved. // within serial code AD :: masterTape -> registerInputs ( inputs ); AD :: masterTape -> setActive () ; // start recording // function to be differentiated // involves parallel code f( inputs , outputs ); AD :: masterTape -> setPassive () ; // stop recording AD :: masterTape -> registerOutputs ( outputs ); AD :: masterTape -> seedOutputs ( seeds ); AD :: masterTape -> evaluate () ; // reverse pass AD :: masterTape -> reset () ;

Listing 1: AD workﬂow.Particularly, we assume that the recording is controlled by the initial implicit task, that is, serialparts of the code. In Listing 1, all parallel regions are assumed to be contained in f . Hence, whentransitioning to parallel code, the general AD workﬂow set up for the serial version can be keptand as before, the user interacts only with the tape of the initial implicit task, to which we refer as master tape . Any additional treatment required for parallel code is embedded by OpDiLib into themaster tape in a fully automatic fashion. In order to be coupled with OpDiLib, an AD tool mustadditionally• allow for creation and destruction of tapes at runtime,• treat tapes as thread-local objects that can be exchanged at runtime, and on which theclassical operations such as recording or evaluation can be performed individually,• provide access to tape positions and support positional evaluation and positional reset oftapes,• have capabilities to embed custom function calls into the tape evaluation, commonly referredto as external functions [19]. 6urthermore, the AD tool must be thread-safe for use in an OpenMP parallel environment.• Recordings that are performed in parallel on diﬀerent tapes must be thread-safe. Particularly,the simultaneous distribution of indices to active variables in multiple threads must be safe.• The AD tool must support atomic operations on the adjoint vector. It is suﬃcient if eachevaluation uses atomic access by default. However, to allow for additional performance op-timizations, it is preferred that the adjoint access mode can be speciﬁed for each individualpositional tape evaluation. The most obvious optimization is that serial parts of the code arenot evaluated with atomic updates.The full interface that has to be implemented for coupling OpDiLib with an operator overloadingAD tool is shown in Listing 2. It reﬂects the AD tool requirements identiﬁed above. Note that anyspeciﬁc types used by the AD tool for tapes and tape positions are abstracted away be the toollayer, hence the various void * pointers in the interface. struct ToolInterface { public : virtual ~ ToolInterface () {} // tape creation and deletion virtual void * createTape () = 0; virtual void deleteTape ( void * tape ) = 0; // management of thread local tape of caller virtual void * getThreadLocalTape () = 0; virtual void setThreadLocalTape ( void * tape ) = 0; // position handling virtual void * allocPosition () = 0; virtual void freePosition ( void * position ) = 0; virtual size_t getPositionSize () = 0; virtual std :: string positionToString ( void * position ) = 0; virtual void getTapePosition ( void * tape , void * position ) = 0; // tape handling virtual bool isActive ( void * tape ) = 0; virtual void setActive ( void * tape , bool active ) = 0; virtual void evaluate ( void * tape , void * start , void * end , bool useAtomics = true ) = 0; virtual void reset ( void * tape , bool clearAdjoints = true ) = 0; virtual void reset ( void * tape , void * position , bool clearAdjoints = true ) = 0; virtual void pushExternalFunction ( void * tape , Handle const * handle ) = 0; }; Listing 2: Interface to AD tool features.In the remaining parts of this section, we discuss some reﬁnements of the AD workﬂow from Listing1 for advanced use cases. 7 ultiple recordings and evaluations.

The AD tool may support multiple recordings on thesame tape and multiple evaluations of one recorded tape, but OpDiLib does not depend on thesefeatures. However, it diﬀerentiates them correctly if they are present.

Positional evaluation of the master tape.

If the AD workﬂow in the serial part of the codeinvolves positional evaluations of the master tape, additional small modiﬁcations of the workﬂoware required in order to get correct results with OpDiLib. As detailed in Section 4, OpDiLibmaintains internal states for the diﬀerentiation. State changes are tracked alongside the recordingon the classical tapes, and reverted accordingly during evaluation. Hence, if parts of the mastertape are evaluated starting from a position that does not match OpDiLib’s current state, the statecorresponding to that position must be restored ﬁrst. To that end, OpDiLib’s logic layer oﬀersfunctions for the export and recovery of internal states. These have to be used alongside the exportof tape positions and prior to positional evaluation, respectively. The resulting workﬂow is depictedin Listing 3. AD :: Position startPosition = AD :: masterTape -> getPosition () ; AD :: masterTape -> startRecording () ; // function to be differentiated // involves parallel code f( inputs , outputs ); AD :: masterTape -> endRecording () ;

AD :: Position endPosition = AD :: masterTape -> getPosition () ; OpDiLib :: State endState = OpDiLib :: logic -> exportState () ; ... // other code

OpDiLib :: logic -> recoverState ( endState ); AD :: masterTape -> evaluate ( endPosition , startPosition );

Listing 3: Modiﬁed workﬂow for positional evaluation of the master tape.

Tape activity changes within parallel code.

There are some edge cases that — unlike weassumed in Listing 1 — require changes of the tape activity during the recording phase. Externalfunctions that are executed on the AD type instead of a scalar type, for example, require aninterruption of the recording. OpDiLib handles AD events only if the tape of the thread thatencounters them is active. Hence, care must be taken that changing the tape activity does notproduce incorrect results. For example, a parallel region that is encountered by a thread withinactive tape does not prepare a parallel recording environment.

Recording may only be performedinside parallel regions that were already encountered by a recording thread.

Also, AD events occurin pairs of which either both or none should be handled.

The tape activity at any begin event mustbe identical to the tape activity at the matching end event.

Even then, care must be taken that noimportant synchronization information due to OpenMP directives encountered during interruptionsof the recording gets lost. An experienced user, on the other hand, can take advantage of this and8liminate synchronization from the tapes that is not required during the reverse pass. To give anexample, the following code showcases a barrier that does not result in a barrier in the reverse pass. AD :: threadLocalTape -> setPassive () ; AD :: threadLocalTape -> setActive () ;

External functions involving parallelism.

While OpDiLib uses external functions to embedstate changes and reverse parallelization into classical tapes, they are originally designed as a toolfor users to provide custom derivatives for code, say a function g , where blackbox diﬀerentiationis inaccurate or costly in terms of memory or runtime [19]. A typical example for this is thediﬀerentiation through a solver for a linear system of equations [26]. Instead of the statement-levelrecording, user-provided code for the diﬀerentiation of g is embedded into the tape. Transitioning tothe parallel case, the user-provided derivative code does not only replace the derivative computationbut also the reverse parallelism that would otherwise be deduced automatically by OpDiLib. Thisgives experienced users another opportunity to ﬁne-tune the performance. OpDiLib is not awarewhether the tapes it evaluates contain usual statement-level recordings or external functions. Hence,support for external functions involving parallelism is mostly an issue of the underlying AD tool.Particularly, its external function tools must be thread-safe, which involves declaration of inputsand outputs, tape activity changes, the registrations of the external function on the tapes, andwhere necessary, barrier synchronization between these steps. From OpDiLib’s point of view, eachthread that registered the external function on its tape in the forward pass results in a threadthat participates in the evaluation of the custom derivative code in the reverse pass. In typical usecases for external functions, the same degree of parallelism is used in the forward and reverse pass.Nonetheless, OpDiLib does not prevent alternative patterns here. As outlined in the introduction, previous research on diﬀerentiation of parallel code made theobservation that the parallelization of the forward code can be used to deduce a correspondingparallelization of the reverse pass. In Section 4.1, we collect previous results on the transformationof OpenMP constructs to their reverse counterparts, transfer them — where necessary — from therealm of source transformation to the realm of operator overloading, extend them to a larger classof OpenMP constructs, and detail their event-based implementations in OpDiLib.OpDiLib achieves a thread-safe parallel evaluation by defaulting to atomic updates on the adjointvariables. It depends on the speciﬁc data access patterns whether this can be relaxed, but doing socan have a signiﬁcant impact on the performance, see e. g. [18]. We defer the performance discussionto Section 6. In Section 4.2, we present the tools provided by OpDiLib with which an experienceduser can disable atomic evaluation where appropriate.

In OpDiLib, the AD treatment of OpenMP constructs is implemented as reactions to events thatindicate the beginning and end of OpenMP constructs alongside the OpenMP workﬂow. This isinspired by the execution model events deﬁned in the OpenMP 5.0 speciﬁcation [23]. However, thereis a multitude of execution model events, and a lot of internal information is passed to the callbacks9 arallelBeginParallelEndImplicitTaskBeginImplicitTaskEndParallelData int requested (omp) int actual (omp)

Tape* tapes[requested]Position start[requested]Position end[requested] ImplicitTaskData int index (omp)

Tape* oldTape { ... /* structured block */ } Figure 2: Parallel region related AD events in the intuitive ordering with respect to each other andthe OpenMP workﬂow, as well as corresponding AD data ﬂow. Data set according to OpenMPruntime information is tagged with (omp).that can be registered for these events. Not all of it is relevant for diﬀerentiation. For the purposeof AD, we derive from the execution model events a reduced set of events with correspondinglyadapted data, to which we refer as

AD events in the following. Another advantage of the AD eventformulation is that it does not depend on OMPT and can also be used together with other eventgeneration mechanisms, as we will discuss in Section 5. The logic layer implements handling for theAD events. For each AD event, this may consists of a forward action and a reverse action that areaccompanied by appropriate

AD data . The forward action contains additional code for the forwardpass and the reverse action contains reverse OpenMP logic in the shape of an external function.Upon encountering the event, the forward action is executed and the reverse action is pushed ontothe thread’s tape. AD data encompasses any data required in the forward and reverse actions ofmatching begin and end events. It is usually created in the forward action of a begin event, passedto the matching end event, and provided to the reverse actions of either of both events as externalfunction data. These patterns will become apparent in the examples throughout this section.

Parallel regions and implicit tasks.

Support for OpenMP in an operator overloading toolwas already established on the level of parallel regions in [19], which lead to the implementationin ADOL-C [5]. The core idea is that concurrent recordings are performed on diﬀerent tapesthat are then also evaluated in parallel. In OpDiLib, the handling of the directive is split into events for parallel regions and implicit tasks. The resulting event family ispresented in Figure 2. The corresponding forward and reverse actions are detailed in Figure 3. The

ParallelBegin event is received by the thread that encounters the OpenMP directive. It receivesthe requested number of threads, e. g. due to a num_threads clause. Since the actual number ofthreads is bounded from above by the number of requested threads [23], it can be used to preparethe parallel recording environment with the possibility of a small overallocation. Each thread thatparticipates in the parallel region receives an

ImplicitTaskBegin event before it executes the parallelregion’s structured block. They receive the actual number of threads and their index in the teamof threads that execute the parallel region. An

ImplicitTaskEnd event is received by each threadafter its execution of the structured block. The key objective of the

ImplicitTask events is thateach thread is equipped with a dedicated tape for the extent of the parallel region, which is realizedvia their forward actions, see Figure 3. The

ParallelEnd event is again received by the threadthat encountered the parallel region, after it received its

ImplicitTaskEnd event. Its reverse action10 arallelBegin ImplicitTaskBegin ImplicitTaskEnd ParallelEnd forwardreverse oldTape = AD::threadLocalTape;tapes[index] = OpDiLib::acquireTape();AD::threadLocalTape = tapes[index];start[index] = tapes[index]->getPosition();tapes[index]->setActive(); tapes[index]->setPassive();end[index] = tapes[index]->getPosition();OpDiLib::releaseTape(tapes[index]);AD::threadLocalTape = oldTape; (actual){ int index = omp_get_thread_num();tapes[index]->evaluate(end[index],start[index]);}

Figure 3: Forward and reverse actions of the parallel region AD event family. Actions withoutdescription are either empty (white) or contain management tasks like data allocation, data ini-tialization, and the reverse action external function push (grey). The corresponding AD data isdisplayed in Figure 2.contains the parallel evaluation of the tapes that were recorded in the forward pass.We do not want to create new tapes for every parallel region. Therefore, the tapes are acquiredfrom and released to a tape pool and reused again in later parallel regions. In order to evaluatethe correct parts of the tapes in the reverse parallel regions, we must remember the positions ofthe tapes at the boundaries of the structured block in the forward pass. This is implementedas part of the

ImplicitTask event handling. As an advancement to what is reported on in [19],our implementation supports also nested parallelism. This requires additional care in the tapemanagement. To see this, consider the following example. (2) { workA () ; (1) { workB () ; } } Suppose the workload workA is imbalanced for the two threads, so that the ﬁrst thread is ﬁnishedwith both workA and the inner parallel region before the second thread ﬁnishes workA . Then, thetape that thread one used in the inner parallel region is available again and might be acquired bythread two so that both inner parallel regions are recorded on the same tape. In the reverse pass,however, both inner parallel regions are most likely evaluated in parallel. If no indices are reused,this is not a problem. Otherwise, the same index pool is used for temporaries throughout workB in both threads, which creates data races in the reverse pass. To avoid this, it must be ensuredthat no tape is acquired twice within a parallel region, including all levels of nested parallelism.The resulting hierarchical arrangement of tapes can be seen as an extension of the nested taping11escribed in [19].Figure 2 displays the intuitive ordering of the parallel region related AD events with respect toeach other and with respect to the AD workﬂow, which is suﬃcient for a concise presentation of thediﬀerentiation logic. In practice, however, we have to support all orderings of the correspondingOpenMP execution model events that are admissible according to the speciﬁcation [23]. We revisitthis in Section 5.1.

Sync regions.

The ﬁrst class of OpenMP constructs that were only addressed by source trans-formation tools are sync regions, with the explicit directive as a prominentexample. According to [10], its reverse counterpart is, again, a barrier directive. This is handled inOpDiLib by a

SyncRegionBegin event that occurs upon encountering the barrier, and a

SyncRegionEnd event that occurs upon leaving it. No AD data is required, and the reverse action of either of bothevents may contain the reverse barrier. In addition to the explicit barrier directive, this logic coversalso other types of barriers, for example implicit barriers due to OpenMP worksharing directivesthat do not specify a nowait clause, as we explain in the next paragraph.

Worksharing.

It is common to all worksharing directives that they deﬁne independent unitsof work that are scheduled for execution by speciﬁc threads at runtime. Since each thread isequipped with its own tape already, the speciﬁc worksharing is reﬂected in the recordings performedon the diﬀerent tapes; without any additional treatment at all, the same worksharing is alreadyachieved in the reverse pass simply by evaluating the tapes in parallel. This covers the , and directives. Nonetheless, care must be takenthat synchronization due to the implicit barrier at end of such worksharing regions results incorresponding reverse barrier synchronization, as explained in the previous paragraph. Then, ifthe forward synchronization is skipped with a nowait clause, the reverse synchronization is omittedas well. Additional augmentations in the source transformation rules for worksharing directives in[10] are either speciﬁc to source transformation or part of a so-called joint reversal approach, wherethe memory footprint is reduced by checkpointing techniques already as part of the transformation.As is common in operator overloading tools, a user has to employ such techniques manually, andthe same holds true if OpDiLib is applied. Still, a WorkBegin and a

WorkEnd event are reserved forfuture augmentations of worksharing directives.

Mutexes.

There are two kinds of OpenMP constructs. The ﬁrst kind does not require awarenessof the ordering between threads, that is, upon encountering an OpenMP construct, it is suﬃcient toadd a corresponding reverse construct to the tape. A barrier, for example, may be encountered bythe team of threads in any order, and the same holds true for the corresponding reverse barrier. Allexamples we have discussed so far are of this kind. The second kind, however, requires that the orderin which threads execute the OpenMP construct is also reversed, that is, the reverse of the orderin which the threads executed the OpenMP construct in the forward pass must be identical to theorder in which the reverse constructs are executed in the reverse pass. The directive is an example for this and the issue is reﬂected in the source transformation AD approachthat is developed for it in [10]. Reformulated in terms of operator overloading, a global counter isassociated with the critical region. A thread that enters the critical region increments the counterand remembers its subsequent value . In the reverse pass, a thread busy-waits until the countermatches the remembered value, then it evaluates the recording of the protected forward code, anddecrements the counter afterwards. We make the observation that this concept is not limited to12 pragma omp critical { ... /* structured block */ } MutexAcquiredMutexReleasedMutexData int kind int idsize_t value size_t& counter =OpDiLib::mutexCounter[kind][id];

Figure 4: Mutex related AD events in the intuitive ordering with respect the OpenMP workﬂow,as well as corresponding AD data ﬂow. counter is an abbreviation for global data that depends on kind and id . MutexAcquired MutexReleased forwardreverse counter++;value = counter; size_t currentValue; do { currentValue = counter;} while (currentValue != value); counter--; Figure 5: Forward and reverse actions of the mutex AD events. The omitted action (grey) containsonly the external function push for the corresponding reverse action. The corresponding AD datais displayed in Figure 4.critical regions. More general, it can be used to diﬀerentiate any synchronization of mutex type,where entering the critical region corresponds to acquiring the mutex and leaving the critical regioncorresponds to releasing it. We apply this technique for critical regions, locks, nested locks, orderedclauses and reduction clauses. Counters are then associated with mutexes via the mutex kind and aunique mutex id that is provided by the backend layer as discussed in Section 5. The current valuesof all counters are part of OpDiLib’s state that we mentioned before in Section 3 in connection withListing 3. The event based implementation consists of a MutexAcquired event that is encounteredafter acquiring a mutex, and a

MutexReleased event that occurs after releasing it. This is displayedfor the example of a critical region together with the corresponding AD data ﬂow in Figure 4.Again, precise event ordering constraints, in particular with respect to acquisition and release of thesame mutex from diﬀerent threads, are inherited from the corresponding OpenMP execution modelevents. The forward and reverse action of the diﬀerentiation logic discussed above are displayedin Figure 5. We perform the increment of the counter as part of the

MutexAcquired forward actionbecause it is — unlike the

MutexReleased forward action — protected by the corresponding mutex.13 eductions.

Since the OpenMP 4.0 speciﬁcation [22], user-deﬁned reductions are supported.This allows reduction clauses involving data types that are used in operator overloading AD tools.However, a declaration of a custom reduction alone is not suﬃcient to obtain correct reverse modederivatives. As described in detail in the OpenMP 5.0 speciﬁcation [23], reductions are performed inOpenMP in two stages. First, each thread that participates in the reduction accumulates its resultsin a private variable. At the end of the OpenMP construct with the reduction clause, the threads’individual results are then accumulated into the shared reduction target, usually involving severalthreads and shared intermediate variables, for example in a tree-like reduction. From an AD pointof view, the operations in the course of the reduction are recorded just as any other operations onthe tapes of the threads that execute them. However, access to shared intermediate variables or theﬁnal reduction target requires synchronization, which must be turned into reverse synchronizationagain. To that end, we require that a thread receives a MutexAcquired event before contributingto a shared variable, and a MutexReleased event afterwards; just as if access to the shared variablewas safeguarded by a mutex, regardless of the actual synchronization mechanism that is used by theOpenMP runtime. We must further ensure that no thread begins with the reversal of the structuredblock of the OpenMP construct before the reversal of the second reduction stage has progressedto a point where all contributions have been propagated to the private reduction variable of thatthread. The implementational details of this are discussed in the course of AD event generation inSection 5.

Directives that do not require treatment.

There are several directives that do not require anyadditional treatment for AD in the framework presented above. The directive,for example, does not involve any synchronization; it only declares that the code will be executedby a speciﬁc thread, and hence, recorded on a speciﬁc tape. As it is only recorded by one thread, itwill also be evaluated by only one thread. This is in line with the results on source transformationfrom [19]. The same holds true for the directive. The essential augmentationsare already performed by the handling of parallel regions, implicit tasks and the sections work-sharing directive. The section directive marks then only parts of the code that can be executed inparallel. The speciﬁc assignment of section to threads, and hence tapes, depends on the schedulingin the forward pass and is automatically preserved for the reverse pass. Furthermore, OpenMPallows combined constructs, usually a parallel directive combined with a worksharing construct,for example . Such combined constructs are supported automatically bythe logic presented above as long as all involved mechanisms generate their corresponding events.Furthermore, we do not have to add special treatment for data-sharing attribute clauses such as private or firstprivate or data copying clauses such as copyprivate . The resulting copy operationsare recorded in the same fashion as any other operations. Source transformation tools, on the otherhand, have to implement treatment for such clauses [10, 18]. Atomics.

In principle, synchronization due to atomics could also be handled by mutex AD events;in fact, atomics generate mutex-acquired and mutex-release execution model events. However, theOpenMP speciﬁcations only allow directives for scalar types, which excludes thecustom types that are used for active variables in an operator overloading code. Therefore, we donot add AD support for atomic directives. Atomics can still be used in parts of the code that are notdiﬀerentiated and otherwise for inactive variables. This does not aﬀect AD in typical use cases likeincrementing a counter from within a worksharing loop where the synchronization due to the atomicshas no impact on the correctness of the derivative. In general, the missing AD handling implies that14ny synchronization due to atomics is not transformed to corresponding reverse synchronization.Therefore, custom synchronization patterns that rely on atomics, for example busy-waits, are notcaptured correctly. If OpDiLib is used, synchronization via atomics has to be replaced by one of themutex AD event generating constructs described above, or by barriers. Alternatively, the user maygenerate mutex AD events manually for custom synchronization mechanisms. Expanding on theexample of a busy-wait, this has the advantage that mutex events do not have to be generated foreach individual atomic statement, which would quickly blow up the mutex counters. Speciﬁcally inthe reverse synchronization in Figure 5, one

MutexAcquired event could be emitted upon a completedbusy-wait, with the corresponding

MutexReleased event placed after decrementing the counter.We remark that missing AD support for atomics does not prevent us from using them in thereverse pass, for example for atomic updates on the adjoint variables or for synchronization asdisplayed in Figure 5. The reason is that the tape is not in recording mode during reversal.

Flush directive.

The directive enforces a consistent view of the threads on thememory, either in general or with respect to speciﬁc variables. As can be seen e. g. in the examplesin the OpenMP 2.5 speciﬁcation [21], it can be used to implement rather irregular synchronizationpatterns. Like with atomics in the previous paragraph, this is not supported in OpDiLib. Also,we see no way to handle the ﬂush directive without turning it implicitly into a stronger kind ofsynchronization like atomics, mutexes, or barriers, eﬀectively eliminating its advantages. Regardingthe reverse pass, no issues with inconsistent view of memory should occur if atomic updates areused on adjoint variables. Otherwise, we oﬀer a reverse ﬂush that the user may push to the tapesmanually, in the same spirit as the reverse barrier described in Section 4.2.

Higher order derivatives.

There are diﬀerent strategies for the computation of higher orderderivatives involving the reverse mode of AD. In the adjoints of tangents approach, the forwardmode of AD is used for all derivative orders except the outermost one. This approach is not onlyfavoured for its eﬃciency properties [12] but also supported by the diﬀerentiation logic presentedabove. We recommend using this strategy with OpDiLib for higher order derivatives. The adjointsof adjoints approach, on the other hand, applies the reverse mode to the reverse mode. Thisrequires recordings during reversal and in order to support it, a reverse mode AD tool must havethe closure property that it can be applied to the code it produces [10]. In the diﬀerentiation logicpresented above, this leads to conﬂicts. On the one hand, there is no AD support for atomics.On the other hand, atomics are used during reversal for updates on adjoint variables (see Section4.2) and synchronization (see Figure 5). Hence, this strategy cannot be supported without suitableadaptions of the diﬀerentiation logic. It also gives rise to further implementational issues in thebackend layer that are discussed in Section 5.

In typical parallel applications, data is read from diﬀerent threads simultaneously. As we explainedalready in Section 1, active variables that are read from multiple threads simultaneously in theforward pass lead to data races in the reverse pass and require additional synchronization, which isfor example achieved by atomic updates on the corresponding adjoint variables. However, atomicupdates come at a performance cost, especially if they are used for all adjoint variables [18]. Thechallenge is hence to identify the variables that need atomic updates. Source transformation toolshave similarities to compilers and may attempt to address this issue with static program analysis15echniques like the exclusive read analysis introduced in [10]. In the research on operator overloadingin [19] and the implementation in ADOL-C [5], a compromise is made. At the beginning of a parallelregion, threadprivate copies of all active variables are made. This way, simultaneous read accessis prevented by design and no atomic updates are required. However, no write access to variablesthat served already as inputs to the parallel region is allowed from within the parallel region, avariable can be written by at most one thread and during reversal, an additional step is needed toreduce the contributions to the adjoint input values from the diﬀerent threads. In [18], a patternof so-called self-adjoint memory access is discussed. It admits reversal of certain loops withoutatomics; the user has to identify and mark such loops so that they can be handled appropriatelyby a source transformation tool.In order to make OpDiLib as applicable as possible, we do not want to impose any restrictionson the data access patterns and we do not want to require users to restructure their code to geta ﬁrst working version. Therefore, we use atomic updates on the adjoint variables by default. Onthe other hand, a user should have as much ﬂexibility as possible to disable atomic adjoints wherethey are not needed. To that end, OpDiLib tracks an adjoint access mode per thread alongsidethe recording. The available options are

Atomic , which is the default for OpDiLib, or

Classical ,which is the non-atomic counterpart. The user may control the adjoint access mode by calls tothe setAdjointAccessMode routine of the logic layer at arbitrary points in the code. The recordingthat is performed after the call until the next change of the adjoint access mode in that thread isthen evaluated in the reverse pass according to the chosen mode. This means that whole parallelregions can be marked as safe for evaluation without atomics, for example if the user decides tomake thread-private copies of all inputs to the parallel region as in [19, 5], or if a parallel loop hasself-adjoint memory access [18]. In addition, one may isolate any parts of large parallel regions assafe for evaluation without atomics in a ﬁne-grained manner, even across the boundaries of parallelregions. The following example demonstrates that additional synchronization might be requiredduring the reverse pass. ADType shared_data [2] = ...; ADType result [2] = ...; (2) { int i = omp_get_thread_num () ; OpDiLib :: logic -> setAdjointAccessMode ( OpDiLib :: Classical ); // part without shared reading result [i] *= shared_data [i ]; OpDiLib :: logic -> addReverseBarrier () ; OpDiLib :: logic -> setAdjointAccessMode ( OpDiLib :: Atomic ); // part with shared reading result [i] += shared_data [0] * shared_data [1]; } The change of the adjoint access mode in line 12 separates adjacent parts of the code with andwithout shared reading. During the forward pass, shared reading is not a problem and no syn-chronization is required. During the reverse pass, however, one thread might ﬁnish ﬁrst with theatomic evaluation of the recording for line 15 and begin with unsafe updates on the adjoint of shared_data[0] while the other thread is still busy with the safe evaluation of its recording for line165, resulting in simultaneous safe and unsafe writes to the same variable. This is prevented byintroducing a barrier exclusively in the reverse pass by the additional call in line 11. It pushes anexternal function containing a statement to the tape, just as in the handlingof the

SyncRegion events in Section 4.1. Hence, one must make sure to follow the rules for OpenMPbarriers, that is, either all threads of a team encounter a reverse barrier or none [23]. When usingreverse barriers in nested parallel regions, one must be aware that they aﬀect only the team ofthreads of the corresponding nested reverse parallel region.

The AD events introduced in Section 4.1 have to be generated alongside the OpenMP workﬂow. Inprinciple, the required augmentation of the source code by AD events could be performed by hand.The objective of the backend layer is to automatize this tedious task, up to the point where nochanges to the original source code are required. This is achieved with the OMPT backend that ispresented in Section 5.1. It makes use of modern OpenMP features around OMPT that allow for anobservation and augmentation of the OpenMP workﬂow. This way, we solve the principal problemthat OpenMP directives are inaccessible to overloading techniques, which previously limitedsupport for OpenMP in operator overloading tools and required manual annotations of the sourcecode [19, 5].As a fallback for compilers without OMPT support, we provide a second backend that hidesthe augmentations of the source code inside replacement macros for OpenMP directivesand clauses as well as replacements for OpenMP’s runtime functions. They form an alternativeinterface to OpenMP functionality which we present in Section 5.2. It provides access to the sameset of AD features as the OMPT backend. Clearly, the macro backend requires the rewriting of allOpenMP directives and runtime function calls. In principal, given the structural similarities of theoriginal constructs and their replacements, this task could be performed by a source transformationpreprocessing script in an automatic fashion. Code that is written with the replacements is alsofully compatible with the OMPT backend. This means that no additional rewriting is requiredfor a transition to the modern features. As long as the OMPT backend is used, original OpenMPconstructs and their respective replacements can be used in the same code.

The OpenMP 5.0 speciﬁcation [23] formalizes the OpenMP workﬂow by deﬁning execution modelevents . For each event, a custom callback function can be registered that is invoked by the OpenMPruntime at each occurrence of the corresponding event in the respective thread. The callbackfunction receives detailed information about the OpenMP mechanisms and objects involved. Inmany cases, it also oﬀers possibilities to associate custom data with events. This allows developersto build their own tools on top of OpenMP and integrate them seamlessly into the workﬂow of theOpenMP runtime. We use these features for the generation of AD events without any modiﬁcationsof the original source code. Since the AD events introduced in Section 4.1 are derived from theexecution model events, a key part of a callback implementation is always the transformation to thecorresponding AD event. Additionally, OMPT is powerful enough to fulﬁll further requirementsidentiﬁed throughout Section 4, for example AD data transport, mutex identiﬁer generation orgeneral access to parallel AD data. 17e illustrate how a backend can be implemented with functionality provided by OMPT usingthe example of a parallel region and its implicit tasks. The OpenMP 5.0 speciﬁcation [23] deﬁnesthe following order of events for a parallel region.• A thread that encounters a parallel construct receives a parallel-begin event before any implicittask associated with the parallel region is created.• Guided by the requested degree of parallelism, implicit tasks are generated, all of whichexecute the structured block of the parallel construct. Each implicit task is executed by adedicated thread. The thread that encountered the parallel region executes the implicit taskwith index zero.• Each thread that participates in the parallel region receives an implicit-task-begin event beforeit starts executing the structured block. After it has ﬁnished executing the structured blockand has received events associated with the implicit barrier at the end of the parallel region,it receives an implicit-task-end event.• After the thread that encountered the parallel region has received its implicit-task-end event,it receives a parallel-end event. After that, it resumes execution of the task in which itencountered the parallel region.This order is inherited by the corresponding AD events. We remark that no ordering is speciﬁed forthe parallel-end event and implicit-task-end events of implicit tasks other than the one with indexzero. The extreme case is that they occur no sooner than immediately prior to the next implicit-task-begin in that thread, in particular after the next parallel-begin event. We observed this in the LLVMOpenMP runtime. The implementation of the diﬀerentiation logic has to be correct also for theseextreme cases. Listing 4 displays implementations of the callbacks for parallel-begin and parallel-end events that match the ompt_callback_parallel_begin_t and ompt_callback_parallel_end_t signa-tures as deﬁned by the OpenMP 5.0 speciﬁcation [23]. void on_parallel_begin ( ompt_data_t *, const ompt_frame_t *, ompt_data_t * parallelData , unsigned int requested , int , void const *) { parallelData -> ptr = OpDiLib :: logic -> onParallelBegin ( requested ); } void on_parallel_end ( ompt_data_t * parallelData , ompt_data_t *, int , void const *) { OpDiLib :: logic -> onParallelEnd ( parallelData -> ptr ); } Listing 4: Callbacks for parallel-begin and parallel-end events.We omitted the names of arguments that we do not use. The callback for the parallel-begin eventreceives information about the requested parallelism, that is, the number of threads that wasspeciﬁed with the num_threads clause if present, or the default number of threads instead. Asexplained in Section 4.1, this information is needed to prepare the recording environment andhence forwarded to the AD event. Furthermore, the OpenMP runtime allocates a parallel dataobject of structured type ompt_data_t that is associated with the parallel region. It has a member void * ptr that can be used to add custom data. We use it to store the parallel AD data objectreturned by the logic layer. The callback for the corresponding parallel-end event receives the18ame OMPT data object with the ptr member as it was assigned in the parallel-begin callback.Hence, we can use it to transport the AD data between the matching begin and end events asdisplayed in Figure 1 and discussed in Section 4.1. OMPT’s data object associated with the currentparallel region can also be queried at arbitrary points in the code via the ompt_get_parallel_info runtime entry point. Thus, we implement access to the parallel AD data object of the currentparallel region as is needed in Section 4.2 by querying OMPT’s parallel data object, and extractingthe parallel AD data object from it in the same fashion as in Listing 4. Listing 5 displays theimplementation of the callback for implicit-task-begin and implicit-task-end events. It has the ompt_callback_implicit_task_t signature as required by the OpenMP 5.0 speciﬁcation [23]. void on_implicit_task ( ompt_scope_endpoint_t endpoint , ompt_data_t * parallelData , ompt_data_t * taskData , unsigned int actual , unsigned int index , int flags ) { if ( flags & ompt_task_initial ) { return ; } if ( ompt_scope_begin == endpoint ) { taskData -> ptr = OpDiLib :: logic -> onImplicitTaskBegin ( actual , index , parallelData -> ptr ); } else { OpDiLib :: logic -> onImplicitTaskEnd ( taskData -> ptr ); } } Listing 5: Callback for implicit-task-begin and implicit-task-end events.It ﬁlters out the special case of the initial implicit task, that is, the task that executes the serial partsof the program. It does not require additional AD treatment. The callback receives all parametersfrom the OpenMP runtime that are needed for the implicit task AD events, that is, the actual degree of parallelism (as opposed to the requested one), the index in the team of threads, and theparallel AD data object. The task data object provided by OMPT is again used to exchange ADdata with the corresponding end event.Table 1 provides an overview over all callbacks that we use to generate the AD events withOpDiLib’s OMPT backend. Not each execution model event has a dedicated callback function.In contrary, a single callback often handles a large number of events. For example, the critical-acquired event due to a directive and the lock-acquired event caused by theruntime function omp_set_lock both result in invocations of the ompt_callback_mutex_acquired call-back. Likewise, an ompt_callback_sync_region callback handles (amongst others) explicit barrierdirectives and implicit barriers at the end of, e. g., worksharing directives that do not specify a nowait clause. It is beneﬁcial for AD that similar types of events are collected in one callback fortwo reasons. First, this coincides often with the fact that all execution model events collected bythe same callback also require the same treatment from an AD point of view and can be forwardedto the same module of the logic layer. For example, this is the case for mutex type events. Second,we do not have to distinguish whether the execution model event is caused by a directive,the presence or absence of clauses, a runtime function call, or other mechanisms. This allows us toconveniently establish AD support for a large range of OpenMP features.19MPT callback type AD events ompt_callback_parallel_begin ParallelBeginompt_callback_parallel_end ParallelEndompt_callback_implicit_task ImplicitTaskBegin , ImplicitTaskEndompt_callback_mutex_acquired MutexAcquiredompt_callback_mutex_released MutexReleasedompt_callback_sync_region SyncRegionBegin , SyncRegionEndompt_callback_work WorkBegin , WorkEndompt_callback_reduction MutexAcquired , MutexReleased

Table 1: Mapping between OMPT callbacks and AD events.In connection to what we discussed in Section 4.1, the OpenMP runtime already passes uniquemutex identiﬁers to mutex callbacks, which we can reuse immediately to identify and track mutexesin the logic layer. We remark that it is not guaranteed that pending mutex-released events in onethread occur prior to mutex-acquired events for the same mutex in another thread. Unlike thehandling of the mutex-acquired event, the handling of the mutex-released event is not protected bythe corresponding mutex. The implementation of the diﬀerentiation logic has to support this. Thereduction callback is special among the cases in Table 1 in the sense that it produces mutex ADevents. According to the OpenMP 5.0 speciﬁcation [23], the reduction-begin event is received beforethe task begins to perform load and stores associated with the reduction, and the reduction-end is received after these are completed. The reduction is usually implemented in a tree-like fashionso that the reduction events might mark updates both to intermediate shared variables and theﬁnal reduction target. By mapping the reduction events to AD events for an imaginary mutex, werecord the reduction operations as if they were performed in an equivalent serial manner, that is,we choose a total order for all reduction events that preserves the partial order indicated by thereduction tree. Reversal is then performed according to this total order. Conveniently, the LLVMOpenMP runtime produces sync region events for a so-called implementation speciﬁc barrier thatseparates the execution of the structured block and the accumulation in private reduction variablesfrom the part of the reduction that actually involves synchronization. These events are detectedand handled by OpDiLib like any other sync region. Hence, when using this OpenMP runtime,it is ensured that no thread begins with the reversal of the structured block before all reductionoperations are reverted. This is important for threads that did not participate in the reductionand did not receive mutex events. This implementation speciﬁc barrier is not guaranteed by theOpenMP speciﬁcation. Besides its usefulness for AD, it provides insights in the reduction processthat should also be of interest to performance monitoring tools. Therefore, we think that it is adesirable feature in general and would like to see it in future versions of the OpenMP speciﬁcation.If OpDiLib is used together with other OMPT implementations, it might be necessary to add areverse barrier by hand, or to use semi-automatic techniques similar to those in the macro backend.We close this section with a discussion of additional issues due to the use of OMPT.

Execution model events due to internal synchronization.

OpDiLib uses OpenMP’s lockfacilities for internal synchronization purposes. From the OpenMP runtime’s point of view, theseare not any diﬀerent from locks in user code and hence produce the same execution model events andcause invocations of the same callbacks. Besides the unnecessary performance cost of diﬀerentiating20he logic layer’s internals, there is also a more imminent issue. If the user successfully sets a lock,an ompt_callback_mutex_acquired callback is dispatched and generates a

MutexAcquired

AD event.The handling of this event, however, requires itself the acquisition of an internal lock that invokesthe same callback, resulting in an inﬁnite recursion. To ﬁx this, the logic layer keeps track ofinactive mutexes. These are then ﬁltered out in the AD event handling. An experienced user canalso use these mechanisms to mark mutexes from user code as inactive by a call

OpDiLib::logic->registerInactiveMutex(mutexKind, mutexId) . These are then disregarded for the diﬀerentiation.This is also recommended for mutexes introduced in the underlying AD tool for the purpose ofthread-safety.

Execution model events during reversal.

Since OpDiLib embeds OpenMP directives intothe tape, execution model events are also produced during the reverse pass. However, the tapesare not in recording mode at this point. This is detected in the logic layer and the AD handlingis skipped. We remark — despite the limited practical relevance — that execution models duringreversal might have an application for higher order derivatives that are computed in an adjoints ofadjoints fashion, see the previous discussion at the end of Section 4.1. In an application of OpDiLibto itself, the execution model events during an evaluation for the higher order type could be usedto perform the recording for the lower order type. This relates also to the next topic of discussion.

Multiple OMPT tools.

Suppose we want to apply OpDiLib to an OpenMP parallelized codethat uses an OMPT tool already, for example for performance monitoring purposes. The simulta-neous use of multiple OMPT tools is beyond the scope of the OpenMP 5.0 speciﬁcation [23]. Toovercome this limitation, there was work in the direction of multiplexing of OMPT tools [25] thatis now part of clang . We will conduct further tests to see whether it can be used for our purposes.Otherwise, the macro backend may serve as a fallback solution.

Without OMPT, we are not aware of any way to change the behaviour of the original OpenMPconstructs in the scope of the program. Instead, we provide an alternative interface to OpenMPfunctionality. Code that is written according to this interface can then also be diﬀerentiated ifa compiler without OMPT support is used. Speciﬁcally, we introduce macros that have to beused instead of the usual OpenMP directives and clauses, as well as functions that have to beused instead of the usual OpenMP runtime functions. This provides a semi-automatic way ofdiﬀerentiating OpenMP parallel code with operator overloading. The replacements always containthe original OpenMP construct but embed it into additional code that generates the required ADevents, and the replacements follow mostly the same structure as the original OpenMP constructs.

Runtime functions.

For each runtime function omp_* , we introduce a replacement opdi_* . Forexample, omp_set_lock is replaced by opdi_set_lock . The replacement has the same signature asthe original function. Its implementation always forwards the call to the original runtime function,but surrounds it with appropriate generation of AD events, as showcased in Listing 6. void opdi_set_lock ( omp_lock_t * lock ) { omp_set_lock ( lock ); OpDiLib :: logic -> onMutexAcquired ( OpDiLib :: Lock , OpDiLib :: backend -> getLockIdentifier ( lock )); } Listing 6: omp_set_lock runtime function replacement.As can be seen in Listing 6, the generation of mutex identiﬁers that was previously performed bythe OMPT runtime requires now a dedicated custom implementation in the backend layer. Forreasons of consistency, we oﬀer replacements also for runtime functions that do not require ADtreatment. A user code that should compile both with and without OpDiLib can then use a globalmacro to decide whether the omp_ or opdi_ preﬁx is used for the runtime function calls, withoutkeeping track of the necessity of this switch for each individual runtime function. Directives.

We replace each OpenMP directive ... by a custom macro

OPDI_DIRECTIVE(...) . The idea is that the same clauses and hints that were passed to the originaldirective can also be passed to the macro, possibly again in the shape of their replacements. Themacro expands to the orignal OpenMP directive surrounded by corresponding AD treatment. Thisis showcased for the barrier directive in Listing 7. OPDI_BARRIER (...) \ OpDiLib :: logic -> onSyncRegionBegin ( OpDiLib :: Barrier ); \ _Pragma ( omp barrier __VA_ARGS__ ) \ OpDiLib :: logic -> onSyncRegionEnd ( OpDiLib :: Barrier );

Listing 7: Barrier replacement macro.The situation is more involved for OpenMP directives that are followed by a structured block, forexample . This has two reasons. First, it might be necessary to generate AD eventswithin the structured block. Second, it might be necessary to spawn AD events after the structuredblock. We make heavy use of a privatization technique that was already applied to address theseissues for the handling of parallel regions in ADOL-C [19, 5]. Augmentations at the beginningand end of the structured block can be generated as side eﬀects of constructors and destructors ofobjects that are privatized via data-sharing clauses. These privatizations can be generated as partof both replacement macros and replacement clauses. However, this technique is not applicable orpractical in all situations. First, it covers only OpenMP directives that support data-sharing clauses.Furthermore, the OpenMP speciﬁcation leaves the respective order of constructor and destructorcalls unspeciﬁed, does not support multiple privatizations of the same variable on one directive,and restricts subsequent privatization in nested directives to some extent [23]. Privatization is theonly approach we are aware of to design replacement clauses with side eﬀects on the recordingenvironment. Thus, their limitations give rise to some special cases which we discuss below. Toﬁnd a remedy for directives, we complement each

OPDI_DIRECTIVE(...) replacement that is followedby a structured block by a corresponding

OPDI_END_DIRECTIVE macro that has to be placed after thestructured block. This is showcased in Listing 8 and resembles the Fortran way of using OpenMPdirectives rather than the C ++ one. OPDI_DIRECTIVE (...) { ... } OPDI_END_DIRECTIVE

Listing 8: Schematic application of replacement macro and the corresponding end macro.22penMP directive replacement macro end macro ... OPDI_BARRIER(...) ... OPDI_PARALLEL(...) OPDI_END_PARALLEL ... OPDI_FOR(...) OPDI_END_FOR ... OPDI_SECTIONS(...) OPDI_END_SECTIONS ... OPDI_SINGLE(...) OPDI_END_SINGLE ... OPDI_SINGLE_NOWAIT(...) OPDI_END_SINGLE ... OPDI_CRITICAL(...) OPDI_END_CRITICAL (name) ... OPDI_CRITICAL_NAME(name, ...) OPDI_END_CRITICAL ... OPDI_ORDERED(...) OPDI_END_ORDERED ... OPDI_MASTER(...) OPDI_END_MASTER ... OPDI_SECTION(...) OPDI_END_SECTION

Table 2: Replacement macros and corresponding end macros.End macros provide also important degrees of ﬂexibility in our implementation with respect to futureadaptions. Table 2 provides an overview over all replacement directives and the corresponding endmacros. Among the replacement directives there are two special cases. The ﬁrst are named criticalregions. The critical directive does not support data-sharing clauses. However, the suppliedname is essential for the generation of a unique mutex identiﬁer so that is has to be provided viaa dedicated macro. Second, while the single directive supports data-sharing clauses, these applyonly to the single thread that executes the structured block, which is not suﬃcient for the treatmentof the implicit barrier that regards all threads. Hence, a dedicated macro for single directives witha nowait clause is introduced. Note that the special replacements still use the same respective endmacro.

Clauses.

Most clauses and hints do not require replacements and are passed to the replacementdirectives in their original form. Besides the special replacement directives discussed above, weprovide a

OPDI_NOWAIT replacement clause that has to be used instead of nowait for correct treatmentof the corresponding implicit barrier. Furthermore, any directive that speciﬁes reduction (...) clauses has to specify the

OPDI_REDUCTION macro once alongside the otherwise unmodiﬁed reductionclauses. It provides the reverse barrier that separates the threadprivate reduction phase from theshared, tree-like reduction phase. We remark that this macro has to be provided as a supplementinstead of a replacement because multiple reduction clauses on one directive are allowed but multipleprivatizations of the same variable on one directive are forbidden [23].

Reductions.

While the

OPDI_REDUCTION macro handles the required reduction barrier, it doesnot handle the mapping of reductions to mutex AD events. In the macro backend, we associatereductions with actual nested locks so that the required mutex AD events are produced by settingand unsetting these locks. This logic is hidden inside a replacement macro for the declaration ofcustom reductions. Since any reduction involving AD types is of custom type, this covers all therelevant cases. The replacement macro reads

OPDI_DECLARE_REDUCTION(OP_NAME, TYPE, OP, INIT) and expands to code as displayed in Listing 9. \ reduction ( OP_NAME : TYPE : \ OpDiLib :: Reducer ( omp_out ) \ = OpDiLib :: Reducer ( omp_out ) OP OpDiLib :: Reducer ( omp_in )) \ initializer ( omp_priv = INIT ) Listing 9: Schematic reduction declaration.It involves casts of omp_in and omp_out to a

Reducer object that is associated with a global nestedlock. Note that a nested lock has the distinguished property that it can be set multiple times fromwithin the same thread without causing deadlocks. We implement the

Reducer such that the nestedlock is set in constructor calls, which happens three times in the course of one reduction statement.The three set operations are complemented by three unset operations as part of the overloadof operator = for the Reducer . This ensures that the whole reduction statement is exclusivelyperformed by one thread at a time. This produces a serialization of the reduction tree in theforward pass that is preserved by the mutex AD handling for the reverse pass. In addition, weensure via templating techniques that diﬀerent custom reductions use diﬀerent

Reducer objects andhence, diﬀerent nested locks.

To study the performance of OpDiLib, we apply it to a PDE solver for coupled Burgers’ equations u t + uu x + vu y = 1 R ( u x x + u y y ) ,u t + uv x + vv y = 1 R ( u x x + u y y ) (3)[3, 4, 34] that was previously used for the performance evaluation of various aspects of automaticdiﬀerentiation in [27, 28, 29]. We solve (3) on ( x, y, t ) ∈ D × R ≥ where D is a scaling of the unitsquare [0 , , together with boundary conditions and initial conditions at t = 0 which are takenfrom the exact solution u ( x, y, t ) = x + y − xt − t ,v ( x, y, t ) = x − y − yt − t that is given in [4]. We solve the equations with an upwind ﬁnite diﬀerence scheme, for which wediscretize D by an equidistant grid and advance in time steps of ﬁxed size ∆ t .The solver is implemented in a hybrid parallel manner. The spatial domain is ﬁrst decomposedinto columns of equal width, one for each MPI process. Halo exchange via MPI communication isrequired at the column boundaries. Second, each MPI process works on its column with multipleOpenMP threads. To that end, the column is further subdivided into rows of equal height, resultingin one block for each OpenMP thread. This is visualized in Figure 6a. Each MPI process storesits part of the solution ﬁeld in row-major order and an OpenMP thread performs row-wise sweepswith the ﬁnite diﬀerence stencil. This is illustrated in Figure 7. We use distributed memoryparallelization between NUMA domains and employ shared memory parallelization within these.To that end, we bind MPI processes to NUMA domains with hwloc and tie OpenMP threads tocores using OpenMP environment variables [23].

24 1 2 3MPI processes0123 O p e n M P t h r e a d s (a) O p e n M P t h r e a d s (b) ﬁrst sweepsecond sweepFigure 6: Decomposition of the computational domain. Column splitting for MPI processes and rowsplitting for OpenMP threads. Shared reading might occur at block boundaries between adjacentOpenMP threads in (a). This is avoided in (b) by two separate OpenMP sweeps.column boundaryghost cell · · ·· · ·· · · read read/writeclose distantFigure 7: Pattern of subsequent memory access with the ﬁve-point stencil in a sweep along a rowof the solution ﬁeld, starting at the left column boundary. Showcases the read-write pattern in theforward pass as well as close and distant memory locations with a row-major memory layout. Thereverse pass follows an analogous pattern with reversed order of access and exchanged read/writeroles.We use the reverse mode of AD to diﬀerentiate the L norm of the solution after the ﬁnal timestep with respect to the initial data. CoDiPack [28] is used as the underlying AD tool, conﬁguredfor Jacobi taping [17] with a reuse index handler [29]. For the diﬀerentiation of OpenMP, we coupleit with OpDiLib as explained in Section 3. CoDiPack is also coupled with MeDiPack for thediﬀerentiation of MPI.In the splitting strategy displayed in Figure 6a, shared reading might occur at the boundarybetween two adjacent OpenMP threads. In order to evaluate the impact of atomic updates on theperformance of the reverse pass, we include a conﬁguration that prevents shared reading. This isachieved by doubling the number of blocks in the OpenMP decomposition, see Figure 6b. In theﬁrst sweep, the blocks with even index are processed in parallel by the OpenMP threads; in thesecond sweep, the ones with odd index are treated. This way, no adjacent blocks are processed in clang++ and g++ . The latter doesnot support OMPT yet and is hence an important use case for the macro backend presented inSection 5.2.The simulations in this paper were carried out with R = 1, D = [0 , , on a grid with2000 × t = 0 . https://releases.llvm.org/ https://gcc.gnu.org/gcc-10/ . . . . . . . . . . primal, g++ primal, clang++ .

96 9 . .

77 15 . . . .

01 10 .

61 15 . . . .

08 15 . . . . (a) forward pass, average time and variation in s CoDiPack master, g++

CoDiPack master, clang++

OpDiLib macro backend, g++

OpDiLib macro backend, clang++

OpDiLib OMPT backend, clang++ . . .

47 23 . .

48 15 . .

56 8 .

47 23 . .

32 15 . .

21 23 . . . . (b) reverse pass, average time and variation in sFigure 8: Forward pass (a) and reverse pass (b) performance comparison of diﬀerent parallel conﬁg-urations within one NUMA domain. Includes setups with classical AD , with the macro backend and with the OMPT backend , compiled with g++ (cid:3) and clang++ (cid:13) . In (a), we display also primal performance values. Numbers denote averages of 20 runs after ﬁve discarded warm-up runs.27 . . . . . . · no OpenMP1 OpenMP thread1 OpenMP threadno shared reading4 OpenMP threads4 OpenMP threadsno shared reading , . , . , . , , . , . , . , . , . , . , . , . , . , . , . , . memory in MiB CoDiPack master, g++

CoDiPack master, clang++

OpDiLib macro backend, g++

OpDiLib macro backend, clang++

OpDiLib OMPT backend, clang++

Figure 9: Memory high water marks of diﬀerent parallel conﬁgurations within one NUMA domain.Includes setups with classical AD , with the macro backend and with the

OMPT backend ,compiled with g++ (cid:3) and clang++ (cid:13) .eﬀect is reduced with the OMPT backend, the comparison is still in favour of serial reverse runs.Without shared reading (“4 OpenMP threads, no shared reading”), on the other hand, we obtainthe expected speedup of slightly less than four with respect to the serial conﬁgurations also in thereverse pass. Figure 9 displays memory high water marks for the diﬀerent AD setups from Figure8. There is almost no diﬀerence between all setups and conﬁgurations. Compared to the classicalserial conﬁguration, any additional memory consumption due to OpDiLib’s diﬀerentiation logic andbackend is negligible.Next, we investigate the strong scaling of the diﬀerent AD setups up to the full node in anOpenMP-MPI hybrid parallel manner. Given the hardware speciﬁcs, we ﬁx the number of OpenMPthreads to four per MPI process and subsequently increase the number of MPI processes until thefull node is used. For each setup, we use the respective conﬁguration of one MPI process withone OpenMP thread as the baseline for the strong scaling. Note that the scaling plots cannotbe used to compare absolute performance values. The results for the forward and the reversepass are displayed in Figure 10. The results for the forward pass include also the scaling resultsfor the primal, undiﬀerentiated code and we see that all AD setups achieve better scaling thanthe primal ones. This is often observed in AD and is attributed to the fact that AD improvesthe computation to memory ratio in the forward pass [20]. Neither the backend choice nor theshared reading optimization have any inﬂuence on the scaling in the forward pass. Among the ADsetups, best scaling is achieved with clang++ . In the reverse pass, all AD conﬁgurations achievegood scaling. With clang++ , conﬁgurations without atomic updates outscale their shared readingcounterparts in the reverse pass and exhibit even superlinear behaviour. With g++ , on the otherhand, the reverse pass scaling is better if atomic updates are used. To understand how this ispossible, we observe that there are two ways how atomics inﬂuence the performance of the parallel28 4 8 12 16 20 24 28 3251015202530 cores s p ee dup (a) forward pass macro backend g++g++ , no shared reading clang++clang++ , no shared reading OMPT backend clang++clang++ , no shared reading primal g++clang++ ideal s p ee dup (b) reverse pass macro backend g++g++ , no atomic updates clang++clang++ , no atomic updates OMPT backend clang++clang++ , no atomic updatesideal

Figure 10: Scaling results of OpDiLib for the forward pass (a) and the reverse pass (b). Includesfor comparison primal scaling results in the forward pass and the ideal scaling curves.29everse pass. First, there is the absolute performance impact of performing operations atomically.Second, the performance may decrease due to additional serialization in the reverse pass, whichshould be the most signiﬁcant factor of inﬂuence on the scaling curve. In the splitting strategydisplayed in Figure 6, however, shared reading — and hence serialization in the reverse pass —can only occur at the cells at block boundaries between adjacent OpenMP threads. The number ofthese cells is furthermore invariant with respect to the number of MPI processes so that we cannotobserve serialization at all in our scaling plots. This explains why we achieve good scaling despiteatomic updates. Furthermore, the fact that conﬁgurations without atomics are much faster in termsof absolute performance implies that they operate closer to the memory bandwidth of the socket,which hinders scaling. We attribute the overall good scaling results also to the splitting strategydisplayed in Figure 6. Increasing the number of MPI processes corresponds to decreasing the widthof the column for each process. This, in turn, implies that the distant memory locations that areaccessed during one application of the ﬁnite diﬀerence stencil move closer to each other, see Figure7. This applies analogously to the corresponding pattern of adjoint variable access in the reversepass. Hence, memory locality improves with an increasing number of MPI processes.To summarize, parallel performance is — as always — driven by multiple factors. For now, themost important observation is that the application of OpDiLib does not, in principle, prevent verygood parallel performance in practice.

With OpDiLib, we have designed a tool for the automatic diﬀerentiation of OpenMP parallel codesthat is fast, ﬂexible, reusable and easy to apply. Our interpretation of AD augmentations asevents alongside the OpenMP workﬂow allowed us to cover an unprecedented extent of OpenMPmechanisms in an operator overloading AD tool. We provide this functionality in the new opensource software OpDiLib as a universal add-on for operator overloading AD tools. We publish ittogether with bindings for CoDiPack and hope that it is also a useful addition to other operatoroverloading AD tools. In order to couple OpDiLib to another AD tool, that tool must be madethread-safe in a ﬁrst step. The coupling performed in the second step provides capabilities forhandling OpenMP constructs including an automatic deduction of a parallel reverse pass. Withthe OMPT backend, we achieve full automatization, that is, we do not impose need for additionalcode changes. We do not restrict the data access patterns either. Hence, if a serial code that wasalready diﬀerentiated with the classical AD tool is parallelized by means of OpenMP, an initialparallel diﬀerentiated code is obtained immediately. While we apply the conservative strategy ofatomic updates on the adjoint variables, we do not prevent further optimization and equip theuser with tools that can disable atomic evaluation for any parts of the code. We complemented theOMPT features by an alternative macro backend, which serves as a fallback solution and covers alsocompilers without OMPT support. OpDiLib’s ﬂexible design allowed us to support the diﬀerent usecases and modes of operation presented in this paper and will likely beneﬁt future modiﬁcations andapplications. We have demonstrated in the performance tests that all conﬁgurations of OpDiLibpreserve the serial AD memory footprint and that very good parallel performance can be achievedin practice. We will enhance and improve OpDiLib in the future and work on additional featuresand conﬁgurations is in progress. OpDiLib’s OpenMP support as presented in this paper covers alldirectives, clauses and runtime functions of the OpenMP 2.5 speciﬁcation [21] with the exception of References [1] T. Albring, M. Sagebaum, and N. R. Gauger. Development of a Consistent Discrete AdjointSolver in an Evolving Aerodynamic Design Framework. In , AIAA Aviation. American Institute of Aero-nautics and Astronautics, 2015. doi:10.2514/6.2015-3240.[2] D. Auroux and V. Groza. Optimal parameters identiﬁcation and sensitivity study for abrasivewaterjet milling model.

Inverse Problems in Science and Engineering , 25(11):1560–1576, 2017.doi:10.1080/17415977.2016.1273916.[3] A. Bahadır. A fully implicit ﬁnite-diﬀerence scheme for two-dimensional Burgers’ equa-tions.

Applied Mathematics and Computation , 137(1):131–137, 2003. doi:10.1016/S0096-3003(02)00091-7.[4] J. Biazar and H. Aminikhah. Exact and numerical solutions for non-linear Burger’sequation by VIM.

Mathematical and Computer Modelling , 49(7):1394–1400, 2009.doi:10.1016/j.mcm.2008.12.006.[5] C. Bischof, N. Guertler, A. Kowarz, and A. Walther. Parallel Reverse Mode Automatic Diﬀer-entiation for OpenMP Programs with ADOL-C. In C. H. Bischof, H. M. Bücker, P. Hovland,U. Naumann, and J. Utke, editors,

Advances in Automatic Diﬀerentiation , pages 163–173.Springer Berlin Heidelberg, 2008. doi:10.1007/978-3-540-68942-3_15.[6] H. M. Bücker, B. Lang, D. an Mey, and C. H. Bischof. Bringing Together Automatic Diﬀeren-tiation and OpenMP. In

Proceedings of the 15th International Conference on Supercomputing ,ICS ’01, pages 246–251, New York, NY, USA, 2001. Association for Computing Machinery.doi:10.1145/377792.377842.[7] H. M. Bücker, B. Lang, A. Rasch, C. H. Bischof, and D. an Mey. Explicit loop schedulingin OpenMP for parallel automatic diﬀerentiation. In

Proceedings 16th Annual InternationalSymposium on High Performance Computing Systems and Applications , pages 121–126. IEEE,2002. doi:10.1109/HPCSA.2002.1019144.[8] H. M. Bücker, A. Rasch, and A. Wolf. A Class of OpenMP Applications Involving NestedParallelism. In

Proceedings of the 2004 ACM Symposium on Applied Computing , SAC’04, pages 220–224, New York, NY, USA, 2004. Association for Computing Machinery.doi:10.1145/967900.967948.[9] A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty, R. Dietrich, X. Liu,E. Loh, D. Lorenz, et al. OMPT: An OpenMP Tools Application Programming Interface forPerformance Analysis. In A. P. Rendell, B. M. Chapman, and M. S. Müller, editors,

OpenMP n the Era of Low Power Devices and Accelerators , pages 171–185. Springer Berlin Heidelberg,2013. doi:10.1007/978-3-642-40698-0_13.[10] M. Förster. Algorithmic Diﬀerentiation of Pragma-Deﬁned Parallel Regions . Springer Vieweg,2014. doi:10.1007/978-3-658-07597-2. PhD thesis, RWTH Aachen University.[11] N. R. Gauger, A. Walther, C. Moldenhauer, and M. Widhalm. Automatic Diﬀerentiation ofan Entire Design Chain for Aerodynamic Shape Optimization. In C. Tropea, S. Jakirlic, H.-J.Heinemann, R. Henke, and H. Hönlinger, editors,

New Results in Numerical and ExperimentalFluid Mechanics VI , pages 454–461. Springer Berlin Heidelberg, 2008. doi:10.1007/978-3-540-74460-3_56.[12] A. Griewank and A. Walther.

Evaluating Derivatives: Principles and Techniques of AlgorithmicDiﬀerentiation , volume 105. Siam, 2008. doi:10.1137/1.9780898717761.[13] S. Günther, L. Ruthotto, J. B. Schroder, E. C. Cyr, and N. R. Gauger. Layer-Parallel Trainingof Deep Residual Neural Networks.

SIAM Journal on Mathematics of Data Science , 2(1):1–23,2020. doi:10.1137/19M1247620.[14] L. Hascoët and V. Pascual. The Tapenade Automatic Diﬀerentiation Tool: Principles, Model,and Speciﬁcation.

ACM Trans. Math. Softw. , 39(3), 2013. doi:10.1145/2450153.2450158.[15] P. Heimbach, C. Hill, and R. Giering. Automatic Generation of Eﬃcient Adjoint Code fora Parallel Navier-Stokes Solver. In P. M. A. Sloot, A. G. Hoekstra, C. J. K. Tan, and J. J.Dongarra, editors,

Computational Science — ICCS 2002 , pages 1019–1028. Springer BerlinHeidelberg, 2002. doi:10.1007/3-540-46080-2_107.[16] P. Heimbach, C. Hill, and R. Giering. An eﬃcient exact adjoint of the parallel MIT Gen-eral Circulation Model, generated via automatic diﬀerentiation.

Future Generation ComputerSystems , 21(8):1356–1371, 2005. doi:10.1016/j.future.2004.11.010.[17] R. J. Hogan. Fast Reverse-Mode Automatic Diﬀerentiation Using Expression Templates inC++.

ACM Trans. Math. Softw. , 40(4), 2014. doi:10.1145/2560359.[18] J. Hückelheim, P. Hovland, M. M. Strout, and J.-D. Müller. Reverse-mode algorithmic diﬀer-entiation of an OpenMP-parallel compressible ﬂow solver.

The International Journal of HighPerformance Computing Applications , 33(1):140–154, 2019. doi:10.1177/1094342017712060.[19] A. Kowarz.

Advanced concepts for automatic diﬀerentiation based on operator overloading .PhD thesis, Techn. Univ. Dresden, 2008. URL https://nbn-resolving.org/urn:nbn:de:bsz:14-ds-1206719130404-22306 .[20] A. Nemili, E. Özkaya, N. R. Gauger, F. Kramer, and F. Thiele. Discrete Adjoint Based Op-timal Active Control of Separation on a Realistic High-Lift Conﬁguration. In A. Dillmann,G. Heller, E. Krämer, C. Wagner, and C. Breitsamter, editors,

New Results in Numerical andExperimental Fluid Mechanics X , pages 237–246, Cham, 2016. Springer International Publish-ing. doi:10.1007/978-3-319-27279-5_21.[21] OpenMP Architecture Review Board. OpenMP Application Program Interface Version 2.5,May 2005. URL .3222] OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.0,July 2013. URL .[23] OpenMP Architecture Review Board. OpenMP Application Programming Interface Version5.0, November 2018. URL .[24] E. Özkaya and N. R. Gauger. Automatic transition from simulation to one-shot shapeoptimization with Navier-Stokes equations.

GAMM-Mitteilungen , 33(2):133–147, 2010.doi:https://doi.org/10.1002/gamm.201010011.[25] J. Protze, T. Cramer, S. Convent, and M. S. Müller. OMPT-Multiplex: Nesting of OMPTTools. In C. Niethammer, M. M. Resch, W. E. Nagel, H. Brunst, and H. Mix, editors,

Toolsfor High Performance Computing 2017 , pages 73–83, Cham, 2019. Springer International Pub-lishing. doi:10.1007/978-3-030-11987-4_5.[26] M. Sagebaum, N. R. Gauger, U. Naumann, J. Lotz, and K. Leppkes. Algorithmic Diﬀeren-tiation of a Complex C++ Code with Underlying Libraries.

Procedia Computer Science , 18:208–217, 2013. ISSN 1877-0509. doi:https://doi.org/10.1016/j.procs.2013.05.184. 2013 Inter-national Conference on Computational Science.[27] M. Sagebaum, T. Albring, and N. R. Gauger. Expression templates for primal value taping inthe reverse mode of algorithmic diﬀerentiation.

Optimization Methods and Software , 33(4-6):1207–1231, 2018. doi:10.1080/10556788.2018.1471140.[28] M. Sagebaum, T. Albring, and N. R. Gauger. High-Performance Derivative Computationsusing CoDiPack.

ACM Trans. Math. Softw. , 45(4), 2019. doi:10.1145/3356900.[29] M. Sagebaum, J. Blühdorn, and N. R. Gauger. Index handling and assign optimization forAlgorithmic Diﬀerentiation reuse index managers, 2020. URL https://arxiv.org/abs/2006.12992 . Preprint arXiv:2006.12992, submitted to ACM TOMS.[30] M. Schanen, U. Naumann, L. Hascoët, and J. Utke. Interpretative adjoints for nu-merical simulation codes using MPI.

Procedia Computer Science , 1(1):1825–1833, 2010.doi:10.1016/j.procs.2010.04.204. ICCS 2010.[31] M. Schanen, M. Förster, J. Lotz, K. Leppkes, and U. Naumann. Adjoining Hybrid Paral-lel Code. In B. H. V. Topping, editor,

Proceedings of the Eighth International Conferenceon Engineering Computational Technology , volume 100 of

Civil-Comp Proceedings , Kippen,Stirlingshire, 2012. Civil-Comp Press. doi:10.4203/ccp.100.7.[32] S. Schlenkrich, A. Walther, N. R. Gauger, and R. Heinrich. Diﬀerentiating Fixed Point Itera-tions with ADOL-C: Gradient Calculation for Fluid Dynamics. In H. G. Bock, E. Kostina, H. X.Phu, and R. Rannacher, editors,

Modeling, Simulation and Optimization of Complex Processes ,pages 499–508. Springer Berlin Heidelberg, 2008. doi:10.1007/978-3-540-79409-7_36.[33] A. Walther. Getting Started with ADOL-C. In U. Naumann, O. Schenk, H. D. Simon, andS. Toledo, editors,

Combinatorial Scientiﬁc Computing , number 09061 in Dagstuhl SeminarProceedings, Dagstuhl, Germany, 2009. Schloss Dagstuhl – Leibniz-Zentrum für Informatik,Germany. URL https://drops.dagstuhl.de/opus/volltexte/2009/2084 .3334] H. Zhu, H. Shu, and M. Ding. Numerical solutions of two-dimensional Burgers’ equations bydiscrete Adomian decomposition method.