[PDF] C11Tester: A Race Detector for C/C++ Atomics Technical Report

Abstract

Writing correct concurrent code that uses atomics under the C/C++ memory model is extremely difficult. We present C11Tester, a race detector for the C/C++ memory model that can explore executions in a larger fragment of the C/C++ memory model than previous race detector tools. Relative to previous work, C11Tester's larger fragment includes behaviors that are exhibited by ARM processors. C11Tester uses a new constraint-based algorithm to implement modification order that is optimized to allow C11Tester to make decisions in terms of application-visible behaviors. We evaluate C11Tester on several benchmark applications, and compare C11Tester's performance to both tsan11rec, the state of the art tool that controls scheduling for C/C++; and tsan11, the state of the art tool that does not control scheduling.

Full PDF

CC11Tester: A Race Detector for C/C++ AtomicsTechnical Report

Weiyu Luo

University of California, IrvineIrvine, California, [email protected]

Brian Demsky

University of California, IrvineIrvine, California, [email protected]

ABSTRACT

Writing correct concurrent code that uses atomics under the C/C++memory model is extremely difficult. We present C11Tester, a racedetector for the C/C++ memory model that can explore executionsin a larger fragment of the C/C++ memory model than previousrace detector tools. Relative to previous work, C11Tester’s largerfragment includes behaviors that are exhibited by ARM proces-sors. C11Tester uses a new constraint-based algorithm to imple-ment modification order that is optimized to allow C11Tester tomake decisions in terms of application-visible behaviors. We eval-uate C11Tester on several benchmark applications, and compareC11Tester’s performance to both tsan11rec, the state of the art toolthat controls scheduling for C/C++; and tsan11, the state of the arttool that does not control scheduling.

The C/C++11 standards added a weak memory model with sup-port for low-level atomics operations [11, 34] that allows expertsto craft efficient concurrent data structures that scale better or pro-vide stronger liveness guarantees than lock-based data structures.The potential benefits of atomics can lure both experts and novicedevelopers to use them. However, writing correct concurrent codeusing these atomics operations is extremely difficult.Simply executing concurrent code is not an effective approachto testing. Exposing concurrency bugs often requires executing aspecific path that might only occur when the program is heavilyloaded during deployment, executed on a specific processor, orcompiled with a specific compiler. Some prior work helps recordand replay buggy executions [45]. Debuggers like Symbiosis [43]and Cortex [44] focus on sequential consistency and test programsby modifying thread scheduling of given initial executions. How-ever, both the thread scheduling and relaxed behavior of C/C++atomics are sources of nondeterminism in a C/C++ programs thatuse atomics. Thus, it is necessary to develop tools to help test forconcurrency bugs. We present the C11Tester tool for testing C/C++programs that use atomics.Figure 1 presents an overview of the C11Tester system. C11Testeris implemented as a dynamically linked library together with anLLVM compiler pass, which instruments atomic operations, non-atomic accesses to shared memory locations, and fence operationswith function calls into the C11Tester dynamic library. The C++ andpthread library functions are overridden by the C11Tester library—C11Tester implements its own threading library using fibers toprecisely control the scheduling of each thread. The C11Testerlibrary implements a race detector and C11Tester reports any racesor assertion violations that it discovers. deifidomnU C/C++ source code Instrumented executable C11Tester dynamic library:scheduling, C/C++ memory modelthreadsAPIRerun until execution count is hit Error reportsLLVMCompiler atomicsAPIraceAPI

Figure 1: C11Tester system overview

The C/C++ memory model defines the modification order relationto totally order all atomic stores to a memory location. This relationcaptures the notion of cache coherence. The modification orderis not directly observable by the program execution — it is onlyobserved indirectly through its effects on program visible behaviorssuch as the values returned by loads. Under the C/C++ memorymodel, modification order cannot be extended to be a total orderover all stores that is consistent with the happens-before relation.This paper presents a new technique for scaling a constraint-based treatment of the modification order relation to long execu-tions.

This technique allows C11Tester to support a larger fragment ofthe C/C++ memory model than previous race detectors.

In particular,this technique can handle the full range of modification orders thatare permitted by the C/C++ memory model.Constraint-based modification order delays decisions about themodification order until the decisions have observable effects onthe program’s behavior. For example, when an algorithm decideswhich store a load will read from, C11Tester adds the correspond-ing constraints to the modification order. This approach allowstesting algorithms to focus on program visible behaviors such asthe value a load reads and does not require them to eagerly decidethe modification order.Fibers provide a more efficient means to control thread sched-ules than kernel threads. However, C/C++ programs commonlymake use of thread local storage (TLS) and fibers do not directlysupport TLS. This paper presents a new technique, thread contextborrowing, that allows fiber-based scheduling to support thread lo-cal storage without incurring dependencies on TLS implementationdetails that can vary across different library versions.

Prior work on data race detectors for C/C++11 such as tsan11 [40]and tsan11rec [41] require hb ∪ rf ∪ mo ∪ sc be acyclic and thusmiss potentially bug-revealing executions that both are allowed a r X i v : . [ c s . P L ] F e b onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky by the C/C++ memory model and can be produced by mainstreamhardware including ARM processors. We have found examples ofbugs that C11Tester can detect but tsan11 and tsan11rec miss dueto the set of hb ∪ rf edges orders writes in the modification order.C11Tester’s constraint-based approach to modification order sup-ports a larger fragment of the C/C++ memory model than tsan11and tsan11rec. C11Tester adds minor constraints to the C/C++ mem-ory model to forbid out-of-thin-air (OOTA) executions for relaxedatomics. Furthermore, these constraints appear to incur minimaloverheads on existing ARM processors [49] while x86 and PowerPCprocessors already implement these constraints. This paper makes the following contributions: • Scalable Concurrency Testing Tool:

It presents a tool forthe C/C++ memory model that can test full programs. • Supports a Larger Fragment of the C/C++ MemoryModel:

It presents a tool that supports a larger fragment ofthe C/C++ memory model than previous tools. • Constraint-Based Modification Order:

The modificationorder relation is not directly visible to the application, in-stead it constrains the behaviors of visible relations such asthe reads-from relation. Eagerly selecting the modificationorder limits the choices of stores that a load can read fromand thus limits the information available to algorithms. Wedevelop a scalable constraint-based approach to modelingthe modification order relation that allows algorithms toignore the modification order relation and focus on programvisible behaviors. • Support for Limiting Memory Usage:

The size of theC/C++ execution graph and execution trace grows as theprogram executes and thus limits the length of executionsthat a testing tool can support. Naively freeing portions ofthe graph can cause a tool to produce executions that areforbidden by the memory model. We present techniques thatcan limit the memory usage of C11Tester while ensuringthat C11Tester only produces executions that are allowed bythe C/C++ memory model. • Fiber-based Support for Thread Local Storage:

Fibersare the most efficient way to control the scheduling of theapplication under test, but supporting thread local storagewith fibers is problematic. We develop a novel approach forborrowing the context of a kernel thread to support threadlocal storage. • Evaluation:

We evaluate C11Tester on several applicationsand compare against both tsan11 and tsan11rec. We showthat C11Tester can find bugs that tsan11 and tsan11rec miss.We present a performance comparison with both tsan11 andtsan11rec.

In this section, we present general background on the C/C++ mem-ory model and then discuss the fragment of the C/C++ memorymodel that C11Tester supports. The C and C++ standards wereextended in 2011 to include a weak memory model that providesprecise guarantees about the behavior of both the compiler and the underlying processor. The standards divide memory locations intotwo types: normal types, which are accessed using normal memoryprimitives; and atomic types, which are accessed using atomic mem-ory primitives. The standards forbid data races on normal memorytypes and allow arbitrary accesses to atomic memory types. Ac-cesses to atomic memory types have an optional memory_order argument that explicitly specifies the ordering constraints. Anyoperation on an atomic object will have one of six memory orders ,each of which falls into one or more of the following categories.Like all other tools for the C/C++ memory model, compilers, andwork on formalization to our knowledge, C11Tester does not sup-port the consume memory order and thus we omit consume in ourpresentation. seq-cst: memory_order_seq_cst – strongest memory order-ing, there exists a total order of all operations with thismemory ordering. Loads that are seq_cst either read fromthe last store in the seq_cst order or from some store that isnot part of seq_cst total order. release: memory_order_release , memory_order_acq_rel ,and memory_order_seq_cst – when a load-acquire readsfrom a store-release, it establishes a happens-before relationbetween the store and the load. Release sequences generalizethis notion to allow intervening RMW operations to notbreak synchronization. acquire: memory_order_acquire , memory_order_acq_rel ,and memory_order_seq_cst – may form release/acquiresynchronization. relaxed: memory_order_relaxed – weakest memory order-ing. The only constraints for relaxed memory operationsare a per-location total order, the modification order, that isequivalent to cache coherence.The C/C++ memory model expresses program behavior in theform of binary relations or orderings. We briefly summarize therelations: • Sequenced-Before:

The evaluation order within a programestablishes an intra-thread sequenced-before ( sb ) relation—astrict preorder of the atomic operations over the executionof a single thread. • Reads-From:

The reads-from ( rf ) relation consists of store-load pairs ( 𝑋, 𝑌 ) such that 𝑌 takes its value from 𝑋 . In theC/C++ memory model, this relation is non-trivial, as a givenload operation may read from one of many potential storesin the execution. • Synchronizes-With:

The synchronizes-with ( sw ) relationcaptures the synchronization that occurs when certainatomic operations interact across threads. • Happens-Before:

In the absence of memory operationswith the consume memory ordering, the happens-before ( hb )relation is the transitive closure of the union of the sequenced-before and the synchronizes-with relations. • Sequentially Consistent:

All operations that declare the memory_order_seq_cst memory order have a total order-ing ( sc ) in the program execution. • Modification Order:

Each atomic object in a program hasan associated modification order ( mo )—a total order of all Technical Report

Conference’17, July 2017, Washington, DC, USA stores to that object—which informally represents an order-ing in which those stores may be observed by the rest of theprogram.

To explore some of the key concepts of the memory-ordering opera-tions provided by the C/C++ memory model, consider the examplein Figure 2, assuming that two independent threads execute themethods threadA() and threadB() . This example uses the C++syntax for atomics; shared, concurrently-accessed variables aregiven an atomic type, whose loads and stores are marked withan explicit memory_order governing their inter-thread orderingand visibility properties. In the example, the memory operationsare specified to have the relaxed memory ordering, which is theweakest ordering in the C/C++ memory model and allows memoryoperations to different locations to be reordered.In this example, a few simple interleavings of threadA() and threadB() show that we may see executions in which { r1 = r2 = } , { r1 = r2 = } , or { r1 = ∧ r2 = } , but it is some-what counter-intuitive that we may also see { r1 = ∧ r2 = } , inwhich the first load statement sees the second store but the secondload statement does not see the first store. While this latter behaviorcannot occur under a sequentially-consistent execution of this pro-gram, it is, in fact, allowed by the relaxed memory ordering usedin the example (and achieved by compiler or processor reorderings).Now, consider a modification of the same example, wherethe store and load on variable y (Line 5 and Line 8) nowuse memory_order_release and memory_order_acquire , respec-tively, so that when the load-acquire reads from the store-release,they form a release/acquire synchronization pair. Then in any exe-cution where r1 = 1 and thus the load-acquire statement (Line 8)reads from the store-release statement (Line 5), the synchronizationbetween the store-release and the load-acquire forms an orderingbetween threadB() and threadA() —particularly, that the actionsin threadB() after the acquire must observe the effects of theactions in threadA() before the release . In the terminology ofthe C/C++ memory model, we say that all actions in threadA() se-quenced before the release happen before all actions in threadB() sequenced after the acquire .So when r1 = 1 , threadB() must see r2 = 1 . In summary, thismodified example allows only three of the four previously-describedbehaviors: { r1 = r2 = } , { r1 = r2 = } , and { r1 = ∧ r2 = } . atomic x(0), y(0); void threadA() { x.store(1, memory_order_relaxed); y.store(1, memory_order_relaxed); } void threadB() { int r1 = y.load(memory_order_relaxed); int r2 = x.load(memory_order_relaxed); printf("r1 = %d\n", r1); printf("r2 = %d\n", r2); } Figure 2: A Variant of Message Passing in C++

We next describe the fragment of the C/C++ memory model thatC11Tester supports. Our memory model has the following changesbased on the formalization of Batty et al. [10]:

1) Use the C/C++20 release sequence definition:

Sincethe original C/C++11 memory model, the definition of releasesequences has been weakened [17]. This change is part of theC/C++20 standard [1]. C11Tester uses the newly weakened def-inition. The new definition of release sequences does not allow memory_order_relaxed stores by the thread that originally per-formed the memory_order_release store that heads the releasesequence to appear in the release sequence.

2) Add hb ∪ sc ∪ rf is acyclic: Supporting load buffering orout-of-thin-air executions is extremely difficult and the existingapproaches introduce high overheads in dynamic tools [20, 47,48]. Thus, we prohibit out-of-thin-air executions with a similarassumption made by much work on the C/C++ memory model —we add the constraint that the union of happens-before, sequentialconsistency, and reads-from relations, i.e. , hb ∪ sc ∪ rf , is acyclic [59]. This feature of the C/C++ memory model is known to be generallyproblematic and similar solutions have been proposed to fix theC/C++ memory model [13, 15, 16, 49].

3) Strengthen consume atomics to acquire:

No compilerssupport the consume access mode. Instead, all compilers strengthenconsume atomics to acquire.We formalize the above changes in Section A.1 of the Appendix.Our fragment of the C/C++ memory model is larger than that oftsan11 and tsan11rec [40, 41]. The tsan11 and tsan11rec tools add avery strong restriction to the C/C++ memory model that requiresthat hb ∪ sc ∪ rf ∪ mo be acyclic. We present our algorithm in this section. In our presentation, weadapt some terminology and symbols from stateless model checking[29]. We denote the initial state with 𝑠 . We associate every statetransition 𝑡 taken by thread 𝑝 with the dynamic operation thataffected the transition. We use enabled ( 𝑠 ) to denote the set of allthreads that are enabled in state 𝑠 (threads can be disabled whenwaiting on a mutex, condition variable, or when completed). Wesay that next ( 𝑠, 𝑝 ) is the next transition in thread 𝑝 at state 𝑠 . procedure Explore 𝑠 : = 𝑠 while enabled ( 𝑠 ) is not empty do Select 𝑝 from enabled ( 𝑠 ) 𝑡 : = next ( 𝑠, 𝑝 ) behaviors ( 𝑡 ) : = { Initial behaviors } Select a behavior 𝑏 from behaviors ( 𝑡 ) 𝑠 : = Execute ( 𝑠, 𝑡,𝑏 ) end while end procedure Figure 3: Pseudocode for C11Tester’s Algorithm The C/C++11 memory model already requires that hb ∪ sc is acyclic. onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky atomic x(0); void threadA() { x.store(1, memory_order_relaxed); x.store(2, memory_order_relaxed); } void threadB() { r1 = x.load(memory_order_relaxed); } Figure 4: Bias of a Purely Randomized Algorithm

Figure 3 presents pseudocode for C11Tester’s exploration algo-rithm. C11Tester calls Explore multiple times—each time gener-ates one program execution. Recall from Section 2 that the threadschedule does not uniquely define the behavior of C/C++ atomics.Therefore, we split the exploration into two components: (1) se-lecting the next thread to execute and (2) selecting the behavior ofthat thread’s next operation. C11Tester has a pluggable frameworkfor testing algorithms—C11Tester generates a set of legal choicesfor the next thread and behavior, and then the plugin selects thenext thread and behavior. The default plugin implements a randomstrategy.

Scheduling.

Thread scheduling decisions are made at each atomicoperation, threading operation, or synchronization operation (suchas locking a mutex). Every time a thread finishes a visible oper-ation, the next thread to execute is randomly selected from theset of enabled threads. However, when a thread performs severalconsecutive stores with memory order release or relaxed, the sched-uler executes these stores consecutively without interruption fromother threads. Executing these stores consecutively does not limitthe set of possible executions and provides C11Tester with morestores to select from when deciding which store a load should readfrom. This decision also reduces bias in comparison to a purelyrandomized algorithm.For example, in Figure 4, under a purely randomized algorithm,the probability that r1 = 1 is much greater than that of r1 = 2 ,because in order for r1 = 2 , the scheduler must schedule threadA() twice before threadB() is scheduled. However, under C11Tester’sstrategy, once threadA is scheduled to run, both stores at line 4and line 5 will be performed consecutively. So when the load isencountered, the may-read-from set (defined in the paragraphsbelow) either only contains the initial store at line 1 or contains allthree stores. Thus, r1 is equally likely to read 1 or 2. Transition Behaviors.

The source of multiple behaviors for agiven schedule arises from the reads-from relation—in C/C++, loadscan read from stores besides just the “last” store to an atomic object.We use the concept of a may-read-from set, which is an overap-proximation of the stores that a given atomic load may read fromthat just considers constraints from the happens-before relation.The may-read-from set for a load 𝑌 is constructed as: may-read-from ( 𝑌 ) = { 𝑋 ∈ stores ( 𝑌 ) | ¬( 𝑌 hb → 𝑋 )∧((cid:154) 𝑍 ∈ stores ( 𝑌 ) . 𝑋 hb → 𝑍 hb → 𝑌 )} ,where stores ( 𝑌 ) denotes the set of all stores to the same object fromwhich 𝑌 reads. C11Tester selects a store from the may-read-from set. C11Tester then checks that establishing this rf relation does not violate constraints imposed by the modification order, as describedin Section 4. If the given selection is not allowed, C11Tester repeatsthe selection process. C11Tester delays the modification order checkuntil after a selection is made to optimize for performance. In this section, we present how C11Tester efficiently supports keyaspects of the C/C++ memory model.CDSChecker [47] initially introduced the technique of using aconstraint-based treatment of modification order to remove redun-dancy from the search space it explores. There are essentially twotypes of constraints on the modification order: (1) that a store 𝑠 𝐴 is modification ordered before a store 𝑠 𝐵 and (2) that a store 𝑠 𝐴 immediately precedes an RMW 𝑟 𝐵 in the modification order.CDSChecker models these constraints using a modification or-der graph . Two types of edges correspond to these two types ofconstraints. Edges only exist between two nodes if they both rep-resent memory accesses to the same location. There is a cycle inthe modification order graph if and only if the graph correspondsto an unsatisfiable set of constraints. Otherwise, a topological sortof the graph (with the additional constraint that an RMW nodeimmediately follows the store that it reads from) yields a modifica-tion order that is consistent with the observed program behavior.CDSChecker used depth first search to check for cycles in the graph.CDSChecker would add edges to the modification order graph todetermine whether a given reads-from edge was plausible — if theedge made the set of constraints unsatisfiable, CDSChecker wouldrollback the changes that the edge made to the graph.This approach works well for model checking where the graphsare small—the fundamental scalability limits of model checkingensure that the executions always contain a very small number ofstores. This approach is infeasible when executions (and thus the mod-ification order graphs) can contain millions of atomic stores, becausethe graph traversals become extremely expensive.

We next describe the modification order graph in more detail. Werepresent modification order ( mo ) as a set of constraints, built as aconstraint graph, namely the modification order graph ( mo-graph ).A node in the mo-graph represents a single store or RMW in theexecution. There are two types of edges in the graph. An mo edgefrom node 𝐴 to node 𝐵 represents the constraint 𝐴 mo → 𝐵 . A rmw edge from node 𝐴 to node 𝐵 represents the constraint that 𝐴 mustimmediately precede 𝐵 or formally that: 𝐴 mo → 𝐵 and ∀ 𝐶.𝐶 ≠ 𝐴 ∧ 𝐶 ≠ 𝐵 ⇒ ( 𝐴 mo → 𝐶 ⇒ 𝐵 mo → 𝐶 ) ∧ ( 𝐶 mo → 𝐵 ⇒ 𝐶 mo → 𝐴 ) .C11Tester must only ensure that there exists some mo that sat-isfies the set of constraints, or equivalently an acyclic mo-graph .C11Tester dynamically adds edges to mo-graph when new rf and hb relations are formed. We briefly summarize the properties of mo as implications [47] in Figure 5. C11Tester maintains a per-threadlist of atomic memory accesses to each memory location. When-ever a new atomic load or store is executed, C11Tester uses thislist to evaluate the implications in Figure 5 as well as additionalimplications for fences. Technical Report

Conference’17, July 2017, Washington, DC, USA

Read-Read Coherence

X: v.store(1)Y: v.store(2) A: v.load()rf B: v.load()rf hb = ⇒ X: v.store(1)Y: v.store(2)mo

Write-Read Coherence

X: v.store(2) B: v.load()rf A: v.store(1)hb = ⇒ X: v.store(2) A: v.store(1)mo

Read-Write Coherence

X: v.store(1) A: v.load()rf B: v.store(2)hb = ⇒ X: v.store(1) B: v.store(2)mo

Write-Write Coherence

A: v.store(1)B: v.store(2)hb = ⇒ A: v.store(1)B: v.store(2)mo

Seq-cst / MO Consistency

A: v.store(1)B: v.store(2)sc = ⇒ A: v.store(1)B: v.store(2)mo

Seq-cst Write-Read Coherence

X: v.store(2) B: v.load(seq_cst)rf A: v.store(1, seq_cst)sc = ⇒ X: v.store(2) A: v.store(1, seq_cst)mo

RMW / MO Consistency

A: v.store(1)B: v.rmw()rf = ⇒ A: v.store(1)B: v.rmw()mo

RMW Atomicity

A: v.store(1)B: v.rmw() rfC: v.store(2)mo = ⇒ B: v.rmw()C: v.store(2)mo

Figure 5: Modification order implications. On the left sideof each implication, 𝐴 , 𝐵 , 𝐶 , 𝑋 , and 𝑌 must be distinct. Due to the high cost of graph traversals for large graphs, graphtraversals are not a feasible implementation approach for C11Tester.We next describe how we adapt clock vectors [39] to efficiently com-pute reachability in the mo-graph and scale the constraint-basedmodification order approach to large executions. We associate aclock vector with each node in the mo-graph . It is important tonote that our use of clock vectors in the mo-graph is not to track thehappens-before relation. Instead we use clock vectors to efficiently com-pute reachability between nodes in the mo-graph. Thus, our mo-graphclock vectors model a partial order that contains the current set ofordering constraints on the modification order.

Each event 𝐸 in C11Tester has a unique sequence number 𝑠 𝐸 .Sequence numbers are a global counter of events across all threads,which is incremented by one at each event. We denote the threadthat executed 𝐸 as 𝑡 𝐸 . Each node in the mo-graph represents anatomic store. The initial mo-graph clock vector ⊥ 𝐶𝑉 𝐴 associatedwith the node representing an atomic store 𝐴 , the union operator ∪ , and the comparison operator ≤ for mo-graph clock vectors aredefined as follows: ⊥ 𝐶𝑉 𝐴 = 𝜆𝑡. if 𝑡 == 𝑡 𝐴 then 𝑠 𝐴 else 0 ,𝐶𝑉 ∪ 𝐶𝑉 ≜ 𝜆𝑡.𝑚𝑎𝑥 ( 𝐶𝑉 ( 𝑡 ) , 𝐶𝑉 ( 𝑡 )) ,𝐶𝑉 ≤ 𝐶𝑉 ≜ ∀ 𝑡.𝐶𝑉 ( 𝑡 ) ≤ 𝐶𝑉 ( 𝑡 ) . Note that two mo-graph clock vectors can only be compared if theirassociated nodes represent atomic stores to the same memory location.

The mo-graph clock vectors are updated when new mo relationsare formed. For example, if 𝐴 mo → 𝐵 is a newly formed mo relation,then the node 𝐵 ’s mo-graph clock vector is merged with that ofnode 𝐴 , i.e. , 𝐶𝑉 𝐵 : = 𝐶𝑉 𝐴 ∪ 𝐶𝑉 𝐵 . If 𝐶𝑉 𝐵 is updated by this merge,the change in 𝐶𝑉 𝐵 must be propagated to all nodes reachable from 𝐵 using the union operator.Figure 6 presents pseudocode for updating the modification ordergraph. The Merge procedure merges the mo-graph clock vector ofthe src node into the dst node and returns true if the dst mo-graph clock vector changed. The AddEdge procedure adds a new modi-fication order edge to the graph. It first compares mo-graph clockvectors to check if the edge is redundant and if so drops the edgeupdate. Recall that RMW operations are ordered immediately afterthe stores that they read from. To implement this, AddEdge checksto see if the from node has a rmw edge, and if so, follows the rmw edge. AddEdge finally adds the relevant edge, and then propagatesany changes in the mo-graph clock vectors. The AddRMWEdgeprocedure has two parameters, where the rmw node reads from the from node. It first adds an rmw edge and then migrates any outgoingedges from the source of the edge to the rmw node. Finally, it callsthe AddEdge procedure to add a normal modification order edgeand to propagate mo-graph clock vector changes.Figure 7 presents pseudocode for the helper method AddEdgesthat adds a set of edges to the mo-graph . The parameter set is a setof atomic stores or RMWs, and 𝑆 is an atomic store or RMW. The GetNode method converts an atomic action to the correspondingnode in the mo-graph . If such node does not exist yet, then themethod will create a new node in the mo-graph . Events in each thread consist of atomic operations, thread creation and join, mutexlock and unlock, and other synchronization operations. onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky procedure Merge(Node dst, Node src) if src.cv ≤ dst.cv then return false end if dst.cv := dst.cv ∪ src.cv return true end procedure procedure AddEdge(Node from, Node to) mustAddEdge := (from.rmw == to ∨ from.tid == to.tid) if from.cv ≤ to.cv ∧¬ mustAddEdge then return end if while from.rmw ≠ null do next := from.rmw if next == to then break end if from := next end while from.edges := from.edges ∪ to if Merge(to, from) then

Q := { to } while

Q is not empty do node := remove item from Q for each dst in node.edges do if Merge(dst, node) then

Q := Q ∪ dst end if end for end while end if end procedure procedure AddRMWEdge(Node from, Node rmw) from.rmw := rmw for each dst in from.edges do if dst ≠ rmw then rmw.edges := rmw.edges ∪ dst end if end for from.edges := ∅ AddEdge(from, rmw) end procedure

Figure 6: Pseudocode for Updating mo-graph procedure AddEdges( set , 𝑆 ) 𝑛 𝑆 : = GetNode ( 𝑆 ) for each 𝑒 in set do 𝑛 𝑒 : = GetNode ( 𝑒 ) AddEdge( 𝑛 𝑒 , 𝑛 𝑆 ) end for end procedure Figure 7: Helper method for adding a set of edges to the mo-graph

Theorem 1 guarantees the soundness of our use of mo-graph clock vectors. We present the theorem and its proof in Section 5.This theorem states that we can solely rely on mo-graph clockvectors to compute reachability between nodes in mo-graph . Mo-graph

Prior work on constraint-based modification order utilized roll-back when it was determined that a given reads-from relation wasnot feasible [47, 48]. C11Tester may also hit such infeasible exe-cutions because the may-read-from set defined in Section 3 is anoverapproximation of the set of stores that a load can read from. Todetermine precisely whether a load can read from a store, a naiveapproach is to add edges to the mo-graph and then utilize rollbackif adding these edges introduces cycles in the mo-graph . However,the addition of clock vectors and clock vector propagation makesrollback much more expensive. It is thus critical that C11Testeravoids the need for rollback. We now discuss how C11Tester avoidsrollback.The mo-graph is updated whenever a new atomic store, atomicload, or atomic RMW is encountered. Processing a new atomic store,atomic load, or atomic RMW can potentially add multiple edgesto the mo-graph . We next analyze each case to understand how toavoid rollback: • Atomic Store:

Since an atomic load can only read frompast stores, a newly created store node in mo-graph has nooutgoing edges. By the properties of mo , only incoming edgesfrom other nodes to this new node will be created. Hence, anew store node cannot introduce any cycles. • Atomic Load:

Consider a new atomic load 𝑌 that readsfrom a store 𝑋 . Forming a new rf relation may only causeedges to be created from other nodes to the node repre-senting the store 𝑋 . We denote this set of "other nodes" as ReadPriorSet ( 𝑋 ) and compute it using the ReadPriorSetprocedure in Figure 13. Lines 6, 7, and 8 in the ReadPri-orSet procedure consider statements 5, 4, and 6 in Section29.3 of the C++11 standard. Line 9 in the procedure consid-ers write-read and read-read coherences. Therefore, the setreturned by the ReadPriorSet procedure captures the setof stores from where new mo relations are to be formed ifthe rf relation is established.Before forming the rf relation, C11Tester checks whether anynode in ReadPriorSet ( 𝑋 ) is reachable from 𝑋 . If so, thenhaving load 𝑌 read from store 𝑋 will introduce a cycle in the mo-graph , so we discard 𝑋 and try another store. While itis possible for a cycle to contain two or more edges in the setof newly created edges, this also implies that there is a cyclewith one edge (since all edges have the same destination). • Atomic RMWs:

An atomic RMW is similar to both a loadand store, but with the constraint that it must be immedi-ately modification ordered after the store it reads from. Weimplement this by moving modification order edges fromthe store it reads from to the RMW. Thus, the same checksused by the load suffice to check for cycles for atomic RMWs.Thus, C11Tester first computes a set of edges that reading froma given store would add to the mo-graph . Then for each edge, itchecks the mo-graph clock vectors to see if the destination of theedge can reach the source of the edge. If none of the edges wouldcreate a cycle, it adds all of the edges to the mo-graph using theAddEdge and AddRMWEdge procedures.

Technical Report

Conference’17, July 2017, Washington, DC, USA

MO-GRAPH

To prove the correctness of mo-graphs , we first prove three Lemmasand then prove Theorem 1. Lemma 1 and Lemma 2 characterizesome important properties of mo-graph clock vectors. Lemma 3proves one direction in Theorem 1.

Mo-graph clock vectors aresimply referred to as clock vectors in the following context.Lemma 1.

Let 𝐶 mo → 𝐶 mo → ... mo → 𝐶 𝑛 be a path in a modificationorder graph 𝐺 , such that 𝐶𝑉 𝐶 ≤ ... ≤ 𝐶𝑉 𝐶 𝑛 . Then if any new edge 𝐸 is added to 𝐺 using procedures in Figure 6, it holds that 𝐶𝑉 ′ 𝐶 ≤ ... ≤ 𝐶𝑉 ′ 𝐶 𝑛 (5.1) for the updated clock vectors. We define 𝐶𝑉 ′ 𝐶 𝑖 : = 𝐶𝑉 𝐶 𝑖 if the values of 𝐶𝑉 𝐶 𝑖 are not actually updated. Proof. To simplify notation, we define 𝐶𝑉 𝑖 : = 𝐶𝑉 𝐶 𝑖 for all 𝑖 ∈{ ..., 𝑛 } . Let’s first consider the case where no rmw edge is added, i.e. , the AddRMWEdge procedure is not called.By the definition of the union operator, each slot in clock vectorsis monotonically increasing when the Merge procedure is called.By the structure of procedure AddEdge’s algorithm, a node 𝑋 isadded to 𝑄 if and only if this node’s clock vector is updated by theMerge procedure.Let’s assume that adding the new edge 𝐸 updates any of 𝐶𝑉 , ..., 𝐶𝑉 𝑛 . Otherwise, it is trivial. Let 𝑖 be the smallest integerin { , ..., 𝑛 } such that 𝐶𝑉 𝑖 is updated. Then 𝐶𝑉 ′ 𝑘 = 𝐶𝑉 𝑘 for all 𝑘 ∈ 𝐼 : = { , ..., 𝑖 − } , and we have 𝐶𝑉 ′ ≤ ... ≤ 𝐶𝑉 ′ 𝑖 . (5.2)If 𝑖 =

0, then we take 𝐼 = ∅ . There are two cases. Case 1 : Suppose 𝐶𝑉 ′ 𝑖 ≤ 𝐶𝑉 𝑗 for some 𝑗 ∈ { 𝑖 + , ..., 𝑛 } , let 𝑗 bethe smallest such integer. Then 𝐶𝑉 ′ 𝑘 = 𝐶𝑉 𝑘 for all 𝑘 ∈ { 𝑗 , ..., 𝑛 } , asnodes { 𝐶 𝑗 , ..., 𝐶 𝑛 } will not be added to 𝑄 in the AddEdge procedure,and it holds trivially that 𝐶𝑉 ′ 𝑗 ≤ ... ≤ 𝐶𝑉 ′ 𝑛 . (5.3)By line 14 to line 24 in the AddEdge procedure, we have 𝐶𝑉 ′ 𝑘 = 𝐶𝑉 𝑘 ∪ 𝐶𝑉 ′ 𝑘 − , (5.4)for all 𝑘 ∈ 𝑆 : = { 𝑖 + , ..., 𝑗 − } . If 𝑗 happens to be 𝑖 +

1, then take 𝑆 = ∅ . And we have for all 𝑘 ∈ 𝑆 , 𝐶𝑉 ′ 𝑘 − ≤ 𝐶𝑉 ′ 𝑘 . Then combiningwith inequality (5.2), we have 𝐶𝑉 ′ ≤ ... ≤ 𝐶𝑉 𝑖 ≤ ... ≤ 𝐶𝑉 ′ 𝑗 − . Together with inequality (5.3), we only need to show that 𝐶𝑉 ′ 𝑗 − ≤ 𝐶𝑉 ′ 𝑗 to complete the proof.If 𝑗 = 𝑖 +

1, then we are done, because by assumption 𝐶𝑉 ′ 𝑖 ≤ 𝐶𝑉 𝑗 = 𝐶𝑉 ′ 𝑗 . If 𝑗 > 𝑖 +

1, then 𝐶𝑉 ′ 𝑖 ≤ 𝐶𝑉 𝑗 and 𝐶𝑉 𝑖 + ≤ 𝐶𝑉 𝑗 implythat 𝐶𝑉 ′ 𝑖 + = 𝐶𝑉 𝑖 + ∪ 𝐶𝑉 ′ 𝑖 ≤ 𝐶𝑉 𝑗 = 𝐶𝑉 ′ 𝑗 . Based on equation (5.4),we can deduce in a similar way that 𝐶𝑉 ′ 𝑖 + ≤ ... ≤ 𝐶𝑉 ′ 𝑗 − ≤ 𝐶𝑉 ′ 𝑗 . Case 2 : Suppose 𝐶𝑉 𝑖 ≰ 𝐶𝑉 𝑗 for all 𝑗 ∈ { 𝑖 + , ..., 𝑛 } . Then byline 14 to line 24 in the AddEdge procedure, all nodes { 𝐶 𝑖 , ..., 𝐶 𝑛 } are added to 𝑄 in the AddEdge procedure, and 𝐶𝑉 ′ 𝑘 = 𝐶𝑉 𝑘 ∪ 𝐶𝑉 ′ 𝑘 − for all 𝑘 ∈ 𝑆 : = { 𝑖 + , ..., 𝑛 } . This recursive formula guarantees thatfor all 𝑘 ∈ 𝑆 , 𝐶𝑉 ′ 𝑘 − ≤ 𝐶𝑉 ′ 𝑘 . Therefore, combining with inequality(5.2), we have 𝐶𝑉 ′ ≤ ... ≤ 𝐶𝑉 ′ 𝑛 . Now suppose the newly added edge 𝐸 is a rmw edge. If 𝐸 : 𝑋 rmw −−−→ 𝐶 𝑖 where 𝑖 ∈ { , ..., 𝑛 } and 𝑋 is some node not in path 𝑃 , then thepath 𝑃 remains unchanged and AddEdge( 𝑋 , 𝐶 𝑖 ) is called. Then theabove proof shows that inequality (5.1) holds. If 𝐸 : 𝐶 𝑖 rmw −−−→ 𝑋 ,then 𝐶 𝑖 mo → 𝐶 𝑖 + is migrated to 𝑋 mo → 𝐶 𝑖 + by line 3 to line 7 in theAddRMWEdge procedure, and 𝐶 𝑖 mo → 𝑋 is added.If 𝑋 is not in path 𝑃 , then path 𝑃 becomes 𝐶 mo → ... mo → 𝐶 𝑖 mo → 𝑋 mo → 𝐶 𝑖 + mo → ... mo → 𝐶 𝑛 . Since AddEdge( 𝐶 𝑖 , 𝑋 ) is called, the same proof in the case without rmw edges applies. If 𝑋 is in path 𝑃 , then 𝑋 can only be 𝐶 𝑖 + andthe path 𝑃 remains unchanged. Otherwise, a cycle is created andthis execution is invalid. In any case, the same proof applies. □ Let (cid:174) 𝑥 = ( 𝑥 , 𝑥 , ..., 𝑥 𝑛 ) . We define the projection function 𝑈 𝑖 thatextracts the 𝑖 th position of (cid:174) 𝑥 as 𝑈 𝑖 ( (cid:174) 𝑥 ) = 𝑥 𝑖 , where we assume 𝑖 ≤ 𝑛 .Lemma 2. Let 𝐴 be a store with sequence number 𝑠 𝐴 performed bythread 𝑖 in an acyclic modification order graph 𝐺 . Then 𝑈 𝑖 ( 𝐶𝑉 𝐴 ) = 𝑈 𝑖 (⊥ 𝐶𝑉 𝐴 ) = 𝑠 𝐴 throughout each execution that terminates. Proof. We will prove by contradiction. Let 𝑆 = { 𝐴 , 𝐴 , ... } be the sequence of stores performed by thread 𝑖 with sequencenumbers { 𝑠 , 𝑠 , ... } , respectively. Suppose that there is a point oftime in a terminating execution such that the first store 𝐴 𝑛 inthe sequence with 𝑈 𝑖 ( 𝐶𝑉 𝐴 𝑛 ) > 𝑠 𝑛 appears. Sequence numbersare strictly increasing and by the Merge procedure, 𝑈 𝑖 ( 𝐶𝑉 𝐴 𝑛 ) ∈{ 𝑠 𝑛 + , 𝑠 𝑛 + , ..., } . Let 𝑈 𝑖 ( 𝐶𝑉 𝐴 𝑛 ) = 𝑠 𝑁 for some 𝑁 > 𝑛 .For 𝑈 𝑖 ( 𝐶𝑉 𝐴 𝑛 ) to increase to 𝑠 𝑁 from 𝑠 𝑛 , 𝐶𝑉 𝐴 𝑛 must be mergedwith the clock vector of some node 𝑋 ( i.e. , some store 𝑋 ) in 𝐺 suchthat 𝑈 𝑖 ( 𝐶𝑉 𝑋 ) = 𝑠 𝑁 . Such 𝑋 is modification ordered before 𝐴 𝑛 .If 𝑋 is performed by thread 𝑖 , then 𝑋 has to be the store 𝐴 𝑁 ,because 𝑈 𝑖 ( 𝐶𝑉 𝐴 𝑗 ) is unique for all stores 𝐴 𝑗 in the sequence 𝑆 other than 𝐴 𝑛 . Then ⊥ 𝐶𝑉 𝑋 ≥⊥ 𝐶𝑉 𝐴𝑛 . By the definition of initialvalues of clock vectors and sequence numbers, 𝑋 happens after andis modification ordered after 𝐴 𝑛 . However, 𝑋 is also modificationordered before 𝐴 𝑛 , and we have a cycle in 𝐺 . This is a contradiction.If 𝑋 is not performed by thread 𝑖 , then 𝑈 𝑖 (⊥ 𝐶𝑉 𝑋 ) =

0. For 𝑈 𝑖 ( 𝐶𝑉 𝑋 ) to be 𝑠 𝑁 , 𝑋 must be modification ordered after by somestore 𝑌 in 𝐺 such that 𝑈 𝑖 ( 𝐶𝑉 𝑌 ) = 𝑠 𝑁 . If 𝑌 is done by thread 𝑖 , thenthe same argument in the last paragraph leads to a contradiction;otherwise, by repeating the same argument as in this paragraphfinitely many times (there are only a finite number of stores in sucha terminating execution), we would eventually deduce that 𝑋 ismodification ordered after some store by thread 𝑖 . Hence, we wouldhave a cycle in 𝐺 , a contradiction. □ Lemma 3.

Let 𝐴 and 𝐵 be two nodes that write to the same locationin an acyclic modification order graph 𝐺 . If 𝐵 is reachable from 𝐴 in 𝐺 , then 𝐶𝑉 𝐴 ≤ 𝐶𝑉 𝐵 . Proof. Suppose that 𝐵 is reachable from 𝐴 in 𝐺 . Let 𝐴 mo → 𝐶 mo → ... mo → 𝐶 𝑛 − mo → 𝐵 be the shortest path 𝑃 from 𝐴 to 𝐵 in graph 𝐺 . Tosimplify notation, 𝑋 mo → 𝑌 is abbreviated as 𝑋 → 𝑌 in the following.As the AddRMWEdge procedure calls the AddEdge procedure tocreate an mo edge, we can assume that all the mo edges in 𝑃 arecreated by directly calling AddEdge. onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky Base Case 1 : Suppose the path 𝑃 has length 1, i.e. , 𝐴 imme-diately precedes 𝐵 . Then when the edge 𝐴 → 𝐵 was formed bycalling AddEdge( 𝐴 , 𝐵 ), 𝐶𝑉 𝐵 was merged with 𝐶𝑉 𝐴 in line 14 of theAddEdge procedure. In other words, 𝐶𝑉 𝐵 = 𝐶𝑉 𝐵 ∪ 𝐶𝑉 𝐴 ≥ 𝐶𝑉 𝐴 . Base Case 2 : Suppose the path 𝑃 has length 2, i.e. , 𝐴 → 𝐶 → 𝐵 .There are two cases:(a) If 𝐴 → 𝐶 was formed first, then 𝐶𝑉 𝐴 ≤ 𝐶𝑉 𝐶 . When 𝐶 → 𝐵 was formed, 𝐶𝑉 𝐵 was merged with 𝐶𝑉 𝐶 and 𝐶𝑉 𝐶 ≤ 𝐶𝑉 𝐵 . Accord-ing to Lemma 1, adding the edge 𝐶 → 𝐵 or any edge not in path 𝑃 (if any such edges were formed before 𝐶 → 𝐵 was formed)to 𝐺 would not break the inequality 𝐶𝑉 𝐴 ≤ 𝐶𝑉 𝐶 . It follows that 𝐶𝑉 𝐴 ≤ 𝐶𝑉 𝐶 ≤ 𝐶𝑉 𝐵 .(b) If 𝐶 → 𝐵 was formed first, then 𝐶𝑉 𝐶 ≤ 𝐶𝑉 𝐵 . Based onLemma 1, this inequality remains true when 𝐴 → 𝐶 was formed.Therefore 𝐶𝑉 𝐴 ≤ 𝐶𝑉 𝐶 ≤ 𝐶𝑉 𝐵 . Inductive Step : Suppose that 𝐵 being reachable from 𝐴 impliesthat 𝐶𝑉 𝐴 ≤ 𝐶𝑉 𝐵 for all paths with length 𝑘 or less, for some 𝑘 > 𝑘 + 𝑃 be a path from 𝐴 to 𝐵 with length 𝑘 + 𝑃 : 𝐴 = 𝐶 → 𝐶 → ... → 𝐶 𝑘 → 𝐶 𝑘 + = 𝐵. We denote 𝐴 as 𝐶 and 𝐵 as 𝐶 𝑘 + in the following.Let 𝐸 : 𝐶 𝑖 → 𝐶 𝑖 + be the last edge formed in path 𝑃 , where 𝑖 ∈{ , ..., 𝑘 } . Then before edge 𝐸 was formed, the inductive hypothesisimplies that 𝐶𝑉 𝐶 ≤ ... ≤ 𝐶𝑉 𝐶 𝑖 and 𝐶𝑉 𝐶 𝑖 + ≤ ... ≤ 𝐶𝑉 𝐶 𝑘 + , becauseboth 𝐶 → ... → 𝐶 𝑖 and 𝐶 𝑖 + → ... → 𝐶 𝑘 + have length 𝑘 or less.Lemma 1 guarantees that 𝐶𝑉 𝐶 ≤ ... ≤ 𝐶𝑉 𝐶 𝑖 ,𝐶𝑉 𝐶 𝑖 + ≤ ... ≤ 𝐶𝑉 𝐶 𝑘 + remain true if any edge not in path 𝑃 was added to 𝐺 as well asthe moment when 𝐸 was formed. Therefore when the edge 𝐸 wasformed, we have 𝐶𝑉 𝐶 𝑖 ≤ 𝐶𝑉 𝐶 𝑖 + , and 𝐶𝑉 𝐴 = 𝐶𝑉 𝐶 ≤ ... ≤ 𝐶𝑉 𝐶 𝑘 + = 𝐶𝑉 𝐵 . □ Theorem 1.

Let 𝐴 and 𝐵 be two nodes that write to the samelocation in an acyclic modification order graph 𝐺 for a terminatingexecution. Then 𝐶𝑉 𝐴 ≤ 𝐶𝑉 𝐵 iff 𝐵 is reachable from 𝐴 in 𝐺 . Proof. Lemma 3 proves the backward direction, so we only needto prove the forward direction. Suppose that 𝐶𝑉 𝐴 ≤ 𝐶𝑉 𝐵 . Let’s firstconsider the situation where the graph 𝐺 contain no rmw edges. Case 1 : 𝐴 and 𝐵 are two stores performed by the same threadwith thread id 𝑖 . Then it is either 𝐴 happens before 𝐵 or 𝐵 happensbefore 𝐴 . If 𝐴 happens before 𝐵 , then 𝐴 precedes 𝐵 in the modifi-cation order because 𝐴 and 𝐵 are performed by the same thread.Hence 𝐵 is reachable from 𝐴 in 𝐺 . We want to show that the othercase is impossible.If 𝐵 happens before 𝐴 and hence precedes 𝐴 in the modificationorder, then 𝐴 is reachable from 𝐵 . By Lemma 3, 𝐴 being reachablefrom 𝐵 implies that 𝐶𝑉 𝐵 ≤ 𝐶𝑉 𝐴 . Since 𝐶𝑉 𝐴 ≤ 𝐶𝑉 𝐵 by assump-tion, we deduce that 𝐶𝑉 𝐴 = 𝐶𝑉 𝐵 . This is impossible according toLemma 2, because each store has a unique sequence number and 𝑈 𝑖 ( 𝐶𝑉 𝐴 ) = 𝑠 𝐴 ≠ 𝑠 𝐵 = 𝑈 𝑖 ( 𝐶𝑉 𝐵 ) , implying that 𝐶𝑉 𝐴 ≠ 𝐶𝑉 𝐵 . Case 2 : 𝐴 and 𝐵 are two stores done by different threads. Sup-pose that 𝐴 is performed by thread 𝑖 . Let 𝐶𝑉 𝐴 = ( ..., 𝑠 𝐴 , ... ) and 𝐶𝑉 𝐵 = ( ..., 𝑡 𝑏 , ... ) where both 𝑠 𝐴 and 𝑡 𝑏 are in the 𝑖 th position. Byassumption, we have 0 < 𝑠 𝐴 ≤ 𝑡 𝑏 .Since 𝐵 is not performed by thread 𝑖 , we have 𝑈 𝑖 (⊥ 𝐶𝑉 𝐵 ) = 𝐵 ismodification ordered after 𝐴 or some store sequenced after 𝐴 . Sincemodification order is consistent with sequenced-before relation, iffollows that 𝐵 is reachable from 𝐴 in graph 𝐺 .Now, consider the case where rmw edges are present. Adding a rmw edge from a node 𝑆 to a node 𝑅 first transfers to 𝑅 all outgoing mo edges coming from 𝑆 and then adds a normal mo edge from 𝑆 to 𝑅 . So, any updates in 𝐶𝑉 𝑆 are propagated to all nodes that arereachable from 𝑆 . Therefore, the above argument still applies. □ We present our operational model with respect to the tsan11 [40]core language described by the grammar in Figure 8. A programis a sequence of statements.

LocNA and

LocA denote disjoint setsof non-atomic and atomic memory locations. A statement can beone of these forms: an if statement, assigning the result of anexpression to a non-atomic location, forking a new thread, joininga thread via its thread handle, and atomic statements. The symbol 𝜖 denotes an empty statement. Atomic statements denoted by StmtA include atomic loads, store,

RMW s, and fences. An

RMW takes a functor, F , to implement RMW operations, such as atomic_fetch_add . Weomit loops for simplicity and leave the details of an expressionunspecified. We omit lock and unlock operations because they canbe implemented with atomic statements.

Prog ::= Stmt ; 𝜖 Stmt ::= Stmt ; Stmt| if (LocNA) {Stmt} else {Stmt}| LocNA := Expr| LocNA = Fork(Prog)| Join(LocNA)| StmtA| 𝜖 StmtA ::= LocNA = Load(LocA , MO)| Store(LocNA , LocA , MO)| RMW(LocA , MO, F)| Fence(MO)MO ::= relaxed | release | acquire | rel_acq| seq_cstExpr ::= | LocNA | Expr op Expr

Figure 8: Syntax for our core language

We next discuss the various happens-before clock vectors thatC11Tester uses to implement happens-before relations. Figure 9presents our algorithm for updating clock vectors used to trackhappens-before relations for atomic loads, stores, RMWs, and fences.The union operator ∪ between clock vectors is defined the sameway as in Section 4.2.For each thread 𝑡 , the algorithm maintains the thread’s ownclock vector C 𝑡 , and release- and acquire-fence clock vectors F rel 𝑡 and F acq 𝑡 . The algorithm also records a reads-from clock vector RF 𝑠 for each atomic store and RMW. Recall that the sequence numberis a global counter of events across all threads, and thus uniquely Technical Report

Conference’17, July 2017, Washington, DC, USA

States:

Tid ≜ Z Seq ≜ Z C : Tid → CV F rel : Tid → CV RF : Seq → CV F acq : Tid → CV [RELEASE STORE] RF ′ = RF [ 𝑠 : = C 𝑡 ] (cid:16) C , RF , F rel , F acq (cid:17) ⇒ storerel ( 𝑠,𝑡 ) (cid:16) C , RF ′ , F rel , F acq (cid:17) [RELAXED STORE] RF ′ = RF [ 𝑠 : = F rel 𝑡 ] (cid:16) C , RF , F rel , F acq (cid:17) ⇒ storerlx ( 𝑠,𝑡 ) (cid:16) C , RF ′ , F rel , F acq (cid:17) [RELEASE RMW] RF ′ = RF [ 𝑠 : = C 𝑡 ∪ RF 𝑠 ′ ] (cid:16) C , RF , F rel , F acq (cid:17) ⇒ rmwrel ( 𝑠,𝑡 ) , rf ( 𝑠 ′ ,𝑡 ′) (cid:16) C , RF ′ , F rel , F acq (cid:17) [RELAXED RMW] RF ′ = RF [ 𝑠 : = F rel 𝑡 ∪ RF 𝑠 ′ ] (cid:16) C , RF , F rel , F acq (cid:17) ⇒ rmwrlx ( 𝑠,𝑡 ) , rf ( 𝑠 ′ ,𝑡 ′) (cid:16) C , RF ′ , F rel , F acq (cid:17) [ACQUIRE LOAD] C ′ = C [ 𝑡 : = C 𝑡 ∪ RF 𝑠 ′ ] (cid:16) C , RF , F rel , F acq (cid:17) ⇒ loadacq ( 𝑠,𝑡 ) , rf ( 𝑠 ′ ,𝑡 ′) (cid:16) C ′ , RF , F rel , F acq (cid:17) [RELAXED LOAD] F acq ′ = C [ 𝑡 : = F acq 𝑡 ∪ RF 𝑠 ′ ] (cid:16) C , RF , F rel , F acq (cid:17) ⇒ loadrlx ( 𝑠,𝑡 ) , rf ( 𝑠 ′ ,𝑡 ′) (cid:16) C , RF , F rel , F acq ′ (cid:17) [RELEASE FENCE] F rel ′ = F rel [ 𝑡 : = C 𝑡 ] (cid:16) C , RF , F rel , F acq (cid:17) ⇒ fencerel ( 𝑡 ) (cid:16) C ′ , RF , F rel ′ , F acq (cid:17) [ACQUIRE FENCE] C ′ = C [ 𝑡 : = C 𝑡 ∪ F acq 𝑡 ] (cid:16) C , RF , F rel , F acq (cid:17) ⇒ fenceacq ( 𝑡 ) (cid:16) C ′ , RF , F rel , F acq (cid:17) Figure 9: Semantics for tracking happens-before clock vec-tors for atomic loads, stores, RMWs, and fences. An RMWalso triggers a load rule initially. identifies an event. We use C , F rel , F acq and RF to denote these clockvectors across all threads, and atomic stores and RMWs. The rulesfor atomic loads and RMWs also require the stores or RMWs thatare read from to be specified, which are denoted as rf . Release Sequences.

The 2011 standard used a complicated defi-nition of release sequences that allowed the possibility of relaxedwrites blocking release sequences [40]. The 2020 standard simpli-fies and weakens the definition of release sequences. In a recentlyapproved draft [1], a store-release heads a release sequence and anRMW is part of the release sequence if and only if it reads from astore or RMW that is part of the release sequence. A load-acquiresynchronizes with a store-release 𝑆 if the load reads from a store orRMW in the release sequence headed by 𝑆 .We first discuss C11Tester’s treatment of release sequences in theabsence of fences. C11Tester uses two clock vectors for store/RMWoperations: both the current thread clock vector C 𝑡 and a second reads-from clock vector RF 𝑆 that tracks the happens-before relationfor all release sequences that the RMW/store 𝑆 is part of. For anormal store release, these two clock vectors are the same. When a relaxed or release RMW 𝐴 reads from another store 𝐵 , C11Testercomputes the RMW’s reads-from clock vector RF 𝐴 as the unionof: (1) the store 𝐵 ’s reads-from clock vector RF 𝐵 and (2) the RMW 𝐴 ’s current thread clock vector C 𝑡 𝐴 if 𝐴 is a release. When a load-acquire 𝐴 reads from a store-release or RMW, C11Tester computesthe load-acquire’s new thread clock vector as the union of: (1) theload-acquire’s current thread clock vector C 𝑡 𝐴 and (2) the storerelease/RMW’s reads-from clock vector. Fences.

The C/C++ memory model also contains fences. Fencescan have one of four different memory orders: acquire, release,acq_rel, and seq_cst. Release fences effectively make later relaxedstores into store-releases, but the happens-before relation is estab-lished at the fence-release. C11Tester maintains a release fence clockvector F rel 𝑡 for each thread and uses this clock vector when comput-ing the clock vector for release sequences. Acquire fences effectivelymake previous relaxed loads into load-acquires, but the happens-before relation starts at the fence. When a relaxed load reads froma release sequence, C11Tester updates the per-thread acquire-fenceclock vector F acq 𝑡 . When C11Tester processes an acquire fence, ituses F acq 𝑡 to update the thread’s clock vector C 𝑡 . Seq_cst fencesconstrain the interactions between sequentially consistent atomicsand non-sequentially consistent atomics. The behavior of seq_cstfences can be represented as rules for generating modification orderconstraints [10]. C11Tester maintains a list of all seq_cst fencesfor each thread so that C11Tester can quickly locate the relevantfence instructions. It then generates the relevant modification orderedges to implement the fence semantics. Figure 10 formalizes the operational state of a program. The state ofsystem

State consists of the list of

ThrState , the mapping

ALocs frommemory locations to atomic information, the mapping

NALocs frommemory locations to values stored at non-atomic locations, the map-ping

FenceInfo , and the mo-graph described in Section 4.

ALocInfo records the list of atomic loads, stores, and RMWs performed at agiven atomic location.

FenceInfo records the list of fences performedby each thread.

Prog is a program described by the grammar inFigure 8. The initial state of the system has empty mappings

ALocs and

NALocs , and

FenceInfo , only one thread representing the mainfunction, and an empty mo-graph . Figures 11 to 13 present state transitions and related algorithmsfor our operational model. A system under evaluation is a tripleof the form ( Σ , ss , 𝑇 ), where Σ represents the state of the system State , ss is the program being executed, and 𝑇 represents ThrState of the thread currently running the program. The current threadonly updates its own state 𝑇 when the program ss executes, whichcauses the copy of 𝑇 in Σ to become outdated. However, the updated 𝑇 will replace the old copy in Σ when the thread switching function 𝛿 is called at the end of each atomic statement. The mo-graph isa data structure in State and represented as Σ .𝑀 . The mo-graph has methods Merge, AddEdge, AddRMWEdge, and AddEdgesdescribed in Figure 6 and Figure 7.Figure 11 shows semantics for atomic statements. Every timean atomic statement is encountered, a corresponding LoadElem , onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky Tid ≜ Z Epoch ≜ Z Val ≜ Z Seq ≜ Z CV ≜ Tid → EpochThrState ≜ ( 𝑡 : Tid ) × ( C : CV ) × ( F { rel , acq } : CV ) × ( RF : Seq → CV )× ( 𝑃 : Prog ) StoreElem ≜ ( 𝑡 : Tid ) × ( 𝑠 : Seq ) × ( 𝑎 : LocA ) × ( mo : MemoryOrder )× ( 𝑣 : Val ) LoadElem ≜ ( 𝑡 : Tid ) × ( 𝑠 : Seq ) × ( 𝑎 : LocA ) × ( mo : MemoryOrder )× ( rf : StoreElem ) RMWElem ≜ ( 𝑡 : Tid ) × ( 𝑠 : Seq ) × ( 𝑎 : LocA ) × ( mo : MemoryOrder )× ( rf : StoreElem or RMWElem ) × ( 𝑣 : Val ) FenceElem ≜ ( 𝑡 : Tid ) × ( 𝑠 : Seq ) × ( mo : MemoryOrder ) ALocInfo ≜ ( StoreElem or LoadElem or RMWElem ) list FenceInfo ≜ Tid → FenceElem list

ALocs ≜ LocA → ALocInfoNALocs ≜ LocNA → ValState ≜ ThrState list × ALocs × NALocs × FenceInfo × ( 𝑀 : mo-graph ) Figure 10: Operational State

StoreElem , RMWElem , or

FenceElem is created with the sequencenumber auto-assigned. The process of assigning sequence numbersare omitted in Figure 11. Function calls [LOAD], [STORE], [RMW],and [FENCE] invokes the corresponding inference rules for up-dating clock vectors described in Figure 9 based on the type ofatomic statements and the memory orders. Atomic statements with seq_cst or acq_rel memory orderings invoke both acquire andrelease clock vector rules if they apply. [LOAD], [STORE], [RMW],and [FENCE] take the current state of the system, the current atomicelement, and the state of the current thread as arguments, pass nec-essary input into the inference rules for updating clock vectors, andfinally return the updated state of the current thread.For atomic loads and RMWs, the store that is read from is ran-domly selected from the may-read-from set computed using thealgorithm BuildMayReadFrom presented in Figure 12, and thestore must satisfy the constraint that the second return value ofReadPriorSet is true, i.e. , having the load reading from the se-lected store does not create a cycle in the mo-graph . The atomicRMW rule first triggers an atomic load rule, and the store/RMW 𝑆 that is read from is recorded in the rf field of the RMWElem . Then,the mo-graph is updated using the procedure AddRMWEdge, andthe atomic RMW rules is finally finished by invoking an atomicstore rule. Both atomic load and atomic store rules call the helpermethod AddEdges in Figure 7 to add edges to the mo-graph .Figure 13 presents the procedures ReadPriorSet and WritePri-orSet which compute the set of atomic actions ( mo-graph nodes)from where new mo edges will be formed.We use the following helper functions in Figure 12 and Figure 13: • last_sc_fence ( 𝑡 ) returns the last seq_cst fence in thread 𝑡 ; • last_sc_store ( 𝑎, 𝑆 ) returns the last seq_cst store performedat location 𝑎 and is different from 𝑆 ; • sc_fences ( 𝑡 ) returns the list of seq_cst fences performed bythread 𝑡 ; • sc_stores ( 𝑡, 𝑎 ) returns the list of seq_cst stores and RMWsperformed by thread 𝑡 at location 𝑎 ; • stores ( 𝑡, 𝑎 ) returns the list of stores and RMWs performedby thread 𝑡 at location 𝑎 ; • loads_stores ( 𝑡, 𝑎 ) returns the list of loads, stores, and RMWsperformed by thread 𝑡 at location 𝑎 ; • last ( list ) returns the element with the largest sequence num-ber in the list, excluding null elements; • get_write ( 𝐴 ) returns 𝐴 if 𝐴 is an atomic store or RMW andreturns 𝐴. rf if 𝐴 is an atomic load.All the above functions return null if the result does not exist. [ATOMIC LOAD] ( Σ ,𝑇 ) → load ( Σ ,𝑇 ′ ) 𝐿.𝑡 = 𝑇 ′ .𝑡 𝐿.𝑎 = 𝑎 𝐿. mo = mo 𝑆 ∈ BuildMayReadFrom ( 𝐿 ) 𝐿. rf = 𝑆 ( pset , ret ) = ReadPriorSet ( 𝐿, 𝑆 ) ret == 𝑇𝑟𝑢𝑒 𝑇 ′′ = [ LOAD ] ( Σ , 𝐿,𝑇 ′ ) Σ ′ = Σ [ 𝑀 : = Σ .𝑀. AddEdges ( pset , 𝑆 ) ] Σ ′′ = Σ ′ [ NALocs : = Σ ′ . NALocs [ 𝑙 : = 𝑆.𝑣 ]] Σ ′′′ = Σ ′′ [ ALocs : = Σ ′′ . ALocs ( 𝑎 ) . pushback ( 𝐿 ) ]( Σ ,𝑙 = Load ( 𝑎, mo ) ; ss ,𝑇 ) ⇒ ( Σ ′′ , 𝛿 ; ss ,𝑇 ′′ ) [ATOMIC STORE] ( Σ ,𝑇 ) → store ( Σ ′ ,𝑇 ) 𝑆.𝑡 = 𝑇 .𝑡𝑆.𝑎 = 𝑎 𝑆. mo = mo 𝑆.𝑣 = Σ ′ . NALocs ( 𝑙 ) pset = WritePriorSet ( 𝑆 ) 𝑇 ′ = [ STORE ] ( Σ ′ , 𝑆,𝑇 ) Σ ′′ = Σ ′ [ 𝑀 : = Σ ′ .𝑀. AddEdges ( pset , 𝑆 ) ] Σ ′′′ = Σ ′′ [ ALocs : = Σ ′′ . ALocs ( 𝑎 ) . pushback ( 𝑆 ) ]( Σ , Store ( 𝑙, 𝑎, mo ) ; ss ,𝑇 ) ⇒ ( Σ ′′′ , 𝛿 ; ss ,𝑇 ′ ) [ATOMIC RMW] ( Σ ,𝑇 ) → rmw ( Σ ′ ,𝑇 ′ ) 𝑅.𝑡 = 𝑇 ′ .𝑡 𝑅.𝑎 = 𝑎 𝑅.𝑚𝑜 = mo ( Σ ′ ,𝑙 = Load ( 𝑎, mo ) ,𝑇 ′ ) → ( Σ ′′ , ss ,𝑇 ′′ ) 𝑅. rf = 𝑆 𝑇 ′′′ = [ RMW ] ( Σ ′′ , 𝑅,𝑇 ′′ ) Σ ′′′ = Σ ′′ [ 𝑀 : = Σ ′′ .𝑀. AddRMWEdge ( GetNode ( 𝑅. rf ) , GetNode ( 𝑅 )) ] Σ ′′′′ = Σ ′′′ [ ALocs : = Σ ′′′ . ALocs ( 𝑎 ) . pushback ( 𝑅 ) ]( Σ , RMW ( 𝑎, mo , 𝐹 ) ; ss ,𝑇 ) ⇒( Σ ′′′′ ,𝑙 = 𝐹 ( 𝑙 ) ; 𝑅.𝑣 = Σ ′′′′ . NALocs ( 𝑙 ) ; Store ( 𝑙, 𝑎, mo ) ; 𝛿 ; ss ,𝑇 ′′′ ) [ATOMIC FENCE] 𝐹 .𝑡 = 𝑇 .𝑡 𝐹 . mo = mo 𝑇 ′ = [ FENCE ] ( Σ , 𝐹,𝑇 ) Σ ′ = Σ [ FenceInfo : = Σ . FenceInfo ( 𝑡 ) . pushback ( 𝐹 ) ]( Σ , Fence ( mo ) ; ss ,𝑇 ) ⇒ ( Σ ′ , 𝛿 ; ss ,𝑇 ′ ) Figure 11: Semantics for atomic statements procedure BuildMayReadFrom( 𝐿 )2: ret := ∅ if 𝐿. mo == seq _ cst then 𝑆 : = last_sc_store ( 𝐿.𝑎, 𝐿 ) end if for all threads 𝑡 do stores : = stores ( 𝑡, 𝐿.𝑎 ) base : = { 𝑋 ∈ stores | ¬( 𝑋 hb → 𝐿 ) ∨ ( 𝑋 hb → 𝐿 ∧ ((cid:154) 𝑌 ∈ stores . 𝑋 sb → 𝑌 hb → 𝐿 )) } if 𝐿. mo == seq _ cst ∧ 𝑆 ≠ null then base : = base \ { 𝑋 ∈ stores | 𝑋 sc → 𝑆 ∨ 𝑋 hb → 𝑆 } end if ret : = ret ∪ base end for if 𝐿 is rmw then ret : = { 𝑋 ∈ ret | no rmw has read from 𝑋 } end if return ret end procedure Figure 12: Pseudocode for computing may-read-from sets

Technical Report

Conference’17, July 2017, Washington, DC, USA procedure WritePriorSet( 𝑆 )2: priorset : = ∅ ; 𝐹 𝑆 : = last_sc_fence ( 𝑆.𝑡 ) ; is_sc_store := ( 𝑆. mo == seq_cst )3: if is_sc_store then

4: add last_sc_store ( 𝑆.𝑎, 𝑆 ) to priorset end if for all threads 𝑡 do 𝐹 𝑡 : = last_sc_fence ( 𝑡 ) 𝐹 𝑏 : = last ({ 𝐹 ∈ sc_fences ( 𝑡 ) | 𝐹 𝑆 ≠ null ∧ 𝐹 sc → 𝐹 𝑆 }) 𝑆 : = last ({ 𝑋 ∈ stores ( 𝑡, 𝑆.𝑎 ) | is_sc_store ∧ 𝐹 𝑡 ≠ null ∧ 𝑋 sb → 𝐹 𝑡 }) 𝑆 : = last ({ 𝑋 ∈ sc_stores ( 𝑡, 𝑆.𝑎 ) | 𝐹 𝑆 ≠ null ∧ 𝑋 sc → 𝐹 𝑆 }) 𝑆 : = last ({ 𝑋 ∈ stores ( 𝑡, 𝑆.𝑎 ) | 𝐹 𝑏 ≠ null ∧ 𝑋 sb → 𝐹 𝑏 }) 𝑆 : = last ({ 𝑋 ∈ load_stores ( 𝑡, 𝑆.𝑎 ) | 𝑋 hb → 𝑆 })

13: add get_write ( last ({ 𝑆 , 𝑆 , 𝑆 , 𝑆 })) to priorset end for return priorset end procedure procedure ReadPriorSet( 𝐿 , 𝑆 )2: priorset : = ∅ ; 𝐹 𝐿 : = last_sc_fence ( 𝐿.𝑡 ) ; is_sc_load := ( 𝐿. mo == seq_cst )3: for all threads 𝑡 do 𝐹 𝑡 : = last_sc_fence ( 𝑡 ) 𝐹 𝑏 : = last ({ 𝐹 ∈ sc_fences ( 𝑡 ) | 𝐹 𝐿 ≠ null ∧ 𝐹 sc → 𝐹 𝐿 }) 𝑆 : = last ({ 𝑋 ∈ stores ( 𝑡, 𝐿.𝑎 ) | is_sc_load ∧ 𝐹 𝑡 ≠ null ∧ 𝑋 sb → 𝐹 𝑡 }) 𝑆 : = last ({ 𝑋 ∈ sc_stores ( 𝑡, 𝐿.𝑎 ) | 𝐹 𝐿 ≠ null ∧ 𝑋 sc → 𝐹 𝐿 }) 𝑆 : = last ({ 𝑋 ∈ stores ( 𝑡, 𝐿.𝑎 ) | 𝐹 𝑏 ≠ null ∧ 𝑋 sb → 𝐹 𝑏 }) 𝑆 : = last ({ 𝑋 ∈ load_stores ( 𝑡, 𝐿.𝑎 ) | 𝑋 hb → 𝐿 }) 𝐴 : = get_write ( last ({ 𝑆 , 𝑆 , 𝑆 , 𝑆 })) if 𝐴 ≠ 𝑆 then

12: add 𝐴 to priorset end if end for for each 𝑒 in priorset do if 𝑒 is reachable from 𝑆 in mo-graph then return ( ∅ , false)18: end if end for return ( priorset , true)21: end procedure Figure 13: Pseudocode for computing priorsets for atomicstores and loads

We make our axiomatic model precise and prove the equivalence ofour operational and axiomatic models in Section A of the Appendix.

We next present several aspects of the C11Tester implementation.Section 7.1 presents C11Tester’s support for limiting memory us-age. Section 7.2 presents C11Tester’s support for mixed mode ac-cesses to a memory location. Section 7.3 discusses the overheadsof different approaches to controlling thread schedules. Section 7.4describes how C11Tester implements thread local storage withfiber-based scheduling. Section 7.5 presents C11Tester’s supportfor static initializers. Section 7.6 presents C11Tester’s support forrepeated execution.

While keeping the complete C/C++ execution graph and executiontrace is feasible for short executions and can help with debugging,for longer executions their size eventually becomes too large tostore in memory. Naively pruning the execution trace to retainthe most recent actions is not safe—an older store 𝑆 𝐴 to an atomic location 𝑋 in the trace can be modification ordered after a later store 𝑆 𝐵 to 𝑋 in the trace. If a thread has already read from 𝑆 𝐴 , it cannotread from 𝑆 𝐵 because it is modification ordered before 𝑆 𝐴 . Naivelypruning 𝑆 𝐴 from execution graph without also removing 𝑆 𝐵 mighterroneously produce an invalid execution in which a thread readsfrom 𝑆 𝐴 and then 𝑆 𝐵 .C11Tester supports two approaches to limiting memory usage:(1) a conservative mode that limits the size of the execution graphwith the constraint that C11Tester must retain the ability to gen-erate all possible executions and (2) an aggressive mode that canpotentially reduce the set of executions that C11Tester can produce. Conservative Mode.

The key idea behind the conservative modeis to compute a set of older stores that can no longer be read by anythread and thus can be safely removed from the execution graph.The basic idea is to compute the latest action 𝐴 𝑡 for each thread 𝑡 such that for the last action 𝐿 𝑡 ′ in every other thread 𝑡 ′ , we have 𝐴 𝑡 hb → 𝐿 𝑡 ′ . If action 𝑆 is a store that either happens before 𝐴 𝑡 or is 𝐴 𝑡 , then any new loads from the same memory location must eitherread from 𝑆 or some store that is modification ordered after 𝑆 . Thusany store 𝑆 old that is modification ordered before the store 𝑆 canno longer be read from by any thread and can be safely pruned.C11Tester efficiently computes a clock vector 𝐶𝑉 min to identifysuch actions 𝐴 𝑡 for each thread by using the intersection operator, ∩ , to combine the clock vectors of all running threads. We definethe intersection operator ∩ as follows: 𝐶𝑉 ∩ 𝐶𝑉 ≜ 𝜆𝑡.𝑚𝑖𝑛 ( 𝐶𝑉 ( 𝑡 ) , 𝐶𝑉 ( 𝑡 )) . C11Tester then searches for stores that happen before theseoperations. It then uses the mo-graph to identify old stores to prune.Finally, it prunes these stores and any loads that read from them.

Aggressive Mode.

If a thread fails to synchronize with otherthreads, this can prevent C11Tester from freeing much of the execu-tion graph or execution trace as such a thread can potentially readfrom older stores in the execution trace and thus prevent freeingthose stores. In the aggressive mode, the user provides a windowof the trace that C11Tester attempts to keep in the graph. Simplydeleting all memory operations before that window is not soundas newer (with respect to the trace) memory operations may bemodification ordered before older memory operations. Thus remov-ing older memory operations could cause C11Tester to erroneouslyallow loads to read from stores they should not.For a store 𝑆 outside of this window, C11Tester attempts toremove all stores modification ordered before 𝑆 . Such stores canin some cases be inside of the window that C11Tester attempts topreserve, but they must also be removed. C11Tester then removesany loads that read from the removed stores. Fences.

Release fences that happen before actions whose se-quence numbers correspond to components of 𝐶𝑉 min are not neces-sary to keep since every running thread has already synchronizedwith a later point in the respective thread’s execution. Thus suchrelease fences can be safely removed.After an acquire fence is executed, its effect is summarized inthe clock vector of subsequent actions in the same thread. Thusacquire fences can be safely removed. onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky Sequentially consistent fences that happen before 𝐶𝑉 min are nolonger necessary since the happens-before relation will enforce thesame orderings. Thus, such sequentially consistent fences can besafely removed. The C/C++ memory model is silent on the semantics of how atomicaccesses interact with non-atomics accesses to the same memorylocation. Researchers have recognized this as a serious limitationof the standard [8]. It is necessary to handle mixtures of atomicsand non-atomics in C11Tester for three reasons: (1) atomic_init is implemented in the header files as a non-atomic store and mayrace with concurrent atomic accesses to the same memory location,(2) memory can be reused in C/C++, e.g. , via malloc and free , andthe new use may use a memory location for a different purpose, and(3) C/C++ programs may use non-atomic accesses to copy mem-ory that contains atomics, e.g. , via realloc or memcpy . C11Testerthus supports non-atomic operations that access the same memorylocation as an atomic as they must be tolerated provided that theaccesses are ordered by the happens-before relation. If the accessesconflict and are not ordered by happens-before, C11Tester reportsa data race.Handling non-atomic stores poses a challenge. For performancereasons, it is important to implement non-atomic stores as simplewrites to memory. But if a non-atomic store is later read by anatomic load, then C11Tester must include that non-atomic storein the modification order graph and other internal data structures.The challenge is that by the time C11Tester observes the atomicload, it has lost information about the non-atomic store.C11Tester uses a FastTrack [27]-like approach to race detection.It maintains a 64-bit shadow word for each byte of memory. Theshadow word either contains 25-bit read and write clocks and 6-bitread and write thread identifiers or a reference to an expanded ac-cess record. We use one bit in the shadow word to record whetherthe last store to the address was from a non-atomic or an atomicstore. If C11Tester performs an atomic access to a memory locationthat was last written to by a non-atomic store, C11Tester creates aspecial non-atomic access record and adds the access to the modifi-cation order graph.Many applications also contain legacy libraries that use pre-C/C++11 atomic operations such as LLVM intrinsics and volatileaccesses. C11Tester supports converting such volatile accesses intoatomic accesses (with a user specific memory order) to allow codethat incorporates legacy libraries to execute. There are two general techniques for controlling the schedule forexecuting threads. The first technique is to map application threadsto kernel threads and then use synchronization constructs to controlwhich thread takes a step. The second technique is to simulateapplication threads with user threads or fibers that are all mapped toone kernel thread. While there is a proposal for user-space control ofthread scheduling that provides very low latency context switches,unfortunately it still has not been implemented in the mainlineLinux kernel [57] after six years.

Scheduling Approach Time for Time forall cores 1 corePthread condition variable 1.95 𝜇 s 1.61 𝜇 sFutex 1.85 𝜇 s 1.32 𝜇 sSpinning 0.07 𝜇 s 15,976.7 𝜇 sSpinning w/ yield 0.21 𝜇 s 0.54 𝜇 sSwapcontext 0.34 𝜇 s 0.34 𝜇 sSwapcontext w/ tls 0.63 𝜇 s 0.63 𝜇 sSetjmp/Longjmp 0.01 𝜇 s 0.01 𝜇 sSetjmp/Longjmp w/ tls 0.30 𝜇 s 0.30 𝜇 s Figure 14: Context Switch Costs

We implemented a microbenchmark on x86 to measure the con-text switch costs for several implementations of these two tech-niques. Our microbenchmark starts two threads or fibers and mea-sures the time to switch between these threads. Figure 14 reportsthe results of these experiments. We ran each experiment in twoconfigurations: (1) in the all-core configuration the microbench-mark could use all 4 hardware cores and (2) in the single-core con-figuration the microbenchmark was pinned to a single hardwarethread.For the kernel threads, we implemented four approaches to con-text switches. The first approach uses standard pthread conditionvariables and was generally the slowest approach. The second ap-proach uses Linux futexes and is a little faster. The next two ap-proaches use spinning to wait. Simply spinning is very fast if everythread has its own core. As soon as two threads have to share a core,this approach becomes 10,000 × slower than the other approachesbecause it has to wait for a scheduling epoch to occur to switchcontexts. We also implemented a version that adds a yield call. Thishurts performance if both threads run on their own core, but sig-nificantly helps performance if threads share a core. But in general,spinning is problematic as idle threads keep cores busy.For the fiber-based approaches, we used both swapcontext and setjmp to implement fibers. Swapcontext is significantly slowerthan setjmp because it makes a system call to update the signalmask. An issue with these approaches is that neither call updatesthe register that points to thread local storage. Updating this registerrequires a system call, and this slows down both fiber approaches.We report context switches with this system call in the “w/ tls”entries.For practical implementation strategies, the fiber-based approachis faster than kernel threads. Thus, C11Tester uses fibers imple-mented via swapcontext to simulate application threads.

A major challenge with implementing fibers is supporting thread lo-cal storage. The specification for thread local storage on x86-64 [23]is complicated and leaves many important details implementation-defined and these details vary across different versions of the stan-dard library. Generating a correct thread local storage region foreach thread is a significant effort as it requires continually updatingC11Tester code to support the current set of library implementationstrategies. This is complicated by the fact that creating the threadlocal storage may involve calling initializers and freeing the threadlocal storage may involve calling destructors.

Technical Report

Conference’17, July 2017, Washington, DC, USA

Instead, C11Tester implements a technique for borrowing thethread context including the thread local storage from a kernelthread. The idea is that for each fiber C11Tester creates a real kernelthread and the fiber borrows the kernel thread’s entire contextincluding its thread local storage.C11Tester implements thread context borrowing by first creatingand locking a mutex to protect the thread context and then creatinga new kernel thread to serve as a lending thread that lends its con-text to C11Tester. The lending thread then creates a fiber contextand switches to the fiber context. The fiber context then transfersthe lending thread’s context along with its thread local storage tothe C11Tester. Finally, the fiber context grabs the context mutexto wait for the C11Tester to return its context. Once the applica-tion thread is finished, C11Tester returns the thread context to thelending thread by releasing the context mutex. The lending threadthen switches back to its original context, frees its fiber context,and then exits. Migrating thread local storage on x86 requires asystem call to change the fs register. C11Tester implements threadcontext borrowing for x86, but the basic idea should work for anyarchitecture. Static initializers in C++ can and do create threads, perform atomicoperations, and call arbitrary functions in the C++ and pthreadlibraries. C11Tester guards access to itself with initialization checks.In the first call to a C11Tester routine, C11Tester initializes itselfand converts the current application thread into a fiber context. Itthen takes control of the execution and controls the remainder ofthe program execution. This allows C11Tester to support programsthat perform arbitrary operations in their static initializers.

C11Tester supports repeatedly executing the same benchmark tofind hard-to-trigger bugs. It can be desirable for testing algorithmsto maintain state between executions to attempt to explore differentprogram behaviors across different executions. C11Tester maintainsits internal state across executions of the application under test andresets the application’s state between executions.C11Tester uses fork-based snapshots to restore the applicationto its initial state. C11Tester uses the mmap library call to map ashared memory region to store its internal state. The data in thisshared memory region persists across different executions. Thisstate allows C11Tester to report data races only once as opposed toreporting the same race on each execution. It also allows for thecreation of smart plugins that explore different behaviors acrossdifferent executions.

We compare C11Tester with both tsan11rec, a race detector thatsupports controlled execution [41] and tsan11 [40], a race detectorthat relies on the operating system scheduler to control the sched-uling of threads. We ran our experiments on an Ubuntu Linux 18.04LTS machine with a 6 core Intel Core i7-8700K CPU and 64GB RAM.We first evaluated the above tools on buggy implementations ofseqlock and reader-writer lock to check whether all three tools candetect the injected bugs. Then we evaluated the three tools on both a set of five applications that make extensive use of C/C++ atomicsand the data structure benchmarks used to evaluate CDSCheckerpreviously [47].We were not able to build tsan11rec and tsan11 directly on ourmachine due to dependencies on legacy versions of software. Nev-ertheless, we compiled tsan11rec and tsan11 inside two dockercontainers whose base images were both Ubuntu 14.04 LTS. Thetsan11rec-instrumented benchmarks were compiled with Clangv4.0 revision 286346, the tsan11-instrumented benchmarks werecompiled with Clang v3.9 revision 375507, and the C11Tester-instrumented benchmarks were compiled with Clang v8.0 revision346999.The way these three tools support multi-threading differs signif-icantly. C11Tester sequentializes thread executions and only allowsone thread to execute at a single time, tsan11 allows multiple threadsto execute in parallel, while tsan11rec falls in between—it sequen-tializes visible operations (such as atomics, thread operations, andsynchronization operations) and runs invisible operations in paral-lel. The closest tool to compare C11Tester with is tsan11rec becauseboth C11Tester and tsan11rec support controlled scheduling, whileresults for tsan11 are also presented for completeness. Althoughboth tsan11 and tsan11rec execute all or some operations in parallel,we present a best effort comparison in the following.

We have injected bugs into two commonly used data structures andverified that both tsan11 and tsan11rec miss these bugs due to therestrictions of their memory models and that the buggy executionscontained cycles in hb ∪ rf ∪ mo ∪ sc . Seqlock.

We took the seqlock implementation from Figure 5 ofHans Boehm’s MSPC 12 paper [14], made the writer correctly userelease atomics for the data field stores, and injected a bug byweakening atomics that initially increment the counter to relaxedmemory ordering.

Reader-Writer Lock.

We also implemented a broken reader-writerlock where the write-lock operation incorrectly uses relaxed atom-ics. The test case uses the read-lock to protect reads from atomicvariables and the write-lock to protect writes to atomic variables.C11Tester was able to detect the injected bugs in the brokenseqlock and reader-writer lock with bug detection rates of 28.8% and55.3%, respectively, in 1,000 runs. However, tsan11 and tsan11recfailed to detect the bugs in 10,000 runs.

Ideally, we would evaluate the tools against real world applicationsthat make extensive use of C/C++ atomics. However, to our knowl-edge, no such standard benchmark suite exists so far. So we gatheredour benchmarks through searching for benchmarks evaluated inprevious work as well as concurrent programs on GitHub.The five large applications that we have gathered include:GDAX [7], an in-memory copy of the order book of the GDAXcryptocurrency exchange; Iris [67], a low-latency C++ logging li-brary; Mabain [22], a key-value store library; Silo [55, 56], a multi-core in-memory storage engine; and the Firefox JavaScript engine onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky release 50.0.1. To make our results as reproducible as possible, wetested the JavaScript engine using the offline version of JSBenchv2013.1. [51] As the three tools supported multi-threading in different ways,to make a fair comparison, we ran each experiment on applicationbenchmarks in both the all-core configuration, where all hardwarecores could be utilized, and the single-core configuration, where thetools were restricted to running on a single CPU using the Linuxcommand taskset . As it is always trivial to parallelize testing byrunning several copies of a tool in parallel, the rationale behind thesingle-core experiment is to compare the total CPU time used toexecute a benchmark or the equivalent throughput under differenttools. However, to understand the performance benefits of paral-lelism for the other tools, we also ran experiments in the all-coreconfiguration. The performance of C11Tester does not vary muchin two configurations, because C11Tester only schedules one threadto run at a time.Table 1 summarizes the average and relative standard deviation(in parentheses) of execution time or throughput for each of the fivebenchmarks in the single-core and all-core configurations. Table 1reports wall-clock time for Iris and Mabain. The throughput ofSilo is the aggregate throughput ( agg_throughput ) reported bySilo, and the unit is ops/sec, i.e. , the number of database operationsperformed per second. The throughput of GDAX is the number ofiterations which the entire data set is iterated over in 120s. Thetime and relative standard deviation reported for JSBench are thestatistics reported by the python script in JSBench over 10 runs.For the other four benchmarks, the average and relative standarddeviation of the time and throughput are calculated over 10 runs.C11Tester is slower than tsan11 in all benchmarks except Silo inthe single-core configuration. C11Tester is faster than tsan11rec inall benchmarks except JSBench in the all-core configuration.Figure 15 summarizes speedups compared to tsan11 on the single-core configuration for each tool under both configurations, whichare derived from data in Table 1. Tsan11 on the single-core configu-ration is set as the baseline and is omitted from Figure 15.Based on the results in Figure 15, we further calculated the geo-metric mean of the speedup over the five benchmarks for eachtool under both configurations. According to the geometric means,C11Tester is 14.9 × and 11.1 × faster than tsan11rec in the single-coreconfiguration and all-core configuration, respectively. C11Tester is1.6 × and 3.1 × slower than tsan11 in the single-core configurationand all-core configuration, respectively.Table 3 presents the number of atomic operations and normalaccesses to shared memory locations executed by C11Tester foreach benchmark. As the compiler pass of C11Tester was adaptedfrom the LLVM ThreadSanitizer pass, the number of atomic opera-tions and normal accesses to shared memory executed by tsan11and tsan11rec should be relatively similar, except for the twothroughput-based benchmarks — Silo and GDAX, as the amount ofwork depends on how fast a tool is. Silo.

Silo [55, 56] is an in-memory database that is designed forperformance and scalability for modern multicore machines. The https://ftp.mozilla.org/pub/firefox/releases/50.0.1/source/ https://plg.uwaterloo.ca/~dynjs/jsbench/ Figure 15: Speedups compared to tsan11 on the single-coreconfiguration for all three tools under both configurations,derived from Table 1. The performance results of tsan11 onthe single-core configuration is set as the baseline and isomitted in the Figure. The larger values the faster the toolsare. The "(S)" label stands for the single-core configuration,and "(A)" stands for the all-core configuration.Figure 16: Performance comparisons for data structurebenchmarks, based on data in table 2. test driver we used is dbtest.cc . We ran the driver for 30 secondseach run with option "-t 5" , i.e. , 5 threads in parallel.In the first part of the experiment, Silo was compiled with in-variant checking turned on. C11Tester found executions in whichinvariants were violated. We found that it was because Silo usedvolatiles with gcc intrinsic atomics to implement a spinlock andassumed stronger behaviors from volatiles than C11Tester’s de-fault handling of volatiles as relaxed atomics. The bug disappearedwhen we handled volatile loads and stores as load-acquire andstore-release atomics. Volatile variables were commonly used toimplement atomic memory accesses before C/C++11. However, thisusage of volatile is technically incorrect, because the C++ standard Technical Report

Conference’17, July 2017, Washington, DC, USA

Table 1: Performance results for application benchmarks in the single-core and all-core configurations. The results are aver-aged over 10 runs. Relative standard deviation is reported in parentheses. Larger throughputs are better for throughput-basedmeasurements, smaller times are better for time-based measurements.

Single-core Configuration All-core ConfigurationTest C11Tester tsan11rec tsan11 C11Tester tsan11rec tsan11 Measurement

Silo 15267 (0.45%) 436 (2.52%) 5496 (4.54%) 15297 (1.17%) 438.3 (0.59%) 46688 (1.68%) Throughput (ops/sec)GDAX 2953 (1.80%) 69.3 (0.97%) 15700 (0.12%) 2946 (1.64%) 49.4 (1.04%) 53362 (11.4%) Throughput (

Table 2: Performance results for data structure benchmarks. The time column gives the time taken to execute the test caseonce, averaged over 500 runs. The rate column gives the percentage of executions in which the data race is detected among500 runs.

Test C11Tester tsan11rec tsan11

Time rate Time rate Time ratebarrier 4ms 76.6% 19ms 36.4% 12ms 0.0 %chase-lev-deque 2ms 94.6% 7ms 0.0 % 3ms 0.0 %dekker-fences 2ms 21.6% 10ms 41.4% 5ms 53.2 %linuxrwlocks 2ms 86.2% 10ms 53.4% 5ms 1.6 %mcs-lock 3ms 89.4% 11ms 71.4% 14ms 0.8 %mpmc-queue 4ms 59.4% 10ms 58.2% 5ms 0.4 %ms-queue 4ms 100.0% 136ms 100.0% 9ms 100.0%Average 75.4% 51.5% 22.3%

Table 3: The number of atomic operations (including synchronization operations such as mutex and condition variable oper-ations) and normal accesses to shared memory locations executed in each benchmark by C11Tester.

Silo GDAX Mabain Iris JSBench provides no guarantee when volatiles are mixed with atomics, andweaker behaviors for volatiles can be exhibited by ARM processors.We ran both tsan11rec and tsan11 on Silo for 100 runs with 30seach run. Tsan11rec was not able to reproduce the weak behaviorsthat C11Tester discovered, while tsan11 could reproduce the weakbehaviors 35% of the time. Tsan11rec and tsan11 both found racyaccesses on volatile variables that were used to implement a spinlock. C11Tester did not report an error message for the volatileraces because C11Tester intentionally elides race warnings for racesinvolving volatiles and atomic accesses or races involving volatilesand volatiles because volatiles are in practice still commonly usedto implement atomics.When measuring performance for Silo, we turned off invari-ant checking. We measured performances in terms of aggregatethroughput reported by Silo. C11Tester is faster than tsan11 inthe single-core configuration, because reporting data races causedsignificant overhead for tsan11 in the case of Silo.

Mabain.

Mabain is a lightweight key-value store li-brary [22]. Mabain contains a few test drivers that insertkey-value pairs concurrently into the Mabain system—we used mb_multi_thread_insert_test.cpp . All tools discovered anapplication bug that caused assertions in the test driver to fail,although tsan11 required us to set a different number of threadsthan our standard test harness to detect it. For performancemeasurements, we turned off assertions in the test driver. All toolsfound data races in Mabain.The application bug is as follows. The test driver has one asyn-chronous writer and a few workers. The workers and the writercommunicate via a shared queue protected by a lock. The writer consumes jobs (insertion into the database) in the queue and insertvalues into the Mabain database, while the workers submit jobsinto the queue. When workers finish submitting all jobs into thequeue, the writer is stopped. However, there is no check to makesure that all jobs in the queue have been cleared before the writeris stopped. Thus, after the writer is stopped, some values may notbe found in the Mabain database, causing assertion failures.The time reported in Table 1 was measured for inserting 100,000key-value pairs into the Mabain system.

GDAX.

GDAX [7] implements an in-memory copy of the orderbook for the GDAX cryptocurrency exchange using a lock-freeskip list with garbage collection from the libcds library [35]. Theoriginal GDAX fetches data from a server, but we have recordedinput data from a previous run and modified GDAX to read localdata. All tools reported data races in GDAX.In our experiment, GDAX was run for 120s each time, duringwhich 5 threads kept iterating over the data set. We counted thenumber of iterations the data set was iterated over by each tool ineach run and computed statistics based on 10 runs.

Iris.

Iris [67] is a low latency asynchronous C++ logging librarythat buffers data using lock-free circular queues. The test driverwe used to measure performance was test_lfringbuffer.cpp ,in which there is one producer and one consumer. To make thetest driver finish in a timely manner, we reduced the number ofITERATIONS to 1 million in the test driver. All tools reported dataraces in Iris.

Firefox JavaScript Engine.

We compiled the Firefox JavaScriptengine release 50.0.1 following the instructions for building the onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky

JavaScript shell with Thread Sanitizer given by the developers ofFirefox. We tested the JavaScript engine with the JSBench suite,which contains 25 JavaScript benchmarks, sampled from real-worldapplications. The Python script of JSBench first calculated the arith-metic mean of all 25 benchmarks over 10 runs, and then took thegeometric means of the 25 arithmetic mean, as reported in Table 1.

To assess the ability of C11Tester to discover data races, we alsoused the data structure benchmarks that were originally used toevaluate CDSChecker and subsequently modified to evaluate tsan11and tsan11rec. We used the version of the benchmarks available athttps://github.com/mc-imperial/tsan11. Note that sleep statementswere added to 6 of these benchmarks to induce some variabilityin the schedules explored by the tsan11 [40]. We replicated thesame timing strategy used in [40] and reported times that werethe sum of the user time and system time measured by the time command. Due to differences in the implementation of the sleepstatement, sleep time is partially included in C11Tester’s user timeand thus we removed the sleep statements for C11Tester to makethe comparison fair. We executed the benchmarks in the all-coreconfiguration to ensure that we did not put tsan11 at a disadvantagesince it does not control the thread schedule.Table 2 summarizes the experiment results for the data structurebenchmarks. The times reported in Table 2 were averaged over 500runs, and the rate columns report data races detection rates basedon 500 runs. Out of 7 benchmarks, C11Tester detects data raceswith rates higher than tsan11rec in 4 benchmarks and tsan11 in 5benchmarks. Tsan11 and tsan11rec did not detect races in chase-lev-deque, but C11Tester did. All three tools always detected racesin ms-queue.

Related work falls into three categories: model checkers, fuzzers,and race detectors.

Model Checkers.

In the context of weak hardware memory mod-els, researchers have developed stateful model checkers [33, 38, 50].Stateful model checkers however are limited by the state explosionproblem and have the general problem of comparing abstractlyequivalent but concretely different program states.Stateless model checkers have been developed for the C/C++memory model. CDSChecker can model check real-world C/C++concurrent data structures [47, 48]. More recent work has led tothe development of other model checking tools that can efficientlycheck fragments of the C/C++ memory model [5, 36, 37]. Recentwork on model checking for sequential consistency has developedpartial order reduction techniques that only explore all reads-fromrelations and do not need to explore all sequentially consistentorderings [4]. Other tools such as Herd [6], Nitpick [12], and Cpp-Mem [10] are intended to help understand the behaviors of memorymodels and do not scale to real-world data structures.CHESS [46] is designed to find and reproduce concurrency bugsin C, C++, and C https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Thread_Sanitizer reorder memory operations. Line-Up [19] extends CHESS to checkfor linearization. Like CHESS, it can miss bugs that are exposedby reordering of memory operations. The Inspect tool combinesstateless model checking and stateful model checking to modelcheck C and C++ code [61–64]. The Inspect tool checks code usingthe sequential consistency model rather than the weaker C/C++memory model and therefore may miss concurrency bugs arisingfrom reordered memory operations.Dynamic Partial Order Reduction [29] and Optimal DynamicPartial Order Reduction [2] seek to make stateless model checkingmore efficient by skipping equivalent executions. Maximal causalreduction [30] further refines the technique with the insight that itis only necessary to explore executions in which threads read differ-ent values. Recent work has extended these algorithms to handle theTSO and PSO memory models [3, 32, 65]. SATCheck further devel-ops partial order reduction with the insight that it is only necessaryto explore executions that exhibit new behaviors [21]. CheckFencechecks concurrent code by translating it into SAT [18]. Despitethese advances, model checking faces fundamental limitations thatprevent it from scaling to full applications. Fuzzers.

The Relacy race detector [60] explores thread interleav-ings and memory operation reorderings for C++ code. The Relacyrace detector has several limitations that cause it to miss executionsallowed by the C/C++ memory model. Relacy imposes an executionorder on the program under test in which it executes the program.Relacy then derives the modification order from the execution or-der; it cannot simulate (legal) executions in which the modificationorder is inconsistent with the execution order.Industry tools like the IBM ConTest tool support testing con-current software. They work by injecting noise into the executionschedule [58]. While such tools may increase the likelihood of find-ing races, they do not precisely control the schedule. They also donot handle weak memory models like the C/C++ memory model.Adversarial memory increases the likelihood of observing weakmemory system behaviors for the purpose of testing [28]. In thecontext of Java, prescient memory can simulate some of the weakbehaviors allowed by the Java memory model [20]. Prescient mem-ory however requires that the entire application be amenable todeterministic record and replay and uses a single profiling run togenerate future values limiting the executions it can discover.Concutest-JUnit extends JUnit with checks for concurrent unittests and support for perturbing schedules using randomizedwaits [52]. Concurrit is a DSL designed to help reproduce con-currency bugs [24]. Developers write code in a DSL to help guideConcurrit to a bug reproducing schedule. CalFuzzer more uniformlysamples non-equivalent thread interleavings by using techniques in-spired by partial order reduction [54]. These approaches are largelyorthogonal to C11Tester.

Race Detectors.

Several tools have been designed to detect dataraces in code that uses standard lock-based concurrency control [25–27, 31, 42]. These tools typically verify that all accesses to shareddata are protected by a locking discipline. They miss higher-level se-mantic races that occur when the locks allow unexpected orderingsthat produce incorrect results.

Technical Report

Conference’17, July 2017, Washington, DC, USA

Tsan11 extends the tsan tool to support a fragment of theC/C++11 memory model [40]. Tsan11 supports a restricted ver-sion of the C/C++ memory model and cannot produce many of thebehaviors that real-world programs may exhibit. In particular, itrequires that the modification order relation can be extended to atotal order (that is the order in which tsan executes the statements)and thus can only produce executions in which the modificationorder for all memory locations is a total order. Tsan11 also doesnot control the thread schedule—threads execute in whatever or-der happens to occur. Tsan11rec extends tsan11 with support forcontrolled execution [41]. It also has same limitation regarding thefragment of the memory model it supports.In [66], the lockset algorithm [53] is implemented in hardware.However, such data race detection algorithms have a high false-positives rate due to their inferential nature as well as their handlingof a limited portfolio of synchronization primitives, namely locks.Moreover, these approaches were not designed to detect errors inconcurrent data structures based on atomic operations.

10 CONCLUSION

We have presented C11Tester, which implements a novel approachfor efficiently testing C/C++11 programs. C11Tester supports alarger fragment of the C/C++ memory model than prior work whilestill delivering competitive performance to prior systems. C11Testeruses a constraint-based approach to the modification order thatallows testing tools to make decisions about the modification or-der implicitly when they select the store that a load reads from.C11Tester includes a data race detector that can identify races.C11Tester supports controlled scheduling for C/C++11 at loweroverhead than prior systems. Our evaluation shows that C11Testercan find bugs in all of our benchmark applications including bugsthat were missed by other tools.

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their thorough and insight-ful comments. We are especially grateful to our shepherd CarolineTrippel for her feedback. We also thank Derek Yeh for his workon performance improvement for the C11Tester tool. This work issupported by the National Science Foundation grants CNS-1703598,OAC-1740210, and CCF-2006948.

REFERENCES

Proceedings of the 2014 Symposium onPrinciples of Programming Languages , pages 373–384, 2014.[3] Parosh Aziz Abdulla, Stavros Aronis, Mohamed Faouzi Atig, Bengt Jonsson, CarlLeonardsson, and Konstantinos Sagonas. Stateless model checking for TSO andPSO. In

Proceedings of the 21st International Conference on Tools and Algorithmsfor the Construction and Analysis of Systems , pages 353–367, 2015.[4] Parosh Aziz Abdulla, Mohamed Faouzi Atig, Bengt Jonsson, Magnus Lång,Tuan Phong Ngo, and Konstantinos Sagonas. Optimal stateless model checkingfor reads-from equivalence under sequential consistency.

Proceedings of ACM onProgramming Languages , 3(OOPSLA):150:1–150:29, October 2019.[5] Parosh Aziz Abdulla, Mohamed Faouzi Atig, Bengt Jonsson, and Tuan PhongNgo. Optimal stateless model checking under the release-acquire semantics.

Proceedings of the ACM on Programming Languages , 2(OOPSLA):135:1–135:29,October 2018.[6] Jade Alglave, Luc Maranget, and Michael Tautschnig. Herding cats: Modelling,simulation, testing, and data mining for weak memory.

ACM Transactions on Programming Languages and Systems , 36(2):7:1–7:74, July 2014.[7] F. Eugene Aumson. gdax-orderbook-hpp. https://github.com/feuGeneA/gdax-orderbook-hpp, June 2018.[8] Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-Pharabod, andPeter Sewell. The problem of programming language concurrency semantics. In

European Symposium on Programming Languages and Systems

Proceedings of the 38th Annual ACM SIGPLAN-SIGACTSymposium on Principles of Programming Languages , 2011.[11] Pete Becker. ISO/IEC 14882:2011, Information technology – programming lan-guages – C++, 2011.[12] Jasmin Christian Blanchette, Tjark Weber, Mark Batty, Scott Owens, and SusmitSarkar. Nitpicking C++ concurrency. In

Proceedings of the 13th InternationalACM SIGPLAN Symposium on Principles and Practices of Declarative Programming ,pages 113–124, 2011.[13] Hans Boehm and Brian Demsky. Outlawing ghosts: Avoiding out-of-thin-air re-sults. In

Proceedings of ACM SIGPLAN Workshop on Memory Systems Performanceand Correctness , pages 7:1–7:6, June 2014.[14] Hans-J. Boehm. Can seqlocks get along with programming language memorymodels? In

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory SystemsPerformance and Correctness

Proceedingsof the 2007 Conference on Programming Language Design and Implementation ,pages 12–21, 2007.[19] Sebastian Burckhardt, Chris Dern, Madanlal Musuvathi, and Roy Tan. Line-up:A complete and automatic linearizability checker. In

Proceedings of the 2010ACM SIGPLAN Conference on Programming Language Design and Implementation ,pages 330–340, 2010.[20] Man Cao, Jake Roemer, Aritra Sengupta, and Michael D. Bond. Prescient memory:Exposing weak memory model behavior by looking into the future. In

Proceedingsof the 2016 ACM SIGPLAN International Symposium on Memory Management ,pages 99–110, 2016.[21] Brian Demsky and Patrick Lam. SATCheck: SAT-directed stateless model check-ing for SC and TSO. In

Proceedings of the 2015 Conference on Object-OrientedProgramming, Systems, Languages, and Applications , pages 20–36, October 2015.[22] Changxue Deng. Mabain: A fast and light-weighted key-value store library.https://github.com/chxdeng/mabain, November 2018.[23] Ulrich Drepper. ELF handling for thread-local storage. https://akkadia.org/drepper/tls.pdf, August 2013.[24] Tayfun Elmas, Jacob Burnim, George Necula, and Koushik Sen. Concurrit: Adomain specific language for reproducing concurrency bugs. In

Proceedingsof the 34th ACM SIGPLAN Conference on Programming Language Design andImplementation , PLDI ’13, pages 153–164, 2013.[25] Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. Goldilocks: A race andtransaction-aware Java runtime. In

Proceedings of the 2007 ACM SIGPLAN Con-ference on Programming Language Design and Implementation , pages 245–255,2007.[26] Dawson Engler and Ken Ashcraft. RacerX: Effective, static detection of raceconditions and deadlocks. In

Proceedings of the Nineteenth ACM Symposium onOperating Systems Principles , pages 237–252, 2003.[27] Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and precise dy-namic race detection. In

Proceedings of the 2009 ACM SIGPLAN Conference onProgramming Language Design and Implementation , pages 121–133, 2009.[28] Cormac Flanagan and Stephen N. Freund. Adversarial memory for detectingdestructive races. In

Proceedings of the 2010 ACM SIGPLAN Conference on Pro-gramming Language Design and Implementation , pages 244–254, 2010.[29] Cormac Flanagan and Patrice Godefroid. Dynamic partial-order reduction formodel checking software. In

Proceedings of the 2005 Symposium on Principles ofProgramming Languages , pages 110–121, 2005.[30] Jeff Huang. Stateless model checking concurrent programs with maximal causal-ity reduction. In

Proceedings of the 2015 Conference on Programming LanguageDesign and Implementation , pages 165–174, 2015.[31] Jeff Huang, Patrick Meredith, and Grigore Rosu. Maximal sound predictiverace detection with control flow abstraction. In

Proceedings of the 35th annualACM SIGPLAN conference on Programming Language Design and Implementation onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky (PLDI’14) , pages 337–348. ACM, June 2014.[32] Shiyou Huang and Jeff Huang. Maximal causality reduction for TSO and PSO. In

Proceedings of the 2016 ACM SIGPLAN International Conference on Object-OrientedProgramming, Systems, Languages, and Applications , pages 447–461, 2016.[33] Bengt Jonsson. State-space exploration for concurrent algorithms under weakmemory orderings.

SIGARCH Computer Architecture News , 36(5):65–71, June2009.[34] ISO JTC. ISO/IEC 9899:2011, Information technology – programming languages– C, 2011.[35] Max Khiszinsky. https://github.com/khizmax/libcds, Dec 2017.[36] Michalis Kokologiannakis, Ori Lahav, Konstantinos Sagonas, and Viktor Vafeiadis.Effective stateless model checking for C/C++ concurrency.

Proceedings of theACM on Programming Languages , 2(POPL):17:1–17:32, December 2017.[37] Michalis Kokologiannakis, Azalea Raad, and Viktor Vafeiadis. Model checking forweakly consistent libraries. In

Proceedings of the 40th ACM SIGPLAN Conferenceon Programming Language Design and Implementation , PLDI 2019, pages 96–110,2019.[38] Michael Kuperstein, Martin Vechev, and Eran Yahav. Automatic inference ofmemory fences. In

Proceedings of the Conference on Formal Methods in Computer-Aided Design , pages 111–120, 2010.[39] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system.

Communications of the ACM , 21(7):558–565, July 1978.[40] Christopher Lidbury and Alastair F. Donaldson. Dynamic race detection forC++11. In

Proceedings of the 44th ACM SIGPLAN Symposium on Principles ofProgramming Languages , POPL 2017, pages 443–457, New York, NY, USA, 2017.ACM.[41] Christopher Lidbury and Alastair F. Donaldson. Sparse record and replay withcontrolled scheduling. In

Proceedings of the 40th ACM SIGPLAN Conference onProgramming Language Design and Implementation , PLDI 2019, pages 576–593,2019.[42] Brandon Lucia, Luis Ceze, Karin Strauss, Shaz Qadeer, and Hans Boehm. Conflictexceptions: Simplifying concurrent language semantics with precise hardware ex-ceptions for data-races. In

Proceedings of the 37th Annual International Symposiumon Computer Architecture , pages 210–221, 2010.[43] Nuno Machado, Brandon Lucia, and Luís Rodrigues. Concurrency debuggingwith differential schedule projections.

SIGPLAN Not. , 50(6):586–595, June 2015.[44] Nuno Machado, Brandon Lucia, and Luís Rodrigues. Production-guided concur-rency debugging.

SIGPLAN Not. , 51(8), February 2016.[45] Pablo Montesinos, Luis Ceze, and Josep Torrellas. Delorean: Recording anddeterministically replaying shared-memory multiprocessor execution efficiently.

SIGARCH Comput. Archit. News , 36(3):289–300, June 2008.[46] Madanlal Musuvathi, Shaz Qadeer, Piramanayagam Arumuga Nainar, ThomasBall, Gerard Basler, and Iulian Neamtiu. Finding and reproducing Heisenbugsin concurrent programs. In

Proceedings of the Eighth USENIX Symposium onOperating Systems Design and Implementation , pages 267–280, 2008.[47] Brian Norris and Brian Demsky. CDSChecker: Checking concurrent data struc-tures written with C/C++ atomics. In

Proceedings of the 2013 Conference onObject-Oriented Programming, Systems, Languages, and Applications , pages 131–150, October 2013.[48] Brian Norris and Brian Demsky. A practical approach for model checkingC/C++11 code.

ACM Transactions on Programming Languages and Systems ,38(3):10:1–10:51, May 2016.[49] Peizhao Ou and Brian Demsky. Towards understanding the costs of avoiding out-of-thin-air results.

Proceedings of the ACM on Programming Languages Volume 2Issue OOPSLA , 2(OOPSLA):136:1–136:29, October 2018.[50] Seungjoon Park and David L. Dill. An executable specification and verifier forrelaxed memory order.

IEEE Transactions on Computers , 48(2):227–235, February1999.[51] Gregor Richards, Andreas Gal, Brendan Eich, and Jan Vitek. Automated con-struction of javascript benchmarks. In

Proceedings of the 2011 ACM InternationalConference on Object Oriented Programming Systems Languages and Applications ,OOPSLA ’11, page 677–694, New York, NY, USA, 2011. Association for ComputingMachinery.[52] Mathias Guenter Ricken.

A Framework for Testing Concurrent Programs . PhDthesis, Houston, TX, USA, 2011. AAI3463989.[53] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and ThomasAnderson. Eraser: A dynamic data race detector for multithreaded programs.

ACM Transactions on Computer Systems , 15:391–411, November 1997.[54] Koushik Sen. Effective random testing of concurrent programs. In

Proceedingsof the Twenty-second IEEE/ACM International Conference on Automated SoftwareEngineering , ASE ’07, pages 323–332, 2007.[55] Stephen Tu, Wenting Zheng, and Eddie Kohler. Silo: Multicore in-memory storageengine. https://github.com/stephentu/silo, March 2015.[56] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden.Speedy transactions in multicore in-memory databases. In

Proceedings of theTwenty-Fourth ACM Symposium on Operating Systems Principles , SOSP ’13, pages18–32, 2013. [57] Paul Turner. User-level threads...with threads. https://blog.linuxplumbersconf.org/2013/ocw/system/presentations/1653/original/LPC%20-%20User%20Threading.pdf

Proceedings of the 2013 ACM SIGPLAN InternationalConference on Object-Oriented Programming, Systems, Languages, and Applications ,pages 867–884, 2013.[60] Dmitriy Vyukov. Relacy race detector. http://relacy.sourceforge.net/, October2011.[61] Chao Wang, Yu Yang, Aarti Gupta, and Ganesh Gopalakrishnan. Dynamic modelchecking with property driven pruning to detect race conditions.

ATVA LNCS ,(126–140), 2008.[62] Yu Yang, Xiaofang Chen, Ganesh Gopalakrishnan, and Robert M. Kirby. Dis-tributed dynamic partial order reduction based verification of threaded software.In

Proceedings of the 14th International SPIN Conference on Model Checking Soft-ware , pages 58–75, 2007.[63] Yu Yang, Xiaofang Chen, Ganesh Gopalakrishnan, and Robert M. Kirby. Effi-cient stateful dynamic partial order reduction. In

Proceedings of the FifteenthInternational SPIN Workshop , pages 288–305, August 2008.[64] Yu Yang, Xiaofang Chen, Ganesh Gopalakrishnan, and Chao Wang. Automaticdiscovery of transition symmetry in multithreaded programs using dynamicanalysis. In

Proceedings of the 16th International SPIN Workshop on Model CheckingSoftware , pages 279–295, 2009.[65] Naling Zhang, Markus Kusano, and Chao Wang. Dynamic partial order reductionfor relaxed memory models. In

Proceedings of the 36th ACM SIGPLAN Conferenceon Programming Language Design and Implementation , pages 250–259, 2015.[66] Pin Zhou, Radu Teodorescu, and Yuanyuan Zhou. HARD: Hardware-assistedlockset-based race detection. In

Proceedings of the 2007 IEEE 13th InternationalSymposium on High Performance Computer Architecture , pages 121–132, 2007.[67] Xinjing Zhou. Iris: A low latency asynchronous C++ logging library. https://github.com/zxjcarrot/iris, October 2015.

Technical Report

Conference’17, July 2017, Washington, DC, USA

A PROOF OF EQUIVALENCE BETWEENOPERATIONAL AND AXIOMATIC MODELS

We first present the formalization of our axiomatic model. We thenshow how to lift a trace produced by our operational model to anaxiomatic-style execution and prove that lifting the set of tracesproduced by our operational model gives rise to executions thatexactly match the executions allowed by our restricted axiomaticmodel. We refer to the axiomatic model that is based on the C++11memory model but incorporates the first and the third changesdescribed in Section A.1 below as the modified C++11 memory model .We refer to the axiomatic model presented in Section A.1 as the restricted axiomatic model or our axiomatic model . Our axiomaticmodel is stronger than the modified C++11 memory model.We also introduce some notations in this Section. Let 𝑃 be a pro-gram written in our language described in Figure 8 of the paper. Let Consistent ( 𝑃 ) denote the set of executions allowed by the modifiedC++11 memory model, rConsistent ( 𝑃 ) denote the set of executionsallowed by our axiomatic memory model, and traces ( 𝑃 ) denote theset of traces produced by our operational model. We use 𝜎 to denotean individual trace, which is a finite sequence of state transitions, i.e. , 𝜎 = s t −→ s t −→ ... t 𝑚 −−→ s 𝑚 The set of axiomatic-style executions obtained by lifting a trace 𝜎 is denoted as lift ( 𝜎 ) . Lifting a trace gives rise to a set of executionsbecause the extension of the mo-graph is not unique, as explainedin Section A.2. We write lift ( 𝜎 ) when we wish to refer to a singleexecution in lift ( 𝜎 ) . A.1 Restricted Axiomatic Model

We present the formalization of our axiomatic model by makingfollowing changes to the formalization of Batty et al. [9, 10]:

1) Use the C/C++20 release sequence definition:

This corre-sponds to changing the definition of "rs_element" (in Section 3.6 oftheir formalization) by dropping the "same_thread a rs_head" term.

2) Add hb ∪ sc ∪ rf is acyclic: This is implemented by addingthe following to their formalization in Section 3: • Add the following definition:"let acyclic_hb_sc_rf actions hb sc rf = irrefl actions tc ( hb ∪ sc ∪ rf )" • Add the following term to the conjunct in Section 3.11:"acyclic_hb_sc_rf

Xo.actions hb Xw.sc Xw.rf ∧ "

3) Strengthen consume atomics to acquire:

This is imple-mented with the following two changes in Section 2.1: • Make is_consume always false. • Change the MO_CONSUME case for is_acquire a to "is_read a ∨ is_fence a ".Therefore, the set of executions allowed by our restricted ax-iomatic model can be expressed as: rConsistent ( 𝑃 ) = Consistent ( 𝑃 ) ∧ acyclic ( hb ∪ sc ∪ rf ) . Since we do not consider the consume memory ordering, the hb relation is the transitive closure of sb and sw . In the simple languagedescribed in Figure 8, an sw edge is either an additional synchronizeswith ( asw ) edge, an rf edge, or a combination of sb and rf edges (release-acquire synchronization involving fences). Therefore, wehave sb ∪ asw ⊆ hb ⊆ sb ∪ asw ∪ rf , and we can deduce that sb ∪ asw ∪ sc ∪ rf ⊆ hb ∪ sc ∪ rf ⊆ sb ∪ asw ∪ sc ∪ rf , which implies that hb ∪ sc ∪ rf = sb ∪ asw ∪ sc ∪ rf . Therefore, rConsistent ( 𝑃 ) = Consistent ( 𝑃 ) ∧ acyclic ( sb ∪ asw ∪ sc ∪ rf ) . A.2 Lifting Traces

We need to extend our operational states with auxiliary labels inorder to track events. We define a label as

Label ≜ { , , , ... } ∪ {⊥} .We extend ThrState with a last sequenced before ( lsb ) label and a last additional synchronizes with ( lasw ) label to track sb and asw relations, and State with a last sequentially consistent ( lsc ) labelto track sc relations. The lasw label stores the last instruction theparent thread performs before forking a new thread. Each load,store, or RMW element will have an event label representing itsunique event id.The Load , Store , RMW , and

Fence in Figure 8 correspond toatomic load, store, RMW, and fence events in an execution. Loadsfrom and stores to

LocNA correspond to non-atomic reads andwrites in an execution. The event labels inside

LoadElem , StoreElem , RMWElem , and

FenceElem will match the event ids of their corre-sponding events in the execution.We will describe how to lift traces in the following. The in-structions referred to below are the ones that create events in anexecution.When a thread 𝑇 performs an instruction and 𝑇 . lsb ≠ ⊥ , an sb edge is created from 𝑇 . lsb to the current instruction. Similarly,when a seq_cst instruction is performed and Σ . lsc ≠ ⊥ , an sc edge iscreated from Σ . lsc to the current seq_cst instruction. The rf edgescan be created by inspecting traces and checking the rf fields of LoadElem s and

RMWElem s.The asw edges can be created in two ways: a) When a thread 𝑇 performs a Fork instruction, creating a new thread 𝑇 ′ , the newthread 𝑇 ′ stores 𝑇 . lsb in the field 𝑇 ′ . lasw . Then when 𝑇 ′ performs aninstruction and 𝑇 ′ . lasw ≠ ⊥ , a asw edge is created; b) When thread 𝑇 ′ has finished, the parent thread 𝑇 performs a Join instructionwith the thread id of 𝑇 ′ . If 𝑇 ′ . lsb ≠ ⊥ , an asw edge is created when 𝑇 performs the next instruction.The mo-graph of a trace models mo relations in an execution.However, the mo-graph at a memory location may sometimes onlycapture a partial order over all stores to the location. We know thata partial order can always be extended to a total order. Therefore,we can perform a topological sort of the mo-graph at each memorylocation, and extend the obtained mo relations to total orders ifnecessary. Then, we can create mo edges in the lifted trace based onthe extended mo relations. Since linear extensions of a partial orderare not unique, lifting a trace may give rise up multiple axiomatic-style executions.Since the operational model requires atomic loads to read fromstores that have been processed by the operational model, andthe sb , sc , and asw edges are constructed to be consistent with theprogram order in the lifting process, we have that ( sb ∪ asw ∪ rf ∪ sc )is acyclic. We summarize this relation as the Lemma below. onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky Lemma 4.

Let 𝑃 be an arbitrary program and 𝜎 ∈ traces ( 𝑃 ) . Thenfor any E ∈ lift ( 𝜎 ) , the union of sb, asw, sc, and rf edges in E isacyclic. A.3 Equivalence of Axiomatic and OperationalModels

Our goal is to show that for an arbitrary program 𝑃 , the set ofexecutions allowable by our restricted axiomatic model is equivalentto the set of executions we get by lifting traces that our operationalmodel can produce, i.e., ∀ 𝑃 ∀ E . E ∈ rConsistent ( 𝑃 ) ⇔ ∃ 𝜎 ∈ traces ( 𝑃 ) . E ∈ lift ( 𝜎 ) . Definition 1.

Let 𝑃 be an arbitrary program and E be anaxiomatic-style execution of 𝑃 . We define 𝐸 as the execution thatonly contains sb, asw, sc, and rf edges in E together with events in E. Given an execution E ∈ rConsistent ( 𝑃 ) that consists of 𝑛 events, 𝐸 is a DAG, which can be topologically sorted to give an ordering, e , ..., e 𝑛 , that is consistent with the order that events are added to E as the program is running. e1: v.store(1)e2: v.store(3)sb e3: v.fetch_add(1)rf Figure 17: Example where the partial execution graph E misses mo edges Based on the topological sort, we define the partial executiongraph E 𝑖 of E as the execution that consists of the first 𝑖 events,together with sb , asw , sc , rf , and mo edges such that the sources anddestinations of included relations are events in E 𝑖 , where 0 ≤ 𝑖 ≤ 𝑛 . E is defined as the empty execution.Since we do not consider mo edges in the topological sort, somepartial execution graph E 𝑖 may contain events where there existsmodification ordering between them in E but the mo edges are miss-ing in E 𝑖 . For example, in Figure 17, { 𝑒 , 𝑒 , 𝑒 } is a valid topologicalsort of 𝐸 , and we also have 𝑒 mo → 𝑒 mo → 𝑒 , but no mo edge is presentin the partial execution graph E . To deal with this issue, we definethe modification order at an atomic location 𝑀 in partial executiongraph E 𝑖 as a total order S 𝑀𝑖 over events of E 𝑖 that modifies 𝑀 suchthat S 𝑀𝑖 is consistent with the modification order at location 𝑀 inthe complete execution graph E . For atomic modifications 𝑋 and 𝑌 in S 𝑀𝑖 , if 𝑋 precedes 𝑌 , then we write 𝑋 S 𝑀𝑖 → 𝑌 .The equivalence proof between axiomatic and operational mod-els contains two directions, and we will break it down into Lemma 5and Lemma 6. In the proofs of the two lemmas below, for a storeevent e 𝑖 in E , there is a corresponding node in the mo-graph ofthe equivalent trace in the operational model. Although e 𝑖 is tech-nically an event in an axiomatic execution, we sometimes abusethe notation and use e 𝑖 to refer to the corresponding node in the mo-graph of the equivalent trace. If the node corresponding to e 𝑖 is modification ordered before the node corresponding to e 𝑗 in a mo-graph , then we may say e 𝑖 mo → e 𝑗 exists in the mo-graph .In the forward direction (Lemma 5), the proof strategy is to applyinduction on the construction of partial execution graphs. Morespecifically, if E 𝑖 is a partial execution graph of E such that thereexists at least one trace 𝜎 𝑖 where E 𝑖 ∈ lift ( 𝜎 𝑖 ) , then when E 𝑖 isextended to E 𝑖 + , we can construct a trace 𝜎 𝑖 + that is an extensionof 𝜎 𝑖 such that E 𝑖 + ∈ lift ( 𝜎 𝑖 + ) ,Lemma 5. Let 𝑃 be an arbitrary program, and E ∈ rConsistent ( 𝑃 ) be an execution. Then there exists a trace 𝜎 ∈ traces ( 𝑃 ) such thatE ∈ lift ( 𝜎 ) . Proof. Let 𝑃 be an arbitrary program, and E ∈ rConsistent ( 𝑃 ) be an execution. Let e , ..., e 𝑛 be a topological sort of 𝐸 . We willprove by induction on the construction of partial execution graphsof E as described in our proof strategy.Base case: When 𝑖 = E is the empty execution. We can takethe initial trace 𝜎 that is the initial state of the program 𝑃 withoutany transitions, and E ∈ lift ( 𝜎 ) . At this point the mo-graph isempty as well.Inductive step: Suppose that for some 𝑖 < 𝑛 , we have constructedthe partial execution graph E 𝑖 and that there exists some trace 𝜎 𝑖 such that E 𝑖 ∈ lift ( 𝜎 𝑖 ) . We will assume throughout the proof thatthe instructions used for the state transitions match the events beingadded. Also note that each visible instruction adds a context switchafter it, so there is always the ability to change to the requiredthread. We will extend E 𝑖 by adding the next event e 𝑖 + , and showthat we can construct a 𝜎 𝑖 + that is the extension of 𝜎 𝑖 such that E 𝑖 + ∈ lift ( 𝜎 𝑖 + ) . In the following, we will first analyze five differentcases of incoming edges to e 𝑗 , and then show that we can extendthe mo-graph of 𝜎 𝑖 + to the modification orders in E 𝑖 + . A.1. If e 𝑖 + has no incoming edges, then e 𝑖 + must be the event e , corresponding to the first visible instruction in the initial thread.Because if e 𝑖 + ≠ e , there must be an sb edge or an asw edgecoming to e 𝑖 + . We can construct 𝜎 𝑖 + = 𝜎 by letting the initialthread execute until the instruction corresponding to e is executed. A.2. If e 𝑖 + has an incoming sb edge from some event e 𝑗 , thenfrom the last state of 𝜎 𝑖 , we must switch to the thread that exe-cutes e 𝑗 and continue until the instruction corresponding to e 𝑖 + isexecuted, to obtain 𝜎 𝑖 + . A.3. If e 𝑖 + has an incoming asw edge, then we have two scenar-ios: e 𝑖 + is the first event in a newly created thread; or a thread isfinishing and joining onto its parent thread, and e 𝑖 + is an event inthe parent thread. In any case, let e 𝑗 be the source of the asw edge,and 𝑡 be the thread performing e 𝑗 .In the first case, there must be a Fork instruction after e 𝑗 andbefore the next visible instruction in thread 𝑡 . The event e 𝑗 mustalso be the last event completed in thread 𝑡 , because otherwise,the source of the asw edge would be some event other than e 𝑗 .To produce the trace 𝜎 𝑖 + , we can switch to thread 𝑡 and run theprogram until the Fork is done, switch to the newly created thread,and run until e 𝑖 + is done.In the second case, the thread 𝑡 is finishing, and there is nomore visible instruction in thread 𝑡 . To produce the trace 𝜎 𝑖 + , wecan switch to thread 𝑡 and run until 𝑡 finishes, then switch to the Technical Report

Conference’17, July 2017, Washington, DC, USA parent thread, and run until e 𝑖 + is done. We must encounter a Join instruction before e 𝑖 + is done, because otherwise the destinationof the asw edge would be some event other than e 𝑖 + . A.4. If e 𝑖 + has an incoming sc edge from some event e 𝑗 , thenit is similar to the case of sb . To obtain 𝜎 𝑖 + , we will switch to thethread that performs e 𝑖 + and continue until e 𝑖 + is done. Thereshall be no seq_cst events sequenced before e 𝑖 + that has not yetbe performed, because otherwise, there cannot be an sc edge from e 𝑗 to e 𝑖 + in E 𝑖 + . A.5. If e 𝑖 + has an incoming rf edge from some event e 𝑗 , thenwe need to show that it is valid for e 𝑖 + to read from e 𝑗 in 𝜎 𝑖 + . Let e 𝑖 + be an RMW or atomic read at 𝑀 . We claim that e 𝑗 belongs tothe set constructed by BuildMayReadFrom procedure when theoperational model processes the instruction that creates e 𝑖 + . The for loop in the procedure considers the thread that performs e 𝑗 .If event e 𝑗 does not happens before e 𝑖 + , then e 𝑗 ∈ base at line 8.If e 𝑗 happens before e 𝑖 + , then there cannot be any event e 𝑘 thatmodifies 𝑀 and that e 𝑗 sb → e 𝑘 hb → e 𝑖 + , because in that case, Write-Read Coherence (CoWR) would forbid e 𝑖 + from reading from e 𝑗 in E 𝑖 + . Therefore, e 𝑗 ∈ base at line 8. Now we will show that e 𝑗 is notremoved at line 10 when e 𝑖 + has seq_cst memory ordering. If thelast seq_cst modification of 𝑀 that precedes e 𝑖 + in the total orderof sc , i.e. , 𝑆 at line 4, does not exist, then we are done. Assume such 𝑆 exists. According to Section 29.3 statement 3 of the C++11 standard, e 𝑖 + either reads from 𝑆 or some non-seq_cst modification of 𝑀 thatdoes not happen before 𝑆 . The fact that e 𝑖 + reads from e 𝑗 in E 𝑖 + implies that e 𝑗 is either 𝑆 or a non-seq_cst modification that doesnot happen before 𝑆 . Hence, e 𝑗 is not removed from base at line 10.If e 𝑖 + is an RMW, then e 𝑗 has not been read by any other RMW,because no two RMWs can read from the same modification in E 𝑖 + . Hence, e 𝑗 is in the set returned from the BuildMayReadFromprocedure.We still need to show that having e 𝑖 + read from e 𝑗 does notcreate a cycle in the mo-graph . The discussion in paragraph B.2 below about how an atomic load updates the mo-graph shows thathaving e 𝑖 + read from e 𝑗 , the updated mo-graph is still consistentwith modification orders in E 𝑖 + . Thus, e 𝑖 + reading from e 𝑗 doesnot create a cycle in mo-graph . We defer the proof to paragraph B.2 .Now we will show that the mo-graph of 𝜎 𝑖 + can be extendedto the modification orders in E 𝑖 + . The mo-graph is updated whenthe operational model processes an atomic store, load, or RMW. Sowe will assume that e 𝑖 + corresponds to an atomic store, load, orRMW. Otherwise, the modification orders in E 𝑖 + are the same asthose in E 𝑖 , and the mo edges in mo-graph are not updated, andhence mo-graph of 𝜎 𝑖 + can be extended to the modification ordersin E 𝑖 + by inductive hypothesis. B.1.

Suppose e 𝑖 + is an atomic store that modifies atomic location 𝑀 . We consider two cases: e 𝑖 + is the last element in S 𝑀𝑖 + ; or e 𝑖 + isnot the last element S 𝑀𝑖 + .In the first case, e 𝑖 + is the last element in the modification order S 𝑀𝑖 + of E 𝑖 + . Let e 𝑗 be the second last element in S 𝑀𝑖 + . Since e 𝑗 pre-cedes e 𝑖 + in modification order, the modification ordering is either forced by coherence rules, consistent with sc relations, consistentwith Section 29.3 statement 7 of C++11 standard, or not forced byany relations in E 𝑖 + . If the modification ordering between e 𝑗 and e 𝑖 + is forced by coherence rules under hb and rf relations in E 𝑖 + ,then the coherence rules being inferred can only be Coherenceof Write-Write (CoWW) or Coherence of Read-Write (CoRW), be-cause the other two coherence rules require rf relations that are notpresent in E 𝑖 + . For CoWW, line 12 in the WritePriorSet proce-dure considers the atomic store 𝑋 corresponding to e 𝑗 and adds itto priorset . We claim that the store 𝑋 will not be filtered out by the last function call in line 13, because otherwise there would be anatomic store event e 𝑘 (different from e 𝑖 + ) sequenced after e 𝑗 , con-tradicting the assumption that e 𝑗 is the second last element in S 𝑀𝑖 + .The case for CoRW is similar. If the modification ordering between e 𝑗 and e 𝑖 + is consistent with sc relations, then e 𝑗 and e 𝑖 + are bothseq_cst atomic stores and line 4 in the WritePriorSet procedureconsiders such case. If the modification ordering between e 𝑗 and e 𝑖 + is consistent with Section 29.3 statement 7 of C++11 standard,then this case is dealt with at line 11 in the WritePriorSet proce-dure. If the modification ordering between e 𝑗 and e 𝑖 + is not forcedby any relations in E 𝑖 + , then the mo-graph does not contain a mo edge from e 𝑗 to e 𝑖 + , and we are free to extend mo-graph to include S 𝑀𝑖 + . In fact, the algorithm WritePriorSet may add mo edges fromevents modification ordered before e 𝑗 to e 𝑖 + . This is not a prob-lem, because adding such edges does not introduce modificationordering not present in the modification order of E 𝑖 + .In the second case, let e 𝑗 be the event immediately preceding e 𝑖 + in S 𝑀𝑖 + , and e 𝑘 be the event immediately succeeding e 𝑖 + in S 𝑀𝑖 + . Without loss of generality, assume e 𝑘 is the last event in S 𝑀𝑖 + .The modification ordering between e 𝑗 and e 𝑖 + could be any one ofthe cases discussed in the first case in this paragraph. So we do notrepeat the same argument. However, since all other events in E 𝑖 + including e 𝑘 come before e 𝑖 + in the topological order, there is nochain of rf , sb , asw , and sc edges that come from e 𝑖 + to any otherevent in E 𝑖 + . Thus, the only possibility is that no relations in E 𝑖 + force the existence of the modification ordering between e 𝑖 + and e 𝑘 . Hence, the mo edge between e 𝑖 + and e 𝑘 does not exist in the mo-graph of 𝜎 𝑖 + , and we are free to extend mo-graph to include S 𝑀𝑖 + . If e 𝑘 is not the last event in S 𝑀𝑖 + , then the same argumentapplies to any event modification ordered after e 𝑘 in S 𝑀𝑖 + . B.2.

Suppose that e 𝑖 + is an atomic load that reads from event e 𝑗 at atomic location 𝑀 . Adding event e 𝑖 + does not change the modi-fication order at 𝑀 from E 𝑖 to E 𝑖 + , but performing the instructioncorresponding to e 𝑖 + may change the mo-graph from 𝜎 𝑖 to 𝜎 𝑖 + .We will show that the mo-graph in 𝜎 𝑖 + can still be extended tothe modification orders in E 𝑖 + . Line 6, 7,and 8 in ReadPriorSetprocedure consider statements 5, 4, and 6 in Section 29.3 of thestandard. For each thread, if such 𝑆 , 𝑆 , and 𝑆 exist, the eventscorresponding to them are all modification ordered before e 𝑗 in E 𝑖 + .Therefore, having mo edges from 𝑆 , 𝑆 , and 𝑆 to e 𝑗 in mo-graph does not conflict with the modification orders in E 𝑖 + . Line 9 in theReadPriorSet procedure considers Write-Read Coherence (CoWR)and Read-Read Coherence (CoRR). If an mo edge from some event e 𝑘 to e 𝑗 in mo-graph is induced by CoWR and CoRR, then e 𝑘 must bemodification ordered before e 𝑗 in E 𝑖 + by the standard. Hence, whenextending 𝜎 𝑖 to 𝜎 𝑖 + , the newly created mo edges in mo-graph do onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky not conflict with modification orders in E 𝑖 + . Because the mo-graph in 𝜎 𝑖 can be extended to the modification orders of E 𝑖 by inductivehypothesis, so can the mo-graph in 𝜎 𝑖 be extended to the modifica-tion orders of E 𝑖 + . B.3.

Suppose that e 𝑖 + is an atomic RMW that reads from e 𝑗 . AnRMW modifies the mo-graph in three phases: performing an atomicload, migrating mo edges using the AddRMWEdge procedure, andperforming an atomic store. We also consider two cases.In the first case, e 𝑖 + is the last element in S 𝑀𝑖 + of E 𝑖 + . The firstand third phases have been discussed in paragraphs B.1 and

B.2 . Inthe second phase, since e 𝑖 + is the last element in S 𝑀𝑖 + , no edges in mo-graph are migrated. Then an mo edge is added from e 𝑗 to e 𝑖 + in mo-graph , which does not conflict with the modification order S 𝑀𝑖 + , because an atomic RMW is immediately modification orderedafter the modification it reads from.In the second case, e 𝑖 + is not the last element in S 𝑀𝑖 + . Since e 𝑖 + reads from e 𝑗 , e 𝑗 must immediately precede e 𝑖 + in S 𝑀𝑖 + . The firstphase of the RMW is equivalent to an atomic load. In the secondphase, any outgoing mo edges from e 𝑗 will be migrated to outgoing mo edges from e 𝑖 + , and an mo edge is added from e 𝑗 to e 𝑖 + in mo-graph . The third phase is the same as the second case of anatomic store, except that for an event e 𝑘 modification ordered after e 𝑖 + in S 𝑀𝑖 + , there may exist mo edges from e 𝑖 + to e 𝑘 in mo-graph due to edge migrations in the second phase.In both cases, the mo-graph of 𝜎 𝑖 + does not contain mo edgesthat conflict with modification orders in E 𝑖 + Hence, the mo-graph of 𝜎 𝑖 + can be extended to include modification orders in E 𝑖 + .Considering all above cases in paragraphs B.1 , B.2 , and

B.3 , the mo-graph of 𝜎 𝑖 + can be extended to the modification orders in E 𝑖 + ,and the proof completes. □ Lemma 6.

Let 𝑃 be an arbitrary program, and 𝜎 ∈ traces ( 𝑃 ) be atrace. Then for all E ∈ lift ( 𝜎 ) , we have E ∈ rConsistent ( 𝑃 ) . Proof. In the backward direction, we want to show that givena program 𝑃 and a trace 𝜎 produced by the operational model,then any execution E ∈ lift ( 𝜎 ) obtained by lifting the trace 𝜎 isan element of rConsistent ( 𝑃 ) . We will prove by induction on theconstruction of the partial trace 𝜎 𝑖 . Specially, if we have E 𝑘 ∈ lift ( 𝜎 𝑖 ) , where E 𝑘 is a partial execution graph of E based on atopological sort of 𝐸 , then when 𝜎 𝑖 is extended to 𝜎 𝑖 + , we have E 𝑘 ∈ lift ( 𝜎 𝑖 + ) or E 𝑘 + ∈ lift ( 𝜎 𝑖 + ) , where E 𝑘 + is also a partialexecution graph of E subject to the same topological sort.Let 𝑃 be an arbitrary program and 𝜎 ∈ traces ( 𝑃 ) be a traceproduced by our operational model. Let E ∈ lift ( 𝜎 ) be an executionobtained by lifting the trace 𝜎 . In Section A.2, we have arguedthat when only considering sb , asw , rf and sc edges, any executionobtained by lifting a trace is acyclic. In the process of lifting traces,some transitions create events while some do not. Label the eventsin E in the order that they are created by transitions in 𝜎 is a naturaltopological sort of 𝐸 . All the partial execution graphs described beloware based on this natural topological sort. We also have the mo-graphof all partial traces 𝜎 𝑖 be extended in a way that is consistent withE in the lifting process. We will use lift ( 𝜎 𝑖 ) to refer to the specificexecution in lift ( 𝜎 𝑖 ) whose mo-graph extension is consistent withthat of E . Base case: when 𝑖 = 𝜎 is the empty trace, which is the ini-tial state of the operation model for 𝑃 . Thus, lift ( 𝜎 ) is the emptyexecution graph E , which is a valid partial execution graph of 𝐸 .Inductive step: suppose we have constructed a partial trace 𝜎 𝑖 of 𝜎 and that lift ( 𝜎 𝑖 ) = E 𝑘 , where E 𝑘 is a partial execution graph of E .We will show that when 𝜎 𝑖 is extended to 𝜎 𝑖 + by executing the nexttransition t 𝑖 + , we have either lift ( 𝜎 𝑖 + ) = E 𝑘 or lift ( 𝜎 𝑖 + ) = E 𝑘 + ,where E 𝑘 + is a partial execution graph of E .We will consider different cases for the transition t 𝑖 + below. Invisible Instruction.

If the transition t 𝑖 + is an invisible instruc-tion, such as an if statements, an assignment to non-atomic loca-tions, and the empty statement 𝜖 , then it leaves E 𝑘 unchanged, and lift ( 𝜎 𝑖 + ) = E 𝑘 . For if statements, the partial trace at this pointdetermines a branch to take, and proving that taking the branch willproduce a valid partial execution graph when lifted comes down toproving the rest of cases analyzed. Visible Instruction.

If the next transition t 𝑖 + creates a new event e 𝑘 + in the lifting process, we have lift ( 𝜎 𝑖 + ) = E 𝑘 + . Moreover,new sb , asw , sc edges, and updates in the modification orders mayalso be added to E 𝑘 to form E 𝑘 + . We already show that E 𝑘 + hasacyclic sb ∪ asw ∪ rf ∪ sc edges. So we only need to show that E 𝑘 + is a partial execution graph of E .We will discuss the newly added sb and asw edges first. Supposethe transition t 𝑖 + is a general visible instruction that correspondsto the event e 𝑘 + . We will consider new sb and asw edges that maybe added to E 𝑘 when lifting 𝜎 𝑖 + . Suppose that t 𝑖 + is performed bythread 𝑡 and is not the first visible instruction in thread 𝑡 . Then an sb edge will be drawn from the last visible event performed by 𝑡 to e 𝑘 + . Suppose that t 𝑖 + is the first visible instruction performedby thread 𝑡 . If 𝑡 is the main thread, then no new edges to e 𝑘 + willbe added during lifting. If 𝑡 is not the main thread, then the parentthread 𝑡 ′ that created 𝑡 must have performed a Fork instruction,and an asw edge from the last visible instruction sequenced beforethe

Fork instruction to e 𝑘 + will be added to E 𝑘 , to obtain E 𝑘 + . Itis clear that the way we construct sb and asw edges in lifting tracesis consistent with the C++ axiomatic model. Visible Instruction (Atomic Store).

If the next transition t 𝑖 + is anatomic store statement at location 𝑀 , it will create an StoreElem that corresponds to the event e 𝑘 + . We will focus on the changesto modification orders and sc relations. If e 𝑘 + is a seq_cst store,lifting 𝜎 𝑖 + will cause an sc edge to be added from the last seq_cstevent to e 𝑘 + . Line 4 in the WritePriorSet procedure ensures that mo edges in the mo-graph conform with sc relations by adding an mo edge from the last seq_cst store at 𝑀 to e 𝑘 + in the mo-graph ifthe last seq_cst store at 𝑀 exists. So modification orders confirmwith sc relations in E 𝑘 + . Since the operation model may only addincoming mo edges to e 𝑘 + in the mo-graph when processing t 𝑖 + ,then modification orders in E 𝑘 + do not have cycles. By Lemma 4, sb ∪ asw ∪ rf ∪ sc in E , it follows that sc relations conform with hb relations in E 𝑘 + = lift ( 𝜎 𝑖 + ) . Line 11 and line 12 in the WritePri-orSet procedure ensure that mo edges induced by seq_cst fences,CoRW and CoWW are added to the mo-graph . Therefore, the newmodification orders in E 𝑘 + induced by the changes in the mo-graph in the lifting process conform with CoRW, CoWW, Section 29.3 Technical Report

Conference’17, July 2017, Washington, DC, USA statement 7 of the C++11 standard, and sc relations. Conformitywith CoRW and CoWW implies that mo conforms with hb .Now, if e 𝑘 + is the last element in S 𝑀𝑘 + , then the above discussionshows that E 𝑘 + is a valid partial execution graph of E based onthe natural topological sort. It is also possible that e 𝑘 + is not thelast element in S 𝑀𝑘 + . Let e 𝑗 be any event modification ordered after e 𝑘 + in S 𝑀𝑘 + . Note that e 𝑗 is topologically ordered before e 𝑘 + . Sincethe WritePriorSet procedure only adds incoming mo edges to e 𝑘 + , no mo or chain of mo edges from e 𝑘 + to e 𝑗 exists in mo-graph .Since the operational model forbids cycles in mo-graph , the final mo-graph of 𝜎 is free of cycles, and the modification orders in thefinal mo-graph at each location are extended to a total order in E during lifting. Because we assume that mo-graph of all partialtraces 𝜎 𝑖 be extended in a way that is consistent with E in the liftingprocess, we can conclude that the modification ordering between e 𝑘 + and e 𝑗 do not cause any cycles in modification orders of E 𝑘 + and is only added to make S 𝑀𝑘 + a total order. Therefore, E 𝑘 + is avalid partial execution graph of E . Visible Instruction (Atomic Load).

If the transition t 𝑖 + is anatomic load statement at location 𝑀 , it creates an LoadElem thatcorresponds to the event e 𝑘 + . To obtain E 𝑘 + = lift ( 𝜎 𝑖 + ) , a new rf edge is added to E 𝑘 , and the modification orders at 𝑀 may beupdated. Suppose that e 𝑘 + reads from e 𝑗 , where e 𝑗 is topologicallyordered before e 𝑘 + . We make the following claim: (Claim 1) Any valid store that the operational model allows the

LoadElem corresponding to e 𝑘 + to read from is also valid for e 𝑘 + to read from under our axiomatic model.We will prove this claim by contradiction or contrapositive usingcase analysis. We will assume our axiomatic model forbids e 𝑘 + from reading from e 𝑗 . Case 1:

If having e 𝑘 + read from e 𝑗 violates CoWR, then thereexists an event e 𝑙 in E 𝑘 + such that e 𝑗 S 𝑀𝑘 + → e 𝑙 and e 𝑙 hb → e 𝑘 + . We willfirst assume that e 𝑗 mo → e 𝑙 exists in the mo-graph of 𝜎 𝑖 . The ReadPri-orSet procedure iterates over each thread, and when consideringthe thread that performs e 𝑙 , line 9 finds either the store e 𝑙 , any storesequenced after e 𝑙 , or any load sequenced after e 𝑙 . Then the store 𝐴 in line 10 of ReadPriorSet is the store e 𝑙 , a store sequenced after e 𝑙 , or a store read by a load sequenced after e 𝑙 . In any case, basedon CoWW, CoWR and the inductive hypothesis that E 𝑘 is a validpartial execution graph of E , we can deduce that the mo edge (orthe equivalent chain of mo edges) e 𝑗 mo → 𝐴 exists in the mo-graph of 𝜎 𝑖 . Since 𝐴 is reachable from e 𝑗 , line 16 in the ReadPriorSetprocedure forbids the LoadElem corresponding to e 𝑘 + from readingfrom the store corresponding to e 𝑗 in the operational model. Thenwe prove the Claim 1 by contrapositive.However, it is also possible that e 𝑗 and e 𝑙 are unordered in the mo-graph of 𝜎 𝑖 . Then the modification ordering between e 𝑗 and e 𝑙 in E 𝑘 + is due to the extension of the final mo-graph of 𝜎 in E , as the mo-graph of all partial traces 𝜎 𝑖 are extended in a waythat is consistent with E . Having e 𝑘 + read from e 𝑗 will add mo edges so that e 𝑙 mo → e 𝑗 exists in the mo-graph of 𝜎 𝑖 + and the final mo-graph of 𝜎 . This is a contradiction because the extension of thefinal mo-graph is only required between two unordered stores in mo-graph . Therefore, we prove the Claim 1 by contradiction. Case 2:

If having e 𝑘 + read from e 𝑗 violates CoRR, the proof issimilar to the case of CoWR. Case 3:

If having e 𝑘 + read from e 𝑗 violates Section 29.3 state-ment 4 in the C++11 standard, then in E 𝑘 + , there exists a fence 𝑋 sequenced before e 𝑘 + . Let 𝑋 ′ be the last seq_cst store preceding 𝑋 in sc in the partial execution graph E 𝑘 + . Then e 𝑗 is either somestore modification ordered before 𝑋 ′ or is a seq_cst store that pre-cedes 𝑋 ′ in sc . Consider that e 𝑗 sc → 𝑋 ′ in E 𝑘 + . Then the inductivehypothesis implies that e 𝑗 mo → 𝑋 ′ in the mo-graph of 𝜎 𝑖 , as line 4in the WritePriorSet procedure adds mo edges between seq_cststores at the same location in mo-graph . When the ReadPriorSetprocedure iterates over the thread that performs 𝑋 ′ , line 7 returns 𝑆 as 𝑋 ′ , and line 10 will return 𝐴 as either 𝑋 ′ , a store sequencedafter 𝑋 ′ , or a store read from by a load sequenced after 𝑋 ′ . In anycase, CoWW, CoWR, and the inductive hypothesis guarantees that 𝑋 ′ mo → 𝐴 in the mo-graph of 𝜎 𝑖 . Therefore, we have e 𝑗 mo → 𝐴 in the mo-graph of 𝜎 𝑖 , and line 16 in the ReadPriorSet procedure forbids e 𝑘 + from reading from e 𝑗 in the operational model.Consider that e 𝑗 is modification ordered before 𝑋 ′ in E 𝑘 + . If e 𝑗 mo → 𝑋 ′ exists in the mo-graph of 𝜎 𝑖 , then it the same as the lastparagraph. If e 𝑗 and 𝑋 ′ are two unordered stores in the mo-graph of 𝜎 𝑖 , then we can deduce the same contradiction as in the analysisof CoWR violation in Case 1 . Case 4:

If having e 𝑘 + read from e 𝑗 violates statement 5 or state-ment 6 of Section 29.3 in the C++11 standard, then the proof issimilar to the analysis of the violation of statement 4 in Case 3 byconsidering line 6 or line 8 in the ReadPriorSet procedure. So wedo not present it here.

Case 5:

If having e 𝑘 + read from e 𝑗 violates Section 29.3 statement3 in the C++11 standard, then e 𝑘 + has seq_cst ordering, the lastseq_cst 𝑋 store at 𝑀 that precedes e 𝑘 + in sc exists, and e 𝑗 is eithera seq_cst store that precedes 𝑋 is sc or a store that happens before 𝑋 . Then e 𝑗 will be removed from the may-read-from set in line 10of the BuildMayReadFrom procedure, and e 𝑗 is not a valid storefor e 𝑘 + to read from in the operational model.We have completed the proof of Claim 1 by analyzing the abovefive cases. Claim 1 shows that E 𝑘 + has valid rf edges. We will nextshow the acyclicity for modification orders and conformity for sc edges in E 𝑘 + . Suppose e 𝑘 + reads from some event e 𝑗 . Establishingthis rf relation adds incoming mo edges to e 𝑗 to the mo-graph of 𝜎 𝑖 . We claim that the updated mo-graph is free of cycles. If addingthese edges causes a cycle in the mo-graph of 𝜎 𝑖 + , then the cyclecontains only one of the newly added edges, and line 16 in theReadPriorSet procedure should have forbidden e 𝑘 + from readingfrom e 𝑗 . Therefore, the mo-graph of 𝜎 𝑖 + is free of cycles. Lines 6to 9 in the ReadPriorSet procedure considers mo edges that areenforced by statements 5, 4, and 6 of Section 29.3 in the C++11standard, CoRR, and CoWR. Line 10 filters out redundant mo edges,as once the mo edges that are not filtered out are added, then the mo edges that are filtered out will follow from the transitivity of mo edges. Since the mo-graph of 𝜎 𝑖 + is acyclic, modification ordersin E 𝑘 + is also acyclic in the lifting process. If e 𝑘 + has seq_cstordering, then an sc edge will be drawn from the last seq_cst eventto e 𝑘 + in E 𝑘 + , this sc edge conforms with modification orders as e 𝑘 + is a load and not an element in S 𝑀𝑘 + . However, we also needto show that modification orders in E 𝑘 + conform with other sc onference’17, July 2017, Washington, DC, USA Weiyu Luo and Brian Demsky edges. Because we assume conformity of modification orders with sc edges in E 𝑘 , if any cycle exists in union of sc and S 𝑀𝑘 + in E 𝑘 + , thecycle must involve one of the newly added modification ordering.Suppose a cycle 𝐶 exists and it contains events 𝑋 and e 𝑗 where 𝑋 S 𝑀𝑘 + → e 𝑗 is a newly added modification ordering in E 𝑘 + . Since allevents in S 𝑀𝑘 + are atomic stores, the two endpoints of any maximalchain of sc edges (could also be an sc edge) in 𝐶 must be seq_cstatomic stores. Let the two endpoints be events 𝑍 and 𝑍 . Then,the relation 𝑍 mo → 𝑍 or 𝑍 mo → 𝑍 whichever conforms with the sc edges must exist in the mo-graph of 𝜎 𝑖 . No S 𝑀𝑘 + → edges in 𝐶 canbe due to the extensions of unordered stores in mo-graph of 𝜎 𝑖 + ,because we assume that the mo-graph of 𝜎 is extended withoutcycles. Therefore, we must have e 𝑗 mo → 𝑋 in the mo-graph of 𝜎 𝑖 .Then, the relation 𝑋 mo → e 𝑗 cannot exist in the mo-graph of 𝜎 𝑖 + , asit makes the mo-graph of 𝜎 𝑖 + acyclic. However, 𝑋 S 𝑀𝑘 + → e 𝑗 cannotbe due to the extension of unordered stores in the mo-graph of 𝜎 𝑖 + . Therefore, we have a contradiction, and sc conforms withmodification orders in E 𝑘 + . Therefore, we have E 𝑘 + is a validpartial execution graph of E . Visible Instruction (Atomic RMW).

If the transition t 𝑖 + is anatomic RMW statement at location 𝑀 , it creates an RMWElem thatcorresponds to the event e 𝑘 + . An atomic RMW is both a load and astore except that the standard requires that RMW operations shallalways read the last value (in the modification order) written beforethe write associated with the RMW operation. In the operationalmodel, the BuildMayReadFrom procedure forbids two RMWs toread from the same store. Denote the store that e 𝑘 + reads fromas e 𝑗 . Then the AddRMWEdge procedure adds all outgoing mo edges from e 𝑗 to the set of outgoing edges of e 𝑘 + and adds an mo edge from e 𝑗 to e 𝑘 + to form the mo-graph of 𝜎 𝑖 + . Therefore, whenlifting 𝜎 𝑖 + , e 𝑗 is immediately modification ordered before e 𝑘 + in E 𝑘 + . Hence, E 𝑘 + is a valid partial execution graph of E .If the transition t 𝑖 + is a fence instruction, it creates a FenceElem which corresponds to the event e 𝑘 + . Lifting 𝜎 𝑖 + adds e 𝑘 + to E 𝑘 but does not create rf edges or change modification ordering. If e 𝑘 + has seq_cst ordering, an sc edge from the last seq_cst event (if exists)to e 𝑘 + will be created, and sc conforms with hb and modificationorders in E 𝑘 + . Thus, E 𝑘 + is a valid partial execution graph of E .We have completed the proof of induction by case analysis. □ B ADDITIONAL DATA

Table 4 reports some detailed statistics about the 25 JavaScriptbenchmarks in the JSBench suite.

Technical Report

Conference’17, July 2017, Washington, DC, USA

Table 4: Performance results (in ms) for individual JavaScript benchmarks in the JSBench suite for tsan11, tsan11rec, andC11Tester under two configurations. Smaller times are better. The "Memory Accesses" columns report the number of atomicoperations (including synchronization operations such as mutex and condition variable operations) and normal accesses toshared memory locations executed by individual JavaScript benchmarks under C11Tester.