[PDF] VSync: Push-Button Verification and Optimization for Synchronization Primitives on Weak Memory Models (Technical Report)

Abstract

This technical report contains material accompanying our work with same title published at ASPLOS'21. We start in Sec. 1 with a detailed presentation of the core innovation of this work, Await Model Checking (AMC). The correctness proofs of AMC can be found in Sec. 2. Next, we discuss three study cases in Sec. 3, presenting bugs found and challenges encountered when applying VSync to existing code bases. Finally, in Sec. 4 we describe the setup details of our evaluation and report further experimental results.

Full PDF

VVS

YNC : Push-Button Veriﬁcation and Optimization forSynchronization Primitives on Weak Memory Models(Technical Report)

Jonas Oberhauser , Rafael Lourenco de Lima Chehab , Diogo Behrens , Ming Fu ,Antonio Paolillo , Lilith Oberhauser , Koustubha Bhat , Yuzhong Wen , Haibo Chen ,Jaeho Kim , and Viktor Vafeiadis Huawei Dresden Research Center Huawei OS Kernel Lab Shanghai Jiao Tong University Max Planck Institute for Software Systems (MPI-SWS)

Abstract

This technical report contains material accompanying our work with same title published at ASPLOS’21 [24].We start in §1 with a detailed presentation of the core innovation of this work, Await Model Checking (AMC).The correctness proofs of AMC can be found in §2. Next, we discuss three study cases in §3, presenting bugsfound and challenges encountered when applying VS

YNC to existing code bases. Finally, in §4 we describe thesetup details of our evaluation and report further experimental results. a r X i v : . [ c s . L O ] F e b ontents Await Model Checking in Detail

AMC is an enhancement of stateless model checking (SMC) capable of handling programs with awaits onWMMs. SMC constructs all possible executions of a program, and ﬁlters out those inconsistent with respectto the underlying memory model. However, SMC falls short when the program has inﬁnitely many or non-terminating executions ( e.g. , due to await loops) because the check never terminates. AMC overcomes thislimitation by ﬁltering out executions in which multiple iterations of an await loop read from the same writes.We start introducing basic notation and deﬁnitions, including execution graphs, which are used to representexecutions. Next, we explain how awaits lead to inﬁnitely many and/or non-terminating executions, and howAMC overcomes these problems. We present sufﬁcient conditions under which AMC correctly veriﬁes programsincluding await termination (AT) and safety. Finally, we show the integration of AMC into a stateless modelchecker from the literature.

Executions as graphs.

An execution graph G is a formal abstraction of executions, where nodes are eventssuch as reads and writes, and the edges indicate (po) program order, (mo) modiﬁcation order, and (rf) read-fromrelationships, as illustrated in Fig. 2. A read event R mT ( x , v ) reads the value v from the variable x by the thread T with the mode m , a write event W mT ( x , v ) updates x with v , and W init ( x , v ) initializes. The short notations R T ( x , v ) and W T ( x , v ) represent the relaxed events R rlxT ( x , v ) and W rlxT ( x , v ) respectively. Note that the po is identical in a and b because it is the order of events in the program text. In contrast, mo and rf edges differ; e.g. , in a , W T ( l , ) precedes W T ( l , ) in mo, while in b it is the other way around. Furthermore, the rf edges indicate that W T ( l , ) is never read in a , while it is read by R T ( l , ) in b . locked = 0, q = 0; T : lock T : unlocklocked = 1;q = 1;while (locked == 1);/* Critical Section */ while (q == 0);locked = 0;assert (locked == 0); Figure 1: Awaits in one path of a partial MCS lock. T signals q = 1 to notify T that it enqueued, and T waitsfor the notiﬁcation, then signals locked = 0 to pass thelock to T . a W init ( l , ) W init ( q , ) W T ( l , ) W relT ( q , ) W T ( l , ) mo momopo R acqT ( q , ) R acqT ( q , ) R acqT ( q , ) R T ( l , ) R T ( l , ) popopopopo rfrf rfrfrf b W init ( l , ) W init ( q , ) W T ( l , ) W relT ( q , ) W T ( l , ) momopo R acqT ( q , ) R acqT ( q , ) R acqT ( q , ) R T ( l , ) (cid:55) R T ( l , ) ... popopopopoporf rf rfrfmo Figure 2: Two execution graphs of Fig. 1 where l = locked . Consistency predicates.

A weak memory model M is deﬁned by a consistency predicate cons M over graphs,where cons M ( G ) holds iff G is consistent with M . For instance, the ‘IMM’ model used by VS YNC forbids thecyclic path of highlighted edges in b of Fig. 2 due to the rel and acq modes, as it forbids all cyclic pathsconsisting of edges in this order: 1) po ending in W rel , 2) rf ending in R acq , 3) po, 4) mo. Such a path is compactlywritten as “po; [ W rel ] ; rf; [ R acq ] ; po; mo”, and is never cyclic in graphs consistent with IMM. Thus cons IMM ( b ) does not hold. If say the rel barriers on the accesses to q would be removed, the graph would be consistent withIMM. Awaits . Intuitively, an await is a special type of loop which waits for a write of another thread. To make thisintuition more precise, imagine a demonic scheduler that prioritizes threads currently inside awaits. Under sucha scheduler, an await has two possible outcomes: either the write of the other thread is currently visible, andthe await terminates immediately; or the write of the other thread is not visible. In the latter case, the schedulercontinuously prevents the write from becoming visible by never scheduling the writer thread, and hence theawait never terminates. A more precise deﬁnition for await is a loop that, for every possible value of the polled A path is cyclic if it starts and ends with the same node * lock acquire */ do {atomic_await_neq (& lock , 1);} while ( atomic_xchg (& lock , 1) != 0);x ++; /* CS *//* lock release */ atomic_write (& lock , 0); Figure 3: TTAS lock example.variables , either exits immediately or loops forever when executed in isolation. We illustrate this with the twoloops of the TTAS lock from Fig. 3. The inner loop is an await; to show this, we need to consider every potentialvalue v of lock : for v =

0, the loop repeats forever, and for v (cid:54) = v of lock for which the loop is not executedinﬁnitely often but also not exited immediately. One such value is v =

1, for which the thread never reaches theouter loop again after entering the inner loop.

Sets of execution graphs.

Given a fair scheduler, awaits either exit after a number of failed iterations or(intentionally) loop forever. We separate execution graphs that satisfy cons M into two sets, G F and G ∞ . G F isthe set of execution graphs where all awaits exit after a ﬁnite number of failed iterations. In Fig. 2, for example, a is in G F since cons M ( a ) holds and the await of T exits after two failed iterations. Note G F consists exactlyof the graphs in which all awaits terminate, but that does not imply that these graphs are ﬁnite: there may still beinﬁnitely many steps outside of the awaits. We will later state a sufﬁcient condition that excludes such cases. G ∞ is the set of the remaining consistent graphs. In each of these at least one await loops forever. We deﬁne: Deﬁnition 1 ( Await termination ) . AT holds iff G ∞ is /0 .Due to the barriers on q , cons IMM ( b ) does not hold, and hence b is not in G ∞ . In fact, with these acq and rel barriers, all graphs with an inﬁnite number of failed iterations violate consistency, and hence G ∞ is empty; AT isnot violated.Note that we can splice an additional failed iteration into a by repeating R acqT ( q , ) , resulting in a newgraph in G F . We generalize this idea. Let G k ⊆ G F be the set of consistent execution graphs with a total of k ∈ { , , . . . } failed iterations. Thus, G is the set of consistent graphs with no failed iterations. With two failediterations of T ’s await and zero of T ’s await, a has a total of 2 + = G . Let now G ∈ G k with k > i.e. , G has at least one failed iteration; we can always repeat one of its failediterations to obtain a graph G (cid:48) ∈ G k + due to the non-deterministic number of iterations of await loops. Since all G k are disjoint, their union (denoted by G F = (cid:85) k ∈{ , ,... } G k ) is inﬁnite despite every set G k being ﬁnite. State-of-the-art stateless model checkers [15–17] cannot construct all execution graphs in G F or any in G ∞ . Inorder for SMC to complete in ﬁnite time, the user has to limit the search space to a ﬁnite subset of executionsgraphs in G F . Consequently, SMC cannot verify AT (Deﬁnition 1) and can only verify safety within this subsetof executions graphs. Key challenges.

For SMC to become feasible in our problem domain, we need to solve three key challenges:

Inﬁnity:

We need to produce an answer in ﬁnite time without a user-speciﬁed search space, even though thesearch space G F ∪ G ∞ is inﬁnite. Soundness:

We need to make sure not to miss any execution graph that may potentially uncover a safety bug.

Await termination:

We need to verify that G ∞ is /0 .Under certain conditions speciﬁed later, AMC overcomes these challenges through three crucial implications: Polled variables refers the variables read in each loop iteration to evaluate the loop’s condition. We count here the sum of failed iterations of all executed instances of awaits, including multiple instances by the same thread, e.g. , whenthe inner await loop in the TTAS lock from Fig. 3 is executed multiple times. /* lock acquire */do { d = 2;while (d--);} await_while(atomic_xchg(lock, 1) != 0);assert (d==0); W T ( d , ) B : R T ( d , ) W T ( d , ) R T ( d , ) W T ( d , ) R T ( d , ) po po po po porf rf (cid:32) rf (cid:32) B R acqT ( lock , ) po B R acqT ( lock , ) po po B R acqT ( lock , ) W T ( lock , ) po po po R T ( d , ) porf Figure 4: Execution graph B represents the inner loop of T (marked yellow ). Failed await iterations areindicated with dotted boxes, the ﬁnal (non-failed) iteration with a solid box. The inner loop of T violates theBounded-Effect principle, as rf-edges (marked with (cid:32) ) leave failed await iterations. The outer loop obeys theprinciple, as rf-edges only leave the ﬁnal await iteration.1. The inﬁnite set G F is collapsed into a ﬁnite set of ﬁnite executions graphs G F ∗ ⊂ G F . Moreover, the inﬁniteexecution graphs in G ∞ are collapsed into ﬁnite execution graphs in a (possibly inﬁnite) set G ∞ ∗ . AMCexplores at most all graphs in G F ∗ and up to one graph in G ∞ ∗ .2. For all G ∈ G F , there exists G (cid:48) ∈ G F ∗ such that G and G (cid:48) are equivalent. Thus, bugs present in G F are alsoin G F ∗ .3. Detecting whether there exists a ﬁnite execution graph in G ∞ ∗ is sufﬁcient to conclude whether G ∞ is empty.Thus AMC can stop after exploring one graph in the set G ∞ ∗ and report an AT violation, and if AMC doesnot come across such a graph, AT is not violated.We now explain how AMC achieves these three implications, as well as the conditions under which it does so. The key to AMC.

In contrast to existing SMCs, AMC ﬁlters out execution graphs that contain awaits wheremultiple iterations read from the same writes to the polled variables. This idea is captured by the predicate W ( G ) which deﬁnes wasteful executions. Deﬁnition 2 ( Wasteful ) . An execution graph G is wasteful, i.e. , W ( G ) holds, if an await in G reads the samecombination of writes in two consecutive iterations.AMC does not generate wasteful executions, as they do not add any additional information. For instance,AMC does not generate execution graph α ∈ G from Fig. 5 because W ( α ) holds: T reads from its own writetwice in the await. Similarly, AMC does not generate any of the inﬁnitely many variations of α in which T reads even more often from that write. Instead, if we remove all references to q in the program of Fig. 1, AMCgenerates only the two execution graphs in G F ∗ = (cid:8) , (cid:9) , in which each write is read at most once by the awaitof T . T can read from at most two different writes, thus there is at most one failed iteration of the await in theexecution graphs in G F ∗ , and we have G F ∗ ⊆ (cid:85) k ∈{ , } G k ; in general, if there are at most n ∈ N writes each await α W init ( l , ) W T ( l , ) W T ( l , ) mo mo R T ( l , ) rf po R T ( l , ) po R T ( l , ) po rf β W init ( l , ) W T ( l , ) W T ( l , ) momo R T ( l , ) rf po R T ( l , (cid:32) ) po W init ( l , ) W T ( l , ) W T ( l , ) mo mo R T ( l , ) rf po R T ( l , ) po rf W init ( l , ) W T ( l , ) W T ( l , ) mo mo R T ( l , ) rfpo Figure 5: Execution graphs of Fig. 1 where l = locked .5 . push ( G init ) | S | > G : = S . pop () cons M ( G ) ∧¬ W ( G ) T G (cid:54) = /0 ∃ r . ⊥ rf → r ∈ G Pick T ∈ T G Nextinstruction of T yes for all W m (cid:48) U ( x , v ) ∈ G : S . push ( G [ W m (cid:48) U ( x , v ) rf → R mT ( x , v )]) R: readsfrom x for all G (cid:48) ∈ CalcRev . ( G , W mT ( x , v )) : S . push ( G (cid:48) ) W: writes v to x readin awaitloop S . push ( G [ ⊥ rf → R mT ( x , (cid:32) )]) report G as acounterexampleF: assertion (cid:32) yes ( AT violation )nono yesyes yesreport success nonono

Figure 6: AMC explorationcan read from, there are at most n − G F ∗ . If there are at most a ∈ N executed instances of awaits, we have G F ∗ ⊆ (cid:85) k ∈{ ,..., a · ( n − ) } G k which is a union of a ﬁnite number of ﬁnite setsand thus ﬁnite. We will later deﬁne sufﬁcient conditions to ensure this.We proceed to discuss how AMC discovers AT violations. Consider execution graph β . In the ﬁrst iterationof the await, T reads from W T ( l , ) . In the next iteration, T ’s read has no incoming rf-edge; coherence forbids T from reading an older write and the await progress condition forbids it from reading the same write. Sincethere is no further write to the same location, AMC detects an AT violation and uses the ﬁnite graph β as theevidence. In general, if the mo of every polled variable is ﬁnite, then AT violations from G ∞ are represented bygraphs in G ∞ ∗ where some read has no incoming rf-edge. AMC exploits this fact to detect AT violations. Conditions of AMC.

State-of-the-art SMC only terminates and produces correct results for terminating andloop-free programs . AMC extends the domain of SMC to a fragment of looping and/or non-terminatingprograms, on which AMC not only produces correct results but can also decide termination. With the genericclient code provided by VS YNC , all synchronization primitives we have studied are in this fragment, showing itis practically useful. The fragment includes all programs satisfying the following two principles:

Bounded-Length Principle:

There is a bound b (chosen globally for the program) so that all executions in G of the program have length ≤ b . Bounded-Effect Principle:

Failed await iterations satisfy the bounded-effect principle , that the effect of theloop iteration is limited to that loop iteration.Informally, the Bounded-Length principle means that the number of execution steps outside of awaits isbounded, and each individual iteration of a failed await is also bounded. Obviously, inﬁnite loops in the clientcode are disallowed by the Bounded-Length principle.

The Bounded-Effect principle means that no side-effects from failed await iterations must be referenced byeither subsequent loop iterations, other threads, or outside the loop. The principle can be deﬁned more preciselyin terms of execution graphs: rf-edges starting with writes generated by a failed await iteration must go to readevents that are generated in the same iteration. Figure 4 illustrates the principle: rf-edges from decrements in thefailed iterations of the loop body B go to subsequent iterations of the loop, but for the outer loop only the ﬁnaliteration has outgoing rf-edges. The Bounded-Effect principle allows removing any failed iteration from a graphwithout affecting the rest of the graph since the effects of the failed iteration are never referenced outside theiteration. This implies that any bugs in graphs from G F are also present in graphs in G . Furthermore, if theBounded-Effect principle and the Bounded-Length principle hold, then graphs in G k are bounded for every k .The bound for G k can be computed as b + ( k − ) · x where b is the bound for G and x is the maximum numberof steps in a failed iteration of an await in G .The two principles jointly imply that mos and the number of awaits are bounded, thus, as discussed before, G F ∗ is a ﬁnite set and G ∞ ∗ contains only ﬁnite graphs, and AMC always terminates. In synchronization primitives,awaits either just poll a variable (without side effects) or perform some operation which only changes globalstate if it succeeds, e.g. , await_while(q==0); or await_while(!trylock(&L)); These awaits satisfy theBounded-Effect principle. The former does not have any side effects. The latter encapsulates its local side Exploration depths are often used to transform programs with loops into loop-free programs, potentially changing the behavior of theprogram. Only writes that change the value of the variable matter here. trylock(&L) , which can therefore not leave failed iteration of the loop. A global side effect ( i.e. ,acquiring the lock) only occurs in the last iteration of the await (cf. Fig. 4). When called in our generic clientcode, synchronization primitives also satisfy the Bounded-Length principle: the client code invokes the functionsof the primitives only a bounded number of times, and each function of the primitives is also bounded.

AMC Correctness.

For programs which satisfy the Bounded-Length principle and the Bounded-Effect principle,1) AMC terminates, 2) AMC detects every possible safety violation, 3) AMC detects every possible non-terminating await, and 4) AMC has no false positives. See Section 2 for the formal proof.

We implement AMC on top of GenMC [16, 17], a highly advanced SMC from the literature. The explorationalgorithm in Fig. 6 extends GenMC’s algorithm with the highlighted essential changes: 1) detecting AT violationsthrough reads with no incoming rf-edge; 2) checking if W ( G ) holds to ﬁlter out graphs G in which an awaitreads from the same writes in multiple iterations. AMC builds up execution graphs through a kind of depth-ﬁrstsearch, starting with an empty graph G init , which is extended step-by-step with new events and edges. The searchis driven by a stack S of possibly incomplete and/or inconsistent graphs that is initialized to contain only G init .Each iteration of the exploration pops a graph G from S . If the graph G violates the consistency predicatecons M or is wasteful ( i.e. , W ( G ) holds), it is discarded and the iteration ends. Otherwise, a program state isreconstructed by emulating the execution of threads until every thread executed all its events in G ; e.g. , if a thread U executes a read instruction that corresponds to R U ( x , ) in G , the emulator looks into G for the correspondingread event and returns the value read by the event, in this case 0. We denote the set of runnable threads inthe reconstructed program state by T G . Initially, all threads are runnable; a thread is removed from T G once itterminates or if it is stuck in an await. If the set is not empty, we can explore further and pick some arbitrarythread T ∈ T G to run next. We emulate the next instruction of T in the reconstructed program state. For the sakeof brevity, we discuss only three types of instructions: failed assertions , writes, and reads. F: an assertion failed. We stop the exploration and report G .If the instruction executes a read or write, a new graph with the corresponding event should be generated. Usuallythere are several options for the event; for each option, a new graph is generated and pushed on a stack S . Inparticular: W: a write event w is added. In this case, for every existing read r to the same variable, a partial copy of G withan edge w rf → r is generated and pushed into S . R: a read event r is added; for every write event w in G a copy of G with an additional edge w rf → r is generatedand pushed into S .Crucially, if the read event r is in an await, an additional copy of G is generated in which r has no incomingrf-edge (we write this new graph as G [ ⊥ rf → r ] ). This missing rf-edge indicates a potential AT violation. It isnot an actual AT violation yet because a new write w to the same variable might be added by another threadlater during exploration. If such a write is added, it leads to the generation of two types of graphs: graphs inwhich r is still missing an rf-edge, and graphs with the edge w rf → r where r no longer has a missing rf-edge (andthe potential AT violation got resolved). Otherwise, if a missing rf-edge is still present and no other thread canbe run, we know that such a write cannot become available anymore; the potential AT violation turns into anactual AT violation. The algorithm detects the violation after popping a graph G in which no threads can berun ( T G = /0 ), but a read without incoming rf-edge is present ( ⊥ rf → r ∈ G ), and this missing rf-edge could notbe resolved except through a wasteful execution, i.e. , every consistent graph G (cid:48) obtained by adding the missingrf-edge and completing the await iteration is wasteful. The assertion expression assert(x==0) consists of at least two instructions: the ﬁrst reads x , and the second just compares the resultto zero and potentially fails the assertion. Note that once the rf-edge for the event corresponding to the ﬁrst instruction has been ﬁxed,whether the second instruction is a failed assertion or not is also ﬁxed. For details we refer the reader to the function

CalcRevisits from [16]. Proving the Correctness of Await Model Checking

In this section we formally prove Theorem 1 to show that our AMC is correct. In order to formally prove thistheorem, ﬁrst we deﬁne a tiny concurrent assembly-like language in §2.1, which follows our Bounded-Lengthprinciple and allows us to map the execution graphs to the execution of instruction sequences. Then we formalizethe Bounded-Effect principle in §2.2. Finally we give the formal representation of Theorem 1 and prove it in§2.3. Throughout this section we use the notation of [16].

Theorem 1 ( AMC Correctness ) . For programs which satisfy the Bounded-Length principle and the BoundedEffect principle, 1) AMC terminates, 2) AMC detects every possible safety violation, 3) AMC detects everypossible non-terminating await, and 4) AMC has no false positives.

To have a formal foundation for these proofs we need to provide a formal programming language. Without such aprogramming language, execution graphs just ﬂoat in the air, detached from any program. Using the consistencypredicate cons M we can state that an execution graph can be generated by the weak memory model, but notwhether it can be generated by a given program. W init x, 0 R rlx x, 0 R rlx x, 0 R rlx x, 0 R rlx x, 0 ... rf rfrfrf Thread 1Figure 7: Divergent execution graphFor example, the non-terminating execution graph in Fig. 7 consisting only of reads from the initial storeis consistent with all standard memory models, but is obviously irrelevant to most programs. If we were onlyto decide whether a non-terminating execution graph is consistent with the memory model or not, we couldsimply always return true and be done with it. What we really want to decide is whether a given program(which satisﬁes our two principles) has a non-terminating execution graph which is consistent with the memorymodel. For this purpose we will in this section deﬁne a tiny assembly-like programming language, and deﬁnewhether a program P can generate an execution graph G through a new consistency predicate cons P ( G ) . This willrequire formally deﬁning an execution-graph driven semantics for the language, in which threads are executed inisolation using values provided by an execution graph. After we deﬁne the programming language, we formallydeﬁne the Bounded-Effect principle. We then show that if the Bounded-Effect principle is satisﬁed, we canalways remove one failed iteration of an await from a graph without making the graph inconsistent with theprogram. This will allow us to show that graphs in G F can always be “trimmed” to a graph in G F ∗ which has thesame error events, and thus all safety violations are detected by AMC. Next we show that we can also add failediterations of an await; indeed, for graphs in G ∞ ∗ we can add inﬁnitely many such iterations. Thus these graphscan be “extended” to graphs in G ∞ which are consistent with the program. This implies that there are not falsepositives. Finally we show that due to the Bounded-Length dance principle and the Bounded-Effect principle,graphs in G F ∗ and G ∞ ∗ have a bounded number of failed iterations of awaits, and the remaining steps are boundedas well. Thus the search space itself must be ﬁnite, and AMC always terminates.8 Program) P : = T (cid:107) . . . (cid:107) T i (cid:107) . . . (cid:107) T n Composition of parallel threads T i (Thread) T : = S ; . . . ; S n Sequence of statements S i (Stmts) S : = await ( n , κ ) | step ( ε , δ ) Await-loop and non-await-loop steps. (LoopCon) κ ∈ State → { , } Loop condition (EvtGen) ε ∈ State → Events

Event generator (StTrans) δ ∈ State × Value ? → Update

State transformer (Events) e : = R m ( x ) | W m ( x , v ) | F m | E Read, write, fence, error events. x ∈ Location , v ∈ Value . (Modes) m : = rlx | rel | acq | sc Barrier modes (State) σ ∈ Register → Value

Set of thread-local states. (Update) µ ∈ Register (cid:42)

Value

Updated register values. (Register) . . .

Set of thread-local registers. (Location) . . .

Set of shared memory locations. (Value) . . .

Set of possible values of registers and memory locations.Figure 8: Compact Syntax and Types of our LanguageC-like program Our toy language x = r1;r1 = y;if (r1 == 0)r2 = x; step ( λ σ . W rlx (x, σ (r1)), λ σ _. [ ]); step ( λ σ . R rlx (y), λ σ v. [r1 → v ]); step ( λ σ . match σ (r1) with (cid:12)(cid:12) → R rlx (x) (cid:12)(cid:12) _ → F rlx , λ σ v. match σ (r1) with (cid:12)(cid:12) → [r2 → v] (cid:12)(cid:12) _ → [ ]) Figure 9: Using lambda functions inside step to implement different control paths

Recall that the Bounded-Length principle requires that the number of steps inside await loops and the number ofsteps outside awaits are bounded. We deﬁne a tiny concurrent assembly-like language which represents suchprograms, but not programs that violate the Bounded-Length principle. The purpose of this language is to allowus to prove things easily, not to conveniently program in it. Thus instead of a variety of statements, we consideronly two statements: await loops ( await ) and event generating instructions ( step ). The syntax and types of ourlanguage is summarized in Fig. 8. The event generating instructions step use a pair of two lambda functionsto generate events and modify the thread local state. Thus the execution of the steps yields a sequence σ ( t ) ofthread-local states. We illustrate this at hand of the small example in Fig. 9, which we execute starting with the(arbitrarily picked for demonstrative purposes) thread local state σ ( r ) = (cid:40) r = r1 r = r2 in which the value of r1 is 5 and the value of r2 is 1. The ﬁrst instruction of the program ﬁrst evaluates λ σ . W rlx (x, σ (r1)) on σ to determine which event (if any) should be generated by this instruction. In this case,the generated event is W rlx ( x , σ ( r1 )) = W rlx ( x , ) x . Next, the function λ _ _. [ ] is evaluated on σ to determine the set of changedregisters and their new value in the next thread local state σ . This function takes a second parameter whichrepresents the value returned by the generated event in case the generated event is a read. Because the event inthis case is not a read, no value is returned, and the function simply ignores the second parameter. The empty list [ ] indicates that no registers should be updated. Thus σ = σ and execution proceeds with the next instruction. The second instruction generates the read event R rlx ( y ) which reads the value of variable y . Assume for the sake of demonstration that this read ( e.g. , due to some otherthread not shown here) returns the value 8. Now the function λ σ v. [r1 → v] is evaluated on σ and v =

8. Theresult [r1 → indicates that the value of r1 should be updated to 8, and the next state σ is computed as σ ( r ) = (cid:40) r = r1 σ ( r ) o.w.In this state, the third instruction is executed. Because in σ , the value of r1 is not 0, the match goes to thesecond case, in which no event is generated (indicated by F( rlx ) , i.e. , a relaxed fence which indicates a NOP).Thus again there is no read result of v , and the next state σ is computed simply as σ = σ normal code encoding in toy languageevent generators state transformers r1 = x; λ _. R rlx (x) λ _ v. [r1 → v] y = r1 +2; λ σ . W rlx (x, σ (r1)+2) λ _ _. [ ] if (x ==1){y = 2;r2 = z;} else {r2 = z;y = r2;} λ _. R rlx (x) λ σ . match σ (r1) with (cid:12)(cid:12) → W rlx (y, 2) (cid:12)(cid:12) _ → R rlx (z) λ σ . match σ (r1) with (cid:12)(cid:12)

1. R rlx (z) (cid:12)(cid:12) _. W rlx (y, σ (r2)) λ _ v. [r1 → v] λ σ v. match σ (r1) with (cid:12)(cid:12)

1. [ ] (cid:12)(cid:12) _. [r1 → v] λ σ v. match σ (r1) with (cid:12)(cid:12)

1. [r2 → v] (cid:12)(cid:12) _. [ ] for (r1 = 0; r1 < 3; r1 ++){ x = r1;} λ σ . W rlx (x, σ (r1)) λ σ . W rlx (x, σ (r1)) λ σ . W rlx (x, σ (r1)) λ σ _. [r1 → σ (r1)+1] λ σ _. [r1 → σ (r1)+1] λ σ _. [r1 → σ (r1)+1] Figure 10: Example Encodings of Language Constructs as Event-generator/State-transformer PairsNote that each thread’s program text is ﬁnite, and the only allowed loops are awaits. Thus programs withinﬁnite behaviors or unbounded executions (which violate the Bounded-Length principle) can not be representedin this language.Each statement generates up to one event that depends on a thread-local state, and modiﬁes the thread-localstate based on the previous state and (in case a read event was generated) the result of the read. This is encoded10 o_awaitwhile ({r1 = y;},x ==1) step ( λ _. R rlx (y), λ _ v. [r1 → v ]); step ( λ _. R rlx (x), λ _ v. [r2 → v ]); await (2, λ σ . σ (r2) == 1) Figure 11: Encoding of Do-Await-While as a Program in our Languageusing two types of lambda functions: the event generators that map the current state an event (possibly F rlx ), i.e. ,have type State → Event and the state transformers that map the current state and possibly a read result to an update to thread-local staterepresenting the new value of all the registers that are changed by the instruction, i.e. , have type

State × Value ? → Update

Here T ? is a so called option type, which is similar to the nullable types of C v ∈ T ? is either avalue of T or ⊥ (standing for “none” or “null”): v ∈ T ? ⇐⇒ v ∈ T ∨ v = ⊥ Await loops are the only control construct in our language. Apart from awaits, the control of the thread onlymoves forward, one statement at a time. Different conditional branches are implemented through internal logicof the event generating instructions: the state keeps track of the active branch in the code, and the event generatorand state transformer functions do a case split on this state. Bounded loops have to be unrolled. See Fig. 10.We formalize the syntax of the language. There is a ﬁxed, ﬁnite set of threads T and each thread T ∈ T has aﬁnite program text P T which is a sequence of statements. We denote the k -th statement in the program text ofthread T by P T ( k ) . A statement is either an an event generating instruction or a do-await-while statement. Weassume that the set of registers, values, and locations are all ﬁnite. Event Generating Instruction

An event generating instruction has the syntax step ( ε , δ ) where ε : State → Event is an event generator and δ : State × Value ? → Update is a state transformer. Note that the event generating instruction is roughly a tuple of two functions ε and δ .When the statement is executed in a thread-local state s ∈ State , we ﬁrst evaluate ε ( σ ) to determine which eventis generated. If this event is a read, it returns a value v (deﬁned based on reads-from edges in an execution graph G ), which is then passed to δ to compute the new values for updated registers in the update δ ( σ , v ) . The nextstate is then deﬁned by taking all new values from δ ( σ , v ) and the remaining (unchanged) values from s . Do-Await-While

A do-await-while statement has the syntax await ( n , κ ) where n ∈ { , , , . . . } is the number of statements in the loop, and the loop condition κ : State → { , } is apredicate over states telling us whether we must stay in the loop. If this statement is executed in thread-localstate s , we ﬁrst evaluate κ ( σ ) . In case κ ( σ ) evaluates to true, the control jumps back n statements, thus repeatingthe loop; otherwise it moves one statement ahead, thus exiting the loop. Syntactic Restriction of Awaits

We add two syntactic restrictions of awaits: 1) no nesting of awaits and 2) anawait which jumps back n statements needs to be at least at position n in the program. P T ( k ) = await ( n ,_) → n ≤ k ∧ ∀ k (cid:48) ∈ [ k − n : k ) . P T ( k (cid:48) ) (cid:54) = await (_,_) These restrictions will allow us to easily identify steps in an iteration of an await as steps in the range [ k − n : k ) wait ( n ,_); await ( v ,_); nvk TG ( t ) k TG ( t ) − mk TG ( t ) − m + v ......Figure 12: Two overlapping awaits The semantics of our language consist of two components: an execution graph G , which represents the concurrentexecution of the events, and local instruction sequences that generate and refer to these events. At ﬁrst glance,the instruction sequences and the event graph interlock like gears: the instruction sequences generate the eventsin the event graph, e.g. , the reads and writes, and the event graph generates the values that are returned by thosereads to the instruction sequences. Of course, the values returned by reads determine which events are generatednext by the instruction sequences. Unfortunately, it is a bit more complex than this: due to weak memory models,the interlock is actually cyclical ; a write event w of thread A can be generated based on a value returned by aprevious read of A , which reads from a write event of thread B , which is generated based on a previous read of B which reads from w . Thus a simple step-by-step parallel construction of G and the instruction sequences is notpossible.Instead, we follow a more indirect (so called axiomatic) semantics: we take an arbitrary (potentially cyclical)execution graph G and try to justify it ad-hoc by ﬁnding local instruction sequences that are consistent with theevents in the graph, i.e. , 1) every event in G is generated by the instruction sequences, and 2) the instructionsequences use read-results from the graph G . We deﬁne this in two steps: at ﬁrst we ignore for simplicity theconsistency predicate cons M ( G ) which states that G is consistent with the memory model, and only check that G can be justiﬁed by the program text. We deﬁne this by a predicate cons P ( G ) stating that G is consistent withthe program. Then we make the deﬁnition complete by combining the two consistency predicates into a singlepredicate cons PM ( G ) = cons P ( G ) ∧ cons M ( G ) which states that the execution graph G is consistent with the program P under the weak memory model M . Weborrow the notation of execution graphs from the work of Vafeiadis et. al [16], with the minor change that ourevents include barrier modes. Deﬁning cons P ( G ) The semantics of our language is deﬁned with relation to an execution graph G , whichprovides the values individual reads read. With reference to these values, the local program text of each threadcan be executed locally. The graph is consistent with the program text if it contains exactly the events that occurduring this local execution. The local execution of thread T is described by four sequences, which are deﬁned bymutual recursion on the number of executed steps t ∈ { , , , . . . } : the thread local state σ TG ( t ) after t steps, theposition of control k TG ( t ) after t steps, the (potential) event generated in the t -th step e TG ( t ) , and the (potential)read result v TG ( t ) of that event. The deﬁnitions of e TG ( t ) and v TG ( t ) are not themselves recursive but refer to σ TG ( t ) ,while themselves being referenced in the deﬁnition of σ TG ( t + ) . For this reason, the four sequences do not allhave the same length. We denote the number of execution steps of thread T by N TG ∈ N ∪ { ∞ } , where N TG = ∞ indicates that the thread does not terminate and hence makes inﬁnitely many steps. The number of steps N TG coincides with the length | e TG ( t ) | and | v TG ( t ) | of the sequences e TG ( t ) of events and v TG ( t ) of read results of thread T as every step generates up to one event and returns up to one read result | e TG ( t ) | = | v TG ( t ) | = N TG

12s usual for these fence post cases, the number of states and positions of control is N TG + | σ TG ( t ) | = | k TG ( t ) | = N TG + Position of Control

The position of control k TG ( t ) ∈ { , , , . . . } is the index of the next statement P T ( k TG ( t )) to be executed by thread T after executing t steps. All programs startat the ﬁrst statement, i.e. , k TG ( ) = P T ( k TG ( )) = P T ( ) . After t ≤ N TG steps, the position of control may leave the program text, i.e. , k TG ( t ) may no longer be an index in the sequence P T (( k )) of statements of thread T k TG ( t ) ≥ | P T (( k )) | In this case, the computation of thread T ends: k TG ( t ) ≥ | P T (( k )) | → N TG = t and thus there is no k TG ( t + ) that needs to be deﬁned. Otherwise, the execution of the t -th step changes theposition based on the statement P T ( k TG ( t )) executed in that step. We abbreviate S TG ( t ) = P T ( k TG ( t )) If the this statement is an event generating instruction, we always move to the next statement, i.e. , P T ( k TG ( t )) = step (_,_) → k TG ( t + ) = k TG ( t ) + await ( n , κ ) , k is either also incremented by 1 (the loop is exited) or decremented by n (theloop is continued), depending on whether κ evaluates to false or true in the internal state σ TG ( t ) of thread T after t steps: P T ( k TG ( t )) = await ( n , κ ) → k TG ( t + ) = (cid:40) k TG ( t ) + κ ( σ TG ( t )) = k TG ( t ) − n o.w. Event Sequence

This t -th step (with t < N TG ) generates the event e TG ( t ) ∈ Event which is either ε ( σ TG ( t )) if the statement executed in this step is an event generating instruction step ( ε ,_) P T ( k TG ( t )) = step ( ε ,_) → e TG ( t ) = ε ( σ TG ( t )) or, in case the statement is a do-await-while, a NOP event P T ( k TG ( t )) = await (_,_) → e TG ( t ) = F rlx This event must be in G or G is not consistent with the program. Recall that G stores the event together withmeta data indicating the thread T and the event index t in the program order of T . If no event with this meta dataexists in the graph, the graph represents a partial execution of the program. In this case we stop execution ofthread T before the event is generated (cid:104) T , t , −(cid:105) (cid:54)∈ G . E → N T = t Otherwise, the event and its meta data from the triplet (cid:104) T , t , e TG ( t ) (cid:105) . If some event with this meta data exists inthe graph, but not this particular event, then the program generated a different event than the one provided by thegraph; the graph is inconsistent with the program. (cid:104) T , t , e (cid:105) ∈ G . E ∧ e (cid:54) = e TG ( t ) → ¬ cons P ( G ) Note that T and t already uniquely identify e TG ( t ) in a consistent execution graph. To avoid redundancy weabbreviate (cid:104)(cid:104) T , t (cid:105)(cid:105) G = (cid:104) T , t , e TG ( t ) (cid:105) ead Result The read result v TG ( t ) ∈ Value ?is the value returned by a read event generated in step t < N TG of thread T . If no read event is generated, there isno read result v TG ( t ) e TG ( t ) (cid:54) = R − ( − ) → v TG ( t ) = ⊥ Otherwise, the read reads from the write w = G . rf ( (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:105) ) and returns the value w . val written by that write.Note that in the case of a missing rf-edge, w may be ⊥ even though e TG ( t ) is a read event. In such cases, wereturn the read result ⊥ . However, an instruction that generates a read event usually depends on the read result tocompute the next state. Thus we will deﬁne in the next section that the thread-local execution terminates in case e TG ( t ) is a read event but the read result is ⊥ . We collect the read result as the value v TG ( t ) e TG ( t ) = R o ( x ) → v TG ( t ) = (cid:40) G . rf ( (cid:104)(cid:104) T , t (cid:105)(cid:105) G ) . val G . rf ( (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:105) ) (cid:54) = ⊥⊥ o.w State Sequence

The thread-local state of thread T after executing t steps σ TG ( t ) ∈ State contains the values of all thread-local registers. We leave the initial state σ TG ( ) of thread T uninterpreted. Each step then updates the local state based on the executed statement. Do-await-whilesnever change the program state P T ( k TG ( t )) = await (_,_) → σ TG ( t + ) = σ TG ( t ) For event generating instructions, we consider two cases: the ﬁrst (regular) case is that the value v TG ( t ) matchesthe event e TG ( t ) in the sense that v TG ( t ) provides a value if e TG ( t ) is a read. In this case the event generatinginstruction step (_, δ ) with state transformer δ updates the state based on δ under control of the read result v TG ( t ) : P T ( k TG ( t )) = step (_, δ ) ∧ ( e TG ( t ) (cid:54) = R − ( − ) ∨ v TG ( t ) (cid:54) = ⊥ ) → σ TG ( t + ) = σ TG ( t ) (cid:28) δ ( σ TG ( t ) , v TG ( t )) Here (cid:28) is the update operator which takes all the new (updated) register values from δ ( σ , v ) and the remaining(unchanged) register values from s : ( σ (cid:28) σ (cid:48) )( r ) = (cid:40) σ (cid:48) ( r ) r ∈ Dom ( σ (cid:48) ) σ ( r ) o.w.In the second (irregular) case, step t generated a read event but no read-result was returned. In this case thecomputation gets stuck. We deﬁne that t is the last step. Since steps 0 , . . . , t have been executed the number ofsteps is thus N TG = t +

1. Thus formally we need to deﬁne a state σ TG ( t + ) , despite the computation being stuck.We arbitrarily deﬁne σ TG ( t + ) = σ TG ( t ) . P T ( k TG ( t )) = step (_, δ ) ∧ e TG ( t ) = R − ( − ) ∧ v TG ( t ) = ⊥ → N TG = t + ∧ σ TG ( t + ) = σ TG ( t ) No Superﬂuous Events

We have so far checked that every event generated by the program is also in G .However, G is only consistent with a program if there is also no event in G that was not generated by the program, i.e. , there are no superﬂuous events. More precisely, G is consistent with the program P exactly when the set ofevents G . E in G is exactly the set of events (plus meta data) generated by P :cons P ( G ) ⇐⇒ G . E = (cid:8) (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:12)(cid:12) T ∈ T , t < N TG (cid:9) egister Read-From Recall that the Bounded-Effect principle states that there must not be a visible side effectof a failed iteration of an await. Between threads, the only potentially visible side effects are the generatedstores. Within a thread, updates to registers can also be visible, provided these registers are not overwritten in themean-time. We deﬁne a register read-from relation within events of a single thread G . rrf ⊆ G . powhich holds between events e and e (cid:48) exactly when the statement that generated e (cid:48) “depends” on any registersupdated by the statement that generated e which have not been overwritten in the meantime. More precisely,we put a register read-from edge in the graph between the t -th and u -th steps (with u ≥ t ) if any function in thestatement P T ( k TG ( u ))( G ) executed by the u -th step depends on the visible output of the t -th to the u -th step: (cid:104)(cid:104) T , t (cid:105)(cid:105) G rrf −→ (cid:104)(cid:104) T , u (cid:105)(cid:105) G ⇐⇒ u ≥ t ∧ ∃ f ∈ F ( P T ( k TG ( u ))( G )) . depends - on ( f , vis TG ( t , u )) To deﬁne what it means to “depend” on these registers, we look at the functions ε , δ , κ of the statement and seewhether the registers can affect the functions. We collect the functions in statement S in a set F ( S ) deﬁned asfollows F ( S ) = (cid:40) { ε , δ } S = step ( ε , δ ) { κ } S = await (_, κ ) A function f ∈ F ( S ) depends on a set of registers R ⊆ Register if there are two states which only differ onregisters in R on which f produces different results depends - on ( f , R ) ⇐⇒ ∃ σ , σ (cid:48) ∈ State . ( ∀ r (cid:54)∈ R . σ ( r ) = σ (cid:48) ( r )) ∧ f ( σ ) (cid:54) = f ( σ (cid:48) ) To unify notation we deﬁne an update δ TG ( t ) for each step t by δ TG ( t ) = (cid:40) δ ( σ TG ( t ) , v TG ( t )) P T ( k TG ( t )) = step (_, δ ) ∧ ( e TG ( t ) (cid:54) = R − ( − ) ∨ v TG ( t ) (cid:54) = ⊥ ) /0 o.w.where /0 is the empty update (no registers changed). A straightforward induction shows: Lemma 1. σ TG ( t + ) = σ TG ( t ) (cid:28) δ TG ( t ) We deﬁne the visible output of the t -th step the the u -th step to be the set of registers that are updated by the t -th step but not by the steps before uvis TG ( t , u ) = Dom ( δ TG ( t )) \ (cid:91) u (cid:48) ∈ ( t : u ) Dom ( δ TG ( u (cid:48) )) Iterations of Await

In this section we deﬁne the steps that constitute iterations of await. These are steps thatexecute statements with numbers k (cid:48) ∈ [ k − n : k ] where statement number k is a do-await-while statement thatjumps back n steps. We enumerate these steps k which are endpoints of await iterations. end TG ( ) = min (cid:8) t (cid:12)(cid:12) P T ( k TG ( t )) = await (_, _) (cid:9) end TG ( q + ) = min (cid:8) t > end TG ( q ) (cid:12)(cid:12) P T ( k TG ( t )) = await (_, _) (cid:9) We denote the length of such an iteration, i.e. , the number n of steps jumped back by the do-await-while-statement,by len TG ( q ) = n where P T ( k TG ( end TG ( q ))) = await ( n , _) The start point ( k − n ) of the q -th iteration is deﬁned by start TG ( q ) = end TG ( q ) − len TG ( q ) We show that the intervals [ start TG ( q ) : end TG ( q )] for all q do not overlap. For this it sufﬁces to show that onlythe last step in such an interval executes a do-await-while statement. For the sake of simplicity we use here curried notation for f = δ , i.e. , δ ( σ ) (cid:54) = δ ( σ (cid:48) ) iff there is a v such that δ ( σ , v ) (cid:54) = δ ( σ (cid:48) , v ) . emma 2. t ∈ [ start TG ( q ) : end TG ( q )) → P TG ( k TG ( t )) (cid:54) = await (_, _) Proof.

Assume for the sake of contradiction that step t executes a do-await-while. W.l.o.g. t is the last step to doso t = max (cid:8) u ∈ [ start TG ( q ) : end TG ( q )) (cid:12)(cid:12) P TG ( k TG ( u )) = await (_, _) (cid:9) Since the remaining steps between t and end TG ( q ) are not do-await-while statements, they move the position ofcontrol ahead one statement at a time. Thus the difference in the positions of control is equal to the difference inthe ste number. k TG ( end TG ( q )) − k TG ( t + ) = end TG ( q ) − ( t + ) Furthermore, the loop condition of step t can not be satisﬁed since otherwise control would jump back and thenhave to cross position k TG ( t ) a second time; but this contradicts the assumption that no more do-await-whilestatements are executed. Thus k TG ( t + ) = k TG ( t ) and we conclude with simple arithmetic k TG ( t ) = k TG ( end TG ( q )) − ( end TG ( q ) − t ) Since t is in the interval [ start TG ( q ) : end TG ( q )) which by deﬁnition has length len TG ( q ) , we conclude ﬁrst end TG ( q ) − t < len TG ( q ) and then that the await must be positioned in the last len TG ( q ) statements before the q -th executed await k TG ( t ) ∈ [ k TG ( end ) − len TG ( q ) : k TG ( end )) but this contradicts the assumption that awaits are never nested.We conclude that interval number q + q ends. Monotonicity then immediatelyimplies that the intervals are pairwise disjoint. Lemma 3. end TG ( q ) < start TG ( q + ) Proof.

By deﬁnition, step end TG ( q ) executes a do-await-while statement P T ( k TG ( end TG ( q ))) = await (_, _) By contraposition of Lemma 2, the step is not in interval number q + end TG ( q ) (cid:54)∈ [ start TG ( q + ) : end TG ( q + )) Due to monotonicity interval number q ends before interval number q + end TG ( q ) < end TG ( q + ) and the claim follows end TG ( q ) < start TG ( q + ) Iteration q is failed if the loop condition in step end TG ( q ) evaluates to 1: fail TG ( q ) ⇐⇒ κ ( σ TG ( end TG ( q ))) = P T ( k TG ( end TG ( q ))) = await (_, κ )

16e plan to cut all failed iterations from the graph which result in a wasteful graph. These are failed iterations q in which the next failed iteration reads from exactly the same stores. We deﬁne this precisely through a predicate WI TG ( q ) : WI TG ( q ) ⇐⇒ fail TG ( q ) ∧ ∀ m ≤ len TG ( q ) . e TG ( start TG ( q ) + m ) = R − ( − ) → G . rf ( (cid:104)(cid:104) T , start TG ( q ) + m (cid:105)(cid:105) G ) = G . rf ( (cid:104)(cid:104) T , start TG ( q + ) + m (cid:105)(cid:105) G ) Iff a graph has any such iterations, we say that is wasteful. Formally:W ( G ) ⇐⇒ ∀ T , q . ¬ WI TG ( q ) The Bounded-Effect principle is now easily formalized. We deﬁne BE ( G ) to hold if in graph G no registerreads-from arrow leaves a failed iteration of an await. For the sake of simplicity we fully forbid generating writeevents in failed iterations of awaits. This is unlikely to be a practical restriction ; as reads-from edges from suchwrites that leave the iteration are anyways forbidden, this directly only affects loop-internal write-read pairs,which can be simulated using registers. Deﬁnition 3.

Graph G satisﬁes the Bounded-Effect principle, i.e. , BE ( G ) , iff for all threads T , failed awaititerations q with fail TG ( q ) , and step numbers t ∈ [ start TG ( q ) : end TG ( q )) we have:1. no write event is generated in step t of thread Te TG ( t ) (cid:54) = W − ( − , − )

2. if the event generated in step t of thread T is register read-from by the event generated in step u of threadT , then u is in the same failed iteration q (cid:104)(cid:104) T , t (cid:105)(cid:105) G rrf −→ (cid:104)(cid:104) T , u (cid:105)(cid:105) G → u ∈ [ start TG ( q ) : end TG ( q )) In the remainder of this text, we assume: all graphs that are consistent with the memory model and with P satisfy the Bounded-Effect principle ∀ G . cons PM ( G ) → BE ( G ) (1) We deﬁne the sets G F and G ∞ of execution graphs with a ﬁnite resp. inﬁnite number of failed await iterations by G F = (cid:8) G (cid:12)(cid:12) cons PM ( G ) ∧ ∀ T . ∃ q . ∀ q (cid:48) ≥ q . ¬ fail TG ( q (cid:48) ) (cid:9) G ∞ = (cid:8) G (cid:12)(cid:12) cons PM ( G ) ∧ ∃ T . ∀ q . ∃ q (cid:48) ≥ q . fail TG ( q (cid:48) ) (cid:9) = (cid:8) G (cid:12)(cid:12) cons PM ( G ) (cid:9) \ G F Await-termination holds if G ∞ is empty AT ⇐⇒ G ∞ = /0 We now show a series of lemmas that lead us to the main theorem:

Theorem 1.

1. The set G F ∗ = (cid:8) G ∈ G F (cid:12)(cid:12) ¬ W ( G ) (cid:9) of non-wasteful execution graphs is ﬁnite, and every G ∈ G ∞ ∗ = (cid:8) G ∈ G F ∗ (cid:12)(cid:12) stagnant ( G ) (cid:9) which is stagnant is ﬁnite.2. if an error event E exists in a graph G ∈ G F , it also exists in a graph G (cid:48) ∈ G F ∗ It is possible to construct theoretical examples where this restriction changes the logic of the code, but we are not aware of any practicalexamples. In any case, the restriction can be lifted with considerable elbow grease. . every graph G ∈ G ∞ can be cut to a graph G (cid:48) ∈ G ∞ ∗

4. every graph G ∈ G ∞ ∗ can be extended to a graph G (cid:48) ∈ G ∞ In the proofs we ignore the memory model consistency. This can be easily proven on a case by case basis( i.e. , for concrete M ) but a generic proof must rely on certain abstract features of the memory model which arehard to identify in generic fashion. We leave a generic proof with appropriate conditions as future work.We ﬁrst show that after a failed iteration, we return to the startof the await and immediately repeat the iteration(possibly with a different outcome). Lemma 4. fail TG ( q ) → k TG ( end TG ( q ) + ) = k TG ( start TG ( q )) ∧ start TG ( q + ) = end TG ( q ) + ∧ end TG ( q + ) = end TG ( q ) + + nProof. By deﬁnition the loop condition κ in step end TG ( q ) is satisﬁed P T ( k TG ( end TG ( q ))) = await ( n , κ ) ∧ κ ( σ TG ( end TG ( q ))) = n = len TG ( q ) steps k TG ( end TG ( q ) + ) = k TG ( end TG ( q )) − n By Lemma 2 the previous n steps all do not execute do-await-while statements and thus moved control forwardlinearly; the ﬁrst part of the claim follows k TG ( end TG ( q )) − n = k TG ( end TG ( q ) − n ) = k TG ( start TG ( q )) We next prove the third part of the claim. Observe that the next n statements are exactly the same (non-await)statements, and thus after an additional n steps we have again k TG ( end TG ( q ) + + n ) = k TG ( end TG ( q ) + ) + n = k TG ( end TG ( q )) which is a do-await-while. Hence by the deﬁnition of end TG we have end TG ( q + ) = end TG ( q ) + + n which is the third part of the claim. For the remaining second part of the claim, note that the do-await-while stilljumps back n statements. Thus by deﬁnition of start TG and len TG we have start TG ( q + ) = end TG ( q + ) − len TG ( q + ) = end TG ( q ) + + n − n = end TG ( q ) + Lemma 5.

Let q be the index of an iteration that is wasteful, and m be the number of steps taken by the threadinside the iteration (without leaving it) WI TG ( q ) ∧ m ≤ len TG ( q ) Then all of the following hold:1. the same events are generated in iterations q and q + after m stepse TG ( start TG ( q ) + m ) = e TG ( start TG ( q + ) + m ) . the same values are observed . . .v TG ( start TG ( q ) + m ) = v TG ( start TG ( q + ) + m )

3. the same position of control is reached . . .k TG ( start TG ( q ) + m ) = k TG ( start TG ( q + ) + m )

4. if the value of register r is not the same after m steps in the two iterations σ TG ( start TG ( q ) + m )( r ) = σ TG ( start TG ( q + ) + m )( r ) then r must be an output of one of the steps u of the failed iteration q which is still visible after m steps ∃ u ∈ [ start TG ( q ) : end TG ( q )] . r ∈ vis ( u , start TG ( q + ) + m ) Proof.

We ﬁrst show that claims 1 and 2 follow from claims 3 and 4. We know from claim 3 that the position ofcontrol is the same. Thus also the executed statement is the same P T ( k TG ( start TG ( q ) + m )) = P T ( k TG ( start TG ( q + ) + m )) We split cases on the type of statement executed by the steps; in the case it is a do-await-while we are done as noevent or read-result is generated. In the other case, we have P T ( k TG ( start TG ( q ) + m )) = P T ( k TG ( start TG ( q + ) + m )) = step ( ε , _) From the Bounded-Effect principle we know that there is no register reads-from from a step u ∈ [ start TG ( q ) : end TG ( q )] of the failed iteration q to step start TG ( q + ) + m which is outside that iteration (Lemma 3) (cid:104)(cid:104) T , u (cid:105)(cid:105) G (cid:54) rrf −→ (cid:104)(cid:104) T , start TG ( q + ) + m (cid:105)(cid:105) G and thus in particular ε does not depend on the visible outputs of step u to that step ¬ depends - on ( ε , vis ( u , start TG ( q + ) + m ) From claim 4 we know that the only differences between the two states are on registers which are such visibleoutputs. Thus with the deﬁnition of depends - on we know ε ( σ TG ( start TG ( q ) + m )) = ε ( σ TG ( start TG ( q + ) + m )) and thus the generated events are the same, which is claim 1 e TG ( start TG ( q ) + m ) = e TG ( start TG ( q + ) + m ) For claim 2 we only consider read events e TG ( start TG ( q ) + m ) = R − ( − ) We have by assumption that iteration q is wasteful, thus the two events read from the same store G . rf ( (cid:104)(cid:104) T , start TG ( q ) + m (cid:105)(cid:105) G ) = G . rf ( (cid:104)(cid:104) T , start TG ( q + ) + m (cid:105)(cid:105) G ) and thus read the same value. Claim 2 immediately follows.Claims 4 and 3 are shown by joint induction on m . In the base case m = k TG ( start TG ( q + )) = k TG ( end TG ( q ) + ) = k TG ( start TG ( q )) r which differs before and after iteration q σ TG ( start TG ( q ))( r ) (cid:54) = σ TG ( end TG ( q ) + )( r ) must have been modiﬁed by some u ∈ [ start TG ( q ) : end TG ( q )] in that iteration r ∈ Dom ( δ TG ( u )) W.l.o.g. u is the last such write, in which case the effect is still visible to step end TG ( q ) + r ∈ vis TG ( u , end TG ( q ) + ) and the claim follows as by Lemma 4, step end TG ( q ) + q + end TG ( q ) + = start TG ( q + ) In the induction step m → m +

1, we know by the induction hypothesis that the position of control is the sameafter m steps in the respective iteration and that the states are the same (modulo visible outputs). As we haveshown before, this implies that the read result is also the same v TG ( start TG ( q ) + m ) = v TG ( start TG ( q + ) + m ) Analogous to the proof that ε produces the same event due to the Bounded-Effect principle, one can also concludethat the new position of control must be the same ( i.e. , claim 3 holds) k TG ( start TG ( q ) + m + ) = k TG ( start TG ( q + ) + m + ) and that the state transformer is the same (which apart from the state also depends on the read result) δ TG ( start TG ( q ) + m + ) = δ TG ( start TG ( q + ) + m + ) Assume for the sake of showing the only remaining claim that register r has a different value in the new states σ TG ( start TG ( q ) + m + )( r ) (cid:54) = σ TG ( start TG ( q + ) + m + )( r ) By Lemma 1 the functions δ TG determine the state change; since the states were updated in the same way, r cannot have been updated r (cid:54)∈ Dom ( δ TG ( start TG ( q ) + m + )) and the difference was already present before the step σ TG ( start TG ( q ) + m )( r ) (cid:54) = σ TG ( start TG ( q + ) + m )( r ) By the induction hypothesis this implies that r was a visible output of some step u from iteration q to the previousstep ∃ u ∈ [ start TG ( q ) : end TG ( q )] . r ∈ vis TG ( u , start TG ( q + ) + m ) and since it is not updated in this step, it is still visible to the next step ∃ u ∈ [ start TG ( q ) : end TG ( q )] . r ∈ vis TG ( u , start TG ( q + ) + m + ) which is the claim.Next we show that such a failed iteration can be safely removed. For this we deﬁne a deletion operation G − ( T , q ) which deletes the q -th iteration of thread T from the graph. We only deﬁne this in case the q -thiteration failed. In this case by the Bounded-Effect principle there are no write events that are deleted, and thuswe do not have to pay attention to deleting writes that are referenced by other reads. Events of other threads are20ot affected at all. Neither are events generated before the start of the deleted iteration. For events started after thedeleted iteration we simply reduce the event index by the number of events in the deleted iteration ( len TG ( q ) + ( G − ( T , q )) . E U = G . E U if U (cid:54) = T ( G − ( T , q )) . E T = (cid:8) (cid:104) T , t , e (cid:105) ∈ G . E T (cid:12)(cid:12) t < start TG ( q ) (cid:9) ∪ (cid:8) (cid:104) T , t − ( len TG ( q ) + ) , e (cid:105) ∈ G . E T (cid:12)(cid:12) t > end TG ( q ) (cid:9) This can also be deﬁned by means of a partial, invertible renaming function r : G . E (cid:42) ( G − ( T , q )) . Ewhich maps each non-deleted event to its renamed event in G − ( T , q ) : r ( (cid:104) U , t , e (cid:105) ) = (cid:40) (cid:104) U , t , e (cid:105) U (cid:54) = T ∨ t < start TG ( q ) (cid:104) U , t − ( len TG ( q ) + ) , e (cid:105) U = T ∧ t > end TG ( q ) We have: ( G − ( T , q )) . E = r ( G . E ) For the reads-from relationship, we simply re-map the edges between the renamed events: ( G − ( T , q )) . rf ( e ) = r ( G . rf ( r − ( e ))) We show that this graph still is consistent with the program.

Lemma 6. cons P ( G ) ∧ fail TG ( q ) → cons P ( G − ( T , q )) Proof.

For threads other than T there is nothing to show as the event sequences and rf-edges are fully unchanged.For T , we focus on the steps after the deletion, which may be affected by the change in registers. We will show:any changes to registers after step start TG ( q ) were visible changes of a register by one of the deleted steps. Otherthings have not changed. Thus any dependence on these changed registers would imply a register-read-fromrelation in the original graph G , which is forbidden by the Bounded-Effect principle. For the sake of brevity wedeﬁne G (cid:48) = G − ( T , q ) Lemma 7.

If t is a step behind the deleted parts in the new graph,t ≥ start TG ( q ) then both of the following hold:1. position of control in, event generated by, and read result seen in step t are unaffected by the deletion(relative to the original values of step t + len TG ( q ) + )k TG (cid:48) ( t ) = k TG ( t + len TG ( q ) + ) ∧ e TG (cid:48) ( t ) = e TG ( t + len TG ( q ) + ) ∧ v TG (cid:48) ( t ) = v TG ( t + len TG ( q ) + )

2. if r is a register whose value was changed by the deletion σ TG (cid:48) ( t )( r ) (cid:54) = σ TG ( t + len TG ( q ) + )( r ) then there is a step u in the original graph which still has a visible effect on r ∃ u ∈ [ start TG ( q ) : end TG ( q )] . r ∈ vis TG ( u , t + len TG ( q ) + ) Proof.

The proof is analogous to the proof of Lemma 5 and omitted.21ow to show that cons P is preserved we simply consider the sets of events of the individual threads U ∈ T and show that they are not affected: ∀ U . G (cid:48) . E U = (cid:8) (cid:104)(cid:104) U , t (cid:105)(cid:105) G (cid:48) (cid:12)(cid:12) t < N UG (cid:48) (cid:9) (2)We split cases on U (cid:54) = T . For threads U (cid:54) = T other than T , nothing has changed and the claim follows from theconsistency of G G (cid:48) . E U = G (cid:48) . E U = (cid:8) (cid:104)(cid:104) U , t (cid:105)(cid:105) G (cid:12)(cid:12) t < N UG (cid:9) = (cid:8) (cid:104)(cid:104) U , t (cid:105)(cid:105) G (cid:48) (cid:12)(cid:12) t < N UG (cid:48) (cid:9) (3)For thread T , we split the set into those events generated before the cut (which have not changed) (cid:8) (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:12)(cid:12) t < start TG ( q ) (cid:9) = (cid:8) (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:48) (cid:12)(cid:12) t < start TG ( q ) (cid:9) (4)and the events generated after the cut, for which Lemma 7 shows that only indices have changed: (cid:8) (cid:104) T , t − ( len TG ( q ) + ) , e TG ( t ) (cid:105) (cid:12)(cid:12) t > end TG ( q ) ∧ t < N TG (cid:9) = (cid:8) (cid:104) T , t − ( len TG ( q ) + ) , e TG ( t ) (cid:105) (cid:12)(cid:12) t > start TG ( q ) + ( len TG ( q ) + ) ∧ t < N TG (cid:9) = (cid:8) (cid:104) T , t − ( len TG ( q ) + ) , e TG ( t ) (cid:105) (cid:12)(cid:12) t − ( len TG ( q ) + ) > start TG ( q ) ∧ t < N TG (cid:9) = (cid:8) (cid:104) T , t , e TG ( t + ( len TG ( q ) + )) (cid:105) (cid:12)(cid:12) t > start TG ( q ) ∧ t + ( len TG ( q ) + ) < N TG (cid:9) rebase t = (cid:8) (cid:104) T , t , e TG (cid:48) ( t ) (cid:105) (cid:12)(cid:12) t > start TG ( q ) ∧ t < N TG − ( len TG ( q ) + ) (cid:9) L 7 = (cid:8) (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:48) (cid:12)(cid:12) t > start TG ( q ) ∧ t < N TG (cid:48) (cid:9) (5)Jointly with Eq. (4) this proves Eq. (2) for U : = T : G (cid:48) . E T = r ( G . E T )= (cid:8) (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:12)(cid:12) t < start TG ( q ) (cid:9) ∪ (cid:8) (cid:104) T , t − ( len TG ( q ) + ) , e TG ( t ) (cid:105) (cid:12)(cid:12) t > end TG ( q ) ∧ t < N TG (cid:9) = (cid:8) (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:48) (cid:12)(cid:12) t < start TG ( q ) (cid:9) ∪ (cid:8) (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:48) (cid:105) (cid:12)(cid:12) t ≥ start TG ( q ) ∧ t < N TG (cid:48) (cid:9) E (4), (5) = (cid:8) (cid:104)(cid:104) T , t (cid:105)(cid:105) G (cid:48) (cid:12)(cid:12) t < N TG (cid:48) (cid:9) Together with Eq. (3) this shows Eq. (2). By Eq. (2) we conclude that the union over all threads U of events in G (cid:48) . E U is equal to the events generated by all threads: (cid:91) U ∈ T G (cid:48) . E U = (cid:91) U ∈ T (cid:8) (cid:104)(cid:104) U , t (cid:105)(cid:105) G (cid:48) (cid:105) (cid:12)(cid:12) t < N TG (cid:48) (cid:9) It immediately follows that the set of all events in G (cid:48) is equal to the set of events generated by all threads G (cid:48) . E = (cid:8) (cid:104)(cid:104) U , t (cid:105)(cid:105) G (cid:48) (cid:105) (cid:12)(cid:12) U ∈ T , t < N TG (cid:48) (cid:9) which is the deﬁnition of cons P ( G (cid:48) ) , i.e. , the claim.Next we iteratively eliminate all wasteful iterations of awaits. This takes us from any graph G ∈ G F to agraph G (cid:48) ∈ G F ∗ but preserves at least some error events. Lemma 8. G ∈ G F ∧ (cid:104)− , − , E (cid:105) ∈ G . E → ∃ G (cid:48) ∈ G F ∗ . (cid:104)− , − , E (cid:105) ∈ G (cid:48) . E Proof.

We construct a series G ( i ) of graphs in which i wasteful iterations have been eliminated from G . Since G ∈ G F there are only ﬁnitely many failed iterations we need to remove, thus the sequence is ﬁnite. We begingraph G in which 0 iterations have been deleted G = G G i + = G i − ε (cid:8) ( T , q ) (cid:12)(cid:12) WI TG ( q ) (cid:9) The last graph G I − in the sequence with index I = | G ( i ) | by deﬁnition does not have any wasteful iterations (cid:64) T , q . WI TG I − ( q ) and thus is not wasteful ¬ W ( G I − ) By repeated application of Lemma 6 we conclude that all the graphs in the sequence, including G I − , areconsistent with the program cons P ( G I − ) and thus G (cid:48) : = G I − is in G F ∗ G I − ∈ G F ∗ It now sufﬁces to show that the error event is preserved (although possibly generated by a different step).Assume (cid:104)− , − , E (cid:105) ∈ G . EBy deﬁnition we only delete events in wasteful iterations. By Lemma 5 every event in such an iteration is repeatedin the next iteration. The events outside the deleted iteration are maintained (cf. Lemma 7). We conclude: theevent E is generated in every graph in the sequence, in particular also in the last one (cid:104)− , − , E (cid:105) ∈ G I − . Ewhich is the claim.This shows that no bugs are missed by AMC. Our next goal is to show that AMC can terminate. We ﬁrstshow that the number of writes generated in a graph is bounded.

Lemma 9. ∃ b . ∀ G . cons PM ( G ) → | (cid:8) ( T , t ) (cid:12)(cid:12) e TG ( t ) = W − ( − , − ) , T ∈ T , t < N TG (cid:9) | ≤ bProof. The bound is equal to the sum of the program lengths of each thread b : = ∑ T ∈ T | P T | The reason for this is that threads only repeat statements in case they fail a loop iteration; but these failed loopiterations by the Bounded-Effect principle do not produce writes. We show: if step t of thread T generates awrite event, statement k TG ( t ) is never executed again. e TG ( t ) = W − ( − , − ) → ∀ u > t . k TG ( u ) > k TG ( t ) (6)Let u w.l.o.g. be the ﬁrst step after t in which the position of control is at k TG ( t ) or before k TG ( u ) ≤ k TG ( t ) ∧ k TG ( u − ) > k TG ( t ) By the semantics of the language, step u − P T ( k TG ( u − )) = await ( n , κ ) ∧ κ ( σ TG ( u − )) = k TG ( u ) and k TG ( u − ) are not awaits ∀ k ∈ [ k TG ( u ) : k TG ( u − )) . P T ( k ) (cid:54) = await ( − , − )

23f course k TG ( t ) is in that interval; thus all steps between k TG ( t ) and K TG ( u − ) are not awaits. By the semantics ofthe language, these steps moved control ahead one statement per step. Thus at most n steps have passed between t and u − ( u − ) − t = k TG ( u − ) − k TG ( t ) < n Since step u − q of an await of length nend TG ( q ) = u − ∧ len TG ( q ) = n it follows that step t is one of the steps in that iteration t ∈ [ start TG ( q ) : end TG ( q )] Because the loop condition is satisﬁed, the iteration is a failed iteration fail TG ( q ) and we conclude from the Bounded-Effect principle: step t does not produce a write e TG ( t ) (cid:54) = W − ( − , − ) which is a contradiction. This proves Eq. (6), i.e. , each statement can produce at most one write. Thus the totalnumber of writes produced by thread T is at most the size | P T | of the program text of thread T | (cid:8) t (cid:12)(cid:12) e TG ( t ) = W − ( − , − ) , t < N TG (cid:9) | ≤ | P T | The claim follows: | (cid:8) ( T , t ) (cid:12)(cid:12) e TG ( t ) = W − ( − , − ) , T ∈ T , t < N TG (cid:9) | = ∑ T ∈ T | (cid:8) t (cid:12)(cid:12) e TG ( t ) = W − ( − , − ) , t < N TG (cid:9) | ≤ ∑ T ∈ T | P T | In a graph which satisﬁes the progress condition, each iteration of await reads from a different combinationof writes. Since the set of writes is bounded, the possible number of combinations is bounded as well. Thus thenumber of failed iterations is also bounded.

Lemma 10. G F ∗ is ﬁniteProof. We show instead that each G ∈ G F ∗ has bounded length (bounded by some constant B ). This impliesthat the graphs are ﬁnite as every G can then be encoded as a pair of a sequence of B events and a sequenceof B numbers in the range [ b − ] indicating rf-edges. Since the set of events is ﬁnite, the number of suchencodings is ﬁnite, and so is G F ∗ .Assume for the sake of contradiction that no such bound exists. Thus graphs in G F ∗ can be arbitrarily large.Since they are consistent with the program, this means that the program execution can become arbitrarily long.Since there are only ﬁnitely many threads, one thread T can be executed for arbitrarily long. Since the programtext P T has ﬁnite length and thus only ﬁnitely many awaits, there has to be one await that can be made to failarbitrarily often. Let the line number of this await be k , and let G , G , G , . . . be graphs in which the await failszero times, once, twice, etc. The await jumps back some number n of statements P T ( k ) = await ( n , κ ) and thus produces at most n reads. By the progress condition, at least one of them must read from a new writein each iteration. Coherence forbids going back to the previous writes. By Lemma 9 the number of availablewrites in each graph G i is at most b . Thus there can be at most n · b iterations of this loop. Consider G n · b + , i.e. ,the graph in which one iteration beyond that has been executed. Note that after l iterations of the await, at most n · b − l writes are still available to read from. Thus in the ﬁnal iteration n · b +

1, the thread has at most − B must exist and thus G F ∗ is ﬁnite.24his proves our claims about G F / G F ∗ . Next we consider the graphs in G ∞ ∗ which indicate await violations. Lemma 11.

Every graph G ∈ G ∞ ∗ is ﬁnite.Proof. Observe that every G ∈ G ∞ ∗ is by deﬁnition an element of G F ∗ and thus ﬁnite.We proceed to show that graphs in G ∞ ∗ always indicate await terminations. This depends on the deﬁnition ofstagnancy which we have not shown yet. We deﬁne it as follows. Let V G = (cid:8) T ∈ T (cid:12)(cid:12) k TG ( N TG ) < | P T | (cid:9) be the set of threads which have not terminated (yet) in G . G is stagnant iff all of the following are true:1. Some threads have not terminated V G (cid:54) = /0

2. All of those threads have just completed a failed await loop iteration ∀ T ∈ V G . N TG = end TG ( q ) ∧ fail TG ( q )

3. There is no extension G (cid:48) of G (with G . X ⊆ G (cid:48) . X for X ∈ { E , rf , mo } ) where any threads in V G haveterminated and threads read only from stores that are already available in G cons PM ( G (cid:48) ) ∧ Dom ( G (cid:48) . rf ) ⊆ G . E → V G = V G (cid:48) Lemma 12. G ∞ ∗ (cid:54) = /0 → G ∞ (cid:54) = /0 Proof.

Assume that G ∞ ∗ is non-empty. Thus there is some stagnant graph G ∈ G ∞ ∗ stagnant ( G ) The set of threads that have not completed their program in G V G = (cid:8) T ∈ T (cid:12)(cid:12) k TG ( N TG ) < | P T | (cid:9) is by deﬁnition of stagnancy non-empty.By deﬁnition of stagnancy, each of these threads in V G must be in the q T + q T of the same await failed, and 3) there are no other writes to read from that would result in the threadterminating. We extend the graph to an inﬁnite graph G (cid:48) by adding failed await loop iterations in which eachload reads from the mo-maximal store to its location. Let for T ∈ V G the index of the previous (failed) iterationof the await be q T fail TG ( q T ) ∧ N TG = end TG ( q T ) We add for T ∈ V G events for an inﬁnite number of iterations, numbering each with an index n ∈ N , and eachstep inside that iteration with an index m ≤ len TG ( q ) , replicating the same events over and over G (cid:48) . E T = G . E ∪ (cid:8) (cid:104) T , N TG + n · len TG ( q ) + m , e TG ( start TG ( q ) + m ) (cid:105) (cid:12)(cid:12) n ∈ N , m ≤ len TG ( q ) (cid:9) The newly added read events will read from the mo-maximal stores to their locations. Let e = (cid:104) T , N TG + n · len TG ( q ) + m , e TG ( start TG ( q ) + m ) (cid:105) be such a read event. We deﬁne G (cid:48) . rf ( e ) = max G . mo G . W loc ( e ) Other than that we change nothing. Obviously this repetition results in a wasteful execution. With Lemma 5 itis easy to show that all of these new loop iterations are consistent with the program and are themselves failedwasteful iterations cons PM ( G (cid:48) ) G ∞ , which is the claim G (cid:48) ∈ G ∞ It only remains to show that every graph G ∈ G ∞ can be cut to a graph in G ∞ ∗ . Lemma 13. G ∞ (cid:54) = /0 → G ∞ ∗ (cid:54) = /0 Proof.

Assume that there are graphs which violate await termination and let G ∈ G ∞ be such a graph. Analogousto how we showed in Lemma 10 that there is some thread T which executes some await in line k an arbitrarynumber of times, we can show that some thread executes inﬁnitely many steps. Let V be the (non-empty) set ofthese threads V = (cid:8) T (cid:12)(cid:12) N TG = ∞ (cid:9) Each of these threads T ∈ V executes some await in line k T inﬁnitely often P T ( k T ) = await ( n , κ ) ∧ ∀ t . ∃ u ≥ t . k TG ( u ) = k T By Lemma 9 the number of writes is bounded, thus (as also shown in Lemma 10) this loop have only ﬁnitelymany non-wasteful iterations. Starting from some iteration q T , the writes observed by the await never change;iteration q T and all subsequent iterations are wasteful. ∀ q (cid:48) ≥ q T . WI TG ( q (cid:48) ) Due to fairness, this is only allowed if there is no possibility for the reads to read anything else: otherwise,eventually one of the subsequent iterations would need to read from one of the other available writes and hencenot be a wasteful iteration. This allows us to construct a graph G (cid:48) ∈ G ∞ ∗ which stagnates. We generate G (cid:48) bycutting down all failed wasteful iterations of threads in T . Assume w.l.o.g. that the iteration q T is the ﬁrstwasteful iteration of thread T (the ﬁnitely many preceding wasteful iterations can be deleted iteratively withLemma 7) ¬ WI TG ( q (cid:48) − ) We deﬁne: G (cid:48) . E T = (cid:8) (cid:104) T , t , e (cid:105) ∈ G . E T (cid:12)(cid:12) t ≤ end TG ( q T ) (cid:9) This ensures all conditions of stagnant : these threads have not terminated, by deﬁnition they just ﬁnished a failedawait loop iterationevent, and the only available writes force the thread to stay in the loop indeﬁnitely. Thus wehave stagnant ( G (cid:48) ) Furthermore, the graph is still consistent with the program because the beginning of the thread-local execution isexactly the same as for G cons P ( G (cid:48) ) Since all wasteful iterations have been deleted, the graph G (cid:48) is not wasteful ¬ W ( G (cid:48) ) and can thus in the set of graphs G ∞ ∗ that are searched by AMC G (cid:48) ∈ G ∞ ∗ which proves the claim.The main theorem (Theorem 1) follows from Lemmas 8 and 11 to 13.26 Study Cases

In this section, we discuss in detail three study cases: a bug in the MCS lock of the DPDK library, a bug in theMCS lock of an internal Huawei product, and a comparison of expert-optimization and VS

YNC -optimization ofthe Linux qspinlock. We report about bugs found with VS

YNC as well as limitations.

The Data Plane Development Kit (DPDK ) is a popular set of libraries used to develop packet processingsoftware in user space. VS YNC found a bug in the MCS lock of the current DPDK version (v20.05). Figure 13shows the part of the implementation that concern us. At the end of the code, we added the bug scenario inwhich two threads, Alice and Bob, are involved. Alice wants to acquire the lock (see run_alice() function),and Bob currently holds the lock and is about to release it (see run_bob() ). Note that we removed the slowpathof rte_mcslock_unlock() since the bug only occurs in the fastpath. The core of the bug is a missing rel barrierbefore or at Line 27, which causes Alice to hang and never enter the critical section. /* SPDX-License-Identifier: BSD-3-Clause * Copyright(c) 2019 Arm Limited */ typedef struct rte_mcslock { struct rte_mcslock *next; int locked; /* 1 if the queue locked, 0 otherwise */ } rte_mcslock_t; static inline void rte_mcslock_lock(rte_mcslock_t **msl, rte_mcslock_t *me) { rte_mcslock_t *prev; /* Init me node */ __atomic_store_n(&me->locked, 1, __ATOMIC_RELAXED); __atomic_store_n(&me->next, NULL, __ATOMIC_RELAXED); /* If the queue is empty, the exchange operation is * enough to acquire the lock. Hence, the exchange * operation requires acquire semantics. The store to * me->next above should complete before the node is * visible to other CPUs/threads. Hence, the exchange * operation requires release semantics as well. */ prev = __atomic_exchange_n(msl, me, __ATOMIC_ACQ_REL); if (prev == NULL) { return; } __atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);

28 29 /* The while-load of me->locked should not move above * the previous store to prev->next. Otherwise it will * cause a deadlock. Need a store-load barrier. */ __atomic_thread_fence(__ATOMIC_ACQ_REL); while (__atomic_load_n(&me->locked, __ATOMIC_ACQUIRE)) rte_pause(); } static inline void rte_mcslock_unlock(rte_mcslock_t **msl, rte_mcslock_t *me) { if (__atomic_load_n(&me->next, __ATOMIC_RELAXED) == NULL) { // **ignore this branch** } /* Pass lock to next waiter. */ __atomic_store_n(&me->next->locked, 0, __ATOMIC_RELEASE); } //--------------------------------------------------------- // bug scenario //--------------------------------------------------------- // 2 threads: alice and bob. rte_mcslock_t alice, bob; // bob has the lock rte_mcslock_t *tail = &bob; void run_alice() { rte_mcslock_lock(&tail, &alice); } void run_bob() { rte_mcslock_unlock(&tail, &bob); } Figure 13: Part of the DPDK MCS lock implementation describing the scenario in which Alice hangs.

The bug on IMM.

Figure 14 shows an execution graph in which Alice hangs. AMC gives exactly this executiongraph as counter-example for await termination, but in the text form. The U sc pair of events are the “read part”and “write part” of the atomic exchange (Line 23 of code in Fig. 13). To understand why exchange is modeledwith two events, remember that atomic exchange is implemented with load-linked/store-conditional instructionpairs in many architectures. For example in ARMv8, atomic_exchange_n(msl, me, __ATOMIC_ACQ_REL) iscompiled to

38: c85ffc02 ldaxr x2, [x0]3c: c803fc01 stlxr w3, x1, [x0]40: 35ffffc3 cbnz w3, 38

Intuitively, the load instruction is the “read part” of the exchange, whereas the store instruction is “write part” ofthe exchange. Also note that for the sake of this bug, __ATOMIC_ACQ_REL is equivalent to __ATOMIC_SEQ_CST , i.e. , even with the stronge sc barrier mode, the bug can still manifest. https://github.com/DPDK/dpdk init locked, 0 W rlx alice->next, NULL W rlx alice->locked, 1 U R sc tail, bob U W sc tail, alice W rlx bob->next, alice F sc R acq alice->locked, 1 R acq alice->locked, 1 ... R rlx bob->next, alice W rel alice->locked, 0 hb hbrf rf momo Alice Bob

Figure 14:

IMM . Bug results in Alice hanging: Alicewrites to bob->next with rlx mode, and Bob reads rlx mode, which causes allows Bob’s write to before theinitialization of me->locked . W init locked, 0 W rlx alice->next, NULL W rlx alice->locked, 1 U R sc tail, bob U W sc tail, alice W rel bob->next, alice F sc R acq alice->locked, 1 R acq alice->locked, 0 R acq bob->next, alice W rel alice->locked, 0 hbrf rf rfmo mo Alice Bob

Figure 15:

IMM . With the bug ﬁxed, Alice writesto prev->next with rel mode, and Bob reads with acq mode, creating a synchronizes-with edge, whichforces Bob’s write to occur after the initialization.Returning to the bug in IMM, Alice starts by initializing her node, in particular, setting alice->locked to 1.After exchanging the tail, Alice writes to bob->next . Although Bob reads from Alice’s write to bob->next ,IMM allows Alice’s write to alice->locked to happen after Bob’s write to alice->locked because no happens-before relation is established between Alice and Bob. The mo relation shows this order of modiﬁcations. If thatoccurs, Alice’s fate is to await alice->locked to become 0 forever. To establish the correct happens-beforerelation between Alice and Bob, Alice’s write to bob->next has to be rel , and Bob’s read of bob->next has to be acq (see Fig. 15). That causes both events to “synchronize-with”, guaranteeing Alice’s write to alice->locked to happen before Bob’s. Note that in IMM the happens-before relation projected to one memory location, e.g. , alice->locked , implies the same order in the visible memory updates of that location, i.e. , in the modiﬁcationorder mo of alice->locked . The happens-before relation does not, however, imply an ordering between thewrites to distinct memory locations [25]. The bug on ARM.

The bug is not exclusive to the IMM model. Figure 16 shows an execution graph manuallyadapted to the ARM memory model (ARM for short). In this memory model, a global order of events exist;so, we number the events with one possible global order. ARM allows the write to alice->locked (event7) to happen after the read part of the atomic exchange U Rsc (event 3), but does not allow it happen after thewrite part U Wsc (event 8). Moreover, since Alice’s write to bob->next is rlx (event 4), ARM allows the writeto alice->locked (along with U Wsc ) to happen after it. As a consequence, although Bob reads (event 5) fromAlice’s write to bob->next (event 4), the effect of setting alice->locked to 1 (event 7) happens after Bob hasset it to 0 (event 6). In contrast to IMM, making Alice’s write to be rel is sufﬁcient in ARM (see Fig. 17) becausethe control/address dependency between Bob’s events guarantees that writes that happen before the read eventalso happen before the subsequent dependent events.

Validation of the bug.

So far we were not able to reproduce the effect of the bug on real hardware. Thesituation that triggers the bug is very unlikely to happen, but nevertheless possible and still a potential problemfor code using DPDK on ARM platforms. To have a higher conﬁdence about the bug on ARM, we checkedthe scenario of Fig. 13 with Rmem [26], a stateful model checking tool capable of verifying small pieces ofbinary code compiled to ARMv8 architecture. Although Rmem cannot deal with the inﬁnite loop of Alice, wecan reproduce the bug by asserting Bob does not see alice->locked being reverted to 1 after he has set it to 0 –as expected, the assertion fails. 28 init locked, 0 W rlx alice->next, NULL W rlx alice->locked, 1 U R sc tail, bob U W sc tail, alice W rlx bob->next, alice F sc R acq alice->locked, 1 R acq alice->locked, 1 ... R rlx bob->next, alice W rel alice->locked, 0 ctrl+addr ctrl+addrrf rf momo Alice Bob

Figure 16:

ARM memory model . Bug results inAlice hanging: Alice writes to bob->next with rlx mode causing the initialization of alice->locked tobe reordered after Bob’s write. W init locked, 0 W rlx alice->next, NULL W rlx alice->locked, 1 U R sc tail, bob U W sc tail, alice W rel bob->next, alice F sc R acq alice->locked, 1 R acq alice->locked, 0 R rlx bob->next, alice W rel alice->locked, 0 ctrl+addr ctrl+addrrf rf rfmo mo Alice Bob

Figure 17:

ARM memory model . To ﬁx the bug,Alice has to writes to bob->next with rel mode, forc-ing Bob’s write to occur after the initialization of alice->locked . Discussion.

The DPDK MCS lock bug is a good example of how understanding WMMs can be challengingeven for experts. Note the MCS lock has been contributed by ARM Limited to the DPDK project. In Fig. 13,Line 17, the developer considers exactly the situation observed in this bug: “The store to me->next above shouldcomplete before the node is visible to other CPUs/threads. Hence, the exchange operation requires releasesemantics as well.” However, making the exchange rel is not sufﬁcient because the node can also become visibleto another thread via the write at Line 27, and nothing stops the store of Line 14 and the write part of the exchangeof Line 23 to be reordered after Line 27. Another interesting ﬁnding in this code is that, as far as we can verify,the explicit fence at Line 32 is useless and can be removed.

Our next study case is concerned with the MCS lock implementation found in an internal Huawei product.In this implementation, VS

YNC identiﬁed a missing acq barrier that causes serious data corruption problems.We were able to reproduce the problem on real hardware and reported the bug along with a simple ﬁx to themaintainers. Here, we describe this issue to illustrate the challenges of porting x86 code to ARM, which is thereason why such a bug was introduced in the code base. With the recent increased demand of software for ARMservers, we believe that similar bugs are going to become more and more common in production.

The bug on IMM.

Figure 18 presents a slightly simpliﬁed version of the original MCS lock implementation.The bug is a missing acq barrier at the end of mcslock_acquire() . To understand the scenario, consider theexecution graph in Fig. 19, where the critical section is a simple increment x++ , Alice wants to enter, and Bob isinside the critical section. Similarly to the DPDK bug, Bob sees Alice’s node when releasing the lock and setsAlice’s spin = 0 ; this ﬂag is called locked in DPDK. The ﬁrst fence in Alice’s mcslock_acquire synchronizeswith the fence in Bob’s mcslock_release due to the write and read of bob->next ﬁeld. That establishes ahappens-before relation marked with the dashed arrows in the ﬁgure. The happens-before relation, however,does not specify whether Bob’s critical-section execution happens before Alice’s critical-section execution, orvice-versa. In this execution graph, Alice and Bob run their critical sections concurrently and both read from theinitial write to x , causing one of the increments to be lost. By introducing an acq barrier in the reads of me->spin or after them (Fig. 18, Line 19), Alice is guaranteed to execute her critical section after Bob. Note that, althoughthe ARM model also introduces control dependencies, reads of me->spin and the read of x inside the critical29 static inline void mcslock_acquire(volatile mcslock_t *tail, volatile mcs_node_t *me) { mcs_node_t *prev; me->next = NULL; me->spin = 1; smp_wmb(); // ** consider to be SC fence ** // equivalent to xchg_acq prev = __sync_lock_test_and_set(tail, me); if (!prev) return; prev->next = me; smp_mb(); // ** consider to be SC fence ** while(me->spin); // BUG: Missing ACQ barrier, eg, smp_mb(); } static inline void mcslock_release(volatile mcslock_t *tail, volatile mcs_node_t *me) { if (!me->next) { // SC cmpxchg if (__sync_val_compare_and_swap( tail, me, NULL) == me) { return; } while(!me->next); } smp_mb(); // ** consider to be SC fence ** me->next->spin = 0; } Figure 18: MCS implementation in a com-mercial OS. A barrier bug cases data races inthe critical section. W init x, 0 W rlx me->next, NULL W rlx me->spin, 1 F sc U R acq m, bob U W acq tail, me W rlx bob->next, me F sc R rlx me->spin, 1 R rlx me->spin, 0 R rlx x, 0 W rlx x, 1 R rlx x, 0 W rlx x, 1 R rlx me->next, alice F sc R rlx me->next, alice W rlx alice->spin, 0 hbrfrf rf rfrfrf Alice Bob mcslock_release()x++;mcslock_acquire()x++;

Figure 19: In IMM, Alice’s read of x may happen before Bob’swrite to x .section of Alice, a reordering of these operations is not precluded because they are all read operations. Discussion.

Besides the barrier bug, some issues may be interesting to point out. The developers that imple-mented this code opted in using compiler speciﬁc atomic operations. We do not recommend their use because theyhide the barrier mode used underneath. In particular, _sync_lock_test_and_set has an acq mode, whereas _sync_val_compare_and_swap has an sc mode. Moreover, the developers overuse fences: the smb_mb() fencein mcs_acquire() , Line 18, is redundant and can be eliminated. The qspinlock code was originally introduced in Linux version 4.2 as a new and faster spinlock [5]. Sincethen experts have slightly improved the algorithm and carefully optimized the barriers, achieving an excellentperformance. Currently, qspinlock is the default spinlock inside the Linux kernel for many architectures includex86 and ARM. We now describe how we used VS

YNC to automate the barrier optimization process of the Linuxqspinlock and obtain similar barriers as the current code, at version 5.6 [21].

Baseline.

Our optimization is based on Linux 4.4 [20], where the barriers were yet to be completely optimized,but other (algorithm) optimizations were mainly done. Because our purpose is exclusively the barrier optimization,we ported a few remaining algorithmic optimizations present on 5.6 back to version 4.4. Speciﬁcally, webackported the prefetch instructions to receive the next node in the queue.

Code preparation.

To optimize existing code with VS

YNC , we may have to perform minor changes. Inparticular, if the code uses custom atomics implemented in assembly, these have to be replaced with either30ompiler builtin atomics or with VS

YNC -atomics (which compile down to builtin atomics). In the case ofqspinlock, we replaced the Linux’s atomic operations with VS

YNC -atomics. Because custom atomics do notalways follow the same model as IMM and compiler builtin atomics, some discrepancies may arise duringthe replacement. We encountered one such case: The cmpxchg function in Linux is deﬁned as having a fullmemory barrier ( i.e. , an sc fence) before and after the operation only if it succeeds [2]. So we replace the Linux cmpxchg with a wrapper that calls VS YNC ’s atomic_cmpxchg and additionally an atomic fence in the successcase to mimic the original behavior (for reference, see Fig. 22) . Another issue we encountered were unions ofdifferent-sized variables: The qspinlock code uses a union that allows the same memory location to be read andwritten with either 8, 16 or 32 bits. Currently, AMC requires that accesses to the same memory locations alwayshave the same size – this limitation may be ﬁxed in the future. To overcome this issue, we replaced accesses tothe qspinlock data structure always 32 bits.Version acq rel sc Time CorrectnessLinux 4.4 [20] 3 6 6 2015/09/11 Not veriﬁedLinux 4.5 [19] 6 2 1 2015/11/09 Barrier bug, ﬁxed in [6]Linux 4.8 [27] 6 3 0 2016/06/03 Barrier bug, ﬁxed in [6]Linux 4.16 [6] 6 4 0 2018/02/13 Not veriﬁedLinux 5.6 [21] 6 2 1 2020/01/07 Not veriﬁedVS

YNC

YNC -veriﬁedTable 1: Barrier optimization results for Linux’s qspinlock

Optimization results. VS YNC recommended barrier modes similar to those used by the experts (see Table 1)in roughly 11 minutes – in contrast, the expert optimization took several release cycles. The details of theoptimization can be seen in Fig. 20. The bold marked text refers the the optimization suggested by VS

YNC . Theboxes to are the Linux cmpxchg function. We see that VS YNC removes all atomic fences and transfer theirbarrier to atomic operations.We now relate the VS

YNC optimization with the optimizations made by the Linux experts over the severalrelease cycles:

Version 4.5 – optimization of cmpxchg:

The ﬁrst optimization by the experts was exactly in the cmpxchg functions ( to ), changing it from sc to a more relaxed mode [19]. Version 4.8 – optimization of unlock function:

Experts optimized the code in , removing the fence andchanging the atomic_sub to rel mode [27] – identically to VS YNC suggestion.

Version 4.16 – bug ﬁx:

The optimization from version 4.5 introduced a bug that was only found and ﬁxedin version 4.16 [6]. The bug allowed the node initialization to occur after the update of prev->next ,similarly to DPDK’s bug discussed in §3.1. In version 4.16, the experts used a rel barrier in the atomicwrite immediately after the decode_tail function, but ﬁnally replace that with an atomic sc -fence in thecurrent version. Optimizations with VS YNC are veriﬁed and hence not affected by such bugs.

Version 5.6 – current version:

Figure 21 shows the barrier modes used in the current version of qspinlock [21].The dotted lines connect our barriers in Fig. 20 with the equivalent barriers in the current version. The fewdifferent barrier modes are due to two reasons: First, there exists multiple maximally-relaxed combinationsthat are correct. Second, both optimizations are based on different memory models (LKMM and IMM).VS

YNC extended with an LKMM module would likely suggest the same barriers as used in Linux.31 ockatomic32_cmpxchg_rel --> acquire atomic_fence --> remove queued_spin_lock_slowpathatomic32_await_neq_rlxatomic32_cmpxchg_rel --> acquire atomic_fence --> remove atomic32_await_mask_eq_acq --> relaxed atomic32_add_rlx --> acquire encode_tailatomic32_write_rlxatomicptr_write_rlxatomic32_read_rlxatomic32_cmpxchg_rel --> acquire atomic_fence --> remove atomic32_read_rlxatomic32_cmpxchg_rel --> seq_cst atomic_fence --> remove decode_tailatomicptr_write_rlxatomic32_await_neq_acqatomicptr_read_rlxatomic32_await_mask_eq_acq --> relaxed atomic32_or_rlx --> acquire atomic32_cmpxchg_rel --> acquire atomic_fence --> remove atomicptr_await_neq_rlxatomic32_write_relunlockatomic_fence --> remove atomic32_sub_rlx --> release

Figure 20: Barrier modes in version 4.4 andVS

YNC optimizations in bold . Optimizations inthe red boxes are similar to version 4.5; those inthe blue box are identical to version 4.8. lock atomic32_cmpxchg_acq queued_spin_lock_slowpathatomic32_await_counter_neq_rlx atomic32_get_or_acq atomic32_sub_rlx atomic32_await_mask_eq_acq atomic32_add_rlxencode_tailgrab_mcs_nodeatomic32_write_rlxatomicptr_write_rlxatomic32_read_rlx atomic32_cmpxchg_acqatomic_fence atomic32_read_rlxatomic32_cmpxchg_rlxdecode_tailatomicptr_write_rlxatomic32_await_neq_acqatomicptr_read_rlx atomic32_await_mask_eq_acq atomic32_cmpxchg_rlxatomic32_or_rlxatomicptr_await_neq_rlx atomic32_write_rel unlock atomic32_sub_rel

Figure 21: Barrier mode information for qspin-lock in Linux version 5.6 (current version). Dot-ted lines connect related barrier optimizationsof VS

YNC and the current version. ({ \ typeof(a) __r = atomic_cmpxchg if (__r == a) atomic_fence __r; \ }) Figure 22: Using VS

YNC atomics to implement code compatible with Linux’s cmpxchg .32

Optimized-code Evaluation

In this section, we present details of the setup used in our “optimized-code evaluation” section. Moreover, wediscuss the results obtained with microbenchmarks in length. See the full paper for results with real-worldworkloads [24].

We conduct our experiments in the following hardware platforms:• a Huawei

TaiShan 200 (Model 2280) rack server that has

128 GB of RAM and 2

Kunpeng 920-6426 processors, a HiSilicon chip with ARMv8.2 , totaling 128 cores running at a nominal 2.6GHz frequency. The identiﬁer to denote this machine in this document is taishan200-128c .• a GIGABYTE R182-Z91-00 rack server that has

128 GB of RAM and 2

EPYC 7352 processors, anAMD chip with x86_64 cores , totaling 48 cores (96 if counting hyperthreading) running at a nominal2.3 GHz frequency. The identiﬁer to denote this machine in this document is gigabyte-96c .We installed on all these servers the Ubuntu 18.04.4 LTS (aarch64) operating system, with the followingLinux kernel version: . To produce stable benchmark results on a kernel as complex as Linux, we took some precautions in terms ofenvironment conﬁguration of our experiment target platforms. We list here such precautions:1.

Atomic types isolation.

Linux and VS

YNC each declare their own atomic types such as atomic_t and atomic64_t . When writing the kernel benchmark module, to avoid name conﬂicts between Linux kernelheaders and VS

YNC library headers, we separated into different translation units the benchmark “main”code (where the entry point lies and where the kernel threads are created) from the lock primitives functiondeﬁnitions and data structures instantiations (where the contention loops are executed, see Section 4.2.1).Therefore, the “main” code of the benchmark kernel module could use the classic Linux headers for itsneeds while the VS

YNC test units could include the VS

YNC library headers (that deﬁne the required atomictypes) and use from Linux only the primitives types (such as uint32_t and the likes). This technique alsoenables the possibility to benchmark individually module from the Linux kernel, such as the qspinlocklocated in "linux/spinlock.h" header.2.

Thread to core afﬁnity assignment.

The benchmark module spawns as many kernel threads as requestedon the module invocation command. To measure the multi-core overheads of the locks, these threads mustbe pinned on individual cores (both within the same NUMA nodes and on different NUMA nodes). Forthis purpose, the Linux kthread_bind() function is used.3.

Operating frequency ﬁxing.

To avoid suffering from thermal effects, and thus the OS dynamicallychanging the operating frequency while the benchmark were running (and by doing so skewing our results),we ﬁxed the frequency to 1.5 GHz, a frequency point available on all the platforms used in our evaluations.For this purpose, we used the Linux cpufreq mechanism. We set the governor to userspace to be able tochoose the frequency. We observed that using a ﬁxed governor such as userspace instead of an adaptiveone (such as ondemand ) yields way better predictability in our results.4.

Disable network.

In the preliminary experiments we conducted for our work, we observed that networkintroduced lots of noise into the evaluations by widely spreading the distributions of results (more than https://e.huawei.com/uk/products/servers/taishan-server/taishan-2280-v2 https://en.wikichip.org/wiki/hisilicon/kunpeng/920-6426 tty ).5. Disable IRQ balancing. irqbalance is a Linux daemon in charge of distributing the hardware InterruptsRequests (IRQs) among the different processing cores of the platform for the purpose of overall systemperformance. However, sporadic IRQs and subsequent execution of Interrupt Service Routines (ISRs)occurring on an uncontrolled set of cores would bring unpredictability in the system response-time andinterfere with our benchmark measurements. We simply disable this mechanism. Therefore, the Linuxfallback strategy is to pin all IRQs to the ﬁrst core, which we remove from our thread afﬁnity assignmentto completely avoid the issue of running ISRs and benchmarks concurrently on the same cores.6.

Disable NUMA balancing.

On platforms with a large number of cores such as the ones used in theseexperiments (see Section 4.1.1), the CPU cores are organized in NUMA nodes (for

Non-Uniform MemoryAccess ). This structure allows to palliate the unavoidable pressure on the memory bus due to the highamount of processing cores operating in parallel. Banks of memory are allocated per NUMA node,reﬂecting the cache hierarchy.

NUMA balancing is a feature of Linux that periodically moves the taskscloser to the memory they use, i.e. in the right NUMA node.

NUMA control is an additional tool allowingto conﬁgure NUMA-aware task scheduling and memory allocation in a ﬁne-grain manner, this overridingthe overall system NUMA balancing. We disable system NUMA balancing, and we enable task-localNUMA control, using this syntax: sudo numactl -- cpubind =0 -- membind =0 < cmd_to_insert_kernel_module >

This goes one step further as task afﬁnity assignment, as it forces memory allocation to be bound on NUMAnode 0. Our benchmark being inherently concurrent, as soon as there will be more threads than cores perNUMA node, these threads will be allocated on the next NUMA node. For userspace benchmarks, weuse libnuma directly in our pthread wrapper to pin threads to cores and control the allocation of contextdata structures. Therefore, the cross-node threads will suffer some performance loss when trying to accessshared data ( e.g. spinlock data structures).7.

Completely isolate the cores.

The Linux kernel provides the possibility to isolate a subset of the CPUcores. To do so, the parameter isolcpus must be ﬁlled with the list of CPU core identiﬁers to isolate.This parameter is given on the Linux boot argument command line ( i.e. in Grub for our case, prior tothe kernel boot). This has the effect of completely preventing the scheduling and other task balancingmechanisms (such as SMP balancing) to operate on these cores. Unless explicitly required with taskafﬁnity conﬁguration (with the corresponding system calls or by using the taskset program), the OS willnot schedule any task on these isolated cores. In our case, we decided to isolate all cores but the ﬁrst, withthe idea of running our benchmarks on the isolated cores, while the rest of the Linux processes would runon the Core 0 to avoid interfering with our results.8.

Kernel threads priority.

We tried several conﬁgurations of niceness for our kernel threads by callingthe set_user_nice()

Linux function, but this did not seem to impact the distribution of our results. Thisis to be expected with the precautions described above. The benchmark response time variability weren’tinﬂuenced by the priority of the kernel threads.

In this section, we present the microbenchmark experiments carried-out for the paper. We ﬁrst describe theexperiment itself and then discuss the results obtained.

The microbenchmark works as follow: each thread repeatedly acquires a (writer) lock, increments a sharedcounter, and releases the lock. This is summarized in pseudo-C in Listing 1.34isting 1: Pseudo-C code of microbenchmark /* the tested lock is parameterized here */ /* lock variable */ static lock_s lock_var ; /* * supposedly already allocated, * we also take care of cacheline-alignment * to avoid false sharing. */ unsigned long long * shared_counter ; /* ... */ unsigned long long run () { * shared_counter = 0 ull ; lock_init (& lock_var ); do { lock_acquire (& lock_var ); (* shared_counter ) ++; lock_release (& lock_var ); } while (! thread_should_stop () ); return * shared_counter ; } The returned counter is used to compute the throughput (number of times the critical sections was accessedby any thread). We vary the number of threads in the following set : { , , , , , , , , , } . We runeach experiment for a ﬁxed period of time (30 seconds) and measure the throughput (number of critical sectionsper second). We run the experiments 5 times to ensure the stability of the results (for each case, we pick themedian of these repeated runs).For spinlocks and reader-writer locks, the benchmark runs as a Linux kernel module. It means that we insertin the kernel a kernel module ( *.ko ﬁle) that is linked with the code listed in Listing 1. When inserted, themodule runs an initialization routine that consists in spawning as many kernel threads as requested (as explainedabove, this is a parameter of the experiment), and each thread runs the code listed in Listing 1, until interruptedby a timer (the 30-seconds execution). This timer is triggered externally, when removing the module from thekernel, as the exit routine of the module is requiring all the threads to ﬁnish their execution (it is not possibleto kill a kernel thread with a kill signal). This kernel module is inspired by the previous work of Kashyap etal. [14], where they (nano-)benchmarked the behavior of a hash-table in kernel-mode to evaluate several lockingprimitives.For mutexes, we replace pthread_mutex using LD_PRELOAD , and the benchmark runs in Linux userspace.About the selected locking primitives, we compare two variants of each primitive: an sc -only variant, and aVS YNC -optimized variant with barriers. Obviously, the 127-thread case can only be run on platforms with 128 cores. It is omitted in the other cases. rchitecture algorithm seqopt threads_nb run_nb atomics count duration throughput0 ARMv8 array opt 1 1 a64 957109580 30.0122 3.18907e+071 ARMv8 array opt 1 2 a64 957161287 30.0133 3.18913e+072 ARMv8 array opt 1 3 a64 957576858 30.025 3.18926e+073 ARMv8 array opt 1 4 a64 957238417 30.0143 3.18927e+074 ARMv8 array opt 1 5 a64 957129609 30.0141 3.18893e+075 ARMv8 array opt 2 1 a64 209273223 30.0116 6.97308e+066 ARMv8 array opt 2 2 a64 205836422 30.0118 6.85851e+067 ARMv8 array opt 2 3 a64 205883982 30.0119 6.86008e+06... ... ... ... ... ... ... ... ... ...3697 x86_64 musl seq 63 3 a64 10917032 30.0052 3638383698 x86_64 musl seq 63 4 a64 12659470 30.0059 4219003699 x86_64 musl seq 63 5 a64 10882122 30.0047 3626813700 x86_64 musl seq 95 1 a64 11842053 30.0067 3946473701 x86_64 musl seq 95 2 a64 11655763 30.0056 3884533702 x86_64 musl seq 95 3 a64 13233013 30.0067 4410023703 x86_64 musl seq 95 4 a64 13896114 30.0062 4631083704 x86_64 musl seq 95 5 a64 13857038 30.0065 461801 Table 2: Raw captured records, with parameters and output values. stability (binned) C o un t o f R e c o r d s ARMv8x86_64

Architecture

Figure 23: Density of the stability of the different records, per architecture. The chart exhibits the fact that mostresults are very stable ( < .

16 for stability).

The raw experiment results look like a list of records as showed in Table 2.The count column represents the value returned by the function run() of Listing 1, i.e. the number of timesa thread access the critical section. The duration column is the measured duration (even if it is ﬁxed, somedeviation may occur). Lastly, the throughput column is simply countduration , effectively capturing the number ofcritical sections per second.The records then get grouped together by parameters, and throughput mean, median and stability arecomputed, as reported on Table 3.As can be seen on the table, these values are computed for different versions of the algorithm. opt refers tothe VS

YNC -optimized version of the algorithm, while seq refers to the sc -only variant.Mean, median and standard deviation are computed using the usual deﬁnitions, while the stability is computedby dividing the maximum throughput by the minimum throughput, effectively giving an indication on the stabilityof the data set. The closer the stability is to 1 .

00, the more stable the sample is for these ﬁxed values of theparameters. Figure 23 shows the repartition of the stability among the records of the above table. As can beobserved on the density chart, most observed values are stable.36 ean median std stabilityarch algorithm seqopt threads_nbaarch64 array opt 1 3.18913e+07 3.18913e+07 1436.94 1.000112 6.87696e+06 6.85993e+06 54935.1 1.020474 4.08881e+06 4.10817e+06 57940.8 1.030058 3.91338e+06 3.90199e+06 49164.6 1.0324616 3.75618e+06 3.74699e+06 15217 1.0095823 2.48333e+06 2.41812e+06 105903 1.0930231 2.23512e+06 2.23639e+06 6561.7 1.0075963 1.74074e+06 1.74047e+06 8351.25 1.0119195 1.32026e+06 1.32048e+06 5805.31 1.01172127 1.11277e+06 1.10708e+06 11596.4 1.02398seq 1 2.67509e+07 2.67693e+07 30079.4 1.00262 5.66552e+06 5.69127e+06 62049.8 1.026524 3.47092e+06 3.46877e+06 15011.9 1.010148 3.10139e+06 3.10546e+06 7845.61 1.00616 3.05401e+06 3.05231e+06 4886.13 1.0039523 2.08272e+06 2.06173e+06 64897.5 1.0788931 1.92977e+06 1.92845e+06 10816.1 1.013263 1.52597e+06 1.53138e+06 17730.6 1.0281595 1.11215e+06 1.10892e+06 13310.7 1.02923127 899750 898612 10382.7 1.02754... ... ... ... ... ... ... ...x86_64 ttas seq 63 1.16617e+06 1.16119e+06 14598.2 1.0309195 1.16657e+06 1.16654e+06 9672.9 1.01849twa opt 1 3.64516e+07 3.64537e+07 29737.1 1.002162 6.02238e+06 6.02219e+06 1489.88 1.000694 2.10028e+06 2.12157e+06 68586.1 1.090878 2.27264e+06 2.28098e+06 34452.2 1.0379216 2.06737e+06 2.06864e+06 3631.41 1.0046523 1.85262e+06 1.85316e+06 1579.55 1.0022731 1.48692e+06 1.48739e+06 3060.34 1.0057463 1.23218e+06 1.23213e+06 3422.3 1.0066595 1.11774e+06 1.11715e+06 5174.11 1.01166seq 1 1.39532e+07 1.39525e+07 8965.07 1.001572 1.51132e+07 1.50894e+07 119766 1.021144 3.3227e+06 3.31492e+06 13600.1 1.009488 2.69598e+06 2.6822e+06 19174.8 1.013216 2.09911e+06 2.10757e+06 17185.9 1.0186623 1.89835e+06 1.89785e+06 3783.31 1.0054231 1.71193e+06 1.7124e+06 5194.5 1.007663 1.44299e+06 1.44049e+06 12048.1 1.0189695 1.09068e+06 1.08852e+06 8834 1.01751

Table 3: Records grouped by target platform, lock algorithm, sc -only/VS YNC -optimized version and numberof threads. Computed values are median, mean, standard deviation and stability of the throughput column ofTable 2. 37tability values Amount (absolute) Amount (%) ≤ . . > . . > . . > . . > . . Total 741 100.00%

Table 4: Number of experiments categorized by stability. The mentioned records are lines of Table 3. − median-gain (binned) C o un t o f R e c o r d s aarch64x86_64 arch Figure 24: Density of the speedups of the different locks, per architecture.Another way to see them is to group the lines of Table 3 by stability values as done in Table 4. We see thanmore than 84% of the results have a stability inferior to 10% (value ≤ . e.g. with a stability threshold value of 1 . Analysis of speedups of VS

YNC -optimized over sc -only implementations. Then, we use the values in theﬁltered table of records to compute the speedup T o T s −

1, where T o is median throughput of VS YNC -optimized and T s is the median throughput sc -only variants, respectively. Descriptive statistics aggregates about the observedspeedups are showed in Table 5 and the density of the speedup values is showed in Figure 24. In the paper,for the sake of space, we only reported maximum observed speedups (the max column in Table 5). We canobserve on Figure 24 that most speedups are close to 0. This effect can mainly be observed on the plot becauseof the highly-contended cases (number of threads from 8 and up), where the impact of optimizing barrier isnegligible. On the other hand, if we observe the same data but split for all measured contention levels ( i.e. numberof threads) as depicted per architecture on Figures 25 and 26, we can analyze the results with ﬁner-grain details.For ARMv8 (taishan200-128c) (Fig. 25), good results are scattered across the different contentions levels, butspeedups tend to be better for low contention level (especially the 1 thread case). In the case of x86 (Fig. 26), thetremendous low-contention speedup case (up to 7 × for 1 thread) is emphasized. This is so big that it overshadowsthe other cases. However, the qspinlock column is clearly better than the others, illustrating that in the x86 case, qspinlock has no negative speedup . 38 ock aarch64 x86_64max mean min std max mean min stdArrayQ lock [13] 0.256496 0.195695 0.136534 0.035925 3.704002 0.512900 0.050954 1.198285CertiKOS MCS [12] 0.741137 0.148102 0.014506 0.217878 0.711755 0.323380 0.184185 0.191676CLH lock [13] 0.326751 0.116884 -0.008331 0.090660 7.034888 0.767551 -0.150228 2.350886c-TKT-MCS [9] 0.633937 0.317441 0.046994 0.196817 1.379088 0.122309 -0.457079 0.497779c-TTAS-MCS [9] 0.538990 0.265129 0.063337 0.157456 1.388887 0.114989 -0.454959 0.502378c-MCS-TWA 0.610119 0.040497 -0.065989 0.201884 1.250191 0.121885 -0.282231 0.487366HCLH lock [13, 22] 0.297295 0.050683 -0.084593 0.105824 1.331446 0.201763 -0.019287 0.430856MCS lock [13, 23] 0.776025 0.111308 -0.031085 0.252447 3.605063 0.387545 -0.404894 1.216811musl mutex [3] 0.039510 -0.000101 -0.031193 0.026376 0.048818 -0.050181 -0.206353 0.1107433-state mutex [10] 0.000157 -0.013167 -0.025476 0.012847 -0.001148 -0.140026 -0.287827 0.143548qspinlock [5] 0.235271 0.117755 -0.148553 0.131182 3.966020 0.606631 0.085625 1.261861rec. CAS lock 0.078310 0.010387 -0.024441 0.037942 4.099000 1.950000 -0.338620 1.324669RW lock 0.546915 -0.007887 -0.405652 0.289225 1.194217 0.671709 -0.015319 0.462634Semaphore 0.112009 -0.001584 -0.089416 0.062090 0.004707 0.000205 -0.003467 0.002518CAS lock 0.045456 -0.007812 -0.060903 0.031734 4.099994 1.247507 -0.255443 1.175026Ticketlock [1] 0.162076 0.009903 -0.060949 0.059517 3.935485 0.633132 -0.392997 1.260925TTAS lock [13] 0.094544 -0.011786 -0.110472 0.057227 3.875872 1.322232 -0.286289 1.078801TWA lock [7] 0.337384 0.026431 -0.044020 0.112353 1.612696 0.023385 -0.600899 0.627365 Table 5: Speedups of VS

YNC -optimized version of the algorithm over sc -only variant. This descriptive summarymust be read with care (especially for the values of mean ), as they are only aggregated from our own experimentsamples (our arbitrary selected thread number, etc.). MCS lock comparisons.

Figure 27 compares the performance of several MCS lock implementations on

ARMv8 (taishan200-128c) and x86_64 (gigabyte-96c) . As reported in the paper, the different MCS lockimplementations are: DPDK [11], Concurrency Kit (ck) [4], CertiKOS [12] and our VS

YNC -optimized.

Critical and non-critical section sizes.

We observed a few other things by conducting this campaign ofexperiments. Our benchmark setting allows for additional parameters: cs_size and es_size (not reported inthe charts of this report).• The cs_size parameter (for “critical section size”) allows to artiﬁcially increase (and control) the sizeof the critical section. Instead of only touching one cache line by increasing a counter (as depicted inListing 1), we can touch an arbitrary number of cache lines, which would corresponds to the value of the cs_size parameter.• The es_size parameter allows to set an arbitrary number of cache lines touched outside the critical section,to simulate different relative sizes of the critical section with regards to the size of the whole program.We observed the following:1. The es_size parameter did not inﬂuence the results, meaning that the lock primitive performances andspeedup obtained in VS

YNC -optimized over sc -only are not affected by the size of the program that is notin the critical section.2. The cs_size parameter strongly inﬂuenced the results in the following way: the bigger the critical sectionwas, the less was the impact of the barrier optimization. Additionally, overall, all locking primitives areconverging towards the same performance value for an increasing critical section, which is expected as theentry/exit protocols are negligible relatively to a sufﬁciently large critical section.From this, we can conclude that barrier optimizations and locking protocols make sense especially for smallcritical sections and ﬁne-grain locking. For the ﬁnal results of the paper, we decided to set cs_size to 1 and es_size to 0. Hash table benchmarks.

Linked to these last ﬁndings, prior running our custom-made kernel benchmarkmodule, we tried to use the work of Kashyap et al. [14] (which is publicly available on Github ), but we were https://github.com/sslab-gatech/shfllock/tree/master/benchmarks/kernel-syncstress − Speedup

Speedups observed on ARMv8 target

Thread number arraycertikosmcsclhcmcsticketcmcsttasctwamcshclhmcsmuslmutexqspinrecspinrwsemaphorespinticketttastwa L o c k s Figure 25: Heat map showing the speedups of the different locks on

ARMv8 (taishan200-128c) .White squares correspond to data ﬁltered-out for instability.not able to produce predictable results. Indeed, the variability of such results was very high, and each timewe changed a small parameter it produced different output values. It happened even for details that should notinﬂuence the results, such as changing the linking order of the object modules in the makeﬁle. This was unusablefor our work, and could be explained in the following way: basically, the critical section in the kernel syncstress module of Kashyap et al. is accessing nodes of a hash table. The data of this hash table is randomly populated.However, accessing a hash table is not a predictable operation in terms of run-time, and different access can yieldvery different execution times (especially if the hash table is seeded with different random values at each run).This would lead to critical section size being very different for different runs (even with the same parametervalues), and was therefore not usable to compare different techniques. In comparison, our microbenchmarkframework, although being simpler in its structure, produce very predictable results (with very small deviationsand good stability, as showcased above in this section) and allow to precisely measure the overheads of barriersand the performance of different locking primitive implementations.40

Speedup

Speedups observed on x86_64 target

Thread number arraycertikosmcsclhcmcsticketcmcsttasctwamcshclhmcsmuslmutexqspinrecspinrwsemaphorespinticketttastwa L o c k s Figure 26: Heat map showing the speedups of the different locks on x86_64 (gigabyte-96c) .White squares correspond to data ﬁltered-out for instability.

References [1] Linux ticketlock. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=314cdbefd1fd0a7acf3780e9628465b77ea6a836 , 2008.[2] Linux-Kernel Memory Model, 2018. .[3] musl libc: an implementation of the C standard library, 2020. https://musl.libc.org .[4] Samy Al Bahra. Concurrency kit.

Retrieved November , 8:2018, 2015. https://github.com/concurrencykit/ck .[5] Jonathan Corbet. locks and qspinlocks. https://lwn.net/Articles/590243/ , 2014.[6] Will Deacon. locking/qspinlock: Ensure node is initialized before updating prev->next, Feb13, 2018. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=95bcade33a8a .[7] Dave Dice and Alex Kogan. Twa – ticket locks augmented with a waiting array. In

European Conferenceon Parallel Processing , pages 334–345. Springer, 2019.[8] Dave Dice, Alex Kogan, Yossi Lev, Timothy Merriﬁeld, and Mark Moir. Adaptive integration of hardwareand software lock elision techniques. In

Proceedings of the 26th ACM Symposium on Parallelism inAlgorithms and Architectures , SPAA ’14, page 188–197, New York, NY, USA, 2014. Association forComputing Machinery. 41 m e d i a n t h r o u g h p u t ( M . i t e r s / s ) Performance of MCS lock implementations on ARMv8 m e d i a n t h r o u g h p u t ( M . i t e r s / s ) Performance of MCS lock implementations on x86_64

CertiKOS ck DPDK own impl.

Figure 27: Comparisons of performance of different MCS lock implementations on

ARMv8 (taishan-128c) and x86_64 (gigabyte-96c) .[9] David Dice, Virendra J. Marathe, and Nir Shavit. Lock cohorting: A general technique for designing numalocks.

ACM Trans. Parallel Comput. , 1(2), February 2015.[10] Ulrich Drepper. Futexes are tricky.

Futexes are Tricky, Red Hat Inc, Japan , 4, 2005.[11] Linux Foundation. Data Plane Development Kit (DPDK), 2015.[12] Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo.Certikos: An extensible architecture for building certiﬁed concurrent os kernels. In

Proceedings of the 12thUSENIX Conference on Operating Systems Design and Implementation , OSDI’16, page 653–669, USA,2016. USENIX Association.[13] Maurice Herlihy and Nir Shavit.

The art of multiprocessor programming . Morgan Kaufmann, 2011.[14] Sanidhya Kashyap, Irina Calciu, Xiaohe Cheng, Changwoo Min, and Taesoo Kim. Scalable and practicallocking with shufﬂing. In

Proceedings of the 27th ACM Symposium on Operating Systems Principles ,SOSP ’19, page 586–599, New York, NY, USA, 2019. Association for Computing Machinery.[15] Michalis Kokologiannakis, Ori Lahav, Konstantinos Sagonas, and Viktor Vafeiadis. Effective statelessmodel checking for c/c++ concurrency.

Proceedings of the ACM on Programming Languages , 2(POPL),December 2017.[16] Michalis Kokologiannakis, Azalea Raad, and Viktor Vafeiadis. Model checking for weakly consistentlibraries. In

Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Designand Implementation , PLDI 2019, pages 96–110, New York, NY, USA, 2019. Association for ComputingMachinery.[17] Michalis Kokologiannakis and Viktor Vafeiadis. Hmc: Model checking for hardware memory models.In

Proceedings of the Twenty-Fifth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems , ASPLOS ’20, page 1157–1171, New York, NY, USA, 2020. Associationfor Computing Machinery. 4218] FAL Labs. Kyoto Cabinet: A straightforward implementation of DBM, 2011. http://fallabs.com/kyotocabinet .[19] Waiman Long. locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg(), Nov10, 2015. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=64d816cba06c .[20] Waiman Long and Peter Zijlstra. qspinlock code at version 4.4 of Linux Kernel, 2015. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/qspinlock.c?h=v4.4 .[21] Waiman Long and Peter Zijlstra. qspinlock code at version 5.6 of Linux Kernel, 2020. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/qspinlock.c?h=v5.6 .[22] Victor Luchangco, Dan Nussbaum, and Nir Shavit. A hierarchical clh queue lock. In

European Conferenceon Parallel Processing , pages 801–810. Springer, 2006.[23] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memorymultiprocessors.

ACM Trans. Comput. Syst. , 9(1):21–65, February 1991.[24] Jonas Oberhauser, Rafael Lourenco de Lima Chehab, Diogo Behrens, Ming Fu, Antonio Paolillo, LilithOberhauser, Koustubha Bhat, Yuzhong Wen, Haibo Chen, Jaeho Kim, and Viktor Vafeiadis. VSync:Push-Button Veriﬁcation and Optimization for Synchronization Primitives on Weak Memory Models.In

Proceedings of the 26th ACM International Conference on Architectural Support for ProgrammingLanguages and Operating Systems , ASPLOS ’21, New York, NY, USA, 2021. Association for ComputingMachinery.[25] Anton Podkopaev, Ori Lahav, and Viktor Vafeiadis. Bridging the gap between programming languages andhardware weak memory models.

Proceedings of the ACM on Programming Languages , 3(POPL), January2019.[26] Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. Simplifyingarm concurrency: Multicopy-atomic axiomatic and operational models for armv8.

Proceedings of the ACMon Programming Languages , 2(POPL), December 2017.[27] Pan Xinhui. locking/qspinlock: Use atomic_sub_return_release() in queued_spin_unlock(), Jun3, 2016. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca50e426f96chttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca50e426f96c