[PDF] An Ownership Policy and Deadlock Detector for Promises

Abstract

Task-parallel programs often enjoy deadlock freedom under certain restrictions, such as the use of structured join operations, as in Cilk and X10, or the use of asynchronous task futures together with deadlock-avoiding policies such as Known Joins or Transitive Joins. However, the promise, a popular synchronization primitive for parallel tasks, does not enjoy deadlock-freedom guarantees. Promises can exhibit deadlock-like bugs; however, the concept of a deadlock is not currently well-defined for promises. To address these challenges, we propose an ownership semantics in which each promise is associated to the task which currently intends to fulfill it. Ownership immediately enables the identification of bugs in which a task fails to fulfill a promise for which it is responsible. Ownership further enables the discussion of deadlock cycles among tasks and promises and allows us to introduce a robust definition of deadlock-like bugs for promises. Cycle detection in this context is non-trivial because it is concurrent with changes in promise ownership. We provide a lock-free algorithm for precise runtime deadlock detection. We show how to obtain the memory consistency criteria required for the correctness of our algorithm under TSO and the Java and C++ memory models. An evaluation compares the execution time and memory usage overheads of our detection algorithm on benchmark programs relative to an unverified baseline. Our detector exhibits a 12% (1.12\times) geometric mean time overhead and a 6% (1.06\times) geometric mean memory overhead, which are smaller overheads than in past approaches to deadlock cycle detection.

Full PDF

AAn Ownership Policy and Deadlock Detector forPromises

Caleb Voss [email protected]

Georgia Institute of Technology

Vivek Sarkar [email protected]

Georgia Institute of Technology

Abstract

Task-parallel programs often enjoy deadlock freedom un-der certain restrictions, such as the use of structured joinoperations, as in Cilk and X10, or the use of asynchronoustask futures together with deadlock-avoiding policies suchas Known Joins or Transitive Joins. However, the promise ,a popular synchronization primitive for parallel tasks, doesnot enjoy deadlock-freedom guarantees. Promises can ex-hibit deadlock-like bugs; however, the concept of a deadlockis not currently well-defined for promises.To address these challenges, we propose an ownershipsemantics in which each promise is associated to the taskwhich currently intends to fulfill it. Ownership immediatelyenables the identification of bugs in which a task fails to fulfilla promise for which it is responsible. Ownership furtherenables the discussion of deadlock cycles among tasks andpromises and allows us to introduce a robust definition ofdeadlock-like bugs for promises.Cycle detection in this context is non-trivial because it isconcurrent with changes in promise ownership. We providea lock-free algorithm for precise runtime deadlock detection.We show how to obtain the memory consistency criteriarequired for the correctness of our algorithm under TSOand the Java and C++ memory models. An evaluation com-pares the execution time and memory usage overheads ofour detection algorithm on benchmark programs relative toan unverified baseline. Our detector exhibits a 12% (1.12 × )geometric mean time overhead and a 6% (1.06 × ) geometricmean memory overhead, which are smaller overheads thanin past approaches to deadlock cycle detection. The task-parallel programming model is based on the princi-ple that structured parallelism (using high-level abstractionssuch as spawn-sync [20, 47], async-finish [12, 26, 33], fu-tures [24, 36], barriers [47], and phasers [10, 50]) is a superiorstyle to unstructured parallelism (using explicit low-levelconstructs like threads and locks). Structured programmingcommunicates programmer intent in an upfront and visibleway, providing an accessible framework for reasoning aboutcomplex code by isolating and modularizing concerns. How-ever, the promise construct, found in mainstream languagesincluding C++ and Java, introduces an undesirable lack ofstructure into task-parallel programming. A promise gen-eralizes a future in that it need not be bound to the return

Listing 1.

A deadlock? Promise p , q ; t 1 = async { . . . } ; t 2 = async { p . get ( ) ; / / s t u c k q . s e t ( ) ; } ; q . get ( ) ; / / s t u c k p . s e t ( ) ; value of a specific task. Instead, any task may elect to supplythe value, and the code may not clearly communicate whichtask is intended to do so.Promises provide point-to-point synchronization whereinone or more tasks can await the arrival of a payload, to beproduced by another task. Although the promise providesa safe abstraction for sharing data across tasks, there is nosafety in the kinds of inter-task blocking dependencies thatcan be created using promises. The inherent lack of structurein promises not only leads to deadlock-like bugs in whichtasks block indefinitely due to a cyclic dependence, but suchbugs are not well-defined and are undetectable in the gen-eral case due to the lack of information about which task issupposed to fulfill which promise.A deadlock-like cycle may only be detected once all taskshave terminated or blocked. For example, the Go languageruntime reports a deadlock if no task is eligible to run [25].However, if even one task remains active, this techniquecannot raise an alarm. An example of such a program is inListing 1; the root task and 𝑡 are in a deadlock that maybe hidden if 𝑡 is a long-running task, such as a web server.An alternative detection approach is to impose timeoutson waits, which is only a heuristic solution that may raisean alarm when there is no cycle. In both of these existingapproaches, the detection mechanism may find the deadlocksome time after the cycle has been created. It is instead moredesirable to detect a cycle immediately when it forms. There is inconsistency across programming languages aboutwhat to call a promise and sometimes about what func-tionality “promise” refers to. The synchronization primitivewe intend to discuss is called by many names, includingpromise [36], handled future [46], completable future [48],and one-shot channel [16]. For us, a promise is a wrapperfor a data payload that is initially absent; each get of the a r X i v : . [ c s . P L ] J a n ayload blocks until the first and only set of the payload isperformed. Setting the payload may also be referred to ascompleting, fulfilling, or resolving the promise.Some languages, such as C++, divide the promise constructinto a pair of objects; in this case, “promise” refers only tothe half with a setter method, while “future” refers to thehalf with a getter method. In Java, the CompletableFuture class is a promise, as it implements the

Future interface andadditionally provides a setter method.Habanero-Java introduced the data-driven future [51],which is a promise with limitations on when gets may occur.When a new task is spawned, the task must declare up frontwhich promises it intends to consume. The task does notbecome eligible to run until all such promises are fulfilled.In JavaScript, the code responsible for resolving a promisemust be specified during construction of the promise [44].This is a limitation that makes deadlock cycles impossible, al-though the responsible code may omit to resolve the promisealtogether, leading to unexecuted callbacks.Promises may provide a synchronous or an asynchronousAPI. The Java concurrency library provides both, for exam-ple [48]. The synchronous API consists of the get and set methods. The asynchronous API associates each of the syn-chronous operations to a new task. A call to supplyAsync binds the eventual return value of a new task to the promise.The then operation schedules a new task to operate on thepromise’s value once it becomes available. The asynchro-nous API can be implemented using the synchronous API.Conversely, the synchronous API can be implemented usingcontinuations and an asynchronous event-driven scheduler[32]. We focus on the synchronous API in this work.

We identify two kinds of synchronization bug in which theimproper use of promises causes one or more tasks to blockindefinitely:1. the deadlock cycle , in which tasks are mutually blockedon promises that would be set only after these tasksunblock, and2. the omitted set , in which a task is blocked on a promisethat no task intends to set.However, neither of these bugs manifests in an automaticallyrecognizable way at runtime unless every task in the pro-gram is blocked. In fact, the definitions of these bugs describeconditions which cannot generally be detected. What doesit mean for no task to intend to set a promise? What does itmean that a task would set a promise once the task unblocks?In a traditional deadlock, say one involving actual locks, thecycle is explicit: Task 1 holds lock 𝐴 and blocks while acquir-ing lock 𝐵 , because task 2 is holding lock 𝐵 and concurrentlyblocked during its acquisition of lock 𝐴 . Intention to releasea lock (thereby unblocking any waiters) is detectable by thefact that a task holds the lock. But we currently have no Listing 2.

An omitted set? Promise r , s ; t 3 = async { / / should s e t r , s t 4 = async { / / should s e t s / / ( f o r g o t to s e t s ) } r . s e t ( ) ; } ; r . get ( ) ; s . get ( ) ; / / s t u c k concept of a task “holding” a promise and no way to tell thata task intends to set it. Consider the small deadlock in Listing 1. Two promises, 𝑝, 𝑞 ,are created. Task 𝑡 waits for 𝑝 prior to setting 𝑞 , whereasthe root task waits for 𝑞 prior to setting 𝑝 . Clearly a deadlockcycle arises? Not so fast. To accurately call this pattern adeadlock cycle requires knowing that task 𝑡 will not ever set 𝑝 or 𝑞 . Such a fact about what will not happen is generallynot determinable from the present state without an offlineprogram analysis. For this reason, a deadlock cycle amongpromises evades runtime detection unless the cycle involvesevery currently executing task.Now consider the bug in Listing 2. Two promises, 𝑟, 𝑠 , arecreated. According to the comments, task 𝑡 is responsible forsetting both, and it subsequently delegates the responsibilityfor 𝑠 to 𝑡 . However, 𝑡 fails to perform its intended behavior,terminating without setting 𝑠 . The root task then blocks on 𝑠 forever. If a bug has occurred, we would like to raise an alarmat runtime when and where it occurs. Where is this bug?Should the root task not have blocked on 𝑠 ? Should 𝑡 haveset 𝑠 ? Should 𝑡 have set 𝑠 ? The blame cannot be attributed,and the bug may, in fact, be in any one of the tasks involved.Furthermore, when does this bug occur? The symptom ofthe bug manifests in the indefinite blocking of the root task,potentially after 𝑡 terminates successfully. If some othertask may yet set 𝑠 , then this bug is not yet confirmed to haveoccurred. Omitted sets evade runtime detection and, evenonce discovered, evade proper blame assignment.We propose to augment the task creation syntax ( async inour examples) to carry information about promise ownershipand responsibility within the code itself, not in the comments.In doing so, omitted sets become detectable at runtime withblame appropriately assigned. Moreover, programmer intentis necessarily communicated in the code. Finally, in know-ing which task is expected to set each promise, it becomespossible to properly discuss deadlock cycles among promises. An example of an omitted set bug was exhibited by the Ama-zon Web Services SDK for Java (v2) when a certain checksumvalidation failed [31]. An abbreviated version of the code is isting 3. An omitted set in Amazon AWS SDK (v2) [31].Code abbreviated and inlined for clarity. private CompletableFuture < Void > c f ; public void onComplete ( ) { . . . i f ( streamChecksumInt ! = computedChecksumInt ) { o n E r r o r ( . . . ) ; / / Assumed to f u l f i l l promise return ; / / Don ' t f u l f i l l promise again } . . . c f . complete ( null ) ; / / F u l f i l l s promise } public void o n E r r o r ( Throwable t ) { / / O r i g i n a l l y a no−op . F i x e d to : c f . completeExceptionally ( t ) ; } given in Listing 3; line 16 was absent prior to the bug fix. Thecontrol flow ensures that either exception handling code ornon-exceptional code was executed, not both (line 8) [43].However, only the non-exceptional code would set the valueof a CompletableFuture (Java’s promise) to indicate the workwas completed (line 11), whereas the onError method wouldtake no action. If checksum validation failed after a file down-load, any consumer tasks waiting for the download to com-plete would block indefinitely. A month later, the omittedset bug was identified and corrected by adding line 16 [2].When this bug arises at runtime, the symptom (the blockedconsumer) is far from the cause (the omitted set), and thebug is not readily diagnosable. If the runtime could trackwhich tasks are responsible for which promises, then thisbug could be detected and reported as an exception as soonas the responsible task terminates. Using our approach, thebug would be detected when the task running the onComplete callback finishes, and the alarm would name the offendingtask and the unfulfilled promise.

In this work, we propose the addition of ownership semantics for promises which enables a task’s intention to set a promiseto be reflected in the runtime state. In so doing,1. we enable a precise definition of a deadlocked cycle ofpromises in terms of runtime state;2. we define a second kind of blocking bug, the omittedset , which does not involve a cycle;3. we require important programmer intent to be en-coded explicitly and to respect a runtime-verifiablepolicy, thereby enabling structured programming forpromises.In addition to these theoretical contributions, 1. we introduce a new lock-free algorithm for detectingour now-identifiable deadlock-cycle and omitted-setbugs when they occur ;2. we identify properties critical for establishing the cor-rectness of the algorithm under weak memory consis-tency and show how to ensure these properties holdunder the TSO, Java, and C++ memory models;3. we prove that our algorithm precisely detects everydeadlock without false alarms;4. we experimentally show that a Java implementationhas low execution time and memory usage overheadson nine benchmarks relative to the original, unveri-fied baseline (geometric mean overheads of 1.12 × and1.06 × , respectively). In promise-based synchronization, a task does not directlyawait another task; it awaits a promise, thereby indirectly waiting on whichever task fulfills that promise. It is a run-time error to fulfill a promise twice, so there ought to beone and only one fulfilling task. However, the relationshipbetween a promise and the task which will fulfill it is notexplicit and inhibits the identification of deadlocks. To makethis relationship explicit and meaningful, we say that eachpromise is owned by exactly one task at any given time. Theowner is responsible for fulfilling the promise eventually,or else handing ownership off to another task. Ownershiphand-offs may only occur at the time of spawning a newtask. We augment the async keyword, used to spawn tasks,with a list of promises currently owned by the parent taskthat should be transferred to the new child.

We define an abstract language, showing only its synchro-nization instructions and leaving its sequential control flowand other instructions unspecified. For simplicity, we haveabstracted away the payload values of promises and refer toindividual promises by globally unique identifiers.

Definition 2.1.

The L 𝑝 language consists of task-parallelprograms, 𝑃 , whose synchronization instructions have thesyntax new 𝑝 | set 𝑝 | get 𝑝 | async ( 𝑝 , . . . , 𝑝 𝑛 ) { 𝑃 } where 𝑛 may be 0.The instruction new 𝑝 represents the point of allocationfor the promise 𝑝 , and we assume well-formed programsdo not allocate a given 𝑝 twice or operate on 𝑝 prior to itsallocation. Each invocation of get 𝑝 blocks the current taskuntil after set 𝑝 has been invoked for the first (and only)time.The async block creates a new task to execute a sub-program 𝑃 ; the block is annotated with a list of promises,which should be moved from the parent task to the new ask. In many task-parallel languages, async automaticallycreates a future which can be used to retrieve the new task’sreturn value. We can readily reproduce this behavior usingpromises in the pattern new 𝑝 ; async ( 𝑝, . . . ) { . . . ; set 𝑝 } . Definition 2.2.

The ownership policy , P 𝑜 , maintains stateduring the execution of an L 𝑝 program in the form of a map owner : Promise → Task ∪ { null } according to these rules:1. When task 𝑡 executes new 𝑝 , set owner ( 𝑝 ) : = 𝑡 .2. When task 𝑡 spawns task 𝑡 ′ as async ( 𝑝 , . . . , 𝑝 𝑛 ) { 𝑃 } ,prior to 𝑡 ′ becoming eligible to run, ensure owner ( 𝑝 𝑖 ) = 𝑡 and update owner ( 𝑝 𝑖 ) : = 𝑡 ′ for each 𝑝 𝑖 .3. When task 𝑡 terminates, ensure the set of promises owner − ( 𝑡 ) is empty.4. When task 𝑡 executes set 𝑝 , ensure that owner ( 𝑝 ) = 𝑡 and set owner ( 𝑝 ) : = null .These four rules together ensure that there is at least one set for each promise, with omitted sets being detected byrule 3. Rule 4 guarantees there is at most one set .Our proposed modification to the program given in List-ing 1 is to annotate the async in line 3 as async ( 𝑞 ) , indi-cating that 𝑡 takes on the responsibility to set 𝑞 . It is nowpossible to trace the cycle when it occurs: the root task awaits 𝑞 , owned by 𝑡 , awaiting 𝑝 , owned by the root task. It is clearthat 𝑡 , whose async is not given any parameters, is notinvolved as it can set neither 𝑝 nor 𝑞 (rule 4).The proposed modification to the program given in List-ing 2 is to write async ( 𝑟, 𝑠 ) in line 2 and async ( 𝑠 ) in line 3.That is, the information already present in the comments isincorporated into the code itself. The moment 𝑡 terminates,the runtime can observe that 𝑡 still holds an outstandingobligation to set 𝑠 . We treat this as an error immediately(rule 3), irrespective of whether any task is awaiting 𝑠 . Algorithm 1 implements the P 𝑜 policy by providing code tobe run during new , async , and set operations. Each promisehas an owner field to store the task that is currently its owner,and each task has an associated owned list that maintainsthe inverse map, owner − . The functions currentTask and getCurrentTask interact with thread-local storage.In compliance with P 𝑜 rule 1, the New procedure createsa promise owned by the currently running task (line 3) andadds this promise to that task’s owned list (line 4).Async ( 𝑃, 𝑓 ) schedules 𝑓 to be called asynchronously asa new task and moves the promises listed in 𝑃 into thistask. These promises are first confirmed to belong to theparent task (line 9), then moved into the child task (lines 9–12), in accordance with rule 2. (Line 10 is in preparation forAlgorithm 2, presented in section 3.) Once the child task ter-minates, rule 3 requires that the task not own any remainingpromises (line16). The Init procedure shows how to set upa root task to execute the main function. Algorithm 1

Promise Ownership Management procedure New() 𝑡 ← currentTask () 𝑝 ← { owner : 𝑡 } ⊲ C: atomic, Java: volatile append 𝑝 to 𝑡 . owned return 𝑝 procedure Async( 𝑃 , 𝑓 ) 𝑡 ← currentTask () assert 𝑝. owner = 𝑡 forall 𝑝 ∈ 𝑃 𝑡 ′ ← { owned : 𝑃, waitingOn : null } ⊲ C: atomic, Java: volatile remove all of 𝑃 from 𝑡 . owned 𝑝. owner ← 𝑡 ′ forall 𝑝 ∈ 𝑃 do asynchronously setCurrentTask ( 𝑡 ′ ) 𝑓 () assert 𝑡 ′ . owned is empty return 𝑡 ′ procedure Init( main ) setCurrentTask ( null ) Async([], main ) procedure Set( 𝑝 , 𝑣 ) 𝑡 ← currentTask () assert 𝑝. owner = 𝑡 𝑝. owner ← null remove 𝑝 from 𝑡 . owned set _ impl ( 𝑝, 𝑣 ) Finally, Set ( 𝑝, 𝑣 ) achieves rule 4, checking that the cur-rent task owns 𝑝 and marking 𝑝 as fulfilled by assigningit no owner (lines 23–25). The procedure then invokes theunderlying mechanism for actually setting the promise valueto 𝑣 (line 26).As an example of how Algorithm 1 enforces compliancewith P 𝑜 , refer again to Listing 2. When promise 𝑠 is firstcreated, it belongs to the root task (Algorithm 1 line 4). If the async that creates 𝑡 is annotated with 𝑠 , then Algorithm 1line 12 changes the owner of 𝑠 to 𝑡 . Since 𝑡 does not set 𝑠 , upon termination of 𝑡 , an assertion fails in Algorithm 1line 16. The offending task, 𝑡 , and the outstanding promise, 𝑠 , are directly identifiable and can be reported in the alarm. Now that we have established the relationship betweenpromises and tasks, it is possible to describe what a deadlockis. A deadlock is a cycle of 𝑛 tasks, 𝑡 𝑖 , and 𝑛 promises, 𝑝 𝑖 , suchthat 𝑡 𝑖 awaits 𝑝 𝑖 while 𝑝 𝑖 is owned by 𝑡 𝑖 + (mod 𝑛 ). The infor-mation required to identify such a deadlock is, for the firsttime, made available explicitly at runtime through the use f the P 𝑜 policy. We can now develop a runtime detectionmechanism to identify deadlocks based on this informationand raise an alarm as soon as one is created. Even assuming sequential consistency, the algorithm forfinding such a cycle is non-trivial. Conceptually, whenever a get 𝑝 is executed by 𝑡 , 𝑡 must alternately traverse owned-byand waits-for edges to see if the path of dependences returnsto 𝑡 . If another task, 𝑡 ′ , is encountered which is not currentlyawaiting a promise, this proves that progress is still beingmade and there is no deadlock (yet). In this case, 𝑡 passesverification and commits to blocking on 𝑝 . Should this pathof dependences grow due to a subsequent get 𝑝 ′ by 𝑡 ′ , thenthe same algorithm runs again in task 𝑡 ′ to verify that thenew waits-for edge does not create a deadlock.Crucially, during verification 𝑡 must establish a waits-foredge to mark that it is awaiting 𝑝 prior to traversing thedependence path. That is, a waits-for edge is created beforeit is determined that 𝑡 will be allowed to await 𝑝 . A two-taskcycle shows what would go wrong if this procedure is notfollowed. If 𝑡 begins to verify its wait of 𝑝 (say, owned by 𝑡 ′ ) without marking that 𝑡 is awaiting 𝑝 , and concurrently 𝑡 ′ begins to verify its wait of 𝑝 ′ (owned by 𝑡 ) without markingthat 𝑡 ′ is awaiting 𝑝 ′ , then each task may find that the otheris apparently not awaiting any promises at this time, andboth commit to blocking, creating an undetected deadlock.However, by ensuring that each task marks itself as await-ing a promise prior to verifying whether that wait is safe,we guarantee the last task to arrive in the formation of adeadlock cycle will be able to detect this cycle.A second consideration is how this approach handles con-current transfer of promise ownership or concurrent ful-fillment of promises. Suppose that while the cycle detec-tion algorithm is traversing a dependence path, an earlierpromise in the path is transferred to a new owner or is ful-filled, thereby invalidating the remainder of the traversedpath. Failure to handle this correctly could result in an alarmwhen there is no deadlock. The first observation we makeis that this scenario cannot arise for any but the most re-cent promise encountered on the path. If 𝑝 is owned by 𝑡 , awaiting 𝑝 , owned by 𝑡 , then it is impossible for 𝑝 tomove into a new task or to become fulfilled, since its currentowner, 𝑡 , is blocked (or about to block, pending successfulverification). The concern is only that 𝑡 has not yet blockedand may transfer or fulfill 𝑝 . The natural solution is thatwhen traversing the dependence path, upon reaching eachpromise in the path we must go back and double-check thatthe preceding promise still belongs to the task it belonged toin the previous iteration and is still unfulfilled. If this checkfails, then the present verification passes because progressis still being made. Algorithm 2

Deadlock Cycle Detection procedure Get( 𝑝 ) 𝑡 ← currentTask () 𝑡 . waitingOn ← 𝑝 ⊲ C: seq_cst ⊲ TSO: memory fence 𝑖 ← 𝑡 𝑖 + ← 𝑝 𝑖 . owner while 𝑡 𝑖 + ≠ 𝑡 do if 𝑡 𝑖 + = null then break 𝑝 𝑖 + ← 𝑡 𝑖 + . waitingOn ⊲ C: acquire if 𝑝 𝑖 + = null then break if 𝑡 𝑖 + ≠ 𝑝 𝑖 . owner then break 𝑖 ← 𝑖 + 𝑡 𝑖 + ← 𝑝 𝑖 . owner try assert 𝑡 𝑖 + ≠ 𝑡 return get _ impl ( 𝑝 ) finally 𝑡 . waitingOn ← null ⊲ C: release

The deadlock detector occupies the implementation of the get instruction, given in Algorithm 2. This detector canthereby raise an alarm in a task as soon as the task attemptsa deadlock-forming await of a promise. At the time of raisingan alarm, the available diagnostic information that can bereported includes the task, the awaited promise, as well asevery other task and promise in the cycle, if desired.For a preliminary understanding of the procedure’s logic,we assume sequential consistency in this section. Upon en-tering Get, the currently executing task records the promisethat it will be waiting on (line 3). This waitingOn field wasinitialized to null in Algorithm 1 line 10, and is always resetto null upon exiting Get (Algorithm 2 line 18), either nor-mally (line 16) or abnormally (line 15). Doing so makes thealgorithm robust to programs with more than one deadlock.The loop in the detection algorithm traverses the chain ofalternating owner and waitingOn fields. If task 𝑡 is waitingon promise 𝑝 , which is owned by a task 𝑡 ′ , then 𝑡 is effectivelywaiting on whatever 𝑡 ′ awaits. In traversing this chain, if 𝑡 finds that it is transitively waiting on itself, then we haveidentified a deadlock (lines 7, 15). If the algorithm reachesthe end of this chain without finding 𝑡 again, as indicatedby finding a null value in line 8 ( 𝑝 𝑖 is already fulfilled) orin line 10 ( 𝑡 𝑖 + is not awaiting a promise), then it is safe tocommit to a blocking wait on the desired promise (line 16).Recall that 𝑝 𝑖 . owner is null after 𝑝 𝑖 has been fulfilled, and 𝑡 𝑖 + . waitingOn is null when 𝑡 𝑖 + is not currently executingGet.In order to guarantee that an apparent cycle always cor-responds to a real deadlock, even under concurrent updatesto promises, we rely on line 11 to establish that task 𝑡 𝑖 + as waiting on promise 𝑝 𝑖 + while 𝑡 𝑖 + was still the owner ofpromise 𝑝 𝑖 . This is achieved by reading the owner field bothbefore (line 6, 13) and after (line 11) reading the waitingOn field (line 9). If the task observes the owner of 𝑝 𝑖 to havechanged, it turns out that it is safe to abandon the deadlockcheck and commit to the blocking wait.In sections 4–5, we will move to a weaker memory model.There are two crucial points to remember. We must preservethe ability to reason temporally over the edges in the depen-dence path, and we must guarantee that at least one taskentering a deadlock can observe the existence of the wholedeadlock cycle. With a few tweaks, we can obtain a correctness guarantee forour deadlock detector under a weak memory model, whichimplies the same guarantee under any stronger model, in-cluding sequential consistency. First, we must define thisweak memory model and give a definition of deadlock thatis compatible with it.In practice, we do not want to assume that maps such asthe owner field have a single, globally consistent state thatis observed by all tasks. Machines and languages often haveweaker consistency guarantees, and there are performancecosts for requesting stronger consistency due to the synchro-nization required. Instead, we will assume a weak memorymodel and use unsynchronized accesses whenever possible.We now define this weak memory model, which we willuse to establish the correctness of our deadlock detectionalgorithm under models at least as strong as this one.

Definition 4.1.

The happens-before (h.b.) order is a partialorder over the instructions in a program execution that sub-sumes the intra-task program order and, upon spawning eachnew task, the ordering of Algorithm 1 line 14 (the start ofthe new task) after Algorithm 1 line 12 (the last action of theparent task before spawning). The reverse of happens-beforeis happens-after . Definition 4.2.

With respect to a given memory location,a read may only observe a (not necessarily unique) last writewhich happens-before it or any write with which the read isnot h.b. ordered. Two writes or a write and read of the samelocation which are not h.b. ordered are racing .A typical language has a more refined happens-beforeordering and definition of observable writes, especially relat-ing to reads-from edges on promises; however, we will notneed to appeal to such edges in our formalism.

Definition 4.3.

A program in L 𝑝 is well-formed if, in everyexecution, for each promise, 𝑝 , there is at most one new 𝑝 instruction, and each set , get , or async instruction referringto 𝑝 happens-after such a new 𝑝 . We note that although the owners of different promisesmay be updated concurrently, it is not possible in Algorithm 1for a write-write race to occur on the same owner field. Lemma 4.4.

Consider an execution of a well-formed program.If 𝑤 , 𝑤 are two writes to 𝑝. owner in Algorithm 1, then 𝑤 and 𝑤 are not racing. Further, if 𝑟 is a read of 𝑝. owner by task 𝑡 , and 𝑟 observes the value to be 𝑡 , then 𝑟 does not race withthe write it observes.Proof. The two claims can be shown together. Line 3 repre-sents the initialization of the owner field and so happens-before every other write to it. The writes in lines 12 and 24each happen-after a read of the same field observes the valueto be the currently executing task (lines 8, 23). Take thistogether with the fact that there are only two ways to set 𝑝. owner to 𝑡 : line 3, executed by 𝑡 itself, or line 12, executedby the parent of 𝑡 prior to spawning 𝑡 . In either case, writ-ing 𝑡 to 𝑝. owner happens-before any read of 𝑝. owner by 𝑡 itself. □ Since we do not assume a globally consistent state, wehave to be careful in the definition of deadlock cycle. Twotasks need not agree on the value of owner ( 𝑝 ) for a givenpromise, 𝑝 . Instead of freely referring to owner as a map Promise → Task ∪ { null } , we must additionally state whichtask’s perspective is being used to observe the owner map. Definition 4.5.

A non-empty set of tasks, 𝑇 , is in a deadlockcycle if for every task 𝑡 ∈ 𝑇 ,1. 𝑡 is executing get 𝑝 𝑡 for some promise, 𝑝 𝑡 ,2. there exists a task, 𝑜 𝑝 𝑡 , also in 𝑇 which observes that owner ( 𝑝 𝑡 ) = 𝑜 𝑝 𝑡 ,and 𝑇 is minimal with respect to these constraints. The setof promises associated to the deadlock is { 𝑝 𝑡 | 𝑡 ∈ 𝑇 } .The subtle point in this definition is that task 𝑜 𝑝 𝑡 neces-sarily has the most up-to-date information about the ownerof 𝑝 𝑡 , since 𝑜 𝑝 𝑡 is itself the owner. Per lemma 4.4, we knowthat all the writes to 𝑝 𝑡 . owner are ordered and that 𝑜 𝑝 𝑡 isobserving the last such write, since only 𝑜 𝑝 𝑡 is capable ofperforming the next write to follow the observed one. Algorithm 2 correctly and precisely detects all deadlocksunder our weak memory consistency model with some addi-tional specific consistency requirements on certain accesses.We define these requirements, show how to meet them ineach of the TSO, Java, and C++ memory models, and thenprove the algorithm raises an alarm exactly when there is adeadlock.

In order to prove correctness, we require the following addi-tional memory consistency. . There is a total order, < , over all instances of the writein Algorithm 2 line 3, across all memory locations.Let 𝑤 < 𝑤 . Any write preceding and including 𝑤 in h.b. order is visible to any read following 𝑤 inh.b. order.2. The consistency of any owner field is expected to fol-low from release-acquire semantics for any waitingOn field. Specifically, let 𝑤 be an Algorithm 1 line 3 orline 12 write to an owner field, let 𝑤 be an Algorithm 2line 3 write to a waitingOn field, let 𝑟 be an Algo-rithm 2 line 9 read, and let 𝑟 be an Algorithm 2 line 11read. Suppose 𝑤 , 𝑟 refer to the same location, as do 𝑤 , 𝑟 . If 𝑤 happens-before 𝑤 , if 𝑤 is visible to 𝑟 ,and if 𝑟 happens-before 𝑟 , then 𝑤 is visible to 𝑟 .3. The write in Algorithm 2 line 18 must not becomevisible until the fulfillment of 𝑝 is visible (Algorithm 1line 24) or it is determined that an exception shouldbe raised (Algorithm 2 line 15).These three requirements are readily attained in TSO, Java,and C++ as follows. • Under TSO, a memory fence is needed in Algorithm 2line 4 to achieve requirement 1 by ordering line 9 af-ter line 3 and sequentializing all instances of line 4with each other. TSO naturally achieves requirement 2by respecting the local store order, as well as require-ment 3 by not allowing the line 18 write to becomevisible early. Note that the loop contains no fences. • Under the Java memory model, it suffices to mark thetwo fields, owner and waitingOn , as volatile to satisfyall three requirements. This eliminates all write-readdata races. Remember that there are no write-writeraces (see lemma 4.4). In the absence of any races onthese two fields, the Java memory model guaranteessequential consistency with respect to these fields. • In C++ both of the fields must be std::atomic to elim-inate data races, but this alone is insufficient. Algo-rithm 2 line 3 must be tagged as a std::memory_order_-seq_cst access to achieve requirement 1, establishinga total order over these writes and subsuming releaseconsistency. Line 9 must then be tagged std::memory_-order_acquire to achieve requirement 2. And finally,line 18 must be std::memory_order_release to satisfy 3.

Under the preceding consistency requirements, we can nowprove important theoretical guarantees of correctness for ourdeadlock detector. Throughout, we consider an execution ofa well-formed program (definition 4.3).We first show that Algorithm 2 raises no false alarms.

Theorem 5.1.

If task 𝑡 fails the assertion in line 15 duringGet ( 𝑝 ) , then a deadlock cycle exists, involving 𝑡 and 𝑝 .Proof. We have 𝑡 = 𝑡 and 𝑝 = 𝑝 . If the execution had brokenout of the while loop in line 8, 10, or 11, then the assertion would have succeeded. Therefore, it is the loop conditionthat fails. Upon reaching line 12 in each iteration, we havefound 𝑝 𝑖 . owner to be 𝑡 𝑖 + both before and after we found 𝑡 𝑖 + . waitingOn to be 𝑝 𝑖 + . Therefore, we know 1) that at onetime 𝑡 𝑖 + was the owner of 𝑝 𝑖 , and 2) that while 𝑡 𝑖 + stillobserved itself to own 𝑝 𝑖 , 𝑡 𝑖 + had invoked Get ( 𝑝 𝑖 + ) . Thisfollows from memory consistency requirement 2. At thispoint in the reasoning, we do not yet know if 𝑡 𝑖 + still theowner of 𝑝 𝑖 or if 𝑡 𝑖 + is still awaiting 𝑝 𝑖 + .When the loop (lines 7–13) terminates with 𝑡 𝑖 + = 𝑡 , since 𝑡 is the current task, we deduce that the final 𝑡 𝑖 + , set byline 6 or 13, is the current owner of 𝑝 𝑖 . For all 𝑘 modulo 𝑖 + 𝑡 𝑘 at one time concurrently observed itself to be the owner of 𝑝 𝑘 − and was in a call to Get ( 𝑝 𝑘 ) . This meets our definitionof deadlock. □ The following series of lemmas builds to the theorem thatAlgorithm 2 detects every deadlock.

Definition 5.2.

In a deadlock cycle comprising tasks 𝑇 , a 𝑡 ∗ task is a task in 𝑇 to which the line 3 write by every taskin 𝑇 is visible. Lemma 5.3.

Every deadlock cycle has a 𝑡 ∗ task.Proof. Corollary to memory consistency requirement 1. □ A 𝑡 ∗ task, which need not be unique, should be thoughtof as the (or a) last task to enter the deadlock. Lemma 5.4.

If a program execution exhibits a deadlock cyclecomprising tasks 𝑇 and promises 𝑃 , when a 𝑡 ∗ task calls Get itconstructs a sequence { 𝑡 𝑖 } 𝑖 that is a subset of 𝑇 and a sequence { 𝑝 𝑖 } 𝑖 that is a subset of 𝑃 .Proof. We have 𝑡 = 𝑡 ∗ ∈ 𝑇 and, by definition, 𝑝 ∈ 𝑃 . Ifthe loop immediately terminates, then 𝑡 = 𝑡 ∈ 𝑇 , and weare done. Otherwise, the values of 𝑡 𝑖 + and 𝑝 𝑖 + inductivelydepend on 𝑡 𝑖 and 𝑝 𝑖 . By definition of deadlock, one of thetasks in 𝑇 , call it 𝑜 𝑝 𝑖 , observes itself to be the owner of 𝑝 𝑖 .The most recent write to 𝑝 𝑖 . owner (recall all the writes areordered by lemma 4.4) occurred in program order before 𝑜 𝑝 𝑖 ’sline 3 write. Therefore, memory consistency requirement 1establishes that 𝑡 ∗ must read 𝑡 𝑖 + = 𝑜 𝑝 𝑖 ∈ 𝑇 in line 11. Bydefinition of 𝑡 ∗ and by memory consistency requirement 3,we see that line 9 observes 𝑡 𝑖 + ’s line 3 write, not its line 18write. Thus, 𝑝 𝑖 + ∈ 𝑃 by definition of deadlock. □ Lemma 5.5.

If a program execution exhibits a deadlock cyclecomprising tasks 𝑇 , no 𝑡 ∗ task executes a diverging loop (lines 7–13) in its call to Get.Proof. Suppose, during the call to Get by 𝑡 ∗ , the loop doesnot terminate. Thus 𝑡 𝑖 ≠ 𝑡 for any 𝑖 >

0. But by lemma 5.4,the infinite sequence { 𝑡 𝑖 } 𝑖 is a subset of 𝑇 . Therefore, 𝑇 , infact, exhibits a smaller cycle not involving 𝑡 , violating theminimality condition in the definition of deadlock cycle. □ heorem 5.6. If a program execution exhibits a deadlockcycle comprising tasks 𝑇 and promises 𝑃 , at least one task in 𝑇 fails the assertion in Algorithm 2 line 15.Proof. Suppose for the sake of contradiction that a deadlockcycle arises and yet no assertion fails. So every task 𝑡 ∈ 𝑇 enters the Get procedure and either blocks at line 16 on apromise in 𝑃 or diverges in an infinite loop.No task exits the loop by failing the loop condition, 𝑡 𝑖 + ≠ 𝑡 , since this would directly fail the assertion in line 15.For each invocation of Get by a 𝑡 ∗ task, the loop cannotbreak in line 8 or line 10 because lemma 5.4 implies no tasksor promises in the sequence are null . If the loop breaks inline 11, then 𝑡 ∗ has observed the owner of 𝑝 𝑖 to change fromone read to the next. This is impossible: both reads observethe current owner, 𝑜 𝑝 𝑖 , by the same reasoning as in the proofof lemma 5.4. Finally, the loop cannot diverge for 𝑡 ∗ , bylemma 5.5. Since there exists at least one 𝑡 ∗ task, by lemma 5.3,we have a contradiction. □ Corollary 5.7 (to theorems 5.1, 5.6) . Algorithm 2 is preciseand correct, guaranteeing the existence of a deadlock when analarm is raised and raising an alarm upon every deadlock.

We have implemented ownership semantics with omitted setand deadlock detection in Java. We give a brief discussion ofsome of the practical considerations in the design of this im-plementation. We then present the results of a performanceevaluation on a set of benchmark programs.

Introducing an explicit conception of ownership is minimallydisruptive. It is already the case that every promise is fulfilledby at most one task, since two sets cause a runtime error. Weonly ask that the programmer identify this task by leverag-ing the existing structure of async directives. However, forlarge, complex synchronization patterns that rely on manypromises, it can become tedious for a programmer to specifyall the relevant promises, one by one.In our Java implementation, an object-oriented approachcan reduce the burden of identifying which promises shouldbe moved to new tasks. In our Java implementation of theselanguage features, classes containing many promises mayimplement a

PromiseCollection interface so that moving acomposite object to a new task is equivalent to moving eachof its constituent promises. A channel class is shown in List-ing 4, illustrating that complex and versatile primitives canbe built on top of promises with the aid of

PromiseCollection .This class behaves like a promise that can be used repeat-edly, where the 𝑛 th recv operation obtains the value from the 𝑛 th send operation. This behavior depends on dynamicallyallocated promises, and the responsibility for the sendingend of the channel is associated not to the ownership of asingle promise, but to the ownership of different promises Listing 4.

Object-oriented approach to promise movement. c l a s s Channel implements

P r o m i s e C o l l e c t i o n { c l a s s Payload { T v a l u e ; Promise < Payload > next ; } Promise < Payload > producer = new

Promise < > ( ) ; Promise < Payload > consumer = producer ; @Override / / from P r o m i s e C o l l e c t i o n I t e r a b l e < Promise > g e t P r o m i s e s ( ) { / / Return the s e t o f a l l promises t h a t / / should be moved when t h i s o b j e c t moves return C o l l e c t i o n s . s i n g l e t o n ( producer ) ; } void send ( T v a l u e ) { / / F u l f i l l s one promise ; a l l o c a t e s another Promise < Payload > next = new

Promise < > ( ) ; producer . s e t ( { value , next ) } ; producer = next ; } void s t o p ( ) { / / F u l f i l l s a promise producer . s e t ( null ) ; } T r e c v ( ) { Payload p = consumer . get ( ) ; consumer = p . next ; return p . v a l u e ; } } void main ( ) { Channel < I n t e g e r > ch = new

Channel < > ( ) ; ch . send ( 1 ) ; async ( ch ) { / / Move e n t i r e channel ch . send ( 2 ) ; ch . s t o p ( ) ; / / No remaining promises } / / No remaining promises ch . r e c v ( ) ; / / 1 ch . r e c v ( ) ; / / 2 } at different times. It is abstraction-breaking to ask the chan-nel user to manually specify which promise to move to anew task in order to effectively move the sending end of thechannel. Instead, we give the impression that the channelobject itself is movable like a promise (line 39), since it is a PromiseCollection , and the implementation of async relieson the getPromises method (line 11) to determine whichpromises should be moved. .2 Exception Handling In an implementation of Algorithm 1, some care must go intoan exception handling mechanism. What code is capable ofand responsible for recovering from the failed assertion inline 16? And what happens if a task terminates early, withunfulfilled promises, because of an exception?Observe that line 16 occurs within an asynchronous taskafter the user-supplied code for that task has completed. Onesolution is to add a parameter to Async so that the user cansupply a post-termination exception handler, which acceptsthe list of unfulfilled promises, 𝑡 ′ . owned , as input. Indeed,the fix for the AWS omitted set bug included such a mecha-nism (not shown in Listing 3) [2]. Alternatively, the runtimecould automatically fulfill every unfulfilled promise uponan assertion failure in line 16. Some APIs, including in C++and Java, provide an exceptional variant of the completionmechanism for promises [36, 48]. In our implementation, weuse this mechanism to propagate an exception through thepromises that were left unfulfilled.Finally, observe that the correctness of Algorithm 1 onlydepends on knowing when a task’s owned list is empty.Therefore, the owned list could be correctly replaced witha counter, which would at least reduce the memory foot-print of ownership tracking, if not also the execution timeof maintaining a list. However, doing so would mean that anassertion failure in line 16 could not indicate which promiseswent unfulfilled. Therefore, the implementation we evaluateuses an actual list. We evaluate the execution time and memory usage overheadsintroduced by our promise deadlock detector on nine task-parallel programs. The overheads are measured relative tothe original, unverified baseline versions.1. Conway [56] parallelizes a 2D cellular automaton bydividing the grid into chunks. We adapted the codefrom C to Java, using our

Channel class (Listing 4)in place of MPI primitives used by worker tasks toexchange chunk borders with their neighbors.2. Heat [9] simulates diffusion on a one-dimensional sur-face, with 50 tasks operating on chunks of 40,000 cellsfor 5000 iterations. Neighboring tasks again use

Chan-nel in place of MPI primitives.3. QSort sorts 1M integers using a parallelized divide-and-conquer recursion; the partition phase is not par-allelized. This is a standard technique for paralleliz-ing Quicksort [19] and has been previously imple-mented using the Habanero-Java Library [33]. We im-plemented the finish construct, which awaits task ter-mination using promises.4. Randomized distributes 5000 promises over 2535 tasksspawned in a tree with branching factor of 3. Each taskawaits a random promise with probability 0.8 before performing some work, fulfilling its own promises,and awaiting all its child tasks. We chose a randomseed that does not construct a deadlock.5. Sieve counts the primes below 100,000 with a pipelineof tasks, each filtering out the multiples of an earlierprime. A similar program is found in prior work [45].6. SmithWaterman (adapted from HClib [26]; also usedin prior work [15, 55]) aligns DNA sequences having18,000–20,000 bases. Each task operates on a 25 × ×

128 matrices containing around 8000 values.Divide-and-conquer recursion issues asynchronousaddition and multiplication tasks, up to depth 5.8. StreamCluster (from PARSEC [5]) computes a stream-ing 𝑘 -means clustering of 102,400 points in 128 dimen-sions, using 8 worker tasks at a time. We replaced theOpenMP barriers with promises in an all-to-all depen-dence pattern.9. StreamCluster2 reduces synchronization in Stream-Cluster by replacing some of the all-to-all patternswith all-to-one when it is correct to do so. We alsocorrect a data race in the original implementation.All benchmarks were run on a Linux machine with a 16-core AMD Opteron processor under the OpenJDK 11 VMwith a 1 GB memory limit. A thread pool schedules asyn-chronous tasks by spawning a new thread for a new taskwhen all existing threads are in use. This execution strat-egy is necessary in general for promises because there isno a priori bound on the number of tasks that can blocksimultaneously. We measured both execution time and, in aseparate run, average memory usage by sampling every 10ms. Each measurement is averaged over thirty runs withinthe same VM instance, after five discarded warm-up runs;this is a standard technique to mitigate the variability of JVMoverheads, including JIT compilation [22].Table 1 gives the unverified baseline measurements foreach program and the overhead factors introduced by theverifiers. The table also gives the geometric mean of over-heads across all benchmarks. There is an overall factor of1.12 × in execution time and 1.06 × in memory usage. Thetotal number of tasks in the program and the average ratesof promise get and set actions per millisecond (with respectto the baseline execution time) are also reported. Figure 1represents the execution times of each benchmark, show-ing the 95% confidence interval. The low overheads indicatethat our deadlock detection algorithm does not introduceserialization bottlenecks.The overall execution time overheads are within 1.1 × for each of Conway, Heat, QSort, Randomized, SmithWa-terman, Strassen, and StreamCluster2. The same is true of able 1. Mean execution time and memory overheads for verification.Time MemoryBenchmark Baseline (s) Overhead Baseline (MB) Overhead Tasks Gets/ms Sets/msConway 4.43 1.01 × ×

101 361.74 361.58Heat 5.06 1.00 × ×

51 98.92 98.89QSort 3.14 0.98 × × × × × × × × × × × ×

33 39.27 274.89StreamCluster2 16.81 0.99 × ×

33 17.92 125.93Geometric Mean Overhead 1.12 × × Figure 1.

Execution times for each benchmark showing themean with a 95% confidence interval (red).the memory overheads for this subset of benchmarks, ex-cepting SmithWaterman. In many cases, the verified runnarrowly out-performs the baseline, which can be attributedto perturbations in scheduling and garbage collection.It is worth noting that the execution overhead for Sieveis in excess of 2 × . Sieve has the single highest rate of getoperations by an order of magnitude (over 37,000, comparedto SmithWaterman’s 536). The Sieve program requires almost9594 tasks to be live simultaneously, each waiting on thenext, with the potential to form very long dependence chainsfor Algorithm 2 to traverse.We can also remark on the 1.4 × memory overhead inSmithWaterman. Unlike Conway, Heat, Sieve, and both ofthe StreamCluster benchmarks, in which most promises areallocated by the same task that fulfills them, SmithWaterman(and Randomized) allocates all promises in the root task andmoves them later. In maintaining the owned lists in Algo-rithm 1, one can make trade-offs between speed and space.Our implementation favors speed, so instead of literally re-moving a promise 𝑝 from 𝑡 . owned in lines 11 and 25, we simply rely on the fact that 𝑝. owner ≠ 𝑡 anymore to detectthat 𝑝 should no longer be counted in line 16.For comparison with deadlock verification in other set-tings, the Armus tool [14] can identify barrier deadlocks assoon as they occur, with execution overheads of up to 1.5 × on Java benchmarks. Our benchmark results represent anacceptable performance overhead when one desires runtime-identifiable deadlocks and omitted sets with attributableblame. Task-parallel programming is prevalent in a variety of lan-guages and libraries. Multilisp [28] is one of the earliestlanguages with futures, a mechanism for parallel execu-tion of functional code. Fork-join parallelism is employedin Cilk [20], and the more general async-finish with futuresmodel was introduced in X10 [12]. Habanero-Java [10] mod-ernized X10 as an extension to Java and, later, as a Java library,HJlib [33]; this language incorporates additional synchroniza-tion primitives, such as the phaser [50] and the data-drivenfuture [51], which is a promise-like mechanism. Many otherlanguages, libraries, and extensions include spawning andsynchronizing facilities, whether for threads or lightweighttasks, including Chapel [11], Fortress [3], OpenMP [47], IntelThreading Building Blocks [34], Java [24], C++17 [36], andScala [27].The promise, as we define it, can be traced back to theI-structures of the Id language [4], which are also susceptibleto deadlock. Cells of data in an I-structure are uninitializedwhen allocated, may be written to at most once, and supporta read operation that blocks until the data is available.The classic definition of a deadlock is found in Isloor andMarsland [35], which is primarily concerned with concurrentallocation of limited resources. Solutions in this domain fallinto the three categories of Coffman: static prevention, run-time detection, and run-time avoidance [13]. e consider logical deadlocks, which are distinct fromresource deadlocks in that there is an unresolvable cyclicdependence among computational results. Solutions in thelogical deadlock domain include techniques that dynamicallydetect cycles [29, 30, 37, 38, 40, 54], that raise alarms uponthe formation or possible formation of cycles [1, 6, 14, 15,23, 55], that statically check for cycles through analysis [42,45, 57] or through type systems [7, 52], or that precludecycles by carefully limiting the blocking synchronizationsemantics available to the programmer, either statically ordynamically [10, 12, 15, 50, 55]. The present work includes adynamic, precise cycle detection algorithm, enabled only bythe introduction of a structured ownership semantics on theotherwise unrestricted promise primitive.Futures are a special case of promises where each one isbound to a task whose return value is automatically put intothe promise. Transitive Joins [55] and its predecessor, KnownJoins [15], are policies with runtime algorithms for deadlockdetection on futures. They are, in general, not applicable topromises. These two techniques impose additional structureon the synchronization pattern by limiting the set of futuresthat a given task may await at any given time.Recent work identifies the superior flexibility of promisesover futures with the problematic loss of a guarantee thatthey will be fulfilled and develops a forward construct as amiddle-ground [18]. Forwarding can be viewed in terms ofdelegating promise ownership, but it is restricted in that 1)it moves only a single promise into a new task, and 2) inparticular, it moves only the implicit promise that is used toretrieve a task’s return value. In terms of futures, forwardingamounts to re-binding a future to new task.Other synchronization constructs benefit from similar an-notations to the one we have proposed for promises. Thisincludes event-driven programming models where eventshave similar semantics to that of promises. JavaScript, thougha single-threaded language, still uses an asynchronous taskmodel to schedule callbacks on an event loop [39], and couldbenefit from our approach. Likewise, our approach is di-rectly applicable to multithreaded execution models, suchas Concurrent Collections [8] and the Open CommunityRuntime [41], that use event-driven execution as a funda-mental primitive. As another example, the MPI blockingreceive primitive must name the sending task; from thisinformation a waits-for graph for deadlock detection canbe directly constructed [29]. In addition, nonblocking com-munications in MPI use MPI_Request objects in a mannersimilar to promises, and the

MPI_Wait operation akin to theget operation on promises.Languages with barriers and phasers sometimes requirethe participating tasks to register with the construct [50].Notably, this kind of registration is absent from the Java API,which is problematic for the Armus deadlock tool [14]. Inthat work, registration annotations had to be added to theJava benchmarks in order to apply the Armus methodology. In this work, we considered programs which only usepromises for blocking synchronization, and we constrainedownership transfer to occur only when a task is spawned.Since a promise can have multiple readers or no readers at all,it is not possible in principle to use one promise to synchro-nize the ownership hand-off of a second promise betweentwo existing tasks. We cannot guarantee that the receivingtask exists and is unique. In future work, one could considera slightly higher abstraction in the form of a pair of promisesacting like a rendezvous, which is a primitive in languageslike Ada and Concurrent C [21]. Such a synchronization pat-tern could be leveraged to hand off promise ownership sincethere would be a guaranteed single receiving task.The Rust language incorporates affine types in its movesemantics to ensure that certain objects have at most oneextant reference at all times [49]. The movement of promiseownership from one task to another and the obligation to ful-fill each promise exactly once may be expressible at compiletime through the use of a linear type system, which restrictsreferences to exactly one instance.

We have introduced an ownership mechanism for promises,whereby each task is responsible for ensuring that all ofits owned promises are fulfilled. This mechanism makes itpossible to identify a bug, called the omitted set, at runtimewhen the bug actually occurs and to report which task is toblame for the error. The ownership mechanism also makesit meaningful, for the first time, to formally define, discuss,and detect deadlock cycles among tasks synchronizing withpromises. Such a bug is now detectable as soon as the cycleforms.In our approach, any code that spawns a new asynchro-nous task must name the promises which are to be trans-ferred to the new task. The programmer must already beaware of this critical information in order to even informallyreason about omitted set and deadlock bugs. We now askthat it be explicitly notated in the code.We provided an algorithm to check for compliance withthe ownership policy at runtime, thereby detecting omittedsets, as well as an algorithm for detecting deadlock cyclesusing ownership information. Both types of bug are detectedwhen they occur, not after-the-fact. Our deadlock detector isprovably precise and correct under a weak memory modeland we described how to obtain this correct behavior underthe TSO, Java, and C++ memory models. Every alarm corre-sponds to a true deadlock and every deadlock results in analarm. Experimental evaluation demonstrates that our lock-free approach to deadlock detection exhibits low executiontime and memory overheads relative to an uninstrumentedbaseline. cknowledgments This work is supported by the National Science Foundationunder Collaborative Grant No. 1822919 and Graduate Re-search Fellowship Grant No. 1650044.

References [1] Rahul Agarwal and Scott D. Stoller. 2006. Run-Time Detection of Poten-tial Deadlocks for Programs with Locks, Semaphores, and ConditionVariables. In

Proc. 2006 Worksh. on Parallel and Distributed Systems:Testing and Debugging (PADTAD ’06) . ACM, New York, NY, 51–60.[2] Dongie Agnir. 2019.

Call exceptionOcurred in case of stream error .Amazon Web Services. Retrieved 30 July 2020 from https://github.com/aws/aws-sdk-java-v2/commit/bfdd0d2063 [3] Eric Allen, David Chase, Joe Hallett, Victor Luchangco, Jan-WillemMaessen, Sukyoung Ryu, Guy L. Steele Jr., and Sam Tobin-Hochstadt.2008.

The Fortress Language Specification . Sun Microsystems, Inc.[4] Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. 1989. I-Structures:Data Structures for Parallel Computing.

ACM Trans. Program. Lang.Syst.

11, 4 (1989), 598–632.[5] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008.The PARSEC Benchmark Suite: Characterization and ArchitecturalImplications. In

Proc. 17th Int’l. Conf. on Parallel Architectures andCompilation Techniques (PACT ’08) . ACM, New York, NY, 72–81.[6] Gérard Boudol. 2009. A Deadlock-Free Semantics for Shared MemoryConcurrency. In

Proc. 6th Int’l. Coll. on Theoretical Aspects of Computing(ICTAC ’09) . Springer, Berlin, Germany, 140–154.[7] Chandrasekhar Boyapati, Robert Lee, and Martin Rinard. 2002. Own-ership Types for Safe Programming: Preventing Data Races and Dead-locks. In

Proc. 17th ACM SIGPLAN Conf. on Object-Oriented Program-ming, Systems, Languages, and Applications (OOPSLA ’02) . ACM, NewYork, NY, 211–230.[8] Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, GeoffLowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar,Frank Schlimbach, and Sağnak Taşırlar. 2010. Concurrent Collections.

Sci. Program.

18, 3–4 (Aug. 2010), 203–217. https://doi.org/10.1155/2010/521797 [9] John Burkardt. 2016.

HEAT_MPI: Solve the 1D Time Dependent HeatEquation using MPI . Florida State University. Retrieved 13 Au-gust 2020 from https://people.sc.fsu.edu/~jburkardt/cpp_src/heat_mpi/heat_mpi.html [10] Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. 2011.Habanero-Java: The New Adventures of Old X10. In

Proc. 9th Int’l.Conf. on Principles and Practice of Programming in Java (PPPJ ’11) .ACM, New York, NY, 51–61.[11] Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007.Parallel Programmability and the Chapel Language.

Int’l. Journal ofHigh Performance Computing Applications

21, 3 (2007), 291–312.[12] Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Don-awa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and VivekSarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Clus-ter Computing. In

Proc. 20th ACM SIGPLAN Conf. on Object-OrientedProgramming, Systems, Languages, and Applications (OOPSLA ’05) .ACM, New York, NY, 519–538.[13] E. G. Coffman, M. Elphick, and A. Shoshani. 1971. System Deadlocks.

ACM Comput. Surv.

3, 2 (1971), 67–78.[14] Tiago Cogumbreiro, Raymond Hu, Francisco Martins, and NobukoYoshida. 2018. Dynamic Deadlock Verification for General BarrierSynchronisation.

ACM Trans. Program. Lang. Syst.

41, 1, Article 1(2018), 38 pages.[15] Tiago Cogumbreiro, Rishi Surendran, Francisco Martins, Vivek Sarkar,Vasco T. Vasconcelos, and Max Grossman. 2017. Deadlock Avoidancein Parallel Programs with Futures: Why Parallel Tasks Should NotWait for Strangers.

Proc. ACM Program. Lang.

OOPSLA, Article 103 (2017), 26 pages.[16] Alex Crichton. 2020. futures::channel . Retrieved 30 July 2020 from https://docs.rs/futures/0.3.5/futures/channel [17] Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, andEduard Ayguade. 2009. Barcelona OpenMP Tasks Suite: A Set ofBenchmarks Targeting the Exploitation of Task Parallelism in OpenMP.In

Proc. 2009 Int’l. Conf. on Parallel Processing (ICPP ’09) . IEEE ComputerSociety, Washington, DC, 124–131.[18] Kiko Fernandez-Reyes, Dave Clarke, Elias Castegren, and Huu-PhucVo. 2018. Forward to a Promising Future. In

Int’l. Conf. on CoordinationModels and Languages . Springer, Cham, Switzerland, 162–180.[19] Rhys S. Francis and Linda J.H. Pannan. 1992. A parallel partition forenhanced parallel QuickSort.

Parallel Comput.

18, 5 (1992), 543–550.[20] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. TheImplementation of the Cilk-5 Multithreaded Language. In

Proc. ACMSIGPLAN 1998 Conf. on Programming Language Design and Implemen-tation (PLDI ’98) . ACM, New York, NY, 212–223.[21] Narain H. Gehani and William D. Roome. 1988. Rendezvous Facilities:Concurrent C and the Ada Language.

IEEE Transactions on SoftwareEngineering

14, 11 (1988), 1546–1553.[22] Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. StatisticallyRigorous Java Performance Evaluation. In

Proc. 22nd ACM SIGPLANConf. on Object-Oriented Programming, Systems, Languages, and Appli-cations (OOPSLA ’07) . ACM, New York, NY, 57–76.[23] Prodromos Gerakios, Nikolaos Papaspyrou, Konstantinos Sagonas,and Panagiotis Vekris. 2011. Dynamic Deadlock Avoidance in SystemsCode Using Statically Inferred Effects. In

Proc. 6th Worksh. on Program-ming Languages and Operating Systems (PLOS ’11) . ACM, New York,NY, Article 5, 5 pages.[24] B. Goetz, T. Peierls, J. Bloch, J. Bowbeer, D. Lea, and D. Holmes. 2006.

Java Concurrency in Practice . Pearson Education, London, England.[25] Google. 2014.

Go 1.3 Release Notes . Retrieved 30 July 2020 from https://golang.org/doc/go1.3 [26] Habanero Extreme Scale Software Research Lab. 2020.

Habanero-C Li-brary . Retrieved 15 February 2020 from https://github.com/habanero-rice/hclib [27] Philipp Haller, Aleksandar Prokopec, Heather Miller, Viktor Klang,Roland Kuhn, and Vojin Jovanovic. 2012.

Futures and Promises . Écolepolytechnique fédéral de Lausanne. Retrieved 24 August 2020 from https://docs.scala-lang.org/overviews/core/futures.html [28] Robert H. Halstead, Jr. 1985. Multilisp: A Language for ConcurrentSymbolic Computation.

ACM Trans. Program. Lang. Syst.

7, 4 (1985),501–538.[29] Tobias Hilbrich, Bronis R. de Supinski, Martin Schulz, and Matthias S.Müller. 2009. A Graph Based Approach for MPI Deadlock Detection.In

Proc. 23rd Int’l. Conf. on Supercomputing (ICS ’09) . ACM, New York,NY, 296–305.[30] Tobias Hilbrich, Joachim Protze, Martin Schulz, Bronis R. de Supinski,and Matthias S. Müller. 2012. MPI Runtime Error Detection withMUST: Advances in Deadlock Detection. In

Proc. Int’l. Conf. on HighPerformance Computing, Networking, Storage and Analysis (SC ’12) .IEEE Computer Society, Los Alamitos, CA, Article 30, 11 pages.[31] Oliver Hsu. 2019.

S3: FileAsyncResponseTransformer future does notcomplete when checksum error occurs . Amazon Web Services. Retrieved30 July 2020 from https://github.com/aws/aws-sdk-java-v2/issues/1279 [32] Shams Imam and Vivek Sarkar. 2014. Cooperative Scheduling ofParallel Tasks with General Synchronization Patterns. In

EuropeanConference on Object-Oriented Programming (ECOOP ’14) . Springer,Berlin, Germany, 618–643.[33] Shams Imam and Vivek Sarkar. 2014. Habanero-Java Library: A Java 8Framework for Multicore Programming. In

Proc. 2014 Int’l. Conf. onPrinciples and Practices of Programming on the Java Platform: VirtualMachines, Languages, and Tools (PPPJ ’14) . ACM, New York, NY, 75–86.

34] Intel 2020.

Intel Threading Building Blocks Developer Guide . Intel.[35] S. Sreekaanth Isloor and T. Anthony Marsland. 1980. The DeadlockProblem: An Overview.

Computer

13, 9 (1980), 58–78.[36] ISO. 2017.

ISO/IEC 14882:2017: Programming Languages — C++ . Inter-national Organization for Standardization, Geneva, Switzerland.[37] Bettina Krammer, Tobias Hilbrich, Valentin Himmler, Blasius Czink,Kiril Dichev, and Matthias S. Müller. 2008. MPI Correctness Check-ing with Marmot. In

Tools for High Performance Computing . Springer,Berlin, Germany, 61–78.[38] Bettina Krammer, Matthias S. Müller, and Michael M. Resch. 2004.MPI Application Development Using the Analysis Tool MARMOT. In

Proc. Int’l. Conf. on Computational Science (ICCS ’04) . Springer, Berlin,Germany, 464–471.[39] Matthew C. Loring, Mark Marron, and Daan Leijen. 2017. Semanticsof Asynchronous JavaScript. In

Proc. 13th ACM SIGPLAN Int’l. Symp.on Dynamic Languages (DLS ’17) . ACM, New York, NY, 51–62.[40] Glenn Luecke, Hua Chen, James Coyle, Jim Hoekstra, Marina Kraeva,and Yan Zou. 2003. MPI-CHECK: A Tool for Checking Fortran 90 MPIPrograms.

Concurrency and Computation: Practice and Experience . 1–7. https://doi.org/10.1109/HPEC.2016.7761580 [42] Mayur Naik, Chang-Seo Park, Koushik Sen, and David Gay. 2009.Effective Static Deadlock Detection. In

Proc. 31st Int’l. Conf. on SoftwareEngineering (ICSE ’09) . IEEE Computer Society, Washington, DC, 386–396.[43] Varun Nandi. 2019.

Don’t call onComplete after onError in Checksum-ValidatingSubscriber . AmazonWeb Services. Retrieved 30 July 2020 from https://github.com/aws/aws-sdk-java-v2/commit/eaecf99a02 [44] Mozilla Developer Network. 2020.

Promise – JavaScript | MDN . Re-trieved 5 August 2020 from https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise [45] Nicholas Ng and Nobuko Yoshida. 2016. Static Deadlock Detectionfor Concurrent Go by Global Session Graph Synthesis. In

Proc. 25thInt’l. Conf. on Compiler Construction (CC ’16) . ACM, New York, NY,174–184.[46] Joachim Niehren, Jan Schwinghammer, and Gert Smolka. 2005. AConcurrent Lambda Calculus with Futures. In

Int’l. Worksh. on Frontiers of Combining Systems (FroCoS ’05) . Springer, Berlin, Germany, 338–356.[47] OpenMP Architecture Review Board 2018.

OpenMP Application Pro-gramming Interface . OpenMP Architecture Review Board.[48] Oracle. 2020.

CompletableFuture (Java SE 14 & JDK 14) . Retrieved30 July 2020 from https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/util/concurrent/CompletableFuture.html [49] Rust Lang. 2020.

The Rust Programming Language . Retrieved 12August 2020 from https://doc.rust-lang.org/1.8.0/book/index.html [50] Jun Shirako, David M. Peixotto, Vivek Sarkar, and William N. Scherer,III. 2008. Phasers: A Unified Deadlock-Free Construct for Collectiveand Point-to-Point Synchronization. In

Proc. 22nd Ann. Int’l. Conf. onSupercomputing (ICS ’08) . ACM, New York, NY, 277–288.[51] Sağnak Taşırlar and Vivek Sarkar. 2011. Data-Driven Tasks and TheirImplementation. In .ACM, New York, NY, 652–661.[52] Vasco T. Vasconcelos, Francisco Martins, and Tiago Cogumbreiro. 2010.Type Inference for Deadlock Detection in a Multithreaded PolymorphicTyped Assembly Language. In

Proc. 2nd Int’l. Worksh. on ProgrammingLanguage Approaches to Concurrency and Communication-cEntric Soft-ware (Electronic Proceedings in Theoretical Computer Science, Vol. 17) ,Alastair R. Beresford and Simon Gay (Eds.). Open Publishing Associa-tion, 95–109.[53] Philippe Virouleau, Pierrick Brunet, François Broquedis, Nathalie Fur-mento, Samuel Thibault, Olivier Aumage, and Thierry Gautier. 2014.Evaluation of OpenMP Dependent Tasks with the KASTORS Bench-mark Suite. In

Int’l. Worksh. on OpenMP (IWOMP ’14) . Springer, Cham,Switzerland, 16–29.[54] Anh Vo, Ganesh Gopalakrishnan, Robert M. Kirby, Bronis R. de Supin-ski, Martin Schulz, and Greg Bronevetsky. 2011. Large Scale Verifica-tion of MPI Programs Using Lamport Clocks with Lazy Update. In

Proc.2011 Int’l. Conf. on Parallel Architectures and Compilation Techniques(PACT ’11) . IEEE Computer Society, Washington, DC, 330–339.[55] Caleb Voss, Tiago Cogumbreiro, and Vivek Sarkar. 2019. TransitiveJoins: A Sound and Efficient Online Deadlock-Avoidance Policy. In

Proc. 24th ACM SIGPLAN Symp. on Principles and Practice of ParallelProgramming (PPoPP ’19) . ACM, New York, NY, 378–390.[56] Aaron Weeden. 2012.

Parallelization: Conway’s Game of Life . TheShodor Education Foundation. Retrieved 13 August 2020 from [57] Amy Williams, William Thies, and Michael D. Ernst. 2005. StaticDeadlock Detection for Java Libraries. In

Proc. 19th European Conf. onObject-Oriented Programming (ECOOP ’05) . Springer, Berlin, Germany,602–629.. Springer, Berlin, Germany,602–629.