Deterministic Consistency: A Programming Model for Shared Memory Parallelism
aa r X i v : . [ c s . O S ] F e b Deterministic Consistency:A Programming Model for Shared Memory Parallelism
Amittai Aviram and Bryan FordYale University
Draft of 2018/10/29 10:12
Abstract
The difficulty of developing reliable parallel soft-ware is generating interest in deterministic environ-ments, where a given program and input can yield onlyone possible result. Languages or type systems can en-force determinism in new code, and runtime systems canimpose synthetic schedules on legacy parallel code. Toparallelize existing serial code, however, we would like aprogramming model that is naturally deterministic with-out language restrictions or artificial scheduling. Wepropose deterministic consistency , a parallel program-ming model as easy to understand as the “parallel assign-ment” construct in sequential languages such as Perl andJavaScript, where concurrent threads always read theirinputs before writing shared outputs. DC supports com-mon data- and task-parallel synchronization abstractionssuch as fork/join and barriers, as well as non-hierarchicalstructures such as producer/consumer pipelines and fu-tures. A preliminary prototype suggests that software-only implementations of DC can run applications writ-ten for popular parallel environments such as OpenMPwith low ( < ) overhead for some applications. For decades, the “gold standard” in multiprocessorprogramming models has been sequentially consistentshared memory [25] with mutual exclusion [20]. Alter-native models, such as explicit message passing [29] orweaker consistency [17], usually represent compromisesto improve performance without giving up “too much”of the simplicity and convenience of sequentially con-sistent shared memory. But are sequential consistencyand mutual exclusion really either simple or convenient ?In this model, we find that slight concurrency errorsyield subtle heisenbugs [27, 28] and security vulnerabil-ities [34]. Data race detection [16, 30] or transactionalmemory [19, 32] can help ensure mutual exclusion, buteven “race-free” programs may have heisenbugs [2].Heisenbugs result from nondeterminism in general , arealization that has inspired new languages that ensuredeterminism through communication constraints [33] ortype systems [7]. But to parallelize the vast body of se- Figure 1: Deterministic versus sequential consistencyquential code for new multicore systems, we would likea programming model that is simple, convenient, deter-ministic, and compatible with existing languages.To this end, we propose a new memory model called deterministic consistency or DC. In DC, concurrentthreads logically share an address space but never seeeach others’ writes, except when they synchronize ex-plicitly and deterministically. To illustrate DC, considerthe “parallel assignment” operator in many sequentiallanguages such as Python, Perl, Ruby, and JavaScript,with which one may swap two variables as follows: x,y := y,x This construct implies no actual parallel execution:the statement merely evaluates all right-side expressions(in some order) before writing their results to the left-side variables. Now consider a “truly parallel” analog,using Hoare’s notation for fork/join parallelism [20]: {x := y} // {y := x}
This statement forks two threads, each of which readsone variable and then writes the other; the threads thensynchronize and rejoin. As Figure 1 illustrates, undersequential consistency, this parallel statement may swapthe variables or overwrite one with the other, depend-ing on timing. Making each thread’s actions atomic, byenclosing the assignments in critical sections or transac-tions, eliminates the swapping case but leaves a nonde-
Draft:
Page: x overwriting y and y over-writing x . How popular would the former “parallel as-signment” construct be if it behaved in this way? Deter-ministic consistency, in contrast, reliably behaves like aparallel assignment: each thread reads all inputs beforewriting any shared results.Like release consistency [17], DC distinguishes or-dinary reads and writes from synchronization operationsand classifies the latter into acquires and releases , whichdetermine at what point one thread sees (acquires) re-sults produced (released) by another thread. DC en-sures determinism by requiring that (1) program logicuniquely pairs each acquire with a matching release,(2) only an intervening acquire/release pair makes onethread’s writes visible to another thread, and (3) acquireshandle conflicting writes deterministically. Unlike mostmemory models, reads never conflict with writes in DC:the swapping example above contains no data race. Anatural way to understand DC—and one way to imple-ment it—is as a distributed shared memory [1, 24] inwhich a release explicitly “transmits” a message con-taining memory updates, and the matching acquire op-eration “receives” and integrates these updates locally.DC supports not only block-structured synchro-nization abstractions such as the fork/join, barrier,and task constructs of OpenMP [6], but also non-hierarchical synchronization patterns such as dynamicproducer/consumer graphs and inter-thread queues.DC can emulate nondeterministic synchronization con-structs in existing parallel code via techniques such asdeterministic scheduling [3, 4, 12], but for new or newlyparallelized code, we develop deterministic alternativesfor common idioms such as pipelines and futures. A pro-totype in progress promises to be flexible and efficientenough for a variety of parallel applications.Section 2 defines DC at a low level, and Section 3 ex-plores its use in high-level environments like OpenMP.Section 4 outlines implementation issues, Section 5 dis-cusses related work, and Section 6 concludes. Since others have eloquently made the case for deter-ministic parallelism [7, 27], we will take its desirabilityfor granted and focus on deterministic consistency (DC).This section defines the basic DC model and its low-level synchronization primitives, leaving the model’smapping to high-level abstractions to the next section.
As in release consistency (RC) [17, 24], DC sepa-rates normal data accesses from synchronization op-erations and classifies the latter into release , where athread makes recent state changes available for use byother threads, and acquire , where a thread obtains state Figure 2: Example synchronization trace for threethreads with labeled and matched release/acquire pairschanges made by other threads. A thread performs a re-lease when forking a child thread or leaving a barrier,for example, and an acquire when joining with a child orentering a barrier. As in RC, synchronization operationsin DC are sequentially consistent relative to each other,and these synchronization operations determine when anormal write in one thread must become visible to a nor-mal read in another thread: namely, when an interveningchain of acquire/release pairs connects the two accessesin a “happens-before” synchronization relation.While RC relaxes the constraints of sequential con-sistency [25], allowing an even wider range of nondeter-ministic orderings, DC in turn tightens RC’s constraintsto permit only one unique execution behavior for a givenparallel program. DC ensures determinism by addingthree new constraints to those of RC:1. Program logic must uniquely pair release and ac-quire operations, so that each release “transmits”updates to a specific acquire in another thread.2. One thread’s writes never become visible to anotherthread’s reads until mandated by synchronization:i.e., writes propagate “as slowly as possible.”3. If two threads perform conflicting writes to thesame location, the implementation handles the con-flict deterministically at the relevant acquire.Constraint 1 makes synchronization deterministic byensuring that a release in one thread always interactswith the same acquire in some other thread, at the samepoint in each thread’s execution, regardless of execu-tion speeds. A program might in theory satisfy thisconstraint by specifying each synchronization opera-tion’s “partner” explicitly through a labeling scheme. Ifeach thread has a unique identifier T , and we assigneach of T ’s synchronization actions a consecutive inte- Draft:
Page: N , then a ( T, N ) pair uniquely names any synchro-nization event in a program’s execution. The programthen invokes synchronization primitives of the form acquire( T r , N r ) and release( T a , N a ) , where ( T r , N r ) names the acquire ’s partner release andvice versa. Figure 2 illustrates a 3-thread execution tracewith matched and labeled acquire/release pairs. We sug-gest this scheme only to clarify DC: explicit labelingwould be an unwelcome practical burden, and Section 3discusses more convenient high-level abstractions.Constraint 2 makes normal accesses deterministic byensuring that writes in a given thread become visible toreads in another thread at only one possible moment. Re-lease consistency already requires a write by thread T to become visible to thread T no later than the moment T performs an acquire directly or indirectly following T ’s next release after the write. RC permits the writeto become visible to T before this point, but DC re-quires the write to propagate to T at exactly this point.By delaying writes “as long as possible,” DC ensuresthat non-conflicting normal accesses behave determinis-tically while preserving the key property that makes RCefficient: it keeps parallel execution as independent aspossible subject to synchronization constraints.DC’s third constraint affects only programs with dataraces. If both threads in Figure 1 wrote to the same vari-able before rejoining, for example, DC requires the jointo handle this race deterministically. Since data racesusually indicate software bugs, one response is to throwa runtime exception. Other behaviors, e.g., prioritizingone write over the other, would not affect correct pro-grams but may be less helpful with buggy code. To clarify why the above rules adequately ensure deter-ministic execution in spite of arbitrary parallelism, webriefly sketch a proof of DC’s determinism.
Theorem:
A parallel program whose sequential frag-ments execute deterministically, and whose memory ac-cess and synchronization behavior conforms to the rulesin Section 2.1, yields at most one possible result.
Proof Sketch:
Assume each synchronization opera-tion explicitly names its “partner” as described above.Suppose we implement DC by accumulating memory“diffs” and passing them at synchronization points atopa message-passing substrate, as in distributed sharedmemory [1, 24]. Assume the substrate provides an un-limited number of buffered message channels, each witha unique name of the form ( T r , N r , T a , N a ) . When athread T r invokes a release( T a , N a ) operation la-beled ( T r , N r ) , T r sends all diffs it has accumulatedso far on channel ( T r , N r , T a , N a ) . Similarly, whenthread T a invokes an acquire( T r , N r ) operation la-beled ( T a , N a ) , it receives a set of diffs on channel ( T r , N r , T a , N a ) and applies those it does not alreadyhave. Since each channel ( T r , N r , T a , N a ) is used byonly one sender T r and one receiver T a , the resultingsystem forms a Kahn process network [23], and DC’sdeterminism follows from that of Kahn networks. We are developing
DOMP , a variant of OpenMP [6] withdeterministic consistency. DOMP retains OpenMP’slanguage neutrality and convenience, supporting mostOpenMP constructs except for fundamentally nondeter-ministic ones, and extending OpenMP to support generalreductions and non-hierarchical dependency structures.
Fork/Join:
OpenMP’s foundation is its parallel construct, which forks multiple threads to execute a par-allel code block and then rejoins them. Fork/join paral-lelism maps readily to DC, as shown in Figure 3(a): onfork, the parent releases to an acquire at the birth of eachchild; on join, the parent acquires the final results eachchild releases at its death. OpenMP’s work-sharing con-structs, such as parallel for loops, merely affect eachchild thread’s actions within this fork/join model.
Barrier:
At a barrier, each thread releases to eachother thread, then acquires from each other thread, asin Figure 3(b). Although we view an n -thread barrieras n − releases and acquires per thread, DOMP avoidsthis n cost using “broadcast” release/acquire primitives,which are consistent with DC as long as each releasematches a well-defined set of acquires and vice versa. Ordering:
OpenMP’s ordered construct orders aparticular code block within a loop by iteration whilepermitting parallelism in other parts. DOMP imple-ments this construct using a chain of acquire/releasepairs among worker threads, as shown in Figure 3(c).
Reductions:
OpenMP’s reduction attributes and atomic constructs enable programs to accumulatesums, maxima, or bit masks efficiently across threads.OpenMP unfortunately supports reductions only onsimple scalar types, leading programmers to serial-ize complex reductions unnecessarily via ordered or critical sections or locks. All uses of these serializa-tion constructs in the NAS Parallel Benchmarks [21] im-plement reductions, for example. DOMP therefore pro-vides a generalized reduction construct, by which aprogram can specify a custom reduction on pairs of vari-ables of any matching types, as in this example: DOMP accumulates each thread’s partial results inthread-private variables and reduces them at the next join
Draft:
Page:
Tasks:
OpenMP 3.0’s task constructs express a formof fork/join parallelism suited to dynamic work struc-tures. Since DC rules prevent a task from seeing anywrites of other tasks until it completes and synchronizesat a barrier or taskwait , DOMP eliminates OpenMP’srisk of subtle bugs if one task uses shared inputs that arefreed or go out of scope in a concurrent task.DOMP extends OpenMP with explicit task objects ,with which a taskwait construct can name and syn-chronize with a particular task instance independentlyof other tasks, in order to express futures [18] or non-hierarchical dependency graphs [15] deterministically: omp_task mytask;
Mutual exclusion:
Unlike ordered , which specifiesa particular sequential ordering, mutual exclusion facil-ities such as critical sections and locks imply anarbitrary, nondeterministic ordering. Mutual exclusionviolates Constraint 1 in Section 2.1 because it permitsmultiple acquire/release pairings, as illustrated in Fig-ure 3(d). While DOMP could emulate mutual exclusionvia deterministic scheduling, we prefer to focus on de-veloping deterministic abstractions to replace commonuses of mutual exclusion, such as general reductions.
Flush:
Some OpenMP programs implement customsynchronization structures such as pipelines using the flush (memory barrier) construct in spin loops. Likemutual exclusion, DOMP omits support for such con-structions, in favor of expressing dependency graphssuch as pipelines deterministically using task objects.
We have built an early user space prototype implement-ing DC with a pthreads-like fork/join API. The proto-type encouragingly shows less than overhead on the coarse-grained PARSEC benchmarks [5] Blacksc-holes and Swaptions. Finer-grained benchmarks such asStreamcluster currently show high overheads, but manyoptimization opportunities remain. The rest of this sec-tion outlines key challenges and opportunities in imple-menting deterministic consistency, for both shared mem-ory multithreaded programs and multiprocess systems.
Memory Access Isolation:
Since DC requires onethread’s writes to remain invisible to a second threaduntil the two threads synchronize, the threads must ef-fectively execute in separate “workspaces” between syn-chronization events. Virtual memory and write-sharingtechniques like those used to implement lazy releaseconsistent distributed shared memory [1] should ap-ply to DC. Memory accesses may also be isolated viainstruction-level rewriting [3], possibly reducing the costof synchronization operations at the expense of addingoverhead to all ordinary memory accesses. Hardwaresupport [12, 17] could mitigate the performance cost ofisolation, but is unlikely to appear in commodity hard-ware unless software-based approaches first demonstratedeterministic parallelism to be viable and compelling.
Shared Resources:
Shared resources in current envi-ronments implicitly introduce nondeterminism throughmutual exclusion: calling malloc() concurrently inmultiple threads may yield different pointers dependingon execution timing, for example, and the file descrip-tor number returned by a call to Unix’s open() mayhave similar timing dependencies on other threads’ filedescriptor operations. The malloc() problem may beaddressed by assigning each thread a separate virtualmemory address range and allocation pool from whichto satisfy malloc() requests; such an allocator mayalso benefit scalability. The file descriptor table prob-lem might be addressed by using higher-level equiva-lents such as fopen() that do not imply mutual exclu-sion. These approaches do not address shared resourcesoutside the application process, however, such as readsand writes to shared files in an external file system.
Draft:
Page: .2 Beyond Shared Memory While we have focused on the intra-process sharedmemory abstraction, DC may also be applicable at thesystem level for state shared among processes. Standardoperating systems, for example, commonly give all pro-cesses sequentially consistent access to a globally sharedfile system (though network file systems often relax con-sistency somewhat). This design yields the same prob-lems of nondeterminism and heisenbugs at inter-processlevel that we see within multithreaded programs: we findoften that a large software source tree builds reliably un-der a sequential ‘ make ’ but fails nondeterministicallyunder a parallel ‘ make -j ’ command, for example.In place of sequential consistency, an OS might pro-vide a deterministically consistent file system to pro-cesses, enabling a multi-process computation to run de-terministically even as processes share state by readingand writing files. If a parallel make forks off two com-piler instances running in parallel, for example, eachcompiler would execute in its own private virtual copyof the file system until completion; the system wouldthen reconcile the .o files produced by each compilerinto a single directory once both compilers complete.There will always be shared resources “outside thereach” of any deterministic environment, whose use willintroduce nondeterminism into the program: for exam-ple, I/O requests arriving at a network server from itsclients. In such cases the only solution may be to ac-cept some nondeterminism, log nondeterministic inputsto enable later replay, or avoid their use entirely. DC conceptually builds on release consistency [17] andlazy release consistency [24], which relax sequentialconsistency’s ordering constraints to increase the inde-pendence of parallel activities. DC retains these inde-pendence benefits, additionally providing determinismby delaying the propagation of any thread’s writes toother threads until required by explicit synchronization.Race detectors [16, 30] can detect certain heisen-bugs, but only determinism eliminates their possibil-ity. Language extensions can dynamically check deter-minism assertions in parallel code [10, 31], but heisen-bugs may persist if the programmer omits an assertion.SHIM [14, 15, 33] provides a deterministic message-passing programming model, and DPJ [7,8] enforces de-terminism in a parallel shared memory environment viatype system constraints. While we find language-basedsolutions promising, parallelizing the huge body of ex-isting sequential code will require parallel programmingmodels compatible with existing languages.DMP [3, 12] uses binary rewriting to execute exist-ing parallel code deterministically, dividing threads’ ex-ecution into fixed “quanta” and synthesizing an artifi- cial round-robin execution schedule. Since DMP is ef-fectively a deterministic implementation of a nondeter-ministic programming model, slight input changes maystill reveal schedule-dependent bugs. Grace [4] runsfork/join-style programs deterministically using virtualmemory techniques. These systems still pursue sequen-tial consistency as an “ideal” and rely on speculationfor parallelism: if a thread reads a variable concurrentlywritten by another, as in the “swap” example in Sec-tion 1, one thread aborts and re-executes sequentially. Apartial exception is DMP-B [3], which weakens consis-tency within a parallel execution quantum. DC, in con-trast, keeps threads fully independent between program-defined synchronization points, never requires specu-lation or rollback, and imposes no artificial executionschedules prone to accidental perturbation.Replay systems can log and reproduce particular ex-ecutions of conventional nondeterministic programs, fordebugging [11, 26] or intrusion analysis [13, 22]. Theperformance and space costs of logging nondeterminis-tic events usually make replay usable only “in the lab,”however: if a bug or intrusion manifests under deploy-ment with logging disabled, the event may not be sub-sequently reproducible. In a deterministic environment,any event is reproducible provided only that the originalexternal inputs to the computation are logged.As with deterministic release consistency, transac-tional memory (TM) systems [19, 32] isolate a thread’smemory accesses from visibility to other threads exceptat well-defined synchronization points, namely betweentransaction start and commit/abort events. TM offers nodeterministic ordering between transactions, however:like mutex-based synchronization, transactions guaran-tee only atomicity, not determinism.
Building reliable software on massively multicore pro-cessors demands a predictable, understandable program-ming model, a goal that may require giving up sequentialconsistency and mutual exclusion. Deterministic con-sistency provides an alternative parallel programmingmodel as simple as “parallel assignment,” and supportsexisting languages and synchronization abstractions.
References [1] Cristiana Amza et al. TreadMarks: Shared mem-ory computing on networks of workstations.
IEEEComputer , 29(2):18–28, February 1996.[2] Cyrille Artho, Klaus Havelund, and Armin Biere.High-level data races. In
Workshop on Verificationand Validation of Enterprise Information Systems(VVEIS) , pages 82–93, April 2003.
Draft:
Page:
53] Tom Bergan, Owen Anderson, Joseph Devietti,Luis Ceze, and Dan Grossman. CoreDet: A com-piler and runtime system for deterministic multi-threaded execution. In , March 2010.[4] Emery D. Berger, Ting Yang, Tongping Liu, andGene Novark. Grace: Safe multithreaded program-ming for C/C++. In
OOPSLA , October 2009.[5] Christian Bienia, Sanjeev Kumar, Jaswinder PalSingh, and Kai Li. The PARSEC benchmarksuite: Characterization and architectural implica-tions. In , Octo-ber 2008.[6] OpenMP Architecture Review Board. OpenMP ap-plication program interface version 3.0, May 2008. .[7] Robert L. Bocchino Jr., Vikram S. Adve,Sarita V. Adve, and Marc Snir. Parallel pro-gramming must be deterministic by default.In . USENIX, March 2009. .[8] Robert L. Bocchino Jr., Vikram S. Adve,Danny Dig, Sarita V. Adve, Stephen Heumann,Rakesh Komuravelli, Jeffrey Overbey, PatrickSimmons, Hyojin Sung, and Mohsen Vak-ilian. A type and effect system for De-terministic Parallel Java. October 2009. http://dpj.cs.uiuc.edu/DPJ/Publications_files/paper_1.pdf .[9] Per Brinch Hansen, editor.
The Origin of Concur-rent Programming: From Semaphores to RemoteProcedure Calls . Springer-Verlag, Berlin, Ger-many, 2002.[10] Jacob Burnim and Koushik Sen. Asserting andchecking determinism for multithreaded programs.In
ACM SIGSOFT Symposium on the Foundationsof Software Engineering , August 2009.[11] Ronald S. Curtis and Larry D. Wittie. BugNet: Adebugging system for parallel programming envi-ronments. In , pages 394–400, Oc-tober 1982.[12] Joseph Devietti, Brandon Lucia, Luis Ceze, andMark Oskin. DMP: Deterministic shared memorymultiprocessing. In , March2009. [13] George W. Dunlap, Samuel T. King, Sukru Cinar,Murtaza A. Basrai, and Peter M. Chen. ReVirt: En-abling intrusion analysis through virtual-machinelogging and replay. In ,December 2002.[14] Stephen A. Edwards and Olivier Tardieu. Shim:A deterministic model for heterogeneous embed-ded systems.
IEEE Transactions on Very LargeScale Integration (VLSI) Systems , 14(8):854–867,August 2006.[15] Stephen A. Edwards, Nalini Vasudevan, andOlivier Tardieu. Programming shared memorymultiprocessors with deterministic message-passing concurrency: Compiling SHIM toPthreads. In
Design, Automation, and Test inEurope , March 2008.[16] Dawson Engler and Ken Ashcraft. RacerX: effec-tive, static detection of race conditions and dead-locks. In , October 2003.[17] Kourosh Gharachorloo, Daniel Lenoski, JamesLaudon, Phillip Gibbons, Anoop Gupta, and JohnHennessy. Memory consistency and event order-ing in scalable shared-memory multiprocessors. In , pages 15–26, May 1990.[18] Robert H. Halstead, Jr. Multilisp: A language forconcurrent symbolic computation.
ACM Trans-actions on Programming Languages and Systems ,7(4):501–538, October 1985.[19] Maurice Herlihy and J. Eliot B. Moss. Transac-tional memory: Architectural support for lock-freedata structures. In , pages 289–300, May1993.[20] C. A. R Hoare. Towards a theory of parallel pro-gramming. In C. A. R. Hoare and R. H. Perrott,editors,
Operating Systems Techniques: Proceed-ings of a Seminar at Queen’s University , pages 61–71, New York, New York, USA, 1972. AcademicPress. Reprinted in [9], 231–244.[21] H. Jin, M. Frumkin, and J. Yan. The OpenMPimplementation of NAS parallel benchmarks andits performance. Technical Report NAS-99-011,NASA Ames Research Center, October 1999.[22] Ashlesha Joshi, Samuel T. King, George W. Dun-lap, and Peter M. Chen. Detecting past andpresent intrusions through vulnerability-specificpredicates. In
SOSP ’05: Proceedings of the twen-tieth ACM symposium on Operating systems prin-
Draft:
Page: iples , pages 91–104, New York, NY, USA, 2005.ACM.[23] Gilles Kahn. The semantics of a simple languagefor parallel programming. In Information Pro-cessing , pages 471–475, Amsterdam, Netherlands,1974. North-Holland.[24] Pete Keleher, Alan L. Cox, and Willy Zwaenepoel.Lazy release consistency for software distributedshared memory. In , pages 13–21, May1992.[25] Leslie Lamport. How to make a multiproces-sor computer that correctly executes multiprocessprograms.
IEEE Transactions on Computers ,28(9):690–691, September 1979.[26] Thomas J. Leblanc and John M. Mellor-Crummey.Debugging parallel programs with instant replay.
IEEE Transactions on Computers , C-36(4):471–482, April 1987.[27] E.A. Lee. The problem with threads.
Computer ,39(5):33–42, May 2006.[28] Shan Lu, Soyeon Park, Eunsoo Seo, and YuanyuanZhou. Learning from mistakes — a comprehen-sive study on real world concurrency bug charac-teristics. In , pages 329–339,March 2008.[29] Message Passing Interface Forum. MPI: Amessage-passing interface standard version 2.2,September 2009.[30] Madanlal Musuvathi, Shaz Qadeer, Thomas Ball,and Gerard Basler. Finding and reproducingheisenbugs in concurrent programs. In
Proceed-ings of the 8th USENIX Symposium on OperatingSystem Design and Implementation (OSDI ’08) ,pages 267–280, Berkeley, California, USA, 2008.USENIX Association.[31] Caitlin Sadowski, Stephen N. Freund, and CormacFlanagan. SingleTrack: A dynamic determinismchecker for multithreaded programs. In , March 2009.[32] Nir Shavit and Dan Touitou. Software transactionalmemory.
Distributed Computing , 10(2):99–116,February 1997.[33] Olivier Tardieu and Stephen A. Edwards.Scheduling-independent threads and excep-tions in SHIM. In , pages 142–151, October 2006.[34] Robert N. M. Watson. Exploiting concurrencyvulnerabilities in system call wrappers. In , Au-gust 2007.
Draft: