[PDF] BRAVO -- Biased Locking for Reader-Writer Locks

Abstract

Designers of modern reader-writer locks confront a difficult trade-off related to reader scalability. Locks that have a compact memory representation for active readers will typically suffer under high intensity read-dominated workloads when the "reader indicator"' state is updated frequently by a diverse set of threads, causing cache invalidation and coherence traffic. Other designs, such as cohort reader-writer locks, use distributed reader indicators, one per NUMA node. This improves reader-reader scalability, but also increases the size of each lock instance. We propose a simple transformation BRAVO, that augments any existing reader-writer lock, adding just two integer fields to the lock instance. Readers make their presence known to writers by hashing their thread's identity with the lock address, forming an index into a visible readers table. Readers attempt to install the lock address into that element in the table, making their existence known to potential writers. All locks and threads in an address space can share the visible readers table. Updates by readers tend to be diffused over the table, resulting in a NUMA-friendly design. Crucially, readers of the same lock tend to write to different locations in the array, reducing coherence traffic. Specifically, BRAVO allows a simple compact lock to be augmented so as to provide scalable concurrent reading but with only a modest increase in footprint.

Full PDF

BBRAVO – Biased Locking for Reader-Writer Locks

Dave Dice

Oracle Labs [email protected]

Alex Kogan

Oracle Labs [email protected]

Abstract

Designers of modern reader-writer locks confront a difficulttrade-off related to reader scalability. Lock implementationsthat have a compact memory representation for active read-ers will typically suffer under high intensity read-dominatedworkloads when the “reader indicator” state is updated fre-quently by a diverse set of threads, causing cache invalidationand coherence traffic. Other designs use distributed readerindicators, one per NUMA node, per core or even per thread.This improves reader-reader scalability, but also increases thesize of each lock instance and creates overhead for writers.We propose a simple transformation, BRAVO, that aug-ments any existing reader-writer lock, adding just two inte-ger fields to the lock instance. Readers make their presenceknown to writers by hashing their thread’s identity with thelock address, forming an index into a visible readers table and installing the lock address into the table. All locks andthreads in an address space can share the same readers table.Crucially, readers of the same lock tend to write to differentlocations in the table, reducing coherence traffic. Therefore,BRAVO can augment a simple compact lock to provide scal-able concurrent reading, but with only modest and constantincrease in footprint.We implemented BRAVO in user-space, as well as inte-grated it with the Linux kernel reader-writer semaphore( rwsem ). Our evaluation with numerous benchmarks and realapplications, both in user and kernel-space, demonstrate thatBRAVO improves performance and scalability of underlyinglocks in read-heavy workloads while introducing virtuallyno overhead, including in workloads in which writes arefrequent.

Readersattempttoinstallthelockaddressintothatelementinthetable,makingtheirexistenceknowntopotentialwriters.Updatesbyreaderstendtobediffusedoverthetable,resultinginaNUMA-friendlydesign.Specifically,BRAVOallowsasimplecompactlocktobeaugmentedsoastoprovidescalableconcurrentreadingbutwithonlyamodestincreaseinfootprint.

CCS Concepts • Software and its engineering → Mul-tithreading ; Mutual exclusion ; Concurrency control ; Processsynchronization ; Keywords

Reader-Writer Locks, Synchronization, Concur-rency Control

A reader-writer lock, also known as a shared-exclusive lock,is a synchronization primitive for controlling access by mul-tiple threads (or processes) to a shared resource (criticalsection). It allows shared access for read-only use of theresource, while write operations access the resource exclu-sively. Such locks are ubiquitous in modern systems, andcan be found, for example, in database software, file systems,key-value stores and operating systems. Reader-writer locks have to keep track of the presence ofactive readers before a writer can be granted the lock. In thecommon case, such presence is recorded in a shared counter,incremented and decremented with every acquisition andrelease of the lock in the read mode. This is the way reader-writer locks are implemented in the Linux kernel, POSIXpthread library and several other designs [3, 35, 42]. Theuse of a shared counter lends itself to a relatively simpleimplementation and has a compact memory representationfor a lock. However, it suffers under high intensity read-dominated workloads when the “reader indicator” state isupdated frequently by a diverse set of threads, causing cacheinvalidation and coherence traffic [7, 15, 18, 31].Alternative designs for reader-writer locks use distributedreader indicators, for instance, one per NUMA node as incohort locks [6], or even one lock per core as in the Linuxkernel brlock [10] and other related ideas [26, 31, 39, 46].This improves reader-reader scalability, but also consider-ably increases the size of each lock instance. Furthermore,the lock performance is hampered when writes are frequent,as multiple indicators have to be accessed and/or modified.Finally, such locks have to be instantiated dynamically, sincethe number of sockets or cores can be unknown until theruntime. As a result, designers of modern reader-writer locksconfront a difficult trade-off related to the scalability of main-taining the indication of the readers’ presence.In this paper, we propose a simple transformation, calledBRAVO, that augments any existing reader-writer lock, addingjust two integer fields to the lock instance. When applied ontop of a counter-based reader-writer lock, BRAVO allows usto achieve, and often beat, the performance levels of locksthat use distributed reader indicators while maintaining acompact footprint of the underlying lock. With BRAVO, read-ers make their presence known to writers by hashing theirthread’s identity with the lock address, forming an indexinto a visible readers table . Readers attempt to install the lockaddress into the element (slot) in the table identified by thatindex. If successful, readers can proceed with their criticalsection without modifying the shared state of the underlyinglock. Otherwise, readers resort to the acquisition path ofthe underlying lock. Note that the visible readers table isshared by all locks and threads in an address space. Crucially,readers of the same lock tend to write to different locationsin the table, reducing coherence traffic and thus resulting ina NUMA-friendly design. At the same time, a writer alwaysuses the acquisition path of the underlying lock, but alsoscans the readers table and waits for all readers that acquired2019-07-11 • Copyright Oracle and or its affiliates a r X i v : . [ c s . O S ] J u l ave Dice and Alex Kogan that lock through it. A simple mechanism is put in place tolimit the overhead of scanning the table for workloads inwhich writes are frequent.We implemented BRAVO and evaluated it on top of sev-eral locks, such as the POSIX pthread_rwlock lock and thePF-Q reader-writer lock by Brandenburg and Anderson [3].For our evaluation, we used numerous microbenchmarks aswell as rocksdb [40], a popular open-source key-value store.Furthermore, we integrated BRAVO with rwsem , a read-writesemaphore in the Linux kernel. We evaluated the modifiedkernel through kernel microbenchmarks as well as severaluser-space applications (from the Metis suite [33]) that cre-ate contention on read-write semaphores in the kernel. Allour experiments in user-space and in the kernel demonstratethat BRAVO is highly efficient in improving performanceand scalability of underlying locks in read-heavy workloadswhile introducing virtually no overhead, even in workloadsin which writes are frequent.The rest of the paper is organized as follows. The relatedwork is surveyed in Section 2. We present the BRAVO algo-rithm in Section 3 and discuss how we apply BRAVO in theLinux kernel in Section 4. The performance evaluation inuser-space and the Linux kernel is provided in Sections ?? and 6, respectively. We conclude the paper and elaborate onmultiple directions for future work in Section 7. *single-writermultiple-readerlocks*Readersmusttypicallywritetoshareddatatoannouncetheiractionandmakethemselvesvisibletowriters,butideally,inordertoreducecoherencetraffic,avoidwritingtolocationsfrequentlyreadorwrittenbyotherreaders.*Announce;publicize;publish;makevisible;*Tension;trade-off*Primumnon-nocere;boundharm;hippocraticoath*CollisionprobabilityinVRTisequivalentto"BirthdayParadox".*Dualpathsforreaders:fastandslowalternatefast-pathencoding/representation/encoding*Costmodel:improvement=Benefit-FromFastReaders-RevocationCostUnfortunatelytheBenefitFromFastReadersisafunctionofthenumberofreaders.*Improvementratio=performance(BRAVO-A)/performance(A)*revocationoccursontransition/edgefromfastreadtowrite.*dispersed;diffused;distributed;disseminate;split;fractured;sharded;decomposed;dilute;spread;*deconstruct;decouple;*faithful;realistic;accurate;fidelity;veracity;authentic;adherent;*Insufficientlyalgorithmic*Ashortideaexpressedinalongpaper*Acceleratorlayer;stack;compose;transform;construct;fabricate;wrap*wrapper;jacket;envelope*subsume;subducts*Downsides:adaptivityimpactsperformancepredictability;*trade-off:tablesizevscollisionrateandrevokescanoverhead*shiftsomecost/burdenfromreaderstowriters*performancepredictability;variability;variance;consistent;*adheretoprincipleofleastsurprise*divert;revert;fall-back;fail-over*desiderata;performancegoals;target;ideal;aspire;*devolve;degenerate;converge;trendtoward;tendtowards*slipstream;*con-fers*affords*encumbrance*defensibledecision/design*supplant*consequent;ensuant;ensue;pursuant;arising;byvirtueof;*constrain;limit;guard;clamp;guard;cap;protect;restrict;*throttledownside;stop-loss;stop-limit*per-formancediode*reader-writerVSread-writeVSRW*compact"inverted"encoding:collapse2fieldsintoonechangeRBiastoInhibitInhibitUntil=0indicatesbiasenabledInhibitUntil=TindicatesbiasdisableduntiltimeTPossiblyuseT=MaxValuefordisabled*release;surrender;relinquish;unlock;abjure;*Thoughtexperiment:GedankenexperimentImaginereenablereader-biasoneveryread,oraftereverywrite*AdditionalaspecttocostmodelPenaltiesassociatedwithbiasenabled@revocationcost@collisionsinfast-readattempt–futileCAS@sharingorfalse-sharingfromnear-collisionsintable*Conservativeandpessimistic–LossAversionTrade-offbiasedtowardlossaversioninsteadWeforsake/forgo/surrenderpotentiallybetterperformanceinordertolimitthepossibilityofworsenedperformance.*targetenvironmentforBRAVO:read-dominatedmultiplefrequentlyarrivingconcurrentreaders*assiduouslyavoid*relax*Topologyoblivious;topologyinsensitive;topologyindependent*BAhasneutralpreference*Databaseswillcommonlyusearraysofreader-writerlocks,withconcurrencyprotectionshardedviahashfunction.*revokeonwrite-after-fast-read*Minimize*Inter-nodemissvsintra-nodecoherencemiss*Augmentandextent*race;inopportuneinterleaving;intervened;window*Slot;Arrayelement;Tableelement;cell;index;*Writersalwayspassthroughtheunderlyinglock.*Incrementalcost*SeeBRAVO.txtforadditionalcommentaryanddiscussion*in-effect;prevailingpreference;transparent;*readerinfast-path=fastreader*Wenotethatifreader-writerlockalgorithm A hascertainpreferenceproperties(readerpreference,writerpreference,neutralpreference,etc)then BRAVO-A willexhibitthesameproperties.BRAVOis“transparent”anddoesnotaltertheadmissionpolicyoftheunderlyinglock.*slow-downboundingeneralvsslow-downboundspecificallyforwriters*Fast-pathreadersdonotwriteintothelockinstance,onlythetable.*Whennecessarywritersmustscanthearraytocheckforconflictingreaders,soourapproachismostusefulforread-dominatedworkloads.*read-readconcurrency*Numerous;myriad;plurality;*withoutsurrendering...;trade-off*Profligatememoryconsumption–footprint*Usehashfunctiontoscatteraccessesintohashtable*Stochasticdisperal*Probabalisticallyconfict-freeorconflict-reducing*Readerarrival:writetodispersedRIbutthenreadfromcentralWI*NUMAvsNUCA*NUMA-friendlybecausewritesharingisparticularlyexpensiveonNUMA*Commonread-writelockswithcentralRIgenerateswrite-sharingcoherencetraffic.BRAVOavoidsthatcentralRI*OracleIDaccessionnumberORA190125*Confoundingfactor:highpthread_rwlock_tdiversity–largenumberofdifferentlockinstancesinuseoverashortperiod–canresultinincreasedcachepressurearisingfromaccessesintotheBRAVOvisiblereaderstable.Increasedcachefootprintandpressurearisingfromaccessestoarray/table*Claim:regardingcollisionsandnearcollisions,concurrencyisequivalenttolockdiversity–nodifference.TODO:*modality:favorread-dominated;readmostly*Asshown,writersinrevocationphasespinbusy-wait,butit’srelativelytrivialtomakethemwaitpolitelyviaspin-then-parkinstead.*Equivalentski-rentalproblem*VRTsizing:shouldbeafunctionofthenumberoflogicalCPUs.WantatleastonecachesectorperCPU.*TODO:reportread/writefractionandfast-readfraction=NFast/(NFast+NSlow)FastReads;SlowReads;WriteNormal;WriteRevoke;SlowReadbreakdown:SlowReadDiabled;SlowReadCollision*Briefly,wehaveasimplewrapperthatletsyouaugmentanyexistingreader-writerlock,andyieldsacompositelockthatwillhavegoodread-readscaling.*BRAVO-2D:Toimprovetemporallocality,dividethevisiblereaderstableinsectors,eachwith,say,256contiguousslots.Weusethecaller’sCPUIDtoidentifyasector,andthenahashfunctiononthelockaddresstoidentifyaslotwithinthatsector.Thisallowsasecondaryoptimizationwhererevokingthreadsneedtocheckonlyonelocation–theindexidentifiedbythehashonthelockaddress–ineachsectorduringrevocation.*AKA:BRAVO-2D;BRAVO-Sectored;BRAVO-RowColumn;*avoidfutilesettingwherebiaslikelytoberevokedinshortorder*improvedpolicesandmorefaithfulcostmodelforbiassetting*ImplementBRAVOovermutex,butmaydenyexpectedR-Radmission.*exploreinterplaybetweenpreferenceofunderlyinglockandBRAVO.WriterpreferencelocksappeartoprovidebenefitwhenusedinconjunctionwithBRAVOaswriterpreferencesreducestheread-to-writetransitionrate,andthusandrevocationcosts,insteadgroupingwriterstogethersoonlytheleadwriterisrequiredtorevokebias.*Webelieveourapproachmaybeapplicabletouser-modeRCUClarifynuancedandsubtlepointsaboutinter-lockinterferenceStatement-1*Interferencedoesaffectperformance,butina“reasonable”way.Whilesomeinterferenceexists,BRAVOstillprovidesbenefitandnaturallyitismostefficientwhenoneoronlyafewlocksareaccessedsinceinthecaseofmultiplelocks,eachofthoselocksislesscontended.*Convexityobservedininter-lockinterferencesensitivityplotatN=smallBRAVOdoesverywellrelativetoBRAVO-BA-Primeasthereisnointer-lockinterference.AtN=LargeBRAVOalsodoesrelativelywellbecausecontentionisdispersedanddiffusedovermanylocks.AtN=middlingwesomesomeevidenceofinter-lockinterference.*Ifweholdthenumberofthreadsconstantandincreasethenumberoflocks,thenweincreasethenumberofdistinctslotsaccessedinthevisiblereaderstable–occupancyincreases.Inturn,theoddsandrateofcollisionsornear-collisionsthenincreasesaccordingly.Buttheoddsofaspecificgivenlockencounteringcontentionalsodecreaseaswehaveincreasedlockdiversity.Thus,foragivenlockL,fewerthreadsareforcedintotheslowpathforLatanygiventime.AsfewerthreadsusetheslowpathforL,wereducecoherencetrafficonL’scentralreaderindicator.Giventhereducedtrafficintotheslowpath,thecostofusingtheslowpathforLisnotexcessivesothepenaltyofusingtheslowpathismitigated.AsweincreaseN,theconsequentincreasedaggregatedcollisionrateisoffsetbyfasterslowpathoperations.*Theslowpathisnotreally"slow"ifonlyonethreadusesthepathatagiventime.*Morecollisionsbutlowerintensityintheslowpath,andthusfasterslowpathexecution,mitigatingthosecollisions.Statement-2:*Normalexecutionmodevsmodewithverysmallnumberofdistinctlocks*Thoughtexperiment:modelBRAVOrunningconcurrentbenchmarkwhereeachthreadpickslocksrandomlyfromapoolofN.modelwherethreadsruninlockstep–in-phasesteppingConcurrently,eachthreadpicksarandomlock,acquiresit,andthenreleasesit.Thisisequivalenttothreadsthrowingballintobinswhereballsarelockacquisitionsandbinsareindicesinthevisiblereaderstable.Weassumethehashfunctionprovidesequidistributionandiseffectivelyequivalenttorandomizationofindices.Canuseballs-into-binsprobabilitymodeltoanalyzecollisionsinthetable.CollisionrateperaccessisBalls/(2*Bins).ThenumberoflocksisNOTrelevanttothecollisionrate.*ClaimandconjectureIngeneralthecollisionrateinthereaderstableispurelyafunctionofjustthetablesizeandthenumberofconcurrentthreadsandNOTthenumberofdistinctlocks.*Increasingthenumberoflocksdoesnotnormallyincreaseinter-lockinterferenceeffects.*Increasingthenumberoflocksmayinfluencetherateofnearcollisionsandrelatedcoherencetraffic.*Increasingthenumberoflocksinfluencescache/TLBpressureandfootprint,potentiallyresultingincapacitymisses.*BUT:considertheoperatingregionwherewehaveasmallnumberoflocksNinuse.Thesetoflocksandthreadsmaymaptoonlyasubsetofthetableviathehash.WithsmallN,theeffectivesetofreachablebinsissmaller.Asweincreasethenumberoflocks,alargerfractionofthetableisaccessed.Occupancyincreases.Thisincreasestheinter-lockcollisionrate–inter-lockinterference.Inthisregion,asweincreaseNweincreasetheinter-lockcollisionrate.ButifweholdthenumberofthreadsconstantandincreaseN,theoddsofaconflictonagivenlockLwilldropbecauseofincreaseddiversity.InturntheslowpathforLisusedlessfrequently,makingitlessexpensiveoffsetingthecostofthecollision.*BRAVOyieldsaverygoodperformance-spacetrade-off,whichisthekeypoint.(INCLUDEinCONCLUSION)*BRAVOoffersaviablealternativethatresidesbetweencentralizedBAwithsmallfootprintandpoorreaderscalabilityandthelargelockswithgoodreadersscalability.(INCLUDEinCONCLUSION)*BRAVOyieldsafavorableperformancetrade-offbetweenspaceandscalability,offeringaviablealternativethatresidesonthedesignspectrumbetweenclassiccentralizedlocks,suchasBA,havingsmallfootprintandpoorreaderscalability,andthelargelockswithhighreaderscalability.Statement-3USETHEFOLLOWINGTEXTintheinter-lockinterferencesection.LetsassumeasusualthatweâĂŹreholdingthenumberofthreadsandthesizeoftheBRAVOvisiblereaderstablefixed,andthatweâĂŹrevaryingthenumberoflocksN.ThreadsrandomlypicklocksforasharedpoolofNandacquireandreleaseâĂĲreadâĂİpermission.WhenNisrelativelysmall,therangeofthehashfunctionâĂŤmapping(threadXlock)spacetotableindices–hitsjustasubsetofthetable.Theeffectivesizeofthetable–reachableindices–issmallerthantheactualsize.AsweincreaseNthiseffectabates,therangeincreases,wediffuseovermoreofthetable,andthetruecollisionratedecreases.(IncreasingNfurtherreachesanasymptote,andthetruecollisionratedevolvesintoasimpleballs-into-binscollisionmodel.Thatis,atsomeNthetruecollisionrateissolelyafunctionofthenumberofthreadandthesizeofthetable,andNisnolongerrelevant).However,asweincreaseN,therangeofindiceswhereagiventhreadcan"hit"–thediversity–alsoincreases,andweincurmorefalsesharing.Furthermorewecanencounterincreased"temporalsharing".SaythreadTpreviouslyaccessedindexI,andthenaccessesIagain.AsweincreaseN,theoddsthattherehavebeennointerveningaccessestoIbyotherthreadswilldecrease,andTâĂŹsaccesswillmissandincuradditionalcoherencetraffic.(Temporalsharingissharingthatisnotaconcurrentconflictandisnotfalsesharing).Inasense,asweincreaseN,wecauseadditionalwritingtoslotsthatweremorerecentlywrittenbyotherthreads.Localitydecreasesandcachemissesincrease.Again,thereâĂŹsanasymptotehere,afterwhichperformanceisinsensitivetofurtherincreasesinN,asthehashfullydiffusesaccessesforagivenlockoverthetable.Inaddition,asweincreaseN,theoddsdecreasethatwhenthreadTaccessesindexI,thatTwasthelastthreadtoaccessthatI.Thatis,theoddsdecreasethattherewerenointerveningaccessestoIbyotherthreadssinceTlastaccessedI.Remarks:*Sensitivityanalysis–Supplementarytableandexperimentswhichsupportsourclaimthatinter-lockinterferenceisnotasignficantperformanceissue.*See:InterferenceRW.ccandBRAVOModel*.ccandBallsIntoBins.cc*Consider:runsensitivitysurveyvaryingTableSizeandN,andidentifyworstcaseBRAVO/LARGEratioandreportthat.LARGEreferstothefamilylockswithlargefootprintthatprovidescalableconcurrentreading.Includes:DV-BA;BRAVO-BA-PrimeandPrimePrime;Cohortlocks;etc*Assumeequidistributionofhashfunctionthatmakes(Thread,Lock)toindexintable.Equivalenttorandomizationsowecanuseballs-into-binsprobabilitymodel.*Collisionrateinsensitivetonumberoflocksandinter-lockinterference.*truecollision:resultsinfall-backintoslowpathandfutilecoherencetrafficnearcollision:resultsinfalsesharingandcoherencetraffic*Keywords:@Diversity;distribute;spread;diffuse;@occupancy;load;utilization;saturation;tenancy;residency;@intensity;arrivalrate*Explication:divideinto2executionmodes–operatingregionsfulltablevspartialtable*Nisverysmallimplieseffectivenumberofreachablebinsissmall.onlyusesubsetoftable.*Ourhashfunctionmaybebetterthanrandomwithrespecttodispersal ▶ Reader-Indicator Design

Readers that are active – cur-rently executing in a reader critical section – must be visibleto potential writers. Writers must be able to detect activereaders in order to resolve read-vs-write conflicts, and waitfor active readers to depart. The mechanism through whichreaders make themselves visible is the reader indicator . Myr-iad designs have been described in the literature. At one endof the spectrum we find a centralized reader indicator im-plemented as an integer field within each reader-writer lockinstance that reflects the number of active readers. Readersuse atomic instructions (or a central lock) to safely incrementand decrement this field. Classic examples of such locks canbe found in the early work of Mellor-Crummey and Scott [35]and more recent work by Shirako et al. [42]. Another reader-writer lock algorithm having a compact centralized readerindicator is Brandenburg and Anderson’s Phase-Fair Ticketlock, designated

PF-T in [3], where the reader indicator isencoded in two central fields. Their Phase-Fair Queue-basedlock,

PF-Q , uses a centralized counter for active readers andan MCS-like central queue, with local spinning, for read-ers that must wait. We refer to this latter algorithm as “BA”throughout the remainder of this paper. Such approachesare compact, having a small per-lock footprint, and simple,but, because of coherence traffic, do not scale in the pres-ence of concurrent readers that are arriving and departingfrequently [7, 15, 18, 31]. To address this concern, many designs turn toward dis-tributed reader indicators. Each cohort reader-writer lock [6],for instance, uses a per-NUMA node reader indicator. Whiledistributed reader indicators improve scalability, they alsosignificantly increase the footprint of a lock instance, witheach reader indicator residing on its own private cache lineor sector to reduce false sharing. In addition, the size of thelock is variable with the number of nodes, and not knownat compile-time, precluding simple static preallocation oflocks. Writers are also burdened with the overhead of check-ing multiple reader indicators. Kashyap et al. [27] attemptto address some of those issues by maintaining a dynamiclist of per-socket structures and expand the lock instanceon-demand. However, this only helps if a lock is accessed bythreads running on a subset of nodes.At the extreme end of the spectrum we find lock de-signs with reader indicators assigned per-CPU or per-thread[10, 26, 31, 39, 46]. These designs promote read-read scaling,but have a large variable-sized footprint. They also favorreaders in that writers must traverse and examine all thereader-indicators to resolve read-vs-write conflicts, possiblyimposing a performance burden on writers. We note thereare a number of varieties of such distributed locks : a setof reader-indicators coupled with a central mutual exclu-sion lock for writer permission, as found in cohort locks [6];sets of mutexes where readers must acquire one mutex andwriters must acquire all mutexes, as found in Linux kernel br-locks [10]; or sets of reader-writer locks where readers mustacquire read permission on one lock, and writers must ac-quire write permission on all locks. To reduce the impact onwriters, which must visit all reader indicators, some designsuse a tree of distributed counters where the root elementcontains a sum of the indicators within the subtrees [30].Dice et al. [22] devised read-write byte-locks for use inthe

TLRW software transactional memory infrastructure.Briefly, read-write byte-locks are reader-writer locks aug-mented with an array of bytes, serving as reader indicators,where indices in the array are assigned to favored threadsthat are frequent readers. These threads can simply set andclear these reader indicators with normal store operations.The motivation for read-write byte-locks was to avoid atomicread-modify-write instructions, which were particularly ex-pensive on the system under test. The design, as described,is not NUMA-friendly as the byte array occupies a singlecache line.In addition to distributing or dispersing the counters, in-dividual counters can themselves be further split into con-stituent ingress and egress fields to further reduce writesharing. Arriving readers increment the ingress field anddeparting readers increment the egress field. Cohort reader-writer locks use this approach [6].BRAVO takes a different approach, opportunistically rep-resenting active readers in the shared global visible readerstable. The table (array) is fixed in size and shared over all2019-07-11 • Copyright Oracle and or its affiliateshreads and locks within an address space. Each BRAVOlock has, in addition to the underlying reader-writer lock, aboolean flag that indicates if reader bias is currently enabledfor that lock. Publication of active readers in the array isstrictly optional and best-effort. A reader can always fallback to acquiring read permission via the underlying reader-writer lock. BRAVOâĂŹs benefit comes from reduced coher-ence traffic arising from reader arrival. Such coherence trafficis particularly costly on NUMA systems, consuming sharedinterconnect bandwidth and also exhibiting high latency. Assuch, BRAVO is naturally NUMA-friendly, but unlike mostother NUMA-aware reader-writer locks, it does not needto understand or otherwise query the system topology, fur-ther simplifying the design and reducing dependencies . Wenote that coherence traffic caused by waiting âĂŤ such asglobal vs local waiting âĂŤ is determined by the nature ofthe underlying lock. ▶ Optimistic Invisible Readers

Synchronization constructssuch as seqlocks [9, 23, 29] allow concurrent readers, butforgo the need for readers to make themselves visible. Criti-cally, readers do not write to synchronization data and thusdo not induce coherence traffic. Instead, writers update state– typically a modification counter – to indicate that updateshave occurred. Readers check that counter at the start andthen again at the end of their critical section, and if writerswere active or the counter changed, the readers self-abortand retry. An additional challenge for seqlocks is that read-ers can observe inconsistent state, and special care mustbe taken to constrain the effects and avoid errant behaviorin readers. Often, non-trivial reader critical sections mustbe modified to safely tolerate optimistic execution. Varioushybrid forms exist, such as the

StampedLock [36] facility in java.util.concurrent which consists of a reader writerlock coupled with a seqlock, providing 3 modes : classic pes-simistic write locking, classic pessimistic read locking, andoptimistic reading.To avoid the problem where optimistic readers might seeinconsistent state, transactional lock elision [17, 19, 24, 28,38] based on hardware transactional memory can be used.Readers are invisible and do not write to shared data. Such ap-proaches can be helpful, but are still vulnerable to indefiniteabort and progress failure. In addition, the hardware trans-actional memory facilities required to support lock elisionare not available on all systems, and are usually best-effort,without any guaranteed progress, requiring some type offallback to pessimistic mechanisms. ▶ Biased Locking

BRAVO draws inspiration from biasedlocking [14, 21, 37, 41, 44]. Briefly, biased locking allows the While BRAVO is topology oblivious, it does require high-resolution low-latency means of reading the system clock. We further expect that readingthe clock is scalable, and that concurrent readers do not interfere with eachother. On systems with modern Intel CPUs and Linux kernels the

RDTSCP instruction or clock_gettime(CLOCK_MONOTONIC) fast system call suffices. same thread to repeatedly acquire and release a mutual ex-clusion lock without requiring atomic instructions, excepton the initial acquisition. If another thread attempted to ac-quire the lock, then expensive revocation is required to wrestbias from the original thread. The lock would then revertto normal non-biased mode for some period before againbecoming potentially eligible for bias. (Conceptually, we canthink of the lock as just being left in the locked state untilthere is contention. Subsequent lock and unlock operationsby the original thread are ignored – the unlock operationis deferred until contention arises). Biased locking was aresponse to the CPU-local latencies incurred by atomic in-structions on early Intel and SPARC processors and to thefact that locks in Java were often dominated by a singlethread. Subsequently, processor designers have addressedthe latency concern, rendering biased locking less profitable.Classic biased locking identifies a preferred thread, whileBRAVO identifies a preferred access mode. That is, BRAVObiases toward a mode instead of thread identity. BRAVO issuitable for read-dominated workloads, allowing a fast-pathfor readers when reader bias is enabled for a lock. If a writerequest is issued against a reader-biased lock, reader bias isdisabled and revocation (scanning of the visible readers table)is required, shifting some cost from readers to writers. Classicbiased locking provides benefit by reducing the number ofatomic operations and improving latency. It does not improvescalability. BRAVO reader-bias, however, can improve bothlatency and scalability by reducing coherence traffic on thereader indicators in the underlying reader-writer lock.

BRAVO transforms any existing reader-writer lock A into BRAVO-A , which provides scalable reader acquisition. We say A is the underlying lock in BRAVO-A . In typical circumstances A might be a simple compact lock that suffers under highlevels of reader concurrency. BRAVO-A will also be compact,but is NUMA-friendly as it reduce coherence traffic and offersscalability in the presence of frequently arriving concurrentreaders.Listing 1 depicts a pseudo-code implementation of theBRAVO algorithm. BRAVO extends A ’s structure with a new RBias boolean field (Line 2). Arriving readers first checkthe

RBias field, and, if found set, then hash the address ofthe lock with a value reflecting the calling thread’s identityto form an index into the visible readers table (Lines 12–13). (This readers table is shared by all locks and threads inan address space. In all our experiments we sized the tableat 4096 entries. Each table element, or slot , is either null or a pointer to a reader-writer lock instance). The readerthen uses an atomic compare-and-swap (CAS) operator toattempt to change the element at that index from null tothe address of the lock, publishing its existence to potentialwriters (Line 14). If the CAS is successful then the reader2019-07-11 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan rechecks the

RBias field to ensure it remains set (Line 18). Ifso, the reader has successfully gained read permission andcan enter the critical section (Line 19). Upon completingthe critical section the reader executes the complementaryoperation to release read permission, simply storing null into that slot (Lines 29–31). We refer to this as the fast-path .The fast-path attempt prefix (Lines 11-23) runs in constanttime. Our hash function is based on the

Mix32 operator foundin [43].If the recheck operation above happens to fail, as wouldbe the case if a writer intervened and cleared

RBias and thereader lost the race, then the reader simply clears the slot(Line 21) and reverts to the traditional slow-path where itacquires read permission via the underlying lock (Line 24).Similarly, if the initial check of

RBias found the flag clear(Line 12), or the CAS failed because of collisions in the array(Line 14) – the slot was found to be populated – then controldiverts to the traditional slow-path. After a slow-path readeracquires read permission from the underlying lock, it entersand executes the critical section, and then at unlock timereleases read permission via the underlying lock (Line 33).Arriving writers first acquire write permission on theunderlying reader-writer lock (Line 36). Having done so, theythen check the

RBias flag (Line 37). If set, the writer mustperform revocation , first clearing the

RBias flag (Line 40) andthen scanning all the elements of the visible readers tablechecking for conflicting fast-path readers (Lines 42–44). Ifany elements match the lock, the writer must wait for thatfast-path reader to depart and clear the slot. If lock L has2 fast-path active readers, for instance, then L will appeartwice in the array. Scanning the array might appear to beonerous, but in practice the sequential scan is assisted by theautomatic hardware prefetchers present in modern CPUs.We observe a scan rate of about 1.1 nanoseconds per elementon our system-under-test (described later). Having checked RBias and performed revocation if necessary, the writerthen enters the critical section (Line 50). At unlock-time, thewriter simply releases write permission on the underlyingreader-writer lock (Line 51). Therefore the only differencefor writers under BRAVO is the requirement to check andpotentially revoke reader bias if

RBias was found set.

Amortizedscanrate

We note that writers only scan the visible reader table,and never write into it. Yet, this scan may pollute the writer’scache. One way to cope with it is to use non-temporal loads,however, exploring this idea is left for the future work. Notethat revocation is only required on transitions from readingto writing and only when

RBias was previously set.In summary, active readers can make their existence pub-lic in one of two ways : either via the visible readers table(fast-path), or via the traditional underlying reader-writerlock (slow-path). Our mechanism allows both slow-path andfast-path readers simultaneously. Absent hash collisions, con-current fast-path readers will write to different locations inthe visible readers table. Collisions are benign, and impact class BRAVOLock : int RBias Time InhibitUntil T Underlying BRAVOLock * VisibleReaders [4096] int N = 9 def Reader(BRAVOLock * L) : BRAVOLock * * slot = null if L.RBias : slot = VisibleReaders + Hash(L, Self) if CAS(slot, null, L) == null : if L.RBias : goto EnterCS *slot = null slot = null assert slot == null AcquireRead (L.Underlying) if L.RBias == 0 and

Time() >= L.InhibitUntil : L.RBias = 1 EnterCS: ReaderCriticalSection() if slot != null : assert *slot == L *slot = null else : ReleaseRead (L.Underlying) def

Writer(BRAVOLock * L) : AcquireWrite (L.Underlying) if L.RBias : L.RBias = 0 auto start = Time() for i in xrange(VisibleReaders) : while VisibleReaders[i] == L : Pause() auto now = Time() L.InhibitUntil = now + ((now - start) * N) WriterCriticalSection() ReleaseWrite (L.Underlying)

Listing 1.

Simplified Python-like Implementation ofBRAVOperformance but not correctness. Writers resolve read-vs-write conflicts against fast-path readers via the visible read-ers table and against slow-path readers via the underlyingreader-writer lock.

BRAVOprovidesadualexistencerepresentationforactivereaders,withtheirexistencereflectedineitherthearrayortheunderlyinglock.

RBias .In our early prototypes we set

RBias in the reader slow-path based on a low-cost Bernoulli trial with probability P = /

100 using a thread-local Marsgalia XOR-Shift [34]pseudo-random number generator. While this simplistic pol-icy for enabling bias worked well in practice, we were con-cerned about situations where we might have enabled biastoo eagerly, and incur frequent revocation to the point where

BRAVO-A might be slower than A . Specifically, the worst-case scenario would be where slow readers repeatedly set RBias , only to have it revoked immediately by a writer.The key additional cost in BRAVO is the revocation step,which executes under the underlying write lock and thusserializes operations associated with the lock . As such, wemeasure the latency of revocation and multiply that periodby N , a configurable parameter, and then inhibit the subse-quent setting of bias in the reader slow-path for that period,bounding the worst-case expected slow-down from BRAVOfor writers to 1 /( N + ) (cf. Lines 41-49). Our specific per-formance goal is primum non nocere – first, do no harm,with BRAVO-A never underperforming A by any significantmargin on any workload . This tactic is simple and effective,but excessively conservative, taking into account only theworst-case performance penalty imposed by BRAVO, andnot accounting for any potential benefit conferred by theBRAVO fast-path. Furthermore, measuring the revocationduration also incorporates the waiting time, as well as thescanning time, yielding a conservative over-estimate of therevocation scan cost and resulting in less aggressive use ofreader bias. Despite these concerns, we find this policy yieldsgood and predictable performance. For all benchmarks inthis paper we used N = InhibitUntil (Line 3),which reflects the earliest time at which slow readers shouldreenable bias . We note that for safety, readers can onlyset RBias while they hold read permission on the underly-ing reader-writer lock, avoiding interactions with writers(cf. Lines 25–26).In our implementations revoking waiters busy-wait forreaders to depart. There can be at most one such busy-waiting thread for a given lock at any given time. We note,however, that it is trivial to shift to a waiting policy that usesblocking. Additional costs associated with BRAVO include futile atomic operationsfrom collisions, and sharing or false-sharing arising from near-collisionsin the table. Our simplified cost model ignores these secondary factors.We note that the odds of collision are equivalent to those given by the“Birthday Paradox” [48] and that the general problem of deciding to set biasis equivalent to the classic “ski-rental” problem [47]. Our approach conservatively forgoes the potential of better performanceafforded by the aggressive use of reader bias in order to limit the possibilityof worsened performance[49]. We observe that it is trivial to collapse

RBias and

InhibitUntil into justa single field. For clarity, we did not do so in our implementation.

BRAVO acts as an accelerator layer, as readers can alwaysfall back to the traditional underlying lock to gain read access.The benefit arises from avoiding coherence traffic on thecentralized reader indicators in the underlying lock, andinstead relying on updates to be diffused over the visiblereaders table. Fast-path readers write only into the visiblereaders table, and not the lock instance proper. This accesspattern improves performance on NUMA systems, wherewrite sharing is particularly expensive. We note that if theunderlying lock algorithm A has reader preference or writerpreference, then BRAVO-A will exhibit that same property.Write performance and the scalability of read-vs-write andwrite-vs-write behavior depends solely on the underlyinglock. Under high write intensity, with write-vs-write andwrite-vs-read conflicts, the performance of BRAVO devolvesto that of the underlying lock. BRAVO accelerates reads only.BRAVO fully supports the case where a thread holds multiplelocks at the same time.

Increasestableoccupancyandoddsofcollision,butrareinpractice.Holdingmultiplelocksisequivalenttoincreasingconcurrencywithrespecttotableoccupancyandcollisionrates.

BRAVO supports try-lock operations as follows. For read try-lock attempts an implementation could try the BRAVOfast path and then fall back, if the fast path fails, to the slowpath underlying try-lock . An implementation can also opt toforgo the fast path attempt and simply call the underlying try-lock operator. We use the former approach when applyingBRAVO in the Linux kernel as detailed in the next section.We note that if the underlying try-lock call is successful, onemay set

RBias if the BRAVO policy allows that (e.g., if thecurrent time is larger than

InhibitUntil ). For write try-lock operators, an implementation will invoke the underlying try-lock operation. If successful, and bias is set, then revocationmust be performed following the same procedure describedin Lines 37–49.As seen in Listing 1, the slot value must be passed from theread lock operator to the corresponding unlock.

Null indi-cates that the slow path was used to acquire read permission.To provide correct errno error return values in the POSIXpthread environment, a thread must be able to determine ifit holds read, write, or no permission on a given lock. Thisis typically accomplished by using per-thread lists of lockscurrently held in read mode. We leverage those list elementsto pass the slot. We note that the Cohort read-write lockimplementation [6] passed the reader’s NUMA node ID fromlock to corresponding unlock in this exact fashion. rwsem

In this section, we describe prototype integration of BRAVOin the Linux kernel, where we apply it to rwsem . Rwsem isa read-write semaphore construct. Among many places in-side the kernel, it is used to protect the access to the virtualmemory area (VMA) structure of each process [11], whichmakes it a source of contention for data intensive applica-tions [8, 11].2019-07-11 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan

On a high level, rwsem consists of a counter and a waitingqueue protected by a spin-lock. The counter keeps track ofthe number of active readers, as well as encodes the presenceof a writer. To acquire the rwsem in the read mode, a readeratomically increments the counter and checks its value. If a(waiting or active) writer is not present, the read acquisitionis successful; otherwise, the reader acquires the spin-lockprotecting the waiting queue, joins the queue at the tail, re-leases the spin-lock and blocks, waiting for a wake-up signalfrom a writer. As a result, when there is no reader-writercontention, the read acquisition boils down to one atomiccounter increment. On architectures that do not support anatomic increment instruction, this requires acquisition (andsubsequent release) of the spin-lock. Even on architecturesthat have such an instruction (such as Intel x86), the readacquisition of rwsem creates contention over the cache linehosting the counter.In our integration of BRAVO on top of rwsem , we makea simplifying assumption that the semaphore is always re-leased by the same thread that acquired it for read. This is notguaranteed by the API of rwsem , however, this is a commonway of using semaphores in the kernel. This assumptionallows us to preserve the existing rwsem

API and limits thescope of changes required, resulting in a patch of only threefiles and adding just a few dozens lines of code. We use thisassumption when determining the slot into which a threadwould store the semaphore address on the fast acquisitionpath, and clear that slot during the release operation .While we have not observed any issue when running andevaluating the modified kernel, we note that our assumptioncan be easily eliminated by, for example, extending the APIof rwsem to allow an additional pointer argument for readacquisition and release functions. In case the acquisition ismade on the fast path, this pointer would be used to storethe address of the corresponding slot; later, this pointer canbe passed to a (different) releasing thread to specify theslot to be cleared. Alternatively, we can extend the API of rwsem to include a flag explicitly allowing the use of the fastpath for read acquisition and release. This flag would be setonly in call sites known for high read contention (such as infunctions operating on VMAs), where a thread that releasesthe semaphore is known to be the one that acquired it. Othercall sites for semaphore acquisition and release can be leftuntouched, letting them use the slow path only.We note that the default configuration of the kernel en-ables a so-called spin-on-owner optimization of rwsem [32].With this optimization, the rwsem structure includes an owner field that contains a pointer to the current struct of theowner task when rwsem is acquired for write. Using thisfield, a reader may check whether the writer is currentlyrunning on a CPU, and if so, spin rather than block [32]. We determine the slot by hashing the task struct pointer ( current ) withthe address of the semaphore.

While writers do not use this field to decide whether theyhave to spin (as there might be multiple readers), in the cur-rent rwsem implementation a reader updates the owner fieldregardless, storing there its current pointer along with afew control bits (that specify that the lock is owned by areader). These writes by readers are for debugging purposesonly, yet they create unnecessary contention on the owner field. We fix that by letting a reader set only the control bitsin the owner field, and only if those bits were not set before,i.e., when the first reader acquires that rwsem instance after awriter. Note that all subsequent readers would read, but notupdate the owner field, until it is updated again by a writer.

All user-space data was collected on an Oracle X5-2 system.The system has 2 sockets, each populated with an Intel XeonE5-2699 v3 CPU running at 2.30GHz. Each socket has 18cores, and each core is 2-way hyperthreaded, yielding 72logical CPUs in total. The system was running Ubuntu 18.04with a stock Linux version 4.15 kernel, and all software wascompiled using the provided GCC version 7.3 toolchain atoptimization level “-O3”. 64-bit code was used for all exper-iments. Factory-provided system defaults were used in allcases, and

Turbo mode [45] was left enabled. In all casesdefault free-range unbound threads were used.We implemented all locks within LD_PRELOAD interpo-sition libraries that expose the standard POSIX pthread_rwlock_t programming interface. This allows us to changelock implementations by varying the LD_PRELOAD environ-ment variable and without modifying the application codethat uses reader-writer locks. This same framework was usedto implement Cohort reader-writer locks [6].In the following figures “BA” refers to the Brandenburg-Anderson PF-Q lock [3]; “Cohort-RW” refers to the C-RW-WP lock [6]; “Per-CPU” reflects a lock that consists of anarray of BA locks, one for each CPU, where readers ac-quire read-permission on the sub-lock associated with theirCPU, and writers acquire write-permission on all the sub-locks (this lock is inspired by the Linux kernel brlock con-struct [10]); “BRAVO-BA” reflects BRAVO implemented ontop of BA; “pthread” is the default Linux POSIX “pthread_rwlock”read-write lock mechanism; and “BRAVO-pthread” is BRAVOimplemented on top of the pthread_rwlock.

Arrivalcostvswaitingcost

We also took data on the Brandenburg-Anderson PF-T lockand the BRAVO form thereof. PF-T implements the readerindicator via a central pair of counters, one incremented byarriving readers and the other incremented by departingreaders. Waiting readers busy-wait on a dedicated writerpresent bit encoded in the reader arrival counter. In PF-Qactive readers are tallied on a central pair of counters in thesame fashion as PF-T, but waiting readers enqueue on anMCS-like queue. In both PF-T and PF-Q, arriving readersupdate the central reader indicator state, generating more2019-07-11 • Copyright Oracle and or its affiliatesoherence traffic than would be the case for locks that usedistributed reader indicators or BRAVO. Waiting (blocked)readers in PF-T use global spinning, while waiting readersin PF-Q use local spinning on a thread-local field in theenqueued element. PF-T enjoys slightly shorter code pathsbut also suffers from lessened scalability because of the globalspinning. We found that PF-T and PF-Q offer broadly similarperformance, with PF-T having a slight advantage when thearrival rate is high, the number of waiting threads is low, andthe waiting period is shorter. PF-T is slightly more compacthaving just 4 integer fields, while PF-Q has 2 such fieldsand 4 pointers. For brevity, we do not include PF-T results.We also found that “fair lock with local only spinning” byMellor-Crummey and Scott [35] yielded performance similarto or slower than that of PF-Q.We note that the default pthread read-write lock imple-mentation found in our Linux distribution provides strongreader preference, and admits indefinite writer starvation . The reader indicator is centralized and the lock has afootprint of 56 bytes for 64-bit programs. Waiting threadsblock immediately in the kernel without spinning. Whilethis policy incurs overheads associated with voluntary con-text switching, it may also yield benefits by allowing “polite”waiting by enabling Turbo mode for those threads makingprogress. Except where otherwise noted, we plot the numberof concurrent threads on the X-axis, and aggregate through-put on the Y-axis reporting the median of 7 independent runsfor each data point.We use a 128 byte sector size on Intel processors for align-ment to avoid false sharing. The unit of coherence is 64 bytesthroughout the cache hierarchy, but 128 bytes is requiredbecause of the adjacent cache line prefetch facility wherepairs of lines are automatically fetched together. BA requiresjust 128 bytes – 2 32-bit integer fields plus 4 pointers fieldswith the overall size rounded up to the next sector bound-ary. BRAVO-BA adds the 8-byte InhibitUntil field, whichcontains a timestamp, and the 4-byte

RBias field. Roundingup to the sector size, this still yields a 128 byte lock instance.Per-CPU consists of one instance of BA for each logical CPU,yielding a lock size of 9216 bytes on our 72-way system.Cohort-RW consists of one reader indicator (128 bytes) perNUMA node, a central location for state (128 bytes) and afull cohort mutex [20] to provide writer exclusion. In turn,the cohort mutex requires one 128-byte sub-lock per NUMAnode, and another 256 bytes for central state, for a total of896 bytes. (While our implementation did not do so, we notethat a more space aggressive implementation of Cohort-RWcould colocate the per-node reader indicators with the mutexsub-locks, and the central state for the read-write lock withits associated cohort mutex, yielding a size of 512 bytes). The pthread implementation allows writer preference to be selected viaa non-portable API. Unfortunately in the Linux distribution we used thisfeature has bugs that result in lost wakeups and hangs: https://sourceware.org/bugzilla/show_bug.cgi?id=23861 . As noted above, the size of the pthread read-write lock is56 bytes, and the BRAVO variant adds 12 bytes. The size ofBA, BRAVO-BA, pthread, and BRAVO-pthread are fixed, andknown at compile-time, while the size of Per-CPU varieswith the number of logical CPUs, and the size of Cohort-RWvaries with the number of NUMA nodes. Finally, we observethat BRAVO allows more relaxed approach toward the align-ment and padding of the underlying lock. Since fast-pathreaders do not mutate the underlying lock fields, the designercan reasonable forgo alignment and padding on that lock,without trading off reader scalability.The size of the lock can be important in concurrent datastructures, such as linked lists or binary search trees, that usea lock per node or entry [4, 12, 25]. As Bronson at el. observe,when a scalable lock is striped across multiple cache lines toavoid contention in the coherence fabric, it is “prohibitivelyexpensive to store a separate lock per node”[4].BRAVO also requires the visible readers table. With 4096entries on a system with 64-bit pointers, the additional foot-print is 32KB. The table is aligned and sized to minimize thenumber of underlying pages (reducing TLB footprint) and toeliminate false sharing from variables that might be placedadjacent to the table. We selected a table size 4096 empiri-cally but in general believe the size should be a function ofthe number of logical CPUs in the system. Similar tables inthe Linux kernel, such as the futex hash table, are sized inthis fashion [5]. . Ideallywe’dplacethevisiblereadersarrayonlargepagestoreduceTLBpressure.

BRAVO yields a favorable performance trade-off betweenspace and scalability, offering a viable alternative that resideson the design spectrum between classic centralized locks,such as BA, having small footprint and poor reader scalability,and the large locks with high reader scalability.

As the visible readers array is shared over all locks andthreads within an address space, one potential concern iscollisions and near collisions that might arise when multiplethreads are using a large set of locks. Near collisions are alsoof concern as they can cause false sharing within the array. Todetermine BRAVO’s performance sensitivity to such effects,we implemented a microbenchmark program that spawns 64concurrent threads. Each thread loops as follows: randomlypick a reader-writer lock from a pool of such locks; acquirethat lock for read; advance a thread-local pseudo-randomnumber generator 20 steps; release read permission on thelock; and finally advance that random number generator 100steps. At the end of a 10 second measurement interval wereport the number of lock acquisitions. No locks are ever ac-quired with write permission. Each data point is the medianof 7 distinct runs. We report the results in Figure 1 wherethe X-axis reflects the number of locks in the pool (varyingthrough powers-of-two between 1 and 8192) and the Y-axis https://blog.stgolabs.net/2014/01/futexes-and-hash-table-collisions.html ave Dice and Alex Kogan is the number of acquisitions completed by BRAVO-BA di-vided by the number completed by a specialized version ofBRAVO-BA where each lock instance has a private array of4096 elements. This fraction reflects the performance dropattributable to inter-lock conflicts and near conflicts in theshared array, where the modified form of BRAVO-BA canbe seen as an idealized form that has a large per-instancefootprint but which is immune to inter-lock conflicts . Theworst-case penalty arising from inter-thread interference(the lowest fraction value) is always under 6%. . . . . . . Locks T h r oughpu t F r a c t i on Figure 1.

Inter-Lock Interference

SeeInterferenceRW.cc.TheremayexistsbetterexperimentstoshowthatconflictsarenâĂŹtimportant–insensitivity.Forinstance,wecouldpicksomefixedconfigurationoflocksandthreadsandvarytheBRAVOarraysize,showingthatthefunctionfromsizetoperformanceisasymptoticandthatourchoiceofsizeisselectedsothatinterferenceisnegligibleandpositioned"farout"onthelongtailoftheasymptote.

Figure 2 shows the results of our alternator benchmark.The benchmark spawns the specified number of concurrentthreads, which organize themselves into a logical ring, eachwaiting for notification from its “left” sibling. Notification isaccomplished via setting a thread-specific variable via a storeinstruction, and waiting is via simple busy-waiting. Oncenotified, the thread acquires and then immediately releasesread permission on a shared reader-writer lock. Next thethread notifies its “right” sibling and then again waits. Thereare no writers, and there is no concurrency between readers.At most one reader is active at any given moment. At the endof a 10 second measurement interval the program reportsthe number of notifications.The BA lock suffers as the lines underlying the reader in-dicators “slosh” and migrate from cache to cache. In contrastBRAVO-BA readers touch different locations in the visiblereaders table as they acquire and release read permissions.BRAVO enables reader-bias early in the run, and it remainsset for the duration of the measurement interval. All locksexperience a significant performance drop between 1 and 2threads due to the impact of coherent communication fornotification. Crucially, we see that BRAVO-BA outperforms We note that as we increase the number of locks, cache pressure constitutesa confounding factor for the specialized version of BRAVO-BA.

Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e cs Cohort−RWper−CPUBABRAVO−BApthreadBRAVO−pthread

Figure 2.

Alternatorthe underlying BA by a wide margin, and is competitive withthe much larger Per-CPU lock. In addition, the performanceof BA can be seen to degrade as we add threads, whereasthe performance of BRAVO-BA remains stable. The sameobservations are true when considering BRAVO-pthread andpthread locks.Since the hash function that associates a read lockingrequest with an index is deterministic, threads repeatedlylocking and unlocking a specific lock will enjoy temporallocality and reuse in the visible readers table.

We next report results from the test_rwlock benchmarkdescribed by Desnoyers et al. [13] . The benchmark was de-signed to evaluate the performance and scalability of reader-writer locks against the RCU (Read-Copy Update) synchro-nization mechanism. We used the following command-line: test_rwlock T 1 10 -c 10 -e 10 -d 1000 . The bench-mark launches 1 fixed-role writer thread and T fixed-rolereader threads for a 10 second measurement interval. Thewriter loops as follows : acquire a central reader-writer lockinstance; execute 10 units of work, which entails countingdown a local variable; release writer permission; execute anon-critical section for 1000 work units. Readers loop ac-quiring the central lock for reading, executing 10 steps ofwork in the critical section, and then release the lock. (Thebenchmark has no facilities to allow a non-trivial criticalsection for readers). At the end of the measurement intervalthe benchmark reports the sum of iterations completed byall the threads. As we can see in Figure 3, BRAVO-BA sig-nificantly outperforms BA, and even the Cohort-RW lockat higher thread counts. Since the workload is extremelyread-dominated, the Per-CPU lock yields the best perfor-mance, albeit with a very large footprint and only because obtained from https://github.com/urcu/userspace-rcu/blob/master/tests/benchmark/test_rwlock.c and modified slightly to allow a fixed measure-ment interval. + + + Threads A gg r ega t e t h r oughpu t r a t e : op s / m s e c Cohort−RWper−CPUBABRAVO−BApthreadBRAVO−pthread

Figure 3. test_rwlockof the relatively low write rate. For that same reason, anddue to its default reader preference, BRAVO-pthread easilybeats pthread, and comes close to the performance level ofPer-CPU.

Using

RWBench – modeled on a benchmark of the same namedescribed by Calciu et al. [6] – we evaluated the reader-writelock algorithms over a variety of read-write ratios, rangingfrom write-intensive in Figure 4a (9 out of every 10 opera-tions are writes) to read-intensive in Figure 4f (1 out of every10000 operations are writes), demonstrating that BRAVOinflicts no harm for write-intensive workloads, but improvesperformance for more read-dominated workloads.

RWBench launches T concurrent threads for a 10 second measurementinterval. Each thread loops as follows : using a thread-localpseudo-random generator, decide to write with probability P via a Bernoulli trial; writers acquire a central reader-writelock for write permission and then execute 10 steps of athread-local C++ std::mt19937 random number generatorand then release write permission, while readers do the same,but under read permission; execute a non-critical section of N steps of the same random-number generator where N is a random number uniformly distributed in [ , ) withaverage and median of 100. At the end of the measurementinterval the benchmark reports the total number of top-levelloops completed.In Figure 4a we see poor scalability over all the locksby virtue of the highly serialized write-heavy nature of theworkload. Per-CPU fairs poorly as writes, which are common,need to scan the array of per-CPU sub-locks. Cohort-RWprovides some benefit, while BRAVO-BA (BRAVO-pthread)tracks closely to BA (pthread, respectively), providing nei-ther benefit nor harm. The same behavior plays out in Fig-ure 4b ( P = /

2) and Figure 4c ( P = / P = / P = / P = / rocksdb readwhilewriting We next explore performance sensitivity to reader-writerin the rocksdb database [40]. We observed high frequencyreader traffic arising from calls in ::Get() to db/memtable.ccGetLock() in the readwhilewriting benchmark . In Fig-ure 5 we see the performance of BRAVO-BA and BRAVO-pthread tracks that of Per-CPU and always exceeds that ofCohort-RW and the respective underlying locks. rocksdb hash_table_bench Rocksdb also provides a benchmark to stress the hash tableused by their persistent cache . The benchmark implementsa central shared hash table as a C++ std::unordered_map protected by a reader-writer lock. The cache is pre-populatedbefore the measurement interval. At the end of the 50 sec-ond measurement interval the benchmark reports the ag-gregate operation rate – reads, erases, insertions – per mil-lisecond. A single dedicated thread loops, erasing randomelements, and another dedicated thread loops inserting newelements with a random key. Both erase and insertion opera-tions require write access. The benchmark launches T readerthreads, which loop, running lookups on randomly selectedkeys. We vary T on the X-axis. All the threads execute opera-tions back-to-back without a delay between operations. Thebenchmark makes frequent use of malloc-free operationsin the std::unordered_map . The default malloc allocatorfails to fully scale in this environment and masks any benefitconferred by improved reader-writer locks, so we insteadused the index-aware allocator by Afek et al. [1].The results are shown in Figure 6. Once again, BRAVOenhances the performance of underlying locks, and showssubstantial speedup at high thread counts. All kernel-space data was collected on an Oracle X5-4 sys-tem. The system has 4 sockets, each populated with an IntelXeon CPU E7-8895 v3 running at 2.60GHz. Each socket has We used rocksdb version 5.13.4 with the fol-lowing command line: db_bench --threads=T--benchmarks=readwhilewriting --memtablerep=cuckoo-duration=100 --inplace_update_support=1--allow_concurrent_memtable_write=0 --num=10000--inplace_update_num_locks=1 --histogram--stats_interval=10000000 https://github.com/facebook/rocksdb/blob/master/utilities/persistent_cache/hash_table_bench.cc run with the following command-line: hash_table_bench -nread_thread= T -nsec=50 ave Dice and Alex Kogan Threads A gg r ega t e t h r oughpu t r a t e : op s / m s e c Cohort−RWPer−CPUBABRAVO−BApthreadBRAVO−pthread (a)

RWBench with 90% writes (9/10)

Threads A gg r ega t e t h r oughpu t r a t e : op s / m s e c Cohort−RWPer−CPUBABRAVO−BApthreadBRAVO−pthread (b)

RWBench with 50% writes (1/2)

Threads A gg r ega t e t h r oughpu t r a t e : op s / m s e c Cohort−RWPer−CPUBABRAVO−BApthreadBRAVO−pthread (c)

RWBench with 10% writes (1/10)

Threads A gg r ega t e t h r oughpu t r a t e : op s / m s e c Cohort−RWPer−CPUBABRAVO−BApthreadBRAVO−pthread (d)

RWBench with 1% writes (1/100)

Threads A gg r ega t e t h r oughpu t r a t e : op s / m s e c Cohort−RWPer−CPUBABRAVO−BApthreadBRAVO−pthread (e)

RWBench with .1% writes (1/1000) + + + + + + Threads A gg r ega t e t h r oughpu t r a t e : op s / m s e c Cohort−RWPer−CPUBABRAVO−BApthreadBRAVO−pthread (f)

RWBench with .01% writes (1/10000)

Figure 4.

RWBench2019-07-11 • Copyright Oracle and or its affiliates

Threads A gg r ega t e t h r oughpu t r a t e : M op s / s e c Cohort−RWper−CPUBABRAVO−BApthreadBRAVO−pthread

Figure 5. rocksdb readwhilewriting + + + + + Threads A gg r ega t e t h r oughpu t r a t e : op s / m s e c Cohort−RWper−CPUBABRAVO−BApthreadBRAVO−pthread

Figure 6. rocksdbi hash_table_bench with std::unordered_map

18 cores, and each core is 2-way hyperthreaded, yielding144 logical CPUs in total. The patch was applied on top of arecent Linux version 4.20.rc4 kernel, which refer to as stock .Factory-provided system defaults were used in all cases. Inparticular, we compiled the kernel in the default configu-ration, which notably disables the lock performance datacollection mechanism (aka lockstat ) built into the kernelfor debugging lock performance. As we mention below, thismechanism was useful to gain insights into the usage pat-terns of kernel locks under various applications. However,we kept it disabled during performance measurements as itadds a probing effect by generating stores into shared vari-ables, e.g., by keeping track of the last CPU on which a givenlock instance, rwsem included, was acquired. These storeshamper the benefit of techniques like BRAVO that aim toreduce updates to the shared state during lock acquisitionand release operations. Each experiment is repeated 7 times, and the reportednumbers are the average of the corresponding results. Un-less noted, the reported results were relatively stable, withvariance of less than 5% from the average in most cases. Inthe following, we refer to the kernel version modified to useBRAVO simply as BRAVO. locktorture is a loadable kernel module distributed withthe kernel. As its name suggests, locktorture contains a setof microbenchmarks for evaluating performance of varioussynchronization constructs in the kernel, including rwsem .It allows specifying the number of readers and writers thatrepeatably acquire the rwsem in the corresponding mode,and hold it (aka run in a critical section) for some amountof time. The typical length of the critical section is 50ms forreaders and 10ms for writers. Occasionally, a long delay isintroduced (according to a comment in the source code, “toforce massive contention”), which means a critical sectionof 200ms for a reader and of 1000ms for a writer. We notethat the probability for those delays is small, yet it dependson the number of threads. Therefore, the average length ofthe critical section is not the same across all thread counts,and thus the reported results do not necessarily measurescalability.Figure 7 presents results for the experiment in which wevary the number of readers and set the number of writers to 1.We run the experiment for 30 seconds, and report separatelythe total number of read and write acquisitions. For lowthread counts, the number of read operations on the stockand the BRAVO versions increases linearly in the number ofthreads. Once the number of threads increases, the scalabilityof the stock is hampered. The BRAVO version continues toscale perfectly across all thread counts.When considering the number of write acquisitions, wenote that the stock version has a better result. This can beattributed to the nature of the benchmark, where a writerrepeatedly acquires and releases the rwsem in the write mode.In the BRAVO version, each such acquisition is likely to gothrough the revocation of the fast path, where a writer waitsfor readers, which in this case have a relatively long criticalsection. This increases the latency of the write acquisitionand results in a smaller number of such acquisitions in thegiven period of time. We validated this hypothesis by manu-ally disabling the setting of

RBias flag in BRAVO (that is, inthat modified version, a fast acquisition path and revocationwere never used), and observing that the modified BRAVOversion produced similar results to stock.Figure 8 (a) presents results for the same experiment butwith no writers. Here both versions, stock and BRAVO, in-crease the number of reads linearly with the number ofthreads. This is not surprising, given the relatively longread critical section (50ms), which masks any contentionon the shared counter state during semaphore acquisition2019-07-11 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan (a) readers ops (b) writer ops

Figure 7. locktorture results with 1 writer. (a) original (50ms critical section) (b) modified (5us critical section)

Figure 8. locktorture results with 0 writers.in the stock version. To validate this claim, we modified the locktorture benchmark such that a reader would hold thelock for only 5us. We note that we chose this number basedon the typical length of a rwsem critical section as reportedin will-it-scale benchmarks described in the next section.We also made sure that readers in locktorture do not con-tend on any other shared variables, e.g., use a local randomnumber generator seed instead of a shared one. The resultsfrom this modified setup and a read-only workload are pre-sented in Figure 8 (b). As expected, they show that the stockversion stops scaling once the contention on shared counterin rwsem grows. At the same time, BRAVO avoids accessto the shared counter and scales perfectly across all threadcounts. As a result, BRAVO refutes the common opinion thatread-write locks should be used only when critical sectionsare long [7]. will-it-scale is an open-source collection of microbench-marks for stress-testing various kernel subsystems. Will-it-scale runs in user-mode but is known to induce con-tention on kernel locks [16]. Each microbenchmark runs agiven number of tasks (that can be either threads or pro-cesses), performing a series of specific system calls (such asopening and closing a file, mapping and unmapping mem-ory pages, raising a signal, etc.). We experiment with a sub-set of microbenchmarks that access the VMA structure andcreate contention on mmap_sem , an instance of rwsem thatprotects the access to VMA [11]. In particular, the relevantmicrobenchmarks are page_fault and mmap . The formercontinuously maps a large (128M) chunk of memory, writes https://github.com/antonblanchard/will-it-scale (a) page_fault1_threads (b) page_fault2_threads (c) mmap1_threads (d) mmap2_threads Figure 9. will-it-scale resultsone word into every page in the chunk (causing a page faultfor every write), and unmaps the memory. The latter simplymaps and unmaps large chunks of memory. (Each of thosebenchmarks has several variants denoted as page_fault1 , page_fault2 , etc.)Page faults require the acquisition of mmap_sem for read,while memory mapping and unmapping operations acquire mmap_sem s for write [8]. Therefore, the access pattern for mmap_sem is expected to be read-heavy in the page_fault microbenchmark and more write-heavy in mmap . We con-firmed that through lockstat statistics. We note that BRAVOis not expected to provide any benefit for mmap , yet we in-clude it to evaluate any overhead BRAVO might introducein write-heavy workloads.Figure 9 presents the results of our experiments for page_fault and mmap , respectively. In page_fault , the BRAVO versionperforms similarly to stock as long as the latter scales. After16 threads, however, the throughput of the stock versiondecreases while the BRAVO version continues to scale, albeitat a slower rate. At 142 threads, BRAVO outperforms stockby up to 93%. At the same time, mmap shows no significantdifference in the performance of BRAVO vs. stock, suggest-ing that BRAVO does not introduce overhead in scenarioswhere it is not profitable. Metis is an open-source MapReduce library [33] used inthe past to assess the scalability of Linux kernel locks [2, 8,27]. Metis is known for a relatively intense access to VMAthrough the mix of page-fault and mmap operations [27].By collecting lock performance statistics with lockstat ,however, we found that only few of Metis benchmarks haveboth a large number of mmap_sem acquisitions and a largeportion of those acquisitions is for read. We note that like in2019-07-11 • Copyright Oracle and or its affiliatesthreads stock BRAVO speedup1

Table 1. wc runtime (sec) Table 2. wrmem runtime (sec)all other kernel benchmarks, lockstat was disabled whenmeasuring performance numbers reported below.Tables 1 and 2 present the performance results, respec-tively, for wc , a map-reduce based word count, and wrmem ,which allocates a large chunk of memory and fills it withrandom “words”, which are fed into the map-reduce frame-work for inverted index calculation. The BRAVO version canachieve speedups of over 30%. We note that some of the data,particularly for wc , was noisy; we print values with variancelarger than 5% from the mean in italics. (All values havevariance of 19% or less). We also note that BRAVO did notcreate significant overhead for any other Metis benchmark,although some benchmarks produced noisy results similarlyto wc . BRAVO easily composes with existing locks, preserving de-sirable properties of those underlying locks, and yieldinga composite lock with improved read-read scalability. Wespecifically target read-dominated workloads with multipleconcurrent threads that acquire and release read permissionat a high rate. The approach is simple, effective, and yieldsimproved performance for read-dominated workloads com-pared to commonly used compact locks. The key trade-offinherent in the design is the benefit accrued by reads againstthe potential slow-down imposed by revocation. Even in mixed or write-heavy workloads, we limit any slow-downstemming from revocation costs and bound harm, makingthe decision to use BRAVO simple. BRAVO incurs a verysmall footprint increase per lock instance, and also adds ashared table of fixed size that can be used by all threads andlocks. BRAVO’s key benefit arises from reducing coherencecost that would normally be incurred by locks having a cen-tral reader indicator. Write performance is left unchangedrelative to the underlying lock. BRAVO provides read-readperformance at, and often above, that of the best modernreader-writer locks that use distributed read indicators, butwithout the footprint or complexity of such locks. By reduc-ing coherence traffic, BRAVO is implicitly NUMA-friendly. ▶ Future directions

We identify a number of future direc-tions for our investigation into BRAVO-based designs: • Dynamic sizing of the visible readers table based on colli-sions. Large tables will have reduced collision rates butlarger scan revocation overheads. • The reader fast-path currently probes just a single loca-tion and reverts to the slow-path after a collision. Weplan on using a secondary hash to probe an alternativelocation. In that vein, we note that while we currentlyuse a hash function to map a thread’s identity and thelock address to an index in the table, there is no particu-lar requirement that the function that associates a readrequest with an index be deterministic. We plan on ex-ploring other functions, using time or random numbersto form indices. While this will be less beneficial in termsof cache locality for the reader, it might be helpful in caseof temporal contention over specific slots. • Accelerate the revocation scan operation via SIMD in-structions such as AVX. The visible reader table is usuallysparsely populated, making it amenable to such optimiza-tions. Non-temporal non-polluting loads may also be help-ful for the scan operation. • As noted, our current policy to enable bias is conservative,and leaves untapped performance. We intend to exploremore sophisticated adaptive policies based on recent be-havior and to use a more faithful cost model. • An interesting variation is to implement BRAVO on top ofan underlying mutex instead of a reader-writer lock. Slow-path readers must acquire the mutex, and the sole sourceof read-read concurrency is via the fast path. We notethat some applications might expect the reader-write lockimplementation to be fully work conserving and maxi-mally admissive – always allowing full read concurrencywhere available. For example an active reader thread T T T An extended version of this paper is available at https://arxiv.org/abs/1810.01573 ave Dice and Alex Kogan T Ifnowritersarewaitingorarrive,andareaderarriveswhileanotherreaderisactive,thefirstreaderwillbeallowedadmission–ifathread can enter,thenit will enter. • In our current implementation arriving readers are blockedwhile a revocation scan is in progress. This could beavoided by adding a mutex to each BRAVO-enhanced lock.Arriving writers immediately acquire this mutex, whichresolves all write-write conflicts. They then perform re-vocation, if necessary; acquire the underlying reader-vs-write lock with write permission; execute the writer crit-ical section; and finally release both the mutex and theunderlying reader-writer lock. The underlying reader-writer lock resolves read-vs-write conflicts. The code usedby readers remains unchanged. This optimization allowsreaders to make progress during revocation by divertingthrough the reader slow-path, mitigating the cost of re-vocation. This also reduces variance for the latency ofread operations. We note that this general technique canbe readily applied to other existing reader-writer locksthat employ distributed reader indicators, such as Linux’s br-lock family [31]. • Modify the hash function that selects indices so that agiven lock maps to subsets of the visible readers table.In turn, we can then accelerate the revocation scan byrestricting the scan to those subsets. • Partition the visible readers table into contiguous sectorsspecified by the NUMA node of the reader. This optimiza-tion will reduce the cost of false sharing within the table. • Our preferred embodiment is

BRAVO-2D where we par-tition the visible reader table into disjoint “rows”. A rowis a contiguous set of slots, and is aligned on a cachesector boundary. We configure the length of a row as amultiple of the cache sector size. Readers index into thetable by using the caller’s CPUID to select a row. andthen hash the lock address to form an offset into that row.We could think of this offset as selecting a column. Thisapproach virtually eliminates inter-thread interferenceand false sharing (near collisions). Threads enjoy spatialand temporal locality within their row. Inter-thread col-lisions would typically arise only because of migrationand preemption, which we expect to be rare. The thread-to-CPUID relationship is expected to be fairly stable overthe short term.We may still encounter inter-lock collisions within a row,but fall back as usual to the underlying lock. Note thatinter-lock collisions will be intra-thread, within a row.BRAVO-2D admits a higher inter-lock collision rate thandoes our baseline approach, where we just hash the threadand lock to form the index. Specifically, for a given threadand lock, the set of possible slots is restricted to just onerow under BRAVO-2D, instead of the how table as wasthe case in the baseline design. But in practice we findthat improved intra-row locality in the BRAVO-2D form overcomes that penalty. And since most threads hold veryfew reader-writer locks at a given time, the odds of inter-lock collisions are still very low, even if slightly highercompared to our baseline. At revocation time we needonly scan the column associated with the lock, insteadof the full table, reducing the number of slots accessedcompared to the baseline design and accelerating revoca-tion. Furthermore the automatic hardware stride-basedprefetcher can still track the revocation access patternand provide benefit.

Acknowledgments

We thank Shady Issa for useful discussions about revocationand the cost model. We also thank the anonymous reviewersand our shepherd Yu Hua for providing insightful comments.

References [1] Yehuda Afek, Dave Dice, and Adam Morrison. 2011. Cache Index-awareMemory Allocation. In

Proceedings of the International Symposiumon Memory Management . ACM. http://doi.acm.org/10.1145/1993478.1993486 [2] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, AlekseyPesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.2010. An Analysis of Linux Scalability to Many Cores. In

Proceedingsof the USENIX Conference on Operating Systems Design and Implemen-tation (OSDI) . 1–16.[3] B. B. Brandenburg and J. H. Anderson. 2010. Spin-Based Reader-WriterSynchronization for Multiprocessor Real-Time Systems. In

Real-TimeSystems Journal . https://doi.org/10.1007/s1124 [4] Nathan G. Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun.2010. A Practical Concurrent Binary Search Tree. In Proceedings of theACM Symposium on Principles and Practice of Parallel Programming(PPoPP) . 257–268. http://doi.acm.org/10.1145/1693453.1693488 [5] Davidlohr Bueso. 2014. futexes and hash table collisions. https://blog.stgolabs.net/2014/01/futexes-and-hash-table-collisions.html . Ac-cessed: 2019-05-14.[6] Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J.Marathe, and Nir Shavit. 2013. NUMA-aware Reader-writer Locks.In

Proceedings of ACM PPoPP . ACM, 157–166. http://doi.acm.org/10.1145/2442516.2442532 [7] Bryan Cantrill and Jeff Bonwick. 2008. Real-World Concurrency.

ACMQueue

6, 5 (2008), 16–25.[8] Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2012.Scalable Address Spaces Using RCU Balanced Trees. In

Proceedings ofthe International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS) . 199–210.[9] Jonathan Corbet. 2003. Driver porting: mutual exclusion with seqlocks. http://lwn.net/Articles/22818 . Accessed: 2018-04-20.[10] Jonathan Corbet. 2010. Big reader locks. https://lwn.net/Articles/378911 . Accessed: 2018-04-20.[11] Jonathan Corbet. 2018. The LRU lock and mmap_sem. https://lwn.net/Articles/753058 . Accessed: 2019-01-10.[12] Tyler Crain, Vincent Gramoli, and Michel Raynal. 2012. A Speculation-friendly Binary Search Tree. In

Proceedings of ACM PPoPP . ACM. http://doi.acm.org/10.1145/2145816.2145837 [13] M. Desnoyers, P. E. McKenney, A. S. Stern, M. R. Dagenais, and J.Walpole. 2012. User-Level Implementations of Read-Copy Update.

IEEE Transactions on Parallel and Distributed Systems (2012). https://doi.org/10.1109/TPDS.2011.159 [14] Dave Dice. 2006. Biased Locking in HotSpot. https://blogs.oracle.com/dave/biased-locking-in-hotspot .

15] David Dice, Danny Hendler, and Ilya Mirsky. 2013. Lightweight Con-tention Management for Efficient Compare-and-swap Operations. In

Proceedings of the International Conference on Parallel Processing (Eu-roPar) . Springer-Verlag. http://dx.doi.org/10.1007/978-3-642-40047-6_60 [16] Dave Dice and Alex Kogan. 2019. Compact NUMA-aware Locks. In

Proceedings of the ACM European Conference on Computer Systems(EuroSys) .[17] Dave Dice, Alex Kogan, and Yossi Lev. 2016. Refined TransactionalLock Elision. In

Proceedings of ACM PPoPP .[18] Dave Dice, Yossi Lev, and Mark Moir. 2013. Scalable Statistics Counters.In

Proceedings of the ACM Symposium on Parallelism in Algorithms andArchitectures (SPAA) . http://doi.acm.org/10.1145/2486159.2486182 [19] Dave Dice, Yossi Lev, Mark Moir, and Daniel Nussbaum. 2009. EarlyExperience with a Commercial Hardware Transactional Memory Im-plementation. In Proceedings of the International Conference on Archi-tectural Support for Programming Languages and Operating Systems(ASPLOS) . http://doi.acm.org/10.1145/1508244.1508263 [20] David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock Cohorting:A General Technique for Designing NUMA Locks. ACM Trans. ParallelComput. (2015). https://doi.org/10.1145/2686884 [21] David Dice, Mark Moir, and William N. Scherer III. 2002. QuicklyReacquirable Locks – US Patent 7,814,488. https://patents.google.com/patent/US7814488 [22] Dave Dice and Nir Shavit. 2010. TLRW: Return of the Read-write Lock.In

Proceedings of the ACM Symposium on Parallelism in Algorithms andArchitectures (SPAA) . http://doi.acm.org/10.1145/1810479.1810531 [23] William B. Easton. 1971. Process Synchronization Without Long-termInterlock. In Proceedings of the ACM Symposium on Operating SystemsPrinciples (SOSP) . http://doi.acm.org/10.1145/800212.806505 [24] Pascal Felber, Shady Issa, Alexander Matveev, and Paolo Romano. 2016.Hardware Read-write Lock Elision. In Proceedings of the ACM EuropeanConference on Computer Systems (EuroSys) . http://doi.acm.org/10.1145/2901318.2901346 [25] Steve Heller, Maurice Herlihy, Victor Luchangco, Mark Moir,William N. Scherer, and Nir Shavit. 2006. A Lazy Concurrent List-based Set Algorithm. In Proceedings of the 9th International Confer-ence on Principles of Distributed Systems (OPODIS’05) . Springer-Verlag. http://dx.doi.org/10.1007/11795490_3 [26] W. C. Hsieh and W. E. Weihl. 1992. Scalable reader-writer locks forparallel systems. In

Proceedings Sixth International Parallel ProcessingSymposium . https://doi.org/10.1109/IPPS.1992.222989 [27] Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. Scal-able NUMA-aware Blocking Synchronization Primitives. In Pro-ceedings of the USENIX Annual Technical Conference (USENIXATC) . [28] Andi Kleen. 2013. Lock elision in the GNU C library. https://lwn.net/Articles/534758 . Accessed: 2019-05-13.[29] Christoph Lameter. 2005. Effective synchronization on Linux/NUMAsystems. In Gelato Conference . [30] Yossi Lev, Victor Luchangco, and Marek Olszewski. 2009. ScalableReader-writer Locks. In Proceedings of the Symposium on Parallelismin Algorithms and Architectures (SPAA) . http://doi.acm.org/10.1145/1583991.1584020 [31] Ran Liu, Heng Zhang, and Haibo Chen. 2014. Scalable Read-mostlySynchronization Using Passive Reader-Writer Locks. In Proceedings ofthe USENIX Annual Technical Conference (USENIX ATC) . [32] Waiman Long. 2017. locking/rwsem: Enable reader optimistic spinning. https://lwn.net/Articles/724384/ . Accessed: 2019-01-24.[33] Yandong Mao, Robert Morris, and Frans Kaashoek. 2010. OptimizingMapReduce for Multicore Architectures . Technical Report. MIT. [34] George Marsaglia. 2003. Xorshift RNGs.

Journal of Statistical Software,Articles (2003). https://doi.org/10.18637/jss.v008.i14 [35] John M. Mellor-Crummey and Michael L. Scott. 1991. Scalable Reader-writer Synchronization for Shared-memory Multiprocessors. In

Pro-ceedings of ACM PPoPP . http://doi.acm.org/10.1145/109625.109637 [36] Oracle. 2012. API Documentation forjava.util.concurrent.locks.StampedLock. https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/StampedLock.html .[37] Filip Pizlo, Daniel Frampton, and Antony L. Hosking. 2011. Fine-grained Adaptive Biased Locking. In Proceedings of the InternationalConference on Principles and Practice of Programming in Java (PPPJ) . http://doi.acm.org/10.1145/2093157.2093184 [38] Ravi Rajwar and James R. Goodman. 2001. Speculative Lock Elision:Enabling Highly Concurrent Multithreaded Execution. In Proceedingsof the ACM/IEEE International Symposium on Microarchitecture (MI-CRO) . http://dl.acm.org/citation.cfm?id=563998.564036 [39] Andreia Craveiro Ramalhete and Pedro Ramalhete. 2013. DistributedCache-Line Counter Scalable RW-Lock. http://concurrencyfreaks.blogspot.com/2013/09/distributed-cache-line-counter-scalable.html .[40] rocksdb.org. 2018. A persistent key-value store for fast storage envi-ronments. rocksdb.org .[41] Kenneth Russell and David Detlefs. 2006. Eliminating Synchronization-related Atomic Operations with Biased Locking and Bulk Rebiasing.In Proceedings of the ACM SIGPLAN Conference on Object-orientedProgramming Systems, Languages, and Applications (OOPSLA) . http://doi.acm.org/10.1145/1167473.1167496 [42] Jun Shirako, Nick Vrvilo, Eric G. Mercer, and Vivek Sarkar. 2012. De-sign, Verification and Applications of a New Read-write Lock Algo-rithm. In Proceedings of the ACM Symposium on Parallelism in Algo-rithms and Architectures (SPAA) . http://doi.acm.org/10.1145/2312005.2312015 [43] Guy L. Steele, Jr., Doug Lea, and Christine H. Flood. 2014. Fast SplittablePseudorandom Number Generators. In Proceedings of the ACM Confer-ence on Object Oriented Programming Systems Languages & Applications(OOPSLA) . 453–472. http://doi.acm.org/10.1145/2660193.2660195 [44] N. Vasudevan, K. S. Namjoshi, and S. A. Edwards. 2010. Simple and fastbiased locks. In

Proceedings of the International Conference on ParallelArchitectures and Compilation Techniques (PACT) .[45] U. Verner, A. Mendelson, and A. Schuster. 2017. Extending Am-dahlâĂŹs Law for Multicores with Turbo Boost.

IEEE Computer Archi-tecture Letters (2017). https://doi.org/10.1109/LCA.2015.2512982 [46] D. Vyukov. 2011. Distributed Reader-Writer Mutex. .[47] Wikipedia contributors. 2017. Ski rental problem — Wikipedia, TheFree Encyclopedia. https://en.wikipedia.org/w/index.php?title=Ski_rental_problem&oldid=813551905 . [Online; accessed 8-August-2018].[48] Wikipedia contributors. 2018. Birthday problem — Wikipedia, The FreeEncyclopedia. https://en.wikipedia.org/w/index.php?title=Birthday_problem&oldid=853622452 . [Online; accessed 8-August-2018].[49] Wikipedia contributors. 2018. Loss Aversion. https://en.wikipedia.org/wiki/Loss_aversion ..