[PDF] An Analysis of Concurrency Control Protocols for In-Memory Databases with CCBench (Extended Version)

Abstract

This paper presents yet another concurrency control analysis platform, CCBench. CCBench supports seven protocols (Silo, TicToc, MOCC, Cicada, SI, SI with latch-free SSN, 2PL) and seven versatile optimization methods and enables the configuration of seven workload parameters. We analyzed the protocols and optimization methods using various workload parameters and a thread count of 224. Previous studies focused on thread scalability and did not explore the space analyzed here. We classified the optimization methods on the basis of three performance factors: CPU cache, delay on conflict, and version lifetime. Analyses using CCBench and 224 threads, produced six insights. (I1) The performance of optimistic concurrency control protocol for a read only workload rapidly degrades as cardinality increases even without L3 cache misses. (I2) Silo can outperform TicToc for some write-intensive workloads by using invisible reads optimization. (I3) The effectiveness of two approaches to coping with conflict (wait and no-wait) depends on the situation. (I4) OCC reads the same record two or more times if a concurrent transaction interruption occurs, which can improve performance. (I5) Mixing different implementations is inappropriate for deep analysis. (I6) Even a state-of-the-art garbage collection method cannot improve the performance of multi-version protocols if there is a single long transaction mixed into the workload. On the basis of I4, we defined the read phase extension optimization in which an artificial delay is added to the read phase. On the basis of I6, we defined the aggressive garbage collection optimization in which even visible versions are collected. The code for CCBench and all the data in this paper are available online at GitHub.

Full PDF

AAn Analysis of Concurrency Control Protocolsfor In-Memory Databases with CCBench(Extended Version)

Takayuki Tanabe, ∗ Takashi Hoshino, Hideyuki Kawashima, Osamu Tatebe NAUTILUS Technologies, Inc., Cybozu Labs, Inc., Faculty of Environment and Information Studies,Keio University, Center for Computational Sciences, University of [email protected], [email protected],[email protected], [email protected]

ABSTRACT

This paper presents yet another concurrency control analy-sis platform, CCBench. CCBench supports seven protocols(Silo, TicToc, MOCC, Cicada, SI, SI with latch-free SSN,2PL) and seven versatile optimization methods and enablesthe conﬁguration of seven workload parameters. We ana-lyzed the protocols and optimization methods using variousworkload parameters and a thread count of 224. Previousstudies focused on thread scalability and did not explore thespace analyzed here. We classiﬁed the optimization methodson the basis of three performance factors: CPU cache, delayon conﬂict, and version lifetime. Analyses using CCBenchand 224 threads, produced six insights. (I1) The perfor-mance of optimistic concurrency control protocol for a read-only workload rapidly degrades as cardinality increases evenwithout L3 cache misses. (I2) Silo can outperform TicTocfor some write-intensive workloads by using invisible readsoptimization. (I3) The eﬀectiveness of two approaches tocoping with conﬂict (wait and no-wait) depends on the situ-ation. (I4) OCC reads the same record two or more times ifa concurrent transaction interruption occurs, which can im-prove performance. (I5) Mixing diﬀerent implementations isinappropriate for deep analysis. (I6) Even a state-of-the-artgarbage collection method cannot improve the performanceof multi-version protocols if there is a single long transactionmixed into the workload. On the basis of I4, we deﬁned theread phase extension optimization in which an artiﬁcial de-lay is added to the read phase. On the basis of I6, we deﬁnedthe aggressive garbage collection optimization in which evenvisible versions are collected. The code for CCBench and allthe data in this paper are available online at GitHub. ∗ Work done while at University of Tsukuba

1. INTRODUCTION1.1 Motivation

Transacting processing systems containing thousands ofCPU cores in a single server have been emulated [75], im-plemented [44,69], and analyzed [24]. Along with these sys-tems, a variety of concurrency control protocols for a singleserver [43, 46, 65, 69, 76] have been proposed in recent yearsfor use in many-core architectures. These modern protocolsuse a variety of optimization methods, and their behaviorsdepend on the workload characteristics (e.g., thread count,skew, read ratio, cardinality, payload size, transaction size,and read-modify-write or not). Recent new proposal stud-ies have compared these protocols with conventional ones[30,32,33,35,37,39,43,44,46,56,63,65,69,70,73,74,76,77], andrecent analytical studies have compared the performance ofconventional protocols on a common platform [23,24,72,75].These studies mostly evaluated protocol scalability. Thispaper acknowledges that modern protocols are scalable andfocuses on other factors that contribute to performance ona many-core environment. This type of analysis is novel tothe best of our knowledge.Fairness is important in comparison. However, some re-cent studies were unable to perform such analysis fairly be-cause they compared their new protocols with others usingdiﬀerent platforms. For example, evaluations of ERMIA [43],mostly-optimistic concurrency control (MOCC) [69], and Ci-cada [46] used two or three platforms. Experiments using amix of platforms can produce only approximate results be-cause the performance of protocols on a many-core architec-ture depends greatly on the exploitation of the underlyinghardware. In such conditions, only the macroscopic eval-uations (i.e., scalability) can be conducted, and a detailedanalysis is diﬃcult. A single uniﬁed comparison platform isneeded to conduct a detailed analysis.For a fair comparison, a common analysis platform is nec-essary. It should provide shared core modules such as accessmethods (concurrent index) and thread-local data structures(read and write sets) for any protocols. It should also pro-vide a variety of optimization methods to enable obtaininga deep understanding of the protocols. Finally, the plat-form should be publicly available for reproducing the exper-iments. Although there are several open-source platforms,including DBx1000 [7], Peloton [12], and Cavalia [4], theydo not provide certain modern protocols [46, 68, 69, 76]. Ap-puswamy et al. [23] evaluated protocols in four types of ar-chitecture using Trireme, which is not publicly available. a r X i v : . [ c s . D B ] S e p .2 Contributions The ﬁrst contribution of this paper is

CCBench : a plat-form for fairly evaluating concurrency control protocols. Theprotocols are two-phase locking (2PL), Silo, MOCC, TicToc,snapshot isolation (SI), latch-free serial safety net (SSN),and Cicada. CCBench supports seven versatile optimizationmethods and two identiﬁed in this work (Table 1, § NoWait method [28] can be appliedto Silo, the rapid garbage collection (GC) method [46] canbe applied to multi-version protocols, and the

AdaptiveBack-oﬀ optimization method [46] can be applied to all protocols.Fairness in analyzing protocols is achieved through sharedmodules including a workload generator, access methods(Masstree [11, 47]), local datasets (read/write sets), and amemory manager (mimalloc [21] and

AssertiveVersionReuse presented in this paper). Evaluation of protocols under var-ious conditions is enabled by providing of seven workloadconﬁguration parameters: skew, payload size, transactionsize, cardinality, read ratio, read-modify-write (RMW) ornot, and the number of worker threads. CCBench and all ofthe data in this paper are available online at GitHub [5].The second contribution is an analysis of cache, delay,and version lifetime using 224 threads.

We clariﬁedthe eﬀects of the protocols on the performance factors byconﬁguring the optimization methods and workload settings.As suggested elsewhere [75] and [44], the era of a thousandcores is just around the corner, so conventional analyticalstudies focused on evaluating thread scalability [38, 72, 75].In contrast, we performed all analyses with

224 threads on224 cores .We ﬁrst investigated the eﬀects of the optimization meth-ods related to cache . By determining that a centralizedcounter increases cache-line conﬂicts and degrades perfor-mance, we gained two insights.

I1:

The performance ofoptimistic concurrency control (OCC) for a read-only work-load rapidly degrades as cardinality increases even withoutL3 cache misses ( § I2:

Silo outperforms TicToc forwrite-intensive workloads by using

InvisibleReads ( § delay , and gained two more insights. I3:

Theeﬀectiveness of two approaches to cope with conﬂict (

Wait and

NoWait ) depends on the situation ( § I4:

OCCreads the same record two or more times due to concurrenttransaction interruption. Surprisingly, OCC can improveperformance in certain situations with it, and we have de-ﬁned a new optimization method

ReadPhaseExtension basedon it ( § version lifetime for multi-version concur-rency control (MVCC) protocols, and gained two ﬁnal in-sights. I5:

Mixing diﬀerent implementations is inappropri-ate. Silo outperformed Cicada on the Yahoo! Cloud ServingBenchmark B (YCSB-B) workload in an unskewed environ-ment, which is inconsistent with previously reported testingresults on diﬀerent systems [46].

I6:

Even a state-of-the-artGC technique cannot improve the performance of MVCC ifthere is a single long transaction mixed into the workload.To overcome this problem, we deﬁned a new optimizationmethod,

AggressiveGC . It requires an unprecedented proto-col that weaves GC into MVCC, thus going beyond the cur-rent assumption that versions can not be collected if theymight be read by transactions ( § The rest of this paper is organized as follows. § § § § § § § §

2. PRELIMINARIES2.1 Concurrency Control Protocols

The concurrency control protocols we analyzed are classi-ﬁed as (1) pessimistic (2PL [34]), (2) optimistic (Silo [65] andTicToc [76]), (3) multi-version (SI [41] and ERMIA [43]), (4)integration of optimistic and pessimistic (MOCC [69]), and(5) integration of optimistic and multi-version (Cicada [46]).

Silo [65] is an OCC protocol that has inﬂuenced sub-sequent concurrency control protocols. For example, FOE-DUS [44] and MOCC [69] extend the commit protocol ofSilo. Silo selects the design of

InvisibleRead [48] so that itdoes not update the metadata during a read operation. Theinvisible read process avoids cache-line conﬂicts, so it pro-vides scalability, as shown by Wang and Kimura [69].

TicToc [76] is an OCC protocol based on timestamp or-dering with data-driven timestamp management. TicTochas a larger scheduling space than Silo, so it can commitschedules where Silo cannot. TicToc provides three opti-mization methods.

PreemptiveAbort initiates abort process-ing immediately if an abort is detected before write lock-ing in the validation phase.

NoWaitTT does not wait forlock release in the validation phase. It instead releases thelocks and retries the validation phase after a ﬁxed dura-tion of a wait without an abort.

TimestampHistory expandsthe scheduling space by recording the write timestamp ofan older version so that some of the read operations on theversion can be veriﬁed after the record is overwritten.

MOCC [69] exhibits high performance for a wide rangeof workloads by adaptively switching its policy by addingpessimistic locking and temperature management to Silo.MOCC locks high temperature (on highly conﬂicted) recordswhile keeping the order of records locked to avoid deadlocksin the read phase. The MOCC queuing lock (MQL) inte-grates an MCS (Mellor-Crummey and Scott) lock, which cantime out [58], and a reader/writer fairness MCS lock [49].

Cicada [46] combines the OCC protocol, a multi-versionprotocol with timestamp ordering, and distributed times-tamp generation. The distributed timestamp generationeliminates the need to access the centralized counter thatconventional MVCC protocols require, thereby dramaticallymitigating cache-line conﬂicts. Cicada has a variety of op-timization methods.

EarlyAbort initiates an abort if one ispredicted during the read phase.

BestEﬀortInlining embedsan inline version in the record header at the top of the ver-sion list to reduce indirect reference cost.

PrecheckValida-tion checks whether read versions are still valid in the earlyvalidation phase.

RapidGC is a quick and parallel GC opti-mization method.

AdaptiveBackoﬀ dynamically determineshow long to wait before retrying.

SortWriteSetByContention detects conﬂicts at an early stage before performing actionsunnecessarily due to aborts. ead/Write

Set (All protocols)

Version

Cache (ERMIA, SI, Cicada)

Workload

Generator (All protocols)

Worker Threads

Access

Method (Masstree)

Key ‐ Value

Store (Record design depends on each protocol)

Central

Data

Object (SI:counter, Silo:epoch, ERMIA:mapping table, Cicada:backoff delta)

Memory

Manager (Allocator (mimalloc), pre-alloc, numactl, thread affinity)

Shared Objects

Figure 1: CCBench Architecture.

SI [26] is an MVCC protocol that can generate write-skew and read-only-transaction anomalies, so it does notproduce serializable schedules. Under SI, a transaction readsa snapshot of the latest committed version. The write op-erations are also reﬂected in the snapshot. SI requires amonotonic-increasing timestamp assignment for each trans-action to provide snapshots. To determine the timestamp,a centralized shared counter is required.

SI with latch-free SSN [43] integrates SI and SSN [67]and exploits the latch-free mechanism [68]. SSN detectsand aborts dangerous transactions that could lead to non-serializable schedules. We refer to the integration of SI andSSN as

ERMIA in this paper. releases all read/write locks at the endof each transaction. The

NoWait method, which imme-diately aborts the running transaction when a conﬂict isdetected, was originally proposed as a deadlock resolutionmechanism [27]. We use it as an optimization method.

The evaluation environment consisted of a single serverwith four Intel (R) Xeon (R) Platinum 8176 CPUs with2.10GHz processors. Each CPU had 28 physical cores withhyper-threading, and the server had 224 logical cores. Eachphysical core had 32 KB private L1d cache and 1 MB privateL2 cache. The 28 cores in each processor shared a 38.5 MBL3 cache. The total cache size was about 154 MB. Forty-eight DDR4-2666 32 GB DIMMs were connected, so thetotal size was 1.5 TB.In all graphs in this paper showing the results of CCBench,each plot point shows the average for ﬁve runs, each withmore than 3 s. We conﬁrmed that these numbers producestable CCBench results. The error bars indicate the rangebetween the maximum and minimum values. We countedthe number of committed and aborted transactions to cal-culate average throughput. We used rdtscp instruction tomeasure the latency of the transitions and their portions.Each worker thread was pinned to a speciﬁc CPU core inevery experiment.

3. CCBENCH3.1 Fairness Condition

When analyzing concurrency control protocols, their per-formance should be evaluated under a fairness condition.The meaning of fairness depends on the situation. In thispaper, fairness means that basic modules are shared inthe analysis . This is because the performance of modernprotocols is closely related to the exploitation of the under-lying hardware; i.e., they are sensitive to the engineering details. Therefore, the basic modules of an evaluation plat-form should be shared for a fair evaluation. Access methods(e.g., Masstree) and local data structures (read and writesets) need to be shared among protocols. The eﬀects ofmemory allocation should be reduced as much as possible.The workload should be consistent among experiments.Several analysis studies satisfy our fairness condition. Yuet al. [75] developed DBx1000 [7]. It was used in the eval-uation of TicToc paper [76] and in another evaluation witha real 1568 cores machines [24]. DBx1000 currently doesnot support some modern protocols (Cicada, MOCC, ER-MIA). Wu et al. developed Cavalia [4] to compare thehardware transactional memory (HTM)-assisted OCC-styleprotocol (HTCC) [74] with conventional protocols. Sev-eral modern protocols (Cicada, ERMIA, TicToc, MOCC)were beyond their scope. Wu et al. [72] developed Pelo-ton [12] to compare MVCC protocols. Several modern pro-tocols (Cicada, ERMIA, Silo, TicToc, MOCC) were beyondtheir scope. Appuswamy et al. [23] developed Trireme tocompare protocols in four types of architecture, includingshared-everything. Several modern protocols (TicToc, Ci-cada, MOCC, ERMIA) were beyond their scope.A protocol can include a variety of optimization methods.Even if protocol P does not initially include optimizationmethod O , an analysis platform should to provide O to P if users request it because the performance of a protocolgreatly depends on the optimization method. For example, RapidGC included in Cicada can also be applied to both SIand ERMIA.

NoWait [27] can be applied to Silo to improveperformance, as shown in Fig. 10c. Conventional platformsdo not support this concept.Our fairness condition was not satisﬁed in several studies.The Cicada paper ( § Our CCBench analysis platform for in-memory CC pro-tocols satisﬁes our fairness condition because it shares thebasic modules among protocols. The architecture of CC-Bench is illustrated in Fig. 1. The code for CCBench isavailable on GitHub [5].In CCBench, each thread executes both the client andthe server logic. The client generates a transaction withread/write operations using the workload parameters at run-time. The server provides APIs as C++ functions, such asread, write, commit, and abort. The client calls the APIsto run transactions. The server runs the transactions re-quested from the client inside a worker thread. The clientcode is separated from the server code, and the both codesable 1: Versatile optimization methods in CCBench. — : Irrelevant or Incompatible. Org : Supported by original protocol.

CCB : Supported by CCBench. ( α ) Delay inspired by extra reads in OCC ( § β ) CCBench performsread lock for hot records and invisible reads for non-hot records. ( γ ) NoWait locking optimization detects a write lock conﬂictand immediately aborts and retries the transaction.

NoWaitTT releases all locks and re-locks them. ( δ ) Lightweight memorymanagement. Proposed in this work ( § (cid:15) ) PreempiveAbort and

TimestampHistory cannot be applied to others. ( ζ ) SortWriteSetByContention , PrecheckVersionConsistency , EarlyAborts , BestEﬀorInlining cannot be applied to others. We appliedoptimization methods as follows.

NoWait : Silo in Fig. 10c.

RapidGC : ERMIA and SI in all cases.

AssertiveVersionReuse :Cicada, ERMIA, and SI in all cases.

AdaptiveBackoﬀ : TicToc in Fig. 2c and all protocols in Figs. 4a, 4b, 4c, and 4d.

PerformanceFactor Cache Delay Version Lifetime

OptimizationMethod DecentralizedOrdering InvisibleReads NoWait or Wait AdaptiveBackoﬀ ReadPhaseExtension ( α ) AssertiveVersionReuse ( δ ) RapidGC β ) — CCB CCB — —TicToc( (cid:15) ) [76] Org, CCB — CCB ( γ ) CCB CCB — —SI [41] — — — CCB — CCB CCBERMIA [43] — — — CCB — CCB CCBCicada( ζ ) [46] Org, CCB — — Org, CCB CCB CCB Org, CCB are compiled into a single executable binary. The mem-ory allocator, mimalloc [21], allocates interleaved memoryamong CPU sockets by using the Linux numactl command.Memory allocation is avoided as much as possible so thatthe penalty [29] imposed by the Linux virtual memory sys-tem can be ignored and so that the execution performanceof protocols is not degraded due to undesirable side-eﬀectsof memory allocation. CCBench initializes the database foreach experiment, pre-allocates the memory for objects, andreuses the memory. Allocated memory for local data struc-tures (read/write sets) and the generated operations for atransaction are reused for the next transaction. The meta-data and record data are carefully aligned to reduce falsesharing. A wrapper of Masstree [11] was implemented, andall protocols used it as the access method.CCBench supports seven versatile optimization methods,as shown in Table 1. (1) DecentralizedOrdering : preventscontended accesses to a single shared counter. (2)

Invisi-bleReads : read operations that do not update memory andcache-line conﬂicts do not occur. (3)

NoWait or Wait : im-mediate abort upon detecting a conﬂict followed by retry, orwaiting for lock release. (4)

ReadPhaseExtension : an artiﬁ-cial delay added to the read phase, inspired by the extra readprocess that retries the read operation. (5)

AdaptiveBackoﬀ :an artiﬁcial delay before restarting an aborted transaction.(6)

AssertiveVersionReuse : allocates thread-local space, de-noted as version cache in Fig. 1, to retain versions so thatmemory manager access is not needed. (7)

RapidGC : fre-quent updating of timestamp watermark for GC in MVCCprotocols.

ReadPhaseExtension (4) and

AssertiveVersionReuse (6) are presented in this paper.CCBench enables the use of additional optimization meth-ods for 2PL, Silo, TicToc, SI, ERMIA, and Cicada. It doesnot support the read-only optimization methods introducedin Silo, FOEDUS, MOCC, and ERMIA because they im-prove performance only in a speciﬁc case (i.e., 99% read-onlytransactions, 1% write transactions, skew at 0.99), and sucha workload is not considered here.Users of CCBench can easily add new protocols to theplatform. After replicating a directory that has an existingprotocol, the user rewrites the functions corresponding tothe transactions with begin/read/write/commit/abort op- erations in the transaction executor class. The user can thenreuse the basic modules in the platform: workload generator,memory management, and so on. Users can easily attach ordetach optimization methods provided by CCBench to theirprotocols simply by rewriting the preprocessor deﬁnition inthe CMakeLists.txt. A developer guide for CCBench usersis available [1]. One of our team members implemented asimple OCC protocol following this guide [3]; and publisheda description of the experience for other users [17]. Userscan also switch the conﬁguration of the optimization meth-ods, key-value size, and protocol details in the same wayas done in the online example. In an experiment, the usercan set the workload by specifying the runtime argumentsby gﬂags [16] using the seven workload parameters listed inTable 2. This design makes it easier to conduct experimentsthan DBx1000 [7], which manages these values as prepro-cessor deﬁnitions in a ﬁle.

For eﬃciency, we implemented from scratch thereader/writer lock system with the compare-and-swap (CAS)operation used in the protocol.

Silo:

We padded the global epoch (64 bits) into thecache-line size to reduce the false sharing that can be causedby a global epoch state change. Calculation of the committransaction id (TID) is expensive since it requires loops forboth the read set and the write set. We reduced these twoloops by implementing calculation at write lock acquisitionand read veriﬁcation.

ERMIA:

We used latch-free SSN to prevent expensiveSSN serial validation. This implementation avoids the vali-dation from being a critical section. We integrated pstampand sstamp to reduce memory usage as described at Sec-tion 5.1 of original paper [68]. We optimized the transac-tion mapping table. Straightforwardly, the mapping tableis designed as a two-dimensional global array with a threadnumber, TID, cstamp, and last cstamp, which requires alock manager for updates and read over the rows. Thisdegrades performance due to serialization. Therefore, wedesigned a data structure that expresses the rows in a one-dimensional array so that it can be latched with a CASoperation. The performance was improved by avoiding seri-lization and cache-line conﬂicts for row element access. Weused

RapidGC [46]. The transaction mapping table objectsexploited our

AssertiveVersionReuse method, so the memorymanager almost did not need to work during experiments.

TicToc:

We implemented all three optimization meth-ods (

NoWaitTT , PreemptiveAbort , TimestampHistory ). Wealso improved algorithm 5 in the original paper [76] by re-moving the redundant read timestamp updates. This is ef-fective because if the recorded read timestamp is the sameas the previous one, it does not need to be updated.

MOCC:

MOCC periodically resets temperature infor-mation [69], which switches many cache-lines to the invalidstate simultaneously and thereby degrades performance. Incontrast, our implementation stores the epoch ID of thelatest reset timing and temperature together and therebyavoids multiple resets in the same epoch. This reduces thecost of cache-line conﬂicts and the number of reset opera-tions, thereby maintaining ﬁne control of the temperature.We did not implement MQL since our environment had onlyfour NUMA (non-uniform memory access) nodes, so the ef-fect would be limited. We instead used the reader/writerlock that was used in our 2PL implementation.

Cicada:

We implemented all six optimization meth-ods:

SortWriteSetByContention , PrecheckVersionConsistency , AdaptiveBackoﬀ , EarlyAborts , RapidGC , and

BestEﬀortInlin-ing . Moreover, we ﬁxed a logical bug, i.e., the incompleteversion consistency check in the validation phase. In theoriginal paper [46], only the read version is conﬁrmed to stillbe the visible version. However, whether the latest versionis still the visible version needs to be conﬁrmed as well. Theexistence of a newer version in the observable range meansthat it should have been read and the transaction view willthus be broken. We turned oﬀ the one-sided synchronizationto improve throughput.

CCBench:

In the MVCC protocols (SI, ERMIA, andCicada), creating new versions and deleting old versions puta load on the memory manager. We therefore developeda new optimization method dubbed,

AssertiveVersionReuse that avoids overloading the memory manager in MVCC pro-tocols. This method enables each worker thread to maintaina container for future version reuse. When GC begins, theversions collected are stored in this container. A new ver-sion is taken from this container except if it is empty. If itis empty, a request for space is sent to the memory manageras usual. The space needed for GC is estimated before theexperiment, and memory space is allocated in advance. Thisoptimization minimizes the burden on the memory managerfor the MVCC protocols. Moreover, we introduce the useof

ReadPhaseExtension to delay execution. It was inspiredby the extra read process described in § AggressiveGC optimization, which collects even visibleversions, described in § CCBench supports the seven parameters shown in Table 2.Skew is an access pattern that follows a Zipf distribution.Cardinality is the number of records in the database. Thepayload is the size of a record (key plus value). Transac-tion size is the number of operations in a transaction. Readratio is the ratio of reads in a transaction. Write modelis whether to perform RMW or not. Thread count is thenumber of worker threads (ﬁxed to 224 here except for re-production). To determine the skew, CCBench uses a fast Table 2: Analyzed parameter sets. ( α ) Cardinality (from10 to 10 records); ( β ) Cardinality (10 or 10 records); ( γ )Read ratio (from 0% to 100%); ( δ ) Transaction size (from 10to 100 operations); ( (cid:15) ) Payload size (from 4 to 1000 bytes);( ζ ) Skew (from 0.6 to 0.99). Cache Delay VersionFigure Number F5 F7 F9 F10 F11Skew 0 0 0.8 0.9 ζ Cardinality α β Payload (byte) 4 4 4 (cid:15) δ

10 10Read ratio (%) 0,5 γ approximation method [36]. We analyzed CC protocols byusing 224 threads using CCBench. Most of the benchmarkswere the YCSB workload, and some were variants of YCSB.The analyzed parameter sets are summarized in Table 2. Wevaried the skew, cardinality, payload, transaction size, andread ratio and ﬁxed the thread count at 224. The caption inthe table describes the variance in parameters using α . . . ζ .This paper focuses on identifying factors that signiﬁcantlyaﬀect performance in a highly parallel environment. Wehave chosen YCSB like benchmarks because they can gen-erate various workloads despite the simpleness of its datamodel which oﬀers only read and update operations for asingle table with the primary index. It is preferred thata benchmark also supports realistic workloads, which of-ten contain insertion, deletion, range search, and secondaryindexes for multiple tables. Difallah et al. [31] summarizedsuch workloads, which included an industry-standard bench-mark, TPC-C [22]. CCBench currently supports only a sub-set of TPC-C including New-Order and

Payment transac-tions, and we obtained the following results: (1) CCBenchexhibited scalability in both the thread count and the ware-house count; (2) The majority of the execution time wasfor the index traversal, and its acceleration was importantfor high performance; (3) Diﬀerent analysis platforms exhib-ited diﬀerent behavior even for the same workload, depend-ing on the design and the implementation. The details aredescribed in Appendix A. CCBench will support TPC-Cfull-mix in future work, which will be available at [1].

4. REPRODUCTION OF PRIOR WORK

Before presenting our analysis of reproducing experimentsperformed in prior work, we explain how we validated thecorrectness of the CCBench implementation, so that it suc-cessfully reproduced the results of experiments performedin prior work. We did this by evaluating DBx1000 [7], astandard analysis platform, as used by Bang et al. [24], tovalidate the behavior of CCBench. For all graphs showingDBx1000 results, we set

CPU FREQ to 2.095 after measuringthe real clock rate used for rdtscp instruction; each plotpoint shows the average of ten runs. The duration of eachrun was 4.5 to 10 s. As the access method, we used a hashindex for DBx1000 and Masstree for CCBench. DBx1000does not support RMW operations in the YCSB workload,so their plots are not presented in the reproduction resultsfor MOCC and Cicada. More spaces were explored, and they are found online [2]. T h r oughpu t [ M T PS ] SiloTicToc (a) YCSB (Read-Only). 2 queries pertransaction, skew 0. T h r oughpu t [ M T PS ] DBx1000-SiloDBx1000-TicToc (b) YCSB (Medium Contention). 16queries per transaction, 90% reads,10% writes, skew 0.8. T h r oughpu t [ M T PS ] TorgT+BO (c) YCSB (High Contention). 16queries per transaction, 50% reads,50% writes, skew 0.9.

Figure 2: Reproduction of TicToc experiment. 10 M records, payload 1 K bytes. Diﬀerent access methods (hash for DBx1000,Masstree for CCBench). CCBench without index outperformed (39.6 Mtps) DBx1000 (30 Mtps) at 80 threads. T h r o u g h p u t [ M T P S ] SiloMOCC (a) Read Only Workload.

0 1 2 3 4 5 6 7 8 9 10 T h r o u g h p u t [ T P S ] (b) Read-Write YCSB. 224 threads. A b o r t R a t i o (c) Read-Write YCSB. 224 threads. Figure 3: Reproduction of MOCC experiment. YCSB, 50 records, 10 ops/trans, Skew 0, Payload 4 bytes, reads were pureread, writes were read-modify-write.

Our ﬁrst experiment was aimed at reproducing TicToc re-sults. We experimented with the settings for Figs. 6, 7, and8 in the original paper [76]. We set the epoch interval longerthan 40 ms so that cache-line conﬂicts for the global epochrarely occurred in Silo. The experimental results are shownin Figs. 2a, 2b, and 2c. TicToc uses

NoWait optimizationinstead of

NoWaitTT . Note that hereafter,

TicToc meansto attach

NoWait instead of

NoWaitTT . Torg is the origi-nal TicToc with all the optimizations, including

NoWaitTT .T+BO is TicToc using

AdaptiveBackoﬀ .Fig. 2a shows that Silo and TicToc had a comparable per-formance for read-only workloads, consistent with the re-sults shown in Fig. 6 in the original paper. Fig. 2b showsthat Silo and TicToc both scale in performance. This is alsoconsistent with the results shown in Fig. 7 in the originalpaper. The results for DBx1000 are also shown in the ﬁg-ures, and their behaviors were similar to those for CCBench.DBx1000 outperformed CCBench, which we attribute to thediﬀerence in access methods. A hash index was used forDBx1000 while Masstree was used for CCBench, and hashindexes are typically faster than tree indexes. We deter-mined that CCBench without indexing had 30% higher per-formance (39.6 Mtps) than DBx1000 (30 Mtps). As we havealready stated, the eﬀect of the access method must be con-sidered for a fair comparison.One ﬁnding is particularly noteworthy. In Fig. 2c, the be-haviors of Silo and TicToc diﬀer from those in Fig. 8 in theoriginal paper [76]. The performance of TicToc improved asthe number of threads was increased, but when the numberexceeded a certain value, the performance started to dete-riorate due to an increase in read/write conﬂicts. There- fore, we presumed that some sort of backoﬀ was applied toTicToc in the original paper, and attached

AdaptiveBackoﬀ to TicToc. This is shown as

T+BO in Fig. 2c. The per-formance curve for

T+BO is similar to that in Fig. 9 in theoriginal paper. The lower performance of the original Tic-Toc protocol shown as

Torg , is attributed to excessive delaysand read validation failures.

NoWaitTT inserts ﬁxed delaysinto the validation phase, which tends to result in read val-idation failure. Therefore,

Torg shows low performance inthis graph, and TicToc shows better performance due to theshorter read phase. The results for DBx1000 are also plottedin the ﬁgures, and their behaviors are similar to our results.Fig. 2a shows that Silo and TicToc exhibited linear scal-ability for a read-only workload, and Fig. 6 in the originalpaper shows almost linear scalability. Fig. 2b above andFig. 7(a) in the original paper show almost the same results.In summary, CCBench tended to closely reproduce the re-sults for TicToc, and the results of CCBench are consistentwith those of DBx1000.

The second experiment was aimed at reproducing MOCCresults [69]. Our MOCC implementation diﬀers from theoriginal one in terms of temperature and MQL, as describedin § T h r oughpu t [ K T PS ] ThreadsCicada ERMIA (a) Write-Intensive. 16 re-quests/trans, skew 0.99, 50%RMW, 50% pure reads. T h r oughpu t [ M T PS ] SkewSilo 2PL (b) Write-Intensive. 28threads. 16 requests/trans,50% RMW, 50% pure reads. T h r oughpu t [ M T PS ] SkewTicToc (c) Read-Intensive. 28threads. 16 requests/trans,5% RMW, 95% pure reads. T h r oughpu t [ M T PS ] ThreadsMOCC SI (d) Read-Intensive. 1 re-quests/trans, skew 0.99, 5%RMW, 95% pure reads.

Figure 4: Reproduction of Cicada experiment. 10 M records, payload 100 bytes.When MOCC reads a record, it accesses the correspondingtemperature information and checks whether the tempera-ture is high.The results in Figs. 3b and 3c closely reproduce those inFig. 7 in the original paper. When the write ratio increasedfrom 0% to 10%, the abort rate rapidly increased, and theperformance deteriorated. This is because the write opera-tion produces a large number of read validation failures. Asshown in Fig. 3b, the performance of Silo deteriorated ata write rate of 0-40%, while it improved at a write rate of40-60%. There is thus a trade-oﬀ between reduced record-access contention due to reduced write operations and fre-quent read validation failures due to increased read opera-tions. In contrast, MOCC showed stable performance, ap-parently because abort is less likely to occur by temperaturecontrol.MOCC and Silo exhibited scalability, as shown in Fig. 3aabove, while MOCC and FOEDUS exhibited scalability inthe original paper [69] (in Fig. 6). Fig. 3b shows the through-puts of MOCC and Silo for various write ratios. They arealmost the same as those in Fig. 7 in the original paper. Theabort ratios of Silo and MOCC shown in Fig. 3c is consis-tent with that in Fig. 7 in the original paper. In summary,CCBench closely reproduced the results for MOCC.

Our third experiment was aimed at reproducing Cicadaresults. The workload parameters were set almost the sameas those in the original paper [46]. The details of the work-load are described in the graph caption. The diﬀerences be-tween this paper and the original paper are as follows: (1)we redesigned and reimplemented all protocols; (2)

Adaptive-Backoﬀ was implemented in all protocols because it is easilyattachable and eﬀective for performance improvement.Figs. 4a, 4b, 4c, and 4d present the experimental resultsreproducing those in Figs. 6a, 6b, 6c, and 7 in the originalpaper, respectively. In Figs. 4b, 4c, and 4d, the tendenciesin our results for Silo, TicToc, MOCC, ERMIA, and 2PL arethe same as those in the corresponding ﬁgures in the originalpaper. In contrast, the results for Cicada diﬀer. Fig. 4dshows that Cicada had worse performance than Silo andTicToc for a read-intensive workload, and this is inconsistentwith the results shown in Fig. 7 in the original paper. Wediscuss this inconsistency in

Insight 5 ( §

5. ANALYSIS OF CACHE

Here, we discuss two eﬀects of the cache.

Cache-lineconﬂict occurs when many worker threads access to thesame cache-line with some writes (e.g., accessing a single shared counter). The cache-line becomes occupied by a sin-gle thread, and the other threads are internally blocked.

Cache-line replacement occurs when transactions accessa large amount of data (e.g., high cardinality, low skew, orlarge payload). The data accesses tend to replace the con-tents of the L1/L2/L3 cache.

Some protocols use centralized ordering (e.g., a sharedcounter) [41, 43] while others use decentralized ordering[46, 65, 69, 76]. We ﬁrst analyzed the negative eﬀect of cen-tralized ordering. The results are shown in Fig. 5. Sincethe results for YCSB-A have the same tendency as those forYCSB-B, they are omitted. Fig. 5 shows that the through-puts of ERMIA and SI did not scale although their schedul-ing spaces are wider than those of single-version concur-rency control (SVCC) serializable protocols. This is be-cause both protocols depend on a centralized data structure,which causes cache-line conﬂicts, which in turn degrade per-formance. This is consistent with previous ﬁndings [65, 69].To investigate the cause of the ineﬃciency, we measuredthe throughput of fetch add . The results are shown inFigs. 6. The L3 cache miss rate was measured using theLinux perf tool. Fig. 6a shows that one thread exhibitedthe best performance. This is because frequent reads/writesto the same memory address by multiple threads cause manycache-line conﬂicts. Fig. 6b shows that the L3 cache missrate increased with the number of threads. Cache-line con-ﬂicts often result in L3 cache misses and longer communica-tion delays because the CPU core uses a cache-line, and theother cores running in diﬀerent CPU sockets also require theline. In the experiment setting, 28 threads used one socket,56 threads used two sockets, and 112 or more threads usedfour sockets. Thus, a higher L3 cache miss rate ( ≥ fetch add . This means that cen-tralized ordering should be avoided. We discuss the results for YCSB-B. In Figs. 5a and 5c,as cardinality increased, (1) the throughput ﬁrst improvedand then deteriorated, and (2) the abort ratio monotonicallyimproved. When cardinality was quite low, the footprint ofaccessible records in database was small so that all of therecords ﬁt inside the L3 cache. The contention is more likelywith L1 or L2 cache. As cardinality increases, such con-tention is alleviated since the number of accessible recordsincreases, and the performance improves. It is likely thatwith more cardinality, the total number of records starts to T h r o u g h p u t [ M T P S ] (a) YCSB-B. T h r o u g h p u t [ M T P S ] (b) YCSB-C. A b o r t R a t i o (c) YCSB-B. C a c h e - M i ss R a t i o (d) YCSB-C. Figure 5: Varying cardinality: payload 4 bytes, 224 threads, skew 0, 10 ops/trans. [ M O PS / SE C ] Threads (a) Throughput. C a c he - M i ss R a t i o Threads (b) Cache miss ratio.

Figure 6: Scalability of fetch add.overﬂow L3 cache. This results in L3 cache misses, and con-secutive accesses to remote socket cache or DRAM degradesthe performance.We investigated the strong eﬀect of the L3 cache missratio. As shown in Fig. 5d, when cardinality was no morethan 10 , the L3 cache miss ratios for Silo, TicToc, andCicada were almost zero. This is because the L3 cache sizewas 38.5 MB and the size of a record including the headerwas 64 bytes (64 × < . × ). However, as cardinalityincreased, their throughputs decreased linearly, as shown inFig. 5b. This is attributed to L1/L2 cache misses. Whencardinality was more than 10 , we observed that L3 cachemisses aﬀected performance even for read-only workloads.The negative eﬀect of read contention is shown in Figs. 1and 6 in the original paper [69]. The cache-line conﬂictsthat occurred due to read lock contention degraded perfor-mance, and the diﬀerence in performance between OCC and2PL rapidly increased as the number of threads increased.However, there is no mention in the paper that the greaterthe number of L3 cache misses for a read-only workload, thesmaller the performance diﬀerence between OCC and 2PL.As shown in Fig. 5b, when cardinality was no less than 10 ,Silo, TicToc, and 2PL had almost the same throughput.Moreover, their L3 cache miss ratios were almost the same,as shown in Fig. 5d. For read-only workloads, the diﬀer-ences in performance among protocols converged due to theL3 cache misses when the database size was large. This hasnot been reported so far. Insight 1:

Even if the entire database is within L3 cache,as cardinality increases, (1) OCC for read-intensive work-loads improves due to a decrease in L1/L2 cache-line con-ﬂicts, and (2) OCC for read-only workloads deteriorates dueto an increase in L1/L2 cache-line replacements. If the entiredatabase slightly overﬂows the L3 cache, the performancesof the protocols diverge. If the size of the entire databaseis much larger than the L3 cache, the performances of Silo,TicToc, MOCC, and 2PL converge due to frequent L3 cache-line replacements.

Updating shared data among threads running in diﬀerentCPU sockets trigger L3 cache misses. It is mainly due tothe cache-line conﬂicts, leading to performance degradation.The read operation in Silo does not update the correspond-ing metadata, which can prevent cache-line conﬂicts. Thisread method is referred to as

InvisibleReads [48]. It eﬀec-tively improves performance of protocols for read-intensiveworkloads; however, its behaviors have not been exploredyet. The read operations of TicToc typically update thecorresponding metadata (i.e., read timestamp), so they arenot

InvisibleReads .To investigate the details of

InvisibleReads , we measuredthe performance of Silo and TicToc for various read ratiosand cardinalities. The results of experiments are shown inFigs. 7a-7c. Fig. 7a shows that TicToc-1 K (where 1 Kmeans 1000 records) outperformed Silo-1 K for low cardi-nality and write-intensive workloads. The abort ratio ofTicToc-1 K was much worse than that of Silo-1 K, as shownin Fig. 7b. As shown in Fig. 7a, in the read-most (90%)case, Silo-1 K outperformed TicToc-1 K, as evidenced by therapid increase in the abort ratio (Fig. 7b). The

InvisibleReads method contributed to the performance improvement in thiscase. For high cardinality workloads, Silo-1 M (where 1 Mmeans 1 million records) always outperformed TicToc-1 M.The abort ratios of Silo-1 M and TicToc-1 M were almostzero, as shown in Fig. 7b. This result seems mysterious. Theabort ratios for both Silo and TicToc were almost the same,and their L3 cache miss ratios were almost the same, asshown in Fig. 7c. However, Silo-1 M outperformed TicToc-1M. This is because dealing with contention resolution pro-cessing in TicToc requires three functions: timestamp his-tory management, early abort judgment, and read times-tamp update. These functions require executing additionalinstructions and thus degrade performance. If

InvisibleReads is woven into the process, a protocol has diﬃculty exploitingthem.

Insight 2:

For a detailed analysis of protocols, contentionshould be carefully considered. The occurrence of contentiondepends not only on the read ratio but also on cardinal-ity and skew. We observed that Silo outperformed TicTocnot only for read-only workloads but also for write-intensiveworkloads. This fact has not been reported in previous stud-ies [23, 46, 76]. These results are due to the use of

Invisi-bleReads method, which prevents cache misses in low con-tention cases. The eﬃciency of

InvisibleReads is providedby its simple design that does not require computations forconﬂict resolution or anti-dependency tracking. T h r o u g h p u t [ M T P S ] Read Ratio [%]

Silo-1 KTicToc-1 KSilo-1 MTicToc-1 M (a) Throughput. A b o r t R a t i o Read Ratio [%] (b) Abort ratio. C a c h e - M i ss R a t i o Read Ratio [%] (c) Cache-miss ratio.

Figure 7: Analysis of

InvisibleReads . 224 threads, payload 4 bytes, skew 0, 10 ops/trans. cardinality 10 or 10 records. T h r o u g h p u t [ M T P S ] Skew

MOCC 2PL-Wait (a) Throughput. A b o r t R a t i o Skew (b) Abort ratio.

Figure 8:

Wait or NoWait : YCSB-B, 100 M records, 224threads, payload 4 bytes, 10 ops/trans. -2 -1

20 40 60 80 100 T h r o u g h p u t [ M T P S ] (a) Throughput. A b o r t R a t i o (b) Abort ratio. Figure 9: Impact of transaction size: YCSB-B, 100 Mrecords, 224 threads, payload 4 bytes, skew 0.8.

6. ANALYSIS OF DELAY6.1 Effect of NoWait

A method for mitigating performance degradation causedby conﬂicts is to place an additional artiﬁcial delay before aretry, such as with

NoWaitTT [76] and

AdaptiveBackoﬀ [46].In contrast to waiting for a lock release, the

NoWait methodimmediately releases all of acquired locks when a conﬂictis detected and immediately retries. The question then indealing with contention is whether to choose

NoWait or Wait .To answer this question, we compared the performanceof three methods for various degrees of skew: 2PL-Wait,2PL-NoWait, and MOCC (mix of OCC and 2PL-Wait). Alloperations of each transaction for 2PL-Wait were sorted inadvance to avoid deadlock. As shown in Fig. 8a, the perfor-mance of all three protocols started to degrade at a skew of0.7 due to a large number of conﬂicts. MOCC demonstratedexcellent performance, as shown in Fig. 8a. This is becauseMOCC manages fewer lock objects for workloads with lessskew. 2PL-Wait and 2PL-NoWait showed similar perfor-mances at skews of 0.60-0.85, so the cost of abort-retry and the cost of waiting for lock were similar. So, we need morecontention to answer the question.To provide more contention, we increased the transactionsize. The cost of aborts should be smaller for short transac-tions and larger for long ones. Therefore, with long trans-actions, the abort-retry cost is higher and may exceed thelock waiting cost. Fig. 8 shows that the lock waiting cost of

Wait and the abort-retry cost of

NoWait were similar. Fig. 9ashows that

Wait began to outperform

NoWait as the transac-tion size increased. In contrast, Silo-NoWait outperformedSilo(-Wait), as shown in Fig. 10 (described in § Insight 3:

One cannot give a general statement aboutwhich is better,

NoWait or Wait since it depends on thesituations. As shown in Fig. 9a,

Wait begins to outperform

NoWait as the transaction size increases. On the other hand,as shown in Fig. 10c, Silo with

NoWait comes to outperformthe original Silo. This suggests the importance of adap-tive behavior. Although adaptive concurrency controls havebeen studied [37, 55, 59, 69], their importance has not beendiscussed in depth regarding optimizations except for back-oﬀ [46]. Studies on adaptive optimization methods remainon the frontier.

We analyzed the eﬀect of payload size, which had not pre-viously been done, to the best of our knowledge. Fig. 10ashows the relationship between throughput and payload sizefor Silo and TicToc. Throughput initially increased withpayload size, and then started to decrease at a certain point.An increase in payload size would likely degrade OCC per-formance, which is consistent with the throughput decreaseshown in the right half of the graph. However, the through-put increase shown in the left half of the graph is counter-intuitive.We hypothesized that this increase was due to the de-lay caused by the extra reads process, where two or morereads were performed if there was an interruption in theconcurrent update transactions. An increase in payload sizelengthens the time to read the payload. This increases theprobability of an interruption due to an update transaction,which increases the number of extra reads. This behaviorextends the read phase and reduces the number of transac-tions that run through the validation phase in parallel. Asthe number of committed update transactions decreases, thenumber of read validation failures also decreases, which leads T h r oughpu t [ M T PS ] Payload

SizeSiloTicToc (a) Throughput.

60 4 200 400 600 800 1000 E x t r a R ead s [ M / SE C ] Payload

SizeSiloTicToc (b) Extra reads. T h r oughpu t [ M T PS ] Clocks of Sleep in Read

Phase

Silo+NoWaitSiloTicToc (c)

ReadPhaseExtension.

Figure 10: Eﬀect of delay provided by extra read: YCSB-A, 100 M records, 224 threads, skew 0.9, 10 ops/trans.to throughput improvement. Delaying the retry of atomicreads reduces the number of concurrent worker threads in-volved in the validation phase for the entire system, whichalso reduces contention.Besides

Wait , NoWait , and

AdaptiveBackoﬀ , we present anovel delay-related optimization method,

ReadPhaseExten-sion , which was inspired by the positive eﬀect of extra reads.Comparing Figs. 10a and 10b reveal that the curves are simi-lar. This indicates that throughput and the number of extrareads are correlated. If this is correct, adding an artiﬁcialdelay into the read phase should produce similar results. Weconducted such an experiment for a payload size of 4 bytes.The result in Fig. 10c shows that a certain amount of addi-tional delay (less than 9000 clocks) improved performance.This eﬀect is similar to that of the extra read process. Werefer to this artiﬁcial delay as

ReadPhaseExtension and de-ﬁne it as a new optimization method.

ReadPhaseExtension isconﬁgured by setting the delay on the basis of local conﬂictinformation. This optimization method can exploit informa-tion on the access conﬂicts for each record during the readphase whereas

AdaptiveBackoﬀ uses only global informationacross all worker threads.

Insight 4:

The extra read process plays a key role in theperformance of OCC protocols. It is known that the con-tention regulation caused by delay can contribute to perfor-mance improvement depending on the situation. A remark-able ﬁnding here is that the

ReadPhaseExtension inspired bythe extra read process can also improve performance.

Read-PhaseExtension diﬀers from

NoWaitTT since it can exploitsinformation on conﬂicting records inside transactions to ad-just the delay whereas the delay in

NoWaitTT is ﬁxed. Suchadditional delay in the read phase and the use of thread-localconﬂict information combine to create a unique optimizationmethod.

7. ANALYSIS OF VERSION LIFETIME7.1 Determining Version Overhead

The life of a version begins when a corresponding write op-eration creates it. The version state is called visible duringthe period when other transactions can read it. Otherwise,the version state is called non-visible . Recent CC proto-cols typically make the version visible at the pre-commitphase [40, 65]. After a certain period, the life of the versionends, and it is made non-visible. SVCC protocols typicallymake a version by overwriting its previous version, with theformer becoming visible and the latter becoming non-visibleat the same time. MVCC protocols typically make a version on a diﬀerent memory fragment from other visible versionsof the same record. Therefore, their life does not end unlessGC conducts its work.The performance of MVCC is thought to suﬀer from theexistence of many visible versions [45,54]. They lead to a sig-niﬁcant burden due to memory exhaustion or an increase incache-line replacements for traversal or installation. MVCCprotocols can provide higher performance than SVCC onesdue to their larger scheduling spaces. It should be noted thatarchitectural factors sometimes negate the beneﬁts, as men-tioned in § AssertiveVersionReuse ( § RapidGC optimization eﬀectively mitigates this problem. To evaluateits eﬃciency, we varied the interval setting and measured the T h r o u g h p u t [ M T P S ] Skew

Cicada ERMIA (a) YCSB-A. T h r o u g h p u t [ T P S ] Skew

MOCC SI (b) YCSB-A (high skew). T h r o u g h p u t [ M T P S ] Skew

Silo 2PL (c) YCSB-B. T h r o u g h p u t [ T P S ] Skew

TicToc (d) YCSB-B (high skew).

Figure 11: Eﬀect of workload skew: 100 M records, 224 threads, payload 4 bytes, 10 ops/trans.

120 Cicada Silo La t en cy [ n s ] Gen.

WorkloadReadWrite (a) YCSB-A.

120 Cicada Silo La t en cy [ n s ] ValidationGCOther (b) YCSB-B.

Figure 12: Latency breakdown: skew 0, 100 M records, 224threads, payload 4 bytes, 10 ops/trans.

120 Cicada Silo La t en cy [ n s ] Gen.

WorkloadReadWrite (a) YCSB-A.

120 Cicada Silo La t en cy [ n s ] ValidationOther (b) YCSB-B.

Figure 13: Latency breakdown of Cicada-SV and Silo forpartitioned workload. Other settings were the same as forFig. 12.throughput of Cicada. The legend and

100 us in Fig. 14show that high throughputs were obtained when the intervalwas no more than 100 us, as reported in prior studies [46,72].This is because the number of visible versions was small, andmost of them were kept in cache. So, can

RapidGC with itsshort intervals perform eﬃciently for any workload? Theanswer is no. We explain the reason why in § Insight 5:

The overhead of multi-version managementis not negligible. Silo and TicToc outperformed Cicada inhigh-skew (0.99), read-intensive (YCSB-B), non-RMW, highcardinality (100 M records) cases. A previous study [46]found that all three exhibited a similar performance for asimilar workload (skew 0.99, read 95%, RMW, 10 M records,payload 100 byte), as shown in Fig. 7 in the original paper.This inconsistency is due to the use of diﬀerent platformsin the previous study. Using a single platform, we observed

16 17 18 19 20 21 22 23 2410 T h r o u g h p u t [ M T P S ] GC Interval [us]

0 100 us1 ms10 ms100 ms

Figure 14: Analysis of RapidGC: YCSB-A, skew 0, 224threads, 1 M records, payload 4 bytes, 10 ops/trans. In-serted delays for long transactions: 0 to 10 ms.the diﬀerence and found that the version management costof Cicada is not negligible even for a low contention case,as shown in Figs. 12a and 12b. It is diﬃcult to obtain pre-cise knowledge using diﬀerent reference implementations orplatforms, and the deep analysis of CC protocols must bedone on the same platform.

We investigated the behavior of

RapidGC using the sameworkloads as in our third experiment with a long trans-action. Even state-of-the-art GC does not suﬃciently re-duce the number of visible versions if there is only a singlelong transaction. This is because modern protocols assumethat transaction execution is one-shot, and that the trans-actions are relatively short (e.g., 16 operations for YCSB).Long transactions except read-only ones have been ignoredin modern transaction studies.To generate a long transaction, we added an artiﬁcial de-lay at the end of the read phase. Both long and short trans-actions used the same number of operations with the sameread-write ratio. One worker thread executed the long trans-action, and the remaining worker threads executed the shorttransactions. The skew was set to 0 so contention in recordaccesses rarely occurred and thus did not aﬀect performance,even though there was a long transaction.We measured the performance of Cicada under the work-load described above, varying the

RapidGC interval settingsand the delay added to the long transaction. As shown inFig. 14, performance saturated when a delay was inserted.aturation occurred when the GC interval was the same asthe added delay. For example, the light blue line includesa long transaction with a 1 ms delay, and performance sat-urated when the GC interval was 1 ms. Similar behaviorswere observed with longer delays. This is because currentGC methods do not collect visible versions that may be readby active long transactions. The current GC scheme doesnot collect the versions until the transaction ﬁnishes. Weconsider that this limitation is the primary hurdle to im-proving MVCC performance by reducing the number of vis-ible versions. We could not obtain results for a 1 s delaybecause such a delay requires a huge amount of memory,which causes the Linux OOM (out-of-memory) killer to killthe process.

From the limitation of the current GC scheme describedabove, we suggest a novel GC scheme,

AggressiveGC , thataggressively collects versions beyond the current ones todeal with long transactions. For example, the multi-versiontimestamp ordering (MVTO) protocol could be integratedwith a GC method that aggressively collects visible versions.It could make some versions non-visible even though activeor future transactions need to read them. Such a proto-col might incur read operation failures unlike conventionalMVTO, which could be handled by aborting the transac-tion and retrying it with a new timestamp. Restricting thenumber of versions in kVSR [50] and 2V2PL [25, 60] hasbeen discussed. However, only the latest contiguous ver-sions are kept there, so this approach is less ﬂexible thanour suggested scheme. We claim that the visible versions donot need to be contiguous and the number of versions canbe ﬂexible depending on the context. An interesting topicin our proposed scheme is the risk of starvation. One wayto mitigate this problem is to manage the priorities amongtransactions such as wait-die [57], which has not yet beendiscussed in modern MVCC studies.We suggest two optimizations for write operations in termsof aggressive protocols. The ﬁrst is version overwriting, i.e.,creating a new version by overwriting the memory segmentof the previous version, which becomes non-visible at thesame time, as is done in SVCC protocols. Version overwrit-ing is eﬃcient because two operations are combined into oneoperation. The second is non-visible write, i.e., making ver-sions non-visible from the beginning of their life. The ideaof non-visible write was originally proposed as the Thomaswrite rule [64] and recently generalized as the non-visiblewrite rule (NWR) [52] to deal with blind writes. Novel ver-sion lifetime management is a promising way to improveMVCC protocols.

Insight 6:

Even state-of-the-art GC cannot hold downthe number of versions in MVCC protocols if a single longtransaction is mixed into the workload. An aggressive ap-proach can solve this problem by aggressively changing theversion state to non-visible for both reads and writes, evenif transactions still require the state.

8. RELATED WORK

Yu et al. [75] evaluated CC protocols using DBx1000 [7],which is open source. They evaluated the scalability of CCprotocols using a real machine and an emulator. Three re-cent protocols [43, 46, 69] supported in CCBench, are notincluded in DBx1000 [7]. Wu et al. [72] empirically evaluated MVCC protocols using Peloton [12], which is open source.They evaluated not only scalability but also the contentioneﬀect, read ratio, attributes, memory allocation, and index.They did not evaluate SVCC and the modern MVCC proto-cols evaluated in this paper [46, 65, 69, 76]. Appuswamy etal. [23] evaluated CC protocols in four types of architectureusing Trireme, which is not open source. They determinedthat the shared-everything architecture is still the best op-tion for contention-tolerant in-memory transaction engines.CC protocols for distributed systems have been evaluatedelsewhere [38, 66]; this paper focuses on a single many-corearchitecture.Whereas previous studies mostly evaluated scalability anddid not explore the behavior of protocols when thread paral-lelism was set to a high degree [30,32,33,35,37,39,43,44,46,56,63,65,69,70,73,74,76,77], we ﬁxed the thread parallelismat 224 and analyzed protocols for various settings. We clas-siﬁed a variety of methods on the basis of three performancefactors: cache, delay, and version lifetime. This analysis letus identify three new optimization methods.

9. CONCLUSION

Using CCBench, we analyzed concurrency control proto-cols and optimization methods for various settings of theworkload parameters with the number of threads ﬁxed at224, whereas previous studies mostly focused on thread scal-ability, and none of them explored the space we analyzed.We classiﬁed versatile optimization methods on the basis ofthree performance factors: cache, delay, and version lifetime.Through the analysis of protocols with CCBench, we gainedsix insights.

I1:

The performance of optimistic concurrencycontrol for a read-only workload rapidly degrades as car-dinality increases even without L3 cache misses.

I2:

Silooutperforms TicToc for write-intensive workloads, which isattributed to

InvisibleReads for unskewed high cardinalitycases.

I3:

The eﬀectiveness of two approaches to copingwith conﬂict (

Wait and

NoWait ) depends on the situation.

I4:

Extra reads can regulate contention.

I5:

Results pro-duced from mixed implementations may be inconsistent withthe theory.

I6:

Even a state-of-the-art garbage collectionmethod

RapidGC cannot improve the performance of multi-version concurrency control if there is a single long transac-tion mixed into the workload On the basis of I4 , we deﬁnedthe ReadPhaseExtension optimization in which an artiﬁcialdelay is added to the read phase. On the basis of I6 , wedeﬁned the AggressiveGC optimization in which even visibleversions are collected.In future work, we plan to support TPC-C full-mix, and toinclude logging and recovery modules based on our prelimi-nary studies [51,61]. The code for CCBench and all the datain this paper are available online at GitHub [5]. We expectthat CCBench will help to advance transaction processingresearch.

Acknowledgments

We thank Masahiro Tanaka, Jun Nemoto, Takuma Ojiro,Kosei Masumura, and Sho Nakazono for thoughtful supportson drafts. We are grateful to the anonymous reviewers fortheir valuable feedbacks. This work was supported by JSTJPMJCR1414, JSPS 19H04117, and the New Energy andIndustrial Technology Development Organization (NEDO).

0. REFERENCES [1] CCBench Developer Guide. https://github.com/thawk105/ccbench/tree/master/cc_format .[2] CCBench Experimental Data. https://github.com/thawk105/ccdata .[3] CCBench OCC. https://github.com/thawk105/ccbench/tree/master/occ .[4] Code of Cavalia. https://github.com/Cavalia .[5] Code of CCBench. https://github.com/thawk105/ccbench .[6] Code of Cicada. https://github.com/efficient/cicada-engine .[7] Code of DBx1000. https://github.com/yxymit/DBx1000 .[8] Code of DBx1000 Extention for Cicada. https://github.com/efficient/cicada-exp-sigmod2017-DBx1000 .[9] Code of ERMIA. https://github.com/ermia-db/ermia .[10] Code of FOEDUS. https://github.com/hkimura/foedus_code .[11] Code of Masstree. https://github.com/kohler/masstree-beta .[12] Code of Peloton. https://pelotondb.io .[13] Code of SGT. https://github.com/durner/No-False-Negatives .[14] Code of Silo. https://github.com/stephentu/silo .[15] Code of STO. https://readablesystems.github.io/sto .[16] gﬂags. https://github.com/gflags/gflags .[17] How to Extend CCBench. https://medium.com/@jnmt .[18] Masstree Debug about Cast. https://github.com/thawk105/masstree-beta/commit/d4bcf7711dc027818b1719a5a4c29aee547c58f6 .[19] Masstree Debug about string slice.hh. https://github.com/thawk105/masstree-beta/commit/77ef355f868a6db4eac7b44669c508d8db053502 .[20] Masstree when Controling Key Length is 9. https://github.com/kohler/masstree-beta/issues/42 .[21] mimalloc. https://github.com/microsoft/mimalloc .[22] The Transaction Processing Council. TPC-CBenchmark (Revision 5.11), February 2011.[23] R. Appuswamy, A. G. Anadiotis, D. Porobic,M. Iman, and A. Ailamaki. Analyzing the Impact ofSystem Architecture on the Scalability of OLTPEngines for High-Contention Workloads.

PVLDB ,11(2):121–134, 2017.[24] T. Bang, N. May, I. Petrov, and C. Binnig. The Taleof 1000 Cores: An Evaluation of Concurrency Controlon Real(Ly) Large Multi-Socket Hardware. In

DaMoN , 2020.[25] R. Bayer, H. Heller, and A. Reiser. Parallelism andRecovery in Database Systems.

ACM TODS ,5(2):139–156, 1980.[26] H. Berenson, P. Bernstein, J. Gray, J. Melton,E. O’Neil, and P. O’Neil. A Critique of ANSI SQLIsolation Levels. In

SIGMOD Record , volume 24,pages 1–10, 1995.[27] P. A. Bernstein and N. Goodman. ConcurrencyControl in Distributed Database Systems.

ACM Comput. Surv. , 13(2):185–221, 1981.[28] P. A. Bernstein, V. Hadzilacos, and N. Goodman.Concurrency control and recovery in databasesystems. 1987.[29] A. T. Clements, M. F. Kaashoek, and N. Zeldovich.RadixVM: Scalable address spaces for multithreadedapplications. In

EuroSys , pages 211–224, 2013.[30] M. Dashti, S. Basil John, A. Shaikhha, and C. Koch.Transaction Repair for Multi-Version ConcurrencyControl. In

SIGMOD Conf. , pages 235–250, 2017.[31] D. E. Difallah, A. Pavlo, C. Curino, andP. Cudre-Mauroux. OLTP-Bench: An ExtensibleTestbed for Benchmarking Relational Databases.

PVLDB , 7(4):277–288, 2013.[32] B. Ding, L. Kot, and J. Gehrke. Improving OptimisticConcurrency Control through Transaction Batchingand Operation Reordering.

PVLDB , 12(2):169–182,2018.[33] D. Durner and T. Neumann. No False Negatives:Accepting All Useful Schedules in a Fast SerializableMany-Core System. In

ICDE , pages 734–745, 2019.[34] K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L.Traiger. The Notions of Consistency and PredicateLocks in a Database System.

Comm. ACM ,19(11):624–633, 1976.[35] J. M. Faleiro, D. J. Abadi, and J. M. Hellerstein. HighPerformance Transactions via Early Write Visibility.

PVLDB , 10(5):613–624, 2017.[36] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, andP. J. Weinberger. Quickly generating billion-recordsynthetic databases. In

SIGMOD Record , volume 23,pages 243–252, 1994.[37] J. Guo, P. Cai, J. Wang, W. Qian, and A. Zhou.Adaptive Optimistic Concurrency Control forHeterogeneous Workloads.

PVLDB , 12(5):584–596,2019.[38] R. Harding, D. Van Aken, A. Pavlo, andM. Stonebraker. An Evaluation of DistributedConcurrency Control.

PVLDB , 10(5):553–564, 2017.[39] Y. Huang, W. Qian, E. Kohler, B. Liskov, andL. Shrira. Opportunities for Optimism in ContendedMain-Memory Multicore Transactions.

PVLDB ,13(5):629–642, 2020.[40] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis,and A. Ailamaki. Aether: a Scalable Approach toLogging.

PVLDB , 3(1-2):681–692, 2010.[41] H. Jung, H. Han, A. Fekete, U. R¨ohm, and H. Y.Yeom. Performance of Serializable Snapshot Isolationon Multicore Servers. In

DASFAA , pages 416–430,2013.[42] T. Kersten, V. Leis, A. Kemper, T. Neumann,A. Pavlo, and P. Boncz. Everything You AlwaysWanted to Know about Compiled and VectorizedQueries but Were Afraid to Ask.

PVLDB ,11(13):2209–2222, 2018.[43] K. Kim, T. Wang, R. Johnson, and I. Pandis. ERMIA:Fast Memory-Optimized Database System forHeterogeneous Workloads. In

SIGMOD Conf. , pages1675–1687, 2016.[44] H. Kimura. FOEDUS: OLTP engine for a thousandcores and NVRAM. In

SIGMOD Conf. , pages691–706, 2015.45] P. Larson, S. Blanas, C. Diaconu, C. Freedman, J. M.Patel, and M. Zwilling. High-PerformanceConcurrency Control Mechanisms for Main-MemoryDatabases.

PVLDB , 5(4):298–309, 2011.[46] H. Lim, M. Kaminsky, and D. G. Andersen. Cicada:Dependably fast multi-core in-memory transactions.In

SIGMOD Conf. , pages 21–35, 2017.[47] Y. Mao, E. Kohler, and R. T. Morris. CacheCraftiness for Fast Multicore Key-Value Storage. In

EuroSys , pages 183–196, 2012.[48] V. J. Marathe, W. N. Scherer, and M. L. Scott.Design Tradeoﬀs in Modern Software TransactionalMemory Systems. In

LCR , pages 1–7, 2004.[49] J. M. Mellor-Crummey and M. L. Scott. ScalableReader-Writer Synchronization for Shared-MemoryMultiprocessors. In

SIGPLAN Notices , volume 26,pages 106–113, 1991.[50] T. Morzy. The Correctness of Concurrency Control forMultiversion Database Systems with Limited Numberof Versions. In

ICDE , pages 595–604, 1993.[51] Y. Nakamura, H. Kawashima, and O. Tatebe.Integration of TicToc Concurrency Control Protocolwith Parallel Write Ahead Logging Protocol.

Journalof Network Computing , 9(2):339–353, 2019.[52] S. Nakazono, H. Uchiyama, Y. Fujiwara,Y. Nakamura, and H. Kawashima. NWR: RethinkingThomas Write Rule for Omittable Write Operations. http://arxiv.org/abs/1904.08119 , 2020.[53] N. Narula, C. Cutler, E. Kohler, and R. Morris. PhaseReconciliation for Contended In-MemoryTransactions. In

OSDI , pages 511–524, 2014.[54] T. Neumann, T. M¨uhlbauer, and A. Kemper. FastSerializable Multi-Version Concurrency Control forMain-Memory Database Systems. In

SIGMOD Conf. ,pages 677–689, 2015.[55] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma,P. Menon, T. C. Mowry, M. Perron, I. Quah,S. Santurkar, A. Tomasic, S. Toor, D. V. Aken,Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-DrivingDatabase Management Systems. In

CIDR , 2017.[56] G. Prasaad, A. Cheung, and D. Suciu. HandlingHighly Contended OLTP Workloads Using FastDynamic Partitioning. In

SIGMOD Conf. , pages527–542, 2020.[57] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis.System Level Concurrency Control for DistributedDatabase Systems.

ACM TODS , 3(2):178–198, 1978.[58] M. L. Scott and W. N. Scherer. Scalable Queue-basedSpin Locks with Timeout. In

SIGPLAN Notices ,volume 36, pages 44–52, 2001.[59] Y. Sheng, A. Tomasic, T. Zhang, and A. Pavlo.Scheduling OLTP Transactions via Learned AbortPrediction. In aiDM , 2019.[60] R. E. Stearns and D. J. Rosenkrantz. DistributedDatabase Concurrency Controls Using Before-Values.In

SIGMOD Conf. , pages 74–83, 1981. [61] T. Tanabe, H. Kawashima, and O. Tatebe. Integrationof Parallel Write Ahead Logging and CicadaConcurrency Control Method. In

BITS , pages291–296, 2018.[62] D. Tang and A. J. Elmore. Toward coordination-freeand reconﬁgurable mixed concurrency control. In

ATC , pages 809–822, 2018.[63] D. Tang, H. Jiang, and A. J. Elmore. AdaptiveConcurrency Control: Despite the Looking Glass, OneConcurrency Control Does Not Fit All. In

CIDR ,2017.[64] R. H. Thomas. A Majority Consensus Approach toConcurrency Control for Multiple Copy Databases.

ACM TODS , 4(2):180–209, 1979.[65] S. Tu, W. Zheng, E. Kohler, B. Liskov, andS. Madden. Speedy Transactions in Multicorein-Memory Databases. In

SOSP , pages 18–32, 2013.[66] C. Wang, K. Huang, and X. Qian. A ComprehensiveEvaluation of RDMA-enabled Concurrency ControlProtocols. http://arxiv.org/abs/2002.12664 , 2020.[67] T. Wang, R. Johnson, A. Fekete, and I. Pandis. TheSerial Safety Net: Eﬃcient Concurrency Control onModern Hardware. In

DaMoN , 2015.[68] T. Wang, R. Johnson, A. Fekete, and I. Pandis.Eﬃciently Making (Almost) Any Concurrency ControlMechanism Serializable.

VLDB Journal ,26(4):537–562, 2017.[69] T. Wang and H. Kimura. Mostly-OptimisticConcurrency Control for Highly Contended DynamicWorkloads on a Thousand Cores.

PVLDB ,10(2):49–60, 2016.[70] Z. Wang, S. Mu, Y. Cui, H. Yi, H. Chen, and J. Li.Scaling Multicore Databases via Constrained ParallelExecution. In

SIGMOD Conf. , pages 1643–1658, 2016.[71] G. Weikum and G. Vossen.

Transactional InformationSystems . Elsevier, 2001.[72] Y. Wu, J. Arulraj, J. Lin, R. Xian, and A. Pavlo. Anempirical evaluation of in-memory multi-versionconcurrency control.

PVLDB , 10(7):781–792, 2017.[73] Y. Wu, C.-Y. Chan, and K.-L. Tan. TransactionHealing: Scaling Optimistic Concurrency Control onMulticores. In

SIGMOD Conf. , pages 1689–1704, 2016.[74] Y. Wu and K.-L. Tan. Scalable In-MemoryTransaction Processing with HTM. In

ATC , pages365–377, 2016.[75] X. Yu, G. Bezerra, A. Pavlo, S. Devadas, andM. Stonebraker. Staring into the Abyss: AnEvaluation of Concurrency Control with OneThousand Cores.

PVLDB , 8(3):209–220, 2014.[76] X. Yu, A. Pavlo, D. Sanchez, and S. Devadas. Tictoc:Time Traveling Optimistic Concurrency Control. In

SIGMOD Conf. , pages 1629–1642, 2016.[77] Y. Yuan, K. Wang, R. Lee, X. Ding, J. Xing,S. Blanas, and X. Zhang. BCC: Reducing False Abortsin Optimistic Concurrency Control with Low Cost forin-Memory Databases.

PVLDB , 9(6):504–515, 2016.able 3: Supported TPC-C transactions and execution plat-forms in modern CC studies for a single node.

Type columnindicates the type of research. Papers with

Exp. describeanalytical studies, and those with

Pro. propose new proto-cols or optimization methods. The

TPC-C column indicatesthe type of TPC-C transactions supported in the paper. NP indicates NewOrder and

Payment . Full indicates the full-mix,and α includes an original StockScan transaction in additionto the full-mix. β includes an original Reward transaction inaddition to the NP. φ indicates no TPC-C evaluation. The System column shows the evaluation system used in the pa-per. No citation indicates that the code seems to be publiclyunavailable.Paper Type TPC-C SystemMVCC Eval. [72] ( α ) Exp. Full Peloton [12]CCBench NP

CCBench [5]Abyss [75]

DBx1000 [7]1000 Cores [24]

DBx1000 [7]Trireme [23] φ OriginalRepair [30] Pro. Full

Cavalia [4]Healing [73]

Cavalia [4]HTCC [74]

Cavalia [4]Cicada [46]

Cicada [6]ACC [63]

Doppel [53]Latch-free SSN [68]

ERMIA [9]FOEDUS [44]

FOEDUS [10]MOCC [69]

FOEDUS [10]Silo [65]

Silo [14]IC3 [70]

Silo [14]STOv2 [39]

STO [15]CormCC [62] OriginalTicToc [76] NP

DBx1000 [7]Strife [56]

DBx1000 [7]AOCC [37]( β ) DBx1000 [7]EWV [35] OriginalBCC [77]

Silo [14]Batching [32] φ Cicada [6]SGT [33] Original [13]Doppel [53] Original

APPENDIXA. EVALUATION OF A SUBSET OF TPC-C

The TPC-C benchmark emulates the workload of a whole-sale supplier and is the current industry standard for eval-uating transaction processing systems [22]. Here, we intro-duce how previous studies and CCBench support TPC-C;describe the results of its evaluation, including a compar-ison with other systems [7, 8]; analyze the eﬀect of indexstructures on the performance; and discuss transactions thatCCBench currently does not support.

A.1 Supported Transactions

TPC-C or its variants have been widely used to evaluateconcurrency control protocols. Some studies fully supportall ﬁve transactions, whereas some support only a subset ofthem. We summarize the studies in Table 3. Most of the pa-pers with new proposals evaluated the full-mix. There werethree experimental papers, among which only Wu et al. [72]evaluated the full-mix using Peloton [12]. Appuswamy etal. [23] did not evaluate TPC-C. Finally, Yu et al. [75] and Bang et al. [24] evaluated only

New-Order and

Payment fromamong the ﬁve transactions using DBx1000 [7].CCBench supports only two (

New-Order and

Payment )transactions. Among the CC protocols, CCBench currentlysupports only Silo. Here, we evaluate

New-Order and

Payment transactions in CCBench and validate the correctness of itsimplementation by comparing it with other platforms [7, 8].We promise to include an evaluation of the TPC-C full-mixin the near future. The evaluation will be included in the up-dated version of this paper. In this Appendix, we denote thesubset of the TPC-C workload that includes only

New-Order and

Payment transactions as NP and illustrate an evaluationof two variants of NP , denoted by NP-NoInsert and

NP-Insert .They were supported by DBx1000, and both were evaluatedby Bang et al. [24].

NP-NoInsert does not include any insertion operations.NP originally includes insertions into the

Order, Order-Line,New-Order tables in the

New-Order transaction and inser-tions into the

History table in the

Payment transaction.However, from the viewpoint of semantics, it is possible toomit these insertions. Records inserted into the

History ta-ble by the

Payment transaction are not handled thereafter(i.e., other transactions do not perform CRUD operationson the

History table). Records inserted into the

Orders , New-Order , Order-Line tables by

New-Order transaction arenot read, updated, or deleted by the

Payment transaction.Therefore, NP can omit all insertion operations. NP-Insert includes insertion operations that are omittedin

NP-NoInsert . To run

NP-NoInsert or NP-Insert , diﬀerentfunctions should be implemented. We illustrate such dif-ferences in Table 4. As shown in the table, the requiredfunctions for

NP-NoInsert are the same as those for a part ofYCSB. To implement the

NP-Insert , an insertion operationis necessary.

A.2 Implementation and Settings

We implemented TPC-C client codes with reference to theoriginal

DBx1000 system [7] and an extension of DBx1000by the Cicada team [8], which is denoted by

DBx1000(C) . We implemented the NP in CCBench as follows. First,for the access method, we used a Masstree implementationthat is available online [11] with some modiﬁcations (ﬁxingbugs regarding the cast [18] and a long key with more than9 bytes [19, 20]). All nine tables were searchable with theprimary key. The History table did not originally need pri-mary keys, but the table was stored using Masstree. Thus,it used dummy primary keys using the scalable, unique iden-tiﬁer generator that we developed. The primary keys for alltables were encoded into 8 bytes in CCBench. NP requires a secondary index on the Customer table with c w id , c d id , and c last columns. We stored multiplecustomer primary keys in a std::vector container for eachsecondary key. The size of the secondary keys was at most20 bytes, 2 bytes for c w id , 1 byte for c d id , and up to17 bytes for c last . Our NP implementation is publiclyavailable on GitHub [5].The settings of the other platforms were as follows. DBx1000separates the table data structures from the primary in-dex structures and can omit primary indexes for tables ifunnecessary. We did not use the B+tree index owing toits low performance; therefore, we used a hash index forall indexes in our experiments. DBx1000 was conﬁguredby omitting an unnecessary primary index in NP . Order , .00.20.40.6 0 50 100 150 200 250 T h r oughpu t [ M T PS ] (a) Warehouse count = 1. T h r oughpu t [ M T PS ] (b) Warehouses count = 4. T h r oughpu t [ M T PS ] (c) Warehouse count = thread count. Figure 15: TPC-C-NP-NoInsert (full scale). T h r oughpu t [ M T PS ] (a) Warehouse count = 1. T h r oughpu t [ M T PS ] (b) Warehouse count = 4. T h r oughpu t [ M T PS ] (c) Warehouse count = thread count. Figure 16: TPC-C-NP-NoInsert (small scale).Table 4: Workload and required functions. YCSB’ indi-cates only A, B, C, and F, which are key-wise. It doesnot include insert (required by D) or range query (requiredby E). Further, NP-NoInsert does not include any inser-tion operations. NP originally includes insertions to

Order,Order-Line, and New-Order tables in

New-Order transac-tion and insertions to

History table in Payment transac-tion. NP-Insert includes insertion operations omitted inNP-NoInsert. Phantom avd. indicates phantom avoidance.Each cell denotes whether the corresponding function is re-quired.

YCSB’ NP-NoInsert NP-Insert Full-mixInsertion No No Yes YesDeletion No No No YesRange search No No No YesPhantom avd. No No No Yes

New-Order , Order-Line , and

History tables had no pri-mary index. The

Customer secondary index composed of the (c w id, c d id, c last) key was encoded into 8 bytes. Apointer to the records was retrieved using the encoded keythrough the hash index, and was accessed through the linkedlist. DBx1000(C) also separates the tables and indexes. Itproperly uses the Masstree and hash index.

A.3 Analysis

We used the experimental environment described in § New-Order or Payment transaction at random in a 50:50 ratio each time.

A.3.1 Varying Thread Count

We evaluated the workloads by varying the thread count.

NP-NoInsert:

For DBx1000(C), we set a value of false to two parameters,

TPCC INSERT ROWS and

TPCC INSERT INDEX .We commented out the code of insertion operations for DBx1000.We set the corresponding command-line argument for CC-Bench to omit insertion operations. The results are shown in Figs. 15 and 16. Fig. 15c shows that all the systemsexhibited scalability for a low contention case, and Fig. 15aand 15b show that they exhibited less eﬃciency for highcontention cases. This is consistent with prior studies (Figs.4 and 9 for a study on TicToc [76], Fig. 5 for a study onCicada [46], and Figs. 2 and 3 for a study on 1000 corespaper [24]). Fig. 15 shows that DBx1000 outperformed CC-Bench in all cases. One of the reasons for this is the diﬀerentindexes used. CCBench used Masstrees, and DBx1000 usedhash tables.When the warehouse count was equal to the thread count,DBx1000(C) exhibited the best performance, as illustratedin Fig. 15c, when the thread count was less than or equalto 96. When the thread count was more than 96, it wasnot run owing to an ASSERT failure or segmentation fault.The increase in the

NMaxCores parameter did not solve thisissue. Fig. 16 showed that all three platforms scaled in allsettings when the thread count was no more than 20.

NP-Insert:

For DBx1000(C), we set two parametersto true , TPCC INSERT ROWS and

TPCC INSERT INDEX . We un-commented the code of the insertion operations for DBx1000.We set the corresponding command-line argument for CC-Bench to omit insertion operations. The results are shown inFigs. 17 and 18. The performance of DBx1000 for

NP-Insert was overwhelmingly worse than that of

NP-NoInsert . Thisshould be due to the insertion operation. This degradationwas reported by Bang et al. [24] (Fig. 9), who improved theworkload by weaving modern optimizations [39, 42, 44] intoDBx1000. The codes were not publicly available. CCBenchand DBx1000 exhibited similar trends in performance in allcases. CCBench outperformed DBx1000 under this setting.DBx1000(C) exhibited a mysterious performance when thewarehouse count was set to 1 or 4 as shown in Figs. 17a,17b, 18a, and 18b. We could not determine the reason forthis behavior.The performance of NP-Insert was worse than that of NP-NoInsert because of additional insert operations. CCBenchexhibited approximately 14 Mtps for NP-NoInsert and ap-proximately 8 Mtps for NP-Insert, with 224 threads and .00.10.20.30.4 0 50 100 150 200 250 T h r oughpu t [ M T PS ] (a) Warehouse count = 1. T h r oughpu t [ M T PS ] (b) Warehouses count = 4. T h r oughpu t [ M T PS ] (c) Warehouse count = thread count. Figure 17: TPC-C-NP-Insert (full scale). T h r oughpu t [ M T PS ] (a) Warehouse count = 1. T h r oughpu t [ M T PS ] (b) Warehouse count = 4. T h r oughpu t [ M T PS ] (c) Warehouse count = thread count. Figure 18: TPC-C-NP-Insert (small scale). T h r oughpu t [ M T PS ] Figure 20: Varying warehouse count, NP-NoInsert. T h r oughpu t [ M T PS ] Figure 21: Varying warehouse count, NP-Insert. R e l a t i v e l a t en cy b r ea k do w n other workcommitinsertupdatesearch Figure 19: Breakdown of TPC-C-NP-Insert, 224 threads,224 warehouses, and NewOrder and Payment transactions.224 warehouses in Figs. 15(c) and 17(c), respectively. In- sert operations require an index traversal as well as memoryallocation for index nodes and record data, which typicallycause page faults in the operating system. Therefore, theyare considered to be expensive compared with other opera-tions.

Impact of Index to Performance:

The diﬀerence inindexes produces a diﬀerence in performance. The accesscost of the tree indexes is higher than that of the hash in-dexes in theory. To understand the impact of the Masstreeindex on performance, we measured the latency breakdownof the transactions in the

NP-Insert workload of CCBench on224 threads and 224 warehouses using the Linux perf tool.As shown in Fig. 19, execution times of 68.4% and 58.9%were spent for

New-Order and

Payment transactions on thesearch, update, and insert. These operations need to ﬁnda record, and require frequent Masstree traversals. Becausethe size of the Masstree node was a few hundred bytes, itstraversal decreased the spatial locality of the memory ac-cesses; thus, the cache miss tended to increase. NP can beexecuted using hash indexes, and in such a case, the ratioof search, update, and insert to the execution time wouldbe signiﬁcantly reduced, and the performance of NP wouldimprove. A.3.2 Varying Warehouse Count

We evaluated

NP-NoInsert and

NP-Insert varying the ware-house counts. The settings are the same as those in Ap-pendix A.3.1. The results for

NP-NoInsert and

NP-Insert were shown in Figs. 20 and 21. All results of DBx1000(C) were measured with 96 threads due to errors. The re-sults showed that all three systems tended to scale. For

NP-NoInsert , DBx1000 outperformed both DBx1000(C) andCCBench as shown in Fig. 20. A result of a similar exper-iment is shown in Fig. 5 for the study on TicToc [76], andit exhibited scalability, which is consistent with this result.For

NP-Insert , DBx1000 underperformed both DBx1000(C)and CCBench, as shown in Fig. 21.

A.3.3 Difference between NP and Full-mix ransactions of TPC-C include the

OrderStatus , Delivery ,and

Stock-Level in addition to

New-Order and

Payment .We did not obtain the results in this study. Under a full-mixworkload, which includes all of these factors, its performancedeteriorates to a fraction of the NP . Figs. 4 and 5 for theoriginal study on Cicada [46] show the results of both full-mix and NP , respectively. The result in Fig. 4 was approx-imately 1 / New-Order and

Payment transactions account for 88%, and the remain-ing three account for only 12%. The reason for the perfor-mance degradation was because the

Delivery transactioncontained many record accesses compared to the

New-Order and

Payment transactions. Owing to the existence of longtransactions, many conﬂicts will occur for small warehousecases, which would deteriorate the performance. For manywarehouses, we expect the abort ratio would be low becauseKimura [44] determined that the ratio was 0.12% for thewarehouse count was equal to the thread count. The eﬀectof such long transactions was described in § A.4 Summary of TPC-C

TPC-C is an important benchmark. CCBench currentlysupports only two (

New-Order and