MV-PBT: Multi-Version Index for Large Datasets and HTAP Workloads
Christian Riegger, Tobias Vincon, Robert Gottstein, Ilia Petrov
MMV-PBT: Multi-Version Index for Large Datasets and HTAPWorkloads
Christian Riegger
Data Management Lab,Reutlingen University, [email protected]
Tobias Vinçon
Data Management Lab,Reutlingen University, [email protected]
Robert Gottstein
Data Management Lab,Reutlingen University, [email protected]
Ilia Petrov
Data Management Lab,Reutlingen University, [email protected]
ABSTRACT
Modern mixed (HTAP) workloads execute fast update-transactionsand long-running analytical queries on the same dataset and sys-tem. In multi-version (MVCC) systems, such workloads result inmany short-lived versions and long version-chains as well as inincreased and frequent maintenance overhead.Consequently, the index pressure increases significantly. Firstly,the frequent modifications cause frequent creation of new ver-sions, yielding a surge in index maintenance overhead. Secondlyand more importantly, index-scans incur extra I/O overhead todetermine, which of the resulting tuple-versions are visible tothe executing transaction (visibility-check) as current designsonly store version/timestamp information in the base table – notin the index. Such index-only visibility-check is critical for HTAPworkloads on large datasets.In this paper we propose the
Multi-Version Partitioned B-Tree(MV-PBT) as a version-aware index structure, supporting index-only visibility checks and flash-friendly I/O patterns. The ex-perimental evaluation indicates a 2x improvement for analyticalqueries and 15% higher transactional throughput under HTAPworkloads (CH-Benchmark). MV-PBT offers 40% higher transac-tional throughput compared to WiredTiger’s LSM-Tree imple-mentation under YCSB.
The spread of large-scale, data-intensive, real-time analyticalapplications is increasing. Such applications result in HybridTransactional and Analytical Processing workloads (
HTAP ) com-bining long running analytical queries (OLAP) as well as frequentand low-latency update transactions (OLTP) on the same datasetand even on the same system [18].Multi-versioning is at the core of many approaches and sys-tem designs suitable for HTAP. Under
Multi-Version ConcurrencyControl (MVCC) reading transactions, executing long-runningqueries, do not block the frequent low-latency modifying transac-tions. Under such approaches multiple versions of each data item(i.e. tuple) may physically co-exist, whereas every transaction op-erates against a snapshot of the database comprising all versionsit is allowed to see for consistent execution. Read operationssimply operate on the latest committed version, visible to themand are therefore never blocked, yielding good read performanceand concurrency. An update operation produces a new versionof the updated data item and invalidates the predecessor version.All versions of tuple form a version-chain. Timestamps placed © 2019 Copyright held by the owner/author(s).
Reading Transaction TX R executes a long-running query Q R Transactions modifying tuple t, attr. a TX U1 Version-chain of tuple t t.v r ead ( t ) t.v upda t e ( t ) s e t a = TX U2 r ead ( t ) t.v upda t e ( t ) s e t a = TX U3 r ead ( t ) t.v upda t e ( t ) s e t a = MV-PBT
931 7
Lookup (a <= 10 ) M od i f i ca t i on s t.v Figure 1: HTAP and Version-Chain Lengths: TX U . . . TX U create new versions of tuple t , which are indexed. The in-dex scan of TX R returns only the index entries ( t . v ) visibleto TX R filtering the invisible ones ( t . v . . . t . v ), matchingthe search predicate. on every physical version-record are used to determine, whichof the exisiting tuple-versions is visible to a transaction.Under OLTP workloads, version-chains tend to be short, due tothe predominantly short-lived transactions. For instance, underTPC-C the average version-chain length is approx. 1.2 [8]. UnderHTAP the DBMS needs to handle much longer version-chains dueto the mix of long-running and short-lived transactions (Figure 1).
Whenever a transaction TX R reads a tuple t the DBMS returns thelatest version of that tuple t . v , committed before the start of TX R .Even though, in the meantime multiple low-latency updatingtransactions TX U . . . TX U might have committed, producingsuccessor-versions ( t . v . . . t . v . t . v ), t . v cannot be garbagecollected as long as, it is visible to an active transaction, i.e. TX R .Thus, the amount of such transient versions can be as high as several hundred millions in real systems [13]. HTAP workloads in combination with long version-chains exer-cise significant pressure on indices.
In a single-versioned systemthere is one index entry per tuple. However, in a multi-versionedsystem, the DBMS needs to index at least all committed tuple-versions (Figure 1), even the transient ones. Given long version-chains thus put extra pressure on the index.
Although most of to-day’s systems are multi-versioned, the majority of index approachesstill handle tuple-versions of the same tuple as if they were separatetuples, ignoring the version semantics.
If naïvely integrated, theseslow down index lookups and may cause significant maintenanceoverhead to persistent indices, as index updates are very frequentand since index entries corresponding to obsolete tuple-versionsneed to be frequently garbage collected. Given the read/writeasymmetry of modern persistent storage technologies these op-erations result in prohibitively expensive in-place updates. Inthis context append-based index structures trading sequentialwrites for complex reads are a good candidate. a r X i v : . [ c s . D B ] O c t ll in all, the following observations can be made:1) Version-obliviousness:
Although, all tuple-versions need to beindexed, current indexing approaches lack version information.2)
No index-only visibility-check:
Currently, it is impossible to de-termine which index-entries resulting from an index lookup/scancorrespond to versions, visible to the calling transaction.3)
I/O overhead:
Version-oblivious indices or naïve support formulti-versioning yield signifiant I/O overhead.In the present paper we propose
Multi-Version Partitioned B-Trees (MV-PBT) as a version-aware index structure for MV-DBMS,in an attempt to address the above issues. MV-PBT is based on avariant of B + -Trees called Partitioned B-Trees [12]. The contribu-tions of this paper are: • MV-PBT is a version-aware index structure. It contains versioninformation and supports index-only visibility-checks. • MV-PBT support append-based write-behavior and exhibitmuch lower write-amplification compared to LSM-Trees. • MV-PBT has been implemented in PostgreSQL. The perfor-mance evaluation under HTAP workloads (CH-Benchmark[2]) indicates that they improve the analytical throughput by2x due to index-only visibility-checks, while improving thetransactional throughput by 15% compared to PostgreSQL’shighly-optimized B + -Tree. Under TPC-C MV-PBT performs15% better. • MV-PBT has also been implemented in WiredTiger (Mon-goDB). The performance evaluation indicates approx. 40%higher throughput under YSCB compared to WiredTigers highly-optimized LSM-Trees.The rest of the paper is organized as follows. We motivate themissing version-awareness and the need for index-only visibility-checks in Section 2, while Section 3 provides some backgroundon various multi-versioning aspects. The design and implemen-tation of MV-PBT is described in detail in Section 5, while theexperimental evaluation is presented in Section 6. We. discussrelated approaches in Section 7 and conclude in Section 8.
In this section we give a more detailed perspective on the aboveissues of: 1)
Version-obliviousness in indices ; 2) missing index-only visibility-check ; and 3)
I/O overhead . Consider the examplein Figure 2, which is a more detailed version of Figure 1 witha conventional B + -Tree. An initial transaction TX U (not de-picted) inserts tuple t prior to TX R , creating its initial version t . v . While TX R is running, multiple concurrent transactions TX U . . . TX U update tuple t and each of them produces newversions of it ( t . v . . . t . v ). Only TX U inserts tuple y in its initialversion y . v in addition to creating t . v . Each tuple-version is aseparate physical version record (Figure 2.A). It contains version-information : the recordID of the predecessor version and twotimestamps, t creation - the timestamp of the transaction thatcreated that tuple-version; and t invladiation the timestamp ofthe transaction that invalidated it by creating a successor version.The invalidation-timestamp is null if there is no successor. If atuple gets deleted a special tombstone version-record is insertedto mark the logical end of the chain. The version-information isonly available on the version-record.
Since version-records are independent physical entities theycan be stored on any DB-page with enough free space. Figure 2.Bdepicts an example of the physical version-storage.
An index on atable must contain index-entries for each committed version of everytuple for consistency.
Therefore, a B + -Tree index idx on attribute a of table R (Figure 2.C) should reflect all versions of each tupleof R . Since the index is version-oblivious it contains no version-information, and treats each tuple-version as if it were a separatetuple.
Consequently, if TX R uses the index to count all tuplessatisfying “a ≤ (Figure 2.D), the index scan will return thematching index entries (referencing versions t . v . . . t . v ). Now,each one of them must be checked for visibility, i.e. is it latestcommitted tuple-version prior to the start of TX R . However, thenecessary timestamps are available only on the version-records.Therefore, all of them are retrieved, at the cost of random I/Os. Return all tuple versions satisfying:Table
R a z t creation t invalidation Tuple t version t.v u0 TX u1 version t.v u1 TX u2 version t.v u2 TX u3 version t.v u3 nullTuple y version y.v
11 TX u3 nullLogical View A B + -Tree idx 931 7 Index C
11 *****
Physical Storage B t.v Page 3 t.v Page 42 y.v Page 5 t.v Page 72 t.v Page 117 D :SELECT COUNT(*)FROM R WHERE a <= 10; : Transaction TX R COSTS:
Index Scan + 4 Table Pages/ Random I/Os
RESULT : 1 Version
Visibility Check E Execution of long-running transaction TX R CREATE INDEX idx ON R ( a ); ... MAX(t creation ) <= TX R & COMMITTED recID( t.v ) recID( t.v )recID( t.v )recID( t.v ) recID( t.v ) Figure 2: Index-Only Visibility-Check in Multi-VersionDBMS: (a) logical tuples (t and y) of a table R and theirversions; (b) the physical storage of these versions intodatabase pages; (c) an index created over the table R mustindex all versions; (d) an index-scan retrieves all versionsmeeting the predicate, out of which (e) the visibility-checkreturns the ones visible to calling transaction TX R . Consider the example in Figure 2.D, C and E, the index-scan forthe condition “a ≤ will return versions t . v , t . v , t . v and t . v .Subsequently, they are read to extract the version-information ( t creation and t invalidation – Figure 2.A) yielding four randomI/Os. The visibility-check then determines the latest version com-mitted prior to the start of TX R , returning the recordID of t . v andignoring the rest. Since the index is version-oblivious and thus doesnot support index-only visibility-checks, the I/O costs amount to:COST(Index-Scan) + 1 random I/O for each matching tuple-version.Especially for HTAP workloads this yields significant performancedegradation depending on the length of the version-chains.
To quantify the combined effect, we designed a simple exper-iment with
YCSB and PostgreSQL. We run YCSB workloads A(update) and E (scan) combined, performing frequent scans andupdates. In parallel, we perform a point-query on a tuple every 30seconds (simulating an HTAP workload). Additionally, we con-tinuously increase the version-chain, by updating the tuple, until50 versions are reached. In realistic HTAP settings, the amount ofactive versions can be as high as several hundred millions , whileanalyses can take as long as 1000s [13]. The experimental resultsare shown in Figure 3. The highly-optimized B + -Tree implemen-tation in PostgreSQL performs better than MV-PBT on a singletuple-version. However, as the version-chain length increases(6-8 versions) the performance drops rapidly to approx. 50 trans-actions/sec , due to version-obliviousness and random I/O. Basic T h r oughpu t ( T x pe r s e c ond ) Version-Chain LengthBTree PBT MVPBT
Performance effect ofindex-only visibility-checks under HTAP workloadsPerformance effect ofappend-based indices and storage
Figure 3: Performance Impact of Version Visibility Check.
Partitioned B-Trees (PBT), are likewise version-oblivious , but ex-hibit append-based write behaviour, avoiding in-place updatesand perform therefore slightly better ( approx. 150 tx/sec ). Due toits version-awareness and support for index-only visibility-check
MV-PBT exhibits much higher and robust performance ( approx.1200 tx/sec ) with growing chain lengths.
Multi-Version Concurrency Control (MVCC) is one of the mostpopular transaction management schemes and is used in mostmodern DBMS: Oracle, Microsoft SQL Server, HyPer, SAP HANA,MongoDB WiredTiger, NuoDB, PostgreSQL or MySQL-InnoDB,just to name a few. These DBMS make different design decisionsregarding the aspects described below.
Under MVCC a logical tuple corresponds to one or more tuple-versions (Figure 2.A). They form a singly linked list, which rep-resents a version chain. There are two possible physical repre-sentations of a tuple-version (Figure 4): physically materialized or delta-record based . The former implies that each tuple-versionrecord is stored physically materialized in its entirety and is inthe focus of this paper. The latter implies that each modificationof a logical tuple results in a delta-record, indicating the differ-ence to another version (à la BW-Tree [14, 21]). The delta-recordsare connected and retrieved on demand by the DBMS storagemanager to restore a tuple-version. Delta-record based systemdesigns typically store a single version (oldest or newest) in themain store and use a separate store for the delta-records, whichmay be the undo log (à la InnoDB) or a temporary version store (àla MS SQL Server). Both organizations can perform modificationsin-place or out-of-place. Out-of-place updates with physicallymaterialised version-maintenance insert a new version-recordin the base table. Based on the version ordering, additional mod-ifications may be necessary to maintain logical timestamps orreferences.
Tuple t version t.v t.v u3 - V3 Tuple t version t.v t.v u2 TX u3 V2 y.v n ... Physically Materialized Storage Delta-Record Storage latest version t.v u3 - V3 Delta storage t.v TX u2 V2 UNDO log
LogLSN TX U2 Version Pool/Temp Storage t.v ... TX u2 TX u3 t.v ... ... Figure 4: Version Storage Alternatives
Considering the characteristics of modern storage technolo-gies, physically materialized version storage and out-of-place updates are preferable, due to lower write-amplification and thehigher parallelism. Delta records tend to consume less space thanmaterialized tuple-versions, but require additional processingand all predecessors or successors for tuple reconstruction.
The set of tuple-versions of a database tuple is organized asa singly linked list. There are two different ordering methods(Figure 5): old-to-new and new-to-old . Old-to-New ordering : The entry-point is the oldest tuple-versionin version chain and each version contains a reference (recordID)to its successor. A visibility-check must therefore process all suc-cessors, beginning from the oldest tuple-version. This behavior isbeneficial for lookups of long-running analytical (OLAP) queriesunder HTAP workloads, where older tuple-versions are likely tobe the visible ones. Alternatively, OLTP workloads mostly requirethe newest version and would need to process the whole versionchain.
New-to-Old ordering implies that the entry-point is thenewest tuple-version, which refers to its predecessor. Queries inthe typically short OLTP transactions find the visible version veryfast, but long-running OLAP queries may need to process severalsuccessors in version chain (Figure 3).
In-place and out-of-place update strategies are are possible for both methods.Considering the characteristics of modern storage technolo-gies new-to-old ordering for physical version storage results inlower write-amplification and matches append-only storage. Allother approaches require in-place updates.
Newer version t.v ... TX u3 - Older Version t.v ... TX u2 TX u3 New-to-Old Ordering Old-to-New Ordering
New-to-old Reference
Newer version t.v ... TX u3 - Older Version t.v ... TX u2 TX u3 Old-to-New Reference
Figure 5: Version Ordering Alternatives
Under MVCC a version is said to be invalidated whenever a suc-cessor version exists. There are two possible invalidation models[8] (Figure 6). First, two-point invalidation is the state-of-the-artmodel, where the creation timestamp of the successor version isalso placed as invalidation timestamp on the predecessor.
Two-point invalidation works well with old-to-new ordering. However,with new-to-old ordering, the invalidation timestamp must beset on the predecessor version, yielding an in-place update andpossibly a random write. Second, with one-point invalidation [10],the existence of a successor implicitly invalidates the predeces-sor and all version-records contain only the creation timestamp.
One-point invalidation matches well new-to-old ordering, the useof indirection layer (VIDs, and entry-points) as well as append-based storage.
Tuple t version t.v t.v TX u2 TX u3 ... Tuple t version t.v t.v TX u1 TX u2 ...Two-Point Invalidation One-Point Invalidation Tuple t version t.v t.v TX u2 ... Tuple t version t.v t.v TX u1 ... Figure 6: Version Invalidation Model
Under MVCC modifications of a tuple result in the creation of anew tuple-version. Old tuple-versions become obsolete, if theyare no longer visible to any of the active transactions. Therefore,ome form of version GC is necessary to reclaim space and can im-prove performance. However, GC causes performance spikes (asit interferes with foreground I/O), reduces concurrency (as someform of locking is required) and increases write-amplificationon secondary storage. GC [22] can be performed on transaction[13], tuple and index levels [14, 21]. Index-level GC (Section 5.6)purges index entries, resulting from index updates, maintenanceor tuple-level GC.
There are two possibilities to map index records to tuple-versionsin base tables (Figure 7).
First , classical physical references (recor-dIDs) can be used. Thus, the latest tuple-version in base tables(entry-point in the version chain) can be accessed directly, butchanges to the latest version or its location result in index-recordmodifications. Such changes comprise: creation of a successor-version; storage management and physical movement (as in ap-pend storage) or garbage collection.
Second , an indirection layerwith logical references can be employed. Each tuple-version isaugmented with an unique tuple-identifier (Virtual Tuple Identi-fier – VID), which is also stored in the index records. An indexoperation resolves the VID using a mapping table (indirectionlayer) to locate the physical entry-point.
An indirection layercan reduce index maintenance costs for in-place and out-of-placeupdates, but requires additional structures and processing.
Tuple t version t.v t.v u3 - VID(t)Tuple y version y.v t.v u2 TX u3 VID(t) y.v
11 TX u3 - VID(y) t.v u1 TX u2 VID(t) y.v
13 TX u1 TX u3 VID(y) t.v u0 TX u1 VID(t) P h ys i ca ll y M a t e r i a li z e d VID(t) I nd i r ec t i on L aye r RecordID (latest version)
Tuple(VID)recordID( y.v ) VID(y)
IndexRecord
Index
Search Key Values Transaction Timestamp
VID ... ...entry-point to version chain of tuple y
Physcial Reference Logical Reference recordID( t.v ) ... ......... ...... recordID Figure 7: Version/Index-Record Referencing
Traditional index designs use physical references and containno version-information, which tends to increase index mainte-nance overhead as well as the visibility check costs for lookupsand scans. Alternatively, modern index-structures (BW-Tree) usean indirection layer, but contain no version-information and sup-port no index-only visibility check. This can cause massive readamplification for mixed workloads. An optimal index structureshould reduce write amplification and return only referencesto tuple-versions that are visible to a transaction snapshot. AMV-PBT uses physical or logical references, is version-aware andproduces append-only sequential write pattern.
We have outlined some relevant design decisions for storing tuple-versions in multi-version DBMS. Modifications are preferablystored as physically materialised tuple-versions in base tables,rather than deltas, due to tuple reconstruction costs. Moreover,this enables direct access to each tuple-version from additionalaccess paths.
Out-of-place updates reduce write amplificationto secondary storage. Garbage collection is required for spacereclamation, but brings additional complexity to data structures. A new-to-old version ordering requires index maintenance forevery new tuple-version, because the entry-point of the versionchain for that tuple changes. A logical indirection layer ensuresfast lookups by efficiently returning entry-point to a versionchain and reduces index maintenance effort.
New-to-old order-ing is beneficial for OLTP and speeds up visibility-check as thelatest version ist typically the visible one, yet older versions mayrequire slow reconstruction. Alternatively, old-to-new orderingis supports long-running OLAP operations and visibility-checkin HTAP settings, as the oldest version is directly accessible. Yet,modifications and maintenance may suffer low performance.Indices for mixed workloads and large datasets should ratherreturn visible tuple-versions. Alternatively, traditional index struc-tures only return version candidates, which have to be subse-quently verified, fetching version-records from base tables byperforming random I/O.
Modern database storage management needs to address the char-acteristics of semiconductor storage technologies [19]. ConsiderFigure 8 depicting the I/O characteristics of the enterprise Flashstorage used in the evaluation. Typical index search operationsresult in large amount of small (8K) random reads. Hence, opti-mize for read IOPS and sequential writes ( ≥ Blocksize [KB] 8 64 8 64Iops 122382 24180 11104 1343MB/s 956 1511 87 84Iops 112479 23631 7185 56MB/s 879 1477 1184 74
Read WriteSequentialRandom
Figure 8: I/O Characteristics of Intel DC P3600 SSD.
Multi-Version Partitioned B-Trees (MV-PBT) – Figure 9 – arebased on Partitioned B-Trees (PBT), introduced by Goetz Graefe[11, 12]. PBT in turn represent an enhancement on traditionalB + -Trees[4]. PBT (and MV-PBT) create index partitions based onan artificial, leading key-column – the partition number . All index-entires in a partition have the same partition number in the searchkey. PBT (and MV-PBT) utilize a portion of the database buffer( partition buffer ) to host the latest partition P N , where insertionsand updates to existing partitions ( P . . . P N − ) are placed. Up-dates to existing index entries are treated as replacement records to avoid in-place updates. Once P N gets full, a MV-PBT partitionis appended to persistent storage and becomes immutable.Regular MV-PBT records comprize of a partition number , its search key columns , and a recordID (set) . Furthermore, MV-PBTindex records contain version-information: logical transactiontimestamp for validation or invalidation of the tuple-version andoptionally an unique virtual identifier (indirection layer). Eachpartition number value identifies a single partition. Partitionnumbers are unique, monotonically increasing, two-byte integer B BufferPartition P Partition P ... P N-1
MV-PBT Buffer P N Append to storage, when full
Partition P N MV-PBTRecord
Insertions and updates of P - P N-1 go in P N Multi-Version Partitioned B-Tree
Partition Number Search Key Values recordID Transaction Timestamp T y pe MV-PBT Search Key
Leaf Node S i b ili ng P o i n t e r S i b ili ng P o i n t e r Figure 9: Structure of a Multi-Version Partitioned B-Tree. values. This enables the MV-PBT to maintain partitions withinone single tree structure in alphanumeric sort order. The partitionnumber is an artificial column and is therefore transparent tohigher database layers. Each MV-PBT maintains partitions inde-pendent of other MV-PBTs. Partitions appear and vanish as sim-ple as inserting or deleting records. They can be reorganized andoptimized on-line in system-transaction merge steps, dependingon the workload. Partitions can support additional functionalities,like bulk loads or can serve as multi-version store[12].MV-PBTs write any modification of index records exactly once– upon eviction of a partition, except for later reorganization orgarbage collection operations. This is realized by forcing sequen-tial writes of all leaf nodes in a partition (Figure 9). Leaf nodesof modifiable main memory partitions are stored in a separatebuffer cache – the
MV-PBT Buffer . This area is shared for all MV-PBT indices in the database. Once the MV-PBT Buffer gets full,a victim MV-PBT is selected and its P N is written to secondarystorage. The MV-PBT Buffer is managed by a special replacementpolicy, giving active partitions the chance to grow (Section 5.5). Persistent index partitions are immutable. Direct modification-operations are forbidden. Therefore, modifications to existingindex-records as well as insertions are placed in the bufferedpartition P N . To handle this behavior MV-PBT introduces newindex-record types. Currently the following are defined. Regular Index Records are created upon the insertion ofnew tuples. The partition number of the newest MV-PBT partition P N is inserted together with the search key values. The recordID (pageID and slot) of the newly inserted tuple-version is includedas well as the transaction timestamp of the inserting transaction(Figure 10). The latter is essential for index-only visibility-checks.For example, transaction TX U (Figure 10) inserts a new tuple( t ), in its initial version ( t . v ), causing the creation of a regularindex record in partition P . Replacement-Records result from tuple-updates on non-index key columns on existing index-entries. Such updates yielda new tuple-version that becomes the new chain entry-point,which needs to be reflected in the index. Although the index-record for the previous version has not changed (non-index-keyupdate) the version-information and recordID of the new ver-sion need to be replaced. However, this is not possible, if theindex-record is already in an immutable partition ( P . . . P N − ).Therefore a replacement record is inserted in the newest partition P N to logically replace the old one with the recordID and theversion-information. The Replacement Record (Figure 10) con-tains: the recordID of the new version, its creation-timestamp aswell as the recordID of the predecessor version. Hence the record
Transaction TX U0 : insert INSERT INTO r VALUES (7, 'V0');
Transaction TX U1 : non-key update UPDATE r SET z ='V1' WHERE a =7; Transaction TX U2 : index key update UPDATE r SET a=1 WHERE a =7; Transaction TX U3 : delete DELETE FROM r WHERE a =1; P P P P P t.v ) TX U0 P t.v ) TX U1 recID( t.v ) P t.v ) TX U2 P t.v ) TX U3 Regular RecordReplacement Record Replacement Record Anti-RecordTombstone Record(1) (2) Table
R a z t creation t invalidation version t.v u0 TX u1 version t.v u1 TX u2 version t.v u2 TX u3 version t.v tombstone TX u3 nullV0V1V1 Replacement Record
Partition Number Search Key Values recordID new version Transactional Timestamp P ) TX U1 recordID old version recID(t.v ) Anti-Record
Partition Number Search Key Values recordIDold Transactional Timestamp P ) TX U2 Tombstone-Record
Partition Number Search Key Values recordIDold Transactional Timestamp P ) TX U3 Regular Index Record
Partition Number Search Key Values recordID Transactional Timestamp P ) TX U0 Insert (TX U0 )Non-Key Index Update (TX U1 )Index-Key Update (TX U2 )Deletion (TX U3 ) P ) TX U2 P t.v ) TX U2 recID( t.v ) recID(t.v )Index-Key update Figure 10: MV-PBT Index-Record Types and Their Use:MV-PBT record format (top), an example including a se-quence of transactions and their index records (bottom). includes some "anti-matter" [12] (recordID) invalidating the pre-decessors as well as some "matter", i.e. recordID and timestampif the new version. For example, transaction TX U (Figure 10)updates the attribute z of the previously inserted tuple ( t ), pro-ducing a new version ( t . v ). Although the index-key remainsunchanged, the version-information of ( t . v ) has to be updated,causing the creation of a replacement-record in partition P . Anti-Records are required for updates on index-key attributesand are always used in combination with replacement records inthe same partition. If the index-key of an existing index-record (inthe immutable partitions) gets updated, MV-PBT inserts a com-bination of an anti-record and a replacement record . Anti-records are pure "anti-matter" as they mark extinction of the old indexrecord (from partitions P . . . P N − ), whereas the simultaneouslyinserted replacement record represents the new "matter" and re-flects the new index-key and the new version-information. Theanti-record and the replacement records are inserted in P N andare placed according to the sort-order of the search-key value.An anti-record contains the recordID of the old version, togetherwith its search key and the transaction timestamp of the updatingtransaction (Figure 10). For example, transaction TX U (Figure10) updates the indexed attribute a of the previously insertedand updated tuple ( t ), producing new version ( t . v ), modifyingthe index-key from to . The anti-record (marking the extinc-tion of the replacement-record from partition P ) reflects therecordID of the predecessor version ( t . v ), contains its index-keyvalues (7) and the transaction-timestamp of the current updating-transaction ( TX U ). The simultaneously inserted replacementrecord reflects the new and updated value of the search key (i.e. ), the recordIDs of the old and the new tuple-versions ( t . v and . v ) as well as the transaction-timestamp of TX U . Since the in-dex records are kept in sort order of the search-key values withina partition (as in a B-Tree) the replacement record is placed firstin order, followed by the anti-record. Tombstone-records indicate the deletion of a tuple. If a tupleis logically deleted, it does not become erased immediately inMV-DBMS, because it could be visible to a concurrent transaction.Rather a tombstone tuple-version record is inserted in the DB,which needs to be reflected in the MV-PBT index.
Tombstone-records are similar to
Anti Records in that they represent pure"anti-matter", marking the extinction of the whole tuple-versionchain. The difference is that if a tombstone-record is visible toa transaction, no further tuple-version belonging to this chaincan be visible, even no replacement record . Tombstone-records (Figure 10) contain the recordID of the latest tuple-version andthe transaction-timestamp of the deleting transaction.For example, transaction TX U (Figure 10) deletes tuple ( t ),creating a tombstone-version ( t . v ) in the DB. Therefore a tomb-stone record is inserted in partition P with the recordID of thedeleted tuple-version t . v , reflecting deletion of the whole ver-sion chain t . v → t . v → t . v . In the following we describe the index operations in a MV-PBT: • Insert Operations are only performed in P N . An insertion yieldsthe creation of an regular index-record in P N with the recor-dID of the newly created tuple-version and the timestamp ofthe creating transaction. The insertion traverses the bufferedpartition P N and places the new index record according tothe alphanumeric sort-order of the search-key (ordering issuesare described in Section 5.3). The MV-PBT buffer managementstrategy (Section 5.5) guarantees sufficient space for the in-sertion and possible maintenance. In case of an non-uniqueindex the insertion is performed directly. Alternatively, givena unique index a lookup operation (see Search and Scan) isperformed first to guarantee the non-existence of the newindex-key. • Update Operations are performed in different ways. If a trans-action modifies a tuple-version in a way that a non-index-keyattribute is changed ( non-key update ) a new tuple-version iscreated and its version-information needs to be reflected in theindex. In case of non-key updates
MV-PBT inserts a replacement-record in P N (Figure 10), containing the version-information(recordID and timestamp) of the modifying transaction. Bydoing so, it logically replaces the index-record, which is locatedin an older partition, and reflects the predecessor version.Alternatively if the modifying transaction updates an index-key attribute ( index-key update ) a replacement record as well asan anti-record are inserted in P N (ordering issues are describedin Section 5.3). The former reflects the new and modified index-key value in the new tuple-version, the latter indicates theextinction of the old index-record, reflecting the index-keyvalue of the predecessor version. In case of an unique index , theMV-PBT first performs a lookup to ensure the non-existenceof the new key-value. • Delete Operations cause the insertion of a tombstone record in P N . If a transaction deletes a logical-tuple a tombstone versionis created indicating the deletion of the whole version-chain,to transactions to which it is visible. Analogously MV-PBT inserts a tombstone record to indicate the extinction of all index-records corresponding to the version chain. Ordering issuesare described in Section 5.3. • Search and Scan Operations process partitions in reverse or-der from P N to P . Filter techniques such as Partition RangeKeys , Minimum Transaction Timestamp or Bloom- and RangeFilters (Section 5.7) are needed for selecting the predeceasingpartition which may contain an index record, matching thesearch conditions (Algorithm 1). The search conditions areextended to match the format of a MV-PBT – the partitionnumber is prepended to the first search key column. A regularroot-to-leaf traversal operation is performed and the cursoris positioned. Afterwards, the next matching index record isrequested and checked for visibility (Section 5.4). This processis repeated until an index record visible to current transactionis found, and can be returned together with the respective recordID . Partition number and timestamp are transparent forhigher database layers and become removed. Index records ofmost recent tuple versions are found and processed first, dueto index-record ordering (Section 5.3), which is very beneficialfor simple search conditions, like point lookups.
Complex scan operations (Algorithm 2) build a set of all match-ing index records, spreading all MV-PBT partitions. Every parti-tion is pre-selected by filter techniques and processed from P N to P . Traversal operations benefit from commonly bufferedhigher levels of the tree-structure. Matching index records ofany record type in a partition are processed by the index-onlyvisibility-check. Visible index records are added to the resultset without partition number and timestamp in regular sortorder. If no further index record matches the scan conditions,the algorithm proceeds with the predeceasing partition. Finally,the result set is returned. It is filled with all index records (in-cluding recordIDs ), matching the scan and visibility conditionsof the calling transaction.A single scan process without rechecking for concurrent mod-ifications in P N is sufficient, due to transaction snapshots –concurrent modifications in P N are invisible, anyways. Expen-sive retrieval of version-records in base-table ( random readI/O ) for version-information is avoided. In case of selection ofnon-index attributes, the recordID indicates the location ofversion-record in base-tables, which can be directly accessed. Algorithm 1
MV-PBT Search function search(Search conditions | attr val , cond | , ... ) Output:
IndexRecord while hasNext( ) do Let idx _ record ← next ( ) ▷ fetch next index record if VisCheck( idx _ record ) equals V ISIBLE then return set_return_format( idx _ record ) ▷ hide partitionnumber and timestamp while part ← previousPartition ( part ) do if | attr val , cond | ∈ part . f ilter then Let | skeys part | ← form_rec( part , | attr val , cond | ) traverse( | skeys part | ) return search( ) return ∅ lgorithm 2 MV-PBT Scan function scan(Scan conditions | attr val , cond | , ... ) Output:
ResultSet of | IndexRecords | part ← ∅ ▷ previousPartition returns P N for ∅ while part ← previousPartition ( part ) do if | attr val , cond | ∈ part . f ilter then Let | skeys part | ← form_rec( part , | attr val , cond | ) traverse( | skeys part | ) while hasNext( ) do Let idx _ record ← next ( ) ▷ neighbor in BTree if VisCheck( idx _ record ) equals V ISIBLE then | IndexRecords |.add(set_return_format( idx _ record )) return | IndexRecords | The version/partition-placement in MV-PBT given by modifica-tion, search and scan algorithms of MV-PBTs.
Index-records ofpredecessor versions are likely to be located in lower-numberedpartitions, successors in higher-numbered ones (Figure 10).
Thishowever necessitates multiple memory partitions for a MV-PBT.To address such issues the current MV-PBT design uses a singlemain-memory partition P N for each MV-PBT. However, for index-records with the same index-key it is mandatory that records fornewer/successor versions are always placed before index-records forolder/predecessor versions in P N . In other words the primary sort-order of the index-records in a P N is on the search-key (mostlydescending), however all records with the same search-key aresorted in inverse secondary sort-order (mostly ascending) on thetransactional timestamp.Search and scan operations traverse partitions backwards :starting from buffered partition P N (i.e. P N → P N − · · · → P ).Yer, given the above ordering, index-records of newer tuple-versions, matching the search predicates, are processed first inforward direction (i.e. in descending timestamp-order). Only thenthe next lower-numbered partition is traversed and processed. This is how MV-PBT ensures that in a search and scan operation,newer versions can always be found before older ones in the samepartition, and across partitions.
Consider for example Figure 11, where we have only twopartitions and index-records reflecting updates to the same tuplego to P , and contrast to Figure 10, where all index-records withhigher-timestamps are placed in higher-numbered partitions.Observe that the index-records in P (Figure 11) appear in theirprimary-order (on the search key), i.e. records with search-key1 precede those with 7. Observe also that the tombstone record with key 1 precedes the regular record as a result of the secondarysort-order since timestamp ( TX U ) > timestamp ( TX U ) . MV-PBT is version-aware and supports index-only visibility-check ,i.e. it returns a set of index records matching the search conditionand visible to the calling transaction. In doing so, MV-PBT avoidsthe expensive retrieval of base-table version-records to extracttheir version-information .The index-only visibility-check (Algorithm 3) is inherently sup-ported by the data structure. MV-PBT index records (Section5.1) contain version-information and define modifications andrecordIDs of tuple-versions. The respective index-record ordering
Transaction TX U0 : insert INSERT INTO r VALUES (7, 'V0');
Transaction TX U1 : non-key update UPDATE r SET z ='V1' WHERE a =7; Transaction TX U2 : index key update UPDATE r SET a=1 WHERE a =7; Transaction TX U3 : delete DELETE FROM r WHERE a =1; P P P t.v ) TX U0 P t.v ) TX U1 recID( t.v )P t.v ) TX U2 Regular Record Replacement RecordReplacement Record Anti-RecordTombstone Record (1) (2) Table
R a z t creation t invalidation version t.v u0 TX u1 version t.v u1 TX u2 version t.v u2 TX u3 version t.v tombstone TX u3 nullV0V1V1 P t.v ) TX U2 recID( t.v )1 recID( t.v ) TX U3 P Index-Key update P t.v )TX U1 recID( t.v )P t.v )TX U2 P t.v ) TX U3 Replacement Record Repalcement RecordAnti-Record Tombstone Record P P P t.v )TX U0 Regular Record P t.v )TX U2 recID( t.v ) Alternatively: TX U3 and TX U4 UPDATE/DELETE ... WHERE a=8;
Index-Key update
Figure 11: MV-PBT Index-Record Ordering. is essential to scans (Section 5.3), whereby records indicating theinvalidation of a tuple-version are guaranteed to be placed beforethe “validating”-records for a given transactional timestamp.Index records of any type, matching the search-conditionsare processed by the visibility check. They are invisible to atransaction, if:(a) the index record is flagged for garbage collection ;(b) the transaction timestamp of the index-record is greater than the timestamp of the calling transaction; orthe transaction corresponding to the index-record timestampis concurrent to the calling transaction;(c) visible record with anti-matter for the recordID (anti-matter,replacement- and tombstone-records) was already encoun-tered (in this case also checked for GC); or(d) the index record is either a tombstone record or an anti-record .An additional visibility-check by processing the version chain inbase table is not required. Skewed updates on tuples do not lowerthe performance of the index-only visibility check, due to wellperforming garbage collection and well-cached version-chainsin the main-memory partition P N . MV-PBT accumulate modifications to persistent partitions in thelatest partition P N , which is held in the MV-PBT partition buffer(Figure 9). All MV-PBT indices place their respective P N in theMV-PBT buffer which rises the question of the proper buffer man-agement strategy. Well-known replacement policies (like LRU orARC) are not suitable for managing the set of leaf nodes containedin the respective P N as well as different P N . The MV-PBT buffershould (a) only evict partitions as a whole instead of individualpages (like in LRU) to achieve sequential write patterns; and (b)give partitions of update intensive indices a fair chance to grow,and balance it across all indices. Remember that MV-PBT readoperations place persistent partition nodes in the main/sharedDB-Buffer. MV-PBT buffer-management strategy can be summa-rized as follows. Whenever the buffer-size threshold is reachedthe MV-PBT buffer manager selects the largest partition of allindices as a victim for eviction. Smaller, less update-intensive lgorithm 3 MV-PBT Index-Only Visibility-Check function VisibilityCheck( idx _ record ) input: idx _ record at current scan position output: BOOL _ V ISIBLE Let anti _ map ← Map of ( recID | TS ) ▷ anti-matter if IS_SET( idx _ record , FLAG _ GC ) then return I NV ISIBLE if not precedes( idx _ record . ts , CurrentTxId ) OR isConcurrent( idx _ record . ts , CurrentTxId ) then return I NV ISIBLE if ts anti ← anti _ map . get ( idx _ record . recID matter ) and precedes ( idx _ record . ts , ts anti ) then checkForGC( idx _ record ) return I NV ISIBLE if IS_SET( idx _ record , FLAG _ ANT I _ MATT ER ) then anti _ map . put ( idx _ record . recID anti , idx _ record . ts ) if IS_SET( idx _ record , FLAG _ MATT ER ) then return V ISIBLE return
I NV ISIBLE partitions are frequently evicted to avoid imbalanced number ofpartitions per MV-PBT and shrinking partition sizes.The eviction process (Algorithm 4) can be summarized as fol-lows . A new partition numbered P N + (initially P N + ) is createdfor ongoing modifications. The current victim partition P N be-comes immutable and is scanned, as following operations areperformed cooperatively and latch-free, piggybacking that scan .(1) Version-chains are built from the Scan-ResultSet record us-ing their timestamps and RecordIDs and creating a tempo-rary VID for each chain. While doing that obsolete index-records (parts of the version-chain) are detected and markedfor garbage collection.(2) Garbage Collection is performed on the marked records (nolonger needed/invisible records are removed) and the resultis written out to new leaf nodes.(3) During this process index-records and leaf nodes are trans-formed to an on-disk format, whereby prefix-truncation, com-pression and encoding as well as dense-packing (Section 5.7)are performed. Furthermore, the partition number of eachindex record is decremented from P N to P N − . Now P N − is a separate partition, which is yet unknown in the MV-PBT partition metadata. The process resembles a leaf-build inPostgreSQL: having full pages the intermediary index nodescan be built on top. Concurrent, lookups and scans are stillperformed on the old P N nodes.(4) In parallel, well-sized (prefix-) bloom filters are created (Sec-tion 5.7).(5) Dense-packing, compression and read-optimizations are per-formed to higher level intermediary nodes, resembling to abottom-up build. All nodes are sequentially written out .(6) Finally, P N − is added to the MV-PBT partition metadata. Theold P N leaf nodes, on which concurrent non-blocking readshad been executing, are detached from the MV-PBT and arefreed for reuse. Mixed workloads with high update-rates result in massive amountof tuple-versions, which need to be garbage collected once a long-running reading/analytical query completes [13]. Same is truefor the corresponding index-records. With high probability these
Algorithm 4
MV-PBT Partition Eviction function evict( | P N | ) Input: set of P N in MV-PBT buffer Let p evict ← SelectEvictionVictim( | P N | ) Add
Partition p evict + to B + -Tree PartitionsList SET( p evict , FLAG _ IMMUTABLE ) Let recordSet ← scanRecords ( p evict ) garbageCollectionP3( recordSet ) worker . loadAndFlush( p evict . pNo − , recordSet ) worker . createFilters( p evict , recordSet ) wait( ) Let p evict _ new ← decrementPartitionNumber( p evict ) detatchAndFree( p evict )records are located in the main-memory partition P N of a MV-PBT due to their temporal locality. Therefore, we implemented acooperative page-level garbage collection (GC) for P N . Phase (1):
The GC piggybacks regular index-scans to identifyindex-records of versions, that are not visible to any active trans-action ( cutoff-transaction ). As a page is already latched (shared),the following checks a performed on each record: (a) comparisonwith the lowest active transaction timestamp and if lower, markpredecessors as victim-versions for GC; (b) if higher, but a succes-sor exists, mark all predecessors as victims for GC. In both cases,a hasGarbage flag is set in the page header (no exclusive latchrequired). This step also piggybacks the in-memory structures ofthe scan and index-only visibility check algorithms. Records with anti-matter (anti-matter, replacement and tombstone records) re-quire special attention, as they are still required for invalidationof predecessors. Hence the anti-matter record with the highesttimestamp smaller than cutoff transaction timestamp must not begarbage collected. Index-record ordering (Section 5.3) supportsGC while scanning, since successors are mostly processed first.
Phase (2):
Update operations check the hasGarbage flag in pageheader. If set they first set the recordID of the oldest requiredrecord with anti-matter (anti-matter, replacement and tombstonerecords) to the recordID of the oldest victim-version of that chainon the page. Next, GC victims are removed on that page, the spaceis reclaimed and only then the update operation proceeds. Thisbehavior saves memory, speeds up scans and visibility checks aswell as reduces index maintenance operations (split).
Phase (3):
To handle version-chains spanning several pages, andfor final cleanup before partition eviction the whole partitionis scanned and the version chains (based on timestamps andrecords) are built in memory. This scan is also piggybacked forfilter creation and dense-packing (Section 5.7). Before switchingto sibling page, obsolete versions are removed after updatinginvalidation reference and in-memory version chain is updated.
Various optimizations can be performed, based on the fact thatonce written to storage MV-PBT partitions are immutable . Bloom Filters.
Each MV-PBT partition has a bloom filter ( BF )on the search key. Using bloom filters accelerates key lookups(point-queries) in a partition, by avoiding unnecessary scans.Whenever a key lookup is performed, a BF-query executed first,to verify whether the key does not exist in the partition. If it doesnot exist MV-PBT proceeds with the next partition. Alternatively,if the BF returns true (i.e. the key may exist), MV-PBT scans thewhole partition. (cid:1)(cid:2)(cid:3) (cid:3)(cid:4)(cid:2)(cid:5)(cid:0)(cid:6)(cid:5)(cid:3) (cid:3)(cid:4)(cid:7)(cid:8)(cid:7)(cid:2)(cid:0)(cid:2) (cid:3)(cid:4)(cid:9)(cid:6)(cid:0)(cid:3)(cid:5)(cid:0) (cid:3)(cid:4)(cid:6)(cid:9) (cid:3)(cid:3)(cid:4)(cid:6)(cid:3)(cid:4)(cid:2)(cid:3)(cid:4)(cid:0)(cid:3)(cid:4)(cid:7)(cid:3)(cid:4)(cid:10)(cid:3)(cid:4)(cid:9)(cid:3)(cid:4)(cid:1)(cid:3)(cid:4)(cid:8)(cid:3)(cid:4)(cid:5)(cid:3)(cid:10)(cid:3)(cid:3)(cid:6)(cid:11)(cid:3)(cid:3)(cid:3)(cid:6)(cid:11)(cid:10)(cid:3)(cid:3)(cid:2)(cid:11)(cid:3)(cid:3)(cid:3)(cid:2)(cid:11)(cid:10)(cid:3)(cid:3)(cid:0)(cid:11)(cid:3)(cid:3)(cid:3)(cid:0)(cid:11)(cid:10)(cid:3)(cid:3)(cid:7)(cid:11)(cid:3)(cid:3)(cid:3)(cid:7)(cid:11)(cid:10)(cid:3)(cid:3) (cid:12)(cid:13)(cid:14)(cid:15) (cid:12)(cid:13)(cid:16)(cid:15) (cid:17) (cid:18) (cid:19)(cid:20) (cid:21) (cid:19) (cid:22) (cid:17) (cid:23) (cid:24)(cid:20) (cid:21) (cid:25) (cid:26)(cid:27) (cid:18) (cid:25) (cid:28) (cid:23) (cid:20)(cid:29) (cid:17) (cid:27) (cid:25) (cid:30) (cid:12) (cid:13) (cid:14) (cid:15) (cid:31) (cid:17) (cid:18) (cid:19)(cid:20) (cid:21) (cid:19) (cid:22) (cid:17) (cid:23) (cid:24)(cid:20) (cid:21) (cid:25) (cid:26)(cid:27) (cid:18) (cid:25) (cid:28) (cid:23) (cid:20)(cid:29) (cid:17) (cid:27) (cid:25) (cid:30) (cid:12) (cid:13) (cid:14) (cid:15) (cid:31) (cid:14)(cid:18)(cid:27)(cid:27)(cid:25)(cid:15) (cid:14)!"(cid:15) (cid:14) (a) Index Performance under Mixed Workloads (CH-Bnchmark) (cid:0) (cid:1) (cid:2) (cid:3) (cid:1) (cid:4) (cid:0)(cid:3) (cid:1) (cid:5) (cid:6)(cid:7) (cid:1) (cid:8)(cid:7) (cid:1) (cid:5) (cid:0) (cid:1) (cid:2) (cid:9) (cid:1) (cid:7) (cid:8) (cid:1) (cid:0)(cid:0) (cid:1) (cid:6) (cid:0) (cid:1) (cid:7) (cid:6) (cid:1) (cid:0) (cid:6) (cid:1) (cid:7) (cid:5)(cid:8)(cid:7)(cid:5)(cid:7)(cid:8)(cid:0)(cid:5)(cid:0)(cid:8)(cid:6)(cid:5)(cid:5)(cid:8)(cid:7)(cid:5)(cid:7)(cid:8)(cid:0)(cid:5)(cid:0)(cid:8)(cid:6)(cid:5)(cid:6)(cid:8) (cid:6)(cid:5) (cid:3)(cid:5) (cid:4)(cid:5) (cid:7)(cid:0)(cid:5) (cid:10) (cid:11) (cid:12) (cid:13) (cid:14) (cid:15)(cid:16) (cid:17) (cid:18) (cid:19)(cid:20) (cid:14) (cid:16) (cid:17) (cid:21)(cid:11)(cid:16)(cid:22) (cid:23) (cid:19) (cid:17) (cid:24) (cid:13)(cid:25) (cid:20)(cid:16)(cid:16)(cid:11)(cid:26) (cid:27) (cid:28) (cid:25) (cid:20)(cid:16) (cid:17) (cid:29) (cid:14) (cid:30) (cid:11) (cid:17)(cid:31) (cid:30) (cid:13) (cid:28)!(cid:11)(cid:11)"(cid:17)(cid:29)(cid:14)(cid:30)(cid:11)(cid:17)(cid:31)(cid:13) (b) Standard vs. Index-Only Visibility-Check for Different Chain Lengths time (in ms) L B A ( * ) (c) Sequential Write Pattern of Eviction of a Single MV-PBT Partition (cid:0)(cid:1)(cid:2)(cid:3)(cid:4) (cid:0)(cid:5)(cid:2)(cid:6)(cid:4) (cid:0)(cid:7)(cid:2)(cid:8)(cid:4) (cid:0)(cid:6)(cid:2)(cid:7)(cid:4)(cid:9)(cid:6)(cid:2)(cid:0)(cid:4) (cid:0)(cid:1)(cid:2)(cid:5)(cid:4) (cid:9)(cid:9)(cid:2)(cid:6)(cid:4) (cid:9)(cid:5)(cid:2)(cid:0)(cid:4)(cid:5)(cid:1)(cid:5)(cid:1)(cid:1)(cid:5)(cid:1)(cid:1)(cid:1) (cid:10)(cid:11)(cid:12)(cid:13)(cid:14) (cid:15)(cid:16)(cid:17)(cid:18)(cid:13) (cid:10)(cid:11)(cid:12)(cid:13)(cid:14) (cid:15)(cid:16)(cid:17)(cid:18)(cid:13) (cid:10)(cid:11)(cid:12)(cid:13)(cid:14) (cid:15)(cid:16)(cid:17)(cid:18)(cid:13) (cid:10)(cid:11)(cid:12)(cid:13)(cid:14) (cid:15)(cid:16)(cid:17)(cid:18)(cid:13)(cid:19)(cid:20)(cid:15)(cid:21)(cid:13)(cid:13)(cid:22)(cid:23)(cid:24)(cid:15) (cid:19)(cid:20)(cid:15)(cid:21)(cid:13)(cid:13)(cid:22)(cid:25)(cid:26) (cid:27)(cid:19)(cid:15)(cid:22)(cid:25)(cid:26) (cid:28)(cid:29)(cid:20)(cid:27)(cid:19)(cid:15)(cid:22)(cid:27)(cid:26) (cid:30) (cid:31) (cid:17)(cid:13) (cid:21) (cid:22) ! "(cid:22) (cid:19) (cid:18) ! (cid:22) & '(cid:18)(cid:18)' !(cid:11) % ( (cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:22)(cid:23)') (cid:23)(cid:13)(cid:16)*(cid:22)(cid:23)')(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:22)(cid:28)'%% (cid:23)(cid:13)(cid:16)*(cid:22)(cid:28)'%% (d) Requests / Cache Hit Rate Figure 12: Index Performance under Mixed Workloads (CH-Benchmark)
Our experimental evaluation (Figure 13) indicates that the av-erage BF size is small – in the order of few hundred KB. Thereforefrequently used filters are usually cached in the MV-PBT buffer.Furthermore, their precision is on average, thus false posi-tives and therefore superfluous scans are rare. BF is is computedefficiently on eviction, piggybacking existing maintenance scanand is persisted as part of the partition metadata.
Range Filters.
Partition bloom filters accelerate point lookups,but cannot handle range predicates. Currently, we employ prefixBloom Filters (pBF) , if appropriate, to speedup range scans.
Dense-packed, Read-Optimized immutable storage.
Sincea partition is immutable once persisted, various space and read-optimization techniques can be applied.
Dense-packing is usedto perform coalescing and free-space optimzation. When in-memory leaf nodes are on average 67% full to accumulate modifi-cations and avoid splitting, however when persisted the the spaceutilization can be maximized. MV-PBT performs dense-packing as part of the final garbage collection and space reclamation.Especially for non-unique indices
MV-PBT performs recon-celiation upon eviction to convert all regular records with thesame search key to a single regular record with a set of {recor-dID, timestamp}, instead of holding separate record for each keyinstance. The same is true for replacement records, where forthe same search key sets of { recordID
N EW , Timestamp
N EW , recordID OLD } are created. Last but not least, compression tech-niques such as prefix-truncation or delta-compression are per-formed on the search key. Along the same lines, various read andcache-aware optimizations can be performed. We present the analysis of Partitioned B-Trees (PBT) and Multi-Version Partitioned B-Trees (MV-PBT) together with traditionalB + -Trees in PostgreSQL 9.04, which serve as baseline . Standard,PostgreSQL uses an old-to-new version ordering, physically mate-rialized version storage and two-point invalidation . Index recordshave a physical reference to base tables – denoted as B-Tree(PG/HOT). PostgreSQL base table storage was also modified toSnapshot Isolation Append Storage (SIAS) [8, 10] with a ben-eficial append-only write pattern, one-point invalidation andnew-to-old version ordering. We implemented and evaluated B + -Trees and PBT with physical references and with logical tuplereferences on top of SIAS [8, 10]. Experimental Setup.
We deployed PostgreSQL 9.04 and Post-greSQL with SIAS [10] on an
Ubuntu 16.04.4 LTS server withan eight core
Intel(R) Xeon(R) E5-1620
CPU, 2GB RAM and an
Intel DC P3600 400GB SSD drive. We used the well-known DBT-2[1] TPC-C-like OLTP benchmark and mixed workload CH-Benchmark [6] in OLTP-Bench [2, 7] for experimental evaluation.The OS page cache was cleaned every second to ensure repeatableand reliable results (even though conservative).
Mixed Workloads: CH-Benchmark.
MV-PBT is designedfor large datasets and mixed workloads. We evaluate the through-put of B + -Trees, PBT and MV-PBT under the CH-Benchmark [6]in OLTP-Bench [2, 7]. MV-PBT doubles the analytical throughputcompared to B + -Trees (Figure 12a), improving it from 0.29 to 0.61ueries/transactions per minute. In the same time, MV-PBT yield higher transactional throughput than B + -Trees (Figure 12a).The performance improvements are mainly due to index-onlyvisibility-check and partition garbage collection . To illustrate thecombined effect we turn off both and repeat the experiment.Consider now the lower MV-PBT performance bars in Figure12a. Without partition garbage collection and index-only visibility-check the OLAP performance drops by 75% from 0.61 to 0.16queries per minute, whereas the OLTP throughput plummetsfrom 4232 from to 3093 tx/min. Mixed Workloads: Index-Only Visibility-Check and GarbageCollection.
In a further experiment we investigate MV-PBT GCand visibility-check in more detail varying the version-chainlength. We run the OLTP part of the CH-Benchmark and executea query on the same dataset (Figure 12b), however we pause it (us-ing pg_sleep ) for 30/60/90/120 seconds to simulate a long-runningquery and vary the amount of transient versions and the chainlength. Clearly, as the version-chain length increases, index-onlyvisibility-checks gain importance, because unnecessary read I/Oon base table can be reduced.We compare PBT and standard visibility-check in base table(VC) to MV-PBT and index-only visibility-check (idxVC) (Figure12b). As the query processing time and version-chain lengthincrease, index scans and VC of slow down PBT by an order ofmagnitude. Even if the version-chain length has no linear growth,pages in base table get evicted and need to be fetched morefrequently. MV-PBT performs idxVC however without garbagecollection (Figure 12b MV-PBT w/o GC ), every index record ofsuccessor tuple-versions has to be processed, likewise the scantime increases proportionally to the chain length. With garbagecollection (Figure 12b
MV-PBT w/ GC ), the number of scannedindex records and the scan time remain almost constant. However,GC requires additional processing and latches index nodes in P N . Reading transactions have to wait for latches and scan timeincreases - consider Figure 12b at 30 seconds sleep time. As moreindex record get garbage collected, GC improves the index scantime - compare MV-PBT with and without GC at 30 and 120second (Figure 12b). Sequential write-pattern/Append-based storage.
Basedon the tradeoffs derived in Section 4. MV-PBT needs to sup-port write sequentialization and append based storage. In thisexperiment we evaluate the write pattern of MV-PBT (Figure 12c).Using blktrace and blkparse we record an I/O trace during thepartition eviction from MV-PBT buffer. The X-axis represents theeviction time; the average write I/O time is about 1ms. The Y-axisrepresents the logical block addresses (LBA), i.e. the file systemaddresses where the blocks of the index file are written. Eachred cross indicates the write of a single index node. A horizontalline, therefore indicates a sequential write , i.e multiple blocksare written onto neighbouring addresses over time.
Hence thesequential write pattern of MV-PBT.
The horizontal lines in Figure12c represent database extents and result from the database spaceallocation strategy . Each evicted partition comprises leaf nodesallocated in new extents of the index file, allocated at (mostly) ad-jacent addresses by the file system. The overall sequential pattern confirms the sequential append behaviour of MV-PBT.
MV-PBT Buffer Efficiency.
Figure 12d shows the fetch re-quests on index nodes (blue) and base table nodes (red) for anOLTP benchmark. Furthermore, the cache hit-rate is depicted.Requests yielding a cache-hit are displayed brighter colour than fetches (cache-misses) from secondary storage. The scale of re-quests is logarithmic. The results are calculated for equal through-put over the test duration and all tables and indices.PBT and MV-PBT require more requests on index nodes dueto partitioning of index records and greater record sizes. Mostrequests are on buffered nodes, because many queries can beanswered in the main memory partition. Index records of newtuple-versions are common to be located there. MV-PBT reducesthe requests on base table by up to 40%, because the base table isnot required for visibility-check. The version chains are short forthis benchmark, for mixed workloads this effect is more weighty.This can be seen at the reduced cache hit rate on base table nodesin comparison to PBT. Most saved requests on base tables are onnew tuple-versions, which are located in main memory.
Partition Filters.
Partition-based indices like MV-PBT, PBTor LSM-Trees incur higher lookup and scan overhead than B-Trees, since matching records may exist in older partitions. Hence,the effort of lookups and especially of scans increases with num-ber of index-partitions, since every partition has to be traversedin the worst case. Point lookups can stop partition traversal afterfinding the first matching record, which is visible to a transaction,since older partitions are guaranteed to contain older records. E ff e c t i v ene ss o f F il t e r A v e r age S i z e i n M B Prefix Bloom Filter Bloom FilterPartition
Partiton Bloom FilterPrefix Bloom Filter
BloomFilter
Figure 13: Effectiveness and Size of Partition Filters
Using
Bloom filters (BF) (Section 5.7) point lookups can skippartitions and increase throughput up to 10% under TPC-C (Fig-ure 14c). Furthermore, prefix Bloom filters (pBF) may under certainconditions speedup scans by skipping partitions not matchingthe range predicate. pBF including a fixed set of scan attributes,increase the throughput by another 10% (Figure 14c). The preci-sion of both Bloom filters is relatively high (Figure 13): the falsepositives rate is 2% for BF and 10% for pBF , while the negatives(skipping) rate is approx. 82% for BF and 84.5% for pBF . The size BF and pBF is small relative to the partition size (Figure 14c): fora 24MB partition BF is 0.57MB, while pBF is 0.36MB.Since index operations only have a fair share of the wholedatabase operations under TPC-C (besides logging, CC and I/O)the above numbers yield moderate performance improvements. Comparison to LSM-Trees.
LSM-Trees [16] are used as workhorsestorage structure in many Key/Value stores for large datasets.Today’s highly-optimized multi-level LSM-Trees with levellingor tiering resemble MV-PBT as they exhibit an append-behaviourand employ buffered components. We implemented MV-PBT inWiredTiger [3], the high-performance KV-Store of MongoDB. Inthis experiment we compare MV-PBT to LSM-Tree in WiredTigerunder YCSB (Figure 15a). YCSB has been instrumented as follows:a dataset of 100 million keys (approx. 100GB); workloads A (30mil. requests), B and D(10 mil. req.) and E (2 mil. req).Workload A comprises 50% read and 50% update requests,which require fast lookups and updates. MV-PBT is approx. 42%faster than LSM-Trees. Each LSM level comprises multiple com-ponents which themselves are small read-optimzed BTrees. Asearch needs to process separate LSM components even thoughsome can be skipped (bloom filters). MV-PBT partition search isfaster than LSM component search since the leaf nodes in each (cid:1)(cid:2)(cid:2)(cid:2) (cid:20) (cid:21) (cid:22)(cid:13)(cid:13)(cid:13)(cid:13) (cid:0)(cid:1)(cid:2)(cid:2)(cid:2) (cid:3) (cid:20) (cid:21) (cid:22)(cid:13)(cid:13)(cid:13)(cid:13) (cid:24)(cid:25)(cid:3)(cid:4)(cid:26)(cid:26)(cid:13)(cid:27)(cid:20)(cid:28)(cid:29)(cid:30)(cid:31)(cid:3) (cid:0)(cid:23)(cid:2)(cid:2)(cid:2)(cid:0)(cid:1)(cid:2)(cid:2)(cid:2) (cid:11) (cid:3) (cid:20) (cid:21) (cid:22)(cid:13)(cid:13) (cid:24)(cid:25)(cid:3)(cid:4)(cid:26)(cid:26)(cid:13)(cid:27)(cid:20)(cid:28)(cid:29)(cid:30)(cid:31)(cid:3) (cid:0)(cid:23)(cid:2)(cid:2)(cid:2) (cid:18) (cid:19) (cid:11) (cid:3) (cid:20) (cid:21) (cid:22) (cid:24)(cid:25)(cid:3)(cid:4)(cid:26)(cid:26)(cid:13)(cid:27)(cid:20)(cid:28)(cid:29)(cid:30)(cid:31)(cid:3) (cid:0)(cid:23)(cid:2)(cid:2)(cid:2) (cid:9)(cid:13)(cid:18) (cid:19) (cid:11) (cid:3) (cid:20) (cid:21) (cid:24)(cid:25)(cid:3)(cid:4)(cid:26)(cid:26)(cid:13)(cid:27)(cid:20)(cid:28)(cid:29)(cid:30)(cid:31)(cid:3) (cid:24)(cid:3)(cid:4)(cid:26)(cid:26)(cid:13)(cid:20)(cid:14)!(cid:7)(cid:10)(cid:8)(cid:5)(cid:12)(cid:13)"(cid:26) (cid:15) (cid:9)(cid:13)(cid:18) (cid:19) (cid:11) (cid:3) (cid:20) (cid:24)(cid:3)(cid:4)(cid:26)(cid:26)(cid:13)(cid:20)(cid:14)!(cid:7)(cid:10)(cid:8)(cid:5)(cid:12)(cid:13)"(cid:26) (cid:17)(cid:15) (cid:9)(cid:13)(cid:18) (cid:19) (cid:11) (cid:3) (cid:24)(cid:3)(cid:4)(cid:26)(cid:26)(cid:13)(cid:20)(cid:14)!(cid:7)(cid:10)(cid:8)(cid:5)(cid:12)(cid:13)"(cid:26) (cid:16)(cid:14)(cid:17)(cid:15) (cid:9)(cid:13)(cid:18) (cid:19) (cid:11) (cid:24)(cid:3)(cid:4)(cid:26)(cid:26)(cid:13)%(cid:6)&(cid:10)(cid:4)(cid:26)(cid:8)(cid:9)(cid:10)(cid:11)(cid:6)(cid:13)'(cid:5)!(cid:26)(cid:4)(cid:0)(cid:0)(cid:2)(cid:2)(cid:2) (cid:15)(cid:16)(cid:14)(cid:17)(cid:15) (cid:9)(cid:13)(cid:18) (cid:19) (cid:24)(cid:3)(cid:4)(cid:26)(cid:26)(cid:13)%(cid:6)&(cid:10)(cid:4)(cid:26)(cid:8)(cid:9)(cid:10)(cid:11)(cid:6)(cid:13)'(cid:5)!(cid:26)(cid:4)(cid:0)(cid:0)(cid:2)(cid:2)(cid:2) (cid:11)(cid:15)(cid:16)(cid:14)(cid:17)(cid:15) (cid:9)(cid:13) (cid:0)(cid:2)(cid:2)(cid:2)(cid:2) (cid:4) (cid:11)(cid:15)(cid:16)(cid:14)(cid:17)(cid:15) (cid:0)(cid:2)(cid:2)(cid:2)(cid:2) (cid:14) (cid:4) (cid:11)(cid:15)(cid:16)(cid:14)(cid:17) ((cid:1)(cid:2)(cid:2)(cid:2)(cid:0)(cid:2)(cid:2)(cid:2)(cid:2) (cid:12) (cid:13) (cid:3) (cid:14) (cid:4) (cid:11)(cid:15)(cid:16)(cid:14) ((cid:1)(cid:2)(cid:2)(cid:2) (cid:5) (cid:12) (cid:13) (cid:3) (cid:14) (cid:4) (cid:11)(cid:15)(cid:16) ((cid:1)(cid:2)(cid:2)(cid:2) (cid:6)(cid:5) (cid:12) (cid:13) (cid:3) (cid:14) (cid:4) (cid:11) ((cid:23)(cid:2)(cid:2)(cid:2) (cid:10) (cid:11)(cid:6)(cid:5) (cid:12) (cid:13) (cid:3) (cid:14) (cid:4) ((cid:23)(cid:2)(cid:2)(cid:2) (cid:8) (cid:9) (cid:10) (cid:11)(cid:6)(cid:5) (cid:12) (cid:13) (cid:3) (cid:14) ($(cid:2)(cid:2)(cid:2)((cid:23)(cid:2)(cid:2)(cid:2) (cid:5) (cid:8) (cid:9) (cid:10) (cid:11)(cid:6)(cid:5) (cid:12) (cid:13) (cid:3) ($(cid:2)(cid:2)(cid:2) (cid:7) (cid:5) (cid:8) (cid:9) (cid:10) (cid:11)(cid:6)(cid:5) ($(cid:2)(cid:2)(cid:2) (cid:6) (cid:7) (cid:5) (cid:8) (cid:9) (cid:10) (cid:11)(cid:6) ((cid:0)(cid:2)(cid:2)(cid:2) (cid:4) (cid:5)(cid:6) (cid:7) (cid:5) (cid:8) (cid:9) (cid:10) (cid:11) ((cid:0)(cid:2)(cid:2)(cid:2) (cid:3) (cid:4) (cid:5)(cid:6) (cid:7) (cid:5) (cid:8) (cid:9) ((cid:2)(cid:2)(cid:2)(cid:2)((cid:0)(cid:2)(cid:2)(cid:2) (cid:3) (cid:4) (cid:5)(cid:6) (cid:7) (cid:5) ((cid:2)(cid:2)(cid:2)(cid:2) (cid:3) (cid:4) (cid:5)(cid:6) (cid:7) ((cid:2)(cid:2)(cid:2)(cid:2) (cid:3) (cid:4) (cid:5)(cid:6) ((cid:2)(cid:2)(cid:2)(cid:2) (cid:3) (cid:4) (cid:5) (cid:3) )(cid:5)(cid:4)(cid:26)(cid:14)(cid:11)(cid:15)(cid:7)(cid:26)(cid:7))(cid:5)(cid:4)(cid:26)(cid:14)(cid:11)(cid:15)(cid:7)(cid:26)(cid:7))(cid:5)(cid:4)(cid:26)(cid:14)(cid:11)(cid:15)(cid:7)(cid:26)(cid:7) (a) Indirection Layer vs. Physical Version-Record Reference (cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:2)(cid:1)(cid:1)(cid:1)(cid:0)(cid:3)(cid:1)(cid:1)(cid:1)(cid:0)(cid:4)(cid:1)(cid:1)(cid:1)(cid:0)(cid:5)(cid:1)(cid:1)(cid:1)(cid:2)(cid:1)(cid:1)(cid:1)(cid:1)(cid:2)(cid:2)(cid:1)(cid:1)(cid:1) (cid:6) (cid:7) (cid:8)(cid:9) (cid:10) (cid:8) (cid:11) (cid:12) (cid:13) (cid:14)(cid:9)(cid:8) (cid:15) (cid:16) (cid:6) (cid:17) (cid:7) (cid:14)(cid:18)(cid:19)(cid:17)(cid:20)(cid:18) (cid:12)(cid:16)(cid:21) (cid:22) (cid:14) (cid:6) (cid:23) (cid:24) (cid:25) (cid:26)(cid:8)(cid:7)(cid:27)(cid:17)(cid:14)(cid:18)(cid:10)(cid:27)(cid:10)(cid:24)(cid:28)(cid:29)(cid:23)(cid:30)(cid:6)(cid:16)(cid:23)(cid:17)(cid:31)(cid:10)(cid:13)(cid:11)(cid:8)(cid:15)(cid:16) (cid:27)!(cid:27)(cid:7)(cid:27)(cid:9)(cid:11)(cid:27)(cid:23)(cid:30)(cid:6)(cid:16)(cid:23)(cid:17)(cid:31)(cid:10)(cid:13)(cid:11)(cid:8)(cid:15)(cid:16) (cid:27)!(cid:27)(cid:7)(cid:27)(cid:9)(cid:11)(cid:27)(cid:23)(cid:30)(cid:6)(cid:16)"(cid:9) (b) Performance of Indexing Approaches under TPC-C (cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:2)(cid:1)(cid:1)(cid:1)(cid:0)(cid:3)(cid:1)(cid:1)(cid:1)(cid:0)(cid:4)(cid:1)(cid:1)(cid:1)(cid:0)(cid:5)(cid:1)(cid:1)(cid:1)(cid:2)(cid:1)(cid:1)(cid:1)(cid:1)(cid:2)(cid:2)(cid:1)(cid:1)(cid:1) (cid:6) (cid:7) (cid:8)(cid:9) (cid:10) (cid:8) (cid:11) (cid:12) (cid:13) (cid:14)(cid:9)(cid:8) (cid:15) (cid:16) (cid:6) (cid:17) (cid:7) (cid:14)(cid:18)(cid:19)(cid:17)(cid:20)(cid:18) (cid:12)(cid:16)(cid:21) (cid:22) (cid:14) (cid:6) (cid:23) (cid:24) (cid:25) (cid:26)(cid:8)(cid:7)(cid:27)(cid:17)(cid:14)(cid:18)(cid:10)(cid:27)(cid:10)(cid:23)(cid:28)(cid:6)(cid:23)(cid:28)(cid:6)(cid:16)(cid:29)(cid:28)(cid:15)(cid:14)(cid:14)(cid:30)(cid:31)(cid:13)(cid:15)(cid:12)(cid:27)(cid:7) (cid:23)(cid:28)(cid:6)(cid:16)(cid:29)(cid:23)(cid:7)(cid:27)!(cid:13)"(cid:16)(cid:28)(cid:15)(cid:14)(cid:14)(cid:30)(cid:31)(cid:13)(cid:15)(cid:12)(cid:27)(cid:7) (c) Influence of filter techniques on Throughput under TPC-C (cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:2)(cid:1)(cid:1)(cid:1)(cid:0)(cid:3)(cid:1)(cid:1)(cid:1)(cid:0)(cid:4)(cid:1)(cid:1)(cid:1)(cid:0)(cid:5)(cid:1)(cid:1)(cid:1)(cid:2)(cid:1)(cid:1)(cid:1)(cid:1)(cid:2)(cid:2)(cid:1)(cid:1)(cid:1) (cid:6) (cid:7) (cid:8)(cid:9) (cid:10) (cid:8) (cid:11) (cid:12) (cid:13) (cid:14)(cid:9)(cid:8) (cid:15) (cid:16) (cid:6) (cid:17) (cid:7) (cid:14)(cid:18)(cid:19)(cid:17)(cid:20)(cid:18) (cid:12)(cid:16)(cid:21) (cid:22) (cid:14) (cid:6) (cid:23) (cid:24) (cid:25) (cid:26)(cid:8)(cid:7)(cid:27)(cid:17)(cid:14)(cid:18)(cid:10)(cid:27)(cid:10)(cid:24)(cid:28)(cid:29)(cid:23)(cid:30)(cid:6)(cid:16)(cid:31) !(cid:16)" (d) MV-PBT Garbage Collection under TPC-C Figure 14: OLTP Performance Evaluation under TPC-C , , , , , , , , , , , , T h r oughpu t [t hou s and T x / s ] BTreeLSMMV-PBT (a) MV-PBT, BTree, LSM-Tree under YCSB. N u m be r o f M V - PB T P a r t i on s T h r oughpu t [ T x / m i n ] Execution TimeThroughput MV-PBT Partitions (b) YCSB Throughput (workload A) vs. Number of MV-PBT Partitions
Figure 15: Performance Evaluation under YCSB partition are under the same common index. Updates in MV-PBT, hit P N , which accommodates more kv-pairs than the mainmemory L in LSM-Trees. Workload B comprises 95% read and5% update requests, with zipfian distribution. BTree performsrandom reads, the LSM Tree caches the updates but has an equal amount of random reads spread over more componets. MV-PBThave much lower index maintenance compared to BTrees andplace the updates in P N . The reads are performed with maximumI/O parallelism. Workload D comprising 95% read and 5% updaterequests, which given the latest distribution stress the memorycomponents and BTrees performs most of the operations in mem-ory. MV-PBT is marginally better than LSM-Tree. Last but notleast, we run workload E comprising 95% scans and 5% insertrequests. Even though the scans are slow under MV-PBT, theyoutperform LSM-Trees due to the faster search and updates.Consider, Figure 15b depicting the YCSB throughput (work-load A) and the number of MV-PBT partitions over time. Thethroughput remains stable as the number of partitions increases. OLTP: comparison of B-Tree alternatives.
To establish thebaseline we first compare standard PostgreSQL B-Trees (PG/HOT)to B + -Trees with physical reference and indirection layer on topof append-only storage (SIAS [8, 10]) under TPC-C. In Figure 14a,we show the throughput for different dataset sizes. The buffercache of the DBMS is fixed to 600MB. B-Tree(PG/HOT) performswell (Figure 14a) as long as the database buffer can accommodatemost modifications. Under standard Postgres updates are per-formed in base tables by Heap-Only Tuples, i.e. the predecessorversion is cached on the same page where its successor is located.Therefore the index maintenance effort is low. With growingdata sizes (and therefore more modifications), the throughputfalls rapidly. Append-based storage and one-point invalidation(SIAS [8, 10]) exhibit a robust throughput: (a) physical references (Section 3.5) yield lower performance, due to the higher indexmanagement overhead; (b) an indirection layer reduces indexmaintenance for insertions and index-key updates, yielding up to0% better throughput. With larger datasets ( ≥ Indexing Approaches under OLTP.
In a follow-up experi-ment, we compare B-Tree with indirection layer (Section 3.5), to
PBT and
MV-PBT under TPC-C (Figure 14b). PBT and MV-PBTexhibit robust performance, which improves with larger datasetscompared to B-Tree. PBT with indirection layer exhibits highand robust performance (Figure 14b). PBT with physical reference to close the performance gap for larger datasets as the updatedensity decreases decreases with larger datasets.
MV-PBT areslower than PBT under OLTP workloads for several reasons . First,less MV-PBT index records fit on the same sized P N , since theirsizes are larger because of the version-information (transactiontimestamps). Consequently, the number of partitions increases,yielding more I/O. Second, the average version-chain length un-der TPC-C is short: 1.15/2.18 versions for customer / stock respec-tively [8]. Therefore, index-only visibility-checks cannot improveperformance significantly. Thus, MV-PBT exhibit 6% lower per-formance than PBT under TPC-C (Figure 14b). We implementedMV-PBT with an indirection layer as well as with physical refer-ences (Section 3.5). Figure 14b depicts on the performance with physical references for brevity, both curves are almost identical. Therefore, MV-PBT are general enough to be implemented matchingthe rest of the system design.
OLTP Garbage Collection.
In this experiment (Figure 14d)we quantify the performance effect of MV-PBT partition garbagecollection (Section 5.6). It improves performance between 5%and 17% since old invisible versions are purged and need not beprocessed by scans as well as space is reclaimed letting more indexrecords fit in P N . The opportunity of improvement under OLTP ishowever limited by the short average version-chain length: 1.15versions for customer and 2.18 versions for stock under TPC-C[8]. With HTAP workloads the amount of ’transient’ (short-livedversions visible only throughout the duration of an analyticalquery) versions increases rapidly as does the effect of garbagecollection. Garbage collecting larger amounts transient versionshas a major role on the performace improvment of MV-PBT overPBT and B-Tree under mixed workloads (Figure 12a). Most popular indexing approaches in database managementsystems (DBMS) are based on B + -Trees. Their alphanumericsorted structure can result in high write amplification for highupdate rates and visibility-checks require information, that isonly located at tuple-versions in base table. PostgreSQL usesHeap-Only Tuples (HOT) to reduce index management opera-tions. Index records reference items in base table, which point totuple-versions in the heap node. Corresponding tuple-versionsare held on the same node and can be located by processingthe version chain. If a tuple-version become garbage collected,the item is modified to reference the next version. This indirec-tion layer reduces index modifications, but cannot avoid writeamplification of index nodes and requires the base table forvisibility-checking. Furthermore the write amplification of basetable nodes is increased for large datasets. MV-IDX[9] maintainsa virtual identifier for each tuple and data nodes for each versionas an indirection layer. With Snapshot Isolation Append Storage(SIAS)[10] write amplification on base tables is reduced in com-parison to HOT, but index management operations can cause ahigh write amplification and base table nodes are still required for visibility-checking[20]. LSM-Trees[17] reduce write amplifica-tion due to collecting modifications in main memory components,but there is no concept for managing tuple-versions and performan index-only visibility-check[20]. Time-Split B-Trees [15] andMultiversion B-Trees [5] are able to separate index records of oldtuple-versions from current dataset and to perform an index-onlyvisibility-check, but maintenance operations are complex andcan cause a high write amplification of index nodes[20]. In the present paper we introduce MV-PBT as an approach tomulti-version indexing. A MV-PBT is an extension of a B-Tree,where an artificial leading column is prepended to the search keyof each index record and index records a placed in a buffered in-dex partition, which if full gets evicted and appended to persistentstorage. MV-PBT is version-aware, since index records containversion-information and allow for index-only visibility check.This is particularly beneficial for HTAP workloads since longchains of transient versions exist due to the mix of short-livedupdating transactions and long-running queries. Furthermore,MV-PBT exhibit a sequential write pattern due to the concept ofpartition, which leads to less write-amplification and better uti-lization of modern storage technologies. Under mixed workloads(CH-Benchmark) MV-PBT double the analytical throughput 2x,while improving the transactional throughput by 15%.
REFERENCES [1] 2019.
Database Test Suite . https://sourceforge.net/projects/osdldbt/files/dbt2/[2] 2019.
Oltpbench . https://github.com/oltpbenchmark/oltpbench/[3] 2019.
WiredTiger (MongoDB)
Proc. SIGFIDET (SIGMOD) 1970 . 107–141.[5] Bruno Becker, Stephan Gschwind, Thomas Ohler, Bernhard Seeger, and PeterWidmayer. 1996. An Asymptotically Optimal Multiversion B-tree.
The VLDBJournal
5, 4 (Dec. 1996), 264–275.[6] Richard Cole, Florian Funke, Alfons Kemper, and et al. 2011. The MixedWorkload CH-benCHmark. In
Proc. DBTest ’11 . Article 8, 6 pages.[7] Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. 2013. OLTP-Bench: An Extensible Testbed for Benchmarking Rela-tional Databases.
Proc. VLDB Endow.
7, 4 (Dec. 2013), 277–288.[8] Robert Gottstein. 2016.
Impact of new storage technologies on an OLTP DBMS,its architecture and algorithms.
Ph.D. Dissertation. TU, Darmstadt.[9] Robert Gottstein, Rohit Goyal, Sergej Hardock, Ilia Petrov, and AlejandroBuchmann. 2014. MV-IDX: Indexing in Multi-version Databases. In
Proc.IDEAS 2014 . 142–148.[10] Robert Gottstein, Ilia Petrov, Sergey Hardock, and Alejandro P. Buch-mann. 2017. SIAS-Chains: Snapshot Isolation Append Storage Chains. In
ADMS@VLDB .[11] Goetz Graefe. 2003. Partitioned B-trees - a user’s guide. In
Proc. BTW . 668–671.[12] Goetz Graefe. 2003. Sorting And Indexing With Partitioned B-Trees. In
CIDR .[13] Juchang Lee, Hyungyu Shin, Chang Gyoo Park, Seongyun Ko, and et al. 2016.Hybrid Garbage Collection for Multi-Version Concurrency Control in SAPHANA. In
Proc. SIGMOD 2016 . 1307–1318.[14] Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013. The Bw-Tree:A B-tree for New Hardware Platforms. In
Proc. ICDE 2013 . 302–313.[15] David Lomet and Betty Salzberg. 1990. The Performance of a MultiversionAccess Method. In
Proc. SIGMOD 1990 . 353–363.[16] Chen Luo and Michael J. Carey. 2019. LSM-based storage techniques: a survey.
The VLDB Journal (19 Jul 2019).[17] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. TheLog-structured Merge-tree (LSM-tree).
Acta Inf.
33, 4 (June 1996), 351–385.[18] Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. 2017. Hybrid Transac-tional/Analytical Processing: A Survey. In
Proc. SIGMOD 2017 . 1771–1775.[19] I. Petrov, R. Gottstein, and S. Hardock. 2015. DBMS on modern storagehardware. In
Proc. ICDE 2015 . 1545–1548.[20] Christian Riegger, Tobias Vincon, and Ilia Petrov. 2017. Multi-version Indexingand Modern Hardware Technologies A Survey of Present Indexing Approaches.In
Proc. iiWAS 2017 . 266–275.[21] Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, and et al. 2018. Buildinga Bw-Tree Takes More Than Just Buzz Words. In
Proc. SIGMOD 2018 . 473–488.[22] Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. 2017. AnEmpirical Evaluation of In-memory Multi-version Concurrency Control.