[PDF] Real-Time LSM-Trees for HTAP Workloads

Abstract

Real-time data analytics systems such as SAP HANA, MemSQL, and IBM Wildfire employ hybrid data layouts, in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high data rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle-aware storage engine due to its high write throughput and level-oriented structure, in which records propagate from one level to the next over time. To build a lifecycle-aware storage engine using an LSM-Tree, we make a crucial modification to allow different data layouts in different levels, ranging from purely row-oriented to purely column-oriented, leading to a Real-Time LSM-Tree. We give a cost model and an algorithm to design a Real-Time LSM-Tree that is suitable for a given workload, followed by an experimental evaluation of LASER - a prototype implementation of our idea built on top of the RocksDB key-value store. In our evaluation, LASER is almost 5x faster than Postgres (a pure row-store) and two orders of magnitude faster than MonetDB (a pure column-store) for real-time data analytics workloads.

Full PDF

RReal-Time LSM-Trees for HTAP Workloads

Hemant Saxena

University of [email protected]

Lukasz Golab

University of [email protected]

Stratos Idreos

Harvard [email protected]

Ihab F. Ilyas

University of [email protected]

ABSTRACT

Real-time data analytics systems such as SAP HANA, MemSQL,and IBM Wildfire employ hybrid data layouts, in which data arestored in different formats throughout their lifecycle. Recent dataare stored in a row-oriented format to serve OLTP workloads andsupport high data rates, while older data are transformed to acolumn-oriented format for OLAP access patterns. We observe thata Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle-aware storage engine due to its high write throughput and level-oriented structure, in which records propagate from one level tothe next over time. To build a lifecycle-aware storage engine usingan LSM-Tree, we make a crucial modification to allow differentdata layouts in different levels, ranging from purely row-orientedto purely column-oriented, leading to a Real-Time LSM-Tree. Wegive a cost model and an algorithm to design a Real-Time LSM-Treethat is suitable for a given workload, followed by an experimentalevaluation of LASER - a prototype implementation of our idea builton top of the RocksDB key-value store. In our evaluation, LASER isalmost 5x faster than Postgres (a pure row-store) and two orders ofmagnitude faster than MonetDB (a pure column-store) for real-timedata analytics workloads.

The need for real-time analytics or Hybrid Transactional-AnalyticalProcessing (HTAP) is ubiquitous in present-day applications such ascontent recommendation, real-time inventory/pricing, high-frequencytrading, blockchain data analytics, and IoT [28]. These applica-tions differ from traditional pure On-Line Transactional Processing(OLTP) and On-Line Analytical Processing (OLAP) applicationsin two aspects: 1) they have high data ingest rates [28]; 2) accesspatterns change over the lifecycle of the data [5, 15]. For exam-ple, recent data may be accessed via OLTP style operations (pointqueries and updates) as part of an alerting application, or to imputea missing attribute value based on the values of other attributes [28].Additionally, recent and historical data may be accessed via OLAPstyle processes, perhaps to generate hourly, weekly and monthlyreports, where all values of one column (or a few selected columns,depending on the time span of the report) are scanned [28].Recent systems, such as SAP HANA [20], MemSQL [3], and IBMWildfire [14], support real-time analytics with write-optimizedfeatures for high data ingest rates and read-optimized features forHTAP workloads. In these systems, recent data are stored in a row-oriented format to serve point queries (OLTP), and older data aretransformed to a column-oriented format suitable for OLAP. Suchsystems can be described as having a lifecycle-aware data layout. We observe that a Log-Structured Merge (LSM) Tree is a natu-ral fit for a lifecycle-aware storage engine. LSM-Trees are widelyused in key-value stores (e.g., Google’s BigTable and LevelDB, Cas-sandra, Facebook’s RocksDB), RDBMSs (e.g., Facebook’s MyRocks,SQLite4), blockchains (e.g., Hyperledger uses LevelDB), and datastream and time-series databases (e.g., InfluxDB). While Cassandraand RocksDB can simulate columnar storage via column families ,we are not aware of any lifecycle-aware LSM-Trees in which thestorage layout can change throughout the lifetime of the data. Wefill this gap in our work, by extending the capabilities of LSM-based systems to efficiently serve real-time analytics and HTAPworkloads.An LSM-Tree is a multi-level data structure with a main-memorybuffer and a number of levels of exponentially-increasing size(shown in Figure 1 and discussed in detail in Section 2). Periodically,or when full, the buffer is flushed to Level-0. When Level-0, whichstores multiple flushed buffers, is nearly full, its data are mergedinto the sorted runs residing in level one (via a compaction process),and so on. We observe that LSM-Trees provide a natural frameworkfor a lifecycle-aware storage engine for real-time analytics due tothe following reasons.1

LSM-Trees are write optimized:

All writes and data transfersbetween levels are batched, allowing high write throughput.2

LSM-Trees naturally propagate data through the levelsover time:

At any point in time, the buffer stores the most re-cent data that have not yet been flushed (perhaps data insertedwithin the last hour), Level-0 may contain data between one hourand 24 hours old, and levels one and beyond store even older data.3

Different levels can store data in different layouts:

Datamay be stored in row format in the buffer and in some of the levels,and in column format in other levels. This suggests a flexible andconfigurable storage engine that can be adapted to the workload.4

Compaction can be used to change data layout:

Transform-ing the data from a row to a column format can be done seamlesslyduring compaction, when a level is merged into the next level.We make the following contributions in this paper. • We propose the

Real-Time LSM-Tree , which extends the designof a traditional LSM-Tree with the ability to store data in a row-oriented or a column-oriented format in each level. • We characterize the design space of possible Real-Time LSM-Trees, where different designs are suitable for different workloads.To navigate this design space, we provide a cost model to selectgood designs for a given workload. • We develop and empirically evaluate LASER, a Lifecycle-AwareStorage Engine for Real-time analytics based on Real-Time LSM-Trees. We implement LASER using RocksDB, which is a popular a r X i v : . [ c s . D B ] J a n emant Saxena, Lukasz Golab, Stratos Idreos, and Ihab F. Ilyas open-source key-value store based on LSM-Trees. We show thatfor real-time data analytics workloads, LASER is almost 5x fasterthan Postgres and two orders of magnitude faster than MonetDB. Compared to traditional read-optimized data structures such as B-trees or B + -trees, LSM-Trees focus on high write throughput whileallowing indexed access to data [26]. LSM-Trees have two compo-nents: an in-memory piece that buffers inserts and a secondary-storage (SSD or disk) piece. The in-memory piece consists of treesor skiplists, whereas the disk piece consists of sorted runs.Figure 1 shows the architecture of an LSM-Tree, with the memorypiece at the top, followed by multiple levels of sorted runs on disk(four levels, numbered zero to three, are shown in the figure). Thememory piece contains two or more skiplists of user-configuredsize (two are shown in the figure). New records are inserted intothe most recent (mutable) skiplist and into a write-ahead-log fordurability. Once inserted, a record cannot be modified or deleteddirectly. Instead, a new version of it must be inserted and markedwith a tombstone flag in case of deletions.Once a skiplist is full, it becomes immutable and can be flushed to disk via a sequential write. Flushing is executed by a backgroundthread (or can be called explicitly) and does not block new datafrom being inserted into the mutable skiplist. During flushing, eachskiplist is sorted and serialized to a sorted run. Sorted runs are typi-cally range-partitioned into smaller chunks called Sorted SequenceTables (SSTs), which consist of fixed-size blocks. In Figure 1, weshow sorted runs being range-partitioned by key into multiple SSTs.For example, the sorted run in Level-1 has four SSTs; the first SSTcontains values for the keys in the range 0-20, the second in therange 21-50, and so on. Each SST contains a list of data blocks andan index block. A data block stores key-value pairs ordered by key,and an index block stores the key ranges of the data blocks.As sorted runs accumulate over time, query performance tendsto degrade since multiple sorted runs may be accessed to find arecord with a given key. To address this, sorted runs are graduallymerged by a background process called compaction . The mergingprocess organizes the disk piece into 𝐿 logical levels of increasingsizes with a size ratio of 𝑇 . For example, a size ratio of two meansthat every level is twice the size of the previous one. In Figure 1,we show four levels with increasing sizes. The parameters 𝐿 and 𝑇 are user-configurable in most LSM-Tree implementations, and theirvalue depends on the expected number of entries in the database.Two common merging strategies are leveling and tiering [18, 26].Their trade-offs are well understood: leveling has higher writeamplification but it is more read-optimized than tiering. Further-more, the “wacky continuum" [19] provides tunable read/writeperformance by adjusting the merging strategy and size ratios. OurReal-Time LSM-Tree is independent of the merging strategy, butwe will use the leveling strategy in LASER since this is also usedby RocksDB.In leveling, each level consists of one sorted run, so the run atlevel 𝑖 is 𝑇 times larger than the run at level 𝑖 −

1. As a result, therun at level 𝑖 will be merged up to 𝑇 times with runs from level 𝑖 − 𝑖 − 𝑖 .This divides the merging process into smaller tasks, bounding theprocessing time and allowing parallelism. Sorted runs in Level-0are not partitioned into SSTs (or have exactly one SST) because theyare directly flushed from memory. Some implementations, such asRocksDB, make an exception for Level-0 and allow multiple sortedruns to absorb write bursts.The merging process moves data from one level to the next overtime. This puts recent data in the upper levels and older data in thelower levels, providing a natural framework for a lifecycle-awarestorage engine proposed in this paper. In Figure 2, we present theresults of an experiment using RocksDB with an LSM-Tree havingfive levels (zero through four), with Level-0 starting at 64MB and 𝑇 =

2. We inserted data at a steady rate until all the levels were full,with background compaction enabled. We show the distributionof keys in terms of their time-since-insertion for two compactionpolicies commonly used in RocksDB: kByCompensatedSize (Figure2(a)) prioritizes the largest SST, and kOldestSmallestSeqFirst (Figure2(b)) prioritizes SSTs whose key range has not been compactedfor the longest time. For both compaction priorities, each level hasa high density of keys within a certain time range. We will usetime-based compaction priority because it is better at distributingkeys based on time since insertion.A point query starts from the most recent data and stops as soonas the search key is found (there may be older versions of thiskey deeper in the LSM-Tree, but the query only returns the latestversion). First, the in-memory skiplists are probed. If the searchkey has not been found, then the sorted runs on disk are searchedstarting from Level-0. Within a sorted run, binary search is usedto find the SST whose key range includes the key requested bythe query. Then, the index block of this SST is binary-searched toidentify the data block that may contain the key. Many LSM-Treeimplementations include a bloom filter with each SST, and an SST issearched only if the bloom filter reports that the key may exist. Weassume that the ranges of SSTs, the index blocks of SSTs, and bloomfilters fit in main memory and are cached, as illustrated in Figure 1.For range queries , all the skiplists and the sorted runs are scannedto find keys within the desired range. In many implementations(including RocksDB), range queries are implemented using multipleiterators, which are opened in parallel over each sorted run and theskiplists. Then, similar to a k-way merge , keys are emitted in sortedorder while discarding old versions.

We now summarize the cost of LSM-Trees in terms of writes, pointqueries, range queries, and space amplification [17, 18, 26]. Weassume that leveling is used for compaction, that sorted runs arenot partitioned into SSTs, and that the LSM-Tree is in a steady state,with all levels full and the volume of inserts equal to the volume ofdeletes.Table 1 summarizes the symbols used in the analysis. Let 𝑁 bethe total number of records, 𝑇 be the size ratio between consecutivelevels, and 𝐿 be the number of levels. Let 𝐵 denote the numberof records in each data page, and let 𝑝𝑔 denote the number of eal-Time LSM-Trees for HTAP Workloads Figure 1: LSM-Tree with leveling merge strategy (a) Compaction prioritized bysize ( kByCompensatedSize ) (b) Compaction prioritized bytime ( kOldestSmallestSeqFirst ) Figure 2: Distribution of keys across levels based on time pages in Level-0. For example, with a 4kB page and each recordof size 100 bytes, 𝐵 =

40; with Level-0 of size 64MB, 𝑝𝑔 = , 𝐵.𝑝𝑔 entries, and level 𝑖 ( 𝑖 ≥

0) containsat most 𝑇 𝑖 .𝐵.𝑝𝑔 entries. Furthermore, the largest level containsapproximately 𝑁 . 𝑇 − 𝑇 ( ≈ 𝑇 𝐿 .𝐵.𝑝𝑔 ) entries. The total number oflevels is given by Equation 1. 𝐿 = (cid:24) log 𝑇 (cid:18) 𝑁𝐵.𝑝𝑔 .𝑇 − 𝑇 (cid:19)(cid:25) (1) Write amplification:

Inserted or updated keys are merged mul-tiple times across different levels over time, therefore the insert orupdate I/O cost is measured in terms of write amplification. Theworst-case write amplification corresponds to the I/O required tomerge an entry all the way to the last level. An entry in level 𝑖 iscopied and merged every time level 𝑖 − 𝑖 . This can happen up to 𝑇 times. Adding this up over 𝐿 levels,each entry is merged 𝐿.𝑇 times. Since each disk page contains 𝐵 entries, the write cost for each entry across all the levels is 𝑂 ( 𝑇 .𝐿𝐵 ) . Point queries:

The worst-case lookup cost for an existing keyis 𝑂 ( 𝐿 ) without bloom filters because the entry may exist in thelast level, requiring access to one block (whose range overlaps withthe search key) in each level along the way. With bloom filters,the average cost of fetching a block from the first 𝐿 − ( 𝐿 − ) .𝑓 𝑝𝑟 , plus one I/O to fetch the entry from last level, where 𝑓 𝑝𝑟 is the false positive rate of the bloom filter. In practice, 𝑓 𝑝𝑟 isroughly 1%, giving an I/O cost of 𝑂 ( ) . Range queries:

Let 𝑠 be the selectivity, which is the number ofunique entries across all the sorted runs that fall within the target 𝑁 number of entries 𝐿 total number of levels 𝑇 size ratio between adjacent levels 𝐵 𝐵 𝑗𝑖 𝑗 at level 𝑖𝑝𝑔 number of blocks in Level-0 𝑐 number of columns 𝑠 range query selectivity (i.e., Π set of projected columns 𝑔 𝑖 𝑖 , ≤ 𝑔 𝑖 ≤ 𝑐𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 size of 𝑗 𝑡ℎ CG at level 𝑖 CG i CGs at level 𝑖𝐸 𝑔𝑖 estimated number of CGs required by a projection 𝐸 𝐺𝑖 estimated sum of sizes of CGs required by a projection Table 1: Summary of terms used in this paper key range. If keys are uniformly spread across the levels, then ineach level 𝑖 , 𝑠 / 𝑇 𝐿 − 𝑖 entries will be scanned. With 𝐵 entries perblock, the total number of I/Os is 𝑂 ( 𝑠𝐵 𝐿 (cid:205) 𝑖 = 𝑇 𝐿 − 𝑖 ) . Since the largestlevel contributes most of the I/O, the cost simplifies to 𝑂 ( 𝑠𝐵 ) . Space amplification:

This is defined as 𝑎𝑚𝑝 = 𝑁𝑢𝑛𝑞 −

1, where 𝑢𝑛𝑞 is the number of unique entries (keys). The worst-case spaceamplification occurs when all the entries in the first 𝐿 − 𝐿 − 𝑇 of the data. Therefore, 𝑇 of the data in thelast level are obsolete, giving a space amplification of 𝑂 ( 𝑇 ) . Lifecycle-driven hybrid workloads:

We target real-time analyt-ics workloads with high data ingest rates and access patterns thatchange with the lifecycle of the data. These workloads include amix of writes and reads, with recent data accessed by OLTP-stylequeries (point queries, inserts, updates), and older data by OLAP-style queries (range queries) [5]. From a storage engine’s viewpoint,we represent these workloads as combinations of inserts, updates,deletes, point reads, and scans. With 𝑘𝑒𝑦 as the row identifier, 𝑟𝑜𝑤 as the tuple with all the column values, and Π as the set of projectedcolumns (e.g., Π = { 𝐴, 𝐶 } means that the query requires values forcolumns A and C only), we consider the following operations: • insert( 𝑘𝑒𝑦 , 𝑟𝑜𝑤 ) : inserts a new entry. • read( 𝑘𝑒𝑦 , Π ) : for the given 𝑘𝑒𝑦 , reads the values of columns in Π . • scan( 𝑘𝑒𝑦 𝑙𝑜𝑤 , 𝑘𝑒𝑦 ℎ𝑖𝑔ℎ , Π ) : reads the values of the columns in Π where the key is in the range 𝑘𝑒𝑦 𝑙𝑜𝑤 , and 𝑘𝑒𝑦 ℎ𝑖𝑔ℎ . Range queriesbased on non-key column values also use this operator by simplyscanning all the entries and filtering out the entries that are notwithin the range. • update( 𝑘𝑒𝑦 , 𝑣𝑎𝑙𝑢𝑒 Π ) : updates the values of the columns in Π forthe given 𝑘𝑒𝑦 . 𝑣𝑎𝑙𝑢𝑒 Π contains the column identifiers and theirnew values. For example, 𝑣𝑎𝑙𝑢𝑒 Π = {( 𝐴, 𝑛𝑣 𝑎 ) , ( 𝐵, 𝑛𝑣 𝑏 )} indicatesnew values for columns A and B for the given key. • delete( 𝑘𝑒𝑦 ) : deletes the entry identified by 𝑘𝑒𝑦 .We assume that read and update access recently inserted keyswith a wide Π (almost all the columns), while scan accesses a rangeof keys spanning historical and recent data with a narrow Π (onecolumn or a few columns depending on the age of the data). emant Saxena, Lukasz Golab, Stratos Idreos, and Ihab F. Ilyas Figure 3: Design space of Real-Time LSM-Trees, with example column group (CG) configurations.Column groups (CGs):

Hybrid storage layouts support HTAPworkloads defined above [28]. A hybrid storage layout is defined bycolumn groups (CGs) that are stored together as rows [13]. Supposewe have a table with four columns: 𝐴 , 𝐵 , 𝐶 , and 𝐷 . In a row-orientedlayout, there is a single CG corresponding to all the columns. Ina column-oriented layout, each column corresponds to a separateCG. Other hybrid layouts are possible, e.g., two CGs of < 𝐴, 𝐵, 𝐶 > and < 𝐷 > , where the projection over columns 𝐴 , 𝐵 , and 𝐶 isstored in row format, and the projection over 𝐷 is stored separately.Column groups are advantageous when certain column values areco-accessed often in the workload. The key insight that makes the Real-Time LSM-Tree a natural fitfor a lifecycle-aware storage engine is that different levels may storedata in different layouts . This creates a design space.

Design space:

The design space for Real-Time LSM-Trees canbe characterized by the column groups used in each level. In Figure3, we show three examples. On the left, we show an extreme designpoint corresponding to a row-oriented format, which is used byexisting LSM-Tree storage engines, and is suitable for OLTP. Onthe right, we show the other extreme, corresponding to a purecolumnar layout, which is suitable for OLAP. In the middle, weshow a hybrid design, in which Level-0 is row-oriented, levels 1and 2 use different combinations of CGs, and Level-3 switches toa pure columnar layout. This design may be suitable for mixed orHTAP workloads, with a column group configuration dependingon the access patterns during the data lifecycle.In the Real-Time LSM-Tree design, we keep the in-memory com-ponent and Level-0 the same as in the original LSM-Tree, as de-scribed in Section 2, to maintain high write throughput. However,the on-disk levels beyond Level-0 are split into CGs, where eachCG stores its own sorted runs. As we will see in Section 4, eachsuch sorted run is associated with tail indices and bloom filters toanswer queries that access columns within the CG.Since different levels may have different CG configurations, aReal-Time LSM-Tree must be able to change the data layout as datamove from one level to another. As we will explain in Section 4.4,this can naturally be done during the compaction process.

CG containment assumption:

In principle, the space of Real-Time LSM-Trees consists of all possible combinations of CGs in eachlevel. However, we make a simplifying assumption since accesspatterns throughout the data lifecycle tend to change from row-friendly OLTP to column-friendly OLAP. In particular, we assumethat any CG in level 𝑖 must be a subset of (i.e., contained in) a single CG in level 𝑖 −

1, for 𝑖 ≥

1. Returning to Figure 3, the designin the middle has two column groups in Level-1: < 𝐴, 𝐵 > and < 𝐶, 𝐷 > . This means that, for example, a CG of < 𝐴, 𝐵, 𝐶 > , or aCG of < 𝐵, 𝐶 > is not a valid choice in Level-2. This assumptionalso simplifies layout changes during compaction, as we will see inSection 4.4. No replication assumption : For some workloads, data in agiven level may be touched by both OLTP and OLAP style queries,meaning that no single CG layout is suitable for that level. This maybe true especially in the last level, which stores the oldest and themajority of the data. For these workloads, a level can be replicatedto maintain two layouts, similar to the “fractured mirrors” approach[30], at the cost of storage and write amplification. However, weexpect such situations to be rare in practice because OLTP patternstend to be limited to recent data in real-time analytics workloads[5, 28], which are expected to fit in the first few levels.

We now describe the design of

LASER – our HTAP storage enginebased on Real-Time LSM-Trees. LASER borrows several conceptsfrom column-store systems [6]: a data model for storing columngroups (Section 4.1), column updates (Section 4.2), and “stitching”individual column values to reconstruct tuples (Section 4.3). LASERalso requires a mechanism to change the data layout from one levelto the next in the Real-Time LSM-Tree (Section 4.4).

Since entries in an LSM-Tree are stored across multiple sorted runsand levels, column scans do not access the data contiguously. Tofetch data in sorted order from different levels, we need to locateentries by their keys. Therefore, we store the keys along with thecolumn group values, as shown in Figure 4. This is known as simu-lated columnar storage [6], and incurs read and storage overheadcompared to storing only the column values in a contiguous datablock. However, in LSM-Trees, this overhead is reduced due to theleveling merge strategy, and can be further reduced by compressingthe data blocks and delta-encoding the keys within each data block.For example, in our evaluation, we observed that naïvely storingkeys along with column group values took 86GB of disk space,using Snappy compression took 51GB, and delta-encoding the keysfurther reduced the space usage to 48GB. Storing the same amountof data in a pure column-store (MonetDB [22]), which stores onlythe column values, requires 43GB. eal-Time LSM-Trees for HTAP Workloads

Figure 4: Simulated column-group representationFigure 5:

ColumnMergingIterators and

LevelMergingIterators

Inserts are performed in the same way as in original LSM-Trees ,where an entry is inserted in the in-memory skiplist, and is eventu-ally moved to lower levels via flush and compaction jobs. Insertionof an existing key (and a corresponding value, containing the valuesof the remaining attributes) acts as an update, whereas insertion ofan existing key with a tombstone flag acts as a deletion.

Updates of individual columns may be implemented in twoways. A straightforward way is to fetch the entire tuple that is to beupdated, modify the column that is being updated, and re-insert theentire tuple. This is the standard approach in a row-oriented storageengine. Column-oriented storage engines [24, 32] and some HTAPstorage engines [11] allow updates of individual columns. Similarly,in LASER, we allow insertion of partial rows that contain only theupdated column values. Partial rows are eventually merged withcomplete rows, or other partial rows, at the time of compaction, andany older column values are discarded. For example, suppose wehave four columns, < 𝐴, 𝐵, 𝐶, 𝐷 > , and suppose we update columns 𝐵 and 𝐶 of the tuple with key 100. Here, we insert the followingkey-value pair: 100 : − , 𝑏 ′ , 𝑐 ′ , − where 𝑏 ′ , 𝑐 ′ are the updated values,and − denotes an unchanged value. If, during compaction, we findanother entry for the same key, 100 : 𝑎, 𝑏, 𝑐, 𝑑 , then the two entriesare merged to give 100 : 𝑎, 𝑏 ′ , 𝑐 ′ , 𝑑 . Point queries (with projections) are handled by searching forthe given key in the skiplist, and then down the levels until thelatest value is found. To support projections efficiently, in each level,we only probe the CGs that overlap with the projected columns,and the query result is returned as soon as the values for all of theprojected columns are found. Since we allow updates of individualcolumns, (the latest version of) a given tuple may exist partiallyin one level and partially in another. For example, in Figure 5, thelatest values of 𝐴 and 𝐵 for tuple 108 exist in Level-0, but the valuesof 𝐶 and 𝐷 exist in Level-2. Range queries (with projections) are also handled by openingiterators for each level and returning values in a sorted order, whilediscarding older versions of the entries. We optimize range querieswith projections by opening iterators only for the overlappingcolumn-groups in each level. As was the case for point queries,subsets of column values may be found across different levels. Weuse

LevelMergingIterators to merge values across levels, and to stitchcolumn values within a level we use

ColumnMergingIterators . Weprovide the details of these iterators in Section 4.4.

In Section 2, we described the compaction process used by LSM-Trees to improve query performance. In LASER, we also utilizecompaction to change the data layout. A compaction job selects alevel that overflows the most, and merges all of its entries with thenext level. Using this approach in LASER would require mergingentries from all the CGs of an overflowing level with the next level.However, since we allow individual column updates, as mentionedin Section 4.2, different CGs can fill up at different rates. For example,certain column groups (bank balance, inventory) may be updatedmore frequently than others (contact information, item description).Therefore, treating all the CGs in the same way when scheduling acompaction job might push certain CG values to deeper levels evenwhen top levels are not full, and therefore disrupt the distributionof entries across levels based on time. We modify the compactionstrategy to select the most overflowing CG in the most overflowinglevel. To determine if a CG is overflowing, we define the capacityof a CG within a level by proportionally dividing the level capacityacross all the CGs, and any CG that exceeds its capacity is identifiedas an overflowing CG.We call this strategy a

CG local compaction strategy, in whichthe span of a compaction job is limited to only one CG from level 𝑖 and the overlapping CGs at level 𝑖 +

1. We show two examplecompaction jobs in Figure 6. Compaction job 1 merges entries fromCG < 𝐴, 𝐵 > in level-1 to overlapping CGs (i.e., < 𝐴 > ; < 𝐵 > ) inlevel-2. Similarly, compaction job 2 is limited to only CG < 𝐶 > inlevel-2 and level-3. To perform CG local compaction , we require twotypes of merging iterators:

LevelMergingIterators that merge entriesfrom different levels, and

ColumnMergingIterators that combinecolumn values from different CGs within the same level.

LevelMergingIterators support range queries and compactionjobs by fetching and merging qualifying tuples from each level, anddiscarding old attribute values when multiple versions are foundfor the same key. Figure 5 shows LevelMergingIterators collecting emant Saxena, Lukasz Golab, Stratos Idreos, and Ihab F. Ilyas

Figure 6: Sorted runs of a Real-Time LSM-Tree with twohighlighted compaction jobs. tuples from three levels to answer a range query for keys between50 and 108. Only the latest versions of keys 107 and 108 are returned.

ColumnMergingIterators combine values from different col-umn groups within the same level. For each LevelMergingIterator,multiple ColumnMergingIterators are opened. Since there can existonly one version for each key and column value within a level,these iterators do not have to discard old versions. Instead, theyfetch all the required column values for each key, some of whichmay be empty due to column updates. In Figure 5, we show Colum-nMergingIterators for each level. In level-0, the iterators returnpartial values for 108 because the corresponding entry correspondsto an update of columns 𝐴 and 𝐵 . Similarly, in level-1, key 107 hasa partial value.The above iterators are used by CG local compaction in the follow-ing way: we first identify the most overflowing level, and the mostoverflowing CG at that level. Then, we identify the overlappingCGs in the next level, open LevelMergingIterators for both levels,and open the required ColumnMergingIterators for the respectiveLevelMergingIterators. Once the iterators are opened, entries areemitted in sorted order and are written to the new sorted run be-longing to the next level.

In this section, we analyze the cost of each operation supported byLASER, and compare it with the cost of a purely row-oriented LSM-Tree (Section 2) and a purely column-oriented LSM-Tree (a specialcase of a Real-Time LSM-Tree with as many CGs as columns). Table2 summarizes the operations and their costs.We use the variables listed in Table 1. Let 1 ≤ 𝑔 𝑖 ≤ 𝑐 be thenumber of CGs at level 0 ≤ 𝑖 ≤ 𝐿 , where 𝑐 is the total number ofcolumns. The size of the 𝑗 𝑡ℎ (1 ≤ 𝑗 ≤ 𝑔 𝑖 ) CG at level 𝑖 is defined asthe number of columns in the CG and is represented by 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 . 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 is 𝑐 for all column-groups at all levels for a row-style LSM-Tree and 1 for all column-groups at all levels for a column-styleLSM-Tree. For each level 𝑖 , we have the following relation between 𝑐, 𝑔 𝑖 , and 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 : 𝑐 = 𝑔 𝑖 ∑︁ 𝑗 = 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 (2) We define 𝐵 𝑗𝑖 to be the number of entries in a data block of a 𝑗 𝑡ℎ CG at level 𝑖 . From Section 2, we know that a row-style LSM-Treecontains 𝐵 entries in a block. The block size, in bytes, is fixed for anLSM-Tree; for example in RocksDB, it is 4kB by default. If 𝐷 is theblock size in bytes, then we have 𝐷 = 𝐵. ( 𝑘𝑒𝑦 - 𝑠𝑖𝑧𝑒 + 𝑣𝑎𝑙𝑢𝑒 - 𝑠𝑖𝑧𝑒 ) = 𝐵. ( .𝑑𝑡 _ 𝑠𝑖𝑧𝑒 + 𝑐.𝑑𝑡 _ 𝑠𝑖𝑧𝑒 ) , where 𝑑𝑡 _ 𝑠𝑖𝑧𝑒 is the average datatypesize of the columns, which includes the column value and the key.This can be generalized for a Real-Time LSM-Tree, in which a block contains 𝐵 𝑗𝑖 entries: 𝐷 = 𝐵 𝑗𝑖 . ( + 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 ) .𝑑𝑡 _ 𝑠𝑖𝑧𝑒 . For example, inFigure 4, the relationship between the number of entries in a blockof CG < 𝐴, 𝐵 > , and CG < 𝐶 > is 𝐷 = 𝐵 < 𝐴,𝐵 > . ( + ) .𝑑𝑡 _ 𝑠𝑖𝑧𝑒 = 𝐵 < 𝐶 > . ( + ) .𝑑𝑡 _ 𝑠𝑖𝑧𝑒 , or 𝐵 < 𝐴,𝐵 > = .𝐵 < 𝐶 > /

3. The relationshipbetween 𝐵 and 𝐵 𝑗𝑖 is as follows. 𝐵 𝑗𝑖 = 𝐵. ( + 𝑐 )( + 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 ) (3) This gives 𝐵 𝑗𝑖 = 𝐵. ( + 𝑐 )/ 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 reduces, 𝐵 𝑗𝑖 increases becausewe can pack more entries of smaller CG size in a block. Write amplification:

We start with the cost of write amplifica-tion for insert(key, row) operations. For a row-style LSM-Tree, thewrite amplification is the same as described in Section 2, i.e., 𝑂 ( 𝑇 . 𝐿𝐵 ) .For a column-style LSM-Tree, each level has 𝑐 column-groups (eachwith one column). Therefore, the write amplification is 𝑂 ( 𝑐.𝑇 . 𝐿𝐵 𝑗𝑖 ) ,where 𝐵 𝑗𝑖 = 𝐵. ( + 𝑐 )/ < 𝐴, 𝐵 > ; < 𝐶 > ; < 𝐷 > where 𝐵 = 𝐵 ( + )/( + ) = 𝐵 / < 𝐴, 𝐵 > ) and 𝐵 = 𝐵 = 𝐵 /

2. For each CG, the merge cost is given by 𝑇 / 𝐵 𝑗𝑖 (because entries are merged 𝑇 times, as explained in Section 2). Thetotal write amplification cost is: 𝑂 ( 𝐿 (cid:205) 𝑖 = 𝑔 𝑖 (cid:205) 𝑗 = 𝑇 / 𝐵 𝑗𝑖 ) . Using Equations2 and 3, this simplifies to 𝑂 ( 𝑇 .𝐿𝐵 + 𝑇𝐵.𝑐 𝐿 (cid:205) 𝑖 = 𝑔 𝑖 ) . The second term (i.e. 𝑇𝐵.𝑐 𝐿 (cid:205) 𝑖 = 𝑔 𝑖 ) represents the overhead of storing keys along with CGvalues due to the simulated column group representation. Thisoverhead is at most 𝑇 𝐿 / 𝐵 (because 1 ≤ 𝑔 𝑖 ≤ 𝑐 ) in a column-styleLSM-Tree. 𝑊 : = 𝑂 (cid:32) 𝑇 .𝐿𝐵 + 𝑇𝐵.𝑐 . 𝐿 ∑︁ 𝑖 = 𝑔 𝑖 (cid:33) (4) Point lookups:

The cost for a row-style LSM-Tree is the same asin Section 2, i.e. 𝑂 ( ) (assuming the false positive rate of bloom fil-ters is much smaller than 1). For a column-style LSM-Tree, the costis equal to the number of column groups containing the columnsprojected by the query. For a Real-Time LSM-Tree, this cost is simi-larly equal to the number column-groups containing the projectedcolumns, summed over all the levels. We use 𝐸 𝑔𝑖 (1 ≤ 𝐸 𝑔𝑖 ≤ 𝑔 𝑖 ) todefine the number of column-groups required at level 𝑖 . For exam-ple, if there are two CGs, < 𝐴, 𝐵 > ; < 𝐶, 𝐷 > , in level 𝑖 , then 𝐸 𝑔𝑖 = Π = { 𝐴, 𝐶 } and 𝐸 𝑔𝑖 = Π = { 𝐴, 𝐵 } .The total I/O cost is bounded by 𝑂 ( 𝐿 (cid:205) 𝑖 = 𝐸 𝑔𝑖 ) . 𝑃 : = 𝑂 ( 𝐿 ∑︁ 𝑖 = 𝐸 𝑔𝑖 ) (5) Range queries:

The I/O cost for a row-style LSM-Tree is thesame as in Section 2, i.e., 𝑂 ( 𝑠𝐵 ) . For a column-style LSM-Tree, thisdepends on the number of CGs containing the projected columns.Therefore, the I/O cost is 𝑂 (| Π | . 𝑠𝐵.𝑐 ) (here, 𝐵 𝑗𝑖 = 𝐵. ( + 𝑐 )/ eal-Time LSM-Trees for HTAP Workloads Operation Row-styleLSM-Tree Real-Time LSM-Tree Column-style LSM-TreeInsert amplifica-tion (W) 𝑂 ( 𝑇 .𝐿𝐵 ) 𝑂 ( 𝑇 .𝐿𝐵 + 𝑇 . 𝐿 (cid:205) 𝑖 = 𝑔 𝑖 𝐵.𝑐 ) 𝑂 ( 𝑇 .𝐿𝐵 ) Existing keylookup (P) 𝑂 ( ) 𝑂 ( 𝐿 (cid:205) 𝑖 = 𝐸 𝑔𝑖 ) 𝑂 ( | Π |) Range query(Q) 𝑂 ( 𝑠𝐵 ) 𝑂 ( 𝐿 (cid:205) 𝑖 = 𝑠 𝑖 .𝐸 𝐺𝑖 𝑐.𝐵 ) 𝑂 ( | Π | .𝑠𝑐.𝐵 ) Update amplifi-cation (U) 𝑂 ( 𝑇 .𝐿𝐵 ) 𝑂 ( 𝐿 (cid:205) 𝑖 = 𝑇 .𝐸 𝐺𝑖 𝑐.𝐵 ) 𝑂 ( 𝑇 .𝐿. | Π | 𝑐.𝐵 ) Table 2: Summary of operations and their costs. must be fetched. Therefore, for each level, we have 𝑂 ( (cid:205) 𝑗 ∈ 𝐺 𝑖 𝑠 𝑖 / 𝐵 𝑗𝑖 ) ,where 𝑠 𝑖 is the selectivity at level 𝑖 , and 𝐺 𝑖 is the set of CGs con-taining the projected columns. In Section 2, we defined 𝑠 to be theselectivity of a range query for all the levels; selectivity 𝑠 𝑖 for an in-dividual level 𝑖 can be estimated by dividing 𝑠 by the capacity of thatlevel. Using Equation 3, we obtain the following cost for each level: 𝑂 ( 𝑠 𝑖 𝑐.𝐵 (cid:205) 𝑗 ∈ 𝐺 𝑖 ( + 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 )) . We define 𝐸 𝐺𝑖 : = (cid:205) 𝑗 ∈ 𝐺 𝑖 ( + 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 𝑗𝑖 ) , i.e.,the sum of the sizes of all the required CGs and corresponding keys.For example, if there are CGs < 𝐴, 𝐵 > ; < 𝐶, 𝐷 > in level 𝑖 , then 𝐸 𝐺𝑖 = Π = { 𝐴, 𝐶 } and 𝐸 𝐺𝑖 = Π = { 𝐴, 𝐵 } . The overall cost of a range query is: 𝑄 : = 𝑂 ( 𝐿 ∑︁ 𝑖 = 𝑠 𝑖 .𝐸 𝐺𝑖 / 𝑐.𝐵 ) (6) Update amplification:

The update amplification for a row-style LSM-Tree is the same as the insert amplification: 𝑂 ( 𝐿.𝑇𝐵 ) . For acolumn-style LSM-Tree, the cost depends on the number of columnvalues that are updated due to our CG local compaction strategy(Section 4.4). The amplification is given by 𝑂 ( 𝐿.𝑇 . | Π | 𝐵.𝑐 ) , where Π is the set of updated columns. For a Real-Time LSM-Tree, updateamplification depends on the sum of the sizes of the required CGs.This is estimated by 𝐸 𝐺𝑖 (see range query cost above). Therefore,the amplification of an update operation is 𝑈 : = 𝑂 ( 𝐿 ∑︁ 𝑖 = 𝑇 .𝐸 𝐺𝑖 / 𝑐.𝐵 ) (7) Space amplification:

As explained in Section 2, similar to arow-oriented LSM-Tree, the worst-case space amplification in aReal-Time LSM-Tree happens when the first 𝐿 − 𝐿 − 𝑇 . Therefore, the space amplification is still 𝑂 ( 𝑇 ) . In this section, we describe how to select a suitable Real-TimeLSM-Tree design for a given workload using the cost analysis fromSection 5. Our goal is to find an optimal CG configuration foreach level to minimize the total I/O cost for a given workload.Finding an optimal CG layout for a given workload has been studied in the context of hybrid database systems. Solutions span fromheuristics [31] to complex modeling techniques that allow rankingthe candidate CGs [7, 21].In the context of LSM-trees, the problem is critically differentdue to the flexibility of assigning a different layout for each levelof the tree. That is, we are not searching for a single CG layoutacross the whole data and tree, but rather we are searching for anoptimal layout for each level of the tree in a way that holisticallyoptimizes the overall performance. A critical invariant is the CGcontainment constraint described in Section 3.2, (a CG at level 𝑖 must be a subset of some CG at level 𝑖 − Parameters:

To find an optimal CG layout for a given workload,LASER requires: 1) parameters defining the Real-Time LSM-Treestructure, and 2) parameters defining the workload. As explainedin Section 5, the costs of the operations depend on the Real-TimeLSM-Tree structure, which is defined by the parameters 𝑇 , 𝐿 , and 𝐵 (Section 2), and on the CG configuration CG . We represent theworkload by wl , which is a set of operations. Let 𝑤 be the numberof insert operations, 𝑝 be the number of read operations for existingkeys, 𝑞 be the number of scan operations, and 𝑢 be the number of update operations in wl . Since we are searching for an optimal CGlayout for each level independently, we additionally define level 𝑖 ’sworkload by wl i , and similarly, 𝑝 𝑖 , 𝑞 𝑖 , and 𝑢 𝑖 represent the numberof read , scan , and update operations, respectively, served at level 𝑖 . Obtaining parameter values:

We assume that the values ofthe LSM-Tree parameters ( 𝑇 , 𝐿 , 𝐵 ) are fixed based on the datasize ( 𝑁 ) and the Operating System configuration (e.g., page size).Past research has shown how to tune T and L in an LSM-tree [17–19]. Furthermore, B is usually fixed based on a 4kB block size (asin RocksDB). Overall, these parameter choices are orthogonal toLASER: they govern the high-level LSM-tree architecture whileLASER optimizes the architecture within each run. As for the work-load, we assume that, at the logical level, it consists of SQL state-ments. For the LASER storage engine, we convert the workload tothe operations defined in Section 3.1. Profiling the workload wl i ateach level allows us to determine the values for 𝑤 , 𝑝 𝑖 , 𝑞 𝑖 , 𝑢 𝑖 , and 𝑠 𝑖 .Finally, the values for 𝐸 𝑔𝑖 and 𝐸 𝐺𝑖 are determined by the workloadtrace and the CG configuration under consideration, as discussedin Section 5. Cost function:

Let 𝑊 𝑘 be the cost of the 𝑘 𝑡ℎ write operation inthe workload, obtained using Equation 4; we define 𝑃 𝑘 , 𝑄 𝑘 and 𝑈 𝑘 similarly based on Equations 5 through 7. Following previous emant Saxena, Lukasz Golab, Stratos Idreos, and Ihab F. Ilyas work on LSM-Tree design [18], we compute the cost of a workloadfor a given CG configuration CG by adding up the costs of eachoperation, as shown in Equation 8. 𝑐𝑜𝑠𝑡 ( CG ) = 𝑤 ∑︁ 𝑘 = 𝑊 𝑘 + 𝑝 ∑︁ 𝑘 = 𝑃 𝑘 + 𝑞 ∑︁ 𝑘 = 𝑄 𝑘 + 𝑢 ∑︁ 𝑘 = 𝑈 𝑘 (8) Since we need to find an optimal CG configuration at each levelusing per-level workload statistics, the cost function in Equation 8can be split into per-level cost, given by the following equation: 𝑐𝑜𝑠𝑡 ( CG i ) : = (9) 𝑤 .𝑇 .𝑔 𝑖 𝐵.𝑐 + 𝑝 𝑖 ∑︁ 𝑘 = 𝐸 𝑔𝑖𝑘 + 𝑞 𝑖 ∑︁ 𝑘 = 𝑠 𝑖𝑘 .𝐸 𝐺𝑖𝑘 𝑐.𝐵 + 𝑢 𝑖 ∑︁ 𝑘 = 𝑇 .𝐸

𝐺𝑖𝑘 𝑐.𝐵

Here, CG i = { 𝑐𝑔 𝑖 , 𝑐𝑔 𝑖 , ..., 𝑐𝑔 𝑖𝑔 } is the partitioning of columns into 𝑔 groups at level 𝑖 that satisfies the CG containment constraint. Optimization problem:

For each level 𝑖 , we want to find anoptimal CG i such that 𝑐𝑜𝑠𝑡 ( CG i ) is minimized for the workload wl i and the CG containment constraint is satisfied. This leads tothe following optimization problem: ∀ 𝑖 : 1 ≤ 𝑖 ≤ 𝐿 (10) CG ∗ i = argmin CG i 𝑐𝑜𝑠𝑡 ( CG i ) 𝑠.𝑡. : ∀ 𝑐𝑔 𝑖 𝑗 ∈ CG i ∃ 𝑐𝑔 ( 𝑖 − ) 𝑘 ∈ CG ( i − ) | 𝑐𝑔 𝑖 𝑗 ⊆ 𝑐𝑔 ( 𝑖 − ) 𝑘 Recall that we keep level-0 row-oriented, so the CG containmentconstraint is trivially satisfied for level-1.

Previous work [21] takes the following three-step approach: 1)pruning the space of candidate CGs, 2) merging candidate CGs toavoid overfitting, and 3) selecting an optimal CG layout from thecandidate CGs. The

CG containment constraint can be added to thefirst step, further pruning the space of candidate CGs.Let { 𝑎 , 𝑎 , ..., 𝑎 𝑐 } be the attributes in relation R , and let Π 𝑗 bethe projection of the 𝑗 𝑡ℎ operation (point lookup, range query, orupdate operation) at level 𝑖 . In the first step, we generate a CGpartitioning with the smallest subsets, where every subset containscolumns that are co-accessed by at least one operation. This is doneby recursively splitting the attributes of R using the projections Π 𝑗 .For example, suppose R = { 𝑎 , 𝑎 , 𝑎 , 𝑎 } , and let Π = { 𝑎 , 𝑎 , 𝑎 } , Π = { 𝑎 , 𝑎 } , and Π = { 𝑎 , 𝑎 , 𝑎 , 𝑎 } . Then, splitting using Π gives subsets: { 𝑎 } , { 𝑎 , 𝑎 , 𝑎 } , and further splitting using Π givessubsets: { 𝑎 } , { 𝑎 } , { 𝑎 , 𝑎 } ( Π does not split any subsets).The next step is to merge the subsets from the previous step.This is beneficial for point queries, which typically have widerprojections, while smaller subsets are beneficial for range scanoperations, which typically have narrow projections. This tensionbetween the access patterns of point queries and scan operations isused to decide which subsets should be merged. We merge smallersubsets only if the cost of running the workload with the largersubsets is lower. To systematically evaluate all merging possibilities,we start with the smallest subsets from the previous step, andconsider all possible permutations of them for merging.Finally, in the third step, we generate all possible CG partitions(covering all attributes of R ) from the subsets generated in the previous step, and output the least-cost solution (the cost is givenby Equation 9).To satisfy the CG containment constraint, when considering level 𝑖 , we change the initial set of attributes R to be the set of attributesfrom one CG at level 𝑖 −

1, and we separately execute our solution foreach CG at level 𝑖 −

1. For example, if level-2 has CGs: < 𝑎 , ..., 𝑎 > ; < 𝑎 , ..., 𝑎 > , then we solve two CG selection problems for level-3,one with R = { 𝑎 , ..., 𝑎 } and one with R = { 𝑎 , ..., 𝑎 } . The designselection algorithm starts with level-1, where the complete schema R is split into CGs using the three steps described above. Then, thisprocess is repeated for level-2 onwards, where each CG at level 𝑖 − 𝑖 . The worst case time complexityof finding an optimal CG configuration at a single level is given by[21], which is exponential in the number of partitions generated inthe first step. The overall worst case time complexity for all levelsequals the number of levels times the worst case complexity ofeach level. Since the number of partitions is small in practice [21]and the CGs get smaller from one level to the next, the actual timetaken by the design selection algorithm is expected to be small. Forexample, in our evaluation (Section 7), design selection took only 3seconds for 100 columns and 8 LSM-Tree levels. In the experimental evaluation of LASER, we show that: 1) theempirical behaviour of LASER matches the cost model from Section5, 2) LASER can outperform pure row-store, pure column-store,and other column-group hybrid designs, and 3) LASER’s design isrobust to minor workload changes.

Experimental setup:

We deployed LASER on a Linux machinerunning 64-bit Ubuntu 14.04.3 LTS. The machine has 12 CPUs intotal across two NUMA nodes, running at 1.2GHz, with 15MB ofL3 cache, 16GB of RAM, and a 4TB rotating Seagate ConstellationES.3 disk.

LASER implementation:

We implemented LASER on top ofRocksDB 5.14. We added the components described in Section 4:simulated CG layout, CG updates, support for projections in queries,

LevelMergingIterators and

ColumnMergingIterators , and the CG lo-cal compaction strategy. We reused other necessary but orthogonalcomponents provided by RocksDB, such as in-memory skiplists,index blocks for SSTs, bloom filters, snapshots, and concurrency.To collect workload traces for design selection, we modified theRocksDB profiling tools to collect per-level statistics about opera-tions and their projections. We implemented the design selectionalgorithm as an offline process that takes in the workload trace andthe LSM-Tree parameters as input.

LASER configuration:

Unless specified otherwise, we use theleveling compaction strategy with kOldestLargestSeqFirst compactionpriority, with a maximum of 6 compaction threads. We use theRocksDB default values of other parameters such as Level-0 size,SST size, and compression.

Alternative compaction strategies:

While we use levelingin our experiments, the results are orthogonal to the compactionstrategy: regardless of the strategy, the number of entries in everylevel remains constant given a fixed size ratio (T). For example,tiering, the write optimized merging strategy, or lazy leveling andthe wacky continuum [19], which balance read and write costs, eal-Time LSM-Trees for HTAP Workloads only affect the number of runs within a level but do not affectthe number of entries in a level (since runs will simply be smallerwith those strategies compared to leveling). In our experiments,we vary the size ratio (T), which affects how entries spread acrossthe levels and the number of levels. This is critical as it affects thenumber of column-group layouts a Real-Time LSM-tree can holdsimultaneously.

Workload:

We generate workloads using the benchmark pro-posed by previous work [11, 12]. The benchmark consists of thefollowing queries that are common in HTAP workloads: ( 𝑄 ) inserts new tuples, ( 𝑄 ) is a point query that selects a specific row, ( 𝑄 ) isan aggregate query that computes the maximum values of selectedattributes over selected tuples, ( 𝑄 ) is an arithmetic query that sumsa subset of attributes over the selected tuples, and ( 𝑄 ) is an updatequery that updates a subset of attributes of a specific row. Thesequeries are written in SQL as follows: 𝑄 : INSERT INTO R VALUES ( 𝑎 , 𝑎 , ..., 𝑎 𝑐 ) 𝑄 : SELECT 𝑎 , 𝑎 , ..., 𝑎 𝑘 FROM R WHERE 𝑎 = 𝑣𝑄 : SELECT

𝑀𝐴𝑋 ( 𝑎 ) , ..., 𝑀𝐴𝑋 ( 𝑎 𝑘 ) FROM R WHERE 𝑎 ∈ [ 𝑣 𝑠 , 𝑣 𝑒 ) 𝑄 : SELECT 𝑎 + 𝑎 + ... + 𝑎 𝑘 FROM R WHERE 𝑎 ∈ [ 𝑣 𝑠 , 𝑣 𝑒 ) 𝑄 : UPDATE R SET 𝑎 = 𝑣 , ..., 𝑎 𝑘 = 𝑣 𝑘 WHERE 𝑎 = 𝑣 The parameters 𝑘 , 𝑣 , 𝑣 𝑠 , and 𝑣 𝑒 , control projectivity, selectivity,overlap between queries, and access patterns throughout the datalifecycle. The benchmark includes two types of tables: a narrowtable (with 30 columns) and a wide table (with 100 columns). Eachtable contains tuples with a 8-byte integer primary key 𝑎 and apayload of 𝑐 𝑎 , 𝑎 , ..., 𝑎 𝑐 ). Unless otherwisenoted, the experiments use the table with 30 columns, with uni-formly distributed integer values as keys. In all experiments, werun an initial data load phase, followed by a steady workload phasein which we record measurements. Goal:

We begin by validating the cost of point reads, range scans,and write amplification presented in Section 5.

Methodology:

For a fixed schema and system environment(i.e., fixed 𝑐 and 𝐵 ) the cost of these operations depends on thequery projection size and the CG configuration. We validate thecost model using the narrow table and 𝑇 =

2, as well as the widetable with 𝑇 =

10. For the narrow table, we consider six Real-Time LSM-Tree designs, in which the CG sizes vary from 1 to 30,covering the extreme pure row and pure column layouts, and otherdesigns in between. For each design, we use 𝑔 = / 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 equi-width column groups in each level, and we set 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 to a valuein { , , , , , } . In each design, the LSM-Tree has 8 levels withLevel-0 in row-format. For the wide table, we consider 4 Real-TimeLSM-Tree designs, with 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 values in { , , , } , and 5 LSM-Tree levels. To generate read and scan operations, we use 𝑄 and 𝑄 ,respectively, and we vary 𝑘 from 1 to 30 to control the projectionsize. The queries are executed after the load phase (400 millionentries loaded into the narrow table, and 200 million entries intothe wide table) with the OS cache cleared, and we measure theaverage latency. The write amplification cost of the LSM-Tree isreflected in the background compaction process. To measure thecompaction time, we first loaded all the entries in Level-0, withcompaction disabled, and then scheduled compaction manually and measured its runtime. Compaction is completed when no levelexceeds its capacity. Results:

Figures 7(a) and 7(b) show the latency of read opera-tions with respect to the projection size and the number of CGs,respectively. The top figures correspond to the narrow table and thebottom figures correspond to the wide table. In Figure 7(a), whenthe CGs are small (i.e., similar to a column-oriented layout), latencyincreases linearly with the projection size because more CGs needto be fetched from disk. However, when the CGs are large (i.e.,similar to a row-oriented layout), latency stays unchanged withthe projection size because for any projection size, all the columnsare fetched. This is also implied by the point query cost given byEquation 5, which is plotted in black dotted lines for cg_size=1 (topline) and for cg_size=30/100 (bottom line). The empirical data inFigure 7(a) are in-line with the cost equation.In Figure 7(b), we vary the number of CGs while keeping theprojection sizes fixed. For wide projections (i.e., fetching completerows), the cost increases linearly with the number of CGs, becauseeach CG is fetched in a separate disk I/O. However, for narrowprojections (i.e., fetching a single column value), the I/O cost staysunchanged because a single disk I/O is enough to fetch the requiredcolumn value. This is consistent with the point query cost givenby Equation 5, which is plotted in black dotted lines for projectionsize 30/100 (top line) and 1 (bottom line) in Figure 7(b).In Figures 7(c) and 7(d), we measure the latency of scan opera-tions with respect to the projection size and CG size, respectively.Again, the top figures correspond to the narrow table and the bot-tom figures correspond to the wide table. Similar to Figure 7(a), wevary the projection sizes in Figure 7(c). For small CGs (i.e., similarto a column-oriented layout), latency increases linearly with theprojection size because more disk I/O is required to fetch moreCGs. However, for large CGs (i.e., similar to a row-oriented layout),latency stays almost constant with projection size, because manycolumns are fetched in a single disk I/O. This is consistent withthe range query cost given by Equation 6, which is plotted in blackdotted lines for CG size 1 (top line) and 30/100 (bottom line) inFigure 7(c).In Figure 7(d), we vary the CG size while keeping the projectionsizes fixed. For wider projection sizes, latency should stay constantwith CG size, because almost all the columns need to be fetchedirrespective of the CG layout, hence keeping the disk I/O cost con-stant. However, we see a latency decrease as we increase the CGsize; this is because of the simulated CG layout used in LASER. Forlarge CGs, we fetch the key only once, whereas for smaller CGs,the key is fetched along with each CG, hence increasing the latency.The change in latency for wider projection sizes is proportional to 𝑐𝑜𝑛𝑠𝑡 / 𝑐𝑔 _ 𝑠𝑖𝑧𝑒 + 𝑐𝑜𝑛𝑠𝑡 (derived from Equation 6), as shown by thetop black dotted line in the Figure 7(d). For smaller projections, weexpect latency to increase with CG size because of the overheadof fetching unnecessary columns. This is empirically observed inFigure 7(d) and matches the cost given by Equation 6. Similar ob-servations were made in prior work on HTAP systems that allowconfigurable column groups [8, 11].In Figure 7(e), we show that compaction time (which reflectswrite amplification) for different CG sizes matches our write ampli-fication cost. The cost given by Equation 4 is shown using a blackdotted line for reference. This completes the empirical validation emant Saxena, Lukasz Golab, Stratos Idreos, and Ihab F. Ilyas l a t en cy ( m s ) projection size

12 36 1530 0 100 200 300 400 500 600 700 0 5 10 15 20 25 30 l a t en cy ( m s ) l a t en cy ( s e c ) projection size

12 36 1530 0 500 1000 1500 2000 2500 3000 3500 0 5 10 15 20 25 30 l a t en cy ( s e c ) cg-size c o m pa c t i on t i m e ( m i n s ) l a t en cy ( m s ) projection size (a) Read : average latencyw.r.t. projection size fordifferent CG sizes l a t en cy ( m s ) (b) Read : average latencyw.r.t. l a t en cy ( s e c ) projection size (c) Scan : average latencyw.r.t. projection sizefor different CG sizes l a t en cy ( s e c ) cg-size (d) Scan : average latencyw.r.t. CG sizes fordifferent projection sizes c o m pa c t i on t i m e ( m i n s ) (e) Write amplification: com-paction time w.r.t.

Figure 7: The cost of operations in LASER matches the cost model in Section 5. The top row corresponds to the narrow tablewith T=2, and the bottom row corresponds to the wide table with T=10. of the cost model described in Section 5, which is an importantcomponent of our design selection algorithm in Section 6.

Goal:

We show that LASER can speed up mixed workloads thatchange with the data lifecycle. We compare the performance ofLASER’s Real-Time LSM-Tree storage layout against pure row-oriented, pure column-oriented, and certain fixed column-grouplayouts. We also compare LASER with a row-store DBMS: Postgres,and a column-store DBMS: MonetDB [22].

Methodology:

We generate an HTAP workload ( HW ) usingqueries 𝑄 − 𝑄 . To emulate a data lifecycle, we continuously insertnew data ( 𝑄 ) at a steady rate of 10,000 insert operations per second.This ensures that entries continuously move from one level to thenext. Along with the inserts, we issue 100 updates per second, i.e.,one percent of the insert rate, via 𝑄 , where a randomly chosencolumn value is updated for a recently inserted key. This updatepattern mimics updates and corrections frequently taking place inmixed analytical and transactional processing [12]. Furthermore,we control the access patterns throughout the data lifecycle byselecting 𝑘 , 𝑣 , 𝑣 𝑠 , and 𝑣 𝑒 for queries 𝑄 − 𝑄 such that the upperlevels of the LSM-Tree are mostly accessed by point read operationsand wider projections, whereas lower levels are accessed by scanoperations and narrower projections. This allows us to generatea lifecycle-driven hybrid workload, as described in Section 3.1, tostress test LASER’s Real-Time LSM-Tree.We use two variants of 𝑄 for point access of recent data: HW- 𝑄 𝑎 and HW- 𝑄 𝑏 . The 𝑣 value in each variant is determined bya normal distribution over the time-since-insertion values of thekeys. In Figure 9(a), we show the two distributions from which 𝑣 is selected. The mean of the first distribution is 0.98 (typicallyaccessing data from in-memory skiplists, Level-0, or Level-1), and0.85 for the second distribution (typically accessing data from Level-2 or Level-3); each distribution has a standard deviation of 0.02. 𝑄 𝑎 queries fetch all 30 attributes, whereas 𝑄 𝑏 fetches columns 16-30. For analytical operations, we use 𝑄 , which accesses columns21-30 for 5% of the keys, and 𝑄 , which accesses columns 28-30 for50% of the keys. Since our keys are uniformly distributed integervalues, these queries access data from all the levels, but the amountof data scanned at level 𝑖 + 𝑇 = 𝑖 . Table 3 summarizes the properties of these operations.We first load 400 million entries, and then execute the workloaduntil another 20 million entries are inserted. Queries HW- 𝑄 𝑎 andHW- 𝑄 𝑏 are spread uniformly, whereas 𝑄 and 𝑄 are executedtowards the end. Queries 𝑄 − 𝑄 are issued using four concurrentclient threads, whereas a separate client thread is responsible forwrite operations ( 𝑄 and 𝑄 ).The CG configuration used by LASER for this workload is la-belled D-opt (see Figure 9(b)) and is obtained using the approachdescribed in Section 6.3. For comparison, we select five other de-signs with varying CG sizes. The design with cg_size=30 corre-sponds to a pure row-oriented layout, which is default RocksDB inour setting, and the design with cg_size=1 corresponds to a purecolumn-oriented layout. The remaining three designs correspondto CG sizes that match the projections of the operations in theworkload HW . The design with cg_size=15 matches the projectionof 𝑄 𝑏 , the design with cg_size=3 matches the projection of 𝑄 ,and the design with cg_size=6 is partly suitable for projections of 𝑄 and 𝑄 . To test various CG layouts within reasonable time, weopted to have deeper LSM-Trees, therefore, we set the level sizeratio (T) to 2. For all the designs, the LSM-Trees have 8 levels withLevel-0 in row-format. To isolate the impact of the storage layout,we simulate these five designs within LASER. Additionally, we exe-cuted this workload in Postgres-9.3 (a row-store) and MonetDB 5server v11.33.3 (a column-store) [22]. Results:

Figure 8 shows that LASER’s optimal design outper-forms the other storage layouts when executing the mixed work-load described in Table 3. Figure 8(a) shows that LASER took theleast total time to execute queries 𝑄 𝑎 , 𝑄 𝑏 , 𝑄 , and 𝑄 . Designs eal-Time LSM-Trees for HTAP Workloads r o c k s d b c g - s i z e - c g - s i z e - c g - s i z e - c g - s i z e - c g - s i z e -

1p o s t g r e s m o n e t d b H T A P - X L A S E R T o t a l w o r k l oad t i m e ( m i n s ) > r > r > r (a) Workload runtime of different designs Q1 Q2a/b Q5 La t en cy ( u s ) (b) Average latency of inserts (Q1), pointqueries (Q2a, Q2b), updates (Q5). La t en cy ( s e c ) rocksdbcg-size-15cg-size-6cg-size-3cg-size-2cg-size-1postgresmonetdbLASER (c) Average latency of range queries (Q3, Q4). Figure 8: LASER performs the best on the HTAP workload (HW). (a) 𝑄 𝑎 : mean=0.98, 𝑄 𝑏 : mean=0.85 𝐿 < - > 𝐿 < - > 𝐿 < - >< - > 𝐿 < - >< - > 𝐿 < - >< - >< - > 𝐿 < - >< - >< - > 𝐿 < - >< - >< - >< - > 𝐿 < - >< - >< - >< - > (b) Design D-opt used by LASER

Figure 9:

Read operation patterns and optimal design usedin Section 7.2

Query Projection ( 𝑘 ) Key ( 𝑣 ) distribution Count 𝑄 𝑄 𝑎 𝑄 𝑏 𝑄 𝑄 𝑄 any 1 of 30 uniform, 1% of data 100/sec Table 3: Summary of the HTAP workload HW with cg_size=1 and 2, and MonetDB did not finish within our time-limit-exceeded (TLE) window of 24 hours. Therefore, we insteadreport their average latencies in Figure 8(b) and 8(c). In Figure 8(b),LASER’s design has the lowest latency for inserts ( 𝑄 ), point queries( 𝑄 𝑎 , 𝑄 𝑏 ), and updates ( 𝑄 ). MonetDB has the highest latency, or-ders of magnitude slower than LASER. Insert and update latenciesacross all the LSM-Tree designs (including LASER’s) are the samebecause they all append the data to an in-memory skiplist, which isnot impacted by the layout of the disk levels. In Figure 8(c), LASER’slatency is close to the latency of the design best suitable for thequery. For example, cg_size=3 is suitable for 𝑄 and cg_size=15 issuitable for 𝑄 , and LASER’s latency is close to the best latency forboth of these queries. For 𝑄 , MonetDB performs 5x better thanLASER because it stores all the data in contiguous columns, makingit suitable for aggregation queries. However, MonetDB performs20x worse than LASER for 𝑄 . Postgres performs as well as LASERfor 𝑄 but it performs 1.5x worse on 𝑄 . Comparison to a simulated HTAP system:

Traditional HTAPsystems, such as SAP HANA [20], typically consist of a write-optimized component for inserts/updates and point queries against recent data, and a read-optimized component for analytical queries.To demonstrate the benefits of LASER against an ideal traditionalHTAP system, we utilize Postgres and MonetDB. That is, we extrap-olate the performance of a hypothetical HTAP system that consistsof Postgres as a write-optimized component and MonetDB as a read-optimized component. We call this system

HTAP-X , and we reportnumbers that represent an ideal scenario, which does not includethe cost of moving data from the row-store to the column-store com-ponent. This cost can be high if we need to persist columnar dataon disk, or relatively low if we are only keeping columnar data tem-porarily in memory and then discarding the data. Even if this movehappens in the ”background”, i.e., query processing happens on therest of the data in parallel, those background threads could havebeen used for query processing, especially by the read-optimizedpart of the HTAP system that will typically do large scans. In Figure8(a), we show results where the Postgres part of HTAP-X handles 𝑄 , 𝑄 , and 𝑄 , and the MonetDB part handles 𝑄 and 𝑄 . Theresults show that LASER would perform almost 9x better than thisideal HTAP system. The reason for the improved performance isthat LASER leverages the fundamental LSM-architecture, which isable to absorb data much faster than Postgres, while at the sametime being able to do analytical queries as fast as MonetDB due toits hybrid layouts. Goal:

The actual workload can be different from the representativeworkload that was used to design the storage layout, either due toinaccurate workload monitoring or due to shifts in the workloadover time. We now measure the performance impact when thetested workload deviates from the representative workload.

Methodology:

We consider two types of shifts from the work-load HW used in Section 7.2: 1) vertical shift in read access patterns,and 2) horizontal shift in scan projections. For the vertical shift, weoffset the mean of the normal distribution used to generate keys for 𝑄 𝑎 and 𝑄 𝑏 in HW . For example, an offset of 0.1 results in a meanof 0.88 for the distribution generating keys for 𝑄 𝑎 and a mean of0.75 for 𝑄 𝑏 . For the horizontal shift, we offset the projection of 𝑄 in HW to the left by some amount. For example, offsetting theprojection by 2 results in projecting columns 26-28, and a shift of 4results in columns 24-26, and so on. Results:

Figure 10(a) shows the results of vertical shifts in theread pattern while keeping rest of the workload same as HW . Thelatency of the read operations initially increases with the shift but emant Saxena, Lukasz Golab, Stratos Idreos, and Ihab F. Ilyas r ead l a t en cy ( m i c r o s e c ) normal mean o ﬀ set (a) Change in read latencywith shifting read pattern sc an l a t en cy ( s e c ) projection o ﬀ set (b) Change in scan latencywith shifting projections Figure 10: LASER’s design is robust to minor workload shifts then stays constant. This is because top levels are small, so theCG layout changes with the offset, but eventually when the offsetfetches keys from the last few levels (which are larger), the CGlayout does not change with the offset.In Figure 10(b), we show the results of horizontal shifts in thescan projection while keeping the rest of the workload same as HW . The scan latency can become up to 2x worse if the projectionsare misaligned and fetch data from a few wide CGs. For example,when the offset is 14, the projection is 14-16, which spans two largeCGs, < − > and < − > , in levels 6 and 7, and results in2x higher latency. However, if the CGs are small, or only a singleCG is required by the projection, the impact of misalignment issmaller. As expected, from Figures 10(a) and 10(b), we conclude thatthe performance of both read and scan operations can deterioratesignificantly if the mismatch between the actual workload and therepresentative workload is significant. Thus, if the actual workloaddeviates from the representative workload with time, then LASERshould be re-configured. We consider a self-configuring Real-TimeLSM-Tree as a direction for future work as part of online tuning ofphysical design [16]. The proposed idea of a Real-Time LSM-Tree lies at the intersectionof LSM-Trees and HTAP storage engines. In this section, we discussrelated work in these domains.

Adoption of LSM-Trees:

LSM-Trees are used in key-value storesand NoSQL systems such as LevelDB [2], RocksDB [4], Cassandra[23], HBase [1], and AsterixDB [10]. A recent survey [26] describeshow the original LSM-Tree idea [29] has been adopted by variousindustry and open-source DBMSs. Other applications of LSM-Treesinclude the log-structured history access method (LHAM) [27],which supports temporal workloads by attaching timestamp rangesto each sorted run and pruning irrelevant sorted runs at query time.Furthermore, LSM-trie [33] is an LSM-Tree based hash index formanaging a large number of key-value pairs where the metadata,such as index pages, cannot be fully cached. However, LSM-trieonly supports point lookups since its optimizations heavily dependon hashing. Finally, the LSM-based tuple compaction frameworkin AsterixDB [9] leverages LSM lifecycle events (flushing and com-paction) to extract and infer schemas for semi-structured data. Sim-ilarly, we exploited LSM-Tree properties, such as data propagationthrough the levels over time and compaction, in LASER.

Improvements of LSM-Trees:

Recent works have optimizedvarious components of LSM-Trees such as allocating space for Bloom filters [17] and tuning the compaction strategy [18]. LSM-Trees have been criticized for their write stalls and large perfor-mance variance due to the mismatch between in-memory writesand slow background operations. To mitigate this, Luo et. al. [25]proposed a scheduler for concurrent compaction jobs. Many ofthese recent improvements are orthogonal to the design of ourReal-Time LSM-Tree.

HTAP systems and storage engines:

One of the first approaches,fractured mirrors, simultaneously maintained one copy of the datain row-major layout for OLTP workloads and another copy incolumn-major layout for OLAP workloads [30]. This approach hasbeen adopted by Oracle and IBM to support columnar layout as anadd-on. Although these systems achieve better ad-hoc OLAP per-formance than a pure row-store, the cost of synchronizing the tworeplicas is high. HYRISE [21] automatically partitions tables intocolumn groups based on how columns are co-accessed by queries.Systems such as SAP HANA [20], MemSQL [3], and IBM Wildfire[14] split the storage into OLTP friendly and OLAP friendly compo-nents. Data are ingested by the OLTP friendly component, whichis write-optimized and uses a row-major layout, and are eventuallymoved to the OLAP friendly component, which is read-optimizedand uses a column-major layout. Peloton [11] generalizes this ideaby partitioning the data into multiple components called tiles, withdifferent column group layouts. In this work, we described thesesystems as having a lifecycle-aware data layout, and we showed thatLSM-Trees are a natural fit for a lifecycle-aware storage engine.

In this paper, we showed that Log-Structured Merge Trees (LSMs)can be used to design a lifecycle-aware storage engine for HTAPsystems. To do so, we proposed the idea of a Real-Time LSM-Tree,in which different levels may be configured to store the data in dif-ferent formats, ranging from purely row-oriented to purely column-oriented. We presented a design advisor to select an appropriateReal-Time LSM-Tree design given a representative workload, andwe implemented a proof-of-concept prototype, called LASER, on topof the RocksDB key-value store. Our experimental results showedthat for real-time data analytics workloads, LASER is almost 5xfaster than Postgres (a pure row-store) and two orders of magnitudefaster than MonetDB (a pure column-store).

REFERENCES

The Design andImplementation of Modern Column-Oriented Database Systems . Now PublishersInc., Hanover, MA, USA.[7] Sanjay Agrawal, Vivek Narasayya, and Beverly Yang. 2004. Integrating Verticaland Horizontal Partitioning into Automated Physical Database Design. In

Pro-ceedings of the 2004 ACM SIGMOD International Conference on Management ofData (Paris, France) (SIGMOD ’04) . Association for Computing Machinery, NewYork, NY, USA, 359–370. https://doi.org/10.1145/1007568.1007609[8] Ioannis Alagiannis, Stratos Idreos, and Anastasia Ailamaki. 2014. H2O: AHands-Free Adaptive Store. In

Proceedings of the 2014 ACM SIGMOD Inter-national Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD’14) . Association for Computing Machinery, New York, NY, USA, 1103–1114.https://doi.org/10.1145/2588555.2610502 eal-Time LSM-Trees for HTAP Workloads [9] Wail Y. Alkowaileet, Sattam Alsubaiee, and Michael J. Carey. 2020. An LSM-basedTuple Compaction Framework for Apache AsterixDB.

Proc. VLDB Endow.

7, 14 (2014),1905–1916. https://doi.org/10.14778/2733085.2733096[11] Joy Arulraj, Andrew Pavlo, and Prashanth Menon. 2016. Bridging the Archipelagobetween Row-Stores and Column-Stores for Hybrid Workloads. In

Proceedingsof the 2016 International Conference on Management of Data, SIGMOD Confer-ence 2016, San Francisco, CA, USA, June 26 - July 01, 2016 , Fatma Özcan, Geor-gia Koutrika, and Sam Madden (Eds.). ACM, 583–598. https://doi.org/10.1145/2882903.2915231[12] Manos Athanassoulis, Kenneth S. Bøgh, and Stratos Idreos. 2019. Optimal ColumnLayout for Hybrid Workloads.

Proc. VLDB Endow.

12, 13 (Sept. 2019), 2393–2407.https://doi.org/10.14778/3358701.3358707[13] Ronald Barber, Peter Bendel, Marco Czech, Oliver Draese, Frederick Ho, NamikHrle, Stratos Idreos, Min-Soo Kim, Oliver Koeth, Jae-Gil Lee, Tianchao TimLi, Guy M. Lohman, Konstantinos Morfonios, René Müller, Keshava Murthy,Ippokratis Pandis, Lin Qiao, Vijayshankar Raman, Richard Sidle, Knut Stolze, andSandor Szabo. 2012. Business Analytics in (a) Blink.

IEEE Data Eng. Bull.

35, 1(2012), 9–14. http://sites.computer.org/debull/A12mar/blink.pdf[14] Ronald Barber, Christian Garcia-Arellano, Ronen Grosman, René Müller, Vi-jayshankar Raman, Richard Sidle, Matt Spilchen, Adam J. Storm, Yuanyuan Tian,Pinar Tözün, Daniel C. Zilio, Matt Huras, Guy M. Lohman, Chandrasekaran Mo-han, Fatma Özcan, and Hamid Pirahesh. 2017. Evolving Databases for New-GenBig Data Applications. In

CIDR .[15] Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010.Finding a Needle in Haystack: Facebook’s Photo Storage. In

Proceedings of the 9thUSENIX Conference on Operating Systems Design and Implementation (Vancouver,BC, Canada) (OSDI’10) . USENIX Association, USA, 47–60.[16] Nicolas Bruno and Surajit Chaudhuri. 2007. An Online Approach to PhysicalDesign Tuning. In

Proc. International Conference on Data Engineering (ICDE’07) .826–835.[17] Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: OptimalNavigable Key-Value Store. In

Proceedings of the 2017 ACM International Confer-ence on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17) . Associationfor Computing Machinery, New York, NY, USA, 79–94. https://doi.org/10.1145/3035918.3064054[18] Niv Dayan and Stratos Idreos. 2018. Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of SuperfluousMerging. In

Proceedings of the 2018 International Conference on Management ofData (Houston, TX, USA) (SIGMOD ’18) . Association for Computing Machinery,New York, NY, USA, 505–520. https://doi.org/10.1145/3183713.3196927[19] Niv Dayan and Stratos Idreos. 2019. The Log-Structured Merge-Bush & theWacky Continuum. In

ACM SIGMOD International Conference on Management ofData .[20] Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, HannesRauhe, and Jonathan Dees. 2012. The SAP HANA database - An architectureoverview.

IEEE Data Eng. Bull.

35 (03 2012), 28–33.[21] Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, and Samuel Madden. 2010. HYRISE: A Main Memory Hybrid StorageEngine.

Proc. VLDB Endow.

4, 2 (Nov. 2010), 105–116. https://doi.org/10.14778/1921071.1921077[22] Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender,and Martin L. Kersten. 2012. MonetDB: Two Decades of Research in Column-oriented Database Architectures.

IEEE Data Engineering Bulletin

35, 1 (2012),40–45.[23] Avinash Lakshman and Prashant Malik. 2010. Cassandra: A DecentralizedStructured Storage System.

SIGOPS Oper. Syst. Rev.

44, 2 (April 2010), 35–40.https://doi.org/10.1145/1773912.1773922[24] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver,Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7Years Later.

Proc. VLDB Endow.

5, 12 (Aug. 2012), 1790–1801. https://doi.org/10.14778/2367502.2367518[25] Chen Luo and Michael J. Carey. 2019. On Performance Stability in LSM-basedStorage Systems.

Proc. VLDB Endow.

VLDB J.

29, 1 (2020), 393–418. https://doi.org/10.1007/s00778-019-00555-y[27] Peter Muth, Patrick Neil, Achim Pick, and Gerhard Weikum. 2000. The LHAMlog-structured history data access method.

The VLDB Journal, v.8, 199-221 (2000)

Proceedings of the 2017 ACM Inter-national Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD’17) . Association for Computing Machinery, New York, NY, USA, 1771–1775.https://doi.org/10.1145/3035918.3054784[29] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. TheLog-Structured Merge-Tree (LSM-Tree).

Acta Inf.

33, 4 (June 1996), 351–385.https://doi.org/10.1007/s002360050048[30] Ravishankar Ramamurthy, David J. DeWitt, and Qi Su. 2002. A Case for FracturedMirrors. In

Proceedings of the 28th International Conference on Very Large DataBases (Hong Kong, China) (VLDB ’02) . VLDB Endowment, 430–441.[31] Philipp Rösch, Lars Dannecker, Franz Färber, and Gregor Hackenbroich. 2012.A Storage Advisor for Hybrid-Store Databases.

Proc. VLDB Endow.

5, 12 (Aug.2012), 1748–1758. https://doi.org/10.14778/2367502.2367514[32] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack,Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, PatO’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2005. C-Store: A Column-OrientedDBMS. In

Proceedings of the 31st International Conference on Very Large DataBases (Trondheim, Norway) (VLDB ’05) . VLDB Endowment, 553–564.[33] Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. 2015. LSM-Trie: An LSM-Tree-Based Ultra-Large Key-Value Store for Small Data. In

Proceedings of the2015 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA) (USENIX ATC ’15)(USENIX ATC ’15)