Breaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems (Extended Version)
BBreaking Down Memory Walls: Adaptive Memory Managementin LSM-based Storage Systems (Extended Version)
Chen Luo
University of California, [email protected]
Michael J. Carey
University of California, [email protected]
ABSTRACT
Log-Structured Merge-trees (LSM-trees) have been widely used inmodern NoSQL systems. Due to their out-of-place update design,LSM-trees have introduced memory walls among the memory com-ponents of multiple LSM-trees and between the write memory andthe buffer cache. Optimal memory allocation among these regionsis non-trivial because it is highly workload-dependent. ExistingLSM-tree implementations instead adopt static memory allocationschemes due to their simplicity and robustness, sacrificing perfor-mance. In this paper, we attempt to break down these memory wallsin LSM-based storage systems. We first present a memory manage-ment architecture that enables adaptive memory management. Wethen present a partitioned memory component structure with newflush policies that better exploits the write memory to minimizethe write cost. To break down the memory wall between the writememory and the buffer cache, we further introduce a memory tunerthat tunes the memory allocation between these two regions. Wehave conducted extensive experiments in the context of ApacheAsterixDB using the YCSB and TPC-C benchmarks and we presentthe results here.
Log-Structured Merge-trees (LSM-trees) [42] are widely used inmodern NoSQL systems, such as LevelDB [4], RocksDB [5], Cas-sandra [2], HBase [3], X-Engine [26], and AsterixDB [1]. Unliketraditional in-place update structures, LSM-trees adopt an out-of-place update design by first buffering all writes in memory; they aresubsequently flushed to disk to form immutable disk components.The disk components are periodically merged to improve queryperformance and reclaim space occupied by obsolete records.Efficient memory management is critical for storage systems toachieve optimal performance. Compared to update-in-place sys-tems where all pages are managed within shared buffer pools, LSM-trees have introduced additional memory walls. Due to the LSM-tree’s out-of-place update nature, the write memory is isolatedfrom the buffer cache. Moreover, the write memory must be sharedamong multiple LSM-trees since each LSM-tree manages its mem-ory component independently. Since the optimal memory allocationheavily depends on the workload, memory management should beworkload-adaptive to maximize the system performance.Unfortunately, adaptivity is non-trivial, as it is highly workload-dependent. Existing LSM-tree implementations, such as RocksDB [5]and AsterixDB [28], have opted for simplicity and robustness overoptimal performance by adopting static memory allocation schemes.For example, RocksDB sets a static size limit (default 64MB) for eachmemory component. AsterixDB specifies the maximum number Nof writable datasets (default 8) so that each active dataset, includingits primary and secondary indexes, receives 1/N of the total write memory. Both systems allocate separate static budgets for the writememory and the buffer cache.In this paper, we seek to break down these memory walls in LSM-based storage systems to maximize performance and efficiency. Asthe first contribution, we present a memory management archi-tecture to enable adaptive memory management for LSM-basedstorage systems. In this architecture, the overall memory budget isdivided into the write memory region and the buffer cache region.Within the write memory region, the memory allocation of eachmemory component is purely driven by its demands, i.e., write rates,to minimize the overall write amplification. The two regions areconnected via a memory tuner that adaptively tunes the memoryallocation between the write memory and the buffer cache.As the second contribution of this paper, we propose a new LSMmemory component structure to manage the write memory. Thekey insight is to adopt an in-memory LSM-tree to maximize thememory utilization and reduce the write amplification. We furtherpresent new flush policies to manage the memory components ofmultiple LSM-trees to minimize the overall write cost.The third contribution of this paper is the detailed design of amemory tuner that adaptively tunes the memory allocation betweenthe write memory and the buffer cache to minimize the system’soverall I/O cost. The memory tuner performs on-line tuning bymodeling the I/O cost of LSM-trees without any a priori knowledgeof the workload. This further allows the memory tuner to quicklyadjust the memory allocation when the workload changes.We have implemented all of the proposed techniques insideApache AsterixDB [1]. We have carried out extensive experimentson both the YCSB benchmark [18] and the TPC-C benchmark [6]to evaluate the effectiveness of the proposed techniques. The ex-perimental results show that the proposed techniques successfullyreduce the disk I/O cost via adaptive memory management, whichin turn maximizes system efficiency and overall performance.The remainder of this paper is organized as follows. Section 2 dis-cusses background information and related work. Section 3 presentsour adaptive memory management architecture for LSM-trees. Sec-tion 4 describes the new memory component structure for manag-ing the write memory. Section 5 presents the design and implemen-tation of the memory tuner. Section 6 experimentally evaluates theproposed techniques. Finally, Section 7 concludes the paper.
The LSM-tree [42] is a persistent index structure optimized for write-intensive workloads. LSM-trees perform out-of-place updates byalways buffering writes into a memory component and appendinglog records to a transaction log for durability. Writes are flushed to a r X i v : . [ c s . D B ] A p r hen Luo and Michael J. Carey L0L1L2 0-990-50 55-990-20 22-52 53-75 80-950-99 memorydisk0-99 Before Merge 0-99 55-99 53-75 80-950-990-990-15 17-30 32-52After Mergemerging SSTablenew SSTableSSTable
Figure 1: Example Partitioned LSM-tree disk when either the memory component is full, called a memory-triggered flush , or when the transaction log length becomes toolong, called a log-triggered flush .A query over an LSM-tree has to reconcile the entries withidentical keys from multiple components, as entries from newercomponents override those from older components. A range querysearches all components simultaneously using a priority queueto perform reconciliation. A point lookup query simply search allcomponents from newest to oldest until the first match is found. Tospeed up point lookups, a common optimization is to build Bloomfilters [13] over the sets of keys stored in disk components.To improve query performance and space utilization, disk compo-nents are periodically merged according a pre-defined merge policy.In practice, two types of merge policies are commonly used [38],both of which organize disk components into “levels”. The levelingmerge policy maintains one component per level. When a compo-nent at Level i is T times larger than that of Level i −
1, it will bemerged into Level i + T components per level. When aLevel i becomes full with T components, they are merged togetherinto a new component at Level i + Partitioning.
In practice, a common optimization is to range-partition a disk component into multiple (often fixed-size) SSTablesto bound the processing time and temporary space of each merge.This optimization is often used together with the leveling mergepolicy, as pioneered by LevelDB [4]. An example of a partitionedLSM-tree with the leveling merge policy is shown in Figure 1, whereeach SSTable is labeled with its key range. Note that L is notpartitioned since its SSTables are directly flushed from memory. L also stores multiple SSTables with overlapping key ranges toabsorb write bursts. To merge an SSTable from L i to L i + , all ofits overlapping SSTables at L i + are selected and these SSTablesare merged to form new SSTables at L i + . For example in Figure 1,the SSTable labeled 0-50 at L will be merged with the SSTableslabeled 0-20 and 22-52 at L , which produce new SSTables labeled0-15, 17-30, and 32-52 at L . When the LSM-tree becomes too large,a new level must be added. To maximize space utilization, the newlevel should be added at L instead of the last level, as suggestedby [23]. In this optimization, the last level is always treated as full,which in turn determines the maximum sizes of other levels. Whenthe maximum size of L is larger than T times the write memorysize (or the configured base level size), a new L is added while allremaining levels L i become L i + . In our work, we will also focuson the partitioned leveling structure due to its wide adoption intoday’s LSM-based systems. Table 1: LSM-tree NotationNotation Definition Example
Global Notation T size ratio of the merge policy 10 P disk page size 4 KB/page M w total write memory size 1GBLocal Notation e i entry size 100 B/entry a i ratio of an LSM-tree’s writememory to total write memory 20% N i number of levels (excluding L ) 3 | L l i | size of Level L l i
10 GB C i write I/O cost per entry 4 pages/entry Write Memory vs. Write Cost.
Here we provide a simple costanalysis to show the relationship between the write memory sizeand the per-entry write I/O cost. Our notation is shown in Table 1.Note that since we consider multiple LSM-trees, Table 1 containsglobal notation that is valid for all LSM-trees and local notation thatis specific to one LSM-tree. In the remainder of this paper, for the i -th LSM-tree, we add the subscript i to denote the local notation forthis LSM-tree. Note that we have further introduced the notation a to denote the write memory ratio of an LSM-tree. Thus, for the i -th LSM-tree, its write memory size is a i · M w . Moreover, given acollection of K LSM-trees, we have (cid:205) Ki = a i = eP pages/entry. Merging an SSTable at Level L i usually has T overlapping SSTables at Level L i + . Thus, to merge an entry from L to the last level, the overall merge cost is eP · ( T + ) · log T · N pages/entry. Here the number of levels N can be expressed usingother terms as follows. Given an LSM-tree whose write memorysize is a · M w , the maximum size of i -th level is a · M w · T i . Based onthe size of the last Level | L N | , we have | L N | ≤ a · M w · T N . Thus, N can be approximated as log T | L N | a · M w . Putting everything together,the per-entry write cost C is approximately C = eP + eP · ( T + ) · log T | L N | a · M w (1)As Equation 1 shows, a larger write memory reduces the writecost by reducing the number of disk levels. Thus, it is important toutilize a large write memory efficiently to reduce the write cost. Apache AsterixDB [1, 8, 16] is a parallel, semi-structured Big DataManagement System (BDMS) for efficiently managing large amountsof data. It supports a feed-based framework for efficient data in-gestion [25, 52]. The records of a dataset in AsterixDB are hash-partitioned based on their primary keys across multiple nodes ofa shared-nothing cluster. Each partition of a dataset uses a pri-mary LSM-based B + -tree index to store the data records, whilelocal secondary indexes, including LSM-based B + -trees, R-trees,and inverted indexes, can be built to expedite query processing.AsterixDB uses a static memory allocation scheme for simplicityand robustness [28]. It specifies static memory budgets for the reaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems (Extended Version) buffer cache and the write memory. Moreover, AsterixDB specifiesthe maximum number D of writable datasets (default 8) so thateach active dataset receives 1/D of the total write memory. Whena dataset’s write memory is full, all of its LSM-trees, includingits primary index and secondary indexes, will be flushed to disktogether. If the user writes to the D+1-st dataset, the least recentlywritten active dataset will be evicted to reclaim its write memory. Inthis work, we use AsterixDB as a testbed to evaluate the proposedtechniques and compare them to other baselines. LSM-trees.
Recently, a large number of improvements have beenproposed to optimize the original [42] LSM-tree design. These im-provements include optimizing write performance [10, 12, 21, 22,30, 33, 39, 40, 44, 55], supporting auto-tuning of LSM-trees [19, 20,31], optimizing LSM-based secondary indexes [36, 43], minimizingwrite stalls [11, 37, 47], and extending the applicability of LSM-trees [35, 45]. We refer readers to a recent survey [38] for a moredetailed description of these LSM-tree improvements.In terms of memory management, FloDB [9] presents a two-levelmemory component structure to mask write latencies by first stor-ing writes into a small hash index that is later migrated to a largersorted index. However, it mainly optimizes for peak throughput in-stead of reducing the overall write cost. Accordion [14] introduces amulti-level memory component structure with memory flushes andmerges. One drawback is that Accordion does not range-partitionmemory components, resulting in high memory utilization duringlarge memory merges. We will further experimentally evaluateAccordion in Section 6. Monkey [19] uses analytical models totune the memory allocation between memory components andBloom filters. ElasticBF [29] proposes a dynamic Bloom filter man-agement scheme to adjust Bloom filter false positives rates basedon the data hotness. Different from Monkey and ElasticBF, in ourwork Bloom filters are managed the same paged way as SSTablesthrough the buffer cache. It should also be noted that virtually allprevious research only considers the memory management of a sin-gle LSM-tree. Except [28], which describes memory management inAsterixDB, we are not aware of any previous work that considersmemory management of multiple heterogeneous LSM-trees.
Database Memory Management.
The importance of memorymanagement, or buffer management, has long been recognizedfor database systems. Various buffer replacement policies, suchas DBMIN [17], 2Q [27], LRU-K [41], and Hot-Set [46], have beenproposed to reduce buffer cache misses. These replacement policiesare orthogonal to this work because we mainly focus on the memorywalls introduced by the LSM-tree’s out-of-place update design.Automatic memory tuning is also an important problem fordatabase systems. Some commercial DBMSs have offered function-alities to tune the memory allocation among different memoryregions [7, 48]. Depending on the tuning goals, the memory tuningtechniques can be classified as maximizing the overall through-put or meeting latency requirements. DB2’s self-tuning memorymanager (STMM) [48] is an example of the former, using controltheory to tune the memory allocation. For the latter, the relation-ship between the buffer cache size and the cache miss rate must
MemoryTuner ......
LSM-tree 1 LSM-tree 2 LSM-tree N
Disk Buffer CacheM write M cache M total = M write +M cache Write Memory on write on cache miss
Figure 2: Memory Management Architecture be predicted, using either analytical models [50] or machine learn-ing approaches [49]. In our work, the memory tuner attempts tominimize the total I/O cost, which indirectly maximizes the overallthroughput. One key difference between our memory tuner andSTMM is that STMM targets a traditional in-place update system,which does not include the write memory used by LSM-trees.There has been recent interest in exploiting machine learningto tune database configurations [24, 32, 51, 54], where memoryallocation is treated as one tuning knob. These approaches usuallyrequire additional training steps and user inputs. Different fromthese approaches, our memory tuner uses a white-box approach; itcarefully models the I/O cost of LSM-based storage systems.
In this section, we present our memory management architecture toenable adaptive memory management. In this architecture, depictedin Figure 2, the total memory budget is divided into the write mem-ory M write and the buffer cache M cache . These two regions arefurther connected via a memory tuner, which periodically performsmemory tuning to minimize the total I/O cost. Write Memory.
The write memory stores incoming writes forall LSM-trees. To maximize memory utilization, we do not set staticsize limits for the individual memory components. Instead, all mem-ory components are managed through a shared memory pool. Whenan LSM-tree has insufficient memory to store its incoming writes,more pages will be requested from the pool. When the overallwrite memory usage is too high, an LSM-tree is selected to flush itsmemory component to disk.While the basic idea of this design is straightforward, there areseveral technical challenges here. First, how can we best utilizethe write memory to minimize the write cost? Existing LSM-treeimplementations use B + -trees or skiplists to manage memory com-ponents and always flush a memory component entirely to disk.However, this negatively impacts memory utilization since B + -treeshave internal fragmentation [53] and a large chunk of memory willbe freed all at once during flushes. Second, since the memory com-ponent of an LSM-tree now becomes highly dynamic, how canwe adjust the disk levels as the write memory changes to alwaysmake optimal performance trade-offs? Finally, given a collection ofheterogeneous LSM-trees with different sizes, how can we allocatethe write memory to these LSM-trees to minimize the overall writecost? We will present our solutions to these challenges in Section 4. Buffer Cache.
The buffer cache stores the (immutable) diskpages of the SSTables as well as their Bloom filters. As in traditional hen Luo and Michael J. Carey database systems, all disk pages are managed together using apredefined buffer replacement policy. For example, AsterixDB usesthe clock replacement policy to manage its shared buffer cache. Inthis work, we mainly focus on the memory allocation given to thebuffer cache instead of cache replacement within the buffer cache. Memory Tuner.
Given a memory budget, the memory tunerattempts to find an optimal memory allocation between the writememory and the buffer cache to minimize the total I/O cost. The keyproperty of the memory tuner is that it takes a white-box approachby carefully modeling the I/O cost of LSM-based storage systemsand thus does not require any offline training. We will describe thedesign and implementation of the memory tuner in Section 5.
Now we present our solution for managing the write memory. Wefirst describe the memory component structure of a single LSM-treeand then extend it to multiple LSM-trees.
Existing LSM-tree implementations use skiplists or B + -trees to man-age memory components and always flush a memory componententirely to disk. As mentioned before, this causes lower memoryutilization for two reasons. First, a B + -tree has internal fragmen-tation, as its pages are about 2/3 full [53]. Second, after a flush, alarge chunk of write memory will be freed (vacated) at once. Toaddress these two problems, we introduce a partitioned LSM-tree tomanage the memory component, which is called a partitioned mem-ory component for short. An LSM-tree achieves much higher spaceutilization than B + -trees. For example, with a size ratio of 10, anLSM-tree achieves 90% space utilization, which is much higher thanthat of a B + -tree. Moreover, since the structure is range-partitioned,it naturally supports flushing the write memory incrementally andcontinuously by flushing one memory SSTable at a time. Figure 3 shows an example LSM-tree with apartitioned memory component. Compared to the basic partitionedLSM-tree design depicted in Figure 1, the new design has two keydifferences. First, the memory component itself is managed by apartitioned LSM-tree. This LSM-tree has an active SSTable at M that stores incoming writes and a set of partitioned in-memorylevels that contain immutable SSTables. When a memory level M i is full, one of its SSTables is merged into the next level M i + us-ing a memory merge . A greedy selection policy is used to selectSSTables to merge by minimizing the overlapping ratio, i.e., theratio between the size of the overlapping SSTables at M i + and thesize of the selected SSTable at M i . This reduces the merge cost andprovides better support for concurrent merges. Memory SSTablesmust be flushed to disk eventually. For a memory-triggered flush,SSTables at the last memory level ( M in Figure 3) are flushed todisk in a round-robin way. This policy ensures that the flushedSSTables always have disjoint key ranges, which minimizes writeamplification. For a log-triggered flush, the SSTable with the mini-mum log sequence number (LSN) will be flushed to facilitate logtruncation. Suppose this SSTable is at Level M i . In order to flushthis SSTable, all overlapping SSTables at higher levels ( M j s.t. j > i )must be flushed together for correctness. active0-1000-50 55-990-20 25-53 55-80M1M2L0L1 0-15 20-35 38-45 50-60 65-80 85-9981-9910-30 32-550-23MemoryDisk flushM0 25-50 65-80 group 1group 0 Figure 3: LSM-tree with a Partitioned Memory Component
The second key difference is that the disk level L is now range-partitioned. L organizes its SSTables into groups, where all SSTa-bles within each group have disjoint key ranges. Groups are orderedbased on their recency, where the keys in a newer group overridethe keys in an older group. When the total number of groups at L exceeds a predefined threshold, incoming flushes must be stopped.To minimize the number of groups at L , which in turns minimizeswrite stalls, two heuristics are used. First, when an SSTable is flushedto disk, it is always inserted into the oldest possible group whereall newer groups do not have any overlapping SSTables. Otherwise,if no such group can be found, a new group is created. Considerthe two groups in Figure 3, where group 0 is older than group 1.When flushing the SSTable labeled 81-99, the resulting SSTable willbe inserted into the older group 0. If the SSTable labeled 25-53 isflushed, a new group will be created because group 1 contains anoverlapping SSTable 25-50. Second, to merge SSTables from L into L , the smallest group that contains the fewest SSTables is alwaysselected for the merge. Specifically, an SSTable from this group aswell as any overlapping SSTables from other L groups are mergedwith the overlapping SSTables at L . To reduce write amplification,the SSTable to merge is selected to minimize the overlapping ratio,i.e., the ratio between the total size of the overlapping SSTables at L and the total size of the merging SSTables at L . Consider the ex-ample LSM-tree in Figure 3. Group 1 will be selected for the mergebecause it has fewer SSTables than group 0. The SSTable labeled0-23 could be merged with the SSTable labeled 10-30 at group 0and the SSTables labeled 0-15 and 20-35 at L , whose overlappingratio would be 1. The SSTable labeled 25-50 could be merged withthe SSTables labeled 10-30 and 32-55 in group 0 and the SSTableslabeled from 0-15 to 50-60 at L , whose overlapping ratio would be4 /
3. Thus, to reduce write amplification, the SSTable labeled 0-23will be selected for the merge.One potential issue with the above design is that memory mergesand flushes may lead to deadlocks. To see this problem, consider anextreme situation where all SSTables at the last memory level aremerging and the total write memory is full. As a result, memorymerges cannot proceed until some write memory is reclaimed byflushes. However, flushes also cannot proceed because all of the last-level SSTables are merging. Since deadlocks are rare, our solutionis to break deadlocks when they occur. Specifically, when the writememory is full and no flush can proceed, some memory mergesthat are blocking flushes will be aborted.
In the new memory component archi-tecture, the write memory allocated of each LSM-tree is allocated reaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems (Extended Version) Figure 4: Example Merge for Removing L on-demand and is thus dynamic. Since the write cost of an LSM-treedepends on the number of disk levels, the number of disk levelsneeds to be adjusted as its write memory size changes .Recall that to maximize the space utilization, levels are onlyadded or deleted at L . For each disk level L i , its maximum size is a · M w · T i . Here we assume that the disk levels of an LSM-treeare relatively stable, i.e., the size of each level | L i | is stable, but thewrite memory allocated to an LSM-tree may change, i.e., a · M w isdynamic. When an LSM-tree’s write memory size a · M w becomestoo small, i.e., a · M w · T < | L | , a new L should be added toreduce the write cost. One can simply add a new empty L and allremaining levels L i automatically become L i + . In contrast, whenthe write memory size becomes too big, i.e., a · M w · T > | L | , L becomes redundant and can be deleted. Implementing this strategydirectly can cause oscillation when the write memory is close to thisthreshold. To avoid this, the deletion of L can be delayed until thewrite memory further grows by a factor of f , i.e., a · M w · T > f · | L | .As we will see in Section 6, delaying the deletion of L has a muchsmaller impact than delaying the addition of a level. In general, alarger f better avoids oscillation but may have a larger negativeimpact on write amplification. By default, we set f to 1 . L , all existing SSTables from L must be merged into L . Here we describe an efficient solution to delete L smoothlywith minimal overhead. When L needs to be deleted, SSTablesfrom L can be directly merged into L along with all overlappingSSTables at L . Consider the example LSM-tree in Figure 4, wherethe write memory is large enough to remove L . In this case, theSSTable labeled 0-23 at L as well as the overlapping SSTable labeled0-46 at L are directly merged into L . This mechanism ensuresthat L will not receive new SSTables but does not itself guaranteethat L will eventually become empty. To address that problem,low-priority merges are also scheduled to merge SSTables from L directly into L when there are no schedulable merges at otherlevels. These two operations ensure that L will eventually becomeempty, and it can then be removed from the LSM-tree. The partitioned memory compo-nent design allows for the flushing of one SSTable at a time, whichwe call partial flushes . For memory-triggered flushes, partial flushesreduce the disk write amplification by creating skews at the lastlevel [31]. The reason is that since SSTables are flushed in a round-robin way, the flushed SSTable will have received the most updates. Our preliminary solution [34] was to only increase the number of disk on-levelswithout ever decreasing it. However, our subsequent evaluation showed that this ledto 5%-10% performance loss compared to an optimal LSM-tree.
Thus, the key ranges of these SSTables will be denser than theaverage key range, which in turns reduces the write amplification.While possible, partial flushes may not always be an optimalchoice. Consider the case when the total write memory is largeand flushes are only triggered by log truncation. Since the oldestentries can be distributed across all memory SSTables, most memorySSTables may have to be flushed in order to truncate the log. Ifpartial flushes are used, the flushed SSTable may have overlappingkey ranges. In contrast, if a full flush is performed, which will merge-sort all memory SSTables across all levels, the flushed SSTables willhave non-overlapping key ranges. Thus, for a log-triggered flush,the optimal flush choice depends on the write memory size and themaximum transaction log length.Developing an optimal flush solution is non-trivial since it alsoheavily depends on the key distribution of the write workload.Here we propose a simple heuristic to dynamically switch betweenpartial and full flushes for log-triggered flushes. The basic idea isto use a window to keep track of how much write memory hasbeen partially flushed before the log-triggered flush, where thewindow size is set as the maximum transaction log length. Whenlog truncation is needed, if the total amount of previously flushedwrite memory is larger than β times the total write memory, where β is a configurable parameter, then partial flushes will be performed.Otherwise, we flush the entire memory component using a full flush.Based on some preliminary simulation results, we set our defaultvalue for β to be 0 . In summary, the partitioned memory componentdesign described here provides additional adaptivity on top of atraditional monolithic memory component. When the write mem-ory is small, it behaves much like a monolithic memory componentbecause its memory merges are rarely performed. When the writememory is large, however, memory merges are performed to max-imize memory utilization and to reduce write amplification. Thepartitioned memory component design also permits concurrentflushes and concurrent disk merges of L0 because both the memorycomponents and L0 are now range-partitioned. However, one po-tential drawback is that memory merges incur extra CPU overhead,which may not be ideal for CPU-bound workloads. We will furtherevaluate this issue in Section 6.
When managing multiple LSM-trees, a fundamental question is howto allocate portions of the write memory to these LSM-trees. Sincewrite memory is allocated on-demand, this question becomes howto select LSM-trees to flush. For log-triggered flushes, the LSM-treewith the minimum LSN should be flushed to perform log truncation.For memory-triggered flushes, existing LSM-tree implementations,such as RocksDB [5] and HBase [3], choose to flush the LSM-treewith the largest memory component. We call this policy the max-memory flush policy. The intuition is that flushing this LSM-tree canreclaim the most write memory, which can be used for subsequentwrites. However, this policy may not be suitable for our partitionedmemory components because flushing any LSM-tree will reclaimthe same amount of write memory due to partial SSTable flushes. hen Luo and Michael J. Carey Min-LSN Policy . One alternative flush policy is to always flushthe LSM-tree with the minimum LSN for both log-triggered andmemory-triggered flushes. We call this policy the min-LSN flushpolicy. The intuition is that the flush rate of an LSM-tree shouldbe approximately proportional to its write rate. A hotter LSM-treeshould be flushed more often than a colder one, but it still receivesmore write memory. This policy also facilitates log truncation,which can be beneficial if flushes are dominated by log truncation.
Optimal Policy . Given a collection of K LSM-trees, our ultimategoal is to find an optimal memory allocation that minimizes theoverall write cost. For the i -th LSM-tree, we denote r i as its thewrite rate (bytes/s). The optimal memory allocation can be obtainedby solving the following optimization problem:min a i K (cid:213) i = r i e i · C i , s.t. K (cid:213) i = a i = a opti for the i -th LSM-tree is a opti = r i (cid:205) Kj = r j . This shows thatthe write memory allocated to each LSM-tree should be propor-tional to its write rate.We call this policy the optimal flush policy. In terms of its imple-mentation, we can use a window to keep track of the total numberof writes to each LSM-tree, where the window size is set as themaximum transaction log length. When a memory-triggered flushis requested, each active LSM-tree is checked in turn and a flushis scheduled if its write memory ratio a i is larger than its optimalwrite memory ratio a opti . After discussing how to efficiently manage the write memory, wenow proceed to describe the memory tuner to tune the memoryallocation between the write memory and the buffer cache. We firstprovide an overview of the tuning approach, which is followed byits design and implementation.
The goal of the memory tuner is to find an optimal memory alloca-tion between the write memory and the buffer cache to minimizethe I/O cost per operation. This in turn should maximize the sys-tem efficiency as well as the overall throughput. Suppose the totalavailable memory is M . For ease of discussion, let us assume thewrite memory size is x , which implies that the buffer cache sizeis M − x . Let write ( x ) and read ( x ) be the write cost and read costper operation when the write memory is x . Our tuning goal is tominimize the weighted I/O cost per operation (pages/op) cost ( x ) = ω · write ( x ) + γ · read ( x ) (3)The weights ω and γ allow us to instantiate the objective functionfor different use cases. For example, on hard disks, one can set asmaller ω since LSM-trees mainly use sequential I/Os for writes,while on SSDs one can make ω larger since SSD writes are oftenmore expensive than SSD reads.In general, cost ( x ) is a U -shaped function. To see this, when x is very small, the per-operation merge cost will be very large. Incontrast, when x is very large, there will be a lot of buffer cache MemoryTunerM write M cache Write MemoryBuffer Cache workload statistics
Figure 5: Workflow of Memory TunerTable 2: Memory Tuner NotationNotation Definition Example
Global Notation K number of LSM-trees 8 op number of operations observed 10K ops saved q saved query disk I/O by the simu-lated cache 0.01page/op saved m saved merge disk I/O by the simu-lated cache 0.002page/op sim simulated cache size 32 MBLocal Notation w i number of entries written to anLSM-tree 50K entries f lush loд i write memory flushed by log trun-cation 1 GB f lush mem i write memory flushed by highmemory usage 8 GBmisses caused by both queries and merges. Thus, to minimize Equa-tion 3, our goal is to find x such that the derivative cost ′ ( x ) = ω · write ′ ( x ) + γ · read ′ ( x ) =
0. Intuitively, write ′ ( x ) / read ′ ( x ) mea-sures how the write/read cost changes if more write memory isallocated. If we can estimate both write ′ ( x ) and read ′ ( x ) , then wecan find the optimal x using some root-finding algorithm.Based on this idea, our memory tuner uses a feedback-controlloop to tune memory allocation, as depicted in Figure 5. The systemperiodically reports workload statistics to the memory tuner. Thememory tuner then uses the collected statistics to find an optimalmemory allocation between the write memory and the buffer cache.Note that the whole tuning process does not require any user inputnor training samples. Instead, the memory tuner continuously tunesthe memory allocation based on the current statistics as well assome past history. Before describing the details of the memorytuner, we first introduce some notation used by the memory tuner(Table 2) in addition to the LSM-tree notation listed in Table 1. Notethat with secondary indexes each operation may write multipleentries to multiple LSM-trees. For the i -th LSM-tree, recall that Equation 1 computes the per-entrywrite cost C i . Since each operation writes w i op entries to this LSM-tree, its write cost per operation write i ( x ) can computed as w i op · C i .By taking the derivative of write i ( x ) , we have write ′ i ( x ) = w i op · e i P · ( T + ) · x · ln T (4) reaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems (Extended Version) To reduce the estimation error, instead of collecting statistics for op , w i , e i and P , we simply collect the total number of merge writesper operation, merдe i ( x ) , in the last tuning cycle. By substituting merдe i ( x ) into Equation 4, we have write ′ i ( x ) = − merдe i ( x ) x · ln | L Ni | a i · x (5)Here we assume that the write memory of an LSM-tree is alwayssmaller than its last level size. Thus, the estimated value of write ′ i ( x ) in Equation 5 is always negative as long as merдe i ( x ) is not zero.This implies that adding more write memory can always reducethe write cost, which may not hold in practice. Once flushes aredominated by log truncation, adding more write memory will notfurther reduce the write cost. To account for the impact of log-triggered flushes, we further multiply Equation 5 by a scale factor f lush memi f lush memi + f lush loдi that we also keep statistics for. Intuitively,this scale factor will be close to 1 if flushes are mainly triggeredby high memory usage and it will approach to 0 if flushes aremostly triggered by log truncation. Finally, write ′ ( x ) is the sum of write ′ i ( x ) for all LSM-trees: write ′ ( x ) = K (cid:213) i = − merдe i ( x ) x · ln | L Ni | a i · x · f lush mem i f lush mem i + f lush loд i (6) Example 5.1.
Consider an example with two LSM-trees. Sup-pose that the total write memory x is 128MB. Suppose that thefirst LSM-tree receives 80% of the write memory ( a = .
8) witha last level size of 100GB ( | L N | = GB ) and that its merge costper operation is 1 page/op ( merдe ( MB ) = a = . | L N | = GB , and merдe ( MB ) = . write ′ ( MB ) ≈ − . e − page/op and write ′ ( MB ) ≈ − . e − page/op. Thus, write ′ ( MB ) ≈ − . e − page/op. This impliesthat if we allocate one more byte of write memory, the write costcan be reduced by 1 . e − page/op. Estimating read ′ ( x ) is slightly more complicated because disk readsare performed by both queries and merges. Thus, we break down read ( x ) into read ( x ) = read q ( x ) + read m ( x ) , where read q ( x ) is thetotal number of query disk reads per operation and read m ( x ) is thetotal number of merge disk reads per operation.We use a simulated cache to estimate read ′ q ( x ) , as suggestedby [48]. This simulated cache only stores page IDs. Whenever apage is evicted from the buffer cache, its page ID is added to thesimulated cache. Whenever a page is about to be read from disk,a disk I/O could have been saved if the simulated cache containsthat page ID. Suppose that the simulated cache size is sim and thesaved read cost per operation is saved q , then read ′ q ( x ) = saved q sim .To estimate read ′ m ( x ) , we first rewrite read m ( x ) = pin m ( x ) · miss m ( x ) , where pin m ( x ) is the total number of page pins for diskmerges and miss m ( x ) is the cache miss ratio for merges. Basedon the derivative rule, we have read ′ m ( x ) = pin ′ m ( x ) · miss m ( x ) + pin m ( x ) · miss ′ m ( x ) . pin m ( x ) can be obtained by counting the num-ber of merge page pins per operation, and miss m ( x ) = read m ( x ) pin m ( x ) . pin ′ m ( x ) is the number of saved merge page pins per unit of writememory. Recall that we have computed write ′ ( x ) , which is thenumber of saved disk writes per unit of write memory. On aver-age, each merge disk write requires pin m ( x ) merдe ( x ) page pins. As a result, pin ′ m ( x ) = write ′ ( x )· pin m ( x ) merдe ( x ) . To estimate miss ′ m ( x ) , we again usethe simulated cache to estimate the number of saved merge readsper operation saved m . Thus, miss ′ m ( x ) = saved m pin m ( x )· sim . Putting ev-erything together, read ′ m ( x ) = write ′ ( x ) · read m ( x ) merдe ( x ) + saved m sim .Finally, read ′ ( x ) can be computed as read ′ ( x ) = saved q + saved m sim + write ′ ( x ) · read m ( x ) merдe m ( x ) (7) Example 5.2.
Continuing from Example 5.1, suppose that thesimulated cache size is 32MB. Suppose that the simulated cachereports that the saved query disk reads per operation is saved q = .
01 page/op and that the saved merge disk reads per operation is saved m = .
008 page/op. Moreover, suppose that the total numberof merge disk reads per operation is read m ( x ) = . read ′ ( x ) = − . e − page/op. This means thatallocating 1 more byte of write memory can decrease the disk readcost per operation by 1 . e − page/op, as disk reads are mainlyperformed by merges in this case. To find the optimal memory allocation x , we can use the New-tonâĂŞRaphson method to find the root of cost ′ ( x ) = ω · write ′ ( x ) + γ · read ′ ( x ) . The basic idea is to use a series of approximations tofind the root of a function f ( x ) . At the i -th iteration, the next ap-proximation is computed as x i + = x i − f ( x i ) f ′ ( x i ) . Since we only knowthe evaluations of cost ′ ( x ) , we further approximate cost ′ ( x ) usinga linear function. That is, we use the last K samples to fit a linearfunction cost ′ ( x ) = Ax + B , where by default K is set to 3. Thus,at each tuning step, the next memory allocation is computed as x i + = x i − cost ′ ( x i ) A .To ensure the stability of the memory tuner, we employ severalheuristics here. First, during the startup phase, the tuner does nothave enough samples to construct the linear function. In this case, cost ′ ( x ) only tells whether the write memory should be increasedor decreased but not the exact amount. To address this, a simpleheuristic is to use a fixed step size, e.g., 5% of the total memory. Sec-ond, to ensure the stability of the memory tuner, the maximum stepsize is limited based on the memory region whose memory needsto be decreased. The intuition is that taking memory from a regionmay be harmful because both the write memory and the buffercache are subject to diminishing returns. Thus, at each tuning step,we limit the maximum decreased memory size for either memoryregion to 10% of its currently allocated memory size. Finally, thememory tuner uses two stopping criteria to avoid oscillation. Thememory allocation is not changed if the step size is too small, e.g.,smaller than 32MB, or if the expected cost reduction is too small,e.g., smaller than 0.1% of the current I/O cost. hen Luo and Michael J. Carey The last question for implementing the memory tuner is de-termining the appropriate tuning cycle length. Ideally, the tuningcycle should be long enough to capture the workload characteristicsbut be as short as possible for better responsiveness. To balancethese two requirements, memory tuning is triggered whenever theaccumulated log records exceed the maximum log length. This al-lows the memory tuner to capture the workload statistics moreaccurately by waiting for log-triggered flushes to complete. Forread-heavy workloads, it may take a very long time to produceenough log records. To address this, the memory tuner also uses atimer-based tuning cycle, e.g., 10 minutes.
In this section, we experimentally evaluate the proposed techniquesin the context of Apache AsterixDB [1]. Throughout the evaluation,we focus on the following two questions. First, what are the benefitsof the partitioned memory component compared to alternativeapproaches? Second, what is the effectiveness of the memory tunerin terms of its accuracy and responsiveness? In the remainder of thissection, we first describe the general experimental setup followedby the detailed evaluation results.
Hardware.
All experiments were performed on a single nodem5d.2xlarge on AWS. The node has an 8-core 2.50GHZ vCPUs,32GB of memory, a 300GB NVMe SSD, and a 500GB elastic blockstore (EBS). We use the native NVMe for LSM storage and EBS forstoring transaction logs. The NVMe SSD provides a write through-put of 250MB/s and a read throughput of 500MB/s. We allocated26GB of memory for the AsterixDB instance. Unless otherwisenoted, the total storage memory budget, including the buffer cacheand the write memory, was set at 20GB. Both the disk page sizeand memory page size were set at 16KB. The maximum transactionlog length was set at 10GB. Finally, we used 8 worker threads toexecute benchmark operations.
LSM-tree Setup.
All LSM-trees used a partitioned leveling mergepolicy with a size ratio of 10, which is a common setting in existingsystems. Unless otherwise noted, the number of disk levels wasdynamically determined based on the current write memory size.For the partitioned memory component, its active SSTable size wasset at 32MB and the size ratio of the memory merge policy wasalso set at 10. We used 2 threads to execute flushes, 2 threads toexecute memory merges, and 4 threads to execute disk merges. Ineach set of experiments, we first loaded the LSM storage based onthe given workload. Each experiment always started with a freshcopy of the loaded LSM storage. For both memory and disk levels,we built a Bloom filter for each SSTable with a false positive rateof 1% to accelerate point lookups. Finally, both the memory flushthreshold and the log truncation threshold were set at 95%.
Workloads.
We used two popular benchmarks, YCSB [18] andTPC-C [6], to evaluate the proposed techniques. YCSB is a popularand extensible benchmark for evaluating key-value stores. Due toits simplicity, we used YCSB to understand the basic performanceof various techniques. In all experiments, we used the default YCSBrecord size, where each record has 10 fields with 1KB size in total,and the default Zipfian distribution. Since YCSB only supports a single LSM-tree, we further extended it to support multiple primaryand secondary LSM-trees, which is described in Section 6.2. TPC-Cis an industrial standard benchmark used to evaluate transactionprocessing systems. We chose TPC-C because it represents a morerealistic workload with multiple datasets and secondary indexes.It should be noted that AsterixDB only supports a basic record-level transaction model without full ACID transactions. Thus, alltransactions in our evaluation were effectively running under theread-uncommitted isolation level from the TPC-C perspective. Be-cause of this, we disabled the client-triggered aborts (1%) of theNewOrder transaction. The detailed setup of these two benchmarksare further described below.
We first evaluated the benefits of the partitioned memory compo-nent structure for managing the write memory. Specifically, wedesigned the following four sets of experiments. The first set ofexperiments uses a single LSM-tree to evaluate the basic perfor-mance of various memory component structures. The second setof experiments uses multiple datasets, each of which just has a pri-mary LSM-tree. The third set of experiments focuses on LSM-basedsecondary indexes which all belong to the same dataset. Finally,the last set of experiments uses a more realistic workload that con-tains multiple primary and secondary indexes. For the first threesets of experiments, we used the YCSB benchmark [18] due to itssimplicity and customizability. For the last set of experiments, weused the TPC-C benchmark [6] since it represents a more realisticworkload.
Evaluated Write Memory Management Schemes.
First, weevaluated two variations of AsterixDB’s static memory allocationscheme. The first variation, called B + -tree-static-default , uses As-terixDB’s default number of active datasets, which is 8. The sec-ond variation, called B + -tree-static-tuned , configures the numberof active datasets parameter setting based on each experiment. Wefurther evaluated an optimized version of the write memory man-agement scheme (called B + -tree-dynamic ) used in existing systems,e.g., RocksDB and HBase, by not limiting the size of each mem-ory component. When the overall write memory becomes full, theLSM-tree with the largest memory component is selected to flush.Moreover, we also evaluated two variations of Accordion [14]. Ac-cordion separates keys from values by storing keys into an indexstructure while putting values into a log. The first variation, called Accordion-index , only merges the indexes without rewriting thelogs. The second variation, called
Accordion-data , merges both theindexes and logs. Finally, for the proposed partitioned memorycomponent structure, called
Partitioned , we further evaluated threevariations based on the three flush policies described in Section 4.2,namely max-memory (called
Partitioned-MEM ), min-LSN (called
Partitioned-LSN ), and optimal (called
Partitioned-OPT ). In this experiment, the LSM-tree had 100million records with a 110GB storage size. We evaluated four typesof workloads, namely write-only (100% writes), write-heavy (50%writes and 50% lookups), read-heavy (5% writes and 95% lookups),and scan-heavy (5% writes and 95% scans). A write operation up-dates an existing key and each scan query accesses a range of 100records. Each experiment ran for 30 minutes and the first 10-minute reaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems (Extended Version) T h r oughpu t ( kop s / s ) (a) Write-Only 1/8 1/4 1/2 1 2 4 8Write Memory (GB)0204060 T h r oughpu t ( kop s / s ) (b) Write-Heavy 1/8 1/4 1/2 1 2 4 8Write Memory (GB)0204060 T h r oughpu t ( kop s / s ) (c) Read-Heavy 1/8 1/4 1/2 1 2 4 8Write Memory (GB)0246 T h r oughpu t ( kop s / s ) (d) Scan-HeavyB + -tree-static-default B + -tree-static-tuned B + -tree-dynamic Accordion-data Accordion-index Partitioned Figure 6: Experimental Results for a Single LSM-tree period was excluded when computing the throughput. It should benoted that in this experiment all flush policies have the identicalbehavior because there was only one LSM-tree.
Basic Performance.
Figure 6 shows the throughput of eachmemory component scheme under different workloads and writememory sizes. In general, the write memory mainly impacts write-dominated workloads, such as write-only and write-heavy, andlarger write memory improves the overall throughput by reducingthe write cost. Among these structures, B + -tree-static-default al-ways performs the worst since any one LSM-tree is only allocated1/8 of the write memory. B + -tree-dynamic performs slightly bet-ter than B + -tree-static-tuned because the former does not leavememory idle by preallocating two memory components for doublebuffering. The partitioned memory component structure has thehighest throughput under write-dominated workloads since betterutilizes the write memory. It also improves the overall throughputslightly under the read-heavy workload by reducing write ampli-fication. For both B + -tree-dynamic and partitioned, the through-put stops increasing after the write memory exceeds 4GB. Thisis because flushes are then dominated by log-truncation. Finally,Accordion does not provide any improvement compared to B + -tree-dynamic. Accordion-data actually reduces the overall throughputbecause a large memory merge will temporarily double the mem-ory usage, forcing memory components to be flushed. Moreover,Accordion was designed for reducing GC overhead since HBase [3]uses Java objects to manage memory components. Although As-terixDB is written in Java, it uses off-heap structures for memorymanagement [15, 28]. In all experiments, its measured GC time wasalways less than 1% of the total run time. Based on these results,and because Accordion is mainly designed for a single LSM-tree, weexcluded Accordion for further evaluation with multiple LSM-trees.As suggested by [37], we further carried out an experiment toevaluate the 99th percentile write latencies of each scheme using aconstant data arrival process, whose arrival rate was set at a highutilization level (95% of the measured maximum write throughput).We found out that the resulting 99th percentile latencies of allschemes were less than 1s, which suggests that all structures canprovide a stable write throughput with a relatively small variance,even under a very high utilization level. Benefits of Dynamically Adjusting Disk Levels.
To evalu-ate the benefit of dynamically adjusting disk levels as the writememory changes, we conducted an experiment where the write memory size alternates between 1GB and 32MB every 30 minutes.Each experiment ran for two hours in total. We used the partitionedmemory component structure but the disk levels were determineddifferently. In addition to the proposed approach that adjusts disklevels dynamically (called “dynamic”), we used two baselines wherethe number of disk levels is determined statically by assuming thatthe write memory is always 32MB (called “static-32MB”) or always1GB (called “static-1GB”). The resulting write throughput, aggre-gated over 5-minute windows, is shown in Figure 7. The dynamicapproach always has the highest throughput, which confirms theutility of adjusting disk levels as the write memory changes. More-over, we see that having fewer levels when the write memory issmall has a more negative impact than having more levels when thewrite memory is large since the write throughput for static-1GB ismuch lower under the small write memory.
In this set of experiments, weused 10 primary LSM-trees, each of which had 10 million records.Since the write memory mainly impacts write performance, a write-only workload was used in this experiment. Writes were distributedamong the multiple LSM-trees following a hotspot distribution,where x % of the writes go to y % of the LSM-trees. For example,an 80-20 distribution means that 80% of the writes go to 20% ofthe LSM-trees, i.e., 2 hot LSM-trees, while the 20% of the writes goto 80% of the LSM-trees, yielding 8 cold LSM-trees. Within eachLSM-tree, writes still followed YCSB’s default Zipfian distribution. Impact of Write Memory.
We first evaluated the impact ofthe write memory size by fixing the skewness to be 80-20. The T h r oughpu t ( kop s / s ) static-1GBstatic-32MBdynamic Figure 7: Write Throughput with Varying Write Memory hen Luo and Michael J. Carey resulting write throughput is shown in Figure 8a. Note that B + -tree-static-default results in a much lower throughput because ofthrashing. Since the default number of active datasets in AsterixDBis only 8, some LSM-trees have to be constantly activated anddeactivated, resulting in many tiny flushes. Moreover, threshing haslarger negative impact under large write memory because it takeslonger to allocate larger memory components. B + -tree-static-tunedavoids the thrashing problem, but it still performs worse than theother baselines because it does not differentiate hot LSM-trees fromcold ones. B + -tree-dynamic allows the write memory to be allocateddynamically. However, since it always flushes the LSM-tree withthe largest memory component, the memory components of thecold LSM-trees are not flushed until they are large enough or untilthe transaction log has to be truncated. Because of this, Partitioned-MEM, which uses the max-memory flush policy, also has a relativelylower throughput. In contrast, both the min-LSN (Partitioned-LSN)and optimal (Partitioned-OPT) flush policies improve the writethroughput via better memory allocation. Moreover, the min-LSNpolicy has a write throughput comparable to the optimal policy,which makes it a good approximation but with less implementationcomplexity. Finally, all three flush policies start to have similarthroughput when the write memory is larger than 2GB becauseflushes then become dominated by log truncation. Impact of Skewness.
Next, we evaluated the impact of skew-ness by fixing the write memory to be 1GB. The resulting writethroughput is shown in Figure 8b. All memory component struc-tures except B + -tree-static-tuned benefit from skewed workloads.The problem of B + -tree-static-tuned is that it always allocates thewrite memory evenly to the active datasets without differentiat-ing hot LSM-trees from cold ones. For B + -tree-static-default, thethrashing problem is alleviated under skewed workloads since mostwrites go to a small number of LSM-trees. The partitioned mem-ory component structure also outperforms B + -tree-dynamic, as wehave seen before. Moreover, when the workload is more heavilyskewed, the performance differences among the three flush policiesalso become larger. Under the 50-50 workload, where each LSM-treereceives the same volume of writes, these flush policies have nearlyidentical behavior. When the workload becomes more skewed, themin-LSN (Partitioned-LSN) and optimal (Partitioned-OPT) policiesstart to outperform the max-memory policy (Partitioned-MEM) byallocating more write memory to the hot LSM-trees. T h r oughpu t ( kop s / s ) (a) Vary Write Memory 50-50 60-40 70-30 80-20 90-10Skewness02040 T h r oughpu t ( kop s / s ) (b) Vary SkewnessB + -tree-static-defaultB + -tree-static-tuned B + -tree-dynamicPartitioned-MEM Partitioned-LSNPartitioned-OPT Figure 8: Experimental Results for 10 Primary LSM-trees
We further evaluated the al-ternative memory component structures using multiple secondaryLSM-trees for one dataset. The dataset had one primary LSM-treeand 10 secondary LSM-trees, with one secondary LSM-tree per field.The primary LSM-tree had 50 million records with 55GB storagesize, and each secondary LSM-tree was about 5GB. As before, weused the write-only workload to focus on write performance. Itshould be noted that each write must also performs a primary in-dex lookup to cleanup secondary indexes [36]. Unless otherwisenoted, each write only updates one secondary field, but the choiceof updated fields followed the same hotspot distribution as in Sec-tion 6.2.2. For each field, its values followed the default Zipfiandistribution used in YCSB.
Impact of Write Memory.
First, we varied the total write mem-ory to evaluate its impact on the different memory componentstructures with secondary indexes. The resulting write throughputis shown in Figure 9a. In general, the results are consistent withthe multiple primary LSM-tree case in Figure 8a. Note that theperformance difference between B + -tree-static-tuned and B + -tree-dynamic becomes smaller in this case because B + -tree-static-tunedallocates the write memory at a dataset level, so the primary LSM-tree and all secondary LSM-trees share the same budget here, whichis similar to B + -tree-dynamic. Impact of Update Field Skewness.
We further varied the skew-ness of updated fields to study its impact on write throughput.Figure 9b shows the resulting write throughput. As one can see,the skewness of updated fields has a smaller performance impacthere compared to the multiple primary LSM-tree case in Figure 8bbecause the size of a secondary LSM-tree is much smaller thanthe primary one. However, B + -tree-dynamic and Partitioned-MEM,both of which used the max-memory flush policy, still benefit froma more skewed workload. The reason is that when most writesaccess a small number of hot secondary indexes, the size of theirmemory components will grow faster and they will be selected bythe max-memory policy to flush. Impact of Number of Updated Fields.
Finally, we studied theperformance impact of the number of updated fields per write, rang-ing from 1 to 5. The resulting throughput is shown in Figure 9c.Increasing the number of updated fields per write negatively im-pacts the write throughput because each logical write producesmore physical writes. Because of this, the write throughput of allmemory component structures decreases in the same way wheneach write updates more fields.
Finally, we used the TPC-C benchmark toevaluate the alternative memory management schemes on a morerealistic workload. We used two scale factors (SF) of TPC-C, i.e.,500, which results in a 50GB storage size, and 2000, which resultsin a 200GB storage size. Each experiment ran for one hour and thethroughput was measured excluding the first 30 minutes.The resulting throughput and the per-transaction disk writes(KB) under the two scale factors are shown in Figure 10. Note thatthere is only one baseline, B + -tree-static, because the number ofactive datasets in TPC-C is 8, which is the same as the defaultvalue used in AsterixDB. B + -tree-static still has the highest I/Ocost because it allocates write memory evenly to all datasets. TPC-C contains some hot datasets, such as order_line and stock, that reaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems (Extended Version) T h r oughpu t ( kop s / s ) (a) Vary Write Memory 50-50 60-40 70-30 80-20 90-10Skewness0102030 T h r oughpu t ( kop s / s ) (b) Vary Skewness 1 2 3 4 5Number of Fields0102030 T h r oughpu t ( kop s / s ) (c) Vary Updated FieldsB + -tree-static-default B + -tree-static-tuned B + -tree-dynamic Partitioned-MEM Partitioned-LSN Partitioned-OPT Figure 9: Experimental Results for Multiple Secondary LSM-trees T h r oughpu t ( k t xn / s ) (a) Throughput (SF=500) 1/8 1/4 1/2 1 2 4 8Write Memory (GB)0204060 D i s k W r it e s ( K B /t xn ) (b) Write Cost (SF=500) 1/8 1/4 1/2 1 2 4 8Write Memory (GB)012345 T h r oughpu t ( k t xn / s ) (c) Throughput (SF=2000) 1/8 1/4 1/2 1 2 4 8Write Memory (GB)0204060 D i s k W r it e s ( K B /t xn ) (d) Write Cost (SF=2000)B + -tree-static B + -tree-dynamic Partitioned-MEM Partitioned-LSN Partitioned-OPT Figure 10: Experimental Results on TPC-C W r it e M e m o r y ( G B ) (a) Tuned Write Memory/4GB 0 1200 2400 3600Time (s)051015 I / O C o s t ( K B / op ) (b) Tuned I/O Cost/4GB 0 1200 2400 3600Time (s)01234 W r it e M e m o r y ( G B ) (c) Tuned Write Memory/20GB 0 1200 2400 3600Time (s)051015 I / O C o s t ( K B / op ) (d) Tuned I/O Cost/20GB10% writes 20% writes 30% writes 40% writes 50% writes Figure 11: Evaluation of Memory Tuner on YCSB receive most of the writes, as well as some cold datasets, such aswarehouse and district, that only require a few megabytes of writememory. Partitioned-OPT always seeks to minimize the transactionwrite cost, improving the system I/O efficiency. However, note thatthis may not always improve the overall throughput. When theworkload is CPU-bound at scale factor 500, the extra CPU over-head incurred by memory merges actually decreases the overallthroughput as compared to B + -tree-dynamic. When the workloadis I/O-bound at scale factor 2000, reducing the disk writes doesincrease the overall throughput. Thus, we observe that it is use-ful to design a memory management scheme to balance the CPUoverhead and the I/O cost, which we leave as future work. Finally,the results also show that increasing the write memory may not always increase the overall transaction throughput. For example,when the scale factor is 2000, the optimal throughput is reachedwhen the write memory is between 1GB and 2GB. This confirmsthe importance of memory tuning, which will be evaluated next. Here we briefly summarize the findings from theevaluation of the various memory component schemes. As all of theexperiments have illustrated, it is important to utilize a large writememory efficiently to reduce the I/O cost. Although AsterixDB’sstatic memory allocation scheme is relatively simple and robust,it leads to sub-optimal performance because the write memory isalways evenly allocated to active datasets. The optimized version ofthe memory management scheme used by existing systems, i.e., B + -tree-dynamic, reduces the I/O cost by dynamically allocating the hen Luo and Michael J. Carey write memory to active LSM-trees. This still does not achieve opti-mal performance, however, because it fails to manage large memorycomponents efficiently and its choice of flushes does not optimizethe overall write cost. Finally, the proposed partitioned memorycomponent structure and the optimal flush policy minimize thewrite cost for all workloads. The use of partitioned memory compo-nents manages the large write memory more effectively to reducethe write amplification of a single LSM-tree. Moreover, the optimalflush policy allocates the write memory to multiple LSM-trees basedon their write rates to minimize the overall write cost. However,the partitioned memory component structure may incur extra CPUoverhead, which makes it less suitable for CPU-heavy workloads.Finally, we have observed that the min-LSN policy achieves com-parable performance to the optimal policy, which makes it a goodapproximation but with less implementation complexity. We now proceed to evaluate the memory tuner with the focus onthe following questions: First, what are the basic mechanics of thememory tuner in terms of how it tunes the memory allocation fordifferent workloads? Second, what is the accuracy of the memorytuner as compared to manually tuned memory allocation? Finally,how responsive is the memory tuner when the workload changes?Recall that the memory tuner minimizes the weighted sum of theI/O cost. Thus, instantiating this cost function to maximize the over-all throughput is hardware-dependent. To avoid this dependencyon the underlying hardware in our evaluation, we set both weightsto be 1 here and focus on the per-operation I/O cost, instead of theabsolute throughput, in our evaluation. The I/O cost was measuredby dividing the total number of monitored disk I/Os with the totalnumber of operations. Moreover, to show the effectiveness of thememory tuner, the write memory size always starts from 64MB.The simulated cache size was set to 128MB. Unless otherwise noted,other settings of the memory tuner, such as the number of sam-ples for fitting the linear function, the stopping threshold, and themaximum step size, all used the default values given in Section 5.4.
To understand the basic mechanics of thememory tuner, we carried out a set of experiments using YCSB [18]with a single LSM-tree. As before, the LSM-tree had 100 millionrecords with 110GB in total. We used a mixed read/write workloadwhere the write ratio varied from 10% to 50%. The total memorybudget was set at 4GB or 20GB. Each experiment ran for 1 hour.The tuned write memory size and the corresponding I/O costsover time are shown in Figure 11. Note that each point denotesone tuning step performed by the memory tuner. We see that thememory tuner balances the relative gain of allocating more memoryto the write memory and the buffer cache to minimize the over-all I/O cost. As shown in Figures 11a and 11c, when the overallmemory budget is fixed, the memory tuner allocates more writememory when the write ratio is increased because the benefit ofhaving a large write memory increases. Moreover, by comparingthe allocated write memory sizes in Figures 11a and 11c, we can seethat when the write ratio is fixed, the memory tuner also allocatesmore write memory when the total memory becomes larger. This isbecause the benefit of having more buffer cache memory plateaus. I / O C o s t ( K B /t xn ) tuned-writeopt-write64M-write50%-write tuned-readopt-read64M-read50%-read Figure 12: Experimental Results of Memory Tuner’s Accu-racy on TPC-C
Finally, as shown in Figures 11b and 11d, the overall I/O cost alsodecreases after the memory allocation is tuned over time.
To evaluate the accuracy of the memory tuner, wecarried out a set of experiments on TPC-C to compare the tuned I/Ocost versus the optimal I/O cost. Here We used TPC-C because itrepresents a more complex and more realistic workload than YCSB.The scale factor was set at 2000. To find the memory allocation withthe optimal I/O cost, we used an exhaustive search with an incre-ment of 128MB. To show the effectiveness of the memory tuner, weincluded two additional baselines. The first baseline always set thewrite memory at 64MB, which is the starting point of the memorytuner. The second baseline divided the total memory budget evenlybetween the buffer cache and the write memory. We further variedthe total memory budget from 4GB to 20GB. Each experiment ranfor 1 hour and the I/O cost was measured after the first 30 minutes.Figure 12 shows the I/O cost per transaction, which includesboth the read and write costs, for the different memory allocationapproaches. In general, the auto-tuned I/O cost is always very closeto the optimal I/O cost found via exhaustive search, which shows theeffectiveness of our memory tuner. Moreover, the memory tunerperforms notable better than the two heuristic-based baselines.Allocating a small write memory minimizes the read cost but leadsto a higher write cost. In contrast, allocating a large write memoryminimizes the write cost but the read cost becomes much higher.An optimal memory allocation must balance these two costs inorder to minimize the overall cost.
Finally, we used a variation of TPC-C toevaluate the responsiveness of the memory tuner. This experimentstarted with the default TPC-C transaction mix and the workloadchanged into a read-mostly variation, one which contains 5% writetransactions, i.e., new_order, payment, and delivery, and 95% readtransactions, i.e., order_status and stock_level. Each experimentran for two hours and the workload was changed after the firsthour. The resulting allocated write memory and I/O cost over timeare shown in Figure 13. After the workload changes, the memorytuner immediately detects the change in the next tuning cycle andbegins allocating more memory to the buffer cache. Note that thewrite memory decreases relatively slowly because the memorytuner limits its step size to 10% of the current write memory sizeto ensure stability. However, we see that this does not impact theoverall I/O cost too much because the buffer cache already occupies reaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems (Extended Version) W r it e M e m o r y ( G B ) workloadchanged (a) Tuned Write Memory 1800 3600 5400 7200Time (s)0100200 I / O C o s t ( K B /t xn ) workloadchanged (b) Tuned I/O Costtotal-4G total-8G total-12G total-16G total-20G Figure 13: Experimental Results of Memory Tuner’s Respon-siveness on TPC-C W r it e M e m o r y ( G B ) workloadchanged (a) Tuned Write Memory 0 4800 9600 14400Time (s)050100150 I / O C o s t ( K B /t xn ) workloadchanged (b) Tuned I/O Costmax-step-10% max-step-30% max-step-50% max-step-100% Figure 14: Impact of Maximum Step Size on Memory Tuner’sResponsiveness most of the memory. Also note that the write memory size does notchange when the total memory is 16GB or 20GB. This is because thebuffer cache already occupies most of the memory and allocatingmore write memory would not change the total I/O cost too much.To study the impact of the maximum step size on the respon-siveness and stability of the memory tuner, we further carried outan experiment that varies the maximum step size from 10% to100%. The total memory was set at 12GB. Each experiment ran forfour hours and the workload changed from the the default TPC-Cmix into the read-heavy mix after the first hour. The tuned writememory and I/O cost over time are shown in Figures 14a and 14brespectively. As the results show, increasing the maximum step sizeimproves responsiveness by allowing the memory tuner to changethe memory allocation more quickly. However, this also negativelyimpacts the memory tuner’s stability and leads to some oscillation.Also note that decreasing the write memory more rapidly has avery small impact on the I/O cost since the buffer cache alreadyoccupies most of the memory. Thus, the memory tuner’s defaultmaximum step size is set at 10% to ensure stability while providingreasonable responsiveness.
In this set of experiments, we evaluated the mem-ory tuner in terms of its mechanics, accuracy, and responsiveness.The memory tuner uses a white-box approach by modeling the I/Ocost of the LSM storage system and minimizing the overall I/O costbased on the relative gains of allocating more memory to the buffer cache or to the write memory. The experimental results show thatthis white-box approach enables the memory tuner to achieve bothhigh accuracy with reasonable responsiveness, making it suitablefor online tuning.
In this paper, we have described and evaluated a number of tech-niques to break down the memory walls in LSM-based storage sys-tems. We first presented an LSM memory management architecturethat facilitates adaptive memory management. We further proposeda partitioned memory component structure with new flush policesto better utilize the write memory in order to minimize the overallwrite cost. To break down the memory wall between the write mem-ory and the buffer cache, we further introduced a memory tunerthat uses a white-box approach to continuously tune the memoryallocation. We have empirically demonstrated that these techniquestogether enable adaptive memory management to minimize the I/Ocost for LSM-based storage systems.
ACKNOWLEDGMENTS
This work has been supported by NSF awards CNS-1305430, IIS-1447720, IIS-1838248, and CNS-1925610 along with industrial sup-port from Amazon, Google, and Microsoft and support from theDonald Bren Foundation (via a Bren Chair).
REFERENCES
ACM SIGMOD . ACM, 930–932.[8] Sattam Alsubaiee et al. 2014. AsterixDB: A Scalable, Open Source BDMS.
PVLDB
7, 14 (2014), 1905–1916.[9] Oana Balmau et al. 2017. FloDB: Unlocking Memory in Persistent Key-ValueStores. In
European Conference on Computer Systems (EuroSys) . 80–94.[10] Oana Balmau et al. 2017. TRIAD: Creating Synergies Between Memory, Disk andLog in Log Structured Key-Value Stores. In
USENIX Annual Technical Conference(ATC) . 363–375.[11] Oana Balmau et al. 2019. SILK: Preventing Latency Spikes in Log-StructuredMerge Key-Value Stores. In
USENIX Annual Technical Conference (ATC)) . 753–766.[12] Laurent Bindschaedler et al. 2020. Hailstorm: Disaggregated Compute and Stor-age for Distributed LSM-based Databases. In
Proceedings of the Twenty-FifthInternational Conference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) . 301–316.[13] Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with AllowableErrors.
CACM
13, 7 (July 1970), 422–426.[14] Edward Bortnikov et al. 2018. Accordion: Better Memory Organization for LSMKey-value Stores.
PVLDB
11, 12 (2018), 1863–1875.[15] Yingyi Bu et al. 2013. A Bloat-Aware Design for Big Data Applications.
SIGPLANNot.
48, 11 (June 2013), 119–130.[16] Michael J. Carey. 2019. AsterixDB Mid-Flight: A Case Study in Building Systemsin Academia. In
ICDE . 1–12.[17] Hong Tai Chou and David J. DeWitt. 1986. An evaluation of buffer managementstrategies for relational database systems.
Algorithmica
1, 1 (01 Nov 1986), 311–336.[18] Brian F. Cooper et al. 2010. Benchmarking Cloud Serving Systems with YCSB. In
ACM SoCC . 143–154.[19] Niv Dayan et al. 2017. Monkey: Optimal Navigable Key-Value Store. In
ACMSIGMOD . 79–94.[20] Niv Dayan et al. 2018. Optimal Bloom Filters and Adaptive Merging for LSM-Trees.
ACM TODS
43, 4, Article 16 (Dec. 2018), 16:1–16:48 pages.[21] Niv Dayan and Stratos Idreos. 2018. Dostoevsky: Better Space-Time Trade-Offs forLSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging.In
ACM SIGMOD . 505–520.13 hen Luo and Michael J. Carey [22] Niv Dayan and Stratos Idreos. 2019. The Log-Structured Merge-Bush & theWacky Continuum. In
ACM SIGMOD . 449–466.[23] Siying Dong et al. 2017. Optimizing Space Amplification in RocksDB.. In
CIDR ,Vol. 3. 3.[24] Songyun Duan et al. 2009. Tuning Database Configuration Parameters withiTuned.
PVLDB
2, 1 (2009), 1246–1257.[25] Raman Grover and Michael J. Carey. 2015. Data Ingestion in AsterixDB. In
EDBT .605–616.[26] Gui Huang et al. 2019. X-Engine: An Optimized Storage Engine for Large-scaleE-commerce Transaction Processing. In
ACM SIGMOD . 651–665.[27] Theodore Johnson and Dennis Shasha. 1994. 2Q: A Low Overhead High Perfor-mance Buffer Management Replacement Algorithm. In
VLDB . 439–450.[28] Taewoo Kim and et al. 2020. Robust and efficient memory management in ApacheAsterixDB.
Software: Practice and Experience (2020). https://doi.org/10.1002/spe.2799[29] Yongkun Li et al. 2019. ElasticBF: elastic bloom filter with hotness awarenessfor boosting read performance in large key-value stores. In . 739–752.[30] Yongkun Li et al. 2019. Enabling Efficient Updates in KV Storage via Hashing:Design and Performance Evaluation.
ACM Transactions on Storage (TOS)
15, 3(2019), 20.[31] Hyeontaek Lim et al. 2016. Towards Accurate and Fast Evaluation of Multi-StageLog-structured Designs. In
USENIX Conference on File and Storage Technologies(FAST) . 149–166.[32] Jiaheng Lu et al. 2019. Speedup Your Analytics: Automatic Parameter Tuning forDatabases and Big Data Systems.
PVLDB
12, 12 (2019), 1970–âĂŞ1973.[33] Lanyue Lu et al. 2016. WiscKey: Separating Keys from Values in SSD-consciousStorage. In
USENIX Conference on File and Storage Technologies (FAST) . 133–148.[34] Chen Luo. 2020. Breaking Down Memory Walls in LSM-based Storage Systems.In
ACM SIGMOD . https://doi.org/10.1145/3318464.3384399[35] Chen Luo et al. 2019. Umzi: Unified Multi-Zone Indexing for Large-Scale HTAP..In
EDBT . 1–12.[36] Chen Luo and Michael J. Carey. 2019. Efficient Data Ingestion and Query Pro-cessing for LSM-Based Storage Systems.
PVLDB
12, 5 (2019), 531–543.[37] Chen Luo and Michael J. Carey. 2019. On Performance Stability in LSM-basedStorage Systems.
PVLDB
13, 4 (2019), 449–462.[38] Chen Luo and Michael J. Carey. 2020. LSM-based storage techniques: a survey.
The VLDB Journal
29, 1 (2020), 393–418. [39] Qizhong Mao et al. 2019. Experimental Evaluation of Bounded-Depth LSM MergePolicies. In
IEEE International Conference on Big Data . 523–532.[40] Fei Mei et al. 2018. SifrDB: A Unified Solution for Write-Optimized Key-ValueStores in Large Datacenter. In
ACM SoCC . 477–489.[41] Elizabeth J. O’Neil et al. 1993. The LRU-K Page Replacement Algorithm forDatabase Disk Buffering.
SIGMOD Rec.
22, 2 (June 1993), 297–306.[42] Patrick O’Neil et al. 1996. The Log-structured Merge-tree (LSM-tree).
Acta Inf.
33, 4 (1996), 351–385.[43] Mohiuddin Abdul Qader et al. 2018. A Comparative Study of Secondary IndexingTechniques in LSM-based NoSQL Databases. In
ACM SIGMOD . 551–566.[44] Pandian Raju et al. 2017. PebblesDB: Building Key-Value Stores Using FragmentedLog-Structured Merge Trees. In
ACM SOSP . 497–514.[45] Kai Ren et al. 2017. SlimDB: A Space-efficient Key-value Storage Engine forSemi-sorted Data.
PVLDB
10, 13 (2017), 2037–2048.[46] Giovanni Maria Sacco and Mario Schkolnick. 1982. A Mechanism for Managingthe Buffer Pool in a Relational Database System Using the Hot Set Model. In
VLDB . 257–262.[47] Russell Sears and Raghu Ramakrishnan. 2012. bLSM: A General Purpose LogStructured Merge Tree. In
ACM SIGMOD . 217–228.[48] Adam J. Storm et al. 2006. Adaptive Self-tuning Memory in DB2. In
VLDB (Seoul,Korea). 1081–1092.[49] Jian Tan et al. 2019. IBTune: Individualized Buffer Tuning for Large-Scale CloudDatabases.
PVLDB
12, 10 (2019), 122–1234.[50] Dinh Nguyen Tran et al. 2008. A new approach to dynamic self-tuning of databasebuffers.
ACM Transactions on Storage (TOS)
4, 1 (2008), 1–25.[51] Dana Van Aken et al. 2017. Automatic Database Management System TuningThrough Large-Scale Machine Learning. In
ACM SIGMOD . 1009–1024.[52] Xikui Wang and Michael J. Carey. 2019. An IDEA: An Ingestion Framework forData Enrichment in AsterixDB.
PVLDB.
12, 11 (2019), 1485–1498.[53] Andrew Chi-Chih Yao. 1978. On random 2–3 trees.
Acta Informatica
9, 2 (1978),159–170.[54] Ji Zhang et al. 2019. An End-to-End Automatic Cloud Database Tuning SystemUsing Deep Reinforcement Learning. In
ACM SIGMOD . 415–432.[55] Teng Zhang et al. 2020. FPGA-Accelerated Compactions for LSM-based Key-Value Store. In18th USENIX Conference on File and Storage Technologies (FAST)