[PDF] Benchmarking Learned Indexes

Abstract

Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art "traditional" baselines. Using four real-world datasets, we demonstrate that learned index structures can indeed outperform non-learned indexes in read-only in-memory workloads over a dense array. We also investigate the impact of caching, pipelining, dataset size, and key size. We study the performance profile of learned index structures, and build an explanation for why learned models achieve such good performance. Finally, we investigate other important properties of learned index structures, such as their performance in multi-threaded systems and their build times.

Full PDF

BBenchmarking Learned Indexes

Ryan Marcus , Andreas Kipf , Alexander van Renen , Mihail Stoian ,Sanchit Misra , Alfons Kemper , Thomas Neumann , Tim Kraska MIT CSAIL TUM Intel Labs { ryanmarcus, kipf, kraska } @mit.edu { renen, stoian, kemper, neumann } @in.tum.de [email protected] ABSTRACT

Recent advancements in learned index structures proposereplacing existing index structures, like B-Trees, with ap-proximate learned models. In this work, we present a uni-ﬁed benchmark that compares well-tuned implementationsof three learned index structures against several state-of-the-art “traditional” baselines. Using four real-world datasets,we demonstrate that learned index structures can indeedoutperform non-learned indexes in read-only in-memory work-loads over a dense array. We also investigate the impact ofcaching, pipelining, dataset size, and key size. We study theperformance proﬁle of learned index structures, and buildan explanation for why learned models achieve such goodperformance. Finally, we investigate other important prop-erties of learned index structures, such as their performancein multi-threaded systems and their build times.

1. INTRODUCTION

While index structures are one of the most well-studied components of database management systems, re-cent work [12, 19] provided a new perspective on thisdecades-old topic, showing how machine learning techniquescan be used to develop so-called learned index structures.Unlike their traditional counterparts (e.g., [10, 15, 16, 20, 31,33]), learned index structures build an explicit model of theunderlying data to provide eﬀective indexing.Since learned index structures have been proposed, theyhave been criticized [26,27]. The main reasons for these crit-icisms were the lack of an eﬃcient open-source implemen-tation of the learned index structure, inadequate data-sets,and the lack of a standardized benchmark suite to ensure afair comparison between the diﬀerent approaches.Even worse, the lack of an open-source implementationforced researchers to re-implement the techniques of [19],or only use back-of-the-envelop calculations, to compareagainst learned index structures. While not a bad thing per se , it is easy to leave the baseline unoptimized, or makeother unrealistic assumptions, even with the best of inten-tions, potentially rendering the main takeaways void.For example, recently Ferragina and Vinciguerra proposedthe PGM index [13], a learned index structure with interest-ing theoretical properties, which is recursively built bottom-up. Their experimental evaluation showed that the PGM-index was strictly superior to traditional indexes as well astheir own implementation of the original learned index [19].This strong result surprised the authors of [19], who had ex-perimented with bottom-up approaches and usually foundthem to be slower to execute (see Section 3.4 for a discussion why this may be case). This motivated us to investigate ifthe results of [13] would hold against tuned implementationsof the original learned index [19] and other structures.Further complicating matters, learned structures havean “unfair” advantage on synthetic datasets, as syntheticdatasets are often surprisingly easy to learn. Hence, it isoften easy to show that a learned structure outperforms themore traditional approaches just by using the right kind ofdata. While this is also true for almost any benchmark, itis much more pronounced for learned algorithms and datastructures as their entire goal is to automatically adjust tothe data distribution and even the workload.In this paper, we try to address these problems on threefronts: (1) we provide a ﬁrst open-source implementation ofRMIs for researchers to compare against and improve upon,(2) we created a repository of several real-world datasets andworkloads for testing, and (3) we created a benchmarkingsuite, which makes it easy to compare against learned andtraditional index structures. To avoid comparing againstweak baselines, our open-source benchmarking suite [5] con-tains implementations of index structures that are eitherwidespread, tuned by their original authors, or both.

Understanding learned indexes.

In addition to provid-ing an open source benchmark for use in future research,we also tried to achieve a deeper understanding of learnedindex structures, extending the work of [17].First, we present a Pareto analysis of three recent learnedindex structures (RMIs [19], PGM indexes [13], and RS in-dexes [18]) and several traditional index structures, includ-ing trees, tries, and hash tables. We show that, in a warm-cache, tight-loop setting, all three variants of learned in-dex structures can provide better performance/size tradeoﬀsthan several state-of-the-art traditional index structures.We extend this analysis to multiple dataset sizes, 32 and64-bit integers, and diﬀerent search techniques (i.e., binarysearch, linear search, interpolation search).Second, we analyze why learned index structures achievesuch good performance. While we were unable to ﬁnd asingle metric that fully explains the performance of an in-dex structure (it seems intuitive that such a metric does notexist), we oﬀer a statistical analysis of performance coun-ters and other properties. The single most important ex-planatory variable was cache misses, although cache missesalone are not enough for a statistically signiﬁcant expla-nation. Surprisingly, we found that branch misses do not explain why learned index structures perform better thantraditional structures, as originally claimed in [19]. In fact,we found that both learned index structures and traditional1 a r X i v : . [ c s . D B ] J un ookup KeyIndexStructure Data (sorted) (e.g., 72)

Search bound1 A query for a particular key is made.2 An index structure maps the lookup key to a search bound, which must contain the correct index. 3 Given a valid search bound, a search function (e.g., binary search) is used to locate the correct index within the search bound.

Figure 1: Index structures map each lookup key to a searchbound . This search bound must contain the “lower bound”of the key (i.e., the smallest key greater than or equal tothe lookup key). The depicted search bound is valid for thelookup key 72 because the key 95 is in the bound. A searchfunction, such as binary search, is used to locate the correctindex within the search bound.index structures use branching eﬃciently.Third, we analyze the performance of a wide range of in-dex structures in the presence of memory fences, cold caches,and multi-threaded environments, to test their behavior un-der more realistic settings. In all scenarios, we found thatlearned approaches perform surprisingly well.However, our study is not without its limitations. We fo-cused only on read-only workloads, and we tested each indexstructure in isolation (e.g., a lookup loop, not with integra-tion into any broader application). While this certainly doesnot cover all potential use cases, in-memory performance isincreasingly important, and many write-heavy DBMS archi-tectures are also moving towards immutable read-only data-structures (for example, see LSM-trees in RocksDB [4, 21]).Hence, we believe our benchmark can still guide the designof many systems to come and, more importantly, serve asa foundation to develop benchmarks for mixed read/writeworkloads and the next generation of learned index struc-tures which supports writes [11, 13, 14].

2. FORMULATION & DEFINITIONS

As depicted in Figure 1, we deﬁne an index structure I over a zero-indexed sorted array D as a mapping betweenan integer lookup key x ∈ Z and a search bound ( lo, hi ) ∈ ( Z + × Z + ), where Z + is the positive integers and zero: I : Z → ( Z + × Z + )We do not consider indexes over unsorted data, nor do weconsider non-integer keys. We assume that data is stored ina way supporting fast random access (e.g., an array).Search bounds are indexes into D . A valid index structuremaps any possible lookup key x to a bound that containsthe “lower bound” of x : the smallest key in D that is greaterthan or equal to x . Formally, we deﬁne the lower bound ofa key x , LB ( x ), as: LB ( x ) = i ↔ [ D i ≥ x ∧ ¬∃ j ( j < i ∧ D j ≥ x )]As a special case, we deﬁne the lower bound of any keygreater than or equal to the largest key in D as one morethan the size of D : LB (max D ) = | D | . Our deﬁnition of“lower bound” corresponds to the C++ standard [2]. C D F O u t p u t ( r e l a t i v e p o s i t i o n ) ApproximationCDF

CDF FunctionData Relative position

Figure 2: The cumulative distribution function (CDF) viewof a sorted array.We say that an index structure is valid if and only if itproduces search bounds that contain the lower bound forevery possible lookup key. ∀ x ∈ Z [ I ( x ) = ( lo, hi ) → D lo ≤ LB ( x ) ≤ D hi ]Intuitively, this view of index structures corresponds toan approximate index , an index that returns a search rangeinstead of the exact position of a key. We are not the ﬁrstto note that both traditional structures like B-Trees andlearned index structures can be viewed in this way [8, 19].Given a valid index, the actual index of the lower boundfor a lookup key is located via a “last mile” search (e.g.,binary search). This last mile search only needs to examinethe keys within the provided search bound (e.g., Figure 1). Learned index structures use machine learning techniquesranging from deep neural networks to simple regression inorder to model the cumulative distribution function , or CDF,of a sorted array [19]. Here, we use the term CDF to meanthe function mapping keys to their relative position in anarray. This is strongly connected to the traditional interpre-tation of the CDF from statistics: the CDF of a particularkey x is the proportion of keys less than x . Figure 2 showsthe CDF for some example data.Given the CDF of a dataset, ﬁnding the lower bound of alookup key x in a dataset D with a CDF CDF D is trivial:one simply computes CDF D ( x ) × | D | . Learned index struc-tures function by approximating the CDF of the dataset us-ing learned models (e.g., linear regressions). Of course, suchlearned models are never entirely accurate. For example,the blue line in Figure 2 represents one possible imperfectapproximation of the CDF. While imperfect, this approx-imation has a bounded error: the largest deviation fromthe blue line to the actual CDF occurs at key 12, whichhas a true CDF value of 0.4 but an approximated valueof 0.24. The maximum error of this approximation is thus0 . − .

24 = 0 .

16 (some adjustments may be required forlookups of absent keys). Given this approximation function A and the maximum error of A , we can deﬁne an indexstructure I A as such: I A ( x ) = ( A ( x ) − | D | × . , A ( x ) + | D | × . inearcubic cubic cubic n ... Stage 1Stage 2 Figure 3: A recursive model index (RMI). The linear model(stage 1) makes a coarse-grained prediction. Based on this,one of the cubic models (stage 2) makes a reﬁned prediction.using the maximum error of the approximation. Note thatthis technique, while utilizing approximate machine learningtechniques, never produces an incorrect search bound .One can view a B-Tree as a way of memorizing the CDFfunction for a given dataset: a B-Tree in which every n thkey is inserted can be viewed as an approximate index withan error bound of n −

1. At one extreme, if every key isinserted into the B-Tree, the B-Tree perfectly maps any pos-sible lookup key to its position in the underlying data (anerror bound of zero). Instead, one can insert every other keyinto a B-Tree in order to reduce the size of the index. Thisresults in a B-Tree with an error bound of one: any locationgiven by the B-Tree can be oﬀ by at most one position.

3. LEARNED INDEX STRUCTURES

In this work, we evaluate the performance of three dif-ferent learned index structures: recursive model indexes(RMI), radix spline indexes (RS), and piecewise geometricmodel indexes (PGM). We do not compare with a numberof other learned index structures [11, 14, 24] because tunedimplementations could not be made publicly available.While all three of these techniques approximate the CDFof the underlying data, the way these approximations areconstructed vary. We next give a high-level overview of eachtechnique, followed by a discussion of their diﬀerences.

Originally presented by Kraska et al. [19], RMIs usea multi-stage model, combining simpler machine learningmodels together. For example, as depicted in Figure 3, anRMI with two stages, a linear stage and a cubic stage, wouldﬁrst use a linear model to make an initial prediction of theCDF for a speciﬁc key (stage 1). Then, based on that pre-diction, the RMI would select one of several cubic models toreﬁne this initial prediction (stage 2).

Structure.

When all keys can ﬁt in memory, RMIs withmore than two stages are almost never required [22]. Thus,here we explain only two-stage RMIs for simplicity. See [19]for a generalization to n stages. A two-stage RMI is a CDFapproximator A trained on | D | data points (key / indexpairs). The RMI approximator A is composed of a singleﬁrst stage model f , and B second-stage models f i . Thevalue B is referred to as the “branching factor” of the RMI.Formally, the RMI is deﬁned as: A ( x ) = f (cid:98) B × f ( x ) / | D |(cid:99) ( x ) (1)Intuitively, the RMI ﬁrst uses the stage-one model f ( x )to compute a rough approximation of the CDF of the inputkey x . This coarse-grained approximation is then scaled be-tween 0 and the branching factor B , and this scaled valueis used to select a model from the second stage, f i ( x ). The Key I nde x Lookup key:

Radix tableSpline pointCDFPointer

Figure 4: A radix spline index. A linear spline is used to ap-proximate the CDF of the data. Preﬁxes of resulting splinepoints are indexed in a radix table to accelerate the searchon the spline. Figure from [18].selected second-stage model is used to produce the ﬁnal ap-proximation. The stage-one model f ( x ) can be thought ofas partitioning the data into B buckets, and each second-stage model f i ( x ) is responsible for approximating the CDFof only the keys that fall into the i th bucket.Choosing the correct models for both stages ( f and f )and selecting the best branching factor for a particulardataset depends on the desired memory footprint of the RMIas well as the underlying data. In this work, we use the CDF-Shop [22] auto-tuner to determine these hyperparameters. Training.

Let ( x, y ) ∈ D be the set of key / index pairs inthe underlying data. Then, an RMI is trained by adjustingthe parameters contained in f ( x ) and f i ( x ) to minimize: (cid:88) ( x,y ) ∈ D ( F ( x ) − y ) (2)Intuitively, minimizing Equation 2 is done by training“top down”: ﬁrst, the stage one model is trained, and theneach stage 2 model is trained to ﬁne-tune the prediction.Details can be found in [19] and our implementation at [1]. An RS index [18] consists of a linear spline [25] that ap-proximates the CDF of the data and a radix table that in-dexes resulting spline points (cf., Figure 4). In contrast toRMI [19], and similar to FITing-Tree [14] and PGM [13], RSis built in a bottom-up fashion. Uniquely, RS can be builtin a single pass with a constant worst-case cost per element(PGM provides a constant amortized cost per element).

Structure.

As depicted in Figure 4, RS consists of a radixtable and a set of spline points that deﬁne a linear spline overthe CDF of the data. The radix table indexes r -bit preﬁxesof the spline points and serves as an approximate index overthe spline points. Its purpose is to accelerate binary searchesover the spline points. The radix table is represented as anarray containing 2 r oﬀsets into the sorted array of splinepoints. The spline points themselves are represented as key/ index pairs. To locate a key in a spline segment, linearinterpolation between the two spline points is used.Using the example in Figure 4, a lookup in RS works asfollows: First, the r most signiﬁcant bits b of the lookup keyare extracted ( r = 3 and b = 101). Then, the extracted bits b are used as an oﬀset into the radix table to retrieve the3 Key: 56 Model: f Key: 95 Model: f Key: 1 Model: f PGM Level 1 PGM Level 2

Figure 5: A piecewise geometric model (PGM) index.oﬀsets stored at the b th and the b +1th position (e.g., the 5thand the 6th position). Next, RS performs a binary searchbetween the two oﬀsets on the sorted array of spline pointsto locate the two spline points that encompass the lookupkey. Once the relevant spline segment has been identiﬁed,it uses linear interpolation between the two spline points toestimate position of the lookup key in the underlying data. Training.

To build the spline layer, RS uses a one-passspline ﬁtting algorithm [25] that is similar to the shrink-ing cone algorithm of FITing-Tree [14]. The spline algo-rithm guarantees a user-deﬁned error bound. At a highlevel, whenever the current error corridor exceeds the user-supplied bound, a new spline point is created. Wheneverthe spline algorithm encounters a new r -bit preﬁx, a newentry is inserted into the pre-allocated radix table.RS has only two hyperparameters (spline error and num-ber of radix bits), which makes it straightforward to tune.In practice, few conﬁgurations need to be tested to reach adesired performance / size tradeoﬀ on a given dataset [18]. The PGM index is a multi-level structure, where each levelrepresents an error-bounded piecewise linear regression [13].An example PGM index is depicted in Figure 5. In theﬁrst level, the data is partitioned into three segments, eachrepresented by a simple linear model ( f , f , f ). By con-struction, each of these linear models predicts the CDF ofkeys in their corresponding segments to within a preset er-ror bound. The partition boundaries of this ﬁrst level arethen treated as their own sorted dataset, and another error-bounded piecewise linear regression is computed. This isrepeated until the top level of the PGM is suﬃciently small. Structure.

A piecewise linear regression partitions the datainto n + 1 segments with a set of points p , p , . . . , p n . Theentire piecewise linear regression is expressed as a piecewisefunction: F ( x ) =  a × x + b if x < p a × x + b if x ≥ p and x < p a × x + b if x ≥ p and x < p . . .a n × x + b n if x ≥ p n and x < p n Each regression in the PGM index is constructed witha ﬁxed error bound (cid:15) . Such a regression can trivially beused as an approximate index. PGM indexes apply thistrick recursively, ﬁrst building an error-bounded piecewise regression model over the underlying data, then buildinganother error-bounded piecewise regression model over thepartitioning points of the ﬁrst regression. Key lookups areperformed by searching each index layer until the regressionover the underlying data is reached.

Training.

Each regression is constructed optimally, in thesense that the fewest pieces are used to achieve a preset max-imum error. This can be done quickly using the approachof [32]. The ﬁrst regression is performed on the underly-ing data, resulting in a set of split points (the boundariesof each piece of the regression) and regression coeﬃcients.These split points are then treated as if they were a newdataset, and the process is repeated, resulting in fewer andfewer pieces at each level. Since each piecewise linear regres-sion contains the fewest possible segments, the PGM indexis optimal in the sense of piecewise linear models [13].Intuitively, PGM indexes are constructed “bottom-up”:ﬁrst, an error bound is chosen, and then a minimal piece-wise linear model is found that achieves that error bound.This process is repeated until the piecewise models becomesmaller than some threshold. The PGM index can also han-dle inserts, and can be adapted to a particular query work-load. We do not evaluate either capability here.

RMIs, RS indexes, and PGM indexes all provide an ap-proximation of the CDF of some underlying data using ma-chine learning techniques. However, the speciﬁcs vary.

Model types.

While RS indexes and PGM indexes useonly a single type of model (spline regression and piecewiselinear regression, respectively), RMIs can use a wide varietyof model types. This gives the RMI a greater degree ofﬂexibility, but also increases the complexity of tuning theRMI. While both the PGM index and RS index can be tunedby adjusting just two knobs, automatically optimizing anRMI requires a more involved approach, such as [22]. Boththe PGM index authors and the RS index authors mentionintegrating other model types as future work [13, 18].

Top-down vs. bottom-up.

RMIs are trained “top down”,ﬁrst ﬁtting the topmost model and training subsequent lay-ers to correct errors. PGM and RS indexes are trained “bot-tom up”, ﬁrst ﬁtting the bottommost layer to a ﬁxed accu-racy and then building subsequent layers to quickly searchthe bottommost layer for the appropriate model. Becauseboth PGM and RS indexes require searching this bottom-most layer (PGM may require searching several intermediatelayers), they may require more branches or cache misses thanan RMI. While an RMI uses its topmost model to directlyindex into the next layer, avoiding a search entirely, the bot-tommost layer of the RMI does not have a ﬁxed error bound;any bottom-layer model could have a large maximum error.RS indexes and PGM indexes also diﬀer in how the bot-tommost layer is searched. PGM indexes decompose theproblem recursively, essentially building a second PGM in-dex on top of the bottommost layer. Thus, a PGM in-dex may have many layers, each of which must be searched(within a ﬁxed range) during inference. On the other hand,an RS index uses a radix table to narrow the search range,but there is no guarantee on the search range’s size. If theradix table provides a comparable search range as the up-per level of a PGM index, then an RS index locates theproper ﬁnal model with a comparatively cheaper operation4 ethod Updates Ordered TypePGM [13] Yes Yes LearnedRS [18] No Yes LearnedRMI [19] No Yes LearnedBTree [7] Yes Yes TreeIBTree [15] Yes Yes TreeFAST [16] No Yes TreeART [20] Yes Yes TrieFST [33] Yes Yes TrieWormhole [31] Yes Yes Hybrid hash/trieCuckooMap [6] Yes No HashRobinHash [3] Yes No HashRBS No Yes Lookup tableBS No Yes Binary search

Table 1: Search techniques evaluated(a bitshift and an array lookup). If the radix table doesnot provide a narrow search range, signiﬁcant time may bespent searching for the appropriate bottom-layer model.

4. EXPERIMENTS

Our experimental analysis is divided into six sections.1. Setup (Section 4.1): we describe the index structures,baselines, and datasets used.2. Pareto analysis (Section 4.2): we analyze the size andperformance tradeoﬀs oﬀered by each index structure,including variations in dataset and key size. We ﬁnd thatlearned index structures oﬀer competitive performance.3. Explanatory analysis (Section 4.3): we analyze indexesvia performance counters (e.g., cache misses) and otherdescriptive statistics. We ﬁnd that no single metric canfully account for the performance of learned structures.4. CPU interactions (Section 4.4): we analyze how CPUcache and operator reordering impacts the performanceof index structures. We ﬁnd that learned index struc-tures beneﬁt disproportionately from these eﬀects.5. Multithreading (Section 4.5): we analyze the through-put of each index in a multithreaded environment. Weﬁnd that learned structures have comparatively highthroughput, possibly attributable to the fact that theyincur fewer cache misses per lookup.6. Build times (Section 4.6): we analyze the time to buildeach index structure. We ﬁnd that RMIs are slow tobuild compared to PGM and RS indexes, but that (un-surprisingly) no learned structure yet provides builds asfast as insert-optimized traditional index structures.

Experiments are conducted on a machine with 256 GB ofRAM and an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz.

In this section, we describe the index structures we evalu-ate, and how we tune their size/performance tradeoﬀs. Ta-ble 1 lists each technique and its capabilities.

Learned indexes.

We compare with RMIs, PGM indexes,and RadixSpline indexes (RS), each of which are described O ff s e t amzn face O ff s e t osm wiki Figure 6: CDF plots of each testing dataset. The face dataset contains ≈

100 large outlier keys, not plotted.in Section 3. We use implementations tuned by each struc-ture’s original authors. RMIs are tuned using CDFShop [22],an automatic RMI optimizer. RS and PGM are tuned byvarying the error tolerance of the underlying models.

Tree structures.

We compare with several tree-structuredindexes: the STX B-Tree (BTree) [7], an interpolating BTree(IBTree) [15], the Adaptive Radix Trie (ART) [20], the FastArchitectural-Sensitive Tree (FAST) [16], Fast Succinct Trie(FST) [33], and Wormhole [31].For each tree structure, we tune the size/performancetradeoﬀ by inserting a subset of the data as described inSection 2.1. To build a tree of maximum size with per-fect accuracy, we insert every key. To build a tree with asmaller size and decreased accuracy, we insert every otherkey. We note that this technique, while simple, may notbe the ideal way to trade space for accuracy in each treestructure. Speciﬁcally, ART may admit a smarter methodin which keys are retained or discarded based on the ﬁll levelof a node. We only evaluate the simple and universal tech-nique of inserting fewer keys into each structure, and leavestructure-speciﬁc optimizations to future work.

Hashing.

While most hash tables do not support rangequeries, hash tables are still an interesting point of compari-son due to their unmatched lookup performance. Unorderedhash tables cannot be shrunk using the same technique aswe use for trees. Therefore, we only evaluate hash tablesthat contain every key. We evaluate a standard implemen-tation of a Robinhood hash table (RobinHash) [3] and aSIMD-optimized Cuckoo map (CuckooMap) [6].

Baselines.

We also include two naive baselines: binarysearch (BS), and a radix binary search (RBS). Radix binarysearch [17] stores only the radix table used by the learnedRS approach. We vary the size of the radix table to achievediﬀerent size/performance tradeoﬀs.

We use four real-world datasets for our evaluation. Eachdataset consists of 200 million unsigned 64-bit integer keys.We test larger datasets in Section 4.2.1, and we test 32-bit datasets in Section 4.2.2. We generate 8-byte (random) Wormhole [31], which we evaluate, represents a state-of-the-art ordered hashing approach.5ayloads for each key. For each lookup, we compute the sumof these values to ensure the results are accurate. • amzn : book popularity data from Amazon. Each keyrepresents the popularity of a particular book. • face : randomly sampled Facebook user IDs [30]. Eachkey uniquely identiﬁes a user. • osm : cell IDs from Open Street Map. Each key repre-sents an embedded location. • wiki : timestamps of edits from Wikipedia. Each keyrepresents the time an edit was committed.The CDFs of each of these datasets are plotted in Figure 6.The zoom window on each plot shows 100 keys. While the“zoomed out” plots appear smooth, each CDF function ismuch more complex, containing both structure and noise.For each dataset, we generate 10M random lookup keys.Indexes are required to return search bounds that containthe lower bound of each lookup key (see Section 2). Why not test synthetic datasets?

Synthetic datasetsare often used to benchmark index structures, learned orotherwise [13, 19, 20]. However, synthetic datasets are prob-lematic for evaluating learned index structures. Syntheticdatasets are either (1) entirely random, in which case thereis no possibility of learning an eﬀective model of the under-lying data (although a model may be able to overﬁt to thenoise), or (2) drawn from a known distribution, in whichcase learning the distribution is trivial. Here, we focus onlyon datasets drawn from real world distributions, which webelieve are the most important. For readers speciﬁcally in-terested in synthetic datasets, we refer to [17].

A primary concern of index structures is lookup perfor-mance: given a query, how quickly can the correct recordbe fetched? However, size is also important: with no lim-its, one could simply store a lookup table and retrieve thecorrect record with only a single cache miss. Such a lookuptable would be prohibitively large in many cases, such as 64-bit keys. Thus, we consider the performance / size tradeoﬀ provided by each index structure, plotted in Figure 7.For each index structure, we selected ten conﬁgurationsranging from minimum to maximum size. While diﬀerentapplications may weigh performance and size diﬀerently, allapplications almost surely desire a

Pareto optimal index:an index for which no alternative has both a smaller sizeand improved performance. For the amzn and wiki datasets,learned structures are Pareto optimal up to a size of 100MB,at which point the RBS lookup table becomes eﬀective. For face , learned structures are Pareto optimal throughout.

Poor performance on osm . Both traditional and learnedindex structures fail to outperform RBS on the osm datasetfor nearly any size. The poor performance of learned in-dex structures can be attributed to the osm ’s dataset lackof local structure: even small pieces of the CDF exhibitdiﬃcult-to-model erratic behavior. This is an artifact of thetechnique used to project the Earth into one-dimensionalspace (a Hilbert curve). In Section 4.3, we conﬁrm this in-tuition by analyzing the errors of the learned models; allthree learned structures required signiﬁcantly more storageto achieve errors comparable to those observed on the otherdatasets. Simply put, learned structures perform poorly on osm because osm is diﬃcult to learn. Because osm is a one-dimensional projection of multi-dimensional data, a multi-dimensional learned index [24] may yield improvements.

Performance of PGM.

In [13], the authors showed that“the PGM-index dominates RMI,” contradicting our previ-ous experience that the time spent on searches between thelayers of the index outweighed the beneﬁts of having a lowererror. Indeed, in our experimental evaluation we found thatthe PGM index performs signiﬁcantly worse than RMI on 3out of the 4 datasets and slightly worse on osm . After con-tacting the authors of [13], we found that their RMI imple-mentation was missing several key optimizations: their RMIonly used linear models rather than tuning diﬀerent type ofmodels as proposed in [19, 22], and omitted some optimiza-tions for RMIs with only linear models. This highlightshow implementation details can aﬀect experimental results,and the importance of having a common benchmark withstrong implementations. We stress that our results are theﬁrst to compare RMI and PGM implementations tuned bytheir respective authors.

Performance of RBS.

RBS exhibits substantially de-graded performance on face compared to other datasets.This is due to a small number ( ≈ face dataset: most keys fall within (0 , ), but the outliersfall in (2 , − b provides equally accu-rate bounds as a binary search tree with b levels, but requiresonly a single cache miss. When the keys are heavily skewed(as is the case with face ), the radix table is nearly useless. Tree structures are non-monotonic.

All tree structurestested (ART, BTree, IBTree, and FAST) become less ef-fective after a certain size. For example, the largest ARTindex for the amzn data occupies nearly 1GB of space, buthas worse lookup performance than an ART index occupy-ing only 100MB of space. This is because, at a certain point,performing a binary search on a small densely-packed arraybecomes more eﬃcient than traversing a tree. As a result,tree structures show non-monotonic behavior in Figure 7.

Indexes slower than binary search?

At extremely smallor large sizes, some index structures perform worse than bi-nary search. In both cases, this is because some index struc-tures are unable to provide suﬃciently small search boundsto make up for the inference time required. For example, onthe osm dataset, very small RMIs barely narrow down thesearch range at all. Because this small RMIs ﬁt is so poor We shared our RMI implementation with Ferragina and Vin-ciguerra before the publication of [13], but since [13] was alreadyundergoing revision, they elected to continue with their own RMIimplementation instead, without note. All PGM results in thispaper are based on Ferragina and Vinciguerra’s tuned PGM codeas of May 18th, 2020. Size (MB)0200400600800 L oo k u p t i m e ( n s ) amzn Size (MB) face Size (MB) osm Size (MB) wiki

RMIPGMRSRBSARTBTreeIBTreeFAST

Figure 7: Performance and size tradeoﬀs provided by several index structures for four diﬀerent datasets. The black horizontalline represents the performance of binary search (which has a size of zero). Extended plots with all techniques are availablehere: https://rm.cab/lis1 Size (MB)50010001500 L oo k u p t i m e ( n s ) amzn Size (MB) face

RMIBTreeFSTWormhole

Figure 8: Performance of index structures built for strings(stars) on our integer datasets.

Method Time SizePGM 326.48 ns 14.0 MBRS 266.58 ns 4.0 MBRMI 180.90 ns 48.0 MBBTree 482.11 ns 166.0 MBIBTree 446.55 ns 9.0 MBFAST 435.33 ns 102.0 MBBS 741.69 ns 0.0 MBCuckooMap 114.50 ns 1541.0 MBRobinHash 93.69 ns 6144.0 MB

Table 2: The fastest variant of each index structure com-pared against two hashing techniques on the amzn dataset.(analyzed later, Figure 12), the time required to execute theRMI model and produce the search bound is comparativelyworse than executing a binary search on the entire dataset.

Structures for strings.

Many recent works on index struc-tures have focused on indexing keys of arbitrary length (e.g.,strings) [31, 33]. For completeness, we evaluated two struc-tures designed for string keys – FST and Wormhole – inFigure 8. Unsurprisingly, neither performed as well as bi-nary search. These string indexes contain optimizations thatassume that comparing two keys is expensive. These opti-mizations translate to overhead when considering only in-teger keys, which can be compared in a single instruction.ART, an index designed for both string and integer data,does so by indexing one key-byte per radix tree level.

Hashing.

Hashing provides O (1) time point lookups. How-ever, hashing diﬀers from both traditional and learned in- dexes in a number of ways: ﬁrst, hashing generally does notsupport lower bound lookups. Second, hash tables gener-ally have a large footprint, as they store every key. We eval-uate two hashing techniques – a Cuckoo hash table [6] anda Robinhood hash table [3]. We found that a load factor of0.99 and 0.25 (respectively) maximized lookup performance.Table 2 lists the size and lookup performance of the best-performing (and thus often largest) variant of each indexstructure and both hashing techniques for a 32-bit version of the amzn dataset (results similar for others). Unsurpris-ingly, both hashing techniques oﬀer superior point-lookuplatency compared to traditional and learned index struc-tures. This decreased latency comes at the cost of a largerin-memory footprint. For example, CuckooMap provides a114ns lookup time compared to the 180ns provided by theRMI, but CuckooMap uses over 1GB of memory, whereasthe RMI uses only 48MB. When range lookups and memoryfootprint are not concerns, hashing is a clear choice. Figure 9 shows the performance / size tradeoﬀ for eachlearned structure and a BTree for four diﬀerent data sizesof the amzn dataset, ranging from 200M to 800M. All threelearned structures are capable of scaling to larger datasetsizes, with only a logarithmic slowdown (as is expected fromthe ﬁnal binary search step). For example, consider anRMI that produces an average search bound that spans 128keys. Such a bound requires 7 steps of binary search. If thedataset size doubles, an RMI of equal size is likely to returnbounds that are twice as large: one could expect an RMIof equal size to produce search bounds that span 256 keys.Such a bound requires only 8 total (1 additional) binarysearch steps. Thus, learned index structures scale to largerdatasets in much the same way as BTrees. If larger datasetshave more pronounced and modelable patterns, learned in-dex structures may provide better scaling.

Other sections evaluate 64-bit datasets. Here, we scaledown the amzn dataset from 64 to 32 bits, and compare the Wormhole, evaluated in Figure 8, is a hash-based technique thatprovides ordering, but is primarily optimized for strings. The SIMD Cuckoo implementation only supports 32-bit keys. Size (MB)50010001500 L oo k u p t i m e ( n s ) RMI, amzn Size (MB)

PGM, amzn Size (MB)

RS, amzn Size (MB)

BTree, amzn

Figure 9: Performance / size tradeoﬀs for datasets of various sizes (200M, 400M, 600M, and 800M keys) for the amzn dataset.The face and wiki datasets were not suﬃciently large to compare. Extended plots with all techniques and the osm datasetare available here: https://rm.cab/lis2 Size (MB)50010001500 L oo k u p t i m e ( n s ) amzn RMI (32-bit)RMI (64-bit) 10 Size (MB) amzn

RS (32-bit)RS (64-bit) 10 Size (MB) amzn

PGM (32-bit)PGM (64-bit) 10 Size (MB) amzn

BTree (32-bit)BTree (64-bit) 10 Size (MB) amzn

FAST (32-bit)FAST (64-bit)

Figure 10: Performance / size tradeoﬀ for 32 and 64 bit keys. While decreasing the key size to 32-bits has a minimal impacton learned structures, the ability to pack more values into a single cache line improve the performance of tree structures.performance of the three learned index structures, BTrees,and FAST. The results are plotted in Figure 10.For learned structures, the performance on 32-bit datais nearly identical to performance on 64-bit data. Our im-plementations of RS and RMI both transform query keysto 64-bit ﬂoats, so this is not surprising. We attemptedto perform computations on 32-bit keys using 32-bit ﬂoats,but found that the decreased precision caused ﬂoating pointerrors. The PGM implementation uses 32-bit computationsfor 32-bit inputs, achieving some modest performance gains.For both tree structures, the switch from 64-bit to 32-bit keys allows twice as many keys to ﬁt into a single cacheline, improving performance. For FAST, which makes heavyuse of AVX-512 streaming operations, doubling the num-ber of keys per cache line essentially doubles computationalthroughput as well, as each operator can work on 16 32-bitvalues simultaneously (as opposed to 8 64-bit values).

Normally, we use binary search to locate the correct keywithin the search bound provided by the index structure.However, other search techniques can be used. Figure 11evaluates binary, linear, and interpolation search for eachlearned structure and the RBS baseline on osm and amzn .We observed that binary search (ﬁrst column) was alwaysfaster than linear search (second column). This aligns withprior work that showed binary search being eﬀective untilthe data size dropped below a very small threshold [29].Interpolation search (third column) behaves similarly tobinary search on the amzn dataset, even oﬀering improvedperformance on average ( ≈ L oo k u p t i m e ( n s ) amzn (bin. search) amzn (lin. search) amzn (int. search) Size (MB)10 L oo k u p t i m e ( n s ) osm (bin. search) Size (MB) osm (lin. search) Size (MB) osm (int. search)

RMI (binary)PGM (binary)RS (binary) RMI (linear)PGM (linear)RS (linear) RMI (interpolation)PGM (interpolation)RS (interpolation)

Figure 11: A comparison of “last mile” (Section 2) searchtechniques for the osm and amzn datasets.the case, one would expect a learned index to learn this dis-tribution, subsuming any gains from interpolation search.However, because the learned structures have a limited size,there can be many segments of the underlying data that ex-hibit linear behavior that the learned structure does have thecapacity to learn. For the osm dataset, which is relativelycomplex, interpolation search does not provide a beneﬁt,and is often slower than binary search. This is unsurprising,since interpolation search works best on smooth datasets.One could also integrate more complex interpolationsearch techniques, such as SIP [30]. One diﬃculty with in-corporating SIP is the precomputation steps, which vary de-pending on the search bound used. Integrating an exponen-tial search [9] technique could also be of interest, although8t is not immediately clear how to integrate a search bound.We leave such investigations to future work.

In this section, we investigate why learned index struc-tures have such strong performance and size properties.While prior work [19] attributed this to decreased branchingand instruction count, we discovered that the whole storywas more complex. None of model accuracy, model size (or“precision gain”, the combination of the two in [19]), cachemisses, instruction count, or branch misses can fully accountfor learned index structures’ performance.Figure 12 shows the correlation between lookup time andvarious performance characteristics of learned index struc-tures, BTrees, and ART for the amzn and osm datasets. Theﬁrst column shows the total in-memory size of each model,the second column shows average log search bound size(i.e., the expected number of binary search steps required),the third column shows last-level cache misses, the fourthcolumn shows branch mispredictions, and the ﬁfth columnshows instruction counts. One can visually dismiss any sin-gle metric as explanatory: any vertical line corresponds tostructures that are equal on the given metric, but exhibit dif-ferent lookup times. For example, at a size of 1MB, RMIsachieve a latency of 220ns on amzn , but a BTree with thesame size achieves a latency of 650ns (blue vertical line).The second column (“log error”), is especially interest-ing. Learned indexes must balance inference time withmodel error [22]. For example, with a log error of 7, anRMI achieves a lookup time of 250ns on the amzn dataset,but the PGM index with the same log error achieves alatency of 480ns (red vertical line). In other words, eventhough the average size of the search bound generated byboth structures was the same, the RMI still achieved fasterlookup times. This is attributable to the higher inferencetime of the PGM index. Of course, other factors, such asoverall model size, must be taken into account as well. Analysis.

In order to statistically test each potential ex-planatory factor, we performed a linear regression analysisusing every index structure on all four datasets at 200 mil-lion 64-bit keys. The results indicated that cache misses,branch misses, and instruction count had a statistically sig-niﬁcant eﬀect on lookup time ( p < . ) , whereas size andlog error did not ( p > . given the branch misses, cache misses, and instructioncounts , the size and log error do not signiﬁcantly aﬀect per-formance. This does not mean that the log error and sizedo not have an impact on cache misses; just that the rele-vant variation in lookup time explained by model size andlog error is accounted for fully in the other measures.Overall, a regression on cache misses, branch misses, andinstruction count explained 95% of the variance ( R =0 . . − .

28, and 0 .

50, respectively. Standardized regressioncoeﬃcients can be interpreted as the number of standarddeviations that a particular measure needs to increase by,assuming the other measures stay ﬁxed, in order to increasethe output by one standard deviation; in other words, thesecoeﬃcients are descriptive of the variations within our mea- surements, not of the actual hardware impact of the metrics(although these are obviously related).

Interpretation: branch misses.

While the magnitude ofstandardized regression coeﬃcients are not useful on theirown, their sign can provide interesting insights. Surpris-ingly, the coeﬃcient on branch misses is negative. This doesnot mean that an increased number of branch misses leadsto increased model performance. Instead, the negative co-eﬃcient means that for a ﬁxed number of cache misses andinstructions , the tested indexes that incurred more branchmisses performed better. In other words, indexes are gettingsigniﬁcant value from branch misses; when an index incursa branch miss, it does so in such a way that reduces lookuptime more than an hypothetical alternative index that usesthe same number of instructions and cache misses.We oﬀer two possible explanations for this surprising ob-servation. First, structures may be over-optimized to avoidbranching, trading additional cache misses or instructionsto reduce branching. Second, indexes that experience morebranch misses may beneﬁt from speculative loads on modernhardware. We leave further investigation to future work.

Interpretation: what metrics matter?

If there is asingle metric that explains the performance of learned indexstructures, we were unable to ﬁnd it. Any of model size, log error, cache misses, branch misses, and instruction countalone are not enough to determine if one index structure willbe faster than another. Linear regression analysis suggeststhat cache misses, branch misses, and instruction counts areall signiﬁcant, and account for model size and log error.Of the signiﬁcant measures, cache misses had the largestexplanatory power. This is consistent with indexes beinglatency-bound (i.e., limited by the round-trip time to RAM).The vast majority of cache misses for RMIs happen duringthe last-mile search. Two-layer RMIs require at most twocache misses for inference (potentially only one if the RMI’stop layer is small enough). On the other hand, for a fullBTree, no cache misses happen during the ﬁnal search atall, but BTrees generally require at least one cache miss perlevel of the tree. Cache misses also help explain performancediﬀerences between RMI and PGM: since each additionalPGM layer likely requires a cache miss at inference time, alarge RMI with low log error will incur fewer cache missesthan a large PGM index with a similar log error (e.g., amzn in Figure 12). When an RMI is not able to achieve a lowlog error, this advantage vanishes, as more cache misses arerequired during the last-mile search (e.g., osm in Figure 12).Current implementations of learned index structures seemto prioritize fast inference time over log error. This makessense, since a linear increase in log error only leads to alogarithmic increase in lookup time (due to binary search).However, our analysis suggests that a learned index struc-ture could use signiﬁcantly more cache misses if it could ac-curately pinpoint the cache line containing the lookup key.We experimented with multi-stage RMIs ( >

10 levels), butwere unable to achieve such an accuracy. This could be aninteresting direction for future work.We encourage future development of index structures totake into account cache misses, branch misses, and instruc-tion counts. Since all three of these metrics have a statisti-cally signiﬁcant impact on performance, ignoring one or twoof them in favor of the other may lead to poor results. Whilewe cannot suggest a single metric for evaluating index struc-9 L oo k u p t i m e ( n s ) amzn L oo k u p t i m e ( n s ) amzn L oo k u p t i m e ( n s ) amzn L oo k u p t i m e ( n s ) amzn L oo k u p t i m e ( n s ) amzn PGMRSRMIBTreeART10 Size (MB)02004006008001000 L oo k u p t i m e ( n s ) osm L oo k u p t i m e ( n s ) osm

50 100Cache misses L oo k u p t i m e ( n s ) osm L oo k u p t i m e ( n s ) osm

200 400Instructions L oo k u p t i m e ( n s ) osm Figure 12: Various metrics compared with lookup times across index structures and datasets. No single metric can fullyexplain the performance of diﬀerent index structures, suggesting a multi-metric analysis is required. Extended plots for alltechniques and datasets are available here: https://rm.cab/lis5 Size01020 L o g amzn Size osm

RSRMIPGMBTree

Figure 13: Size and log error bound of various index struc-tures. When evaluated as a compression technique, learnedindex structures can be evaluated purely based on theirsize and log error. Extended plots are available here: https://rm.cab/lis7 tures, if one must select a single metric, our analysis suggeststhat cache misses are the most signiﬁcant. Learned indexes as compression.

A common view oflearned index structures is to think of learned indexes as alossy compression of the CDF function [13,19]. In this view,the goal of a learned index is similar to lossy image com-pression (like JPG): come up with a representation that issmaller than the CDF with minimal information loss. Thequality of a learned index can thus be judged by just twometrics: the size of the structure, and the log error (infor-mation loss). Figure 13 plots these two metrics for the threelearned index structures and BTrees. These plots indicatethat the information theoretic view, while useful, is not fullypredictive of index performance. For example, for face , allthree structures have very similar size and log errors after1MB. However, some structures are substantially faster thanothers at a ﬁxed size (Figure 7).We encourage researchers and practitioners to familiarizethemselves with the information theoretic view of learnedindex structures, but we caution against ending analysis atthis stage. For example, an index structure that achievesoptimal compression (i.e., an optimal size to log error ra-tio) is not necessarily going to outperform an index with suboptimal compression. The simplest way this could oc-cur is because of inference time: if the index structure withsuperior compression takes a long time to produce a searchbound, an index structure that quickly generates less ac-curate search bounds may be superior. However, if one as-sumes that storage mediums are arbitrarily slow (i.e., searchtime is strictly dominated by the size of search bound), thenthere is merit in viewing learned index structures as a purecompression problem, and investigating more advanced com-pression techniques for these structures [13] could be fruitful. Many prior works on both learned and non-learned indexstructures (including those by authors of this work) haveevaluated their index structures by repeatedly performinglookups in a tight loop. While convenient and applicable tomany applications, this experimental setup may exaggeratethe performance of some index structures due, in part, to caching and operator reordering . Executing index lookups in a tight loop, as it is often doneto evaluate an index structure, will cause nearly all of theCPU cache to be ﬁlled with the index structure and un-derlying data. Since accessing a cached value is signiﬁcantlyfaster (10s of nanoseconds) than accessing an uncached value( ≈

100 nanoseconds), this may cause such tight-loop exper-iments to exaggerate the performance of an index structure.The amount of data that will remain cached from oneindex lookup to another is clearly application dependent.In Figure 14, we investigate the eﬀects of caching by eval-uating the two possible extremes: the datapoints labeled“warm” correspond to a tight loop in which large portionsof the index structure and underlying data can be cachedbetween lookups. The datapoints labeled “cold” correspondto the same workload, but with additionally fully ﬂushingthe cache after each lookup. The gain from a warm cachefor all ﬁve index structures ranges from 2x to 2.5x. Withsmall index sizes ( < Size (MB)0100020003000 L oo k u p t i m e ( n s ) amzn RMI (cold)RMI (warm) 10 Size (MB) amzn

RS (cold)RS (warm) 10 Size (MB) amzn

PGM (cold)PGM (warm) 10 Size (MB) amzn

BTree (cold)BTree (warm) 10 Size (MB) amzn

FAST (cold)FAST (warm)

Figure 14: The performance impact of having a cold cache for various index structures. Extended plots with all techniquesare available here: https://rm.cab/lis3 learned index structures outperform the warm-cache BTree.With larger (and arguably more realistic) index structuresizes, obviously whether or not the cache is warm or cold ismore important than the choice of index structure. Regard-less of if the cache is warm or cold, we found that learnedapproaches exhibited dominant performance / size tradeoﬀs.

Modern CPUs and compilers may reorder instructions tooverlap computation and memory access or otherwise im-prove pipelining. For example, consider a simple programthat loads x , does a computation f ( x ), loads y , and thendoes a computation g ( y ). Assuming the load of y does notdepend on x , a load of y may be reordered to occur beforethe computation of f ( x ), so that the latency from loading y can be hidden within the computation of f ( x ). When con-sidering index structures, lookups placed in a tight loop maycause the CPU or compiler to overlap the ﬁnal computationof one query with the initial memory read of the next query.In some applications, this may be realistic and desirable –in other applications, expensive computations between indexlookups may prevent such overlapping. Thus, some indexesmay disproportionately beneﬁt from this reordering.To test the impact of reordering on lookup time, we in-serted a memory fence instruction into our experimentalloop. This prevents the CPU or compiler from reorderingoperations across the fence. Figure 15 shows that RMI andRS – two of the most competitive index structures – have thelargest drop in performance when a memory fence is intro-duced (approximately a 50% slowdown). The BTree, FASTand PGM are almost entirely unaﬀected. While the inclu-sion of a memory fence harms the performance of RMI andRS, learned structures still provide a better performance /size tradeoﬀ for the amzn dataset (results for other datasetsare similar, but omitted due to space constraints).The impact of a memory fence was highly correlated withthe number of instructions used by an index structure (Fig-ure 12): indexes using fewer instructions, like RMI andRS, were impacted to a greater extent than structures us-ing more instructions, like BTrees. Since reordering opti-mizations often examine only a small window of instruc-tions (i.e., “peephole optimizations” [23]), reordering opti-mizations may be more eﬀective when instruction countsare lower. This may explain why RMI and RS are impactedmore by a memory fence.We recommend that future researchers test their indexstructures with memory fences to determine how much ben-eﬁt their structure gets from reordering. Getting a lot of beneﬁt from reordering is not necessarily a bad thing; plentyof applications require performing index lookups in a tightloop, with only minimal computation being performed oneach result. Ideally, researchers should evaluate their in-dex structures within a speciﬁc application, although this ismuch more diﬃcult. Here, we evaluate how various index structures scale whenqueried by concurrent threads. Our test CPU had 20 physi-cal cores, capable of executing 40 simultaneous threads withhyperthreading. Since multithreading strictly increases la-tency, here we evaluate throughput (lookups per second).

Varying thread count.

We ﬁrst vary the number ofthreads, ﬁxing the model size at 50MB except for Robin-Hash, which is still the full size. The results are plottedin Figure 16a, with and without a memory fence. Over-all, all three learned index variants scale with an increasingnumber of threads, although only the RMI achieved higherthroughput than the RBS lookup table in this experiment.RobinHash, the technique with the lowest latency witha single thread, does not achieve the highest throughputin a concurrent environment. Even the RBS lookup ta-ble achieves higher throughput than RobinHash, regard-less of whether or not a memory fence was used. We donot consider hash tables optimized for concurrent environ-ments [28]; here we only demonstrate that an oﬀ-the-shelfhash table with a load factor optimized for single-threadedlookups does not scale seamlessly.To help explain why certain indexes scaled better thanothers, we measured the number of cache misses incurred per second by each structure, plotted in Figure 16c. If a in-dex structure incurs more cache misses per second, then thebeneﬁts of multithreading will be diminished, since threadswill be latency bound waiting for access to RAM. Indeed,RobinHash incurs a much larger number of cache misses persecond than any other technique. The larger size of the hashtable may contribute to this, as fewer cache lines may beshared in between lookups compared with a smaller index.PGM and FAST have the fewest cache misses per secondat 40 threads, suggesting that PGM and FAST may beneﬁtthe most from multithreading. To investigate this, we tab-ulated the relative speedup factor of each technique. Dueto space constraints, the plot is available online: https://rm.cab/lis8 . FAST has the highest relative speedup,achieving 32x throughput with 40 threads. In addition to The SIMD Cuckoo implementation [6] only supports 32-bit keys,and was not included in this experiment. Size (MB)50010001500 L oo k u p t i m e ( n s ) amzn RMI (fence)RMI (no fence) 10 Size (MB) amzn

RS (fence)RS (no fence) 10 Size (MB) amzn

PGM (fence)PGM (no fence) 10 Size (MB) amzn

BTree (fence)BTree (no fence) 10 Size (MB) amzn

FAST (fence)FAST (no fence)

Figure 15: Performance of various index structures with and without a memory fence. Without the fence, the CPU mayreorder instructions and overlap computation between lookups. With the fence, each lookup must be completed before thenext lookup begins. Extended plots with all techniques and datasets are available here: https://rm.cab/lis4 M l oo k u p s p e r s no fence (amzn) fence (amzn) (a) Multithreaded throughput for the amzn dataset, models have a ﬁxed size of 50MB.No memory fence (left) and with memoryfence (right). Size (MB)050100150 M l oo k u p s p e r s Size vs. Throughput, 40 threads (amzn) (b) Model size vs. 40-thread throughput forthe amzn dataset. An extended plot with allindex techniques is available here: https://rm.cab/lis6 C a c h e m i ss e s / l oo k u p / s e c amzn (fence) RMIPGMRSRBSARTBTreeIBTreeFASTRobinHash (c) Cache misses per lookup per second forvarious data structures. More cache missesper second indicates that speedup from mul-tithreading may be negatively impacted.

Figure 16: Multithreading resultshaving few cache misses per second, FAST also takes advan-tage of streaming AVX-512 instructions, which allows for ef-fective overlap of computation with memory reads. PGM,despite having the least cache misses per second, achievedonly a 27x speedup at 40 threads. On the other hand, Robin-Hash had by far the most cache misses per second and thelowest relative speedup at 40 threads (20x). Thus, cachemisses per second correlate with, but do not always deter-mine, the speedup factor of an index structure.

Varying index size.

Next, we ﬁx the number of threadsat 40, and vary the size of the index. Results are plottedin Figure 16b. One might expect smaller structures to havebetter throughput because of caching eﬀects; we did not ﬁndthis to be the case. In general, larger indexes had higherthroughput than smaller ones. One possible explanation ofthis behavior is that smaller models, while more likely toremain cached, produce larger search bounds, which causemore cache misses during the last mile search.PGM, BTree, RS, and ART indexes suﬀered decreasedthroughput at large model sizes. This suggests that thecache misses incurred from the larger model sizes are notenough to make up for the reﬁnement in the search bound.The RMI did not suﬀer such a regression, possibly becauseeach RMI inference requires at most two cache misses (onefor each model level), whereas for other indexes the numberof cache misses per inference could be higher.

Figure 17 shows the single-threaded build time requiredfor the fastest (in terms of lookup time) variants of each in-dex structure on amzn at diﬀerent dataset sizes. We do notinclude the time required to tune each structure (automati- P G M R S R M I R B S A R T B T r ee I B T r ee F A S T F S T W o r m h o l e R o b i n H a s h B u il d t i m e ( s ) Figure 17: Build times for the fastest (in terms of querytime) variant of each index type for the amzn dataset at fourdiﬀerent data sizes. Note the log scale.cally via CDFShop [22] for RMIs, manually for other struc-tures). We note that automatically tuning an RMI may takeseveral minutes. Unsurprisingly, BTrees, FST, and Worm-hole provide the fastest build times, as these structures weredesigned to support fast updates. Of the non-learned in-dexes, FAST and RobinHash have the longest build times.Maximizing the performance of Robinhood hashing requiresusing a high load factor (to keep the structure compact),which induces a high number of swaps. We note that manyvariants of Robinhood hashing support parallel operations,and thus lower build times.For the largest dataset, the build times for the fastestvariants of RMI, PGM, and RS were 80 seconds, 38 sec- In particular, Wormhole and PGM can handle parallel insertsand builds respectively, which we do not evaluate here.

5. CONCLUSION AND FUTURE WORK

In this work, we present an open source benchmark thatincludes several state-of-the-art tuned implementations oflearned and traditional index structures, as well as sev-eral real-world datasets. Our experiments on read-only in-memory workloads searching over dense arrays showed thatlearned structures provided Pareto dominant performance /size behavior. This dominance, while sometimes diminished,persists even when varying dataset sizes, key sizes, mem-ory fences, cold caches, and multi-threading. We demon-strate that the performance of learned index structuresis not attributable to any speciﬁc metric, although cachemisses played the largest explanatory role. In our experi-ments, learned structures generally had higher build timesthan insert-optimized traditional structures like BTrees.Amongst learned structures, we found that RMIs providedthe strongest performance / size but the longest build times,whereas both RS and PGM indexes could be constructedfaster but had slightly slower lookup times.In the future, we plan to examine the end-to-end impact oflearned index structures on real applications. Opportunitiesto combine a simple radix table with an RMI structure (orvice versa) are also worth investigating. As more learned in-dex structures begin to support updates [11,13,14], a bench-mark against traditional indexes (which are often optimizedfor updates) could be fruitful.

Acknowledgments

This research is supported by Google, Intel, and Microsoftas part of the MIT Data Systems and AI Lab (DSAIL) atMIT, NSF IIS 1900933, DARPA Award 16-43-D3M-FP040,and the MIT Air Force Artiﬁcial Intelligence Innovation Ac-celerator (AIIA).

6. REFERENCES

Proceedings of the VLDBEndowment , 4(8):470–481, May 2011.[9] J. L. Bentley and A. C.-C. Yao. An almost optimalalgorithm for unbounded searching.

InformationProcessing Letters , 5(3):82–87, Aug. 1976.[10] R. Binna, E. Zangerle, M. Pichl, G. Specht, andV. Leis. HOT: A height optimized trie index formain-memory database systems. In

Proceedings of the2018 International Conference on Management ofData , SIGMOD ’18, pages 521–534, New York, NY,USA, 2018. Association for Computing Machinery.[11] J. Ding, U. F. Minhas, H. Zhang, Y. Li, C. Wang,B. Chandramouli, J. Gehrke, D. Kossmann, andD. Lomet. ALEX: An Updatable Adaptive LearnedIndex. arXiv:1905.08898 [cs] , May 2019.[12] P. Ferragina and G. Vinciguerra. Learned datastructures. In

Recent Trends in Learning From Data ,volume 896 of

Studies in Computational Intelligence .Springer, 2020.[13] P. Ferragina and G. Vinciguerra. The PGM-index: Afully-dynamic compressed learned index with provableworst-case bounds.

Proceedings of the VLDBEndowment , 13(8):1162–1175, Apr. 2020.[14] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca,and T. Kraska. FITing-Tree: A Data-aware IndexStructure. In

Proceedings of the 2019 InternationalConference on Management of Data , SIGMOD ’19,pages 1189–1206, New York, NY, USA, 2019. ACM.[15] G. Graefe. B-tree indexes, interpolation search, andskew. In

Proceedings of the 2nd InternationalWorkshop on Data Management on New Hardware ,DaMoN ’06, Chicago, Illinois, June 2006. Associationfor Computing Machinery.[16] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D.Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, andP. Dubey. FAST: Fast architecture sensitive treesearch on modern CPUs and GPUs. In

Proceedings ofthe 2010 International Conference on Management ofData , SIGMOD ’10, 2010.[17] A. Kipf, R. Marcus, A. van Renen, M. Stoian,A. Kemper, T. Kraska, and T. Neumann. SOSD: ABenchmark for Learned Indexes. In

ML for Systems atNeurIPS , MLForSystems @ NeurIPS ’19, Dec. 2019.[18] A. Kipf, R. Marcus, A. van Renen, M. Stoian,A. Kemper, T. Kraska, and T. Neumann. RadixSpline:A single-pass learned index. In

Proceedings of theThird International Workshop on Exploiting ArtiﬁcialIntelligence Techniques for Data Management , aiDM@ SIGMOD ’20, pages 1–5, Portland, Oregon, June2020. Association for Computing Machinery.[19] T. Kraska, A. Beutel, E. H. Chi, J. Dean, andN. Polyzotis. The Case for Learned Index Structures.In

Proceedings of the 2018 International Conferenceon Management of Data , SIGMOD ’18, pages489–504, New York, NY, USA, 2018. ACM.[20] V. Leis, A. Kemper, and T. Neumann. The adaptiveradix tree: ARTful indexing for main-memorydatabases. In

Proceedings of the 2013 IEEEInternational Conference on Data Engineering , ICDE’13, pages 38–49, USA, 2013. IEEE Computer Society.[21] C. Luo and M. J. Carey. LSM-based storage13echniques: A survey.

PVLDB , 29(1):393–418, Jan.2020.[22] R. Marcus, E. Zhang, and T. Kraska. CDFShop:Exploring and Optimizing Learned Index Structures.In

Proceedings of the 2020 ACM SIGMODInternational Conference on Management of Data ,SIGMOD ’20, Portland, OR, June 2020.[23] W. M. McKeeman. Peephole optimization.

Communications of the ACM , 8(7):443–444, July 1965.[24] V. Nathan, J. Ding, M. Alizadeh, and T. Kraska.Learning Multi-dimensional Indexing. In

ML forSystems at NeurIPS , MLForSystems @ NeurIPS ’19,Dec. 2019.[25] T. Neumann and S. Michel. Smooth interpolatinghistograms with error guarantees. In

Sharing Data,Information and Knowledge, 25th British NationalConference on Databases , BNCOD ’08, pages 126–138,2008.[26] Peter Bailis, Kai Sheng Tai, Pratiksha Thaker, andMatei Zaharia. Don’t Throw Out Your AlgorithmsBook Just Yet: Classical Data Structures That CanOutperform Learned Indexes (blog post),https://dawn.cs.stanford.edu/2018/01/11/index-baselines/,2018.[27] Peter Boncz and Thomas Neumann. The Case forB-Tree Index Structures (blog post),http://databasearchitects.blogspot.com/2017/12/the-case-for-b-tree-index-structures.html,2017.[28] S. Richter, V. Alvarez, and J. Dittrich. Aseven-dimensional analysis of hashing methods and itsimplications on query processing.

Proceedings of theVLDB Endowment , 9(3):96–107, Nov. 2015.[29] L.-C. Schulz, D. Broneske, and G. Saake. Aneight-dimensional systematic evaluation of optimizedsearch algorithms on modern processors.

Proceedingsof the VLDB Endowment , 11(11):1550–1562, July2018.[30] P. Van Sandt, Y. Chronis, and J. M. Patel. EﬃcientlySearching In-Memory Sorted Arrays: Revenge of theInterpolation Search? In

Proceedings of the 2019International Conference on Management of Data ,SIGMOD ’19, pages 36–53, New York, NY, USA,2019. ACM.[31] X. Wu, F. Ni, and S. Jiang. Wormhole: A FastOrdered Index for In-memory Data Management. In

Proceedings of the Fourteenth EuroSys Conference2019 , EuroSys ’19, pages 1–16, Dresden, Germany,Mar. 2019. Association for Computing Machinery.[32] Q. Xie, C. Pang, X. Zhou, X. Zhang, and K. Deng.Maximum error-bounded Piecewise LinearRepresentation for online stream approximation.

TheVLDB Journal , 23(6):915–937, Dec. 2014.[33] H. Zhang, H. Lim, V. Leis, D. G. Andersen,M. Kaminsky, K. Keeton, and A. Pavlo. SuRF:Practical Range Query Filtering with Fast SuccinctTries. In