[PDF] Shift-Table: A Low-latency Learned Index for Range Queries using Model Correction

Abstract

Indexing large-scale databases in main memory is still challenging today. Learned index structures -- in which the core components of classical indexes are replaced with machine learning models -- have recently been suggested to significantly improve performance for read-only range queries. However, a recent benchmark study shows that learned indexes only achieve limited performance improvements for real-world data on modern hardware. More specifically, a learned model cannot learn the micro-level details and fluctuations of data distributions thus resulting in poor accuracy; or it can fit to the data distribution at the cost of training a big model whose parameters cannot fit into cache. As a consequence, querying a learned index on real-world data takes a substantial number of memory lookups, thereby degrading performance. In this paper, we adopt a different approach for modeling a data distribution that complements the model fitting approach of learned indexes. We propose Shift-Table, an algorithmic layer that captures the micro-level data distribution and resolves the local biases of a learned model at the cost of at most one memory lookup. Our suggested model combines the low latency of lookup tables with learned indexes and enables low-latency processing of range queries. Using Shift-Table, we achieve a speedup of 1.5X to 2X on real-world datasets compared to trained and tuned learned indexes.

Full PDF

SShift-Table: A Low-latency Learned Index for Range Queriesusing Model Correction

Ali Hadian

Imperial College London

Thomas Heinis

Imperial College London

ABSTRACT

Indexing large-scale databases in main memory is still challeng-ing today. Learned index structures — in which the core com-ponents of classical indexes are replaced with machine learningmodels — have recently been suggested to significantly improveperformance for read-only range queries.However, a recent benchmark study shows that learned in-dexes only achieve limited performance improvements for real-world data on modern hardware. More specifically, a learnedmodel cannot learn the micro-level details and fluctuations ofdata distributions thus resulting in poor accuracy; or it can fit tothe data distribution at the cost of training a big model whoseparameters cannot fit into cache. As a consequence, querying alearned index on real-world data takes a substantial number ofmemory lookups, thereby degrading performance.In this paper, we adopt a different approach for modeling adata distribution that complements the model fitting approachof learned indexes. We propose

Shift-Table , an algorithmic layerthat captures the micro-level data distribution and resolves thelocal biases of a learned model at the cost of at most one memorylookup. Our suggested model combines the low latency of lookuptables with learned indexes and enables low-latency processing ofrange queries. Using Shift-Table, we achieve a speedup of 1.5X to2X on real-world datasets compared to trained and tuned learnedindexes.

Trends in new hardware play a significant role in the way wedesign high-performance systems. A recent technological trend isthe divergence of CPU and memory latencies, which encouragesdecreasing random memory access at the cost of doing morecompute on cache-resident data [24, 40, 43].A particularly interesting family of methods exploiting thememory/CPU latency gap are learned index structures. A learnedindex uses machine learning instead of algorithmic data struc-tures to learn the patterns in data distribution and exploits thetrained model to carry out the operations supported by an al-gorithmic index, e.g., determining the location of records onphysical storage [7, 12, 17, 23, 24, 28]. If the learned index man-ages to build a model that is compact enough to fit in processorcache, then the results can ideally be fetched with a single accessto main memory, hence outperforming algorithmic structuressuch as B-trees and hash tables.In particular, learned index models have shown a great poten-tial for range queries, e.g., retrieving all records where the key isin a certain range 𝐴 < key < 𝐵 . To enable efficient retrieval ofrange queries, range indexes keep the records physically sorted.Therefore, retrieving the range query is equivalent to findingthe first result and then sequentially scanning the records to © 2021 Copyright held by the owner/author(s). Published in Proceedings of the24th International Conference on Extending Database Technology (EDBT), March23-26, 2021, ISBN 978-3-89318-084-4 on OpenProceedings.org.Distribution of this paper is permitted under the terms of the Creative Commonslicense CC-by-nc-nd 4.0. retrieve the entire result set. Therefore, processing a range query 𝐴 < key < 𝐵 is equivalent to finding the first result, i.e., thesmallest key in the dataset that is greater than or equal to 𝐴 (sim-ilar to lower_bound(A) in the C++ Library standard). A learnedindex can be built by fitting a regression model to the cumulativedistribution function (CDF) of the key distribution. The learnedCDF model can be used to determine the physical location wherethe lower-bound of the query resides, i.e., pos ( A ) = 𝑁 × 𝐹 𝜃 ( A ) where N is the number of keys and 𝐹 𝜃 is the learned CDF modelwith model parameters 𝜃 .Learned indexes are very efficient for sequence-like data (e.g.,machine-generated IDs), as well as synthetic data sampled fromstatistical distributions. However, a recent study using the Search-On-Sorted-Data benchmark (SOSD) [21] shows that for real-world data distributions, a learned index has the same or evenpoorer performance compared to algorithmic indexes. For manyreal-world data distributions, the CDF is too complex to be learnedefficiently by a small cache-resident model. The data distributionof real-world data has "too much information" to be accuratelyrepresented by a small machine-learning model, while an accu-rate model is needed for an accurate prediction. One can of courseuse smaller models that fit in memory with the cost of lower pre-diction accuracy, but will end up in searching a larger set ofrecords to find the actual result which consequently increasesmemory lookups and degrades performance. Alternatively, ahigh accuracy can be achieved by training a bigger model, butaccessing the model parameters incurs multiple cache missesand also increases memory lookups, reducing the margins forperformance improvement.In this paper, we address the challenge of using learned modelson real-world data and illustrate how the micro-level details(e.g., local variance) of a cumulative distribution can dramaticallyaffect the performance of a range index. We also argue that apure machine learning approach cannot shoulder the burden oflearning the fine-grained details of an empirical data distributionand demonstrate that not much improvement can be achievedby tuning the complexity or size thresholds of the models.We suggest that by going beyond mere machine learning mod-els, the performance of a learned index architecture can be sig-nificantly improved using a complementary enhancement layerrather than over-emphasizing on the machine learning tasks. Oursuggested layer, called Shift-Table is an algorithmic solution thatimproves the precision of a learned model and effectively acceler-ates the search performance. Shift-Table, targets the micro-levelbias of the model and significantly improves the accuracy, at thecost of only one memory lookup. The suggested layer is optionaland applied after the prediction; it can hence be switched on oroff without re-training the model.Our contributions can be summarized as follows: • We identify the problem of learning a range index for real-world data, and illustrate the difficulty of learning fromthis data. a r X i v : . [ c s . D B ] J a n We suggest the Shift-Table approach for correcting a learnedindex model, which complements a valid (monotonicallyincreasing) CDF model by correcting its error. • We show how, and in which circumstances, the suggestedmethods can be used for best performance. • We suggest cost models that determine whether the Shift-Table layer can boost performance. • The experimental results show that our suggested methodcan improve existing learned index structures and bringstable and almost-constant lookup time for real-worlddata distributions. Our enhancement layer achieves upto 3X performance improvement over existing learnedindexes. More interestingly, we show that for non-skeweddistributions, the Shift-Table layer is effective enough tohelp a dummy linear model outperform the state of theart learned indexes on real-world datasets

In modern hardware, the lookup times of in-memory range in-dexes and the binary search algorithm are mainly affected bytheir memory access pattern, most notably by how the algorithmuses the cache and the Last-Level-Cache (LLC) miss rate.Processing a range query in a learned index has two stages:1)

Prediction : Running the learned model to predict the locationof the first result for the range query; and 2)

Local search (alsoknown as last-mile search ): searching around the predicted loca-tion to find the actual location of the first result. Figure 1a showscommon search methods for the local search. If the learned modelcan determine a guaranteed range area around the predicted po-sition, one can perform binary search. Otherwise, exponential orlinear search should be used, starting from the predicted position.A cache miss in a learned index can occur in the first stagefor accessing the parameters of the model (if the model is toobig to fit in cache), or in stage two for the local search. Key inunderstanding the cost of a learned index is that local search isdone entirely over non-cached blocks of memory. A learned indexbuilt over millions of records could predict the location of recordswith an error of, say, 1000 records and yet achieve no performancegain over binary search algorithms or algorithmic indexes. Thisis because while the learned index fits the models in cache, itsalgorithmic competitors also fit the frequently-accessed parts ofthe data in cache, which limits the potential for improvement fora learned index.

Classical algorithms, such as binary search, can be seen as a hi-erarchy of [non-learned] models, which take the middle-pointas its parameter and predicts (accurately) which direction thesearch should follow. Specifically for the first few steps of binarysearch where the middle-points usually reside in cache, the func-tionality of binary search is the same as a learned model from aperformance point of view.In a pure binary search on the entire data, the first set of mem-ory locations accessed by the algorithm (i.e., the median, quarters,etc.) will already be in the CPU cache after a few lookups. There-fore, the major bottleneck in binary search is for the latter stagesof search where the middle elements are not in cache, causinglast-level-cache (LLC) misses. Figure 1b shows a schematic illus-tration of how caching accelerates binary search. In basic implementations of binary search, the “hot keys” arecached with their payload and nearby records in the same cacheline, which wastes cache space. Binary search thus uses the cachepoorly and there are more efficient algorithmic approaches whoseperformance is not sensitive to data distributions.Cache-optimized versions of binary search, e.g., a binary searchtree such as FAST [20], a read-only search tree that co-locatesthe hot keys but still follows the simple bisecting method of bi-nary search, are up to 3X faster than binary search [21]. Thisis because FAST keeps more hot keys in the cache and hence itneeds to scan a shorter range of records in the local search phase(cache-non-resident iterations of the search).

For a tangible discussion and to elaborate on the real cost of alearned model, we provide a micro-benchmark that measuresthe cost of errors in a learned index. We use the experimentalconfiguration used in the SOSD benchmark [21], i.e., searchingover 200M records with 32-bit keys and 64-bit payloads. Figure 2ashows the lookup time of the second phase (local search) in alearned model for different prediction errors. We include thelookup times for binary search, as well as FAST [20], over thewhole array of 200M keys.We are interested to see that if the position predicted by alearned index, say predicted_pos ( 𝑥 ) , has an error Δ , then howlong does it take in the local phase to find the correct record. Thus,for each query 𝑥 𝑖 , we pre-compute the ‘output’ of the learnedindex with error Δ , i.e., [ predicted_pos ( 𝑥 𝑖 ) ± Δ ] , and then runthe benchmark given { 𝑥 𝑖 , [ predicted_pos ( 𝑥 𝑖 ) ± Δ ]} tuples.As shown in Figure 2a, if the error of the model is more than ∼

300 records on average, then FAST outperforms the learnedmodel (with linear or exponential local search). Even if the learnedmodel can give a guaranteed range around the predicted pointto guide the local search and enable binary search, FAST outper-forms it if the error exceeds 1000 records. The same trend can beseen for the LLC miss rates in Figure 2b.Note that this micro-benchmark over-estimates the maximumerror that the learned index can have because we only comparethe time of local search phase in a learned index with the totalsearch time of FAST and binary search. Considering the timetaken to execute the model for predicting the location, a learnedmodel needs to have a much lower error to compete with thegeneric, reliable, and distribution-independent algorithms suchas binary search and FAST. For example, FAST takes 200 nanosec-onds to search a key in the entire 200M-key dataset. If a learnedindex takes, say, 120 nanoseconds to run (for accessing modelparameters and computing the prediction), then the local searchcan take at most 80 nanoseconds so that the learned index canoutperform FAST, which means that the prediction error ( Δ ) mustbe less than 16 records (based on Figure 2a).Tuning the learned index for a balance of model size and ac-curacy is a challenging task. Improving the local search timerequires using a more accurate model with a higher learningcapacity and more parameters. However, accessing such a bigmodel typically incurs further cache misses during model exe-cution, and consequently the lookup time. Therefore, if the datadistribution cannot be learned efficiently with a small memoryfootprint (fitting into cache), outperforming cache-efficient al-gorithmic indexes is very challenging. This is indeed the casefor most real-world datasets that cannot be modelled accuratelywith a small-sized model. eyModel p r e d i c t i o n Actual position Δ (model-provided max error estimate) Exponential search Linear search Binary search (a) Different "last-mile" search methods (performed after location prediction) in learned index. The locations predicted by themodel depend on the query and are not known in advance. Since the last-mile search algorithms need to access different memorylocations for each query, they cannot exploit the processor cache and the search algorithm incurs multiple cache misses !𝑁 4 !𝑁 2 𝑁!3𝑁 4 L1-cache- resident locationsL2 / L3 -cache- resident locations Non-cached location accessed by binary search, causing LLC miss (b) Schematic illustration of processor caching in binary search. The locations accessed by the very early stage of binary search,such as the min, max, and the midpoint, are frequently accessed and available in the L1 cache. Further steps of binary accesslocations that are less frequently accessed and fit on lower levels of the memory hierarchy. Therefore, a deterministic searchalgorithm like binary search enjoys a high cache hit rate

Figure 1: Comparison of patterns in binary search (partially cached) and local search in learned indices (non-cached).

To use a learned index in a production system, it is essential toidentify when learned indexes fail to achieve superior perfor-mance and what aspects of the data distribution contributes tothe performance of a learned index model. We realized that amajor challenge in understanding learned indexes is that thecommon practices of performance evaluation for indexing al-gorithms are misleading for learned indexes. For example, it iscommon to use the uniform and skewed distributions (such aslog-normal) as arguably the two best- and worst-case extremesfor a search task [24]. However, for evaluating search over sortedread-only data, the difficulty of the task is determined by the unpredictability of the data, which is not necessarily a factor ofskewness or shape parameter of the data distribution. As we willshow in this section, most statistical distributions are much easierto model than real-world data.

Distributions that matter.

An interesting observation from theSOSD benchmark results is that even for datasets that have thesame background distribution, e.g., both closely match a uniformdistribution, the performance of a learned model can vary signif-icantly, depending on the fine-grained details in the empiricalCDFs. For example, consider Figures 3a and 3b, which repre-sent two CDFs that are both close to uniform. The uniform data( uden64 [21]) is comprised of dense integers that are syntheticallysampled from a uniform distribution, and Facebook ( face64 [21])is a Facebook user ID dataset. While both datasets match closely with the uniform distribution, face64 is significantly harder tomodel due to its fine-grained details in the CDF. The lookup timeof learned indexes (both RMI and Radix-Splines) for face64 is6-7 × higher than that of uden64 (see Table 2) because there aremany micro-level details (unpredictability) in the CDF, hence ahuge model with a high learning capacity is needed to fit the CDFaccurately. Using the RMI learned index, for example, the uden64data is easily modelled with a simple line (two parameters) withnear-zero error, while the best architecture found by the SOSDbenchmark for modelling the face64 data is a hierarchy of twolinear models, a huge model (136MB), with an average error of202 records.Generally speaking, real-world datasets are more difficult tolearn compared to synthetic ones and the learned index builtover them is not significantly faster than the algorithmic rivals.The main question remains what distinguishes a real-world datafrom a synthetic one? Consider the four distributions in Figure 3,where Figures 3a, 3c are synthetic (generated from uniform andlog-normal distributions), and Figures 3b, 3d are real-world data.The mini-chart inside each CDF highlights the distribution ina small sub-range, i.e., a “zoomed-in” view of the CDF. For thesynthetic data, the CDF is very smooth in any short sub-rangeof the whole CDF. Synthetic data (such as uniform, normal andlog-normal) are built using a cumulative density function thatis derivable, meaning that the at any small sub-range, the shapeof the CDF is close to a straight line with a slope that is closeto the derivation of the underlying CDF in that range. Such a Error (search area)02004006008001000 s e a r c h t i m e ( n s ) LinearBinaryExponential Binary w/o modelFASTDRAM latency (a) Lookup time Error (search area)02468101214161820 C a c h e m i ss e s (b) Cache misses Figure 2: Cost of local search in a learned index smooth CDF has less information to be compressed into a model.For example, a learned index model based on linear splines canaccurately fit the whole CDF by fitting each part of the CDF to aline. Even for very skewed distributions, such as log-normal, thedata is so predictable that it can be easily fitted to simple, linearmodels.Real-world data, however, is much less predictable and hasa much higher level of complexity in its patterns. Even if anideal learning algorithm is used to model the real-world data, themodel itself needs to be very big because the compressed versionof the CDF (to be stored as a model) is still very big.This explains why state-of-the-art learned indexes perform ex-tremely well for datasets that are synthetically generated from astatistical distribution (such as uniform, normal, and log-normal),but perform comparably poor for real-world data that even al-most match (shapewise) with those synthetic distributions [21].On real-world datasets, learned indexes have a high cache missrate and lookup time, contrary to their primary goal of havingfewer cache misses.Using learned models is beneficial when they are 1) accurateenough to predict a position within the same cache line that con-tains the data point, otherwise the lookup time will be adverselyaffected due to multiple cache misses, and 2) compact enough tofit in cache and not to cause LLC misses. With this in mind, wecan argue that a pure machine-learning approach might fail to“learn the data perfectly” and “fit the model in cache” simultane-ously, specifically in case of real-world datasets that contain a lotof underlying patterns like spikes and generally noise.As a consequence, learned models are crucial to indexing butthey cannot shoulder the burden of indexing the data alone. Wehence suggest an algorithmic layer that can mitigate the difficulty (a) uniform (b) Facebook(c) Lognormal (d) OSMC

Figure 3: Example distributions with different complexi-ties in micro and macro levels

Records key

Simple model

PDC (Correction)

Shift-Table

Model (a) Learned index

Records key

Simple model

PDC (Correction)

Shift-Table

Model (b) Model + Shift-Table

Figure 4: Leveraging correction layers to a learned index of learning the data distribution. In this approach, the learnedmodel is allowed to learn an semi-accurate, small model thatlearns the holistic shape of the distribution, and the fine-tunedmodelling is provided by the algorithmic layers.

While learned index models are powerful tools for describing adata distribution in a compact representation, merely focusingon learning a highly-accurate model does not necessarily leadto a high-performance index. In this paper, we suggest a newapproach for boosting existing learned models with additionallayers, specifically developed with hardware costs in mind.The suggested helping layers add a small overhead when exe-cuting queries, but significantly reduce the overall lookup timeof the learned index. The suggested layers are very powerfuland consequently allow for using more lightweight models, yetideally avoid computationally-expensive algorithms for training.As Figure 4 illustrates, in addition to the learned index modelwe add a correction layer, an optional component, that can beadded to improve the performance. We explore the potential ofcorrection layers in the next sections.

A learned model predicts a relative position 𝐹 𝜃 ( 𝑥 ) for a givenquery 𝑥 . To calculate the position of the result, the estimatedelative position is multiplied by the number of keys, and trun-cated to an integer (the index), hence the predicted position is [ 𝑁 𝐹 𝜃 ( 𝑥 )] . The actual position of the record, however, is 𝑁 𝐹 ( 𝑥 ) where 𝐹 ( 𝑥 ) is the empirical CDF of the data points, and 𝑁 isthe data size. Therefore, the result is 𝑁 𝐹 ( 𝑥 ) − [ 𝑁 𝐹 𝜃 ( 𝑥 )] recordsahead of the predicted position. We identify 𝑁 𝐹 ( 𝑥 ) − [ 𝑁 𝐹 𝜃 ( 𝑥 )] asthe drift of 𝐹 𝜃 at key 𝑥 , which is the signed error of the prediction,as opposed to the absolute error.The idea of the Shift-Table layer is to have a lookup table thatcontains the drift values so that the drift of the prediction canbe corrected. Capturing the drift for every value of 𝑥 requires anauxiliary index, which is not feasible. However, we can use the output of the learned index model ( [ 𝑁 𝐹 𝜃 ( 𝑥 )] ), which is in therange of [ , 𝑁 ] , and we construct a mapping from each possibleoutput of the model, say 𝑘 , to “how far ahead is the actual record ifmodel predicts k’s record”, so that we can correct the predictionsusing this mapping. This means that for each prediction, we onlyneed an extra lookup of 𝑘 in a fixed array of size 𝑁 .To build the Shift-Table layer, we first partition the keys 𝑥 , · · · , 𝑥 𝑁 − into 𝑁 partitions. We define 𝑃 𝑘 as the set of keys for which themodel predicts 𝑘 as the position: 𝑃 𝑘 = { 𝑥 | [ 𝑁 𝐹 𝜃 ( 𝑥 )] = 𝑘 } (1)Each of the indexed keys in 𝑃 𝑘 has an index, say 𝑁 𝐹 ( 𝑥 ) anda prediction 𝑘 = [ 𝑁 𝐹 𝜃 ( 𝑥 )] . For each partition, we extract twoparameters that specify the range for local search, namely Δ 𝑘 and 𝐶 𝑘 . Δ 𝑘 is defined as: Δ 𝑘 = min ( 𝑁 𝐹 ( 𝑥 ) − 𝑘 ]) ∀ 𝑥 ∈ 𝑃 𝑘 (2)which indicates that if the predicted location is 𝑘 , the searchshould be started at point 𝑘 + Δ 𝑘 . Also, 𝐶 𝑘 = | 𝑃 𝑘 | is the cardinalityof 𝑃 𝑘 , i.e., the number of indexed keys for which the predictionpredicts the 𝑘 ’th record, in other words, the length of the areathat has to be searched in the local search phase.To correct the prediction, we first compute the predicted posi-tion 𝑘 = [ 𝑁 𝐹 𝜃 ( 𝑥 )] , and then perform local search in the rangeof [ 𝑘 + Δ 𝑘 , 𝑘 + Δ 𝑘 + 𝐶 𝐾 − ] .The number of partitions depends on the range of the output ofthe learned index, which should be 0 , 𝑁 ) . Therefore, The < Δ 𝑘 , 𝐶 𝑘 > pairs: pairs are stored in a single array of size 𝑁 , so that thecorrection can be done using a single lookup into the array ofpairs.A Shift-Table layer is depicted in Figure 5. The index contains100 elements in range [0,999]. The CDF model is a simple model: 𝐹 𝜃 ( 𝑥 ) = 𝑥 / 𝑘 = [ 𝑥 / ] . If thequery is 771, for example, the prediction of the model is 𝑘 = Δ = −

41 and 𝐶 =

2, whichindicates that the result is -41 records ahead of the prediction,and the search area is of length 2. Therefore, the local search isperformed on the indexes of range [ , ] .Algorithm 1 shows how Shift-Table is used to accelerate queryprocessing. The Shift-Table layer reduces the prediction error ofthe model, but incurs an additional memory lookup. If the query is on the indexed keys, the result is in range [ 𝑘 + Δ 𝑘 , 𝑘 + Δ 𝑘 + 𝐶 𝐾 − ] . In Figure 5, for example, querying 771 and782 points to the correct range that contains the result. However,if the query is not among the indexed keys, then the query iseither within the range, or in the position just after the range (atdata [ 𝑘 + Δ 𝑘 + 𝐶 𝐾 ] . For example, in Figure 5, the record correspond-ing to queries 778 and 781 is the same, though the aforementioned Algorithm 1

Search with direct-mapped learned index procedure FIND_LOWER( 𝑞 , model, Shift-Table) pos = model.predict(q) pos = Shift_Table.mapping[pos].startPoint range = Shift_Table.mapping[pos].range if range < linear_to_binary_threshold then pos = LinearSearch(start=data[pos],range) else pos = BinarySearch(start=data[pos],range) end if return pos end procedure model ( 𝑘 = [ 𝑞 / ] ), maps 778 to range [ , ] , and 781 to [ , ] .In both cases, however, the local search algorithm (either binaryor linear search) within the range computes the correct positionof the result (i.e., 38). Notably for 𝑞 = [ , ] , which is 38.Another issue that can arise for non-indexed keys is when thepredicted position 𝑃 𝑘 has an empty partition that none of theindexed keys belongs to. In Figure 5, if the query is 15, then thepredicted position is 𝑘 = [ / ] =

1, but 𝑃 is empty becausethe model does not predict position 1 for any of the indexedkeys. If the query is predicted to be in an empty partition, theresult is the first record in the next non-empty partition, e.g.,the result of query=15 is record 3. To make the Shift-Table layerconsistent for the empty partitions, we put pseudo values for Δ , 𝐶 in the mapping layer such that they refer to the same rangeas the next existing partition. If 𝑃 𝑘 ∅ is an empty partition and 𝑃 𝑘 is the first non-empty partition after 𝑃 𝑘 ∅ , then 𝐶 𝑘 ∅ = 𝐶 𝑘 and Δ 𝑘 ∅ = Δ 𝑘 + ( 𝑘 − 𝑘 ∅ ) . The pseudo Δ , 𝐶 -values are depicted usingdashed arrows in Figure 5. It should be noted that the empirical CDF function, i.e., 𝐹 ( 𝑋 ) = 𝑃 ( 𝑋 ≤ 𝑥 ) does not exactly identify the result of a range queryon x. In this paper, we use the CDF (F(x)) notation as the index ofthe result corresponding to 𝑥 . We consider range queries of type( key < = query ), hence the CDF for a point 𝑥 is the relative posi-tion of the first key in the indexed keys, as the range is scannedtowards the right. More precisely, we assume that 𝑁 𝐹 ( 𝑥 ) = 𝑁 𝐹 ( 𝑥 𝑁 − ) = 𝑁 − 𝑥 ≤ 𝑞 , can be used for other operators ( ≥ , > , , etc.) with abrief left/right scan. However, if there are too many duplicatesin the indexed data, then the the performance of the learnedindex will be worse for queries that do not match the presumeddefinition of F(X). In such cases, it is more efficient to use thespecific definition of 𝐹 ( 𝑥 ) that reflects the position of the result ofthe query in the most common type of constraint in the queries.For example, if most of the queries are of type 𝑥 > = 𝑞 , then 𝐹 ( 𝑥 ) should be defined such that 𝑁 𝐹 ( 𝑥 ) identifies the index of the lastkey among the duplicate values. Algorithm 2 describes how the mapping of the Shift-Table layeris built. In the first stage, it computes the Δ , 𝐶 values and updatesfor the non-empty partitions, i.e., 𝑃 𝑘 s for which at least one of Records(only keys shown)Records(only keys shown)Correction layer ...

Figure 5: Shift-Table the indexed keys is mapped to 𝑘 . In the second stage, a backwardtraversal is performed on the Shift-Table layer and the computethe pseudo-values for the empty partitions (Algorithm 2, lines10–14). Starting from the last entry, a pseudo-partition has thesame count ( 𝐶 ) as the first non-empty partition on its right side,but the shift Δ is adjusted so that they both point the the sameregion for local search.The computational complexity of building the Shift-Table layeris 𝑂 ( 𝑁 ) × 𝑂 ( 𝐹 𝜃 ) to compute the drifts and updating the mapping,as it only traverses the data and the Shift-Table layer once. Incase that running the model is expensive, model executions canbe parallelized for faster execution. Algorithm 2

Building the Shift-Table layer procedure Shift-Table_Build(model ( 𝐹 𝜃 ), data) Shift-Table = Array of tuples < Δ , 𝐶 > , all set to zero for all x ∈ data do 𝑝𝑜𝑠 = 𝑁 𝐹 ( 𝑥 ) ⊲ Position of x (sec 3.2) 𝑘 = [ 𝑁 𝐹 𝜃 ( data [ i ])] Δ = 𝑝𝑜𝑠 − 𝑘 Shift_Table[k]. Δ = min ( Shift_Table [ 𝑘 ] . Δ , Δ ) Shift-Table[k].C += 1 end for for 𝑘 ← 𝑁 − · · · do if Shift_Table[k].C = 0 then ⊲ Empty partitions

Shift_Table[k].C = Shift_Table[k-1].C

Shift_Table[k]. Δ = Shift_Table[k-1]. Δ + end if end for return Shift_Table end procedure

Correcting the prediction of the model using the Shift-Table layertakes a single DRAM lookup irrespective of the size of the index.However, it might be of interest to reduce the size of the layer.The Shift-Table layer is an array of size N, containing < Δ , 𝐶 > tuples. Further compression can be used to decrease the memoryfootprint of the Shift-Table layer.One approach is to keep a single parameter instead of the < Δ , 𝐶 > tuples. In this regard, a predicted position 𝑘 should bemapped to the key that is in the median point among the keys in 𝑃 𝑘 , which is ¯ Δ 𝑘 = (cid:20) Δ 𝑘 + 𝐶 𝑘 (cid:21) (3)To correct using the ¯ Δ 𝑘 values, the final position is computedas 𝑝𝑜𝑠 = 𝑘 + ¯ Δ 𝑘 , which indicates where the search should bestarted without specifying the guaranteed range that should besearched. Therefore, search algorithms that require the bound-aries specified such as binary search cannot be used for localsearch. As discussed in section 2.4, linear or exponential searchcan be used for local search without boundaries, but they areslightly slower if the error is considerable after the correction.A second approach that complements the first one, is to shrinkthe size of the Shift-Table layer by merging nearby partitions.We can extend the definition of P = { 𝑃 , · · · , 𝑃 𝑁 } to allowpartitions that have a size of 𝑀 < 𝑁 . We define 𝑀 partitions P 𝑀 = (cid:8) 𝑃 𝑀 , · · · , 𝑃 𝑀𝑀 (cid:9) where each partition is defined as: 𝑃 𝑀𝑘 = { 𝑥 | [ 𝑀𝐹 𝜃 ( 𝑥 )] = 𝑘 } (4)Similarly, Δ 𝑀𝑘 is the minimum "move to the right" shifts thateach of the keys in 𝑃 𝑀𝐾 need: Δ 𝑀𝑘 = min ( 𝑁 𝐹 ( 𝑥 ) − [ 𝑁 𝐹 𝜃 ( 𝑥 )]) ∀ 𝑥 ∈ 𝑃 𝑀𝑘 (5)and 𝐶 𝑘 should be defined such that the boundary is valid forall keys in 𝑃 𝑀𝐾 , which is: 𝐶 𝑀𝑘 = max ( 𝑁 𝐹 ( 𝑥 ) − ([ 𝑁 𝐹 𝜃 ( 𝑥 )] + Δ 𝑀𝑘 (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) start of the search window )) ∀ 𝑥 ∈ 𝑃 𝑀𝑘 (6)To combine approaches to compact the Shift-Table layer, wecan use average drifts ¯ Δ 𝑀𝑘 instead of the < Δ 𝑀𝑘 , 𝐶 𝑀𝑘 > pairs:¯ Δ 𝑀𝑘 =  | 𝑃 𝑀𝑘 | ∑︁ 𝑥 ∈ 𝑃 𝑀𝑘 ( 𝑁 𝐹 ( 𝑥 ) − [ 𝑁 𝐹 𝜃 ( 𝑥 )])  (7)and then use [ 𝑁 𝐹 𝜃 ( 𝑥 )]+ ¯ Δ 𝑀 [ 𝑀𝐹 𝜃 ( 𝑥 ) ] as the corrected prediction.Suppose the same data as in Figure 5, but instead of a Shift-Tablelayer of size N, we use only M=30 partitions. Table 1 shows howa compact Shift-Table layer is built and used for correction, ona portion of the index. We use the same model ( 𝐹 𝜃 = [ 𝑥 / ] ),hence the prediction is 𝑁 𝐹 𝜃 ( 𝑥 ) = [ . 𝑥 ] , and the partition cor-responding to a key is 𝑁 𝐹 𝜃 ( 𝑥 ) = [ . 𝑥 ] . All of the recordsfrom data[35..39] are assigned to the same partition 𝑃 and theirpredictions are shifted 40 records backwards. Note that when ≠ 𝑁 , a partition does not specify a single point (or range) forall of the keys in the partition. Instead, the position of a key af-ter correction depends on both 𝑁 𝐹 𝜃 ( 𝑥 ) (prediction) and 𝑀𝐹 𝜃 ( 𝑥 ) (partition number). For example, all keys belonging to 𝑃 , i.e.,data [ · · · ] have the same correction of ¯ Δ = −

40, but theirfinal predictions are different. Therefore, the correction error ofa compact Shift-Table layer is less than the number of elementsin the partitions.

Table 1: Illustration of Shift-Table with 𝑀 = mappingentries on an index with 𝑁 = keys Index 34 35 36 37 38 39 40 41key (x) 752 769 770 771 782 785 820 830Predicted index= [0.1 x] 75 76 77 77 78 78 82 83Error before correction -41 -41 -41 -40 -40 -39 -42 -42Partition (k) = [0.03 x] 22 23 24¯ Δ 𝑘 -41 -40 -42Prediction after correction 34 36 37 37 38 38 40 41Error after correction 0 1 1 0 0 -1 0 0 The drift of 𝑃 𝑀𝑘 , namely ¯ Δ 𝑀𝑘 is the index of the median keyamong the members of 𝑃 𝑀𝑘 . This means that if the key is predictedto be in the 𝑘 ’th partition (among the 𝑀 partitions), the localsearch is done around [ 𝑁 𝐹 𝜃 ( 𝑥 )] + ¯ Δ 𝑀𝑘 .Using a Shift-Table layer of size 𝐾 < 𝑁 does not affect thecomplexity of building the layer, which is 𝑂 ( 𝑁 ) × 𝑂 ( 𝐹 𝜃 ) + 𝑂 ( 𝑀 ) .However, if the midpoint-values are used (correction withoutspecifying the boundary), it is possible to construct the mapusing a sample of the indexed keys, which comes at the cost ofthe accuracy. Using a sample of size 𝑆 < 𝑁 , the layer can be builtin 𝑂 ( 𝑆 ) × 𝑂 ( 𝐹 𝜃 ) + 𝑂 ( 𝐾 ) time.Nonetheless, keep in mind that the Shift-Table layer is de-signed for applications that favour latency to memory footprint,hence reducing the memory footprint of the Shift-Table layerby a large factor will limit its margin for improvement as thefine-grained details of the empirical CDF will be lost to someextent. Since the Shift-Table layer specifies a range for local search,the notion of error is not trivial. However, we can use the es-timates without range ¯ Δ ), for which the correction picks themedian value among the keys in the 𝑃 𝑘 . The error for the keysin each partition is (cid:110) [ 𝐶 𝑘 ] , · · · , , · · · , [ 𝐶 𝑘 ] (cid:111) if 𝐶 𝑘 is odd, and (cid:110) [ 𝐶 𝑘 ] − , · · · , , · · · , [ 𝐶 𝑘 ] (cid:111) if 𝐶 𝑘 is even. The average error isapproximately 𝐶 𝑘 / 𝐹 ( 𝑥 ) and 𝐹 𝜃 ( 𝑥 ) . After correcting the model with theShift-Table, however, the error only depends on the 𝐶 𝑘 values, i.e.,a prediction error only occurs when [ 𝐹 𝜃 ( 𝑥 )] predicts the sameposition for multiple keys. Therefore, the local search range andthe error are combinations of multiple step functions over the 𝑃 𝑘 s with 𝐶 𝑘 > 𝑒 = 𝑁 ∑︁ 𝑘 ∈P 𝐶 𝑘 (8) p o s i t i o n (a) Example data & model p r e d i c t i o n e rr o r ( l o g - s c a l e ) Model Model + Shift-Table (b) Error

Figure 6: Error correction using the Shift-Table layer

Figure 6 illustrates how the Shift-Table layer corrects the error ofa linear interpolation model on the OSMC data. While the modelis too simple to capture the patterns in data, the Shift-Table layeralone is effective for correcting the predictions. While the averageerror of the model is 28 million keys, Shift-Table reduces the errorto only 129 keys.Shift-Table corrects two types of error. First, when the modelhas a considerable local bias, which means that

𝑁 𝐹 ( 𝑥 ) divergessignificantly from 𝑁 𝐹 𝜃 ( 𝑥 ) in a sub-range of the data distribution.The second type of error is the fluctuations of the distributionbetween the nearby keys, for most of which the Shift-Table layeris very effective. The only type of error that can degrade theperformance of the Shift-Table layer is when there is a congestionof keys in a small sub-range of values, leading to many of thekeys being classified in a single layer, and hence having somepartitions with high 𝐶 𝑘 .The behavior of the Shift-Table layer and its error estimateindicates that it can be effective in eliminating different typesof errors that models have. One common type of error is thelocal bias in the model, i.e., when the error of the model, i.e., 𝑁 𝐹 𝜃 ( 𝑥 ) − 𝑁 𝐹 ( 𝑥 ) has a considerable bias in some sub-ranges ofthe distribution, meaning that the 𝐹 and 𝐹 𝜃 diverge at some point.This happens when the model cannot capture the CDF in a localneighborhood. Table 2 shows that even if a single line is used asa model, which has a huge bias in most areas of the distribution,the Shift-Table layer can efficiently eliminate the huge bias of afully linear model (a single line as a model), and reduces the errorsignificantly such that the linear model outperforms all otheralgorithms for the real-world datasets, as well as the uspr dataset(sparse uniformly-distributed integers) which has a significantlyhigher variance than uniformly-distributed dense integers.Another type of error that the Shift-Table layer eliminates isthe local variance in the data, which is the fluctuations of thevalues between nearby keys. This type of error is very commonin real-world data. For example, the face, uspr, and uden datasetsall follow a uniform distribution, but they have different localvariances, which is the amount of fluctuations in the nearby keys.The uden dataset is very easy to model using the learned indexesand does not require a helping layer such as Shift-Table. Theother two datasets, however, are very hard to model using thelearned index structures. This is because the Shift-Table modelcan easily correct the fluctuations of values (different incrementsbetween each two points), as long as the model does not predicta single record for a lot of nearby keys (resulting in a high 𝐶 𝑘 value). .7 Cost model of the Shift-Table layer The accuracy of the model after correction with Shift-Table de-pends on the cardinalities of the partitions ( 𝐶 𝑖 values). Ideally,if the records of each partition reside on a single cache line, theresults will be retrieved in a single memory lookup. The costof local search, i.e., the mapping between the accuracy in eachpartition and the latency to do local search depends on the hard-ware. As discussed in section 2.1, the latency of search for variousranges can be measured by a micro-benchmark over non-cachedregions with different sizes. Let 𝐿 ( 𝑠 ) be the measured latency ofnon-cached search over a range containing 𝑠 records. The latencyfor looking up a key in a region of size 𝑠 is 𝐿 ( 𝐶 𝑘 ) . Assuming thatthe queries have the same distribution as the data points, theaverage lookup latency for the index is:Latency with Shift-Table = Latency ( 𝐹 𝜃 ) + 𝑁 ∑︁ 𝑘 ∈P 𝐶 𝑘 𝐿 ( 𝐶 𝑘 ) (9)The cost model can also be used to estimate which of thelocal search algorithms should be used, by substituting in equa-tion 9 the local search cost of each local search algorithm, i.e., 𝐿 ( 𝑠 ) mappings for linear, binary, and exponential search; and forand their different implementations. Branch-optimized binarysearch would be the natural choice if the Shift-Table model candetermine the boundary (if using the Δ 𝑘 , 𝐶 𝑘 pairs), otherwiseeither linear or exponential search should be chosen based onthe latency estimate.Taking the cost of running the Shift-Table layer into account,we should consider how much the correction improves the accu-racy of the learned index model and hence estimate the speedup.The lookup time of the model without using the Shift-Table modelcan be estimated once the Shift-Table model is built, without run-ning a speedup benchmark. The model error for each key is¯ Δ 𝑘 = Δ 𝑘 + 𝐶 𝑘 , therefore the estimated runtime of the indexwithout correction is:Latency without Shift-Table = Latency ( 𝐹 𝜃 ) + 𝑁 ∑︁ 𝑘 ∈P 𝐶 𝑘 𝐿 ( ¯ Δ 𝑘 ) (10) The correction layer requires the learned model to be a validCDF function, i.e., 𝐹 𝜃 ( 𝑥 ) should be monotonically increasing: 𝑥 𝑖 > 𝑥 𝑗 −→ 𝐹 𝜃 ( 𝑥 𝑖 ) > = 𝐹 𝜃 ( 𝑥 𝑗 ) . Among our baselines, the RadixS-plines learned index always produces a valid (increasing) CDF,but the RMI index does not always produce monotonically in-creasing predictions. In RMI, for example, the CDF model mightdecrease when using cubic models [29] or on the edge pointbetween two models in the second-level. If 𝐹 𝜃 ( 𝑥 ) is not mono-tonically increasing, then the correction layer could identify arange that does not include the query result, because the valuesof 𝑥 for which the learned model predicts 𝑘 ’th record are not ina contagious memory block.A learned index model that is non-monotonic can still use theShift-Table layer, as the output of the Shift-Table layer would stillpredict a position but it is not guaranteed that the position is inthe predicted range. Therefore, the local search algorithm shouldcheck if the query is in the predicted range and perform a searchoutside of the range. Another hack for non-monotonic model isto use the ¯ Δ midpoint-values instead of the Δ 𝑘 , 𝐶 𝑘 pairs, whichpredicts a location (instead of a range) to start the local search. If the Shift-Table layer uses the < Δ 𝑀𝑘 , 𝐶 𝑀𝑘 > pairs, it can deter-mine the range for local search and we can apply either linear orbinary search, depending on the error range. We do linear searchif the range is smaller than a threshold (8 keys, in our experi-ments), otherwise a binary search is performed. However, if itonly contains the average shift values ( ¯ Δ 𝑘 , it predicts a positionwithout specifying the boundaries that contain the record; henceeither linear or exponential search can be performed dependingon the average error rate and performance objectives (average orworst-case latency). The Shift-Table layer is optional and adds overhead to the search.Therefore, enabling Shift-Table is only worthwhile if it can even-tually accelerate the original learned index structure. An effectiveconfiguration of the index is a choice between 1) Using the modelalone, 2) model + Shift-Table. Note that the Shift-Table layer isoptional and can be deactivated with zero cost. The output of themodel and the Shift-Table layer are of the same type and bothrepresent a prediction of the records, hence if the Shift-Tablelayer is disabled, we can easily use the model alone for predictionof the records.While tuning the system, the performance of each configu-ration can be directly measured using performance tests, or bymeasuring the model error and then using the cost model of theShift-Table model on the bottom of the architecture (section 3.7).The parameters of the architecture, i.e., the Shift-Table arraysize 𝑀 and the parameters of the learned CDF model, can betuned by computing the error estimate using Shift-Table’s costmodel, or alternatively, by running a performance tests on thebuilt architecture. Our suggested default value for the Shift-Tablelayer is 𝑀 = 𝑁 , because using a mapping layer that has the samenumber of entries as the keys will ensures that the layer canexhibit its ultimate effect to eliminate the signed error, and doesnot have more latency compared to using smaller 𝑀 values.An advantage of Shift-Table is that the learned model doesnot need to be very accurate, as a correction will be appliedanyway. Therefore, a more relaxed measure can be used insteadof least-square error. In this paper, however, we do not learn themodel w.r.t. the Shift-Table layer, for the sake of simplicity andto keep the Shift-Table layer detacheable (optional), preservingthe assumption that the Shift-Table layer can be disabled to freeup memory space on run-time while the model can still be used.The accuracy of the learned model also determines the size ofthe entries of the Shift-Table layer. Each mapping entry shouldat most fit a Δ value of Δ 𝑀𝐴𝑋 , which is the maximum error ofthe model. If, for example, the error is smaller than 2 /

2, then a16-bit integer ( short type) can be used.

In this section, we compare the performance of our proposedmethod with the SOSD benchmark , which is a recent bench-mark for search on sorted data. The benchmark includes learnedindexes, classical indexes, and no-index search algorithms. Experimental Setup.

The algorithms are implemented in C++and compiled with GCC 9.1. The experiments are performed ona system with 16 GB of memory and Intel Core i7-6700 (Skylake),which has four cores and is running at 3.4 GHz with 32 KB L1,256 KB L2, and 8 MB L3 caches. The operating system is Ubuntu18.04 with kernel version 4.15.0-65. In our setup, the LLC miss https://github.com/learnedsystems/SOSD/tree/mlforsys19 enalty measured by Intel Memory Latency Checker is 36 ns,which is the minimum lookup time of an ideal index.Note that all data resides in main memory. The range indexfinds the first indexed key that is equal to or bigger than thelookup key. Also, the keys on the physical layout are sorted(i.e., it is a clustered index), so that the entire result set of therange query can be returned once the first key is found. Similarto [21, 24], we only report the lookup time for the first resultand do not include the scan times in our experiments because allindexes use the same layout for the data records. Datasets.

For the sake of reproducibility, we used the samedatasets as in the SOSD benchmark, which contains four datasetssynthetically generated from known distributions and four real-world ones. The synthetic data are generated from different distri-butions, namely logn : lognormal distribution ( , ) , norm : normaldistribution, uden : uniformly-generated dense integers, and uspr :uniformly-generated sparse integers. The real-world datasetsare face : Facebook user IDs [40], amzn : book sale popularityfrom Amazon sales rank data , osmc : uniform sample of Open-StreetMap locations , and wiki : timestamps of edit actions onWikipedia articles . All datasets contain 200M unsigned integers. Implementation details.

Our experiments are based on theSOSD benchmark [21]. The baseline includes two learned indexes,namely RadixSpline [32] (RS), which uses linear splines; and Re-cursive Model Index (RMI), which uses a hierarchy of models.Note that RMI has a choice of different models and SOSD [21]specifically handpicked the best models for each of the datasetsin the benchmark . SOSD also includes no-index search algo-rithms such as binary search (BS), linear interpolation search (IS),and the recently suggested non-linear triple-point interpolation(TIP) [40]. We also compare against algorithmic index structuressuch as ART: Adaptive Radix Tree [25], FAST [20], RBS (RadixBinary Search): a two-stage algorithm in which a radix struc-ture that maps a fixed-length key prefix to the range of all keyshaving that prefix and then a binary search is performed on therange [21], and STX implementation of B+tree [1]. Finally, weincluded four On-the-fly search algorithms, namely BS: Binarysearch (STL implementation), TIP: three-point interpolation [40],Interpolation search, which is similar to binary search but usesinterpolated positions in each iteration, and IM:

Interpolation asa Model : a dummy model that interpolates the key between theminimum and maximum value of the keys and then performsexponential search around the predicted key.The experiments use either 32- or 64-bit unsigned integerIDs for the key (depending on the dataset), and 64-bytes for thepayload.

To test the effectiveness of the suggested layers compared tolearned indexes, we use a simple interpolation model (IM), i.e., 𝐹 𝜃 ( 𝑥 ) = ( 𝑥 − 𝑚𝑖𝑛𝑉 𝑎𝑙 )/( 𝑚𝑎𝑥𝑉 𝑎𝑙 − 𝑚𝑖𝑛𝑉 𝑎𝑙 ) . Such a dummy modelis deliberately chosen to purely delegate the burden of data mod-elling to the correction layers. https://software.intel.com/en-us/articles/intelr-memory-latency-checker https://aws.amazon.com/public-datasets/osm https://dumps.wikimedia.org The architectures and parameters of the RMI models used for each dataset is spec-ified at https://github.com/learnedsystems/SOSD/blob/mlforsys19/scripts/build_rmis.sh

The Shift-Table layer has the same number of entries as theactual data, i.e., 𝑀 = 𝑁 . We followed the tuning procedure dis-cussed in section 3.9: we start from the model (IM and RS) andconsequently evaluate IM+Shift-Table and RS+Shift-Table. Thecost of running the Shift-Table layer is around 40ns, which paysoff by reducing the prediction error and thus lookup time. There-fore, based on the cost model of the Shift-Table layer (Section 3.7)and the error-to-latency micro-benchmark (Figure 2a), we shouldnot add the Shift-Table layer if the error before adding the con-figuration is less than a threshold (10 records), or 2) the error ofthe index after adding the Shift-Table layer does not decreaseby a factor of 10 (roughly equivalent to the 50-nanosecondslatency the additional layer, according to the error-to-latencymicro-benchmark).Table 2 compares the lookup times (nanoseconds per lookup)of the baseline algorithms with our dummy interpolation model(IM), and the two corrected versions, i.e., IM+Shift-Table andRS+Shift-Table. Note that ART does not support data with dupli-cate keys, and FAST does not support 64-bit keys. Also, interpo-lation search (IS) takes too much time on some datasets, becausethe execution time of interpolation search highly depends on theuniformity of data distribution, varying from O(loglogN) + O(1)iterations on uniform distributions, to O(N) iterations for veryskew ones [40].For the synthetic datasets, the difficulty of the datasets forour dummy linear interpolation model varies from very easy(uden64) to extremely hard (logn64). While the Shift-Table layersignificantly improves a dummy layer on non-uniform data dis-tributions, it cannot outperform the learned index models. Thisis not surprising, as all synthetic datasets (uniform, lognormal,and uniform) have a pattern derived from continuously differ-entiable density functions, hence the distribution is similar toa straight line on smaller sub-ranges as we "zoom in" the datadistribution (e.g., see Figure 3c). Therefore, a learned index struc-ture composed of linears at the bottom (including both RMI andRS) can effectively model the distribution using a very compactrepresentation.For the real-world data, however, the fluctuations in data se-verely affect both RMI and RS learned indexes. The Shift-Tablelayer, effectively corrects a highly inaccurate dummy IM model,such that it outperforms the RMI learned index by 1.5X to 2X onall datasets, while RS falls behind both. Keep in mind that RMIrequires to be tuned with the best architecture and parameters,while Shift-Table does not require a manual training process andcan even work with a simple model such as IM that is not trained,and yet deliver a lower latency.Figure 7 shows the average build times of the indexes, alongwith the standard deviation bars indicating how the build timevaries for different distributions. Please note that the RMI imple-mentation used in the SOSD benchmark needs to be compiled forfaster retrievals, however we did not include RMI’s extra over-head for compiling the code and only reported the build time.IM+Shift-Table, the winner method latency-wise, also takes ei-ther the same or even less build time than the competing learnedindexes. The latencies reported in Table 2 present the fastest configurationfor each learned index. In this section, we present the details of thetuning process to see the optimum performance of each learnedindex. able 2: Comparison of lookup times (nanoseconds per lookup) with the SOSD benchmark. The red box indicates the basemodel (IM) and the enhanced versions.

Dataset ART FAST RBS B+tree BS TIP IS IM IM+ Shift-Table RMI RS RS+ Shift-Table logn32 N/A 230 385 375 624 551 N/A 1384 166

141 166 153.5logn64 238 N/A 622 427 674 377 N/A 1075 376 132

472 92.8

145 182 154.6amzn32 N/A 208 243 393 658 569 3228 1524

185 236 110.8face32 179 203 238 388 654 717 792 861

213 310 142.8amzn64 N/A N/A 284 428 676 578 3510 1575

189 238 119.3face64 290 N/A 257 427 671 925 1257 918

247 344 204.1osmc64 N/A N/A 410 428 675 4617 N/A 1462 194 297 339 wiki64 N/A N/A 271 437 686 767 5867 1687

172 191 124.1

Algorithmic indexes On-the-fly search S y n t h e t i c R e a l - w o r l d Learned indexes A R T B + t r e e F A S T R B S R M I R S R S + S h i f t T a b l e I M + S h i f t T a b l e A v e r age i nde x bu il d t i m e ( m s ) Figure 7: Build times (average time for all datasets)

For those indexes that have a parameter affecting the indexsize (such as the branching factor in B+tree, and the number ofradix bits in ART, RS, and RBS), the performance can be tunedby evaluating the latency for different index sizes.Figure 8 shows the latencies of the indexes for the face64 andosmc64 datasets, along with the average Log2 error, CPU instruc-tions, and L1/LLC cache misses. IM+Shift-Table and RS+Shift-Table achieve faster lookup times on both datasets. For most in-dexes, except RMI and RBS, the latency does not improve beyonda certain optimum index size, after which the latency increasesagain. RBS has a much larger latency than both [IM/RS]-Shift-Table indexes of the same size, and extrapolating the RMI laten-cies also suggest that if we could extend RMI size to 1400MB(equal to Shift-Table’s size), it could not achieve a game-changingperformance on either of the datasets. Note that we could not runRMI with larger models because RMI embeds the parameters intothe code, and the compile times for models larger than 400MBwere astonishingly high. Average Log2 errors indicate the average number of iterationsin binary search for the last-mile search stage. Larger modelsresult in lower Log2 errors in all indexes and lead to faster last-mile search, however, once the model exceeds the LLC cache sizes,cache-miss rate increases (when running the model), and hencethe prediction time worsens. For RS, ART, and B+tree, the cachemisses and extra overhead of running the models increases eitherthe number of instruction, the cache misses, or both, enoughto prevent the index from improving latency by increasing thefootprint.

As discussed in section 3.4, the Shift-Table layer can be com-pressed by merging multiple entries, hence reducing its footprint.Figure 9 shows the effect of the Shift-Table layer size on lookuptime and prediction error. Shift-Table can operate in two modes:

R-1 : a full layer containing < Δ 𝑀𝑘 , 𝐶 𝑀𝑘 > pairs similar to Figure 5that indicates the exact range for local search (hence enablingbinary search); and S-X : a compressed single-entry map similarto Table 1 containing one ¯ Δ 𝑀𝑘 entry per 𝑋 records. Thus, S-Xcontains 𝑀 = 𝑁 / 𝑋 entries; and the memory footprint of S-1 ishalf the size of R-1.The error of the S-1 Shift-Table is slightly more than that ofR-1. This is due to the fact that S-1 is designed to draw boundariesfor binary search; hence it always points to the first record ofeach partition; while R-1 always points to the middle of the parti-tion and almost half the error of S-1. Performance-wise, however,S-1 always has the lowest latency, because its boundaries for thelast-mile search operation do not need to be discovered usingadditional boundary-detection algorithms such as exponentialsearch. As expected, compressing the Shift-Table by allocatingone entry per 𝑋 records increases the error and hence degrades Loo k up t i m e ( n s ) Lookup time , face64 Log2 e rr o r Log2 error , face64 C P U I n s t r u c t i on s Instructions , face64 LL C c a c he m i ss e s L1-misses , face64 LLC-misses , face64 Index size

Loo k up t i m e ( n s ) Lookup time , osmc64 Index size

Log2 e rr o r Log2 error , osmc64 Index size C P U I n s t r u c t i on s Instructions , osmc64 Index size LL C c a c he m i ss e s L1-misses , osmc64 Index size

LLC-misses , osmc64

RS RMI ART B+tree RBS IM+ShiftTable RS+ShiftTable

Figure 8: Analysis of the effect of index size on performance a m z n f a c e l o g n n o r m o s m c u d e n u s p r w i k i L oo k u p t i m e ( n s ) R 1S 1 S 10S 100 S 1000Without Shift-Table (a) Latency a m z n f a c e l o g n n o r m o s m c u d e n u s p r w i k i A v g e rr o r ( r e c o r d s ) (b) Error Figure 9: Analysis of the effect of Shift-Table layer size the performance. This is due to the fact that with higher com-pression ratios, the ability of Shift-Table to "memorize" the fine-grained details of the data distribution degrades due to the lossof information after merging.

On-the-fly search on sorted data

A fundamental problem thatis studied for decades is how to find a key among a sorted listof items. The classic approach is binary search and numerousextensions have been suggested to improve it for special cases,most notably interpolation search [33] and exponential search [3]. For data distributions that are close to uniform, interpolation-search is shown to be very effective [13, 34, 40]. Due to thegrowing gap between CPU power and memory latency in the pastdecade, more advanced interpolation techniques such as three-point interpolation are becoming viable on modern hardware [40].Exponential search enables binary search over an unbounded list.Exponential search is also extensively used in learned indexeswhen the key is more likely to be near a "guessed" location, buta guaranteed boundary around the guessed point that containsthe data is not known [7, 24, 31].

Range indexes

An alternative to on-the-fly binary searchover sorted data is to keep the data in an index structure. Nonethe-less, indexes that are built to answer range queries (such as B-trees) are similar to the binary search in that they need to keepthe data sorted internally. Common index structures for rangeindex include skiplists, B+trees, and radix-trees. The B+-tree iscache-efficient, but requires pointer chasing, which incurs multi-ple cache misses [14]. There has been a tremendous effort to makebinary search trees and B+-trees efficient on modern hardware.For example, FAST [20] organizes tree elements efficiently to ex-ploit modern hardware features such as the cache line and SIMD.Another common solution is to use compression techniques onthe indexed keys, most notably as a radix-tree. Modern radixtrees exploit hardware-efficient heuristics for fitting a distribu-tion in memory (usually by building a heuristically-optimizedcompressed trie), such as adaptive radix index (ART) [5, 25], andSuccinct Range Filter (SuRF) [43]. Skiplist is specifically efficientfor concurrent updates workloads [39, 42].

Learned index structures

Learned range indexes [7, 12, 24,28, 32] have recently been suggested as an alternative to rangeindexes. In this approach, a model is trained from the data withthe intent of capturing the data distribution and processing thequeries more efficiently. We refer to the paper by Kraska etal. [24], which introduced the idea of the learned index. In alearned index, the CDF of the key distribution is learned by fit-ting a model, and the learned model is subsequently used as areplacement of the index (B+-trees or similar) for finding thelocation of the query results on the storage medium. Index learn-ing frameworks such as the RMI model [24, 29] can learn arbi-trary models [29], although a further theoretical study [9] aswell as a recent experimental benchmark [21] have shown thatimple model like linear splines are very effective for datasets.Spline-based learned indexes include Piecewise Geometric Modelindex (PGM-index) [11], Fiting-tree [12], Model-Assisted B-tree(MAB-tree) [18], Radix-Spline [22], Interpolation-friendly B-tree(IF-Btree) [17] and some others [28, 38]. We refer to [10] for anextensive comparison of learned indexes. Recently, there hasbeen numerous theoretical works [4, 26, 36, 37] on learned in-dexes. Also, numerous efforts have been made to handle practi-cal challenges around using a learned index, including update-handling [7, 16] and designing a learned DBMS [23]. The idea ofusing a model of the data to boost an existing algorithmic indexhas been the center of focus in the past few years [14, 16, 18, 35].In the multivariate area, learning from a sample workload hasalso shown interesting results [8, 19, 27, 31]. Aside from the maintrend in learned indexes, which is on range indexing, machinelearning has also inspired other indexing and retrieval tasks. Thisincludes bloom filters [6, 30], inverted indexes [41], computinglist intersections [2], and multidimensional indexing on datasetswith correlated attributes [15].

Learning and modeling data distributions via machine learn-ing approaches is a great idea for managing and analyzing datamanagement systems. However, the approaches and objectivefunctions that are common in machine learning problems arenot necessarily optimal choices when the ultimate target is per-formance improvement. Instead of pushing machine learningmodel algorithm to its limits for highly accurate modeling ofdata distributions, it is more efficient if we only use ML modelsto approximate the high-level, generalizable "patterns" in datadistribution (the holistic shape), and handle the fluctuations andfine-grained details of the distribution using a more hardware-efficient approach, outperforms learned models as well as algo-rithmic index structures even if a simple or somewhat dummymodel such as min/max linear interpolation is used. The Shift-Table layer is effective in learning almost all distributions evenwithout using models that require training from data, and takesonly a single pass over the data points to build the layer. Ourresults show that even a simple linear model equipped with theShift-Table enhancement layer outperforms trained and tunedlearned indexes by 1.5X to 2X on real-world datasets.Our current work only considers read-only workloads. Weleave it as future work to adapt Shift-Table with workloads havingupdates. One idea is to capture the drifts in data distribution usingupdate-tracking segments [16], and use Fenwick trees to estimateand correct the drifts in both the model and the Shift-Table.

REFERENCES [1] STX. B+Tree C++ Template Classes. http://panthema.net/2007/stx-btree.[2] Naiyong Ao, Fan Zhang, Di Wu, Douglas S Stones, Gang Wang, XiaoguangLiu, Jing Liu, and Sheng Lin. 2011. Efficient parallel lists intersection and indexcompression algorithms using graphics processing units.

VLDB Endowment

Information processing letters

Cost Models for LearnedIndex with Insertions . Technical Report. University of Aalborg.[5] Robert Binna, Eva Zangerle, Martin Pichl, Günther Specht, and Viktor Leis.2018. HOT: a height optimized Trie index for main-memory database systems.In

SIGMOD . 521–534.[6] Zhenwei Dai and Anshumali Shrivastava. 2019. Adaptive learned Bloom filter(Ada-BF): Efficient utilization of the classifier. arXiv:1910.09131 (2019).[7] Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li,Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann,David Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive LearnedIndex. In

SIGMOD . 969–984. [8] Mohamad Dolatshah, Ali Hadian, and Behrouz Minaei-Bidgoli. 2015. Ball*-tree: Efficient Spatial Indexing for Constrained Nearest-neighbor Search inMetric Spaces. arXiv:cs.DB/1511.00628[9] Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. 2020. Why are learnedindexes so effective?. In

ICML , Vol. 119. PMLR.[10] Paolo Ferragina and Giorgio Vinciguerra. 2020. Learned data structures.

RecentTrends in Learning From Data (2020), 5–41.[11] Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds.

VLDBEndowment

13, 8 (2020), 1162–1175.[12] Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, andTim Kraska. 2019. FITing-Tree: A Data-aware Index Structure. In

SIGMOD .1189–1206.[13] Goetz Graefe. 2006. B-tree indexes, interpolation search, and skew. In

DaMoN .[14] Goetz Graefe and Harumi Kuno. 2011. Modern B-tree Techniques.

Foundationsand Trends in Databases

3, 4 (2011), 203–402.[15] Ali Hadian, Behzad Ghaffari, Taiyi Wang, and Thomas Heinis. 2021. COAX:Correlation-Aware Indexing on Multidimensional Data with Soft FunctionalDependencies. arXiv:cs.DB/2006.16393[16] Ali Hadian and Thomas Heinis. 2019. Considerations for handling updates inlearned index structures. In

AIDM .[17] Ali Hadian and Thomas Heinis. 2019. Interpolation-friendly B-trees: Bridgingthe Gap Between Algorithmic and Learned Indexes. In

EDBT .[18] Ali Hadian and Thomas Heinis. 2020. MADEX: Learning-augmented Algo-rithmic Index Structures. In

AIDB .[19] Ali Hadian, Ankit Kumar, and Thomas Heinis. 2020. Hands-off Model Integra-tion in Spatial Index Structures. In

AIDB .[20] Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony DNguyen, Tim Kaldewey, Victor W Lee, Scott A Brandt, and Pradeep Dubey.2010. FAST: fast architecture sensitive tree search on modern CPUs and GPUs.In

SIGMOD . 339–350.[21] Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, AlfonsKemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark forLearned Indexes.

NeurIPS Workshop on Machine Learning for Systems (2019).[22] Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, AlfonsKemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-passlearned index. In

AIDM .[23] Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Jialin Ding, AniKristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan.2019. SageDB: A Learned Database System. In

CIDR .[24] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018.The case for learned index structures. In

SIGMOD . 489–504.[25] Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radixtree: ARTful indexing for main-memory databases. In

ICDE . 38–49.[26] Pengfei Li, Yu Hua, Pengfei Zuo, and Jingnan Jia. 2019. A Scalable LearnedIndex Scheme in Storage Systems. arXiv:1905.06256 (2019).[27] Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: ALearned Index Structure for Spatial Data. In

SIGMOD .[28] Anisa Llavesh, Utku Sirin, Robert West, and Anastasia Ailamaki. 2019. Ac-celerating B+tree Search by Using Simple Machine Learning Techniques. In

AIDB .[29] Ryan Marcus, Emily Zhang, and Tim Kraska. 2020. CDFShop: Exploring andOptimizing Learned Index Structures. In

SIGMOD . 2789–2792.[30] Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and RelatedStructures. arXiv:1802.00884 (2018).[31] Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020.Learning Multi-dimensional Indexes. In

SIGMOD . 985–1000.[32] Thomas Neumann and Sebastian Michel. 2008. Smooth interpolating his-tograms with error guarantees. In

BNCOD . Springer, 126–138.[33] W Wesley Peterson. 1957. Addressing for random-access storage.

IBM journalof Research and Development

1, 2 (1957), 130–146.[34] CE Price. 1971. Table lookup techniques.

Comput. Surveys

3, 2 (1971), 49–64.[35] Wenwen Qu, Xiaoling Wang, Jingdong Li, and Xin Li. 2019. Hybrid Indexesby Exploring Traditional B-Tree and Linear Regression. In

WEBIST . 601–613.[36] Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou.2018. Deja Vu: an empirical evaluation of the memorization properties ofConvNets. arXiv:1809.06396 (2018).[37] Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou.2019. Spreading vectors for similarity search. In

ICLR .[38] Naufal Fikri Setiawan, Benjamin IP Rubinstein, and Renata Borovica-Gajic.2020. Function Interpolation for Learned Index Structures. In

ADC . 68–80.[39] Stefan Sprenger, Steffen Zeuch, and Ulf Leser. 2016. Cache-sensitive skip list:Efficient range queries on modern cpus. In

DaMoN . Springer, 1–17.[40] Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently Search-ing In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In

SIG-MOD . 36–53.[41] Wenkun Xiang, Hao Zhang, Rui Cui, Xing Chu, Keqin Li, and Wei Zhou. 2018.Pavo: A RNN-Based Learned Inverted Index, Supervised or Unsupervised?

IEEE Access

ICDE . IEEE, 119–122.[43] Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G Andersen, MichaelKaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. Surf: Practical rangequery filtering with fast succinct tries. In