[PDF] A Pluggable Learned Index Method via Sampling and Gap Insertion

Abstract

Database indexes facilitate data retrieval and benefit broad applications in real-world systems. Recently, a new family of index, named learned index, is proposed to learn hidden yet useful data distribution and incorporate such information into the learning of indexes, which leads to promising performance improvements. However, the "learning" process of learned indexes is still under-explored. In this paper, we propose a formal machine learning based framework to quantify the index learning objective, and study two general and pluggable techniques to enhance the learning efficiency and learning effectiveness for learned indexes. With the guidance of the formal learning objective, we can efficiently learn index by incorporating the proposed sampling technique, and learn precise index with enhanced generalization ability brought by the proposed result-driven gap insertion technique. We conduct extensive experiments on real-world datasets and compare several indexing methods from the perspective of the index learning objective. The results show the ability of the proposed framework to help to design suitable indexes for different scenarios. Further, we demonstrate the effectiveness of the proposed sampling technique, which achieves up to 78x construction speedup while maintaining non-degraded indexing performance. Finally, we show the gap insertion technique can enhance both the static and dynamic indexing performances of existing learned index methods with up to 1.59x query speedup. We will release our codes and processed data for further study, which can enable more exploration of learned indexes from both the perspectives of machine learning and database.

Full PDF

AA Pluggable Learned Index Method viaSampling and Gap Insertion

Yaliang Li ∗ , Daoyuan Chen ∗ , Bolin Ding, Kai Zeng, and Jingren Zhou Alibaba Group{yaliang.li, daoyuanchen.cdy, bolin.ding, zengkai.zk, jingren.zhou}@alibaba-inc.com

ABSTRACT

Database indexes facilitate data retrieval and benefit broad applica-tions in real-world systems. Recently, a new family of index, namedlearned index, is proposed to learn hidden yet useful data distribu-tion and incorporate such information into the learning of indexes,which leads to promising performance improvements. However,the “learning” process of learned indexes is still under-explored. Inthis paper, we propose a formal machine learning based frameworkto quantify the index learning objective , and study two general andpluggable techniques to enhance the learning efficiency and learningeffectiveness for learned indexes. With the guidance of the formallearning objective, we can efficiently learn index by incorporatingthe proposed sampling technique, and learn precise index with en-hanced generalization ability brought by the proposed result-drivengap insertion technique.We conduct extensive experiments on real-world datasets andcompare several indexing methods from the perspective of the in-dex learning objective. The results show the ability of the proposedframework to help to design suitable indexes for different scenarios.Further, we demonstrate the effectiveness of the proposed samplingtechnique, which achieves up to 78x construction speedup whilemaintaining non-degraded indexing performance. Finally, we showthe gap insertion technique can enhance both the static and dy-namic indexing performances of existing learned index methodswith up to 1.59x query speedup. We will release our codes and pro-cessed data for further study, which can enable more explorationof learned indexes from both the perspectives of machine learningand database.

Database indexes have been extensively studied in the databasecommunity for past decades, resulting in fruitful designed indexmethods and broad real-world applications, e.g. , [1, 9, 41, 50]. Therelated topics become even more important in the era of big dataas tremendous data are generating and collecting from numeroussources at every second. Recently, a new family of index, namelylearned index, has attracted increasing attention due to its promis-ing results in terms of both index size and query time [13, 16, 19, 29].The research direction of learned index is opened up by [29],which regards the traditional indexes such as B+Tree as models thatpredict the actual location within a sorted array for a queried key.From this view, indexes can be machine learning models trainedfrom data to be indexed, and the hidden distribution informationof data (for example, patterns and regular trends) can be leveragedto optimize the “indexing” components of traditional indexes. Forexample, to achieve small index size, the index storage layout of ∗ The first two authors contributed equally to this work and are joint first authors.

B+Tree, i.e., the height-balanced tree, can be replaced by a smalltree whose nodes are machine learning models with a few learn-able parameters [29], or a small B+Tree whose nodes maintain thelearned parameters of a few piece-wise linear segments [19] ratherthan the whole data. To achieve fast query speed, the index query-ing algorithm of B+Tree is changed from traversing internal nodeswith multiple comparisons and branches to inferences of machinelearning models with a few numeric calculations [13, 16].However, the “learning” of learned indexes is still under-exploredin existing learned index methods. First, how to compare differentlearned indexes with evaluation from the perspective of machinelearning, such as learning objectives and capacity of chosen models?For different data with varied hidden patterns and different resourceconstraints, a formal learning objective can help us to design andoptimize learned index. Second, few existing learned index methodsstudy the cost of index learning while the sizes of real-world dataare heavily increasing. Can we reduce the learning cost of learnedindexes to enhance their applicability for large-scale datasets? Third,existing learned index methods explore few about distributions ofthe data to be learned. Can we deeply explore the data distributioninformation to further improve the learned index performance?Motivated by these three questions, in this paper, we propose aformal machine learning based framework to measure the indexlearning objective , and study two general and pluggable learning-oriented techniques, i.e. , sampling to enhance the learning effi-ciency , and data redistribution via gap insertion to improve theindex performance from the view of learning effectiveness .First, to formally quantify the learning objective and learningquality of learned indexes, we regard the learned indexes as en-coding mechanisms that compress data information into learnedmodels and measure the learned index using the minimum descrip-tion length (MDL) principle [22]. By formulating an MDL-basedobjective function as the learning objective, we connect two impor-tant concepts of machine learning, overfitting and underfitting, tothe flexible learning of indexes. With the help of the index learn-ing objectives, we discuss how to compare existing learned indexmethods and design suitable learned indexes for different scenarios.Second, to reduce the learning cost and accelerate the index con-struction especially for large-scale datasets, we propose to learnindex with sampled small subsets. By capturing the data distribu-tion information hidden in the data, it is possible to learn the indexwith a small subset of data while achieving high learning effective-ness. We theoretically prove the feasibility of learning index withsampling, and provide the asymptotic guideline on how large thesample should be in order to learn an index having a comparableperformance with the index learned on the whole dataset.Last but not the least, to enhance the performance and general-ization ability of off-the-shelf learned indexes with few extra efforts, a r X i v : . [ c s . D B ] J a n e study what a data distribution can be beneficial to index learn-ing, and propose a data re-distribution technique via gap insertion.Specifically, we design a result-driven strategy to estimate the gapsthat should be inserted in a data-dependent manner, and proposea gap-based data layout and strategy to physically place keys onpositions with gaps inserted. In comparison to the original distri-bution, the re-distributed data is easier to be learned and can boostthe performance of static indexing operations. Surprisingly, theresult-driven gap insertion also enables us to handle dynamic in-dexing scenarios well, which is due to the reason that the estimatedgap-inserted positions can be predictively reserved for possibleinserted keys in the dynamic workloads.We conduct comprehensive experiments on four wildly adoptedreal-world datasets. We compare several index methods [1, 16, 19,29] from the perspective of the proposed MDL-based framework,and examine these methods in terms of learning objective, modelregularization and model capacity, providing a new understand-ing of them. Further, we apply the proposed two general learning-oriented techniques, sampling and data re-distribution via gap inser-tion into existing learned index methods. Promising improvementsare achieved for both these two techniques: The proposed samplingtechnique achieves up to 78x construction speedup, meanwhilemaintaining non-degraded query performance and reasonable pre-diction preciseness; The gap insertion technique significantly im-proves the preciseness of learned indexes and achieves up to 1.59xquery speedup over strong learned index baselines. Finally, we alsoshow that the learned indexes with gap insertion can achieve goodperformance on dynamic workloads.To summarize, we make the following contributions: (1) Wepropose an MDL-based framework that enables formal analysis oflearning objectives and comparison of different learned indexes,and more importantly, the framework can guide us to design flexiblelearned indexes for various scenarios; (2) We propose a pluggable sampling technique that can improve the learning efficiency oflearned index, which is practical and useful for index constructionacceleration, especially on large-scale datasets; (3) We propose apluggable technique, data re-distribution via gap insertion , to betterutilize the hidden distribution information of indexed data, whichimproves the preciseness and generalization ability of learned in-dexes; (4) We conduct comprehensive experiments to verify theeffectiveness of the proposed techniques, and we will release ourcodes and datasets to promote further studies. With these contribu-tions, we hope to better connect the database community with themachine learning community to enhance each other for the topicof learned indexes. Indexes as Mechanisms . Let’s first formalize the task of learningindex from data: Given a database D with 𝑛 records (rows), assumethat a range index structure will be built on a specific column 𝑥 . Foreach record 𝑖 ∈ [ 𝑛 ] , the value of this column, 𝑥 𝑖 , is adopted as the key , and 𝑦 𝑖 is the position where the record is stored (for the case of primary indexes ), or the position of the pointer to the record (for thecase of secondary indexes ). 𝑦 𝑖 can be interpreted as the position of 𝑥 𝑖 in an array sorted by 𝑥 𝑖 , and to support range queries, a databaseto be indexed needs to satisfy the key-position monotonicity. Definition 1 (Key-position Monotonicity). For a set of key-position pairs {( 𝑥 𝑖 , 𝑦 𝑖 )} , the key-position monotonicity means that,for any 𝑥 𝑖 and 𝑥 𝑗 , 𝑦 𝑖 < 𝑦 𝑗 iff 𝑥 𝑖 < 𝑥 𝑗 . The task of learned index aims to train and learn a predictive index mechanism 𝑀 ( 𝑦 | 𝑥 ) from D : 𝑀 takes a record’s key 𝑥 as inputand outputs a predicated position ˆ 𝑦 ← 𝑀 ( 𝑥 ) in the sorted array fordata access. From this perspective, classic index structures suchas B + tree [1, 24], B ∗ tree [10] and Prefix B-tree [4] can be alsoregarded as such a mechanism designed by experts, which givesthe exact position or the page number for a given key. Prediction-Correction Decomposition . The learning of an in-dex mechanism 𝑀 is essentially to approximate the joint distribu-tion of keys and positions. Ideally, the predicted position ˆ 𝑦 shouldbe exactly the same as the true position of a record 𝑥 , i.e., ˆ 𝑦 = 𝑦 .However, in general, ˆ 𝑦 is different with 𝑦 since it is difficult to per-fectly fit a real-world dataset containing complex patterns. Thus,the query process of 𝑥 can be decomposed into a “prediction” stepthat gives prediction ˆ 𝑦 based on 𝑥 using 𝑀 , and a “correction” stepthat finds the true position 𝑦 of indexed record based on ˆ 𝑦 .For example, considering B-Tree index, it gives the page numberˆ 𝑦 where the record 𝑥 is located in, and requires a further page scan toget the exact position of this particular record. For learned indexes,after getting the predicted position ˆ 𝑦 through machine learningmodel inference, we also need to conduct, e.g. , a binary/exponentialsearch around ˆ 𝑦 to find the true position 𝑦 where the record isstored. The cost of this “correction” step depends on the difference | ˆ 𝑦 − 𝑦 | . Indeed, one goal is to minimize the difference between ˆ 𝑦 and 𝑦 . But the difference can be non-zero, which provides the flexibilityto adjust the cost of “prediction” step for a mechanism, i.e. , howmuch space we need to store 𝑀 and how much time we need tocalculate 𝑀 ( 𝑥 ) . This raises several questions: can we evaluate alearned index from the costs of these two decomposed steps? Andwhat roles do these two steps play in the index learning? In this section, we formally quantify the index mechanisms usingthe Minimum Description Length (MDL) principle [21, 22]. We firstconnect the above prediction-correction decomposition of indexmechanisms with MDL, then formulate the learning objective ofindex mechanisms with two general terms that can encode variousindexing criteria, and finally discuss the physical meanings of theproposed concepts with several instantiations of index mechanisms.

The idea of MDL is to regard both the given data D and a mecha-nism 𝑀 as codes, and regard the learning as appropriately encodingor compressing the data D using 𝑀 . Then we can use the code length or description length to measure the “simplicity” of a mechanism:The more we learn the hidden patterns in the data and reduce itsredundancy, the shorter the description lengths for the compresseddata and the learned mechanism. Specifically, the description lengthof a mechanism 𝑀 is decomposed into 𝐿 ( 𝑀 ) and 𝐿 (D| 𝑀 ) : 𝐿 ( 𝑀 ) in-dicates the description length of the mechanism itself, and 𝐿 (D| 𝑀 ) indicates the conditional description length of D given 𝑀 , whichcan be interpreted as how many extra bits do we need to exactly escribe D using 𝑀 . In the context of learned index, we can link 𝐿 ( 𝑀 ) and 𝐿 (D| 𝑀 ) to the prediction and correction respectively.Formally, in this paper, we will use the MDL criteria to measurethe quality of an index mechanism 𝑀 as MDL ( 𝑀, D) = 𝐿 ( 𝑀 ) + 𝐿 (D| 𝑀 ) . To be more specific, the two terms: • 𝐿 ( 𝑀 ) measures the cost of using 𝑀 for prediction, i.e. , gettingthe predicted position ˆ 𝑦 based on the key 𝑥 , which is usuallyproportional to the size of 𝑀 ; • 𝐿 (D| 𝑀 ) = E { 𝑥,𝑦 }∈D 𝐿 ( 𝑦, ˆ 𝑦 ) measures the average cost ofgetting the true position 𝑦 based on the predicted position ˆ 𝑦 ,in the position correction step. Two Example Instantiations.

To better illustrate the idea of MDL,let’s consider two instantiations of mechanism 𝑀 , classic B-treeand polynomial function. For a classic B-tree, 𝐿 ( 𝑀 ) denotes thecost of traversing index tree, which is usually proportional to theheight of the index tree ℎ , i.e., 𝐿 ( 𝑀 ) ∝ ℎ . The concrete form of 𝐿 ( 𝑀 ) = 𝑓 ( ℎ ) depends on several implementation details of B-tree,such as the maximum number of children a node can have, or theminimum number of children an internal (non-root) node can have.The second term 𝐿 (D| 𝑀 ) denotes the cost of page scan in the leafnode, which is a function of page size 𝑠 , i.e., 𝐿 (D| 𝑀 ) = 𝑓 ( 𝑠 ) . Whenbinary scan is adopted, 𝑓 ( 𝑠 ) is a linear function of log ( 𝑠 ) .For the second example, let’s consider the case that a polynomialfunction is learned to act as the mechanism, that is 𝑀 ( 𝑥 ) = (cid:205) 𝑇𝑡 = 𝑎 𝑡 · 𝑥 𝑡 + 𝑐 𝑡 , where 𝑎 𝑡 and 𝑐 𝑡 are learnable parameters for a correspondingdegree 𝑡 , and 𝑇 is the highest degree. In this example, 𝐿 ( 𝑀 ) can bemeasured as O ( 𝑇 ) , because O ( 𝑇 ) multiplications are needed to getthe predicted position ˆ 𝑦 based on the key 𝑥 . For the second term inMDL, 𝐿 (D| 𝑀 ) = E { 𝑥,𝑦 }∈D 𝐿 ( 𝑦, ˆ 𝑦 ) = E { 𝑥,𝑦 }∈D ( log (| 𝑦 − ˆ 𝑦 |) + ) ifbinary search is adopted for the correction step. From the above discussion of two instantiations, we can see theMDL provides a criterion to formally analyze and compare differ-ent indexes, including both classic B-tree and machine learningbased indexes. This enables us to design suitable index structuresfor various scenarios. Formally, given the data D and a family ofpossible mechanisms M , the process of learning index from datacan be formulated as to find the best mechanism 𝑀 ∗ that minimizesthe description length as follows: 𝑀 ∗ = arg min 𝑀 ∈M MDL ( 𝑀, D) = arg min 𝑀 ∈M ( 𝐿 ( 𝑀 ) + 𝛼𝐿 (D| 𝑀 )) , (1)where 𝛼 is a coefficient to balance the two terms in MDL. Fromthe view of machine learning, D is fed into the learning procedurestated in Equation (1) and the description length ( 𝑀, D) acts asthe objective function to be minimized. By fitting the dataset, thelearning procedure finds an optimal mechanism 𝑀 to predict theposition 𝑦 based on given 𝑥 . If a mechanism achieves minimumdescription length, it learns and stores the underlying key-positiondistribution information about data in the most compact form. Forexample, given a toy data D = {( , ) , ( , ) , ( , ) , ( , ) , ( , )} ,the learned index can represent it as 𝑀 ( 𝑥 ) = Round ( . 𝑥 − . ) .Instead of storing the key-position pairs, only parameters of 𝑀 areneeded to store, which reduces the redundancy of raw data andcompresses the information to store. Although the distributions of real-world data are not so simple, various patterns can be minedby machine learning methods, and be expressed in compact forms.In Equation (1), three factors 𝐿 ( 𝑀 ) , 𝐿 ( 𝐷 | 𝑀 ) and 𝛼 act as theknobs to tune the performance of learned indexes, and next we willgive more discussion about these factors. Choice of 𝐿 ( 𝑀 ) and 𝐿 ( 𝐷 | 𝑀 ) . To learn an index mechanism fromdata, the first step is to choose the concrete forms of these two termsin the above framework. As the list of keys and corresponding posi-tions {( 𝑥, 𝑦 )} are given and the underlying physical storage formatis fixed, we can first choose a specific search method for the correc-tion step, and thus the concrete form of 𝐿 (D| 𝑀 ) = E { 𝑥,𝑦 }∈D 𝐿 ( 𝑦, ˆ 𝑦 ) can be determined. For 𝐿 ( 𝑀 ) , it has more flexibility. It can be set asthe number of operations to calculate 𝑀 ( 𝑥 ) , the number of modelparameters in 𝑀 , or the summarization of the 𝑝 -norm of all modelparameters, etc. The choice of 𝐿 ( 𝑀 ) can be made by consideringthe requirements of database systems and the constraints of compu-tation resources ( e.g. , on-disk or in-memory) to fit various scenarios.For example, existing learned index method PGM [16] implicitlyadopts 𝐿 (D| 𝑀 ) = 𝑙𝑜𝑔 (| ˆ 𝑦 − 𝑦 )|) + 𝐿 ( 𝑀 ) to be the number of learned model parameters as it usesan optimal linear segmentation learning algorithm. Overfitting v.s. Underfitting.

When learning to build the in-dex from data, the coefficient 𝛼 in Equation (1) plays an impor-tant role. From the perspective of machine learning, 𝐿 ( 𝐷 | 𝑀 ) = E { 𝑥,𝑦 }∈ 𝐷 𝐿 ( 𝑦, ˆ 𝑦 ) measures the prediction loss on training data 𝐷 ,while 𝐿 ( 𝑀 ) is the regularization term of learned model. These twoterms are usually contradicted: We can learn a very complex modelM to make the prediction loss on training data zero or close to zero,which leads to a small 𝐿 ( 𝐷 | 𝑀 ) while a large 𝐿 ( 𝑀 ) , the so-calledoverfitting phenomenon [8, 46]; On the other hand, if we learn asimple model, 𝐿 ( 𝑀 ) is small while 𝐿 ( 𝐷 | 𝑀 ) is large as the model istoo simple to make precious predictions, the so-called underfittingphenomenon. The coefficient 𝛼 makes a trade-off between thesetwo terms and aims to learn an index mechanism having the min-imum description length, i.e., a relatively simple model 𝑀 whilealso having a small prediction loss on training data. 𝛼 in Existing Index Methods. For existing index methods, thereare some tunable parameters playing the role of 𝛼 to adjust the 𝐿 ( 𝑀 ) term and 𝐿 ( 𝐷 | 𝑀 ) term in MDL. For example, the page size of B+Tree, the tree depth and the layer width of RMI [29], and the errorbound 𝜖 of FITting-Tree [19] and PGM [16]. These parameters are tobe tuned for a given D and reflect the degree we want to fit D : Witha small page size, a deep and wide model tree, and a small 𝜖 , theseindex mechanisms will substantially gain small perdition errors 𝐿 (D| 𝑀 ) and usually large 𝐿 ( 𝑀 ) such as large index sizes or longprediction times. In other words, these parameters implicitly actas the regularization factors in index learning. In the experiments(Section 6.2), we will investigate the trade-offs of several indexmechanisms with different 𝛼 s in more detail. As formulated above, we aim to choose 𝑀 from a candidate family M to minimize the objective function 𝐿 ( 𝑀 ) + 𝛼𝐿 (D| 𝑀 ) , for a givendataset D . When |D| is large, it is an expensive learning task. In fact,it is expensive just to evaluate the loss 𝐿 (D| 𝑀 ) . In this section, wewill investigate a computationally efficient solution via sampling. .1 Accelerating Index Construction Recall that the cost of the correction step (corresponding to theterm 𝐿 (D| 𝑀 ) in the MDL objective function) is proportional tolog | 𝑦 − 𝑀 ( 𝑥 )| = log | 𝑦 − ˆ 𝑦 | when binary search is adopted (referto above Two Example Instantiations ). Let’s focus on the objectivefunction in the following form: 𝐿 (D| 𝑀 ) = E ( 𝑥,𝑦 ) ∈D [ 𝐿 ( 𝑦, 𝑀 ( 𝑥 ))] = E ( 𝑥,𝑦 ) ∈D [ log | 𝑦 − 𝑀 ( 𝑥 )|] . Let E be the maximum absolute prediction error, i.e., the maximumabsolute value of all differences between predicated positions andtrue positions: ∀( 𝑥, 𝑦 ) ∈ D , | 𝑀 ( 𝑥 ) − 𝑦 | ≤ E . We have such anupper bound 𝐿 ( 𝑦, 𝑀 ( 𝑥 )) ≤ log E , for any ( 𝑥, 𝑦 ) ∈ D and a binaryor an exponential search. Thus, we can show that a small samplefrom D suffices to approximate 𝐿 (D| 𝑀 ) :Proposition 1. Given that 𝐿 ( 𝑦, 𝑀 ( 𝑥 )) ≤ log E , we draw a ran-dom sample D 𝑠 from D with |D 𝑠 | = 𝑛 𝑠 . For any candidate mecha-nism 𝑀 , we can estimate 𝐿 (D| 𝑀 ) using 𝐿 (D 𝑠 | 𝑀 ) = 𝑛 𝑠 ∑︁ ( 𝑥,𝑦 ) ∈D 𝑠 𝐿 ( 𝑦, 𝑀 ( 𝑥 )) , s.t., with probability as most − 𝛿 , we have (cid:12)(cid:12) 𝐿 (D 𝑠 | 𝑀 ) − 𝐿 (D| 𝑀 ) (cid:12)(cid:12) ≤ log E√ 𝑛 𝑠 · √︂ log 2 𝛿 . Proof. Here we provide a proof sketch due to the space limita-tion. Let’s consider the random sample {( 𝑥, 𝑦 ) ∈ D 𝑠 } . Indeed, wehave E [ 𝐿 (D 𝑠 | 𝑀 )] = 𝑛 𝑠 ∑︁ ( 𝑥,𝑦 ) ∈D 𝑠 E [ 𝐿 ( 𝑦, 𝑀 ( 𝑥 ))] = 𝐿 (D| 𝑀 ) . Since 𝐿 ( 𝑦, 𝑀 ( 𝑥 )) ∈ [ , log E] , we apply Hoeffding’s inequality tofinish the proof.We can interpret 𝐿 (D| 𝑀 ) as the expected cost of the correctionstep. Theoretically, we only need to estimate it with an error up toa constant factor ( e.g. , no more than the size of a page). To this end,we only need to draw a small sample.Theorem 1. Consider the optimization problem: 𝑀 ∗ = arg min 𝑀 ∈M MDL ( 𝑀, D) = arg min 𝑀 ∈M ( 𝐿 ( 𝑀 ) + 𝛼𝐿 (D| 𝑀 )) . We can solve it on a random sample 𝐷 𝑠 with size 𝑠 = O ( 𝛼 log E) as ˆ 𝑀 ∗ = arg min 𝑀 ∈M MDL ( 𝑀, D 𝑠 ) s.t., MDL ( ˆ 𝑀 ∗ , D) ≤

MDL ( 𝑀 ∗ , D) + O ( ) with high probability. Proof. Indeed, we haveMDL ( ˆ 𝑀 ∗ , D) −

MDL ( 𝑀 ∗ , D) = MDL ( ˆ 𝑀 ∗ , D) −

MDL ( ˆ 𝑀 ∗ , 𝑆 ) + MDL ( ˆ 𝑀 ∗ , D 𝑠 )− MDL ( 𝑀 ∗ , D 𝑠 ) + MDL ( 𝑀 ∗ , D 𝑠 ) − MDL ( 𝑀 ∗ , D)≤ | 𝐿 (D| ˆ 𝑀 ∗ ) − 𝐿 (D 𝑠 | ˆ 𝑀 ∗ )| + + | 𝐿 (D| 𝑀 ∗ ) − 𝐿 (D 𝑠 | 𝑀 ∗ )| , where MDL ( ˆ 𝑀 ∗ , D 𝑠 ) − MDL ( 𝑀 ∗ , D 𝑠 ) ≤ 𝑀 ∗ . Using Proposition 1, both terms | 𝐿 (D| ˆ 𝑀 ∗ ) − 𝐿 (D 𝑠 | ˆ 𝑀 ∗ )| and | 𝐿 (D| 𝑀 ∗ ) − 𝐿 (D 𝑠 | 𝑀 ∗ )| can be bounded to complete the proof. We can interpret Theorem 1 as follows. For the purpose of min-imizing the MDL function considered in Equation (1), it sufficesto learn the index 𝑀 on a small random sample with size as smallas O (cid:16) 𝛼 log E (cid:17) . Further, the parameter 𝛼 controls how much weweigh the cost of the correction step in relative to the size of theindex 𝐿 ( 𝑀 ) in our goal. The larger the weight is, the larger samplewe need to draw in order to learn the index; in other words, more de-tails about the data are needed to learn an index in finer granularity.It should be noticed that there must be some constants hidden in thebig O notation, which does matter in practice; however, the abovetheorem enables us to speed up index learning with small samples,and serves as an asymptotic guideline on how large the sampleneeds to be. We will show that the proposed sampling techniquecan achieve significant construction speedup while maintainingcomparable query performance in experiments (Section 6.3). E in Existing Learned Index Methods . Now let’s examine themaximum prediction error E for several existing learned indexmethods. FITing-Tree [19] and PGM [16] introduce an error bound 𝜖 to limit the maximum absolute difference between the actualand predicted position of any key, and thus E can be set as E = 𝜖 .For RMI [29], recall that the maximum prediction errors, ( 𝑦 ′ − 𝑦 ) where 𝑦 ′ can be larger or less than 𝑦 , are recorded during the train-ing, and thus E can be bounded by the maximum value of them: E = 𝑚𝑎𝑥 (| 𝑚𝑎𝑥 _ 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 _ 𝑒𝑟𝑟𝑜𝑟 | , | 𝑚𝑖𝑛 _ 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 _ 𝑒𝑟𝑟𝑜𝑟 |) . For theseexisting learned index methods, we can see that E s are usually farsmaller than |D| , indicating that we can learn an index mecha-nism efficiently via sampling a small subset of D . In experiments(Section 6.3), we will see that these three learned index methods re-quire different numbers of samples to achieve non-degraded perfor-mances and their numbers of samples meet our asymptotic analysisO (cid:16) 𝛼 log E (cid:17) with different 𝛼 s and different E s. Unseen Keys in Sampling . Interestingly enough, the samplingtechnique can bridge two indexing scenarios, static indexing anddynamic indexing, which correspond to two machine learning con-cepts, overfitting and generalization. In static indexing scenarios,we only need to consider how to overfit D since all { 𝑥 𝑖 , 𝑦 𝑖 } to beindexed can be accessed during the learning. However, when learn-ing on a sampled subset data D 𝑠 , the learned index mechanism 𝑀 has to consider the model generalization ability to precisely predictpositions for unseen data D − 𝑠 = D −D 𝑠 . While for dynamic index-ing scenarios including possible inserted data, similarly, a learnedindex mechanism 𝑀 needs to be generalized to unseen new keys.This inspires us that if we can learn an indexing mechanism havinggood generalization ability from a small sampled data D 𝑠 , we willbe able to learn a dynamic index that has good generalization abilityand can handle inserted keys. Thus, as shown in the example ofFigure 1(a), besides index construction speedup, the sampling tech-nique also improves the model generalization ability. This seemscontradicted with machine learning theory: usually the generationbound can be improved when we have more data, not fewer data.However, for learned index, the data pairs are not independent anda small sample can be enough to learn the correlation between keysand their positions. Moreover, the sampling technique can help to earn the underlying distribution of D better by extracting moregeneral patterns (red segments in Figure 1(a)) from D 𝑠 and alleviatethe overfitting (blue segments in Figure 1(a)) caused by some noisykeys in the unseen D − 𝑠 . (a) Sampling (b) Gap Insertion Figure 1: Examples of learned index with sampling and gapinsertion.

As discussed above, the proposed sampling technique can somehowimprove the generalization ability of learned index methods. Herewe explore more along this direction: Can we further enhance thegeneralization ability of learned index methods and improve theirpreciseness? In the fruitful research work of index structure inthe database community, there are some ideas about how to leavecertain empty space for dynamic data, such as Packed MemoryArray [5] or Gapped Array [13]. Actually, we can further leveragethe empty space to re-distribute the data such that the updated key-position distribution is easy to be learned and the generalizationability of index can be further enhanced.Let’s consider the example in Figure 1(b): Can we insert somegaps, move the original data ( i.e. , blue dots) to the gap-inserted data( i.e. , green dots), and learn the index as red segments? If so, thenewly learned index ( i.e. , red segments) fits the gap-inserted databetter and has better generalization ability. However, we shouldnote that the number of inserted gaps cannot be very large sincethe gaps can increase storage and query costs, which are also partof the optimization goals.

Result-Driven Gap Insertion . The gap insertion can be formu-lated as a position manipulation problem: given a set of data D = { 𝑥 𝑖 , 𝑦 𝑖 } 𝑛𝑖 = with the monotonicity of key-position, how to insertgaps and adjust each record’s position from 𝑦 𝑖 to 𝑦 𝑔𝑖 while preserv-ing the monotonicity of key-position, such that the preciseness oflearned mechanism can be improved? Formally, let D 𝑔 = { 𝑥 𝑖 , 𝑦 𝑔𝑖 } 𝑛𝑖 = be a gap-inserted data and 𝑢 𝑖 = 𝑦 𝑔𝑖 − 𝑦 𝑔𝑖 − − 𝑥 𝑖 − and 𝑥 𝑖 . Then 𝑦 𝑔𝑖 = 𝑦 𝑖 + (cid:205) ≤ 𝑗 ≤ 𝑖 𝑢 𝑗 andwe aim to find a manipulated D 𝑔 that maximizes the objective:max { 𝑦 𝑔𝑖 } ∑︁ { 𝑥 𝑖 ,𝑦 𝑖 }∈D 𝐿 ( 𝑦 𝑖 , 𝑀 ∗D ( 𝑥 𝑖 )) − 𝐿 ( 𝑦 𝑔𝑖 , 𝑀 ∗D 𝑔 ( 𝑥 𝑖 )) s.t. ∀ 𝑖 : 𝑢 𝑖 ∈ Z ≥ , and 𝑛 ∑︁ 𝑖 = 𝑢 𝑖 ≤ 𝜌 · 𝑛 , ∀ 𝑖 : 𝑦 𝑔𝑖 = 𝑦 𝑖 + ∑︁ ≤ 𝑗 ≤ 𝑖 𝑢 𝑗 , ∀ 𝑖, 𝑗 : 𝑦 𝑔𝑖 < 𝑦 𝑔𝑗 , iff 𝑥 𝑖 < 𝑥 𝑗 , (2) where 𝜌 indicates the gap ratio that defines at most how many gaps( i.e. , 𝜌 · 𝑛 ) can be introduced, 𝑀 ∗D and 𝑀 ∗D 𝑔 indicate the optimallearned index mechanism from original data D and gap-inserteddata D 𝑔 respectively.Inspired by block coordinate gradient descent [6], here we de-scribe a two-step result-driven solution to solve the optimizationproblem in Equation (2): The solution first proposes a new indexmechanism 𝑀 ′ that can achieve better preciseness than the indexmechanism on the original data, and then move records to 𝑦 𝑔 thatare as close as possible to the positions predicted by 𝑀 ′ . In otherwords, as the new index mechanism 𝑀 ′ meets our optimizationgoal, i.e., reducing the prediction error, we can “backward inference” 𝑦 𝑔 using the predictions from 𝑀 ′ and learn a mechanism 𝑀 D 𝑔 fromthe estimated gap-inserted data D 𝑔 . Gap ! ",$% ! ",&% '(% ")* ! ",+% ! ",* ! ",& Gaps , ",* , ",& , ",$ , ",+ (cid:1) (cid:1)(cid:1)(cid:1) '(% " Key P o s i t i o n for '(% - Insertion

Original SegmentHypothetical Segment

Figure 2: The illustration of result-driven gap insertion.

Specifically, since the linear models are adopted in existinglearned index methods including RMI, FITing-Tree and PGM, herewe discuss how to manipulate the positions for linear models basedon the proposed result-driven solution. Considering that real-worlddatasets usually cannot be indexed with only one linear model, wefirst learn an index mechanism with 𝐾 linear segments to glob-ally split the data, then locally insert gaps for each linear segment.Let’s consider the example in Figure 2: for the 𝑘 -th linear segment( 𝑘 ∈ [ , 𝐾 ] ), we can propose a hypothetical linear model shown inred line by connecting two anchoring points after gap insertion:the first data point { 𝑥 𝑘, , 𝑦 𝑔𝑘, } and the last data point { 𝑥 𝑘,𝑚 , 𝑦 𝑔𝑘,𝑚 } .For the first data point, 𝑦 𝑔𝑘, = 𝑦 𝑘, + (cid:205) 𝑗 < = 𝑘 𝑈 𝑗 where 𝑈 𝑗 is thenumber of total inserted gaps in the previous 𝑗 -th segment, whilefor the last data point, 𝑦 𝑔𝑘,𝑚 = 𝑦 𝑘,𝑚 + (cid:205) 𝑗 < = 𝑘 𝑈 𝑗 + 𝜌 ( 𝑦 𝑘,𝑚 − 𝑦 𝑘, ) as 𝜌 ( 𝑦 𝑘,𝑚 − 𝑦 𝑘, ) is the total number of gaps to be inserted in the 𝑘 -th segment. In this way, the gap-inserted positions 𝑦 𝑔𝑘,𝑖 can becalculated as: 𝑦 𝑔𝑘,𝑖 = 𝑦 𝑔𝑘, + ( 𝑥 𝑘,𝑖 − 𝑥 𝑘, ) × 𝑦 𝑔𝑘,𝑚 − 𝑦 𝑔𝑘, 𝑥 𝑘,𝑚 − 𝑥 𝑘, , = 𝑦 𝑘, + ∑︁ 𝑗 < = 𝑘 𝑈 𝑗 + ( 𝑥 𝑘,𝑖 − 𝑥 𝑘, ) × ( 𝑦 𝑘,𝑚 − 𝑦 𝑘, )( + 𝜌 ) 𝑥 𝑘,𝑚 − 𝑥 𝑘, . (3)From a geometric perspective, the data after inserting gaps areplaced along the red hypothetical line and consequently, the datais re-distributed to meet our “expected linear results”. Note thatthe estimated 𝑦 𝑔 s can be non-integers and we will discuss how tophysically place the keys according to their 𝑦 𝑔 s in Section 5.2. Nowlet’s first theoretically examine the effectiveness of the proposedresult-driven gap insertion technique. heoretical Analysis . In this section, we will theoretically showthat the index mechanism learned on D 𝑔 achieves better precisenessand tighter generalization bound than the one learned on D . Herewe leverage the information bottleneck principle [47] to quantifythe mutual information between the input keys and the positionsto be predicted, and analyze the preciseness and generalization forindexes learned on D 𝑔 and D .Let’s consider the index learning task in the context of infor-mation compression: The key 𝑥 ∈ X and the position 𝑦 ∈ Y areboth random variables, and there are some statistical dependenciesbetween 𝑥 and 𝑦 . The index learning task can be regarded as tofind an optimal compact representation ˜ 𝑥 of 𝑥 , which compresses 𝑥 by removing the un-related parts of 𝑥 that have no contribution tothe prediction of 𝑦 ; and then learn the correlation between ˜ 𝑥 of 𝑦 .Formally, the index learning task is minimizing 𝐼 ( 𝑥 ; ˜ 𝑥 ) − 𝛽𝐼 ( ˜ 𝑥 ; 𝑦 ) ,where 𝐼 () indicates the mutual information and 𝛽 is a coefficient toadjust the trade-off between the degree of compression 𝐼 ( 𝑥 ; ˜ 𝑥 ) andthe degree of preserving predictive information 𝐼 ( ˜ 𝑥 ; 𝑦 ) .Let ˜ X be the set of minimal sufficient statistics of X with respectto Y , i.e. , ∀ 𝑥 ∈ X , ˜ 𝑥 ∈ ˜ X , 𝑦 ∈ Y : 𝑝 ( 𝑥 | ˜ 𝑥, 𝑦 ) = 𝑝 ( 𝑥 | ˜ 𝑥 ) . In the contextof index learning, we interpret ˜ X as the optimal hidden compactrepresentation of keys X . Note ˜ X = X is always sufficient for Y while usually not the optimal compact one with minimal | ˜ X| . Thenwe can see that | ˜ X| becomes smaller on D 𝑔 after inserting gaps intothe original data D , which makes the index learning easier andimproves the index preciseness. To clarify the reason, recall thatwe insert gaps based on several hypothetical linear models. Asthe linearity of transformed data increases, the linear correlationbetween keys and gap-inserted positions increases accordingly. Inother words, we need less data information to learn the parametersof linear models, resulting in redundant ˜ X and smaller | ˜ X| . In theextreme case, ˜ X can have only one trivial ˜ 𝑥 , that is ˜ X = { 𝑥 − } , ifthe whole transformed data is on a simple line (e.g., 𝑦 = 𝑥 − 𝑥 + D 𝑔 achieves atighter generalization bound after inserting gaps into the original D . Let ˆ 𝐼 () be the empirical estimate of the mutual information fora given data, which is a sample of size 𝑛 from the joint distribu-tion {X , Y} . Shamir et al. [47] prove that the generalization error 𝐸 ( ˜ 𝑥, 𝑦 ) = | 𝐼 ( ˜ 𝑥 ; 𝑦 ) − ˆ 𝐼 ( ˜ 𝑥 ; 𝑦 )| can be bounded in a data-dependentform with a probability of at least 1 − 𝛿 : 𝐸 ( ˜ 𝑥, 𝑦 ) ≤ ( | ˜ X| + ) √︁ 𝑙𝑜𝑔 ( / 𝛿 )√ 𝑛 + (|Y| + )(| ˜ X| + ) − 𝑛 . (4)In summary, the generation error between the optimal hidden com-pact representation and the empirical estimates from finite sample D or D 𝑔 is bounded in 𝑂 ( | ˜ X | |Y |√ 𝑛 ) .Now let’s compare the bounds on the original data D and thegap-inserted data D 𝑔 . Note that we transform the original 𝑦 into 𝑦 𝑔 with a one to one mapping due to the key-position monotonicity,and thus the |Y| and data size 𝑛 remain the same. The other factor | ˜ X| does matter, and it becomes smaller after gap insertion as weanalyzed above. As a result, the index mechanism learned on D 𝑔 hasa tighter generation bound than the index mechanism learned on D ,and it can generalize to possible keys better. Later, we will discusshow the proposed gap insertion technique can handle dynamic scenarios in Section 5.3 and experimentally confirm its advantagesin Section 6.4. Gap Insertion for Non-Linear Models . So far, our discussionabout the proposed gap insertion is based on a collection of linearmodels. However, the idea of results-driven gap insertion is generaland easy to be extended to other non-linear models. Specifically,as long as the non-linear models to be learned are monotonicallyincreasing functions, we can also introduce the non-linear hypothet-ical lines by anchoring a few points whose transformed positionscan be determined. Then based on the hypothetical lines, we caninference the positions for the keys of other non-anchoring pointsto fit hypothetical models.

Physical Key Placement . We have discussed how to logically esti-mate the gaps to be inserted such that a better index can be learnedon the gap-inserted data. However, as the estimated positions canbe non-integral, we need to round them into integers and physicallyplace the keys according to their adjusted positions. This is non-trivial since we have to maintain the key-position monotonicityduring the placement. Let’s assume two anchoring keys 𝑥 𝑖 and 𝑥 𝑗 where 𝑥 𝑖 < 𝑥 𝑗 , whose physical positions are integers and have beendetermined to be 𝑦 𝑔𝑖 and 𝑦 𝑔𝑗 respectively. In practice, the first andlast points of each learned segment can be such anchoring points(please refer to Result-Driven Gap Insertion above). Consider 𝑚 non-anchoring keys to be placed, whose keys are larger than 𝑥 𝑖 and lessthan 𝑥 𝑗 , we need to choose suitable positions to place them. Onone hand, the rounding of estimated (non-integral) positions maylead to conflicted 𝑦 𝑔 s. On the other hand, there may be more than 𝑚 available positions used for physical placement, i.e., 𝑦 𝑔𝑗 − 𝑦 𝑔𝑖 > 𝑚 . (cid:4)(cid:11)(cid:13)(cid:12)(cid:11)(cid:13)(cid:9)(cid:2)(cid:15)(cid:15)(cid:6)(cid:16)(cid:2)(cid:13)(cid:7)(cid:10)(cid:14)(cid:15)(cid:11)(cid:13)(cid:9)(cid:3)(cid:8)(cid:16) (cid:5)(cid:14)(cid:13)(cid:1)(cid:2)(cid:13)(cid:7)(cid:10)(cid:14)(cid:15)(cid:11)(cid:13)(cid:9)(cid:3)(cid:8)(cid:16) ! " ! ! $ ! $%& '( = ( "* + , '( = ( "* + , Figure 3: Illustration of linking array based key placement.

To address this problem, we propose a linking array based keyplacement strategy as shown in Figure 3. We place a key 𝑥 𝑙 ac-cording to its predicted position 𝑀 ( 𝑥 𝑙 ) . If two keys get conflictedpredictions, we place them on an external linking array with thesame position. This strategy reduces the differences between physi-cal positions and result-driven estimated positions by fully utilizingthe prediction ability of learned index 𝑀 , at the price of introduc-ing potential disambiguation costs to determine exact positions inthe linking arrays, and increasing storage costs for linking arrays.We also explore two other strategies to balance the advantage ofaccurate prediction and the disadvantage of extra cost, and experi-mental results show that the strategy presented above works wellin practice. Due to the space limitation, we omit the details andkeep with the above linking array based key placement strategy.As a result, we place all data in a gapped array 𝐺 and severallinking arrays A = { 𝐴 𝑖 } . A linking array 𝐴 𝑖 contains the 𝑖 -th occu-pied key in 𝐺 and at least one other key having the same position 𝑔 = 𝑖 in 𝐺 . The 𝑖 -th occupied key in 𝐺 will be the minimum keysof 𝐴 𝑖 , i.e., 𝐺 ( 𝑖 ) = 𝑚𝑖𝑛 ( 𝐴 𝑖 ) . Clearly, the key-position monotonicityis maintained for keys in the first-level gapped array. Next, we willdescribe how to read ( i.e. , lookup operation) and write ( i.e. , dynamicscenario) with the gap-based array. Lookup Operation . As mentioned above, in the gapped array 𝐺 ,there are more positions than the number of indexed keys, i.e. , 𝑦 𝑔𝑗 − 𝑦 𝑔𝑖 > 𝑚 . Further, the proposed linking arrays can increasethe number of unoccupied positions. So how to conduct lookupoperations on such data layout? The key idea here is to maintain a total order for keys in the first-level gapped array 𝐺 , such that theunoccupied positions in 𝐺 are comparable to indexed keys. Specifi-cally, we set the key of unoccupied position to be the key of the firstoccupied position at its right hand, and use an additional indicatorto indicate that such unoccupied positions have no payloads. In thisway, we can define a comparison rule for the situation of having thesame key but different payloads: a key pointing to an empty pay-load is smaller than the same key pointing to a non-empty payload,and thus the 𝐺 is totally ordered.In practical, if the searched positions point to the secondarylinking arrays, we conduct a linear scan on the linking arrays dueto the fact that there are usually a limited number of keys with thesame conflicted predicted position. Handling Dynamic Scenarios . Dynamic operations, especiallyinserting new keys, are challenging for learned index since themachine learning models may need to be re-trained to maintainprecise predictions. Thanks to the proposed gap insertion, we canintroduce a simple yet effective extension to efficiently support dy-namic operations, meanwhile, maintaining comparable predictionpreciseness without model re-training. This is because we have re-distributed the data by introducing gaps in a result-driven manner.Further, we place keys directly according to their predicted posi-tions and allow position conflicts, thus the unoccupied positions aredata-dependently reserved for possible inserted new keys. In otherwords, we can insert new keys based on their predicted positionsby a learned index 𝑀 . These positions can be either unoccupiedor occupied, and for both cases, the inserted data follow hiddenkey-position distribution already learned by 𝑀 , such that 𝑀 canmaintain preciseness at no price or a small price of correction costin the lookup operation.Specifically, given a key 𝑥 to be inserted, we expect to place 𝑥 onthe position as close as possible to its predicted position by 𝑀 . Mean-while, to simplify the lookup operation after new data insertion, weneed to maintain the total order for 𝐺 with corresponding linkingarrays A : ∀ 𝐴 𝑖 − ∈ A , 𝐺 ( 𝑖 − ) = 𝑚𝑖𝑛 ( 𝐴 𝑖 − ) ≤ 𝑚𝑎𝑥 ( 𝐴 𝑖 − ) < 𝐺 ( 𝑖 ) .Using 𝑀 , we first get 𝑥 ’s predicted position ˆ 𝑦 and its upper bound’sposition 𝑦 𝑢𝑏 in 𝐺 , i.e., the position of largest key in 𝐺 that is lessthan or equal to 𝑥 , and then we insert 𝑥 in either 𝐺 ( ˆ 𝑦 ) (unoccupiedcase) or the linking array 𝐴 𝑦 𝑢𝑏 − (occupied case) to maintain thekey-position monotonicity and the total order of 𝐺 .For the delete operation, first, we look up the 𝑥 to be deleted.If 𝑥 is stored in a linking array 𝐴 , we can easily remove 𝑥 from 𝐴 when | 𝐴 | >

2, or make the corresponding position an occupied onewith the other key in 𝐴 , and delete 𝐴 when | 𝐴 | =

2. Otherwise, the deletion will make an unoccupied position, and we need to updatethe keys in 𝐺 whose values are the same as 𝑥 , by setting them asthe key at 𝑥 ’s upper bound’s position 𝑦 𝑢𝑏 , i.e., ∀ 𝑦 𝑗 : 𝐺 ( 𝑦 𝑗 ) = 𝑥 , set 𝐺 ( 𝑦 𝑗 ) ← 𝐺 ( 𝑦 𝑢𝑏 ) . For the update operation, we can look up thedata to be updated using its key and simply reset its payload value. Now let’s analyze the cost of the proposed gap-based index learn-ing. Recall that we first learn an index with 𝐾 linear segments onoriginal data D = { 𝑥 𝑖 , 𝑦 𝑖 } 𝑛𝑖 = , then we generate gap-inserted data D 𝑔 = { 𝑥 𝑖 , 𝑦 𝑔𝑖 } 𝑛𝑖 = and re-learn a better index on D 𝑔 . With existinglearned index methods such as FITing-Tree [19] and PGM [16], thecost of these three steps are all 𝑂 ( 𝑛 ) . Although the complexity ofthe proposed gap-based index learning is still 𝑂 ( 𝑛 ) , we introducesome extra training costs due to the inserted gaps. Fortunately, wecan leverage the sampling technique mentioned in Section 4 tofurther reduce the learning cost using small sampled data, and thusmaintain high learning efficiency. Combining Sampling and Gap Insertion . Formally, with a sam-ple rate 𝑠 ∈ ( , . ] and the sample size 𝑛 𝑠 = 𝑛 ∗ 𝑠 , we first learn index 𝑀 on a sampled subset D 𝑠 = { 𝑥 𝑖 , 𝑦 𝑖 } 𝑛 𝑠 𝑖 = of D . Then we use all the 𝐾 segments of 𝑀 to estimate the positions of anchoring keys, physi-cally place other non-anchoring keys based on their predicted posi-tions from proposed hypothetical models, and get the gap-inserteddataset D 𝑠,𝑔 = { 𝑥 𝑖 , 𝑦 𝑔𝑖 } 𝑛 𝑠 𝑖 = . Finally we re-learn an index mechanismon D 𝑠,𝑔 , and get the whole gap-inserted data D 𝑔 = { 𝑥 𝑖 , 𝑦 𝑔𝑖 } 𝑛𝑖 = byphysically placing the un-sampled keys D − 𝑠 = D − D 𝑠 into D 𝑠,𝑔 .Summing up, we can learn an index with a data scale of 𝑛 𝑠 andphysically maintain the whole data layout in 𝑂 ( 𝑛 ) . In Section 6.4.2,we will experimentally show that by combining sampling and gapinsertion techniques, both the preciseness and efficiency of existinglearned index methods can be significantly improved. In this section, we conduct experiments aiming to answer the fol-lowing questions: (1) What are the strengths and weaknesses of theexisting learned index methods evaluated by the proposed MDLframework (Section 6.2)? (2) Can we improve the learning efficiencyof the learned index by the sampling technique (Section 6.3)? (3)Can we enhance the learning effectiveness of learned index by theproposed result-driven gap insertion technique (Section 6.4)?

We conduct all the experiments on a Linux server with an IntelXeon Platinum 8163 2.50GHz CPU, whose L1 cache size, L2 cachesize and L3 cache size are 32KiB, 1MiB and 33MiB respectively.

Baselines . We compare a traditional and three learned index meth-ods, which are also adopted as base methods to incorporate theplug-in sampling and gap insertion techniques. B + Tree : we use a standard in-memory B+ Tree implementation,stx::btree (v0.9) [7]. Following [19, 29], we evaluate the B+ Treeindex with dense pages, i.e. , the filling factor = 100%. ecursive Model Index (RMI) [29]: RMI is a hierarchy learnedindex method consisting of typically two-layer or three-layer ma-chine learning models. Following previous works [13, 16, 29], weadopt a two-layer RMI with linear models. FITing-Tree [19]: This is an error-guaranteed learned index methodusing a greedy shrinking cone algorithm to learn piece-wise linearsegments. The learned segments are organized by a B+ Tree andhere we adopt the stx::btree to organize these learned segments.

Piecewise Geometric Model (PGM) [16]: PGM is a state-of-the-arterror-guaranteed learned index method, which improves FITing-Tree by learning an optimal number of linear segments. Thereare three PGM variants based on binary search, CSS-Tree[43] andrecursive construction. Here we evaluate its recursive version sinceit beats the other two variants.

Datasets . We conduct experiments on four wildly adopted real-world datasets that cover different data scales, key types, data dis-tributions and patterns:

Weblogs [16, 19, 29]: The Weblogs dataset contains about 715Mlog entries requesting to a university web server. The index keysare unique log timestamps. This dataset contains typical non-lineartemporal patterns caused by school online transactions, such asdepartment events and class schedule arrangements.

IoT [16, 19]: The IoT dataset contains about 26M recordings fromdifferent IoT sensors in a building. The index keys are unique times-tamps of the recordings. This dataset has more complex temporalpatterns than Weblogs, since IoT data are more diverse (e.g., motion,door, etc.) and prone to noise during the data collection.

Longitude and LatiLong [13, 16, 19, 29]: These two datasets con-tain location-based data that are collected around the world from

Open Street Map [38]. The index keys of

Longitude are the longitudecoordinates of about 1.8M buildings and points of interest. Similarto [13], the index keys of

LatiLong is compounded of latitudes andlongitudes as 𝑘𝑒𝑦 = × 𝑙𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝑙𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 . Evaluation Metrics . For the storage cost evaluation, we measurethe index size . We use 64-bit payloads for all baselines and 64-bit keypointers for all datasets. The index size of B+ Tree is the sum of thesizes of inner nodes and the sizes of leaf nodes including payloads.The index size of RMI is the sum of payloads and the sizes of linearmodels, including slopes, intercepts, maximum positive/negativeprediction errors storing as double-precision floats. The index sizesof Fitting-Tree and PGM are the sum of payloads and the sizes oftheir linear segment models, including slopes and intercepts storingas double-precision floats.For the efficiency evaluation, we measure several kinds of timecosts in nanoseconds, including the index construction time , the index prediction time per query ( i.e. , getting predicted position ˆ 𝑦 given queried key 𝑥 ), the index correction time per query ( i.e. , gettingthe true position 𝑦 given ˆ 𝑦 ) and the overall query time per query( i.e. ., getting 𝑦 given 𝑥 ). Besides, we calculate the Mean AbsoluteError (MAE) between predicted positions and true positions as | 𝐷 | (cid:205) 𝑥 ∈ 𝐷 | 𝑦 − ˆ 𝑦 | ). MAE is a metric widely adopted for machinelearning algorithms and determines the index correction time inthe context of learned index. The proposed MDL-based framework quantifies an index as termsof 𝐿 ( 𝑀 ) and 𝐿 (D| 𝑀 ) . In this subsection, we compare several exist-ing index methods to demonstrate several performance trade-offsbetween 𝐿 ( 𝑀 ) and 𝐿 (D| 𝑀 ) , and the impact of varying the mecha-nism family M . We demonstrate the results on IoT dataset and omitthe results on the other three datasets due to the space limitationand similar conclusions. From the view of machine learning, 𝐿 (D| 𝑀 ) measures the prediction loss on training data while 𝐿 ( 𝑀 ) plays a regularization role of the learned model. These two terms areusually contradicted and can be balanced by the coefficient 𝛼 thatcontrols performance trade-offs. As analyzed in Section 3.2, severalparameters of existing index methods implicitly take the role of 𝛼 : the number of the layer-2 models in RMI is proportional to 𝛼 ,while the page size of B+ Tree and error bound 𝜖 of Fitting-Tree andPGM are inversely proportional to 𝛼 . Here we vary these tunableparameters to study various performance trade-offs by adoptingdifferent 𝐿 ( 𝑀 ) and 𝐿 (D| 𝑀 ) . Trade-off between Storage Cost and Query Efficiency.

Wefirst set 𝐿 (D| 𝑀 ) = 𝑡 𝑞 (D| 𝑀 ) indicating the overall query time perquery and 𝐿 ( 𝑀 ) = 𝑆𝐼𝑍𝐸 ( 𝑀 ) indicating the index size to explorethe trade-off between storage cost and query efficiency. We plot thecurves of overall query time per query and index size by varying 𝛼 s of different methods in Figure 4. Figure 4: Trade-off between storage cost and query efficiencyof four index methods.

Overall speaking, all the four methods have consistent trade-off trends: a smaller 𝛼 achieves a smaller index size, but leads tolarger query time, as it penalizes less on the term 𝐿 (D| 𝑀 ) . Herewe test several 𝛼 s for each method, and among them, some choicesgain good trade-offs (e.g., 𝜖 =

128 for FITing-Tree). These resultsnaturally raise an open question: what is the “best” space-timetrade-off and how to achieve it by tuning 𝛼 ? In practice, it dependson the demand of users. PGM has explored both ends of the space-time trade-off by varying the degree of linear approximation andsearching an 𝜖 that achieves minimal space or minimal query time. 𝑏𝑢𝑖𝑙𝑑 𝑇 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑇 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑇 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 Index Size MAEB+ Tree (pageSize=256) 7,770,595,420 305 338 643 489,877,504 63.5005RMI ( | 𝑀 | =100k) 682,842,396 68 539 607 128,720,736 173.513FITing ( 𝜖 = | 𝑀 | =11,830) 788,580,446 106 203 309 121,910,880 27.3392PGM ( 𝜖 = | 𝑀 | =8,813) 1,556,264,268 121 224 355 121,740,264 36.817 Table 1: Performance comparison on the IoT dataset for B-tree with dense page, RMI, Fitting-tree and PGM. For RMI, FITing-Tree and PGM, | 𝑀 | indicates their numbers of last-layer linear models. 𝑇 indicate time in ns, and the index size is in bytes. In the context of MDL, we can explicitly incorporate the prefer-ence of users into the objective functions. And then the fruitfulideas of hyper-parameter optimization [17] can be incorporated toautomatically search an optimal 𝛼 to achieve preferable trade-offs.Also note that the query time of B+ Tree increases heavily frompage size 256 to 512 since the index begins to fall out of the L1-cache.Inspired by this phenomenon, we can see an interesting futuredirection: designing cache-sensitive regularization in index learning.For example, we can harness the information about the size of cachelines into the objective function by gate-control techniques [11]. Trade-off between Prediction and Correction.

We then drilldown the index querying process and explore the trade-offs be-tween prediction cost and correction cost. To measure the pre-diction cost, we set 𝐿 ( 𝑀 ) = 𝑡 𝑝 ( 𝑀 ) , where 𝑡 𝑝 ( 𝑀 ) means the pre-diction time . To measure the correction cost, we can set 𝐿 (D| 𝑀 ) to be the correction time , i.e. 𝐿 (D| 𝑀 ) = 𝑡 𝑐 (D| 𝑀 ) , or MAE , i.e. , 𝐿 (D| 𝑀 ) = |D | (cid:205) 𝑥 ∈D | 𝑦 − ˆ 𝑦 | . Figure 5: Trade-off between prediction time and correctioncost of four index methods.

We plot the prediction time and correction cost (including correc-tion time and

MAE ) for different index methods in Figure 5. As theparameter 𝛼 increases ( i.e. , page size and error bound decrease asthey are inversely proportional to 𝛼 ), the prediction time of B+Tree,FITing-Tree and PGM both increase while their correction costsdecrease. From the view of MDL, a more complex 𝐿 ( 𝑀 ) , requiringa longer prediction time, usually make precise predictions and thussmaller correction cost 𝐿 (D| 𝑀 ) . Here RMI is a bit different as itsprediction time keeps almost the same as 𝛼 changes. This is due to the fact that the major part of inference in RMI is calculating 𝐿 linear functions through a 𝐿 layer tree (here 𝐿 is 2), which isindependent of the tunable number of the layer-2 models.Besides, the results in terms of MAE and correction time of allthe index methods have the consistent increasing trends: all ofthem increase as 𝛼 increases, while with different increasing speeds(see the different gaps between red lines and green lines). Thismeets our expectation since the evaluated index methods use a bi-nary search within certain ranges: page size for B+ Tree, maximumpositive/negative errors for RMI, and 𝜖 for FITing-Tree and PGM.Meanwhile, the MAEs are bounded by these values and approxi-matively reflect the correction time. Note that the MAE are usuallysmaller than these fixed search ranges, which allow us to furtheruse exponential search to speed up the correction as studied in [13].Another observation is that the prediction time is usually largerthan the correction time for B+Tree, FITing-Tree and PGM. Compar-ing to their lookups within a continuous memory in the correctionstage, their recursive lookups in the prediction stage tend to incurmore costs, which are due to the organization of hierarchical mod-els. This inspires us to design learned index having as few layers aspossible, which is adopted by RMI method. We have studied the impact ofvarying the regularization coefficient 𝛼 for different index methods.Now we study the intrinsic property of the mechanism by com-paring different model capacity, or say, comparing the differentfamilies of learning models.To fairly compare these four index methods, we choose the mostfavorable 𝛼 s for each of them, which are the values of 𝛼 s closestto the intersections of their time-space trade-off lines in Figure 4.The results are summarized in Table 1. We can see that all the threelearned index methods achieve much smaller storage costs whilefaster lookup speeds than traditional B+Tree. Among the threelearned index methods, we observe that FITing-Tree and PGM havemuch smaller MAE, better query efficiency, and smaller index sizethan RMI. However, RMI method requires less construction timesince it doesn’t need the organization of learned segments, suchas an assistant B+Tree for FITing-Tree and a recursive strategy forPGM. Note that with the same 𝜖 = Recall that existing learn index methods including RMI, FITing-Treeand PGM need to scan at least one pass of the whole data to learnseveral sub-models, and further organize the learned sub-modelsby a model tree or a B+Tree. There is a great potential to reduce he data scanning cost with smaller data, and reduce the index or-ganization cost with fewer sub-models to be learned. Here we plugthe proposed sampling technique into these learned index methodsto accelerate the index learning. Specifically, given a sample rate 𝑠 ∈ ( , . ] , we first get a sampled dataset D 𝑠 by randomly sampling 𝑠 × |D| keys from D , then learn the indexes from D 𝑠 , and finallytest the indexes on the whole dataset D . When directly applyingthis procedure with original RMI, FITing-Tree and PGM methods,we found some pretty large prediction errors caused by few unsam-pled keys that cannot be covered by learned sub-models ( i.e. , linearsegments here). Fortunately, we can eliminate these large errorsby some simple yet effective patches. For FITing-Tree and PGM,we connect the adjacent segments learned from D 𝑠 to cover all theun-sampled keys. For RMI, we propose a RMI-Nearest-Seg patch,which re-assigns a key covered by an un-trained (empty) sub-modelto its nearest trained sub-model. Since the sampling may cause fewviolations of error bounds, we adopt the exponential search to findthe searching boundary around the predicted position. M e a n Absolute Error (AE) for Patched PGM S T D Q u e r y T i m e ( n s ) Build Time and Overall Query Time for Patched PGM B u il d T i m e ( n s ) Sample Rate350400450500550600650 Q u e r y T i m e ( n s ) B u il d T i m e ( n s ) Sample Rate050100150200250300 M e a n S T D Mean of AESTD of AEOverall Query TimeBuild Time

Figure 6: MAE, build time and overall query time of PGMindex with varied 𝑠 . The gray and green horizontal dottedlines indicate the MAE of 𝑀 D and 𝜖 = respectively. By varying 𝑠 from 1.0 to 0.001, we evaluate the MAE , build time and query time for all patched learned indexes on all the datasets.The results of PGM on IoT dataset are shown in Figure 6, the othertwo index methods or on other datasets have similar conclusionsand we omit those results due to the space limitation. In Figure6, the two right subfigures show the results for extremely smallsample rates ( i.e. , 0.1 to 0.001). For comparison, let’s denote theindex learned from the original whole data D as 𝑀 D . In the top-right subfigure, the gray and green horizontal dotted lines indicatethe MAE of 𝑀 D and the adopted error bound 𝜖 =

256 respectively.

Construction Speedup . From Figure 6, we can see that the patchedPGM gains significant construction speedup (e.g., the build timedecrease from 1 . 𝑒 𝑠 = 𝑒 𝑠 = .

01, which be-comes about 78x smaller), while maintaining non-degraded queryperformance (e.g., when 𝑠 = .

01, the MEA is still very close to the one of 𝑀 D , the gray dotted line). Generally speaking, the buildtime linearly decreases as the sample rate decreases. On the otherhand, the curves of MAE and query time are near-horizontal whenthe sample rate decreases until the very small one (e.g., 𝑠 = . Generalization Improvement . As discussed in Section 4.2, thesampling technique can improve the generalization ability of learnedindex methods, which leads to a fewer number of learned segmentsand correspondingly smaller index size. We conduct statistics of thenumber of learned segments for patched FITing-Tree and patchedPGM, and the results are illustrated in Figure 7. N u m b e r o f L e a r n e d S e g m e n t s Patched FITing-Tree Patched PGM

IoT

Sample Rate 0.00.20.40.60.81.05000100001500020000 N u m b e r o f L e a r n e d S e g m e n t s Longtitude N u m b e r o f L e a r n e d S e g m e n t s Figure 7: Number of learned segments for patched FITing-Tree and patched PGM with varied 𝑠 . We can observe that the numbers of learned segments of bothpatched FITing-Tree and PGM decrease as sample rates decrease.This is because that as the sampled dataset becomes smaller, learnedindex methods can extract more general patterns in the data as somenoisy keys are discarded. So some adjacent learned segments hav-ing similar slopes can be merged, and the generalization abilityimproves. Note that PGM adopts an optimal piece-wise segmenta-tion learning algorithm, and thus it is more stable than FITing-Treethat adopts a greedy learning algorithm.

The 𝛼 Adjustment . To gain more insights about the proposedsampling technique, here, we vary the tunable 𝛼 of learned indexmethods and check their the smallest “safe” sampled data size 𝑛 𝑠𝑎𝑓 𝑒 that maintains a non-degraded performance. Error Bound10 Patched FITing-TreePatched PGM10 Number of Layer-2 Models10 S m a ll e s t S a f e S a m p l e d D a t a S i z e S m a ll e s t S a f e S a m p l e d D a t a S i z e RMI-Nearest-Seg

Figure 8: The smallest “safe” sampled data size 𝑛 𝑠𝑎𝑓 𝑒 to main-tain a non-degraded performance, by varying 𝛼 of differentlearned index methods on IoT dataset. The results are shown in Figure 8. Recall that in Section 4, wetheoretically show that it suffices to learn the index on a small igure 9: Boxplots of performance for PGM index with gap insertion on IoT dataset. sample with size as small as 𝑂 ( 𝛼 𝑙𝑜𝑔 E) in Theorem 1. Apply-ing simple log transformation, we can get 𝑙𝑜𝑔 ( 𝑂 ( 𝛼 𝑙𝑜𝑔 E)) = 𝑂 ( 𝑙𝑜𝑔𝑙𝑜𝑔 E × 𝑙𝑜𝑔 ( 𝛼 )) = 𝑂 ( 𝑙𝑜𝑔 ( 𝛼 )) . That is, the 𝑙𝑜𝑔 ( 𝑛 𝑠𝑎𝑓 𝑒 ) forthese methods are asymptotically linear to 𝑙𝑜𝑔 ( 𝛼 ) . In Figure 8, thex-axis and y-axis represent 𝛼 and 𝑛 𝑠𝑎𝑓 𝑒 with log transformationrespectively, and the linear trends of those plots match the abovetheoretical analysis (again, note that the number of layer-2 modelsin RMI is proportional to 𝛼 , while the error bound of Fitting-Treeand PGM are inversely proportional to 𝛼 ). With a smaller 𝛼 , we canachieve non-degraded performance with fewer samples, while alarger 𝛼 requires us to draw more samples since more details aboutthe data are needed to learn an index in finer granularity. As discussed in Section 5, we propose to learn precise index byadjusting the distribution of keys and positions, at the cost of gapinsertion and index re-training. The cost can be further reducedby combining the sampling technique. In this section, we empiri-cally examine the effectiveness and efficiency of the proposed gapinsertion technique.

We conduct experiments on all the adopteddatasets and the three learned index methods, by varying the gapinsertion rate 𝜌 from 0.5 to 0.001 and the sample rate 𝑠 from 1 to0.005. Due to similar results and space limitation, we only report theresults for PGM and IoT dataset here. The overall query time andother detailed performance numbers are summarized as boxplotsin Figure 9, where each box indicates the middle 50% experimentalpoints and the “No Gap” boxes represent the baseline index 𝑀 D learned without gap insertion.Clearly, indexes learned on the gap-inserted datasets gain sig-nificantly smaller overall query times comparing with the baseline(e.g., the speedup is up to 1.59x), which verifies the effectivenessof the gap insertion technique. Note that the overall query time isthe overall performance including the advantage of improved MAEand the disadvantage of additional index size. To further analyzethe detailed performance, we further break down the overall querytimes into the prediction time and corrections time, and also plotMAE and index size in Figure 9. Comparing with the baseline with-out gap insertion, we can see that PGM with gap insertion achievesslightly better prediction time and much better correction time. Toexplain the improvements, we can check the MAE results, whichshow a significant improvement. However, the index size, includingthe size of introduced gaps and linking arrays, becomes larger andreduce the benefits brought by the MAE improvements. So in total,the correction time shows an averaged 2x improvement, and the overall query time shows an averaged 1.4x speedup, which is lessthan the improvement in terms of MAE. These results and analysisshow that our gap insertion technique can learn preciser indexes,and thus improve overall performance. Above we analyze theoverall performance of the gap insertion and sampling techniques,here we discuss the performance of various cases with specific 𝑠 and 𝜌 , as illustrated in Figure 10. Linking Array

Overall Query Time

Mean of MAE s a m p l e r a t e g a p r a t e O v e r a ll T i m e ( n s ) s a m p l e r a t e g a p r a t e O v e r a ll T i m e ( n s ) S a m p l e R a t e G a p R a t e O v e r a ll T i m e ( n s ) M e a n o f A b s o l u t e E rr o r G a p R a t e S a m p l e R a t e MAE

Figure 10: The performance of PGM with varied 𝑠 and 𝜌 . Overall speaking, comparing with the results without gap in-sertion and sampling ( 𝑠 = , 𝜌 = i.e. , the left upper concern ineach subfigure), large gap rates and moderate sample rates achievemuch smaller MAE and significant query efficiency improvements,verifying the effectiveness of proposed sampling and gap insertiontechniques again. Specifically, a larger 𝜌 allows us to transform thedata distribution and enhance the patterns via the result-drivengap insertion. For the sampling technique, we can observe that itbehaves a bit different from the results in Section 6.3, where bothreasonable MAE and query time can be maintained as 𝑠 decreasesfrom 1.0 to 0.1. Here the reasonable MAE will be maintained butoverall query time will decrease firstly and then increase slightly.This is because that when combining the sampling and gap inser-tion, with a very small 𝑠 , we need to put more un-sampled keysinto the linking arrays, resulting in an increased total query time. As mentioned in Section 5.3, our pro-posed techniques can be easily extended to handle the dynamic sce-narios. To show how the performance varies with different dynamicscenarios, as an example here, we evaluate PGM with dynamic link-ing arrays on both read-heavy and write-heavy workloads. Specifi-cally, we randomly split the IoT dataset into D 𝑖𝑛𝑖𝑡 and D − 𝑖𝑛𝑖𝑡 withwrite proportion 𝑤 , i.e. , D = D 𝑖𝑛𝑖𝑡 + D − 𝑖𝑛𝑖𝑡 , | 𝐷 − 𝑖𝑛𝑖𝑡 | = 𝑤 · | 𝐷 | . Wechoose 𝑤 = . 𝑤 = . D − 𝑖𝑛𝑖𝑡 into 𝐵 equal-sized data atches D − , . . . , 𝐷 − 𝐵 . We initially learn the index on D 𝑖𝑛𝑖𝑡 and theninsert D − 𝑖𝑛𝑖𝑡 in batches. After inserting the 𝑏 -th batch, we evaluateits MAE, prediction time, correction time and overall query timeby randomly querying the data we have seen so far. No Gap 0 (No Write) T i m e P e r Q u e r y ( n s ) Read-Heavy Workload G a p F r a c t i o n Predict Time Correction Time Overall Query Time Gap FractionNo Gap 0 (No Write) T i m e P e r Q u e r y ( n s ) Write-Heavy Workload G a p F r a c t i o n Figure 11: The performance of PGM with linking arrays ondynamic scenarios.

The left subfigure in Figure 11 shows the querying-related timeand gap fraction after each batch insertion for the read-heavy work-load. As more new data are inserted into the reserved unoccupiedpositions, the fraction of available gaps decreases as expected. Forquerying-related performance, the prediction time remains thesame, while the correction and overall query time slightly increaseas some new data are placed into the linking arrays. Note thatduring the insertion procedure, the index with dynamic linkingarrays achieves about 1.364x faster correction speed and 1.227xfaster overall lookup speed than the index without gap insertiontechnique (the leftmost points in the left subfigure of Figure 11,and this baseline can access to all the data D ), showing the greatpotentials of the gap-based index to handle dynamic scenarios. Inthe right subfigure, we also plot the results for write-heavy work-load, which shows similar trends as read-heavy workload. The onlydifference is that the correction and overall query time increasesa bit faster than the ones in ready-heavy workload, as we havemore new data to insert in the write-heavy workload. All theseresults confirm that the proposed method can effectively leveragethe reserved gaps and maintain comparable query performance fordynamic scenarios. Learned Indexes . There have been many well-studied traditionalindexes, such as tree-based [3, 4, 20, 26, 32, 40, 44], hash-based[41, 50] and bitmap-based [2, 9, 25, 42, 51]. Recently, the learnedindexes gain increasing interest, which learn and utilize the hiddendistribution of data to be indexed. Recursive Model Index (RMI)[29] firstly introduces the idea of predicting positions of keys withmachine learning models. FITing-Tree [19] greedily learns piece-wise linear segments with a pre-defined bounded error, and PGM[16] further improves FITing-Tree by finding the optimal numberof learned segments given an error bound. ALEX [13] proposesan adaptive RMI with workload-specific optimization, achievinghigh performance on dynamic workloads. RadixSpline [28] gainscompetitive performance to RMI with a radix structure and a single-pass training. In addition to single-dimensional indexes, existingmethods also explore the multi-dimensional scenarios, such as Flood[37], Tsunami [14], NEIST [52] and LISA [33]. Our work is kind of complimentary to above existing learnedindex methods. We propose an MDL-based learning framework toquantify the learning objective and propose two pluggable tech-niques that can be incorporated by existing methods to furtherenhances their performance as shown in experiments.

Sampling in Index . Sampling has also been explored in partialindex [45, 48], tail index [18], cost estimation [12, 13] and indexlayout estimation [31, 37]. Different from sampling a user-interestedsubset in partial indexes [48] or a rare subset in tail index [18], weadopt a uniform sampling in our experiments. ALEX [13] samplesevery 𝑛 -th key of data to predict the cost of data node for bulkloading. Flood [37] trains RMI on each dimension and samples datato estimate how often certain dimensions are used. Different fromALEX and Flood, we propose the sampling technique for learningacceleration, and we also provide some theoretical analysis that isfurther confirmed by experimental results. Gapped Structure . To support dynamic operations, several gappeddata structures have been studied to reserve gaps between ele-ments, including Packed Memory Array (PMA) [5], Packed-MemoryQuadtree [49], B+Tree that reserves continuous gaps at the end ofdata arrays, and ALEX [13] that adopts a gapped array with keyshifts and model-based insertion. PMA and B+Tree reserve gapsthat are independent with different data distributions, while ourmethod and ALEX reserve data-dependent gap using learned mod-els. Further, different from ALEX, we use linking arrays to simplifythe dynamic operations and make the key-position distribution ofgap-inserted data suitable for possible inserted data.

Machine Learning based Database . We compare several learnedindexes from the perspective of minimum description length [21,22], which falls in the category of machine learning based database.Recently, there are many works facilitating database componentswith machine learning, such as query optimization [15, 27, 30, 35,36, 39], workload forecasting [34], memory prefetchers [23], andselectivity estimation [53].

Learned index gains promising performance by learning and utiliz-ing the hidden distribution of the data to be indexed. To facilitatethe learned index from the view of machine learning, we propose aminimum description length based framework that can formallyquantify index learning objective and help to design suitable learn-ing indexes for different scenarios. Besides, we study two generaland pluggable techniques, i.e. , the sampling technique to enhance learning efficiency with theoretical guidance, and the result-drivengap insertion technique to enhance learning effectiveness in termsof index preciseness and generalization ability. Extensive exper-iments demonstrate the efficiency and effectiveness of the pro-posed framework and the two pluggable techniques, which boostexisting learned index methods by up to 78x construction speedupmeanwhile maintaining non-degraded performance, and up to 1.59xquery speedup on both static and dynamic indexing scenarios. Withthis paper, we hope to provide a deeper understanding of currentlearned index methods from the perspective of machine learning,and promote more explorations of learned index from both theperspective of machine learning and database. EFERENCES [1] David J Abel. 1984. A B+-tree structure for large quadtrees.

Computer Vision,Graphics, and Image Processing

27, 1 (1984), 19–31.[2] Manos Athanassoulis et al. 2016. UpBit: Scalable In-Memory Updatable BitmapIndexing. In

SIGMOD . 1319–1332.[3] Manos Athanassoulis and Anastasia Ailamaki. 2014. BF-tree: Approximate TreeIndexing. In

PVLDB . 1881–1892.[4] Rudolf Bayer and Karl Unterauer. 1977. Prefix B-trees.

TODS

2, 1 (1977), 11–26.[5] Michael A Bender and Haodong Hu. 2007. An adaptive packed-memory array.

TODS

32, 4 (2007), 26.[6] Dimitri P Bertsekas. 1997. Nonlinear programming.

Journal of the OperationalResearch Society

48, 3 (1997), 334–334.[7] Timo Bingmann. 2013. STX B+ Tree. https://panthema.net/2007/stx-btree/.[8] Christopher M Bishop. 2006.

Pattern recognition and machine learning . springer.[9] Chee-Yong Chan and Yannis E. Ioannidis. 1998. Bitmap Index Design and Evalu-ation. In

SIGMOD . 355–366.[10] Yun-Chih Chang, Yao-Wen Chang, Guang-Ming Wu, and Shu-Wei Wu. 2000.B*-Trees: a new representation for non-slicing floorplans. In

DAC . 458–463.[11] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. 2014.Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.

ArXiv abs/1412.3555 (2014).[12] Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, and TimKraska. 2016. The case for interactive data exploration accelerators (IDEAs). In

HILDA . 1–6.[13] Jialin Ding, Umar Farooq Minhas, Hantian Zhang, Yinan Li, Chi Wang, BadrishChandramouli, Johannes Gehrke, Donald Kossmann, and David B. Lomet. 2020.ALEX: An Updatable Adaptive Learned Index.

SIGMOD , 969–984.[14] Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020.Tsunami: A Learned Multi-dimensional Index for Correlated Data and SkewedWorkloads. arXiv preprint arXiv:2006.13282 (2020).[15] Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya,and Surajit Chaudhuri. 2019. Selectivity Estimation for Range Predicates UsingLightweight Models.

PVLDB

12, 9, 1044–1057.[16] Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamiccompressed learned index with provable worst-case bounds.

PVLDB

13, 8, 1162–1175.[17] Matthias Feurer and Frank Hutter. 2019. Hyperparameter optimization. In

Automated Machine Learning . Springer, Cham, 3–33.[18] Alex Galakatos, Andrew Crotty, Emanuel Zgraggen, Carsten Binnig, and TimKraska. 2017. Revisiting reuse for approximate query processing.

PVLDB

10, 10,1142–1153.[19] Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and TimKraska. 2019. FITing-Tree: A Data-aware Index Structure. In

SIGMOD . 1189–1206.[20] Goetz Graefe and Per-Åke Larson. 2001. B-Tree Indexes and CPU Caches. In

ICDE . 349–358.[21] Peter Grünwald and Teemu Roos. 2019. Minimum description length revisited. arXiv preprint arXiv:1908.08484 (2019).[22] Peter D Grünwald and Abhijit Grunwald. 2007.

The minimum description lengthprinciple . MIT press.[23] Milad Hashemi, Kevin Swersky, Jamie A. Smith, Grant Ayers, Heiner Litz, JichuanChang, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. LearningMemory Access Patterns. In

ICML .[24] Hosagrahar V Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang.2005. iDistance: An adaptive B+-tree based indexing method for nearest neighborsearch.

TODS

30, 2 (2005), 364–397.[25] Theodore Johnson. 1999. Performance Measurements of Compressed BitmapIndices. In

PVLDB . 278–289.[26] Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D Nguyen,Tim Kaldewey, Victor W Lee, Scott A Brandt, and Pradeep Dubey. 2010. FAST:fast architecture sensitive tree search on modern CPUs and GPUs. In

SIGMOD .339–350.[27] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and AlfonsKemper. 2018. Learned Cardinalities: Estimating Correlated Joins with DeepLearning. arXiv preprint arXiv:1809.00677 (2018).[28] Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper,Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learnedindex. arXiv preprint arXiv:2004.14541 (2020).[29] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018.The case for learned index structures. In

SIGMOD . 489–504.[30] Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and IonStoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).[31] Christian A Lang and Ambuj K Singh. 2001. Modeling high-dimensional indexstructures using sampling. In

SIGMOD . 389–400.[32] Viktor Leis et al. 2013. The Adaptive Radix Tree: ARTful Indexing for Main-memory Databases. In

ICDE . 38–49. [33] Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A LearnedIndex Structure for Spatial Data. In

SIGMOD . 2119–2133.[34] Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo,and Geoffrey J Gordon. 2018. Query-based workload forecasting for self-drivingdatabase management systems. In

SIGMOD . 631–645.[35] Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh,Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learnedquery optimizer.

PVLDB .[36] Ryan Marcus and Olga Papaemmanouil. 2018. Towards a hands-free queryoptimizer through deep learning. arXiv preprint arXiv:1809.10212 (2018).[37] Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learn-ing Multi-dimensional Indexes. In

SIGMOD arXiv preprint arXiv:1803.08604 (2018).[40] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. Thelog-structured merge-tree (LSM-tree).

Acta Informatica

33, 4 (1996), 351–385.[41] Rasmus Pagh and Flemming Friche Rodler. 2004. Cuckoo hashing.

Journal ofAlgorithms

51, 2 (2004), 122 – 144.[42] Ali Pinar et al. 2005. Compressing Bitmap Indices by Data Reorganization. In

ICDE . 310–321.[43] Jun Rao and Kenneth A Ross. 1998. Cache conscious indexing for decision-supportin main memory. In

VLDB . 78–89.[44] Jun Rao and Kenneth A. Ross. 2000. Making B+-Trees Cache Conscious in MainMemory. In

SIGMOD . 475–486.[45] Praveen Seshadri and Arun Swami. 1995. Generalized partial indexes. In

ICDE .IEEE, 420–427.[46] Shai Shalev-Shwartz and Shai Ben-David. 2014.

Understanding machine learning:From theory to algorithms . Cambridge university press.[47] Ohad Shamir, Sivan Sabato, and Naftali Tishby. 2010. Learning and generalizationwith the information bottleneck.

Theoretical Computer Science

SIGMOD Record

18, 4(1989), 4–11.[49] Julio Toss, Cicero AL Pahins, Bruno Raffin, and João LD Comba. 2018. Packed-Memory Quadtree: A cache-oblivious data structure for visual exploration ofstreaming spatiotemporal big data.

Computers & Graphics

76 (2018), 117–128.[50] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. 2017. A survey onlearning to hash.

PAMI

40, 4 (2017), 769–790.[51] Kesheng Wu et al. 2006. Optimizing Bitmap Indices with Efficient Compression.

TODS (2006), 1–38.[52] Sai Wu, Zhifei Pang, Gang Chen, Yunjun Gao, Cenjiong Zhao, and Shili Xiang.2019. NEIST: a Neural-Enhanced Index for Spatio-Temporal Queries.

TKDE (2019).[53] Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen,Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Selec-tivity Estimation with Deep Likelihood Models. arXiv preprint arXiv:1905.04278 (2019).(2019).