[PDF] A Lazy Approach for Efficient Index Learning

Abstract

Learned indices using neural networks have been shown to outperform traditional indices such as B-trees in both query time and memory. However, learning the distribution of a large dataset can be expensive, and updating learned indices is difficult, thus hindering their usage in practical applications. In this paper, we address the efficiency and update issues of learned indices through agile model reuse. We pre-train learned indices over a set of synthetic (rather than real) datasets and propose a novel approach to reuse these pre-trained models for a new (real) dataset. The synthetic datasets are created to cover a large range of different distributions. Given a new dataset DT, we select the learned index of a synthetic dataset similar to DT, to index DT. We show a bound over the indexing error when a pre-trained index is selected. We further show how our techniques can handle data updates and bound the resultant indexing errors. Experimental results on synthetic and real datasets confirm the effectiveness and efficiency of our proposed lazy (model reuse) approach.

Full PDF

aa r X i v : . [ c s . D B ] F e b A Lazy Approach for Eﬀicient Index Learning

Guanli Liu

University of [email protected]

Lars Kulik

University of [email protected]

Xingjun Ma

Deakin [email protected]

Jianzhong Qi

University of [email protected]

ABSTRACT

Learned indices using neural networks have been shown to out-perform traditional indices such as B-trees in both query timeand memory. However, learning the distribution of a large datasetcan be expensive, and updating learned indices is diﬃcult, thushindering their usage in practical applications. In this paper, weaddress the eﬃciency and update issues of learned indices through agile model reuse . We pre-train learned indices over a set of syn-thetic (rather than real) datasets and propose a novel approachto reuse these pre-trained models for a new (real) dataset. Thesynthetic datasets are created to cover a large range of diﬀerentdistributions. Given a new dataset D 𝑇 , we select the learned in-dex of a synthetic dataset similar to D 𝑇 , to index D 𝑇 . We showa bound over the indexing error when a pre-trained index is se-lected. We further show how our techniques can handle dataupdates and bound the resultant indexing errors. Experimentalresults on synthetic and real datasets conﬁrm the eﬀectivenessand eﬃciency of our proposed lazy (model reuse) approach. Learned indices using neural networks have been shown to out-perform traditional indices such as B-trees in both query timeand memory [5, 10, 13]. Given a dataset (e.g., a database table),an index is a structure that maps the index key 𝑝.𝑘𝑒𝑦 of a datapoint 𝑝 (e.g., a data record) to its storage address 𝑝.𝑎𝑑𝑑𝑟 . Theidea of learned indices is to train a machine learning model F (e.g., a neural network) to approximate the mapping from 𝑝.𝑘𝑒𝑦 to 𝑝.𝑎𝑑𝑑𝑟 . Previous work has shown that such learned indicescan be simpler and more query-eﬃcient than traditional indices.The trained model F can predict 𝑝.𝑎𝑑𝑑𝑟 with a bounded errorrange [ 𝑒𝑟𝑟 𝑙 , 𝑒𝑟𝑟 𝑢 ] , i.e., the data point 𝑝 can be found in the rangeof [F ( 𝑝.𝑘𝑒𝑦 ) + 𝑒𝑟𝑟 𝑙 , F ( 𝑝.𝑘𝑒𝑦 ) + 𝑒𝑟𝑟 𝑢 ] [5].While learned indices have eﬃcient query procedures, theyare prone to slow building and updates, since machine learningmodels are expensive to train, and once trained, they are diﬃcultto update. Even with simple models such as linear splines, cubicsplines, or linear regression, a learned index such as the recur-sive model index (RMI) [10] is two orders of magnitude slower tobuild than a B-tree [13]. Techniques that learn indices in a singlepass such as RadixSpline [8] can be built faster, but they tend toproduce sub-optimal indices of large sizes and lower query eﬃ-ciency. The high costs in model training also prevent the retrain-ing of learned indices for every data update. Existing learnedindices [4–6] avoid model retraining by storing newly insertedpoints into additional structures, which inevitably adds queryprocessing costs. This limits the applicability of learned indicesin dynamic scenarios where there are frequent dataset creationor updates, which is common in practice, for example, to indexsenor data or data from scientiﬁc studies (simulations) [19]. © 2021 Copyright held by the owner/author(s). Published in Proceedings of theACM Conference, July 2017, ISBN 978-x-xxxx-xxxx-x/YY/MM on OpenProceed-ings.org.Distribution of this paper is permitted under the terms of the Creative Commonslicense CC-by-nc-nd 4.0.

Table 1: Two example datasets D 𝑆 and D 𝑇 . D 𝑆 D 𝑇 In this paper, we aim to address the eﬃciency issues in train-ing and updating learned indices without hindering their queryeﬃciency. Our solution is inspired by domain adaptation [1]. Givena model M 𝑆 trained on an existing (source) dataset D 𝑆 , domainadaptation reuses M 𝑆 for a new (target) dataset D 𝑇 by ﬁne-tuning M 𝑆 over D 𝑇 . This avoids training a new model on D 𝑇 from scratch, which can be extremely time-consuming.A key requirement for successful adaptation of M 𝑆 to D 𝑇 isthat D 𝑆 and D 𝑇 should have similar distributions [2, 12]. Oth-erwise, M 𝑆 may yield large errors on D 𝑇 . This is important inour problem as we aim to further skip ﬁne-tuning on D 𝑇 , toachieve fast updates. This motivates us to generate syntheticdatasets to cover a wide range diﬀerent data distributions andpre-train reusable indices on such datasets. Our dataset genera-tion is based on the cumulative distribution function (CDF). Givena new dataset D 𝑇 , we measure the CDF similarity between D 𝑇 and the synthetic datasets. We select a model pre-trained on asynthetic dataset similar to D 𝑇 as the index model for D 𝑇 . S (x) S  T Figure 1: CDFs of D 𝑆 and D 𝑇 Table 1 and Fig. 1 illustrate two example datasets (both sortedin ascending order) and their corresponding CDFs. We denotethe CDFs of D 𝑆 and D 𝑇 as 𝑐𝑑 𝑓 𝑆 (·) and 𝑐𝑑 𝑓 𝑇 (·) , respectively.In this toy example, an index model M 𝑆 is learned to predictthe rank 𝑝.𝑟𝑎𝑛𝑘 (or percentile) of point 𝑝 ∈ D 𝑆 based on itssearch key 𝑝.𝑘𝑒𝑦 , that is, 𝑝.𝑟𝑎𝑛𝑘 ≈ M 𝑆 ( 𝑝.𝑘𝑒𝑦 ) and 𝑝.𝑎𝑑𝑑𝑟 ≈M 𝑆 ( 𝑝.𝑘𝑒𝑦 ) · |D 𝑆 | . This eﬀectively learns 𝑐𝑑 𝑓 𝑆 ( 𝑝 ) . For 𝑝 ∈ D 𝑆 , 𝑐𝑑 𝑓 𝑆 ( 𝑝 ) measures the probability of a value less than or equal to 𝑝 , which is also the rank of 𝑝 in D 𝑆 . When reusing M 𝑆 for D 𝑇 ,the additional prediction errors introduced can be bounded withrespect to the dissimilarity between 𝑐𝑑 𝑓 𝑆 (·) and 𝑐𝑑 𝑓 𝑇 (·) . Thissuggests that, if we can generate synthetic datasets that covera suﬃciently large area of the space of all possible CDFs, thelearned indices on the synthetic datasets can be reused for anynew dataset with bounded prediction errors. Since the CDF ofa dataset with 𝑛 points takes 𝑂 ( 𝑛 ) time to compute, we furtherpropose a histogram based approximation of CDF, with boundederrors, to reduce the computation time to only 𝑂 ( log 𝑛 ) .We use a model reuse threshold 𝜖 ∈ ( , ] to help determinewhether to reuse a pre-trained model for a new dataset 𝐷 𝑇 . Whenthe CDF similarity between D 𝑇 and a synthetic dataset D 𝑆 isgreater than or equal to 𝜖 , we reuse the model M 𝑆 pre-trained on 𝑆 to index D 𝑇 . Based on 𝜖 , we further derive the maximum ad-ditional prediction error of M 𝑆 on D 𝑇 , and we derive the num-ber of synthetic datasets to be generated. Since our model reuseprocedure is fast and ﬂexible, we call it agile model reuse . Follow-ing a similar idea, we adapt agile model reuse to handle updates.When the similarity between the CDFs of a dataset D 𝑇 and itsupdated version D ′ 𝑇 is greater than or equal to 𝜖 , we can reusemodel M 𝑇 trained on D 𝑇 without re-training.To showcase the applicability of our agile model reuse tech-nique, we integrate it into the RMI learned index [10]. We showthat agile model reuse can signiﬁcantly reduce the training timeof the sub-models in RMI. We then propose a new index struc-ture named recursive model reuse tree (RMRT) with built-in agilemodel reuse support. RMRT is designed to be adaptive to diﬀer-ent data distributions: it builds sub-models with more layers formore dense regions of a dataset. This is particularly useful forskewed data, which has not been addressed in RMI.In summary, our key contributions are:(1) We propose an agile model reuse technique to pre-traina set of models on synthetic datasets and adaptively se-lect the most suitable model to index a new dataset withrespect to a model reuse threshold 𝜖 . We show how tobound the additional model prediction error given 𝜖 .(2) We propose a new index structure named RMRT, whichhas built-in agile model reuse support and adaptively buildsan unbalanced hierarchical structure for better indexingof skewed data.(3) Extensive experiments on synthetic and real datasets showthat agile model reuse can accelerate the building time ofneural network-based learned indices by two orders ofmagnitude, while retaining the lookup eﬃciency. Further,our agile model reuse based index RMRT is faster thanRMI-based structures to build, while it outruns all base-line models in lookup performance. A learned index [3–6, 10, 11] learns a mapping from a search keyto the storage address of a data point with a machine learningmodel. Due to limits on the learning capacity of a single model,existing learned indices such as RMI [10] build a hierarchy ofmodels to index large datasets. The idea is similar to that of tradi-tional hierarchical indices: top-level models predict partitions ofthe data points (i.e., the lower-level model in which a point is in-dexed), while leaf-level models predict the storage locations. Thetraining and updates of a hierarchical learned index can be veryexpensive, especially when neural networks are used. Follow-upstudies aim to bound the prediction error of the learned model.For example, PGM [5] builds a hierarchical learned index bottomup, with a worst-case prediction error bound 𝜖 on every learnedmodel. The building time of such learned indices is also high. Update handling.

Updates may change the data distributionfrom which an index model is learned and amplify the modelprediction error. Existing studies have focused on handling in-sertions, since points deleted can be simply ﬂagged as “removed”with a light impact on query processing. For query correctness,one may update the prediction error bounds to 𝑒𝑟𝑟 𝑙 − 𝑖 and 𝑒𝑟𝑟 𝑢 + 𝑖 after 𝑖 insertions. Tighter bounds are achieved by keeping trackof the error bound drifts for a number of reference points [7].At query time, the closest reference points on both sides of thequery point are fetched, and their error bound drifts are used toestimate the updated error bounds with a linear interpolation.PGM [5] uses two diﬀerent strategies to handle insertions. For time series data insertion, it can either add a new point to thelast model or add a new model to handle the new point. For arbi-trary insertion, it applies the logarithmic method [14] and buildsa series of PGM indices for the insertions. All these indices needextra structures to handle updates, which impact the query eﬃ-ciency. Domain adaptation.

The idea of domain adaptation is toadapt a model pre-trained on a dataset D 𝑆 for a new problemwith a diﬀerent dataset D 𝑇 . A key step is to measure the simi-larity between the distributions of D 𝑆 and D 𝑇 . The 𝐿 distanceis a often used [12]. It does not suit our problem because it can-not help bound the index prediction error on D 𝑇 . The discrep-ancy [2] is another a measure. It is designed based on testingwhether the training loss diﬀers signiﬁcantly on D 𝑆 and D 𝑇 .This is inapplicable because we require a highly eﬃcient test todetermine online whether a model can be reused for D 𝑇 . Typi-cal domain adaptation techniques also ﬁne-tune the pre-trainedmodel on D 𝑇 , while we skip this step for eﬃciency considera-tions. Given a new or updated dataset D 𝑇 , we aim to construct a learnedindex M 𝑇 for D 𝑇 with a high eﬃciency.We ﬁrst present an overview of our agile model reuse tech-nique. We then detail its key components, including dataset simi-larity measurement, synthetic dataset generation, model adapta-tion, and error bounding. We will also showcase the applicabilityof our technique on an existing and a novel learned indices. Algorithm 1:

Agile Model Reuse

Input: D 𝑇 , Q 𝑀𝑃 Output: M 𝑇 for < D 𝑆 , M 𝑆 > ∈ Q 𝑀𝑃 do 𝑑𝑖𝑠𝑡 ← 𝑐𝑎𝑙 _ 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (D 𝑆 , D 𝑇 ) ; if 𝑑𝑖𝑠𝑡 ≤ − 𝜖 then M 𝑇 ← 𝑎𝑑𝑎𝑝𝑡 _ 𝑚𝑜𝑑𝑒𝑙 (M 𝑆 , D 𝑆 , D 𝑇 ) ; return M 𝑇 ; Train model M 𝑇 over D 𝑇 ; M 𝑇 .𝑚𝑎𝑥 _ 𝑎𝑏𝑠 _ 𝑒𝑟𝑟 ← M 𝑇 .calc_err( D 𝑇 ); Q 𝑀𝑃 .𝑒𝑛𝑞𝑢𝑒𝑢𝑒 ( < D 𝑇 , M 𝑇 > , M 𝑇 .𝑚𝑎𝑥 _ 𝑎𝑏𝑠 _ 𝑒𝑟𝑟 ) ; return M 𝑇 ; Agile model reuse overview.

Algorithm 1 summarizes ouragile (i.e., fast and ﬂexible) model reuse technique. We pre-trainmodels on synthetic datasets (detailed later) which are reusedto index D 𝑇 . The pre-trained models are organized in a priorityqueue Q 𝑀𝑃 . Each entry in Q 𝑀𝑃 contains the information of asynthetic dataset D 𝑆 and its corresponding trained model M 𝑆 .The trained models are sorted by their error bounds in ascend-ing order. Algorithm 1 traverses Q 𝑀𝑃 (line 1), calculates the dis-tance (dissimilarity) between D 𝑇 and each synthetic dataset D 𝑆 (line 2), and ﬁnds the ﬁrst model where the distance is smallerthan or equal to the model reuse threshold 𝜖 ∈ ( , ] (line 3).If such a model is found, the model and its error bounds areadapted based on the dataset distance (line 4, detailed later), andthe adapted model is returned as M 𝑇 (line 5). Otherwise, wetrain a new model M 𝑇 for D 𝑇 (line 6) and obtain the error range( 𝑒𝑟𝑟 𝑢 − 𝑒𝑟𝑟 𝑙 , line 7). We enqueue and return the model (lines 8and 9).We use 𝜖 to control the dataset similarity in model reuse. Asmaller 𝜖 allows the algorithm to return a model earlier, whichmay not have a high similarity with D 𝑇 and may lead to lowprediction accuracy and high query costs. In contrast, a larger 𝜖 can cost more time in traversing Q 𝑀𝑃 but gain a more ﬁttedodel with high prediction accuracy and low query costs. As 𝜖 increases in range ( , ] , the requirement for agile model reuseis getting higher. We elaborate on the eﬀect of 𝜖 in Section 5. Dataset similarity measurement.

A model M 𝑆 for dataset D 𝑆 eﬀectively learns a CDF of D 𝑆 . To reuse M 𝑆 on D 𝑇 , it isimportant that the CDFs of D 𝑆 and D 𝑇 are similar. We thusdeﬁne the similarity between D 𝑆 and D 𝑇 by their CDFs. Deﬁnition 3.1 (Similarity between two datasets).

Given two datasets D 𝑆 and D 𝑇 , their similarity is deﬁned by the maximum distancebetween their CDFs: 𝑠𝑖𝑚 (D 𝑆 , D 𝑇 ) = − sup 𝑥 | 𝑐𝑑𝑓 𝑆 ( 𝑥 ) − 𝑐𝑑𝑓 𝑇 ( 𝑥 ) | (1) Here, sup 𝑥 | 𝑐𝑑 𝑓 𝑆 ( 𝑥 ) − 𝑐𝑑 𝑓 𝑇 ( 𝑥 )| is the maximum gap between 𝑐𝑑 𝑓 𝑆 ( 𝑥 ) and 𝑐𝑑 𝑓 𝑇 ( 𝑥 ) . We use 𝑠𝑖𝑚 (D 𝑆 , D 𝑇 ) and 𝑑𝑖𝑠𝑡 (D 𝑆 , D 𝑇 ) = − 𝑠𝑖𝑚 (D 𝑆 , D 𝑇 ) to denote the similarity and the distance (dis-similarity) between D 𝑆 and D 𝑇 , respectively.This similarity metric is also based on the Kolmogorov–Smirnov (KS) test [9], which takes 𝑂 (|D 𝑆 | + |D 𝑇 |) time to compute, as-suming that both datasets are sorted already. This may be tooexpensive for online computation for large datasets. We presentan approximate similarity metric for faster computation.Our approximate similarity metric uses relative frequency his-tograms (“histograms” for short) that discretize the data domaininto bins and record relative frequencies (i.e., percentages) of thedata points in each bin. A histogram is a discrete approximationof the probability density function (PDF) of a dataset. We use itto compute an approximation of the CDF and to compute an ap-proximation of 𝑑𝑖𝑠𝑡 (D 𝑆 , D 𝑇 ) , denoted by 𝑑𝑖𝑠𝑡 ℎ (D 𝑆 , D 𝑇 ) .Algorithm 2 summarizes the computation process, which takesas input histograms of D 𝑆 and D 𝑇 with 𝑚 (a system parameter)bins each, denoted by 𝐻 𝑆 and 𝐻 𝑇 . We use 𝐻 𝑆 [ 𝑖 ] and 𝐻 𝑇 [ 𝑖 ] todenote the 𝑖 -th bins and their relative frequencies. The sum ofthe probabilities of ﬁrst 𝑖 bins of 𝐻 𝑆 and 𝐻 𝑇 are denoted by 𝑃 𝑆 and 𝑃 𝑇 , i.e., 𝑃 𝑆 = Í 𝑖𝑗 = 𝐻 𝑆 [ 𝑖 ] and 𝑃 𝑇 = Í 𝑖𝑗 = 𝐻 𝑇 [ 𝑖 ] .The algorithm computes 𝑑𝑖𝑠𝑡 ℎ (D 𝑆 , D 𝑇 ) ( 𝑑𝑖𝑠𝑡 ℎ for short) bylooping through the bins (lines 2 to 4). In the 𝑖 -th iteration ( 𝑖 ∈[ , 𝑚 − ] ), it computes 𝐻 𝑆 [ 𝑖 ] + 𝑃 𝑆 . This is the maximum 𝑐𝑑 𝑓 𝑆 ( 𝑥 ) for any 𝑥 ∈ ( 𝑖𝑚 , 𝑖 + 𝑚 ] (in our synthetic datasets, 𝑥 ∈ [ , ] ), be-cause 𝑃 𝑆 has accumulated the probabilities for 𝑥 ≤ 𝑖𝑚 while 𝐻 𝑆 [ 𝑖 ] further adds the probability for 𝑥 ∈ ( 𝑖𝑚 , 𝑖 + 𝑚 ] . Meanwhile, 𝑃 𝑇 is the minimum 𝑐𝑑 𝑓 𝑇 ( 𝑥 ) for any 𝑥 ∈ ( 𝑖𝑚 , 𝑖 + 𝑚 ] . Thus, ∀ 𝑥 ∈( 𝑖𝑚 , 𝑖 + 𝑚 ] : 𝐻 𝑆 [ 𝑖 ] + 𝑃 𝑆 − 𝑃 𝑇 ≥ 𝑐𝑑𝑓 𝑆 ( 𝑥 ) − 𝑐𝑑𝑓 𝑇 ( 𝑥 ) 𝐻 𝑇 [ 𝑖 ] + 𝑃 𝑇 − 𝑃 𝑆 ≥ 𝑐𝑑𝑓 𝑇 ( 𝑥 ) − 𝑐𝑑𝑓 𝑆 ( 𝑥 ) (2) After going through all bins, we have: 𝑑𝑖𝑠𝑡 ℎ (D 𝑆 , D 𝑇 ) ≥ | 𝑐𝑑𝑓 𝑇 ( 𝑥 ) − 𝑐𝑑𝑓 𝑆 ( 𝑥 ) | , ∀ 𝑥 ∈ ( , ] (3) Thus, 𝑑𝑖𝑠𝑡 ℎ (D 𝑆 , D 𝑇 ) ≥ 𝑑𝑖𝑠𝑡 (D 𝑆 , D 𝑇 ) . Algorithm 2:

Histogram-based-Distance

Input: 𝐻 𝑆 , 𝐻 𝑇 Output: 𝑑𝑖𝑠𝑡 ℎ 𝑑𝑖𝑠𝑡 ℎ ← , 𝑃 𝑆 ← , 𝑃 𝑇 ← for 𝑖 ∈ [ ,𝑚 − ] do 𝑑𝑖𝑠𝑡 ℎ ← max { 𝐻 𝑆 [ 𝑖 ] + 𝑃 𝑆 − 𝑃 𝑇 , 𝐻 𝑇 [ 𝑖 ] + 𝑃 𝑇 − 𝑃 𝑆 , 𝑑𝑖𝑠𝑡 ℎ } ; 𝑃 𝑆 ← 𝑃 𝑆 + 𝐻 𝑆 [ 𝑖 ] , 𝑃 𝑇 ← 𝑃 𝑇 + 𝐻 𝑇 [ 𝑖 ] ; return 𝑑𝑖𝑠𝑡 ℎ ; Using histograms to discretize CDFs reduces the similaritycomputation time to 𝑂 ( log |D 𝑇 | + 𝑚 ) , i.e., 𝑂 ( log |D 𝑇 |) time for 𝐻 𝑇 computation and 𝑂 ( 𝑚 ) time for Algorithm 2. Histogram 𝐻 𝑆 is pre-computed since D 𝑆 is known. Its cost is omitted here. Synthetic dataset generation.

We aim to generate a smallnumber of datasets with CDFs that can be similar to those of alarge number of real datasets, as bounded by threshold 𝜖 .We ﬁrst generate a set of CDFs to cover the space of possibleCDFs. We discretize the CDF space, such that it can be coveredby limited CDFs given threshold 𝜖 . As shown in Fig. 2a, afterdata normalization, all CDFs lie in a [ , ] space. Any CDF canbe seen as a curve that starts at ( , ) and travels to ( , ) in anon-deceasing manner (in the CDF value dimension). We dis-crete this space with a grid, where each row has a height of 1 − 𝜖 ( 𝜖 = . ⌈ /( − 𝜖 )⌉ .Consider the set L of polylines each starting from ( , ) and trav-eling to ( , ) via the grid vertices in a non-deceasing manner(in the CDF value dimension, e.g., the colored lines). Straight-forwardly. given any CDF, there must be a polyline 𝑙 ∈ L suchthat the distance between 𝑙 and the CDF is bounded by 1 − 𝜖 (cf.Fig. 2b). C D F v a l u e (a) CDFs of synthetic data (b) CDFs of real and synthetic data Figure 2: CDF space discretization.

The CDFs in L correspond to histograms with 𝑚 = ⌈ /( − 𝜖 )⌉ where each bin has a probability value in { , − 𝜖, ( − 𝜖 ) , . . . , } .To limit the bin value combinations and hence the number ofCDFs (synthetic datasets) generated, we limit the probability valueof each bin to be within { , ( − 𝜖 )/ , − 𝜖 } , and we use 𝑚 = ⌈ /( − 𝜖 )⌉ bins in the histogram heuristically. Our syntheticdatasets hence will not cover the most skewed CDFs (e.g., theblack polylines in Fig. 2a). However, when a target dataset D 𝑇 is matched by a synthetic dataset, their CDF similarity may bewithin ( − 𝜖 )/ − 𝜖 , which improves the query per-formance. Our total number of histograms generated is: Í 𝑚𝑖 = ( 𝐶 𝑖𝑚 · 𝐶 ⌊( − 𝑖 ( − 𝜖 ))/(( − 𝜖 )/ ) ⌋ 𝑚 − 𝑖 ) , where the two combination terms rep-resent the numbers of bins with probability values of ( − 𝜖 ) and ( − 𝜖 )/

2, respectively. Once a histogram is generated, wegenerate a synthetic dataset of 𝑛𝑠 key values ( 𝑛𝑠 =

100 in ourexperiments) based on the histogram, where the data range is [ , ] , and random key values are generated for each bin. Thisprocedure is shown the be eﬀective and eﬃcient empirically. Model adaptation.

When a model M 𝑆 pre-trained on D 𝑆 has been selected to index D 𝑇 , we need to adapt M 𝑆 based onthe data domains of D 𝑆 and D 𝑇 . This is because M 𝑆 will notwork properly on a domain over which it was not trained, evenif the CDFs of D 𝑆 and D 𝑇 share a similar shape. Let the dataranges of D 𝑆 and D 𝑇 be [ 𝑥 𝑠𝑆 , 𝑥 𝑒𝑆 ] and [ 𝑥 𝑠𝑇 , 𝑥 𝑒𝑇 ] , and their datastorage position ranges be [ 𝑦 𝑠𝑆 , 𝑦 𝑒𝑆 ] and [ 𝑦 𝑠𝑇 , 𝑦 𝑒𝑇 ] , respectively.Model M 𝑆 has been trained to take a search key in [ 𝑥 𝑠𝑆 , 𝑥 𝑒𝑆 ] asthe input and predict an storage position in [ 𝑦 𝑠𝑆 , 𝑦 𝑒𝑆 ] . Here, weassume that M 𝑆 predicts the storage position of point 𝑝 directlyrather than its rank (or percentile), i.e., 𝑝.𝑎𝑑𝑑𝑟 ≈ M 𝑆 ( 𝑝.𝑘𝑒𝑦 ) (instead of M 𝑆 ( 𝑝.𝑘𝑒𝑦 ) · |D 𝑆 | as shown in Section 1). This sim-pliﬁes the discussion but does not impact our key ﬁndings. Toadapt M 𝑆 for D 𝑇 , we take a search key in [ 𝑥 𝑠𝑇 , 𝑥 𝑒𝑇 ] , map it into [ 𝑥 𝑠𝑆 , 𝑥 𝑒𝑆 ] , and feed the mapped value into M 𝑆 for prediction. Thepredicted output needs to be mapped back into [ 𝑦 𝑠𝑇 , 𝑦 𝑒𝑇 ] for D 𝑇 .et 𝑆 Δ 𝑥 = 𝑥 𝑒𝑆 − 𝑥 𝑠𝑆 𝑥 𝑒𝑇 − 𝑥 𝑠𝑇 and 𝑆 Δ 𝑦 = 𝑦 𝑒𝑇 − 𝑦 𝑠𝑇 𝑦 𝑒𝑆 − 𝑦 𝑠𝑆 . The input mappingis done by a linear transformation T 𝑖𝑛 ( 𝑥 ) = 𝑎 · 𝑥 + 𝑏 where 𝑎 = 𝑆 Δ 𝑥 and 𝑏 = 𝑥 𝑠𝑆 − 𝑥 𝑠𝑇 · 𝑆 Δ 𝑥 . This is an aﬃne transformationthat maps the data range (i.e., T 𝑖𝑛 ( 𝑥 𝑠𝑇 ) = 𝑥 𝑠𝑆 and T 𝑖𝑛 ( 𝑥 𝑒𝑇 ) = 𝑥 𝑒𝑆 )without changing the distribution. Similarly, the output map-ping is done by T 𝑜𝑢𝑡 ( 𝑦 ) = 𝑎 · 𝑦 + 𝑏 where 𝑎 = 𝑆 Δ 𝑦 and 𝑏 = 𝑦 𝑠𝑇 − 𝑦 𝑠𝑆 · 𝑆 Δ 𝑦 . The mappings may incur extra costs (ﬂoat-ing point calculation), which can be mitigated by adjusting theparameters of M 𝑆 . We use a linear model as an example. Lemma 3.2.

Input and output mappings for a linear model M 𝑆 incur no additional prediction costs. Proof.

Let M 𝑆 be a linear model 𝑦 = 𝑎𝑥 + 𝑏 , where 𝑎 and 𝑏 are parameters. The output ˜ 𝑦 ′ of M 𝑆 is (with input mapping): ˜ 𝑦 ′ = M 𝑆 (cid:0) T 𝑖𝑛 ( 𝑥 ) (cid:1) = 𝑎 · T 𝑖𝑛 ( 𝑥 ) + 𝑏 = 𝑎 · 𝑆 Δ 𝑥 · 𝑥 − 𝑎 · 𝑥 𝑠𝑇 · 𝑆 Δ 𝑥 + 𝑎 · 𝑥 𝑠𝑆 + 𝑏 (4)After output mapping, the ﬁnal prediction output ˜ 𝑦 is: ˜ 𝑦 = T 𝑜𝑢𝑡 ( ˜ 𝑦 ′ ) = ( ˜ 𝑦 ′ − 𝑦 𝑠𝑆 ) · 𝑆 Δ 𝑦 + 𝑦 𝑠𝑇 = 𝑎 · 𝑆 Δ 𝑥 · 𝑆 Δ 𝑦 · 𝑥 + (− 𝑎 · 𝑥 𝑠𝑇 · 𝑆 Δ 𝑥 + 𝑎 · 𝑥 𝑠𝑆 + 𝑏 − 𝑦 𝑠𝑆 ) · 𝑆 Δ 𝑦 + 𝑦 𝑠𝑇 (5)Thus, we can adapt M 𝑆 to a new linear model 𝑦 = 𝑎 ′ 𝑥 + 𝑏 ′ for D 𝑇 where 𝑎 ′ = 𝑎 · 𝑆 Δ 𝑥 · 𝑆 Δ 𝑦 and 𝑏 ′ = (− 𝑎 · 𝑥 𝑠𝑇 · 𝑆 Δ 𝑥 + 𝑎 · 𝑥 𝑠𝑆 + 𝑏 − 𝑦 𝑠𝑆 ) · 𝑆 Δ 𝑦 + 𝑦 𝑠𝑇 . Input and output mappings can be combinedwith linear models without extra prediction costs. (cid:3) Similar results can be derived for other models (e.g., neuralmodels). We omit the details due to the space limit.

Error bounding.

Recall the prediction errors of M 𝑆 over D 𝑆 , 𝑒𝑟𝑟 𝑙 and 𝑒𝑟𝑟 𝑢 . Given a query key 𝑥 ∈ D 𝑆 , the position of 𝑥 isbounded in [M 𝑆 ( 𝑥 )+ 𝑒𝑟𝑟 𝑙 , M 𝑆 ( 𝑥 )+ 𝑒𝑟𝑟 𝑢 ] . After input and outputmappings, we also need to adjust the error bounds for D 𝑇 . Theorem 3.3.

Let M 𝑆 be a model trained on D 𝑆 with predic-tion error bounds 𝑒𝑟𝑟 𝑙 and 𝑒𝑟𝑟 𝑢 . Let 𝑑𝑖𝑠𝑡 be the distance between D 𝑆 and D 𝑇 . The error bounds of M 𝑆 over D 𝑇 of size 𝑛 𝑇 are: 𝑒𝑟𝑟 ′ 𝑙 = − 𝑑𝑖𝑠𝑡 · 𝑛 𝑇 + 𝑒𝑟𝑟 𝑙 · 𝑆 Δ 𝑦 (6) 𝑒𝑟𝑟 ′ 𝑢 = 𝑑𝑖𝑠𝑡 · 𝑛 𝑇 + 𝑒𝑟𝑟 𝑢 · 𝑆 Δ 𝑦 (7) Proof.

For any 𝑥 ∈ D 𝑇 and any T 𝑖𝑛 ( 𝑥 ) ∈ D 𝑆 , let 𝑦 and 𝑦 ′ be the storage positions of the corresponding data points, re-spectively. Then, M 𝑆 (cid:0) T 𝑖𝑛 ( 𝑥 ) (cid:1) is bounded by: 𝑦 ′ − 𝑒𝑟𝑟 𝑙 ≥ M 𝑆 (cid:0) T 𝑖𝑛 ( 𝑥 ) (cid:1) (8) After output mapping, T 𝑜𝑢𝑡 (cid:0) M 𝑆 (cid:0) T 𝑖𝑛 ( 𝑥 ) (cid:1)(cid:1) is the predicted posi-tion of M 𝑆 over D 𝑇 , which is bounded by: 𝑦 − 𝑒𝑟𝑟 ′ 𝑙 ≥ T 𝑜𝑢𝑡 (cid:16) M 𝑆 (cid:0) T 𝑖𝑛 ( 𝑥 ) (cid:1) (cid:17) (9) Since mappings are monotonic, Inequality (8) is rewritten as: T 𝑜𝑢𝑡 ( 𝑦 ′ − 𝑒𝑟𝑟 𝑙 ) = T 𝑜𝑢𝑡 ( 𝑦 ′ ) − T 𝑜𝑢𝑡 ( 𝑒𝑟𝑟 𝑙 ) + 𝑏 ′ ≥ T 𝑜𝑢𝑡 (cid:16) M 𝑆 (cid:0) T 𝑖𝑛 ( 𝑥 ) (cid:1) (cid:17) (10)where 𝑏 ′ is the interception of T 𝑜𝑢𝑡 (·) . Given 𝑑𝑖𝑠𝑡 as the distancebetween D 𝑇 and D 𝑆 , after input and output mappings, we have |T 𝑜𝑢𝑡 ( 𝑦 ′ ) − 𝑦 | ≤ 𝑑𝑖𝑠𝑡 · 𝑛 𝑇 , i.e., 𝑦 + 𝑑𝑖𝑠𝑡 · 𝑛 𝑇 ≥ T 𝑜𝑢𝑡 ( 𝑦 ′ ) . Combiningwith Inequality (10), we have 𝑦 + 𝑑𝑖𝑠𝑡 · 𝑛 𝑇 − T 𝑜𝑢𝑡 ( 𝑒𝑟𝑟 𝑙 ) + 𝑏 ′ ≥T 𝑜𝑢𝑡 (cid:16) M 𝑆 (cid:0) T 𝑖𝑛 ( 𝑥 ) (cid:1)(cid:17) . Given this inequality, to satisfy Inequality (9),we enforce 𝑦 − 𝑒𝑟𝑟 ′ 𝑙 ≥ 𝑦 + 𝑑𝑖𝑠𝑡 · 𝑛 𝑇 − T 𝑜𝑢𝑡 ( 𝑒𝑟𝑟 𝑙 ) + 𝑏 ′ . Thus, 𝑒𝑟𝑟 ′ 𝑙 ≤ − 𝑑𝑖𝑠𝑡 · 𝑛 𝑇 + 𝑒𝑟𝑟 𝑙 · 𝑆 Δ 𝑦 , where 𝑆 Δ 𝑦 is the slope of T 𝑜𝑢𝑡 (·) .Letting 𝑒𝑟𝑟 ′ 𝑙 = − 𝑑𝑖𝑠𝑡 · 𝑛 𝑇 + 𝑒𝑟𝑟 𝑙 · 𝑆 Δ 𝑦 will ensure query correctness.Similarly, we can derive 𝑒𝑟𝑟 ′ 𝑢 = 𝑑𝑖𝑠𝑡 · 𝑛 𝑇 + 𝑒𝑟𝑟 𝑢 · 𝑆 Δ 𝑦 . (cid:3) Learned indices with agile model reuse.

We showcase theapplicability of agile model reuse over existing learned indicesby building a two-layer RMI [10], as shown in Fig. 3. We ﬁrstcompute 𝑠𝑖𝑚 (D 𝑆 , D , ) between the full dataset D , to be in-dexed and a synthetic dataset D 𝑆 , which has a pre-trained model Figure 3: Building RMI with agile model reuse ( 𝜖 = . ). M 𝑆 in Q 𝑀𝑃 . Suppose 𝑠𝑖𝑚 (D 𝑆 , D , ) = . ≥ 𝜖 = .

8. Then, M 𝑆 is reused over D , . For D , , a subset to be indexed by a childmodel in RMI, we cannot ﬁnd a synthetic dataset with a similar-ity greater than or equal to 0.8. We train a model M , over D , and put it into Q 𝑀𝑃 for reuse later. For the other subset D , , weﬁnd another synthetic dataset D 𝑆 with 𝑠𝑖𝑚 (D 𝑆 , D , ) = . ≥ .

8. Its model M 𝑆 is reused over D , , which completes the RMI. Recursive Model Reuse Tree.

In RMI, the number of layersand the number of models in each layer is ﬁxed. If the data isskewed, the cardinality of the subsets assigned to diﬀerent mod-els can vary considerably, resulting in high prediction errors andsearch costs on some models. To address this issue, we design alearned index structure with built-in agile model reuse supportnamed the recursive model reuse tree (RMRT).Suppose that the models used in RMRT have the same learn-ing capacity (e.g., neural networks of the same structure), whichcan ﬁt at most 𝑁 points each. When |D 𝑇 | is greater than 𝑁 , weﬁrst learn a model M , to predict the points into 𝐵 partitionswhere 𝐵 is a system parameter. Then, recursively, for points pre-dicted to partition 𝑖 , we learn another model M ,𝑖 to partitionthem. This process continues, until each partition has at most 𝑁 points, which is indexed by a learned model. Agile model reuseis applied whenever a model is needed in this process. Fig. 4agives an example with 𝑁 = 𝐵 =

2. Model M , predictstwo partitions (i.e., subsets) D , and D , that contain the ﬁrstfour and the last eight points, respectively. Further partitioningis needed for D , . A model M , is learned for this, creatingtwo partitions of size 𝑁 each. The partitioning then stops. (a) RMRT ( 𝑁 =4, 𝐵 =2) (b) Insertion handling Figure 4: RMRT and insertion handling examples

We focus on insertions. Deletions can be implemented simply asa point query and marking the queried point as “deleted”.To handle insertions, we examine their impact on the CDF.As shown in Fig. 4b, when data point 5 is inserted, only theCDF of D , is impacted in the second layer of the recursivemodel. For D , which is not impacted, we can simply add 1 toits model error bounds. For D , , we have to check whether thereused model M , can still meet the similarity bound deﬁnedby threshold 𝜖 . To enable eﬃcient checks, we propose a boundon the maximum number of insertions without requiring modelupdates. Lemma 4.1.

Let D be a dataset with cardinality 𝑛 D and M 𝐷 be a model over D . Let 𝑠𝑖𝑚 be the similarity between D and thedataset from which M 𝐷 is trained, which can be D or a syntheticataset. If there are less than ( 𝑠𝑖𝑚 − 𝜖 ) + 𝜖 − 𝑠𝑖𝑚 · 𝑛 D insertions on D , we canstill reuse model M 𝐷 for the resultant dataset D ′ . Proof.

After 𝑛 𝑖 insertions, the CDFs 𝑐𝑑 𝑓 𝐷 ( 𝑥 ) and 𝑐𝑑 𝑓 𝐷 ′ ( 𝑥 ) may become diﬀerent. In the worst case, all new data pointsare inserted at the same position, where the diﬀerence between 𝑐𝑑 𝑓 𝐷 ( 𝑥 ) and 𝑐𝑑 𝑓 𝐷 ′ ( 𝑥 ) is bounded by 𝑑𝑖𝑠𝑡 (D , D ′ ) ≤ 𝑛 𝑖 𝑛 𝑖 + 𝑛 D . Re-call that M 𝐷 is reused with 𝑠𝑖𝑚 ≥ 𝜖 or trained on D ( 𝑠𝑖𝑚 = ≥ 𝜖 ). We can use the gap 𝑠𝑖𝑚 − 𝜖 as a buﬀer to accommodate theCDF drift caused by the insertions. According to the transitivityof inequalities, if 𝑛 𝑖 𝑛 𝑖 + 𝑛 D ≤ 𝑠𝑖𝑚 − 𝜖 , there is no need to rebuild amodel over D ′ , since the impact on the CDF of D by the inser-tions cannot exceed the error bound 𝑠𝑖𝑚 − 𝜖 . Thus, we can derivea bound on the number of insertions 𝑛 𝑖 as 𝑛 𝑖 ≤ ( 𝑠𝑖𝑚 − 𝜖 ) + 𝜖 − 𝑠𝑖𝑚 · 𝑛 D (cid:3) According to Lemma 4.1, insertions can be handled withoutmodel rebuilt when their number does not exceed the bound.When a new data point is inserted, we ﬁnd the target insertionposition through a point query and obtain the correspondingmodel M . If the number of insertions on M has not exceeded thebound, the insertion is completed. Otherwise, we only rebuildthe model indexing the inserted data point. All experiments are done on a 64-bit machine with a 3.60 GHzIntel i9 CPU, RTX 2080Ti GPU, 64 GB RAM, and a 1 TB harddisk. We use

PyTorch

Scikit-learn [17] and

PyTorch , re-spectively.

Competitors.

We compare with both traditional and learnedindices: 1) traditional

BTree [18] which is a C++ based in-memoryB+ tree from the STX B+ Tree package; 2)

RMI [10] which is thelinear RMI model from the SOSD benchmark [13]; 3)

RMI-NN which is our implementation of the neural network RMI model;4)

PGM [5] which is a piecewise geometric model index; and5) RS [8] which is a single-pass learned index. Proposed models.

We study the performance of the follow-ing proposed and adapted models:1)

RMI-MR which is the lin-ear RMI model enhanced by agile model reuse; 2)

RMI-NN-MR which is the neural RMI model enhanced by agile model reuse;and 3)

RMRT which is our proposed learned index.

Implementation details.

For BTree, RMI, PGM, and RS, weuse their published source code and default conﬁgurations. ForRMI-MR, we adapt the original model training code to includeagile model reuse. For neural network based models includingRMI-NN, RMI-NN-MR and RMRT, we use feedforward neuralnetworks each with one hidden layer of four neurons. We use 𝑁 = as the RMRT model size threshold, which shows strongempirical performance. We set the default value for 𝜖 to 0.9.We summarize the number of synthetic datasets (each with100 points) and the time to pre-train models on them in Table 2.Note that we use 𝑚 = < ⌈ /( − 𝜖 )⌉ when 𝜖 = .

9. As a result,the number of datasets generated is bounded in 10,000; all pre-trained models can be loaded in memory within a second (30 MBin size); and the total model comparison time to build an indexin any of the experiments is also within a second.

Table 2: Summary of Synthetic Datasets. 𝜖 𝑚 ) 4 5 7 10 12Number of datasets 19 95 987 8,953 1,221Total model training time (s) 2.1 8.8 63.5 839.5 109.1 Datasets.

Following SOSD [13], we use four real datasets: amzn (default) – an Amazon book popularity dataset, face –a Facebook user ID dataset, osm – an OpenStreetMap cell IDdataset, and wiki – a Wikipedia edit timestamp dataset. We fur-ther generate skew datasets from uniform data by raising a keyvalue 𝑥 to its powers 𝑥 𝛼 ( 𝛼 = , , , , Performance metrics.

We report the index build time, lookup(i.e., point query) time, and update (insertion) time.

Index build time on real datasets.

The four real datasets havediﬀerent distributions. They lead to diﬀerent index build timesas shown in Fig. 5a. BTree is the fastest to build and is little im-pacted by the data distribution, due to its simple building pro-cedure. With agile model reuse, RMI-NN-MR is two orders ofmagnitude faster than RMI-NN on all four datasets, while RMI-MR is also consistently faster than its non-model-reusing coun-terpart RMI. Our RMRT further outperforms RMI-NN-MR andPGM, and the advantage is up to 74% (3.0s vs. 11.7s on face) and32% (5.0s vs. 7.4s on face) than PGM, respectively. RS is fasterthan RMRT, due to its single-pass procedure, but its lookup timeis substantially higher than that of RMRT, which is detailed next. amzn face osm wiki B u il d t i m e ( s ) DatasetBTreePGMRSRMRT RMIRMI-MRRMI-NNRMI-NN-MR (a) Index build time

150 200 250 300 350 400 450 500 550 600 amzn face osm wiki

Loo k up t i m e ( n s ) DatasetBTreePGMRSRMRT RMIRMI-MRRMI-NNRMI-NN-MR (b) Lookup time

Figure 5: Build and lookup time on real datasets.Lookup time on real datasets.

Each real dataset has 10 mil-lion random lookup keys, and we report their average lookuptime in Fig. 5b. RMRT is the fastest over all four datasets. Onamzn, RMRT (189 ns) is 46%, 35%, 14%, 7%, 52%, 38%, and 20%faster than BTree (534 ns), RMI-NN (440 ns), RMI (221 ns), RMI-MR (205 ns), RMI-NN-MR (401 ns), PGM (308 ns), and RS (236ns), respectively. RMI-MR and RMI-NN-MR have better perfor-mance than RMI and RMI-NN over amzn, face, and wiki, sincethese three datasets are well distributed and can be ﬁtted by thepre-trained models. For osm, it diﬀers more from the syntheticdatasets, where the reused model has increased lookup costs. Wefurther note that RS index stores spline points, the number ofwhich are a decisive factor in RS lookup time. To obtain thelookup performance shown here, the index size of RS is aboutan order of magnitude larger than that of RMRT. B u il d t i m e ( s ) SkewnessBTreePGMRS RMIRMI-MRRMI-NN RMI-NN-MRRMRT (a) Index build time

150 200 250 300 350 400 450 500 550 1 3 5 7 9

Loo k up t i m e ( n s ) Skewness (b) Lookup time

Figure 6: Build and lookup time on skew datasets.Index build time on skewed datasets.

Since our RMRTtargets skewed data, we further test the indices on syntheticdata with increasing skewness. As Fig. 6a shows, the index buildtimes are less impacted by data skewness. This is consistent withhe results on real datasets, where the index build times are alsosimilar across diﬀerent datasets. RMI-NN-MR and RMI again out-perform RMI-NN and RMI, respectively, while our RMRT is onlyslightly slower than BTree and the single-pass learned index RS.

Lookup time on skewed datasets.

As Fig. 6b shows, ourRMRT again yields the best lookup performance on all skeweddatasets (skewness = 1 denotes uniform data), and its perfor-mance is stable as data skewness increases, conﬁrming its ca-pability to adapt to skewed data. In contrast, the four RMI-basedindices have fast increasing lookup times when the data skew-ness increases, as analyzed in Section 3. BTree, PGM, and RS arealso less impacted because their designs are based on worst-casescenarios.

Index build time under varying 𝜖 . Table 3 shows that, as 𝜖 increases, the build times of both RMI-NN-MR and RMRT in-crease. This is because a larger 𝜖 requires a higher similarity formodel reuse, thus more datasets are examined. For our RMRT,the build time decreases initially. This is because a small 𝜖 ( 𝜖 = .

5) cannot ﬁt the datasets well, which creates uneven partitionsthat take more models to ﬁt. As 𝜖 increases beyond 𝜖 = .

6, thebuild time of RMRT rises again. Note that RMI-MR shows a sim-ilar trend to RMI-NN-MR and is omitted for conciseness.

Lookup time under varying 𝜖 . Table 3 also shows that, as 𝜖 increases, the lookup times of both models decrease. This is be-cause better-ﬁtted models are selected for larger 𝜖 which bringshorter search ranges. We see that the beneﬁt in lookup out-weighs the extra index building costs when using a larger 𝜖 . Table 3: Build, lookup and insertion time under varying 𝜖 . 𝜖 Update time under under varying 𝜖 . Table 3 further showsthe times for inserting 100% more points (following the distri-bution of amzn, same below). We see that the insertion timesincrease with 𝜖 . This is because the bound on the number of in-sertions before model rebuilding is inversely proportional to 𝜖 (Lemma 4.1), i.e., a larger 𝜖 triggers rebuilds more eagerly. Update time under varying insertion ratios.

Next, we testthe impact of the number of points inserted. RMI, RMI-NN, andRS are static indices and are omitted for this experiment. For thefanout (maximum error in PGM), we use 2 for PGM and 2 forRMRT and RMI-NN-MR, respectively. As shown in Fig. 7a, theinsertion times of BTree, RMI-NN-MR, and RMRT increase withthe insertion ratio. For PGM, the insertion time rises in stages.It has higher insertion costs than RMI-NN-MR in most cases(from 10% to 90%). This is because PGM uses the logarithmicmethod [14] which builds and merges (hence the cost jumps) aset of PGMs for insertions. For RMI-NN-MR, when the insertionratio is within 80%, the cost increases slowly because most pointsare inserted directly. The costs increase faster when the insertionratio exceeds 80%, which exceeds the insertion bound and leadsto more rebuilds. For RMRT, the insertion time is more stable.This is because each RMRT sub-model indexes a relatively smallset of data points and has a relatively small rebuild bound. Modelrebuild is triggered steadily as new data points are inserted. Update time under varying branching parameters.

Dueto the parameter limitation of dynamic PGM, we vary the fanout(i.e., maximum error in PGM) from 2 to 2 . As Fig. 7b shows,

90 110 130 150 170 19010 20 30 40 50 60 70 80 90 100 R e s pon s e t i m e ( n s ) Insertions (%)BTreePGM RMI-NN-MRRMRT (a) Varying insertion ratio

100 120 140 160 180 2 R e s pon s e t i m e ( n s ) Branching parameter (b) Varying branching parameter

Figure 7: Insertion time results.

PGM outperforms RMI-NN-MR when its maximum error (fanout)is larger than 2 because a larger fanout for the RMI learned in-dices means fewer data points in each model. The cardinality ofeach model in the second layer is smaller, and the bound for in-sertion is also smaller (w.r.t., Lemma 4.1), such that rebuild hap-pens more often. For RMRT, the insertion performance becomesbetter as the fanout increases. This is because it can adaptivelydivide the underlying models and provide more positions for di-rect insertions without frequent model rebuilds. We note several directions to be explored next. We have omittedthe sorting costs in index building, since these are shared by allindices. It would be interesting to further optimize these costswith a learning-based technique. Our CDF similarity approxi-mation considers the maximum distance between the CDFs. Analternative is to take the average distance. How to bound thesearch range in this case is an interesting challenge.

We proposed to reuse pre-trained models for indexing new (orupdated) datasets to address building and update eﬃciency is-sues in learned indices. We also proposed a similarity metric tomeasure the distribution diﬀerence between two datasets. Basedon this metric, our agile model reuse algorithm can eﬃcientlyselect the most suitable pre-trained model to index a new (or up-dated) dataset. We show how the prediction error of the selectedpre-trained model is bounded on the new (or updated) dataset.We demonstrate the eﬀectiveness of the proposed algorithm byapplying it on the RMI learned indices [10] and our proposedlearned index RMRT. Experimental results on synthetic datasetsand four real datasets show that our agile model reuse techniquecan improve the building and update time of learned indices sub-stantially with little impact on the lookup performance.

REFERENCES [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan.2010. A Theory of Learning from Diﬀerent Domains.

Machine Learning

Interna-tional Conference on Algorithmic Learning Theory . 308–323.[3] A. Davitkova, E. Milchevski, and S. Michel. 2020. The ML-Index: A Multidi-mensional, Learned Index for Point, Range, and Nearest-Neighbor Queries.In

EDBT . 407–410.[4] J. Ding, U. F. Minhas, J. Yu, C. Wang, J. Do, Y. Li, H. Zhang, B. Chandramouli,J. Gehrke, D. Kossmann, D. Lomet, and T. Kraska. 2020. ALEX: An UpdatableAdaptive Learned Index. In

SIGMOD . 969–984.[5] P. Ferragina and G. Vinciguerra. 2020. The PGM-Index: A Fully-DynamicCompressed Learned Index with Provable Worst-Case Bounds.

PVLDB

13, 8(2020), 1162–1175.[6] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska. 2019.FITing-Tree: A Data-Aware Index Structure. In

SIGMOD . 1189–1206.[7] A. Hadian and T. Heinis. 2019. Considerations for Handling Updates inLearned Index Structures. In

International Workshop on Exploiting ArtiﬁcialIntelligence Techniques for Data Management . 3:1–3:4.8] A. Kipf, R. Marcus, A. v. Renen, M. Stoian, A. Kemper, T. Kraska, and T. Neu-mann. 2020. RadixSpline: A Single-Pass Learned Index. In

International Work-shop on Exploiting Artiﬁcial Intelligence Techniques for Data Management . 5:1–5:5.[9] Kolmogorov-Smirnov Test. 2020. https://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test .Accessed: 2021-01-06.[10] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. 2018. The Case forLearned Index Structures. In

SIGMOD . 489–504.[11] X. Li, J. Li, and X. Wang. 2019. ASLM: Adaptive Single Layer Model forLearned Index. In

DASFAA . 80–95.[12] Y. Mansour, M. Mohri, and A. Rostamizadeh. 2009. Domain Adaptation:Learning Bounds and Algorithms. In

COLT .[13] R. Marcus, A. Kipf, A. v. Renen, M. Stoian, S. Misra, A. Kemper, T. Neumann,and T. Kraska. 2020. Benchmarking Learned Indexes.

PVLDB

14, 1 (2020),1–13.[14] M. H. Overmars. 1983.

The Design of Dynamic Data Structures . Lecture Notesin Computer Science, Vol. 156. Springer.[15] PyTorch. 2016. https://pytorch.org . Accessed: 2021-01-06.[16] J. Qi, Y. Tao, Y. Chang, and R. Zhang. 2018. Theoretically Optimal and Em-pirically Eﬃcient R-trees with Strong Parallelizability.

PVLDB

11, 5 (2018),621–634.[17] Scikit-learn. 2007. https://scikit-learn.org . Accessed: 2021-01-06.[18] STX B+ Tree. 2007. https://panthema.net/2007/stx-btree . Accessed: 2021-01-06.[19] J. C. Thibault, D. R. Roe, J. C. Facelli, and T. E. Cheatham III. 2014. Data Model,Dictionaries, and Desiderata for Biomolecular Simulation Data Indexing andSharing.