A Lazy Approach for Efficient Index Learning
aa r X i v : . [ c s . D B ] F e b A Lazy Approach for Eο¬icient Index Learning
Guanli Liu
University of [email protected]
Lars Kulik
University of [email protected]
Xingjun Ma
Deakin [email protected]
Jianzhong Qi
University of [email protected]
ABSTRACT
Learned indices using neural networks have been shown to out-perform traditional indices such as B-trees in both query timeand memory. However, learning the distribution of a large datasetcan be expensive, and updating learned indices is diο¬cult, thushindering their usage in practical applications. In this paper, weaddress the eο¬ciency and update issues of learned indices through agile model reuse . We pre-train learned indices over a set of syn-thetic (rather than real) datasets and propose a novel approachto reuse these pre-trained models for a new (real) dataset. Thesynthetic datasets are created to cover a large range of diο¬erentdistributions. Given a new dataset D π , we select the learned in-dex of a synthetic dataset similar to D π , to index D π . We showa bound over the indexing error when a pre-trained index is se-lected. We further show how our techniques can handle dataupdates and bound the resultant indexing errors. Experimentalresults on synthetic and real datasets conο¬rm the eο¬ectivenessand eο¬ciency of our proposed lazy (model reuse) approach. Learned indices using neural networks have been shown to out-perform traditional indices such as B-trees in both query timeand memory [5, 10, 13]. Given a dataset (e.g., a database table),an index is a structure that maps the index key π.πππ¦ of a datapoint π (e.g., a data record) to its storage address π.ππππ . Theidea of learned indices is to train a machine learning model F (e.g., a neural network) to approximate the mapping from π.πππ¦ to π.ππππ . Previous work has shown that such learned indicescan be simpler and more query-eο¬cient than traditional indices.The trained model F can predict π.ππππ with a bounded errorrange [ πππ π , πππ π’ ] , i.e., the data point π can be found in the rangeof [F ( π.πππ¦ ) + πππ π , F ( π.πππ¦ ) + πππ π’ ] [5].While learned indices have eο¬cient query procedures, theyare prone to slow building and updates, since machine learningmodels are expensive to train, and once trained, they are diο¬cultto update. Even with simple models such as linear splines, cubicsplines, or linear regression, a learned index such as the recur-sive model index (RMI) [10] is two orders of magnitude slower tobuild than a B-tree [13]. Techniques that learn indices in a singlepass such as RadixSpline [8] can be built faster, but they tend toproduce sub-optimal indices of large sizes and lower query eο¬-ciency. The high costs in model training also prevent the retrain-ing of learned indices for every data update. Existing learnedindices [4β6] avoid model retraining by storing newly insertedpoints into additional structures, which inevitably adds queryprocessing costs. This limits the applicability of learned indicesin dynamic scenarios where there are frequent dataset creationor updates, which is common in practice, for example, to indexsenor data or data from scientiο¬c studies (simulations) [19]. Β© 2021 Copyright held by the owner/author(s). Published in Proceedings of theACM Conference, July 2017, ISBN 978-x-xxxx-xxxx-x/YY/MM on OpenProceed-ings.org.Distribution of this paper is permitted under the terms of the Creative Commonslicense CC-by-nc-nd 4.0.
Table 1: Two example datasets D π and D π . D π D π In this paper, we aim to address the eο¬ciency issues in train-ing and updating learned indices without hindering their queryeο¬ciency. Our solution is inspired by domain adaptation [1]. Givena model M π trained on an existing (source) dataset D π , domainadaptation reuses M π for a new (target) dataset D π by ο¬ne-tuning M π over D π . This avoids training a new model on D π from scratch, which can be extremely time-consuming.A key requirement for successful adaptation of M π to D π isthat D π and D π should have similar distributions [2, 12]. Oth-erwise, M π may yield large errors on D π . This is important inour problem as we aim to further skip ο¬ne-tuning on D π , toachieve fast updates. This motivates us to generate syntheticdatasets to cover a wide range diο¬erent data distributions andpre-train reusable indices on such datasets. Our dataset genera-tion is based on the cumulative distribution function (CDF). Givena new dataset D π , we measure the CDF similarity between D π and the synthetic datasets. We select a model pre-trained on asynthetic dataset similar to D π as the index model for D π . S (x)ξ° S ξ° T Figure 1: CDFs of D π and D π Table 1 and Fig. 1 illustrate two example datasets (both sortedin ascending order) and their corresponding CDFs. We denotethe CDFs of D π and D π as ππ π π (Β·) and ππ π π (Β·) , respectively.In this toy example, an index model M π is learned to predictthe rank π.ππππ (or percentile) of point π β D π based on itssearch key π.πππ¦ , that is, π.ππππ β M π ( π.πππ¦ ) and π.ππππ βM π ( π.πππ¦ ) Β· |D π | . This eο¬ectively learns ππ π π ( π ) . For π β D π , ππ π π ( π ) measures the probability of a value less than or equal to π , which is also the rank of π in D π . When reusing M π for D π ,the additional prediction errors introduced can be bounded withrespect to the dissimilarity between ππ π π (Β·) and ππ π π (Β·) . Thissuggests that, if we can generate synthetic datasets that covera suο¬ciently large area of the space of all possible CDFs, thelearned indices on the synthetic datasets can be reused for anynew dataset with bounded prediction errors. Since the CDF ofa dataset with π points takes π ( π ) time to compute, we furtherpropose a histogram based approximation of CDF, with boundederrors, to reduce the computation time to only π ( log π ) .We use a model reuse threshold π β ( , ] to help determinewhether to reuse a pre-trained model for a new dataset π· π . Whenthe CDF similarity between D π and a synthetic dataset D π isgreater than or equal to π , we reuse the model M π pre-trained on π to index D π . Based on π , we further derive the maximum ad-ditional prediction error of M π on D π , and we derive the num-ber of synthetic datasets to be generated. Since our model reuseprocedure is fast and ο¬exible, we call it agile model reuse . Follow-ing a similar idea, we adapt agile model reuse to handle updates.When the similarity between the CDFs of a dataset D π and itsupdated version D β² π is greater than or equal to π , we can reusemodel M π trained on D π without re-training.To showcase the applicability of our agile model reuse tech-nique, we integrate it into the RMI learned index [10]. We showthat agile model reuse can signiο¬cantly reduce the training timeof the sub-models in RMI. We then propose a new index struc-ture named recursive model reuse tree (RMRT) with built-in agilemodel reuse support. RMRT is designed to be adaptive to diο¬er-ent data distributions: it builds sub-models with more layers formore dense regions of a dataset. This is particularly useful forskewed data, which has not been addressed in RMI.In summary, our key contributions are:(1) We propose an agile model reuse technique to pre-traina set of models on synthetic datasets and adaptively se-lect the most suitable model to index a new dataset withrespect to a model reuse threshold π . We show how tobound the additional model prediction error given π .(2) We propose a new index structure named RMRT, whichhas built-in agile model reuse support and adaptively buildsan unbalanced hierarchical structure for better indexingof skewed data.(3) Extensive experiments on synthetic and real datasets showthat agile model reuse can accelerate the building time ofneural network-based learned indices by two orders ofmagnitude, while retaining the lookup eο¬ciency. Further,our agile model reuse based index RMRT is faster thanRMI-based structures to build, while it outruns all base-line models in lookup performance. A learned index [3β6, 10, 11] learns a mapping from a search keyto the storage address of a data point with a machine learningmodel. Due to limits on the learning capacity of a single model,existing learned indices such as RMI [10] build a hierarchy ofmodels to index large datasets. The idea is similar to that of tradi-tional hierarchical indices: top-level models predict partitions ofthe data points (i.e., the lower-level model in which a point is in-dexed), while leaf-level models predict the storage locations. Thetraining and updates of a hierarchical learned index can be veryexpensive, especially when neural networks are used. Follow-upstudies aim to bound the prediction error of the learned model.For example, PGM [5] builds a hierarchical learned index bottomup, with a worst-case prediction error bound π on every learnedmodel. The building time of such learned indices is also high. Update handling.
Updates may change the data distributionfrom which an index model is learned and amplify the modelprediction error. Existing studies have focused on handling in-sertions, since points deleted can be simply ο¬agged as βremovedβwith a light impact on query processing. For query correctness,one may update the prediction error bounds to πππ π β π and πππ π’ + π after π insertions. Tighter bounds are achieved by keeping trackof the error bound drifts for a number of reference points [7].At query time, the closest reference points on both sides of thequery point are fetched, and their error bound drifts are used toestimate the updated error bounds with a linear interpolation.PGM [5] uses two diο¬erent strategies to handle insertions. For time series data insertion, it can either add a new point to thelast model or add a new model to handle the new point. For arbi-trary insertion, it applies the logarithmic method [14] and buildsa series of PGM indices for the insertions. All these indices needextra structures to handle updates, which impact the query eο¬-ciency. Domain adaptation.
The idea of domain adaptation is toadapt a model pre-trained on a dataset D π for a new problemwith a diο¬erent dataset D π . A key step is to measure the simi-larity between the distributions of D π and D π . The πΏ distanceis a often used [12]. It does not suit our problem because it can-not help bound the index prediction error on D π . The discrep-ancy [2] is another a measure. It is designed based on testingwhether the training loss diο¬ers signiο¬cantly on D π and D π .This is inapplicable because we require a highly eο¬cient test todetermine online whether a model can be reused for D π . Typi-cal domain adaptation techniques also ο¬ne-tune the pre-trainedmodel on D π , while we skip this step for eο¬ciency considera-tions. Given a new or updated dataset D π , we aim to construct a learnedindex M π for D π with a high eο¬ciency.We ο¬rst present an overview of our agile model reuse tech-nique. We then detail its key components, including dataset simi-larity measurement, synthetic dataset generation, model adapta-tion, and error bounding. We will also showcase the applicabilityof our technique on an existing and a novel learned indices. Algorithm 1:
Agile Model Reuse
Input: D π , Q ππ Output: M π for < D π , M π > β Q ππ do πππ π‘ β πππ _ πππ π‘ππππ (D π , D π ) ; if πππ π‘ β€ β π then M π β πππππ‘ _ πππππ (M π , D π , D π ) ; return M π ; Train model M π over D π ; M π .πππ₯ _ πππ _ πππ β M π .calc_err( D π ); Q ππ .ππππ’ππ’π ( < D π , M π > , M π .πππ₯ _ πππ _ πππ ) ; return M π ; Agile model reuse overview.
Algorithm 1 summarizes ouragile (i.e., fast and ο¬exible) model reuse technique. We pre-trainmodels on synthetic datasets (detailed later) which are reusedto index D π . The pre-trained models are organized in a priorityqueue Q ππ . Each entry in Q ππ contains the information of asynthetic dataset D π and its corresponding trained model M π .The trained models are sorted by their error bounds in ascend-ing order. Algorithm 1 traverses Q ππ (line 1), calculates the dis-tance (dissimilarity) between D π and each synthetic dataset D π (line 2), and ο¬nds the ο¬rst model where the distance is smallerthan or equal to the model reuse threshold π β ( , ] (line 3).If such a model is found, the model and its error bounds areadapted based on the dataset distance (line 4, detailed later), andthe adapted model is returned as M π (line 5). Otherwise, wetrain a new model M π for D π (line 6) and obtain the error range( πππ π’ β πππ π , line 7). We enqueue and return the model (lines 8and 9).We use π to control the dataset similarity in model reuse. Asmaller π allows the algorithm to return a model earlier, whichmay not have a high similarity with D π and may lead to lowprediction accuracy and high query costs. In contrast, a larger π can cost more time in traversing Q ππ but gain a more ο¬ttedodel with high prediction accuracy and low query costs. As π increases in range ( , ] , the requirement for agile model reuseis getting higher. We elaborate on the eο¬ect of π in Section 5. Dataset similarity measurement.
A model M π for dataset D π eο¬ectively learns a CDF of D π . To reuse M π on D π , it isimportant that the CDFs of D π and D π are similar. We thusdeο¬ne the similarity between D π and D π by their CDFs. Deο¬nition 3.1 (Similarity between two datasets).
Given two datasets D π and D π , their similarity is deο¬ned by the maximum distancebetween their CDFs: π ππ (D π , D π ) = β sup π₯ | πππ π ( π₯ ) β πππ π ( π₯ ) | (1) Here, sup π₯ | ππ π π ( π₯ ) β ππ π π ( π₯ )| is the maximum gap between ππ π π ( π₯ ) and ππ π π ( π₯ ) . We use π ππ (D π , D π ) and πππ π‘ (D π , D π ) = β π ππ (D π , D π ) to denote the similarity and the distance (dis-similarity) between D π and D π , respectively.This similarity metric is also based on the KolmogorovβSmirnov (KS) test [9], which takes π (|D π | + |D π |) time to compute, as-suming that both datasets are sorted already. This may be tooexpensive for online computation for large datasets. We presentan approximate similarity metric for faster computation.Our approximate similarity metric uses relative frequency his-tograms (βhistogramsβ for short) that discretize the data domaininto bins and record relative frequencies (i.e., percentages) of thedata points in each bin. A histogram is a discrete approximationof the probability density function (PDF) of a dataset. We use itto compute an approximation of the CDF and to compute an ap-proximation of πππ π‘ (D π , D π ) , denoted by πππ π‘ β (D π , D π ) .Algorithm 2 summarizes the computation process, which takesas input histograms of D π and D π with π (a system parameter)bins each, denoted by π» π and π» π . We use π» π [ π ] and π» π [ π ] todenote the π -th bins and their relative frequencies. The sum ofthe probabilities of ο¬rst π bins of π» π and π» π are denoted by π π and π π , i.e., π π = Γ ππ = π» π [ π ] and π π = Γ ππ = π» π [ π ] .The algorithm computes πππ π‘ β (D π , D π ) ( πππ π‘ β for short) bylooping through the bins (lines 2 to 4). In the π -th iteration ( π β[ , π β ] ), it computes π» π [ π ] + π π . This is the maximum ππ π π ( π₯ ) for any π₯ β ( ππ , π + π ] (in our synthetic datasets, π₯ β [ , ] ), be-cause π π has accumulated the probabilities for π₯ β€ ππ while π» π [ π ] further adds the probability for π₯ β ( ππ , π + π ] . Meanwhile, π π is the minimum ππ π π ( π₯ ) for any π₯ β ( ππ , π + π ] . Thus, β π₯ β( ππ , π + π ] : π» π [ π ] + π π β π π β₯ πππ π ( π₯ ) β πππ π ( π₯ ) π» π [ π ] + π π β π π β₯ πππ π ( π₯ ) β πππ π ( π₯ ) (2) After going through all bins, we have: πππ π‘ β (D π , D π ) β₯ | πππ π ( π₯ ) β πππ π ( π₯ ) | , β π₯ β ( , ] (3) Thus, πππ π‘ β (D π , D π ) β₯ πππ π‘ (D π , D π ) . Algorithm 2:
Histogram-based-Distance
Input: π» π , π» π Output: πππ π‘ β πππ π‘ β β , π π β , π π β for π β [ ,π β ] do πππ π‘ β β max { π» π [ π ] + π π β π π , π» π [ π ] + π π β π π , πππ π‘ β } ; π π β π π + π» π [ π ] , π π β π π + π» π [ π ] ; return πππ π‘ β ; Using histograms to discretize CDFs reduces the similaritycomputation time to π ( log |D π | + π ) , i.e., π ( log |D π |) time for π» π computation and π ( π ) time for Algorithm 2. Histogram π» π is pre-computed since D π is known. Its cost is omitted here. Synthetic dataset generation.
We aim to generate a smallnumber of datasets with CDFs that can be similar to those of alarge number of real datasets, as bounded by threshold π .We ο¬rst generate a set of CDFs to cover the space of possibleCDFs. We discretize the CDF space, such that it can be coveredby limited CDFs given threshold π . As shown in Fig. 2a, afterdata normalization, all CDFs lie in a [ , ] space. Any CDF canbe seen as a curve that starts at ( , ) and travels to ( , ) in anon-deceasing manner (in the CDF value dimension). We dis-crete this space with a grid, where each row has a height of 1 β π ( π = . β /( β π )β .Consider the set L of polylines each starting from ( , ) and trav-eling to ( , ) via the grid vertices in a non-deceasing manner(in the CDF value dimension, e.g., the colored lines). Straight-forwardly. given any CDF, there must be a polyline π β L suchthat the distance between π and the CDF is bounded by 1 β π (cf.Fig. 2b). C D F v a l u e (a) CDFs of synthetic data (b) CDFs of real and synthetic data Figure 2: CDF space discretization.
The CDFs in L correspond to histograms with π = β /( β π )β where each bin has a probability value in { , β π, ( β π ) , . . . , } .To limit the bin value combinations and hence the number ofCDFs (synthetic datasets) generated, we limit the probability valueof each bin to be within { , ( β π )/ , β π } , and we use π = β /( β π )β bins in the histogram heuristically. Our syntheticdatasets hence will not cover the most skewed CDFs (e.g., theblack polylines in Fig. 2a). However, when a target dataset D π is matched by a synthetic dataset, their CDF similarity may bewithin ( β π )/ β π , which improves the query per-formance. Our total number of histograms generated is: Γ ππ = ( πΆ ππ Β· πΆ β( β π ( β π ))/(( β π )/ ) β π β π ) , where the two combination terms rep-resent the numbers of bins with probability values of ( β π ) and ( β π )/
2, respectively. Once a histogram is generated, wegenerate a synthetic dataset of ππ key values ( ππ =
100 in ourexperiments) based on the histogram, where the data range is [ , ] , and random key values are generated for each bin. Thisprocedure is shown the be eο¬ective and eο¬cient empirically. Model adaptation.
When a model M π pre-trained on D π has been selected to index D π , we need to adapt M π based onthe data domains of D π and D π . This is because M π will notwork properly on a domain over which it was not trained, evenif the CDFs of D π and D π share a similar shape. Let the dataranges of D π and D π be [ π₯ π π , π₯ ππ ] and [ π₯ π π , π₯ ππ ] , and their datastorage position ranges be [ π¦ π π , π¦ ππ ] and [ π¦ π π , π¦ ππ ] , respectively.Model M π has been trained to take a search key in [ π₯ π π , π₯ ππ ] asthe input and predict an storage position in [ π¦ π π , π¦ ππ ] . Here, weassume that M π predicts the storage position of point π directlyrather than its rank (or percentile), i.e., π.ππππ β M π ( π.πππ¦ ) (instead of M π ( π.πππ¦ ) Β· |D π | as shown in Section 1). This sim-pliο¬es the discussion but does not impact our key ο¬ndings. Toadapt M π for D π , we take a search key in [ π₯ π π , π₯ ππ ] , map it into [ π₯ π π , π₯ ππ ] , and feed the mapped value into M π for prediction. Thepredicted output needs to be mapped back into [ π¦ π π , π¦ ππ ] for D π .et π Ξ π₯ = π₯ ππ β π₯ π π π₯ ππ β π₯ π π and π Ξ π¦ = π¦ ππ β π¦ π π π¦ ππ β π¦ π π . The input mappingis done by a linear transformation T ππ ( π₯ ) = π Β· π₯ + π where π = π Ξ π₯ and π = π₯ π π β π₯ π π Β· π Ξ π₯ . This is an aο¬ne transformationthat maps the data range (i.e., T ππ ( π₯ π π ) = π₯ π π and T ππ ( π₯ ππ ) = π₯ ππ )without changing the distribution. Similarly, the output map-ping is done by T ππ’π‘ ( π¦ ) = π Β· π¦ + π where π = π Ξ π¦ and π = π¦ π π β π¦ π π Β· π Ξ π¦ . The mappings may incur extra costs (ο¬oat-ing point calculation), which can be mitigated by adjusting theparameters of M π . We use a linear model as an example. Lemma 3.2.
Input and output mappings for a linear model M π incur no additional prediction costs. Proof.
Let M π be a linear model π¦ = ππ₯ + π , where π and π are parameters. The output Λ π¦ β² of M π is (with input mapping): Λ π¦ β² = M π (cid:0) T ππ ( π₯ ) (cid:1) = π Β· T ππ ( π₯ ) + π = π Β· π Ξ π₯ Β· π₯ β π Β· π₯ π π Β· π Ξ π₯ + π Β· π₯ π π + π (4)After output mapping, the ο¬nal prediction output Λ π¦ is: Λ π¦ = T ππ’π‘ ( Λ π¦ β² ) = ( Λ π¦ β² β π¦ π π ) Β· π Ξ π¦ + π¦ π π = π Β· π Ξ π₯ Β· π Ξ π¦ Β· π₯ + (β π Β· π₯ π π Β· π Ξ π₯ + π Β· π₯ π π + π β π¦ π π ) Β· π Ξ π¦ + π¦ π π (5)Thus, we can adapt M π to a new linear model π¦ = π β² π₯ + π β² for D π where π β² = π Β· π Ξ π₯ Β· π Ξ π¦ and π β² = (β π Β· π₯ π π Β· π Ξ π₯ + π Β· π₯ π π + π β π¦ π π ) Β· π Ξ π¦ + π¦ π π . Input and output mappings can be combinedwith linear models without extra prediction costs. (cid:3) Similar results can be derived for other models (e.g., neuralmodels). We omit the details due to the space limit.
Error bounding.
Recall the prediction errors of M π over D π , πππ π and πππ π’ . Given a query key π₯ β D π , the position of π₯ isbounded in [M π ( π₯ )+ πππ π , M π ( π₯ )+ πππ π’ ] . After input and outputmappings, we also need to adjust the error bounds for D π . Theorem 3.3.
Let M π be a model trained on D π with predic-tion error bounds πππ π and πππ π’ . Let πππ π‘ be the distance between D π and D π . The error bounds of M π over D π of size π π are: πππ β² π = β πππ π‘ Β· π π + πππ π Β· π Ξ π¦ (6) πππ β² π’ = πππ π‘ Β· π π + πππ π’ Β· π Ξ π¦ (7) Proof.
For any π₯ β D π and any T ππ ( π₯ ) β D π , let π¦ and π¦ β² be the storage positions of the corresponding data points, re-spectively. Then, M π (cid:0) T ππ ( π₯ ) (cid:1) is bounded by: π¦ β² β πππ π β₯ M π (cid:0) T ππ ( π₯ ) (cid:1) (8) After output mapping, T ππ’π‘ (cid:0) M π (cid:0) T ππ ( π₯ ) (cid:1)(cid:1) is the predicted posi-tion of M π over D π , which is bounded by: π¦ β πππ β² π β₯ T ππ’π‘ (cid:16) M π (cid:0) T ππ ( π₯ ) (cid:1) (cid:17) (9) Since mappings are monotonic, Inequality (8) is rewritten as: T ππ’π‘ ( π¦ β² β πππ π ) = T ππ’π‘ ( π¦ β² ) β T ππ’π‘ ( πππ π ) + π β² β₯ T ππ’π‘ (cid:16) M π (cid:0) T ππ ( π₯ ) (cid:1) (cid:17) (10)where π β² is the interception of T ππ’π‘ (Β·) . Given πππ π‘ as the distancebetween D π and D π , after input and output mappings, we have |T ππ’π‘ ( π¦ β² ) β π¦ | β€ πππ π‘ Β· π π , i.e., π¦ + πππ π‘ Β· π π β₯ T ππ’π‘ ( π¦ β² ) . Combiningwith Inequality (10), we have π¦ + πππ π‘ Β· π π β T ππ’π‘ ( πππ π ) + π β² β₯T ππ’π‘ (cid:16) M π (cid:0) T ππ ( π₯ ) (cid:1)(cid:17) . Given this inequality, to satisfy Inequality (9),we enforce π¦ β πππ β² π β₯ π¦ + πππ π‘ Β· π π β T ππ’π‘ ( πππ π ) + π β² . Thus, πππ β² π β€ β πππ π‘ Β· π π + πππ π Β· π Ξ π¦ , where π Ξ π¦ is the slope of T ππ’π‘ (Β·) .Letting πππ β² π = β πππ π‘ Β· π π + πππ π Β· π Ξ π¦ will ensure query correctness.Similarly, we can derive πππ β² π’ = πππ π‘ Β· π π + πππ π’ Β· π Ξ π¦ . (cid:3) Learned indices with agile model reuse.
We showcase theapplicability of agile model reuse over existing learned indicesby building a two-layer RMI [10], as shown in Fig. 3. We ο¬rstcompute π ππ (D π , D , ) between the full dataset D , to be in-dexed and a synthetic dataset D π , which has a pre-trained model Figure 3: Building RMI with agile model reuse ( π = . ). M π in Q ππ . Suppose π ππ (D π , D , ) = . β₯ π = .
8. Then, M π is reused over D , . For D , , a subset to be indexed by a childmodel in RMI, we cannot ο¬nd a synthetic dataset with a similar-ity greater than or equal to 0.8. We train a model M , over D , and put it into Q ππ for reuse later. For the other subset D , , weο¬nd another synthetic dataset D π with π ππ (D π , D , ) = . β₯ .
8. Its model M π is reused over D , , which completes the RMI. Recursive Model Reuse Tree.
In RMI, the number of layersand the number of models in each layer is ο¬xed. If the data isskewed, the cardinality of the subsets assigned to diο¬erent mod-els can vary considerably, resulting in high prediction errors andsearch costs on some models. To address this issue, we design alearned index structure with built-in agile model reuse supportnamed the recursive model reuse tree (RMRT).Suppose that the models used in RMRT have the same learn-ing capacity (e.g., neural networks of the same structure), whichcan ο¬t at most π points each. When |D π | is greater than π , weο¬rst learn a model M , to predict the points into π΅ partitionswhere π΅ is a system parameter. Then, recursively, for points pre-dicted to partition π , we learn another model M ,π to partitionthem. This process continues, until each partition has at most π points, which is indexed by a learned model. Agile model reuseis applied whenever a model is needed in this process. Fig. 4agives an example with π = π΅ =
2. Model M , predictstwo partitions (i.e., subsets) D , and D , that contain the ο¬rstfour and the last eight points, respectively. Further partitioningis needed for D , . A model M , is learned for this, creatingtwo partitions of size π each. The partitioning then stops. (a) RMRT ( π =4, π΅ =2) (b) Insertion handling Figure 4: RMRT and insertion handling examples
We focus on insertions. Deletions can be implemented simply asa point query and marking the queried point as βdeletedβ.To handle insertions, we examine their impact on the CDF.As shown in Fig. 4b, when data point 5 is inserted, only theCDF of D , is impacted in the second layer of the recursivemodel. For D , which is not impacted, we can simply add 1 toits model error bounds. For D , , we have to check whether thereused model M , can still meet the similarity bound deο¬nedby threshold π . To enable eο¬cient checks, we propose a boundon the maximum number of insertions without requiring modelupdates. Lemma 4.1.
Let D be a dataset with cardinality π D and M π· be a model over D . Let π ππ be the similarity between D and thedataset from which M π· is trained, which can be D or a syntheticataset. If there are less than ( π ππ β π ) + π β π ππ Β· π D insertions on D , we canstill reuse model M π· for the resultant dataset D β² . Proof.
After π π insertions, the CDFs ππ π π· ( π₯ ) and ππ π π· β² ( π₯ ) may become diο¬erent. In the worst case, all new data pointsare inserted at the same position, where the diο¬erence between ππ π π· ( π₯ ) and ππ π π· β² ( π₯ ) is bounded by πππ π‘ (D , D β² ) β€ π π π π + π D . Re-call that M π· is reused with π ππ β₯ π or trained on D ( π ππ = β₯ π ). We can use the gap π ππ β π as a buο¬er to accommodate theCDF drift caused by the insertions. According to the transitivityof inequalities, if π π π π + π D β€ π ππ β π , there is no need to rebuild amodel over D β² , since the impact on the CDF of D by the inser-tions cannot exceed the error bound π ππ β π . Thus, we can derivea bound on the number of insertions π π as π π β€ ( π ππ β π ) + π β π ππ Β· π D (cid:3) According to Lemma 4.1, insertions can be handled withoutmodel rebuilt when their number does not exceed the bound.When a new data point is inserted, we ο¬nd the target insertionposition through a point query and obtain the correspondingmodel M . If the number of insertions on M has not exceeded thebound, the insertion is completed. Otherwise, we only rebuildthe model indexing the inserted data point. All experiments are done on a 64-bit machine with a 3.60 GHzIntel i9 CPU, RTX 2080Ti GPU, 64 GB RAM, and a 1 TB harddisk. We use
PyTorch
Scikit-learn [17] and
PyTorch , re-spectively.
Competitors.
We compare with both traditional and learnedindices: 1) traditional
BTree [18] which is a C++ based in-memoryB+ tree from the STX B+ Tree package; 2)
RMI [10] which is thelinear RMI model from the SOSD benchmark [13]; 3)
RMI-NN which is our implementation of the neural network RMI model;4)
PGM [5] which is a piecewise geometric model index; and5) RS [8] which is a single-pass learned index. Proposed models.
We study the performance of the follow-ing proposed and adapted models:1)
RMI-MR which is the lin-ear RMI model enhanced by agile model reuse; 2)
RMI-NN-MR which is the neural RMI model enhanced by agile model reuse;and 3)
RMRT which is our proposed learned index.
Implementation details.
For BTree, RMI, PGM, and RS, weuse their published source code and default conο¬gurations. ForRMI-MR, we adapt the original model training code to includeagile model reuse. For neural network based models includingRMI-NN, RMI-NN-MR and RMRT, we use feedforward neuralnetworks each with one hidden layer of four neurons. We use π = as the RMRT model size threshold, which shows strongempirical performance. We set the default value for π to 0.9.We summarize the number of synthetic datasets (each with100 points) and the time to pre-train models on them in Table 2.Note that we use π = < β /( β π )β when π = .
9. As a result,the number of datasets generated is bounded in 10,000; all pre-trained models can be loaded in memory within a second (30 MBin size); and the total model comparison time to build an indexin any of the experiments is also within a second.
Table 2: Summary of Synthetic Datasets. π π ) 4 5 7 10 12Number of datasets 19 95 987 8,953 1,221Total model training time (s) 2.1 8.8 63.5 839.5 109.1 Datasets.
Following SOSD [13], we use four real datasets: amzn (default) β an Amazon book popularity dataset, face βa Facebook user ID dataset, osm β an OpenStreetMap cell IDdataset, and wiki β a Wikipedia edit timestamp dataset. We fur-ther generate skew datasets from uniform data by raising a keyvalue π₯ to its powers π₯ πΌ ( πΌ = , , , , Performance metrics.
We report the index build time, lookup(i.e., point query) time, and update (insertion) time.
Index build time on real datasets.
The four real datasets havediο¬erent distributions. They lead to diο¬erent index build timesas shown in Fig. 5a. BTree is the fastest to build and is little im-pacted by the data distribution, due to its simple building pro-cedure. With agile model reuse, RMI-NN-MR is two orders ofmagnitude faster than RMI-NN on all four datasets, while RMI-MR is also consistently faster than its non-model-reusing coun-terpart RMI. Our RMRT further outperforms RMI-NN-MR andPGM, and the advantage is up to 74% (3.0s vs. 11.7s on face) and32% (5.0s vs. 7.4s on face) than PGM, respectively. RS is fasterthan RMRT, due to its single-pass procedure, but its lookup timeis substantially higher than that of RMRT, which is detailed next. amzn face osm wiki B u il d t i m e ( s ) DatasetBTreePGMRSRMRT RMIRMI-MRRMI-NNRMI-NN-MR (a) Index build time
150 200 250 300 350 400 450 500 550 600 amzn face osm wiki
Loo k up t i m e ( n s ) DatasetBTreePGMRSRMRT RMIRMI-MRRMI-NNRMI-NN-MR (b) Lookup time
Figure 5: Build and lookup time on real datasets.Lookup time on real datasets.
Each real dataset has 10 mil-lion random lookup keys, and we report their average lookuptime in Fig. 5b. RMRT is the fastest over all four datasets. Onamzn, RMRT (189 ns) is 46%, 35%, 14%, 7%, 52%, 38%, and 20%faster than BTree (534 ns), RMI-NN (440 ns), RMI (221 ns), RMI-MR (205 ns), RMI-NN-MR (401 ns), PGM (308 ns), and RS (236ns), respectively. RMI-MR and RMI-NN-MR have better perfor-mance than RMI and RMI-NN over amzn, face, and wiki, sincethese three datasets are well distributed and can be ο¬tted by thepre-trained models. For osm, it diο¬ers more from the syntheticdatasets, where the reused model has increased lookup costs. Wefurther note that RS index stores spline points, the number ofwhich are a decisive factor in RS lookup time. To obtain thelookup performance shown here, the index size of RS is aboutan order of magnitude larger than that of RMRT. B u il d t i m e ( s ) SkewnessBTreePGMRS RMIRMI-MRRMI-NN RMI-NN-MRRMRT (a) Index build time
150 200 250 300 350 400 450 500 550 1 3 5 7 9
Loo k up t i m e ( n s ) Skewness (b) Lookup time
Figure 6: Build and lookup time on skew datasets.Index build time on skewed datasets.
Since our RMRTtargets skewed data, we further test the indices on syntheticdata with increasing skewness. As Fig. 6a shows, the index buildtimes are less impacted by data skewness. This is consistent withhe results on real datasets, where the index build times are alsosimilar across diο¬erent datasets. RMI-NN-MR and RMI again out-perform RMI-NN and RMI, respectively, while our RMRT is onlyslightly slower than BTree and the single-pass learned index RS.
Lookup time on skewed datasets.
As Fig. 6b shows, ourRMRT again yields the best lookup performance on all skeweddatasets (skewness = 1 denotes uniform data), and its perfor-mance is stable as data skewness increases, conο¬rming its ca-pability to adapt to skewed data. In contrast, the four RMI-basedindices have fast increasing lookup times when the data skew-ness increases, as analyzed in Section 3. BTree, PGM, and RS arealso less impacted because their designs are based on worst-casescenarios.
Index build time under varying π . Table 3 shows that, as π increases, the build times of both RMI-NN-MR and RMRT in-crease. This is because a larger π requires a higher similarity formodel reuse, thus more datasets are examined. For our RMRT,the build time decreases initially. This is because a small π ( π = .
5) cannot ο¬t the datasets well, which creates uneven partitionsthat take more models to ο¬t. As π increases beyond π = .
6, thebuild time of RMRT rises again. Note that RMI-MR shows a sim-ilar trend to RMI-NN-MR and is omitted for conciseness.
Lookup time under varying π . Table 3 also shows that, as π increases, the lookup times of both models decrease. This is be-cause better-ο¬tted models are selected for larger π which bringshorter search ranges. We see that the beneο¬t in lookup out-weighs the extra index building costs when using a larger π . Table 3: Build, lookup and insertion time under varying π . π Update time under under varying π . Table 3 further showsthe times for inserting 100% more points (following the distri-bution of amzn, same below). We see that the insertion timesincrease with π . This is because the bound on the number of in-sertions before model rebuilding is inversely proportional to π (Lemma 4.1), i.e., a larger π triggers rebuilds more eagerly. Update time under varying insertion ratios.
Next, we testthe impact of the number of points inserted. RMI, RMI-NN, andRS are static indices and are omitted for this experiment. For thefanout (maximum error in PGM), we use 2 for PGM and 2 forRMRT and RMI-NN-MR, respectively. As shown in Fig. 7a, theinsertion times of BTree, RMI-NN-MR, and RMRT increase withthe insertion ratio. For PGM, the insertion time rises in stages.It has higher insertion costs than RMI-NN-MR in most cases(from 10% to 90%). This is because PGM uses the logarithmicmethod [14] which builds and merges (hence the cost jumps) aset of PGMs for insertions. For RMI-NN-MR, when the insertionratio is within 80%, the cost increases slowly because most pointsare inserted directly. The costs increase faster when the insertionratio exceeds 80%, which exceeds the insertion bound and leadsto more rebuilds. For RMRT, the insertion time is more stable.This is because each RMRT sub-model indexes a relatively smallset of data points and has a relatively small rebuild bound. Modelrebuild is triggered steadily as new data points are inserted. Update time under varying branching parameters.
Dueto the parameter limitation of dynamic PGM, we vary the fanout(i.e., maximum error in PGM) from 2 to 2 . As Fig. 7b shows,
90 110 130 150 170 19010 20 30 40 50 60 70 80 90 100 R e s pon s e t i m e ( n s ) Insertions (%)BTreePGM RMI-NN-MRRMRT (a) Varying insertion ratio
100 120 140 160 180 2 R e s pon s e t i m e ( n s ) Branching parameter (b) Varying branching parameter
Figure 7: Insertion time results.
PGM outperforms RMI-NN-MR when its maximum error (fanout)is larger than 2 because a larger fanout for the RMI learned in-dices means fewer data points in each model. The cardinality ofeach model in the second layer is smaller, and the bound for in-sertion is also smaller (w.r.t., Lemma 4.1), such that rebuild hap-pens more often. For RMRT, the insertion performance becomesbetter as the fanout increases. This is because it can adaptivelydivide the underlying models and provide more positions for di-rect insertions without frequent model rebuilds. We note several directions to be explored next. We have omittedthe sorting costs in index building, since these are shared by allindices. It would be interesting to further optimize these costswith a learning-based technique. Our CDF similarity approxi-mation considers the maximum distance between the CDFs. Analternative is to take the average distance. How to bound thesearch range in this case is an interesting challenge.
We proposed to reuse pre-trained models for indexing new (orupdated) datasets to address building and update eο¬ciency is-sues in learned indices. We also proposed a similarity metric tomeasure the distribution diο¬erence between two datasets. Basedon this metric, our agile model reuse algorithm can eο¬cientlyselect the most suitable pre-trained model to index a new (or up-dated) dataset. We show how the prediction error of the selectedpre-trained model is bounded on the new (or updated) dataset.We demonstrate the eο¬ectiveness of the proposed algorithm byapplying it on the RMI learned indices [10] and our proposedlearned index RMRT. Experimental results on synthetic datasetsand four real datasets show that our agile model reuse techniquecan improve the building and update time of learned indices sub-stantially with little impact on the lookup performance.
REFERENCES [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan.2010. A Theory of Learning from Diο¬erent Domains.
Machine Learning
Interna-tional Conference on Algorithmic Learning Theory . 308β323.[3] A. Davitkova, E. Milchevski, and S. Michel. 2020. The ML-Index: A Multidi-mensional, Learned Index for Point, Range, and Nearest-Neighbor Queries.In
EDBT . 407β410.[4] J. Ding, U. F. Minhas, J. Yu, C. Wang, J. Do, Y. Li, H. Zhang, B. Chandramouli,J. Gehrke, D. Kossmann, D. Lomet, and T. Kraska. 2020. ALEX: An UpdatableAdaptive Learned Index. In
SIGMOD . 969β984.[5] P. Ferragina and G. Vinciguerra. 2020. The PGM-Index: A Fully-DynamicCompressed Learned Index with Provable Worst-Case Bounds.
PVLDB
13, 8(2020), 1162β1175.[6] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska. 2019.FITing-Tree: A Data-Aware Index Structure. In
SIGMOD . 1189β1206.[7] A. Hadian and T. Heinis. 2019. Considerations for Handling Updates inLearned Index Structures. In
International Workshop on Exploiting Artiο¬cialIntelligence Techniques for Data Management . 3:1β3:4.8] A. Kipf, R. Marcus, A. v. Renen, M. Stoian, A. Kemper, T. Kraska, and T. Neu-mann. 2020. RadixSpline: A Single-Pass Learned Index. In
International Work-shop on Exploiting Artiο¬cial Intelligence Techniques for Data Management . 5:1β5:5.[9] Kolmogorov-Smirnov Test. 2020. https://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test .Accessed: 2021-01-06.[10] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. 2018. The Case forLearned Index Structures. In
SIGMOD . 489β504.[11] X. Li, J. Li, and X. Wang. 2019. ASLM: Adaptive Single Layer Model forLearned Index. In
DASFAA . 80β95.[12] Y. Mansour, M. Mohri, and A. Rostamizadeh. 2009. Domain Adaptation:Learning Bounds and Algorithms. In
COLT .[13] R. Marcus, A. Kipf, A. v. Renen, M. Stoian, S. Misra, A. Kemper, T. Neumann,and T. Kraska. 2020. Benchmarking Learned Indexes.
PVLDB
14, 1 (2020),1β13.[14] M. H. Overmars. 1983.
The Design of Dynamic Data Structures . Lecture Notesin Computer Science, Vol. 156. Springer.[15] PyTorch. 2016. https://pytorch.org . Accessed: 2021-01-06.[16] J. Qi, Y. Tao, Y. Chang, and R. Zhang. 2018. Theoretically Optimal and Em-pirically Eο¬cient R-trees with Strong Parallelizability.
PVLDB
11, 5 (2018),621β634.[17] Scikit-learn. 2007. https://scikit-learn.org . Accessed: 2021-01-06.[18] STX B+ Tree. 2007. https://panthema.net/2007/stx-btree . Accessed: 2021-01-06.[19] J. C. Thibault, D. R. Roe, J. C. Facelli, and T. E. Cheatham III. 2014. Data Model,Dictionaries, and Desiderata for Biomolecular Simulation Data Indexing andSharing.