[PDF] There is No Such Thing as an "Index"! or: The next 500 Indexing Papers

Abstract

Index structures are a building block of query processing and computer science in general. Since the dawn of computer technology there have been index structures. And since then, a myriad of index structures are being invented and published each and every year. In this paper we argue that the very idea of "inventing an index" is a misleading concept in the first place. It is the analogue of "inventing a physical query plan". This paper is a paradigm shift in which we propose to drop the idea to handcraft index structures (as done for binary search trees over B-trees to any form of learned index) altogether. We present a new automatic index breeding framework coined Genetic Generic Generation of Index Structures (G3oI). It is based on the observation that almost all index structures are assembled along three principal dimensions: (1) structural building blocks, e.g., a B-tree is assembled from two different structural node types (inner and leaf nodes), (2) a couple of invariants, e.g., for a B-tree all paths have the same length, and (3) decisions on the internal layout of nodes (row or column layout, etc.). We propose a generic indexing framework that can mimic many existing index structures along those dimensions. Based on that framework we propose a generic genetic index generation algorithm that, given a workload and an optimization goal, can automatically assemble and mutate, in other words 'breed' new index structure 'species'. In our experiments we follow multiple goals. We reexamine some good old wisdom from database technology. Given a specific workload, will G3oI even breed an index that is equivalent to what our textbooks and papers currently recommend for such a workload? Or can we do even more? Our initial results strongly indicate that generated indexes are the next step in designing "index structures".

Full PDF

TThere is No Such Thing as an “Index”!or:The next 500 Indexing Papers

Jens Dittrich Joris Nix ∗ Christian Sch¨on ∗ Big Data Analytics GroupSaarland UniversitySaarland Informatics Campus https://bigdata.uni-saarland.de

September 23, 2020

Abstract

Genetic Generic Generation of Index Structures (G oI) . It is based on the observation thatalmost all index structures are assembled along three principal dimensions: (1) structural buildingblocks, e.g., a B-tree is assembled from two di ﬀ erent structural node types (inner and leaf nodes),(2) a couple of invariants, e.g., for a B-tree all paths have the same length, and (3) decisions on theinternal layout of nodes (row or column layout, etc.). We propose a generic indexing frameworkthat can mimic many existing index structures along those dimensions. Based on that frameworkwe propose a generic genetic index generation algorithm that, given a workload and an optimizationgoal, can automatically assemble and mutate, in other words ‘breed’ new index structure ‘species’.In our experiments we follow multiple goals. We reexamine some good old wisdom from databasetechnology. Given a speciﬁc workload, will G oI even breed an index that is equivalent to what ourtextbooks and papers currently recommend for such a workload? Or can we do even more? Our initialresults strongly indicate that generated indexes are the next step in designing “index structures”. Motivation 1: Two Completely Di ﬀ erent Methodologies to Solve a Similar Problem. It is remarkablethat there is quite a divide in databases when it comes to designing e ﬃ cient components of a databasesystem like index structures as opposed to query plans. For index structures, the historic and state-of-the-art approach is to deﬁne some performance goals, reason about complexities, design something ona blackboard, and then implement it. Like that an index (much like any other system component) has ∗ Both authors contributed equally to this work. a r X i v : . [ c s . D B ] S e p o be designed from scratch and then implemented. Eventually, we receive a piece of software that then(hopefully) serves the original purpose. In sharp contrast to this, since the 70s and the seminal Selingerpaper [34] database researchers follow a completely di ﬀ erent, and rather successful, design path when itcomes to query planning: we automatically assemble complex plans from logical and physical operators.In most cases there is no such thing as a hand-assembled query plan unless you only have very few plansto optimize anyway and can a ﬀ ord the e ﬀ ort and are brave enough to do this manually. So why followtwo completely di ﬀ erent design approaches if at the core these are similar problems?For this reason, in this paper, inspired by logical and physical operators, we will split the deﬁnitionof indexes into logical and physical indexes. Motivation 2: There is no such Thing as “an Index”.

Let’s look at our good old B-tree: A

B-tree index consists of two types of structural elements: inner nodes and leaves . Inner nodes keep pointers to otherinner nodes and leaves. The main purpose of a inner node is to route incoming lookups to other innernodes or leaves. In addition, a B-tree index algorithmically preserves a couple of invariants, e.g. allpaths from the root to a leaf have the same lengths, and second, each node only has one parent node(i.e. nodes are structurally organized into a tree). In addition, all nodes keep data in a speciﬁc layout(row or column layout, cache-and SIMD-e ﬃ cient layouts, etc.) and deﬁne which search algorithm touse inside a node (binary search, interpolation, prediction, etc.). Since the publication of the originalB-tree paper [5] almost 50 years ago, the physical organization of B-trees has been improved in a zilliondi ﬀ erent ways, e.g. [28, 29, 32, 20].So what concretely is “the index” in here? So far we only deﬁned two di ﬀ erent node types pointingto each other, we added a couple of constraints (fan-outs, tree-structure, concrete physical organizationof inner nodes and leaves). We may also add heuristics for invariant maintenance (split and merge). But,if we change any aspect of this, do we receive a completely di ﬀ erent “index”? When is it just a variant ofan existing “index”? And when is it a new “index”? For instance, if we allowed inner nodes and leavesto have more than one parent, would that be a completely di ﬀ erent index? Or is it just that one propertythat changes (with possible implications to other features of the index)?For this reason, in this paper, we will drop the idea of considering as an “index” an entity. Wewill show that most indexes can be expressed as a speciﬁc conﬁguration in a generic logical and physi-cal indexing framework including B-trees, radix-trees, RMIs, and even extendible hashing. And thoseconﬁgurations can be combined almost arbitrarily within the same conﬁguration . Actually, in our frame-work, one extreme of an index (say a single hash table) can smoothly be morphed into another extreme(say a B-tree style index with all kinds of di ﬀ erent layouts and search algorithms inside its nodes). Wewill use the term logical index to denote a speciﬁc conﬁguration in our logical indexing framework. Theterm physical index denotes a speciﬁc conﬁguration in our physical indexing framework. The term index denotes both logical and physical conﬁgurations. Motivation 3: Why handcraft an Index?

Once we are in the position to express an “index” as aconﬁguration in a generic logical and physical indexing framework, there is one question left: Whyshould we conﬁgure indexes by hand anyways? Why should we handcraft which node type to use, whichnode-internal search algorithm to use, which data layout, tree-levels to use, etc.? Why should we evendecide manually how many tree-levels to use?If we have di ﬀ erent components of an index which can be interchanged freely, plus options to playwith, well, then we have an optimization problem!For this reason, in this paper, we will propose a genetic algorithm that, given a dataset and workload,will automatically determine a suitable logical and physical index conﬁguration. Note that we will not introduce this as a software framework as done in [14, 7] but rather on a conceptual level. elated Work for the Impatient. Notice that the method we propose in this work is for the momentconsidered an o ﬄ ine method for database architects. Hence, long optimization times are not an issue.This is in line with the body of work on read-only index structures [28, 20, 25, 22] which allow for longo ﬄ ine bulkloading times. Similar to these works, extensions to an online use are possible but for themoment part of future work.Also note that, though our work is partially inspired by [19], we go far beyond that work in thatwe do not need to rely on cost models. Moreover we are not only semi-automatic but fully automatic.In fact, our biggest inspiration is actually the well-known conceptual split of operators into logical andphysical operators. In this work, this will lead us to deﬁne logical and physical indexes.Furthermore, our paper goes far beyond to what has recently been proposed with the wave of so-called ‘learned indexes’ [25]. Those indexes and also the follow-up works [22, 10, 13] we are awareof are still handcrafted indexes . We will argue that from the point of view of our generic indexingframework, all those indexes are special cases . We will — after having introduced our logical andphysical indexing frameworks — also discuss this more technically throughout this paper and in RelatedWork (Section 5). The problem statement we investigate in this work is as follows:1. How can we generalize the most important index structures into a common conceptual indexing frame-work?2. How can we automatically breed index structures using (1).

In this paper we make the following contributions:1. We introduce a generic index structure framework that is split into a logical and a physical indexingframework. This is inspired by the split into logical and physical operators in relational and physicalalgebras / operators.2. We present a genetic algorithm which allows us to automatically generate (breed) e ﬃ cient indexconﬁgurations (aka indexes).3. We present an extensive experimental evaluation of our approach demonstrating that we can bothrediscover existing, previously handcrafted indexes as well as new types of hybrid indexes.The paper is structured as follows: in Section 2, we introduce our logical generic indexing frame-work. After that, in Section 3, we introduce our physical generic indexing framework. Both serve as thebasis for Section 4 where we introduce our index breeding approach. Section 5 contrasts our approach torelated work. Section 6 presents our experimental evaluation. We will conclude and point out a coupleof exciting future research directions in Section 7. In this section we introduce our generic logical indexing framework. The physical indexing frameworkis explained in Section 3. 3escriptions of index structures tend to mix up logical ( what is done) and physical aspects ( how is that achieved). For instance, consider the following sentence you might ﬁnd in almost any workdescribing some index structure: ‘every node in our tree corresponds to a cache line size and maintains arange to pointer mapping which is stored in a hash table”. In this sentence the logical (black underlines)and physical (colored underlines) aspects of an index are introduced at the same time . In this paper wewant to clearly separate the logical and physical aspects of an index. This is analogue to the separationof logical and physical operators in query processing and optimization.

Basic Deﬁnitions.

Any expression σ P ( R ) where P is a predicate deﬁned on a relational schema [ R ] : { [ A : D , . . . , A n : D n ] } , i.e., a function P : [ R ] (cid:55)→ { true,false } , is called a query on R . The resultof a query is σ P ( R ) ⊆ R . Given [ R ] with an attribute A i with a corresponding non-categorical one-dimensional domain D i , and two constants l , h ∈ D i , l ≤ h , σ l ≤ A i ≤ h ( R ) is a range query on R . It selectsall tuples t = ( a , .., a i , .., a n ) ∈ R where a i is contained in the interval [ l ; h ]. A range query with l = h iscalled a point query. Deﬁnition 2.1.

Logical Node. A logical node is a tuple (p , RI , DT) where:1. p : [ R ] → D is a partitioning function on the schema [ R ] of the dataset to index, (p may be unde-ﬁned),2. RI is the routing information . It is a function RI : D → P ( N ) where N is a set of nodes and P ( N )is the power set of N . In other words, each element of D (the target domain of p) is mapped to a subsetof the nodes in N . For each outcome of the partitioning function p we can ﬁnd a set of associated nodesor the empty set. Notice that the routing information does neither imply nor assume a speciﬁc physicalorganization including a sort order on its entries (like in B-trees). RI may be undeﬁned. In the following,we use nodes(RI) for the set of nodes mapped to by RI.3. DT is the data . It is a set of tuples with relational schema [ R ], DT may be empty . … logical node routing information RI data DT{(2,A),(1,B)} … … partitioning function p p ( t ) := t . e mod set of nodes N … …node routing table RI data DT{42, 9, 4, 8} … … … partitioning function p p ( t ) := t . a mod set of nodes N …old version: {4, 2, 0, 1} … Figure 1: An example of a logical node with a hash-style partitioning function, four mappings in therouting information RI, and two tuples in the data part.Figure 1 visualizes the principal structure of a logical node. The partitioning function p computes t . e mod 5 which yields a domain D = { , , , , } . Here, only a subset of D is shown in the visualization In principle, DT could also be deﬁned as a similar function as RI the di ﬀ erence being that RI maps to nodes whereas DTmaps to tuples. However, to simplify matters a bit, we stick to a set deﬁnition at this point. Also note that the DT-ﬁelds canbe used to very naturally support bu ﬀ er-tree-style indexes [3], bulkloading mechanisms [8] as well as any form of recursivepartitioning algorithm. I DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ RI DT {(1,B), (2,A)} p {[6;+ )} ∞− RI DT {(7,B), (6,C)} p {(- ;6),[11;+ )} ∞ ∞− RI DT {(11,C), (12,Z)} p {- ;11)} ∞− b-tree with ISAM (a) B-tree with ISAM:

Here the partitioning function returns t . e . The routing information maps ranges to nodes on the nextlevel. This induces a B-tree-style partitioning. Notice that thecommon textbook explanation of B-trees showing k pivots and k + RI DT {(1,B), (2,A)} p {} − RI DT {(7,B), (6,C)} p {} − RI DT {(11,C)} p {} − RI DT {(12,Z)} p {} − RMI

RI DT {} p div 3 t . e {0,2,3,4} RI DT {(1,B), (2,A)} p {} − RI DT {(7,B), (6,C)} p {} − RI DT {(11,C)} p {} − RI DT {(12,Z)} p {} − old: RI DT {} p {} − RI DT {} p ﬂoor(D)

13 ⋅ t . e (b) RMI:

Here the partitioning function is a linear function p ( t ) = · t . e + → [0;4]). This is equivalent to a linear regression overthe key space. RI groups the data into bins (corresponding tonodes on the next level). However, p and RI can be set to useany form of regression method and for any node independently. extensible hashing (2,A),(7,B),(1,B),(6,C), (12,Z),(11,C) (0010,A),(0111,B),(0001,B),(0110,C), (1100,Z),(1011,C) data:binary: (0010,A),(0111,B),(0001,B),(0110,C), (1100,Z),(1011,C) (0010,A), (0110,C) (0001,B) RI DT {} p & 0x7 t . e {001,010,011,100,110,111} RI DT {(0111,B) (1011,C) p {} − RI DT {(0010,A), (0110,C)} p {} − RI DT {(0001,B)} p {} − RI DT {(1100,Z)} p {} − local depth = 2 local depth = 2 local depth = 3 local depth = 3 global depth = 3 (c) extendible hashing: Here the partitioning function onlyconsiders a su ﬃ x of the lowest three bits (&0x7) of t . e . Thisimplies that it partitions exactly like an extendible hashing [12]directory with global depth of three. Note that there is no needto create entries for empty ‘buckets‘. radix tree (1100,Z) RI DT {} p & 0xC t . e {00,01,10,11} RI DT {(0001,B)} p {} − RI DT {(1100,Z)} p {} − (0001,B) (0110,C) (1011,C))(0010,A) (0111,B) RI DT {} p & 0x3 t . e {01,10} RI DT {} p & 0x3 t . e {10,11} RI DT {(0010,A)} p {} − RI DT {(0110,C)} p {} − RI DT {(0111,B)} p {} − RI DT {(1011,C)} p {} − (d) radix tree: Here the partitioning functions partition thedataset on two adjacent bits each: the root-node partitions onthe ﬁrst two bits of the preﬁx, the next level on the next two bits.This induces a radix-partitioning. Note that in this example theindex is conﬁgured to keep at most one tuple per leaf. This canof course be conﬁgured. So alternatively, we could force a two-level tree just partitioning on the ﬁrst two bits. The second levelwould then keep multiple entries in their DT-ﬁelds.

Figure 2: The modeling power of our logical indexing framework for traditional indexes. Four specialcases of possible logical indexes for the running example. All examples mimic existing and handcrafted(physiological) index structures.of RI, i.e. 3 is not shown as it maps to the empty set. In addition, RI maps 2 and 0 to the same node.Moreover, the data part DT contains two tuples (2 , A ) and (1 , B ). Deﬁnition 2.2.

Complete Logical Index. Let LN be a set of logical nodes with ∀ n ∈ LN : nodes ( n . RI ) ⊆ LN . Then the graph λ = ( LN ) is called a complete logical index.In other words, only if all routing information in the nodes of LN points to nodes contained in LN ,we call LN a complete logical index. At ﬁrst, this deﬁnition sounds a bit trivial, but this deﬁnition makesan important observation that is frequently overlooked: a logical index is-a graph of logical nodes — and nothing else . Running Example.

Figure 2 illustrates the modeling power of our framework and shows four possi-ble logical indexes for a running example [ R ] = { [ e : int , g : char] } . R = { (2 , A) , (7 , B) , (1 , B) , (6 , C) , (12 , Z) , (11 , C) } . Notice that in all these examples the DT-ﬁelds are empty for internal nodes. The impli-cations of allowing data in internal nodes are however considered future work and will therefore not beinvestigated in this paper. Figure 3 demonstrates how we can model arbitrary ‘hybrid’ logical indexes.5 MI-style indexextendible hashing-style indexradix-style index B-tree-style index hybrid logical index

RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ RI DT {} p t . g {A, B} RI DT {(11,C)} p {} − RI DT {(12,Z)} p {} − RI DT {} p & 0x7 t . e {110,111} RI DT {(0111,B)) {} − RI DT {(0110,C)} p {} − RI DT {(2,A)} p − {} RI DT {(1,B)} p − {} p RI DT {} p ﬂoor(D)

13 ⋅ t . e Figure 3: The modeling power of our logical indexing framework for any form of ‘hybrid’ index. Theexample combines properties from four di ﬀ erent traditional index structures. Notice that this is just oneof countless possible examples: any node in this logical index may be exchanged by any other suitablelogical node as long as the data in the index is partitioned in a way that all possible queries on the logicalindex return the correct result set. On this abstraction level it is still undeﬁned how data is represented inthe di ﬀ erent nodes and in particular in the RI-function and the DT-set and how we search. Deﬁnition 2.3.

RQ: Result of a Range Query on a Logical Index.

Given a range query with predicate P : = l ≤ A i ≤ h , a logical index λ build upon a relation R and a non-empty start node-set S N ⊆ LN , theresult set of the range query is given by:RQ( P , S N ) : = (cid:91) n ∈ S N (cid:18) σ P ( n . DT ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) data in n ∪ RQ (cid:16) P , (cid:91) t ∈ R , l ≤ t . A i ≤ h n . RI (cid:0) n . p ( t ) (cid:1)(cid:17)(cid:19) Notice that the set semantics will implicitly remove duplicates which in a physical graph-structuredindex (possibly not obeying set semantics) may result from visiting nodes multiple times.Also note that this query will recursively traverse the graph for all qualifying nodes in the RI-ﬁelds.This is ﬁne for a strictly tree-structured index, however, as soon as we do not have a tree-structureanymore but a more general DAG, it may become possible that, given a set of start nodes

S N , certainnodes are reachable via multiple paths. For a general graph, the implementing algorithm has to bemodiﬁed to not visit nodes multiple times.

Deﬁnition 2.4.

Correctness of a Logical Index.

Let λ = ( LN ) be a complete logical index. Let S N be anarbitrary non-empty subset of start nodes:

S N ⊆ LN . Let DT λ : = (cid:83) n ∈ LN n . DT be the data contained in λ . Let σ P : = l ≤ A i ≤ h ( R ) be a range query on R . If ∀ l , h : σ l ≤ A i ≤ h ( DT λ ) = RQ( P , S N ) , then λ is called a correct logical index w.r.t. S N .Notice that the correctness of an index depends on whether data is placed into the di ﬀ erent DT-setsaccording to the properties of the di ﬀ erent partitioning functions used at the various nodes. Furthermore,the start nodes S N must be chosen such that all qualifying data can be reached by the range query. Forinstance, in a tree-structured index picking the start node is trivial: we call it ‘the root node’. In a generalgraph structure, which may even be disconnected, things can become more complex, i.e. we might have6 -tree one column version

RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ RI DT {(7,B), (6,C)} p {} − RI DT {(11,C), (12,Z)} p {} − RI DT {(1,B), (2,A)}{} p − specify s p ec i f y logical index s p ec i f y physical index RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {(1,B), (2,A)} {} DL: col, unsorted p − SAlg: scan

RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: hashS physical index

RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted

RI DT {(1,B), (2,A)} {} DL: col, unsorted p − RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: hashSSAlg: binS SAlg: expS physical index

RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ RI DT {(1,B), (2,A)} {} DL: col, unsorted p − RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: expS DL: fun SAlg: hashSSAlg: hashS let’s get physical

Figure 4: The arrows show some possible transitions from a logical to a physical index (we specify analgorithm and / or a data layout). Notice that neither the partitioning tree nor the assignment of data tonodes are changed in this process.multiple ‘root nodes’, i.e. all nodes that cannot be reached from any other node of the index, or even noroot nodes (in case of a cyclic graph). This discussion is beyond the scope of this paper and thereforein the following, we will only consider correct, DAG-structured indexes and assume that S N is chosenaccordingly.

As we just have deﬁned logical indexes (our counterparts to the logical relational algebra operators),now, we can proceed to devise physical indexes (our counterparts to physical operators). So, how do weget from a logical index to a physical index?

For each logical node , and for each of its RI and DT-part we eventually have to specify how to realizeit. We do this by making a physical decision on the search algorithm (Section 3.1) and the data layout touse for that set (Section 3.2). Or, we delegate those decisions by using a nested index (Section 3.3).Any index where for all its nodes the data layouts and algorithms are su ﬃ ciently speciﬁed, is calleda physical index . We decide which search algorithm to use for searching (key / value)-pairs in RI and / or DT. Note that allsearch algorithms stop once a qualifying key was found, i.e. we found the corresponding entry in RI orwe have an exact key match in DT. The principal options are as follows: (1) scan: linear search throughall entries, for each key check if it qualiﬁes, (2) binS: binary search . (3) intS: interpolation search,iteratively compute slope and intercept, i.e. a linear function, for left and right key, predict key location pred and reduce search area to [left, pred] or (pred, right] respectively until key qualiﬁes. (4) expS: exponential search, start with the ﬁrst entry, increase exponent i for key position speciﬁed by 2 i until key

7s greater than the search value, use binary search (or any other suitable method) inside range [2 i − , end]. (5) hashS: chained hashing (or any other suitable hashing variant), use the underlying hash functionto compute the location of the key (and its associated mapping). (6) linregS: linear regression (or anyother form of approximation and / or learning), compute slope and intercept, i.e. linear function, for alldata points, compute error bounds, predict key location pred and use linear search (or any other suitableerror correction method) inside [pred - lower error bound, pred + upper error bound]. (7) hybridS: anysuitable hybrid algorithm (i.e. a composite of the former options). We decide which data layout to use for representing the data from RI and / or DT. To deﬁne a data layout,we have to specify the following: (1) col vs row: key / value-pairs are in row or col layout. (2) func: weuse a function to specify the RI and / or DT-mapping, thus we do not need to represent pivots and / or dataand therefore do not need a data layout. As discussed in Deﬁnition 2.1 already, we assume the DT-ﬁeldsto be actual sets even though they could be modeled as a more general mapping as well. (3) unsortedvs sorted: the entries are (or are not) sorted by their key. (4) comp: the entries are compressed (andhow exactly, i.e. which compression method). (5) hybridDL any suitable hybrid data layout (i.e. anycomposite of the former options). Notice that some of these data layout decisions cannot be madeindependently from the search algorithms to use, e.g. binary search implies a sorted data layout. Figure 4shows an example of a logical index that by specifying the search algorithms and data layouts may betransformed into di ﬀ erent physical indexes. We make a decision to specify RI and DT by a nested physical index. Notice that this is not equivalentto the recursively reachable set of nodes pointed to by one particular RI. Nesting is about representingthe key / value-lookup search algorithms and data layout inside a node by another index. For instance,consider a physical binary search tree (BST). If we use such BST to represent and search RI, we basicallyhave a nested physical index in our node. However, this is just a special case, so in theory we can allowfor arbitrary nested indexes at this point. As we just have deﬁned our logical and physical generic indexing frameworks, we proceed to presentour genetic algorithm allowing us to automatically generate indexes. This is structured as follows:1. Core algorithm (Section 4.1),2. Initial population generation (Section 4.2),3. The set of applicable mutations describing possible changes to individual logical and physical indexstructures (Section 4.3), and4. The ﬁtness function used to measure the performance of individual physical index structures (Sec-tion 4.4).The major challenge with a generic indexing framework presented in Section 3 is the intractablesearch space. Therefore, we need an optimization method that can cope with such a huge search space.8 lgorithm 1

Genetic Search Algorithm of G oI function I nit P opulation ( DS , s init ) Π = ∅ (cid:46) initialize population with empty set for ( i = i < s init ; i + + ) do (cid:46) create s init initial indexes π = buildAndPopulateRandomIndex( DS ) (cid:46) build and populate index Π = Π ∪ { π } (cid:46) add index to population Π end for return Π (cid:46) return population Π end function function T ournament S election ( Π , s T , W ) T = sample subset( Π , s T ) (cid:46) draw random subset T ⊆ Π of size s T π min = arg min π ∈ T f ( π, W ) (cid:46) select ﬁttest individual π min in T under W ˜ t = median ﬁtness( T ) (cid:46) compute median ﬁtness of all π ∈ T return ( π min , ˜ t ) (cid:46) return ﬁttest individual π min and median ﬁtness ˜ t end function function G enetic S earch ( g max , s init , s max , s Π , s T , s ch , DS , MD , ND , W ) Π =

InitPopulation( s init , DS ) (cid:46) initialize population for ( i = i < g max ; i + + ) do (cid:46) perform r max iterations / generations ( π min , ˜ t ) = TournamentSelection( Π , s T , W ) (cid:46) run tournament selection for ( j = j < s max ; j + + ) do (cid:46) create s max mutations m = draw mutation( MD ) (cid:46) draw from mutation distribution n = draw node (cid:0) ND ( π min , m ) (cid:1) (cid:46) draw from node distribution ph = draw phys (cid:0) PD ( m , n ) (cid:1) (cid:46) draw from phys distribution π mut = m ( π min , n , ph ) (cid:46) perform mutation if f ( π mut , W ) ≤ ˜ t then (cid:46) add π mut to Π if ﬁtter than median ˜ t if | Π | ≥ s Π then (cid:46) if capacity exceeded Π = Π \ arg max π ∈ T f ( π, W ) (cid:46) remove unﬁttest individual end if Π = Π ∪ { π mut } (cid:46) add index to population end if end for end for π min = arg min π ∈ Π f ( π, W ) (cid:46) return ﬁttest individual of ﬁnal population return π min end function ymbol Meaning λ logical index π physical index Π population s init initial size of the population s Π maximum number of indexes in population g max number of generations s max number of mutations created and evaluated in a single iteration s T size of sample in tournament selection s ch maximum length of a mutation chain applied in one iteration DS dataset π min best individual in tournament selection π mut mutated element˜ t median ﬁtness MD probability distribution of mutations m a single mutation ND ( π, m ) probability distribution of nodes PD ( m , N ) probability distribution of physical implementations W workload of queries f ( π, W ) ﬁtness of a physical indexTable 1: Symbols used.Notice that an intractable search space does not imply that we cannot ﬁnd a good solution. In fact, entireresearch communities work on these kind of problems including: planning, reinforcement learning, andgenetic optimization. We decided to design our search algorithm based on genetic optimization. Geneticoptimization algorithms have been developed for more than 40 years [15], but recently gained a lot ofattention due to growing computational resources. They allow researchers to e ﬀ ectively explore largersearch spaces. Recent surprising, and not widely-known, results include: genetic algorithms can redis-cover state-of-the-art machine learning algorithms (!) [30]. Furthermore, they can devise yet unknownmathematical equations [6]. Genetic optimization tasks are very domain speciﬁc as possible mutationsand the performance measure depend heavily on the concrete task. The general design for our algorithm follows the principal of evolution which is known from nature: Westart with the main function G enetic S earch (line 15). We start by initializing a population of individuals(line 16), in our case a set of physical index structures Π : = { π | π is a physical index } (see function I nit -P opulation , line 1). To create the initial population, we build and populate s init physical index structures(line 4) and add them to the population Π (line 5). This build process is described in more detail in Sec-tion 4.2. Now, we enter the central iteration: we perform g max iterations in genetic search (lines 17–31).We start by tournament selection (line 18), see function T ournament S election (line 9). We select a sam-ple of size s T of the current population Π (line 10) from which we select the ﬁttest index π min (line 11).We keep a trace of the ﬁtness of physical indexes to never evaluate indexes multiple times. We com-pute the median ﬁtness ˜ t of sample T (line 12) and return both π min and ˜ t (line 13) to the GeneticSearchfunction (line 18). Then, we enter the mutation loop (line 19). The core idea is to compute s max ≥ π min . We draw a random mutation m from a precomputed distribution of mutations MD (line 20). For the mutation m we draw a start node n to be used for this mutation (line 21) as wellas a physical implementation ph (line 22). The mutations and distributions are described in detail inSection 4.3. Then, we perform the actual mutation on π min (line 23) and receive π mut . We originally alsoexperimented with applying chains of mutations (lines 20 and 23) but it did not show any beneﬁts. Wecheck, whether the mutated index π mut has a better ﬁtness than the median ˜ t (line 24). If it has a betterﬁtness, we check if Π exceeds its capacity of maximum allowed physical indexes s Π (line 25). If that isthe case, we remove the physical index with the worst ﬁtness from Π (line 26). Then we add π mut to thepopulation Π (line 28). Once the outer loop terminates, we determine the ﬁttest index from Π (line 32)and return it. What is a good start population Π for the genetic algorithm? In Algorithm 1, function InitPopulation(line 1), we need to deﬁne an initial population of individual index structures. There are several possibledimensions to consider. First, we can change the initial number s init of indexes in Π . This basicallydeﬁnes how diverse the initial set of indexes may be. Second, we should determine how to actually buildand populate the initial physical index with data from dataset DS (line 4). There are several options:1. We start with a single physical node that does not contain data, mutate it, and only then insert theactual data. We experimented with this approach initially but discarded it quickly due to its high trainingcosts. Thus we do not support it in our algorithm anymore.2. We start with a single physical node containing all data. For data layout / search method we eitherrandomly pick it or we pick one that we believe works well for the given workload.3. We use bottom-up bulkloading with the di ﬀ erence that for all nodes the search algorithms and datalayouts are picked randomly. In our current version we exclude hash nodes for inner nodes as we have notdeﬁned a radix-partition search method on this data layout yet. We will integrate this in future versionsof our optimization framework. The resulting tree is logically similar to a standard B-Tree, the physicalnodes however di ﬀ er considerably.4. We start with a population containing a physical index that resembles a state-of-the-art hand-tunedindex, i.e. we deﬁne the logical index (including its partitioning functions) as well as the physical nodes.Then we check whether we can still improve that index through our genetic algorithm.Notice that for options from (1) to (4) increasing, we postulate that we take away load from G oI,using it increasingly as a reﬁnement tool: The more we start with something already representing a verye ﬃ cient (or ﬁt, however ﬁtness is deﬁned) index, the more we expect that only small mutations will beperformed by G oI. At least that is what we would believe. In fact, even if we (non-randomly) specifyan initial physical index to start with, recall, that G oI has all degrees of freedom to pick mutations, andmay surprise us by taking unexpected turns and make di ﬀ erent decisions. In this section we introduce a suitable set of mutations and discuss how they are used in our algorithm.

Mutation.

In our framework, a mutation is a function m : Index → Index. A mutation takes a single indexas input, mutates it, and returns a modiﬁed index. By ‘Index’ we mean, that either a logical index ( λ ) or a physical index ( π ) is given and a mutated index is returned ( λ mut or π mut ). λ mut and π mut must preserve11 I DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan M1 RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: tree, sorted SAlg: scan mutate M1 n n’ (a) M Change node type: node n ’s data layout of RTis changed from a sorted column-layout to a tree-layoutyielding the new node n (cid:48) . M2 RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: binS mutate M2 n n’ (b) M Change search method: node n ’s search algo-rithm of RT is changed from a scan to binary search yield-ing the new node n (cid:48) . M3 RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {(1,B), (2,A)} {} DL: col, unsorted p − SAlg: scan

RI DT {} p t . e {(- ;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: hashS

RI DT {(1,B), (2,A), (7,B}, (6,C)} {} DL: col, unsorted p − SAlg: scan mutate M3 n parent n parent’ n target N sources n target ’ (c) M Merge nodes horizontally: merge left & middle childnode. M4 RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {(1,B), (2,A)} {} DL: col, unsorted p − SAlg: scan

RI DT {} p t . e {(- ;2), [2;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: hashS

RI DT {(1,B)} {} DL: col, unsorted p − SAlg: scan

RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: hashS

RI DT {(2,A)} {} DL: col, unsorted p − SAlg: scan mutate M4 n parent n parent’ (d) M Split node horizontally: split left child node. M5 RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {(1,B)}{(- ;2), [2;6)} ∞ DL: col, unsorted p t . e SAlg: scanDL: col, unsortedSAlg: scan

RI DT {(1,B)} p t . e {(- ;2), [2;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: hashS

RI DT {(2,A)} {} DL: col, unsorted p − SAlg: scan

RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: hashS

RI DT {} p − {} RI DT {(2,A)} {} DL: col, unsorted p − SAlg: scan

RI DT {} p − {} DL: row, sorted SAlg: scan mutate M5 n parent n parent’ n child (e) M Merge nodes vertically: merge top-level node’s left child. M6 RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {(1,B), (2,A)} {} DL: col, unsorted p − SAlg: scan

RI DT {} p t . e {(- ;6), [6;11), [11;+ )} ∞ ∞ DL: col, sorted SAlg: scan

RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: hashS

RI DT {(7,B), (6,C)} {} DL: row, sorted p − SAlg: expS

RI DT {(11,C), (12,Z)} {} DL: row, unsorted p − SAlg: hashS

RI DT {}{(- ;6)} ∞ p DL: col, unsortedSAlg: scan

RI DT {(1,B), (2,A)} {} DL: col, unsorted p − SAlg: scan t . e mutate M6 n parent n parent (f) M Split node vertically: split left child.

Figure 5: Performing the mutations described in Section 4.3 on actual physical indexes.the correctness of λ and π . This is inspired by rewrite rules in classical query optimization: there we alsoonly consider rules that are guaranteed to not change the query result. We will only consider mutations12n tree-structured indexes. This is not a restriction of our generic framework but makes the followingmutations a bit more digestible. Mutation distributions.

We use a probability distribution MD allowing us to assign di ﬀ erent probabili-ties to the di ﬀ erent mutations (line 20), e.g. we can give higher probabilities to certain mutations. Given amutation m and a physical index π min we then draw from a second distribution ND ( π min , m ) to determinethe nodes to use for this mutation (line 21). Now, we draw from a third distribution PD ( m , N ) to deter-mine which physical implementation to use for this mutation and node set. Note that these distributionscan be created based on microbenchmarks, see Section 6.1. Fundamental Mutations.

Our goal is to implement a minimal set of mutations that allow for breedinga huge variety of physical index structures. M Change data layout:

From n , we randomly select either its RI- or DT-part. Then we create a newphysical node n (cid:48) with data layout n (cid:48) . dl (cid:44) n . dl with the same data and routing information as n : n (cid:48) . DT = n . DT ∧ n (cid:48) . RI = n . RI . The options for data layouts are described in Section 3.2. If n contains childpartitions, we enforce the additional condition n . dl (cid:48) (cid:44) hash, as our software framework does not (yet)support child partitions in nodes with a hash layout. In π , we replace n by n (cid:48) . Figure 5(a) shows anexample: the input node n has a sorted column-layout. In the index, we replace n by n (cid:48) which has atree-layout. M Change search method:

From n , we randomly select either its RI- or DT-part. Given the existingsearch method n . s ∈ S : = { scan , binS , intS , expS , linRegS , hashS } , we draw an s (cid:48) ∈ S with s (cid:48) (cid:44) s .Then we create a new physical node n (cid:48) with the new search method s (cid:48) with the same data and routinginformation as n : n (cid:48) . DT = n . DT ∧ n (cid:48) . RI = n . RI . Figure 5(b) shows an example: the input node n uses ascan as search method. In the index, we replace n by n (cid:48) which uses binary search. M Merge sibling nodes horizontally:

We set node n parent : = n whose RI maps to at least one other nodein π , if not we abort this mutation. From the set of nodes mapped to by n parent we randomly select a childnode n target ∈ nodes( n parent .RI). We select a non-empty subset N sources ⊆ nodes( n parent .RI) of nodes tomerge into n target using the following restrictions: n target (cid:60) N sources ∧∀ n ∈ N sources n . p = n target . p . This impliesthat the source domain of the routing information function D is equal for all nodes in N sources ∪ { n target } .We then need to perform updates on two levels of the index: The node n target that we merge with andthe parent node n parent . We start by describing the updates to the node n target . First we update the data n target .DT and set it to the union of all data within the merged nodes: n (cid:48) target . DT = n target . DT ∪ (cid:91) n ∈ N sources n . DT . In the following, we also update the routing information function n target . RI such that ∀ d ∈ D n (cid:48) target . RI ( d ) = n target . RI( d ) ∪ (cid:91) n ∈ N sources n . RI( d ) , where D is the common domain of the RIs in N sources ∪ { n target } .This ensures that our target node n target now maps to all child nodes that any node n ∈ N sources previouslymapped to, i.e. we can still reach all child nodes. For the parent node n parent we have to update the routinginformation n parent .RI such that ∀ d ∈ n D parent ∀ n ∈ N sources n ∈ n parent . RI ( d )13 n parent . RI ( d ) = { n target } ∪ n parent . RI ( d ) \ { n } . In other words: We remove all mappings to merged nodes n ∈ N sources and replace them with a newmapping to the node n target .Notice that the merge operation performed in B-trees is essentially just a specialized version of thisgeneral merge mutation. In a B-tree the number of merged nodes k is typically set to k = | N sources | =

1. Merge operations with larger source-setscan easily be achieved by recursively executing the merge operation on the same node. Figure 5(c) showsan example: the set N sources contains a single leaf that we want to merge into n target . To achieve this weﬁrst merge all data contained in N sources .DT into n (cid:48) target .DT. As N sources .RI is empty, we do not have todo anything here. In n parent .RI, we need to remove the mapping to all nodes in N sources , in this case thekey-range [6; 11) ⊂ D must be changed to map to n (cid:48) target . For this example this is equivalent to mergingthe old entry ( −∞ ; 6) with [6; 11) into ( −∞ ; 11). Now, all nodes in N sources can be removed from theindex. M Split child node horizontally into k nodes:

This is the inverse mutation of M . Figure 5(d) showsan example. M Merge sibling nodes vertically:

We set node n parent : = n whose RI maps to at least one other nodein π , if not we abort this mutation. From the set of nodes mapped to by n parent we randomly select a childnode n child ∈ nodes( n parent .RI) using the following restriction: n child . p = n parent . p . To merge n child into n parent , we then need to perform the following updates: First we need to move all data in n child . DT to theparent node: n parent . DT = n parent . DT ∪ n child . DTIn the following we need to move potential child nodes n (cid:48) of n child to the parent node n parent : ∀ d ∈ D parent n child ∈ n parent . RI( d ) ⇒ n parent . RI( d ) = n parent . RI( d ) \ { n child } ∪ n child . RI( d )where D parent is the domain of n parent . RI. In other words: We remove all mappings to the merged node n child and replace them with mappings to the child nodes of n child . For our actual implementation, werestrict ourselves to the merge of a single parent-child-pair during a single mutation. Merge operationsfor longer chains of nodes can easily be achieved by recursively executing the merge operation on thesame node. Figure 5(e) shows an example: We select the root node as n parent and its left child node as n child which we want to merge into the root node. To achieve this we ﬁrst merge all data contained in n child .DT into n parent .DT. In n parent .RI, we need to remove the mapping to n child and replace them withmappings to the children of n child . In this case, we remove the key-range ( −∞ ; 6) ⊂ D and replace it withthe corresponding entries of n child .RI. For this example this is equivalent to inserting the entries ( −∞ ; 2)and [2; 6) into n parent .RI. M Split child node vertically into k nodes:

This is the inverse operation of M . Figure 5(f) shows anexample. The ﬁtness function is used to measure the performance of a single physical index and describes whatto optimize by the genetic algorithm (either by minimizing or maximizing its value). Its deﬁnition can14e chosen freely depending on the optimization goal. We have chosen to optimize our index structuresfor the runtime given a speciﬁc workload consisting of point and range queries. We therefore deﬁne theﬁtness function f : Physical Index × Workload → (cid:146) to be minimized in the following way: f ( π, W ) = r ( π, W ) c . π denotes the physical index (the individual) to evaluate, W is a sequence of queries and denotesthe workload of the speciﬁc experiment. r ( π, W ) c is the median runtime measured for this physical indexon the workload over c runs.The ﬁtness function can also easily be adapted to factor in other optimization goals like memory-or energy-e ﬃ ciency. Other interesting extensions include regularization, i.e. the complexity of an indexcould be punished (similar to model complexity in machine learning). Furthermore we could punish orincentivize the ﬁlling grade of leaves, e.g. if leaves are fully packed, this is beneﬁcial for read-optimizedindexes but for inserts can quickly lead to structural modiﬁcations of the tree. However, if leaves are onlypartially ﬁlled, many inserts can be handled by performing leaf-local changes. All these requirementscan be modeled into the ﬁtness function. Due to space constraints we leave this to future work. Handcrafted Indexes.

Since the original B-tree-paper [5] in 1972, B-trees have become a workhorse indatabase systems. Since then a myriad of B-tree-variants and -improvements have been proposed [28,29, 32, 20]. Other classes of handcrafted index structures include radix-trees like Judy-arrays [4] andits modern SIMDiﬁed incarnation ARTful [26]. Moreover, considerable work has been done in the pastyears to better understand the performance of hash tables which are widely used in query processing [1,31].

Learned Indexes.

The core task of a learned index [25] is to provide an index on a densely packed,sorted array. The main idea is to manually deﬁne an (outer) B-tree-like structure, typically a two-leveltree (coined RMI by the authors). Then, inside each node, rather than performing a binary search onthe keys contained in that node — as done in a textbook B-tree — a learned regression function is usedto predict the position in the sorted array. Care has to be taken to avoid prediction errors. This is donethrough an error correction method: the prediction actually deﬁnes a range which must be post-ﬁlteredthrough a di ﬀ erent algorithm like binary or interpolation search. The biggest advantage of a ‘learnedindex’ is that no space is required to store pivots in internal nodes thus allowing for high branchingfactors.Like our work, the original work was a read-only index. It bulkloaded the index top-down, but aswith any other B-tree like structure, bottom-up bulkloading up is also possible [22] and actually easier.Later on di ﬀ erent proposals were made to use di ﬀ erent regression techniques [21] and support insertsand deletes [10, 13].As illustrated in Figure 2(b) already an RMI is just one special conﬁguration in G oI: an RMI is (1) alogical index: classical B-tree (however, ﬁxed number of layers, balancing enforced, high fan-out), (2) aphysical index: node internal search constrained to use some form of linear regression.In other words, an RMI handcrafts its logical structure. Then, inside its nodes it uses a ﬁxed physicalregression method to learn a CDF. In contrast, we allow for optimizing the structure and the searchmethods and data layouts used inside nodes. Thus, we fully embrace the orthogonality of learning amodel only inside a node vs optimizing the entire index structure . Periodic Tables and Data Calculator.

The work by Stratos Idreos et.al. on semi-automatic data structuredesign is truly inspiring. In their vision paper [17] they aim at a complete dissection and classiﬁcationof the individual primitives used to design data structures. The authors sketch the huge design spaceof indexes and conclude that many quadrants in that space are still unexplored. They also phrase the15igh-level vision to synthesize an index from a declarative speciﬁcation. The main idea is to use a ﬁne-grained learned cost models to be able to cost the physical individual index primitives (like scans, binarysearch, etc.). However, they go not further to show how this can be achieved concretely. In addition,no split into logical and physical indexes is given which is the key enabler in our approach. The paperalso suggests to use their tool more as a n the hands of a database architect. The follow-up work [18]is another vision paper which goes into somewhat more detail in describing the problem space of thisendeavor and proposing a workbench like “‘Data Alchemist’ architecture” which is a semi-automaticdesign tool. However, again no experiments and / or results are shown. Then, [19] explores a large set ofphysical index design primitives, benchmarks them, and uses the results to learn cost models for physicalprimitives. This is used to build synthesized cost models for the expected cost of a combination ofthose physical primitives. The authors show several indexes where these cost estimates match the actualruntimes very well. At the same time the paper emphasizes that many physical design primitives andtheir cost models are missing including compression, concurrency, updates, etc.This was one big motivation for our work: we believe that the index design space is simply toobig to be fully modeled by cost models. A major problem we witnessed in our LAB is the non-additiveinteraction of di ﬀ erent physical primitives (and their associated cost models). Also recall that the creationof an index structure (in contrast to the creation of an index instance at query time!) is non-performancecritical and hence there is no need to predict the runtime: we can simply measure the runtime (or ﬁtness).In a query optimizer cost models make sense as we need to quickly predict the runtimes of di ﬀ erentoptions for the physical plan. However, when automatically creating index structures, we can do thiso ﬄ ine and hence use the actual measured performance to decide about the fate of an index variant.In summary, [17, 18, 19] do not propose a separation into logical and physical indexes. And theyrely on complex cost models. In addition, those methods are only semi-automatic. Generic Frameworks.

A couple of generic indexing frameworks have been proposed in the past, mostnotably GIST [14] and XXL [7]. Those frameworks also aimed at generalizing presumably di ﬀ erentindex structures into a common software framework. This in turn allowed architects to implement im-portant database algorithms for the generic index. The specialized indexes could then relatively easilybe adapted to use the generic algorithms. Prominent examples include generic bulkloading [9] and con-currency control [23]. Though that work was inspiring to us, we stress that in our paper we argue ona conceptual level rather than an OO-level. Moreover, we are primarily inspired by the analogue sep-aration into logical and relational operators without immediately specifying how physical operators getimplemented (ONC, vectorization, SIMD, whatever) or even how software interfaces need to be deﬁned,as that is a tertiary concern. DQO.

A recent vision paper proposed Deep Query Optimization [11]. The core idea is to break operatorsinto smaller components which can then possibly be optimized using traditional query optimizationtechnique. This paper is another inspiration of our work. However, that work does not go into any detailon how such an idea can be realized in the context of indexing. It neither details how traditional operatorscan be split nor how this can be turned into an optimization problem for automatic index creation. Ourpaper ﬁlls that gap.

Index Selection.

Index Selection [27, 24] operates on a completely di ﬀ erent level as our approach. Inindex selection the goal is to determine a suitable set of attributes to index in order to improve the runtimeof a workload. In contrast, in our work we consider how to devise e ﬃ cient index structures in the ﬁrstplace — which could then be leveraged in index selection algorithms. Adaptive Indexing.

As index selection is NP-hard, an interesting strategy is to not consider indexing abinary decision but rather allow indexes to become more and more ﬁne-grained over time. That is at theheart of adaptive indexing [16]. Several interesting proposals have been made in this space, see [33] for16 survey. However, all these indexes are still handcrafted indexes. In future work, we are planning torevisit some of these techniques, as the DT-ﬁeld of our logical nodes can be used to mimic many of thosetechniques.

In this section, we will give an extensive evaluation of our framework and the capabilities of our geneticalgorithm.In our experiments, we will ﬁrst establish some baselines for the physical node and search methodperformance in our setting. Then we carefully evaluate G oI. We highlight the cost for training and theability to automatically reach a certain performance baseline. Furthermore, we show the capability ofG oI to breed handcrafted index structures. Finally, we depict how the structure of our generated indexesvary under di ﬀ erent workloads. System.

All experiments were executed on a machine with an AMD Ryzen Threadripper 1900X 8-Core processor with 32 GiB memory on Linux. Our framework and the respective experiments areimplemented in

C++ and compiled with Clang 8.0.1, -O3. All experiments are run single-threaded andin main-memory.

Datasets.

We use three types of datasets. All datasets consist of unique 64-bit uint keys and a 64-bitpayload. In the following, we refer to the keys as data.keys . The datasets exhibit a variety of di ﬀ erentcharacteristics like distribution , density , domain , and size . The ﬁrst two datasets, uni dense and uni sparse ,are uniformly distributed in a dense and sparse domain, respectively. Concretely, uni dense contains keysin the range [0, n) where n is the size of the dataset and uni sparse contains keys drawn uniformly at randomfrom the domain [0, 2 ) without duplicates. The third dataset wiki is taken from [21] and sampled-downto our speciﬁc data size by uniformly drawing elements without duplicates. We have two main datasetsizes 100K and 1M, with 100K being the default unless explicitly mentioned otherwise. Table 2 gives anoverview of the datasets. Workloads.

We use three classes of workloads: point, range, and mixed point and range query work-loads. For the moment, all our workloads are read-only, i.e. we do not consider insert , delete , or update statements. Note however, that update statements would not alter the structure of the index so we couldeasily integrate them into our framework. Table 3 summarizes the basic workload types. Point dist (data,min, max) represents a point query workload where the keys to lookup are taken from the keys in thedataset data in the domain [min, max) ⊆ D (data.keys) with distribution dist . We support a uniformdistribution uni and a normal distribution norm , with uni being the default unless explicitly mentionedotherwise. Likewise, Range dist , sel (data, min, max) describes a range query workload consisting of pairsspecifying the lower bound and upper bound of the query. The lower bound is drawn with the underlyingdistribution dist in the domain [min, max - data.size * sel) ⊆ D (data.keys) and the upper bound is setbased on the dataset size and the given selectivity sel . If the domain is not explicitly speciﬁed, we as-sume it to be equal to the data domain. Mix(data, P , R ) represents a mix of point and range queries with P and R being sets of point and range query workloads, respectively, based on data . For each speciﬁcexperiment, we concretely specify the workload type and size. Note, that in contrast to the datasets, ourworkloads may contain duplicates. In this section we establish baselines. In particular, we want to see how the lookup performance of thedi ﬀ erent search methods varies for a single physical node with an increasing data size. The goal here is17 ataset CDF Properties uni dense P o s i t i o n n : = , 1M) domain : = [0 , n )64-bit unique unsigned integersuni sparse P o s i t i o n n : = , 1M) domain : = [0 , )64-bit unique unsigned integerswiki P o s i t i o n n : = , 1M) domain : = [979672113 , ﬀ erent mutation probabilities as described in Section 4.3.As already showcased in Sections 3 and 4, there is a huge search space in designing physical indexstructures. Consequently, in our experiments, we focus on the most important data layouts and searchalgorithms. We use the data layouts depicted in Table 4. The speciﬁed hash functions standard and murmur correspond to the default hash function of the STL container and a 64-bit Murmur3 ﬁnalizer [2].As search algorithms, we use scan , binS , intS , expS , and hashS described in more detail in Section 3.1.All following experiments compare the lookup time of di ﬀ erent combinations of data layouts andtheir applicable search methods. Figure 6 compares the average lookup time of the di ﬀ erent data layout and search method combinationsfor an increasing dataset size dataset uni dense on the point query workload Point(uni dense ). The reportedlookup times are obtained by taking the average of the median of ﬁve consecutive runs consisting of100K point queries. For each run, we shu ﬄ e the workload in order to avoid caching e ﬀ ects. As we cansee, the sorted col node with intS performs best for almost all data sizes. This can be attributed to theprevalent best case scenario for intS due to the perfectly linear cumulative distribution function givenby the uniform dense dataset. Interestingly, the zoom-in shows that for very small data sizes, hashS with the standard hash function, performs even better. At around 40K tuples both search algorithmsperform equally and starting at 80K tuples, intS outperforms hashS . We assume collision resolutionin combination with chained hashing to be the reason that hashS becomes more costly as intS at onepoint. All other combinations do not exhibit a competitive performance. In particular, the tree nodewith binS shows a much worse lookup time. 18 orkload Characteristics Parameters Point dist (data [, min, max]) point queries in domain [min,max) ⊆ D (data.keys) with distri-bution dist dist ∈ { uni , norm } Range dist , sel (data [, min, max]) range queries in domain [min,max) ⊆ D (data.keys) with distri-bution dist and selectivity sel dist ∈ { uni , norm } sel ∈ [0 , P , R ) mix of point and range queryworkloads with P and R be-ing sets of respective workloadsbased on data P : = { p | p is Point dist (data [, min, max]) } R : = { r | r is Range dist , sel (data [, min, max]) } Table 3: Workloads used in the experiments.

Data Layout Characteristics Implementation Detailsorted col

RI and DT have columnar layoutfor both keys and values. Sortedaccording to keys.

C++ standard library con-tainer std::vector and std::vector hash hf DT represents hash table map-ping keys to their values usinghash function hf . RI empty. C++ standard library container std::unordered map with Hash ∈{ standard, murmur } tree RI and DT represent tree datastructure mapping keys to theirvalues. Sorted according tokeys.

C++ standard library container std::map

Table 4: Data layouts used in the experiments.We also conducted a variation of the previous experiment on a range query workloadRange . (uni dense ). The implementation of our range queries on sorted data structures consists of a pointlookup for the lower bound and a subsequent scan until the elements do not qualify anymore. Therefore,the lookup time is heavily dominated by scanning the indexed elements once the lower bound for thegiven range is found. The scanning time scales linearly with the data size due to the ﬁxed selectivity,while the time to ﬁnd the lower bound is negligible. As a consequence, the di ﬀ erent search methods ona sorted array perform similarly.Overall, we conclude that the results reﬂect the prevalent understanding that on exclusive point queryworkloads on a uniformly distributed dataset in a dense domain, intS and hashS show the best perfor-mance. The good intS performance can be attributed to the optimal use case while the hashS perfor-mance is due to the workload type being exclusive point queries. For the range queries, intS exhibits thebest performance on all data sizes due to the best case scenario, however, the lookup times are in generalheavily dominated by the scanning time. We also ran both point and range query workloads on the corresponding uni sparse dataset. As expected, theresults are very similar to the uniform dense dataset since the underlying distribution does not change and19 data size l oo k up t i m e [ n s ] hash_standard-hashShash_murmur-hashSsorted_col-scansorted_col-binSsorted_col-intSsorted_col-expStree-binS data_layout-search_method data size l oo k up t i m e [ n s ] Zoom-in

Figure 6: Average point lookup time for di ﬀ erent physical indexes with a single node and an increasingdataset size. The underlying dataset is uni dense and the workload is Point(uni dense ) consisting of 100Kelements.the data layouts and search algorithms are mostly independent of the actual domain values. Therefore,we refrain from showing the concrete results. Figure 7 shows the average point lookup times for the wiki dataset varying the data size on a Point(wiki)workload. The plot shows that binS on sorted col and the di ﬀ erent hashS methods display the overallbest performance. For up to 1 ,

000 tuples, the zoom-in shows a superior performance of the hashS meth-ods and even binS on a tree has a better average lookup time than all combinations on sorted col . For ∼ ,

000 tuples onwards, expS and binS on a sorted array become faster than the tree implementation.At ∼ .

8M elements, sorted col with binS has the overall best performance. Afterwards,the hashS lookups perform better again. The reason for this is the asymptotical complexity of binS in comparison with the relatively constant lookup times for the hashS except for very small data sizes.Note, that as the best possible use case for intS is not given anymore, the lookup time is much worsecompared to binS and hashS .For the wiki dataset, we again did a similar experiment on the range query workload Range . (wiki).The experimental results are very similar to the uniform dense results, i.e. the lookup times scale linearlywith the data size and are mostly dominated by the scanning time, while the lookup time for the lowerbound is negligible. 20 data size l oo k up t i m e [ n s ] hash_standard-hashShash_murmur-hashSsorted_col-scansorted_col-binSsorted_col-intSsorted_col-expStree-binS data_layout-search_method data size l oo k up t i m e [ n s ] Zoom-in

Figure 7: Average point lookup time for di ﬀ erent physical indexes with a single node and an increasingdataset size. The underlying dataset is wiki with Point(wiki) being the workload with 100K elements. As our genetic search algorithm o ﬀ ers a multitude of hyperparameters which can be tuned and mighthave a huge inﬂuence on the overall performance (e.g. in terms of runtime), we ﬁrst conduct a systematicgrid search to measure the inﬂuence of various hyperparameters. Our experimental setup is based onthe results of the baseline experiments: We try to ﬁnd the best index structure for the dataset uni dense with 100K entries and the workload Point(uni dense ), i.e. a workload with 100K point queries uniformlydistributed over the whole domain of the dataset. During the initial bulkloading, each node is created ina way that it can store up to 100K pointers to children and the same number of entries. Using a max-imum load factor, we enforce however tree-structures consisting of 10 leaves with 1 ,

000 entries each,two inner nodes with 5 leaves as children each and a single root node as our initial index structures.According to our baseline experiments, we expect to ﬁnd a single node with sorted col as data layoutand intS as search method as the best index for this combination of dataset and workload. To obtainan overview of the inﬂuence of several parameter choices on the performance of the genetic search, wemeasure the runtime until we ﬁrst encounter this index structure as the best individual of our population.We vary ﬁve di ﬀ erent parameters within this experiment: (1) number of mutations per generation ( s max ): s max ∈ { , } , (2) maximum population size ( s Π ): s Π ∈ { , , } , (3) tournament selectionsize ( s T ): s T ∈ { , , } , (4) initial population size ( s init ): s init ∈ { , } ,(5) population insertion criterion ( q ): Instead of taking the median of the subset drawn during tourna-ment selection, we deﬁne a percentile q to be reached for a mutated individual to be inserted into thepopulation: q ∈ { , , } . For the 0% percentile, we always insert the mutated individual, forthe 100% percentile we only add it if it is better than the previous best individual within the tournamentselection subset. 21uring the paramter grid search, we test each possible combination of these parameters. As therandom choice of mutations might have a huge impact on runtime, we measure the runtime for eachcombination ﬁve times, allowing us to eliminate extreme outliers if necessary.Rank s max s Π s T s init q median runtime [s] mean runtime [s]1 10 200 100% 50 0% 13.72 91.722 10 1000 50% 50 50% 14.58 26.103 10 1000 100% 10 50% 16.71 24.944 10 1000 100% 50 0% 16.87 94.485 10 1000 50% 10 50% 18.21 158.49Table 5: Best Genetic Search Conﬁgurations (over 5 runs)Table 5 shows the best conﬁgurations (based on the median of the 5 runs executed per conﬁguration).Given a total number of mutations we want to perform, we conclude that it is more beneﬁcial to usea smaller number of mutations per generation combined with a larger number of generations. As thepopulation size has a limited inﬂuence, we decided to keep it very small to reduce the overhead tomaintain the population. We therefore used the following default parameters for the experiments in thefollowing sections: s max = , s Π = , s T = , s init =

10 and q = In this experiment, we will demonstrate that our genetic algorithm is capable of reproducing the perfor-mance of various baseline index structures as known from textbooks. We consider two di ﬀ erent datasets: uni dense and wiki . We combine each of those two datasets with three di ﬀ erent workloads containing10,000 queries each: Point(uni dense ), Range . (uni dense ) and a Mix(uni dense , P , R ) workload, with P : = { Point(uni dense ) } and R : = { Range . (uni dense ) } consisting of 80% point and 20% range queries. For eachworkload, we deﬁne a baseline within our generic framework of which we believe it has a decent per-formance: For the point query only workload, we assume a simple hash table to perform best which isimplemented as an index structure with a single node having the hash data layout. For the range queryonly and mixed workload, we assume a B-tree-like structure to o ﬀ er a decent performance. We initializethe tree to have 100 fully ﬁlled leaves, each containing 1,000 elements and a fan-out of 10 for the internalnodes. Each node is conﬁgured to use the sorted col layout and binS . We conﬁgured the algorithmto allow all nodes to contain up to 100,000 elements and 100,000 child partitions (potentially leading tosolutions consisting of a single node or solutions with one node per element assembled under a singleroot node). The trees in the initial population were bulkloaded similar to B-trees with 100 leaves and afan-out of 5. Each experiment is conducted for 3500 generations at most, but early aborted if we couldnot register any improvement within the last 1000 generations. As this is the longest running experiment,we decided to switch from single threaded execution to a setup using two threads in parallel to reducethe execution time.Figure 8 shows the results. Each plot compares the performance of the baseline to the performanceof the genetic algorithm where we plotted the best individual of each generation. Although we conducted3500 generations per workload, we only plot the curves up to the point of the last improvement. As wecan clearly see, the algorithm usually only needs a few hundred generations to match the performance ofour baselines. This is mostly due to the fact that the algorithm can rather easily merge ine ﬃ cient nodesin the beginning. After reaching the baseline, the genetic algorithm only ﬁnds slight improvements, e.g.by changing search algorithms within nodes, which are hardly visible on the plot.22he index structures found by our genetic algorithm are very similar to the baselines: For the uni dense dataset, the point query only workload delivered a single hash table as the best solution. For the rangequery only workload, we found a tree with a root node and three children, each having a sorted col layout. The search methods used were either binS or intS . For the mixed workload, we received alarger tree with 13 nodes in total, arranged on 3 levels. Each node used the sorted col layout. The searchmethods varied, although binS or intS dominated. Considering the wiki dataset, the point and rangequery only workloads both resulted in a single node with sorted col layout and binS . For the mixedworkload we obtained a larger tree consisting of 17 leaves, each having a hash data layout and hashS . generation m e d i a n r un t i m e [ m s ] baseline G oI (a) PQ, uni dense generation m e d i a n r un t i m e [ m s ] baseline G oI (b) RQ, uni dense generation m e d i a n r un t i m e [ m s ] baseline G oI (c) Mixed, uni dense generation m e d i a n r un t i m e [ m s ] baseline G oI (d) PQ, wiki generation m e d i a n r un t i m e [ m s ] baseline G oI (e) RQ, wiki generation m e d i a n r un t i m e [ m s ] baseline G oI (f) Mixed, wiki

Figure 8: Comparison of our genetic algorithm with handcrafted baselines on three di ﬀ erent workloads:A point query only workload (PQ), a range query only workload (RQ) and a mixed workload consistingof 80% point and 20% range queries.We also experimented with an additional, mixed workload again consisting of 10,000 queries witha 80% /

20% point to range query ratio, based on the uni dense dataset. However, this time we chose thequeries to be normally distributed around key 75,000 with a standard deviation of 10,000, i.e. the querieswere mainly focused on the upper half of the key domain. Our G oI algorithm again decided to shrinkthe initial index structures considerably, however it stopped after 3500 generations returning a tree with4 levels and 25 nodes in total, 17 of them being leaves. The nodes containing the upper half of the keydomain were again using the sorted col layout and either intS or binS .The total runtimes of G oI heavily depend on the concrete datasets and workloads. The fastest execu-tion was wiki with point query only workload which took about 3 minutes until the last improvement wasfound. The longest run was uni dense with range query only workload which took about 500 minutes. In this experiment, we evaluate the capabilities of G oI to recreate a B-tree from a binary tree. Theexperimental setup is as follows: We use the dataset uni dense with 100K elements to bulkload a B-tree23ike index structure with a leaf capacity of 1 ,

000 elements and a fanout of 10. While the leaves havea load factor of 1, we limit the inner nodes to a load factor of 0 .

2, which results in a binary tree. Thegenerated index structures are evaluated on a Mix(uni dense , P , R ) workload, with P : = { Point(uni dense ) } and R : = { Range . (uni dense ) } . The workload consists of 100K elements with a ratio of 80% pointqueries to 20% range queries. The hyperparameters are the default parameters described in Section 6.2.We then let the genetic algorithm transform this binary tree using all mutations described in Section 4.3with the intention that the algorithm realizes the beneﬁt of utilizing the full capacity of the inner nodes.Figure 9 shows for each generation the total amount of nodes separated into the di ﬀ erent levels, with level 0 being the root node. By limiting the maximum capacity to 10, the shallowest tree consists of3 levels which the algorithm is not quite able to achieve in just 100 K iterations. As we can see, thealgorithm is able to very early reduce the number of nodes signiﬁcantly. Especially, the number of nodesin the higher levels (1-3) increases up to 800 generations, and afterwards, stays relatively constant. Inthe end, the lowest level vanishes completely and many of the nodes are pulled to the root. Note, thatthe generated index structures are not perfectly balanced anymore, which results in leaves nodes beingdistributed over multiple levels. Afterwards, the number of nodes stays relatively constant until around35K generations onwards, where the algorithm trims down the number of nodes again. In the end, we cansee that level 2 and level 3 contain the most nodes, while the lower levels have a very little numberof nodes. This demonstrates the capability of the genetic algorithm, to fully utilize the capacity.Overall, we can clearly see that the genetic algorithm realizes the beneﬁt of utilizing the full capacityof the inner nodes and consequently, reduces the overall number of levels while increasing the numberof nodes in the levels close to the root. generations n o d e s level_0level_1level_2level_3level_4level_5level_6level_7level_8 Figure 9: Total number of nodes separated into di ﬀ erent tree levels for each generation. The initialpopulation consists of binary trees which is transformed into B-Trees by utilizing the full capacity ofinner nodes. In this experiment, we want to assess how the genetic algorithm handles the generation of index structureson a more complex workload. We again use the uni dense and wiki datasets with 100K elements. The24orkload is a split workload based on the dataset domain which is conceptually divided into 11 equallylarge and disjunct partitions. Each partition has a di ﬀ erent point-range query ratio, from 0% point and100% range, over 50% point and range, to 100% point and 0% range. Based on this, we adjust thestructure of our initial population to ﬁt the workload, i.e., we have a single root node and 11 leaf nodes,one for each partition. The step functions in the lower part of Figure 10(a) illustrate the composition ofour workload. Our mixed workload consists of 11 ,

000 queries in total, with 1 ,

000 queries per partition.With this underlying setup, we conduct two di ﬀ erent experimental variations.First, we restrict the mutations to ones without structural modiﬁcations, i.e., we only apply M andM . The resulting index structure of this experiment with an execution of 2 ,

000 generations on the uni dense dataset can be seen in Figure 10(a). The experiment starts with arbitrary physical data layoutsand search methods and tries to ﬁnd the best physical node for the di ﬀ erent point and range query ratiosat each leaf. As we can see, only for an exclusive point query workload, i.e. for the rightmost leaf, the al-gorithm reasonably chooses a hash table with standard hash-based accesses. At every other leaf, we havea range query ratio of at least 10%, for which the algorithm always chooses sorted arrays in combina-tion with intS . These results match our expectations of a near-optimal index structure without structuralmodiﬁcations for the underlying workload. For the wiki dataset, the algorithm produces basically thesame index, however, decides to use binS instead of intS .Second, we allow all mutations, i.e., we explicitly enable structural modiﬁcations. The experimentalresults are shown in Figure 10(b). As can be seen with the intermediate index structures, horizontalmerges predominantly lead to improvements. The resulting index is a single node with a sorted col as data layout and either intS for the uni dense dataset or binS for the wiki dataset. Both runs were ableto ﬁnd the index structures in less than 200 generations. The results are to be expected since intS ona uniform dense dataset performs very well on both point and range query workloads, while binS isthe best performing search method for the wiki dataset, and using a single node avoids additional costlydynamic dispatches.In addition to this set of experiments, we also conducted the experiment with structural modiﬁcationson the uni dense dataset with 1M tuples. As expected, the result is equivalent to the version with 100Kelements, i.e., a single physical node with sorted col as data layout and intS as search method forthe data. However, we are well aware of the long runtimes of our genetic algorithm, especially for largedataset sizes. In this experiment, it took ∼

14 hours until the algorithm found the optimal index structure.To remedy this, we can partition the data domain into smaller parts, iteratively run our algorithm on theindependent partitions and manually construct the ﬁnal index from the resulting index structures. Thisalso o ﬀ ers a lot of optimization potential. If multiple partitions have similar statistical properties andworkloads, we only have to execute our algorithm once and can reuse the resulting index structure for allpartitions. Furthermore, we have the option of post-processing the manually created index structure withour genetic algorithm and with it, optimize the ﬁnal index beyond partition boundaries.All in all, this experiment shows that our genetic algorithm is capable of handling more complexworkloads. It was able to identify locally the best combination of data layout and search method andfurthermore, to realize globally that it is beneﬁcial to use a single node covering the complete key domain. Conclusions.

This paper has opened the book for automatically generated index structures. We haveproposed a powerful generic indexing framework on the logical and physical level analogue to logicaland physical operators in query processing and optimization. We have shown that by clearly separatingthe logical and physical dimensions of an index, a huge number of existing (physical) indexes can be25 orted_colhashbinSintS hashSsearch method RI data layout search method DT expSscan tree (a) Split workload with di ﬀ erent point and range query ratios for 11 equally large and disjunctpartitions of the uni dense dataset. The step function shows the di ﬀ erent ratios for each partitionwhile the depicted index is obtained by restricting the set of mutations to M and M . wiki after 4 improvementsafter 8 improvementsafter 12 improvementsﬁnalinitial unidense (b) Several snapshots of temporary best index structures on the split workload with structural modiﬁcations on both, the uni dense and wiki dataset. Figure 10: Resulting index structures for a split workload on the key domain.represented in our generic indexing framework. Furthermore, we introduced

Genetic Generic Genera-tion of Indexes (G oI) . Given a workload, G oI can come up with an e ﬃ cient physical index structureautomatically. Our initial experimental results outlines the potential and e ﬃ ciency of our approach. Future Work.

This paper is obviously just a starting point of a much longer story. There are manypossible exciting research directions ahead, including:1. code-generation, similar to generating code for the most e ﬃ cient physical plan found, generatecode for the most e ﬃ cient physical index structure found,2. The Index Farm : we plan to open source our framework: the goal is that people submit a workloadon a web page and the framework emits suitable source code for an index structure,26. runtime adaptivity: how to mutate structurally, this can also simulate the adaptive indexing familyof index structures,4. updates: simple updates are trivial, i.e. if a value assigned to an existing key is changed, no struc-tural modiﬁcations required at this point); deletes and inserts are more complex, also in batch, inparticular if factoring in bulkloading [8] and bu ﬀ er-trees [3],5. e ﬀ ects of non-empty DT-ﬁelds in internal nodes,6. extend G oI to support more data layouts, search algorithms, and hardware acceleration (SIMD). References [1] V. Alvarez, S. Richter, X. Chen, and J. Dittrich. A comparison of adaptive radix trees and hashtables. In J. Gehrke, W. Lehner, K. Shim, S. K. Cha, and G. M. Lohman, editors, ,pages 1227–1238. IEEE Computer Society, 2015.[2] A. Appleby.

Murmurhash3 64-bit ﬁnalizer , accessed September 15, 2020.[3] L. Arge. The bu ﬀ er tree: A new technique for optimal i / o-algorithms (extended abstract). In S. G.Akl, F. K. H. A. Dehne, J. Sack, and N. Santoro, editors, Algorithms and Data Structures, 4thInternational Workshop, WADS ’95, Kingston, Ontario, Canada, August 16-18, 1995, Proceedings ,volume 955 of

Lecture Notes in Computer Science , pages 334–345. Springer, 1995.[4] D. Baskins.

Judy arrays , 2004, (accessed September 10, 2020).[5] R. Bayer and E. M. McCreight. Organization and maintenance of large ordered indices.

ActaInformatica , 1:173–189, 1972.[6] M. Cranmer, A. Sanchez-Gonzalez, P. Battaglia, R. Xu, K. Cranmer, D. Spergel, and S. Ho. Dis-covering symbolic models from deep learning with inductive biases, 2020.[7] J. V. den Bercken, B. Blohsfeld, J. Dittrich, J. Kr¨amer, T. Sch¨afer, M. Schneider, and B. Seeger.XXL - A library approach to supporting e ﬃ cient implementations of advanced database queries. InP. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September11-14, 2001, Roma, Italy , pages 39–48. Morgan Kaufmann, 2001.[8] J. V. den Bercken and B. Seeger. An evaluation of generic bulk loading techniques. In P. M. G.Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors,

VLDB2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14,2001, Roma, Italy , pages 461–470. Morgan Kaufmann, 2001.[9] J. V. den Bercken, B. Seeger, and P. Widmayer. A Generic Approach to Bulk Loading Multidimen-sional Index Structures. In

VLDB , 1997.[10] J. Ding, U. F. Minhas, J. Yu, C. Wang, J. Do, Y. Li, H. Zhang, B. Chandramouli, J. Gehrke, D. Koss-mann, D. B. Lomet, and T. Kraska. ALEX: an updatable adaptive learned index. In D. Maier,R. Pottinger, A. Doan, W. Tan, A. Alawini, and H. Q. Ngo, editors,

Proceedings of the 2020 In-ternational Conference on Management of Data, SIGMOD Conference 2020, online conference[Portland, OR, USA], June 14-19, 2020 , pages 969–984. ACM, 2020.2711] J. Dittrich and J. Nix. The case for deep query optimisation. In

CIDR 2020, 10th Conferenceon Innovative Data Systems Research, Amsterdam, The Netherlands, January 12-15, 2020, OnlineProceedings

ACM Trans. Database Syst. , 4(3):315–344, 1979.[13] P. Ferragina and G. Vinciguerra. The pgm-index: a fully-dynamic compressed learned index withprovable worst-case bounds.

Proc. VLDB Endow. , 13(8):1162–1175, 2020.[14] J. M. Hellerstein, J. F. Naughton, and A. Pfe ﬀ er. Generalized search trees for database systems.In U. Dayal, P. M. D. Gray, and S. Nishio, editors, VLDB’95, Proceedings of 21th InternationalConference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland , pages 562–573. Morgan Kaufmann, 1995.[15] J. H. Holland.

Adaptation in natural and artiﬁcial systems: an introductory analysis with applica-tions to biology, control, and artiﬁcial intelligence . MIT press, 1975.[16] S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In

CIDR 2007, Third BiennialConference on Innovative Data Systems Research, Asilomar, CA, USA, January 7-10, 2007, OnlineProceedings

IEEE Data Eng.Bull. , 41(3):64–75, 2018.[18] S. Idreos, K. Zoumpatianos, S. Chatterjee, W. Qin, A. Wasay, B. Hentschel, M. S. Kester, N. Dayan,D. Guo, M. Kang, and Y. Sun. Learning data structure alchemy.

IEEE Data Eng. Bull. , 42(2):47–58,2019.[19] S. Idreos, K. Zoumpatianos, B. Hentschel, M. S. Kester, and D. Guo. The data calculator: Datastructure design and cost synthesis from ﬁrst principles and learned cost models. In G. Das, C. M.Jermaine, and P. A. Bernstein, editors,

Proceedings of the 2018 International Conference on Man-agement of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018 , pages 535–550. ACM, 2018.[20] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt,and P. Dubey. FAST: fast architecture sensitive tree search on modern cpus and gpus. In A. K.Elmagarmid and D. Agrawal, editors,

Proceedings of the ACM SIGMOD International Conferenceon Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010 , pages 339–350. ACM, 2010.[21] A. Kipf, R. Marcus, A. van Renen, M. Stoian, A. Kemper, T. Kraska, and T. Neumann. SOSD: Abenchmark for learned indexes.

CoRR , abs / Proceedings of the Third International Workshop on Exploiting Artiﬁcial Intelligence Techniquesfor Data Management, aiDM@SIGMOD 2020, Portland, Oregon, USA, June 19, 2020 , pages 5:1–5:5. ACM, 2020. 2823] M. Kornacker, C. Mohan, and J. M. Hellerstein. Concurrency and recovery in generalized searchtrees. In J. Peckham, editor,

SIGMOD 1997, Proceedings ACM SIGMOD International Conferenceon Management of Data, May 13-15, 1997, Tucson, Arizona, USA , pages 62–72. ACM Press, 1997.[24] J. Kossmann, S. Halfpap, M. Jankrift, and R. Schlosser. Magic mirror in my hand, which is thebest in the land? an experimental evaluation of index selection algorithms.

Proc. VLDB Endow. ,13(11):2382–2395, 2020.[25] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures.In G. Das, C. M. Jermaine, and P. A. Bernstein, editors,

Proceedings of the 2018 InternationalConference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15,2018 , pages 489–504. ACM, 2018.[26] V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful indexing for main-memorydatabases. In C. S. Jensen, C. M. Jermaine, and X. Zhou, editors, , pages 38–49. IEEEComputer Society, 2013.[27] V. Y. Lum and H. Ling. An optimization problem on the selection of secondary keys. In

Proceed-ings of the 1971 26th Annual Conference , ACM ’71, page 349356, New York, NY, USA, 1971.Association for Computing Machinery.[28] J. Rao and K. A. Ross. Cache conscious indexing for decision-support in main memory. In M. P.Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie, editors,

VLDB’99, Pro-ceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Ed-inburgh, Scotland, UK , pages 78–89. Morgan Kaufmann, 1999.[29] J. Rao and K. A. Ross. Making b + -trees cache conscious in main memory. In W. Chen, J. F.Naughton, and P. A. Bernstein, editors, Proceedings of the 2000 ACM SIGMOD InternationalConference on Management of Data, May 16-18, 2000, Dallas, Texas, USA , pages 475–486. ACM,2000.[30] E. Real, C. Liang, D. R. So, and Q. V. Le. Automl-zero: Evolving machine learning algorithmsfrom scratch, 2020.[31] S. Richter, V. Alvarez, and J. Dittrich. A seven-dimensional analysis of hashing methods and itsimplications on query processing.

Proc. VLDB Endow. , 9(3):96–107, 2015.[32] B. Schlegel, R. Gemulla, and W. Lehner. k-ary search on modern processors. In P. A. Bonczand K. A. Ross, editors,

Proceedings of the Fifth International Workshop on Data Management onNew Hardware, DaMoN 2009, Providence, Rhode Island, USA, June 28, 2009 , pages 52–60. ACM,2009.[33] F. M. Schuhknecht, A. Jindal, and J. Dittrich. The uncracked pieces in database cracking.

Proc.VLDB Endow. , 7(2):97–108, 2013.[34] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access pathselection in a relational database management system. In P. A. Bernstein, editor,