[PDF] Automated Database Indexing using Model-free Reinforcement Learning

Abstract

Configuring databases for efficient querying is a complex task, often carried out by a database administrator. Solving the problem of building indexes that truly optimize database access requires a substantial amount of database and domain knowledge, the lack of which often results in wasted space and memory for irrelevant indexes, possibly jeopardizing database performance for querying and certainly degrading performance for updating. We develop an architecture to solve the problem of automatically indexing a database by using reinforcement learning to optimize queries by indexing data throughout the lifetime of a database. In our experimental evaluation, our architecture shows superior performance compared to related work on reinforcement learning and genetic algorithms, maintaining near-optimal index configurations and efficiently scaling to large databases.

Full PDF

AAutomated Database Indexing Using Model-Free Reinforcement Learning

Gabriel Paludo Licks † and Felipe Meneguzzi ‡ Pontiﬁcal Catholic University of Rio Grande do Sul (PUCRS), BrazilGraduate Program in Computer Science, School of Technology † [email protected] ‡ [email protected] Abstract

Conﬁguring databases for efﬁcient querying is a complextask, often carried out by a database administrator. Solvingthe problem of building indexes that truly optimize databaseaccess requires a substantial amount of database and do-main knowledge, the lack of which often results in wastedspace and memory for irrelevant indexes, possibly jeopardiz-ing database performance for querying and certainly degrad-ing performance for updating. We develop an architecture tosolve the problem of automatically indexing a database byusing reinforcement learning to optimize queries by indexingdata throughout the lifetime of a database. In our experimen-tal evaluation, our architecture shows superior performancecompared to related work on reinforcement learning and ge-netic algorithms, maintaining near-optimal index conﬁgura-tions and efﬁciently scaling to large databases.

Introduction

Despite the multitude of tools available to manage and gaininsights from very large datasets, indexing databases thatstore such data remains a challenge with multiple opportu-nities for improvement [27]. Slow information retrieval indatabases entails not only wasted time for a business butalso indicates a high computational cost being paid. Unnec-essary indexes or columns that should be indexed but arenot, directly impact the query performance of a database.Nevertheless, achieving the best indexing conﬁguration fora database is not a trivial task [4, 5]. To do so, we have tolearn from queries that are running, take into account theirperformance, the system resources, and the storage budgetso that we can ﬁnd the best index candidates [18].In an ideal scenario, all frequently queried columnsshould be indexed to optimize query performance. Since cre-ating and maintaining indexes incur a cost in terms of stor-age as well as in computation whenever database insertionsor updates take place in indexed columns [21], choosing anoptimal set of indexes for querying purposes is not enoughto ensure optimal performance, so we must reach a trade-off between query and insert/update performance. Thus, this

Copyright c (cid:13) is a fundamental task that needs to be performed continu-ously, as the indexing conﬁguration directly impacts on adatabase’s overall performance.We developed an architecture for automated and dynamicdatabase indexing that evaluates query and insert/update per-formance to make decisions on whether to create or dropindexes using Reinforcement Learning (RL). We performedexperiments using a scalable benchmark database, where weempirically evaluate our architecture results in comparisonto standard baseline index conﬁgurations, database advisortools, genetic algorithms, and other reinforcement learningmethods applied to database indexing. The architecture weimplemented to automatically manage indexes through re-inforcement learning successfully converged in its trainingto a conﬁguration that outperforms all baselines and relatedwork, both in performance and in storage usage by indexes.

Background

Reinforcement Learning

Reinforcement learning is an approach to learn opti-mal agent policies in stochastic environments modeled asMarkov Decision Processes (MDPs) [2]. It is characterizedby a trial-and-error learning method, where an agent inter-acts and transitions through states of an MDP environmentmodel by taking actions and observing rewards [23, Ch. 1].MDP are formally deﬁned as a tuple M = (cid:104)S , A , P , R , γ (cid:105) ,where S is the state space, A is the action space, P is a tran-sition probability function which deﬁnes the dynamics of theMDP, R is a reward function and γ ∈ [0 , is a discount fac-tor [23, Ch. 3].In order to solve an MDP, an agent needs to know thestate-transition and the reward functions. However, in mostrealistic applications, modeling knowledge about the state-transition or the reward function is either impossible or im-practical, so an agent interacts with the environment tak-ing sequential actions to collect information and explore thesearch space by trial and error [23, Ch. 1]. The Q-learningalgorithm is the natural choice for solving such MDPs [23,Ch. 16]. This method learns the values of state-action pairs,denoted by Q ( s, a ) , representing the value of taking action a in a state s [23, Ch. 6]. a r X i v : . [ c s . D B ] J u l ssuming that states can be described in terms of featuresthat are well informative, such problem can be handled byusing linear function approximation, which is to use a pa-rameterized representation for the state-action value func-tion other than a look-up table [25]. The simplest differen-tiable function approximator is through a linear combinationof features, though there are other ways of approximatingfunctions such as using neural networks [23, Ch. 9, p. 195]. Indexing in Relational Databases

An important technique to ﬁle organization in a DBMS is in-dexing [21, Ch. 8, p. 274], and is usually managed by a DBA.However, index selection without the need of a domain ex-pert is a long-time research subject and remains a challengedue to the problem complexity [27, 5, 4]. The idea is that,given the database schema and the workload it receives, wecan deﬁne the problem of ﬁnding an efﬁcient index conﬁg-uration that optimizes database operations [21, Ch. 20, p.664]. The complexity stems from the potential number ofattributes that can be indexed and all of its subsets.While DBMSs strive to provide automatic index tuning,the usual scenario is that performance statistics for optimiz-ing queries and index recommendations are offered, but theDBA makes the decision on whether to apply the changesor not. Most recent versions of DBMSs such as Oracle [13]and Azure SQL Database [19] can automatically adjust in-dexes. However, it is not the case that the underlying systemis openly described.A general way of evaluating DBMS performance isthrough benchmarking. Since DBMSs are complex piecesof software, and each has its own techniques for optimiza-tion, external organizations have deﬁned protocols to eval-uate their performance [21, Ch. 20, p. 682]. The goals ofbenchmarks are to provide measures that are portable to dif-ferent DBMSs and evaluate a wider range of aspects of thesystem, e.g., transactions per second and price-performanceratio [21, Ch. 20, p. 683].

TPC-H Benchmark

The tools provided by TPC-H include a database genera-tor (DBGen) able to create up to 100 TB of data to loadin a DBMS, and a query generator (QGen) that creates22 queries with different levels of complexity. Using thedatabase and workload generated using these tools, TPC-H speciﬁes a benchmark that consists of inserting records,executing queries, and deleting records in the database tomeasure the performance of these operations.The TPC-H Performance metric is expressed in Queries-per-Hour (

QphH @ Size ), which is achieved by computingthe

P ower @ Size and the

T hroughput @ Size metrics [24].The resulting values are related to its scale factor ( @ Size ),i.e., the database size in gigabytes. The

P ower @ Size eval-uates how fast the DBMS computes the answers to singlequeries. This metric is computed using Equation 1:

P ower @ Size = 3600 (cid:113) π i =1 QI ( i, × π j =1 RI ( j, × SF (1) where is the number of seconds per hour and QI ( i, s ) is the execution time for each one of the queries i . RI ( j, s ) is the execution time of refresh functions j (insert/update) inthe query stream s , and SF is the scale factor or databasesize, ranging from to , according to its @ Size .The

T hroughput @ Size measures the ability of the sys-tem to process the most queries in the least amount of time,taking advantage of I/O and CPU parallelism [24]. It com-putes the performance of the system against a multi-userworkload performed in an elapsed time, using Equation 2:

T hroughput @ Size = S × T S × × SF (2) where S is the number of query streams executed, and T S isthe total time required to run the test for s streams. QphH @ Size = (cid:112) P ower @ Size × T hroughput @ Size (3)

Equation 3 shows the Query-per-Hour Performance(

QphH @ Size ) metric, which is obtained from the geomet-ric mean of the previous two metrics and reﬂects multipleaspects of the capability of a database to process queries.The

QphH @ Size metric is the ﬁnal output metric of thebenchmark and summarizes both single-user and multiple-user overall database performance.

Architecture

In this section, we introduce our database indexing architec-ture to automatically choose indexes in relational databases,which we refer to as SmartIX. The main motivation of Smar-tIX is to abstract the database administrator’s task that in-volves a frequent analysis of all candidate columns and ver-ifying which ones are likely to improve the database indexconﬁguration. For this purpose, we use reinforcement learn-ing to explore the space of possible index conﬁgurations inthe database, aiming to ﬁnd an optimal strategy over a longtime horizon while improving the performance of an agentin the environment.The SmartIX architecture is composed of a reinforcementlearning agent, an environment model of a database, and aDBMS interface to apply agent actions to the database. Thereinforcement learning agent is responsible for the decisionmaking process. The agent interacts with an environment

SmartIXRL AgentEnvironmentDBMS InterfaceDBMS

Action at State st+1

Reward rt+1

IndexingoptionChangesto apply DatabasestatsStructuredDB stats

Figure 1: SmartIX architecture.odel of the database, which computes system transitionsand rewards that the agent receives for its decisions. To makechanges persistent, there is a DBMS interface that is respon-sible for communicating with the DBMS to create or dropindexes and get statistics of the current index conﬁguration.

Agent

Our agent is based on the Deep Q-Network agent proposedby [11]. Algorithm 1 consists of a reinforcement learningmethod built around the Q-learning, using a neural networkfor function approximation, and a replay memory for ex-perience replay. The neural network is used to approximatethe action-value function and is trained using mini-batchesof experience randomly sampled from the replay memory.At each time step, the agent performs one transition in theenvironment. That is, the agent chooses an action using anepsilon-greedy exploration function at the current state, theaction is then applied in the environment, and the environ-ment returns a reward signal and the next state. Finally, eachtransition in the environment is stored in the replay buffer,and the agent performs a mini-batch update in the action-value function.

Algorithm 1

Indexing agent with function approximationand experience replay. From [23, Ch. 6, p. 131] and [11].

1: Random initialization of the value function2: Empty initialization of a replay memory D s ← DB initial index configuration for each step do a ← epsilon greedy ( s ) s (cid:48) , r ← execute ( a )

7: Store experience e = (cid:104) s, a, r, s (cid:48) (cid:105) in D

8: Sample random mini-batch of experiences e ∼ D

9: Perform experience replay using mini-batch10: s ← s (cid:48) Environment

The environment component is responsible for computingtransitions in the system and computing the reward function.To successfully apply a transition, we implement a model ofthe database environment, modeling states that contain fea-tures that are relevant to the agent learning, and a transitionfunction that is able to modify the state with regard to theaction an agent chooses. Each transition in the environmentoutputs a reward signal that is fed back to the agent alongwith the next state, and the reward function has to be infor-mative enough so that the agent learns which actions yieldbetter decisions at each state.

State representation

The state is the formal representa-tion of the environment information used by the agent in thelearning process. Thus, deciding which information shouldbe used to deﬁne a state of the environment is critical fortask performance. The amount of information encoded in astate imposes a trade-off for reinforcement learning agents.Speciﬁcally, that if the state encodes too little information,then the agent might not learn a useful policy, whereas if thestate encodes too much information, there is a risk that the learning algorithm needs too many samples of the environ-ment that it does not converge to a policy.For the database indexing problem, the state representa-tion is deﬁned as a feature vector (cid:126)S = (cid:126)I · (cid:126)Q , which is aresult of a concatenation of the feature vectors (cid:126)I and (cid:126)Q . Thefeature vector (cid:126)I encodes information regarding the currentindex conﬁguration of the database, with length | (cid:126)I | = C ,where C is a constant of the total number of columns in thedatabase schema. Each element in the feature vector (cid:126)I holdsa binary value, containing or , depending on whether thecolumn that corresponds to that position in the vector is in-dexed or not. The second part of our state representation isa feature vector (cid:126)Q , also with length | (cid:126)Q | = C , which en-codes information regarding which indexes were used in lastqueries received by the database. To organize such informa-tion, we set a constant value of H that deﬁnes the horizon ofqueries that we keep track of. To each of the last queries ina horizon H , we verify whether any of the indexes currentlycreated in the database are used to run such queries. Eachposition in the vector (cid:126)Q corresponds to a column and holdsa binary value that is assigned if such column is indexedand used in the last H queries, else . Finally, the concate-nate both (cid:126)I and (cid:126)Q to generate our binary state vector (cid:126)S withlength | (cid:126)S | = 2 C . Actions

In our environment, we deﬁne the possible actionsas a set A of size C + 1 . Each one of the C actions refersto one column in the database schema. These actions areimplemented as a “ﬂip” to create or drop an index in the cur-rent column. Therefore, for each action, there are two pos-sible behaviors: CREATE INDEX or DROP INDEX on the cor-responding column. The last action is a “do nothing” action,that enables the agent not to modify the index conﬁgurationin case it is not necessary at the current state.

Reward

Deciding the reward function is critical for thequality of the ensuing learned policy. On the one hand, wewant the agent to learn that indexes that are used by thequeries in the workload must be maintained in order to op-timize such queries. On the other hand, indexes that arenot being used by queries must not be maintained as theyconsume system resources and are not useful to the currentworkload. Therefore, we compute the reward signal basedon the next state’s feature vector (cid:126)S after an action is applied,since our state representation encodes information both onthe current index conﬁguration and on the indexes used inthe last queries, i.e. information contained in vectors (cid:126)I and (cid:126)Q . Our reward function is computed using Equation 4: R ( op, use ) = (1 − op )((1 − use )(1) + ( use )( − op )((1 − use )( −

5) + ( use )(1)) (4) where op = I c and use = Q c . That is, the ﬁrst parameter op holds if the last action represents a dropped index incolumn c , or if created an index. The latter parameter, use ,holds if an index in column c is not being used by the last H horizon queries, and otherwise.able 1: TPC-H database - Table stats and indexes Table Total Indexed Indexable

REGION

NATION

PART

SUPPLIER

PARTSUPP

CUSTOMER

ORDERS

LINEITEM

16 4 12

Totals

61 16 45

Therefore, our reward function returns a value of +1 ifan index is created and it actually beneﬁts the current work-load, or if an index is dropped and it is not beneﬁcial to thecurrent workload. Otherwise, the function returns − to pe-nalize the agent if an index is dropped and it is beneﬁcial tothe current workload, or an index is created and it does notbeneﬁt the current workload. The choice of values +1 and − is empirical. However, we want the penalization value tobe at least twice smaller than the +1 value, so that the val-ues do not get canceled when accumulating with each other.Finally, if the action corresponds to a “do nothing” opera-tion, the environment simply returns a reward of , withoutcomputing Equation 4. Experiments

Experimental setup

Database setup

For experimental purposes and due to itsusage in literature for measuring database performance, wechoose to run experiments using the database schema anddata provided by the TPC-H benchmark. The tools providedby TPC-H include a data generator (DBGen), which is ableto create up to 100TB of data to load in a DBMS, and aquery generator (QGen) that creates 22 queries with differ-ent levels of complexity. The database we use in these exper-iments is populated with 1GB of data. To run benchmarksusing each baseline index conﬁguration, we implementedthe TPC-H benchmark protocol using a Python script thatruns queries, fetches execution time, and computes the per-formance metrics.To provide statistics on the database, we show the in Ta-ble 1 the number of columns that each table contains and ananalysis on the indexing possibilities. For that, we mappedfor each table in the TPC-H database the total number ofcolumns, the columns that are already indexed (primaryand foreign keys, indexed by default), and the remainingcolumns that are available for indexing.By summing the number of indexable columns in eachtable, we have a total of 45 columns that are available forindexing. Since a column is either indexed or not, thereare two possibilities for each of the remaining 45 index-able columns. This scenario indicates that we have exactly , , , , ( ), i.e. more than 35 trillion, pos-sible conﬁgurations of simple indexes. Thus, this is also thenumber of states that can be assumed by the database index-ing conﬁguration and therefore explored by the algorithms. For comparison purposes, we manually check whichcolumns compose the ground truth optimal index conﬁg-uration. We manually create each index possibility andcheck whether an index beneﬁts at least one query withinthe 22 TPC-H queries. To check whether an index isused or not, we run the EXPLAIN command to visual-ize the execution plan of each query. Finally, we have 6columns from the TPC-H that compose our ground truth op-timal indexes:

C ACCTBAL , L SHIPDATE , O ORDERDATE , P BRAND , P CONTAINER , P SIZE . Baselines

The baselines comprise different indexing con-ﬁgurations using different indexing approaches, includingcommercial and open-source database advisors, and re-lated work on genetic algorithms and reinforcement learn-ing methods. Each baseline index conﬁguration is a resultof training or analyzing the same workload of queries, fromthe TPC-H benchmark, in order to make an even compar-ison between the approaches. The following list brieﬂy in-troduces each of them.

Default : indexes only on primaryand foreign keys;

All indexed : all columns indexed.

Ran-dom : indexes randomly explored by an agent;

EDB

POWA

ITLCS

GADIS

NoDBA rCOREIL

Agent training

Training the reinforcement learning agent consists of timesteps of agent-environment interaction and value functionupdates until it converges to a policy as desired. In ourcase, to approximate the value function, we use a simplemulti-layer perceptron neural network with two hidden lay-ers and ReLU activation, and an Adam optimizer with mean-squared error loss, both PyTorch 1.5.1 implementations us-ing default hyperparameters [15]. The input and output di-mensions depend on the number of columns available to in-dex in the database schema, as shown in Section .The hyperparameters used while training are set as fol-lows. The ﬁrst, learning rate α = 0 . and discount fac-tor γ = 0 . , are used in the update equation of the valuefunction. The next are related to experience replay, where replay memory size = 10000 deﬁnes the number of experi-

100 200 300 400 500Step30025020015010050050 A cc u m u l a t e d r e w a r d (a) Accumulated reward per 128steps. A cc u m u l a t e d l o ss (b) Accumulated loss per 128 steps. N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (c) Index conﬁgurations whiletraining.

Figure 2: Training statistics.ences the agent is capable of storing, and replay batch size =1024 deﬁnes the number of samples the agent uses at eachtime step to update the value function. The last are relatedto the epsilon-greedy exploration function, where we de-ﬁne an epsilon initial = 1 as maximum epsilon value, an epsilon ﬁnal = 0 . as epsilon minimum value, a percent-age in which epsilon decays = 1% , and the interval of time steps at each decay = 128 .We train our agent for the course of 64 thousand timesteps in the environment. Training statistics are gathered ev-ery 128 steps and are shown in Figure 2. Sub-ﬁgure 2a showsthe total reward accumulated by the agent at each 128 stepsin the environment, which consistently improves over timeand stabilizes after the 400th x-axis value. Sub-ﬁgure 2bshows the accumulated loss at each 128 steps in the envi-ronment, i.e. the errors in predictions of the value functionduring experience replay, and illustrates how it decreases to-wards zero as parameters are adjusted and the agent approx-imates the true value function.To evaluate the agent behavior and the index conﬁgura-tion in which the agent is converging to, we plot in Figure 2ceach of the index conﬁgurations explored by the agent in the64 thousand training steps. Each index conﬁguration is rep-resented in terms of total indexes and total optimal indexes a conﬁguration contains. Total indexes is simply a count onthe number of indexes in the conﬁguration, while total opti-mal indexes is a count on the number of ground truth optimalindexes in the conﬁguration. The lines are smoothed using arunning mean of the last 5 values, and a ﬁxed red dashedline across the x-axis represents the conﬁguration in whichthe agent should converge to. As we can see, both the totalamount of indexes and the total optimal indexes convergetowards the ground truth optimal indexes. That is, the agentlearns both to keep the optimal indexes in the conﬁguration,as well as to drop irrelevant indexes for the workload.

Performance Comparison

We now evaluate each baseline index conﬁguration in com-parison to the one in which our agent converged to in the lastepisode of training. We show the TPC-H performance metric(QphH, i.e. the query-per-hour metric) and the index size ofeach conﬁguration. Figure 3a shows the query-per-hour met-ric of each conﬁguration (higher values denote better perfor-mance). The plotted values are a result of a trimmed mean,where we run the TPC-H benchmark 12 times for each in- dex conﬁguration, removing the highest and the lowest resultand averaging the 10 remaining results. Figure 3b shows thedisk space required for the indexes in each conﬁguration (in-dex size in MB), which allows us to analyze the trade-off inthe number of indexes and the resources needed to maintainit. In an ideal scenario, the index size is just the bare min-imum to maintain the indexes that are necessary to supportquery performance.Yet SmartIX achieves the best query-per-hour-metric, thetwo genetic algorithms [12] and [17] have both very similarquery-per-hour and index size metrics in comparison to ouragent. GADIS [12] itself uses a similar state-space model toSmartIX, with individuals being represented as binary vec-tors of the indexable columns. The ﬁtness function GADISoptimizes is the actual query-per-hour metric, and it runs thewhole TPC-H benchmark every time it needs to computethe ﬁtness function. Therefore, it is expected that it ﬁnds anindividual with a high performance metric, although it is un-realistic for real-world applications in production due to thecomputational cost of running the benchmark.Indexing all columns is among the highest query-per-hourresults and can seem to be a natural alternative to solve theindexing problem. However, it results in the highest amountof disk used to maintain indexes stored. Such alternativeis less efﬁcient in a query-per-hour metric as the bench-mark not only takes into account the performance of SE - LECT queries, but also

INSERT and DELETE operations,whose performance is affected by the presence of indexesdue to the overhead of updating and maintaining the struc-ture when records change [21, Ch. 8, p. 290-291]. It has thelowest ratio value due to the storage it needs to maintain in-dexes. D e f a u l t R a n d o m N o D B A P O W A r C O R E I L E D B A ll i n d e x e s I T L C S G A D I SS m a r t I X Index configuration95097510001025105010751100 Q p h H @ G B (a) Query-per-hour metric D e f a u l t S m a r t I X I T L C S G A D I S E D B P O W A N o D B A R a n d o m r C O R E I L A ll i n d e x e s Index configuration05001000150020002500 S i z e i n M B (b) Index size (in MB) Figure 3: Static index conﬁgurations results.hile rCOREIL [1] is the most competitive reinforce-ment learning method in comparison to SmartIX, the amountof storage used to maintain its indexes is the highest amongall baselines (except for having all columns indexed). rCOR-EIL does not handle whether primary and foreign key in-dexes are already created, causing it to create duplicate in-dexes. The policy iteration algorithm used in rCOREIL is adynamic programming method used in reinforcement learn-ing, which is characterized by complete sweeps in the statespace at each iteration in order to update the value function.Since dynamic programming methods are not suitable tolarge state spaces [23, Ch. 4, p. 87], this can become a prob-lem in databases that contain a larger number of columns toindex.Among the database advisors, the commercial toolEDB [6] achieves the highest query-per-hour metric in com-parison to the open-source tool POWA [20], while its in-dexes occupy virtually the same disk space. Other baselinesand related work are able to optimize the index conﬁgurationand have lightweight index sizes, but are not competitive incomparison to the previously discussed methods in terms ofthe query-per-hour performance metric. Finally, among allthe baselines, the index conﬁguration obtained using Smar-tIX not only yields the best query-per-hour metric but alsothe smallest index size (except for the default conﬁguration),i.e. it ﬁnds the balance between performance and storage, asshown in the ratio plot.

Dynamic conﬁgurations

This section aims to evaluate the behavior of algorithmsthat generate policies, i.e. generate a function that guidesan agent’s behavior. The three algorithms that generate poli-cies are SmartIX, rCOREIL, and NoDBA. The three arereinforcement learning algorithms, although using differentstrategies (see Sec. ). While rCOREIL and SmartIX show amore interesting and dynamic behavior, the NoDBA algo-rithm shows a ﬁxed behavior and keeps only three columnsindexed over the whole time horizon, without changing theindex conﬁguration over time (see its limitations in Sec. ).Therefore, we do not include NoDBA in the following anal-ysis and focus the discussion on rCOREIL and SmartIX.

Fixed workload

We now evaluate the index conﬁguration of rCOREIL andSmartIX over time while the database receives a ﬁxed work-load of queries. Figure 4 shows the behavior of rCOREILand SmartIX, respectively. Notice that rCOREIL takes sometime to create the ﬁrst indexes in the database, after receiv-ing about 150 queries, while SmartIX creates indexes at thevery beginning of the workload. On the one hand, rCOR-EIL shows a ﬁxed behavior maintains all ground truth opti-mal indexes, but it creates a total of 22 indexes, 16 of thosebeing unnecessary indexes and the remaining 6 are optimalindexes. On the other hand, SmartIX shows a dynamic be-havior and consistently maintains 5 out of the 6 ground truthoptimal indexes, and it does not maintain unnecessary in-dexes throughout most of the received workload. N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (a) rCOREIL N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (b) SmartIX

Figure 4: Agent behavior with a ﬁxed workload. N u m b e r o f i n d e x e s Workload shiftsGround truth optimal indexesTotal indexesTotal optimal indexes (a) rCOREIL N u m b e r o f i n d e x e s Workload shiftsGround truth optimal indexesTotal indexesTotal optimal indexes (b) SmartIX

Figure 5: Agent behavior with a shifting workload.

Shifting workload

We now evaluate the algorithm’s behavior while receiving aworkload that shifts over time. To do so, we divide the 22TPC-H queries into two sets of 11 queries, where for eachset there is a different ground truth set of indexes. That is,out of the 6 ground truth indexes from the previous ﬁxedworkload, we now separate the workload to have 3 indexesthat are optimal ﬁrst set of queries, and 3 other indexes thatare optimal for the second set of queries. Therefore, we aimto evaluate whether the algorithms can adapt the index con-ﬁguration over time when the workload shifts and a differentset of indexes is needed according to each of the workloads.The behavior of each algorithm is shown in Figure 5. Thevertical dashed lines placed along the x-axis represent thetime step where the workload shifts from one set of queriesto another, and therefore the set of ground truth optimal in-dexes also changes. On the one hand, notice that rCOREILshows a similar behavior from the one in the previous ﬁxedworkload experiment, in which it takes some time to createthe ﬁrst indexes, and then maintains a ﬁxed index conﬁg-uration, not adapting as the workload shifts. On the otherhand, SmartIX shows a more dynamic behavior with regardto the shifts in the workload. Notice that, at the beginningof each set of queries in the workload, there is a peak in thetotal indexes, which decreases as soon as the index conﬁgu-ration adapts to the new workload and SmartIX drops the un-necessary indexes with regard to the current workload. Eventhough rCOREIL maintains all 3 ground truth indexes overtime, it still maintains 16 unnecessary indexes, while Smar-tIX consistently maintains 2 out of 3 ground truth optimalindexes and adapts as the workload shifts.

Scaling up database size

In the previous sections, we showed that the SmartIX archi-tecture can consistently achieve near-optimal index conﬁgu-

200 400 600 800Step0510152025 N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (a) 10GB TPC-H database. N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (b) 100GB TPC-H database.

Figure 6: Agent behavior in larger databases.rations in a database of size 1GB. In this section, we reportexperiments on indexing larger databases, where we transferthe policy trained in the 1GB database to perform indexingin databases with size 10GB and 100GB. We plot the behav-ior of our agent in Figure 6.As we can see, the agent shows a similar behavior to theone using a 1GB database size reported in previous experi-ments. The reason is that both the state features and the re-ward function are not inﬂuenced by the database size. Theonly information relevant to the state and the reward func-tion is the current index conﬁguration and the workload be-ing received. Therefore, we can successfully transfer thevalue function learned in smaller databases to index largerdatabases, consuming fewer resources to train the agent.

Related Work

Machine learning techniques are used in a variety of tasksrelated to database management systems and automateddatabase administration [26]. One example is the work fromKraska et al. [8], which outperforms traditional index struc-tures used in current DBMS by replacing them with learnedindex models, having signiﬁcant advantages under particu-lar assumptions. Pavlo et. al [16] developed Peloton, whichautonomously optimizes the system for incoming workloadsand uses predictions to prepare the system for future work-loads. In this section, though, we further discuss relatedwork that focused on developing methods for optimizingqueries through automatic index tuning. Speciﬁcally, we fo-cus our analysis on work that based their approach on rein-forcement learning techniques.Basu et al. [1] developed a technique for index tuningbased on a cost model that is learned with reinforcementlearning. However, once the cost model is known, it be-comes trivial to ﬁnd the conﬁguration that minimizes thecost through dynamic programming, such as the policy itera-tion method used by the authors. They use DBTune [3] to re-duce the state space by considering only indexes that are rec-ommended by the DBMS. Our approach, on the other hand,focuses on ﬁnding the optimal index conﬁguration withouthaving complete knowledge of the environment and withoutheuristics of the DBMS to reduce the state space.Sharma et al. [22] use a cross-entropy deep reinforce-ment learning method to administer databases automatically.Their set of actions, however, only include the creation ofindexes, and a budget of 3 indexes is set to deal with spaceconstraints and index maintenance costs. Indexes are only dropped once an episode is ﬁnished. A strong limitation intheir evaluation process is to only use the L

INEITEM tableto query, which does not exploit how indexes on other tablescan optimize the database performance, and consequentlyreduces the state space of the problem. Furthermore, theydo not use the TPC-H benchmark performance measure toevaluate performance but use query execution time in mil-liseconds.Other papers show that reinforcement learning can also beexplored in the context of query optimization by predictingquery plans: Marcus et al. [10] proposed a proof-of-conceptto determine the join ordering for a ﬁxed database; Ortiz etal. [14] developed a learning state representation to predictthe cardinality of a query. These approaches could possiblybe used alongside ours, generating better plans to query ex-ecution while we focus on maintaining indexes that thesequeries can beneﬁt from.

Conclusion

In this research, we developed the SmartIX architecture forautomated database indexing using reinforcement learning.The experimental results show that our agent consistentlyoutperforms the baseline index conﬁgurations and relatedwork on genetic algorithms and reinforcement learning. Ouragent is able to ﬁnd the trade-off concerning the disk spaceits index conﬁguration occupies and the performance metricit achieves. The state representation and the reward functionallows us to successfully index larger databases while train-ing in smaller databases and consuming fewer resources.Regarding the limitations of our architecture, we do notyet deal with composite indexes due to the resulting statespace of all possible indexes that use two or more columns.Our experiments show results using workloads that are read-intensive (i.e. intensively fetching data from the database),which is exactly the type of workload that beneﬁts from in-dexes. However, experiments using write-heavy workloads(i.e. intensively writing data to the database) can be inter-esting to verify whether the agent learns to avoid indexesin write-intensive tables. Considering these limitations, infuture work, we plan to: (1) investigate techniques that al-low us to deal with composite indexes; (2) improve the re-ward function to provide feedback in case of write-intensiveworkloads; (3) investigate pattern recognition techniques topredict incoming queries to index ahead of time; and (4)evaluate SmartIX on big data ecosystems (e.g. Hadoop).Finally, our contributions include: (1) a formalization ofa reward function shaped for the database indexing prob-lem, independent of DBMS’s statistics, that allows the agentto adapt the index conﬁguration according to the workload;(2) an environment representation for database indexing thatis independent of schema or DBMS; and (3) a reinforce-ment learning agent that efﬁciently scales to large databases,while trained in small databases consuming fewer resources.At last, as a result of this research, we published a paper atthe Applied Intelligence journal [9].In closing, we envision this kind of architecture being de-ployed in cloud platforms such as Heroku and similar plat-forms that often provide database infrastructure for variousclients’ applications. The reality is that these clients do notrioritize, or it is not in their scope of interest to focus ondatabase management. Especially in the case of early-stagestart-ups, the aim to shorten time-to-market and quickly shipcode motivates externalizing complexity on third party so-lutions [7]. From an overall platform performance point ofview, having efﬁcient database management results in an op-timized use of hardware and software resources. Thus, in thelack of a database administrator, the SmartIX architecture isa potential stand-in solution, as experiments show that it pro-vides at least equivalent and often superior indexing choicescompared to baseline indexing recommendations.

References [1] Basu, D.; Lin, Q.; Chen, W.; Vo, H. T.; Yuan, Z.; Senel-lart, P.; and Bressan, S. 2016. Regularized cost-modeloblivious database tuning with reinforcement learning.In

Transactions on Large-Scale Data-and Knowledge-Centered Systems XXVIII . Springer. 96–132.[2] Bellman, R. 1957. A markovian decision process.

Jour-nal of Mathematics and Mechanics

Proceed-ings of the VLDB Endowment

IKE proceedings , 1.[6] EnterpriseDB. 2019. Enterprise Database.[7] Giardino, C.; Paternoster, N.; Unterkalmsteiner, M.;Gorschek, T.; and Abrahamsson, P. 2016. Software de-velopment in startup companies: the greenﬁeld startupmodel.

IEEE Transactions on Software Engineering

Proceedings of the 2018 SIGMOD , 489–504. ACM.[9] Licks, G. P.; Couto, J. C.; de F´atima Miehe, P.; De Paris,R.; Ruiz, D. D.; and Meneguzzi, F. 2020. Smartix: Adatabase indexing agent based on reinforcement learning.

Applied Intelligence

CoRR abs/1803.00055:1–7.[11] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.;Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning.

Na-ture , 39–42.[13] Olofson, C. W. 2018. Ensuring a fast, reliable, andsecure database through automation: Oracle autonomous database. White paper, IDC Corporate USA, Sponsoredby: Oracle Corp.[14] Ortiz, J.; Balazinska, M.; Gehrke, J.; and Keerthi,S. S. 2018. Learning state representations for queryoptimization with deep reinforcement learning.

CoRR abs/1803.08604:1–5.[15] Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury,J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.;Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito,Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.;Fang, L.; Bai, J.; and Chintala, S. 2019. Pytorch: Animperative style, high-performance deep learning library.In . Curran Associates, Inc.8024–8035.[16] Pavlo, A.; Angulo, G.; Arulraj, J.; Lin, H.; Lin, J.; Ma,L.; Menon, P.; Mowry, T. C.; Perron, M.; Quah, I.; et al.2017. Self-driving database management systems. In

CIDR proceedings , 1–6.[17] Pedrozo, W. G.; Nievola, J. C.; and Ribeiro, D. C.2018. An adaptive approach for index tuning with learn-ing classiﬁer systems on hybrid storage environments. In

HAIS , 716–729. Springer.[18] Petraki, E.; Idreos, S.; and Manegold, S. 2015. Holisticindexing in main-memory column-stores. In , 1153–1166. New York, NY, USA:ACM.[19] Popovic, J. 2017. Automatic tuning - SQL Server.Accessed: 2019-06-17.[20] POWA. 2019. PostreSQL Workload Analyzer.[21] Ramakrishnan, R., and Gehrke, J. 2003.

DatabaseManagement Systems . New York, NY, USA: McGraw-Hill, Inc., 3 edition.[22] Sharma, A.; Schuhknecht, F. M.; and Dittrich, J. 2018.The case for automatic database administration usingdeep reinforcement learning.

CoRR abs/1801.05643:1–9.[23] Sutton, R. S., and Barto, A. G. 2018.

Reinforcementlearning: An introduction . Cambridge, Massachusetts:MIT press.[24] Thanopoulou, A.; Carreira, P.; and Galhardas, H. 2012.Benchmarking with tpc-h on off-the-shelf hardware. In , 205–208. Wroclaw, Poland: Springer.[25] Tsitsiklis, J. N., and Roy, B. V. 1996. An analysisof temporal-difference learning with function approxima-tion. Technical report, Laboratory for Information andDecision Systems, MIT.[26] Van Aken, D.; Pavlo, A.; Gordon, G. J.; and Zhang,B. 2017. Automatic database management system tuningthrough large-scale machine learning. In ,1009–1024. ACM.[27] Wang, J.; Liu, W.; Kumar, S.; and Chang, S.-F. 2016.Learning to hash for indexing big data—a survey.