Automated Database Indexing using Model-free Reinforcement Learning
AAutomated Database Indexing Using Model-Free Reinforcement Learning
Gabriel Paludo Licks † and Felipe Meneguzzi ‡ Pontifical Catholic University of Rio Grande do Sul (PUCRS), BrazilGraduate Program in Computer Science, School of Technology † [email protected] ‡ [email protected] Abstract
Configuring databases for efficient querying is a complextask, often carried out by a database administrator. Solvingthe problem of building indexes that truly optimize databaseaccess requires a substantial amount of database and do-main knowledge, the lack of which often results in wastedspace and memory for irrelevant indexes, possibly jeopardiz-ing database performance for querying and certainly degrad-ing performance for updating. We develop an architecture tosolve the problem of automatically indexing a database byusing reinforcement learning to optimize queries by indexingdata throughout the lifetime of a database. In our experimen-tal evaluation, our architecture shows superior performancecompared to related work on reinforcement learning and ge-netic algorithms, maintaining near-optimal index configura-tions and efficiently scaling to large databases.
Introduction
Despite the multitude of tools available to manage and gaininsights from very large datasets, indexing databases thatstore such data remains a challenge with multiple opportu-nities for improvement [27]. Slow information retrieval indatabases entails not only wasted time for a business butalso indicates a high computational cost being paid. Unnec-essary indexes or columns that should be indexed but arenot, directly impact the query performance of a database.Nevertheless, achieving the best indexing configuration fora database is not a trivial task [4, 5]. To do so, we have tolearn from queries that are running, take into account theirperformance, the system resources, and the storage budgetso that we can find the best index candidates [18].In an ideal scenario, all frequently queried columnsshould be indexed to optimize query performance. Since cre-ating and maintaining indexes incur a cost in terms of stor-age as well as in computation whenever database insertionsor updates take place in indexed columns [21], choosing anoptimal set of indexes for querying purposes is not enoughto ensure optimal performance, so we must reach a trade-off between query and insert/update performance. Thus, this
Copyright c (cid:13) is a fundamental task that needs to be performed continu-ously, as the indexing configuration directly impacts on adatabase’s overall performance.We developed an architecture for automated and dynamicdatabase indexing that evaluates query and insert/update per-formance to make decisions on whether to create or dropindexes using Reinforcement Learning (RL). We performedexperiments using a scalable benchmark database, where weempirically evaluate our architecture results in comparisonto standard baseline index configurations, database advisortools, genetic algorithms, and other reinforcement learningmethods applied to database indexing. The architecture weimplemented to automatically manage indexes through re-inforcement learning successfully converged in its trainingto a configuration that outperforms all baselines and relatedwork, both in performance and in storage usage by indexes.
Background
Reinforcement Learning
Reinforcement learning is an approach to learn opti-mal agent policies in stochastic environments modeled asMarkov Decision Processes (MDPs) [2]. It is characterizedby a trial-and-error learning method, where an agent inter-acts and transitions through states of an MDP environmentmodel by taking actions and observing rewards [23, Ch. 1].MDP are formally defined as a tuple M = (cid:104)S , A , P , R , γ (cid:105) ,where S is the state space, A is the action space, P is a tran-sition probability function which defines the dynamics of theMDP, R is a reward function and γ ∈ [0 , is a discount fac-tor [23, Ch. 3].In order to solve an MDP, an agent needs to know thestate-transition and the reward functions. However, in mostrealistic applications, modeling knowledge about the state-transition or the reward function is either impossible or im-practical, so an agent interacts with the environment tak-ing sequential actions to collect information and explore thesearch space by trial and error [23, Ch. 1]. The Q-learningalgorithm is the natural choice for solving such MDPs [23,Ch. 16]. This method learns the values of state-action pairs,denoted by Q ( s, a ) , representing the value of taking action a in a state s [23, Ch. 6]. a r X i v : . [ c s . D B ] J u l ssuming that states can be described in terms of featuresthat are well informative, such problem can be handled byusing linear function approximation, which is to use a pa-rameterized representation for the state-action value func-tion other than a look-up table [25]. The simplest differen-tiable function approximator is through a linear combinationof features, though there are other ways of approximatingfunctions such as using neural networks [23, Ch. 9, p. 195]. Indexing in Relational Databases
An important technique to file organization in a DBMS is in-dexing [21, Ch. 8, p. 274], and is usually managed by a DBA.However, index selection without the need of a domain ex-pert is a long-time research subject and remains a challengedue to the problem complexity [27, 5, 4]. The idea is that,given the database schema and the workload it receives, wecan define the problem of finding an efficient index config-uration that optimizes database operations [21, Ch. 20, p.664]. The complexity stems from the potential number ofattributes that can be indexed and all of its subsets.While DBMSs strive to provide automatic index tuning,the usual scenario is that performance statistics for optimiz-ing queries and index recommendations are offered, but theDBA makes the decision on whether to apply the changesor not. Most recent versions of DBMSs such as Oracle [13]and Azure SQL Database [19] can automatically adjust in-dexes. However, it is not the case that the underlying systemis openly described.A general way of evaluating DBMS performance isthrough benchmarking. Since DBMSs are complex piecesof software, and each has its own techniques for optimiza-tion, external organizations have defined protocols to eval-uate their performance [21, Ch. 20, p. 682]. The goals ofbenchmarks are to provide measures that are portable to dif-ferent DBMSs and evaluate a wider range of aspects of thesystem, e.g., transactions per second and price-performanceratio [21, Ch. 20, p. 683].
TPC-H Benchmark
The tools provided by TPC-H include a database genera-tor (DBGen) able to create up to 100 TB of data to loadin a DBMS, and a query generator (QGen) that creates22 queries with different levels of complexity. Using thedatabase and workload generated using these tools, TPC-H specifies a benchmark that consists of inserting records,executing queries, and deleting records in the database tomeasure the performance of these operations.The TPC-H Performance metric is expressed in Queries-per-Hour (
QphH @ Size ), which is achieved by computingthe
P ower @ Size and the
T hroughput @ Size metrics [24].The resulting values are related to its scale factor ( @ Size ),i.e., the database size in gigabytes. The
P ower @ Size eval-uates how fast the DBMS computes the answers to singlequeries. This metric is computed using Equation 1:
P ower @ Size = 3600 (cid:113) π i =1 QI ( i, × π j =1 RI ( j, × SF (1) where is the number of seconds per hour and QI ( i, s ) is the execution time for each one of the queries i . RI ( j, s ) is the execution time of refresh functions j (insert/update) inthe query stream s , and SF is the scale factor or databasesize, ranging from to , according to its @ Size .The
T hroughput @ Size measures the ability of the sys-tem to process the most queries in the least amount of time,taking advantage of I/O and CPU parallelism [24]. It com-putes the performance of the system against a multi-userworkload performed in an elapsed time, using Equation 2:
T hroughput @ Size = S × T S × × SF (2) where S is the number of query streams executed, and T S isthe total time required to run the test for s streams. QphH @ Size = (cid:112) P ower @ Size × T hroughput @ Size (3)
Equation 3 shows the Query-per-Hour Performance(
QphH @ Size ) metric, which is obtained from the geomet-ric mean of the previous two metrics and reflects multipleaspects of the capability of a database to process queries.The
QphH @ Size metric is the final output metric of thebenchmark and summarizes both single-user and multiple-user overall database performance.
Architecture
In this section, we introduce our database indexing architec-ture to automatically choose indexes in relational databases,which we refer to as SmartIX. The main motivation of Smar-tIX is to abstract the database administrator’s task that in-volves a frequent analysis of all candidate columns and ver-ifying which ones are likely to improve the database indexconfiguration. For this purpose, we use reinforcement learn-ing to explore the space of possible index configurations inthe database, aiming to find an optimal strategy over a longtime horizon while improving the performance of an agentin the environment.The SmartIX architecture is composed of a reinforcementlearning agent, an environment model of a database, and aDBMS interface to apply agent actions to the database. Thereinforcement learning agent is responsible for the decisionmaking process. The agent interacts with an environment
SmartIXRL AgentEnvironmentDBMS InterfaceDBMS
Action at State st+1
Reward rt+1
IndexingoptionChangesto apply DatabasestatsStructuredDB stats
Figure 1: SmartIX architecture.odel of the database, which computes system transitionsand rewards that the agent receives for its decisions. To makechanges persistent, there is a DBMS interface that is respon-sible for communicating with the DBMS to create or dropindexes and get statistics of the current index configuration.
Agent
Our agent is based on the Deep Q-Network agent proposedby [11]. Algorithm 1 consists of a reinforcement learningmethod built around the Q-learning, using a neural networkfor function approximation, and a replay memory for ex-perience replay. The neural network is used to approximatethe action-value function and is trained using mini-batchesof experience randomly sampled from the replay memory.At each time step, the agent performs one transition in theenvironment. That is, the agent chooses an action using anepsilon-greedy exploration function at the current state, theaction is then applied in the environment, and the environ-ment returns a reward signal and the next state. Finally, eachtransition in the environment is stored in the replay buffer,and the agent performs a mini-batch update in the action-value function.
Algorithm 1
Indexing agent with function approximationand experience replay. From [23, Ch. 6, p. 131] and [11].
1: Random initialization of the value function2: Empty initialization of a replay memory D s ← DB initial index configuration for each step do a ← epsilon greedy ( s ) s (cid:48) , r ← execute ( a )
7: Store experience e = (cid:104) s, a, r, s (cid:48) (cid:105) in D
8: Sample random mini-batch of experiences e ∼ D
9: Perform experience replay using mini-batch10: s ← s (cid:48) Environment
The environment component is responsible for computingtransitions in the system and computing the reward function.To successfully apply a transition, we implement a model ofthe database environment, modeling states that contain fea-tures that are relevant to the agent learning, and a transitionfunction that is able to modify the state with regard to theaction an agent chooses. Each transition in the environmentoutputs a reward signal that is fed back to the agent alongwith the next state, and the reward function has to be infor-mative enough so that the agent learns which actions yieldbetter decisions at each state.
State representation
The state is the formal representa-tion of the environment information used by the agent in thelearning process. Thus, deciding which information shouldbe used to define a state of the environment is critical fortask performance. The amount of information encoded in astate imposes a trade-off for reinforcement learning agents.Specifically, that if the state encodes too little information,then the agent might not learn a useful policy, whereas if thestate encodes too much information, there is a risk that the learning algorithm needs too many samples of the environ-ment that it does not converge to a policy.For the database indexing problem, the state representa-tion is defined as a feature vector (cid:126)S = (cid:126)I · (cid:126)Q , which is aresult of a concatenation of the feature vectors (cid:126)I and (cid:126)Q . Thefeature vector (cid:126)I encodes information regarding the currentindex configuration of the database, with length | (cid:126)I | = C ,where C is a constant of the total number of columns in thedatabase schema. Each element in the feature vector (cid:126)I holdsa binary value, containing or , depending on whether thecolumn that corresponds to that position in the vector is in-dexed or not. The second part of our state representation isa feature vector (cid:126)Q , also with length | (cid:126)Q | = C , which en-codes information regarding which indexes were used in lastqueries received by the database. To organize such informa-tion, we set a constant value of H that defines the horizon ofqueries that we keep track of. To each of the last queries ina horizon H , we verify whether any of the indexes currentlycreated in the database are used to run such queries. Eachposition in the vector (cid:126)Q corresponds to a column and holdsa binary value that is assigned if such column is indexedand used in the last H queries, else . Finally, the concate-nate both (cid:126)I and (cid:126)Q to generate our binary state vector (cid:126)S withlength | (cid:126)S | = 2 C . Actions
In our environment, we define the possible actionsas a set A of size C + 1 . Each one of the C actions refersto one column in the database schema. These actions areimplemented as a “flip” to create or drop an index in the cur-rent column. Therefore, for each action, there are two pos-sible behaviors: CREATE INDEX or DROP INDEX on the cor-responding column. The last action is a “do nothing” action,that enables the agent not to modify the index configurationin case it is not necessary at the current state.
Reward
Deciding the reward function is critical for thequality of the ensuing learned policy. On the one hand, wewant the agent to learn that indexes that are used by thequeries in the workload must be maintained in order to op-timize such queries. On the other hand, indexes that arenot being used by queries must not be maintained as theyconsume system resources and are not useful to the currentworkload. Therefore, we compute the reward signal basedon the next state’s feature vector (cid:126)S after an action is applied,since our state representation encodes information both onthe current index configuration and on the indexes used inthe last queries, i.e. information contained in vectors (cid:126)I and (cid:126)Q . Our reward function is computed using Equation 4: R ( op, use ) = (1 − op )((1 − use )(1) + ( use )( − op )((1 − use )( −
5) + ( use )(1)) (4) where op = I c and use = Q c . That is, the first parameter op holds if the last action represents a dropped index incolumn c , or if created an index. The latter parameter, use ,holds if an index in column c is not being used by the last H horizon queries, and otherwise.able 1: TPC-H database - Table stats and indexes Table Total Indexed Indexable
REGION
NATION
PART
SUPPLIER
PARTSUPP
CUSTOMER
ORDERS
LINEITEM
16 4 12
Totals
61 16 45
Therefore, our reward function returns a value of +1 ifan index is created and it actually benefits the current work-load, or if an index is dropped and it is not beneficial to thecurrent workload. Otherwise, the function returns − to pe-nalize the agent if an index is dropped and it is beneficial tothe current workload, or an index is created and it does notbenefit the current workload. The choice of values +1 and − is empirical. However, we want the penalization value tobe at least twice smaller than the +1 value, so that the val-ues do not get canceled when accumulating with each other.Finally, if the action corresponds to a “do nothing” opera-tion, the environment simply returns a reward of , withoutcomputing Equation 4. Experiments
Experimental setup
Database setup
For experimental purposes and due to itsusage in literature for measuring database performance, wechoose to run experiments using the database schema anddata provided by the TPC-H benchmark. The tools providedby TPC-H include a data generator (DBGen), which is ableto create up to 100TB of data to load in a DBMS, and aquery generator (QGen) that creates 22 queries with differ-ent levels of complexity. The database we use in these exper-iments is populated with 1GB of data. To run benchmarksusing each baseline index configuration, we implementedthe TPC-H benchmark protocol using a Python script thatruns queries, fetches execution time, and computes the per-formance metrics.To provide statistics on the database, we show the in Ta-ble 1 the number of columns that each table contains and ananalysis on the indexing possibilities. For that, we mappedfor each table in the TPC-H database the total number ofcolumns, the columns that are already indexed (primaryand foreign keys, indexed by default), and the remainingcolumns that are available for indexing.By summing the number of indexable columns in eachtable, we have a total of 45 columns that are available forindexing. Since a column is either indexed or not, thereare two possibilities for each of the remaining 45 index-able columns. This scenario indicates that we have exactly , , , , ( ), i.e. more than 35 trillion, pos-sible configurations of simple indexes. Thus, this is also thenumber of states that can be assumed by the database index-ing configuration and therefore explored by the algorithms. For comparison purposes, we manually check whichcolumns compose the ground truth optimal index config-uration. We manually create each index possibility andcheck whether an index benefits at least one query withinthe 22 TPC-H queries. To check whether an index isused or not, we run the EXPLAIN command to visual-ize the execution plan of each query. Finally, we have 6columns from the TPC-H that compose our ground truth op-timal indexes:
C ACCTBAL , L SHIPDATE , O ORDERDATE , P BRAND , P CONTAINER , P SIZE . Baselines
The baselines comprise different indexing con-figurations using different indexing approaches, includingcommercial and open-source database advisors, and re-lated work on genetic algorithms and reinforcement learn-ing methods. Each baseline index configuration is a resultof training or analyzing the same workload of queries, fromthe TPC-H benchmark, in order to make an even compar-ison between the approaches. The following list briefly in-troduces each of them.
Default : indexes only on primaryand foreign keys;
All indexed : all columns indexed.
Ran-dom : indexes randomly explored by an agent;
EDB
POWA
ITLCS
GADIS
NoDBA rCOREIL
Agent training
Training the reinforcement learning agent consists of timesteps of agent-environment interaction and value functionupdates until it converges to a policy as desired. In ourcase, to approximate the value function, we use a simplemulti-layer perceptron neural network with two hidden lay-ers and ReLU activation, and an Adam optimizer with mean-squared error loss, both PyTorch 1.5.1 implementations us-ing default hyperparameters [15]. The input and output di-mensions depend on the number of columns available to in-dex in the database schema, as shown in Section .The hyperparameters used while training are set as fol-lows. The first, learning rate α = 0 . and discount fac-tor γ = 0 . , are used in the update equation of the valuefunction. The next are related to experience replay, where replay memory size = 10000 defines the number of experi-
100 200 300 400 500Step30025020015010050050 A cc u m u l a t e d r e w a r d (a) Accumulated reward per 128steps. A cc u m u l a t e d l o ss (b) Accumulated loss per 128 steps. N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (c) Index configurations whiletraining.
Figure 2: Training statistics.ences the agent is capable of storing, and replay batch size =1024 defines the number of samples the agent uses at eachtime step to update the value function. The last are relatedto the epsilon-greedy exploration function, where we de-fine an epsilon initial = 1 as maximum epsilon value, an epsilon final = 0 . as epsilon minimum value, a percent-age in which epsilon decays = 1% , and the interval of time steps at each decay = 128 .We train our agent for the course of 64 thousand timesteps in the environment. Training statistics are gathered ev-ery 128 steps and are shown in Figure 2. Sub-figure 2a showsthe total reward accumulated by the agent at each 128 stepsin the environment, which consistently improves over timeand stabilizes after the 400th x-axis value. Sub-figure 2bshows the accumulated loss at each 128 steps in the envi-ronment, i.e. the errors in predictions of the value functionduring experience replay, and illustrates how it decreases to-wards zero as parameters are adjusted and the agent approx-imates the true value function.To evaluate the agent behavior and the index configura-tion in which the agent is converging to, we plot in Figure 2ceach of the index configurations explored by the agent in the64 thousand training steps. Each index configuration is rep-resented in terms of total indexes and total optimal indexes a configuration contains. Total indexes is simply a count onthe number of indexes in the configuration, while total opti-mal indexes is a count on the number of ground truth optimalindexes in the configuration. The lines are smoothed using arunning mean of the last 5 values, and a fixed red dashedline across the x-axis represents the configuration in whichthe agent should converge to. As we can see, both the totalamount of indexes and the total optimal indexes convergetowards the ground truth optimal indexes. That is, the agentlearns both to keep the optimal indexes in the configuration,as well as to drop irrelevant indexes for the workload.
Performance Comparison
We now evaluate each baseline index configuration in com-parison to the one in which our agent converged to in the lastepisode of training. We show the TPC-H performance metric(QphH, i.e. the query-per-hour metric) and the index size ofeach configuration. Figure 3a shows the query-per-hour met-ric of each configuration (higher values denote better perfor-mance). The plotted values are a result of a trimmed mean,where we run the TPC-H benchmark 12 times for each in- dex configuration, removing the highest and the lowest resultand averaging the 10 remaining results. Figure 3b shows thedisk space required for the indexes in each configuration (in-dex size in MB), which allows us to analyze the trade-off inthe number of indexes and the resources needed to maintainit. In an ideal scenario, the index size is just the bare min-imum to maintain the indexes that are necessary to supportquery performance.Yet SmartIX achieves the best query-per-hour-metric, thetwo genetic algorithms [12] and [17] have both very similarquery-per-hour and index size metrics in comparison to ouragent. GADIS [12] itself uses a similar state-space model toSmartIX, with individuals being represented as binary vec-tors of the indexable columns. The fitness function GADISoptimizes is the actual query-per-hour metric, and it runs thewhole TPC-H benchmark every time it needs to computethe fitness function. Therefore, it is expected that it finds anindividual with a high performance metric, although it is un-realistic for real-world applications in production due to thecomputational cost of running the benchmark.Indexing all columns is among the highest query-per-hourresults and can seem to be a natural alternative to solve theindexing problem. However, it results in the highest amountof disk used to maintain indexes stored. Such alternativeis less efficient in a query-per-hour metric as the bench-mark not only takes into account the performance of SE - LECT queries, but also
INSERT and DELETE operations,whose performance is affected by the presence of indexesdue to the overhead of updating and maintaining the struc-ture when records change [21, Ch. 8, p. 290-291]. It has thelowest ratio value due to the storage it needs to maintain in-dexes. D e f a u l t R a n d o m N o D B A P O W A r C O R E I L E D B A ll i n d e x e s I T L C S G A D I SS m a r t I X Index configuration95097510001025105010751100 Q p h H @ G B (a) Query-per-hour metric D e f a u l t S m a r t I X I T L C S G A D I S E D B P O W A N o D B A R a n d o m r C O R E I L A ll i n d e x e s Index configuration05001000150020002500 S i z e i n M B (b) Index size (in MB) Figure 3: Static index configurations results.hile rCOREIL [1] is the most competitive reinforce-ment learning method in comparison to SmartIX, the amountof storage used to maintain its indexes is the highest amongall baselines (except for having all columns indexed). rCOR-EIL does not handle whether primary and foreign key in-dexes are already created, causing it to create duplicate in-dexes. The policy iteration algorithm used in rCOREIL is adynamic programming method used in reinforcement learn-ing, which is characterized by complete sweeps in the statespace at each iteration in order to update the value function.Since dynamic programming methods are not suitable tolarge state spaces [23, Ch. 4, p. 87], this can become a prob-lem in databases that contain a larger number of columns toindex.Among the database advisors, the commercial toolEDB [6] achieves the highest query-per-hour metric in com-parison to the open-source tool POWA [20], while its in-dexes occupy virtually the same disk space. Other baselinesand related work are able to optimize the index configurationand have lightweight index sizes, but are not competitive incomparison to the previously discussed methods in terms ofthe query-per-hour performance metric. Finally, among allthe baselines, the index configuration obtained using Smar-tIX not only yields the best query-per-hour metric but alsothe smallest index size (except for the default configuration),i.e. it finds the balance between performance and storage, asshown in the ratio plot.
Dynamic configurations
This section aims to evaluate the behavior of algorithmsthat generate policies, i.e. generate a function that guidesan agent’s behavior. The three algorithms that generate poli-cies are SmartIX, rCOREIL, and NoDBA. The three arereinforcement learning algorithms, although using differentstrategies (see Sec. ). While rCOREIL and SmartIX show amore interesting and dynamic behavior, the NoDBA algo-rithm shows a fixed behavior and keeps only three columnsindexed over the whole time horizon, without changing theindex configuration over time (see its limitations in Sec. ).Therefore, we do not include NoDBA in the following anal-ysis and focus the discussion on rCOREIL and SmartIX.
Fixed workload
We now evaluate the index configuration of rCOREIL andSmartIX over time while the database receives a fixed work-load of queries. Figure 4 shows the behavior of rCOREILand SmartIX, respectively. Notice that rCOREIL takes sometime to create the first indexes in the database, after receiv-ing about 150 queries, while SmartIX creates indexes at thevery beginning of the workload. On the one hand, rCOR-EIL shows a fixed behavior maintains all ground truth opti-mal indexes, but it creates a total of 22 indexes, 16 of thosebeing unnecessary indexes and the remaining 6 are optimalindexes. On the other hand, SmartIX shows a dynamic be-havior and consistently maintains 5 out of the 6 ground truthoptimal indexes, and it does not maintain unnecessary in-dexes throughout most of the received workload. N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (a) rCOREIL N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (b) SmartIX
Figure 4: Agent behavior with a fixed workload. N u m b e r o f i n d e x e s Workload shiftsGround truth optimal indexesTotal indexesTotal optimal indexes (a) rCOREIL N u m b e r o f i n d e x e s Workload shiftsGround truth optimal indexesTotal indexesTotal optimal indexes (b) SmartIX
Figure 5: Agent behavior with a shifting workload.
Shifting workload
We now evaluate the algorithm’s behavior while receiving aworkload that shifts over time. To do so, we divide the 22TPC-H queries into two sets of 11 queries, where for eachset there is a different ground truth set of indexes. That is,out of the 6 ground truth indexes from the previous fixedworkload, we now separate the workload to have 3 indexesthat are optimal first set of queries, and 3 other indexes thatare optimal for the second set of queries. Therefore, we aimto evaluate whether the algorithms can adapt the index con-figuration over time when the workload shifts and a differentset of indexes is needed according to each of the workloads.The behavior of each algorithm is shown in Figure 5. Thevertical dashed lines placed along the x-axis represent thetime step where the workload shifts from one set of queriesto another, and therefore the set of ground truth optimal in-dexes also changes. On the one hand, notice that rCOREILshows a similar behavior from the one in the previous fixedworkload experiment, in which it takes some time to createthe first indexes, and then maintains a fixed index config-uration, not adapting as the workload shifts. On the otherhand, SmartIX shows a more dynamic behavior with regardto the shifts in the workload. Notice that, at the beginningof each set of queries in the workload, there is a peak in thetotal indexes, which decreases as soon as the index configu-ration adapts to the new workload and SmartIX drops the un-necessary indexes with regard to the current workload. Eventhough rCOREIL maintains all 3 ground truth indexes overtime, it still maintains 16 unnecessary indexes, while Smar-tIX consistently maintains 2 out of 3 ground truth optimalindexes and adapts as the workload shifts.
Scaling up database size
In the previous sections, we showed that the SmartIX archi-tecture can consistently achieve near-optimal index configu-
200 400 600 800Step0510152025 N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (a) 10GB TPC-H database. N u m b e r o f i n d e x e s Total indexesTotal optimal indexesGround truth optimal indexes (b) 100GB TPC-H database.
Figure 6: Agent behavior in larger databases.rations in a database of size 1GB. In this section, we reportexperiments on indexing larger databases, where we transferthe policy trained in the 1GB database to perform indexingin databases with size 10GB and 100GB. We plot the behav-ior of our agent in Figure 6.As we can see, the agent shows a similar behavior to theone using a 1GB database size reported in previous experi-ments. The reason is that both the state features and the re-ward function are not influenced by the database size. Theonly information relevant to the state and the reward func-tion is the current index configuration and the workload be-ing received. Therefore, we can successfully transfer thevalue function learned in smaller databases to index largerdatabases, consuming fewer resources to train the agent.
Related Work
Machine learning techniques are used in a variety of tasksrelated to database management systems and automateddatabase administration [26]. One example is the work fromKraska et al. [8], which outperforms traditional index struc-tures used in current DBMS by replacing them with learnedindex models, having significant advantages under particu-lar assumptions. Pavlo et. al [16] developed Peloton, whichautonomously optimizes the system for incoming workloadsand uses predictions to prepare the system for future work-loads. In this section, though, we further discuss relatedwork that focused on developing methods for optimizingqueries through automatic index tuning. Specifically, we fo-cus our analysis on work that based their approach on rein-forcement learning techniques.Basu et al. [1] developed a technique for index tuningbased on a cost model that is learned with reinforcementlearning. However, once the cost model is known, it be-comes trivial to find the configuration that minimizes thecost through dynamic programming, such as the policy itera-tion method used by the authors. They use DBTune [3] to re-duce the state space by considering only indexes that are rec-ommended by the DBMS. Our approach, on the other hand,focuses on finding the optimal index configuration withouthaving complete knowledge of the environment and withoutheuristics of the DBMS to reduce the state space.Sharma et al. [22] use a cross-entropy deep reinforce-ment learning method to administer databases automatically.Their set of actions, however, only include the creation ofindexes, and a budget of 3 indexes is set to deal with spaceconstraints and index maintenance costs. Indexes are only dropped once an episode is finished. A strong limitation intheir evaluation process is to only use the L
INEITEM tableto query, which does not exploit how indexes on other tablescan optimize the database performance, and consequentlyreduces the state space of the problem. Furthermore, theydo not use the TPC-H benchmark performance measure toevaluate performance but use query execution time in mil-liseconds.Other papers show that reinforcement learning can also beexplored in the context of query optimization by predictingquery plans: Marcus et al. [10] proposed a proof-of-conceptto determine the join ordering for a fixed database; Ortiz etal. [14] developed a learning state representation to predictthe cardinality of a query. These approaches could possiblybe used alongside ours, generating better plans to query ex-ecution while we focus on maintaining indexes that thesequeries can benefit from.
Conclusion
In this research, we developed the SmartIX architecture forautomated database indexing using reinforcement learning.The experimental results show that our agent consistentlyoutperforms the baseline index configurations and relatedwork on genetic algorithms and reinforcement learning. Ouragent is able to find the trade-off concerning the disk spaceits index configuration occupies and the performance metricit achieves. The state representation and the reward functionallows us to successfully index larger databases while train-ing in smaller databases and consuming fewer resources.Regarding the limitations of our architecture, we do notyet deal with composite indexes due to the resulting statespace of all possible indexes that use two or more columns.Our experiments show results using workloads that are read-intensive (i.e. intensively fetching data from the database),which is exactly the type of workload that benefits from in-dexes. However, experiments using write-heavy workloads(i.e. intensively writing data to the database) can be inter-esting to verify whether the agent learns to avoid indexesin write-intensive tables. Considering these limitations, infuture work, we plan to: (1) investigate techniques that al-low us to deal with composite indexes; (2) improve the re-ward function to provide feedback in case of write-intensiveworkloads; (3) investigate pattern recognition techniques topredict incoming queries to index ahead of time; and (4)evaluate SmartIX on big data ecosystems (e.g. Hadoop).Finally, our contributions include: (1) a formalization ofa reward function shaped for the database indexing prob-lem, independent of DBMS’s statistics, that allows the agentto adapt the index configuration according to the workload;(2) an environment representation for database indexing thatis independent of schema or DBMS; and (3) a reinforce-ment learning agent that efficiently scales to large databases,while trained in small databases consuming fewer resources.At last, as a result of this research, we published a paper atthe Applied Intelligence journal [9].In closing, we envision this kind of architecture being de-ployed in cloud platforms such as Heroku and similar plat-forms that often provide database infrastructure for variousclients’ applications. The reality is that these clients do notrioritize, or it is not in their scope of interest to focus ondatabase management. Especially in the case of early-stagestart-ups, the aim to shorten time-to-market and quickly shipcode motivates externalizing complexity on third party so-lutions [7]. From an overall platform performance point ofview, having efficient database management results in an op-timized use of hardware and software resources. Thus, in thelack of a database administrator, the SmartIX architecture isa potential stand-in solution, as experiments show that it pro-vides at least equivalent and often superior indexing choicescompared to baseline indexing recommendations.
References [1] Basu, D.; Lin, Q.; Chen, W.; Vo, H. T.; Yuan, Z.; Senel-lart, P.; and Bressan, S. 2016. Regularized cost-modeloblivious database tuning with reinforcement learning.In
Transactions on Large-Scale Data-and Knowledge-Centered Systems XXVIII . Springer. 96–132.[2] Bellman, R. 1957. A markovian decision process.
Jour-nal of Mathematics and Mechanics
Proceed-ings of the VLDB Endowment
IKE proceedings , 1.[6] EnterpriseDB. 2019. Enterprise Database.[7] Giardino, C.; Paternoster, N.; Unterkalmsteiner, M.;Gorschek, T.; and Abrahamsson, P. 2016. Software de-velopment in startup companies: the greenfield startupmodel.
IEEE Transactions on Software Engineering
Proceedings of the 2018 SIGMOD , 489–504. ACM.[9] Licks, G. P.; Couto, J. C.; de F´atima Miehe, P.; De Paris,R.; Ruiz, D. D.; and Meneguzzi, F. 2020. Smartix: Adatabase indexing agent based on reinforcement learning.
Applied Intelligence
CoRR abs/1803.00055:1–7.[11] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.;Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning.
Na-ture , 39–42.[13] Olofson, C. W. 2018. Ensuring a fast, reliable, andsecure database through automation: Oracle autonomous database. White paper, IDC Corporate USA, Sponsoredby: Oracle Corp.[14] Ortiz, J.; Balazinska, M.; Gehrke, J.; and Keerthi,S. S. 2018. Learning state representations for queryoptimization with deep reinforcement learning.
CoRR abs/1803.08604:1–5.[15] Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury,J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.;Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito,Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.;Fang, L.; Bai, J.; and Chintala, S. 2019. Pytorch: Animperative style, high-performance deep learning library.In . Curran Associates, Inc.8024–8035.[16] Pavlo, A.; Angulo, G.; Arulraj, J.; Lin, H.; Lin, J.; Ma,L.; Menon, P.; Mowry, T. C.; Perron, M.; Quah, I.; et al.2017. Self-driving database management systems. In
CIDR proceedings , 1–6.[17] Pedrozo, W. G.; Nievola, J. C.; and Ribeiro, D. C.2018. An adaptive approach for index tuning with learn-ing classifier systems on hybrid storage environments. In
HAIS , 716–729. Springer.[18] Petraki, E.; Idreos, S.; and Manegold, S. 2015. Holisticindexing in main-memory column-stores. In , 1153–1166. New York, NY, USA:ACM.[19] Popovic, J. 2017. Automatic tuning - SQL Server.Accessed: 2019-06-17.[20] POWA. 2019. PostreSQL Workload Analyzer.[21] Ramakrishnan, R., and Gehrke, J. 2003.
DatabaseManagement Systems . New York, NY, USA: McGraw-Hill, Inc., 3 edition.[22] Sharma, A.; Schuhknecht, F. M.; and Dittrich, J. 2018.The case for automatic database administration usingdeep reinforcement learning.
CoRR abs/1801.05643:1–9.[23] Sutton, R. S., and Barto, A. G. 2018.
Reinforcementlearning: An introduction . Cambridge, Massachusetts:MIT press.[24] Thanopoulou, A.; Carreira, P.; and Galhardas, H. 2012.Benchmarking with tpc-h on off-the-shelf hardware. In , 205–208. Wroclaw, Poland: Springer.[25] Tsitsiklis, J. N., and Roy, B. V. 1996. An analysisof temporal-difference learning with function approxima-tion. Technical report, Laboratory for Information andDecision Systems, MIT.[26] Van Aken, D.; Pavlo, A.; Gordon, G. J.; and Zhang,B. 2017. Automatic database management system tuningthrough large-scale machine learning. In ,1009–1024. ACM.[27] Wang, J.; Liu, W.; Kumar, S.; and Chang, S.-F. 2016.Learning to hash for indexing big data—a survey.