Automatic Storage Structure Selection for hybrid Workload
AAutomatic Storage Structure Selection for hybridWorkload
Hongzhi Wang, Yan Wei and Hao Yan
School of Computer Science and TechnologyHarbin Institute of Technology
Harbin, China { wangzh, 18s003043 } @hit.edu.cn, [email protected] Abstract —In the use of database systems, the design of thestorage engine and data model directly affects the performanceof the database when performing queries. Therefore, the usersof the database need to select the storage engine and design datamodel according to the workload encountered. However, in ahybrid workload, the query set of the database is dynamicallychanging, and the design of its optimal storage structure isalso changing. Motivated by this, we propose an automaticstorage structure selection system based on learning cost, whichis used to dynamically select the optimal storage structure of thedatabase under hybrid workloads. In the system, we introduce amachine learning method to build a cost model for the storageengine, and a column-oriented data layout generation algorithm.Experimental results show that the proposed system can choosethe optimal combination of storage engine and data modelaccording to the current workload, which greatly improves theperformance of the default storage structure. And the system isdesigned to be compatible with different storage engines for easyuse in practical applications.
Index Terms —storage structure, self-driving database, hybridworkload, database system
I. I
NTRODUCTION
Large companies, especially those in the Internet industry,are keen to collect and analyze data. After a large amountof data is generated in trades or mobile applications, theyare quickly used for decision support, business intelligence,and personalized recommendation. In this process, the dataneeds to be capable of processing under both transactionaland analytical workloads, which is also known as hybridtransaction-analytical processing (HTAP). Data engineers needto balance hardware costs and performance, and provide theHTAP systems suitable for their business scenarios. Dataengineers need to investigate and compare databases fromdifferent vendors and select a solution within a limited time.Such a solution often involves a combination of multipledatabases that support different workload types, which mayinclude an OLAP database and an OLTP database (or NOSQLthat supports write-heavy workloads), and is connected andsynchronized through ETL tools [1].The development of the database has entered the stageof designing for specific tasks, and different databases havedifferent designs and advantages. As we know, ”One size notfit all [2]”, the database chooses different models from thedesign of the underlying storage engine in order to adaptto the needs of different workloads. These databases are different from each other and cannot replace each other [3].For example, even databases based on the same LSM modelhave different types to weigh the performance of write andscan operations [4]. Among the rich choices, it is difficult fordata engineers to choose the most suitable type based on thedatabase description or experience. Even more, unfortunately,the workloads created by the business are not static, theymay shift over time. The selected database may no longer besuitable for new workloads. In summary, data engineers needto face the following issues when building HTAP systems: • To combine different databases, engineers may need touse ETL tools to synchronize in different databases. Thesynchronization process generally pulls the latest datafrom the OLTP database and imports it into the OLAPdatabase. In this case, analytical queries cannot directlyaccess the latest data in the OLTP database. • Data engineers need to understand the principles ofdifferent databases, performance evaluation, and evenexperience to select the database type in the system. Con-sidering the capabilities and experience of data engineers,this may take several days or even months to investigateand experiment. • The workload may change dynamically with time. Suchchanges may result in the original design no longer beingoptimal, or even failing to meet performance require-ments.In recent years, self-driven databases have liberated dataengineers from tedious tasks [5], while these databases pre-fer to be developed as a new system. They have emergedthat can dynamically adjust the storage design to achievehigh-performance processing of HATP workloads. However,such systems require expensive hardware (such as in-memorydatabase using large amounts of memory [6] or global sharedmemory [7]). Moreover, the immaturity of these systems alsoprevents data engineers from choosing them.For enterprises, the best solution still uses the existing ma-ture database for system design. Thus, instead of establishing anew system, we attempt to take the advantages of the existingdatabase to solve the problem of automatic design of the HTAPsystem, which can make full use of the community and low-cost features of these databases.Motivated by this, in this paper, we develop the automatic a r X i v : . [ c s . D B ] A ug esign method for the HTAP database based on existingdatabases. The focus is to select proper storage structureaccording to the features of hybrid workload. To achieve thisgoal, we requires an effective way to evaluate the cost of theworkload execution on the data storage. With the considerationof the diversity of data storage structures, we develop acost evaluation approach based on machine learning withoutthe knowledge of database internals. Such that the storagestructure is selected according to such estimated cost. Thecontributions of this paper are summarized as follows: • In this paper, we regard the data layout under a specificstorage engine as a storage structure and proposes astorage structure selection system based on learning cost.The proposed system can automatically perform storagestructure generation, evaluation, and conversion. • We propose a learning method for the storage engine costmodel, which does not require the internal knowledge ofthe storage engine. • A novel benchmark is proposed for the establishment oflearning models, which can dynamically generate dataschemas and workloads for performance data collection. • To increase flexibility, this paper proposes a column-oriented data layout recommendation algorithm, whichis used to generate storage structure candidates.The structure of this paper is as follows. Section II intro-duces the architecture of the system, including the explanationof each key module and the workflows of the system function.Then in the Section III, we introduce the method of buildinga storage engine learning model, including how to designfeatures, how to collect training data, and how to build amore accurate model. In the Section IV, we introduce theselection method of storage structure, especially the column-oriented data layout recommend algorithm. Section V conductsexperimental evaluation of the system. Finally, summary anddiscussion of the future work in Section VI.II. S
YSTEM OVERVIEW
In this section, we propose the architecture of the storagestructure recommendation system. First, we summarize themethodology for relevant systems in Section II-A, which willhelp to understand the proposed system. Then we introducedthe main modules of the system in Section II-B, especiallythe candidate execution module. Finally, we introduced theworkflows related to storage structure recommendations inSection II-C.
A. Methodology
The purpose of the self-driving database is to establisha kind of self-designing and self-tuning database manage-ment system. The optimization task of automatically selectingthe storage structure plays an important role in self-drivingdatabases. During the decades of autonomous database devel-opment, a multi-level theoretical system has been established[8], which acts on the knobs tuning, algorithm picking, andeven the self-design of database components. The current research on the search method for algorithm picking problems[9] can be simplified into the following three steps: • Pruning:
The original search space for algorithm pickingproblems may be huge, so its better to prune the searchspace of the problem by some methods. For example, themulti-index optimization problem can be transformed intoa combination of single-index optimization problems, andthe queries that have little effect can be removed in theworkload to narrow the input. • Candidates Generation:
Traverse the search space, andgenerate candidates by enumerating solutions using re-lated rules. If the original problem is disassembled intoseveral sub-problems, the candidates of the original prob-lem need to be generated according to the candidates ofthe sub-problems. • Evaluation:
Choose an optimal solution among the can-didates by calculating costs. For the candidate cost that iseasy to be tested, we can run a set of experiments to findthe best candidate. For candidates that cannot be easilytested, the What-If cost model needs to be established[10].The system proposed in this paper follows the searchsteps mentioned above. In order to provide compatibility withexisting storage engines, our system is designed with lowcoupling to the host database. The system is divided intothree parts as shown in Fig. 1, namely the host database, thedatabase adapter, and the storage structure selection module.The workload is analyzed in the storage structure selectionmodule to obtain the candidate storage structures, and theoptimal storage structure is selected by comparing the costof the candidates. Since the storage structure scheme has arelatively high cost to be tested, it is not feasible to runexperiments to see which one is the best choice, in whichcase we propose a series of What-If cost models to achievethe goal of candidate evaluation.
B. Storage structure selection module
The storage structure selection module is divided intofive submodules, namely benchmark module, learning mod-ule, candidate generation module, storage structure evaluationmodule and candidate execution module. The submodules areintroduced as follows: • Benchmark module:
This module manages the execu-tion and performance data collection of the benchmark.The benchmark is a simulation on the host database,based on generated workload with generated data. Itcan be used to obtain the performance data of the hostdatabase under different storage engines and different op-erations. The benchmark is customized for the proposedsystem and will be discussed in Section III-B. • Learning module:
This module learns and updates thecost model. The cost model is learned by benchmarkingresult data and the performance data collected at runtime,and is used for the cost estimation of different storagestructures under a given workload. The learning modelestablish method will be discussed in Section III-C. enchmark Module LearningModule
Host Database
Workloads
Cost Model Performance Data Collection And ModelLearning
Storage Structure
Recommendation
StorageStructure
Selection
Module Adapter
Performance
DataCandidate ExecutionModuleCandidate GenerationModule
Storage Structure
Evaluation ModuleRecommendedStorage Structure
Storage
APIQueryAPIMeta-data
Fig. 1. System architecture • Candidate generation module:
This module analyzesthe recent workload and then generates a set of candidateswhich are the data layout and storage engine combina-tions. This module takes workloads and table meta dataas input, and workloads are pruned before being usedfor candidate generation. The details of the data layoutrecommendation algorithm is discussed in Section IV. • Storage structure evaluation module:
This moduleenumerates the cost of each candidate storage structure,and obtains the optimal storage structure based on thecomparison. Since the workload needs to be convertedinto a set of query plans before it can be used for costestimation, this module requires a connector to the built-in query planner in the host database. • Candidate execution module:
This module applies theoptimal storage structure to the target data partition ifrequired. Whether it is the switching of the storageengine or the transformation of the data layout, it isnecessary to convert the data format in that partition.This module is responsible for the execution process ofthe storage structure and provides atomic guarantee forthe database system. Through the snapshots and logs inthe database, the conversion of data partitions betweendifferent formats will become relatively simple.
C. Workflow
In the proposed system, there are two main workflowsrunning through the five sub-modules of the storage engineselection module. Among them, the performance data collec-tion and model learning workflow is used to establish the costmodel for the evaluation step mentioned at the beginning ofthis section. The storage structure recommendation workflowis used to prune the workloads, and then generate, evaluate,and apply storage structures.
1) Performance data collection and model learning:
Per-formance data is the record of the performance of the storageengine under various operations. The learning module uses the performance data to train and update the cost model. Theperformance data consists of two parts. Firstly, the systemwill perform a customized benchmark in the early stages ofdeployment, and the benchmark module will use the systembehavior such as the elapsed time of each operation to con-struct the first batch of performance data. Since the bench-marks need to consider different data schemas, workloads,and storage engines, this process will take some time. Ifthe system is deployed on servers with the same hardwareconfiguration, the performance tests of the benchmark canbe performed in parallel because the servers with the samehardware share the performance characteristic of the storageengine. The other part of the performance data is collectedduring system running. When users runs the database underreal-world workloads, the system will sample the performancedata. The sampled data is generated based on real scenarios,so it is more representative and used to enrich the performancedata and update the cost models.After collecting sufficient performance data in the bench-mark, the benchmark module will not start until the hardwareenvironment of the database changes. Whenever the databasefinds changes in the hardware environment, it is necessaryto run the benchmark module and use the performance datacollected by the benchmark to initialize the cost model. Afterinitialization, the performance data will gradually increase, andthe cost learning module will periodically check whether theperformance data increases. If the performance data change,the cost learning module reruns the learning algorithm andupdates the cost model.
2) Storage Structure Recommendation:
The storage struc-ture is defined as a combination of a storage engine and datalayout. Fortunately, the choice of storage engines is relativelysmall, depending on the number of storage engines supportedby the current database. However, the search space of datalayouts is relatively large. Therefore, the system focuses moreon the selection of the data layout. To evaluate the storageengine and data layout at the same time, the system willenerate multiple choices of data layouts for the currentworkload of each data table. These data layouts represent localoptimal solutions among all possible data layouts. They will bearbitrarily combined with possible storage engine and createall candidate storage structures. Then, the query engine of thedatabase will generate a corresponding query plan based on thecandidate storage structure and workload. The cost estimationmodule predicts the estimated cost of each candidate storagestructure according to the query plan and selects the optimalstorage structure among them.Note that the optimal storage structure may not be appliedimmediately. The DBA can control the trigger conditions orthe trigger time of the application process, such as a thresholdof predicted performance improvement which DBA thinks isworthy to apply new storage structure, or a schedule that DBAthinks is suitable for storage changes.III.
LEARNED COST MODEL
To compare storage structures, a nature way is to establisha cost model for storage engines. Many methods for modelingthe cost have been proposed, such as analysis method ofaddressing and scanning of disk [11] or learning method usingquery features [12]. In our system, since the storage structuresunder different workloads are required to be compared, thecost model have the following requirements:Comparability: The operating costs obtained under differentstorage engines should be comparable to each other. Thismeans that some cost models based on query cardinalitycannot be used.Compatibility: The cost model should be compatible withdifferent storage engines. The purpose of the system is to usethe richness and reliability of the existing query engine, so itcannot be limited to a specific query engine. The cost modelshould avoid using storage engine internal details.Workload-oriented: Compared with the goal of query op-timization, the system concerns more about the total cost ofthe workload. Therefore, the cost model should represent theglobal performance during the execution of the workload.This section first introduces the features used by the costmodel in Section III-A. The training data will be collectedbased on these features. In order to collect training data, wepropose a novel benchmark in Section III-B. Finally, in SectionIII-C, we introduced how to use features and training data tolearn the cost model.To achieve the comparability, our cost model selects thequery elapsed time as the measure, because the query elapsedtime directly reflects the delay of the system when respondingto different requests and reducing the delay of the system is theprimary task of system design. Besides, the query elapsed timecan be proportional to the system’s other resource usage, suchas IO usage and memory usage. Another reason for choosingtime as the cost measure is that many query features are the-oretically proportional to query time empirically, such as theselectivity of query predicates. The linear relationship betweenthese features and the target measure is easily captured bymachine learning algorithms.
A. Model Features
Since the results of machine learning are strongly related tothe quality of the input training data, it is necessary to providehigh-quality training data. Since time is used as a measure ofcost, we use performance data as the training data. In orderto get good performance data, we need to define the featuresprovided in the data. In this section, we first analyze howdifferent factors affect database performance, and then showthe features used in the proposed model.The features of the model should select the factors thataffect the query most, while following the requirements men-tioned in the previous section. Many factors determine therunning time of a query, which can be divided into thefollowing categories.
Hardware features:
These features describe the types andparameters of hardware resources that the database system canuse during the runtime. The hardware resources of the serverrarely change when running the database.
Database configuration features:
Each database has differ-ent configuration knobs to tune for performance improvement.
Workload features:
These features include table schema,metadata, and query attributes. They determine the logic costof the query execution, such as the number of data blocks areread, the utilization of the cache, and so on.
Runtime features:
These features are the specific usageof operating system components and database componentsduring execution, such as the use of page cache and the hitsof database buffer. Database-level features are related to theinternal knowledge of the database, which are hardly to captureuniformly . System-level features use status of the operatingsystem components, such as the utilization of the page cacheand the read and write throughput of the disk.In addition to the above features, unpredictable features areencountered during query execution, such as whether the databeing queried is cached, whether the query is early returnedby additional structure (such as Bloom filters), and whetherhardware performance is jittery. These features, as well as thepreviously described features that require internal knowledgeof the database, are not used in the cost model due to thedifficulty of being captured by the proposed system.In summary, the cost model selects workload features andsystem-level runtime features as the input of the machinelearning model. These features are summarized in Table I.
1) workload features:
The system extracts the feature ofthe table schema and operations. When considering the tableschema, we treat all fields in primary key as key fields, andothers as value fields. To simplify the data types, the fieldsare divided into fixed-length fields and variable-length fields,because the database system cares about the storage form ofthe field rather than the way that the field is expressed. Thesefeatures affect the number of actual data blocks accessingwhen reading data or indexes for a given query.We disassemble the query into a series of data accessoperations. Data access operations are simpler than queriesand easier in feature extraction. The features of data access
ABLE IF
EATURES OF PERFORMANCE DATA
Categories Feature Comment
Average Length of RowsTable Number of Key/Value FieldsSchema Length of Key/Value FieldsFeatures Number of Fixed-length /Variable-length FieldsLength of Fixed-length /Variable-length FieldsOperation TypeOperationFeatures Operation Result Size Records returned by readoperations or recordsinserted by updateoperationsOperation Selectivity Selectivity of scanoperations or affected keyrange of update operationsRandomness of DataInsertion Normalized inversionnumber of key insertionsequenceDisk Read/Write Throughput In a recent periodSystem-level Loaded Pages in Page Cache/Total pages of table fileRuntimeFeatures Number of Files / Numberof Level-1 Files / Number ofLevel-2 Files Used for amplificationanalysis of LSM storage operations include operation result size and operation selec-tivity. The reason for using selectivity is that we want to getthe locality of the operation. The size of the result is directlyrelated to the number of data block accesses, and the localityof the operation is related to whether the data block can becached.In addition, the randomness of the data upon insertion isalso recorded in the features. The data inserted sequentiallywill be closer to the sorted array in the physical structure, so itwill have higher efficiency in insertion and search. We denotethe insertion sequence of keys as π . We use the inversionnumber of π to measure the randomness of data insertion andnormalize it to [0,1]. If the data is inserted into the storageengine sequentially as the log, the randomness is 0, and if thedata is inserted randomly, the randomness is 1. The formulafor randomness is as follows. Note that order and reverse orderare treated equally in randomness. randomness ( π ) = (cid:40) inv ( π ) | π | −| π | inv ( π ) < | π | −| π | − inv ( π ) | π | −| π | inv ( π ) ≥ | π | −| π |
2) runtime features:
For the runtime features, we mainlymonitor the operating system page cache usage and the diskread/write throughput in the most recent period. Page cachingis a general optimization method of the operating system. Ituses free memory to cache recently opened files, reducingthe number of disk access requests. In the database, differenttables normally correspond to different data files. If a tableis recently opened for data access, some data pages in thefile corresponding to this table will enter the operating systempage cache. The LRU cache strategy of the general databasewill also store the recently opened data blocks in the database
Fig. 2. Example of a figure caption. cache. Therefore, monitoring the page cache can reflect theusage of each table in the cache of the storage engine, withoutcaring about the cache design details.Also, disk read and write throughput will also affect queryperformance. When the disk is busy, data access requests willbe delayed due to waiting or scheduling. Therefore, recordingthe page cache and disk throughput of different storage enginescan help analyze the read and write performance of differentstorage engines and the use of disk IO, thereby improving theaccuracy of prediction.Besides, the proposed system considers some runtime fea-tures that are not related to storage engine implementationdetails but related to storage engine classification. For exam-ple, for the LSM storage engine, the number of files in thefirst level, the number of files in the second level, and thetotal number of files are collected to analyze the read or writeamplification phenomenon [4].In order to visually show the relationship between featuresand performance, we selected the two most obvious featuresof workload and runtime features for testing. The results areshown in Fig. 2.This graph is a scatter diagram of the operation time per rowunder the two characteristics of the number of scan lines andthe operating system page cache utilization. It can be observedthat with the increase in the number of scanning rows, theoperation time per row is significantly reduced, and eventuallytends to be stable. Under the same number of scanned rows,the higher the operating system page cache utilization rate is,the less the operation time per row is.
B. Benchmark Design
In the performance data collecting, we need to capturethe features mentioned above, and ensure that the trainingdata meets the requirements of machine learning in size andquality. Common benchmarks such as TPCH and YSCB focuson evaluating the performance of different databases underspecific workloads, so they are not suitable for the proposedsystem. This paper proposes a brand-new benchmark method,which automatically generates a variety of workloads to obtainmore comprehensive training data.The workflow of proposedbenchmark in shown in Fig. 3. unBenchmark Performance Test
Tests < T ? Insert DataWait and Clear CachePoint LookupsWait and Clear CacheScansRuns < R ?Creata Table and Generate SchemaStartEndPerformance Test Delete TableAssign Storage Enging Execute A RunYY NN
Fig. 3. Workflow of the Benchmark
1) Data schema generation:
The data schema used in theproposed benchmark is randomly generated. The generationmainly considers the number, length, and type of fields. Thedata schema with the data generation pattern of each fieldwill be assigned at the beginning of the performance test. Thediversity of data schema is a guarantee for the sufficiency oftraining data. The numbers of fields are generated from a long-tail distribution within a fixed range, which refers to the actualscenario of the database application.In the general data schema, each field also has a typeattribute to determine the size and encoding method, and weused two types as fixed-length fields and variable-length fields.The size of the fixed-length field is designed to be relativelysmall bytes, considering that this kind of field is generally usedto store numeric data. The variable-length field can take thelength within a relatively large range as the average length ofthe field. In addition, the average length of the variable-lengthfield is also taken from a long-tailed distribution, since mostof the data fields store numbers or labels, while only a fewdata fields store long text.Another reason of using long-tail distribution is that, interms of database design, the database usually uses a fixed-size data block as the unit of operating data. When the rowlength is large, the difference of several bytes does not affectthe number of rows stored in the data block. Therefore, longerrows are less influential than shorter rows for data blocks andhave less weight in distribution.When generating the data schema, the pattern of fieldcontent should also be designed, such as the data insertionorder (log-type sequential insertion or transaction-type randominsertion), the amount of information in the field (informationentropy), etc. The design of these patterns can refer to real-world data.
2) Workload of Performance Test:
After specifying the dataschema, the benchmark module will insert and query the tableaccording to the workload. The workload of performance testshould be feasible and contain variable features. To reducethe complexity, the workload is defined as a series of basic
Fig. 4. Outliers under different operations operations on the storage engine, which are common accessmethod [13]. If the storage engine supports special opera-tions such as secondary index lookup, column aggregation,etc., additional cost models and training data are needed forengines that support these operations. For the convenience ofdiscussions, this paper only introduces the design of commonoperations.Although write operations are classified into single-rowinsertion and batch insertion, most storage engines that usedisk as the storage media will buffer single-row insertion andconvert it into batch updates using technology like write-aheadlog. To specify write operation, we only need the parameterof write rows after data is already generated. Due to the smallnumber of large-scale insertions in practice, the number ofwrite rows should also be long-tail distribution.The read operation is divided into point lookup and rangescans. Since the database system uses data blocks as the oper-ation unit, sequential scans will benefit from the locality, andthe performance per row of point lookups will be significantlyworse than that of range scans. When designing the readoperation workload, it is necessary to pay attention to theimpact of the cache. If the keys of two adjacent queries arealso adjacent, the second query will read the data through thecache, thereby affecting the judgment of the time of a singleread operation.After the database runs for a while, some data may becached in memory as ”hot data”, causing bias to the perfor-mance results. To avoid that most of the training data belong tothe hot data query, the workload should be controlled. Thereare two ways to control the workload. One is to clear theoperating system’s page cache. The page cache is usuallylarger than the actual cache of the database, making dataeasily be hot. Clearing page cache can simulate the processof page cache replacement due to IO operations outside thetarget table. Another is to interrupt the workload and waitfor a period to simulate the idleness of the database workload.These two operations will affect the characteristics of the pagecache and the disk throughput in the operating system.
C. Learning Model
In this section, we use performance data to train the costmodel. The training is divided into writing cost model trainingand reading cost model training. In order to obtain a moreaccurate cost model, we analyzed the collected performance raining Data OutliersRegular Data Regressor
IO Surge Cost
Logistic Regressor
IO Surge Prob.
Regressor
Regular cost
Fig. 5. Cost model for writing data and designed the cost model according to the features ofthe performance data.Performance data obtained by benchmark and system run-time are stored separately according to different classificationsof storage engines and operations, because the cost model isdesigned to only predict certain operations on certain storageengine. They need to remove outliers before training. Byobserving the performance data, it is discovered that differentoperations have different performance features, as shown inFig. 4. For read performance, when the number of scan rowsis low, the elapsed time per row increases significantly. Thewrite performance is relatively stable because of the existenceof the write buffer in the database. Therefore, for writing per-formance data, thresholds can be used to distinguish outliers,but for reading data, density clustering is more suitable foroutlier discrimination.However, the outliers in the write performance data do notmean that the point has no benefit on the storage enginemodel. Due to the existence of the write buffer, the data isaccumulated after reaching a threshold, and then the write isperformed. At this time, the emergence of IO increases thetime of write operations, which is manifested in the surgeof elapsed time per row. This surge has impact on the totalperformance. In our test of RocksDB, the query delay causedby the outliers in the write data accounts for about 5-10 % ofthe total delay. Therefore, the writing cost model should takesurge under consideration. We refer to the outliers in the writedata as the IO surge point. The appearance of IO surge pointsis also considered to be related to the features in the trainingdata.When training writing cost models, we use multiple learn-ers, as shown in Fig. 5. The proposed writing cost model usesa logistics regression to learn the IO surge point to predict theprobability of its occurrence under a given workload, becauseit is a parameter-free learner with high efficiency. Since thegoal of the system is to estimate the cost of the workload, thereshould be massive write operations in the workload. Accordingto the law of large numbers, the sum of the IO surge pointexpectation of write operations in the workload should be closeto the actual number of IO surge events.The writing cost model also establishes a regression modelof costs for IO surge points and regular data points, todetermine the elapsed time under the occurrence of IO surge events and regular write events respectively. These regressionmodels select the XGBoost learner [14], which is a trade-off between performance and prediction accuracy. Consideringthe feature of commercial servers, the servers may not haveenough resources to run some machine learning algorithms,while XGBoost has lower computation requirements and hasstronger accuracy than traditional learning methods. If theprediction of the elapsed time under regular write event isexpressed as f r ( x ) , and the prediction of the elapsed timeunder the IO surge event is expressed as f o ( x ) , then the overalltime prediction for the write workload is: f w ( X ) = (cid:88) ( f o ( x ) p ( x ) + f r ( x )) The read operations are generally not performed after bufferingmultiple requests, so there is no need to consider modelingoutliers. The reading cost model only uses the XGBoostlearner for training.After the query engine parses the SQL statement, the queryplan tree composed of operators is generated, including opera-tors related to data access and operators related to calculation.When predicting the workload, the system uses the queryengine to generate the query plan tree for each query ofthe workload, and use the data access operators to predictthe workload cost. For the storage structure using the samequery plan tree, the order of the access operations predictioncosts can represent the order of the query cost because thecalculation operators are exactly the same. For the storagestructure using different query plan trees, prediction errorsmay be caused by different calculation operators. In general,for data-intensive applications, the data access operator takesmuch longer time than the calculation operator. In this case,the error caused by the calculation operator has a relativelysmall influence on the cost order of the storage structure. Thus,for insurance, we set a threshold (cid:15) , e.g. 10 % . If the beststorage structure cannot exceed the predicted performance ofthe current storage structure by (cid:15) , this storage structure willnot be applied in the database.IV. D ATA L AYOUT R ECOMMENDATION
In many studies of storage optimization [15], data layoutis one of the focuses of research. In our proposed automaticstorage structure selection system, data layout is also consid-ered to better utilize the storage engine. In this section, wefirst introduce the type of storage model in Section IV-A, andthen propose a novel column-oriented data layout algorithmin Section IV-B.
A. Storage Model and Workload
The storage model refers to the logical arrangement ofdata in storage. There are three storage models in the currentdatabase design. The selection of storage model is closelyrelated to the workload. N-ary Storage Model (NSM) for tradi-tional databases and OLTP databases, Decomposition Storage(DSM) Model for OLAP databases, and Flexible StorageModel (FSM) [16] for hybrid workloads.
D V1 V2 V3 V4
101 201 aaa 10 AAA102 202 bbb 20 BBB
103 203 ccc 30 CCC
104 204 ddd 40 DDD105 205 eee 50 EEE
ID V1 V2 V3 V4
101 201 aaa 10 AAA102 202 bbb 20 BBB
103 203 ccc 30 CCC
104 204 ddd 40 DDD105 205 eee 50 EEE
ID V1 V2 ID V3 V4
101 201 aaa 101 10 AAA102 202 bbb 102 20 BBB103 203 ccc 103 30 CCC104 204 ddd 104 40 DDD105 205 eee 105 50 EEE a) b)c) d)
ID V1 V2 V3 V4
101 201 aaa 10 AAA102 202 bbb 20 BBB103 203 ccc 30 CCC104 204 ddd 40 DDD105 205 eee 50 EEE
Fig. 6. Storage Models. a) NSM, b) DSM, c) FSM inside in-memory database,d) FSM inside disk storage.
In NSM, data is stored sequentially with rows as shownin Fig. 6.a. Typical queries in OLTP like point lookup anddata insertion are benefited from NSM because these querieshave the least access to data blocks. In DSM, data is storedwith cells that belong to the same column as shown in Fig. 6.b.Cells in different columns are stored in separate files. In DSM,the cost of single row reads and writes becomes expensive,because these operations need to access multiple files for thewhole cells, and the data access pattern is scattered whichreduces the data locality. In contrast, column-wise aggregationoperations no longer require access to irrelevant cells, but thecells on the target column. This feature greatly reduces the sizeof data to access. Besides, cells of the same type are storedtogether, which is suitable for compression and encoding. Theworkload of OLAP generally does not include point lookupand update of data, and analytical queries will only group andaggregate the columns of interest, so it is suitable for DSMstorage models.For a hybrid workload that contains both analytical andtransactional queries, some studies combine the advantagesof NSM and DSM to design the Flexible Storage Model(FSM). The intuitive idea is to divide the columns into multiplecolumn groups as shown in Fig. 6.c. Each column belongs toexactly one column group, which is easy for the in-memorydatabase. The in-memory database only needs to expand thepointer in the index into a list of pointers, and each pointer inthe list corresponds to one column group.However, in the implementation of disk storage, anothersolution is used. In order to allow the storage engine to use theprimary key to access the tuples in different column groups,each column group will hold the complete key fields as shownin Fig. 6.d. Under this scheme, the increase in the number ofcolumn groups will lead to an increase in the redundancy ofkey fields. In FSM, it is necessary to balance the performanceof transactional queries with the performance of representativeanalytical queries. For example, we found some importantanalytical queries in the workload, and hope to optimize thesequeries. If we divide the table into many column groups, thenthese queries can access fewer columns to obtain the results.Since each column group will save a copy of keys, there will be a lot of redundancy in the storage, which reduce theperformance of transactional queries.
B. Data Layout Recommendation Algorithm
The design of column groups is also called data layout. Thedata layout design is the main problem in the study of FSM.Only a good data layout can maximize the performance of theFSM database. The best data layout depends on the workload.To measure whether a data layout is optimal, the evaluation ofthe data layout cost under the target workload is needed. Thesolution space of the data layout problem is the
Bell number of the column set. It is not feasible to traverse all possibledata layouts to obtain the optimal result. Therefore, heuristicalgorithms or cost-based algorithms are in demand, whichsearch the solution using strategies mentioned in Section II. Pruning:
Use frequent item thresholds or clustering meth-ods to reduce the huge number of queries in the workload toa few representative queries.
Candidate generation:
For each representative query, thecandidate data layout is column groups that make the rep-resentative query achieve the maximum performance. Thatis, all the columns accessed by the representative query arecombined into a column group, and the unvisited columnsare grouped separately. Data layouts obtained from differentrepresentative queries will be combined into a new data layout.For example, the columns accessed by the most importantqueries are grouped into the first column group. Among theremaining columns, the columns accessed by the second mostimportant queries are combined into the second column group,and so on.
Evaluation:
Evaluate the generated candidates of datalayout using the cost model. The least cost candidate afterevaluation is selected.Among the data layouts generated by the above meth-ods, the ”primary” column group is always able to find itscorresponding specific query (or representative query). Thispaper refers to these methods as query-oriented data layoutrecommendation algorithms.In addition to the query-oriented data layout recommen-dation algorithm, this paper proposes a new perspective tosolve the data layout problem, that is, the column-orienteddata layout recommendation algorithm. The main differencebetween these algorithms is how candidates are generated. Inthe column-oriented algorithm, the candidate data layout is notregarded as a combination of optimal data layouts for a certainquery. Instead, the data layout is regarded as the judgment ofthe similarity of the columns, and the similar columns areclustered to obtain the candidate data layouts.
Example.
Consider the following query: T is a table with5 columns (a, b, c, d, e). Q1 visits three columns (a, b, c) onT, and Q2 visits three columns (c, d, e) on T. Using a query-oriented algorithm, the candidate data layout obtained fromQ1 and Q2 are either (a, b, c)(d, e), or (a, b)(c, d, e), becausethe result of Q1 and the result of Q2 are mutually exclusiveand cannot be merged to generate new candidates. https://en.wikipedia.org/wiki/Bell number n column-oriented algorithms, each column is representedas a vector consisting of query information. After pruningthe query, representative queries are obtained. Each repre-sentative query contains a collection of columns that needto be accessed. The vector is composed of the representativequery accessing the column. As in the previous example, therepresentative vector of the column is obtained as a = b =(1, 0), c = (1, 1) and d = e = (0, 1). The vectors of thecolumns represent the similarity between the columns duringquery access, and the columns that are close to each other areoften accessed together by the queries in the workload. In thisscenario, we get a new data layout (a, b)(c)(d, e), which isdifferent from the previous algorithm and is beneficial to bothqueries. The column vector is expressed more specifically asfollows: c ij = (cid:88) q ∈ Q j w q I ( c i , q ) Q j represents the j th representative query, and c i j representsthe j th dimension in column c i . When c i is the columnaccessed by q, I ( c i , q ) = 1 , otherwise I ( c i , q ) = 0 . If therepresentative query is the central query of a series of queries,then these queries share one dimension of the column vector,but each query can assign different values to the columnvector according to the specific column accessed. In addition,the formula uses the cost w q of the query to weight thevector, in which case columns accessed by costly queriesare grouped preferentially. It should be pointed out that ifthe value of one dimension is composed of multiple queries,the accuracy of the vector to the column distance dependson the reasonableness of the clustering result. The reason isthat it is impossible to distinguish which queries contributedthe value in one dimension, and if the values are contributedby dissimilar queries, it leads to misjudgment of the distancebetween different columns.In some algorithms, special methods are used to makethe calculation result drift toward the recent queries. In theproposed algorithm, we use a time-dependent factor to achievethis purpose, where the weight of a query q is set to (1 − α ) s w q .In this formula, s is an attribute that indicates how old thequery is. A larger value indicates that the query occurredearlier. α represents the attenuation coefficient, and the largerthe value is, the less attention is paid to old queries.After obtaining the representative vectors of columns, weuse the Euclidean distance to measure the similarity betweendifferent columns, and use clustering methods such as hier-archical clustering to generate column group candidates. Theadvantage of hierarchical clustering is that for the exampleabove, if the cluster parameter is 3, the result is exactly thesplit of one of the column groups when the cluster parameter is2. In this way, the maintenance cost caused by multiple columngroups is weighed against the additional data read caused byfewer column families. Algorithm 1
DATA LAYOUT RECOMMENDATION
Input: Query set: Q, table: TOutput: Data layout candidates
Function DataLayoutRec-ommendation (Q, T) : for col i in T.col do col i := for q j in Q do if col i in q j .columns then col ij := col ij + q j .cost C := LevelCluster ( T.col ) R := ∅ for i in len( T.col ) do R := R ∪ C. ExtractCluster ( i ) return R Since the performance of the data layout is related to theimplementation of specific storage engines, it is difficult todetermine the number of column groups in advance. Therefore,in this algorithm, all hierarchical clustering results from theone group (that is, NSM) to the most groups i.e. DSM, willbe given at the same time. The Cartesian product of the datalayout result set and the storage engine set is computed asthe candidate storage structure set, which is used for costestimation. In the example presented in this section, assumingthat Q1 has a greater weight than Q2, then there will be 4 datalayouts, namely (a, b, c, d, e), (a, b, c) (d, e), (a, b)(c)(d, e),(a)(b)(c)(d)(e). If we have 3 storage engines to choose from,then we can get a total of 34=12 candidate storage structures.The final optimal storage structure is estimated and rankedaccording the cost model within these candidates.V. E
XPERIMENTAL E VALUATION
A. Experimental configuration
The evaluation content is divided into three parts. The firstpart tests the storage engine cost model proposed in SectionIII. The second part tests the data layout recommendationalgorithm proposed in Section IV. The third part evaluates thequality of storage structure recommended results.Experiment run on a personal computer with an Intel Corei7 4790, a 3.6GHz8 processor, 16GB of RAM, and a 7200RPM hard drive.The goal of the experiments is to optimize the storagestructure of the LineItem table in the TPC-H benchmark . Torun the experiment, we build the whole system. Our systemuses two well-known storage engines, RocksDB with LSMtree as the storage model and WiredTiger with B+ tree asthe major storage model. Both storage engines have verypowerful performance and are used in many applications.WiredTiger is MongoDB’s default storage engine and can beused independently. It supports columnar storage which willbe considered as another storage engine in the experiment, OLUMNS ACCESSED IN DIFFERENT TEST QUERIES
Query Columns Accessed
Q3 K1, V2, V3, V7Q5 K1, K3, V2, V3WQ K1-K4, V1-V12RQ1 K1-K4, V1-V12RQ2 K1-K3, V1-V3, V5-V8, V11 meaning that the system has three different storage engines tochoose.Since the purpose of the system is to recommend storagestructures under hybrid workloads, we design workloads onLineItem table. It contains two analytical queries and threecustom transactional queries. The analytical queries used thestandard queries Q3 and Q5 in the TPC-H benchmark. Trans-actional queries consist of a single-row write query and twosingle-row read queries. We use K1-K4 to denote the first 4columns (primary key) of the LineItem data schema, and useV1-V12 to denote the last 12 columns of the LineItem. Thecolumns accessed by each query are shown in Table II.We designed four workloads based on the above queries,namely transactional workload, transactional mixed workload,analytical mixed workload, and analytical workload. The de-tails of the workloads are shown in Table IV.
B. Storage Engine Cost Model
The proposed system is based on the learning cost model,whose recommendation results depend on the accuracy ofthe cost model in estimating the workload. Also, we needto find out whether features of Section III are sufficientfor cost prediction, and whether the learned cost model hasenough generalization capabilities. To achieve this goal, weuse the cost model which only learns the performance fromgenerated data schema to make predictions on the performanceof LineItem data.We performed a series of data read and write operationsaccording to the performance test mentioned in Section III.A.For the convenience of observation, we draw scatter plots withthe accessed rows per operation as the x-axis and the elapsedper row as the y-axis, since the accessed rows per operationare the most important factor for the elapsed time. The resultsof the experiment are shown in Fig. 7. The green dots in theinsets represent the actual time of the operation, and the bluedots represent the predicted time of the operation. The shapesof the blue dots and green dots in the insets are basicallyoverlapped, indicating that the learned cost model can welldescribe the relationship between the elapsed time per rowand the accessed rows in operations. Also, the shape of bluedots can fall within the shape of green dots, which means thatalthough the model cannot get accurate time predictions, it canget average results. The above results demonstrate that dataschemas and workloads generated by the benchmark algorithmcan provide sufficient features to generalize the cost model.Although the cost model learned without using the internal
Fig. 7. Actual elapsed time vs. predicted time. Inset a), b), c), d) areusing model of read operations while e), f), g), h) are using model of writeoperations. Inset a), b), e), f) are using WiredTiger as storage engine while c),d), g), h) are using RocksDB. Inset a), c), e), g) are using sequential insertionwhile b), d), f), h) are using random insertion. knowledge of the storage engine cannot accurately predict thecost of a single operation, it can represent the average time ofthe workload.We also show the mean absolute error for operation pre-diction and the relative error for workload prediction (usingbenchmark workload) in Table III. The relative error of theworkload is defined as the relative error between the time toperform the benchmark test (excluding the waiting time) andthe predicted time. The experimental results corroborate theabove conclusion that the cost model may contain large errorswhen predicting a single operation. However, when making anoverall prediction of the workload, the cost model can give nomore than 10 % error. ABLE IIIP
REDICTION ERROR FOR DIFFERENT OPERATIONS
Operation of Storage Mean absolute errorfor operationprediction ( µs ) Relative errorfor workloadprediction ( % ) B+ / read / sequential 2.43 9.93B+ / read / random 14.40 6.81B+ / write / sequential 0.9 2.75B+ / write / random 3.42 0.08LSM / read / sequential 2.22 6.35LSM / read / random 7.13 2.11LSM / write / sequential 1.48 7.63LSM / write / random 1.41 4.01TABLE IVD
ATA L AYOUT R ECOMMENDATION OF D IFFERENT A LGORITHM
Workloadcharacteristics Query-orientedAlgorithm (Peloton) Column-orientedAlgorithm (Proposed)
Transactionalworkload(9M,10k,10k,0,0) s =0 NSM Best: NSMSecond best:(V1-V3, V5- V8, V11),(V4, V9, V10, V12)Transactionalmixed workload(3M,30k,30k,5,5) s =9M NSM Best: NSMSecond best:(V1, V4-V6, V8-V12)(V2, V3, V7)Analyticalmixed workload(0,10k,10k,15,15) s =12M (V1, V4-V6,V8-V12),(V2, V3, V7) Best:(V1, V4-V6, V8-V12),(V2, V3, V7)Second best: NSMAnalyticalworkload(0,0,0,30,30) s =12M (V1, V4-V6,V8-V12),(V2, V3, V7) Best: DSMSecond best:(V1, V4-V6, V8-V12),(V2, V3, V7) C. Data Layout Recommendation
This section demonstrates whether the column-oriented al-gorithm can produce better data layout recommendation resultsas the query-oriented algorithm. We implemented a typicalquery-oriented data layout recommendation algorithm whichwas proposed in Peloton [16] to compare with our proposedalgorithm. In the experiment, we run the algorithms with fourtypes of workloads.The comparison results are shown in Table IV. The vectorsin the workload characteristics represent the number of cor-responding queries executed. For example, the transactionalworkload includes 9M data insertions, 10k RQ1 queries, and10k RQ2 queries. s indicates the size of the table beforethe workload is executed. For each workload, the Pelotonalgorithm gives the optimal data layout design. Since ourproposed algorithm is designed to be evaluated together withthe storage engine, the results of the algorithm include thestorage engines (omitted in the table) and only show the besttwo data layouts for convenience.In Table IV, we find that the results obtained by thePeloton algorithm are included in the results of our proposedalgorithm. This shows that the proposed column-oriented datalayout algorithm can produce similar results as the query- Transactional Workload Transactional Mixed Workload Analytical Mixed Workload Analytical Workload5060708090100 R e l a t i v e t i m e u s e d ( % ) DefaultData Layout OptimizedBoth Optimized
Fig. 8. Performance comparison under different storage structure oriented data layout algorithm. Noting that since the storageengines participate in the data layout cost evaluation, the pro-posed algorithm can obtain the DSM solution in the analyticalworkload which is not easy to obtain by query-oriented datalayout algorithms. This also shows that the selection of datalayout is closely related to the storage engine implementation,so it should not be designed independently.
D. Storage Structure Selection
Finally, we evaluate the storage structure recommendedby the proposed system. We choose the RocksDB storageengine with NSM as the default storage structure design,and discuss the impact of different storage structures onthe system performance under four workloads. We use thedefault design as a standard to regularize the time used bythe recommended storage structure. We also observed thestorage structure obtained with fixed storage engine to explorethe importance of the storage engine selection in the storagestructure. The results are shown in Fig. 8.Since the default storage structure uses NSM, the optimiza-tion of data layout only has a significant impact on workloadswith a large number of analytical queries. That is, with theincrease of analytical queries, the recommended data layoutcan execute the workload faster. The result also shows that ifthe storage engine and data layout are change at the sametime, workload execution is faster than changing only thedata layout. According to the results of Section V-C, therecommended data layouts are different under different storageengines, which means that the data layout and storage engineneed to be decided together in storage structure recommenda-tion. Only changing the data layout will not take full advantageof the changes in the storage structure.We also observe that the optimized storage structure onlyachieves a greater degree of performance improvement intransactional and analytical workloads. The internal reasonis that either WiredTiger and RocksDB is not designed formixed workloads, and the upper limit of system performanceis determined by the upper limit of these two storage engines.Besides, the proposed system only uses one storage engine inone partition, so it is natural to have poor performance in amixed workload. The data partition still affected by the ideaof one size not fit all.In hybrid workloads, although there are different types ofqueries, the data accessed by these queries may be different. Inthe scenario where the OLAP system and the OLTP system are ata PartitionsAnalytical Queries Transactional QueriesData PartitionsAnalytical Queries Transactional QueriesData PartitionsAnalytical QueriesTransactional Queriesa)b)c)
Fig. 9. Different types of hybrid workloads used together, due to the independence of the systems, the datacannot be accessed from each other as shown in Fig. 9.a. Butin the proposed system, there is allowed to be an intersectionof the transactional queries and analytical queries on accessingdata, as shown in Fig. 9.b. Data outside the intersection onlyneeds to process a single type of query, which can benefit fromthe storage structure. This means that after horizontal partitionof data is performed on such workloads, only a single type ofworkload exists in some data partitions which can benefit fromthe proposed system. For workloads with large intersectionsas shown in Fig. 9.c, the proposed system may not serve welldue to the theoretical limitations of storage.In either case, the storage structure selection algorithmproposed by our system can automatically select differentstorage structures for different partitions. So far, the proposedsystem has automatically completed all the work related toHTAP system storage design, including choosing the storageengine and data layout for every partition in database, andimplement the conversion of the storage structure when work-loads changed. VI.
CONCLUSION
This paper proposes a system that automatically selects thestorage structure and uses the theory of self-driving databasesto solve the above problems. We have integrated all the mod-ules in one system, realized the mechanism of data conversionin different storage structures, and avoided the cross-systemdata ETL. The proposed system can dynamically analyze theworkload and recommend the storage engine and data layoutthat are most suitable for current workload. The problem ofworkloads shift can also be solved under the proposed system.We develop solutions to several key issues in the processof designing the system. We have established a storage enginecost model that does not depend on the internal knowledge ofthe storage engine. We propose benchmarking techniques toprovide training data for the storage engine cost model, anduse CPU-friendly learning algorithms for common commercial servers without special setting such as large memory or GPU.We regard the storage engine and data layout as a wholeand propose a novel data layout recommendation algorithmto make full use of the features of different storage engines.We build a prototype system for the experiment. We showthat the proposed system can automatically select the ap-propriate storage engine and data layout according to thecurrent workload. Limited by ”One size not fit all”, mixedworkload performance under a single data partition cannot besignificantly improved. However, compared to the HTAP sys-tem composed of separate databases, our system can performautomatic selection of storage structure, automatic conversionof data, and adaptation under changing workload. The systemproposed in this paper is relatively preliminary, and there isstill much room for improvement. If the space cost of thestorage structure is considered together, the system will beable to draw a more appropriate recommendation result. Inaddition, the current cost model cannot predict the concurrentperformance of queries well, which can also be considered infuture work. R
EFERENCES[1] F. ¨Ozcan, Y. Tian, and P. T¨oz¨un, “Hybrid transactional/analytical pro-cessing: A survey,” in
Proceedings of the 2017 SIGMOD , 2017.[2] M. Stonebraker and U. C¸ etintemel, “”one size fits all” an idea whosetime has come and gone,” in
Making Databases Work: the PragmaticWisdom of Michael Stonebraker , 2018.[3] D. J. Abadi, S. R. Madden, and N. Hachem, “Column-stores vs. row-stores: how different are they really?,” in
Proceedings of the 2008 ACMSIGMOD , 2008.[4] N. Dayan and S. Idreos, “Dostoevsky: Better space-time trade-offs forlsm-tree based key-value stores via adaptive removal of superfluousmerging,” in
Proceedings of the 2018 ACM SIGMOD , 2018.[5] J. Lu, Y. Chen, H. Herodotou, and S. Babu, “Speedup your analyt-ics: Automatic parameter tuning for databases and big data systems,”
Proceedings of the VLDB Endowment , vol. 12, no. 12, pp. 1970–1973,2019.[6] M. Grund, J. Kr¨uger, H. Plattner, A. Zeier, P. Cudre-Mauroux, andS. Madden, “Hyrise: a main memory hybrid storage engine,”
Proceed-ings of the VLDB Endowment , vol. 4, no. 2, 2010.[7] R. Appuswamy, M. Karpathiotakis, D. Porobic, and A. Ailamaki, “Thecase for heterogeneous htap,” in , no. CONF, 2017.[8] T. Kraska, M. Alizadeh, A. Beutel, E. H. Chi, J. Ding, A. Kristo,G. Leclerc, S. Madden, H. Mao, and V. Nathan, “Sagedb: A learneddatabase system,” 2019.[9] S. Chaudhuri and V. Narasayya, “Self-tuning database systems: a decadeof progress,” in
Proceedings of the 33rd international conference on Verylarge data bases , 2007.[10] S. Chaudhuri and V. Narasayya, “Autoadmin what-if index analysisutility,”
ACM SIGMOD Record , vol. 27, no. 2, 1998.[11] A. Rasin and S. Zdonik, “An automatic physical design tool for clusteredcolumn-stores,” in
Proceedings of the 16th International Conference onExtending Database Technology , 2013.[12] A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, andD. Patterson, “Predicting multiple metrics for queries: Better decisionsenabled by machine learning,” in
Proceedings of the 22nd ACM SIGKDD , 2016.[15] I. Alagiannis, S. Idreos, and A. Ailamaki, “H2o: a hands-free adaptivestore,” in
Proceedings of the 2014 ACM SIGMOD , 2014.[16] J. Arulraj, A. Pavlo, and P. Menon, “Bridging the archipelago betweenrow-stores and column-stores for hybrid workloads,” in