Automatic Generation of a Hybrid Query Execution Engine
AAutomatic Generation of a Hybrid Query Execution Engine
Aleksei KashubaVrije Universiteit AmsterdamAmsterdam, The Netherlands [email protected]
Hannes M¨uhleisenCentrum Wiskunde & InformaticaAmsterdam, The Netherlands [email protected]
Abstract — The ever-increasing need for fast data processingdemands new methods for efficient query execution. Just-in-time query compilation techniques have been demonstrated toimprove performance in a set of analytical tasks significantly. Inthis work, we investigate the possibility of adding this approachto existing database solutions and the benefits it provides. Tothat end, we create a set of automated tools to create a runtimecode generation engine and integrate such an engine into SQLitewhich is one of the most popular relational databases in theworld and is used in a large variety of contexts. Speedups ofup to 1.7x were observed in microbenchmarks with queriesinvolving a large number of operations.
I. INTRODUCTIONQuery processing is an essential part of any databasemanagement system. The traditional way of executing aSQL query is to convert it into a query plan consistingof relational algebra operators, which communicate witheach other through an iterator interface. It is a simple andextensible design which allows for various optimizations atthe query plan level and became known as Volcano-stylequery processing [1]. However, this method was designedwhen I/O performance was the main issue, and it is thereforenot optimized for CPU performance. In the last decade,faster and cheaper main memory has made the Volcano-stylequery execution model less attractive for modern databasemanagement systems, which prompted the development ofnew optimizations for query processing one of which iscalled just-in-time compilation.The fundamental principle of JIT compilation is to gener-ate code on the fly once the query is known and then compileit into native machine instructions. Using the informationavailable at runtime allows to tailor the algebraic operatorsfor the specific query and eliminate the interpretation over-head. Additional improvements come from the optimizationsperformed by the compiler [2]. The method has proved to beeffective and was adopted in both commercial (Hekaton [3])and research projects (HIQUE[4], HyPer [5], Peloton [6]).One of the approaches to JIT compilation named “holisticquery evaluation” is to generate a source file implementingthe functionality of the query using a set of templates,compile the code and dynamically link it with the databaseserver [4].The idea of runtime code generation for query execution isnot new and was part of IBM System R developed in 1970’swhere a procedure to execute the query was assembled fromcomponents consisting from machine language instructions[7]. Still, the successor of System R did not utilize thetechnique because of the impracticality of maintaining a code generation engine for a machine language, e.g., porting it toa different operating system [2]. Modern compilers, such asLLVM, use an intermediate representation as a lightweightabstraction over machine code. Writing a code generator foran IR is simpler than writing a code generator for a machinelanguage but is still time-consuming and challenging tomaintain. Writing a query interpreter is more straightforward,portable and allows to reduce latency for short runningqueries if a cost model estimates the compilation time tobe too high. Therefore, we investigate the possibility ofautomatically deriving a code generation engine from theinterpreter. Such an approach could reduce the workloadassociated with maintaining a code generation engine whilestill providing significant performance improvements.The contributions of this paper are the following. • We present a template-based code generation engine forthe popular embedded DBMS SQLite [8] and describea method for automatic extraction of operator templatesdirectly from the source code of the query interpreter. • We design a microbenchmark and a set of experimentsto investigate the performance of the JIT query execu-tion engine in relation to the complexity of a query andits selectivity. The new system provides a speedup ofup to 1.7x in complex queries. • The JIT system is compared to the standard SQLite aswell as a second baseline that implements a techniquefor the reduction of the interpretation overhead withoutcompilation called “direct threading”.The rest of the paper is organized as follows. Section 2summarizes the relevant parts of the internal organizationof SQLite and the direct threading technique. Section 3describes how the JIT system functions and the algorithmfor automatic template generation. Section 4 presents theperformed experiments and their results. Section 5 reviewsrelevant work. Section 6 describes future work and con-cludes. II. BACKGROUND INFORMATIONSQLite somewhat departs from the traditional Volcano-style query execution. Instead of linking algebraic operatorsbetween each other with indirect function calls it creates aquery execution plan in the form of its custom bytecode.The bytecode program and other necessary information suchas open cursors (pointers into a table) are contained in aninstance of a “Virtual Database Engine” (VDBE) [9]. Theinterpreter contains a switch statement which determinesthe code that needs to be executed based on the opcode of a r X i v : . [ c s . D B ] A ug int sqlite3VdbeExec( Vdbe *p // The VDBE ){ // The array of operations Op *aOp = p->aOp; // Current operation Op *pOp = aOp; ... for (pOp=&aOp[p->pc]; 1; pOp++){ switch ( pOp->opcode ){ case OP_Goto: { jump_to_p2: pOp = &aOp[pOp->p2 - 1]; break ; } case OP_Ge: { ... break ; } } } } Fig. 1: Opcode interpreter (simplified)the current instruction. In the remainder of the paper, werefer to the code that is executed when a particular case isencountered as a “case block”.Each SQLite instruction consists of an opcode and upto three operands. The first and the third operands usuallycontain the addresses of virtual registers, the contents ofwhich should be processed. The second operand is usuallythe next program counter for a jump. For each producedresult a callback function is invoked, and the processing ofthe opcode continues until the end of the program.Figure 1 shows the simplified source code of the in-terpreter function in SQLite. For each invocation of thefunction, the pointer to the correct operation is set at thebeginning of the for-loop by accessing the array of opcodesat the position specified by the virtual program counter (Line7). Then the opcode is decoded in the switch statement,and the case block corresponding to it is executed. After a break statement in the case block, the pointer to the currentoperation is incremented to point to the next opcode. Tomove execution of the bytecode to a different point in theprogram, opcodes perform a jump to jump to p2 label,where the current operation pointer is set to point to theopcode at the position stored in p2 . To stop the execution, anopcode performs a jump to a label, such as vdbe return or abort due to error which handle the release of theresources.Figure 2 illustrates the bytecode for a simple query execu-tion plan. The first instruction sets up some internal valuesand jumps to the position specified in its p2 operand to Line9. After starting a transaction, an integer with the value from p1 (20) is stored in a virtual register at position p2 (2). Afterthat, the program counter moves to Line 1, where a newcursor is obtained. The cursor is set to point to the first rowat Line 2. Line 3 marks the beginning of the main loop of the Query:
SELECT i FROM test
WHERE i<20;
Addr Opcode p1 p2 p3 Control flow0
Init 0 9 0 OpenRead 0 2 0 Rewind 0 8 0 Column 0 0 1 Ge 2 7 1 Copy 1 3 0 ResultRow 3 1 0 Next 0 3 0 Halt 0 0 0 Transaction 0 0 1 Integer 20 2 0 Goto 0 1 0
Fig. 2: A simple query execution plan (abridged)query.
Column retrieves the value from p2 -th column andstores in p3 . Ge jumps to position p2 if the value of register p1 is greater than or equal to the value of p3 . Otherwise,it proceeds to Copy and to
ResultRow , the latter sets thevirtual program counter to the correct value to be restoredlater and returns the return code. The interpretation of thebytecode continues from the last position after the callbackfunction returns.
Next moves the table cursor to the nextposition and moves the execution to the start of the loop onLine 3.
A. Direct threading
Many implementations of interpreted programming lan-guages, for instance, CPython, use “direct threading”. Thistechnique eliminates the switch statement by using thecompiler extension first introduced by GCC but now alsosupported by other compilers known as “Labels as Values”.This extension allows goto statements to jump to labels thatare calculated at runtime. By using this technique, the boundschecking done by the switch statement is removed, whichdecreases the number of instructions executed per one virtualinstruction. Typically, a switch statement is implementedby using a jump table. However, the offset to the jumptable is calculated in only one place of the program. It iscommonly believed that branch prediction can be improvedby calculating the offset to the jump table in each opcodewhich is how it is done in direct threading [10].III. JIT ENGINE OVERVIEWThe JIT engine consists of three main components: a loopdetector, a template generator, and a code emitter. Figure 3provides a top-level overview of the whole system. Duringbuild time the template generator extracts C macro templatesfrom the SQLite source code that can be dispatched at
IT Engine
Backend BytecodeCompilerUser API Tokenizer Parser OpcodeGenerator Virtual Machine Code Emitter LoopDetection OS Interface Pager B-treeTemplateGenerator
Fig. 3: System overviewruntime by the code emitter. The loop detector keeps track ofthe instructions executed by the VDBE and invokes the codeemitter once a loop has executed enough times. The codeemitter composes and compiles a new function implementingthe functionality of the loop.
A. Runtime code execution
During the interpretation of the query execution plan, theJIT engine records the number of times a loop is executed.A loop is detected when the program counter moves tothe position in the program which precedes the currentone. At that point, a counter contained in the operation isincremented. The position from which the jump happenedis also recorded. When a counter becomes greater than thespecified threshold, a “hot loop” is detected, and the codegeneration procedure is called. Hot loop detection allowsreducing compilation times by only compiling the parts ofthe query plan that are run repeatedly. The code emittergenerates a temporary file containing the definition of thefunction that will implement the functionality of the loop.The emitter iterates over the instructions in the loop and addsthe corresponding macro invocations to the function body.The file is then compiled to a shared object which is linkedwith the SQLite library. The new object is loaded, and thefunction pointer is recorded in the operation structure. Nowwhenever the interpreter reaches the beginning of the hotloop, it calls the compiled function.The runtime code compilation relies on the existenceof macro templates implementing the functionality of theinstructions contained in a hot loop. The structure of theinterpreter facilitates the creation of such templates. After case OP_Ge: { pIn3 = &aMem[pOp->p3]; pIn1 = &aMem[pOp->p1]; if (pIn3->u.i >= pIn1->u.i){ goto jump_to_p2; } break ; } (a) A simplified opcode implementation in a case block1 GE_TEMPL(pos, next, P1, P3, P2) \ L pOp = &aOp[pos]; \ pIn3 = &aMem[P3]; \ pIn1 = &aMem[P1]; \ if (pIn3->u.i >= pIn1->u.i){ \ goto L } \ goto next; \ } (b) A generated instruction template Fig. 4: Transformation of an opcodesome modifications, each case block can be converted intoan opcode template. However, the number of the requiredtemplates make this process tedious which creates a needfor automation. Additionally, automating this process allowsonly to maintain the codebase of the original interpreter. Thefollowing section describes how the templates are generated.
B. Instruction template generation
To prepare the source code for template generation thefile containing the implementation of the VDBE is prepro-cessed using dummy system library header files, and the noinline compiler directives are removed. The preprocessingreduces the bloat by removing the unnecessary code ofsystem libraries and allows for the file to be efficientlyparsed into an abstract syntax tree (AST) representationwith pycparser [11]. Next, the case blocks of the opcodeimplementation being generated are located in the AST bytraversing it in the depth-first order. At this point, the firstparameter of the macro is created, which is the position ofthe instruction in the program. The position is used to createa label by which the invocation of the template is addressed.SQLite extensively uses goto statements which need tobe modified for the future compiled functions to work cor-rectly. The goto statements that are used for error handlingare converted into return statements, which moves the burdenof resource deallocation to the interpreter. The other crucialtype of goto statements in SQLite is goto jump to p2 .The section of the code at that label is responsible forchanging the value of the program counter to the appropriateposition. The jump to p2 label is therefore replaced bythe label of the template invocation corresponding to theinstruction at the address contained in the p2 register. Therest of the goto statements are made local to the macroinvocation by appending the position of the instruction tohe labels. The break statement in the case block is alsoreplaced by a goto statement with the label addressing thetemplate invocation of the next instruction in the sequence.See Fig. 4 for a simplified example of a conversion. In Fig.4a Line 3 sets the current operation to the correct value,Line 7 contains a goto statement which is completed atcode generation time with the value of p2 .Additionally, some of the constants that are known atcompile time are replaced and added to the parameters ofthe macro. The constants are values of the first three virtualregisters and the opcode value. The last is of particularinterest because it allows for some opcode specialization. Insome cases, multiple opcodes are implemented in one caseblock using the fall through mechanism. The specifics ofthe implementation are determined based on the value of theopcode. By inserting the value of the opcode at compile time,we can force the compiler to remove the redundant code dur-ing compilation. Finally, a statement which sets the pointerto the current operation structure is added to the beginningof the block. After the macro has been generated a functionwhich writes the name of the macro to the temporary filewith the appropriate parameters is automatically added tothe emitter function.IV. EXPERIMENTAL STUDYThe primary goal of the experimental study was to evaluatethe potential benefits of the system using JIT compilationand measure the effect of reducing interpreter overhead onquery execution time. The number of operations that have tobe executed for each tuple heavily influences the interpreteroverhead. Therefore, to achieve our goal we designed amicrobenchmarking program that can generate a query withan arbitrary number of instructions in a hot loop and measurethe relation between CPU execution time and the number ofoperations in a loop. Experiment A compares the perfor-mance of the unmodified version of SQLite, a version thatuses JIT compilation and a version that was modified to usea common technique to reduce interpretation overhead thatdoes not involve compilation. Additionally, to assert that JITcompilation is not only beneficial in one particular context,in experiment B we measure the effect of query selectivity onthe performance of the systems. For experiment C the opcodeimplementations were manually specialized for the datatypethat is used in the microbenchmark to estimate the potentialbenefits of automatic opcode specialization. All tests wereconducted on a system with Intel Xeon W-2145 CPU and 128GB of physical memory running Fedora 26 (kernel 4.14.11).The version of SQLite used is 3.23.0 which was compiled byGCC version 7.3.1 with the -O3 flag . The code generatedby JIT was compiled with the -O2 flag because it providedthe best balance between compilation speed and executiontime. A separate process with the SQLite shell was createdfor every query execution. Each measurement is an averageover five runs. The source code is available on GitHub: https://github.com/AlexKashuba/SQLite_JIT
Number of jitted operations C P U T i m e ( s ) Baseline (slope=0.185)JIT (slope=0.104)Threaded (slope=0.162)Compilation (slope=0.01)
Fig. 5: Experiment A. Comparison of performance on querieswith no results
SELECT i FROM test
WHERE (i<1
AND i>6) OR (i<101 AND i>106) OR (i<201 AND i>206);
Fig. 6: Example of a test queryJIT techniques are most useful when the number of opera-tions to interpret grows large. To test how the performance ofthe systems depends on the number of instructions involvedin the execution plan we generate queries with a numberof conditions in the where clause. Since SQLite invokesa callback function for each produced result, which mayaffect performance, we add conditions that cannot changethe number of rows produced by the query. To increase thenumber of operations in the loop, the test generator addsconjunction of comparisons that will always evaluate to false,such as (i < 200 and i > 300) . The SQLite queryoptimizer is not able to detect the optimization opportunityand eliminate the comparisons. The pairs of comparisonsare connected with disjunction. See an example query inFig. 6. The queries were executed on a dataset consistingof a hundred million random numbers, and the table did notinclude an index or a primary key to force a full table scan.
A. Number of operations and system performance
Experiment A measures the increase in CPU processingtime in relation to the number of operations in the main loopof the query plan. The queries included up to sixty opera-tions. For comparison, TPC-H Q5 includes a total of forty-four instructions in its three loops. In order not to measurethe overhead associated with materializing the results thequeries involved in experiment A produce an empty set. Theexperiment compares the standard version of SQLite as theprimary baseline, the version with the JIT engine, and thesecond “threaded” baseline version. The third version useddirect threading and was created to see how much of theperformance improvement could be attributed to eliminatingthe overhead of the bytecode interpreter compared to otherig. 7: Experiment B. Comparison of performance on querieswith various degrees of selectivitypotential effects such as improved cache access patterns.Direct threading is a technique that eliminates the switch statement by using the compiler extension that allows theuse of computed goto statements. It is discussed in furtherdetail in the background section of the paper.The results of the evaluation are presented in Figure 5. TheJIT version consistently outperforms both baselines with themargin growing with the number of operations reaching themaximum speedup of 1.72x. The time required to compilethe function produced by the JIT engine is also displayed.The compilation time does not appear to be a significantslowdown even for complex queries.Even though it has been argued that modern branch predic-tors do not require direct threading and can correctly predictthe result of the offset calculation used in a switch state-ment [12], the new threaded baseline succeeds in reducingthe interpreter overhead and demonstrates better performancethan the original SQLite. However, it is unclear whether theimprovement should be attributed to the improved branchprediction or other effects.
B. Effects of query selectivity
Whenever SQLite finds a tuple satisfying the selectioncondition, a callback function is executed. Experiment Bevaluates how the number of produced rows affects query ex-ecution time to find out whether query selectivity influencesthe JIT version more than the standard one. The query usedin the experiment contains 65 operations in the main loopand is produced by the microbenchmark with an additionalcondition such as i<2000 that evaluates to true in a subsetof rows. For each measurement, the newly added condition istweaked to include more rows and the execution CPU timeis recorded. The results of the experiment can be seen inFigure 7. Query selectivity does affect the JIT version morecompared to the baselines. The query processing time in theJIT version grows 2.21 times faster than in the standard onerendering the benefits of JIT negligible when more than sixtypercent of the rows are selected. This effect can potentially
Number of jitted operations C P U T i m e ( s ) Baseline (slope=0.153)JIT (slope=0.013)
Fig. 8: Experiment C. Comparison of performance usingspecialized opcode implementationsbe explained by the increase in the number of indirect callsto the function implementing the functionality of the hot loopof the query. If that is the case, the overhead of the indirectcall could potentially be amortized over multiple rows bybuffering the results instead of exiting the JIT function eachtime a new result is produced.
C. Manual opcode implementation specialization
SQLite is a dynamically typed system, and the opcodeimplementations have to perform many type checks for everytuple to execute their functions, which results in inefficientexecution. By using the information available at runtime,such as the table schema, a JIT engine could generate codethat is specialized for a particular query. To investigatethe potential benefits of opcode specialization we manuallyspecialized the opcodes used in comparisons to processinteger values and added a simple fallback condition thatstops the execution in case the type of the attribute is notthe same as the expected one.Figure 8 shows the results of the experiment. The JITversion performs remarkably well as the execution timegrows extremely slowly with the number of the operations.The new system achieves the maximum speedup of 4.17xover the baseline. V. RELATED WORKThe HIQUE system is one of seminal works in JITcompilation for query execution [4]. The authors presenteda system that translated SQL to C using a collection oftemplates and exhibited remarkable performance signifi-cantly outperforming the competition. However, the queryexecution engine of HIQUE was created from scratch and didnot integrate into an existing system. The templates HIQUEuses were therefore manually written rather than generatedfrom source code.SQPyte is a partial reimplementation of the SQLite VDBEusing the RPython framework, which can derive a JIT com-piler from a description of an interpreter [13]. This approacho the automatic creation of a JIT engine also allows devel-oping and maintaining only the interpreter. However, on theTPC-H benchmark suite, SQPyte performed only marginallybetter than stock SQLite providing a 2.2% improvement.The authors of [14] draw on the idea of Futamura pro-jections [15], i.e., viewing compilation as the specializationof an interpreter, to produce an efficient query compiler.Through the use of LMS [16], a generative programmingframework, a query interpreter is modified to symbolicallyexecute a query plan while also producing C code. However,the authors show that in a single compiler pass this methodcan only generate efficient code for data-centric style inter-preters and involves modifications to the interpreter code forsome optimizations.In [17] the authors also make use of the idea of Futamuraprojections and propose to separate the development ofthe algorithmic part of the query execution system fromthe methods that reduce the interpretation overhead. Theydevelop a query compiler for PostgreSQL, a DBMS whichimplements Volcano-style query execution. The system au-tomatically finds variables and instructions that store orcompute static data in the LLVM IR of the execution engine.The technique allows the system to combine the code of theinterpreter with the data available at runtime to automaticallyproduce an interpreter specialized for a particular query.VI. CONCLUSIONSIn this work, we demonstrate an implementation of JITcompilation integrated into SQLite and a method for auto-matic generation of the templates used for code generation.The experimental study shows the interpreter overhead tobe a significant bottleneck in the execution time of complexqueries in SQLite. The JIT system substantially improves theperformance of query evaluation as compared to the standardversion of SQLite as well as a version implementing directthreading. However, due to the design of SQLite where acallback is executed for each produced result, the benefitsdiminish when the selectivity of a query becomes large. Thisproblem could be remedied in future work by making the JITsystem produce results in batches rather than individually.Lastly, the results of manual opcode specialization show agreat potential of JIT compilation in the reduction of queryexecution time. In future work, the template generation couldbe improved by generating versions of opcodes specializedfor a particular type. This would additionally require addingautomatic generation and insertion of fallback conditions intothe templates. R
EFERENCES[1] G. Graefe and W. J. McKenna, “The Volcano optimizer generator:Extensibility and efficient search,” in
Data Engineering, 1993. Pro-ceedings. Ninth International Conference on . IEEE, 1993, pp. 209–218.[2] S. D. Viglas, “Just-in-time compilation for SQL query processing,” in
Data Engineering (ICDE), 2014 IEEE 30th International Conferenceon . IEEE, 2014, pp. 1298–1301.[3] C. Freedman, E. Ismert, and P.- ˚A. Larson, “Compilation in theMicrosoft SQL Server Hekaton engine.”
IEEE Data Eng. Bull. , vol. 37,no. 1, pp. 22–30, 2014. [4] K. Krikellas, S. D. Viglas, and M. Cintra, “Generating code for holisticquery evaluation,” in
Data Engineering (ICDE), 2010 IEEE 26thInternational Conference on . IEEE, 2010, pp. 613–624.[5] T. Neumann, “Efficiently compiling efficient query plans for modernhardware,”
Proceedings of the VLDB Endowment , vol. 4, no. 9, pp.539–550, 2011.[6] P. Menon, T. C. Mowry, and A. Pavlo, “Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetchingwork together at last,”
Proceedings of the VLDB Endowment , vol. 11,no. 1, pp. 1–13, 2017.[7] D. D. Chamberlin, M. M. Astrahan, M. W. Blasgen, J. N. Gray, W. F.King, B. G. Lindsay, R. Lorie, J. W. Mehl, T. G. Price, F. Putzolu et al. , “A history and evaluation of System R,”
Communications ofthe ACM
Journal of Instruction-Level Parallelism , vol. 5, pp. 1–25,2003.[11] E. Bendersky et al. , “Pycparser.” [Online]. Available: https://github.com/eliben/pycparser[12] E. Rohou, B. N. Swamy, and A. Seznec, “Branch prediction and theperformance of interpreters don’t trust folklore,” in
Code Generationand Optimization (CGO), 2015 IEEE/ACM International Symposiumon . IEEE, 2015, pp. 103–114.[13] C. F. Bolz, D. Kurilova, and L. Tratt, “Making an embedded DBMSJIT-friendly,” in , 2016.[14] R. Y. Tahboub, G. M. Essertel, and T. Rompf, “How to architect aquery compiler, revisited,” in
Proceedings of the 2018 InternationalConference on Management of Data . ACM, 2018, pp. 307–322.[15] Y. Futamura, “Partial evaluation of computation process–an approachto a compiler-compiler,”
Higher-Order and Symbolic Computation ,vol. 12, no. 4, pp. 381–391, 1999.[16] T. Rompf and M. Odersky, “Lightweight modular staging: a pragmaticapproach to runtime code generation and compiled DSLs,”
Communi-cations of the ACM , vol. 55, no. 6, pp. 121–130, 2012.[17] E. Y. Sharygin, R. Buchatskiy, R. Zhuykov, and A. Sher, “Querycompilation in PostgreSQL by specialization of the DBMS sourcecode,”