A Unified System for Data Analytics and In Situ Query Processing
AAn Unified System for Data Analytics and In SituQuery Processing
Alex Watson § , Suvam Kumar Das § , and Suprio Ray Faculty of Computer Science, University of New Brunswick, Canada.
Email: { awatson, suvam.das, sray } @unb.ca Abstract —In today’s world data is being generated at a highrate due to which it has become inevitable to analyze thisdata efficiently and produce results quickly. But, data scientistsand analysts are required to use different systems, becauseapart from SQL querying relational databases are not wellequipped to perform complex data analyses. Due to this, datascience frameworks are in huge demand. This may requiresignificant data movement across multiple systems, which canbe expensive. Furthermore, with relational databases, the datamust be completely loaded into the database before performingany analysis.We believe that it has become the need of the hour to come upwith a single system which can perform both data analysis tasksand SQL querying. Ideally, this system would offer adequateperformance, scalability, built-in functionalities, and usability.We extend the Python’s Dask framework to present DaskDB,a scalable data science system with support for unified dataanalytics and in situ
SQL query processing on heterogeneousdata sources. DaskDB supports invoking any Python APIs asUser-Defined Functions (UDF). So, it can be easily integratedwith most existing Python data science applications. Moreover,we introduce a novel distributed learned index to improve joinperformance. Our experimental evaluation involve the TPC-Hbenchmark and a custom UDF benchmark, which we developed,for data analytics. And, we demonstrate that DaskDB signifi-cantly outperforms PySpark and Hive/Hivemall.
I. I
NTRODUCTION
Due to the increasing level of digitalization, large volumesof data are constantly being generated. To make sense ofthe deluge of data, it must be cleaned, transformed and ana-lyzed. Data science offers tools and techniques to manipulatedata in order to extract actionable insights from data. Theseinclude support for data wrangling, statistical analysis andmachine learning model building. Traditionally, practitionersand researchers make a distinction between query processingand data analysis tasks. Whereas relational database systems(henceforth, databases or DBMSs) are used for SQL-stylequery processing, a separate category of frameworks are usedfor data analyses that include statistical and machine learningtasks. Currently, Python has emerged as the most popularlanguage-based framework, for its rich ecosystem of dataanalysis libraries, such as Pandas, Numpy, scikit-learn. Thesetools make it possible to perform in situ analysis of data that isstored outside of any database. However, a significant amountof data is still stored in databases. To do analysis on thisdata, it must be moved from a database into the address spaceof the data analysis application. Similarly, to do SQL queryprocessing on data that is stored in a raw file, it must be loaded § Equal contribution into a database using a loading mechanism, which is knownas ETL (extract, transform, load). This movement of data andloading of data are both time consuming operations.To address the movement of data across databases anddata analysis frameworks, recently researchers have proposedseveral approaches. Among them, a few are in-database so-lutions, that incorporate data analysis functionalities withinan existing database. These include PostgreSQL/Madlib [1],HyPer [2] and AIDA [3]. In these systems, the applicationdevelopers write SQL code and invoke data analysis func-tionalities through user-defined functions (UDF). There areseveral issues with these approaches.
First , the vast body ofexisting data science applications that are written in a popularlanguage (Python or R), need to be converted into SQL.
Second , the data analysis features supported by databases areusually through UDF functions, which are not as rich as thatof the language-based API ecosystem, such as in Python orR.
Third , data stored in raw files needs to be loaded into adatabase through ETL. Although, some support for executingSQL queries on raw files exist, such as PostgreSQL’s supportfor foreign data wrapper, this can easily break if the file isnot well-formatted. In recent years several projects [4], [5],[6] investigated how to support in situ
SQL querying onraw data files. However, they primarily focused on supportingdatabase-like query processing, operating on a single machine.These systems lack sophisticated data wrangling and datascience features that is available in Python or R.
Fourth , mostrelational databases are not horizontally scalable. Even withparallel databases, the parallel execution of UDFs is eithernot supported or not efficient. Although “Big Data” systemssuch as Spark [7] and Hive/Hivemall [8] address some ofthese issues, that they often involve more complex APIs anda steeper learning curve. Also, it is not practical to rewrite thelarge body of existing data science code with these APIs.To address the issues with the existing approaches, weintroduce a scalable data science system, DaskDB, whichseamlessly supports in situ
SQL query execution and dataanalysis using Python. DaskDB extends the scalable dataanalytics framework Dask [9] that can scale to more thanone machine. Dask’s high-level collections APIs mimic manyof the popular Python data analytics library APIs based onPandas and NumPy. So, existing applications written usingPandas collections need not be modified. On the other hand,Dask does not support SQL query processing. In contrast,DaskDB can execute SQL queries in situ without requiringthe expensive ETL step and movement of data from raw a r X i v : . [ c s . D B ] F e b les into a database system. Furthermore, with DaskDB, SQLqueries can have UDFs that directly invoke Python APIs. Thisprovides a powerful mechanism of mixing SQL with Pythonand enables data scientists to take advantage of the rich datascience libraries with the convenience of SQL. Thus, DaskDBunifies query processing and analytics in a scalable manner.A key issue with distributed query processing and dataanalytics is the movement of data across nodes, which cansignificantly impact the performance. We propose a distributedlearned index , to improve the performance of join that is animportant data operation. In DaskDB, a relation (or dataframe)is split into multiple partitions, where each partition consistsof numerous tuples of that relation. These partitions aredistributed across different nodes. While processing a join,it is possible that not all partitions of a relation contribute tothe final result when two relations are joined. The distributedlearned index is designed to efficiently consider only thosepartitions that contain the required data in constant time, byidentifying the data pattern in each partition. This minimizesthe unnecessary data movement across nodes. DaskDB also in-corporates selective data persistence that significantly reducesserialization/de-serialization overhead and data movement.We conduct extensive experimental evaluation to comparethe performance of DaskDB against two horizontally scalablesystems: PySpark and Hive/Hivemall. Our experiments involveworkloads from a few queries from TPC-H [10] benchmark,with different data sizes (scale factors). We also created acustom UDF benchmark to evaluate DaskDB and PySpark.Our results show that DaskDB outperforms others in both ofthese benchmarks. The key contributions of this paper are: • We propose DaskDB that integrates in situ query process-ing and data analytics in a scalable manner. • DaskDB supports SQL queries with UDFs that can di-rectly invoke Python data science APIs. • We introduce a novel distributed learned index. • We present extensive experimental results involving TPC-H benchmark and a custom UDF benchmark.II. B
ACKGROUND
DaskDB was built by extending the Dask [9], which isan open-source library for parallel computing in Python. Themain advantage of Dask is that it provides Python APIs anddata structures that are similar to Numpy, Pandas, and Scikit-Learn. Hence, programs written using Python data scienceAPIs can easily be switched to Dask by changing the importstatement. Dask comes with an efficient task scheduler, whichcan run programs on a single node and scale to many nodes.However, Dask does not support SQL queries. We show thatDaskDB, which is built over Dask, can outperform Spark [11].III. R
ELATED W ORK
In this section, first we discuss about systems to performdata analytics and query processing. Next, we look at worksrelated to learned index, followed by in situ query processing.
A. Data Analytics and Query Processing1) Dedicated Data Analytics Frameworks:
Many open-source data analytic applications traditionally use R. Morerecently, Python has become very popular because of theAnaconda distribution [12]. It contains many data science andanalytics packages, such as Pandas, SciPy, Matplotlib, andscikit-learn. Some popular commercial data analytic systemsinclude Tableau [13] and MATLAB [14].
2) In-Database Analytics:
An increasing number of themajor DBMSs now include data science and machine learningtools. For instance, PostgreSQL supports SQL-based algo-rithms for machine learning with the Apache MADlib library[1]. However, interacting with a DBMS to implement analyticscan be challenging [15]. Although SQL is a mature technology,it is not rich enough for extensive data analysis. DBMSs typ-ically support analytics functionalities through User DefinedFunctions (UDF). Since, a UDF may execute any external codewritten in R, Python, Java or T-SQL, a DBMS treats a UDF asa black box because no optimization can be performed on it.It is also difficult to debug and to incrementally develop [3].The in-database analytics approaches still have the constraintof ETL (extract, transform, load), which is a time-consumingprocess and not practical in many cases.
3) Integrating Analytics and Query Processing:
There havebeen several attempts at creating more efficient solutions andthey combine two or more of either dedicated data analyticsystems, DBMS or big data frameworks. These systems canbe classified into 2 categories that we describe next.
Hybrid Solutions.
These solutions integrate two or moresystem types together into one and are primarily DBMS-centric approaches. AIDA [3] integrates a Python clientdirectly to use the DBMS memory space, eliminating thebottleneck of transferring data. In [16] the authors present aprototype system that integrates a columnar relational database(MonetDB) and R together using a same-process zero-copydata sharing mechanism. In [17], the authors proposed anembeddable analytical database DuckDB. The key drawbackof these hybrid systems is ETL, since the data needs tobe loaded into a database. Moreover, existing data scienceapplications written in Python or R, need to be modified towork in such systems, since their interface is SQL-based. “Big Data” Analytics Frameworks.
The most popular bigdata frameworks are Hadoop and Spark [7]. Spark supportsmachine learning with MLlib [18] and SQL like queries.Hive is based on Hadoop that supports SQL-like queries andsupports analytics with the machine learning library Hive-mall [8]. Some drawbacks of big data frameworks includemore complicated development and steeper learning curve thanmost other analytics systems and the difficulty in integrationwith DBMS applications. To run any existing Python or Rapplication within a big data system, it will require rewritingthese applications with new APIs. This is not a viable optionin most cases.2 .QueryPlanner
Generates a logical and physical plan
1. SQLParser
Extracts metadata information from SQL Query
Raco is a subset of Myria stack, Myria was developped for big data management and analytics, it was developed by the database group at the University of Washington. Raco is from the MyriaL at the University of Washington \cite{Wang2017Myria}. Raco is a SQL query parser. It takes an SQL query and returns a a preliminary optimized physical plan. We chose to use Raco at because it was natively built in in python, this allowed for easy integration and no potential overheard or trying to merge separate run-time environments. We added a few features to the existing code and changed a few things for the SQL Query Parser and Optimizer. Namely we added functionality to the parser to support more keywords commonly used in SQL.SQL Parser first takes in the original sql query. In this step we gather information about relevant table names, column names and any user defined function. This is done because the format Raco needs all this information beforehand to Available at https://github.com/TwoLaid/python-sqlparser a. Metadata and SQL query
DaskDB takes in the Physical Plan from raco then creates a Dask Plan. From the dask Plan we are able to convert this plan into dask which is executable in the dask client.Simple but subtle improvements to improve query performance. Only read in b. Physical Plan c . E xec u t a b l e D ask C od e Within Dask I head to modify few features to make it capatable . Such as returning nth (smallest or largest function) with a multi-index.
ClientScheduler W W W n
5. HDFS D a t a D a t a … …
3. DaskPlanner
Converts to Daskplan then converts to Dask code and sends to Dask
Fig. 1: DaskDB System Architecture
B. Learned Index
Data structures such as B+trees are the mainstay of indexingtechniques. These approaches require the storage of all keysfor a dataset. Recent studies have shown that learned modelscan be used to model the cumulative distribution function(CDF) of the keys in a sorted array. This can be used to predicttheir locations for the purpose of indexing and this idea wastermed as learned index [19]. Several learned indexes weredeveloped that include FITing-Tree [20] and PGM-Index [21].The learned index approaches proposed so far were meantonly for stand-alone systems. These ideas have not beenincorporated as part of any database system yet, to the bestof our knowledge. Also, no learned index model has yet beendeveloped for any distributed data system.
C. In Situ Query Processing
A vast amount of data is stored in raw file-formats that arenot inside traditional databases. Data scientists, who frequentlylack expertise in data modeling, database admin and ETLtools, often need to run interactive analysis on this data. Toreduce the “time to query” and avoid the overhead associatedwith relational databases, a number of research projects inves-tigated in situ query processing on raw data.NoDB [4] was one of the earliest systems to support insitu query processing on raw data files. PostgresRaw [22] isbased on the idea of NoDB and it supports SQL querying overCSV files in PostgreSQL. The SCANRAW [5] system exploitsparallelism during in situ raw data processing. Slalom [6]introduced adaptive partitioning and on-the-fly per partitionindexes to improve query processing performance on raw data.All these systems were focused on database-style SQL queryprocessing on raw data and on a single machine. Our system,DaskDB supports in situ querying on heterogeneous datasources, and it also supports doing data science. Moreover, itis a distributed data system that can scale over a node cluster.IV. O UR A PPROACH : D
ASK
DBIn this section, we present DaskDB. DaskDB, in addition tosupporting all Dask features, also enables in situ
SQL queryingon raw data in a data science friendly environment. Next, wedescribe DaskDB system architecture and its components.
A. System Architecture
The system architecture of DaskDB incorporates fivemain components: the SQLParser, QueryPlanner, DaskPlanner,DaskDB Execution Engine, and HDFS. They are shown inFigure 1. First, the SQLParser gathers metadata informationpertaining to the SQL query, such as the names of the tables,columns and functions. This information is then passed alongto the QueryPlanner. Next, in the QueryPlanner component,physical plan is generated from the information sent by SQL-Parser about the SQL query. The physical plan is an orderedset of steps that specify a particular execution plan for a queryand how data would be accessed. The QueryPlanner then sendsthe physical plan to the DaskPlanner. In the DaskPlanner,a plan is generated, which includes operations that closelyresemble Dask APIs, called the
Daskplan . The Daskplan isproduced from the physical plan, and it is then converted intoPython code and sent to DaskDB Execution Engine. DaskDBExecution Engine then executes the code and gathers the datafrom the HDFS, and thus executes the SQL query. Furtherdetails are provided in the next sections.
1) SQLParser:
The SQLParser is the first component ofDaskDB that is involved in query processing. The input forthe SQLParser is the original SQL query. It first checks forsyntax errors and creates a parse tree with all the metadatainformation about the query. We then process the parse treeto gather the metadata information needed by the QueryPlan-ner. This metadata information includes table names, columnnames and UDFs. We then check if the table(s) exist andif they do, we dynamically generate a schema. The schemacontains information about tables and column names and datatypes used in the SQL query. The schema, UDFs (if any) andthe original SQL query are then passed to the QueryPlanner.
2) QueryPlanner:
The QueryPlanner creates logical andpreliminary physical plans. The schema and UDFs producedby SQLParser, along with the SQL query, are passed into theQueryPlanner. The QueryPlanner uses these to first create alogical plan and then an optimized preliminary physical plan.This plan is then sent to the DaskPlanner.
3) DaskPlanner:
The DaskPlanner is used to transform thepreliminary physical query plan from the QueryPlanner intoPython code that is ready for execution. The first step in thisprocess is for the DaskPlanner to go through the physical planobtained from QueryPlanner and convert it into a Daskplan.This maps the operators from the physical plan into operatorsthat more closely resemble the Dask API. This Daskplan alsoassociates relevant information with each operator from thephysical plan. This information includes columns and tablesinvolved and specific metadata information for a particular op-erator. We also keep track of each operator’s data dependency.This is needed to pass intermediate results from one operationto the next. Algorithm 1 shows how DaskDB converts thephysical plan into the Daskplan.In the next step, the DaskPlanner converts the Daskplaninto the Python code, which utilizes the Dask API. All of thedetail about each table and their particular column names andindexes are tracked and maintained in a dynamic dictionary3 lgorithm 1:
DaskPlanner: Conversion of PhysicalPlan to Daskplan
Input:
A physical plan ( P ) is given as input. P contains ordered groups of dependent operators( G ). Each G consists of an ordered list oftuples ( k, o, d ), where k contains the uniquekey of a operation, o is the operation type and d contains the operation metadata information. Output:
The final result is a Daskplan DP , whichconsists of an ordered list of operators. DP ← list() for sorted( G ∈ P ) do for sorted( k, o, d ∈ G ) do dp ← dict() //create Daskplan operator dp[o] ← convertToDaskPlanOperator(o) dp[d] ← getMetadataInfo(d) dp[key] ← k //adds key (used to get datadependencies) DP.add(dp) //adds dp to Daskplan for sorted( dp ∈ DP ) do while dp has children (c) do dp[t i ] ← get table information from c i return DP throughout the execution of a query. This is because multipletables and columns may be created, removed or manipulatedduring execution. For these reasons, the names of the tables,columns, and indexes are dynamically maintained while trans-forming the Daskplan into Python code to execute it.
4) DaskDB Execution Engine:
There are three main com-ponents of the DaskDB execution engine: the client, schedulerand workers. The client transforms the Dask Python code intoa set of tasks. The scheduler creates a DAG (directed acyclicgraph) from the set of tasks, automatically partitions the datainto chunks, while taking into account data dependencies. Thescheduler sends a task at a time to each of the workers ac-cording to several scheduling policies. The scheduling policiesfor task and workers depend on various factors including datalocality. A worker stores a data chunk until it is not neededanymore and is instructed by the scheduler to release it.
5) HDFS:
The Hadoop Distributed File System(HDFS) [23] is a storage system used by Hadoop applications.DaskDB uses HDFS to store and share the data files amongits nodes.
B. Illustration of SQL query execution An in situ query is executed within DaskDB by calling query function with the SQL string as argument. The queryin Figure 2 is a simplified version of a typical TPC-H query.The Daskplan, shown in Figure 3, is generated from thephysical plan in the DaskPlanner component. The Daskplanoperators more closely resemble the Dask API. For example,these include the read_csv and filters methods shownin the tree. This Daskplan is then converted into executablePython code, which is omitted due to space constraint. from daskdb_core import querysql = """SELECT l_orderkey, sum(l_extendedprice *(1-l_discount)) as revenueFROM orders, lineitemWHERE l_orderkey = o_orderkey and o_orderdate >= '1995-01-01'GROUP BY l_orderkeyORDER BY revenue LIMIT 5 ; """query(sql) Fig. 2: Code showing SQL query execution
ColumnMapping l_orderkey, l_extendedprice,l_discount groupby: (l_orderkey) aggregate: sum (revenue) order by revenue read_csv orders filter o_orderdate>='1995-01-01'
DaskDBJoin read_csv lineitemo_orderkey=l_orderkey
NewColumn revenue = (l_extendedprice * (1-l_discount)) limit 5 l_orderkey revenue
Fig. 3: Generated Daskplan for the code in Figure 2
C. Support for SQL query with UDFs
DaskDB supports UDFs in SQL as part of in situ querying.A UDF enables a user to create a function using Python codeand embed it into the SQL query. Since DaskDB converts theSQL query, including the UDF back into Python code, theUDFs can reference and utilize features from any of the datascience packages available in Anaconda Python API ecosys-tem. Spark introduced UDF’s since version 0.7 but it operatedone-row-at-a-time and thus suffered from high serializationand invocation overhead. Later they came up with
PandasUDF which provides low-overhead, high-performance UDFsentirely in Python. Although Pandas UDFs are efficient, but arerestrictive to use in queries. Sometimes it also requires to useSpark’s own data types, which would be inconvenient for userswho are not experienced in Spark. In contrast, in DaskDBUDFs for SQL queries can easily be written. Any nativePython function (either imported from an existing package orcustom-made), which accepts Pandas dataframes as parameterscan be applied as UDFs to the SQL queries in DaskDB. Thereturn type of the UDFs is not fixed like Spark’s Pandas UDF,and hence allows the user to design UDFs with ease. Like ageneral Python function, UDFs with code involving machinelearning, data visualization and numerous other functionalitiescan easily be developed and applied on queries in DaskDB.
D. Illustration of SQL query with UDF
We illustrate UDF with SQL in DaskDB with K-Meansclustering. Similar to Spark, a UDF needs to be registered toDaskDB system using the register_udf
API. As shown inFigure 4, the UDF myKMeans takes as input a single Pandasdataframe having 2 columns; hence the UDF is registered4s register_udf (myKMeans, [2]) . As part of thequery, the UDF is invoked as myKMeans(l_discount,l_tax) , which means after application of the selection con-dition (l_orderkey < and the limit (limit 50) to the lineitem relation, both the columns l_discount and l_tax together form a Pandas dataframe and is passedto myKMeans . The output of the code is plotted in Figure 5,where the different clusters are shown by different colours. from daskdb_core import query, register_udf import matplotlib.pyplot as pltfrom sklearn.cluster import KMeans def myKMeans(df):kmeans = KMeans(n_clusters=4).fit(df)col1 = list(df.columns)[0]col2 = list(df.columns)[1]plt.scatter(df[col1], df[col2],c= kmeans.labels_.astype(float), s=50)plt.xlabel(col1)plt.ylabel(col2)plt.show()register_udf(myKMeans,[2])sql_kmeans = """select myKMeans(l_discount, l_tax)from lineitem where l_orderkey < 50 limit 50; """query(sql_kmeans)
Fig. 4: UDF code showing K-Means Clustering
E. Distributed Learned Index
Join is considered one of the most expensive data opera-tions. We propose a novel distributed learned index, to accel-erate distributed execution of join in DaskDB. Our distributedlearned index relies on Heaviside step function [24]. Any stepfunction can be represented by a combination of multipleHeaviside step functions, which forms the basis of our learnedindex structure. A Heaviside step function can be defined as H ( x ) = (cid:40) x < x ≥ (1)We define a Partition Function, P as P a,b,c ( y ) = H (( b − y ) ∗ ( y − a )) ∗ c (2)which returns c whenever a ≤ y ≤ b , or returns 0 otherwise.While constructing the distributed learned index it is as-sumed that one of the relations is sorted by the join attribute.Hence, if an index can maintain the first and last values ofthe keys for each partition, then given any key, the partitioncontaining the key can be identified by a Partition Function .A sparse index in this case can entail huge storage savings,since all the keys are not required to be stored. We illustratethis using a simplified example with the customer table fromTPC-H, where c custkey is the primary key. If there are 500tuples in this relation and each table partition can store 100tuples, then there will be total 5 partitions. The distribution ofthe keys is shown in Table II, and also plotted in Figure 6. It can be seen that the plot is a step function f , where f ( key ) = ≤ key ≤ ≤ key ≤ ≤ key ≤ ≤ key ≤ ≤ key ≤ (3)which can equivalently be represented by summing several Partition Functions , which constitutes the
Learned IndexFunction L customer on the customer table as L customer ( key ) = P , , ( key ) + P , , ( key )+ P , , ( key ) + P , , ( key ) + P , , ( key ) where, a, b and c of the Partition Function P represent thebegin and end keys and the corresponding partition id.Next we explain how our learned index can effectively beused for join queries. Let A and B be two relations, whichneed to be joined on col A and col B of A and B respectively.Without loss of generality, we assume col A is sorted and a Learned Index Function ( L A ) is generated on this column.Since DaskDB internally uses Dask APIs, before joining therelations, they are converted to Dask dataframes. Each daskdataframes consists of several partitions. For each tuple t ofa partition of B , L A ( t [ col B ]) is calculated (which returnsthe partition number of A to which t [ col B ] belongs) and isappended to t as a new ‘Partition’ column. This process isparallely executed for each partition of B . Then for eachpartition number i of A , all the tuples t of B are selectedsuch that t [ (cid:48) P artition (cid:48) ] = i and are hence joined with the i th partition of A. The processes of partition identification,partition selection and the partition-wise join are individuallyexecuted in parallel. For joining the dataframes, we developeda variant of the merge API of Pandas package.V. E
VALUATION
In this section, we present the experimental setup, TPC-Hbenchmark results and a custom UDF benchmark results.
A. Experimental Setup
DaskDB was implemented in Python by extending Dask.The SQLParser of DaskDB utilizes the tool sql-metadata [25].The QueryPlanner of DaskDB extends Raco [26]. We ranexperiments on a cluster of 8 machines each having 16 GBmemory and 8 Intel(R) Xeon(R) CPUs running at 3.00 GHzand each machine ran Ubuntu 16.04 OS.We evaluated DaskDB against two systems that supportboth SQL query execution and data analytics: PySpark andHive/Hivemall ( henceforth referred to as Hivemall ). HDFSwas used to store the datasets for each system. The softwareversions of Python, PySpark and Hive were 3.7.6, 3.0.1and 2.1.0 respectively. PySpark and Hivemall were allocatedmaximum resources available (i.e. cores and memory).5 asks Query
LR select myLinearFit(l discount, l tax) from lineitem where l orderkey < <
10 limit1
TABLE I: Queries with UDF l _ t a x Fig. 5: K-Means output
Q1 Q3 Q5 Q6 Q10
83 93 99 25 10614 9 11 6 122 2 3 1 4
Hivemall PySpark DaskDB T i m e ( S e c ond s ) (a) Scale Factor 1 (GB) Q1 Q3 Q5 Q6 Q10
334 420 310 94 33348 70 50 18 3712 9 13 1 21
Hivemall PySpark DaskDB T i m e ( s e c ond s ) (b) Scale Factor 5 (GB) Q1 Q3 Q5 Q6 Q10
534 584 644 338 585250 264 97 72 9122 13 17 1 57
Hivemall PySpark DaskDB T i m e ( s e c ond s ) (c) Scale Factor 10 (GB) Q1 Q3 Q5 Q6 Q10
Hivemall PySpark DaskDB T i m e ( s e c ond s ) (d) Scale Factor 20 (GB) Fig. 7: Execution times - TPC-H benchmark queries
LR K-Means Quantiles CGO
PySpark DaskDB T i m e ( S e c ond s ) (a) UDF on SF 1 LR K-Means Quantiles CGO
15 618 52 151 6 6 1
PySpark DaskDB T i m e ( s e c ond s ) (b) UDF on SF 5 LR K-Means Quantiles CGO
105 839 245 682 13 12 1
PySpark DaskDB T i m e ( S e c ond s ) (c) UDF on SF 10 LR K-Means Quantiles CGO
463 624 4821 41 37 1
PySpark DaskDB T i m e ( S e c ond s ) P y S p a r k exec u t i on t oo l ong (d) UDF on SF 20 Fig. 8: Execution times - SQL queries with UDFs
Begin End Partition
TABLE II: Sparse indexfor customer relation
Keys P a r t i t i on N u m be r Fig. 6: Keys vs Partition plot
B. TPC-H Benchmark Evaluation Results
We evaluated the systems with several queries from TPC-H decision support benchmark [10]. We used 4 scale factors(SF): 1, 5, 10 and 20, where SF 1 indicates roughly 1 GB.We executed 5 queries from TPC-H benchmark and theresults are plotted in Figure 7. As can be seen, DaskDBoutperforms PySpark and Hivemall on all queries for all thescale factors. Hivemall performs worse than both DaskDB andPySpark in all cases. However, in general higher the SF, largerthe performance gap between them. For instance, with Q10,DaskDB is 3.5 × faster than PySpark at SF 1 and 4.7 × fasterthan PySpark at SF 20. DaskDB achieves a speedup of 182 × with Q5 at SF 20. The superior performance of DaskDB can becredited to its efficient data distribution, join implementation using distributed learned index and selective data persistence. C. UDF Benchmark Evaluation Results
We developed a custom UDF benchmark, which consistsof four machine learning tasks with UDFs:
LR (LinearRegression), K-Means (K-Means Clustering), Quantiles(Quantiles Estimation) and
CGO (Conjugate Gradient Op-timization) . They were developed using the available machinelearning packages in Python. For each of the machine learningtasks, SQL queries in Table I were executed. The UDFs werewritten to perform the same task in both the systems byimporting the same Python packages. For this evaluation, theperformance of DaskDB was compared with that of PySpark,whereas Hivemall results are skipped due to poor performance.The results are plotted in Figure 8. Similar to the TPC-Hbenchmark results, DaskDB outperforms PySpark here as wellfor all the machine learning tasks for all scale factors. Amongall the tasks K-Means performs worst in PySpark. With respectto K-Means, DaskDB performs 28.5 × faster than PySpark atSF 1 and 64 × faster at SF 10. For SF 20, PySpark took toolong to perform K-Means and hence could not be measured,whereas DaskDB took only 41s approximately. For Quantiles,DaskDB was 4 × and 16.6 × faster than PySpark for SF 1 and6F 20 respectively. These results also show that larger the SF,the better DaskDB performs compared to PySpark.VI. C ONCLUSION
We presented DaskDB, a scalable data science system. Itbrings in situ
SQL querying to a data science platform in away that supports high usability, performance, scalability andbuilt-in capabilities. Moreover, DaskDB also has the abilityto incorporate any UDF into the input SQL query, where theUDF could invoke any Python library call. Furthermore, weintroduce a novel distributed learned index that acceleratesjoin/merge operation. We evaluated DaskDB against two state-of-the-art systems, PySpark and Hive/Hivemall, using TPC-H benchmark and a custom UDF benchmark. We show thatDaskDB significantly outperforms these systems.R
EFERENCES[1] J. M. Hellerstein et al. , “The MADlib Analytics Library: Or MAD Skills,the SQL,”
PVLDB , vol. 5, no. 12, pp. 1700–1711, 2012.[2] A. Kemper and T. Neumann, “Hyper: A hybrid oltp & olap mainmemory database system based on virtual memory snapshots,” in
ICDE ,2011, p. 195–206.[3] J. V. D’silva, F. De Moor, and B. Kemme, “AIDA: Abstraction forAdvanced In-database Analytics,”
PVLDB , vol. 11, no. 11, 2018.[4] I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki,“NoDB: efficient query execution on raw data files,” in
SIGMOD , 2012,pp. 241–252.[5] Y. Cheng and F. Rusu, “Parallel in-situ data processing with speculativeloading,” in
SIGMOD , 2014, p. 1287–1298.[6] M. Olma et al. , “Adaptive partitioning and indexing for in situ queryprocessing,”
VLDB J. , vol. 29, no. 1, pp. 569–591, 2020.[7] M. Zaharia et al. , “Spark: Cluster computing with working sets,” in
USENIX HotCloud , 2010, pp. 10–10.[8] “Hivemall,” https://hivemall.apache.org/.[9] M. Rocklin, “Dask: Parallel computation with blocked algorithms andtask scheduling,” in
Python in Science Conference
SSDBM , 2018.[16] J. Lajus and H. M¨uhleisen, “Efficient data management and statisticswith zero-copy integration,” in
SSDBM , 2014.[17] M. Raasveldt and H. M¨uhleisen, “Data management for data science -towards embedded analytics,” in
CIDR , 2020.[18] X. Meng et al. , “MLlib: Machine Learning in Apache Spark,”
Journalof Machine Learning Research , vol. 17, no. 34, pp. 1–7, 2016.[19] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The casefor learned index structures,” in
SIGMOD , 2018, p. 489–504.[20] A. Galakatos et al. , “FITing-Tree: A Data-Aware Index Structure,” in
SIGMOD , 2019, p. 1189–1206.[21] P. Ferragina and G. Vinciguerra, “The PGM-index: a fully-dynamiccompressed learned indexwith provable worst-case bounds,” in
PVLDB ,vol. 13, no. 8, 2020, pp. 1162–1175.[22] “PostgresRAW,” https://github.com/HBPMedical/PostgresRAW.[23] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoopdistributed file system,” in
MSST , 2010, pp. 1–10.[24] “Heaviside Function,” https://en.wikipedia.org/wiki/Heaviside step function.[25] “sql-metadata,” https://pypi.org/project/sql-metadata/.[26] J. Wang et al. , “The Myria Big Data Management and Analytics Systemand Cloud Services,” in
CIDR , 2017., 2017.