[PDF] A Unified System for Data Analytics and In Situ Query Processing

Abstract

In today's world data is being generated at a high rate due to which it has become inevitable to analyze and quickly get results from this data. Most of the relational databases primarily support SQL querying with a limited support for complex data analysis. Due to this reason, data scientists have no other option, but to use a different system for complex data analysis. Due to this, data science frameworks are in huge demand. But to use such a framework, all the data needs to be loaded into it. This requires significant data movement across multiple systems, which can be expensive. We believe that it has become the need of the hour to come up with a single system which can perform both data analysis tasks and SQL querying. This will save the data scientists from the expensive data transfer operation across systems. In our work, we present DaskDB, a system built over the Python's Dask framework, which is a scalable data science system having support for both data analytics and in situ SQL query processing over heterogeneous data sources. DaskDB supports invoking any Python APIs as User-Defined Functions (UDF) over SQL queries. So, it can be easily integrated with most existing Python data science applications, without modifying the existing code. Since joining two relations is a very vital but expensive operation, so a novel distributed learned index is also introduced to improve the join performance. Our experimental evaluation demonstrates that DaskDB significantly outperforms existing systems.

Full PDF

AAn Uniﬁed System for Data Analytics and In SituQuery Processing

Alex Watson § , Suvam Kumar Das § , and Suprio Ray Faculty of Computer Science, University of New Brunswick, Canada.

Email: { awatson, suvam.das, sray } @unb.ca Abstract —In today’s world data is being generated at a highrate due to which it has become inevitable to analyze thisdata efﬁciently and produce results quickly. But, data scientistsand analysts are required to use different systems, becauseapart from SQL querying relational databases are not wellequipped to perform complex data analyses. Due to this, datascience frameworks are in huge demand. This may requiresigniﬁcant data movement across multiple systems, which canbe expensive. Furthermore, with relational databases, the datamust be completely loaded into the database before performingany analysis.We believe that it has become the need of the hour to come upwith a single system which can perform both data analysis tasksand SQL querying. Ideally, this system would offer adequateperformance, scalability, built-in functionalities, and usability.We extend the Python’s Dask framework to present DaskDB,a scalable data science system with support for uniﬁed dataanalytics and in situ

SQL query processing on heterogeneousdata sources. DaskDB supports invoking any Python APIs asUser-Deﬁned Functions (UDF). So, it can be easily integratedwith most existing Python data science applications. Moreover,we introduce a novel distributed learned index to improve joinperformance. Our experimental evaluation involve the TPC-Hbenchmark and a custom UDF benchmark, which we developed,for data analytics. And, we demonstrate that DaskDB signiﬁ-cantly outperforms PySpark and Hive/Hivemall.

I. I

NTRODUCTION

Due to the increasing level of digitalization, large volumesof data are constantly being generated. To make sense ofthe deluge of data, it must be cleaned, transformed and ana-lyzed. Data science offers tools and techniques to manipulatedata in order to extract actionable insights from data. Theseinclude support for data wrangling, statistical analysis andmachine learning model building. Traditionally, practitionersand researchers make a distinction between query processingand data analysis tasks. Whereas relational database systems(henceforth, databases or DBMSs) are used for SQL-stylequery processing, a separate category of frameworks are usedfor data analyses that include statistical and machine learningtasks. Currently, Python has emerged as the most popularlanguage-based framework, for its rich ecosystem of dataanalysis libraries, such as Pandas, Numpy, scikit-learn. Thesetools make it possible to perform in situ analysis of data that isstored outside of any database. However, a signiﬁcant amountof data is still stored in databases. To do analysis on thisdata, it must be moved from a database into the address spaceof the data analysis application. Similarly, to do SQL queryprocessing on data that is stored in a raw ﬁle, it must be loaded § Equal contribution into a database using a loading mechanism, which is knownas ETL (extract, transform, load). This movement of data andloading of data are both time consuming operations.To address the movement of data across databases anddata analysis frameworks, recently researchers have proposedseveral approaches. Among them, a few are in-database so-lutions, that incorporate data analysis functionalities withinan existing database. These include PostgreSQL/Madlib [1],HyPer [2] and AIDA [3]. In these systems, the applicationdevelopers write SQL code and invoke data analysis func-tionalities through user-deﬁned functions (UDF). There areseveral issues with these approaches.

First , the vast body ofexisting data science applications that are written in a popularlanguage (Python or R), need to be converted into SQL.

Second , the data analysis features supported by databases areusually through UDF functions, which are not as rich as thatof the language-based API ecosystem, such as in Python orR.

Third , data stored in raw ﬁles needs to be loaded into adatabase through ETL. Although, some support for executingSQL queries on raw ﬁles exist, such as PostgreSQL’s supportfor foreign data wrapper, this can easily break if the ﬁle isnot well-formatted. In recent years several projects [4], [5],[6] investigated how to support in situ

SQL querying onraw data ﬁles. However, they primarily focused on supportingdatabase-like query processing, operating on a single machine.These systems lack sophisticated data wrangling and datascience features that is available in Python or R.

Fourth , mostrelational databases are not horizontally scalable. Even withparallel databases, the parallel execution of UDFs is eithernot supported or not efﬁcient. Although “Big Data” systemssuch as Spark [7] and Hive/Hivemall [8] address some ofthese issues, that they often involve more complex APIs anda steeper learning curve. Also, it is not practical to rewrite thelarge body of existing data science code with these APIs.To address the issues with the existing approaches, weintroduce a scalable data science system, DaskDB, whichseamlessly supports in situ

SQL query execution and dataanalysis using Python. DaskDB extends the scalable dataanalytics framework Dask [9] that can scale to more thanone machine. Dask’s high-level collections APIs mimic manyof the popular Python data analytics library APIs based onPandas and NumPy. So, existing applications written usingPandas collections need not be modiﬁed. On the other hand,Dask does not support SQL query processing. In contrast,DaskDB can execute SQL queries in situ without requiringthe expensive ETL step and movement of data from raw a r X i v : . [ c s . D B ] F e b les into a database system. Furthermore, with DaskDB, SQLqueries can have UDFs that directly invoke Python APIs. Thisprovides a powerful mechanism of mixing SQL with Pythonand enables data scientists to take advantage of the rich datascience libraries with the convenience of SQL. Thus, DaskDBuniﬁes query processing and analytics in a scalable manner.A key issue with distributed query processing and dataanalytics is the movement of data across nodes, which cansigniﬁcantly impact the performance. We propose a distributedlearned index , to improve the performance of join that is animportant data operation. In DaskDB, a relation (or dataframe)is split into multiple partitions, where each partition consistsof numerous tuples of that relation. These partitions aredistributed across different nodes. While processing a join,it is possible that not all partitions of a relation contribute tothe ﬁnal result when two relations are joined. The distributedlearned index is designed to efﬁciently consider only thosepartitions that contain the required data in constant time, byidentifying the data pattern in each partition. This minimizesthe unnecessary data movement across nodes. DaskDB also in-corporates selective data persistence that signiﬁcantly reducesserialization/de-serialization overhead and data movement.We conduct extensive experimental evaluation to comparethe performance of DaskDB against two horizontally scalablesystems: PySpark and Hive/Hivemall. Our experiments involveworkloads from a few queries from TPC-H [10] benchmark,with different data sizes (scale factors). We also created acustom UDF benchmark to evaluate DaskDB and PySpark.Our results show that DaskDB outperforms others in both ofthese benchmarks. The key contributions of this paper are: • We propose DaskDB that integrates in situ query process-ing and data analytics in a scalable manner. • DaskDB supports SQL queries with UDFs that can di-rectly invoke Python data science APIs. • We introduce a novel distributed learned index. • We present extensive experimental results involving TPC-H benchmark and a custom UDF benchmark.II. B

ACKGROUND

DaskDB was built by extending the Dask [9], which isan open-source library for parallel computing in Python. Themain advantage of Dask is that it provides Python APIs anddata structures that are similar to Numpy, Pandas, and Scikit-Learn. Hence, programs written using Python data scienceAPIs can easily be switched to Dask by changing the importstatement. Dask comes with an efﬁcient task scheduler, whichcan run programs on a single node and scale to many nodes.However, Dask does not support SQL queries. We show thatDaskDB, which is built over Dask, can outperform Spark [11].III. R

ELATED W ORK

In this section, ﬁrst we discuss about systems to performdata analytics and query processing. Next, we look at worksrelated to learned index, followed by in situ query processing.

A. Data Analytics and Query Processing1) Dedicated Data Analytics Frameworks:

Many open-source data analytic applications traditionally use R. Morerecently, Python has become very popular because of theAnaconda distribution [12]. It contains many data science andanalytics packages, such as Pandas, SciPy, Matplotlib, andscikit-learn. Some popular commercial data analytic systemsinclude Tableau [13] and MATLAB [14].

2) In-Database Analytics:

An increasing number of themajor DBMSs now include data science and machine learningtools. For instance, PostgreSQL supports SQL-based algo-rithms for machine learning with the Apache MADlib library[1]. However, interacting with a DBMS to implement analyticscan be challenging [15]. Although SQL is a mature technology,it is not rich enough for extensive data analysis. DBMSs typ-ically support analytics functionalities through User DeﬁnedFunctions (UDF). Since, a UDF may execute any external codewritten in R, Python, Java or T-SQL, a DBMS treats a UDF asa black box because no optimization can be performed on it.It is also difﬁcult to debug and to incrementally develop [3].The in-database analytics approaches still have the constraintof ETL (extract, transform, load), which is a time-consumingprocess and not practical in many cases.

3) Integrating Analytics and Query Processing:

There havebeen several attempts at creating more efﬁcient solutions andthey combine two or more of either dedicated data analyticsystems, DBMS or big data frameworks. These systems canbe classiﬁed into 2 categories that we describe next.

Hybrid Solutions.

These solutions integrate two or moresystem types together into one and are primarily DBMS-centric approaches. AIDA [3] integrates a Python clientdirectly to use the DBMS memory space, eliminating thebottleneck of transferring data. In [16] the authors present aprototype system that integrates a columnar relational database(MonetDB) and R together using a same-process zero-copydata sharing mechanism. In [17], the authors proposed anembeddable analytical database DuckDB. The key drawbackof these hybrid systems is ETL, since the data needs tobe loaded into a database. Moreover, existing data scienceapplications written in Python or R, need to be modiﬁed towork in such systems, since their interface is SQL-based. “Big Data” Analytics Frameworks.

The most popular bigdata frameworks are Hadoop and Spark [7]. Spark supportsmachine learning with MLlib [18] and SQL like queries.Hive is based on Hadoop that supports SQL-like queries andsupports analytics with the machine learning library Hive-mall [8]. Some drawbacks of big data frameworks includemore complicated development and steeper learning curve thanmost other analytics systems and the difﬁculty in integrationwith DBMS applications. To run any existing Python or Rapplication within a big data system, it will require rewritingthese applications with new APIs. This is not a viable optionin most cases.2 .QueryPlanner

Generates a logical and physical plan

1. SQLParser

Extracts metadata information from SQL Query

Raco is a subset of Myria stack, Myria was developped for big data management and analytics, it was developed by the database group at the University of Washington. Raco is from the MyriaL at the University of Washington \cite{Wang2017Myria}. Raco is a SQL query parser. It takes an SQL query and returns a a preliminary optimized physical plan. We chose to use Raco at because it was natively built in in python, this allowed for easy integration and no potential overheard or trying to merge separate run-time environments. We added a few features to the existing code and changed a few things for the SQL Query Parser and Optimizer. Namely we added functionality to the parser to support more keywords commonly used in SQL.SQL Parser first takes in the original sql query. In this step we gather information about relevant table names, column names and any user defined function. This is done because the format Raco needs all this information beforehand to Available at https://github.com/TwoLaid/python-sqlparser a. Metadata and SQL query

DaskDB takes in the Physical Plan from raco then creates a Dask Plan. From the dask Plan we are able to convert this plan into dask which is executable in the dask client.Simple but subtle improvements to improve query performance. Only read in b. Physical Plan c . E xec u t a b l e D ask C od e Within Dask I head to modify few features to make it capatable . Such as returning nth (smallest or largest function) with a multi-index.

ClientScheduler W W W n

5. HDFS D a t a D a t a … …

3. DaskPlanner

Converts to Daskplan then converts to Dask code and sends to Dask

Fig. 1: DaskDB System Architecture

B. Learned Index

Data structures such as B+trees are the mainstay of indexingtechniques. These approaches require the storage of all keysfor a dataset. Recent studies have shown that learned modelscan be used to model the cumulative distribution function(CDF) of the keys in a sorted array. This can be used to predicttheir locations for the purpose of indexing and this idea wastermed as learned index [19]. Several learned indexes weredeveloped that include FITing-Tree [20] and PGM-Index [21].The learned index approaches proposed so far were meantonly for stand-alone systems. These ideas have not beenincorporated as part of any database system yet, to the bestof our knowledge. Also, no learned index model has yet beendeveloped for any distributed data system.

C. In Situ Query Processing

A vast amount of data is stored in raw ﬁle-formats that arenot inside traditional databases. Data scientists, who frequentlylack expertise in data modeling, database admin and ETLtools, often need to run interactive analysis on this data. Toreduce the “time to query” and avoid the overhead associatedwith relational databases, a number of research projects inves-tigated in situ query processing on raw data.NoDB [4] was one of the earliest systems to support insitu query processing on raw data ﬁles. PostgresRaw [22] isbased on the idea of NoDB and it supports SQL querying overCSV ﬁles in PostgreSQL. The SCANRAW [5] system exploitsparallelism during in situ raw data processing. Slalom [6]introduced adaptive partitioning and on-the-ﬂy per partitionindexes to improve query processing performance on raw data.All these systems were focused on database-style SQL queryprocessing on raw data and on a single machine. Our system,DaskDB supports in situ querying on heterogeneous datasources, and it also supports doing data science. Moreover, itis a distributed data system that can scale over a node cluster.IV. O UR A PPROACH : D

ASK

DBIn this section, we present DaskDB. DaskDB, in addition tosupporting all Dask features, also enables in situ

SQL queryingon raw data in a data science friendly environment. Next, wedescribe DaskDB system architecture and its components.

A. System Architecture

The system architecture of DaskDB incorporates ﬁvemain components: the SQLParser, QueryPlanner, DaskPlanner,DaskDB Execution Engine, and HDFS. They are shown inFigure 1. First, the SQLParser gathers metadata informationpertaining to the SQL query, such as the names of the tables,columns and functions. This information is then passed alongto the QueryPlanner. Next, in the QueryPlanner component,physical plan is generated from the information sent by SQL-Parser about the SQL query. The physical plan is an orderedset of steps that specify a particular execution plan for a queryand how data would be accessed. The QueryPlanner then sendsthe physical plan to the DaskPlanner. In the DaskPlanner,a plan is generated, which includes operations that closelyresemble Dask APIs, called the

Daskplan . The Daskplan isproduced from the physical plan, and it is then converted intoPython code and sent to DaskDB Execution Engine. DaskDBExecution Engine then executes the code and gathers the datafrom the HDFS, and thus executes the SQL query. Furtherdetails are provided in the next sections.

1) SQLParser:

The SQLParser is the ﬁrst component ofDaskDB that is involved in query processing. The input forthe SQLParser is the original SQL query. It ﬁrst checks forsyntax errors and creates a parse tree with all the metadatainformation about the query. We then process the parse treeto gather the metadata information needed by the QueryPlan-ner. This metadata information includes table names, columnnames and UDFs. We then check if the table(s) exist andif they do, we dynamically generate a schema. The schemacontains information about tables and column names and datatypes used in the SQL query. The schema, UDFs (if any) andthe original SQL query are then passed to the QueryPlanner.

2) QueryPlanner:

The QueryPlanner creates logical andpreliminary physical plans. The schema and UDFs producedby SQLParser, along with the SQL query, are passed into theQueryPlanner. The QueryPlanner uses these to ﬁrst create alogical plan and then an optimized preliminary physical plan.This plan is then sent to the DaskPlanner.

3) DaskPlanner:

The DaskPlanner is used to transform thepreliminary physical query plan from the QueryPlanner intoPython code that is ready for execution. The ﬁrst step in thisprocess is for the DaskPlanner to go through the physical planobtained from QueryPlanner and convert it into a Daskplan.This maps the operators from the physical plan into operatorsthat more closely resemble the Dask API. This Daskplan alsoassociates relevant information with each operator from thephysical plan. This information includes columns and tablesinvolved and speciﬁc metadata information for a particular op-erator. We also keep track of each operator’s data dependency.This is needed to pass intermediate results from one operationto the next. Algorithm 1 shows how DaskDB converts thephysical plan into the Daskplan.In the next step, the DaskPlanner converts the Daskplaninto the Python code, which utilizes the Dask API. All of thedetail about each table and their particular column names andindexes are tracked and maintained in a dynamic dictionary3 lgorithm 1:

DaskPlanner: Conversion of PhysicalPlan to Daskplan

Input:

A physical plan ( P ) is given as input. P contains ordered groups of dependent operators( G ). Each G consists of an ordered list oftuples ( k, o, d ), where k contains the uniquekey of a operation, o is the operation type and d contains the operation metadata information. Output:

The ﬁnal result is a Daskplan DP , whichconsists of an ordered list of operators. DP ← list() for sorted( G ∈ P ) do for sorted( k, o, d ∈ G ) do dp ← dict() //create Daskplan operator dp[o] ← convertToDaskPlanOperator(o) dp[d] ← getMetadataInfo(d) dp[key] ← k //adds key (used to get datadependencies) DP.add(dp) //adds dp to Daskplan for sorted( dp ∈ DP ) do while dp has children (c) do dp[t i ] ← get table information from c i return DP throughout the execution of a query. This is because multipletables and columns may be created, removed or manipulatedduring execution. For these reasons, the names of the tables,columns, and indexes are dynamically maintained while trans-forming the Daskplan into Python code to execute it.

4) DaskDB Execution Engine:

There are three main com-ponents of the DaskDB execution engine: the client, schedulerand workers. The client transforms the Dask Python code intoa set of tasks. The scheduler creates a DAG (directed acyclicgraph) from the set of tasks, automatically partitions the datainto chunks, while taking into account data dependencies. Thescheduler sends a task at a time to each of the workers ac-cording to several scheduling policies. The scheduling policiesfor task and workers depend on various factors including datalocality. A worker stores a data chunk until it is not neededanymore and is instructed by the scheduler to release it.

5) HDFS:

The Hadoop Distributed File System(HDFS) [23] is a storage system used by Hadoop applications.DaskDB uses HDFS to store and share the data ﬁles amongits nodes.

B. Illustration of SQL query execution An in situ query is executed within DaskDB by calling query function with the SQL string as argument. The queryin Figure 2 is a simpliﬁed version of a typical TPC-H query.The Daskplan, shown in Figure 3, is generated from thephysical plan in the DaskPlanner component. The Daskplanoperators more closely resemble the Dask API. For example,these include the read_csv and filters methods shownin the tree. This Daskplan is then converted into executablePython code, which is omitted due to space constraint. from daskdb_core import querysql = """SELECT l_orderkey, sum(l_extendedprice *(1-l_discount)) as revenueFROM orders, lineitemWHERE l_orderkey = o_orderkey and o_orderdate >= '1995-01-01'GROUP BY l_orderkeyORDER BY revenue LIMIT 5 ; """query(sql) Fig. 2: Code showing SQL query execution

ColumnMapping l_orderkey, l_extendedprice,l_discount groupby: (l_orderkey) aggregate: sum (revenue) order by revenue read_csv orders filter o_orderdate>='1995-01-01'

DaskDBJoin read_csv lineitemo_orderkey=l_orderkey

NewColumn revenue = (l_extendedprice * (1-l_discount)) limit 5 l_orderkey revenue

Fig. 3: Generated Daskplan for the code in Figure 2

C. Support for SQL query with UDFs

DaskDB supports UDFs in SQL as part of in situ querying.A UDF enables a user to create a function using Python codeand embed it into the SQL query. Since DaskDB converts theSQL query, including the UDF back into Python code, theUDFs can reference and utilize features from any of the datascience packages available in Anaconda Python API ecosys-tem. Spark introduced UDF’s since version 0.7 but it operatedone-row-at-a-time and thus suffered from high serializationand invocation overhead. Later they came up with

PandasUDF which provides low-overhead, high-performance UDFsentirely in Python. Although Pandas UDFs are efﬁcient, but arerestrictive to use in queries. Sometimes it also requires to useSpark’s own data types, which would be inconvenient for userswho are not experienced in Spark. In contrast, in DaskDBUDFs for SQL queries can easily be written. Any nativePython function (either imported from an existing package orcustom-made), which accepts Pandas dataframes as parameterscan be applied as UDFs to the SQL queries in DaskDB. Thereturn type of the UDFs is not ﬁxed like Spark’s Pandas UDF,and hence allows the user to design UDFs with ease. Like ageneral Python function, UDFs with code involving machinelearning, data visualization and numerous other functionalitiescan easily be developed and applied on queries in DaskDB.

D. Illustration of SQL query with UDF

We illustrate UDF with SQL in DaskDB with K-Meansclustering. Similar to Spark, a UDF needs to be registered toDaskDB system using the register_udf

API. As shown inFigure 4, the UDF myKMeans takes as input a single Pandasdataframe having 2 columns; hence the UDF is registered4s register_udf (myKMeans, [2]) . As part of thequery, the UDF is invoked as myKMeans(l_discount,l_tax) , which means after application of the selection con-dition (l_orderkey < and the limit (limit 50) to the lineitem relation, both the columns l_discount and l_tax together form a Pandas dataframe and is passedto myKMeans . The output of the code is plotted in Figure 5,where the different clusters are shown by different colours. from daskdb_core import query, register_udf import matplotlib.pyplot as pltfrom sklearn.cluster import KMeans def myKMeans(df):kmeans = KMeans(n_clusters=4).fit(df)col1 = list(df.columns)[0]col2 = list(df.columns)[1]plt.scatter(df[col1], df[col2],c= kmeans.labels_.astype(float), s=50)plt.xlabel(col1)plt.ylabel(col2)plt.show()register_udf(myKMeans,[2])sql_kmeans = """select myKMeans(l_discount, l_tax)from lineitem where l_orderkey < 50 limit 50; """query(sql_kmeans)

Fig. 4: UDF code showing K-Means Clustering

E. Distributed Learned Index

Join is considered one of the most expensive data opera-tions. We propose a novel distributed learned index, to accel-erate distributed execution of join in DaskDB. Our distributedlearned index relies on Heaviside step function [24]. Any stepfunction can be represented by a combination of multipleHeaviside step functions, which forms the basis of our learnedindex structure. A Heaviside step function can be deﬁned as H ( x ) = (cid:40) x < x ≥ (1)We deﬁne a Partition Function, P as P a,b,c ( y ) = H (( b − y ) ∗ ( y − a )) ∗ c (2)which returns c whenever a ≤ y ≤ b , or returns 0 otherwise.While constructing the distributed learned index it is as-sumed that one of the relations is sorted by the join attribute.Hence, if an index can maintain the ﬁrst and last values ofthe keys for each partition, then given any key, the partitioncontaining the key can be identiﬁed by a Partition Function .A sparse index in this case can entail huge storage savings,since all the keys are not required to be stored. We illustratethis using a simpliﬁed example with the customer table fromTPC-H, where c custkey is the primary key. If there are 500tuples in this relation and each table partition can store 100tuples, then there will be total 5 partitions. The distribution ofthe keys is shown in Table II, and also plotted in Figure 6. It can be seen that the plot is a step function f , where f ( key ) =  ≤ key ≤ ≤ key ≤ ≤ key ≤ ≤ key ≤ ≤ key ≤ (3)which can equivalently be represented by summing several Partition Functions , which constitutes the

Learned IndexFunction L customer on the customer table as L customer ( key ) = P , , ( key ) + P , , ( key )+ P , , ( key ) + P , , ( key ) + P , , ( key ) where, a, b and c of the Partition Function P represent thebegin and end keys and the corresponding partition id.Next we explain how our learned index can effectively beused for join queries. Let A and B be two relations, whichneed to be joined on col A and col B of A and B respectively.Without loss of generality, we assume col A is sorted and a Learned Index Function ( L A ) is generated on this column.Since DaskDB internally uses Dask APIs, before joining therelations, they are converted to Dask dataframes. Each daskdataframes consists of several partitions. For each tuple t ofa partition of B , L A ( t [ col B ]) is calculated (which returnsthe partition number of A to which t [ col B ] belongs) and isappended to t as a new ‘Partition’ column. This process isparallely executed for each partition of B . Then for eachpartition number i of A , all the tuples t of B are selectedsuch that t [ (cid:48) P artition (cid:48) ] = i and are hence joined with the i th partition of A. The processes of partition identiﬁcation,partition selection and the partition-wise join are individuallyexecuted in parallel. For joining the dataframes, we developeda variant of the merge API of Pandas package.V. E

VALUATION

In this section, we present the experimental setup, TPC-Hbenchmark results and a custom UDF benchmark results.

A. Experimental Setup

DaskDB was implemented in Python by extending Dask.The SQLParser of DaskDB utilizes the tool sql-metadata [25].The QueryPlanner of DaskDB extends Raco [26]. We ranexperiments on a cluster of 8 machines each having 16 GBmemory and 8 Intel(R) Xeon(R) CPUs running at 3.00 GHzand each machine ran Ubuntu 16.04 OS.We evaluated DaskDB against two systems that supportboth SQL query execution and data analytics: PySpark andHive/Hivemall ( henceforth referred to as Hivemall ). HDFSwas used to store the datasets for each system. The softwareversions of Python, PySpark and Hive were 3.7.6, 3.0.1and 2.1.0 respectively. PySpark and Hivemall were allocatedmaximum resources available (i.e. cores and memory).5 asks Query

LR select myLinearFit(l discount, l tax) from lineitem where l orderkey < <

10 limit1

TABLE I: Queries with UDF l _ t a x Fig. 5: K-Means output

Q1 Q3 Q5 Q6 Q10

83 93 99 25 10614 9 11 6 122 2 3 1 4

Hivemall PySpark DaskDB T i m e ( S e c ond s ) (a) Scale Factor 1 (GB) Q1 Q3 Q5 Q6 Q10

334 420 310 94 33348 70 50 18 3712 9 13 1 21

Hivemall PySpark DaskDB T i m e ( s e c ond s ) (b) Scale Factor 5 (GB) Q1 Q3 Q5 Q6 Q10

534 584 644 338 585250 264 97 72 9122 13 17 1 57

Hivemall PySpark DaskDB T i m e ( s e c ond s ) (c) Scale Factor 10 (GB) Q1 Q3 Q5 Q6 Q10

Hivemall PySpark DaskDB T i m e ( s e c ond s ) (d) Scale Factor 20 (GB) Fig. 7: Execution times - TPC-H benchmark queries

LR K-Means Quantiles CGO

PySpark DaskDB T i m e ( S e c ond s ) (a) UDF on SF 1 LR K-Means Quantiles CGO

15 618 52 151 6 6 1

PySpark DaskDB T i m e ( s e c ond s ) (b) UDF on SF 5 LR K-Means Quantiles CGO

105 839 245 682 13 12 1

PySpark DaskDB T i m e ( S e c ond s ) (c) UDF on SF 10 LR K-Means Quantiles CGO

463 624 4821 41 37 1

PySpark DaskDB T i m e ( S e c ond s ) P y S p a r k exec u t i on t oo l ong (d) UDF on SF 20 Fig. 8: Execution times - SQL queries with UDFs

Begin End Partition

TABLE II: Sparse indexfor customer relation

Keys P a r t i t i on N u m be r Fig. 6: Keys vs Partition plot

B. TPC-H Benchmark Evaluation Results

We evaluated the systems with several queries from TPC-H decision support benchmark [10]. We used 4 scale factors(SF): 1, 5, 10 and 20, where SF 1 indicates roughly 1 GB.We executed 5 queries from TPC-H benchmark and theresults are plotted in Figure 7. As can be seen, DaskDBoutperforms PySpark and Hivemall on all queries for all thescale factors. Hivemall performs worse than both DaskDB andPySpark in all cases. However, in general higher the SF, largerthe performance gap between them. For instance, with Q10,DaskDB is 3.5 × faster than PySpark at SF 1 and 4.7 × fasterthan PySpark at SF 20. DaskDB achieves a speedup of 182 × with Q5 at SF 20. The superior performance of DaskDB can becredited to its efﬁcient data distribution, join implementation using distributed learned index and selective data persistence. C. UDF Benchmark Evaluation Results

We developed a custom UDF benchmark, which consistsof four machine learning tasks with UDFs:

LR (LinearRegression), K-Means (K-Means Clustering), Quantiles(Quantiles Estimation) and

CGO (Conjugate Gradient Op-timization) . They were developed using the available machinelearning packages in Python. For each of the machine learningtasks, SQL queries in Table I were executed. The UDFs werewritten to perform the same task in both the systems byimporting the same Python packages. For this evaluation, theperformance of DaskDB was compared with that of PySpark,whereas Hivemall results are skipped due to poor performance.The results are plotted in Figure 8. Similar to the TPC-Hbenchmark results, DaskDB outperforms PySpark here as wellfor all the machine learning tasks for all scale factors. Amongall the tasks K-Means performs worst in PySpark. With respectto K-Means, DaskDB performs 28.5 × faster than PySpark atSF 1 and 64 × faster at SF 10. For SF 20, PySpark took toolong to perform K-Means and hence could not be measured,whereas DaskDB took only 41s approximately. For Quantiles,DaskDB was 4 × and 16.6 × faster than PySpark for SF 1 and6F 20 respectively. These results also show that larger the SF,the better DaskDB performs compared to PySpark.VI. C ONCLUSION

We presented DaskDB, a scalable data science system. Itbrings in situ

SQL querying to a data science platform in away that supports high usability, performance, scalability andbuilt-in capabilities. Moreover, DaskDB also has the abilityto incorporate any UDF into the input SQL query, where theUDF could invoke any Python library call. Furthermore, weintroduce a novel distributed learned index that acceleratesjoin/merge operation. We evaluated DaskDB against two state-of-the-art systems, PySpark and Hive/Hivemall, using TPC-H benchmark and a custom UDF benchmark. We show thatDaskDB signiﬁcantly outperforms these systems.R

EFERENCES[1] J. M. Hellerstein et al. , “The MADlib Analytics Library: Or MAD Skills,the SQL,”

PVLDB , vol. 5, no. 12, pp. 1700–1711, 2012.[2] A. Kemper and T. Neumann, “Hyper: A hybrid oltp & olap mainmemory database system based on virtual memory snapshots,” in

ICDE ,2011, p. 195–206.[3] J. V. D’silva, F. De Moor, and B. Kemme, “AIDA: Abstraction forAdvanced In-database Analytics,”

PVLDB , vol. 11, no. 11, 2018.[4] I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki,“NoDB: efﬁcient query execution on raw data ﬁles,” in

SIGMOD , 2012,pp. 241–252.[5] Y. Cheng and F. Rusu, “Parallel in-situ data processing with speculativeloading,” in

SIGMOD , 2014, p. 1287–1298.[6] M. Olma et al. , “Adaptive partitioning and indexing for in situ queryprocessing,”

VLDB J. , vol. 29, no. 1, pp. 569–591, 2020.[7] M. Zaharia et al. , “Spark: Cluster computing with working sets,” in

USENIX HotCloud , 2010, pp. 10–10.[8] “Hivemall,” https://hivemall.apache.org/.[9] M. Rocklin, “Dask: Parallel computation with blocked algorithms andtask scheduling,” in

Python in Science Conference

SSDBM , 2018.[16] J. Lajus and H. M¨uhleisen, “Efﬁcient data management and statisticswith zero-copy integration,” in

SSDBM , 2014.[17] M. Raasveldt and H. M¨uhleisen, “Data management for data science -towards embedded analytics,” in

CIDR , 2020.[18] X. Meng et al. , “MLlib: Machine Learning in Apache Spark,”

Journalof Machine Learning Research , vol. 17, no. 34, pp. 1–7, 2016.[19] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The casefor learned index structures,” in

SIGMOD , 2018, p. 489–504.[20] A. Galakatos et al. , “FITing-Tree: A Data-Aware Index Structure,” in

SIGMOD , 2019, p. 1189–1206.[21] P. Ferragina and G. Vinciguerra, “The PGM-index: a fully-dynamiccompressed learned indexwith provable worst-case bounds,” in

PVLDB ,vol. 13, no. 8, 2020, pp. 1162–1175.[22] “PostgresRAW,” https://github.com/HBPMedical/PostgresRAW.[23] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoopdistributed ﬁle system,” in

MSST , 2010, pp. 1–10.[24] “Heaviside Function,” https://en.wikipedia.org/wiki/Heaviside step function.[25] “sql-metadata,” https://pypi.org/project/sql-metadata/.[26] J. Wang et al. , “The Myria Big Data Management and Analytics Systemand Cloud Services,” in

CIDR , 2017., 2017.