Building your Cross-Platform Application with RHEEM
Sanjay Chawla, Bertty Contreras-Rojas, Zoi Kaoudi, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz
BBuilding your Cross-Platform Application with RHEEM
Sanjay Chawla Bertty Contreras-Rojas Zoi Kaoudi Sebastian Kruse ∗ Jorge-Arnulfo Quian ´e-Ruiz Qatar Computing Research Institute, Hamad Bin Khalifa University Hasso Plattner Institute, University of Potsdam http://da.qcri.org/rheem/
ABSTRACT
Today, organizations typically perform tedious and costlytasks to juggle their code and data across different dataprocessing platforms. Addressing this pain and achievingautomatic cross-platform data processing is quite challeng-ing because it requires quite good expertise for all the avail-able data processing platforms. In this report, we present
Rheem , a general-purpose cross-platform data processingsystem that alleviates users from the pain of finding themost efficient data processing platform for a given task. Italso splits a task into subtasks and assigns each subtask to aspecific platform to minimize the overall cost (e.g., runtimeor monetary cost). To offer cross-platform functionality, itfeatures (i) a robust interface to easily compose data an-alytic tasks; (ii) a novel cost-based optimizer able to findthe most efficient platform in almost all cases; and (iii) anexecutor to efficiently orchestrate tasks over different plat-forms. As a result, it allows users to focus on the businesslogic of their applications rather than on the mechanics ofhow to compose and execute them.
Rheem is released underan open source license.
1. INTRODUCTION
The pursuit of comprehensive, efficient, and scalabledata analytics as well as the one-size-does-not-fit-all dictumhave given rise to a plethora of data processing platforms( platforms for short). These specialized platforms includeDBMS, NoSQL, and MapReduce-like platforms. In fact,just under the umbrella of NoSQL, there are reportedly over200 different platforms . Each excels in specific aspects al-lowing applications to achieve high performance and scala-bility. For example, while Spark supports Select queries,Postgres can execute them much faster by using indices.However, Postgres is not as good as Spark for general pur-pose batch processing where parallel full scans are the keyperformance factor. Several studies have shown this kind ofperformance differences [20, 32, 36, 50, 57].Moreover, today’s data analytics is moving beyond thelimits of a single platform. For example: (i) IBM re-ported that North York hospital needs to process 50 di-verse datasets, which run on a dozen different platforms [35]; ∗ Work partially done while interning at QCRI. http://db-engines.com (ii) Airlines need to analyze large datasets, which are pro-duced by different departments, are of different data for-mats, and reside on multiple data sources, to produce globalreports for decision makers [9]; (iii) Oil & Gas companiesneed to process large amounts of diverse data spanning var-ious platforms [19, 34]; (iv) Several data warehouse appli-cations require data to be moved from a MapReduce-likesystem into a DBMS for further analysis [27,53]; and (v) Us-ing multiple platforms for machine learning improves perfor-mance significantly [20, 36].To cope with these new requirements, developers (ordata scientists) have to write ad-hoc programs and scriptsto integrate different platforms. This is not only a te-dious, time-consuming, and costly task, but it also re-quires knowledge of the intricacies of the different platformsto achieve high efficiency and scalability. Some systemshave appeared with the goal of facilitating platform integra-tion [2, 4, 10, 12]. Nonetheless, they all require a good dealof expertise from developers, who still need to decide whichprocessing platforms to use for each task at hand. Recentresearch has taken steps towards transparent cross-platformexecution [15, 28, 32, 43, 55, 56], but lacks several importantaspects. Usually these efforts do not automatically maptasks to platforms. Additionally, they do not consider com-plex data movement (i.e., with data transformations) amongplatforms [28, 32]. Finally, most of the research focuses onspecific applications [15, 43, 55].Therefore, there is a clear need for a systematic approachto enable efficient cross-platform data processing , i.e., use ofmultiple data processing platforms. The Holy Grail wouldbe to replicate the success of DBMSs for cross-platform dataprocessing. Users simply send their tasks expressing thelogic of their applications, and the cross-platform systemdecides on which platform(s) to execute each task with thegoal of minimizing its cost (e.g., runtime or monetary cost).In other words, users focus on the high level details and thecross-platform system takes care of the low level details.Building a cross-platform system is challenging on numer-ous fronts: (i) a cross-platform system not only has to ef-fectively find all the suitable platforms for a given task, butalso has to choose the most efficient one; (ii) cross-platformsettings are characterized by high uncertainty as differentplatforms are autonomous and thus one has little controlover them; (iii) the performance gains of using multiple plat-forms should compensate the added cost of moving dataacross platforms; (iv) it is crucial to achieve inter-platformparallelism to prevent slow platforms from dominating exe-cution time; and (v) the system should be extensible to new a r X i v : . [ c s . D B ] M a y HEEM
ML4all
PlatformsApplications
Cross-Platform System
BigDansing Xdb … JavaStreams DBMS Spark … Figure 1: Rheem in the data analytics stack. platforms and application requirements.In this report, we present
Rheem , the first general-purpose cross-platform system to tackle all of the abovechallenges. The goal of Rheem is to enable applications andusers to run data analytic tasks efficiently on one or moredata processing platforms. To do so, it decouples applica-tions from platforms as shown in Figure 1. Applicationsissue their tasks to
Rheem , which in turn decides whereto execute them. As of today,
Rheem supports a varietyof platforms:
Spark , Flink , JavaStreams , Postgres , GraphX , GraphChi , and
Giraph . We are currently testing
Rheem ina large international airline company and in a biomedicalresearch institute. In the former case, we aim at seamlesslyintegrating all data analytic activity governing an aircraft;In the latter case, we aim at reducing the effort scientistsneed for building data analytic pipelines while at the sametime speeding up the running time. Several papers showdifferent aspects of
Rheem : the vision behind it [17]; its op-timizer [39]; its inequality join algorithm [38]; and a coupleof its applications [36, 37]. A couple of demo papers show-case the benefits of
Rheem [16] and its interface [44]. Thisreport aims at presenting the complete design of
Rheem andhow all its pieces work together.In summary, we identify four situations in which applica-tions require support for cross-platform data processing inSection 2. For each case, we use a real application to showexperimentally the benefits of cross-platform data process-ing using
Rheem . In Section 3, we present the data and pro-cessing model of
Rheem and show how it shields users fromthe intricacies of the underlying platforms.
Rheem providesflexible operator mappings that allow for better exploitingthe underlying platforms. Also, its extensible design allowsusers to add new platforms and operators with very littleeffort. Then, in Section 4, we discuss the key components of
Rheem that make it novel: among them a cost-based cross-platform optimizer that considers data movement costs; aprogressive optimization mechanism to deal with inconsis-tent cardinality estimates; and a learning tool that alleviatesusers from the burden of tuning the cost model. We presentthe
Rheem interfaces whereby users can easily code and runa data analytic task in Section 5. In particular, we presenta data-flow language (RheemLatin) and a visual integrateddevelopment environment (
Rheem
Studio). In Section 6,we show in detail three examples of real
Rheem plans tobetter illustrate how developers can build their applicationsusing these interfaces. Section 8 outlines the limitations of
Rheem . Finally, we discuss related work in Section 9 andconclude with some open problems in Section 10.
2. CROSS-PLATFORM PROCESSING
We identified four situations in which an application re-quires support for cross-platform data processing [51]. Fig-ure 2 illustrates these four cases. Rheem is open source under the Apache Software Li-cense 2.0 and can be found at https://github.com/rheem-ecosystem/rheem . (1) Platform-independence . Applications run an entire taskon a single platform but may require switching platformsfor different input datasets or tasks usually with the goal ofachieving better performance (Figure 2(a)). (2)
Opportunistic cross-platform . Applications might alsobenefit performance-wise from using multiple platforms torun one single task (Figure 2(b)). (3)
Mandatory cross-platform . Applications may requiremultiple platforms because the platform where the inputdata resides, e.g., PostgreSQL, cannot perform the incom-ing task, e.g., a machine learning task. Thus, data shouldbe moved from the platform it resides to another platform(Figure 2(c)). (4)
Polystore . Applications may require multiple platformsbecause the input data is stored on multiple data stores(Figure 2(d)).In contrast to existing systems [28, 29, 32, 55, 58],
Rheem helps users in all above cases. The design of our system hasbeen mainly driven by four applications: a data cleaning ap-plication,
BigDansing [37]; a machine learning application,
ML4all [36]; a database application, xDB ; and an end-to-end data discovery and preparation application,
Data Civ-ilizer [31]. We use these applications to showcase the ben-efits of performing cross-platform data processing, insteadof single-platform data processing, in terms of both perfor-mance and ease of use.
Applications are usually tied to a specific platform. Thismay not constitute the ideal case for two reasons. First, asmore efficient platforms become available, developers needto re-implement existing applications on top of these newplatforms. For example, Spark SQL [14] and MLlib [13]are the Spark counterparts of Hive [6] and Mahout [7]. Mi-grating an application from one platform to another is atime-consuming and costly task and hence it is not alwaysa viable choice. Second, for different inputs of a specifictask, a different platform may be the most efficient one, sothe best platform cannot be determined statically. For in-stance, running a specific task on a big data platform forvery large datasets is often a good choice, while single-nodeplatforms with only little overhead costs are often a betterchoice for small datasets [20]. Thus, enabling applicationsto seamlessly switch from one platform to another accordingto the input dataset and task is important.
Rheem dynam-ically determines the best platform to run an incoming task.
Benefits.
We use BigDansing [37] to demonstrate the ben-efits of providing platform independence. Users specify adata cleaning task with five logical operators:
Scope (iden-tifies relevant data),
Block (defines the group of data amongwhich an error may occur),
Iterate (enumerates candidate er-rors),
Detect (determines whether a candidate error is indeedan error), and
GenFix (generates a set of possible repairs).
Rheem maps these operators to
Rheem operators to decidethe best underlying platform. We show the power of sup-porting cross-platform data processing by running an errordetection task on a widely used Tax dataset [30]. The task isbased on the denial constraint ∀ t , t , ¬ ( t . Salary > t . Salary ∧ t . Tax < t . Tax ), which states that there is an inconsis-tency between two tuples representing two different personsif one earns a higher salary but pays a lower tax. We consid-ered NADEEF [24], a data cleaning tool, and SparkSQL, a tore APlatform X Platform YApplication task 1a task 1bdata accessdata accesstask 1 (b) Opportunistic Cross-Platform Store APlatform X Platform YApplication task 1a task 1bdata movementdata accesstask 1 (c) Mandatory Cross-PlatformStore APlatform X Platform YApplication data accessdata accesstask 1 (a) Platform Independence Store APlatform X Platform YApplication task 1a task 1bdata access task 1 (d) Polystore data access
Store B or Figure 2: Cross-platform cases.
Dataset rcv1 higgs synthetic svm
ML@Rheem MLlib SystemML o u t o f m e m o r y R un t i m e ( s ) Number of rows
DC@Rheem NADEEF Spark SQL X
10 833 3 ,
731 30 4 ,
529 1 ,
240 5 , X X X X
Dataset size xDB@Rheem Ideal case
Scale factor
DataCiv@Rheem Postgres Spark (a) Platform Independence (b) Opportunistic Cross-Platform (c) Mandatory Cross-Platform (d) Polystore Figure 3: Benefits of the cross-platform data processing approach (using Rheem). general-purpose framework, as baselines and forced
Rheem to use either
Spark or JavaStreams per run.Figure 3(a) shows the results . Overall, we observe that Rheem (DC@Rheem) allows data cleaning tasks to scale upto large datasets and be at least three orders of magnitudefaster than baselines. One order of magnitude gain comesfrom the ability of
Rheem to automatically switch plat-forms.
Rheem used JavaStreams for small datasets speed-ing up the data cleaning task by avoiding
Spark ’s overhead,while it used
Spark for the largest datasets. Furthermore, incontrast to
SparkSQL that cannot process inequality joins ef-ficiently,
Rheem ’s extensibility allowed us to plug in a moreefficient inequality-join algorithm [38], thereby further im-proving over these baselines. In a nutshell, BigDansing ben-efited from
Rheem because of its ability to effectively switchplatforms and because of its extensibility to easily plug op-timized algorithms. We demonstrated how BigDansing ben-efits from
Rheem in [16].
While some applications can be executed on a single plat-form, there are cases where their performance would be spedup by using multiple platforms. For instance, users can runa gradient descent algorithm, such as SGD, on top of
Spark relatively fast. Still, we recently showed that mixing it with
JavaStreams significantly improves performance [36]. In fact,opportunistic cross-platform processing can be seen as theexecution counter-part of polyglot persistence [52], where dif-ferent types of databases are combined to leverage their in-dividual strengths. However, developing such cross-platformapplications is difficult: developers must know all the caseswhere it is beneficial to use multiple platforms and how ex-actly to use them. These opportunities are often very hard(if not impossible) to spot. Even worse, like in the plat-form independence case, they usually cannot be determined The red cross means we stopped the execution after 40 hrs. a priori.
Rheem finds and exploits opportunities of usingmultiple processing platforms.
Benefits.
Let us now take our machine learning applica-tion,
ML4all [36], to showcase the benefits of using multipleplatforms to perform one single task. ML4all abstracts threefundamental phases (namely preparation, processing, andconvergence) found in most machine learning tasks via sevenlogical operators which are mapped to
Rheem operators.In the preparation phase, the dataset is prepared appropri-ately along with the necessary initialization of the algorithm(
Transform and
Stage operators). The processing phasecomputes the gradient and updates the current estimateof the solution (
Sample , Compute , and
Update operators)while the convergence phase repeats the processing phasebased on the number of iterations or other criteria (
Loop and
Converge operators). We demonstrate the benefits ofusing
Rheem with a classification task over three benchmarkdatasets, using Stochastic Gradient Descent (SGD).Figure 3(b) shows the results. We observe that, eventhough all systems use the same SGD algorithm,
Rheem allows this algorithm to run significantly faster than com-peting Spark-based systems. This is because of two mainreasons. First, this comes from opportunistically runningthe
Compute , Update , Converge , and
Loop operators on
JavaStreams , thereby avoiding some of the
Spark ’s overhead.
Rheem runs the rest of the operators on
Spark . MLlib and
SystemML do not avoid such overhead by purely using Sparkfor the entire algorithm. Second, ML4all leverages
Rheem ’sextensibility to plug an efficient sampling operator, result-ing in significant speedups. We demonstrated how ML4allfurther benefits from
Rheem in [16].
There are cases where an application needs to go beyondthe functionalities offered by the platform on which the datais stored. For instance, a dataset is stored on a relationaldatabase and a user needs to perform a clustering task onparticular attributes. Doing so inside the relational databaseight simply be disastrous in terms of performance. Thus,the user needs to move the projected data out of the rela-tional database and, for example, put it on HDFS in orderto use Apache Flink [3], which is known to be efficient foriterative tasks. A similar situation occurs in complex dataanalytics applications with disparate subtasks. As an exam-ple, an application that extracts a graph from a text corpusto perform subsequent graph analytics may require usingboth a text and a graph analytics system. The required in-tegration of platforms is tedious, repetitive, and particularlyerror-prone. Nowadays, developers write ad-hoc programsto move the data around and integrate different platforms.
Rheem not only selects the right platforms for each task butalso moves the data if necessary at execution time . Benefits.
We use xDB , a system on top of Rheem withdatabase functionalities, to demonstrate the benefits of per-forming cross-platform data processing for the above situ-ation. It provides a declarative language to compose dataanalytic tasks, while its optimizer produces a plan to be ex-ecuted in
Rheem . We evaluate the benefits of
Rheem withthe cross-community pagerank task, which is not only hardto express in SQL but also inefficient to run on a DBMS.Thus, it is important to move the computation to anotherplatform. In this experiment, the input datasets are on Post-gres and Rheem moves the data into Spark.Figure 3(c) shows the results. As a baseline, we considerthe ideal case where the data is on HDFS and
Rheem sim-ply uses either JavaStreams or Spark to run the tasks. Weobserve that
Rheem allows xDB (xDB@Rheem) to achievesimilar performance with the ideal case in all the situations,while fully automating the process. This is a remarkableresult as
Rheem needs to move data out of Postgres to per-form the tasks, in contrast to the ideal case.
In many organizations, data is collected in different for-mats and on heterogeneous storage platforms ( data lakes ).Typically, a data lake comprises various DBMSs, documentstores, key-value stores, graph databases, and pure file sys-tems. As most of these stores are tightly coupled with anexecution engine, e.g., a DBMS, it is crucial to be able to runanalytics over multiple platforms. For this, users performnot only tedious, time-intensive, and costly data migration,but also complex integration tasks for analyzing the data.
Rheem shields the users from all these tedious tasks and al-lows them to instead focus on the logic of their applications . Benefits.
A clear example that shows the benefits of cross-platform data processing in a polystore case is the DataCivilizer system [31]. Data Civilizer is a big data manage-ment system for data discovery, extraction, and cleaningfrom data lakes in large enterprises [26]. It constructs agraph that expresses relationships among data existing inheterogeneous data sources. Data Civilizer uses
Rheem toperform complex tasks over information that spans multipledata storages. We measure the efficiency of
Rheem for thesepolystore tasks with TPC-H query 5. In this experiment, weassume that the data is stored in HDFS (LINEITEM andORDERS), Postgres (CUSTOMER, REGION, and SUP-PLIER), and a local file system (NATION). Thus, this task https://github.com/rheem-ecosystem/xdb This task basically intersects two community-DBpediadatasets and runs pagerank on the resulting dataset. R HEEM operator Spark execution operator JavaStreams execution operatorUDFInput/Output Data flow Broadcast data flow (a) R
HEEM plan (b) Execution plan
Map parse
Map compute
Reduce sum & count
Map update
TextFile Source RepeatLoopCollection SinkCollection SourceBroadcastCollectCacheSample RepeatLoopMap parse
Map compute
Reduce sum & count
Map update
Collection SinkCollection SourceTextFile Source Sample
Figure 4: SGD example. performs join, groupby, and orderby operations across threedifferent platforms. In this scenario, the common practice isto move the data into the database to enact the queries in-side the database [27,53] or move the data entirely to HDFSand use Spark. We consider these two practices as the base-line. For a fairer comparison, we also set the “parallel query”and “effective IO concurrency” features of Postgres to 4.Figure 3(d) shows the results.
Rheem (DataCiv@Rheem)is significantly faster, namely up to 5 × , than the currentpractice. We observed that loading data into Postgres isalready approximately 3 × slower than it takes Rheem tocomplete the entire task. Even when discarding data mi-gration times,
Rheem can still perform quite similarly tothe parallel version of Postgres. The pure execution time inPostgres for scale factor 100 amounts to 1 ,
541 sec comparedto 1 ,
608 sec for
Rheem , which exploits Spark for data paral-lelism. We also observe that
Rheem has negligible overheadover the case where the developer writes ad-hoc scripts tomove the data to HDFS for running the task on
Spark . Inparticular,
Rheem is twice faster than Spark for scale factor1 because it moves less data from Postgres to Spark.
3. RHEEM MODEL
First of all, let us emphasize that
Rheem is not yetanother data processing platform. On the contrary, it isdesigned to work between applications and platforms (asshown in Figure 1), helping applications to choose the rightplatform(s) for a given task. Rheem is the first general-purpose cross-platform system that shields users from theintricacies of the underlying platforms and let them focusonly on the logic of their applications. We define the
Rheem data and processing models in the following.
Data Quanta.
The
Rheem data model relies on data quanta , the smallest processing units from the inputdatasets. A data quantum can express a large spectrum ofdata formats, such as database tuples, edges in a graph, orthe full content of a document. This flexibility allows ap-plications and users to define a data quantum at any gran-ularity level, e.g., at the attribute level rather than at thetuple level for a relational database. This fine-grained datamodel allows
Rheem to work in a highly parallel fashion, ifnecessary, to achieve better scalability and performance.
Rheem Plan.
Rheem accepts as input a
Rheem plan : directed data flow graph whose vertices are
Rheem op-erators and whose edges represent data flows among theoperators. A
Rheem operator is a platform agnostic datatransformation over its input data quanta, e.g., a
Map oper-ator transforms an individual data quantum while a
Reduce operator aggregates input data quanta into a single outputdata quantum. Only
Loop operators accept feedback edges,which allows iterative data flows to be expressed. Usersor applications can refine the behavior of operators with aUDF. Optionally, applications can also attach the selectiv-ities of the operators through a UDF.
Rheem comes withdefault selectivity values in case they are not provided. A
Rheem plan must have at least one source operator, i.e., anoperator reading or producing input data quanta, and onesink operator per branch, i.e., an operator retrieving or stor-ing the result. Intuitively, data quanta are flowing fromsource to sink operators, thereby being manipulated by allinner operators. As our processing model is based on prim-itive operators,
Rheem plans are highly expressive. This isin contrast to other systems that accept either declarativequeries [32, 58] or coarse-granular operators [28].
Example Figure 4(a) shows a
Rheem plan for thestochastic gradient descent algorithm (SGD). Initially,the dataset containing the data points is read via a
TextFileSource operator and parsed using a
Map operatorwhile the initial weights are read via a
Collection source op-erator. After the
RepeatLoop operator, the weights are fed tothe
Sample operator, where a set of input data points is sam-pled. Next,
Map ( compute ) computes the gradient for eachsampled data point. Note that as Map ( compute ) requires allweights to compute the gradient, the weights are broadcastedat each iteration to the Sample operator (denoted by the dot-ted line). Then, the
Reduce operator computes the sum andcount of all gradients. The next
Map operator uses thesesum and count values to update the weights. This process isrepeated until the loop condition is satisfied. The resultingweights are output in a collection sink.
Execution Plan.
Given a
Rheem plan as input,
Rheem uses a cost-based optimization approach to produce an exe-cution plan by selecting one or more platforms to efficientlyexecute the input plan. The cost can be any user-specifiedcost, e.g., runtime or monetary cost. The resulting exe-cution plan is again a data flow graph, where the verticesare now execution operators . An execution operator imple-ments one or more
Rheem operators with platform-specificcode. For instance, the
Cache
Spark execution operator in
Rheem implements the
Cache
Rheem operator by callingthe
RDD . cache () operation of Spark. An execution planmay also comprise additional execution operators for datamovement (e.g., data broadcasting) or data reuse (e.g., datacaching). Additionally, each execution operator has at-tached a UDF where its cost is specified. Rheem learnssuch costs from execution logs using machine learning. Wediscuss more details in Section 4.5.
Example Figure 4(b) shows the SGD execution planproduced by
Rheem when Spark and JavaStreams are theonly available platforms. This execution plan exploits highparallelism for the large dataset of input data points andavoids the extra overhead incurred by big data processingplatforms for the smaller collection of weights. Note thatthe execution plan also contains three execution operatorsfor transferring (
Broadcast , Collect ) and making data quantareusable across the platforms (
Cache ). MapGroupByReduce Reduce (b)(c) (a)(d) R HEEM operatorSpark execution operatorJavaStreams exec. op.
MapGroupBy (a) (b) (c) n-to-1 mapping (d) m-to-n mapping
Figure 5: Operator mappings.Operator Mappings.
To produce an execution plan,
Rheem relies on flexible m-to-n mappings to map
Rheem operators to execution operators. Supporting m -to- n map-pings is particularly useful as it allows to map whole sub-plans of Rheem operators to subplans of execution opera-tors. Additionally, a subplan of
Rheem (or execution) op-erators can map to another subplan of
Rheem (respectivelyexecution) operators. As a result, we can handle differentabstraction levels among platforms, e.g., to emulate
Rheem operators that are not natively supported by a specific plat-form. This is not possible in other systems, such as [28].
Example Figure 5 illustrates the mapping for the
Reduce
Rheem operator. This operator directly maps to the
Reduce
Spark execution operator via a 1-to-1 mapping (map-ping (a)). However, it does not have a direct mapping to aJavaStreams execution operator. Instead, it maps to a set of
Rheem operators (
GroupBy and
Map ) via a 1-to-n mapping(mapping (b)) and vice-versa (n-to-1 mapping (c)). In turn,this set of
Rheem operators maps to a set of JavaStreamsexecution operators (
GroupBy and
Map ) via an m-to-n map-ping (mapping (d)).
Data movement.
Data flows among operators via com-munication channels (or simply channels ). A channel can beany internal data structure within a data processing plat-form (e.g., RDD for
Spark or Collection for
JavaStreams ),or simply a file. In the case of two execution operators ofdifferent platforms connected within a plan, it is necessaryto convert the output channel of one to the input channelof the other (e.g., from RDD to Collection). These conver-sions are handled by conversion operators, which in fact areregular execution operators. For example, we can converta Spark RDD channel to a JavaStreams Collection channelusing the
SparkCollect operator (see Figure 4(b)). We repre-sent the space of data movement paths across all platformsas a channel conversion graph , where the channels form itsvertices and the conversion operators form its directed edgesconnecting one source channel to a target channel. Unlikeother approaches [28, 32], developers do not need to provideconversion operators for all combinations of source and tar-get channels. It is thus much easier for developers to addnew platforms to
Rheem . Extensibility.
We designed
Rheem to address extensi-bility as a first-class citizen rather than as “nice-to-have”feature. Users add new
Rheem and execution operators bymerely extending or implementing few abstract classes/in-terfaces.
Rheem provides template classes to facilitate thedevelopment for different operator types. Users also add op-erator mappings by simply implementing an interface andspecifying a graph pattern that matches the
Rheem oper-ator. As a result, users can plug a new platform by pro-viding: (i) its execution operators and their mappings; and(ii) the communication channels that are specific to the newplatform (e.g.,
RDDChannel for Spark). Users neither haveto modify the
Rheem code nor integrate the newly addedplatform with all the already supported platforms. ython Rheem StudioJava Scala REST RheemLatin R HEEM
Executor
Spark
Driver
MonitorProgressive
Optimizer … execution plan (2) (3) (5) R HEEM plan (4) cost model
Cost Learner
ML4all Data Processing Platforms
Java
Driver (1)
Cross-Platform Optimizer
BigDansing Xdb
JavaStreams SparkPostgreSQL Flink GiraphGraphChi
Figure 6: Rheem’s ecosystem and architecture.
4. RHEEM INTERNALS
In this section, we give the details of the
Rheem internals.Figure 6 depicts the
Rheem ecosystem, i.e., the
Rheem core architecture together with three main applications builton top of it. Users provide a
Rheem plan to the system(Step (1) in Figure 6), using Java, Scala, Python, REST,RheemLatin, or
Rheem
Studio API (yellow boxes in Fig-ure 6). The cross-platform optimizer compiles the
Rheem plan into an execution plan (Step (2)), which specifies theprocessing platforms to use; the executor schedules the re-sulting execution plan on the selected platforms (Step (3));the monitor collects statistics and checks the health of theexecution (Step (4)); the progressive optimizer re-optimizesthe plan if the cardinality estimates turn out to be inaccu-rate (Step (5)); and the cost learner helps users in buildingthe cost model offline. In the following, we explain eachof these components using the pseudocode in Algorithm 1,which shows the entire data processing pipeline.
The cross-platform optimizer (Line 1 in Algorithm 1) isresponsible for selecting the most efficient platform for ex-ecuting each single operator in a
Rheem plan. One mightthink of a rule-based optimizer for selecting the right plat-forms to perform a given
Rheem plan. However, while arule-based optimizer could determine how to split and exe-cute a plan, e.g., based on its processing patterns [32,58], itis neither practical nor effective. First, by setting rules, onemay make only very simplistic decisions based on the dif-ferent cardinality and complexity of each operator. Second,the cost of a task on any given platform depends on manyinput parameters, which hampers a rule-based optimizer’seffectiveness as it oversimplifies the problem. Third, as newplatforms and applications emerge, maintaining a rule-basedoptimizer becomes cumbersome.We thus pursue a more flexible cost-based approach: wesplit a given
Rheem plan into subplans and determine thebest platform for each subplan so that the total plan costis minimized . Figure 4(b) shows how the
Rheem plan ofFigure 4(a) was split into two subplans to be executed in
JavaStreams and
Spark . Below, we give the four main phasesof the optimizer, namely plan inflation , cardinality and costestimation , data movement planning , and plan enumeration .Technical details about these can be found in [39].At first, the optimizer passes the Rheem plan throughan inflation phase. That is, it applies a set of operatormappings as described in Section 3. The optimizer thenannotates the inflated plan with the cost of each execu-tion operator.
Rheem represents cost estimates as inter-vals with a confidence value, which allows it to perform on-the-fly re-optimization as we will see in Section 4.4. The
Algorithm 1:
Cross-platform data processing
Input : Rheem plan rheemPlan exPlan ← Optimize ( rheemPlan ) monitor ← StartMonitor ( exPlan ) finished ← ExecuteUntilCheckpoint ( exPlan , monitor ) while ¬ finished do updated ← UpdateEstimates ( exPlan , monitor ) if updated then exPlan ← ReOptimize ( exPlan ) finished ← ResumeExecution ( exPlan , monitor )cost (e.g., wallclock time or monetary cost) of an executionoperator depends on (i) its resource usage (CPU, memory,disk, and network) and (ii) the unit costs of each resource(e.g., how much one CPU cycle costs). While the unit costsdepend on hardware characteristics, the resource usage ofeach execution operator depends on its input cardinality.Next, the optimizer looks for the best way to move dataquanta among execution operators of different platforms.As noted earlier, we model the problem of finding the mostefficient communication path among execution operators asa graph problem, which we proved to be NP-hard. Our solu-tion to this problem relies kernelization and can discover allways to connect execution operators of different platformsvia a sequence of communication channels. After the bestdata movement strategy is found, the optimizer attaches thedata movement cost to the inflated plan. At last, it deter-mines the optimal way of executing a Rheem plan based onthe cost estimates of its inflated plan. For this, it must con-sider the previously computed data movement costs as wellas the start-up costs of data processing platforms. Thus, in-stead of taking a simple greedy approach that neglects datamovement and platform start-up costs, we follow a princi-pled approach: we use an enumeration algebra together witha lossless pruning technique. Our pruning technique is guar-anteed to not prune a subplan that is part of the optimalexecution plan. As a result, the optimizer can output theoptimal execution plan without an exhaustive enumeration.
The executor receives an execution plan from the opti-mizer to run it on the selected data processing platforms(Lines 3 and 7 in Algorithm 1). For example, the optimizerselected the Spark and JavaStreams platforms for our SGDexample in Figure 4(a). Overall, the executor follows well-known approaches to parallelize a task over multiple com-pute nodes, with only few differences in the way it dividesan execution plan. In particular, it divides an executionplan into stages . A stage is a subplan where (i) all its exe-cution operators are from the same platform; (ii) at the endof its execution, the platforms need to give back the execu-tion control to the executor; and (iii) its terminal operatorsmaterialize their output data quanta in a data structure,instead of being pipelined into the next operator.In our SGD example of Figure 4(b), the executor dividesthe execution plan into six stages as illustrated in Figure 7.Note that Stage3 contains only the
RepeatLoop operator asthe executor must have the execution control to evaluatethe loop condition. This is why the executor also separatesStage1 from Stage5. Then, it dispatches the stages to therelevant platform drivers, which in turn submit the stages asa job to the underlying platforms. Stages are connected bydata flow dependencies so that stages with no dependencies ap parse TextFile SourceCacheCollection Source RepeatLoop Map compute
Reduce sum & count
Map update
Collection Sink
Stage1Stage2 Stage3 Stage5 Stage6 optimization checkpoint
Spark execution operator JavaStreams execution operator Dependency
Broadcast
Stage4
CollectSniffer MultiplexSocket Sink Map metadata
Collection Sink
Auxiliary
Sample
Figure 7: Stage dependencies for SGD. (e.g., Stage1 and Stage2) are dispatched first in parallel andany other stage is dispatched once its input dependenciesare satisfied (e.g., Stage3 after Stage2).
Data Exploration.
As data exploration is a key piece inthe field of data science, the executor optionally allows appli-cations to run in an exploratory mode where they can pauseand resume the execution of a task at any point. Achievingthis in a cross-platform setting is very challenging, becausemost platforms, such as Spark, Flink, Giraph, Postgres, andHadoop, do not support pausing task computations at all –let alone resuming a task from an intermediate state. Thus,the challenge resides in enabling the underlying platforms tosupport data exploration efficiently.
Rheem achieves this byinjecting sniffers into execution plans and attaching auxil-iary execution plans. A sniffer is an execution operator thatduplicates intermediate results and sends them to an aux-iliary execution plan. For example, the user would like tokeep track of the weights at each iteration of SGD and thus asniffer is necessary right after updating the weights (Stage5in Figure 7). The sniffer sends the weights to an auxiliaryplan that is responsible for reporting them back to the user(the socket sink operator in Figure 7). This auxiliary plan isalso responsible for computing and storing additional meta-data for efficient task resumption (the map and collectionsink operators of the auxiliary plan in Figure 7). When re-suming a task, the executor performs the task by re-using asmuch as possible from the previously computed metadata.For instance, if the user pauses the SGD task at iteration i and resumes it later on, the executor fetches the previouslycomputed weights of iteration i and resumes the task. Recall that the cross-platform optimizer operates in a set-ting that is characterized by high uncertainty. For instance,the semantics of UDFs and data distributions are usuallyunknown because of the little control over the underlyingplatforms. This uncertainty can cause poor cardinality andcost estimates and hence can negatively impact the effective-ness of the optimizer [42]. To compensate this uncertainty,
Rheem registers the execution of a plan with the monitor(Line 2 in Algorithm 1). The monitor collects light-weightexecution statistics for the given plan, such as data cardi-nalities and operator execution times. It is also aware oflazy execution strategies used by the underlying platformsand assigns measured execution time correctly to operators.
Rheem uses these statistics to improve its cost model andre-optimize ongoing execution plans in case of poor cardi- nality estimates. Additionally, the monitor is responsible forchecking the health of the execution. For instance, if it findsa large mismatch between the real output cardinalities andthe estimated ones, it pauses the execution plan and sendsit to the progressive optimizer.
To mitigate the effects of bad cardinality estimates,
Rheem employs a progressive query optimization approach.The key principle is to re-optimize the plan whenever thecardinalities observed by the monitor greatly mismatch theestimated ones [45]. Applying progressive query optimiza-tion in our setting comes with two main challenges. First,we have only limited control over the underlying platforms,which makes plan instrumentation and halting executionsdifficult. Second, re-optimizing an ongoing execution planmust efficiently consider the results already produced.We tackle these challenges by using optimization check-points . An optimization checkpoint tells the executorto pause the plan execution in order to consider a re-optimization of the plan beyond the checkpoint. The pro-gressive optimizer inserts optimization checkpoints into exe-cution plans wherever (i) cardinality estimates are uncertain(having a wide interval or low confidence) or (ii) the datais at rest (e.g., a Java collection or a file). For instance,the optimizer inserts an optimization checkpoint right afterStage1 as the data is at rest because of the
Cache operator(see Figure 7). When the executor cannot dispatch a newstage anymore without crossing an optimization checkpoint,it pauses the execution and gives the control to the pro-gressive optimizer. The latter gets the actual cardinalitiesobserved so far by the monitor and re-computes all cardi-nalities from the current optimization checkpoint (Line 5 inAlgorithm 1). In case of a mismatch, it re-optimizes theremaining of the plan (from the current optimization check-point) using the new cardinalities (Line 6). It then gives thenew execution plan to the executor, which resumes the ex-ecution from the current optimization checkpoint (Line 7).
Rheem can switch between execution and progressive opti-mization any number of times at a negligible cost.
Profiling operators in isolation might be unrealistic when-ever platforms optimize execution across multiple operators,e.g., by pipelining. Indeed, we found cost functions derivedfrom isolated benchmarking to be insufficiently accurate.We thus take a different approach.
Learning the Cost Model.
Recall that each executionoperator o is associated with a number of resource usagefunctions ( r mo , where m is CPU, memory, disk, or network).For instance, the cost function to estimate the CPU cy-cles required by the JavaFilter operator is r CPUJavaFilter := c in × ( α + β )+ δ , where parameters α and β denote the num-ber of required CPU cycles for each input data quantum inthe operator itself and in its UDF, and parameter δ describessome fixed overhead for the operator start-up and schedul-ing. We then multiply each of these resource usage functions r mo with the time required per unit (e.g., msec/CPU cycle)to get the time estimate t mo . The total cost estimate foroperator o is defined as: f o = t CPUo + t memo + t disko + t neto .However, obtaining the parameters for each resource, suchas the α, β, δ values for CPU, is not trivial. We, thus, usexecution logs to learn these parameters in an offline fash-ion and model the cost of individual execution operators asa regression problem . Note that the execution logs containthe runtimes of execution stages (i.e., pipelines of opera-tors as defined in Section 4.2) and not of individual op-erators. Let ( { ( o , C ) , ( o , C ) , . . . ( o n , C n ) } , t ) be an exe-cution stage, with o i , 0 < i ≤ n , where o i are executionoperators, C i are input and output true cardinalities, and t is the measured execution time for the entire stage. Fur-thermore, let f i ( x , C i ) be the total cost function for execu-tion operator o i with x being a vector with the parametersof all resource usage functions (e.g., CPU cycles, disk I/Oper data quantum). We are interested in finding x min =arg min x loss (cid:0) t, (cid:80) ni =1 f i ( x , C i ) (cid:1) . Specifically, we use a rel-ative loss function defined as loss ( t, t (cid:48) ) = (cid:16) | t − t (cid:48) | + st + s (cid:17) , where t (cid:48) is the geometric mean of the lower and upper bounds ofthe cost estimate produced by (cid:80) f i ( x , C i ) and s is a regu-larizer inspired by additive smoothing that tempers the lossfor small t . Note that we can easily generalize this optimiza-tion problem to multiple execution stages: we minimize theweighted arithmetic mean of the losses of multiple execu-tion stages. In particular, we use as stage weights the sumof the relative frequencies of the stages’ operators amongall stages, so as to deal with skewed workloads that containcertain operators more often than others. Finally, we applya genetic algorithm [47] to find x min . In contrast to otheroptimization algorithms, genetic algorithms impose only fewrestrictions on the loss function to be minimized. Hence, ourcost learner can deal with arbitrary cost functions. Apply-ing this technique allows us to calibrate the cost functionswith only little additional effort. Logs Generation.
Clearly, the more execution logs areavailable, the better
Rheem can tune the cost model. Thus,
Rheem comes with a log generator. It first creates a setof
Rheem plans by composing all possible combinationsof
Rheem operators forming a particular topology. Wefound that most data analytic tasks in practice follow threedifferent topologies: pipeline (e.g., batch tasks), iterative (e.g., ML tasks), and merge (e.g., SPJA tasks). It then gen-erates all possible executions plans for the previously createdset of
Rheem plans. Next, it creates different configurationsfor each execution plan, i.e., it varies the UDF complexity,output cardinalities, input dataset sizes, and data types.Once it has generated all possible plans with different con-figurations, it executes them and logs their runtime.
5. RHEEM INTERFACES
Rheem provides a set of native APIs for developers tobuild their applications. These include Java, Scala, Python,and REST. Examples of using these APIs can be found inthe
Rheem repository . The code developers have to writeis fully agnostic of the underlying platforms. Still, in casethe user wants to force Rheem to execute a given operatoron a specific platform, she can invoke the withTargetPlatform method. Similarly, she can force the system to use a specificexecution operator via the customOperator method, whichfurther enables users to employ custom operators withouthaving to extend the API.Although the native APIs are quite popular among de-velopers, many users are not proficient using these APIs. https://github.com/rheem-ecosystem/rheem-benchmark Thus,
Rheem also provides two APIs that target non-expertusers: a data-flow language (
RheemLatin ) and a visual IDE(
Rheem
Studio ). We explain these interfaces using our SGDexample from Figure 4. However, for the sake of explana-tion, before going into the details of these two interfaces, wefirst show how one can implement SGD on
Rheem using oneof its native APIs. The salient feature of all these APIs isthat they are all platform-agnostic. It is
Rheem that figuresout on which platform to execute each of the operators.
Let us explain how users can code their applications usingone of the native APIs of
Rheem . We use the Scala APIand our SGD running example (see Listing 1) . val context = new RheemContext( new
Configuration) .withPlugin(Spark. basicPlugin ) .withPlugin(JavaStreams.basicPlugin) val plan = new PlanBuilder(context) val points = plan.readTextFile (”hdfs://myData.csv”) .map(parsePoints) val finalWeights = plan. loadCollection (createRandomWeights()) .repeat(50, { weights = > points .sample(sampleSize).withBroadcast(weights) .map(computeGradient()) .reduce( + ) .map(updateWeights()) } ). collect () Listing 1: SGD task using the Scala API.First, a user creates the
Rheem context, where she speci-fies the available platforms (Lines 1-3): Spark and JavaS-treams in this example. She then initializes her
Rheem plan with this context (Line 4). Eventually, she createsthe graph of
Rheem operators that defines the SGD task(Lines 5-13). Note that
Rheem plans must have at leastone source operator (Line 5), i.e., an operator reading orproducing input data quanta, and one sink operator perbranch (Line 13), i.e., an operator retrieving or storing theresult. Recall that a
Rheem plan must have at least onesource operator (Line 5) and one sink operator per branch(Line 13). Also, observe that this code is fully agnostic of theunderlying platforms. Still, in case the user wants to force
Rheem to execute a given operator on a specific platform,she can invoke the withTargetPlatform method. Similarly, shecan force the system to use a specific execution operatorvia the customOperator method, which further enables usersto employ custom operators without having to extend theAPI. For clarity reasons, we did not include the UDF im-plementations in Listing 1.
Rheem provides a data-flow language (RheemLatin) forusers to specify their tasks [44]. Our goal is to provideease-of-use to users without compromising expressiveness.RheemLatin follows a procedural programming style to nat-urally fit the pipeline paradigm of
Rheem . This is similarto the R language, which is quite popular among data scien-tists. It draws its inspiration from PigLatin [48] and hence ithas PigLatin’s grammar and supports most PigLatin’s key-words. In fact, one could see it as an extension of PigLatinfor cross-platform settings. For example, users can specifythe platform for any part of their queries. More importantly, The complete source code of this task is available online: https://github.com/rheem-ecosystem/rheem-benchmark .t provides a set of configuration files whereby users can addnew keywords to the language together with their mappingsto
Rheem operators. As a result, users can easily adaptRheemLatin for their applications. Listing 2 illustrates howone can express our SGD example with RheemLatin. import ’/sgd/udfs. class ’ as taggedPointCounter; lines = load ’hdfs://myData.csv’; points = map lines − > { taggedPointCounter.parsePoints(lines) } ; weights = load taggedPointCounter.createWeights(); final weights = repeat { sample points = sample points − > { taggedPointCounter.getSample() } with broadcast weights; gradient = map sample points − > { taggedPointCounter.computeGradient() } ; gradient sum count = reduce gradient − > { gradient.sumcount() } ; weights = map gradient sum − > { gradient sum count.average() } withplatform ’JavaStreams’; } store final weights ’hdfs://output/sgd’; Listing 2: SGD task in RheemLatin.The user starts by importing all her required UDFs(Line 1). She then parses all the data points from the inputdataset (Lines 2 and 3) and initializes the weights (Line 4).Next, she proceeds to perform the core of SGD: she takesa sample of data points (Line 6), computes the gradientfor each sampled data point (Line 7), updates the weights(Lines 8 and 9), and repeats the process 50 times (Line 5).She can also repeat such a core process until convergenceby using
WhileLoop instead of
Repeat . Optionally, she canspecify the platform for any part of her query. For instance,she might know that updating the weights on each itera-tion is a lightweight computation and hence might specifyto use JavaStreams (Line 9). She finishes by storing the finalweights on HDFS (Line 10).
Although the native APIs and RheemLatin cover a largenumber of users, some might still be unfamiliar with pro-gramming and data-flow languages. Also, some other usersmay simply desire to speed up the process of composingtheir data analytic tasks. To this end,
Rheem provides avisual IDE (
Rheem
Studio) where users can compose theirdata analytic tasks in a drag and drop fashion [44]. Fig-ure 8 shows the
Rheem
Studio’s GUI. The GUI is composedof four parts: a panel containing all
Rheem operators, thedrawing surface, a console for writing RheemLatin queries,and the output terminal. The right-side of Figure 8 showshow operators are connected for an SGD plan. The studioprovides default implementations for any of the
Rheem op-erators, which enables users to run common data analytictasks without writing code. Yet, expert users can provide aUDF by double-clicking on any operator.Users can draw such a plan by simply dragging as many
Rheem operators as required from the left-side panel and dropping them on the drawing surface. They consequentlyconnect the operators as required by their data analytic task.The right-side of Figure 8 shows how operators are con-nected for SGD. While connecting operators, the studio val-idates such connections and gives feedback to users in casethat a connection cannot be established, e.g., the outputand input of two connected operators are of different datatypes. Last but not least, the studio provides default imple-mentations for any of the
Rheem operators, which enablesusers to run common data analytic tasks without writing a z oo m i n output window R H EE M op e r a t o r s p a n e l RheemLatin consoledrawing surface
TextFileSourceParseMap SampleComputeMapReduce UpdateMap wCollectionRepeatLocalSink
Figure 8: SGD task in the Rheem Studio. single line of code. Yet, expert users can provide a UDF bydouble-clicking on any operator.
6. EXAMPLES OF RHEEM PLANS
We now provide in detail three examples of how users canimplement their tasks using the Scala native API and theRheemLatin interface. For this, we consider three populardata analytic tasks:
WordCount (a well-known aggregatetask),
K-means (a very representative iterative task), and
PolyJoin (a common task over difference data sources).Users start their
Rheem plans in Scala with a preamblethat defines the context and the platforms to be used, asshown in Listing 3. For the sake of presentation, we do notinclude this preamble in our Scala code examples below. val context = new RheemContext( new
Configuration) .withPlugin(Spark. basicPlugin ) .withPlugin(JavaStreams.basicPlugin) val plan = new PlanBuilder(context)
Listing 3: Preamble in the Scala API.
WordCount is an aggregate task that computes the fre-quency with which each word appears in a dataset. Listing 4shows the RheemLatin query for this task: Line 1 importsall the required UDFs, Line 2 loads the input data; Lines 3and 4 parse the words and convert them into records; Line 5computes the frequency of each word; and Line 6 stores thefinal word count on disk. Note that users naturally definethe flow of their analytical tasks with RheemLatin. Alterna-tively, users can implement this task using one of the nativeAPIs of
Rheem . Listing 5 shows the Scala code for this task.Similar to the RheemLatin query, the Scala code keeps theplan composition simple. import ’/wordcount/udfs.class ’ as wordcount; lines = load ’hdfs://myWords.txt’; words = flatmap lines − > { wordcount.splitWords() } ; tuples = map words − > { wordcount.convert2Tuple() } ; adds = reduce tuples − > { wordcount.getWord() } , tuples − > { wordcount. reduce () } ; store adds ’/output/wordcount’; Listing 4: Word Count task in RheemLatin.
K-means is a widely used ML task for clustering datapoints together according to their similarity. We show theRheemLatin query in Listing 6. In contrast to the Word-Count task, this task is iterative (Lines 4–7). We observe val words = plan.readTextFile (”hdfs://myWords.csv”) .flatMap( . split (” \\ W+”)) .map(word = > (word.toLowerCase, 1)) .reduceByKey( . 1, (c1, c2) = > (c1. 1, c1. 2 + c2. 2)) . collect () Listing 5: Word Count task using the Scala API.that defining loops in RheemLatin is quite similar to cod-ing in a high-level language (e.g., Scala), which makes itintuitive for most users. Listing 7 shows its counterpart inScala. lines = load ’hdfs://myPoints.txt’ ; points = map lines − > kmeans.parsePoints(); centroids = load ’hdfs:// myInitialCentroids . txt ’ ; final centroids = repeat centroids AS current centroid for { distance = map points − > kmeans.selectNearestCentroid() withbroadcast current centroid; centroids sum = reduce distance − > kmeans. reduce (); new centroids = map centroids sum − > kmeans.average(); } store final centroids ’hdfs:///output/kmeans’; Listing 6: K-means task in RheemLatin. val points = plan.readTextFile (”hdfs://myPoints.csv”) .map(createPoints) val initialCentroids =plan. loadCollection (Kmeans.createRandomCentroids(k)) val finalCentroids = initialCentroids .repeat( iterations , { currentCentroids = > val newCentroids = points.mapJava( new SelectNearestCentroid, ) .withBroadcast(currentCentroids , ”centroids”) .reduceByKey( .centroidId , + ) .map( .average newCentroids } ) finalCentroids . collect () Listing 7: K-means task using the Scala API.
PolyJoin is a common task in polystore scenarios,i.e., joining several datasets from different data sources. Inthis case, we consider the TPC-H Q5 and assume that: the region , suppliers , and customer relations are on Postgres; the nations relations is on the local file system; and the orders and lineitem relations are on HDFS. Despite the complexityof this query, we observe that the RheemLatin query (List-ing 8) and the Scala (Listing 9) are still simple as they followthe logical flow of the task itself. Lines 1-7 in Listing 8 loadthe dataset, Lines 8-12 select and project the required tu-ples, and Lines 13-22 join the resulted tuples before makingthe group-by in Line 23.
7. RHEEM VS. MUSKETEER
We experimentally compare
Rheem with its closest com-petitor, Musketeer [32]. More experiments concerning theoptimizer can be found in [40].
Setup.
We ran our experiments on a cluster of 10 machines.Each node has one 2 GHz Quad Core Xeon processor, 32 GBmain memory, 500 GB SATA hard disks, a 1 Gigabit networkcard and runs 64-bit platform Linux Ubuntu 14.04.05. In
Rheem we used the following platforms: Java’s Stream li-brary (
JavaStreams ), Spark 1.6.0 (
Spark ), Flink 1.3.2 (
Flink ),GraphX 1.6.0 (
GraphX ), Giraph 1.2.0 (
Giraph ), a Java graphlibrary (
JGraph ), and HDFS 2.6.0 to store files. We used allthese platforms with their default settings and configured import ’/ polyjoin /udfs. class ’ as polyjoin ; region = load ’postgres:///tpch/region’ ; suppliers = load ’postgres:///tpch/ suppliers ’ ; customers = load ’postgres:///tpch/customers’; nations = load ’ file :///nations ’ delimiter ’ | ’ ; orders = load ’hdfs:///orders ’ delimiter ’ | ’ ; lineitems = load ’hdfs:/// lineitems ’ delimiter ’ | ’ ; region filter = filter region [1] == ’ASIA’; region project = map region filter − > { polyjoin.projectRecord(0, 1) } ; suppliers project = map suppliers − > { polyjoin.projectRecord(0, 3) } ; customers project = map customers − > { polyjoin.projectRecord(0, 3) } ; order filter = filter orders − > { polyjoin.isBetween( 4,’1994 − − − − } ; join1 = join nation [2], region project [0]; map join1 = map join1 − > { polyjoin.tuple2Record(0, 0, 0, 1) } ; join2 = join map join1 [0], customers project [1]; map join2 = map join2 − > { polyjoin.tuple2Record(0, 0, 0, 1, 1, 0) } ; join3 = join map join2 [2], order filter [0]; map join3 = map join3 − > { polyjoin.tuple2Record(0, 0, 0, 1, 1, 0) } ; join4 = join map join3 [2], lineitems [0]; map join4 = map join4 − > { polyjoin.tuple2Record(0, 0, 0, 1, 1, 2, 1,5, 1, 6) } ; join5 = join map join4 − > { polyjoin.record2Tuple(2, 0) } ,suppliers project − > { polyjoin.record2Tuple(0, 1) } ; map join5 = map join5 − > { polyjoin.tuple2Record(0, 1, 0, 3, 0, 4) } ; groupBy = groupby map join5[0]; store groupBy ’/output/polyjoin ’ ; Listing 8: PolyJoin task in RheemLatin. val regions : DataQuanta[Record] =plan.readTable(”postgres:///tpch/region”) .map( createRecord( )) . filter ((r : Record) = > r.getString(1) == ”ASIA”) .map( projectRecord( , 0, 1)) val suppliers : DataQuanta[Record] =plan.readTable(”postgres:///tpch/supplier”) .map[Record]( createRecord( )) .map[Record]( projectRecord( , 0, 3)) val customers: DataQuanta[Record] =plan.readTable(”postgres:///tpch/customer”) .map[Record]( createRecord( )) .map( projectRecord( , 0, 3)) val nations : DataQuanta[Record] = plan.readTextFile(” file :///nation”) .map( createRecord( )) val orders : DataQuanta[Record] = plan.readTextFile(”hdfs:///order”) .map( createRecord( )) . filter ( isBetween( , 4, fromData, toDate) ) val lineitems : DataQuanta[Record] =plan.readTextFile(”hdfs:/// lineitem ”) .map(createRecord( )) nations . join ( getColumn(, 2), regions , getColumn(, 0)) .map( tuple2Record( , 0, 0, 0, 1)) . join ( getColumn(, 0), customers, getColumn(, 1)) .map( tuple2Record( , 0, 0, 0, 1, 1, 0)) . join ( getColumn(, 2), orders , getColumn(, 1)) .map( tuple2Record( , 0, 0, 0, 1, 1, 0)) . join ( getColumn(, 2), lineitems , getColumn(, 0)) .map( tuple2Record( , 0, 0, 0, 1, 1, 2, 1, 5, 1, 6)) . join [Record, Tuple2[String , String ]]( record2Tuple(, 2, 0),suppliers , record2Tuple(, 0, 1)) .map( tuple2Record( , 0, 1, 0, 3, 0, 4)) .groupByKey((r: Record) = > r.getField(0)) . collect () Listing 9: PolyJoin task using the Scala API.the maximum RAM of each platform to 20 GB. We dis-abled the
Rheem stage parallelization feature to have onlyone single platform running at any time. We obtained allthe cost functions required by our optimizer as describedin Section 4.5. We considered the cross-community pager-ank task (
CrocoPR ), because the authors reported this taskto be a case where Musketeer chooses multiple platforms.Note that, for fairness reasons, we perform the data prepa-ration part of
CrocoPR (i.e., union the different communities R un t i m e ( s e c ) Dataset size (%)
Musketeer R
HEEM
Figure 9: Rheem outperforms Musketeer by morethan one order of magnitude. pages) as a separate script for Musketeer. This is because itslanguage (Mindi) is not optimized for dealing with UDFs,thereby it would be much slower to provide the data prepa-ration as a UDF. In contrast,
Rheem seamlessly performsboth parts (data preparation and page rank) as a single task.We used the DBPedia pagelinks dataset (20 GB).
Results.
Figure 9 shows the results in log scale whenvarying the dataset sizes for 10 iterations and the numberof iterations for 10% of the dataset. Overall, we observethe superiority of
Rheem over Musketeer, especially as thenumber of iterations increases:
Rheem is up to 85 timesfaster than Musketeer. Note that, in contrast to Musketeer,
Rheem keeps its runtime constant as the number of itera-tions increases. This is because: (i) Musketeer, among otherthings, checks dependencies, compiles and package the code,and writes the output to HDFS at each iteration (or stage),which comes with a high overhead; (ii)
Rheem executes thepage rank part of the task (i.e., after the data preparation)on
JavaStreams , which allows it to perform each iterationwith almost zero overhead.
8. LIMITATIONS
As of now,
Rheem does not support any stream processingplatforms. While users can easily supply new batch process-ing platforms, stream processing requires to extend
Rheem ’score. We plan to do so by following the lambda architectureparadigm [46]. In addition,
Rheem currently relies on thefault-tolerance of the underlying platforms and is, thus, sus-ceptible to failures while moving data across platforms. Weplan to incorporate some basic fault-tolerance mechanismat the cross-platform level. Other remaining issues include:adding methods that speed up inter-platform communica-tions, such as the one proposed in [33], integrating
Rheem with resource managers to incorporate changes in the avail-ability of computing resources, and supporting simultaneousexecution of
Rheem jobs.
9. RELATED WORK
The research and industry communities have proposed amyriad of different data processing platforms [5, 8, 11, 18,25, 59]. In contrast, we do not provide a data processingplatform but a novel system on top of them.Cross-platform data processing has been in the spotlightonly very recently. Some works focus only on integrating dif-ferent data processing platforms with the goal of alleviatingusers from their intricacies [1, 2, 10, 12, 29]. However, theystill require expertise from users to decide when to use a spe-cific data processing platform. For example, BigDAWG [29]requires users to specify where to run tasks via its
Scope and
Cast commands, which already require expertise from users.Only few works share a similar goal with us [28,32,43,55,58].However, they substantially differ from
Rheem . Two main differences are that they consider neither data movementcosts nor progressive task optimization techniques, althoughboth aspects are crucial in cross-platform settings. Addi-tionally, each of these works differs from
Rheem in variousways. As Musketeer’s main goal is to decouple front-end lan-guages (e.g., SQL and PigLatin) from the underlying plat-forms [32], it is not as expressive and extensible as
Rheem .Furthermore, as it maps task patterns to specific underly-ing platforms, it is not clear how one can efficiently map atask when having similar platforms (e.g., Spark vs. Flinkor Postgres vs. MySQL). Similarly, in Myria [58], it is hardto allocate tasks when having similar platforms because itcomes with a rule-based optimizer. Additionally, its rule-based optimizer also makes it hard to maintain. IReS [28]supports only 1-to-1 mappings between abstract tasks andtheir implementations, which limits expressiveness and op-timization opportunities. Moreover, it assumes direct datamovement paths between platforms, which is hard to main-tain for many platforms. QoX focuses only on ETL work-loads [55]. DBMS+ [43] is limited by the expressiveness ofits declarative language and hence it is neither adaptive norextensible. Other complementary works focus on improvingdata movement across different platforms [33] or librariesby using a common intermediate representation and execut-ing the scripts in LLVM [49], but none of them address thecross-platform optimization problem. Tensorflow [15] fol-lows a similar idea, but for cross-device execution of machinelearning tasks and thus it is orthogonal to
Rheem . In fact,
Rheem could use TensorFlow as an underlying platform.The research community has also studied the problemof federating relational databases [54]. Garlic [22], TSIM-MIS [23], and InterBase [21] are just three examples. How-ever, all these works significantly differ from
Rheem in thatthey consider a single data model and simply push queryprocessing to where the data is. Other works integrateHadoop with an RDBMS [27,41], however, one cannot easilyextend them to deal with more diverse tasks and platforms.
10. CONCLUSION
Given today’s data analytic ecosystem, supporting cross-platform data processing has become rather crucial in or-ganizations. We have identified four different situations inwhich an application requires or benefits from cross-platformdata processing. Driven by these cases, we built
Rheem , across-platform system that decouples applications from dataprocessing platforms to achieve efficient task execution overmultiple platforms.
Rheem follows a cost-based optimiza-tion approach for splitting an input task into subtasks andassigning each subtask to a specific platform, such that thecost (e.g., runtime or monetary cost) is minimized. Ourexperience while building
Rheem raised several interestingquestions that need to be addressed in the future, namely:
How can we (i) reduce the inter-platform data movementcosts? (ii) address the cardinality and cost estimation prob-lem? (iii) efficiently support fault-tolerance across plat-forms? (iv) add new platforms automatically? and (v) im-prove data exploration in cross-platform settings?
11. REFERENCES [1] Apache Beam. https://beam.apache.org .[2] Apache Drill. https://drill.apache.org .[3] Apache Flink. https://flink.apache.org .4] Apache Flume. https://flume.apache.org/index.html .[5] Apache HBase. http://hbase.apache.org/ .[6] Apache Hive: A data warehouse software fordistributed storage. http://hive.apache.org .[7] Apache Mahout. http://mahout.apache.org .[8] Apache Spark: Lightning-Fast Cluster Computing. http://spark.incubator.apache.org/ .[9] Fortune magazine. http://fortune.com/2014/06/19/big-data-airline-industry/ .[10] Luigi Project. https://github.com/spotify/luigi .[11] PostgreSQL. .[12] PrestoDB Project. https://prestodb.io .[13] Spark MLlib: http://spark.apache.org/mllib .[14] Spark SQL programming guide. http://spark.apache.org/docs/latest/sql-programming-guide.html .[15] M. Abadi et al. TensorFlow: A System for Large-ScaleMachine Learning. In
OSDI , pages 265–283, 2016.[16] D. Agrawal, L. Ba, L. Berti-Equille, S. Chawla,A. Elmagarmid, H. Hammady, Y. Idris, Z. Kaoudi,Z. Khayyat, S. Kruse, M. Ouzzani, P. Papotti, J.-A.Quian´e-Ruiz, N. Tang, and M. Zaki. Rheem: EnablingMulti-Platform Task Execution. In
SIGMOD , pages2069–2072, 2016.[17] D. Agrawal et al. Road to Freedom in Big DataAnalytics. In
EDBT , pages 479–484, 2016.[18] A. Alexandrov et al. The Stratosphere platform forbig data analytics.
VLDB J. , 23(6):939–964, 2014.[19] A. Baaziz and L. Quoniam. How to use big datatechnologies to optimize operations in upstreampetroleum industry. In st World PetroleumCongress , 2014.[20] M. Boehm, M. Dusenberry, D. Eriksson, A. V.Evfimievski, F. M. Manshadi, N. Pansare,B. Reinwald, F. Reiss, P. Sen, A. Surve, andS. Tatikonda. SystemML: Declarative MachineLearning on Spark.
PVLDB , 9(13):1425–1436, 2016.[21] O. A. Bukhres et al. InterBase: An ExecutionEnvironment for Heterogeneous Software Systems.
IEEE Computer , 26(8):57–69, 1993.[22] M. J. Carey et al. Towards Heterogeneous MultimediaInformation Systems: The Garlic Approach. In
RIDE-DOM , pages 124–131, 1995.[23] S. S. Chawathe et al. The TSIMMIS Project:Integration of Heterogeneous Information Sources. In
IPSJ , pages 7–18, 1994.[24] M. Dallachiesa, A. Ebaid, A. Eldawy, A. K.Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang.NADEEF: a commodity data cleaning system. In
SIGMOD , pages 541–552, 2013.[25] J. Dean and S. Ghemawat. MapReduce: SimplifiedData Processing on Large Clusters.
Communicationsof the ACM , 51(1), 2008.[26] D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang,M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas,S. Madden, M. Ouzzani, and N. Tang. The DataCivilizer System. In
CIDR , 2017.[27] D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar,J. Aguilar-Saborit, A. Avanes, M. Flasza, andJ. Gramling. Split query processing in polybase. In
SIGMOD , pages 1255–1266, 2013.[28] K. Doka, N. Papailiou, V. Giannakouris,D. Tsoumakos, and N. Koziris. Mix ’n’ matchmulti-engine analytics. In
IEEE BigData , pages194–203, 2016.[29] A. J. Elmore et al. A Demonstration of the BigDAWGPolystore System.
PVLDB , 8(12):1908–1911, 2015.[30] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis.Conditional Functional Dependencies for CapturingData Inconsistencies.
ACM Transactions on DatabaseSystems (TODS) , 33(2):6:1–6:48, 2008.[31] R. C. Fernandez, D. Deng, E. Mansour, A. A. Qahtan,W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas,S. Madden, M. Ouzzani, M. Stonebraker, andN. Tang. A Demo of the Data Civilizer System. In
SIGMOD , pages 1639–1642, 2017.[32] I. Gog et al. Musketeer: all for one, one for all in dataprocessing systems. In
EuroSys , 2015.[33] B. Haynes, A. Cheung, and M. Balazinska. PipeGen:Data Pipe Generator for Hybrid Analytics. In
SoCC ,pages 470–483, 2016.[34] A. Hems, A. Soofi, and E. Perez. How innovative oiland gas companies are using big data to outmaneuverthe competition. Microsoft White Paper, http://goo.gl/2Bn0xq , 2014.[35] IBM. Data-driven healthcare organizations use bigdata analytics for big gains. White paper, http://goo.gl/AFIHpk .[36] Z. Kaoudi, J.-A. Quiane-Ruiz, S. Thirumuruganathan,S. Chawla, and D. Agrawal. A Cost-based Optimizerfor Gradient Descent Optimization. In
SIGMOD , 2017.[37] Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden,M. Ouzzani, P. Papotti, J. Quian´e-Ruiz, N. Tang, andS. Yin. BigDansing: A System for Big Data Cleansing.In
SIGMOD , pages 1215–1230, 2015.[38] Z. Khayyat, W. Lucia, M. Singh, M. Ouzzani,P. Papotti, J. Quian´e-Ruiz, N. Tang, and P. Kalnis.Lightning Fast and Space Efficient Inequality Joins.
PVLDB , 8(13):2074–2085, 2015.[39] S. Kruse, Z. Kaoudi, J.-A. Quian´e-Ruiz, S. Chawla,F. Naumann, and B. Contreras-Rojas. RHEEMix inthe Data Jungle – A Cross-Platform Query Optimizer.arXiv: 1805.03533 https://arxiv.org/abs/1805.03533 , 2018.[40] S. Kruse, Z. Kaoudi, J.-A. Quian´e-Ruiz, S. Chawla,F. Naumann, and B. Contreras-Rojas. RHEEMix inthe Data Jungle – A Cross-Platform Query Optimizer.arXiv: 1805.03533 https://arxiv.org/abs/1805.03533 , 2018.[41] J. LeFevre, J. Sankaranarayanan, H. Hacig¨um¨us,J. Tatemura, N. Polyzotis, and M. J. Carey. MISO:souping up big data query processing with a multistoresystem. In
SIGMOD , pages 1591–1602, 2014.[42] V. Leis et al. How good are query optimizers, really?
Proc. VLDB Endow. , 9(3):204–215, 2015.[43] H. Lim, Y. Han, and S. Babu. How to Fit when NoOne Size Fits. In
CIDR , 2013.[44] J. Lucas, Y. Idris, B. Contreras-Rojas, J.-A.Quian´e-Ruiz, and S. Chawla. Cross-Platform DataAnalytics Made Easy. In
ICDE , 2018.[45] V. Markl, V. Raman, D. Simmen, G. Lohman,. Pirahesh, and M. Cilimdzic. Robust queryprocessing through progressive optimization. In
SIGMOD , pages 659–670, 2004.[46] N. Marz and J. Warren.
Big Data: Principles and bestpractices of scalable realtime data systems . Manning,2015.[47] M. Mitchell.
An introduction to genetic algorithms .MIT press, 1998.[48] C. Olston, B. Reed, U. Srivastava, R. Kumar, andA. Tomkins. Pig Latin: A Not-so-foreign Language forData Processing. In
Proceedings of the 2008 ACMSIGMOD International Conference on Management ofData , SIGMOD ’08, pages 1099–1110, 2008.[49] S. Palkar, J. J. Thomas, A. Shanbhag,M. Schwarzkopt, S. P. Amarasinghe, and M. Zaharia.Weld: A Common Runtime for High PerformanceData Analysis. In
CIDR , 2017.[50] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.DeWitt, S. Madden, and M. Stonebraker. AComparison of Approaches to Large-Scale DataAnalysis. In
SIGMOD , pages 165–178, 2009.[51] J.-A. Quian´e-Ruiz and Z. Kaoudi. Cross-PlatformQuery Processing. In
ICDE (tutorial) , 2018.[52] P. J. Sadalage and M. Fowler.
NoSQL distilled: Abrief guide to the emerging world of polyglotpersistence . Addison-Wesley Professional, 2012.[53] S. Shankar, A. Choi, and J.-P. Dijcks. Integrating Hadoop Data with Oracle Parallel Processing. OracleWhite Paper, , 2010.[54] A. P. Sheth and J. A. Larson. Federated DatabaseSystems for Managing Distributed, Heterogeneous,and Autonomous Databases.
ACM ComputingSurveys , 22(3):183–236, 1990.[55] A. Simitsis, K. Wilkinson, M. Castellanos, andU. Dayal. Optimizing Analytic Data Flows forMultiple Execution Engines. In
SIGMOD , pages829–840, 2012.[56] M. Stonebraker. The Case for Polystores. http://wp.sigmod.org/?p=1629 , 2015.[57] D. Tsoumakos and C. Mantas. The Case forMulti-Engine Data Analytics. In
Euro-Par , pages406–415, 2013.[58] J. Wang, T. Baker, M. Balazinska, D. Halperin,B. Haynes, B. Howe, D. Hutchison, S. Jain, R. Maas,P. Mehta, D. Moritz, B. Myers, J. Ortiz, D. Suciu,A. Whitaker, and S. Xu. The Myria Big DataManagement and Analytics System and CloudServices. In
CIDR , 2017.[59] F. Yang, J. Li, and J. Cheng. Husky: Towards a MoreEfficient and Expressive Distributed ComputingFramework.