[PDF] Building your Cross-Platform Application with RHEEM

Abstract

Today, organizations typically perform tedious and costly tasks to juggle their code and data across different data processing platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging because it requires quite good expertise for all the available data processing platforms. In this report, we present Rheem, a general-purpose cross-platform data processing system that alleviates users from the pain of finding the most efficient data processing platform for a given task. It also splits a task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). To offer cross-platform functionality, it features (i) a robust interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Rheem is released under an open source license.

Full PDF

BBuilding your Cross-Platform Application with RHEEM

Sanjay Chawla Bertty Contreras-Rojas Zoi Kaoudi Sebastian Kruse ∗ Jorge-Arnulfo Quian ´e-Ruiz Qatar Computing Research Institute, Hamad Bin Khalifa University Hasso Plattner Institute, University of Potsdam http://da.qcri.org/rheem/

ABSTRACT

Today, organizations typically perform tedious and costlytasks to juggle their code and data across diﬀerent dataprocessing platforms. Addressing this pain and achievingautomatic cross-platform data processing is quite challeng-ing because it requires quite good expertise for all the avail-able data processing platforms. In this report, we present

Rheem , a general-purpose cross-platform data processingsystem that alleviates users from the pain of ﬁnding themost eﬃcient data processing platform for a given task. Italso splits a task into subtasks and assigns each subtask to aspeciﬁc platform to minimize the overall cost (e.g., runtimeor monetary cost). To oﬀer cross-platform functionality, itfeatures (i) a robust interface to easily compose data an-alytic tasks; (ii) a novel cost-based optimizer able to ﬁndthe most eﬃcient platform in almost all cases; and (iii) anexecutor to eﬃciently orchestrate tasks over diﬀerent plat-forms. As a result, it allows users to focus on the businesslogic of their applications rather than on the mechanics ofhow to compose and execute them.

Rheem is released underan open source license.

1. INTRODUCTION

The pursuit of comprehensive, eﬃcient, and scalabledata analytics as well as the one-size-does-not-ﬁt-all dictumhave given rise to a plethora of data processing platforms( platforms for short). These specialized platforms includeDBMS, NoSQL, and MapReduce-like platforms. In fact,just under the umbrella of NoSQL, there are reportedly over200 diﬀerent platforms . Each excels in speciﬁc aspects al-lowing applications to achieve high performance and scala-bility. For example, while Spark supports Select queries,Postgres can execute them much faster by using indices.However, Postgres is not as good as Spark for general pur-pose batch processing where parallel full scans are the keyperformance factor. Several studies have shown this kind ofperformance diﬀerences [20, 32, 36, 50, 57].Moreover, today’s data analytics is moving beyond thelimits of a single platform. For example: (i) IBM re-ported that North York hospital needs to process 50 di-verse datasets, which run on a dozen diﬀerent platforms [35]; ∗ Work partially done while interning at QCRI. http://db-engines.com (ii) Airlines need to analyze large datasets, which are pro-duced by diﬀerent departments, are of diﬀerent data for-mats, and reside on multiple data sources, to produce globalreports for decision makers [9]; (iii) Oil & Gas companiesneed to process large amounts of diverse data spanning var-ious platforms [19, 34]; (iv) Several data warehouse appli-cations require data to be moved from a MapReduce-likesystem into a DBMS for further analysis [27,53]; and (v) Us-ing multiple platforms for machine learning improves perfor-mance signiﬁcantly [20, 36].To cope with these new requirements, developers (ordata scientists) have to write ad-hoc programs and scriptsto integrate diﬀerent platforms. This is not only a te-dious, time-consuming, and costly task, but it also re-quires knowledge of the intricacies of the diﬀerent platformsto achieve high eﬃciency and scalability. Some systemshave appeared with the goal of facilitating platform integra-tion [2, 4, 10, 12]. Nonetheless, they all require a good dealof expertise from developers, who still need to decide whichprocessing platforms to use for each task at hand. Recentresearch has taken steps towards transparent cross-platformexecution [15, 28, 32, 43, 55, 56], but lacks several importantaspects. Usually these eﬀorts do not automatically maptasks to platforms. Additionally, they do not consider com-plex data movement (i.e., with data transformations) amongplatforms [28, 32]. Finally, most of the research focuses onspeciﬁc applications [15, 43, 55].Therefore, there is a clear need for a systematic approachto enable eﬃcient cross-platform data processing , i.e., use ofmultiple data processing platforms. The Holy Grail wouldbe to replicate the success of DBMSs for cross-platform dataprocessing. Users simply send their tasks expressing thelogic of their applications, and the cross-platform systemdecides on which platform(s) to execute each task with thegoal of minimizing its cost (e.g., runtime or monetary cost).In other words, users focus on the high level details and thecross-platform system takes care of the low level details.Building a cross-platform system is challenging on numer-ous fronts: (i) a cross-platform system not only has to ef-fectively ﬁnd all the suitable platforms for a given task, butalso has to choose the most eﬃcient one; (ii) cross-platformsettings are characterized by high uncertainty as diﬀerentplatforms are autonomous and thus one has little controlover them; (iii) the performance gains of using multiple plat-forms should compensate the added cost of moving dataacross platforms; (iv) it is crucial to achieve inter-platformparallelism to prevent slow platforms from dominating exe-cution time; and (v) the system should be extensible to new a r X i v : . [ c s . D B ] M a y HEEM

ML4all

PlatformsApplications

Cross-Platform System

BigDansing Xdb … JavaStreams DBMS Spark … Figure 1: Rheem in the data analytics stack. platforms and application requirements.In this report, we present

Rheem , the ﬁrst general-purpose cross-platform system to tackle all of the abovechallenges. The goal of Rheem is to enable applications andusers to run data analytic tasks eﬃciently on one or moredata processing platforms. To do so, it decouples applica-tions from platforms as shown in Figure 1. Applicationsissue their tasks to

Rheem , which in turn decides whereto execute them. As of today,

Rheem supports a varietyof platforms:

Spark , Flink , JavaStreams , Postgres , GraphX , GraphChi , and

Giraph . We are currently testing

Rheem ina large international airline company and in a biomedicalresearch institute. In the former case, we aim at seamlesslyintegrating all data analytic activity governing an aircraft;In the latter case, we aim at reducing the eﬀort scientistsneed for building data analytic pipelines while at the sametime speeding up the running time. Several papers showdiﬀerent aspects of

Rheem : the vision behind it [17]; its op-timizer [39]; its inequality join algorithm [38]; and a coupleof its applications [36, 37]. A couple of demo papers show-case the beneﬁts of

Rheem [16] and its interface [44]. Thisreport aims at presenting the complete design of

Rheem andhow all its pieces work together.In summary, we identify four situations in which applica-tions require support for cross-platform data processing inSection 2. For each case, we use a real application to showexperimentally the beneﬁts of cross-platform data process-ing using

Rheem . In Section 3, we present the data and pro-cessing model of

Rheem and show how it shields users fromthe intricacies of the underlying platforms.

Rheem providesﬂexible operator mappings that allow for better exploitingthe underlying platforms. Also, its extensible design allowsusers to add new platforms and operators with very littleeﬀort. Then, in Section 4, we discuss the key components of

Rheem that make it novel: among them a cost-based cross-platform optimizer that considers data movement costs; aprogressive optimization mechanism to deal with inconsis-tent cardinality estimates; and a learning tool that alleviatesusers from the burden of tuning the cost model. We presentthe

Rheem interfaces whereby users can easily code and runa data analytic task in Section 5. In particular, we presenta data-ﬂow language (RheemLatin) and a visual integrateddevelopment environment (

Rheem

Studio). In Section 6,we show in detail three examples of real

Rheem plans tobetter illustrate how developers can build their applicationsusing these interfaces. Section 8 outlines the limitations of

Rheem . Finally, we discuss related work in Section 9 andconclude with some open problems in Section 10.

2. CROSS-PLATFORM PROCESSING

We identiﬁed four situations in which an application re-quires support for cross-platform data processing [51]. Fig-ure 2 illustrates these four cases. Rheem is open source under the Apache Software Li-cense 2.0 and can be found at https://github.com/rheem-ecosystem/rheem . (1) Platform-independence . Applications run an entire taskon a single platform but may require switching platformsfor diﬀerent input datasets or tasks usually with the goal ofachieving better performance (Figure 2(a)). (2)

Opportunistic cross-platform . Applications might alsobeneﬁt performance-wise from using multiple platforms torun one single task (Figure 2(b)). (3)

Mandatory cross-platform . Applications may requiremultiple platforms because the platform where the inputdata resides, e.g., PostgreSQL, cannot perform the incom-ing task, e.g., a machine learning task. Thus, data shouldbe moved from the platform it resides to another platform(Figure 2(c)). (4)

Polystore . Applications may require multiple platformsbecause the input data is stored on multiple data stores(Figure 2(d)).In contrast to existing systems [28, 29, 32, 55, 58],

Rheem helps users in all above cases. The design of our system hasbeen mainly driven by four applications: a data cleaning ap-plication,

BigDansing [37]; a machine learning application,

ML4all [36]; a database application, xDB ; and an end-to-end data discovery and preparation application,

Data Civ-ilizer [31]. We use these applications to showcase the ben-eﬁts of performing cross-platform data processing, insteadof single-platform data processing, in terms of both perfor-mance and ease of use.

Applications are usually tied to a speciﬁc platform. Thismay not constitute the ideal case for two reasons. First, asmore eﬃcient platforms become available, developers needto re-implement existing applications on top of these newplatforms. For example, Spark SQL [14] and MLlib [13]are the Spark counterparts of Hive [6] and Mahout [7]. Mi-grating an application from one platform to another is atime-consuming and costly task and hence it is not alwaysa viable choice. Second, for diﬀerent inputs of a speciﬁctask, a diﬀerent platform may be the most eﬃcient one, sothe best platform cannot be determined statically. For in-stance, running a speciﬁc task on a big data platform forvery large datasets is often a good choice, while single-nodeplatforms with only little overhead costs are often a betterchoice for small datasets [20]. Thus, enabling applicationsto seamlessly switch from one platform to another accordingto the input dataset and task is important.

Rheem dynam-ically determines the best platform to run an incoming task.

Beneﬁts.

We use BigDansing [37] to demonstrate the ben-eﬁts of providing platform independence. Users specify adata cleaning task with ﬁve logical operators:

Scope (iden-tiﬁes relevant data),

Block (deﬁnes the group of data amongwhich an error may occur),

Iterate (enumerates candidate er-rors),

Detect (determines whether a candidate error is indeedan error), and

GenFix (generates a set of possible repairs).

Rheem maps these operators to

Rheem operators to decidethe best underlying platform. We show the power of sup-porting cross-platform data processing by running an errordetection task on a widely used Tax dataset [30]. The task isbased on the denial constraint ∀ t , t , ¬ ( t . Salary > t . Salary ∧ t . Tax < t . Tax ), which states that there is an inconsis-tency between two tuples representing two diﬀerent personsif one earns a higher salary but pays a lower tax. We consid-ered NADEEF [24], a data cleaning tool, and SparkSQL, a tore APlatform X Platform YApplication task 1a task 1bdata accessdata accesstask 1 (b) Opportunistic Cross-Platform Store APlatform X Platform YApplication task 1a task 1bdata movementdata accesstask 1 (c) Mandatory Cross-PlatformStore APlatform X Platform YApplication data accessdata accesstask 1 (a) Platform Independence Store APlatform X Platform YApplication task 1a task 1bdata access task 1 (d) Polystore data access

Store B or Figure 2: Cross-platform cases.

Dataset rcv1 higgs synthetic svm

ML@Rheem MLlib SystemML o u t o f m e m o r y R un t i m e ( s ) Number of rows

DC@Rheem NADEEF Spark SQL X

10 833 3 ,

731 30 4 ,

529 1 ,

240 5 , X X X X

Dataset size xDB@Rheem Ideal case

Scale factor

DataCiv@Rheem Postgres Spark (a) Platform Independence (b) Opportunistic Cross-Platform (c) Mandatory Cross-Platform (d) Polystore Figure 3: Beneﬁts of the cross-platform data processing approach (using Rheem). general-purpose framework, as baselines and forced

Rheem to use either

Spark or JavaStreams per run.Figure 3(a) shows the results . Overall, we observe that Rheem (DC@Rheem) allows data cleaning tasks to scale upto large datasets and be at least three orders of magnitudefaster than baselines. One order of magnitude gain comesfrom the ability of

Rheem to automatically switch plat-forms.

Rheem used JavaStreams for small datasets speed-ing up the data cleaning task by avoiding

Spark ’s overhead,while it used

Spark for the largest datasets. Furthermore, incontrast to

SparkSQL that cannot process inequality joins ef-ﬁciently,

Rheem ’s extensibility allowed us to plug in a moreeﬃcient inequality-join algorithm [38], thereby further im-proving over these baselines. In a nutshell, BigDansing ben-eﬁted from

Rheem because of its ability to eﬀectively switchplatforms and because of its extensibility to easily plug op-timized algorithms. We demonstrated how BigDansing ben-eﬁts from

Rheem in [16].

While some applications can be executed on a single plat-form, there are cases where their performance would be spedup by using multiple platforms. For instance, users can runa gradient descent algorithm, such as SGD, on top of

Spark relatively fast. Still, we recently showed that mixing it with

JavaStreams signiﬁcantly improves performance [36]. In fact,opportunistic cross-platform processing can be seen as theexecution counter-part of polyglot persistence [52], where dif-ferent types of databases are combined to leverage their in-dividual strengths. However, developing such cross-platformapplications is diﬃcult: developers must know all the caseswhere it is beneﬁcial to use multiple platforms and how ex-actly to use them. These opportunities are often very hard(if not impossible) to spot. Even worse, like in the plat-form independence case, they usually cannot be determined The red cross means we stopped the execution after 40 hrs. a priori.

Rheem ﬁnds and exploits opportunities of usingmultiple processing platforms.

Beneﬁts.

Let us now take our machine learning applica-tion,

ML4all [36], to showcase the beneﬁts of using multipleplatforms to perform one single task. ML4all abstracts threefundamental phases (namely preparation, processing, andconvergence) found in most machine learning tasks via sevenlogical operators which are mapped to

Rheem operators.In the preparation phase, the dataset is prepared appropri-ately along with the necessary initialization of the algorithm(

Transform and

Stage operators). The processing phasecomputes the gradient and updates the current estimateof the solution (

Sample , Compute , and

Update operators)while the convergence phase repeats the processing phasebased on the number of iterations or other criteria (

Loop and

Converge operators). We demonstrate the beneﬁts ofusing

Rheem with a classiﬁcation task over three benchmarkdatasets, using Stochastic Gradient Descent (SGD).Figure 3(b) shows the results. We observe that, eventhough all systems use the same SGD algorithm,

Rheem allows this algorithm to run signiﬁcantly faster than com-peting Spark-based systems. This is because of two mainreasons. First, this comes from opportunistically runningthe

Compute , Update , Converge , and

Loop operators on

JavaStreams , thereby avoiding some of the

Spark ’s overhead.

Rheem runs the rest of the operators on

Spark . MLlib and

SystemML do not avoid such overhead by purely using Sparkfor the entire algorithm. Second, ML4all leverages

Rheem ’sextensibility to plug an eﬃcient sampling operator, result-ing in signiﬁcant speedups. We demonstrated how ML4allfurther beneﬁts from

Rheem in [16].

There are cases where an application needs to go beyondthe functionalities oﬀered by the platform on which the datais stored. For instance, a dataset is stored on a relationaldatabase and a user needs to perform a clustering task onparticular attributes. Doing so inside the relational databaseight simply be disastrous in terms of performance. Thus,the user needs to move the projected data out of the rela-tional database and, for example, put it on HDFS in orderto use Apache Flink [3], which is known to be eﬃcient foriterative tasks. A similar situation occurs in complex dataanalytics applications with disparate subtasks. As an exam-ple, an application that extracts a graph from a text corpusto perform subsequent graph analytics may require usingboth a text and a graph analytics system. The required in-tegration of platforms is tedious, repetitive, and particularlyerror-prone. Nowadays, developers write ad-hoc programsto move the data around and integrate diﬀerent platforms.

Rheem not only selects the right platforms for each task butalso moves the data if necessary at execution time . Beneﬁts.

We use xDB , a system on top of Rheem withdatabase functionalities, to demonstrate the beneﬁts of per-forming cross-platform data processing for the above situ-ation. It provides a declarative language to compose dataanalytic tasks, while its optimizer produces a plan to be ex-ecuted in

Rheem . We evaluate the beneﬁts of

Rheem withthe cross-community pagerank task, which is not only hardto express in SQL but also ineﬃcient to run on a DBMS.Thus, it is important to move the computation to anotherplatform. In this experiment, the input datasets are on Post-gres and Rheem moves the data into Spark.Figure 3(c) shows the results. As a baseline, we considerthe ideal case where the data is on HDFS and

Rheem sim-ply uses either JavaStreams or Spark to run the tasks. Weobserve that

Rheem allows xDB (xDB@Rheem) to achievesimilar performance with the ideal case in all the situations,while fully automating the process. This is a remarkableresult as

Rheem needs to move data out of Postgres to per-form the tasks, in contrast to the ideal case.

In many organizations, data is collected in diﬀerent for-mats and on heterogeneous storage platforms ( data lakes ).Typically, a data lake comprises various DBMSs, documentstores, key-value stores, graph databases, and pure ﬁle sys-tems. As most of these stores are tightly coupled with anexecution engine, e.g., a DBMS, it is crucial to be able to runanalytics over multiple platforms. For this, users performnot only tedious, time-intensive, and costly data migration,but also complex integration tasks for analyzing the data.

Rheem shields the users from all these tedious tasks and al-lows them to instead focus on the logic of their applications . Beneﬁts.

A clear example that shows the beneﬁts of cross-platform data processing in a polystore case is the DataCivilizer system [31]. Data Civilizer is a big data manage-ment system for data discovery, extraction, and cleaningfrom data lakes in large enterprises [26]. It constructs agraph that expresses relationships among data existing inheterogeneous data sources. Data Civilizer uses

Rheem toperform complex tasks over information that spans multipledata storages. We measure the eﬃciency of

Rheem for thesepolystore tasks with TPC-H query 5. In this experiment, weassume that the data is stored in HDFS (LINEITEM andORDERS), Postgres (CUSTOMER, REGION, and SUP-PLIER), and a local ﬁle system (NATION). Thus, this task https://github.com/rheem-ecosystem/xdb This task basically intersects two community-DBpediadatasets and runs pagerank on the resulting dataset. R HEEM operator Spark execution operator JavaStreams execution operatorUDFInput/Output Data ﬂow Broadcast data ﬂow (a) R

HEEM plan (b) Execution plan

Map parse

Map compute

Reduce sum & count

Map update

TextFile Source RepeatLoopCollection SinkCollection SourceBroadcastCollectCacheSample RepeatLoopMap parse

Map compute

Reduce sum & count

Map update

Collection SinkCollection SourceTextFile Source Sample

Figure 4: SGD example. performs join, groupby, and orderby operations across threediﬀerent platforms. In this scenario, the common practice isto move the data into the database to enact the queries in-side the database [27,53] or move the data entirely to HDFSand use Spark. We consider these two practices as the base-line. For a fairer comparison, we also set the “parallel query”and “eﬀective IO concurrency” features of Postgres to 4.Figure 3(d) shows the results.

Rheem (DataCiv@Rheem)is signiﬁcantly faster, namely up to 5 × , than the currentpractice. We observed that loading data into Postgres isalready approximately 3 × slower than it takes Rheem tocomplete the entire task. Even when discarding data mi-gration times,

Rheem can still perform quite similarly tothe parallel version of Postgres. The pure execution time inPostgres for scale factor 100 amounts to 1 ,

541 sec comparedto 1 ,

608 sec for

Rheem , which exploits Spark for data paral-lelism. We also observe that

Rheem has negligible overheadover the case where the developer writes ad-hoc scripts tomove the data to HDFS for running the task on

Spark . Inparticular,

Rheem is twice faster than Spark for scale factor1 because it moves less data from Postgres to Spark.

3. RHEEM MODEL

First of all, let us emphasize that

Rheem is not yetanother data processing platform. On the contrary, it isdesigned to work between applications and platforms (asshown in Figure 1), helping applications to choose the rightplatform(s) for a given task. Rheem is the ﬁrst general-purpose cross-platform system that shields users from theintricacies of the underlying platforms and let them focusonly on the logic of their applications. We deﬁne the

Rheem data and processing models in the following.

Data Quanta.

The

Rheem data model relies on data quanta , the smallest processing units from the inputdatasets. A data quantum can express a large spectrum ofdata formats, such as database tuples, edges in a graph, orthe full content of a document. This ﬂexibility allows ap-plications and users to deﬁne a data quantum at any gran-ularity level, e.g., at the attribute level rather than at thetuple level for a relational database. This ﬁne-grained datamodel allows

Rheem to work in a highly parallel fashion, ifnecessary, to achieve better scalability and performance.

Rheem Plan.

Rheem accepts as input a

Rheem plan : directed data ﬂow graph whose vertices are

Rheem op-erators and whose edges represent data ﬂows among theoperators. A

Rheem operator is a platform agnostic datatransformation over its input data quanta, e.g., a

Map oper-ator transforms an individual data quantum while a

Reduce operator aggregates input data quanta into a single outputdata quantum. Only

Loop operators accept feedback edges,which allows iterative data ﬂows to be expressed. Usersor applications can reﬁne the behavior of operators with aUDF. Optionally, applications can also attach the selectiv-ities of the operators through a UDF.

Rheem comes withdefault selectivity values in case they are not provided. A

Rheem plan must have at least one source operator, i.e., anoperator reading or producing input data quanta, and onesink operator per branch, i.e., an operator retrieving or stor-ing the result. Intuitively, data quanta are ﬂowing fromsource to sink operators, thereby being manipulated by allinner operators. As our processing model is based on prim-itive operators,

Rheem plans are highly expressive. This isin contrast to other systems that accept either declarativequeries [32, 58] or coarse-granular operators [28].

Example Figure 4(a) shows a

Rheem plan for thestochastic gradient descent algorithm (SGD). Initially,the dataset containing the data points is read via a

TextFileSource operator and parsed using a

Map operatorwhile the initial weights are read via a

Collection source op-erator. After the

RepeatLoop operator, the weights are fed tothe

Sample operator, where a set of input data points is sam-pled. Next,

Map ( compute ) computes the gradient for eachsampled data point. Note that as Map ( compute ) requires allweights to compute the gradient, the weights are broadcastedat each iteration to the Sample operator (denoted by the dot-ted line). Then, the

Reduce operator computes the sum andcount of all gradients. The next

Map operator uses thesesum and count values to update the weights. This process isrepeated until the loop condition is satisﬁed. The resultingweights are output in a collection sink.

Execution Plan.

Given a

Rheem plan as input,

Rheem uses a cost-based optimization approach to produce an exe-cution plan by selecting one or more platforms to eﬃcientlyexecute the input plan. The cost can be any user-speciﬁedcost, e.g., runtime or monetary cost. The resulting exe-cution plan is again a data ﬂow graph, where the verticesare now execution operators . An execution operator imple-ments one or more

Rheem operators with platform-speciﬁccode. For instance, the

Cache

Spark execution operator in

Rheem implements the

Cache

Rheem operator by callingthe

RDD . cache () operation of Spark. An execution planmay also comprise additional execution operators for datamovement (e.g., data broadcasting) or data reuse (e.g., datacaching). Additionally, each execution operator has at-tached a UDF where its cost is speciﬁed. Rheem learnssuch costs from execution logs using machine learning. Wediscuss more details in Section 4.5.

Example Figure 4(b) shows the SGD execution planproduced by

Rheem when Spark and JavaStreams are theonly available platforms. This execution plan exploits highparallelism for the large dataset of input data points andavoids the extra overhead incurred by big data processingplatforms for the smaller collection of weights. Note thatthe execution plan also contains three execution operatorsfor transferring (

Broadcast , Collect ) and making data quantareusable across the platforms (

Cache ). MapGroupByReduce Reduce (b)(c) (a)(d) R HEEM operatorSpark execution operatorJavaStreams exec. op.

MapGroupBy (a) (b) (c) n-to-1 mapping (d) m-to-n mapping

Figure 5: Operator mappings.Operator Mappings.

To produce an execution plan,

Rheem relies on ﬂexible m-to-n mappings to map

Rheem operators to execution operators. Supporting m -to- n map-pings is particularly useful as it allows to map whole sub-plans of Rheem operators to subplans of execution opera-tors. Additionally, a subplan of

Rheem (or execution) op-erators can map to another subplan of

Rheem (respectivelyexecution) operators. As a result, we can handle diﬀerentabstraction levels among platforms, e.g., to emulate

Rheem operators that are not natively supported by a speciﬁc plat-form. This is not possible in other systems, such as [28].

Example Figure 5 illustrates the mapping for the

Reduce

Rheem operator. This operator directly maps to the

Reduce

Spark execution operator via a 1-to-1 mapping (map-ping (a)). However, it does not have a direct mapping to aJavaStreams execution operator. Instead, it maps to a set of

Rheem operators (

GroupBy and

Map ) via a 1-to-n mapping(mapping (b)) and vice-versa (n-to-1 mapping (c)). In turn,this set of

Rheem operators maps to a set of JavaStreamsexecution operators (

GroupBy and

Map ) via an m-to-n map-ping (mapping (d)).

Data movement.

Data ﬂows among operators via com-munication channels (or simply channels ). A channel can beany internal data structure within a data processing plat-form (e.g., RDD for

Spark or Collection for

JavaStreams ),or simply a ﬁle. In the case of two execution operators ofdiﬀerent platforms connected within a plan, it is necessaryto convert the output channel of one to the input channelof the other (e.g., from RDD to Collection). These conver-sions are handled by conversion operators, which in fact areregular execution operators. For example, we can converta Spark RDD channel to a JavaStreams Collection channelusing the

SparkCollect operator (see Figure 4(b)). We repre-sent the space of data movement paths across all platformsas a channel conversion graph , where the channels form itsvertices and the conversion operators form its directed edgesconnecting one source channel to a target channel. Unlikeother approaches [28, 32], developers do not need to provideconversion operators for all combinations of source and tar-get channels. It is thus much easier for developers to addnew platforms to

Rheem . Extensibility.

We designed

Rheem to address extensi-bility as a ﬁrst-class citizen rather than as “nice-to-have”feature. Users add new

Rheem and execution operators bymerely extending or implementing few abstract classes/in-terfaces.

Rheem provides template classes to facilitate thedevelopment for diﬀerent operator types. Users also add op-erator mappings by simply implementing an interface andspecifying a graph pattern that matches the

Rheem oper-ator. As a result, users can plug a new platform by pro-viding: (i) its execution operators and their mappings; and(ii) the communication channels that are speciﬁc to the newplatform (e.g.,

RDDChannel for Spark). Users neither haveto modify the

Rheem code nor integrate the newly addedplatform with all the already supported platforms. ython Rheem StudioJava Scala REST RheemLatin R HEEM

Executor

Spark

Driver

MonitorProgressive

Optimizer … execution plan (2) (3) (5) R HEEM plan (4) cost model

Cost Learner

ML4all Data Processing Platforms

Java

Driver (1)

Cross-Platform Optimizer

BigDansing Xdb

JavaStreams SparkPostgreSQL Flink GiraphGraphChi

Figure 6: Rheem’s ecosystem and architecture.

4. RHEEM INTERNALS

In this section, we give the details of the

Rheem internals.Figure 6 depicts the

Rheem ecosystem, i.e., the

Rheem core architecture together with three main applications builton top of it. Users provide a

Rheem plan to the system(Step (1) in Figure 6), using Java, Scala, Python, REST,RheemLatin, or

Rheem

Studio API (yellow boxes in Fig-ure 6). The cross-platform optimizer compiles the

Rheem plan into an execution plan (Step (2)), which speciﬁes theprocessing platforms to use; the executor schedules the re-sulting execution plan on the selected platforms (Step (3));the monitor collects statistics and checks the health of theexecution (Step (4)); the progressive optimizer re-optimizesthe plan if the cardinality estimates turn out to be inaccu-rate (Step (5)); and the cost learner helps users in buildingthe cost model oﬄine. In the following, we explain eachof these components using the pseudocode in Algorithm 1,which shows the entire data processing pipeline.

The cross-platform optimizer (Line 1 in Algorithm 1) isresponsible for selecting the most eﬃcient platform for ex-ecuting each single operator in a

Rheem plan. One mightthink of a rule-based optimizer for selecting the right plat-forms to perform a given

Rheem plan. However, while arule-based optimizer could determine how to split and exe-cute a plan, e.g., based on its processing patterns [32,58], itis neither practical nor eﬀective. First, by setting rules, onemay make only very simplistic decisions based on the dif-ferent cardinality and complexity of each operator. Second,the cost of a task on any given platform depends on manyinput parameters, which hampers a rule-based optimizer’seﬀectiveness as it oversimpliﬁes the problem. Third, as newplatforms and applications emerge, maintaining a rule-basedoptimizer becomes cumbersome.We thus pursue a more ﬂexible cost-based approach: wesplit a given

Rheem plan into subplans and determine thebest platform for each subplan so that the total plan costis minimized . Figure 4(b) shows how the

Rheem plan ofFigure 4(a) was split into two subplans to be executed in

JavaStreams and

Spark . Below, we give the four main phasesof the optimizer, namely plan inﬂation , cardinality and costestimation , data movement planning , and plan enumeration .Technical details about these can be found in [39].At ﬁrst, the optimizer passes the Rheem plan throughan inﬂation phase. That is, it applies a set of operatormappings as described in Section 3. The optimizer thenannotates the inﬂated plan with the cost of each execu-tion operator.

Rheem represents cost estimates as inter-vals with a conﬁdence value, which allows it to perform on-the-ﬂy re-optimization as we will see in Section 4.4. The

Algorithm 1:

Cross-platform data processing

Input : Rheem plan rheemPlan exPlan ← Optimize ( rheemPlan ) monitor ← StartMonitor ( exPlan ) ﬁnished ← ExecuteUntilCheckpoint ( exPlan , monitor ) while ¬ ﬁnished do updated ← UpdateEstimates ( exPlan , monitor ) if updated then exPlan ← ReOptimize ( exPlan ) ﬁnished ← ResumeExecution ( exPlan , monitor )cost (e.g., wallclock time or monetary cost) of an executionoperator depends on (i) its resource usage (CPU, memory,disk, and network) and (ii) the unit costs of each resource(e.g., how much one CPU cycle costs). While the unit costsdepend on hardware characteristics, the resource usage ofeach execution operator depends on its input cardinality.Next, the optimizer looks for the best way to move dataquanta among execution operators of diﬀerent platforms.As noted earlier, we model the problem of ﬁnding the mosteﬃcient communication path among execution operators asa graph problem, which we proved to be NP-hard. Our solu-tion to this problem relies kernelization and can discover allways to connect execution operators of diﬀerent platformsvia a sequence of communication channels. After the bestdata movement strategy is found, the optimizer attaches thedata movement cost to the inﬂated plan. At last, it deter-mines the optimal way of executing a Rheem plan based onthe cost estimates of its inﬂated plan. For this, it must con-sider the previously computed data movement costs as wellas the start-up costs of data processing platforms. Thus, in-stead of taking a simple greedy approach that neglects datamovement and platform start-up costs, we follow a princi-pled approach: we use an enumeration algebra together witha lossless pruning technique. Our pruning technique is guar-anteed to not prune a subplan that is part of the optimalexecution plan. As a result, the optimizer can output theoptimal execution plan without an exhaustive enumeration.

The executor receives an execution plan from the opti-mizer to run it on the selected data processing platforms(Lines 3 and 7 in Algorithm 1). For example, the optimizerselected the Spark and JavaStreams platforms for our SGDexample in Figure 4(a). Overall, the executor follows well-known approaches to parallelize a task over multiple com-pute nodes, with only few diﬀerences in the way it dividesan execution plan. In particular, it divides an executionplan into stages . A stage is a subplan where (i) all its exe-cution operators are from the same platform; (ii) at the endof its execution, the platforms need to give back the execu-tion control to the executor; and (iii) its terminal operatorsmaterialize their output data quanta in a data structure,instead of being pipelined into the next operator.In our SGD example of Figure 4(b), the executor dividesthe execution plan into six stages as illustrated in Figure 7.Note that Stage3 contains only the

RepeatLoop operator asthe executor must have the execution control to evaluatethe loop condition. This is why the executor also separatesStage1 from Stage5. Then, it dispatches the stages to therelevant platform drivers, which in turn submit the stages asa job to the underlying platforms. Stages are connected bydata ﬂow dependencies so that stages with no dependencies ap parse TextFile SourceCacheCollection Source RepeatLoop Map compute

Reduce sum & count

Map update

Collection Sink

Stage1Stage2 Stage3 Stage5 Stage6 optimization checkpoint

Spark execution operator JavaStreams execution operator Dependency

Broadcast

Stage4

CollectSniffer MultiplexSocket Sink Map metadata

Collection Sink

Auxiliary

Sample

Figure 7: Stage dependencies for SGD. (e.g., Stage1 and Stage2) are dispatched ﬁrst in parallel andany other stage is dispatched once its input dependenciesare satisﬁed (e.g., Stage3 after Stage2).

Data Exploration.

As data exploration is a key piece inthe ﬁeld of data science, the executor optionally allows appli-cations to run in an exploratory mode where they can pauseand resume the execution of a task at any point. Achievingthis in a cross-platform setting is very challenging, becausemost platforms, such as Spark, Flink, Giraph, Postgres, andHadoop, do not support pausing task computations at all –let alone resuming a task from an intermediate state. Thus,the challenge resides in enabling the underlying platforms tosupport data exploration eﬃciently.

Rheem achieves this byinjecting sniﬀers into execution plans and attaching auxil-iary execution plans. A sniﬀer is an execution operator thatduplicates intermediate results and sends them to an aux-iliary execution plan. For example, the user would like tokeep track of the weights at each iteration of SGD and thus asniﬀer is necessary right after updating the weights (Stage5in Figure 7). The sniﬀer sends the weights to an auxiliaryplan that is responsible for reporting them back to the user(the socket sink operator in Figure 7). This auxiliary plan isalso responsible for computing and storing additional meta-data for eﬃcient task resumption (the map and collectionsink operators of the auxiliary plan in Figure 7). When re-suming a task, the executor performs the task by re-using asmuch as possible from the previously computed metadata.For instance, if the user pauses the SGD task at iteration i and resumes it later on, the executor fetches the previouslycomputed weights of iteration i and resumes the task. Recall that the cross-platform optimizer operates in a set-ting that is characterized by high uncertainty. For instance,the semantics of UDFs and data distributions are usuallyunknown because of the little control over the underlyingplatforms. This uncertainty can cause poor cardinality andcost estimates and hence can negatively impact the eﬀective-ness of the optimizer [42]. To compensate this uncertainty,

Rheem registers the execution of a plan with the monitor(Line 2 in Algorithm 1). The monitor collects light-weightexecution statistics for the given plan, such as data cardi-nalities and operator execution times. It is also aware oflazy execution strategies used by the underlying platformsand assigns measured execution time correctly to operators.

Rheem uses these statistics to improve its cost model andre-optimize ongoing execution plans in case of poor cardi- nality estimates. Additionally, the monitor is responsible forchecking the health of the execution. For instance, if it ﬁndsa large mismatch between the real output cardinalities andthe estimated ones, it pauses the execution plan and sendsit to the progressive optimizer.

To mitigate the eﬀects of bad cardinality estimates,

Rheem employs a progressive query optimization approach.The key principle is to re-optimize the plan whenever thecardinalities observed by the monitor greatly mismatch theestimated ones [45]. Applying progressive query optimiza-tion in our setting comes with two main challenges. First,we have only limited control over the underlying platforms,which makes plan instrumentation and halting executionsdiﬃcult. Second, re-optimizing an ongoing execution planmust eﬃciently consider the results already produced.We tackle these challenges by using optimization check-points . An optimization checkpoint tells the executorto pause the plan execution in order to consider a re-optimization of the plan beyond the checkpoint. The pro-gressive optimizer inserts optimization checkpoints into exe-cution plans wherever (i) cardinality estimates are uncertain(having a wide interval or low conﬁdence) or (ii) the datais at rest (e.g., a Java collection or a ﬁle). For instance,the optimizer inserts an optimization checkpoint right afterStage1 as the data is at rest because of the

Cache operator(see Figure 7). When the executor cannot dispatch a newstage anymore without crossing an optimization checkpoint,it pauses the execution and gives the control to the pro-gressive optimizer. The latter gets the actual cardinalitiesobserved so far by the monitor and re-computes all cardi-nalities from the current optimization checkpoint (Line 5 inAlgorithm 1). In case of a mismatch, it re-optimizes theremaining of the plan (from the current optimization check-point) using the new cardinalities (Line 6). It then gives thenew execution plan to the executor, which resumes the ex-ecution from the current optimization checkpoint (Line 7).

Rheem can switch between execution and progressive opti-mization any number of times at a negligible cost.

Proﬁling operators in isolation might be unrealistic when-ever platforms optimize execution across multiple operators,e.g., by pipelining. Indeed, we found cost functions derivedfrom isolated benchmarking to be insuﬃciently accurate.We thus take a diﬀerent approach.

Learning the Cost Model.

Recall that each executionoperator o is associated with a number of resource usagefunctions ( r mo , where m is CPU, memory, disk, or network).For instance, the cost function to estimate the CPU cy-cles required by the JavaFilter operator is r CPUJavaFilter := c in × ( α + β )+ δ , where parameters α and β denote the num-ber of required CPU cycles for each input data quantum inthe operator itself and in its UDF, and parameter δ describessome ﬁxed overhead for the operator start-up and schedul-ing. We then multiply each of these resource usage functions r mo with the time required per unit (e.g., msec/CPU cycle)to get the time estimate t mo . The total cost estimate foroperator o is deﬁned as: f o = t CPUo + t memo + t disko + t neto .However, obtaining the parameters for each resource, suchas the α, β, δ values for CPU, is not trivial. We, thus, usexecution logs to learn these parameters in an oﬄine fash-ion and model the cost of individual execution operators asa regression problem . Note that the execution logs containthe runtimes of execution stages (i.e., pipelines of opera-tors as deﬁned in Section 4.2) and not of individual op-erators. Let ( { ( o , C ) , ( o , C ) , . . . ( o n , C n ) } , t ) be an exe-cution stage, with o i , 0 < i ≤ n , where o i are executionoperators, C i are input and output true cardinalities, and t is the measured execution time for the entire stage. Fur-thermore, let f i ( x , C i ) be the total cost function for execu-tion operator o i with x being a vector with the parametersof all resource usage functions (e.g., CPU cycles, disk I/Oper data quantum). We are interested in ﬁnding x min =arg min x loss (cid:0) t, (cid:80) ni =1 f i ( x , C i ) (cid:1) . Speciﬁcally, we use a rel-ative loss function deﬁned as loss ( t, t (cid:48) ) = (cid:16) | t − t (cid:48) | + st + s (cid:17) , where t (cid:48) is the geometric mean of the lower and upper bounds ofthe cost estimate produced by (cid:80) f i ( x , C i ) and s is a regu-larizer inspired by additive smoothing that tempers the lossfor small t . Note that we can easily generalize this optimiza-tion problem to multiple execution stages: we minimize theweighted arithmetic mean of the losses of multiple execu-tion stages. In particular, we use as stage weights the sumof the relative frequencies of the stages’ operators amongall stages, so as to deal with skewed workloads that containcertain operators more often than others. Finally, we applya genetic algorithm [47] to ﬁnd x min . In contrast to otheroptimization algorithms, genetic algorithms impose only fewrestrictions on the loss function to be minimized. Hence, ourcost learner can deal with arbitrary cost functions. Apply-ing this technique allows us to calibrate the cost functionswith only little additional eﬀort. Logs Generation.

Clearly, the more execution logs areavailable, the better

Rheem can tune the cost model. Thus,

Rheem comes with a log generator. It ﬁrst creates a setof

Rheem plans by composing all possible combinationsof

Rheem operators forming a particular topology. Wefound that most data analytic tasks in practice follow threediﬀerent topologies: pipeline (e.g., batch tasks), iterative (e.g., ML tasks), and merge (e.g., SPJA tasks). It then gen-erates all possible executions plans for the previously createdset of

Rheem plans. Next, it creates diﬀerent conﬁgurationsfor each execution plan, i.e., it varies the UDF complexity,output cardinalities, input dataset sizes, and data types.Once it has generated all possible plans with diﬀerent con-ﬁgurations, it executes them and logs their runtime.

5. RHEEM INTERFACES

Rheem provides a set of native APIs for developers tobuild their applications. These include Java, Scala, Python,and REST. Examples of using these APIs can be found inthe

Rheem repository . The code developers have to writeis fully agnostic of the underlying platforms. Still, in casethe user wants to force Rheem to execute a given operatoron a speciﬁc platform, she can invoke the withTargetPlatform method. Similarly, she can force the system to use a speciﬁcexecution operator via the customOperator method, whichfurther enables users to employ custom operators withouthaving to extend the API.Although the native APIs are quite popular among de-velopers, many users are not proﬁcient using these APIs. https://github.com/rheem-ecosystem/rheem-benchmark Thus,

Rheem also provides two APIs that target non-expertusers: a data-ﬂow language (

RheemLatin ) and a visual IDE(

Rheem

Studio ). We explain these interfaces using our SGDexample from Figure 4. However, for the sake of explana-tion, before going into the details of these two interfaces, weﬁrst show how one can implement SGD on

Rheem using oneof its native APIs. The salient feature of all these APIs isthat they are all platform-agnostic. It is

Rheem that ﬁguresout on which platform to execute each of the operators.

Let us explain how users can code their applications usingone of the native APIs of

Rheem . We use the Scala APIand our SGD running example (see Listing 1) . val context = new RheemContext( new

Conﬁguration) .withPlugin(Spark. basicPlugin ) .withPlugin(JavaStreams.basicPlugin) val plan = new PlanBuilder(context) val points = plan.readTextFile (”hdfs://myData.csv”) .map(parsePoints) val ﬁnalWeights = plan. loadCollection (createRandomWeights()) .repeat(50, { weights = > points .sample(sampleSize).withBroadcast(weights) .map(computeGradient()) .reduce( + ) .map(updateWeights()) } ). collect () Listing 1: SGD task using the Scala API.First, a user creates the

Rheem context, where she speci-ﬁes the available platforms (Lines 1-3): Spark and JavaS-treams in this example. She then initializes her

Rheem plan with this context (Line 4). Eventually, she createsthe graph of

Rheem operators that deﬁnes the SGD task(Lines 5-13). Note that

Rheem plans must have at leastone source operator (Line 5), i.e., an operator reading orproducing input data quanta, and one sink operator perbranch (Line 13), i.e., an operator retrieving or storing theresult. Recall that a

Rheem plan must have at least onesource operator (Line 5) and one sink operator per branch(Line 13). Also, observe that this code is fully agnostic of theunderlying platforms. Still, in case the user wants to force

Rheem to execute a given operator on a speciﬁc platform,she can invoke the withTargetPlatform method. Similarly, shecan force the system to use a speciﬁc execution operatorvia the customOperator method, which further enables usersto employ custom operators without having to extend theAPI. For clarity reasons, we did not include the UDF im-plementations in Listing 1.

Rheem provides a data-ﬂow language (RheemLatin) forusers to specify their tasks [44]. Our goal is to provideease-of-use to users without compromising expressiveness.RheemLatin follows a procedural programming style to nat-urally ﬁt the pipeline paradigm of

Rheem . This is similarto the R language, which is quite popular among data scien-tists. It draws its inspiration from PigLatin [48] and hence ithas PigLatin’s grammar and supports most PigLatin’s key-words. In fact, one could see it as an extension of PigLatinfor cross-platform settings. For example, users can specifythe platform for any part of their queries. More importantly, The complete source code of this task is available online: https://github.com/rheem-ecosystem/rheem-benchmark .t provides a set of conﬁguration ﬁles whereby users can addnew keywords to the language together with their mappingsto

Rheem operators. As a result, users can easily adaptRheemLatin for their applications. Listing 2 illustrates howone can express our SGD example with RheemLatin. import ’/sgd/udfs. class ’ as taggedPointCounter; lines = load ’hdfs://myData.csv’; points = map lines − > { taggedPointCounter.parsePoints(lines) } ; weights = load taggedPointCounter.createWeights(); ﬁnal weights = repeat { sample points = sample points − > { taggedPointCounter.getSample() } with broadcast weights; gradient = map sample points − > { taggedPointCounter.computeGradient() } ; gradient sum count = reduce gradient − > { gradient.sumcount() } ; weights = map gradient sum − > { gradient sum count.average() } withplatform ’JavaStreams’; } store ﬁnal weights ’hdfs://output/sgd’; Listing 2: SGD task in RheemLatin.The user starts by importing all her required UDFs(Line 1). She then parses all the data points from the inputdataset (Lines 2 and 3) and initializes the weights (Line 4).Next, she proceeds to perform the core of SGD: she takesa sample of data points (Line 6), computes the gradientfor each sampled data point (Line 7), updates the weights(Lines 8 and 9), and repeats the process 50 times (Line 5).She can also repeat such a core process until convergenceby using

WhileLoop instead of

Repeat . Optionally, she canspecify the platform for any part of her query. For instance,she might know that updating the weights on each itera-tion is a lightweight computation and hence might specifyto use JavaStreams (Line 9). She ﬁnishes by storing the ﬁnalweights on HDFS (Line 10).

Although the native APIs and RheemLatin cover a largenumber of users, some might still be unfamiliar with pro-gramming and data-ﬂow languages. Also, some other usersmay simply desire to speed up the process of composingtheir data analytic tasks. To this end,

Rheem provides avisual IDE (

Rheem

Studio) where users can compose theirdata analytic tasks in a drag and drop fashion [44]. Fig-ure 8 shows the

Rheem

Studio’s GUI. The GUI is composedof four parts: a panel containing all

Rheem operators, thedrawing surface, a console for writing RheemLatin queries,and the output terminal. The right-side of Figure 8 showshow operators are connected for an SGD plan. The studioprovides default implementations for any of the

Rheem op-erators, which enables users to run common data analytictasks without writing code. Yet, expert users can provide aUDF by double-clicking on any operator.Users can draw such a plan by simply dragging as many

Rheem operators as required from the left-side panel and dropping them on the drawing surface. They consequentlyconnect the operators as required by their data analytic task.The right-side of Figure 8 shows how operators are con-nected for SGD. While connecting operators, the studio val-idates such connections and gives feedback to users in casethat a connection cannot be established, e.g., the outputand input of two connected operators are of diﬀerent datatypes. Last but not least, the studio provides default imple-mentations for any of the

Rheem operators, which enablesusers to run common data analytic tasks without writing a z oo m i n output window R H EE M op e r a t o r s p a n e l RheemLatin consoledrawing surface

TextFileSourceParseMap SampleComputeMapReduce UpdateMap wCollectionRepeatLocalSink

Figure 8: SGD task in the Rheem Studio. single line of code. Yet, expert users can provide a UDF bydouble-clicking on any operator.

6. EXAMPLES OF RHEEM PLANS

We now provide in detail three examples of how users canimplement their tasks using the Scala native API and theRheemLatin interface. For this, we consider three populardata analytic tasks:

WordCount (a well-known aggregatetask),

K-means (a very representative iterative task), and

PolyJoin (a common task over diﬀerence data sources).Users start their

Rheem plans in Scala with a preamblethat deﬁnes the context and the platforms to be used, asshown in Listing 3. For the sake of presentation, we do notinclude this preamble in our Scala code examples below. val context = new RheemContext( new

Conﬁguration) .withPlugin(Spark. basicPlugin ) .withPlugin(JavaStreams.basicPlugin) val plan = new PlanBuilder(context)

Listing 3: Preamble in the Scala API.

WordCount is an aggregate task that computes the fre-quency with which each word appears in a dataset. Listing 4shows the RheemLatin query for this task: Line 1 importsall the required UDFs, Line 2 loads the input data; Lines 3and 4 parse the words and convert them into records; Line 5computes the frequency of each word; and Line 6 stores theﬁnal word count on disk. Note that users naturally deﬁnethe ﬂow of their analytical tasks with RheemLatin. Alterna-tively, users can implement this task using one of the nativeAPIs of

Rheem . Listing 5 shows the Scala code for this task.Similar to the RheemLatin query, the Scala code keeps theplan composition simple. import ’/wordcount/udfs.class ’ as wordcount; lines = load ’hdfs://myWords.txt’; words = ﬂatmap lines − > { wordcount.splitWords() } ; tuples = map words − > { wordcount.convert2Tuple() } ; adds = reduce tuples − > { wordcount.getWord() } , tuples − > { wordcount. reduce () } ; store adds ’/output/wordcount’; Listing 4: Word Count task in RheemLatin.

K-means is a widely used ML task for clustering datapoints together according to their similarity. We show theRheemLatin query in Listing 6. In contrast to the Word-Count task, this task is iterative (Lines 4–7). We observe val words = plan.readTextFile (”hdfs://myWords.csv”) .ﬂatMap( . split (” \\ W+”)) .map(word = > (word.toLowerCase, 1)) .reduceByKey( . 1, (c1, c2) = > (c1. 1, c1. 2 + c2. 2)) . collect () Listing 5: Word Count task using the Scala API.that deﬁning loops in RheemLatin is quite similar to cod-ing in a high-level language (e.g., Scala), which makes itintuitive for most users. Listing 7 shows its counterpart inScala. lines = load ’hdfs://myPoints.txt’ ; points = map lines − > kmeans.parsePoints(); centroids = load ’hdfs:// myInitialCentroids . txt ’ ; ﬁnal centroids = repeat centroids AS current centroid for { distance = map points − > kmeans.selectNearestCentroid() withbroadcast current centroid; centroids sum = reduce distance − > kmeans. reduce (); new centroids = map centroids sum − > kmeans.average(); } store ﬁnal centroids ’hdfs:///output/kmeans’; Listing 6: K-means task in RheemLatin. val points = plan.readTextFile (”hdfs://myPoints.csv”) .map(createPoints) val initialCentroids =plan. loadCollection (Kmeans.createRandomCentroids(k)) val ﬁnalCentroids = initialCentroids .repeat( iterations , { currentCentroids = > val newCentroids = points.mapJava( new SelectNearestCentroid, ) .withBroadcast(currentCentroids , ”centroids”) .reduceByKey( .centroidId , + ) .map( .average newCentroids } ) ﬁnalCentroids . collect () Listing 7: K-means task using the Scala API.

PolyJoin is a common task in polystore scenarios,i.e., joining several datasets from diﬀerent data sources. Inthis case, we consider the TPC-H Q5 and assume that: the region , suppliers , and customer relations are on Postgres; the nations relations is on the local ﬁle system; and the orders and lineitem relations are on HDFS. Despite the complexityof this query, we observe that the RheemLatin query (List-ing 8) and the Scala (Listing 9) are still simple as they followthe logical ﬂow of the task itself. Lines 1-7 in Listing 8 loadthe dataset, Lines 8-12 select and project the required tu-ples, and Lines 13-22 join the resulted tuples before makingthe group-by in Line 23.

7. RHEEM VS. MUSKETEER

We experimentally compare

Rheem with its closest com-petitor, Musketeer [32]. More experiments concerning theoptimizer can be found in [40].

Setup.

We ran our experiments on a cluster of 10 machines.Each node has one 2 GHz Quad Core Xeon processor, 32 GBmain memory, 500 GB SATA hard disks, a 1 Gigabit networkcard and runs 64-bit platform Linux Ubuntu 14.04.05. In

Rheem we used the following platforms: Java’s Stream li-brary (

JavaStreams ), Spark 1.6.0 (

Spark ), Flink 1.3.2 (

Flink ),GraphX 1.6.0 (

GraphX ), Giraph 1.2.0 (

Giraph ), a Java graphlibrary (

JGraph ), and HDFS 2.6.0 to store ﬁles. We used allthese platforms with their default settings and conﬁgured import ’/ polyjoin /udfs. class ’ as polyjoin ; region = load ’postgres:///tpch/region’ ; suppliers = load ’postgres:///tpch/ suppliers ’ ; customers = load ’postgres:///tpch/customers’; nations = load ’ ﬁle :///nations ’ delimiter ’ | ’ ; orders = load ’hdfs:///orders ’ delimiter ’ | ’ ; lineitems = load ’hdfs:/// lineitems ’ delimiter ’ | ’ ; region ﬁlter = ﬁlter region [1] == ’ASIA’; region project = map region ﬁlter − > { polyjoin.projectRecord(0, 1) } ; suppliers project = map suppliers − > { polyjoin.projectRecord(0, 3) } ; customers project = map customers − > { polyjoin.projectRecord(0, 3) } ; order ﬁlter = ﬁlter orders − > { polyjoin.isBetween( 4,’1994 − − − − } ; join1 = join nation [2], region project [0]; map join1 = map join1 − > { polyjoin.tuple2Record(0, 0, 0, 1) } ; join2 = join map join1 [0], customers project [1]; map join2 = map join2 − > { polyjoin.tuple2Record(0, 0, 0, 1, 1, 0) } ; join3 = join map join2 [2], order ﬁlter [0]; map join3 = map join3 − > { polyjoin.tuple2Record(0, 0, 0, 1, 1, 0) } ; join4 = join map join3 [2], lineitems [0]; map join4 = map join4 − > { polyjoin.tuple2Record(0, 0, 0, 1, 1, 2, 1,5, 1, 6) } ; join5 = join map join4 − > { polyjoin.record2Tuple(2, 0) } ,suppliers project − > { polyjoin.record2Tuple(0, 1) } ; map join5 = map join5 − > { polyjoin.tuple2Record(0, 1, 0, 3, 0, 4) } ; groupBy = groupby map join5[0]; store groupBy ’/output/polyjoin ’ ; Listing 8: PolyJoin task in RheemLatin. val regions : DataQuanta[Record] =plan.readTable(”postgres:///tpch/region”) .map( createRecord( )) . ﬁlter ((r : Record) = > r.getString(1) == ”ASIA”) .map( projectRecord( , 0, 1)) val suppliers : DataQuanta[Record] =plan.readTable(”postgres:///tpch/supplier”) .map[Record]( createRecord( )) .map[Record]( projectRecord( , 0, 3)) val customers: DataQuanta[Record] =plan.readTable(”postgres:///tpch/customer”) .map[Record]( createRecord( )) .map( projectRecord( , 0, 3)) val nations : DataQuanta[Record] = plan.readTextFile(” ﬁle :///nation”) .map( createRecord( )) val orders : DataQuanta[Record] = plan.readTextFile(”hdfs:///order”) .map( createRecord( )) . ﬁlter ( isBetween( , 4, fromData, toDate) ) val lineitems : DataQuanta[Record] =plan.readTextFile(”hdfs:/// lineitem ”) .map(createRecord( )) nations . join ( getColumn(, 2), regions , getColumn(, 0)) .map( tuple2Record( , 0, 0, 0, 1)) . join ( getColumn(, 0), customers, getColumn(, 1)) .map( tuple2Record( , 0, 0, 0, 1, 1, 0)) . join ( getColumn(, 2), orders , getColumn(, 1)) .map( tuple2Record( , 0, 0, 0, 1, 1, 0)) . join ( getColumn(, 2), lineitems , getColumn(, 0)) .map( tuple2Record( , 0, 0, 0, 1, 1, 2, 1, 5, 1, 6)) . join [Record, Tuple2[String , String ]]( record2Tuple(, 2, 0),suppliers , record2Tuple(, 0, 1)) .map( tuple2Record( , 0, 1, 0, 3, 0, 4)) .groupByKey((r: Record) = > r.getField(0)) . collect () Listing 9: PolyJoin task using the Scala API.the maximum RAM of each platform to 20 GB. We dis-abled the

Rheem stage parallelization feature to have onlyone single platform running at any time. We obtained allthe cost functions required by our optimizer as describedin Section 4.5. We considered the cross-community pager-ank task (

CrocoPR ), because the authors reported this taskto be a case where Musketeer chooses multiple platforms.Note that, for fairness reasons, we perform the data prepa-ration part of

CrocoPR (i.e., union the diﬀerent communities R un t i m e ( s e c ) Dataset size (%)

Musketeer R

HEEM

Figure 9: Rheem outperforms Musketeer by morethan one order of magnitude. pages) as a separate script for Musketeer. This is because itslanguage (Mindi) is not optimized for dealing with UDFs,thereby it would be much slower to provide the data prepa-ration as a UDF. In contrast,

Rheem seamlessly performsboth parts (data preparation and page rank) as a single task.We used the DBPedia pagelinks dataset (20 GB).

Results.

Figure 9 shows the results in log scale whenvarying the dataset sizes for 10 iterations and the numberof iterations for 10% of the dataset. Overall, we observethe superiority of

Rheem over Musketeer, especially as thenumber of iterations increases:

Rheem is up to 85 timesfaster than Musketeer. Note that, in contrast to Musketeer,

Rheem keeps its runtime constant as the number of itera-tions increases. This is because: (i) Musketeer, among otherthings, checks dependencies, compiles and package the code,and writes the output to HDFS at each iteration (or stage),which comes with a high overhead; (ii)

Rheem executes thepage rank part of the task (i.e., after the data preparation)on

JavaStreams , which allows it to perform each iterationwith almost zero overhead.

8. LIMITATIONS

As of now,

Rheem does not support any stream processingplatforms. While users can easily supply new batch process-ing platforms, stream processing requires to extend

Rheem ’score. We plan to do so by following the lambda architectureparadigm [46]. In addition,

Rheem currently relies on thefault-tolerance of the underlying platforms and is, thus, sus-ceptible to failures while moving data across platforms. Weplan to incorporate some basic fault-tolerance mechanismat the cross-platform level. Other remaining issues include:adding methods that speed up inter-platform communica-tions, such as the one proposed in [33], integrating

Rheem with resource managers to incorporate changes in the avail-ability of computing resources, and supporting simultaneousexecution of

Rheem jobs.

9. RELATED WORK

The research and industry communities have proposed amyriad of diﬀerent data processing platforms [5, 8, 11, 18,25, 59]. In contrast, we do not provide a data processingplatform but a novel system on top of them.Cross-platform data processing has been in the spotlightonly very recently. Some works focus only on integrating dif-ferent data processing platforms with the goal of alleviatingusers from their intricacies [1, 2, 10, 12, 29]. However, theystill require expertise from users to decide when to use a spe-ciﬁc data processing platform. For example, BigDAWG [29]requires users to specify where to run tasks via its

Scope and

Cast commands, which already require expertise from users.Only few works share a similar goal with us [28,32,43,55,58].However, they substantially diﬀer from

Rheem . Two main diﬀerences are that they consider neither data movementcosts nor progressive task optimization techniques, althoughboth aspects are crucial in cross-platform settings. Addi-tionally, each of these works diﬀers from

Rheem in variousways. As Musketeer’s main goal is to decouple front-end lan-guages (e.g., SQL and PigLatin) from the underlying plat-forms [32], it is not as expressive and extensible as

Rheem .Furthermore, as it maps task patterns to speciﬁc underly-ing platforms, it is not clear how one can eﬃciently map atask when having similar platforms (e.g., Spark vs. Flinkor Postgres vs. MySQL). Similarly, in Myria [58], it is hardto allocate tasks when having similar platforms because itcomes with a rule-based optimizer. Additionally, its rule-based optimizer also makes it hard to maintain. IReS [28]supports only 1-to-1 mappings between abstract tasks andtheir implementations, which limits expressiveness and op-timization opportunities. Moreover, it assumes direct datamovement paths between platforms, which is hard to main-tain for many platforms. QoX focuses only on ETL work-loads [55]. DBMS+ [43] is limited by the expressiveness ofits declarative language and hence it is neither adaptive norextensible. Other complementary works focus on improvingdata movement across diﬀerent platforms [33] or librariesby using a common intermediate representation and execut-ing the scripts in LLVM [49], but none of them address thecross-platform optimization problem. Tensorﬂow [15] fol-lows a similar idea, but for cross-device execution of machinelearning tasks and thus it is orthogonal to

Rheem . In fact,

Rheem could use TensorFlow as an underlying platform.The research community has also studied the problemof federating relational databases [54]. Garlic [22], TSIM-MIS [23], and InterBase [21] are just three examples. How-ever, all these works signiﬁcantly diﬀer from

Rheem in thatthey consider a single data model and simply push queryprocessing to where the data is. Other works integrateHadoop with an RDBMS [27,41], however, one cannot easilyextend them to deal with more diverse tasks and platforms.

10. CONCLUSION

Given today’s data analytic ecosystem, supporting cross-platform data processing has become rather crucial in or-ganizations. We have identiﬁed four diﬀerent situations inwhich an application requires or beneﬁts from cross-platformdata processing. Driven by these cases, we built

Rheem , across-platform system that decouples applications from dataprocessing platforms to achieve eﬃcient task execution overmultiple platforms.

Rheem follows a cost-based optimiza-tion approach for splitting an input task into subtasks andassigning each subtask to a speciﬁc platform, such that thecost (e.g., runtime or monetary cost) is minimized. Ourexperience while building

Rheem raised several interestingquestions that need to be addressed in the future, namely:

How can we (i) reduce the inter-platform data movementcosts? (ii) address the cardinality and cost estimation prob-lem? (iii) eﬃciently support fault-tolerance across plat-forms? (iv) add new platforms automatically? and (v) im-prove data exploration in cross-platform settings?

11. REFERENCES [1] Apache Beam. https://beam.apache.org .[2] Apache Drill. https://drill.apache.org .[3] Apache Flink. https://flink.apache.org .4] Apache Flume. https://flume.apache.org/index.html .[5] Apache HBase. http://hbase.apache.org/ .[6] Apache Hive: A data warehouse software fordistributed storage. http://hive.apache.org .[7] Apache Mahout. http://mahout.apache.org .[8] Apache Spark: Lightning-Fast Cluster Computing. http://spark.incubator.apache.org/ .[9] Fortune magazine. http://fortune.com/2014/06/19/big-data-airline-industry/ .[10] Luigi Project. https://github.com/spotify/luigi .[11] PostgreSQL. .[12] PrestoDB Project. https://prestodb.io .[13] Spark MLlib: http://spark.apache.org/mllib .[14] Spark SQL programming guide. http://spark.apache.org/docs/latest/sql-programming-guide.html .[15] M. Abadi et al. TensorFlow: A System for Large-ScaleMachine Learning. In

OSDI , pages 265–283, 2016.[16] D. Agrawal, L. Ba, L. Berti-Equille, S. Chawla,A. Elmagarmid, H. Hammady, Y. Idris, Z. Kaoudi,Z. Khayyat, S. Kruse, M. Ouzzani, P. Papotti, J.-A.Quian´e-Ruiz, N. Tang, and M. Zaki. Rheem: EnablingMulti-Platform Task Execution. In

SIGMOD , pages2069–2072, 2016.[17] D. Agrawal et al. Road to Freedom in Big DataAnalytics. In

EDBT , pages 479–484, 2016.[18] A. Alexandrov et al. The Stratosphere platform forbig data analytics.

VLDB J. , 23(6):939–964, 2014.[19] A. Baaziz and L. Quoniam. How to use big datatechnologies to optimize operations in upstreampetroleum industry. In st World PetroleumCongress , 2014.[20] M. Boehm, M. Dusenberry, D. Eriksson, A. V.Evﬁmievski, F. M. Manshadi, N. Pansare,B. Reinwald, F. Reiss, P. Sen, A. Surve, andS. Tatikonda. SystemML: Declarative MachineLearning on Spark.

PVLDB , 9(13):1425–1436, 2016.[21] O. A. Bukhres et al. InterBase: An ExecutionEnvironment for Heterogeneous Software Systems.

IEEE Computer , 26(8):57–69, 1993.[22] M. J. Carey et al. Towards Heterogeneous MultimediaInformation Systems: The Garlic Approach. In

RIDE-DOM , pages 124–131, 1995.[23] S. S. Chawathe et al. The TSIMMIS Project:Integration of Heterogeneous Information Sources. In

IPSJ , pages 7–18, 1994.[24] M. Dallachiesa, A. Ebaid, A. Eldawy, A. K.Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang.NADEEF: a commodity data cleaning system. In

SIGMOD , pages 541–552, 2013.[25] J. Dean and S. Ghemawat. MapReduce: SimpliﬁedData Processing on Large Clusters.

Communicationsof the ACM , 51(1), 2008.[26] D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang,M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas,S. Madden, M. Ouzzani, and N. Tang. The DataCivilizer System. In

CIDR , 2017.[27] D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar,J. Aguilar-Saborit, A. Avanes, M. Flasza, andJ. Gramling. Split query processing in polybase. In

SIGMOD , pages 1255–1266, 2013.[28] K. Doka, N. Papailiou, V. Giannakouris,D. Tsoumakos, and N. Koziris. Mix ’n’ matchmulti-engine analytics. In

IEEE BigData , pages194–203, 2016.[29] A. J. Elmore et al. A Demonstration of the BigDAWGPolystore System.

PVLDB , 8(12):1908–1911, 2015.[30] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis.Conditional Functional Dependencies for CapturingData Inconsistencies.

ACM Transactions on DatabaseSystems (TODS) , 33(2):6:1–6:48, 2008.[31] R. C. Fernandez, D. Deng, E. Mansour, A. A. Qahtan,W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas,S. Madden, M. Ouzzani, M. Stonebraker, andN. Tang. A Demo of the Data Civilizer System. In

SIGMOD , pages 1639–1642, 2017.[32] I. Gog et al. Musketeer: all for one, one for all in dataprocessing systems. In

EuroSys , 2015.[33] B. Haynes, A. Cheung, and M. Balazinska. PipeGen:Data Pipe Generator for Hybrid Analytics. In

SoCC ,pages 470–483, 2016.[34] A. Hems, A. Sooﬁ, and E. Perez. How innovative oiland gas companies are using big data to outmaneuverthe competition. Microsoft White Paper, http://goo.gl/2Bn0xq , 2014.[35] IBM. Data-driven healthcare organizations use bigdata analytics for big gains. White paper, http://goo.gl/AFIHpk .[36] Z. Kaoudi, J.-A. Quiane-Ruiz, S. Thirumuruganathan,S. Chawla, and D. Agrawal. A Cost-based Optimizerfor Gradient Descent Optimization. In

SIGMOD , 2017.[37] Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden,M. Ouzzani, P. Papotti, J. Quian´e-Ruiz, N. Tang, andS. Yin. BigDansing: A System for Big Data Cleansing.In

SIGMOD , pages 1215–1230, 2015.[38] Z. Khayyat, W. Lucia, M. Singh, M. Ouzzani,P. Papotti, J. Quian´e-Ruiz, N. Tang, and P. Kalnis.Lightning Fast and Space Eﬃcient Inequality Joins.

PVLDB , 8(13):2074–2085, 2015.[39] S. Kruse, Z. Kaoudi, J.-A. Quian´e-Ruiz, S. Chawla,F. Naumann, and B. Contreras-Rojas. RHEEMix inthe Data Jungle – A Cross-Platform Query Optimizer.arXiv: 1805.03533 https://arxiv.org/abs/1805.03533 , 2018.[40] S. Kruse, Z. Kaoudi, J.-A. Quian´e-Ruiz, S. Chawla,F. Naumann, and B. Contreras-Rojas. RHEEMix inthe Data Jungle – A Cross-Platform Query Optimizer.arXiv: 1805.03533 https://arxiv.org/abs/1805.03533 , 2018.[41] J. LeFevre, J. Sankaranarayanan, H. Hacig¨um¨us,J. Tatemura, N. Polyzotis, and M. J. Carey. MISO:souping up big data query processing with a multistoresystem. In

SIGMOD , pages 1591–1602, 2014.[42] V. Leis et al. How good are query optimizers, really?

Proc. VLDB Endow. , 9(3):204–215, 2015.[43] H. Lim, Y. Han, and S. Babu. How to Fit when NoOne Size Fits. In

CIDR , 2013.[44] J. Lucas, Y. Idris, B. Contreras-Rojas, J.-A.Quian´e-Ruiz, and S. Chawla. Cross-Platform DataAnalytics Made Easy. In

ICDE , 2018.[45] V. Markl, V. Raman, D. Simmen, G. Lohman,. Pirahesh, and M. Cilimdzic. Robust queryprocessing through progressive optimization. In

SIGMOD , pages 659–670, 2004.[46] N. Marz and J. Warren.

Big Data: Principles and bestpractices of scalable realtime data systems . Manning,2015.[47] M. Mitchell.

An introduction to genetic algorithms .MIT press, 1998.[48] C. Olston, B. Reed, U. Srivastava, R. Kumar, andA. Tomkins. Pig Latin: A Not-so-foreign Language forData Processing. In

Proceedings of the 2008 ACMSIGMOD International Conference on Management ofData , SIGMOD ’08, pages 1099–1110, 2008.[49] S. Palkar, J. J. Thomas, A. Shanbhag,M. Schwarzkopt, S. P. Amarasinghe, and M. Zaharia.Weld: A Common Runtime for High PerformanceData Analysis. In

CIDR , 2017.[50] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.DeWitt, S. Madden, and M. Stonebraker. AComparison of Approaches to Large-Scale DataAnalysis. In

SIGMOD , pages 165–178, 2009.[51] J.-A. Quian´e-Ruiz and Z. Kaoudi. Cross-PlatformQuery Processing. In

ICDE (tutorial) , 2018.[52] P. J. Sadalage and M. Fowler.

NoSQL distilled: Abrief guide to the emerging world of polyglotpersistence . Addison-Wesley Professional, 2012.[53] S. Shankar, A. Choi, and J.-P. Dijcks. Integrating Hadoop Data with Oracle Parallel Processing. OracleWhite Paper, , 2010.[54] A. P. Sheth and J. A. Larson. Federated DatabaseSystems for Managing Distributed, Heterogeneous,and Autonomous Databases.

ACM ComputingSurveys , 22(3):183–236, 1990.[55] A. Simitsis, K. Wilkinson, M. Castellanos, andU. Dayal. Optimizing Analytic Data Flows forMultiple Execution Engines. In

SIGMOD , pages829–840, 2012.[56] M. Stonebraker. The Case for Polystores. http://wp.sigmod.org/?p=1629 , 2015.[57] D. Tsoumakos and C. Mantas. The Case forMulti-Engine Data Analytics. In

Euro-Par , pages406–415, 2013.[58] J. Wang, T. Baker, M. Balazinska, D. Halperin,B. Haynes, B. Howe, D. Hutchison, S. Jain, R. Maas,P. Mehta, D. Moritz, B. Myers, J. Ortiz, D. Suciu,A. Whitaker, and S. Xu. The Myria Big DataManagement and Analytics System and CloudServices. In

CIDR , 2017.[59] F. Yang, J. Li, and J. Cheng. Husky: Towards a MoreEﬃcient and Expressive Distributed ComputingFramework.