[PDF] Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

Abstract

MapReduce is a popular programming paradigm for developing large-scale, data-intensive computation. Many frameworks that implement this paradigm have recently been developed. To leverage these frameworks, however, developers must become familiar with their APIs and rewrite existing code. Casper is a new tool that automatically translates sequential Java programs into the MapReduce paradigm. Casper identifies potential code fragments to rewrite and translates them in two steps: (1) Casper uses program synthesis to search for a program summary (i.e., a functional specification) of each code fragment. The summary is expressed using a high-level intermediate language resembling the MapReduce paradigm and verified to be semantically equivalent to the original using a theorem prover. (2) Casper generates executable code from the summary, using either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically converting real-world, sequential Java benchmarks to MapReduce. The resulting benchmarks perform up to 48.2x faster compared to the original.

Full PDF

AAutomatically Leveraging MapReduce Frameworks forData-Intensive Applications casper.uwplse.org

Maaz Bin Safeer Ahmad

University of Washington [email protected]

Alvin Cheung

University of Washington [email protected]

ABSTRACT

MapReduce is a popular programming paradigm for developinglarge-scale, data-intensive computation. Many frameworks thatimplement this paradigm have recently been developed. To lever-age these frameworks, however, developers must become familiarwith their APIs and rewrite existing code. We present Casper, anew tool that automatically translates sequential Java programsinto the MapReduce paradigm. Casper identifies potential codefragments to rewrite and translates them in two steps: (1) Casperuses program synthesis to search for a program summary (i.e., afunctional specification) of each code fragment. The summary isexpressed using a high-level intermediate language resembling theMapReduce paradigm and verified to be semantically equivalentto the original using a theorem prover. (2) Casper generates exe-cutable code from the summary, using either the Hadoop, Spark, orFlink API. We evaluated Casper by automatically converting real-world, sequential Java benchmarks to MapReduce. The resultingbenchmarks perform up to 48.2 × faster compared to the original. ACM Reference Format:

Maaz Bin Safeer Ahmad and Alvin Cheung. 2018. Automatically LeveragingMapReduce Frameworks for Data-Intensive Applications: casper.uwplse.org.In

SIGMOD’18: 2018 International Conference on Management of Data, June10–15, 2018, Houston, TX, USA.

ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3183713.3196891

MapReduce [21], a popular paradigm for developing data-intensiveapplications, has varied and highly efficient implementations [4, 5, 8,36]. All these implementations expose an application programminginterface (API) to developers. While the concrete syntax differsslightly across the different APIs, they all require developers toorganize their computation into map and reduce stages in order toleverage their optimizations.While exposing optimization via an API shields application de-velopers from the complexities of distributed computing, this ap-proach contains a major drawback: for legacy applications to lever-age MapReduce frameworks, developers must first understand the

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

SIGMOD’18, June 10–15, 2018, Houston, TX, USA © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-4703-7/18/06...$15.00https://doi.org/10.1145/3183713.3196891 existing code’s function and subsequently re-organize the compu-tation using mappers and reducers. Similarly, novice programmers,unfamiliar with the MapReduce paradigm, must first learn the dif-ferent APIs in order to express their computation accordingly. Bothrequire a significant expenditure of time and effort. Further, eachcode rewrite or algorithm reformulation opens another opportunityto introduce bugs.One way to alleviate these issues is to build a compiler that trans-lates code written in another paradigm (e.g., imperative code) intoMapReduce. Classical compilers, like logical to physical query plancompilers [29], use pattern matching rules, i.e., the compilers con-tain a number of rules that recognize different input code patterns(e.g., a sequential loop over lists) and translate the matched codefragment into the target (e.g., a single-stage map and reduce). Asin query compilers, designing the rules is challenging: they mustbe both correct , i.e., the translated code should have the same se-mantics as the input, and sufficiently expressive to capture the widevariety of coding patterns that developers use to express their com-putations. We are aware of only one such compiler that translatesimperative Java programs into MapReduce [38], and the number ofrules involved in that compiler makes it difficult to maintain andmodify.This paper describes a new tool, Casper, that translates sequen-tial Java code into semantically equivalent MapReduce programs.Rather than relying on rules to translate different code patterns,Casper is inspired by prior work on cost-based query optimiza-tion [41], which considers compilation to be a dynamic search prob-lem. However, given that the inputs are general-purpose programs,the space of possible target programs is much larger than it is forquery optimization. To address this issue, Casper leverages recentadvances in program synthesis [11, 25] to search for MapReduceprograms into which it can rewrite a given input sequential Javacode fragment. To reduce the search space, Casper searches overthe space of program summaries , which are expressed using a high-level intermediate language (IR) that we designed. As we discussin §3.1, the IR’s design succinctly expresses computations in theMapReduce paradigm yet remains sufficiently easy to translate intothe concrete syntax of the target API.To search for summaries, Casper first performs lightweight pro-gram analysis to generate a description of the space of MapReduceprograms that a given input code fragment might be equivalent to.The search space is also described using our high-level IR. Casperthen uses an off-the-shelf program synthesizer to perform the search,but it is guided by an incremental search algorithm and our domain-specific cost model to speed the process. A theorem prover is used tocheck whether the found program summary is indeed semantically a r X i v : . [ c s . D B ] J un quivalent to the input. Once proved, the summary is translatedinto the concrete syntax of the target MapReduce API. Since theperformance of the translated program often depends on inputdata characteristics (e.g., skewness), Casper generates multiplesemantically equivalent MapReduce programs for a given inputand produces a monitor module that switches among them basedon runtime statistics; the monitor and switcher are automaticallygenerated during compilation.Compared to prior approaches, Casper does not require com-piler developers to design or maintain any pattern matching rules.Furthermore, the entire translation process is completely automatic.We evaluated Casper using a number of benchmarks and real-world Java applications and demonstrated both Casper’s ability totranslate an input program into MapReduce equivalents and thesignificant performance improvements that result.In summary, our paper makes the following contributions: • We propose a new high-level intermediate representation (IR)to express the semantics of sequential Java programs in the MapRe-duce paradigm. The language is succinct to be easily translated intomultiple MapReduce APIs, yet expressive to describe the seman-tics of many real-world benchmarks written in a general-purposelanguage. Furthermore, programs written in our IR can be automat-ically checked for correctness using a theorem prover (§4.1). The IR,being a high-level language, also lets us perform various semanticoptimizations using our cost model (§5). • We describe an efficient search technique for program sum-maries expressed in the IR without requiring any pattern matchingrules. Our technique is both sound and complete with respect to theinput search space. Unlike classical compilers, which rely on patternmatching to drive translation, our technique leverages programsynthesis to dynamically search for summaries. Our technique isnovel in that it incrementally searches for summaries based oncost. It also uses verification failures to systematically prune thesearch space and a hierarchy of search grammars to speed the sum-mary search. This lets us translate benchmarks that have not beentranslated in any prior work (§4.1). • There are often multiple ways to express the same input asMapReduce programs. Therefore, our technique can generate mul-tiple semantically equivalent MapReduce versions of the input. Italso automatically inserts code that collects statistics during pro-gram execution to adaptively switch among the different generatedversions (§5.2). • We implemented our methodology in Casper, a tool that con-verts sequential Java programs into three MapReduce implementa-tions: Spark, Hadoop, and Flink. We evaluated the feasibility andeffectiveness of Casper by translating real-world benchmarks from7 different suites from multiple domains. Across 55 benchmarks,Casper translated 82 of 101 code fragments. The translated bench-marks performed up to 48.2 × faster compared to the original onesand were competitive even with other distributed implementations,including manual ones (§7). This section describes how we model the MapReduce programmingparadigm and demonstrates by example how Casper translatessequential code into MapReduce programs. @Summary( m = map ( reduce ( map ( mat , λ m ) , λ r ) , λ m ) λ m : ( i , j , v ) → {( i , v )} λ r : ( v , v ) → v + v λ m : ( k , v ) → {( k , v / cols )} ) int[] rwm(int[][] mat, int rows, int cols) { int[] m = new int[rows]; for (int i = 0; i < rows; i++) { int sum = 0; for (int j = 0; j < cols; j++) sum += mat[i][j]; m[i] = sum / cols; } return m; } (a) Input: Sequential Java code RDD rwm(RDD mat, int rows, int cols) { RDD m = mat.mapToPair(e -> new Tuple(e.i, e.v)); m = m.reduceByKey((v1, v2) -> (v1 + v2)); m = m.mapValues(v -> (v / cols)); return m; } (b) Output: Apache Spark code Figure 1: Using Casper to translate the row-wise meanbenchmark to MapReduce (Spark).

MapReduce organizes computation using two operators: map and reduce . The map operator has the following type signature: map : ( mset [ τ ] , λ m ) −→ mset [( κ , ν )] λ m : τ −→ mset [( κ , ν )] Input into map is a multiset (i.e., bag) of type τ and a unary trans-former function λ m , which converts a value of type τ into a multisetof key-value pairs of types κ and ν . The map operator then concur-rently applies λ m to every element in the multiset and returns theunion of all multisets generated by λ m . reduce : ( mset [( κ , ν )] , λ r ) −→ mset [( κ , ν )] λ r : ( ν , ν ) −→ ν Input into reduce is a multiset of key-value pairs and a binarytransformer function λ r , which combines two values of type ν toproduce a final value. The reduce operator first groups all key-valuepairs by key (also known as shuffling) and then uses λ r to combine,in parallel, the bag of values for each key-group into a single value.The output of reduce is another multiset of key-value pairs, whereeach pair holds a unique key. If the transformer function λ r iscommutative and associative, then reduce can be further optimizedby concurrently applying λ r to pairs of values in a key-group.Casper’s goal is to translate a sequential code fragment into aMapReduce program that is expressed using the map and reduce operators. The challenges in doing so are: (1) identify the correctsequence of operators to apply, and (2) implement the correspond-ing transformer functions. We next discuss how Casper overcomesthese challenges. igure 2: Casper’s system architecture. Sequential code frag-ments (green) are translated into MapReduce tasks (orange). Casper takes in Java code with loop nests that sequentially iterateover data and translates the code into a semantically equivalentMapReduce program to be executed by the target framework. Todemonstrate, we show how Casper translates a real-word bench-mark from the Phoenix suite [39].As shown in Figure 1(a), the benchmark takes as input a ma-trix ( mat ) and computes, using nested loops, the column vector ( m )containing the mean value of each row in the matrix. Assume thecode is annotated with a program summary that helps with thetranslation into MapReduce. The program summary, written usinga high-level intermediate representation (IR), describes how theoutput of the code fragment (i.e., m ) can be computed using a seriesof map and reduce stages from the input data (i.e., mat ), as shownin lines 1 to 5 in Figure 1(a). While the summary is not executable,translating from that into the concrete syntax of a MapReduceframework (say, Spark) would be much easier than translating fromthe original input code. This is shown in Figure 1(b) where the map and reduce primitives from our summary are translated into thecorresponding Spark API calls.Unfortunately, the input code does not have such a summary,which must therefore be inferred. Casper does this via programsynthesis and verification, as we explain in §3. Figure 2 shows Casper’s overall design. We now discuss the threeprimary modules that make up Casper’s compilation pipeline.First, the program analyzer parses the input code into an AbstractSyntax Tree (AST) and uses static program analysis to identify codefragments for translation (§6.1). In addition, for each identifiedcode fragment, it prepares: (1) a search space description encodedusing our high-level IR that lets the synthesizer search for a validprogram summary (§3.1), and (2) verification conditions (VCs) (§3.3)to automatically ascertain that the induced program summary issemantically equivalent to the input.Next, the summary generator synthesizes and verifies programsummaries (§3.4 and §4.1). To speed up the search, it partitionsthe search space so that it can be efficiently traversed using ourincremental synthesis algorithm (§4.2).Once a summary is inferred, the code generator translates itinto executable code. Casper currently supports three MapReduce PS : = ∀ v . v = MR | ∀ v . v = MR [ v id ] MR : = map ( MR , λ m ) | reduce ( MR , λ r ) | join ( MR , MR ) | dataλ m : = f : ( val ) → { Emit } λ r : = f : ( val , val ) → ExprEmit : = emit ( Expr , Expr ) | if ( Expr ) emit ( Expr , Expr ) | if ( Expr ) emit ( Expr , Expr ) else EmitExpr : = Expr op Expr | op Expr | f ( Expr , Expr , ... ) | n | var | ( Expr , Expr ) v ∈ Output V ariablesop ∈ Operators v id ∈ V ariable I D , f ∈ Library Methods

Figure 3: Excerpt of the IR for program summaries (PSs), afull description of which is provided in Appendix B. frameworks: Spark, Hadoop, and Flink. Additionally, this compo-nent also generates code that collects data statistics to adaptivelychoose among different implementations during runtime (§5.2).

As discussed, Casper discovers a program summary for each codefragment before translation. Technically, a program summary is a postcondition [26] of the input code that describes the program stateafter the code fragment is executed. In this section, we explain: (1)the IR Casper uses to express summaries, (2) how Casper verifiesa summary’s validity, and (3) the search algorithm Casper uses tofind valid summaries given a search space description.

One approach to synthesize summaries directly searches in pro-grams written in the target framework’s API. This does not scalewell; Spark alone offers over 80 high-level operators, even thoughmany of them have similar semantics and differ only in their im-plementation or syntax (e.g., map, flatMap, filter). To speed upsynthesis, we instead search in programs written in a high-levelIR that abstracts away syntactical differences and describes onlythe functionality of a few essential operators. The goals of the IRare: (1) to express summaries that are translatable into the targetAPI, and (2) to let the synthesizer efficiently search for summariesthat are equivalent to the input program. To address these goals,Casper’s IR models two MapReduce primitives that are similar tothe map and fold operators in Haskell (see §2.1). In addition, ourIR models the join primitive, which takes as input two multisetsof key-value pairs and returns all pairs of elements with matchingkeys. The IR does not currently model the full range of operatorsacross different MapReduce implementations; however, it alreadylets Casper capture a wide array of computations expressible usingthe paradigm and is sufficiently general to be translatable into dif-ferent MapReduce APIs while keeping the search problem tractable,as we demonstrate in §7.Figure 3 shows a subset of Casper’s IR, used to express both pro-gram summaries and the search space. The IR assumes that programsummaries are expressed in the stylized form shown in Figure 3 as PS , which states that each output variable v (i.e., a variables updatedin the code fragment), must be computed using a sequence of map , reduce and join operations over the inputs (e.g., the arrays or col-lections being iterated). While doing so ensures that the summaryis translatable into the target API, the implementations of λ m and r for the map and reduce operators depend on the code fragmentbeing translated. We leave these functions to be synthesized andrestrict the body of λ m to a sequence of emit statements, whereeach emit statement produces a single key-value pair, and the bodyof λ r is an expression that evaluates to a single value of the re-quired type. Besides emit , the bodies of λ m and λ r ’s can consist ofconditionals and other operations on tuples, as shown in Figure 3.The output of the MapReduce expression is an associative arrayof key-value pairs; the unique key v id for each variable is used toaccess the computed value of that variable. Appendix B lists thefull set of types and operators that our IR supports. In addition to program summaries, Casper also uses the IR todescribe the search space of summaries for the synthesizer. It doesso by generating a grammar for each input code fragment, likethe one shown in Figure 3. The synthesizer traverses the grammarby expanding on each production rule and checks whether anygenerated candidate constitutes a valid summary (as explainedin §3.3).To generate the search space grammar, Casper analyzes the inputcode to extract the following properties and their type information:(1) Variables in scope at the beginning of the input code(2) Variables that are modified within the input code(3) The operators and library methods usedThe code analyzer extracts these properties using standard pro-gram analyses. It computes (1) and (2) using live variable anddataflow analysis [1], and it computes (3) by scanning functionsthat are invoked in the input code. We currently assume that in-put variables are not aliased to each other and put guards on thetranslated code to ensure that is the case. Appendix D shows theanalysis results for the TPC-H Q6 benchmark, and we discuss thelimitations of our program analyzer module implementation in §6.1.Given this information, the summary generator builds a searchspace grammar that is specialized to the code fragment being trans-lated. Figure 6 shows sample grammars that are generated for thecode shown in Figure 1(a). The input code uses addition and di-vision; hence, the grammar includes addition and division in itsproduction rules for λ m and λ r . Furthermore, Casper also usestype information of variables to prune invalid production rules inthe grammar. For instance, if the output variable v is of type int ,the final operation in the synthesized MapReduce expression mustevaluate to a value of type int . Since the output type of a reduceoperation is inferred from the type of its input, we can propagatethis information to restrict the type of values the reduce operationaccepts. To make synthesis tractable and the search space finite,Casper imposes recursive bounds on the production rules. For in-stance, it limits the number of MapReduce operations a programsummary can use and the number of emit statements in a singletransformer function. In §4.2, we discuss how Casper further spe-cializes the search space by changing the set of production rulesavailable in the grammar or specifying different recursive bounds. Thus, if variable handles v1 and v2 are both inputs into the same code fragment,Casper wraps the translated code as: if (v1 != v2) { [Casper translated code]} else { [original code] } . Computing precise alias information requires moreengineering [43] and does not impact our approach. Refer to Appendix D to see how a grammar can be encoded in our IR. invariant ( m , i ) ≡ ≤ i ≤ rows ∧ m = map ( reduce ( map ( mat [ .. i ] , λ m ) , λ r ) , λ m ) (a) Outer loop invariant Initiation ( i = ) → Inv ( m , i ) Continuation

Inv ( m , i ) ∧ ( i < rows ) → Inv ( m [ i (cid:55)→ sum ( mat [ i ])/ cols ] , i + ) Termination

Inv ( m , i ) ∧ ¬( i < rows ) → PS ( m , i ) (b) Verification conditions to ascertain the correctness of the pro-gram summary PS given loop invariant Inv

Figure 4: Proof of soundness for the row-wise mean bench-mark.

To search for a valid summary within the search space, Casper re-quires a way to check whether a candidate summary is semanticallyequivalent to the input code. It does so using standard techniquesin program verification, namely, by creating verification conditions based on Hoare logic [26]. Verification conditions are Boolean pred-icates that, given a program statement S and a postcondition (i.e.,program summary) P , state what must be true before S is executedin order for P to be a valid postcondition of S . Verification con-ditions can be systematically generated for imperative programstatements, including those processed by Casper [33, 47]. However,each loop statement requires an extra loop invariant to constructan inductive proof. Loop invariants are Boolean predicates that aretrue before and after every execution of the loop body regardlessof how many times the loop executes.The general problem of inferring the strongest loop invariants orpostconditions is undecidable [33, 47]. Unlike prior work, however,two factors make our problem solvable: first, our summaries arerestricted to only those expressible using the IR described in §3.1,which lacks many problematic features (e.g., pointers) that a general-purpose language would have. Moreover, we are interested onlyin finding loop invariants that are strong enough to establish thevalidity of the synthesized program summaries.As an example, Figure 4(a) shows an outer loop invariant Inv ,which can be used to prove the validty of the program summaryshown in Figure 1(a). Figure 4(b) shows the verification conditionsCasper constructs to state what the program summary and invari-ant must satisfy. We can check that this loop invariant and programsummary are indeed valid based on Hoare logic as follows. First, the initiation clause asserts that the invariant holds before the loop, i.e.,when i is zero. This is true because the invariant asserts that theMapReduce expression is true only for the first i rows of the inputmatrix. Hence, when i is zero, the MapReduce expression is exe-cuted on an empty dataset, and the output value for each row is 0.Next, the continuation clause asserts that after one more executionof the loop body, the i th index of output vector m should hold themean for the i th row of the matrix mat . This is true since the valueof i is incremented inside the loop body, which implies that themean for the i th row has been computed. Finally, the termination condition completes the proof by asserting that if the invariant istrue, and i has reached the end of the matrix, then the programummary PS must now hold as well. This is true since i now equalsthe number of rows in the matrix, and the loop invariant assertsthat m equals the MapReduce expression executed over the entirematrix, which is the same assertion as our program summary.Casper formulates the search problem for finding program sum-maries by constructing the verification conditions for the givencode fragment and leaving the body of the summary (and any nec-essary invariants for loops) to be synthesized. For the programsummary and invariants, the search space is expressed using thesame IR as discussed in §3.1. Formally, the synthesis problem is: ∃ ps , inv , . . . , inv n . ∀ σ . VC ( P , ps , inv , . . . , inv n , σ ) (1)In other words, Casper’s goal is to find a program summary ps andany required invariants inv , . . . , inv n such that for all possibleprogram states σ , the verification conditions for the input codefragment P are true. After the synthesizer has identified a candidatesummary and invariants, Casper sends them and the verificationconditions to a theorem prover (see §4.1), and to the code generatorto generate executable MapReduce code if the program summaryis proven to be correct. Casper uses an off-the-shelf program synthesizer, Sketch [42], toinfer program summaries and loop invariants. Sketch takes as in-put: (1) a set of candidate summaries and invariants encoded as agrammar (e.g., Figure 3), and (2) the correctness specification forthe summary in the form of verification conditions. It then attemptsto find a program summary (and any invariants needed) using theprovided grammar such that the verification conditions hold true.The universal quantifier in Eq.1 make the synthesis problemchallenging. Therefore, Casper uses a two-step process to ensurethat the found summary is valid. First, it leverages Sketch’s boundedmodel checking to verify the candidate program summary over afinite (i.e., “bounded”) subset of all possible program states. Forexample, Casper restricts the maximum size of the input datasetand the range of values for integer inputs. Finding a solution forthis weakened specification can be done very efficiently by thesynthesizer. Once a candidate program summary can be verifiedfor the bounded domain, Casper passes the summary to a theoremprover to determine its soundness over the entire domain, which ismore expensive computationally. Casper currently translates thesummary along with an automatically generated proof script toDafny [31] for full verification. This two-step verification makesCasper’s synthesis algorithm sound, without compromising effi-ciency.

Figure 5 (lines 1 to 8) shows the coreCEGIS [45] algorithm Casper’s synthesizer uses. The algorithm isan iterative interaction between two modules: a candidate programsummary generator and a bounded model checker. The candidatesummary generator takes as input the IR grammar G , the veri-fication conditions for the input code fragment VC , and a set ofconcrete program states Φ . To start the process, the synthesizer pop-ulates Φ with a few randomly chosen states, and generates programsummary candidate ps and any needed invariants inv , . . . , inv n from G such that ∀ σ ∈ Φ . VC ( ps , inv , . . . , inv n , σ ) is true. Next,the bounded model checker verifies whether the candidate program function synthesize (G, VC): Φ = {} // set of random program states while true do ps, inv .. n = generateCandidate(G, VC, Φ ) if ps is null then return null // search space exhausted ϕ = boundedVerify(ps, inv .. n , VC) if ϕ is null then return (ps, inv .. n ) // summary found else Φ = Φ ∪ ϕ // counter-example found function findSummary ( A , VC): G = generateGrammar( A ) Γ = generateClasses(G) Ω = {} // summaries that failed verification ∆ = {} // summaries that passed verification for γ ∈ Γ do while true do c = synthesize( γ - Ω - ∆ , VC) if c is null and ∆ is null then break // move to next grammar class else if c is null then return ∆ // search complete else if fullVerify(c, VC) then ∆ = ∆ ∪ c else Ω = Ω ∪ c return null // no solution found Figure 5: Casper’s search algorithm. summary holds over the bounded domain. If it does, the algorithmreturns ps as the solution. Otherwise, the model checker returns acounter-example state ϕ such that VC ( ps , inv , . . . , inv n , ϕ ) is false.The algorithm adds ϕ to Φ and restarts the program summary gen-erator. This continues until either a program summary is found thatpasses bounded model checking or the search space is exhausted.A limitation of the CEGIS algorithm is that, while efficient, thefound program summary might be true only for the finite domainand thus will be rejected by the theorem prover when checking forvalidity over the entire domain. In this case, Casper dynamicallychanges the search space grammar to exclude the candidate pro-gram summary that does not verify and restarts the synthesizer togenerate a new candidate summary using the preceding algorithm.We discuss this process in detail in §4.1. We now discuss the techniques Casper uses to make the search forprogram summaries more robust and efficient.

As mentioned, the program summary that the synthesizer returnscan fail theorem prover validation due to the bounded domain usedduring search. For instance, assume we bound the integer inputsto have a maximum value of 4 in the synthesizer. In this boundeddomain, the expressions v and Math.min(4,v) (where v is an inputinteger) are deemed to be equivalent even though they are not equalin practice. While prior work [17, 28] simply fails to translate suchbenchmarks if the theorem prover rejects the candidate summary,Casper uses a two-phase verification technique to eliminate suchcandidates. This ensures that Casper’s search is complete withrespect to the search space defined by the grammar.To achieve completeness, Casper must first prevent summariesthat failed the theorem prover from being regenerated by the syn-thesizer. A naive approach would be to restart the synthesizer roperty G G G Map/ReduceSequence m m → r m → r → m λ m 𝐺𝐺𝐺 ≔ 𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚, 𝜆𝜆 𝑚𝑚 𝜆𝜆 𝑚𝑚 ≔ 𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑖𝑖, 𝑗𝑗𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑖𝑖, 𝑣𝑣𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑗𝑗, 𝑣𝑣 + 𝑖𝑖𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑖𝑖 + 𝑗𝑗, 𝑣𝑣 𝐺𝐺𝐺 ≔ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚, 𝜆𝜆 𝑚𝑚 , 𝜆𝜆 𝑟𝑟 𝜆𝜆 𝑚𝑚 ≔ 𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑖𝑖, 𝑣𝑣𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑗𝑗, 𝑣𝑣 + 𝑖𝑖𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑖𝑖, 𝑗𝑗 , 𝑣𝑣, 𝐺𝜆𝜆 𝑟𝑟 ≔ 𝑣𝑣 , 𝑣𝑣 → 𝑣𝑣 𝑣𝑣 , 𝑣𝑣 → 𝑣𝑣 + 4𝑣𝑣 , 𝑣𝑣 → 𝑣𝑣 + 𝑣𝑣 𝐺𝐺𝐺 ≔ 𝑚𝑚𝑚𝑚𝑚𝑚 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚, 𝜆𝜆 𝑚𝑚1 , 𝜆𝜆 𝑟𝑟 , 𝜆𝜆 𝑚𝑚2 𝜆𝜆 𝑚𝑚1 ≔ 𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑖𝑖, 𝑣𝑣𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑖𝑖, 𝑣𝑣, 𝑖𝑖𝑖𝑖, 𝑗𝑗, 𝑣𝑣 → 𝑖𝑖 + 𝐺, 𝑗𝑗 − 𝑣𝑣 , 𝑖𝑖, 𝑣𝑣𝜆𝜆 𝑟𝑟 ∶= 𝑣𝑣 , 𝑣𝑣 → 𝑣𝑣 𝑣𝑣 , 𝑣𝑣 → 𝑣𝑣 + 𝑣𝑣 𝑣𝑣 , 𝑣𝑣 → (𝑣𝑣 . 𝐺, 𝑣𝑣 . 𝐺)𝜆𝜆 𝑚𝑚2 ∶= 𝑘𝑘, 𝑣𝑣 → 𝑘𝑘, 𝑣𝑣 , (𝑣𝑣, 𝑘𝑘)𝑘𝑘, 𝑣𝑣 → 𝑣𝑣. 𝐺, 𝑘𝑘 , 𝑣𝑣. 𝐺𝑘𝑘, 𝑣𝑣 → 𝑘𝑘, 𝑣𝑣/𝑟𝑟𝑐𝑐𝑐𝑐𝑐𝑐𝑘𝑘, 𝑣𝑣 → 𝑖𝑖𝑖𝑖(𝑣𝑣 > 𝑖𝑖)[(𝑘𝑘, 𝑣𝑣)] … … …… …… Figure 6: Incremental grammar generation. Casper generates a hierarchy of grammars to optimize search. until a new summary is found, assuming that the algorithm im-plemented by the synthesizer is non-deterministic. However, thisapproach is incomplete because the algorithm may never terminatesince it can continually return the same incorrect summary. In-stead, Casper modifies the search space to ensure forward progress.Recall from §3.4 that the search space for candidate summaries { c , . . . , c n } is specified using an input grammar that is generatedby the program analyzer and passed to the synthesizer. Thus, toprevent a candidate c f that fails the theorem prover from beingrepeatedly generated from grammar G , Casper simply passes ina new grammar G − { c f } to the synthesizer. This is implementedby passing additional constraints to the synthesizer to block a sum-mary from being regenerated. Theorem.

Casper’s algorithm for inferring program summaries issound and complete with respect to the given search space.A proof sketch for this theorem is provided in Appendix A.Figure 5 shows how Casper infers program summaries andinvariants. Casper calls the synthesizer to generate a candidatesummary c on line 17 and attempts to verify c by passing it to thetheorem prover on line 22. If verification fails, c is added to Ω , theset of incorrect summaries, and the synthesizer is restarted with anew grammar G − Ω . We explain the full algorithm in §4.3.In §7.3.2, we provide experimental results that illustrate how ourtwo-phase verification algorithm effectively finds program sum-maries even when faced with verification failures. Although Casper’s search algorithm is complete, the space of pos-sible summaries to consider remains large. To address this, Casperincrementally expands the search space for program summariesto speed up the search. It does this by: (1) adding new productionrules to the grammar, and (2) increasing the number of times thateach product rule is expanded.The benefits of this approach are twofold. First, since the searchtime for a valid summary is proportional to search space size,Casper often finds valid summaries quickly, as our experimentsshow. Second, since larger grammars are more syntactically ex-pressive, the found summaries are likely to be more expensivecomputationally. Hence, biasing the search towards smaller gram-mars likely produces program summaries that run more efficiently.Although this is not sufficient to guarantee optimality of gener-ated summaries, our experiments show that in practice Caspergenerates efficient solutions (§7.2). To implement incremental grammar generation, Casper par-titions the space of program summaries into different grammarclasses , where each class is defined based on these syntactical fea-tures: (1) the number of MapReduce operations, (2) the number of emit statements in each map stage, (3) the size of key-value pairsemitted in each stage, as inferred from the types of the key andvalue, and (4) the length of expressions (e.g., x + y is an expressionof length 2, while x + y + z has a length of 3). All of these featuresare implemented by altering production rules in the search spacegrammar. A grammar hierarchy is created such that all programsummaries expressible in a grammar class G i are also expressiblein a higher level class, i.e., G j where j > i . Figure 5 shows Casper’s algorithm for searching program sum-maries. The algorithm begins by constructing a grammar G usingthe results of program analysis A on the input code. First, Casperpartitions the grammar G into a hierarchy of grammar classes Γ (line 12). Then, it incrementally searches each grammar class γ ∈ Γ ,invoking the synthesizer to find summaries in γ (line 17). Eachsummary (and invariants) the synthesizer returns is checked by atheorem prover (line 22); Casper saves the set of correct programsummaries in ∆ and all summaries that fail verification in Ω . Eachsynthesized summary (correct or not) is eliminated from the searchspace, forcing the synthesizer to generate a new summary eachtime, as explained in §4.1. When the grammar γ is exhausted, i.e.,the synthesizer has returned null, Casper returns the set of correctsummaries ∆ if it is non-empty. Otherwise, no valid solution wasfound, and the algorithm proceeds to search the next grammarclass in Γ . If ∆ is empty after exploring every grammar in Γ , i.e., nosummary could be found in the entire search space, the algorithmreturns null. We now illustrate how findSummary searches for program sum-maries using the row-wise mean benchmark discussed in §2.2. Fig-ure 6 shows three sample (incremental) grammars Casper gener-ated as a result of calling generateClasses (Figure 5, line 12) alongwith their properties. For example, the first class, G , consists ofprogram summaries expressed using a single map or reduce opera-tor, and the transformer functions λ m and λ r are restricted to emitonly one integer key-value pair. A few candidates for λ m are shownn the figure. For instance, the first candidate, ( i , j , v ) → [( i , j )] ,maps each matrix entry to its row and column as the output.If findSummary fails to find a valid summary in G for the bench-mark, it advances to the next grammar class, G . G expands upon G by including summaries that consist of two map or reduce op-erators, and each λ m can emit up to 2 key-value pairs. The searchnext moves to G , where G expands upon G with summaries thatinclude up to three map or reduce operators, and the transformerscan emit either integers or tuples. As shown in Figure 1(a), a validsummary is finally found in G and added to ∆ . Search continues in G for other valid summaries in the same grammar class. The searchterminates after all valid summaries in G , i.e., those returned bythe synthesizer and fully verified, are found. This includes the oneshown in Figure 1(a). There often exist many semantically equivalent MapReduce imple-mentations for a given sequential code fragment, with significantperformance differences. Many frameworks come with optimizersthat perform low-level optimizations (e.g., fusing multiple map op-erators). However, performing semantic transformations is oftendifficult. For instance, at least three different implementations ofthe StringMatch benchmark exist in MapReduce, and they differ inthe type of key-value pairs the map stage emits (see §7.4). Althoughit is difficult for a low-level optimizer to discover these equiva-lences by syntax analysis, Casper can perform such optimizationbecause it searches for a high-level program summary expressedusing the IR. We now discuss Casper’s use of a cost model andruntime monitoring module for this purpose.

Casper uses a cost model to evaluate different semantically equiva-lent program summaries that are found for a code fragment. BecauseCasper aims to translate data-intensive applications, its cost modelestimates data transfer costs as opposed to compute costs.Each synthesized program summary is a sequence of map , reduce and join operations. The semantics of these operations are known,but the transformer functions that they use ( λ m and λ r ) are syn-thesized and determine the operation’s cost. We define the costfunctions of the map , reduce and join operations below: cost m ( λ m , N , W m ) = W m ∗ N ∗ | λ m | (cid:213) i = sizeO f ( emit i ) ∗ p i (2) cost r ( λ r , N , W r ) = W r ∗ N ∗ sizeO f ( λ r ) ∗ ϵ ( λ r ) (3) cost j ( N , N , W j ) = W j ∗ N ∗ N ∗ sizeO f ( emit j ) ∗ p j (4)The function cost m estimates the amount of data generated inthe map stage. For each emit statement in λ m , the size of the key-value pair emitted is multiplied by the probability that the emit statement will execute ( p i ). The values are then summed to get theexpected size of the output record. The total amount of data emittedduring the map stage equals to the product of expected record sizeand the number of times λ m is executed ( N ). The cost function fora reduce stage, cost r , is defined similarly, except that λ r producesonly a single value and the cost is adjusted based on whether λ r is commutative and associative. The function ϵ returns 1 if theseproperties hold; otherwise, it returns W csд . The cost function for join operations, cost j , is defined over: the number of elements in thetwo input datasets ( N and N ), the selectivity of the join predicate( p j ), and the size of the output record. W m , W r and W j are theweights assigned to the map, reduce and join operations. W csд isthe penalty for a non-commutative associative reduction. In ourexperiments, we used the values 1, 2, 2 and 50 for these weights,respectively based on our empirical studies.To estimate the cost of a program summary, we simply sum thecost of each individual operation. The first operator in the pipelinetakes symbolic variables N .. i as the number of records. For eachsubsequent stage, we use the number of key-value pairs generatedby the current stage, expressed as a function over N .. i : cost mr ([( op , λ ) , ( op , λ ) , . . . ] , N .. i ) = cost op ( λ , N , W ) + cost mr ([( op , λ ) , . . . ] , count ( λ , N .. i )) The function count returns the number of key-value pairs generatedby a given stage. For map stages, this equals (cid:205) | emits | i = p i ; for reduce stages, it equals the number of unique key values on which thereducer was executed; for joins, it equals N ∗ N ∗ p j . The cost model computes the cost of a program summary as afunction of input data size N . We use this cost model to comparethe synthesized summaries both statically and dynamically. First,calling findSummary returns a list of verified summaries that werefound. Casper then uses the cost model to prune summaries when aless costly one exists in the list. Not all summaries can be comparedthat way, however, since they could depend on the value distributionof the input data or how frequently a conditional evaluates to true,as shown in the candidates for grammar G ’s λ m in Figure 6.In such cases, Casper generates code for all remaining sum-maries that have been validated, and it uses a runtime monitoringmodule to evaluate their costs dynamically when the generatedprogram executes. As the program executes, the runtime modulesamples values from the input dataset (Casper currently uses first-kvalues sampling, although different sampling method may be used).It then uses the samples to estimate the probabilities of conditionalsby counting the number of data elements in the sample for whichthe conditional will evaluate to true. Similarly, it counts the numberof unique data values that are emitted as keys. These estimatesare inserted into Eqn 2 and Eqn 3 for each program summary toget comparable cost values. Finally, the summary with the low-est cost is executed at runtime. Hence, if the generated programis executed over different data distributions, it will run differentimplementations, as illustrated in §7.4. We implemented Casper using the Polyglot framework [37] toparse Java code into an abstract syntax tree (AST). Casper traversesthe program AST to identify candidate code fragments, performsprogram analysis, and generates target code. We now describe theJava features supported by our compiler front-end. We also discusshow Casper identifies code fragments for translation and generatesexecutable code from the verified program summary. .1 Supported Language Features

To translate a code fragment, Casper must first successfully gener-ate verification conditions for that fragment (as explained in §3.3).Casper can currently do this for basic Java statements, conditionals,functions, user-defined types, and loops.

Basic Types.

Casper supports all basic Java arithmetic, logical,and bit-wise operators. It can also process reads and writes intoprimitive arrays and common Java Collection interfaces, such as java.util.{List, Set,Map} . Casper can be extended to supportother data structures, such as

Stack or Queue . User-defined Types.

Casper traverses the program AST to finddeclarations of all types that were used in the code fragment beingtranslated. It then dynamically translates and adds these types tothe IR as structs , as shown in Appendix B.

Loops.

Casper computes VCs for different types of loops ( for , while , do ), including those with loop-carried dependencies [1],after applying classical transformations [1] to convert loops intothe while(true){...} format. Methods.

Casper handles methods by inlining their bodies. Poly-morphic methods can be supported similarly by inlining differentversions with conditionals that check the type of the host object atruntime. Recursive methods and methods with side-effects are notcurrently supported because they are unlikely to gain any speedupby being translated to MapReduce.

External Library Methods.

Casper supports common library meth-ods from standard Java libraries (e.g., java.lang.Math methods)by modeling their semantics explicitly using the IR. Users can simi-larly provide models for other methods that Casper currently doesnot support. Casper traverses the input AST to identify code fragments that areamenable for translation by searching for loops that iterate one ormore data structures (e.g., a list or an array). We target loops sincethey are most likely to benefit from translation to MapReduce. Wehave kept our loop selection criteria lenient to avoid false negatives.

Once an identified code fragment is translated, Casper replacesthe original code fragment with the translated MapReduce code.It also generates “glue” code to merge the generated code into therest of the program. This includes creating a

SparkContext (or an

ExecutionEnvironment for Flink), converting data into

RDD s (orFlink’s

DataSet s), broadcasting required variables, etc. Since someAPI calls (such as Spark’s reduceByKey ) are not defined for non-commutative associative transformer functions, Casper uses theseAPI calls only if the generated code is indeed commutative andassociative (otherwise, Casper uses safe, albiet less efficient, trans-formations, such as groupByKey ). Finally, Casper also generatescode for sampling input data and dynamic switching, as discussedin §5.2. Appendix C presents a subset of code-generation rules forthe Spark API. We provide examples of library function and type models in Appendix B.

Suite

Phoenix 7 / 11 14.8x 32xAriths 11 / 11 12.6x 18.1xStats 18 / 19 18.2x 28.9xBig λ Table 1: Number of code fragments translated by Casperand their mean and max speedups compared to sequentialimplementations.

In this section, we present a comprehensive evaluation of Casper ona number of dimensions, including its ability to: (1) handle diverseand realistic workloads, (2) find efficient translations, (3) compileefficiently, and (4) extend to support other IRs and cost-models inthe future. All experiments were conducted on an AWS cluster of10 m3.2xlarge instances (1 master node, 9 core nodes), where eachnode contains an Intel Xeon 2.5 GHz processor with 8 vCPUs, 30 GBof memory, and 160 GB of SSD storage. We used the latest versionsof all frameworks available on AWS: Spark 2.3.0, Hadoop 2.8.3, andFlink 1.4.0. The data files for all experiments were stored on HDFS.

We first assess Casper’s ability to handle a variety of data-processingapplications. Specifically, we determine whether: (1) Casper cangenerate verification conditions for a syntactically diverse set ofprograms, (2) our IR can express summaries for a broad range ofdata-processing workloads, and (3) Casper’s ability to find suchsummaries. To this end, we used Casper to optimize a set of 55diverse benchmarks from real-world applications that contained atotal of 101 translatable code fragments.

Basic Applications.

For benchmarking, we assembled a set ofsmall applications from prior work and online repositories. Theseapplications, summarized below, contain a diverse set of code pat-terns commonly found in data-processing workloads (e.g., aggrega-tions, selections, grouping, etc), as follows: • Big λ [44] consists of several data analysis tasks such as senti-ment analysis , database operations (e.g., selection and projection),and Wikipedia log processing . Since Big λ generates code from input-output examples rather than from an actual implementation, werecruited computer science graduate students in our department toimplement a representative subset of the benchmarks from theirtextual descriptions. This resulted in 211 lines of code across 7 files. • Stats is a set of benchmarks Casper automatically extractedfrom an online repository for the statistical analysis of data [32].Examples include

Covariance , Standard Error and

Hadamard Product .The repository contains 1162 lines of code across 12 Java files,mostly consisting of vector and matrix operations. • Ariths is a set of simple mathematical functions and aggrega-tions collected from prior work [14, 19, 23, 40]. Examples include

Min , Max , Delta , and

Conditional Sum . The suite contains 245 linesof code than span 11 files.Across the 3 suites, Casper identified 38 code fragments, ofwhich 35 were successfully translated. One code-fragment thatasper failed to translate used a variable-sized kernel to convolvea matrix; two others required broadcasting data values to manyreducers during the map stage, but such mappers are currentlyinexpressible in our IR due to the absence of loops.

Traditional Data-Processing Benchmarks.

Next, we used Casperto translate a set of well-known, data-processing benchmarks thatresemble real-world workloads: • We manually implemented Q1, Q6, Q15 and Q17 from the

TPC-H benchmark using sequential Java and used Casper to translatethe Java implementations to MapReduce. The selected queries covermany SQL features, such as aggregations, joins and nested queries. • Phoenix [39] is a collection of standard MapReduce problems—such as

3D Histogram , Linear Regression , KMeans , etc.—used inprior work [38]. Since the original sequential implementationswere written in C, we used the sequential Java translations of thebenchmarks from prior work in our experiments. The suite consistsof 440 lines of code across 7 files. • Iterative represents two popular iterative algorithms that wemanually implemented into sequential versions:

PageRank and

Lo-gistic Regression Based Classification .Casper successfully translated all 4 TPC-H queries and bothiterative algorithms. It successfully translated 7 of 11 from thePhoenix suite. Three of the 4 failures were due to the IR’s lack ofsupport for loops inside transformer functions. One benchmarkfailed to synthesize within 90 minutes, causing Casper to time out.

Real-World Applications.

Fiji [24] is a popular distribution of theImageJ [27] library for scientific image analysis. We ran Casperon the source code of four Fiji packages (aka plugins).

NL Means is a plugin for denoising images via the non-local-means algo-rithm [13] with optimizations [20].

Red To Magenta transformsimages by changing red pixels to magenta.

Temporal Median is aprobabilistic filter for extracting foreground objects from a sequenceof images.

Trails averages pixel intensities over a time window in animage sequence. These packages, authored by different developers,contain 1411 lines of code that span 5 files. Of the 35 candidatecode fragments identified across all 4 packages, Casper success-fully optimized 23. Three of the failures were caused by the use ofunsupported types or methods from the ImageJ library since wedid not model them using the IR, and the search timed out for theremaining 9.Table 1 summarizes the results of our feasibility analysis. Of the101 individual code fragments identified by the compiler acrossall benchmarks, Casper translated 82. We manually inspected allcode files to ensure that Casper’s code fragment identifier missedno translatable code fragments. Overall, the benchmarks form asyntactically diverse set of applications. Because MOLD is not publicly available, we obtained the gen-erated code from the MOLD authors for the benchmarks used inits evaluation [38]. Of the 7 Phoenix benchmarks, MOLD couldnot translate 2 (

PCA and

KMeans ). Another 2 (

Histogram and

Ma-trix Multiplication ) generated semantically correct translations thatworked well for multi-core execution but failed to execute on thecluster because they ran out of memory. For the remaining 3 bench-marks (

Word Count , String Match and

Linear Regression ), MOLD We summarize the syntactic features of the code fragments in Appendix E.1. S p ee dup MOLD (Spark) Manual (Spark) Casper (Spark)Casper (Flink) Casper (Hadoop) (a) Casper achieves speedup competitive with manual translations R un t i m e ( s ) CasperSparkSQL (b) TPC-H benchmarks R un t i m e ( s ) CasperSparkTut (c) Iterative algorithms

Figure 7: A runtime comparison of Casper-generated imple-mentations against reference implementations. generated working implementations. In contrast, Casper translated4 of the 7 Phoenix benchmarks. For

PCA and

KMeans , Casper trans-lated and successfully executed a subset of all the loops found, whiletranslation failed for the other loops and the

Matrix Multiplicationbenchmark for reasons explained above.

Casper helps an application leverage the optimization and paral-lelization provided by MapReduce implementations by translatingtheir code. Therefore, in this section, we examine the quality of thetranslations Casper produced by comparing their performance tothat of reference distributed implementations.We used Casper to translate summaries for these benchmarksto three popular implementations of the MapReduce programmingmodel: Hadoop, Spark, and Flink. The translated Spark implemen-tations, along with their original sequential implementations, wereexecuted on three synthetic datasets of sizes 25GB, 50GB, and 75GB.Overall, the Spark implementations Casper generated are 15 . × faster on average than their sequential counterparts, with a maximprovement of up to 48 . × . Table 1 shows the mean and maxspeedup observed for each benchmark suite using Spark on a 75GBdataset. We also executed the Hadoop and Flink implementationsgenerated by Casper for a subset of 10 benchmarks, some of whichare shown in Figure 7(a). The average speedups observed (overthe 10 benchmarks) by these implementations are 6.4 × and 10.8 × ,respectively. These results show that Casper can effectively im-prove the performance of applications by an order of magnitudeby retargeting critical code fragments for execution on MapReduceframeworks.igure 7(a) plots the speedup achieved by the MOLD-generatedimplementations for String Match , Word Count , and

Linear Regres-sion . The Spark translations MOLD generated for these benchmarksperformed 12.3 × faster on average than the sequential versions.The solutions generated by Casper for String Match and

LinearRegression were faster than those generated by MOLD by 1.44 × and 2.34 × , respectively. For String Match , Casper found an efficientencoding to reduce the amount of data emitted in the map stage(see §7.4), whereas MOLD emitted a key-value pair for every wordin the dataset. Furthermore, MOLD used separate MapReduce oper-ations to compute the result for each keyword; Casper computedthe result for all keywords in the same set of operations. For

LinearRegression , MOLD discovered the same overall algorithm as Casperexcept its implementation zipped the input RDD with its index asa pre-processing step, almost doubling the size of input data andhence the amount of time spent in data transfers.For the

Ariths , Stats , Big λ , and Fiji benchmarks, we recruitedSpark developers through UpWork.com to manually rewrite thebenchmarks since reference distributed implementations were notavailable. Figure 7(a) compares the performance of (a subset of)Casper-generated implementations to handwritten benchmark im-plementations over the 75GB dataset. Results show that the Casper-generated implementations perform competitively, even with thosemanually written by developers. In fact, of the 42 hand-translatedbenchmark implementations, 24 used the same high-level algo-rithm as the one generated by Casper, and most of the remainingones differ by using framework-specific methods instead of an ex-plicit map/reduce (e.g., using Spark’s built-in filter, sum, and countmethods). However, these variations did not cause a noticeable per-formance difference. One interesting case was the 3D Histogrambenchmark, where the developer exploited knowledge about thedata to improve runtime performance. Specifically, the developerrecognized that since RGB values always range between 0-255, thehistogram data structure would never exceed 768 values. Therefore,the developer used Spark’s more efficient aддreдate operator toimplement the solution. Casper, not knowing that pixel RGB valuesare bounded, assumed that the number of keys could grow to bearbitrarily large and that using the aggregate operator could causeout-of-memory errors, hence it generated a single stage map andreduce instead.For

PageRank and

Logistical Regression , we compared Casperagainst the implementations found in the Spark Tutorials [46] (seeFigure 7(c)). The reference PageRank implementation was 1.3 × faster than the one Casper generated on a dataset of about 2.25 bil-lion graph edges and running 10 iterations. This is because Caspercurrently does not generate any cache() statements, nor does itco-partition data. Deciding when to cache can lead to further per-formance gains. Prior work [12] suggested heuristics for insertingsuch statements into Spark algorithms that could be integratedinto Casper’s code generator to improve performance for itera-tive workloads. For Logistical Regression , we found no noticeabledifference in performance.For TPC-H queries, we compared the performance of Sparkcode generated by Casper against SparkSQL’s implementation.Figure 7(b) plots the results of this experiment. For Q1, Q6 and Appendix E.2 describes the hiring criteria.

Source MeanTime (s) MeanLOC Mean

Phoenix 944 13.8 (13.1) 2.3 (2.1) 0.35Ariths 223 9.4 (7.6) 1.6 (1.2) 4Stats 351 7.6 (5.8) 1.8 (1.8) 0.6Big λ

112 13.6 (10) 1.8 (2.0) 0.4Fiji 1294 7.2 (7.4) 1.4 (1.6) 0.1TPC-H 476 5.9 (n/a) 7.25 (n/a) 0Iterative 788 3.3 (3.7) 4.5 (3.5) 2

Table 2: Summary of Casper’s compilation performance.Values for the reference implementations are shown inparentheses.

Q15, Casper implementations executed 2 × , 1.8 × and 2.8 × faster,respectively, than SparkSQL on a scale factor of 100. For Q1 andQ6, we attribute this to the extra data shuffling performed by theSparkSQL query plan. In Q15, SparkSQL’s query plan scanned the lineitem relation twice, whereas Casper’s implementation did soonly once, resulting in worse runtime performance. For Q17, Spark-SQL executed 1.7 × faster because it performed better scheduling ofthe query operators than the Casper-generated implementation. Insum, results show that the Casper-generated implementations theTPC-H benchmarks have comparable performance to those imple-mented directly using the MapReduce frameworks. Yet, developersneed not learn different MapReduce APIs by using Casper. We next evaluate Casper’s compilation performance. We discussthe time taken by Casper to compile the benchmarks, the effective-ness of Casper’s two-phase verification strategy, the quality of thegenerated code, and incremental grammar generation.

On average, Casper took 11.4 minutes tocompile a single code fragment. However, the median compile timefor a single benchmark was only 2.1 minutes: for some benchmarks,the synthesizer discovered a low-cost solution during the first fewgrammar classes, letting Casper terminate search early. Table 2shows the mean compilation time for a single benchmark by suite.

In our experiments, the candi-date summary generator produced at least one incorrect solutionfor 13 out of the 101 successfully translated code-fragments. Thesynthesizer proposed a total of 76 incorrect summaries across allbenchmarks. Table 2 lists the average number of times the theo-rem prover rejected a solution for each benchmark suite. As anexample, the

Delta benchmark computes the difference betweenthe largest and smallest values in the dataset. It incurred 7 roundsof interaction with the theorem prover before the candidate gen-erator found a correct solution due to errors from bounded modelchecking (discussed in §4.1).

In addition to measuring theruntime performance of Casper-generated implementations, wemanually inspected the code generated by Casper and comparedit to the reference implementations for two code quality metrics:lines of code (LOC) and the number of MapReduce operations used.Table 2 shows the results of our analysis. Implementations gener-ated by Casper were comparable and did not use more MapReduceoperations or LOC than were necessary to implement a given task. enchmark With Incr.Grammar Without Incr.Grammar

WordCount 2 827StringMatch 24 416Linear Regression 1 943D Histogram 5 118YelpKids 1 286Wikipedia PageCount 1 568Covariance 5 11Hadamard Product 1 484Database Select 1 397Anscombe Transform 2 78

Table 3: With incremental grammar generation, Casper pro-duces far less redundant summaries.

Note that the LOC pertain to individual code fragments, not entirebenchmarks.

We also measured theeffectiveness of incremental grammar generation in optimizingsearch. To measure its impact on compilation time, we used Casperto translate benchmarks without incremental grammar generationand compared the results. The synthesizer was allowed to run for90 minutes, after which it was manually killed. The results of thisexperiment are summarized in Table 3. Exhaustively searching theentire search space produced hundreds of more expensive solutions.The cost of searching, verifying, and sorting all these superfluoussolutions dramatically increased overall synthesis time. In fact,Casper timed out for every benchmark in that set (which representsa slowdown by at least one order of magnitude).

The final set of experiments evaluated the runtime monitor moduleand whether the dynamic cost model could select the correct imple-mentations. As explained in §5.2, the performance of some solutionsdepends on the distribution of the input data. Therefore, we usedCasper to generate different implementations for the StringMatchbenchmark (Figure 8(a)). Figure 8(d) shows three (out of 400+) cor-rect candidate solutions, with their respective costs based on theformula described in §5.1 and the following values for data-typesizes: 40 bytes for String, 10 bytes for Boolean and 28 bytes for atuple of Boolean Objects. Solution (a) can be disqualified at compiletime because it will have a higher cost than solution (b) for allpossible data distributions. However, the cost of solutions (b) and(c) cannot be statically compared due to the unknowns p and p (the respective probabilities that the conditionals will evaluate totrue and a key-value pair will be emitted). The values of p and p depend on the input data, i.e., how often the keywords appear inthe text, and thus can be determined only dynamically at run-time.Casper handles this by generating a runtime monitor in theoutput code. The monitor samples the input data (first 5000 values)in each execution to estimate values for unknown variables in thecost formulas. The estimated values are then plugged back intothe original cost functions (Eqn 2 and 3), and the solution with thelowest cost is then executed.We executed solutions (b) and (c) on three 75GB datasets with dif-ferent amounts of skew: one with no matching words (i.e., (c) emitsnothing), one with 50% matching words (i.e., (c) emits a key-value pair for half of the words in the dataset), and one with 95% matchingwords (i.e., (c) emits a key-value pair for 95% of the words in thedataset). Figure 8(c) shows the dynamically computed final cost ofsolution (c) using p and p estimates calculated using sampling.Figure 8(b) shows the actual performance of the two solutions. Fordatasets with very high skew, it is beneficial to use solution (b) dueto the smaller size of its key-value pair emit. Otherwise, solution (c)performs better. Casper, with the help of the dynamic input fromthe runtime monitor, makes this inference and selects the correctsolution for all three datasets.Dynamic cost estimation is particularly impactful in workloadswith multiple join operations. The size of each relation participat-ing in the join in addition to the selectivity of the join predicatedictate the most cost-efficient join ordering. To demonstrate this,we translated a simple query based on the TPC-H schema that im-plements a 3-way join between the part , supplier , and partsupplier relations. Query parameters are the name of the supplier and thecustomer_id, and outputs are the customer’s name, email address,and the sum of discount savings across all sales between the twoparties. We executed this query over two parameter configurations:one where the cardinality of join(sales, supplier) was much greaterthan join(sales, customer) and one where it was much smaller. Oncompilation, Casper generated two semantically equivalent imple-mentations for the query with different join orderings; which oneto use depends on the cardinality of the input data. Upon execu-tion, the Casper runtime estimated the cost of each join orderingand executed the faster solution for both configurations, showingthe effectiveness of our dynamic tuning approach. We discuss theaccuracy of the cost-functions we used in Appendix E.3 The translation techniques Casper uses are not coupled to ourIR or the target frameworks used. To demonstrate Casper’s ex-tensibility, we implemented the Fold-IR in prior work [22] in oursystem. Adding the fold construct to our IR required just 5 linesof code. An additional 43 lines of code were required to implementcompilation of the fold operator to Dafny for verification of synthe-sized summaries. Since operations such as min , max , set.insert and list.append were already available in our IR, hence no extrawork was needed. We did not implement any incremental grammarexploration for Fold-IR and used a constant bound to restrict themaximum size of summary expressions. With this minimal amountof work, we synthesized summaries expressed in Fold-IR for allbenchmarks in the Ariths set. We believe it should be easy to extendCasper’s code generator to output the same code as in the originalwork.We also explored using WeldIR [35] to express summaries. Al-though WeldIR is an excellent abstraction for data-processing work-loads, we believe it is not suited for synthesis because it is toolow-level. However, since both our IR and Fold-IR are conceptuallysubsets of WeldIR, summaries expressed using them can be trans-lated to Weld through simple rewrite rules. To demonstrate, wesuccessfully translated the summary for TPC-H Q6 expressed inour IR to Weld and used the Weld compiler to produce vectorized,multi-threaded code. key1_found = false key2_found = false for word in text: if word == key1: key1_found = true; if word == key2: key2_found = true; (a) Sequential code for StringMatch R un t i m e ( s ) Match Probability (p₁+p₂)Solution (b) Solution (c) (b) Performance of solutions overdatasets with different levels of skew Dataset Cost ofSoln (c) OptimalSolution

0% match 0 (c)50% match 75 N (c)95% match 142.5 N (b) (c) Dynamic selection of optimal algorithm Solution Static Cost a output = reduceByKey ( map ( text , λ m ) , λ r ) λ m : ( word ) → {( key , word = key ) , ( key , word = key )} λ r : ( v , v ) → v ∨ v λ m : 2 ∗ ( + ) ∗ Nλ r : 2 ∗ ∗ ∗ N Total : N b output = reduce ( map ( text , λ m ) , λ r ) λ m : ( word ) → {( word = key , word = key )} λ r : ( t , t ) → ( t [ ] ∨ t [ ] , t [ ] ∨ t [ ]) λ m : 1 ∗ ∗ Nλ r : 2 ∗ ∗ N Total : N c output = reduceByKey ( map ( text , λ m ) , λ r ) λ m : ( word ) → { if ( word = key ) : ( key , true ) , if ( word = key ) : ( key , true )} λ r : ( v , v ) → v ∨ v λ m : ( p + p ) ∗ ∗ Nλ r : ( p + p ) ∗ ∗ ∗ N Total : ( p + p ) (d) Candidate solutions and their statically computed costs Figure 8: StringMatch benchmark: Casper dynamically selects the optimal implementation for execution at runtime.

Implementations of MapReduce.

MapReduce [21] is a popular pro-gramming model that has been implemented by various systems [6–8]. These systems provide their own high-level DSLs that developersmust use to express their computation. In contrast, Casper workswith native Java programs and infers rewrites automatically.

Source-to-Source Compilers.

Many efforts translate programs fromlow-level languages into high-level DSLs. MOLD [38], a source-to-source compiler, relies on syntax-directed rules to convert na-tive Java programs to Apache Spark. Unlike MOLD, Casper trans-lates based on program semantics and eliminates the need forrewrite rules, which are difficult to devise and brittle to code pat-tern changes. Many source-to-source compilers have been builtsimilarly for other domains [34]. Unlike prior approaches in auto-matic parallelization [3, 10], Casper targets data parallel processingframeworks and translates only code fragments that are expressiblein the IR for program summaries.

Synthesizing Efficient Implementations.

Prior work used synthe-sis to generate efficient implementations and optimize programs.[44] synthesizes MapReduce solutions from user-provided inputand output examples. QBS [15–17] and STNG [28] both use synthe-sis to convert low-level languages to specialized high-level DSLsfor database applications and stencil computations, respectively.Casper takes inspiration from prior approaches by applying ver-ified lifting to construct compilers. Unlike prior work, however,Casper: (1) addresses the problem of verifier failures and designsa grammar hierarchy to prune away non-performant summaries,(2) has a dynamic cost model and runtime monitoring module foradaptively choosing from different implementations at runtime.

Query Optimizers and IRs.

Modern frameworks usually ship withsophisticated query optimizers [2, 9, 18, 29, 30] for generating ef-ficient execution plans. However, these tools make users expresstheir queries in the provided APIs. Our objective is orthogonal,i.e., to find the best way to express program semantics using theAPIs provided by these tools. We essentially enable these tools tooptimize code not written in their API. Furthermore, unlike our IR,most IRs meant to capture data-processing workloads [22, 35] arenot designed with synthesis in mind. This makes it difficult both tofind and verify programs expressed in them.

We presented Casper, a new compiler that identifies and convertssequential Java code fragments into MapReduce frameworks. Ratherthan defining pattern-matching rules to search for convertible codefragments, Casper instead automatically discovers high-level sum-maries of each input code fragment using program synthesis andretargets the found summary to the framework’s API. Our experi-ments show that Casper can convert a wide variety of benchmarksfrom both prior work and real-world applications and can generatecode for three different MapReduce frameworks. The generatedcode performs up to 48.2 × faster compared to the original imple-mentation, and is competitive with translations done manually bydevelopers.

10 ACKNOWLEDGEMENTS

This work is supported by the National Science Foundation throughgrants IIS-1546083, IIS-1651489, OAC-1739419, and CNS-1563788;DARPA award FA8750-16-2-0032; DOE award DE-SC0016260; theIntel-NSF CAPA center, and gifts from Adobe, Amazon, and Google.

EFERENCES [1] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006.

Compil-ers: Principles, Techniques, and Tools (2Nd Edition) . Addison-Wesley LongmanPublishing Co., Inc., Boston, MA, USA.[2] Alexander Alexandrov, Asterios Katsifodimos, Georgi Krastev, and Volker Markl.2016. Implicit Parallelism Through Deep Language Embedding.

SIGMOD Rec.

PPSC . 662–667.[4] Apache Flink 2018. https://flink.apache.org/. (2018). Accessed on: 2018-04-09.[5] Apache Hadoop 2018. http://hadoop.apache.org. (2018). Accessed on: 2018-04-09.[6] Apache Hive 2018. http://hive.apache.org. (2018). Accessed on: 2018-04-09.[7] Apache Pig 2018. https://pig.apache.org/. (2018). Accessed on: 2018-04-09.[8] Apache Spark 2018. https://spark.apache.org. (2018). Accessed on: 2018-04-09.[9] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K.Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and MateiZaharia. 2015. Spark SQL: Relational Data Processing in Spark. In

Proceedings ofthe 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15) . ACM, New York, NY, USA, 1383–1394.[10] William Blume, Rudolf Eigenmann, Jay Hoeflinger, David A. Padua, Paul Petersen,Lawrence Rauchwerger, and Peng Tu. 1994. Automatic Detection of Parallelism:A grand challenge for high performance computing.

IEEE P&DT

2, 3 (1994),37–47.[11] Rastislav Bodík and Barbara Jobstmann. 2013. Algorithmic program synthesis:introduction.

International Journal on Software Tools for Technology Transfer

Proc. VLDB Endow.

9, 13(Sept. 2016), 1425–1436.[13] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. 2005. A Non-LocalAlgorithm for Image Denoising. In

Proceedings of the 2005 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR ’05) . IEEE ComputerSociety, Washington, DC, USA, 60–65.[14] Yu-Fang Chen, Lei Song, and Zhilin Wu. 2016. The Commutativity Problem of theMapReduce Framework: A Transducer-based Approach.

CoRR abs/1605.01497(2016).[15] Alvin Cheung, Samuel Madden, Armando Solar-Lezama, Owen Arden, and An-drew C. Myers. 2014. Using Program Analysis to Improve Database Applications.

IEEE Data Eng. Bull.

37, 1 (2014), 48–59.[16] Alvin Cheung and Armando Solar-Lezama. 2016. Computer-Assisted QueryFormulation.

Foundations and Trends in Programming Languages

3, 1 (2016),1–94.[17] Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. OptimizingDatabase-backed Applications with Query Synthesis. In

Proceedings of the 34thACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI ’13) . ACM, New York, NY, USA, 3–14.[18] Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Çetintemel,and Stanley B. Zdonik. 2014. Tupleware: Redefining Modern Analytics.

CoRR abs/1406.6667 (2014).[19] Przemyslaw Daca, Thomas A. Henzinger, and Andrey Kupriyanov. 2016. ArrayFolds Logic.

CoRR abs/1603.06850 (2016).[20] Jerome Darbon, Alexandre Cunha, Tony F. Chan, Stanley Osher, and Grant J.Jensen. 2008. Fast nonlocal filtering applied to electron cryomicroscopy.. In

ISBI .IEEE, 1331–1334.[21] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processingon Large Clusters.

Commun. ACM

51, 1 (Jan. 2008), 107–113.[22] K. Venkatesh Emani, Karthik Ramachandra, Subhro Bhattacharya, and S. Su-darshan. 2016. Extracting Equivalent SQL from Imperative Code in DatabaseApplications. In

Proceedings of the 2016 International Conference on Managementof Data (SIGMOD ’16) . ACM, New York, NY, USA, 1781–1796.[23] Grigory Fedyukovich, Maaz Bin Safeer Ahmad, and Rastislav Bodik. 2017. GradualSynthesis for Static Parallelization of Single-pass Array-processing Programs.In

Proceedings of the 38th ACM SIGPLAN Conference on Programming LanguageDesign and Implementation (PLDI 2017) . ACM, New York, NY, USA, 572–585.[24] Fiji: ImageJ 2018. https://github.com/fiji. (2018). Accessed on: 2018-04-09.[25] Sumit Gulwani. 2010. Dimensions in Program Synthesis. In

Proceedings of the 12thInternational ACM SIGPLAN Symposium on Principles and Practice of DeclarativeProgramming (PPDP ’10) . ACM, New York, NY, USA, 13–24.[26] C. A. R. Hoare. 1969. An Axiomatic Basis for Computer Programming.

Commun.ACM

12, 10 (Oct. 1969), 576–580.[27] ImageJ 2018. https://imagej.net/Welcome. (2018). Accessed on: 2018-04-09.[28] Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando Solar-Lezama. 2016.Verified Lifting of Stencil Computations.

SIGPLAN Not.

51, 6 (June 2016), 711–726. [29] Alfons Kemper, Thomas Neumann, Florian Funke, Viktor Leis, and Henrik Mühe.2012. HyPer: Adapting Columnar Main-Memory Data Management for Transac-tional AND Query Processing.

IEEE Data Eng. Bull.

35, 1 (2012), 46–51.[30] Yannis Klonatos, Christoph Koch, Tiark Rompf, and Hassan Chafi. 2014. BuildingEfficient Query Engines in a High-level Language.

Proc. VLDB Endow.

7, 10 (June2014), 853–864.[31] K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for FunctionalCorrectness. In

Proceedings of the 16th International Conference on Logic forProgramming, Artificial Intelligence, and Reasoning (LPAR’10) . Springer-Verlag,Berlin, Heidelberg, 348–370.[32] MagPie Analysis Repository 2018. https://github.com/thisMagpie/Analysis.(2018). Accessed on: 2018-04-09.[33] John Matthews, J. Strother Moore, Sandip Ray, and Daron Vroon. 2006. Verifi-cation Condition Generation via Theorem Proving. In

Proceedings of the 13thInternational Conference on Logic for Programming, Artificial Intelligence, andReasoning (LPAR’06) . Springer-Verlag, Berlin, Heidelberg, 362–376.[34] Cedric Nugteren and Henk Corporaal. 2012. Introducing ’Bones’: A ParallelizingSource-to-source Compiler Based on Algorithmic Skeletons. In

Proceedings ofthe 5th Annual Workshop on General Purpose Processing with Graphics ProcessingUnits (GPGPU-5) . ACM, New York, NY, USA, 1–10.[35] Shoumik Palkar, James J. Thomas, Anil Shanbhag, Deepak Narayanan, HolgerPirk, Malte Schwarzkopf, Saman Amarasinghe, and Matei Zaharia. 2017. Weld:A Common Runtime for High Performance Data Analytics. (January 2017).[36] Spiros Papadimitriou and Jimeng Sun. 2008. DisCo: Distributed Co-clusteringwith Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining.In

Proceedings of the 2008 Eighth IEEE International Conference on Data Mining(ICDM ’08)

Proceedings of the 2014 ACM Interna-tional Conference on Object Oriented Programming Systems Languages & Applica-tions (OOPSLA ’14) . ACM, New York, NY, USA, 909–927.[39] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Chris-tos Kozyrakis. 2007. Evaluating MapReduce for Multi-core and MultiprocessorSystems. In

Proceedings of the 2007 IEEE 13th International Symposium on High Per-formance Computer Architecture (HPCA ’07) . IEEE Computer Society, Washington,DC, USA, 13–24.[40] Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. ParallelizingUser-defined Aggregations Using Symbolic Execution. In

Proceedings of the 25thSymposium on Operating Systems Principles (SOSP ’15) . ACM, New York, NY, USA,153–167.[41] P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G.Price. 1979. Access Path Selection in a Relational Database Management System.In

Proceedings of the 1979 ACM SIGMOD International Conference on Managementof Data (SIGMOD ’79) . ACM, New York, NY, USA, 23–34.[42] Sketch 2018. https://people.csail.mit.edu/asolar/. (2018). Accessed on: 2018-04-09.[43] Yannis Smaragdakis and George Balatsouras. 2015. Pointer Analysis.

Found.Trends Program. Lang.

2, 1 (April 2015), 1–69.[44] Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis.

SIGPLAN Not.

51, 6 (June 2016), 326–340.[45] Armando Solar-Lezama. 2008.

Program Synthesis by Sketching . Ph.D. Dissertation.Berkeley, CA, USA. Advisor(s) Bodik, Rastislav.[46] Spark GitHub Repository 2018. https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples. (2018). Accessed on: 2018-01-20.[47] Glynn Winskel. 1993.

The Formal Semantics of Programming Languages: AnIntroduction . MIT Press, Cambridge, MA, USA.

PROOF SKETCH FOR SOUNDNESS ANDCOMPLETENESS

Here, we first formalize the definitions of soundness and complete-ness, and then we present a proof sketch to show that Casper’ssynthesis algorithm for program summaries has these properties.We use terms and acronyms defined in the paper without explainingthem again here.

Definition 1. (Soundness of Search)

An algorithm for generatingprogram summaries is sound if and only if, for all program sum-mary ps and loop invariants inv , . . . , inv n generated by the algo-rithm, the verification conditions hold over all possible programstates after we execute the input code fragment P . In other words, ∀ σ . VC ( P , ps , inv , . . . , inv n , σ ) . Definition 2. (Completeness of Search)

An algorithm for generatingprogram summaries is complete if and only if when there exists ps , inv , . . . , inv n ∈ G , then ∀ σ . VC ( P , ps , inv , . . . , inv n , σ )) →( ∆ (cid:44) ∅) . Here, G is the search space traversed, P is the input codefragment, VC is the set of verification conditions, and ∆ is the setof sound summaries found by the algorithm. In other words, thealgorithm will never fail to find a correct program summary as longas one exists in the search space. Proof of Soundness.

The soundness guarantee for Casper’ssynthesis algorithm is derived from the soundness guaranteesoffered by Hoare-style verification conditions. The proof is con-structed using a loop-invariant , namely, a statement that is trueimmediately before and after each loop execution. Hoare logic dic-tates that in order to prove correctness of a given postcondition (i.e.,program summary) for a given loop, we must prove the followingholds over all possible program states:(1) The invariant is true before the loop.(2) Each iteration of the loop maintains the invariant.(3) Once the loop has terminated, the invariant implies the post-condition.This is essentially an inductive proof. The first two constraintsprevent Casper from finding a loop invariant strong enough to im-ply an incorrect program summary. Our correctness guarantee is, ofcourse, subject to the correct implementation of our VC generationmodule and of the theorem prover we use (Dafny). Establishing thatthe summary is a correct postcondition is sufficient to establish thatit is a correct translation. This is so because summaries in our IRmust describe the final value of all output variables (i.e., variablesthat were modified) as a function over the inputs (see Figure 3).

Proof of Completeness.

To understand that Casper’s algorithmis complete with respect to the search space, we first show thatthat the algorithm always terminates. Recall that we use recursivebounds to finitize the number of solutions expressible by our IR’sgrammar. As explained in §4.1, we prevent the same solution frombeing regenerated, thus ensuring forward progress in search. Thesetwo facts imply that our algorithm always terminates. There areonly two possible exit points for the while(true) loop in our algo-rithm: line 24 and line 21 of Figure 5. The first is only reached oncethe entire search space has been exhausted. The second implies thata solution is successfully returned as ∆ is not empty. It is important to note that our search algorithm is complete only for verifiablycorrect summaries. If a correct summary exists in the search spacebut cannot be proven correct using the available automated theo-rem prover, it will not be returned. Therefore, the completeness ofthe algorithm is modulo the completeness of the theorem prover. B INTERMEDIATE REPRESENTATIONSPECIFICATION

Here, we list the full set of types available in our IR and provideexamples to demonstrate how they may be used to express modelsfor library methods and types.

Primitive Data Types

Scalars bool , int , float , string , char , ...Structures class(id:Type, id2:Type2, ..) List list(Type)

Array array(dimensions, Type)

Functions name(arg1:Type1, ...) : Type -> Body

Conditionals if cond then e else e Synthesis Construct choose( e , e , ... , e n ) Built-in operations

Arithmetic + , − , ∗ , / , %, ...Bitwise << , >> , &, ...Relational < , > , ≤ , ≥ , ...Logical &&, ∥ , == , ! = List len, append, get, equals, concat, slice

Array select, store

To provide support for a datatype found in a Library, users mustdefine the type of the object using our IR and annotate it with thefully qualified name, as follows: @java.awt.Pointclass Point(x:int, y:int)

Similarly, users may also provide support for library methods,for instance the following defines a model for the absolute valuefunction: @java.lang.Math.absabs(val: int) : int ->if val < 0 then val * -1 else val

Using the core IR described above, we implemented in Casper the map , reduce and join primitives used to synthesize summaries. Wehave also implemented commonly used methods from Java standardlibraries such as java.util.Math,String,Date and other essen-tial data-types, along with methods that were needed to translatethe Fiji plugins.The choose operator in the IR is a special construct that enablesus to express a search space using the IR. The parameters to choose are one or more expressions of matching types. The synthesizer isthen free to select any expression from the list of choices in orderto satisfy the correctness specification. CODE GENERATION RULES

To generate target DSL code from the synthesized program sum-mary, we implemented in Casper a set of translation rules that mapthe operators in our IR to the concrete syntax of the target DSL.Here, we list a subset of such code-generation rules for the SparkRDD API.

T R ⟦ map ( l , λ m : T → list ( Pair )) ⟧ = l.flatMapToPair( ⟦ λ m ⟧ ); T R ⟦ map ( l , λ m : T → list ( U )) ⟧ = l.flatMap( ⟦ λ m ⟧ ); T R ⟦ map ( l , λ m : T → Pair ) ⟧ = l.mapToPair( ⟦ λ m ⟧ ); T R ⟦ map ( l , λ m : T → U ) ⟧ = l.map( ⟦ λ m ⟧ ); T R ⟦ reduce ( l : list ( Pair ) , λ r ) ⟧ = l.reduceByKey( ⟦ λ r ⟧ ); T R ⟦ reduce ( l : list ( U ) , λ r ) ⟧ = l.reduce( ⟦ λ r ⟧ ); T R ⟦ λ m ( e ) → e b ⟧ = (e -> ⟦ e b ⟧ ) T R ⟦ e + e ⟧ = ⟦ e ⟧ + ⟦ e ⟧ The translation function

T R takes as input an expression inour IR language and maps it to an equivalent expression in Spark.Since Spark provides multiple variations for the operators definedin our IR, such as map , we can select the appropriate variationby looking at the type information of the λ m function used by map . For example, if λ m returns a list of Pairs, we translate to JavaRDD.flatMapToPair . If it instead returns a list of a non-Pairtype, we use the more general rule that translates map to JavaRDD.flatMap . Translation for the other expressions proceedssimilarly.

D PROGRAM ANALYZER OUTPUTS

Here, we use TPC-H Query 6 to illustrate the outputs computed byCasper’s program analyzer. Since the queries are originally in SQL,we have manually translated them to Java as follows: double query6(List lineitem){ List lineitem = new ArrayList(); Date dt1 = Util.df.parse("1993-01-01"); Date dt2 = Util.df.parse("1994-01-01"); double revenue = 0; for (LineItem l : lineitem) { if ( l.l_shipdate.after(dt1) && l.l_shipdate.before(dt2) && l.l_discount >= 0.05 && l.l_discount <= 0.07 && l.l_quantity < 24 ) revenue += (l.l_extendedprice * l.l_discount); } return revenue; } First, Casper’s program analyzer normalizes the loop starting onLine 6 into an equivalent while(true){..} loop, and then tra-verses the loop to identify the set of input/output variables andoperators used:

Program Analaysis Results

Inputs Vars l: list(LineItem), dt1: Date, dt2: Date

Output Vars revenue: double

Constants [(24, int), (0.05, double), (0.07, double)]

Operators + , − , ∗ , ≥ , ≤ , < Methods

Date.before, Date.after

With this information, Casper generates verification conditionslike those shown in Figure 4(b) for the row-wise mean benchmark.Next, the program analyzer defines a search space within whichCasper searches for summaries and the needed loop-invariant.Since the full search space description is too large to show, we onlyshow a small snippet below: generator doubleExpr(val:LineItem, depth:int) : double ->if depth = 0 thenchoose(val.l_quantity,val.l_extendedprice,val.l_discount,0.05,0.07,24)elsechoose(doubleExpr(val, 0),doubleExpr(val, depth-1) + doubleExpr(val, depth-1),doubleExpr(val, depth-1) * doubleExpr(val, depth-1),doubleExpr(val, depth-1) / doubleExpr(val, depth-1))

The doubleExpr is the part of the grammar used to constructexpressions that evaluate to double . The generator keyword indi-cates that this is a special type of function, one that can select adifferent value from the choose operators on each invocation. Thedepth parameter controls how large the generated expression isallowed to grow. The choose construct is used to present a set ofpossible productions to the synthesizer. This grammar is tailoredspecifically to our implementation of TPC-H Query 6.

E SUPPLEMENTARY EXPERIMENTSE.1 Benchmark Details

The benchmarks Casper extracted form a diverse and challeng-ing problem set. As shown in the table below, they vary acrossprogramming style as well as the structure of their solutions.

Benchmark Properties

Conditionals 26 19User Defined Types 14 10Nested Loops 40 22Multiple Datasets 22 18Multidim. Dataset 38 23 .2 Developer Selection Criteria

To get reference Spark implementations for non-SQL benchmarks,we hired developers through the online freelancing platform Up-Work.com. While hiring, we ensured all candidates met the follow-ing basic criteria:(1) At least an undergraduate or equivalent degree in computerscience.(2) Minimum 500 hours of work logged at the platform.(3) Minimum 4 star rating for previous projects (scale of 5).(4) A portfolio of at least one or more successfully completedcontracts using Spark.Finally, applicants were required to answer three test questionsregarding Spark API internals to bid on our contract.

E.3 Evaluating Cost Model Heuristics

We present here some experiments that measure whether Casper’scost model model can effectively identify efficient solutions duringthe search process.

Program Emitted (MB) Shuffled (MB) Runtime (s)

WC 1 105k 30 254WC 2 105k 58k 2627SM 1 16 0.7 189SM 2 90k 0.7 362

Table 4: The correlation of data shuffle and execution. (WC= WordCount, SM = StringMatch).

As discussed in §5.1, Casper uses a data-centric cost model. Thecost model is based on the hypothesis that the amount of data gen-erated and shuffled during the execution of a MapReduce programdetermines how fast the program executes. For our first experiment,we measured the correlation between the amount of data shuffledand the runtime of a benchmark to check the validity of the hy-pothesis. To do so, we compared the performance of two differentSpark WordCount implementations: one that aggregates data lo-cally before shuffling (WC 1) using combiners [21], and one thatdoes not (WC 2). Although both implementations processed thesame amount of input data, the former implementation significantlyoutperformed the latter, as the latter incurred the expensive over-head of moving data across the network to the nodes responsible forprocessing it. Table 4 shows the amount of data shuffled along withthe corresponding runtimes for both implementations using the75GB dataset. As shown, the implementation that used combinersto reduce data shuffling was almost an order of magnitude faster.Next, we verified the second part of our hypotheses by mea-suring the correlation of the amount of data generated and theruntime of a benchmark. To do so, we compared two solutions forthe StringMatch benchmark (sequential code shown in Figure 8(a)).The benchmark determines whether certain keywords exist in alarge body of text. Both solutions use combiners to locally aggre-gate data before shuffling. However, one solution emits a key-valuepair only when a matching word is found (SM 1), whereas the otheralways emits either (key, true) or (key, false) (SM 2). Sincethe data is locally aggregated, each node in the cluster only gen-erates 2 records for shuffling (one for each keyword) regardless ofhow many records were emitted during the map phase. As shown Wikipedia PageCount

Casper Manual

Database Select

Casper Manual

3D Histogram

Casper Manual

20x 10 30 50 70 100

Fiji: Red To Magenta

Cora Manual

Figure 9: The top 2 benchmarks with the most performancealong with the bottom 2. The x-axis plots the size of inputdata, while the y-axis plots the runtime speedup over se-quential implementations. in Table 4, the implementation that minimized the amount of dataemitted in the map-phase executed almost twice as fast.In sum, the two experiments confirm that the heuristics usedin our cost model are accurate indicators of runtime performancefor MapReduce applications. We also demonstrated the need for adata-centric cost model; solutions that minimize data costs executesignificantly faster than those that do not.